Re: [PATCH v10 4/4] vpci/msix: Free MSIX resources when init_msix() fails
On 2025/8/5 16:10, Jan Beulich wrote: > On 05.08.2025 05:49, Jiqian Chen wrote: >> When MSI-X initialization fails vPCI will hide the capability, but >> remove of handlers and data won't be performed until the device is >> deassigned. Introduce a MSI-X cleanup hook that will be called when >> initialization fails to cleanup MSI-X related hooks and free it's >> associated data. >> >> As all supported capabilities have been switched to use the cleanup >> hooks call those from vpci_deassign_device() instead of open-code the >> capability specific cleanup in there. >> >> Signed-off-by: Jiqian Chen >> --- >> cc: "Roger Pau Monné" >> --- >> v9->v10 changes: >> * Call all cleanup hook in vpci_deassign_device() instead of cleanup_msix(). > > Isn't this rather an omission in an earlier change, and hence may want to > come separately and with a Fixes: tag? This is not really an omission, after all, all the cleanup hooks were implemented at the end of this series. And judging from the commit message(which was written by Roger in v8), Roger also agreed to add them in this patch. > >> --- a/xen/drivers/vpci/vpci.c >> +++ b/xen/drivers/vpci/vpci.c >> @@ -321,6 +321,27 @@ void vpci_deassign_device(struct pci_dev *pdev) >> &pdev->domain->vpci_dev_assigned_map); >> #endif >> >> +for ( i = 0; i < NUM_VPCI_INIT; i++ ) >> +{ >> +const vpci_capability_t *capability = &__start_vpci_array[i]; >> +const unsigned int cap = capability->id; >> +unsigned int pos = 0; >> + >> +if ( !capability->is_ext ) >> +pos = pci_find_cap_offset(pdev->sbdf, cap); >> +else if ( is_hardware_domain(pdev->domain) ) >> +pos = pci_find_ext_capability(pdev->sbdf, cap); > > What's the point of doing this when ... > >> +if ( pos && capability->cleanup ) > > ... ->cleanup is NULL? Don't you want to have > > if ( !capability->cleanup ) > continue; > > earlier on? Will change in next version. > > Jan -- Best regards, Jiqian Chen.
Re: [PATCH] x86/HVM: polish hvm_asid_init() a little
On 04.08.2025 17:41, Jan Beulich wrote: > While the logic there covers asymmetric cases, the resulting log > messages would likely raise more confusion than clarify anything. Split > the BSP action from the AP one, indicating the odd CPU in the AP log > message, thus avoiding the impression that global state would have > changed. > > While there also > - move g_disabled into .data.ro_after_init; only the BSP will ever write > to it, > - make the function's parameter unsigned; no negative values may be > passed in. Also reflect this in svm_asid_init(). > > Signed-off-by: Jan Beulich Sorry, I meant to Cc: you on the submission, as it was the reviewing of your copied code which made me spot the shortcomings (which, as indicated, I think it would be nice if you avoided from the very start). Jan > --- a/xen/arch/x86/hvm/asid.c > +++ b/xen/arch/x86/hvm/asid.c > @@ -48,20 +48,22 @@ struct hvm_asid_data { > > static DEFINE_PER_CPU(struct hvm_asid_data, hvm_asid_data); > > -void hvm_asid_init(int nasids) > +void hvm_asid_init(unsigned int nasids) > { > -static int8_t g_disabled = -1; > +static int8_t __ro_after_init g_disabled = -1; > struct hvm_asid_data *data = &this_cpu(hvm_asid_data); > > data->max_asid = nasids - 1; > data->disabled = !opt_asid_enabled || (nasids <= 1); > > -if ( g_disabled != data->disabled ) > +if ( g_disabled < 0 ) > { > -printk("HVM: ASIDs %sabled.\n", data->disabled ? "dis" : "en"); > -if ( g_disabled < 0 ) > -g_disabled = data->disabled; > +g_disabled = data->disabled; > +printk("HVM: ASIDs %sabled\n", data->disabled ? "dis" : "en"); > } > +else if ( g_disabled != data->disabled ) > +printk("HVM: CPU%u: ASIDs %sabled\n", smp_processor_id(), > + data->disabled ? "dis" : "en"); > > /* Zero indicates 'invalid generation', so we start the count at one. */ > data->core_asid_generation = 1; > --- a/xen/arch/x86/hvm/svm/asid.c > +++ b/xen/arch/x86/hvm/svm/asid.c > @@ -12,7 +12,7 @@ > > void svm_asid_init(const struct cpuinfo_x86 *c) > { > -int nasids = 0; > +unsigned int nasids = 0; > > /* Check for erratum #170, and leave ASIDs disabled if it's present. */ > if ( !cpu_has_amd_erratum(c, AMD_ERRATUM_170) ) > --- a/xen/arch/x86/include/asm/hvm/asid.h > +++ b/xen/arch/x86/include/asm/hvm/asid.h > @@ -13,7 +13,7 @@ struct vcpu; > struct hvm_vcpu_asid; > > /* Initialise ASID management for the current physical CPU. */ > -void hvm_asid_init(int nasids); > +void hvm_asid_init(unsigned int nasids); > > /* Invalidate a particular ASID allocation: forces re-allocation. */ > void hvm_asid_flush_vcpu_asid(struct hvm_vcpu_asid *asid);
Re: [PATCH v10 4/4] vpci/msix: Free MSIX resources when init_msix() fails
On 05.08.2025 10:27, Chen, Jiqian wrote: > On 2025/8/5 16:10, Jan Beulich wrote: >> On 05.08.2025 05:49, Jiqian Chen wrote: >>> When MSI-X initialization fails vPCI will hide the capability, but >>> remove of handlers and data won't be performed until the device is >>> deassigned. Introduce a MSI-X cleanup hook that will be called when >>> initialization fails to cleanup MSI-X related hooks and free it's >>> associated data. >>> >>> As all supported capabilities have been switched to use the cleanup >>> hooks call those from vpci_deassign_device() instead of open-code the >>> capability specific cleanup in there. >>> >>> Signed-off-by: Jiqian Chen >>> --- >>> cc: "Roger Pau Monné" >>> --- >>> v9->v10 changes: >>> * Call all cleanup hook in vpci_deassign_device() instead of cleanup_msix(). >> >> Isn't this rather an omission in an earlier change, and hence may want to >> come separately and with a Fixes: tag? > This is not really an omission, after all, all the cleanup hooks were > implemented at the end of this series. > And judging from the commit message(which was written by Roger in v8), Roger > also agreed to add them in this patch. I disagree. Of the two xfree()-s you remove here from vpci_deassign_device(), one should have been removed by patch 3 already. Which would require the part of the patch here to be put in place earlier on. Jan
Re: [PATCH v10 2/4] vpci/rebar: Free Rebar resources when init_rebar() fails
On 05.08.2025 05:49, Jiqian Chen wrote: > --- a/xen/drivers/vpci/rebar.c > +++ b/xen/drivers/vpci/rebar.c > @@ -49,6 +49,32 @@ static void cf_check rebar_ctrl_write(const struct pci_dev > *pdev, > bar->guest_addr = bar->addr; > } > > +static int cf_check cleanup_rebar(const struct pci_dev *pdev) > +{ > +int rc; > +uint32_t ctrl; > +unsigned int nbars; > +unsigned int rebar_offset = pci_find_ext_capability(pdev->sbdf, > + > PCI_EXT_CAP_ID_REBAR); > + > +if ( !rebar_offset || !is_hardware_domain(pdev->domain) ) > +{ > +ASSERT_UNREACHABLE(); > +return 0; > +} > + > +ctrl = pci_conf_read32(pdev->sbdf, rebar_offset + PCI_REBAR_CTRL(0)); > +nbars = MASK_EXTR(ctrl, PCI_REBAR_CTRL_NBAR_MASK); > + > +rc = vpci_remove_registers(pdev->vpci, rebar_offset + PCI_REBAR_CAP(0), > + PCI_REBAR_CTRL(nbars - 1)); > +if ( rc ) > +printk(XENLOG_ERR "%pd %pp: fail to remove Rebar handlers rc=%d\n", > + pdev->domain, &pdev->sbdf, rc); MSI and MSI-X (now) have ASSERT_UNREACHABLE() on their respective paths. Is there a reason this shouldn't be done here as well? MSI and MSI-X further have another add-register below here, to ensure the control register cannot be written. Again - is there a reason the same shouldn't be done here? (If so, I think this may want putting in a comment.) Jan
Re: [PATCH v10 4/4] vpci/msix: Free MSIX resources when init_msix() fails
On 05.08.2025 05:49, Jiqian Chen wrote: > --- a/xen/drivers/vpci/msix.c > +++ b/xen/drivers/vpci/msix.c > @@ -655,6 +655,48 @@ int vpci_make_msix_hole(const struct pci_dev *pdev) > return 0; > } > > +static int cf_check cleanup_msix(const struct pci_dev *pdev) > +{ > +int rc; > +struct vpci *vpci = pdev->vpci; > +const unsigned int msix_pos = pdev->msix_pos; > + > +if ( !msix_pos ) > +return 0; > + > +rc = vpci_remove_registers(vpci, msix_control_reg(msix_pos), 2); > +if ( rc ) > +{ > +printk(XENLOG_ERR "%pd %pp: fail to remove MSIX handlers rc=%d\n", > + pdev->domain, &pdev->sbdf, rc); > +ASSERT_UNREACHABLE(); > +return rc; > +} > + > +if ( vpci->msix ) > +{ > +list_del(&vpci->msix->next); > +for ( unsigned int i = 0; i < ARRAY_SIZE(vpci->msix->table); i++ ) > +if ( vpci->msix->table[i] ) > +iounmap(vpci->msix->table[i]); > + > +XFREE(vpci->msix); > +} > + > +/* > + * The driver may not traverse the capability list and think device > + * supports MSIX by default. So here let the control register of MSIX > + * be Read-Only is to ensure MSIX disabled. > + */ > +rc = vpci_add_register(vpci, vpci_hw_read16, NULL, > + msix_control_reg(msix_pos), 2, NULL); > +if ( rc ) > +printk(XENLOG_ERR "%pd %pp: fail to add MSIX ctrl handler rc=%d\n", > + pdev->domain, &pdev->sbdf, rc); Here as well as for MSI: Wouldn't this better be limited to the init-failure case? No point in adding a register hook (and possibly emitting a misleading log message) when we're tearing down anyway. IOW I think the ->cleanup() hook needs a boolean parameter, unless the distinction of the two cases can be (reliably) inferred from some other property. > --- a/xen/drivers/vpci/vpci.c > +++ b/xen/drivers/vpci/vpci.c > @@ -321,6 +321,27 @@ void vpci_deassign_device(struct pci_dev *pdev) > &pdev->domain->vpci_dev_assigned_map); > #endif > > +for ( i = 0; i < NUM_VPCI_INIT; i++ ) > +{ > +const vpci_capability_t *capability = &__start_vpci_array[i]; > +const unsigned int cap = capability->id; > +unsigned int pos = 0; > + > +if ( !capability->is_ext ) > +pos = pci_find_cap_offset(pdev->sbdf, cap); > +else if ( is_hardware_domain(pdev->domain) ) > +pos = pci_find_ext_capability(pdev->sbdf, cap); > + > +if ( pos && capability->cleanup ) > +{ > +int rc = capability->cleanup(pdev); > +if ( rc ) > +printk(XENLOG_ERR "%pd %pp: clean %s cap %u fail rc=%d\n", > + pdev->domain, &pdev->sbdf, > + capability->is_ext ? "extended" : "legacy", cap, rc); > +} > +} With this imo the patch subject is now wrong, too. Jan
Re: [PATCH] misra: remove default case in single-clause switch
On 04.08.2025 19:33, Dmytro Prokopchuk1 wrote: > MISRA Rule 16.4: Every switch statement shall have a default label. > The default clause must contain either a statement or a comment > prior to its terminating break statement. > > However, there is a documented deviation for this rule in Xen: > 'docs/misra/deviations.rst': > * - R16.4 > - A switch statement with a single clause and no default label > may replace an equivalent if statement to improve readability. > - Tagged as `deliberate` for ECLAIR. > > This change removes empty default cases in single-clause switches > to avoid violations of the rule where the `default` clause lacks > a suitable comment or statement. > > Signed-off-by: Dmytro Prokopchuk It's all CPU notifiers that you alter, and iirc the outcome of earlier discussion was that particularly for those we _want_ to add commentary, clarifying why only the given subset of notification need handling in the particular case. It may also well be that the (at least) one case of the possibly missing handling of some other notification still is unaddressed (and might hence be wrongly hidden by the adjustment done here, if it's in one of the function that are being altered). Jan
Re: [PATCH v3 05/20] xen/riscv: construct the P2M pages pool for guests
On 31.07.2025 17:58, Oleksii Kurochko wrote: > @@ -30,3 +34,18 @@ int p2m_init(struct domain *d) > > return 0; > } > + > +/* > + * Set the pool of pages to the required number of pages. > + * Returns 0 for success, non-zero for failure. > + * Call with d->arch.paging.lock held. > + */ > +int p2m_set_allocation(struct domain *d, unsigned long pages, bool > *preempted) Noticed only when looking at the subsequent patch: With this being ... > +{ > +int rc; > + > +if ( (rc = paging_freelist_init(d, pages, preempted)) ) ... a caller of this function, the "init" in the name feels wrong. Jan
Re: [PATCH v3 06/20] xen/riscv: add root page table allocation
On 31.07.2025 17:58, Oleksii Kurochko wrote: > Introduce support for allocating and initializing the root page table > required for RISC-V stage-2 address translation. > > To implement root page table allocation the following is introduced: > - p2m_get_clean_page() and p2m_alloc_root_table(), p2m_allocate_root() > helpers to allocate and zero a 16 KiB root page table, as mandated > by the RISC-V privileged specification for Sv32x4/Sv39x4/Sv48x4/Sv57x4 > modes. > - Update p2m_init() to inititialize p2m_root_order. > - Add maddr_to_page() and page_to_maddr() macros for easier address > manipulation. > - Introduce paging_ret_pages_to_domheap() to return some pages before > allocate 16 KiB pages for root page table. > - Allocate root p2m table after p2m pool is initialized. > - Add construct_hgatp() to construct the hgatp register value based on > p2m->root, p2m->hgatp_mode and VMID. Imo for this to be complete, freeing of the root table also wants taking care of. Much like imo p2m_init() would better immediately be accompanied by the respective teardown function. Once you start using them, you want to use them in pairs, after all. > --- a/xen/arch/riscv/include/asm/riscv_encoding.h > +++ b/xen/arch/riscv/include/asm/riscv_encoding.h > @@ -133,11 +133,13 @@ > #define HGATP_MODE_SV48X4_UL(9) > > #define HGATP32_MODE_SHIFT 31 > +#define HGATP32_MODE_MASK_UL(0x8000) > #define HGATP32_VMID_SHIFT 22 > #define HGATP32_VMID_MASK_UL(0x1FC0) > #define HGATP32_PPN _UL(0x003F) > > #define HGATP64_MODE_SHIFT 60 > +#define HGATP64_MODE_MASK_ULL(0xF000) > #define HGATP64_VMID_SHIFT 44 > #define HGATP64_VMID_MASK_ULL(0x03FFF000) > #define HGATP64_PPN _ULL(0x0FFF) > @@ -170,6 +172,7 @@ > #define HGATP_VMID_SHIFT HGATP64_VMID_SHIFT > #define HGATP_VMID_MASK HGATP64_VMID_MASK > #define HGATP_MODE_SHIFT HGATP64_MODE_SHIFT > +#define HGATP_MODE_MASK HGATP64_MODE_MASK > #else > #define MSTATUS_SD MSTATUS32_SD > #define SSTATUS_SD SSTATUS32_SD > @@ -181,8 +184,11 @@ > #define HGATP_VMID_SHIFT HGATP32_VMID_SHIFT > #define HGATP_VMID_MASK HGATP32_VMID_MASK > #define HGATP_MODE_SHIFT HGATP32_MODE_SHIFT > +#define HGATP_MODE_MASK HGATP32_MODE_MASK > #endif > > +#define GUEST_ROOT_PAGE_TABLE_SIZE KB(16) In another context I already mentioned that imo you want to be careful with the use of "guest" in identifiers. It's not the guest page tables which have an order-2 root table, but the P2M (Xen terminology) or G-stage / second stage (RISC-V spec terminology) ones. As long as you're only doing P2M work, this may not look significant. But once you actually start dealing with guest page tables, it easily can end up confusing. > --- a/xen/arch/riscv/p2m.c > +++ b/xen/arch/riscv/p2m.c > @@ -1,8 +1,86 @@ > +#include > #include > #include > #include > > #include > +#include > +#include > + > +unsigned int __read_mostly p2m_root_order; If this is to be a variable at all, it ought to be __ro_after_init, and hence it shouldn't be written every time p2m_init() is run. If you want to to remain as a variable, what's wrong with const unsigned int p2m_root_order = ilog2(GUEST_ROOT_PAGE_TABLE_SIZE) - PAGE_SHIFT; or some such? But of course equally well you could have #define P2M_ROOT_ORDER (ilog2(GUEST_ROOT_PAGE_TABLE_SIZE) - PAGE_SHIFT) > +static void clear_and_clean_page(struct page_info *page) > +{ > +clear_domain_page(page_to_mfn(page)); > + > +/* > + * If the IOMMU doesn't support coherent walks and the p2m tables are > + * shared between the CPU and IOMMU, it is necessary to clean the > + * d-cache. > + */ That is, ... > +clean_dcache_va_range(page, PAGE_SIZE); ... this call really wants to be conditional? > +} > + > +static struct page_info *p2m_allocate_root(struct domain *d) With there also being p2m_alloc_root_table() and with that being the sole caller of the function here, I wonder: Is having this in a separate function really outweighing the possible confusion of which of the two functions to use? > +{ > +struct page_info *page; > + > +/* > + * As mentioned in the Priviliged Architecture Spec (version 20240411) > + * in Section 18.5.1, for the paged virtual-memory schemes (Sv32x4, > + * Sv39x4, Sv48x4, and Sv57x4), the root page table is 16 KiB and must > + * be aligned to a 16-KiB boundary. > + */ > +page = alloc_domheap_pages(d, P2M_ROOT_ORDER, MEMF_no_owner); > +if ( !page ) > +return NULL; > + > +for ( unsigned int i = 0; i < P2M_ROOT_PAGES; i++ ) > +clear_and_clean_page(page + i); > + > +return page; > +} > + > +unsigned long construct_hgatp
Re: [PATCH v3 06/20] xen/riscv: add root page table allocation
On 31.07.2025 17:58, Oleksii Kurochko wrote: > +static int p2m_alloc_root_table(struct p2m_domain *p2m) > +{ > +struct domain *d = p2m->domain; > +struct page_info *page; > +const unsigned int nr_root_pages = P2M_ROOT_PAGES; > + > +/* > + * Return back nr_root_pages to assure the root table memory is also > + * accounted against the P2M pool of the domain. > + */ > +if ( !paging_ret_pages_to_domheap(d, nr_root_pages) ) > +return -ENOMEM; > + > +page = p2m_allocate_root(d); > +if ( !page ) > +return -ENOMEM; > + > +p2m->root = page; > + > +return 0; > +} In the success case, shouldn't you bump the paging pool's total_pages by P2M_ROOT_PAGES? (As the freeing side is missing so far, it's not easy to tell whether there's [going to be] a balancing problem in the long run. In the short run there certainly is.) Jan
Re: [PATCH v3 10/20] xen/riscv: introduce page_{get,set}_xenheap_gfn()
On 31.07.2025 17:58, Oleksii Kurochko wrote: > --- a/xen/arch/riscv/include/asm/mm.h > +++ b/xen/arch/riscv/include/asm/mm.h > @@ -12,6 +12,7 @@ > #include > #include > > +#include > #include > #include > > @@ -247,9 +248,17 @@ static inline bool arch_mfns_in_directmap(unsigned long > mfn, unsigned long nr) > #define PGT_writable_page PG_mask(1, 1) /* has writable mappings? */ > #define PGT_type_mask PG_mask(1, 1) /* Bits 31 or 63. */ > > -/* Count of uses of this frame as its current type. */ > -#define PGT_count_width PG_shift(2) > -#define PGT_count_mask((1UL << PGT_count_width) - 1) > + /* 9-bit count of uses of this frame as its current type. */ Nit: Stray blank at start of line. > +#define PGT_count_maskPG_mask(0x3FF, 10) A 9-bit count corresponds to a mask of 0x1ff, doesn't it? With 0x3ff the count can spill over the type. Jan
Re: [PATCH v13 2/3] tools/tests: introduce unit tests for domain ID allocator
On Wed, Jul 30, 2025 at 05:41:00PM +, dm...@proton.me wrote: > From: Denis Mukhin > > Introduce some basic infrastructure for doing domain ID allocation unit tests, > and add a few tests that ensure correctness of the domain ID allocator. > > Signed-off-by: Denis Mukhin > --- > Changes since v12: > - fixed Makefile > - dropped unused symbols/includes from the test harness header > - s/printk/printf/g in the test code > --- > tools/tests/Makefile | 2 +- > tools/tests/domid/.gitignore | 2 + > tools/tests/domid/Makefile | 48 ++ > tools/tests/domid/include/xen/domain.h | 126 + > tools/tests/domid/test-domid.c | 78 +++ > 5 files changed, 255 insertions(+), 1 deletion(-) > create mode 100644 tools/tests/domid/.gitignore > create mode 100644 tools/tests/domid/Makefile > create mode 100644 tools/tests/domid/include/xen/domain.h > create mode 100644 tools/tests/domid/test-domid.c > > diff --git a/tools/tests/Makefile b/tools/tests/Makefile > index 36928676a666..ff1666425436 100644 > --- a/tools/tests/Makefile > +++ b/tools/tests/Makefile > @@ -1,7 +1,7 @@ > XEN_ROOT = $(CURDIR)/../.. > include $(XEN_ROOT)/tools/Rules.mk > > -SUBDIRS-y := > +SUBDIRS-y := domid > SUBDIRS-y += resource > SUBDIRS-$(CONFIG_X86) += cpu-policy > SUBDIRS-$(CONFIG_X86) += tsx > diff --git a/tools/tests/domid/.gitignore b/tools/tests/domid/.gitignore > new file mode 100644 > index ..70e306b3c074 > --- /dev/null > +++ b/tools/tests/domid/.gitignore > @@ -0,0 +1,2 @@ > +*.o > +test-domid > diff --git a/tools/tests/domid/Makefile b/tools/tests/domid/Makefile > new file mode 100644 > index ..08fbad096aec > --- /dev/null > +++ b/tools/tests/domid/Makefile > @@ -0,0 +1,48 @@ > +# SPDX-License-Identifier: GPL-2.0-only > +# > +# Unit tests for domain ID allocator. > +# > +# Copyright 2025 Ford Motor Company > + > +XEN_ROOT=$(CURDIR)/../../.. > +include $(XEN_ROOT)/tools/Rules.mk > + > +TESTS := test-domid > + > +vpath domid.c $(XEN_ROOT)/xen/common/ > + > +.PHONY: all > +all: $(TESTS) > + > +.PHONY: run > +run: $(TESTS) > + $(foreach t,$(TESTS),./$(t);) > + > +.PHONY: clean > +clean: > + $(RM) -- *.o $(TESTS) $(DEPS_RM) > + > +.PHONY: distclean > +distclean: clean > + $(RM) -- *~ > + > +.PHONY: install > +install: all > + $(INSTALL_DIR) $(DESTDIR)$(LIBEXEC)/tests > + $(INSTALL_PROG) test-domid $(DESTDIR)$(LIBEXEC)/tests > + > +.PHONY: uninstall > +uninstall: > + $(RM) -- $(DESTDIR)$(LIBEXEC)/tests/test-domid > + > +CFLAGS += -D__XEN_TOOLS__ > +CFLAGS += $(APPEND_CFLAGS) > +CFLAGS += $(CFLAGS_xeninclude) > +CFLAGS += -I./include/ > + > +LDFLAGS += $(APPEND_LDFLAGS) > + > +test-domid: domid.o test-domid.o > + $(CC) $^ -o $@ $(LDFLAGS) > + > +-include $(DEPS_INCLUDE) > diff --git a/tools/tests/domid/include/xen/domain.h > b/tools/tests/domid/include/xen/domain.h > new file mode 100644 > index ..e5db0235445e > --- /dev/null > +++ b/tools/tests/domid/include/xen/domain.h > @@ -0,0 +1,126 @@ > +/* SPDX-License-Identifier: GPL-2.0-only */ > +/* > + * Unit test harness for domain ID allocator. > + * > + * Copyright 2025 Ford Motor Company > + */ > + > +#ifndef _TEST_HARNESS_ > +#define _TEST_HARNESS_ > + > +#include > +#include > +#include > +#include > + > +#include > + > +#define BUG_ON(x) assert(!(x)) > +#define ASSERT(x) assert(x) > + > +#define DOMID_FIRST_RESERVED(10) > +#define DOMID_INVALID (11) > + > +#define DEFINE_SPINLOCK(x) unsigned long *(x) I think this shouldn't be a pointer? As you otherwise trigger a NULL pointer dereference in the increases and decreases done below? > +#define spin_lock(x)((*(x))++) > +#define spin_unlock(x) ((*(x))--) FWIW, I would use a plain bool: #define DEFINE_SPINLOCK(l) bool l #define spin_lock(l)(*(l) = true) #define spin_unlock(l) (*(l) = false) As you don't expect concurrency tests, you could even assert the lock is in the expected state before taking/releasing it. > + > +#define printk printf > + > +#define BITS_PER_LONG sizeof(unsigned long) That's BYTES_PER_LONG, BITS_PER_LONG would be (sizeof(unsigned long) * 8). > +#define BITS_PER_WORD (8U * BITS_PER_LONG) > +#define BITS_TO_LONGS(bits) \ > +(((bits) + BITS_PER_LONG - 1) / BITS_PER_LONG) > +#define DECLARE_BITMAP(name, bits) \ > +unsigned long name[BITS_TO_LONGS(bits)] > + > +static inline int __test_and_set_bit(unsigned int nr, unsigned long *addr) > +{ > +unsigned long mask = 1UL << (nr % BITS_PER_WORD); > +unsigned long *p = addr + (nr / BITS_PER_WORD); > +int old = (*p & mask) != 0; > + > +*p |= mask; > + > +return old; > +} > + > +static inline int __test_and_clear_bit(unsigned int nr, unsigned long *addr) > +{ > +unsigned long mask = 1UL << (nr % BITS_PER_WORD); > +unsigned long *p = addr + (nr / BITS_P
Re: [PATCH v4 4/8] pdx: allow per-arch optimization of PDX conversion helpers
On Tue, Aug 05, 2025 at 02:11:22PM +0200, Jan Beulich wrote: > On 05.08.2025 11:52, Roger Pau Monne wrote: > > There are four performance critical PDX conversion helpers that do the PFN > > to/from PDX and the physical addresses to/from directmap offsets > > translations. > > > > In the absence of an active PDX compression, those functions would still do > > the calculations needed, just to return the same input value as no > > translation is in place and hence PFN and PDX spaces are identity mapped. > > > > To reduce the overhead of having to do the pointless calculations allow > > architectures to implement the translation helpers in a per-arch header. > > Rename the existing conversion functions to add a trailing _xlate suffix, > > so that the per-arch headers can define the non suffixed versions. > > > > Currently only x86 implements meaningful custom handlers to short circuit > > the translation when not active, using asm goto. Other architectures use > > generic macros that map the non-xlate to the xlate variants to keep the > > previous behavior. > > > > Signed-off-by: Roger Pau Monné > > Once again: > Reviewed-by: Jan Beulich Thanks, I didn't carry your RB due to the changes requested by Andrew, that was a bit too much to carry a RB after those. Roger.
Re: [PATCH v4 4/8] pdx: allow per-arch optimization of PDX conversion helpers
On 05.08.2025 11:52, Roger Pau Monne wrote: > There are four performance critical PDX conversion helpers that do the PFN > to/from PDX and the physical addresses to/from directmap offsets > translations. > > In the absence of an active PDX compression, those functions would still do > the calculations needed, just to return the same input value as no > translation is in place and hence PFN and PDX spaces are identity mapped. > > To reduce the overhead of having to do the pointless calculations allow > architectures to implement the translation helpers in a per-arch header. > Rename the existing conversion functions to add a trailing _xlate suffix, > so that the per-arch headers can define the non suffixed versions. > > Currently only x86 implements meaningful custom handlers to short circuit > the translation when not active, using asm goto. Other architectures use > generic macros that map the non-xlate to the xlate variants to keep the > previous behavior. > > Signed-off-by: Roger Pau Monné Once again: Reviewed-by: Jan Beulich Jan
Re: [PATCH] misra: fix violations in macros GVA_INFO, TRACE_TIME
On 2025-08-05 13:49, Dmytro Prokopchuk1 wrote: On 7/31/25 19:09, Nicola Vetrini wrote: On 2025-07-31 18:05, Andrew Cooper wrote: On 31/07/2025 4:58 pm, Jan Beulich wrote: On 31.07.2025 17:37, Andrew Cooper wrote: On 31/07/2025 4:16 pm, Dmytro Prokopchuk1 wrote: MISRA Rule 13.1: Initializer lists shall not contain persistent side effects. The violations occur because both the `GVA_INFO` and `TRACE_TIME` macro expansions include expressions with persistent side effects introduced via inline assembly. In the case of `GVA_INFO`, the issue stems from the initializer list containing a direct call to `current`, which evaluates to `this_cpu(curr_vcpu)` and involves persistent side effects via the `asm` statement. To resolve this, the side-effect-producing expression is computed in a separate statement prior to the macro initialization: struct vcpu *current_vcpu = current; The computed value is passed into the `GVA_INFO(current_vcpu)` macro, ensuring that the initializer is clean and free of such side effects. Similarly, the `TRACE_TIME` macro violates this rule when accessing expressions like `current->vcpu_id` and `current->domain->domain_id`, which also depend on `current` and inline assembly. To fix this, the value of `current` is assigned to a temporary variable: struct vcpu *v = current; This temporary variable is then used to access `domain_id` and `vcpu_id`. This ensures that the arguments passed to the `TRACE_TIME` macro are simple expressions free of persistent side effects. Signed-off-by: Dmytro Prokopchuk The macro `current` specifically does not (and must not) have side effects. It is expected to behave like a plain `struct vcpu *current;` variable, and what Eclair is noticing is the thread-local machinery under this_cpu() (or in x86's case, get_current()). In ARM's case, it's literally reading the hardware thread pointer register. Can anything be done to tell Eclair that `this_cpu()` specifically does not have side effects? The only reason that GVA_INFO() and TRACE_TIME() are picked out is because they both contain embedded structure initialisation, and this is is actually an example where trying to comply with MISRA interferes with what is otherwise a standard pattern in Xen. Irrespective of what you say, some of the changes here were eliminating multiple adjacent uses of current, which - iirc - often the compiler can't fold via CSE. Where we have mixed usage, sure. (I'm sure I've got a branch somewhere trying to add some more pure/const around to try and help out here, but I can't find it, and don't recall it being a major improvement either.) The real problem here is that there are a *very few* number of contexts where Eclair refuses to tolerate the use of `current` citing side effects, despite there being no side effects. That is the thing that breaks the principle of least surprise, and we ought to fix it by making Eclair happy with `current` everywhere, rather than force people to learn that 2 macros can't have a `current` in their parameter list. I'll take a look. Likely yes, by adding a handful of properties. There are subtleties, though. Hi, Nicola. Did you have a chance to try configure Eclair to ignore this macro `this_cpu()`? Hi Dmytro, I'm on it, I needed to handle other tasks first. Thanks. Dmytro -- Nicola Vetrini, B.Sc. Software Engineer BUGSENG (https://bugseng.com) LinkedIn: https://www.linkedin.com/in/nicola-vetrini-a42471253
Re: [PATCH] x86/HVM: polish hvm_asid_init() a little
On 2025-08-04 11:41, Jan Beulich wrote: While the logic there covers asymmetric cases, the resulting log messages would likely raise more confusion than clarify anything. Split the BSP action from the AP one, indicating the odd CPU in the AP log message, thus avoiding the impression that global state would have changed. While there also - move g_disabled into .data.ro_after_init; only the BSP will ever write to it, - make the function's parameter unsigned; no negative values may be passed in. Also reflect this in svm_asid_init(). Signed-off-by: Jan Beulich Reviewed-by: Jason Andryuk
Re: [PATCH] console: make printk_ratelimit_{burst,ms} const
On 2025-08-01 03:30, Jan Beulich wrote: Them not being altered by any means, their __read_mostly attribute is actually counter-productive: It causes the compiler to instantiate the variables, when already with just the attributes dropped the compiler can constant-propagate the values into the sole use site. Make the situation yet more explicit by adding const. Also switch the variables away from being plain int, and have the parameters of __printk_ratelimit() follow suit. While there also similarly adjust the type of "missed" and "lost", and - while touching the adjacent line - increase lost_str[] to accommodate any unsigned 32-bit number. Fixes: a8b1845a7845 ("Miscellaneous data placement adjustments") Signed-off-by: Jan Beulich Reviewed-by: Jason Andryuk
Re: [PATCH 1/2] efi: Call FreePages only if needed
On 05/08/2025 5:32 pm, Ross Lagerwall wrote: > If the config file is builtin, cfg.addr will be zero but Xen > unconditionally calls FreePages() on the address. > > Xen may also call FreePages() with a zero address if blexit() is called > after this point since cfg.need_to_free is not set to false. > > The UEFI specification does not say whether calling FreePages() with a > zero address is allowed so let's be cautious and use cfg.need_to_free > properly. > > Signed-off-by: Ross Lagerwall Acked-by: Andrew Cooper
Xen 4.21 Development Update [June-July]
Hello everyone, This email only tracks big items for xen.git tree. Please reply for items you would like to see in 4.21 so that people have an idea what is going on and prioritise accordingly. You're welcome to provide description and use cases of the feature you're working on. = Timeline = The current release schedule could be found here: https://wiki.xenproject.org/wiki/Xen_Project_X.YY_Release_Notes And as a reminder I would like to remind at the of this week we will have Last posting date (Fri Aug 08, 2025). = Updates = The following items ( the links for them could be found int the list below ) were moved to completed: [since Jun2 - Aug5]: Added some tags: [4.21], [next-rel(s)] to the list "Full list of items" below. * x86: - kexec: add kexec support to Mini-OS. - x86: memcpy() / memset() (non-)ERMS flavors plus fallout * Arm: - SMMU handling for PCIe Passthrough on ARM. - Add support for R-Car Gen4 PCI host controller. - First chunk for Arm R82 and MPU support. - Enable R52 support for the first chunk of MPU support - ARM split hardware and control domains. * RISC-V: - Introduce basic UART support and interrupts for hypervisor mode. [since May 6 - Jun2]: * Hypervisor: - tools: remove qemu-traditional * Arm: - PCI devices passthrough on Arm, part 3 * x86: - xen: cache control improvements [since 4.20 relese - May 6]: * Hypervisor: - Move parts of Arm's Dom0less to common code - remove libxenctrl usage from xenstored * Arm: - Enable early bootup of Armv8-R AArch32 systems * x86: - x86/HVM: emulation (MMIO) improvements * RISC-V: - RISC-V some preinit calls. - Fixes for UBSAN & GCOV support for RISC-V. Some new items added: [since May] * x86: - Allow x86 to unflatten DTs - hyperlaunch: move remaining pvh dom0 construction - x86/hyperlaunch: introduce concept of core domains - Confidential computing and AMD SEV support * Arm: - SMMU handling for PCIe Passthrough on ARM - xen/arm: scmi: introduce SCI SCMI SMC multi-agent support - Add initial Xen Suspend-to-RAM support on ARM64 * RISC-V: - introduce p2m functionality [since 4.20 release] * Hypervisor: - tools: remove qemu-traditional - Physical address hypercall ABI ("HVMv2") - xen: Untangle mm.h - xen: introduce CONFIG_SYSCTL - Add support for exact-node memory claims - Several CI cleanups and improvements, plus yet another new runner * x86: - x86/EFI: prevent write-execute sections - x86: Trenchboot Secure Launch DRTM (Xen) - Hyperlaunch device tree for dom0 (v6) - amd-cppc CPU Performance Scaling Driver (v4) - Hyperlaunch domain builder - kexec: add kexec support to Mini-OS - xen: cache control improvements (should be moved to "Hypervisor"?) - x86: generate xen.efi image with no write-execute sections - x86/asm: cleanups after toolchain baseline upgrade * Arm: - Add support for R-Car Gen4 PCI host controller (v4) - FF-A VM to VM support (v5) - First chunk for Arm R82 and MPU support (v4) - ARM split hardware and control domains (v5) - MPU mm subsistem skeleton * RISC-V: - introduce basic UART support and interrupts for hypervisor mode * Full list of items : * = Projects = == Hypervisor == * [4.21] xen/console: cleanup console input switch logic (v5) - Denis Mukhin - https://lore.kernel.org/xen-devel/20250530231841.73386-1-dmuk...@ford.com/ * [4.21] xen: introduce CONFIG_SYSCTL (v4 -> v8) - Penny Zheng - https://lore.kernel.org/xen-devel/20250711043158.2566880-1-penny.zh...@amd.com/ * [4.21] Several CI cleanups and improvements, plus yet another new runner - Marek Marczykowski-Górecki - https://lore.kernel.org/xen-devel/cover.7da1777882774486a13e6f39ff4a2096f6b7901e.1744028549.git-series.marma...@invisiblethingslab.com/ - https://patchew.org/Xen/cover.7da1777882774486a13e6f39ff4a2096f6b7901e.1744028549.git-series.marma...@invisiblethingslab.com/ * [4.21] automation: Refresh the remaining Debian containers (v2) - Javi Merino - https://lore.kernel.org/xen-devel/cover.1730743077.git.javi.mer...@cloud.com/T/#m5d9acb7cf5db3c2be3d6527de14b69b07812314e * [4.21] MSI-X support with qemu in stubdomain, and other related changes (v8) - Marek Marczykowski-Górecki - https://lore.kernel.org/xen-devel/cover.33fb4385b7dd6c53bda4acf0a9e91748b3d7b1f7.1715313192.git-series.marma...@invisiblethingslab.com/ - Only automation patch left to be reviewed/merged. * [next-rel(s)] Physical address hypercall ABI ("HVMv2") - Teddy Astie - https://lore.kernel.org/xen-devel/cover.1744981654.git.teddy.as...@vates.tech/ * [next-rel(s)] xen: Untangle mm.h - Andrew Cooper - https://lore.kernel.org/xen-devel/20250312174513.4075066-1-andrew.coop...@citrix.com/ - https://patchew.org/Xen/2
Re: [XEN][PATCH] xen/dom0less: arm: fix hwdom 1:1 low memory allocation
Hi Jason, On 05.08.25 20:21, Jason Andryuk wrote: On 2025-08-01 11:54, Grygorii Strashko wrote: From: Grygorii Strashko Call stack for dom0less hwdom case (1:1) memory: create_domUs |-construct_domU |-construct_hwdom() |-allocate_memory_11() And allocate_memory_11() uses "dom0_mem" as: min_low_order = get_order_from_bytes(min_t(paddr_t, dom0_mem, MB(128))); In case of dom0less boot the "dom0_mem" is not used and defaulted to 0, which causes min_low_order to get high value > order and so no allocations happens from low memory. Fix it, by using kinfo->unassigned_mem instead of "dom0_mem" has correct memory size in both cases: regular dom0 boot and dom0less boot. Fixes: 43afe6f030244 ("xen/common: dom0less: introduce common dom0less-build.c") I think I introduced this bug with the dom0less hwdom support, and the correct fixes is: Fixes: 52cb53f1816a ("xen/arm: dom0less hwdom construction") Signed-off-by: Grygorii Strashko dom0_mem is also mentioned in the comment on line 252. With that changed as well: Reviewed-by: Jason Andryuk Will smth like below be ok? * We first allocate the largest allocation we can as low as we * can. This then becomes the first bank. This bank must be at least - * 128MB (or dom0_mem if that is smaller). + * 128MB (or memory size requested for domain if that is smaller). -- Best regards, -grygorii
[PATCH v2] xen/dom0less: arm: fix hwdom 1:1 low memory allocation
From: Grygorii Strashko Call stack for dom0less hwdom case (1:1) memory: create_domUs |-construct_domU |-construct_hwdom() |-allocate_memory_11() And allocate_memory_11() uses "dom0_mem" as: min_low_order = get_order_from_bytes(min_t(paddr_t, dom0_mem, MB(128))); In case of dom0less boot the "dom0_mem" is not used and defaulted to 0, which causes min_low_order to get high value > order and so no allocations happens from low memory. Fix it, by using kinfo->unassigned_mem instead of "dom0_mem" has correct memory size in both cases: regular dom0 boot and dom0less boot. Fixes: 52cb53f1816a ("xen/arm: dom0less hwdom construction") Signed-off-by: Grygorii Strashko Reviewed-by: Denis Mukhin Reviewed-by: Jason Andryuk --- Changes in v2: - changed 'fixes' tag - fixed comment for allocate_memory_11() - added reviewer's tags xen/arch/arm/domain_build.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/xen/arch/arm/domain_build.c b/xen/arch/arm/domain_build.c index 463ae4474d30..a9e4153e3cf9 100644 --- a/xen/arch/arm/domain_build.c +++ b/xen/arch/arm/domain_build.c @@ -249,7 +249,7 @@ fail: * * We first allocate the largest allocation we can as low as we * can. This then becomes the first bank. This bank must be at least - * 128MB (or dom0_mem if that is smaller). + * 128MB (or memory size requested for domain if that is smaller). * * Then we start allocating more memory, trying to allocate the * largest possible size and trying smaller sizes until we @@ -278,7 +278,7 @@ static void __init allocate_memory_11(struct domain *d, struct kernel_info *kinfo) { const unsigned int min_low_order = -get_order_from_bytes(min_t(paddr_t, dom0_mem, MB(128))); +get_order_from_bytes(min_t(paddr_t, kinfo->unassigned_mem, MB(128))); const unsigned int min_order = get_order_from_bytes(MB(4)); struct membanks *mem = kernel_info_get_mem(kinfo); struct page_info *pg; -- 2.34.1
Re: [PATCH v1 01/25] xen/x86: move domctl.o out of PV_SHIM_EXCLUSIVE
On 05.08.2025 05:38, Penny, Zheng wrote: > [Public] > >> -Original Message- >> From: Jan Beulich >> Sent: Monday, August 4, 2025 3:43 PM >> To: Penny, Zheng >> Cc: Huang, Ray ; Andrew Cooper >> ; Roger Pau Monné ; >> Anthony PERARD ; Orzel, Michal >> ; Julien Grall ; Stefano Stabellini >> ; xen-devel@lists.xenproject.org >> Subject: Re: [PATCH v1 01/25] xen/x86: move domctl.o out of >> PV_SHIM_EXCLUSIVE >> >> On 03.08.2025 11:47, Penny Zheng wrote: >>> In order to fix CI error of a randconfig picking both >>> PV_SHIM_EXCLUSIVE=y and HVM=y results in hvm.c being built, but >>> domctl.c not being built, which leaves a few functions, like >>> domctl_lock_acquire/release() undefined, causing linking to fail. >>> To fix that, we intend to move domctl.o out of the PV_SHIM_EXCLUSIVE >>> Makefile /hypercall-defs section, with this adjustment, we also need >>> to release redundant vnuma_destroy() stub definition and paging_domctl >>> hypercall-defs from PV_SHIM_EXCLUSIVE guardian, to not break >>> compilation Above change will leave dead code in the shim binary >>> temporarily and will be fixed with the introduction of CONFIG_DOMCTL. >>> >>> Fixes: 568f806cba4c ("xen/x86: remove "depends on >>> !PV_SHIM_EXCLUSIVE"") >>> Reported-by: Jan Beulich >>> Signed-off-by: Penny Zheng >>> --- >>> v1 -> v2: >>> - remove paging_domctl hypercall-defs >> >> And you've run this through a full round of testing this time, in isolation? > > This commit shall be committed together with "xen/x86: complement > PG_log_dirty wrapping", (I've added in change log, idk why it didn't get > delivered in the mail list in the last). And "committed together" still means the two at least build okay independently (i.e. allowing the build-each-commit job to succeed)? As to the missing indication thereof in the submission: Patch 01 has a revlog, so if anything was missing there you must have added it some other way. But the cover letter is lacking that information as well. (As indicated earlier, to increase the chance of such a remark actually being noticed, it's best put in both places.) > As PG_log_dirty is disabled on PV mode, paging_domctl() will still have > "undefined reference" on PV mode, which gets fixed in "xen/x86: complement > PG_log_dirty wrapping". I thought it better sits there. > If it doesn't comply with the commit policy, I'll move according fix here. Let me post a pair of patches dealing with part of the problem, in an imo (longer term) more desirable way. Jan
Re: [PATCH v6 11/19] xen/x86: introduce "cpufreq=amd-cppc" xen cmdline and amd-cppc driver
On 05.08.2025 08:31, Penny, Zheng wrote: > [Public] > >> -Original Message- >> From: Jan Beulich >> Sent: Monday, August 4, 2025 4:48 PM >> To: Penny, Zheng >> Cc: Huang, Ray ; Andrew Cooper >> ; Anthony PERARD ; >> Orzel, Michal ; Julien Grall ; Roger >> Pau >> Monné ; Stefano Stabellini ; >> xen- >> de...@lists.xenproject.org >> Subject: Re: [PATCH v6 11/19] xen/x86: introduce "cpufreq=amd-cppc" xen >> cmdline >> and amd-cppc driver >> >> On 04.08.2025 10:09, Penny, Zheng wrote: >>> [Public] >>> -Original Message- From: Jan Beulich Sent: Thursday, July 17, 2025 12:00 AM To: Penny, Zheng Cc: Huang, Ray ; Andrew Cooper ; Anthony PERARD ; Orzel, Michal ; Julien Grall ; Roger Pau Monné ; Stefano Stabellini ; xen- de...@lists.xenproject.org Subject: Re: [PATCH v6 11/19] xen/x86: introduce "cpufreq=amd-cppc" xen cmdline and amd-cppc driver On 11.07.2025 05:50, Penny Zheng wrote: > --- a/xen/arch/x86/acpi/cpufreq/cpufreq.c > +++ b/xen/arch/x86/acpi/cpufreq/cpufreq.c > @@ -128,12 +128,14 @@ static int __init cf_check > cpufreq_driver_init(void) > > if ( cpufreq_controller == FREQCTL_xen ) > { > +unsigned int i = 0; Pointless initializer; both for() loops set i to 0. But also see further down. > @@ -157,9 +164,70 @@ static int __init cf_check > cpufreq_driver_init(void) > > case X86_VENDOR_AMD: > case X86_VENDOR_HYGON: > -ret = IS_ENABLED(CONFIG_AMD) ? powernow_register_driver() : - ENODEV; > +if ( !IS_ENABLED(CONFIG_AMD) ) > +{ > +ret = -ENODEV; > +break; > +} > +ret = -ENOENT; The code structure is sufficiently different from the Intel counterpart for this to perhaps better move ... > +for ( i = 0; i < cpufreq_xen_cnt; i++ ) > +{ > +switch ( cpufreq_xen_opts[i] ) > +{ > +case CPUFREQ_xen: > +ret = powernow_register_driver(); > +break; > + > +case CPUFREQ_amd_cppc: > +ret = amd_cppc_register_driver(); > +break; > + > +case CPUFREQ_none: > +ret = 0; > +break; > + > +default: > +printk(XENLOG_WARNING > + "Unsupported cpufreq driver for vendor AMD or > Hygon\n"); > +break; ... here. >>> >>> Are we suggesting moving >>> " >>> if ( !IS_ENABLED(CONFIG_AMD) ) >>> { >>> ret = -ENODEV; >>> break; >>> } >>> " here? In which case, When CONFIG_AMD=n and users doesn't provide >>> "cpufreq=xxx", we will have cpufreq_xen_cnt initialized as 1 and >>> cpufreq_xen_opts[0] = CPUFREQ_xen. powernow_register_driver() hence >>> gets invoked. The thing is that we don't have stub for it and it is >>> compiled under CONFIG_AMD I suggest to change to use #ifdef CONFIG_AMD >>> code wrapping >>> > +} > + > +if ( !ret || ret == -EBUSY ) > +break; > +} > + > break; > } > + > +/* > + * After successful cpufreq driver registeration, XEN_PROCESSOR_PM_CPPC > + * and XEN_PROCESSOR_PM_PX shall become exclusive flags. > + */ > +if ( !ret ) > +{ > +ASSERT(i < cpufreq_xen_cnt); > +switch ( cpufreq_xen_opts[i] ) Hmm, this is using the the initializer of i that I commented on. I think there's another default: case missing, where you simply "return 0" (to >> retain prior behavior). But again see also yet further down. > +/* > + * No cpufreq driver gets registered, clear both > + * XEN_PROCESSOR_PM_CPPC and XEN_PROCESSOR_PM_PX > + */ > + xen_processor_pmbits &= ~(XEN_PROCESSOR_PM_CPPC | > + XEN_PROCESSOR_PM_PX); Yet more hmm - this path you want to get through for the case mentioned above. But only this code; specifically not the "switch ( cpufreq_xen_opts[i] )", which really is "switch ( cpufreq_xen_opts[0] )" in that case, and that's pretty clearly wrong to evaluate in then. >>> >>> Correct me if I understand you wrongly: >>> The above "case missing" , are we talking about is entering "case >> CPUFREQ_none" ? >>> IMO, it may never be entered. If users doesn't provide "cpufreq=xxx", we >>> will >> have cpufreq_xen_cnt initialized as 1 and cpufreq_
Re: [PATCH v2] xen: rework error handling in vcpu_create
On 04.08.2025 18:57, Stewart Hildebrand wrote: > On 8/4/25 03:57, Jan Beulich wrote: >> On 01.08.2025 22:24, Stewart Hildebrand wrote: >>> @@ -839,6 +839,9 @@ void sched_destroy_vcpu(struct vcpu *v) >>> { >>> struct sched_unit *unit = v->sched_unit; >>> >>> +if ( !unit ) >>> +return; >>> + >>> kill_timer(&v->periodic_timer); >>> kill_timer(&v->singleshot_timer); >>> kill_timer(&v->poll_timer); >> >> What if it's the 2nd error path in sched_init_vcpu() that is taken? Then we >> might take this path (just out of context here) >> >> if ( unit->vcpu_list == v ) >> { >> rcu_read_lock(&sched_res_rculock); >> >> sched_remove_unit(vcpu_scheduler(v), unit); >> sched_free_udata(vcpu_scheduler(v), unit->priv); >> >> and at least Credit1's hook doesn't look to be safe against being passed >> NULL. >> (Not to speak of the risk of unit->priv being used elsewhere while cleaning >> up.) > > > Are you referring to this error path in sched_init_vcpu? No, given the context I thought it was clear that I was referring to static void cf_check csched_free_udata(const struct scheduler *ops, void *priv) { struct csched_unit *svc = priv; BUG_ON( !list_empty(&svc->runq_elem) ); (i.e. particularly this BUG_ON()). Jan
Re: [PATCH v10 4/4] vpci/msix: Free MSIX resources when init_msix() fails
On 05.08.2025 05:49, Jiqian Chen wrote: > When MSI-X initialization fails vPCI will hide the capability, but > remove of handlers and data won't be performed until the device is > deassigned. Introduce a MSI-X cleanup hook that will be called when > initialization fails to cleanup MSI-X related hooks and free it's > associated data. > > As all supported capabilities have been switched to use the cleanup > hooks call those from vpci_deassign_device() instead of open-code the > capability specific cleanup in there. > > Signed-off-by: Jiqian Chen > --- > cc: "Roger Pau Monné" > --- > v9->v10 changes: > * Call all cleanup hook in vpci_deassign_device() instead of cleanup_msix(). Isn't this rather an omission in an earlier change, and hence may want to come separately and with a Fixes: tag? > --- a/xen/drivers/vpci/vpci.c > +++ b/xen/drivers/vpci/vpci.c > @@ -321,6 +321,27 @@ void vpci_deassign_device(struct pci_dev *pdev) > &pdev->domain->vpci_dev_assigned_map); > #endif > > +for ( i = 0; i < NUM_VPCI_INIT; i++ ) > +{ > +const vpci_capability_t *capability = &__start_vpci_array[i]; > +const unsigned int cap = capability->id; > +unsigned int pos = 0; > + > +if ( !capability->is_ext ) > +pos = pci_find_cap_offset(pdev->sbdf, cap); > +else if ( is_hardware_domain(pdev->domain) ) > +pos = pci_find_ext_capability(pdev->sbdf, cap); What's the point of doing this when ... > +if ( pos && capability->cleanup ) ... ->cleanup is NULL? Don't you want to have if ( !capability->cleanup ) continue; earlier on? Jan
[PATCH 0/2] x86/mm: paging build adjustments
This is in the hope that it'll lead to a better overall result than plainly taking [1] and [2] (or whichever re-work thereof). 1: drop paging_get_mode() 2: correct PG_log_dirty definition Jan [1] https://lists.xen.org/archives/html/xen-devel/2025-08/msg00050.html [2] https://lists.xen.org/archives/html/xen-devel/2025-08/msg00051.html
[PATCH 2/2] x86/mm: correct PG_log_dirty definition
While it is correct that in shim-exclusive mode log-dirty handling is all unreachable code, the present conditional still isn't correct: In a HVM=n and SHADOW_PAGING=n configuration log-dirty code also is all unreachable (and hence violating Misra rule 2.1). As we're aiming at moving away from special casing PV_SHIM_EXCLUSIVE=y, don't retain that part of the conditional. Because of hypercall-defs.c we need to carry out the dependency by introducing a new auxiliary PAGING control. Since compiling out mm/paging.c altogether would entail further changes, merely conditionalize the one function in there (paging_enable()) which would otherwise remain unreachable (Misra rule 2.1 again) when PAGING=n. Fixes: 23d4e0d17b76 ("x86/shim: fix build with PV_SHIM_EXCLUSIVE and SHADOW_PAGING") Signed-off-by: Jan Beulich --- Of course PAGING is at risk of being confused with MEM_PAGING. It not having a prompt, I hope that's tolerable, as I can't really think of a better name. Other PG_log_dirty pre-processor conditionals then likely also want replacing. mm/paging.c and mm/p2m-basic.c could also be compiled out altogether when PAGING=n, at the expense of introducing a few more stubs. FTAOD, the Fixes: tag being referenced does not mean this patch corrects the far more recently introduced build issue with the combination of the two features. That's still work that I expect Penny to carry out (with there still being the option of reverting the final part of the earlier series). --- a/xen/arch/x86/Kconfig +++ b/xen/arch/x86/Kconfig @@ -162,6 +162,9 @@ config SHADOW_PAGING If unsure, say Y. +config PAGING + def_bool HVM || SHADOW_PAGING + config BIGMEM bool "big memory support" default n --- a/xen/arch/x86/domctl.c +++ b/xen/arch/x86/domctl.c @@ -213,11 +213,15 @@ long arch_do_domctl( { case XEN_DOMCTL_shadow_op: +#ifdef CONFIG_PAGING ret = paging_domctl(d, &domctl->u.shadow_op, u_domctl, 0); if ( ret == -ERESTART ) return hypercall_create_continuation( __HYPERVISOR_paging_domctl_cont, "h", u_domctl); copyback = true; +#else +ret = -EOPNOTSUPP; +#endif break; case XEN_DOMCTL_ioport_permission: --- a/xen/arch/x86/include/asm/paging.h +++ b/xen/arch/x86/include/asm/paging.h @@ -55,7 +55,7 @@ #define PG_translate 0 #define PG_external0 #endif -#if defined(CONFIG_HVM) || !defined(CONFIG_PV_SHIM_EXCLUSIVE) +#ifdef CONFIG_PAGING /* Enable log dirty mode */ #define PG_log_dirty (XEN_DOMCTL_SHADOW_ENABLE_LOG_DIRTY << PG_mode_shift) #else --- a/xen/arch/x86/mm/paging.c +++ b/xen/arch/x86/mm/paging.c @@ -864,6 +864,7 @@ void paging_final_teardown(struct domain p2m_final_teardown(d); } +#ifdef CONFIG_PAGING /* Enable an arbitrary paging-assistance mode. Call once at domain * creation. */ int paging_enable(struct domain *d, u32 mode) @@ -889,6 +890,7 @@ int paging_enable(struct domain *d, u32 else return shadow_enable(d, mode); } +#endif #ifdef CONFIG_HVM /* Called from the guest to indicate that a process is being torn down --- a/xen/include/hypercall-defs.c +++ b/xen/include/hypercall-defs.c @@ -197,9 +197,11 @@ dm_op(domid_t domid, unsigned int nr_buf #ifdef CONFIG_SYSCTL sysctl(xen_sysctl_t *u_sysctl) #endif +#if defined(CONFIG_X86) && defined(CONFIG_PAGING) +paging_domctl_cont(xen_domctl_t *u_domctl) +#endif #ifndef CONFIG_PV_SHIM_EXCLUSIVE domctl(xen_domctl_t *u_domctl) -paging_domctl_cont(xen_domctl_t *u_domctl) platform_op(xen_platform_op_t *u_xenpf_op) #endif #ifdef CONFIG_HVM @@ -296,7 +298,7 @@ dm_op compa hypfs_op do do do do do #endif mcado do --- -#ifndef CONFIG_PV_SHIM_EXCLUSIVE +#if defined(CONFIG_X86) && defined(CONFIG_PAGING) paging_domctl_cont do do do do - #endif
[PATCH 1/2] x86/mm: drop paging_get_mode()
The function was introduced without any caller, and never gained any. Thus it has always been violating Misra rule 2.1 (unreachable code). Fixes: dd6de3ab9985 ("Implement Nested-on-Nested") Signed-off-by: Jan Beulich --- a/xen/arch/x86/include/asm/paging.h +++ b/xen/arch/x86/include/asm/paging.h @@ -225,7 +225,6 @@ int paging_enable(struct domain *d, u32 #define paging_get_hostmode(v) ((v)->arch.paging.mode) #define paging_get_nestedmode(v) ((v)->arch.paging.nestedmode) -const struct paging_mode *paging_get_mode(struct vcpu *v); void paging_update_nestedmode(struct vcpu *v); /* Page fault handler --- unstable.orig/xen/arch/x86/mm/paging.c 2025-08-05 08:59:15.512131147 +0200 +++ unstable/xen/arch/x86/mm/paging.c 2025-08-05 09:00:24.160657794 +0200 @@ -946,14 +946,6 @@ void paging_dump_vcpu_info(struct vcpu * } } -const struct paging_mode *paging_get_mode(struct vcpu *v) -{ -if (!nestedhvm_is_n2(v)) -return paging_get_hostmode(v); - -return paging_get_nestedmode(v); -} - #ifdef CONFIG_HVM void paging_update_nestedmode(struct vcpu *v) {
Re: [PATCH v2] xen: rework error handling in vcpu_create
On 8/5/25 05:17, Jan Beulich wrote: > On 05.08.2025 11:06, Stewart Hildebrand wrote: >> On 8/5/25 03:44, Jan Beulich wrote: >>> On 04.08.2025 18:57, Stewart Hildebrand wrote: On 8/4/25 03:57, Jan Beulich wrote: > On 01.08.2025 22:24, Stewart Hildebrand wrote: >> @@ -839,6 +839,9 @@ void sched_destroy_vcpu(struct vcpu *v) >> { >> struct sched_unit *unit = v->sched_unit; >> >> +if ( !unit ) >> +return; >> + >> kill_timer(&v->periodic_timer); >> kill_timer(&v->singleshot_timer); >> kill_timer(&v->poll_timer); > > What if it's the 2nd error path in sched_init_vcpu() that is taken? >> >> ^^ This ^^ is what I'm confused about > > If sched_init_vcpu() took the indicated path, What path? In the one I'm looking at, sched_free_unit() gets called, setting v->sched_unit = NULL, and in that case ... > the if() you add here won't > help, and ... ... the condition is true, and ... > Then we > might take this path (just out of context here) > > if ( unit->vcpu_list == v ) > { > rcu_read_lock(&sched_res_rculock); > > sched_remove_unit(vcpu_scheduler(v), unit); > sched_free_udata(vcpu_scheduler(v), unit->priv); > > and at least Credit1's hook doesn't look to be safe against being passed > NULL. > (Not to speak of the risk of unit->priv being used elsewhere while > cleaning > up.) Are you referring to this error path in sched_init_vcpu? >>> >>> No, given the context I thought it was clear that I was referring to >>> >>> static void cf_check >>> csched_free_udata(const struct scheduler *ops, void *priv) >>> { >>> struct csched_unit *svc = priv; >>> >>> BUG_ON( !list_empty(&svc->runq_elem) ); > > ... we'd make it here (afaict). ... we'd not make it here.
Re: [PATCH v10] xen/console: introduce domain_console struct
On Fri, Jul 25, 2025 at 09:08:02PM +, dm...@proton.me wrote: > From: Denis Mukhin > > Introduce domain_console for grouping data structures used for integrating > domain's diagnostic console with Xen's console driver. > > Group all pbuf-related data structures under domain_console. Rename the moved > fields to plain .buf, .idx and .lock names, since all uses of the fields are > touched. > > Ensure accesses to domain_console pointer are valid in console_switch_input(). > > Bump the domain console buffer allocation size to 256. No extra symbol for the > value since it is used only once during data structure declaration. All size > checks use ARRAY_SIZE(). > > Allocate domain_console from the heap so that the parent domain struct size > stays below PAGE_SIZE boundary to account for more console-related fields > added in the future. > > Finally, update the domain_console allocation and initialization code. > > Not a functional change. > > Signed-off-by: Denis Mukhin Just one request about the allocation of the domain_console in domain_create(). > --- > Changes since v9: > - kept cd as is in hvm_print_line() as requested > - dropped domain lock in hvm_print_line() > > Link to v9: > https://lore.kernel.org/xen-devel/20250723001116.186009-1-dmuk...@ford.com/ > --- > xen/arch/arm/vpl011.c | 2 +- > xen/arch/x86/hvm/hvm.c | 16 +--- > xen/arch/x86/pv/shim.c | 2 +- > xen/common/domain.c| 19 +-- > xen/drivers/char/console.c | 28 > xen/include/xen/sched.h| 24 +--- > 6 files changed, 49 insertions(+), 42 deletions(-) > > diff --git a/xen/arch/arm/vpl011.c b/xen/arch/arm/vpl011.c > index 480fc664fc62..d0d17c76b72c 100644 > --- a/xen/arch/arm/vpl011.c > +++ b/xen/arch/arm/vpl011.c > @@ -713,7 +713,7 @@ int domain_vpl011_init(struct domain *d, struct > vpl011_init_info *info) > } > else > { > -d->console.input_allowed = true; > +d->console->input_allowed = true; > vpl011->backend_in_domain = false; > > vpl011->backend.xen = xzalloc(struct vpl011_xen_backend); > diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c > index 56c7de39778b..37af507a8d90 100644 > --- a/xen/arch/x86/hvm/hvm.c > +++ b/xen/arch/x86/hvm/hvm.c > @@ -560,6 +560,7 @@ static int cf_check hvm_print_line( > int dir, unsigned int port, unsigned int bytes, uint32_t *val) > { > struct domain *cd = current->domain; > +struct domain_console *cons = cd->console; > char c = *val; > > ASSERT(bytes == 1 && port == XEN_HVM_DEBUGCONS_IOPORT); > @@ -571,16 +572,17 @@ static int cf_check hvm_print_line( > if ( !is_console_printable(c) ) > return X86EMUL_OKAY; > > -spin_lock(&cd->pbuf_lock); > +spin_lock(&cons->lock); > +ASSERT(cons->idx < ARRAY_SIZE(cons->buf)); > if ( c != '\n' ) > -cd->pbuf[cd->pbuf_idx++] = c; > -if ( (cd->pbuf_idx == (DOMAIN_PBUF_SIZE - 1)) || (c == '\n') ) > +cons->buf[cons->idx++] = c; > +if ( (cons->idx == (ARRAY_SIZE(cons->buf) - 1)) || (c == '\n') ) > { > -cd->pbuf[cd->pbuf_idx] = '\0'; > -guest_printk(cd, XENLOG_G_DEBUG "%s\n", cd->pbuf); > -cd->pbuf_idx = 0; > +cons->buf[cons->idx] = '\0'; > +guest_printk(cd, XENLOG_G_DEBUG "%s\n", cons->buf); > +cons->idx = 0; > } > -spin_unlock(&cd->pbuf_lock); > +spin_unlock(&cons->lock); > > return X86EMUL_OKAY; > } > diff --git a/xen/arch/x86/pv/shim.c b/xen/arch/x86/pv/shim.c > index bc2a7dd5fae5..bd29c53a2d34 100644 > --- a/xen/arch/x86/pv/shim.c > +++ b/xen/arch/x86/pv/shim.c > @@ -239,7 +239,7 @@ void __init pv_shim_setup_dom(struct domain *d, > l4_pgentry_t *l4start, > */ > d->max_pages = domain_tot_pages(d); > > -d->console.input_allowed = true; > +d->console->input_allowed = true; > } > > static void write_start_info(struct domain *d) > diff --git a/xen/common/domain.c b/xen/common/domain.c > index 303c338ef293..caef4cc8d649 100644 > --- a/xen/common/domain.c > +++ b/xen/common/domain.c > @@ -669,7 +669,7 @@ static void _domain_destroy(struct domain *d) > BUG_ON(!d->is_dying); > BUG_ON(atomic_read(&d->refcnt) != DOMAIN_DESTROYED); > > -xfree(d->pbuf); > +xvfree(d->console); Nit: even if the struct if being freed just a couple of lines below, I would use XVFREE() here. > > argo_destroy(d); > > @@ -835,8 +835,6 @@ struct domain *domain_create(domid_t domid, > flags |= CDF_hardware; > if ( old_hwdom ) > old_hwdom->cdf &= ~CDF_hardware; > - > -d->console.input_allowed = true; > } > > /* Holding CDF_* internal flags. */ > @@ -866,8 +864,6 @@ struct domain *domain_create(domid_t domid, > spin_lock_init(&d->shutdown_lock); > d->shutdown_code = SHUTDOWN_CODE_INVALID; > > -spin_lock_init(&d->pbuf_lock); > - > rwlock_init(&d->vnuma_rwlock
Re: [PATCH v2] xen: rework error handling in vcpu_create
On 8/5/25 03:44, Jan Beulich wrote: > On 04.08.2025 18:57, Stewart Hildebrand wrote: >> On 8/4/25 03:57, Jan Beulich wrote: >>> On 01.08.2025 22:24, Stewart Hildebrand wrote: @@ -839,6 +839,9 @@ void sched_destroy_vcpu(struct vcpu *v) { struct sched_unit *unit = v->sched_unit; +if ( !unit ) +return; + kill_timer(&v->periodic_timer); kill_timer(&v->singleshot_timer); kill_timer(&v->poll_timer); >>> >>> What if it's the 2nd error path in sched_init_vcpu() that is taken? ^^ This ^^ is what I'm confused about >>> Then we >>> might take this path (just out of context here) >>> >>> if ( unit->vcpu_list == v ) >>> { >>> rcu_read_lock(&sched_res_rculock); >>> >>> sched_remove_unit(vcpu_scheduler(v), unit); >>> sched_free_udata(vcpu_scheduler(v), unit->priv); >>> >>> and at least Credit1's hook doesn't look to be safe against being passed >>> NULL. >>> (Not to speak of the risk of unit->priv being used elsewhere while cleaning >>> up.) >> >> >> Are you referring to this error path in sched_init_vcpu? > > No, given the context I thought it was clear that I was referring to > > static void cf_check > csched_free_udata(const struct scheduler *ops, void *priv) > { > struct csched_unit *svc = priv; > > BUG_ON( !list_empty(&svc->runq_elem) ); > > (i.e. particularly this BUG_ON()). The comment about credit1 was clear
Re: [PATCH v2] xen: rework error handling in vcpu_create
On 05.08.2025 11:06, Stewart Hildebrand wrote: > On 8/5/25 03:44, Jan Beulich wrote: >> On 04.08.2025 18:57, Stewart Hildebrand wrote: >>> On 8/4/25 03:57, Jan Beulich wrote: On 01.08.2025 22:24, Stewart Hildebrand wrote: > @@ -839,6 +839,9 @@ void sched_destroy_vcpu(struct vcpu *v) > { > struct sched_unit *unit = v->sched_unit; > > +if ( !unit ) > +return; > + > kill_timer(&v->periodic_timer); > kill_timer(&v->singleshot_timer); > kill_timer(&v->poll_timer); What if it's the 2nd error path in sched_init_vcpu() that is taken? > > ^^ This ^^ is what I'm confused about If sched_init_vcpu() took the indicated path, the if() you add here won't help, and ... Then we might take this path (just out of context here) if ( unit->vcpu_list == v ) { rcu_read_lock(&sched_res_rculock); sched_remove_unit(vcpu_scheduler(v), unit); sched_free_udata(vcpu_scheduler(v), unit->priv); and at least Credit1's hook doesn't look to be safe against being passed NULL. (Not to speak of the risk of unit->priv being used elsewhere while cleaning up.) >>> >>> >>> Are you referring to this error path in sched_init_vcpu? >> >> No, given the context I thought it was clear that I was referring to >> >> static void cf_check >> csched_free_udata(const struct scheduler *ops, void *priv) >> { >> struct csched_unit *svc = priv; >> >> BUG_ON( !list_empty(&svc->runq_elem) ); ... we'd make it here (afaict). Jan
Re: [PATCH] misra: fix violations in macros GVA_INFO, TRACE_TIME
On 7/31/25 19:09, Nicola Vetrini wrote: > On 2025-07-31 18:05, Andrew Cooper wrote: >> On 31/07/2025 4:58 pm, Jan Beulich wrote: >>> On 31.07.2025 17:37, Andrew Cooper wrote: On 31/07/2025 4:16 pm, Dmytro Prokopchuk1 wrote: > MISRA Rule 13.1: Initializer lists shall not contain persistent side > effects. > > The violations occur because both the `GVA_INFO` and `TRACE_TIME` > macro > expansions include expressions with persistent side effects introduced > via inline assembly. > > In the case of `GVA_INFO`, the issue stems from the initializer list > containing a direct call to `current`, which evaluates to > `this_cpu(curr_vcpu)` and involves persistent side effects via the > `asm` statement. To resolve this, the side-effect-producing expression > is computed in a separate statement prior to the macro initialization: > > struct vcpu *current_vcpu = current; > > The computed value is passed into the `GVA_INFO(current_vcpu)` macro, > ensuring that the initializer is clean and free of such side effects. > > Similarly, the `TRACE_TIME` macro violates this rule when accessing > expressions like `current->vcpu_id` and `current->domain->domain_id`, > which also depend on `current` and inline assembly. To fix this, the > value of `current` is assigned to a temporary variable: > > struct vcpu *v = current; > > This temporary variable is then used to access `domain_id` and > `vcpu_id`. > This ensures that the arguments passed to the `TRACE_TIME` macro are > simple expressions free of persistent side effects. > > Signed-off-by: Dmytro Prokopchuk The macro `current` specifically does not (and must not) have side effects. It is expected to behave like a plain `struct vcpu *current;` variable, and what Eclair is noticing is the thread-local machinery under this_cpu() (or in x86's case, get_current()). In ARM's case, it's literally reading the hardware thread pointer register. Can anything be done to tell Eclair that `this_cpu()` specifically does not have side effects? The only reason that GVA_INFO() and TRACE_TIME() are picked out is because they both contain embedded structure initialisation, and this is is actually an example where trying to comply with MISRA interferes with what is otherwise a standard pattern in Xen. >>> Irrespective of what you say, some of the changes here were eliminating >>> multiple adjacent uses of current, which - iirc - often the compiler >>> can't fold via CSE. >> >> Where we have mixed usage, sure. (I'm sure I've got a branch somewhere >> trying to add some more pure/const around to try and help out here, but >> I can't find it, and don't recall it being a major improvement either.) >> >> The real problem here is that there are a *very few* number of contexts >> where Eclair refuses to tolerate the use of `current` citing side >> effects, despite there being no side effects. >> >> That is the thing that breaks the principle of least surprise, and we >> ought to fix it by making Eclair happy with `current` everywhere, rather >> than force people to learn that 2 macros can't have a `current` in their >> parameter list. >> > > I'll take a look. Likely yes, by adding a handful of properties. There > are subtleties, though. > Hi, Nicola. Did you have a chance to try configure Eclair to ignore this macro `this_cpu()`? Thanks. Dmytro
Xen 4.19.3 released
All, we're pleased to announce the release of another bug fixing Xen version. Xen 4.19.3 is available from its git repository http://xenbits.xen.org/gitweb/?p=xen.git;a=shortlog;h=refs/heads/stable-4.19 (tag RELEASE-4.19.3) or (eventually) from the XenProject download page https://xenproject.org/resources/downloads/. We recommend all users of the 4.19 stable series to update to this latest point release. Regards, Jan
Re: [PATCH] misra: remove default case in single-clause switch
On 8/5/25 11:15, Jan Beulich wrote: > On 04.08.2025 19:33, Dmytro Prokopchuk1 wrote: >> MISRA Rule 16.4: Every switch statement shall have a default label. >> The default clause must contain either a statement or a comment >> prior to its terminating break statement. >> >> However, there is a documented deviation for this rule in Xen: >> 'docs/misra/deviations.rst': >> * - R16.4 >>- A switch statement with a single clause and no default label >> may replace an equivalent if statement to improve readability. >>- Tagged as `deliberate` for ECLAIR. >> >> This change removes empty default cases in single-clause switches >> to avoid violations of the rule where the `default` clause lacks >> a suitable comment or statement. >> >> Signed-off-by: Dmytro Prokopchuk > > It's all CPU notifiers that you alter, and iirc the outcome of earlier > discussion was that particularly for those we _want_ to add commentary, > clarifying why only the given subset of notification need handling in > the particular case. It may also well be that the (at least) one case > of the possibly missing handling of some other notification still is > unaddressed (and might hence be wrongly hidden by the adjustment done > here, if it's in one of the function that are being altered). > > Jan I understood. Thank you, Jan. Dmytro
Re: [PATCH v4 7/8] pdx: introduce a new compression algorithm based on region offsets
On 05.08.2025 11:52, Roger Pau Monne wrote: > --- a/xen/common/pdx.c > +++ b/xen/common/pdx.c > @@ -24,6 +24,7 @@ > #include > #include > #include > +#include > > /** > * Maximum (non-inclusive) usable pdx. Must be > @@ -40,6 +41,12 @@ bool __mfn_valid(unsigned long mfn) > > #ifdef CONFIG_PDX_MASK_COMPRESSION > invalid |= mfn & pfn_hole_mask; > +#elif defined(CONFIG_PDX_OFFSET_COMPRESSION) > +{ > +unsigned long base = pfn_bases[PFN_TBL_IDX(mfn)]; > + > +invalid |= mfn < base || mfn >= base + pdx_region_size; > +} > #endif Hmm, didn't notice this earlier on: Brace placement looks odd here. I think they want to be indented by one level, as they aren't starting a function body. > @@ -294,7 +308,245 @@ void __init pfn_pdx_compression_reset(void) > nr_ranges = 0; > } > > -#endif /* CONFIG_PDX_COMPRESSION */ > +#elif defined(CONFIG_PDX_OFFSET_COMPRESSION) /* CONFIG_PDX_MASK_COMPRESSION > */ > + > +unsigned int __ro_after_init pfn_index_shift; > +unsigned int __ro_after_init pdx_index_shift; > + > +unsigned long __ro_after_init pfn_pdx_lookup[CONFIG_PDX_NR_LOOKUP]; > +unsigned long __ro_after_init pdx_pfn_lookup[CONFIG_PDX_NR_LOOKUP]; > +unsigned long __ro_after_init pfn_bases[CONFIG_PDX_NR_LOOKUP]; > +unsigned long __ro_after_init pdx_region_size = ~0UL; For cache locality, might this last one better also move ahead of the arrays? > +bool pdx_is_region_compressible(paddr_t base, unsigned long npages) > +{ > +unsigned long pfn = PFN_DOWN(base); > +unsigned long pfn_base = pfn_bases[PFN_TBL_IDX(pfn)]; > + > +return pfn >= pfn_base && > + pfn + npages <= pfn_base + pdx_region_size; > +} > + > +static int __init cf_check cmp_node(const void *a, const void *b) > +{ > +const struct pfn_range *l = a; > +const struct pfn_range *r = b; > + > +if ( l->base_pfn > r->base_pfn ) > +return 1; > +if ( l->base_pfn < r->base_pfn ) > +return -1; > + > +return 0; > +} > + > +static void __init cf_check swp_node(void *a, void *b) > +{ > +SWAP(a, b); > +} This hasn't changed from v3, and still looks wrong to me. > +bool __init pfn_pdx_compression_setup(paddr_t base) > +{ > +unsigned long mask = PFN_DOWN(pdx_init_mask(base)), idx_mask = 0; > +unsigned long pages = 0; > +unsigned int i; > + > +if ( !nr_ranges ) > +{ > +printk(XENLOG_DEBUG "PFN compression disabled%s\n", > + pdx_compress ? ": no ranges provided" : ""); > +return false; > +} > + > +if ( nr_ranges > ARRAY_SIZE(ranges) ) > +{ > +printk(XENLOG_WARNING > + "Too many PFN ranges (%u > %zu), not attempting PFN > compression\n", > + nr_ranges, ARRAY_SIZE(ranges)); > +return false; > +} > + > +/* Sort ranges by start address. */ > +sort(ranges, nr_ranges, sizeof(*ranges), cmp_node, swp_node); > + > +for ( i = 0; i < nr_ranges; i++ ) > +{ > +unsigned long start = ranges[i].base_pfn; > + > +/* > + * Align range base to MAX_ORDER. This is required so the PDX offset > + * for the bits below MAX_ORDER matches the MFN offset, and pages > + * greater than the minimal order can be used to populate the > + * directmap. > + */ > +ranges[i].base_pfn = start & ~((1UL << MAX_ORDER) - 1); > +ranges[i].pages = start + ranges[i].pages - ranges[i].base_pfn; > + > +/* > + * Only merge overlapped regions now, leave adjacent regions > separated. > + * They would be merged later if both use the same index into the > + * lookup table. > + */ > +if ( !i || > + ranges[i].base_pfn >= > + (ranges[i - 1].base_pfn + ranges[i - 1].pages) ) > +{ > +mask |= pdx_region_mask(ranges[i].base_pfn, ranges[i].pages); > +continue; > +} > + > +ranges[i - 1].pages = ranges[i].base_pfn + ranges[i].pages - > + ranges[i - 1].base_pfn; > + > +if ( i + 1 < nr_ranges ) > +memmove(&ranges[i], &ranges[i + 1], > +(nr_ranges - (i + 1)) * sizeof(ranges[0])); > +else /* last range */ > +mask |= pdx_region_mask(ranges[i].base_pfn, ranges[i].pages); > +nr_ranges--; > +i--; > +} > + > +/* > + * Populate a mask with the non-equal bits of the different ranges, do > this > + * to calculate the maximum PFN shift to use as the lookup table index. > + */ > +for ( i = 0; i < nr_ranges; i++ ) > +for ( unsigned int j = 0; j < nr_ranges; j++ ) > +idx_mask |= (ranges[i].base_pfn & ~mask) ^ > +(ranges[j].base_pfn & ~mask); "mask" is loop invariant - can't the AND-ing be pulled out, after the loop? Further, isn't it sufficient for the inner loop to start from i + 1? Jan
Re: [PATCH v4 8/8] x86/mm: adjust loop in arch_init_memory() to iterate over the PDX space
On 05.08.2025 11:52, Roger Pau Monne wrote: > --- a/xen/arch/x86/mm.c > +++ b/xen/arch/x86/mm.c > @@ -275,7 +275,7 @@ static void __init assign_io_page(struct page_info *page) > > void __init arch_init_memory(void) > { > -unsigned long i, pfn, rstart_pfn, rend_pfn, iostart_pfn, ioend_pfn; > +unsigned long i, pfn, rstart_pfn, rend_pfn, iostart_pfn, ioend_pfn, pdx; > > /* > * Basic guest-accessible flags: > @@ -328,9 +328,20 @@ void __init arch_init_memory(void) > destroy_xen_mappings((unsigned long)mfn_to_virt(iostart_pfn), > (unsigned long)mfn_to_virt(ioend_pfn)); > > -/* Mark as I/O up to next RAM region. */ > -for ( ; pfn < rstart_pfn; pfn++ ) > +/* > + * Mark as I/O up to next RAM region. Iterate over the PDX space to > + * skip holes which would always fail the mfn_valid() check. > + * > + * pfn_to_pdx() requires a valid (iow: RAM) PFN to convert to PDX, > + * hence provide pfn - 1, which is the tailing PFN from the last RAM > + * range, or pdx 0 if the input pfn is 0. > + */ > +for ( pdx = pfn ? pfn_to_pdx(pfn - 1) + 1 : 0; > + pdx < pfn_to_pdx(rstart_pfn); > + pdx++ ) > { > +pfn = pdx_to_pfn(pdx); > + > if ( !mfn_valid(_mfn(pfn)) ) > continue; > As much as I would have liked to ack this, I fear there's another caveat here: At the top of the loop we check not only for RAM, but also for UNUSABLE. The latter, like RAM, shouldn't be marked I/O, but we also can't use PFN <-> PDX transformations on any such page. Jan
Re: [PATCH v4 7/8] pdx: introduce a new compression algorithm based on region offsets
On Tue, Aug 05, 2025 at 02:28:22PM +0200, Jan Beulich wrote: > On 05.08.2025 11:52, Roger Pau Monne wrote: > > --- a/xen/common/pdx.c > > +++ b/xen/common/pdx.c > > @@ -24,6 +24,7 @@ > > #include > > #include > > #include > > +#include > > > > /** > > * Maximum (non-inclusive) usable pdx. Must be > > @@ -40,6 +41,12 @@ bool __mfn_valid(unsigned long mfn) > > > > #ifdef CONFIG_PDX_MASK_COMPRESSION > > invalid |= mfn & pfn_hole_mask; > > +#elif defined(CONFIG_PDX_OFFSET_COMPRESSION) > > +{ > > +unsigned long base = pfn_bases[PFN_TBL_IDX(mfn)]; > > + > > +invalid |= mfn < base || mfn >= base + pdx_region_size; > > +} > > #endif > > Hmm, didn't notice this earlier on: Brace placement looks odd here. I think > they want to be indented by one level, as they aren't starting a function > body. Right, I can adjust. Since they are inside of the ifdef block it did look kind of OK to me, and avoided having to indent the content one extra level. > > @@ -294,7 +308,245 @@ void __init pfn_pdx_compression_reset(void) > > nr_ranges = 0; > > } > > > > -#endif /* CONFIG_PDX_COMPRESSION */ > > +#elif defined(CONFIG_PDX_OFFSET_COMPRESSION) /* > > CONFIG_PDX_MASK_COMPRESSION */ > > + > > +unsigned int __ro_after_init pfn_index_shift; > > +unsigned int __ro_after_init pdx_index_shift; > > + > > +unsigned long __ro_after_init pfn_pdx_lookup[CONFIG_PDX_NR_LOOKUP]; > > +unsigned long __ro_after_init pdx_pfn_lookup[CONFIG_PDX_NR_LOOKUP]; > > +unsigned long __ro_after_init pfn_bases[CONFIG_PDX_NR_LOOKUP]; > > +unsigned long __ro_after_init pdx_region_size = ~0UL; > > For cache locality, might this last one better also move ahead of the arrays? Oh, yes, this was a late addition and I clearly didn't place enough attention when adding it. > > +bool pdx_is_region_compressible(paddr_t base, unsigned long npages) > > +{ > > +unsigned long pfn = PFN_DOWN(base); > > +unsigned long pfn_base = pfn_bases[PFN_TBL_IDX(pfn)]; > > + > > +return pfn >= pfn_base && > > + pfn + npages <= pfn_base + pdx_region_size; > > +} > > + > > +static int __init cf_check cmp_node(const void *a, const void *b) > > +{ > > +const struct pfn_range *l = a; > > +const struct pfn_range *r = b; > > + > > +if ( l->base_pfn > r->base_pfn ) > > +return 1; > > +if ( l->base_pfn < r->base_pfn ) > > +return -1; > > + > > +return 0; > > +} > > + > > +static void __init cf_check swp_node(void *a, void *b) > > +{ > > +SWAP(a, b); > > +} > > This hasn't changed from v3, and still looks wrong to me. Oh, I did recall a comment to that regard, but somehow forgot to apply it, I'm sorry, I've now fixed it. > > +bool __init pfn_pdx_compression_setup(paddr_t base) > > +{ > > +unsigned long mask = PFN_DOWN(pdx_init_mask(base)), idx_mask = 0; > > +unsigned long pages = 0; > > +unsigned int i; > > + > > +if ( !nr_ranges ) > > +{ > > +printk(XENLOG_DEBUG "PFN compression disabled%s\n", > > + pdx_compress ? ": no ranges provided" : ""); > > +return false; > > +} > > + > > +if ( nr_ranges > ARRAY_SIZE(ranges) ) > > +{ > > +printk(XENLOG_WARNING > > + "Too many PFN ranges (%u > %zu), not attempting PFN > > compression\n", > > + nr_ranges, ARRAY_SIZE(ranges)); > > +return false; > > +} > > + > > +/* Sort ranges by start address. */ > > +sort(ranges, nr_ranges, sizeof(*ranges), cmp_node, swp_node); > > + > > +for ( i = 0; i < nr_ranges; i++ ) > > +{ > > +unsigned long start = ranges[i].base_pfn; > > + > > +/* > > + * Align range base to MAX_ORDER. This is required so the PDX > > offset > > + * for the bits below MAX_ORDER matches the MFN offset, and pages > > + * greater than the minimal order can be used to populate the > > + * directmap. > > + */ > > +ranges[i].base_pfn = start & ~((1UL << MAX_ORDER) - 1); > > +ranges[i].pages = start + ranges[i].pages - ranges[i].base_pfn; > > + > > +/* > > + * Only merge overlapped regions now, leave adjacent regions > > separated. > > + * They would be merged later if both use the same index into the > > + * lookup table. > > + */ > > +if ( !i || > > + ranges[i].base_pfn >= > > + (ranges[i - 1].base_pfn + ranges[i - 1].pages) ) > > +{ > > +mask |= pdx_region_mask(ranges[i].base_pfn, ranges[i].pages); > > +continue; > > +} > > + > > +ranges[i - 1].pages = ranges[i].base_pfn + ranges[i].pages - > > + ranges[i - 1].base_pfn; > > + > > +if ( i + 1 < nr_ranges ) > > +memmove(&ranges[i], &ranges[i + 1], > > +(nr_ranges - (i + 1)) * sizeof(ranges[0])); > > +else /* last range */ > > +mask |= pdx_region_mask(ranges[i].base_pfn, ranges[i].pages); > > +
Re: [PATCH v3 01/20] xen/riscv: implement sbi_remote_hfence_gvma()
On 8/4/25 3:52 PM, Jan Beulich wrote: On 31.07.2025 17:58, Oleksii Kurochko wrote: Instruct the remote harts to execute one or more HFENCE.GVMA instructions, covering the range of guest physical addresses between start_addr and start_addr + size for all VMIDs. The remote fence operation applies to the entire address space if either: - start_addr and size are both 0, or - size is equal to 2^XLEN-1. Signed-off-by: Oleksii Kurochko Acked-by: Jan Beulich However, ... --- a/xen/arch/riscv/include/asm/sbi.h +++ b/xen/arch/riscv/include/asm/sbi.h @@ -89,6 +89,25 @@ bool sbi_has_rfence(void); int sbi_remote_sfence_vma(const cpumask_t *cpu_mask, vaddr_t start, size_t size); +/* + * Instructs the remote harts to execute one or more HFENCE.GVMA + * instructions, covering the range of guest physical addresses + * between start_addr and start_addr + size for all VMIDs. ... I'd like to ask that you avoid fuzzy terminology like this one. Afaict you mean [start, start + size). Help yourself and future readers by then also saying it exactly like this. (Happy to make a respective edit while committing.) I just tried the following wording in SBI spec. I agree that using [start, start+size) is clearer as each time I'm going to check SBI code to verify if 'start+size' is included or not. It would be happy if you could update this part of commit message during commit. + * Returns 0 if IPI was sent to all the targeted harts successfully + * or negative value if start_addr or size is not valid. This similarly is ambiguous: The union of the success case stated and the error case stated isn't obviously all possible states. The success statement in particular alludes to the possibility of an IPI not actually reaching its target. The same as above this is what SBI spec. tells. I've not checked SBI code deeply, but it seems like the code is waiting while IPI will be reached as looking at the code: /** * As this this function only handlers scalar values of hart mask, it must be * set to all online harts if the intention is to send IPIs to all the harts. * If hmask is zero, no IPIs will be sent. */ int sbi_ipi_send_many(ulong hmask, ulong hbase, u32 event, void *data) { ... /* Send IPIs */ do { retry_needed = false; sbi_hartmask_for_each_hart(i, &target_mask) { rc = sbi_ipi_send(scratch, i, event, data); if (rc == SBI_IPI_UPDATE_RETRY) retry_needed = true; else sbi_hartmask_clear_hart(i, &target_mask); } } while (retry_needed); /* Sync IPIs */ sbi_ipi_sync(scratch, event); return 0; } and static int sbi_ipi_sync(struct sbi_scratch *scratch, u32 event) { const struct sbi_ipi_event_ops *ipi_ops; if ((SBI_IPI_EVENT_MAX <= event) || !ipi_ops_array[event]) return SBI_EINVAL; ipi_ops = ipi_ops_array[event]; if (ipi_ops->sync) ipi_ops->sync(scratch); return 0; } which calls: static void tlb_sync(struct sbi_scratch *scratch) { atomic_t *tlb_sync = sbi_scratch_offset_ptr(scratch, tlb_sync_off); while (atomic_read(tlb_sync) > 0) { /* * While we are waiting for remote hart to set the sync, * consume fifo requests to avoid deadlock. */ tlb_process_once(scratch); } return; } + * The remote fence operation applies to the entire address space if either: + * - start_addr and size are both 0, or + * - size is equal to 2^XLEN-1. Whose XLEN is this? The guest's? The host's? (I assume the latter, but it's not unambiguous, unless there's specific terminology that I'm unaware of, yet which would make this unambiguous.) RISC-V spec quite mixes the terminology (3.1.6.2. Base ISA Control in mstatus Register) around XLEN: For RV64 harts, the SXL and UXL fields are WARL fields that control the value of XLEN for S-mode and U-mode, respectively. The encoding of these fields is the same as the MXL field of misa, shown in Table 9. The effective XLEN in S-mode and U-mode are termed SXLEN and UXLEN, respectively Basically, RISC-V privileged architecture defines different XLEN values for various privilege modes: - MXLEN for Machine mode - SXLEN for Supervisor mode. - HSXLEN for Hypervisor
Re: [PATCH v13 1/3] xen/domain: unify domain ID allocation
On Wed, Jul 30, 2025 at 05:40:54PM +, dm...@proton.me wrote: > From: Denis Mukhin > > Currently, there are two different domain ID allocation implementations: > > 1) Sequential IDs allocation in dom0less Arm code based on max_init_domid; > > 2) Sequential IDs allocation in XEN_DOMCTL_createdomain; does not use > max_init_domid (both Arm and x86). > > The domain ID allocation covers dom0 or late hwdom, predefined domains, > post-boot domains, excluding Xen system domains (domid >= > DOMID_FIRST_RESERVED). > > It makes sense to have a common helper code for such task across architectures > (Arm and x86) and between dom0less / toolstack domU allocation. > > Note, fixing dependency on max_init_domid is out of scope of this patch. > > Wrap the domain ID allocation as an arch-independent function domid_alloc() in > new common/domid.c based on the bitmap. > > Allocation algorithm: > - If an explicit domain ID is provided, verify its availability and use it if > ID is not used; > - If DOMID_INVALID is provided, search the range [1..DOMID_FIRST_RESERVED-1], > starting from the last used ID. > Implementation guarantees that two consecutive calls will never return the > same ID. ID#0 is reserved for the first boot domain (currently, dom0) and > excluded from the allocation range. > > Remove is_free_domid() helper as it is not needed now. > > No functional change intended. > > Signed-off-by: Denis Mukhin > Reviewed-by: Alejandro Vallejo > --- > Changes since v12: > - updated comment for domid_alloc() and commit message > - added Alejandro's R-b > --- > xen/arch/arm/domain_build.c | 7 +- > xen/arch/x86/setup.c| 7 +- > xen/common/Makefile | 1 + > xen/common/device-tree/dom0less-build.c | 15 ++-- > xen/common/domain.c | 2 + > xen/common/domctl.c | 42 ++- > xen/common/domid.c | 94 + > xen/include/xen/domain.h| 3 + > 8 files changed, 124 insertions(+), 47 deletions(-) > create mode 100644 xen/common/domid.c > > diff --git a/xen/arch/arm/domain_build.c b/xen/arch/arm/domain_build.c > index 463ae4474d30..789f2b9d3ce7 100644 > --- a/xen/arch/arm/domain_build.c > +++ b/xen/arch/arm/domain_build.c > @@ -2050,6 +2050,7 @@ void __init create_dom0(void) > .grant_opts = XEN_DOMCTL_GRANT_version(opt_gnttab_max_version), > }; > unsigned int flags = CDF_privileged | CDF_hardware; > +domid_t domid; > int rc; > > /* The vGIC for DOM0 is exactly emulating the hardware GIC */ > @@ -2074,7 +2075,11 @@ void __init create_dom0(void) > if ( !llc_coloring_enabled ) > flags |= CDF_directmap; > > -dom0 = domain_create(0, &dom0_cfg, flags); > +domid = domid_alloc(0); > +if ( domid == DOMID_INVALID ) > +panic("Error allocating domain ID 0\n"); > + > +dom0 = domain_create(domid, &dom0_cfg, flags); > if ( IS_ERR(dom0) ) > panic("Error creating domain 0 (rc = %ld)\n", PTR_ERR(dom0)); > > diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c > index 1543dd251cc6..2ff7c28c277b 100644 > --- a/xen/arch/x86/setup.c > +++ b/xen/arch/x86/setup.c > @@ -1047,8 +1047,11 @@ static struct domain *__init create_dom0(struct > boot_info *bi) > if ( iommu_enabled ) > dom0_cfg.flags |= XEN_DOMCTL_CDF_iommu; > > -/* Create initial domain. Not d0 for pvshim. */ > -bd->domid = get_initial_domain_id(); > +/* Allocate initial domain ID. Not d0 for pvshim. */ > +bd->domid = domid_alloc(get_initial_domain_id()); > +if ( bd->domid == DOMID_INVALID ) > +panic("Error allocating domain ID %d\n", get_initial_domain_id()); Nit: in other error messages in the same function we handle the domid as an unsigned integer, so %u probably wants using here. Unless you have an explicit intention to print IDs >= DOMID_FIRST_RESERVED as negative integers? > + > d = domain_create(bd->domid, &dom0_cfg, >pv_shim ? 0 : CDF_privileged | CDF_hardware); > if ( IS_ERR(d) ) > diff --git a/xen/common/Makefile b/xen/common/Makefile > index c316957fcb36..0c7d0f5d46e1 100644 > --- a/xen/common/Makefile > +++ b/xen/common/Makefile > @@ -11,6 +11,7 @@ obj-$(filter-out $(CONFIG_X86),$(CONFIG_ACPI)) += device.o > obj-$(CONFIG_DEVICE_TREE_PARSE) += device-tree/ > obj-$(CONFIG_IOREQ_SERVER) += dm.o > obj-y += domain.o > +obj-y += domid.o > obj-y += event_2l.o > obj-y += event_channel.o > obj-$(CONFIG_EVTCHN_FIFO) += event_fifo.o > diff --git a/xen/common/device-tree/dom0less-build.c > b/xen/common/device-tree/dom0less-build.c > index 6bb038111de9..f4b6b515d2d2 100644 > --- a/xen/common/device-tree/dom0less-build.c > +++ b/xen/common/device-tree/dom0less-build.c > @@ -833,6 +833,7 @@ void __init create_domUs(void) > { > struct kernel_info ki = KERNEL_INFO_INIT; > int rc = parse_dom0less_node(node,
Re: [PATCH -next v7 5/7] arm64: entry: Refactor preempt_schedule_irq() check code
Hi Jinjie, On 29/07/2025 02:54, Jinjie Ruan wrote: ARM64 requires an additional check whether to reschedule on return from interrupt. So add arch_irqentry_exit_need_resched() as the default NOP implementation and hook it up into the need_resched() condition in raw_irqentry_exit_cond_resched(). This allows ARM64 to implement the architecture specific version for switching over to the generic entry code. I was a bit confused by this, as I didn't see the link with the generic entry given you implement `raw_irqentry_exit_cond_resched()` in arch/arm64 as well in this patch : I expected the arm64 implementation to be added. I share more thoughts below. What do you think about something along those lines ? Compared to the generic entry code, arm64 does additional checks when deciding to reschedule on return from an interrupt. Introduce arch_irqentry_exit_need_resched() in the need_resched() condition of the generic raw_irqentry_exit_cond_resched(), with a NOP default. This will allow arm64 to implement its architecture specific checks when switching over to the generic entry code. [...] diff --git a/kernel/entry/common.c b/kernel/entry/common.c index b82032777310..4aa9656fa1b4 100644 --- a/kernel/entry/common.c +++ b/kernel/entry/common.c @@ -142,6 +142,20 @@ noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs) return ret; } +/** + * arch_irqentry_exit_need_resched - Architecture specific need resched function + * + * Invoked from raw_irqentry_exit_cond_resched() to check if need resched. Very nit : "to check if resched is needed." ? + * Defaults return true. + * + * The main purpose is to permit arch to skip preempt a task from an IRQ. If feel that "to avoid preemption of a task" instead of "to skip preempt a task" would make this much clearer, what do you think ? + */ +static inline bool arch_irqentry_exit_need_resched(void); + +#ifndef arch_irqentry_exit_need_resched +static inline bool arch_irqentry_exit_need_resched(void) { return true; } +#endif + I've had some trouble reviewing this patch : on the one hand because I didn't notice `arch_irqentry_exit_need_resched()` was added in the common entry code, which is on me ! On the other hand, I felt that the patch itself was a bit disconnected : we add `arch_irqentry_exit_need_resched()` in the common entry code, with a default NOP, but in the same function we add to arm64, while mentioning that this is for arm64's additional checks, which we only implement in patch 7. Would it make sense to move the `arch_irqentry_exit_need_resched()` part of the patch to patch 7, so that the introduction and arch-specific implementation appear together ? To me it seems easier to wrap my head around, as it would look like "Move arm64 to generic entry, but it does additional checks : add a new arch-specific function controlling re-scheduling, defaulting to true, and implement it for arm64". I feel it could help making patch 7's commit message clearer as well. From what I gathered on the archive `arch_irqentry_exit_need_resched()` being added here was suggested previously, so others might not have the same opinion. Maybe improving the commit message and comment for this would be enough as well, as per my suggestions above. Otherwise the changes make sense and I don't see any functional issues ! Thanks, Ada
Re: [PATCH v3 11/20] xen/riscv: implement function to map memory in guest p2m
On 31.07.2025 17:58, Oleksii Kurochko wrote: > Implement map_regions_p2mt() to map a region in the guest p2m with > a specific p2m type. The memory attributes will be derived from the > p2m type. This function is going to be called from dom0less common > code. s/is going to be/is/ ? Such a call exists already, after all. > --- a/xen/arch/riscv/include/asm/p2m.h > +++ b/xen/arch/riscv/include/asm/p2m.h > @@ -121,21 +121,22 @@ static inline int > guest_physmap_mark_populate_on_demand(struct domain *d, > return -EOPNOTSUPP; > } > > -static inline int guest_physmap_add_entry(struct domain *d, > - gfn_t gfn, mfn_t mfn, > - unsigned long page_order, > - p2m_type_t t) > -{ > -BUG_ON("unimplemented"); > -return -EINVAL; > -} > +/* > + * Map a region in the guest p2m with a specific p2m type. What is "the guest p2m"? In your answer, please consider the possible (and at some point likely necessary) existence of altp2m and nestedp2m. In patch 04 you introduce p2m_get_hostp2m(), and I expect it's that what you mean here. > --- a/xen/arch/riscv/p2m.c > +++ b/xen/arch/riscv/p2m.c > @@ -9,6 +9,41 @@ > > unsigned int __read_mostly p2m_root_order; > > +/* > + * Force a synchronous P2M TLB flush. > + * > + * Must be called with the p2m lock held. > + */ > +static void p2m_force_tlb_flush_sync(struct p2m_domain *p2m) > +{ > +struct domain *d = p2m->domain; Pointer-to-const please. Personally, given the implementation of this function (and also ... > +ASSERT(p2m_is_write_locked(p2m)); > + > +sbi_remote_hfence_gvma(d->dirty_cpumask, 0, 0); > + > +p2m->need_flush = false; > +} > + > +void p2m_tlb_flush_sync(struct p2m_domain *p2m) > +{ > +if ( p2m->need_flush ) > +p2m_force_tlb_flush_sync(p2m); > +} ... this one) I'd further ask for the function parameters to also be pointer-to-const, but Andrew may object to that. Andrew - it continues to be unclear to me under what conditions you agree with adding const, and under what conditions you would object to me asking for such. Please can you take the time to clarify this? > +/* Unlock the flush and do a P2M TLB flush if necessary */ > +void p2m_write_unlock(struct p2m_domain *p2m) > +{ > +/* > + * The final flush is done with the P2M write lock taken to avoid > + * someone else modifying the P2M wbefore the TLB invalidation has Nit: Stray 'w'. > + * completed. > + */ > +p2m_tlb_flush_sync(p2m); Wasn't the plan to have this be conditional? > @@ -139,3 +174,33 @@ int p2m_set_allocation(struct domain *d, unsigned long > pages, bool *preempted) > > return 0; > } > + > +static int p2m_set_range(struct p2m_domain *p2m, > + gfn_t sgfn, > + unsigned long nr, > + mfn_t smfn, > + p2m_type_t t) > +{ > +return -EOPNOTSUPP; > +} > + > +static int p2m_insert_mapping(struct p2m_domain *p2m, gfn_t start_gfn, > + unsigned long nr, mfn_t mfn, p2m_type_t t) > +{ > +int rc; > + > +p2m_write_lock(p2m); > +rc = p2m_set_range(p2m, start_gfn, nr, mfn, t); > +p2m_write_unlock(p2m); > + > +return rc; > +} > + > +int map_regions_p2mt(struct domain *d, > + gfn_t gfn, > + unsigned long nr, > + mfn_t mfn, > + p2m_type_t p2mt) > +{ > +return p2m_insert_mapping(p2m_get_hostp2m(d), gfn, nr, mfn, p2mt); > +} And eventually both helper functions will gain further callers? Otherwise it's a little hard to see why they would both need to be separate functions. Jan
Re: [PATCH -next v7 7/7] arm64: entry: Switch to generic IRQ entry
Hi Jinjie, The code changes look good to me, just a small missing clean up I believe. On 29/07/2025 02:54, Jinjie Ruan wrote: Currently, x86, Riscv, Loongarch use the generic entry. Convert arm64 to use the generic entry infrastructure from kernel/entry/*. The generic entry makes maintainers' work easier and codes more elegant. Switch arm64 to generic IRQ entry first, which removed duplicate 100+ LOC. The next patch serise will switch to generic entry completely later. Switch to generic entry in two steps according to Mark's suggestion will make it easier to review. I think the commit message could be clearer, especially since now this series only moves arm64 to generic IRQ entry and not the complete generic entry. What do you think of something like below ? It repeats a bit less and I think it helps understanding what is going on in this specific commit, as you already have details on the larger plans in the cover. Currently, x86, Riscv and Loongarch use the generic entry code, which makes maintainer's work easier and code more elegant. Start converting arm64 to use the generic entry infrastructure from kernel/entry/* by switching it to generic IRQ entry, which removes 100+ lines of duplicate code. arm64 will completely switch to generic entry in a later series. The changes are below: - Remove *enter_from/exit_to_kernel_mode(), and wrap with generic irqentry_enter/exit(). Also remove *enter_from/exit_to_user_mode(), and wrap with generic enter_from/exit_to_user_mode() because they are exactly the same so far. Nit : "so far" can be removed - Remove arm64_enter/exit_nmi() and use generic irqentry_nmi_enter/exit() because they're exactly the same, so the temporary arm64 version irqentry_state can also be removed. - Remove PREEMPT_DYNAMIC code, as generic entry do the same thing if arm64 implement arch_irqentry_exit_need_resched(). This feels unrelated, given that the part that needs `arch_irqentry_exit_need_resched()` is called whether or not PREEMPT_DYNAMIC is enabled ? Given my comments on patch 5, I feel that the commit message should mention explicitly the implementation of `arch_irqentry_exit_need_resched()` and why, even though it was already mentioned in patch 5. (This is what I was referencing in patch 5 : as I feel it's useful to mention again the reasons when implementing it, it doesn't feel too out of place to introduce the generic part at the same time. But again, I might be wrong here.) Then you can have another point explaining that `raw_irqentry_exit_cond_resched()` and the PREEMPT_DYNAMIC code is removed because they are identical to the generic entry code, similarly to your other points. Tested ok with following test cases on Qemu virt platform: - Perf tests. - Different `dynamic preempt` mode switch. - Pseudo NMI tests. - Stress-ng CPU stress test. - MTE test case in Documentation/arch/arm64/memory-tagging-extension.rst and all test cases in tools/testing/selftests/arm64/mte/*. Nit : I'm not sure if the commit message is the best place for this, given you already gave some details in the cover ? But I don't have much experience here, so I'll leave it up to you and others ! Suggested-by: Mark Rutland Signed-off-by: Jinjie Ruan --- [...] diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c index db3f972f8cd9..1110eeb21f57 100644 --- a/arch/arm64/kernel/signal.c +++ b/arch/arm64/kernel/signal.c @@ -9,6 +9,7 @@ #include #include #include +#include #include #include #include @@ -1576,7 +1577,7 @@ static void handle_signal(struct ksignal *ksig, struct pt_regs *regs) * the kernel can handle, and then we build all the user-level signal handling * stack-frames in one go after that. */ -void do_signal(struct pt_regs *regs) +void arch_do_signal_or_restart(struct pt_regs *regs) Given that `do_signal(struct pt_regs *regs)` is defined in `arch/arm64/include/asm/exception.h`, and that there remains no users of `do_signal()`, I think it should be removed there. Thanks, Ada
Re: [PATCH -next v7 3/7] arm64: entry: Rework arm64_preempt_schedule_irq()
On 29/07/2025 02:54, Jinjie Ruan wrote: The generic entry code has the form: | raw_irqentry_exit_cond_resched() | { | if (!preempt_count()) { | ... | if (need_resched()) | preempt_schedule_irq(); | } | } In preparation for moving arm64 over to the generic entry code, align the structure of the arm64 code with raw_irqentry_exit_cond_resched() from the generic entry code. Signed-off-by: Jinjie Ruan --- Reviewed-by: Ada Couprie Diaz
Re: [PATCH -next v7 1/7] arm64: ptrace: Replace interrupts_enabled() with regs_irqs_disabled()
On 29/07/2025 02:54, Jinjie Ruan wrote: The generic entry code expects architecture code to provide regs_irqs_disabled(regs) function, but arm64 does not have this and provides inerrupts_enabled(regs), which has the opposite polarity. Nit: "interrupts_enabled(regs)" In preparation for moving arm64 over to the generic entry code, relace arm64's interrupts_enabled() with regs_irqs_disabled() and update its callers under arch/arm64. For the moment, a definition of interrupts_enabled() is provided for the GICv3 driver. Once arch/arm implement regs_irqs_disabled(), this can be removed. Delete the fast_interrupts_enabled() macro as it is unused and we don't want any new users to show up. No functional changes. Acked-by: Mark Rutland Suggested-by: Mark Rutland Signed-off-by: Jinjie Ruan --- Otherwise looks good to me ! Reviewed-by: Ada Couprie Diaz
Re: [PATCH -next v7 4/7] arm64: entry: Use preempt_count() and need_resched() helper
On 29/07/2025 02:54, Jinjie Ruan wrote: The generic entry code uses preempt_count() and need_resched() helpers to check if it should do preempt_schedule_irq(). Currently, arm64 use its own check logic, that is "READ_ONCE(current_thread_info()->preempt_count == 0", which is equivalent to "preempt_count() == 0 && need_resched()". In preparation for moving arm64 over to the generic entry code, use these helpers to replace arm64's own code and move it ahead. No functional changes. Signed-off-by: Jinjie Ruan --- Reviewed-by: Ada Couprie Diaz
Re: [PATCH -next v7 6/7] arm64: entry: Move arm64_preempt_schedule_irq() into __exit_to_kernel_mode()
On 29/07/2025 02:54, Jinjie Ruan wrote: The arm64 entry code only preempts a kernel context upon a return from a regular IRQ exception. The generic entry code may preempt a kernel context for any exception return where irqentry_exit() is used, and so may preempt other exceptions such as faults. In preparation for moving arm64 over to the generic entry code, align arm64 with the generic behaviour by calling arm64_preempt_schedule_irq() from exit_to_kernel_mode(). To make this possible, arm64_preempt_schedule_irq() and dynamic/raw_irqentry_exit_cond_resched() are moved earlier in the file, with no changes. As Mark pointed out, this change will have the following 2 key impact: - " We'll preempt even without taking a "real" interrupt. That shouldn't result in preemption that wasn't possible before, but it does change the probability of preempting at certain points, and might have a performance impact, so probably warrants a benchmark." - " We will not preempt when taking interrupts from a region of kernel code where IRQs are enabled but RCU is not watching, matching the behaviour of the generic entry code. This has the potential to introduce livelock if we can ever have a screaming interrupt in such a region, so we'll need to go figure out whether that's actually a problem. Having this as a separate patch will make it easier to test/bisect for that specifically." Suggested-by: Mark Rutland Signed-off-by: Jinjie Ruan --- Reviewed-by: Ada Couprie Diaz
Re: [PATCH -next v7 0/7] arm64: entry: Convert to generic irq entry
Hi Jinjie, On 29/07/2025 02:54, Jinjie Ruan wrote: Since commit a70e9f647f50 ("entry: Split generic entry into generic exception and syscall entry") split the generic entry into generic irq entry and generic syscall entry, it is time to convert arm64 to use the generic irq entry. And ARM64 will be completely converted to generic entry in the upcoming patch series. Note : I had to manually cherry-pick a70e9f647f50 when pulling the series on top of the Linux Arm Kernel for-next/core branch, but there might be something I'm missing here. The main convert steps are as follows: - Split generic entry into generic irq entry and generic syscall to make the single patch more concentrated in switching to one thing. - Make arm64 easier to use irqentry_enter/exit(). - Make arm64 closer to the PREEMPT_DYNAMIC code of generic entry. - Switch to generic irq entry. I reviewed the whole series and as expected it looks good ! Just a few nits here and there and some clarifications that I think could be useful. I'm not sure about the generic implementation of `arch_irqentry_exit_need_resched()` in patch 5, I would be tempted to move it to patch 7. I detail my thoughts more on the relevant patches, but I might be wrong and that feels like details : I don't think the code itself has issues. It was tested ok with following test cases on QEMU virt platform: - Perf tests. - Different `dynamic preempt` mode switch. - Pseudo NMI tests. - Stress-ng CPU stress test. - MTE test case in Documentation/arch/arm64/memory-tagging-extension.rst and all test cases in tools/testing/selftests/arm64/mte/*. The test QEMU configuration is as follows: qemu-system-aarch64 \ -M virt,gic-version=3,virtualization=on,mte=on \ -cpu max,pauth-impdef=on \ -kernel Image \ -smp 8,sockets=1,cores=4,threads=2 \ -m 512m \ -nographic \ -no-reboot \ -device virtio-rng-pci \ -append "root=/dev/vda rw console=ttyAMA0 kgdboc=ttyAMA0,115200 \ earlycon preempt=voluntary irqchip.gicv3_pseudo_nmi=1" \ -drive if=none,file=images/rootfs.ext4,format=raw,id=hd0 \ -device virtio-blk-device,drive=hd0 \ I'll spend some time testing the series now, specifically given patch 6's changes, but other than that everything I saw made sense and didn't look like it would be of concern to me. Thanks, Ada
Re: [PATCH -next v7 2/7] arm64: entry: Refactor the entry and exit for exceptions from EL1
Hi, On 29/07/2025 02:54, Jinjie Ruan wrote: The generic entry code uses irqentry_state_t to track lockdep and RCU state across exception entry and return. For historical reasons, arm64 embeds similar fields within its pt_regs structure. In preparation for moving arm64 over to the generic entry code, pull these fields out of arm64's pt_regs, and use a separate structure, matching the style of the generic entry code. No functional changes. As far as I understand and checked, we used the two fields in an exclusive fashion, so there is indeed no functional change. Suggested-by: Mark Rutland Signed-off-by: Jinjie Ruan --- [...] diff --git a/arch/arm64/kernel/entry-common.c b/arch/arm64/kernel/entry-common.c index 8e798f46ad28..97e0741abde1 100644 --- a/arch/arm64/kernel/entry-common.c +++ b/arch/arm64/kernel/entry-common.c [...] @@ -475,73 +497,81 @@ UNHANDLED(el1t, 64, error) static void noinstr el1_abort(struct pt_regs *regs, unsigned long esr) { unsigned long far = read_sysreg(far_el1); + arm64_irqentry_state_t state; - enter_from_kernel_mode(regs); + state = enter_from_kernel_mode(regs); Nit: There is some inconsistencies with some functions splitting state's definition and declaration (like el1_abort here), while some others do it on the same line (el1_undef() below for example). In some cases it is welcome as the entry function is called after some other work, but here for example it doesn't seem to be beneficial ? local_daif_inherit(regs); do_mem_abort(far, esr, regs); local_daif_mask(); - exit_to_kernel_mode(regs); + exit_to_kernel_mode(regs, state); } static void noinstr el1_pc(struct pt_regs *regs, unsigned long esr) { unsigned long far = read_sysreg(far_el1); + arm64_irqentry_state_t state; - enter_from_kernel_mode(regs); + state = enter_from_kernel_mode(regs); local_daif_inherit(regs); do_sp_pc_abort(far, esr, regs); local_daif_mask(); - exit_to_kernel_mode(regs); + exit_to_kernel_mode(regs, state); } static void noinstr el1_undef(struct pt_regs *regs, unsigned long esr) { - enter_from_kernel_mode(regs); + arm64_irqentry_state_t state = enter_from_kernel_mode(regs); + local_daif_inherit(regs); do_el1_undef(regs, esr); local_daif_mask(); - exit_to_kernel_mode(regs); + exit_to_kernel_mode(regs, state); } [...] Other than the small nit: Reviewed-by: Ada Couprie Diaz
Re: [PATCH v4 8/8] x86/mm: adjust loop in arch_init_memory() to iterate over the PDX space
On Tue, Aug 05, 2025 at 02:38:38PM +0200, Jan Beulich wrote: > On 05.08.2025 11:52, Roger Pau Monne wrote: > > --- a/xen/arch/x86/mm.c > > +++ b/xen/arch/x86/mm.c > > @@ -275,7 +275,7 @@ static void __init assign_io_page(struct page_info > > *page) > > > > void __init arch_init_memory(void) > > { > > -unsigned long i, pfn, rstart_pfn, rend_pfn, iostart_pfn, ioend_pfn; > > +unsigned long i, pfn, rstart_pfn, rend_pfn, iostart_pfn, ioend_pfn, > > pdx; > > > > /* > > * Basic guest-accessible flags: > > @@ -328,9 +328,20 @@ void __init arch_init_memory(void) > > destroy_xen_mappings((unsigned long)mfn_to_virt(iostart_pfn), > > (unsigned long)mfn_to_virt(ioend_pfn)); > > > > -/* Mark as I/O up to next RAM region. */ > > -for ( ; pfn < rstart_pfn; pfn++ ) > > +/* > > + * Mark as I/O up to next RAM region. Iterate over the PDX space > > to > > + * skip holes which would always fail the mfn_valid() check. > > + * > > + * pfn_to_pdx() requires a valid (iow: RAM) PFN to convert to PDX, > > + * hence provide pfn - 1, which is the tailing PFN from the last > > RAM > > + * range, or pdx 0 if the input pfn is 0. > > + */ > > +for ( pdx = pfn ? pfn_to_pdx(pfn - 1) + 1 : 0; > > + pdx < pfn_to_pdx(rstart_pfn); > > + pdx++ ) > > { > > +pfn = pdx_to_pfn(pdx); > > + > > if ( !mfn_valid(_mfn(pfn)) ) > > continue; > > > > As much as I would have liked to ack this, I fear there's another caveat here: > At the top of the loop we check not only for RAM, but also for UNUSABLE. The > latter, like RAM, shouldn't be marked I/O, but we also can't use PFN <-> PDX > transformations on any such page. Right you are. I'm not sure however why we do this - won't we want the mappings of UNUSABLE regions also be removed from the Xen page-tables? (but not marked as IO) I could do something like: /* Mark as I/O up to next RAM or UNUSABLE region. */ if ( (!pfn || pdx_is_region_compressible(pfn_to_paddr(pfn - 1), 1)) && pdx_is_region_compressible(pfn_to_paddr(rstart_pfn), 1) ) { /* * Iterate over the PDX space to skip holes which would always fail * the mfn_valid() check. * * pfn_to_pdx() requires a valid (iow: RAM) PFN to convert to PDX, * hence provide pfn - 1, which is the tailing PFN from the last * RAM range, or pdx 0 if the input pfn is 0. */ for ( pdx = pfn ? pfn_to_pdx(pfn - 1) + 1 : 0; pdx < pfn_to_pdx(rstart_pfn); pdx++ ) { pfn = pdx_to_pfn(pdx); if ( !mfn_valid(_mfn(pfn)) ) continue; assign_io_page(mfn_to_page(_mfn(pfn))); } } else { /* Slow path, iterate over the PFN space. */ for ( ; pfn < rstart_pfn; pfn++ ) { if ( !mfn_valid(_mfn(pfn)) ) continue; assign_io_page(mfn_to_page(_mfn(pfn))); } } But I find it a bit ugly - I might send v5 without this final patch while I see if I can find a better alternative. Thanks, Roger.
Re: [PATCH v2] xen: rework error handling in vcpu_create
On 05.08.2025 11:33, Stewart Hildebrand wrote: > On 8/5/25 05:17, Jan Beulich wrote: >> On 05.08.2025 11:06, Stewart Hildebrand wrote: >>> On 8/5/25 03:44, Jan Beulich wrote: On 04.08.2025 18:57, Stewart Hildebrand wrote: > On 8/4/25 03:57, Jan Beulich wrote: >> On 01.08.2025 22:24, Stewart Hildebrand wrote: >>> @@ -839,6 +839,9 @@ void sched_destroy_vcpu(struct vcpu *v) >>> { >>> struct sched_unit *unit = v->sched_unit; >>> >>> +if ( !unit ) >>> +return; >>> + >>> kill_timer(&v->periodic_timer); >>> kill_timer(&v->singleshot_timer); >>> kill_timer(&v->poll_timer); >> >> What if it's the 2nd error path in sched_init_vcpu() that is taken? >>> >>> ^^ This ^^ is what I'm confused about >> >> If sched_init_vcpu() took the indicated path, > > What path? In the one I'm looking at, sched_free_unit() gets called, Oh, I see - that wasn't quite obvious, though. Yet of course ... > setting v->sched_unit = NULL, and in that case ... > >> the if() you add here won't >> help, and ... > > ... the condition is true, and ... > >> Then we >> might take this path (just out of context here) >> >> if ( unit->vcpu_list == v ) >> { >> rcu_read_lock(&sched_res_rculock); >> >> sched_remove_unit(vcpu_scheduler(v), unit); >> sched_free_udata(vcpu_scheduler(v), unit->priv); >> >> and at least Credit1's hook doesn't look to be safe against being passed >> NULL. >> (Not to speak of the risk of unit->priv being used elsewhere while >> cleaning >> up.) ... this latter part of my remark still applies. IOW I continue to think that discussing the correctness of this change needs to be extended. Unless of course a scheduler maintainer wants to ack it as is. Jan
[PATCH v4 7/8] pdx: introduce a new compression algorithm based on region offsets
With the appearance of Intel Sierra Forest and Granite Rapids it's now possible to get a production x86 host with the following memory map: SRAT: Node 0 PXM 0 [, 7fff] SRAT: Node 0 PXM 0 [0001, 00807fff] SRAT: Node 1 PXM 1 [063e8000, 06be7fff] SRAT: Node 2 PXM 2 [0c7e8000, 0cfe7fff] SRAT: Node 3 PXM 3 [12be8000, 133e7fff] This is from a four socket Granite Rapids system, with each node having 512GB of memory. The total amount of RAM on the system is 2TB, but without enabling CONFIG_BIGMEM the last range is not accessible, as it's above the 16TB boundary covered by the frame table. Sierra Forest and Granite Rapids are socket compatible, however Sierra Forest only supports 2 socket configurations, while Granite Rapids can go up to 8 sockets. Note that while the memory map is very sparse, it couldn't be compressed using the current PDX_MASK compression algorithm, which relies on all ranges having a shared zeroed region of bits that can be removed. The memory map presented above has the property of all regions being similarly spaced between each other, and all having also a similar size. Use a lookup table to store the offsets to translate from/to PFN and PDX spaces. Such table is indexed based on the input PFN or PDX to translated. The example PFN layout about would get compressed using the following: PFN compression using PFN lookup table shift 29 and PDX region size 0x1000 range 0 [0, 0x807] PFN IDX 0 : 0 range 1 [0x00063e8, 0x0006be7] PFN IDX 3 : 0x00053e8 range 2 [0x000c7e8, 0x000cfe7] PFN IDX 6 : 0x000a7e8 range 3 [0x0012be8, 0x00133e7] PFN IDX 9 : 0x000fbe8 Note how the tow ranges belonging to node 0 get merged into a single PDX region by the compression algorithm. The default size of lookup tables currently set in Kconfig is 64 entries, and the example memory map consumes 10 entries. Such memory map is from a 4 socket Granite Rapids host, which in theory supports up to 8 sockets according to Intel documentation. Assuming the layout of a 8 socket system is similar to the 4 socket one, it would require 21 lookup table entries to support it, way below the current default of 64 entries. The valid range of lookup table size is currently restricted from 1 to 512 elements in Kconfig. An extra array is used to keep track of the base PFN for each translated range. Non used slots are set to ~0UL, so that in mfn_valid() the mfn < base check always fails, thus reporting the mfn as invalid. Introduce __init_or_pdx_mask and use it on some shared functions between PDX mask and offset compression, as otherwise some code becomes unreachable after boot if PDX offset compression is used. Mark the code as __init in that case, so it's pruned after boot. Signed-off-by: Roger Pau Monné --- Changes since v3: - Rename __init_or_pdx_mask. - Reorder logic in pfn_offset_sanitize_ranges() to reduce code. - Print a message if the table index is truncated from what would be required to represent all input ranges independently. - Cast size to paddr_t for pdx_init_mask() call. Changes since v2: - s/PDX_OFFSET_TLB_ORDER/PDX_OFFSET_TBL_ORDER/. - Fix changelog mention of Sapphire Rapids. - Misc fixes in the test harness. - Use SWAP() macro. - Introduce an extra table to keep the bases of the valid ranges. Changes since v1: - Use a lookup table with the offsets. - Split the adding of the test to a pre-patch. - Amend diagram to also show possible padding after compression. --- CHANGELOG.md | 3 + tools/tests/pdx/.gitignore | 1 + tools/tests/pdx/Makefile | 3 +- tools/tests/pdx/harness.h | 14 ++ tools/tests/pdx/test-pdx.c | 4 + xen/common/Kconfig | 21 ++- xen/common/pdx.c | 258 - xen/include/xen/pdx.h | 87 - 8 files changed, 385 insertions(+), 6 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 5f31ca08fe3f..f9ef893f4851 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -20,6 +20,9 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/) grant table or foreign memory. ### Added + - Introduce new PDX compression algorithm to cope with Intel Sierra Forest and + Granite Rapids having sparse memory maps. + - On x86: - Option to attempt to fixup p2m page-faults on PVH dom0. - Resizable BARs is supported for PVH dom0. diff --git a/tools/tests/pdx/.gitignore b/tools/tests/pdx/.gitignore index a32c7db4de79..1202a531a7fd 100644 --- a/tools/tests/pdx/.gitignore +++ b/tools/tests/pdx/.gitignore @@ -1,2 +1,3 @@ /pdx.h /test-pdx-mask +/test-pdx-offset diff --git a/tools/tests/pdx/Makefile b/tools/tests/pdx/Makefile index b3734afde686..10b354f0cefd 100644 --- a/tools/tests/pdx/Makefile +++ b/tools/tests/pdx/Makefile @@ -1,7 +1,7 @@ XEN_ROOT=$(CURDIR)/../../.. include $(XEN_ROOT)/tool
[PATCH v4 1/8] kconfig: turn PDX compression into a choice
Rename the current CONFIG_PDX_COMPRESSION to CONFIG_PDX_MASK_COMPRESSION, and make it part of the PDX compression choice block, in preparation for adding further PDX compression algorithms. The PDX compression defaults should still be the same for all architectures, however the choice block cannot be protected under EXPERT and still have a default choice being unconditionally selected. As a result, the new "PDX (Page inDeX) compression" item will be unconditionally visible in Kconfig, even on architectures like x86 that previously had no way to enable PDX compression. As part of this preparation work to introduce new PDX compressions, adjust some of the comments on pdx.h to note they apply to a specific PDX compression. Also shuffle function prototypes and dummy implementations around to make it easier to introduce a new PDX compression. Note all PDX compression implementations are expected to provide a pdx_is_region_compressible() that takes the same set of arguments. Signed-off-by: Roger Pau Monné Acked-by: Jan Beulich --- xen/arch/arm/setup.c | 2 +- xen/common/Kconfig| 18 +++--- xen/common/pdx.c | 4 ++-- xen/include/xen/pdx.h | 32 +++- 4 files changed, 37 insertions(+), 19 deletions(-) diff --git a/xen/arch/arm/setup.c b/xen/arch/arm/setup.c index bb35afe56cec..a77b31071ed8 100644 --- a/xen/arch/arm/setup.c +++ b/xen/arch/arm/setup.c @@ -258,7 +258,7 @@ void __init init_pdx(void) paddr_t bank_start, bank_size, bank_end, ram_end = 0; int bank; -#ifdef CONFIG_PDX_COMPRESSION +#ifndef CONFIG_PDX_NONE /* * Arm does not have any restrictions on the bits to compress. Pass 0 to * let the common code further restrict the mask. diff --git a/xen/common/Kconfig b/xen/common/Kconfig index 16936418a6e6..8dad0c923a9d 100644 --- a/xen/common/Kconfig +++ b/xen/common/Kconfig @@ -57,9 +57,10 @@ config EVTCHN_FIFO If unsure, say Y. -config PDX_COMPRESSION - bool "PDX (Page inDeX) compression" if EXPERT && !X86 && !RISCV - default ARM || PPC +choice + prompt "PDX (Page inDeX) compression" + default PDX_MASK_COMPRESSION if !X86 && !RISCV + default PDX_NONE help PDX compression is a technique designed to reduce the memory overhead of physical memory management on platforms with sparse RAM @@ -72,6 +73,17 @@ config PDX_COMPRESSION If your platform does not have sparse RAM banks, do not enable PDX compression. +config PDX_MASK_COMPRESSION + bool "Mask compression" + help + Compression relying on all RAM addresses sharing a zeroed bit region. + +config PDX_NONE + bool "None" + help + No compression +endchoice + config ALTERNATIVE_CALL bool diff --git a/xen/common/pdx.c b/xen/common/pdx.c index b8384e6189df..00aa7e43006d 100644 --- a/xen/common/pdx.c +++ b/xen/common/pdx.c @@ -34,7 +34,7 @@ bool __mfn_valid(unsigned long mfn) { bool invalid = mfn >= max_page; -#ifdef CONFIG_PDX_COMPRESSION +#ifdef CONFIG_PDX_MASK_COMPRESSION invalid |= mfn & pfn_hole_mask; #endif @@ -55,7 +55,7 @@ void set_pdx_range(unsigned long smfn, unsigned long emfn) __set_bit(idx, pdx_group_valid); } -#ifdef CONFIG_PDX_COMPRESSION +#ifdef CONFIG_PDX_MASK_COMPRESSION /* * Diagram to make sense of the following variables. The masks and shifts diff --git a/xen/include/xen/pdx.h b/xen/include/xen/pdx.h index c1423d64a95b..8e373cac8b87 100644 --- a/xen/include/xen/pdx.h +++ b/xen/include/xen/pdx.h @@ -25,7 +25,7 @@ * this by keeping a bitmap of the ranges in the frame table containing * invalid entries and not allocating backing memory for them. * - * ## PDX compression + * ## PDX mask compression * * This is a technique to avoid wasting memory on machines known to have * split their machine address space in several big discontinuous and highly @@ -101,22 +101,13 @@ bool __mfn_valid(unsigned long mfn); #define paddr_to_pdx(pa) pfn_to_pdx(paddr_to_pfn(pa)) #define pdx_to_paddr(px) pfn_to_paddr(pdx_to_pfn(px)) -#ifdef CONFIG_PDX_COMPRESSION +#ifdef CONFIG_PDX_MASK_COMPRESSION extern unsigned long pfn_pdx_bottom_mask, ma_va_bottom_mask; extern unsigned int pfn_pdx_hole_shift; extern unsigned long pfn_hole_mask; extern unsigned long pfn_top_mask, ma_top_mask; -/** - * Validate a region's compatibility with the current compression runtime - * - * @param base Base address of the region - * @param npages Number of PAGE_SIZE-sized pages in the region - * @return True iff the region can be used with the current compression - */ -bool pdx_is_region_compressible(paddr_t base, unsigned long npages); - /** * Calculates a mask covering "moving" bits of all addresses of a region * @@ -209,7 +200,9 @@ static inline paddr_t directmapoff_to_maddr(unsigned long offset) */ void pfn_pdx_hole_setup(unsigned long mask); -#else /* !CONFIG_PDX_COMPRESSION */ +#endif /* CONFIG_PDX_MA
[PATCH v4 2/8] pdx: provide a unified set of unit functions
The current setup (pdx_init_mask() and pdx_region_mask()) and init (pfn_pdx_hole_setup()) PDX compression functions are tailored to the existing PDX compression algorithm. In preparation for introducing a new compression algorithm convert the setup and init functions to more generic interfaces that aren't tied to the compression in-use. To accomplish this introduce a function that registers all the PFN RAM ranges, plus an init function. This has the downside of requiring a static array to store such ranges ahead of being processed by the setup function, however it's the only way to pass all the possible information to the different compression setup functions without using per-compression specific setup functions. Per-arch compression setup also need to be adjusted to use the new interface. There's a slight ordering adjustment, in that after PDX compression setup the caller will check whether all the RAM regions are properly covered by the newly setup compression, otherwise compression is disabled by resetting to the initial values. No functional change intended in the resulting PDX compression values. Signed-off-by: Roger Pau Monné Acked-by: Jan Beulich --- Changes since v3: - Rename base to base_pfn and size to pages. Changes since v2: - Fix indentation. Changes since v1: - s/nr/nr_ranges/ --- xen/arch/arm/setup.c | 34 ++-- xen/arch/x86/srat.c | 28 ++ xen/common/pdx.c | 122 -- xen/include/xen/pdx.h | 73 + 4 files changed, 166 insertions(+), 91 deletions(-) diff --git a/xen/arch/arm/setup.c b/xen/arch/arm/setup.c index a77b31071ed8..ba35bf1fe3bb 100644 --- a/xen/arch/arm/setup.c +++ b/xen/arch/arm/setup.c @@ -256,9 +256,11 @@ void __init init_pdx(void) { const struct membanks *mem = bootinfo_get_mem(); paddr_t bank_start, bank_size, bank_end, ram_end = 0; -int bank; +unsigned int bank; #ifndef CONFIG_PDX_NONE +for ( bank = 0 ; bank < mem->nr_banks; bank++ ) +pfn_pdx_add_region(mem->bank[bank].start, mem->bank[bank].size); /* * Arm does not have any restrictions on the bits to compress. Pass 0 to * let the common code further restrict the mask. @@ -266,26 +268,24 @@ void __init init_pdx(void) * If the logic changes in pfn_pdx_hole_setup we might have to * update this function too. */ -uint64_t mask = pdx_init_mask(0x0); - -for ( bank = 0 ; bank < mem->nr_banks; bank++ ) -{ -bank_start = mem->bank[bank].start; -bank_size = mem->bank[bank].size; - -mask |= bank_start | pdx_region_mask(bank_start, bank_size); -} +pfn_pdx_compression_setup(0); for ( bank = 0 ; bank < mem->nr_banks; bank++ ) { -bank_start = mem->bank[bank].start; -bank_size = mem->bank[bank].size; - -if (~mask & pdx_region_mask(bank_start, bank_size)) -mask = 0; +if ( !pdx_is_region_compressible( + mem->bank[bank].start, + PFN_UP(mem->bank[bank].start + mem->bank[bank].size) - + PFN_DOWN(mem->bank[bank].start)) ) +{ +pfn_pdx_compression_reset(); +printk(XENLOG_WARNING + "PFN compression disabled, RAM region [%#" PRIpaddr ", %#" + PRIpaddr "] not covered\n", + mem->bank[bank].start, + mem->bank[bank].start + mem->bank[bank].size - 1); +break; +} } - -pfn_pdx_hole_setup(mask >> PAGE_SHIFT); #endif for ( bank = 0 ; bank < mem->nr_banks; bank++ ) diff --git a/xen/arch/x86/srat.c b/xen/arch/x86/srat.c index 688f410287d4..747607439fb4 100644 --- a/xen/arch/x86/srat.c +++ b/xen/arch/x86/srat.c @@ -261,8 +261,6 @@ acpi_numa_memory_affinity_init(const struct acpi_srat_mem_affinity *ma) void __init acpi_numa_arch_fixup(void) {} -static uint64_t __initdata srat_region_mask; - static int __init cf_check srat_parse_region( struct acpi_subtable_header *header, const unsigned long end) { @@ -282,15 +280,13 @@ static int __init cf_check srat_parse_region( printk(KERN_INFO "SRAT: %013"PRIx64"-%013"PRIx64"\n", ma->base_address, ma->base_address + ma->length - 1); - srat_region_mask |= ma->base_address | - pdx_region_mask(ma->base_address, ma->length); + pfn_pdx_add_region(ma->base_address, ma->length); return 0; } void __init srat_parse_regions(paddr_t addr) { - u64 mask; unsigned int i; if (acpi_disabled || acpi_numa < 0 || @@ -299,19 +295,29 @@ void __init srat_parse_regions(paddr_t addr) /* Set "PXM" as early as feasible. */ numa_fw_nid_name = "PXM"; - srat_region_mask = pdx_init_mask(addr); acpi_table_parse_srat(ACPI_SRAT_TYPE_MEMORY_AFFINITY, srat_parse_region, 0); - for (mask = srat_region_mask, i
[PATCH v4 5/8] test/pdx: add PDX compression unit tests
Introduce a set of unit tests for PDX compression. The unit tests contains both real and crafted memory maps that are then compressed using the selected PDX algorithm. Note the build system for the unit tests has been done in a way to support adding new compression algorithms easily. That requires generating a new test-pdx- executable that's build with the selected PDX compression enabled. Currently the only generated executable is test-pdx-mask that tests PDX mask compression. Signed-off-by: Roger Pau Monné --- Changes since v2: - Use set -e when running the tests. - Drop stray removal of pdx.c. - Use * instead of ? for header blank space capture. - Add some more memory maps. Changes since v1: - New in this version (partially pulled out from a different patch). --- tools/tests/Makefile | 1 + tools/tests/pdx/.gitignore | 2 + tools/tests/pdx/Makefile | 49 +++ tools/tests/pdx/harness.h | 84 tools/tests/pdx/test-pdx.c | 267 + xen/common/pdx.c | 4 + 6 files changed, 407 insertions(+) create mode 100644 tools/tests/pdx/.gitignore create mode 100644 tools/tests/pdx/Makefile create mode 100644 tools/tests/pdx/harness.h create mode 100644 tools/tests/pdx/test-pdx.c diff --git a/tools/tests/Makefile b/tools/tests/Makefile index 36928676a666..97ba2a13894d 100644 --- a/tools/tests/Makefile +++ b/tools/tests/Makefile @@ -9,6 +9,7 @@ ifneq ($(clang),y) SUBDIRS-$(CONFIG_X86) += x86_emulator endif SUBDIRS-y += xenstore +SUBDIRS-y += pdx SUBDIRS-y += rangeset SUBDIRS-y += vpci SUBDIRS-y += paging-mempool diff --git a/tools/tests/pdx/.gitignore b/tools/tests/pdx/.gitignore new file mode 100644 index ..a32c7db4de79 --- /dev/null +++ b/tools/tests/pdx/.gitignore @@ -0,0 +1,2 @@ +/pdx.h +/test-pdx-mask diff --git a/tools/tests/pdx/Makefile b/tools/tests/pdx/Makefile new file mode 100644 index ..b3734afde686 --- /dev/null +++ b/tools/tests/pdx/Makefile @@ -0,0 +1,49 @@ +XEN_ROOT=$(CURDIR)/../../.. +include $(XEN_ROOT)/tools/Rules.mk + +TARGETS := test-pdx-mask + +.PHONY: all +all: $(TARGETS) + +.PHONY: run +run: $(TARGETS) +ifeq ($(CC),$(HOSTCC)) + set -e; \ + for test in $? ; do \ + ./$$test ; \ + done +else + $(warning HOSTCC != CC, will not run test) +endif + +.PHONY: clean +clean: + $(RM) -- *.o $(TARGETS) $(DEPS_RM) pdx.h + +.PHONY: distclean +distclean: clean + $(RM) -- *~ + +.PHONY: install +install: all + $(INSTALL_DIR) $(DESTDIR)$(LIBEXEC)/tests + $(INSTALL_PROG) $(TARGETS) $(DESTDIR)$(LIBEXEC)/tests + +.PHONY: uninstall +uninstall: + $(RM) -- $(patsubst %,$(DESTDIR)$(LIBEXEC)/tests/%,$(TARGETS)) + +pdx.h: $(XEN_ROOT)/xen/include/xen/pdx.h + sed -E -e '/^#[[:space:]]*include/d' <$< >$@ + +CFLAGS += -D__XEN_TOOLS__ +CFLAGS += $(APPEND_CFLAGS) +CFLAGS += $(CFLAGS_xeninclude) + +test-pdx-mask: CFLAGS += -DCONFIG_PDX_MASK_COMPRESSION + +test-pdx-%: test-pdx.c pdx.h + $(CC) $(CPPFLAGS) $(CFLAGS) $(CFLAGS_$*.o) -o $@ $< $(APPEND_CFLAGS) + +-include $(DEPS_INCLUDE) diff --git a/tools/tests/pdx/harness.h b/tools/tests/pdx/harness.h new file mode 100644 index ..5bef7df650d2 --- /dev/null +++ b/tools/tests/pdx/harness.h @@ -0,0 +1,84 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ +/* + * Unit tests for PDX compression. + * + * Copyright (C) 2025 Cloud Software Group + */ + +#ifndef _TEST_HARNESS_ +#define _TEST_HARNESS_ + +#include +#include +#include +#include +#include +#include + +#include + +#define __init +#define __initdata +#define __ro_after_init +#define cf_check + +#define printk printf +#define XENLOG_INFO +#define XENLOG_DEBUG +#define XENLOG_WARNING +#define KERN_INFO + +#define BITS_PER_LONG (unsigned int)(sizeof(unsigned long) * 8) + +#define PAGE_SHIFT12 +/* Some libcs define PAGE_SIZE in limits.h. */ +#undef PAGE_SIZE +#define PAGE_SIZE (1 << PAGE_SHIFT) +#define MAX_ORDER 18 /* 2 * PAGETABLE_ORDER (9) */ + +#define PFN_DOWN(x) ((x) >> PAGE_SHIFT) +#define PFN_UP(x) (((x) + PAGE_SIZE-1) >> PAGE_SHIFT) + +#define pfn_to_paddr(pfn) ((paddr_t)(pfn) << PAGE_SHIFT) +#define paddr_to_pfn(pa) ((unsigned long)((pa) >> PAGE_SHIFT)) + +#define MAX_RANGES 16 +#define MAX_PFN_RANGES MAX_RANGES + +#define ASSERT assert + +#define CONFIG_DEBUG + +static inline unsigned int find_next( +const unsigned long *addr, unsigned int size, unsigned int off, bool value) +{ +unsigned int i; + +ASSERT(size <= BITS_PER_LONG); + +for ( i = off; i < size; i++ ) +if ( !!(*addr & (1UL << i)) == value ) +return i; + +return size; +} + +#define find_next_zero_bit(a, s, o) find_next(a, s, o, false) +#define find_next_bit(a, s, o) find_next(a, s, o, true) + +#define boolean_param(name, func) + +typedef uint64_t paddr_t; + +#include "pdx.h" + +#endif + +/* + * Local variables: + * mode: C + * c-file-style: "BSD" + * c-basic-offset: 4 + * ind
[PATCH v4 6/8] pdx: move some helpers in preparation for new compression
Move fill_mask(), pdx_region_mask() and pdx_init_mask() to the !CONFIG_PDX_NONE section in preparation of them also being used by a newly added PDX compression. No functional change intended. Signed-off-by: Roger Pau Monné Acked-by: Jan Beulich --- git is not very helpful when generating the diff here, and it ends up moving everything around the functions instead of the functions themselves. --- xen/common/pdx.c | 118 +++ 1 file changed, 59 insertions(+), 59 deletions(-) diff --git a/xen/common/pdx.c b/xen/common/pdx.c index cd8a9e75a836..9e6b36086fbd 100644 --- a/xen/common/pdx.c +++ b/xen/common/pdx.c @@ -101,59 +101,6 @@ void __init pfn_pdx_add_region(paddr_t base, paddr_t size) ranges[nr_ranges++].pages = PFN_UP(base + size) - PFN_DOWN(base); } -#endif /* !CONFIG_PDX_NONE */ - -#ifdef CONFIG_PDX_MASK_COMPRESSION - -/* - * Diagram to make sense of the following variables. The masks and shifts - * are done on mfn values in order to convert to/from pdx: - * - * pfn_hole_mask - * pfn_pdx_hole_shift (mask bitsize) - * | - * |-| - * | | - * V V - * -- - * |HHH|0|LL| <--- mfn - * -- - * ^ ^ ^ ^ - * | | |--| - * | | | - * | | pfn_pdx_bottom_mask - * | | - * |---| - * | - * pfn_top_mask - * - * ma_{top,va_bottom}_mask is simply a shifted pfn_{top,pdx_bottom}_mask, - * where ma_top_mask has zeroes shifted in while ma_va_bottom_mask has - * ones. - */ - -/** Mask for the lower non-compressible bits of an mfn */ -unsigned long __ro_after_init pfn_pdx_bottom_mask = ~0UL; - -/** Mask for the lower non-compressible bits of an maddr or vaddr */ -unsigned long __ro_after_init ma_va_bottom_mask = ~0UL; - -/** Mask for the higher non-compressible bits of an mfn */ -unsigned long __ro_after_init pfn_top_mask = 0; - -/** Mask for the higher non-compressible bits of an maddr or vaddr */ -unsigned long __ro_after_init ma_top_mask = 0; - -/** - * Mask for a pdx compression bit slice. - * - * Invariant: valid(mfn) implies (mfn & pfn_hole_mask) == 0 - */ -unsigned long __ro_after_init pfn_hole_mask = 0; - -/** Number of bits of the "compressible" bit slice of an mfn */ -unsigned int __ro_after_init pfn_pdx_hole_shift = 0; - /* Sets all bits from the most-significant 1-bit down to the LSB */ static uint64_t fill_mask(uint64_t mask) { @@ -196,12 +143,6 @@ static uint64_t pdx_region_mask(uint64_t base, uint64_t len) return fill_mask(base ^ (base + len - 1)); } -bool pdx_is_region_compressible(paddr_t base, unsigned long npages) -{ -return !(paddr_to_pfn(base) & pfn_hole_mask) && - !(pdx_region_mask(base, npages * PAGE_SIZE) & ~ma_va_bottom_mask); -} - /** * Creates the mask to start from when calculating non-compressible bits * @@ -219,6 +160,65 @@ static uint64_t __init pdx_init_mask(uint64_t base_addr) (uint64_t)1 << (MAX_ORDER + PAGE_SHIFT)) - 1); } +#endif /* !CONFIG_PDX_NONE */ + +#ifdef CONFIG_PDX_MASK_COMPRESSION + +/* + * Diagram to make sense of the following variables. The masks and shifts + * are done on mfn values in order to convert to/from pdx: + * + * pfn_hole_mask + * pfn_pdx_hole_shift (mask bitsize) + * | + * |-| + * | | + * V V + * -- + * |HHH|0|LL| <--- mfn + * -- + * ^ ^ ^ ^ + * | | |--| + * | | | + * | | pfn_pdx_bottom_mask + * | | + * |---| + * | + * pfn_top_mask + * + * ma_{top,va_bottom}_mask is simply a shifted pfn_{top,pdx_bottom}_mask, + * where ma_top_mask has zeroes shifted in while ma_va_bottom_mask has + * ones. + */ + +/** Mask for the lower non-compressible bits of an mfn */ +unsigned long __ro_after_init pfn_pdx_bottom_mask = ~0UL; + +/** Mask for the lower non-compressible bits of an maddr or vaddr */ +unsigned long __ro_after_init ma_va_bottom_mask = ~0UL; + +/** Mask for the higher non-compressible bits of an mfn */ +unsigned long __ro_after_init pfn_top_mask = 0; + +/** Mask for the higher non-compressible bits of an maddr or vaddr */ +unsigned long __ro_after_init ma_top_mask = 0; + +/** + * Mask for a pdx compression bit slice. + * + * Invariant: valid(mfn) implies (mfn & pfn_hole_mask) == 0 + */ +unsigned long __ro_after_init pfn_hole_mask = 0; + +/** Number of bits of the "compressible" bit slice of an mfn */ +unsigned int __ro_
[PATCH v4 4/8] pdx: allow per-arch optimization of PDX conversion helpers
There are four performance critical PDX conversion helpers that do the PFN to/from PDX and the physical addresses to/from directmap offsets translations. In the absence of an active PDX compression, those functions would still do the calculations needed, just to return the same input value as no translation is in place and hence PFN and PDX spaces are identity mapped. To reduce the overhead of having to do the pointless calculations allow architectures to implement the translation helpers in a per-arch header. Rename the existing conversion functions to add a trailing _xlate suffix, so that the per-arch headers can define the non suffixed versions. Currently only x86 implements meaningful custom handlers to short circuit the translation when not active, using asm goto. Other architectures use generic macros that map the non-xlate to the xlate variants to keep the previous behavior. Signed-off-by: Roger Pau Monné --- Changes since v3: - Use has_include. Changes since v2: - Consume the skip label as parameter in the PDX optimize macro. Changes since v1: - Pull return out of OPTIMIZE_PDX macro. - undef OPTIMIZE_PDX. --- Would it make sense to move the x86 implementation to the common pdx.h header and let architectures define PDX_ASM_GOTO_SKIP instead? --- xen/arch/x86/include/asm/cpufeatures.h | 1 + xen/arch/x86/include/asm/pdx.h | 71 ++ xen/arch/x86/srat.c| 6 ++- xen/common/pdx.c | 10 ++-- xen/include/xen/pdx.h | 29 --- 5 files changed, 106 insertions(+), 11 deletions(-) create mode 100644 xen/arch/x86/include/asm/pdx.h diff --git a/xen/arch/x86/include/asm/cpufeatures.h b/xen/arch/x86/include/asm/cpufeatures.h index bc108c3819e8..71308d9dafc8 100644 --- a/xen/arch/x86/include/asm/cpufeatures.h +++ b/xen/arch/x86/include/asm/cpufeatures.h @@ -42,6 +42,7 @@ XEN_CPUFEATURE(XEN_IBT, X86_SYNTH(27)) /* Xen uses CET Indirect Branch XEN_CPUFEATURE(IBPB_ENTRY_PV, X86_SYNTH(28)) /* MSR_PRED_CMD used by Xen for PV */ XEN_CPUFEATURE(IBPB_ENTRY_HVM,X86_SYNTH(29)) /* MSR_PRED_CMD used by Xen for HVM */ XEN_CPUFEATURE(USE_VMCALL,X86_SYNTH(30)) /* Use VMCALL instead of VMMCALL */ +XEN_CPUFEATURE(PDX_COMPRESSION, X86_SYNTH(31)) /* PDX compression */ /* Bug words follow the synthetic words. */ #define X86_NR_BUG 1 diff --git a/xen/arch/x86/include/asm/pdx.h b/xen/arch/x86/include/asm/pdx.h new file mode 100644 index ..6be7e1185eb1 --- /dev/null +++ b/xen/arch/x86/include/asm/pdx.h @@ -0,0 +1,71 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ + +#ifndef X86_PDX_H +#define X86_PDX_H + +#include + +/* + * Introduce a macro to avoid repeating the same asm goto block in each helper. + * Note the macro is strictly tied to the code in the helpers. + */ +#define PDX_ASM_GOTO(label) \ +asm_inline goto ( \ +ALTERNATIVE(\ +"", \ +"jmp %l0", \ +ALT_NOT(X86_FEATURE_PDX_COMPRESSION)) \ +: : : : label ) + +static inline unsigned long pfn_to_pdx(unsigned long pfn) +{ +PDX_ASM_GOTO(skip); + +return pfn_to_pdx_xlate(pfn); + + skip: +return pfn; +} + +static inline unsigned long pdx_to_pfn(unsigned long pdx) +{ +PDX_ASM_GOTO(skip); + +return pdx_to_pfn_xlate(pdx); + + skip: +return pdx; +} + +static inline unsigned long maddr_to_directmapoff(paddr_t ma) +{ +PDX_ASM_GOTO(skip); + +return maddr_to_directmapoff_xlate(ma); + + skip: +return ma; +} + +static inline paddr_t directmapoff_to_maddr(unsigned long offset) +{ +PDX_ASM_GOTO(skip); + +return directmapoff_to_maddr_xlate(offset); + + skip: +return offset; +} + +#undef PDX_ASM_GOTO_SKIP + +#endif /* X86_PDX_H */ + +/* + * Local variables: + * mode: C + * c-file-style: "BSD" + * c-basic-offset: 4 + * indent-tabs-mode: nil + * End: + */ diff --git a/xen/arch/x86/srat.c b/xen/arch/x86/srat.c index 747607439fb4..42ccb0c0f3ae 100644 --- a/xen/arch/x86/srat.c +++ b/xen/arch/x86/srat.c @@ -298,7 +298,8 @@ void __init srat_parse_regions(paddr_t addr) acpi_table_parse_srat(ACPI_SRAT_TYPE_MEMORY_AFFINITY, srat_parse_region, 0); - pfn_pdx_compression_setup(addr); + if (!pfn_pdx_compression_setup(addr)) + return; /* Ensure all RAM ranges in the e820 are covered. */ for (i = 0; i < e820.nr_map; i++) { @@ -318,6 +319,9 @@ void __init srat_parse_regions(paddr_t addr) return; } } + + /* If we got this far compression is working as expected. */ + setup_force_cpu_cap(X86_FEATURE_PDX_COMPRESSION); } unsigned int numa_node_to_arch_nid(nodeid_t n) diff --git a/xen/common/pdx.c b/xen/common/pdx.c index f4a3dcf6cb60..c9ec86729151 100644 --- a/xen/c
[PATCH v4 8/8] x86/mm: adjust loop in arch_init_memory() to iterate over the PDX space
There's a loop in arch_init_memory() that iterates over holes and non-RAM regions to possibly mark any page_info structures matching those addresses as IO. The looping there is done over the PFN space. PFNs not covered by the PDX space will always fail the mfn_valid() check, hence re-write the loop to iterate over the PDX space and avoid checking any holes that are not covered by the PDX translation. On a system with a ~6TiB hole this change together with using PDX compression reduces boot time in approximately 20 seconds. Xen boot time without the change is ~50s, with the change it's ~30s. Signed-off-by: Roger Pau Monné --- Changes since v3: - Ensure parameter to pfn_to_pdx() is always RAM. Changes since v2: - New in this version. --- xen/arch/x86/mm.c | 17 ++--- 1 file changed, 14 insertions(+), 3 deletions(-) diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c index e7fd56c7ce90..e27dc28cfdd6 100644 --- a/xen/arch/x86/mm.c +++ b/xen/arch/x86/mm.c @@ -275,7 +275,7 @@ static void __init assign_io_page(struct page_info *page) void __init arch_init_memory(void) { -unsigned long i, pfn, rstart_pfn, rend_pfn, iostart_pfn, ioend_pfn; +unsigned long i, pfn, rstart_pfn, rend_pfn, iostart_pfn, ioend_pfn, pdx; /* * Basic guest-accessible flags: @@ -328,9 +328,20 @@ void __init arch_init_memory(void) destroy_xen_mappings((unsigned long)mfn_to_virt(iostart_pfn), (unsigned long)mfn_to_virt(ioend_pfn)); -/* Mark as I/O up to next RAM region. */ -for ( ; pfn < rstart_pfn; pfn++ ) +/* + * Mark as I/O up to next RAM region. Iterate over the PDX space to + * skip holes which would always fail the mfn_valid() check. + * + * pfn_to_pdx() requires a valid (iow: RAM) PFN to convert to PDX, + * hence provide pfn - 1, which is the tailing PFN from the last RAM + * range, or pdx 0 if the input pfn is 0. + */ +for ( pdx = pfn ? pfn_to_pdx(pfn - 1) + 1 : 0; + pdx < pfn_to_pdx(rstart_pfn); + pdx++ ) { +pfn = pdx_to_pfn(pdx); + if ( !mfn_valid(_mfn(pfn)) ) continue; -- 2.49.0
[PATCH v4 3/8] pdx: introduce command line compression toggle
Introduce a command line option to allow disabling PDX compression. The disabling is done by turning pfn_pdx_add_region() into a no-op, so when attempting to initialize the selected compression algorithm the array of ranges to compress is empty. Signed-off-by: Roger Pau Monné Reviewed-by: Jan Beulich --- Changes since v1: - New in this version. --- docs/misc/xen-command-line.pandoc | 9 + xen/common/pdx.c | 14 +- 2 files changed, 22 insertions(+), 1 deletion(-) diff --git a/docs/misc/xen-command-line.pandoc b/docs/misc/xen-command-line.pandoc index 6865a61220ca..5dd2ab82b7aa 100644 --- a/docs/misc/xen-command-line.pandoc +++ b/docs/misc/xen-command-line.pandoc @@ -2072,6 +2072,15 @@ for all of them (`true`), only for those subject to XPTI (`xpti`) or for those not subject to XPTI (`no-xpti`). The feature is used only in case INVPCID is supported and not disabled via `invpcid=false`. +### pdx-compress +> `= ` + +> Default: `true` if CONFIG_PDX_NONE is unset + +Only relevant when the hypervisor is build with PFN PDX compression. Controls +whether Xen will engage in PFN compression. The algorithm used for PFN +compression is selected at build time from Kconfig. + ### ple_gap > `= ` diff --git a/xen/common/pdx.c b/xen/common/pdx.c index c5ea58873c0e..f4a3dcf6cb60 100644 --- a/xen/common/pdx.c +++ b/xen/common/pdx.c @@ -19,6 +19,7 @@ #include #include #include +#include #include #include @@ -76,9 +77,13 @@ static struct pfn_range { } ranges[MAX_PFN_RANGES] __initdata; static unsigned int __initdata nr_ranges; +static bool __initdata pdx_compress = true; +boolean_param("pdx-compress", pdx_compress); + void __init pfn_pdx_add_region(paddr_t base, paddr_t size) { -if ( !size ) +/* Without ranges there's no PFN compression. */ +if ( !size || !pdx_compress ) return; if ( nr_ranges >= ARRAY_SIZE(ranges) ) @@ -215,6 +220,13 @@ void __init pfn_pdx_compression_setup(paddr_t base) unsigned int i, j, bottom_shift = 0, hole_shift = 0; unsigned long mask = pdx_init_mask(base) >> PAGE_SHIFT; +if ( !nr_ranges ) +{ +printk(XENLOG_DEBUG "PFN compression disabled%s\n", + pdx_compress ? ": no ranges provided" : ""); +return; +} + if ( nr_ranges > ARRAY_SIZE(ranges) ) { printk(XENLOG_WARNING -- 2.49.0
[PATCH v4 0/8] pdx: introduce a new compression algorithm
Hello, This series implements a new PDX compression algorithm to cope with the spare memory maps found on the Intel Sierra Forest and Granite Rapids. Patches 1 to 6 prepare the existing code to make it easier to introduce a new PDX compression, including generalizing the initialization and setup functions and adding a unit test for PDX compression. Patch 7 introduce the new compression. The new compression is only enabled by default on x86, other architectures are left with their previous defaults. Finally patch 8 optimizes one x86 loop that was iterating over pfn ranges to instead use pdx values. Thanks, Roger. Roger Pau Monne (8): kconfig: turn PDX compression into a choice pdx: provide a unified set of unit functions pdx: introduce command line compression toggle pdx: allow per-arch optimization of PDX conversion helpers test/pdx: add PDX compression unit tests pdx: move some helpers in preparation for new compression pdx: introduce a new compression algorithm based on region offsets x86/mm: adjust loop in arch_init_memory() to iterate over the PDX space CHANGELOG.md | 3 + docs/misc/xen-command-line.pandoc | 9 + tools/tests/Makefile | 1 + tools/tests/pdx/.gitignore | 3 + tools/tests/pdx/Makefile | 50 +++ tools/tests/pdx/harness.h | 98 ++ tools/tests/pdx/test-pdx.c | 271 xen/arch/arm/setup.c | 36 +-- xen/arch/x86/include/asm/cpufeatures.h | 1 + xen/arch/x86/include/asm/pdx.h | 71 xen/arch/x86/mm.c | 17 +- xen/arch/x86/srat.c| 30 +- xen/common/Kconfig | 37 ++- xen/common/pdx.c | 432 +++-- xen/include/xen/pdx.h | 213 15 files changed, 1139 insertions(+), 133 deletions(-) create mode 100644 tools/tests/pdx/.gitignore create mode 100644 tools/tests/pdx/Makefile create mode 100644 tools/tests/pdx/harness.h create mode 100644 tools/tests/pdx/test-pdx.c create mode 100644 xen/arch/x86/include/asm/pdx.h -- 2.49.0
Re: [PATCH v1 03/25] xen/x86: complement PG_log_dirty wrapping
On 03.08.2025 11:47, Penny Zheng wrote: > --- a/xen/include/hypercall-defs.c > +++ b/xen/include/hypercall-defs.c > @@ -198,7 +198,9 @@ dm_op(domid_t domid, unsigned int nr_bufs, > xen_dm_op_buf_t *bufs) > sysctl(xen_sysctl_t *u_sysctl) > #endif > domctl(xen_domctl_t *u_domctl) > +#if PG_log_dirty > paging_domctl_cont(xen_domctl_t *u_domctl) > +#endif > #ifndef CONFIG_PV_SHIM_EXCLUSIVE > platform_op(xen_platform_op_t *u_xenpf_op) > #endif > @@ -294,6 +296,8 @@ dm_op compat do > compat do do > hypfs_op do do do do do > #endif > mcado do --- > +#if PG_log_dirty > paging_domctl_cont do do do do - > +#endif While putting together my pair of patches, I figured that using PG_log_dirty here is wrong. asm/paging.h isn't (and cannot easily be) included, and hence the compiler will consider PG_log_dirty uniformly 0, no matter what .config holds. Jan
Re: [PATCH v3 02/20] xen/riscv: introduce sbi_remote_hfence_gvma_vmid()
On 8/4/25 3:55 PM, Jan Beulich wrote: On 31.07.2025 17:58, Oleksii Kurochko wrote: It instructs the remote harts to execute one or more HFENCE.GVMA instructions by making an SBI call, covering the range of guest physical addresses between start_addr and start_addr + size only for the given VMID. The remote fence operation applies to the entire address space if either: - start_addr and size are both 0, or - size is equal to 2^XLEN-1. Signed-off-by: Oleksii Kurochko Acked-by: Jan Beulich with perhaps a similar on-commit edit as suggested for patch 1. I would happy if you could do that during merge. Thanks. ~ Oleksii
Re: [PATCH v3 01/20] xen/riscv: implement sbi_remote_hfence_gvma()
On 05.08.2025 16:45, Oleksii Kurochko wrote: > On 8/4/25 3:52 PM, Jan Beulich wrote: >> On 31.07.2025 17:58, Oleksii Kurochko wrote: >>> + * Returns 0 if IPI was sent to all the targeted harts successfully >>> + * or negative value if start_addr or size is not valid. >> This similarly is ambiguous: The union of the success case stated and the >> error case stated isn't obviously all possible states. The success >> statement in particular alludes to the possibility of an IPI not actually >> reaching its target. > > The same as above this is what SBI spec. tells. > > I've not checked SBI code deeply, but it seems like the code is waiting while > IPI will be reached as looking at the code: > /** >* As this this function only handlers scalar values of hart mask, it > must be >* set to all online harts if the intention is to send IPIs to all the > harts. >* If hmask is zero, no IPIs will be sent. >*/ > int sbi_ipi_send_many(ulong hmask, ulong hbase, u32 event, void *data) > { > ... > > /* Send IPIs */ > do { > retry_needed = false; > sbi_hartmask_for_each_hart(i, &target_mask) { > rc = sbi_ipi_send(scratch, i, event, data); > if (rc == SBI_IPI_UPDATE_RETRY) > retry_needed = true; > else > sbi_hartmask_clear_hart(i, > &target_mask); > } > } while (retry_needed); > > /* Sync IPIs */ > sbi_ipi_sync(scratch, event); > > return 0; > } > and > static int sbi_ipi_sync(struct sbi_scratch *scratch, u32 event) > { > const struct sbi_ipi_event_ops *ipi_ops; > > if ((SBI_IPI_EVENT_MAX <= event) || > !ipi_ops_array[event]) > return SBI_EINVAL; > ipi_ops = ipi_ops_array[event]; > > if (ipi_ops->sync) > ipi_ops->sync(scratch); > > return 0; > } > which calls: > static void tlb_sync(struct sbi_scratch *scratch) > { > atomic_t *tlb_sync = > sbi_scratch_offset_ptr(scratch, tlb_sync_off); > > while (atomic_read(tlb_sync) > 0) { > /* >* While we are waiting for remote hart to set the sync, >* consume fifo requests to avoid deadlock. >*/ > tlb_process_once(scratch); > } > > return; > } I'll leave that comment as-is then, even if I'm not really happy with it. >>> + * The remote fence operation applies to the entire address space if >>> either: >>> + * - start_addr and size are both 0, or >>> + * - size is equal to 2^XLEN-1. >> Whose XLEN is this? The guest's? The host's? (I assume the latter, but it's >> not unambiguous, unless there's specific terminology that I'm unaware of, >> yet which would make this unambiguous.) > > RISC-V spec quite mixes the terminology (3.1.6.2. Base ISA Control in mstatus > Register) > around XLEN: >For RV64 harts, the SXL and UXL fields are WARL fields that control the > value >of XLEN for S-mode and U-mode, respectively. The encoding of these fields > is >the same as the MXL field of misa, shown in Table 9. The effective XLEN in >S-mode and U-mode are termed SXLEN and UXLEN, respectively > > Basically, RISC-V privileged architecture defines different XLEN values for > various privilege modes: > - MXLEN for Machine mode > - SXLEN for Supervisor mode. > - HSXLEN for Hypervisor-Supervisor mode. > - VSXLEN for Virtual Supervisor mode. > > Considering that SBI is an API that is provided for S-mode I expect that XLEN > = SXLEN > in this case, but SBI spec. is using just XLEN. Very helpful. Jan
Re: [PATCH v4 4/8] pdx: allow per-arch optimization of PDX conversion helpers
On 05.08.2025 16:20, Roger Pau Monné wrote: > On Tue, Aug 05, 2025 at 02:11:22PM +0200, Jan Beulich wrote: >> On 05.08.2025 11:52, Roger Pau Monne wrote: >>> There are four performance critical PDX conversion helpers that do the PFN >>> to/from PDX and the physical addresses to/from directmap offsets >>> translations. >>> >>> In the absence of an active PDX compression, those functions would still do >>> the calculations needed, just to return the same input value as no >>> translation is in place and hence PFN and PDX spaces are identity mapped. >>> >>> To reduce the overhead of having to do the pointless calculations allow >>> architectures to implement the translation helpers in a per-arch header. >>> Rename the existing conversion functions to add a trailing _xlate suffix, >>> so that the per-arch headers can define the non suffixed versions. >>> >>> Currently only x86 implements meaningful custom handlers to short circuit >>> the translation when not active, using asm goto. Other architectures use >>> generic macros that map the non-xlate to the xlate variants to keep the >>> previous behavior. >>> >>> Signed-off-by: Roger Pau Monné >> >> Once again: >> Reviewed-by: Jan Beulich > > Thanks, I didn't carry your RB due to the changes requested by Andrew, > that was a bit too much to carry a RB after those. Of course, and I didn't mean to suggest you should have kept it. All I wanted to indicate is that I'm as okay withe change as I was before. Jan
Re: [PATCH v3 12/20] xen/riscv: implement p2m_set_range()
On 31.07.2025 17:58, Oleksii Kurochko wrote: > This patch introduces p2m_set_range() and its core helper p2m_set_entry() for Nit: This patch doesn't introduce p2m_set_range(); it merely fleshes it out. > --- a/xen/arch/riscv/include/asm/p2m.h > +++ b/xen/arch/riscv/include/asm/p2m.h > @@ -7,11 +7,13 @@ > #include > #include > > +#include > #include > > extern unsigned int p2m_root_order; > #define P2M_ROOT_ORDER p2m_root_order > #define P2M_ROOT_PAGES BIT(P2M_ROOT_ORDER, U) > +#define P2M_ROOT_LEVEL HYP_PT_ROOT_LEVEL I think I commented on this before, and I would have hoped for at least a remark in the description to appear (perhaps even a comment here): It's okay(ish) to tie these together for now, but in the longer run I don't expect this is going to be wanted. If e.g. we ran Xen in Sv57 mode, there would be no reason at all to force all P2Ms to use 5 levels of page tables. > @@ -175,13 +179,257 @@ int p2m_set_allocation(struct domain *d, unsigned long > pages, bool *preempted) > return 0; > } > > +/* > + * Find and map the root page table. The caller is responsible for > + * unmapping the table. > + * > + * The function will return NULL if the offset into the root table is > + * invalid. > + */ > +static pte_t *p2m_get_root_pointer(struct p2m_domain *p2m, gfn_t gfn) > +{ > +unsigned long root_table_indx; > + > +root_table_indx = gfn_x(gfn) >> XEN_PT_LEVEL_ORDER(P2M_ROOT_LEVEL); Right now page table layouts / arrangements are indeed similar enough to share accessor constructs. Nevertheless I find it problematic (doc-wise at the very least) that a Xen page table construct is used to access a P2M page table. If and when these needed to be decoupled, it would likely help of the distinction was already made, by - for now - simply introducing aliases (here e.g. P2M_LEVEL_ORDER(), expanding to XEN_PT_LEVEL_ORDER() for the time being). > +if ( root_table_indx >= P2M_ROOT_PAGES ) > +return NULL; > + > +return __map_domain_page(p2m->root + root_table_indx); > +} > + > +static inline void p2m_write_pte(pte_t *p, pte_t pte, bool clean_pte) > +{ > +write_pte(p, pte); > +if ( clean_pte ) > +clean_dcache_va_range(p, sizeof(*p)); Not necessarily for right away, but if multiple adjacent PTEs are written without releasing the lock, this then redundant cache flushing can be a performance issue. > +} > + > +static inline void p2m_clean_pte(pte_t *p, bool clean_pte) > +{ > +pte_t pte; > + > +memset(&pte, 0, sizeof(pte)); Why memset()? Why not simply give the variable an appropriate initializer? Or use ... > +p2m_write_pte(p, pte, clean_pte); ... a compound literal here, like you do ... > +} > + > +static pte_t p2m_pte_from_mfn(mfn_t mfn, p2m_type_t t) > +{ > +panic("%s: hasn't been implemented yet\n", __func__); > + > +return (pte_t) { .pte = 0 }; ... here? (Just {} would also do, if I'm not mistaken.) > +} > + > +#define P2M_TABLE_MAP_NONE 0 > +#define P2M_TABLE_MAP_NOMEM 1 > +#define P2M_TABLE_SUPER_PAGE 2 > +#define P2M_TABLE_NORMAL 3 > + > +/* > + * Take the currently mapped table, find the corresponding the entry > + * corresponding to the GFN, and map the next table, if available. Nit: Double "corresponding". > + * The previous table will be unmapped if the next level was mapped > + * (e.g P2M_TABLE_NORMAL returned). > + * > + * `alloc_tbl` parameter indicates whether intermediate tables should > + * be allocated when not present. > + * > + * Return values: > + * P2M_TABLE_MAP_NONE: a table allocation isn't permitted. > + * P2M_TABLE_MAP_NOMEM: allocating a new page failed. > + * P2M_TABLE_SUPER_PAGE: next level or leaf mapped normally. > + * P2M_TABLE_NORMAL: The next entry points to a superpage. > + */ > +static int p2m_next_level(struct p2m_domain *p2m, bool alloc_tbl, > + unsigned int level, pte_t **table, > + unsigned int offset) > +{ > +panic("%s: hasn't been implemented yet\n", __func__); > + > +return P2M_TABLE_MAP_NONE; > +} > + > +/* Free pte sub-tree behind an entry */ > +static void p2m_free_subtree(struct p2m_domain *p2m, > + pte_t entry, unsigned int level) > +{ > +panic("%s: hasn't been implemented yet\n", __func__); > +} > + > +/* > + * Insert an entry in the p2m. This should be called with a mapping > + * equal to a page/superpage. > + */ > +static int p2m_set_entry(struct p2m_domain *p2m, > + gfn_t gfn, > + unsigned long page_order, > + mfn_t mfn, > + p2m_type_t t) > +{ > +unsigned int level; > +unsigned int target = page_order / PAGETABLE_ORDER; > +pte_t *entry, *table, orig_pte; > +int rc; > +/* A mapping is removed if the MFN is invalid. */ > +bool removing_mapping = mfn_eq(mfn, INVALID_MFN); Comment and code don't fit together. Many MFNs are invalid (any for which mfn_valid(
[PATCH] x86/hpet: do local APIC EOI after interrupt processing
The current logic in the HPET interrupt ->ack() hook will perform a local APIC EOI ahead of enabling interrupts, possibly leading to recursion in the interrupt handler. Fix this by doing the local APIC EOI strictly after the window with interrupt enabled, as that prevents the recursion, and would only allow for interrupts with higher priority to be serviced. Use the generic ack_nonmaskable_msi_irq() and end_nonmaskable_irq() functions, removing the need for hpet_msi_ack(). Reported-by: Andrew Cooper Fixes: 3ba523ff957c ('CPUIDLE: enable MSI capable HPET for timer broadcast') Signed-off-by: Roger Pau Monné --- xen/arch/x86/hpet.c | 10 ++ 1 file changed, 2 insertions(+), 8 deletions(-) diff --git a/xen/arch/x86/hpet.c b/xen/arch/x86/hpet.c index 192de433cfd1..d05b5eb1361f 100644 --- a/xen/arch/x86/hpet.c +++ b/xen/arch/x86/hpet.c @@ -303,13 +303,6 @@ static unsigned int cf_check hpet_msi_startup(struct irq_desc *desc) #define hpet_msi_shutdown hpet_msi_mask -static void cf_check hpet_msi_ack(struct irq_desc *desc) -{ -irq_complete_move(desc); -move_native_irq(desc); -ack_APIC_irq(); -} - static void cf_check hpet_msi_set_affinity( struct irq_desc *desc, const cpumask_t *mask) { @@ -337,7 +330,8 @@ static hw_irq_controller hpet_msi_type = { .shutdown = hpet_msi_shutdown, .enable= hpet_msi_unmask, .disable= hpet_msi_mask, -.ack= hpet_msi_ack, +.ack= ack_nonmaskable_msi_irq, +.end= end_nonmaskable_irq, .set_affinity = hpet_msi_set_affinity, }; -- 2.49.0
Re: [PATCH] x86/hpet: do local APIC EOI after interrupt processing
On 05/08/2025 5:03 pm, Roger Pau Monne wrote: > The current logic in the HPET interrupt ->ack() hook will perform a local > APIC EOI ahead of enabling interrupts, possibly leading to recursion in the > interrupt handler. > > Fix this by doing the local APIC EOI strictly after the window with > interrupt enabled, as that prevents the recursion, and would only allow for > interrupts with higher priority to be serviced. > > Use the generic ack_nonmaskable_msi_irq() and end_nonmaskable_irq() > functions, removing the need for hpet_msi_ack(). > > Reported-by: Andrew Cooper > Fixes: 3ba523ff957c ('CPUIDLE: enable MSI capable HPET for timer broadcast') > Signed-off-by: Roger Pau Monné Reviewed-by: Andrew Cooper Good riddance to the mess in our patchqueue cased by this.
Re: [PATCH v1 1/3] arm/pci: Add pci-scan boot argument
Hi Jan, > On 4 Aug 2025, at 09:10, Jan Beulich wrote: > > On 01.08.2025 11:22, Mykyta Poturai wrote: >> From: Edward Pickup >> >> This patch adds a Xen boot arguments that, if enabled, causes a call to >> existing code to scan pci devices enumerated by the firmware. >> >> This patch also makes an existing debug function viewable outside its >> translation unit, and uses this to dump the PCI devices found. >> The debug message is controlled by config DEBUG. >> >> Additionally, this patch modifies segment loading to ensure that PCI >> devices on other segments are properly found. >> >> This will be needed ahead of dom0less support for pci passthrough on >> arm. >> >> Signed-off-by: Luca Fancellu >> Signed-off-by: Edward Pickup > > Considering the From: above and this order of S-o-b: Who is it really that > was the original author here? I think this patch was mine, probably some issues from Edward, anyway he doesn’t work at arm anymore. Cheers, Luca
Re: [PATCH v2] xen/dom0less: arm: fix hwdom 1:1 low memory allocation
Hi Grygorii, On 05/08/2025 20:00, Grygorii Strashko wrote: From: Grygorii Strashko Call stack for dom0less hwdom case (1:1) memory: create_domUs |-construct_domU |-construct_hwdom() |-allocate_memory_11() And allocate_memory_11() uses "dom0_mem" as: min_low_order = get_order_from_bytes(min_t(paddr_t, dom0_mem, MB(128))); In case of dom0less boot the "dom0_mem" is not used and defaulted to 0, From docs/mics/xen-command-linux.pandoc: --- ### dom0_mem (ARM) > `= ` Set the amount of memory for the initial domain (dom0). It must be greater than zero. This parameter is required. --- If dom0_mem is effectively optional, then shouldn't the doc be updated? which causes min_low_order to get high value > order and so no allocations happens from low memory. > > Fix it, by using kinfo->unassigned_mem instead of "dom0_mem" has correct memory size in both cases: regular dom0 boot and dom0less boot. Fixes: 52cb53f1816a ("xen/arm: dom0less hwdom construction") Signed-off-by: Grygorii Strashko Reviewed-by: Denis Mukhin Reviewed-by: Jason Andryuk Acked-by: Julien Grall Cheers, -- Julien Grall
Re: [PATCH v2 1/3] xen/arm: irq: drop unreachable pirq callbacks
Hi, On 05/08/2025 19:40, Grygorii Strashko wrote: From: Grygorii Strashko Since commit 341f271cf86f ("xen/evtchn: fully restrict concept of pIRQ for arches with pIRQ support only"), the corresponding Arm arch pIRQ callbacks become unreachable, so drop them. Signed-off-by: Grygorii Strashko Acked-by: Andrew Cooper Reviewed-by: Denis Mukhin Acked-by: Julien Grall Cheers, -- Julien Grall
Re: [PATCH 1/2] x86/mm: drop paging_get_mode()
On 05/08/2025 8:58 am, Jan Beulich wrote: > The function was introduced without any caller, and never gained any. > Thus it has always been violating Misra rule 2.1 (unreachable code). > > Fixes: dd6de3ab9985 ("Implement Nested-on-Nested") > Signed-off-by: Jan Beulich Acked-by: Andrew Cooper I'm still not certain how this will shake out in nested-virt, but I'm reasonably sure it wont end up like this. ~Andrew
Re: Xen 4.21 Development Update [June-July]
On 05.08.25 20:19, Oleksii Kurochko wrote: Hello everyone, This email only tracks big items for xen.git tree. Please reply for items you would like to see in 4.21 so that people have an idea what is going on and prioritise accordingly. You're welcome to provide description and use cases of the feature you're working on. = Timeline = The current release schedule could be found here: https://wiki.xenproject.org/wiki/Xen_Project_X.YY_Release_Notes And as a reminder I would like to remind at the of this week we will have Last posting date (Fri Aug 08, 2025). = Updates = The following items ( the links for them could be found int the list below ) were moved to completed: [since Jun2 - Aug5]: Added some tags: [4.21], [next-rel(s)] to the list "Full list of items" below. * x86: - kexec: add kexec support to Mini-OS. And with that: - PVH xenstore-stubdom live update - x86: memcpy() / memset() (non-)ERMS flavors plus fallout * Arm: - SMMU handling for PCIe Passthrough on ARM. - Add support for R-Car Gen4 PCI host controller. - First chunk for Arm R82 and MPU support. - Enable R52 support for the first chunk of MPU support - ARM split hardware and control domains. * RISC-V: - Introduce basic UART support and interrupts for hypervisor mode. * Hypervisor: - tools: support for per-domain Xenstore features Juergen OpenPGP_0xB0DE9DD628BF132F.asc Description: OpenPGP public key OpenPGP_signature.asc Description: OpenPGP digital signature
Re: [PATCH 1/2] efi: Call FreePages only if needed
On 05.08.2025 18:32, Ross Lagerwall wrote: > If the config file is builtin, cfg.addr will be zero but Xen > unconditionally calls FreePages() on the address. > > Xen may also call FreePages() with a zero address if blexit() is called > after this point since cfg.need_to_free is not set to false. > > The UEFI specification does not say whether calling FreePages() with a > zero address is allowed so let's be cautious and use cfg.need_to_free > properly. Well, no, this paragraph makes no sense. Of course this is allowed, but not as no-op behavior (like free(NULL) would be), but to free memory starting at 0. > Signed-off-by: Ross Lagerwall This pretty clearly wants a Fixes: tag, or maybe it even needs to be two. I've checked the original code in 4.2, and things were consistent there, afaics. So breakage was introduced perhaps in one or two of the many re-works. Jan
Re: Xen 4.21 Development Update [June-July]
On 05.08.2025 20:19, Oleksii Kurochko wrote: > * Full list of items : * > > = Projects = > > == Hypervisor == > > * [4.21] xen/console: cleanup console input switch logic (v5) > - Denis Mukhin > - > https://lore.kernel.org/xen-devel/20250530231841.73386-1-dmuk...@ford.com/ > > * [4.21] xen: introduce CONFIG_SYSCTL (v4 -> v8) > - Penny Zheng > - > https://lore.kernel.org/xen-devel/20250711043158.2566880-1-penny.zh...@amd.com/ > > * [4.21] Several CI cleanups and improvements, plus yet another new runner > - Marek Marczykowski-Górecki > - > https://lore.kernel.org/xen-devel/cover.7da1777882774486a13e6f39ff4a2096f6b7901e.1744028549.git-series.marma...@invisiblethingslab.com/ > - > https://patchew.org/Xen/cover.7da1777882774486a13e6f39ff4a2096f6b7901e.1744028549.git-series.marma...@invisiblethingslab.com/ > > * [4.21] automation: Refresh the remaining Debian containers (v2) > - Javi Merino > - > https://lore.kernel.org/xen-devel/cover.1730743077.git.javi.mer...@cloud.com/T/#m5d9acb7cf5db3c2be3d6527de14b69b07812314e > > * [4.21] MSI-X support with qemu in stubdomain, and other related > changes (v8) > - Marek Marczykowski-Górecki > - > https://lore.kernel.org/xen-devel/cover.33fb4385b7dd6c53bda4acf0a9e91748b3d7b1f7.1715313192.git-series.marma...@invisiblethingslab.com/ > - Only automation patch left to be reviewed/merged. > > * [next-rel(s)] Physical address hypercall ABI ("HVMv2") > - Teddy Astie > - > https://lore.kernel.org/xen-devel/cover.1744981654.git.teddy.as...@vates.tech/ > > * [next-rel(s)] xen: Untangle mm.h > - Andrew Cooper > - > https://lore.kernel.org/xen-devel/20250312174513.4075066-1-andrew.coop...@citrix.com/ > - > https://patchew.org/Xen/20250312174513.4075066-1-andrew.coop...@citrix.com/ > > * [next-rel(s)] Add support for exact-node memory claims > - Alejandro Vallejo > - > https://lore.kernel.org/xen-devel/20250314172502.53498-1-alejandro.vall...@cloud.com/ > - > https://patchew.org/Xen/20250314172502.53498-1-alejandro.vall...@cloud.com/ > > * [next-rel(s)] Remove the directmap (v5) > - Alejandro Vallejo > - > https://lore.kernel.org/xen-devel/20250108151822.16030-1-alejandro.vall...@cloud.com/ > - > https://patchew.org/Xen/20250108151822.16030-1-alejandro.vall...@cloud.com/ > > * [next-rel(s)] GRUB: Supporting Secure Boot of xen.gz (v1) > - Ross Lagerwall > - > https://patchew.org/Xen/20240313150748.791236-1-ross.lagerw...@citrix.com/ > > * [next-rel(s)] Introduce xenbindgen to autogen hypercall structs (v1) > - Alejandro Vallejo > - > https://patchew.org/Xen/20241115115200.2824-1-alejandro.vall...@cloud.com/ > > * [next-rel(s)] Introduce NS8250 UART emulator (v2) > - Denis Mukhin > - > https://patchew.org/Xen/20241205-vuart-ns8250-v1-0-e9aa92312...@ford.com/ > > * [next-rel(s)] xen: framework for UART emulators > - Denis Mukhin > - > https://lore.kernel.org/xen-devel/20250624035443.344099-1-dmuk...@ford.com/ > > === x86 === > * [4.21] x86/asm: cleanups after toolchain baseline upgrade (v1 -> v2) > - Denis Mukhin > - > https://lore.kernel.org/xen-devel/20250403182250.3329498-1-dmuk...@ford.com/ > - https://patchew.org/Xen/20250403182250.3329498-1-dmuk...@ford.com/ > > * [4.21?] x86/efi: Fix booting when NX is disabled (v1 -> v2) > - Andrew Cooper > - > https://patchew.org/Xen/20240722101838.3946983-1-andrew.coop...@citrix.com/ > - > https://lore.kernel.org/xen-devel/20240722101838.3946983-1-andrew.coop...@citrix.com/ > > * [4.21?] Hyperlaunch device tree for dom0 (v6) > - Alejandro Vallejo > - https://patchew.org/Xen/20250429123629.20839-1-agarc...@amd.com/ > - > https://lore.kernel.org/xen-devel/20250429123629.20839-1-agarc...@amd.com/ > > * [4.21?] Boot modules for Hyperlaunch (v9) > - Daniel P. Smith > - > https://lore.kernel.org/xen-devel/20241115131204.32135-1-dpsm...@apertussolutions.com/ > - > https://patchew.org/Xen/20241115131204.32135-1-dpsm...@apertussolutions.com/ > > * [4.21?] Address Space Isolation FPU preparations (v2->v3) > - Alejandro Vallejo > - > https://patchew.org/Xen/20250110132823.24348-1-alejandro.vall...@cloud.com/ > > * [next-rel(s)] Hyperlaunch domain builder > - Daniel P. Smith > - > https://lore.kernel.org/xen-devel/20250515131744.3843-1-dpsm...@apertussolutions.com/ > > * [next-rel(s)] Confidential computing and AMD SEV support > - Teddy Astie > - https://patchew.org/Xen/cover.1747312394.git.teddy.as...@vates.tech/ > - > https://lore.kernel.org/xen-devel/cover.1747312394.git.teddy.as...@vates.tech/ > > * [next-rel(s)] amd-cppc CPU Performance Scaling Driver (v5 -> v6) > - Penny Zheng > - > https://lore.kernel.org/xen-devel/20250711035106.2540522-1-penny.zh...@amd.com/ > > * [next-rel(s)] x86: Trenchboot Secure Launch DRTM (Xen) (v1 -> v3) > - Sergii Dmytruk > - https://patchew.org/Xen/cover.1745172094.git.sergii.dmyt...@3mdeb.com/ >
Re: [PATCH -next v7 5/7] arm64: entry: Refactor preempt_schedule_irq() check code
On 2025/8/5 23:06, Ada Couprie Diaz wrote: > Hi Jinjie, > > On 29/07/2025 02:54, Jinjie Ruan wrote: >> ARM64 requires an additional check whether to reschedule on return >> from interrupt. So add arch_irqentry_exit_need_resched() as the default >> NOP implementation and hook it up into the need_resched() condition in >> raw_irqentry_exit_cond_resched(). This allows ARM64 to implement >> the architecture specific version for switching over to >> the generic entry code. > I was a bit confused by this, as I didn't see the link with the generic > entry > given you implement `raw_irqentry_exit_cond_resched()` in arch/arm64 > as well in this patch : I expected the arm64 implementation to be added. > I share more thoughts below. > > What do you think about something along those lines ? > > Compared to the generic entry code, arm64 does additional checks > when deciding to reschedule on return from an interrupt. > Introduce arch_irqentry_exit_need_resched() in the need_resched() > condition > of the generic raw_irqentry_exit_cond_resched(), with a NOP default. > This will allow arm64 to implement its architecture specific checks > when > switching over to the generic entry code. This revision makes it easier for people to understand. > >> [...] >> diff --git a/kernel/entry/common.c b/kernel/entry/common.c >> index b82032777310..4aa9656fa1b4 100644 >> --- a/kernel/entry/common.c >> +++ b/kernel/entry/common.c >> @@ -142,6 +142,20 @@ noinstr irqentry_state_t irqentry_enter(struct >> pt_regs *regs) >> return ret; >> } >> +/** >> + * arch_irqentry_exit_need_resched - Architecture specific need >> resched function >> + * >> + * Invoked from raw_irqentry_exit_cond_resched() to check if need >> resched. > Very nit : "to check if resched is needed." ? This is good. >> + * Defaults return true. >> + * >> + * The main purpose is to permit arch to skip preempt a task from an >> IRQ. > If feel that "to avoid preemption of a task" instead of "to skip preempt > a task" > would make this much clearer, what do you think ? Yes, this is more clearer. >> + */ >> +static inline bool arch_irqentry_exit_need_resched(void); >> + >> +#ifndef arch_irqentry_exit_need_resched >> +static inline bool arch_irqentry_exit_need_resched(void) { return >> true; } >> +#endif >> + > > I've had some trouble reviewing this patch : on the one hand because > I didn't notice `arch_irqentry_exit_need_resched()` was added in > the common entry code, which is on me ! > On the other hand, I felt that the patch itself was a bit disconnected : > we add `arch_irqentry_exit_need_resched()` in the common entry code, > with a default NOP, but in the same function we add to arm64, > while mentioning that this is for arm64's additional checks, > which we only implement in patch 7. > > Would it make sense to move the `arch_irqentry_exit_need_resched()` > part of the patch to patch 7, so that the introduction and > arch-specific implementation appear together ? > To me it seems easier to wrap my head around, as it would look like > "Move arm64 to generic entry, but it does additional checks : add a new > arch-specific function controlling re-scheduling, defaulting to true, > and implement it for arm64". I feel it could help making patch 7's > commit message clearer as well. > > From what I gathered on the archive `arch_irqentry_exit_need_resched()` > being added here was suggested previously, so others might not have the > same opinion. > > Maybe improving the commit message and comment for this would be enough > as well, as per my suggestions above. > > > Otherwise the changes make sense and I don't see any functional issues ! > > Thanks, > Ada > >
Re: [PATCH 2/2] efi: Stop using StdErr
On 05.08.2025 18:32, Ross Lagerwall wrote: > Xen's use of StdErr is inconsistent. Some boot errors are reported using > PrintErr() which uses StdErr and some are reported using blexit() which > uses StdOut. ... with PrintErrMesg() having StdOut = StdErr; apparently to address at least some of the inconsistencies. Perhaps blexit(), when not passed NULL, should similarly override StdOut. > On my test system using OVMF, StdErr is not displayed on the emulated > screen. Looking at other EFI applications, StdErr is just used for debug > messages if at all. That's hardly how StdErr was meant to be used. And at the risk of being flamed for saying so, looking at other EFI applications (without saying of what prominence or origin they are) can hardly serve as a justification. If OVMF doesn't set up StdErr correctly (despite being configured / set up correctly), and if that can't be fixed there, imo what you want as a workaround is a command line option to override StdErr by StdOut even when SystemTable->StdErr is non-NULL. Along the lines of the comment further up, inconsistencies in the use of StdErr vs StdOut may want addressing (separately). But of course, not being EFI maintainer anymore, all of what I said above may be deemed entirely meaningless by the new maintainers. Jan
Re: [PATCH -next v7 5/7] arm64: entry: Refactor preempt_schedule_irq() check code
On 2025/8/5 23:06, Ada Couprie Diaz wrote: > Hi Jinjie, > > On 29/07/2025 02:54, Jinjie Ruan wrote: >> ARM64 requires an additional check whether to reschedule on return >> from interrupt. So add arch_irqentry_exit_need_resched() as the default >> NOP implementation and hook it up into the need_resched() condition in >> raw_irqentry_exit_cond_resched(). This allows ARM64 to implement >> the architecture specific version for switching over to >> the generic entry code. > I was a bit confused by this, as I didn't see the link with the generic > entry > given you implement `raw_irqentry_exit_cond_resched()` in arch/arm64 > as well in this patch : I expected the arm64 implementation to be added. > I share more thoughts below. > > What do you think about something along those lines ? > > Compared to the generic entry code, arm64 does additional checks > when deciding to reschedule on return from an interrupt. > Introduce arch_irqentry_exit_need_resched() in the need_resched() > condition > of the generic raw_irqentry_exit_cond_resched(), with a NOP default. > This will allow arm64 to implement its architecture specific checks > when > switching over to the generic entry code. > >> [...] >> diff --git a/kernel/entry/common.c b/kernel/entry/common.c >> index b82032777310..4aa9656fa1b4 100644 >> --- a/kernel/entry/common.c >> +++ b/kernel/entry/common.c >> @@ -142,6 +142,20 @@ noinstr irqentry_state_t irqentry_enter(struct >> pt_regs *regs) >> return ret; >> } >> +/** >> + * arch_irqentry_exit_need_resched - Architecture specific need >> resched function >> + * >> + * Invoked from raw_irqentry_exit_cond_resched() to check if need >> resched. > Very nit : "to check if resched is needed." ? >> + * Defaults return true. >> + * >> + * The main purpose is to permit arch to skip preempt a task from an >> IRQ. > If feel that "to avoid preemption of a task" instead of "to skip preempt > a task" > would make this much clearer, what do you think ? >> + */ >> +static inline bool arch_irqentry_exit_need_resched(void); >> + >> +#ifndef arch_irqentry_exit_need_resched >> +static inline bool arch_irqentry_exit_need_resched(void) { return >> true; } >> +#endif >> + > > I've had some trouble reviewing this patch : on the one hand because > I didn't notice `arch_irqentry_exit_need_resched()` was added in > the common entry code, which is on me ! > On the other hand, I felt that the patch itself was a bit disconnected : > we add `arch_irqentry_exit_need_resched()` in the common entry code, > with a default NOP, but in the same function we add to arm64, > while mentioning that this is for arm64's additional checks, > which we only implement in patch 7. Yes, it does. > > Would it make sense to move the `arch_irqentry_exit_need_resched()` > part of the patch to patch 7, so that the introduction and > arch-specific implementation appear together ? > To me it seems easier to wrap my head around, as it would look like > "Move arm64 to generic entry, but it does additional checks : add a new > arch-specific function controlling re-scheduling, defaulting to true, > and implement it for arm64". I feel it could help making patch 7's > commit message clearer as well. > > From what I gathered on the archive `arch_irqentry_exit_need_resched()` > being added here was suggested previously, so others might not have the > same opinion. Yes, introduce `arch_irqentry_exit_need_resched()` here may help understand the patch's refactoring purpose. > > Maybe improving the commit message and comment for this would be enough > as well, as per my suggestions above. Thank you! I'll improve the commit message and comment. > > > Otherwise the changes make sense and I don't see any functional issues ! > > Thanks, > Ada > >
Re: [PATCH v10 2/4] vpci/rebar: Free Rebar resources when init_rebar() fails
On 06.08.2025 05:39, Chen, Jiqian wrote: > On 2025/8/5 16:48, Jan Beulich wrote: >> On 05.08.2025 05:49, Jiqian Chen wrote: >>> --- a/xen/drivers/vpci/rebar.c >>> +++ b/xen/drivers/vpci/rebar.c >>> @@ -49,6 +49,32 @@ static void cf_check rebar_ctrl_write(const struct >>> pci_dev *pdev, >>> bar->guest_addr = bar->addr; >>> } >>> >>> +static int cf_check cleanup_rebar(const struct pci_dev *pdev) >>> +{ >>> +int rc; >>> +uint32_t ctrl; >>> +unsigned int nbars; >>> +unsigned int rebar_offset = pci_find_ext_capability(pdev->sbdf, >>> + >>> PCI_EXT_CAP_ID_REBAR); >>> + >>> +if ( !rebar_offset || !is_hardware_domain(pdev->domain) ) >>> +{ >>> +ASSERT_UNREACHABLE(); >>> +return 0; >>> +} >>> + >>> +ctrl = pci_conf_read32(pdev->sbdf, rebar_offset + PCI_REBAR_CTRL(0)); >>> +nbars = MASK_EXTR(ctrl, PCI_REBAR_CTRL_NBAR_MASK); >>> + >>> +rc = vpci_remove_registers(pdev->vpci, rebar_offset + PCI_REBAR_CAP(0), >>> + PCI_REBAR_CTRL(nbars - 1)); >>> +if ( rc ) >>> +printk(XENLOG_ERR "%pd %pp: fail to remove Rebar handlers rc=%d\n", >>> + pdev->domain, &pdev->sbdf, rc); >> >> MSI and MSI-X (now) have ASSERT_UNREACHABLE() on their respective paths. Is >> there a reason this shouldn't be done here as well? > Will add. > >> >> MSI and MSI-X further have another add-register below here, to ensure the >> control register cannot be written. Again - is there a reason the same >> shouldn't be done here? (If so, I think this may want putting in a comment.) > Since extended capabilities are only exposed to dom0, and dom0 has no > limitations to access devices. > It since there is not much point in adding such a handler for rebar. While the effect is different from MSI / MSI-X, isn't it still a problem if Dom0 changed BAR sizes entirely behind Xen's back? Which could be prevented by similarly discarding writes to the control register, as is done for the other two? Jan
Re: [PATCH v10 4/4] vpci/msix: Free MSIX resources when init_msix() fails
On 06.08.2025 05:35, Chen, Jiqian wrote: > On 2025/8/5 16:43, Jan Beulich wrote: >> On 05.08.2025 05:49, Jiqian Chen wrote: >>> --- a/xen/drivers/vpci/msix.c >>> +++ b/xen/drivers/vpci/msix.c >>> @@ -655,6 +655,48 @@ int vpci_make_msix_hole(const struct pci_dev *pdev) >>> return 0; >>> } >>> >>> +static int cf_check cleanup_msix(const struct pci_dev *pdev) >>> +{ >>> +int rc; >>> +struct vpci *vpci = pdev->vpci; >>> +const unsigned int msix_pos = pdev->msix_pos; >>> + >>> +if ( !msix_pos ) >>> +return 0; >>> + >>> +rc = vpci_remove_registers(vpci, msix_control_reg(msix_pos), 2); >>> +if ( rc ) >>> +{ >>> +printk(XENLOG_ERR "%pd %pp: fail to remove MSIX handlers rc=%d\n", >>> + pdev->domain, &pdev->sbdf, rc); >>> +ASSERT_UNREACHABLE(); >>> +return rc; >>> +} >>> + >>> +if ( vpci->msix ) >>> +{ >>> +list_del(&vpci->msix->next); >>> +for ( unsigned int i = 0; i < ARRAY_SIZE(vpci->msix->table); i++ ) >>> +if ( vpci->msix->table[i] ) >>> +iounmap(vpci->msix->table[i]); >>> + >>> +XFREE(vpci->msix); >>> +} >>> + >>> +/* >>> + * The driver may not traverse the capability list and think device >>> + * supports MSIX by default. So here let the control register of MSIX >>> + * be Read-Only is to ensure MSIX disabled. >>> + */ >>> +rc = vpci_add_register(vpci, vpci_hw_read16, NULL, >>> + msix_control_reg(msix_pos), 2, NULL); >>> +if ( rc ) >>> +printk(XENLOG_ERR "%pd %pp: fail to add MSIX ctrl handler rc=%d\n", >>> + pdev->domain, &pdev->sbdf, rc); >> >> Here as well as for MSI: Wouldn't this better be limited to the init-failure >> case? No point in adding a register hook (and possibly emitting a misleading >> log message) when we're tearing down anyway. IOW I think the ->cleanup() >> hook needs a boolean parameter, unless the distinction of the two cases can >> be (reliably) inferred from some other property. > To make these changes, can I add a new patch as the last patch of this series? > And the new patch will do: > 1. add a boolean parameter for cleanup hook to separate whose calls > cleanup(during initialization or during deassign device). > 2. call all cleanup hooks in vpci_deassign_device(). > 3. remove the MSI/MSIX specific free actions in vpci_deassign_device(). The outline looks okay, but imo it shouldn't be last in the series. Instead I think it wants to come ahead of the last three patches; whether it's patch 1 or patch 2 doesn't really matter. Then (3) would be taken care of incrementally, as ->cleanup hooks are added. Jan
Re: [PATCH -next v7 7/7] arm64: entry: Switch to generic IRQ entry
On 2025/8/5 23:07, Ada Couprie Diaz wrote: > Hi Jinjie, > > The code changes look good to me, just a small missing clean up I believe. > > On 29/07/2025 02:54, Jinjie Ruan wrote: > >> Currently, x86, Riscv, Loongarch use the generic entry. Convert arm64 >> to use the generic entry infrastructure from kernel/entry/*. >> The generic entry makes maintainers' work easier and codes >> more elegant. >> >> Switch arm64 to generic IRQ entry first, which removed duplicate 100+ >> LOC. The next patch serise will switch to generic entry completely later. >> Switch to generic entry in two steps according to Mark's suggestion >> will make it easier to review. > > I think the commit message could be clearer, especially since now this > series > only moves arm64 to generic IRQ entry and not the complete generic entry. > > What do you think of something like below ? It repeats a bit less and I > think > it helps understanding what is going on in this specific commit, as you > already > have details on the larger plans in the cover. > > Currently, x86, Riscv and Loongarch use the generic entry code, > which makes > maintainer's work easier and code more elegant. > Start converting arm64 to use the generic entry infrastructure > from kernel/entry/* by switching it to generic IRQ entry, which > removes 100+ > lines of duplicate code. > arm64 will completely switch to generic entry in a later series. > Yes, this is more concise and accurate, and make the motivation more clearer. >> The changes are below: >> - Remove *enter_from/exit_to_kernel_mode(), and wrap with generic >> irqentry_enter/exit(). Also remove *enter_from/exit_to_user_mode(), >> and wrap with generic enter_from/exit_to_user_mode() because they >> are exactly the same so far. > Nit : "so far" can be removed >> - Remove arm64_enter/exit_nmi() and use generic >> irqentry_nmi_enter/exit() >> because they're exactly the same, so the temporary arm64 version >> irqentry_state can also be removed. >> >> - Remove PREEMPT_DYNAMIC code, as generic entry do the same thing >> if arm64 implement arch_irqentry_exit_need_resched(). > This feels unrelated, given that the part that needs > `arch_irqentry_exit_need_resched()` > is called whether or not PREEMPT_DYNAMIC is enabled ? Yes, the language here needs to be reorganized in conjunction with your comments from the fifth patch. > > Given my comments on patch 5, I feel that the commit message should mention > explicitly the implementation of `arch_irqentry_exit_need_resched()` and > why, > even though it was already mentioned in patch 5. > (This is what I was referencing in patch 5 : as I feel it's useful to > mention again > the reasons when implementing it, it doesn't feel too out of place to > introduce > the generic part at the same time. But again, I might be wrong here.) > > Then you can have another point explaining that > `raw_irqentry_exit_cond_resched()` > and the PREEMPT_DYNAMIC code is removed because they are identical to the > generic entry code, similarly to your other points. >> Tested ok with following test cases on Qemu virt platform: >> - Perf tests. >> - Different `dynamic preempt` mode switch. >> - Pseudo NMI tests. >> - Stress-ng CPU stress test. >> - MTE test case in >> Documentation/arch/arm64/memory-tagging-extension.rst >> and all test cases in tools/testing/selftests/arm64/mte/*. > Nit : I'm not sure if the commit message is the best place for this, > given you > already gave some details in the cover ? > But I don't have much experience here, so I'll leave it up to you and > others ! Yes, this can be removed as the cover letter already has it. >> Suggested-by: Mark Rutland >> Signed-off-by: Jinjie Ruan >> --- >> [...] >> diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c >> index db3f972f8cd9..1110eeb21f57 100644 >> --- a/arch/arm64/kernel/signal.c >> +++ b/arch/arm64/kernel/signal.c >> @@ -9,6 +9,7 @@ >> #include >> #include >> #include >> +#include >> #include >> #include >> #include >> @@ -1576,7 +1577,7 @@ static void handle_signal(struct ksignal *ksig, >> struct pt_regs *regs) >> * the kernel can handle, and then we build all the user-level >> signal handling >> * stack-frames in one go after that. >> */ >> -void do_signal(struct pt_regs *regs) >> +void arch_do_signal_or_restart(struct pt_regs *regs) > Given that `do_signal(struct pt_regs *regs)` is defined in > `arch/arm64/include/asm/exception.h`, > and that there remains no users of `do_signal()`, I think it should be > removed there. Good catch! I'll remove it. > > Thanks, > Ada >
Re: [PATCH v10 4/4] vpci/msix: Free MSIX resources when init_msix() fails
On 2025/8/5 16:43, Jan Beulich wrote: > On 05.08.2025 05:49, Jiqian Chen wrote: >> --- a/xen/drivers/vpci/msix.c >> +++ b/xen/drivers/vpci/msix.c >> @@ -655,6 +655,48 @@ int vpci_make_msix_hole(const struct pci_dev *pdev) >> return 0; >> } >> >> +static int cf_check cleanup_msix(const struct pci_dev *pdev) >> +{ >> +int rc; >> +struct vpci *vpci = pdev->vpci; >> +const unsigned int msix_pos = pdev->msix_pos; >> + >> +if ( !msix_pos ) >> +return 0; >> + >> +rc = vpci_remove_registers(vpci, msix_control_reg(msix_pos), 2); >> +if ( rc ) >> +{ >> +printk(XENLOG_ERR "%pd %pp: fail to remove MSIX handlers rc=%d\n", >> + pdev->domain, &pdev->sbdf, rc); >> +ASSERT_UNREACHABLE(); >> +return rc; >> +} >> + >> +if ( vpci->msix ) >> +{ >> +list_del(&vpci->msix->next); >> +for ( unsigned int i = 0; i < ARRAY_SIZE(vpci->msix->table); i++ ) >> +if ( vpci->msix->table[i] ) >> +iounmap(vpci->msix->table[i]); >> + >> +XFREE(vpci->msix); >> +} >> + >> +/* >> + * The driver may not traverse the capability list and think device >> + * supports MSIX by default. So here let the control register of MSIX >> + * be Read-Only is to ensure MSIX disabled. >> + */ >> +rc = vpci_add_register(vpci, vpci_hw_read16, NULL, >> + msix_control_reg(msix_pos), 2, NULL); >> +if ( rc ) >> +printk(XENLOG_ERR "%pd %pp: fail to add MSIX ctrl handler rc=%d\n", >> + pdev->domain, &pdev->sbdf, rc); > > Here as well as for MSI: Wouldn't this better be limited to the init-failure > case? No point in adding a register hook (and possibly emitting a misleading > log message) when we're tearing down anyway. IOW I think the ->cleanup() > hook needs a boolean parameter, unless the distinction of the two cases can > be (reliably) inferred from some other property. To make these changes, can I add a new patch as the last patch of this series? And the new patch will do: 1. add a boolean parameter for cleanup hook to separate whose calls cleanup(during initialization or during deassign device). 2. call all cleanup hooks in vpci_deassign_device(). 3. remove the MSI/MSIX specific free actions in vpci_deassign_device(). > >> --- a/xen/drivers/vpci/vpci.c >> +++ b/xen/drivers/vpci/vpci.c >> @@ -321,6 +321,27 @@ void vpci_deassign_device(struct pci_dev *pdev) >> &pdev->domain->vpci_dev_assigned_map); >> #endif >> >> +for ( i = 0; i < NUM_VPCI_INIT; i++ ) >> +{ >> +const vpci_capability_t *capability = &__start_vpci_array[i]; >> +const unsigned int cap = capability->id; >> +unsigned int pos = 0; >> + >> +if ( !capability->is_ext ) >> +pos = pci_find_cap_offset(pdev->sbdf, cap); >> +else if ( is_hardware_domain(pdev->domain) ) >> +pos = pci_find_ext_capability(pdev->sbdf, cap); >> + >> +if ( pos && capability->cleanup ) >> +{ >> +int rc = capability->cleanup(pdev); >> +if ( rc ) >> +printk(XENLOG_ERR "%pd %pp: clean %s cap %u fail rc=%d\n", >> + pdev->domain, &pdev->sbdf, >> + capability->is_ext ? "extended" : "legacy", cap, rc); >> +} >> +} > > With this imo the patch subject is now wrong, too. > > Jan -- Best regards, Jiqian Chen.
Re: [PATCH v2] systemd: Add hooks to stop/start xen-watchdog on suspend/resume
On 17/07/2025 9:16 pm, Mykola Kvach wrote: > config/Tools.mk.in| 1 + > m4/systemd.m4 | 14 > tools/hotplug/Linux/systemd/Makefile | 8 - > .../Linux/systemd/xen-watchdog-sleep.sh | 34 +++ This has been committed, but it drops the file: /usr/lib/systemd/system-sleep/xen-watchdog-sleep.sh into a directory which more normally contains: $ file /usr/lib/systemd/system-sleep/* /usr/lib/systemd/system-sleep/hdparm: POSIX shell script, ASCII text executable /usr/lib/systemd/system-sleep/nvidia: POSIX shell script, ASCII text executable /usr/lib/systemd/system-sleep/sysstat.sleep: POSIX shell script, ASCII text executable /usr/lib/systemd/system-sleep/tlp: POSIX shell script, ASCII text executable /usr/lib/systemd/system-sleep/unattended-upgrades: POSIX shell script, ASCII text executable I'd suggest renaming it to xen-watchdog (or perhaps xen-watchdog.sleep). ~Andrew
[PATCH 2/2] efi: Stop using StdErr
Xen's use of StdErr is inconsistent. Some boot errors are reported using PrintErr() which uses StdErr and some are reported using blexit() which uses StdOut. On my test system using OVMF, StdErr is not displayed on the emulated screen. Looking at other EFI applications, StdErr is just used for debug messages if at all. Therefore, switch all boot output to use StdOut. Signed-off-by: Ross Lagerwall --- xen/common/efi/boot.c | 9 + 1 file changed, 1 insertion(+), 8 deletions(-) diff --git a/xen/common/efi/boot.c b/xen/common/efi/boot.c index 50ff1d1bd225..6ba486943466 100644 --- a/xen/common/efi/boot.c +++ b/xen/common/efi/boot.c @@ -153,7 +153,6 @@ static UINT32 __initdata efi_bs_revision; static EFI_HANDLE __initdata efi_ih; static SIMPLE_TEXT_OUTPUT_INTERFACE *__initdata StdOut; -static SIMPLE_TEXT_OUTPUT_INTERFACE *__initdata StdErr; static UINT32 __initdata mdesc_ver; static bool __initdata map_bs; @@ -168,11 +167,7 @@ static void __init PrintStr(const CHAR16 *s) { StdOut->OutputString(StdOut, (CHAR16 *)s ); } - -static void __init PrintErr(const CHAR16 *s) -{ -StdErr->OutputString(StdErr, (CHAR16 *)s ); -} +#define PrintErr PrintStr static CHAR16 *__init FormatDec(UINT64 Val, CHAR16 *Buffer) { @@ -287,7 +282,6 @@ static bool __init match_guid(const EFI_GUID *guid1, const EFI_GUID *guid2) /* generic routine for printing error messages */ static void __init noreturn PrintErrMesg(const CHAR16 *mesg, EFI_STATUS ErrCode) { -StdOut = StdErr; PrintErr(mesg); PrintErr(L": "); @@ -914,7 +908,6 @@ static void __init efi_init(EFI_HANDLE ImageHandle, EFI_SYSTEM_TABLE *SystemTabl efi_fw_revision = SystemTable->FirmwareRevision; StdOut = SystemTable->ConOut; -StdErr = SystemTable->StdErr ?: StdOut; } static void __init efi_console_set_mode(void) -- 2.50.1
[PATCH 1/2] efi: Call FreePages only if needed
If the config file is builtin, cfg.addr will be zero but Xen unconditionally calls FreePages() on the address. Xen may also call FreePages() with a zero address if blexit() is called after this point since cfg.need_to_free is not set to false. The UEFI specification does not say whether calling FreePages() with a zero address is allowed so let's be cautious and use cfg.need_to_free properly. Signed-off-by: Ross Lagerwall --- xen/common/efi/boot.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/xen/common/efi/boot.c b/xen/common/efi/boot.c index 778a39cc48e6..50ff1d1bd225 100644 --- a/xen/common/efi/boot.c +++ b/xen/common/efi/boot.c @@ -1534,8 +1534,11 @@ void EFIAPI __init noreturn efi_start(EFI_HANDLE ImageHandle, efi_arch_cfg_file_late(loaded_image, dir_handle, section.s); -efi_bs->FreePages(cfg.addr, PFN_UP(cfg.size)); -cfg.addr = 0; +if ( cfg.need_to_free ) +{ +efi_bs->FreePages(cfg.addr, PFN_UP(cfg.size)); +cfg.need_to_free = false; +} if ( dir_handle ) dir_handle->Close(dir_handle); -- 2.50.1
[PATCH 0/2] Misc EFI boot fixes
Fixes for a couple of annoyances when booting Xen using the native EFI entry point. Ross Lagerwall (2): efi: Call FreePages only if needed efi: Stop using StdErr xen/common/efi/boot.c | 16 ++-- 1 file changed, 6 insertions(+), 10 deletions(-) -- 2.50.1
Re: [XEN][PATCH] xen/dom0less: arm: fix hwdom 1:1 low memory allocation
On 2025-08-01 11:54, Grygorii Strashko wrote: From: Grygorii Strashko Call stack for dom0less hwdom case (1:1) memory: create_domUs |-construct_domU |-construct_hwdom() |-allocate_memory_11() And allocate_memory_11() uses "dom0_mem" as: min_low_order = get_order_from_bytes(min_t(paddr_t, dom0_mem, MB(128))); In case of dom0less boot the "dom0_mem" is not used and defaulted to 0, which causes min_low_order to get high value > order and so no allocations happens from low memory. Fix it, by using kinfo->unassigned_mem instead of "dom0_mem" has correct memory size in both cases: regular dom0 boot and dom0less boot. Fixes: 43afe6f030244 ("xen/common: dom0less: introduce common dom0less-build.c") I think I introduced this bug with the dom0less hwdom support, and the correct fixes is: Fixes: 52cb53f1816a ("xen/arm: dom0less hwdom construction") Signed-off-by: Grygorii Strashko dom0_mem is also mentioned in the comment on line 252. With that changed as well: Reviewed-by: Jason Andryuk Thanks, Jason
[PATCH v2 2/3] xen/ppc: irq: drop unreachable pirq callbacks
From: Grygorii Strashko Since commit 341f271cf86f ("xen/evtchn: fully restrict concept of pIRQ for arches with pIRQ support only"), the corresponding PPC arch pIRQ callbacks become unreachable, so drop them. Signed-off-by: Grygorii Strashko Acked-by: Andrew Cooper Reviewed-by: Denis Mukhin --- xen/arch/ppc/stubs.c | 20 1 file changed, 20 deletions(-) diff --git a/xen/arch/ppc/stubs.c b/xen/arch/ppc/stubs.c index 671e71aa0a60..bdaf474c5cc0 100644 --- a/xen/arch/ppc/stubs.c +++ b/xen/arch/ppc/stubs.c @@ -103,26 +103,6 @@ void smp_send_call_function_mask(const cpumask_t *mask) /* irq.c */ -struct pirq *alloc_pirq_struct(struct domain *d) -{ -BUG_ON("unimplemented"); -} - -int pirq_guest_bind(struct vcpu *v, struct pirq *pirq, int will_share) -{ -BUG_ON("unimplemented"); -} - -void pirq_guest_unbind(struct domain *d, struct pirq *pirq) -{ -BUG_ON("unimplemented"); -} - -void pirq_set_affinity(struct domain *d, int pirq, const cpumask_t *mask) -{ -BUG_ON("unimplemented"); -} - void irq_ack_none(struct irq_desc *desc) { BUG_ON("unimplemented"); -- 2.34.1
Re: [XEN][PATCH] xen/dom0less: arm: fix hwdom 1:1 low memory allocation
On 2025-08-05 14:38, Grygorii Strashko wrote: Hi Jason, On 05.08.25 20:21, Jason Andryuk wrote: On 2025-08-01 11:54, Grygorii Strashko wrote: From: Grygorii Strashko Call stack for dom0less hwdom case (1:1) memory: create_domUs |-construct_domU |-construct_hwdom() |-allocate_memory_11() And allocate_memory_11() uses "dom0_mem" as: min_low_order = get_order_from_bytes(min_t(paddr_t, dom0_mem, MB(128))); In case of dom0less boot the "dom0_mem" is not used and defaulted to 0, which causes min_low_order to get high value > order and so no allocations happens from low memory. Fix it, by using kinfo->unassigned_mem instead of "dom0_mem" has correct memory size in both cases: regular dom0 boot and dom0less boot. Fixes: 43afe6f030244 ("xen/common: dom0less: introduce common dom0less-build.c") I think I introduced this bug with the dom0less hwdom support, and the correct fixes is: Fixes: 52cb53f1816a ("xen/arm: dom0less hwdom construction") Signed-off-by: Grygorii Strashko dom0_mem is also mentioned in the comment on line 252. With that changed as well: Reviewed-by: Jason Andryuk Will smth like below be ok? * We first allocate the largest allocation we can as low as we * can. This then becomes the first bank. This bank must be at least - * 128MB (or dom0_mem if that is smaller). + * 128MB (or memory size requested for domain if that is smaller). LGTM - Thank you. -Jason
[PATCH v2 1/3] xen/arm: irq: drop unreachable pirq callbacks
From: Grygorii Strashko Since commit 341f271cf86f ("xen/evtchn: fully restrict concept of pIRQ for arches with pIRQ support only"), the corresponding Arm arch pIRQ callbacks become unreachable, so drop them. Signed-off-by: Grygorii Strashko Acked-by: Andrew Cooper Reviewed-by: Denis Mukhin --- xen/arch/arm/irq.c | 29 - 1 file changed, 29 deletions(-) diff --git a/xen/arch/arm/irq.c b/xen/arch/arm/irq.c index 03fbb90c6c43..4bbf0b0664df 100644 --- a/xen/arch/arm/irq.c +++ b/xen/arch/arm/irq.c @@ -595,35 +595,6 @@ unlock: return ret; } -/* - * pirq event channels. We don't use these on ARM, instead we use the - * features of the GIC to inject virtualised normal interrupts. - */ -struct pirq *alloc_pirq_struct(struct domain *d) -{ -return NULL; -} - -/* - * These are all unreachable given an alloc_pirq_struct - * which returns NULL, all callers try to lookup struct pirq first - * which will fail. - */ -int pirq_guest_bind(struct vcpu *v, struct pirq *pirq, int will_share) -{ -BUG(); -} - -void pirq_guest_unbind(struct domain *d, struct pirq *pirq) -{ -BUG(); -} - -void pirq_set_affinity(struct domain *d, int pirq, const cpumask_t *mask) -{ -BUG(); -} - static bool irq_validate_new_type(unsigned int curr, unsigned int new) { return (curr == IRQ_TYPE_INVALID || curr == new ); -- 2.34.1
[PATCH v2 3/3] xen/riscv: irq: drop unreachable pirq callbacks
From: Grygorii Strashko Since commit 341f271cf86f ("xen/evtchn: fully restrict concept of pIRQ for arches with pIRQ support only"), the corresponding RISCV arch pIRQ callbacks become unreachable, so drop them. Signed-off-by: Grygorii Strashko Reviewed-by: Oleksii Kurochko Acked-by: Andrew Cooper Reviewed-by: Denis Mukhin --- xen/arch/riscv/stubs.c | 20 1 file changed, 20 deletions(-) diff --git a/xen/arch/riscv/stubs.c b/xen/arch/riscv/stubs.c index 8918cebf35c6..1a8c86cd8da2 100644 --- a/xen/arch/riscv/stubs.c +++ b/xen/arch/riscv/stubs.c @@ -82,26 +82,6 @@ void smp_send_call_function_mask(const cpumask_t *mask) /* irq.c */ -struct pirq *alloc_pirq_struct(struct domain *d) -{ -BUG_ON("unimplemented"); -} - -int pirq_guest_bind(struct vcpu *v, struct pirq *pirq, int will_share) -{ -BUG_ON("unimplemented"); -} - -void pirq_guest_unbind(struct domain *d, struct pirq *pirq) -{ -BUG_ON("unimplemented"); -} - -void pirq_set_affinity(struct domain *d, int pirq, const cpumask_t *mask) -{ -BUG_ON("unimplemented"); -} - void irq_ack_none(struct irq_desc *desc) { BUG_ON("unimplemented"); -- 2.34.1
[PATCH v2 0/3] xen/arch: irq: drop unreachable pirq callbacks
From: Grygorii Strashko Hence prerequisite patch was merged [1] arch specific pIRQ callback can now be dropped for arches without pIRQ support as those callback become unreachable. [1] commit 341f271cf86f ("xen/evtchn: fully restrict concept of pIRQ for arches with pIRQ support only") Changes in v2: - reworded commit messages as requested - added reviewer's tags Grygorii Strashko (3): xen/arm: irq: drop unreachable pirq callbacks xen/ppc: irq: drop unreachable pirq callbacks xen/riscv: irq: drop unreachable pirq callbacks xen/arch/arm/irq.c | 29 - xen/arch/ppc/stubs.c | 20 xen/arch/riscv/stubs.c | 20 3 files changed, 69 deletions(-) -- 2.34.1
Re: [PATCH -next v7 1/7] arm64: ptrace: Replace interrupts_enabled() with regs_irqs_disabled()
On 2025/8/5 23:05, Ada Couprie Diaz wrote: > On 29/07/2025 02:54, Jinjie Ruan wrote: > >> The generic entry code expects architecture code to provide >> regs_irqs_disabled(regs) function, but arm64 does not have this and >> provides inerrupts_enabled(regs), which has the opposite polarity. > Nit: "interrupts_enabled(regs)" Good catch! thank you for the review. >> In preparation for moving arm64 over to the generic entry code, >> relace arm64's interrupts_enabled() with regs_irqs_disabled() and >> update its callers under arch/arm64. >> >> For the moment, a definition of interrupts_enabled() is provided for >> the GICv3 driver. Once arch/arm implement regs_irqs_disabled(), this >> can be removed. >> >> Delete the fast_interrupts_enabled() macro as it is unused and we >> don't want any new users to show up. >> >> No functional changes. >> >> Acked-by: Mark Rutland >> Suggested-by: Mark Rutland >> Signed-off-by: Jinjie Ruan >> --- > Otherwise looks good to me ! > Reviewed-by: Ada Couprie Diaz >
Re: [PATCH v10 2/4] vpci/rebar: Free Rebar resources when init_rebar() fails
On 2025/8/5 16:48, Jan Beulich wrote: > On 05.08.2025 05:49, Jiqian Chen wrote: >> --- a/xen/drivers/vpci/rebar.c >> +++ b/xen/drivers/vpci/rebar.c >> @@ -49,6 +49,32 @@ static void cf_check rebar_ctrl_write(const struct >> pci_dev *pdev, >> bar->guest_addr = bar->addr; >> } >> >> +static int cf_check cleanup_rebar(const struct pci_dev *pdev) >> +{ >> +int rc; >> +uint32_t ctrl; >> +unsigned int nbars; >> +unsigned int rebar_offset = pci_find_ext_capability(pdev->sbdf, >> + >> PCI_EXT_CAP_ID_REBAR); >> + >> +if ( !rebar_offset || !is_hardware_domain(pdev->domain) ) >> +{ >> +ASSERT_UNREACHABLE(); >> +return 0; >> +} >> + >> +ctrl = pci_conf_read32(pdev->sbdf, rebar_offset + PCI_REBAR_CTRL(0)); >> +nbars = MASK_EXTR(ctrl, PCI_REBAR_CTRL_NBAR_MASK); >> + >> +rc = vpci_remove_registers(pdev->vpci, rebar_offset + PCI_REBAR_CAP(0), >> + PCI_REBAR_CTRL(nbars - 1)); >> +if ( rc ) >> +printk(XENLOG_ERR "%pd %pp: fail to remove Rebar handlers rc=%d\n", >> + pdev->domain, &pdev->sbdf, rc); > > MSI and MSI-X (now) have ASSERT_UNREACHABLE() on their respective paths. Is > there a reason this shouldn't be done here as well? Will add. > > MSI and MSI-X further have another add-register below here, to ensure the > control register cannot be written. Again - is there a reason the same > shouldn't be done here? (If so, I think this may want putting in a comment.) Since extended capabilities are only exposed to dom0, and dom0 has no limitations to access devices. It since there is not much point in adding such a handler for rebar. > > Jan -- Best regards, Jiqian Chen.
Re: [PATCH -next v7 2/7] arm64: entry: Refactor the entry and exit for exceptions from EL1
On 2025/8/5 23:06, Ada Couprie Diaz wrote: > Hi, > > On 29/07/2025 02:54, Jinjie Ruan wrote: > >> The generic entry code uses irqentry_state_t to track lockdep and RCU >> state across exception entry and return. For historical reasons, arm64 >> embeds similar fields within its pt_regs structure. >> >> In preparation for moving arm64 over to the generic entry code, pull >> these fields out of arm64's pt_regs, and use a separate structure, >> matching the style of the generic entry code. >> >> No functional changes. > As far as I understand and checked, we used the two fields > in an exclusive fashion, so there is indeed no functional change. >> Suggested-by: Mark Rutland >> Signed-off-by: Jinjie Ruan >> --- >> [...] >> diff --git a/arch/arm64/kernel/entry-common.c >> b/arch/arm64/kernel/entry-common.c >> index 8e798f46ad28..97e0741abde1 100644 >> --- a/arch/arm64/kernel/entry-common.c >> +++ b/arch/arm64/kernel/entry-common.c >> [...] >> @@ -475,73 +497,81 @@ UNHANDLED(el1t, 64, error) >> static void noinstr el1_abort(struct pt_regs *regs, unsigned long esr) >> { >> unsigned long far = read_sysreg(far_el1); >> + arm64_irqentry_state_t state; >> - enter_from_kernel_mode(regs); >> + state = enter_from_kernel_mode(regs); > Nit: There is some inconsistencies with some functions splitting state's > definition > and declaration (like el1_abort here), while some others do it on the > same line > (el1_undef() below for example). > In some cases it is welcome as the entry function is called after some > other work, > but here for example it doesn't seem to be beneficial ? Both methods can keep the modifications to `enter_from_kernel_mode()` on the same line as the original code, which will facilitate code review. I think it is also fine to do it on the same line here, which can reduce one line code, which method is better may be a matter of personal opinion. >> local_daif_inherit(regs); >> do_mem_abort(far, esr, regs); >> local_daif_mask(); >> - exit_to_kernel_mode(regs); >> + exit_to_kernel_mode(regs, state); >> } >> static void noinstr el1_pc(struct pt_regs *regs, unsigned long esr) >> { >> unsigned long far = read_sysreg(far_el1); >> + arm64_irqentry_state_t state; >> - enter_from_kernel_mode(regs); >> + state = enter_from_kernel_mode(regs); >> local_daif_inherit(regs); >> do_sp_pc_abort(far, esr, regs); >> local_daif_mask(); >> - exit_to_kernel_mode(regs); >> + exit_to_kernel_mode(regs, state); >> } >> static void noinstr el1_undef(struct pt_regs *regs, unsigned long >> esr) >> { >> - enter_from_kernel_mode(regs); >> + arm64_irqentry_state_t state = enter_from_kernel_mode(regs); >> + >> local_daif_inherit(regs); >> do_el1_undef(regs, esr); >> local_daif_mask(); >> - exit_to_kernel_mode(regs); >> + exit_to_kernel_mode(regs, state); >> } >> >> [...] > Other than the small nit: > Reviewed-by: Ada Couprie Diaz > >
Consider changing CONFIG_ACPI default on ARM? (was: Re: Xen bootup: issue with Raspberry Pi 5?)
Sigh, resending as I lost some of the intended Cc targets when originally creating the message. Sorry about the duplication for people who have already seen, but I thought this might be worthy of wider discussion. I would like to draw the attention of a few people on xen-devel to the thread which occured on xen-users recently and quoted below: https://lists.xenproject.org/archives/html/xen-users/2025-07/msg1.html On Tue, Jul 01, 2025 at 10:01:13PM +0200, Paul Leiber wrote: > > Unfortunately, I don't have a direct answer to the question (as is so often > the case, due to my limited knowledge and experience). However, I am > successfully running Xen on a RPi 4 (mostly, except for some VLAN related > networking issues). > > I used instructions in [1] to install vanilla Debian on the RPi, including > UEFI boot and grub. I then compiled Xen with expert options and ACPI > enabled. > > I don't know if there are better solutions. For example, I suffer from the > fact that I2C doesn't work when using UEFI boot on a RPi. Nowadays, Debian > provides their own vanilla Debian images for RPi and with working I2C, but > these images are using a different boot method that I didn't know how to use > with Xen. So far, the procedure described above seems to be the easiest > solution for me. > [1] https://forums.raspberrypi.com/viewtopic.php?t=282839 > > Am 30.06.2025 um 12:35 schrieb Sumit Semwal: > > > > I've just begun to experiment with the Raspberry Pi 5, trying to run a > > simple xen + Dom0 setup, using uBoot, and the bookworm based Rpi > > distro. > > > > I've tried combinations of the following setup: > > > > 1. prebuilt Rpi5 kernel + dtbs, and have also tried to build them from > > source [1] > > 2. Xen from upstream [2] and xen-troops [3] > > 3. upstream uBoot from [4] > > > > but with the same result: [short log below; I can provide a fuller log > > if needed] > > > > (XEN) DT: ** translation for device /axi/msi-controller@100013 ** > > (XEN) DT: bus is default (na=2, ns=2) on /axi > > (XEN) DT: translating address:<3> 00ff<3> f000<3> > > (XEN) DT: parent bus is default (na=2, ns=1) on / > > (XEN) DT: walking ranges... > > (XEN) DT: default map, cp=0, s=10, da=fff000 > > (XEN) DT: default map, cp=10, s=1, da=fff000 > > (XEN) DT: default map, cp=14, s=4, da=fff000 > > (XEN) DT: default map, cp=18, s=4, da=fff000 > > (XEN) DT: default map, cp=1c, s=4, da=fff000 > > (XEN) DT: not found ! > > (XEN) Unable to retrieve address 1 for /axi/msi-controller@100013 > > (XEN) Device tree generation failed (-22). > > (XEN) debugtrace_dump() global buffer starting > > 1 cpupool_create(pool=0,sched=6) > > 2 Created cpupool 0 with scheduler SMP Credit Scheduler rev2 (credit2) > > 3 cpupool_add_domain(dom=0,pool=0) n_dom 1 rc 0 > > (XEN) wrap: 0 > > (XEN) debugtrace_dump() global buffer finished > > (XEN) > > (XEN) > > (XEN) Panic on CPU 0: > > (XEN) Could not set up DOM0 guest OS (rc = -22) > > (XEN) > > > > > > I'm certain I'm missing something, but before I delve deeper, I just > > wanted to ask if this is a known issue, and if so, are there any > > workarounds or solutions available for this? > > > > Any help about this is highly appreciated! > > > > Thanks and Best regards, > > Sumit. > > > > [1]: https://github.com/raspberrypi/linux rpi-6.12.y branch > > [2]: git://xenbits.xen.org/xen.git - main branch > > [3] xen-troops https://github.com/xen-troops/xen - rpi5_dev branch > > [4]: https://github.com/u-boot/u-boot.git master branch Ultimately Debian is choosing to leave most defaults alone. So far the Xen developers have left CONFIG_ACPI defaulting to off on ARM*. The Debian project doesn't have paid people to support Raspberry PI hardware, despite being rather common. As a result there aren't any official Raspberry PI images, but people associated with Tianocore have gotten generic images to boot on Raspberry PI hardware. I'm unsure of the likelihood of getting the Debian maintainers to override the default. Yet due being by far the simplest way to install Debian and Xen on a very common ARM64 platform, perhaps the Xen developers should consider changing? -- (\___(\___(\__ --=> 8-) EHM <=-- __/)___/)___/) \BS (| ehem+sig...@m5p.com PGP 87145445 |) / \_CS\ | _ -O #include O- _ | / _/ 8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445