Re: [PATCH v10 4/4] vpci/msix: Free MSIX resources when init_msix() fails

2025-08-05 Thread Chen, Jiqian
On 2025/8/5 16:10, Jan Beulich wrote:
> On 05.08.2025 05:49, Jiqian Chen wrote:
>> When MSI-X initialization fails vPCI will hide the capability, but
>> remove of handlers and data won't be performed until the device is
>> deassigned.  Introduce a MSI-X cleanup hook that will be called when
>> initialization fails to cleanup MSI-X related hooks and free it's
>> associated data.
>>
>> As all supported capabilities have been switched to use the cleanup
>> hooks call those from vpci_deassign_device() instead of open-code the
>> capability specific cleanup in there.
>>
>> Signed-off-by: Jiqian Chen 
>> ---
>> cc: "Roger Pau Monné" 
>> ---
>> v9->v10 changes:
>> * Call all cleanup hook in vpci_deassign_device() instead of cleanup_msix().
> 
> Isn't this rather an omission in an earlier change, and hence may want to
> come separately and with a Fixes: tag?
This is not really an omission, after all, all the cleanup hooks were 
implemented at the end of this series.
And judging from the commit message(which was written by Roger in v8), Roger 
also agreed to add them in this patch.

> 
>> --- a/xen/drivers/vpci/vpci.c
>> +++ b/xen/drivers/vpci/vpci.c
>> @@ -321,6 +321,27 @@ void vpci_deassign_device(struct pci_dev *pdev)
>>  &pdev->domain->vpci_dev_assigned_map);
>>  #endif
>>  
>> +for ( i = 0; i < NUM_VPCI_INIT; i++ )
>> +{
>> +const vpci_capability_t *capability = &__start_vpci_array[i];
>> +const unsigned int cap = capability->id;
>> +unsigned int pos = 0;
>> +
>> +if ( !capability->is_ext )
>> +pos = pci_find_cap_offset(pdev->sbdf, cap);
>> +else if ( is_hardware_domain(pdev->domain) )
>> +pos = pci_find_ext_capability(pdev->sbdf, cap);
> 
> What's the point of doing this when ...
> 
>> +if ( pos && capability->cleanup )
> 
> ... ->cleanup is NULL? Don't you want to have
> 
> if ( !capability->cleanup )
> continue;
> 
> earlier on?
Will change in next version.

> 
> Jan

-- 
Best regards,
Jiqian Chen.


Re: [PATCH] x86/HVM: polish hvm_asid_init() a little

2025-08-05 Thread Jan Beulich
On 04.08.2025 17:41, Jan Beulich wrote:
> While the logic there covers asymmetric cases, the resulting log
> messages would likely raise more confusion than clarify anything. Split
> the BSP action from the AP one, indicating the odd CPU in the AP log
> message, thus avoiding the impression that global state would have
> changed.
> 
> While there also
> - move g_disabled into .data.ro_after_init; only the BSP will ever write
>   to it,
> - make the function's parameter unsigned; no negative values may be
>   passed in. Also reflect this in svm_asid_init().
> 
> Signed-off-by: Jan Beulich 

Sorry, I meant to Cc: you on the submission, as it was the reviewing of
your copied code which made me spot the shortcomings (which, as indicated,
I think it would be nice if you avoided from the very start).

Jan

> --- a/xen/arch/x86/hvm/asid.c
> +++ b/xen/arch/x86/hvm/asid.c
> @@ -48,20 +48,22 @@ struct hvm_asid_data {
>  
>  static DEFINE_PER_CPU(struct hvm_asid_data, hvm_asid_data);
>  
> -void hvm_asid_init(int nasids)
> +void hvm_asid_init(unsigned int nasids)
>  {
> -static int8_t g_disabled = -1;
> +static int8_t __ro_after_init g_disabled = -1;
>  struct hvm_asid_data *data = &this_cpu(hvm_asid_data);
>  
>  data->max_asid = nasids - 1;
>  data->disabled = !opt_asid_enabled || (nasids <= 1);
>  
> -if ( g_disabled != data->disabled )
> +if ( g_disabled < 0 )
>  {
> -printk("HVM: ASIDs %sabled.\n", data->disabled ? "dis" : "en");
> -if ( g_disabled < 0 )
> -g_disabled = data->disabled;
> +g_disabled = data->disabled;
> +printk("HVM: ASIDs %sabled\n", data->disabled ? "dis" : "en");
>  }
> +else if ( g_disabled != data->disabled )
> +printk("HVM: CPU%u: ASIDs %sabled\n", smp_processor_id(),
> +   data->disabled ? "dis" : "en");
>  
>  /* Zero indicates 'invalid generation', so we start the count at one. */
>  data->core_asid_generation = 1;
> --- a/xen/arch/x86/hvm/svm/asid.c
> +++ b/xen/arch/x86/hvm/svm/asid.c
> @@ -12,7 +12,7 @@
>  
>  void svm_asid_init(const struct cpuinfo_x86 *c)
>  {
> -int nasids = 0;
> +unsigned int nasids = 0;
>  
>  /* Check for erratum #170, and leave ASIDs disabled if it's present. */
>  if ( !cpu_has_amd_erratum(c, AMD_ERRATUM_170) )
> --- a/xen/arch/x86/include/asm/hvm/asid.h
> +++ b/xen/arch/x86/include/asm/hvm/asid.h
> @@ -13,7 +13,7 @@ struct vcpu;
>  struct hvm_vcpu_asid;
>  
>  /* Initialise ASID management for the current physical CPU. */
> -void hvm_asid_init(int nasids);
> +void hvm_asid_init(unsigned int nasids);
>  
>  /* Invalidate a particular ASID allocation: forces re-allocation. */
>  void hvm_asid_flush_vcpu_asid(struct hvm_vcpu_asid *asid);




Re: [PATCH v10 4/4] vpci/msix: Free MSIX resources when init_msix() fails

2025-08-05 Thread Jan Beulich
On 05.08.2025 10:27, Chen, Jiqian wrote:
> On 2025/8/5 16:10, Jan Beulich wrote:
>> On 05.08.2025 05:49, Jiqian Chen wrote:
>>> When MSI-X initialization fails vPCI will hide the capability, but
>>> remove of handlers and data won't be performed until the device is
>>> deassigned.  Introduce a MSI-X cleanup hook that will be called when
>>> initialization fails to cleanup MSI-X related hooks and free it's
>>> associated data.
>>>
>>> As all supported capabilities have been switched to use the cleanup
>>> hooks call those from vpci_deassign_device() instead of open-code the
>>> capability specific cleanup in there.
>>>
>>> Signed-off-by: Jiqian Chen 
>>> ---
>>> cc: "Roger Pau Monné" 
>>> ---
>>> v9->v10 changes:
>>> * Call all cleanup hook in vpci_deassign_device() instead of cleanup_msix().
>>
>> Isn't this rather an omission in an earlier change, and hence may want to
>> come separately and with a Fixes: tag?
> This is not really an omission, after all, all the cleanup hooks were 
> implemented at the end of this series.
> And judging from the commit message(which was written by Roger in v8), Roger 
> also agreed to add them in this patch.

I disagree. Of the two xfree()-s you remove here from vpci_deassign_device(),
one should have been removed by patch 3 already. Which would require the
part of the patch here to be put in place earlier on.

Jan



Re: [PATCH v10 2/4] vpci/rebar: Free Rebar resources when init_rebar() fails

2025-08-05 Thread Jan Beulich
On 05.08.2025 05:49, Jiqian Chen wrote:
> --- a/xen/drivers/vpci/rebar.c
> +++ b/xen/drivers/vpci/rebar.c
> @@ -49,6 +49,32 @@ static void cf_check rebar_ctrl_write(const struct pci_dev 
> *pdev,
>  bar->guest_addr = bar->addr;
>  }
>  
> +static int cf_check cleanup_rebar(const struct pci_dev *pdev)
> +{
> +int rc;
> +uint32_t ctrl;
> +unsigned int nbars;
> +unsigned int rebar_offset = pci_find_ext_capability(pdev->sbdf,
> +
> PCI_EXT_CAP_ID_REBAR);
> +
> +if ( !rebar_offset || !is_hardware_domain(pdev->domain) )
> +{
> +ASSERT_UNREACHABLE();
> +return 0;
> +}
> +
> +ctrl = pci_conf_read32(pdev->sbdf, rebar_offset + PCI_REBAR_CTRL(0));
> +nbars = MASK_EXTR(ctrl, PCI_REBAR_CTRL_NBAR_MASK);
> +
> +rc = vpci_remove_registers(pdev->vpci, rebar_offset + PCI_REBAR_CAP(0),
> +   PCI_REBAR_CTRL(nbars - 1));
> +if ( rc )
> +printk(XENLOG_ERR "%pd %pp: fail to remove Rebar handlers rc=%d\n",
> +   pdev->domain, &pdev->sbdf, rc);

MSI and MSI-X (now) have ASSERT_UNREACHABLE() on their respective paths. Is
there a reason this shouldn't be done here as well?

MSI and MSI-X further have another add-register below here, to ensure the
control register cannot be written. Again - is there a reason the same
shouldn't be done here? (If so, I think this may want putting in a comment.)

Jan



Re: [PATCH v10 4/4] vpci/msix: Free MSIX resources when init_msix() fails

2025-08-05 Thread Jan Beulich
On 05.08.2025 05:49, Jiqian Chen wrote:
> --- a/xen/drivers/vpci/msix.c
> +++ b/xen/drivers/vpci/msix.c
> @@ -655,6 +655,48 @@ int vpci_make_msix_hole(const struct pci_dev *pdev)
>  return 0;
>  }
>  
> +static int cf_check cleanup_msix(const struct pci_dev *pdev)
> +{
> +int rc;
> +struct vpci *vpci = pdev->vpci;
> +const unsigned int msix_pos = pdev->msix_pos;
> +
> +if ( !msix_pos )
> +return 0;
> +
> +rc = vpci_remove_registers(vpci, msix_control_reg(msix_pos), 2);
> +if ( rc )
> +{
> +printk(XENLOG_ERR "%pd %pp: fail to remove MSIX handlers rc=%d\n",
> +   pdev->domain, &pdev->sbdf, rc);
> +ASSERT_UNREACHABLE();
> +return rc;
> +}
> +
> +if ( vpci->msix )
> +{
> +list_del(&vpci->msix->next);
> +for ( unsigned int i = 0; i < ARRAY_SIZE(vpci->msix->table); i++ )
> +if ( vpci->msix->table[i] )
> +iounmap(vpci->msix->table[i]);
> +
> +XFREE(vpci->msix);
> +}
> +
> +/*
> + * The driver may not traverse the capability list and think device
> + * supports MSIX by default. So here let the control register of MSIX
> + * be Read-Only is to ensure MSIX disabled.
> + */
> +rc = vpci_add_register(vpci, vpci_hw_read16, NULL,
> +   msix_control_reg(msix_pos), 2, NULL);
> +if ( rc )
> +printk(XENLOG_ERR "%pd %pp: fail to add MSIX ctrl handler rc=%d\n",
> +   pdev->domain, &pdev->sbdf, rc);

Here as well as for MSI: Wouldn't this better be limited to the init-failure
case? No point in adding a register hook (and possibly emitting a misleading
log message) when we're tearing down anyway. IOW I think the ->cleanup()
hook needs a boolean parameter, unless the distinction of the two cases can
be (reliably) inferred from some other property.

> --- a/xen/drivers/vpci/vpci.c
> +++ b/xen/drivers/vpci/vpci.c
> @@ -321,6 +321,27 @@ void vpci_deassign_device(struct pci_dev *pdev)
>  &pdev->domain->vpci_dev_assigned_map);
>  #endif
>  
> +for ( i = 0; i < NUM_VPCI_INIT; i++ )
> +{
> +const vpci_capability_t *capability = &__start_vpci_array[i];
> +const unsigned int cap = capability->id;
> +unsigned int pos = 0;
> +
> +if ( !capability->is_ext )
> +pos = pci_find_cap_offset(pdev->sbdf, cap);
> +else if ( is_hardware_domain(pdev->domain) )
> +pos = pci_find_ext_capability(pdev->sbdf, cap);
> +
> +if ( pos && capability->cleanup )
> +{
> +int rc = capability->cleanup(pdev);
> +if ( rc )
> +printk(XENLOG_ERR "%pd %pp: clean %s cap %u fail rc=%d\n",
> +   pdev->domain, &pdev->sbdf,
> +   capability->is_ext ? "extended" : "legacy", cap, rc);
> +}
> +}

With this imo the patch subject is now wrong, too.

Jan



Re: [PATCH] misra: remove default case in single-clause switch

2025-08-05 Thread Jan Beulich
On 04.08.2025 19:33, Dmytro Prokopchuk1 wrote:
> MISRA Rule 16.4: Every switch statement shall have a default label.
> The default clause must contain either a statement or a comment
> prior to its terminating break statement.
> 
> However, there is a documented deviation for this rule in Xen:
> 'docs/misra/deviations.rst':
> * - R16.4
>   - A switch statement with a single clause and no default label
> may replace an equivalent if statement to improve readability.
>   - Tagged as `deliberate` for ECLAIR.
> 
> This change removes empty default cases in single-clause switches
> to avoid violations of the rule where the `default` clause lacks
> a suitable comment or statement.
> 
> Signed-off-by: Dmytro Prokopchuk 

It's all CPU notifiers that you alter, and iirc the outcome of earlier
discussion was that particularly for those we _want_ to add commentary,
clarifying why only the given subset of notification need handling in
the particular case. It may also well be that the (at least) one case
of the possibly missing handling of some other notification still is
unaddressed (and might hence be wrongly hidden by the adjustment done
here, if it's in one of the function that are being altered).

Jan



Re: [PATCH v3 05/20] xen/riscv: construct the P2M pages pool for guests

2025-08-05 Thread Jan Beulich
On 31.07.2025 17:58, Oleksii Kurochko wrote:
> @@ -30,3 +34,18 @@ int p2m_init(struct domain *d)
>  
>  return 0;
>  }
> +
> +/*
> + * Set the pool of pages to the required number of pages.
> + * Returns 0 for success, non-zero for failure.
> + * Call with d->arch.paging.lock held.
> + */
> +int p2m_set_allocation(struct domain *d, unsigned long pages, bool 
> *preempted)

Noticed only when looking at the subsequent patch: With this being ...

> +{
> +int rc;
> +
> +if ( (rc = paging_freelist_init(d, pages, preempted)) )

... a caller of this function, the "init" in the name feels wrong.

Jan



Re: [PATCH v3 06/20] xen/riscv: add root page table allocation

2025-08-05 Thread Jan Beulich
On 31.07.2025 17:58, Oleksii Kurochko wrote:
> Introduce support for allocating and initializing the root page table
> required for RISC-V stage-2 address translation.
> 
> To implement root page table allocation the following is introduced:
> - p2m_get_clean_page() and p2m_alloc_root_table(), p2m_allocate_root()
>   helpers to allocate and zero a 16 KiB root page table, as mandated
>   by the RISC-V privileged specification for Sv32x4/Sv39x4/Sv48x4/Sv57x4
>   modes.
> - Update p2m_init() to inititialize p2m_root_order.
> - Add maddr_to_page() and page_to_maddr() macros for easier address
>   manipulation.
> - Introduce paging_ret_pages_to_domheap() to return some pages before
>   allocate 16 KiB pages for root page table.
> - Allocate root p2m table after p2m pool is initialized.
> - Add construct_hgatp() to construct the hgatp register value based on
>   p2m->root, p2m->hgatp_mode and VMID.

Imo for this to be complete, freeing of the root table also wants taking
care of. Much like imo p2m_init() would better immediately be accompanied
by the respective teardown function. Once you start using them, you want
to use them in pairs, after all.

> --- a/xen/arch/riscv/include/asm/riscv_encoding.h
> +++ b/xen/arch/riscv/include/asm/riscv_encoding.h
> @@ -133,11 +133,13 @@
>  #define HGATP_MODE_SV48X4_UL(9)
>  
>  #define HGATP32_MODE_SHIFT   31
> +#define HGATP32_MODE_MASK_UL(0x8000)
>  #define HGATP32_VMID_SHIFT   22
>  #define HGATP32_VMID_MASK_UL(0x1FC0)
>  #define HGATP32_PPN  _UL(0x003F)
>  
>  #define HGATP64_MODE_SHIFT   60
> +#define HGATP64_MODE_MASK_ULL(0xF000)
>  #define HGATP64_VMID_SHIFT   44
>  #define HGATP64_VMID_MASK_ULL(0x03FFF000)
>  #define HGATP64_PPN  _ULL(0x0FFF)
> @@ -170,6 +172,7 @@
>  #define HGATP_VMID_SHIFT HGATP64_VMID_SHIFT
>  #define HGATP_VMID_MASK  HGATP64_VMID_MASK
>  #define HGATP_MODE_SHIFT HGATP64_MODE_SHIFT
> +#define HGATP_MODE_MASK  HGATP64_MODE_MASK
>  #else
>  #define MSTATUS_SD   MSTATUS32_SD
>  #define SSTATUS_SD   SSTATUS32_SD
> @@ -181,8 +184,11 @@
>  #define HGATP_VMID_SHIFT HGATP32_VMID_SHIFT
>  #define HGATP_VMID_MASK  HGATP32_VMID_MASK
>  #define HGATP_MODE_SHIFT HGATP32_MODE_SHIFT
> +#define HGATP_MODE_MASK  HGATP32_MODE_MASK
>  #endif
>  
> +#define GUEST_ROOT_PAGE_TABLE_SIZE   KB(16)

In another context I already mentioned that imo you want to be careful with
the use of "guest" in identifiers. It's not the guest page tables which have
an order-2 root table, but the P2M (Xen terminology) or G-stage / second
stage (RISC-V spec terminology) ones. As long as you're only doing P2M
work, this may not look significant. But once you actually start dealing
with guest page tables, it easily can end up confusing.

> --- a/xen/arch/riscv/p2m.c
> +++ b/xen/arch/riscv/p2m.c
> @@ -1,8 +1,86 @@
> +#include 
>  #include 
>  #include 
>  #include 
>  
>  #include 
> +#include 
> +#include 
> +
> +unsigned int __read_mostly p2m_root_order;

If this is to be a variable at all, it ought to be __ro_after_init, and
hence it shouldn't be written every time p2m_init() is run. If you want
to to remain as a variable, what's wrong with

const unsigned int p2m_root_order = ilog2(GUEST_ROOT_PAGE_TABLE_SIZE) - 
PAGE_SHIFT;

or some such? But of course equally well you could have

#define P2M_ROOT_ORDER  (ilog2(GUEST_ROOT_PAGE_TABLE_SIZE) - PAGE_SHIFT)

> +static void clear_and_clean_page(struct page_info *page)
> +{
> +clear_domain_page(page_to_mfn(page));
> +
> +/*
> + * If the IOMMU doesn't support coherent walks and the p2m tables are
> + * shared between the CPU and IOMMU, it is necessary to clean the
> + * d-cache.
> + */

That is, ...

> +clean_dcache_va_range(page, PAGE_SIZE);

... this call really wants to be conditional?

> +}
> +
> +static struct page_info *p2m_allocate_root(struct domain *d)

With there also being p2m_alloc_root_table() and with that being the sole
caller of the function here, I wonder: Is having this in a separate
function really outweighing the possible confusion of which of the two
functions to use?

> +{
> +struct page_info *page;
> +
> +/*
> + * As mentioned in the Priviliged Architecture Spec (version 20240411)
> + * in Section 18.5.1, for the paged virtual-memory schemes  (Sv32x4,
> + * Sv39x4, Sv48x4, and Sv57x4), the root page table is 16 KiB and must
> + * be aligned to a 16-KiB boundary.
> + */
> +page = alloc_domheap_pages(d, P2M_ROOT_ORDER, MEMF_no_owner);
> +if ( !page )
> +return NULL;
> +
> +for ( unsigned int i = 0; i < P2M_ROOT_PAGES; i++ )
> +clear_and_clean_page(page + i);
> +
> +return page;
> +}
> +
> +unsigned long construct_hgatp

Re: [PATCH v3 06/20] xen/riscv: add root page table allocation

2025-08-05 Thread Jan Beulich
On 31.07.2025 17:58, Oleksii Kurochko wrote:
> +static int p2m_alloc_root_table(struct p2m_domain *p2m)
> +{
> +struct domain *d = p2m->domain;
> +struct page_info *page;
> +const unsigned int nr_root_pages = P2M_ROOT_PAGES;
> +
> +/*
> + * Return back nr_root_pages to assure the root table memory is also
> + * accounted against the P2M pool of the domain.
> + */
> +if ( !paging_ret_pages_to_domheap(d, nr_root_pages) )
> +return -ENOMEM;
> +
> +page = p2m_allocate_root(d);
> +if ( !page )
> +return -ENOMEM;
> +
> +p2m->root = page;
> +
> +return 0;
> +}

In the success case, shouldn't you bump the paging pool's total_pages by
P2M_ROOT_PAGES? (As the freeing side is missing so far, it's not easy to
tell whether there's [going to be] a balancing problem in the long run.
In the short run there certainly is.)

Jan



Re: [PATCH v3 10/20] xen/riscv: introduce page_{get,set}_xenheap_gfn()

2025-08-05 Thread Jan Beulich
On 31.07.2025 17:58, Oleksii Kurochko wrote:
> --- a/xen/arch/riscv/include/asm/mm.h
> +++ b/xen/arch/riscv/include/asm/mm.h
> @@ -12,6 +12,7 @@
>  #include 
>  #include 
>  
> +#include 
>  #include 
>  #include 
>  
> @@ -247,9 +248,17 @@ static inline bool arch_mfns_in_directmap(unsigned long 
> mfn, unsigned long nr)
>  #define PGT_writable_page PG_mask(1, 1)  /* has writable mappings? */
>  #define PGT_type_mask PG_mask(1, 1)  /* Bits 31 or 63. */
>  
> -/* Count of uses of this frame as its current type. */
> -#define PGT_count_width   PG_shift(2)
> -#define PGT_count_mask((1UL << PGT_count_width) - 1)
> + /* 9-bit count of uses of this frame as its current type. */

Nit: Stray blank at start of line.

> +#define PGT_count_maskPG_mask(0x3FF, 10)

A 9-bit count corresponds to a mask of 0x1ff, doesn't it? With 0x3ff the count
can spill over the type.

Jan



Re: [PATCH v13 2/3] tools/tests: introduce unit tests for domain ID allocator

2025-08-05 Thread Roger Pau Monné
On Wed, Jul 30, 2025 at 05:41:00PM +, dm...@proton.me wrote:
> From: Denis Mukhin  
> 
> Introduce some basic infrastructure for doing domain ID allocation unit tests,
> and add a few tests that ensure correctness of the domain ID allocator.
> 
> Signed-off-by: Denis Mukhin 
> ---
> Changes since v12:
> - fixed Makefile
> - dropped unused symbols/includes from the test harness header
> - s/printk/printf/g in the test code
> ---
>  tools/tests/Makefile   |   2 +-
>  tools/tests/domid/.gitignore   |   2 +
>  tools/tests/domid/Makefile |  48 ++
>  tools/tests/domid/include/xen/domain.h | 126 +
>  tools/tests/domid/test-domid.c |  78 +++
>  5 files changed, 255 insertions(+), 1 deletion(-)
>  create mode 100644 tools/tests/domid/.gitignore
>  create mode 100644 tools/tests/domid/Makefile
>  create mode 100644 tools/tests/domid/include/xen/domain.h
>  create mode 100644 tools/tests/domid/test-domid.c
> 
> diff --git a/tools/tests/Makefile b/tools/tests/Makefile
> index 36928676a666..ff1666425436 100644
> --- a/tools/tests/Makefile
> +++ b/tools/tests/Makefile
> @@ -1,7 +1,7 @@
>  XEN_ROOT = $(CURDIR)/../..
>  include $(XEN_ROOT)/tools/Rules.mk
>  
> -SUBDIRS-y :=
> +SUBDIRS-y := domid
>  SUBDIRS-y += resource
>  SUBDIRS-$(CONFIG_X86) += cpu-policy
>  SUBDIRS-$(CONFIG_X86) += tsx
> diff --git a/tools/tests/domid/.gitignore b/tools/tests/domid/.gitignore
> new file mode 100644
> index ..70e306b3c074
> --- /dev/null
> +++ b/tools/tests/domid/.gitignore
> @@ -0,0 +1,2 @@
> +*.o
> +test-domid
> diff --git a/tools/tests/domid/Makefile b/tools/tests/domid/Makefile
> new file mode 100644
> index ..08fbad096aec
> --- /dev/null
> +++ b/tools/tests/domid/Makefile
> @@ -0,0 +1,48 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +#
> +# Unit tests for domain ID allocator.
> +#
> +# Copyright 2025 Ford Motor Company
> +
> +XEN_ROOT=$(CURDIR)/../../..
> +include $(XEN_ROOT)/tools/Rules.mk
> +
> +TESTS := test-domid
> +
> +vpath domid.c $(XEN_ROOT)/xen/common/
> +
> +.PHONY: all
> +all: $(TESTS)
> +
> +.PHONY: run
> +run: $(TESTS)
> + $(foreach t,$(TESTS),./$(t);)
> +
> +.PHONY: clean
> +clean:
> + $(RM) -- *.o $(TESTS) $(DEPS_RM)
> +
> +.PHONY: distclean
> +distclean: clean
> + $(RM) -- *~
> +
> +.PHONY: install
> +install: all
> + $(INSTALL_DIR) $(DESTDIR)$(LIBEXEC)/tests
> + $(INSTALL_PROG) test-domid $(DESTDIR)$(LIBEXEC)/tests
> +
> +.PHONY: uninstall
> +uninstall:
> + $(RM) -- $(DESTDIR)$(LIBEXEC)/tests/test-domid
> +
> +CFLAGS += -D__XEN_TOOLS__
> +CFLAGS += $(APPEND_CFLAGS)
> +CFLAGS += $(CFLAGS_xeninclude)
> +CFLAGS += -I./include/
> +
> +LDFLAGS += $(APPEND_LDFLAGS)
> +
> +test-domid: domid.o test-domid.o
> + $(CC) $^ -o $@ $(LDFLAGS)
> +
> +-include $(DEPS_INCLUDE)
> diff --git a/tools/tests/domid/include/xen/domain.h 
> b/tools/tests/domid/include/xen/domain.h
> new file mode 100644
> index ..e5db0235445e
> --- /dev/null
> +++ b/tools/tests/domid/include/xen/domain.h
> @@ -0,0 +1,126 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * Unit test harness for domain ID allocator.
> + *
> + * Copyright 2025 Ford Motor Company
> + */
> +
> +#ifndef _TEST_HARNESS_
> +#define _TEST_HARNESS_
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#include 
> +
> +#define BUG_ON(x)   assert(!(x))
> +#define ASSERT(x)   assert(x)
> +
> +#define DOMID_FIRST_RESERVED(10)
> +#define DOMID_INVALID   (11)
> +
> +#define DEFINE_SPINLOCK(x)  unsigned long *(x)

I think this shouldn't be a pointer?  As you otherwise trigger a NULL
pointer dereference in the increases and decreases done below?

> +#define spin_lock(x)((*(x))++)
> +#define spin_unlock(x)  ((*(x))--)

FWIW, I would use a plain bool:

#define DEFINE_SPINLOCK(l)  bool l
#define spin_lock(l)(*(l) = true)
#define spin_unlock(l)  (*(l) = false)

As you don't expect concurrency tests, you could even assert the lock
is in the expected state before taking/releasing it.

> +
> +#define printk printf
> +
> +#define BITS_PER_LONG   sizeof(unsigned long)

That's BYTES_PER_LONG, BITS_PER_LONG would be (sizeof(unsigned long) * 8).

> +#define BITS_PER_WORD   (8U * BITS_PER_LONG)
> +#define BITS_TO_LONGS(bits) \
> +(((bits) + BITS_PER_LONG - 1) / BITS_PER_LONG)
> +#define DECLARE_BITMAP(name, bits) \
> +unsigned long name[BITS_TO_LONGS(bits)]
> +
> +static inline int __test_and_set_bit(unsigned int nr, unsigned long *addr)
> +{
> +unsigned long mask = 1UL << (nr % BITS_PER_WORD);
> +unsigned long *p = addr + (nr / BITS_PER_WORD);
> +int old = (*p & mask) != 0;
> +
> +*p |= mask;
> +
> +return old;
> +}
> +
> +static inline int __test_and_clear_bit(unsigned int nr, unsigned long *addr)
> +{
> +unsigned long mask = 1UL << (nr % BITS_PER_WORD);
> +unsigned long *p = addr + (nr / BITS_P

Re: [PATCH v4 4/8] pdx: allow per-arch optimization of PDX conversion helpers

2025-08-05 Thread Roger Pau Monné
On Tue, Aug 05, 2025 at 02:11:22PM +0200, Jan Beulich wrote:
> On 05.08.2025 11:52, Roger Pau Monne wrote:
> > There are four performance critical PDX conversion helpers that do the PFN
> > to/from PDX and the physical addresses to/from directmap offsets
> > translations.
> > 
> > In the absence of an active PDX compression, those functions would still do
> > the calculations needed, just to return the same input value as no
> > translation is in place and hence PFN and PDX spaces are identity mapped.
> > 
> > To reduce the overhead of having to do the pointless calculations allow
> > architectures to implement the translation helpers in a per-arch header.
> > Rename the existing conversion functions to add a trailing _xlate suffix,
> > so that the per-arch headers can define the non suffixed versions.
> > 
> > Currently only x86 implements meaningful custom handlers to short circuit
> > the translation when not active, using asm goto.  Other architectures use
> > generic macros that map the non-xlate to the xlate variants to keep the
> > previous behavior.
> > 
> > Signed-off-by: Roger Pau Monné 
> 
> Once again:
> Reviewed-by: Jan Beulich 

Thanks, I didn't carry your RB due to the changes requested by Andrew,
that was a bit too much to carry a RB after those.

Roger.



Re: [PATCH v4 4/8] pdx: allow per-arch optimization of PDX conversion helpers

2025-08-05 Thread Jan Beulich
On 05.08.2025 11:52, Roger Pau Monne wrote:
> There are four performance critical PDX conversion helpers that do the PFN
> to/from PDX and the physical addresses to/from directmap offsets
> translations.
> 
> In the absence of an active PDX compression, those functions would still do
> the calculations needed, just to return the same input value as no
> translation is in place and hence PFN and PDX spaces are identity mapped.
> 
> To reduce the overhead of having to do the pointless calculations allow
> architectures to implement the translation helpers in a per-arch header.
> Rename the existing conversion functions to add a trailing _xlate suffix,
> so that the per-arch headers can define the non suffixed versions.
> 
> Currently only x86 implements meaningful custom handlers to short circuit
> the translation when not active, using asm goto.  Other architectures use
> generic macros that map the non-xlate to the xlate variants to keep the
> previous behavior.
> 
> Signed-off-by: Roger Pau Monné 

Once again:
Reviewed-by: Jan Beulich 

Jan



Re: [PATCH] misra: fix violations in macros GVA_INFO, TRACE_TIME

2025-08-05 Thread Nicola Vetrini

On 2025-08-05 13:49, Dmytro Prokopchuk1 wrote:

On 7/31/25 19:09, Nicola Vetrini wrote:

On 2025-07-31 18:05, Andrew Cooper wrote:

On 31/07/2025 4:58 pm, Jan Beulich wrote:

On 31.07.2025 17:37, Andrew Cooper wrote:

On 31/07/2025 4:16 pm, Dmytro Prokopchuk1 wrote:
MISRA Rule 13.1: Initializer lists shall not contain persistent 
side

effects.

The violations occur because both the `GVA_INFO` and `TRACE_TIME`
macro
expansions include expressions with persistent side effects 
introduced

via inline assembly.

In the case of `GVA_INFO`, the issue stems from the initializer 
list

containing a direct call to `current`, which evaluates to
`this_cpu(curr_vcpu)` and involves persistent side effects via the
`asm` statement. To resolve this, the side-effect-producing 
expression
is computed in a separate statement prior to the macro 
initialization:


    struct vcpu *current_vcpu = current;

The computed value is passed into the `GVA_INFO(current_vcpu)` 
macro,
ensuring that the initializer is clean and free of such side 
effects.


Similarly, the `TRACE_TIME` macro violates this rule when 
accessing
expressions like `current->vcpu_id` and 
`current->domain->domain_id`,
which also depend on `current` and inline assembly. To fix this, 
the

value of `current` is assigned to a temporary variable:

    struct vcpu *v = current;

This temporary variable is then used to access `domain_id` and
`vcpu_id`.
This ensures that the arguments passed to the `TRACE_TIME` macro 
are

simple expressions free of persistent side effects.

Signed-off-by: Dmytro Prokopchuk 

The macro `current` specifically does not (and must not) have side
effects.  It is expected to behave like a plain `struct vcpu 
*current;`

variable, and what Eclair is noticing is the thread-local machinery
under this_cpu() (or in x86's case, get_current()).

In ARM's case, it's literally reading the hardware thread pointer
register.  Can anything be done to tell Eclair that `this_cpu()`
specifically does not have side effects?

The only reason that GVA_INFO() and TRACE_TIME() are picked out is
because they both contain embedded structure initialisation, and
this is
is actually an example where trying to comply with MISRA interferes
with
what is otherwise a standard pattern in Xen.
Irrespective of what you say, some of the changes here were 
eliminating

multiple adjacent uses of current, which - iirc - often the compiler
can't fold via CSE.


Where we have mixed usage, sure.  (I'm sure I've got a branch 
somewhere
trying to add some more pure/const around to try and help out here, 
but
I can't find it, and don't recall it being a major improvement 
either.)


The real problem here is that there are a *very few* number of 
contexts

where Eclair refuses to tolerate the use of `current` citing side
effects, despite there being no side effects.

That is the thing that breaks the principle of least surprise, and we
ought to fix it by making Eclair happy with `current` everywhere, 
rather
than force people to learn that 2 macros can't have a `current` in 
their

parameter list.



I'll take a look. Likely yes, by adding a handful of properties. There
are subtleties, though.



Hi, Nicola.

Did you have a chance to try configure Eclair to ignore this macro
`this_cpu()`?



Hi Dmytro,

I'm on it, I needed to handle other tasks first.


Thanks.
Dmytro


--
Nicola Vetrini, B.Sc.
Software Engineer
BUGSENG (https://bugseng.com)
LinkedIn: https://www.linkedin.com/in/nicola-vetrini-a42471253



Re: [PATCH] x86/HVM: polish hvm_asid_init() a little

2025-08-05 Thread Jason Andryuk

On 2025-08-04 11:41, Jan Beulich wrote:

While the logic there covers asymmetric cases, the resulting log
messages would likely raise more confusion than clarify anything. Split
the BSP action from the AP one, indicating the odd CPU in the AP log
message, thus avoiding the impression that global state would have
changed.

While there also
- move g_disabled into .data.ro_after_init; only the BSP will ever write
   to it,
- make the function's parameter unsigned; no negative values may be
   passed in. Also reflect this in svm_asid_init().

Signed-off-by: Jan Beulich 


Reviewed-by: Jason Andryuk 



Re: [PATCH] console: make printk_ratelimit_{burst,ms} const

2025-08-05 Thread Jason Andryuk

On 2025-08-01 03:30, Jan Beulich wrote:

Them not being altered by any means, their __read_mostly attribute is
actually counter-productive: It causes the compiler to instantiate the
variables, when already with just the attributes dropped the compiler
can constant-propagate the values into the sole use site. Make the
situation yet more explicit by adding const.

Also switch the variables away from being plain int, and have the
parameters of __printk_ratelimit() follow suit. While there also
similarly adjust the type of "missed" and "lost", and - while touching
the adjacent line - increase lost_str[] to accommodate any unsigned
32-bit number.

Fixes: a8b1845a7845 ("Miscellaneous data placement adjustments")
Signed-off-by: Jan Beulich 


Reviewed-by: Jason Andryuk 



Re: [PATCH 1/2] efi: Call FreePages only if needed

2025-08-05 Thread Andrew Cooper
On 05/08/2025 5:32 pm, Ross Lagerwall wrote:
> If the config file is builtin, cfg.addr will be zero but Xen
> unconditionally calls FreePages() on the address.
>
> Xen may also call FreePages() with a zero address if blexit() is called
> after this point since cfg.need_to_free is not set to false.
>
> The UEFI specification does not say whether calling FreePages() with a
> zero address is allowed so let's be cautious and use cfg.need_to_free
> properly.
>
> Signed-off-by: Ross Lagerwall 

Acked-by: Andrew Cooper 



Xen 4.21 Development Update [June-July]

2025-08-05 Thread Oleksii Kurochko

Hello everyone,

This email only tracks big items for xen.git tree. Please reply for 
items you

would like to see in 4.21 so that people have an idea what is going on and
prioritise accordingly.

You're welcome to provide description and use cases of the feature you're
working on.

= Timeline =

The current release schedule could be found here:
  https://wiki.xenproject.org/wiki/Xen_Project_X.YY_Release_Notes

And as a reminder I would like to remind at the of this week we will have
Last posting date (Fri Aug 08, 2025).

= Updates =

The following items ( the links for them could be found int the list below )
were moved to completed:
  [since Jun2 - Aug5]:
   Added some tags: [4.21], [next-rel(s)] to the list "Full list of items"
   below.
   * x86:
    - kexec: add kexec support to Mini-OS.
    - x86: memcpy() / memset() (non-)ERMS flavors plus fallout
   * Arm:
    - SMMU handling for PCIe Passthrough on ARM.
    - Add support for R-Car Gen4 PCI host controller.
    - First chunk for Arm R82 and MPU support.
    - Enable R52 support for the first chunk of MPU support
    - ARM split hardware and control domains.
   * RISC-V:
    - Introduce basic UART support and interrupts for hypervisor mode.

  [since May 6 - Jun2]:
    * Hypervisor:
  - tools: remove qemu-traditional
    * Arm:
  - PCI devices passthrough on Arm, part 3
    * x86:
  - xen: cache control improvements
  [since 4.20 relese - May 6]:
    * Hypervisor:
  - Move parts of Arm's Dom0less to common code
  - remove libxenctrl usage from xenstored
    * Arm:
  - Enable early bootup of Armv8-R AArch32 systems
    * x86:
  - x86/HVM: emulation (MMIO) improvements
    * RISC-V:
  - RISC-V some preinit calls.
  - Fixes for UBSAN & GCOV support for RISC-V.

Some new items added:
 [since May]
    * x86:
 - Allow x86 to unflatten DTs
 - hyperlaunch: move remaining pvh dom0 construction
 - x86/hyperlaunch: introduce concept of core domains
 - Confidential computing and AMD SEV support
    * Arm:
 - SMMU handling for PCIe Passthrough on ARM
 - xen/arm: scmi: introduce SCI SCMI SMC multi-agent support
 - Add initial Xen Suspend-to-RAM support on ARM64
    * RISC-V:
 - introduce p2m functionality
 [since 4.20 release]
    * Hypervisor:
  - tools: remove qemu-traditional
  - Physical address hypercall ABI ("HVMv2")
  - xen: Untangle mm.h
  - xen: introduce CONFIG_SYSCTL
  - Add support for exact-node memory claims
  - Several CI cleanups and improvements, plus yet another new runner
    * x86:
  - x86/EFI: prevent write-execute sections
  - x86: Trenchboot Secure Launch DRTM (Xen)
  - Hyperlaunch device tree for dom0 (v6)
  - amd-cppc CPU Performance Scaling Driver (v4)
  - Hyperlaunch domain builder
  - kexec: add kexec support to Mini-OS
  - xen: cache control improvements (should be moved to "Hypervisor"?)
  - x86: generate xen.efi image with no write-execute sections
  - x86/asm: cleanups after toolchain baseline upgrade
    * Arm:
  - Add support for R-Car Gen4 PCI host controller (v4)
  - FF-A VM to VM support (v5)
  - First chunk for Arm R82 and MPU support (v4)
  - ARM split hardware and control domains (v5)
  - MPU mm subsistem skeleton
    * RISC-V:
  - introduce basic UART support and interrupts for hypervisor mode

* Full list of items : *

= Projects =

== Hypervisor ==

* [4.21] xen/console: cleanup console input switch logic (v5)
  - Denis Mukhin
  - 
https://lore.kernel.org/xen-devel/20250530231841.73386-1-dmuk...@ford.com/


* [4.21] xen: introduce CONFIG_SYSCTL (v4 -> v8)
  -  Penny Zheng
  - 
https://lore.kernel.org/xen-devel/20250711043158.2566880-1-penny.zh...@amd.com/


* [4.21] Several CI cleanups and improvements, plus yet another new runner
  - Marek Marczykowski-Górecki
  - 
https://lore.kernel.org/xen-devel/cover.7da1777882774486a13e6f39ff4a2096f6b7901e.1744028549.git-series.marma...@invisiblethingslab.com/
  - 
https://patchew.org/Xen/cover.7da1777882774486a13e6f39ff4a2096f6b7901e.1744028549.git-series.marma...@invisiblethingslab.com/


* [4.21] automation: Refresh the remaining Debian containers (v2)
  -  Javi Merino
  - 
https://lore.kernel.org/xen-devel/cover.1730743077.git.javi.mer...@cloud.com/T/#m5d9acb7cf5db3c2be3d6527de14b69b07812314e


* [4.21] MSI-X support with qemu in stubdomain, and other related 
changes (v8)

  -  Marek Marczykowski-Górecki
  - 
https://lore.kernel.org/xen-devel/cover.33fb4385b7dd6c53bda4acf0a9e91748b3d7b1f7.1715313192.git-series.marma...@invisiblethingslab.com/

  -  Only automation patch left to be reviewed/merged.

* [next-rel(s)] Physical address hypercall ABI ("HVMv2")
  - Teddy Astie
  - 
https://lore.kernel.org/xen-devel/cover.1744981654.git.teddy.as...@vates.tech/


* [next-rel(s)] xen: Untangle mm.h
  -  Andrew Cooper
  - 
https://lore.kernel.org/xen-devel/20250312174513.4075066-1-andrew.coop...@citrix.com/
  - 
https://patchew.org/Xen/2

Re: [XEN][PATCH] xen/dom0less: arm: fix hwdom 1:1 low memory allocation

2025-08-05 Thread Grygorii Strashko

Hi Jason,

On 05.08.25 20:21, Jason Andryuk wrote:

On 2025-08-01 11:54, Grygorii Strashko wrote:

From: Grygorii Strashko 

Call stack for dom0less hwdom case (1:1) memory:
create_domUs
|-construct_domU
   |-construct_hwdom()
 |-allocate_memory_11()

And allocate_memory_11() uses "dom0_mem" as:
min_low_order =
   get_order_from_bytes(min_t(paddr_t, dom0_mem, MB(128)));

In case of dom0less boot the "dom0_mem" is not used and defaulted to 0,
which causes min_low_order to get high value > order and so no allocations
happens from low memory.

Fix it, by using kinfo->unassigned_mem instead of "dom0_mem" has correct
memory size in both cases: regular dom0 boot and dom0less boot.

Fixes: 43afe6f030244 ("xen/common: dom0less: introduce common dom0less-build.c")


I think I introduced this bug with the dom0less hwdom support, and the correct 
fixes is:

Fixes: 52cb53f1816a ("xen/arm: dom0less hwdom construction")


Signed-off-by: Grygorii Strashko 


dom0_mem is also mentioned in the comment on line 252.  With that changed as 
well:

Reviewed-by: Jason Andryuk 



Will smth like below be ok?

  * We first allocate the largest allocation we can as low as we
  * can. This then becomes the first bank. This bank must be at least
- * 128MB (or dom0_mem if that is smaller).
+ * 128MB (or memory size requested for domain if that is smaller).


--
Best regards,
-grygorii




[PATCH v2] xen/dom0less: arm: fix hwdom 1:1 low memory allocation

2025-08-05 Thread Grygorii Strashko
From: Grygorii Strashko 

Call stack for dom0less hwdom case (1:1) memory:
create_domUs
|-construct_domU
  |-construct_hwdom()
|-allocate_memory_11()

And allocate_memory_11() uses "dom0_mem" as:
min_low_order =
  get_order_from_bytes(min_t(paddr_t, dom0_mem, MB(128)));

In case of dom0less boot the "dom0_mem" is not used and defaulted to 0,
which causes min_low_order to get high value > order and so no allocations
happens from low memory.

Fix it, by using kinfo->unassigned_mem instead of "dom0_mem" has correct
memory size in both cases: regular dom0 boot and dom0less boot.

Fixes: 52cb53f1816a ("xen/arm: dom0less hwdom construction")
Signed-off-by: Grygorii Strashko 
Reviewed-by: Denis Mukhin 
Reviewed-by: Jason Andryuk 
---
Changes in v2:
- changed 'fixes' tag
- fixed comment for allocate_memory_11()
- added reviewer's tags

 xen/arch/arm/domain_build.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/xen/arch/arm/domain_build.c b/xen/arch/arm/domain_build.c
index 463ae4474d30..a9e4153e3cf9 100644
--- a/xen/arch/arm/domain_build.c
+++ b/xen/arch/arm/domain_build.c
@@ -249,7 +249,7 @@ fail:
  *
  * We first allocate the largest allocation we can as low as we
  * can. This then becomes the first bank. This bank must be at least
- * 128MB (or dom0_mem if that is smaller).
+ * 128MB (or memory size requested for domain if that is smaller).
  *
  * Then we start allocating more memory, trying to allocate the
  * largest possible size and trying smaller sizes until we
@@ -278,7 +278,7 @@ static void __init allocate_memory_11(struct domain *d,
   struct kernel_info *kinfo)
 {
 const unsigned int min_low_order =
-get_order_from_bytes(min_t(paddr_t, dom0_mem, MB(128)));
+get_order_from_bytes(min_t(paddr_t, kinfo->unassigned_mem, MB(128)));
 const unsigned int min_order = get_order_from_bytes(MB(4));
 struct membanks *mem = kernel_info_get_mem(kinfo);
 struct page_info *pg;
-- 
2.34.1



Re: [PATCH v1 01/25] xen/x86: move domctl.o out of PV_SHIM_EXCLUSIVE

2025-08-05 Thread Jan Beulich
On 05.08.2025 05:38, Penny, Zheng wrote:
> [Public]
> 
>> -Original Message-
>> From: Jan Beulich 
>> Sent: Monday, August 4, 2025 3:43 PM
>> To: Penny, Zheng 
>> Cc: Huang, Ray ; Andrew Cooper
>> ; Roger Pau Monné ;
>> Anthony PERARD ; Orzel, Michal
>> ; Julien Grall ; Stefano Stabellini
>> ; xen-devel@lists.xenproject.org
>> Subject: Re: [PATCH v1 01/25] xen/x86: move domctl.o out of
>> PV_SHIM_EXCLUSIVE
>>
>> On 03.08.2025 11:47, Penny Zheng wrote:
>>> In order to fix CI error of a randconfig picking both
>>> PV_SHIM_EXCLUSIVE=y and HVM=y results in hvm.c being built, but
>>> domctl.c not being built, which leaves a few functions, like
>>> domctl_lock_acquire/release() undefined, causing linking to fail.
>>> To fix that, we intend to move domctl.o out of the PV_SHIM_EXCLUSIVE
>>> Makefile /hypercall-defs section, with this adjustment, we also need
>>> to release redundant vnuma_destroy() stub definition and paging_domctl
>>> hypercall-defs from PV_SHIM_EXCLUSIVE guardian, to not break
>>> compilation Above change will leave dead code in the shim binary
>>> temporarily and will be fixed with the introduction of CONFIG_DOMCTL.
>>>
>>> Fixes: 568f806cba4c ("xen/x86: remove "depends on
>>> !PV_SHIM_EXCLUSIVE"")
>>> Reported-by: Jan Beulich 
>>> Signed-off-by: Penny Zheng 
>>> ---
>>> v1 -> v2:
>>> - remove paging_domctl hypercall-defs
>>
>> And you've run this through a full round of testing this time, in isolation?
> 
> This commit shall be committed together with "xen/x86: complement 
> PG_log_dirty wrapping", (I've added in change log, idk why it didn't get 
> delivered in the mail list in the last).

And "committed together" still means the two at least build okay independently
(i.e. allowing the build-each-commit job to succeed)?

As to the missing indication thereof in the submission: Patch 01 has a revlog,
so if anything was missing there you must have added it some other way. But
the cover letter is lacking that information as well. (As indicated earlier,
to increase the chance of such a remark actually being noticed, it's best put
in both places.)

> As PG_log_dirty is disabled on PV mode, paging_domctl() will still have 
> "undefined reference" on PV mode, which gets fixed in "xen/x86: complement 
> PG_log_dirty wrapping".  I thought it better sits there.
> If it doesn't comply with the commit policy, I'll move according fix here.

Let me post a pair of patches dealing with part of the problem, in an imo
(longer term) more desirable way.

Jan



Re: [PATCH v6 11/19] xen/x86: introduce "cpufreq=amd-cppc" xen cmdline and amd-cppc driver

2025-08-05 Thread Jan Beulich
On 05.08.2025 08:31, Penny, Zheng wrote:
> [Public]
> 
>> -Original Message-
>> From: Jan Beulich 
>> Sent: Monday, August 4, 2025 4:48 PM
>> To: Penny, Zheng 
>> Cc: Huang, Ray ; Andrew Cooper
>> ; Anthony PERARD ;
>> Orzel, Michal ; Julien Grall ; Roger 
>> Pau
>> Monné ; Stefano Stabellini ; 
>> xen-
>> de...@lists.xenproject.org
>> Subject: Re: [PATCH v6 11/19] xen/x86: introduce "cpufreq=amd-cppc" xen 
>> cmdline
>> and amd-cppc driver
>>
>> On 04.08.2025 10:09, Penny, Zheng wrote:
>>> [Public]
>>>
 -Original Message-
 From: Jan Beulich 
 Sent: Thursday, July 17, 2025 12:00 AM
 To: Penny, Zheng 
 Cc: Huang, Ray ; Andrew Cooper
 ; Anthony PERARD
 ; Orzel, Michal ;
 Julien Grall ; Roger Pau Monné
 ; Stefano Stabellini ;
 xen- de...@lists.xenproject.org
 Subject: Re: [PATCH v6 11/19] xen/x86: introduce "cpufreq=amd-cppc"
 xen cmdline and amd-cppc driver

 On 11.07.2025 05:50, Penny Zheng wrote:
> --- a/xen/arch/x86/acpi/cpufreq/cpufreq.c
> +++ b/xen/arch/x86/acpi/cpufreq/cpufreq.c
> @@ -128,12 +128,14 @@ static int __init cf_check
> cpufreq_driver_init(void)
>
>  if ( cpufreq_controller == FREQCTL_xen )
>  {
> +unsigned int i = 0;

 Pointless initializer; both for() loops set i to 0. But also see further 
 down.

> @@ -157,9 +164,70 @@ static int __init cf_check
> cpufreq_driver_init(void)
>
>  case X86_VENDOR_AMD:
>  case X86_VENDOR_HYGON:
> -ret = IS_ENABLED(CONFIG_AMD) ? powernow_register_driver() : -
 ENODEV;
> +if ( !IS_ENABLED(CONFIG_AMD) )
> +{
> +ret = -ENODEV;
> +break;
> +}
> +ret = -ENOENT;

 The code structure is sufficiently different from the Intel
 counterpart for this to perhaps better move ...

> +for ( i = 0; i < cpufreq_xen_cnt; i++ )
> +{
> +switch ( cpufreq_xen_opts[i] )
> +{
> +case CPUFREQ_xen:
> +ret = powernow_register_driver();
> +break;
> +
> +case CPUFREQ_amd_cppc:
> +ret = amd_cppc_register_driver();
> +break;
> +
> +case CPUFREQ_none:
> +ret = 0;
> +break;
> +
> +default:
> +printk(XENLOG_WARNING
> +   "Unsupported cpufreq driver for vendor AMD or 
> Hygon\n");
> +break;

 ... here.

>>>
>>> Are we suggesting moving
>>> "
>>> if ( !IS_ENABLED(CONFIG_AMD) )
>>> {
>>> ret = -ENODEV;
>>> break;
>>> }
>>> " here? In which case, When CONFIG_AMD=n and users doesn't provide
>>> "cpufreq=xxx", we will have cpufreq_xen_cnt initialized as 1 and
>>> cpufreq_xen_opts[0] = CPUFREQ_xen. powernow_register_driver() hence
>>> gets invoked. The thing is that we don't have stub for it and it is
>>> compiled under CONFIG_AMD I suggest to change to use #ifdef CONFIG_AMD
>>> code wrapping
>>>
> +}
> +
> +if ( !ret || ret == -EBUSY )
> +break;
> +}
> +
>  break;
>  }
> +
> +/*
> + * After successful cpufreq driver registeration,
 XEN_PROCESSOR_PM_CPPC
> + * and XEN_PROCESSOR_PM_PX shall become exclusive flags.
> + */
> +if ( !ret )
> +{
> +ASSERT(i < cpufreq_xen_cnt);
> +switch ( cpufreq_xen_opts[i] )

 Hmm, this is using the the initializer of i that I commented on. I
 think there's another default: case missing, where you simply "return 0" 
 (to
>> retain prior behavior).
 But again see also yet further down.


> +/*
> + * No cpufreq driver gets registered, clear both
> + * XEN_PROCESSOR_PM_CPPC and XEN_PROCESSOR_PM_PX
> + */
> + xen_processor_pmbits &= ~(XEN_PROCESSOR_PM_CPPC |
> +   XEN_PROCESSOR_PM_PX);

 Yet more hmm - this path you want to get through for the case mentioned 
 above.
 But only this code; specifically not the "switch (
 cpufreq_xen_opts[i] )", which really is "switch ( cpufreq_xen_opts[0]
 )" in that case, and that's pretty clearly wrong to evaluate in then.
>>>
>>> Correct me if I understand you wrongly:
>>> The above "case missing" , are we talking about is entering "case
>> CPUFREQ_none" ?
>>> IMO, it may never be entered. If users doesn't provide "cpufreq=xxx", we 
>>> will
>> have cpufreq_xen_cnt initialized as 1 and cpufreq_

Re: [PATCH v2] xen: rework error handling in vcpu_create

2025-08-05 Thread Jan Beulich
On 04.08.2025 18:57, Stewart Hildebrand wrote:
> On 8/4/25 03:57, Jan Beulich wrote:
>> On 01.08.2025 22:24, Stewart Hildebrand wrote:
>>> @@ -839,6 +839,9 @@ void sched_destroy_vcpu(struct vcpu *v)
>>>  {
>>>  struct sched_unit *unit = v->sched_unit;
>>>  
>>> +if ( !unit )
>>> +return;
>>> +
>>>  kill_timer(&v->periodic_timer);
>>>  kill_timer(&v->singleshot_timer);
>>>  kill_timer(&v->poll_timer);
>>
>> What if it's the 2nd error path in sched_init_vcpu() that is taken? Then we
>> might take this path (just out of context here)
>>
>> if ( unit->vcpu_list == v )
>> {
>> rcu_read_lock(&sched_res_rculock);
>>
>> sched_remove_unit(vcpu_scheduler(v), unit);
>> sched_free_udata(vcpu_scheduler(v), unit->priv);
>>
>> and at least Credit1's hook doesn't look to be safe against being passed 
>> NULL.
>> (Not to speak of the risk of unit->priv being used elsewhere while cleaning
>> up.)
> 
> 
> Are you referring to this error path in sched_init_vcpu?

No, given the context I thought it was clear that I was referring to

static void cf_check
csched_free_udata(const struct scheduler *ops, void *priv)
{
struct csched_unit *svc = priv;

BUG_ON( !list_empty(&svc->runq_elem) );

(i.e. particularly this BUG_ON()).

Jan



Re: [PATCH v10 4/4] vpci/msix: Free MSIX resources when init_msix() fails

2025-08-05 Thread Jan Beulich
On 05.08.2025 05:49, Jiqian Chen wrote:
> When MSI-X initialization fails vPCI will hide the capability, but
> remove of handlers and data won't be performed until the device is
> deassigned.  Introduce a MSI-X cleanup hook that will be called when
> initialization fails to cleanup MSI-X related hooks and free it's
> associated data.
> 
> As all supported capabilities have been switched to use the cleanup
> hooks call those from vpci_deassign_device() instead of open-code the
> capability specific cleanup in there.
> 
> Signed-off-by: Jiqian Chen 
> ---
> cc: "Roger Pau Monné" 
> ---
> v9->v10 changes:
> * Call all cleanup hook in vpci_deassign_device() instead of cleanup_msix().

Isn't this rather an omission in an earlier change, and hence may want to
come separately and with a Fixes: tag?

> --- a/xen/drivers/vpci/vpci.c
> +++ b/xen/drivers/vpci/vpci.c
> @@ -321,6 +321,27 @@ void vpci_deassign_device(struct pci_dev *pdev)
>  &pdev->domain->vpci_dev_assigned_map);
>  #endif
>  
> +for ( i = 0; i < NUM_VPCI_INIT; i++ )
> +{
> +const vpci_capability_t *capability = &__start_vpci_array[i];
> +const unsigned int cap = capability->id;
> +unsigned int pos = 0;
> +
> +if ( !capability->is_ext )
> +pos = pci_find_cap_offset(pdev->sbdf, cap);
> +else if ( is_hardware_domain(pdev->domain) )
> +pos = pci_find_ext_capability(pdev->sbdf, cap);

What's the point of doing this when ...

> +if ( pos && capability->cleanup )

... ->cleanup is NULL? Don't you want to have

if ( !capability->cleanup )
continue;

earlier on?

Jan



[PATCH 0/2] x86/mm: paging build adjustments

2025-08-05 Thread Jan Beulich
This is in the hope that it'll lead to a better overall result than plainly
taking [1] and [2] (or whichever re-work thereof).

1: drop paging_get_mode()
2: correct PG_log_dirty definition

Jan

[1] https://lists.xen.org/archives/html/xen-devel/2025-08/msg00050.html
[2] https://lists.xen.org/archives/html/xen-devel/2025-08/msg00051.html



[PATCH 2/2] x86/mm: correct PG_log_dirty definition

2025-08-05 Thread Jan Beulich
While it is correct that in shim-exclusive mode log-dirty handling is
all unreachable code, the present conditional still isn't correct: In a
HVM=n and SHADOW_PAGING=n configuration log-dirty code also is all
unreachable (and hence violating Misra rule 2.1).

As we're aiming at moving away from special casing PV_SHIM_EXCLUSIVE=y,
don't retain that part of the conditional.

Because of hypercall-defs.c we need to carry out the dependency by
introducing a new auxiliary PAGING control.

Since compiling out mm/paging.c altogether would entail further changes,
merely conditionalize the one function in there (paging_enable()) which
would otherwise remain unreachable (Misra rule 2.1 again) when PAGING=n.

Fixes: 23d4e0d17b76 ("x86/shim: fix build with PV_SHIM_EXCLUSIVE and 
SHADOW_PAGING")
Signed-off-by: Jan Beulich 
---
Of course PAGING is at risk of being confused with MEM_PAGING. It not
having a prompt, I hope that's tolerable, as I can't really think of a
better name.

Other PG_log_dirty pre-processor conditionals then likely also want
replacing. mm/paging.c and mm/p2m-basic.c could also be compiled out
altogether when PAGING=n, at the expense of introducing a few more
stubs.

FTAOD, the Fixes: tag being referenced does not mean this patch corrects
the far more recently introduced build issue with the combination of the
two features. That's still work that I expect Penny to carry out (with
there still being the option of reverting the final part of the earlier
series).

--- a/xen/arch/x86/Kconfig
+++ b/xen/arch/x86/Kconfig
@@ -162,6 +162,9 @@ config SHADOW_PAGING
 
   If unsure, say Y.
 
+config PAGING
+   def_bool HVM || SHADOW_PAGING
+
 config BIGMEM
bool "big memory support"
default n
--- a/xen/arch/x86/domctl.c
+++ b/xen/arch/x86/domctl.c
@@ -213,11 +213,15 @@ long arch_do_domctl(
 {
 
 case XEN_DOMCTL_shadow_op:
+#ifdef CONFIG_PAGING
 ret = paging_domctl(d, &domctl->u.shadow_op, u_domctl, 0);
 if ( ret == -ERESTART )
 return hypercall_create_continuation(
__HYPERVISOR_paging_domctl_cont, "h", u_domctl);
 copyback = true;
+#else
+ret = -EOPNOTSUPP;
+#endif
 break;
 
 case XEN_DOMCTL_ioport_permission:
--- a/xen/arch/x86/include/asm/paging.h
+++ b/xen/arch/x86/include/asm/paging.h
@@ -55,7 +55,7 @@
 #define PG_translate   0
 #define PG_external0
 #endif
-#if defined(CONFIG_HVM) || !defined(CONFIG_PV_SHIM_EXCLUSIVE)
+#ifdef CONFIG_PAGING
 /* Enable log dirty mode */
 #define PG_log_dirty   (XEN_DOMCTL_SHADOW_ENABLE_LOG_DIRTY << PG_mode_shift)
 #else
--- a/xen/arch/x86/mm/paging.c
+++ b/xen/arch/x86/mm/paging.c
@@ -864,6 +864,7 @@ void paging_final_teardown(struct domain
 p2m_final_teardown(d);
 }
 
+#ifdef CONFIG_PAGING
 /* Enable an arbitrary paging-assistance mode.  Call once at domain
  * creation. */
 int paging_enable(struct domain *d, u32 mode)
@@ -889,6 +890,7 @@ int paging_enable(struct domain *d, u32
 else
 return shadow_enable(d, mode);
 }
+#endif
 
 #ifdef CONFIG_HVM
 /* Called from the guest to indicate that a process is being torn down
--- a/xen/include/hypercall-defs.c
+++ b/xen/include/hypercall-defs.c
@@ -197,9 +197,11 @@ dm_op(domid_t domid, unsigned int nr_buf
 #ifdef CONFIG_SYSCTL
 sysctl(xen_sysctl_t *u_sysctl)
 #endif
+#if defined(CONFIG_X86) && defined(CONFIG_PAGING)
+paging_domctl_cont(xen_domctl_t *u_domctl)
+#endif
 #ifndef CONFIG_PV_SHIM_EXCLUSIVE
 domctl(xen_domctl_t *u_domctl)
-paging_domctl_cont(xen_domctl_t *u_domctl)
 platform_op(xen_platform_op_t *u_xenpf_op)
 #endif
 #ifdef CONFIG_HVM
@@ -296,7 +298,7 @@ dm_op  compa
 hypfs_op   do   do   do   do   do
 #endif
 mcado   do   ---
-#ifndef CONFIG_PV_SHIM_EXCLUSIVE
+#if defined(CONFIG_X86) && defined(CONFIG_PAGING)
 paging_domctl_cont do   do   do   do   -
 #endif
 




[PATCH 1/2] x86/mm: drop paging_get_mode()

2025-08-05 Thread Jan Beulich
The function was introduced without any caller, and never gained any.
Thus it has always been violating Misra rule 2.1 (unreachable code).

Fixes: dd6de3ab9985 ("Implement Nested-on-Nested")
Signed-off-by: Jan Beulich 

--- a/xen/arch/x86/include/asm/paging.h
+++ b/xen/arch/x86/include/asm/paging.h
@@ -225,7 +225,6 @@ int paging_enable(struct domain *d, u32
 
 #define paging_get_hostmode(v) ((v)->arch.paging.mode)
 #define paging_get_nestedmode(v)   ((v)->arch.paging.nestedmode)
-const struct paging_mode *paging_get_mode(struct vcpu *v);
 void paging_update_nestedmode(struct vcpu *v);
 
 /* Page fault handler
--- unstable.orig/xen/arch/x86/mm/paging.c  2025-08-05 08:59:15.512131147 
+0200
+++ unstable/xen/arch/x86/mm/paging.c   2025-08-05 09:00:24.160657794 +0200
@@ -946,14 +946,6 @@ void paging_dump_vcpu_info(struct vcpu *
 }
 }
 
-const struct paging_mode *paging_get_mode(struct vcpu *v)
-{
-if (!nestedhvm_is_n2(v))
-return paging_get_hostmode(v);
-
-return paging_get_nestedmode(v);
-}
-
 #ifdef CONFIG_HVM
 void paging_update_nestedmode(struct vcpu *v)
 {




Re: [PATCH v2] xen: rework error handling in vcpu_create

2025-08-05 Thread Stewart Hildebrand
On 8/5/25 05:17, Jan Beulich wrote:
> On 05.08.2025 11:06, Stewart Hildebrand wrote:
>> On 8/5/25 03:44, Jan Beulich wrote:
>>> On 04.08.2025 18:57, Stewart Hildebrand wrote:
 On 8/4/25 03:57, Jan Beulich wrote:
> On 01.08.2025 22:24, Stewart Hildebrand wrote:
>> @@ -839,6 +839,9 @@ void sched_destroy_vcpu(struct vcpu *v)
>>  {
>>  struct sched_unit *unit = v->sched_unit;
>>  
>> +if ( !unit )
>> +return;
>> +
>>  kill_timer(&v->periodic_timer);
>>  kill_timer(&v->singleshot_timer);
>>  kill_timer(&v->poll_timer);
>
> What if it's the 2nd error path in sched_init_vcpu() that is taken?
>>
>> ^^ This ^^ is what I'm confused about
> 
> If sched_init_vcpu() took the indicated path,

What path? In the one I'm looking at, sched_free_unit() gets called,
setting v->sched_unit = NULL, and in that case ...

> the if() you add here won't
> help, and ...

... the condition is true, and ...

> Then we
> might take this path (just out of context here)
>
> if ( unit->vcpu_list == v )
> {
> rcu_read_lock(&sched_res_rculock);
>
> sched_remove_unit(vcpu_scheduler(v), unit);
> sched_free_udata(vcpu_scheduler(v), unit->priv);
>
> and at least Credit1's hook doesn't look to be safe against being passed 
> NULL.
> (Not to speak of the risk of unit->priv being used elsewhere while 
> cleaning
> up.)


 Are you referring to this error path in sched_init_vcpu?
>>>
>>> No, given the context I thought it was clear that I was referring to
>>>
>>> static void cf_check
>>> csched_free_udata(const struct scheduler *ops, void *priv)
>>> {
>>> struct csched_unit *svc = priv;
>>>
>>> BUG_ON( !list_empty(&svc->runq_elem) );
> 
> ... we'd make it here (afaict).

... we'd not make it here.



Re: [PATCH v10] xen/console: introduce domain_console struct

2025-08-05 Thread Roger Pau Monné
On Fri, Jul 25, 2025 at 09:08:02PM +, dm...@proton.me wrote:
> From: Denis Mukhin  
> 
> Introduce domain_console for grouping data structures used for integrating
> domain's diagnostic console with Xen's console driver.
> 
> Group all pbuf-related data structures under domain_console. Rename the moved
> fields to plain .buf, .idx and .lock names, since all uses of the fields are
> touched.
> 
> Ensure accesses to domain_console pointer are valid in console_switch_input().
> 
> Bump the domain console buffer allocation size to 256. No extra symbol for the
> value since it is used only once during data structure declaration. All size
> checks use ARRAY_SIZE().
> 
> Allocate domain_console from the heap so that the parent domain struct size
> stays below PAGE_SIZE boundary to account for more console-related fields
> added in the future.
> 
> Finally, update the domain_console allocation and initialization code.
> 
> Not a functional change.
> 
> Signed-off-by: Denis Mukhin 

Just one request about the allocation of the domain_console in
domain_create().

> ---
> Changes since v9:
> - kept cd as is in hvm_print_line() as requested
> - dropped domain lock in hvm_print_line()
> 
> Link to v9: 
> https://lore.kernel.org/xen-devel/20250723001116.186009-1-dmuk...@ford.com/
> ---
>  xen/arch/arm/vpl011.c  |  2 +-
>  xen/arch/x86/hvm/hvm.c | 16 +---
>  xen/arch/x86/pv/shim.c |  2 +-
>  xen/common/domain.c| 19 +--
>  xen/drivers/char/console.c | 28 
>  xen/include/xen/sched.h| 24 +---
>  6 files changed, 49 insertions(+), 42 deletions(-)
> 
> diff --git a/xen/arch/arm/vpl011.c b/xen/arch/arm/vpl011.c
> index 480fc664fc62..d0d17c76b72c 100644
> --- a/xen/arch/arm/vpl011.c
> +++ b/xen/arch/arm/vpl011.c
> @@ -713,7 +713,7 @@ int domain_vpl011_init(struct domain *d, struct 
> vpl011_init_info *info)
>  }
>  else
>  {
> -d->console.input_allowed = true;
> +d->console->input_allowed = true;
>  vpl011->backend_in_domain = false;
>  
>  vpl011->backend.xen = xzalloc(struct vpl011_xen_backend);
> diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
> index 56c7de39778b..37af507a8d90 100644
> --- a/xen/arch/x86/hvm/hvm.c
> +++ b/xen/arch/x86/hvm/hvm.c
> @@ -560,6 +560,7 @@ static int cf_check hvm_print_line(
>  int dir, unsigned int port, unsigned int bytes, uint32_t *val)
>  {
>  struct domain *cd = current->domain;
> +struct domain_console *cons = cd->console;
>  char c = *val;
>  
>  ASSERT(bytes == 1 && port == XEN_HVM_DEBUGCONS_IOPORT);
> @@ -571,16 +572,17 @@ static int cf_check hvm_print_line(
>  if ( !is_console_printable(c) )
>  return X86EMUL_OKAY;
>  
> -spin_lock(&cd->pbuf_lock);
> +spin_lock(&cons->lock);
> +ASSERT(cons->idx < ARRAY_SIZE(cons->buf));
>  if ( c != '\n' )
> -cd->pbuf[cd->pbuf_idx++] = c;
> -if ( (cd->pbuf_idx == (DOMAIN_PBUF_SIZE - 1)) || (c == '\n') )
> +cons->buf[cons->idx++] = c;
> +if ( (cons->idx == (ARRAY_SIZE(cons->buf) - 1)) || (c == '\n') )
>  {
> -cd->pbuf[cd->pbuf_idx] = '\0';
> -guest_printk(cd, XENLOG_G_DEBUG "%s\n", cd->pbuf);
> -cd->pbuf_idx = 0;
> +cons->buf[cons->idx] = '\0';
> +guest_printk(cd, XENLOG_G_DEBUG "%s\n", cons->buf);
> +cons->idx = 0;
>  }
> -spin_unlock(&cd->pbuf_lock);
> +spin_unlock(&cons->lock);
>  
>  return X86EMUL_OKAY;
>  }
> diff --git a/xen/arch/x86/pv/shim.c b/xen/arch/x86/pv/shim.c
> index bc2a7dd5fae5..bd29c53a2d34 100644
> --- a/xen/arch/x86/pv/shim.c
> +++ b/xen/arch/x86/pv/shim.c
> @@ -239,7 +239,7 @@ void __init pv_shim_setup_dom(struct domain *d, 
> l4_pgentry_t *l4start,
>   */
>  d->max_pages = domain_tot_pages(d);
>  
> -d->console.input_allowed = true;
> +d->console->input_allowed = true;
>  }
>  
>  static void write_start_info(struct domain *d)
> diff --git a/xen/common/domain.c b/xen/common/domain.c
> index 303c338ef293..caef4cc8d649 100644
> --- a/xen/common/domain.c
> +++ b/xen/common/domain.c
> @@ -669,7 +669,7 @@ static void _domain_destroy(struct domain *d)
>  BUG_ON(!d->is_dying);
>  BUG_ON(atomic_read(&d->refcnt) != DOMAIN_DESTROYED);
>  
> -xfree(d->pbuf);
> +xvfree(d->console);

Nit: even if the struct if being freed just a couple of lines below, I
would use XVFREE() here.

>  
>  argo_destroy(d);
>  
> @@ -835,8 +835,6 @@ struct domain *domain_create(domid_t domid,
>  flags |= CDF_hardware;
>  if ( old_hwdom )
>  old_hwdom->cdf &= ~CDF_hardware;
> -
> -d->console.input_allowed = true;
>  }
>  
>  /* Holding CDF_* internal flags. */
> @@ -866,8 +864,6 @@ struct domain *domain_create(domid_t domid,
>  spin_lock_init(&d->shutdown_lock);
>  d->shutdown_code = SHUTDOWN_CODE_INVALID;
>  
> -spin_lock_init(&d->pbuf_lock);
> -
>  rwlock_init(&d->vnuma_rwlock

Re: [PATCH v2] xen: rework error handling in vcpu_create

2025-08-05 Thread Stewart Hildebrand
On 8/5/25 03:44, Jan Beulich wrote:
> On 04.08.2025 18:57, Stewart Hildebrand wrote:
>> On 8/4/25 03:57, Jan Beulich wrote:
>>> On 01.08.2025 22:24, Stewart Hildebrand wrote:
 @@ -839,6 +839,9 @@ void sched_destroy_vcpu(struct vcpu *v)
  {
  struct sched_unit *unit = v->sched_unit;
  
 +if ( !unit )
 +return;
 +
  kill_timer(&v->periodic_timer);
  kill_timer(&v->singleshot_timer);
  kill_timer(&v->poll_timer);
>>>
>>> What if it's the 2nd error path in sched_init_vcpu() that is taken?

^^ This ^^ is what I'm confused about

>>> Then we
>>> might take this path (just out of context here)
>>>
>>> if ( unit->vcpu_list == v )
>>> {
>>> rcu_read_lock(&sched_res_rculock);
>>>
>>> sched_remove_unit(vcpu_scheduler(v), unit);
>>> sched_free_udata(vcpu_scheduler(v), unit->priv);
>>>
>>> and at least Credit1's hook doesn't look to be safe against being passed 
>>> NULL.
>>> (Not to speak of the risk of unit->priv being used elsewhere while cleaning
>>> up.)
>>
>>
>> Are you referring to this error path in sched_init_vcpu?
> 
> No, given the context I thought it was clear that I was referring to
> 
> static void cf_check
> csched_free_udata(const struct scheduler *ops, void *priv)
> {
> struct csched_unit *svc = priv;
> 
> BUG_ON( !list_empty(&svc->runq_elem) );
> 
> (i.e. particularly this BUG_ON()).

The comment about credit1 was clear



Re: [PATCH v2] xen: rework error handling in vcpu_create

2025-08-05 Thread Jan Beulich
On 05.08.2025 11:06, Stewart Hildebrand wrote:
> On 8/5/25 03:44, Jan Beulich wrote:
>> On 04.08.2025 18:57, Stewart Hildebrand wrote:
>>> On 8/4/25 03:57, Jan Beulich wrote:
 On 01.08.2025 22:24, Stewart Hildebrand wrote:
> @@ -839,6 +839,9 @@ void sched_destroy_vcpu(struct vcpu *v)
>  {
>  struct sched_unit *unit = v->sched_unit;
>  
> +if ( !unit )
> +return;
> +
>  kill_timer(&v->periodic_timer);
>  kill_timer(&v->singleshot_timer);
>  kill_timer(&v->poll_timer);

 What if it's the 2nd error path in sched_init_vcpu() that is taken?
> 
> ^^ This ^^ is what I'm confused about

If sched_init_vcpu() took the indicated path, the if() you add here won't
help, and ...

 Then we
 might take this path (just out of context here)

 if ( unit->vcpu_list == v )
 {
 rcu_read_lock(&sched_res_rculock);

 sched_remove_unit(vcpu_scheduler(v), unit);
 sched_free_udata(vcpu_scheduler(v), unit->priv);

 and at least Credit1's hook doesn't look to be safe against being passed 
 NULL.
 (Not to speak of the risk of unit->priv being used elsewhere while cleaning
 up.)
>>>
>>>
>>> Are you referring to this error path in sched_init_vcpu?
>>
>> No, given the context I thought it was clear that I was referring to
>>
>> static void cf_check
>> csched_free_udata(const struct scheduler *ops, void *priv)
>> {
>> struct csched_unit *svc = priv;
>>
>> BUG_ON( !list_empty(&svc->runq_elem) );

... we'd make it here (afaict).

Jan



Re: [PATCH] misra: fix violations in macros GVA_INFO, TRACE_TIME

2025-08-05 Thread Dmytro Prokopchuk1


On 7/31/25 19:09, Nicola Vetrini wrote:
> On 2025-07-31 18:05, Andrew Cooper wrote:
>> On 31/07/2025 4:58 pm, Jan Beulich wrote:
>>> On 31.07.2025 17:37, Andrew Cooper wrote:
 On 31/07/2025 4:16 pm, Dmytro Prokopchuk1 wrote:
> MISRA Rule 13.1: Initializer lists shall not contain persistent side
> effects.
>
> The violations occur because both the `GVA_INFO` and `TRACE_TIME` 
> macro
> expansions include expressions with persistent side effects introduced
> via inline assembly.
>
> In the case of `GVA_INFO`, the issue stems from the initializer list
> containing a direct call to `current`, which evaluates to
> `this_cpu(curr_vcpu)` and involves persistent side effects via the
> `asm` statement. To resolve this, the side-effect-producing expression
> is computed in a separate statement prior to the macro initialization:
>
>     struct vcpu *current_vcpu = current;
>
> The computed value is passed into the `GVA_INFO(current_vcpu)` macro,
> ensuring that the initializer is clean and free of such side effects.
>
> Similarly, the `TRACE_TIME` macro violates this rule when accessing
> expressions like `current->vcpu_id` and `current->domain->domain_id`,
> which also depend on `current` and inline assembly. To fix this, the
> value of `current` is assigned to a temporary variable:
>
>     struct vcpu *v = current;
>
> This temporary variable is then used to access `domain_id` and 
> `vcpu_id`.
> This ensures that the arguments passed to the `TRACE_TIME` macro are
> simple expressions free of persistent side effects.
>
> Signed-off-by: Dmytro Prokopchuk 
 The macro `current` specifically does not (and must not) have side
 effects.  It is expected to behave like a plain `struct vcpu *current;`
 variable, and what Eclair is noticing is the thread-local machinery
 under this_cpu() (or in x86's case, get_current()).

 In ARM's case, it's literally reading the hardware thread pointer
 register.  Can anything be done to tell Eclair that `this_cpu()`
 specifically does not have side effects?

 The only reason that GVA_INFO() and TRACE_TIME() are picked out is
 because they both contain embedded structure initialisation, and 
 this is
 is actually an example where trying to comply with MISRA interferes 
 with
 what is otherwise a standard pattern in Xen.
>>> Irrespective of what you say, some of the changes here were eliminating
>>> multiple adjacent uses of current, which - iirc - often the compiler
>>> can't fold via CSE.
>>
>> Where we have mixed usage, sure.  (I'm sure I've got a branch somewhere
>> trying to add some more pure/const around to try and help out here, but
>> I can't find it, and don't recall it being a major improvement either.)
>>
>> The real problem here is that there are a *very few* number of contexts
>> where Eclair refuses to tolerate the use of `current` citing side
>> effects, despite there being no side effects.
>>
>> That is the thing that breaks the principle of least surprise, and we
>> ought to fix it by making Eclair happy with `current` everywhere, rather
>> than force people to learn that 2 macros can't have a `current` in their
>> parameter list.
>>
> 
> I'll take a look. Likely yes, by adding a handful of properties. There 
> are subtleties, though.
> 

Hi, Nicola.

Did you have a chance to try configure Eclair to ignore this macro 
`this_cpu()`?

Thanks.
Dmytro

Xen 4.19.3 released

2025-08-05 Thread Jan Beulich
All,

we're pleased to announce the release of another bug fixing Xen version.

Xen 4.19.3 is available from its git repository
http://xenbits.xen.org/gitweb/?p=xen.git;a=shortlog;h=refs/heads/stable-4.19
(tag RELEASE-4.19.3) or (eventually) from the XenProject download page
https://xenproject.org/resources/downloads/.

We recommend all users of the 4.19 stable series to update to this latest
point release.

Regards, Jan



Re: [PATCH] misra: remove default case in single-clause switch

2025-08-05 Thread Dmytro Prokopchuk1


On 8/5/25 11:15, Jan Beulich wrote:
> On 04.08.2025 19:33, Dmytro Prokopchuk1 wrote:
>> MISRA Rule 16.4: Every switch statement shall have a default label.
>> The default clause must contain either a statement or a comment
>> prior to its terminating break statement.
>>
>> However, there is a documented deviation for this rule in Xen:
>> 'docs/misra/deviations.rst':
>> * - R16.4
>>- A switch statement with a single clause and no default label
>>  may replace an equivalent if statement to improve readability.
>>- Tagged as `deliberate` for ECLAIR.
>>
>> This change removes empty default cases in single-clause switches
>> to avoid violations of the rule where the `default` clause lacks
>> a suitable comment or statement.
>>
>> Signed-off-by: Dmytro Prokopchuk 
> 
> It's all CPU notifiers that you alter, and iirc the outcome of earlier
> discussion was that particularly for those we _want_ to add commentary,
> clarifying why only the given subset of notification need handling in
> the particular case. It may also well be that the (at least) one case
> of the possibly missing handling of some other notification still is
> unaddressed (and might hence be wrongly hidden by the adjustment done
> here, if it's in one of the function that are being altered).
> 
> Jan

I understood.

Thank you, Jan.

Dmytro

Re: [PATCH v4 7/8] pdx: introduce a new compression algorithm based on region offsets

2025-08-05 Thread Jan Beulich
On 05.08.2025 11:52, Roger Pau Monne wrote:
> --- a/xen/common/pdx.c
> +++ b/xen/common/pdx.c
> @@ -24,6 +24,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  /**
>   * Maximum (non-inclusive) usable pdx. Must be
> @@ -40,6 +41,12 @@ bool __mfn_valid(unsigned long mfn)
>  
>  #ifdef CONFIG_PDX_MASK_COMPRESSION
>  invalid |= mfn & pfn_hole_mask;
> +#elif defined(CONFIG_PDX_OFFSET_COMPRESSION)
> +{
> +unsigned long base = pfn_bases[PFN_TBL_IDX(mfn)];
> +
> +invalid |= mfn < base || mfn >= base + pdx_region_size;
> +}
>  #endif

Hmm, didn't notice this earlier on: Brace placement looks odd here. I think
they want to be indented by one level, as they aren't starting a function
body.

> @@ -294,7 +308,245 @@ void __init pfn_pdx_compression_reset(void)
>  nr_ranges = 0;
>  }
>  
> -#endif /* CONFIG_PDX_COMPRESSION */
> +#elif defined(CONFIG_PDX_OFFSET_COMPRESSION) /* CONFIG_PDX_MASK_COMPRESSION 
> */
> +
> +unsigned int __ro_after_init pfn_index_shift;
> +unsigned int __ro_after_init pdx_index_shift;
> +
> +unsigned long __ro_after_init pfn_pdx_lookup[CONFIG_PDX_NR_LOOKUP];
> +unsigned long __ro_after_init pdx_pfn_lookup[CONFIG_PDX_NR_LOOKUP];
> +unsigned long __ro_after_init pfn_bases[CONFIG_PDX_NR_LOOKUP];
> +unsigned long __ro_after_init pdx_region_size = ~0UL;

For cache locality, might this last one better also move ahead of the arrays?

> +bool pdx_is_region_compressible(paddr_t base, unsigned long npages)
> +{
> +unsigned long pfn = PFN_DOWN(base);
> +unsigned long pfn_base = pfn_bases[PFN_TBL_IDX(pfn)];
> +
> +return pfn >= pfn_base &&
> +   pfn + npages <= pfn_base + pdx_region_size;
> +}
> +
> +static int __init cf_check cmp_node(const void *a, const void *b)
> +{
> +const struct pfn_range *l = a;
> +const struct pfn_range *r = b;
> +
> +if ( l->base_pfn > r->base_pfn )
> +return 1;
> +if ( l->base_pfn < r->base_pfn )
> +return -1;
> +
> +return 0;
> +}
> +
> +static void __init cf_check swp_node(void *a, void *b)
> +{
> +SWAP(a, b);
> +}

This hasn't changed from v3, and still looks wrong to me.

> +bool __init pfn_pdx_compression_setup(paddr_t base)
> +{
> +unsigned long mask = PFN_DOWN(pdx_init_mask(base)), idx_mask = 0;
> +unsigned long pages = 0;
> +unsigned int i;
> +
> +if ( !nr_ranges )
> +{
> +printk(XENLOG_DEBUG "PFN compression disabled%s\n",
> +   pdx_compress ? ": no ranges provided" : "");
> +return false;
> +}
> +
> +if ( nr_ranges > ARRAY_SIZE(ranges) )
> +{
> +printk(XENLOG_WARNING
> +   "Too many PFN ranges (%u > %zu), not attempting PFN 
> compression\n",
> +   nr_ranges, ARRAY_SIZE(ranges));
> +return false;
> +}
> +
> +/* Sort ranges by start address. */
> +sort(ranges, nr_ranges, sizeof(*ranges), cmp_node, swp_node);
> +
> +for ( i = 0; i < nr_ranges; i++ )
> +{
> +unsigned long start = ranges[i].base_pfn;
> +
> +/*
> + * Align range base to MAX_ORDER.  This is required so the PDX offset
> + * for the bits below MAX_ORDER matches the MFN offset, and pages
> + * greater than the minimal order can be used to populate the
> + * directmap.
> + */
> +ranges[i].base_pfn = start & ~((1UL << MAX_ORDER) - 1);
> +ranges[i].pages = start + ranges[i].pages - ranges[i].base_pfn;
> +
> +/*
> + * Only merge overlapped regions now, leave adjacent regions 
> separated.
> + * They would be merged later if both use the same index into the
> + * lookup table.
> + */
> +if ( !i ||
> + ranges[i].base_pfn >=
> + (ranges[i - 1].base_pfn + ranges[i - 1].pages) )
> +{
> +mask |= pdx_region_mask(ranges[i].base_pfn, ranges[i].pages);
> +continue;
> +}
> +
> +ranges[i - 1].pages = ranges[i].base_pfn + ranges[i].pages -
> +  ranges[i - 1].base_pfn;
> +
> +if ( i + 1 < nr_ranges )
> +memmove(&ranges[i], &ranges[i + 1],
> +(nr_ranges - (i + 1)) * sizeof(ranges[0]));
> +else /* last range */
> +mask |= pdx_region_mask(ranges[i].base_pfn, ranges[i].pages);
> +nr_ranges--;
> +i--;
> +}
> +
> +/*
> + * Populate a mask with the non-equal bits of the different ranges, do 
> this
> + * to calculate the maximum PFN shift to use as the lookup table index.
> + */
> +for ( i = 0; i < nr_ranges; i++ )
> +for ( unsigned int j = 0; j < nr_ranges; j++ )
> +idx_mask |= (ranges[i].base_pfn & ~mask) ^
> +(ranges[j].base_pfn & ~mask);

"mask" is loop invariant - can't the AND-ing be pulled out, after the loop?
Further, isn't it sufficient for the inner loop to start from i + 1?

Jan



Re: [PATCH v4 8/8] x86/mm: adjust loop in arch_init_memory() to iterate over the PDX space

2025-08-05 Thread Jan Beulich
On 05.08.2025 11:52, Roger Pau Monne wrote:
> --- a/xen/arch/x86/mm.c
> +++ b/xen/arch/x86/mm.c
> @@ -275,7 +275,7 @@ static void __init assign_io_page(struct page_info *page)
>  
>  void __init arch_init_memory(void)
>  {
> -unsigned long i, pfn, rstart_pfn, rend_pfn, iostart_pfn, ioend_pfn;
> +unsigned long i, pfn, rstart_pfn, rend_pfn, iostart_pfn, ioend_pfn, pdx;
>  
>  /*
>   * Basic guest-accessible flags:
> @@ -328,9 +328,20 @@ void __init arch_init_memory(void)
>  destroy_xen_mappings((unsigned long)mfn_to_virt(iostart_pfn),
>   (unsigned long)mfn_to_virt(ioend_pfn));
>  
> -/* Mark as I/O up to next RAM region. */
> -for ( ; pfn < rstart_pfn; pfn++ )
> +/*
> + * Mark as I/O up to next RAM region.  Iterate over the PDX space to
> + * skip holes which would always fail the mfn_valid() check.
> + *
> + * pfn_to_pdx() requires a valid (iow: RAM) PFN to convert to PDX,
> + * hence provide pfn - 1, which is the tailing PFN from the last RAM
> + * range, or pdx 0 if the input pfn is 0.
> + */
> +for ( pdx = pfn ? pfn_to_pdx(pfn - 1) + 1 : 0;
> +  pdx < pfn_to_pdx(rstart_pfn);
> +  pdx++ )
>  {
> +pfn = pdx_to_pfn(pdx);
> +
>  if ( !mfn_valid(_mfn(pfn)) )
>  continue;
>  

As much as I would have liked to ack this, I fear there's another caveat here:
At the top of the loop we check not only for RAM, but also for UNUSABLE. The
latter, like RAM, shouldn't be marked I/O, but we also can't use PFN <-> PDX
transformations on any such page.

Jan



Re: [PATCH v4 7/8] pdx: introduce a new compression algorithm based on region offsets

2025-08-05 Thread Roger Pau Monné
On Tue, Aug 05, 2025 at 02:28:22PM +0200, Jan Beulich wrote:
> On 05.08.2025 11:52, Roger Pau Monne wrote:
> > --- a/xen/common/pdx.c
> > +++ b/xen/common/pdx.c
> > @@ -24,6 +24,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  
> >  /**
> >   * Maximum (non-inclusive) usable pdx. Must be
> > @@ -40,6 +41,12 @@ bool __mfn_valid(unsigned long mfn)
> >  
> >  #ifdef CONFIG_PDX_MASK_COMPRESSION
> >  invalid |= mfn & pfn_hole_mask;
> > +#elif defined(CONFIG_PDX_OFFSET_COMPRESSION)
> > +{
> > +unsigned long base = pfn_bases[PFN_TBL_IDX(mfn)];
> > +
> > +invalid |= mfn < base || mfn >= base + pdx_region_size;
> > +}
> >  #endif
> 
> Hmm, didn't notice this earlier on: Brace placement looks odd here. I think
> they want to be indented by one level, as they aren't starting a function
> body.

Right, I can adjust.  Since they are inside of the ifdef block it did
look kind of OK to me, and avoided having to indent the content one
extra level.

> > @@ -294,7 +308,245 @@ void __init pfn_pdx_compression_reset(void)
> >  nr_ranges = 0;
> >  }
> >  
> > -#endif /* CONFIG_PDX_COMPRESSION */
> > +#elif defined(CONFIG_PDX_OFFSET_COMPRESSION) /* 
> > CONFIG_PDX_MASK_COMPRESSION */
> > +
> > +unsigned int __ro_after_init pfn_index_shift;
> > +unsigned int __ro_after_init pdx_index_shift;
> > +
> > +unsigned long __ro_after_init pfn_pdx_lookup[CONFIG_PDX_NR_LOOKUP];
> > +unsigned long __ro_after_init pdx_pfn_lookup[CONFIG_PDX_NR_LOOKUP];
> > +unsigned long __ro_after_init pfn_bases[CONFIG_PDX_NR_LOOKUP];
> > +unsigned long __ro_after_init pdx_region_size = ~0UL;
> 
> For cache locality, might this last one better also move ahead of the arrays?

Oh, yes, this was a late addition and I clearly didn't place enough
attention when adding it.

> > +bool pdx_is_region_compressible(paddr_t base, unsigned long npages)
> > +{
> > +unsigned long pfn = PFN_DOWN(base);
> > +unsigned long pfn_base = pfn_bases[PFN_TBL_IDX(pfn)];
> > +
> > +return pfn >= pfn_base &&
> > +   pfn + npages <= pfn_base + pdx_region_size;
> > +}
> > +
> > +static int __init cf_check cmp_node(const void *a, const void *b)
> > +{
> > +const struct pfn_range *l = a;
> > +const struct pfn_range *r = b;
> > +
> > +if ( l->base_pfn > r->base_pfn )
> > +return 1;
> > +if ( l->base_pfn < r->base_pfn )
> > +return -1;
> > +
> > +return 0;
> > +}
> > +
> > +static void __init cf_check swp_node(void *a, void *b)
> > +{
> > +SWAP(a, b);
> > +}
> 
> This hasn't changed from v3, and still looks wrong to me.

Oh, I did recall a comment to that regard, but somehow forgot to apply
it, I'm sorry, I've now fixed it.

> > +bool __init pfn_pdx_compression_setup(paddr_t base)
> > +{
> > +unsigned long mask = PFN_DOWN(pdx_init_mask(base)), idx_mask = 0;
> > +unsigned long pages = 0;
> > +unsigned int i;
> > +
> > +if ( !nr_ranges )
> > +{
> > +printk(XENLOG_DEBUG "PFN compression disabled%s\n",
> > +   pdx_compress ? ": no ranges provided" : "");
> > +return false;
> > +}
> > +
> > +if ( nr_ranges > ARRAY_SIZE(ranges) )
> > +{
> > +printk(XENLOG_WARNING
> > +   "Too many PFN ranges (%u > %zu), not attempting PFN 
> > compression\n",
> > +   nr_ranges, ARRAY_SIZE(ranges));
> > +return false;
> > +}
> > +
> > +/* Sort ranges by start address. */
> > +sort(ranges, nr_ranges, sizeof(*ranges), cmp_node, swp_node);
> > +
> > +for ( i = 0; i < nr_ranges; i++ )
> > +{
> > +unsigned long start = ranges[i].base_pfn;
> > +
> > +/*
> > + * Align range base to MAX_ORDER.  This is required so the PDX 
> > offset
> > + * for the bits below MAX_ORDER matches the MFN offset, and pages
> > + * greater than the minimal order can be used to populate the
> > + * directmap.
> > + */
> > +ranges[i].base_pfn = start & ~((1UL << MAX_ORDER) - 1);
> > +ranges[i].pages = start + ranges[i].pages - ranges[i].base_pfn;
> > +
> > +/*
> > + * Only merge overlapped regions now, leave adjacent regions 
> > separated.
> > + * They would be merged later if both use the same index into the
> > + * lookup table.
> > + */
> > +if ( !i ||
> > + ranges[i].base_pfn >=
> > + (ranges[i - 1].base_pfn + ranges[i - 1].pages) )
> > +{
> > +mask |= pdx_region_mask(ranges[i].base_pfn, ranges[i].pages);
> > +continue;
> > +}
> > +
> > +ranges[i - 1].pages = ranges[i].base_pfn + ranges[i].pages -
> > +  ranges[i - 1].base_pfn;
> > +
> > +if ( i + 1 < nr_ranges )
> > +memmove(&ranges[i], &ranges[i + 1],
> > +(nr_ranges - (i + 1)) * sizeof(ranges[0]));
> > +else /* last range */
> > +mask |= pdx_region_mask(ranges[i].base_pfn, ranges[i].pages);
> > +

Re: [PATCH v3 01/20] xen/riscv: implement sbi_remote_hfence_gvma()

2025-08-05 Thread Oleksii Kurochko


On 8/4/25 3:52 PM, Jan Beulich wrote:

On 31.07.2025 17:58, Oleksii Kurochko wrote:

Instruct the remote harts to execute one or more HFENCE.GVMA instructions,
covering the range of guest physical addresses between start_addr and
start_addr + size for all VMIDs.

The remote fence operation applies to the entire address space if either:
  - start_addr and size are both 0, or
  - size is equal to 2^XLEN-1.

Signed-off-by: Oleksii Kurochko

Acked-by: Jan Beulich

However, ...


--- a/xen/arch/riscv/include/asm/sbi.h
+++ b/xen/arch/riscv/include/asm/sbi.h
@@ -89,6 +89,25 @@ bool sbi_has_rfence(void);
  int sbi_remote_sfence_vma(const cpumask_t *cpu_mask, vaddr_t start,
size_t size);
  
+/*

+ * Instructs the remote harts to execute one or more HFENCE.GVMA
+ * instructions, covering the range of guest physical addresses
+ * between start_addr and start_addr + size for all VMIDs.

... I'd like to ask that you avoid fuzzy terminology like this one. Afaict
you mean [start, start + size). Help yourself and future readers by then
also saying it exactly like this. (Happy to make a respective edit while
committing.)


I just tried the following wording in SBI spec.

I agree that using [start, start+size) is clearer as each time I'm going
to check SBI code to verify if 'start+size' is included or not.

It would be happy if you could update this part of commit message during
commit.




+ * Returns 0 if IPI was sent to all the targeted harts successfully
+ * or negative value if start_addr or size is not valid.

This similarly is ambiguous: The union of the success case stated and the
error case stated isn't obviously all possible states. The success
statement in particular alludes to the possibility of an IPI not actually
reaching its target.


The same as above this is what SBI spec. tells.

I've not checked SBI code deeply, but it seems like the code is waiting while
IPI will be reached as looking at the code:
/**
 * As this this function only handlers scalar values of hart mask, it 
must be
 * set to all online harts if the intention is to send IPIs to all the 
harts.
 * If hmask is zero, no IPIs will be sent.
 */
int sbi_ipi_send_many(ulong hmask, ulong hbase, u32 event, void *data)
{
...

/* Send IPIs */
do {
retry_needed = false;
sbi_hartmask_for_each_hart(i, &target_mask) {
rc = sbi_ipi_send(scratch, i, event, data);
if (rc == SBI_IPI_UPDATE_RETRY)
retry_needed = true;
else
sbi_hartmask_clear_hart(i, 
&target_mask);
}
} while (retry_needed);

/* Sync IPIs */
sbi_ipi_sync(scratch, event);

return 0;
}
and
static int sbi_ipi_sync(struct sbi_scratch *scratch, u32 event)
{
const struct sbi_ipi_event_ops *ipi_ops;

if ((SBI_IPI_EVENT_MAX <= event) ||
!ipi_ops_array[event])
return SBI_EINVAL;
ipi_ops = ipi_ops_array[event];

if (ipi_ops->sync)
ipi_ops->sync(scratch);

return 0;
}
which calls:
static void tlb_sync(struct sbi_scratch *scratch)
{
atomic_t *tlb_sync =
sbi_scratch_offset_ptr(scratch, tlb_sync_off);

while (atomic_read(tlb_sync) > 0) {
/*
 * While we are waiting for remote hart to set the sync,
 * consume fifo requests to avoid deadlock.
 */
tlb_process_once(scratch);
}

return;
}




+ * The remote fence operation applies to the entire address space if either:
+ *  - start_addr and size are both 0, or
+ *  - size is equal to 2^XLEN-1.

Whose XLEN is this? The guest's? The host's? (I assume the latter, but it's
not unambiguous, unless there's specific terminology that I'm unaware of,
yet which would make this unambiguous.)


RISC-V spec quite mixes the terminology (3.1.6.2. Base ISA Control in mstatus 
Register)
around XLEN:
  For RV64 harts, the SXL and UXL fields are WARL fields that control the value
  of XLEN for S-mode and U-mode, respectively. The encoding of these fields is
  the same as the MXL field of misa, shown in Table 9. The effective XLEN in
  S-mode and U-mode are termed SXLEN and UXLEN, respectively

Basically, RISC-V privileged architecture defines different XLEN values for
various privilege modes:
 - MXLEN for Machine mode
 - SXLEN for Supervisor mode.
 - HSXLEN for Hypervisor

Re: [PATCH v13 1/3] xen/domain: unify domain ID allocation

2025-08-05 Thread Roger Pau Monné
On Wed, Jul 30, 2025 at 05:40:54PM +, dm...@proton.me wrote:
> From: Denis Mukhin  
> 
> Currently, there are two different domain ID allocation implementations:
> 
>   1) Sequential IDs allocation in dom0less Arm code based on max_init_domid;
> 
>   2) Sequential IDs allocation in XEN_DOMCTL_createdomain; does not use
>  max_init_domid (both Arm and x86).
> 
> The domain ID allocation covers dom0 or late hwdom, predefined domains,
> post-boot domains, excluding Xen system domains (domid >=
> DOMID_FIRST_RESERVED).
> 
> It makes sense to have a common helper code for such task across architectures
> (Arm and x86) and between dom0less / toolstack domU allocation.
> 
> Note, fixing dependency on max_init_domid is out of scope of this patch.
> 
> Wrap the domain ID allocation as an arch-independent function domid_alloc() in
> new common/domid.c based on the bitmap.
> 
> Allocation algorithm:
> - If an explicit domain ID is provided, verify its availability and use it if
>   ID is not used;
> - If DOMID_INVALID is provided, search the range [1..DOMID_FIRST_RESERVED-1],
>   starting from the last used ID.
>   Implementation guarantees that two consecutive calls will never return the
>   same ID. ID#0 is reserved for the first boot domain (currently, dom0) and
>   excluded from the allocation range.
> 
> Remove is_free_domid() helper as it is not needed now.
> 
> No functional change intended.
> 
> Signed-off-by: Denis Mukhin 
> Reviewed-by: Alejandro Vallejo 
> ---
> Changes since v12:
> - updated comment for domid_alloc() and commit message
> - added Alejandro's R-b
> ---
>  xen/arch/arm/domain_build.c |  7 +-
>  xen/arch/x86/setup.c|  7 +-
>  xen/common/Makefile |  1 +
>  xen/common/device-tree/dom0less-build.c | 15 ++--
>  xen/common/domain.c |  2 +
>  xen/common/domctl.c | 42 ++-
>  xen/common/domid.c  | 94 +
>  xen/include/xen/domain.h|  3 +
>  8 files changed, 124 insertions(+), 47 deletions(-)
>  create mode 100644 xen/common/domid.c
> 
> diff --git a/xen/arch/arm/domain_build.c b/xen/arch/arm/domain_build.c
> index 463ae4474d30..789f2b9d3ce7 100644
> --- a/xen/arch/arm/domain_build.c
> +++ b/xen/arch/arm/domain_build.c
> @@ -2050,6 +2050,7 @@ void __init create_dom0(void)
>  .grant_opts = XEN_DOMCTL_GRANT_version(opt_gnttab_max_version),
>  };
>  unsigned int flags = CDF_privileged | CDF_hardware;
> +domid_t domid;
>  int rc;
>  
>  /* The vGIC for DOM0 is exactly emulating the hardware GIC */
> @@ -2074,7 +2075,11 @@ void __init create_dom0(void)
>  if ( !llc_coloring_enabled )
>  flags |= CDF_directmap;
>  
> -dom0 = domain_create(0, &dom0_cfg, flags);
> +domid = domid_alloc(0);
> +if ( domid == DOMID_INVALID )
> +panic("Error allocating domain ID 0\n");
> +
> +dom0 = domain_create(domid, &dom0_cfg, flags);
>  if ( IS_ERR(dom0) )
>  panic("Error creating domain 0 (rc = %ld)\n", PTR_ERR(dom0));
>  
> diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
> index 1543dd251cc6..2ff7c28c277b 100644
> --- a/xen/arch/x86/setup.c
> +++ b/xen/arch/x86/setup.c
> @@ -1047,8 +1047,11 @@ static struct domain *__init create_dom0(struct 
> boot_info *bi)
>  if ( iommu_enabled )
>  dom0_cfg.flags |= XEN_DOMCTL_CDF_iommu;
>  
> -/* Create initial domain.  Not d0 for pvshim. */
> -bd->domid = get_initial_domain_id();
> +/* Allocate initial domain ID.  Not d0 for pvshim. */
> +bd->domid = domid_alloc(get_initial_domain_id());
> +if ( bd->domid == DOMID_INVALID )
> +panic("Error allocating domain ID %d\n", get_initial_domain_id());

Nit: in other error messages in the same function we handle the domid
as an unsigned integer, so %u probably wants using here.  Unless you
have an explicit intention to print IDs >= DOMID_FIRST_RESERVED as
negative integers?

> +
>  d = domain_create(bd->domid, &dom0_cfg,
>pv_shim ? 0 : CDF_privileged | CDF_hardware);
>  if ( IS_ERR(d) )
> diff --git a/xen/common/Makefile b/xen/common/Makefile
> index c316957fcb36..0c7d0f5d46e1 100644
> --- a/xen/common/Makefile
> +++ b/xen/common/Makefile
> @@ -11,6 +11,7 @@ obj-$(filter-out $(CONFIG_X86),$(CONFIG_ACPI)) += device.o
>  obj-$(CONFIG_DEVICE_TREE_PARSE) += device-tree/
>  obj-$(CONFIG_IOREQ_SERVER) += dm.o
>  obj-y += domain.o
> +obj-y += domid.o
>  obj-y += event_2l.o
>  obj-y += event_channel.o
>  obj-$(CONFIG_EVTCHN_FIFO) += event_fifo.o
> diff --git a/xen/common/device-tree/dom0less-build.c 
> b/xen/common/device-tree/dom0less-build.c
> index 6bb038111de9..f4b6b515d2d2 100644
> --- a/xen/common/device-tree/dom0less-build.c
> +++ b/xen/common/device-tree/dom0less-build.c
> @@ -833,6 +833,7 @@ void __init create_domUs(void)
>  {
>  struct kernel_info ki = KERNEL_INFO_INIT;
>  int rc = parse_dom0less_node(node,

Re: [PATCH -next v7 5/7] arm64: entry: Refactor preempt_schedule_irq() check code

2025-08-05 Thread Ada Couprie Diaz

Hi Jinjie,

On 29/07/2025 02:54, Jinjie Ruan wrote:

ARM64 requires an additional check whether to reschedule on return
from interrupt. So add arch_irqentry_exit_need_resched() as the default
NOP implementation and hook it up into the need_resched() condition in
raw_irqentry_exit_cond_resched(). This allows ARM64 to implement
the architecture specific version for switching over to
the generic entry code.
I was a bit confused by this, as I didn't see the link with the generic 
entry

given you implement `raw_irqentry_exit_cond_resched()` in arch/arm64
as well in this patch : I expected the arm64 implementation to be added.
I share more thoughts below.

What do you think about something along those lines ?

Compared to the generic entry code, arm64 does additional checks
when deciding to reschedule on return from an interrupt.
Introduce arch_irqentry_exit_need_resched() in the need_resched() 
condition
of the generic raw_irqentry_exit_cond_resched(), with a NOP default.
This will allow arm64 to implement its architecture specific checks when
switching over to the generic entry code.


[...]
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index b82032777310..4aa9656fa1b4 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -142,6 +142,20 @@ noinstr irqentry_state_t irqentry_enter(struct pt_regs 
*regs)
return ret;
  }
  
+/**

+ * arch_irqentry_exit_need_resched - Architecture specific need resched 
function
+ *
+ * Invoked from raw_irqentry_exit_cond_resched() to check if need resched.

Very nit : "to check if resched is needed." ?

+ * Defaults return true.
+ *
+ * The main purpose is to permit arch to skip preempt a task from an IRQ.
If feel that "to avoid preemption of a task" instead of "to skip preempt 
a task"

would make this much clearer, what do you think ?

+ */
+static inline bool arch_irqentry_exit_need_resched(void);
+
+#ifndef arch_irqentry_exit_need_resched
+static inline bool arch_irqentry_exit_need_resched(void) { return true; }
+#endif
+


I've had some trouble reviewing this patch : on the one hand because
I didn't notice `arch_irqentry_exit_need_resched()` was added in
the common entry code, which is on me !
On the other hand, I felt that the patch itself was a bit disconnected :
we add `arch_irqentry_exit_need_resched()` in the common entry code,
with a default NOP, but in the same function we add to arm64,
while mentioning that this is for arm64's additional checks,
which we only implement in patch 7.

Would it make sense to move the `arch_irqentry_exit_need_resched()`
part of the patch to patch 7, so that the introduction and
arch-specific implementation appear together ?
To me it seems easier to wrap my head around, as it would look like
"Move arm64 to generic entry, but it does additional checks : add a new
arch-specific function controlling re-scheduling, defaulting to true,
and implement it for arm64". I feel it could help making patch 7's
commit message clearer as well.

From what I gathered on the archive `arch_irqentry_exit_need_resched()`
being added here was suggested previously, so others might not have the
same opinion.

Maybe improving the commit message and comment for this would be enough
as well, as per my suggestions above.


Otherwise the changes make sense and I don't see any functional issues !

Thanks,
Ada




Re: [PATCH v3 11/20] xen/riscv: implement function to map memory in guest p2m

2025-08-05 Thread Jan Beulich
On 31.07.2025 17:58, Oleksii Kurochko wrote:
> Implement map_regions_p2mt() to map a region in the guest p2m with
> a specific p2m type. The memory attributes will be derived from the
> p2m type. This function is going to be called from dom0less common
> code.

s/is going to be/is/ ? Such a call exists already, after all.

> --- a/xen/arch/riscv/include/asm/p2m.h
> +++ b/xen/arch/riscv/include/asm/p2m.h
> @@ -121,21 +121,22 @@ static inline int 
> guest_physmap_mark_populate_on_demand(struct domain *d,
>  return -EOPNOTSUPP;
>  }
>  
> -static inline int guest_physmap_add_entry(struct domain *d,
> -  gfn_t gfn, mfn_t mfn,
> -  unsigned long page_order,
> -  p2m_type_t t)
> -{
> -BUG_ON("unimplemented");
> -return -EINVAL;
> -}
> +/*
> + * Map a region in the guest p2m with a specific p2m type.

What is "the guest p2m"? In your answer, please consider the possible
(and at some point likely necessary) existence of altp2m and nestedp2m.
In patch 04 you introduce p2m_get_hostp2m(), and I expect it's that
what you mean here.

> --- a/xen/arch/riscv/p2m.c
> +++ b/xen/arch/riscv/p2m.c
> @@ -9,6 +9,41 @@
>  
>  unsigned int __read_mostly p2m_root_order;
>  
> +/*
> + * Force a synchronous P2M TLB flush.
> + *
> + * Must be called with the p2m lock held.
> + */
> +static void p2m_force_tlb_flush_sync(struct p2m_domain *p2m)
> +{
> +struct domain *d = p2m->domain;

Pointer-to-const please. Personally, given the implementation of this
function (and also ...

> +ASSERT(p2m_is_write_locked(p2m));
> +
> +sbi_remote_hfence_gvma(d->dirty_cpumask, 0, 0);
> +
> +p2m->need_flush = false;
> +}
> +
> +void p2m_tlb_flush_sync(struct p2m_domain *p2m)
> +{
> +if ( p2m->need_flush )
> +p2m_force_tlb_flush_sync(p2m);
> +}

... this one) I'd further ask for the function parameters to also be
pointer-to-const, but Andrew may object to that. Andrew - it continues to
be unclear to me under what conditions you agree with adding const, and
under what conditions you would object to me asking for such. Please can
you take the time to clarify this?

> +/* Unlock the flush and do a P2M TLB flush if necessary */
> +void p2m_write_unlock(struct p2m_domain *p2m)
> +{
> +/*
> + * The final flush is done with the P2M write lock taken to avoid
> + * someone else modifying the P2M wbefore the TLB invalidation has

Nit: Stray 'w'.

> + * completed.
> + */
> +p2m_tlb_flush_sync(p2m);

Wasn't the plan to have this be conditional?

> @@ -139,3 +174,33 @@ int p2m_set_allocation(struct domain *d, unsigned long 
> pages, bool *preempted)
>  
>  return 0;
>  }
> +
> +static int p2m_set_range(struct p2m_domain *p2m,
> + gfn_t sgfn,
> + unsigned long nr,
> + mfn_t smfn,
> + p2m_type_t t)
> +{
> +return -EOPNOTSUPP;
> +}
> +
> +static int p2m_insert_mapping(struct p2m_domain *p2m, gfn_t start_gfn,
> +  unsigned long nr, mfn_t mfn, p2m_type_t t)
> +{
> +int rc;
> +
> +p2m_write_lock(p2m);
> +rc = p2m_set_range(p2m, start_gfn, nr, mfn, t);
> +p2m_write_unlock(p2m);
> +
> +return rc;
> +}
> +
> +int map_regions_p2mt(struct domain *d,
> + gfn_t gfn,
> + unsigned long nr,
> + mfn_t mfn,
> + p2m_type_t p2mt)
> +{
> +return p2m_insert_mapping(p2m_get_hostp2m(d), gfn, nr, mfn, p2mt);
> +}

And eventually both helper functions will gain further callers? Otherwise
it's a little hard to see why they would both need to be separate functions.

Jan



Re: [PATCH -next v7 7/7] arm64: entry: Switch to generic IRQ entry

2025-08-05 Thread Ada Couprie Diaz

Hi Jinjie,

The code changes look good to me, just a small missing clean up I believe.

On 29/07/2025 02:54, Jinjie Ruan wrote:


Currently, x86, Riscv, Loongarch use the generic entry. Convert arm64
to use the generic entry infrastructure from kernel/entry/*.
The generic entry makes maintainers' work easier and codes
more elegant.

Switch arm64 to generic IRQ entry first, which removed duplicate 100+
LOC. The next patch serise will switch to generic entry completely later.
Switch to generic entry in two steps according to Mark's suggestion
will make it easier to review.


I think the commit message could be clearer, especially since now this 
series

only moves arm64 to generic IRQ entry and not the complete generic entry.

What do you think of something like below ? It repeats a bit less and I 
think
it helps understanding what is going on in this specific commit, as you 
already

have details on the larger plans in the cover.

Currently, x86, Riscv and Loongarch use the generic entry code, which 
makes
maintainer's work easier and code more elegant.
Start converting arm64 to use the generic entry infrastructure
from kernel/entry/* by switching it to generic IRQ entry, which removes 
100+
lines of duplicate code.
arm64 will completely switch to generic entry in a later series.


The changes are below:
  - Remove *enter_from/exit_to_kernel_mode(), and wrap with generic
irqentry_enter/exit(). Also remove *enter_from/exit_to_user_mode(),
and wrap with generic enter_from/exit_to_user_mode() because they
are exactly the same so far.

Nit : "so far" can be removed

  - Remove arm64_enter/exit_nmi() and use generic irqentry_nmi_enter/exit()
because they're exactly the same, so the temporary arm64 version
irqentry_state can also be removed.

  - Remove PREEMPT_DYNAMIC code, as generic entry do the same thing
if arm64 implement arch_irqentry_exit_need_resched().
This feels unrelated, given that the part that needs 
`arch_irqentry_exit_need_resched()`

is called whether or not PREEMPT_DYNAMIC is enabled ?

Given my comments on patch 5, I feel that the commit message should mention
explicitly the implementation of `arch_irqentry_exit_need_resched()` and 
why,

even though it was already mentioned in patch 5.
(This is what I was referencing in patch 5 : as I feel it's useful to 
mention again
the reasons when implementing it, it doesn't feel too out of place to 
introduce

the generic part at the same time. But again, I might be wrong here.)

Then you can have another point explaining that 
`raw_irqentry_exit_cond_resched()`

and the PREEMPT_DYNAMIC code is removed because they are identical to the
generic entry code, similarly to your other points.

Tested ok with following test cases on Qemu virt platform:
  - Perf tests.
  - Different `dynamic preempt` mode switch.
  - Pseudo NMI tests.
  - Stress-ng CPU stress test.
  - MTE test case in Documentation/arch/arm64/memory-tagging-extension.rst
and all test cases in tools/testing/selftests/arm64/mte/*.
Nit : I'm not sure if the commit message is the best place for this, 
given you

already gave some details in the cover ?
But I don't have much experience here, so I'll leave it up to you and 
others !

Suggested-by: Mark Rutland 
Signed-off-by: Jinjie Ruan 
---
[...]
diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
index db3f972f8cd9..1110eeb21f57 100644
--- a/arch/arm64/kernel/signal.c
+++ b/arch/arm64/kernel/signal.c
@@ -9,6 +9,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  #include 
  #include 
@@ -1576,7 +1577,7 @@ static void handle_signal(struct ksignal *ksig, struct 
pt_regs *regs)
   * the kernel can handle, and then we build all the user-level signal handling
   * stack-frames in one go after that.
   */
-void do_signal(struct pt_regs *regs)
+void arch_do_signal_or_restart(struct pt_regs *regs)
Given that `do_signal(struct pt_regs *regs)` is defined in 
`arch/arm64/include/asm/exception.h`,
and that there remains no users of `do_signal()`, I think it should be 
removed there.


Thanks,
Ada



Re: [PATCH -next v7 3/7] arm64: entry: Rework arm64_preempt_schedule_irq()

2025-08-05 Thread Ada Couprie Diaz

On 29/07/2025 02:54, Jinjie Ruan wrote:


The generic entry code has the form:

| raw_irqentry_exit_cond_resched()
| {
|   if (!preempt_count()) {
|   ...
|   if (need_resched())
|   preempt_schedule_irq();
|   }
| }

In preparation for moving arm64 over to the generic entry code, align
the structure of the arm64 code with raw_irqentry_exit_cond_resched() from
the generic entry code.

Signed-off-by: Jinjie Ruan 
---

Reviewed-by: Ada Couprie Diaz 




Re: [PATCH -next v7 1/7] arm64: ptrace: Replace interrupts_enabled() with regs_irqs_disabled()

2025-08-05 Thread Ada Couprie Diaz

On 29/07/2025 02:54, Jinjie Ruan wrote:


The generic entry code expects architecture code to provide
regs_irqs_disabled(regs) function, but arm64 does not have this and
provides inerrupts_enabled(regs), which has the opposite polarity.

Nit: "interrupts_enabled(regs)"

In preparation for moving arm64 over to the generic entry code,
relace arm64's interrupts_enabled() with regs_irqs_disabled() and
update its callers under arch/arm64.

For the moment, a definition of interrupts_enabled() is provided for
the GICv3 driver. Once arch/arm implement regs_irqs_disabled(), this
can be removed.

Delete the fast_interrupts_enabled() macro as it is unused and we
don't want any new users to show up.

No functional changes.

Acked-by: Mark Rutland 
Suggested-by: Mark Rutland 
Signed-off-by: Jinjie Ruan 
---

Otherwise looks good to me !
Reviewed-by: Ada Couprie Diaz 



Re: [PATCH -next v7 4/7] arm64: entry: Use preempt_count() and need_resched() helper

2025-08-05 Thread Ada Couprie Diaz

On 29/07/2025 02:54, Jinjie Ruan wrote:


The generic entry code uses preempt_count() and need_resched() helpers to
check if it should do preempt_schedule_irq(). Currently, arm64 use its own
check logic, that is "READ_ONCE(current_thread_info()->preempt_count == 0",
which is equivalent to "preempt_count() == 0 && need_resched()".

In preparation for moving arm64 over to the generic entry code, use
these helpers to replace arm64's own code and move it ahead.

No functional changes.

Signed-off-by: Jinjie Ruan 
---

Reviewed-by: Ada Couprie Diaz 




Re: [PATCH -next v7 6/7] arm64: entry: Move arm64_preempt_schedule_irq() into __exit_to_kernel_mode()

2025-08-05 Thread Ada Couprie Diaz

On 29/07/2025 02:54, Jinjie Ruan wrote:


The arm64 entry code only preempts a kernel context upon a return from
a regular IRQ exception. The generic entry code may preempt a kernel
context for any exception return where irqentry_exit() is used, and so
may preempt other exceptions such as faults.

In preparation for moving arm64 over to the generic entry code, align
arm64 with the generic behaviour by calling
arm64_preempt_schedule_irq() from exit_to_kernel_mode(). To make this
possible, arm64_preempt_schedule_irq()
and dynamic/raw_irqentry_exit_cond_resched() are moved earlier in
the file, with no changes.

As Mark pointed out, this change will have the following 2 key impact:

- " We'll preempt even without taking a "real" interrupt. That
 shouldn't result in preemption that wasn't possible before,
 but it does change the probability of preempting at certain points,
 and might have a performance impact, so probably warrants a
 benchmark."

- " We will not preempt when taking interrupts from a region of kernel
 code where IRQs are enabled but RCU is not watching, matching the
 behaviour of the generic entry code.

 This has the potential to introduce livelock if we can ever have a
 screaming interrupt in such a region, so we'll need to go figure out
 whether that's actually a problem.

 Having this as a separate patch will make it easier to test/bisect
 for that specifically."

Suggested-by: Mark Rutland 
Signed-off-by: Jinjie Ruan 
---


Reviewed-by: Ada Couprie Diaz 




Re: [PATCH -next v7 0/7] arm64: entry: Convert to generic irq entry

2025-08-05 Thread Ada Couprie Diaz

Hi Jinjie,

On 29/07/2025 02:54, Jinjie Ruan wrote:


Since commit a70e9f647f50 ("entry: Split generic entry into generic
exception and syscall entry") split the generic entry into generic irq
entry and generic syscall entry, it is time to convert arm64 to use
the generic irq entry. And ARM64 will be completely converted to generic
entry in the upcoming patch series.

Note : I had to manually cherry-pick a70e9f647f50 when pulling the series
on top of the Linux Arm Kernel for-next/core branch, but there might be
something I'm missing here.


The main convert steps are as follows:
- Split generic entry into generic irq entry and generic syscall to
   make the single patch more concentrated in switching to one thing.
- Make arm64 easier to use irqentry_enter/exit().
- Make arm64 closer to the PREEMPT_DYNAMIC code of generic entry.
- Switch to generic irq entry.


I reviewed the whole series and as expected it looks good ! Just a few nits
here and there and some clarifications that I think could be useful.

I'm not sure about the generic implementation of 
`arch_irqentry_exit_need_resched()`
in patch 5, I would be tempted to move it to patch 7. I detail my 
thoughts more

on the relevant patches, but I might be wrong and that feels like details :
I don't think the code itself has issues.

It was tested ok with following test cases on QEMU virt platform:
  - Perf tests.
  - Different `dynamic preempt` mode switch.
  - Pseudo NMI tests.
  - Stress-ng CPU stress test.
  - MTE test case in Documentation/arch/arm64/memory-tagging-extension.rst
and all test cases in tools/testing/selftests/arm64/mte/*.

The test QEMU configuration is as follows:

qemu-system-aarch64 \
-M virt,gic-version=3,virtualization=on,mte=on \
-cpu max,pauth-impdef=on \
-kernel Image \
-smp 8,sockets=1,cores=4,threads=2 \
-m 512m \
-nographic \
-no-reboot \
-device virtio-rng-pci \
-append "root=/dev/vda rw console=ttyAMA0 kgdboc=ttyAMA0,115200 
\
earlycon preempt=voluntary irqchip.gicv3_pseudo_nmi=1" \
-drive if=none,file=images/rootfs.ext4,format=raw,id=hd0 \
-device virtio-blk-device,drive=hd0 \


I'll spend some time testing the series now, specifically given patch 6's
changes, but other than that everything I saw made sense and didn't look
like it would be of concern to me.

Thanks,
Ada



Re: [PATCH -next v7 2/7] arm64: entry: Refactor the entry and exit for exceptions from EL1

2025-08-05 Thread Ada Couprie Diaz

Hi,

On 29/07/2025 02:54, Jinjie Ruan wrote:


The generic entry code uses irqentry_state_t to track lockdep and RCU
state across exception entry and return. For historical reasons, arm64
embeds similar fields within its pt_regs structure.

In preparation for moving arm64 over to the generic entry code, pull
these fields out of arm64's pt_regs, and use a separate structure,
matching the style of the generic entry code.

No functional changes.

As far as I understand and checked, we used the two fields
in an exclusive fashion, so there is indeed no functional change.

Suggested-by: Mark Rutland 
Signed-off-by: Jinjie Ruan 
---
[...]
diff --git a/arch/arm64/kernel/entry-common.c b/arch/arm64/kernel/entry-common.c
index 8e798f46ad28..97e0741abde1 100644
--- a/arch/arm64/kernel/entry-common.c
+++ b/arch/arm64/kernel/entry-common.c
[...]
@@ -475,73 +497,81 @@ UNHANDLED(el1t, 64, error)
  static void noinstr el1_abort(struct pt_regs *regs, unsigned long esr)
  {
unsigned long far = read_sysreg(far_el1);
+   arm64_irqentry_state_t state;
  
-	enter_from_kernel_mode(regs);

+   state = enter_from_kernel_mode(regs);
Nit: There is some inconsistencies with some functions splitting state's 
definition
and declaration (like el1_abort here), while some others do it on the 
same line

(el1_undef() below for example).
In some cases it is welcome as the entry function is called after some 
other work,

but here for example it doesn't seem to be beneficial ?

local_daif_inherit(regs);
do_mem_abort(far, esr, regs);
local_daif_mask();
-   exit_to_kernel_mode(regs);
+   exit_to_kernel_mode(regs, state);
  }
  
  static void noinstr el1_pc(struct pt_regs *regs, unsigned long esr)

  {
unsigned long far = read_sysreg(far_el1);
+   arm64_irqentry_state_t state;
  
-	enter_from_kernel_mode(regs);

+   state = enter_from_kernel_mode(regs);
local_daif_inherit(regs);
do_sp_pc_abort(far, esr, regs);
local_daif_mask();
-   exit_to_kernel_mode(regs);
+   exit_to_kernel_mode(regs, state);
  }
  
  static void noinstr el1_undef(struct pt_regs *regs, unsigned long esr)

  {
-   enter_from_kernel_mode(regs);
+   arm64_irqentry_state_t state = enter_from_kernel_mode(regs);
+
local_daif_inherit(regs);
do_el1_undef(regs, esr);
local_daif_mask();
-   exit_to_kernel_mode(regs);
+   exit_to_kernel_mode(regs, state);
  }

[...]

Other than the small nit:
Reviewed-by: Ada Couprie Diaz 




Re: [PATCH v4 8/8] x86/mm: adjust loop in arch_init_memory() to iterate over the PDX space

2025-08-05 Thread Roger Pau Monné
On Tue, Aug 05, 2025 at 02:38:38PM +0200, Jan Beulich wrote:
> On 05.08.2025 11:52, Roger Pau Monne wrote:
> > --- a/xen/arch/x86/mm.c
> > +++ b/xen/arch/x86/mm.c
> > @@ -275,7 +275,7 @@ static void __init assign_io_page(struct page_info 
> > *page)
> >  
> >  void __init arch_init_memory(void)
> >  {
> > -unsigned long i, pfn, rstart_pfn, rend_pfn, iostart_pfn, ioend_pfn;
> > +unsigned long i, pfn, rstart_pfn, rend_pfn, iostart_pfn, ioend_pfn, 
> > pdx;
> >  
> >  /*
> >   * Basic guest-accessible flags:
> > @@ -328,9 +328,20 @@ void __init arch_init_memory(void)
> >  destroy_xen_mappings((unsigned long)mfn_to_virt(iostart_pfn),
> >   (unsigned long)mfn_to_virt(ioend_pfn));
> >  
> > -/* Mark as I/O up to next RAM region. */
> > -for ( ; pfn < rstart_pfn; pfn++ )
> > +/*
> > + * Mark as I/O up to next RAM region.  Iterate over the PDX space 
> > to
> > + * skip holes which would always fail the mfn_valid() check.
> > + *
> > + * pfn_to_pdx() requires a valid (iow: RAM) PFN to convert to PDX,
> > + * hence provide pfn - 1, which is the tailing PFN from the last 
> > RAM
> > + * range, or pdx 0 if the input pfn is 0.
> > + */
> > +for ( pdx = pfn ? pfn_to_pdx(pfn - 1) + 1 : 0;
> > +  pdx < pfn_to_pdx(rstart_pfn);
> > +  pdx++ )
> >  {
> > +pfn = pdx_to_pfn(pdx);
> > +
> >  if ( !mfn_valid(_mfn(pfn)) )
> >  continue;
> >  
> 
> As much as I would have liked to ack this, I fear there's another caveat here:
> At the top of the loop we check not only for RAM, but also for UNUSABLE. The
> latter, like RAM, shouldn't be marked I/O, but we also can't use PFN <-> PDX
> transformations on any such page.

Right you are.  I'm not sure however why we do this - won't we want
the mappings of UNUSABLE regions also be removed from the Xen
page-tables? (but not marked as IO)

I could do something like:

/* Mark as I/O up to next RAM or UNUSABLE region. */
if ( (!pfn || pdx_is_region_compressible(pfn_to_paddr(pfn - 1), 1)) &&
 pdx_is_region_compressible(pfn_to_paddr(rstart_pfn), 1) )
{
/*
 * Iterate over the PDX space to skip holes which would always fail
 * the mfn_valid() check.
 *
 * pfn_to_pdx() requires a valid (iow: RAM) PFN to convert to PDX,
 * hence provide pfn - 1, which is the tailing PFN from the last
 * RAM range, or pdx 0 if the input pfn is 0.
 */
for ( pdx = pfn ? pfn_to_pdx(pfn - 1) + 1 : 0;
  pdx < pfn_to_pdx(rstart_pfn);
  pdx++ )
{
pfn = pdx_to_pfn(pdx);

if ( !mfn_valid(_mfn(pfn)) )
continue;

assign_io_page(mfn_to_page(_mfn(pfn)));
}
}
else
{
/* Slow path, iterate over the PFN space. */
for ( ; pfn < rstart_pfn; pfn++ )
{
if ( !mfn_valid(_mfn(pfn)) )
continue;

assign_io_page(mfn_to_page(_mfn(pfn)));
}
}

But I find it a bit ugly - I might send v5 without this final patch
while I see if I can find a better alternative.

Thanks, Roger.



Re: [PATCH v2] xen: rework error handling in vcpu_create

2025-08-05 Thread Jan Beulich
On 05.08.2025 11:33, Stewart Hildebrand wrote:
> On 8/5/25 05:17, Jan Beulich wrote:
>> On 05.08.2025 11:06, Stewart Hildebrand wrote:
>>> On 8/5/25 03:44, Jan Beulich wrote:
 On 04.08.2025 18:57, Stewart Hildebrand wrote:
> On 8/4/25 03:57, Jan Beulich wrote:
>> On 01.08.2025 22:24, Stewart Hildebrand wrote:
>>> @@ -839,6 +839,9 @@ void sched_destroy_vcpu(struct vcpu *v)
>>>  {
>>>  struct sched_unit *unit = v->sched_unit;
>>>  
>>> +if ( !unit )
>>> +return;
>>> +
>>>  kill_timer(&v->periodic_timer);
>>>  kill_timer(&v->singleshot_timer);
>>>  kill_timer(&v->poll_timer);
>>
>> What if it's the 2nd error path in sched_init_vcpu() that is taken?
>>>
>>> ^^ This ^^ is what I'm confused about
>>
>> If sched_init_vcpu() took the indicated path,
> 
> What path? In the one I'm looking at, sched_free_unit() gets called,

Oh, I see - that wasn't quite obvious, though. Yet of course ...

> setting v->sched_unit = NULL, and in that case ...
> 
>> the if() you add here won't
>> help, and ...
> 
> ... the condition is true, and ...
> 
>> Then we
>> might take this path (just out of context here)
>>
>> if ( unit->vcpu_list == v )
>> {
>> rcu_read_lock(&sched_res_rculock);
>>
>> sched_remove_unit(vcpu_scheduler(v), unit);
>> sched_free_udata(vcpu_scheduler(v), unit->priv);
>>
>> and at least Credit1's hook doesn't look to be safe against being passed 
>> NULL.
>> (Not to speak of the risk of unit->priv being used elsewhere while 
>> cleaning
>> up.)

... this latter part of my remark still applies. IOW I continue to think
that discussing the correctness of this change needs to be extended.
Unless of course a scheduler maintainer wants to ack it as is.

Jan



[PATCH v4 7/8] pdx: introduce a new compression algorithm based on region offsets

2025-08-05 Thread Roger Pau Monne
With the appearance of Intel Sierra Forest and Granite Rapids it's now
possible to get a production x86 host with the following memory map:

SRAT: Node 0 PXM 0 [, 7fff]
SRAT: Node 0 PXM 0 [0001, 00807fff]
SRAT: Node 1 PXM 1 [063e8000, 06be7fff]
SRAT: Node 2 PXM 2 [0c7e8000, 0cfe7fff]
SRAT: Node 3 PXM 3 [12be8000, 133e7fff]

This is from a four socket Granite Rapids system, with each node having
512GB of memory.  The total amount of RAM on the system is 2TB, but without
enabling CONFIG_BIGMEM the last range is not accessible, as it's above the
16TB boundary covered by the frame table. Sierra Forest and Granite Rapids
are socket compatible, however Sierra Forest only supports 2 socket
configurations, while Granite Rapids can go up to 8 sockets.

Note that while the memory map is very sparse, it couldn't be compressed
using the current PDX_MASK compression algorithm, which relies on all
ranges having a shared zeroed region of bits that can be removed.

The memory map presented above has the property of all regions being
similarly spaced between each other, and all having also a similar size.
Use a lookup table to store the offsets to translate from/to PFN and PDX
spaces.  Such table is indexed based on the input PFN or PDX to translated.
The example PFN layout about would get compressed using the following:

PFN compression using PFN lookup table shift 29 and PDX region size 0x1000
 range 0 [0, 0x807] PFN IDX  0 : 0
 range 1 [0x00063e8, 0x0006be7] PFN IDX  3 : 0x00053e8
 range 2 [0x000c7e8, 0x000cfe7] PFN IDX  6 : 0x000a7e8
 range 3 [0x0012be8, 0x00133e7] PFN IDX  9 : 0x000fbe8

Note how the tow ranges belonging to node 0 get merged into a single PDX
region by the compression algorithm.

The default size of lookup tables currently set in Kconfig is 64 entries,
and the example memory map consumes 10 entries.  Such memory map is from a
4 socket Granite Rapids host, which in theory supports up to 8 sockets
according to Intel documentation.  Assuming the layout of a 8 socket system
is similar to the 4 socket one, it would require 21 lookup table entries to
support it, way below the current default of 64 entries.

The valid range of lookup table size is currently restricted from 1 to 512
elements in Kconfig.

An extra array is used to keep track of the base PFN for each translated
range.  Non used slots are set to ~0UL, so that in mfn_valid() the mfn <
base check always fails, thus reporting the mfn as invalid.

Introduce __init_or_pdx_mask and use it on some shared functions between
PDX mask and offset compression, as otherwise some code becomes unreachable
after boot if PDX offset compression is used.  Mark the code as __init in
that case, so it's pruned after boot.

Signed-off-by: Roger Pau Monné 
---
Changes since v3:
 - Rename __init_or_pdx_mask.
 - Reorder logic in pfn_offset_sanitize_ranges() to reduce code.
 - Print a message if the table index is truncated from what would be
   required to represent all input ranges independently.
 - Cast size to paddr_t for pdx_init_mask() call.

Changes since v2:
 - s/PDX_OFFSET_TLB_ORDER/PDX_OFFSET_TBL_ORDER/.
 - Fix changelog mention of Sapphire Rapids.
 - Misc fixes in the test harness.
 - Use SWAP() macro.
 - Introduce an extra table to keep the bases of the valid ranges.

Changes since v1:
 - Use a lookup table with the offsets.
 - Split the adding of the test to a pre-patch.
 - Amend diagram to also show possible padding after compression.
---
 CHANGELOG.md   |   3 +
 tools/tests/pdx/.gitignore |   1 +
 tools/tests/pdx/Makefile   |   3 +-
 tools/tests/pdx/harness.h  |  14 ++
 tools/tests/pdx/test-pdx.c |   4 +
 xen/common/Kconfig |  21 ++-
 xen/common/pdx.c   | 258 -
 xen/include/xen/pdx.h  |  87 -
 8 files changed, 385 insertions(+), 6 deletions(-)

diff --git a/CHANGELOG.md b/CHANGELOG.md
index 5f31ca08fe3f..f9ef893f4851 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -20,6 +20,9 @@ The format is based on [Keep a 
Changelog](https://keepachangelog.com/en/1.0.0/)
  grant table or foreign memory.
 
 ### Added
+ - Introduce new PDX compression algorithm to cope with Intel Sierra Forest and
+   Granite Rapids having sparse memory maps.
+
  - On x86:
- Option to attempt to fixup p2m page-faults on PVH dom0.
- Resizable BARs is supported for PVH dom0.
diff --git a/tools/tests/pdx/.gitignore b/tools/tests/pdx/.gitignore
index a32c7db4de79..1202a531a7fd 100644
--- a/tools/tests/pdx/.gitignore
+++ b/tools/tests/pdx/.gitignore
@@ -1,2 +1,3 @@
 /pdx.h
 /test-pdx-mask
+/test-pdx-offset
diff --git a/tools/tests/pdx/Makefile b/tools/tests/pdx/Makefile
index b3734afde686..10b354f0cefd 100644
--- a/tools/tests/pdx/Makefile
+++ b/tools/tests/pdx/Makefile
@@ -1,7 +1,7 @@
 XEN_ROOT=$(CURDIR)/../../..
 include $(XEN_ROOT)/tool

[PATCH v4 1/8] kconfig: turn PDX compression into a choice

2025-08-05 Thread Roger Pau Monne
Rename the current CONFIG_PDX_COMPRESSION to CONFIG_PDX_MASK_COMPRESSION,
and make it part of the PDX compression choice block, in preparation for
adding further PDX compression algorithms.

The PDX compression defaults should still be the same for all
architectures, however the choice block cannot be protected under EXPERT
and still have a default choice being unconditionally selected.  As a
result, the new "PDX (Page inDeX) compression" item will be unconditionally
visible in Kconfig, even on architectures like x86 that previously had no
way to enable PDX compression.

As part of this preparation work to introduce new PDX compressions, adjust
some of the comments on pdx.h to note they apply to a specific PDX
compression.  Also shuffle function prototypes and dummy implementations
around to make it easier to introduce a new PDX compression.  Note all
PDX compression implementations are expected to provide a
pdx_is_region_compressible() that takes the same set of arguments.

Signed-off-by: Roger Pau Monné 
Acked-by: Jan Beulich 
---
 xen/arch/arm/setup.c  |  2 +-
 xen/common/Kconfig| 18 +++---
 xen/common/pdx.c  |  4 ++--
 xen/include/xen/pdx.h | 32 +++-
 4 files changed, 37 insertions(+), 19 deletions(-)

diff --git a/xen/arch/arm/setup.c b/xen/arch/arm/setup.c
index bb35afe56cec..a77b31071ed8 100644
--- a/xen/arch/arm/setup.c
+++ b/xen/arch/arm/setup.c
@@ -258,7 +258,7 @@ void __init init_pdx(void)
 paddr_t bank_start, bank_size, bank_end, ram_end = 0;
 int bank;
 
-#ifdef CONFIG_PDX_COMPRESSION
+#ifndef CONFIG_PDX_NONE
 /*
  * Arm does not have any restrictions on the bits to compress. Pass 0 to
  * let the common code further restrict the mask.
diff --git a/xen/common/Kconfig b/xen/common/Kconfig
index 16936418a6e6..8dad0c923a9d 100644
--- a/xen/common/Kconfig
+++ b/xen/common/Kconfig
@@ -57,9 +57,10 @@ config EVTCHN_FIFO
 
  If unsure, say Y.
 
-config PDX_COMPRESSION
-   bool "PDX (Page inDeX) compression" if EXPERT && !X86 && !RISCV
-   default ARM || PPC
+choice
+   prompt "PDX (Page inDeX) compression"
+   default PDX_MASK_COMPRESSION if !X86 && !RISCV
+   default PDX_NONE
help
  PDX compression is a technique designed to reduce the memory
  overhead of physical memory management on platforms with sparse RAM
@@ -72,6 +73,17 @@ config PDX_COMPRESSION
  If your platform does not have sparse RAM banks, do not enable PDX
  compression.
 
+config PDX_MASK_COMPRESSION
+   bool "Mask compression"
+   help
+ Compression relying on all RAM addresses sharing a zeroed bit region.
+
+config PDX_NONE
+   bool "None"
+   help
+ No compression
+endchoice
+
 config ALTERNATIVE_CALL
bool
 
diff --git a/xen/common/pdx.c b/xen/common/pdx.c
index b8384e6189df..00aa7e43006d 100644
--- a/xen/common/pdx.c
+++ b/xen/common/pdx.c
@@ -34,7 +34,7 @@ bool __mfn_valid(unsigned long mfn)
 {
 bool invalid = mfn >= max_page;
 
-#ifdef CONFIG_PDX_COMPRESSION
+#ifdef CONFIG_PDX_MASK_COMPRESSION
 invalid |= mfn & pfn_hole_mask;
 #endif
 
@@ -55,7 +55,7 @@ void set_pdx_range(unsigned long smfn, unsigned long emfn)
 __set_bit(idx, pdx_group_valid);
 }
 
-#ifdef CONFIG_PDX_COMPRESSION
+#ifdef CONFIG_PDX_MASK_COMPRESSION
 
 /*
  * Diagram to make sense of the following variables. The masks and shifts
diff --git a/xen/include/xen/pdx.h b/xen/include/xen/pdx.h
index c1423d64a95b..8e373cac8b87 100644
--- a/xen/include/xen/pdx.h
+++ b/xen/include/xen/pdx.h
@@ -25,7 +25,7 @@
  * this by keeping a bitmap of the ranges in the frame table containing
  * invalid entries and not allocating backing memory for them.
  *
- * ## PDX compression
+ * ## PDX mask compression
  *
  * This is a technique to avoid wasting memory on machines known to have
  * split their machine address space in several big discontinuous and highly
@@ -101,22 +101,13 @@ bool __mfn_valid(unsigned long mfn);
 #define paddr_to_pdx(pa) pfn_to_pdx(paddr_to_pfn(pa))
 #define pdx_to_paddr(px) pfn_to_paddr(pdx_to_pfn(px))
 
-#ifdef CONFIG_PDX_COMPRESSION
+#ifdef CONFIG_PDX_MASK_COMPRESSION
 
 extern unsigned long pfn_pdx_bottom_mask, ma_va_bottom_mask;
 extern unsigned int pfn_pdx_hole_shift;
 extern unsigned long pfn_hole_mask;
 extern unsigned long pfn_top_mask, ma_top_mask;
 
-/**
- * Validate a region's compatibility with the current compression runtime
- *
- * @param base Base address of the region
- * @param npages Number of PAGE_SIZE-sized pages in the region
- * @return True iff the region can be used with the current compression
- */
-bool pdx_is_region_compressible(paddr_t base, unsigned long npages);
-
 /**
  * Calculates a mask covering "moving" bits of all addresses of a region
  *
@@ -209,7 +200,9 @@ static inline paddr_t directmapoff_to_maddr(unsigned long 
offset)
  */
 void pfn_pdx_hole_setup(unsigned long mask);
 
-#else /* !CONFIG_PDX_COMPRESSION */
+#endif /* CONFIG_PDX_MA

[PATCH v4 2/8] pdx: provide a unified set of unit functions

2025-08-05 Thread Roger Pau Monne
The current setup (pdx_init_mask() and pdx_region_mask()) and init
(pfn_pdx_hole_setup()) PDX compression functions are tailored to the
existing PDX compression algorithm.

In preparation for introducing a new compression algorithm convert the
setup and init functions to more generic interfaces that aren't tied to the
compression in-use.  To accomplish this introduce a function that registers
all the PFN RAM ranges, plus an init function.

This has the downside of requiring a static array to store such ranges
ahead of being processed by the setup function, however it's the only way
to pass all the possible information to the different compression setup
functions without using per-compression specific setup functions.
Per-arch compression setup also need to be adjusted to use the new
interface.  There's a slight ordering adjustment, in that after PDX
compression setup the caller will check whether all the RAM regions are
properly covered by the newly setup compression, otherwise compression is
disabled by resetting to the initial values.

No functional change intended in the resulting PDX compression values.

Signed-off-by: Roger Pau Monné 
Acked-by: Jan Beulich 
---
Changes since v3:
 - Rename base to base_pfn and size to pages.

Changes since v2:
 - Fix indentation.

Changes since v1:
 - s/nr/nr_ranges/
---
 xen/arch/arm/setup.c  |  34 ++--
 xen/arch/x86/srat.c   |  28 ++
 xen/common/pdx.c  | 122 --
 xen/include/xen/pdx.h |  73 +
 4 files changed, 166 insertions(+), 91 deletions(-)

diff --git a/xen/arch/arm/setup.c b/xen/arch/arm/setup.c
index a77b31071ed8..ba35bf1fe3bb 100644
--- a/xen/arch/arm/setup.c
+++ b/xen/arch/arm/setup.c
@@ -256,9 +256,11 @@ void __init init_pdx(void)
 {
 const struct membanks *mem = bootinfo_get_mem();
 paddr_t bank_start, bank_size, bank_end, ram_end = 0;
-int bank;
+unsigned int bank;
 
 #ifndef CONFIG_PDX_NONE
+for ( bank = 0 ; bank < mem->nr_banks; bank++ )
+pfn_pdx_add_region(mem->bank[bank].start, mem->bank[bank].size);
 /*
  * Arm does not have any restrictions on the bits to compress. Pass 0 to
  * let the common code further restrict the mask.
@@ -266,26 +268,24 @@ void __init init_pdx(void)
  * If the logic changes in pfn_pdx_hole_setup we might have to
  * update this function too.
  */
-uint64_t mask = pdx_init_mask(0x0);
-
-for ( bank = 0 ; bank < mem->nr_banks; bank++ )
-{
-bank_start = mem->bank[bank].start;
-bank_size = mem->bank[bank].size;
-
-mask |= bank_start | pdx_region_mask(bank_start, bank_size);
-}
+pfn_pdx_compression_setup(0);
 
 for ( bank = 0 ; bank < mem->nr_banks; bank++ )
 {
-bank_start = mem->bank[bank].start;
-bank_size = mem->bank[bank].size;
-
-if (~mask & pdx_region_mask(bank_start, bank_size))
-mask = 0;
+if ( !pdx_is_region_compressible(
+  mem->bank[bank].start,
+  PFN_UP(mem->bank[bank].start + mem->bank[bank].size) -
+  PFN_DOWN(mem->bank[bank].start)) )
+{
+pfn_pdx_compression_reset();
+printk(XENLOG_WARNING
+   "PFN compression disabled, RAM region [%#" PRIpaddr ", %#"
+   PRIpaddr "] not covered\n",
+   mem->bank[bank].start,
+   mem->bank[bank].start + mem->bank[bank].size - 1);
+break;
+}
 }
-
-pfn_pdx_hole_setup(mask >> PAGE_SHIFT);
 #endif
 
 for ( bank = 0 ; bank < mem->nr_banks; bank++ )
diff --git a/xen/arch/x86/srat.c b/xen/arch/x86/srat.c
index 688f410287d4..747607439fb4 100644
--- a/xen/arch/x86/srat.c
+++ b/xen/arch/x86/srat.c
@@ -261,8 +261,6 @@ acpi_numa_memory_affinity_init(const struct 
acpi_srat_mem_affinity *ma)
 
 void __init acpi_numa_arch_fixup(void) {}
 
-static uint64_t __initdata srat_region_mask;
-
 static int __init cf_check srat_parse_region(
 struct acpi_subtable_header *header, const unsigned long end)
 {
@@ -282,15 +280,13 @@ static int __init cf_check srat_parse_region(
printk(KERN_INFO "SRAT: %013"PRIx64"-%013"PRIx64"\n",
   ma->base_address, ma->base_address + ma->length - 1);
 
-   srat_region_mask |= ma->base_address |
-   pdx_region_mask(ma->base_address, ma->length);
+   pfn_pdx_add_region(ma->base_address, ma->length);
 
return 0;
 }
 
 void __init srat_parse_regions(paddr_t addr)
 {
-   u64 mask;
unsigned int i;
 
if (acpi_disabled || acpi_numa < 0 ||
@@ -299,19 +295,29 @@ void __init srat_parse_regions(paddr_t addr)
 
/* Set "PXM" as early as feasible. */
numa_fw_nid_name = "PXM";
-   srat_region_mask = pdx_init_mask(addr);
acpi_table_parse_srat(ACPI_SRAT_TYPE_MEMORY_AFFINITY,
  srat_parse_region, 0);
 
-   for (mask = srat_region_mask, i

[PATCH v4 5/8] test/pdx: add PDX compression unit tests

2025-08-05 Thread Roger Pau Monne
Introduce a set of unit tests for PDX compression.  The unit tests contains
both real and crafted memory maps that are then compressed using the
selected PDX algorithm.  Note the build system for the unit tests has been
done in a way to support adding new compression algorithms easily.  That
requires generating a new test-pdx- executable that's build with
the selected PDX compression enabled.

Currently the only generated executable is test-pdx-mask that tests PDX
mask compression.

Signed-off-by: Roger Pau Monné 
---
Changes since v2:
 - Use set -e when running the tests.
 - Drop stray removal of pdx.c.
 - Use * instead of ? for header blank space capture.
 - Add some more memory maps.

Changes since v1:
 - New in this version (partially pulled out from a different patch).
---
 tools/tests/Makefile   |   1 +
 tools/tests/pdx/.gitignore |   2 +
 tools/tests/pdx/Makefile   |  49 +++
 tools/tests/pdx/harness.h  |  84 
 tools/tests/pdx/test-pdx.c | 267 +
 xen/common/pdx.c   |   4 +
 6 files changed, 407 insertions(+)
 create mode 100644 tools/tests/pdx/.gitignore
 create mode 100644 tools/tests/pdx/Makefile
 create mode 100644 tools/tests/pdx/harness.h
 create mode 100644 tools/tests/pdx/test-pdx.c

diff --git a/tools/tests/Makefile b/tools/tests/Makefile
index 36928676a666..97ba2a13894d 100644
--- a/tools/tests/Makefile
+++ b/tools/tests/Makefile
@@ -9,6 +9,7 @@ ifneq ($(clang),y)
 SUBDIRS-$(CONFIG_X86) += x86_emulator
 endif
 SUBDIRS-y += xenstore
+SUBDIRS-y += pdx
 SUBDIRS-y += rangeset
 SUBDIRS-y += vpci
 SUBDIRS-y += paging-mempool
diff --git a/tools/tests/pdx/.gitignore b/tools/tests/pdx/.gitignore
new file mode 100644
index ..a32c7db4de79
--- /dev/null
+++ b/tools/tests/pdx/.gitignore
@@ -0,0 +1,2 @@
+/pdx.h
+/test-pdx-mask
diff --git a/tools/tests/pdx/Makefile b/tools/tests/pdx/Makefile
new file mode 100644
index ..b3734afde686
--- /dev/null
+++ b/tools/tests/pdx/Makefile
@@ -0,0 +1,49 @@
+XEN_ROOT=$(CURDIR)/../../..
+include $(XEN_ROOT)/tools/Rules.mk
+
+TARGETS := test-pdx-mask
+
+.PHONY: all
+all: $(TARGETS)
+
+.PHONY: run
+run: $(TARGETS)
+ifeq ($(CC),$(HOSTCC))
+   set -e; \
+   for test in $? ; do \
+   ./$$test ;  \
+   done
+else
+   $(warning HOSTCC != CC, will not run test)
+endif
+
+.PHONY: clean
+clean:
+   $(RM) -- *.o $(TARGETS) $(DEPS_RM) pdx.h
+
+.PHONY: distclean
+distclean: clean
+   $(RM) -- *~
+
+.PHONY: install
+install: all
+   $(INSTALL_DIR) $(DESTDIR)$(LIBEXEC)/tests
+   $(INSTALL_PROG) $(TARGETS) $(DESTDIR)$(LIBEXEC)/tests
+
+.PHONY: uninstall
+uninstall:
+   $(RM) -- $(patsubst %,$(DESTDIR)$(LIBEXEC)/tests/%,$(TARGETS))
+
+pdx.h: $(XEN_ROOT)/xen/include/xen/pdx.h
+   sed -E -e '/^#[[:space:]]*include/d' <$< >$@
+
+CFLAGS += -D__XEN_TOOLS__
+CFLAGS += $(APPEND_CFLAGS)
+CFLAGS += $(CFLAGS_xeninclude)
+
+test-pdx-mask: CFLAGS += -DCONFIG_PDX_MASK_COMPRESSION
+
+test-pdx-%: test-pdx.c pdx.h
+   $(CC) $(CPPFLAGS) $(CFLAGS) $(CFLAGS_$*.o) -o $@ $< $(APPEND_CFLAGS)
+
+-include $(DEPS_INCLUDE)
diff --git a/tools/tests/pdx/harness.h b/tools/tests/pdx/harness.h
new file mode 100644
index ..5bef7df650d2
--- /dev/null
+++ b/tools/tests/pdx/harness.h
@@ -0,0 +1,84 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Unit tests for PDX compression.
+ *
+ * Copyright (C) 2025 Cloud Software Group
+ */
+
+#ifndef _TEST_HARNESS_
+#define _TEST_HARNESS_
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+#define __init
+#define __initdata
+#define __ro_after_init
+#define cf_check
+
+#define printk printf
+#define XENLOG_INFO
+#define XENLOG_DEBUG
+#define XENLOG_WARNING
+#define KERN_INFO
+
+#define BITS_PER_LONG (unsigned int)(sizeof(unsigned long) * 8)
+
+#define PAGE_SHIFT12
+/* Some libcs define PAGE_SIZE in limits.h. */
+#undef  PAGE_SIZE
+#define PAGE_SIZE (1 << PAGE_SHIFT)
+#define MAX_ORDER 18 /* 2 * PAGETABLE_ORDER (9) */
+
+#define PFN_DOWN(x)   ((x) >> PAGE_SHIFT)
+#define PFN_UP(x) (((x) + PAGE_SIZE-1) >> PAGE_SHIFT)
+
+#define pfn_to_paddr(pfn) ((paddr_t)(pfn) << PAGE_SHIFT)
+#define paddr_to_pfn(pa)  ((unsigned long)((pa) >> PAGE_SHIFT))
+
+#define MAX_RANGES 16
+#define MAX_PFN_RANGES MAX_RANGES
+
+#define ASSERT assert
+
+#define CONFIG_DEBUG
+
+static inline unsigned int find_next(
+const unsigned long *addr, unsigned int size, unsigned int off, bool value)
+{
+unsigned int i;
+
+ASSERT(size <= BITS_PER_LONG);
+
+for ( i = off; i < size; i++ )
+if ( !!(*addr & (1UL << i)) == value )
+return i;
+
+return size;
+}
+
+#define find_next_zero_bit(a, s, o) find_next(a, s, o, false)
+#define find_next_bit(a, s, o)  find_next(a, s, o, true)
+
+#define boolean_param(name, func)
+
+typedef uint64_t paddr_t;
+
+#include "pdx.h"
+
+#endif
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * ind

[PATCH v4 6/8] pdx: move some helpers in preparation for new compression

2025-08-05 Thread Roger Pau Monne
Move fill_mask(), pdx_region_mask() and pdx_init_mask() to the
!CONFIG_PDX_NONE section in preparation of them also being used by a newly
added PDX compression.

No functional change intended.

Signed-off-by: Roger Pau Monné 
Acked-by: Jan Beulich 
---
git is not very helpful when generating the diff here, and it ends up
moving everything around the functions instead of the functions themselves.
---
 xen/common/pdx.c | 118 +++
 1 file changed, 59 insertions(+), 59 deletions(-)

diff --git a/xen/common/pdx.c b/xen/common/pdx.c
index cd8a9e75a836..9e6b36086fbd 100644
--- a/xen/common/pdx.c
+++ b/xen/common/pdx.c
@@ -101,59 +101,6 @@ void __init pfn_pdx_add_region(paddr_t base, paddr_t size)
 ranges[nr_ranges++].pages = PFN_UP(base + size) - PFN_DOWN(base);
 }
 
-#endif /* !CONFIG_PDX_NONE */
-
-#ifdef CONFIG_PDX_MASK_COMPRESSION
-
-/*
- * Diagram to make sense of the following variables. The masks and shifts
- * are done on mfn values in order to convert to/from pdx:
- *
- *  pfn_hole_mask
- *  pfn_pdx_hole_shift (mask bitsize)
- *  |
- * |-|
- * | |
- * V V
- * --
- * |HHH|0|LL| <--- mfn
- * --
- * ^   ^ ^  ^
- * |   | |--|
- * |   | |
- * |   | pfn_pdx_bottom_mask
- * |   |
- * |---|
- * |
- * pfn_top_mask
- *
- * ma_{top,va_bottom}_mask is simply a shifted pfn_{top,pdx_bottom}_mask,
- * where ma_top_mask has zeroes shifted in while ma_va_bottom_mask has
- * ones.
- */
-
-/** Mask for the lower non-compressible bits of an mfn */
-unsigned long __ro_after_init pfn_pdx_bottom_mask = ~0UL;
-
-/** Mask for the lower non-compressible bits of an maddr or vaddr */
-unsigned long __ro_after_init ma_va_bottom_mask = ~0UL;
-
-/** Mask for the higher non-compressible bits of an mfn */
-unsigned long __ro_after_init pfn_top_mask = 0;
-
-/** Mask for the higher non-compressible bits of an maddr or vaddr */
-unsigned long __ro_after_init ma_top_mask = 0;
-
-/**
- * Mask for a pdx compression bit slice.
- *
- *  Invariant: valid(mfn) implies (mfn & pfn_hole_mask) == 0
- */
-unsigned long __ro_after_init pfn_hole_mask = 0;
-
-/** Number of bits of the "compressible" bit slice of an mfn */
-unsigned int __ro_after_init pfn_pdx_hole_shift = 0;
-
 /* Sets all bits from the most-significant 1-bit down to the LSB */
 static uint64_t fill_mask(uint64_t mask)
 {
@@ -196,12 +143,6 @@ static uint64_t pdx_region_mask(uint64_t base, uint64_t 
len)
 return fill_mask(base ^ (base + len - 1));
 }
 
-bool pdx_is_region_compressible(paddr_t base, unsigned long npages)
-{
-return !(paddr_to_pfn(base) & pfn_hole_mask) &&
-   !(pdx_region_mask(base, npages * PAGE_SIZE) & ~ma_va_bottom_mask);
-}
-
 /**
  * Creates the mask to start from when calculating non-compressible bits
  *
@@ -219,6 +160,65 @@ static uint64_t __init pdx_init_mask(uint64_t base_addr)
  (uint64_t)1 << (MAX_ORDER + PAGE_SHIFT)) - 1);
 }
 
+#endif /* !CONFIG_PDX_NONE */
+
+#ifdef CONFIG_PDX_MASK_COMPRESSION
+
+/*
+ * Diagram to make sense of the following variables. The masks and shifts
+ * are done on mfn values in order to convert to/from pdx:
+ *
+ *  pfn_hole_mask
+ *  pfn_pdx_hole_shift (mask bitsize)
+ *  |
+ * |-|
+ * | |
+ * V V
+ * --
+ * |HHH|0|LL| <--- mfn
+ * --
+ * ^   ^ ^  ^
+ * |   | |--|
+ * |   | |
+ * |   | pfn_pdx_bottom_mask
+ * |   |
+ * |---|
+ * |
+ * pfn_top_mask
+ *
+ * ma_{top,va_bottom}_mask is simply a shifted pfn_{top,pdx_bottom}_mask,
+ * where ma_top_mask has zeroes shifted in while ma_va_bottom_mask has
+ * ones.
+ */
+
+/** Mask for the lower non-compressible bits of an mfn */
+unsigned long __ro_after_init pfn_pdx_bottom_mask = ~0UL;
+
+/** Mask for the lower non-compressible bits of an maddr or vaddr */
+unsigned long __ro_after_init ma_va_bottom_mask = ~0UL;
+
+/** Mask for the higher non-compressible bits of an mfn */
+unsigned long __ro_after_init pfn_top_mask = 0;
+
+/** Mask for the higher non-compressible bits of an maddr or vaddr */
+unsigned long __ro_after_init ma_top_mask = 0;
+
+/**
+ * Mask for a pdx compression bit slice.
+ *
+ *  Invariant: valid(mfn) implies (mfn & pfn_hole_mask) == 0
+ */
+unsigned long __ro_after_init pfn_hole_mask = 0;
+
+/** Number of bits of the "compressible" bit slice of an mfn */
+unsigned int __ro_

[PATCH v4 4/8] pdx: allow per-arch optimization of PDX conversion helpers

2025-08-05 Thread Roger Pau Monne
There are four performance critical PDX conversion helpers that do the PFN
to/from PDX and the physical addresses to/from directmap offsets
translations.

In the absence of an active PDX compression, those functions would still do
the calculations needed, just to return the same input value as no
translation is in place and hence PFN and PDX spaces are identity mapped.

To reduce the overhead of having to do the pointless calculations allow
architectures to implement the translation helpers in a per-arch header.
Rename the existing conversion functions to add a trailing _xlate suffix,
so that the per-arch headers can define the non suffixed versions.

Currently only x86 implements meaningful custom handlers to short circuit
the translation when not active, using asm goto.  Other architectures use
generic macros that map the non-xlate to the xlate variants to keep the
previous behavior.

Signed-off-by: Roger Pau Monné 
---
Changes since v3:
 - Use has_include.

Changes since v2:
 - Consume the skip label as parameter in the PDX optimize macro.

Changes since v1:
 - Pull return out of OPTIMIZE_PDX macro.
 - undef OPTIMIZE_PDX.
---
Would it make sense to move the x86 implementation to the common pdx.h
header and let architectures define PDX_ASM_GOTO_SKIP instead?
---
 xen/arch/x86/include/asm/cpufeatures.h |  1 +
 xen/arch/x86/include/asm/pdx.h | 71 ++
 xen/arch/x86/srat.c|  6 ++-
 xen/common/pdx.c   | 10 ++--
 xen/include/xen/pdx.h  | 29 ---
 5 files changed, 106 insertions(+), 11 deletions(-)
 create mode 100644 xen/arch/x86/include/asm/pdx.h

diff --git a/xen/arch/x86/include/asm/cpufeatures.h 
b/xen/arch/x86/include/asm/cpufeatures.h
index bc108c3819e8..71308d9dafc8 100644
--- a/xen/arch/x86/include/asm/cpufeatures.h
+++ b/xen/arch/x86/include/asm/cpufeatures.h
@@ -42,6 +42,7 @@ XEN_CPUFEATURE(XEN_IBT,   X86_SYNTH(27)) /* Xen uses 
CET Indirect Branch
 XEN_CPUFEATURE(IBPB_ENTRY_PV, X86_SYNTH(28)) /* MSR_PRED_CMD used by Xen 
for PV */
 XEN_CPUFEATURE(IBPB_ENTRY_HVM,X86_SYNTH(29)) /* MSR_PRED_CMD used by Xen 
for HVM */
 XEN_CPUFEATURE(USE_VMCALL,X86_SYNTH(30)) /* Use VMCALL instead of 
VMMCALL */
+XEN_CPUFEATURE(PDX_COMPRESSION,   X86_SYNTH(31)) /* PDX compression */
 
 /* Bug words follow the synthetic words. */
 #define X86_NR_BUG 1
diff --git a/xen/arch/x86/include/asm/pdx.h b/xen/arch/x86/include/asm/pdx.h
new file mode 100644
index ..6be7e1185eb1
--- /dev/null
+++ b/xen/arch/x86/include/asm/pdx.h
@@ -0,0 +1,71 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+
+#ifndef X86_PDX_H
+#define X86_PDX_H
+
+#include 
+
+/*
+ * Introduce a macro to avoid repeating the same asm goto block in each helper.
+ * Note the macro is strictly tied to the code in the helpers.
+ */
+#define PDX_ASM_GOTO(label) \
+asm_inline goto (   \
+ALTERNATIVE(\
+"", \
+"jmp %l0",  \
+ALT_NOT(X86_FEATURE_PDX_COMPRESSION))   \
+: : : : label )
+
+static inline unsigned long pfn_to_pdx(unsigned long pfn)
+{
+PDX_ASM_GOTO(skip);
+
+return pfn_to_pdx_xlate(pfn);
+
+ skip:
+return pfn;
+}
+
+static inline unsigned long pdx_to_pfn(unsigned long pdx)
+{
+PDX_ASM_GOTO(skip);
+
+return pdx_to_pfn_xlate(pdx);
+
+ skip:
+return pdx;
+}
+
+static inline unsigned long maddr_to_directmapoff(paddr_t ma)
+{
+PDX_ASM_GOTO(skip);
+
+return maddr_to_directmapoff_xlate(ma);
+
+ skip:
+return ma;
+}
+
+static inline paddr_t directmapoff_to_maddr(unsigned long offset)
+{
+PDX_ASM_GOTO(skip);
+
+return directmapoff_to_maddr_xlate(offset);
+
+ skip:
+return offset;
+}
+
+#undef PDX_ASM_GOTO_SKIP
+
+#endif /* X86_PDX_H */
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/xen/arch/x86/srat.c b/xen/arch/x86/srat.c
index 747607439fb4..42ccb0c0f3ae 100644
--- a/xen/arch/x86/srat.c
+++ b/xen/arch/x86/srat.c
@@ -298,7 +298,8 @@ void __init srat_parse_regions(paddr_t addr)
acpi_table_parse_srat(ACPI_SRAT_TYPE_MEMORY_AFFINITY,
  srat_parse_region, 0);
 
-   pfn_pdx_compression_setup(addr);
+   if (!pfn_pdx_compression_setup(addr))
+   return;
 
/* Ensure all RAM ranges in the e820 are covered. */
for (i = 0; i < e820.nr_map; i++) {
@@ -318,6 +319,9 @@ void __init srat_parse_regions(paddr_t addr)
return;
}
}
+
+   /* If we got this far compression is working as expected. */
+   setup_force_cpu_cap(X86_FEATURE_PDX_COMPRESSION);
 }
 
 unsigned int numa_node_to_arch_nid(nodeid_t n)
diff --git a/xen/common/pdx.c b/xen/common/pdx.c
index f4a3dcf6cb60..c9ec86729151 100644
--- a/xen/c

[PATCH v4 8/8] x86/mm: adjust loop in arch_init_memory() to iterate over the PDX space

2025-08-05 Thread Roger Pau Monne
There's a loop in arch_init_memory() that iterates over holes and non-RAM
regions to possibly mark any page_info structures matching those addresses
as IO.  The looping there is done over the PFN space.

PFNs not covered by the PDX space will always fail the mfn_valid() check,
hence re-write the loop to iterate over the PDX space and avoid checking
any holes that are not covered by the PDX translation.

On a system with a ~6TiB hole this change together with using PDX
compression reduces boot time in approximately 20 seconds.  Xen boot time
without the change is ~50s, with the change it's ~30s.

Signed-off-by: Roger Pau Monné 
---
Changes since v3:
 - Ensure parameter to pfn_to_pdx() is always RAM.

Changes since v2:
 - New in this version.
---
 xen/arch/x86/mm.c | 17 ++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index e7fd56c7ce90..e27dc28cfdd6 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -275,7 +275,7 @@ static void __init assign_io_page(struct page_info *page)
 
 void __init arch_init_memory(void)
 {
-unsigned long i, pfn, rstart_pfn, rend_pfn, iostart_pfn, ioend_pfn;
+unsigned long i, pfn, rstart_pfn, rend_pfn, iostart_pfn, ioend_pfn, pdx;
 
 /*
  * Basic guest-accessible flags:
@@ -328,9 +328,20 @@ void __init arch_init_memory(void)
 destroy_xen_mappings((unsigned long)mfn_to_virt(iostart_pfn),
  (unsigned long)mfn_to_virt(ioend_pfn));
 
-/* Mark as I/O up to next RAM region. */
-for ( ; pfn < rstart_pfn; pfn++ )
+/*
+ * Mark as I/O up to next RAM region.  Iterate over the PDX space to
+ * skip holes which would always fail the mfn_valid() check.
+ *
+ * pfn_to_pdx() requires a valid (iow: RAM) PFN to convert to PDX,
+ * hence provide pfn - 1, which is the tailing PFN from the last RAM
+ * range, or pdx 0 if the input pfn is 0.
+ */
+for ( pdx = pfn ? pfn_to_pdx(pfn - 1) + 1 : 0;
+  pdx < pfn_to_pdx(rstart_pfn);
+  pdx++ )
 {
+pfn = pdx_to_pfn(pdx);
+
 if ( !mfn_valid(_mfn(pfn)) )
 continue;
 
-- 
2.49.0




[PATCH v4 3/8] pdx: introduce command line compression toggle

2025-08-05 Thread Roger Pau Monne
Introduce a command line option to allow disabling PDX compression.  The
disabling is done by turning pfn_pdx_add_region() into a no-op, so when
attempting to initialize the selected compression algorithm the array of
ranges to compress is empty.

Signed-off-by: Roger Pau Monné 
Reviewed-by: Jan Beulich 
---
Changes since v1:
 - New in this version.
---
 docs/misc/xen-command-line.pandoc |  9 +
 xen/common/pdx.c  | 14 +-
 2 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/docs/misc/xen-command-line.pandoc 
b/docs/misc/xen-command-line.pandoc
index 6865a61220ca..5dd2ab82b7aa 100644
--- a/docs/misc/xen-command-line.pandoc
+++ b/docs/misc/xen-command-line.pandoc
@@ -2072,6 +2072,15 @@ for all of them (`true`), only for those subject to XPTI 
(`xpti`) or for
 those not subject to XPTI (`no-xpti`). The feature is used only in case
 INVPCID is supported and not disabled via `invpcid=false`.
 
+### pdx-compress
+> `= `
+
+> Default: `true` if CONFIG_PDX_NONE is unset
+
+Only relevant when the hypervisor is build with PFN PDX compression. Controls
+whether Xen will engage in PFN compression.  The algorithm used for PFN
+compression is selected at build time from Kconfig.
+
 ### ple_gap
 > `= `
 
diff --git a/xen/common/pdx.c b/xen/common/pdx.c
index c5ea58873c0e..f4a3dcf6cb60 100644
--- a/xen/common/pdx.c
+++ b/xen/common/pdx.c
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -76,9 +77,13 @@ static struct pfn_range {
 } ranges[MAX_PFN_RANGES] __initdata;
 static unsigned int __initdata nr_ranges;
 
+static bool __initdata pdx_compress = true;
+boolean_param("pdx-compress", pdx_compress);
+
 void __init pfn_pdx_add_region(paddr_t base, paddr_t size)
 {
-if ( !size )
+/* Without ranges there's no PFN compression. */
+if ( !size || !pdx_compress )
 return;
 
 if ( nr_ranges >= ARRAY_SIZE(ranges) )
@@ -215,6 +220,13 @@ void __init pfn_pdx_compression_setup(paddr_t base)
 unsigned int i, j, bottom_shift = 0, hole_shift = 0;
 unsigned long mask = pdx_init_mask(base) >> PAGE_SHIFT;
 
+if ( !nr_ranges )
+{
+printk(XENLOG_DEBUG "PFN compression disabled%s\n",
+   pdx_compress ? ": no ranges provided" : "");
+return;
+}
+
 if ( nr_ranges > ARRAY_SIZE(ranges) )
 {
 printk(XENLOG_WARNING
-- 
2.49.0




[PATCH v4 0/8] pdx: introduce a new compression algorithm

2025-08-05 Thread Roger Pau Monne
Hello,

This series implements a new PDX compression algorithm to cope with the
spare memory maps found on the Intel Sierra Forest and Granite Rapids.

Patches 1 to 6 prepare the existing code to make it easier to introduce
a new PDX compression, including generalizing the initialization and
setup functions and adding a unit test for PDX compression.

Patch 7 introduce the new compression.  The new compression is only
enabled by default on x86, other architectures are left with their
previous defaults.

Finally patch 8 optimizes one x86 loop that was iterating over pfn
ranges to instead use pdx values.

Thanks, Roger.

Roger Pau Monne (8):
  kconfig: turn PDX compression into a choice
  pdx: provide a unified set of unit functions
  pdx: introduce command line compression toggle
  pdx: allow per-arch optimization of PDX conversion helpers
  test/pdx: add PDX compression unit tests
  pdx: move some helpers in preparation for new compression
  pdx: introduce a new compression algorithm based on region offsets
  x86/mm: adjust loop in arch_init_memory() to iterate over the PDX
space

 CHANGELOG.md   |   3 +
 docs/misc/xen-command-line.pandoc  |   9 +
 tools/tests/Makefile   |   1 +
 tools/tests/pdx/.gitignore |   3 +
 tools/tests/pdx/Makefile   |  50 +++
 tools/tests/pdx/harness.h  |  98 ++
 tools/tests/pdx/test-pdx.c | 271 
 xen/arch/arm/setup.c   |  36 +--
 xen/arch/x86/include/asm/cpufeatures.h |   1 +
 xen/arch/x86/include/asm/pdx.h |  71 
 xen/arch/x86/mm.c  |  17 +-
 xen/arch/x86/srat.c|  30 +-
 xen/common/Kconfig |  37 ++-
 xen/common/pdx.c   | 432 +++--
 xen/include/xen/pdx.h  | 213 
 15 files changed, 1139 insertions(+), 133 deletions(-)
 create mode 100644 tools/tests/pdx/.gitignore
 create mode 100644 tools/tests/pdx/Makefile
 create mode 100644 tools/tests/pdx/harness.h
 create mode 100644 tools/tests/pdx/test-pdx.c
 create mode 100644 xen/arch/x86/include/asm/pdx.h

-- 
2.49.0




Re: [PATCH v1 03/25] xen/x86: complement PG_log_dirty wrapping

2025-08-05 Thread Jan Beulich
On 03.08.2025 11:47, Penny Zheng wrote:
> --- a/xen/include/hypercall-defs.c
> +++ b/xen/include/hypercall-defs.c
> @@ -198,7 +198,9 @@ dm_op(domid_t domid, unsigned int nr_bufs, 
> xen_dm_op_buf_t *bufs)
>  sysctl(xen_sysctl_t *u_sysctl)
>  #endif
>  domctl(xen_domctl_t *u_domctl)
> +#if PG_log_dirty
>  paging_domctl_cont(xen_domctl_t *u_domctl)
> +#endif
>  #ifndef CONFIG_PV_SHIM_EXCLUSIVE
>  platform_op(xen_platform_op_t *u_xenpf_op)
>  #endif
> @@ -294,6 +296,8 @@ dm_op  compat   do   
> compat   do   do
>  hypfs_op   do   do   do   do   do
>  #endif
>  mcado   do   ---
> +#if PG_log_dirty
>  paging_domctl_cont do   do   do   do   -
> +#endif

While putting together my pair of patches, I figured that using PG_log_dirty
here is wrong. asm/paging.h isn't (and cannot easily be) included, and hence
the compiler will consider PG_log_dirty uniformly 0, no matter what .config
holds.

Jan



Re: [PATCH v3 02/20] xen/riscv: introduce sbi_remote_hfence_gvma_vmid()

2025-08-05 Thread Oleksii Kurochko


On 8/4/25 3:55 PM, Jan Beulich wrote:

On 31.07.2025 17:58, Oleksii Kurochko wrote:

It instructs the remote harts to execute one or more HFENCE.GVMA instructions
by making an SBI call, covering the range of guest physical addresses between
start_addr and start_addr + size only for the given VMID.

The remote fence operation applies to the entire address space if either:
   - start_addr and size are both 0, or
   - size is equal to 2^XLEN-1.

Signed-off-by: Oleksii Kurochko

Acked-by: Jan Beulich
with perhaps a similar on-commit edit as suggested for patch 1.


I would happy if you could do that during merge. Thanks.

~ Oleksii


Re: [PATCH v3 01/20] xen/riscv: implement sbi_remote_hfence_gvma()

2025-08-05 Thread Jan Beulich
On 05.08.2025 16:45, Oleksii Kurochko wrote:
> On 8/4/25 3:52 PM, Jan Beulich wrote:
>> On 31.07.2025 17:58, Oleksii Kurochko wrote:
>>> + * Returns 0 if IPI was sent to all the targeted harts successfully
>>> + * or negative value if start_addr or size is not valid.
>> This similarly is ambiguous: The union of the success case stated and the
>> error case stated isn't obviously all possible states. The success
>> statement in particular alludes to the possibility of an IPI not actually
>> reaching its target.
> 
> The same as above this is what SBI spec. tells.
> 
> I've not checked SBI code deeply, but it seems like the code is waiting while
> IPI will be reached as looking at the code:
>   /**
>* As this this function only handlers scalar values of hart mask, it 
> must be
>* set to all online harts if the intention is to send IPIs to all the 
> harts.
>* If hmask is zero, no IPIs will be sent.
>*/
>   int sbi_ipi_send_many(ulong hmask, ulong hbase, u32 event, void *data)
>   {
>  ...
>   
>   /* Send IPIs */
>   do {
>   retry_needed = false;
>   sbi_hartmask_for_each_hart(i, &target_mask) {
>   rc = sbi_ipi_send(scratch, i, event, data);
>   if (rc == SBI_IPI_UPDATE_RETRY)
>   retry_needed = true;
>   else
>   sbi_hartmask_clear_hart(i, 
> &target_mask);
>   }
>   } while (retry_needed);
>   
>   /* Sync IPIs */
>   sbi_ipi_sync(scratch, event);
>   
>   return 0;
>   }
> and
>   static int sbi_ipi_sync(struct sbi_scratch *scratch, u32 event)
>   {
>   const struct sbi_ipi_event_ops *ipi_ops;
>   
>   if ((SBI_IPI_EVENT_MAX <= event) ||
>   !ipi_ops_array[event])
>   return SBI_EINVAL;
>   ipi_ops = ipi_ops_array[event];
>   
>   if (ipi_ops->sync)
>   ipi_ops->sync(scratch);
>   
>   return 0;
>   }
> which calls:
>   static void tlb_sync(struct sbi_scratch *scratch)
>   {
>   atomic_t *tlb_sync =
>   sbi_scratch_offset_ptr(scratch, tlb_sync_off);
>   
>   while (atomic_read(tlb_sync) > 0) {
>   /*
>* While we are waiting for remote hart to set the sync,
>* consume fifo requests to avoid deadlock.
>*/
>   tlb_process_once(scratch);
>   }
>   
>   return;
>   }

I'll leave that comment as-is then, even if I'm not really happy with it.

>>> + * The remote fence operation applies to the entire address space if 
>>> either:
>>> + *  - start_addr and size are both 0, or
>>> + *  - size is equal to 2^XLEN-1.
>> Whose XLEN is this? The guest's? The host's? (I assume the latter, but it's
>> not unambiguous, unless there's specific terminology that I'm unaware of,
>> yet which would make this unambiguous.)
> 
> RISC-V spec quite mixes the terminology (3.1.6.2. Base ISA Control in mstatus 
> Register)
> around XLEN:
>For RV64 harts, the SXL and UXL fields are WARL fields that control the 
> value
>of XLEN for S-mode and U-mode, respectively. The encoding of these fields 
> is
>the same as the MXL field of misa, shown in Table 9. The effective XLEN in
>S-mode and U-mode are termed SXLEN and UXLEN, respectively
> 
> Basically, RISC-V privileged architecture defines different XLEN values for
> various privilege modes:
>   - MXLEN for Machine mode
>   - SXLEN for Supervisor mode.
>   - HSXLEN for Hypervisor-Supervisor mode.
>   - VSXLEN for Virtual Supervisor mode.
> 
> Considering that SBI is an API that is provided for S-mode I expect that XLEN 
> = SXLEN
> in this case, but SBI spec. is using just XLEN.

Very helpful.

Jan



Re: [PATCH v4 4/8] pdx: allow per-arch optimization of PDX conversion helpers

2025-08-05 Thread Jan Beulich
On 05.08.2025 16:20, Roger Pau Monné wrote:
> On Tue, Aug 05, 2025 at 02:11:22PM +0200, Jan Beulich wrote:
>> On 05.08.2025 11:52, Roger Pau Monne wrote:
>>> There are four performance critical PDX conversion helpers that do the PFN
>>> to/from PDX and the physical addresses to/from directmap offsets
>>> translations.
>>>
>>> In the absence of an active PDX compression, those functions would still do
>>> the calculations needed, just to return the same input value as no
>>> translation is in place and hence PFN and PDX spaces are identity mapped.
>>>
>>> To reduce the overhead of having to do the pointless calculations allow
>>> architectures to implement the translation helpers in a per-arch header.
>>> Rename the existing conversion functions to add a trailing _xlate suffix,
>>> so that the per-arch headers can define the non suffixed versions.
>>>
>>> Currently only x86 implements meaningful custom handlers to short circuit
>>> the translation when not active, using asm goto.  Other architectures use
>>> generic macros that map the non-xlate to the xlate variants to keep the
>>> previous behavior.
>>>
>>> Signed-off-by: Roger Pau Monné 
>>
>> Once again:
>> Reviewed-by: Jan Beulich 
> 
> Thanks, I didn't carry your RB due to the changes requested by Andrew,
> that was a bit too much to carry a RB after those.

Of course, and I didn't mean to suggest you should have kept it. All I
wanted to indicate is that I'm as okay withe change as I was before.

Jan



Re: [PATCH v3 12/20] xen/riscv: implement p2m_set_range()

2025-08-05 Thread Jan Beulich
On 31.07.2025 17:58, Oleksii Kurochko wrote:
> This patch introduces p2m_set_range() and its core helper p2m_set_entry() for

Nit: This patch doesn't introduce p2m_set_range(); it merely fleshes it out.

> --- a/xen/arch/riscv/include/asm/p2m.h
> +++ b/xen/arch/riscv/include/asm/p2m.h
> @@ -7,11 +7,13 @@
>  #include 
>  #include 
>  
> +#include 
>  #include 
>  
>  extern unsigned int p2m_root_order;
>  #define P2M_ROOT_ORDER  p2m_root_order
>  #define P2M_ROOT_PAGES  BIT(P2M_ROOT_ORDER, U)
> +#define P2M_ROOT_LEVEL  HYP_PT_ROOT_LEVEL

I think I commented on this before, and I would have hoped for at least a remark
in the description to appear (perhaps even a comment here): It's okay(ish) to 
tie
these together for now, but in the longer run I don't expect this is going to be
wanted. If e.g. we ran Xen in Sv57 mode, there would be no reason at all to 
force
all P2Ms to use 5 levels of page tables.

> @@ -175,13 +179,257 @@ int p2m_set_allocation(struct domain *d, unsigned long 
> pages, bool *preempted)
>  return 0;
>  }
>  
> +/*
> + * Find and map the root page table. The caller is responsible for
> + * unmapping the table.
> + *
> + * The function will return NULL if the offset into the root table is
> + * invalid.
> + */
> +static pte_t *p2m_get_root_pointer(struct p2m_domain *p2m, gfn_t gfn)
> +{
> +unsigned long root_table_indx;
> +
> +root_table_indx = gfn_x(gfn) >> XEN_PT_LEVEL_ORDER(P2M_ROOT_LEVEL);

Right now page table layouts / arrangements are indeed similar enough to
share accessor constructs. Nevertheless I find it problematic (doc-wise
at the very least) that a Xen page table construct is used to access a
P2M page table. If and when these needed to be decoupled, it would likely
help of the distinction was already made, by - for now - simply
introducing aliases (here e.g. P2M_LEVEL_ORDER(), expanding to
XEN_PT_LEVEL_ORDER() for the time being).

> +if ( root_table_indx >= P2M_ROOT_PAGES )
> +return NULL;
> +
> +return __map_domain_page(p2m->root + root_table_indx);
> +}
> +
> +static inline void p2m_write_pte(pte_t *p, pte_t pte, bool clean_pte)
> +{
> +write_pte(p, pte);
> +if ( clean_pte )
> +clean_dcache_va_range(p, sizeof(*p));

Not necessarily for right away, but if multiple adjacent PTEs are
written without releasing the lock, this then redundant cache flushing
can be a performance issue.

> +}
> +
> +static inline void p2m_clean_pte(pte_t *p, bool clean_pte)
> +{
> +pte_t pte;
> +
> +memset(&pte, 0, sizeof(pte));

Why memset()? Why not simply give the variable an appropriate initializer?
Or use ...

> +p2m_write_pte(p, pte, clean_pte);

... a compound literal here, like you do ...

> +}
> +
> +static pte_t p2m_pte_from_mfn(mfn_t mfn, p2m_type_t t)
> +{
> +panic("%s: hasn't been implemented yet\n", __func__);
> +
> +return (pte_t) { .pte = 0 };

... here? (Just {} would also do, if I'm not mistaken.)

> +}
> +
> +#define P2M_TABLE_MAP_NONE 0
> +#define P2M_TABLE_MAP_NOMEM 1
> +#define P2M_TABLE_SUPER_PAGE 2
> +#define P2M_TABLE_NORMAL 3
> +
> +/*
> + * Take the currently mapped table, find the corresponding the entry
> + * corresponding to the GFN, and map the next table, if available.

Nit: Double "corresponding".

> + * The previous table will be unmapped if the next level was mapped
> + * (e.g P2M_TABLE_NORMAL returned).
> + *
> + * `alloc_tbl` parameter indicates whether intermediate tables should
> + * be allocated when not present.
> + *
> + * Return values:
> + *  P2M_TABLE_MAP_NONE: a table allocation isn't permitted.
> + *  P2M_TABLE_MAP_NOMEM: allocating a new page failed.
> + *  P2M_TABLE_SUPER_PAGE: next level or leaf mapped normally.
> + *  P2M_TABLE_NORMAL: The next entry points to a superpage.
> + */
> +static int p2m_next_level(struct p2m_domain *p2m, bool alloc_tbl,
> +  unsigned int level, pte_t **table,
> +  unsigned int offset)
> +{
> +panic("%s: hasn't been implemented yet\n", __func__);
> +
> +return P2M_TABLE_MAP_NONE;
> +}
> +
> +/* Free pte sub-tree behind an entry */
> +static void p2m_free_subtree(struct p2m_domain *p2m,
> + pte_t entry, unsigned int level)
> +{
> +panic("%s: hasn't been implemented yet\n", __func__);
> +}
> +
> +/*
> + * Insert an entry in the p2m. This should be called with a mapping
> + * equal to a page/superpage.
> + */
> +static int p2m_set_entry(struct p2m_domain *p2m,
> +   gfn_t gfn,
> +   unsigned long page_order,
> +   mfn_t mfn,
> +   p2m_type_t t)
> +{
> +unsigned int level;
> +unsigned int target = page_order / PAGETABLE_ORDER;
> +pte_t *entry, *table, orig_pte;
> +int rc;
> +/* A mapping is removed if the MFN is invalid. */
> +bool removing_mapping = mfn_eq(mfn, INVALID_MFN);

Comment and code don't fit together. Many MFNs are invalid (any for which
mfn_valid(

[PATCH] x86/hpet: do local APIC EOI after interrupt processing

2025-08-05 Thread Roger Pau Monne
The current logic in the HPET interrupt ->ack() hook will perform a local
APIC EOI ahead of enabling interrupts, possibly leading to recursion in the
interrupt handler.

Fix this by doing the local APIC EOI strictly after the window with
interrupt enabled, as that prevents the recursion, and would only allow for
interrupts with higher priority to be serviced.

Use the generic ack_nonmaskable_msi_irq() and end_nonmaskable_irq()
functions, removing the need for hpet_msi_ack().

Reported-by: Andrew Cooper 
Fixes: 3ba523ff957c ('CPUIDLE: enable MSI capable HPET for timer broadcast')
Signed-off-by: Roger Pau Monné 
---
 xen/arch/x86/hpet.c | 10 ++
 1 file changed, 2 insertions(+), 8 deletions(-)

diff --git a/xen/arch/x86/hpet.c b/xen/arch/x86/hpet.c
index 192de433cfd1..d05b5eb1361f 100644
--- a/xen/arch/x86/hpet.c
+++ b/xen/arch/x86/hpet.c
@@ -303,13 +303,6 @@ static unsigned int cf_check hpet_msi_startup(struct 
irq_desc *desc)
 
 #define hpet_msi_shutdown hpet_msi_mask
 
-static void cf_check hpet_msi_ack(struct irq_desc *desc)
-{
-irq_complete_move(desc);
-move_native_irq(desc);
-ack_APIC_irq();
-}
-
 static void cf_check hpet_msi_set_affinity(
 struct irq_desc *desc, const cpumask_t *mask)
 {
@@ -337,7 +330,8 @@ static hw_irq_controller hpet_msi_type = {
 .shutdown   = hpet_msi_shutdown,
 .enable= hpet_msi_unmask,
 .disable= hpet_msi_mask,
-.ack= hpet_msi_ack,
+.ack= ack_nonmaskable_msi_irq,
+.end= end_nonmaskable_irq,
 .set_affinity   = hpet_msi_set_affinity,
 };
 
-- 
2.49.0




Re: [PATCH] x86/hpet: do local APIC EOI after interrupt processing

2025-08-05 Thread Andrew Cooper
On 05/08/2025 5:03 pm, Roger Pau Monne wrote:
> The current logic in the HPET interrupt ->ack() hook will perform a local
> APIC EOI ahead of enabling interrupts, possibly leading to recursion in the
> interrupt handler.
>
> Fix this by doing the local APIC EOI strictly after the window with
> interrupt enabled, as that prevents the recursion, and would only allow for
> interrupts with higher priority to be serviced.
>
> Use the generic ack_nonmaskable_msi_irq() and end_nonmaskable_irq()
> functions, removing the need for hpet_msi_ack().
>
> Reported-by: Andrew Cooper 
> Fixes: 3ba523ff957c ('CPUIDLE: enable MSI capable HPET for timer broadcast')
> Signed-off-by: Roger Pau Monné 

Reviewed-by: Andrew Cooper 

Good riddance to the mess in our patchqueue cased by this.



Re: [PATCH v1 1/3] arm/pci: Add pci-scan boot argument

2025-08-05 Thread Luca Fancellu
Hi Jan,

> On 4 Aug 2025, at 09:10, Jan Beulich  wrote:
> 
> On 01.08.2025 11:22, Mykyta Poturai wrote:
>> From: Edward Pickup 
>> 
>> This patch adds a Xen boot arguments that, if enabled, causes a call to
>> existing code to scan pci devices enumerated by the firmware.
>> 
>> This patch also makes an existing debug function viewable outside its
>> translation unit, and uses this to dump the PCI devices found.
>> The debug message is controlled by config DEBUG.
>> 
>> Additionally, this patch modifies segment loading to ensure that PCI
>> devices on other segments are properly found.
>> 
>> This will be needed ahead of dom0less support for pci passthrough on
>> arm.
>> 
>> Signed-off-by: Luca Fancellu 
>> Signed-off-by: Edward Pickup 
> 
> Considering the From: above and this order of S-o-b: Who is it really that
> was the original author here?

I think this patch was mine, probably some issues from Edward, anyway he 
doesn’t work at arm anymore.

Cheers,
Luca

Re: [PATCH v2] xen/dom0less: arm: fix hwdom 1:1 low memory allocation

2025-08-05 Thread Julien Grall

Hi Grygorii,

On 05/08/2025 20:00, Grygorii Strashko wrote:

From: Grygorii Strashko 

Call stack for dom0less hwdom case (1:1) memory:
create_domUs
|-construct_domU
   |-construct_hwdom()
 |-allocate_memory_11()

And allocate_memory_11() uses "dom0_mem" as:
min_low_order =
   get_order_from_bytes(min_t(paddr_t, dom0_mem, MB(128)));

In case of dom0less boot the "dom0_mem" is not used and defaulted to 0,


From docs/mics/xen-command-linux.pandoc:

---

### dom0_mem (ARM)
> `= `

Set the amount of memory for the initial domain (dom0). It must be
greater than zero. This parameter is required.

---

If dom0_mem is effectively optional, then shouldn't the doc be updated?


which causes min_low_order to get high value > order and so no allocations
happens from low memory.
> > Fix it, by using kinfo->unassigned_mem instead of "dom0_mem" has 
correct

memory size in both cases: regular dom0 boot and dom0less boot.

Fixes: 52cb53f1816a ("xen/arm: dom0less hwdom construction")
Signed-off-by: Grygorii Strashko 
Reviewed-by: Denis Mukhin 
Reviewed-by: Jason Andryuk 


Acked-by: Julien Grall 

Cheers,

--
Julien Grall




Re: [PATCH v2 1/3] xen/arm: irq: drop unreachable pirq callbacks

2025-08-05 Thread Julien Grall

Hi,

On 05/08/2025 19:40, Grygorii Strashko wrote:

From: Grygorii Strashko 

Since commit 341f271cf86f ("xen/evtchn: fully restrict concept of pIRQ for
arches with pIRQ support only"), the corresponding Arm arch pIRQ callbacks
become unreachable, so drop them.

Signed-off-by: Grygorii Strashko 
Acked-by: Andrew Cooper 
Reviewed-by: Denis Mukhin 


Acked-by: Julien Grall 

Cheers,

--
Julien Grall




Re: [PATCH 1/2] x86/mm: drop paging_get_mode()

2025-08-05 Thread Andrew Cooper
On 05/08/2025 8:58 am, Jan Beulich wrote:
> The function was introduced without any caller, and never gained any.
> Thus it has always been violating Misra rule 2.1 (unreachable code).
>
> Fixes: dd6de3ab9985 ("Implement Nested-on-Nested")
> Signed-off-by: Jan Beulich 

Acked-by: Andrew Cooper 

I'm still not certain how this will shake out in nested-virt, but I'm
reasonably sure it wont end up like this.

~Andrew



Re: Xen 4.21 Development Update [June-July]

2025-08-05 Thread Jürgen Groß

On 05.08.25 20:19, Oleksii Kurochko wrote:

Hello everyone,

This email only tracks big items for xen.git tree. Please reply for items you
would like to see in 4.21 so that people have an idea what is going on and
prioritise accordingly.

You're welcome to provide description and use cases of the feature you're
working on.

= Timeline =

The current release schedule could be found here:
   https://wiki.xenproject.org/wiki/Xen_Project_X.YY_Release_Notes

And as a reminder I would like to remind at the of this week we will have
Last posting date (Fri Aug 08, 2025).

= Updates =

The following items ( the links for them could be found int the list below )
were moved to completed:
   [since Jun2 - Aug5]:
    Added some tags: [4.21], [next-rel(s)] to the list "Full list of items"
    below.
    * x86:
     - kexec: add kexec support to Mini-OS.


And with that:

- PVH xenstore-stubdom live update


     - x86: memcpy() / memset() (non-)ERMS flavors plus fallout
    * Arm:
     - SMMU handling for PCIe Passthrough on ARM.
     - Add support for R-Car Gen4 PCI host controller.
     - First chunk for Arm R82 and MPU support.
     - Enable R52 support for the first chunk of MPU support
     - ARM split hardware and control domains.
    * RISC-V:
     - Introduce basic UART support and interrupts for hypervisor mode.


* Hypervisor:
  - tools: support for per-domain Xenstore features


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: OpenPGP public key


OpenPGP_signature.asc
Description: OpenPGP digital signature


Re: [PATCH 1/2] efi: Call FreePages only if needed

2025-08-05 Thread Jan Beulich
On 05.08.2025 18:32, Ross Lagerwall wrote:
> If the config file is builtin, cfg.addr will be zero but Xen
> unconditionally calls FreePages() on the address.
> 
> Xen may also call FreePages() with a zero address if blexit() is called
> after this point since cfg.need_to_free is not set to false.
> 
> The UEFI specification does not say whether calling FreePages() with a
> zero address is allowed so let's be cautious and use cfg.need_to_free
> properly.

Well, no, this paragraph makes no sense. Of course this is allowed, but
not as no-op behavior (like free(NULL) would be), but to free memory
starting at 0.

> Signed-off-by: Ross Lagerwall 

This pretty clearly wants a Fixes: tag, or maybe it even needs to be two.
I've checked the original code in 4.2, and things were consistent there,
afaics. So breakage was introduced perhaps in one or two of the many
re-works.

Jan



Re: Xen 4.21 Development Update [June-July]

2025-08-05 Thread Jan Beulich
On 05.08.2025 20:19, Oleksii Kurochko wrote:
> * Full list of items : *
> 
> = Projects =
> 
> == Hypervisor ==
> 
> * [4.21] xen/console: cleanup console input switch logic (v5)
>    - Denis Mukhin
>    - 
> https://lore.kernel.org/xen-devel/20250530231841.73386-1-dmuk...@ford.com/
> 
> * [4.21] xen: introduce CONFIG_SYSCTL (v4 -> v8)
>    -  Penny Zheng
>    - 
> https://lore.kernel.org/xen-devel/20250711043158.2566880-1-penny.zh...@amd.com/
> 
> * [4.21] Several CI cleanups and improvements, plus yet another new runner
>    - Marek Marczykowski-Górecki
>    - 
> https://lore.kernel.org/xen-devel/cover.7da1777882774486a13e6f39ff4a2096f6b7901e.1744028549.git-series.marma...@invisiblethingslab.com/
>    - 
> https://patchew.org/Xen/cover.7da1777882774486a13e6f39ff4a2096f6b7901e.1744028549.git-series.marma...@invisiblethingslab.com/
> 
> * [4.21] automation: Refresh the remaining Debian containers (v2)
>    -  Javi Merino
>    - 
> https://lore.kernel.org/xen-devel/cover.1730743077.git.javi.mer...@cloud.com/T/#m5d9acb7cf5db3c2be3d6527de14b69b07812314e
> 
> * [4.21] MSI-X support with qemu in stubdomain, and other related 
> changes (v8)
>    -  Marek Marczykowski-Górecki
>    - 
> https://lore.kernel.org/xen-devel/cover.33fb4385b7dd6c53bda4acf0a9e91748b3d7b1f7.1715313192.git-series.marma...@invisiblethingslab.com/
>    -  Only automation patch left to be reviewed/merged.
> 
> * [next-rel(s)] Physical address hypercall ABI ("HVMv2")
>    - Teddy Astie
>    - 
> https://lore.kernel.org/xen-devel/cover.1744981654.git.teddy.as...@vates.tech/
> 
> * [next-rel(s)] xen: Untangle mm.h
>    -  Andrew Cooper
>    - 
> https://lore.kernel.org/xen-devel/20250312174513.4075066-1-andrew.coop...@citrix.com/
>    - 
> https://patchew.org/Xen/20250312174513.4075066-1-andrew.coop...@citrix.com/
> 
> * [next-rel(s)] Add support for exact-node memory claims
>    -  Alejandro Vallejo
>    - 
> https://lore.kernel.org/xen-devel/20250314172502.53498-1-alejandro.vall...@cloud.com/
>    - 
> https://patchew.org/Xen/20250314172502.53498-1-alejandro.vall...@cloud.com/
> 
> * [next-rel(s)] Remove the directmap (v5)
>    -  Alejandro Vallejo
>    - 
> https://lore.kernel.org/xen-devel/20250108151822.16030-1-alejandro.vall...@cloud.com/
>    - 
> https://patchew.org/Xen/20250108151822.16030-1-alejandro.vall...@cloud.com/
> 
> * [next-rel(s)] GRUB: Supporting Secure Boot of xen.gz (v1)
>    -  Ross Lagerwall
>    - 
> https://patchew.org/Xen/20240313150748.791236-1-ross.lagerw...@citrix.com/
> 
> * [next-rel(s)] Introduce xenbindgen to autogen hypercall structs (v1)
>    -  Alejandro Vallejo
>    - 
> https://patchew.org/Xen/20241115115200.2824-1-alejandro.vall...@cloud.com/
> 
> * [next-rel(s)] Introduce NS8250 UART emulator (v2)
>    -  Denis Mukhin
>    - 
> https://patchew.org/Xen/20241205-vuart-ns8250-v1-0-e9aa92312...@ford.com/
> 
> * [next-rel(s)] xen: framework for UART emulators
>    - Denis Mukhin
>    - 
> https://lore.kernel.org/xen-devel/20250624035443.344099-1-dmuk...@ford.com/
> 
> === x86 ===
> * [4.21] x86/asm: cleanups after toolchain baseline upgrade (v1 -> v2)
>    - Denis Mukhin
>    - 
> https://lore.kernel.org/xen-devel/20250403182250.3329498-1-dmuk...@ford.com/
>    - https://patchew.org/Xen/20250403182250.3329498-1-dmuk...@ford.com/
> 
> * [4.21?] x86/efi: Fix booting when NX is disabled (v1 -> v2)
>    - Andrew Cooper
>    - 
> https://patchew.org/Xen/20240722101838.3946983-1-andrew.coop...@citrix.com/
>    - 
> https://lore.kernel.org/xen-devel/20240722101838.3946983-1-andrew.coop...@citrix.com/
> 
> * [4.21?] Hyperlaunch device tree for dom0 (v6)
>    - Alejandro Vallejo
>    - https://patchew.org/Xen/20250429123629.20839-1-agarc...@amd.com/
>    - 
> https://lore.kernel.org/xen-devel/20250429123629.20839-1-agarc...@amd.com/
> 
> *  [4.21?] Boot modules for Hyperlaunch (v9)
>    -  Daniel P. Smith
>    - 
> https://lore.kernel.org/xen-devel/20241115131204.32135-1-dpsm...@apertussolutions.com/
>    - 
> https://patchew.org/Xen/20241115131204.32135-1-dpsm...@apertussolutions.com/
> 
> *  [4.21?] Address Space Isolation FPU preparations (v2->v3)
>    -  Alejandro Vallejo
>    - 
> https://patchew.org/Xen/20250110132823.24348-1-alejandro.vall...@cloud.com/
> 
> * [next-rel(s)] Hyperlaunch domain builder
>    - Daniel P. Smith
>    - 
> https://lore.kernel.org/xen-devel/20250515131744.3843-1-dpsm...@apertussolutions.com/
> 
> * [next-rel(s)] Confidential computing and AMD SEV support
>    - Teddy Astie
>    - https://patchew.org/Xen/cover.1747312394.git.teddy.as...@vates.tech/
>    - 
> https://lore.kernel.org/xen-devel/cover.1747312394.git.teddy.as...@vates.tech/
> 
> * [next-rel(s)] amd-cppc CPU Performance Scaling Driver (v5 -> v6)
>    - Penny Zheng
>    - 
> https://lore.kernel.org/xen-devel/20250711035106.2540522-1-penny.zh...@amd.com/
> 
> * [next-rel(s)] x86: Trenchboot Secure Launch DRTM (Xen) (v1 -> v3)
>    - Sergii Dmytruk
>    - https://patchew.org/Xen/cover.1745172094.git.sergii.dmyt...@3mdeb.com/
>  

Re: [PATCH -next v7 5/7] arm64: entry: Refactor preempt_schedule_irq() check code

2025-08-05 Thread Jinjie Ruan



On 2025/8/5 23:06, Ada Couprie Diaz wrote:
> Hi Jinjie,
> 
> On 29/07/2025 02:54, Jinjie Ruan wrote:
>> ARM64 requires an additional check whether to reschedule on return
>> from interrupt. So add arch_irqentry_exit_need_resched() as the default
>> NOP implementation and hook it up into the need_resched() condition in
>> raw_irqentry_exit_cond_resched(). This allows ARM64 to implement
>> the architecture specific version for switching over to
>> the generic entry code.
> I was a bit confused by this, as I didn't see the link with the generic
> entry
> given you implement `raw_irqentry_exit_cond_resched()` in arch/arm64
> as well in this patch : I expected the arm64 implementation to be added.
> I share more thoughts below.
> 
> What do you think about something along those lines ?
> 
> Compared to the generic entry code, arm64 does additional checks
> when deciding to reschedule on return from an interrupt.
> Introduce arch_irqentry_exit_need_resched() in the need_resched()
> condition
> of the generic raw_irqentry_exit_cond_resched(), with a NOP default.
> This will allow arm64 to implement its architecture specific checks
> when
> switching over to the generic entry code.

This revision makes it easier for people to understand.

> 
>> [...]
>> diff --git a/kernel/entry/common.c b/kernel/entry/common.c
>> index b82032777310..4aa9656fa1b4 100644
>> --- a/kernel/entry/common.c
>> +++ b/kernel/entry/common.c
>> @@ -142,6 +142,20 @@ noinstr irqentry_state_t irqentry_enter(struct
>> pt_regs *regs)
>>   return ret;
>>   }
>>   +/**
>> + * arch_irqentry_exit_need_resched - Architecture specific need
>> resched function
>> + *
>> + * Invoked from raw_irqentry_exit_cond_resched() to check if need
>> resched.
> Very nit : "to check if resched is needed." ?

This is good.

>> + * Defaults return true.
>> + *
>> + * The main purpose is to permit arch to skip preempt a task from an
>> IRQ.
> If feel that "to avoid preemption of a task" instead of "to skip preempt
> a task"
> would make this much clearer, what do you think ?

Yes, this is more clearer.

>> + */
>> +static inline bool arch_irqentry_exit_need_resched(void);
>> +
>> +#ifndef arch_irqentry_exit_need_resched
>> +static inline bool arch_irqentry_exit_need_resched(void) { return
>> true; }
>> +#endif
>> +
> 
> I've had some trouble reviewing this patch : on the one hand because
> I didn't notice `arch_irqentry_exit_need_resched()` was added in
> the common entry code, which is on me !
> On the other hand, I felt that the patch itself was a bit disconnected :
> we add `arch_irqentry_exit_need_resched()` in the common entry code,
> with a default NOP, but in the same function we add to arm64,
> while mentioning that this is for arm64's additional checks,
> which we only implement in patch 7.
> 
> Would it make sense to move the `arch_irqentry_exit_need_resched()`
> part of the patch to patch 7, so that the introduction and
> arch-specific implementation appear together ?
> To me it seems easier to wrap my head around, as it would look like
> "Move arm64 to generic entry, but it does additional checks : add a new
> arch-specific function controlling re-scheduling, defaulting to true,
> and implement it for arm64". I feel it could help making patch 7's
> commit message clearer as well.
> 
> From what I gathered on the archive `arch_irqentry_exit_need_resched()`
> being added here was suggested previously, so others might not have the
> same opinion.
> 
> Maybe improving the commit message and comment for this would be enough
> as well, as per my suggestions above.
> 
> 
> Otherwise the changes make sense and I don't see any functional issues !
> 
> Thanks,
> Ada
> 
> 



Re: [PATCH 2/2] efi: Stop using StdErr

2025-08-05 Thread Jan Beulich
On 05.08.2025 18:32, Ross Lagerwall wrote:
> Xen's use of StdErr is inconsistent. Some boot errors are reported using
> PrintErr() which uses StdErr and some are reported using blexit() which
> uses StdOut.

... with PrintErrMesg() having

StdOut = StdErr;

apparently to address at least some of the inconsistencies. Perhaps
blexit(), when not passed NULL, should similarly override StdOut.

> On my test system using OVMF, StdErr is not displayed on the emulated
> screen. Looking at other EFI applications, StdErr is just used for debug
> messages if at all.

That's hardly how StdErr was meant to be used. And at the risk of being
flamed for saying so, looking at other EFI applications (without saying
of what prominence or origin they are) can hardly serve as a justification.
If OVMF doesn't set up StdErr correctly (despite being configured / set up
correctly), and if that can't be fixed there, imo what you want as a
workaround is a command line option to override StdErr by StdOut even when
SystemTable->StdErr is non-NULL.

Along the lines of the comment further up, inconsistencies in the use of
StdErr vs StdOut may want addressing (separately).

But of course, not being EFI maintainer anymore, all of what I said above
may be deemed entirely meaningless by the new maintainers.

Jan



Re: [PATCH -next v7 5/7] arm64: entry: Refactor preempt_schedule_irq() check code

2025-08-05 Thread Jinjie Ruan



On 2025/8/5 23:06, Ada Couprie Diaz wrote:
> Hi Jinjie,
> 
> On 29/07/2025 02:54, Jinjie Ruan wrote:
>> ARM64 requires an additional check whether to reschedule on return
>> from interrupt. So add arch_irqentry_exit_need_resched() as the default
>> NOP implementation and hook it up into the need_resched() condition in
>> raw_irqentry_exit_cond_resched(). This allows ARM64 to implement
>> the architecture specific version for switching over to
>> the generic entry code.
> I was a bit confused by this, as I didn't see the link with the generic
> entry
> given you implement `raw_irqentry_exit_cond_resched()` in arch/arm64
> as well in this patch : I expected the arm64 implementation to be added.
> I share more thoughts below.
> 
> What do you think about something along those lines ?
> 
> Compared to the generic entry code, arm64 does additional checks
> when deciding to reschedule on return from an interrupt.
> Introduce arch_irqentry_exit_need_resched() in the need_resched()
> condition
> of the generic raw_irqentry_exit_cond_resched(), with a NOP default.
> This will allow arm64 to implement its architecture specific checks
> when
> switching over to the generic entry code.
> 
>> [...]
>> diff --git a/kernel/entry/common.c b/kernel/entry/common.c
>> index b82032777310..4aa9656fa1b4 100644
>> --- a/kernel/entry/common.c
>> +++ b/kernel/entry/common.c
>> @@ -142,6 +142,20 @@ noinstr irqentry_state_t irqentry_enter(struct
>> pt_regs *regs)
>>   return ret;
>>   }
>>   +/**
>> + * arch_irqentry_exit_need_resched - Architecture specific need
>> resched function
>> + *
>> + * Invoked from raw_irqentry_exit_cond_resched() to check if need
>> resched.
> Very nit : "to check if resched is needed." ?
>> + * Defaults return true.
>> + *
>> + * The main purpose is to permit arch to skip preempt a task from an
>> IRQ.
> If feel that "to avoid preemption of a task" instead of "to skip preempt
> a task"
> would make this much clearer, what do you think ?
>> + */
>> +static inline bool arch_irqentry_exit_need_resched(void);
>> +
>> +#ifndef arch_irqentry_exit_need_resched
>> +static inline bool arch_irqentry_exit_need_resched(void) { return
>> true; }
>> +#endif
>> +
> 
> I've had some trouble reviewing this patch : on the one hand because
> I didn't notice `arch_irqentry_exit_need_resched()` was added in
> the common entry code, which is on me !
> On the other hand, I felt that the patch itself was a bit disconnected :
> we add `arch_irqentry_exit_need_resched()` in the common entry code,
> with a default NOP, but in the same function we add to arm64,
> while mentioning that this is for arm64's additional checks,
> which we only implement in patch 7.

Yes, it does.

> 
> Would it make sense to move the `arch_irqentry_exit_need_resched()`
> part of the patch to patch 7, so that the introduction and
> arch-specific implementation appear together ?
> To me it seems easier to wrap my head around, as it would look like
> "Move arm64 to generic entry, but it does additional checks : add a new
> arch-specific function controlling re-scheduling, defaulting to true,
> and implement it for arm64". I feel it could help making patch 7's
> commit message clearer as well.
> 
> From what I gathered on the archive `arch_irqentry_exit_need_resched()`
> being added here was suggested previously, so others might not have the
> same opinion.

Yes, introduce `arch_irqentry_exit_need_resched()` here may help
understand the patch's refactoring purpose.

> 
> Maybe improving the commit message and comment for this would be enough
> as well, as per my suggestions above.

Thank you! I'll improve the commit message and comment.

> 
> 
> Otherwise the changes make sense and I don't see any functional issues !
> 
> Thanks,
> Ada
> 
> 



Re: [PATCH v10 2/4] vpci/rebar: Free Rebar resources when init_rebar() fails

2025-08-05 Thread Jan Beulich
On 06.08.2025 05:39, Chen, Jiqian wrote:
> On 2025/8/5 16:48, Jan Beulich wrote:
>> On 05.08.2025 05:49, Jiqian Chen wrote:
>>> --- a/xen/drivers/vpci/rebar.c
>>> +++ b/xen/drivers/vpci/rebar.c
>>> @@ -49,6 +49,32 @@ static void cf_check rebar_ctrl_write(const struct 
>>> pci_dev *pdev,
>>>  bar->guest_addr = bar->addr;
>>>  }
>>>  
>>> +static int cf_check cleanup_rebar(const struct pci_dev *pdev)
>>> +{
>>> +int rc;
>>> +uint32_t ctrl;
>>> +unsigned int nbars;
>>> +unsigned int rebar_offset = pci_find_ext_capability(pdev->sbdf,
>>> +
>>> PCI_EXT_CAP_ID_REBAR);
>>> +
>>> +if ( !rebar_offset || !is_hardware_domain(pdev->domain) )
>>> +{
>>> +ASSERT_UNREACHABLE();
>>> +return 0;
>>> +}
>>> +
>>> +ctrl = pci_conf_read32(pdev->sbdf, rebar_offset + PCI_REBAR_CTRL(0));
>>> +nbars = MASK_EXTR(ctrl, PCI_REBAR_CTRL_NBAR_MASK);
>>> +
>>> +rc = vpci_remove_registers(pdev->vpci, rebar_offset + PCI_REBAR_CAP(0),
>>> +   PCI_REBAR_CTRL(nbars - 1));
>>> +if ( rc )
>>> +printk(XENLOG_ERR "%pd %pp: fail to remove Rebar handlers rc=%d\n",
>>> +   pdev->domain, &pdev->sbdf, rc);
>>
>> MSI and MSI-X (now) have ASSERT_UNREACHABLE() on their respective paths. Is
>> there a reason this shouldn't be done here as well?
> Will add.
> 
>>
>> MSI and MSI-X further have another add-register below here, to ensure the
>> control register cannot be written. Again - is there a reason the same
>> shouldn't be done here? (If so, I think this may want putting in a comment.)
> Since extended capabilities are only exposed to dom0, and dom0 has no 
> limitations to access devices.
> It since there is not much point in adding such a handler for rebar.

While the effect is different from MSI / MSI-X, isn't it still a problem if
Dom0 changed BAR sizes entirely behind Xen's back? Which could be prevented
by similarly discarding writes to the control register, as is done for the
other two?

Jan



Re: [PATCH v10 4/4] vpci/msix: Free MSIX resources when init_msix() fails

2025-08-05 Thread Jan Beulich
On 06.08.2025 05:35, Chen, Jiqian wrote:
> On 2025/8/5 16:43, Jan Beulich wrote:
>> On 05.08.2025 05:49, Jiqian Chen wrote:
>>> --- a/xen/drivers/vpci/msix.c
>>> +++ b/xen/drivers/vpci/msix.c
>>> @@ -655,6 +655,48 @@ int vpci_make_msix_hole(const struct pci_dev *pdev)
>>>  return 0;
>>>  }
>>>  
>>> +static int cf_check cleanup_msix(const struct pci_dev *pdev)
>>> +{
>>> +int rc;
>>> +struct vpci *vpci = pdev->vpci;
>>> +const unsigned int msix_pos = pdev->msix_pos;
>>> +
>>> +if ( !msix_pos )
>>> +return 0;
>>> +
>>> +rc = vpci_remove_registers(vpci, msix_control_reg(msix_pos), 2);
>>> +if ( rc )
>>> +{
>>> +printk(XENLOG_ERR "%pd %pp: fail to remove MSIX handlers rc=%d\n",
>>> +   pdev->domain, &pdev->sbdf, rc);
>>> +ASSERT_UNREACHABLE();
>>> +return rc;
>>> +}
>>> +
>>> +if ( vpci->msix )
>>> +{
>>> +list_del(&vpci->msix->next);
>>> +for ( unsigned int i = 0; i < ARRAY_SIZE(vpci->msix->table); i++ )
>>> +if ( vpci->msix->table[i] )
>>> +iounmap(vpci->msix->table[i]);
>>> +
>>> +XFREE(vpci->msix);
>>> +}
>>> +
>>> +/*
>>> + * The driver may not traverse the capability list and think device
>>> + * supports MSIX by default. So here let the control register of MSIX
>>> + * be Read-Only is to ensure MSIX disabled.
>>> + */
>>> +rc = vpci_add_register(vpci, vpci_hw_read16, NULL,
>>> +   msix_control_reg(msix_pos), 2, NULL);
>>> +if ( rc )
>>> +printk(XENLOG_ERR "%pd %pp: fail to add MSIX ctrl handler rc=%d\n",
>>> +   pdev->domain, &pdev->sbdf, rc);
>>
>> Here as well as for MSI: Wouldn't this better be limited to the init-failure
>> case? No point in adding a register hook (and possibly emitting a misleading
>> log message) when we're tearing down anyway. IOW I think the ->cleanup()
>> hook needs a boolean parameter, unless the distinction of the two cases can
>> be (reliably) inferred from some other property.
> To make these changes, can I add a new patch as the last patch of this series?
> And the new patch will do:
> 1. add a boolean parameter for cleanup hook to separate whose calls 
> cleanup(during initialization or during deassign device).
> 2. call all cleanup hooks in vpci_deassign_device().
> 3. remove the MSI/MSIX specific free actions in vpci_deassign_device().

The outline looks okay, but imo it shouldn't be last in the series. Instead I
think it wants to come ahead of the last three patches; whether it's patch 1
or patch 2 doesn't really matter. Then (3) would be taken care of incrementally,
as ->cleanup hooks are added.

Jan



Re: [PATCH -next v7 7/7] arm64: entry: Switch to generic IRQ entry

2025-08-05 Thread Jinjie Ruan



On 2025/8/5 23:07, Ada Couprie Diaz wrote:
> Hi Jinjie,
> 
> The code changes look good to me, just a small missing clean up I believe.
> 
> On 29/07/2025 02:54, Jinjie Ruan wrote:
> 
>> Currently, x86, Riscv, Loongarch use the generic entry. Convert arm64
>> to use the generic entry infrastructure from kernel/entry/*.
>> The generic entry makes maintainers' work easier and codes
>> more elegant.
>>
>> Switch arm64 to generic IRQ entry first, which removed duplicate 100+
>> LOC. The next patch serise will switch to generic entry completely later.
>> Switch to generic entry in two steps according to Mark's suggestion
>> will make it easier to review.
> 
> I think the commit message could be clearer, especially since now this
> series
> only moves arm64 to generic IRQ entry and not the complete generic entry.
> 
> What do you think of something like below ? It repeats a bit less and I
> think
> it helps understanding what is going on in this specific commit, as you
> already
> have details on the larger plans in the cover.
> 
> Currently, x86, Riscv and Loongarch use the generic entry code,
> which makes
> maintainer's work easier and code more elegant.
> Start converting arm64 to use the generic entry infrastructure
> from kernel/entry/* by switching it to generic IRQ entry, which
> removes 100+
> lines of duplicate code.
> arm64 will completely switch to generic entry in a later series.
> 

Yes, this is more concise and accurate, and make the motivation more
clearer.

>> The changes are below:
>>   - Remove *enter_from/exit_to_kernel_mode(), and wrap with generic
>>     irqentry_enter/exit(). Also remove *enter_from/exit_to_user_mode(),
>>     and wrap with generic enter_from/exit_to_user_mode() because they
>>     are exactly the same so far.
> Nit : "so far" can be removed
>>   - Remove arm64_enter/exit_nmi() and use generic
>> irqentry_nmi_enter/exit()
>>     because they're exactly the same, so the temporary arm64 version
>>     irqentry_state can also be removed.
>>
>>   - Remove PREEMPT_DYNAMIC code, as generic entry do the same thing
>>     if arm64 implement arch_irqentry_exit_need_resched().
> This feels unrelated, given that the part that needs
> `arch_irqentry_exit_need_resched()`
> is called whether or not PREEMPT_DYNAMIC is enabled ?

Yes, the language here needs to be reorganized in conjunction with your
comments from the fifth patch.

> 
> Given my comments on patch 5, I feel that the commit message should mention
> explicitly the implementation of `arch_irqentry_exit_need_resched()` and
> why,
> even though it was already mentioned in patch 5.
> (This is what I was referencing in patch 5 : as I feel it's useful to
> mention again
> the reasons when implementing it, it doesn't feel too out of place to
> introduce
> the generic part at the same time. But again, I might be wrong here.)
> 
> Then you can have another point explaining that
> `raw_irqentry_exit_cond_resched()`
> and the PREEMPT_DYNAMIC code is removed because they are identical to the
> generic entry code, similarly to your other points.
>> Tested ok with following test cases on Qemu virt platform:
>>   - Perf tests.
>>   - Different `dynamic preempt` mode switch.
>>   - Pseudo NMI tests.
>>   - Stress-ng CPU stress test.
>>   - MTE test case in
>> Documentation/arch/arm64/memory-tagging-extension.rst
>>     and all test cases in tools/testing/selftests/arm64/mte/*.
> Nit : I'm not sure if the commit message is the best place for this,
> given you
> already gave some details in the cover ?
> But I don't have much experience here, so I'll leave it up to you and
> others !

Yes, this can be removed as the cover letter already has it.

>> Suggested-by: Mark Rutland 
>> Signed-off-by: Jinjie Ruan 
>> ---
>> [...]
>> diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
>> index db3f972f8cd9..1110eeb21f57 100644
>> --- a/arch/arm64/kernel/signal.c
>> +++ b/arch/arm64/kernel/signal.c
>> @@ -9,6 +9,7 @@
>>   #include 
>>   #include 
>>   #include 
>> +#include 
>>   #include 
>>   #include 
>>   #include 
>> @@ -1576,7 +1577,7 @@ static void handle_signal(struct ksignal *ksig,
>> struct pt_regs *regs)
>>    * the kernel can handle, and then we build all the user-level
>> signal handling
>>    * stack-frames in one go after that.
>>    */
>> -void do_signal(struct pt_regs *regs)
>> +void arch_do_signal_or_restart(struct pt_regs *regs)
> Given that `do_signal(struct pt_regs *regs)` is defined in
> `arch/arm64/include/asm/exception.h`,
> and that there remains no users of `do_signal()`, I think it should be
> removed there.

Good catch! I'll remove it.

> 
> Thanks,
> Ada
> 



Re: [PATCH v10 4/4] vpci/msix: Free MSIX resources when init_msix() fails

2025-08-05 Thread Chen, Jiqian
On 2025/8/5 16:43, Jan Beulich wrote:
> On 05.08.2025 05:49, Jiqian Chen wrote:
>> --- a/xen/drivers/vpci/msix.c
>> +++ b/xen/drivers/vpci/msix.c
>> @@ -655,6 +655,48 @@ int vpci_make_msix_hole(const struct pci_dev *pdev)
>>  return 0;
>>  }
>>  
>> +static int cf_check cleanup_msix(const struct pci_dev *pdev)
>> +{
>> +int rc;
>> +struct vpci *vpci = pdev->vpci;
>> +const unsigned int msix_pos = pdev->msix_pos;
>> +
>> +if ( !msix_pos )
>> +return 0;
>> +
>> +rc = vpci_remove_registers(vpci, msix_control_reg(msix_pos), 2);
>> +if ( rc )
>> +{
>> +printk(XENLOG_ERR "%pd %pp: fail to remove MSIX handlers rc=%d\n",
>> +   pdev->domain, &pdev->sbdf, rc);
>> +ASSERT_UNREACHABLE();
>> +return rc;
>> +}
>> +
>> +if ( vpci->msix )
>> +{
>> +list_del(&vpci->msix->next);
>> +for ( unsigned int i = 0; i < ARRAY_SIZE(vpci->msix->table); i++ )
>> +if ( vpci->msix->table[i] )
>> +iounmap(vpci->msix->table[i]);
>> +
>> +XFREE(vpci->msix);
>> +}
>> +
>> +/*
>> + * The driver may not traverse the capability list and think device
>> + * supports MSIX by default. So here let the control register of MSIX
>> + * be Read-Only is to ensure MSIX disabled.
>> + */
>> +rc = vpci_add_register(vpci, vpci_hw_read16, NULL,
>> +   msix_control_reg(msix_pos), 2, NULL);
>> +if ( rc )
>> +printk(XENLOG_ERR "%pd %pp: fail to add MSIX ctrl handler rc=%d\n",
>> +   pdev->domain, &pdev->sbdf, rc);
> 
> Here as well as for MSI: Wouldn't this better be limited to the init-failure
> case? No point in adding a register hook (and possibly emitting a misleading
> log message) when we're tearing down anyway. IOW I think the ->cleanup()
> hook needs a boolean parameter, unless the distinction of the two cases can
> be (reliably) inferred from some other property.
To make these changes, can I add a new patch as the last patch of this series?
And the new patch will do:
1. add a boolean parameter for cleanup hook to separate whose calls 
cleanup(during initialization or during deassign device).
2. call all cleanup hooks in vpci_deassign_device().
3. remove the MSI/MSIX specific free actions in vpci_deassign_device().

> 
>> --- a/xen/drivers/vpci/vpci.c
>> +++ b/xen/drivers/vpci/vpci.c
>> @@ -321,6 +321,27 @@ void vpci_deassign_device(struct pci_dev *pdev)
>>  &pdev->domain->vpci_dev_assigned_map);
>>  #endif
>>  
>> +for ( i = 0; i < NUM_VPCI_INIT; i++ )
>> +{
>> +const vpci_capability_t *capability = &__start_vpci_array[i];
>> +const unsigned int cap = capability->id;
>> +unsigned int pos = 0;
>> +
>> +if ( !capability->is_ext )
>> +pos = pci_find_cap_offset(pdev->sbdf, cap);
>> +else if ( is_hardware_domain(pdev->domain) )
>> +pos = pci_find_ext_capability(pdev->sbdf, cap);
>> +
>> +if ( pos && capability->cleanup )
>> +{
>> +int rc = capability->cleanup(pdev);
>> +if ( rc )
>> +printk(XENLOG_ERR "%pd %pp: clean %s cap %u fail rc=%d\n",
>> +   pdev->domain, &pdev->sbdf,
>> +   capability->is_ext ? "extended" : "legacy", cap, rc);
>> +}
>> +}
> 
> With this imo the patch subject is now wrong, too.
> 
> Jan

-- 
Best regards,
Jiqian Chen.



Re: [PATCH v2] systemd: Add hooks to stop/start xen-watchdog on suspend/resume

2025-08-05 Thread Andrew Cooper
On 17/07/2025 9:16 pm, Mykola Kvach wrote:
>  config/Tools.mk.in|  1 +
>  m4/systemd.m4 | 14 
>  tools/hotplug/Linux/systemd/Makefile  |  8 -
>  .../Linux/systemd/xen-watchdog-sleep.sh   | 34 +++

This has been committed, but it drops the file:

  /usr/lib/systemd/system-sleep/xen-watchdog-sleep.sh

into a directory which more normally contains:

$ file /usr/lib/systemd/system-sleep/*
/usr/lib/systemd/system-sleep/hdparm:  POSIX shell script, ASCII 
text executable
/usr/lib/systemd/system-sleep/nvidia:  POSIX shell script, ASCII 
text executable
/usr/lib/systemd/system-sleep/sysstat.sleep:   POSIX shell script, ASCII 
text executable
/usr/lib/systemd/system-sleep/tlp: POSIX shell script, ASCII 
text executable
/usr/lib/systemd/system-sleep/unattended-upgrades: POSIX shell script, ASCII 
text executable


I'd suggest renaming it to xen-watchdog (or perhaps xen-watchdog.sleep).

~Andrew



[PATCH 2/2] efi: Stop using StdErr

2025-08-05 Thread Ross Lagerwall
Xen's use of StdErr is inconsistent. Some boot errors are reported using
PrintErr() which uses StdErr and some are reported using blexit() which
uses StdOut.

On my test system using OVMF, StdErr is not displayed on the emulated
screen. Looking at other EFI applications, StdErr is just used for debug
messages if at all.

Therefore, switch all boot output to use StdOut.

Signed-off-by: Ross Lagerwall 
---
 xen/common/efi/boot.c | 9 +
 1 file changed, 1 insertion(+), 8 deletions(-)

diff --git a/xen/common/efi/boot.c b/xen/common/efi/boot.c
index 50ff1d1bd225..6ba486943466 100644
--- a/xen/common/efi/boot.c
+++ b/xen/common/efi/boot.c
@@ -153,7 +153,6 @@ static UINT32 __initdata efi_bs_revision;
 static EFI_HANDLE __initdata efi_ih;
 
 static SIMPLE_TEXT_OUTPUT_INTERFACE *__initdata StdOut;
-static SIMPLE_TEXT_OUTPUT_INTERFACE *__initdata StdErr;
 
 static UINT32 __initdata mdesc_ver;
 static bool __initdata map_bs;
@@ -168,11 +167,7 @@ static void __init PrintStr(const CHAR16 *s)
 {
 StdOut->OutputString(StdOut, (CHAR16 *)s );
 }
-
-static void __init PrintErr(const CHAR16 *s)
-{
-StdErr->OutputString(StdErr, (CHAR16 *)s );
-}
+#define PrintErr PrintStr
 
 static CHAR16 *__init FormatDec(UINT64 Val, CHAR16 *Buffer)
 {
@@ -287,7 +282,6 @@ static bool __init match_guid(const EFI_GUID *guid1, const 
EFI_GUID *guid2)
 /* generic routine for printing error messages */
 static void __init noreturn PrintErrMesg(const CHAR16 *mesg, EFI_STATUS 
ErrCode)
 {
-StdOut = StdErr;
 PrintErr(mesg);
 PrintErr(L": ");
 
@@ -914,7 +908,6 @@ static void __init efi_init(EFI_HANDLE ImageHandle, 
EFI_SYSTEM_TABLE *SystemTabl
 efi_fw_revision = SystemTable->FirmwareRevision;
 
 StdOut = SystemTable->ConOut;
-StdErr = SystemTable->StdErr ?: StdOut;
 }
 
 static void __init efi_console_set_mode(void)
-- 
2.50.1




[PATCH 1/2] efi: Call FreePages only if needed

2025-08-05 Thread Ross Lagerwall
If the config file is builtin, cfg.addr will be zero but Xen
unconditionally calls FreePages() on the address.

Xen may also call FreePages() with a zero address if blexit() is called
after this point since cfg.need_to_free is not set to false.

The UEFI specification does not say whether calling FreePages() with a
zero address is allowed so let's be cautious and use cfg.need_to_free
properly.

Signed-off-by: Ross Lagerwall 
---
 xen/common/efi/boot.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/xen/common/efi/boot.c b/xen/common/efi/boot.c
index 778a39cc48e6..50ff1d1bd225 100644
--- a/xen/common/efi/boot.c
+++ b/xen/common/efi/boot.c
@@ -1534,8 +1534,11 @@ void EFIAPI __init noreturn efi_start(EFI_HANDLE 
ImageHandle,
 
 efi_arch_cfg_file_late(loaded_image, dir_handle, section.s);
 
-efi_bs->FreePages(cfg.addr, PFN_UP(cfg.size));
-cfg.addr = 0;
+if ( cfg.need_to_free )
+{
+efi_bs->FreePages(cfg.addr, PFN_UP(cfg.size));
+cfg.need_to_free = false;
+}
 
 if ( dir_handle )
 dir_handle->Close(dir_handle);
-- 
2.50.1




[PATCH 0/2] Misc EFI boot fixes

2025-08-05 Thread Ross Lagerwall
Fixes for a couple of annoyances when booting Xen using the native EFI
entry point.

Ross Lagerwall (2):
  efi: Call FreePages only if needed
  efi: Stop using StdErr

 xen/common/efi/boot.c | 16 ++--
 1 file changed, 6 insertions(+), 10 deletions(-)

-- 
2.50.1




Re: [XEN][PATCH] xen/dom0less: arm: fix hwdom 1:1 low memory allocation

2025-08-05 Thread Jason Andryuk

On 2025-08-01 11:54, Grygorii Strashko wrote:

From: Grygorii Strashko 

Call stack for dom0less hwdom case (1:1) memory:
create_domUs
|-construct_domU
   |-construct_hwdom()
 |-allocate_memory_11()

And allocate_memory_11() uses "dom0_mem" as:
min_low_order =
   get_order_from_bytes(min_t(paddr_t, dom0_mem, MB(128)));

In case of dom0less boot the "dom0_mem" is not used and defaulted to 0,
which causes min_low_order to get high value > order and so no allocations
happens from low memory.

Fix it, by using kinfo->unassigned_mem instead of "dom0_mem" has correct
memory size in both cases: regular dom0 boot and dom0less boot.

Fixes: 43afe6f030244 ("xen/common: dom0less: introduce common dom0less-build.c")


I think I introduced this bug with the dom0less hwdom support, and the 
correct fixes is:


Fixes: 52cb53f1816a ("xen/arm: dom0less hwdom construction")


Signed-off-by: Grygorii Strashko 


dom0_mem is also mentioned in the comment on line 252.  With that 
changed as well:


Reviewed-by: Jason Andryuk 

Thanks,
Jason



[PATCH v2 2/3] xen/ppc: irq: drop unreachable pirq callbacks

2025-08-05 Thread Grygorii Strashko
From: Grygorii Strashko 

Since commit 341f271cf86f ("xen/evtchn: fully restrict concept of pIRQ for
arches with pIRQ support only"), the corresponding PPC arch pIRQ callbacks
become unreachable, so drop them.

Signed-off-by: Grygorii Strashko 
Acked-by: Andrew Cooper 
Reviewed-by: Denis Mukhin 
---
 xen/arch/ppc/stubs.c | 20 
 1 file changed, 20 deletions(-)

diff --git a/xen/arch/ppc/stubs.c b/xen/arch/ppc/stubs.c
index 671e71aa0a60..bdaf474c5cc0 100644
--- a/xen/arch/ppc/stubs.c
+++ b/xen/arch/ppc/stubs.c
@@ -103,26 +103,6 @@ void smp_send_call_function_mask(const cpumask_t *mask)
 
 /* irq.c */
 
-struct pirq *alloc_pirq_struct(struct domain *d)
-{
-BUG_ON("unimplemented");
-}
-
-int pirq_guest_bind(struct vcpu *v, struct pirq *pirq, int will_share)
-{
-BUG_ON("unimplemented");
-}
-
-void pirq_guest_unbind(struct domain *d, struct pirq *pirq)
-{
-BUG_ON("unimplemented");
-}
-
-void pirq_set_affinity(struct domain *d, int pirq, const cpumask_t *mask)
-{
-BUG_ON("unimplemented");
-}
-
 void irq_ack_none(struct irq_desc *desc)
 {
 BUG_ON("unimplemented");
-- 
2.34.1



Re: [XEN][PATCH] xen/dom0less: arm: fix hwdom 1:1 low memory allocation

2025-08-05 Thread Jason Andryuk

On 2025-08-05 14:38, Grygorii Strashko wrote:

Hi Jason,

On 05.08.25 20:21, Jason Andryuk wrote:

On 2025-08-01 11:54, Grygorii Strashko wrote:

From: Grygorii Strashko 

Call stack for dom0less hwdom case (1:1) memory:
create_domUs
|-construct_domU
   |-construct_hwdom()
 |-allocate_memory_11()

And allocate_memory_11() uses "dom0_mem" as:
min_low_order =
   get_order_from_bytes(min_t(paddr_t, dom0_mem, MB(128)));

In case of dom0less boot the "dom0_mem" is not used and defaulted to 0,
which causes min_low_order to get high value > order and so no 
allocations

happens from low memory.

Fix it, by using kinfo->unassigned_mem instead of "dom0_mem" has correct
memory size in both cases: regular dom0 boot and dom0less boot.

Fixes: 43afe6f030244 ("xen/common: dom0less: introduce common 
dom0less-build.c")


I think I introduced this bug with the dom0less hwdom support, and the 
correct fixes is:


Fixes: 52cb53f1816a ("xen/arm: dom0less hwdom construction")


Signed-off-by: Grygorii Strashko 


dom0_mem is also mentioned in the comment on line 252.  With that 
changed as well:


Reviewed-by: Jason Andryuk 



Will smth like below be ok?

   * We first allocate the largest allocation we can as low as we
   * can. This then becomes the first bank. This bank must be at least
- * 128MB (or dom0_mem if that is smaller).
+ * 128MB (or memory size requested for domain if that is smaller).


LGTM - Thank you.

-Jason



[PATCH v2 1/3] xen/arm: irq: drop unreachable pirq callbacks

2025-08-05 Thread Grygorii Strashko
From: Grygorii Strashko 

Since commit 341f271cf86f ("xen/evtchn: fully restrict concept of pIRQ for
arches with pIRQ support only"), the corresponding Arm arch pIRQ callbacks
become unreachable, so drop them.

Signed-off-by: Grygorii Strashko 
Acked-by: Andrew Cooper 
Reviewed-by: Denis Mukhin 
---
 xen/arch/arm/irq.c | 29 -
 1 file changed, 29 deletions(-)

diff --git a/xen/arch/arm/irq.c b/xen/arch/arm/irq.c
index 03fbb90c6c43..4bbf0b0664df 100644
--- a/xen/arch/arm/irq.c
+++ b/xen/arch/arm/irq.c
@@ -595,35 +595,6 @@ unlock:
 return ret;
 }
 
-/*
- * pirq event channels. We don't use these on ARM, instead we use the
- * features of the GIC to inject virtualised normal interrupts.
- */
-struct pirq *alloc_pirq_struct(struct domain *d)
-{
-return NULL;
-}
-
-/*
- * These are all unreachable given an alloc_pirq_struct
- * which returns NULL, all callers try to lookup struct pirq first
- * which will fail.
- */
-int pirq_guest_bind(struct vcpu *v, struct pirq *pirq, int will_share)
-{
-BUG();
-}
-
-void pirq_guest_unbind(struct domain *d, struct pirq *pirq)
-{
-BUG();
-}
-
-void pirq_set_affinity(struct domain *d, int pirq, const cpumask_t *mask)
-{
-BUG();
-}
-
 static bool irq_validate_new_type(unsigned int curr, unsigned int new)
 {
 return (curr == IRQ_TYPE_INVALID || curr == new );
-- 
2.34.1



[PATCH v2 3/3] xen/riscv: irq: drop unreachable pirq callbacks

2025-08-05 Thread Grygorii Strashko
From: Grygorii Strashko 

Since commit 341f271cf86f ("xen/evtchn: fully restrict concept of pIRQ for
arches with pIRQ support only"), the corresponding RISCV arch pIRQ
callbacks become unreachable, so drop them.

Signed-off-by: Grygorii Strashko 
Reviewed-by: Oleksii Kurochko 
Acked-by: Andrew Cooper 
Reviewed-by: Denis Mukhin 
---
 xen/arch/riscv/stubs.c | 20 
 1 file changed, 20 deletions(-)

diff --git a/xen/arch/riscv/stubs.c b/xen/arch/riscv/stubs.c
index 8918cebf35c6..1a8c86cd8da2 100644
--- a/xen/arch/riscv/stubs.c
+++ b/xen/arch/riscv/stubs.c
@@ -82,26 +82,6 @@ void smp_send_call_function_mask(const cpumask_t *mask)
 
 /* irq.c */
 
-struct pirq *alloc_pirq_struct(struct domain *d)
-{
-BUG_ON("unimplemented");
-}
-
-int pirq_guest_bind(struct vcpu *v, struct pirq *pirq, int will_share)
-{
-BUG_ON("unimplemented");
-}
-
-void pirq_guest_unbind(struct domain *d, struct pirq *pirq)
-{
-BUG_ON("unimplemented");
-}
-
-void pirq_set_affinity(struct domain *d, int pirq, const cpumask_t *mask)
-{
-BUG_ON("unimplemented");
-}
-
 void irq_ack_none(struct irq_desc *desc)
 {
 BUG_ON("unimplemented");
-- 
2.34.1



[PATCH v2 0/3] xen/arch: irq: drop unreachable pirq callbacks

2025-08-05 Thread Grygorii Strashko
From: Grygorii Strashko 

Hence prerequisite patch was merged [1] arch specific pIRQ callback can now
be dropped for arches without pIRQ support as those callback become unreachable.

[1] commit 341f271cf86f ("xen/evtchn: fully restrict concept of pIRQ for arches 
with pIRQ support only")

Changes in v2:
- reworded commit messages as requested
- added reviewer's tags

Grygorii Strashko (3):
  xen/arm: irq: drop unreachable pirq callbacks
  xen/ppc: irq: drop unreachable pirq callbacks
  xen/riscv: irq: drop unreachable pirq callbacks

 xen/arch/arm/irq.c | 29 -
 xen/arch/ppc/stubs.c   | 20 
 xen/arch/riscv/stubs.c | 20 
 3 files changed, 69 deletions(-)

-- 
2.34.1



Re: [PATCH -next v7 1/7] arm64: ptrace: Replace interrupts_enabled() with regs_irqs_disabled()

2025-08-05 Thread Jinjie Ruan



On 2025/8/5 23:05, Ada Couprie Diaz wrote:
> On 29/07/2025 02:54, Jinjie Ruan wrote:
> 
>> The generic entry code expects architecture code to provide
>> regs_irqs_disabled(regs) function, but arm64 does not have this and
>> provides inerrupts_enabled(regs), which has the opposite polarity.
> Nit: "interrupts_enabled(regs)"

Good catch! thank you for the review.

>> In preparation for moving arm64 over to the generic entry code,
>> relace arm64's interrupts_enabled() with regs_irqs_disabled() and
>> update its callers under arch/arm64.
>>
>> For the moment, a definition of interrupts_enabled() is provided for
>> the GICv3 driver. Once arch/arm implement regs_irqs_disabled(), this
>> can be removed.
>>
>> Delete the fast_interrupts_enabled() macro as it is unused and we
>> don't want any new users to show up.
>>
>> No functional changes.
>>
>> Acked-by: Mark Rutland 
>> Suggested-by: Mark Rutland 
>> Signed-off-by: Jinjie Ruan 
>> ---
> Otherwise looks good to me !
> Reviewed-by: Ada Couprie Diaz 
> 



Re: [PATCH v10 2/4] vpci/rebar: Free Rebar resources when init_rebar() fails

2025-08-05 Thread Chen, Jiqian
On 2025/8/5 16:48, Jan Beulich wrote:
> On 05.08.2025 05:49, Jiqian Chen wrote:
>> --- a/xen/drivers/vpci/rebar.c
>> +++ b/xen/drivers/vpci/rebar.c
>> @@ -49,6 +49,32 @@ static void cf_check rebar_ctrl_write(const struct 
>> pci_dev *pdev,
>>  bar->guest_addr = bar->addr;
>>  }
>>  
>> +static int cf_check cleanup_rebar(const struct pci_dev *pdev)
>> +{
>> +int rc;
>> +uint32_t ctrl;
>> +unsigned int nbars;
>> +unsigned int rebar_offset = pci_find_ext_capability(pdev->sbdf,
>> +
>> PCI_EXT_CAP_ID_REBAR);
>> +
>> +if ( !rebar_offset || !is_hardware_domain(pdev->domain) )
>> +{
>> +ASSERT_UNREACHABLE();
>> +return 0;
>> +}
>> +
>> +ctrl = pci_conf_read32(pdev->sbdf, rebar_offset + PCI_REBAR_CTRL(0));
>> +nbars = MASK_EXTR(ctrl, PCI_REBAR_CTRL_NBAR_MASK);
>> +
>> +rc = vpci_remove_registers(pdev->vpci, rebar_offset + PCI_REBAR_CAP(0),
>> +   PCI_REBAR_CTRL(nbars - 1));
>> +if ( rc )
>> +printk(XENLOG_ERR "%pd %pp: fail to remove Rebar handlers rc=%d\n",
>> +   pdev->domain, &pdev->sbdf, rc);
> 
> MSI and MSI-X (now) have ASSERT_UNREACHABLE() on their respective paths. Is
> there a reason this shouldn't be done here as well?
Will add.

> 
> MSI and MSI-X further have another add-register below here, to ensure the
> control register cannot be written. Again - is there a reason the same
> shouldn't be done here? (If so, I think this may want putting in a comment.)
Since extended capabilities are only exposed to dom0, and dom0 has no 
limitations to access devices.
It since there is not much point in adding such a handler for rebar.

> 
> Jan

-- 
Best regards,
Jiqian Chen.


Re: [PATCH -next v7 2/7] arm64: entry: Refactor the entry and exit for exceptions from EL1

2025-08-05 Thread Jinjie Ruan



On 2025/8/5 23:06, Ada Couprie Diaz wrote:
> Hi,
> 
> On 29/07/2025 02:54, Jinjie Ruan wrote:
> 
>> The generic entry code uses irqentry_state_t to track lockdep and RCU
>> state across exception entry and return. For historical reasons, arm64
>> embeds similar fields within its pt_regs structure.
>>
>> In preparation for moving arm64 over to the generic entry code, pull
>> these fields out of arm64's pt_regs, and use a separate structure,
>> matching the style of the generic entry code.
>>
>> No functional changes.
> As far as I understand and checked, we used the two fields
> in an exclusive fashion, so there is indeed no functional change.
>> Suggested-by: Mark Rutland 
>> Signed-off-by: Jinjie Ruan 
>> ---
>> [...]
>> diff --git a/arch/arm64/kernel/entry-common.c
>> b/arch/arm64/kernel/entry-common.c
>> index 8e798f46ad28..97e0741abde1 100644
>> --- a/arch/arm64/kernel/entry-common.c
>> +++ b/arch/arm64/kernel/entry-common.c
>> [...]
>> @@ -475,73 +497,81 @@ UNHANDLED(el1t, 64, error)
>>   static void noinstr el1_abort(struct pt_regs *regs, unsigned long esr)
>>   {
>>   unsigned long far = read_sysreg(far_el1);
>> +    arm64_irqentry_state_t state;
>>   -    enter_from_kernel_mode(regs);
>> +    state = enter_from_kernel_mode(regs);
> Nit: There is some inconsistencies with some functions splitting state's
> definition
> and declaration (like el1_abort here), while some others do it on the
> same line
> (el1_undef() below for example).
> In some cases it is welcome as the entry function is called after some
> other work,
> but here for example it doesn't seem to be beneficial ?

Both methods can keep the modifications to `enter_from_kernel_mode()` on
the same line as the original code, which will facilitate code review.

I think it is also fine to do it on the same line here, which can reduce
one line code, which method is better may be a matter of personal opinion.

>>   local_daif_inherit(regs);
>>   do_mem_abort(far, esr, regs);
>>   local_daif_mask();
>> -    exit_to_kernel_mode(regs);
>> +    exit_to_kernel_mode(regs, state);
>>   }
>>     static void noinstr el1_pc(struct pt_regs *regs, unsigned long esr)
>>   {
>>   unsigned long far = read_sysreg(far_el1);
>> +    arm64_irqentry_state_t state;
>>   -    enter_from_kernel_mode(regs);
>> +    state = enter_from_kernel_mode(regs);
>>   local_daif_inherit(regs);
>>   do_sp_pc_abort(far, esr, regs);
>>   local_daif_mask();
>> -    exit_to_kernel_mode(regs);
>> +    exit_to_kernel_mode(regs, state);
>>   }
>>     static void noinstr el1_undef(struct pt_regs *regs, unsigned long
>> esr)
>>   {
>> -    enter_from_kernel_mode(regs);
>> +    arm64_irqentry_state_t state = enter_from_kernel_mode(regs);
>> +
>>   local_daif_inherit(regs);
>>   do_el1_undef(regs, esr);
>>   local_daif_mask();
>> -    exit_to_kernel_mode(regs);
>> +    exit_to_kernel_mode(regs, state);
>>   }
>>
>> [...]
> Other than the small nit:
> Reviewed-by: Ada Couprie Diaz 
> 
> 



Consider changing CONFIG_ACPI default on ARM? (was: Re: Xen bootup: issue with Raspberry Pi 5?)

2025-08-05 Thread Elliott Mitchell
Sigh, resending as I lost some of the intended Cc targets when originally
creating the message.  Sorry about the duplication for people who have
already seen, but I thought this might be worthy of wider discussion.



I would like to draw the attention of a few people on xen-devel to the
thread which occured on xen-users recently and quoted below:

https://lists.xenproject.org/archives/html/xen-users/2025-07/msg1.html

On Tue, Jul 01, 2025 at 10:01:13PM +0200, Paul Leiber wrote:
> 
> Unfortunately, I don't have a direct answer to the question (as is so often
> the case, due to my limited knowledge and experience). However, I am
> successfully running Xen on a RPi 4 (mostly, except for some VLAN related
> networking issues).
> 
> I used instructions in [1] to install vanilla Debian on the RPi, including
> UEFI boot and grub. I then compiled Xen with expert options and ACPI
> enabled.
> 
> I don't know if there are better solutions. For example, I suffer from the
> fact that I2C doesn't work when using UEFI boot on a RPi. Nowadays, Debian
> provides their own vanilla Debian images for RPi and with working I2C, but
> these images are using a different boot method that I didn't know how to use
> with Xen.  So far, the procedure described above seems to be the easiest
> solution for me.


> [1] https://forums.raspberrypi.com/viewtopic.php?t=282839
> 
> Am 30.06.2025 um 12:35 schrieb Sumit Semwal:
> > 
> > I've just begun to experiment with the Raspberry Pi 5, trying to run a
> > simple xen + Dom0 setup, using uBoot, and the bookworm based Rpi
> > distro.
> > 
> > I've tried combinations of the following setup:
> > 
> > 1. prebuilt Rpi5 kernel + dtbs, and have also tried to build them from
> > source [1]
> > 2. Xen from upstream [2] and xen-troops [3]
> > 3. upstream uBoot from [4]
> > 
> > but with the same result: [short log below; I can provide a fuller log
> > if needed]
> > 
> > (XEN) DT: ** translation for device /axi/msi-controller@100013 **
> > (XEN) DT: bus is default (na=2, ns=2) on /axi
> > (XEN) DT: translating address:<3> 00ff<3> f000<3>
> > (XEN) DT: parent bus is default (na=2, ns=1) on /
> > (XEN) DT: walking ranges...
> > (XEN) DT: default map, cp=0, s=10, da=fff000
> > (XEN) DT: default map, cp=10, s=1, da=fff000
> > (XEN) DT: default map, cp=14, s=4, da=fff000
> > (XEN) DT: default map, cp=18, s=4, da=fff000
> > (XEN) DT: default map, cp=1c, s=4, da=fff000
> > (XEN) DT: not found !
> > (XEN) Unable to retrieve address 1 for /axi/msi-controller@100013
> > (XEN) Device tree generation failed (-22).
> > (XEN) debugtrace_dump() global buffer starting
> > 1 cpupool_create(pool=0,sched=6)
> > 2 Created cpupool 0 with scheduler SMP Credit Scheduler rev2 (credit2)
> > 3 cpupool_add_domain(dom=0,pool=0) n_dom 1 rc 0
> > (XEN) wrap: 0
> > (XEN) debugtrace_dump() global buffer finished
> > (XEN)
> > (XEN) 
> > (XEN) Panic on CPU 0:
> > (XEN) Could not set up DOM0 guest OS (rc = -22)
> > (XEN) 
> > 
> > 
> > I'm certain I'm missing something, but before I delve deeper, I just
> > wanted to ask if this is a known issue, and if so, are there any
> > workarounds or solutions available for this?
> > 
> > Any help about this is highly appreciated!
> > 
> > Thanks and Best regards,
> > Sumit.
> > 
> > [1]:  https://github.com/raspberrypi/linux rpi-6.12.y branch
> > [2]: git://xenbits.xen.org/xen.git - main branch
> > [3] xen-troops https://github.com/xen-troops/xen - rpi5_dev branch
> > [4]: https://github.com/u-boot/u-boot.git master branch

Ultimately Debian is choosing to leave most defaults alone.  So far the
Xen developers have left CONFIG_ACPI defaulting to off on ARM*.  The
Debian project doesn't have paid people to support Raspberry PI hardware,
despite being rather common.  As a result there aren't any official
Raspberry PI images, but people associated with Tianocore have gotten
generic images to boot on Raspberry PI hardware.

I'm unsure of the likelihood of getting the Debian maintainers to
override the default.  Yet due being by far the simplest way to install
Debian and Xen on a very common ARM64 platform, perhaps the Xen
developers should consider changing?


-- 
(\___(\___(\__  --=> 8-) EHM <=--  __/)___/)___/)
 \BS (| ehem+sig...@m5p.com  PGP 87145445 |)   /
  \_CS\   |  _  -O #include  O-   _  |   /  _/
8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445