Re: Limitations for Running Xen on KVM Arm64

2025-11-05 Thread haseeb.ash...@siemens.com
Hi,

I have sent out a patch using IPAS2E1IS. The R version RIPAS2E1IS would only be 
helpful if we have to invalidate more than one page at a time and this is not 
possible unless a batch version of hypercall is implemented because otherwise 
there is only one page removed per hypercall. Although IPAS2E1IS can be used 
and the number of invocations is still same as VMALLS12E1IS, but the execution 
time is much smaller. With Ftrace I got:
handle_ipas2e1is: min-max: 17.580 - 68.260 us.

Thanks again for your great suggestions. Please review my patch, you should've 
received an email.

Regards,
Haseeb


Limitations for Running Xen on KVM Arm64

2025-10-30 Thread haseeb.ash...@siemens.com
Hello Xen development community,

I wanted to discuss the limitations that I have faced while running Xen on KVM 
on Arm64 machines. I hope I am using the right mailing list.

The biggest limitation is the costly emulation of instruction tlbi vmalls12e1is 
in KVM. The cost is exponentially proportional to the IPA size exposed by KVM 
for VM hosting Xen. If I reduce the IPA size to 40-bits in KVM, then this issue 
is not much observable but with the IPA size of 48-bits, it is 256x more costly 
than the former one. Xen uses this instruction too frequently and this 
instruction is trapped and emulated by KVM, and performance is not as good as 
on bare-metal hardware. With 48-bit IPA, it can take up to 200 minutes for domu 
creation with just 128M RAM. I have identified two places in Xen which are 
problematic w.r.t the usage of this instruction and hoping to reduce the 
frequency of this instruction or use a more relevant TLBI instruction instead 
of invalidating whole stage-1 and stage-2 translations.


  1.
During the creation of domu, first the domu memory is mapped onto dom0 domain, 
images are copied into it, and it is then unmapped. During unmapping, the TLB 
translations are invalidated one by one for each page being unmapped in 
XENMEM_remove_from_physmap hypercall. Here is the code snippet where the 
decision to flush TLBs is being made during removal of mapping.

diff --git a/xen/arch/arm/mmu/p2m.c b/xen/arch/arm/mmu/p2m.c
index 7642dbc7c5..e96ff92314 100644
--- a/xen/arch/arm/mmu/p2m.c
+++ b/xen/arch/arm/mmu/p2m.c
@@ -1103,7 +1103,8 @@ static int __p2m_set_entry(struct p2m_domain *p2m,

if ( removing_mapping )
/* Flush can be deferred if the entry is removed */
-p2m->need_flush |= !!lpae_is_valid(orig_pte);
+//p2m->need_flush |= !!lpae_is_valid(orig_pte);
+p2m->need_flush |= false;
else
{
lpae_t pte = mfn_to_p2m_entry(smfn, t, a);

  1.
This can be optimized by either introducing a batch version of this hypercall 
i.e., XENMEM_remove_from_physmap_batch and flushing TLBs only once for all 
pages being removed
OR
by using a TLBI instruction that only invalidates the intended range of 
addresses instead of the whole stage-1 and stage-2 translations. I understand 
that a single TLBI instruction does not exist that can perform both stage-1 and 
stage-2 invalidations for a given address range but maybe a combination of 
instructions can be used such as:

; switch to current VMID
tlbi rvae1, guest_vaddr ; first invalidate stage-1 TLB by guest VA for current 
VMID
tlbi ripas2e1, guest_paddr ; then invalidate stage-2 TLB by IPA range for 
current VMID
dsb ish
isb
; switch back the VMID

  1.
This is where I am not quite sure and I was hoping that if someone with Arm 
expertise could sign off on this so that I can work on its implementation in 
Xen. This will be an optimization not only for virtualized hardware but also in 
general for Xen on arm64 machines.


  1.
The second place in Xen where this is problematic is when multiple vCPUs of the 
same domain juggle on single pCPU, TLBs are invalidated everytime a different 
vCPU runs on a pCPU. I do not know how this can be optimized. Any support on 
this is appreciated.

diff --git a/xen/arch/arm/mmu/p2m.c b/xen/arch/arm/mmu/p2m.c
index 7642dbc7c5..e96ff92314 100644
--- a/xen/arch/arm/mmu/p2m.c
+++ b/xen/arch/arm/mmu/p2m.c
@@ -247,7 +247,7 @@ void p2m_restore_state(struct vcpu *n)
  * when running multiple vCPU of the same domain on a single pCPU.
  */
 if ( *last_vcpu_ran != INVALID_VCPU_ID && *last_vcpu_ran != n->vcpu_id )
-flush_guest_tlb_local();
+; // flush_guest_tlb_local();

 *last_vcpu_ran = n->vcpu_id;
 }

Thanks & Regards,
Haseeb Ashraf


Re: Limitations for Running Xen on KVM Arm64

2025-11-03 Thread haseeb.ash...@siemens.com
Hi,

> To clarify, Xen is using the local TLB version. So it should be vmalls12e1.
If I understood correctly, won't HCR_EL2.FB makes local TLB, a broadcast one?

Mohamed mentioned this in earlier email:
> If a core-local TLB invalidate was issued, this bit forces it to become a 
> broadcast, so that you don’t have to worry about flushing TLBs when moving a 
> vCPU between different pCPUs. KVM operates with this bit set.

Can you explain in what scenario exactly, can we use vmalle1?

> Before going into batching, do you have any data showing how often 
> XENMEM_remove_from_physmap is called in your setup? Similar, I would be 
> interested to know the number of TLBs flush within one hypercalls and whether 
> the regions unmapped were contiguous.
The number of times XENMEM_remove_from_physmap is invoked depends upon the size 
of each binary. Each hypercall invokes TLB instruction once. If I use 
persistent rootfs, then this hypercall is invoked almost 7458 times (+8 approx) 
which is equal to sum of kernel and DTB image pages:
domainbuilder: detail: xc_dom_alloc_segment:   kernel   : 0x4000 -> 
0x41d1f200  (pfn 0x4 + 0x1d20 pages)
domainbuilder: detail: xc_dom_alloc_segment:   devicetree   : 0x4800 -> 
0x4800188d  (pfn 0x48000 + 0x2 pages)

And if I use ramdisk image, then this hypercall is invoked almost 222815 times 
(+8 approx) which is equal to sum of kernel, ramdisk and DTB image 4k pages.
domainbuilder: detail: xc_dom_alloc_segment:   kernel   : 0x4000 -> 
0x41d1f200  (pfn 0x4 + 0x1d20 pages)
domainbuilder: detail: xc_dom_alloc_segment:   module0  : 0x4800 -> 
0x7c93d000  (pfn 0x48000 + 0x3493d pages)
domainbuilder: detail: xc_dom_alloc_segment:   devicetree   : 0x7c93d000 -> 
0x7c93e8d9  (pfn 0x7c93d + 0x2 pages)

You can see the address ranges in above logs, the addresses seem contiguous in 
this address space and at best we can reduce the number of calls to 3, each at 
the end of every image when removed from physmap.

> we may still send a few TLBs because:
> * We need to avoid long-running operations, so the hypercall may restart. So 
> we will have to flush at mininum before every restart
> * The current way we handle batching is we will process one item at the time. 
> As this may free memory (either leaf or intermediate page-tables), we will 
> need to flush the TLBs first to prevent the domain accessing the wrong 
> memory. This could be solved by keeping track of the list of memory to free. 
> But this is going to require some work and I am not entirely sure this is 
> worth it at the moment.
I think now you have the figure that 222815 TLBs are too much and a few TLBs 
would still be a lot better. TLBs less than 10 are not much noticeable.

> We could use a series of TLBI IPAS2E1IS which I think is what TLBI range is 
> meant to replace (so long the addresses are contiguous in the given space).
Isn't IPAS2E1IS a range tlbi instruction? My understanding is that this 
instruction is available on processors with range TLBI support, I could be 
wrong. I saw its KVM emulation which does full invalidation if range TLBI is 
not supported 
(https://github.com/torvalds/linux/blob/master/arch/arm64/kvm/hyp/pgtable.c#L647).

> On the KVM side, it would be worth looking at whether the implementation can 
> be optimized. Is this really walking block by block? Can it skip over large 
> hole (e.g. if we know a level 1 entry doesn't exist, then we can increment by 
> 1GB).
Yes, this should also be looked from KVM side. I think to solve this problem, 
we need this optimized on both places in Xen and in KVM because Xen is invoking 
this instruction too many times and unless KVM can provide performance close to 
bare-metal tlbi, this would still be a problem.

Regards,
Haseeb


Re: Limitations for Running Xen on KVM Arm64

2025-11-03 Thread haseeb.ash...@siemens.com
Hi,

> Does this mean only one ioctl call will be issue per blob will be used?
Yes, one ioctl is issued to add all pages to physmap IOCTL_PRIVCMD_MMAPBATCH_V2 
then all pages are removed from physmap as a result of munmap().

> At least to me, it feels like switching to TLBI range (or a series os 
> IPAS2E1IS) is an easier win. But if you feel like doing the larger rework, I 
> would be happy to have a look to check whether it would be an acceptable 
> change for upstream.
Thank you. Yes, I agree. I just wanted a solution that also works for older 
CPUs. A series of IPAS2E1IS can work for older CPUs but there will be a lot of 
invocations (222815 * 4K, using the same example). Although, each invocation 
would be much less costly as compared to VMALLS12E1IS, so still seems like a 
viable solution. I shall evaluate this and let you know.

> IPAS2E1IS only allows you to invalidate one address at the time and is 
> available on all processors. The R version is only available when the 
> processor support TLBI range and allow you to invalidate multiple contiguous 
> address.
Thanks, got it.

Regards,
Haseeb



Re: Limitations for Running Xen on KVM Arm64

2025-10-30 Thread haseeb.ash...@siemens.com
Adding @[email protected] and replying to his questions he 
asked over #XenDevel:matrix.org.

can you add some details why the implementation cannot be optimized in KVM? 
Asking because I have never seen such issue when running Xen on QEMU (without 
nested virt enabled).
AFAIK when Xen is run on QEMU without virtualization, then instructions are 
emulated in QEMU while with KVM, ideally the instruction should run directly on 
hardware except in some special cases (those trapped by FGT/CGT). Such as this 
one where KVM maintains shadow page tables for each VM. It traps these 
instructions and emulates them with callback such as handle_vmalls12e1is(). The 
way this callback is implemented, it has to iterate over the whole address 
space and clean-up the page tables which is a costly operation. Regardless of 
this, it should still be optimized in Xen as invalidating a selective range 
would be much better than invalidating a whole range of 48-bit address space.
Some details about your platform and use case would be helpful. I am interested 
to know whether you are using all the features for nested virt.
I am using AWS G4. My use case is to run Xen as guest hypervisor. Yes, most of 
the features are enabled except VHE or those which are disabled by KVM.

Regards,
Haseeb Ashraf

From: Ashraf, Haseeb (DI SW EDA HAV SLS EPS RTOS LIN)
Sent: Thursday, October 30, 2025 11:12 AM
To: [email protected] 
Subject: Limitations for Running Xen on KVM Arm64

Hello Xen development community,

I wanted to discuss the limitations that I have faced while running Xen on KVM 
on Arm64 machines. I hope I am using the right mailing list.

The biggest limitation is the costly emulation of instruction tlbi vmalls12e1is 
in KVM. The cost is exponentially proportional to the IPA size exposed by KVM 
for VM hosting Xen. If I reduce the IPA size to 40-bits in KVM, then this issue 
is not much observable but with the IPA size of 48-bits, it is 256x more costly 
than the former one. Xen uses this instruction too frequently and this 
instruction is trapped and emulated by KVM, and performance is not as good as 
on bare-metal hardware. With 48-bit IPA, it can take up to 200 minutes for domu 
creation with just 128M RAM. I have identified two places in Xen which are 
problematic w.r.t the usage of this instruction and hoping to reduce the 
frequency of this instruction or use a more relevant TLBI instruction instead 
of invalidating whole stage-1 and stage-2 translations.


  1.
During the creation of domu, first the domu memory is mapped onto dom0 domain, 
images are copied into it, and it is then unmapped. During unmapping, the TLB 
translations are invalidated one by one for each page being unmapped in 
XENMEM_remove_from_physmap hypercall. Here is the code snippet where the 
decision to flush TLBs is being made during removal of mapping.

diff --git a/xen/arch/arm/mmu/p2m.c b/xen/arch/arm/mmu/p2m.c
index 7642dbc7c5..e96ff92314 100644
--- a/xen/arch/arm/mmu/p2m.c
+++ b/xen/arch/arm/mmu/p2m.c
@@ -1103,7 +1103,8 @@ static int __p2m_set_entry(struct p2m_domain *p2m,

if ( removing_mapping )
/* Flush can be deferred if the entry is removed */
-p2m->need_flush |= !!lpae_is_valid(orig_pte);
+//p2m->need_flush |= !!lpae_is_valid(orig_pte);
+p2m->need_flush |= false;
else
{
lpae_t pte = mfn_to_p2m_entry(smfn, t, a);

  1.
This can be optimized by either introducing a batch version of this hypercall 
i.e., XENMEM_remove_from_physmap_batch and flushing TLBs only once for all 
pages being removed
OR
by using a TLBI instruction that only invalidates the intended range of 
addresses instead of the whole stage-1 and stage-2 translations. I understand 
that a single TLBI instruction does not exist that can perform both stage-1 and 
stage-2 invalidations for a given address range but maybe a combination of 
instructions can be used such as:

; switch to current VMID
tlbi rvae1, guest_vaddr ; first invalidate stage-1 TLB by guest VA for current 
VMID
tlbi ripas2e1, guest_paddr ; then invalidate stage-2 TLB by IPA range for 
current VMID
dsb ish
isb
; switch back the VMID

  1.
This is where I am not quite sure and I was hoping that if someone with Arm 
expertise could sign off on this so that I can work on its implementation in 
Xen. This will be an optimization not only for virtualized hardware but also in 
general for Xen on arm64 machines.


  1.
The second place in Xen where this is problematic is when multiple vCPUs of the 
same domain juggle on single pCPU, TLBs are invalidated everytime a different 
vCPU runs on a pCPU. I do not know how this can be optimized. Any support on 
this is appreciated.

diff --git a/xen/arch/arm/mmu/p2m.c b/xen/arch/arm/mmu/p2m.c
index 7642dbc7c5..e96ff92314 100644
--- a/xen/arch/arm/mmu/p2m.c
+++ b/xen/arch/arm/mmu/p2m.c
@@ -247,7 +247,7 @@ void p2m_restore_state(struct vcpu *n)
  * when running multip

Re: Limitations for Running Xen on KVM Arm64

2025-10-31 Thread haseeb.ash...@siemens.com
Hello,

Thanks for your reply.

You mean Graviton4 (for reference to others, from a bare metal instance)? 
Interesting to see people caring about nested virt there :) - and hopefully 
using it wasn’t too much of a pain for you to deal with.
Yes, I am using Graviton4 (r8g.metal-24xl). Nope, it wasn't much of an issue to 
use G4.
KVM has a similar logic see "last_vcpu_ran" and "__kvm_flush_cpu_context()". 
That said... they are using "vmalle1" whereas we are using "vmalls12e1". So 
maybe we can relax it. Not sure if this would make any difference for the 
performance though.
I have seen no such performance issue with nested KVM. For Xen, if this can be 
relaxed from vmalls12e1 to vmalle1, this would still be a huge performance 
improvement. I used Ftrace to get execution time of each of these handler 
functions:
handle_vmalls12e1is() min-max = 1464441 - 9495486 us
handle_tlbi_el1() min-max = 10 - 27 us

So, to summarize using HCR_EL2.FB (which Xen already enables?) and then using 
vmalle1 instead of vmalls12e1 should resolve the issue-2 for vCPUs switching on 
pCPUs.

Coming back to issue-1, what do you think about creating a batch version of 
hypercall XENMEM_remove_from_physmap (other batch versions exist such as for 
XENMEM_add_to_physmap) and doing the TLB invalidation only once per this 
hypercall? I just realized that ripas2e1 is a range TLBI instruction which is 
only supported after Armv8.4 indicated by ID_AA64ISAR0_EL1.TLB == 2. So, on 
older architectures, full stage-2 invalidation would be required. For an 
architecture independent solution, creating a batch version seems to be a 
better way.

Regards,
Haseeb

From: Julien Grall 
Sent: Friday, October 31, 2025 2:18 PM
To: Mohamed Mediouni 
Cc: Ashraf, Haseeb (DI SW EDA HAV SLS EPS RTOS LIN) 
; [email protected] 
; [email protected] 

Subject: Re: Limitations for Running Xen on KVM Arm64



On 31/10/2025 00:20, Mohamed Mediouni wrote:
>
>
>> On 31. Oct 2025, at 00:55, Julien Grall  wrote:
>>
>> Hi Mohamed,
>>
>> On 30/10/2025 18:33, Mohamed Mediouni wrote:
>>>> On 30. Oct 2025, at 14:41, [email protected] wrote:
>>>>
>>>> Adding @[email protected] and replying to his questions he asked over 
>>>> #XenDevel:matrix.org.
>>>>
>>>> can you add some details why the implementation cannot be optimized in 
>>>> KVM? Asking because I have never seen such issue when running Xen on QEMU 
>>>> (without nested virt enabled).
>>>> AFAIK when Xen is run on QEMU without virtualization, then instructions 
>>>> are emulated in QEMU while with KVM, ideally the instruction should run 
>>>> directly on hardware except in some special cases (those trapped by 
>>>> FGT/CGT). Such as this one where KVM maintains shadow page tables for each 
>>>> VM. It traps these instructions and emulates them with callback such as 
>>>> handle_vmalls12e1is(). The way this callback is implemented, it has to 
>>>> iterate over the whole address space and clean-up the page tables which is 
>>>> a costly operation. Regardless of this, it should still be optimized in 
>>>> Xen as invalidating a selective range would be much better than 
>>>> invalidating a whole range of 48-bit address space.
>>>> Some details about your platform and use case would be helpful. I am 
>>>> interested to know whether you are using all the features for nested virt.
>>>> I am using AWS G4. My use case is to run Xen as guest hypervisor. Yes, 
>>>> most of the features are enabled except VHE or those which are disabled by 
>>>> KVM.
>>> Hello,
>>> You mean Graviton4 (for reference to others, from a bare metal instance)? 
>>> Interesting to see people caring about nested virt there :) - and hopefully 
>>> using it wasn’t too much of a pain for you to deal with.
>>>>
>>>> ; switch to current VMID
>>>> tlbi rvae1, guest_vaddr ; first invalidate stage-1 TLB by guest VA for 
>>>> current VMID
>>>> tlbi ripas2e1, guest_paddr ; then invalidate stage-2 TLB by IPA range for 
>>>> current VMID
>>>> dsb ish
>>>> isb
>>>> ; switch back the VMID
>>>>  • This is where I am not quite sure and I was hoping that if someone 
>>>> with Arm expertise could sign off on this so that I can work on its 
>>>> implementation in Xen. This will be an optimization not only for 
>>>> virtualized hardware but also in general for Xen on arm64 machines.
>>>>
>>> Note that the documentation says
>>>> The invalidati

Re: [XEN PATCH] xen/arm/p2m: perform IPA-based TLBI for arm64 when IPA is known

2025-11-19 Thread haseeb.ash...@siemens.com
Hi Julien,

Thanks for your review.

> > The first one is addressed by relaxing VMALLS12E1IS -> VMALLE1IS.
> > Each CPU have their own private TLBs, so flush between vCPU of the
> > same domains is required to avoid translations from vCPUx to "leak"
> > to the vCPUy.
>
> This doesn't really tell me why we don't need the flush the S2. The key
> point is (barring altp2m) the stage-2 is common between all the vCPUs of
> a VM.

Alright, I'll update the commit message in version 2.

> > This can be achieved by using VMALLE1. If FEAT_nTLBPA
> > is present then VMALLE1 can also be avoided.
>
> I had a look at the Arm Arm and I can't figure out why it is fine to
> skip the flush. Can you provide a pointer? BTW, in general, it is useful
> to quote the Arm Arm for the reviewer and future reader. It makes easier
> to find what you are talking about.

Okay. This was pointed out by @Mohamed 
Mediouni. From Arm Arm:
> Translation table entry caching that is used for stage 1 translations and is 
> indexed by the intermediate physical
> address of the location holding the translation table entry. However, 
> FEAT_nTLBPA allows software
> discoverability of whether such caches exist, such that if FEAT_nTLBPA is 
> implemented, such caching is not
> implemented.

> > +/*
> > + * FLush TLB by IPA. This will likely be used in a loop, so the caller
> > + * is responsible to use the appropriate memory barriers before/after
> > + * the sequence.
>
> If the goal is to call TLB_HELPER_IPA() in a loop, then the current
> implementation is too expensive.
>
> If the CPU doesn't need the repeat TLBI workaround, then you only need
> to do the dsb; isb once.
>
> If the CPU need the repeat TLBI workaround, looking at the Cortex A76
> errata doc (https://developer.arm.com/documentation/SDEN885749/latest/)
> then I think you might be able to do:
>
> "Flush TLBs"
> "DSB"
> "ISB"
> "Flush TLBs"
> "DSB"
> "ISB"

Yes, I did not use dsb/isb inside this helper TLB_HELPER_IPA(). That's what the 
comment explains that the caller is responsible to call isb/dsb outside as it 
can be invoked in a loop. So, dsb() and isb() should be added before and after 
the loop where this is invoked in the loop. (I forgot isb() in my patch, I'll 
update that). And I kept the sequence with repeat TLBI workaround same as used 
in TLB_HELPER_VA() and it is also same in Linux Kernel: 
https://github.com/torvalds/linux/blob/master/arch/arm64/include/asm/tlbflush.h#L32.

> > diff --git a/xen/arch/arm/include/asm/mmu/p2m.h 
> > b/xen/arch/arm/include/asm/mmu/p2m.h
> > index 58496c0b09..fc2e08bbe8 100644
> > --- a/xen/arch/arm/include/asm/mmu/p2m.h
> > +++ b/xen/arch/arm/include/asm/mmu/p2m.h
> > @@ -10,6 +10,10 @@ extern unsigned int p2m_root_level;
> >
> >   struct p2m_domain;
> >   void p2m_force_tlb_flush_sync(struct p2m_domain *p2m);
> > +#ifdef CONFIG_ARM_64
>
> We should also handle Arm 32-bit. Barring nTLBA, the code should be the
> same.

Okay, nTLBPA feature is also available on Arm 32-bit. I'll update this.

> > diff --git a/xen/arch/arm/mmu/p2m.c b/xen/arch/arm/mmu/p2m.c
> > index 51abf3504f..28268fb67f 100644
> > --- a/xen/arch/arm/mmu/p2m.c
> > +++ b/xen/arch/arm/mmu/p2m.c
> > @@ -235,7 +235,12 @@ void p2m_restore_state(struct vcpu *n)
> >* when running multiple vCPU of the same domain on a single pCPU.
> >*/
> >   if ( *last_vcpu_ran != INVALID_VCPU_ID && *last_vcpu_ran != 
> > n->vcpu_id )
> > +#ifdef CONFIG_ARM_64
> > +if ( system_cpuinfo.mm64.ntlbpa != MM64_NTLBPA_SUPPORT_IMP )
>
> If we decide to use nTLBA, then we should introduce a capability so the
> check can be patched at aboot time.

Alright, I need to go through how a CPU capability is added in Xen. Any commit 
I can use as reference?

> > +/*
> > + * ARM64_WORKAROUND_AT_SPECULATE: We need to stop AT to allocate
> > + * TLBs entries because the context is partially modified. We
> > + * only need the VMID for flushing the TLBs, so we can generate
> > + * a new VTTBR with the VMID to flush and the empty root table.
> > + */
> > +if ( !cpus_have_const_cap(ARM64_WORKAROUND_AT_SPECULATE) )
> > +vttbr = p2m->vttbr;
> > +else
> > +vttbr = generate_vttbr(p2m->vmid, empty_root_mfn);
> > +
> > +WRITE_SYSREG64(vttbr, VTTBR_EL2);
> > +
> > +/* Ensure VTTBR_EL2 is synchronized before flushing the TLBs */
> > +isb();
> > +}
>
> I don't really like the idea to duplicate the AT speculation logic.
> Could we try to consolidate by introducing helper to load and unload the
> VTTBR?

Okay, I'll create helpers for load_vttbr() and restore_vttbr().

> > +
> > +/* Ensure prior page-tables updates have completed */
> > +dsb(ishst);
> > +
> > +/* Invalidate stage-2 TLB entries by IPA range */
> > +for ( i = 0; i < page_count; i++ ) {
> > +flush_guest_tlb_one_s2(ipa);
> > +ipa += 1UL << PAGE_SHIFT;
> > +}
>
> In