Re: [Xen-devel] [RFC 5/6] arm64: call enter_hypervisor_head only when it is needed

2019-08-02 Thread Andrii Anisov



On 01.08.19 13:17, Julien Grall wrote:
All the commit message is based on "performance improvement" Now you are selling it as this is confusing. 


Sorry Julien, I have no more arguments for you.
I'll drop these two patches for the next iteration.

--
Sincerely,
Andrii Anisov.

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [RFC 5/6] arm64: call enter_hypervisor_head only when it is needed

2019-08-01 Thread Julien Grall

Hi Andrii,

On 01/08/2019 08:33, Andrii Anisov wrote:



On 31.07.19 14:02, Julien Grall wrote:
Everything is in the hot path here, yet there are a lot of other branches. So 
why this branch in particular?


This branch and function call is particular because I find this piece of code 
quite confusing:


All the commit message is based on "performance improvement" Now you are 
selling it as this is confusing. What are the real reasons for this patch?




I'm not seeing any benefits in calling `enter_hypervisor_head()` from 
functions named `do_trap_hyp_*`. That code is confusing for me.
IMHO, dividing `do_trap_irq/fiq()` into guest and hyp specific functions is 
not a big deal. Even for ARM32. Moreover, it will make more obvious the fact 
that nothing from `enter_hypervisor_head()` is done for IRQ traps taken from 
hyp.


And I believe this patch balances patch "xen/arm: Re-enable interrupt later in 
the trap path" what you NAKed because of increased IRQ latency.


I never NAKed the patch as you keep claiming it. You are sending a patch without 
any justification three time in a row, so it is normal to request more details 
and to be slightly annoyed.


If you expect me to ack a patch without understanding the implications, then I 
am afraid this is not going to happen. Additionally, it is important to keep 
track of the reasoning of we can come back in 2 years time and find out quickly 
why interrupts are masked for a long period of time.


As I pointed out Xen supports multiple use-cases. I am concerned you are trying 
to optimize for your use-case and disregard the rest. I have actually requested 
more details on your use case to understand a bit more where you are coming 
from. See my e-mail [1].


I know I wrote the patch but it was from debugging other than real improvement. 
From my understanding, you are using to optimize the case where all LRs are 
full. Is it something you have seen during your testing?


If so, how many LRs does the platform provide? Can you provide more details on 
your use case? I don't need the full details, but roughly the number of 
interrupts used and often they trigger.


Additionally, it would be good to know the usage over the time.  You could 
modify Xen to record the number of LRs used to each entry.



While them together make the code more straight forward and clear, because:
  - you start all C-coded common trap handlers with IRQs locked, and free from 
asm code misunderstanding


Well, there are only one (two if you count the FIQ) case where interrupts are 
kept disabled, this is when receiving an interrupt. I don't see it as a good 
enough justification to have to impose that to all the handlers.


  - all common trap handlers are distinct in their naming, purpose and action. 
In terms of calling `enter_hypervisor_head()` only from the traps taken from guest.


There are nearly no difference between receiving an interrupt while running the 
guest mode and while running in hypervisor mode. So what do you really gain with 
the duplication?


Cheers,

[1] https://lists.xenproject.org/archives/html/xen-devel/2019-07/msg02335.html

--
Julien Grall

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [RFC 5/6] arm64: call enter_hypervisor_head only when it is needed

2019-08-01 Thread Andrii Anisov



On 31.07.19 14:02, Julien Grall wrote:

Everything is in the hot path here, yet there are a lot of other branches. So 
why this branch in particular?


This branch and function call is particular because I find this piece of code 
quite confusing:


I'm not seeing any benefits in calling `enter_hypervisor_head()` from functions 
named `do_trap_hyp_*`. That code is confusing for me.
IMHO, dividing `do_trap_irq/fiq()` into guest and hyp specific functions is not 
a big deal. Even for ARM32. Moreover, it will make more obvious the fact that 
nothing from `enter_hypervisor_head()` is done for IRQ traps taken from hyp.


And I believe this patch balances patch "xen/arm: Re-enable interrupt later in the 
trap path" what you NAKed because of increased IRQ latency.
While them together make the code more straight forward and clear, because:
 - you start all C-coded common trap handlers with IRQs locked, and free from 
asm code misunderstanding
 - all common trap handlers are distinct in their naming, purpose and action. 
In terms of calling `enter_hypervisor_head()` only from the traps taken from 
guest.


If you are really worry of the impact of branch then there are a few more 
important places (with a greater benefits) to look:
     1) It seems the compiler use a jump table for the switch case in 
do_trap_guest_sync(), so it will result to multiple direct branch everytime.
     2) Indirect branch have a non-negligible cost compare to direct branch. This 
is a lot used in the interrupt code (see gic_hw_ops->read_irq()). All of them 
are known at boot time, so they could be replace with direct branch. x86 recently 
introduced alternative_call() for this purpose. This could be re-used on Arm.


I remember this. But I'm not ready to code it. I admit that I have not yet good 
understanding about how alternatives work.

--
Sincerely,
Andrii Anisov.

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [RFC 5/6] arm64: call enter_hypervisor_head only when it is needed

2019-07-31 Thread Andre Przywara
On Wed, 31 Jul 2019 12:02:20 +0100
Julien Grall  wrote:

Hi,

> On 30/07/2019 18:35, Andrii Anisov wrote:
> > 
> > On 26.07.19 13:59, Julien Grall wrote:  
> >> Hi,
> >>
> >> On 26/07/2019 11:37, Andrii Anisov wrote:  
> >>> From: Andrii Anisov 
> >>>
> >>> On ARM64 we know exactly if trap happened from hypervisor or guest, so
> >>> we do not need to take that decision. This reduces a condition for
> >>> all enter_hypervisor_head calls and the function call for traps from
> >>> the hypervisor mode.  
> >>
> >> One condition lost but ...  
> > 
> > ...In the hot path (actually at any trap).  
> 
> Everything is in the hot path here, yet there are a lot of other branches. So 
> why this branch in particular?
> 
> As I have mentioned a few times before, there are a difference between the 
> theory and the practice. In theory, removing a branch looks nice. But in 
> practice this may not be the case.
> 
> In this particular case, I don't believe this is going to have a real impact 
> on 
> the performance.
> 
> The PSTATE has been saved a few instructions before in cpu_user_regs, so there
> are high chance the value will still be in the L1 cache.

I agree on this, and second the idea of *not* micro-optimising code just for 
the sake of it. If you have numbers that back this up, it would be a different 
story.

> The compiler may also decide to do the direct branch when not in guest_mode. 
> A 
> trap from the hypervisor is mostly for interrupts. So there are chance this 
> is 
> not going to have a real impact on the overall of the interrupt handling.
> 
> If you are really worry of the impact of branch then there are a few more 
> important places (with a greater benefits) to look:
>  1) It seems the compiler use a jump table for the switch case in 
> do_trap_guest_sync(), so it will result to multiple direct branch everytime.

I don't think it's worth to "fix" this issue. The compiler has done this for a 
reason, and I would guess it figured that this is cheaper than other ways of 
solving this. If you are really paranoid about this, I would try to compile 
this with tuning for your particular core (-mtune), so that the compiler can 
throw in more micro-architectural knowledge about the cost of certain 
instructions.

>  2) Indirect branch have a non-negligible cost compare to direct branch. 
> This is a lot used in the interrupt code (see gic_hw_ops->read_irq()). All of 
> them are known at boot time, so they could be replace with direct branch. x86 
> recently introduced alternative_call() for this purpose. This could be 
> re-used 
> on Arm.

This is indeed something I was always worried about: It looks cheap and elegant 
in the C source code, but is potentially expensive on hardware. The particular 
snippet is:
...
  249024:   d5033fdfisb
  249028:   f9401e80ldr x0, [x20, #56]
  24902c:   f9407801ldr x1, [x0, #240]
  249030:   2a1303e0mov w0, w19
  249034:   d63f0020blr x1
...
In case of an interrupt, the first load will probably miss the cache, and the 
CPU is stuck now, because due to the dependencies it can't do much else. It 
can't even predict the branch and speculatively execute anything, because the 
destination address is yet another dependent load away.
This might not matter for little cores like A53s, because they wouldn't 
speculate anyway. But better cores (A72, for instance) would most likely 
benefit from an optimisation in this area.

Cheers,
Andre.

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [RFC 5/6] arm64: call enter_hypervisor_head only when it is needed

2019-07-31 Thread Julien Grall

Hi Andrii,

On 30/07/2019 18:35, Andrii Anisov wrote:


On 26.07.19 13:59, Julien Grall wrote:

Hi,

On 26/07/2019 11:37, Andrii Anisov wrote:

From: Andrii Anisov 

On ARM64 we know exactly if trap happened from hypervisor or guest, so
we do not need to take that decision. This reduces a condition for
all enter_hypervisor_head calls and the function call for traps from
the hypervisor mode.


One condition lost but ...


...In the hot path (actually at any trap).


Everything is in the hot path here, yet there are a lot of other branches. So 
why this branch in particular?


As I have mentioned a few times before, there are a difference between the 
theory and the practice. In theory, removing a branch looks nice. But in 
practice this may not be the case.


In this particular case, I don't believe this is going to have a real impact on 
the performance.


The PSTATE has been saved a few instructions before in cpu_user_regs, so there
are high chance the value will still be in the L1 cache.

The compiler may also decide to do the direct branch when not in guest_mode. A 
trap from the hypervisor is mostly for interrupts. So there are chance this is 
not going to have a real impact on the overall of the interrupt handling.


If you are really worry of the impact of branch then there are a few more 
important places (with a greater benefits) to look:
1) It seems the compiler use a jump table for the switch case in 
do_trap_guest_sync(), so it will result to multiple direct branch everytime.
2) Indirect branch have a non-negligible cost compare to direct branch. 
This is a lot used in the interrupt code (see gic_hw_ops->read_irq()). All of 
them are known at boot time, so they could be replace with direct branch. x86 
recently introduced alternative_call() for this purpose. This could be re-used 
on Arm.


Cheers,

--
Julien Grall

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [RFC 5/6] arm64: call enter_hypervisor_head only when it is needed

2019-07-30 Thread Andrii Anisov


On 26.07.19 13:59, Julien Grall wrote:

Hi,

On 26/07/2019 11:37, Andrii Anisov wrote:

From: Andrii Anisov 

On ARM64 we know exactly if trap happened from hypervisor or guest, so
we do not need to take that decision. This reduces a condition for
all enter_hypervisor_head calls and the function call for traps from
the hypervisor mode.


One condition lost but ...


...In the hot path (actually at any trap).
And the function call for traps from hyp.


Currently, it is implemented for ARM64 only. Integrating the stuff
with ARM32 requires moving ` if ( guest_mode(regs) )` condition
into ARM32 specific traps.


... one more divergence between arm32 and arm64.

There are probably dozens of more conditions in the code that are not necessary 
for one of the architectures.
Yet there are value to keep everything common because the benefits outweigh the 
likely non performance improvement.


I'm not seeing any benefits in calling `enter_hypervisor_head()` from functions 
named `do_trap_hyp_*`. That code is confusing for me.
IMHO, dividing `do_trap_irq/fiq()` into guest and hyp specific functions is not 
a big deal. Even for ARM32. Moreover, it will make more obvious the fact that 
nothing from `enter_hypervisor_head()` is done for IRQ traps taken from hyp.


So I am not convinced that this has any value for Xen.


OK, I wouldn't insist.

--
Sincerely,
Andrii Anisov.

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [RFC 5/6] arm64: call enter_hypervisor_head only when it is needed

2019-07-26 Thread Julien Grall

Hi,

On 26/07/2019 11:37, Andrii Anisov wrote:

From: Andrii Anisov 

On ARM64 we know exactly if trap happened from hypervisor or guest, so
we do not need to take that decision. This reduces a condition for
all enter_hypervisor_head calls and the function call for traps from
the hypervisor mode.


One condition lost but ...



Currently, it is implemented for ARM64 only. Integrating the stuff
with ARM32 requires moving ` if ( guest_mode(regs) )` condition
into ARM32 specific traps.


... one more divergence between arm32 and arm64.

There are probably dozens of more conditions in the code that are not necessary 
for one of the architectures. Yet there are value to keep everything common 
because the benefits outweigh the likely non performance improvement.


So I am not convinced that this has any value for Xen.

Cheers,

--
Julien Grall

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [RFC 5/6] arm64: call enter_hypervisor_head only when it is needed

2019-07-26 Thread Andrii Anisov
From: Andrii Anisov 

On ARM64 we know exactly if trap happened from hypervisor or guest, so
we do not need to take that decision. This reduces a condition for
all enter_hypervisor_head calls and the function call for traps from
the hypervisor mode.

Currently, it is implemented for ARM64 only. Integrating the stuff
with ARM32 requires moving ` if ( guest_mode(regs) )` condition
into ARM32 specific traps.c

Signed-off-by: Andrii Anisov 
---
 xen/arch/arm/arm64/entry.S |  6 ++--
 xen/arch/arm/traps.c   | 75 --
 2 files changed, 43 insertions(+), 38 deletions(-)

diff --git a/xen/arch/arm/arm64/entry.S b/xen/arch/arm/arm64/entry.S
index 8f28789..21c710d 100644
--- a/xen/arch/arm/arm64/entry.S
+++ b/xen/arch/arm/arm64/entry.S
@@ -211,7 +211,7 @@ hyp_irq:
 entry   hyp=1
 msr daifclr, #4
 mov x0, sp
-bl  do_trap_irq
+bl  do_trap_hyp_irq
 exithyp=1
 
 guest_sync:
@@ -321,7 +321,7 @@ guest_irq:
 SKIP_SYNCHRONIZE_SERROR_ENTRY_EXIT)
 msr daifclr, #4
 mov x0, sp
-bl  do_trap_irq
+bl  do_trap_guest_irq
 1:
 exithyp=0, compat=0
 
@@ -364,7 +364,7 @@ guest_irq_compat:
 SKIP_SYNCHRONIZE_SERROR_ENTRY_EXIT)
 msr daifclr, #4
 mov x0, sp
-bl  do_trap_irq
+bl  do_trap_guest_irq
 1:
 exithyp=0, compat=1
 
diff --git a/xen/arch/arm/traps.c b/xen/arch/arm/traps.c
index 5a9dc66..13726db 100644
--- a/xen/arch/arm/traps.c
+++ b/xen/arch/arm/traps.c
@@ -2011,48 +2011,45 @@ static inline bool needs_ssbd_flip(struct vcpu *v)
  cpu_require_ssbd_mitigation();
 }
 
-static void enter_hypervisor_head(struct cpu_user_regs *regs)
+static void enter_hypervisor_head(void)
 {
-if ( guest_mode(regs) )
-{
-struct vcpu *v = current;
+struct vcpu *v = current;
 
-ASSERT(!local_irq_is_enabled());
+ASSERT(!local_irq_is_enabled());
 
-/* If the guest has disabled the workaround, bring it back on. */
-if ( needs_ssbd_flip(v) )
-arm_smccc_1_1_smc(ARM_SMCCC_ARCH_WORKAROUND_2_FID, 1, NULL);
+/* If the guest has disabled the workaround, bring it back on. */
+if ( needs_ssbd_flip(v) )
+arm_smccc_1_1_smc(ARM_SMCCC_ARCH_WORKAROUND_2_FID, 1, NULL);
 
-/*
- * If we pended a virtual abort, preserve it until it gets cleared.
- * See ARM ARM DDI 0487A.j D1.14.3 (Virtual Interrupts) for details,
- * but the crucial bit is "On taking a vSError interrupt, HCR_EL2.VSE
- * (alias of HCR.VA) is cleared to 0."
- */
-if ( v->arch.hcr_el2 & HCR_VA )
-v->arch.hcr_el2 = READ_SYSREG(HCR_EL2);
+/*
+ * If we pended a virtual abort, preserve it until it gets cleared.
+ * See ARM ARM DDI 0487A.j D1.14.3 (Virtual Interrupts) for details,
+ * but the crucial bit is "On taking a vSError interrupt, HCR_EL2.VSE
+ * (alias of HCR.VA) is cleared to 0."
+ */
+if ( v->arch.hcr_el2 & HCR_VA )
+v->arch.hcr_el2 = READ_SYSREG(HCR_EL2);
 
 #ifdef CONFIG_NEW_VGIC
-/*
- * We need to update the state of our emulated devices using level
- * triggered interrupts before syncing back the VGIC state.
- *
- * TODO: Investigate whether this is necessary to do on every
- * trap and how it can be optimised.
- */
-vtimer_update_irqs(v);
-vcpu_update_evtchn_irq(v);
+/*
+ * We need to update the state of our emulated devices using level
+ * triggered interrupts before syncing back the VGIC state.
+ *
+ * TODO: Investigate whether this is necessary to do on every
+ * trap and how it can be optimised.
+ */
+vtimer_update_irqs(v);
+vcpu_update_evtchn_irq(v);
 #endif
 
-vgic_sync_from_lrs(v);
-}
+vgic_sync_from_lrs(v);
 }
 
 void do_trap_guest_sync(struct cpu_user_regs *regs)
 {
 const union hsr hsr = { .bits = regs->hsr };
 
-enter_hypervisor_head(regs);
+enter_hypervisor_head();
 local_irq_enable();
 
 switch ( hsr.ec )
@@ -2188,7 +2185,6 @@ void do_trap_hyp_sync(struct cpu_user_regs *regs)
 {
 const union hsr hsr = { .bits = regs->hsr };
 
-enter_hypervisor_head(regs);
 local_irq_enable();
 
 switch ( hsr.ec )
@@ -2227,7 +2223,6 @@ void do_trap_hyp_sync(struct cpu_user_regs *regs)
 
 void do_trap_hyp_serror(struct cpu_user_regs *regs)
 {
-enter_hypervisor_head(regs);
 local_irq_enable();
 
 __do_trap_serror(regs, VABORT_GEN_BY_GUEST(regs));
@@ -2235,21 +2230,31 @@ void do_trap_hyp_serror(struct cpu_user_regs *regs)
 
 void do_trap_guest_serror(struct cpu_user_regs *regs)
 {
-enter_hypervisor_head(regs);
+enter_hypervisor_head();
 local_irq_enable();
 
 __do_trap_serror(regs, true);
 }
 
-void do_trap_irq(struct cpu_user_regs *regs)
+void do_trap_guest_irq(struct