Re: [RFC 00/11] KVM, EFI, arm64: EFI Runtime Services Sandboxing

2017-09-22 Thread Ard Biesheuvel
On 25 August 2017 at 01:31, Florent Revest  wrote:
> Hi,
>
> This series implements a mechanism to sandbox EFI Runtime Services on arm64.
> It can be enabled with CONFIG_EFI_SANDBOX. At boot it spawns an internal KVM
> virtual machine that is ran everytime an EFI Runtime Service is called. This
> limits the possible security and stability impact of EFI runtime on the 
> kernel.
>
> The patch set is split as follow:
>  - Patches 1 and 2: Give more control over HVC handling to KVM
>  - Patches 3 to 6: Introduce the concept of KVM "internal VMs"
>  - Patches 7 to 9: Reorder KVM and EFI initialization on ARM
>  - Patch 10: Introduces the EFI sandboxing VM and wrappers
>  - Patch 11: Workarounds some EFI Runtime Services relying on EL3
>
> The sandboxing has been tested to work reliably (rtc and efivars) on a
> SoftIron OverDrive 1000 box and on a ARMv8.3 model with VHE enabled. Normal
> userspace KVM instance have also been tested to still work correctly.
>
> Those patches apply cleanly on the Linus' v4.13-rc6 tag and have no other
> dependencies.
>
> Florent Revest (11):
>   arm64: Add an SMCCC function IDs header
>   KVM: arm64: Return an Unknown ID on unhandled HVC
>   KVM: Allow VM lifecycle management without userspace
>   KVM, arm, arm64: Offer PAs to IPAs idmapping to internal VMs
>   KVM: Expose VM/VCPU creation functions
>   KVM, arm64: Expose a VCPU initialization function
>   KVM: Allow initialization before the module target
>   KVM, arm, arm64: Initialize KVM's core earlier
>   EFI, arm, arm64: Enable EFI Runtime Services later
>   efi, arm64: Sandbox Runtime Services in a VM
>   KVM, arm64: Don't trap internal VMs SMC calls
>

Hello Florent,

This is really nice work. Thanks for contributing it.

>From the EFI side, there are some minor concerns on my part regarding
the calling convention, and the fact that we can no longer invoke
runtime services from a kernel running at EL1, but those all seem
fixable. I will respond to the patches in question in greater detail
at a later time.

In the mean time, Christoffer has raised a number for valid concerns,
and those need to be addressed first before it makes sense to talk
about EFI specifics. I hope you will find more time to invest in this:
I would really love to have this feature upstream.

Regards,
Ard.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v2 03/11] KVM: arm64: Change hyp_panic()s dependency on tpidr_el2

2017-09-22 Thread James Morse
Hi Christoffer,

On 17/09/17 15:43, Christoffer Dall wrote:
> On Tue, Aug 08, 2017 at 05:46:08PM +0100, James Morse wrote:
>> Make tpidr_el2 a cpu-offset for per-cpu variables in the same way the
>> host uses tpidr_el1. This lets tpidr_el{1,2} have the same value, and
>> on VHE they can be the same register.
>>
>> KVM calls hyp_panic() when anything unexpected happens. This may occur
>> while a guest owns the EL1 registers. KVM stashes the vcpu pointer in
>> tpidr_el2, which it uses to find the host context in order to restore
>> the host EL1 registers before parachuting into the host's panic().
>>
>> The host context is a struct kvm_cpu_context allocated in the per-cpu
>> area, and mapped to hyp. Given the per-cpu offset for this CPU, this is
>> easy to find. Change hyp_panic() to take a pointer to the
>> struct kvm_cpu_context. Wrap these calls with an asm function that
>> retrieves the struct kvm_cpu_context from the host's per-cpu area.
>>
>> Copy the per-cpu offset from the hosts tpidr_el1 into tpidr_el2 during
>> kvm init. (Later patches will make this unnecessary for VHE hosts)
>>
>> We print out the vcpu pointer as part of the panic message. Add a back
>> reference to the 'running vcpu' in the host cpu context to preserve this.

>> diff --git a/arch/arm64/kvm/hyp/switch.c b/arch/arm64/kvm/hyp/switch.c
>> index 945e79c641c4..235d615cee30 100644
>> --- a/arch/arm64/kvm/hyp/switch.c
>> +++ b/arch/arm64/kvm/hyp/switch.c
>> @@ -286,9 +286,9 @@ int __hyp_text __kvm_vcpu_run(struct kvm_vcpu *vcpu)
>>  u64 exit_code;
>>  
>>  vcpu = kern_hyp_va(vcpu);
>> -write_sysreg(vcpu, tpidr_el2);
>>  
>>  host_ctxt = kern_hyp_va(vcpu->arch.host_cpu_context);
>> +host_ctxt->__hyp_running_vcpu = vcpu;
> 
> I'm fine with this for now, but eventually when we optimize KVM further
> we may want to avoid doing this in every world-switch.  One option is to
> just set this pointer in vcpu_get and vcpu_load, but would it also be
> possible to use the kvm_arm_running_vcpu per-cpu array directly?

Yes, that would have been better, I didn't know that existed...
After this point we can find per-cpu variables easily, they just need mapping to
HYP.

This pointer is just for the panic message, I'm not sure how useful it is.
For kdump the kvm_arm_running_vcpu array holds the same information, so is
printing this out just so we can spot if its totally-bogus?


As your fine with this for now, I'll put tidying it up on my todo list...


>>  guest_ctxt = >arch.ctxt;
>>  
>>  __sysreg_save_host_state(host_ctxt);


> Reviewed-by: Christoffer Dall 

Thanks!

James



___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v3 19/20] KVM: arm/arm64: Get rid of kvm_timer_flush_hwstate

2017-09-22 Thread Christoffer Dall
Now when both the vtimer and the ptimer when using both the in-kernel
vgic emulation and a userspace IRQ chip are driven by the timer signals
and at the vcpu load/put boundaries, instead of recomputing the timer
state at every entry/exit to/from the guest, we can get entirely rid of
the flush hwstate function.

Signed-off-by: Christoffer Dall 
---
 include/kvm/arm_arch_timer.h |  1 -
 virt/kvm/arm/arch_timer.c| 24 
 virt/kvm/arm/arm.c   |  1 -
 3 files changed, 26 deletions(-)

diff --git a/include/kvm/arm_arch_timer.h b/include/kvm/arm_arch_timer.h
index 8e5ed54..af29563 100644
--- a/include/kvm/arm_arch_timer.h
+++ b/include/kvm/arm_arch_timer.h
@@ -61,7 +61,6 @@ int kvm_timer_hyp_init(void);
 int kvm_timer_enable(struct kvm_vcpu *vcpu);
 int kvm_timer_vcpu_reset(struct kvm_vcpu *vcpu);
 void kvm_timer_vcpu_init(struct kvm_vcpu *vcpu);
-void kvm_timer_flush_hwstate(struct kvm_vcpu *vcpu);
 void kvm_timer_sync_hwstate(struct kvm_vcpu *vcpu);
 bool kvm_timer_should_notify_user(struct kvm_vcpu *vcpu);
 void kvm_timer_update_run(struct kvm_vcpu *vcpu);
diff --git a/virt/kvm/arm/arch_timer.c b/virt/kvm/arm/arch_timer.c
index aa18a5d..f92459a 100644
--- a/virt/kvm/arm/arch_timer.c
+++ b/virt/kvm/arm/arch_timer.c
@@ -302,12 +302,6 @@ static void kvm_timer_update_state(struct kvm_vcpu *vcpu)
struct arch_timer_context *vtimer = vcpu_vtimer(vcpu);
struct arch_timer_context *ptimer = vcpu_ptimer(vcpu);
 
-   /*
-* If userspace modified the timer registers via SET_ONE_REG before
-* the vgic was initialized, we mustn't set the vtimer->irq.level value
-* because the guest would never see the interrupt.  Instead wait
-* until we call this function from kvm_timer_flush_hwstate.
-*/
if (unlikely(!timer->enabled))
return;
 
@@ -493,24 +487,6 @@ bool kvm_timer_should_notify_user(struct kvm_vcpu *vcpu)
   ptimer->irq.level != plevel;
 }
 
-/**
- * kvm_timer_flush_hwstate - prepare timers before running the vcpu
- * @vcpu: The vcpu pointer
- *
- * Check if the virtual timer has expired while we were running in the host,
- * and inject an interrupt if that was the case, making sure the timer is
- * masked or disabled on the host so that we keep executing.  Also schedule a
- * software timer for the physical timer if it is enabled.
- */
-void kvm_timer_flush_hwstate(struct kvm_vcpu *vcpu)
-{
-   struct arch_timer_cpu *timer = >arch.timer_cpu;
-   struct arch_timer_context *ptimer = vcpu_ptimer(vcpu);
-
-   if (unlikely(!timer->enabled))
-   return;
-}
-
 void kvm_timer_vcpu_put(struct kvm_vcpu *vcpu)
 {
struct arch_timer_cpu *timer = >arch.timer_cpu;
diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c
index 132d39a..14c50d1 100644
--- a/virt/kvm/arm/arm.c
+++ b/virt/kvm/arm/arm.c
@@ -656,7 +656,6 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct 
kvm_run *run)
 
local_irq_disable();
 
-   kvm_timer_flush_hwstate(vcpu);
kvm_vgic_flush_hwstate(vcpu);
 
/*
-- 
2.9.0

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v3 14/20] KVM: arm/arm64: Avoid timer save/restore in vcpu entry/exit

2017-09-22 Thread Christoffer Dall
We don't need to save and restore the hardware timer state and examine
if it generates interrupts on on every entry/exit to the guest.  The
timer hardware is perfectly capable of telling us when it has expired
by signaling interrupts.

When taking a vtimer interrupt in the host, we don't want to mess with
the timer configuration, we just want to forward the physical interrupt
to the guest as a virtual interrupt.  We can use the split priority drop
and deactivate feature of the GIC to do this, which leaves an EOI'ed
interrupt active on the physical distributor, making sure we don't keep
taking timer interrupts which would prevent the guest from running.  We
can then forward the physical interrupt to the VM using the HW bit in
the LR of the GIC VE, like we do already, which lets the guest directly
deactivate both the physical and virtual timer simultaneously, allowing
the timer hardware to exit the VM and generate a new physical interrupt
when the timer output is again asserted later on.

We do need to capture this state when migrating VCPUs between physical
CPUs, however, which we use the vcpu put/load functions for, which are
called through preempt notifiers whenever the thread is scheduled away
from the CPU or called directly if we return from the ioctl to
userspace.

One caveat is that we cannot restore the timer state during
kvm_timer_vcpu_load, because the flow of sleeping a VCPU is:

  1. kvm_vcpu_block
  2. kvm_timer_schedule
  3. schedule
  4. kvm_timer_vcpu_put (preempt notifier)
  5. schedule (vcpu thread gets scheduled back)
  6. kvm_timer_vcpu_load
< We restore the hardware state here, but the bg_timer
  hrtimer may have scheduled a work function that also
  changes the timer state here.
  7. kvm_timer_unschedule
< We can restore the state here instead

So, while we do need to restore the timer state in step (6) in all other
cases than when we called kvm_vcpu_block(), we have to defer the restore
to step (7) when coming back after kvm_vcpu_block().  Note that we
cannot simply call cancel_work_sync() in step (6), because vcpu_load can
be called from a preempt notifier.

An added benefit beyond not having to read and write the timer sysregs
on every entry and exit is that we no longer have to actively write the
active state to the physical distributor, because we set the affinity of
the vtimer interrupt when loading the timer state, so that the interrupt
automatically stays active after firing.

Signed-off-by: Christoffer Dall 
---
 include/kvm/arm_arch_timer.h |   9 +-
 virt/kvm/arm/arch_timer.c| 238 +++
 virt/kvm/arm/arm.c   |  19 +++-
 virt/kvm/arm/hyp/timer-sr.c  |   8 +-
 4 files changed, 174 insertions(+), 100 deletions(-)

diff --git a/include/kvm/arm_arch_timer.h b/include/kvm/arm_arch_timer.h
index 16887c0..8e5ed54 100644
--- a/include/kvm/arm_arch_timer.h
+++ b/include/kvm/arm_arch_timer.h
@@ -31,8 +31,8 @@ struct arch_timer_context {
/* Timer IRQ */
struct kvm_irq_levelirq;
 
-   /* Active IRQ state caching */
-   boolactive_cleared_last;
+   /* Is the timer state loaded on the hardware timer */
+   boolloaded;
 
/* Virtual offset */
u64 cntvoff;
@@ -80,10 +80,15 @@ void kvm_timer_unschedule(struct kvm_vcpu *vcpu);
 
 u64 kvm_phys_timer_read(void);
 
+void kvm_timer_vcpu_load(struct kvm_vcpu *vcpu);
 void kvm_timer_vcpu_put(struct kvm_vcpu *vcpu);
 
 void kvm_timer_init_vhe(void);
 
 #define vcpu_vtimer(v) (&(v)->arch.timer_cpu.vtimer)
 #define vcpu_ptimer(v) (&(v)->arch.timer_cpu.ptimer)
+
+void enable_el1_phys_timer_access(void);
+void disable_el1_phys_timer_access(void);
+
 #endif
diff --git a/virt/kvm/arm/arch_timer.c b/virt/kvm/arm/arch_timer.c
index 4275f8f..70110ea 100644
--- a/virt/kvm/arm/arch_timer.c
+++ b/virt/kvm/arm/arch_timer.c
@@ -46,10 +46,9 @@ static const struct kvm_irq_level default_vtimer_irq = {
.level  = 1,
 };
 
-void kvm_timer_vcpu_put(struct kvm_vcpu *vcpu)
-{
-   vcpu_vtimer(vcpu)->active_cleared_last = false;
-}
+static bool kvm_timer_irq_can_fire(struct arch_timer_context *timer_ctx);
+static void kvm_timer_update_irq(struct kvm_vcpu *vcpu, bool new_level,
+struct arch_timer_context *timer_ctx);
 
 u64 kvm_phys_timer_read(void)
 {
@@ -74,17 +73,37 @@ static void soft_timer_cancel(struct hrtimer *hrt, struct 
work_struct *work)
cancel_work_sync(work);
 }
 
-static irqreturn_t kvm_arch_timer_handler(int irq, void *dev_id)
+static void kvm_vtimer_update_mask_user(struct kvm_vcpu *vcpu)
 {
-   struct kvm_vcpu *vcpu = *(struct kvm_vcpu **)dev_id;
+   struct arch_timer_context *vtimer = vcpu_vtimer(vcpu);
 
/*
-* We disable the timer in the world switch and let it be
-* handled by kvm_timer_sync_hwstate(). Getting a timer
-* 

[PATCH v3 02/20] arm64: Use physical counter for in-kernel reads

2017-09-22 Thread Christoffer Dall
Using the physical counter allows KVM to retain the offset between the
virtual and physical counter as long as it is actively running a VCPU.

As soon as a VCPU is released, another thread is scheduled or we start
running userspace applications, we reset the offset to 0, so that
userspace accessing the virtual timer can still read the cirtual counter
and get the same view of time as the kernel.

This opens up potential improvements for KVM performance.

VHE kernels or kernels continuing to use the virtual timer are
unaffected.

Signed-off-by: Christoffer Dall 
---
 arch/arm64/include/asm/arch_timer.h  | 9 -
 drivers/clocksource/arm_arch_timer.c | 3 +--
 2 files changed, 5 insertions(+), 7 deletions(-)

diff --git a/arch/arm64/include/asm/arch_timer.h 
b/arch/arm64/include/asm/arch_timer.h
index a652ce0..1859a1c 100644
--- a/arch/arm64/include/asm/arch_timer.h
+++ b/arch/arm64/include/asm/arch_timer.h
@@ -148,11 +148,10 @@ static inline void arch_timer_set_cntkctl(u32 cntkctl)
 
 static inline u64 arch_counter_get_cntpct(void)
 {
-   /*
-* AArch64 kernel and user space mandate the use of CNTVCT.
-*/
-   BUG();
-   return 0;
+   u64 cval;
+   isb();
+   asm volatile("mrs %0, cntpct_el0" : "=r" (cval));
+   return cval;
 }
 
 static inline u64 arch_counter_get_cntvct(void)
diff --git a/drivers/clocksource/arm_arch_timer.c 
b/drivers/clocksource/arm_arch_timer.c
index fd4b7f6..9b3322a 100644
--- a/drivers/clocksource/arm_arch_timer.c
+++ b/drivers/clocksource/arm_arch_timer.c
@@ -890,8 +890,7 @@ static void __init arch_counter_register(unsigned type)
 
/* Register the CP15 based counter if we have one */
if (type & ARCH_TIMER_TYPE_CP15) {
-   if (IS_ENABLED(CONFIG_ARM64) ||
-   arch_timer_uses_ppi == ARCH_TIMER_VIRT_PPI)
+   if (arch_timer_uses_ppi == ARCH_TIMER_VIRT_PPI)
arch_timer_read_counter = arch_counter_get_cntvct;
else
arch_timer_read_counter = arch_counter_get_cntpct;
-- 
2.9.0

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v3 18/20] KVM: arm/arm64: Avoid phys timer emulation in vcpu entry/exit

2017-09-22 Thread Christoffer Dall
There is no need to schedule and cancel a hrtimer when entering and
exiting the guest, because we know when the physical timer is going to
fire when the guest programs it, and we can simply program the hrtimer
at that point.

Now when the register modifications from the guest go through the
kvm_arm_timer_set/get_reg functions, which always call
kvm_timer_update_state(), we can simply consider the timer state in this
function and schedule and cancel the timers as needed.

This avoids looking at the physical timer emulation state when entering
and exiting the VCPU, allowing for faster servicing of the VM when
needed.

Signed-off-by: Christoffer Dall 
---
 virt/kvm/arm/arch_timer.c | 75 ---
 1 file changed, 51 insertions(+), 24 deletions(-)

diff --git a/virt/kvm/arm/arch_timer.c b/virt/kvm/arm/arch_timer.c
index 1f82c21..aa18a5d 100644
--- a/virt/kvm/arm/arch_timer.c
+++ b/virt/kvm/arm/arch_timer.c
@@ -199,7 +199,27 @@ static enum hrtimer_restart kvm_bg_timer_expire(struct 
hrtimer *hrt)
 
 static enum hrtimer_restart kvm_phys_timer_expire(struct hrtimer *hrt)
 {
-   WARN(1, "Timer only used to ensure guest exit - unexpected event.");
+   struct arch_timer_context *ptimer;
+   struct arch_timer_cpu *timer;
+   struct kvm_vcpu *vcpu;
+   u64 ns;
+
+   timer = container_of(hrt, struct arch_timer_cpu, phys_timer);
+   vcpu = container_of(timer, struct kvm_vcpu, arch.timer_cpu);
+   ptimer = vcpu_ptimer(vcpu);
+
+   /*
+* Check that the timer has really expired from the guest's
+* PoV (NTP on the host may have forced it to expire
+* early). If not ready, schedule for a later time.
+*/
+   ns = kvm_timer_compute_delta(ptimer);
+   if (unlikely(ns)) {
+   hrtimer_forward_now(hrt, ns_to_ktime(ns));
+   return HRTIMER_RESTART;
+   }
+
+   kvm_timer_update_irq(vcpu, true, ptimer);
return HRTIMER_NORESTART;
 }
 
@@ -253,24 +273,28 @@ static void kvm_timer_update_irq(struct kvm_vcpu *vcpu, 
bool new_level,
 }
 
 /* Schedule the background timer for the emulated timer. */
-static void phys_timer_emulate(struct kvm_vcpu *vcpu,
- struct arch_timer_context *timer_ctx)
+static void phys_timer_emulate(struct kvm_vcpu *vcpu)
 {
struct arch_timer_cpu *timer = >arch.timer_cpu;
+   struct arch_timer_context *ptimer = vcpu_ptimer(vcpu);
 
-   if (kvm_timer_should_fire(timer_ctx))
-   return;
-
-   if (!kvm_timer_irq_can_fire(timer_ctx))
+   /*
+* If the timer can fire now we have just raised the IRQ line and we
+* don't need to have a soft timer scheduled for the future.  If the
+* timer cannot fire at all, then we also don't need a soft timer.
+*/
+   if (kvm_timer_should_fire(ptimer) || !kvm_timer_irq_can_fire(ptimer)) {
+   soft_timer_cancel(>phys_timer, NULL);
return;
+   }
 
-   /*  The timer has not yet expired, schedule a background timer */
-   soft_timer_start(>phys_timer, 
kvm_timer_compute_delta(timer_ctx));
+   soft_timer_start(>phys_timer, kvm_timer_compute_delta(ptimer));
 }
 
 /*
- * Check if there was a change in the timer state (should we raise or lower
- * the line level to the GIC).
+ * Check if there was a change in the timer state, so that we should either
+ * raise or lower the line level to the GIC or schedule a background timer to
+ * emulate the physical timer.
  */
 static void kvm_timer_update_state(struct kvm_vcpu *vcpu)
 {
@@ -292,6 +316,8 @@ static void kvm_timer_update_state(struct kvm_vcpu *vcpu)
 
if (kvm_timer_should_fire(ptimer) != ptimer->irq.level)
kvm_timer_update_irq(vcpu, !ptimer->irq.level, ptimer);
+
+   phys_timer_emulate(vcpu);
 }
 
 static void vtimer_save_state(struct kvm_vcpu *vcpu)
@@ -445,6 +471,9 @@ void kvm_timer_vcpu_load(struct kvm_vcpu *vcpu)
 
if (has_vhe())
disable_el1_phys_timer_access();
+
+   /* Set the background timer for the physical timer emulation. */
+   phys_timer_emulate(vcpu);
 }
 
 bool kvm_timer_should_notify_user(struct kvm_vcpu *vcpu)
@@ -480,12 +509,6 @@ void kvm_timer_flush_hwstate(struct kvm_vcpu *vcpu)
 
if (unlikely(!timer->enabled))
return;
-
-   if (kvm_timer_should_fire(ptimer) != ptimer->irq.level)
-   kvm_timer_update_irq(vcpu, !ptimer->irq.level, ptimer);
-
-   /* Set the background timer for the physical timer emulation. */
-   phys_timer_emulate(vcpu, vcpu_ptimer(vcpu));
 }
 
 void kvm_timer_vcpu_put(struct kvm_vcpu *vcpu)
@@ -500,6 +523,17 @@ void kvm_timer_vcpu_put(struct kvm_vcpu *vcpu)
 
vtimer_save_state(vcpu);
 
+   /*
+* Cancel the physical timer emulation, because the only case where we
+* need it after a vcpu_put is in the context of a sleeping VCPU, and
+* in that case we 

[PATCH v3 08/20] KVM: arm/arm64: Rename soft timer to bg_timer

2017-09-22 Thread Christoffer Dall
As we are about to introduce a separate hrtimer for the physical timer,
call this timer bg_timer, because we refer to this timer as the
background timer in the code and comments elsewhere.

Signed-off-by: Christoffer Dall 
---
 include/kvm/arm_arch_timer.h |  3 +--
 virt/kvm/arm/arch_timer.c| 22 +++---
 2 files changed, 12 insertions(+), 13 deletions(-)

diff --git a/include/kvm/arm_arch_timer.h b/include/kvm/arm_arch_timer.h
index f0053f8..dcbb2e1 100644
--- a/include/kvm/arm_arch_timer.h
+++ b/include/kvm/arm_arch_timer.h
@@ -43,8 +43,7 @@ struct arch_timer_cpu {
struct arch_timer_context   ptimer;
 
/* Background timer used when the guest is not running */
-   struct hrtimer  timer;
-
+   struct hrtimer  bg_timer;
/* Work queued with the above timer expires */
struct work_struct  expired;
 
diff --git a/virt/kvm/arm/arch_timer.c b/virt/kvm/arm/arch_timer.c
index 871d8ae..c2e8326 100644
--- a/virt/kvm/arm/arch_timer.c
+++ b/virt/kvm/arm/arch_timer.c
@@ -56,7 +56,7 @@ u64 kvm_phys_timer_read(void)
return timecounter->cc->read(timecounter->cc);
 }
 
-static bool soft_timer_is_armed(struct arch_timer_cpu *timer)
+static bool bg_timer_is_armed(struct arch_timer_cpu *timer)
 {
return timer->armed;
 }
@@ -154,13 +154,13 @@ static u64 kvm_timer_earliest_exp(struct kvm_vcpu *vcpu)
return min(min_virt, min_phys);
 }
 
-static enum hrtimer_restart kvm_timer_expire(struct hrtimer *hrt)
+static enum hrtimer_restart kvm_bg_timer_expire(struct hrtimer *hrt)
 {
struct arch_timer_cpu *timer;
struct kvm_vcpu *vcpu;
u64 ns;
 
-   timer = container_of(hrt, struct arch_timer_cpu, timer);
+   timer = container_of(hrt, struct arch_timer_cpu, bg_timer);
vcpu = container_of(timer, struct kvm_vcpu, arch.timer_cpu);
 
/*
@@ -267,7 +267,7 @@ static void kvm_timer_emulate(struct kvm_vcpu *vcpu,
return;
 
/*  The timer has not yet expired, schedule a background timer */
-   soft_timer_start(>timer, kvm_timer_compute_delta(timer_ctx));
+   soft_timer_start(>bg_timer, kvm_timer_compute_delta(timer_ctx));
 }
 
 /*
@@ -281,7 +281,7 @@ void kvm_timer_schedule(struct kvm_vcpu *vcpu)
struct arch_timer_context *vtimer = vcpu_vtimer(vcpu);
struct arch_timer_context *ptimer = vcpu_ptimer(vcpu);
 
-   BUG_ON(soft_timer_is_armed(timer));
+   BUG_ON(bg_timer_is_armed(timer));
 
/*
 * No need to schedule a background timer if any guest timer has
@@ -303,14 +303,14 @@ void kvm_timer_schedule(struct kvm_vcpu *vcpu)
 * Set the earliest expiration time among the guest timers.
 */
timer->armed = true;
-   soft_timer_start(>timer, kvm_timer_earliest_exp(vcpu));
+   soft_timer_start(>bg_timer, kvm_timer_earliest_exp(vcpu));
 }
 
 void kvm_timer_unschedule(struct kvm_vcpu *vcpu)
 {
struct arch_timer_cpu *timer = >arch.timer_cpu;
 
-   soft_timer_cancel(>timer, >expired);
+   soft_timer_cancel(>bg_timer, >expired);
timer->armed = false;
 }
 
@@ -447,7 +447,7 @@ void kvm_timer_sync_hwstate(struct kvm_vcpu *vcpu)
 * This is to cancel the background timer for the physical timer
 * emulation if it is set.
 */
-   soft_timer_cancel(>timer, >expired);
+   soft_timer_cancel(>bg_timer, >expired);
 
/*
 * The guest could have modified the timer registers or the timer
@@ -504,8 +504,8 @@ void kvm_timer_vcpu_init(struct kvm_vcpu *vcpu)
vcpu_ptimer(vcpu)->cntvoff = 0;
 
INIT_WORK(>expired, kvm_timer_inject_irq_work);
-   hrtimer_init(>timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS);
-   timer->timer.function = kvm_timer_expire;
+   hrtimer_init(>bg_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS);
+   timer->bg_timer.function = kvm_bg_timer_expire;
 
vtimer->irq.irq = default_vtimer_irq.irq;
ptimer->irq.irq = default_ptimer_irq.irq;
@@ -614,7 +614,7 @@ void kvm_timer_vcpu_terminate(struct kvm_vcpu *vcpu)
struct arch_timer_cpu *timer = >arch.timer_cpu;
struct arch_timer_context *vtimer = vcpu_vtimer(vcpu);
 
-   soft_timer_cancel(>timer, >expired);
+   soft_timer_cancel(>bg_timer, >expired);
kvm_vgic_unmap_phys_irq(vcpu, vtimer->irq.irq);
 }
 
-- 
2.9.0

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v3 20/20] KVM: arm/arm64: Rework kvm_timer_should_fire

2017-09-22 Thread Christoffer Dall
kvm_timer_should_fire() can be called in two different situations from
the kvm_vcpu_block().

The first case is before calling kvm_timer_schedule(), used for wait
polling, and in this case the VCPU thread is running and the timer state
is loaded onto the hardware so all we have to do is check if the virtual
interrupt lines are asserted, becasue the timer interrupt handler
functions will raise those lines as appropriate.

The second case is inside the wait loop of kvm_vcpu_block(), where we
have already called kvm_timer_schedule() and therefore the hardware will
be disabled and the software view of the timer state is up to date
(timer->loaded is false), and so we can simply check if the timer should
fire by looking at the software state.

Signed-off-by: Christoffer Dall 
---
 include/kvm/arm_arch_timer.h |  3 ++-
 virt/kvm/arm/arch_timer.c| 22 +-
 virt/kvm/arm/arm.c   |  3 +--
 3 files changed, 24 insertions(+), 4 deletions(-)

diff --git a/include/kvm/arm_arch_timer.h b/include/kvm/arm_arch_timer.h
index af29563..250db34 100644
--- a/include/kvm/arm_arch_timer.h
+++ b/include/kvm/arm_arch_timer.h
@@ -73,7 +73,8 @@ int kvm_arm_timer_set_attr(struct kvm_vcpu *vcpu, struct 
kvm_device_attr *attr);
 int kvm_arm_timer_get_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr 
*attr);
 int kvm_arm_timer_has_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr 
*attr);
 
-bool kvm_timer_should_fire(struct arch_timer_context *timer_ctx);
+bool kvm_timer_is_pending(struct kvm_vcpu *vcpu);
+
 void kvm_timer_schedule(struct kvm_vcpu *vcpu);
 void kvm_timer_unschedule(struct kvm_vcpu *vcpu);
 
diff --git a/virt/kvm/arm/arch_timer.c b/virt/kvm/arm/arch_timer.c
index f92459a..1d0cd3a 100644
--- a/virt/kvm/arm/arch_timer.c
+++ b/virt/kvm/arm/arch_timer.c
@@ -49,6 +49,7 @@ static const struct kvm_irq_level default_vtimer_irq = {
 static bool kvm_timer_irq_can_fire(struct arch_timer_context *timer_ctx);
 static void kvm_timer_update_irq(struct kvm_vcpu *vcpu, bool new_level,
 struct arch_timer_context *timer_ctx);
+static bool kvm_timer_should_fire(struct arch_timer_context *timer_ctx);
 
 u64 kvm_phys_timer_read(void)
 {
@@ -223,7 +224,7 @@ static enum hrtimer_restart kvm_phys_timer_expire(struct 
hrtimer *hrt)
return HRTIMER_NORESTART;
 }
 
-bool kvm_timer_should_fire(struct arch_timer_context *timer_ctx)
+static bool kvm_timer_should_fire(struct arch_timer_context *timer_ctx)
 {
u64 cval, now;
 
@@ -236,6 +237,25 @@ bool kvm_timer_should_fire(struct arch_timer_context 
*timer_ctx)
return cval <= now;
 }
 
+bool kvm_timer_is_pending(struct kvm_vcpu *vcpu)
+{
+   struct arch_timer_context *vtimer = vcpu_vtimer(vcpu);
+   struct arch_timer_context *ptimer = vcpu_ptimer(vcpu);
+
+   if (vtimer->irq.level || ptimer->irq.level)
+   return true;
+
+   /*
+* When this is called from withing the wait loop of kvm_vcpu_block(),
+* the software view of the timer state is up to date (timer->loaded
+* is false), and so we can simply check if the timer should fire now.
+*/
+   if (!vtimer->loaded && kvm_timer_should_fire(vtimer))
+   return true;
+
+   return kvm_timer_should_fire(ptimer);
+}
+
 /*
  * Reflect the timer output level into the kvm_run structure
  */
diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c
index 14c50d1..bc126fb 100644
--- a/virt/kvm/arm/arm.c
+++ b/virt/kvm/arm/arm.c
@@ -307,8 +307,7 @@ void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu)
 
 int kvm_cpu_has_pending_timer(struct kvm_vcpu *vcpu)
 {
-   return kvm_timer_should_fire(vcpu_vtimer(vcpu)) ||
-  kvm_timer_should_fire(vcpu_ptimer(vcpu));
+   return kvm_timer_is_pending(vcpu);
 }
 
 void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu)
-- 
2.9.0

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v3 04/20] KVM: arm/arm64: Guard kvm_vgic_map_is_active against !vgic_initialized

2017-09-22 Thread Christoffer Dall
If the vgic is not initialized, don't try to grab its spinlocks or
traverse its data structures.

This is important because we soon have to start considering the active
state of a virtual interrupts when doing vcpu_load, which may happen
early on before the vgic is initialized.

Signed-off-by: Christoffer Dall 
---
 virt/kvm/arm/vgic/vgic.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/virt/kvm/arm/vgic/vgic.c b/virt/kvm/arm/vgic/vgic.c
index fed717e..e1f7dbc 100644
--- a/virt/kvm/arm/vgic/vgic.c
+++ b/virt/kvm/arm/vgic/vgic.c
@@ -777,6 +777,9 @@ bool kvm_vgic_map_is_active(struct kvm_vcpu *vcpu, unsigned 
int virt_irq)
struct vgic_irq *irq = vgic_get_irq(vcpu->kvm, vcpu, virt_irq);
bool map_is_active;
 
+   if (!vgic_initialized(vcpu->kvm))
+   return false;
+
spin_lock(>irq_lock);
map_is_active = irq->hw && irq->active;
spin_unlock(>irq_lock);
-- 
2.9.0

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v3 10/20] KVM: arm/arm64: Move timer/vgic flush/sync under disabled irq

2017-09-22 Thread Christoffer Dall
As we are about to play tricks with the timer to be more lazy in saving
and restoring state, we need to move the timer sync and flush functions
under a disabled irq section and since we have to flush the vgic state
after the timer and PMU state, we do the whole flush/sync sequence with
disabled irqs.

The only downside is a slightly longer delay before being able to
process hardware interrupts and run softirqs.

Signed-off-by: Christoffer Dall 
---
 virt/kvm/arm/arm.c | 26 +-
 1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c
index b9f68e4..27db222 100644
--- a/virt/kvm/arm/arm.c
+++ b/virt/kvm/arm/arm.c
@@ -654,11 +654,11 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct 
kvm_run *run)
 
kvm_pmu_flush_hwstate(vcpu);
 
+   local_irq_disable();
+
kvm_timer_flush_hwstate(vcpu);
kvm_vgic_flush_hwstate(vcpu);
 
-   local_irq_disable();
-
/*
 * If we have a singal pending, or need to notify a userspace
 * irqchip about timer or PMU level changes, then we exit (and
@@ -683,10 +683,10 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct 
kvm_run *run)
if (ret <= 0 || need_new_vmid_gen(vcpu->kvm) ||
kvm_request_pending(vcpu)) {
vcpu->mode = OUTSIDE_GUEST_MODE;
-   local_irq_enable();
kvm_pmu_sync_hwstate(vcpu);
kvm_timer_sync_hwstate(vcpu);
kvm_vgic_sync_hwstate(vcpu);
+   local_irq_enable();
preempt_enable();
continue;
}
@@ -710,6 +710,16 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct 
kvm_run *run)
kvm_arm_clear_debug(vcpu);
 
/*
+* We must sync the PMU and timer state before the vgic state so
+* that the vgic can properly sample the updated state of the
+* interrupt line.
+*/
+   kvm_pmu_sync_hwstate(vcpu);
+   kvm_timer_sync_hwstate(vcpu);
+
+   kvm_vgic_sync_hwstate(vcpu);
+
+   /*
 * We may have taken a host interrupt in HYP mode (ie
 * while executing the guest). This interrupt is still
 * pending, as we haven't serviced it yet!
@@ -732,16 +742,6 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct 
kvm_run *run)
guest_exit();
trace_kvm_exit(ret, kvm_vcpu_trap_get_class(vcpu), 
*vcpu_pc(vcpu));
 
-   /*
-* We must sync the PMU and timer state before the vgic state so
-* that the vgic can properly sample the updated state of the
-* interrupt line.
-*/
-   kvm_pmu_sync_hwstate(vcpu);
-   kvm_timer_sync_hwstate(vcpu);
-
-   kvm_vgic_sync_hwstate(vcpu);
-
preempt_enable();
 
ret = handle_exit(vcpu, run, ret);
-- 
2.9.0

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v3 12/20] genirq: Document vcpu_info usage for percpu_devid interrupts

2017-09-22 Thread Christoffer Dall
It is currently unclear how to set the VCPU affinity for a percpu_devid
interrupt , since the Linux irq_data structure describes the state for
multiple interrupts, one for each physical CPU on the system.  Since
each such interrupt can be associated with different VCPUs or none at
all, associating a single VCPU state with such an interrupt does not
capture the necessary semantics.

The implementers of irq_set_affinity are the Intel and AMD IOMMUs, and
the ARM GIC irqchip.  The Intel and AMD callers do not appear to use
percpu_devid interrupts, and the ARM GIC implementation only checks the
pointer against NULL vs. non-NULL.

Therefore, simply update the function documentation to explain the
expected use in the context of percpu_devid interrupts, allowing future
changes or additions to irqchip implementers to do the right thing.

This allows us to set the VCPU affinity for the virtual timer interrupt
in KVM/ARM, which is a percpu_devid (PPI) interrupt.

Cc: Thomas Gleixner 
Cc: Marc Zyngier 
Signed-off-by: Christoffer Dall 
---
 kernel/irq/manage.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 573dc52..2b2c94f 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -381,7 +381,8 @@ int irq_select_affinity_usr(unsigned int irq)
 /**
  * irq_set_vcpu_affinity - Set vcpu affinity for the interrupt
  * @irq: interrupt number to set affinity
- * @vcpu_info: vCPU specific data
+ * @vcpu_info: vCPU specific data or pointer to a percpu array of vCPU
+ * specific data for percpu_devid interrupts
  *
  * This function uses the vCPU specific data to set the vCPU
  * affinity for an irq. The vCPU specific data is passed from
-- 
2.9.0

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v3 03/20] arm64: Use the physical counter when available for read_cycles

2017-09-22 Thread Christoffer Dall
Currently get_cycles() is hardwired to arch_counter_get_cntvct() on
arm64, but as we move to using the physical timer for the in-kernel
time-keeping, we need to make that more flexible.

First, we need to make sure the physical counter can be read on equal
terms to the virtual counter, which includes adding physical counter
read functions for timers that require errata.

Second, we need to make a choice between reading the physical vs virtual
counter, depending on which timer is used for time keeping in the kernel
otherwise.  We can do this using a static key to avoid a performance
penalty during runtime when reading the counter.

Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: Mark Rutland 
Cc: Marc Zyngier 
Signed-off-by: Christoffer Dall 
---
 arch/arm64/include/asm/arch_timer.h  | 15 ---
 arch/arm64/include/asm/timex.h   |  2 +-
 drivers/clocksource/arm_arch_timer.c | 32 ++--
 3 files changed, 43 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/include/asm/arch_timer.h 
b/arch/arm64/include/asm/arch_timer.h
index 1859a1c..c56d8cd 100644
--- a/arch/arm64/include/asm/arch_timer.h
+++ b/arch/arm64/include/asm/arch_timer.h
@@ -30,6 +30,8 @@
 
 #include 
 
+extern struct static_key_false arch_timer_phys_counter_available;
+
 #if IS_ENABLED(CONFIG_ARM_ARCH_TIMER_OOL_WORKAROUND)
 extern struct static_key_false arch_timer_read_ool_enabled;
 #define needs_unstable_timer_counter_workaround() \
@@ -52,6 +54,7 @@ struct arch_timer_erratum_workaround {
const char *desc;
u32 (*read_cntp_tval_el0)(void);
u32 (*read_cntv_tval_el0)(void);
+   u64 (*read_cntpct_el0)(void);
u64 (*read_cntvct_el0)(void);
int (*set_next_event_phys)(unsigned long, struct clock_event_device *);
int (*set_next_event_virt)(unsigned long, struct clock_event_device *);
@@ -148,10 +151,8 @@ static inline void arch_timer_set_cntkctl(u32 cntkctl)
 
 static inline u64 arch_counter_get_cntpct(void)
 {
-   u64 cval;
isb();
-   asm volatile("mrs %0, cntpct_el0" : "=r" (cval));
-   return cval;
+   return arch_timer_reg_read_stable(cntpct_el0);
 }
 
 static inline u64 arch_counter_get_cntvct(void)
@@ -160,6 +161,14 @@ static inline u64 arch_counter_get_cntvct(void)
return arch_timer_reg_read_stable(cntvct_el0);
 }
 
+static inline u64 arch_counter_get_cycles(void)
+{
+   if (static_branch_unlikely(_timer_phys_counter_available))
+   return arch_counter_get_cntpct();
+   else
+   return arch_counter_get_cntvct();
+}
+
 static inline int arch_timer_arch_init(void)
 {
return 0;
diff --git a/arch/arm64/include/asm/timex.h b/arch/arm64/include/asm/timex.h
index 81a076e..c0d214c 100644
--- a/arch/arm64/include/asm/timex.h
+++ b/arch/arm64/include/asm/timex.h
@@ -22,7 +22,7 @@
  * Use the current timer as a cycle counter since this is what we use for
  * the delay loop.
  */
-#define get_cycles()   arch_counter_get_cntvct()
+#define get_cycles()   arch_counter_get_cycles()
 
 #include 
 
diff --git a/drivers/clocksource/arm_arch_timer.c 
b/drivers/clocksource/arm_arch_timer.c
index 9b3322a..f35da20 100644
--- a/drivers/clocksource/arm_arch_timer.c
+++ b/drivers/clocksource/arm_arch_timer.c
@@ -77,6 +77,9 @@ static bool arch_timer_mem_use_virtual;
 static bool arch_counter_suspend_stop;
 static bool vdso_default = true;
 
+DEFINE_STATIC_KEY_FALSE(arch_timer_phys_counter_available);
+EXPORT_SYMBOL_GPL(arch_timer_phys_counter_available);
+
 static bool evtstrm_enable = IS_ENABLED(CONFIG_ARM_ARCH_TIMER_EVTSTREAM);
 
 static int __init early_evtstrm_cfg(char *buf)
@@ -217,6 +220,11 @@ static u32 notrace fsl_a008585_read_cntv_tval_el0(void)
return __fsl_a008585_read_reg(cntv_tval_el0);
 }
 
+static u64 notrace fsl_a008585_read_cntpct_el0(void)
+{
+   return __fsl_a008585_read_reg(cntpct_el0);
+}
+
 static u64 notrace fsl_a008585_read_cntvct_el0(void)
 {
return __fsl_a008585_read_reg(cntvct_el0);
@@ -258,6 +266,11 @@ static u32 notrace hisi_161010101_read_cntv_tval_el0(void)
return __hisi_161010101_read_reg(cntv_tval_el0);
 }
 
+static u64 notrace hisi_161010101_read_cntpct_el0(void)
+{
+   return __hisi_161010101_read_reg(cntpct_el0);
+}
+
 static u64 notrace hisi_161010101_read_cntvct_el0(void)
 {
return __hisi_161010101_read_reg(cntvct_el0);
@@ -288,6 +301,15 @@ static struct ate_acpi_oem_info hisi_161010101_oem_info[] 
= {
 #endif
 
 #ifdef CONFIG_ARM64_ERRATUM_858921
+static u64 notrace arm64_858921_read_cntpct_el0(void)
+{
+   u64 old, new;
+
+   old = read_sysreg(cntpct_el0);
+   new = read_sysreg(cntpct_el0);
+   return (((old ^ new) >> 32) & 1) ? old : new;
+}
+
 static u64 notrace arm64_858921_read_cntvct_el0(void)
 {
u64 old, new;
@@ -346,6 +368,7 @@ static const struct arch_timer_erratum_workaround 
ool_workarounds[] = {

[PATCH v3 05/20] KVM: arm/arm64: Support calling vgic_update_irq_pending from irq context

2017-09-22 Thread Christoffer Dall
We are about to optimize our timer handling logic which involves
injecting irqs to the vgic directly from the irq handler.

Unfortunately, the injection path can take any AP list lock and irq lock
and we must therefore make sure to use spin_lock_irqsave where ever
interrupts are enabled and we are taking any of those locks, to avoid
deadlocking between process context and the ISR.

This changes a lot of the VGIC code, but The good news are that the
changes are mostly mechanical.

Signed-off-by: Christoffer Dall 
---
 virt/kvm/arm/vgic/vgic-its.c | 17 +++-
 virt/kvm/arm/vgic/vgic-mmio-v2.c | 22 +--
 virt/kvm/arm/vgic/vgic-mmio-v3.c | 17 +++-
 virt/kvm/arm/vgic/vgic-mmio.c| 44 +
 virt/kvm/arm/vgic/vgic-v2.c  |  5 ++--
 virt/kvm/arm/vgic/vgic-v3.c  | 12 
 virt/kvm/arm/vgic/vgic.c | 60 +---
 virt/kvm/arm/vgic/vgic.h |  3 +-
 8 files changed, 108 insertions(+), 72 deletions(-)

diff --git a/virt/kvm/arm/vgic/vgic-its.c b/virt/kvm/arm/vgic/vgic-its.c
index f51c1e1..9f5e347 100644
--- a/virt/kvm/arm/vgic/vgic-its.c
+++ b/virt/kvm/arm/vgic/vgic-its.c
@@ -278,6 +278,7 @@ static int update_lpi_config(struct kvm *kvm, struct 
vgic_irq *irq,
u64 propbase = GICR_PROPBASER_ADDRESS(kvm->arch.vgic.propbaser);
u8 prop;
int ret;
+   unsigned long flags;
 
ret = kvm_read_guest(kvm, propbase + irq->intid - GIC_LPI_OFFSET,
 , 1);
@@ -285,15 +286,15 @@ static int update_lpi_config(struct kvm *kvm, struct 
vgic_irq *irq,
if (ret)
return ret;
 
-   spin_lock(>irq_lock);
+   spin_lock_irqsave(>irq_lock, flags);
 
if (!filter_vcpu || filter_vcpu == irq->target_vcpu) {
irq->priority = LPI_PROP_PRIORITY(prop);
irq->enabled = LPI_PROP_ENABLE_BIT(prop);
 
-   vgic_queue_irq_unlock(kvm, irq);
+   vgic_queue_irq_unlock(kvm, irq, flags);
} else {
-   spin_unlock(>irq_lock);
+   spin_unlock_irqrestore(>irq_lock, flags);
}
 
return 0;
@@ -393,6 +394,7 @@ static int its_sync_lpi_pending_table(struct kvm_vcpu *vcpu)
int ret = 0;
u32 *intids;
int nr_irqs, i;
+   unsigned long flags;
 
nr_irqs = vgic_copy_lpi_list(vcpu, );
if (nr_irqs < 0)
@@ -420,9 +422,9 @@ static int its_sync_lpi_pending_table(struct kvm_vcpu *vcpu)
}
 
irq = vgic_get_irq(vcpu->kvm, NULL, intids[i]);
-   spin_lock(>irq_lock);
+   spin_lock_irqsave(>irq_lock, flags);
irq->pending_latch = pendmask & (1U << bit_nr);
-   vgic_queue_irq_unlock(vcpu->kvm, irq);
+   vgic_queue_irq_unlock(vcpu->kvm, irq, flags);
vgic_put_irq(vcpu->kvm, irq);
}
 
@@ -515,6 +517,7 @@ static int vgic_its_trigger_msi(struct kvm *kvm, struct 
vgic_its *its,
 {
struct kvm_vcpu *vcpu;
struct its_ite *ite;
+   unsigned long flags;
 
if (!its->enabled)
return -EBUSY;
@@ -530,9 +533,9 @@ static int vgic_its_trigger_msi(struct kvm *kvm, struct 
vgic_its *its,
if (!vcpu->arch.vgic_cpu.lpis_enabled)
return -EBUSY;
 
-   spin_lock(>irq->irq_lock);
+   spin_lock_irqsave(>irq->irq_lock, flags);
ite->irq->pending_latch = true;
-   vgic_queue_irq_unlock(kvm, ite->irq);
+   vgic_queue_irq_unlock(kvm, ite->irq, flags);
 
return 0;
 }
diff --git a/virt/kvm/arm/vgic/vgic-mmio-v2.c b/virt/kvm/arm/vgic/vgic-mmio-v2.c
index b3d4a10..e21e2f4 100644
--- a/virt/kvm/arm/vgic/vgic-mmio-v2.c
+++ b/virt/kvm/arm/vgic/vgic-mmio-v2.c
@@ -74,6 +74,7 @@ static void vgic_mmio_write_sgir(struct kvm_vcpu *source_vcpu,
int mode = (val >> 24) & 0x03;
int c;
struct kvm_vcpu *vcpu;
+   unsigned long flags;
 
switch (mode) {
case 0x0:   /* as specified by targets */
@@ -97,11 +98,11 @@ static void vgic_mmio_write_sgir(struct kvm_vcpu 
*source_vcpu,
 
irq = vgic_get_irq(source_vcpu->kvm, vcpu, intid);
 
-   spin_lock(>irq_lock);
+   spin_lock_irqsave(>irq_lock, flags);
irq->pending_latch = true;
irq->source |= 1U << source_vcpu->vcpu_id;
 
-   vgic_queue_irq_unlock(source_vcpu->kvm, irq);
+   vgic_queue_irq_unlock(source_vcpu->kvm, irq, flags);
vgic_put_irq(source_vcpu->kvm, irq);
}
 }
@@ -131,6 +132,7 @@ static void vgic_mmio_write_target(struct kvm_vcpu *vcpu,
u32 intid = VGIC_ADDR_TO_INTID(addr, 8);
u8 cpu_mask = GENMASK(atomic_read(>kvm->online_vcpus) - 1, 0);
int i;
+   unsigned long flags;
 
/* GICD_ITARGETSR[0-7] are read-only */
if (intid < VGIC_NR_PRIVATE_IRQS)
@@ -140,13 +142,13 @@ static void 

[PATCH v3 17/20] KVM: arm/arm64: Move phys_timer_emulate function

2017-09-22 Thread Christoffer Dall
We are about to call phys_timer_emulate() from kvm_timer_update_state()
and modify phys_timer_emulate() at the same time.  Moving the function
and modifying it in a single patch makes the diff hard to read, so do
this separately first.

No functional change.

Signed-off-by: Christoffer Dall 
---
 virt/kvm/arm/arch_timer.c | 32 
 1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/virt/kvm/arm/arch_timer.c b/virt/kvm/arm/arch_timer.c
index d5b632d..1f82c21 100644
--- a/virt/kvm/arm/arch_timer.c
+++ b/virt/kvm/arm/arch_timer.c
@@ -252,6 +252,22 @@ static void kvm_timer_update_irq(struct kvm_vcpu *vcpu, 
bool new_level,
}
 }
 
+/* Schedule the background timer for the emulated timer. */
+static void phys_timer_emulate(struct kvm_vcpu *vcpu,
+ struct arch_timer_context *timer_ctx)
+{
+   struct arch_timer_cpu *timer = >arch.timer_cpu;
+
+   if (kvm_timer_should_fire(timer_ctx))
+   return;
+
+   if (!kvm_timer_irq_can_fire(timer_ctx))
+   return;
+
+   /*  The timer has not yet expired, schedule a background timer */
+   soft_timer_start(>phys_timer, 
kvm_timer_compute_delta(timer_ctx));
+}
+
 /*
  * Check if there was a change in the timer state (should we raise or lower
  * the line level to the GIC).
@@ -278,22 +294,6 @@ static void kvm_timer_update_state(struct kvm_vcpu *vcpu)
kvm_timer_update_irq(vcpu, !ptimer->irq.level, ptimer);
 }
 
-/* Schedule the background timer for the emulated timer. */
-static void phys_timer_emulate(struct kvm_vcpu *vcpu,
- struct arch_timer_context *timer_ctx)
-{
-   struct arch_timer_cpu *timer = >arch.timer_cpu;
-
-   if (kvm_timer_should_fire(timer_ctx))
-   return;
-
-   if (!kvm_timer_irq_can_fire(timer_ctx))
-   return;
-
-   /*  The timer has not yet expired, schedule a background timer */
-   soft_timer_start(>phys_timer, 
kvm_timer_compute_delta(timer_ctx));
-}
-
 static void vtimer_save_state(struct kvm_vcpu *vcpu)
 {
struct arch_timer_cpu *timer = >arch.timer_cpu;
-- 
2.9.0

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v3 06/20] KVM: arm/arm64: Check that system supports split eoi/deactivate

2017-09-22 Thread Christoffer Dall
Some systems without proper firmware and/or hardware description data
don't support the split EOI and deactivate operation.

On such systems, we cannot leave the physical interrupt active after the
timer handler on the host has run, so we cannot support KVM with an
in-kernel GIC with the timer changes we are about to introduce.

This patch makes sure that trying to initialize the KVM GIC code will
fail on such systems.

Cc: Marc Zyngier 
Signed-off-by: Christoffer Dall 
---
 drivers/irqchip/irq-gic.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/irqchip/irq-gic.c b/drivers/irqchip/irq-gic.c
index f641e8e..ab12bf4 100644
--- a/drivers/irqchip/irq-gic.c
+++ b/drivers/irqchip/irq-gic.c
@@ -1420,7 +1420,8 @@ static void __init gic_of_setup_kvm_info(struct 
device_node *node)
if (ret)
return;
 
-   gic_set_kvm_info(_v2_kvm_info);
+   if (static_key_true(_deactivate))
+   gic_set_kvm_info(_v2_kvm_info);
 }
 
 int __init
-- 
2.9.0

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v3 16/20] KVM: arm/arm64: Use kvm_arm_timer_set/get_reg for guest register traps

2017-09-22 Thread Christoffer Dall
When trapping on a guest access to one of the timer registers, we were
messing with the internals of the timer state from the sysregs handling
code, and that logic was about to receive more added complexity when
optimizing the timer handling code.

Therefore, since we already have timer register access functions (to
access registers from userspace), reuse those for the timer register
traps from a VM and let the timer code maintain its own consistency.

Signed-off-by: Christoffer Dall 
---
 arch/arm64/kvm/sys_regs.c | 41 ++---
 1 file changed, 14 insertions(+), 27 deletions(-)

diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index 2e070d3..bb0e41b 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -841,13 +841,16 @@ static bool access_cntp_tval(struct kvm_vcpu *vcpu,
struct sys_reg_params *p,
const struct sys_reg_desc *r)
 {
-   struct arch_timer_context *ptimer = vcpu_ptimer(vcpu);
u64 now = kvm_phys_timer_read();
+   u64 cval;
 
-   if (p->is_write)
-   ptimer->cnt_cval = p->regval + now;
-   else
-   p->regval = ptimer->cnt_cval - now;
+   if (p->is_write) {
+   kvm_arm_timer_set_reg(vcpu, KVM_REG_ARM_PTIMER_CVAL,
+ p->regval + now);
+   } else {
+   cval = kvm_arm_timer_get_reg(vcpu, KVM_REG_ARM_PTIMER_CVAL);
+   p->regval = cval - now;
+   }
 
return true;
 }
@@ -856,24 +859,10 @@ static bool access_cntp_ctl(struct kvm_vcpu *vcpu,
struct sys_reg_params *p,
const struct sys_reg_desc *r)
 {
-   struct arch_timer_context *ptimer = vcpu_ptimer(vcpu);
-
-   if (p->is_write) {
-   /* ISTATUS bit is read-only */
-   ptimer->cnt_ctl = p->regval & ~ARCH_TIMER_CTRL_IT_STAT;
-   } else {
-   u64 now = kvm_phys_timer_read();
-
-   p->regval = ptimer->cnt_ctl;
-   /*
-* Set ISTATUS bit if it's expired.
-* Note that according to ARMv8 ARM Issue A.k, ISTATUS bit is
-* UNKNOWN when ENABLE bit is 0, so we chose to set ISTATUS bit
-* regardless of ENABLE bit for our implementation convenience.
-*/
-   if (ptimer->cnt_cval <= now)
-   p->regval |= ARCH_TIMER_CTRL_IT_STAT;
-   }
+   if (p->is_write)
+   kvm_arm_timer_set_reg(vcpu, KVM_REG_ARM_PTIMER_CTL, p->regval);
+   else
+   p->regval = kvm_arm_timer_get_reg(vcpu, KVM_REG_ARM_PTIMER_CTL);
 
return true;
 }
@@ -882,12 +871,10 @@ static bool access_cntp_cval(struct kvm_vcpu *vcpu,
struct sys_reg_params *p,
const struct sys_reg_desc *r)
 {
-   struct arch_timer_context *ptimer = vcpu_ptimer(vcpu);
-
if (p->is_write)
-   ptimer->cnt_cval = p->regval;
+   kvm_arm_timer_set_reg(vcpu, KVM_REG_ARM_PTIMER_CVAL, p->regval);
else
-   p->regval = ptimer->cnt_cval;
+   p->regval = kvm_arm_timer_get_reg(vcpu, 
KVM_REG_ARM_PTIMER_CVAL);
 
return true;
 }
-- 
2.9.0

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v3 09/20] KVM: arm/arm64: Use separate timer for phys timer emulation

2017-09-22 Thread Christoffer Dall
We were using the same hrtimer for emulating the physical timer and for
making sure a blocking VCPU thread would be eventually woken up.  That
worked fine in the previous arch timer design, but as we are about to
actually use the soft timer expire function for the physical timer
emulation, change the logic to use a dedicated hrtimer.

This has the added benefit of not having to cancel any work in the sync
path, which in turn allows us to run the flush and sync with IRQs
disabled.

Signed-off-by: Christoffer Dall 
---
 include/kvm/arm_arch_timer.h |  3 +++
 virt/kvm/arm/arch_timer.c| 18 ++
 2 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/include/kvm/arm_arch_timer.h b/include/kvm/arm_arch_timer.h
index dcbb2e1..16887c0 100644
--- a/include/kvm/arm_arch_timer.h
+++ b/include/kvm/arm_arch_timer.h
@@ -47,6 +47,9 @@ struct arch_timer_cpu {
/* Work queued with the above timer expires */
struct work_struct  expired;
 
+   /* Physical timer emulation */
+   struct hrtimer  phys_timer;
+
/* Background timer active */
boolarmed;
 
diff --git a/virt/kvm/arm/arch_timer.c b/virt/kvm/arm/arch_timer.c
index c2e8326..7f87099 100644
--- a/virt/kvm/arm/arch_timer.c
+++ b/virt/kvm/arm/arch_timer.c
@@ -178,6 +178,12 @@ static enum hrtimer_restart kvm_bg_timer_expire(struct 
hrtimer *hrt)
return HRTIMER_NORESTART;
 }
 
+static enum hrtimer_restart kvm_phys_timer_expire(struct hrtimer *hrt)
+{
+   WARN(1, "Timer only used to ensure guest exit - unexpected event.");
+   return HRTIMER_NORESTART;
+}
+
 bool kvm_timer_should_fire(struct arch_timer_context *timer_ctx)
 {
u64 cval, now;
@@ -255,7 +261,7 @@ static void kvm_timer_update_state(struct kvm_vcpu *vcpu)
 }
 
 /* Schedule the background timer for the emulated timer. */
-static void kvm_timer_emulate(struct kvm_vcpu *vcpu,
+static void phys_timer_emulate(struct kvm_vcpu *vcpu,
  struct arch_timer_context *timer_ctx)
 {
struct arch_timer_cpu *timer = >arch.timer_cpu;
@@ -267,7 +273,7 @@ static void kvm_timer_emulate(struct kvm_vcpu *vcpu,
return;
 
/*  The timer has not yet expired, schedule a background timer */
-   soft_timer_start(>bg_timer, kvm_timer_compute_delta(timer_ctx));
+   soft_timer_start(>phys_timer, 
kvm_timer_compute_delta(timer_ctx));
 }
 
 /*
@@ -424,7 +430,7 @@ void kvm_timer_flush_hwstate(struct kvm_vcpu *vcpu)
kvm_timer_update_state(vcpu);
 
/* Set the background timer for the physical timer emulation. */
-   kvm_timer_emulate(vcpu, vcpu_ptimer(vcpu));
+   phys_timer_emulate(vcpu, vcpu_ptimer(vcpu));
 
if (unlikely(!irqchip_in_kernel(vcpu->kvm)))
kvm_timer_flush_hwstate_user(vcpu);
@@ -447,7 +453,7 @@ void kvm_timer_sync_hwstate(struct kvm_vcpu *vcpu)
 * This is to cancel the background timer for the physical timer
 * emulation if it is set.
 */
-   soft_timer_cancel(>bg_timer, >expired);
+   soft_timer_cancel(>phys_timer, NULL);
 
/*
 * The guest could have modified the timer registers or the timer
@@ -507,6 +513,9 @@ void kvm_timer_vcpu_init(struct kvm_vcpu *vcpu)
hrtimer_init(>bg_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS);
timer->bg_timer.function = kvm_bg_timer_expire;
 
+   hrtimer_init(>phys_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS);
+   timer->phys_timer.function = kvm_phys_timer_expire;
+
vtimer->irq.irq = default_vtimer_irq.irq;
ptimer->irq.irq = default_ptimer_irq.irq;
 }
@@ -615,6 +624,7 @@ void kvm_timer_vcpu_terminate(struct kvm_vcpu *vcpu)
struct arch_timer_context *vtimer = vcpu_vtimer(vcpu);
 
soft_timer_cancel(>bg_timer, >expired);
+   soft_timer_cancel(>phys_timer, NULL);
kvm_vgic_unmap_phys_irq(vcpu, vtimer->irq.irq);
 }
 
-- 
2.9.0

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v3 11/20] KVM: arm/arm64: Move timer save/restore out of the hyp code

2017-09-22 Thread Christoffer Dall
As we are about to be lazy with saving and restoring the timer
registers, we prepare by moving all possible timer configuration logic
out of the hyp code.  All virtual timer registers can be programmed from
EL1 and since the arch timer is always a level triggered interrupt we
can safely do this with interrupts disabled in the host kernel on the
way to the guest without taking vtimer interrupts in the host kernel
(yet).

The downside is that the cntvoff register can only be programmed from
hyp mode, so we jump into hyp mode and back to program it.  This is also
safe, because the host kernel doesn't use the virtual timer in the KVM
code.  It may add a little performance performance penalty, but only
until following commits where we move this operation to vcpu load/put.

Signed-off-by: Christoffer Dall 
---
 arch/arm/include/asm/kvm_asm.h   |  2 ++
 arch/arm/include/asm/kvm_hyp.h   |  4 +--
 arch/arm/kvm/hyp/switch.c|  7 ++--
 arch/arm64/include/asm/kvm_asm.h |  2 ++
 arch/arm64/include/asm/kvm_hyp.h |  4 +--
 arch/arm64/kvm/hyp/switch.c  |  6 ++--
 virt/kvm/arm/arch_timer.c| 40 ++
 virt/kvm/arm/hyp/timer-sr.c  | 74 +---
 8 files changed, 87 insertions(+), 52 deletions(-)

diff --git a/arch/arm/include/asm/kvm_asm.h b/arch/arm/include/asm/kvm_asm.h
index 14d68a4..36dd296 100644
--- a/arch/arm/include/asm/kvm_asm.h
+++ b/arch/arm/include/asm/kvm_asm.h
@@ -68,6 +68,8 @@ extern void __kvm_tlb_flush_vmid_ipa(struct kvm *kvm, 
phys_addr_t ipa);
 extern void __kvm_tlb_flush_vmid(struct kvm *kvm);
 extern void __kvm_tlb_flush_local_vmid(struct kvm_vcpu *vcpu);
 
+extern void __kvm_timer_set_cntvoff(u32 cntvoff_low, u32 cntvoff_high);
+
 extern int __kvm_vcpu_run(struct kvm_vcpu *vcpu);
 
 extern void __init_stage2_translation(void);
diff --git a/arch/arm/include/asm/kvm_hyp.h b/arch/arm/include/asm/kvm_hyp.h
index 14b5903..ab20ffa 100644
--- a/arch/arm/include/asm/kvm_hyp.h
+++ b/arch/arm/include/asm/kvm_hyp.h
@@ -98,8 +98,8 @@
 #define cntvoff_el2CNTVOFF
 #define cnthctl_el2CNTHCTL
 
-void __timer_save_state(struct kvm_vcpu *vcpu);
-void __timer_restore_state(struct kvm_vcpu *vcpu);
+void __timer_enable_traps(struct kvm_vcpu *vcpu);
+void __timer_disable_traps(struct kvm_vcpu *vcpu);
 
 void __vgic_v2_save_state(struct kvm_vcpu *vcpu);
 void __vgic_v2_restore_state(struct kvm_vcpu *vcpu);
diff --git a/arch/arm/kvm/hyp/switch.c b/arch/arm/kvm/hyp/switch.c
index ebd2dd4..330c9ce 100644
--- a/arch/arm/kvm/hyp/switch.c
+++ b/arch/arm/kvm/hyp/switch.c
@@ -174,7 +174,7 @@ int __hyp_text __kvm_vcpu_run(struct kvm_vcpu *vcpu)
__activate_vm(vcpu);
 
__vgic_restore_state(vcpu);
-   __timer_restore_state(vcpu);
+   __timer_enable_traps(vcpu);
 
__sysreg_restore_state(guest_ctxt);
__banked_restore_state(guest_ctxt);
@@ -191,7 +191,8 @@ int __hyp_text __kvm_vcpu_run(struct kvm_vcpu *vcpu)
 
__banked_save_state(guest_ctxt);
__sysreg_save_state(guest_ctxt);
-   __timer_save_state(vcpu);
+   __timer_disable_traps(vcpu);
+
__vgic_save_state(vcpu);
 
__deactivate_traps(vcpu);
@@ -237,7 +238,7 @@ void __hyp_text __noreturn __hyp_panic(int cause)
 
vcpu = (struct kvm_vcpu *)read_sysreg(HTPIDR);
host_ctxt = kern_hyp_va(vcpu->arch.host_cpu_context);
-   __timer_save_state(vcpu);
+   __timer_disable_traps(vcpu);
__deactivate_traps(vcpu);
__deactivate_vm(vcpu);
__banked_restore_state(host_ctxt);
diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h
index 26a64d0..ab4d0a9 100644
--- a/arch/arm64/include/asm/kvm_asm.h
+++ b/arch/arm64/include/asm/kvm_asm.h
@@ -55,6 +55,8 @@ extern void __kvm_tlb_flush_vmid_ipa(struct kvm *kvm, 
phys_addr_t ipa);
 extern void __kvm_tlb_flush_vmid(struct kvm *kvm);
 extern void __kvm_tlb_flush_local_vmid(struct kvm_vcpu *vcpu);
 
+extern void __kvm_timer_set_cntvoff(u32 cntvoff_low, u32 cntvoff_high);
+
 extern int __kvm_vcpu_run(struct kvm_vcpu *vcpu);
 
 extern u64 __vgic_v3_get_ich_vtr_el2(void);
diff --git a/arch/arm64/include/asm/kvm_hyp.h b/arch/arm64/include/asm/kvm_hyp.h
index 4572a9b..08d3bb6 100644
--- a/arch/arm64/include/asm/kvm_hyp.h
+++ b/arch/arm64/include/asm/kvm_hyp.h
@@ -129,8 +129,8 @@ void __vgic_v3_save_state(struct kvm_vcpu *vcpu);
 void __vgic_v3_restore_state(struct kvm_vcpu *vcpu);
 int __vgic_v3_perform_cpuif_access(struct kvm_vcpu *vcpu);
 
-void __timer_save_state(struct kvm_vcpu *vcpu);
-void __timer_restore_state(struct kvm_vcpu *vcpu);
+void __timer_enable_traps(struct kvm_vcpu *vcpu);
+void __timer_disable_traps(struct kvm_vcpu *vcpu);
 
 void __sysreg_save_host_state(struct kvm_cpu_context *ctxt);
 void __sysreg_restore_host_state(struct kvm_cpu_context *ctxt);
diff --git a/arch/arm64/kvm/hyp/switch.c b/arch/arm64/kvm/hyp/switch.c

[PATCH v3 00/20] KVM: arm/arm64: Optimize arch timer register handling

2017-09-22 Thread Christoffer Dall
We currently spend a measurable amount of time on each entry/exit to the
guest dealing with arch timer registers, even when the timer is not
pending and not doing anything (on certain architectures).

We can do much better by moving the arch timer save/restore to the
vcpu_load and vcpu_put functions, but this means that if we don't read
back the timer state on every exit from the guest, then we have to be
able to start taking timer interrupts for the virtual timer in KVM and
handle that properly.

That has a number of entertaining consequences, such as having to make
sure we don't deadlock between any of the vgic code and interrupt
injection happening from an ISR.  On the plus side, being able to inject
virtual interrupts corresponding to a physical interrupt directly from
an ISR is probably a good system design change overall.

We also have to change the use of the physical vs. virtual counter in
the arm64 kernel to avoid having to save/restore the CNTVOFF_EL2
register on every return to the hypervisor.  The only reason I could
find for using the virtual counter for the kernel on systems with access
to the physical counter is to detect if firmware did not properly clear
CNTVOFF_EL2, and this change has to weighed against the existing check
(assuming I got this right).

On a non-VHE system (AMD Seattle) I have measured this to improve the
world-switch time by about ~100 cycles, but on an EL2 kernel (emulating
VHE behavior on the same hardware) this gives us around ~250 cycles
worth of improvement, and on Thunder-X we seem to get ~650 cycles
improvement, because we can avoid the extra configuration of trapping
accesses to the physical timer from EL1 on every switch.

These patches require that the GICv2 hardware (on such systems) is
properly reported by firmware to have the extra CPU interface page for
the deactivate register.

Code is also available here:
git://git.kernel.org/pub/scm/linux/kernel/git/cdall/linux.git timer-optimize-v3

Based on v4.14-rc1

Some recent numbers I ran on Thunder-X with v4.14-rc1 with/without these
patches (averaged over a few hundred thousand executions) for a base
hypercall cost:

Without this series, avg. cycles: 12,476
Without this series, min. cycles: 12,052

With this series, avg. cycles: 11,782
With this series, min. cycles: 11,435

Improvement ~650 cycles (over 5%)

Changes since v2:
 - Removed RFC tag
 - Included Marc's patch to support EOI/deactivate on broken firmware
   systems
 - Simplified patch 6 (was patch 5 in RFC v2)
 - Clarify percpu_devid interrupts in patch 12 (was patch 11)

Thanks,
  Christoffer

Christoffer Dall (19):
  arm64: Use physical counter for in-kernel reads
  arm64: Use the physical counter when available for read_cycles
  KVM: arm/arm64: Guard kvm_vgic_map_is_active against !vgic_initialized
  KVM: arm/arm64: Support calling vgic_update_irq_pending from irq
context
  KVM: arm/arm64: Check that system supports split eoi/deactivate
  KVM: arm/arm64: Make timer_arm and timer_disarm helpers more generic
  KVM: arm/arm64: Rename soft timer to bg_timer
  KVM: arm/arm64: Use separate timer for phys timer emulation
  KVM: arm/arm64: Move timer/vgic flush/sync under disabled irq
  KVM: arm/arm64: Move timer save/restore out of the hyp code
  genirq: Document vcpu_info usage for percpu_devid interrupts
  KVM: arm/arm64: Set VCPU affinity for virt timer irq
  KVM: arm/arm64: Avoid timer save/restore in vcpu entry/exit
  KVM: arm/arm64: Support EL1 phys timer register access in set/get reg
  KVM: arm/arm64: Use kvm_arm_timer_set/get_reg for guest register traps
  KVM: arm/arm64: Move phys_timer_emulate function
  KVM: arm/arm64: Avoid phys timer emulation in vcpu entry/exit
  KVM: arm/arm64: Get rid of kvm_timer_flush_hwstate
  KVM: arm/arm64: Rework kvm_timer_should_fire

Marc Zyngier (1):
  irqchip/gic: Deal with broken firmware exposing only 4kB of GICv2 CPU
interface

 Documentation/admin-guide/kernel-parameters.txt |   7 +
 arch/arm/include/asm/kvm_asm.h  |   2 +
 arch/arm/include/asm/kvm_hyp.h  |   4 +-
 arch/arm/include/uapi/asm/kvm.h |   6 +
 arch/arm/kvm/hyp/switch.c   |   7 +-
 arch/arm64/include/asm/arch_timer.h |  18 +-
 arch/arm64/include/asm/kvm_asm.h|   2 +
 arch/arm64/include/asm/kvm_hyp.h|   4 +-
 arch/arm64/include/asm/timex.h  |   2 +-
 arch/arm64/include/uapi/asm/kvm.h   |   6 +
 arch/arm64/kvm/hyp/switch.c |   6 +-
 arch/arm64/kvm/sys_regs.c   |  41 +--
 drivers/clocksource/arm_arch_timer.c|  33 +-
 drivers/irqchip/irq-gic.c   |  74 +++-
 include/kvm/arm_arch_timer.h|  19 +-
 kernel/irq/manage.c |   3 +-
 virt/kvm/arm/arch_timer.c   | 446 
 virt/kvm/arm/arm.c  |  45 ++-
 virt/kvm/arm/hyp/timer-sr.c

[PATCH v3 13/20] KVM: arm/arm64: Set VCPU affinity for virt timer irq

2017-09-22 Thread Christoffer Dall
As we are about to take physical interrupts for the virtual timer on the
host but want to leave those active while running the VM (and let the VM
deactivate them), we need to set the vtimer PPI affinity accordingly.

Signed-off-by: Christoffer Dall 
---
 virt/kvm/arm/arch_timer.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/virt/kvm/arm/arch_timer.c b/virt/kvm/arm/arch_timer.c
index 4254f88..4275f8f 100644
--- a/virt/kvm/arm/arch_timer.c
+++ b/virt/kvm/arm/arch_timer.c
@@ -650,11 +650,20 @@ int kvm_timer_hyp_init(void)
return err;
}
 
+   err = irq_set_vcpu_affinity(host_vtimer_irq, kvm_get_running_vcpus());
+   if (err) {
+   kvm_err("kvm_arch_timer: error setting vcpu affinity\n");
+   goto out_free_irq;
+   }
+
kvm_info("virtual timer IRQ%d\n", host_vtimer_irq);
 
cpuhp_setup_state(CPUHP_AP_KVM_ARM_TIMER_STARTING,
  "kvm/arm/timer:starting", kvm_timer_starting_cpu,
  kvm_timer_dying_cpu);
+   return 0;
+out_free_irq:
+   free_percpu_irq(host_vtimer_irq, kvm_get_running_vcpus());
return err;
 }
 
-- 
2.9.0

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v3 01/20] irqchip/gic: Deal with broken firmware exposing only 4kB of GICv2 CPU interface

2017-09-22 Thread Christoffer Dall
From: Marc Zyngier 

There is a lot of broken firmware out there that don't really
expose the information the kernel requires when it comes with dealing
with GICv2:

(1) Firmware that only describes the first 4kB of GICv2
(2) Firmware that describe 128kB of CPU interface, while
the usable portion of the address space is between
60 and 68kB

So far, we only deal with (2). But we have platforms exhibiting
behaviour (1), resulting in two sub-cases:
(a) The GIC is occupying 8kB, as required by the GICv2 architecture
(b) It is actually spread 128kB, and this is likely to be a version
of (2)

This patch tries to work around both (a) and (b) by poking at
the outside of the described memory region, and try to work out
what is actually there. This is of course unsafe, and should
only be enabled if there is no way to otherwise fix the DT provided
by the firmware (we provide a "irqchip.gicv2_force_probe" option
to that effect).

Note that for the time being, we restrict ourselves to GICv2
implementations provided by ARM, since there I have no knowledge
of an alternative implementations. This could be relaxed if such
an implementation comes to light on a broken platform.

Signed-off-by: Marc Zyngier 
Signed-off-by: Christoffer Dall 
---
 Documentation/admin-guide/kernel-parameters.txt |  7 +++
 drivers/irqchip/irq-gic.c   | 71 +
 2 files changed, 69 insertions(+), 9 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 0549662..3daa0a5 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1713,6 +1713,13 @@
irqaffinity=[SMP] Set the default irq affinity mask
The argument is a cpu list, as described above.
 
+   irqchip.gicv2_force_probe=
+   [ARM, ARM64]
+   Format: 
+   Force the kernel to look for the second 4kB page
+   of a GICv2 controller even if the memory range
+   exposed by the device tree is too small.
+
irqfixup[HW]
When an interrupt is not handled search all handlers
for it. Intended to get systems with badly broken
diff --git a/drivers/irqchip/irq-gic.c b/drivers/irqchip/irq-gic.c
index 651d726..f641e8e 100644
--- a/drivers/irqchip/irq-gic.c
+++ b/drivers/irqchip/irq-gic.c
@@ -1256,6 +1256,19 @@ static void gic_teardown(struct gic_chip_data *gic)
 
 #ifdef CONFIG_OF
 static int gic_cnt __initdata;
+static bool gicv2_force_probe;
+
+static int __init gicv2_force_probe_cfg(char *buf)
+{
+   return strtobool(buf, _force_probe);
+}
+early_param("irqchip.gicv2_force_probe", gicv2_force_probe_cfg);
+
+static bool gic_check_gicv2(void __iomem *base)
+{
+   u32 val = readl_relaxed(base + GIC_CPU_IDENT);
+   return (val & 0xff0fff) == 0x02043B;
+}
 
 static bool gic_check_eoimode(struct device_node *node, void __iomem **base)
 {
@@ -1265,20 +1278,60 @@ static bool gic_check_eoimode(struct device_node *node, 
void __iomem **base)
 
if (!is_hyp_mode_available())
return false;
-   if (resource_size(_res) < SZ_8K)
-   return false;
-   if (resource_size(_res) == SZ_128K) {
-   u32 val_low, val_high;
+   if (resource_size(_res) < SZ_8K) {
+   void __iomem *alt;
+   /*
+* Check for a stupid firmware that only exposes the
+* first page of a GICv2.
+*/
+   if (!gic_check_gicv2(*base))
+   return false;
 
+   if (!gicv2_force_probe) {
+   pr_warn("GIC: GICv2 detected, but range too small and 
irqchip.gicv2_force_probe not set\n");
+   return false;
+   }
+
+   alt = ioremap(cpuif_res.start, SZ_8K);
+   if (!alt)
+   return false;
+   if (!gic_check_gicv2(alt + SZ_4K)) {
+   /*
+* The first page was that of a GICv2, and
+* the second was *something*. Let's trust it
+* to be a GICv2, and update the mapping.
+*/
+   pr_warn("GIC: GICv2 at %pa, but range is too small 
(broken DT?), assuming 8kB\n",
+   _res.start);
+   iounmap(*base);
+   *base = alt;
+   return true;
+   }
+
+   /*
+* We detected *two* initial GICv2 pages in a
+* row. Could be a GICv2 aliased over two 64kB
+* pages. Update the resource, map the iospace, and
+* pray.
+*/
+   

Re: [PATCH v6 6/7] KVM: arm64: allow get exception information from userspace

2017-09-22 Thread James Morse
Hi gengdongjiu,

On 18/09/17 14:36, gengdongjiu wrote:
> On 2017/9/14 21:00, James Morse wrote:
>> On 13/09/17 08:32, gengdongjiu wrote:
>>> On 2017/9/8 0:30, James Morse wrote:
 On 28/08/17 11:38, Dongjiu Geng wrote:
 For BUS_MCEERR_A* from memory_failure() we can't know if they are caused by
 an access or not.
>>
>> Actually it looks like we can: I thought 'BUS_MCEERR_AR' could be triggered 
>> via
>> some CPER flags, but its not. The only code that flags MF_ACTION_REQUIRED is
>> x86's kernel-first handling, which nicely matches this 'direct access' 
>> problem.
>> BUS_MCEERR_AR also come from KVM stage2 faults (and the x86 equivalent). 
>> Powerpc
>> also triggers these directly, both from what look to be synchronous paths, 
>> so I
>> think its fair to equate BUS_MCEERR_AR to a synchronous access and 
>> BUS_MCEERR_AO
>> to something_else.
> 
> James, thanks for your explanation.
> can I understand that your meaning that "BUS_MCEERR_AR" stands for 
> synchronous access and BUS_MCEERR_AO stands for asynchronous access?

Not 'stands for', as the AR is Action-Required and AO Action-Optional. My point
was I can't find a case where Action-Required is used for an error that isn't
synchronous.

We should run this past the people who maintain the existing BUS_MCEERR_AR
users, in case its just a severity to them.


> Then for "BUS_MCEERR_AO", how to distinguish it is asynchronous data 
> access(SError) and PCIE AER error?

How would userspace get one of these memory errors for a PCIe error?


> In the user space, we can check the si_code, if it is "BUS_MCEERR_AR", we use 
> SEA notification type for the guest;
> if it is "BUS_MCEERR_AO", we use SEI notification type for the guest.
> Because there are only two values for si_code("BUS_MCEERR_AR" and 
> BUS_MCEERR_AO), in which case we can use the GSIV(IRQ) notification type?

This is for Qemu/kvmtool to decide, it depends on what sort of machine they are
emulating.

For example, the physical machine's memory-controller may notify the CPU about
memory errors by triggering SError trapped to EL3, or with a dedicated FIQ, also
routed to EL3. By the time this gets to the host kernel the distinction doesn't
matter. The host has handled the error.

For a guest, your memory-controller is effectively the host kernel. It will give
you an BUS_MCEERR_AO signal for any guest memory that is affected, and a
BUS_MCEERR_AR if the guest directly accesses a page of affected memory.

What Qemu/kvmtool do with this is up to them. If they're emulating a machine
with no RAS features, printing an error and exit.

Otherwise BUS_MCEERR_AR could be notified as one of the flavours of IRQ, unless
the affected vcpu has interrupts masked, in which case an SEA notification gives
you some NMI-like behaviour.

For BUS_MCEERR_AO you could use SEI, IRQ or polled notification. My choice would
be IRQ, as you can't know if the guest supports SEI and it would be a shame to
kill it with an SError if the affected memory was free. SEA for synchronous
errors is still a good choice even if the guest doesn't support it as that
memory is still gone so its still a valid guest:Synchronous-external-abort.


[...]

>>> 1. Let us firstly discuss the SEA and SEI, there are different workflow for 
>>> the two different Errors.

>> user-space can choose whether to use SEA or SEI, it doesn't have to choose 
>> the
>> same notification type that firmware used, which in turn doesn't have to be 
>> the
>> same as that used by the CPU to notify firmware.
>>
>> The choice only matters because these notifications hang on an existing 
>> pieces
>> of the Arm-architecture, so the notification can only add to the 
>> architecturally
>> defined meaning. (i.e. You can only send an SEA for something that can 
>> already
>> be described as a synchronous external abort).
>>
>> Once we get to user-space, for memory_failure() notifications, (which so far 
>> is
>> all we are talking about here), the only thing that could matter is whether 
>> the
>> guest hit a PG_hwpoison page as a stage2 fault. These can be described as
>> Synchronous-External-Abort.
>>
>> The Synchronous-External-Abort/SError-Interrupt distinction matters for the 
>> CPU
>> because it can't always make an error synchronous. For memory_failure()
>> notifications to a KVM guest we really can do this, and we already have this
>> behaviour for free. An example:
>>
>> A guest touches some hardware:poisoned memory, for whatever reason the CPU 
>> can't
>> put the world back together to make this a synchronous exception, so it 
>> reports
>> it to firmware as an SError-interrupt.
> 
>> Linux gets an APEI notification and memory_failure() causes the affected 
>> page to
>> be unmapped from the guest's stage2, and SIGBUS_MCEERR_AO sent to user-space.
>>
>> Qemu/kvmtool can now notify the guest with an IRQ or POLLed notification. 
>> AO->
>> action optional, probably asynchronous.

> If so, in this case, Qemu/kvmtool only got a little information(receive a 
> 

Re: [PATCH v6 6/7] KVM: arm64: allow get exception information from userspace

2017-09-22 Thread James Morse
Hi gengdongjiu,

On 21/09/17 08:55, gengdongjiu wrote:
> On 2017/9/14 21:00, James Morse wrote:
>> user-space can choose whether to use SEA or SEI, it doesn't have to choose 
>> the
>> same notification type that firmware used, which in turn doesn't have to be 
>> the
>> same as that used by the CPU to notify firmware.
>>
>> The choice only matters because these notifications hang on an existing 
>> pieces
>> of the Arm-architecture, so the notification can only add to the 
>> architecturally
>> defined meaning. (i.e. You can only send an SEA for something that can 
>> already
>> be described as a synchronous external abort).
>>
>> Once we get to user-space, for memory_failure() notifications, (which so far 
>> is
>> all we are talking about here), the only thing that could matter is whether 
>> the
>> guest hit a PG_hwpoison page as a stage2 fault. These can be described as
>> Synchronous-External-Abort.
>>
>> The Synchronous-External-Abort/SError-Interrupt distinction matters for the 
>> CPU
>> because it can't always make an error synchronous. For memory_failure()
>> notifications to a KVM guest we really can do this, and we already have this
>> behaviour for free. An example:
>>
>> A guest touches some hardware:poisoned memory, for whatever reason the CPU 
>> can't
>> put the world back together to make this a synchronous exception, so it 
>> reports
>> it to firmware as an SError-interrupt.
>> Linux gets an APEI notification and memory_failure() causes the affected 
>> page to
>> be unmapped from the guest's stage2, and SIGBUS_MCEERR_AO sent to user-space.
>>
>> Qemu/kvmtool can now notify the guest with an IRQ or POLLed notification. 
>> AO->
>> action optional, probably asynchronous.
>>
>> But in our example it wasn't really asynchronous, that was just a property of
>> the original CPU->firmware notification. What happens? The guest vcpu is 
>> re-run,
>> it re-runs the same instructions (this was a contained error so KVM's ELR 
>> points
>> at/before the instruction that steps in the problem). This time KVM takes a
>> stage2 fault, which the mm code will refuse to fixup because the relevant 
>> page
>> was marked as PG_hwpoision by memory_failure(). KVM signals Qemu/kvmtool with
>> SIGBUS_MCEERR_AR. Now Qemu/kvmtool can notify the guest using SEA.
> 
> CC Achin
> 
> I have some personal opinion, if you think it is not right, hope you can 
> point out.
> 
> Synchronous External Abort and SError Interrupt are hardware 
> exception(hardware concept),
> which is independent of software notification,
> in armv8 without RAS, the two concepts already exist. In the APEI spec, in 
> order to
> better describe the two exceptions, so use SEA and SEI notification to stand
for them.

> SEA notification stands for Synchronous External Abort, so may be it is not 
> only a
> notification, it also stands for a hardware error type.
> SEI notification stands for SError Interrupt, so may be it is not only a 
> notification,
> it also stands for a hardware error type.

> In the OS, it has different handling flow to the two exception(two 
> notification):
> when the guest OS running, if the hardware generates a Synchronous External 
> Abort, we
> told the guest OS this error is SError Interrupt instead of Synchronous
External Abort.

This should only happen when APEI doesn't claim the external-abort as a RAS
notification. If there were CPER records to process then the error is handled by
the host, and we can return to the guest.

If this wasn't a firmware-first notification, then you're right KVM hands the
guest an asynchronous external abort. This could be considered a bug in KVM. (we
can discuss with Marc and Christoffer what it should do), but:

I'm not sure what scenario you could see this in: surely all your
CPU:external-aborts are taken to EL3 by SCR_EL3.EA and become firmware-first
notifications. So they should always be claimed by APEI.


> guest OS uses SEI notification handling flow to deal with it, I am not sure 
> whether it
> will have problem, because the true hardware exception is Synchronous External
> Abort, but software treats it as SError interrupt to handle.

Once you're into a guest the original 'true hardware exception' shouldn't
matter. In this scenario KVM has handed the guest an SError, our question is 'is
it an SEI notification?':

For firmware first the guest OS should poke around in the CPER buffers, find
nothing to do, and return to the arch code for (future) kernel-first.
For kernel first the guest OS should trawl through the v8.2 ERR registers, find
nothing to do, and continue to the default case:

By default, we should panic on SError, unless its classified as a non-fatal RAS
error. (I'm tempted to pr_warn_once() if we get RAS notifications but there is
no work to do).


What you may be seeing is some awkwardness with the change in the SError ESR
with v8.2. Previously the VSE mechanism injected an impdef SError, (but they
were all impdef so it didn't matter).
With VSESR_EL2 KVM has to 

[PATCH v3 11/13] firmware: arm_sdei: add support for CPU private events

2017-09-22 Thread James Morse
Private SDE events are per-cpu, and need to be registered and enabled
on each CPU.

Hide this detail from the caller by adapting our {,un}register and
{en,dis}able calls to send an IPI to each CPU if the event is private.

CPU private events are unregistered when the CPU is powered-off, and
re-registered when the CPU is brought back online. This saves bringing
secondary cores back online to call private_reset() on shutdown, kexec
and resume from hibernate.

Signed-off-by: James Morse 
---
 drivers/firmware/arm_sdei.c | 242 +---
 1 file changed, 228 insertions(+), 14 deletions(-)

diff --git a/drivers/firmware/arm_sdei.c b/drivers/firmware/arm_sdei.c
index 5656830efff4..3bee2719d6f4 100644
--- a/drivers/firmware/arm_sdei.c
+++ b/drivers/firmware/arm_sdei.c
@@ -20,9 +20,11 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -65,12 +67,49 @@ struct sdei_event {
u8  priority;
 
/* This pointer is handed to firmware as the event argument. */
-   struct sdei_registered_event *registered;
+   union {
+   /* Shared events */
+   struct sdei_registered_event *registered;
+
+   /* CPU private events */
+   struct sdei_registered_event __percpu *private_registered;
+   };
 };
 
 static LIST_HEAD(sdei_events);
 static DEFINE_SPINLOCK(sdei_events_lock);
 
+/* When frozen, cpu-hotplug notifiers shouldn't unregister/re-register events 
*/
+static bool frozen;
+
+/* Private events are registered/enabled via IPI passing one of these */
+struct sdei_crosscall_args {
+   struct sdei_event *event;
+   atomic_t errors;
+   int first_error;
+};
+
+#define CROSSCALL_INIT(arg, event) (arg.event = event, \
+arg.first_error = 0, \
+atomic_set(, 0))
+
+static inline int sdei_do_cross_call(void *fn, struct sdei_event * event)
+{
+   struct sdei_crosscall_args arg;
+
+   CROSSCALL_INIT(arg, event);
+   on_each_cpu(fn, , true);
+
+   return arg.first_error;
+}
+
+static inline void
+sdei_cross_call_return(struct sdei_crosscall_args *arg, int err)
+{
+   if (err && (atomic_inc_return(>errors) == 1))
+   arg->first_error = err;
+}
+
 static int sdei_to_linux_errno(unsigned long sdei_err)
 {
switch (sdei_err) {
@@ -214,6 +253,26 @@ static struct sdei_event *sdei_event_create(u32 event_num,
reg->callback = cb;
reg->callback_arg = cb_arg;
event->registered = reg;
+   } else {
+   int cpu;
+   struct sdei_registered_event __percpu *regs;
+
+   regs = alloc_percpu(struct sdei_registered_event);
+   if (!regs) {
+   kfree(event);
+   return ERR_PTR(-ENOMEM);
+   }
+
+   for_each_possible_cpu(cpu) {
+   reg = per_cpu_ptr(regs, cpu);
+
+   reg->event_num = event->event_num;
+   reg->priority = event->priority;
+   reg->callback = cb;
+   reg->callback_arg = cb_arg;
+   }
+
+   event->private_registered = regs;
}
 
if (sdei_event_find(event_num)) {
@@ -235,6 +294,8 @@ static void sdei_event_destroy(struct sdei_event *event)
 
if (event->type == SDEI_EVENT_TYPE_SHARED)
kfree(event->registered);
+   else
+   free_percpu(event->private_registered);
 
kfree(event);
 }
@@ -265,11 +326,6 @@ static void _ipi_mask_cpu(void *ignored)
sdei_mask_local_cpu();
 }
 
-static int sdei_cpuhp_down(unsigned int ignored)
-{
-   return sdei_mask_local_cpu();
-}
-
 int sdei_unmask_local_cpu(void)
 {
int err;
@@ -291,11 +347,6 @@ static void _ipi_unmask_cpu(void *ignored)
sdei_unmask_local_cpu();
 }
 
-static int sdei_cpuhp_up(unsigned int ignored)
-{
-   return sdei_unmask_local_cpu();
-}
-
 static void _ipi_private_reset(void *ignored)
 {
int err;
@@ -346,6 +397,16 @@ static int sdei_api_event_enable(u32 event_num)
  0, NULL);
 }
 
+static void _ipi_event_enable(void *data)
+{
+   int err;
+   struct sdei_crosscall_args *arg = data;
+
+   err = sdei_api_event_enable(arg->event->event_num);
+
+   sdei_cross_call_return(arg, err);
+}
+
 int sdei_event_enable(u32 event_num)
 {
int err = -EINVAL;
@@ -357,6 +418,8 @@ int sdei_event_enable(u32 event_num)
err = -ENOENT;
else if (event->type == SDEI_EVENT_TYPE_SHARED)
err = sdei_api_event_enable(event->event_num);
+   else
+   err = sdei_do_cross_call(_ipi_event_enable, event);
spin_unlock(_events_lock);
 
return err;
@@ -369,6 +432,16 @@ static int 

[PATCH v3 07/13] firmware: arm_sdei: Add driver for Software Delegated Exceptions

2017-09-22 Thread James Morse
The Software Delegated Exception Interface (SDEI) is an ARM standard
for registering callbacks from the platform firmware into the OS.
This is typically used to implement firmware notifications (such as
firmware-first RAS) or promote an IRQ that has been promoted to a
firmware-assisted NMI.

Add the code for detecting the SDEI version and the framework for
registering and unregistering events. Subsequent patches will add the
arch-specific backend code and the necessary power management hooks.

Only shared events are supported, power management, private events and
discovery for ACPI systems will be added by later patches.

Signed-off-by: James Morse 
---
Changes since v2:
 * Copy the priority into the structure the arch asm handler gets. This
   is used for VMAP stacks where we can't know if a critical event interrupted
   a normal priority event, thus they need separate stacks.

Changes since v1:
 * Changed entry point to unsigned long, if we support non-vhe systems this
   won't be a valid pointer
 * Made invoke_sdei_fn() pass the only register we are interested in, instead
   of the whole arm_smccc_res
 * Made all the locking WARN_ON()s lockdep_assert_held()s.
 * Moved more messages from 'err' to 'warn'.
 * Made IS_SDEI_CALL() not depend on whether the config option is selected.
 * Made 'event failed' messages rate limited.

 drivers/firmware/Kconfig|   7 +
 drivers/firmware/Makefile   |   1 +
 drivers/firmware/arm_sdei.c | 616 
 include/linux/sdei.h| 100 +++
 include/uapi/linux/sdei.h   |  91 +++
 5 files changed, 815 insertions(+)
 create mode 100644 drivers/firmware/arm_sdei.c
 create mode 100644 include/linux/sdei.h
 create mode 100644 include/uapi/linux/sdei.h

diff --git a/drivers/firmware/Kconfig b/drivers/firmware/Kconfig
index 6e4ed5a9c6fd..d8a9859f6102 100644
--- a/drivers/firmware/Kconfig
+++ b/drivers/firmware/Kconfig
@@ -48,6 +48,13 @@ config ARM_SCPI_POWER_DOMAIN
  This enables support for the SCPI power domains which can be
  enabled or disabled via the SCP firmware
 
+config ARM_SDE_INTERFACE
+   bool "ARM Software Delegated Exception Interface (SDEI)"
+   help
+ The Software Delegated Exception Interface (SDEI) is an ARM
+ standard for registering callbacks from the platform firmware
+ into the OS. This is typically used to implement RAS notifications.
+
 config EDD
tristate "BIOS Enhanced Disk Drive calls determine boot disk"
depends on X86
diff --git a/drivers/firmware/Makefile b/drivers/firmware/Makefile
index a37f12e8d137..229984feca4a 100644
--- a/drivers/firmware/Makefile
+++ b/drivers/firmware/Makefile
@@ -5,6 +5,7 @@ obj-$(CONFIG_ARM_PSCI_FW)   += psci.o
 obj-$(CONFIG_ARM_PSCI_CHECKER) += psci_checker.o
 obj-$(CONFIG_ARM_SCPI_PROTOCOL)+= arm_scpi.o
 obj-$(CONFIG_ARM_SCPI_POWER_DOMAIN) += scpi_pm_domain.o
+obj-$(CONFIG_ARM_SDE_INTERFACE)+= arm_sdei.o
 obj-$(CONFIG_DMI)  += dmi_scan.o
 obj-$(CONFIG_DMI_SYSFS)+= dmi-sysfs.o
 obj-$(CONFIG_EDD)  += edd.o
diff --git a/drivers/firmware/arm_sdei.c b/drivers/firmware/arm_sdei.c
new file mode 100644
index ..4b3c7fd99c34
--- /dev/null
+++ b/drivers/firmware/arm_sdei.c
@@ -0,0 +1,616 @@
+/*
+ * Software Delegated Exception Interface (SDEI) driver code.
+ *
+ * Copyright (C) 2017 ARM Ltd.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along
+ * with this program. If not, see .
+ */
+
+#define pr_fmt(fmt) "sdei: " fmt
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+/*
+ * The call to use to reach the firmware.
+ */
+static asmlinkage void (*sdei_firmware_call)(unsigned long function_id,
+ unsigned long arg0, unsigned long arg1,
+ unsigned long arg2, unsigned long arg3,
+ unsigned long arg4, struct arm_smccc_res *res);
+
+/* entry point from firmware to arch asm code */
+static unsigned long sdei_entry_point;
+
+struct sdei_event {
+   struct list_headlist;
+   u32 event_num;
+   u8  type;
+   u8  priority;
+
+   /* This pointer is handed to firmware as the event 

[PATCH v3 04/13] arm64: alternatives: use tpidr_el2 on VHE hosts

2017-09-22 Thread James Morse
Now that KVM uses tpidr_el2 in the same way as Linux's cpu_offset in
tpidr_el1, merge the two. This saves KVM from save/restoring tpidr_el1
on VHE hosts, and allows future code to blindly access per-cpu variables
without triggering world-switch.

Signed-off-by: James Morse 
Reviewed-by: Christoffer Dall 
---
Changes since v1:
 * cpu_copy_el2regs()'s 'have I been patched' test now always sets a register,
   just in case the compiler optimises out part of the logic.

 arch/arm64/include/asm/assembler.h |  8 
 arch/arm64/include/asm/percpu.h| 11 +--
 arch/arm64/include/asm/processor.h |  1 +
 arch/arm64/kernel/cpufeature.c | 23 +++
 arch/arm64/mm/proc.S   |  8 
 5 files changed, 49 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/assembler.h 
b/arch/arm64/include/asm/assembler.h
index d58a6253c6ab..1ba12f59dec0 100644
--- a/arch/arm64/include/asm/assembler.h
+++ b/arch/arm64/include/asm/assembler.h
@@ -242,7 +242,11 @@ lr .reqx30 // link register
 #else
adr_l   \dst, \sym
 #endif
+alternative_if_not ARM64_HAS_VIRT_HOST_EXTN
mrs \tmp, tpidr_el1
+alternative_else
+   mrs \tmp, tpidr_el2
+alternative_endif
add \dst, \dst, \tmp
.endm
 
@@ -253,7 +257,11 @@ lr .reqx30 // link register
 */
.macro ldr_this_cpu dst, sym, tmp
adr_l   \dst, \sym
+alternative_if_not ARM64_HAS_VIRT_HOST_EXTN
mrs \tmp, tpidr_el1
+alternative_else
+   mrs \tmp, tpidr_el2
+alternative_endif
ldr \dst, [\dst, \tmp]
.endm
 
diff --git a/arch/arm64/include/asm/percpu.h b/arch/arm64/include/asm/percpu.h
index 3bd498e4de4c..43393208229e 100644
--- a/arch/arm64/include/asm/percpu.h
+++ b/arch/arm64/include/asm/percpu.h
@@ -16,11 +16,15 @@
 #ifndef __ASM_PERCPU_H
 #define __ASM_PERCPU_H
 
+#include 
 #include 
 
 static inline void set_my_cpu_offset(unsigned long off)
 {
-   asm volatile("msr tpidr_el1, %0" :: "r" (off) : "memory");
+   asm volatile(ALTERNATIVE("msr tpidr_el1, %0",
+"msr tpidr_el2, %0",
+ARM64_HAS_VIRT_HOST_EXTN)
+   :: "r" (off) : "memory");
 }
 
 static inline unsigned long __my_cpu_offset(void)
@@ -31,7 +35,10 @@ static inline unsigned long __my_cpu_offset(void)
 * We want to allow caching the value, so avoid using volatile and
 * instead use a fake stack read to hazard against barrier().
 */
-   asm("mrs %0, tpidr_el1" : "=r" (off) :
+   asm(ALTERNATIVE("mrs %0, tpidr_el1",
+   "mrs %0, tpidr_el2",
+   ARM64_HAS_VIRT_HOST_EXTN)
+   : "=r" (off) :
"Q" (*(const unsigned long *)current_stack_pointer));
 
return off;
diff --git a/arch/arm64/include/asm/processor.h 
b/arch/arm64/include/asm/processor.h
index 29adab8138c3..8f2d0f7d193b 100644
--- a/arch/arm64/include/asm/processor.h
+++ b/arch/arm64/include/asm/processor.h
@@ -193,5 +193,6 @@ static inline void spin_lock_prefetch(const void *ptr)
 
 int cpu_enable_pan(void *__unused);
 int cpu_enable_cache_maint_trap(void *__unused);
+int cpu_copy_el2regs(void *__unused);
 
 #endif /* __ASM_PROCESSOR_H */
diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index cd52d365d1f0..8e4c7da2b126 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -865,6 +865,7 @@ static const struct arm64_cpu_capabilities arm64_features[] 
= {
.capability = ARM64_HAS_VIRT_HOST_EXTN,
.def_scope = SCOPE_SYSTEM,
.matches = runs_at_el2,
+   .enable = cpu_copy_el2regs,
},
{
.desc = "32-bit EL0 Support",
@@ -1308,3 +1309,25 @@ static int __init enable_mrs_emulation(void)
 }
 
 late_initcall(enable_mrs_emulation);
+
+int cpu_copy_el2regs(void *__unused)
+{
+   int do_copyregs = 0;
+
+   /*
+* Copy register values that aren't redirected by hardware.
+*
+* Before code patching, we only set tpidr_el1, all CPUs need to copy
+* this value to tpidr_el2 before we patch the code. Once we've done
+* that, freshly-onlined CPUs will set tpidr_el2, so we don't need to
+* do anything here.
+*/
+   asm volatile(ALTERNATIVE("mov %0, #1", "mov %0, #0",
+ARM64_HAS_VIRT_HOST_EXTN)
+   : "=r" (do_copyregs) : : );
+
+   if (do_copyregs)
+   write_sysreg(read_sysreg(tpidr_el1), tpidr_el2);
+
+   return 0;
+}
diff --git a/arch/arm64/mm/proc.S b/arch/arm64/mm/proc.S
index 877d42fb0df6..109d43a15aaf 100644
--- a/arch/arm64/mm/proc.S
+++ b/arch/arm64/mm/proc.S
@@ -70,7 +70,11 @@ ENTRY(cpu_do_suspend)
mrs x8, mdscr_el1
mrs x9, oslsr_el1
mrs x10, sctlr_el1

[PATCH v3 01/13] KVM: arm64: Store vcpu on the stack during __guest_enter()

2017-09-22 Thread James Morse
KVM uses tpidr_el2 as its private vcpu register, which makes sense for
non-vhe world switch as only KVM can access this register. This means
vhe Linux has to use tpidr_el1, which KVM has to save/restore as part
of the host context.

If the SDEI handler code runs behind KVMs back, it mustn't access any
per-cpu variables. To allow this on systems with vhe we need to make
the host use tpidr_el2, saving KVM from save/restoring it.

__guest_enter() stores the host_ctxt on the stack, do the same with
the vcpu.

Signed-off-by: James Morse 
Reviewed-by: Christoffer Dall 
---
Changes since v2:
 * Added middle paragraph of commit message.

 arch/arm64/kvm/hyp/entry.S | 10 +++---
 arch/arm64/kvm/hyp/hyp-entry.S |  6 +++---
 2 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/kvm/hyp/entry.S b/arch/arm64/kvm/hyp/entry.S
index 12ee62d6d410..9a8ab59e 100644
--- a/arch/arm64/kvm/hyp/entry.S
+++ b/arch/arm64/kvm/hyp/entry.S
@@ -62,8 +62,8 @@ ENTRY(__guest_enter)
// Store the host regs
save_callee_saved_regs x1
 
-   // Store the host_ctxt for use at exit time
-   str x1, [sp, #-16]!
+   // Store host_ctxt and vcpu for use at exit time
+   stp x1, x0, [sp, #-16]!
 
add x18, x0, #VCPU_CONTEXT
 
@@ -159,6 +159,10 @@ abort_guest_exit_end:
 ENDPROC(__guest_exit)
 
 ENTRY(__fpsimd_guest_restore)
+   // x0: esr
+   // x1: vcpu
+   // x2-x29,lr: vcpu regs
+   // vcpu x0-x1 on the stack
stp x2, x3, [sp, #-16]!
stp x4, lr, [sp, #-16]!
 
@@ -173,7 +177,7 @@ alternative_else
 alternative_endif
isb
 
-   mrs x3, tpidr_el2
+   mov x3, x1
 
ldr x0, [x3, #VCPU_HOST_CONTEXT]
kern_hyp_va x0
diff --git a/arch/arm64/kvm/hyp/hyp-entry.S b/arch/arm64/kvm/hyp/hyp-entry.S
index 5170ce1021da..fce7cc507e0a 100644
--- a/arch/arm64/kvm/hyp/hyp-entry.S
+++ b/arch/arm64/kvm/hyp/hyp-entry.S
@@ -104,6 +104,7 @@ el1_trap:
/*
 * x0: ESR_EC
 */
+   ldr x1, [sp, #16 + 8]   // vcpu stored by __guest_enter
 
/*
 * We trap the first access to the FP/SIMD to save the host context
@@ -116,19 +117,18 @@ alternative_if_not ARM64_HAS_NO_FPSIMD
b.eq__fpsimd_guest_restore
 alternative_else_nop_endif
 
-   mrs x1, tpidr_el2
mov x0, #ARM_EXCEPTION_TRAP
b   __guest_exit
 
 el1_irq:
stp x0, x1, [sp, #-16]!
-   mrs x1, tpidr_el2
+   ldr x1, [sp, #16 + 8]
mov x0, #ARM_EXCEPTION_IRQ
b   __guest_exit
 
 el1_error:
stp x0, x1, [sp, #-16]!
-   mrs x1, tpidr_el2
+   ldr x1, [sp, #16 + 8]
mov x0, #ARM_EXCEPTION_EL1_SERROR
b   __guest_exit
 
-- 
2.13.3

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v3 02/13] KVM: arm/arm64: Convert kvm_host_cpu_state to a static per-cpu allocation

2017-09-22 Thread James Morse
kvm_host_cpu_state is a per-cpu allocation made from kvm_arch_init()
used to store the host EL1 registers when KVM switches to a guest.

Make it easier for ASM to generate pointers into this per-cpu memory
by making it a static allocation.

Signed-off-by: James Morse 
Acked-by: Christoffer Dall 
---
 virt/kvm/arm/arm.c | 18 +++---
 1 file changed, 3 insertions(+), 15 deletions(-)

diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c
index b9f68e4add71..3786dd7e277d 100644
--- a/virt/kvm/arm/arm.c
+++ b/virt/kvm/arm/arm.c
@@ -51,8 +51,8 @@
 __asm__(".arch_extension   virt");
 #endif
 
+DEFINE_PER_CPU(kvm_cpu_context_t, kvm_host_cpu_state);
 static DEFINE_PER_CPU(unsigned long, kvm_arm_hyp_stack_page);
-static kvm_cpu_context_t __percpu *kvm_host_cpu_state;
 
 /* Per-CPU variable containing the currently running vcpu. */
 static DEFINE_PER_CPU(struct kvm_vcpu *, kvm_arm_running_vcpu);
@@ -351,7 +351,7 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
}
 
vcpu->cpu = cpu;
-   vcpu->arch.host_cpu_context = this_cpu_ptr(kvm_host_cpu_state);
+   vcpu->arch.host_cpu_context = this_cpu_ptr(_host_cpu_state);
 
kvm_arm_set_running_vcpu(vcpu);
 
@@ -1255,19 +1255,8 @@ static inline void hyp_cpu_pm_exit(void)
 }
 #endif
 
-static void teardown_common_resources(void)
-{
-   free_percpu(kvm_host_cpu_state);
-}
-
 static int init_common_resources(void)
 {
-   kvm_host_cpu_state = alloc_percpu(kvm_cpu_context_t);
-   if (!kvm_host_cpu_state) {
-   kvm_err("Cannot allocate host CPU state\n");
-   return -ENOMEM;
-   }
-
/* set size of VMID supported by CPU */
kvm_vmid_bits = kvm_get_vmid_bits();
kvm_info("%d-bit VMID\n", kvm_vmid_bits);
@@ -1412,7 +1401,7 @@ static int init_hyp_mode(void)
for_each_possible_cpu(cpu) {
kvm_cpu_context_t *cpu_ctxt;
 
-   cpu_ctxt = per_cpu_ptr(kvm_host_cpu_state, cpu);
+   cpu_ctxt = per_cpu_ptr(_host_cpu_state, cpu);
err = create_hyp_mappings(cpu_ctxt, cpu_ctxt + 1, PAGE_HYP);
 
if (err) {
@@ -1490,7 +1479,6 @@ int kvm_arch_init(void *opaque)
 out_hyp:
teardown_hyp_mode();
 out_err:
-   teardown_common_resources();
return err;
 }
 
-- 
2.13.3

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v3 06/13] Docs: dt: add devicetree binding for describing arm64 SDEI firmware

2017-09-22 Thread James Morse
The Software Delegated Exception Interface (SDEI) is an ARM standard
for registering callbacks from the platform firmware into the OS.
This is typically used to implement RAS notifications, or from an
IRQ that has been promoted to a firmware-assisted NMI.

Add a new devicetree binding to describe the SDE firmware interface.

Signed-off-by: James Morse 
Acked-by: Rob Herring 
---
Changes since v2:
 * Added Rob's Ack
 * Fixed 'childe node' typo

Changes since v1:
* Added bound IRQ description for binding,
* Reference SMC-CC, not 'AAPCS like'
* Move sdei node under firmware node (and the file path)

 .../devicetree/bindings/arm/firmware/sdei.txt  | 42 ++
 1 file changed, 42 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/arm/firmware/sdei.txt

diff --git a/Documentation/devicetree/bindings/arm/firmware/sdei.txt 
b/Documentation/devicetree/bindings/arm/firmware/sdei.txt
new file mode 100644
index ..ee3f0ff49889
--- /dev/null
+++ b/Documentation/devicetree/bindings/arm/firmware/sdei.txt
@@ -0,0 +1,42 @@
+* Software Delegated Exception Interface (SDEI)
+
+Firmware implementing the SDEI functions described in ARM document number
+ARM DEN 0054A ("Software Delegated Exception Interface") can be used by
+Linux to receive notification of events such as those generated by
+firmware-first error handling, or from an IRQ that has been promoted to
+a firmware-assisted NMI.
+
+The interface provides a number of API functions for registering callbacks
+and enabling/disabling events. Functions are invoked by trapping to the
+privilege level of the SDEI firmware (specified as part of the binding
+below) and passing arguments in a manner specified by the "SMC Calling
+Convention (ARM DEN 0028B):
+
+r0 => 32-bit Function ID / return value
+   {r1 - r3}   => Parameters
+
+Note that the immediate field of the trapping instruction must be set
+to #0.
+
+The SDEI_EVENT_REGISTER function registers a callback in the kernel
+text to handle the specified event number.
+
+The sdei node should be a child node of '/firmware' and have required
+properties:
+
+ - compatible: should contain:
+   * "arm,sdei-1.0" : For implementations complying to SDEI version 1.x.
+
+ - method: The method of calling the SDEI firmware. Permitted
+   values are:
+   * "smc" : SMC #0, with the register assignments specified in this
+ binding.
+   * "hvc" : HVC #0, with the register assignments specified in this
+ binding.
+Example:
+   firmware {
+   sdei {
+   compatible  = "arm,sdei-1.0";
+   method  = "smc";
+   };
+   };
-- 
2.13.3

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v3 05/13] KVM: arm64: Stop save/restoring host tpidr_el1 on VHE

2017-09-22 Thread James Morse
Now that a VHE host uses tpidr_el2 for the cpu offset we no longer
need KVM to save/restore tpidr_el1. Move this from the 'common' code
into the non-vhe code. While we're at it, on VHE we don't need to
save the ELR or SPSR as kernel_entry in entry.S will have pushed these
onto the kernel stack, and will restore them from there. Move these
to the non-vhe code as we need them to get back to the host.

Finally remove the always-copy-tpidr we hid in the stage2 setup
code, cpufeature's enable callback will do this for VHE, we only
need KVM to do it for non-vhe. Add the copy into kvm-init instead.

Signed-off-by: James Morse 
Reviewed-by: Christoffer Dall 
---
Changes since v1:
 * Switched KVM<->arm64 in the subject.

 arch/arm64/kvm/hyp-init.S  |  4 
 arch/arm64/kvm/hyp/s2-setup.c  |  3 ---
 arch/arm64/kvm/hyp/sysreg-sr.c | 16 
 3 files changed, 12 insertions(+), 11 deletions(-)

diff --git a/arch/arm64/kvm/hyp-init.S b/arch/arm64/kvm/hyp-init.S
index 3f9615582377..fbf259893f6a 100644
--- a/arch/arm64/kvm/hyp-init.S
+++ b/arch/arm64/kvm/hyp-init.S
@@ -122,6 +122,10 @@ CPU_BE(orr x4, x4, #SCTLR_ELx_EE)
kern_hyp_va x2
msr vbar_el2, x2
 
+   /* copy tpidr_el1 into tpidr_el2 for use by HYP */
+   mrs x1, tpidr_el1
+   msr tpidr_el2, x1
+
/* Hello, World! */
eret
 ENDPROC(__kvm_hyp_init)
diff --git a/arch/arm64/kvm/hyp/s2-setup.c b/arch/arm64/kvm/hyp/s2-setup.c
index 7fb88274eba1..a81f5e10fc8c 100644
--- a/arch/arm64/kvm/hyp/s2-setup.c
+++ b/arch/arm64/kvm/hyp/s2-setup.c
@@ -84,8 +84,5 @@ u32 __hyp_text __init_stage2_translation(void)
 
write_sysreg(val, vtcr_el2);
 
-   /* copy tpidr_el1 into tpidr_el2 for use by HYP */
-   write_sysreg(read_sysreg(tpidr_el1), tpidr_el2);
-
return parange;
 }
diff --git a/arch/arm64/kvm/hyp/sysreg-sr.c b/arch/arm64/kvm/hyp/sysreg-sr.c
index 934137647837..c54cc2afb92b 100644
--- a/arch/arm64/kvm/hyp/sysreg-sr.c
+++ b/arch/arm64/kvm/hyp/sysreg-sr.c
@@ -27,8 +27,8 @@ static void __hyp_text __sysreg_do_nothing(struct 
kvm_cpu_context *ctxt) { }
 /*
  * Non-VHE: Both host and guest must save everything.
  *
- * VHE: Host must save tpidr*_el[01], actlr_el1, mdscr_el1, sp0, pc,
- * pstate, and guest must save everything.
+ * VHE: Host must save tpidr*_el0, actlr_el1, mdscr_el1, sp_el0,
+ * and guest must save everything.
  */
 
 static void __hyp_text __sysreg_save_common_state(struct kvm_cpu_context *ctxt)
@@ -36,11 +36,8 @@ static void __hyp_text __sysreg_save_common_state(struct 
kvm_cpu_context *ctxt)
ctxt->sys_regs[ACTLR_EL1]   = read_sysreg(actlr_el1);
ctxt->sys_regs[TPIDR_EL0]   = read_sysreg(tpidr_el0);
ctxt->sys_regs[TPIDRRO_EL0] = read_sysreg(tpidrro_el0);
-   ctxt->sys_regs[TPIDR_EL1]   = read_sysreg(tpidr_el1);
ctxt->sys_regs[MDSCR_EL1]   = read_sysreg(mdscr_el1);
ctxt->gp_regs.regs.sp   = read_sysreg(sp_el0);
-   ctxt->gp_regs.regs.pc   = read_sysreg_el2(elr);
-   ctxt->gp_regs.regs.pstate   = read_sysreg_el2(spsr);
 }
 
 static void __hyp_text __sysreg_save_state(struct kvm_cpu_context *ctxt)
@@ -62,10 +59,13 @@ static void __hyp_text __sysreg_save_state(struct 
kvm_cpu_context *ctxt)
ctxt->sys_regs[AMAIR_EL1]   = read_sysreg_el1(amair);
ctxt->sys_regs[CNTKCTL_EL1] = read_sysreg_el1(cntkctl);
ctxt->sys_regs[PAR_EL1] = read_sysreg(par_el1);
+   ctxt->sys_regs[TPIDR_EL1]   = read_sysreg(tpidr_el1);
 
ctxt->gp_regs.sp_el1= read_sysreg(sp_el1);
ctxt->gp_regs.elr_el1   = read_sysreg_el1(elr);
ctxt->gp_regs.spsr[KVM_SPSR_EL1]= read_sysreg_el1(spsr);
+   ctxt->gp_regs.regs.pc   = read_sysreg_el2(elr);
+   ctxt->gp_regs.regs.pstate   = read_sysreg_el2(spsr);
 }
 
 static hyp_alternate_select(__sysreg_call_save_host_state,
@@ -89,11 +89,8 @@ static void __hyp_text __sysreg_restore_common_state(struct 
kvm_cpu_context *ctx
write_sysreg(ctxt->sys_regs[ACTLR_EL1],   actlr_el1);
write_sysreg(ctxt->sys_regs[TPIDR_EL0],   tpidr_el0);
write_sysreg(ctxt->sys_regs[TPIDRRO_EL0], tpidrro_el0);
-   write_sysreg(ctxt->sys_regs[TPIDR_EL1],   tpidr_el1);
write_sysreg(ctxt->sys_regs[MDSCR_EL1],   mdscr_el1);
write_sysreg(ctxt->gp_regs.regs.sp,   sp_el0);
-   write_sysreg_el2(ctxt->gp_regs.regs.pc,   elr);
-   write_sysreg_el2(ctxt->gp_regs.regs.pstate, spsr);
 }
 
 static void __hyp_text __sysreg_restore_state(struct kvm_cpu_context *ctxt)
@@ -115,10 +112,13 @@ static void __hyp_text __sysreg_restore_state(struct 
kvm_cpu_context *ctxt)
write_sysreg_el1(ctxt->sys_regs[AMAIR_EL1], amair);
write_sysreg_el1(ctxt->sys_regs[CNTKCTL_EL1],   cntkctl);
write_sysreg(ctxt->sys_regs[PAR_EL1],   par_el1);
+   write_sysreg(ctxt->sys_regs[TPIDR_EL1], 

[PATCH v3 00/13] arm64/firmware: Software Delegated Exception Interface

2017-09-22 Thread James Morse
Hello!

The Software Delegated Exception Interface (SDEI) is an ARM specification
for registering callbacks from the platform firmware into the OS.
This is intended to be used to implement firmware-first RAS notifications,
but also supports vendor-defined events and binding IRQs as events.

The document is here:
http://infocenter.arm.com/help/topic/com.arm.doc.den0054a/ARM_DEN0054A_Software_Delegated_Exception_Interface.pdf

I anticpate once reviewed this series will go via the arm-soc tree, as the
bulk of the code is under /drivers/firmware/.

The major change in v3 is due to VMAP stacks. We can't trust sp when we take
an event, so need our own SDEI stack. Events can nest, so we need two. Private
events can be delivered per-cpu, so we need two per-cpu SDEI stacks.

The APCICA support for the SDEI table is now in mainline, so this series has
grown the detection code for non-DT systems. The APEI/GHES wire-up is still
being held back as APEI's ghes_proc() isn't (yet) safe for multiple sources
of NMI.

Exposing the HVC range through KVM to user-space is gone, this will be
re-incarnated as something that better fits KVMs architecture emulation,
as opposed to assuming all-the-world's-SMC-CC.


This series (juggles some registers with KVM+VHE, then) adds a DT binding to
trigger probing of the interface and support for the SDEI API.

SDEI runs between adjacent exception levels, so events will always be delivered
to EL2 if firmware is at EL3. For VHE hosts we run the SDEI event handler
behind KVM's back with all exceptions masked. Once the handler has done its
work we return to the appropriate vbar+irq entry. This allows KVM to
world-switch and deliver any signals sent by the handler to Qemu/kvmtool. We
do the same thing if we interrupt host EL0. If we interrupted code with
interrupts masked, we use a different API call to return to the interrupted
context.

What about non-VHE KVM? If you don't have VHE support and boot at EL2, the
kernel drops to EL1. This driver will print an error message then give up. This
is because events would still be delivered to EL2 hitting either KVM, or the
hyp-stub. Supporting this is complicated, but because the main use-case is
RAS, and ARM v8.2's RAS extensions imply v8.1's Virtual Host Extensions, we
can assume all platforms with SDEI will support VHE too. (I have some ideas
on how to support non-VHE).

Running the event handler behind VHE-KVM's back has some side effects: The
event handler will blindly use any registers that are shared between the host
and guest. The two that I think matter are TPIDR_EL1, and the debug state. The
guest may have set MDSCR_EL1 so debug exceptions must remain masked. The
guest's TPIDR_EL1 will be used by the event handler if it accesses per-cpu
variables. This needs fixing. The first part of this series juggles KVMs use
of TPIDR_EL2 so that we share it with the host on VHE systems. An equivalent
change for 32bit is (still) on my todo list. (the alternative to this is to
have a parody world switch in the SDEI event handler, but this would mean
special casing interrupted guests, and be an ABI link to KVM.)

Is this another begins-with-S RAS mechanism for arm64? Yes.
Why? Any notification delivered as an exception will overwrite the exception
registers. This is fatal for the running thread if it happens during entry.S's
kernel_enter or kernel_exit. Instead of adding masking and routing controls,
events are delivered to a registered address at a fixed exception level and
don't change the exception registers when delivered.

This series can be retrieved from:
git://linux-arm.org/linux-jm.git -b sdei/v3/base

Questions and contradictions welcome!

Thanks,

James

[Changes since previous versions are noted on each patch]


James Morse (13):
  KVM: arm64: Store vcpu on the stack during __guest_enter()
  KVM: arm/arm64: Convert kvm_host_cpu_state to a static per-cpu
allocation
  KVM: arm64: Change hyp_panic()s dependency on tpidr_el2
  arm64: alternatives: use tpidr_el2 on VHE hosts
  KVM: arm64: Stop save/restoring host tpidr_el1 on VHE
  Docs: dt: add devicetree binding for describing arm64 SDEI firmware
  firmware: arm_sdei: Add driver for Software Delegated Exceptions
  arm64: Add vmap_stack header file
  arm64: kernel: Add arch-specific SDEI entry code and CPU masking
  firmware: arm_sdei: Add support for CPU and system power states
  firmware: arm_sdei: add support for CPU private events
  arm64: acpi: Remove __init from acpi_psci_use_hvc() for use by SDEI
  firmware: arm_sdei: Discover SDEI support via ACPI

 .../devicetree/bindings/arm/firmware/sdei.txt  |   42 +
 arch/arm64/Kconfig |2 +-
 arch/arm64/include/asm/assembler.h |8 +
 arch/arm64/include/asm/kvm_host.h  |2 +
 arch/arm64/include/asm/percpu.h|   11 +-
 arch/arm64/include/asm/processor.h |1 +
 arch/arm64/include/asm/sdei.h  |   63 ++
 

[PATCH v3 03/13] KVM: arm64: Change hyp_panic()s dependency on tpidr_el2

2017-09-22 Thread James Morse
Make tpidr_el2 a cpu-offset for per-cpu variables in the same way the
host uses tpidr_el1. This lets tpidr_el{1,2} have the same value, and
on VHE they can be the same register.

KVM calls hyp_panic() when anything unexpected happens. This may occur
while a guest owns the EL1 registers. KVM stashes the vcpu pointer in
tpidr_el2, which it uses to find the host context in order to restore
the host EL1 registers before parachuting into the host's panic().

The host context is a struct kvm_cpu_context allocated in the per-cpu
area, and mapped to hyp. Given the per-cpu offset for this CPU, this is
easy to find. Change hyp_panic() to take a pointer to the
struct kvm_cpu_context. Wrap these calls with an asm function that
retrieves the struct kvm_cpu_context from the host's per-cpu area.

Copy the per-cpu offset from the hosts tpidr_el1 into tpidr_el2 during
kvm init. (Later patches will make this unnecessary for VHE hosts)

We print out the vcpu pointer as part of the panic message. Add a back
reference to the 'running vcpu' in the host cpu context to preserve this.

Signed-off-by: James Morse 
Reviewed-by: Christoffer Dall 
---
Changes since v1:
 * Added a comment explaining how =kvm_host_cpu_state gets from a host-va
   to a hyp va.
 * Added the first paragraph to the commit message.

 arch/arm64/include/asm/kvm_host.h |  2 ++
 arch/arm64/kvm/hyp/hyp-entry.S| 12 
 arch/arm64/kvm/hyp/s2-setup.c |  3 +++
 arch/arm64/kvm/hyp/switch.c   | 25 +
 4 files changed, 30 insertions(+), 12 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index e923b58606e2..806ccef7470a 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -191,6 +191,8 @@ struct kvm_cpu_context {
u64 sys_regs[NR_SYS_REGS];
u32 copro[NR_COPRO_REGS];
};
+
+   struct kvm_vcpu *__hyp_running_vcpu;
 };
 
 typedef struct kvm_cpu_context kvm_cpu_context_t;
diff --git a/arch/arm64/kvm/hyp/hyp-entry.S b/arch/arm64/kvm/hyp/hyp-entry.S
index fce7cc507e0a..e4f37b9dd47c 100644
--- a/arch/arm64/kvm/hyp/hyp-entry.S
+++ b/arch/arm64/kvm/hyp/hyp-entry.S
@@ -163,6 +163,18 @@ ENTRY(__hyp_do_panic)
eret
 ENDPROC(__hyp_do_panic)
 
+ENTRY(__hyp_panic)
+   /*
+* '=kvm_host_cpu_state' is a host VA from the constant pool, it may
+* not be accessible by this address from EL2, hyp_panic() converts
+* it with kern_hyp_va() before use.
+*/
+   ldr x0, =kvm_host_cpu_state
+   mrs x1, tpidr_el2
+   add x0, x0, x1
+   b   hyp_panic
+ENDPROC(__hyp_panic)
+
 .macro invalid_vector  label, target = __hyp_panic
.align  2
 \label:
diff --git a/arch/arm64/kvm/hyp/s2-setup.c b/arch/arm64/kvm/hyp/s2-setup.c
index a81f5e10fc8c..7fb88274eba1 100644
--- a/arch/arm64/kvm/hyp/s2-setup.c
+++ b/arch/arm64/kvm/hyp/s2-setup.c
@@ -84,5 +84,8 @@ u32 __hyp_text __init_stage2_translation(void)
 
write_sysreg(val, vtcr_el2);
 
+   /* copy tpidr_el1 into tpidr_el2 for use by HYP */
+   write_sysreg(read_sysreg(tpidr_el1), tpidr_el2);
+
return parange;
 }
diff --git a/arch/arm64/kvm/hyp/switch.c b/arch/arm64/kvm/hyp/switch.c
index 945e79c641c4..235d615cee30 100644
--- a/arch/arm64/kvm/hyp/switch.c
+++ b/arch/arm64/kvm/hyp/switch.c
@@ -286,9 +286,9 @@ int __hyp_text __kvm_vcpu_run(struct kvm_vcpu *vcpu)
u64 exit_code;
 
vcpu = kern_hyp_va(vcpu);
-   write_sysreg(vcpu, tpidr_el2);
 
host_ctxt = kern_hyp_va(vcpu->arch.host_cpu_context);
+   host_ctxt->__hyp_running_vcpu = vcpu;
guest_ctxt = >arch.ctxt;
 
__sysreg_save_host_state(host_ctxt);
@@ -393,7 +393,8 @@ int __hyp_text __kvm_vcpu_run(struct kvm_vcpu *vcpu)
 
 static const char __hyp_panic_string[] = "HYP panic:\nPS:%08llx PC:%016llx 
ESR:%08llx\nFAR:%016llx HPFAR:%016llx PAR:%016llx\nVCPU:%p\n";
 
-static void __hyp_text __hyp_call_panic_nvhe(u64 spsr, u64 elr, u64 par)
+static void __hyp_text __hyp_call_panic_nvhe(u64 spsr, u64 elr, u64 par,
+struct kvm_vcpu *vcpu)
 {
unsigned long str_va;
 
@@ -407,35 +408,35 @@ static void __hyp_text __hyp_call_panic_nvhe(u64 spsr, 
u64 elr, u64 par)
__hyp_do_panic(str_va,
   spsr,  elr,
   read_sysreg(esr_el2),   read_sysreg_el2(far),
-  read_sysreg(hpfar_el2), par,
-  (void *)read_sysreg(tpidr_el2));
+  read_sysreg(hpfar_el2), par, vcpu);
 }
 
-static void __hyp_text __hyp_call_panic_vhe(u64 spsr, u64 elr, u64 par)
+static void __hyp_text __hyp_call_panic_vhe(u64 spsr, u64 elr, u64 par,
+   struct kvm_vcpu *vcpu)
 {
panic(__hyp_panic_string,
  spsr,  elr,
  read_sysreg_el2(esr),   read_sysreg_el2(far),
- 

[PATCH v3 09/13] arm64: kernel: Add arch-specific SDEI entry code and CPU masking

2017-09-22 Thread James Morse
The Software Delegated Exception Interface (SDEI) is an ARM standard
for registering callbacks from the platform firmware into the OS.
This is typically used to implement RAS notifications.

Such notifications enter the kernel at the registered entry-point
with the register values of the interrupted CPU context. Because this
is not a CPU exception, it cannot reuse the existing entry code.
(crucially we don't implicitly know which exception level we interrupted),

Add an sdei-entry.S asm file to set us up for calling into C code. If
the event interrupted code that had interrupts masked, we always return
to that location. Otherwise we pretend this was an IRQ, and use SDEI's
complete_and_resume call to return to vbar_el1 + offset.

This allows the kernel to deliver signals to user space processes. For
KVM this triggers the world switch, a quick spin round vcpu_run, then
back into the guest, unless there are pending signals.

Add sdei_mask_local_cpu() calls to the smp_send_stop() code, this covers
the panic() code-path, which doesn't invoke cpuhotplug notifiers.

Because we can interrupt entry-from/exit-to another EL, we can't trust the
value in sp_el0 or x29, even if we interrupted the kernel, in this case
the code in entry.S will save/restore sp_el0 and use the value in
__entry_task.

When we have VMAP stacks we can interrupt the stack-overflow test, which
stirs x0 into sp, meaning we have to have our own VMAP stack.

Signed-off-by: James Morse 
---
Changes since v2:
 * Added PAN-setting as we didn't take an exception to get here.
 * Added VMAP stack allocation and switching.
 * overwrite x29 on entry from not-the-kernel.
 * Added a pe-mask to crash_smp_send_stop()s 'no secondaries' path (oops).
 * Added ELR-hasn't-changed test. Any exception in the handler will panic KVM
   as its switched VBAR_EL1, but go silently undetected otherwise. Generate a
   warning.

 arch/arm64/Kconfig  |   2 +-
 arch/arm64/include/asm/sdei.h   |  63 ++
 arch/arm64/include/asm/stacktrace.h |   3 +
 arch/arm64/kernel/Makefile  |   1 +
 arch/arm64/kernel/asm-offsets.c |   5 +
 arch/arm64/kernel/sdei-entry.S  | 122 +++
 arch/arm64/kernel/sdei.c| 235 
 arch/arm64/kernel/smp.c |  11 +-
 drivers/firmware/Kconfig|   1 +
 drivers/firmware/arm_sdei.c |   1 +
 include/linux/sdei.h|   2 +
 11 files changed, 444 insertions(+), 2 deletions(-)
 create mode 100644 arch/arm64/include/asm/sdei.h
 create mode 100644 arch/arm64/kernel/sdei-entry.S
 create mode 100644 arch/arm64/kernel/sdei.c

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 0df64a6a56d4..70dfe4e9ccc5 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -98,7 +98,7 @@ config ARM64
select HAVE_IRQ_TIME_ACCOUNTING
select HAVE_MEMBLOCK
select HAVE_MEMBLOCK_NODE_MAP if NUMA
-   select HAVE_NMI if ACPI_APEI_SEA
+   select HAVE_NMI
select HAVE_PATA_PLATFORM
select HAVE_PERF_EVENTS
select HAVE_PERF_REGS
diff --git a/arch/arm64/include/asm/sdei.h b/arch/arm64/include/asm/sdei.h
new file mode 100644
index ..ed329e01a301
--- /dev/null
+++ b/arch/arm64/include/asm/sdei.h
@@ -0,0 +1,63 @@
+/*
+ * Copyright (C) 2017 ARM Ltd.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program.  If not, see .
+ */
+#ifndef __ASM_SDEI_H
+#define __ASM_SDEI_H
+
+/* Values for sdei_exit_mode */
+#define SDEI_EXIT_HVC  0
+#define SDEI_EXIT_SMC  1
+
+#define SDEI_STACK_SIZEIRQ_STACK_SIZE
+
+#ifndef __ASSEMBLY__
+
+#include 
+#include 
+
+#include 
+
+extern unsigned long sdei_exit_mode;
+
+/* Software Delegated Exception entry point from firmware*/
+asmlinkage void __sdei_asm_handler(unsigned long event_num, unsigned long arg,
+  unsigned long pc, unsigned long pstate);
+
+/*
+ * The above entry point does the minimum to call C code. This function does
+ * anything else, before calling the driver.
+ */
+struct sdei_registered_event;
+asmlinkage unsigned long __sdei_handler(struct pt_regs *regs,
+   struct sdei_registered_event *arg);
+
+extern unsigned long sdei_arch_get_entry_point(int conduit);
+#define sdei_arch_get_entry_point(x)   sdei_arch_get_entry_point(x)
+
+bool _on_sdei_stack(unsigned long sp);
+static inline bool on_sdei_stack(unsigned long sp)
+{

[PATCH v3 12/13] arm64: acpi: Remove __init from acpi_psci_use_hvc() for use by SDEI

2017-09-22 Thread James Morse
SDEI inherits the 'use hvc' bit that is also used by PSCI. PSCI does all
its initialisation early, SDEI does its late.

Remove the __init annotation from acpi_psci_use_hvc().

Signed-off-by: James Morse 
---
The function name is unchanged as this bit is named 'PSCI_USE_HVC'
in table 5-37 of ACPIv6.2.

 arch/arm64/kernel/acpi.c | 2 +-
 include/linux/psci.h | 3 ++-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/kernel/acpi.c b/arch/arm64/kernel/acpi.c
index b3162715ed78..252396a96c78 100644
--- a/arch/arm64/kernel/acpi.c
+++ b/arch/arm64/kernel/acpi.c
@@ -117,7 +117,7 @@ bool __init acpi_psci_present(void)
 }
 
 /* Whether HVC must be used instead of SMC as the PSCI conduit */
-bool __init acpi_psci_use_hvc(void)
+bool acpi_psci_use_hvc(void)
 {
return acpi_gbl_FADT.arm_boot_flags & ACPI_FADT_PSCI_USE_HVC;
 }
diff --git a/include/linux/psci.h b/include/linux/psci.h
index bdea1cb5e1db..e25a2a82 100644
--- a/include/linux/psci.h
+++ b/include/linux/psci.h
@@ -46,10 +46,11 @@ static inline int psci_dt_init(void) { return 0; }
 #if defined(CONFIG_ARM_PSCI_FW) && defined(CONFIG_ACPI)
 int __init psci_acpi_init(void);
 bool __init acpi_psci_present(void);
-bool __init acpi_psci_use_hvc(void);
+bool acpi_psci_use_hvc(void);
 #else
 static inline int psci_acpi_init(void) { return 0; }
 static inline bool acpi_psci_present(void) { return false; }
+static inline bool acpi_psci_use_hvc(void) {return false; }
 #endif
 
 #endif /* __LINUX_PSCI_H */
-- 
2.13.3

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v3 10/13] firmware: arm_sdei: Add support for CPU and system power states

2017-09-22 Thread James Morse
When a CPU enters an idle lower-power state or is powering off, we
need to mask SDE events so that no events can be delivered while we
are messing with the MMU as the registered entry points won't be valid.

If the system reboots, we want to unregister all events and mask the CPUs.
For kexec this allows us to hand a clean slate to the next kernel
instead of relying on it to call sdei_{private,system}_data_reset().

For hibernate we unregister all events and re-register them on restore,
in case we restored with the SDE code loaded at a different address.
(e.g. KASLR).

Add all the notifiers necessary to do this. We only support shared events
so all events are left registered and enabled over CPU hotplug.

Signed-off-by: James Morse 
---
 drivers/firmware/arm_sdei.c | 228 +++-
 include/linux/cpuhotplug.h  |   1 +
 include/linux/sdei.h|   3 +
 3 files changed, 231 insertions(+), 1 deletion(-)

diff --git a/drivers/firmware/arm_sdei.c b/drivers/firmware/arm_sdei.c
index 3ea6a19ae439..5656830efff4 100644
--- a/drivers/firmware/arm_sdei.c
+++ b/drivers/firmware/arm_sdei.c
@@ -22,6 +22,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include 
 #include 
 #include 
@@ -29,12 +31,15 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -260,6 +265,11 @@ static void _ipi_mask_cpu(void *ignored)
sdei_mask_local_cpu();
 }
 
+static int sdei_cpuhp_down(unsigned int ignored)
+{
+   return sdei_mask_local_cpu();
+}
+
 int sdei_unmask_local_cpu(void)
 {
int err;
@@ -281,6 +291,11 @@ static void _ipi_unmask_cpu(void *ignored)
sdei_unmask_local_cpu();
 }
 
+static int sdei_cpuhp_up(unsigned int ignored)
+{
+   return sdei_unmask_local_cpu();
+}
+
 static void _ipi_private_reset(void *ignored)
 {
int err;
@@ -319,6 +334,12 @@ static int sdei_platform_reset(void)
return err;
 }
 
+static int sdei_api_event_status(u32 event_num, u64 *result)
+{
+   return invoke_sdei_fn(SDEI_1_0_FN_SDEI_EVENT_STATUS, event_num, 0, 0, 0,
+ 0, result);
+}
+
 static int sdei_api_event_enable(u32 event_num)
 {
return invoke_sdei_fn(SDEI_1_0_FN_SDEI_EVENT_ENABLE, event_num, 0, 0, 0,
@@ -373,8 +394,23 @@ static int sdei_api_event_unregister(u32 event_num)
 
 static int _sdei_event_unregister(struct sdei_event *event)
 {
-   if (event->type == SDEI_EVENT_TYPE_SHARED)
+   int err;
+   u64 result;
+   bool enabled;
+
+   if (event->type == SDEI_EVENT_TYPE_SHARED) {
+   /*
+* We re-register events after power-management, remember if
+* this event was previously enabled.
+*/
+   err = sdei_api_event_status(event->event_num, );
+   if (err)
+   return err;
+   enabled = !!(result & BIT(SDEI_EVENT_STATUS_ENABLED));
+   event->registered->was_enabled = enabled;
+
return sdei_api_event_unregister(event->event_num);
+   }
 
return -EINVAL;
 }
@@ -407,6 +443,26 @@ int sdei_event_unregister(u32 event_num)
 }
 EXPORT_SYMBOL(sdei_event_unregister);
 
+/*
+ * unregister events, but don't destroy them as they are re-registered by
+ * sdei_reregister_events().
+ */
+static int sdei_event_unregister_all(void)
+{
+   int err = 0;
+   struct sdei_event *event;
+
+   spin_lock(_events_lock);
+   list_for_each_entry(event, _events, list) {
+   err = _sdei_event_unregister(event);
+   if (err)
+   break;
+   }
+   spin_unlock(_events_lock);
+
+   return err;
+}
+
 static int sdei_api_event_register(u32 event_num, unsigned long entry_point,
   void *arg, u64 flags, u64 affinity)
 {
@@ -463,6 +519,148 @@ int sdei_event_register(u32 event_num, 
sdei_event_callback *cb, void *arg,
 }
 EXPORT_SYMBOL(sdei_event_register);
 
+static int sdei_reregister_event(struct sdei_event *event)
+{
+   int err;
+
+   lockdep_assert_held(_events_lock);
+
+   err = _sdei_event_register(event);
+   if (err) {
+   pr_err("Failed to re-register event %u\n", event->event_num);
+   sdei_event_destroy(event);
+   return err;
+   }
+
+   if (event->type == SDEI_EVENT_TYPE_SHARED) {
+   if (event->registered->was_enabled)
+   err = sdei_api_event_enable(event->event_num);
+   }
+
+   if (err)
+   pr_err("Failed to re-enable event %u\n", event->event_num);
+
+   return err;
+}
+
+static int sdei_reregister_events(void)
+{
+   int err = 0;
+   struct sdei_event *event;
+
+   spin_lock(_events_lock);
+   list_for_each_entry(event, _events, list) {
+   err = sdei_reregister_event(event);
+   if 

[PATCH v3 08/13] arm64: Add vmap_stack header file

2017-09-22 Thread James Morse
Today the arm64 arch code allocates an extra IRQ stack per-cpu. If we
also have SDEI and VMAP stacks we need two extra per-cpu VMAP stacks.

Move the VMAP stack allocation out to a helper in a new header file.
This avoids missing THREADINFO_GFP, or getting the all-important alignment
wrong.

Signed-off-by: James Morse 

---
This patch is new for v3. Having a bare prototype if VMAP stacks aren't
enabled allows us to avoid ifdeffery but still forbid building code using
the symbol.

 arch/arm64/include/asm/vmap_stack.h | 41 +
 arch/arm64/kernel/irq.c | 13 ++--
 2 files changed, 43 insertions(+), 11 deletions(-)
 create mode 100644 arch/arm64/include/asm/vmap_stack.h

diff --git a/arch/arm64/include/asm/vmap_stack.h 
b/arch/arm64/include/asm/vmap_stack.h
new file mode 100644
index ..f41d043cac31
--- /dev/null
+++ b/arch/arm64/include/asm/vmap_stack.h
@@ -0,0 +1,41 @@
+/*
+ * Copyright (C) 2017 ARM Ltd.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program.  If not, see .
+ */
+#ifndef __ASM_VMAP_STACK_H
+#define __ASM_VMAP_STACK_H
+
+#include 
+#include 
+#include 
+#include 
+
+#ifdef CONFIG_VMAP_STACK
+/*
+ * To ensure that VMAP'd stack overflow detection works correctly, all VMAP'd
+ * stacks need to have the same alignment.
+ */
+static inline unsigned long *arch_alloc_vmap_stack(size_t stack_size, int node)
+{
+   return __vmalloc_node_range(stack_size, THREAD_ALIGN,
+   VMALLOC_START, VMALLOC_END,
+   THREADINFO_GFP, PAGE_KERNEL, 0, node,
+   __builtin_return_address(0));
+}
+
+#else
+unsigned long *arch_alloc_vmap_stack(size_t stack_size, int node); // link 
error
+
+#endif /* CONFIG_VMAP_STACK */
+#endif /* __ASM_VMAP_STACK_H */
diff --git a/arch/arm64/kernel/irq.c b/arch/arm64/kernel/irq.c
index 713561e5bcab..60e5fc661f74 100644
--- a/arch/arm64/kernel/irq.c
+++ b/arch/arm64/kernel/irq.c
@@ -29,6 +29,7 @@
 #include 
 #include 
 #include 
+#include 
 
 unsigned long irq_err_count;
 
@@ -58,17 +59,7 @@ static void init_irq_stacks(void)
unsigned long *p;
 
for_each_possible_cpu(cpu) {
-   /*
-   * To ensure that VMAP'd stack overflow detection works
-   * correctly, the IRQ stacks need to have the same
-   * alignment as other stacks.
-   */
-   p = __vmalloc_node_range(IRQ_STACK_SIZE, THREAD_ALIGN,
-VMALLOC_START, VMALLOC_END,
-THREADINFO_GFP, PAGE_KERNEL,
-0, cpu_to_node(cpu),
-__builtin_return_address(0));
-
+   p = arch_alloc_vmap_stack(IRQ_STACK_SIZE, cpu_to_node(cpu));
per_cpu(irq_stack_ptr, cpu) = p;
}
 }
-- 
2.13.3

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v3 13/13] firmware: arm_sdei: Discover SDEI support via ACPI

2017-09-22 Thread James Morse
SDEI defines a new ACPI table to indicate the presence of the interface.
The conduit is discovered in the same way as PSCI.

For ACPI we need to create the platform device ourselves as SDEI doesn't
have an entry in the DSDT.

The SDEI platform device should be created after ACPI has been initialised
so that we can parse the table, but before GHES devices are created, which
may register SDE events if they use SDEI as their notification type.

Signed-off-by: James Morse 
---
 drivers/firmware/arm_sdei.c | 41 -
 1 file changed, 40 insertions(+), 1 deletion(-)

diff --git a/drivers/firmware/arm_sdei.c b/drivers/firmware/arm_sdei.c
index 3bee2719d6f4..50b6b1d0fe72 100644
--- a/drivers/firmware/arm_sdei.c
+++ b/drivers/firmware/arm_sdei.c
@@ -912,6 +912,14 @@ static int sdei_get_conduit(struct platform_device *pdev)
}
 
pr_warn("invalid \"method\" property: %s\n", method);
+   } else if (IS_ENABLED(CONFIG_ACPI) && !acpi_disabled) {
+   if (acpi_psci_use_hvc()) {
+   sdei_firmware_call = _smccc_hvc;
+   return CONDUIT_HVC;
+   } else {
+   sdei_firmware_call = _smccc_smc;
+   return CONDUIT_SMC;
+   }
}
 
return CONDUIT_INVALID;
@@ -1025,14 +1033,45 @@ static bool __init sdei_present_dt(void)
return true;
 }
 
+static bool __init sdei_present_acpi(void)
+{
+   acpi_status status;
+   struct platform_device *pdev;
+   struct acpi_table_header *sdei_table_header;
+
+   if (acpi_disabled)
+   return false;
+
+   status = acpi_get_table(ACPI_SIG_SDEI, 0, _table_header);
+   if (ACPI_FAILURE(status) && status != AE_NOT_FOUND) {
+   const char *msg = acpi_format_exception(status);
+
+   pr_info("Failed to get ACPI:SDEI table, %s\n", msg);
+   }
+   if (ACPI_FAILURE(status))
+   return false;
+
+   pdev = platform_device_register_simple(sdei_driver.driver.name, 0, NULL,
+  0);
+   if (IS_ERR(pdev))
+   return false;
+
+   return true;
+}
+
 static int __init sdei_init(void)
 {
-   if (sdei_present_dt())
+   if (sdei_present_dt() || sdei_present_acpi())
platform_driver_register(_driver);
 
return 0;
 }
 
+/*
+ * On an ACPI system SDEI needs to be ready before HEST:GHES tries to register
+ * its events. ACPI is initialised from a subsys_initcall(), GHES is 
initialised
+ * by device_initcall(). We want to be called in the middle.
+ */
 subsys_initcall_sync(sdei_init);
 
 int sdei_event_handler(struct pt_regs *regs,
-- 
2.13.3

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm