Re: [PATCH v2 00/15] KVM: Dynamically size memslot arrays

2019-10-23 Thread Christoffer Dall
On Mon, Oct 21, 2019 at 05:35:22PM -0700, Sean Christopherson wrote:
> The end goal of this series is to dynamically size the memslot array so
> that KVM allocates memory based on the number of memslots in use, as
> opposed to unconditionally allocating memory for the maximum number of
> memslots.  On x86, each memslot consumes 88 bytes, and so with 2 address
> spaces of 512 memslots, each VM consumes ~90k bytes for the memslots.
> E.g. given a VM that uses a total of 30 memslots, dynamic sizing reduces
> the memory footprint from 90k to ~2.6k bytes.
> 
> The changes required to support dynamic sizing are relatively small,
> e.g. are essentially contained in patches 12/13 and 13/13.  Patches 1-11
> clean up the memslot code, which has gotten quite crusy, especially
> __kvm_set_memory_region().  The clean up is likely not strictly necessary
> to switch to dynamic sizing, but I didn't have a remotely reasonable
> level of confidence in the correctness of the dynamic sizing without first
> doing the clean up.
> 
> Testing, especially non-x86 platforms, would be greatly appreciated.  The
> non-x86 changes are for all intents and purposes untested, e.g. I compile
> tested pieces of the code by copying them into x86, but that's it.  In
> theory, the vast majority of the functional changes are arch agnostic, in
> theory...

I've built this for arm/arm64, and I've ran my usual set of tests which
pass fine.  I've also run the selftest framework's tests for the dirty
logging and the migration loop test for arm64, and they pass fine.

You can add my (for arm64):

Tested-by: Christoffer Dall 


Re: [PATCH 00/45] KVM: Refactor vCPU creation

2019-10-22 Thread Christoffer Dall
Hi Sean,

On Mon, Oct 21, 2019 at 06:58:40PM -0700, Sean Christopherson wrote:
> *** DISCLAIMER **
> The non-x86 arch specific patches are completely untested.  Although the
> changes are conceptually straightforward, I'm not remotely confident that
> the patches are bug free, e.g. checkpatch caught several blatant typos
> that would break compilation.
> *
> 
> The end goal of this series is to strip down the interface between common
> KVM code and arch specific code so that there is precisely one arch hook
> for creating a vCPU and one hook for destroying a vCPU.  In addition to
> cleaning up the code base, simplifying the interface gives architectures
> more freedom to organize their vCPU creation code.
> 
> KVM's vCPU creation code is comically messy.  kvm_vm_ioctl_create_vcpu()
> calls three separate arch hooks: init(), create() and setup().  The init()
> call is especially nasty as it's hidden away in a common KVM function,
> kvm_init_vcpu(), that for all intents and purposes must be immediately
> invoked after the vcpu object is allocated.
> 
> Not to be outdone, vCPU destruction also has three arch hooks: uninit(),
> destroy() and free(), the latter of which isn't actually invoked by common
> KVM code, but the hook declaration still exists because architectures are
> relying on its forward declaration.
> 
> Eliminating the extra arch hooks is relatively straightforward, just
> tedious.  For the most part, there is no fundamental constraint that
> necessitated the proliferation of arch hooks, rather they crept in over
> time, usually when x86-centric code was moved out of generic KVM and into
> x86 code.
> 
> E.g. kvm_arch_vcpu_setup() was added to allow x86 to do vcpu_load(), which
> can only be done after preempt_notifier initialization, but adding setup()
> overlooked the fact that the preempt_notifier was only initialized after
> kvm_arch_vcpu_create() because preemption support was added when x86's MMU
> setup (the vcpu_load() user) was called from common KVM code.
> 
> For all intents and purposes, there is no true functional change in this
> series.  The order of some allocations will change, and a few memory leaks
> are fixed, but the actual functionality of a guest should be unaffected.
> 
> Patches 01-03 are bug fixes in error handling paths that were found by
> inspection when refactoring the associated code.
> 
> Patches 04-43 refactor each arch implementation so that the unwanted arch
> hooks can be dropped without a functional change, e.g. move code out of
> kvm_arch_vcpu_setup() so that all implementations are empty, then drop the
> functions and caller.
> 
> Patches 44-45 are minor clean up to eliminate kvm_vcpu_uninit().
> 
> 
> The net result is to go from this:
> 
> vcpu = kvm_arch_vcpu_create(kvm, id);
>|
>|-> kvm_vcpu_init()
>|
>|-> kvm_arch_vcpu_init()
> 
> if (IS_ERR(vcpu)) {
> r = PTR_ERR(vcpu);
> goto vcpu_decrement;
> }
> 
> preempt_notifier_init(&vcpu->preempt_notifier, &kvm_preempt_ops);
> 
> r = kvm_arch_vcpu_setup(vcpu);
> if (r)
> goto vcpu_destroy;
> 
> to this:
> 
> r = kvm_arch_vcpu_precreate(kvm, id);
> if (r)
> goto vcpu_decrement;
> 
> vcpu = kmem_cache_zalloc(kvm_vcpu_cache, GFP_KERNEL);
> if (!vcpu) {
> r = -ENOMEM;
> goto vcpu_decrement;
> }
> 
> page = alloc_page(GFP_KERNEL | __GFP_ZERO);
> if (!page) {
> r = -ENOMEM;
> goto vcpu_free;
> }
> vcpu->run = page_address(page);
> 
> kvm_vcpu_init(vcpu, kvm, id);
> 
> r = kvm_arch_vcpu_create(vcpu);
> if (r)
> goto vcpu_free_run_page;
> 

What a fantastically welcome piece of work!  Thanks for doing this,
many's the time I waded through all those calls to ensure a patch was
doing the right thing.

Modulo the nit in patch 42, the arm64 changes survive a guest boot +
hackbench and build fine.  The lack of changing the arm-specific destroy
function to a void also causes a series of warnings for a 32-bit arm
build, but otherwise builds fine.

You can add my:

  Acked-by: Christoffer Dall 

To the arm/arm64 and generic parts.


Thanks,

Christoffer


Re: [PATCH v9 01/26] arm64: Fix HCR.TGE status for NMI contexts

2019-01-31 Thread Christoffer Dall
On Thu, Jan 31, 2019 at 09:40:02AM +, Julien Thierry wrote:
> 
> 
> On 31/01/2019 09:27, Christoffer Dall wrote:
> > On Thu, Jan 31, 2019 at 08:56:04AM +, Julien Thierry wrote:
> >>
> >>
> >> On 31/01/2019 08:19, Christoffer Dall wrote:
> >>> On Mon, Jan 28, 2019 at 03:42:42PM +, Julien Thierry wrote:
> >>>> Hi James,
> >>>>
> >>>> On 28/01/2019 11:48, James Morse wrote:
> >>>>> Hi Julien,
> >>>>>
> >>>>> On 21/01/2019 15:33, Julien Thierry wrote:
> >>>>>> When using VHE, the host needs to clear HCR_EL2.TGE bit in order
> >>>>>> to interract with guest TLBs, switching from EL2&0 translation regime
> >>>>>
> >>>>> (interact)
> >>>>>
> >>>>>
> >>>>>> to EL1&0.
> >>>>>>
> >>>>>> However, some non-maskable asynchronous event could happen while TGE is
> >>>>>> cleared like SDEI. Because of this address translation operations
> >>>>>> relying on EL2&0 translation regime could fail (tlb invalidation,
> >>>>>> userspace access, ...).
> >>>>>>
> >>>>>> Fix this by properly setting HCR_EL2.TGE when entering NMI context and
> >>>>>> clear it if necessary when returning to the interrupted context.
> >>>>>
> >>>>> Yes please. This would not have been fun to debug!
> >>>>>
> >>>>> Reviewed-by: James Morse 
> >>>>>
> >>>>>
> >>>>
> >>>> Thanks.
> >>>>
> >>>>>
> >>>>> I was looking for why we need core code to do this, instead of updating 
> >>>>> the
> >>>>> arch's call sites. Your 'irqdesc: Add domain handlers for NMIs' patch 
> >>>>> (pointed
> >>>>> to from the cover letter) is the reason: core-code calls 
> >>>>> nmi_enter()/nmi_exit()
> >>>>> itself.
> >>>>>
> >>>>
> >>>> Yes, that's the main reason.
> >>>>
> >>> I wondered the same thing, but I don't understand the explanation :(
> >>>
> >>> Why can't we do a local_daif_mask() around the (very small) calls that
> >>> clear TGE instead?
> >>>
> >>
> >> That would protect against the pseudo-NMIs, but you can still get an
> >> SDEI at that point even with all daif bits set. Or did I misunderstand
> >> how SDEI works?
> >>
> > 
> > I don't know the details of SDEI.  From looking at this patch, the
> > logical conclusion would be that SDEIs can then only be delivered once
> > we've called nmi_enter, but since we don't call this directly from the
> > code that clears TGE for doing guest TLB invalidation (or do we?) then
> > masking interrupts at the PSTATE level should be sufficient.
> > 
> > Surely I'm missing some part of the bigger picture here.
> > 
> 
> I'm not sure I understand. SDEI uses the NMI context and AFAIU, it is an
> interrupt that the firmware sends to the OS, and it is sent regardless
> of the PSTATE at the OS EL.
> 
> So, the worrying part is:
> - Hyp clears TGE
> - Exception/interrupt taken to EL3
> - Firmware decides it's a good time to send an SDEI to the OS
> - SDEI handler (at EL2 for VHE) does nmi_enter()
> - SDEI handler needs to do cache invalidation or something with the
> EL2&0 translation regime but TGE is cleared
> 
> We don't expect the code that clears TGE to call nmi_enter().
> 

You do understand :)

I didn't understand that the SDEI handler calls nmi_enter() -- and to be
fair the commit message didn't really provide that link -- but it
makes perfect sense now.  I naively thought that SDEI had respected the
pstate bits setting before, and that this was becoming a problem with
the introduction of pseudo-NMIs, but I clearly came at this from the
wrong direction.


Thanks for the explanation!

Christoffer


Re: [PATCH v9 01/26] arm64: Fix HCR.TGE status for NMI contexts

2019-01-31 Thread Christoffer Dall
On Thu, Jan 31, 2019 at 08:56:04AM +, Julien Thierry wrote:
> 
> 
> On 31/01/2019 08:19, Christoffer Dall wrote:
> > On Mon, Jan 28, 2019 at 03:42:42PM +, Julien Thierry wrote:
> >> Hi James,
> >>
> >> On 28/01/2019 11:48, James Morse wrote:
> >>> Hi Julien,
> >>>
> >>> On 21/01/2019 15:33, Julien Thierry wrote:
> >>>> When using VHE, the host needs to clear HCR_EL2.TGE bit in order
> >>>> to interract with guest TLBs, switching from EL2&0 translation regime
> >>>
> >>> (interact)
> >>>
> >>>
> >>>> to EL1&0.
> >>>>
> >>>> However, some non-maskable asynchronous event could happen while TGE is
> >>>> cleared like SDEI. Because of this address translation operations
> >>>> relying on EL2&0 translation regime could fail (tlb invalidation,
> >>>> userspace access, ...).
> >>>>
> >>>> Fix this by properly setting HCR_EL2.TGE when entering NMI context and
> >>>> clear it if necessary when returning to the interrupted context.
> >>>
> >>> Yes please. This would not have been fun to debug!
> >>>
> >>> Reviewed-by: James Morse 
> >>>
> >>>
> >>
> >> Thanks.
> >>
> >>>
> >>> I was looking for why we need core code to do this, instead of updating 
> >>> the
> >>> arch's call sites. Your 'irqdesc: Add domain handlers for NMIs' patch 
> >>> (pointed
> >>> to from the cover letter) is the reason: core-code calls 
> >>> nmi_enter()/nmi_exit()
> >>> itself.
> >>>
> >>
> >> Yes, that's the main reason.
> >>
> > I wondered the same thing, but I don't understand the explanation :(
> > 
> > Why can't we do a local_daif_mask() around the (very small) calls that
> > clear TGE instead?
> > 
> 
> That would protect against the pseudo-NMIs, but you can still get an
> SDEI at that point even with all daif bits set. Or did I misunderstand
> how SDEI works?
> 

I don't know the details of SDEI.  From looking at this patch, the
logical conclusion would be that SDEIs can then only be delivered once
we've called nmi_enter, but since we don't call this directly from the
code that clears TGE for doing guest TLB invalidation (or do we?) then
masking interrupts at the PSTATE level should be sufficient.

Surely I'm missing some part of the bigger picture here.

Thanks,

Christoffer


Re: [PATCH v9 01/26] arm64: Fix HCR.TGE status for NMI contexts

2019-01-31 Thread Christoffer Dall
On Mon, Jan 28, 2019 at 03:42:42PM +, Julien Thierry wrote:
> Hi James,
> 
> On 28/01/2019 11:48, James Morse wrote:
> > Hi Julien,
> > 
> > On 21/01/2019 15:33, Julien Thierry wrote:
> >> When using VHE, the host needs to clear HCR_EL2.TGE bit in order
> >> to interract with guest TLBs, switching from EL2&0 translation regime
> > 
> > (interact)
> > 
> > 
> >> to EL1&0.
> >>
> >> However, some non-maskable asynchronous event could happen while TGE is
> >> cleared like SDEI. Because of this address translation operations
> >> relying on EL2&0 translation regime could fail (tlb invalidation,
> >> userspace access, ...).
> >>
> >> Fix this by properly setting HCR_EL2.TGE when entering NMI context and
> >> clear it if necessary when returning to the interrupted context.
> > 
> > Yes please. This would not have been fun to debug!
> > 
> > Reviewed-by: James Morse 
> > 
> > 
> 
> Thanks.
> 
> > 
> > I was looking for why we need core code to do this, instead of updating the
> > arch's call sites. Your 'irqdesc: Add domain handlers for NMIs' patch 
> > (pointed
> > to from the cover letter) is the reason: core-code calls 
> > nmi_enter()/nmi_exit()
> > itself.
> > 
> 
> Yes, that's the main reason.
> 
I wondered the same thing, but I don't understand the explanation :(

Why can't we do a local_daif_mask() around the (very small) calls that
clear TGE instead?


Thanks,

Christoffer


Re: [PATCH v9 10/26] arm64: kvm: Unmask PMR before entering guest

2019-01-30 Thread Christoffer Dall
On Mon, Jan 21, 2019 at 03:33:29PM +, Julien Thierry wrote:
> Interrupts masked by ICC_PMR_EL1 will not be signaled to the CPU. This
> means that hypervisor will not receive masked interrupts while running a
> guest.
> 

You could add to the commit description how this works overall,
something along the lines of:

We need to make sure that all maskable interrupts are masked from the
time we call local_irq_disable() in the main run loop, and remain so
until we call local_irq_enable() after returning from the guest, and we
need to ensure that we see no interrupts at all (including pseudo-NMIs)
in the middle of the VM world-switch, while at the same time we need to
ensure we exit the guest when there are interrupts for the host.

We can accomplish this with pseudo-NMIs enabled by:
  (1) local_irq_disable: set the priority mask
  (2) enter guest: set PSTATE.I
  (3)  clear the priority mask
  (4) eret to guest
  (5) exit guest:  set the priotiy mask
   clear PSTATE.I (and restore other host PSTATE bits)
  (6) local_irq_enable: clear the priority mask.

Also, took me a while to realize that when we come back from the guest,
we call local_daif_restore with DAIF_PROCCTX_NOIRQ, which actually does
both of the things in (5).

> Avoid this by making sure ICC_PMR_EL1 is unmasked when we enter a guest.
> 
> Signed-off-by: Julien Thierry 
> Acked-by: Catalin Marinas 
> Cc: Christoffer Dall 
> Cc: Marc Zyngier 
> Cc: Catalin Marinas 
> Cc: Will Deacon 
> Cc: kvm...@lists.cs.columbia.edu
> ---
>  arch/arm64/include/asm/kvm_host.h | 12 
>  arch/arm64/kvm/hyp/switch.c   | 16 
>  2 files changed, 28 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/kvm_host.h 
> b/arch/arm64/include/asm/kvm_host.h
> index 7732d0b..a1f9f55 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -24,6 +24,7 @@
>  
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -474,6 +475,17 @@ static inline int kvm_arch_vcpu_run_pid_change(struct 
> kvm_vcpu *vcpu)
>  static inline void kvm_arm_vhe_guest_enter(void)
>  {
>   local_daif_mask();
> +
> + /*
> +  * Having IRQs masked via PMR when entering the guest means the GIC
> +  * will not signal the CPU of interrupts of lower priority, and the
> +  * only way to get out will be via guest exceptions.
> +  * Naturally, we want to avoid this.
> +  */
> + if (system_uses_irq_prio_masking()) {
> + gic_write_pmr(GIC_PRIO_IRQON);
> + dsb(sy);
> + }
>  }
>  
>  static inline void kvm_arm_vhe_guest_exit(void)
> diff --git a/arch/arm64/kvm/hyp/switch.c b/arch/arm64/kvm/hyp/switch.c
> index b0b1478..6a4c2d6 100644
> --- a/arch/arm64/kvm/hyp/switch.c
> +++ b/arch/arm64/kvm/hyp/switch.c
> @@ -22,6 +22,7 @@
>  
>  #include 
>  
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -521,6 +522,17 @@ int __hyp_text __kvm_vcpu_run_nvhe(struct kvm_vcpu *vcpu)
>   struct kvm_cpu_context *guest_ctxt;
>   u64 exit_code;
>  
> + /*
> +  * Having IRQs masked via PMR when entering the guest means the GIC
> +  * will not signal the CPU of interrupts of lower priority, and the
> +  * only way to get out will be via guest exceptions.
> +  * Naturally, we want to avoid this.
> +  */
> + if (system_uses_irq_prio_masking()) {
> + gic_write_pmr(GIC_PRIO_IRQON);
> + dsb(sy);
> + }
> +
>   vcpu = kern_hyp_va(vcpu);
>  
>   host_ctxt = kern_hyp_va(vcpu->arch.host_cpu_context);
> @@ -573,6 +585,10 @@ int __hyp_text __kvm_vcpu_run_nvhe(struct kvm_vcpu *vcpu)
>*/
>   __debug_switch_to_host(vcpu);
>  
> + /* Returning to host will clear PSR.I, remask PMR if needed */
> + if (system_uses_irq_prio_masking())
> + gic_write_pmr(GIC_PRIO_IRQOFF);
> +
>   return exit_code;
>  }
>  

nit: you could consider moving the non-vhe part into a new
kvm_arm_nvhe_guest_enter, for symmetry with the vhe part.

Otherwise looks good to me:

Reviewed-by: Christoffer Dall 


Re: [PATCH 0/3] KVM: arm/arm64: trivial header path sanitization

2019-01-25 Thread Christoffer Dall
On Fri, Jan 25, 2019 at 04:57:27PM +0900, Masahiro Yamada wrote:
> My main motivation is to get rid of crappy header search path manipulation
> from Kbuild core.
> 
> Before that, I want to finish as many cleanup works as possible.
> 
> If you are interested in the big picture of this work,
> the full patch set is available at:
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild.git 
> build-test

Changes appear fine to me:

Acked-by: Christoffer Dall 


[PATCH] clocksource/arm_arch_timer: Store physical timer IRQ number for KVM on VHE

2019-01-24 Thread Christoffer Dall
From: Andre Przywara 

A host running in VHE mode gets the EL2 physical timer as its time
source (accessed using the EL1 sysreg accessors, which get re-directed
to the EL2 sysregs by VHE).

The EL1 physical timer remains unused by the host kernel, allowing us to
pass that on directly to a KVM guest and saves us from emulating this
timer for the guest on VHE systems.

Store the EL1 Physical Timer's IRQ number in
struct arch_timer_kvm_info on VHE systems to allow KVM to use it.

Signed-off-by: Andre Przywara 
Signed-off-by: Marc Zyngier 
Signed-off-by: Christoffer Dall 
---
Patches in preparation for nested virtualization on KVM/Arm depend on this
change, so we would like to merge this via the kvmarm tree or have a stable
branch including this patch.

Please let us know your preference.  Thanks.

 drivers/clocksource/arm_arch_timer.c | 11 +--
 include/clocksource/arm_arch_timer.h |  1 +
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/clocksource/arm_arch_timer.c 
b/drivers/clocksource/arm_arch_timer.c
index 9a7d4dc00b6e..b9243e2328b4 100644
--- a/drivers/clocksource/arm_arch_timer.c
+++ b/drivers/clocksource/arm_arch_timer.c
@@ -1206,6 +1206,13 @@ static enum arch_timer_ppi_nr __init 
arch_timer_select_ppi(void)
return ARCH_TIMER_PHYS_SECURE_PPI;
 }
 
+static void __init arch_timer_populate_kvm_info(void)
+{
+   arch_timer_kvm_info.virtual_irq = arch_timer_ppi[ARCH_TIMER_VIRT_PPI];
+   if (is_kernel_in_hyp_mode())
+   arch_timer_kvm_info.physical_irq = 
arch_timer_ppi[ARCH_TIMER_PHYS_NONSECURE_PPI];
+}
+
 static int __init arch_timer_of_init(struct device_node *np)
 {
int i, ret;
@@ -1220,7 +1227,7 @@ static int __init arch_timer_of_init(struct device_node 
*np)
for (i = ARCH_TIMER_PHYS_SECURE_PPI; i < ARCH_TIMER_MAX_TIMER_PPI; i++)
arch_timer_ppi[i] = irq_of_parse_and_map(np, i);
 
-   arch_timer_kvm_info.virtual_irq = arch_timer_ppi[ARCH_TIMER_VIRT_PPI];
+   arch_timer_populate_kvm_info();
 
rate = arch_timer_get_cntfrq();
arch_timer_of_configure_rate(rate, np);
@@ -1550,7 +1557,7 @@ static int __init arch_timer_acpi_init(struct 
acpi_table_header *table)
arch_timer_ppi[ARCH_TIMER_HYP_PPI] =
acpi_gtdt_map_ppi(ARCH_TIMER_HYP_PPI);
 
-   arch_timer_kvm_info.virtual_irq = arch_timer_ppi[ARCH_TIMER_VIRT_PPI];
+   arch_timer_populate_kvm_info();
 
/*
 * When probing via ACPI, we have no mechanism to override the sysreg
diff --git a/include/clocksource/arm_arch_timer.h 
b/include/clocksource/arm_arch_timer.h
index 349e5957c949..702967d996bb 100644
--- a/include/clocksource/arm_arch_timer.h
+++ b/include/clocksource/arm_arch_timer.h
@@ -74,6 +74,7 @@ enum arch_timer_spi_nr {
 struct arch_timer_kvm_info {
struct timecounter timecounter;
int virtual_irq;
+   int physical_irq;
 };
 
 struct arch_timer_mem_frame {
-- 
2.18.0



Re: [PATCH v10 0/8] kvm: arm64: Support PUD hugepage at stage 2

2018-12-18 Thread Christoffer Dall
On Tue, Dec 11, 2018 at 05:10:33PM +, Suzuki K Poulose wrote:
> This series is an update to the PUD hugepage support previously posted
> at [0]. This patchset adds support for PUD hugepages at stage 2 a
> feature that is useful on cores that have support for large sized TLB
> mappings (e.g., 1GB for 4K granule).
> 
> The patches are based on v4.20-rc4
> 
> The patches have been tested on AMD Seattle system with the following
> hugepage sizes - 2M and 1G.
> 
> Right now the PUD hugepage for stage2 is only supported if the stage2
> has 4 levels. i.e, with an IPA size of minimum 44bits with 4K pages.
> This could be relaxed to stage2 with 3 levels, with the stage1 PUD huge
> page mapped in the entry level of the stage2 (i.e, pgd). I have not
> added the change here to keep this version stable w.r.t the previous
> version. I could post a patch later after further discussions in the
> list.
> 

For the series:

Reviewed-by: Christoffer Dall 


Re: [PATCH 10/10] perf/doc: update design.txt for exclude_{host|guest} flags

2018-12-12 Thread Christoffer Dall
On Tue, Dec 11, 2018 at 01:59:03PM +, Andrew Murray wrote:
> On Tue, Dec 11, 2018 at 10:06:53PM +1100, Michael Ellerman wrote:
> > [ Reviving old thread. ]
> > 
> > Andrew Murray  writes:
> > > On Tue, Nov 20, 2018 at 10:31:36PM +1100, Michael Ellerman wrote:
> > >> Andrew Murray  writes:
> > >> 
> > >> > Update design.txt to reflect the presence of the exclude_host
> > >> > and exclude_guest perf flags.
> > >> >
> > >> > Signed-off-by: Andrew Murray 
> > >> > ---
> > >> >  tools/perf/design.txt | 4 
> > >> >  1 file changed, 4 insertions(+)
> > >> >
> > >> > diff --git a/tools/perf/design.txt b/tools/perf/design.txt
> > >> > index a28dca2..7de7d83 100644
> > >> > --- a/tools/perf/design.txt
> > >> > +++ b/tools/perf/design.txt
> > >> > @@ -222,6 +222,10 @@ The 'exclude_user', 'exclude_kernel' and 
> > >> > 'exclude_hv' bits provide a
> > >> >  way to request that counting of events be restricted to times when the
> > >> >  CPU is in user, kernel and/or hypervisor mode.
> > >> >  
> > >> > +Furthermore the 'exclude_host' and 'exclude_guest' bits provide a way
> > >> > +to request counting of events restricted to guest and host contexts 
> > >> > when
> > >> > +using virtualisation.
> > >> 
> > >> How does exclude_host differ from exclude_hv ?
> > >
> > > I believe exclude_host / exclude_guest are intented to distinguish
> > > between host and guest in the hosted hypervisor context (KVM).
> > 
> > OK yeah, from the perf-list man page:
> > 
> >u - user-space counting
> >k - kernel counting
> >h - hypervisor counting
> >I - non idle counting
> >G - guest counting (in KVM guests)
> >H - host counting (not in KVM guests)
> > 
> > > Whereas exclude_hv allows to distinguish between guest and
> > > hypervisor in the bare-metal type hypervisors.
> > 
> > Except that's exactly not how we use them on powerpc :)
> > 
> > We use exclude_hv to exclude "the hypervisor", regardless of whether
> > it's KVM or PowerVM (which is a bare-metal hypervisor).
> > 
> > We don't use exclude_host / exclude_guest at all, which I guess is a
> > bug, except I didn't know they existed until this thread.
> > 
> > eg, in a KVM guest:
> > 
> >   $ perf record -e cycles:G /bin/bash -c "for i in {0..10}; do :;done"
> >   $ perf report -D | grep -Fc "dso: [hypervisor]"
> >   16
> > 
> > 
> > > In the case of arm64 - if VHE extensions are present then the host
> > > kernel will run at a higher privilege to the guest kernel, in which
> > > case there is no distinction between hypervisor and host so we ignore
> > > exclude_hv. But where VHE extensions are not present then the host
> > > kernel runs at the same privilege level as the guest and we use a
> > > higher privilege level to switch between them - in this case we can
> > > use exclude_hv to discount that hypervisor role of switching between
> > > guests.
> > 
> > I couldn't find any arm64 perf code using exclude_host/guest at all?
> 
> Correct - but this is in flight as I am currently adding support for this
> see [1].
> 
> > 
> > And I don't see any x86 code using exclude_hv.
> 
> I can't find any either.
> 
> > 
> > But maybe that's OK, I just worry this is confusing for users.
> 
> There is some extra context regarding this where exclude_guest/exclude_host
> was added, see [2] and where exclude_hv was added, see [3]
> 
> Generally it seems that exclude_guest/exclude_host relies upon switching
> counters off/on on guest/host switch code (which works well in the nested
> virt case). Whereas exclude_hv tends to rely solely on hardware capability
> based on privilege level (which works well in the bare metal case where
> the guest doesn't run at same privilege as the host).
> 
> I think from the user perspective exclude_hv allows you to see your overhead
> if you are a guest (i.e. work done by bare metal hypervisor associated with
> you as the guest). Whereas exclude_guest/exclude_host doesn't allow you to
> see events above you (i.e. the kernel hypervisor) if you are the guest...
> 
> At least that's how I read this, I've copied in others that may provide
> more authoritative feedback.
> 
> [1] https://lists.cs.columbia.edu/pipermail/kvmarm/2018-December/033698.html
> [2] https://www.spinics.net/lists/kvm/msg53996.html
> [3] https://lore.kernel.org/patchwork/patch/143918/
> 

I'll try to answer this in a different way, based on previous
discussions with Joerg et al. who introduced these flags.  Assume no
support for nested virtualization as a first approximation:

  If you are running as a guest:
- exclude_hv: stop counting events when the hypervisor runs
- exclude_host: has no effect
- exclude_guest: has no effect
  
  If you are running as a host/hypervisor:
   - exclude_hv: has no effect
   - exclude_host: only count events when the guest is running
   - exclude_guest: only count events when the host is running

With nested virtualization, you get the natural union of the above.

**This has 

Re: [PATCH v2 4/4] KVM: arm/arm64: vgic: Make vgic_cpu->ap_list_lock a raw_spinlock

2018-12-11 Thread Christoffer Dall
On Mon, Nov 26, 2018 at 06:26:47PM +, Julien Thierry wrote:
> vgic_cpu->ap_list_lock must always be taken with interrupts disabled as
> it is used in interrupt context.
> 
> For configurations such as PREEMPT_RT_FULL, this means that it should
> be a raw_spinlock since RT spinlocks are interruptible.
> 
> Signed-off-by: Julien Thierry 
> Cc: Christoffer Dall 
> Cc: Marc Zyngier 


Acked-by: Christoffer Dall 


Re: [PATCH v2 1/4] KVM: arm/arm64: vgic: Do not cond_resched_lock() with IRQs disabled

2018-12-11 Thread Christoffer Dall
On Mon, Nov 26, 2018 at 06:26:44PM +, Julien Thierry wrote:
> To change the active state of an MMIO, halt is requested for all vcpus of
> the affected guest before modifying the IRQ state. This is done by calling
> cond_resched_lock() in vgic_mmio_change_active(). However interrupts are
> disabled at this point and we cannot reschedule a vcpu.
> 
> Solve this by waiting for all vcpus to be halted after emmiting the halt
> request.
> 
> Signed-off-by: Julien Thierry 
> Suggested-by: Marc Zyngier 
> Cc: Christoffer Dall 
> Cc: Marc Zyngier 
> Cc: sta...@vger.kernel.org
> ---
>  virt/kvm/arm/vgic/vgic-mmio.c | 36 ++--
>  1 file changed, 14 insertions(+), 22 deletions(-)
> 
> diff --git a/virt/kvm/arm/vgic/vgic-mmio.c b/virt/kvm/arm/vgic/vgic-mmio.c
> index f56ff1c..5c76a92 100644
> --- a/virt/kvm/arm/vgic/vgic-mmio.c
> +++ b/virt/kvm/arm/vgic/vgic-mmio.c
> @@ -313,27 +313,6 @@ static void vgic_mmio_change_active(struct kvm_vcpu 
> *vcpu, struct vgic_irq *irq,
>  
>   spin_lock_irqsave(&irq->irq_lock, flags);
>  
> - /*
> -  * If this virtual IRQ was written into a list register, we
> -  * have to make sure the CPU that runs the VCPU thread has
> -  * synced back the LR state to the struct vgic_irq.
> -  *
> -  * As long as the conditions below are true, we know the VCPU thread
> -  * may be on its way back from the guest (we kicked the VCPU thread in
> -  * vgic_change_active_prepare)  and still has to sync back this IRQ,
> -  * so we release and re-acquire the spin_lock to let the other thread
> -  * sync back the IRQ.
> -  *
> -  * When accessing VGIC state from user space, requester_vcpu is
> -  * NULL, which is fine, because we guarantee that no VCPUs are running
> -  * when accessing VGIC state from user space so irq->vcpu->cpu is
> -  * always -1.
> -  */
> - while (irq->vcpu && /* IRQ may have state in an LR somewhere */
> -irq->vcpu != requester_vcpu && /* Current thread is not the VCPU 
> thread */
> -irq->vcpu->cpu != -1) /* VCPU thread is running */
> - cond_resched_lock(&irq->irq_lock);
> -
>   if (irq->hw) {
>   vgic_hw_irq_change_active(vcpu, irq, active, !requester_vcpu);
>   } else {
> @@ -368,8 +347,21 @@ static void vgic_mmio_change_active(struct kvm_vcpu 
> *vcpu, struct vgic_irq *irq,
>   */
>  static void vgic_change_active_prepare(struct kvm_vcpu *vcpu, u32 intid)
>  {
> - if (intid > VGIC_NR_PRIVATE_IRQS)
> + if (intid > VGIC_NR_PRIVATE_IRQS) {
> + struct kvm_vcpu *tmp;
> + int i;
> +
>   kvm_arm_halt_guest(vcpu->kvm);
> +
> + /* Wait for each vcpu to be halted */
> + kvm_for_each_vcpu(i, tmp, vcpu->kvm) {
> + if (tmp == vcpu)
> + continue;
> +
> + while (tmp->cpu != -1)
> + cond_resched();
> + }

I'm actually thinking we don't need this loop at all after the requet
rework which causes:

 1. kvm_arm_halt_guest() to use kvm_make_all_cpus_request(kvm, KVM_REQ_SLEEP), 
and
 2. KVM_REQ_SLEEP uses REQ_WAIT, and
 3. REQ_WAIT requires the VCPU to respond to IPIs before returning, and
 4. a VCPU thread can only respond when it enables interrupt, and
 5. enabling interrupts when running a VCPU only happens after syncing
the VGIC hwstate.

Does that make sense?

It would be good if someone can validate this, but if it holds this
patch just becomes a nice deletion of the logic in
vgic-mmio_change_active.


Thanks,

Christoffer


Re: [PATCH] kvm/arm: return 0 when the number of objects is not lessthan min

2018-12-10 Thread Christoffer Dall
On Thu, Dec 06, 2018 at 09:56:30AM +0800, peng.h...@zte.com.cn wrote:
> >On Wed, Dec 05, 2018 at 09:15:51AM +0800, Peng Hao wrote:
> >> Return 0 when there is enough kvm_mmu_memory_cache object.
> >>
> >> Signed-off-by: Peng Hao 
> >> ---
> >>  virt/kvm/arm/mmu.c | 2 +-
> >>  1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
> >> index ed162a6..fcda0ce 100644
> >> --- a/virt/kvm/arm/mmu.c
> >> +++ b/virt/kvm/arm/mmu.c
> >> @@ -127,7 +127,7 @@ static int mmu_topup_memory_cache(struct 
> >> kvm_mmu_memory_cache *cache,
> >>  while (cache->nobjs < max) {
> >>  page = (void *)__get_free_page(PGALLOC_GFP);
> >>  if (!page)
> >> -return -ENOMEM;
> >> +return cache->nobjs >= min ? 0 : -ENOMEM;
> >
> >This condition will never be true here, as the exact same condition is
> >already checked above, and if it had been true, then we wouldn't be here.
> >
> >What problem are you attempting to solve?
> >
> if (cache->nobjs >= min)  --here cache->nobjs can continue downward 
>  return 0;
> while (cache->nobjs < max) {
> page = (void *)__get_free_page(PGALLOC_GFP);
> if (!page)
>return -ENOMEM; -here it is possible that  
> (cache->nobjs >= min) and (cache->nobjs cache->objects[cache->nobjs++] = page; ---here cache->nobjs increasing
>   }
> 
> I just think the logic of this function is to return 0 as long as 
> (cache->nobjs >= min).
> thanks.

That's not the intention, nor is it on any of the other architectures
implementing the same thing (this one goes on the list of stuff we
should be sharing between architectures).

The idea is that you fill up the cache when it goes below min, and you
are always able to fill up to max.

If you're not able to fill up to max, then your system is seriously low
on memory and continuing to run this VM is not likely to be a good idea,
so you might as well tell user space to do something now instead of
waiting until the situation is even worse.


Thanks,

Christoffer


Re: [PATCH v9 1/8] KVM: arm/arm64: Share common code in user_mem_abort()

2018-12-10 Thread Christoffer Dall
On Mon, Dec 10, 2018 at 10:47:42AM +, Suzuki K Poulose wrote:
> 
> 
> On 10/12/2018 08:56, Christoffer Dall wrote:
> >On Mon, Dec 03, 2018 at 01:37:37PM +, Suzuki K Poulose wrote:
> >>Hi Anshuman,
> >>
> >>On 03/12/2018 12:11, Anshuman Khandual wrote:
> >>>
> >>>
> >>>On 10/31/2018 11:27 PM, Punit Agrawal wrote:
> >>>>The code for operations such as marking the pfn as dirty, and
> >>>>dcache/icache maintenance during stage 2 fault handling is duplicated
> >>>>between normal pages and PMD hugepages.
> >>>>
> >>>>Instead of creating another copy of the operations when we introduce
> >>>>PUD hugepages, let's share them across the different pagesizes.
> >>>>
> >>>>Signed-off-by: Punit Agrawal 
> >>>>Reviewed-by: Suzuki K Poulose 
> >>>>Cc: Christoffer Dall 
> >>>>Cc: Marc Zyngier 
> >>>>---
> >>>>  virt/kvm/arm/mmu.c | 49 --
> >>>>  1 file changed, 30 insertions(+), 19 deletions(-)
> >>>>
> >>>>diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
> >>>>index 5eca48bdb1a6..59595207c5e1 100644
> >>>>--- a/virt/kvm/arm/mmu.c
> >>>>+++ b/virt/kvm/arm/mmu.c
> >>>>@@ -1475,7 +1475,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
> >>>>phys_addr_t fault_ipa,
> >>>>unsigned long fault_status)
> >>>>  {
> >>>>  int ret;
> >>>>- bool write_fault, exec_fault, writable, hugetlb = false, force_pte = 
> >>>>false;
> >>>>+ bool write_fault, exec_fault, writable, force_pte = false;
> >>>>  unsigned long mmu_seq;
> >>>>  gfn_t gfn = fault_ipa >> PAGE_SHIFT;
> >>>>  struct kvm *kvm = vcpu->kvm;
> >>>>@@ -1484,7 +1484,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
> >>>>phys_addr_t fault_ipa,
> >>>>  kvm_pfn_t pfn;
> >>>>  pgprot_t mem_type = PAGE_S2;
> >>>>  bool logging_active = memslot_is_logging(memslot);
> >>>>- unsigned long flags = 0;
> >>>>+ unsigned long vma_pagesize, flags = 0;
> >>>
> >>>A small nit s/vma_pagesize/pagesize. Why call it VMA ? Its implicit.
> >>
> >>May be we could call it mapsize. pagesize is confusing.
> >>
> >
> >I'm ok with mapsize.  I see the vma_pagesize name coming from the fact
> >that this is initially set to the return value from vma_kernel_pagesize.
> >
> >I have not problems with either vma_pagesize or mapsize.
> >
> >>>
> >>>>  write_fault = kvm_is_write_fault(vcpu);
> >>>>  exec_fault = kvm_vcpu_trap_is_iabt(vcpu);
> >>>>@@ -1504,10 +1504,16 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
> >>>>phys_addr_t fault_ipa,
> >>>>  return -EFAULT;
> >>>>  }
> >>>>- if (vma_kernel_pagesize(vma) == PMD_SIZE && !logging_active) {
> >>>>- hugetlb = true;
> >>>>+ vma_pagesize = vma_kernel_pagesize(vma);
> >>>>+ if (vma_pagesize == PMD_SIZE && !logging_active) {
> >>>>  gfn = (fault_ipa & PMD_MASK) >> PAGE_SHIFT;
> >>>>  } else {
> >>>>+ /*
> >>>>+  * Fallback to PTE if it's not one of the Stage 2
> >>>>+  * supported hugepage sizes
> >>>>+  */
> >>>>+ vma_pagesize = PAGE_SIZE;
> >>>
> >>>This seems redundant and should be dropped. vma_kernel_pagesize() here 
> >>>either
> >>>calls hugetlb_vm_op_pagesize (via hugetlb_vm_ops->pagesize) or simply 
> >>>returns
> >>>PAGE_SIZE. The vm_ops path is taken if the QEMU VMA covering any given HVA 
> >>>is
> >>>backed either by HugeTLB pages or simply normal pages. vma_pagesize would
> >>>either has a value of PMD_SIZE (HugeTLB hstate based) or PAGE_SIZE. Hence 
> >>>if
> >>>its not PMD_SIZE it must be PAGE_SIZE which should not be assigned again.
> >>
> >>We may want to force using the PTE mappings when logging_active (e.g, 
> >>migration
> >>?) to prevent keep tracking of huge pages. So the check is still valid

Re: [PATCH v9 5/8] KVM: arm64: Support PUD hugepage in stage2_is_exec()

2018-12-10 Thread Christoffer Dall
On Wed, Dec 05, 2018 at 05:57:51PM +, Suzuki K Poulose wrote:
> 
> 
> On 01/11/2018 13:38, Christoffer Dall wrote:
> >On Wed, Oct 31, 2018 at 05:57:42PM +, Punit Agrawal wrote:
> >>In preparation for creating PUD hugepages at stage 2, add support for
> >>detecting execute permissions on PUD page table entries. Faults due to
> >>lack of execute permissions on page table entries is used to perform
> >>i-cache invalidation on first execute.
> >>
> >>Provide trivial implementations of arm32 helpers to allow sharing of
> >>code.
> >>
> >>Signed-off-by: Punit Agrawal 
> >>Reviewed-by: Suzuki K Poulose 
> >>Cc: Christoffer Dall 
> >>Cc: Marc Zyngier 
> >>Cc: Russell King 
> >>Cc: Catalin Marinas 
> >>Cc: Will Deacon 
> >>---
> >>  arch/arm/include/asm/kvm_mmu.h |  6 +++
> >>  arch/arm64/include/asm/kvm_mmu.h   |  5 +++
> >>  arch/arm64/include/asm/pgtable-hwdef.h |  2 +
> >>  virt/kvm/arm/mmu.c | 53 +++---
> >>  4 files changed, 61 insertions(+), 5 deletions(-)
> >>
> >>diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
> >>index 37bf85d39607..839a619873d3 100644
> >>--- a/arch/arm/include/asm/kvm_mmu.h
> >>+++ b/arch/arm/include/asm/kvm_mmu.h
> >>@@ -102,6 +102,12 @@ static inline bool kvm_s2pud_readonly(pud_t *pud)
> >>return false;
> >>  }
> >>+static inline bool kvm_s2pud_exec(pud_t *pud)
> >>+{
> >>+   BUG();
> >
> >nit: I think this should be WARN() now :)
> >
> >>+   return false;
> >>+}
> >>+
> >>  static inline pte_t kvm_s2pte_mkwrite(pte_t pte)
> >>  {
> >>pte_val(pte) |= L_PTE_S2_RDWR;
> >>diff --git a/arch/arm64/include/asm/kvm_mmu.h 
> >>b/arch/arm64/include/asm/kvm_mmu.h
> >>index 8da6d1b2a196..c755b37b3f92 100644
> >>--- a/arch/arm64/include/asm/kvm_mmu.h
> >>+++ b/arch/arm64/include/asm/kvm_mmu.h
> >>@@ -261,6 +261,11 @@ static inline bool kvm_s2pud_readonly(pud_t *pudp)
> >>return kvm_s2pte_readonly((pte_t *)pudp);
> >>  }
> >>+static inline bool kvm_s2pud_exec(pud_t *pudp)
> >>+{
> >>+   return !(READ_ONCE(pud_val(*pudp)) & PUD_S2_XN);
> >>+}
> >>+
> >>  #define hyp_pte_table_empty(ptep) kvm_page_empty(ptep)
> >>  #ifdef __PAGETABLE_PMD_FOLDED
> >>diff --git a/arch/arm64/include/asm/pgtable-hwdef.h 
> >>b/arch/arm64/include/asm/pgtable-hwdef.h
> >>index 1d7d8da2ef9b..336e24cddc87 100644
> >>--- a/arch/arm64/include/asm/pgtable-hwdef.h
> >>+++ b/arch/arm64/include/asm/pgtable-hwdef.h
> >>@@ -193,6 +193,8 @@
> >>  #define PMD_S2_RDWR   (_AT(pmdval_t, 3) << 6)   /* HAP[2:1] */
> >>  #define PMD_S2_XN (_AT(pmdval_t, 2) << 53)  /* XN[1:0] */
> >>+#define PUD_S2_XN  (_AT(pudval_t, 2) << 53)  /* XN[1:0] */
> >>+
> >>  /*
> >>   * Memory Attribute override for Stage-2 (MemAttr[3:0])
> >>   */
> >>diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
> >>index 1c669c3c1208..8e44dccd1b47 100644
> >>--- a/virt/kvm/arm/mmu.c
> >>+++ b/virt/kvm/arm/mmu.c
> >>@@ -1083,23 +1083,66 @@ static int stage2_set_pmd_huge(struct kvm *kvm, 
> >>struct kvm_mmu_memory_cache
> >>return 0;
> >>  }
> >>-static bool stage2_is_exec(struct kvm *kvm, phys_addr_t addr)
> >>+/*
> >>+ * stage2_get_leaf_entry - walk the stage2 VM page tables and return
> >>+ * true if a valid and present leaf-entry is found. A pointer to the
> >>+ * leaf-entry is returned in the appropriate level variable - pudpp,
> >>+ * pmdpp, ptepp.
> >>+ */
> >>+static bool stage2_get_leaf_entry(struct kvm *kvm, phys_addr_t addr,
> >>+ pud_t **pudpp, pmd_t **pmdpp, pte_t **ptepp)
> >
> >Do we need this type madness or could this just return a u64 pointer
> >(NULL if nothing is found) and pass that to kvm_s2pte_exec (because we
> >know it's the same bit we need to check regardless of the pgtable level
> >on both arm and arm64)?
> >
> >Or do we consider that bad for some reason?
> 
> Practically, yes the bit positions are same and thus we should be able
> to do this assuming that it is just a pte. When we get to independent stage2
> pgtable implementation which treats all page table entries as a single type
> with a level information, we should be able to get rid of these.
> But since we have followed the Linux way of page-table manipulation where we
> have "level" specific accessors. The other option is open code the walking
> sequence from the pgd to the leaf entry everywhere.
> 
> I am fine with changing this code, if you like.
> 

Meh, it just looked a bit over-engineered to me when I originally looked
at the patches, but you're right, they align with the rest of the
implementation.

Thanks,

Christoffer


Re: [PATCH v9 3/8] KVM: arm/arm64: Introduce helpers to manipulate page table entries

2018-12-10 Thread Christoffer Dall
On Mon, Dec 03, 2018 at 07:20:08PM +0530, Anshuman Khandual wrote:
> 
> 
> On 10/31/2018 11:27 PM, Punit Agrawal wrote:
> > Introduce helpers to abstract architectural handling of the conversion
> > of pfn to page table entries and marking a PMD page table entry as a
> > block entry.
> 
> Why is this necessary ? we would still need to access PMD, PUD as is
> without any conversion. IOW KVM knows the breakdown of the page table
> at various levels. Is this something required from generic KVM code ?
>   
> > 
> > The helpers are introduced in preparation for supporting PUD hugepages
> > at stage 2 - which are supported on arm64 but do not exist on arm.
> 
> Some of these patches (including the earlier two) are good on it's
> own. Do we have still mention in commit message about the incoming PUD
> enablement as the reason for these cleanup patches ?
> 

Does it hurt?  What is your concern here?


Thanks,

Christoffer


Re: [PATCH v9 2/8] KVM: arm/arm64: Re-factor setting the Stage 2 entry to exec on fault

2018-12-10 Thread Christoffer Dall
On Wed, Dec 05, 2018 at 10:47:10AM +, Suzuki K Poulose wrote:
> 
> 
> On 03/12/2018 13:32, Anshuman Khandual wrote:
> >
> >
> >On 10/31/2018 11:27 PM, Punit Agrawal wrote:
> >>Stage 2 fault handler marks a page as executable if it is handling an
> >>execution fault or if it was a permission fault in which case the
> >>executable bit needs to be preserved.
> >>
> >>The logic to decide if the page should be marked executable is
> >>duplicated for PMD and PTE entries. To avoid creating another copy
> >>when support for PUD hugepages is introduced refactor the code to
> >>share the checks needed to mark a page table entry as executable.
> >>
> >>Signed-off-by: Punit Agrawal 
> >>Reviewed-by: Suzuki K Poulose 
> >>Cc: Christoffer Dall 
> >>Cc: Marc Zyngier 
> >>---
> >>  virt/kvm/arm/mmu.c | 28 +++-
> >>  1 file changed, 15 insertions(+), 13 deletions(-)
> >>
> >>diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
> >>index 59595207c5e1..6912529946fb 100644
> >>--- a/virt/kvm/arm/mmu.c
> >>+++ b/virt/kvm/arm/mmu.c
> >>@@ -1475,7 +1475,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
> >>phys_addr_t fault_ipa,
> >>  unsigned long fault_status)
> >>  {
> >>int ret;
> >>-   bool write_fault, exec_fault, writable, force_pte = false;
> >>+   bool write_fault, writable, force_pte = false;
> >>+   bool exec_fault, needs_exec;
> >
> >New line not required, still within 80 characters.
> >
> >>unsigned long mmu_seq;
> >>gfn_t gfn = fault_ipa >> PAGE_SHIFT;
> >>struct kvm *kvm = vcpu->kvm;
> >>@@ -1598,19 +1599,25 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
> >>phys_addr_t fault_ipa,
> >>if (exec_fault)
> >>invalidate_icache_guest_page(pfn, vma_pagesize);
> >>+   /*
> >>+* If we took an execution fault we have made the
> >>+* icache/dcache coherent above and should now let the s2
> >
> >Coherent or invalidated with invalidate_icache_guest_page ?
> 
> We also do clean_dcache above if needed. So that makes sure
> the data is coherent. Am I missing something here ?
> 

I think you've got it right.  We have made the icache coherent with the
data/instructions in the page by invalidating the icache.  I think the
comment is ok either way.

Thanks,

Christoffer


Re: [PATCH v9 2/8] KVM: arm/arm64: Re-factor setting the Stage 2 entry to exec on fault

2018-12-10 Thread Christoffer Dall
On Mon, Dec 03, 2018 at 07:02:23PM +0530, Anshuman Khandual wrote:
> 
> 
> On 10/31/2018 11:27 PM, Punit Agrawal wrote:
> > Stage 2 fault handler marks a page as executable if it is handling an
> > execution fault or if it was a permission fault in which case the
> > executable bit needs to be preserved.
> > 
> > The logic to decide if the page should be marked executable is
> > duplicated for PMD and PTE entries. To avoid creating another copy
> > when support for PUD hugepages is introduced refactor the code to
> > share the checks needed to mark a page table entry as executable.
> > 
> > Signed-off-by: Punit Agrawal 
> > Reviewed-by: Suzuki K Poulose 
> > Cc: Christoffer Dall 
> > Cc: Marc Zyngier 
> > ---
> >  virt/kvm/arm/mmu.c | 28 +++-
> >  1 file changed, 15 insertions(+), 13 deletions(-)
> > 
> > diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
> > index 59595207c5e1..6912529946fb 100644
> > --- a/virt/kvm/arm/mmu.c
> > +++ b/virt/kvm/arm/mmu.c
> > @@ -1475,7 +1475,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
> > phys_addr_t fault_ipa,
> >   unsigned long fault_status)
> >  {
> > int ret;
> > -   bool write_fault, exec_fault, writable, force_pte = false;
> > +   bool write_fault, writable, force_pte = false;
> > +   bool exec_fault, needs_exec;
> 
> New line not required, still within 80 characters.
> 

He's trying to logically group the two variables.  I don't see a problem
with that.


Thanks,

Christoffer


Re: [PATCH v9 1/8] KVM: arm/arm64: Share common code in user_mem_abort()

2018-12-10 Thread Christoffer Dall
On Mon, Dec 03, 2018 at 01:37:37PM +, Suzuki K Poulose wrote:
> Hi Anshuman,
> 
> On 03/12/2018 12:11, Anshuman Khandual wrote:
> >
> >
> >On 10/31/2018 11:27 PM, Punit Agrawal wrote:
> >>The code for operations such as marking the pfn as dirty, and
> >>dcache/icache maintenance during stage 2 fault handling is duplicated
> >>between normal pages and PMD hugepages.
> >>
> >>Instead of creating another copy of the operations when we introduce
> >>PUD hugepages, let's share them across the different pagesizes.
> >>
> >>Signed-off-by: Punit Agrawal 
> >>Reviewed-by: Suzuki K Poulose 
> >>Cc: Christoffer Dall 
> >>Cc: Marc Zyngier 
> >>---
> >>  virt/kvm/arm/mmu.c | 49 --
> >>  1 file changed, 30 insertions(+), 19 deletions(-)
> >>
> >>diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
> >>index 5eca48bdb1a6..59595207c5e1 100644
> >>--- a/virt/kvm/arm/mmu.c
> >>+++ b/virt/kvm/arm/mmu.c
> >>@@ -1475,7 +1475,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
> >>phys_addr_t fault_ipa,
> >>  unsigned long fault_status)
> >>  {
> >>int ret;
> >>-   bool write_fault, exec_fault, writable, hugetlb = false, force_pte = 
> >>false;
> >>+   bool write_fault, exec_fault, writable, force_pte = false;
> >>unsigned long mmu_seq;
> >>gfn_t gfn = fault_ipa >> PAGE_SHIFT;
> >>struct kvm *kvm = vcpu->kvm;
> >>@@ -1484,7 +1484,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
> >>phys_addr_t fault_ipa,
> >>kvm_pfn_t pfn;
> >>pgprot_t mem_type = PAGE_S2;
> >>bool logging_active = memslot_is_logging(memslot);
> >>-   unsigned long flags = 0;
> >>+   unsigned long vma_pagesize, flags = 0;
> >
> >A small nit s/vma_pagesize/pagesize. Why call it VMA ? Its implicit.
> 
> May be we could call it mapsize. pagesize is confusing.
> 

I'm ok with mapsize.  I see the vma_pagesize name coming from the fact
that this is initially set to the return value from vma_kernel_pagesize.

I have not problems with either vma_pagesize or mapsize.

> >
> >>write_fault = kvm_is_write_fault(vcpu);
> >>exec_fault = kvm_vcpu_trap_is_iabt(vcpu);
> >>@@ -1504,10 +1504,16 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
> >>phys_addr_t fault_ipa,
> >>return -EFAULT;
> >>}
> >>-   if (vma_kernel_pagesize(vma) == PMD_SIZE && !logging_active) {
> >>-   hugetlb = true;
> >>+   vma_pagesize = vma_kernel_pagesize(vma);
> >>+   if (vma_pagesize == PMD_SIZE && !logging_active) {
> >>gfn = (fault_ipa & PMD_MASK) >> PAGE_SHIFT;
> >>} else {
> >>+   /*
> >>+* Fallback to PTE if it's not one of the Stage 2
> >>+* supported hugepage sizes
> >>+*/
> >>+   vma_pagesize = PAGE_SIZE;
> >
> >This seems redundant and should be dropped. vma_kernel_pagesize() here either
> >calls hugetlb_vm_op_pagesize (via hugetlb_vm_ops->pagesize) or simply returns
> >PAGE_SIZE. The vm_ops path is taken if the QEMU VMA covering any given HVA is
> >backed either by HugeTLB pages or simply normal pages. vma_pagesize would
> >either has a value of PMD_SIZE (HugeTLB hstate based) or PAGE_SIZE. Hence if
> >its not PMD_SIZE it must be PAGE_SIZE which should not be assigned again.
> 
> We may want to force using the PTE mappings when logging_active (e.g, 
> migration
> ?) to prevent keep tracking of huge pages. So the check is still valid.
> 
> 

Agreed, and let's not try additionally change the logic and flow with
this patch.

> >
> >>+
> >>/*
> >> * Pages belonging to memslots that don't have the same
> >> * alignment for userspace and IPA cannot be mapped using
> >>@@ -1573,23 +1579,33 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
> >>phys_addr_t fault_ipa,
> >>if (mmu_notifier_retry(kvm, mmu_seq))
> >>goto out_unlock;
> >>-   if (!hugetlb && !force_pte)
> >>-   hugetlb = transparent_hugepage_adjust(&pfn, &fault_ipa);
> >>+   if (vma_pagesize == PAGE_SIZE && !force_pte) {
> >>+   /*
> >>+* Only PMD_SIZE transparent hugepages(THP) are
> >>+* currently supported. Th

Re: [PATCH] kvm: arm/arm64 : fix vm's hanging at startup time

2018-11-21 Thread Christoffer Dall
On Wed, Nov 21, 2018 at 12:17:45PM +, Julien Thierry wrote:
> 
> 
> On 21/11/18 11:06, Christoffer Dall wrote:
> >Hi,
> >
> >On Wed, Nov 21, 2018 at 04:56:54PM +0800, peng.h...@zte.com.cn wrote:
> >>>On 19/11/2018 09:10, Mark Rutland wrote:
> >>>>On Sat, Nov 17, 2018 at 10:58:37AM +0800, peng.h...@zte.com.cn wrote:
> >>>>>>On 16/11/18 00:23, peng.h...@zte.com.cn wrote:
> >>>>>>>>Hi,
> >>>>>>>>>When virtual machine starts, hang up.
> >>>>>>>>
> >>>>>>>>I take it you mean the *guest* hangs? Because it doesn't get a timer
> >>>>>>>>interrupt?
> >>>>>>>>
> >>>>>>>>>The kernel version of guest
> >>>>>>>>>is 4.16. Host support vgic_v3.
> >>>>>>>>
> >>>>>>>>Your host kernel is something recent, I guess?
> >>>>>>>>
> >>>>>>>>>It was mainly due to the incorrect vgic_irq's(intid=27) group value
> >>>>>>>>>during injection interruption. when kvm_vgic_vcpu_init is called,
> >>>>>>>>>dist is not initialized at this time. Unable to get vgic V3 or V2
> >>>>>>>>>correctly, so group is not set.
> >>>>>>>>
> >>>>>>>>Mmh, that shouldn't happen with (v)GICv3. Do you use QEMU (which
> >>>>>>>>version?) or some other userland tool?
> >>>>>>>>
> >>>>>>>
> >>>>>>>QEMU emulator version 3.0.50 .
> >>>>>>>
> >>>>>>>>>group is setted to 1 when vgic_mmio_write_group is invoked at some
> >>>>>>>>>time.
> >>>>>>>>>when irq->group=0 (intid=27), No ICH_LR_GROUP flag was set and
> >>>>>>>>>interrupt injection failed.
> >>>>>>>>>
> >>>>>>>>>Signed-off-by: Peng Hao 
> >>>>>>>>>---
> >>>>>>>>>   virt/kvm/arm/vgic/vgic-v3.c | 2 +-
> >>>>>>>>>   1 file changed, 1 insertion(+), 1 deletion(-)
> >>>>>>>>>
> >>>>>>>>>diff --git a/virt/kvm/arm/vgic/vgic-v3.c 
> >>>>>>>>>b/virt/kvm/arm/vgic/vgic-v3.c
> >>>>>>>>>index 9c0dd23..d101000 100644
> >>>>>>>>>--- a/virt/kvm/arm/vgic/vgic-v3.c
> >>>>>>>>>+++ b/virt/kvm/arm/vgic/vgic-v3.c
> >>>>>>>>>@@ -198,7 +198,7 @@ void vgic_v3_populate_lr(struct kvm_vcpu *vcpu,
> >>>>>>>>>struct vgic_irq *irq, int lr) if (vgic_irq_is_mapped_level(irq) &&
> >>>>>>>>>(val & ICH_LR_PENDING_BIT)) irq->line_level = false;
> >>>>>>>>>
> >>>>>>>>>-if (irq->group)
> >>>>>>>>>+if (model == KVM_DEV_TYPE_ARM_VGIC_V3)
> >>>>>>>>
> >>>>>>>>This is not the right fix, not only because it basically reverts the
> >>>>>>>>GICv3 part of 87322099052 (KVM: arm/arm64: vgic: Signal IRQs using
> >>>>>>>>their configured group).
> >>>>>>>>
> >>>>>>>>Can you try to work out why kvm_vgic_vcpu_init() is apparently called
> >>>>>>>>before dist->vgic_model is set, also what value it has?
> >>>>>>>>If I understand the code correctly, that shouldn't happen for a GICv3.
> >>>>>>>>
> >>>>>>>Even if the value of  group is correctly assigned in 
> >>>>>>>kvm_vgic_vcpu_init, the group is then written 0 through 
> >>>>>>>vgic_mmio_write_group.
> >>>>>>>   If the interrupt comes at this time, the interrupt injection fails.
> >>>>>>
> >>>>>>Does that mean that the guest is configuring its interrupts as Group0?
> >>>>>>That sounds wrong, Linux should configure all it's interrupts as
> >>>>>>non-secure group1.
> >>>>>
> >>>>>no, I think that uefi dose this, not linux.
> >>>>>1. kvm_vgic_vcpu_init
> >>>>>2. vgic_create
> >>

Re: [PATCH v2 2/4] KVM: arm/arm64: Introduce helpers to manupulate page table entries

2018-05-04 Thread Christoffer Dall
On Tue, May 01, 2018 at 02:00:43PM +0100, Punit Agrawal wrote:
> Hi Suzuki,
> 
> Thanks for having a look.
> 
> Suzuki K Poulose  writes:
> 
> > On 01/05/18 11:26, Punit Agrawal wrote:
> >> Introduce helpers to abstract architectural handling of the conversion
> >> of pfn to page table entries and marking a PMD page table entry as a
> >> block entry.
> >>
> >> The helpers are introduced in preparation for supporting PUD hugepages
> >> at stage 2 - which are supported on arm64 but do not exist on arm.
> >
> > Punit,
> >
> > The change are fine by me. However, we usually do not define kvm_*
> > accessors for something which we know matches with the host variant.
> > i.e, PMD and PTE helpers, which are always present and we make use
> > of them directly. (see unmap_stage2_pmds for e.g)
> 
> In general, I agree - it makes sense to avoid duplication.
> 
> Having said that, the helpers here allow following a common pattern for
> handling the various page sizes - pte, pmd and pud - during stage 2
> fault handling (see patch 4).
> 
> As you've said you're OK with this change, I'd prefer to keep this patch
> but will drop it if any others reviewers are concerned about the
> duplication as well.

There are arguments for both keeping the kvm_ wrappers and not having
them.  I see no big harm or increase in complexity by keeping them
though.

Thanks,
-Christoffer


Re: [PATCH v2 4/4] KVM: arm64: Add support for PUD hugepages at stage 2

2018-05-04 Thread Christoffer Dall
On Tue, May 01, 2018 at 11:26:59AM +0100, Punit Agrawal wrote:
> KVM currently supports PMD hugepages at stage 2. Extend the stage 2
> fault handling to add support for PUD hugepages.
> 
> Addition of pud hugepage support enables additional hugepage
> sizes (e.g., 1G with 4K granule) which can be useful on cores that
> support mapping larger block sizes in the TLB entries.
> 
> Signed-off-by: Punit Agrawal 
> Cc: Christoffer Dall 
> Cc: Marc Zyngier 
> Cc: Russell King 
> Cc: Catalin Marinas 
> Cc: Will Deacon 

Reviewed-by: Christoffer Dall 

> ---
>  arch/arm/include/asm/kvm_mmu.h | 19 
>  arch/arm64/include/asm/kvm_mmu.h   | 15 ++
>  arch/arm64/include/asm/pgtable-hwdef.h |  4 +++
>  arch/arm64/include/asm/pgtable.h   |  2 ++
>  virt/kvm/arm/mmu.c | 40 --
>  5 files changed, 77 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
> index 224c22c0a69c..155916dbdd7e 100644
> --- a/arch/arm/include/asm/kvm_mmu.h
> +++ b/arch/arm/include/asm/kvm_mmu.h
> @@ -77,8 +77,11 @@ void kvm_clear_hyp_idmap(void);
>  
>  #define kvm_pfn_pte(pfn, prot)   pfn_pte(pfn, prot)
>  #define kvm_pfn_pmd(pfn, prot)   pfn_pmd(pfn, prot)
> +#define kvm_pfn_pud(pfn, prot)   (__pud(0))
>  
>  #define kvm_pmd_mkhuge(pmd)  pmd_mkhuge(pmd)
> +/* No support for pud hugepages */
> +#define kvm_pud_mkhuge(pud)  (pud)
>  
>  /*
>   * The following kvm_*pud*() functionas are provided strictly to allow
> @@ -95,6 +98,22 @@ static inline bool kvm_s2pud_readonly(pud_t *pud)
>   return false;
>  }
>  
> +static inline void kvm_set_pud(pud_t *pud, pud_t new_pud)
> +{
> + BUG();
> +}
> +
> +static inline pud_t kvm_s2pud_mkwrite(pud_t pud)
> +{
> + BUG();
> + return pud;
> +}
> +
> +static inline pud_t kvm_s2pud_mkexec(pud_t pud)
> +{
> + BUG();
> + return pud;
> +}
>  
>  static inline void kvm_set_pmd(pmd_t *pmd, pmd_t new_pmd)
>  {
> diff --git a/arch/arm64/include/asm/kvm_mmu.h 
> b/arch/arm64/include/asm/kvm_mmu.h
> index f440cf216a23..f49a68fcbf26 100644
> --- a/arch/arm64/include/asm/kvm_mmu.h
> +++ b/arch/arm64/include/asm/kvm_mmu.h
> @@ -172,11 +172,14 @@ void kvm_clear_hyp_idmap(void);
>  
>  #define  kvm_set_pte(ptep, pte)  set_pte(ptep, pte)
>  #define  kvm_set_pmd(pmdp, pmd)  set_pmd(pmdp, pmd)
> +#define kvm_set_pud(pudp, pud)   set_pud(pudp, pud)
>  
>  #define kvm_pfn_pte(pfn, prot)   pfn_pte(pfn, prot)
>  #define kvm_pfn_pmd(pfn, prot)   pfn_pmd(pfn, prot)
> +#define kvm_pfn_pud(pfn, prot)   pfn_pud(pfn, prot)
>  
>  #define kvm_pmd_mkhuge(pmd)  pmd_mkhuge(pmd)
> +#define kvm_pud_mkhuge(pud)  pud_mkhuge(pud)
>  
>  static inline pte_t kvm_s2pte_mkwrite(pte_t pte)
>  {
> @@ -190,6 +193,12 @@ static inline pmd_t kvm_s2pmd_mkwrite(pmd_t pmd)
>   return pmd;
>  }
>  
> +static inline pud_t kvm_s2pud_mkwrite(pud_t pud)
> +{
> + pud_val(pud) |= PUD_S2_RDWR;
> + return pud;
> +}
> +
>  static inline pte_t kvm_s2pte_mkexec(pte_t pte)
>  {
>   pte_val(pte) &= ~PTE_S2_XN;
> @@ -202,6 +211,12 @@ static inline pmd_t kvm_s2pmd_mkexec(pmd_t pmd)
>   return pmd;
>  }
>  
> +static inline pud_t kvm_s2pud_mkexec(pud_t pud)
> +{
> + pud_val(pud) &= ~PUD_S2_XN;
> + return pud;
> +}
> +
>  static inline void kvm_set_s2pte_readonly(pte_t *ptep)
>  {
>   pteval_t old_pteval, pteval;
> diff --git a/arch/arm64/include/asm/pgtable-hwdef.h 
> b/arch/arm64/include/asm/pgtable-hwdef.h
> index fd208eac9f2a..e327665e94d1 100644
> --- a/arch/arm64/include/asm/pgtable-hwdef.h
> +++ b/arch/arm64/include/asm/pgtable-hwdef.h
> @@ -193,6 +193,10 @@
>  #define PMD_S2_RDWR  (_AT(pmdval_t, 3) << 6)   /* HAP[2:1] */
>  #define PMD_S2_XN(_AT(pmdval_t, 2) << 53)  /* XN[1:0] */
>  
> +#define PUD_S2_RDONLY(_AT(pudval_t, 1) << 6)   /* HAP[2:1] */
> +#define PUD_S2_RDWR  (_AT(pudval_t, 3) << 6)   /* HAP[2:1] */
> +#define PUD_S2_XN(_AT(pudval_t, 2) << 53)  /* XN[1:0] */
> +
>  /*
>   * Memory Attribute override for Stage-2 (MemAttr[3:0])
>   */
> diff --git a/arch/arm64/include/asm/pgtable.h 
> b/arch/arm64/include/asm/pgtable.h
> index 7c4c8f318ba9..31ea9fda07e3 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -386,6 +386,8 @@ static inline int pmd_protnone(pmd_t pmd)
>  
>  #define pud_write(pud)   pte_write(pud_pte(pud))
>  
> +#define pud_m

Re: [PATCH v2 1/4] KVM: arm/arm64: Share common code in user_mem_abort()

2018-05-04 Thread Christoffer Dall
On Tue, May 01, 2018 at 11:26:56AM +0100, Punit Agrawal wrote:
> The code for operations such as marking the pfn as dirty, and
> dcache/icache maintenance during stage 2 fault handling is duplicated
> between normal pages and PMD hugepages.
> 
> Instead of creating another copy of the operations when we introduce
> PUD hugepages, let's share them across the different pagesizes.
> 
> Signed-off-by: Punit Agrawal 
> Reviewed-by: Christoffer Dall 
> Cc: Marc Zyngier 
> ---
>  virt/kvm/arm/mmu.c | 66 +++---
>  1 file changed, 39 insertions(+), 27 deletions(-)
> 
> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
> index 7f6a944db23d..686fc6a4b866 100644
> --- a/virt/kvm/arm/mmu.c
> +++ b/virt/kvm/arm/mmu.c
> @@ -1396,6 +1396,21 @@ static void invalidate_icache_guest_page(kvm_pfn_t 
> pfn, unsigned long size)
>   __invalidate_icache_guest_page(pfn, size);
>  }
>  
> +static bool stage2_should_exec(struct kvm *kvm, phys_addr_t addr,
> +bool exec_fault, unsigned long fault_status)
> +{
> + /*
> +  * If we took an execution fault we will have made the
> +  * icache/dcache coherent and should now let the s2 mapping be
> +  * executable.
> +  *
> +  * Write faults (!exec_fault && FSC_PERM) are orthogonal to
> +  * execute permissions, and we preserve whatever we have.
> +  */
> + return exec_fault ||
> + (fault_status == FSC_PERM && stage2_is_exec(kvm, addr));
> +}
> +
>  static void kvm_send_hwpoison_signal(unsigned long address,
>struct vm_area_struct *vma)
>  {
> @@ -1428,7 +1443,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
> phys_addr_t fault_ipa,
>   kvm_pfn_t pfn;
>   pgprot_t mem_type = PAGE_S2;
>   bool logging_active = memslot_is_logging(memslot);
> - unsigned long flags = 0;
> + unsigned long vma_pagesize, flags = 0;
>  
>   write_fault = kvm_is_write_fault(vcpu);
>   exec_fault = kvm_vcpu_trap_is_iabt(vcpu);
> @@ -1448,7 +1463,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
> phys_addr_t fault_ipa,
>   return -EFAULT;
>   }
>  
> - if (vma_kernel_pagesize(vma) == PMD_SIZE && !logging_active) {
> + vma_pagesize = vma_kernel_pagesize(vma);
> + if (vma_pagesize == PMD_SIZE && !logging_active) {
>   hugetlb = true;
>   gfn = (fault_ipa & PMD_MASK) >> PAGE_SHIFT;
>   } else {
> @@ -1517,28 +1533,34 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
> phys_addr_t fault_ipa,
>   if (mmu_notifier_retry(kvm, mmu_seq))
>   goto out_unlock;
>  
> - if (!hugetlb && !force_pte)
> + if (!hugetlb && !force_pte) {
>   hugetlb = transparent_hugepage_adjust(&pfn, &fault_ipa);
> + /*
> +  * Only PMD_SIZE transparent hugepages(THP) are
> +  * currently supported. This code will need to be
> +  * updated to support other THP sizes.
> +  */
> + if (hugetlb)
> + vma_pagesize = PMD_SIZE;

nit: this is a bit of a trap waiting to happen, as the suggested
semantics of hugetlb is now hugetlbfs and not THP.

It may be slightly nicer to do do:

if (transparent_hugepage_adjust(&pfn, &fault_ipa))
vma_pagesize = PMD_SIZE;

> + }
> +
> + if (writable)
> + kvm_set_pfn_dirty(pfn);
> +
> + if (fault_status != FSC_PERM)
> + clean_dcache_guest_page(pfn, vma_pagesize);
> +
> + if (exec_fault)
> + invalidate_icache_guest_page(pfn, vma_pagesize);
>  
>   if (hugetlb) {
>   pmd_t new_pmd = pfn_pmd(pfn, mem_type);
>   new_pmd = pmd_mkhuge(new_pmd);
> - if (writable) {
> + if (writable)
>   new_pmd = kvm_s2pmd_mkwrite(new_pmd);
> - kvm_set_pfn_dirty(pfn);
> - }
>  
> - if (fault_status != FSC_PERM)
> - clean_dcache_guest_page(pfn, PMD_SIZE);
> -
> - if (exec_fault) {
> + if (stage2_should_exec(kvm, fault_ipa, exec_fault, 
> fault_status))
>   new_pmd = kvm_s2pmd_mkexec(new_pmd);
> - invalidate_icache_guest_page(pfn, PMD_SIZE);
> - } else if (fault_status == FSC_PERM) {
> - /* Preserve execute if XN was already cleared */
> - if (stage2_is_exec(kvm, fault_ipa))
> - new_pmd = kvm_s2pmd_mkexec(new_pmd)

Re: [PATCH v5 11/12] KVM: arm/arm64: Implement KVM_VGIC_V3_ADDR_TYPE_REDIST_REGION

2018-04-30 Thread Christoffer Dall
On Mon, Apr 30, 2018 at 11:07:43AM +0200, Eric Auger wrote:
> Now all the internals are ready to handle multiple redistributor
> regions, let's allow the userspace to register them.
> 
> Signed-off-by: Eric Auger 

Reviewed-by: Christoffer Dall 

> 
> ---
> v4 -> v5:
> - s/uint_t/u
> - fix KVM_VGIC_V3_ADDR_TYPE_REDIST_REGION read
> - fix read path
> - return -ENOENT instead of -ENODEV when reading the
>   attributes of a region that does not exist
> 
> v3 -> v4:
> - vgic_v3_rdist_region_from_index is introduced in this patch
> 
> v2 -> v3:
> - early exit if vgic_v3_rdist_region_from_index() fails
> ---
>  virt/kvm/arm/vgic/vgic-kvm-device.c | 40 
> -
>  virt/kvm/arm/vgic/vgic-mmio-v3.c|  4 ++--
>  virt/kvm/arm/vgic/vgic-v3.c | 14 +
>  virt/kvm/arm/vgic/vgic.h| 13 +++-
>  4 files changed, 67 insertions(+), 4 deletions(-)
> 
> diff --git a/virt/kvm/arm/vgic/vgic-kvm-device.c 
> b/virt/kvm/arm/vgic/vgic-kvm-device.c
> index 76ab369..6ada243 100644
> --- a/virt/kvm/arm/vgic/vgic-kvm-device.c
> +++ b/virt/kvm/arm/vgic/vgic-kvm-device.c
> @@ -92,7 +92,7 @@ int kvm_vgic_addr(struct kvm *kvm, unsigned long type, u64 
> *addr, bool write)
>   if (r)
>   break;
>   if (write) {
> - r = vgic_v3_set_redist_base(kvm, *addr);
> + r = vgic_v3_set_redist_base(kvm, 0, *addr, 0);
>   goto out;
>   }
>   rdreg = list_first_entry(&vgic->rd_regions,
> @@ -103,6 +103,43 @@ int kvm_vgic_addr(struct kvm *kvm, unsigned long type, 
> u64 *addr, bool write)
>   addr_ptr = &rdreg->base;
>   break;
>   }
> + case KVM_VGIC_V3_ADDR_TYPE_REDIST_REGION:
> + {
> + struct vgic_redist_region *rdreg;
> + u8 index;
> +
> + r = vgic_check_type(kvm, KVM_DEV_TYPE_ARM_VGIC_V3);
> + if (r)
> + break;
> +
> + index = *addr & KVM_VGIC_V3_RDIST_INDEX_MASK;
> +
> + if (write) {
> + gpa_t base = *addr & KVM_VGIC_V3_RDIST_BASE_MASK;
> + u32 count = (*addr & KVM_VGIC_V3_RDIST_COUNT_MASK)
> + >> KVM_VGIC_V3_RDIST_COUNT_SHIFT;
> + u8 flags = (*addr & KVM_VGIC_V3_RDIST_FLAGS_MASK)
> + >> KVM_VGIC_V3_RDIST_FLAGS_SHIFT;
> +
> + if (!count || flags)
> + r = -EINVAL;
> + else
> + r = vgic_v3_set_redist_base(kvm, index,
> + base, count);
> + goto out;
> + }
> +
> + rdreg = vgic_v3_rdist_region_from_index(kvm, index);
> + if (!rdreg) {
> + r = -ENOENT;
> + goto out;
> + }
> +
> + *addr = index;
> + *addr |= rdreg->base;
> + *addr |= (u64)rdreg->count << KVM_VGIC_V3_RDIST_COUNT_SHIFT;
> + goto out;
> + }
>   default:
>   r = -ENODEV;
>   }
> @@ -674,6 +711,7 @@ static int vgic_v3_has_attr(struct kvm_device *dev,
>   switch (attr->attr) {
>   case KVM_VGIC_V3_ADDR_TYPE_DIST:
>   case KVM_VGIC_V3_ADDR_TYPE_REDIST:
> + case KVM_VGIC_V3_ADDR_TYPE_REDIST_REGION:
>   return 0;
>   }
>   break;
> diff --git a/virt/kvm/arm/vgic/vgic-mmio-v3.c 
> b/virt/kvm/arm/vgic/vgic-mmio-v3.c
> index 1b1e07f..fcebd42 100644
> --- a/virt/kvm/arm/vgic/vgic-mmio-v3.c
> +++ b/virt/kvm/arm/vgic/vgic-mmio-v3.c
> @@ -764,11 +764,11 @@ static int vgic_v3_insert_redist_region(struct kvm 
> *kvm, uint32_t index,
>   return ret;
>  }
>  
> -int vgic_v3_set_redist_base(struct kvm *kvm, u64 addr)
> +int vgic_v3_set_redist_base(struct kvm *kvm, u32 index, u64 addr, u32 count)
>  {
>   int ret;
>  
> - ret = vgic_v3_insert_redist_region(kvm, 0, addr, 0);
> + ret = vgic_v3_insert_redist_region(kvm, index, addr, count);
>   if (ret)
>   return ret;
>  
> diff --git a/virt/kvm/arm/vgic/vgic-v3.c b/virt/kvm/arm/vgic/vgic-v3.c
> index 5563671..70d5ba1 100644
> --- a/virt/kvm/arm/vgic/vgic-v3.c
> +++ b/virt/kvm/arm/vgic/vgic-v3.c
> @@ -482,6 +482,20 @@ struct vgic_redist_region 
> *vgic_v3_rdist_free_slot(struct list_head *rd_regions)
>   return NULL;
>  }

Re: [PATCH v5 07/12] KVM: arm/arm64: Helper to register a new redistributor region

2018-04-30 Thread Christoffer Dall
 if (!vgic_v3_check_base(kvm)) {
> - ret = -EINVAL;
> - goto out;
> - }
> + rdreg->base = base;
> + rdreg->count = count;
> + rdreg->free_index = 0;
> + rdreg->index = index;
>  
> - list_add(&rdreg->list, &vgic->rd_regions);
> + list_add_tail(&rdreg->list, rd_regions);
> + return 0;
> +free:
> + kfree(rdreg);
> + return ret;
> +}
> +
> +int vgic_v3_set_redist_base(struct kvm *kvm, u64 addr)
> +{
> + int ret;
> +
> + ret = vgic_v3_insert_redist_region(kvm, 0, addr, 0);
> + if (ret)
> + return ret;
>  
>   /*
>* Register iodevs for each existing VCPU.  Adding more VCPUs
> @@ -717,10 +778,6 @@ int vgic_v3_set_redist_base(struct kvm *kvm, u64 addr)
>   return ret;
>  
>   return 0;
> -
> -out:
> - kfree(rdreg);
> - return ret;
>  }
>  
>  int vgic_v3_has_attr_regs(struct kvm_device *dev, struct kvm_device_attr 
> *attr)
> diff --git a/virt/kvm/arm/vgic/vgic.h b/virt/kvm/arm/vgic/vgic.h
> index e6e3ae9..6af7d8a 100644
> --- a/virt/kvm/arm/vgic/vgic.h
> +++ b/virt/kvm/arm/vgic/vgic.h
> @@ -272,6 +272,14 @@ vgic_v3_rd_region_size(struct kvm *kvm, struct 
> vgic_redist_region *rdreg)
>  }
>  bool vgic_v3_rdist_overlap(struct kvm *kvm, gpa_t base, size_t size);
>  
> +static inline bool vgic_dist_overlap(struct kvm *kvm, gpa_t base, size_t 
> size)
> +{
> + struct vgic_dist *d = &kvm->arch.vgic;
> +
> + return (base + size > d->vgic_dist_base) &&
> + (base < d->vgic_dist_base + KVM_VGIC_V3_DIST_SIZE);
> +}
> +
>  int vgic_its_resolve_lpi(struct kvm *kvm, struct vgic_its *its,
>u32 devid, u32 eventid, struct vgic_irq **irq);
>  struct vgic_its *vgic_msi_to_its(struct kvm *kvm, struct kvm_msi *msi);
> -- 
> 2.5.5
> 

Besides the nit about using list_last_entry():

Reviewed-by: Christoffer Dall 


Re: [PATCH v5 02/12] KVM: arm/arm64: Document KVM_VGIC_V3_ADDR_TYPE_REDIST_REGION

2018-04-30 Thread Christoffer Dall
On Mon, Apr 30, 2018 at 11:07:34AM +0200, Eric Auger wrote:
> We introduce a new KVM_VGIC_V3_ADDR_TYPE_REDIST_REGION attribute in
> KVM_DEV_ARM_VGIC_GRP_ADDR group. It allows userspace to provide the
> base address and size of a redistributor region
> 
> Compared to KVM_VGIC_V3_ADDR_TYPE_REDIST, this new attribute allows
> to declare several separate redistributor regions.
> 
> So the whole redist space does not need to be contiguous anymore.
> 
> Signed-off-by: Eric Auger 

Acked-by: Christoffer Dall 

> 
> ---
> v4 -> v5:
> - Document read access
> - removed Peter's R-b
> 
> v3 -> v4:
> - Added Peter's R-b
> ---
>  Documentation/virtual/kvm/devices/arm-vgic-v3.txt | 30 
> +--
>  1 file changed, 28 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/virtual/kvm/devices/arm-vgic-v3.txt 
> b/Documentation/virtual/kvm/devices/arm-vgic-v3.txt
> index 9293b45..dcb7504 100644
> --- a/Documentation/virtual/kvm/devices/arm-vgic-v3.txt
> +++ b/Documentation/virtual/kvm/devices/arm-vgic-v3.txt
> @@ -27,16 +27,42 @@ Groups:
>VCPU and all of the redistributor pages are contiguous.
>Only valid for KVM_DEV_TYPE_ARM_VGIC_V3.
>This address needs to be 64K aligned.
> +
> +KVM_VGIC_V3_ADDR_TYPE_REDIST_REGION (rw, 64-bit)
> +  The attribute data pointed by kvm_device_attr.addr is a __u64 value:
> +  bits: | 63     52  |  51      16 | 15 - 12  |11 - 0
> +  values:   | count  |   base  |  flags   | index
> +  - index encodes the unique redistributor region index
> +  - flags: reserved for future use, currently 0
> +  - base field encodes bits [51:16] of the guest physical base address
> +of the first redistributor in the region.
> +  - count encodes the number of redistributors in the region. Must be
> +greater than 0.
> +  There are two 64K pages for each redistributor in the region and
> +  redistributors are laid out contiguously within the region. Regions
> +  are filled with redistributors in the index order. The sum of all
> +  region count fields must be greater than or equal to the number of
> +  VCPUs. Redistributor regions must be registered in the incremental
> +  index order, starting from index 0.
> +  The characteristics of a specific redistributor region can be read
> +  by presetting the index field in the attr data.
> +  Only valid for KVM_DEV_TYPE_ARM_VGIC_V3.
> +
> +  It is invalid to mix calls with KVM_VGIC_V3_ADDR_TYPE_REDIST and
> +  KVM_VGIC_V3_ADDR_TYPE_REDIST_REGION attributes.
> +
>Errors:
>  -E2BIG:  Address outside of addressable IPA range
> --EINVAL: Incorrectly aligned address
> +-EINVAL: Incorrectly aligned address, bad redistributor region
> + count/index, mixed redistributor region attribute usage
>  -EEXIST: Address already configured
> +-ENOENT: Attempt to read the characteristics of a non existing
> + redistributor region
>  -ENXIO:  The group or attribute is unknown/unsupported for this device
>   or hardware support is missing.
>  -EFAULT: Invalid user pointer for attr->addr.
>  
>  
> -
>KVM_DEV_ARM_VGIC_GRP_DIST_REGS
>KVM_DEV_ARM_VGIC_GRP_REDIST_REGS
>Attributes:
> -- 
> 2.5.5
> 
> ___
> kvmarm mailing list
> kvm...@lists.cs.columbia.edu
> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v4 09/12] KVM: arm/arm64: Check all vcpu redistributors are set on map_resources

2018-04-29 Thread Christoffer Dall
On Fri, Apr 27, 2018 at 04:15:02PM +0200, Eric Auger wrote:
> On vcpu first run, we eventually know the actual number of vcpus.
> This is a synchronization point to check all redistributors
> were assigned. On kvm_vgic_map_resources() we check both dist and
> redist were set, eventually check potential base address inconsistencies.
> 
> Signed-off-by: Eric Auger 

Reviewed-by: Christoffer Dall 

> 
> ---
> 
> v3 -> v4:
> - use kvm_debug
> ---
>  virt/kvm/arm/vgic/vgic-v3.c | 19 ++-
>  1 file changed, 14 insertions(+), 5 deletions(-)
> 
> diff --git a/virt/kvm/arm/vgic/vgic-v3.c b/virt/kvm/arm/vgic/vgic-v3.c
> index c4a2a46..5563671 100644
> --- a/virt/kvm/arm/vgic/vgic-v3.c
> +++ b/virt/kvm/arm/vgic/vgic-v3.c
> @@ -484,16 +484,25 @@ struct vgic_redist_region 
> *vgic_v3_rdist_free_slot(struct list_head *rd_regions)
>  
>  int vgic_v3_map_resources(struct kvm *kvm)
>  {
> - int ret = 0;
>   struct vgic_dist *dist = &kvm->arch.vgic;
> - struct vgic_redist_region *rdreg =
> - list_first_entry(&dist->rd_regions,
> -  struct vgic_redist_region, list);
> + struct kvm_vcpu *vcpu;
> + int ret = 0;
> + int c;
>  
>   if (vgic_ready(kvm))
>   goto out;
>  
> - if (IS_VGIC_ADDR_UNDEF(dist->vgic_dist_base) || !rdreg) {
> + kvm_for_each_vcpu(c, vcpu, kvm) {
> + struct vgic_cpu *vgic_cpu = &vcpu->arch.vgic_cpu;
> +
> + if (IS_VGIC_ADDR_UNDEF(vgic_cpu->rd_iodev.base_addr)) {
> + kvm_debug("vcpu %d redistributor base not set\n", c);
> + ret = -ENXIO;
> + goto out;
> + }
> + }
> +
> + if (IS_VGIC_ADDR_UNDEF(dist->vgic_dist_base)) {
>   kvm_err("Need to set vgic distributor addresses first\n");
>   ret = -ENXIO;
>   goto out;
> -- 
> 2.5.5
> 
> ___
> kvmarm mailing list
> kvm...@lists.cs.columbia.edu
> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v4 12/12] KVM: arm/arm64: Bump VGIC_V3_MAX_CPUS to 512

2018-04-29 Thread Christoffer Dall
On Fri, Apr 27, 2018 at 04:15:05PM +0200, Eric Auger wrote:
> Let's raise the number of supported vcpus along with
> vgic v3 now that HW is looming with more physical CPUs.
> 
> Signed-off-by: Eric Auger 

Acked-by: Christoffer Dall 

> ---
>  include/kvm/arm_vgic.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/include/kvm/arm_vgic.h b/include/kvm/arm_vgic.h
> index e5c16d1..a9a9f4a 100644
> --- a/include/kvm/arm_vgic.h
> +++ b/include/kvm/arm_vgic.h
> @@ -28,7 +28,7 @@
>  
>  #include 
>  
> -#define VGIC_V3_MAX_CPUS 255
> +#define VGIC_V3_MAX_CPUS 512
>  #define VGIC_V2_MAX_CPUS 8
>  #define VGIC_NR_IRQS_LEGACY 256
>  #define VGIC_NR_SGIS 16
> -- 
> 2.5.5
> 
> ___
> kvmarm mailing list
> kvm...@lists.cs.columbia.edu
> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v4 08/12] KVM: arm/arm64: Check vcpu redist base before registering an iodev

2018-04-29 Thread Christoffer Dall
On Fri, Apr 27, 2018 at 04:15:01PM +0200, Eric Auger wrote:
> As we are going to register several redist regions,
> vgic_register_all_redist_iodevs() may be called several times. We need
> to register a redist_iodev for a given vcpu only once. So let's
> check if the base address has already been set. Initialize this latter
> in kvm_vgic_vcpu_early_init().
> 
> Signed-off-by: Eric Auger 

Acked-by: Christoffer Dall 

> ---
>  virt/kvm/arm/vgic/vgic-init.c| 3 +++
>  virt/kvm/arm/vgic/vgic-mmio-v3.c | 3 +++
>  2 files changed, 6 insertions(+)
> 
> diff --git a/virt/kvm/arm/vgic/vgic-init.c b/virt/kvm/arm/vgic/vgic-init.c
> index 6456371..7e040e7 100644
> --- a/virt/kvm/arm/vgic/vgic-init.c
> +++ b/virt/kvm/arm/vgic/vgic-init.c
> @@ -82,6 +82,9 @@ void kvm_vgic_vcpu_early_init(struct kvm_vcpu *vcpu)
>   INIT_LIST_HEAD(&vgic_cpu->ap_list_head);
>   spin_lock_init(&vgic_cpu->ap_list_lock);
>  
> + vgic_cpu->rd_iodev.base_addr = VGIC_ADDR_UNDEF;
> + vgic_cpu->sgi_iodev.base_addr = VGIC_ADDR_UNDEF;
> +
>   /*
>* Enable and configure all SGIs to be edge-triggered and
>* configure all PPIs as level-triggered.
> diff --git a/virt/kvm/arm/vgic/vgic-mmio-v3.c 
> b/virt/kvm/arm/vgic/vgic-mmio-v3.c
> index cbf8f4e..1b1e07f 100644
> --- a/virt/kvm/arm/vgic/vgic-mmio-v3.c
> +++ b/virt/kvm/arm/vgic/vgic-mmio-v3.c
> @@ -592,6 +592,9 @@ int vgic_register_redist_iodev(struct kvm_vcpu *vcpu)
>   gpa_t rd_base, sgi_base;
>   int ret;
>  
> + if (!IS_VGIC_ADDR_UNDEF(vgic_cpu->rd_iodev.base_addr))
> + return 0;
> +
>   /*
>* We may be creating VCPUs before having set the base address for the
>* redistributor region, in which case we will come back to this
> -- 
> 2.5.5
> 
> ___
> kvmarm mailing list
> kvm...@lists.cs.columbia.edu
> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v4 06/12] KVM: arm/arm64: Adapt vgic_v3_check_base to multiple rdist regions

2018-04-29 Thread Christoffer Dall
On Fri, Apr 27, 2018 at 04:14:59PM +0200, Eric Auger wrote:
> vgic_v3_check_base() currently only handles the case of a unique
> legacy redistributor region whose size is not explicitly set but
> infered, instead, from the number of online vcpus.

nit: inferred

> 
> We adapt it to handle the case of multiple redistributor regions
> with explicitly defined size. We rely on two new helpers:
> - vgic_v3_rdist_overlap() is used to detect overlap with the dist
>   region if defined
> - vgic_v3_rd_region_size computes the size of the redist region,
>   would it be a legacy unique region or a new explicitly sized
>   region.
> 
> Signed-off-by: Eric Auger 
> 
> ---
> 
> v3 -> v4:
> - squash vgic_v3_check_base adaptation and vgic_v3_rdist_overlap
>   + vgic_v3_rd_region_size introduction  and put this patch
>   before v3 patch 6
> ---
>  virt/kvm/arm/vgic/vgic-v3.c | 49 
> +
>  virt/kvm/arm/vgic/vgic.h| 10 +
>  2 files changed, 42 insertions(+), 17 deletions(-)
> 
> diff --git a/virt/kvm/arm/vgic/vgic-v3.c b/virt/kvm/arm/vgic/vgic-v3.c
> index f81a63a..c4a2a46 100644
> --- a/virt/kvm/arm/vgic/vgic-v3.c
> +++ b/virt/kvm/arm/vgic/vgic-v3.c
> @@ -410,6 +410,29 @@ int vgic_v3_save_pending_tables(struct kvm *kvm)
>   return 0;
>  }
>  
> +/**
> + * vgic_v3_rdist_overlap - check if a region overlaps with any
> + * existing redistributor region
> + *
> + * @kvm: kvm handle
> + * @base: base of the region
> + * @size: size of region
> + *
> + * Return: true if there is an overlap
> + */
> +bool vgic_v3_rdist_overlap(struct kvm *kvm, gpa_t base, size_t size)
> +{
> + struct vgic_dist *d = &kvm->arch.vgic;
> + struct vgic_redist_region *rdreg;
> +
> + list_for_each_entry(rdreg, &d->rd_regions, list) {
> + if ((base + size > rdreg->base) &&
> + (base < rdreg->base + vgic_v3_rd_region_size(kvm, 
> rdreg)))
> + return true;
> + }
> + return false;
> +}
> +
>  /*
>   * Check for overlapping regions and for regions crossing the end of memory
>   * for base addresses which have already been set.
> @@ -417,31 +440,23 @@ int vgic_v3_save_pending_tables(struct kvm *kvm)
>  bool vgic_v3_check_base(struct kvm *kvm)
>  {
>   struct vgic_dist *d = &kvm->arch.vgic;
> - gpa_t redist_size = KVM_VGIC_V3_REDIST_SIZE;
> - struct vgic_redist_region *rdreg =
> - list_first_entry(&d->rd_regions,
> -  struct vgic_redist_region, list);
> -
> - redist_size *= atomic_read(&kvm->online_vcpus);
> + struct vgic_redist_region *rdreg;
>  
>   if (!IS_VGIC_ADDR_UNDEF(d->vgic_dist_base) &&
>   d->vgic_dist_base + KVM_VGIC_V3_DIST_SIZE < d->vgic_dist_base)
>   return false;
>  
> - if (rdreg && (rdreg->base + redist_size < rdreg->base))
> - return false;
> + list_for_each_entry(rdreg, &d->rd_regions, list) {
> + if (rdreg->base + vgic_v3_rd_region_size(kvm, rdreg) <
> + rdreg->base)
> + return false;
> + }
>  
> - /* Both base addresses must be set to check if they overlap */
> - if (IS_VGIC_ADDR_UNDEF(d->vgic_dist_base) || !rdreg)
> + if (IS_VGIC_ADDR_UNDEF(d->vgic_dist_base))
>   return true;
>  
> - if (d->vgic_dist_base + KVM_VGIC_V3_DIST_SIZE <= rdreg->base)
> - return true;
> -
> - if (rdreg->base + redist_size <= d->vgic_dist_base)
> - return true;
> -
> - return false;
> + return !vgic_v3_rdist_overlap(kvm, d->vgic_dist_base,
> +   KVM_VGIC_V3_DIST_SIZE);
>  }
>  
>  /**
> diff --git a/virt/kvm/arm/vgic/vgic.h b/virt/kvm/arm/vgic/vgic.h
> index fea32cb..e6e3ae9 100644
> --- a/virt/kvm/arm/vgic/vgic.h
> +++ b/virt/kvm/arm/vgic/vgic.h
> @@ -262,6 +262,16 @@ vgic_v3_redist_region_full(struct vgic_redist_region 
> *region)
>  
>  struct vgic_redist_region *vgic_v3_rdist_free_slot(struct list_head *rdregs);
>  
> +static inline size_t
> +vgic_v3_rd_region_size(struct kvm *kvm, struct vgic_redist_region *rdreg)
> +{
> + if (!rdreg->count)
> + return atomic_read(&kvm->online_vcpus) * 
> KVM_VGIC_V3_REDIST_SIZE;
> + else
> + return rdreg->count * KVM_VGIC_V3_REDIST_SIZE;
> +}
> +bool vgic_v3_rdist_overlap(struct kvm *kvm, gpa_t base, size_t size);
> +
>  int vgic_its_resolve_lpi(struct kvm *kvm, struct vgic_its *its,
>u32 devid, u32 eventid, struct vgic_irq **irq);
>  struct vgic_its *vgic_msi_to_its(struct kvm *kvm, struct kvm_msi *msi);
> -- 
> 2.5.5
> 

Reviewed-by: Christoffer Dall 


Re: [PATCH v4 11/12] KVM: arm/arm64: Implement KVM_VGIC_V3_ADDR_TYPE_REDIST_REGION

2018-04-27 Thread Christoffer Dall
Hi Eric,

On Fri, Apr 27, 2018 at 04:15:04PM +0200, Eric Auger wrote:
> Now all the internals are ready to handle multiple redistributor
> regions, let's allow the userspace to register them.
> 
> Signed-off-by: Eric Auger 
> 
> ---
> v3 -> v4:
> - vgic_v3_rdist_region_from_index is introduced in this patch
> 
> v2 -> v3:
> - early exit if vgic_v3_rdist_region_from_index() fails
> ---
>  virt/kvm/arm/vgic/vgic-kvm-device.c | 42 
> +++--
>  virt/kvm/arm/vgic/vgic-mmio-v3.c|  4 ++--
>  virt/kvm/arm/vgic/vgic-v3.c | 14 +
>  virt/kvm/arm/vgic/vgic.h| 13 +++-
>  4 files changed, 68 insertions(+), 5 deletions(-)
> 
> diff --git a/virt/kvm/arm/vgic/vgic-kvm-device.c 
> b/virt/kvm/arm/vgic/vgic-kvm-device.c
> index e7b5a86..00e03d3 100644
> --- a/virt/kvm/arm/vgic/vgic-kvm-device.c
> +++ b/virt/kvm/arm/vgic/vgic-kvm-device.c
> @@ -65,7 +65,8 @@ int kvm_vgic_addr(struct kvm *kvm, unsigned long type, u64 
> *addr, bool write)
>  {
>   int r = 0;
>   struct vgic_dist *vgic = &kvm->arch.vgic;
> - phys_addr_t *addr_ptr, alignment;
> + phys_addr_t *addr_ptr = NULL;
> + phys_addr_t alignment;
>   uint64_t undef_value = VGIC_ADDR_UNDEF;
>  
>   mutex_lock(&kvm->lock);
> @@ -92,7 +93,7 @@ int kvm_vgic_addr(struct kvm *kvm, unsigned long type, u64 
> *addr, bool write)
>   if (r)
>   break;
>   if (write) {
> - r = vgic_v3_set_redist_base(kvm, *addr);
> + r = vgic_v3_set_redist_base(kvm, 0, *addr, 0);
>   goto out;
>   }
>   rdreg = list_first_entry(&vgic->rd_regions,
> @@ -103,6 +104,42 @@ int kvm_vgic_addr(struct kvm *kvm, unsigned long type, 
> u64 *addr, bool write)
>   addr_ptr = &rdreg->base;
>   break;
>   }
> + case KVM_VGIC_V3_ADDR_TYPE_REDIST_REGION:
> + {
> + struct vgic_redist_region *rdreg;
> + uint8_t index;
> +
> + r = vgic_check_type(kvm, KVM_DEV_TYPE_ARM_VGIC_V3);
> + if (r)
> + break;
> +
> + index = *addr & KVM_VGIC_V3_RDIST_INDEX_MASK;
> +
> + if (write) {
> + gpa_t base = *addr & KVM_VGIC_V3_RDIST_BASE_MASK;
> + uint32_t count = (*addr & KVM_VGIC_V3_RDIST_COUNT_MASK)
> + >> KVM_VGIC_V3_RDIST_COUNT_SHIFT;
> + uint8_t flags = (*addr & KVM_VGIC_V3_RDIST_FLAGS_MASK)
> + >> KVM_VGIC_V3_RDIST_FLAGS_SHIFT;
> +
> + if (!count || flags)
> + r = -EINVAL;
> + else
> + r = vgic_v3_set_redist_base(kvm, index,
> + base, count);
> + goto out;
> + }
> +
> + rdreg = vgic_v3_rdist_region_from_index(kvm, index);
> + if (!rdreg) {
> + r = -ENODEV;
> + goto out;
> + }
> +
> + *addr_ptr = rdreg->base & index &
> + (uint64_t)rdreg->count << KVM_VGIC_V3_RDIST_COUNT_SHIFT;

This still looks wrong, please see my reply to v3.

Thanks,
-Christoffer


Re: [PATCH v4 02/12] KVM: arm/arm64: Document KVM_VGIC_V3_ADDR_TYPE_REDIST_REGION

2018-04-27 Thread Christoffer Dall
On Fri, Apr 27, 2018 at 04:14:55PM +0200, Eric Auger wrote:
> We introduce a new KVM_VGIC_V3_ADDR_TYPE_REDIST_REGION attribute in
> KVM_DEV_ARM_VGIC_GRP_ADDR group. It allows userspace to provide the
> base address and size of a redistributor region
> 
> Compared to KVM_VGIC_V3_ADDR_TYPE_REDIST, this new attribute allows
> to declare several separate redistributor regions.
> 
> So the whole redist space does not need to be contiguous anymore.
> 
> Signed-off-by: Eric Auger 
> Reviewed-by: Peter Maydell 

Acked-by: Christoffer Dall 

> 
> ---
> v3 -> v4:
> - Added Peter's R-b
> ---
>  Documentation/virtual/kvm/devices/arm-vgic-v3.txt | 25 
> ++-
>  1 file changed, 24 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/virtual/kvm/devices/arm-vgic-v3.txt 
> b/Documentation/virtual/kvm/devices/arm-vgic-v3.txt
> index 9293b45..cbc4328 100644
> --- a/Documentation/virtual/kvm/devices/arm-vgic-v3.txt
> +++ b/Documentation/virtual/kvm/devices/arm-vgic-v3.txt
> @@ -27,9 +27,32 @@ Groups:
>VCPU and all of the redistributor pages are contiguous.
>Only valid for KVM_DEV_TYPE_ARM_VGIC_V3.
>This address needs to be 64K aligned.
> +
> +KVM_VGIC_V3_ADDR_TYPE_REDIST_REGION (rw, 64-bit)
> +  The attr field of kvm_device_attr encodes 3 values:
> +  bits: | 63     52  |  51      16 | 15 - 12  |11 - 0
> +  values:   | count  |   base  |  flags   | index
> +  - index encodes the unique redistributor region index
> +  - flags: reserved for future use, currently 0
> +  - base field encodes bits [51:16] of the guest physical base address
> +of the first redistributor in the region.
> +  - count encodes the number of redistributors in the region. Must be
> +greater than 0.
> +  There are two 64K pages for each redistributor in the region and
> +  redistributors are laid out contiguously within the region. Regions
> +  are filled with redistributors in the index order. The sum of all
> +  region count fields must be greater than or equal to the number of
> +  VCPUs. Redistributor regions must be registered in the incremental
> +  index order, starting from index 0.
> +  Only valid for KVM_DEV_TYPE_ARM_VGIC_V3.
> +
> +  It is invalid to mix calls with KVM_VGIC_V3_ADDR_TYPE_REDIST and
> +  KVM_VGIC_V3_ADDR_TYPE_REDIST_REGION attributes.
> +
>Errors:
>  -E2BIG:  Address outside of addressable IPA range
> --EINVAL: Incorrectly aligned address
> +-EINVAL: Incorrectly aligned address, bad redistributor region
> + count/index, mixed redistributor region attribute usage
>  -EEXIST: Address already configured
>  -ENXIO:  The group or attribute is unknown/unsupported for this device
>   or hardware support is missing.
> -- 
> 2.5.5
> 
> ___
> kvmarm mailing list
> kvm...@lists.cs.columbia.edu
> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v3 11/12] KVM: arm/arm64: Implement KVM_VGIC_V3_ADDR_TYPE_REDIST_REGION

2018-04-27 Thread Christoffer Dall
On Fri, Apr 13, 2018 at 10:20:57AM +0200, Eric Auger wrote:
> Now all the internals are ready to handle multiple redistributor
> regions, let's allow the userspace to register them.
> 
> Signed-off-by: Eric Auger 
> 
> ---
> 
> v2 -> v3:
> - early exit if vgic_v3_rdist_region_from_index() fails
> ---
>  virt/kvm/arm/vgic/vgic-kvm-device.c | 42 
> +++--
>  virt/kvm/arm/vgic/vgic-mmio-v3.c|  4 ++--
>  virt/kvm/arm/vgic/vgic.h|  9 +++-
>  3 files changed, 50 insertions(+), 5 deletions(-)
> 
> diff --git a/virt/kvm/arm/vgic/vgic-kvm-device.c 
> b/virt/kvm/arm/vgic/vgic-kvm-device.c
> index e7b5a86..00e03d3 100644
> --- a/virt/kvm/arm/vgic/vgic-kvm-device.c
> +++ b/virt/kvm/arm/vgic/vgic-kvm-device.c
> @@ -65,7 +65,8 @@ int kvm_vgic_addr(struct kvm *kvm, unsigned long type, u64 
> *addr, bool write)
>  {
>   int r = 0;
>   struct vgic_dist *vgic = &kvm->arch.vgic;
> - phys_addr_t *addr_ptr, alignment;
> + phys_addr_t *addr_ptr = NULL;
> + phys_addr_t alignment;
>   uint64_t undef_value = VGIC_ADDR_UNDEF;
>  
>   mutex_lock(&kvm->lock);
> @@ -92,7 +93,7 @@ int kvm_vgic_addr(struct kvm *kvm, unsigned long type, u64 
> *addr, bool write)
>   if (r)
>   break;
>   if (write) {
> - r = vgic_v3_set_redist_base(kvm, *addr);
> + r = vgic_v3_set_redist_base(kvm, 0, *addr, 0);
>   goto out;
>   }
>   rdreg = list_first_entry(&vgic->rd_regions,
> @@ -103,6 +104,42 @@ int kvm_vgic_addr(struct kvm *kvm, unsigned long type, 
> u64 *addr, bool write)
>   addr_ptr = &rdreg->base;
>   break;
>   }
> + case KVM_VGIC_V3_ADDR_TYPE_REDIST_REGION:
> + {
> + struct vgic_redist_region *rdreg;
> + uint8_t index;
> +
> + r = vgic_check_type(kvm, KVM_DEV_TYPE_ARM_VGIC_V3);
> + if (r)
> + break;
> +
> + index = *addr & KVM_VGIC_V3_RDIST_INDEX_MASK;
> +
> + if (write) {
> + gpa_t base = *addr & KVM_VGIC_V3_RDIST_BASE_MASK;
> + uint32_t count = (*addr & KVM_VGIC_V3_RDIST_COUNT_MASK)
> + >> KVM_VGIC_V3_RDIST_COUNT_SHIFT;
> + uint8_t flags = (*addr & KVM_VGIC_V3_RDIST_FLAGS_MASK)
> + >> KVM_VGIC_V3_RDIST_FLAGS_SHIFT;
> +
> + if (!count || flags)
> + r = -EINVAL;
> + else
> + r = vgic_v3_set_redist_base(kvm, index,
> + base, count);
> + goto out;
> + }
> +
> + rdreg = vgic_v3_rdist_region_from_index(kvm, index);
> + if (!rdreg) {
> + r = -ENODEV;
> + goto out;
> + }
> +
> + *addr_ptr = rdreg->base & index &
> + (uint64_t)rdreg->count << KVM_VGIC_V3_RDIST_COUNT_SHIFT;

I still think this is a clear NULL-pointer dereference.  It's also
wrong, as you use & where you want to use |.

You should also change the types you use above.

Could you please have a look at my last reply to this patch (I'm happy
to re-send if it got lost somehow) where I suggest how you can handle
this?

Thanks,
-Christoffer

> + break;
> + }
>   default:
>   r = -ENODEV;
>   }
> @@ -674,6 +711,7 @@ static int vgic_v3_has_attr(struct kvm_device *dev,
>   switch (attr->attr) {
>   case KVM_VGIC_V3_ADDR_TYPE_DIST:
>   case KVM_VGIC_V3_ADDR_TYPE_REDIST:
> + case KVM_VGIC_V3_ADDR_TYPE_REDIST_REGION:
>   return 0;
>   }
>   break;
> diff --git a/virt/kvm/arm/vgic/vgic-mmio-v3.c 
> b/virt/kvm/arm/vgic/vgic-mmio-v3.c
> index df23e66..f603fdf 100644
> --- a/virt/kvm/arm/vgic/vgic-mmio-v3.c
> +++ b/virt/kvm/arm/vgic/vgic-mmio-v3.c
> @@ -770,11 +770,11 @@ static int vgic_v3_insert_redist_region(struct kvm 
> *kvm, uint32_t index,
>   return ret;
>  }
>  
> -int vgic_v3_set_redist_base(struct kvm *kvm, u64 addr)
> +int vgic_v3_set_redist_base(struct kvm *kvm, u32 index, u64 addr, u32 count)
>  {
>   int ret;
>  
> - ret = vgic_v3_insert_redist_region(kvm, 0, addr, 0);
> + ret = vgic_v3_insert_redist_region(kvm, index, addr, count);
>   if (ret)
>   return ret;
>  
> diff --git a/virt/kvm/arm/vgic/vgic.h b/virt/kvm/arm/vgic/vgic.h
> index 95b8345..0a95b43 100644
> --- a/virt/kvm/arm/vgic/vgic.h
> +++ b/virt/kvm/arm/vgic/vgic.h
> @@ -96,6 +96,13 @@
>  /* we only support 64 kB translation table page size */
>  #define KVM_ITS_L1E_ADDR_MASKGENMASK_ULL(51, 16)
>  
> +#define KVM_VGIC_V3_RDIST_INDEX_MASK GENMASK_ULL(11, 0)
> +#define KVM_VGIC_V3_RDIST_FLAGS_MASK GENM

Re: [PATCH 3/4] KVM: arm64: Support dirty page tracking for PUD hugepages

2018-04-27 Thread Christoffer Dall
On Fri, Apr 20, 2018 at 03:54:08PM +0100, Punit Agrawal wrote:
> In preparation for creating PUD hugepages at stage 2, add support for
> write protecting PUD hugepages when they are encountered. Write
> protecting guest tables is used to track dirty pages when migrating
> VMs.
> 
> Also, provide trivial implementations of required kvm_s2pud_* helpers
> to allow sharing of code with arm32.
> 
> Signed-off-by: Punit Agrawal 
> Cc: Christoffer Dall 
> Cc: Marc Zyngier 
> Cc: Russell King 
> Cc: Catalin Marinas 
> Cc: Will Deacon 
> ---
>  arch/arm/include/asm/kvm_mmu.h   | 16 
>  arch/arm64/include/asm/kvm_mmu.h | 10 ++
>  virt/kvm/arm/mmu.c   |  9 ++---
>  3 files changed, 32 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
> index 5907a81ad5c1..224c22c0a69c 100644
> --- a/arch/arm/include/asm/kvm_mmu.h
> +++ b/arch/arm/include/asm/kvm_mmu.h
> @@ -80,6 +80,22 @@ void kvm_clear_hyp_idmap(void);
>  
>  #define kvm_pmd_mkhuge(pmd)  pmd_mkhuge(pmd)
>  
> +/*
> + * The following kvm_*pud*() functionas are provided strictly to allow
> + * sharing code with arm64. They should never be called in practice.
> + */
> +static inline void kvm_set_s2pud_readonly(pud_t *pud)
> +{
> + BUG();
> +}
> +
> +static inline bool kvm_s2pud_readonly(pud_t *pud)
> +{
> + BUG();
> + return false;
> +}
> +
> +
>  static inline void kvm_set_pmd(pmd_t *pmd, pmd_t new_pmd)
>  {
>   *pmd = new_pmd;
> diff --git a/arch/arm64/include/asm/kvm_mmu.h 
> b/arch/arm64/include/asm/kvm_mmu.h
> index d962508ce4b3..f440cf216a23 100644
> --- a/arch/arm64/include/asm/kvm_mmu.h
> +++ b/arch/arm64/include/asm/kvm_mmu.h
> @@ -240,6 +240,16 @@ static inline bool kvm_s2pmd_exec(pmd_t *pmdp)
>   return !(READ_ONCE(pmd_val(*pmdp)) & PMD_S2_XN);
>  }
>  
> +static inline void kvm_set_s2pud_readonly(pud_t *pudp)
> +{
> + kvm_set_s2pte_readonly((pte_t *)pudp);
> +}
> +
> +static inline bool kvm_s2pud_readonly(pud_t *pudp)
> +{
> + return kvm_s2pte_readonly((pte_t *)pudp);
> +}
> +
>  static inline bool kvm_page_empty(void *ptr)
>  {
>   struct page *ptr_page = virt_to_page(ptr);
> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
> index f72ae7a6dea0..5f53909da90e 100644
> --- a/virt/kvm/arm/mmu.c
> +++ b/virt/kvm/arm/mmu.c
> @@ -1286,9 +1286,12 @@ static void  stage2_wp_puds(pgd_t *pgd, phys_addr_t 
> addr, phys_addr_t end)
>   do {
>   next = stage2_pud_addr_end(addr, end);
>   if (!stage2_pud_none(*pud)) {
> - /* TODO:PUD not supported, revisit later if supported */
> - BUG_ON(stage2_pud_huge(*pud));
> - stage2_wp_pmds(pud, addr, next);
> + if (stage2_pud_huge(*pud)) {
> + if (!kvm_s2pud_readonly(pud))
> + kvm_set_s2pud_readonly(pud);
> + } else {
> + stage2_wp_pmds(pud, addr, next);
> + }
>   }
>   } while (pud++, addr = next, addr != end);
>  }
> -- 
> 2.17.0
> 

Reviewed-by: Christoffer Dall 


Re: [PATCH 2/4] KVM: arm/arm64: Introduce helpers to manupulate page table entries

2018-04-27 Thread Christoffer Dall
On Fri, Apr 20, 2018 at 03:54:07PM +0100, Punit Agrawal wrote:
> Introduce helpers to abstract architectural handling of the conversion
> of pfn to page table entries and marking a PMD page table entry as a
> block entry.
> 
> The helpers are introduced in preparation for supporting PUD hugepages
> at stage 2 - which are supported on arm64 but do not exist on arm.
> 
> Signed-off-by: Punit Agrawal 
> Cc: Christoffer Dall 
> Cc: Marc Zyngier 
> Cc: Russell King 
> Cc: Catalin Marinas 
> Cc: Will Deacon 

Acked-by: Christoffer Dall 

> ---
>  arch/arm/include/asm/kvm_mmu.h   | 5 +
>  arch/arm64/include/asm/kvm_mmu.h | 5 +
>  virt/kvm/arm/mmu.c   | 7 ---
>  3 files changed, 14 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
> index 707a1f06dc5d..5907a81ad5c1 100644
> --- a/arch/arm/include/asm/kvm_mmu.h
> +++ b/arch/arm/include/asm/kvm_mmu.h
> @@ -75,6 +75,11 @@ phys_addr_t kvm_get_idmap_vector(void);
>  int kvm_mmu_init(void);
>  void kvm_clear_hyp_idmap(void);
>  
> +#define kvm_pfn_pte(pfn, prot)   pfn_pte(pfn, prot)
> +#define kvm_pfn_pmd(pfn, prot)   pfn_pmd(pfn, prot)
> +
> +#define kvm_pmd_mkhuge(pmd)  pmd_mkhuge(pmd)
> +
>  static inline void kvm_set_pmd(pmd_t *pmd, pmd_t new_pmd)
>  {
>   *pmd = new_pmd;
> diff --git a/arch/arm64/include/asm/kvm_mmu.h 
> b/arch/arm64/include/asm/kvm_mmu.h
> index 082110993647..d962508ce4b3 100644
> --- a/arch/arm64/include/asm/kvm_mmu.h
> +++ b/arch/arm64/include/asm/kvm_mmu.h
> @@ -173,6 +173,11 @@ void kvm_clear_hyp_idmap(void);
>  #define  kvm_set_pte(ptep, pte)  set_pte(ptep, pte)
>  #define  kvm_set_pmd(pmdp, pmd)  set_pmd(pmdp, pmd)
>  
> +#define kvm_pfn_pte(pfn, prot)   pfn_pte(pfn, prot)
> +#define kvm_pfn_pmd(pfn, prot)   pfn_pmd(pfn, prot)
> +
> +#define kvm_pmd_mkhuge(pmd)  pmd_mkhuge(pmd)
> +
>  static inline pte_t kvm_s2pte_mkwrite(pte_t pte)
>  {
>   pte_val(pte) |= PTE_S2_RDWR;
> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
> index db382c7c7cd7..f72ae7a6dea0 100644
> --- a/virt/kvm/arm/mmu.c
> +++ b/virt/kvm/arm/mmu.c
> @@ -1538,8 +1538,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
> phys_addr_t fault_ipa,
>   invalidate_icache_guest_page(pfn, vma_pagesize);
>  
>   if (hugetlb) {
> - pmd_t new_pmd = pfn_pmd(pfn, mem_type);
> - new_pmd = pmd_mkhuge(new_pmd);
> + pmd_t new_pmd = kvm_pfn_pmd(pfn, mem_type);
> +
> + new_pmd = kvm_pmd_mkhuge(new_pmd);
>   if (writable)
>   new_pmd = kvm_s2pmd_mkwrite(new_pmd);
>  
> @@ -1553,7 +1554,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
> phys_addr_t fault_ipa,
>  
>   ret = stage2_set_pmd_huge(kvm, memcache, fault_ipa, &new_pmd);
>   } else {
> - pte_t new_pte = pfn_pte(pfn, mem_type);
> + pte_t new_pte = kvm_pfn_pte(pfn, mem_type);
>  
>   if (writable) {
>   new_pte = kvm_s2pte_mkwrite(new_pte);
> -- 
> 2.17.0
> 


Re: [PATCH 1/4] KVM: arm/arm64: Share common code in user_mem_abort()

2018-04-27 Thread Christoffer Dall
On Fri, Apr 20, 2018 at 03:54:06PM +0100, Punit Agrawal wrote:
> The code for operations such as marking the pfn as dirty, and
> dcache/icache maintenance during stage 2 fault handling is duplicated
> between normal pages and PMD hugepages.
> 
> Instead of creating another copy of the operations when we introduce
> PUD hugepages, let's share them across the different pagesizes.
> 
> Signed-off-by: Punit Agrawal 
> Cc: Christoffer Dall 
> Cc: Marc Zyngier 
> ---
>  virt/kvm/arm/mmu.c | 36 +---
>  1 file changed, 21 insertions(+), 15 deletions(-)
> 
> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
> index 7f6a944db23d..db382c7c7cd7 100644
> --- a/virt/kvm/arm/mmu.c
> +++ b/virt/kvm/arm/mmu.c
> @@ -1428,7 +1428,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
> phys_addr_t fault_ipa,
>   kvm_pfn_t pfn;
>   pgprot_t mem_type = PAGE_S2;
>   bool logging_active = memslot_is_logging(memslot);
> - unsigned long flags = 0;
> + unsigned long vma_pagesize, flags = 0;
>  
>   write_fault = kvm_is_write_fault(vcpu);
>   exec_fault = kvm_vcpu_trap_is_iabt(vcpu);
> @@ -1448,7 +1448,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
> phys_addr_t fault_ipa,
>   return -EFAULT;
>   }
>  
> - if (vma_kernel_pagesize(vma) == PMD_SIZE && !logging_active) {
> + vma_pagesize = vma_kernel_pagesize(vma);
> + if (vma_pagesize == PMD_SIZE && !logging_active) {
>   hugetlb = true;
>   gfn = (fault_ipa & PMD_MASK) >> PAGE_SHIFT;
>   } else {
> @@ -1517,23 +1518,33 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
> phys_addr_t fault_ipa,
>   if (mmu_notifier_retry(kvm, mmu_seq))
>   goto out_unlock;
>  
> - if (!hugetlb && !force_pte)
> + if (!hugetlb && !force_pte) {
> + /*
> +  * Only PMD_SIZE transparent hugepages(THP) are
> +  * currently supported. This code will need to be
> +  * updated if other THP sizes are supported.
> +  */
>   hugetlb = transparent_hugepage_adjust(&pfn, &fault_ipa);
> + vma_pagesize = PMD_SIZE;
> + }
> +
> + if (writable)
> + kvm_set_pfn_dirty(pfn);
> +
> + if (fault_status != FSC_PERM)
> + clean_dcache_guest_page(pfn, vma_pagesize);
> +
> + if (exec_fault)
> + invalidate_icache_guest_page(pfn, vma_pagesize);
>  
>   if (hugetlb) {
>   pmd_t new_pmd = pfn_pmd(pfn, mem_type);
>   new_pmd = pmd_mkhuge(new_pmd);
> - if (writable) {
> + if (writable)
>   new_pmd = kvm_s2pmd_mkwrite(new_pmd);
> - kvm_set_pfn_dirty(pfn);
> - }
> -
> - if (fault_status != FSC_PERM)
> - clean_dcache_guest_page(pfn, PMD_SIZE);
>  
>   if (exec_fault) {
>   new_pmd = kvm_s2pmd_mkexec(new_pmd);
> - invalidate_icache_guest_page(pfn, PMD_SIZE);
>   } else if (fault_status == FSC_PERM) {

This could now be rewritten to:

if (exec_fault ||
(fault_status == FSC_PERM && stage2_is_exec(kvm, 
fault_ipa)))
new_pmd = kvm_s2pmd_mkexec(new_pmd);

which we could even consider making

static bool stage2_should_exec(struct kvm *kvm, phys_addr_t addr,
   bool exec_fault, unsigned long fault_status)
{
/*
 * If we took an execution fault we will have made the
 * icache/dcache coherent and should now let the s2 mapping be
 * executable.
 *
 * Write faults (!exec_fault && FSC_PERM) are orthogonal to
 * execute permissions, and we preserve whatever we have.
 */
return exec_fault ||
   (fault_status == FSC_PERM && stage2_is_exec(kvm, fault_ipa));
}

The benefit would be to have this documentation in a single place and
slightly simply both the hugetlb and !hugetlb blocks.


>   /* Preserve execute if XN was already cleared */
>   if (stage2_is_exec(kvm, fault_ipa))
> @@ -1546,16 +1557,11 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
> phys_addr_t fault_ipa,
>  
>   if (writable) {
>   new_pte = kvm_s2pte_mkwrite(new_pte);
> - kvm_set_pfn_dirty(pfn);
>   mark_page_dirty(kvm, gfn);
>   }
>  
> - if (fault_status != FSC_PERM)
> - clean_dcache_guest_page(pfn, PAGE_SIZE);
> -
>  

Re: [PATCH 4/4] KVM: arm64: Add support for PUD hugepages at stage 2

2018-04-27 Thread Christoffer Dall
On Fri, Apr 20, 2018 at 03:54:09PM +0100, Punit Agrawal wrote:
> KVM only supports PMD hugepages at stage 2. Extend the stage 2 fault
> handling to add support for PUD hugepages.
> 
> Addition of pud hugepage support enables additional hugepage
> sizes (e.g., 1G with 4K granule) which can be useful on cores that
> support mapping larger block sizes in the TLB entries.
> 
> Signed-off-by: Punit Agrawal 
> Cc: Christoffer Dall 
> Cc: Marc Zyngier 
> Cc: Russell King 
> Cc: Catalin Marinas 
> Cc: Will Deacon 
> ---
>  arch/arm/include/asm/kvm_mmu.h | 19 +
>  arch/arm64/include/asm/kvm_mmu.h   | 15 +++
>  arch/arm64/include/asm/pgtable-hwdef.h |  4 ++
>  arch/arm64/include/asm/pgtable.h   |  2 +
>  virt/kvm/arm/mmu.c | 54 --
>  5 files changed, 83 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
> index 224c22c0a69c..155916dbdd7e 100644
> --- a/arch/arm/include/asm/kvm_mmu.h
> +++ b/arch/arm/include/asm/kvm_mmu.h
> @@ -77,8 +77,11 @@ void kvm_clear_hyp_idmap(void);
>  
>  #define kvm_pfn_pte(pfn, prot)   pfn_pte(pfn, prot)
>  #define kvm_pfn_pmd(pfn, prot)   pfn_pmd(pfn, prot)
> +#define kvm_pfn_pud(pfn, prot)   (__pud(0))
>  
>  #define kvm_pmd_mkhuge(pmd)  pmd_mkhuge(pmd)
> +/* No support for pud hugepages */
> +#define kvm_pud_mkhuge(pud)  (pud)
>  
>  /*
>   * The following kvm_*pud*() functionas are provided strictly to allow
> @@ -95,6 +98,22 @@ static inline bool kvm_s2pud_readonly(pud_t *pud)
>   return false;
>  }
>  
> +static inline void kvm_set_pud(pud_t *pud, pud_t new_pud)
> +{
> + BUG();
> +}
> +
> +static inline pud_t kvm_s2pud_mkwrite(pud_t pud)
> +{
> + BUG();
> + return pud;
> +}
> +
> +static inline pud_t kvm_s2pud_mkexec(pud_t pud)
> +{
> + BUG();
> + return pud;
> +}
>  
>  static inline void kvm_set_pmd(pmd_t *pmd, pmd_t new_pmd)
>  {
> diff --git a/arch/arm64/include/asm/kvm_mmu.h 
> b/arch/arm64/include/asm/kvm_mmu.h
> index f440cf216a23..f49a68fcbf26 100644
> --- a/arch/arm64/include/asm/kvm_mmu.h
> +++ b/arch/arm64/include/asm/kvm_mmu.h
> @@ -172,11 +172,14 @@ void kvm_clear_hyp_idmap(void);
>  
>  #define  kvm_set_pte(ptep, pte)  set_pte(ptep, pte)
>  #define  kvm_set_pmd(pmdp, pmd)  set_pmd(pmdp, pmd)
> +#define kvm_set_pud(pudp, pud)   set_pud(pudp, pud)
>  
>  #define kvm_pfn_pte(pfn, prot)   pfn_pte(pfn, prot)
>  #define kvm_pfn_pmd(pfn, prot)   pfn_pmd(pfn, prot)
> +#define kvm_pfn_pud(pfn, prot)   pfn_pud(pfn, prot)
>  
>  #define kvm_pmd_mkhuge(pmd)  pmd_mkhuge(pmd)
> +#define kvm_pud_mkhuge(pud)  pud_mkhuge(pud)
>  
>  static inline pte_t kvm_s2pte_mkwrite(pte_t pte)
>  {
> @@ -190,6 +193,12 @@ static inline pmd_t kvm_s2pmd_mkwrite(pmd_t pmd)
>   return pmd;
>  }
>  
> +static inline pud_t kvm_s2pud_mkwrite(pud_t pud)
> +{
> + pud_val(pud) |= PUD_S2_RDWR;
> + return pud;
> +}
> +
>  static inline pte_t kvm_s2pte_mkexec(pte_t pte)
>  {
>   pte_val(pte) &= ~PTE_S2_XN;
> @@ -202,6 +211,12 @@ static inline pmd_t kvm_s2pmd_mkexec(pmd_t pmd)
>   return pmd;
>  }
>  
> +static inline pud_t kvm_s2pud_mkexec(pud_t pud)
> +{
> + pud_val(pud) &= ~PUD_S2_XN;
> + return pud;
> +}
> +
>  static inline void kvm_set_s2pte_readonly(pte_t *ptep)
>  {
>   pteval_t old_pteval, pteval;
> diff --git a/arch/arm64/include/asm/pgtable-hwdef.h 
> b/arch/arm64/include/asm/pgtable-hwdef.h
> index fd208eac9f2a..e327665e94d1 100644
> --- a/arch/arm64/include/asm/pgtable-hwdef.h
> +++ b/arch/arm64/include/asm/pgtable-hwdef.h
> @@ -193,6 +193,10 @@
>  #define PMD_S2_RDWR  (_AT(pmdval_t, 3) << 6)   /* HAP[2:1] */
>  #define PMD_S2_XN(_AT(pmdval_t, 2) << 53)  /* XN[1:0] */
>  
> +#define PUD_S2_RDONLY(_AT(pudval_t, 1) << 6)   /* HAP[2:1] */
> +#define PUD_S2_RDWR  (_AT(pudval_t, 3) << 6)   /* HAP[2:1] */
> +#define PUD_S2_XN(_AT(pudval_t, 2) << 53)  /* XN[1:0] */
> +
>  /*
>   * Memory Attribute override for Stage-2 (MemAttr[3:0])
>   */
> diff --git a/arch/arm64/include/asm/pgtable.h 
> b/arch/arm64/include/asm/pgtable.h
> index 7e2c27e63cd8..5efb4585c879 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -386,6 +386,8 @@ static inline int pmd_protnone(pmd_t pmd)
>  
>  #define pud_write(pud)   pte_write(pud_pte(pud))
>  
> +#define pud_mkhuge(pud)  (__pud(pud_val(pud)

Re: [PATCHv3 03/11] arm64/kvm: hide ptrauth from guests

2018-04-27 Thread Christoffer Dall
On Tue, Apr 17, 2018 at 07:37:27PM +0100, Mark Rutland wrote:
> In subsequent patches we're going to expose ptrauth to the host kernel
> and userspace, but things are a bit trickier for guest kernels. For the
> time being, let's hide ptrauth from KVM guests.
> 
> Regardless of how well-behaved the guest kernel is, guest userspace
> could attempt to use ptrauth instructions, triggering a trap to EL2,
> resulting in noise from kvm_handle_unknown_ec(). So let's write up a
> handler for the PAC trap, which silently injects an UNDEF into the
> guest, as if the feature were really missing.
> 
> Signed-off-by: Mark Rutland 
> Cc: Christoffer Dall 
> Cc: Marc Zyngier 
> Cc: kvm...@lists.cs.columbia.edu
> ---
>  arch/arm64/kvm/handle_exit.c | 18 ++
>  arch/arm64/kvm/sys_regs.c|  9 +
>  2 files changed, 27 insertions(+)
> 
> diff --git a/arch/arm64/kvm/handle_exit.c b/arch/arm64/kvm/handle_exit.c
> index e5e741bfffe1..5114ad691eae 100644
> --- a/arch/arm64/kvm/handle_exit.c
> +++ b/arch/arm64/kvm/handle_exit.c
> @@ -173,6 +173,23 @@ static int handle_sve(struct kvm_vcpu *vcpu, struct 
> kvm_run *run)
>   return 1;
>  }
>  
> +/*
> + * Guest usage of a ptrauth instruction (which the guest EL1 did not turn 
> into
> + * a NOP), or guest EL1 access to a ptrauth register.
> + */
> +static int kvm_handle_ptrauth(struct kvm_vcpu *vcpu, struct kvm_run *run)
> +{
> + /*
> +  * We don't currently suport ptrauth in a guest, and we mask the ID
> +  * registers to prevent well-behaved guests from trying to make use of
> +  * it.
> +  *
> +  * Inject an UNDEF, as if the feature really isn't present.
> +  */
> + kvm_inject_undefined(vcpu);
> + return 1;
> +}
> +
>  static exit_handle_fn arm_exit_handlers[] = {
>   [0 ... ESR_ELx_EC_MAX]  = kvm_handle_unknown_ec,
>   [ESR_ELx_EC_WFx]= kvm_handle_wfx,
> @@ -195,6 +212,7 @@ static exit_handle_fn arm_exit_handlers[] = {
>   [ESR_ELx_EC_BKPT32] = kvm_handle_guest_debug,
>   [ESR_ELx_EC_BRK64]  = kvm_handle_guest_debug,
>   [ESR_ELx_EC_FP_ASIMD]   = handle_no_fpsimd,
> + [ESR_ELx_EC_PAC]= kvm_handle_ptrauth,
>  };
>  
>  static exit_handle_fn kvm_get_exit_handler(struct kvm_vcpu *vcpu)
> diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
> index 806b0b126a64..eee399c35e84 100644
> --- a/arch/arm64/kvm/sys_regs.c
> +++ b/arch/arm64/kvm/sys_regs.c
> @@ -1000,6 +1000,15 @@ static u64 read_id_reg(struct sys_reg_desc const *r, 
> bool raz)
>   task_pid_nr(current));
>  
>   val &= ~(0xfUL << ID_AA64PFR0_SVE_SHIFT);
> + } else if (id == SYS_ID_AA64ISAR1_EL1) {
> + const u64 ptrauth_mask = (0xfUL << ID_AA64ISAR1_APA_SHIFT) |
> +  (0xfUL << ID_AA64ISAR1_API_SHIFT) |
> +  (0xfUL << ID_AA64ISAR1_GPA_SHIFT) |
> +  (0xfUL << ID_AA64ISAR1_GPI_SHIFT);
> + if (val & ptrauth_mask)
> + pr_err_once("kvm [%i]: ptrauth unsupported for guests, 
> suppressing\n",
> + task_pid_nr(current));
> + val &= ~ptrauth_mask;
>   } else if (id == SYS_ID_AA64MMFR1_EL1) {
>   if (val & (0xfUL << ID_AA64MMFR1_LOR_SHIFT))
>   pr_err_once("kvm [%i]: LORegions unsupported for 
> guests, suppressing\n",
> -- 
> 2.11.0
> 

With the change to the debugging print:

Reviewed-by: Christoffer Dall 


Re: [PATCHv3 04/11] arm64: Don't trap host pointer auth use to EL2

2018-04-27 Thread Christoffer Dall
On Tue, Apr 17, 2018 at 07:37:28PM +0100, Mark Rutland wrote:
> To allow EL0 (and/or EL1) to use pointer authentication functionality,
> we must ensure that pointer authentication instructions and accesses to
> pointer authentication keys are not trapped to EL2.
> 
> This patch ensures that HCR_EL2 is configured appropriately when the
> kernel is booted at EL2. For non-VHE kernels we set HCR_EL2.{API,APK},
> ensuring that EL1 can access keys and permit EL0 use of instructions.
> For VHE kernels host EL0 (TGE && E2H) is unaffected by these settings,
> and it doesn't matter how we configure HCR_EL2.{API,APK}, so we don't
> bother setting them.
> 
> This does not enable support for KVM guests, since KVM manages HCR_EL2
> itself when running VMs.
> 
> Signed-off-by: Mark Rutland 
> Cc: Christoffer Dall 
> Cc: Catalin Marinas 
> Cc: Marc Zyngier 
> Cc: Will Deacon 
> Cc: kvm...@lists.cs.columbia.edu

FWIW: Acked-by: Christoffer Dall 

> ---
>  arch/arm64/include/asm/kvm_arm.h | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/include/asm/kvm_arm.h 
> b/arch/arm64/include/asm/kvm_arm.h
> index 89b3dda7e3cb..4d57e1e58323 100644
> --- a/arch/arm64/include/asm/kvm_arm.h
> +++ b/arch/arm64/include/asm/kvm_arm.h
> @@ -23,6 +23,8 @@
>  #include 
>  
>  /* Hyp Configuration Register (HCR) bits */
> +#define HCR_API  (UL(1) << 41)
> +#define HCR_APK  (UL(1) << 40)
>  #define HCR_TEA  (UL(1) << 37)
>  #define HCR_TERR (UL(1) << 36)
>  #define HCR_TLOR (UL(1) << 35)
> @@ -86,7 +88,7 @@
>HCR_AMO | HCR_SWIO | HCR_TIDCP | HCR_RW | HCR_TLOR | \
>HCR_FMO | HCR_IMO)
>  #define HCR_VIRT_EXCP_MASK (HCR_VSE | HCR_VI | HCR_VF)
> -#define HCR_HOST_NVHE_FLAGS (HCR_RW)
> +#define HCR_HOST_NVHE_FLAGS (HCR_RW | HCR_API | HCR_APK)
>  #define HCR_HOST_VHE_FLAGS (HCR_RW | HCR_TGE | HCR_E2H)
>  
>  /* TCR_EL2 Registers bits */
> -- 
> 2.11.0
> 
> ___
> kvmarm mailing list
> kvm...@lists.cs.columbia.edu
> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCHv3 02/11] arm64/kvm: consistently handle host HCR_EL2 flags

2018-04-27 Thread Christoffer Dall
On Tue, Apr 17, 2018 at 07:37:26PM +0100, Mark Rutland wrote:
> In KVM we define the configuration of HCR_EL2 for a VHE HOST in
> HCR_HOST_VHE_FLAGS, but we don't ahve a similar definition for the

nit: have

> non-VHE host flags, and open-code HCR_RW. Further, in head.S we
> open-code the flags for VHE and non-VHE configurations.
> 
> In future, we're going to want to configure more flags for the host, so
> lets add a HCR_HOST_NVHE_FLAGS defintion, adn consistently use both

nit: and

> HCR_HOST_VHE_FLAGS and HCR_HOST_NVHE_FLAGS in the kvm code and head.S.
> 
> We now use mov_q to generate the HCR_EL2 value, as we use when
> configuring other registers in head.S.
> 
> Signed-off-by: Mark Rutland 
> Cc: Catalin Marinas 
> Cc: Christoffer Dall 
> Cc: Marc Zyngier 
> Cc: Will Deacon 
> Cc: kvm...@lists.cs.columbia.edu
> ---
>  arch/arm64/include/asm/kvm_arm.h | 1 +
>  arch/arm64/kernel/head.S | 5 ++---
>  arch/arm64/kvm/hyp/switch.c  | 2 +-
>  3 files changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/kvm_arm.h 
> b/arch/arm64/include/asm/kvm_arm.h
> index 6dd285e979c9..89b3dda7e3cb 100644
> --- a/arch/arm64/include/asm/kvm_arm.h
> +++ b/arch/arm64/include/asm/kvm_arm.h
> @@ -86,6 +86,7 @@
>HCR_AMO | HCR_SWIO | HCR_TIDCP | HCR_RW | HCR_TLOR | \
>HCR_FMO | HCR_IMO)
>  #define HCR_VIRT_EXCP_MASK (HCR_VSE | HCR_VI | HCR_VF)
> +#define HCR_HOST_NVHE_FLAGS (HCR_RW)
>  #define HCR_HOST_VHE_FLAGS (HCR_RW | HCR_TGE | HCR_E2H)
>  
>  /* TCR_EL2 Registers bits */
> diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
> index b0853069702f..651a06b1980f 100644
> --- a/arch/arm64/kernel/head.S
> +++ b/arch/arm64/kernel/head.S
> @@ -494,10 +494,9 @@ ENTRY(el2_setup)
>  #endif
>  
>   /* Hyp configuration. */
> - mov x0, #HCR_RW // 64-bit EL1
> + mov_q   x0, HCR_HOST_NVHE_FLAGS
>   cbz x2, set_hcr
> - orr x0, x0, #HCR_TGE// Enable Host Extensions
> - orr x0, x0, #HCR_E2H
> + mov_q   x0, HCR_HOST_VHE_FLAGS
>  set_hcr:
>   msr hcr_el2, x0
>   isb
> diff --git a/arch/arm64/kvm/hyp/switch.c b/arch/arm64/kvm/hyp/switch.c
> index d9645236e474..cdae330e15e9 100644
> --- a/arch/arm64/kvm/hyp/switch.c
> +++ b/arch/arm64/kvm/hyp/switch.c
> @@ -143,7 +143,7 @@ static void __hyp_text __deactivate_traps_nvhe(void)
>   mdcr_el2 |= MDCR_EL2_E2PB_MASK << MDCR_EL2_E2PB_SHIFT;
>  
>   write_sysreg(mdcr_el2, mdcr_el2);
> - write_sysreg(HCR_RW, hcr_el2);
> + write_sysreg(HCR_HOST_NVHE_FLAGS, hcr_el2);
>   write_sysreg(CPTR_EL2_DEFAULT, cptr_el2);
>  }
>  
> -- 
> 2.11.0
> 
> 
Reviewed-by: Christoffer Dall 


Re: [PATCH v3 09/12] KVM: arm/arm64: Check all vcpu redistributors are set on map_resources

2018-04-26 Thread Christoffer Dall
On Thu, Apr 26, 2018 at 11:56:10AM +0200, Auger Eric wrote:
> 
> 
> On 04/24/2018 11:08 PM, Christoffer Dall wrote:
> > On Fri, Apr 13, 2018 at 10:20:55AM +0200, Eric Auger wrote:
> >> On vcpu first run, we eventually know the actual number of vcpus.
> >> This is a synchronization point to check all redistributors regions
> >> were assigned.
> > 
> > Isn't it the other way around?  We want to check that all redistributors
> > (one for every VCPU) has its base address set?  (I don't suppose we care
> > about unused redistributor regions).
> Yes I meant "to check all redistributors were assigned"
> > 
> >> On kvm_vgic_map_resources() we check both dist and
> >> redist were set, eventually check potential base address inconsistencies.
> > 
> > Do we need to check that again?  Didn't we check that at
> > creation/register time?
> Yes we need to, to handle the case where the userspace does not provide
> sufficient rdist region space
> 
> create 8 vcpus (no iodev registered)
> define a redist region for 4 (4 iodevs registered)
> start the vcpus
> > 
> >>
> >> Signed-off-by: Eric Auger 
> >> ---
> >>  virt/kvm/arm/vgic/vgic-v3.c | 19 ++-
> >>  1 file changed, 14 insertions(+), 5 deletions(-)
> >>
> >> diff --git a/virt/kvm/arm/vgic/vgic-v3.c b/virt/kvm/arm/vgic/vgic-v3.c
> >> index b80f650..feabc24 100644
> >> --- a/virt/kvm/arm/vgic/vgic-v3.c
> >> +++ b/virt/kvm/arm/vgic/vgic-v3.c
> >> @@ -484,16 +484,25 @@ struct vgic_redist_region 
> >> *vgic_v3_rdist_region_from_index(struct kvm *kvm,
> >>  
> >>  int vgic_v3_map_resources(struct kvm *kvm)
> >>  {
> >> -  int ret = 0;
> >>struct vgic_dist *dist = &kvm->arch.vgic;
> >> -  struct vgic_redist_region *rdreg =
> >> -  list_first_entry(&dist->rd_regions,
> >> -   struct vgic_redist_region, list);
> >> +  struct kvm_vcpu *vcpu;
> >> +  int ret = 0;
> >> +  int c;
> >>  
> >>if (vgic_ready(kvm))
> >>goto out;
> >>  
> >> -  if (IS_VGIC_ADDR_UNDEF(dist->vgic_dist_base) || !rdreg) {
> >> +  kvm_for_each_vcpu(c, vcpu, kvm) {
> >> +  struct vgic_cpu *vgic_cpu = &vcpu->arch.vgic_cpu;
> >> +
> >> +  if (IS_VGIC_ADDR_UNDEF(vgic_cpu->rd_iodev.base_addr)) {
> >> +  kvm_err("vcpu %d redistributor base not set\n", c);
> > 
> > Shouldn't this be a kvm_debug instead?  I think the user can just
> > provoke this by failing to register enough redistributor regions.
> yes indeed.
> > 
> > I'm also wondering if we could check this on vgic_init time for gicv3,
> > which should have a defined vgic init ordering requirement.  That would
> > make debugging it slightly easier to debug than "my machine isn't
> > starting, and I get -ENXIO, and it can mean anything".
> 
> Well at vgic_init time, vcpus are frozen but dist and redist regions are
> not forced to be set, hence the late check we do in map_resources, on
> 1st VCPU run. Not sure I get this one?
> 

I thought we required userspace to register the redist regions before
initializing the vgic, but if there is no defined order there, then
fine.

(Although I'm getting the feeling we should have been more strict on our
ordering requirement, and that the whole "you can create anything you
want in any order you want" makes the kernel implementation really
complex.  I may be reversing my own oppinion on this matter here.)

Thanks,
-Christoffer


Re: [PATCH v3 08/12] KVM: arm/arm64: Check vcpu redist base before registering an iodev

2018-04-26 Thread Christoffer Dall
On Thu, Apr 26, 2018 at 11:25:06AM +0200, Auger Eric wrote:
> Hi Christoffer,
> 
> On 04/24/2018 11:07 PM, Christoffer Dall wrote:
> > On Fri, Apr 13, 2018 at 10:20:54AM +0200, Eric Auger wrote:
> >> As we are going to register several redist regions,
> >> vgic_register_all_redist_iodevs() may be called several times. We need
> >> to register a redist_iodev for a given vcpu only once. 
> > 
> > Wouldn't it be more natural to change that caller to only register the
> > iodevs for that region?
> 
> vgic_register_redist_iodev() is the place where we decide where we map a
> given vcpu redist into a given redist region.
> 
> Calling vgic_register_redist_iodev for only the vcpus mapping to the
> redist region would force to inverse the logic. I think it would bring
> more upheavals in the code than bringing benefit?
> 
> This new check somehow corresponds to what we had before:
> "
> if (IS_VGIC_ADDR_UNDEF(vgic->vgic_redist_base))
> return 0;
> "

Ah, this is because we don't enforce any ordering between creating the
redistributors and initializing the vcpus, this always confuses me.

Fine, let's leave it as you suggest here.

Thanks,
-Christoffer


Re: [PATCH v3 07/12] KVM: arm/arm64: Adapt vgic_v3_check_base to multiple rdist regions

2018-04-26 Thread Christoffer Dall
On Thu, Apr 26, 2018 at 10:29:35AM +0200, Auger Eric wrote:
> Hi Christoffer,
> On 04/24/2018 11:07 PM, Christoffer Dall wrote:
> > On Fri, Apr 13, 2018 at 10:20:53AM +0200, Eric Auger wrote:
> >> We introduce a new helper to check there is no overlap between
> >> dist region (if set) and registered rdist regions. This both
> >> handles the case of legacy single rdist region (implicitly sized
> >> with the number of online vcpus) and the new case of multiple
> >> explicitly sized rdist regions.
> > 
> > I don't understand this change, really.  Is this just a cleanup, or
> > changing some functionality (why?).
> > 
> > I think this could have come with the introduction of
> > vgic_v3_rdist_overlap() before patch 6, and then patch 6 could have been
> > simplified (hopefully) to just call this "check that nothing in the
> > world ever collides withi itself" function.
> I have merged this patch and vgic_v3_rd_region_size +
> vgic_v3_rdist_overlap and put it before this patch.
> 
> Also I reworked the commit message which was unclear I acknowledge.
> 
> With respect to using the adapted  vgic_v3_check_base() in
> vgic_v3_insert_redist_region(), it is less obvious to me.
> 
> In vgic_v3_insert_redist_region we do the checks *before* inserting the
> new rdist region in the list of redist regions. While
> vgic_v3_check_base() does the checks on already registered rdist and
> dist regions. So I would be tempted to leave
> vgic_v3_insert_redist_region() implementation as it is.
> 

ok, but do see my suggestion there to factor out the check, which should
make that function slightly easier to read.

Then perhaps you can use that function from vgic_v3_check_base to check
that each rdist doesn't overlap with the distributor?

Thanks,
-Christoffer


Re: [PATCH v3 06/12] KVM: arm/arm64: Helper to register a new redistributor region

2018-04-26 Thread Christoffer Dall
On Thu, Apr 26, 2018 at 09:32:49AM +0200, Auger Eric wrote:
> Hi Christoffer,
> 
> On 04/24/2018 06:47 PM, Christoffer Dall wrote:
> > On Fri, Apr 13, 2018 at 10:20:52AM +0200, Eric Auger wrote:
> >> We introduce a new helper that creates and inserts a new redistributor
> >> region into the rdist region list. This helper both handles the case
> >> where the redistributor region size is known at registration time
> >> and the legacy case where it is not (eventually depending on the number
> >> of online vcpus). Depending on pfns, we perform all the possible checks
> >> that we can do:
> >>
> >> - end of memory crossing
> >> - incorrect alignment of the base address
> >> - collision with distributor region if already defined
> >> - collision with already registered rdist regions
> >> - check of the new index
> >>
> >> Rdist regions must be inserted by increasing order of indices. Indices
> >> must be contiguous.
> >>
> >> We also introduce vgic_v3_rdist_region_from_index() which will be used
> >> from the vgic kvm-device, later on.
> >>
> >> Signed-off-by: Eric Auger 
> >> ---
> >>  | 95 +---
> >>  virt/kvm/arm/vgic/vgic-v3.c  | 29 
> >>  virt/kvm/arm/vgic/vgic.h | 14 ++
> >>  3 files changed, 122 insertions(+), 16 deletions(-)
> >>
> >> diff --git a/virt/kvm/arm/vgic/vgic-mmio-v3.c 
> >> b/virt/kvm/arm/vgic/vgic-mmio-v3.c
> >> index ce5c927..5273fb8 100644
> >> --- a/virt/kvm/arm/vgic/vgic-mmio-v3.c
> >> +++ b/virt/kvm/arm/vgic/vgic-mmio-v3.c
> >> @@ -680,14 +680,66 @@ static int vgic_register_all_redist_iodevs(struct 
> >> kvm *kvm)
> >>return ret;
> >>  }
> >>  
> >> -int vgic_v3_set_redist_base(struct kvm *kvm, u64 addr)
> >> +/**
> >> + * vgic_v3_insert_redist_region - Insert a new redistributor region
> >> + *
> >> + * Performs various checks before inserting the rdist region in the list.
> >> + * Those tests depend on whether the size of the rdist region is known
> >> + * (ie. count != 0). The list is sorted by rdist region index.
> >> + *
> >> + * @kvm: kvm handle
> >> + * @index: redist region index
> >> + * @base: base of the new rdist region
> >> + * @count: number of redistributors the region is made of (of 0 in the 
> >> old style
> >> + * single region, whose size is induced from the number of vcpus)
> >> + *
> >> + * Return 0 on success, < 0 otherwise
> >> + */
> >> +static int vgic_v3_insert_redist_region(struct kvm *kvm, uint32_t index,
> >> +  gpa_t base, uint32_t count)
> >>  {
> >> -  struct vgic_dist *vgic = &kvm->arch.vgic;
> >> +  struct vgic_dist *d = &kvm->arch.vgic;
> >>struct vgic_redist_region *rdreg;
> >> +  struct list_head *rd_regions = &d->rd_regions;
> >> +  struct list_head *last = rd_regions->prev;
> >> +
> > 
> > nit: extra blank line?
> > 
> >> +  gpa_t new_start, new_end;
> >> +  size_t size = count * KVM_VGIC_V3_REDIST_SIZE;
> >>int ret;
> >>  
> >> -  /* vgic_check_ioaddr makes sure we don't do this twice */
> >> -  if (!list_empty(&vgic->rd_regions))
> >> +  /* single rdist region already set ?*/
> >> +  if (!count && !list_empty(rd_regions))
> >> +  return -EINVAL;
> >> +
> >> +  /* cross the end of memory ? */
> >> +  if (base + size < base)
> >> +  return -EINVAL;
> > 
> > what is the size of memory?  This seems to check for a gpa_t overflow,
> > but not againt the IPA space of the VM...
> Yes it checks for a gpa_t overflow. This check is currently done in
> vgic_v3_check_base() for dist and redist region and I replicated it.

fair enough, the comment is a bit misleading though.  We could also
consider checking against KVM_PHYS_SHIFT.

> > 
> >> +
> >> +  if (list_empty(rd_regions)) {
> >> +  if (index != 0)
> >> +  return -EINVAL;
> > 
> > note, I think this can be simplified if we can rid of the index.
> So I eventually keep the index.

Yes.

> > 
> >> +  } else {
> >> +  rdreg = list_entry(last, struct vgic_redist_region, list);
> > 
> > you can use list_last_entry here and get rid of the 'last' temporary
> > variable above.
> 

Re: [PATCH v3 02/12] KVM: arm/arm64: Document KVM_VGIC_V3_ADDR_TYPE_REDIST_REGION

2018-04-24 Thread Christoffer Dall
On Tue, Apr 24, 2018 at 05:50:37PM +0100, Peter Maydell wrote:
> On 24 April 2018 at 17:46, Christoffer Dall  wrote:
> > On Fri, Apr 13, 2018 at 10:20:48AM +0200, Eric Auger wrote:
> >> --- a/Documentation/virtual/kvm/devices/arm-vgic-v3.txt
> >> +++ b/Documentation/virtual/kvm/devices/arm-vgic-v3.txt
> >> @@ -27,9 +27,32 @@ Groups:
> >>VCPU and all of the redistributor pages are contiguous.
> >>Only valid for KVM_DEV_TYPE_ARM_VGIC_V3.
> >>This address needs to be 64K aligned.
> >> +
> >> +KVM_VGIC_V3_ADDR_TYPE_REDIST_REGION (rw, 64-bit)
> >> +  The attr field of kvm_device_attr encodes 3 values:
> >> +  bits: | 63     52  |  51      16 | 15 - 12  |11 - 0
> >> +  values:   | count  |   base  |  flags   | index
> >> +  - index encodes the unique redistributor region index
> >
> > I'm not entirely sure I understand the purpose of the index field.
> > Isn't a redistributor region identified uniquely by its base address?
> 
> You need a way to tell the difference beween:
>  (1) redistributors for CPUs 0..63 at 0x4000, redistributors
>  for 64..127 at 0x8000
>  (2) redistributors for CPUs 0..63 at 0x8000, redistributors
>  for 64..127 at 0x4000
> 
> The index field tells you which order the redistributor
> regions go in.

ah, right.  This could be implied by the order creating the regions
though, but ok, in that case it's nicer for userspace to state it
explicitly.

Thanks,
-Christoffer


Re: [PATCH v3 07/12] KVM: arm/arm64: Adapt vgic_v3_check_base to multiple rdist regions

2018-04-24 Thread Christoffer Dall
On Fri, Apr 13, 2018 at 10:20:53AM +0200, Eric Auger wrote:
> We introduce a new helper to check there is no overlap between
> dist region (if set) and registered rdist regions. This both
> handles the case of legacy single rdist region (implicitly sized
> with the number of online vcpus) and the new case of multiple
> explicitly sized rdist regions.

I don't understand this change, really.  Is this just a cleanup, or
changing some functionality (why?).

I think this could have come with the introduction of
vgic_v3_rdist_overlap() before patch 6, and then patch 6 could have been
simplified (hopefully) to just call this "check that nothing in the
world ever collides withi itself" function.

> 
> Signed-off-by: Eric Auger 
> ---
>  virt/kvm/arm/vgic/vgic-v3.c | 26 +-
>  1 file changed, 9 insertions(+), 17 deletions(-)
> 
> diff --git a/virt/kvm/arm/vgic/vgic-v3.c b/virt/kvm/arm/vgic/vgic-v3.c
> index dbcba5f..b80f650 100644
> --- a/virt/kvm/arm/vgic/vgic-v3.c
> +++ b/virt/kvm/arm/vgic/vgic-v3.c
> @@ -432,31 +432,23 @@ bool vgic_v3_rdist_overlap(struct kvm *kvm, gpa_t base, 
> size_t size)
>  bool vgic_v3_check_base(struct kvm *kvm)
>  {
>   struct vgic_dist *d = &kvm->arch.vgic;
> - gpa_t redist_size = KVM_VGIC_V3_REDIST_SIZE;
> - struct vgic_redist_region *rdreg =
> - list_first_entry(&d->rd_regions,
> -  struct vgic_redist_region, list);
> -
> - redist_size *= atomic_read(&kvm->online_vcpus);
> + struct vgic_redist_region *rdreg;
>  
>   if (!IS_VGIC_ADDR_UNDEF(d->vgic_dist_base) &&
>   d->vgic_dist_base + KVM_VGIC_V3_DIST_SIZE < d->vgic_dist_base)
>   return false;
>  
> - if (rdreg && (rdreg->base + redist_size < rdreg->base))
> - return false;
> + list_for_each_entry(rdreg, &d->rd_regions, list) {
> + if (rdreg->base + vgic_v3_rd_region_size(kvm, rdreg) <
> + rdreg->base)
> + return false;
> + }
>  
> - /* Both base addresses must be set to check if they overlap */
> - if (IS_VGIC_ADDR_UNDEF(d->vgic_dist_base) || !rdreg)
> + if (IS_VGIC_ADDR_UNDEF(d->vgic_dist_base))
>   return true;
>  
> - if (d->vgic_dist_base + KVM_VGIC_V3_DIST_SIZE <= rdreg->base)
> - return true;
> -
> - if (rdreg->base + redist_size <= d->vgic_dist_base)
> - return true;
> -
> - return false;
> + return !vgic_v3_rdist_overlap(kvm, d->vgic_dist_base,
> +   KVM_VGIC_V3_DIST_SIZE);
>  }
>  
>  /**
> -- 
> 2.5.5
> 
Otherwise this patch looks correct to me.

Thanks,
-Christoffer


Re: [PATCH v3 10/12] KVM: arm/arm64: Add KVM_VGIC_V3_ADDR_TYPE_REDIST_REGION

2018-04-24 Thread Christoffer Dall
On Fri, Apr 13, 2018 at 10:20:56AM +0200, Eric Auger wrote:
> This new attribute allows the userspace to set the base address
> of a reditributor region, relaxing the constraint of having all
> consecutive redistibutor frames contiguous.
> 
> Signed-off-by: Eric Auger 
> ---
>  arch/arm/include/uapi/asm/kvm.h   | 7 ---
>  arch/arm64/include/uapi/asm/kvm.h | 7 ---
>  2 files changed, 8 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/arm/include/uapi/asm/kvm.h b/arch/arm/include/uapi/asm/kvm.h
> index 2ba95d6..11725bb 100644
> --- a/arch/arm/include/uapi/asm/kvm.h
> +++ b/arch/arm/include/uapi/asm/kvm.h
> @@ -88,9 +88,10 @@ struct kvm_regs {
>  #define KVM_VGIC_V2_CPU_SIZE 0x2000
>  
>  /* Supported VGICv3 address types  */
> -#define KVM_VGIC_V3_ADDR_TYPE_DIST   2
> -#define KVM_VGIC_V3_ADDR_TYPE_REDIST 3
> -#define KVM_VGIC_ITS_ADDR_TYPE   4
> +#define KVM_VGIC_V3_ADDR_TYPE_DIST   2
> +#define KVM_VGIC_V3_ADDR_TYPE_REDIST 3
> +#define KVM_VGIC_ITS_ADDR_TYPE   4

I think I'd prefer that we just leave these lines above as it's only and
indentation thing, and this is an exported header so will propagate into
userspace updates, but I don't have a strong feeling about it, nor do I
know if there's a general policy.

> +#define KVM_VGIC_V3_ADDR_TYPE_REDIST_REGION  5
>  
>  #define KVM_VGIC_V3_DIST_SIZESZ_64K
>  #define KVM_VGIC_V3_REDIST_SIZE  (2 * SZ_64K)
> diff --git a/arch/arm64/include/uapi/asm/kvm.h 
> b/arch/arm64/include/uapi/asm/kvm.h
> index 9abbf30..ef8ad3b 100644
> --- a/arch/arm64/include/uapi/asm/kvm.h
> +++ b/arch/arm64/include/uapi/asm/kvm.h
> @@ -88,9 +88,10 @@ struct kvm_regs {
>  #define KVM_VGIC_V2_CPU_SIZE 0x2000
>  
>  /* Supported VGICv3 address types  */
> -#define KVM_VGIC_V3_ADDR_TYPE_DIST   2
> -#define KVM_VGIC_V3_ADDR_TYPE_REDIST 3
> -#define KVM_VGIC_ITS_ADDR_TYPE   4
> +#define KVM_VGIC_V3_ADDR_TYPE_DIST   2
> +#define KVM_VGIC_V3_ADDR_TYPE_REDIST 3
> +#define KVM_VGIC_ITS_ADDR_TYPE   4
> +#define KVM_VGIC_V3_ADDR_TYPE_REDIST_REGION  5
>  
>  #define KVM_VGIC_V3_DIST_SIZESZ_64K
>  #define KVM_VGIC_V3_REDIST_SIZE  (2 * SZ_64K)
> -- 
> 2.5.5
> 

Otherwise:

Acked-by: Christoffer Dall 


Re: [PATCH v3 09/12] KVM: arm/arm64: Check all vcpu redistributors are set on map_resources

2018-04-24 Thread Christoffer Dall
On Fri, Apr 13, 2018 at 10:20:55AM +0200, Eric Auger wrote:
> On vcpu first run, we eventually know the actual number of vcpus.
> This is a synchronization point to check all redistributors regions
> were assigned.

Isn't it the other way around?  We want to check that all redistributors
(one for every VCPU) has its base address set?  (I don't suppose we care
about unused redistributor regions).

> On kvm_vgic_map_resources() we check both dist and
> redist were set, eventually check potential base address inconsistencies.

Do we need to check that again?  Didn't we check that at
creation/register time?

> 
> Signed-off-by: Eric Auger 
> ---
>  virt/kvm/arm/vgic/vgic-v3.c | 19 ++-
>  1 file changed, 14 insertions(+), 5 deletions(-)
> 
> diff --git a/virt/kvm/arm/vgic/vgic-v3.c b/virt/kvm/arm/vgic/vgic-v3.c
> index b80f650..feabc24 100644
> --- a/virt/kvm/arm/vgic/vgic-v3.c
> +++ b/virt/kvm/arm/vgic/vgic-v3.c
> @@ -484,16 +484,25 @@ struct vgic_redist_region 
> *vgic_v3_rdist_region_from_index(struct kvm *kvm,
>  
>  int vgic_v3_map_resources(struct kvm *kvm)
>  {
> - int ret = 0;
>   struct vgic_dist *dist = &kvm->arch.vgic;
> - struct vgic_redist_region *rdreg =
> - list_first_entry(&dist->rd_regions,
> -  struct vgic_redist_region, list);
> + struct kvm_vcpu *vcpu;
> + int ret = 0;
> + int c;
>  
>   if (vgic_ready(kvm))
>   goto out;
>  
> - if (IS_VGIC_ADDR_UNDEF(dist->vgic_dist_base) || !rdreg) {
> + kvm_for_each_vcpu(c, vcpu, kvm) {
> + struct vgic_cpu *vgic_cpu = &vcpu->arch.vgic_cpu;
> +
> + if (IS_VGIC_ADDR_UNDEF(vgic_cpu->rd_iodev.base_addr)) {
> + kvm_err("vcpu %d redistributor base not set\n", c);

Shouldn't this be a kvm_debug instead?  I think the user can just
provoke this by failing to register enough redistributor regions.

I'm also wondering if we could check this on vgic_init time for gicv3,
which should have a defined vgic init ordering requirement.  That would
make debugging it slightly easier to debug than "my machine isn't
starting, and I get -ENXIO, and it can mean anything".

> + ret = -ENXIO;
> + goto out;
> + }
> + }
> +
> + if (IS_VGIC_ADDR_UNDEF(dist->vgic_dist_base)) {
>   kvm_err("Need to set vgic distributor addresses first\n");
>   ret = -ENXIO;
>   goto out;
> -- 
> 2.5.5
> 

Thanks,
-Christoffer


Re: [PATCH v3 08/12] KVM: arm/arm64: Check vcpu redist base before registering an iodev

2018-04-24 Thread Christoffer Dall
On Fri, Apr 13, 2018 at 10:20:54AM +0200, Eric Auger wrote:
> As we are going to register several redist regions,
> vgic_register_all_redist_iodevs() may be called several times. We need
> to register a redist_iodev for a given vcpu only once. 

Wouldn't it be more natural to change that caller to only register the
iodevs for that region?

Thanks,
-Christoffer

> So let's
> check if the base address has already been set. Initialize this latter
> in kvm_vgic_vcpu_early_init().
> 
> Signed-off-by: Eric Auger 
> ---
>  virt/kvm/arm/vgic/vgic-init.c| 3 +++
>  virt/kvm/arm/vgic/vgic-mmio-v3.c | 3 +++
>  2 files changed, 6 insertions(+)
> 
> diff --git a/virt/kvm/arm/vgic/vgic-init.c b/virt/kvm/arm/vgic/vgic-init.c
> index 6456371..7e040e7 100644
> --- a/virt/kvm/arm/vgic/vgic-init.c
> +++ b/virt/kvm/arm/vgic/vgic-init.c
> @@ -82,6 +82,9 @@ void kvm_vgic_vcpu_early_init(struct kvm_vcpu *vcpu)
>   INIT_LIST_HEAD(&vgic_cpu->ap_list_head);
>   spin_lock_init(&vgic_cpu->ap_list_lock);
>  
> + vgic_cpu->rd_iodev.base_addr = VGIC_ADDR_UNDEF;
> + vgic_cpu->sgi_iodev.base_addr = VGIC_ADDR_UNDEF;
> +
>   /*
>* Enable and configure all SGIs to be edge-triggered and
>* configure all PPIs as level-triggered.
> diff --git a/virt/kvm/arm/vgic/vgic-mmio-v3.c 
> b/virt/kvm/arm/vgic/vgic-mmio-v3.c
> index 5273fb8..df23e66 100644
> --- a/virt/kvm/arm/vgic/vgic-mmio-v3.c
> +++ b/virt/kvm/arm/vgic/vgic-mmio-v3.c
> @@ -592,6 +592,9 @@ int vgic_register_redist_iodev(struct kvm_vcpu *vcpu)
>   gpa_t rd_base, sgi_base;
>   int ret;
>  
> + if (!IS_VGIC_ADDR_UNDEF(vgic_cpu->rd_iodev.base_addr))
> + return 0;
> +
>   /*
>* We may be creating VCPUs before having set the base address for the
>* redistributor region, in which case we will come back to this
> -- 
> 2.5.5
> 



Re: [PATCH v3 04/12] KVM: arm/arm64: Helper to locate free rdist index

2018-04-24 Thread Christoffer Dall
On Fri, Apr 13, 2018 at 10:20:50AM +0200, Eric Auger wrote:
> We introduce vgic_v3_rdist_free_slot to help identifying
> where we can place a new 2x64KB redistributor.
> 
> Signed-off-by: Eric Auger 
> ---
>  virt/kvm/arm/vgic/vgic-mmio-v3.c |  3 +--
>  virt/kvm/arm/vgic/vgic-v3.c  | 17 +
>  virt/kvm/arm/vgic/vgic.h | 11 +++
>  3 files changed, 29 insertions(+), 2 deletions(-)
> 
> diff --git a/virt/kvm/arm/vgic/vgic-mmio-v3.c 
> b/virt/kvm/arm/vgic/vgic-mmio-v3.c
> index d1aab18..49ca176 100644
> --- a/virt/kvm/arm/vgic/vgic-mmio-v3.c
> +++ b/virt/kvm/arm/vgic/vgic-mmio-v3.c
> @@ -593,8 +593,7 @@ int vgic_register_redist_iodev(struct kvm_vcpu *vcpu)
>* function for all VCPUs when the base address is set.  Just return
>* without doing any work for now.
>*/
> - rdreg = list_first_entry(&vgic->rd_regions,
> -  struct vgic_redist_region, list);
> + rdreg = vgic_v3_rdist_free_slot(&vgic->rd_regions);
>   if (!rdreg)
>   return 0;
>  
> diff --git a/virt/kvm/arm/vgic/vgic-v3.c b/virt/kvm/arm/vgic/vgic-v3.c
> index 94de6cd..820012a 100644
> --- a/virt/kvm/arm/vgic/vgic-v3.c
> +++ b/virt/kvm/arm/vgic/vgic-v3.c
> @@ -444,6 +444,23 @@ bool vgic_v3_check_base(struct kvm *kvm)
>   return false;
>  }
>  
> +/**
> + * vgic_v3_rdist_free_slot - Look up registered rdist regions and identify 
> one
> + * which has free space to put a new rdist regions

Can this structure ever be sparse or do we always find the first empty
one, as we fill them consecutively ?

I assume there is some mapping between the regions and the VCPUs'
redistributors, so perhaps the wording in this comment can be more
precise.

> + *
> + * If any, return this redist region handle, otherwise returns NULL.
> + */
> +struct vgic_redist_region *vgic_v3_rdist_free_slot(struct list_head 
> *rd_regions)
> +{
> + struct vgic_redist_region *rdreg;
> +
> + list_for_each_entry(rdreg, rd_regions, list) {
> + if (!vgic_v3_redist_region_full(rdreg))
> + return rdreg;
> + }
> + return NULL;
> +}
> +
>  int vgic_v3_map_resources(struct kvm *kvm)
>  {
>   int ret = 0;
> diff --git a/virt/kvm/arm/vgic/vgic.h b/virt/kvm/arm/vgic/vgic.h
> index 830e815..fea32cb 100644
> --- a/virt/kvm/arm/vgic/vgic.h
> +++ b/virt/kvm/arm/vgic/vgic.h
> @@ -251,6 +251,17 @@ static inline int vgic_v3_max_apr_idx(struct kvm_vcpu 
> *vcpu)
>   }
>  }
>  
> +static inline bool
> +vgic_v3_redist_region_full(struct vgic_redist_region *region)
> +{
> + if (!region->count)
> + return false;
> +
> + return (region->free_index >= region->count);
> +}
> +
> +struct vgic_redist_region *vgic_v3_rdist_free_slot(struct list_head *rdregs);
> +
>  int vgic_its_resolve_lpi(struct kvm *kvm, struct vgic_its *its,
>u32 devid, u32 eventid, struct vgic_irq **irq);
>  struct vgic_its *vgic_msi_to_its(struct kvm *kvm, struct kvm_msi *msi);
> -- 
> 2.5.5
> 

Asides from the above:

Reviewed-by: Christoffer Dall 


Re: [PATCH v3 05/12] KVM: arm/arm64: Revisit Redistributor TYPER last bit computation

2018-04-24 Thread Christoffer Dall
On Fri, Apr 13, 2018 at 10:20:51AM +0200, Eric Auger wrote:
> The TYPER of an redistributor reflects whether the rdist is
> the last one of the redistributor region. Let's compare the TYPER
> GPA against the address of the last occupied slot within the
> redistributor region.
> 
> Signed-off-by: Eric Auger 
> ---
>  virt/kvm/arm/vgic/vgic-mmio-v3.c | 7 ++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/virt/kvm/arm/vgic/vgic-mmio-v3.c 
> b/virt/kvm/arm/vgic/vgic-mmio-v3.c
> index 49ca176..ce5c927 100644
> --- a/virt/kvm/arm/vgic/vgic-mmio-v3.c
> +++ b/virt/kvm/arm/vgic/vgic-mmio-v3.c
> @@ -184,12 +184,17 @@ static unsigned long vgic_mmio_read_v3r_typer(struct 
> kvm_vcpu *vcpu,
> gpa_t addr, unsigned int len)
>  {
>   unsigned long mpidr = kvm_vcpu_get_mpidr_aff(vcpu);
> + struct vgic_cpu *vgic_cpu = &vcpu->arch.vgic_cpu;
> + struct vgic_redist_region *rdreg = vgic_cpu->rdreg;
>   int target_vcpu_id = vcpu->vcpu_id;
> + gpa_t last_rdist_typer = rdreg->base + GICR_TYPER +
> + (rdreg->free_index - 1) * KVM_VGIC_V3_REDIST_SIZE;
>   u64 value;
>  
>   value = (u64)(mpidr & GENMASK(23, 0)) << 32;
>   value |= ((target_vcpu_id & 0x) << 8);
> - if (target_vcpu_id == atomic_read(&vcpu->kvm->online_vcpus) - 1)
> +
> + if (addr == last_rdist_typer)
>   value |= GICR_TYPER_LAST;
>   if (vgic_has_its(vcpu->kvm))
>   value |= GICR_TYPER_PLPIS;
> -- 
> 2.5.5
> 

Reviewed-by: Christoffer Dall 


Re: [PATCH v3 03/12] KVM: arm/arm64: Replace the single rdist region by a list

2018-04-24 Thread Christoffer Dall
On Fri, Apr 13, 2018 at 10:20:49AM +0200, Eric Auger wrote:
> At the moment KVM supports a single rdist region. We want to
> support several separate rdist regions so let's introduce a list
> of them. This patch currently only cares about a single
> entry in this list as the functionality to register several redist
> regions is not yet there. So this only translates the existing code
> into something functionally similar using that new data struct.
> 
> The redistributor region handle is stored in the vgic_cpu structure
> to allow later computation of the TYPER last bit.
> 
> Signed-off-by: Eric Auger 

Reviewed-by: Christoffer Dall 

> ---
>  include/kvm/arm_vgic.h  | 14 +
>  virt/kvm/arm/vgic/vgic-init.c   | 16 --
>  virt/kvm/arm/vgic/vgic-kvm-device.c | 13 ++--
>  virt/kvm/arm/vgic/vgic-mmio-v3.c| 42 
> -
>  virt/kvm/arm/vgic/vgic-v3.c | 20 +++---
>  5 files changed, 79 insertions(+), 26 deletions(-)
> 
> diff --git a/include/kvm/arm_vgic.h b/include/kvm/arm_vgic.h
> index 24f0394..e5c16d1 100644
> --- a/include/kvm/arm_vgic.h
> +++ b/include/kvm/arm_vgic.h
> @@ -200,6 +200,14 @@ struct vgic_its {
>  
>  struct vgic_state_iter;
>  
> +struct vgic_redist_region {
> + uint32_t index;
> + gpa_t base;
> + uint32_t count; /* number of redistributors or 0 if single region */
> + uint32_t free_index; /* index of the next free redistributor */
> + struct list_head list;
> +};
> +
>  struct vgic_dist {
>   boolin_kernel;
>   boolready;
> @@ -219,10 +227,7 @@ struct vgic_dist {
>   /* either a GICv2 CPU interface */
>   gpa_t   vgic_cpu_base;
>   /* or a number of GICv3 redistributor regions */
> - struct {
> - gpa_t   vgic_redist_base;
> - gpa_t   vgic_redist_free_offset;
> - };
> + struct list_head rd_regions;
>   };
>  
>   /* distributor enabled */
> @@ -310,6 +315,7 @@ struct vgic_cpu {
>*/
>   struct vgic_io_device   rd_iodev;
>   struct vgic_io_device   sgi_iodev;
> + struct vgic_redist_region *rdreg;
>  
>   /* Contains the attributes and gpa of the LPI pending tables. */
>   u64 pendbaser;
> diff --git a/virt/kvm/arm/vgic/vgic-init.c b/virt/kvm/arm/vgic/vgic-init.c
> index c52f03d..6456371 100644
> --- a/virt/kvm/arm/vgic/vgic-init.c
> +++ b/virt/kvm/arm/vgic/vgic-init.c
> @@ -167,8 +167,11 @@ int kvm_vgic_create(struct kvm *kvm, u32 type)
>   kvm->arch.vgic.vgic_model = type;
>  
>   kvm->arch.vgic.vgic_dist_base = VGIC_ADDR_UNDEF;
> - kvm->arch.vgic.vgic_cpu_base = VGIC_ADDR_UNDEF;
> - kvm->arch.vgic.vgic_redist_base = VGIC_ADDR_UNDEF;
> +
> + if (type == KVM_DEV_TYPE_ARM_VGIC_V2)
> + kvm->arch.vgic.vgic_cpu_base = VGIC_ADDR_UNDEF;
> + else
> + INIT_LIST_HEAD(&kvm->arch.vgic.rd_regions);
>  
>  out_unlock:
>   for (; vcpu_lock_idx >= 0; vcpu_lock_idx--) {
> @@ -303,6 +306,7 @@ int vgic_init(struct kvm *kvm)
>  static void kvm_vgic_dist_destroy(struct kvm *kvm)
>  {
>   struct vgic_dist *dist = &kvm->arch.vgic;
> + struct vgic_redist_region *rdreg, *next;
>  
>   dist->ready = false;
>   dist->initialized = false;
> @@ -311,6 +315,14 @@ static void kvm_vgic_dist_destroy(struct kvm *kvm)
>   dist->spis = NULL;
>   dist->nr_spis = 0;
>  
> + if (kvm->arch.vgic.vgic_model == KVM_DEV_TYPE_ARM_VGIC_V3) {
> + list_for_each_entry_safe(rdreg, next, &dist->rd_regions, list) {
> + list_del(&rdreg->list);
> + kfree(rdreg);
> + }
> + INIT_LIST_HEAD(&dist->rd_regions);
> + }
> +
>   if (vgic_supports_direct_msis(kvm))
>   vgic_v4_teardown(kvm);
>  }
> diff --git a/virt/kvm/arm/vgic/vgic-kvm-device.c 
> b/virt/kvm/arm/vgic/vgic-kvm-device.c
> index 10ae6f3..e7b5a86 100644
> --- a/virt/kvm/arm/vgic/vgic-kvm-device.c
> +++ b/virt/kvm/arm/vgic/vgic-kvm-device.c
> @@ -66,6 +66,7 @@ int kvm_vgic_addr(struct kvm *kvm, unsigned long type, u64 
> *addr, bool write)
>   int r = 0;
>   struct vgic_dist *vgic = &kvm->arch.vgic;
>   phys_addr_t *addr_ptr, alignment;
> + uint64_t undef_value = VGIC_ADDR_UNDEF;
>  
>   mutex_lock(&kvm->lock);
>   switch (type) {
> @@ -84,7 +85,9 @@ int kvm_vgic_addr(struct kvm *kvm, unsigned long type, u64 
> *addr, bool

Re: [PATCH v3 11/12] KVM: arm/arm64: Implement KVM_VGIC_V3_ADDR_TYPE_REDIST_REGION

2018-04-24 Thread Christoffer Dall
On Fri, Apr 13, 2018 at 10:20:57AM +0200, Eric Auger wrote:
> Now all the internals are ready to handle multiple redistributor
> regions, let's allow the userspace to register them.
> 
> Signed-off-by: Eric Auger 
> 
> ---
> 
> v2 -> v3:
> - early exit if vgic_v3_rdist_region_from_index() fails
> ---
>  virt/kvm/arm/vgic/vgic-kvm-device.c | 42 
> +++--
>  virt/kvm/arm/vgic/vgic-mmio-v3.c|  4 ++--
>  virt/kvm/arm/vgic/vgic.h|  9 +++-
>  3 files changed, 50 insertions(+), 5 deletions(-)
> 
> diff --git a/virt/kvm/arm/vgic/vgic-kvm-device.c 
> b/virt/kvm/arm/vgic/vgic-kvm-device.c
> index e7b5a86..00e03d3 100644
> --- a/virt/kvm/arm/vgic/vgic-kvm-device.c
> +++ b/virt/kvm/arm/vgic/vgic-kvm-device.c
> @@ -65,7 +65,8 @@ int kvm_vgic_addr(struct kvm *kvm, unsigned long type, u64 
> *addr, bool write)
>  {
>   int r = 0;
>   struct vgic_dist *vgic = &kvm->arch.vgic;
> - phys_addr_t *addr_ptr, alignment;
> + phys_addr_t *addr_ptr = NULL;
> + phys_addr_t alignment;
>   uint64_t undef_value = VGIC_ADDR_UNDEF;

nit: mussed this one before, type should be u64

>  
>   mutex_lock(&kvm->lock);
> @@ -92,7 +93,7 @@ int kvm_vgic_addr(struct kvm *kvm, unsigned long type, u64 
> *addr, bool write)
>   if (r)
>   break;
>   if (write) {
> - r = vgic_v3_set_redist_base(kvm, *addr);
> + r = vgic_v3_set_redist_base(kvm, 0, *addr, 0);
>   goto out;
>   }
>   rdreg = list_first_entry(&vgic->rd_regions,
> @@ -103,6 +104,42 @@ int kvm_vgic_addr(struct kvm *kvm, unsigned long type, 
> u64 *addr, bool write)
>   addr_ptr = &rdreg->base;
>   break;
>   }
> + case KVM_VGIC_V3_ADDR_TYPE_REDIST_REGION:
> + {
> + struct vgic_redist_region *rdreg;
> + uint8_t index;
> +

we tend to use u8, u32, etc. in the kernel.

> + r = vgic_check_type(kvm, KVM_DEV_TYPE_ARM_VGIC_V3);
> + if (r)
> + break;
> +
> + index = *addr & KVM_VGIC_V3_RDIST_INDEX_MASK;
> +
> + if (write) {
> + gpa_t base = *addr & KVM_VGIC_V3_RDIST_BASE_MASK;
> + uint32_t count = (*addr & KVM_VGIC_V3_RDIST_COUNT_MASK)
> + >> KVM_VGIC_V3_RDIST_COUNT_SHIFT;
> + uint8_t flags = (*addr & KVM_VGIC_V3_RDIST_FLAGS_MASK)
> + >> KVM_VGIC_V3_RDIST_FLAGS_SHIFT;
> +
> + if (!count || flags)
> + r = -EINVAL;
> + else
> + r = vgic_v3_set_redist_base(kvm, index,
> + base, count);
> + goto out;
> + }
> +
> + rdreg = vgic_v3_rdist_region_from_index(kvm, index);
> + if (!rdreg) {
> + r = -ENODEV;
> + goto out;
> + }
> +
> + *addr_ptr = rdreg->base & index &
> + (uint64_t)rdreg->count << KVM_VGIC_V3_RDIST_COUNT_SHIFT;

This looks fairly broken, isn't this a clear null pointer dereference?

(If we're making this ioctl read-only using the parameter as both in/out
for set/get, that should also be documented in the API text, then you
should consider writing a small test along with your userspace
implementation to actually test that functionality - otherwise we should
just make this write-only and omit the index part.  It could be said
that retrieving what the kernel actually has is a reasonable debug
feature.)

I think you want (notice the | instead of & as well):

*addr = index;
*addr |= rdreg->base;
*addr |= (u64)rdreg->count << KVM_VGIC_V3_RDIST_COUNT_SHIFT;
goto out;

It is then debatable if the addr_ptr construct gets too convoluted when
not used in every case, and if the logic should be embedded into each
case, and the addr_ptr variable dropped.  Meh, I don't mind leaving it
for now.


> + break;
> + }
>   default:
>   r = -ENODEV;
>   }
> @@ -674,6 +711,7 @@ static int vgic_v3_has_attr(struct kvm_device *dev,
>   switch (attr->attr) {
>   case KVM_VGIC_V3_ADDR_TYPE_DIST:
>   case KVM_VGIC_V3_ADDR_TYPE_REDIST:
> + case KVM_VGIC_V3_ADDR_TYPE_REDIST_REGION:
>   return 0;
>   }
>   break;
> diff --git a/virt/kvm/arm/vgic/vgic-mmio-v3.c 
> b/virt/kvm/arm/vgic/vgic-mmio-v3.c
> index df23e66..f603fdf 100644
> --- a/virt/kvm/arm/vgic/vgic-mmio-v3.c
> +++ b/virt/kvm/arm/vgic/vgic-mmio-v3.c
> @@ -770,11 +770,11 @@ static int vgic_v3_insert_redist_region(struct kvm 
> *kvm, uint32_t index,
>   return ret;
>  }
>  
> -int vgic_v3_set_redist_base(struct kvm *kvm, u

Re: [PATCH v3 01/12] KVM: arm/arm64: Set dist->spis to NULL after kfree

2018-04-24 Thread Christoffer Dall
On Fri, Apr 13, 2018 at 10:20:47AM +0200, Eric Auger wrote:
> in case kvm_vgic_map_resources() fails, typically if the vgic
> distributor is not defined, __kvm_vgic_destroy will be called
> several times. Indeed kvm_vgic_map_resources() is called on
> first vcpu run. As a result dist->spis is freeed more than once
> and on the second time it causes a "kernel BUG at mm/slub.c:3912!"
> 
> Set dist->spis to NULL to avoid the crash.
> 
> Fixes: ad275b8bb1e6 ("KVM: arm/arm64: vgic-new: vgic_init: implement
> vgic_init")
> 
> Signed-off-by: Eric Auger 
> Reviewed-by: Marc Zyngier 

Reviewed-by: Christoffer Dall 

> 
> ---
> 
> v2 -> v3:
> - added Marc's R-b and Fixed commit
> ---
>  virt/kvm/arm/vgic/vgic-init.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/virt/kvm/arm/vgic/vgic-init.c b/virt/kvm/arm/vgic/vgic-init.c
> index 68378fe..c52f03d 100644
> --- a/virt/kvm/arm/vgic/vgic-init.c
> +++ b/virt/kvm/arm/vgic/vgic-init.c
> @@ -308,6 +308,7 @@ static void kvm_vgic_dist_destroy(struct kvm *kvm)
>   dist->initialized = false;
>  
>   kfree(dist->spis);
> + dist->spis = NULL;
>   dist->nr_spis = 0;
>  
>   if (vgic_supports_direct_msis(kvm))
> -- 
> 2.5.5
> 


Re: [PATCH v3 06/12] KVM: arm/arm64: Helper to register a new redistributor region

2018-04-24 Thread Christoffer Dall
On Fri, Apr 13, 2018 at 10:20:52AM +0200, Eric Auger wrote:
> We introduce a new helper that creates and inserts a new redistributor
> region into the rdist region list. This helper both handles the case
> where the redistributor region size is known at registration time
> and the legacy case where it is not (eventually depending on the number
> of online vcpus). Depending on pfns, we perform all the possible checks
> that we can do:
> 
> - end of memory crossing
> - incorrect alignment of the base address
> - collision with distributor region if already defined
> - collision with already registered rdist regions
> - check of the new index
> 
> Rdist regions must be inserted by increasing order of indices. Indices
> must be contiguous.
> 
> We also introduce vgic_v3_rdist_region_from_index() which will be used
> from the vgic kvm-device, later on.
> 
> Signed-off-by: Eric Auger 
> ---
>  virt/kvm/arm/vgic/vgic-mmio-v3.c | 95 
> +---
>  virt/kvm/arm/vgic/vgic-v3.c  | 29 
>  virt/kvm/arm/vgic/vgic.h | 14 ++
>  3 files changed, 122 insertions(+), 16 deletions(-)
> 
> diff --git a/virt/kvm/arm/vgic/vgic-mmio-v3.c 
> b/virt/kvm/arm/vgic/vgic-mmio-v3.c
> index ce5c927..5273fb8 100644
> --- a/virt/kvm/arm/vgic/vgic-mmio-v3.c
> +++ b/virt/kvm/arm/vgic/vgic-mmio-v3.c
> @@ -680,14 +680,66 @@ static int vgic_register_all_redist_iodevs(struct kvm 
> *kvm)
>   return ret;
>  }
>  
> -int vgic_v3_set_redist_base(struct kvm *kvm, u64 addr)
> +/**
> + * vgic_v3_insert_redist_region - Insert a new redistributor region
> + *
> + * Performs various checks before inserting the rdist region in the list.
> + * Those tests depend on whether the size of the rdist region is known
> + * (ie. count != 0). The list is sorted by rdist region index.
> + *
> + * @kvm: kvm handle
> + * @index: redist region index
> + * @base: base of the new rdist region
> + * @count: number of redistributors the region is made of (of 0 in the old 
> style
> + * single region, whose size is induced from the number of vcpus)
> + *
> + * Return 0 on success, < 0 otherwise
> + */
> +static int vgic_v3_insert_redist_region(struct kvm *kvm, uint32_t index,
> + gpa_t base, uint32_t count)
>  {
> - struct vgic_dist *vgic = &kvm->arch.vgic;
> + struct vgic_dist *d = &kvm->arch.vgic;
>   struct vgic_redist_region *rdreg;
> + struct list_head *rd_regions = &d->rd_regions;
> + struct list_head *last = rd_regions->prev;
> +

nit: extra blank line?

> + gpa_t new_start, new_end;
> + size_t size = count * KVM_VGIC_V3_REDIST_SIZE;
>   int ret;
>  
> - /* vgic_check_ioaddr makes sure we don't do this twice */
> - if (!list_empty(&vgic->rd_regions))
> + /* single rdist region already set ?*/
> + if (!count && !list_empty(rd_regions))
> + return -EINVAL;
> +
> + /* cross the end of memory ? */
> + if (base + size < base)
> + return -EINVAL;

what is the size of memory?  This seems to check for a gpa_t overflow,
but not againt the IPA space of the VM...

> +
> + if (list_empty(rd_regions)) {
> + if (index != 0)
> + return -EINVAL;

note, I think this can be simplified if we can rid of the index.

> + } else {
> + rdreg = list_entry(last, struct vgic_redist_region, list);

you can use list_last_entry here and get rid of the 'last' temporary
variable above.

> + if (index != rdreg->index + 1)
> + return -EINVAL;
> +
> + /* Cannot add an explicitly sized regions after legacy region */
> + if (!rdreg->count)
> + return -EINVAL;
> + }
> +
> + /*
> +  * collision with already set dist region ?
> +  * this assumes we know the size of the new rdist region (pfns != 0)
> +  * otherwise we can only test this when all vcpus are registered
> +  */

I don't really understand this commentary... :(

> + if (!count && !IS_VGIC_ADDR_UNDEF(d->vgic_dist_base) &&
> + (!(d->vgic_dist_base + KVM_VGIC_V3_DIST_SIZE <= base)) &&
> + (!(base + size <= d->vgic_dist_base)))
> + return -EINVAL;

Can't you call vgic_v3_check_base() here instead?

> +
> + /* collision with any other rdist region? */
> + if (vgic_v3_rdist_overlap(kvm, base, size))
>   return -EINVAL;
>  
>   rdreg = kzalloc(sizeof(*rdreg), GFP_KERNEL);
> @@ -696,17 +748,32 @@ int vgic_v3_set_redist_base(struct kvm *kvm, u64 addr)
>  
>   rdreg->base = VGIC_ADDR_UNDEF;
>  
> - ret = vgic_check_ioaddr(kvm, &rdreg->base, addr, SZ_64K);
> + ret = vgic_check_ioaddr(kvm, &rdreg->base, base, SZ_64K);
>   if (ret)
> - goto out;
> + goto free;
>  
> - rdreg->base = addr;
> - if (!vgic_v3_check_base(kvm)) {
> - ret = -EINVAL;
> - goto out;
> - }
> + rdreg->base = base;
> + 

Re: [PATCH v3 02/12] KVM: arm/arm64: Document KVM_VGIC_V3_ADDR_TYPE_REDIST_REGION

2018-04-24 Thread Christoffer Dall
On Fri, Apr 13, 2018 at 10:20:48AM +0200, Eric Auger wrote:
> We introduce a new KVM_VGIC_V3_ADDR_TYPE_REDIST_REGION attribute in
> KVM_DEV_ARM_VGIC_GRP_ADDR group. It allows userspace to provide the
> base address and size of a redistributor region
> 
> Compared to KVM_VGIC_V3_ADDR_TYPE_REDIST, this new attribute allows
> to declare several separate redistributor regions.
> 
> So the whole redist space does not need to be contiguous anymore.
> 
> Signed-off-by: Eric Auger 
> ---
>  Documentation/virtual/kvm/devices/arm-vgic-v3.txt | 25 
> ++-
>  1 file changed, 24 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/virtual/kvm/devices/arm-vgic-v3.txt 
> b/Documentation/virtual/kvm/devices/arm-vgic-v3.txt
> index 9293b45..cbc4328 100644
> --- a/Documentation/virtual/kvm/devices/arm-vgic-v3.txt
> +++ b/Documentation/virtual/kvm/devices/arm-vgic-v3.txt
> @@ -27,9 +27,32 @@ Groups:
>VCPU and all of the redistributor pages are contiguous.
>Only valid for KVM_DEV_TYPE_ARM_VGIC_V3.
>This address needs to be 64K aligned.
> +
> +KVM_VGIC_V3_ADDR_TYPE_REDIST_REGION (rw, 64-bit)
> +  The attr field of kvm_device_attr encodes 3 values:
> +  bits: | 63     52  |  51      16 | 15 - 12  |11 - 0
> +  values:   | count  |   base  |  flags   | index
> +  - index encodes the unique redistributor region index

I'm not entirely sure I understand the purpose of the index field.
Isn't a redistributor region identified uniquely by its base address?

Otherwise this looks good.

Thanks,
-Christoffer


> +  - flags: reserved for future use, currently 0
> +  - base field encodes bits [51:16] of the guest physical base address
> +of the first redistributor in the region.
> +  - count encodes the number of redistributors in the region. Must be
> +greater than 0.
> +  There are two 64K pages for each redistributor in the region and
> +  redistributors are laid out contiguously within the region. Regions
> +  are filled with redistributors in the index order. The sum of all
> +  region count fields must be greater than or equal to the number of
> +  VCPUs. Redistributor regions must be registered in the incremental
> +  index order, starting from index 0.
> +  Only valid for KVM_DEV_TYPE_ARM_VGIC_V3.
> +
> +  It is invalid to mix calls with KVM_VGIC_V3_ADDR_TYPE_REDIST and
> +  KVM_VGIC_V3_ADDR_TYPE_REDIST_REGION attributes.
> +
>Errors:
>  -E2BIG:  Address outside of addressable IPA range
> --EINVAL: Incorrectly aligned address
> +-EINVAL: Incorrectly aligned address, bad redistributor region
> + count/index, mixed redistributor region attribute usage
>  -EEXIST: Address already configured
>  -ENXIO:  The group or attribute is unknown/unsupported for this device
>   or hardware support is missing.
> -- 
> 2.5.5
> 


[PATCH] MAINTAINERS: Update e-mail address for Christoffer Dall

2018-04-16 Thread Christoffer Dall
Update my e-mail address to a working address.

Signed-off-by: Christoffer Dall 
---
 MAINTAINERS | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 0a1410d5a621..3e9c99d2620b 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -7738,7 +7738,7 @@ F:arch/x86/include/asm/svm.h
 F: arch/x86/kvm/svm.c
 
 KERNEL VIRTUAL MACHINE FOR ARM (KVM/arm)
-M: Christoffer Dall 
+M: Christoffer Dall 
 M: Marc Zyngier 
 L: linux-arm-ker...@lists.infradead.org (moderated for non-subscribers)
 L: kvm...@lists.cs.columbia.edu
@@ -7752,7 +7752,7 @@ F:virt/kvm/arm/
 F: include/kvm/arm_*
 
 KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64)
-M: Christoffer Dall 
+M: Christoffer Dall 
 M: Marc Zyngier 
 L: linux-arm-ker...@lists.infradead.org (moderated for non-subscribers)
 L: kvm...@lists.cs.columbia.edu
-- 
2.14.2



Re: [RFC v2 02/12] KVM: arm/arm64: Document KVM_VGIC_V3_ADDR_TYPE_REDIST_REGION

2018-04-09 Thread Christoffer Dall
Hi Eric,

On Tue, Mar 27, 2018 at 04:04:06PM +0200, Eric Auger wrote:
> We introduce a new KVM_VGIC_V3_ADDR_TYPE_REDIST_REGION attribute in
> KVM_DEV_ARM_VGIC_GRP_ADDR group. It allows userspace to provide the
> base address and size of a redistributor region
> 
> Compared to KVM_VGIC_V3_ADDR_TYPE_REDIST, this new attribute allows
> to declare several separate redistributor regions.
> 
> So the whole redist space does not need to be contiguous anymore.
> 
> Signed-off-by: Eric Auger 
> ---
>  Documentation/virtual/kvm/devices/arm-vgic-v3.txt | 18 ++
>  1 file changed, 18 insertions(+)
> 
> diff --git a/Documentation/virtual/kvm/devices/arm-vgic-v3.txt 
> b/Documentation/virtual/kvm/devices/arm-vgic-v3.txt
> index 9293b45..0ded904 100644
> --- a/Documentation/virtual/kvm/devices/arm-vgic-v3.txt
> +++ b/Documentation/virtual/kvm/devices/arm-vgic-v3.txt
> @@ -27,6 +27,24 @@ Groups:
>VCPU and all of the redistributor pages are contiguous.
>Only valid for KVM_DEV_TYPE_ARM_VGIC_V3.
>This address needs to be 64K aligned.
> +
> +KVM_VGIC_V3_ADDR_TYPE_REDIST_REGION (rw, 64-bit)
> +  The attr field of kvm_device_attr encodes 3 values:
> +  bits: | 63     52  |  51      16 | 15 - 12  |11 - 0
> +  values:   | count  |   base  |  flags   | index
> +  - index encodes the unique redistributor region index
> +  - flags: reserved for future use, currently 0
> +  - base field encodes bits [51:16] of the guest physical base address
> +of the first redistributor in the region. There are two 64K pages
> +for each VCPU and all of the redistributor pages are contiguous

should this be two 64K pages for the number of redistributors in this
region as specified by count ?

> +within the redistributor region.
> +  - count encodes the number of redistributors in the region.

I assume it's implied that the user must register a total number of
redistributors across all the regions that matches the number of vcpus,
and that otherwise something bad happens?

> +  Only valid for KVM_DEV_TYPE_ARM_VGIC_V3.
> +
> +  It is invalid to mix calls with KVM_VGIC_V3_ADDR_TYPE_REDIST and
> +  KVM_VGIC_V3_ADDR_TYPE_REDIST_REGION attributes. When attempted an
> +  -EINVAL error is returned.
> +
>Errors:
>  -E2BIG:  Address outside of addressable IPA range
>  -EINVAL: Incorrectly aligned address
> -- 
> 2.5.5
> 

Thanks,
-Christoffer


Re: [PATCHv2 09/12] arm64/kvm: preserve host HCR_EL2 value

2018-04-09 Thread Christoffer Dall
On Mon, Apr 09, 2018 at 03:57:09PM +0100, Mark Rutland wrote:
> On Tue, Feb 06, 2018 at 01:39:15PM +0100, Christoffer Dall wrote:
> > On Mon, Nov 27, 2017 at 04:38:03PM +, Mark Rutland wrote:
> > > diff --git a/arch/arm64/kvm/hyp/switch.c b/arch/arm64/kvm/hyp/switch.c
> > > index 525c01f48867..2205f0be3ced 100644
> > > --- a/arch/arm64/kvm/hyp/switch.c
> > > +++ b/arch/arm64/kvm/hyp/switch.c
> > > @@ -71,6 +71,8 @@ static void __hyp_text __activate_traps(struct kvm_vcpu 
> > > *vcpu)
> > >  {
> > >   u64 val;
> > >  
> > > + vcpu->arch.host_hcr_el2 = read_sysreg(hcr_el2);
> > > +
> > 
> > Looking back at this, it seems excessive to switch this at every
> > round-trip.  I think it should be possible to have this as a single
> > global (or per-CPU) variable that gets restored directly when returning
> > from the VM.
> 
> I suspect this needs to be per-cpu, to account for heterogeneous
> systems.
> 
> I guess if we move hcr_el2 into kvm_cpu_context, that gives us a
> per-vcpu copy for guests, and a per-cpu copy for the host (in the global
> kvm_host_cpu_state).
> 
> I'll have a look at how gnarly that turns out. I'm not sure how we can
> initialise that sanely for the !VHE case to match whatever el2_setup
> did.

There's no harm in jumping down to EL2 to read a register during the
initialization phase.  All it requires is an annotation of the callee
function, and a kvm_call_hyp(), and it's actually quite fast unless you
start saving/restoring a bunch of additional system registers.  See how
we call __kvm_set_tpidr_el2() for example.

Thanks,
-Christoffer


Re: [REPOST PATCH] arm/arm64: KVM: Add PSCI version selection API

2018-04-09 Thread Christoffer Dall
On Mon, Apr 09, 2018 at 01:47:50PM +0100, Marc Zyngier wrote:
> +Drew, who's look at the whole save/restore thing extensively
> 
> On 09/04/18 13:30, Christoffer Dall wrote:
> > On Thu, Mar 15, 2018 at 07:26:48PM +, Marc Zyngier wrote:
> >> On 15/03/18 19:13, Peter Maydell wrote:
> >>> On 15 March 2018 at 19:00, Marc Zyngier  wrote:
> >>>> On 06/03/18 09:21, Andrew Jones wrote:
> >>>>> On Mon, Mar 05, 2018 at 04:47:55PM +, Peter Maydell wrote:
> >>>>>> On 2 March 2018 at 11:11, Marc Zyngier  wrote:
> >>>>>>> On Fri, 02 Mar 2018 10:44:48 +,
> >>>>>>> Auger Eric wrote:
> >>>>>>>> I understand the get/set is called as part of the migration process.
> >>>>>>>> So my understanding is the benefit of this series is migration fails 
> >>>>>>>> in
> >>>>>>>> those cases:
> >>>>>>>>
> >>>>>>>>> =0.2 source -> 0.1 destination
> >>>>>>>> 0.1 source -> >=0.2 destination
> >>>>>>>
> >>>>>>> It also fails in the case where you migrate a 1.0 guest to something
> >>>>>>> that cannot support it.
> >>>>>>
> >>>>>> I think it would be useful if we could write out the various
> >>>>>> combinations of source, destination and what we expect/want to
> >>>>>> have happen. My gut feeling here is that we're sacrificing
> >>>>>> exact migration compatibility in favour of having the guest
> >>>>>> automatically get the variant-2 mitigations, but it's not clear
> >>>>>> to me exactly which migration combinations that's intended to
> >>>>>> happen for. Marc?
> >>>>>>
> >>>>>> If this wasn't a mitigation issue the desired behaviour would be
> >>>>>> straightforward:
> >>>>>>  * kernel should default to 0.2 on the basis that
> >>>>>>that's what it did before
> >>>>>>  * new QEMU version should enable 1.0 by default for virt-2.12
> >>>>>>and 0.2 for virt-2.11 and earlier
> >>>>>>  * PSCI version info shouldn't appear in migration stream unless
> >>>>>>it's something other than 0.2
> >>>>>> But that would leave some setups (which?) unnecessarily without the
> >>>>>> mitigation, so we're not doing that. The question is, exactly
> >>>>>> what *are* we aiming for?
> >>>>>
> >>>>> The reason Marc dropped this patch from the series it was first 
> >>>>> introduced
> >>>>> in was because we didn't have the aim 100% understood. We want the
> >>>>> mitigation by default, but also to have the least chance of migration
> >>>>> failure, and when we must fail (because we're not doing the
> >>>>> straightforward approach listed above, which would prevent failures), 
> >>>>> then
> >>>>> we want to fail with the least amount of damage to the user.
> >>>>>
> >>>>> I experimented with a couple different approaches and provided tables[1]
> >>>>> with my results. I even recommended an approach, but I may have changed
> >>>>> my mind after reading Marc's follow-up[2]. The thread continues from
> >>>>> there as well with follow-ups from Christoffer, Marc, and myself. 
> >>>>> Anyway,
> >>>>> Marc did this repost for us to debate it and work out the best approach
> >>>>> here.
> >>>> It doesn't look like we've made much progress on this, which makes me
> >>>> think that we probably don't need anything of the like.
> >>>
> >>> I was waiting for a better explanation from you of what we're trying to
> >>> achieve. If you want to take the "do nothing" approach then a list
> >>> also of what migrations succeed/fail/break in that case would also
> >>> be useful.
> >>>
> >>> (I am somewhat lazily trying to avoid having to spend time reverse
> >>> engineering the "what are we trying to do and what effects are
> >>> we accepting" parts from the patch and the code that's already gone
>

Re: [PATCHv2 10/12] arm64/kvm: context-switch ptrauth registers

2018-04-09 Thread Christoffer Dall
Hi Mark,

[Sorry for late reply]

On Fri, Mar 09, 2018 at 02:28:38PM +, Mark Rutland wrote:
> On Tue, Feb 06, 2018 at 01:38:47PM +0100, Christoffer Dall wrote:
> > On Mon, Nov 27, 2017 at 04:38:04PM +, Mark Rutland wrote:
> > > When pointer authentication is supported, a guest may wish to use it.
> > > This patch adds the necessary KVM infrastructure for this to work, with
> > > a semi-lazy context switch of the pointer auth state.
> > > 
> > > When we schedule a vcpu, 
> > 
> > That's not quite what the code does, the code only does this when we
> > schedule back a preempted or blocked vcpu thread.
> 
> Does that only leave the case of the vCPU being scheduled for the first
> time? Or am I missing something else?
> 
> [...]

In the current patch, you're only calling kvm_arm_vcpu_ptrauth_disable()
from kvm_arch_sched_in() which is only called on the preempt notifier
patch, which leaves out every time we enter the guest from userspace and
therefore also the initial run of the vCPU (assuming there's no
preemption in the kernel prior to running the first time).

vcpu_load() takes care of all the cases.

> 

[...]

> > 
> > I still find this decision to begin trapping again quite arbitrary, and
> > would at least prefer this to be in vcpu_load (which would make the
> > behavior match the commit text as well).
> 
> Sure, done.
> 
> > My expectation would be that if a guest is running software with pointer
> > authentication enabled, then it's likely to either keep using the
> > feature, or not use it at all, so I would make this a one-time flag.
> 
> I think it's likely that some applications will use ptrauth while others
> do not. Even if the gust OS supports ptrauth, KVM may repeatedly preempt
> an application that doesn't use it, and we'd win in that case.
> 
> There are also some rarer cases, like kexec in a guest from a
> ptrauth-aware kernel to a ptrauth-oblivious one.
> 
> I don't have strong feelings either way, and I have no data.
> 

I think your intuition sounds sane, and let's reset the flag on every
vcpu_load, and we can always revisit when we have hardware and data if
someone reports a performance issue.

Thanks,
-Christoffer


Re: [REPOST PATCH] arm/arm64: KVM: Add PSCI version selection API

2018-04-09 Thread Christoffer Dall
ion. That's quite a big deal.
> 
> (3, not related to migration) A guest having a hardcoded knowledge of
> PSCI 0.2 will se that we've changed something, and may decide to catch
> fire. Oh well.
> 
> If we take this patch:
> 
> (1) still exists

No problem, IMHO.

> 
> (2) will now fail to migrate. I see this as a feature.

Yes, I agree.  This is actually the most important reason for doing
anything beyond what's already merged.

> 
> (3) can be worked around by setting the "PSCI version pseudo register"
> to 0.2.

Nice to have, but we're probably not expecting this to be of major
concern.  I initially thought it was a nice debugging feature as well,
but that may be a ridiculous point.

> 
> These are the main things I can think of at the moment.

So I think we we should merge this patch.

If userspace then wants to support "migrate from explicitly set v0.2 new
kernel to old kernel", then it must add specific support to filter out
the register from the register list; not that I think anyone will need
that or bother to implement it.

In other words, I think you should merge this:

Reviewed-by: Christoffer Dall 


Re: [RFC PATCH] KVM: arm/arm64: vgic: change condition for level interrupt resampling

2018-03-10 Thread Christoffer Dall
On Sat, Mar 10, 2018 at 12:20 PM, Marc Zyngier  wrote:
> On Fri, 09 Mar 2018 21:36:12 +,
> Christoffer Dall wrote:
>>
>> On Thu, Mar 08, 2018 at 05:28:44PM +, Marc Zyngier wrote:
>> > I'd be more confident if we did forbid P+A for such interrupts
>> > altogether, as they really feel like another kind of HW interrupt.
>>
>> How about a slightly bigger hammer:  Can we avoid doing P+A for level
>> interrupts completely?  I don't think that really makes much sense, and
>> I think we simply everything if we just come back out and resample the
>> line.  For an edge, something like a network card, there's a potential
>> performance win to appending a new pending state, but I doubt that this
>> is the case for level interrupts.
>
> I started implementing the same thing yesterday. Somehow, it feels
> slightly better to have the same flow for all level interrupts,
> including the timer, and we only use the MI on EOI as a way to trigger
> the next state of injection. Still testing, but looking good so far.
>
> I'm still puzzled that we have this level-but-not-quite behaviour for
> VFIO interrupts. At some point, it is going to bite us badly.
>

Where is the departure from level-triggered behavior with VFIO?  As
far as I can tell, the GIC flow of the interrupts will be just a level
interrupt, but we just need to make sure the resamplefd mechanism is
supported for both types of interrupts.  Whether or not that's a
decent mechanism seems orthogonal to me, but that's a discussion for
another day I think.

Thanks,
-Christoffer


Re: [RFC PATCH] KVM: arm/arm64: vgic: change condition for level interrupt resampling

2018-03-09 Thread Christoffer Dall
On Thu, Mar 08, 2018 at 05:28:44PM +, Marc Zyngier wrote:
> On Thu, 08 Mar 2018 16:19:00 +,
> Christoffer Dall wrote:
> > 
> > On Thu, Mar 08, 2018 at 11:54:27AM +, Marc Zyngier wrote:
> > > On 08/03/18 09:49, Marc Zyngier wrote:

[...]

> > > The state is now pending, we've really EOI'd the interrupt, and
> > > yet lr_signals_eoi_mi() returns false, since the state is not 0.
> > > The result is that we won't signal anything on the corresponding
> > > irqfd, which people complain about. Meh.
> > 
> > So the core of the problem is that when we've entered the guest with
> > PENDING+ACTIVE and when we exit (for some reason) we don't signal the
> > resamplefd, right?  The solution seems to me that we don't ever do
> > PENDING+ACTIVE if you need to resample after each deactivate.  What
> > would be the point of appending a pending state that you only know to be
> > valid after a resample anyway?
> 
> The question is then to identify that a given source needs to be
> signalled back to VFIO. Calling into the eventfd code on the hot path
> is pretty horrid (I'm not sure if we can really call into this with
> interrupts disabled, for example).
> 

This feels like a bad layering violation to me as well.

> > 
> > > 
> > > Example 2:
> > > P+A -> guest EOI -> P -> delayed MI -> guest IAR -> A -> MI fires
> > 
> > We could be more clever and do the following calculation on every exit:
> > 
> > If you enter with P, and exit with either A or 0, then signal.
> > 
> > If you enter with P+A, and you exit with either P, A, or 0, then signal.
> > 
> > Wouldn't that also solve it?  (Although I have a feeling you'd miss some
> > exits in this case).
> 
> I'd be more confident if we did forbid P+A for such interrupts
> altogether, as they really feel like another kind of HW interrupt.

How about a slightly bigger hammer:  Can we avoid doing P+A for level
interrupts completely?  I don't think that really makes much sense, and
I think we simply everything if we just come back out and resample the
line.  For an edge, something like a network card, there's a potential
performance win to appending a new pending state, but I doubt that this
is the case for level interrupts.

The timer would be unaffected, because it's a HW interrupt.

Thanks,
-Christoffer


Re: [RFC PATCH] KVM: arm/arm64: vgic: change condition for level interrupt resampling

2018-03-08 Thread Christoffer Dall
On Thu, Mar 08, 2018 at 11:54:27AM +, Marc Zyngier wrote:
> On 08/03/18 09:49, Marc Zyngier wrote:
> > [updated Christoffer's email address]
> > 
> > Hi Shunyong,
> > 
> > On 08/03/18 07:01, Shunyong Yang wrote:
> >> When resampling irqfds is enabled, level interrupt should be
> >> de-asserted when resampling happens. On page 4-47 of GIC v3
> >> specification IHI0069D, it said,
> >> "When the PE acknowledges an SGI, a PPI, or an SPI at the CPU
> >> interface, the IRI changes the status of the interrupt to active
> >> and pending if:
> >> • It is an edge-triggered interrupt, and another edge has been
> >> detected since the interrupt was acknowledged.
> >> • It is a level-sensitive interrupt, and the level has not been
> >> deasserted since the interrupt was acknowledged."
> >>
> >> GIC v2 specification IHI0048B.b has similar description on page
> >> 3-42 for state machine transition.
> >>
> >> When some VFIO device, like mtty(8250 VFIO mdev emulation driver
> >> in samples/vfio-mdev) triggers a level interrupt, the status
> >> transition in LR is pending-->active-->active and pending.
> >> Then it will wait resampling to de-assert the interrupt.
> >>
> >> Current design of lr_signals_eoi_mi() will return false if state
> >> in LR is not invalid(Inactive). It causes resampling will not happen
> >> in mtty case.
> > 
> > Let me rephrase this, and tell me if I understood it correctly:
> > 
> > - A level interrupt is injected, activated by the guest (LR state=active)
> > - guest exits, re-enters, (LR state=pending+active)
> > - guest EOIs the interrupt (LR state=pending)
> > - maintenance interrupt
> > - we don't signal the resampling because we're not in an invalid state
> > 
> > Is that correct?
> > 
> > That's an interesting case, because it seems to invalidate some of the 
> > optimization that went in over a year ago.
> > 
> > 096f31c4360f KVM: arm/arm64: vgic: Get rid of MISR and EISR fields
> > b6095b084d87 KVM: arm/arm64: vgic: Get rid of unnecessary 
> > save_maint_int_state
> > af0614991ab6 KVM: arm/arm64: vgic: Get rid of unnecessary 
> > process_maintenance operation
> > 
> > We could compare the value of the LR before the guest entry with
> > the value at exit time, but we still could miss it if we have a
> > transition such as P+A -> P -> A and assume a long enough propagation
> > delay for the maintenance interrupt (which is very likely).
> > 
> > In essence, we have lost the benefit of EISR, which was to give us a
> > way to deal with asynchronous signalling.
> > 
> >>
> >> This will cause interrupt fired continuously to guest even 8250 IIR
> >> has no interrupt. When 8250's interrupt is configured in shared mode,
> >> it will pass interrupt to other drivers to handle. However, there
> >> is no other driver involved. Then, a "nobody cared" kernel complaint
> >> occurs.
> >>
> >> / # cat /dev/ttyS0
> >> [4.826836] random: crng init done
> >> [6.373620] irq 41: nobody cared (try booting with the "irqpoll"
> >> option)
> >> [6.376414] CPU: 0 PID: 1307 Comm: cat Not tainted 4.16.0-rc4 #4
> >> [6.378927] Hardware name: linux,dummy-virt (DT)
> >> [6.380876] Call trace:
> >> [6.381937]  dump_backtrace+0x0/0x180
> >> [6.383495]  show_stack+0x14/0x1c
> >> [6.384902]  dump_stack+0x90/0xb4
> >> [6.386312]  __report_bad_irq+0x38/0xe0
> >> [6.387944]  note_interrupt+0x1f4/0x2b8
> >> [6.389568]  handle_irq_event_percpu+0x54/0x7c
> >> [6.391433]  handle_irq_event+0x44/0x74
> >> [6.393056]  handle_fasteoi_irq+0x9c/0x154
> >> [6.394784]  generic_handle_irq+0x24/0x38
> >> [6.396483]  __handle_domain_irq+0x60/0xb4
> >> [6.398207]  gic_handle_irq+0x98/0x1b0
> >> [6.399796]  el1_irq+0xb0/0x128
> >> [6.401138]  _raw_spin_unlock_irqrestore+0x18/0x40
> >> [6.403149]  __setup_irq+0x41c/0x678
> >> [6.404669]  request_threaded_irq+0xe0/0x190
> >> [6.406474]  univ8250_setup_irq+0x208/0x234
> >> [6.408250]  serial8250_do_startup+0x1b4/0x754
> >> [6.410123]  serial8250_startup+0x20/0x28
> >> [6.411826]  uart_startup.part.21+0x78/0x144
> >> [6.413633]  uart_port_activate+0x50/0x68
> >> [6.415328]  tty_port_open+0x84/0xd4
> >> [6.416851]  uart_open+0x34/0x44
> >> [6.418229]  tty_open+0xec/0x3c8
> >> [6.419610]  chrdev_open+0xb0/0x198
> >> [6.421093]  do_dentry_open+0x200/0x310
> >> [6.422714]  vfs_open+0x54/0x84
> >> [6.424054]  path_openat+0x2dc/0xf04
> >> [6.425569]  do_filp_open+0x68/0xd8
> >> [6.427044]  do_sys_open+0x16c/0x224
> >> [6.428563]  SyS_openat+0x10/0x18
> >> [6.429972]  el0_svc_naked+0x30/0x34
> >> [6.431494] handlers:
> >> [6.432479] [<0e9fb4bb>] serial8250_interrupt
> >> [6.434597] Disabling IRQ #41
> >>
> >> This patch changes the lr state condition in lr_signals_eoi_mi() from
> >> invalid(Inactive) to active and pending to avoid this.
> >>
> >> I am not sure about the original design of the condition of
> >> invalid(active). So, This RFC is 

Re: [RFC PATCH] KVM: arm/arm64: vgic: change condition for level interrupt resampling

2018-03-08 Thread Christoffer Dall
On Thu, Mar 08, 2018 at 09:49:43AM +, Marc Zyngier wrote:
> [updated Christoffer's email address]
> 
> Hi Shunyong,
> 
> On 08/03/18 07:01, Shunyong Yang wrote:
> > When resampling irqfds is enabled, level interrupt should be
> > de-asserted when resampling happens. On page 4-47 of GIC v3
> > specification IHI0069D, it said,
> > "When the PE acknowledges an SGI, a PPI, or an SPI at the CPU
> > interface, the IRI changes the status of the interrupt to active
> > and pending if:
> > • It is an edge-triggered interrupt, and another edge has been
> > detected since the interrupt was acknowledged.
> > • It is a level-sensitive interrupt, and the level has not been
> > deasserted since the interrupt was acknowledged."
> > 
> > GIC v2 specification IHI0048B.b has similar description on page
> > 3-42 for state machine transition.
> > 
> > When some VFIO device, like mtty(8250 VFIO mdev emulation driver
> > in samples/vfio-mdev) triggers a level interrupt, the status
> > transition in LR is pending-->active-->active and pending.
> > Then it will wait resampling to de-assert the interrupt.
> > 
> > Current design of lr_signals_eoi_mi() will return false if state
> > in LR is not invalid(Inactive). It causes resampling will not happen
> > in mtty case.
> 
> Let me rephrase this, and tell me if I understood it correctly:
> 
> - A level interrupt is injected, activated by the guest (LR state=active)
> - guest exits, re-enters, (LR state=pending+active)
> - guest EOIs the interrupt (LR state=pending)
> - maintenance interrupt
> - we don't signal the resampling because we're not in an invalid state
> 
> Is that correct?
> 
> That's an interesting case, because it seems to invalidate some of the 
> optimization that went in over a year ago.
> 
> 096f31c4360f KVM: arm/arm64: vgic: Get rid of MISR and EISR fields
> b6095b084d87 KVM: arm/arm64: vgic: Get rid of unnecessary save_maint_int_state
> af0614991ab6 KVM: arm/arm64: vgic: Get rid of unnecessary process_maintenance 
> operation
> 
> We could compare the value of the LR before the guest entry with
> the value at exit time, but we still could miss it if we have a
> transition such as P+A -> P -> A and assume a long enough propagation
> delay for the maintenance interrupt (which is very likely).
> 
> In essence, we have lost the benefit of EISR, which was to give us a
> way to deal with asynchronous signalling.
> 

I don't understand why EISR gives us anything beyond looking at the LR
and evaluating if the state is 00.  My reading of the spec is that the
EISR is merely a shortcut to knowing the state of the LRs but contains
not record or information beyond what you can read from the LRs.

What am I missing?

Thanks,
-Christoffer


Re: [PATCH] KVM: arm/arm64: No need to zero CNTVOFF in kvm_timer_vcpu_put() for VHE

2018-02-22 Thread Christoffer Dall
Hi Shanker,

On Mon, Feb 19, 2018 at 09:38:07AM -0600, Shanker Donthineni wrote:
> In AArch64/AArch32, the virtual counter uses a fixed virtual offset
> of zero in the following situations as per ARMv8 specifications:
> 
> 1) HCR_EL2.E2H is 1, and CNTVCT_EL0/CNTVCT are read from EL2.
> 2) HCR_EL2.{E2H, TGE} is {1, 1}, and either:
>— CNTVCT_EL0 is read from Non-secure EL0 or EL2.
>— CNTVCT is read from Non-secure EL0.
> 
> So, no need to zero CNTVOFF_EL2/CNTVOFF for VHE case.
> 
> Signed-off-by: Shanker Donthineni 
> ---
>  virt/kvm/arm/arch_timer.c | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/virt/kvm/arm/arch_timer.c b/virt/kvm/arm/arch_timer.c
> index 70268c0..86eca324 100644
> --- a/virt/kvm/arm/arch_timer.c
> +++ b/virt/kvm/arm/arch_timer.c
> @@ -541,9 +541,11 @@ void kvm_timer_vcpu_put(struct kvm_vcpu *vcpu)
>* The kernel may decide to run userspace after calling vcpu_put, so
>* we reset cntvoff to 0 to ensure a consistent read between user
>* accesses to the virtual counter and kernel access to the physical
> -  * counter.
> +  * counter of non-VHE case. For VHE, the virtual counter uses a fixed
> +  * virtual offset of zero, so no need to zero CNTVOFF_EL2 register.
>*/
> - set_cntvoff(0);
> + if (!has_vhe())
> + set_cntvoff(0);
>  }
>  
>  /*

I'm okay with this change.  I don't think there's a huge gain here
though.

Marc, any thoughts or concerns?

Thanks,
-Christoffer


Re: [PATCH] KVM: arm: Enable emulation of the physical timer

2018-02-16 Thread Christoffer Dall
On Tue, Feb 13, 2018 at 11:41:16AM +0100, Jérémy Fanguède wrote:
> Set the handlers to emulate read and write operations for CNTP_CTL,
> CNTP_CVAL and CNTP_TVAL registers in such a way that VMs can use the
> physical timer.
> 
> Signed-off-by: Jérémy Fanguède 
> ---
> 
> This patch is the equivalent of this one: [1], but for arm 32bits
> instead of ARMv8 aarch32.
> 
> [1] https://patchwork.kernel.org/patch/10207019/
> 

Thanks, both queued.

-Christoffer

> ---
>  arch/arm/kvm/coproc.c | 61 
> +++
>  1 file changed, 61 insertions(+)
> 
> diff --git a/arch/arm/kvm/coproc.c b/arch/arm/kvm/coproc.c
> index 6d1d2e2..3a02e76 100644
> --- a/arch/arm/kvm/coproc.c
> +++ b/arch/arm/kvm/coproc.c
> @@ -270,6 +270,60 @@ static bool access_gic_sre(struct kvm_vcpu *vcpu,
>   return true;
>  }
>  
> +static bool access_cntp_tval(struct kvm_vcpu *vcpu,
> +  const struct coproc_params *p,
> +  const struct coproc_reg *r)
> +{
> + u64 now = kvm_phys_timer_read();
> + u64 val;
> +
> + if (p->is_write) {
> + val = *vcpu_reg(vcpu, p->Rt1);
> + kvm_arm_timer_set_reg(vcpu, KVM_REG_ARM_PTIMER_CVAL, val + now);
> + } else {
> + val = kvm_arm_timer_get_reg(vcpu, KVM_REG_ARM_PTIMER_CVAL);
> + *vcpu_reg(vcpu, p->Rt1) = val - now;
> + }
> +
> + return true;
> +}
> +
> +static bool access_cntp_ctl(struct kvm_vcpu *vcpu,
> + const struct coproc_params *p,
> + const struct coproc_reg *r)
> +{
> + u32 val;
> +
> + if (p->is_write) {
> + val = *vcpu_reg(vcpu, p->Rt1);
> + kvm_arm_timer_set_reg(vcpu, KVM_REG_ARM_PTIMER_CTL, val);
> + } else {
> + val = kvm_arm_timer_get_reg(vcpu, KVM_REG_ARM_PTIMER_CTL);
> + *vcpu_reg(vcpu, p->Rt1) = val;
> + }
> +
> + return true;
> +}
> +
> +static bool access_cntp_cval(struct kvm_vcpu *vcpu,
> +  const struct coproc_params *p,
> +  const struct coproc_reg *r)
> +{
> + u64 val;
> +
> + if (p->is_write) {
> + val = (u64)*vcpu_reg(vcpu, p->Rt2) << 32;
> + val |= *vcpu_reg(vcpu, p->Rt1);
> + kvm_arm_timer_set_reg(vcpu, KVM_REG_ARM_PTIMER_CVAL, val);
> + } else {
> + val = kvm_arm_timer_get_reg(vcpu, KVM_REG_ARM_PTIMER_CVAL);
> + *vcpu_reg(vcpu, p->Rt1) = val;
> + *vcpu_reg(vcpu, p->Rt2) = val >> 32;
> + }
> +
> + return true;
> +}
> +
>  /*
>   * We could trap ID_DFR0 and tell the guest we don't support performance
>   * monitoring.  Unfortunately the patch to make the kernel check ID_DFR0 was
> @@ -423,10 +477,17 @@ static const struct coproc_reg cp15_regs[] = {
>   { CRn(13), CRm( 0), Op1( 0), Op2( 4), is32,
>   NULL, reset_unknown, c13_TID_PRIV },
>  
> + /* CNTP */
> + { CRm64(14), Op1( 2), is64, access_cntp_cval},
> +
>   /* CNTKCTL: swapped by interrupt.S. */
>   { CRn(14), CRm( 1), Op1( 0), Op2( 0), is32,
>   NULL, reset_val, c14_CNTKCTL, 0x },
>  
> + /* CNTP */
> + { CRn(14), CRm( 2), Op1( 0), Op2( 0), is32, access_cntp_tval },
> + { CRn(14), CRm( 2), Op1( 0), Op2( 1), is32, access_cntp_ctl },
> +
>   /* The Configuration Base Address Register. */
>   { CRn(15), CRm( 0), Op1( 4), Op2( 0), is32, access_cbar},
>  };
> -- 
> 2.7.4
> 


Re: [PATCH v1 15/16] kvm: arm64: Allow configuring physical address space size

2018-02-09 Thread Christoffer Dall
On Thu, Feb 08, 2018 at 05:53:17PM +, Suzuki K Poulose wrote:
> On 08/02/18 11:14, Christoffer Dall wrote:
> >On Tue, Jan 09, 2018 at 07:04:10PM +, Suzuki K Poulose wrote:
> >>Allow the guests to choose a larger physical address space size.
> >>The default and minimum size is 40bits. A guest can change this
> >>right after the VM creation, but before the stage2 entry page
> >>tables are allocated (i.e, before it registers a memory range
> >>or maps a device address). The size is restricted to the maximum
> >>supported by the host. Also, the guest can only increase the PA size,
> >>from the existing value, as reducing it could break the devices which
> >>may have verified their physical address for validity and may do a
> >>lazy mapping(e.g, VGIC).
> >>
> >>Cc: Marc Zyngier 
> >>Cc: Christoffer Dall 
> >>Cc: Peter Maydell 
> >>Signed-off-by: Suzuki K Poulose 
> >>---
> >>  Documentation/virtual/kvm/api.txt | 27 ++
> >>  arch/arm/include/asm/kvm_host.h   |  7 +++
> >>  arch/arm64/include/asm/kvm_host.h |  1 +
> >>  arch/arm64/include/asm/kvm_mmu.h  | 41 
> >> ++-
> >>  arch/arm64/kvm/reset.c| 28 ++
> >>  include/uapi/linux/kvm.h  |  4 
> >>  virt/kvm/arm/arm.c|  2 +-
> >>  7 files changed, 100 insertions(+), 10 deletions(-)
> >>
> >>diff --git a/Documentation/virtual/kvm/api.txt 
> >>b/Documentation/virtual/kvm/api.txt
> >>index 57d3ee9e4bde..a203faf768c4 100644
> >>--- a/Documentation/virtual/kvm/api.txt
> >>+++ b/Documentation/virtual/kvm/api.txt
> >>@@ -3403,6 +3403,33 @@ invalid, if invalid pages are written to (e.g. after 
> >>the end of memory)
> >>  or if no page table is present for the addresses (e.g. when using
> >>  hugepages).
> >>+4.109 KVM_ARM_GET_PHYS_SHIFT
> >>+
> >>+Capability: KVM_CAP_ARM_CONFIG_PHYS_SHIFT
> >>+Architectures: arm64
> >>+Type: vm ioctl
> >>+Parameters: __u32 (out)
> >>+Returns: 0 on success, a negative value on error
> >>+
> >>+This ioctl is used to get the current maximum physical address size for
> >>+the VM. The value is Log2(Maximum_Physical_Address). This is neither the
> >>+ amount of physical memory assigned to the VM nor the maximum physical 
> >>address
> >>+that a real CPU on the host can handle. Rather, this is the upper limit of 
> >>the
> >>+guest physical address that can be used by the VM.
> >
> >What is the point of this?  Presumably if userspace has set the size, it
> >already knows the size?
> 
> This can help the userspace know, what the "default" limit is. As such I am
> not particular about keeping this around.
> 

Userspace has to already know, since otherwise things don't work today,
so I think we can omit this.

> >
> >>+
> >>+4.109 KVM_ARM_SET_PHYS_SHIFT
> >>+
> >>+Capability: KVM_CAP_ARM_CONFIG_PHYS_SHIFT
> >>+Architectures: arm64
> >>+Type: vm ioctl
> >>+Parameters: __u32 (in)
> >>+Returns: 0 on success, a negative value on error
> >>+
> >>+This ioctl is used to set the maximum physical address size for
> >>+the VM. The value is Log2(Maximum_Physical_Address). The value can only
> >>+be increased from the existing setting. The value cannot be changed
> >>+after the stage-2 page tables are allocated and will return an error.
> >>+
> >
> >Is there a way for userspace to discover what the underlying hardware
> >can actually support, beyond trial-and-error on this ioctl?
> 
> Unfortunately, there is none. We don't expose ID_AA64MMFR0 via mrs emulation.
> 

We should probably think about that.  Perhaps it could be tied to the
return value of KVM_CAP_ARM_CONFIG_PHYS_SHIFT ?

> >>+static inline int kvm_reconfig_stage2(struct kvm *kvm, u32 phys_shift)
> >>+{
> >>+   int rc = 0;
> >>+   unsigned int pa_max, parange;
> >>+
> >>+   parange = read_sanitised_ftr_reg(SYS_ID_AA64MMFR0_EL1) & 7;
> >>+   pa_max = id_aa64mmfr0_parange_to_phys_shift(parange);
> >>+   /* Raise it to 40bits for backward compatibility */
> >>+   pa_max = (pa_max < 40) ? 40 : pa_max;
> >>+   /* Make sure the size is supported/available */
> >>+   if (phys_shift > PHYS_MASK_SHIFT || phys_shift > pa_max)
> >>+   return -EINVAL;
> >>+   /*
> >>+* The stage

Re: [PATCH v1 14/16] kvm: arm64: Switch to per VM IPA

2018-02-09 Thread Christoffer Dall
On Thu, Feb 08, 2018 at 05:22:29PM +, Suzuki K Poulose wrote:
> On 08/02/18 11:00, Christoffer Dall wrote:
> >On Tue, Jan 09, 2018 at 07:04:09PM +, Suzuki K Poulose wrote:
> >>Now that we can manage the stage2 page table per VM, switch the
> >>configuration details to per VM instance. We keep track of the
> >>IPA bits, number of page table levels and the VTCR bits (which
> >>depends on the IPA and the number of levels).
> >>
> >>Cc: Marc Zyngier 
> >>Cc: Christoffer Dall 
> >>Signed-off-by: Suzuki K Poulose 
> >>---
> >>  arch/arm/include/asm/kvm_mmu.h  |  1 +
> >>  arch/arm64/include/asm/kvm_host.h   | 12 
> >>  arch/arm64/include/asm/kvm_mmu.h| 22 --
> >>  arch/arm64/include/asm/stage2_pgtable.h |  1 -
> >>  arch/arm64/kvm/hyp/switch.c |  3 +--
> >>  virt/kvm/arm/arm.c  |  2 +-
> >>  6 files changed, 35 insertions(+), 6 deletions(-)
> >>
> >>diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
> >>index 440c80589453..dd592fe45660 100644
> >>--- a/arch/arm/include/asm/kvm_mmu.h
> >>+++ b/arch/arm/include/asm/kvm_mmu.h
> >>@@ -48,6 +48,7 @@
> >>  #define kvm_vttbr_baddr_mask(kvm) VTTBR_BADDR_MASK
> >>  #define stage2_pgd_size(kvm)  (PTRS_PER_S2_PGD * sizeof(pgd_t))
> >>+#define kvm_init_stage2_config(kvm)do { } while (0)
> >>  int create_hyp_mappings(void *from, void *to, pgprot_t prot);
> >>  int create_hyp_io_mappings(void *from, void *to, phys_addr_t);
> >>  void free_hyp_pgds(void);
> >>diff --git a/arch/arm64/include/asm/kvm_host.h 
> >>b/arch/arm64/include/asm/kvm_host.h
> >>index 9a9ddeb33c84..1e66e5ab3dde 100644
> >>--- a/arch/arm64/include/asm/kvm_host.h
> >>+++ b/arch/arm64/include/asm/kvm_host.h
> >>@@ -64,6 +64,18 @@ struct kvm_arch {
> >>/* VTTBR value associated with above pgd and vmid */
> >>u64vttbr;
> >>+   /* Private bits of VTCR_EL2 for this VM */
> >>+   u64vtcr_private;
> >
> >As to my comments in the previous patch, why isn't this simply u64 vtcr;
> 
> nit: I haven't received your response to the previous patch.

It got stuck in my drafts folder somehow, hopefully you received it now.

> 
> We could. I thought this gives a bit more clarity on what changes per-VM.
> 

Since there's a performance issue involved, I think it's cleaner to just
calculate the vtcr once per VM and store it.

Thanks,
-Christoffer


Re: [PATCH v1 08/16] kvm: arm/arm64: Clean up stage2 pgd life time

2018-02-09 Thread Christoffer Dall
On Thu, Feb 08, 2018 at 05:19:22PM +, Suzuki K Poulose wrote:
> On 08/02/18 11:00, Christoffer Dall wrote:
> >On Tue, Jan 09, 2018 at 07:04:03PM +, Suzuki K Poulose wrote:
> >>On arm/arm64 we pre-allocate the entry level page tables when
> >>a VM is created and is free'd when either all the mm users are
> >>gone or the KVM is about to get destroyed. i.e, kvm_free_stage2_pgd
> >>is triggered via kvm_arch_flush_shadow_all() which can be invoked
> >>from two different paths :
> >>
> >>  1) do_exit()-> .-> mmu_notifier->release()-> ..-> 
> >> kvm_arch_flush_shadow_all()
> >> OR
> >>  2) kvm_destroy_vm()-> mmu_notifier_unregister-> 
> >> kvm_arch_flush_shadow_all()
> >>
> >>This has created lot of race conditions in the past as some of
> >>the VCPUs could be active when we free the stage2 via path (1).
> >
> >How??  mmu_notifier->release() is called via __mput->exit_mmap(), which
> >is only called if mm_users == 0, which means there are no more threads
> >left than the one currently doing exit().
> 
> IIRC, if the process is sent a fatal signal, that could cause all the threads
> to exit, leaving the "last" thread to do the clean up. The files could still
> be open, implying that the KVM fds are still active, without a stage2, even
> though we are not going to run anything. (The race was fixed by moving the
> stage2 teardown to mmu_notifier->release()).
> 
> 

Hmm, if the last thread is do_exit(), by definition there can't be any
other VCPU thread (because then it wouldn't be the last one) and
therefore only this last exiting thread can have the fd open, and since
it's in the middle of do_exit(), it will close the fds before anything
will have chance to run.

> >
> >>
> >>On a closer look, all we need to do with kvm_arch_flush_shadow_all() is,
> >>to ensure that the stage2 mappings are cleared. This doesn't mean we
> >>have to free up the stage2 entry level page tables yet, which could
> >>be delayed until the kvm is destroyed. This would avoid issues
> >>of use-after-free,
> >
> >do we have any of those left?
> 
> None that we know of.
> 

Then I think this commit text is misleading and pretty confusing.  If
we have a correct implementation, but we want to clean something up,
then this commit message shouldn't talk about races or use-after-free,
it should just say that we rearrange code to change the flow, and
describe why/how.

> >
> >>This will be later used for delaying
> >>the allocation of the stage2 entry level page tables until we really
> >>need to do something with it.
> >
> >Fine, but you don't actually explain why this change of flow is
> >necessary for what you're trying to do later?
> 
> This patch is not mandatory for the series. But, since we are delaying
> the "allocation" stage2 tables anyway later, I thought it would be
> good to clean up the "free" path.
> 

Hmm, I'm not really sure it is a cleanup.  In any case, the motivation
for this change should be clear.  I do like the idea of getting read of
the kvm->arch.pgd checks in the various stage2 manipulation functions.

> >>  int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> >>diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
> >>index 78253fe00fc4..c94c61ac38b9 100644
> >>--- a/virt/kvm/arm/mmu.c
> >>+++ b/virt/kvm/arm/mmu.c
> >>@@ -298,11 +298,10 @@ static void unmap_stage2_range(struct kvm *kvm, 
> >>phys_addr_t start, u64 size)
> >>pgd = kvm->arch.pgd + stage2_pgd_index(addr);
> >>do {
> >>/*
> >>-* Make sure the page table is still active, as another thread
> >>-* could have possibly freed the page table, while we released
> >>-* the lock.
> >>+* The page table shouldn't be free'd as we still hold a 
> >>reference
> >>+* to the KVM.
> >
> >To avoid confusion about references to the kernel module KVM and a
> >specific KVM VM instance, please s/KVM/VM/.
> 
> ok.
> 
> >
> >> */
> >>-   if (!READ_ONCE(kvm->arch.pgd))
> >>+   if (WARN_ON(!READ_ONCE(kvm->arch.pgd)))
> >
> >This reads a lot like a defensive implementation now, and I think for
> >this patch to make sense, we shouldn't try to handle a buggy super-racy
> >implementation gracefully, but rather have VM_BUG_ON() (if we care about
> >performance of the check) or simpl

Re: [PATCH v1 13/16] kvm: arm64: Configure VTCR per VM

2018-02-08 Thread Christoffer Dall
On Tue, Jan 09, 2018 at 07:04:08PM +, Suzuki K Poulose wrote:
> We set VTCR_EL2 very early during the stage2 init and don't
> touch it ever. This is fine as we had a fixed IPA size. This
> patch changes the behavior to set the VTCR for a given VM,
> depending on its stage2 table. The common configuration for
> VTCR is still performed during the early init. But the SL0
> and T0SZ are programmed for each VM and is cleared once we
> exit the VM.
>
> Cc: Marc Zyngier 
> Cc: Christoffer Dall 
> Signed-off-by: Suzuki K Poulose 
> ---
>  arch/arm64/include/asm/kvm_arm.h  | 16 ++--
>  arch/arm64/include/asm/kvm_asm.h  |  2 +-
>  arch/arm64/include/asm/kvm_host.h |  8 +---
>  arch/arm64/kvm/hyp/s2-setup.c | 16 +---
>  arch/arm64/kvm/hyp/switch.c   |  9 +
>  5 files changed, 22 insertions(+), 29 deletions(-)
>
> diff --git a/arch/arm64/include/asm/kvm_arm.h 
> b/arch/arm64/include/asm/kvm_arm.h
> index eb90d349e55f..d5c40816f073 100644
> --- a/arch/arm64/include/asm/kvm_arm.h
> +++ b/arch/arm64/include/asm/kvm_arm.h
> @@ -115,9 +115,7 @@
>  #define VTCR_EL2_IRGN0_WBWA TCR_IRGN0_WBWA
>  #define VTCR_EL2_SL0_SHIFT 6
>  #define VTCR_EL2_SL0_MASK (3 << VTCR_EL2_SL0_SHIFT)
> -#define VTCR_EL2_SL0_LVL1 (1 << VTCR_EL2_SL0_SHIFT)
>  #define VTCR_EL2_T0SZ_MASK 0x3f
> -#define VTCR_EL2_T0SZ_40B 24
>  #define VTCR_EL2_VS_SHIFT 19
>  #define VTCR_EL2_VS_8BIT (0 << VTCR_EL2_VS_SHIFT)
>  #define VTCR_EL2_VS_16BIT (1 << VTCR_EL2_VS_SHIFT)
> @@ -139,38 +137,36 @@
>   * D4-23 and D4-25 in ARM DDI 0487A.b.
>   */
>
> -#define VTCR_EL2_T0SZ_IPA VTCR_EL2_T0SZ_40B
>  #define VTCR_EL2_COMMON_BITS (VTCR_EL2_SH0_INNER | VTCR_EL2_ORGN0_WBWA | \
>   VTCR_EL2_IRGN0_WBWA | VTCR_EL2_RES1)
> +#define VTCR_EL2_PRIVATE_MASK (VTCR_EL2_SL0_MASK | VTCR_EL2_T0SZ_MASK)
>
>  #ifdef CONFIG_ARM64_64K_PAGES
>  /*
>   * Stage2 translation configuration:
>   * 64kB pages (TG0 = 1)
> - * 2 level page tables (SL = 1)
>   */
> -#define VTCR_EL2_TGRAN_FLAGS (VTCR_EL2_TG0_64K | VTCR_EL2_SL0_LVL1)
> +#define VTCR_EL2_TGRAN VTCR_EL2_TG0_64K
>  #define VTCR_EL2_TGRAN_SL0_BASE 3UL
>
>  #elif defined(CONFIG_ARM64_16K_PAGES)
>  /*
>   * Stage2 translation configuration:
>   * 16kB pages (TG0 = 2)
> - * 2 level page tables (SL = 1)
>   */
> -#define VTCR_EL2_TGRAN_FLAGS (VTCR_EL2_TG0_16K | VTCR_EL2_SL0_LVL1)
> +#define VTCR_EL2_TGRAN VTCR_EL2_TG0_16K
>  #define VTCR_EL2_TGRAN_SL0_BASE 3UL
>  #else /* 4K */
>  /*
>   * Stage2 translation configuration:
>   * 4kB pages (TG0 = 0)
> - * 3 level page tables (SL = 1)
>   */
> -#define VTCR_EL2_TGRAN_FLAGS (VTCR_EL2_TG0_4K | VTCR_EL2_SL0_LVL1)
> +#define VTCR_EL2_TGRAN VTCR_EL2_TG0_4K
>  #define VTCR_EL2_TGRAN_SL0_BASE 2UL
>  #endif
>
> -#define VTCR_EL2_FLAGS (VTCR_EL2_COMMON_BITS | VTCR_EL2_TGRAN_FLAGS)
> +#define VTCR_EL2_FLAGS (VTCR_EL2_COMMON_BITS | VTCR_EL2_TGRAN)
> +
>  /*
>   * VTCR_EL2:SL0 indicates the entry level for Stage2 translation.
>   * Interestingly, it depends on the page size.
> diff --git a/arch/arm64/include/asm/kvm_asm.h 
> b/arch/arm64/include/asm/kvm_asm.h
> index ab4d0a926043..21cfd1fe692c 100644
> --- a/arch/arm64/include/asm/kvm_asm.h
> +++ b/arch/arm64/include/asm/kvm_asm.h
> @@ -66,7 +66,7 @@ extern void __vgic_v3_init_lrs(void);
>
>  extern u32 __kvm_get_mdcr_el2(void);
>
> -extern u32 __init_stage2_translation(void);
> +extern void __init_stage2_translation(void);
>
>  #endif
>
> diff --git a/arch/arm64/include/asm/kvm_host.h 
> b/arch/arm64/include/asm/kvm_host.h
> index ea6cb5b24258..9a9ddeb33c84 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -380,10 +380,12 @@ int kvm_arm_vcpu_arch_has_attr(struct kvm_vcpu *vcpu,
>
>  static inline void __cpu_init_stage2(void)
>  {
> - u32 parange = kvm_call_hyp(__init_stage2_translation);
> + u32 ps;
>
> - WARN_ONCE(parange < 40,
> -  "PARange is %d bits, unsupported configuration!", parange);
> + kvm_call_hyp(__init_stage2_translation);
> + ps = id_aa64mmfr0_parange_to_phys_shift(read_sysreg(id_aa64mmfr0_el1));
> + WARN_ONCE(ps < 40,
> +  "PARange is %d bits, unsupported configuration!", ps);
>  }
>
>  /*
> diff --git a/arch/arm64/kvm/hyp/s2-setup.c b/arch/arm64/kvm/hyp/s2-setup.c
> index b1129c83c531..5c26ad4b8ac9 100644
> --- a/arch/arm64/kvm/hyp/s2-setup.c
> +++ b/arch/arm64/kvm/hyp/s2-setup.c
> @@ -19,13 +19,11 @@
>  #include 
>  #include 
>  #include 
> -#include 
>
> -u32 __hyp_text __init_stage2_translation(void)
> +void __hyp_text __init_stage2_translation(void)
>  {
>   u64 val = VTCR_EL2_FLAGS;
>   u64 parange;
> -

Re: [PATCH v1 02/16] irqchip: gicv3-its: Add helpers for handling 52bit address

2018-02-08 Thread Christoffer Dall
On Thu, Feb 08, 2018 at 11:20:02AM +, Suzuki K Poulose wrote:
> On 07/02/18 15:10, Christoffer Dall wrote:
> >Hi Suzuki,
> >
> >On Tue, Jan 09, 2018 at 07:03:57PM +, Suzuki K Poulose wrote:
> >>Add helpers for encoding/decoding 52bit address in GICv3 ITS BASER
> >>register. When ITS uses 64K page size, the 52bits of physical address
> >>are encoded in BASER[47:12] as follows :
> >>
> >>  Bits[47:16] of the register => bits[47:16] of the physical address
> >>  Bits[15:12] of the register => bits[51:48] of the physical address
> >> bits[15:0] of the physical address are 0.
> >>
> >>Also adds a mask for CBASER address. This will be used for adding 52bit
> >>support for VGIC ITS. More importantly ignore the upper bits if 52bit
> >>support is not enabled.
> >>
> >>Cc: Shanker Donthineni 
> >>Cc: Marc Zyngier 
> >>Signed-off-by: Suzuki K Poulose 
> >>---
> 
> 
> >>+
> >>+/*
> >>+ * With 64K page size, the physical address can be upto 52bit and
> >>+ * uses the following encoding in the GITS_BASER[47:12]:
> >>+ *
> >>+ * Bits[47:16] of the register => bits[47:16] of the base physical address.
> >>+ * Bits[15:12] of the register => bits[51:48] of the base physical address.
> >>+ *bits[15:0] of the base physical address 
> >>are 0.
> >>+ * Clear the upper bits if the kernel doesn't support 52bits.
> >>+ */
> >>+#define GITS_BASER_ADDR64K_LO_MASK GENMASK_ULL(47, 16)
> >>+#define GITS_BASER_ADDR64K_HI_SHIFT12
> >>+#define GITS_BASER_ADDR64K_HI_MOVE (48 - GITS_BASER_ADDR64K_HI_SHIFT)
> >>+#define GITS_BASER_ADDR64K_HI_MASK (GITS_PA_HI_MASK << 
> >>GITS_BASER_ADDR64K_HI_SHIFT)
> >>+#define GITS_BASER_ADDR64K_TO_PHYS(x)  
> >>\
> >>+   (((x) & GITS_BASER_ADDR64K_LO_MASK) |   \
> >>+(((x) & GITS_BASER_ADDR64K_HI_MASK) << GITS_BASER_ADDR64K_HI_MOVE))
> >>+#define GITS_BASER_ADDR64K_FROM_PHYS(p)
> >>\
> >>+   (((p) & GITS_BASER_ADDR64K_LO_MASK) |   \
> >>+(((p) >> GITS_BASER_ADDR64K_HI_MOVE) & GITS_BASER_ADDR64K_HI_MASK))
> >
> >I don't understand why you need this masking logic embedded in these
> >macros?  Isn't it strictly an error if anyone passes a physical address
> >with any of bits [51:48] set to the ITS on a system that doesn't support
> >52 bit PAs, and just silently masking off those bits could lead to some
> >interesting cases.
> 
> What do you think is the best way to handle such cases ? May be I could add
> some checks where we get those addresses and handle it before we use this
> macro ?
> 

I don't think the conversion macros should try to hide programming
errors.  I think we should limit the functionality in the macros to be
simple bit masking and shifting.

Any validation and masking depending on 52 PA support in the kernel
should be done in the context of the functionality, just like the ITS
driver already does.

> >
> >This is also notably more difficult to read than the existing macro.
> >
> >If anything, I think it would be more useful to have
> >GITS_BASER_TO_PHYS(x) and GITS_PHYS_TO_BASER(x) which takes into account
> >CONFIG_ARM64_64K_PAGES.
> 
> I thought the 64K_PAGES is not kernel page size, but the page-size configured
> by the "requester" for ITS. So, it doesn't really mean CONFIG_ARM64_64K_PAGES.

You're right, I skimmed this logic too quickly.

> But the other way around, we can't handle 52bit address unless 
> CONFIG_ARM64_64K_PAGES
> is selected. Also, if the guest uses a 4K page size and uses a 48 bit address,
> we could potentially mask Bits[15:12] to 0, which is not nice.
> 
> So I still think we need to have a special macro for handling addresses with 
> 64K
> page size in ITS.
> 
I think it's easier to have the current GITS_BASER_PHYS_52_to_48 and
have a corresponding GITS_BASER_PHYS_48_to_52, following Robin's
observation.

Any additional logic can be written directly in the C code to check
consitency etc.

Thanks,
-Christoffer


Re: [PATCH v1 05/16] arm64: Helper for parange to PASize

2018-02-08 Thread Christoffer Dall
On Thu, Feb 08, 2018 at 11:08:18AM +, Suzuki K Poulose wrote:
> On 08/02/18 11:00, Christoffer Dall wrote:
> >On Tue, Jan 09, 2018 at 07:04:00PM +, Suzuki K Poulose wrote:
> >>Add a helper to convert ID_AA64MMFR0_EL1:PARange to they physical
> >   *the*
> >>size shift. Limit the size to the maximum supported by the kernel.
> >
> >Is this just a cleanup or are we actually going to need this feature in
> >the subsequent patches?  That would be nice to motivate in the commit
> >letter.
> 
> It is a cleanup, plus we are going to move the user of the code around from
> one place to the other. So this makes it a bit easier and cleaner.
> 

On its own I'm not sure it really is a cleanup, so it's good to mention
that this is to make some operation easier later on in the commit
letter.

> 
> >>
> >>Cc: Mark Rutland 
> >>Cc: Catalin Marinas 
> >>Cc: Will Deacon 
> >>Cc: Marc Zyngier 
> >>Signed-off-by: Suzuki K Poulose 
> >>---
> >>  arch/arm64/include/asm/cpufeature.h | 16 
> >>  arch/arm64/kvm/hyp/s2-setup.c   | 28 +---
> >>  2 files changed, 21 insertions(+), 23 deletions(-)
> >>
> >>diff --git a/arch/arm64/include/asm/cpufeature.h 
> >>b/arch/arm64/include/asm/cpufeature.h
> >>index ac67cfc2585a..0564e14616eb 100644
> >>--- a/arch/arm64/include/asm/cpufeature.h
> >>+++ b/arch/arm64/include/asm/cpufeature.h
> >>@@ -304,6 +304,22 @@ static inline u64 read_zcr_features(void)
> >>return zcr;
> >>  }
> >>+static inline u32 id_aa64mmfr0_parange_to_phys_shift(int parange)
> >>+{
> >>+   switch (parange) {
> >>+   case 0: return 32;
> >>+   case 1: return 36;
> >>+   case 2: return 40;
> >>+   case 3: return 42;
> >>+   case 4: return 44;
> >>+
> >>+   default:
> >
> >What is the case we want to cater for with making parange == 5 the
> >default for unrecognized values?
> >
> >(I have a feeling that default label comes from making the compiler
> >happy about potentially uninitialized values once upon a time before a
> >lot of refactoring happened here.)
> 
> That is there to make sure we return 48 iff 52bit support (for that matter,
> if there is a new limit in the future) is not enabled.
> 

duh, yeah, it's obvious when I look at it again now.

> >
> >>+   case 5: return 48;
> >>+#ifdef CONFIG_ARM64_PA_BITS_52
> >>+   case 6: return 52;
> >>+#endif
> >>+   }
> >>+}
> >>  #endif /* __ASSEMBLY__ */
> 
Thanks,
-Christoffer


Re: [PATCH 00/16] kvm: arm64: Support for dynamic IPA size

2018-02-08 Thread Christoffer Dall
Hi Suzuki,

On Tue, Jan 09, 2018 at 07:03:55PM +, Suzuki K Poulose wrote:
> On arm64 we have a static limit of 40bits of physical address space
> for the VM with KVM. This series lifts the limitation and allows the
> VM to configure the physical address space upto 52bit on systems
> where it is supported. We retain the default and minimum size to 40bits
> to avoid breaking backward compatibility.
> 
> The interface provided is an IOCTL on the VM fd. The guest can change
> only increase the limit from what is already configured, to prevent
> breaking the devices which may have already been configured with a
> particular guest PA. The guest can issue the request until something
> is actually mapped into the stage2 table (e.g, memory region or device).
> This also implies that we now have per VM configuration of stage2
> control registers (VTCR_EL2 bits).
> 
> The arm64 page table level helpers are defined based on the page
> table levels used by the host VA. So, the accessors may not work
> if the guest uses more number of levels in stage2 than the stage1
> of the host. In order to provide an independent stage2 page table,
> we refactor the arm64 page table helpers to give us raw accessors
> for each level, which should only used when that level is present.
> And then, based on the VM, we make the decision of the stage2
> page table using the raw accessors.
> 

This may come a bit out of left field, but have we considered decoupling
the KVM stage 2 page table manipulation functions even further from the
host page table helpers?  I found some of the patches a bit hard to read
with all the wrappers and folding logic considered, so I'm wondering if
it's possible to write something more generic for stage 2 page table
manipulations which doesn't have to fit within a Linux page table
manipulation nomenclature?

Wasn't this also the decision taken for IOMMU page table allocation, and
why was that the right approach for the IOMMU but not for KVM stage 2
page tables?  Is there room for reuse of the IOMMU page table allocation
logic in KVM as well?

This may have been discussed already, but I'd like to know the arguments
for doing things the way proposed in this series.

Thanks,
-Christoffer

> 
> The series also adds :
>  - Support for handling 52bit IPA for vgic ITS.
>  - Cleanup in virtio to handle errors when the PFN used in
>the virtio transport doesn't fit in 32bit.
> 
> Tested with
>   - Modified kvmtool, which can only be used for (patches included in
> the series for reference / testing):
> * with virtio-pci upto 44bit PA (Due to 4K page size for virtio-pci
>   legacy implemented by kvmtool)
> * Upto 48bit PA with virtio-mmio, due to 32bit PFN limitation.
>   - Hacked Qemu (boot loader support for highmem, phys-shift support)
> * with virtio-pci GIC-v3 ITS & MSI upto 52bit on Foundation model.
> 
> The series applies on arm64 for-next/core tree with 52bit PA support patches.
> One would need the fix for virtio_mmio cleanup [1] on top of the arm64
> tree to remove the warnings from virtio.
> 
> [1] https://marc.info/?l=linux-virtualization&m=151308636322117&w=2
> 
> Kristina Martsenko (1):
>   vgic: its: Add support for 52bit guest physical address
> 
> Suzuki K Poulose (15):
>   virtio: Validate queue pfn for 32bit transports
>   irqchip: gicv3-its: Add helpers for handling 52bit address
>   arm64: Make page table helpers reusable
>   arm64: Refactor pud_huge for reusability
>   arm64: Helper for parange to PASize
>   kvm: arm/arm64: Fix stage2_flush_memslot for 4 level page table
>   kvm: arm/arm64: Remove spurious WARN_ON
>   kvm: arm/arm64: Clean up stage2 pgd life time
>   kvm: arm/arm64: Delay stage2 page table allocation
>   kvm: arm/arm64: Prepare for VM specific stage2 translations
>   kvm: arm64: Make stage2 page table layout dynamic
>   kvm: arm64: Dynamic configuration of VTCR and VTTBR mask
>   kvm: arm64: Configure VTCR per VM
>   kvm: arm64: Switch to per VM IPA
>   kvm: arm64: Allow configuring physical address space size
> 
>  Documentation/virtual/kvm/api.txt |  27 +++
>  arch/arm/include/asm/kvm_arm.h|   2 -
>  arch/arm/include/asm/kvm_host.h   |   7 +
>  arch/arm/include/asm/kvm_mmu.h|  13 +-
>  arch/arm/include/asm/stage2_pgtable.h |  46 +++---
>  arch/arm64/include/asm/cpufeature.h   |  16 ++
>  arch/arm64/include/asm/kvm_arm.h  | 112 +++--
>  arch/arm64/include/asm/kvm_asm.h  |   2 +-
>  arch/arm64/include/asm/kvm_host.h |  21 ++-
>  arch/arm64/include/asm/kvm_mmu.h  |  83 --
>  arch/arm64/include/asm/pgalloc.h  |  32 +++-
>  arch/arm64/include/asm/pgtable.h  |  61 ---
>  arch/arm64/include/asm/stage2_pgtable-nopmd.h |  42 -
>  arch/arm64/include/asm/stage2_pgtable-nopud.h |  39 -
>  arch/arm64/include/asm/stage2_pgtable.h   | 211 
>  arch/arm64/kvm/hyp/s2-setup.c

Re: [PATCH v1 15/16] kvm: arm64: Allow configuring physical address space size

2018-02-08 Thread Christoffer Dall
On Tue, Jan 09, 2018 at 07:04:10PM +, Suzuki K Poulose wrote:
> Allow the guests to choose a larger physical address space size.
> The default and minimum size is 40bits. A guest can change this
> right after the VM creation, but before the stage2 entry page
> tables are allocated (i.e, before it registers a memory range
> or maps a device address). The size is restricted to the maximum
> supported by the host. Also, the guest can only increase the PA size,
> from the existing value, as reducing it could break the devices which
> may have verified their physical address for validity and may do a
> lazy mapping(e.g, VGIC).
> 
> Cc: Marc Zyngier 
> Cc: Christoffer Dall 
> Cc: Peter Maydell 
> Signed-off-by: Suzuki K Poulose 
> ---
>  Documentation/virtual/kvm/api.txt | 27 ++
>  arch/arm/include/asm/kvm_host.h   |  7 +++
>  arch/arm64/include/asm/kvm_host.h |  1 +
>  arch/arm64/include/asm/kvm_mmu.h  | 41 
> ++-
>  arch/arm64/kvm/reset.c| 28 ++
>  include/uapi/linux/kvm.h  |  4 
>  virt/kvm/arm/arm.c|  2 +-
>  7 files changed, 100 insertions(+), 10 deletions(-)
> 
> diff --git a/Documentation/virtual/kvm/api.txt 
> b/Documentation/virtual/kvm/api.txt
> index 57d3ee9e4bde..a203faf768c4 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -3403,6 +3403,33 @@ invalid, if invalid pages are written to (e.g. after 
> the end of memory)
>  or if no page table is present for the addresses (e.g. when using
>  hugepages).
>  
> +4.109 KVM_ARM_GET_PHYS_SHIFT
> +
> +Capability: KVM_CAP_ARM_CONFIG_PHYS_SHIFT
> +Architectures: arm64
> +Type: vm ioctl
> +Parameters: __u32 (out)
> +Returns: 0 on success, a negative value on error
> +
> +This ioctl is used to get the current maximum physical address size for
> +the VM. The value is Log2(Maximum_Physical_Address). This is neither the
> + amount of physical memory assigned to the VM nor the maximum physical 
> address
> +that a real CPU on the host can handle. Rather, this is the upper limit of 
> the
> +guest physical address that can be used by the VM.

What is the point of this?  Presumably if userspace has set the size, it
already knows the size?

> +
> +4.109 KVM_ARM_SET_PHYS_SHIFT
> +
> +Capability: KVM_CAP_ARM_CONFIG_PHYS_SHIFT
> +Architectures: arm64
> +Type: vm ioctl
> +Parameters: __u32 (in)
> +Returns: 0 on success, a negative value on error
> +
> +This ioctl is used to set the maximum physical address size for
> +the VM. The value is Log2(Maximum_Physical_Address). The value can only
> +be increased from the existing setting. The value cannot be changed
> +after the stage-2 page tables are allocated and will return an error.
> +

Is there a way for userspace to discover what the underlying hardware
can actually support, beyond trial-and-error on this ioctl?

>  5. The kvm_run structure
>  
>  
> diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
> index a9f7d3f47134..fa8e68a4f692 100644
> --- a/arch/arm/include/asm/kvm_host.h
> +++ b/arch/arm/include/asm/kvm_host.h
> @@ -268,6 +268,13 @@ static inline int 
> kvm_arch_dev_ioctl_check_extension(struct kvm *kvm, long ext)
>   return 0;
>  }
>  
> +static inline long kvm_arch_dev_vm_ioctl(struct kvm *kvm,
> +  unsigned int ioctl,
> +  unsigned long arg)
> +{
> + return -EINVAL;
> +}
> +
>  int kvm_perf_init(void);
>  int kvm_perf_teardown(void);
>  
> diff --git a/arch/arm64/include/asm/kvm_host.h 
> b/arch/arm64/include/asm/kvm_host.h
> index 1e66e5ab3dde..2895c2cda8fc 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -50,6 +50,7 @@
>  int __attribute_const__ kvm_target_cpu(void);
>  int kvm_reset_vcpu(struct kvm_vcpu *vcpu);
>  int kvm_arch_dev_ioctl_check_extension(struct kvm *kvm, long ext);
> +long kvm_arch_dev_vm_ioctl(struct kvm *kvm, unsigned int ioctl, unsigned 
> long arg);
>  void __extended_idmap_trampoline(phys_addr_t boot_pgd, phys_addr_t 
> idmap_start);
>  
>  struct kvm_arch {
> diff --git a/arch/arm64/include/asm/kvm_mmu.h 
> b/arch/arm64/include/asm/kvm_mmu.h
> index ab6a8b905065..ab7f50f20bcd 100644
> --- a/arch/arm64/include/asm/kvm_mmu.h
> +++ b/arch/arm64/include/asm/kvm_mmu.h
> @@ -347,21 +347,44 @@ static inline u64 kvm_vttbr_baddr_mask(struct kvm *kvm)
>   return GENMASK_ULL(PHYS_MASK_SHIFT - 1, x);
>  }
>  
> +static inline int kvm_reconfig_stage2(struct kvm *kvm, u32 phys_shift)
> +{
> + int rc = 0;
> 

Re: [PATCH v1 05/16] arm64: Helper for parange to PASize

2018-02-08 Thread Christoffer Dall
On Tue, Jan 09, 2018 at 07:04:00PM +, Suzuki K Poulose wrote:
> Add a helper to convert ID_AA64MMFR0_EL1:PARange to they physical
  *the*
> size shift. Limit the size to the maximum supported by the kernel.

Is this just a cleanup or are we actually going to need this feature in
the subsequent patches?  That would be nice to motivate in the commit
letter.

> 
> Cc: Mark Rutland 
> Cc: Catalin Marinas 
> Cc: Will Deacon 
> Cc: Marc Zyngier 
> Signed-off-by: Suzuki K Poulose 
> ---
>  arch/arm64/include/asm/cpufeature.h | 16 
>  arch/arm64/kvm/hyp/s2-setup.c   | 28 +---
>  2 files changed, 21 insertions(+), 23 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/cpufeature.h 
> b/arch/arm64/include/asm/cpufeature.h
> index ac67cfc2585a..0564e14616eb 100644
> --- a/arch/arm64/include/asm/cpufeature.h
> +++ b/arch/arm64/include/asm/cpufeature.h
> @@ -304,6 +304,22 @@ static inline u64 read_zcr_features(void)
>   return zcr;
>  }
>  
> +static inline u32 id_aa64mmfr0_parange_to_phys_shift(int parange)
> +{
> + switch (parange) {
> + case 0: return 32;
> + case 1: return 36;
> + case 2: return 40;
> + case 3: return 42;
> + case 4: return 44;
> +
> + default:

What is the case we want to cater for with making parange == 5 the
default for unrecognized values?

(I have a feeling that default label comes from making the compiler
happy about potentially uninitialized values once upon a time before a
lot of refactoring happened here.)

> + case 5: return 48;
> +#ifdef CONFIG_ARM64_PA_BITS_52
> + case 6: return 52;
> +#endif
> + }
> +}
>  #endif /* __ASSEMBLY__ */
>  
>  #endif
> diff --git a/arch/arm64/kvm/hyp/s2-setup.c b/arch/arm64/kvm/hyp/s2-setup.c
> index 603e1ee83e89..b1129c83c531 100644
> --- a/arch/arm64/kvm/hyp/s2-setup.c
> +++ b/arch/arm64/kvm/hyp/s2-setup.c
> @@ -19,11 +19,13 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  u32 __hyp_text __init_stage2_translation(void)
>  {
>   u64 val = VTCR_EL2_FLAGS;
>   u64 parange;
> + u32 phys_shift;
>   u64 tmp;
>  
>   /*
> @@ -37,27 +39,7 @@ u32 __hyp_text __init_stage2_translation(void)
>   val |= parange << 16;
>  
>   /* Compute the actual PARange... */
> - switch (parange) {
> - case 0:
> - parange = 32;
> - break;
> - case 1:
> - parange = 36;
> - break;
> - case 2:
> - parange = 40;
> - break;
> - case 3:
> - parange = 42;
> - break;
> - case 4:
> - parange = 44;
> - break;
> - case 5:
> - default:
> - parange = 48;
> - break;
> - }
> + phys_shift = id_aa64mmfr0_parange_to_phys_shift(parange);
>  
>   /*
>* ... and clamp it to 40 bits, unless we have some braindead
> @@ -65,7 +47,7 @@ u32 __hyp_text __init_stage2_translation(void)
>* return that value for the rest of the kernel to decide what
>* to do.
>*/
> - val |= 64 - (parange > 40 ? 40 : parange);
> + val |= 64 - (phys_shift > 40 ? 40 : phys_shift);
>  
>   /*
>* Check the availability of Hardware Access Flag / Dirty Bit
> @@ -86,5 +68,5 @@ u32 __hyp_text __init_stage2_translation(void)
>  
>   write_sysreg(val, vtcr_el2);
>  
> - return parange;
> + return phys_shift;
>  }
> -- 
> 2.13.6
> 

Could you fold this change into the commit as well:

diff --git a/arch/arm64/kvm/hyp/s2-setup.c b/arch/arm64/kvm/hyp/s2-setup.c
index 603e1ee83e89..eea2fbd68b8a 100644
--- a/arch/arm64/kvm/hyp/s2-setup.c
+++ b/arch/arm64/kvm/hyp/s2-setup.c
@@ -29,7 +29,8 @@ u32 __hyp_text __init_stage2_translation(void)
/*
 * Read the PARange bits from ID_AA64MMFR0_EL1 and set the PS
 * bits in VTCR_EL2. Amusingly, the PARange is 4 bits, while
-* PS is only 3. Fortunately, bit 19 is RES0 in VTCR_EL2...
+* PS is only 3. Fortunately, only three bits is actually used to
+* enode the supported PARange values.
 */
parange = read_sysreg(id_aa64mmfr0_el1) & 7;
if (parange > ID_AA64MMFR0_PARANGE_MAX)


Thanks,
-Christoffer


Re: [PATCH v1 09/16] kvm: arm/arm64: Delay stage2 page table allocation

2018-02-08 Thread Christoffer Dall
On Tue, Jan 09, 2018 at 07:04:04PM +, Suzuki K Poulose wrote:
> We allocate the entry level page tables for stage2 when the
> VM is created. This doesn't give us the flexibility of configuring
> the physical address space size for a VM. In order to allow
> the VM to choose the required size, we delay the allocation of
> stage2 entry level tables until we really try to map something.
> 
> This could be either when the VM creates a memory range or when
> it tries to map a device memory. So we add in a hook to these
> two places to make sure the tables are allocated. We use
> kvm->slots_lock to serialize the allocation entry point, since
> we add hooks to the arch specific call back with the mutex held.
> 
> Cc: Marc Zyngier 
> Cc: Christoffer Dall 
> Signed-off-by: Suzuki K Poulose 
> ---
>  virt/kvm/arm/arm.c | 18 ++--
>  virt/kvm/arm/mmu.c | 61 
> +-
>  2 files changed, 57 insertions(+), 22 deletions(-)
> 
> diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c
> index 19b720ddedce..d06f0054 100644
> --- a/virt/kvm/arm/arm.c
> +++ b/virt/kvm/arm/arm.c
> @@ -127,13 +127,13 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long 
> type)
>   for_each_possible_cpu(cpu)
>   *per_cpu_ptr(kvm->arch.last_vcpu_ran, cpu) = -1;
>  
> - ret = kvm_alloc_stage2_pgd(kvm);
> - if (ret)
> - goto out_fail_alloc;
> -
>   ret = create_hyp_mappings(kvm, kvm + 1, PAGE_HYP);
> - if (ret)
> - goto out_free_stage2_pgd;
> + if (ret) {
> + free_percpu(kvm->arch.last_vcpu_ran);
> + kvm->arch.last_vcpu_ran = NULL;
> + return ret;
> + }
> +
>  
>   kvm_vgic_early_init(kvm);
>  
> @@ -145,12 +145,6 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
>   kvm_vgic_get_max_vcpus() : KVM_MAX_VCPUS;
>  
>   return ret;
> -out_free_stage2_pgd:
> - kvm_free_stage2_pgd(kvm);
> -out_fail_alloc:
> - free_percpu(kvm->arch.last_vcpu_ran);
> - kvm->arch.last_vcpu_ran = NULL;
> - return ret;
>  }
>  
>  bool kvm_arch_has_vcpu_debugfs(void)
> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
> index c94c61ac38b9..257f2a8ccfc7 100644
> --- a/virt/kvm/arm/mmu.c
> +++ b/virt/kvm/arm/mmu.c
> @@ -1011,15 +1011,39 @@ static int stage2_pmdp_test_and_clear_young(pmd_t 
> *pmd)
>   return stage2_ptep_test_and_clear_young((pte_t *)pmd);
>  }
>  
> -/**
> - * kvm_phys_addr_ioremap - map a device range to guest IPA
> - *
> - * @kvm: The KVM pointer
> - * @guest_ipa:   The IPA at which to insert the mapping
> - * @pa:  The physical address of the device
> - * @size:The size of the mapping
> +/*
> + * Finalise the stage2 page table layout. Must be called with kvm->slots_lock
> + * held.
>   */
> -int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
> +static int __kvm_init_stage2_table(struct kvm *kvm)
> +{
> + /* Double check if somebody has already allocated it */

dubious comment: Either leave it out or explain that we need to check
again with the mutex held.

> + if (likely(kvm->arch.pgd))
> + return 0;
> + return kvm_alloc_stage2_pgd(kvm);
> +}
> +
> +static int kvm_init_stage2_table(struct kvm *kvm)
> +{
> + int rc;
> +
> + /*
> +  * Once allocated, the stage2 entry level tables are only
> +  * freed when the KVM instance is destroyed. So, if we see
> +  * something valid here, that guarantees that we have
> +  * done the one time allocation and it is something valid
> +  * and won't go away until the last reference to the KVM
> +  * is gone.
> +  */

Really not sure if this comment adds something beyond what's described
by the code already?

Thanks,
-Christoffer

> + if (likely(kvm->arch.pgd))
> + return 0;
> + mutex_lock(&kvm->slots_lock);
> + rc = __kvm_init_stage2_table(kvm);
> + mutex_unlock(&kvm->slots_lock);
> + return rc;
> +}
> +
> +static int __kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
> phys_addr_t pa, unsigned long size, bool writable)
>  {
>   phys_addr_t addr, end;
> @@ -1055,6 +1079,23 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t 
> guest_ipa,
>   return ret;
>  }
>  
> +/**
> + * kvm_phys_addr_ioremap - map a device range to guest IPA.
> + * Acquires kvm->slots_lock for making sure that the stage2 is initialized.
> + *
> + * @kvm: The KVM pointer
> + * @guest_ipa:   The IPA at which to ins

Re: [PATCH v1 07/16] kvm: arm/arm64: Remove spurious WARN_ON

2018-02-08 Thread Christoffer Dall
On Tue, Jan 09, 2018 at 07:04:02PM +, Suzuki K Poulose wrote:
> On a 4-level page table pgd entry can be empty, unlike a 3-level
> page table. Remove the spurious WARN_ON() in stage_get_pud().

Acked-by: Christoffer Dall 

> 
> Cc: Marc Zyngier 
> Cc: Christoffer Dall 
> Signed-off-by: Suzuki K Poulose 
> ---
>  virt/kvm/arm/mmu.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
> index e6548c85c495..78253fe00fc4 100644
> --- a/virt/kvm/arm/mmu.c
> +++ b/virt/kvm/arm/mmu.c
> @@ -870,7 +870,7 @@ static pud_t *stage2_get_pud(struct kvm *kvm, struct 
> kvm_mmu_memory_cache *cache
>   pud_t *pud;
>  
>   pgd = kvm->arch.pgd + stage2_pgd_index(addr);
> - if (WARN_ON(stage2_pgd_none(*pgd))) {
> + if (stage2_pgd_none(*pgd)) {
>   if (!cache)
>   return NULL;
>   pud = mmu_memory_cache_alloc(cache);
> -- 
> 2.13.6
> 


Re: [PATCH v1 08/16] kvm: arm/arm64: Clean up stage2 pgd life time

2018-02-08 Thread Christoffer Dall
On Tue, Jan 09, 2018 at 07:04:03PM +, Suzuki K Poulose wrote:
> On arm/arm64 we pre-allocate the entry level page tables when
> a VM is created and is free'd when either all the mm users are
> gone or the KVM is about to get destroyed. i.e, kvm_free_stage2_pgd
> is triggered via kvm_arch_flush_shadow_all() which can be invoked
> from two different paths :
> 
>  1) do_exit()-> .-> mmu_notifier->release()-> ..-> kvm_arch_flush_shadow_all()
> OR
>  2) kvm_destroy_vm()-> mmu_notifier_unregister-> kvm_arch_flush_shadow_all()
> 
> This has created lot of race conditions in the past as some of
> the VCPUs could be active when we free the stage2 via path (1).

How??  mmu_notifier->release() is called via __mput->exit_mmap(), which
is only called if mm_users == 0, which means there are no more threads
left than the one currently doing exit().

> 
> On a closer look, all we need to do with kvm_arch_flush_shadow_all() is,
> to ensure that the stage2 mappings are cleared. This doesn't mean we
> have to free up the stage2 entry level page tables yet, which could
> be delayed until the kvm is destroyed. This would avoid issues
> of use-after-free,

do we have any of those left?

> This will be later used for delaying
> the allocation of the stage2 entry level page tables until we really
> need to do something with it.

Fine, but you don't actually explain why this change of flow is
necessary for what you're trying to do later?

> 
> Cc: Marc Zyngier 
> Cc: Christoffer Dall 
> Signed-off-by: Suzuki K Poulose 
> ---
>  virt/kvm/arm/arm.c |  1 +
>  virt/kvm/arm/mmu.c | 56 
> --
>  2 files changed, 30 insertions(+), 27 deletions(-)
> 
> diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c
> index c8d49879307f..19b720ddedce 100644
> --- a/virt/kvm/arm/arm.c
> +++ b/virt/kvm/arm/arm.c
> @@ -189,6 +189,7 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
>   }
>   }
>   atomic_set(&kvm->online_vcpus, 0);
> + kvm_free_stage2_pgd(kvm);
>  }
>  
>  int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
> index 78253fe00fc4..c94c61ac38b9 100644
> --- a/virt/kvm/arm/mmu.c
> +++ b/virt/kvm/arm/mmu.c
> @@ -298,11 +298,10 @@ static void unmap_stage2_range(struct kvm *kvm, 
> phys_addr_t start, u64 size)
>   pgd = kvm->arch.pgd + stage2_pgd_index(addr);
>   do {
>   /*
> -  * Make sure the page table is still active, as another thread
> -  * could have possibly freed the page table, while we released
> -  * the lock.
> +  * The page table shouldn't be free'd as we still hold a 
> reference
> +  * to the KVM.

To avoid confusion about references to the kernel module KVM and a
specific KVM VM instance, please s/KVM/VM/.

>*/
> - if (!READ_ONCE(kvm->arch.pgd))
> + if (WARN_ON(!READ_ONCE(kvm->arch.pgd)))

This reads a lot like a defensive implementation now, and I think for
this patch to make sense, we shouldn't try to handle a buggy super-racy
implementation gracefully, but rather have VM_BUG_ON() (if we care about
performance of the check) or simply BUG_ON().

The rationale being that if we've gotten this flow incorrect and freed
the pgd at the wrong time, we don't want to leave a ticking bomb to blow
up somewhere else randomly (which it will!), but instead crash and burn.

>   break;
>   next = stage2_pgd_addr_end(addr, end);
>   if (!stage2_pgd_none(*pgd))
> @@ -837,30 +836,33 @@ void stage2_unmap_vm(struct kvm *kvm)
>   up_read(¤t->mm->mmap_sem);
>   srcu_read_unlock(&kvm->srcu, idx);
>  }
> -
> -/**
> - * kvm_free_stage2_pgd - free all stage-2 tables
> - * @kvm: The KVM struct pointer for the VM.
> - *
> - * Walks the level-1 page table pointed to by kvm->arch.pgd and frees all
> - * underlying level-2 and level-3 tables before freeing the actual level-1 
> table
> - * and setting the struct pointer to NULL.
> +/*
> + * kvm_flush_stage2_all: Unmap the entire stage2 mappings including
> + * device and regular RAM backing memory.
>   */
> -void kvm_free_stage2_pgd(struct kvm *kvm)
> +static void kvm_flush_stage2_all(struct kvm *kvm)
>  {
> - void *pgd = NULL;
> -
>   spin_lock(&kvm->mmu_lock);
> - if (kvm->arch.pgd) {
> + if (kvm->arch.pgd)
>   unmap_stage2_range(kvm, 0, KVM_PHYS_SIZE);
> - pgd = READ_ONCE(kvm->arch.pgd);
> - kvm->arch.pgd = NULL;
> - }
>   spin_unloc

Re: [PATCH v1 06/16] kvm: arm/arm64: Fix stage2_flush_memslot for 4 level page table

2018-02-08 Thread Christoffer Dall
On Tue, Jan 09, 2018 at 07:04:01PM +, Suzuki K Poulose wrote:
> So far we have only supported 3 level page table with fixed IPA of 40bits.
> Fix stage2_flush_memslot() to accommodate for 4 level tables.
> 

Acked-by: Christoffer Dall 

> Cc: Marc Zyngier 
> Cc: Christoffer Dall 
> Signed-off-by: Suzuki K Poulose 
> ---
>  virt/kvm/arm/mmu.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
> index 761787befd3b..e6548c85c495 100644
> --- a/virt/kvm/arm/mmu.c
> +++ b/virt/kvm/arm/mmu.c
> @@ -375,7 +375,8 @@ static void stage2_flush_memslot(struct kvm *kvm,
>   pgd = kvm->arch.pgd + stage2_pgd_index(addr);
>   do {
>   next = stage2_pgd_addr_end(addr, end);
> - stage2_flush_puds(kvm, pgd, addr, next);
> + if (!stage2_pgd_none(*pgd))
> + stage2_flush_puds(kvm, pgd, addr, next);
>   } while (pgd++, addr = next, addr != end);
>  }
>  
> -- 
> 2.13.6
> 


Re: [PATCH v1 14/16] kvm: arm64: Switch to per VM IPA

2018-02-08 Thread Christoffer Dall
On Tue, Jan 09, 2018 at 07:04:09PM +, Suzuki K Poulose wrote:
> Now that we can manage the stage2 page table per VM, switch the
> configuration details to per VM instance. We keep track of the
> IPA bits, number of page table levels and the VTCR bits (which
> depends on the IPA and the number of levels).
> 
> Cc: Marc Zyngier 
> Cc: Christoffer Dall 
> Signed-off-by: Suzuki K Poulose 
> ---
>  arch/arm/include/asm/kvm_mmu.h  |  1 +
>  arch/arm64/include/asm/kvm_host.h   | 12 
>  arch/arm64/include/asm/kvm_mmu.h| 22 --
>  arch/arm64/include/asm/stage2_pgtable.h |  1 -
>  arch/arm64/kvm/hyp/switch.c |  3 +--
>  virt/kvm/arm/arm.c  |  2 +-
>  6 files changed, 35 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
> index 440c80589453..dd592fe45660 100644
> --- a/arch/arm/include/asm/kvm_mmu.h
> +++ b/arch/arm/include/asm/kvm_mmu.h
> @@ -48,6 +48,7 @@
>  #define kvm_vttbr_baddr_mask(kvm)VTTBR_BADDR_MASK
>  
>  #define stage2_pgd_size(kvm) (PTRS_PER_S2_PGD * sizeof(pgd_t))
> +#define kvm_init_stage2_config(kvm)  do { } while (0)
>  int create_hyp_mappings(void *from, void *to, pgprot_t prot);
>  int create_hyp_io_mappings(void *from, void *to, phys_addr_t);
>  void free_hyp_pgds(void);
> diff --git a/arch/arm64/include/asm/kvm_host.h 
> b/arch/arm64/include/asm/kvm_host.h
> index 9a9ddeb33c84..1e66e5ab3dde 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -64,6 +64,18 @@ struct kvm_arch {
>   /* VTTBR value associated with above pgd and vmid */
>   u64vttbr;
>  
> + /* Private bits of VTCR_EL2 for this VM */
> + u64vtcr_private;

As to my comments in the previous patch, why isn't this simply u64 vtcr;

Thanks,
-Christoffer

> + /* Size of the PA size for this guest */
> + u8 phys_shift;
> + /*
> +  * Number of levels in page table. We could always calculate
> +  * it from phys_shift above. We cache it for faster switches
> +  * in stage2 page table helpers.
> +  */
> + u8 s2_levels;
> +
> +
>   /* The last vcpu id that ran on each physical CPU */
>   int __percpu *last_vcpu_ran;
>  
> diff --git a/arch/arm64/include/asm/kvm_mmu.h 
> b/arch/arm64/include/asm/kvm_mmu.h
> index 483185ed2ecd..ab6a8b905065 100644
> --- a/arch/arm64/include/asm/kvm_mmu.h
> +++ b/arch/arm64/include/asm/kvm_mmu.h
> @@ -134,11 +134,12 @@ static inline unsigned long __kern_hyp_va(unsigned long 
> v)
>  /*
>   * We currently only support a 40bit IPA.
>   */
> -#define KVM_PHYS_SHIFT   (40)
> +#define KVM_PHYS_SHIFT_DEFAULT   (40)
>  
> -#define kvm_phys_shift(kvm)  KVM_PHYS_SHIFT
> +#define kvm_phys_shift(kvm)  (kvm->arch.phys_shift)
>  #define kvm_phys_size(kvm)   (_AC(1, ULL) << kvm_phys_shift(kvm))
>  #define kvm_phys_mask(kvm)   (kvm_phys_size(kvm) - _AC(1, ULL))
> +#define kvm_stage2_levels(kvm)   (kvm->arch.s2_levels)
>  
>  static inline bool kvm_page_empty(void *ptr)
>  {
> @@ -346,5 +347,22 @@ static inline u64 kvm_vttbr_baddr_mask(struct kvm *kvm)
>   return GENMASK_ULL(PHYS_MASK_SHIFT - 1, x);
>  }
>  
> +/*
> + * kvm_init_stage2_config: Initialise the VM specific stage2 page table
> + * details to default IPA size.
> + */
> +static inline void kvm_init_stage2_config(struct kvm *kvm)
> +{
> + /*
> +  * The stage2 PGD is dependent on the settings we initialise here
> +  * and should be allocated only after this step.
> +  */
> + VM_BUG_ON(kvm->arch.pgd != NULL);
> + kvm->arch.phys_shift = KVM_PHYS_SHIFT_DEFAULT;
> + kvm->arch.s2_levels = stage2_pt_levels(kvm->arch.phys_shift);
> + kvm->arch.vtcr_private = VTCR_EL2_SL0(kvm->arch.s2_levels) |
> +  TCR_T0SZ(kvm->arch.phys_shift);
> +}
> +
>  #endif /* __ASSEMBLY__ */
>  #endif /* __ARM64_KVM_MMU_H__ */
> diff --git a/arch/arm64/include/asm/stage2_pgtable.h 
> b/arch/arm64/include/asm/stage2_pgtable.h
> index 33e8ebb25037..9b75b83da643 100644
> --- a/arch/arm64/include/asm/stage2_pgtable.h
> +++ b/arch/arm64/include/asm/stage2_pgtable.h
> @@ -44,7 +44,6 @@
>   */
>  #define __s2_pgd_ptrs(pa, lvls)  (1 << ((pa) - 
> pt_levels_pgdir_shift((lvls
>  
> -#define kvm_stage2_levels(kvm)   
> stage2_pt_levels(kvm_phys_shift(kvm))
>  #define stage2_pgdir_shift(kvm)  \
>   pt_levels_pgdir_shift(kvm_stage2_levels(kvm))
>  #define stage2_pgdir_size(kvm)   (_AC(1, UL) << 
> stage2_p

Re: [PATCH v1 02/16] irqchip: gicv3-its: Add helpers for handling 52bit address

2018-02-07 Thread Christoffer Dall
Hi Suzuki,

On Tue, Jan 09, 2018 at 07:03:57PM +, Suzuki K Poulose wrote:
> Add helpers for encoding/decoding 52bit address in GICv3 ITS BASER
> register. When ITS uses 64K page size, the 52bits of physical address
> are encoded in BASER[47:12] as follows :
> 
>  Bits[47:16] of the register => bits[47:16] of the physical address
>  Bits[15:12] of the register => bits[51:48] of the physical address
> bits[15:0] of the physical address are 0.
> 
> Also adds a mask for CBASER address. This will be used for adding 52bit
> support for VGIC ITS. More importantly ignore the upper bits if 52bit
> support is not enabled.
> 
> Cc: Shanker Donthineni 
> Cc: Marc Zyngier 
> Signed-off-by: Suzuki K Poulose 
> ---
>  drivers/irqchip/irq-gic-v3-its.c   |  2 +-
>  include/linux/irqchip/arm-gic-v3.h | 32 ++--
>  2 files changed, 31 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/irqchip/irq-gic-v3-its.c 
> b/drivers/irqchip/irq-gic-v3-its.c
> index 4039e64cd342..e6aa84f806f7 100644
> --- a/drivers/irqchip/irq-gic-v3-its.c
> +++ b/drivers/irqchip/irq-gic-v3-its.c
> @@ -1615,7 +1615,7 @@ static int its_setup_baser(struct its_node *its, struct 
> its_baser *baser,
>   }
>  
>   /* Convert 52bit PA to 48bit field */
> - baser_phys = GITS_BASER_PHYS_52_to_48(baser_phys);
> + baser_phys = GITS_BASER_ADDR64K_FROM_PHYS(baser_phys);
>   }
>  
>  retry_baser:
> diff --git a/include/linux/irqchip/arm-gic-v3.h 
> b/include/linux/irqchip/arm-gic-v3.h
> index c00c4c33e432..b880b6682fa6 100644
> --- a/include/linux/irqchip/arm-gic-v3.h
> +++ b/include/linux/irqchip/arm-gic-v3.h
> @@ -320,6 +320,15 @@
>  #define GITS_IIDR_REV(r) (((r) >> GITS_IIDR_REV_SHIFT) & 0xf)
>  #define GITS_IIDR_PRODUCTID_SHIFT24
>  
> +#ifdef CONFIG_ARM64_PA_BITS_52
> +#define GITS_PA_HI_MASK  (0xfULL)
> +#define GITS_PA_SHIFT52
> +#else
> +/* Do not use the bits [51-48] if we don't support 52bit */
> +#define GITS_PA_HI_MASK  0
> +#define GITS_PA_SHIFT48
> +#endif
> +
>  #define GITS_CBASER_VALID(1ULL << 63)
>  #define GITS_CBASER_SHAREABILITY_SHIFT   (10)
>  #define GITS_CBASER_INNER_CACHEABILITY_SHIFT (59)
> @@ -343,6 +352,7 @@
>  #define GITS_CBASER_WaWb GIC_BASER_CACHEABILITY(GITS_CBASER, INNER, WaWb)
>  #define GITS_CBASER_RaWaWt   GIC_BASER_CACHEABILITY(GITS_CBASER, INNER, 
> RaWaWt)
>  #define GITS_CBASER_RaWaWb   GIC_BASER_CACHEABILITY(GITS_CBASER, INNER, 
> RaWaWb)
> +#define GITS_CBASER_ADDRESS(x)   ((x) & GENMASK_ULL(GITS_PA_SHIFT, 12))
>  
>  #define GITS_BASER_NR_REGS   8
>  
> @@ -373,8 +383,26 @@
>  #define GITS_BASER_ENTRY_SIZE_SHIFT  (48)
>  #define GITS_BASER_ENTRY_SIZE(r) r) >> GITS_BASER_ENTRY_SIZE_SHIFT) 
> & 0x1f) + 1)
>  #define GITS_BASER_ENTRY_SIZE_MASK   GENMASK_ULL(52, 48)
> -#define GITS_BASER_PHYS_52_to_48(phys)   
> \
> - (((phys) & GENMASK_ULL(47, 16)) | (((phys) >> 48) & 0xf) << 12)
> +
> +/*
> + * With 64K page size, the physical address can be upto 52bit and
> + * uses the following encoding in the GITS_BASER[47:12]:
> + *
> + * Bits[47:16] of the register => bits[47:16] of the base physical address.
> + * Bits[15:12] of the register => bits[51:48] of the base physical address.
> + *bits[15:0] of the base physical address 
> are 0.
> + * Clear the upper bits if the kernel doesn't support 52bits.
> + */
> +#define GITS_BASER_ADDR64K_LO_MASK   GENMASK_ULL(47, 16)
> +#define GITS_BASER_ADDR64K_HI_SHIFT  12
> +#define GITS_BASER_ADDR64K_HI_MOVE   (48 - GITS_BASER_ADDR64K_HI_SHIFT)
> +#define GITS_BASER_ADDR64K_HI_MASK   (GITS_PA_HI_MASK << 
> GITS_BASER_ADDR64K_HI_SHIFT)
> +#define GITS_BASER_ADDR64K_TO_PHYS(x)
> \
> + (((x) & GITS_BASER_ADDR64K_LO_MASK) |   \
> +  (((x) & GITS_BASER_ADDR64K_HI_MASK) << GITS_BASER_ADDR64K_HI_MOVE))
> +#define GITS_BASER_ADDR64K_FROM_PHYS(p)  
> \
> + (((p) & GITS_BASER_ADDR64K_LO_MASK) |   \
> +  (((p) >> GITS_BASER_ADDR64K_HI_MOVE) & GITS_BASER_ADDR64K_HI_MASK))

I don't understand why you need this masking logic embedded in these
macros?  Isn't it strictly an error if anyone passes a physical address
with any of bits [51:48] set to the ITS on a system that doesn't support
52 bit PAs, and just silently masking off those bits could lead to some
interesting cases.

This is also notably more difficult to read than the existing macro.

If anything, I think it would be more useful to have
GITS_BASER_TO_PHYS(x) and GITS_PHYS_TO_BASER(x) which takes into account
CONFIG_ARM64_64K_PAGES.

>  #define GITS_BASER_SHAREABILITY_SHIFT(10)
>  #define GITS_BASER_InnerShareable\

Re: [PATCH v4 02/17] arm: KVM: Fix SMCCC handling of unimplemented SMC/HVC calls

2018-02-07 Thread Christoffer Dall
On Tue, Feb 06, 2018 at 05:56:06PM +, Marc Zyngier wrote:
> KVM doesn't follow the SMCCC when it comes to unimplemented calls,
> and inject an UNDEF instead of returning an error. Since firmware
> calls are now used for security mitigation, they are becoming more
> common, and the undef is counter productive.
> 
> Instead, let's follow the SMCCC which states that -1 must be returned
> to the caller when getting an unknown function number.

Apparently I forgot to review this:

Reviewed-by: Christoffer Dall 

> 
> Cc: 
> Tested-by: Ard Biesheuvel 
> Signed-off-by: Marc Zyngier 
> ---
>  arch/arm/kvm/handle_exit.c | 13 +++--
>  1 file changed, 11 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm/kvm/handle_exit.c b/arch/arm/kvm/handle_exit.c
> index cf8bf6bf87c4..a4bf0f6f024a 100644
> --- a/arch/arm/kvm/handle_exit.c
> +++ b/arch/arm/kvm/handle_exit.c
> @@ -38,7 +38,7 @@ static int handle_hvc(struct kvm_vcpu *vcpu, struct kvm_run 
> *run)
>  
>   ret = kvm_psci_call(vcpu);
>   if (ret < 0) {
> - kvm_inject_undefined(vcpu);
> + vcpu_set_reg(vcpu, 0, ~0UL);
>   return 1;
>   }
>  
> @@ -47,7 +47,16 @@ static int handle_hvc(struct kvm_vcpu *vcpu, struct 
> kvm_run *run)
>  
>  static int handle_smc(struct kvm_vcpu *vcpu, struct kvm_run *run)
>  {
> - kvm_inject_undefined(vcpu);
> + /*
> +  * "If an SMC instruction executed at Non-secure EL1 is
> +  * trapped to EL2 because HCR_EL2.TSC is 1, the exception is a
> +  * Trap exception, not a Secure Monitor Call exception [...]"
> +  *
> +  * We need to advance the PC after the trap, as it would
> +  * otherwise return to the same address...
> +  */
> + vcpu_set_reg(vcpu, 0, ~0UL);
> + kvm_skip_instr(vcpu, kvm_vcpu_trap_il_is32bit(vcpu));
>   return 1;
>  }
>  
> -- 
> 2.14.2
> 


Re: linux-next: manual merge of the kvm tree with the arm64 tree

2018-02-07 Thread Christoffer Dall
On Wed, Feb 07, 2018 at 12:27:53PM +1100, Stephen Rothwell wrote:
> Hi all,
> 
> Today's linux-next merge of the kvm tree got a conflict in:
> 
>   arch/arm64/include/asm/pgtable-prot.h
> 
> between commit:
> 
>   41acec624087 ("arm64: kpti: Make use of nG dependent on 
> arm64_kernel_unmapped_at_el0()")
> 
> from the arm64 tree and commit:
> 
>   d0e22b4ac3ba ("KVM: arm/arm64: Limit icache invalidation to prefetch 
> aborts")
> 
> from the kvm tree.
> 
> I fixed it up (see below) and can carry the fix as necessary. This
> is now fixed as far as linux-next is concerned, but any non trivial
> conflicts should be mentioned to your upstream maintainer when your tree
> is submitted for merging.  You may also want to consider cooperating
> with the maintainer of the conflicting tree to minimise any particularly
> complex conflicts.
> 
> -- 
> Cheers,
> Stephen Rothwell
> 
> diff --cc arch/arm64/include/asm/pgtable-prot.h
> index 2db84df5eb42,4e12dabd342b..
> --- a/arch/arm64/include/asm/pgtable-prot.h
> +++ b/arch/arm64/include/asm/pgtable-prot.h
> @@@ -53,24 -47,23 +53,24 @@@
>   #define PROT_SECT_NORMAL(PROT_SECT_DEFAULT | PMD_SECT_PXN | 
> PMD_SECT_UXN | PMD_ATTRINDX(MT_NORMAL))
>   #define PROT_SECT_NORMAL_EXEC   (PROT_SECT_DEFAULT | PMD_SECT_UXN | 
> PMD_ATTRINDX(MT_NORMAL))
>   
>  -#define _PAGE_DEFAULT   (PROT_DEFAULT | PTE_ATTRINDX(MT_NORMAL))
>  +#define _PAGE_DEFAULT   (_PROT_DEFAULT | 
> PTE_ATTRINDX(MT_NORMAL))
>  +#define _HYP_PAGE_DEFAULT   _PAGE_DEFAULT
>   
>  -#define PAGE_KERNEL __pgprot(_PAGE_DEFAULT | PTE_PXN | PTE_UXN | 
> PTE_DIRTY | PTE_WRITE)
>  -#define PAGE_KERNEL_RO  __pgprot(_PAGE_DEFAULT | PTE_PXN | 
> PTE_UXN | PTE_DIRTY | PTE_RDONLY)
>  -#define PAGE_KERNEL_ROX __pgprot(_PAGE_DEFAULT | PTE_UXN | 
> PTE_DIRTY | PTE_RDONLY)
>  -#define PAGE_KERNEL_EXEC__pgprot(_PAGE_DEFAULT | PTE_UXN | PTE_DIRTY | 
> PTE_WRITE)
>  -#define PAGE_KERNEL_EXEC_CONT   __pgprot(_PAGE_DEFAULT | PTE_UXN | 
> PTE_DIRTY | PTE_WRITE | PTE_CONT)
>  +#define PAGE_KERNEL __pgprot(PROT_NORMAL)
>  +#define PAGE_KERNEL_RO  __pgprot((PROT_NORMAL & ~PTE_WRITE) | 
> PTE_RDONLY)
>  +#define PAGE_KERNEL_ROX __pgprot((PROT_NORMAL & ~(PTE_WRITE | 
> PTE_PXN)) | PTE_RDONLY)
>  +#define PAGE_KERNEL_EXEC__pgprot(PROT_NORMAL & ~PTE_PXN)
>  +#define PAGE_KERNEL_EXEC_CONT   __pgprot((PROT_NORMAL & ~PTE_PXN) | 
> PTE_CONT)
>   
>  -#define PAGE_HYP__pgprot(_PAGE_DEFAULT | PTE_HYP | PTE_HYP_XN)
>  -#define PAGE_HYP_EXEC   __pgprot(_PAGE_DEFAULT | PTE_HYP | 
> PTE_RDONLY)
>  -#define PAGE_HYP_RO __pgprot(_PAGE_DEFAULT | PTE_HYP | PTE_RDONLY | 
> PTE_HYP_XN)
>  +#define PAGE_HYP__pgprot(_HYP_PAGE_DEFAULT | PTE_HYP | 
> PTE_HYP_XN)
>  +#define PAGE_HYP_EXEC   __pgprot(_HYP_PAGE_DEFAULT | PTE_HYP | 
> PTE_RDONLY)
>  +#define PAGE_HYP_RO __pgprot(_HYP_PAGE_DEFAULT | PTE_HYP | 
> PTE_RDONLY | PTE_HYP_XN)
>   #define PAGE_HYP_DEVICE __pgprot(PROT_DEVICE_nGnRE | PTE_HYP)
>   
> - #define PAGE_S2 __pgprot(_PROT_DEFAULT | 
> PTE_S2_MEMATTR(MT_S2_NORMAL) | PTE_S2_RDONLY)
> - #define PAGE_S2_DEVICE  __pgprot(_PROT_DEFAULT | 
> PTE_S2_MEMATTR(MT_S2_DEVICE_nGnRE) | PTE_S2_RDONLY | PTE_UXN)
>  -#define PAGE_S2 __pgprot(PROT_DEFAULT | 
> PTE_S2_MEMATTR(MT_S2_NORMAL) | PTE_S2_RDONLY | PTE_S2_XN)
>  -#define PAGE_S2_DEVICE  __pgprot(PROT_DEFAULT | 
> PTE_S2_MEMATTR(MT_S2_DEVICE_nGnRE) | PTE_S2_RDONLY | PTE_S2_XN)
> ++#define PAGE_S2 __pgprot(_PROT_DEFAULT | 
> PTE_S2_MEMATTR(MT_S2_NORMAL) | PTE_S2_RDONLY | PTE_S2_XN)
> ++#define PAGE_S2_DEVICE  __pgprot(_PROT_DEFAULT | 
> PTE_S2_MEMATTR(MT_S2_DEVICE_nGnRE) | PTE_S2_RDONLY | PTE_S2_XN)
>   
>  -#define PAGE_NONE   __pgprot(((_PAGE_DEFAULT) & ~PTE_VALID) | 
> PTE_PROT_NONE | PTE_RDONLY | PTE_PXN | PTE_UXN)
>  +#define PAGE_NONE   __pgprot(((_PAGE_DEFAULT) & ~PTE_VALID) | 
> PTE_PROT_NONE | PTE_RDONLY | PTE_NG | PTE_PXN | PTE_UXN)
>   #define PAGE_SHARED __pgprot(_PAGE_DEFAULT | PTE_USER | PTE_NG | 
> PTE_PXN | PTE_UXN | PTE_WRITE)
>   #define PAGE_SHARED_EXEC__pgprot(_PAGE_DEFAULT | PTE_USER | PTE_NG | 
> PTE_PXN | PTE_WRITE)
>   #define PAGE_READONLY   __pgprot(_PAGE_DEFAULT | PTE_USER | 
> PTE_RDONLY | PTE_NG | PTE_PXN | PTE_UXN)


This looks correct to me.

Thanks,
-Christoffer


Re: [RFC 2/4] KVM: arm64: Support dirty page tracking for PUD hugepages

2018-02-06 Thread Christoffer Dall
On Wed, Jan 10, 2018 at 07:07:27PM +, Punit Agrawal wrote:
> In preparation for creating PUD hugepages at stage 2, add support for
> write protecting PUD hugepages when they are encountered. Write
> protecting guest tables is used to track dirty pages when migrating VMs.
> 
> Also, provide trivial implementations of required kvm_s2pud_* helpers to
> allow code to compile on arm32.
> 
> Signed-off-by: Punit Agrawal 
> Cc: Christoffer Dall 
> Cc: Marc Zyngier 
> ---
>  arch/arm/include/asm/kvm_mmu.h   |  9 +
>  arch/arm64/include/asm/kvm_mmu.h | 10 ++
>  virt/kvm/arm/mmu.c   |  9 ++---
>  3 files changed, 25 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
> index fa6f2174276b..3fbe919b9181 100644
> --- a/arch/arm/include/asm/kvm_mmu.h
> +++ b/arch/arm/include/asm/kvm_mmu.h
> @@ -103,6 +103,15 @@ static inline bool kvm_s2pmd_readonly(pmd_t *pmd)
>   return (pmd_val(*pmd) & L_PMD_S2_RDWR) == L_PMD_S2_RDONLY;
>  }
>  
> +static inline void kvm_set_s2pud_readonly(pud_t *pud)
> +{
> +}
> +
> +static inline bool kvm_s2pud_readonly(pud_t *pud)
> +{
> + return true;

why true?  Shouldn't this return the pgd's readonly value, strictly
speaking, or if we rely on this never being called, have  VM_BUG_ON() ?

In any case, a comment explaining why we unconditionally return true
would be nice.

> +}
> +
>  static inline bool kvm_page_empty(void *ptr)
>  {
>   struct page *ptr_page = virt_to_page(ptr);
> diff --git a/arch/arm64/include/asm/kvm_mmu.h 
> b/arch/arm64/include/asm/kvm_mmu.h
> index 672c8684d5c2..dbfd18e08cfb 100644
> --- a/arch/arm64/include/asm/kvm_mmu.h
> +++ b/arch/arm64/include/asm/kvm_mmu.h
> @@ -201,6 +201,16 @@ static inline bool kvm_s2pmd_readonly(pmd_t *pmd)
>   return kvm_s2pte_readonly((pte_t *)pmd);
>  }
>  
> +static inline void kvm_set_s2pud_readonly(pud_t *pud)
> +{
> + kvm_set_s2pte_readonly((pte_t *)pud);
> +}
> +
> +static inline bool kvm_s2pud_readonly(pud_t *pud)
> +{
> + return kvm_s2pte_readonly((pte_t *)pud);
> +}
> +
>  static inline bool kvm_page_empty(void *ptr)
>  {
>   struct page *ptr_page = virt_to_page(ptr);
> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
> index 9dea96380339..02eefda5d71e 100644
> --- a/virt/kvm/arm/mmu.c
> +++ b/virt/kvm/arm/mmu.c
> @@ -1155,9 +1155,12 @@ static void  stage2_wp_puds(pgd_t *pgd, phys_addr_t 
> addr, phys_addr_t end)
>   do {
>   next = stage2_pud_addr_end(addr, end);
>   if (!stage2_pud_none(*pud)) {
> - /* TODO:PUD not supported, revisit later if supported */
> - BUG_ON(stage2_pud_huge(*pud));
> - stage2_wp_pmds(pud, addr, next);
> + if (stage2_pud_huge(*pud)) {
> + if (!kvm_s2pud_readonly(pud))
> + kvm_set_s2pud_readonly(pud);
> +         } else {
> + stage2_wp_pmds(pud, addr, next);
> + }
>   }
>   } while (pud++, addr = next, addr != end);
>  }
> -- 
> 2.15.1
> 

Otherwise:

Reviewed-by: Christoffer Dall 


Re: [RFC 4/4] KVM: arm64: Add support for PUD hugepages at stage 2

2018-02-06 Thread Christoffer Dall
On Wed, Jan 10, 2018 at 07:07:29PM +, Punit Agrawal wrote:
> KVM only supports PMD hugepages at stage 2. Extend the stage 2 fault
> handling to add support for PUD hugepages.
> 
> Addition of PUD hugpage support enables additional hugepage sizes (1G

 *hugepage

> with 4K granule and 4TB with 64k granule) which can be useful on cores
> that have support for mapping larger block sizes in the TLB entries.
> 
> Signed-off-by: Punit Agrawal 
> Cc: Marc Zyngier 
> Cc: Christoffer Dall 
> Cc: Catalin Marinas 
> ---
>  arch/arm/include/asm/kvm_mmu.h | 10 +
>  arch/arm/include/asm/pgtable-3level.h  |  2 +
>  arch/arm64/include/asm/kvm_mmu.h   | 19 +
>  arch/arm64/include/asm/pgtable-hwdef.h |  2 +
>  arch/arm64/include/asm/pgtable.h   |  4 ++
>  virt/kvm/arm/mmu.c | 72 
> +-
>  6 files changed, 99 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
> index 3fbe919b9181..6e2e34348cb3 100644
> --- a/arch/arm/include/asm/kvm_mmu.h
> +++ b/arch/arm/include/asm/kvm_mmu.h
> @@ -59,6 +59,10 @@ phys_addr_t kvm_get_idmap_vector(void);
>  int kvm_mmu_init(void);
>  void kvm_clear_hyp_idmap(void);
>  
> +static inline void kvm_set_pud(pud_t *pud, pud_t new_pud)
> +{
> +}
> +
>  static inline void kvm_set_pmd(pmd_t *pmd, pmd_t new_pmd)
>  {
>   *pmd = new_pmd;
> @@ -230,6 +234,12 @@ static inline unsigned int kvm_get_vmid_bits(void)
>   return 8;
>  }
>  
> +static inline pud_t stage2_build_pud(kvm_pfn_t pfn, pgprot_t mem_type,
> +  bool writable)
> +{
> + return __pud(0);
> +}
> +
>  #endif   /* !__ASSEMBLY__ */
>  
>  #endif /* __ARM_KVM_MMU_H__ */
> diff --git a/arch/arm/include/asm/pgtable-3level.h 
> b/arch/arm/include/asm/pgtable-3level.h
> index 1a7a17b2a1ba..97e04fdbfa85 100644
> --- a/arch/arm/include/asm/pgtable-3level.h
> +++ b/arch/arm/include/asm/pgtable-3level.h
> @@ -249,6 +249,8 @@ PMD_BIT_FUNC(mkyoung,   |= PMD_SECT_AF);
>  #define pfn_pmd(pfn,prot)(__pmd(((phys_addr_t)(pfn) << PAGE_SHIFT) | 
> pgprot_val(prot)))
>  #define mk_pmd(page,prot)pfn_pmd(page_to_pfn(page),prot)
>  
> +#define pud_pfn(pud) (((pud_val(pud) & PUD_MASK) & PHYS_MASK) >> 
> PAGE_SHIFT)
> +

does this make sense on 32-bit arm?  Is this ever going to get called
and return something meaningful in that case?

>  /* represent a notpresent pmd by faulting entry, this is used by 
> pmdp_invalidate */
>  static inline pmd_t pmd_mknotpresent(pmd_t pmd)
>  {
> diff --git a/arch/arm64/include/asm/kvm_mmu.h 
> b/arch/arm64/include/asm/kvm_mmu.h
> index dbfd18e08cfb..89eac3dbe123 100644
> --- a/arch/arm64/include/asm/kvm_mmu.h
> +++ b/arch/arm64/include/asm/kvm_mmu.h
> @@ -160,6 +160,7 @@ void kvm_clear_hyp_idmap(void);
>  
>  #define  kvm_set_pte(ptep, pte)  set_pte(ptep, pte)
>  #define  kvm_set_pmd(pmdp, pmd)  set_pmd(pmdp, pmd)
> +#define kvm_set_pud(pudp, pud)   set_pud(pudp, pud)
>  
>  static inline pte_t kvm_s2pte_mkwrite(pte_t pte)
>  {
> @@ -173,6 +174,12 @@ static inline pmd_t kvm_s2pmd_mkwrite(pmd_t pmd)
>   return pmd;
>  }
>  
> +static inline pud_t kvm_s2pud_mkwrite(pud_t pud)
> +{
> + pud_val(pud) |= PUD_S2_RDWR;
> + return pud;
> +}
> +
>  static inline void kvm_set_s2pte_readonly(pte_t *pte)
>  {
>   pteval_t old_pteval, pteval;
> @@ -319,5 +326,17 @@ static inline unsigned int kvm_get_vmid_bits(void)
>   return (cpuid_feature_extract_unsigned_field(reg, 
> ID_AA64MMFR1_VMIDBITS_SHIFT) == 2) ? 16 : 8;
>  }
>  
> +static inline pud_t stage2_build_pud(kvm_pfn_t pfn, pgprot_t mem_type,
> +  bool writable)
> +{
> + pud_t pud = pfn_pud(pfn, mem_type);
> +
> + pud = pud_mkhuge(pud);
> + if (writable)
> + pud = kvm_s2pud_mkwrite(pud);
> +
> + return pud;
> +}
> +
>  #endif /* __ASSEMBLY__ */
>  #endif /* __ARM64_KVM_MMU_H__ */
> diff --git a/arch/arm64/include/asm/pgtable-hwdef.h 
> b/arch/arm64/include/asm/pgtable-hwdef.h
> index 40a998cdd399..a091a6192eee 100644
> --- a/arch/arm64/include/asm/pgtable-hwdef.h
> +++ b/arch/arm64/include/asm/pgtable-hwdef.h
> @@ -181,6 +181,8 @@
>  #define PMD_S2_RDONLY(_AT(pmdval_t, 1) << 6)   /* HAP[2:1] */
>  #define PMD_S2_RDWR  (_AT(pmdval_t, 3) << 6)   /* HAP[2:1] */
>  
> +#define PUD_S2_RDWR  (_AT(pudval_t, 3) << 6)   /* HAP[2:1] */
> +
>  /*
>   * Memory Attribute override for Stage-2 (MemAttr[3:0])
>   */
> diff --git a/a

Re: [PATCHv2 09/12] arm64/kvm: preserve host HCR_EL2 value

2018-02-06 Thread Christoffer Dall
On Mon, Nov 27, 2017 at 04:38:03PM +, Mark Rutland wrote:
> When restoring HCR_EL2 for the host, KVM uses HCR_HOST_VHE_FLAGS, which
> is a constant value. This works today, as the host HCR_EL2 value is
> always the same, but this will get in the way of supporting extensions
> that require HCR_EL2 bits to be set conditionally for the host.
> 
> To allow such features to work without KVM having to explicitly handle
> every possible host feature combination, this patch has KVM save/restore
> the host HCR when switching to/from a guest HCR.
> 
> For __{activate,deactivate}_traps(), the HCR save/restore is made common
> across the !VHE and VHE paths. As the host and guest HCR values must
> have E2H set when VHE is in use, register redirection should always be
> in effect at EL2, and this change should not adversely affect the VHE
> code.
> 
> For the hyp TLB maintenance code, __tlb_switch_to_host_vhe() is updated
> to toggle the TGE bit with a RMW sequence, as we already do in
> __tlb_switch_to_guest_vhe().
> 
> The now unused HCR_HOST_VHE_FLAGS definition is removed.
> 
> Signed-off-by: Mark Rutland 
> Reviewed-by: Christoffer Dall 
> Cc: Marc Zyngier 
> Cc: kvm...@lists.cs.columbia.edu
> ---
>  arch/arm64/include/asm/kvm_arm.h  | 1 -
>  arch/arm64/include/asm/kvm_host.h | 5 -
>  arch/arm64/kvm/hyp/switch.c   | 5 +++--
>  arch/arm64/kvm/hyp/tlb.c  | 6 +-
>  4 files changed, 12 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/kvm_arm.h 
> b/arch/arm64/include/asm/kvm_arm.h
> index 62854d5d1d3b..aa02b05430e8 100644
> --- a/arch/arm64/include/asm/kvm_arm.h
> +++ b/arch/arm64/include/asm/kvm_arm.h
> @@ -84,7 +84,6 @@
>HCR_AMO | HCR_SWIO | HCR_TIDCP | HCR_RW)
>  #define HCR_VIRT_EXCP_MASK (HCR_VSE | HCR_VI | HCR_VF)
>  #define HCR_INT_OVERRIDE   (HCR_FMO | HCR_IMO)
> -#define HCR_HOST_VHE_FLAGS (HCR_RW | HCR_TGE | HCR_E2H)
>  
>  /* TCR_EL2 Registers bits */
>  #define TCR_EL2_RES1 ((1 << 31) | (1 << 23))
> diff --git a/arch/arm64/include/asm/kvm_host.h 
> b/arch/arm64/include/asm/kvm_host.h
> index 674912d7a571..39184aa3e2f2 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -199,10 +199,13 @@ typedef struct kvm_cpu_context kvm_cpu_context_t;
>  struct kvm_vcpu_arch {
>   struct kvm_cpu_context ctxt;
>  
> - /* HYP configuration */
> + /* Guest HYP configuration */
>   u64 hcr_el2;
>   u32 mdcr_el2;
>  
> + /* Host HYP configuration */
> + u64 host_hcr_el2;
> +
>   /* Exception Information */
>   struct kvm_vcpu_fault_info fault;
>  
> diff --git a/arch/arm64/kvm/hyp/switch.c b/arch/arm64/kvm/hyp/switch.c
> index 525c01f48867..2205f0be3ced 100644
> --- a/arch/arm64/kvm/hyp/switch.c
> +++ b/arch/arm64/kvm/hyp/switch.c
> @@ -71,6 +71,8 @@ static void __hyp_text __activate_traps(struct kvm_vcpu 
> *vcpu)
>  {
>   u64 val;
>  
> + vcpu->arch.host_hcr_el2 = read_sysreg(hcr_el2);
> +

Looking back at this, it seems excessive to switch this at every
round-trip.  I think it should be possible to have this as a single
global (or per-CPU) variable that gets restored directly when returning
from the VM.

Thanks,
-Christoffer

>   /*
>* We are about to set CPTR_EL2.TFP to trap all floating point
>* register accesses to EL2, however, the ARM ARM clearly states that
> @@ -116,7 +118,6 @@ static void __hyp_text __deactivate_traps_vhe(void)
>   MDCR_EL2_TPMS;
>  
>   write_sysreg(mdcr_el2, mdcr_el2);
> - write_sysreg(HCR_HOST_VHE_FLAGS, hcr_el2);
>   write_sysreg(CPACR_EL1_DEFAULT, cpacr_el1);
>   write_sysreg(vectors, vbar_el1);
>  }
> @@ -129,7 +130,6 @@ static void __hyp_text __deactivate_traps_nvhe(void)
>   mdcr_el2 |= MDCR_EL2_E2PB_MASK << MDCR_EL2_E2PB_SHIFT;
>  
>   write_sysreg(mdcr_el2, mdcr_el2);
> - write_sysreg(HCR_RW, hcr_el2);
>   write_sysreg(CPTR_EL2_DEFAULT, cptr_el2);
>  }
>  
> @@ -151,6 +151,7 @@ static void __hyp_text __deactivate_traps(struct kvm_vcpu 
> *vcpu)
>   __deactivate_traps_arch()();
>   write_sysreg(0, hstr_el2);
>   write_sysreg(0, pmuserenr_el0);
> + write_sysreg(vcpu->arch.host_hcr_el2, hcr_el2);
>  }
>  
>  static void __hyp_text __activate_vm(struct kvm_vcpu *vcpu)
> diff --git a/arch/arm64/kvm/hyp/tlb.c b/arch/arm64/kvm/hyp/tlb.c
> index 73464a96c365..c2b0680efa2c 100644
> --- a/arch/arm64/kvm/hyp/tlb.c
> +++ b/arch/arm64/kvm/hyp/tlb.c
> @@ -49,12 +49,16 @@ static hyp_alternate_select(__tlb_switch_to_guest,
>  
>  static void __hyp_text __tlb_switch_to_host_vhe(struct kvm *kvm)
>  {

Re: [PATCHv2 05/12] arm64: Don't trap host pointer auth use to EL2

2018-02-06 Thread Christoffer Dall
Hi Mark,

On Mon, Nov 27, 2017 at 04:37:59PM +, Mark Rutland wrote:
> To allow EL0 (and/or EL1) to use pointer authentication functionality,
> we must ensure that pointer authentication instructions and accesses to
> pointer authentication keys are not trapped to EL2 (where we will not be
> able to handle them).

...on non-VHE systems, presumably?

> 
> This patch ensures that HCR_EL2 is configured appropriately when the
> kernel is booted at EL2. For non-VHE kernels we set HCR_EL2.{API,APK},
> ensuring that EL1 can access keys and permit EL0 use of instructions.
> For VHE kernels, EL2 access is controlled by EL3, and we need not set
> anything.


for VHE kernels host EL0 (TGE && E2H) is unaffected by these settings,
and it doesn't matter how we configure HCR_EL2.{API,APK}.

(Because you do actually set these bits when the features are present if
I read the code correctly).


> 
> This does not enable support for KVM guests, since KVM manages HCR_EL2
> itself.

(...when running VMs.)


Besides the nits:

Acked-by: Christoffer Dall 

> 
> Signed-off-by: Mark Rutland 
> Cc: Catalin Marinas 
> Cc: Christoffer Dall 
> Cc: Marc Zyngier 
> Cc: Will Deacon 
> Cc: kvm...@lists.cs.columbia.edu
> ---
>  arch/arm64/include/asm/kvm_arm.h |  2 ++
>  arch/arm64/kernel/head.S | 19 +--
>  2 files changed, 19 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/kvm_arm.h 
> b/arch/arm64/include/asm/kvm_arm.h
> index 7f069ff37f06..62854d5d1d3b 100644
> --- a/arch/arm64/include/asm/kvm_arm.h
> +++ b/arch/arm64/include/asm/kvm_arm.h
> @@ -23,6 +23,8 @@
>  #include 
>  
>  /* Hyp Configuration Register (HCR) bits */
> +#define HCR_API  (UL(1) << 41)
> +#define HCR_APK  (UL(1) << 40)
>  #define HCR_E2H  (UL(1) << 34)
>  #define HCR_ID   (UL(1) << 33)
>  #define HCR_CD   (UL(1) << 32)
> diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
> index 67e86a0f57ac..06a96e9af26b 100644
> --- a/arch/arm64/kernel/head.S
> +++ b/arch/arm64/kernel/head.S
> @@ -415,10 +415,25 @@ CPU_LE( bic x0, x0, #(1 << 25)  )   // 
> Clear the EE bit for EL2
>  
>   /* Hyp configuration. */
>   mov x0, #HCR_RW // 64-bit EL1
> - cbz x2, set_hcr
> + cbz x2, 1f
>   orr x0, x0, #HCR_TGE// Enable Host Extensions
>   orr x0, x0, #HCR_E2H
> -set_hcr:
> +1:
> +#ifdef CONFIG_ARM64_POINTER_AUTHENTICATION
> + /*
> +  * Disable pointer authentication traps to EL2. The HCR_EL2.{APK,API}
> +  * bits exist iff at least one authentication mechanism is implemented.
> +  */
> + mrs x1, id_aa64isar1_el1
> + mov_q   x3, ((0xf << ID_AA64ISAR1_GPI_SHIFT) | \
> +  (0xf << ID_AA64ISAR1_GPA_SHIFT) | \
> +  (0xf << ID_AA64ISAR1_API_SHIFT) | \
> +  (0xf << ID_AA64ISAR1_APA_SHIFT))
> + and x1, x1, x3
> + cbz x1, 1f
> + orr x0, x0, #(HCR_APK | HCR_API)
> +1:
> +#endif
>   msr hcr_el2, x0
>   isb
>  
> -- 
> 2.11.0
> 


Re: [PATCHv2 10/12] arm64/kvm: context-switch ptrauth registers

2018-02-06 Thread Christoffer Dall
On Mon, Nov 27, 2017 at 04:38:04PM +, Mark Rutland wrote:
> When pointer authentication is supported, a guest may wish to use it.
> This patch adds the necessary KVM infrastructure for this to work, with
> a semi-lazy context switch of the pointer auth state.
> 
> When we schedule a vcpu, 

That's not quite what the code does, the code only does this when we
schedule back a preempted or blocked vcpu thread.

> we disable guest usage of pointer
> authentication instructions and accesses to the keys. While these are
> disabled, we avoid context-switching the keys. When we trap the guest
> trying to use pointer authentication functionality, we change to eagerly
> context-switching the keys, and enable the feature. The next time the
> vcpu is scheduled out/in, we start again.
> 
> Pointer authentication consists of address authentication and generic
> authentication, and CPUs in a system might have varied support for
> either. Where support for either feature is not uniform, it is hidden
> from guests via ID register emulation, as a result of the cpufeature
> framework in the host.
> 
> Unfortunately, address authentication and generic authentication cannot
> be trapped separately, as the architecture provides a single EL2 trap
> covering both. If we wish to expose one without the other, we cannot
> prevent a (badly-written) guest from intermittently using a feature
> which is not uniformly supported (when scheduled on a physical CPU which
> supports the relevant feature). 

We could choose to always trap and emulate in software in this case,
couldn't we?  (not saying we should though).

Also, this patch doesn't let userspace decide if we should hide or
expose the feature to guests, and will expose new system registers to
userspace.  That means that on hardware supporting pointer
authentication, with this patch, it's not possible to migrate to a
machine which doesn't have the support.  That's probably a good thing
(false sense of security etc.), but I wonder if we should have a
mechanism for userspace to ask for pointer authentication in the guest
and only if that's enabled, do we expose the feature to the guest and in
the system register list to user space as well?

> When the guest is scheduled on a
> physical CPU lacking the feature, these atetmps will result in an UNDEF

attempts

> being taken by the guest.
> 
> Signed-off-by: Mark Rutland 
> Cc: Christoffer Dall 
> Cc: Marc Zyngier 
> Cc: kvm...@lists.cs.columbia.edu
> ---
>  arch/arm64/include/asm/kvm_host.h | 23 +-
>  arch/arm64/include/asm/kvm_hyp.h  |  7 +++
>  arch/arm64/kvm/handle_exit.c  | 21 +
>  arch/arm64/kvm/hyp/Makefile   |  1 +
>  arch/arm64/kvm/hyp/ptrauth-sr.c   | 91 
> +++
>  arch/arm64/kvm/hyp/switch.c   |  4 ++
>  arch/arm64/kvm/sys_regs.c | 32 ++
>  7 files changed, 178 insertions(+), 1 deletion(-)
>  create mode 100644 arch/arm64/kvm/hyp/ptrauth-sr.c
> 
> diff --git a/arch/arm64/include/asm/kvm_host.h 
> b/arch/arm64/include/asm/kvm_host.h
> index 39184aa3e2f2..2fc21a2a75a7 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -136,6 +136,18 @@ enum vcpu_sysreg {
>   PMSWINC_EL0,/* Software Increment Register */
>   PMUSERENR_EL0,  /* User Enable Register */
>  
> + /* Pointer Authentication Registers */
> + APIAKEYLO_EL1,
> + APIAKEYHI_EL1,
> + APIBKEYLO_EL1,
> + APIBKEYHI_EL1,
> + APDAKEYLO_EL1,
> + APDAKEYHI_EL1,
> + APDBKEYLO_EL1,
> + APDBKEYHI_EL1,
> + APGAKEYLO_EL1,
> + APGAKEYHI_EL1,
> +
>   /* 32bit specific registers. Keep them at the end of the range */
>   DACR32_EL2, /* Domain Access Control Register */
>   IFSR32_EL2, /* Instruction Fault Status Register */
> @@ -363,10 +375,19 @@ static inline void __cpu_init_hyp_mode(phys_addr_t 
> pgd_ptr,
>   __kvm_call_hyp((void *)pgd_ptr, hyp_stack_ptr, vector_ptr);
>  }
>  
> +void kvm_arm_vcpu_ptrauth_enable(struct kvm_vcpu *vcpu);
> +void kvm_arm_vcpu_ptrauth_disable(struct kvm_vcpu *vcpu);
> +void kvm_arm_vcpu_ptrauth_trap(struct kvm_vcpu *vcpu);
> +
>  static inline void kvm_arch_hardware_unsetup(void) {}
>  static inline void kvm_arch_sync_events(struct kvm *kvm) {}
>  static inline void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu) {}
> -static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
> +
> +static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu)
> +{
> + kvm_arm_vcpu_ptrauth_disable(vcpu);
> +}
> +

I still find this decision to begin trapping again quite arbitrary, and
would at least prefer this to be in vcpu_load (which would make the

Re: [PATCH v3 08/18] arm/arm64: KVM: Add PSCI version selection API

2018-02-05 Thread Christoffer Dall
On Mon, Feb 05, 2018 at 10:42:44AM +, Marc Zyngier wrote:
> On 05/02/18 09:58, Andrew Jones wrote:
> > On Mon, Feb 05, 2018 at 09:24:33AM +, Marc Zyngier wrote:
> >> On 04/02/18 12:37, Christoffer Dall wrote:
> 
> [...]
> 
> >>> Given the urgency of adding mitigation towards variant 2 which is the
> >>> driver for this work, I think we should drop the compat functionality in
> >>> this series and work this out later on if needed.  I think we can just
> >>> tweak the previous patch to enable PSCI 1.0 by default and drop this
> >>> patch for the current merge window.
> >>
> >> I'd be fine with that, as long as we have a clear agreement on the
> >> impact of such a move.
> > 
> > Yeah, that's what I was trying to figure out with my fancy tables. I might
> > be coming around more to your approach now, though. Ensuring the new->old
> > migration fails is a nice feature of this series. It would be good if
> > we could preserve that behavior without committing to a new userspace
> > interface, but I'm not sure how. Maybe I should just apologize for the
> > noise, and this patch be left as is...
> 
> How about we don't decide now?
> 
> I can remove this patch from the series so that the core stuff can make
> it into the arm64 tree ASAP (I think Catalin wants to queue something
> early this week so that it can hit Linus' tree before the end of the
> merge window), and then repost this single patch on its own (with fixes
> for the things that Christoffer found in his review) after -rc1.
> 
> It leaves us time to haggle over the userspace ABI (which is
> realistically not going to affect anyone), and we get the core stuff in
> place for SoC vendors to start updating their firmware.
> 
I agree, that's what I tried to suggest in my e-mail as well.  Just
remember to tweak the previous patch to actually enable PSCI 1.0 by
default.

Thanks,
-Christoffer


Re: [PATCH v3 12/18] arm64: KVM: Add SMCCC_ARCH_WORKAROUND_1 fast handling

2018-02-05 Thread Christoffer Dall
On Mon, Feb 05, 2018 at 09:08:31AM +, Marc Zyngier wrote:
> On 04/02/18 18:39, Christoffer Dall wrote:
> > On Thu, Feb 01, 2018 at 11:46:51AM +, Marc Zyngier wrote:
> >> We want SMCCC_ARCH_WORKAROUND_1 to be fast. As fast as possible.
> >> So let's intercept it as early as we can by testing for the
> >> function call number as soon as we've identified a HVC call
> >> coming from the guest.
> > 
> > Hmmm.  How often is this expected to happen and what is the expected
> > extra cost of doing the early-exit handling in the C code vs. here?
> 
> Pretty often. On each context switch of a Linux guest, for example. It
> is almost as bad as if we were trapping all VM ops. Moving it to C is
> definitely visible on something like hackbench (I remember something
> like a 10-12% degradation on Seattle, but I'd need to rerun the tests to
> give you something accurate). 

If it's that easily visible (although hackbench is clearly the
pathological case here), then we should try to optimize it.  Let's hope
we don't have to add too many of these workarounds in the future.

> It is the whole GPR save/restore dance
> that costs us a lot (31 registers for the guest, 12 for the host), plus
> some the extra SError synchronization that doesn't come for free either.
> 

Fair enough.

> > I think we'd be better off if we only had a single early-exit path (and
> > we should move the FP/SIMD trap to that path as well), but if there's a
> > measurable benefit of having this logic in assembly as opposed to in the
> > C code, then I'm ok with this as well.
> 
> I agree that the multiplication of "earlier than early" paths is
> becoming annoying. Moving the FP/SIMD stuff to C would be less
> problematic, as we have patches to move some of that to load/put, and
> we'd only take the trap once per time slice (as opposed to once per
> entry at the moment).

Yes, and we can even improve on that (see separate discussions around
KVM support for SVE with Dave).

> 
> Here, we're trying hard to do exactly nothing, because each instruction
> is just an extra overhead (we've already nuked the BP). I even
> considered inserting that code as part of the per-CPU-type vectors (and
> leave the rest of the KVM code alone), but it felt like a step too far.
> 

We can always look at adjusting this more in the future if we want.

Reviewed-by: Christoffer Dall 


Re: [PATCH 1/2] ARM: kvm: fix building with gcc-8

2018-02-05 Thread Christoffer Dall
On Sun, Feb 04, 2018 at 09:57:49PM +0100, Arnd Bergmann wrote:
> On Sun, Feb 4, 2018 at 7:45 PM, Christoffer Dall
>  wrote:
> > Hi Arnd,
> >
> > On Fri, Feb 02, 2018 at 04:07:34PM +0100, Arnd Bergmann wrote:
> >> In banked-sr.c, we use a top-level '__asm__(".arch_extension virt")'
> >> statement to allow compilation of a multi-CPU kernel for ARMv6
> >> and older ARMv7-A that don't normally support access to the banked
> >> registers.
> >>
> >> This is considered to be a programming error by the gcc developers
> >> and will no longer work in gcc-8, where we now get a build error:
> >>
> >> /tmp/cc4Qy7GR.s:34: Error: Banked registers are not available with this 
> >> architecture. -- `mrs r3,SP_usr'
> >> /tmp/cc4Qy7GR.s:41: Error: Banked registers are not available with this 
> >> architecture. -- `mrs r3,ELR_hyp'
> >> /tmp/cc4Qy7GR.s:55: Error: Banked registers are not available with this 
> >> architecture. -- `mrs r3,SP_svc'
> >> /tmp/cc4Qy7GR.s:62: Error: Banked registers are not available with this 
> >> architecture. -- `mrs r3,LR_svc'
> >> /tmp/cc4Qy7GR.s:69: Error: Banked registers are not available with this 
> >> architecture. -- `mrs r3,SPSR_svc'
> >> /tmp/cc4Qy7GR.s:76: Error: Banked registers are not available with this 
> >> architecture. -- `mrs r3,SP_abt'
> >>
> >> Passign the '-march-armv7ve' flag to gcc works, and is ok here, because
> >> we know the functions won't ever be called on pre-ARMv7VE machines.
> >> Unfortunately, older compiler versions (4.8 and earlier) do not understand
> >> that flag, so we still need to keep the asm around.
> >
> > Does "not understand" mean "ignores" or do we get an error?
> 
> We get an error, which is why I used the $(call cc-option) Makefile
> helper to check if the compiler supports it.
> 

Right.

> >> Backporting to stable kernels (4.6+) is needed to allow those to be built
> >> with future compilers as well.
> >
> > This builds on the toolchains I have on my machine, so:
> >
> > Acked-by: Christoffer Dall 
> >
> > Are you applying this via a tree with other fixes or would you like me
> > to carry it in the kvmarm tree?
> 
> Please pick it up in your tree.
> 
Will do.

Thanks,
-Christoffer


Re: [PATCH 1/2] ARM: kvm: fix building with gcc-8

2018-02-04 Thread Christoffer Dall
Hi Arnd,

On Fri, Feb 02, 2018 at 04:07:34PM +0100, Arnd Bergmann wrote:
> In banked-sr.c, we use a top-level '__asm__(".arch_extension virt")'
> statement to allow compilation of a multi-CPU kernel for ARMv6
> and older ARMv7-A that don't normally support access to the banked
> registers.
> 
> This is considered to be a programming error by the gcc developers
> and will no longer work in gcc-8, where we now get a build error:
> 
> /tmp/cc4Qy7GR.s:34: Error: Banked registers are not available with this 
> architecture. -- `mrs r3,SP_usr'
> /tmp/cc4Qy7GR.s:41: Error: Banked registers are not available with this 
> architecture. -- `mrs r3,ELR_hyp'
> /tmp/cc4Qy7GR.s:55: Error: Banked registers are not available with this 
> architecture. -- `mrs r3,SP_svc'
> /tmp/cc4Qy7GR.s:62: Error: Banked registers are not available with this 
> architecture. -- `mrs r3,LR_svc'
> /tmp/cc4Qy7GR.s:69: Error: Banked registers are not available with this 
> architecture. -- `mrs r3,SPSR_svc'
> /tmp/cc4Qy7GR.s:76: Error: Banked registers are not available with this 
> architecture. -- `mrs r3,SP_abt'
> 
> Passign the '-march-armv7ve' flag to gcc works, and is ok here, because
> we know the functions won't ever be called on pre-ARMv7VE machines.
> Unfortunately, older compiler versions (4.8 and earlier) do not understand
> that flag, so we still need to keep the asm around.

Does "not understand" mean "ignores" or do we get an error?

> 
> Backporting to stable kernels (4.6+) is needed to allow those to be built
> with future compilers as well.

This builds on the toolchains I have on my machine, so:

Acked-by: Christoffer Dall 

Are you applying this via a tree with other fixes or would you like me
to carry it in the kvmarm tree?

Thanks,
-Christoffer

> 
> Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84129
> Fixes: 33280b4cd1dc ("ARM: KVM: Add banked registers save/restore")
> Cc: sta...@vger.kernel.org
> Signed-off-by: Arnd Bergmann 
> ---
>  arch/arm/kvm/hyp/Makefile| 5 +
>  arch/arm/kvm/hyp/banked-sr.c | 4 
>  2 files changed, 9 insertions(+)
> 
> diff --git a/arch/arm/kvm/hyp/Makefile b/arch/arm/kvm/hyp/Makefile
> index 5638ce0c9524..63d6b404d88e 100644
> --- a/arch/arm/kvm/hyp/Makefile
> +++ b/arch/arm/kvm/hyp/Makefile
> @@ -7,6 +7,8 @@ ccflags-y += -fno-stack-protector -DDISABLE_BRANCH_PROFILING
>  
>  KVM=../../../../virt/kvm
>  
> +CFLAGS_ARMV7VE  :=$(call cc-option, -march=armv7ve)
> +
>  obj-$(CONFIG_KVM_ARM_HOST) += $(KVM)/arm/hyp/vgic-v2-sr.o
>  obj-$(CONFIG_KVM_ARM_HOST) += $(KVM)/arm/hyp/vgic-v3-sr.o
>  obj-$(CONFIG_KVM_ARM_HOST) += $(KVM)/arm/hyp/timer-sr.o
> @@ -15,7 +17,10 @@ obj-$(CONFIG_KVM_ARM_HOST) += tlb.o
>  obj-$(CONFIG_KVM_ARM_HOST) += cp15-sr.o
>  obj-$(CONFIG_KVM_ARM_HOST) += vfp.o
>  obj-$(CONFIG_KVM_ARM_HOST) += banked-sr.o
> +CFLAGS_banked-sr.o  += $(CFLAGS_ARMV7VE)
> +
>  obj-$(CONFIG_KVM_ARM_HOST) += entry.o
>  obj-$(CONFIG_KVM_ARM_HOST) += hyp-entry.o
>  obj-$(CONFIG_KVM_ARM_HOST) += switch.o
> +CFLAGS_switch.o += $(CFLAGS_ARMV7VE)
>  obj-$(CONFIG_KVM_ARM_HOST) += s2-setup.o
> diff --git a/arch/arm/kvm/hyp/banked-sr.c b/arch/arm/kvm/hyp/banked-sr.c
> index 111bda8cdebd..be4b8b0a40ad 100644
> --- a/arch/arm/kvm/hyp/banked-sr.c
> +++ b/arch/arm/kvm/hyp/banked-sr.c
> @@ -20,6 +20,10 @@
>  
>  #include 
>  
> +/*
> + * gcc before 4.9 doesn't understand -march=armv7ve, so we have to
> + * trick the assembler.
> + */
>  __asm__(".arch_extension virt");
>  
>  void __hyp_text __banked_save_state(struct kvm_cpu_context *ctxt)
> -- 
> 2.9.0
> 


Re: [PATCH v3 12/18] arm64: KVM: Add SMCCC_ARCH_WORKAROUND_1 fast handling

2018-02-04 Thread Christoffer Dall
On Thu, Feb 01, 2018 at 11:46:51AM +, Marc Zyngier wrote:
> We want SMCCC_ARCH_WORKAROUND_1 to be fast. As fast as possible.
> So let's intercept it as early as we can by testing for the
> function call number as soon as we've identified a HVC call
> coming from the guest.

Hmmm.  How often is this expected to happen and what is the expected
extra cost of doing the early-exit handling in the C code vs. here?

I think we'd be better off if we only had a single early-exit path (and
we should move the FP/SIMD trap to that path as well), but if there's a
measurable benefit of having this logic in assembly as opposed to in the
C code, then I'm ok with this as well.

The code in this patch looks fine otherwise.

Thanks,
-Christoffer

> 
> Signed-off-by: Marc Zyngier 
> ---
>  arch/arm64/kvm/hyp/hyp-entry.S | 20 ++--
>  1 file changed, 18 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm64/kvm/hyp/hyp-entry.S b/arch/arm64/kvm/hyp/hyp-entry.S
> index e4f37b9dd47c..f36464bd57c5 100644
> --- a/arch/arm64/kvm/hyp/hyp-entry.S
> +++ b/arch/arm64/kvm/hyp/hyp-entry.S
> @@ -15,6 +15,7 @@
>   * along with this program.  If not, see .
>   */
>  
> +#include 
>  #include 
>  
>  #include 
> @@ -64,10 +65,11 @@ alternative_endif
>   lsr x0, x1, #ESR_ELx_EC_SHIFT
>  
>   cmp x0, #ESR_ELx_EC_HVC64
> + ccmpx0, #ESR_ELx_EC_HVC32, #4, ne
>   b.neel1_trap
>  
> - mrs x1, vttbr_el2   // If vttbr is valid, the 64bit guest
> - cbnzx1, el1_trap// called HVC
> + mrs x1, vttbr_el2   // If vttbr is valid, the guest
> + cbnzx1, el1_hvc_guest   // called HVC
>  
>   /* Here, we're pretty sure the host called HVC. */
>   ldp x0, x1, [sp], #16
> @@ -100,6 +102,20 @@ alternative_endif
>  
>   eret
>  
> +el1_hvc_guest:
> + /*
> +  * Fastest possible path for ARM_SMCCC_ARCH_WORKAROUND_1.
> +  * The workaround has already been applied on the host,
> +  * so let's quickly get back to the guest. We don't bother
> +  * restoring x1, as it can be clobbered anyway.
> +  */
> + ldr x1, [sp]// Guest's x0
> + eor w1, w1, #ARM_SMCCC_ARCH_WORKAROUND_1
> + cbnzw1, el1_trap
> + mov x0, x1
> + add sp, sp, #16
> + eret
> +
>  el1_trap:
>   /*
>* x0: ESR_EC
> -- 
> 2.14.2
> 


Re: [PATCH v3 11/18] arm64: KVM: Report SMCCC_ARCH_WORKAROUND_1 BP hardening support

2018-02-04 Thread Christoffer Dall
On Thu, Feb 01, 2018 at 11:46:50AM +, Marc Zyngier wrote:
> A new feature of SMCCC 1.1 is that it offers firmware-based CPU
> workarounds. In particular, SMCCC_ARCH_WORKAROUND_1 provides
> BP hardening for CVE-2017-5715.
> 
> If the host has some mitigation for this issue, report that
> we deal with it using SMCCC_ARCH_WORKAROUND_1, as we apply the
> host workaround on every guest exit.

Reviewed-by: Christoffer Dall 

> 
> Signed-off-by: Marc Zyngier 
> ---
>  arch/arm/include/asm/kvm_host.h   | 7 +++
>  arch/arm64/include/asm/kvm_host.h | 6 ++
>  include/linux/arm-smccc.h | 5 +
>  virt/kvm/arm/psci.c   | 9 -
>  4 files changed, 26 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
> index e9d57060d88c..6c05e3b13081 100644
> --- a/arch/arm/include/asm/kvm_host.h
> +++ b/arch/arm/include/asm/kvm_host.h
> @@ -309,4 +309,11 @@ static inline void kvm_fpsimd_flush_cpu_state(void) {}
>  
>  static inline void kvm_arm_vhe_guest_enter(void) {}
>  static inline void kvm_arm_vhe_guest_exit(void) {}
> +
> +static inline bool kvm_arm_harden_branch_predictor(void)
> +{
> + /* No way to detect it yet, pretend it is not there. */
> + return false;
> +}
> +
>  #endif /* __ARM_KVM_HOST_H__ */
> diff --git a/arch/arm64/include/asm/kvm_host.h 
> b/arch/arm64/include/asm/kvm_host.h
> index 10af386642c6..448d3b9a58cb 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -418,4 +418,10 @@ static inline void kvm_arm_vhe_guest_exit(void)
>  {
>   local_daif_restore(DAIF_PROCCTX_NOIRQ);
>  }
> +
> +static inline bool kvm_arm_harden_branch_predictor(void)
> +{
> + return cpus_have_const_cap(ARM64_HARDEN_BRANCH_PREDICTOR);
> +}
> +
>  #endif /* __ARM64_KVM_HOST_H__ */
> diff --git a/include/linux/arm-smccc.h b/include/linux/arm-smccc.h
> index dc68aa5a7261..e1ef944ef1da 100644
> --- a/include/linux/arm-smccc.h
> +++ b/include/linux/arm-smccc.h
> @@ -73,6 +73,11 @@
>  ARM_SMCCC_SMC_32,\
>  0, 1)
>  
> +#define ARM_SMCCC_ARCH_WORKAROUND_1  \
> + ARM_SMCCC_CALL_VAL(ARM_SMCCC_FAST_CALL, \
> +ARM_SMCCC_SMC_32,\
> +0, 0x8000)
> +
>  #ifndef __ASSEMBLY__
>  
>  #include 
> diff --git a/virt/kvm/arm/psci.c b/virt/kvm/arm/psci.c
> index 2efacbe7b1a2..22c24561d07d 100644
> --- a/virt/kvm/arm/psci.c
> +++ b/virt/kvm/arm/psci.c
> @@ -406,13 +406,20 @@ int kvm_hvc_call_handler(struct kvm_vcpu *vcpu)
>  {
>   u32 func_id = smccc_get_function(vcpu);
>   u32 val = PSCI_RET_NOT_SUPPORTED;
> + u32 feature;
>  
>   switch (func_id) {
>   case ARM_SMCCC_VERSION_FUNC_ID:
>   val = ARM_SMCCC_VERSION_1_1;
>   break;
>   case ARM_SMCCC_ARCH_FEATURES_FUNC_ID:
> - /* Nothing supported yet */
> + feature = smccc_get_arg1(vcpu);
> + switch(feature) {
> + case ARM_SMCCC_ARCH_WORKAROUND_1:
> + if (kvm_arm_harden_branch_predictor())
> + val = 0;
> + break;
> + }
>   break;
>   default:
>   return kvm_psci_call(vcpu);
> -- 
> 2.14.2
> 


Re: [PATCH v3 10/18] arm/arm64: KVM: Turn kvm_psci_version into a static inline

2018-02-04 Thread Christoffer Dall
On Thu, Feb 01, 2018 at 11:46:49AM +, Marc Zyngier wrote:
> We're about to need kvm_psci_version in HYP too. So let's turn it
> into a static inline, and pass the kvm structure as a second
> parameter (so that HYP can do a kern_hyp_va on it).
> 

Reviewed-by: Christoffer Dall 

> Signed-off-by: Marc Zyngier 
> ---
>  arch/arm64/kvm/hyp/switch.c | 20 
>  include/kvm/arm_psci.h  | 26 +-
>  virt/kvm/arm/psci.c | 25 +++--
>  3 files changed, 40 insertions(+), 31 deletions(-)
> 
> diff --git a/arch/arm64/kvm/hyp/switch.c b/arch/arm64/kvm/hyp/switch.c
> index 036e1f3d77a6..408c04d789a5 100644
> --- a/arch/arm64/kvm/hyp/switch.c
> +++ b/arch/arm64/kvm/hyp/switch.c
> @@ -19,6 +19,8 @@
>  #include 
>  #include 
>  
> +#include 
> +
>  #include 
>  #include 
>  #include 
> @@ -350,14 +352,16 @@ int __hyp_text __kvm_vcpu_run(struct kvm_vcpu *vcpu)
>  
>   if (exit_code == ARM_EXCEPTION_TRAP &&
>   (kvm_vcpu_trap_get_class(vcpu) == ESR_ELx_EC_HVC64 ||
> -  kvm_vcpu_trap_get_class(vcpu) == ESR_ELx_EC_HVC32) &&
> - vcpu_get_reg(vcpu, 0) == PSCI_0_2_FN_PSCI_VERSION) {
> - u64 val = PSCI_RET_NOT_SUPPORTED;
> - if (test_bit(KVM_ARM_VCPU_PSCI_0_2, vcpu->arch.features))
> - val = 2;
> -
> - vcpu_set_reg(vcpu, 0, val);
> - goto again;
> +  kvm_vcpu_trap_get_class(vcpu) == ESR_ELx_EC_HVC32)) {
> + u32 val = vcpu_get_reg(vcpu, 0);
> +
> + if (val == PSCI_0_2_FN_PSCI_VERSION) {
> + val = kvm_psci_version(vcpu, kern_hyp_va(vcpu->kvm));
> + if (unlikely(val == KVM_ARM_PSCI_0_1))
> + val = PSCI_RET_NOT_SUPPORTED;
> + vcpu_set_reg(vcpu, 0, val);
> + goto again;
> + }
>   }
>  
>   if (static_branch_unlikely(&vgic_v2_cpuif_trap) &&
> diff --git a/include/kvm/arm_psci.h b/include/kvm/arm_psci.h
> index 7b2e12697d4f..9b699f91171f 100644
> --- a/include/kvm/arm_psci.h
> +++ b/include/kvm/arm_psci.h
> @@ -18,6 +18,7 @@
>  #ifndef __KVM_ARM_PSCI_H__
>  #define __KVM_ARM_PSCI_H__
>  
> +#include 
>  #include 
>  
>  #define KVM_ARM_PSCI_0_1 PSCI_VERSION(0, 1)
> @@ -26,7 +27,30 @@
>  
>  #define KVM_ARM_PSCI_LATEST  KVM_ARM_PSCI_1_0
>  
> -int kvm_psci_version(struct kvm_vcpu *vcpu);
> +/*
> + * We need the KVM pointer independently from the vcpu as we can call
> + * this from HYP, and need to apply kern_hyp_va on it...
> + */
> +static inline int kvm_psci_version(struct kvm_vcpu *vcpu, struct kvm *kvm)
> +{
> + /*
> +  * Our PSCI implementation stays the same across versions from
> +  * v0.2 onward, only adding the few mandatory functions (such
> +  * as FEATURES with 1.0) that are required by newer
> +  * revisions. It is thus safe to return the latest, unless
> +  * userspace has instructed us otherwise.
> +  */
> + if (test_bit(KVM_ARM_VCPU_PSCI_0_2, vcpu->arch.features)) {
> + if (kvm->arch.psci_version)
> + return kvm->arch.psci_version;
> +
> + return KVM_ARM_PSCI_LATEST;
> + }
> +
> + return KVM_ARM_PSCI_0_1;
> +}
> +
> +
>  int kvm_hvc_call_handler(struct kvm_vcpu *vcpu);
>  
>  struct kvm_one_reg;
> diff --git a/virt/kvm/arm/psci.c b/virt/kvm/arm/psci.c
> index 53272e8e0d37..2efacbe7b1a2 100644
> --- a/virt/kvm/arm/psci.c
> +++ b/virt/kvm/arm/psci.c
> @@ -124,7 +124,7 @@ static unsigned long kvm_psci_vcpu_on(struct kvm_vcpu 
> *source_vcpu)
>   if (!vcpu)
>   return PSCI_RET_INVALID_PARAMS;
>   if (!vcpu->arch.power_off) {
> - if (kvm_psci_version(source_vcpu) != KVM_ARM_PSCI_0_1)
> + if (kvm_psci_version(source_vcpu, kvm) != KVM_ARM_PSCI_0_1)
>   return PSCI_RET_ALREADY_ON;
>   else
>   return PSCI_RET_INVALID_PARAMS;
> @@ -233,25 +233,6 @@ static void kvm_psci_system_reset(struct kvm_vcpu *vcpu)
>   kvm_prepare_system_event(vcpu, KVM_SYSTEM_EVENT_RESET);
>  }
>  
> -int kvm_psci_version(struct kvm_vcpu *vcpu)
> -{
> - /*
> -  * Our PSCI implementation stays the same across versions from
> -  * v0.2 onward, only adding the few mandatory functions (such
> -  * as FEATURES with 1.0) that are required by newer
> -  * revisions. It is thus safe to return the latest, unless
> -  * userspace has instructed us otherwise.
> -  */
> - if (test_bit(KVM_ARM_VCPU_PSCI_0_2, vcpu->ar

Re: [PATCH v3 09/18] arm/arm64: KVM: Advertise SMCCC v1.1

2018-02-04 Thread Christoffer Dall
On Thu, Feb 01, 2018 at 11:46:48AM +, Marc Zyngier wrote:
> The new SMC Calling Convention (v1.1) allows for a reduced overhead
> when calling into the firmware, and provides a new feature discovery
> mechanism.
> 
> Make it visible to KVM guests.
> 

Reviewed-by: Christoffer Dall 

> Signed-off-by: Marc Zyngier 
> ---
>  arch/arm/kvm/handle_exit.c   |  2 +-
>  arch/arm64/kvm/handle_exit.c |  2 +-
>  include/kvm/arm_psci.h   |  2 +-
>  include/linux/arm-smccc.h| 13 +
>  virt/kvm/arm/psci.c  | 24 +++-
>  5 files changed, 39 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/arm/kvm/handle_exit.c b/arch/arm/kvm/handle_exit.c
> index 230ae4079108..910bd8dabb3c 100644
> --- a/arch/arm/kvm/handle_exit.c
> +++ b/arch/arm/kvm/handle_exit.c
> @@ -36,7 +36,7 @@ static int handle_hvc(struct kvm_vcpu *vcpu, struct kvm_run 
> *run)
> kvm_vcpu_hvc_get_imm(vcpu));
>   vcpu->stat.hvc_exit_stat++;
>  
> - ret = kvm_psci_call(vcpu);
> + ret = kvm_hvc_call_handler(vcpu);
>   if (ret < 0) {
>   vcpu_set_reg(vcpu, 0, ~0UL);
>   return 1;
> diff --git a/arch/arm64/kvm/handle_exit.c b/arch/arm64/kvm/handle_exit.c
> index 588f910632a7..e5e741bfffe1 100644
> --- a/arch/arm64/kvm/handle_exit.c
> +++ b/arch/arm64/kvm/handle_exit.c
> @@ -52,7 +52,7 @@ static int handle_hvc(struct kvm_vcpu *vcpu, struct kvm_run 
> *run)
>   kvm_vcpu_hvc_get_imm(vcpu));
>   vcpu->stat.hvc_exit_stat++;
>  
> - ret = kvm_psci_call(vcpu);
> + ret = kvm_hvc_call_handler(vcpu);
>   if (ret < 0) {
>   vcpu_set_reg(vcpu, 0, ~0UL);
>   return 1;
> diff --git a/include/kvm/arm_psci.h b/include/kvm/arm_psci.h
> index 4ee098c39e01..7b2e12697d4f 100644
> --- a/include/kvm/arm_psci.h
> +++ b/include/kvm/arm_psci.h
> @@ -27,7 +27,7 @@
>  #define KVM_ARM_PSCI_LATEST  KVM_ARM_PSCI_1_0
>  
>  int kvm_psci_version(struct kvm_vcpu *vcpu);
> -int kvm_psci_call(struct kvm_vcpu *vcpu);
> +int kvm_hvc_call_handler(struct kvm_vcpu *vcpu);
>  
>  struct kvm_one_reg;
>  
> diff --git a/include/linux/arm-smccc.h b/include/linux/arm-smccc.h
> index 4c5bca38c653..dc68aa5a7261 100644
> --- a/include/linux/arm-smccc.h
> +++ b/include/linux/arm-smccc.h
> @@ -60,6 +60,19 @@
>  #define ARM_SMCCC_QUIRK_NONE 0
>  #define ARM_SMCCC_QUIRK_QCOM_A6  1 /* Save/restore register a6 */
>  
> +#define ARM_SMCCC_VERSION_1_00x1
> +#define ARM_SMCCC_VERSION_1_10x10001
> +
> +#define ARM_SMCCC_VERSION_FUNC_ID\
> + ARM_SMCCC_CALL_VAL(ARM_SMCCC_FAST_CALL, \
> +ARM_SMCCC_SMC_32,\
> +0, 0)
> +
> +#define ARM_SMCCC_ARCH_FEATURES_FUNC_ID  
> \
> + ARM_SMCCC_CALL_VAL(ARM_SMCCC_FAST_CALL, \
> +ARM_SMCCC_SMC_32,\
> +0, 1)
> +
>  #ifndef __ASSEMBLY__
>  
>  #include 
> diff --git a/virt/kvm/arm/psci.c b/virt/kvm/arm/psci.c
> index 5c8366b71639..53272e8e0d37 100644
> --- a/virt/kvm/arm/psci.c
> +++ b/virt/kvm/arm/psci.c
> @@ -15,6 +15,7 @@
>   * along with this program.  If not, see <http://www.gnu.org/licenses/>.
>   */
>  
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -351,6 +352,7 @@ static int kvm_psci_1_0_call(struct kvm_vcpu *vcpu)
>   case PSCI_0_2_FN_SYSTEM_OFF:
>   case PSCI_0_2_FN_SYSTEM_RESET:
>   case PSCI_1_0_FN_PSCI_FEATURES:
> + case ARM_SMCCC_VERSION_FUNC_ID:
>   val = 0;
>   break;
>   default:
> @@ -405,7 +407,7 @@ static int kvm_psci_0_1_call(struct kvm_vcpu *vcpu)
>   * Errors:
>   * -EINVAL: Unrecognized PSCI function
>   */
> -int kvm_psci_call(struct kvm_vcpu *vcpu)
> +static int kvm_psci_call(struct kvm_vcpu *vcpu)
>  {
>   switch (kvm_psci_version(vcpu)) {
>   case KVM_ARM_PSCI_1_0:
> @@ -419,6 +421,26 @@ int kvm_psci_call(struct kvm_vcpu *vcpu)
>   };
>  }
>  
> +int kvm_hvc_call_handler(struct kvm_vcpu *vcpu)
> +{
> + u32 func_id = smccc_get_function(vcpu);
> + u32 val = PSCI_RET_NOT_SUPPORTED;
> +
> + switch (func_id) {
> + case ARM_SMCCC_VERSION_FUNC_ID:
> + val = ARM_SMCCC_VERSION_1_1;
> + break;
> + case ARM_SMCCC_ARCH_FEATURES_FUNC_ID:
> + /* Nothing supported yet */
> + break;
> + default:
> + return kvm_psci_call(vcpu);
> + }
> +
> + smccc_set_retval(vcpu, val, 0, 0, 0);
> + return 1;
> +}
> +
>  int kvm_arm_get_fw_num_regs(struct kvm_vcpu *vcpu)
>  {
>   return 1;   /* PSCI version */
> -- 
> 2.14.2
> 


Re: [PATCH v3 08/18] arm/arm64: KVM: Add PSCI version selection API

2018-02-04 Thread Christoffer Dall
Hi Marc,

[ I know we're discussing the overall approach in parallel, but here are
  some comments on the specifics of this patch, should it end up being
  used in some capacity ]

On Thu, Feb 01, 2018 at 11:46:47AM +, Marc Zyngier wrote:
> Although we've implemented PSCI 1.0 and 1.1, nothing can select them
> Since all the new PSCI versions are backward compatible, we decide to
> default to the latest version of the PSCI implementation. This is no
> different from doing a firmware upgrade on KVM.
>
> But in order to give a chance to hypothetical badly implemented guests
> that would have a fit by discovering something other than PSCI 0.2,
> let's provide a new API that allows userspace to pick one particular
> version of the API.
>
> This is implemented as a new class of "firmware" registers, where
> we expose the PSCI version. This allows the PSCI version to be
> save/restored as part of a guest migration, and also set to
> any supported version if the guest requires it.
>
> Signed-off-by: Marc Zyngier 
> ---
>  Documentation/virtual/kvm/api.txt  |  3 +-
>  Documentation/virtual/kvm/arm/psci.txt | 30 +++
>  arch/arm/include/asm/kvm_host.h|  3 ++
>  arch/arm/include/uapi/asm/kvm.h|  6 +++
>  arch/arm/kvm/guest.c   | 13 +++
>  arch/arm64/include/asm/kvm_host.h  |  3 ++
>  arch/arm64/include/uapi/asm/kvm.h  |  6 +++
>  arch/arm64/kvm/guest.c | 14 ++-
>  include/kvm/arm_psci.h |  9 +
>  virt/kvm/arm/psci.c| 68 
> +-
>  10 files changed, 151 insertions(+), 4 deletions(-)
>  create mode 100644 Documentation/virtual/kvm/arm/psci.txt
>
> diff --git a/Documentation/virtual/kvm/api.txt 
> b/Documentation/virtual/kvm/api.txt
> index 57d3ee9e4bde..334905202141 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -2493,7 +2493,8 @@ Possible features:
>and execute guest code when KVM_RUN is called.
>   - KVM_ARM_VCPU_EL1_32BIT: Starts the CPU in a 32bit mode.
>Depends on KVM_CAP_ARM_EL1_32BIT (arm64 only).
> - - KVM_ARM_VCPU_PSCI_0_2: Emulate PSCI v0.2 for the CPU.
> + - KVM_ARM_VCPU_PSCI_0_2: Emulate PSCI v0.2 (or a future revision
> +  backward compatible with v0.2) for the CPU.
>Depends on KVM_CAP_ARM_PSCI_0_2.
>   - KVM_ARM_VCPU_PMU_V3: Emulate PMUv3 for the CPU.
>Depends on KVM_CAP_ARM_PMU_V3.

Can we add this to api.txt as well ?:

8><--

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index fc3ae951bc07..c88aa04bcbe8 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -1959,6 +1959,8 @@ arm64 CCSIDR registers are demultiplexed by CSSELR value:
 arm64 system registers have the following id bit patterns:
   0x6030  0013 
 
+ARM/arm64 firmware pseudo-registers have the following bit pattern:
+  0x4030  0014 
 
 MIPS registers are mapped using the lower 32 bits.  The upper 16 of that is
 the register group type:

8><--

> diff --git a/Documentation/virtual/kvm/arm/psci.txt 
> b/Documentation/virtual/kvm/arm/psci.txt
> new file mode 100644
> index ..aafdab887b04
> --- /dev/null
> +++ b/Documentation/virtual/kvm/arm/psci.txt
> @@ -0,0 +1,30 @@
> +KVM implements the PSCI (Power State Coordination Interface)
> +specification in order to provide services such as CPU on/off, reset
> +and power-off to the guest.
> +
> +The PSCI specification is regularly updated to provide new features,
> +and KVM implements these updates if they make sense from a virtualization
> +point of view.
> +
> +This means that a guest booted on two different versions of KVM can
> +observe two different "firmware" revisions. This could cause issues if
> +a given guest is tied to a particular PSCI revision (unlikely), or if
> +a migration causes a different PSCI version to be exposed out of the
> +blue to an unsuspecting guest.
> +
> +In order to remedy this situation, KVM exposes a set of "firmware
> +pseudo-registers" that can be manipulated using the GET/SET_ONE_REG
> +interface. These registers can be saved/restored by userspace, and set
> +to a convenient value if required.
> +
> +The following register is defined:
> +
> +* KVM_REG_ARM_PSCI_VERSION:
> +
> +  - Only valid if the vcpu has the KVM_ARM_VCPU_PSCI_0_2 feature set
> +(and thus has already been initialized)
> +  - Returns the current PSCI version on GET_ONE_REG (defaulting to the
> +highest PSCI version implemented by KVM and compatible with v0.2)
> +  - Allows any PSCI version implemented by KVM and compatible with
> +v0.2 to be set with SET_ONE_REG
> +  - Affects the whole VM (even if the register view is per-vcpu)
> diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
> index acbf9ec7b396..e9d57060d88c 100644
> --- a/arch/arm/include/asm/kvm_host.h
> +++ b/arch/arm/include/asm/kvm_host.h
> @@ -75,6

Re: [PATCH v3 08/18] arm/arm64: KVM: Add PSCI version selection API

2018-02-04 Thread Christoffer Dall
On Sat, Feb 03, 2018 at 11:59:32AM +, Marc Zyngier wrote:
> On Fri, 2 Feb 2018 21:17:06 +0100
> Andrew Jones  wrote:
> 
> > On Thu, Feb 01, 2018 at 11:46:47AM +, Marc Zyngier wrote:
> > > Although we've implemented PSCI 1.0 and 1.1, nothing can select them
> > > Since all the new PSCI versions are backward compatible, we decide to
> > > default to the latest version of the PSCI implementation. This is no
> > > different from doing a firmware upgrade on KVM.
> > > 
> > > But in order to give a chance to hypothetical badly implemented guests
> > > that would have a fit by discovering something other than PSCI 0.2,
> > > let's provide a new API that allows userspace to pick one particular
> > > version of the API.
> > > 
> > > This is implemented as a new class of "firmware" registers, where
> > > we expose the PSCI version. This allows the PSCI version to be
> > > save/restored as part of a guest migration, and also set to
> > > any supported version if the guest requires it.
> > > 
> > > Signed-off-by: Marc Zyngier 
> > > ---
> > >  Documentation/virtual/kvm/api.txt  |  3 +-
> > >  Documentation/virtual/kvm/arm/psci.txt | 30 +++
> > >  arch/arm/include/asm/kvm_host.h|  3 ++
> > >  arch/arm/include/uapi/asm/kvm.h|  6 +++
> > >  arch/arm/kvm/guest.c   | 13 +++
> > >  arch/arm64/include/asm/kvm_host.h  |  3 ++
> > >  arch/arm64/include/uapi/asm/kvm.h  |  6 +++
> > >  arch/arm64/kvm/guest.c | 14 ++-
> > >  include/kvm/arm_psci.h |  9 +
> > >  virt/kvm/arm/psci.c| 68 
> > > +-
> > >  10 files changed, 151 insertions(+), 4 deletions(-)
> > >  create mode 100644 Documentation/virtual/kvm/arm/psci.txt
> > > 
> > > diff --git a/Documentation/virtual/kvm/api.txt 
> > > b/Documentation/virtual/kvm/api.txt
> > > index 57d3ee9e4bde..334905202141 100644
> > > --- a/Documentation/virtual/kvm/api.txt
> > > +++ b/Documentation/virtual/kvm/api.txt
> > > @@ -2493,7 +2493,8 @@ Possible features:
> > > and execute guest code when KVM_RUN is called.
> > >   - KVM_ARM_VCPU_EL1_32BIT: Starts the CPU in a 32bit mode.
> > > Depends on KVM_CAP_ARM_EL1_32BIT (arm64 only).
> > > - - KVM_ARM_VCPU_PSCI_0_2: Emulate PSCI v0.2 for the CPU.
> > > + - KVM_ARM_VCPU_PSCI_0_2: Emulate PSCI v0.2 (or a future revision
> > > +  backward compatible with v0.2) for the CPU.
> > > Depends on KVM_CAP_ARM_PSCI_0_2.
> > >   - KVM_ARM_VCPU_PMU_V3: Emulate PMUv3 for the CPU.
> > > Depends on KVM_CAP_ARM_PMU_V3.
> > > diff --git a/Documentation/virtual/kvm/arm/psci.txt 
> > > b/Documentation/virtual/kvm/arm/psci.txt
> > > new file mode 100644
> > > index ..aafdab887b04
> > > --- /dev/null
> > > +++ b/Documentation/virtual/kvm/arm/psci.txt
> > > @@ -0,0 +1,30 @@
> > > +KVM implements the PSCI (Power State Coordination Interface)
> > > +specification in order to provide services such as CPU on/off, reset
> > > +and power-off to the guest.
> > > +
> > > +The PSCI specification is regularly updated to provide new features,
> > > +and KVM implements these updates if they make sense from a virtualization
> > > +point of view.
> > > +
> > > +This means that a guest booted on two different versions of KVM can
> > > +observe two different "firmware" revisions. This could cause issues if
> > > +a given guest is tied to a particular PSCI revision (unlikely), or if
> > > +a migration causes a different PSCI version to be exposed out of the
> > > +blue to an unsuspecting guest.
> > > +
> > > +In order to remedy this situation, KVM exposes a set of "firmware
> > > +pseudo-registers" that can be manipulated using the GET/SET_ONE_REG
> > > +interface. These registers can be saved/restored by userspace, and set
> > > +to a convenient value if required.
> > > +
> > > +The following register is defined:
> > > +
> > > +* KVM_REG_ARM_PSCI_VERSION:
> > > +
> > > +  - Only valid if the vcpu has the KVM_ARM_VCPU_PSCI_0_2 feature set
> > > +(and thus has already been initialized)
> > > +  - Returns the current PSCI version on GET_ONE_REG (defaulting to the
> > > +highest PSCI version implemented by KVM and compatible with v0.2)
> > > +  - Allows any PSCI version implemented by KVM and compatible with
> > > +v0.2 to be set with SET_ONE_REG
> > > +  - Affects the whole VM (even if the register view is per-vcpu)  
> > 
> 
> Hi Drew,
> 
> Thanks for looking into this, and for the exhaustive data.
> 
> > 
> > I've put some more thought and experimentation into this. I think we
> > should change to a vcpu feature bit. The feature bit would be used to
> > force compat mode, v0.2, so KVM would still enable the new PSCI
> > version by default. Below are two tables describing why I think we
> > should switch to something other than a new sysreg, and below those
> > tables are notes as to why I think we should use a vcpu feature. The
> > asterisks in the tables point out behaviors that ar

  1   2   3   4   5   6   7   8   9   >