Re: [PATCH 1/3] KVM: arm64: Narrow PMU sysreg reset values to architectural requirements

2021-07-15 Thread Robin Murphy

On 2021-07-15 12:11, Marc Zyngier wrote:

Hi Alex,

On Wed, 14 Jul 2021 16:48:07 +0100,
Alexandru Elisei  wrote:


Hi Marc,

On 7/13/21 2:58 PM, Marc Zyngier wrote:

A number of the PMU sysregs expose reset values that are not in
compliant with the architecture (set bits in the RES0 ranges,
for example).

This in turn has the effect that we need to pointlessly mask
some register when using them.

Let's start by making sure we don't have illegal values in the
shadow registers at reset time. This affects all the registers
that dedicate one bit per counter, the counters themselves,
PMEVTYPERn_EL0 and PMSELR_EL0.

Reported-by: Alexandre Chartre 
Signed-off-by: Marc Zyngier 
---
  arch/arm64/kvm/sys_regs.c | 46 ---
  1 file changed, 43 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index f6f126eb6ac1..95ccb8f45409 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -603,6 +603,44 @@ static unsigned int pmu_visibility(const struct kvm_vcpu 
*vcpu,
return REG_HIDDEN;
  }
  
+static void reset_pmu_reg(struct kvm_vcpu *vcpu, const struct sys_reg_desc *r)

+{
+   u64 n, mask;
+
+   /* No PMU available, any PMU reg may UNDEF... */
+   if (!kvm_arm_support_pmu_v3())
+   return;
+
+   n = read_sysreg(pmcr_el0) >> ARMV8_PMU_PMCR_N_SHIFT;


Isn't this going to cause a lot of unnecessary traps with NV? Is
that going to be a problem?


We'll get a new traps at L2 VM creation if we expose a PMU to the L1
guest, and if L2 gets one too. I don't think that's a real problem, as
the performance of an L2 PMU is bound to be hilarious, and if we are
really worried about that, we can always cache it locally. Which is
likely the best thing to do if you think of big-little.

Let's not think of big-little.

Another thing is that we could perfectly ignore the number of counter
on the host and always expose the architectural maximum, given that
the PMU is completely emulated. With that, no trap.


Although that would deliberately exacerbate the existing problem of 
guest counters mysteriously under-reporting due to the host event 
getting multiplexed, thus arguably make the L2 PMU even less useful.


But then trying to analyse application performance under NV at all seems 
to stand a high chance of being akin to shovelling fog, so...


Robin.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: Any way to disable KVM VHE extension?

2021-07-15 Thread Robin Murphy

On 2021-07-15 10:44, Qu Wenruo wrote:



On 2021/7/15 下午5:28, Robin Murphy wrote:

On 2021-07-15 09:55, Qu Wenruo wrote:

Hi,

Recently I'm playing around the Nvidia Xavier AGX board, which has 
VHE extension support.


In theory, considering the CPU and memory, it should be pretty 
powerful compared to boards like RPI CM4.


But to my surprise, KVM runs pretty poor on Xavier.

Just booting the edk2 firmware could take over 10s, and 20s to fully 
boot the kernel.
Even my VM on RPI CM4 has way faster boot time, even just running on 
PCIE2.0 x1 lane NVME, and just 4 2.1Ghz A72 core.


This is definitely out of my expectation, I double checked to be sure 
that it's running in KVM mode.


But further digging shows that, since Xavier AGX CPU supports VHE, 
kvm is running in VHE mode other than HYP mode on CM4.


Is there anyway to manually disable VHE mode to test the more common 
HYP mode on Xavier?


According to kernel-parameters.txt, "kvm-arm.mode=nvhe" (or its 
low-level equivalent "id_aa64mmfr1.vh=0") on the command line should 
do that.


Thanks for this one, I stupidly only searched modinfo of kvm, and didn't 
even bother to search arch/arm64/kvm...




However I'd imagine the discrepancy is likely to be something more 
fundamental to the wildly different microarchitectures. There's 
certainly no harm in giving non-VHE a go for comparison, but I 
wouldn't be surprised if it turns out even slower...


You're totally right, with nvhe mode, it's still the same slow speed.

BTW, what did you mean by the "wildly different microarch"?
Is ARMv8.2 arch that different from ARMv8 of RPI4?


I don't mean Armv8.x architectural features, I mean the actual 
implementation of NVIDIA's Carmel core is very, very different from 
Cortex-A72 or indeed our newer v8.2 Cortex-A designs.



And any extra methods I could try to explore the reason of the slowness?


I guess the first check would be whether you're trapping and exiting the 
VM significantly more. I believe there are stats somewhere, but I don't 
know exactly where, sorry - I know very little about actually *using* KVM :)


If it's not that, then it might just be that EDK2 is doing a lot of 
cache maintenance or system register modification or some other 
operation that happens to be slower on Carmel compared to Cortex-A72.


Robin.


At least RPI CM4 is beyond my expectation and is working pretty fine.

Thanks,
Qu



Robin.

BTW, this is the dmesg related to KVM on Xavier, running v5.13 
upstream kernel, with 64K page size:

[    0.852357] kvm [1]: IPA Size Limit: 40 bits
[    0.857378] kvm [1]: vgic interrupt IRQ9
[    0.862122] kvm: pmu event creation failed -2
[    0.866734] kvm [1]: VHE mode initialized successfully

While on CM4, the host runs v5.12.10 upstream kernel (with downstream 
dtb), with 4K page size:

[    1.276818] kvm [1]: IPA Size Limit: 44 bits
[    1.278425] kvm [1]: vgic interrupt IRQ9
[    1.278620] kvm [1]: Hyp mode initialized successfully

Could it be the PAGE size causing problem?

Thanks,
Qu


___
linux-arm-kernel mailing list
linux-arm-ker...@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel





___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: Any way to disable KVM VHE extension?

2021-07-15 Thread Robin Murphy

On 2021-07-15 09:55, Qu Wenruo wrote:

Hi,

Recently I'm playing around the Nvidia Xavier AGX board, which has VHE 
extension support.


In theory, considering the CPU and memory, it should be pretty powerful 
compared to boards like RPI CM4.


But to my surprise, KVM runs pretty poor on Xavier.

Just booting the edk2 firmware could take over 10s, and 20s to fully 
boot the kernel.
Even my VM on RPI CM4 has way faster boot time, even just running on 
PCIE2.0 x1 lane NVME, and just 4 2.1Ghz A72 core.


This is definitely out of my expectation, I double checked to be sure 
that it's running in KVM mode.


But further digging shows that, since Xavier AGX CPU supports VHE, kvm 
is running in VHE mode other than HYP mode on CM4.


Is there anyway to manually disable VHE mode to test the more common HYP 
mode on Xavier?


According to kernel-parameters.txt, "kvm-arm.mode=nvhe" (or its 
low-level equivalent "id_aa64mmfr1.vh=0") on the command line should do 
that.


However I'd imagine the discrepancy is likely to be something more 
fundamental to the wildly different microarchitectures. There's 
certainly no harm in giving non-VHE a go for comparison, but I wouldn't 
be surprised if it turns out even slower...


Robin.

BTW, this is the dmesg related to KVM on Xavier, running v5.13 upstream 
kernel, with 64K page size:

[    0.852357] kvm [1]: IPA Size Limit: 40 bits
[    0.857378] kvm [1]: vgic interrupt IRQ9
[    0.862122] kvm: pmu event creation failed -2
[    0.866734] kvm [1]: VHE mode initialized successfully

While on CM4, the host runs v5.12.10 upstream kernel (with downstream 
dtb), with 4K page size:

[    1.276818] kvm [1]: IPA Size Limit: 44 bits
[    1.278425] kvm [1]: vgic interrupt IRQ9
[    1.278620] kvm [1]: Hyp mode initialized successfully

Could it be the PAGE size causing problem?

Thanks,
Qu


___
linux-arm-kernel mailing list
linux-arm-ker...@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v2] KVM: arm64: Disabling disabled PMU counters wastes a lot of time

2021-07-12 Thread Robin Murphy

On 2021-07-12 16:51, Alexandru Elisei wrote:

Hi Robin,

On 7/12/21 4:44 PM, Robin Murphy wrote:

On 2021-07-12 16:17, Alexandre Chartre wrote:

In a KVM guest on arm64, performance counters interrupts have an
unnecessary overhead which slows down execution when using the "perf
record" command and limits the "perf record" sampling period.

The problem is that when a guest VM disables counters by clearing the
PMCR_EL0.E bit (bit 0), KVM will disable all counters defined in
PMCR_EL0 even if they are not enabled in PMCNTENSET_EL0.

KVM disables a counter by calling into the perf framework, in particular
by calling perf_event_create_kernel_counter() which is a time consuming
operation. So, for example, with a Neoverse N1 CPU core which has 6 event
counters and one cycle counter, KVM will always disable all 7 counters
even if only one is enabled.

This typically happens when using the "perf record" command in a guest
VM: perf will disable all event counters with PMCNTENTSET_EL0 and only
uses the cycle counter. And when using the "perf record" -F option with
a high profiling frequency, the overhead of KVM disabling all counters
instead of one on every counter interrupt becomes very noticeable.

The problem is fixed by having KVM disable only counters which are
enabled in PMCNTENSET_EL0. If a counter is not enabled in PMCNTENSET_EL0
then KVM will not enable it when setting PMCR_EL0.E and it will remain
disabled as long as it is not enabled in PMCNTENSET_EL0. So there is
effectively no need to disable a counter when clearing PMCR_EL0.E if it
is not enabled PMCNTENSET_EL0.

Signed-off-by: Alexandre Chartre 
---
The patch is based on
https://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms.git/log/?h=kvm-arm64/pmu/reset-values

   arch/arm64/kvm/pmu-emul.c | 8 +---
   1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/kvm/pmu-emul.c b/arch/arm64/kvm/pmu-emul.c
index fae4e95b586c..1f317c3dac61 100644
--- a/arch/arm64/kvm/pmu-emul.c
+++ b/arch/arm64/kvm/pmu-emul.c
@@ -563,21 +563,23 @@ void kvm_pmu_software_increment(struct kvm_vcpu *vcpu,
u64 val)
    */
   void kvm_pmu_handle_pmcr(struct kvm_vcpu *vcpu, u64 val)
   {
-    unsigned long mask = kvm_pmu_valid_counter_mask(vcpu);
+    unsigned long mask;
   int i;
     if (val & ARMV8_PMU_PMCR_E) {
   kvm_pmu_enable_counter_mask(vcpu,
  __vcpu_sys_reg(vcpu, PMCNTENSET_EL0));
   } else {
-    kvm_pmu_disable_counter_mask(vcpu, mask);
+    kvm_pmu_disable_counter_mask(vcpu,
+   __vcpu_sys_reg(vcpu, PMCNTENSET_EL0));
   }
     if (val & ARMV8_PMU_PMCR_C)
   kvm_pmu_set_counter_value(vcpu, ARMV8_PMU_CYCLE_IDX, 0);
     if (val & ARMV8_PMU_PMCR_P) {
-    mask &= ~BIT(ARMV8_PMU_CYCLE_IDX);
+    mask = kvm_pmu_valid_counter_mask(vcpu)
+    & BIT(ARMV8_PMU_CYCLE_IDX);


This looks suspiciously opposite of what it replaces;


It always sets the bit, which goes against the architecture and the code it was
replacing, yes.


however did we even need to do a bitwise operation here in the first place?
Couldn't we skip the cycle counter by just limiting the for_each_set_bit
iteration below to 31 bits?


To quote myself [1]:

"Entertained the idea of restricting the number of bits in for_each_set_bit() to
31 since Linux (and the architecture, to some degree) treats the cycle count
register as the 32nd event counter.


FWIW I wouldn't say there's any degree to it - we're iterating over the 
bits in a register where the cycle counter enable is unequivocally the 
32nd bit.



Settled on this approach because I think it's
clearer."

To expand on that, incorrectly resetting the cycle counter was introduced by a
refactoring, so I preferred making it very clear that PMCR_EL0.P is not supposed
to clear the cycle counter.


Fair enough, but if this has turned out to be a contentious hot path 
then masking the bit to zero and then deliberately iterating to see if 
it's set (find_next_bit() isn't exactly free) adds up to more overhead 
than a comment ;)


Robin.



[1] 
https://lore.kernel.org/kvmarm/20210618105139.83795-1-alexandru.eli...@arm.com/

Thanks,

Alex



Robin.


   for_each_set_bit(i, , 32)
   kvm_pmu_set_counter_value(vcpu, i, 0);
   }

base-commit: 83f870a663592797c576846db3611e0a1664eda2


___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v2] KVM: arm64: Disabling disabled PMU counters wastes a lot of time

2021-07-12 Thread Robin Murphy

On 2021-07-12 16:17, Alexandre Chartre wrote:

In a KVM guest on arm64, performance counters interrupts have an
unnecessary overhead which slows down execution when using the "perf
record" command and limits the "perf record" sampling period.

The problem is that when a guest VM disables counters by clearing the
PMCR_EL0.E bit (bit 0), KVM will disable all counters defined in
PMCR_EL0 even if they are not enabled in PMCNTENSET_EL0.

KVM disables a counter by calling into the perf framework, in particular
by calling perf_event_create_kernel_counter() which is a time consuming
operation. So, for example, with a Neoverse N1 CPU core which has 6 event
counters and one cycle counter, KVM will always disable all 7 counters
even if only one is enabled.

This typically happens when using the "perf record" command in a guest
VM: perf will disable all event counters with PMCNTENTSET_EL0 and only
uses the cycle counter. And when using the "perf record" -F option with
a high profiling frequency, the overhead of KVM disabling all counters
instead of one on every counter interrupt becomes very noticeable.

The problem is fixed by having KVM disable only counters which are
enabled in PMCNTENSET_EL0. If a counter is not enabled in PMCNTENSET_EL0
then KVM will not enable it when setting PMCR_EL0.E and it will remain
disabled as long as it is not enabled in PMCNTENSET_EL0. So there is
effectively no need to disable a counter when clearing PMCR_EL0.E if it
is not enabled PMCNTENSET_EL0.

Signed-off-by: Alexandre Chartre 
---
The patch is based on 
https://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms.git/log/?h=kvm-arm64/pmu/reset-values

  arch/arm64/kvm/pmu-emul.c | 8 +---
  1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/kvm/pmu-emul.c b/arch/arm64/kvm/pmu-emul.c
index fae4e95b586c..1f317c3dac61 100644
--- a/arch/arm64/kvm/pmu-emul.c
+++ b/arch/arm64/kvm/pmu-emul.c
@@ -563,21 +563,23 @@ void kvm_pmu_software_increment(struct kvm_vcpu *vcpu, 
u64 val)
   */
  void kvm_pmu_handle_pmcr(struct kvm_vcpu *vcpu, u64 val)
  {
-   unsigned long mask = kvm_pmu_valid_counter_mask(vcpu);
+   unsigned long mask;
int i;
  
  	if (val & ARMV8_PMU_PMCR_E) {

kvm_pmu_enable_counter_mask(vcpu,
   __vcpu_sys_reg(vcpu, PMCNTENSET_EL0));
} else {
-   kvm_pmu_disable_counter_mask(vcpu, mask);
+   kvm_pmu_disable_counter_mask(vcpu,
+  __vcpu_sys_reg(vcpu, PMCNTENSET_EL0));
}
  
  	if (val & ARMV8_PMU_PMCR_C)

kvm_pmu_set_counter_value(vcpu, ARMV8_PMU_CYCLE_IDX, 0);
  
  	if (val & ARMV8_PMU_PMCR_P) {

-   mask &= ~BIT(ARMV8_PMU_CYCLE_IDX);
+   mask = kvm_pmu_valid_counter_mask(vcpu)
+   & BIT(ARMV8_PMU_CYCLE_IDX);


This looks suspiciously opposite of what it replaces; however did we 
even need to do a bitwise operation here in the first place? Couldn't we 
skip the cycle counter by just limiting the for_each_set_bit iteration 
below to 31 bits?


Robin.


for_each_set_bit(i, , 32)
kvm_pmu_set_counter_value(vcpu, i, 0);
}

base-commit: 83f870a663592797c576846db3611e0a1664eda2


___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [RFC PATCH 10/11] vfio/iommu_type1: Optimize dirty bitmap population based on iommu HWDBM

2021-02-09 Thread Robin Murphy

On 2021-02-09 11:57, Yi Sun wrote:

On 21-02-07 18:40:36, Keqian Zhu wrote:

Hi Yi,

On 2021/2/7 17:56, Yi Sun wrote:

Hi,

On 21-01-28 23:17:41, Keqian Zhu wrote:

[...]


+static void vfio_dma_dirty_log_start(struct vfio_iommu *iommu,
+struct vfio_dma *dma)
+{
+   struct vfio_domain *d;
+
+   list_for_each_entry(d, >domain_list, next) {
+   /* Go through all domain anyway even if we fail */
+   iommu_split_block(d->domain, dma->iova, dma->size);
+   }
+}


This should be a switch to prepare for dirty log start. Per Intel
Vtd spec, there is SLADE defined in Scalable-Mode PASID Table Entry.
It enables Accessed/Dirty Flags in second-level paging entries.
So, a generic iommu interface here is better. For Intel iommu, it
enables SLADE. For ARM, it splits block.

Indeed, a generic interface name is better.

The vendor iommu driver plays vendor's specific actions to start dirty log, and 
Intel iommu and ARM smmu may differ. Besides, we may add more actions in ARM 
smmu driver in future.

One question: Though I am not familiar with Intel iommu, I think it also should 
split block mapping besides enable SLADE. Right?


I am not familiar with ARM smmu. :) So I want to clarify if the block
in smmu is big page, e.g. 2M page? Intel Vtd manages the memory per
page, 4KB/2MB/1GB.


Indeed, what you call large pages, we call blocks :)

Robin.


There are two ways to manage dirty pages.
1. Keep default granularity. Just set SLADE to enable the dirty track.
2. Split big page to 4KB to get finer granularity.

But question about the second solution is if it can benefit the user
space, e.g. live migration. If my understanding about smmu block (i.e.
the big page) is correct, have you collected some performance data to
prove that the split can improve performance? Thanks!


Thanks,
Keqian

___
iommu mailing list
io...@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [RFC PATCH 10/11] vfio/iommu_type1: Optimize dirty bitmap population based on iommu HWDBM

2021-02-09 Thread Robin Murphy

On 2021-02-07 09:56, Yi Sun wrote:

Hi,

On 21-01-28 23:17:41, Keqian Zhu wrote:

[...]


+static void vfio_dma_dirty_log_start(struct vfio_iommu *iommu,
+struct vfio_dma *dma)
+{
+   struct vfio_domain *d;
+
+   list_for_each_entry(d, >domain_list, next) {
+   /* Go through all domain anyway even if we fail */
+   iommu_split_block(d->domain, dma->iova, dma->size);
+   }
+}


This should be a switch to prepare for dirty log start. Per Intel
Vtd spec, there is SLADE defined in Scalable-Mode PASID Table Entry.
It enables Accessed/Dirty Flags in second-level paging entries.
So, a generic iommu interface here is better. For Intel iommu, it
enables SLADE. For ARM, it splits block.


From a quick look, VT-D's SLADE and SMMU's HTTU appear to be the exact 
same thing. This step isn't about enabling or disabling that feature 
itself (the proposal for SMMU is to simply leave HTTU enabled all the 
time), it's about controlling the granularity at which the dirty status 
can be detected/reported at all, since that's tied to the pagetable 
structure.


However, if an IOMMU were to come along with some other way of reporting 
dirty status that didn't depend on the granularity of individual 
mappings, then indeed it wouldn't need this operation.


Robin.


+
+static void vfio_dma_dirty_log_stop(struct vfio_iommu *iommu,
+   struct vfio_dma *dma)
+{
+   struct vfio_domain *d;
+
+   list_for_each_entry(d, >domain_list, next) {
+   /* Go through all domain anyway even if we fail */
+   iommu_merge_page(d->domain, dma->iova, dma->size,
+d->prot | dma->prot);
+   }
+}


Same as above comment, a generic interface is required here.


+
+static void vfio_iommu_dirty_log_switch(struct vfio_iommu *iommu, bool start)
+{
+   struct rb_node *n;
+
+   /* Split and merge even if all iommu don't support HWDBM now */
+   for (n = rb_first(>dma_list); n; n = rb_next(n)) {
+   struct vfio_dma *dma = rb_entry(n, struct vfio_dma, node);
+
+   if (!dma->iommu_mapped)
+   continue;
+
+   /* Go through all dma range anyway even if we fail */
+   if (start)
+   vfio_dma_dirty_log_start(iommu, dma);
+   else
+   vfio_dma_dirty_log_stop(iommu, dma);
+   }
+}
+
  static int vfio_iommu_type1_dirty_pages(struct vfio_iommu *iommu,
unsigned long arg)
  {
@@ -2812,8 +2900,10 @@ static int vfio_iommu_type1_dirty_pages(struct 
vfio_iommu *iommu,
pgsize = 1 << __ffs(iommu->pgsize_bitmap);
if (!iommu->dirty_page_tracking) {
ret = vfio_dma_bitmap_alloc_all(iommu, pgsize);
-   if (!ret)
+   if (!ret) {
iommu->dirty_page_tracking = true;
+   vfio_iommu_dirty_log_switch(iommu, true);
+   }
}
mutex_unlock(>lock);
return ret;
@@ -2822,6 +2912,7 @@ static int vfio_iommu_type1_dirty_pages(struct vfio_iommu 
*iommu,
if (iommu->dirty_page_tracking) {
iommu->dirty_page_tracking = false;
vfio_dma_bitmap_free_all(iommu);
+   vfio_iommu_dirty_log_switch(iommu, false);
}
mutex_unlock(>lock);
return 0;
--
2.19.1

___
iommu mailing list
io...@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [RFC PATCH 01/11] iommu/arm-smmu-v3: Add feature detection for HTTU

2021-02-05 Thread Robin Murphy

On 2021-02-05 11:48, Robin Murphy wrote:

On 2021-02-05 09:13, Keqian Zhu wrote:

Hi Robin and Jean,

On 2021/2/5 3:50, Robin Murphy wrote:

On 2021-01-28 15:17, Keqian Zhu wrote:

From: jiangkunkun 

The SMMU which supports HTTU (Hardware Translation Table Update) can
update the access flag and the dirty state of TTD by hardware. It is
essential to track dirty pages of DMA.

This adds feature detection, none functional change.

Co-developed-by: Keqian Zhu 
Signed-off-by: Kunkun Jiang 
---
   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 16 
   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  8 
   include/linux/io-pgtable.h  |  1 +
   3 files changed, 25 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c

index 8ca7415d785d..0f0fe71cc10d 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -1987,6 +1987,7 @@ static int arm_smmu_domain_finalise(struct 
iommu_domain *domain,

   .pgsize_bitmap    = smmu->pgsize_bitmap,
   .ias    = ias,
   .oas    = oas,
+    .httu_hd    = smmu->features & ARM_SMMU_FEAT_HTTU_HD,
   .coherent_walk    = smmu->features & 
ARM_SMMU_FEAT_COHERENCY,

   .tlb    = _smmu_flush_ops,
   .iommu_dev    = smmu->dev,
@@ -3224,6 +3225,21 @@ static int arm_smmu_device_hw_probe(struct 
arm_smmu_device *smmu)

   if (reg & IDR0_HYP)
   smmu->features |= ARM_SMMU_FEAT_HYP;
   +    switch (FIELD_GET(IDR0_HTTU, reg)) {


We need to accommodate the firmware override as well if we need this 
to be meaningful. Jean-Philippe is already carrying a suitable patch 
in the SVA stack[1].

Robin, Thanks for pointing it out.

Jean, I see that the IORT HTTU flag overrides the hardware register 
info unconditionally. I have some concern about it:


If the override flag has HTTU but hardware doesn't support it, then 
driver will use this feature but receive access fault or permission 
fault from SMMU unexpectedly.

1) If IOPF is not supported, then kernel can not work normally.
2) If IOPF is supported, kernel will perform useless actions, such as 
HTTU based dma dirty tracking (this series).


Yes, if the IORT describes the SMMU incorrectly, things will not work 
well. Just like if it describes the wrong base address or the wrong 
interrupt numbers, things will also not work well. The point is that 
incorrect firmware can be updated in the field fairly easily; incorrect 
hardware can not.


Say the SMMU designer hard-codes the ID register field to 0x2 because 
the SMMU itself is capable of HTTU, and they assume it's always going to 
be wired up coherently, but then a customer integrates it to a 
non-coherent interconnect. Firmware needs to override that value to 
prevent an OS thinking that the claimed HTTU capability is ever going to 
work.


Or say the SMMU *is* integrated correctly, but due to an erratum 
discovered later in the interconnect or SMMU itself, it turns out DBM 
doesn't always work reliably, but AF is still OK. Firmware needs to 
downgrade the indicated level of support from that which was intended to 
that which works reliably.


Or say someone forgets to set an integration tieoff so their SMMU 
reports 0x0 even though it and the interconnect *can* happily support 
HTTU. In that case, firmware may want to upgrade the value to *allow* an 
OS to use HTTU despite the ID register being wrong.


As the IORT spec doesn't give an explicit explanation for HTTU 
override, can we comprehend it as a mask for HTTU related hardware 
register?

So the logic becomes: smmu->feature = HTTU override & IDR0_HTTU;


No, it literally states that the OS must use the value of the firmware 
field *instead* of the value from the hardware field.


Oops, apologies for an oversight there - I've been reviewing IORT spec 
updates lately so naturally had the newest version open already. Turns 
out these descriptions were only clarified in the most recent release, 
so if you were looking at an older document they *were* horribly vague.


Robin.


+    case IDR0_HTTU_NONE:
+    break;
+    case IDR0_HTTU_HA:
+    smmu->features |= ARM_SMMU_FEAT_HTTU_HA;
+    break;
+    case IDR0_HTTU_HAD:
+    smmu->features |= ARM_SMMU_FEAT_HTTU_HA;
+    smmu->features |= ARM_SMMU_FEAT_HTTU_HD;
+    break;
+    default:
+    dev_err(smmu->dev, "unknown/unsupported HTTU!\n");
+    return -ENXIO;
+    }
+
   /*
    * The coherency feature as set by FW is used in preference 
to the ID

    * register, but warn on mismatch.
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h

index 96c2e9565e00..e91bea44519e 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -33,6 +33,10 @@
   #define IDR0_ASID1

Re: [RFC PATCH 01/11] iommu/arm-smmu-v3: Add feature detection for HTTU

2021-02-05 Thread Robin Murphy

On 2021-02-05 09:13, Keqian Zhu wrote:

Hi Robin and Jean,

On 2021/2/5 3:50, Robin Murphy wrote:

On 2021-01-28 15:17, Keqian Zhu wrote:

From: jiangkunkun 

The SMMU which supports HTTU (Hardware Translation Table Update) can
update the access flag and the dirty state of TTD by hardware. It is
essential to track dirty pages of DMA.

This adds feature detection, none functional change.

Co-developed-by: Keqian Zhu 
Signed-off-by: Kunkun Jiang 
---
   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 16 
   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  8 
   include/linux/io-pgtable.h  |  1 +
   3 files changed, 25 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 8ca7415d785d..0f0fe71cc10d 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -1987,6 +1987,7 @@ static int arm_smmu_domain_finalise(struct iommu_domain 
*domain,
   .pgsize_bitmap= smmu->pgsize_bitmap,
   .ias= ias,
   .oas= oas,
+.httu_hd= smmu->features & ARM_SMMU_FEAT_HTTU_HD,
   .coherent_walk= smmu->features & ARM_SMMU_FEAT_COHERENCY,
   .tlb= _smmu_flush_ops,
   .iommu_dev= smmu->dev,
@@ -3224,6 +3225,21 @@ static int arm_smmu_device_hw_probe(struct 
arm_smmu_device *smmu)
   if (reg & IDR0_HYP)
   smmu->features |= ARM_SMMU_FEAT_HYP;
   +switch (FIELD_GET(IDR0_HTTU, reg)) {


We need to accommodate the firmware override as well if we need this to be 
meaningful. Jean-Philippe is already carrying a suitable patch in the SVA 
stack[1].

Robin, Thanks for pointing it out.

Jean, I see that the IORT HTTU flag overrides the hardware register info 
unconditionally. I have some concern about it:

If the override flag has HTTU but hardware doesn't support it, then driver will 
use this feature but receive access fault or permission fault from SMMU 
unexpectedly.
1) If IOPF is not supported, then kernel can not work normally.
2) If IOPF is supported, kernel will perform useless actions, such as HTTU 
based dma dirty tracking (this series).


Yes, if the IORT describes the SMMU incorrectly, things will not work 
well. Just like if it describes the wrong base address or the wrong 
interrupt numbers, things will also not work well. The point is that 
incorrect firmware can be updated in the field fairly easily; incorrect 
hardware can not.


Say the SMMU designer hard-codes the ID register field to 0x2 because 
the SMMU itself is capable of HTTU, and they assume it's always going to 
be wired up coherently, but then a customer integrates it to a 
non-coherent interconnect. Firmware needs to override that value to 
prevent an OS thinking that the claimed HTTU capability is ever going to 
work.


Or say the SMMU *is* integrated correctly, but due to an erratum 
discovered later in the interconnect or SMMU itself, it turns out DBM 
doesn't always work reliably, but AF is still OK. Firmware needs to 
downgrade the indicated level of support from that which was intended to 
that which works reliably.


Or say someone forgets to set an integration tieoff so their SMMU 
reports 0x0 even though it and the interconnect *can* happily support 
HTTU. In that case, firmware may want to upgrade the value to *allow* an 
OS to use HTTU despite the ID register being wrong.



As the IORT spec doesn't give an explicit explanation for HTTU override, can we 
comprehend it as a mask for HTTU related hardware register?
So the logic becomes: smmu->feature = HTTU override & IDR0_HTTU;


No, it literally states that the OS must use the value of the firmware 
field *instead* of the value from the hardware field.



+case IDR0_HTTU_NONE:
+break;
+case IDR0_HTTU_HA:
+smmu->features |= ARM_SMMU_FEAT_HTTU_HA;
+break;
+case IDR0_HTTU_HAD:
+smmu->features |= ARM_SMMU_FEAT_HTTU_HA;
+smmu->features |= ARM_SMMU_FEAT_HTTU_HD;
+break;
+default:
+dev_err(smmu->dev, "unknown/unsupported HTTU!\n");
+return -ENXIO;
+}
+
   /*
* The coherency feature as set by FW is used in preference to the ID
* register, but warn on mismatch.
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 96c2e9565e00..e91bea44519e 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -33,6 +33,10 @@
   #define IDR0_ASID16(1 << 12)
   #define IDR0_ATS(1 << 10)
   #define IDR0_HYP(1 << 9)
+#define IDR0_HTTUGENMASK(7, 6)
+#define IDR0_HTTU_NONE0
+#define IDR0_HTTU_HA1
+#define IDR0_HTTU_HAD2
   #define IDR0_COHACC(1 << 4)
   #define IDR0_TTFG

Re: [RFC PATCH 06/11] iommu/arm-smmu-v3: Scan leaf TTD to sync hardware dirty log

2021-02-04 Thread Robin Murphy

On 2021-01-28 15:17, Keqian Zhu wrote:

From: jiangkunkun 

During dirty log tracking, user will try to retrieve dirty log from
iommu if it supports hardware dirty log. This adds a new interface
named sync_dirty_log in iommu layer and arm smmuv3 implements it,
which scans leaf TTD and treats it's dirty if it's writable (As we
just enable HTTU for stage1, so check AP[2] is not set).

Co-developed-by: Keqian Zhu 
Signed-off-by: Kunkun Jiang 
---
  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 27 +++
  drivers/iommu/io-pgtable-arm.c  | 90 +
  drivers/iommu/iommu.c   | 41 ++
  include/linux/io-pgtable.h  |  4 +
  include/linux/iommu.h   | 17 
  5 files changed, 179 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 2434519e4bb6..43d0536b429a 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2548,6 +2548,32 @@ static size_t arm_smmu_merge_page(struct iommu_domain 
*domain, unsigned long iov
return ops->merge_page(ops, iova, paddr, size, prot);
  }
  
+static int arm_smmu_sync_dirty_log(struct iommu_domain *domain,

+  unsigned long iova, size_t size,
+  unsigned long *bitmap,
+  unsigned long base_iova,
+  unsigned long bitmap_pgshift)
+{
+   struct io_pgtable_ops *ops = to_smmu_domain(domain)->pgtbl_ops;
+   struct arm_smmu_device *smmu = to_smmu_domain(domain)->smmu;
+
+   if (!(smmu->features & ARM_SMMU_FEAT_HTTU_HD)) {
+   dev_err(smmu->dev, "don't support HTTU_HD and sync dirty 
log\n");
+   return -EPERM;
+   }
+
+   if (!ops || !ops->sync_dirty_log) {
+   pr_err("don't support sync dirty log\n");
+   return -ENODEV;
+   }
+
+   /* To ensure all inflight transactions are completed */
+   arm_smmu_flush_iotlb_all(domain);


What about transactions that arrive between the point that this 
completes, and the point - potentially much later - that we actually 
access any given PTE during the walk? I don't see what this is supposed 
to be synchronising against, even if it were just a CMD_SYNC (I 
especially don't see why we'd want to knock out the TLBs).



+
+   return ops->sync_dirty_log(ops, iova, size, bitmap,
+   base_iova, bitmap_pgshift);
+}
+
  static int arm_smmu_of_xlate(struct device *dev, struct of_phandle_args *args)
  {
return iommu_fwspec_add_ids(dev, args->args, 1);
@@ -2649,6 +2675,7 @@ static struct iommu_ops arm_smmu_ops = {
.domain_set_attr= arm_smmu_domain_set_attr,
.split_block= arm_smmu_split_block,
.merge_page = arm_smmu_merge_page,
+   .sync_dirty_log = arm_smmu_sync_dirty_log,
.of_xlate   = arm_smmu_of_xlate,
.get_resv_regions   = arm_smmu_get_resv_regions,
.put_resv_regions   = generic_iommu_put_resv_regions,
diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index 17390f258eb1..6cfe1ef3fedd 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -877,6 +877,95 @@ static size_t arm_lpae_merge_page(struct io_pgtable_ops 
*ops, unsigned long iova
return __arm_lpae_merge_page(data, iova, paddr, size, lvl, ptep, prot);
  }
  
+static int __arm_lpae_sync_dirty_log(struct arm_lpae_io_pgtable *data,

+unsigned long iova, size_t size,
+int lvl, arm_lpae_iopte *ptep,
+unsigned long *bitmap,
+unsigned long base_iova,
+unsigned long bitmap_pgshift)
+{
+   arm_lpae_iopte pte;
+   struct io_pgtable *iop = >iop;
+   size_t base, next_size;
+   unsigned long offset;
+   int nbits, ret;
+
+   if (WARN_ON(lvl == ARM_LPAE_MAX_LEVELS))
+   return -EINVAL;
+
+   ptep += ARM_LPAE_LVL_IDX(iova, lvl, data);
+   pte = READ_ONCE(*ptep);
+   if (WARN_ON(!pte))
+   return -EINVAL;
+
+   if (size == ARM_LPAE_BLOCK_SIZE(lvl, data)) {
+   if (iopte_leaf(pte, lvl, iop->fmt)) {
+   if (pte & ARM_LPAE_PTE_AP_RDONLY)
+   return 0;
+
+   /* It is writable, set the bitmap */
+   nbits = size >> bitmap_pgshift;
+   offset = (iova - base_iova) >> bitmap_pgshift;
+   bitmap_set(bitmap, offset, nbits);
+   return 0;
+   } else {
+   /* To traverse next level */
+   next_size = ARM_LPAE_BLOCK_SIZE(lvl + 1, data);
+  

Re: [RFC PATCH 05/11] iommu/arm-smmu-v3: Merge a span of page to block descriptor

2021-02-04 Thread Robin Murphy

On 2021-01-28 15:17, Keqian Zhu wrote:

From: jiangkunkun 

When stop dirty log tracking, we need to recover all block descriptors
which are splited when start dirty log tracking. This adds a new
interface named merge_page in iommu layer and arm smmuv3 implements it,
which reinstall block mappings and unmap the span of page mappings.

It's caller's duty to find contiuous physical memory.

During merging page, other interfaces are not expected to be working,
so race condition does not exist. And we flush all iotlbs after the merge
procedure is completed to ease the pressure of iommu, as we will merge a
huge range of page mappings in general.


Again, I think we need better reasoning than "race conditions don't 
exist because we don't expect them to exist".



Co-developed-by: Keqian Zhu 
Signed-off-by: Kunkun Jiang 
---
  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 20 ++
  drivers/iommu/io-pgtable-arm.c  | 78 +
  drivers/iommu/iommu.c   | 75 
  include/linux/io-pgtable.h  |  2 +
  include/linux/iommu.h   | 10 +++
  5 files changed, 185 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 5469f4fca820..2434519e4bb6 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2529,6 +2529,25 @@ static size_t arm_smmu_split_block(struct iommu_domain 
*domain,
return ops->split_block(ops, iova, size);
  }
  
+static size_t arm_smmu_merge_page(struct iommu_domain *domain, unsigned long iova,

+ phys_addr_t paddr, size_t size, int prot)
+{
+   struct io_pgtable_ops *ops = to_smmu_domain(domain)->pgtbl_ops;
+   struct arm_smmu_device *smmu = to_smmu_domain(domain)->smmu;
+
+   if (!(smmu->features & (ARM_SMMU_FEAT_BBML1 | ARM_SMMU_FEAT_BBML2))) {
+   dev_err(smmu->dev, "don't support BBML1/2 and merge page\n");
+   return 0;
+   }
+
+   if (!ops || !ops->merge_page) {
+   pr_err("don't support merge page\n");
+   return 0;
+   }
+
+   return ops->merge_page(ops, iova, paddr, size, prot);
+}
+
  static int arm_smmu_of_xlate(struct device *dev, struct of_phandle_args *args)
  {
return iommu_fwspec_add_ids(dev, args->args, 1);
@@ -2629,6 +2648,7 @@ static struct iommu_ops arm_smmu_ops = {
.domain_get_attr= arm_smmu_domain_get_attr,
.domain_set_attr= arm_smmu_domain_set_attr,
.split_block= arm_smmu_split_block,
+   .merge_page = arm_smmu_merge_page,
.of_xlate   = arm_smmu_of_xlate,
.get_resv_regions   = arm_smmu_get_resv_regions,
.put_resv_regions   = generic_iommu_put_resv_regions,
diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index f3b7f7115e38..17390f258eb1 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -800,6 +800,83 @@ static size_t arm_lpae_split_block(struct io_pgtable_ops 
*ops,
return __arm_lpae_split_block(data, iova, size, lvl, ptep);
  }
  
+static size_t __arm_lpae_merge_page(struct arm_lpae_io_pgtable *data,

+   unsigned long iova, phys_addr_t paddr,
+   size_t size, int lvl, arm_lpae_iopte *ptep,
+   arm_lpae_iopte prot)
+{
+   arm_lpae_iopte pte, *tablep;
+   struct io_pgtable *iop = >iop;
+   struct io_pgtable_cfg *cfg = >iop.cfg;
+
+   if (WARN_ON(lvl == ARM_LPAE_MAX_LEVELS))
+   return 0;
+
+   ptep += ARM_LPAE_LVL_IDX(iova, lvl, data);
+   pte = READ_ONCE(*ptep);
+   if (WARN_ON(!pte))
+   return 0;
+
+   if (size == ARM_LPAE_BLOCK_SIZE(lvl, data)) {
+   if (iopte_leaf(pte, lvl, iop->fmt))
+   return size;
+
+   /* Race does not exist */
+   if (cfg->bbml == 1) {
+   prot |= ARM_LPAE_PTE_NT;
+   __arm_lpae_init_pte(data, paddr, prot, lvl, ptep);
+   io_pgtable_tlb_flush_walk(iop, iova, size,
+ ARM_LPAE_GRANULE(data));
+
+   prot &= ~(ARM_LPAE_PTE_NT);
+   __arm_lpae_init_pte(data, paddr, prot, lvl, ptep);
+   } else {
+   __arm_lpae_init_pte(data, paddr, prot, lvl, ptep);
+   }
+
+   tablep = iopte_deref(pte, data);
+   __arm_lpae_free_pgtable(data, lvl + 1, tablep);
+   return size;
+   } else if (iopte_leaf(pte, lvl, iop->fmt)) {
+   /* The size is too small, already merged */
+   return size;
+   }
+
+   /* Keep on walkin */
+   ptep = iopte_deref(pte, data);
+   return 

Re: [RFC PATCH 04/11] iommu/arm-smmu-v3: Split block descriptor to a span of page

2021-02-04 Thread Robin Murphy

On 2021-01-28 15:17, Keqian Zhu wrote:

From: jiangkunkun 

Block descriptor is not a proper granule for dirty log tracking. This
adds a new interface named split_block in iommu layer and arm smmuv3
implements it, which splits block descriptor to an equivalent span of
page descriptors.

During spliting block, other interfaces are not expected to be working,
so race condition does not exist. And we flush all iotlbs after the split
procedure is completed to ease the pressure of iommu, as we will split a
huge range of block mappings in general.


"Not expected to be" is not the same thing as "can not". Presumably the 
whole point of dirty log tracking is that it can be run speculatively in 
the background, so is there any actual guarantee that the guest can't, 
say, issue a hotplug event that would cause some memory to be released 
back to the host and unmapped while a scan might be in progress? Saying 
effectively "there is no race condition as long as you assume there is 
no race condition" isn't all that reassuring...


That said, it's not very clear why patches #4 and #5 are here at all, 
given that patches #6 and #7 appear quite happy to handle block entries.



Co-developed-by: Keqian Zhu 
Signed-off-by: Kunkun Jiang 
---
  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c |  20 
  drivers/iommu/io-pgtable-arm.c  | 122 
  drivers/iommu/iommu.c   |  40 +++
  include/linux/io-pgtable.h  |   2 +
  include/linux/iommu.h   |  10 ++
  5 files changed, 194 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 9208881a571c..5469f4fca820 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2510,6 +2510,25 @@ static int arm_smmu_domain_set_attr(struct iommu_domain 
*domain,
return ret;
  }
  
+static size_t arm_smmu_split_block(struct iommu_domain *domain,

+  unsigned long iova, size_t size)
+{
+   struct arm_smmu_device *smmu = to_smmu_domain(domain)->smmu;
+   struct io_pgtable_ops *ops = to_smmu_domain(domain)->pgtbl_ops;
+
+   if (!(smmu->features & (ARM_SMMU_FEAT_BBML1 | ARM_SMMU_FEAT_BBML2))) {
+   dev_err(smmu->dev, "don't support BBML1/2 and split block\n");
+   return 0;
+   }
+
+   if (!ops || !ops->split_block) {
+   pr_err("don't support split block\n");
+   return 0;
+   }
+
+   return ops->split_block(ops, iova, size);
+}
+
  static int arm_smmu_of_xlate(struct device *dev, struct of_phandle_args *args)
  {
return iommu_fwspec_add_ids(dev, args->args, 1);
@@ -2609,6 +2628,7 @@ static struct iommu_ops arm_smmu_ops = {
.device_group   = arm_smmu_device_group,
.domain_get_attr= arm_smmu_domain_get_attr,
.domain_set_attr= arm_smmu_domain_set_attr,
+   .split_block= arm_smmu_split_block,
.of_xlate   = arm_smmu_of_xlate,
.get_resv_regions   = arm_smmu_get_resv_regions,
.put_resv_regions   = generic_iommu_put_resv_regions,
diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index e299a44808ae..f3b7f7115e38 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -79,6 +79,8 @@
  #define ARM_LPAE_PTE_SH_IS(((arm_lpae_iopte)3) << 8)
  #define ARM_LPAE_PTE_NS   (((arm_lpae_iopte)1) << 5)
  #define ARM_LPAE_PTE_VALID(((arm_lpae_iopte)1) << 0)
+/* Block descriptor bits */
+#define ARM_LPAE_PTE_NT(((arm_lpae_iopte)1) << 16)
  
  #define ARM_LPAE_PTE_ATTR_LO_MASK	(((arm_lpae_iopte)0x3ff) << 2)

  /* Ignore the contiguous bit for block splitting */
@@ -679,6 +681,125 @@ static phys_addr_t arm_lpae_iova_to_phys(struct 
io_pgtable_ops *ops,
return iopte_to_paddr(pte, data) | iova;
  }
  
+static size_t __arm_lpae_split_block(struct arm_lpae_io_pgtable *data,

+unsigned long iova, size_t size, int lvl,
+arm_lpae_iopte *ptep);
+
+static size_t arm_lpae_do_split_blk(struct arm_lpae_io_pgtable *data,
+   unsigned long iova, size_t size,
+   arm_lpae_iopte blk_pte, int lvl,
+   arm_lpae_iopte *ptep)
+{
+   struct io_pgtable_cfg *cfg = >iop.cfg;
+   arm_lpae_iopte pte, *tablep;
+   phys_addr_t blk_paddr;
+   size_t tablesz = ARM_LPAE_GRANULE(data);
+   size_t split_sz = ARM_LPAE_BLOCK_SIZE(lvl, data);
+   int i;
+
+   if (WARN_ON(lvl == ARM_LPAE_MAX_LEVELS))
+   return 0;
+
+   tablep = __arm_lpae_alloc_pages(tablesz, GFP_ATOMIC, cfg);
+   if (!tablep)
+   return 0;
+
+   blk_paddr = iopte_to_paddr(blk_pte, data);
+   

Re: [RFC PATCH 01/11] iommu/arm-smmu-v3: Add feature detection for HTTU

2021-02-04 Thread Robin Murphy

On 2021-01-28 15:17, Keqian Zhu wrote:

From: jiangkunkun 

The SMMU which supports HTTU (Hardware Translation Table Update) can
update the access flag and the dirty state of TTD by hardware. It is
essential to track dirty pages of DMA.

This adds feature detection, none functional change.

Co-developed-by: Keqian Zhu 
Signed-off-by: Kunkun Jiang 
---
  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 16 
  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  8 
  include/linux/io-pgtable.h  |  1 +
  3 files changed, 25 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 8ca7415d785d..0f0fe71cc10d 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -1987,6 +1987,7 @@ static int arm_smmu_domain_finalise(struct iommu_domain 
*domain,
.pgsize_bitmap  = smmu->pgsize_bitmap,
.ias= ias,
.oas= oas,
+   .httu_hd= smmu->features & ARM_SMMU_FEAT_HTTU_HD,
.coherent_walk  = smmu->features & ARM_SMMU_FEAT_COHERENCY,
.tlb= _smmu_flush_ops,
.iommu_dev  = smmu->dev,
@@ -3224,6 +3225,21 @@ static int arm_smmu_device_hw_probe(struct 
arm_smmu_device *smmu)
if (reg & IDR0_HYP)
smmu->features |= ARM_SMMU_FEAT_HYP;
  
+	switch (FIELD_GET(IDR0_HTTU, reg)) {


We need to accommodate the firmware override as well if we need this to 
be meaningful. Jean-Philippe is already carrying a suitable patch in the 
SVA stack[1].



+   case IDR0_HTTU_NONE:
+   break;
+   case IDR0_HTTU_HA:
+   smmu->features |= ARM_SMMU_FEAT_HTTU_HA;
+   break;
+   case IDR0_HTTU_HAD:
+   smmu->features |= ARM_SMMU_FEAT_HTTU_HA;
+   smmu->features |= ARM_SMMU_FEAT_HTTU_HD;
+   break;
+   default:
+   dev_err(smmu->dev, "unknown/unsupported HTTU!\n");
+   return -ENXIO;
+   }
+
/*
 * The coherency feature as set by FW is used in preference to the ID
 * register, but warn on mismatch.
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 96c2e9565e00..e91bea44519e 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -33,6 +33,10 @@
  #define IDR0_ASID16   (1 << 12)
  #define IDR0_ATS  (1 << 10)
  #define IDR0_HYP  (1 << 9)
+#define IDR0_HTTU  GENMASK(7, 6)
+#define IDR0_HTTU_NONE 0
+#define IDR0_HTTU_HA   1
+#define IDR0_HTTU_HAD  2
  #define IDR0_COHACC   (1 << 4)
  #define IDR0_TTF  GENMASK(3, 2)
  #define IDR0_TTF_AARCH64  2
@@ -286,6 +290,8 @@
  #define CTXDESC_CD_0_TCR_TBI0 (1ULL << 38)
  
  #define CTXDESC_CD_0_AA64		(1UL << 41)

+#define CTXDESC_CD_0_HD(1UL << 42)
+#define CTXDESC_CD_0_HA(1UL << 43)
  #define CTXDESC_CD_0_S(1UL << 44)
  #define CTXDESC_CD_0_R(1UL << 45)
  #define CTXDESC_CD_0_A(1UL << 46)
@@ -604,6 +610,8 @@ struct arm_smmu_device {
  #define ARM_SMMU_FEAT_RANGE_INV   (1 << 15)
  #define ARM_SMMU_FEAT_BTM (1 << 16)
  #define ARM_SMMU_FEAT_SVA (1 << 17)
+#define ARM_SMMU_FEAT_HTTU_HA  (1 << 18)
+#define ARM_SMMU_FEAT_HTTU_HD  (1 << 19)
u32 features;
  
  #define ARM_SMMU_OPT_SKIP_PREFETCH	(1 << 0)

diff --git a/include/linux/io-pgtable.h b/include/linux/io-pgtable.h
index ea727eb1a1a9..1a00ea8562c7 100644
--- a/include/linux/io-pgtable.h
+++ b/include/linux/io-pgtable.h
@@ -97,6 +97,7 @@ struct io_pgtable_cfg {
unsigned long   pgsize_bitmap;
unsigned intias;
unsigned intoas;
+   boolhttu_hd;


This is very specific to the AArch64 stage 1 format, not a generic 
capability - I think it should be a quirk flag rather than a common field.


Robin.

[1] 
https://jpbrucker.net/git/linux/commit/?h=sva/current=1ef7d512fb9082450dfe0d22ca4f7e35625a097b



boolcoherent_walk;
const struct iommu_flush_ops*tlb;
struct device   *iommu_dev;


___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH] arm64: Work around broken GCC handling of "S" constraint

2020-12-07 Thread Robin Murphy

On 2020-12-07 19:04, Marc Zyngier wrote:

Hi Robin,

On Mon, 07 Dec 2020 18:42:23 +,
Robin Murphy  wrote:


On 2020-12-07 17:47, Ard Biesheuvel wrote:

On Mon, 7 Dec 2020 at 18:41, Marc Zyngier  wrote:


On 2020-12-07 17:19, Ard Biesheuvel wrote:

(resend with David's email address fixed)


Irk. Thanks for that.


+#ifdef CONFIG_CC_HAS_BROKEN_S_CONSTRAINT
+#define SYM_CONSTRAINT "i"
+#else
+#define SYM_CONSTRAINT "S"
+#endif
+


Could we just check GCC_VERSION here?


I guess we could. But I haven't investigated which exact range of
compiler is broken (GCC 6.3 seems fixed, but that's the oldest
I have apart from the offending 4.9).



I tried 5.4 on godbolt, and it seems happy. And the failure will be
obvious, so we can afford to get it slightly wrong and refine it
later.


FWIW the Linaro 14.11, 15.02 and 15.05 releases of GCC 4.9.3 seem to
build rc7 without complaint. The only thing older that I have to hand
is Ubuntu's GCC 4.8.4, which Kbuild chokes on entirely now.


Can you try kvmarm/next? David's PSCI relay is breaking badly here.


Ah, gotcha... Yes, they're all falling over on that :(

The 15.08 release of 5.1.1 is happy though, so Ard's probably right 
about generalising it to 4.x.


Cheers,
Robin.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH] arm64: Work around broken GCC handling of "S" constraint

2020-12-07 Thread Robin Murphy

On 2020-12-07 17:47, Ard Biesheuvel wrote:

On Mon, 7 Dec 2020 at 18:41, Marc Zyngier  wrote:


On 2020-12-07 17:19, Ard Biesheuvel wrote:

(resend with David's email address fixed)


Irk. Thanks for that.


+#ifdef CONFIG_CC_HAS_BROKEN_S_CONSTRAINT
+#define SYM_CONSTRAINT "i"
+#else
+#define SYM_CONSTRAINT "S"
+#endif
+


Could we just check GCC_VERSION here?


I guess we could. But I haven't investigated which exact range of
compiler is broken (GCC 6.3 seems fixed, but that's the oldest
I have apart from the offending 4.9).



I tried 5.4 on godbolt, and it seems happy. And the failure will be
obvious, so we can afford to get it slightly wrong and refine it
later.


FWIW the Linaro 14.11, 15.02 and 15.05 releases of GCC 4.9.3 seem to 
build rc7 without complaint. The only thing older that I have to hand is 
Ubuntu's GCC 4.8.4, which Kbuild chokes on entirely now.


Robin.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v2] KVM: arm64: Allow to limit number of PMU counters

2020-09-10 Thread Robin Murphy

On 2020-09-10 17:46, Alexander Graf wrote:



On 10.09.20 17:52, Robin Murphy wrote:


On 2020-09-10 11:18, Alexander Graf wrote:



On 10.09.20 12:06, Marc Zyngier wrote:


On 2020-09-08 21:57, Alexander Graf wrote:

We currently pass through the number of PMU counters that we have
available
in hardware to guests. So if my host supports 10 concurrently active
PMU
counters, my guest will be able to spawn 10 counters as well.

This is undesireable if we also want to use the PMU on the host for
monitoring. In that case, we want to split the PMU between guest and
host.

To help that case, let's add a PMU attr that allows us to limit the
number
of PMU counters that we expose. With this patch in place, user space
can
keep some counters free for host use.

Signed-off-by: Alexander Graf 

---

Because this patch touches the same code paths as the vPMU filtering
one
and the vPMU filtering generalized a few conditions in the attr path,
I've based it on top. Please let me know if you want it independent
instead.

v1 -> v2:

  - Add documentation
  - Add read support
---
 Documentation/virt/kvm/devices/vcpu.rst | 25 
+

 arch/arm64/include/uapi/asm/kvm.h   |  7 ---
 arch/arm64/kvm/pmu-emul.c   | 32

 arch/arm64/kvm/sys_regs.c   |  5 +
 include/kvm/arm_pmu.h   |  1 +
 5 files changed, 67 insertions(+), 3 deletions(-)

diff --git a/Documentation/virt/kvm/devices/vcpu.rst
b/Documentation/virt/kvm/devices/vcpu.rst
index 203b91e93151..1a1c8d8c8b1d 100644
--- a/Documentation/virt/kvm/devices/vcpu.rst
+++ b/Documentation/virt/kvm/devices/vcpu.rst
@@ -102,6 +102,31 @@ isn't strictly speaking an event. Filtering the
cycle counter is possible
 using event 0x11 (CPU_CYCLES).


+1.4 ATTRIBUTE: KVM_ARM_VCPU_PMU_V3_NUM_EVENTS
+-
+
+:Parameters: in kvm_device_attr.addr the address for the limit of
concurrent
+ events is a pointer to an int
+
+:Returns:
+
+  ===  ==
+  -ENODEV: PMUv3 not supported
+  -EBUSY:  PMUv3 already initialized
+  -EINVAL: Too large number of events
+  ===  ==
+
+Reconfigure the limit of concurrent PMU events that the guest can
monitor.
+This number is directly exposed as part of the PMCR_EL0 register.
+
+On vcpu creation, this attribute is set to the hardware limit of the
current
+platform. If you need to determine the hardware limit, you can read
this
+attribute before setting it.
+
+Restrictions: The default value for this property is the number of
hardware
+supported events. Only values that are smaller than the hardware 
limit

can
+be set.
+
 2. GROUP: KVM_ARM_VCPU_TIMER_CTRL
 =

diff --git a/arch/arm64/include/uapi/asm/kvm.h
b/arch/arm64/include/uapi/asm/kvm.h
index 7b1511d6ce44..db025c0b5a40 100644
--- a/arch/arm64/include/uapi/asm/kvm.h
+++ b/arch/arm64/include/uapi/asm/kvm.h
@@ -342,9 +342,10 @@ struct kvm_vcpu_events {

 /* Device Control API on vcpu fd */
 #define KVM_ARM_VCPU_PMU_V3_CTRL 0
-#define   KVM_ARM_VCPU_PMU_V3_IRQ    0
-#define   KVM_ARM_VCPU_PMU_V3_INIT   1
-#define   KVM_ARM_VCPU_PMU_V3_FILTER 2
+#define   KVM_ARM_VCPU_PMU_V3_IRQ    0
+#define   KVM_ARM_VCPU_PMU_V3_INIT   1
+#define   KVM_ARM_VCPU_PMU_V3_FILTER 2
+#define   KVM_ARM_VCPU_PMU_V3_NUM_EVENTS 3
 #define KVM_ARM_VCPU_TIMER_CTRL  1
 #define   KVM_ARM_VCPU_TIMER_IRQ_VTIMER  0
 #define   KVM_ARM_VCPU_TIMER_IRQ_PTIMER  1
diff --git a/arch/arm64/kvm/pmu-emul.c b/arch/arm64/kvm/pmu-emul.c
index 0458860bade2..c7915b95fec0 100644
--- a/arch/arm64/kvm/pmu-emul.c
+++ b/arch/arm64/kvm/pmu-emul.c
@@ -253,6 +253,8 @@ void kvm_pmu_vcpu_init(struct kvm_vcpu *vcpu)

  for (i = 0; i < ARMV8_PMU_MAX_COUNTERS; i++)
  pmu->pmc[i].idx = i;
+
+ pmu->num_events = perf_num_counters() - 1;
 }

 /**
@@ -978,6 +980,25 @@ int kvm_arm_pmu_v3_set_attr(struct kvm_vcpu
*vcpu, struct kvm_device_attr *attr)

  return 0;
  }
+ case KVM_ARM_VCPU_PMU_V3_NUM_EVENTS: {
+ u64 mask = ARMV8_PMU_PMCR_N_MASK <<
ARMV8_PMU_PMCR_N_SHIFT;
+ int __user *uaddr = (int __user *)(long)attr->addr;
+ u32 num_events;
+
+ if (get_user(num_events, uaddr))
+ return -EFAULT;
+
+ if (num_events >= perf_num_counters())
+ return -EINVAL;
+
+ vcpu->arch.pmu.num_events = num_events;
+
+ num_events <<= ARMV8_PMU_PMCR_N_SHIFT;
+ __vcpu_sys_reg(vcpu, SYS_PMCR_EL0) &= ~mask;
+ __vcpu_sys_reg(vcpu, SYS_PMCR_EL0) |= num_events;
+
+ return 0;
+ }
  case KVM_ARM_VCPU_PMU_V3_INIT:
  return kvm_arm_pmu_v3_init(vcpu);
  }
@@ -1004,6 +1025,16 @@ int kv

Re: [PATCH v2] KVM: arm64: Allow to limit number of PMU counters

2020-09-10 Thread Robin Murphy

On 2020-09-10 11:18, Alexander Graf wrote:



On 10.09.20 12:06, Marc Zyngier wrote:


On 2020-09-08 21:57, Alexander Graf wrote:

We currently pass through the number of PMU counters that we have
available
in hardware to guests. So if my host supports 10 concurrently active
PMU
counters, my guest will be able to spawn 10 counters as well.

This is undesireable if we also want to use the PMU on the host for
monitoring. In that case, we want to split the PMU between guest and
host.

To help that case, let's add a PMU attr that allows us to limit the
number
of PMU counters that we expose. With this patch in place, user space
can
keep some counters free for host use.

Signed-off-by: Alexander Graf 

---

Because this patch touches the same code paths as the vPMU filtering
one
and the vPMU filtering generalized a few conditions in the attr path,
I've based it on top. Please let me know if you want it independent
instead.

v1 -> v2:

  - Add documentation
  - Add read support
---
 Documentation/virt/kvm/devices/vcpu.rst | 25 +
 arch/arm64/include/uapi/asm/kvm.h   |  7 ---
 arch/arm64/kvm/pmu-emul.c   | 32

 arch/arm64/kvm/sys_regs.c   |  5 +
 include/kvm/arm_pmu.h   |  1 +
 5 files changed, 67 insertions(+), 3 deletions(-)

diff --git a/Documentation/virt/kvm/devices/vcpu.rst
b/Documentation/virt/kvm/devices/vcpu.rst
index 203b91e93151..1a1c8d8c8b1d 100644
--- a/Documentation/virt/kvm/devices/vcpu.rst
+++ b/Documentation/virt/kvm/devices/vcpu.rst
@@ -102,6 +102,31 @@ isn't strictly speaking an event. Filtering the
cycle counter is possible
 using event 0x11 (CPU_CYCLES).


+1.4 ATTRIBUTE: KVM_ARM_VCPU_PMU_V3_NUM_EVENTS
+-
+
+:Parameters: in kvm_device_attr.addr the address for the limit of
concurrent
+ events is a pointer to an int
+
+:Returns:
+
+  ===  ==
+  -ENODEV: PMUv3 not supported
+  -EBUSY:  PMUv3 already initialized
+  -EINVAL: Too large number of events
+  ===  ==
+
+Reconfigure the limit of concurrent PMU events that the guest can
monitor.
+This number is directly exposed as part of the PMCR_EL0 register.
+
+On vcpu creation, this attribute is set to the hardware limit of the
current
+platform. If you need to determine the hardware limit, you can read
this
+attribute before setting it.
+
+Restrictions: The default value for this property is the number of
hardware
+supported events. Only values that are smaller than the hardware limit
can
+be set.
+
 2. GROUP: KVM_ARM_VCPU_TIMER_CTRL
 =

diff --git a/arch/arm64/include/uapi/asm/kvm.h
b/arch/arm64/include/uapi/asm/kvm.h
index 7b1511d6ce44..db025c0b5a40 100644
--- a/arch/arm64/include/uapi/asm/kvm.h
+++ b/arch/arm64/include/uapi/asm/kvm.h
@@ -342,9 +342,10 @@ struct kvm_vcpu_events {

 /* Device Control API on vcpu fd */
 #define KVM_ARM_VCPU_PMU_V3_CTRL 0
-#define   KVM_ARM_VCPU_PMU_V3_IRQ    0
-#define   KVM_ARM_VCPU_PMU_V3_INIT   1
-#define   KVM_ARM_VCPU_PMU_V3_FILTER 2
+#define   KVM_ARM_VCPU_PMU_V3_IRQ    0
+#define   KVM_ARM_VCPU_PMU_V3_INIT   1
+#define   KVM_ARM_VCPU_PMU_V3_FILTER 2
+#define   KVM_ARM_VCPU_PMU_V3_NUM_EVENTS 3
 #define KVM_ARM_VCPU_TIMER_CTRL  1
 #define   KVM_ARM_VCPU_TIMER_IRQ_VTIMER  0
 #define   KVM_ARM_VCPU_TIMER_IRQ_PTIMER  1
diff --git a/arch/arm64/kvm/pmu-emul.c b/arch/arm64/kvm/pmu-emul.c
index 0458860bade2..c7915b95fec0 100644
--- a/arch/arm64/kvm/pmu-emul.c
+++ b/arch/arm64/kvm/pmu-emul.c
@@ -253,6 +253,8 @@ void kvm_pmu_vcpu_init(struct kvm_vcpu *vcpu)

  for (i = 0; i < ARMV8_PMU_MAX_COUNTERS; i++)
  pmu->pmc[i].idx = i;
+
+ pmu->num_events = perf_num_counters() - 1;
 }

 /**
@@ -978,6 +980,25 @@ int kvm_arm_pmu_v3_set_attr(struct kvm_vcpu
*vcpu, struct kvm_device_attr *attr)

  return 0;
  }
+ case KVM_ARM_VCPU_PMU_V3_NUM_EVENTS: {
+ u64 mask = ARMV8_PMU_PMCR_N_MASK << 
ARMV8_PMU_PMCR_N_SHIFT;

+ int __user *uaddr = (int __user *)(long)attr->addr;
+ u32 num_events;
+
+ if (get_user(num_events, uaddr))
+ return -EFAULT;
+
+ if (num_events >= perf_num_counters())
+ return -EINVAL;
+
+ vcpu->arch.pmu.num_events = num_events;
+
+ num_events <<= ARMV8_PMU_PMCR_N_SHIFT;
+ __vcpu_sys_reg(vcpu, SYS_PMCR_EL0) &= ~mask;
+ __vcpu_sys_reg(vcpu, SYS_PMCR_EL0) |= num_events;
+
+ return 0;
+ }
  case KVM_ARM_VCPU_PMU_V3_INIT:
  return kvm_arm_pmu_v3_init(vcpu);
  }
@@ -1004,6 +1025,16 @@ int kvm_arm_pmu_v3_get_attr(struct kvm_vcpu
*vcpu, struct kvm_device_attr *attr)
  irq = vcpu->arch.pmu.irq_num;
 

Re: [trivial PATCH] treewide: Convert switch/case fallthrough; to break;

2020-09-10 Thread Robin Murphy

On 2020-09-09 21:06, Joe Perches wrote:

fallthrough to a separate case/default label break; isn't very readable.

Convert pseudo-keyword fallthrough; statements to a simple break; when
the next label is case or default and the only statement in the next
label block is break;

Found using:

$ grep-2.5.4 -rP --include=*.[ch] -n 
"fallthrough;(\s*(case\s+\w+|default)\s*:\s*){1,7}break;" *

Miscellanea:

o Move or coalesce a couple label blocks above a default: block.

Signed-off-by: Joe Perches 
---

Compiled allyesconfig x86-64 only.
A few files for other arches were not compiled.



[...]

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index c192544e874b..743db1abec40 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -3777,7 +3777,7 @@ static int arm_smmu_device_hw_probe(struct 
arm_smmu_device *smmu)
switch (FIELD_GET(IDR0_TTF, reg)) {
case IDR0_TTF_AARCH32_64:
smmu->ias = 40;
-   fallthrough;
+   break;
case IDR0_TTF_AARCH64:
break;
default:


I have to say I don't really agree with the readability argument for 
this one - a fallthrough is semantically correct here, since the first 
case is a superset of the second. It just happens that anything we would 
do for the common subset is implicitly assumed (there are other 
potential cases we simply haven't added support for at the moment), thus 
the second case is currently empty.


This change actively obfuscates that distinction.

Robin.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 1/2] KVM: arm64: Make vcpu_cp1x() work on Big Endian hosts

2020-06-09 Thread Robin Murphy

On 2020-06-09 09:49, Marc Zyngier wrote:

AArch32 CP1x registers are overlayed on their AArch64 counterparts
in the vcpu struct. This leads to an interesting problem as they
are stored in their CPU-local format, and thus a CP1x register
doesn't "hit" the lower 32bit portion of the AArch64 register on
a BE host.

To workaround this unfortunate situation, introduce a bias trick
in the vcpu_cp1x() accessors which picks the correct half of the
64bit register.

Cc: sta...@vger.kernel.org
Reported-by: James Morse 
Signed-off-by: Marc Zyngier 
---
  arch/arm64/include/asm/kvm_host.h | 10 --
  1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index 59029e90b557..e80c0e06f235 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -404,8 +404,14 @@ void vcpu_write_sys_reg(struct kvm_vcpu *vcpu, u64 val, 
int reg);
   * CP14 and CP15 live in the same array, as they are backed by the
   * same system registers.
   */
-#define vcpu_cp14(v,r) ((v)->arch.ctxt.copro[(r)])
-#define vcpu_cp15(v,r) ((v)->arch.ctxt.copro[(r)])
+#ifdef CPU_BIG_ENDIAN


Ahem... I think you're missing a "CONFIG_" there ;)

Bonus trickery - for a 0 or 1 value you can simply use IS_ENABLED().

Robin.


+#define CPx_OFFSET 1
+#else
+#define CPx_OFFSET 0
+#endif
+
+#define vcpu_cp14(v,r) ((v)->arch.ctxt.copro[(r) ^ CPx_OFFSET])
+#define vcpu_cp15(v,r) ((v)->arch.ctxt.copro[(r) ^ CPx_OFFSET])
  
  struct kvm_vm_stat {

ulong remote_tlb_flush;


___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCHv6 0/3] arm64: perf: Add support for ARMv8.5-PMU 64-bit counters

2020-03-10 Thread Robin Murphy

On 03/03/2020 7:07 pm, Mark Rutland wrote:

On Mon, Mar 02, 2020 at 06:17:49PM +, Mark Rutland wrote:

This is a respin of Andrew Murray's series to enable support for 64-bit
counters as introduced in ARMv8.5.

I've given this a spin on (ARMv8.2) hardware, to test that there are no
regressions, but I have not had the chance to test in an ARMv8.5 model (which I
beleive Andrew had previously tested).


Bad news; this is broken. :(

While perf-stat works as expected, perf-record doesn't get samples for
any of the programmable counters.

In ARMv8.4 mode I can do:

| / # perf record -a -c 1 -e armv8_pmuv3/inst_retired/ true
| [ perf record: Woken up 1 times to write data ]
| [ perf record: Captured and wrote 0.023 MB perf.data (367 samples) ]
| / # perf record -a -c 1 -e armv8_pmuv3/inst_retired,long/ true
| [ perf record: Woken up 1 times to write data ]
| [ perf record: Captured and wrote 0.022 MB perf.data (353 samples) ]

... so regular 32-bit and chained events work correctly.

But in ARMv8.5 mode I get no samples in either case:

| / # perf record -a -c 1 -e armv8_pmuv3/inst_retired/ true
| [ perf record: Woken up 1 times to write data ]
| [ perf record: Captured and wrote 0.008 MB perf.data ]
| / # perf report | grep samples
| Error:
| The perf.data file has no samples!
| / # perf record -a -c 1 -e armv8_pmuv3/inst_retired,long/ true
| [ perf record: Woken up 1 times to write data ]
| [ perf record: Captured and wrote 0.008 MB perf.data ]
| / # perf report | grep samples
| Error:
| The perf.data file has no samples!

I'll have to trace the driver to see what's going on. I suspect we've
missed some bias handling, but it's possible that this is a model bug.


For the record, further evidence has indeed pointed to there being a bug 
in the model's implementation of ARMv8.5-PMU. It's been raised with the 
models team, so we'll have wait and see what they say...


Robin.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [RFC PATCH 0/5] Removing support for 32bit KVM/arm host

2020-02-20 Thread Robin Murphy

On 20/02/2020 2:01 pm, Marc Zyngier wrote:

On 2020-02-20 13:32, Robin Murphy wrote:

On 20/02/2020 1:15 pm, Marc Zyngier wrote:

Hi Marek,

On 2020-02-20 12:44, Marek Szyprowski wrote:

Hi Marc,

On 10.02.2020 15:13, Marc Zyngier wrote:

KVM/arm was merged just over 7 years ago, and has lived a very quiet
life so far. It mostly works if you're prepared to deal with its
limitations, it has been a good prototype for the arm64 version,
but it suffers a few problems:

- It is incomplete (no debug support, no PMU)
- It hasn't followed any of the architectural evolutions
- It has zero users (I don't count myself here)
- It is more and more getting in the way of new arm64 developments


That is a bit sad information. Mainline Exynos finally got everything
that was needed to run it on the quite popular Samsung Exynos5422-based
Odroid XU4/HC1/MC1 boards. According to the Odroid related forums it is
being used. We also use it internally at Samsung.


Something like "too little, too late" springs to mind, but let's be
constructive. Is anyone using it in a production environment, where
they rely on the latest mainline kernel having KVM support?

The current proposal is to still have KVM support in 5.6, as well as
ongoing support for stable kernels. If that's not enough, can you please
explain your precise use case?


Presumably there's no *technical* reason why the stable subset of v7
support couldn't be stripped down and brought back private to arch/arm
if somebody really wants and is willing to step up and look after it?


There is no technical reason at all, just a maintenance effort.

The main killer is the whole MMU code, which I'm butchering with NV,
and that I suspect Will will also turn upside down with his stuff.
Not to mention the hypercall interface that will need a complete overhaul.

If we wanted to decouple the two, we'd need to make the MMU code, the
hypercalls, arm.c and a number of other bits private to 32bit.


Right, the prospective kvm-arm maintainer's gameplan would essentially 
be an equivalent "move virt/kvm/arm to arch/arm/kvm" patch, but then 
ripping out all the Armv8 and GICv3 gubbins instead. Yes, there would 
then be lots of *similar* code to start with, but it would only diverge 
further as v8 architecture development continues independently.


Anyway, I just thought it seemed worth saying out loud, to reassure 
folks that a realistic middle-ground between "yay bye!" and "oh no the 
end of the world!" does exist, namely "someone else's problem" :)


Robin.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [RFC PATCH 0/5] Removing support for 32bit KVM/arm host

2020-02-20 Thread Robin Murphy

On 20/02/2020 1:15 pm, Marc Zyngier wrote:

Hi Marek,

On 2020-02-20 12:44, Marek Szyprowski wrote:

Hi Marc,

On 10.02.2020 15:13, Marc Zyngier wrote:

KVM/arm was merged just over 7 years ago, and has lived a very quiet
life so far. It mostly works if you're prepared to deal with its
limitations, it has been a good prototype for the arm64 version,
but it suffers a few problems:

- It is incomplete (no debug support, no PMU)
- It hasn't followed any of the architectural evolutions
- It has zero users (I don't count myself here)
- It is more and more getting in the way of new arm64 developments


That is a bit sad information. Mainline Exynos finally got everything
that was needed to run it on the quite popular Samsung Exynos5422-based
Odroid XU4/HC1/MC1 boards. According to the Odroid related forums it is
being used. We also use it internally at Samsung.


Something like "too little, too late" springs to mind, but let's be
constructive. Is anyone using it in a production environment, where
they rely on the latest mainline kernel having KVM support?

The current proposal is to still have KVM support in 5.6, as well as
ongoing support for stable kernels. If that's not enough, can you please
explain your precise use case?


Presumably there's no *technical* reason why the stable subset of v7 
support couldn't be stripped down and brought back private to arch/arm 
if somebody really wants and is willing to step up and look after it?


Robin.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 1/5] KVM: arm64: Fix missing RES1 in emulation of DBGBIDR

2020-02-18 Thread Robin Murphy

On 18/02/2020 5:43 pm, James Morse wrote:

Hi Marc,

$subject typo: ~/DBGBIDR/DBGDIDR/

On 16/02/2020 18:53, Marc Zyngier wrote:

The AArch32 CP14 DBGDIDR has bit 15 set to RES1, which our current
emulation doesn't set. Just add the missing bit.


So it does.

Reviewed-by: James Morse 



diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index 3e909b117f0c..da82c4b03aab 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -1658,7 +1658,7 @@ static bool trap_dbgidr(struct kvm_vcpu *vcpu,
p->regval = dfr >> ID_AA64DFR0_WRPS_SHIFT) & 0xf) << 28) |
 (((dfr >> ID_AA64DFR0_BRPS_SHIFT) & 0xf) << 24) |
 (((dfr >> ID_AA64DFR0_CTX_CMPS_SHIFT) & 0xf) << 20)
-| (6 << 16) | (el3 << 14) | (el3 << 12));
+| (6 << 16) | (1 << 15) | (el3 << 14) | (el3 << 
12));


Hmmm, where el3 is:
| u32 el3 = !!cpuid_feature_extract_unsigned_field(pfr, ID_AA64PFR0_EL3_SHIFT);

Aren't we depending on the compilers 'true' being 1 here?


Pretty much, but thankfully the only compilers we support are C compilers:

"The result of the logical negation operator ! is 0 if the value of its 
operand compares unequal to 0, 1 if the value of its operand compares 
equal to 0. The result has type int."


And now I have you to thank for flashbacks to bitwise logical operators 
in Visual Basic... :P


Robin.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 1/2] KVM: arm64: Add PMU event filtering infrastructure

2020-02-17 Thread Robin Murphy

On 15/02/2020 10:28 am, Marc Zyngier wrote:

On Fri, 14 Feb 2020 22:01:01 +,
Robin Murphy  wrote:

Hi Robin,



Hi Marc,

On 2020-02-14 6:36 pm, Marc Zyngier wrote:
[...]

@@ -585,6 +585,14 @@ static void kvm_pmu_create_perf_event(struct kvm_vcpu 
*vcpu, u64 select_idx)
pmc->idx != ARMV8_PMU_CYCLE_IDX)
return;
   +/*
+* If we have a filter in place and that the event isn't allowed, do
+* not install a perf event either.
+*/
+   if (vcpu->kvm->arch.pmu_filter &&
+   !test_bit(eventsel, vcpu->kvm->arch.pmu_filter))
+   return;


If I'm reading the derivation of eventsel right, this will end up
treating cycle counter events (aliased to SW_INCR) differently from
CPU_CYCLES, which doesn't seem desirable.


Indeed, this doesn't look quite right.

Looking at the description of event 0x11, it doesn't seem to count
exactly like the cycle counter (there are a number of PMCR controls
affecting it). But none of these actually apply to our PMU emulation
(no secure mode, and the idea of dealing with virtual EL2 in the
context of the PMU is... not appealing).

Now, given that we implement the cycle counter with event 0x11 anyway,
I don't think there is any reason to deal with them separately.


Right, from the user's PoV they can only ask for event 0x11, and where 
it gets scheduled is more of a black-box implementation detail. Reading 
the Arm ARM doesn't leave me entirely convinced that cycles couldn't 
ever leak idle/not-idle information between closely-coupled PEs, so this 
might not be entirely academic.



Also, if the user did try to blacklist SW_INCR for ridiculous
reasons, we'd need to special-case kvm_pmu_software_increment() to
make it (not) work as expected, right?


I thought of that one, and couldn't see a reason to blacklist it
(after all, the guest could also increment a variable) and send itself
an interrupt. I'm tempted to simply document that event 0 is never
filtered.


I'd say you're on even stronger ground simply because KVM's 
implementation of SW_INCR doesn't go near the PMU hardware at all, thus 
is well beyond the purpose of the blacklist anyway. I believe it's 
important that how the code behaves matches expectations, but there's no 
harm in changing the latter as appropriate ;)


Cheers,
Robin.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 1/2] KVM: arm64: Add PMU event filtering infrastructure

2020-02-14 Thread Robin Murphy

Hi Marc,

On 2020-02-14 6:36 pm, Marc Zyngier wrote:
[...]

@@ -585,6 +585,14 @@ static void kvm_pmu_create_perf_event(struct kvm_vcpu 
*vcpu, u64 select_idx)
pmc->idx != ARMV8_PMU_CYCLE_IDX)
return;
  
+	/*

+* If we have a filter in place and that the event isn't allowed, do
+* not install a perf event either.
+*/
+   if (vcpu->kvm->arch.pmu_filter &&
+   !test_bit(eventsel, vcpu->kvm->arch.pmu_filter))
+   return;


If I'm reading the derivation of eventsel right, this will end up 
treating cycle counter events (aliased to SW_INCR) differently from 
CPU_CYCLES, which doesn't seem desirable.


Also, if the user did try to blacklist SW_INCR for ridiculous reasons, 
we'd need to special-case kvm_pmu_software_increment() to make it (not) 
work as expected, right?


Robin.


+
memset(, 0, sizeof(struct perf_event_attr));
attr.type = PERF_TYPE_RAW;
attr.size = sizeof(attr);

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v2] KVM: arm64: Skip more of the SError vaxorcism

2019-06-10 Thread Robin Murphy

Hi James,

On 10/06/2019 17:30, James Morse wrote:

During __guest_exit() we need to consume any SError left pending by the
guest so it doesn't contaminate the host. With v8.2 we use the
ESB-instruction. For systems without v8.2, we use dsb+isb and unmask
SError. We do this on every guest exit.

Use the same dsb+isr_el1 trick, this lets us know if an SError is pending
after the dsb, allowing us to skip the isb and self-synchronising PSTATE
write if its not.

This means SError remains masked during KVM's world-switch, so any SError
that occurs during this time is reported by the host, instead of causing
a hyp-panic.

If you give gcc likely()/unlikely() hints in an if() condition, it
shuffles the generated assembly so that the likely case is immediately
after the branch. Lets do the same here.

Signed-off-by: James Morse 
---
This patch was previously posted as part of:
[v1] 
https://lore.kernel.org/linux-arm-kernel/20190604144551.188107-1-james.mo...@arm.com/

  arch/arm64/kvm/hyp/entry.S | 14 ++
  1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/kvm/hyp/entry.S b/arch/arm64/kvm/hyp/entry.S
index a5a4254314a1..c2de1a1faaf4 100644
--- a/arch/arm64/kvm/hyp/entry.S
+++ b/arch/arm64/kvm/hyp/entry.S
@@ -161,18 +161,24 @@ alternative_if ARM64_HAS_RAS_EXTN
orr x0, x0, #(1<

It doesn't appear that anyone cares much about x2 containing the masked 
value after returning, so is this just a needlessly long-form TBNZ?


Robin.


+   ret
+
+2:
+   // We know we have a pending asynchronous abort, now is the
+   // time to flush it out. From your VAXorcist book, page 666:
// "Threaten me not, oh Evil one!  For I speak with
// the power of DEC, and I command thee to show thyself!"
mrs x2, elr_el2
+alternative_endif
mrs x3, esr_el2
mrs x4, spsr_el2
mov x5, x0
  
-	dsb	sy		// Synchronize against in-flight ld/st

msr daifclr, #4 // Unmask aborts
-alternative_endif
  
  	// This is our single instruction exception window. A pending

// SError is guaranteed to occur at the earliest when we unmask


___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v7 14/23] iommu/smmuv3: Implement cache_invalidate

2019-05-13 Thread Robin Murphy

On 13/05/2019 13:16, Auger Eric wrote:

Hi Robin,
On 5/8/19 5:01 PM, Robin Murphy wrote:

On 08/04/2019 13:19, Eric Auger wrote:

Implement domain-selective and page-selective IOTLB invalidations.

Signed-off-by: Eric Auger 

---
v6 -> v7
- check the uapi version

v3 -> v4:
- adapt to changes in the uapi
- add support for leaf parameter
- do not use arm_smmu_tlb_inv_range_nosync or arm_smmu_tlb_inv_context
    anymore

v2 -> v3:
- replace __arm_smmu_tlb_sync by arm_smmu_cmdq_issue_sync

v1 -> v2:
- properly pass the asid
---
   drivers/iommu/arm-smmu-v3.c | 60 +
   1 file changed, 60 insertions(+)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 1486baf53425..4366921d8318 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -2326,6 +2326,65 @@ static void arm_smmu_detach_pasid_table(struct
iommu_domain *domain)
   mutex_unlock(_domain->init_mutex);
   }
   +static int
+arm_smmu_cache_invalidate(struct iommu_domain *domain, struct device
*dev,
+  struct iommu_cache_invalidate_info *inv_info)
+{
+    struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
+    struct arm_smmu_device *smmu = smmu_domain->smmu;
+
+    if (smmu_domain->stage != ARM_SMMU_DOMAIN_NESTED)
+    return -EINVAL;
+
+    if (!smmu)
+    return -EINVAL;
+
+    if (inv_info->version != IOMMU_CACHE_INVALIDATE_INFO_VERSION_1)
+    return -EINVAL;
+
+    if (inv_info->cache & IOMMU_CACHE_INV_TYPE_IOTLB) {
+    if (inv_info->granularity == IOMMU_INV_GRANU_PASID) {
+    struct arm_smmu_cmdq_ent cmd = {
+    .opcode = CMDQ_OP_TLBI_NH_ASID,
+    .tlbi = {
+    .vmid = smmu_domain->s2_cfg.vmid,
+    .asid = inv_info->pasid,
+    },
+    };
+
+    arm_smmu_cmdq_issue_cmd(smmu, );
+    arm_smmu_cmdq_issue_sync(smmu);


I'd much rather make arm_smmu_tlb_inv_context() understand nested
domains than open-code commands all over the place.






+
+    } else if (inv_info->granularity == IOMMU_INV_GRANU_ADDR) {
+    struct iommu_inv_addr_info *info = _info->addr_info;
+    size_t size = info->nb_granules * info->granule_size;
+    bool leaf = info->flags & IOMMU_INV_ADDR_FLAGS_LEAF;
+    struct arm_smmu_cmdq_ent cmd = {
+    .opcode = CMDQ_OP_TLBI_NH_VA,
+    .tlbi = {
+    .addr = info->addr,
+    .vmid = smmu_domain->s2_cfg.vmid,
+    .asid = info->pasid,
+    .leaf = leaf,
+    },
+    };
+
+    do {
+    arm_smmu_cmdq_issue_cmd(smmu, );
+    cmd.tlbi.addr += info->granule_size;
+    } while (size -= info->granule_size);
+    arm_smmu_cmdq_issue_sync(smmu);


An this in particular I would really like to go all the way through
io_pgtable_tlb_add_flush()/io_pgtable_sync() if at all possible. Hooking
up range-based invalidations is going to be a massive headache if the
abstraction isn't solid.


The concern is the host does not "own" the s1 config asid
(smmu_domain->s1_cfg.cd.asid is not set, practically). In our case the
asid only is passed by the userspace on CACHE_INVALIDATE ioctl call.

arm_smmu_tlb_inv_context and arm_smmu_tlb_inv_range_nosync use this field


Right, but that's not exactly hard to solve. Even just something like 
the (untested, purely illustrative) refactoring below would be beneficial.


Robin.

->8-
diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index d3880010c6cf..31ef703cf671 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -1423,11 +1423,9 @@ static void arm_smmu_tlb_inv_context(void *cookie)
arm_smmu_cmdq_issue_sync(smmu);
 }

-static void arm_smmu_tlb_inv_range_nosync(unsigned long iova, size_t size,
- size_t granule, bool leaf, void 
*cookie)
+static void __arm_smmu_tlb_inv_range(struct arm_smmu_domain 
*smmu_domain, u16 asid,

+   unsigned long iova, size_t size, size_t granule, bool leaf)
 {
-   struct arm_smmu_domain *smmu_domain = cookie;
-   struct arm_smmu_device *smmu = smmu_domain->smmu;
struct arm_smmu_cmdq_ent cmd = {
.tlbi = {
.leaf   = leaf,
@@ -1437,18 +1435,27 @@ static void 
arm_smmu_tlb_inv_range_nosync(unsigned long iova, size_t size,


if (smmu_domain->stage == ARM_SMMU_DOMAIN_S1) {
cmd.opcode  = CMDQ_OP_TLBI_NH_VA;
-   cmd.tlbi.asid   = smmu_domain->s1_cfg.cd.asid;
+   cmd.tlbi.asid   = asid;
} else {
cmd.opcode  = CMDQ_OP_TLBI_S2_IPA;
cmd.tlbi.vmid   = smmu_domain->s2_cfg.vmid;
}

do {
-   

Re: [PATCH v7 18/23] iommu/smmuv3: Report non recoverable faults

2019-05-13 Thread Robin Murphy

On 13/05/2019 13:32, Auger Eric wrote:

Hi Robin,
On 5/13/19 1:54 PM, Robin Murphy wrote:

On 13/05/2019 08:46, Auger Eric wrote:

Hi Robin,

On 5/8/19 7:20 PM, Robin Murphy wrote:

On 08/04/2019 13:19, Eric Auger wrote:

When a stage 1 related fault event is read from the event queue,
let's propagate it to potential external fault listeners, ie. users
who registered a fault handler.

Signed-off-by: Eric Auger 

---
v4 -> v5:
- s/IOMMU_FAULT_PERM_INST/IOMMU_FAULT_PERM_EXEC
---
    drivers/iommu/arm-smmu-v3.c | 169
+---
    1 file changed, 158 insertions(+), 11 deletions(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 805bc32a..1fd320788dcb 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -167,6 +167,26 @@
    #define ARM_SMMU_PRIQ_IRQ_CFG1    0xd8
    #define ARM_SMMU_PRIQ_IRQ_CFG2    0xdc
    +/* Events */
+#define ARM_SMMU_EVT_F_UUT    0x01
+#define ARM_SMMU_EVT_C_BAD_STREAMID    0x02
+#define ARM_SMMU_EVT_F_STE_FETCH    0x03
+#define ARM_SMMU_EVT_C_BAD_STE    0x04
+#define ARM_SMMU_EVT_F_BAD_ATS_TREQ    0x05
+#define ARM_SMMU_EVT_F_STREAM_DISABLED    0x06
+#define ARM_SMMU_EVT_F_TRANSL_FORBIDDEN    0x07
+#define ARM_SMMU_EVT_C_BAD_SUBSTREAMID    0x08
+#define ARM_SMMU_EVT_F_CD_FETCH    0x09
+#define ARM_SMMU_EVT_C_BAD_CD    0x0a
+#define ARM_SMMU_EVT_F_WALK_EABT    0x0b
+#define ARM_SMMU_EVT_F_TRANSLATION    0x10
+#define ARM_SMMU_EVT_F_ADDR_SIZE    0x11
+#define ARM_SMMU_EVT_F_ACCESS    0x12
+#define ARM_SMMU_EVT_F_PERMISSION    0x13
+#define ARM_SMMU_EVT_F_TLB_CONFLICT    0x20
+#define ARM_SMMU_EVT_F_CFG_CONFLICT    0x21
+#define ARM_SMMU_EVT_E_PAGE_REQUEST    0x24
+
    /* Common MSI config fields */
    #define MSI_CFG0_ADDR_MASK    GENMASK_ULL(51, 2)
    #define MSI_CFG2_SH    GENMASK(5, 4)
@@ -332,6 +352,15 @@
    #define EVTQ_MAX_SZ_SHIFT    7
      #define EVTQ_0_ID    GENMASK_ULL(7, 0)
+#define EVTQ_0_SSV    GENMASK_ULL(11, 11)
+#define EVTQ_0_SUBSTREAMID    GENMASK_ULL(31, 12)
+#define EVTQ_0_STREAMID    GENMASK_ULL(63, 32)
+#define EVTQ_1_PNU    GENMASK_ULL(33, 33)
+#define EVTQ_1_IND    GENMASK_ULL(34, 34)
+#define EVTQ_1_RNW    GENMASK_ULL(35, 35)
+#define EVTQ_1_S2    GENMASK_ULL(39, 39)
+#define EVTQ_1_CLASS    GENMASK_ULL(40, 41)
+#define EVTQ_3_FETCH_ADDR    GENMASK_ULL(51, 3)
      /* PRI queue */
    #define PRIQ_ENT_DWORDS    2
@@ -639,6 +668,64 @@ struct arm_smmu_domain {
    spinlock_t    devices_lock;
    };
    +/* fault propagation */
+
+#define IOMMU_FAULT_F_FIELDS    (IOMMU_FAULT_UNRECOV_PASID_VALID | \
+ IOMMU_FAULT_UNRECOV_PERM_VALID | \
+ IOMMU_FAULT_UNRECOV_ADDR_VALID)
+
+struct arm_smmu_fault_propagation_data {
+    enum iommu_fault_reason reason;
+    bool s1_check;
+    u32 fields; /* IOMMU_FAULT_UNRECOV_*_VALID bits */
+};
+
+/*
+ * Describes how SMMU faults translate into generic IOMMU faults
+ * and if they need to be reported externally
+ */
+static const struct arm_smmu_fault_propagation_data
fault_propagation[] = {
+[ARM_SMMU_EVT_F_UUT]    = { },
+[ARM_SMMU_EVT_C_BAD_STREAMID]    = { },
+[ARM_SMMU_EVT_F_STE_FETCH]    = { },
+[ARM_SMMU_EVT_C_BAD_STE]    = { },
+[ARM_SMMU_EVT_F_BAD_ATS_TREQ]    = { },
+[ARM_SMMU_EVT_F_STREAM_DISABLED]    = { },
+[ARM_SMMU_EVT_F_TRANSL_FORBIDDEN]    = { },
+[ARM_SMMU_EVT_C_BAD_SUBSTREAMID]    =
{IOMMU_FAULT_REASON_PASID_INVALID,
+   false,
+   IOMMU_FAULT_UNRECOV_PASID_VALID
+  },
+[ARM_SMMU_EVT_F_CD_FETCH]    = {IOMMU_FAULT_REASON_PASID_FETCH,
+   false,
+   IOMMU_FAULT_UNRECOV_PASID_VALID |


It doesn't make sense to presume validity here, or in any of the faults
below...






+   IOMMU_FAULT_UNRECOV_FETCH_ADDR_VALID
+  },
+[ARM_SMMU_EVT_C_BAD_CD]    =
{IOMMU_FAULT_REASON_BAD_PASID_ENTRY,
+   false,
+   IOMMU_FAULT_UNRECOV_PASID_VALID
+  },
+[ARM_SMMU_EVT_F_WALK_EABT]    = {IOMMU_FAULT_REASON_WALK_EABT,
true,
+   IOMMU_FAULT_F_FIELDS |
+   IOMMU_FAULT_UNRECOV_FETCH_ADDR_VALID
+  },
+[ARM_SMMU_EVT_F_TRANSLATION]    = {IOMMU_FAULT_REASON_PTE_FETCH,
true,
+   IOMMU_FAULT_F_FIELDS
+  },
+[ARM_SMMU_EVT_F_ADDR_SIZE]    = {IOMMU_FAULT_REASON_OOR_ADDRESS,
true,
+   IOMMU_FAULT_F_FIELDS
+  },
+[ARM_SMMU_EVT_F_ACCESS]    = {IOMMU_FAULT_REASON_ACCESS, true,
+   IOMMU_FAULT_F_FIELDS
+  },
+[ARM_SMMU_EVT_F_PERMISSION]    = {IOMMU_FAULT_REASON_PERMISSION,
true,
+   IOMMU_FAULT_F_FIE

Re: [PATCH v7 13/23] iommu/smmuv3: Implement attach/detach_pasid_table

2019-05-13 Thread Robin Murphy

On 10/05/2019 15:35, Auger Eric wrote:

Hi Robin,

On 5/8/19 4:38 PM, Robin Murphy wrote:

On 08/04/2019 13:19, Eric Auger wrote:

On attach_pasid_table() we program STE S1 related info set
by the guest into the actual physical STEs. At minimum
we need to program the context descriptor GPA and compute
whether the stage1 is translated/bypassed or aborted.

Signed-off-by: Eric Auger 

---
v6 -> v7:
- check versions and comment the fact we don't need to take
    into account s1dss and s1fmt
v3 -> v4:
- adapt to changes in iommu_pasid_table_config
- different programming convention at s1_cfg/s2_cfg/ste.abort

v2 -> v3:
- callback now is named set_pasid_table and struct fields
    are laid out differently.

v1 -> v2:
- invalidate the STE before changing them
- hold init_mutex
- handle new fields
---
   drivers/iommu/arm-smmu-v3.c | 121 
   1 file changed, 121 insertions(+)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index e22e944ffc05..1486baf53425 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -2207,6 +2207,125 @@ static void arm_smmu_put_resv_regions(struct
device *dev,
   kfree(entry);
   }
   +static int arm_smmu_attach_pasid_table(struct iommu_domain *domain,
+   struct iommu_pasid_table_config *cfg)
+{
+    struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
+    struct arm_smmu_master_data *entry;
+    struct arm_smmu_s1_cfg *s1_cfg;
+    struct arm_smmu_device *smmu;
+    unsigned long flags;
+    int ret = -EINVAL;
+
+    if (cfg->format != IOMMU_PASID_FORMAT_SMMUV3)
+    return -EINVAL;
+
+    if (cfg->version != PASID_TABLE_CFG_VERSION_1 ||
+    cfg->smmuv3.version != PASID_TABLE_SMMUV3_CFG_VERSION_1)
+    return -EINVAL;
+
+    mutex_lock(_domain->init_mutex);
+
+    smmu = smmu_domain->smmu;
+
+    if (!smmu)
+    goto out;
+
+    if (!((smmu->features & ARM_SMMU_FEAT_TRANS_S1) &&
+  (smmu->features & ARM_SMMU_FEAT_TRANS_S2))) {
+    dev_info(smmu_domain->smmu->dev,
+ "does not implement two stages\n");
+    goto out;
+    }


That check is redundant (and frankly looks a little bit spammy). If the
one below is not enough, there is a problem elsewhere - if it's possible
for smmu_domain->stage to ever get set to ARM_SMMU_DOMAIN_NESTED without
both stages of translation present, we've already gone fundamentally wrong.


Makes sense. Moved that check to arm_smmu_domain_finalise() instead and
remove redundant ones.


Urgh, I forgot exactly how the crazy domain-allocation dance worked, 
such that we're not in a position to refuse the domain_set_attr() call 
itself, but that does sound like the best compromise for now.


Thanks,
Robin.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v7 18/23] iommu/smmuv3: Report non recoverable faults

2019-05-13 Thread Robin Murphy

On 13/05/2019 08:46, Auger Eric wrote:

Hi Robin,

On 5/8/19 7:20 PM, Robin Murphy wrote:

On 08/04/2019 13:19, Eric Auger wrote:

When a stage 1 related fault event is read from the event queue,
let's propagate it to potential external fault listeners, ie. users
who registered a fault handler.

Signed-off-by: Eric Auger 

---
v4 -> v5:
- s/IOMMU_FAULT_PERM_INST/IOMMU_FAULT_PERM_EXEC
---
   drivers/iommu/arm-smmu-v3.c | 169 +---
   1 file changed, 158 insertions(+), 11 deletions(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 805bc32a..1fd320788dcb 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -167,6 +167,26 @@
   #define ARM_SMMU_PRIQ_IRQ_CFG1    0xd8
   #define ARM_SMMU_PRIQ_IRQ_CFG2    0xdc
   +/* Events */
+#define ARM_SMMU_EVT_F_UUT    0x01
+#define ARM_SMMU_EVT_C_BAD_STREAMID    0x02
+#define ARM_SMMU_EVT_F_STE_FETCH    0x03
+#define ARM_SMMU_EVT_C_BAD_STE    0x04
+#define ARM_SMMU_EVT_F_BAD_ATS_TREQ    0x05
+#define ARM_SMMU_EVT_F_STREAM_DISABLED    0x06
+#define ARM_SMMU_EVT_F_TRANSL_FORBIDDEN    0x07
+#define ARM_SMMU_EVT_C_BAD_SUBSTREAMID    0x08
+#define ARM_SMMU_EVT_F_CD_FETCH    0x09
+#define ARM_SMMU_EVT_C_BAD_CD    0x0a
+#define ARM_SMMU_EVT_F_WALK_EABT    0x0b
+#define ARM_SMMU_EVT_F_TRANSLATION    0x10
+#define ARM_SMMU_EVT_F_ADDR_SIZE    0x11
+#define ARM_SMMU_EVT_F_ACCESS    0x12
+#define ARM_SMMU_EVT_F_PERMISSION    0x13
+#define ARM_SMMU_EVT_F_TLB_CONFLICT    0x20
+#define ARM_SMMU_EVT_F_CFG_CONFLICT    0x21
+#define ARM_SMMU_EVT_E_PAGE_REQUEST    0x24
+
   /* Common MSI config fields */
   #define MSI_CFG0_ADDR_MASK    GENMASK_ULL(51, 2)
   #define MSI_CFG2_SH    GENMASK(5, 4)
@@ -332,6 +352,15 @@
   #define EVTQ_MAX_SZ_SHIFT    7
     #define EVTQ_0_ID    GENMASK_ULL(7, 0)
+#define EVTQ_0_SSV    GENMASK_ULL(11, 11)
+#define EVTQ_0_SUBSTREAMID    GENMASK_ULL(31, 12)
+#define EVTQ_0_STREAMID    GENMASK_ULL(63, 32)
+#define EVTQ_1_PNU    GENMASK_ULL(33, 33)
+#define EVTQ_1_IND    GENMASK_ULL(34, 34)
+#define EVTQ_1_RNW    GENMASK_ULL(35, 35)
+#define EVTQ_1_S2    GENMASK_ULL(39, 39)
+#define EVTQ_1_CLASS    GENMASK_ULL(40, 41)
+#define EVTQ_3_FETCH_ADDR    GENMASK_ULL(51, 3)
     /* PRI queue */
   #define PRIQ_ENT_DWORDS    2
@@ -639,6 +668,64 @@ struct arm_smmu_domain {
   spinlock_t    devices_lock;
   };
   +/* fault propagation */
+
+#define IOMMU_FAULT_F_FIELDS    (IOMMU_FAULT_UNRECOV_PASID_VALID | \
+ IOMMU_FAULT_UNRECOV_PERM_VALID | \
+ IOMMU_FAULT_UNRECOV_ADDR_VALID)
+
+struct arm_smmu_fault_propagation_data {
+    enum iommu_fault_reason reason;
+    bool s1_check;
+    u32 fields; /* IOMMU_FAULT_UNRECOV_*_VALID bits */
+};
+
+/*
+ * Describes how SMMU faults translate into generic IOMMU faults
+ * and if they need to be reported externally
+ */
+static const struct arm_smmu_fault_propagation_data
fault_propagation[] = {
+[ARM_SMMU_EVT_F_UUT]    = { },
+[ARM_SMMU_EVT_C_BAD_STREAMID]    = { },
+[ARM_SMMU_EVT_F_STE_FETCH]    = { },
+[ARM_SMMU_EVT_C_BAD_STE]    = { },
+[ARM_SMMU_EVT_F_BAD_ATS_TREQ]    = { },
+[ARM_SMMU_EVT_F_STREAM_DISABLED]    = { },
+[ARM_SMMU_EVT_F_TRANSL_FORBIDDEN]    = { },
+[ARM_SMMU_EVT_C_BAD_SUBSTREAMID]    = {IOMMU_FAULT_REASON_PASID_INVALID,
+   false,
+   IOMMU_FAULT_UNRECOV_PASID_VALID
+  },
+[ARM_SMMU_EVT_F_CD_FETCH]    = {IOMMU_FAULT_REASON_PASID_FETCH,
+   false,
+   IOMMU_FAULT_UNRECOV_PASID_VALID |


It doesn't make sense to presume validity here, or in any of the faults
below...






+   IOMMU_FAULT_UNRECOV_FETCH_ADDR_VALID
+  },
+[ARM_SMMU_EVT_C_BAD_CD]    =
{IOMMU_FAULT_REASON_BAD_PASID_ENTRY,
+   false,
+   IOMMU_FAULT_UNRECOV_PASID_VALID
+  },
+[ARM_SMMU_EVT_F_WALK_EABT]    = {IOMMU_FAULT_REASON_WALK_EABT, true,
+   IOMMU_FAULT_F_FIELDS |
+   IOMMU_FAULT_UNRECOV_FETCH_ADDR_VALID
+  },
+[ARM_SMMU_EVT_F_TRANSLATION]    = {IOMMU_FAULT_REASON_PTE_FETCH,
true,
+   IOMMU_FAULT_F_FIELDS
+  },
+[ARM_SMMU_EVT_F_ADDR_SIZE]    = {IOMMU_FAULT_REASON_OOR_ADDRESS,
true,
+   IOMMU_FAULT_F_FIELDS
+  },
+[ARM_SMMU_EVT_F_ACCESS]    = {IOMMU_FAULT_REASON_ACCESS, true,
+   IOMMU_FAULT_F_FIELDS
+  },
+[ARM_SMMU_EVT_F_PERMISSION]    = {IOMMU_FAULT_REASON_PERMISSION,
true,
+   IOMMU_FAULT_F_FIELDS
+  },
+[ARM_SMMU_EVT_F_TLB_CONFLICT]    = { },
+[ARM_SMMU_EVT_F_CFG_CONFLICT]    = { },
+[ARM_SMMU_EVT_E_PAGE_REQU

Re: [PATCH v7 12/23] iommu/smmuv3: Get prepared for nested stage support

2019-05-13 Thread Robin Murphy

On 10/05/2019 15:34, Auger Eric wrote:

Hi Robin,

On 5/8/19 4:24 PM, Robin Murphy wrote:

On 08/04/2019 13:19, Eric Auger wrote:

To allow nested stage support, we need to store both
stage 1 and stage 2 configurations (and remove the former
union).

A nested setup is characterized by both s1_cfg and s2_cfg
set.

We introduce a new ste.abort field that will be set upon
guest stage1 configuration passing. If s1_cfg is NULL and
ste.abort is set, traffic can't pass. If ste.abort is not set,
S1 is bypassed.

arm_smmu_write_strtab_ent() is modified to write both stage
fields in the STE and deal with the abort field.

In nested mode, only stage 2 is "finalized" as the host does
not own/configure the stage 1 context descriptor, guest does.

Signed-off-by: Eric Auger 

---

v4 -> v5:
- reset ste.abort on detach

v3 -> v4:
- s1_cfg.nested_abort and nested_bypass removed.
- s/ste.nested/ste.abort
- arm_smmu_write_strtab_ent modifications with introduction
    of local abort, bypass and translate local variables
- comment updated

v1 -> v2:
- invalidate the STE before moving from a live STE config to another
- add the nested_abort and nested_bypass fields
---
   drivers/iommu/arm-smmu-v3.c | 35 ---
   1 file changed, 20 insertions(+), 15 deletions(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 21d027695181..e22e944ffc05 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -211,6 +211,7 @@
   #define STRTAB_STE_0_CFG_BYPASS    4
   #define STRTAB_STE_0_CFG_S1_TRANS    5
   #define STRTAB_STE_0_CFG_S2_TRANS    6
+#define STRTAB_STE_0_CFG_NESTED    7
     #define STRTAB_STE_0_S1FMT    GENMASK_ULL(5, 4)
   #define STRTAB_STE_0_S1FMT_LINEAR    0
@@ -514,6 +515,7 @@ struct arm_smmu_strtab_ent {
    * configured according to the domain type.
    */
   bool    assigned;
+    bool    abort;
   struct arm_smmu_s1_cfg    *s1_cfg;
   struct arm_smmu_s2_cfg    *s2_cfg;
   };
@@ -628,10 +630,8 @@ struct arm_smmu_domain {
   bool    non_strict;
     enum arm_smmu_domain_stage    stage;
-    union {
-    struct arm_smmu_s1_cfg    s1_cfg;
-    struct arm_smmu_s2_cfg    s2_cfg;
-    };
+    struct arm_smmu_s1_cfg    s1_cfg;
+    struct arm_smmu_s2_cfg    s2_cfg;
     struct iommu_domain    domain;
   @@ -1108,12 +1108,13 @@ static void arm_smmu_write_strtab_ent(struct
arm_smmu_device *smmu, u32 sid,
     __le64 *dst, struct arm_smmu_strtab_ent *ste)
   {
   /*
- * This is hideously complicated, but we only really care about
- * three cases at the moment:
+ * We care about the following transitions:
    *
    * 1. Invalid (all zero) -> bypass/fault (init)
- * 2. Bypass/fault -> translation/bypass (attach)
- * 3. Translation/bypass -> bypass/fault (detach)
+ * 2. Bypass/fault -> single stage translation/bypass (attach)
+ * 3. single stage Translation/bypass -> bypass/fault (detach)
+ * 4. S2 -> S1 + S2 (attach_pasid_table)
+ * 5. S1 + S2 -> S2 (detach_pasid_table)
    *
    * Given that we can't update the STE atomically and the SMMU
    * doesn't read the thing in a defined order, that leaves us
@@ -1124,7 +1125,7 @@ static void arm_smmu_write_strtab_ent(struct
arm_smmu_device *smmu, u32 sid,
    * 3. Update Config, sync
    */
   u64 val = le64_to_cpu(dst[0]);
-    bool ste_live = false;
+    bool abort, bypass, translate, ste_live = false;
   struct arm_smmu_cmdq_ent prefetch_cmd = {
   .opcode    = CMDQ_OP_PREFETCH_CFG,
   .prefetch    = {
@@ -1138,11 +1139,11 @@ static void arm_smmu_write_strtab_ent(struct
arm_smmu_device *smmu, u32 sid,
   break;
   case STRTAB_STE_0_CFG_S1_TRANS:
   case STRTAB_STE_0_CFG_S2_TRANS:
+    case STRTAB_STE_0_CFG_NESTED:
   ste_live = true;
   break;
   case STRTAB_STE_0_CFG_ABORT:
-    if (disable_bypass)
-    break;
+    break;
   default:
   BUG(); /* STE corruption */
   }
@@ -1152,8 +1153,13 @@ static void arm_smmu_write_strtab_ent(struct
arm_smmu_device *smmu, u32 sid,
   val = STRTAB_STE_0_V;
     /* Bypass/fault */
-    if (!ste->assigned || !(ste->s1_cfg || ste->s2_cfg)) {
-    if (!ste->assigned && disable_bypass)
+
+    abort = (!ste->assigned && disable_bypass) || ste->abort;
+    translate = ste->s1_cfg || ste->s2_cfg;
+    bypass = !abort && !translate;
+
+    if (abort || bypass) {
+    if (abort)
   val |= FIELD_PREP(STRTAB_STE_0_CFG,
STRTAB_STE_0_CFG_ABORT);
   else
   val |= FIELD_PREP(STRTAB_STE_0_CFG,
STRTAB_STE_0_CFG_BYPASS);
@@ -1172,7 +1178,6 @@ static void arm_smmu_write_strtab_ent(struct
arm_smmu_device *smmu, u32 sid,
  

Re: [PATCH v7 18/23] iommu/smmuv3: Report non recoverable faults

2019-05-08 Thread Robin Murphy

On 08/04/2019 13:19, Eric Auger wrote:

When a stage 1 related fault event is read from the event queue,
let's propagate it to potential external fault listeners, ie. users
who registered a fault handler.

Signed-off-by: Eric Auger 

---
v4 -> v5:
- s/IOMMU_FAULT_PERM_INST/IOMMU_FAULT_PERM_EXEC
---
  drivers/iommu/arm-smmu-v3.c | 169 +---
  1 file changed, 158 insertions(+), 11 deletions(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 805bc32a..1fd320788dcb 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -167,6 +167,26 @@
  #define ARM_SMMU_PRIQ_IRQ_CFG10xd8
  #define ARM_SMMU_PRIQ_IRQ_CFG20xdc
  
+/* Events */

+#define ARM_SMMU_EVT_F_UUT 0x01
+#define ARM_SMMU_EVT_C_BAD_STREAMID0x02
+#define ARM_SMMU_EVT_F_STE_FETCH   0x03
+#define ARM_SMMU_EVT_C_BAD_STE 0x04
+#define ARM_SMMU_EVT_F_BAD_ATS_TREQ0x05
+#define ARM_SMMU_EVT_F_STREAM_DISABLED 0x06
+#define ARM_SMMU_EVT_F_TRANSL_FORBIDDEN0x07
+#define ARM_SMMU_EVT_C_BAD_SUBSTREAMID 0x08
+#define ARM_SMMU_EVT_F_CD_FETCH0x09
+#define ARM_SMMU_EVT_C_BAD_CD  0x0a
+#define ARM_SMMU_EVT_F_WALK_EABT   0x0b
+#define ARM_SMMU_EVT_F_TRANSLATION 0x10
+#define ARM_SMMU_EVT_F_ADDR_SIZE   0x11
+#define ARM_SMMU_EVT_F_ACCESS  0x12
+#define ARM_SMMU_EVT_F_PERMISSION  0x13
+#define ARM_SMMU_EVT_F_TLB_CONFLICT0x20
+#define ARM_SMMU_EVT_F_CFG_CONFLICT0x21
+#define ARM_SMMU_EVT_E_PAGE_REQUEST0x24
+
  /* Common MSI config fields */
  #define MSI_CFG0_ADDR_MASKGENMASK_ULL(51, 2)
  #define MSI_CFG2_SH   GENMASK(5, 4)
@@ -332,6 +352,15 @@
  #define EVTQ_MAX_SZ_SHIFT 7
  
  #define EVTQ_0_ID			GENMASK_ULL(7, 0)

+#define EVTQ_0_SSV GENMASK_ULL(11, 11)
+#define EVTQ_0_SUBSTREAMID GENMASK_ULL(31, 12)
+#define EVTQ_0_STREAMIDGENMASK_ULL(63, 32)
+#define EVTQ_1_PNU GENMASK_ULL(33, 33)
+#define EVTQ_1_IND GENMASK_ULL(34, 34)
+#define EVTQ_1_RNW GENMASK_ULL(35, 35)
+#define EVTQ_1_S2  GENMASK_ULL(39, 39)
+#define EVTQ_1_CLASS   GENMASK_ULL(40, 41)
+#define EVTQ_3_FETCH_ADDR  GENMASK_ULL(51, 3)
  
  /* PRI queue */

  #define PRIQ_ENT_DWORDS   2
@@ -639,6 +668,64 @@ struct arm_smmu_domain {
spinlock_t  devices_lock;
  };
  
+/* fault propagation */

+
+#define IOMMU_FAULT_F_FIELDS   (IOMMU_FAULT_UNRECOV_PASID_VALID | \
+IOMMU_FAULT_UNRECOV_PERM_VALID | \
+IOMMU_FAULT_UNRECOV_ADDR_VALID)
+
+struct arm_smmu_fault_propagation_data {
+   enum iommu_fault_reason reason;
+   bool s1_check;
+   u32 fields; /* IOMMU_FAULT_UNRECOV_*_VALID bits */
+};
+
+/*
+ * Describes how SMMU faults translate into generic IOMMU faults
+ * and if they need to be reported externally
+ */
+static const struct arm_smmu_fault_propagation_data fault_propagation[] = {
+[ARM_SMMU_EVT_F_UUT]   = { },
+[ARM_SMMU_EVT_C_BAD_STREAMID]  = { },
+[ARM_SMMU_EVT_F_STE_FETCH] = { },
+[ARM_SMMU_EVT_C_BAD_STE]   = { },
+[ARM_SMMU_EVT_F_BAD_ATS_TREQ]  = { },
+[ARM_SMMU_EVT_F_STREAM_DISABLED]   = { },
+[ARM_SMMU_EVT_F_TRANSL_FORBIDDEN]  = { },
+[ARM_SMMU_EVT_C_BAD_SUBSTREAMID]   = {IOMMU_FAULT_REASON_PASID_INVALID,
+  false,
+  IOMMU_FAULT_UNRECOV_PASID_VALID
+ },
+[ARM_SMMU_EVT_F_CD_FETCH]  = {IOMMU_FAULT_REASON_PASID_FETCH,
+  false,
+  IOMMU_FAULT_UNRECOV_PASID_VALID |


It doesn't make sense to presume validity here, or in any of the faults 
below...



+  IOMMU_FAULT_UNRECOV_FETCH_ADDR_VALID
+ },
+[ARM_SMMU_EVT_C_BAD_CD]= 
{IOMMU_FAULT_REASON_BAD_PASID_ENTRY,
+  false,
+  IOMMU_FAULT_UNRECOV_PASID_VALID
+ },
+[ARM_SMMU_EVT_F_WALK_EABT] = {IOMMU_FAULT_REASON_WALK_EABT, true,
+  IOMMU_FAULT_F_FIELDS |
+  IOMMU_FAULT_UNRECOV_FETCH_ADDR_VALID
+ },
+[ARM_SMMU_EVT_F_TRANSLATION]   = {IOMMU_FAULT_REASON_PTE_FETCH, true,
+  IOMMU_FAULT_F_FIELDS
+ },
+[ARM_SMMU_EVT_F_ADDR_SIZE] = {IOMMU_FAULT_REASON_OOR_ADDRESS, true,
+  

Re: [PATCH v7 15/23] dma-iommu: Implement NESTED_MSI cookie

2019-05-08 Thread Robin Murphy

On 08/04/2019 13:19, Eric Auger wrote:

Up to now, when the type was UNMANAGED, we used to
allocate IOVA pages within a reserved IOVA MSI range.

If both the host and the guest are exposed with SMMUs, each
would allocate an IOVA. The guest allocates an IOVA (gIOVA)
to map onto the guest MSI doorbell (gDB). The Host allocates
another IOVA (hIOVA) to map onto the physical doorbell (hDB).

So we end up with 2 unrelated mappings, at S1 and S2:
  S1 S2
gIOVA-> gDB
hIOVA->hDB

The PCI device would be programmed with hIOVA.
No stage 1 mapping would existing, causing the MSIs to fault.

iommu_dma_bind_guest_msi() allows to pass gIOVA/gDB
to the host so that gIOVA can be used by the host instead of
re-allocating a new hIOVA.

  S1   S2
gIOVA->gDB->hDB

this time, the PCI device can be programmed with the gIOVA MSI
doorbell which is correctly mapped through both stages.


Hmm, this implies that both the guest kernel and host userspace are 
totally broken if hDB is a hardware MSI region...


Robin.


Signed-off-by: Eric Auger 

---
v6 -> v7:
- removed device handle

v3 -> v4:
- change function names; add unregister
- protect with msi_lock

v2 -> v3:
- also store the device handle on S1 mapping registration.
   This garantees we associate the associated S2 mapping binds
   to the correct physical MSI controller.

v1 -> v2:
- unmap stage2 on put()
---
  drivers/iommu/dma-iommu.c | 129 +-
  include/linux/dma-iommu.h |  17 +
  2 files changed, 143 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 77aabe637a60..9905260ad342 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -35,12 +35,15 @@
  struct iommu_dma_msi_page {
struct list_headlist;
dma_addr_t  iova;
+   dma_addr_t  gpa;
phys_addr_t phys;
+   size_t  s1_granule;
  };
  
  enum iommu_dma_cookie_type {

IOMMU_DMA_IOVA_COOKIE,
IOMMU_DMA_MSI_COOKIE,
+   IOMMU_DMA_NESTED_MSI_COOKIE,
  };
  
  struct iommu_dma_cookie {

@@ -110,14 +113,17 @@ EXPORT_SYMBOL(iommu_get_dma_cookie);
   *
   * Users who manage their own IOVA allocation and do not want DMA API support,
   * but would still like to take advantage of automatic MSI remapping, can use
- * this to initialise their own domain appropriately. Users should reserve a
+ * this to initialise their own domain appropriately. Users may reserve a
   * contiguous IOVA region, starting at @base, large enough to accommodate the
   * number of PAGE_SIZE mappings necessary to cover every MSI doorbell address
- * used by the devices attached to @domain.
+ * used by the devices attached to @domain. The other way round is to provide
+ * usable iova pages through the iommu_dma_bind_doorbell API (nested stages
+ * use case)
   */
  int iommu_get_msi_cookie(struct iommu_domain *domain, dma_addr_t base)
  {
struct iommu_dma_cookie *cookie;
+   int nesting, ret;
  
  	if (domain->type != IOMMU_DOMAIN_UNMANAGED)

return -EINVAL;
@@ -125,7 +131,12 @@ int iommu_get_msi_cookie(struct iommu_domain *domain, 
dma_addr_t base)
if (domain->iova_cookie)
return -EEXIST;
  
-	cookie = cookie_alloc(IOMMU_DMA_MSI_COOKIE);

+   ret =  iommu_domain_get_attr(domain, DOMAIN_ATTR_NESTING, );
+   if (!ret && nesting)
+   cookie = cookie_alloc(IOMMU_DMA_NESTED_MSI_COOKIE);
+   else
+   cookie = cookie_alloc(IOMMU_DMA_MSI_COOKIE);
+
if (!cookie)
return -ENOMEM;
  
@@ -146,6 +157,7 @@ void iommu_put_dma_cookie(struct iommu_domain *domain)

  {
struct iommu_dma_cookie *cookie = domain->iova_cookie;
struct iommu_dma_msi_page *msi, *tmp;
+   bool s2_unmap = false;
  
  	if (!cookie)

return;
@@ -153,7 +165,15 @@ void iommu_put_dma_cookie(struct iommu_domain *domain)
if (cookie->type == IOMMU_DMA_IOVA_COOKIE && cookie->iovad.granule)
put_iova_domain(>iovad);
  
+	if (cookie->type == IOMMU_DMA_NESTED_MSI_COOKIE)

+   s2_unmap = true;
+
list_for_each_entry_safe(msi, tmp, >msi_page_list, list) {
+   if (s2_unmap && msi->phys) {
+   size_t size = cookie_msi_granule(cookie);
+
+   WARN_ON(iommu_unmap(domain, msi->gpa, size) != size);
+   }
list_del(>list);
kfree(msi);
}
@@ -162,6 +182,82 @@ void iommu_put_dma_cookie(struct iommu_domain *domain)
  }
  EXPORT_SYMBOL(iommu_put_dma_cookie);
  
+/**

+ * iommu_dma_bind_guest_msi - Allows to pass the stage 1
+ * binding of a virtual MSI doorbell used by @dev.
+ *
+ * @domain: domain handle
+ * @iova: guest iova
+ * @gpa: gpa of the virtual doorbell
+ * @size: size of the granule used for the stage1 mapping
+ *
+ * In nested 

Re: [PATCH v7 14/23] iommu/smmuv3: Implement cache_invalidate

2019-05-08 Thread Robin Murphy

On 08/04/2019 13:19, Eric Auger wrote:

Implement domain-selective and page-selective IOTLB invalidations.

Signed-off-by: Eric Auger 

---
v6 -> v7
- check the uapi version

v3 -> v4:
- adapt to changes in the uapi
- add support for leaf parameter
- do not use arm_smmu_tlb_inv_range_nosync or arm_smmu_tlb_inv_context
   anymore

v2 -> v3:
- replace __arm_smmu_tlb_sync by arm_smmu_cmdq_issue_sync

v1 -> v2:
- properly pass the asid
---
  drivers/iommu/arm-smmu-v3.c | 60 +
  1 file changed, 60 insertions(+)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 1486baf53425..4366921d8318 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -2326,6 +2326,65 @@ static void arm_smmu_detach_pasid_table(struct 
iommu_domain *domain)
mutex_unlock(_domain->init_mutex);
  }
  
+static int

+arm_smmu_cache_invalidate(struct iommu_domain *domain, struct device *dev,
+ struct iommu_cache_invalidate_info *inv_info)
+{
+   struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
+   struct arm_smmu_device *smmu = smmu_domain->smmu;
+
+   if (smmu_domain->stage != ARM_SMMU_DOMAIN_NESTED)
+   return -EINVAL;
+
+   if (!smmu)
+   return -EINVAL;
+
+   if (inv_info->version != IOMMU_CACHE_INVALIDATE_INFO_VERSION_1)
+   return -EINVAL;
+
+   if (inv_info->cache & IOMMU_CACHE_INV_TYPE_IOTLB) {
+   if (inv_info->granularity == IOMMU_INV_GRANU_PASID) {
+   struct arm_smmu_cmdq_ent cmd = {
+   .opcode = CMDQ_OP_TLBI_NH_ASID,
+   .tlbi = {
+   .vmid = smmu_domain->s2_cfg.vmid,
+   .asid = inv_info->pasid,
+   },
+   };
+
+   arm_smmu_cmdq_issue_cmd(smmu, );
+   arm_smmu_cmdq_issue_sync(smmu);


I'd much rather make arm_smmu_tlb_inv_context() understand nested 
domains than open-code commands all over the place.



+
+   } else if (inv_info->granularity == IOMMU_INV_GRANU_ADDR) {
+   struct iommu_inv_addr_info *info = _info->addr_info;
+   size_t size = info->nb_granules * info->granule_size;
+   bool leaf = info->flags & IOMMU_INV_ADDR_FLAGS_LEAF;
+   struct arm_smmu_cmdq_ent cmd = {
+   .opcode = CMDQ_OP_TLBI_NH_VA,
+   .tlbi = {
+   .addr = info->addr,
+   .vmid = smmu_domain->s2_cfg.vmid,
+   .asid = info->pasid,
+   .leaf = leaf,
+   },
+   };
+
+   do {
+   arm_smmu_cmdq_issue_cmd(smmu, );
+   cmd.tlbi.addr += info->granule_size;
+   } while (size -= info->granule_size);
+   arm_smmu_cmdq_issue_sync(smmu);


An this in particular I would really like to go all the way through 
io_pgtable_tlb_add_flush()/io_pgtable_sync() if at all possible. Hooking 
up range-based invalidations is going to be a massive headache if the 
abstraction isn't solid.


Robin.


+   } else {
+   return -EINVAL;
+   }
+   }
+   if (inv_info->cache & IOMMU_CACHE_INV_TYPE_PASID ||
+   inv_info->cache & IOMMU_CACHE_INV_TYPE_DEV_IOTLB) {
+   return -ENOENT;
+   }
+   return 0;
+}
+
  static struct iommu_ops arm_smmu_ops = {
.capable= arm_smmu_capable,
.domain_alloc   = arm_smmu_domain_alloc,
@@ -2346,6 +2405,7 @@ static struct iommu_ops arm_smmu_ops = {
.put_resv_regions   = arm_smmu_put_resv_regions,
.attach_pasid_table = arm_smmu_attach_pasid_table,
.detach_pasid_table = arm_smmu_detach_pasid_table,
+   .cache_invalidate   = arm_smmu_cache_invalidate,
.pgsize_bitmap  = -1UL, /* Restricted during device attach */
  };
  


___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v7 13/23] iommu/smmuv3: Implement attach/detach_pasid_table

2019-05-08 Thread Robin Murphy

On 08/04/2019 13:19, Eric Auger wrote:

On attach_pasid_table() we program STE S1 related info set
by the guest into the actual physical STEs. At minimum
we need to program the context descriptor GPA and compute
whether the stage1 is translated/bypassed or aborted.

Signed-off-by: Eric Auger 

---
v6 -> v7:
- check versions and comment the fact we don't need to take
   into account s1dss and s1fmt
v3 -> v4:
- adapt to changes in iommu_pasid_table_config
- different programming convention at s1_cfg/s2_cfg/ste.abort

v2 -> v3:
- callback now is named set_pasid_table and struct fields
   are laid out differently.

v1 -> v2:
- invalidate the STE before changing them
- hold init_mutex
- handle new fields
---
  drivers/iommu/arm-smmu-v3.c | 121 
  1 file changed, 121 insertions(+)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index e22e944ffc05..1486baf53425 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -2207,6 +2207,125 @@ static void arm_smmu_put_resv_regions(struct device 
*dev,
kfree(entry);
  }
  
+static int arm_smmu_attach_pasid_table(struct iommu_domain *domain,

+  struct iommu_pasid_table_config *cfg)
+{
+   struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
+   struct arm_smmu_master_data *entry;
+   struct arm_smmu_s1_cfg *s1_cfg;
+   struct arm_smmu_device *smmu;
+   unsigned long flags;
+   int ret = -EINVAL;
+
+   if (cfg->format != IOMMU_PASID_FORMAT_SMMUV3)
+   return -EINVAL;
+
+   if (cfg->version != PASID_TABLE_CFG_VERSION_1 ||
+   cfg->smmuv3.version != PASID_TABLE_SMMUV3_CFG_VERSION_1)
+   return -EINVAL;
+
+   mutex_lock(_domain->init_mutex);
+
+   smmu = smmu_domain->smmu;
+
+   if (!smmu)
+   goto out;
+
+   if (!((smmu->features & ARM_SMMU_FEAT_TRANS_S1) &&
+ (smmu->features & ARM_SMMU_FEAT_TRANS_S2))) {
+   dev_info(smmu_domain->smmu->dev,
+"does not implement two stages\n");
+   goto out;
+   }


That check is redundant (and frankly looks a little bit spammy). If the 
one below is not enough, there is a problem elsewhere - if it's possible 
for smmu_domain->stage to ever get set to ARM_SMMU_DOMAIN_NESTED without 
both stages of translation present, we've already gone fundamentally wrong.



+
+   if (smmu_domain->stage != ARM_SMMU_DOMAIN_NESTED)
+   goto out;
+
+   switch (cfg->config) {
+   case IOMMU_PASID_CONFIG_ABORT:
+   spin_lock_irqsave(_domain->devices_lock, flags);
+   list_for_each_entry(entry, _domain->devices, list) {
+   entry->ste.s1_cfg = NULL;
+   entry->ste.abort = true;
+   arm_smmu_install_ste_for_dev(entry->dev->iommu_fwspec);
+   }
+   spin_unlock_irqrestore(_domain->devices_lock, flags);
+   ret = 0;
+   break;
+   case IOMMU_PASID_CONFIG_BYPASS:
+   spin_lock_irqsave(_domain->devices_lock, flags);
+   list_for_each_entry(entry, _domain->devices, list) {
+   entry->ste.s1_cfg = NULL;
+   entry->ste.abort = false;
+   arm_smmu_install_ste_for_dev(entry->dev->iommu_fwspec);
+   }
+   spin_unlock_irqrestore(_domain->devices_lock, flags);
+   ret = 0;
+   break;
+   case IOMMU_PASID_CONFIG_TRANSLATE:
+   /*
+* we currently support a single CD so s1fmt and s1dss
+* fields are also ignored
+*/
+   if (cfg->pasid_bits)
+   goto out;
+
+   s1_cfg = _domain->s1_cfg;
+   s1_cfg->cdptr_dma = cfg->base_ptr;
+
+   spin_lock_irqsave(_domain->devices_lock, flags);
+   list_for_each_entry(entry, _domain->devices, list) {
+   entry->ste.s1_cfg = s1_cfg;


Either we reject valid->valid transitions outright, or we need to remove 
and invalidate the existing S1 context from the STE at this point, no?



+   entry->ste.abort = false;
+   arm_smmu_install_ste_for_dev(entry->dev->iommu_fwspec);
+   }
+   spin_unlock_irqrestore(_domain->devices_lock, flags);
+   ret = 0;
+   break;
+   default:
+   break;
+   }
+out:
+   mutex_unlock(_domain->init_mutex);
+   return ret;
+}
+
+static void arm_smmu_detach_pasid_table(struct iommu_domain *domain)
+{
+   struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
+   struct arm_smmu_master_data *entry;
+   struct arm_smmu_device *smmu;
+   unsigned long flags;
+
+   mutex_lock(_domain->init_mutex);
+
+   smmu = smmu_domain->smmu;
+
+   if 

Re: [PATCH v7 12/23] iommu/smmuv3: Get prepared for nested stage support

2019-05-08 Thread Robin Murphy

On 08/04/2019 13:19, Eric Auger wrote:

To allow nested stage support, we need to store both
stage 1 and stage 2 configurations (and remove the former
union).

A nested setup is characterized by both s1_cfg and s2_cfg
set.

We introduce a new ste.abort field that will be set upon
guest stage1 configuration passing. If s1_cfg is NULL and
ste.abort is set, traffic can't pass. If ste.abort is not set,
S1 is bypassed.

arm_smmu_write_strtab_ent() is modified to write both stage
fields in the STE and deal with the abort field.

In nested mode, only stage 2 is "finalized" as the host does
not own/configure the stage 1 context descriptor, guest does.

Signed-off-by: Eric Auger 

---

v4 -> v5:
- reset ste.abort on detach

v3 -> v4:
- s1_cfg.nested_abort and nested_bypass removed.
- s/ste.nested/ste.abort
- arm_smmu_write_strtab_ent modifications with introduction
   of local abort, bypass and translate local variables
- comment updated

v1 -> v2:
- invalidate the STE before moving from a live STE config to another
- add the nested_abort and nested_bypass fields
---
  drivers/iommu/arm-smmu-v3.c | 35 ---
  1 file changed, 20 insertions(+), 15 deletions(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 21d027695181..e22e944ffc05 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -211,6 +211,7 @@
  #define STRTAB_STE_0_CFG_BYPASS   4
  #define STRTAB_STE_0_CFG_S1_TRANS 5
  #define STRTAB_STE_0_CFG_S2_TRANS 6
+#define STRTAB_STE_0_CFG_NESTED7
  
  #define STRTAB_STE_0_S1FMT		GENMASK_ULL(5, 4)

  #define STRTAB_STE_0_S1FMT_LINEAR 0
@@ -514,6 +515,7 @@ struct arm_smmu_strtab_ent {
 * configured according to the domain type.
 */
boolassigned;
+   boolabort;
struct arm_smmu_s1_cfg  *s1_cfg;
struct arm_smmu_s2_cfg  *s2_cfg;
  };
@@ -628,10 +630,8 @@ struct arm_smmu_domain {
boolnon_strict;
  
  	enum arm_smmu_domain_stage	stage;

-   union {
-   struct arm_smmu_s1_cfg  s1_cfg;
-   struct arm_smmu_s2_cfg  s2_cfg;
-   };
+   struct arm_smmu_s1_cfg  s1_cfg;
+   struct arm_smmu_s2_cfg  s2_cfg;
  
  	struct iommu_domain		domain;
  
@@ -1108,12 +1108,13 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_device *smmu, u32 sid,

  __le64 *dst, struct arm_smmu_strtab_ent 
*ste)
  {
/*
-* This is hideously complicated, but we only really care about
-* three cases at the moment:
+* We care about the following transitions:
 *
 * 1. Invalid (all zero) -> bypass/fault (init)
-* 2. Bypass/fault -> translation/bypass (attach)
-* 3. Translation/bypass -> bypass/fault (detach)
+* 2. Bypass/fault -> single stage translation/bypass (attach)
+* 3. single stage Translation/bypass -> bypass/fault (detach)
+* 4. S2 -> S1 + S2 (attach_pasid_table)
+* 5. S1 + S2 -> S2 (detach_pasid_table)
 *
 * Given that we can't update the STE atomically and the SMMU
 * doesn't read the thing in a defined order, that leaves us
@@ -1124,7 +1125,7 @@ static void arm_smmu_write_strtab_ent(struct 
arm_smmu_device *smmu, u32 sid,
 * 3. Update Config, sync
 */
u64 val = le64_to_cpu(dst[0]);
-   bool ste_live = false;
+   bool abort, bypass, translate, ste_live = false;
struct arm_smmu_cmdq_ent prefetch_cmd = {
.opcode = CMDQ_OP_PREFETCH_CFG,
.prefetch   = {
@@ -1138,11 +1139,11 @@ static void arm_smmu_write_strtab_ent(struct 
arm_smmu_device *smmu, u32 sid,
break;
case STRTAB_STE_0_CFG_S1_TRANS:
case STRTAB_STE_0_CFG_S2_TRANS:
+   case STRTAB_STE_0_CFG_NESTED:
ste_live = true;
break;
case STRTAB_STE_0_CFG_ABORT:
-   if (disable_bypass)
-   break;
+   break;
default:
BUG(); /* STE corruption */
}
@@ -1152,8 +1153,13 @@ static void arm_smmu_write_strtab_ent(struct 
arm_smmu_device *smmu, u32 sid,
val = STRTAB_STE_0_V;
  
  	/* Bypass/fault */

-   if (!ste->assigned || !(ste->s1_cfg || ste->s2_cfg)) {
-   if (!ste->assigned && disable_bypass)
+
+   abort = (!ste->assigned && disable_bypass) || ste->abort;
+   translate = ste->s1_cfg || ste->s2_cfg;
+   bypass = !abort && !translate;
+
+   if (abort || bypass) {
+   if (abort)
val |= FIELD_PREP(STRTAB_STE_0_CFG, 
STRTAB_STE_0_CFG_ABORT);
else
val |= FIELD_PREP(STRTAB_STE_0_CFG, 

Re: [PATCH v7 11/23] iommu/arm-smmu-v3: Maintain a SID->device structure

2019-05-08 Thread Robin Murphy

On 08/04/2019 13:18, Eric Auger wrote:

From: Jean-Philippe Brucker 

When handling faults from the event or PRI queue, we need to find the
struct device associated to a SID. Add a rb_tree to keep track of SIDs.


Out of curiosity, have you looked at whether an xarray might now be a 
more efficient option for this?


Robin.


Signed-off-by: Jean-Philippe Brucker 
---
  drivers/iommu/arm-smmu-v3.c | 136 ++--
  1 file changed, 132 insertions(+), 4 deletions(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index ff998c967a0a..21d027695181 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -588,6 +588,16 @@ struct arm_smmu_device {
  
  	/* IOMMU core code handle */

struct iommu_device iommu;
+
+   struct rb_root  streams;
+   struct mutexstreams_mutex;
+
+};
+
+struct arm_smmu_stream {
+   u32 id;
+   struct arm_smmu_master_data *master;
+   struct rb_node  node;
  };
  
  /* SMMU private data for each master */

@@ -597,6 +607,7 @@ struct arm_smmu_master_data {
  
  	struct arm_smmu_domain		*domain;

struct list_headlist; /* domain->devices */
+   struct arm_smmu_stream  *streams;
  
  	struct device			*dev;

  };
@@ -1243,6 +1254,32 @@ static int arm_smmu_init_l2_strtab(struct 
arm_smmu_device *smmu, u32 sid)
return 0;
  }
  
+__maybe_unused

+static struct arm_smmu_master_data *
+arm_smmu_find_master(struct arm_smmu_device *smmu, u32 sid)
+{
+   struct rb_node *node;
+   struct arm_smmu_stream *stream;
+   struct arm_smmu_master_data *master = NULL;
+
+   mutex_lock(>streams_mutex);
+   node = smmu->streams.rb_node;
+   while (node) {
+   stream = rb_entry(node, struct arm_smmu_stream, node);
+   if (stream->id < sid) {
+   node = node->rb_right;
+   } else if (stream->id > sid) {
+   node = node->rb_left;
+   } else {
+   master = stream->master;
+   break;
+   }
+   }
+   mutex_unlock(>streams_mutex);
+
+   return master;
+}
+
  /* IRQ and event handlers */
  static irqreturn_t arm_smmu_evtq_thread(int irq, void *dev)
  {
@@ -1881,6 +1918,71 @@ static bool arm_smmu_sid_in_range(struct arm_smmu_device 
*smmu, u32 sid)
return sid < limit;
  }
  
+static int arm_smmu_insert_master(struct arm_smmu_device *smmu,

+ struct arm_smmu_master_data *master)
+{
+   int i;
+   int ret = 0;
+   struct arm_smmu_stream *new_stream, *cur_stream;
+   struct rb_node **new_node, *parent_node = NULL;
+   struct iommu_fwspec *fwspec = master->dev->iommu_fwspec;
+
+   master->streams = kcalloc(fwspec->num_ids,
+ sizeof(struct arm_smmu_stream), GFP_KERNEL);
+   if (!master->streams)
+   return -ENOMEM;
+
+   mutex_lock(>streams_mutex);
+   for (i = 0; i < fwspec->num_ids && !ret; i++) {
+   new_stream = >streams[i];
+   new_stream->id = fwspec->ids[i];
+   new_stream->master = master;
+
+   new_node = &(smmu->streams.rb_node);
+   while (*new_node) {
+   cur_stream = rb_entry(*new_node, struct arm_smmu_stream,
+ node);
+   parent_node = *new_node;
+   if (cur_stream->id > new_stream->id) {
+   new_node = &((*new_node)->rb_left);
+   } else if (cur_stream->id < new_stream->id) {
+   new_node = &((*new_node)->rb_right);
+   } else {
+   dev_warn(master->dev,
+"stream %u already in tree\n",
+cur_stream->id);
+   ret = -EINVAL;
+   break;
+   }
+   }
+
+   if (!ret) {
+   rb_link_node(_stream->node, parent_node, new_node);
+   rb_insert_color(_stream->node, >streams);
+   }
+   }
+   mutex_unlock(>streams_mutex);
+
+   return ret;
+}
+
+static void arm_smmu_remove_master(struct arm_smmu_device *smmu,
+  struct arm_smmu_master_data *master)
+{
+   int i;
+   struct iommu_fwspec *fwspec = master->dev->iommu_fwspec;
+
+   if (!master->streams)
+   return;
+
+   mutex_lock(>streams_mutex);
+   for (i = 0; i < fwspec->num_ids; i++)
+   rb_erase(>streams[i].node, >streams);
+   mutex_unlock(>streams_mutex);
+
+   kfree(master->streams);
+}
+
  static struct iommu_ops arm_smmu_ops;
  
  static 

Re: [PATCH v7 06/23] iommu: Introduce bind/unbind_guest_msi

2019-05-08 Thread Robin Murphy

On 08/04/2019 13:18, Eric Auger wrote:

On ARM, MSI are translated by the SMMU. An IOVA is allocated
for each MSI doorbell. If both the host and the guest are exposed
with SMMUs, we end up with 2 different IOVAs allocated by each.
guest allocates an IOVA (gIOVA) to map onto the guest MSI
doorbell (gDB). The Host allocates another IOVA (hIOVA) to map
onto the physical doorbell (hDB).

So we end up with 2 untied mappings:
  S1S2
gIOVA->gDB
   hIOVA->hDB

Currently the PCI device is programmed by the host with hIOVA
as MSI doorbell. So this does not work.

This patch introduces an API to pass gIOVA/gDB to the host so
that gIOVA can be reused by the host instead of re-allocating
a new IOVA. So the goal is to create the following nested mapping:

  S1S2
gIOVA->gDB ->hDB

and program the PCI device with gIOVA MSI doorbell.

In case we have several devices attached to this nested domain
(devices belonging to the same group), they cannot be isolated
on guest side either. So they should also end up in the same domain
on guest side. We will enforce that all the devices attached to
the host iommu domain use the same physical doorbell and similarly
a single virtual doorbell mapping gets registered (1 single
virtual doorbell is used on guest as well).

Signed-off-by: Eric Auger 

---
v6 -> v7:
- remove the device handle parameter.
- Add comments saying there can only be a single MSI binding
   registered per iommu_domain
v5 -> v6:
-fix compile issue when IOMMU_API is not set

v3 -> v4:
- add unbind

v2 -> v3:
- add a struct device handle
---
  drivers/iommu/iommu.c | 37 +
  include/linux/iommu.h | 23 +++
  2 files changed, 60 insertions(+)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 6d6cb4005ca5..0d160bbd6f81 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1575,6 +1575,43 @@ static void __iommu_detach_device(struct iommu_domain 
*domain,
trace_detach_device_from_domain(dev);
  }
  
+/**

+ * iommu_bind_guest_msi - Passes the stage1 GIOVA/GPA mapping of a
+ * virtual doorbell
+ *
+ * @domain: iommu domain the stage 1 mapping will be attached to
+ * @iova: iova allocated by the guest
+ * @gpa: guest physical address of the virtual doorbell
+ * @size: granule size used for the mapping
+ *
+ * The associated IOVA can be reused by the host to create a nested
+ * stage2 binding mapping translating into the physical doorbell used
+ * by the devices attached to the domain.
+ *
+ * All devices within the domain must share the same physical doorbell.
+ * A single MSI GIOVA/GPA mapping can be attached to an iommu_domain.
+ */
+
+int iommu_bind_guest_msi(struct iommu_domain *domain,
+dma_addr_t giova, phys_addr_t gpa, size_t size)
+{
+   if (unlikely(!domain->ops->bind_guest_msi))
+   return -ENODEV;
+
+   return domain->ops->bind_guest_msi(domain, giova, gpa, size);
+}
+EXPORT_SYMBOL_GPL(iommu_bind_guest_msi);
+
+void iommu_unbind_guest_msi(struct iommu_domain *domain,
+   dma_addr_t iova)
+{
+   if (unlikely(!domain->ops->unbind_guest_msi))
+   return;
+
+   domain->ops->unbind_guest_msi(domain, iova);
+}
+EXPORT_SYMBOL_GPL(iommu_unbind_guest_msi);
+
  void iommu_detach_device(struct iommu_domain *domain, struct device *dev)
  {
struct iommu_group *group;
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 7c7c6bad1420..a2f3f964ead2 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -192,6 +192,8 @@ struct iommu_resv_region {
   * @attach_pasid_table: attach a pasid table
   * @detach_pasid_table: detach the pasid table
   * @cache_invalidate: invalidate translation caches
+ * @bind_guest_msi: provides a stage1 giova/gpa MSI doorbell mapping
+ * @unbind_guest_msi: withdraw a stage1 giova/gpa MSI doorbell mapping
   * @pgsize_bitmap: bitmap of all possible supported page sizes
   */
  struct iommu_ops {
@@ -243,6 +245,10 @@ struct iommu_ops {
int (*cache_invalidate)(struct iommu_domain *domain, struct device *dev,
struct iommu_cache_invalidate_info *inv_info);
  
+	int (*bind_guest_msi)(struct iommu_domain *domain,

+ dma_addr_t giova, phys_addr_t gpa, size_t size);
+   void (*unbind_guest_msi)(struct iommu_domain *domain, dma_addr_t giova);
+
unsigned long pgsize_bitmap;
  };
  
@@ -356,6 +362,11 @@ extern void iommu_detach_pasid_table(struct iommu_domain *domain);

  extern int iommu_cache_invalidate(struct iommu_domain *domain,
  struct device *dev,
  struct iommu_cache_invalidate_info *inv_info);
+extern int iommu_bind_guest_msi(struct iommu_domain *domain,
+   dma_addr_t giova, phys_addr_t gpa, size_t size);
+extern void 

Re: Question: KVM: Failed to bind vfio with PCI-e / SMMU on Juno-r2

2019-03-18 Thread Robin Murphy

On 16/03/2019 04:56, Leo Yan wrote:

Hi Robin,

On Fri, Mar 15, 2019 at 12:54:10PM +, Robin Murphy wrote:

Hi Leo,

Sorry for the delay - I'm on holiday this week, but since I've made the
mistake of glancing at my inbox I should probably save you from wasting any
more time...


Sorry for disturbing you in holiday and appreciate your help.  It's no
rush to reply.


On 2019-03-15 11:03 am, Auger Eric wrote:

Hi Leo,

+ Jean-Philippe

On 3/15/19 10:37 AM, Leo Yan wrote:

Hi Eric, Robin,

On Wed, Mar 13, 2019 at 11:24:25AM +0100, Auger Eric wrote:

[...]


If the NIC supports MSIs they logically are used. This can be easily
checked on host by issuing "cat /proc/interrupts | grep vfio". Can you
check whether the guest received any interrupt? I remember that Robin
said in the past that on Juno, the MSI doorbell was in the PCI host
bridge window and possibly transactions towards the doorbell could not
reach it since considered as peer to peer.


I found back Robin's explanation. It was not related to MSI IOVA being
within the PCI host bridge window but RAM GPA colliding with host PCI
config space?

"MSI doorbells integral to PCIe root complexes (and thus untranslatable)
typically have a programmable address, so could be anywhere. In the more
general category of "special hardware addresses", QEMU's default ARM
guest memory map puts RAM starting at 0x4000; on the ARM Juno
platform, that happens to be where PCI config space starts; as Juno's
PCIe doesn't support ACS, peer-to-peer or anything clever, if you assign
the PCI bus to a guest (all of it, given the lack of ACS), the root
complex just sees the guest's attempts to DMA to "memory" as the device
attempting to access config space and aborts them."


Below is some following investigation at my side:

Firstly, must admit that I don't understand well for up paragraph; so
based on the description I am wandering if can use INTx mode and if
it's lucky to avoid this hardware pitfall.


The problem above is that during the assignment process, the virtualizer
maps the whole guest RAM though the IOMMU (+ the MSI doorbell on ARM) to
allow the device, programmed in GPA to access the whole guest RAM.
Unfortunately if the device emits a DMA request with 0x4000 IOVA
address, this IOVA is interpreted by the Juno RC as a transaction
towards the PCIe config space. So this DMA request will not go beyond
the RC, will never reach the IOMMU and will never reach the guest RAM.
So globally the device is not able to reach part of the guest RAM.
That's how I interpret the above statement. Then I don't know the
details of the collision, I don't have access to this HW. I don't know
either if this problem still exists on the r2 HW.


Thanks a lot for rephrasing, Eric :)


The short answer is that if you want PCI passthrough to work on Juno, the
guest memory map has to look like a Juno.

The PCIe root complex uses an internal lookup table to generate appropriate
AXI attributes for outgoing PCIe transactions; unfortunately this has no
notion of 'default' attributes, so addresses *must* match one of the
programmed windows in order to be valid. From memory, EDK2 sets up a 2GB
window covering the lower DRAM bank, an 8GB window covering the upper DRAM
bank, and a 1MB (or thereabouts) window covering the GICv2m region with
Device attributes.


I checked kernel memory blocks info, it gives out below result:

root@debian:~# cat /sys/kernel/debug/memblock/memory
0: 0x8000..0xfeff
1: 0x00088000..0x0009

So I think the lower 2GB DRAM window is: [0x8000_..0xfeff_]
and the high DRAM window is [0x8_8000_..0x9__].

BTW, now I am using uboot rather than UEFI, so not sure if uboot has
programmed memory windows for PCIe.  Could you help give a point for
which registers should be set in UEFI thus I also can check related
configurations in uboot?


U-Boot does the same thing[1] - you can confirm that by whether PCIe 
works at all on the host ;)



Any PCIe transactions to addresses not within one of
those windows will be aborted by the RC without ever going out to the AXI
side where the SMMU lies (and I think anything matching the config space or
I/O space windows or a region claimed by a BAR will be aborted even earlier
as a peer-to-peer attempt regardless of the AXI Translation Table setup).

You could potentially modify the firmware to change the window
configuration, but the alignment restrictions make it awkward. I've only
ever tested passthrough on Juno using kvmtool, which IIRC already has guest
RAM in an appropriate place (and is trivially easy to hack if not) - I don't
remember if I ever actually tried guest MSI with that.


I did several tries with kvmtool to tweak memory regions but it's no
lucky.  Since the host uses [0x8000_..0xfeff_] as the first
valid memory window for PCIe, thus I tried to change all memory/io
regions into this window with below changes but it's

Re: Question: KVM: Failed to bind vfio with PCI-e / SMMU on Juno-r2

2019-03-15 Thread Robin Murphy

Hi Leo,

Sorry for the delay - I'm on holiday this week, but since I've made the 
mistake of glancing at my inbox I should probably save you from wasting 
any more time...


On 2019-03-15 11:03 am, Auger Eric wrote:

Hi Leo,

+ Jean-Philippe

On 3/15/19 10:37 AM, Leo Yan wrote:

Hi Eric, Robin,

On Wed, Mar 13, 2019 at 11:24:25AM +0100, Auger Eric wrote:

[...]


If the NIC supports MSIs they logically are used. This can be easily
checked on host by issuing "cat /proc/interrupts | grep vfio". Can you
check whether the guest received any interrupt? I remember that Robin
said in the past that on Juno, the MSI doorbell was in the PCI host
bridge window and possibly transactions towards the doorbell could not
reach it since considered as peer to peer.


I found back Robin's explanation. It was not related to MSI IOVA being
within the PCI host bridge window but RAM GPA colliding with host PCI
config space?

"MSI doorbells integral to PCIe root complexes (and thus untranslatable)
typically have a programmable address, so could be anywhere. In the more
general category of "special hardware addresses", QEMU's default ARM
guest memory map puts RAM starting at 0x4000; on the ARM Juno
platform, that happens to be where PCI config space starts; as Juno's
PCIe doesn't support ACS, peer-to-peer or anything clever, if you assign
the PCI bus to a guest (all of it, given the lack of ACS), the root
complex just sees the guest's attempts to DMA to "memory" as the device
attempting to access config space and aborts them."


Below is some following investigation at my side:

Firstly, must admit that I don't understand well for up paragraph; so
based on the description I am wandering if can use INTx mode and if
it's lucky to avoid this hardware pitfall.


The problem above is that during the assignment process, the virtualizer
maps the whole guest RAM though the IOMMU (+ the MSI doorbell on ARM) to
allow the device, programmed in GPA to access the whole guest RAM.
Unfortunately if the device emits a DMA request with 0x4000 IOVA
address, this IOVA is interpreted by the Juno RC as a transaction
towards the PCIe config space. So this DMA request will not go beyond
the RC, will never reach the IOMMU and will never reach the guest RAM.
So globally the device is not able to reach part of the guest RAM.
That's how I interpret the above statement. Then I don't know the
details of the collision, I don't have access to this HW. I don't know
either if this problem still exists on the r2 HW.


The short answer is that if you want PCI passthrough to work on Juno, 
the guest memory map has to look like a Juno.


The PCIe root complex uses an internal lookup table to generate 
appropriate AXI attributes for outgoing PCIe transactions; unfortunately 
this has no notion of 'default' attributes, so addresses *must* match 
one of the programmed windows in order to be valid. From memory, EDK2 
sets up a 2GB window covering the lower DRAM bank, an 8GB window 
covering the upper DRAM bank, and a 1MB (or thereabouts) window covering 
the GICv2m region with Device attributes. Any PCIe transactions to 
addresses not within one of those windows will be aborted by the RC 
without ever going out to the AXI side where the SMMU lies (and I think 
anything matching the config space or I/O space windows or a region 
claimed by a BAR will be aborted even earlier as a peer-to-peer attempt 
regardless of the AXI Translation Table setup).


You could potentially modify the firmware to change the window 
configuration, but the alignment restrictions make it awkward. I've only 
ever tested passthrough on Juno using kvmtool, which IIRC already has 
guest RAM in an appropriate place (and is trivially easy to hack if not) 
- I don't remember if I ever actually tried guest MSI with that.


Robin.


But when I want to rollback to use INTx mode I found there have issue
for kvmtool to support INTx mode, so this is why I wrote the patch [1]
to fix the issue.  Alternatively, we also can set the NIC driver
module parameter 'sky2.disable_msi=1' thus can totally disable msi and
only use INTx mode.

Anyway, finally I can get INTx mode enabled and I can see the
interrupt will be registered successfully on both host and guest:

Host side:

CPU0   CPU1   CPU2   CPU3   CPU4   CPU5
  41:  0  0  0  0  0  0 
GICv2  54 Level arm-pmu
  42:  0  0  0  0  0  0 
GICv2  58 Level arm-pmu
  43:  0  0  0  0  0  0 
GICv2  62 Level arm-pmu
  45:772  0  0  0  0  0 
GICv2 171 Level vfio-intx(:08:00.0)

Guest side:

# cat /proc/interrupts
CPU0   CPU1   CPU2   CPU3   CPU4   CPU5
  12:  0  0  0  0  0  0 
GIC-0  96 Level eth1

So you could see the 

Re: [RFC v3 09/21] iommu/smmuv3: Get prepared for nested stage support

2019-01-25 Thread Robin Murphy

On 08/01/2019 10:26, Eric Auger wrote:

To allow nested stage support, we need to store both
stage 1 and stage 2 configurations (and remove the former
union).

arm_smmu_write_strtab_ent() is modified to write both stage
fields in the STE.

We add a nested_bypass field to the S1 configuration as the first
stage can be bypassed. Also the guest may force the STE to abort:
this information gets stored into the nested_abort field.

Only S2 stage is "finalized" as the host does not configure
S1 CD, guest does.

Signed-off-by: Eric Auger 

---

v1 -> v2:
- invalidate the STE before moving from a live STE config to another
- add the nested_abort and nested_bypass fields
---
  drivers/iommu/arm-smmu-v3.c | 43 -
  1 file changed, 33 insertions(+), 10 deletions(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 9af68266bbb1..9716a301d9ae 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -212,6 +212,7 @@
  #define STRTAB_STE_0_CFG_BYPASS   4
  #define STRTAB_STE_0_CFG_S1_TRANS 5
  #define STRTAB_STE_0_CFG_S2_TRANS 6
+#define STRTAB_STE_0_CFG_NESTED7
  
  #define STRTAB_STE_0_S1FMT		GENMASK_ULL(5, 4)

  #define STRTAB_STE_0_S1FMT_LINEAR 0
@@ -491,6 +492,10 @@ struct arm_smmu_strtab_l1_desc {
  struct arm_smmu_s1_cfg {
__le64  *cdptr;
dma_addr_t  cdptr_dma;
+   /* in nested mode, tells s1 must be bypassed */
+   boolnested_bypass;


Couldn't that be inferred from "s1_cfg == NULL"?


+   /* in nested mode, abort is forced by guest */
+   boolnested_abort;


Couldn't that be inferred from "s1_cfg == NULL && s2_cfg == NULL && 
smmu_domain->stage == ARM_SMMU_DOMAIN_NESTED"?



struct arm_smmu_ctx_desc {
u16 asid;
@@ -515,6 +520,7 @@ struct arm_smmu_strtab_ent {
 * configured according to the domain type.
 */
boolassigned;
+   boolnested;


AFAICS, "nested" really only serves a differentiator between the 
assigned-as-bypass and assigned-as-fault cases. The latter isn't 
actually unique to nested though, I'd say it's more just that nobody's 
found reason to do anything with IOMMU_DOMAIN_BLOCKED yet. There's some 
argument for upgrading "assigned" into a tristate enum, but I think it 
might have a few drawbacks elsewhere, so an extra flag here seems 
reasonable, but I think it should just be named "abort". If we have both 
s1_cfg and s2_cfg set, we can see it's nested; if we only have s2_cfg, I 
don't think we really care whether the host or guest asked for stage 1 
bypass; and if in future we care about the difference between host- vs. 
guest-requested abort, leaving s2_cfg set for the latter would probably 
suffice.



struct arm_smmu_s1_cfg  *s1_cfg;
struct arm_smmu_s2_cfg  *s2_cfg;
  };
@@ -629,10 +635,8 @@ struct arm_smmu_domain {
boolnon_strict;
  
  	enum arm_smmu_domain_stage	stage;

-   union {
-   struct arm_smmu_s1_cfg  s1_cfg;
-   struct arm_smmu_s2_cfg  s2_cfg;
-   };
+   struct arm_smmu_s1_cfg  s1_cfg;
+   struct arm_smmu_s2_cfg  s2_cfg;
  
  	struct iommu_domain		domain;
  
@@ -1139,10 +1143,11 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_device *smmu, u32 sid,

break;
case STRTAB_STE_0_CFG_S1_TRANS:
case STRTAB_STE_0_CFG_S2_TRANS:
+   case STRTAB_STE_0_CFG_NESTED:
ste_live = true;
break;
case STRTAB_STE_0_CFG_ABORT:
-   if (disable_bypass)
+   if (disable_bypass || ste->nested)
break;
default:
BUG(); /* STE corruption */
@@ -1154,7 +1159,8 @@ static void arm_smmu_write_strtab_ent(struct 
arm_smmu_device *smmu, u32 sid,
  
  	/* Bypass/fault */

if (!ste->assigned || !(ste->s1_cfg || ste->s2_cfg)) {
-   if (!ste->assigned && disable_bypass)
+   if ((!ste->assigned && disable_bypass) ||
+   (ste->s1_cfg && ste->s1_cfg->nested_abort))


Yikes, these conditions were hard enough to follow before...


I think what I've proposed above might allow the logic here to be a bit 
less convoluted, but even then it may be time to hoist all these checks 
out and have a temporary decision variable for the bypass/abort/valid 
config outcome.


Robin.


val |= FIELD_PREP(STRTAB_STE_0_CFG, 
STRTAB_STE_0_CFG_ABORT);
else
val |= FIELD_PREP(STRTAB_STE_0_CFG, 
STRTAB_STE_0_CFG_BYPASS);
@@ -1172,8 +1178,17 @@ static void arm_smmu_write_strtab_ent(struct 
arm_smmu_device *smmu, u32 

Re: [virtio-dev] Re: [PATCH v5 5/7] iommu: Add virtio-iommu driver

2018-12-13 Thread Robin Murphy

On 2018-12-12 3:27 pm, Auger Eric wrote:

Hi,

On 12/12/18 3:56 PM, Michael S. Tsirkin wrote:

On Fri, Dec 07, 2018 at 06:52:31PM +, Jean-Philippe Brucker wrote:

Sorry for the delay, I wanted to do a little more performance analysis
before continuing.

On 27/11/2018 18:10, Michael S. Tsirkin wrote:

On Tue, Nov 27, 2018 at 05:55:20PM +, Jean-Philippe Brucker wrote:

+   if (!virtio_has_feature(vdev, VIRTIO_F_VERSION_1) ||
+   !virtio_has_feature(vdev, VIRTIO_IOMMU_F_MAP_UNMAP))


Why bother with a feature bit for this then btw?


We'll need a new feature bit for sharing page tables with the hardware,
because they require different requests (attach_table/invalidate instead
of map/unmap.) A future device supporting page table sharing won't
necessarily need to support map/unmap.


I don't see virtio iommu being extended to support ARM specific
requests. This just won't scale, too many different
descriptor formats out there.


They aren't really ARM specific requests. The two new requests are
ATTACH_TABLE and INVALIDATE, which would be used by x86 IOMMUs as well.

Sharing CPU address space with the HW IOMMU (SVM) has been in the scope
of virtio-iommu since the first RFC, and I've been working with that
extension in mind since the beginning. As an example you can have a look
at my current draft for this [1], which is inspired from the VFIO work
we've been doing with Intel.

The negotiation phase inevitably requires vendor-specific fields in the
descriptors - host tells which formats are supported, guest chooses a
format and attaches page tables. But invalidation and fault reporting
descriptors are fairly generic.


We need to tread carefully here.  People expect it that if user does
lspci and sees a virtio device then it's reasonably portable.


If you want to go that way down the road, you should avoid
virtio iommu, instead emulate and share code with the ARM SMMU (probably
with a different vendor id so you can implement the
report on map for devices without PRI).


vSMMU has to stay in userspace though. The main reason we're proposing
virtio-iommu is that emulating every possible vIOMMU model in the kernel
would be unmaintainable. With virtio-iommu we can process the fast path
in the host kernel, through vhost-iommu, and do the heavy lifting in
userspace.


Interesting.


As said above, I'm trying to keep the fast path for
virtio-iommu generic.

More notes on what I consider to be the fast path, and comparison with
vSMMU:

(1) The primary use-case we have in mind for vIOMMU is something like
DPDK in the guest, assigning a hardware device to guest userspace. DPDK
maps a large amount of memory statically, to be used by a pass-through
device. For this case I don't think we care about vIOMMU performance.
Setup and teardown need to be reasonably fast, sure, but the MAP/UNMAP
requests don't have to be optimal.


(2) If the assigned device is owned by the guest kernel, then mappings
are dynamic and require dma_map/unmap() to be fast, but there generally
is no need for a vIOMMU, since device and drivers are trusted by the
guest kernel. Even when the user does enable a vIOMMU for this case
(allowing to over-commit guest memory, which needs to be pinned
otherwise),


BTW that's in theory in practice it doesn't really work.


we generally play tricks like lazy TLBI (non-strict mode) to
make it faster.


Simple lazy TLB for guest/userspace drivers would be a big no no.
You need something smarter.


Here device and drivers are trusted, therefore the
vulnerability window of lazy mode isn't a concern.

If the reason to enable the vIOMMU is over-comitting guest memory
however, you can't use nested translation because it requires pinning
the second-level tables. For this case performance matters a bit,
because your invalidate-on-map needs to be fast, even if you enable lazy
mode and only receive inval-on-unmap every 10ms. It won't ever be as
fast as nested translation, though. For this case I think vSMMU+Caching
Mode and userspace virtio-iommu with MAP/UNMAP would perform similarly
(given page-sized payloads), because the pagetable walk doesn't add a
lot of overhead compared to the context switch. But given the results
below, vhost-iommu would be faster than vSMMU+CM.


(3) Then there is SVM. For SVM, any destructive change to the process
address space requires a synchronous invalidation command to the
hardware (at least when using PCI ATS). Given that SVM is based on page
faults, fault reporting from host to guest also needs to be fast, as
well as fault response from guest to host.

I think this is where performance matters the most. To get a feel of the
advantage we get with virtio-iommu, I compared the vSMMU page-table
sharing implementation [2] and vhost-iommu + VFIO with page table
sharing (based on Tomasz Nowicki's vhost-iommu prototype). That's on a
ThunderX2 with a 10Gb NIC assigned to the guest kernel, which
corresponds to case (2) above, with nesting page tables and without the
lazy mode. The 

Re: [RFC v2 12/20] dma-iommu: Implement NESTED_MSI cookie

2018-10-24 Thread Robin Murphy

On 2018-10-24 7:44 pm, Auger Eric wrote:

Hi Robin,

On 10/24/18 8:02 PM, Robin Murphy wrote:

Hi Eric,

On 2018-09-18 3:24 pm, Eric Auger wrote:

Up to now, when the type was UNMANAGED, we used to
allocate IOVA pages within a range provided by the user.
This does not work in nested mode.

If both the host and the guest are exposed with SMMUs, each
would allocate an IOVA. The guest allocates an IOVA (gIOVA)
to map onto the guest MSI doorbell (gDB). The Host allocates
another IOVA (hIOVA) to map onto the physical doorbell (hDB).

So we end up with 2 unrelated mappings, at S1 and S2:
   S1 S2
gIOVA    -> gDB
     hIOVA    ->    hDB

The PCI device would be programmed with hIOVA.

iommu_dma_bind_doorbell allows to pass gIOVA/gDB to the host
so that gIOVA can be used by the host instead of re-allocating
a new IOVA. That way the host can create the following nested
mapping:

   S1   S2
gIOVA    ->    gDB    ->    hDB

this time, the PCI device will be programmed with the gIOVA MSI
doorbell which is correctly map through the 2 stages.


If I'm understanding things correctly, this plus a couple of the
preceding patches all add up to a rather involved way of coercing an
automatic allocator to only "allocate" predetermined addresses in an
entirely known-ahead-of-time manner.

agreed
  Given that the guy calling

iommu_dma_bind_doorbell() could seemingly just as easily call
iommu_map() at that point and not bother with an allocator cookie and
all this machinery at all, what am I missing?

Well iommu_dma_map_msi_msg() gets called and is part of this existing
MSI mapping machinery. If we do not do anything this function allocates
an hIOVA that is not involved in any nested setup. So either we coerce
the allocator in place (which is what this series does) or we unplug the
allocator to replace this latter with a simple S2 mapping, as you
suggest, ie. iommu_map(gDB, hDB). Assuming we unplug the allocator, the
guy who actually calls  iommu_dma_bind_doorbell() knows gDB but does not
know hDB. So I don't really get how we can simplify things.


OK, there's what I was missing :D

But that then seems to reveal a somewhat bigger problem - if the callers 
are simply registering IPAs, and relying on the ITS driver to grab an 
entry and fill in a PA later, then how does either one know *which* PA 
is supposed to belong to a given IPA in the case where you have multiple 
devices with different ITS targets assigned to the same guest? (and if 
it's possible to assume a guest will use per-device stage 1 mappings and 
present it with a single vITS backed by multiple pITSes, I think things 
start breaking even harder.)


Other than allowing arbitrary disjoint IOVA pages, I'm not sure this 
really works any differently from the existing MSI cookie now that I 
look more closely :/


Robin.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [RFC v2 12/20] dma-iommu: Implement NESTED_MSI cookie

2018-10-24 Thread Robin Murphy

Hi Eric,

On 2018-09-18 3:24 pm, Eric Auger wrote:

Up to now, when the type was UNMANAGED, we used to
allocate IOVA pages within a range provided by the user.
This does not work in nested mode.

If both the host and the guest are exposed with SMMUs, each
would allocate an IOVA. The guest allocates an IOVA (gIOVA)
to map onto the guest MSI doorbell (gDB). The Host allocates
another IOVA (hIOVA) to map onto the physical doorbell (hDB).

So we end up with 2 unrelated mappings, at S1 and S2:
  S1 S2
gIOVA-> gDB
hIOVA->hDB

The PCI device would be programmed with hIOVA.

iommu_dma_bind_doorbell allows to pass gIOVA/gDB to the host
so that gIOVA can be used by the host instead of re-allocating
a new IOVA. That way the host can create the following nested
mapping:

  S1   S2
gIOVA->gDB->hDB

this time, the PCI device will be programmed with the gIOVA MSI
doorbell which is correctly map through the 2 stages.


If I'm understanding things correctly, this plus a couple of the 
preceding patches all add up to a rather involved way of coercing an 
automatic allocator to only "allocate" predetermined addresses in an 
entirely known-ahead-of-time manner. Given that the guy calling 
iommu_dma_bind_doorbell() could seemingly just as easily call 
iommu_map() at that point and not bother with an allocator cookie and 
all this machinery at all, what am I missing?


Robin.



Signed-off-by: Eric Auger 

---

v1 -> v2:
- unmap stage2 on put()
---
  drivers/iommu/dma-iommu.c | 97 +--
  include/linux/dma-iommu.h | 11 +
  2 files changed, 105 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 511ff9a1d6d9..53444c3e8f2f 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -37,12 +37,14 @@
  struct iommu_dma_msi_page {
struct list_headlist;
dma_addr_t  iova;
+   dma_addr_t  ipa;
phys_addr_t phys;
  };
  
  enum iommu_dma_cookie_type {

IOMMU_DMA_IOVA_COOKIE,
IOMMU_DMA_MSI_COOKIE,
+   IOMMU_DMA_NESTED_MSI_COOKIE,
  };
  
  struct iommu_dma_cookie {

@@ -109,14 +111,17 @@ EXPORT_SYMBOL(iommu_get_dma_cookie);
   *
   * Users who manage their own IOVA allocation and do not want DMA API support,
   * but would still like to take advantage of automatic MSI remapping, can use
- * this to initialise their own domain appropriately. Users should reserve a
+ * this to initialise their own domain appropriately. Users may reserve a
   * contiguous IOVA region, starting at @base, large enough to accommodate the
   * number of PAGE_SIZE mappings necessary to cover every MSI doorbell address
- * used by the devices attached to @domain.
+ * used by the devices attached to @domain. The other way round is to provide
+ * usable iova pages through the iommu_dma_bind_doorbell API (nested stages
+ * use case)
   */
  int iommu_get_msi_cookie(struct iommu_domain *domain, dma_addr_t base)
  {
struct iommu_dma_cookie *cookie;
+   int nesting, ret;
  
  	if (domain->type != IOMMU_DOMAIN_UNMANAGED)

return -EINVAL;
@@ -124,7 +129,12 @@ int iommu_get_msi_cookie(struct iommu_domain *domain, 
dma_addr_t base)
if (domain->iova_cookie)
return -EEXIST;
  
-	cookie = cookie_alloc(IOMMU_DMA_MSI_COOKIE);

+   ret =  iommu_domain_get_attr(domain, DOMAIN_ATTR_NESTING, );
+   if (!ret && nesting)
+   cookie = cookie_alloc(IOMMU_DMA_NESTED_MSI_COOKIE);
+   else
+   cookie = cookie_alloc(IOMMU_DMA_MSI_COOKIE);
+
if (!cookie)
return -ENOMEM;
  
@@ -145,6 +155,7 @@ void iommu_put_dma_cookie(struct iommu_domain *domain)

  {
struct iommu_dma_cookie *cookie = domain->iova_cookie;
struct iommu_dma_msi_page *msi, *tmp;
+   bool s2_unmap = false;
  
  	if (!cookie)

return;
@@ -152,7 +163,15 @@ void iommu_put_dma_cookie(struct iommu_domain *domain)
if (cookie->type == IOMMU_DMA_IOVA_COOKIE && cookie->iovad.granule)
put_iova_domain(>iovad);
  
+	if (cookie->type == IOMMU_DMA_NESTED_MSI_COOKIE)

+   s2_unmap = true;
+
list_for_each_entry_safe(msi, tmp, >msi_page_list, list) {
+   if (s2_unmap && msi->phys) {
+   size_t size = cookie_msi_granule(cookie);
+
+   WARN_ON(iommu_unmap(domain, msi->ipa, size) != size);
+   }
list_del(>list);
kfree(msi);
}
@@ -161,6 +180,50 @@ void iommu_put_dma_cookie(struct iommu_domain *domain)
  }
  EXPORT_SYMBOL(iommu_put_dma_cookie);
  
+/**

+ * iommu_dma_bind_doorbell - Allows to provide a usable IOVA page
+ * @domain: domain handle
+ * @binding: IOVA/IPA binding
+ *
+ * In nested stage use case, the user can provide IOVA/IPA bindings
+ * corresponding to a guest MSI 

Re: [PATCH] kvm: arm64: fix caching of host MDCR_EL2 value

2018-10-17 Thread Robin Murphy

On 17/10/18 17:42, Mark Rutland wrote:

At boot time, KVM stashes the host MDCR_EL2 value, but only does this
when the kernel is not running in hyp mode (i.e. is non-VHE). In these
cases, the stashed value of MDCR_EL2.HPMN happens to be zero, which can
lead to CONSTRAINED UNPREDICTABLE behaviour.

Since we use this value to derive the MDCR_EL2 value when switching
to/from a guest, after a guest have been run, the performance counters
do not behave as expected. This has been observed to result in accesses
via PMXEVTYPER_EL0 and PMXEVCNTR_EL0 not affecting the relevant
counters, resulting in events not being counted. In these cases, only
the fixed-purpose cycle counter appears to work as expected.

Fix this by always stashing the host MDCR_EL2 value, regardless of VHE.


FWIW,

Tested-by: Robin Murphy 


Fixes: 1e947bad0b63b351 ("arm64: KVM: Skip HYP setup when already running in 
HYP")
Signed-off-by: Mark Rutland 
Cc: Christopher Dall 
Cc: James Morse 
Cc: Marc Zyngier 
Cc: Robin Murphy 
Cc: Will Deacon 
Cc: kvmarm@lists.cs.columbia.edu
---
  virt/kvm/arm/arm.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c
index c92053bc3f96..8fb31a7cc22c 100644
--- a/virt/kvm/arm/arm.c
+++ b/virt/kvm/arm/arm.c
@@ -1295,8 +1295,6 @@ static void cpu_init_hyp_mode(void *dummy)
  
  	__cpu_init_hyp_mode(pgd_ptr, hyp_stack_ptr, vector_ptr);

__cpu_init_stage2();
-
-   kvm_arm_init_debug();
  }
  
  static void cpu_hyp_reset(void)

@@ -1320,6 +1318,8 @@ static void cpu_hyp_reinit(void)
cpu_init_hyp_mode(NULL);
}
  
+	kvm_arm_init_debug();

+
if (vgic_present)
kvm_vgic_init_cpu_hardware();
  }


___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v3 3/7] PCI: OF: Allow endpoints to bypass the iommu

2018-10-15 Thread Robin Murphy

On 12/10/18 20:41, Bjorn Helgaas wrote:

s/iommu/IOMMU/ in subject

On Fri, Oct 12, 2018 at 03:59:13PM +0100, Jean-Philippe Brucker wrote:

Using the iommu-map binding, endpoints in a given PCI domain can be
managed by different IOMMUs. Some virtual machines may allow a subset of
endpoints to bypass the IOMMU. In some case the IOMMU itself is presented


s/case/cases/


as a PCI endpoint (e.g. AMD IOMMU and virtio-iommu). Currently, when a
PCI root complex has an iommu-map property, the driver requires all
endpoints to be described by the property. Allow the iommu-map property to
have gaps.


I'm not an IOMMU or virtio expert, so it's not obvious to me why it is
safe to allow devices to bypass the IOMMU.  Does this mean a typo in
iommu-map could inadvertently allow devices to bypass it?  Should we
indicate something in dmesg (and/or sysfs) about devices that bypass
it?


It's not really "allow devices to bypass the IOMMU" so much as "allow DT 
to describe devices which the IOMMU doesn't translate". It's a bit of an 
edge case for not-really-PCI devices, but FWIW I can certainly think of 
several ways to build real hardware like that. As for inadvertent errors 
leaving out IDs which *should* be in the map, that really depends on the 
IOMMU/driver implementation - e.g. SMMUv2 with arm-smmu.disable_bypass=0 
would treat the device as untranslated, whereas SMMUv3 would always 
generate a fault upon any transaction due to no valid stream table entry 
being programmed (not even a bypass one).


I reckon it's a sufficiently unusual case that keeping some sort of 
message probably is worthwhile (at pr_info rather than pr_err) in case 
someone does hit it by mistake.



Relaxing of_pci_map_rid also allows the msi-map property to have gaps,


At worst, I suppose we could always add yet another parameter for each 
caller to choose whether a missing entry is considered an error or not.


Robin.


s/of_pci_map_rid/of_pci_map_rid()/


which is invalid since MSIs always reach an MSI controller. Thankfully
Linux will error out later, when attempting to find an MSI domain for the
device.


Not clear to me what "error out" means here.  In a userspace program,
I would infer that the program exits with an error message, but I
doubt you mean that Linux exits.


Signed-off-by: Jean-Philippe Brucker 
---
  drivers/pci/of.c | 7 ---
  1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/pci/of.c b/drivers/pci/of.c
index 1836b8ddf292..2f5015bdb256 100644
--- a/drivers/pci/of.c
+++ b/drivers/pci/of.c
@@ -451,9 +451,10 @@ int of_pci_map_rid(struct device_node *np, u32 rid,
return 0;
}
  
-	pr_err("%pOF: Invalid %s translation - no match for rid 0x%x on %pOF\n",

-   np, map_name, rid, target && *target ? *target : NULL);
-   return -EFAULT;
+   /* Bypasses translation */
+   if (id_out)
+   *id_out = rid;
+   return 0;
  }
  
  #if IS_ENABLED(CONFIG_OF_IRQ)

--
2.19.1


___
iommu mailing list
io...@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v2 1/5] dt-bindings: virtio: Specify #iommu-cells value for a virtio-iommu

2018-07-04 Thread Robin Murphy

On 27/06/18 18:46, Rob Herring wrote:

On Tue, Jun 26, 2018 at 11:59 AM Jean-Philippe Brucker
 wrote:


On 25/06/18 20:27, Rob Herring wrote:

On Thu, Jun 21, 2018 at 08:06:51PM +0100, Jean-Philippe Brucker wrote:

A virtio-mmio node may represent a virtio-iommu device. This is discovered
by the virtio driver at probe time, but the DMA topology isn't
discoverable and must be described by firmware. For DT the standard IOMMU
description is used, as specified in bindings/iommu/iommu.txt and
bindings/pci/pci-iommu.txt. Like many other IOMMUs, virtio-iommu
distinguishes masters by their endpoint IDs, which requires one IOMMU cell
in the "iommus" property.

Signed-off-by: Jean-Philippe Brucker 
---
  Documentation/devicetree/bindings/virtio/mmio.txt | 8 
  1 file changed, 8 insertions(+)

diff --git a/Documentation/devicetree/bindings/virtio/mmio.txt 
b/Documentation/devicetree/bindings/virtio/mmio.txt
index 5069c1b8e193..337da0e3a87f 100644
--- a/Documentation/devicetree/bindings/virtio/mmio.txt
+++ b/Documentation/devicetree/bindings/virtio/mmio.txt
@@ -8,6 +8,14 @@ Required properties:
  - reg:  control registers base address and size including 
configuration space
  - interrupts:   interrupt generated by the device

+Required properties for virtio-iommu:
+
+- #iommu-cells: When the node describes a virtio-iommu device, it is
+linked to DMA masters using the "iommus" property as
+described in devicetree/bindings/iommu/iommu.txt. For
+virtio-iommu #iommu-cells must be 1, each cell describing
+a single endpoint ID.


The iommus property should also be documented for the client side.


Isn't section "IOMMU master node" of iommu.txt sufficient? Since the
iommus property applies to any DMA master, not only virtio-mmio devices,
the canonical description in iommu.txt seems the best place for it, and
I'm not sure what to add in this file. Maybe a short example below the
virtio_block one?


No, because somewhere we have to capture if 'iommus' is valid for
'virtio-mmio' or not. Hopefully soon we'll actually be able to
validate that.


Indeed, it's rather unusual to have a single compatible which may either 
be an IOMMU or an IOMMU client (but not both at once, I hope!), so 
nailing down the exact semantics as clearly as possible would definitely 
be desirable.


Robin.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 02/14] arm64: Call ARCH_WORKAROUND_2 on transitions between EL0 and EL1

2018-05-24 Thread Robin Murphy

On 24/05/18 11:52, Mark Rutland wrote:

On Wed, May 23, 2018 at 10:23:20AM +0100, Julien Grall wrote:

Hi Marc,

On 05/22/2018 04:06 PM, Marc Zyngier wrote:

diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
index ec2ee720e33e..f33e6aed3037 100644
--- a/arch/arm64/kernel/entry.S
+++ b/arch/arm64/kernel/entry.S
@@ -18,6 +18,7 @@
* along with this program.  If not, see .
*/
+#include 
   #include 
   #include 
@@ -137,6 +138,18 @@ alternative_else_nop_endif
add \dst, \dst, #(\sym - .entry.tramp.text)
.endm
+   // This macro corrupts x0-x3. It is the caller's duty
+   // to save/restore them if required.


NIT: Shouldn't you use /* ... */ for multi-line comments?


There's no requirement to do so, and IIRC even Torvalds prefers '//'
comments for multi-line things these days.


Also, this is an assembly code, not C; '//' is the actual A64 assembler 
comment syntax so is arguably more appropriate here in spite of being 
moot thanks to preprocessing.


Robin.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 2/4] iommu/virtio: Add probe request

2018-03-23 Thread Robin Murphy

On 14/02/18 14:53, Jean-Philippe Brucker wrote:

When the device offers the probe feature, send a probe request for each
device managed by the IOMMU. Extract RESV_MEM information. When we
encounter a MSI doorbell region, set it up as a IOMMU_RESV_MSI region.
This will tell other subsystems that there is no need to map the MSI
doorbell in the virtio-iommu, because MSIs bypass it.

Signed-off-by: Jean-Philippe Brucker 
---
  drivers/iommu/virtio-iommu.c  | 163 --
  include/uapi/linux/virtio_iommu.h |  37 +
  2 files changed, 193 insertions(+), 7 deletions(-)

diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c
index a9c9245e8ba2..3ac4b38eaf19 100644
--- a/drivers/iommu/virtio-iommu.c
+++ b/drivers/iommu/virtio-iommu.c
@@ -45,6 +45,7 @@ struct viommu_dev {
struct iommu_domain_geometrygeometry;
u64 pgsize_bitmap;
u8  domain_bits;
+   u32 probe_size;
  };
  
  struct viommu_mapping {

@@ -72,6 +73,7 @@ struct viommu_domain {
  struct viommu_endpoint {
struct viommu_dev   *viommu;
struct viommu_domain*vdomain;
+   struct list_headresv_regions;
  };
  
  struct viommu_request {

@@ -140,6 +142,10 @@ static int viommu_get_req_size(struct viommu_dev *viommu,
case VIRTIO_IOMMU_T_UNMAP:
size = sizeof(r->unmap);
break;
+   case VIRTIO_IOMMU_T_PROBE:
+   *bottom += viommu->probe_size;
+   size = sizeof(r->probe) + *bottom;
+   break;
default:
return -EINVAL;
}
@@ -448,6 +454,105 @@ static int viommu_replay_mappings(struct viommu_domain 
*vdomain)
return ret;
  }
  
+static int viommu_add_resv_mem(struct viommu_endpoint *vdev,

+  struct virtio_iommu_probe_resv_mem *mem,
+  size_t len)
+{
+   struct iommu_resv_region *region = NULL;
+   unsigned long prot = IOMMU_WRITE | IOMMU_NOEXEC | IOMMU_MMIO;
+
+   u64 addr = le64_to_cpu(mem->addr);
+   u64 size = le64_to_cpu(mem->size);
+
+   if (len < sizeof(*mem))
+   return -EINVAL;
+
+   switch (mem->subtype) {
+   case VIRTIO_IOMMU_RESV_MEM_T_MSI:
+   region = iommu_alloc_resv_region(addr, size, prot,
+IOMMU_RESV_MSI);
+   break;
+   case VIRTIO_IOMMU_RESV_MEM_T_RESERVED:
+   default:
+   region = iommu_alloc_resv_region(addr, size, 0,
+IOMMU_RESV_RESERVED);
+   break;
+   }
+
+   list_add(>resv_regions, >list);
+
+   /*
+* Treat unknown subtype as RESERVED, but urge users to update their
+* driver.
+*/
+   if (mem->subtype != VIRTIO_IOMMU_RESV_MEM_T_RESERVED &&
+   mem->subtype != VIRTIO_IOMMU_RESV_MEM_T_MSI)
+   pr_warn("unknown resv mem subtype 0x%x\n", mem->subtype);


Might as well avoid the extra comparisons by incorporating this into the 
switch statement, i.e.:


default:
dev_warn(vdev->viommu_dev->dev, ...);
/* Fallthrough */
case VIRTIO_IOMMU_RESV_MEM_T_RESERVED:
...

(dev_warn is generally preferable to pr_warn when feasible)


+
+   return 0;
+}
+
+static int viommu_probe_endpoint(struct viommu_dev *viommu, struct device *dev)
+{
+   int ret;
+   u16 type, len;
+   size_t cur = 0;
+   struct virtio_iommu_req_probe *probe;
+   struct virtio_iommu_probe_property *prop;
+   struct iommu_fwspec *fwspec = dev->iommu_fwspec;
+   struct viommu_endpoint *vdev = fwspec->iommu_priv;
+
+   if (!fwspec->num_ids)
+   /* Trouble ahead. */
+   return -EINVAL;
+
+   probe = kzalloc(sizeof(*probe) + viommu->probe_size +
+   sizeof(struct virtio_iommu_req_tail), GFP_KERNEL);
+   if (!probe)
+   return -ENOMEM;
+
+   probe->head.type = VIRTIO_IOMMU_T_PROBE;
+   /*
+* For now, assume that properties of an endpoint that outputs multiple
+* IDs are consistent. Only probe the first one.
+*/
+   probe->endpoint = cpu_to_le32(fwspec->ids[0]);
+
+   ret = viommu_send_req_sync(viommu, probe);
+   if (ret)
+   goto out_free;
+
+   prop = (void *)probe->properties;
+   type = le16_to_cpu(prop->type) & VIRTIO_IOMMU_PROBE_T_MASK;
+
+   while (type != VIRTIO_IOMMU_PROBE_T_NONE &&
+  cur < viommu->probe_size) {
+   len = le16_to_cpu(prop->length);
+
+   switch (type) {
+   case VIRTIO_IOMMU_PROBE_T_RESV_MEM:
+   ret = viommu_add_resv_mem(vdev, (void *)prop->value, 
len);
+   break;
+

Re: [PATCH 1/4] iommu: Add virtio-iommu driver

2018-03-23 Thread Robin Murphy

On 14/02/18 14:53, Jean-Philippe Brucker wrote:

The virtio IOMMU is a para-virtualized device, allowing to send IOMMU
requests such as map/unmap over virtio-mmio transport without emulating
page tables. This implementation handles ATTACH, DETACH, MAP and UNMAP
requests.

The bulk of the code transforms calls coming from the IOMMU API into
corresponding virtio requests. Mappings are kept in an interval tree
instead of page tables.

Signed-off-by: Jean-Philippe Brucker 
---
  MAINTAINERS   |   6 +
  drivers/iommu/Kconfig |  11 +
  drivers/iommu/Makefile|   1 +
  drivers/iommu/virtio-iommu.c  | 960 ++
  include/uapi/linux/virtio_ids.h   |   1 +
  include/uapi/linux/virtio_iommu.h | 116 +
  6 files changed, 1095 insertions(+)
  create mode 100644 drivers/iommu/virtio-iommu.c
  create mode 100644 include/uapi/linux/virtio_iommu.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 3bdc260e36b7..2a181924d420 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -14818,6 +14818,12 @@ S: Maintained
  F:drivers/virtio/virtio_input.c
  F:include/uapi/linux/virtio_input.h
  
+VIRTIO IOMMU DRIVER

+M: Jean-Philippe Brucker 
+S: Maintained
+F: drivers/iommu/virtio-iommu.c
+F: include/uapi/linux/virtio_iommu.h
+
  VIRTUAL BOX GUEST DEVICE DRIVER
  M:Hans de Goede 
  M:Arnd Bergmann 
diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index f3a21343e636..1ea0ec74524f 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -381,4 +381,15 @@ config QCOM_IOMMU
help
  Support for IOMMU on certain Qualcomm SoCs.
  
+config VIRTIO_IOMMU

+   bool "Virtio IOMMU driver"
+   depends on VIRTIO_MMIO
+   select IOMMU_API
+   select INTERVAL_TREE
+   select ARM_DMA_USE_IOMMU if ARM
+   help
+ Para-virtualised IOMMU driver with virtio.
+
+ Say Y here if you intend to run this kernel as a guest.
+
  endif # IOMMU_SUPPORT
diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
index 1fb695854809..9c68be1365e1 100644
--- a/drivers/iommu/Makefile
+++ b/drivers/iommu/Makefile
@@ -29,3 +29,4 @@ obj-$(CONFIG_EXYNOS_IOMMU) += exynos-iommu.o
  obj-$(CONFIG_FSL_PAMU) += fsl_pamu.o fsl_pamu_domain.o
  obj-$(CONFIG_S390_IOMMU) += s390-iommu.o
  obj-$(CONFIG_QCOM_IOMMU) += qcom_iommu.o
+obj-$(CONFIG_VIRTIO_IOMMU) += virtio-iommu.o
diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c
new file mode 100644
index ..a9c9245e8ba2
--- /dev/null
+++ b/drivers/iommu/virtio-iommu.c
@@ -0,0 +1,960 @@
+/*
+ * Virtio driver for the paravirtualized IOMMU
+ *
+ * Copyright (C) 2018 ARM Limited
+ * Author: Jean-Philippe Brucker 
+ *
+ * SPDX-License-Identifier: GPL-2.0


This wants to be a // comment at the very top of the file (thankfully 
the policy is now properly documented in-tree since 
Documentation/process/license-rules.rst got merged)



+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+#define MSI_IOVA_BASE  0x800
+#define MSI_IOVA_LENGTH0x10
+
+struct viommu_dev {
+   struct iommu_device iommu;
+   struct device   *dev;
+   struct virtio_device*vdev;
+
+   struct ida  domain_ids;
+
+   struct virtqueue*vq;
+   /* Serialize anything touching the request queue */
+   spinlock_t  request_lock;
+
+   /* Device configuration */
+   struct iommu_domain_geometrygeometry;
+   u64 pgsize_bitmap;
+   u8  domain_bits;
+};
+
+struct viommu_mapping {
+   phys_addr_t paddr;
+   struct interval_tree_node   iova;
+   union {
+   struct virtio_iommu_req_map map;
+   struct virtio_iommu_req_unmap unmap;
+   } req;
+};
+
+struct viommu_domain {
+   struct iommu_domain domain;
+   struct viommu_dev   *viommu;
+   struct mutexmutex;
+   unsigned intid;
+
+   spinlock_t  mappings_lock;
+   struct rb_root_cached   mappings;
+
+   /* Number of endpoints attached to this domain */
+   unsigned long   endpoints;
+};
+
+struct viommu_endpoint {
+   struct viommu_dev   *viommu;
+   struct viommu_domain*vdomain;
+};
+
+struct viommu_request {
+   struct scatterlist  top;
+   struct scatterlist  bottom;
+
+   int 

Re: [PATCH 1/4] iommu: Add virtio-iommu driver

2018-03-21 Thread Robin Murphy

On 21/03/18 13:14, Jean-Philippe Brucker wrote:

On 21/03/18 06:43, Tian, Kevin wrote:
[...]

+
+#include 
+
+#define MSI_IOVA_BASE  0x800
+#define MSI_IOVA_LENGTH0x10


this is ARM specific, and according to virtio-iommu spec isn't it
better probed on the endpoint instead of hard-coding here?


These values are arbitrary, not really ARM-specific even if ARM is the
only user yet: we're just reserving a random IOVA region for mapping MSIs.
It is hard-coded because of the way iommu-dma.c works, but I don't quite
remember why that allocation isn't dynamic.


The host kernel needs to have *some* MSI region in place before the 
guest can start configuring interrupts, otherwise it won't know what 
address to give to the underlying hardware. However, as soon as the host 
kernel has picked a region, host userspace needs to know that it can no 
longer use addresses in that region for DMA-able guest memory. It's a 
lot easier when the address is fixed in hardware and the host userspace 
will never be stupid enough to try and VFIO_IOMMU_DMA_MAP it, but in the 
more general case where MSI writes undergo IOMMU address translation so 
it's an arbitrary IOVA, this has the potential to conflict with stuff 
like guest memory hotplug.


What we currently have is just the simplest option, with the host kernel 
just picking something up-front and pretending to host userspace that 
it's a fixed hardware address. There's certainly scope for it to be a 
bit more dynamic in the sense of adding an interface to let userspace 
move it around (before attaching any devices, at least), but I don't 
think it's feasible for the host kernel to second-guess userspace enough 
to make it entirely transparent like it is in the DMA API domain case.


Of course, that's all assuming the host itself is using a virtio-iommu 
(e.g. in a nested virt or emulation scenario). When it's purely within a 
guest then an MSI reservation shouldn't matter so much, since the guest 
won't be anywhere near the real hardware configuration anyway.


Robin.


As said on the v0.6 spec thread, I'm not sure allocating the IOVA range in
the host is preferable. With nested translation the guest has to map it
anyway, and I believe dealing with IOVA allocation should be left to the
guest when possible.

Thanks,
Jean
___
iommu mailing list
io...@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH] KVM: arm/arm64: replacing per-VM's per-CPU variable

2018-03-13 Thread Robin Murphy

On 13/03/18 13:01, Marc Zyngier wrote:

[You're repeatedly posting to the kvmarm mailing list without being
subscribed to it. I've flushed the queue now, but please consider
subscribing to the list, it will help everyone]

On 13/03/18 21:03, Peng Hao wrote:

Using a global per-CPU variable instead of per-VM's per-CPU variable.

Signed-off-by: Peng Hao 
---
  arch/arm/include/asm/kvm_host.h   |  3 ---
  arch/arm64/include/asm/kvm_host.h |  3 ---
  virt/kvm/arm/arm.c| 26 ++
  3 files changed, 6 insertions(+), 26 deletions(-)

diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
index 248b930..4224f3b 100644
--- a/arch/arm/include/asm/kvm_host.h
+++ b/arch/arm/include/asm/kvm_host.h
@@ -59,9 +59,6 @@ struct kvm_arch {
/* VTTBR value associated with below pgd and vmid */
u64vttbr;
  
-	/* The last vcpu id that ran on each physical CPU */

-   int __percpu *last_vcpu_ran;
-
/*
 * Anything that is not used directly from assembly code goes
 * here.
diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index 596f8e4..5035a08 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -67,9 +67,6 @@ struct kvm_arch {
/* VTTBR value associated with above pgd and vmid */
u64vttbr;
  
-	/* The last vcpu id that ran on each physical CPU */

-   int __percpu *last_vcpu_ran;
-
/* The maximum number of vCPUs depends on the used GIC model */
int max_vcpus;
  
diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c

index 86941f6..a67ffb0 100644
--- a/virt/kvm/arm/arm.c
+++ b/virt/kvm/arm/arm.c
@@ -59,6 +59,8 @@
  /* Per-CPU variable containing the currently running vcpu. */
  static DEFINE_PER_CPU(struct kvm_vcpu *, kvm_arm_running_vcpu);
  
+static DEFINE_PER_CPU(struct kvm_vcpu *, kvm_last_ran_vcpu);

+
  /* The VMID used in the VTTBR */
  static atomic64_t kvm_vmid_gen = ATOMIC64_INIT(1);
  static u32 kvm_next_vmid;
@@ -115,18 +117,11 @@ void kvm_arch_check_processor_compat(void *rtn)
   */
  int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
  {
-   int ret, cpu;
+   int ret;
  
  	if (type)

return -EINVAL;
  
-	kvm->arch.last_vcpu_ran = alloc_percpu(typeof(*kvm->arch.last_vcpu_ran));

-   if (!kvm->arch.last_vcpu_ran)
-   return -ENOMEM;
-
-   for_each_possible_cpu(cpu)
-   *per_cpu_ptr(kvm->arch.last_vcpu_ran, cpu) = -1;
-
ret = kvm_alloc_stage2_pgd(kvm);
if (ret)
goto out_fail_alloc;
@@ -147,9 +142,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
return ret;
  out_free_stage2_pgd:
kvm_free_stage2_pgd(kvm);
-out_fail_alloc:
-   free_percpu(kvm->arch.last_vcpu_ran);
-   kvm->arch.last_vcpu_ran = NULL;
+out_fail_alloc:
return ret;
  }
  
@@ -179,9 +172,6 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
  
  	kvm_vgic_destroy(kvm);
  
-	free_percpu(kvm->arch.last_vcpu_ran);

-   kvm->arch.last_vcpu_ran = NULL;
-
for (i = 0; i < KVM_MAX_VCPUS; ++i) {
if (kvm->vcpus[i]) {
kvm_arch_vcpu_free(kvm->vcpus[i]);
@@ -343,17 +333,13 @@ int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu)
  
  void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)

  {
-   int *last_ran;
-
-   last_ran = this_cpu_ptr(vcpu->kvm->arch.last_vcpu_ran);
-
/*
 * We might get preempted before the vCPU actually runs, but
 * over-invalidation doesn't affect correctness.
 */
-   if (*last_ran != vcpu->vcpu_id) {
+   if (per_cpu(kvm_last_ran_vcpu, cpu) != vcpu) {
kvm_call_hyp(__kvm_tlb_flush_local_vmid, vcpu);
-   *last_ran = vcpu->vcpu_id;
+   per_cpu(kvm_last_ran_vcpu, cpu) = vcpu;
}
  
  	vcpu->cpu = cpu;




Have you read and understood what this code is about? The whole point of
this code is to track physical CPUs on a per-VM basis. Making it global
completely defeats the point, and can result in guest memory corruption.
Please see commit 94d0e5980d67.


I won't comment on the patch itself (AFAICS it is rather broken), but I 
suppose there is a grain of sense in the general idea, since the set of 
physical CPUs itself is fundamentally a global thing. Given a large 
number of pCPUs and a large number of VMs it could well be more 
space-efficient to keep a single per-pCPU record of a {vmid,vcpu_id} 
tuple or some other *globally-unique* vCPU namespace (I guess just the 
struct kvm_vcpu pointer might work, but then it would be hard to avoid 
unnecessary invalidation when switching VMIDs entirely).


Robin.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v6] arm64: Add support for new control bits CTR_EL0.DIC and CTR_EL0.IDC

2018-03-06 Thread Robin Murphy

On 01/03/18 04:14, Shanker Donthineni wrote:
[...]

diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index 2985a06..0b64b55 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -199,12 +199,12 @@ static int __init register_cpu_hwcaps_dumper(void)
  };
  
  static const struct arm64_ftr_bits ftr_ctr[] = {

-   ARM64_FTR_BITS(FTR_VISIBLE, FTR_STRICT, FTR_EXACT, 31, 1, 1),   
/* RES1 */


Nit: you may as well leave this line as-is...


-   ARM64_FTR_BITS(FTR_VISIBLE, FTR_STRICT, FTR_LOWER_SAFE, 29, 1, 1),  
/* DIC */
-   ARM64_FTR_BITS(FTR_VISIBLE, FTR_STRICT, FTR_LOWER_SAFE, 28, 1, 1),  
/* IDC */
-   ARM64_FTR_BITS(FTR_VISIBLE, FTR_STRICT, FTR_HIGHER_SAFE, 24, 4, 0), 
/* CWG */
-   ARM64_FTR_BITS(FTR_VISIBLE, FTR_STRICT, FTR_HIGHER_SAFE, 20, 4, 0), 
/* ERG */
-   ARM64_FTR_BITS(FTR_VISIBLE, FTR_STRICT, FTR_LOWER_SAFE, 16, 4, 1),  
/* DminLine */
+   ARM64_FTR_BITS(FTR_VISIBLE, FTR_STRICT, FTR_EXACT, 31, 1, 1),   
 /* RES1 */
+   ARM64_FTR_BITS(FTR_VISIBLE, FTR_STRICT, FTR_LOWER_SAFE, CTR_DIC_SHIFT, 
1, 1),/* DIC */
+   ARM64_FTR_BITS(FTR_VISIBLE, FTR_STRICT, FTR_LOWER_SAFE, CTR_IDC_SHIFT, 
1, 1),/* IDC */
+   ARM64_FTR_BITS(FTR_VISIBLE, FTR_STRICT, FTR_HIGHER_SAFE, CTR_CWG_SHIFT, 
4, 0),   /* CWG */
+   ARM64_FTR_BITS(FTR_VISIBLE, FTR_STRICT, FTR_HIGHER_SAFE, CTR_ERG_SHIFT, 
4, 0),   /* ERG */
+   ARM64_FTR_BITS(FTR_VISIBLE, FTR_STRICT, FTR_LOWER_SAFE, 
CTR_DMLINE_SHIFT, 4, 1), /* DminLine */


...because with properly-named macros the rest of these comments no 
longer add anything that isn't already obvious - best to just remove them.


Robin.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v4] arm64: Add support for new control bits CTR_EL0.DIC and CTR_EL0.IDC

2018-02-22 Thread Robin Murphy

On 22/02/18 16:33, Mark Rutland wrote:

On Thu, Feb 22, 2018 at 04:28:03PM +, Robin Murphy wrote:

[Apologies to keep elbowing in, and if I'm being thick here...]

On 22/02/18 15:22, Mark Rutland wrote:

On Thu, Feb 22, 2018 at 08:51:30AM -0600, Shanker Donthineni wrote:

+#define CTR_B31_SHIFT  31


Since this is just a RES1 bit, I think we don't need a mnemonic for it,
but I'll defer to Will and Catalin on that.


   ENTRY(invalidate_icache_range)
+#ifdef CONFIG_ARM64_SKIP_CACHE_POU
+alternative_if ARM64_HAS_CACHE_DIC
+   mov x0, xzr
+   dsb ishst
+   isb
+   ret
+alternative_else_nop_endif
+#endif


As commented on v3, I don't believe you need the DSB here. If prior
stores haven't been completed at this point, the existing implementation
would not work correctly here.


True in terms of ordering between stores prior to entry and the IC IVAU
itself, but what about the DSH ISH currently issued *after* the IC IVAU
before returning? Is provably impossible that existing callers might be
relying on that ordering *anything*, or would we risk losing something
subtle by effectively removing it?


AFAIK, the only caller of this is KVM, before page table updates occur
to add execute permissions to the page this is applied to.

At least in that case, I do not beleive there would be breakage.

If we're worried about subtleties in callers, then we'd need to stick
with DSB ISH rather than optimising to DSH ISHST.


Hmm, I probably am just squawking needlessly. It is indeed hard to 
imagine how callers could be relying on the invalidating the I-cache for 
ordering unless doing something unreasonably stupid, and if the current 
caller is clearly OK then there should be nothing to worry about.


This *has* helped me realise that I was indeed being somewhat thick 
before, because the existing barrier is of course not about memory 
ordering per se, but about completing the maintenance operation. Hooray 
for overloaded semantics...


On a different track, I'm now wondering whether the extra complexity of 
these alternatives might justify removing some obvious duplication and 
letting __flush_cache_user_range() branch directly into 
invalidate_icache_range(), or might that adversely affect the user fault 
fixup path?


Robin.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v4] arm64: Add support for new control bits CTR_EL0.DIC and CTR_EL0.IDC

2018-02-22 Thread Robin Murphy

[Apologies to keep elbowing in, and if I'm being thick here...]

On 22/02/18 15:22, Mark Rutland wrote:

On Thu, Feb 22, 2018 at 08:51:30AM -0600, Shanker Donthineni wrote:

+#define CTR_B31_SHIFT  31


Since this is just a RES1 bit, I think we don't need a mnemonic for it,
but I'll defer to Will and Catalin on that.


  ENTRY(invalidate_icache_range)
+#ifdef CONFIG_ARM64_SKIP_CACHE_POU
+alternative_if ARM64_HAS_CACHE_DIC
+   mov x0, xzr
+   dsb ishst
+   isb
+   ret
+alternative_else_nop_endif
+#endif


As commented on v3, I don't believe you need the DSB here. If prior
stores haven't been completed at this point, the existing implementation
would not work correctly here.


True in terms of ordering between stores prior to entry and the IC IVAU 
itself, but what about the DSH ISH currently issued *after* the IC IVAU 
before returning? Is provably impossible that existing callers might be 
relying on that ordering *anything*, or would we risk losing something 
subtle by effectively removing it?


Robin.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v3] arm64: Add support for new control bits CTR_EL0.DIC and CTR_EL0.IDC

2018-02-21 Thread Robin Murphy

On 21/02/18 16:14, Shanker Donthineni wrote:
[...]

@@ -1100,6 +1114,20 @@ static int cpu_copy_el2regs(void *__unused)
.enable = cpu_clear_disr,
},
  #endif /* CONFIG_ARM64_RAS_EXTN */
+#ifdef CONFIG_ARM64_SKIP_CACHE_POU
+   {
+   .desc = "DCache clean to POU",


This description is confusing, and sounds like it's describing DC CVAU, rather
than the ability to ellide it. How about:



Sure, I'll take your suggestion.


Can we at least spell "elision" correctly please? ;)

Personally I read DIC and IDC as "D-cache to I-cache coherency" and 
"I-cache to D-cache coherency" respectively (just my interpretation, 
I've not looked into the spec work for any hints of rationale), but out 
loud those do sound so poorly-defined that keeping things in terms of 
the required maintenance probably is better.



.desc = "D-cache maintenance ellision (IDC)"


+   .capability = ARM64_HAS_CACHE_IDC,
+   .def_scope = SCOPE_SYSTEM,
+   .matches = has_cache_idc,
+   },
+   {
+   .desc = "ICache invalidation to POU",


... and correspondingly:

.desc = "I-cache maintenance ellision (DIC)"


+   .capability = ARM64_HAS_CACHE_DIC,
+   .def_scope = SCOPE_SYSTEM,
+   .matches = has_cache_dic,
+   },
+#endif /* CONFIG_ARM64_CACHE_DIC */
{},
  };

[...]

+alternative_if ARM64_HAS_CACHE_DIC
+   isb


Why have we gained an ISB here if DIC is set?



I believe synchronization barrier (ISB) is required here to support 
self-modifying/jump-labels
code.
   

This is for a user address, and I can't see why DIC would imply we need an
extra ISB kernel-side.



This is for user and kernel addresses, alternatives and jumplabel patching logic
calls flush_icache_range().


There's an ISB hidden in invalidate_icache_by_line(), so it probably 
would be unsafe to start implicitly skipping that.



+   b   8f
+alternative_else_nop_endif
invalidate_icache_by_line x0, x1, x2, x3, 9f
-   mov x0, #0
+8: mov x0, #0
  1:
uaccess_ttbr0_disable x1, x2
ret
@@ -80,6 +87,12 @@ ENDPROC(__flush_cache_user_range)
   *- end - virtual end address of region
   */
  ENTRY(invalidate_icache_range)
+alternative_if ARM64_HAS_CACHE_DIC
+   mov x0, xzr
+   dsb ish


Do we actually need a DSB in this case?



I'll remove if everyone agree.

Will, Can you comment on this?


As-is, this function *only* invalidates the I-cache, so we already assume that
the data is visible at the PoU at this point. I don't see what extra gaurantee
we'd need the DSB for.


If so, then ditto for the existing invalidate_icache_by_line() code 
presumably.


Robin.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 4/4] vfio: Allow type-1 IOMMU instantiation with a virtio-iommu

2018-02-14 Thread Robin Murphy

On 14/02/18 15:26, Alex Williamson wrote:

On Wed, 14 Feb 2018 14:53:40 +
Jean-Philippe Brucker  wrote:


When enabling both VFIO and VIRTIO_IOMMU modules, automatically select
VFIO_IOMMU_TYPE1 as well.

Signed-off-by: Jean-Philippe Brucker 
---
  drivers/vfio/Kconfig | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
index c84333eb5eb5..65a1e691110c 100644
--- a/drivers/vfio/Kconfig
+++ b/drivers/vfio/Kconfig
@@ -21,7 +21,7 @@ config VFIO_VIRQFD
  menuconfig VFIO
tristate "VFIO Non-Privileged userspace driver framework"
depends on IOMMU_API
-   select VFIO_IOMMU_TYPE1 if (X86 || S390 || ARM_SMMU || ARM_SMMU_V3)
+   select VFIO_IOMMU_TYPE1 if (X86 || S390 || ARM_SMMU || ARM_SMMU_V3 || 
VIRTIO_IOMMU)
select ANON_INODES
help
  VFIO provides a framework for secure userspace device drivers.


Why are we basing this on specific IOMMU drivers in the first place?
Only ARM is doing that.  Shouldn't IOMMU_API only be enabled for ARM
targets that support it and therefore we can forget about the specific
IOMMU drivers?  Thanks,


Makes sense - the majority of ARM systems (and mobile/embedded ARM64 
ones) making use of IOMMU_API won't actually support VFIO, but it can't 
hurt to allow them to select the type 1 driver regardless. Especially as 
multiplatform configs are liable to be pulling in the SMMU driver(s) anyway.


Robin.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v1 02/16] irqchip: gicv3-its: Add helpers for handling 52bit address

2018-02-08 Thread Robin Murphy

On 08/02/18 11:20, Suzuki K Poulose wrote:

On 07/02/18 15:10, Christoffer Dall wrote:

Hi Suzuki,

On Tue, Jan 09, 2018 at 07:03:57PM +, Suzuki K Poulose wrote:

Add helpers for encoding/decoding 52bit address in GICv3 ITS BASER
register. When ITS uses 64K page size, the 52bits of physical address
are encoded in BASER[47:12] as follows :

  Bits[47:16] of the register => bits[47:16] of the physical address
  Bits[15:12] of the register => bits[51:48] of the physical address
 bits[15:0] of the physical address 
are 0.


Also adds a mask for CBASER address. This will be used for adding 52bit
support for VGIC ITS. More importantly ignore the upper bits if 52bit
support is not enabled.

Cc: Shanker Donthineni 
Cc: Marc Zyngier 
Signed-off-by: Suzuki K Poulose 
---




+
+/*
+ * With 64K page size, the physical address can be upto 52bit and
+ * uses the following encoding in the GITS_BASER[47:12]:
+ *
+ * Bits[47:16] of the register => bits[47:16] of the base physical 
address.
+ * Bits[15:12] of the register => bits[51:48] of the base physical 
address.
+ *    bits[15:0] of the base physical 
address are 0.

+ * Clear the upper bits if the kernel doesn't support 52bits.
+ */
+#define GITS_BASER_ADDR64K_LO_MASK    GENMASK_ULL(47, 16)
+#define GITS_BASER_ADDR64K_HI_SHIFT    12
+#define GITS_BASER_ADDR64K_HI_MOVE    (48 - 
GITS_BASER_ADDR64K_HI_SHIFT)
+#define GITS_BASER_ADDR64K_HI_MASK    (GITS_PA_HI_MASK << 
GITS_BASER_ADDR64K_HI_SHIFT)

+#define GITS_BASER_ADDR64K_TO_PHYS(x)    \
+    (((x) & GITS_BASER_ADDR64K_LO_MASK) | \
+ (((x) & GITS_BASER_ADDR64K_HI_MASK) << 
GITS_BASER_ADDR64K_HI_MOVE))

+#define GITS_BASER_ADDR64K_FROM_PHYS(p)    \
+    (((p) & GITS_BASER_ADDR64K_LO_MASK) | \
+ (((p) >> GITS_BASER_ADDR64K_HI_MOVE) & 
GITS_BASER_ADDR64K_HI_MASK))


I don't understand why you need this masking logic embedded in these
macros?  Isn't it strictly an error if anyone passes a physical address
with any of bits [51:48] set to the ITS on a system that doesn't support
52 bit PAs, and just silently masking off those bits could lead to some
interesting cases.


What do you think is the best way to handle such cases ? May be I could add
some checks where we get those addresses and handle it before we use this
macro ?



This is also notably more difficult to read than the existing macro.

If anything, I think it would be more useful to have
GITS_BASER_TO_PHYS(x) and GITS_PHYS_TO_BASER(x) which takes into account
CONFIG_ARM64_64K_PAGES.


I thought the 64K_PAGES is not kernel page size, but the page-size 
configured
by the "requester" for ITS. So, it doesn't really mean 
CONFIG_ARM64_64K_PAGES.
But the other way around, we can't handle 52bit address unless 
CONFIG_ARM64_64K_PAGES
is selected. Also, if the guest uses a 4K page size and uses a 48 bit 
address,

we could potentially mask Bits[15:12] to 0, which is not nice.

So I still think we need to have a special macro for handling addresses 
with 64K

page size in ITS.


If it's allowed to go wrong for invalid input, then you don't even need 
to consider the page size at all, except if you care about 
micro-optimising out a couple of instructions. For valid page-aligned 
addresses, [51:48] and [15:12] can never *both* be nonzero, therefore 
just this should be fine for all granules:


-   (((phys) & GENMASK_ULL(47, 16)) | (((phys) >> 48) & 0xf) << 12)
+   (((phys) & GENMASK_ULL(47, 0)) | (((phys) >> 48) & 0xf) << 12)

Robin.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 1/2] ARM: kvm: fix building with gcc-8

2018-02-02 Thread Robin Murphy

On 02/02/18 16:29, Arnd Bergmann wrote:

On Fri, Feb 2, 2018 at 5:23 PM, Robin Murphy <robin.mur...@arm.com> wrote:

On 02/02/18 15:55, Robin Murphy wrote:


On 02/02/18 15:07, Arnd Bergmann wrote:


In banked-sr.c, we use a top-level '__asm__(".arch_extension virt")'
statement to allow compilation of a multi-CPU kernel for ARMv6
and older ARMv7-A that don't normally support access to the banked
registers.

This is considered to be a programming error by the gcc developers
and will no longer work in gcc-8, where we now get a build error:

/tmp/cc4Qy7GR.s:34: Error: Banked registers are not available with this
architecture. -- `mrs r3,SP_usr'
/tmp/cc4Qy7GR.s:41: Error: Banked registers are not available with this
architecture. -- `mrs r3,ELR_hyp'
/tmp/cc4Qy7GR.s:55: Error: Banked registers are not available with this
architecture. -- `mrs r3,SP_svc'
/tmp/cc4Qy7GR.s:62: Error: Banked registers are not available with this
architecture. -- `mrs r3,LR_svc'
/tmp/cc4Qy7GR.s:69: Error: Banked registers are not available with this
architecture. -- `mrs r3,SPSR_svc'
/tmp/cc4Qy7GR.s:76: Error: Banked registers are not available with this
architecture. -- `mrs r3,SP_abt'

Passign the '-march-armv7ve' flag to gcc works, and is ok here, because
we know the functions won't ever be called on pre-ARMv7VE machines.
Unfortunately, older compiler versions (4.8 and earlier) do not
understand
that flag, so we still need to keep the asm around.

Backporting to stable kernels (4.6+) is needed to allow those to be built
with future compilers as well.



Is "-Wa,arch=armv7-a+virt" (as we appear to do for a couple of files
already) viable as a possibly cleaner alternative, or is GCC itself now
policing the contents of inline asms?



In fact, looking at the binutils history, any version capable of assembling
this file should understand that (modulo my typo), so hopefully it ought to
be feasible to replace these global asms with assembler flags entirely.


No, this only works for .S files, not .c, since gcc starts the output with
an explicit .arch setting that overrides the command line. I think this
was done intentionally to prevent such a hack from working, and have
more reliable checks on the validity of the assembler instruction in
inline asm statements (which we try to circumvent here).


Ah, I see, that is unfortunate. Thanks for clarifying.

Robin.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 1/2] ARM: kvm: fix building with gcc-8

2018-02-02 Thread Robin Murphy

On 02/02/18 15:55, Robin Murphy wrote:

On 02/02/18 15:07, Arnd Bergmann wrote:

In banked-sr.c, we use a top-level '__asm__(".arch_extension virt")'
statement to allow compilation of a multi-CPU kernel for ARMv6
and older ARMv7-A that don't normally support access to the banked
registers.

This is considered to be a programming error by the gcc developers
and will no longer work in gcc-8, where we now get a build error:

/tmp/cc4Qy7GR.s:34: Error: Banked registers are not available with 
this architecture. -- `mrs r3,SP_usr'
/tmp/cc4Qy7GR.s:41: Error: Banked registers are not available with 
this architecture. -- `mrs r3,ELR_hyp'
/tmp/cc4Qy7GR.s:55: Error: Banked registers are not available with 
this architecture. -- `mrs r3,SP_svc'
/tmp/cc4Qy7GR.s:62: Error: Banked registers are not available with 
this architecture. -- `mrs r3,LR_svc'
/tmp/cc4Qy7GR.s:69: Error: Banked registers are not available with 
this architecture. -- `mrs r3,SPSR_svc'
/tmp/cc4Qy7GR.s:76: Error: Banked registers are not available with 
this architecture. -- `mrs r3,SP_abt'


Passign the '-march-armv7ve' flag to gcc works, and is ok here, because
we know the functions won't ever be called on pre-ARMv7VE machines.
Unfortunately, older compiler versions (4.8 and earlier) do not 
understand

that flag, so we still need to keep the asm around.

Backporting to stable kernels (4.6+) is needed to allow those to be built
with future compilers as well.


Is "-Wa,arch=armv7-a+virt" (as we appear to do for a couple of files 
already) viable as a possibly cleaner alternative, or is GCC itself now 
policing the contents of inline asms?


In fact, looking at the binutils history, any version capable of 
assembling this file should understand that (modulo my typo), so 
hopefully it ought to be feasible to replace these global asms with 
assembler flags entirely.


Robin.


Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84129
Fixes: 33280b4cd1dc ("ARM: KVM: Add banked registers save/restore")
Cc: sta...@vger.kernel.org
Signed-off-by: Arnd Bergmann <a...@arndb.de>
---
  arch/arm/kvm/hyp/Makefile    | 5 +
  arch/arm/kvm/hyp/banked-sr.c | 4 
  2 files changed, 9 insertions(+)

diff --git a/arch/arm/kvm/hyp/Makefile b/arch/arm/kvm/hyp/Makefile
index 5638ce0c9524..63d6b404d88e 100644
--- a/arch/arm/kvm/hyp/Makefile
+++ b/arch/arm/kvm/hyp/Makefile
@@ -7,6 +7,8 @@ ccflags-y += -fno-stack-protector 
-DDISABLE_BRANCH_PROFILING

  KVM=../../../../virt/kvm
+CFLAGS_ARMV7VE   :=$(call cc-option, -march=armv7ve)
+
  obj-$(CONFIG_KVM_ARM_HOST) += $(KVM)/arm/hyp/vgic-v2-sr.o
  obj-$(CONFIG_KVM_ARM_HOST) += $(KVM)/arm/hyp/vgic-v3-sr.o
  obj-$(CONFIG_KVM_ARM_HOST) += $(KVM)/arm/hyp/timer-sr.o
@@ -15,7 +17,10 @@ obj-$(CONFIG_KVM_ARM_HOST) += tlb.o
  obj-$(CONFIG_KVM_ARM_HOST) += cp15-sr.o
  obj-$(CONFIG_KVM_ARM_HOST) += vfp.o
  obj-$(CONFIG_KVM_ARM_HOST) += banked-sr.o
+CFLAGS_banked-sr.o   += $(CFLAGS_ARMV7VE)
+
  obj-$(CONFIG_KVM_ARM_HOST) += entry.o
  obj-$(CONFIG_KVM_ARM_HOST) += hyp-entry.o
  obj-$(CONFIG_KVM_ARM_HOST) += switch.o
+CFLAGS_switch.o   += $(CFLAGS_ARMV7VE)
  obj-$(CONFIG_KVM_ARM_HOST) += s2-setup.o
diff --git a/arch/arm/kvm/hyp/banked-sr.c b/arch/arm/kvm/hyp/banked-sr.c
index 111bda8cdebd..be4b8b0a40ad 100644
--- a/arch/arm/kvm/hyp/banked-sr.c
+++ b/arch/arm/kvm/hyp/banked-sr.c
@@ -20,6 +20,10 @@
  #include 
+/*
+ * gcc before 4.9 doesn't understand -march=armv7ve, so we have to
+ * trick the assembler.
+ */
  __asm__(".arch_extension virt");


Would it be worth wrapping this in a preprocessor check for compilers 
that won't understand the command-line flag? I believe LLVM tends to 
choke on these global asm statements entirely, so minimising exposure 
might be a good thing to do in general.


Robin.


  void __hyp_text __banked_save_state(struct kvm_cpu_context *ctxt)



___
linux-arm-kernel mailing list
linux-arm-ker...@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 1/2] ARM: kvm: fix building with gcc-8

2018-02-02 Thread Robin Murphy

On 02/02/18 15:07, Arnd Bergmann wrote:

In banked-sr.c, we use a top-level '__asm__(".arch_extension virt")'
statement to allow compilation of a multi-CPU kernel for ARMv6
and older ARMv7-A that don't normally support access to the banked
registers.

This is considered to be a programming error by the gcc developers
and will no longer work in gcc-8, where we now get a build error:

/tmp/cc4Qy7GR.s:34: Error: Banked registers are not available with this 
architecture. -- `mrs r3,SP_usr'
/tmp/cc4Qy7GR.s:41: Error: Banked registers are not available with this 
architecture. -- `mrs r3,ELR_hyp'
/tmp/cc4Qy7GR.s:55: Error: Banked registers are not available with this 
architecture. -- `mrs r3,SP_svc'
/tmp/cc4Qy7GR.s:62: Error: Banked registers are not available with this 
architecture. -- `mrs r3,LR_svc'
/tmp/cc4Qy7GR.s:69: Error: Banked registers are not available with this 
architecture. -- `mrs r3,SPSR_svc'
/tmp/cc4Qy7GR.s:76: Error: Banked registers are not available with this 
architecture. -- `mrs r3,SP_abt'

Passign the '-march-armv7ve' flag to gcc works, and is ok here, because
we know the functions won't ever be called on pre-ARMv7VE machines.
Unfortunately, older compiler versions (4.8 and earlier) do not understand
that flag, so we still need to keep the asm around.

Backporting to stable kernels (4.6+) is needed to allow those to be built
with future compilers as well.


Is "-Wa,arch=armv7-a+virt" (as we appear to do for a couple of files 
already) viable as a possibly cleaner alternative, or is GCC itself now 
policing the contents of inline asms?



Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84129
Fixes: 33280b4cd1dc ("ARM: KVM: Add banked registers save/restore")
Cc: sta...@vger.kernel.org
Signed-off-by: Arnd Bergmann 
---
  arch/arm/kvm/hyp/Makefile| 5 +
  arch/arm/kvm/hyp/banked-sr.c | 4 
  2 files changed, 9 insertions(+)

diff --git a/arch/arm/kvm/hyp/Makefile b/arch/arm/kvm/hyp/Makefile
index 5638ce0c9524..63d6b404d88e 100644
--- a/arch/arm/kvm/hyp/Makefile
+++ b/arch/arm/kvm/hyp/Makefile
@@ -7,6 +7,8 @@ ccflags-y += -fno-stack-protector -DDISABLE_BRANCH_PROFILING
  
  KVM=../../../../virt/kvm
  
+CFLAGS_ARMV7VE		   :=$(call cc-option, -march=armv7ve)

+
  obj-$(CONFIG_KVM_ARM_HOST) += $(KVM)/arm/hyp/vgic-v2-sr.o
  obj-$(CONFIG_KVM_ARM_HOST) += $(KVM)/arm/hyp/vgic-v3-sr.o
  obj-$(CONFIG_KVM_ARM_HOST) += $(KVM)/arm/hyp/timer-sr.o
@@ -15,7 +17,10 @@ obj-$(CONFIG_KVM_ARM_HOST) += tlb.o
  obj-$(CONFIG_KVM_ARM_HOST) += cp15-sr.o
  obj-$(CONFIG_KVM_ARM_HOST) += vfp.o
  obj-$(CONFIG_KVM_ARM_HOST) += banked-sr.o
+CFLAGS_banked-sr.o+= $(CFLAGS_ARMV7VE)
+
  obj-$(CONFIG_KVM_ARM_HOST) += entry.o
  obj-$(CONFIG_KVM_ARM_HOST) += hyp-entry.o
  obj-$(CONFIG_KVM_ARM_HOST) += switch.o
+CFLAGS_switch.o   += $(CFLAGS_ARMV7VE)
  obj-$(CONFIG_KVM_ARM_HOST) += s2-setup.o
diff --git a/arch/arm/kvm/hyp/banked-sr.c b/arch/arm/kvm/hyp/banked-sr.c
index 111bda8cdebd..be4b8b0a40ad 100644
--- a/arch/arm/kvm/hyp/banked-sr.c
+++ b/arch/arm/kvm/hyp/banked-sr.c
@@ -20,6 +20,10 @@
  
  #include 
  
+/*

+ * gcc before 4.9 doesn't understand -march=armv7ve, so we have to
+ * trick the assembler.
+ */
  __asm__(".arch_extension virt");


Would it be worth wrapping this in a preprocessor check for compilers 
that won't understand the command-line flag? I believe LLVM tends to 
choke on these global asm statements entirely, so minimising exposure 
might be a good thing to do in general.


Robin.

  
  void __hyp_text __banked_save_state(struct kvm_cpu_context *ctxt)



___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v3 16/18] arm/arm64: smccc: Implement SMCCC v1.1 inline primitive

2018-02-01 Thread Robin Murphy

On 01/02/18 13:54, Marc Zyngier wrote:

On 01/02/18 13:34, Robin Murphy wrote:

On 01/02/18 11:46, Marc Zyngier wrote:

One of the major improvement of SMCCC v1.1 is that it only clobbers
the first 4 registers, both on 32 and 64bit. This means that it
becomes very easy to provide an inline version of the SMC call
primitive, and avoid performing a function call to stash the
registers that would otherwise be clobbered by SMCCC v1.0.

Signed-off-by: Marc Zyngier <marc.zyng...@arm.com>
---
   include/linux/arm-smccc.h | 143 
++
   1 file changed, 143 insertions(+)

diff --git a/include/linux/arm-smccc.h b/include/linux/arm-smccc.h
index dd44d8458c04..575aabe85905 100644
--- a/include/linux/arm-smccc.h
+++ b/include/linux/arm-smccc.h
@@ -150,5 +150,148 @@ asmlinkage void __arm_smccc_hvc(unsigned long a0, 
unsigned long a1,
   
   #define arm_smccc_hvc_quirk(...) __arm_smccc_hvc(__VA_ARGS__)
   
+/* SMCCC v1.1 implementation madness follows */

+#ifdef CONFIG_ARM64
+
+#define SMCCC_SMC_INST "smc   #0"
+#define SMCCC_HVC_INST "hvc   #0"


Nit: Maybe the argument can go in the template and we just define the
instruction mnemonics here?


+
+#endif
+
+#ifdef CONFIG_ARM


#elif ?


Sure, why not.




+#include 
+#include 
+
+#define SMCCC_SMC_INST __SMC(0)
+#define SMCCC_HVC_INST __HVC(0)


Oh, I see, it was to line up with this :(

I do wonder if we could just embed an asm(".arch armv7-a+virt\n") (if
even necessary) for ARM, then take advantage of the common mnemonics for
all 3 instruction sets instead of needing manual encoding tricks? I
don't think we should ever be pulling this file in for non-v7 builds.

I suppose that strictly that appears to need binutils 2.21 rather than
the offical supported minimum of 2.20, but are people going to be
throwing SMCCC configs at antique toolchains in practice?


It has been an issue in the past, back when we merged KVM. We settled on
a hybrid solution where code outside of KVM would not rely on a newer
toolchain, hence the macros that Dave introduced. Maybe we've moved on
and we can take that bold step?


Either way I think we can happily throw that on the "future cleanup" 
pile right now as it's not directly relevant to the purpose of the 
patch; I'm sure we don't want to make potential backporting even more 
difficult.





+
+#endif
+
+#define ___count_args(_0, _1, _2, _3, _4, _5, _6, _7, _8, x, ...) x
+
+#define __count_args(...)  \
+   ___count_args(__VA_ARGS__, 7, 6, 5, 4, 3, 2, 1, 0)
+
+#define __constraint_write_0   \
+   "+r" (r0), "=" (r1), "=" (r2), "=" (r3)
+#define __constraint_write_1   \
+   "+r" (r0), "+r" (r1), "=" (r2), "=" (r3)
+#define __constraint_write_2   \
+   "+r" (r0), "+r" (r1), "+r" (r2), "=" (r3)
+#define __constraint_write_3   \
+   "+r" (r0), "+r" (r1), "+r" (r2), "+r" (r3)
+#define __constraint_write_4   __constraint_write_3
+#define __constraint_write_5   __constraint_write_4
+#define __constraint_write_6   __constraint_write_5
+#define __constraint_write_7   __constraint_write_6
+
+#define __constraint_read_0
+#define __constraint_read_1
+#define __constraint_read_2
+#define __constraint_read_3
+#define __constraint_read_4"r" (r4)
+#define __constraint_read_5__constraint_read_4, "r" (r5)
+#define __constraint_read_6__constraint_read_5, "r" (r6)
+#define __constraint_read_7__constraint_read_6, "r" (r7)
+
+#define __declare_arg_0(a0, res)   \
+   struct arm_smccc_res   *___res = res;   \


Looks like the declaration of ___res could simply be factored out to the
template...


Tried that. But...




+   register u32   r0 asm("r0") = a0; \
+   register unsigned long r1 asm("r1");  \
+   register unsigned long r2 asm("r2");  \
+   register unsigned long r3 asm("r3")
+
+#define __declare_arg_1(a0, a1, res)   \
+   struct arm_smccc_res   *___res = res;   \
+   register u32   r0 asm("r0") = a0; \
+   register typeof(a1)r1 asm("r1") = a1; \
+   register unsigned long r2 asm("r2");  \
+   register unsigned long r3 asm("r3")
+
+#define __declare_arg_2(a0, a1, a2, res)   \
+   struct arm_smccc_res   *___res = res;

Re: [PATCH v3 16/18] arm/arm64: smccc: Implement SMCCC v1.1 inline primitive

2018-02-01 Thread Robin Murphy
 asm("r1") = a1; \
+   register typeof(a2)r2 asm("r2") = a2; \
+   register typeof(a3)r3 asm("r3") = a3
+
+#define __declare_arg_4(a0, a1, a2, a3, a4, res)   \
+   __declare_arg_3(a0, a1, a2, a3, res);   \
+   register typeof(a4) r4 asm("r4") = a4
+
+#define __declare_arg_5(a0, a1, a2, a3, a4, a5, res)   \
+   __declare_arg_4(a0, a1, a2, a3, a4, res);   \
+   register typeof(a5) r5 asm("r5") = a5
+
+#define __declare_arg_6(a0, a1, a2, a3, a4, a5, a6, res)   \
+   __declare_arg_5(a0, a1, a2, a3, a4, a5, res);   \
+   register typeof(a6) r6 asm("r6") = a6
+
+#define __declare_arg_7(a0, a1, a2, a3, a4, a5, a6, a7, res)   \
+   __declare_arg_6(a0, a1, a2, a3, a4, a5, a6, res);   \
+   register typeof(a7) r7 asm("r7") = a7
+
+#define ___declare_args(count, ...) __declare_arg_ ## count(__VA_ARGS__)
+#define __declare_args(count, ...)  ___declare_args(count, __VA_ARGS__)
+
+#define ___constraints(count)  \
+   : __constraint_write_ ## count  \
+   : __constraint_read_ ## count   \
+   : "memory"
+#define __constraints(count)   ___constraints(count)
+
+/*
+ * We have an output list that is not necessarily used, and GCC feels
+ * entitled to optimise the whole sequence away. "volatile" is what
+ * makes it stick.
+ */
+#define __arm_smccc_1_1(inst, ...) \
+   do {\
+   __declare_args(__count_args(__VA_ARGS__), __VA_ARGS__); \
+   asm volatile(inst "\n"\
+__constraints(__count_args(__VA_ARGS__))); \
+   if (___res) \
+   *___res = (typeof(*___res)){r0, r1, r2, r3};\


...especially since there's no obvious indication of where it comes from 
when you're looking here.


Otherwise, though, this has already turned out pretty sleek;

Reviewed-by: Robin Murphy <robin.mur...@arm.com>


+   } while (0)
+
+/*
+ * arm_smccc_1_1_smc() - make an SMCCC v1.1 compliant SMC call
+ *
+ * This is a variadic macro taking one to eight source arguments, and
+ * an optional return structure.
+ *
+ * @a0-a7: arguments passed in registers 0 to 7
+ * @res: result values from registers 0 to 3
+ *
+ * This macro is used to make SMC calls following SMC Calling Convention v1.1.
+ * The content of the supplied param are copied to registers 0 to 7 prior
+ * to the SMC instruction. The return values are updated with the content
+ * from register 0 to 3 on return from the SMC instruction if not NULL.
+ */
+#define arm_smccc_1_1_smc(...) __arm_smccc_1_1(SMCCC_SMC_INST, __VA_ARGS__)
+
+/*
+ * arm_smccc_1_1_hvc() - make an SMCCC v1.1 compliant HVC call
+ *
+ * This is a variadic macro taking one to eight source arguments, and
+ * an optional return structure.
+ *
+ * @a0-a7: arguments passed in registers 0 to 7
+ * @res: result values from registers 0 to 3
+ *
+ * This macro is used to make HVC calls following SMC Calling Convention v1.1.
+ * The content of the supplied param are copied to registers 0 to 7 prior
+ * to the HVC instruction. The return values are updated with the content
+ * from register 0 to 3 on return from the HVC instruction if not NULL.
+ */
+#define arm_smccc_1_1_hvc(...) __arm_smccc_1_1(SMCCC_HVC_INST, __VA_ARGS__)
+
  #endif /*__ASSEMBLY__*/
  #endif /*__LINUX_ARM_SMCCC_H*/


___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v3 15/18] arm/arm64: smccc: Make function identifiers an unsigned quantity

2018-02-01 Thread Robin Murphy

On 01/02/18 11:46, Marc Zyngier wrote:

Function identifiers are a 32bit, unsigned quantity. But we never
tell so to the compiler, resulting in the following:

  4ac:   b26187e0mov x0, #0x8001

We thus rely on the firmware narrowing it for us, which is not
always a reasonable expectation.


I think technically it might be OK, since SMCCC states "A Function 
Identifier is passed in register W0.", which implies that a conforming 
implementation should also read w0, not x0, but it's certainly far 
easier to be completely right than to justify being possibly wrong.


Reviewed-by: Robin Murphy <robin.mur...@arm.com>


Cc: sta...@vger.kernel.org
Reported-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
Acked-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
Tested-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyng...@arm.com>
---
  include/linux/arm-smccc.h | 6 --
  1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/include/linux/arm-smccc.h b/include/linux/arm-smccc.h
index e1ef944ef1da..dd44d8458c04 100644
--- a/include/linux/arm-smccc.h
+++ b/include/linux/arm-smccc.h
@@ -14,14 +14,16 @@
  #ifndef __LINUX_ARM_SMCCC_H
  #define __LINUX_ARM_SMCCC_H
  
+#include 

+
  /*
   * This file provides common defines for ARM SMC Calling Convention as
   * specified in
   * http://infocenter.arm.com/help/topic/com.arm.doc.den0028a/index.html
   */
  
-#define ARM_SMCCC_STD_CALL		0

-#define ARM_SMCCC_FAST_CALL1
+#define ARM_SMCCC_STD_CALL _AC(0,U)
+#define ARM_SMCCC_FAST_CALL_AC(1,U)
  #define ARM_SMCCC_TYPE_SHIFT  31
  
  #define ARM_SMCCC_SMC_32		0



___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v3 14/18] firmware/psci: Expose SMCCC version through psci_ops

2018-02-01 Thread Robin Murphy

On 01/02/18 11:46, Marc Zyngier wrote:

Since PSCI 1.0 allows the SMCCC version to be (indirectly) probed,
let's do that at boot time, and expose the version of the calling
convention as part of the psci_ops structure.

Acked-by: Lorenzo Pieralisi <lorenzo.pieral...@arm.com>
Signed-off-by: Marc Zyngier <marc.zyng...@arm.com>
---
  drivers/firmware/psci.c | 19 +++
  include/linux/psci.h|  6 ++
  2 files changed, 25 insertions(+)

diff --git a/drivers/firmware/psci.c b/drivers/firmware/psci.c
index e9493da2b111..8631906c414c 100644
--- a/drivers/firmware/psci.c
+++ b/drivers/firmware/psci.c
@@ -61,6 +61,7 @@ bool psci_tos_resident_on(int cpu)
  
  struct psci_operations psci_ops = {

.conduit = PSCI_CONDUIT_NONE,
+   .smccc_version = SMCCC_VERSION_1_0,
  };
  
  typedef unsigned long (psci_fn)(unsigned long, unsigned long,

@@ -511,6 +512,23 @@ static void __init psci_init_migrate(void)
pr_info("Trusted OS resident on physical CPU 0x%lx\n", cpuid);
  }
  
+static void __init psci_init_smccc(u32 ver)

+{
+   int feature;
+
+   feature = psci_features(ARM_SMCCC_VERSION_FUNC_ID);
+
+   if (feature != PSCI_RET_NOT_SUPPORTED) {
+   ver = invoke_psci_fn(ARM_SMCCC_VERSION_FUNC_ID, 0, 0, 0);
+   if (ver != ARM_SMCCC_VERSION_1_1)
+   psci_ops.smccc_version = SMCCC_VERSION_1_0;


AFAICS, unless you somehow run psci_probe() twice *and* have 
schizophrenic firmware, this assignment now does precisely nothing.


With the condition flipped and the redundant else case removed (or an 
explanation of why I'm wrong...)


Reviewed-by: Robin Murphy <robin.mur...@arm.com>


+   else
+   psci_ops.smccc_version = SMCCC_VERSION_1_1;
+   }
+
+   pr_info("SMC Calling Convention v1.%d\n", psci_ops.smccc_version);
+}
+
  static void __init psci_0_2_set_functions(void)
  {
pr_info("Using standard PSCI v0.2 function IDs\n");
@@ -559,6 +577,7 @@ static int __init psci_probe(void)
psci_init_migrate();
  
  	if (PSCI_VERSION_MAJOR(ver) >= 1) {

+   psci_init_smccc(ver);
psci_init_cpu_suspend();
psci_init_system_suspend();
}
diff --git a/include/linux/psci.h b/include/linux/psci.h
index f2679e5faa4f..8b1b3b5935ab 100644
--- a/include/linux/psci.h
+++ b/include/linux/psci.h
@@ -31,6 +31,11 @@ enum psci_conduit {
PSCI_CONDUIT_HVC,
  };
  
+enum smccc_version {

+   SMCCC_VERSION_1_0,
+   SMCCC_VERSION_1_1,
+};
+
  struct psci_operations {
u32 (*get_version)(void);
int (*cpu_suspend)(u32 state, unsigned long entry_point);
@@ -41,6 +46,7 @@ struct psci_operations {
unsigned long lowest_affinity_level);
int (*migrate_info_type)(void);
enum psci_conduit conduit;
+   enum smccc_version smccc_version;
  };
  
  extern struct psci_operations psci_ops;



___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v3 13/18] firmware/psci: Expose PSCI conduit

2018-02-01 Thread Robin Murphy

On 01/02/18 11:46, Marc Zyngier wrote:

In order to call into the firmware to apply workarounds, it is
useful to find out whether we're using HVC or SMC. Let's expose
this through the psci_ops.


Reviewed-by: Robin Murphy <robin.mur...@arm.com>


Acked-by: Lorenzo Pieralisi <lorenzo.pieral...@arm.com>
Signed-off-by: Marc Zyngier <marc.zyng...@arm.com>
---
  drivers/firmware/psci.c | 28 +++-
  include/linux/psci.h|  7 +++
  2 files changed, 30 insertions(+), 5 deletions(-)

diff --git a/drivers/firmware/psci.c b/drivers/firmware/psci.c
index 8b25d31e8401..e9493da2b111 100644
--- a/drivers/firmware/psci.c
+++ b/drivers/firmware/psci.c
@@ -59,7 +59,9 @@ bool psci_tos_resident_on(int cpu)
return cpu == resident_cpu;
  }
  
-struct psci_operations psci_ops;

+struct psci_operations psci_ops = {
+   .conduit = PSCI_CONDUIT_NONE,
+};
  
  typedef unsigned long (psci_fn)(unsigned long, unsigned long,

unsigned long, unsigned long);
@@ -210,6 +212,22 @@ static unsigned long psci_migrate_info_up_cpu(void)
  0, 0, 0);
  }
  
+static void set_conduit(enum psci_conduit conduit)

+{
+   switch (conduit) {
+   case PSCI_CONDUIT_HVC:
+   invoke_psci_fn = __invoke_psci_fn_hvc;
+   break;
+   case PSCI_CONDUIT_SMC:
+   invoke_psci_fn = __invoke_psci_fn_smc;
+   break;
+   default:
+   WARN(1, "Unexpected PSCI conduit %d\n", conduit);
+   }
+
+   psci_ops.conduit = conduit;
+}
+
  static int get_set_conduit_method(struct device_node *np)
  {
const char *method;
@@ -222,9 +240,9 @@ static int get_set_conduit_method(struct device_node *np)
}
  
  	if (!strcmp("hvc", method)) {

-   invoke_psci_fn = __invoke_psci_fn_hvc;
+   set_conduit(PSCI_CONDUIT_HVC);
} else if (!strcmp("smc", method)) {
-   invoke_psci_fn = __invoke_psci_fn_smc;
+   set_conduit(PSCI_CONDUIT_SMC);
} else {
pr_warn("invalid \"method\" property: %s\n", method);
return -EINVAL;
@@ -654,9 +672,9 @@ int __init psci_acpi_init(void)
pr_info("probing for conduit method from ACPI.\n");
  
  	if (acpi_psci_use_hvc())

-   invoke_psci_fn = __invoke_psci_fn_hvc;
+   set_conduit(PSCI_CONDUIT_HVC);
else
-   invoke_psci_fn = __invoke_psci_fn_smc;
+   set_conduit(PSCI_CONDUIT_SMC);
  
  	return psci_probe();

  }
diff --git a/include/linux/psci.h b/include/linux/psci.h
index f724fd8c78e8..f2679e5faa4f 100644
--- a/include/linux/psci.h
+++ b/include/linux/psci.h
@@ -25,6 +25,12 @@ bool psci_tos_resident_on(int cpu);
  int psci_cpu_init_idle(unsigned int cpu);
  int psci_cpu_suspend_enter(unsigned long index);
  
+enum psci_conduit {

+   PSCI_CONDUIT_NONE,
+   PSCI_CONDUIT_SMC,
+   PSCI_CONDUIT_HVC,
+};
+
  struct psci_operations {
u32 (*get_version)(void);
int (*cpu_suspend)(u32 state, unsigned long entry_point);
@@ -34,6 +40,7 @@ struct psci_operations {
int (*affinity_info)(unsigned long target_affinity,
unsigned long lowest_affinity_level);
int (*migrate_info_type)(void);
+   enum psci_conduit conduit;
  };
  
  extern struct psci_operations psci_ops;



___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v2 04/16] arm/arm64: KVM: Add PSCI_VERSION helper

2018-01-30 Thread Robin Murphy

On 29/01/18 17:45, Marc Zyngier wrote:

As we're about to trigger a PSCI version explosion, it doesn't
hurt to introduce a PSCI_VERSION helper that is going to be
used everywhere.

Signed-off-by: Marc Zyngier 
---
  include/kvm/arm_psci.h | 5 +++--
  virt/kvm/arm/psci.c| 2 +-
  2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/include/kvm/arm_psci.h b/include/kvm/arm_psci.h
index 2042bb909474..3a408c846c09 100644
--- a/include/kvm/arm_psci.h
+++ b/include/kvm/arm_psci.h
@@ -18,8 +18,9 @@
  #ifndef __KVM_ARM_PSCI_H__
  #define __KVM_ARM_PSCI_H__
  
-#define KVM_ARM_PSCI_0_1	1

-#define KVM_ARM_PSCI_0_2   2
+#define PSCI_VERSION(x,y)  x) & 0x7fff) << 16) | ((y) & 0x))


I see virt/kvm/arm/psci.c already pulls in uapi/linux/psci.h, so maybe 
this guy could go in there alongside the other PSCI_VERSION_* gubbins?


Robin.


+#define KVM_ARM_PSCI_0_1   PSCI_VERSION(0, 1)
+#define KVM_ARM_PSCI_0_2   PSCI_VERSION(0, 2)
  
  int kvm_psci_version(struct kvm_vcpu *vcpu);

  int kvm_psci_call(struct kvm_vcpu *vcpu);
diff --git a/virt/kvm/arm/psci.c b/virt/kvm/arm/psci.c
index b322e46fd142..c00bb324e14e 100644
--- a/virt/kvm/arm/psci.c
+++ b/virt/kvm/arm/psci.c
@@ -222,7 +222,7 @@ static int kvm_psci_0_2_call(struct kvm_vcpu *vcpu)
 * Bits[31:16] = Major Version = 0
 * Bits[15:0] = Minor Version = 2
 */
-   val = 2;
+   val = KVM_ARM_PSCI_0_2;
break;
case PSCI_0_2_FN_CPU_SUSPEND:
case PSCI_0_2_FN64_CPU_SUSPEND:


___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v2 13/16] firmware/psci: Expose SMCCC version through psci_ops

2018-01-30 Thread Robin Murphy

On 29/01/18 17:45, Marc Zyngier wrote:

Since PSCI 1.0 allows the SMCCC version to be (indirectly) probed,
let's do that at boot time, and expose the version of the calling
convention as part of the psci_ops structure.

Signed-off-by: Marc Zyngier 
---
  drivers/firmware/psci.c | 21 +
  include/linux/psci.h|  6 ++
  2 files changed, 27 insertions(+)

diff --git a/drivers/firmware/psci.c b/drivers/firmware/psci.c
index e9493da2b111..dd035aaa1c33 100644
--- a/drivers/firmware/psci.c
+++ b/drivers/firmware/psci.c
@@ -511,6 +511,26 @@ static void __init psci_init_migrate(void)
pr_info("Trusted OS resident on physical CPU 0x%lx\n", cpuid);
  }
  
+static void __init psci_init_smccc(u32 ver)

+{
+   int feature = PSCI_RET_NOT_SUPPORTED;
+
+   if (PSCI_VERSION_MAJOR(ver) >= 1)
+   feature = psci_features(ARM_SMCCC_VERSION_FUNC_ID);
+
+   if (feature == PSCI_RET_NOT_SUPPORTED) {
+   psci_ops.variant = SMCCC_VARIANT_1_0;


Presumably at some point in the future we might want to update PSCI 
itself to use 'fast' calls if available, at which point this 
initialisation is a bit backwards given that you've already made a call 
using *some* convention. Even I think relying on the enum value working 
out OK is a bit subtle, so perhaps it's best to default-initialise 
psci_ops.variant to 1.0 (either statically, or dynamically before the 
first call), then only update it to 1.1 if and when we discover that.


Robin.


+   } else {
+   ver = invoke_psci_fn(ARM_SMCCC_VERSION_FUNC_ID, 0, 0, 0);
+   if (ver != ARM_SMCCC_VERSION_1_1)
+   psci_ops.variant = SMCCC_VARIANT_1_0;
+   else
+   psci_ops.variant = SMCCC_VARIANT_1_1;
+   }
+
+   pr_info("SMC Calling Convention v1.%d\n", psci_ops.variant);
+}
+
  static void __init psci_0_2_set_functions(void)
  {
pr_info("Using standard PSCI v0.2 function IDs\n");
@@ -557,6 +577,7 @@ static int __init psci_probe(void)
psci_0_2_set_functions();
  
  	psci_init_migrate();

+   psci_init_smccc(ver);
  
  	if (PSCI_VERSION_MAJOR(ver) >= 1) {

psci_init_cpu_suspend();
diff --git a/include/linux/psci.h b/include/linux/psci.h
index f2679e5faa4f..83fd16a37be3 100644
--- a/include/linux/psci.h
+++ b/include/linux/psci.h
@@ -31,6 +31,11 @@ enum psci_conduit {
PSCI_CONDUIT_HVC,
  };
  
+enum smccc_variant {

+   SMCCC_VARIANT_1_0,
+   SMCCC_VARIANT_1_1,
+};
+
  struct psci_operations {
u32 (*get_version)(void);
int (*cpu_suspend)(u32 state, unsigned long entry_point);
@@ -41,6 +46,7 @@ struct psci_operations {
unsigned long lowest_affinity_level);
int (*migrate_info_type)(void);
enum psci_conduit conduit;
+   enum smccc_variant variant;
  };
  
  extern struct psci_operations psci_ops;



___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v2 10/16] arm64: KVM: Report SMCCC_ARCH_WORKAROUND_1 BP hardening support

2018-01-30 Thread Robin Murphy

On 29/01/18 17:45, Marc Zyngier wrote:

A new feature of SMCCC 1.1 is that it offers firmware-based CPU
workarounds. In particular, SMCCC_ARCH_WORKAROUND_1 provides
BP hardening for CVE-2017-5715.

If the host has some mitigation for this issue, report that
we deal with it using SMCCC_ARCH_WORKAROUND_1, as we apply the
host workaround on every guest exit.

Signed-off-by: Marc Zyngier 
---
  include/linux/arm-smccc.h |  5 +
  virt/kvm/arm/psci.c   | 17 +++--
  2 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/include/linux/arm-smccc.h b/include/linux/arm-smccc.h
index dc68aa5a7261..e1ef944ef1da 100644
--- a/include/linux/arm-smccc.h
+++ b/include/linux/arm-smccc.h
@@ -73,6 +73,11 @@
   ARM_SMCCC_SMC_32,\
   0, 1)
  
+#define ARM_SMCCC_ARCH_WORKAROUND_1	\

+   ARM_SMCCC_CALL_VAL(ARM_SMCCC_FAST_CALL, \
+  ARM_SMCCC_SMC_32,\
+  0, 0x8000)
+
  #ifndef __ASSEMBLY__
  
  #include 

diff --git a/virt/kvm/arm/psci.c b/virt/kvm/arm/psci.c
index a021b62ed762..5677d16abc71 100644
--- a/virt/kvm/arm/psci.c
+++ b/virt/kvm/arm/psci.c
@@ -407,14 +407,27 @@ static int kvm_psci_call(struct kvm_vcpu *vcpu)
  int kvm_hvc_call_handler(struct kvm_vcpu *vcpu)
  {
u32 func_id = smccc_get_function(vcpu);
-   u32 val;
+   u32 val, feature;
  
  	switch (func_id) {

case ARM_SMCCC_VERSION_FUNC_ID:
val = ARM_SMCCC_VERSION_1_1;
break;
case ARM_SMCCC_ARCH_FEATURES_FUNC_ID:
-   val = -1;   /* Nothing supported yet */


Conceptually, might it still make sense to initialise val to 
NOT_SUPPORTED here, then overwrite it if and when a feature actually is 
present? It would in this case save a few lines as well, but I know 
multiple assignment can be one of those religious issues, so I'm not too 
fussed either way.


Robin.


+   feature = smccc_get_arg1(vcpu);
+   switch(feature) {
+#ifdef CONFIG_ARM64
+   case ARM_SMCCC_ARCH_WORKAROUND_1:
+   if (cpus_have_const_cap(ARM64_HARDEN_BRANCH_PREDICTOR))
+   val = 0;
+   else
+   val = -1;
+   break;
+#endif
+   default:
+   val = -1;
+   break;
+   }
break;
default:
return kvm_psci_call(vcpu);


___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v2 15/16] arm/arm64: smccc: Implement SMCCC v1.1 inline primitive

2018-01-29 Thread Robin Murphy

On 29/01/18 17:45, Marc Zyngier wrote:

One of the major improvement of SMCCC v1.1 is that it only clobbers
the first 4 registers, both on 32 and 64bit. This means that it
becomes very easy to provide an inline version of the SMC call
primitive, and avoid performing a function call to stash the
registers that would otherwise be clobbered by SMCCC v1.0.


This is disgusting... I love it :D


Signed-off-by: Marc Zyngier 
---
  include/linux/arm-smccc.h | 157 ++
  1 file changed, 157 insertions(+)

diff --git a/include/linux/arm-smccc.h b/include/linux/arm-smccc.h
index dd44d8458c04..bc5843728909 100644
--- a/include/linux/arm-smccc.h
+++ b/include/linux/arm-smccc.h
@@ -150,5 +150,162 @@ asmlinkage void __arm_smccc_hvc(unsigned long a0, 
unsigned long a1,
  
  #define arm_smccc_hvc_quirk(...) __arm_smccc_hvc(__VA_ARGS__)
  
+/* SMCCC v1.1 implementation madness follows */

+#ifdef CONFIG_ARM64
+
+#define SMCCC_SMC_INST "smc   #0"
+#define SMCCC_HVC_INST "hvc   #0"
+
+#define __arm_smccc_1_1_prologue(inst) \
+   inst "\n" \
+   "cbz   %[ptr], 1f\n"  \
+   "stp   %x[r0], %x[r1], %[ra0]\n"  \
+   "stp   %x[r2], %x[r3], %[ra2]\n"  \
+   "1:\n"\
+   : [ra0] "=Ump" (*(&___res->a0)),   \
+ [ra2] "=Ump" (*(&___res->a2)),


Rather than embedding a guaranteed spill to memory, I wonder if there's 
money in just always declaring r0-r3 as in-out operands, and propagating 
them by value afterwards, i.e.:


asm(...);
if (___res)
*___res = (struct arm_smccc_res){ r0, r1, r2, r3 };

In theory, for sufficiently simple callers that might allow res to stay 
in registers for its entire lifetime and give nicer codegen. It *might* 
also simplify some of this macro machinery too, although at this point 
in the evening I can't really tell...


Robin.


+
+#define __arm_smccc_1_1_epilogue   : "memory"
+   
+#endif
+
+#ifdef CONFIG_ARM
+#include 
+#include 
+
+#define SMCCC_SMC_INST __SMC(0)
+#define SMCCC_HVC_INST __HVC(0)
+
+#define __arm_smccc_1_1_prologue(inst) \
+   inst "\n" \
+   "cmp   %[ptr], #0\n"  \
+   "stmne %[ptr], {%[r0], %[r1], %[r2], %[r3]}\n"\
+   : "=m" (*___res),
+
+#define __arm_smccc_1_1_epilogue   : "memory", "cc"
+   
+#endif
+
+#define __constraint_write_0   \
+   [r0] "+r" (r0), [r1] "=r" (r1), [r2] "=r" (r2), [r3] "=r" (r3)
+#define __constraint_write_1   \
+   [r0] "+r" (r0), [r1] "+r" (r1), [r2] "=r" (r2), [r3] "=r" (r3)
+#define __constraint_write_2   \
+   [r0] "+r" (r0), [r1] "+r" (r1), [r2] "+r" (r2), [r3] "=r" (r3)
+#define __constraint_write_3   \
+   [r0] "+r" (r0), [r1] "+r" (r1), [r2] "+r" (r2), [r3] "+r" (r3)
+#define __constraint_write_4   __constraint_write_3
+#define __constraint_write_5   __constraint_write_3
+#define __constraint_write_6   __constraint_write_3
+#define __constraint_write_7   __constraint_write_3
+
+#define __constraint_read_0: [ptr] "r" (___res)
+#define __constraint_read_1__constraint_read_0
+#define __constraint_read_2__constraint_read_0
+#define __constraint_read_3__constraint_read_0
+#define __constraint_read_4__constraint_read_3, "r" (r4)
+#define __constraint_read_5__constraint_read_4, "r" (r5)
+#define __constraint_read_6__constraint_read_5, "r" (r6)
+#define __constraint_read_7__constraint_read_6, "r" (r7)
+
+#define ___count_args(_0, _1, _2, _3, _4, _5, _6, _7, _8, x, ...) x
+
+#define __count_args(...)  \
+   ___count_args(__VA_ARGS__, 7, 6, 5, 4, 3, 2, 1, 0)
+
+#define __declare_arg_0(a0, res)   \
+   struct arm_smccc_res   *___res = res;   \
+   register u32   r0 asm("r0") = a0; \
+   register unsigned long r1 asm("r1");  \
+   register unsigned long r2 asm("r2");  \
+   register unsigned long r3 asm("r3")
+
+#define __declare_arg_1(a0, a1, res)   \
+   struct arm_smccc_res   *___res = res;   \
+   register u32   r0 asm("r0") = a0; \
+   register typeof(a1)r1 asm("r1") = a1; \
+   register unsigned long r2 asm("r2");   

Re: [PATCH 14/14] arm64: Add ARM_SMCCC_ARCH_WORKAROUND_1 BP hardening support

2018-01-26 Thread Robin Murphy

On 26/01/18 14:28, Marc Zyngier wrote:

Add the detection and runtime code for ARM_SMCCC_ARCH_WORKAROUND_1.
It is lovely. Really.

Signed-off-by: Marc Zyngier 
---
  arch/arm64/kernel/bpi.S| 20 
  arch/arm64/kernel/cpu_errata.c | 71 +-
  2 files changed, 90 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/kernel/bpi.S b/arch/arm64/kernel/bpi.S
index 76225c2611ea..add7e08a018d 100644
--- a/arch/arm64/kernel/bpi.S
+++ b/arch/arm64/kernel/bpi.S
@@ -17,6 +17,7 @@
   */
  
  #include 

+#include 
  
  .macro ventry target

.rept 31
@@ -85,3 +86,22 @@ ENTRY(__qcom_hyp_sanitize_link_stack_start)
.endr
ldp x29, x30, [sp], #16
  ENTRY(__qcom_hyp_sanitize_link_stack_end)
+
+.macro smccc_workaround_1 inst
+   sub sp, sp, #(8 * 4)
+   stp x2, x3, [sp, #(16 * 0)]


This seems unnecessarily confusing - using either units of registers, or 
of register pairs, is fine, but mixing both in the same sequence just 
hurts more than it needs to.



+   stp x0, x1, [sp, #(16 * 1)]
+   orr w0, wzr, #ARM_SMCCC_ARCH_WORKAROUND_1


Writing this as a MOV like a sane person would make things 0.37% more 
lovely, I promise ;)



+   \inst   #0
+   ldp x2, x3, [sp, #(16 * 0)]
+   ldp x0, x1, [sp, #(16 * 1)]
+   add sp, sp, #(8 * 4)
+.endm
+
+ENTRY(__smccc_workaround_1_smc_start)
+   smccc_workaround_1  smc
+ENTRY(__smccc_workaround_1_smc_end)
+
+ENTRY(__smccc_workaround_1_hvc_start)
+   smccc_workaround_1  hvc
+ENTRY(__smccc_workaround_1_hvc_end)


That said, should we not be implementing this lot in smccc-call.S...


diff --git a/arch/arm64/kernel/cpu_errata.c b/arch/arm64/kernel/cpu_errata.c
index ed6881882231..f1501873f2e4 100644
--- a/arch/arm64/kernel/cpu_errata.c
+++ b/arch/arm64/kernel/cpu_errata.c
@@ -70,6 +70,10 @@ DEFINE_PER_CPU_READ_MOSTLY(struct bp_hardening_data, 
bp_hardening_data);
  extern char __psci_hyp_bp_inval_start[], __psci_hyp_bp_inval_end[];
  extern char __qcom_hyp_sanitize_link_stack_start[];
  extern char __qcom_hyp_sanitize_link_stack_end[];
+extern char __smccc_workaround_1_smc_start[];
+extern char __smccc_workaround_1_smc_end[];
+extern char __smccc_workaround_1_hvc_start[];
+extern char __smccc_workaround_1_hvc_end[];
  
  static void __copy_hyp_vect_bpi(int slot, const char *hyp_vecs_start,

const char *hyp_vecs_end)
@@ -116,6 +120,10 @@ static void __install_bp_hardening_cb(bp_hardening_cb_t fn,
  #define __psci_hyp_bp_inval_end   NULL
  #define __qcom_hyp_sanitize_link_stack_start  NULL
  #define __qcom_hyp_sanitize_link_stack_endNULL
+#define __smccc_workaround_1_smc_start NULL
+#define __smccc_workaround_1_smc_end   NULL
+#define __smccc_workaround_1_hvc_start NULL
+#define __smccc_workaround_1_hvc_end   NULL
  
  static void __install_bp_hardening_cb(bp_hardening_cb_t fn,

  const char *hyp_vecs_start,
@@ -142,17 +150,78 @@ static void  install_bp_hardening_cb(const struct 
arm64_cpu_capabilities *entry,
__install_bp_hardening_cb(fn, hyp_vecs_start, hyp_vecs_end);
  }
  
+#include 

+#include 
  #include 
  
+static void call_smc_arch_workaround_1(void)

+{
+   register int w0 asm("w0") = ARM_SMCCC_ARCH_WORKAROUND_1;
+   asm volatile("smc  #0\n"
+: "+r" (w0));
+}
+
+static void call_hvc_arch_workaround_1(void)
+{
+   register int w0 asm("w0") = ARM_SMCCC_ARCH_WORKAROUND_1;
+   asm volatile("hvc  #0\n"
+: "+r" (w0));
+}


...such that these could simply be something like:

static void call_{smc,hvc}_arch_workaround_1(void)
{
arm_smccc_v1_1_{smc,hvc}(ARM_SMCCC_ARCH_WORKAROUND_1);
}

?

Robin.


+
+static bool check_smccc_arch_workaround_1(const struct arm64_cpu_capabilities 
*entry)
+{
+   bp_hardening_cb_t cb;
+   void *smccc_start, *smccc_end;
+   struct arm_smccc_res res;
+
+   if (psci_ops.variant == SMCCC_VARIANT_1_0)
+   return false;
+
+   switch (psci_ops.conduit) {
+   case PSCI_CONDUIT_HVC:
+   arm_smccc_hvc(ARM_SMCCC_ARCH_FEATURES_FUNC_ID,
+ ARM_SMCCC_ARCH_WORKAROUND_1, 0, 0, 0, 0, 0, 0,
+ );
+   if (res.a0)
+   return false;
+   cb = call_hvc_arch_workaround_1;
+   smccc_start = __smccc_workaround_1_hvc_start;
+   smccc_end = __smccc_workaround_1_hvc_end;
+   break;
+
+   case PSCI_CONDUIT_SMC:
+   arm_smccc_smc(ARM_SMCCC_ARCH_FEATURES_FUNC_ID,
+ ARM_SMCCC_ARCH_WORKAROUND_1, 0, 0, 0, 0, 0, 0,
+ );
+   if (res.a0)
+   return false;
+   cb = call_smc_arch_workaround_1;
+   smccc_start = 

Re: [PATCH 3/3] arm64: Add software workaround for Falkor erratum 1041

2017-11-03 Thread Robin Murphy
On 03/11/17 03:27, Shanker Donthineni wrote:
> The ARM architecture defines the memory locations that are permitted
> to be accessed as the result of a speculative instruction fetch from
> an exception level for which all stages of translation are disabled.
> Specifically, the core is permitted to speculatively fetch from the
> 4KB region containing the current program counter and next 4KB.
> 
> When translation is changed from enabled to disabled for the running
> exception level (SCTLR_ELn[M] changed from a value of 1 to 0), the
> Falkor core may errantly speculatively access memory locations outside
> of the 4KB region permitted by the architecture. The errant memory
> access may lead to one of the following unexpected behaviors.
> 
> 1) A System Error Interrupt (SEI) being raised by the Falkor core due
>to the errant memory access attempting to access a region of memory
>that is protected by a slave-side memory protection unit.
> 2) Unpredictable device behavior due to a speculative read from device
>memory. This behavior may only occur if the instruction cache is
>disabled prior to or coincident with translation being changed from
>enabled to disabled.
> 
> To avoid the errant behavior, software must execute an ISB immediately
> prior to executing the MSR that will change SCTLR_ELn[M] from 1 to 0.
> 
> Signed-off-by: Shanker Donthineni 
> ---
>  Documentation/arm64/silicon-errata.txt |  1 +
>  arch/arm64/Kconfig | 10 ++
>  arch/arm64/include/asm/assembler.h | 17 +
>  arch/arm64/include/asm/cpucaps.h   |  3 ++-
>  arch/arm64/kernel/cpu_errata.c | 16 
>  arch/arm64/kernel/efi-entry.S  |  4 ++--
>  arch/arm64/kernel/head.S   |  4 ++--
>  7 files changed, 50 insertions(+), 5 deletions(-)
> 
> diff --git a/Documentation/arm64/silicon-errata.txt 
> b/Documentation/arm64/silicon-errata.txt
> index 66e8ce1..704770c0 100644
> --- a/Documentation/arm64/silicon-errata.txt
> +++ b/Documentation/arm64/silicon-errata.txt
> @@ -74,3 +74,4 @@ stable kernels.
>  | Qualcomm Tech. | Falkor v1   | E1003   | 
> QCOM_FALKOR_ERRATUM_1003|
>  | Qualcomm Tech. | Falkor v1   | E1009   | 
> QCOM_FALKOR_ERRATUM_1009|
>  | Qualcomm Tech. | QDF2400 ITS | E0065   | 
> QCOM_QDF2400_ERRATUM_0065   |
> +| Qualcomm Tech. | Falkor v{1,2}   | E1041   | 
> QCOM_FALKOR_ERRATUM_1041|
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 0df64a6..7e933fb 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -539,6 +539,16 @@ config QCOM_QDF2400_ERRATUM_0065
>  
> If unsure, say Y.
>  
> +config QCOM_FALKOR_ERRATUM_1041
> + bool "Falkor E1041: Speculative instruction fetches might cause errant 
> memory access"
> + default y
> + help
> +   Falkor CPU may speculatively fetch instructions from an improper
> +   memory location when MMU translation is changed from SCTLR_ELn[M]=1
> +   to SCTLR_ELn[M]=0. Prefix an ISB instruction to fix the problem.
> +
> +   If unsure, say Y.
> +
>  endmenu
>  
>  
> diff --git a/arch/arm64/include/asm/assembler.h 
> b/arch/arm64/include/asm/assembler.h
> index b6dfb4f..4c91efb 100644
> --- a/arch/arm64/include/asm/assembler.h
> +++ b/arch/arm64/include/asm/assembler.h
> @@ -30,6 +30,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  /*
>   * Enable and disable interrupts.
> @@ -514,6 +515,22 @@
>   *   reg: the value to be written.
>   */
>   .macro  write_sctlr, eln, reg
> +#ifdef CONFIG_QCOM_FALKOR_ERRATUM_1041
> +alternative_if ARM64_WORKAROUND_QCOM_FALKOR_E1041
> + tbnz\reg, #0, 8000f  // enable MMU?

Do we really need the branch here? It's not like enabling the MMU is
something we do on the syscall fastpath, and I can't imagine an extra
ISB hurts much (and is probably comparable to a mispredicted branch
anyway). In fact, is there any noticeable hit on other
microarchitectures if we save the alternative bother and just do it
unconditionally always?

Robin.

> + isb
> +8000:
> +alternative_else_nop_endif
> +#endif
> + msr sctlr_\eln, \reg
> + .endm
> +
> + .macro  early_write_sctlr, eln, reg
> +#ifdef CONFIG_QCOM_FALKOR_ERRATUM_1041
> + tbnz\reg, #0, 8000f  // enable MMU?
> + isb
> +8000:
> +#endif
>   msr sctlr_\eln, \reg
>   .endm
>  
> diff --git a/arch/arm64/include/asm/cpucaps.h 
> b/arch/arm64/include/asm/cpucaps.h
> index 8da6216..7f7a59d 100644
> --- a/arch/arm64/include/asm/cpucaps.h
> +++ b/arch/arm64/include/asm/cpucaps.h
> @@ -40,7 +40,8 @@
>  #define ARM64_WORKAROUND_858921  19
>  #define ARM64_WORKAROUND_CAVIUM_3011520
>  #define ARM64_HAS_DCPOP  21
> +#define ARM64_WORKAROUND_QCOM_FALKOR_E1041   22
>  
> -#define ARM64_NCAPS  22
> +#define ARM64_NCAPS

Re: [PATCH v1 1/3] arm64: add a macro for SError synchronization

2017-11-01 Thread Robin Murphy
On 01/11/17 12:54, gengdongjiu wrote:
> Hi Robin,
> 
> On 2017/11/1 19:24, Robin Murphy wrote:
>>> +   esb
>>> +alternative_else_nop_endif
>>> +1:
>>> +   .endm
>> Having a branch in here is pretty horrible, and furthermore using label
>> number 1 has a pretty high chance of subtly breaking code where this
>> macro is inserted.
>>
>> Can we not somehow nest or combine the alternative conditions here?
> 
> I found it will report error if combine the alternative conditions here.
> 
> For example:
> 
> + .macro  error_synchronize
> +alternative_if ARM64_HAS_IESB
> +alternative_if ARM64_HAS_RAS_EXTN
> + esb
> +alternative_else_nop_endif
> +alternative_else_nop_endif
> + .endm
> 
> And even using b.eq/cbz instruction in the alternative instruction in 
> arch/arm64/kernel/entry.S,
> it will report Error.
> 
> For example below
> 
> alternative_if ARM64_HAS_PAN
>   
> b.eqx
> alternative_else_nop_endif
> 
> I do not dig it deeply, do you know the reason about it or good suggestion 
> about that?
> Thanks a lot in advance.

Actually, on second look ARM64_HAS_RAS_EXTN doesn't even matter - ESB is
a hint, so if the CPU doesn't have RAS it should behave as a NOP anyway.

On which note, since I don't see one here - are any of those other
patches defining an "esb" assembly macro similar to the inline asm case?
If not then this isn't going to build with older toolchains - perhaps we
should just use the raw hint syntax directly.

Robin.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v1 1/3] arm64: add a macro for SError synchronization

2017-11-01 Thread Robin Murphy
On 01/11/17 19:14, Dongjiu Geng wrote:
> ARMv8.2 adds a control bit to each SCTLR_ELx to insert implicit
> Error Synchronization Barrier(IESB) operations at exception handler entry
> and exit. But not all hardware platform which support RAS Extension
> can support IESB. So for this case, software needs to manually insert
> Error Synchronization Barrier(ESB) operations.
> 
> In this macros, if system supports RAS Extensdddon instead of IESB,
> it will insert an ESB instruction.
> 
> Signed-off-by: Dongjiu Geng 
> ---
>  arch/arm64/include/asm/assembler.h | 9 +
>  1 file changed, 9 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/assembler.h 
> b/arch/arm64/include/asm/assembler.h
> index d4c0adf..e6c79c4 100644
> --- a/arch/arm64/include/asm/assembler.h
> +++ b/arch/arm64/include/asm/assembler.h
> @@ -517,4 +517,13 @@
>  #endif
>   .endm
>  
> + .macro  error_synchronize
> +alternative_if ARM64_HAS_IESB
> + b   1f
> +alternative_else_nop_endif
> +alternative_if ARM64_HAS_RAS_EXTN
> + esb
> +alternative_else_nop_endif
> +1:
> + .endm

Having a branch in here is pretty horrible, and furthermore using label
number 1 has a pretty high chance of subtly breaking code where this
macro is inserted.

Can we not somehow nest or combine the alternative conditions here?

Robin.

>  #endif   /* __ASM_ASSEMBLER_H */
> 

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v1 0/3] manually add Error Synchronization Barrier at exception handler entry and exit

2017-11-01 Thread Robin Murphy
On 01/11/17 19:14, Dongjiu Geng wrote:
> Some hardware platform can support RAS Extension, but not support IESB,
> such as Huawei's platform, so software need to insert Synchronization Barrier
> operations at exception handler entry.
> 
> This series patches are based on  James's series patches "SError rework +
> RAS for firmware first support". In Huawei's platform, we do not
> support IESB, so software needs to insert that.
> 
> 
> Dongjiu Geng (3):
>   arm64: add a macro for SError synchronization
>   arm64: add error synchronization barrier in kernel_entry/kernel_exit
>   KVM: arm64: add ESB in exception handler entry and exit.
> 
> James Morse (18):
>   arm64: explicitly mask all exceptions
>   arm64: introduce an order for exceptions
>   arm64: Move the async/fiq helpers to explicitly set process context
> flags
>   arm64: Mask all exceptions during kernel_exit
>   arm64: entry.S: Remove disable_dbg
>   arm64: entry.S: convert el1_sync
>   arm64: entry.S convert el0_sync
>   arm64: entry.S: convert elX_irq
>   KVM: arm/arm64: mask/unmask daif around VHE guests
>   arm64: kernel: Survive corrected RAS errors notified by SError
>   arm64: cpufeature: Enable IESB on exception entry/return for
> firmware-first
>   arm64: kernel: Prepare for a DISR user
>   KVM: arm64: Set an impdef ESR for Virtual-SError using VSESR_EL2.
>   KVM: arm64: Save/Restore guest DISR_EL1
>   KVM: arm64: Save ESR_EL2 on guest SError
>   KVM: arm64: Handle RAS SErrors from EL1 on guest exit
>   KVM: arm64: Handle RAS SErrors from EL2 on guest exit
>   KVM: arm64: Take any host SError before entering the guest
> 
> Xie XiuQi (2):
>   arm64: entry.S: move SError handling into a C function for future
> expansion
>   arm64: cpufeature: Detect CPU RAS Extentions
> 
>  arch/arm64/Kconfig   | 33 +-
>  arch/arm64/include/asm/assembler.h   | 59 +---
>  arch/arm64/include/asm/barrier.h |  1 +
>  arch/arm64/include/asm/cpucaps.h |  4 +-
>  arch/arm64/include/asm/daifflags.h   | 61 +
>  arch/arm64/include/asm/esr.h | 17 +++
>  arch/arm64/include/asm/exception.h   | 14 ++
>  arch/arm64/include/asm/irqflags.h| 40 ++--
>  arch/arm64/include/asm/kvm_emulate.h | 10 
>  arch/arm64/include/asm/kvm_host.h| 16 +++
>  arch/arm64/include/asm/processor.h   |  2 +
>  arch/arm64/include/asm/sysreg.h  |  6 +++
>  arch/arm64/include/asm/traps.h   | 36 +++
>  arch/arm64/kernel/asm-offsets.c  |  1 +
>  arch/arm64/kernel/cpufeature.c   | 43 ++
>  arch/arm64/kernel/debug-monitors.c   |  5 +-
>  arch/arm64/kernel/entry.S| 88 
> +---
>  arch/arm64/kernel/hibernate.c|  5 +-
>  arch/arm64/kernel/machine_kexec.c|  4 +-
>  arch/arm64/kernel/process.c  |  3 ++
>  arch/arm64/kernel/setup.c|  8 ++--
>  arch/arm64/kernel/signal.c   |  8 +++-
>  arch/arm64/kernel/smp.c  | 12 ++---
>  arch/arm64/kernel/suspend.c  |  7 +--
>  arch/arm64/kernel/traps.c| 64 +-
>  arch/arm64/kvm/handle_exit.c | 19 +++-
>  arch/arm64/kvm/hyp-init.S|  3 ++
>  arch/arm64/kvm/hyp/entry.S   | 15 ++
>  arch/arm64/kvm/hyp/hyp-entry.S   |  1 +
>  arch/arm64/kvm/hyp/switch.c  | 19 ++--
>  arch/arm64/kvm/hyp/sysreg-sr.c   |  6 +++
>  arch/arm64/kvm/inject_fault.c| 13 +-
>  arch/arm64/kvm/sys_regs.c|  1 +
>  arch/arm64/mm/proc.S | 14 --
>  virt/kvm/arm/arm.c   |  4 ++
>  35 files changed, 527 insertions(+), 115 deletions(-)
>  create mode 100644 arch/arm64/include/asm/daifflags.h

If you're sending a patch series, please have the cover letter describe
*those patches*, rather than dozens of other patches that aren't part of
it. This diffstat and summary is very confusing.

Robin.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [report] boot a vm that with PCI only hierarchy devices and with GICv3 , it's failed.

2017-07-18 Thread Robin Murphy
On 18/07/17 12:07, wanghaibin wrote:
> On 2017/7/18 18:02, Robin Murphy wrote:
> 
>> On 18/07/17 10:15, Marc Zyngier wrote:
>>> On 18/07/17 05:07, wanghaibin wrote:
>>>> Hi, all:
>>>>
>>>> I met a problem, I just try to test PCI only hierarchy devices model 
>>>> (qemu/docs/pcie.txt  sections 2.3)
>>>>
>>>> Here is part of qemu cmd:
>>>> -device i82801b11-bridge,id=pci.1,bus=pcie.0,addr=0x1 -device 
>>>> pci-bridge,chassis_nr=2,id=pci.2,bus=pci.1,addr=0x0 -device 
>>>> usb-ehci,id=usb,bus=pci.2,addr=0x2
>>>> -device virtio-scsi-pci,id=scsi0,bus=pci.2,addr=0x3 -netdev 
>>>> tap,fd=27,id=hostnet0,vhost=on,vhostfd=28 -device 
>>>> virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:60:6b:1d,bus=pci.2,addr=0x1
>>>> -vnc 0.0.0.0:0 -device virtio-gpu-pci,id=video0,bus=pci.2,addr=0x4
>>>>
>>>> A single DMI-PCI Bridge, a single PCI-PCI Bridge attached to it.  Four 
>>>> PCI_DEV legacy devices (usb, virtio-scsi-pci, virtio-gpu-pci, 
>>>> virtio-net-pci)attached to the PCI-PCI Bridge.
>>>> Boot the vm, it's failed.
>>
>> What's the nature of the failure? Does it hit some actual error case in
>> the GIC code, or does it simply hang up probing the virtio devices
>> because interrupts never arrive?
> 
> 
> Qemu cmdline, xml info, qemu version info, guest kernel version info at the 
> bottom of this mail.
> 
> 
> Guest hang log:
> 
> [  242.740171] INFO: task kworker/u16:4:446 blocked for more than 120 seconds.
> [  242.741102]   Not tainted 4.12.0+ #18
> [  242.741619] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
> this message.
> [  242.742610] kworker/u16:4   D0   446  2 0x
> [  242.743339] Workqueue: scsi_tmf_0 scmd_eh_abort_handler
> [  242.744014] Call trace:
> [  242.744375] [] __switch_to+0x94/0xa8
> [  242.745042] [] __schedule+0x1a0/0x5e0
> [  242.745716] [] schedule+0x38/0xa0
> [  242.746346] [] schedule_timeout+0x194/0x2b8
> [  242.747092] [] wait_for_common+0xa0/0x148
> [  242.747810] [] wait_for_completion+0x14/0x20
> [  242.748595] [] virtscsi_tmf.constprop.15+0x88/0xf0
> [  242.749408] [] virtscsi_abort+0x9c/0xb8
> [  242.750099] [] scmd_eh_abort_handler+0x5c/0x108
> [  242.750887] [] process_one_work+0x124/0x2a8
> [  242.751618] [] worker_thread+0x5c/0x3d8
> [  242.752330] [] kthread+0xfc/0x128
> [  242.752960] [] ret_from_fork+0x10/0x50
> 
> But I still doubt the total vector count takes this problem in, I add some 
> log in guest:
> In guest boot, pci_dev: 02:04:00(virtio-gpu-pci) first load, and only alloc 4 
> ITT entries, the log as follow:
> 
> [0.986233] ~~its_pci_msi_prepare:pci dev: 2,32, nvec:3~~
> [0.998952] ~~its_pci_msi_prepare:devid:8,alias count:4~~
> [1.28] **its_msi_prepare:devid:8, nves:4**
> [1.001001] ##its_create_device:devid: 8, ITT 4 entries, 2 bits, lpi 
> base:8192, nr:32##
> [1.002585] **its_msi_prepare:ITT 4 entries, 2 bits**
> [1.003593] !!msi_domain_alloc_irqs: to alloc: desc->nvec_used:1!!
> [1.004880] ID:0 pID:8192 vID:52
> [1.005529] !!msi_domain_alloc_irqs: to alloc: desc->nvec_used:1!!
> [1.006777] ID:1 pID:8193 vID:53
> [1.007437] !!msi_domain_alloc_irqs: to alloc: desc->nvec_used:1!!
> [1.008718] ID:2 pID:8194 vID:54
> [1.009366] !!msi_domain_alloc_irqs: to active!!
> [1.010281] ^^^SEND mapti: hwirq:8192,event:0^^
> [1.011224] !!msi_domain_alloc_irqs: to active!!
> [1.012161] ^^^SEND mapti: hwirq:8193,event:1^^
> [1.013095] !!msi_domain_alloc_irqs: to active!!
> [1.014013] ^^^SEND mapti: hwirq:8194,event:2^^
> 
> and the guest booted continue, when load the pci_dev: 02:03:00 (virtio-scsi), 
> the log shows it shared the same devid
> with virtio-gpu-pci, shared ite_dev, reusing ITT. So that, the  
> virtio-gpu-pci dev only alloc 4 ITT, and the virtio-scsi send
> mapti with eventid 5/6, this will be captured by Eric's commit:
> guest log:
> [1.057978] !!msi_domain_alloc_irqs: to prepare: nvec:4!!
> [1.072773] ~~its_pci_msi_prepare:devid:8,alias count:5~~
> [1.073943] **its_msi_prepare:devid:8, nves:5**
> [1.074850] **its_msi_prepare:Reusing ITT for devID:8**
> [1.075873] !!msi_domain_alloc_irqs: to alloc: desc->nvec_used:1!!
> [1.077154] ID:3 pID:8195 vID:55
> [1.077813] !!msi_domain_alloc_irqs: to alloc: desc->nvec_used:1!!
> [1.079044] ID:4 pID:8196 vID:56
> [1.079683] !!msi_domain_alloc_irqs: to alloc: desc->nvec_used:1!!
> [1.080947] ID:5 pID:8197 vID:57
> [1.081592] !!msi_domain_alloc_irqs: 

Re: [report] boot a vm that with PCI only hierarchy devices and with GICv3 , it's failed.

2017-07-18 Thread Robin Murphy
On 18/07/17 10:15, Marc Zyngier wrote:
> On 18/07/17 05:07, wanghaibin wrote:
>> Hi, all:
>>
>> I met a problem, I just try to test PCI only hierarchy devices model 
>> (qemu/docs/pcie.txt  sections 2.3)
>>
>> Here is part of qemu cmd:
>> -device i82801b11-bridge,id=pci.1,bus=pcie.0,addr=0x1 -device 
>> pci-bridge,chassis_nr=2,id=pci.2,bus=pci.1,addr=0x0 -device 
>> usb-ehci,id=usb,bus=pci.2,addr=0x2
>> -device virtio-scsi-pci,id=scsi0,bus=pci.2,addr=0x3 -netdev 
>> tap,fd=27,id=hostnet0,vhost=on,vhostfd=28 -device 
>> virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:60:6b:1d,bus=pci.2,addr=0x1
>> -vnc 0.0.0.0:0 -device virtio-gpu-pci,id=video0,bus=pci.2,addr=0x4
>>
>> A single DMI-PCI Bridge, a single PCI-PCI Bridge attached to it.  Four 
>> PCI_DEV legacy devices (usb, virtio-scsi-pci, virtio-gpu-pci, 
>> virtio-net-pci)attached to the PCI-PCI Bridge.
>> Boot the vm, it's failed.

What's the nature of the failure? Does it hit some actual error case in
the GIC code, or does it simply hang up probing the virtio devices
because interrupts never arrive?

>> I try to debug this problem, and the info just as follow:
>> (1) Since Eric Auger commit (0d44cdb631ef53ea75be056886cf0541311e48df: KVM: 
>> arm64: vgic-its: Interpret MAPD Size field and check related errors), This 
>> problem has been exposed.
>> Of course, I think this commit must be correct surely.
>>
>> (2) For guestOS, I notice Marc commit 
>> (e8137f4f5088d763ced1db82d3974336b76e1bd2: irqchip: gicv3-its: Iterate over 
>> PCI aliases to generate ITS configuration).  This commit brings in that the
>> four PCI_DEV legacy devices shared the same devID, same its_dev, same 
>> ITT tables, but I think here calculate with wrong total msi vector count.
>> (Currently, It seems the total count is the vector count of 
>> virtio-net-pci + PCI-PCI bridge + dmi-pci bridge, maybe here should be the 
>> total count of the four PCI_DEV legacy devices vector count),
>> So that, any pci device using the over bounds eventID and mapti at a 
>> certain moment , the abnormal behavior will captured by Eric's commit.

Now, at worst that patch *should* result in the same number of vectors
being reserved as before - never fewer. Does anything change with it
reverted?

>> Actually, I don't understand very well about non-transparent bridge, PCI 
>> aliases. So just supply these message.

Note that there are further issues with PCI RID to DevID mappings in the
face of aliases[1], but I think the current code does happen to work out
OK for the PCI-PCIe bridge case already.

> +Robin, who is the author of that patch.
> 
> Regarding (2), the number of MSIs should be the total number of devices
> that are going to generate the same DevID. Since the bridge is
> non-transparent, everything behind it aliases with it. So you should
> probably see all the virtio devices and the bridges themselves being
> counted. If that's not the case, then "we have a bug"(tm).
> 
> Can you please post your full qemu cmd line so that we can reproduce it
> and investigate the issue?

Yes, that would be good.

Robin.

[1] https://patchwork.ozlabs.org/patch/769303/

> Thanks,
> 
>   M.
> 

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 3/3] kvm: arm/arm64: Fix locking for kvm_free_stage2_pgd

2017-03-15 Thread Robin Murphy
Hi Marc,

On 15/03/17 13:43, Marc Zyngier wrote:
> On 15/03/17 13:35, Christoffer Dall wrote:
>> On Wed, Mar 15, 2017 at 01:28:07PM +, Marc Zyngier wrote:
>>> On 15/03/17 10:56, Christoffer Dall wrote:
 On Wed, Mar 15, 2017 at 09:39:26AM +, Marc Zyngier wrote:
> On 15/03/17 09:21, Christoffer Dall wrote:
>> On Tue, Mar 14, 2017 at 02:52:34PM +, Suzuki K Poulose wrote:
>>> In kvm_free_stage2_pgd() we don't hold the kvm->mmu_lock while calling
>>> unmap_stage2_range() on the entire memory range for the guest. This 
>>> could
>>> cause problems with other callers (e.g, munmap on a memslot) trying to
>>> unmap a range.
>>>
>>> Fixes: commit d5d8184d35c9 ("KVM: ARM: Memory virtualization setup")
>>> Cc: sta...@vger.kernel.org # v3.10+
>>> Cc: Marc Zyngier 
>>> Cc: Christoffer Dall 
>>> Signed-off-by: Suzuki K Poulose 
>>> ---
>>>  arch/arm/kvm/mmu.c | 3 +++
>>>  1 file changed, 3 insertions(+)
>>>
>>> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
>>> index 13b9c1f..b361f71 100644
>>> --- a/arch/arm/kvm/mmu.c
>>> +++ b/arch/arm/kvm/mmu.c
>>> @@ -831,7 +831,10 @@ void kvm_free_stage2_pgd(struct kvm *kvm)
>>> if (kvm->arch.pgd == NULL)
>>> return;
>>>  
>>> +   spin_lock(>mmu_lock);
>>> unmap_stage2_range(kvm, 0, KVM_PHYS_SIZE);
>>> +   spin_unlock(>mmu_lock);
>>> +
>>
>> This ends up holding the spin lock for potentially quite a while, where
>> we can do things like __flush_dcache_area(), which I think can fault.
>
> I believe we're always using the linear mapping (or kmap on 32bit) in
> order not to fault.
>

 ok, then there's just the concern that we may be holding a spinlock for
 a very long time.  I seem to recall Mario once added something where he
 unlocked and gave a chance to schedule something else for each PUD or
 something like that, because he ran into the issue during migration.  Am
 I confusing this with something else?
>>>
>>> That definitely rings a bell: stage2_wp_range() uses that kind of trick
>>> to give the system a chance to breathe. Maybe we could use a similar
>>> trick in our S2 unmapping code? How about this (completely untested) patch:
>>>
>>> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
>>> index 962616fd4ddd..1786c24212d4 100644
>>> --- a/arch/arm/kvm/mmu.c
>>> +++ b/arch/arm/kvm/mmu.c
>>> @@ -292,8 +292,13 @@ static void unmap_stage2_range(struct kvm *kvm, 
>>> phys_addr_t start, u64 size)
>>> phys_addr_t addr = start, end = start + size;
>>> phys_addr_t next;
>>>  
>>> +   BUG_ON(!spin_is_locked(>mmu_lock));

Nit: assert_spin_locked() is somewhat more pleasant (and currently looks
to expand to the exact same code).

Robin.

>>> +
>>> pgd = kvm->arch.pgd + stage2_pgd_index(addr);
>>> do {
>>> +   if (need_resched() || spin_needbreak(>mmu_lock))
>>> +   cond_resched_lock(>mmu_lock);
>>> +
>>> next = stage2_pgd_addr_end(addr, end);
>>> if (!stage2_pgd_none(*pgd))
>>> unmap_stage2_puds(kvm, pgd, addr, next);
>>>
>>> The additional BUG_ON() is just for my own peace of mind - we seem to
>>> have missed a couple of these lately, and the "breathing" code makes
>>> it imperative that this lock is being taken prior to entering the
>>> function.
>>>
>>
>> Looks good to me!
> 
> OK. I'll stash that on top of Suzuki's series, and start running some
> actual tests... ;-)
> 
> Thanks,
> 
>   M.
> 

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 1/1] arm64: Correcting format specifier for printing 64 bit addresses

2016-12-06 Thread Robin Murphy
On 05/12/16 08:09, Maninder Singh wrote:
> This patch corrects format specifier for printing 64 bit addresses.
> 
> Signed-off-by: Maninder Singh 
> Signed-off-by: Vaneet Narang 
> ---
>  arch/arm64/kernel/signal.c |  2 +-
>  arch/arm64/kvm/sys_regs.c  |  8 ++--
>  arch/arm64/mm/fault.c  | 15 ++-
>  arch/arm64/mm/mmu.c|  4 ++--
>  4 files changed, 19 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
> index c7b6de6..c89d5fd 100644
> --- a/arch/arm64/kernel/signal.c
> +++ b/arch/arm64/kernel/signal.c
> @@ -155,7 +155,7 @@ asmlinkage long sys_rt_sigreturn(struct pt_regs *regs)
>  
>  badframe:
>   if (show_unhandled_signals)
> - pr_info_ratelimited("%s[%d]: bad frame in %s: pc=%08llx 
> sp=%08llx\n",
> + pr_info_ratelimited("%s[%d]: bad frame in %s: pc=%016llx 
> sp=%016llx\n",
>   current->comm, task_pid_nr(current), 
> __func__,
>   regs->pc, regs->sp);
>   force_sig(SIGSEGV, current);
> diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
> index 87e7e66..89bf5c1 100644
> --- a/arch/arm64/kvm/sys_regs.c
> +++ b/arch/arm64/kvm/sys_regs.c
> @@ -1554,8 +1554,12 @@ static void unhandled_cp_access(struct kvm_vcpu *vcpu,
>   WARN_ON(1);
>   }
>  
> - kvm_err("Unsupported guest CP%d access at: %08lx\n",
> - cp, *vcpu_pc(vcpu));
> + if (params->is_32bit)
> + kvm_err("Unsupported guest CP%d access at: %08lx\n",
> + cp, *vcpu_pc(vcpu));
> + else
> + kvm_err("Unsupported guest CP%d access at: %016lx\n",
> + cp, *vcpu_pc(vcpu));

As with the other patch - use '%0*lx' in these cases rather than
pointlessly duplicating everything.

Robin.

>   print_sys_reg_instr(params);
>   kvm_inject_undefined(vcpu);
>  }
> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> index a78a5c4..d96a42a 100644
> --- a/arch/arm64/mm/fault.c
> +++ b/arch/arm64/mm/fault.c
> @@ -77,7 +77,7 @@ void show_pte(struct mm_struct *mm, unsigned long addr)
>  
>   pr_alert("pgd = %p\n", mm->pgd);
>   pgd = pgd_offset(mm, addr);
> - pr_alert("[%08lx] *pgd=%016llx", addr, pgd_val(*pgd));
> + pr_alert("[%016lx] *pgd=%016llx", addr, pgd_val(*pgd));
>  
>   do {
>   pud_t *pud;
> @@ -177,7 +177,7 @@ static void __do_kernel_fault(struct mm_struct *mm, 
> unsigned long addr,
>* No handler, we'll have to terminate things with extreme prejudice.
>*/
>   bust_spinlocks(1);
> - pr_alert("Unable to handle kernel %s at virtual address %08lx\n",
> + pr_alert("Unable to handle kernel %s at virtual address %016lx\n",
>(addr < PAGE_SIZE) ? "NULL pointer dereference" :
>"paging request", addr);
>  
> @@ -198,9 +198,14 @@ static void __do_user_fault(struct task_struct *tsk, 
> unsigned long addr,
>   struct siginfo si;
>  
>   if (unhandled_signal(tsk, sig) && show_unhandled_signals_ratelimited()) 
> {
> - pr_info("%s[%d]: unhandled %s (%d) at 0x%08lx, esr 0x%03x\n",
> - tsk->comm, task_pid_nr(tsk), fault_name(esr), sig,
> - addr, esr);
> + if (compat_user_mode(regs))
> + pr_info("%s[%d]: unhandled %s (%d) at 0x%08lx, esr 
> 0x%03x\n",
> + tsk->comm, task_pid_nr(tsk), fault_name(esr), 
> sig,
> + addr, esr);
> + else
> + pr_info("%s[%d]: unhandled %s (%d) at 0x%016lx, esr 
> 0x%03x\n",
> + tsk->comm, task_pid_nr(tsk), fault_name(esr), 
> sig,
> + addr, esr);
>   show_pte(tsk->mm, addr);
>   show_regs(regs);
>   }
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index 17243e4..cbf444c 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -683,9 +683,9 @@ void __init early_fixmap_init(void)
>   pr_warn("pmd %p != %p, %p\n",
>   pmd, fixmap_pmd(fix_to_virt(FIX_BTMAP_BEGIN)),
>   fixmap_pmd(fix_to_virt(FIX_BTMAP_END)));
> - pr_warn("fix_to_virt(FIX_BTMAP_BEGIN): %08lx\n",
> + pr_warn("fix_to_virt(FIX_BTMAP_BEGIN): %016lx\n",
>   fix_to_virt(FIX_BTMAP_BEGIN));
> - pr_warn("fix_to_virt(FIX_BTMAP_END):   %08lx\n",
> + pr_warn("fix_to_virt(FIX_BTMAP_END):   %016lx\n",
>   fix_to_virt(FIX_BTMAP_END));
>  
>   pr_warn("FIX_BTMAP_END:   %d\n", FIX_BTMAP_END);
> 

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 1/1] arm64: Correcting format specifier for printing 64 bit addresses

2016-12-06 Thread Robin Murphy
On 06/12/16 16:11, Christoffer Dall wrote:
> On Mon, Dec 05, 2016 at 01:39:53PM +0530, Maninder Singh wrote:
>> This patch corrects format specifier for printing 64 bit addresses.
>>
>> Signed-off-by: Maninder Singh 
>> Signed-off-by: Vaneet Narang 
>> ---
>>  arch/arm64/kernel/signal.c |  2 +-
>>  arch/arm64/kvm/sys_regs.c  |  8 ++--
>>  arch/arm64/mm/fault.c  | 15 ++-
>>  arch/arm64/mm/mmu.c|  4 ++--
>>  4 files changed, 19 insertions(+), 10 deletions(-)
>>
>> diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
>> index c7b6de6..c89d5fd 100644
>> --- a/arch/arm64/kernel/signal.c
>> +++ b/arch/arm64/kernel/signal.c
>> @@ -155,7 +155,7 @@ asmlinkage long sys_rt_sigreturn(struct pt_regs *regs)
>>  
>>  badframe:
>>  if (show_unhandled_signals)
>> -pr_info_ratelimited("%s[%d]: bad frame in %s: pc=%08llx 
>> sp=%08llx\n",
>> +pr_info_ratelimited("%s[%d]: bad frame in %s: pc=%016llx 
>> sp=%016llx\n",
>>  current->comm, task_pid_nr(current), 
>> __func__,
>>  regs->pc, regs->sp);
>>  force_sig(SIGSEGV, current);
>> diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
>> index 87e7e66..89bf5c1 100644
>> --- a/arch/arm64/kvm/sys_regs.c
>> +++ b/arch/arm64/kvm/sys_regs.c
>> @@ -1554,8 +1554,12 @@ static void unhandled_cp_access(struct kvm_vcpu *vcpu,
>>  WARN_ON(1);
>>  }
>>  
>> -kvm_err("Unsupported guest CP%d access at: %08lx\n",
>> -cp, *vcpu_pc(vcpu));
>> +if (params->is_32bit)
>> +kvm_err("Unsupported guest CP%d access at: %08lx\n",
>> +cp, *vcpu_pc(vcpu));
>> +else
>> +kvm_err("Unsupported guest CP%d access at: %016lx\n",
>> +cp, *vcpu_pc(vcpu));
> 
> It feels a bit much to me to have an if-statement to differentiate the
> number of leading zeros, so if it's important to always have fixed
> widths then I would just use %016lx in both cases.

Actually, it looks like vsnprintf does support the '*' field width
specifier, so even if the format _is_ critical there's still no reason
to have such duplicated code.

Robin.

>>  print_sys_reg_instr(params);
>>  kvm_inject_undefined(vcpu);
>>  }
>> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
>> index a78a5c4..d96a42a 100644
>> --- a/arch/arm64/mm/fault.c
>> +++ b/arch/arm64/mm/fault.c
>> @@ -77,7 +77,7 @@ void show_pte(struct mm_struct *mm, unsigned long addr)
>>  
>>  pr_alert("pgd = %p\n", mm->pgd);
>>  pgd = pgd_offset(mm, addr);
>> -pr_alert("[%08lx] *pgd=%016llx", addr, pgd_val(*pgd));
>> +pr_alert("[%016lx] *pgd=%016llx", addr, pgd_val(*pgd));
>>  
>>  do {
>>  pud_t *pud;
>> @@ -177,7 +177,7 @@ static void __do_kernel_fault(struct mm_struct *mm, 
>> unsigned long addr,
>>   * No handler, we'll have to terminate things with extreme prejudice.
>>   */
>>  bust_spinlocks(1);
>> -pr_alert("Unable to handle kernel %s at virtual address %08lx\n",
>> +pr_alert("Unable to handle kernel %s at virtual address %016lx\n",
>>   (addr < PAGE_SIZE) ? "NULL pointer dereference" :
>>   "paging request", addr);
>>  
>> @@ -198,9 +198,14 @@ static void __do_user_fault(struct task_struct *tsk, 
>> unsigned long addr,
>>  struct siginfo si;
>>  
>>  if (unhandled_signal(tsk, sig) && show_unhandled_signals_ratelimited()) 
>> {
>> -pr_info("%s[%d]: unhandled %s (%d) at 0x%08lx, esr 0x%03x\n",
>> -tsk->comm, task_pid_nr(tsk), fault_name(esr), sig,
>> -addr, esr);
>> +if (compat_user_mode(regs))
>> +pr_info("%s[%d]: unhandled %s (%d) at 0x%08lx, esr 
>> 0x%03x\n",
>> +tsk->comm, task_pid_nr(tsk), fault_name(esr), 
>> sig,
>> +addr, esr);
>> +else
>> +pr_info("%s[%d]: unhandled %s (%d) at 0x%016lx, esr 
>> 0x%03x\n",
>> +tsk->comm, task_pid_nr(tsk), fault_name(esr), 
>> sig,
>> +addr, esr);
> 
> same here.
> 
> Thanks,
> -Christoffer
> 
> ___
> linux-arm-kernel mailing list
> linux-arm-ker...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
> 

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v2] kvm: arm: Enforce some NS-SVC initialisation

2016-08-16 Thread Robin Murphy
Since the non-secure copies of banked registers lack architecturally
defined reset values, there is no actual guarantee when entering in Hyp
from secure-only firmware that the Non-Secure PL1 state will look the
way that kernel entry (in particular the decompressor stub) expects.
So far, we've been getting away with it thanks to implementation details
of ARMv7 cores and/or bootloader behaviour, but for the sake of forwards
compatibility let's try to ensure that we have a minimally sane state
before dropping down into it.

Signed-off-by: Robin Murphy <robin.mur...@arm.com>
---

v2: Initialise SED/ITD to safe values as well.

 arch/arm/kernel/hyp-stub.S | 13 +
 1 file changed, 13 insertions(+)

diff --git a/arch/arm/kernel/hyp-stub.S b/arch/arm/kernel/hyp-stub.S
index 0b1e4a93d67e..15d073ae5da2 100644
--- a/arch/arm/kernel/hyp-stub.S
+++ b/arch/arm/kernel/hyp-stub.S
@@ -142,6 +142,19 @@ ARM_BE8(orrr7, r7, #(1 << 25)) @ HSCTLR.EE
and r7, #0x1f   @ Preserve HPMN
mcr p15, 4, r7, c1, c1, 1   @ HDCR
 
+   @ Make sure NS-SVC is initialised appropriately
+   mrc p15, 0, r7, c1, c0, 0   @ SCTLR
+   orr r7, #(1 << 5)   @ CP15 barriers enabled
+   bic r7, #(3 << 7)   @ Clear SED/ITD for v8 (RES0 for v7)
+   bic r7, #(3 << 19)  @ WXN and UWXN disabled
+   mcr p15, 0, r7, c1, c0, 0   @ SCTLR
+
+   mrc p15, 0, r7, c0, c0, 0   @ MIDR
+   mcr p15, 4, r7, c0, c0, 0   @ VPIDR
+
+   mrc p15, 0, r7, c0, c0, 5   @ MPIDR
+   mcr p15, 4, r7, c0, c0, 5   @ VMPIDR
+
 #if !defined(ZIMAGE) && defined(CONFIG_ARM_ARCH_TIMER)
@ make CNTP_* and CNTPCT accessible from PL1
mrc p15, 0, r7, c0, c1, 1   @ ID_PFR1
-- 
2.8.1.dirty

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH] kvm: arm: Enforce some NS-SVC initialisation

2016-08-16 Thread Robin Murphy
Hi Marc,

On 16/08/16 14:33, Marc Zyngier wrote:
> On 21/07/16 13:01, Robin Murphy wrote:
>> Since the non-secure copies of banked registers lack architecturally
>> defined reset values, there is no actual guarantee when entering in Hyp
>> from secure-only firmware that the non-secure PL1 state will look the
>> way that kernel entry (in particular the decompressor stub) expects.
>> So far, we've been getting away with it thanks to implementation details
>> of ARMv7 cores and/or bootloader behaviour, but for the sake of forwards
>> compatibility let's try to ensure that we have a minimally sane state
>> before dropping down into it.
>>
>> Signed-off-by: Robin Murphy <robin.mur...@arm.com>
>> ---
>>  arch/arm/kernel/hyp-stub.S | 12 
>>  1 file changed, 12 insertions(+)
>>
>> diff --git a/arch/arm/kernel/hyp-stub.S b/arch/arm/kernel/hyp-stub.S
>> index 0b1e4a93d67e..7de3fe15ab21 100644
>> --- a/arch/arm/kernel/hyp-stub.S
>> +++ b/arch/arm/kernel/hyp-stub.S
>> @@ -142,6 +142,18 @@ ARM_BE8(orr r7, r7, #(1 << 25)) @ HSCTLR.EE
>>  and r7, #0x1f   @ Preserve HPMN
>>  mcr p15, 4, r7, c1, c1, 1   @ HDCR
>>  
>> +@ Make sure NS-SVC is initialised appropriately
>> +mrc p15, 0, r7, c1, c0, 0   @ SCTLR
>> +orr r7, #(1 << 5)   @ CP15 barriers enabled
>> +bic r7, #(3 << 19)  @ WXN and UWXN disabled
> 
> I think that while you're doing this, you also may want to clear SED and
> ITD so that a BE kernel has a chance to survive its first instruction
> (assuming it it uses the decompressor...).

Good point; I wrote this from the v7 perspective and neglected those,
and I think I was actually trying to achieve something useful at the
time which precluded cracking out the big-endian Thumb-2 kernel ;)

>From a quick correlation between ARM ARMs, those bits should be reliably
safe to unconditionally clear on v7VE, so let's do it. I'll respin shortly.

>> +mcr p15, 0, r7, c1, c0, 0   @ SCTLR
>> +
>> +mrc p15, 0, r7, c0, c0, 0   @ MIDR
>> +mcr p15, 4, r7, c0, c0, 0   @ VPIDR
>> +
>> +mrc p15, 0, r7, c0, c0, 5   @ MPIDR
>> +mcr p15, 4, r7, c0, c0, 5   @ VMPIDR
>> +
>>  #if !defined(ZIMAGE) && defined(CONFIG_ARM_ARCH_TIMER)
>>  @ make CNTP_* and CNTPCT accessible from PL1
>>  mrc p15, 0, r7, c0, c1, 1   @ ID_PFR1
>>
> 
> Otherwise looks good.

Cheers,
Robin.

> 
> Thanks,
> 
>   M.
> 

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH] arm64: KVM: Turn kvm_ksym_ref into a NOP on VHE

2016-03-19 Thread Robin Murphy

Hi Marc,

On 18/03/16 17:25, Marc Zyngier wrote:

When running with VHE, there is no need to translate kernel pointers
to the EL2 memory space, since we're already there (and we have a much
saner memory map to start with).

Unfortunately, kvm_ksym_ref is getting in the way, and the first
call into the "hypervisor" section is going to end up in fireworks,
since we're now branching into nowhereland. Meh.

A potential solution is to test if VHE is engaged or not, and only
perform the translation in the negative case. With this in place,
VHE is able to run again.

Signed-off-by: Marc Zyngier 
---
  arch/arm64/include/asm/kvm_asm.h | 8 +++-
  1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h
index 226f49d..282f907 100644
--- a/arch/arm64/include/asm/kvm_asm.h
+++ b/arch/arm64/include/asm/kvm_asm.h
@@ -26,7 +26,13 @@
  #define KVM_ARM64_DEBUG_DIRTY_SHIFT   0
  #define KVM_ARM64_DEBUG_DIRTY (1 << KVM_ARM64_DEBUG_DIRTY_SHIFT)

-#define kvm_ksym_ref(sym)  phys_to_virt((u64) - kimage_voffset)
+#define kvm_ksym_ref(sym)  \
+   ({  \
+   void *val = sym;\
+   if (!is_kernel_in_hyp_mode())   \
+   val = phys_to_virt((u64) - kimage_voffset); \


Is it definitely OK to evaluate sym twice here?

Robin.


+   val;\
+})

  #ifndef __ASSEMBLY__
  struct kvm;



___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [RFC v3 05/15] iommu/arm-smmu: implement alloc/free_reserved_iova_domain

2016-02-18 Thread Robin Murphy

Hi Eric,

On 12/02/16 08:13, Eric Auger wrote:

Implement alloc/free_reserved_iova_domain for arm-smmu. we use
the iova allocator (iova.c). The iova_domain is attached to the
arm_smmu_domain struct. A mutex is introduced to protect it.


The IOMMU API currently leaves IOVA management entirely up to the caller 
- VFIO is already managing its own IOVA space, so what warrants this 
being pushed all the way down to the IOMMU driver? All I see here is 
abstract code with no hardware-specific details that'll have to be 
copy-pasted into other IOMMU drivers (e.g. SMMUv3), which strongly 
suggests it's the wrong place to do it.


As I understand the problem, VFIO has a generic "configure an IOMMU to 
point at an MSI doorbell" step to do in the process of attaching a 
device, which hasn't needed implementing yet due to VT-d's 
IOMMU_CAP_I_AM_ALSO_ACTUALLY_THE_MSI_CONTROLLER_IN_DISGUISE flag, which 
most of us have managed to misinterpret so far. AFAICS all the IOMMU 
driver should need to know about this is an iommu_map() call (which will 
want a slight extension[1] to make things behave properly). We should be 
fixing the abstraction to be less x86-centric, not hacking up all the 
ARM drivers to emulate x86 hardware behaviour in software.


Robin.

[1]:http://article.gmane.org/gmane.linux.kernel.cross-arch/30833


Signed-off-by: Eric Auger 

---
v2 -> v3:
- select IOMMU_IOVA when ARM_SMMU or ARM_SMMU_V3 is set

v1 -> v2:
- formerly implemented in vfio_iommu_type1
---
  drivers/iommu/Kconfig|  2 ++
  drivers/iommu/arm-smmu.c | 87 +++-
  2 files changed, 74 insertions(+), 15 deletions(-)

diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index a1e75cb..1106528 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -289,6 +289,7 @@ config ARM_SMMU
bool "ARM Ltd. System MMU (SMMU) Support"
depends on (ARM64 || ARM) && MMU
select IOMMU_API
+   select IOMMU_IOVA
select IOMMU_IO_PGTABLE_LPAE
select ARM_DMA_USE_IOMMU if ARM
help
@@ -302,6 +303,7 @@ config ARM_SMMU_V3
bool "ARM Ltd. System MMU Version 3 (SMMUv3) Support"
depends on ARM64 && PCI
select IOMMU_API
+   select IOMMU_IOVA
select IOMMU_IO_PGTABLE_LPAE
select GENERIC_MSI_IRQ_DOMAIN
help
diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
index c8b7e71..f42341d 100644
--- a/drivers/iommu/arm-smmu.c
+++ b/drivers/iommu/arm-smmu.c
@@ -42,6 +42,7 @@
  #include 
  #include 
  #include 
+#include 

  #include 

@@ -347,6 +348,9 @@ struct arm_smmu_domain {
enum arm_smmu_domain_stage  stage;
struct mutexinit_mutex; /* Protects smmu pointer */
struct iommu_domain domain;
+   struct iova_domain  *reserved_iova_domain;
+   /* protects reserved domain manipulation */
+   struct mutexreserved_mutex;
  };

  static struct iommu_ops arm_smmu_ops;
@@ -975,6 +979,7 @@ static struct iommu_domain *arm_smmu_domain_alloc(unsigned 
type)
return NULL;

mutex_init(_domain->init_mutex);
+   mutex_init(_domain->reserved_mutex);
spin_lock_init(_domain->pgtbl_lock);

return _domain->domain;
@@ -1446,22 +1451,74 @@ out_unlock:
return ret;
  }

+static int arm_smmu_alloc_reserved_iova_domain(struct iommu_domain *domain,
+  dma_addr_t iova, size_t size,
+  unsigned long order)
+{
+   unsigned long granule, mask;
+   struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
+   int ret = 0;
+
+   granule = 1UL << order;
+   mask = granule - 1;
+   if (iova & mask || (!size) || (size & mask))
+   return -EINVAL;
+
+   if (smmu_domain->reserved_iova_domain)
+   return -EEXIST;
+
+   mutex_lock(_domain->reserved_mutex);
+
+   smmu_domain->reserved_iova_domain =
+   kzalloc(sizeof(struct iova_domain), GFP_KERNEL);
+   if (!smmu_domain->reserved_iova_domain) {
+   ret = -ENOMEM;
+   goto unlock;
+   }
+
+   init_iova_domain(smmu_domain->reserved_iova_domain,
+granule, iova >> order, (iova + size - 1) >> order);
+
+unlock:
+   mutex_unlock(_domain->reserved_mutex);
+   return ret;
+}
+
+static void arm_smmu_free_reserved_iova_domain(struct iommu_domain *domain)
+{
+   struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
+   struct iova_domain *iovad = smmu_domain->reserved_iova_domain;
+
+   if (!iovad)
+   return;
+
+   mutex_lock(_domain->reserved_mutex);
+
+   put_iova_domain(iovad);
+   kfree(iovad);
+
+   mutex_unlock(_domain->reserved_mutex);
+}
+
  static struct iommu_ops arm_smmu_ops = {
-   .capable= arm_smmu_capable,
-   

Re: [PATCH 1/2] arm64: KVM: Fix AArch32 to AArch64 register mapping

2015-11-17 Thread Robin Murphy

Hi Marc,

On 16/11/15 10:28, Marc Zyngier wrote:

When running a 32bit guest under a 64bit hypervisor, the ARMv8
architecture defines a mapping of the 32bit registers in the 64bit
space. This includes banked registers that are being demultiplexed
over the 64bit ones.

On exception caused by an operation involving a 32bit register, the
HW exposes the register number in the ESR_EL2 register. It was so
far understood that SW had to compute which register was AArch64
register was used (based on the current AArch32 mode and register
number).

It turns out that I misinterpreted the ARM ARM, and the clue is in
D1.20.1: "For some exceptions, the exception syndrome given in the
ESR_ELx identifies one or more register numbers from the issued
instruction that generated the exception. Where the exception is
taken from an Exception level using AArch32 these register numbers
give the AArch64 view of the register."

Which means that the HW is already giving us the translated version,
and that we shouldn't try to interpret it at all (for example, doing
an MMIO operation from the IRQ mode using the LR register leads to
very unexpected behaviours).

The fix is thus not to perform a call to vcpu_reg32() at all from
vcpu_reg(), and use whatever register number is supplied directly.
The only case we need to find out about the mapping is when we
actively generate a register access, which only occurs when injecting
a fault in a guest.

Signed-off-by: Marc Zyngier <marc.zyng...@arm.com>
---
  arch/arm64/include/asm/kvm_emulate.h | 8 +---
  arch/arm64/kvm/inject_fault.c| 2 +-
  2 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_emulate.h 
b/arch/arm64/include/asm/kvm_emulate.h
index 17e92f0..3ca894e 100644
--- a/arch/arm64/include/asm/kvm_emulate.h
+++ b/arch/arm64/include/asm/kvm_emulate.h
@@ -99,11 +99,13 @@ static inline void vcpu_set_thumb(struct kvm_vcpu *vcpu)
*vcpu_cpsr(vcpu) |= COMPAT_PSR_T_BIT;
  }

+/*
+ * vcpu_reg should always be passed a register number coming from a
+ * read of ESR_EL2. Otherwise, it may give the wrong result on AArch32
+ * with banked registers.
+ */
  static inline unsigned long *vcpu_reg(const struct kvm_vcpu *vcpu, u8 reg_num)
  {
-   if (vcpu_mode_is_32bit(vcpu))
-   return vcpu_reg32(vcpu, reg_num);
-
return (unsigned long *)_gp_regs(vcpu)->regs.regs[reg_num];
  }

diff --git a/arch/arm64/kvm/inject_fault.c b/arch/arm64/kvm/inject_fault.c
index 85c5715..648112e 100644
--- a/arch/arm64/kvm/inject_fault.c
+++ b/arch/arm64/kvm/inject_fault.c
@@ -48,7 +48,7 @@ static void prepare_fault32(struct kvm_vcpu *vcpu, u32 mode, 
u32 vect_offset)

/* Note: These now point to the banked copies */
*vcpu_spsr(vcpu) = new_spsr_value;
-   *vcpu_reg(vcpu, 14) = *vcpu_pc(vcpu) + return_offset;
+   *vcpu_reg32(vcpu, 14) = *vcpu_pc(vcpu) + return_offset;


To the best of my knowledge after picking through all the uses of 
vcpu_reg, particularly in the shared 32-bit code, this does seem to be 
the only one which involves a potentially-banked register number that 
didn't originally come from an ESR read, and thus needs translation.


Reviewed-by: Robin Murphy <robin.mur...@arm.com>

(unfortunately I don't have an actual test-case as it was already a 
third-hand report when I started trying to look into it).


Thanks for picking this up,
Robin.



/* Branch to exception vector */
if (sctlr & (1 << 13))



___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm