from:"Jean\-Philippe Brucker"

Re: [RFC v16 1/9] iommu: Introduce attach/detach_pasid_table API

2021-12-08 Thread Jean-Philippe Brucker

On Wed, Dec 08, 2021 at 08:56:16AM -0400, Jason Gunthorpe wrote:
> From a progress perspective I would like to start with simple 'page
> tables in userspace', ie no PASID in this step.
> 
> 'page tables in userspace' means an iommufd ioctl to create an
> iommu_domain where the IOMMU HW is directly travesering a
> device-specific page table structure in user space memory. All the HW
> today implements this by using another iommu_domain to allow the IOMMU
> HW DMA access to user memory - ie nesting or multi-stage or whatever.
> 
> This would come along with some ioctls to invalidate the IOTLB.
> 
> I'm imagining this step as a iommu_group->op->create_user_domain()
> driver callback which will create a new kind of domain with
> domain-unique ops. Ie map/unmap related should all be NULL as those
> are impossible operations.
> 
> From there the usual struct device (ie RID) attach/detatch stuff needs
> to take care of routing DMAs to this iommu_domain.
> 
> Step two would be to add the ability for an iommufd using driver to
> request that a RID is connected to an iommu_domain. This
> connection can be requested for any kind of iommu_domain, kernel owned
> or user owned.
> 
> I don't quite have an answer how exactly the SMMUv3 vs Intel
> difference in PASID routing should be resolved.

In SMMUv3 the user pgd is always stored in the PASID table (actually
called "context descriptor table" but I want to avoid confusion with the
VT-d "context table"). And to access the PASID table, the SMMUv3 first
translate its GPA into a PA using the stage-2 page table. For userspace to
pass individual pgds to the kernel, as opposed to passing whole PASID
tables, the host kernel needs to reserve GPA space and map it in stage-2,
so it can store the PASID table in there. Userspace manages GPA space.

This would be easy for a single pgd. In this case the PASID table has a
single entry and userspace could just pass one GPA page during
registration. However it isn't easily generalized to full PASID support,
because managing a multi-level PASID table will require runtime GPA
allocation, and that API is awkward. That's why we opted for "attach PASID
table" operation rather than "attach page table" (back then the choice was
easy since VT-d used the same concept).

So I think the simplest way to support nesting is still to have separate
modes of operations depending on the hardware.

Thanks,
Jean

> 
> to get answers I'm hoping to start building some sketch RFCs for these
> different things on iommufd, hopefully in January. I'm looking at user
> page tables, PASID, dirty tracking and userspace IO fault handling as
> the main features iommufd must tackle.
> 
> The purpose of the sketches would be to validate that the HW features
> we want to exposed can work will with the choices the base is making.
> 
> Jason
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

Re: [RFC PATCH 4/5] iommu/arm-smmu-v3: Use pinned VMID for NESTED stage with BTM

2021-07-22 Thread Jean-Philippe Brucker

Hi Shameer,

On Wed, Jul 21, 2021 at 08:54:00AM +, Shameerali Kolothum Thodi wrote:
> > More generally I think this pinned VMID set conflicts with that of
> > stage-2-only domains (which is the default state until a guest attaches a
> > PASID table). Say you have one guest using DOMAIN_NESTED without PASID
> > table, just DMA to IPA using VMID 0x8000. Now another guest attaches a
> > PASID table and obtains the same VMID from KVM. The stage-2 translation
> > might use TLB entries from the other guest, no?  They'll both create
> > stage-2 TLB entries with {StreamWorld=NS-EL1, VMID=0x8000}
> 
> Now that we are trying to align the KVM VMID allocation algorithm similar to
> that of the ASID allocator [1], I attempted to use that for the SMMU pinned 
> VMID allocation. But the issue you have mentioned above is still valid. 
> 
> And as a solution what I have tried now is follow what pinned ASID is doing 
> in SVA,
>  -Use xarray for private VMIDs
>  -Get pinned VMID from KVM for DOMAIN_NESTED with PASID table
>  -If the new pinned VMID is in use by private, then update the private
>   VMID(VMID update to a live STE).
> 
> This seems to work, but still need to run more tests with this though.  
>
> > It's tempting to allocate all VMIDs through KVM instead, but that will
> > force a dependency on KVM to use VFIO_TYPE1_NESTING_IOMMU and might
> > break
> > existing users of that extension (though I'm not sure there are any).
> > Instead we might need to restrict the SMMU VMID bitmap to match the
> > private VMID set in KVM.
> 
> Another solution I have in mind is, make the new KVM VMID allocator common
> between SMMUv3 and KVM. This will help to avoid all the private and shared
> VMID splitting, also no need for live updates to STE VMID. One possible 
> drawback
> is less number of available KVM VMIDs but with 16 bit VMID space I am not sure
> how much that is a concern.

Yes I think that works too. In practice there shouldn't be many VMIDs on
the SMMU side, the feature's only enabled when a user wants to assign
devices with nesting translation (unlike ASIDs where each device in the
system gets a private ASID by default).

Note that you still need to pin all VMIDs used by the SMMU, otherwise
you'll have to update the STE after rollover.

The problem we have with VFIO_TYPE1_NESTING_IOMMU might be solved by the
upcoming deprecation of VFIO_*_IOMMU [2]. We need a specific sequence from
userspace:
1. Attach VFIO group to KVM (KVM_DEV_VFIO_GROUP_ADD)
2. Create nesting IOMMU domain and attach the group to it
   (VFIO_GROUP_SET_CONTAINER, VFIO_SET_IOMMU becomes
IOMMU_IOASID_ALLOC, VFIO_DEVICE_ATTACH_IOASID)
Currently QEMU does 2 then 1, which would cause the SMMU to allocate a
separate VMID. If we wanted to extend VFIO_TYPE1_NESTING_IOMMU with PASID
tables we'd need to mandate 1-2 and may break existing users. In the new
design we can require from the start that creating a nesting IOMMU
container through /dev/iommu *must* come with a KVM context, that way
we're sure to reuse the existing VMID.

Thanks,
Jean

[2] 
https://lore.kernel.org/linux-iommu/bn9pr11mb5433b1e4ae5b0480369f97178c...@bn9pr11mb5433.namprd11.prod.outlook.com/
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

Re: [RFC PATCH 0/5] KVM: arm64: Pass PSCI to userspace

2021-07-21 Thread Jean-Philippe Brucker

On Tue, Jun 08, 2021 at 05:48:01PM +0200, Jean-Philippe Brucker wrote:
> Allow userspace to request handling PSCI calls from guests. Our goal is
> to enable a vCPU hot-add solution for Arm where the VMM presents
> possible resources to the guest at boot, and controls which vCPUs can be
> brought up by allowing or denying PSCI CPU_ON calls.

Since it looks like vCPU hot-add will be implemented differently, I don't
intend to resend this series at the moment. But some of it could be
useful for other projects and to avoid the helpful review effort going to
waste, I fixed it up and will leave it on branch
https://jpbrucker.net/git/linux/log/?h=kvm/psci-to-userspace
It now only uses KVM_CAP_EXIT_HYPERCALL introduced in v5.14.

Thanks,
Jean
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

Re: [RFC PATCH 0/5] KVM: arm64: Pass PSCI to userspace

2021-07-21 Thread Jean-Philippe Brucker

On Mon, Jul 19, 2021 at 12:37:52PM -0700, Oliver Upton wrote:
> On Mon, Jul 19, 2021 at 11:02 AM Jean-Philippe Brucker
>  wrote:
> > We forward the whole PSCI function range, so it's either KVM or userspace.
> > If KVM manages PSCI and the guest calls an unimplemented function, that
> > returns directly to the guest without going to userspace.
> >
> > The concern is valid for any other range, though. If userspace enables the
> > HVC cap it receives function calls that at some point KVM might need to
> > handle itself. So we need some negotiation between user and KVM about the
> > specific HVC ranges that userspace can and will handle.
> 
> Are we going to use KVM_CAPs for every interesting HVC range that
> userspace may want to trap? I wonder if a more generic interface for
> hypercall filtering would have merit to handle the aforementioned
> cases, and whatever else a VMM will want to intercept down the line.
> 
> For example, x86 has the concept of 'MSR filtering', wherein userspace
> can specify a set of registers that it wants to intercept. Doing
> something similar for HVCs would avoid the need for a kernel change
> each time a VMM wishes to intercept a new hypercall.

Yes we could introduce a VM device group for this:
* User reads attribute KVM_ARM_VM_HVC_NR_SLOTS, which defines the number
  of available HVC ranges.
* User writes attribute KVM_ARM_VM_HVC_SET_RANGE with one range
  struct kvm_arm_hvc_range {
  __u32 slot;
  #define KVM_ARM_HVC_USER (1 << 0) /* Enable range. 0 disables it */
  __u16 flags;
  __u16 imm;
  __u32 fn_start;
  __u32 fn_end;
  };
* KVM forwards any HVC within this range to userspace.
* If one of the ranges is PSCI functions, disable KVM PSCI.

Since it's more work for KVM to keep track of ranges, I didn't include it
in the RFC, and I'm going to leave it to the next person dealing with this
stuff :)

Thanks,
Jean
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

Re: [RFC PATCH 0/5] KVM: arm64: Pass PSCI to userspace

2021-07-19 Thread Jean-Philippe Brucker

Hi Alex,

I'm not planning to resend this work at the moment, because it looks like
vcpu hot-add will go a different way so I don't have a user. But I'll
probably address the feedback so far and park it on some branch, in case
anyone else needs it.

On Mon, Jul 19, 2021 at 04:29:18PM +0100, Alexandru Elisei wrote:
> 1. Why forwarding PSCI calls to userspace depend on enabling forwarding for 
> other
> HVC calls? As I understand from the patches, those handle distinct function 
> IDs.

The HVC cap from patch 4 enables returning from the VCPU_RUN ioctl with
KVM_EXIT_HYPERCALL, for any HVC not handled by KVM. This one should
definitely be improved, either by letting userspace choose the ranges of
HVC it wants, or at least by reporting ranges reserved by KVM to
userspace.

The PSCI cap from patch 5 disables the in-kernel PSCI implementation. As a
result those HVCs are forwarded to userspace.

It was suggested that other users will want to handle HVC calls (SDEI for
example [1]), hence splitting into two capabilities rather than just the
PSCI cap. In v5.14 x86 added KVM_CAP_EXIT_HYPERCALL [2], which lets
userspace receive specific hypercalls. We could reuse that and have PSCI
be one bit of that capability's parameter.

[1] 
https://lore.kernel.org/linux-arm-kernel/20170808164616.25949-12-james.mo...@arm.com/
[2] 
https://lore.kernel.org/kvm/90778988e1ee01926ff9cac447aacb745f954c8c.1623174621.git.ashish.ka...@amd.com/

> 2. HVC call forwarding to userspace also forwards PSCI functions which are 
> defined
> in ARM DEN 0022D, but not (yet) implemented by KVM. What happens if KVM's PSCI
> implementation gets support for one of those functions? How does userspace 
> know
> that now it also needs to enable PSCI call forwarding to be able to handle 
> that
> function?

We forward the whole PSCI function range, so it's either KVM or userspace.
If KVM manages PSCI and the guest calls an unimplemented function, that
returns directly to the guest without going to userspace.

The concern is valid for any other range, though. If userspace enables the
HVC cap it receives function calls that at some point KVM might need to
handle itself. So we need some negotiation between user and KVM about the
specific HVC ranges that userspace can and will handle.

> It looks to me like the boundary between the functions that are forwarded 
> when HVC
> call forwarding is enabled and the functions that are forwarded when PSCI call
> forwarding is enabled is based on what Linux v5.13 handles. Have you 
> considered
> choosing this boundary based on something less arbitrary, like the function 
> types
> specified in ARM DEN 0028C, table 2-1?

For PSCI I've used the range 0-0x1f as the boundary, which is reserved for
PSCI by SMCCC (table 6-4 in that document).

> 
> In my opinion, setting the MP state to HALTED looks like a sensible approach 
> to
> implementing PSCI_SUSPEND. I'll take a closer look at the patches after I get 
> a
> better understanding about what is going on.
> 
> On 6/8/21 4:48 PM, Jean-Philippe Brucker wrote:
> > Allow userspace to request handling PSCI calls from guests. Our goal is
> > to enable a vCPU hot-add solution for Arm where the VMM presents
> > possible resources to the guest at boot, and controls which vCPUs can be
> > brought up by allowing or denying PSCI CPU_ON calls. Passing HVC and
> > PSCI to userspace has been discussed on the list in the context of vCPU
> > hot-add [1,2] but it can also be useful for implementing other SMCCC and
> > vendor hypercalls [3,4,5].
> >
> > Patches 1-3 allow userspace to request WFI to be executed in KVM. That
> 
> I don't understand this. KVM, in kvm_vcpu_block(), does not execute an WFI.
> PSCI_SUSPEND is documented as being indistinguishable from an WFI from the 
> guest's
> point of view, but it's implementation is not architecturally defined.

Yes that was an oversimplification on my part

Thanks,
Jean
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

[RFC PATCH 4/5] KVM: arm64: Pass hypercalls to userspace

2021-06-08 Thread Jean-Philippe Brucker

Let userspace request to handle all hypercalls that aren't handled by
KVM, by setting the KVM_CAP_ARM_HVC_TO_USER capability.

With the help of another capability, this will allow userspace to handle
PSCI calls.

Suggested-by: James Morse 
Signed-off-by: Jean-Philippe Brucker 
---

Notes on this implementation:

* A similar mechanism was proposed for SDEI some time ago [1]. This RFC
  generalizes the idea to all hypercalls, since that was suggested on
  the list [2, 3].

* We're reusing kvm_run.hypercall. I copied x0-x5 into
  kvm_run.hypercall.args[] to help userspace but I'm tempted to remove
  this, because:
  - Most user handlers will need to write results back into the
registers (x0-x3 for SMCCC), so if we keep this shortcut we should
go all the way and synchronize them on return to kernel.
  - QEMU doesn't care about this shortcut, it pulls all vcpu regs before
handling the call.
  - SMCCC uses x0-x16 for parameters.
  x0 does contain the SMCCC function ID and may be useful for fast
  dispatch, we could keep that plus the immediate number.

* Should we add a flag in the kvm_run.hypercall telling whether this is
  HVC or SMC?  Can be added later in those bottom longmode and pad
  fields.

* On top of this we could share with userspace which HVC ranges are
  available and which ones are handled by KVM. That can actually be
  added independently, through a vCPU/VM device attribute (which doesn't
  consume a new ioctl):
  - userspace issues HAS_ATTR ioctl on the VM fd to query whether this
feature is available.
  - userspace queries the number N of HVC ranges using one GET_ATTR.
  - userspace passes an array of N ranges using another GET_ATTR.
The array is filled and returned by KVM.

* Untested for AArch32 guests.

[1] 
https://lore.kernel.org/linux-arm-kernel/20170808164616.25949-12-james.mo...@arm.com/
[2] 
https://lore.kernel.org/linux-arm-kernel/bf7e83f1-c58e-8d65-edd0-d08f27b8b...@arm.com/
[3] 
https://lore.kernel.org/linux-arm-kernel/f56cf420-affc-35f0-2355-801a924b8...@arm.com/
---
 Documentation/virt/kvm/api.rst| 17 +++--
 arch/arm64/include/asm/kvm_host.h |  1 +
 include/kvm/arm_psci.h|  4 
 include/uapi/linux/kvm.h  |  1 +
 arch/arm64/kvm/arm.c  |  5 +
 arch/arm64/kvm/hypercalls.c   | 28 +++-
 6 files changed, 53 insertions(+), 3 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index e4fe7fb60d5d..3d8c1661e7b2 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -5228,8 +5228,12 @@ to the byte array.
__u32 pad;
} hypercall;
 
-Unused.  This was once used for 'hypercall to userspace'.  To implement
-such functionality, use KVM_EXIT_IO (x86) or KVM_EXIT_MMIO (all except s390).
+On x86 this was once used for 'hypercall to userspace'.  To implement such
+functionality, use KVM_EXIT_IO (x86) or KVM_EXIT_MMIO (all except s390).
+
+On arm64 it is used for hypercalls, when the KVM_CAP_ARM_HVC_TO_USER capability
+is enabled. 'nr' contains the HVC or SMC immediate. 'args' contains registers
+x0 - x5. The other parameters are unused.
 
 .. note:: KVM_EXIT_IO is significantly faster than KVM_EXIT_MMIO.
 
@@ -6894,3 +6898,12 @@ This capability is always enabled.
 This capability indicates that the KVM virtual PTP service is
 supported in the host. A VMM can check whether the service is
 available to the guest on migration.
+
+8.33 KVM_CAP_ARM_HVC_TO_USER
+
+
+:Architecture: arm64
+
+This capability indicates that KVM can pass unhandled hypercalls to userspace,
+if the VMM enables it. Hypercalls are passed with KVM_EXIT_HYPERCALL in
+kvm_run::hypercall.
diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index 3ca732feb9a5..25554ce97045 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -123,6 +123,7 @@ struct kvm_arch {
 * supported.
 */
bool return_nisv_io_abort_to_user;
+   bool hvc_to_user;
 
/*
 * VM-wide PMU filter, implemented as a bitmap and big enough for
diff --git a/include/kvm/arm_psci.h b/include/kvm/arm_psci.h
index 5b58bd2fe088..d6b71a48fbb1 100644
--- a/include/kvm/arm_psci.h
+++ b/include/kvm/arm_psci.h
@@ -16,6 +16,10 @@
 
 #define KVM_ARM_PSCI_LATESTKVM_ARM_PSCI_1_0
 
+#define KVM_PSCI_FN_LAST   KVM_PSCI_FN(3)
+#define PSCI_0_2_FN_LAST   PSCI_0_2_FN(0x3f)
+#define PSCI_0_2_FN64_LAST PSCI_0_2_FN64(0x3f)
+
 /*
  * We need the KVM pointer independently from the vcpu as we can call
  * this from HYP, and need to apply kern_hyp_va on it...
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 06ba64c49737..aa831986a399 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1084,6 +1084,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_VM_COPY_ENC_CONTEXT_FROM 197
 #define KVM_CAP_PTP_KVM

[RFC PATCH 5/5] KVM: arm64: Pass PSCI calls to userspace

2021-06-08 Thread Jean-Philippe Brucker

Let userspace request to handle PSCI calls, by setting the new
KVM_CAP_ARM_PSCI_TO_USER capability.

SMCCC probe requires PSCI v1.x. If userspace only implements PSCI v0.2,
the guest won't query SMCCC support through PSCI and won't use the
spectre workarounds. We could hijack PSCI_VERSION and pretend to support
v1.0 if userspace does not, then handle all v1.0 calls ourselves
(including guessing the PSCI feature set implemented by the guest), but
that seems unnecessary. After all the API already allows userspace to
force a version lower than v1.0 using the firmware pseudo-registers.

The KVM_REG_ARM_PSCI_VERSION pseudo-register currently resets to either
v0.1 if userspace doesn't set KVM_ARM_VCPU_PSCI_0_2, or
KVM_ARM_PSCI_LATEST (1.0).

Suggested-by: James Morse 
Signed-off-by: Jean-Philippe Brucker 
---
 Documentation/virt/kvm/api.rst  | 14 ++
 Documentation/virt/kvm/arm/psci.rst |  1 +
 arch/arm64/include/asm/kvm_host.h   |  1 +
 include/kvm/arm_hypercalls.h|  1 +
 include/uapi/linux/kvm.h|  1 +
 arch/arm64/kvm/arm.c| 10 +++---
 arch/arm64/kvm/hypercalls.c |  2 +-
 arch/arm64/kvm/psci.c   | 13 +
 8 files changed, 39 insertions(+), 4 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 3d8c1661e7b2..f24eb70e575d 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6907,3 +6907,17 @@ available to the guest on migration.
 This capability indicates that KVM can pass unhandled hypercalls to userspace,
 if the VMM enables it. Hypercalls are passed with KVM_EXIT_HYPERCALL in
 kvm_run::hypercall.
+
+8.34 KVM_CAP_ARM_PSCI_TO_USER
+-
+
+:Architectures: arm64
+
+When the VMM enables this capability, all PSCI calls are passed to userspace
+instead of being handled by KVM. Capability KVM_CAP_ARM_HVC_TO_USER must be
+enabled first.
+
+Userspace should support at least PSCI v1.0. Otherwise SMCCC features won't be
+available to the guest. Userspace does not need to handle the SMCCC_VERSION
+parameter for the PSCI_FEATURES function. The KVM_ARM_VCPU_PSCI_0_2 vCPU
+feature should be set even if this capability is enabled.
diff --git a/Documentation/virt/kvm/arm/psci.rst 
b/Documentation/virt/kvm/arm/psci.rst
index d52c2e83b5b8..110011d1fa3f 100644
--- a/Documentation/virt/kvm/arm/psci.rst
+++ b/Documentation/virt/kvm/arm/psci.rst
@@ -34,6 +34,7 @@ The following register is defined:
   - Allows any PSCI version implemented by KVM and compatible with
 v0.2 to be set with SET_ONE_REG
   - Affects the whole VM (even if the register view is per-vcpu)
+  - Defaults to PSCI 1.0 if userspace enables KVM_CAP_ARM_PSCI_TO_USER.
 
 * KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1:
 Holds the state of the firmware support to mitigate CVE-2017-5715, as
diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index 25554ce97045..5d74b769c16d 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -124,6 +124,7 @@ struct kvm_arch {
 */
bool return_nisv_io_abort_to_user;
bool hvc_to_user;
+   bool psci_to_user;
 
/*
 * VM-wide PMU filter, implemented as a bitmap and big enough for
diff --git a/include/kvm/arm_hypercalls.h b/include/kvm/arm_hypercalls.h
index 0e2509d27910..b66c6a000ef3 100644
--- a/include/kvm/arm_hypercalls.h
+++ b/include/kvm/arm_hypercalls.h
@@ -6,6 +6,7 @@
 
 #include 
 
+int kvm_hvc_user(struct kvm_vcpu *vcpu);
 int kvm_hvc_call_handler(struct kvm_vcpu *vcpu);
 
 static inline u32 smccc_get_function(struct kvm_vcpu *vcpu)
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index aa831986a399..2b8e55aa7e1e 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1085,6 +1085,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_PTP_KVM 198
 #define KVM_CAP_ARM_MP_HALTED 199
 #define KVM_CAP_ARM_HVC_TO_USER 200
+#define KVM_CAP_ARM_PSCI_TO_USER 201
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 074197721e97..bc3e63b0b3ad 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -83,7 +83,7 @@ int kvm_arch_check_processor_compat(void *opaque)
 int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
struct kvm_enable_cap *cap)
 {
-   int r;
+   int r = -EINVAL;
 
if (cap->flags)
return -EINVAL;
@@ -97,8 +97,11 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
r = 0;
kvm->arch.hvc_to_user = true;
break;
-   default:
-   r = -EINVAL;
+   case KVM_CAP_ARM_PSCI_TO_USER:
+   if (kvm->arch.hvc_to_user) {
+   r = 0;
+   kvm->arch.psci_to_user = true;
+   }
break;
}
 
@@ -213,6 +216,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)

[RFC PATCH 1/5] KVM: arm64: Replace power_off with mp_state in struct kvm_vcpu_arch

2021-06-08 Thread Jean-Philippe Brucker

In order to add a new "suspend" power state, replace power_off with
mp_state in struct kvm_vcpu_arch. Factor the vcpu_off() function while
we're here.

No functional change intended.

Signed-off-by: Jean-Philippe Brucker 
---
 arch/arm64/include/asm/kvm_host.h |  6 --
 arch/arm64/kvm/arm.c  | 29 +++--
 arch/arm64/kvm/psci.c | 19 ++-
 3 files changed, 25 insertions(+), 29 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index 7cd7d5c8c4bc..55a04f4d5919 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -340,8 +340,8 @@ struct kvm_vcpu_arch {
u32 mdscr_el1;
} guest_debug_preserved;
 
-   /* vcpu power-off state */
-   bool power_off;
+   /* vcpu power state (runnable, stopped, halted) */
+   u32 mp_state;
 
/* Don't run the guest (internal implementation need) */
bool pause;
@@ -720,6 +720,8 @@ int kvm_arm_vcpu_arch_get_attr(struct kvm_vcpu *vcpu,
   struct kvm_device_attr *attr);
 int kvm_arm_vcpu_arch_has_attr(struct kvm_vcpu *vcpu,
   struct kvm_device_attr *attr);
+void kvm_arm_vcpu_power_off(struct kvm_vcpu *vcpu);
+bool kvm_arm_vcpu_is_off(struct kvm_vcpu *vcpu);
 
 /* Guest/host FPSIMD coordination helpers */
 int kvm_arch_vcpu_run_map_fp(struct kvm_vcpu *vcpu);
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index e720148232a0..bcc24adb9c0a 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -435,21 +435,22 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
vcpu->cpu = -1;
 }
 
-static void vcpu_power_off(struct kvm_vcpu *vcpu)
+void kvm_arm_vcpu_power_off(struct kvm_vcpu *vcpu)
 {
-   vcpu->arch.power_off = true;
+   vcpu->arch.mp_state = KVM_MP_STATE_STOPPED;
kvm_make_request(KVM_REQ_SLEEP, vcpu);
kvm_vcpu_kick(vcpu);
 }
 
+bool kvm_arm_vcpu_is_off(struct kvm_vcpu *vcpu)
+{
+   return vcpu->arch.mp_state == KVM_MP_STATE_STOPPED;
+}
+
 int kvm_arch_vcpu_ioctl_get_mpstate(struct kvm_vcpu *vcpu,
struct kvm_mp_state *mp_state)
 {
-   if (vcpu->arch.power_off)
-   mp_state->mp_state = KVM_MP_STATE_STOPPED;
-   else
-   mp_state->mp_state = KVM_MP_STATE_RUNNABLE;
-
+   mp_state->mp_state = vcpu->arch.mp_state;
return 0;
 }
 
@@ -460,10 +461,10 @@ int kvm_arch_vcpu_ioctl_set_mpstate(struct kvm_vcpu *vcpu,
 
switch (mp_state->mp_state) {
case KVM_MP_STATE_RUNNABLE:
-   vcpu->arch.power_off = false;
+   vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
break;
case KVM_MP_STATE_STOPPED:
-   vcpu_power_off(vcpu);
+   kvm_arm_vcpu_power_off(vcpu);
break;
default:
ret = -EINVAL;
@@ -483,7 +484,7 @@ int kvm_arch_vcpu_runnable(struct kvm_vcpu *v)
 {
bool irq_lines = *vcpu_hcr(v) & (HCR_VI | HCR_VF);
return ((irq_lines || kvm_vgic_vcpu_pending_irq(v))
-   && !v->arch.power_off && !v->arch.pause);
+   && !kvm_arm_vcpu_is_off(v) && !v->arch.pause);
 }
 
 bool kvm_arch_vcpu_in_kernel(struct kvm_vcpu *vcpu)
@@ -643,10 +644,10 @@ static void vcpu_req_sleep(struct kvm_vcpu *vcpu)
struct rcuwait *wait = kvm_arch_vcpu_get_wait(vcpu);
 
rcuwait_wait_event(wait,
-  (!vcpu->arch.power_off) &&(!vcpu->arch.pause),
+  !kvm_arm_vcpu_is_off(vcpu) && !vcpu->arch.pause,
   TASK_INTERRUPTIBLE);
 
-   if (vcpu->arch.power_off || vcpu->arch.pause) {
+   if (kvm_arm_vcpu_is_off(vcpu) || vcpu->arch.pause) {
/* Awaken to handle a signal, request we sleep again later. */
kvm_make_request(KVM_REQ_SLEEP, vcpu);
}
@@ -1087,9 +1088,9 @@ static int kvm_arch_vcpu_ioctl_vcpu_init(struct kvm_vcpu 
*vcpu,
 * Handle the "start in power-off" case.
 */
if (test_bit(KVM_ARM_VCPU_POWER_OFF, vcpu->arch.features))
-   vcpu_power_off(vcpu);
+   kvm_arm_vcpu_power_off(vcpu);
else
-   vcpu->arch.power_off = false;
+   vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
 
return 0;
 }
diff --git a/arch/arm64/kvm/psci.c b/arch/arm64/kvm/psci.c
index db4056ecccfd..24b4a2265dbd 100644
--- a/arch/arm64/kvm/psci.c
+++ b/arch/arm64/kvm/psci.c
@@ -52,13 +52,6 @@ static unsigned long kvm_psci_vcpu_suspend(struct kvm_vcpu 
*vcpu)
return PSCI_RET_SUCCESS;
 }
 
-static void kvm_psci_vcpu_off(struct kvm_vcpu *vcpu)
-{
-   vcpu->arch.power_off = true;
-   kvm_make_request(KVM_REQ_SLEEP, vcpu);
-   kvm_vcpu_kick(vcpu);
-}
-
 static u

[RFC PATCH 3/5] KVM: arm64: Allow userspace to request WFI

2021-06-08 Thread Jean-Philippe Brucker

To help userspace implement PSCI CPU_SUSPEND, allow setting the "HALTED"
MP state to request a WFI before returning to the guest.

Userspace won't obtain a HALTED mp_state from a KVM_GET_MP_STATE call
unless they set it themselves. When set by KVM, to handle wfi or
CPU_SUSPEND, it is consumed before returning to userspace.

Signed-off-by: Jean-Philippe Brucker 
---
 Documentation/virt/kvm/api.rst | 15 +--
 include/uapi/linux/kvm.h   |  1 +
 arch/arm64/kvm/arm.c   | 11 ++-
 3 files changed, 20 insertions(+), 7 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 7fcb2fd38f42..e4fe7fb60d5d 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -1416,8 +1416,8 @@ Possible values are:
  which has not yet received an INIT signal 
[x86]
KVM_MP_STATE_INIT_RECEIVEDthe vcpu has received an INIT signal, and is
  now ready for a SIPI [x86]
-   KVM_MP_STATE_HALTED   the vcpu has executed a HLT instruction and
- is waiting for an interrupt [x86]
+   KVM_MP_STATE_HALTED   the vcpu has executed a HLT/WFI instruction
+ and is waiting for an interrupt [x86,arm64]
KVM_MP_STATE_SIPI_RECEIVEDthe vcpu has just received a SIPI (vector
  accessible via KVM_GET_VCPU_EVENTS) [x86]
KVM_MP_STATE_STOPPED  the vcpu is stopped [s390,arm/arm64]
@@ -1435,8 +1435,9 @@ these architectures.
 For arm/arm64:
 ^^
 
-The only states that are valid are KVM_MP_STATE_STOPPED and
-KVM_MP_STATE_RUNNABLE which reflect if the vcpu is paused or not.
+Valid states are KVM_MP_STATE_STOPPED and KVM_MP_STATE_RUNNABLE which reflect
+if the vcpu is paused or not. If KVM_CAP_ARM_MP_HALTED is present, state
+KVM_MP_STATE_HALTED is also valid.
 
 4.39 KVM_SET_MP_STATE
 -
@@ -1457,8 +1458,10 @@ these architectures.
 For arm/arm64:
 ^^
 
-The only states that are valid are KVM_MP_STATE_STOPPED and
-KVM_MP_STATE_RUNNABLE which reflect if the vcpu should be paused or not.
+Valid states are KVM_MP_STATE_STOPPED and KVM_MP_STATE_RUNNABLE which reflect
+if the vcpu should be paused or not. If KVM_CAP_ARM_MP_HALTED is present,
+KVM_MP_STATE_HALTED can be set, to wait for interrupts targeted at the vcpu
+before running it.
 
 4.40 KVM_SET_IDENTITY_MAP_ADDR
 --
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 79d9c44d1ad7..06ba64c49737 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1083,6 +1083,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_SGX_ATTRIBUTE 196
 #define KVM_CAP_VM_COPY_ENC_CONTEXT_FROM 197
 #define KVM_CAP_PTP_KVM 198
+#define KVM_CAP_ARM_MP_HALTED 199
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index d8cbaa0373c7..d6ad977fea5f 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -207,6 +207,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
case KVM_CAP_SET_GUEST_DEBUG:
case KVM_CAP_VCPU_ATTRIBUTES:
case KVM_CAP_PTP_KVM:
+   case KVM_CAP_ARM_MP_HALTED:
r = 1;
break;
case KVM_CAP_SET_GUEST_DEBUG2:
@@ -469,6 +470,9 @@ int kvm_arch_vcpu_ioctl_set_mpstate(struct kvm_vcpu *vcpu,
case KVM_MP_STATE_RUNNABLE:
vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
break;
+   case KVM_MP_STATE_HALTED:
+   kvm_arm_vcpu_suspend(vcpu);
+   break;
case KVM_MP_STATE_STOPPED:
kvm_arm_vcpu_power_off(vcpu);
break;
@@ -699,7 +703,12 @@ static void check_vcpu_requests(struct kvm_vcpu *vcpu)
preempt_enable();
}
 
-   if (kvm_check_request(KVM_REQ_SUSPEND, vcpu)) {
+   /*
+* Check mp_state again in case userspace changed their mind
+* after requesting suspend.
+*/
+   if (kvm_check_request(KVM_REQ_SUSPEND, vcpu) &&
+   vcpu->arch.mp_state == KVM_MP_STATE_HALTED) {
if (!irq_pending) {
kvm_vcpu_block(vcpu);
kvm_clear_request(KVM_REQ_UNHALT, vcpu);
-- 
2.31.1

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

[RFC PATCH 0/5] KVM: arm64: Pass PSCI to userspace

2021-06-08 Thread Jean-Philippe Brucker

Allow userspace to request handling PSCI calls from guests. Our goal is
to enable a vCPU hot-add solution for Arm where the VMM presents
possible resources to the guest at boot, and controls which vCPUs can be
brought up by allowing or denying PSCI CPU_ON calls. Passing HVC and
PSCI to userspace has been discussed on the list in the context of vCPU
hot-add [1,2] but it can also be useful for implementing other SMCCC and
vendor hypercalls [3,4,5].

Patches 1-3 allow userspace to request WFI to be executed in KVM. That
way the VMM can easily implement the PSCI CPU_SUSPEND function, which is
mandatory from PSCI v0.2 onwards (even if it doesn't have a more useful
implementation than WFI, natively available to the guest).

Patch 4 lets userspace request any HVC that isn't handled by KVM, and
patch 5 lets userspace request PSCI calls, disabling in-kernel PSCI
handling.

I'm focusing on the PSCI bits, but a complete prototype of vCPU hot-add
for arm64 on Linux and QEMU, most of it from Salil and James, is
available at [6].

[1] https://lore.kernel.org/kvmarm/82879258-46a7-a6e9-ee54-fc3692c1c...@arm.com/
[2] 
https://lore.kernel.org/linux-arm-kernel/20200625133757.22332-1-salil.me...@huawei.com/
(Followed by KVM forum and Linaro Open discussions)
[3] 
https://lore.kernel.org/linux-arm-kernel/f56cf420-affc-35f0-2355-801a924b8...@arm.com/
[4] https://lore.kernel.org/kvm/bf7e83f1-c58e-8d65-edd0-d08f27b8b...@arm.com/
[5] 
https://lore.kernel.org/kvm/1569338454-26202-2-git-send-email-guoh...@huawei.com/
[6] https://jpbrucker.net/git/linux/log/?h=cpuhp/devel
https://jpbrucker.net/git/qemu/log/?h=cpuhp/devel

Jean-Philippe Brucker (5):
  KVM: arm64: Replace power_off with mp_state in struct kvm_vcpu_arch
  KVM: arm64: Move WFI execution to check_vcpu_requests()
  KVM: arm64: Allow userspace to request WFI
  KVM: arm64: Pass hypercalls to userspace
  KVM: arm64: Pass PSCI calls to userspace

 Documentation/virt/kvm/api.rst  | 46 +++
 Documentation/virt/kvm/arm/psci.rst |  1 +
 arch/arm64/include/asm/kvm_host.h   | 10 +++-
 include/kvm/arm_hypercalls.h|  1 +
 include/kvm/arm_psci.h  |  4 ++
 include/uapi/linux/kvm.h|  3 ++
 arch/arm64/kvm/arm.c| 71 +
 arch/arm64/kvm/handle_exit.c|  3 +-
 arch/arm64/kvm/hypercalls.c | 28 +++-
 arch/arm64/kvm/psci.c   | 69 ++--
 10 files changed, 170 insertions(+), 66 deletions(-)

-- 
2.31.1

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

[RFC PATCH 2/5] KVM: arm64: Move WFI execution to check_vcpu_requests()

2021-06-08 Thread Jean-Philippe Brucker

Prepare for WFI requests from userspace, by adding a suspend request and
moving the WFI execution into check_vcpu_requests(), next to the
power-off logic.

vcpu->arch.mp_state, previously only RUNNABLE or STOPPED, supports an
additional state HALTED and two new state transitions:

  RUNNABLE -> HALTEDfrom WFI or PSCI CPU_SUSPEND (same vCPU)
  HALTED -> RUNNABLEvGIC IRQ, pending timer, signal

There shouldn't be any functional change with this patch, even though
the KVM_GET_MP_STATE ioctl could now in theory return
KVM_MP_STATE_HALTED, which would break some users' mp_state support. In
practice it should not happen because we do not return to userspace with
HALTED state. Both WFI and PSCI CPU_SUSPEND stay in the vCPU run loop
until the suspend request is consumed. It does feel fragile though,
maybe we should explicitly return RUNNABLE in KVM_GET_MP_STATE in place
of HALTED, to prevent future breakage.

Signed-off-by: Jean-Philippe Brucker 
---
 arch/arm64/include/asm/kvm_host.h |  2 ++
 arch/arm64/kvm/arm.c  | 18 ++-
 arch/arm64/kvm/handle_exit.c  |  3 +--
 arch/arm64/kvm/psci.c | 37 +--
 4 files changed, 35 insertions(+), 25 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index 55a04f4d5919..3ca732feb9a5 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -46,6 +46,7 @@
 #define KVM_REQ_VCPU_RESET KVM_ARCH_REQ(2)
 #define KVM_REQ_RECORD_STEAL   KVM_ARCH_REQ(3)
 #define KVM_REQ_RELOAD_GICv4   KVM_ARCH_REQ(4)
+#define KVM_REQ_SUSPENDKVM_ARCH_REQ(5)
 
 #define KVM_DIRTY_LOG_MANUAL_CAPS   (KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE | \
 KVM_DIRTY_LOG_INITIALLY_SET)
@@ -722,6 +723,7 @@ int kvm_arm_vcpu_arch_has_attr(struct kvm_vcpu *vcpu,
   struct kvm_device_attr *attr);
 void kvm_arm_vcpu_power_off(struct kvm_vcpu *vcpu);
 bool kvm_arm_vcpu_is_off(struct kvm_vcpu *vcpu);
+void kvm_arm_vcpu_suspend(struct kvm_vcpu *vcpu);
 
 /* Guest/host FPSIMD coordination helpers */
 int kvm_arch_vcpu_run_map_fp(struct kvm_vcpu *vcpu);
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index bcc24adb9c0a..d8cbaa0373c7 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -447,6 +447,12 @@ bool kvm_arm_vcpu_is_off(struct kvm_vcpu *vcpu)
return vcpu->arch.mp_state == KVM_MP_STATE_STOPPED;
 }
 
+void kvm_arm_vcpu_suspend(struct kvm_vcpu *vcpu)
+{
+   vcpu->arch.mp_state = KVM_MP_STATE_HALTED;
+   kvm_make_request(KVM_REQ_SUSPEND, vcpu);
+}
+
 int kvm_arch_vcpu_ioctl_get_mpstate(struct kvm_vcpu *vcpu,
struct kvm_mp_state *mp_state)
 {
@@ -667,6 +673,8 @@ static int kvm_vcpu_initialized(struct kvm_vcpu *vcpu)
 
 static void check_vcpu_requests(struct kvm_vcpu *vcpu)
 {
+   bool irq_pending;
+
if (kvm_request_pending(vcpu)) {
if (kvm_check_request(KVM_REQ_SLEEP, vcpu))
vcpu_req_sleep(vcpu);
@@ -678,7 +686,7 @@ static void check_vcpu_requests(struct kvm_vcpu *vcpu)
 * Clear IRQ_PENDING requests that were made to guarantee
 * that a VCPU sees new virtual interrupts.
 */
-   kvm_check_request(KVM_REQ_IRQ_PENDING, vcpu);
+   irq_pending = kvm_check_request(KVM_REQ_IRQ_PENDING, vcpu);
 
if (kvm_check_request(KVM_REQ_RECORD_STEAL, vcpu))
kvm_update_stolen_time(vcpu);
@@ -690,6 +698,14 @@ static void check_vcpu_requests(struct kvm_vcpu *vcpu)
vgic_v4_load(vcpu);
preempt_enable();
}
+
+   if (kvm_check_request(KVM_REQ_SUSPEND, vcpu)) {
+   if (!irq_pending) {
+   kvm_vcpu_block(vcpu);
+   kvm_clear_request(KVM_REQ_UNHALT, vcpu);
+   }
+   vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
+   }
}
 }
 
diff --git a/arch/arm64/kvm/handle_exit.c b/arch/arm64/kvm/handle_exit.c
index 6f48336b1d86..9717df3104cf 100644
--- a/arch/arm64/kvm/handle_exit.c
+++ b/arch/arm64/kvm/handle_exit.c
@@ -95,8 +95,7 @@ static int kvm_handle_wfx(struct kvm_vcpu *vcpu)
} else {
trace_kvm_wfx_arm64(*vcpu_pc(vcpu), false);
vcpu->stat.wfi_exit_stat++;
-   kvm_vcpu_block(vcpu);
-   kvm_clear_request(KVM_REQ_UNHALT, vcpu);
+   kvm_arm_vcpu_suspend(vcpu);
}
 
kvm_incr_pc(vcpu);
diff --git a/arch/arm64/kvm/psci.c b/arch/arm64/kvm/psci.c
index 24b4a2265dbd..42a307ceb95f 100644
--- a/arch/arm64/kvm/psci.c
+++ b/arch/arm64/kvm/psci.c
@@ -31,27 +31,6 @@ static unsigned long psci_affinity_mask(unsigned long 
affinity_level)
return 0;
 }
 
-static unsigned long kvm_psci_vcpu_suspend(

Re: [RFC PATCH 0/3] kvm/arm: New VMID allocator based on asid(2nd approach)

2021-06-07 Thread Jean-Philippe Brucker

On Fri, Jun 04, 2021 at 04:27:39PM +0100, Marc Zyngier wrote:
> > > Plus, I've found this nugget:
> > > 
> > >  > >   max_pinned_vmids = NUM_USER_VMIDS - num_possible_cpus() - 2;
> > > 
> > > 
> > > What is this "- 2"? My hunch is that it should really be "- 1" as VMID
> > > 0 is reserved, and we have no equivalent of KPTI for S2.
> > 
> > I think this is more related to the "pinned vmid" stuff and was borrowed 
> > from
> > the asid_update_limit() fn in arch/arm64/mm/context.c. But I missed the
> > comment that explains the reason behind it. It says,
> > 
> > ---x---
> > /*
> >  * There must always be an ASID available after rollover. Ensure that,
> >  * even if all CPUs have a reserved ASID and the maximum number of ASIDs
> >  * are pinned, there still is at least one empty slot in the ASID map.
> >  */
> > max_pinned_asids = num_available_asids - num_possible_cpus() - 2;
> > ---x---
> > 
> > So this is to make sure we will have at least one VMID available
> > after rollover in case we have pinned the max number of VMIDs. I
> > will include that comment to make it clear.
> 
> That doesn't really explain the -2. Or is that that we have one for
> the extra empty slot, and one for the reserved?
> 
> Jean-Philippe?

Yes, -2 is for ASID#0 and the extra empty slot. A comment higher in
asids_update_limit() hints at that, but it could definitely be clearer

/*
 * Expect allocation after rollover to fail if we don't have at least
 * one more ASID than CPUs. ASID #0 is reserved for init_mm.
 */

Thanks,
Jean
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

Re: [PATCH] KVM: arm64: Skip CMOs when updating a PTE pointing to non-memory

2021-04-27 Thread Jean-Philippe Brucker

On Tue, Apr 27, 2021 at 03:52:46PM +0100, Alexandru Elisei wrote:
> The comment [1] suggested that the panic is triggered during page aging.

I think only with an out-of-tree patch applied
https://jpbrucker.net/git/linux/commit/?h=sva/2021-03-01=d32d8baaf293aaefef8a1c9b8a4508ab2ec46c61
which probably is not going upstream.

Thanks,
Jean

> vfio_pci_mmap() sets the VM_PFNMAP for the VMA and I see in the Documentation 
> that
> pages with VM_PFNMAP are added to the unevictable LRU list, doesn't that mean 
> it's
> not subject the page aging? I feel like there's something I'm missing.
> 
> [1]
> https://lore.kernel.org/kvm/by5pr12mb37642b9ac7e5d907f5a664f6b3...@by5pr12mb3764.namprd12.prod.outlook.com/
> 
> Thanks,
> 
> Alex
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

Re: [PATCH v14 00/13] SMMUv3 Nested Stage Setup (IOMMU part)

2021-04-23 Thread Jean-Philippe Brucker

Hi Sumit,

On Thu, Apr 22, 2021 at 08:34:38PM +0530, Sumit Gupta wrote:
> Had to revert patch "mm: notify remote TLBs when dirtying a PTE".

Did that patch cause any issue, or is it just not needed on your system?
It fixes an hypothetical problem with the way ATS is implemented. Maybe I
actually observed it on an old software model, I don't remember. Either
way it's unlikely to go upstream but I'd like to know if I should drop it
from my tree.

Thanks,
Jean
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

Re: [PATCH v12 03/13] vfio: VFIO_IOMMU_SET_MSI_BINDING

2021-03-05 Thread Jean-Philippe Brucker

Hi,

On Tue, Feb 23, 2021 at 10:06:15PM +0100, Eric Auger wrote:
> This patch adds the VFIO_IOMMU_SET_MSI_BINDING ioctl which aim
> to (un)register the guest MSI binding to the host. This latter
> then can use those stage 1 bindings to build a nested stage
> binding targeting the physical MSIs.

Now that RMR is in the IORT spec, could it be used for the nested MSI
problem?  For virtio-iommu tables I was planning to do it like this:

MSI is mapped at stage-2 with an arbitrary IPA->doorbell PA. We report
this IPA to userspace through iommu_groups/X/reserved_regions. No change
there. Then to the guest we report a reserved identity mapping at IPA
(using RMR, an equivalent DT binding, or probed RESV_MEM for
virtio-iommu). The guest creates that mapping at stage-1, and that's it.
Unless I overlooked something we'd only reuse existing infrastructure and
avoid the SET_MSI_BINDING interface.

Thanks,
Jean
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

Re: [RFC PATCH 4/5] iommu/arm-smmu-v3: Use pinned VMID for NESTED stage with BTM

2021-03-04 Thread Jean-Philippe Brucker

Hi Shameer,

On Mon, Feb 22, 2021 at 03:53:37PM +, Shameer Kolothum wrote:
> If the SMMU supports BTM and the device belongs to NESTED domain
> with shared pasid table, we need to use the VMID allocated by the
> KVM for the s2 configuration. Hence, request a pinned VMID from KVM.
> 
> Signed-off-by: Shameer Kolothum 
> ---
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 49 -
>  1 file changed, 47 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
> b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index 26bf7da1bcd0..04f83f7c8319 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -28,6 +28,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  
> @@ -2195,6 +2196,33 @@ static void arm_smmu_bitmap_free(unsigned long *map, 
> int idx)
>   clear_bit(idx, map);
>  }
>  
> +static int arm_smmu_pinned_vmid_get(struct arm_smmu_domain *smmu_domain)
> +{
> + struct arm_smmu_master *master;
> +
> + master = list_first_entry_or_null(_domain->devices,
> +   struct arm_smmu_master, domain_head);

This probably needs to hold devices_lock while using master.

> + if (!master)
> + return -EINVAL;
> +
> + return kvm_pinned_vmid_get(master->dev);
> +}
> +
> +static int arm_smmu_pinned_vmid_put(struct arm_smmu_domain *smmu_domain)
> +{
> + struct arm_smmu_master *master;
> +
> + master = list_first_entry_or_null(_domain->devices,
> +   struct arm_smmu_master, domain_head);
> + if (!master)
> + return -EINVAL;
> +
> + if (smmu_domain->s2_cfg.vmid)
> + return kvm_pinned_vmid_put(master->dev);
> +
> + return 0;
> +}
> +
>  static void arm_smmu_domain_free(struct iommu_domain *domain)
>  {
>   struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
> @@ -2215,8 +2243,11 @@ static void arm_smmu_domain_free(struct iommu_domain 
> *domain)
>   mutex_unlock(_smmu_asid_lock);
>   }
>   if (s2_cfg->set) {
> - if (s2_cfg->vmid)
> - arm_smmu_bitmap_free(smmu->vmid_map, s2_cfg->vmid);
> + if (s2_cfg->vmid) {
> + if (!(smmu->features & ARM_SMMU_FEAT_BTM) &&
> + smmu_domain->stage != ARM_SMMU_DOMAIN_NESTED)
> + arm_smmu_bitmap_free(smmu->vmid_map, 
> s2_cfg->vmid);
> + }
>   }
>  
>   kfree(smmu_domain);
> @@ -3199,6 +3230,17 @@ static int arm_smmu_attach_pasid_table(struct 
> iommu_domain *domain,
>   !(smmu->features & ARM_SMMU_FEAT_2_LVL_CDTAB))
>   goto out;
>  
> + if (smmu->features & ARM_SMMU_FEAT_BTM) {
> + ret = arm_smmu_pinned_vmid_get(smmu_domain);
> + if (ret < 0)
> + goto out;
> +
> + if (smmu_domain->s2_cfg.vmid)
> + arm_smmu_bitmap_free(smmu->vmid_map, 
> smmu_domain->s2_cfg.vmid);
> +
> + smmu_domain->s2_cfg.vmid = (u16)ret;

That will require a TLB invalidation on the old VMID, once the STE is
rewritten.

More generally I think this pinned VMID set conflicts with that of
stage-2-only domains (which is the default state until a guest attaches a
PASID table). Say you have one guest using DOMAIN_NESTED without PASID
table, just DMA to IPA using VMID 0x8000. Now another guest attaches a
PASID table and obtains the same VMID from KVM. The stage-2 translation
might use TLB entries from the other guest, no?  They'll both create
stage-2 TLB entries with {StreamWorld=NS-EL1, VMID=0x8000}

It's tempting to allocate all VMIDs through KVM instead, but that will
force a dependency on KVM to use VFIO_TYPE1_NESTING_IOMMU and might break
existing users of that extension (though I'm not sure there are any).
Instead we might need to restrict the SMMU VMID bitmap to match the
private VMID set in KVM.

Besides we probably want to restrict this feature to systems supporting
VMID16 on both SMMU and CPUs, or at least check that they are compatible.

> + }
> +
>   smmu_domain->s1_cfg.cdcfg.cdtab_dma = cfg->base_ptr;
>   smmu_domain->s1_cfg.s1cdmax = cfg->pasid_bits;
>   smmu_domain->s1_cfg.s1fmt = cfg->vendor_data.smmuv3.s1fmt;
> @@ -3221,6 +3263,7 @@ static int arm_smmu_attach_pasid_table(struct 
> iommu_domain *domain,
>  static void arm_smmu_detach_pasid_table(struct iommu_domain *domain)
>  {
>   struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
> + struct arm_smmu_device *smmu = smmu_domain->smmu;
>   struct arm_smmu_master *master;
>   unsigned long flags;
>  
> @@ -3237,6 +3280,8 @@ static void arm_smmu_detach_pasid_table(struct 
> iommu_domain *domain)
>   arm_smmu_install_ste_for_dev(master);
>

Re: [RFC PATCH 01/11] iommu/arm-smmu-v3: Add feature detection for HTTU

2021-02-05 Thread Jean-Philippe Brucker

Hi Keqian,

On Fri, Feb 05, 2021 at 05:13:50PM +0800, Keqian Zhu wrote:
> > We need to accommodate the firmware override as well if we need this to be 
> > meaningful. Jean-Philippe is already carrying a suitable patch in the SVA 
> > stack[1].
> Robin, Thanks for pointing it out.
> 
> Jean, I see that the IORT HTTU flag overrides the hardware register info 
> unconditionally. I have some concern about it:
> 
> If the override flag has HTTU but hardware doesn't support it, then driver 
> will use this feature but receive access fault or permission fault from SMMU 
> unexpectedly.
> 1) If IOPF is not supported, then kernel can not work normally.
> 2) If IOPF is supported, kernel will perform useless actions, such as HTTU 
> based dma dirty tracking (this series).
> 
> As the IORT spec doesn't give an explicit explanation for HTTU override, can 
> we comprehend it as a mask for HTTU related hardware register?

To me "Overrides the value of SMMU_IDR0.HTTU" is clear enough: disregard
the value of SMMU_IDR0.HTTU and use the one specified by IORT instead. And
that's both ways, since there is no validity mask for the IORT value: if
there is an IORT table, always ignore SMMU_IDR0.HTTU.

That's how the SMMU driver implements the COHACC bit, which has the same
wording in IORT. So I think we should implement HTTU the same way.

One complication is that there is no equivalent override for device tree.
I think it can be added later if necessary, because unlike IORT it can be
tri state (property not present, overriden positive, overridden negative).

Thanks,
Jean

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

Re: [PATCH v13 03/15] iommu/arm-smmu-v3: Maintain a SID->device structure

2021-02-01 Thread Jean-Philippe Brucker

On Mon, Feb 01, 2021 at 08:26:41PM +0800, Keqian Zhu wrote:
> > +static int arm_smmu_insert_master(struct arm_smmu_device *smmu,
> > + struct arm_smmu_master *master)
> > +{
> > +   int i;
> > +   int ret = 0;
> > +   struct arm_smmu_stream *new_stream, *cur_stream;
> > +   struct rb_node **new_node, *parent_node = NULL;
> > +   struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(master->dev);
> > +
> > +   master->streams = kcalloc(fwspec->num_ids,
> > + sizeof(struct arm_smmu_stream), GFP_KERNEL);
> > +   if (!master->streams)
> > +   return -ENOMEM;
> > +   master->num_streams = fwspec->num_ids;
> This is not roll-backed when fail.

No need, the caller frees master

> > +
> > +   mutex_lock(>streams_mutex);
> > +   for (i = 0; i < fwspec->num_ids && !ret; i++) {
> Check ret at here, makes it hard to decide the start index of rollback.
> 
> If we fail at here, then start index is (i-2).
> If we fail in the loop, then start index is (i-1).
> 
[...]
> > +   if (ret) {
> > +   for (; i > 0; i--)
> should be (i >= 0)?
> And the start index seems not correct.

Indeed, this whole bit is wrong. I'll fix it while resending the IOPF
series.

Thanks,
Jean

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

Re: [PATCH v13 07/15] iommu/smmuv3: Allow stage 1 invalidation with unmanaged ASIDs

2021-01-14 Thread Jean-Philippe Brucker

Hi Eric,

On Thu, Jan 14, 2021 at 05:58:27PM +0100, Auger Eric wrote:
> >>  The uacce-devel branches from
> >>> https://github.com/Linaro/linux-kernel-uadk do provide this at the moment
> >>> (they track the latest sva/zip-devel branch
> >>> https://jpbrucker.net/git/linux/ which is roughly based on mainline.)
> As I plan to respin shortly, please could you confirm the best branch to
> rebase on still is that one (uacce-devel from the linux-kernel-uadk git
> repo). Is it up to date? Commits seem to be quite old there.

Right I meant the uacce-devel-X branches. The uacce-devel-5.11 branch
currently has the latest patches

Thanks,
Jean
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

Re: [PATCH v13 07/15] iommu/smmuv3: Allow stage 1 invalidation with unmanaged ASIDs

2020-12-04 Thread Jean-Philippe Brucker

Hi Shameer,

On Thu, Dec 03, 2020 at 06:42:57PM +, Shameerali Kolothum Thodi wrote:
> Hi Jean/zhangfei,
> Is it possible to have a branch with minimum required SVA/UACCE related 
> patches
> that are already public and can be a "stable" candidate for future respin of 
> Eric's series?
> Please share your thoughts.

By "stable" you mean a fixed branch with the latest SVA/UACCE patches
based on mainline?  The uacce-devel branches from
https://github.com/Linaro/linux-kernel-uadk do provide this at the moment
(they track the latest sva/zip-devel branch
https://jpbrucker.net/git/linux/ which is roughly based on mainline.)

Thanks,
Jean

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

Re: [RFC] Use SMMU HTTU for DMA dirty page tracking

2020-05-27 Thread Jean-Philippe Brucker

On Wed, May 27, 2020 at 08:40:47AM +, Tian, Kevin wrote:
> > From: Xiang Zheng 
> > Sent: Wednesday, May 27, 2020 2:45 PM
> > 
> > 
> > On 2020/5/27 11:27, Tian, Kevin wrote:
> > >> From: Xiang Zheng
> > >> Sent: Monday, May 25, 2020 7:34 PM
> > >>
> > >> [+cc Kirti, Yan, Alex]
> > >>
> > >> On 2020/5/23 1:14, Jean-Philippe Brucker wrote:
> > >>> Hi,
> > >>>
> > >>> On Tue, May 19, 2020 at 05:42:55PM +0800, Xiang Zheng wrote:
> > >>>> Hi all,
> > >>>>
> > >>>> Is there any plan for enabling SMMU HTTU?
> > >>>
> > >>> Not outside of SVA, as far as I know.
> > >>>
> > >>
> > >>>> I have seen the patch locates in the SVA series patch, which adds
> > >>>> support for HTTU:
> > >>>> https://www.spinics.net/lists/arm-kernel/msg798694.html
> > >>>>
> > >>>> HTTU reduces the number of access faults on SMMU fault queue
> > >>>> (permission faults also benifit from it).
> > >>>>
> > >>>> Besides reducing the faults, HTTU also helps to track dirty pages for
> > >>>> device DMA. Is it feasible to utilize HTTU to get dirty pages on device
> > >>>> DMA during VFIO live migration?
> > >>>
> > >>> As you know there is a VFIO interface for this under discussion:
> > >>> https://lore.kernel.org/kvm/1589781397-28368-1-git-send-email-
> > >> kwankh...@nvidia.com/
> > >>> It doesn't implement an internal API to communicate with the IOMMU
> > >> driver
> > >>> about dirty pages.
> > >
> > > We plan to add such API later, e.g. to utilize A/D bit in VT-d 2nd-level
> > > page tables (Rev 3.0).
> > >
> > 
> > Thank you, Kevin.
> > 
> > When will you send this series patches? Maybe(Hope) we can also support
> > hardware-based dirty pages tracking via common APIs based on your
> > patches. :)
> 
> Yan is working with Kirti on basic live migration support now. After that
> part is done, we will start working on A/D bit support. Yes, common APIs
> are definitely the goal here.
> 
> > 
> > >>
> > >>>
> > >>>> If SMMU can track dirty pages, devices are not required to implement
> > >>>> additional dirty pages tracking to support VFIO live migration.
> > >>>
> > >>> It seems feasible, though tracking it in the device might be more
> > >>> efficient. I might have misunderstood but I think for live migration of
> > >>> the Intel NIC they trap guest accesses to the device and introspect its
> > >>> state to figure out which pages it is accessing.
> > >
> > > Does HTTU implement A/D-like mechanism in SMMU page tables, or just
> > > report dirty pages in a log buffer? Either way tracking dirty pages in 
> > > IOMMU
> > > side is generic thus doesn't require device-specific tweak like in Intel 
> > > NIC.
> > >
> > 
> > Currently HTTU just implement A/D-like mechanism in SMMU page tables.
> > We certainly
> > expect SMMU can also implement PML-like feature so that we can avoid
> > walking the
> > whole page table to get the dirty pages.

There is no reporting of dirty pages in log buffer. It might be possible
to do software logging based on PRI or Stall, but that requires special
support in the endpoint as well as the SMMU.

> Is there a link to HTTU introduction?

I don't know any gentle introduction, but there are sections D5.4.11
"Hardware management of the Access flag and dirty state" in the ARM
Architecture Reference Manual (DDI0487E), and section 3.13 "Translation
table entries and Access/Dirty flags" in the SMMU specification
(IHI0070C). HTTU stands for "Hardware Translation Table Update".

In short, when HTTU is enabled, the SMMU translation performs an atomic
read-modify-write on the leaf translation table descriptor, setting some
bits depending on the type of memory access. This can be enabled
independently on both stage-1 and stage-2 tables (equivalent to your 1st
and 2nd page tables levels, I think).

Thanks,
Jean
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

Re: [RFC] Use SMMU HTTU for DMA dirty page tracking

2020-05-22 Thread Jean-Philippe Brucker

Hi,

On Tue, May 19, 2020 at 05:42:55PM +0800, Xiang Zheng wrote:
> Hi all,
> 
> Is there any plan for enabling SMMU HTTU?

Not outside of SVA, as far as I know.

> I have seen the patch locates in the SVA series patch, which adds
> support for HTTU:
> https://www.spinics.net/lists/arm-kernel/msg798694.html
> 
> HTTU reduces the number of access faults on SMMU fault queue
> (permission faults also benifit from it).
> 
> Besides reducing the faults, HTTU also helps to track dirty pages for
> device DMA. Is it feasible to utilize HTTU to get dirty pages on device
> DMA during VFIO live migration?

As you know there is a VFIO interface for this under discussion:
https://lore.kernel.org/kvm/1589781397-28368-1-git-send-email-kwankh...@nvidia.com/
It doesn't implement an internal API to communicate with the IOMMU driver
about dirty pages.

> If SMMU can track dirty pages, devices are not required to implement
> additional dirty pages tracking to support VFIO live migration.

It seems feasible, though tracking it in the device might be more
efficient. I might have misunderstood but I think for live migration of
the Intel NIC they trap guest accesses to the device and introspect its
state to figure out which pages it is accessing.

With HTTU I suppose (without much knowledge about live migration) that
you'd need several new interfaces to the IOMMU drivers:

* A way for VFIO to query HTTU support in the SMMU. There are some
  discussions about communicating more IOMMU capabilities through VFIO but
  no implementation yet. When HTTU isn't supported the DIRTY_PAGES bitmap
  would report all pages as they do now.

* VFIO_IOMMU_DIRTY_PAGES_FLAG_START/STOP would clear the dirty bit
  for all VFIO mappings (which is going to take some time). There is a
  walker in io-pgtable for iova_to_phys() which could be extended. I
  suppose it's also possible to atomically switch the HA and HD bits in
  context descriptors.

* VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP would query the dirty bit for all
  VFIO mappings.

Thanks,
Jean
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

Re: [PATCH RFC 11/14] arm64: Move the ASID allocator code in a separate file

2019-09-20 Thread Jean-Philippe Brucker

On Fri, Sep 20, 2019 at 08:07:38AM +0800, Guo Ren wrote:
> On Thu, Sep 19, 2019 at 11:18 PM Jean-Philippe Brucker
>  wrote:
> 
> >
> > The SMMU does support PCI Virtual Function - an hypervisor can assign a
> > VF to a guest, and let that guest partition the VF into smaller contexts
> > by using PASID.  What it can't support is assigning partitions of a PCI
> > function (VF or PF) to multiple Virtual Machines, since there is a
> > single S2 PGD per function (in the Stream Table Entry), rather than one
> > S2 PGD per PASID context.
> >
> In my concept, the two sentences "The SMMU does support PCI Virtual
> Functio" v.s. "What it can't support is assigning partitions of a PCI
> function (VF or PF) to multiple Virtual Machines" are conflict and I
> don't want to play naming game :)

That's fine. But to prevent the spread of misinformation: Arm SMMU
supports PCI Virtual Functions.

Thanks,
Jean
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

Re: [PATCH RFC 11/14] arm64: Move the ASID allocator code in a separate file

2019-09-19 Thread Jean-Philippe Brucker

On Thu, Sep 19, 2019 at 09:07:15PM +0800, Guo Ren wrote:
> > The solution I had to this problem is pinning the ASID [1] used by the
> > IOMMU, to prevent the CPU from recycling the ASID on rollover. This way
> > the CPU doesn't have to wait for IOMMU invalidations to complete, when
> > scheduling a task that might not even have anything to do with the IOMMU.
> >
> 
> > In the Arm SMMU, ASID and IOASID (PASID) are separate identifiers. IOASID
> > indexes an entry in the context descriptor table, which contains the ASID.
> > So with unpinned shared ASID you don't need to invalidate the ATC on
> > rollover, since the IOASID doesn't change, but you do need to modify the
> > context descriptor and invalidate cached versions of it.
> The terminology confused me a lot. I perfer use PASID for IOMMU and
> ASID is for CPU.
> Arm's entry of the context descriptor table contains a "IOASID"

The terminology I've been using so far is different:
* IOASID is PASID
* The entry in the context descriptor table contains an ASID, which
  is either "shared" with CPUs or "private" to the SMMU (the SMMU spec
  says "shared" or "non-shared").
* So the CPU and SMMU TLBs use ASIDs, and the PCI ATC uses IOASID

> IOASID != ASID for CPU_TLB and IOMMU_TLB.
> 
> When you say "since the IOASID doesn't change",Is it PASID or my IOASID ? -_*!

I was talking about PASID. Maybe we can drop "IOASID" and talk only
about ASID and PASID :)

> PASID in PCI-sig was used to determine transfer address space.
> For intel, the entry which is indexed by PASID also contain S1/S2.PGD
> and DID(VMID).
> For arm, the entry which is indexed by PASID only contain S1.PGD and
> IOASID. Compare to Intel Vt-d Scalable mode, arm's design can't
> support PCI Virtual Function.

The SMMU does support PCI Virtual Function - an hypervisor can assign a
VF to a guest, and let that guest partition the VF into smaller contexts
by using PASID.  What it can't support is assigning partitions of a PCI
function (VF or PF) to multiple Virtual Machines, since there is a
single S2 PGD per function (in the Stream Table Entry), rather than one
S2 PGD per PASID context.

Thanks,
Jean

> > Once you have pinned ASIDs, you could also declare that IOASID = ASID. I
> > don't remember finding an argument to strictly forbid it, even though ASID
> > and IOASID have different sizes on Arm (respectively 8/16 and 20 bits).
> ASID and IOASID are hard to keep the same between CPU system and IOMMU
> system. So I introduce S1/S2.PGD.PPN as a bridge between CPUs and
> IOMMUs.
> See my proposal [1]
> 
> 1: 
> https://lore.kernel.org/linux-csky/1568896556-28769-1-git-send-email-guo...@kernel.org/T/#u
> -- 
> Best Regards
>  Guo Ren
> 
> ML: https://lore.kernel.org/linux-csky/
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

Re: [PATCH RFC 11/14] arm64: Move the ASID allocator code in a separate file

2019-09-16 Thread Jean-Philippe Brucker

Hi,

On 13/09/2019 09:13, Guo Ren wrote:
> Another idea is seperate remote TLB invalidate into two instructions:
> 
>  - sfence.vma.b.asyc
>  - sfence.vma.b.barrier // wait all async TLB invalidate operations
> finished for all harts.

It's not clear to me how this helps, but I probably don't have the whole
picture. If you have a place where it is safe to wait for the barrier to
complete, why not do the whole invalidate there?

> (I remember who mentioned me separate them into two instructions after
> session. Anup? Is the idea right ?) 
> 
> Actually, I never consider asyc TLB invalidate before, because current our
> light iommu did not need it.
> 
> Thx all people attend the session :) Let's continue the talk. 
> 
> 
> Guo Ren mailto:guo...@kernel.org>> 于 2019年9月12日周
> 四 22:59写道：
> 
> Thx Will for reply.
> 
> On Thu, Sep 12, 2019 at 3:03 PM Will Deacon  > wrote:
> >
> > On Sun, Sep 08, 2019 at 07:52:55AM +0800, Guo Ren wrote:
> > > On Mon, Jun 24, 2019 at 6:40 PM Will Deacon  > wrote:
> > > > > I'll keep my system use the same ASID for SMP + IOMMU :P
> > > >
> > > > You will want a separate allocator for that:
> > > >
> > > >
> 
> https://lkml.kernel.org/r/20190610184714.6786-2-jean-philippe.bruc...@arm.com
> > >
> > > Yes, it is hard to maintain ASID between IOMMU and CPUMMU or different
> > > system, because it's difficult to synchronize the IO_ASID when the CPU
> > > ASID is rollover.
> > > But we could still use hardware broadcast TLB invalidation instruction
> > > to uniformly manage the ASID and IO_ASID, or OTHER_ASID in our IOMMU.
> >
> > That's probably a bad idea, because you'll likely stall execution on the
> > CPU until the IOTLB has completed invalidation. In the case of ATS,
> I think
> > an endpoint ATC is permitted to take over a minute to respond. In
> reality, I
> > suspect the worst you'll ever see would be in the msec range, but that's
> > still an unacceptable period of time to hold a CPU.
> Just as I've said in the session that IOTLB invalidate delay is
> another topic, My main proposal is to introduce stage1.pgd and
> stage2.pgd as address space identifiers between different TLB systems
> based on vmid, asid. My last part of sildes will show you how to
> translate stage1/2.pgd to as/vmid in PCI ATS system and the method
> could work with SMMU-v3 and intel Vt-d. (It's regret for me there is
> no time to show you the whole slides.)
> 
> In our light IOMMU implementation, there's no IOTLB invalidate delay
> problem. Becasue IOMMU is very close to CPU MMU and interconnect's
> delay is the same with SMP CPUs MMU (no PCI, VM supported).
> 
> To solve the problem, we could define a async mode in sfence.vma.b to
> slove the problem and finished with per_cpu_irq/exception.

The solution I had to this problem is pinning the ASID [1] used by the
IOMMU, to prevent the CPU from recycling the ASID on rollover. This way
the CPU doesn't have to wait for IOMMU invalidations to complete, when
scheduling a task that might not even have anything to do with the IOMMU.

In the Arm SMMU, ASID and IOASID (PASID) are separate identifiers. IOASID
indexes an entry in the context descriptor table, which contains the ASID.
So with unpinned shared ASID you don't need to invalidate the ATC on
rollover, since the IOASID doesn't change, but you do need to modify the
context descriptor and invalidate cached versions of it.

Once you have pinned ASIDs, you could also declare that IOASID = ASID. I
don't remember finding an argument to strictly forbid it, even though ASID
and IOASID have different sizes on Arm (respectively 8/16 and 20 bits).

Thanks,
Jean

[1]
https://lore.kernel.org/linux-iommu/20180511190641.23008-17-jean-philippe.bruc...@arm.com/
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

[PATCH] MAINTAINERS: Update my email address

2019-07-22 Thread Jean-Philippe Brucker

Update MAINTAINERS and .mailmap with my @linaro.org address, since I
don't have access to my @arm.com address anymore.

Signed-off-by: Jean-Philippe Brucker 
---
 .mailmap| 1 +
 MAINTAINERS | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/.mailmap b/.mailmap
index 0fef932de3db..8ce554b9c9f1 100644
--- a/.mailmap
+++ b/.mailmap
@@ -98,6 +98,7 @@ Jason Gunthorpe  

 Javi Merino  
  
 Jean Tourrilhes 
+ 
 Jeff Garzik 
 Jeff Layton  
 Jeff Layton  
diff --git a/MAINTAINERS b/MAINTAINERS
index 783569e3c4b4..bded78c84701 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -17123,7 +17123,7 @@ F:  drivers/virtio/virtio_input.c
 F: include/uapi/linux/virtio_input.h
 
 VIRTIO IOMMU DRIVER
-M: Jean-Philippe Brucker 
+M: Jean-Philippe Brucker 
 L: virtualizat...@lists.linux-foundation.org
 S: Maintained
 F: drivers/iommu/virtio-iommu.c
-- 
2.22.0

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

Re: [PATCH v8 26/29] vfio-pci: Register an iommu fault handler

2019-06-19 Thread Jean-Philippe Brucker

On 19/06/2019 01:19, Jacob Pan wrote:
>>> I see this as a future extension due to limited testing,   
>>
>> I'm wondering how we deal with:
>> (1) old userspace that won't fill the new private_data field in
>> page_response. A new kernel still has to support it.
>> (2) old kernel that won't recognize the new PRIVATE_DATA flag.
>> Currently iommu_page_response() rejects page responses with unknown
>> flags.
>>
>> I guess we'll need a two-way negotiation, where userspace queries
>> whether the kernel supports the flag (2), and the kernel learns
>> whether it should expect the private data to come back (1).
>>
> I am not sure case (1) exist in that there is no existing user space
> supports PRQ w/o private data. Am I missing something?
> 
> For VT-d emulation, private data is always part of the scalable mode
> PASID capability. If vIOMMU query host supports PASID and scalable
> mode, it will always support private data once PRQ is enabled.

Right if VT-d won't ever support page_response without private data then
I don't think we have to worry about (1).

> So I think we only need to negotiate (2) which should be covered by
> VT-d PASID cap.
> 
>>> perhaps for
>>> now, can you add paddings similar to page request? Make it 64B as
>>> well.  
>>
>> I don't think padding is necessary, because iommu_page_response is
>> sent by userspace to the kernel, unlike iommu_fault which is
>> allocated by userspace and filled by the kernel.
>>
>> Page response looks a lot more like existing VFIO mechanisms, so I
>> suppose we'll wrap the iommu_page_response structure and include an
>> argsz parameter at the top:
>>
>>  struct vfio_iommu_page_response {
>>  u32 argsz;
>>  struct iommu_page_response pr;
>>  };
>>
>>  struct vfio_iommu_page_response vpr = {
>>  .argsz = sizeof(vpr),
>>  .pr = ...
>>  ...
>>  };
>>
>>  ioctl(devfd, VFIO_IOMMU_PAGE_RESPONSE, );
>>
>> In that case supporting private data can be done by simply appending a
>> field at the end (plus the negotiation above).
>>
> Do you mean at the end of struct vfio_iommu_page_response{}? or at
> the end of that seems struct iommu_page_response{}?
> 
> The consumer of the private data is iommu driver not vfio. So I think
> you want to add the new field at the end of struct iommu_page_response,
> right?

Yes that's what I meant

Thanks,
Jean
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

Re: [PATCH v8 26/29] vfio-pci: Register an iommu fault handler

2019-06-18 Thread Jean-Philippe Brucker

On 12/06/2019 19:53, Jacob Pan wrote:
>>> You are right, the worst case of the spurious PS is to terminate the
>>> group prematurely. Need to know the scope of the HW damage in case
>>> of mdev where group IDs can be shared among mdevs belong to the
>>> same PF.  
>>
>> But from the IOMMU fault API point of view, the full page request is
>> identified by both PRGI and PASID. Given that each mdev has its own
>> set of PASIDs, it should be easy to isolate page responses per mdev.
>>
> On Intel platform, devices sending page request with private data must
> receive page response with matching private data. If we solely depend
> on PRGI and PASID, we may send stale private data to the device in
> those incorrect page response. Since private data may represent PF
> device wide contexts, the consequence of sending page response with
> wrong private data may affect other mdev/PASID.
> 
> One solution we are thinking to do is to inject the sequence #(e.g.
> ktime raw mono clock) as vIOMMU private data into to the guest. Guest
> would return this fake private data in page response, then host will
> send page response back to the device that matches PRG1 and PASID and
> private_data.
> 
> This solution does not expose HW context related private data to the
> guest but need to extend page response in iommu uapi.
> 
> /**
>  * struct iommu_page_response - Generic page response information
>  * @version: API version of this structure
>  * @flags: encodes whether the corresponding fields are valid
>  * (IOMMU_FAULT_PAGE_RESPONSE_* values)
>  * @pasid: Process Address Space ID
>  * @grpid: Page Request Group Index
>  * @code: response code from  iommu_page_response_code
>  * @private_data: private data for the matching page request
>  */
> struct iommu_page_response {
> #define IOMMU_PAGE_RESP_VERSION_1 1
>   __u32   version;
> #define IOMMU_PAGE_RESP_PASID_VALID   (1 << 0)
> #define IOMMU_PAGE_RESP_PRIVATE_DATA  (1 << 1)
>   __u32   flags;
>   __u32   pasid;
>   __u32   grpid;
>   __u32   code;
>   __u32   padding;
>   __u64   private_data[2];
> };
> 
> There is also the change needed for separating storage for the real and
> fake private data.
> 
> Sorry for the last minute change, did not realize the HW implications.
> 
> I see this as a future extension due to limited testing, 

I'm wondering how we deal with:
(1) old userspace that won't fill the new private_data field in
page_response. A new kernel still has to support it.
(2) old kernel that won't recognize the new PRIVATE_DATA flag. Currently
iommu_page_response() rejects page responses with unknown flags.

I guess we'll need a two-way negotiation, where userspace queries
whether the kernel supports the flag (2), and the kernel learns whether
it should expect the private data to come back (1).

> perhaps for
> now, can you add paddings similar to page request? Make it 64B as well.

I don't think padding is necessary, because iommu_page_response is sent
by userspace to the kernel, unlike iommu_fault which is allocated by
userspace and filled by the kernel.

Page response looks a lot more like existing VFIO mechanisms, so I
suppose we'll wrap the iommu_page_response structure and include an
argsz parameter at the top:

struct vfio_iommu_page_response {
u32 argsz;
struct iommu_page_response pr;
};

struct vfio_iommu_page_response vpr = {
.argsz = sizeof(vpr),
.pr = ...
...
};

ioctl(devfd, VFIO_IOMMU_PAGE_RESPONSE, );

In that case supporting private data can be done by simply appending a
field at the end (plus the negotiation above).

Thanks,
Jean
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

Re: [PATCH v8 26/29] vfio-pci: Register an iommu fault handler

2019-06-11 Thread Jean-Philippe Brucker

On 10/06/2019 22:31, Jacob Pan wrote:
> On Mon, 10 Jun 2019 13:45:02 +0100
> Jean-Philippe Brucker  wrote:
> 
>> On 07/06/2019 18:43, Jacob Pan wrote:
>>>>> So it seems we agree on the following:
>>>>> - iommu_unregister_device_fault_handler() will never fail
>>>>> - iommu driver cleans up all pending faults when handler is
>>>>> unregistered
>>>>> - assume device driver or guest not sending more page response
>>>>> _after_ handler is unregistered.
>>>>> - system will tolerate rare spurious response
>>>>>
>>>>> Sounds right?
>>>>
>>>> Yes, I'll add that to the fault series  
>>> Hold on a second please, I think we need more clarifications. Ashok
>>> pointed out to me that the spurious response can be harmful to other
>>> devices when it comes to mdev, where PRQ group id is not per PASID,
>>> device may reuse the group number and receiving spurious page
>>> response can confuse the entire PF.   
>>
>> I don't understand how mdev differs from the non-mdev situation (but I
>> also still don't fully get how mdev+PASID will be implemented). Is the
>> following the case you're worried about?
>>
>>   M#: mdev #
>>
>> # Dev Hostmdev drv   VFIO/QEMUGuest
>> 
>> 1 <- reg(handler)
>> 2 PR1 G1 P1-> M1 PR1 G1inject -> M1 PR1 G1
>> 3 <- unreg(handler)
>> 4   <- PS1 G1 P1 (F)  |
>> 5unreg(handler)
>> 6 <- reg(handler)
>> 7 PR2 G1 P1-> M2 PR2 G1inject -> M2 PR2 G1
>> 8 <- M1 PS1 G1
>> 9 accept ??<- PS1 G1 P1
>> 10<- M2 PS2 G1
>> 11accept   <- PS2 G1 P1
>>
> Not really. I am not worried about PASID reuse or unbind. Just within
> the same PASID bind lifetime of a single mdev, back to back
> register/unregister fault handler.
> After Step 4, device will think G1 is done. Device could reuse G1 for
> the next PR, if we accept PS1 in step 9, device will terminate G1 before
> the real G1 PS arrives in Step 11. The real G1 PS might have a
> different response code. Then we just drop the PS in Step 11?

Yes, I think we do. Two possibilities:

* G1 is reused at step 7 for the same PASID context, which means that it
is for the same mdev. The problem is then identical to the non-mdev
case, new page faults and old page response may cross:

# Dev Hostmdev drv   VFIO/QEMUGuest

7 PR2 G1 P1  --.
8   \ .- M1 PS1 G1
9'->  PR2 G1 P1  ->  /   inject  --> M1 PR2 G1
10   accept <---  PS1 G1 P1  <--'
11   reject <---  PS2 G1 P1  <-- M1 PS2 G1

And the incorrect page response is returned to the guest. However it
affects a single mdev/guest context, it doesn't affect other mdevs.

* Or G1 is reused at step 7 for a different PASID. At step 10 the fault
handler rejects the page response because the PASID is different, and
step 11 is accepted.


>>> Having spurious page response is also not
>>> abiding the PCIe spec. exactly.  
>>
>> We are following the PCI spec though, in that we don't send page
>> responses for PRGIs that aren't in flight.
>>
> You are right, the worst case of the spurious PS is to terminate the
> group prematurely. Need to know the scope of the HW damage in case of mdev
> where group IDs can be shared among mdevs belong to the same PF.

But from the IOMMU fault API point of view, the full page request is
identified by both PRGI and PASID. Given that each mdev has its own set
of PASIDs, it should be easy to isolate page responses per mdev.

Thanks,
Jean
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

Re: [PATCH v8 26/29] vfio-pci: Register an iommu fault handler

2019-06-10 Thread Jean-Philippe Brucker

On 07/06/2019 18:43, Jacob Pan wrote:
>>> So it seems we agree on the following:
>>> - iommu_unregister_device_fault_handler() will never fail
>>> - iommu driver cleans up all pending faults when handler is
>>> unregistered
>>> - assume device driver or guest not sending more page response
>>> _after_ handler is unregistered.
>>> - system will tolerate rare spurious response
>>>
>>> Sounds right?  
>>
>> Yes, I'll add that to the fault series
> Hold on a second please, I think we need more clarifications. Ashok
> pointed out to me that the spurious response can be harmful to other
> devices when it comes to mdev, where PRQ group id is not per PASID,
> device may reuse the group number and receiving spurious page response
> can confuse the entire PF. 

I don't understand how mdev differs from the non-mdev situation (but I
also still don't fully get how mdev+PASID will be implemented). Is the
following the case you're worried about?

  M#: mdev #

# Dev Hostmdev drv   VFIO/QEMUGuest

1 <- reg(handler)
2 PR1 G1 P1-> M1 PR1 G1inject -> M1 PR1 G1
3 <- unreg(handler)
4   <- PS1 G1 P1 (F)  |
5unreg(handler)
6 <- reg(handler)
7 PR2 G1 P1-> M2 PR2 G1inject -> M2 PR2 G1
8 <- M1 PS1 G1
9 accept ??<- PS1 G1 P1
10<- M2 PS2 G1
11accept   <- PS2 G1 P1

Step 2 injects PR1 for mdev#1. Step 4 auto-responds to PR1. Between
steps 5 and 6, we re-allocate PASID #1 for mdev #2. At step 7, we inject
PR2 for mdev #2. Step 8 is the spurious Page Response for PR1.

But I don't think step 9 is possible, because the mdev driver knows that
mdev #1 isn't using PASID #1 anymore. If the configuration is valid at
all (a page response channel still exists for mdev #1), then mdev #1 now
has a different PASID, e.g. #2, and step 9 would be "<- PS1 G1 P2" which
is rejected by iommu.c (no such pending page request). And step 11 will
be accepted.

If PASIDs are allocated through VCMD, then the situation seems similar:
at step 2 you inject "M1 PR1 G1 P1" into the guest, and at step 8 the
spurious response is "M1 PS1 G1 P1". If mdev #1 doesn't have PASID #1
anymore, then the mdev driver can check that the PASID is invalid and
can reject the page response.

> Having spurious page response is also not
> abiding the PCIe spec. exactly.

We are following the PCI spec though, in that we don't send page
responses for PRGIs that aren't in flight.

> We have two options here:
> 1. unregister handler will get -EBUSY if outstanding fault exists.
>   -PROs: block offending device unbind only, eventually timeout
>   will clear.
>   -CONs: flooded faults can prevent clearing
> 2. unregister handle will block until all faults are clear in the host.
>Never fails unregistration

Here the host completes the faults itself or wait for a response from
the guest? I'm slightly confused by the word "blocking". I'd rather we
don't introduce an uninterruptible sleep in the IOMMU core, since it's
unlikely to ever finish if we rely on the guest to complete things.

>   -PROs: simple flow for VFIO, no need to worry about device
>   holding reference.
>   -CONs: spurious page response may come from
>   misbehaving/malicious guest if guest does unregister and
>   register back to back.

> It seems the only way to prevent spurious page response is to introduce
> a SW token or sequence# for each PRQ that needs a response. I still
> think option 2 is good.
> 
> Consider the following time line:
> decoding
>  PR#: page request
>  G#:  group #
>  P#:  PASID
>  S#:  sequence #
>  A#:  address
>  PS#: page response
>  (F): Fail
>  (S): Success
> 
> # Dev HostVFIO/QEMU   Guest
> ===   
> 1 <-reg(handler)
> 2 PR1G1S1A1   ->  inject  ->  PR1G1S1A1
> 3 PR2G1S2A2   ->  inject  ->  PR2G1S2A2
> 4.<-unreg(handler)
> 5.<-PR1G1S1A1(F)  | 
> 6.<-PR2G1S2A2(F)  V
> 7.<-unreg(handler)
> 8.<-reg(handler)
> 9 PR3G1S3A1   ->  inject  ->  PR3G1S3A1
> 10.   <-PS1G1S1A1
> 11.   
> 11.<-PS3G1S3A1
> 12.PS3G1S3A1(S)
> 
> The spurious page response comes in at step 10 where the guest sends
> response for the request in step 1. But since the sequence # is 1, host
> IOMMU driver will reject it. At step 11, we accept page response for
> the matching sequence # then respond SUCCESS to the device.
> 
> So would it be OK to add this sequence# to iommu_fault and page

Re: [PATCH v8 26/29] vfio-pci: Register an iommu fault handler

2019-06-07 Thread Jean-Philippe Brucker

On 26/05/2019 17:10, Eric Auger wrote:
> +int vfio_pci_iommu_dev_fault_handler(struct iommu_fault_event *evt, void 
> *data)
> +{
> + struct vfio_pci_device *vdev = (struct vfio_pci_device *) data;
> + struct vfio_region_fault_prod *prod_region =
> + (struct vfio_region_fault_prod *)vdev->fault_pages;
> + struct vfio_region_fault_cons *cons_region =
> + (struct vfio_region_fault_cons *)(vdev->fault_pages + 2 * 
> PAGE_SIZE);
> + struct iommu_fault *new =
> + (struct iommu_fault *)(vdev->fault_pages + prod_region->offset +
> + prod_region->prod * prod_region->entry_size);
> + int prod, cons, size;
> +
> + mutex_lock(>fault_queue_lock);
> +
> + if (!vdev->fault_abi)
> + goto unlock;
> +
> + prod = prod_region->prod;
> + cons = cons_region->cons;
> + size = prod_region->nb_entries;
> +
> + if (CIRC_SPACE(prod, cons, size) < 1)
> + goto unlock;
> +
> + *new = evt->fault;

Could you check fault.type and return an error if it's not UNRECOV here?
If the fault is recoverable (very unlikely since the PRI capability is
disabled, but allowed) and we return an error here, then the caller
takes care of completing the fault. If we forward it to the guest
instead, the producer will wait indefinitely for a response.

Thanks,
Jean

> + prod = (prod + 1) % size;
> + prod_region->prod = prod;
> + mutex_unlock(>fault_queue_lock);
> +
> + mutex_lock(>igate);
> + if (vdev->dma_fault_trigger)
> + eventfd_signal(vdev->dma_fault_trigger, 1);
> + mutex_unlock(>igate);
> + return 0;
> +
> +unlock:
> + mutex_unlock(>fault_queue_lock);
> + return -EINVAL;
> +}
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

Re: [PATCH v8 25/29] vfio-pci: Add a new VFIO_REGION_TYPE_NESTED region type

2019-06-07 Thread Jean-Philippe Brucker

On 07/06/2019 09:28, Auger Eric wrote:
>>> +static const struct vfio_pci_fault_abi fault_abi_versions[] = {
>>> +   [0] = {
>>> +   .entry_size = sizeof(struct iommu_fault),
>>> +   },
>>> +};
>>> +
>>> +#define NR_FAULT_ABIS ARRAY_SIZE(fault_abi_versions)
>>
>> This looks like it's leading to some dangerous complicated code to
>> support multiple user selected ABIs.  How many ABIs do we plan to
>> support?  The region capability also exposes a type, sub-type, and
>> version.  How much of this could be exposed that way?  ie. if we need
>> to support multiple versions, expose multiple regions.
> 
> This is something that was discussed earlier and suggested by
> Jean-Philippe that we may need to support several versions of the ABI
> (typicallu when adding PRI support).
> Exposing multiple region is an interesting idea and I will explore that
> direction.

At the moment the ABI support errors and PRI. We're considering setting
the fault report structure to 64 or 128 bytes (see "[PATCH v2 2/4]
iommu: Introduce device fault data"). 64-byte allows for 2 additional
fields before we have to introduce a new ABI version, while 128 byte
should last us a while.

But that's for adding new fields to existing fault types. It's probably
a good idea to have different region types in VFIO for different fault
types, since userspace isn't necessarily prepared to deal with them. For
example right now userspace doesn't have a method to complete
recoverable faults, so we can't add them to the queue.

Thanks,
Jean
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

Re: [PATCH v8 26/29] vfio-pci: Register an iommu fault handler

2019-06-07 Thread Jean-Philippe Brucker

On 06/06/2019 21:29, Jacob Pan wrote:
>> iommu_unregister_device_fault_handler(>pdev->dev);
>
>
> But this can fail if there are pending faults which leaves a
> device reference and then the system is broken :(
 This series only features unrecoverable errors and for those the
 unregistration cannot fail. Now unrecoverable errors were added I
 admit this is confusing. We need to sort this out or clean the
 dependencies.  
>>> As Alex pointed out in 4/29, we can make
>>> iommu_unregister_device_fault_handler() never fail and clean up all
>>> the pending faults in the host IOMMU belong to that device. But the
>>> problem is that if a fault, such as PRQ, has already been injected
>>> into the guest, the page response may come back after handler is
>>> unregistered and registered again.  
>>
>> I'm trying to figure out if that would be harmful in any way. I guess
>> it can be a bit nasty if we handle the page response right after
>> having injected a new page request that uses the same PRGI. In any
>> other case we discard the page response, but here we forward it to
>> the endpoint and:
>>
>> * If the response status is success, endpoint retries the
>> translation. The guest probably hasn't had time to handle the new
>> page request and translation will fail, which may lead the endpoint
>> to give up (two unsuccessful translation requests). Or send a new
>> request
>>
> Good point, there shouldn't be any harm if the page response is a
> "fake" success. In fact it could happen in the normal operation when
> PRQs to two devices share the same non-leaf translation structure. The
> worst case is just a retry. I am not aware of the retry limit, is it in
> the PCIe spec? I cannot find it.

I don't think so, it's the implementation's choice. In general I don't
think devices will have a retry limit, but it doesn't seem like the PCI
spec prevents them from implementing one either. It could be useful to
stop retrying after a certain number of faults, for preventing livelocks
when the OS doesn't fix up the page tables and the device would just
repeat the fault indefinitely.

> I think we should just document it, similar to having a spurious
> interrupt. The PRQ trace event should capture that as well.
> 
>> * otherwise the endpoint won't retry the access, and could also
>> disable PRI if the status is failure.
>>
> That would be true regardless this race condition with handler
> registration. So should be fine.

We do give an invalid response for the old PRG (because of unregistering),
but also for the new one, which has a different address that the guest
might be able to page in and would normally return success.

>>> We need a way to reject such page response belong
>>> to the previous life of the handler. Perhaps a sync call to the
>>> guest with your fault queue eventfd? I am not sure.  
>>
>> We could simply expect the device driver not to send any page response
>> after unregistering the fault handler. Is there any reason VFIO would
>> need to unregister and re-register the fault handler on a live guest?
>>
> There is no reason for VFIO to unregister and register again, I was
> just thinking from security perspective. Someone could write a VFIO app
> do this attack. But I agree the damage is within the device, may get
> PRI disabled as a result.

Yes I think the damage would always be contained within the misbehaving
software

> So it seems we agree on the following:
> - iommu_unregister_device_fault_handler() will never fail
> - iommu driver cleans up all pending faults when handler is unregistered
> - assume device driver or guest not sending more page response _after_
>   handler is unregistered.
> - system will tolerate rare spurious response
> 
> Sounds right?

Yes, I'll add that to the fault series

Thanks,
Jean
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

Re: [PATCH v8 26/29] vfio-pci: Register an iommu fault handler

2019-06-06 Thread Jean-Philippe Brucker

On 05/06/2019 23:45, Jacob Pan wrote:
> On Tue, 4 Jun 2019 18:11:08 +0200
> Auger Eric  wrote:
> 
>> Hi Alex,
>>
>> On 6/4/19 12:31 AM, Alex Williamson wrote:
>>> On Sun, 26 May 2019 18:10:01 +0200
>>> Eric Auger  wrote:
>>>   
 This patch registers a fault handler which records faults in
 a circular buffer and then signals an eventfd. This buffer is
 exposed within the fault region.

 Signed-off-by: Eric Auger 

 ---

 v3 -> v4:
 - move iommu_unregister_device_fault_handler to vfio_pci_release
 ---
  drivers/vfio/pci/vfio_pci.c | 49
 + drivers/vfio/pci/vfio_pci_private.h
 |  1 + 2 files changed, 50 insertions(+)

 diff --git a/drivers/vfio/pci/vfio_pci.c
 b/drivers/vfio/pci/vfio_pci.c index f75f61127277..52094ba8
 100644 --- a/drivers/vfio/pci/vfio_pci.c
 +++ b/drivers/vfio/pci/vfio_pci.c
 @@ -30,6 +30,7 @@
  #include 
  #include 
  #include 
 +#include 
  
  #include "vfio_pci_private.h"
  
 @@ -296,6 +297,46 @@ static const struct vfio_pci_regops
 vfio_pci_fault_prod_regops = { .add_capability =
 vfio_pci_fault_prod_add_capability, };
  
 +int vfio_pci_iommu_dev_fault_handler(struct iommu_fault_event
 *evt, void *data) +{
 +  struct vfio_pci_device *vdev = (struct vfio_pci_device *)
 data;
 +  struct vfio_region_fault_prod *prod_region =
 +  (struct vfio_region_fault_prod
 *)vdev->fault_pages;
 +  struct vfio_region_fault_cons *cons_region =
 +  (struct vfio_region_fault_cons
 *)(vdev->fault_pages + 2 * PAGE_SIZE);
 +  struct iommu_fault *new =
 +  (struct iommu_fault *)(vdev->fault_pages +
 prod_region->offset +
 +  prod_region->prod *
 prod_region->entry_size);
 +  int prod, cons, size;
 +
 +  mutex_lock(>fault_queue_lock);
 +
 +  if (!vdev->fault_abi)
 +  goto unlock;
 +
 +  prod = prod_region->prod;
 +  cons = cons_region->cons;
 +  size = prod_region->nb_entries;
 +
 +  if (CIRC_SPACE(prod, cons, size) < 1)
 +  goto unlock;
 +
 +  *new = evt->fault;
 +  prod = (prod + 1) % size;
 +  prod_region->prod = prod;
 +  mutex_unlock(>fault_queue_lock);
 +
 +  mutex_lock(>igate);
 +  if (vdev->dma_fault_trigger)
 +  eventfd_signal(vdev->dma_fault_trigger, 1);
 +  mutex_unlock(>igate);
 +  return 0;
 +
 +unlock:
 +  mutex_unlock(>fault_queue_lock);
 +  return -EINVAL;
 +}
 +
  static int vfio_pci_init_fault_region(struct vfio_pci_device
 *vdev) {
struct vfio_region_fault_prod *header;
 @@ -328,6 +369,13 @@ static int vfio_pci_init_fault_region(struct
 vfio_pci_device *vdev) header = (struct vfio_region_fault_prod
 *)vdev->fault_pages; header->version = -1;
header->offset = PAGE_SIZE;
 +
 +  ret =
 iommu_register_device_fault_handler(>pdev->dev,
 +
 vfio_pci_iommu_dev_fault_handler,
 +  vdev);
 +  if (ret)
 +  goto out;
 +
return 0;
  out:
kfree(vdev->fault_pages);
 @@ -570,6 +618,7 @@ static void vfio_pci_release(void *device_data)
if (!(--vdev->refcnt)) {
vfio_spapr_pci_eeh_release(vdev->pdev);
vfio_pci_disable(vdev);
 +
 iommu_unregister_device_fault_handler(>pdev->dev);  
>>>
>>>
>>> But this can fail if there are pending faults which leaves a device
>>> reference and then the system is broken :(  
>> This series only features unrecoverable errors and for those the
>> unregistration cannot fail. Now unrecoverable errors were added I
>> admit this is confusing. We need to sort this out or clean the
>> dependencies.
> As Alex pointed out in 4/29, we can make
> iommu_unregister_device_fault_handler() never fail and clean up all the
> pending faults in the host IOMMU belong to that device. But the problem
> is that if a fault, such as PRQ, has already been injected into the
> guest, the page response may come back after handler is unregistered
> and registered again.

I'm trying to figure out if that would be harmful in any way. I guess it
can be a bit nasty if we handle the page response right after having
injected a new page request that uses the same PRGI. In any other case we
discard the page response, but here we forward it to the endpoint and:

* If the response status is success, endpoint retries the translation. The
guest probably hasn't had time to handle the new page request and
translation will fail, which may lead the endpoint to give up (two
unsuccessful translation requests). Or send a new request

* otherwise the endpoint won't retry the access, and could also disable
PRI if the status is failure.

> We need a way to reject such page response belong
> to the previous life of the handler. Perhaps a sync call to

Re: [PATCH v8 05/29] iommu: Add a timeout parameter for PRQ response

2019-06-04 Thread Jean-Philippe Brucker

On 03/06/2019 23:32, Alex Williamson wrote:
> It doesn't seem to make much sense to include this patch without also
> including "iommu: handle page response timeout".  Was that one lost?
> Dropped?  Lives elsewhere?

The first 7 patches come from my sva/api branch, where I had forgotten
to add the "handle page response timeout" patch. I added it back,
probably after Eric sent this version. But I don't think the patch is
ready for upstream, as we still haven't decided how to proceed with
timeouts. Patches 6 and 7 are for debugging, I don't know if they should
go upstream.

Thanks,
Jean
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

Re: [virtio-dev] Re: [PATCH v8 2/7] dt-bindings: virtio: Add virtio-pci-iommu node

2019-05-31 Thread Jean-Philippe Brucker

On 30/05/2019 18:45, Michael S. Tsirkin wrote:
> On Thu, May 30, 2019 at 06:09:24PM +0100, Jean-Philippe Brucker wrote:
>> Some systems implement virtio-iommu as a PCI endpoint. The operating
>> system needs to discover the relationship between IOMMU and masters long
>> before the PCI endpoint gets probed. Add a PCI child node to describe the
>> virtio-iommu device.
>>
>> The virtio-pci-iommu is conceptually split between a PCI programming
>> interface and a translation component on the parent bus. The latter
>> doesn't have a node in the device tree. The virtio-pci-iommu node
>> describes both, by linking the PCI endpoint to "iommus" property of DMA
>> master nodes and to "iommu-map" properties of bus nodes.
>>
>> Reviewed-by: Rob Herring 
>> Reviewed-by: Eric Auger 
>> Signed-off-by: Jean-Philippe Brucker 
> 
> So this is just an example right?
> We are not defining any new properties or anything like that.

Yes it's just an example. The properties already exist but it's good to
describe how to put them together for this particular case, because
there isn't a precedent describing the topology for an IOMMU that
appears on the PCI bus.

> I think down the road for non dt platforms we want to put this
> info in the config space of the device. I do not think ACPI
> is the best option for this since not all systems have it.
> But that can wait.

There is the probe order problem - PCI needs this info before starting
to probe devices on the bus. Maybe we could store the info in a separate
memory region, that is referenced on the command-line and that the guest
can read early.

Thanks,
Jean
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

[PATCH v8 6/7] iommu/virtio: Add probe request

2019-05-30 Thread Jean-Philippe Brucker

When the device offers the probe feature, send a probe request for each
device managed by the IOMMU. Extract RESV_MEM information. When we
encounter a MSI doorbell region, set it up as a IOMMU_RESV_MSI region.
This will tell other subsystems that there is no need to map the MSI
doorbell in the virtio-iommu, because MSIs bypass it.

Acked-by: Joerg Roedel 
Reviewed-by: Eric Auger 
Signed-off-by: Jean-Philippe Brucker 
---
 drivers/iommu/virtio-iommu.c  | 157 --
 include/uapi/linux/virtio_iommu.h |  36 +++
 2 files changed, 187 insertions(+), 6 deletions(-)

diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c
index b2719a87c3c5..5d4947c47420 100644
--- a/drivers/iommu/virtio-iommu.c
+++ b/drivers/iommu/virtio-iommu.c
@@ -49,6 +49,7 @@ struct viommu_dev {
u32 last_domain;
/* Supported MAP flags */
u32 map_flags;
+   u32 probe_size;
 };
 
 struct viommu_mapping {
@@ -71,8 +72,10 @@ struct viommu_domain {
 };
 
 struct viommu_endpoint {
+   struct device   *dev;
struct viommu_dev   *viommu;
struct viommu_domain*vdomain;
+   struct list_headresv_regions;
 };
 
 struct viommu_request {
@@ -125,6 +128,9 @@ static off_t viommu_get_write_desc_offset(struct viommu_dev 
*viommu,
 {
size_t tail_size = sizeof(struct virtio_iommu_req_tail);
 
+   if (req->type == VIRTIO_IOMMU_T_PROBE)
+   return len - viommu->probe_size - tail_size;
+
return len - tail_size;
 }
 
@@ -399,6 +405,110 @@ static int viommu_replay_mappings(struct viommu_domain 
*vdomain)
return ret;
 }
 
+static int viommu_add_resv_mem(struct viommu_endpoint *vdev,
+  struct virtio_iommu_probe_resv_mem *mem,
+  size_t len)
+{
+   size_t size;
+   u64 start64, end64;
+   phys_addr_t start, end;
+   struct iommu_resv_region *region = NULL;
+   unsigned long prot = IOMMU_WRITE | IOMMU_NOEXEC | IOMMU_MMIO;
+
+   start = start64 = le64_to_cpu(mem->start);
+   end = end64 = le64_to_cpu(mem->end);
+   size = end64 - start64 + 1;
+
+   /* Catch any overflow, including the unlikely end64 - start64 + 1 = 0 */
+   if (start != start64 || end != end64 || size < end64 - start64)
+   return -EOVERFLOW;
+
+   if (len < sizeof(*mem))
+   return -EINVAL;
+
+   switch (mem->subtype) {
+   default:
+   dev_warn(vdev->dev, "unknown resv mem subtype 0x%x\n",
+mem->subtype);
+   /* Fall-through */
+   case VIRTIO_IOMMU_RESV_MEM_T_RESERVED:
+   region = iommu_alloc_resv_region(start, size, 0,
+IOMMU_RESV_RESERVED);
+   break;
+   case VIRTIO_IOMMU_RESV_MEM_T_MSI:
+   region = iommu_alloc_resv_region(start, size, prot,
+IOMMU_RESV_MSI);
+   break;
+   }
+   if (!region)
+   return -ENOMEM;
+
+   list_add(>resv_regions, >list);
+   return 0;
+}
+
+static int viommu_probe_endpoint(struct viommu_dev *viommu, struct device *dev)
+{
+   int ret;
+   u16 type, len;
+   size_t cur = 0;
+   size_t probe_len;
+   struct virtio_iommu_req_probe *probe;
+   struct virtio_iommu_probe_property *prop;
+   struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
+   struct viommu_endpoint *vdev = fwspec->iommu_priv;
+
+   if (!fwspec->num_ids)
+   return -EINVAL;
+
+   probe_len = sizeof(*probe) + viommu->probe_size +
+   sizeof(struct virtio_iommu_req_tail);
+   probe = kzalloc(probe_len, GFP_KERNEL);
+   if (!probe)
+   return -ENOMEM;
+
+   probe->head.type = VIRTIO_IOMMU_T_PROBE;
+   /*
+* For now, assume that properties of an endpoint that outputs multiple
+* IDs are consistent. Only probe the first one.
+*/
+   probe->endpoint = cpu_to_le32(fwspec->ids[0]);
+
+   ret = viommu_send_req_sync(viommu, probe, probe_len);
+   if (ret)
+   goto out_free;
+
+   prop = (void *)probe->properties;
+   type = le16_to_cpu(prop->type) & VIRTIO_IOMMU_PROBE_T_MASK;
+
+   while (type != VIRTIO_IOMMU_PROBE_T_NONE &&
+  cur < viommu->probe_size) {
+   len = le16_to_cpu(prop->length) + sizeof(*prop);
+
+   switch (type) {
+   case VIRTIO_IOMMU_PROBE_T_RESV_MEM:
+   ret = viommu_add_resv_mem(vdev, (void *)prop, len);
+   break;
+   default:
+   dev_err(dev, "unknown viommu prop 0x%x\n", type);
+

[PATCH v8 5/7] iommu: Add virtio-iommu driver

2019-05-30 Thread Jean-Philippe Brucker

The virtio IOMMU is a para-virtualized device, allowing to send IOMMU
requests such as map/unmap over virtio transport without emulating page
tables. This implementation handles ATTACH, DETACH, MAP and UNMAP
requests.

The bulk of the code transforms calls coming from the IOMMU API into
corresponding virtio requests. Mappings are kept in an interval tree
instead of page tables. A little more work is required for modular and x86
support, so for the moment the driver depends on CONFIG_VIRTIO=y and
CONFIG_ARM64.

Acked-by: Joerg Roedel 
Signed-off-by: Jean-Philippe Brucker 
---
 MAINTAINERS   |   7 +
 drivers/iommu/Kconfig |  11 +
 drivers/iommu/Makefile|   1 +
 drivers/iommu/virtio-iommu.c  | 934 ++
 include/uapi/linux/virtio_ids.h   |   1 +
 include/uapi/linux/virtio_iommu.h | 110 
 6 files changed, 1064 insertions(+)
 create mode 100644 drivers/iommu/virtio-iommu.c
 create mode 100644 include/uapi/linux/virtio_iommu.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 429c6c624861..62bd1834d95a 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -16807,6 +16807,13 @@ S: Maintained
 F: drivers/virtio/virtio_input.c
 F: include/uapi/linux/virtio_input.h
 
+VIRTIO IOMMU DRIVER
+M: Jean-Philippe Brucker 
+L: virtualizat...@lists.linux-foundation.org
+S: Maintained
+F: drivers/iommu/virtio-iommu.c
+F: include/uapi/linux/virtio_iommu.h
+
 VIRTUAL BOX GUEST DEVICE DRIVER
 M: Hans de Goede 
 M: Arnd Bergmann 
diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index 83664db5221d..e15cdcd8cb3c 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -473,4 +473,15 @@ config HYPERV_IOMMU
  Stub IOMMU driver to handle IRQs as to allow Hyper-V Linux
  guests to run with x2APIC mode enabled.
 
+config VIRTIO_IOMMU
+   bool "Virtio IOMMU driver"
+   depends on VIRTIO=y
+   depends on ARM64
+   select IOMMU_API
+   select INTERVAL_TREE
+   help
+ Para-virtualised IOMMU driver with virtio.
+
+ Say Y here if you intend to run this kernel as a guest.
+
 endif # IOMMU_SUPPORT
diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
index 8c71a15e986b..f13f36ae1af6 100644
--- a/drivers/iommu/Makefile
+++ b/drivers/iommu/Makefile
@@ -33,3 +33,4 @@ obj-$(CONFIG_FSL_PAMU) += fsl_pamu.o fsl_pamu_domain.o
 obj-$(CONFIG_S390_IOMMU) += s390-iommu.o
 obj-$(CONFIG_QCOM_IOMMU) += qcom_iommu.o
 obj-$(CONFIG_HYPERV_IOMMU) += hyperv-iommu.o
+obj-$(CONFIG_VIRTIO_IOMMU) += virtio-iommu.o
diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c
new file mode 100644
index ..b2719a87c3c5
--- /dev/null
+++ b/drivers/iommu/virtio-iommu.c
@@ -0,0 +1,934 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Virtio driver for the paravirtualized IOMMU
+ *
+ * Copyright (C) 2019 Arm Limited
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+#define MSI_IOVA_BASE  0x800
+#define MSI_IOVA_LENGTH0x10
+
+#define VIOMMU_REQUEST_VQ  0
+#define VIOMMU_NR_VQS  1
+
+struct viommu_dev {
+   struct iommu_device iommu;
+   struct device   *dev;
+   struct virtio_device*vdev;
+
+   struct ida  domain_ids;
+
+   struct virtqueue*vqs[VIOMMU_NR_VQS];
+   spinlock_t  request_lock;
+   struct list_headrequests;
+
+   /* Device configuration */
+   struct iommu_domain_geometrygeometry;
+   u64 pgsize_bitmap;
+   u32 first_domain;
+   u32 last_domain;
+   /* Supported MAP flags */
+   u32 map_flags;
+};
+
+struct viommu_mapping {
+   phys_addr_t paddr;
+   struct interval_tree_node   iova;
+   u32 flags;
+};
+
+struct viommu_domain {
+   struct iommu_domain domain;
+   struct viommu_dev   *viommu;
+   struct mutexmutex; /* protects viommu pointer */
+   unsigned intid;
+   u32 map_flags;
+
+   spinlock_t  mappings_lock;
+   struct rb_root_cached   mappings;
+
+   unsigned long   nr_endpoints;
+};
+
+struct viommu_endpoint {
+   struct viommu_dev   *viommu;
+   struct viommu_domain*vdomain;
+};
+
+struct viommu_request {
+   struct list_headlist;
+   void*writeback;
+   unsigned int

[PATCH v8 7/7] iommu/virtio: Add event queue

2019-05-30 Thread Jean-Philippe Brucker

The event queue offers a way for the device to report access faults from
endpoints. It is implemented on virtqueue #1. Whenever the host needs to
signal a fault, it fills one of the buffers offered by the guest and
interrupts it.

Acked-by: Joerg Roedel 
Reviewed-by: Eric Auger 
Signed-off-by: Jean-Philippe Brucker 
---
 drivers/iommu/virtio-iommu.c  | 115 +++---
 include/uapi/linux/virtio_iommu.h |  19 +
 2 files changed, 125 insertions(+), 9 deletions(-)

diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c
index 5d4947c47420..2688cdcac6e5 100644
--- a/drivers/iommu/virtio-iommu.c
+++ b/drivers/iommu/virtio-iommu.c
@@ -29,7 +29,8 @@
 #define MSI_IOVA_LENGTH0x10
 
 #define VIOMMU_REQUEST_VQ  0
-#define VIOMMU_NR_VQS  1
+#define VIOMMU_EVENT_VQ1
+#define VIOMMU_NR_VQS  2
 
 struct viommu_dev {
struct iommu_device iommu;
@@ -41,6 +42,7 @@ struct viommu_dev {
struct virtqueue*vqs[VIOMMU_NR_VQS];
spinlock_t  request_lock;
struct list_headrequests;
+   void*evts;
 
/* Device configuration */
struct iommu_domain_geometrygeometry;
@@ -86,6 +88,15 @@ struct viommu_request {
charbuf[];
 };
 
+#define VIOMMU_FAULT_RESV_MASK 0xff00
+
+struct viommu_event {
+   union {
+   u32 head;
+   struct virtio_iommu_fault fault;
+   };
+};
+
 #define to_viommu_domain(domain)   \
container_of(domain, struct viommu_domain, domain)
 
@@ -509,6 +520,68 @@ static int viommu_probe_endpoint(struct viommu_dev 
*viommu, struct device *dev)
return ret;
 }
 
+static int viommu_fault_handler(struct viommu_dev *viommu,
+   struct virtio_iommu_fault *fault)
+{
+   char *reason_str;
+
+   u8 reason   = fault->reason;
+   u32 flags   = le32_to_cpu(fault->flags);
+   u32 endpoint= le32_to_cpu(fault->endpoint);
+   u64 address = le64_to_cpu(fault->address);
+
+   switch (reason) {
+   case VIRTIO_IOMMU_FAULT_R_DOMAIN:
+   reason_str = "domain";
+   break;
+   case VIRTIO_IOMMU_FAULT_R_MAPPING:
+   reason_str = "page";
+   break;
+   case VIRTIO_IOMMU_FAULT_R_UNKNOWN:
+   default:
+   reason_str = "unknown";
+   break;
+   }
+
+   /* TODO: find EP by ID and report_iommu_fault */
+   if (flags & VIRTIO_IOMMU_FAULT_F_ADDRESS)
+   dev_err_ratelimited(viommu->dev, "%s fault from EP %u at %#llx 
[%s%s%s]\n",
+   reason_str, endpoint, address,
+   flags & VIRTIO_IOMMU_FAULT_F_READ ? "R" : 
"",
+   flags & VIRTIO_IOMMU_FAULT_F_WRITE ? "W" : 
"",
+   flags & VIRTIO_IOMMU_FAULT_F_EXEC ? "X" : 
"");
+   else
+   dev_err_ratelimited(viommu->dev, "%s fault from EP %u\n",
+   reason_str, endpoint);
+   return 0;
+}
+
+static void viommu_event_handler(struct virtqueue *vq)
+{
+   int ret;
+   unsigned int len;
+   struct scatterlist sg[1];
+   struct viommu_event *evt;
+   struct viommu_dev *viommu = vq->vdev->priv;
+
+   while ((evt = virtqueue_get_buf(vq, )) != NULL) {
+   if (len > sizeof(*evt)) {
+   dev_err(viommu->dev,
+   "invalid event buffer (len %u != %zu)\n",
+   len, sizeof(*evt));
+   } else if (!(evt->head & VIOMMU_FAULT_RESV_MASK)) {
+   viommu_fault_handler(viommu, >fault);
+   }
+
+   sg_init_one(sg, evt, sizeof(*evt));
+   ret = virtqueue_add_inbuf(vq, sg, 1, evt, GFP_ATOMIC);
+   if (ret)
+   dev_err(viommu->dev, "could not add event buffer\n");
+   }
+
+   virtqueue_kick(vq);
+}
+
 /* IOMMU API */
 
 static struct iommu_domain *viommu_domain_alloc(unsigned type)
@@ -895,16 +968,35 @@ static struct iommu_ops viommu_ops = {
 static int viommu_init_vqs(struct viommu_dev *viommu)
 {
struct virtio_device *vdev = dev_to_virtio(viommu->dev);
-   const char *name = "request";
-   void *ret;
+   const char *names[] = { "request", "event" };
+   vq_callback_t *callbacks[] = {
+   NULL, /* No async requests */
+   viommu_event_handler,
+   };
 
-   ret = virtio_find_single_vq(vde

[PATCH v8 1/7] dt-bindings: virtio-mmio: Add IOMMU description

2019-05-30 Thread Jean-Philippe Brucker

The nature of a virtio-mmio node is discovered by the virtio driver at
probe time. However the DMA relation between devices must be described
statically. When a virtio-mmio node is a virtio-iommu device, it needs an
"#iommu-cells" property as specified by bindings/iommu/iommu.txt.

Otherwise, the virtio-mmio device may perform DMA through an IOMMU, which
requires an "iommus" property. Describe these requirements in the
device-tree bindings documentation.

Reviewed-by: Rob Herring 
Reviewed-by: Eric Auger 
Signed-off-by: Jean-Philippe Brucker 
---
 .../devicetree/bindings/virtio/mmio.txt   | 30 +++
 1 file changed, 30 insertions(+)

diff --git a/Documentation/devicetree/bindings/virtio/mmio.txt 
b/Documentation/devicetree/bindings/virtio/mmio.txt
index 5069c1b8e193..21af30fbb81f 100644
--- a/Documentation/devicetree/bindings/virtio/mmio.txt
+++ b/Documentation/devicetree/bindings/virtio/mmio.txt
@@ -8,10 +8,40 @@ Required properties:
 - reg: control registers base address and size including configuration 
space
 - interrupts:  interrupt generated by the device
 
+Required properties for virtio-iommu:
+
+- #iommu-cells:When the node corresponds to a virtio-iommu device, it 
is
+   linked to DMA masters using the "iommus" or "iommu-map"
+   properties [1][2]. #iommu-cells specifies the size of the
+   "iommus" property. For virtio-iommu #iommu-cells must be
+   1, each cell describing a single endpoint ID.
+
+Optional properties:
+
+- iommus:  If the device accesses memory through an IOMMU, it should
+   have an "iommus" property [1]. Since virtio-iommu itself
+   does not access memory through an IOMMU, the "virtio,mmio"
+   node cannot have both an "#iommu-cells" and an "iommus"
+   property.
+
 Example:
 
virtio_block@3000 {
compatible = "virtio,mmio";
reg = <0x3000 0x100>;
interrupts = <41>;
+
+   /* Device has endpoint ID 23 */
+   iommus = < 23>
}
+
+   viommu: iommu@3100 {
+   compatible = "virtio,mmio";
+   reg = <0x3100 0x100>;
+   interrupts = <42>;
+
+   #iommu-cells = <1>
+   }
+
+[1] Documentation/devicetree/bindings/iommu/iommu.txt
+[2] Documentation/devicetree/bindings/pci/pci-iommu.txt
-- 
2.21.0

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

[PATCH v8 4/7] PCI: OF: Initialize dev->fwnode appropriately

2019-05-30 Thread Jean-Philippe Brucker

For PCI devices that have an OF node, set the fwnode as well. This way
drivers that rely on fwnode don't need the special case described by
commit f94277af03ea ("of/platform: Initialise dev->fwnode appropriately").

Acked-by: Bjorn Helgaas 
Signed-off-by: Jean-Philippe Brucker 
---
 drivers/pci/of.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/drivers/pci/of.c b/drivers/pci/of.c
index 73d5adec0a28..c4f1b5507b40 100644
--- a/drivers/pci/of.c
+++ b/drivers/pci/of.c
@@ -22,12 +22,15 @@ void pci_set_of_node(struct pci_dev *dev)
return;
dev->dev.of_node = of_pci_find_child_device(dev->bus->dev.of_node,
dev->devfn);
+   if (dev->dev.of_node)
+   dev->dev.fwnode = >dev.of_node->fwnode;
 }
 
 void pci_release_of_node(struct pci_dev *dev)
 {
of_node_put(dev->dev.of_node);
dev->dev.of_node = NULL;
+   dev->dev.fwnode = NULL;
 }
 
 void pci_set_bus_of_node(struct pci_bus *bus)
@@ -42,12 +45,15 @@ void pci_set_bus_of_node(struct pci_bus *bus)
bus->self->untrusted = true;
}
bus->dev.of_node = node;
+   if (node)
+   bus->dev.fwnode = >fwnode;
 }
 
 void pci_release_bus_of_node(struct pci_bus *bus)
 {
of_node_put(bus->dev.of_node);
bus->dev.of_node = NULL;
+   bus->dev.fwnode = NULL;
 }
 
 struct device_node * __weak pcibios_get_phb_of_node(struct pci_bus *bus)
-- 
2.21.0

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

[PATCH v8 3/7] of: Allow the iommu-map property to omit untranslated devices

2019-05-30 Thread Jean-Philippe Brucker

In PCI root complex nodes, the iommu-map property describes the IOMMU that
translates each endpoint. On some platforms, the IOMMU itself is presented
as a PCI endpoint (e.g. AMD IOMMU and virtio-iommu). This isn't supported
by the current OF driver, which expects all endpoints to have an IOMMU.
Allow the iommu-map property to have gaps.

Relaxing of_map_rid() also allows the msi-map property to have gaps, which
is invalid since MSIs always reach an MSI controller. In that case
pci_msi_setup_msi_irqs() will return an error when attempting to find the
device's MSI domain.

Reviewed-by: Rob Herring 
Signed-off-by: Jean-Philippe Brucker 
---
 drivers/of/base.c | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/drivers/of/base.c b/drivers/of/base.c
index 20e0e7ee4edf..55e7f5bb0549 100644
--- a/drivers/of/base.c
+++ b/drivers/of/base.c
@@ -2294,8 +2294,12 @@ int of_map_rid(struct device_node *np, u32 rid,
return 0;
}
 
-   pr_err("%pOF: Invalid %s translation - no match for rid 0x%x on %pOF\n",
-   np, map_name, rid, target && *target ? *target : NULL);
-   return -EFAULT;
+   pr_info("%pOF: no %s translation for rid 0x%x on %pOF\n", np, map_name,
+   rid, target && *target ? *target : NULL);
+
+   /* Bypasses translation */
+   if (id_out)
+   *id_out = rid;
+   return 0;
 }
 EXPORT_SYMBOL_GPL(of_map_rid);
-- 
2.21.0

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

[PATCH v8 2/7] dt-bindings: virtio: Add virtio-pci-iommu node

2019-05-30 Thread Jean-Philippe Brucker

Some systems implement virtio-iommu as a PCI endpoint. The operating
system needs to discover the relationship between IOMMU and masters long
before the PCI endpoint gets probed. Add a PCI child node to describe the
virtio-iommu device.

The virtio-pci-iommu is conceptually split between a PCI programming
interface and a translation component on the parent bus. The latter
doesn't have a node in the device tree. The virtio-pci-iommu node
describes both, by linking the PCI endpoint to "iommus" property of DMA
master nodes and to "iommu-map" properties of bus nodes.

Reviewed-by: Rob Herring 
Reviewed-by: Eric Auger 
Signed-off-by: Jean-Philippe Brucker 
---
 .../devicetree/bindings/virtio/iommu.txt  | 66 +++
 1 file changed, 66 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/virtio/iommu.txt

diff --git a/Documentation/devicetree/bindings/virtio/iommu.txt 
b/Documentation/devicetree/bindings/virtio/iommu.txt
new file mode 100644
index ..2407fea0651c
--- /dev/null
+++ b/Documentation/devicetree/bindings/virtio/iommu.txt
@@ -0,0 +1,66 @@
+* virtio IOMMU PCI device
+
+When virtio-iommu uses the PCI transport, its programming interface is
+discovered dynamically by the PCI probing infrastructure. However the
+device tree statically describes the relation between IOMMU and DMA
+masters. Therefore, the PCI root complex that hosts the virtio-iommu
+contains a child node representing the IOMMU device explicitly.
+
+Required properties:
+
+- compatible:  Should be "virtio,pci-iommu"
+- reg: PCI address of the IOMMU. As defined in the PCI Bus
+   Binding reference [1], the reg property is a five-cell
+   address encoded as (phys.hi phys.mid phys.lo size.hi
+   size.lo). phys.hi should contain the device's BDF as
+   0b  dfff . The other cells
+   should be zero.
+- #iommu-cells:Each platform DMA master managed by the IOMMU is 
assigned
+   an endpoint ID, described by the "iommus" property [2].
+   For virtio-iommu, #iommu-cells must be 1.
+
+Notes:
+
+- DMA from the IOMMU device isn't managed by another IOMMU. Therefore the
+  virtio-iommu node doesn't have an "iommus" property, and is omitted from
+  the iommu-map property of the root complex.
+
+Example:
+
+pcie@1000 {
+   compatible = "pci-host-ecam-generic";
+   ...
+
+   /* The IOMMU programming interface uses slot 00:01.0 */
+   iommu0: iommu@0008 {
+   compatible = "virtio,pci-iommu";
+   reg = <0x0800 0 0 0 0>;
+   #iommu-cells = <1>;
+   };
+
+   /*
+* The IOMMU manages all functions in this PCI domain except
+* itself. Omit BDF 00:01.0.
+*/
+   iommu-map = <0x0  0x0 0x8>
+   <0x9  0x9 0xfff7>;
+};
+
+pcie@2000 {
+   compatible = "pci-host-ecam-generic";
+   ...
+   /*
+* The IOMMU also manages all functions from this domain,
+* with endpoint IDs 0x1 - 0x1
+*/
+   iommu-map = <0x0  0x1 0x1>;
+};
+
+ethernet@fe001000 {
+   ...
+   /* The IOMMU manages this platform device with endpoint ID 0x2 */
+   iommus = < 0x2>;
+};
+
+[1] Documentation/devicetree/bindings/pci/pci.txt
+[2] Documentation/devicetree/bindings/iommu/iommu.txt
-- 
2.21.0

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

[PATCH v8 0/7] Add virtio-iommu driver

2019-05-30 Thread Jean-Philippe Brucker

Implement the virtio-iommu driver, following specification v0.12 [1].
Since last version [2] we've worked on improving the specification,
which resulted in the following changes to the interface:
* Remove the EXEC flag.
* Add feature bit for the MMIO flag.
* Change domain_bits to domain_range.

Given that there were small changes to patch 5/7, I removed the review
and test tags. Please find the code at [3].

[1] Virtio-iommu specification v0.12, sources and pdf
git://linux-arm.org/virtio-iommu.git virtio-iommu/v0.12
http://jpbrucker.net/virtio-iommu/spec/v0.12/virtio-iommu-v0.12.pdf

http://jpbrucker.net/virtio-iommu/spec/diffs/virtio-iommu-dev-diff-v0.11-v0.12.pdf

[2] [PATCH v7 0/7] Add virtio-iommu driver

https://lore.kernel.org/linux-pci/0ba215f5-e856-bf31-8dd9-a85710714...@arm.com/T/

[3] git://linux-arm.org/linux-jpb.git virtio-iommu/v0.12
git://linux-arm.org/kvmtool-jpb.git virtio-iommu/v0.12

Jean-Philippe Brucker (7):
  dt-bindings: virtio-mmio: Add IOMMU description
  dt-bindings: virtio: Add virtio-pci-iommu node
  of: Allow the iommu-map property to omit untranslated devices
  PCI: OF: Initialize dev->fwnode appropriately
  iommu: Add virtio-iommu driver
  iommu/virtio: Add probe request
  iommu/virtio: Add event queue

 .../devicetree/bindings/virtio/iommu.txt  |   66 +
 .../devicetree/bindings/virtio/mmio.txt   |   30 +
 MAINTAINERS   |7 +
 drivers/iommu/Kconfig |   11 +
 drivers/iommu/Makefile|1 +
 drivers/iommu/virtio-iommu.c  | 1176 +
 drivers/of/base.c |   10 +-
 drivers/pci/of.c  |6 +
 include/uapi/linux/virtio_ids.h   |1 +
 include/uapi/linux/virtio_iommu.h |  165 +++
 10 files changed, 1470 insertions(+), 3 deletions(-)
 create mode 100644 Documentation/devicetree/bindings/virtio/iommu.txt
 create mode 100644 drivers/iommu/virtio-iommu.c
 create mode 100644 include/uapi/linux/virtio_iommu.h

-- 
2.21.0

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

Re: [PATCH v7 0/7] Add virtio-iommu driver

2019-05-28 Thread Jean-Philippe Brucker

On 27/05/2019 16:15, Michael S. Tsirkin wrote:
> On Mon, May 27, 2019 at 11:26:04AM +0200, Joerg Roedel wrote:
>> On Sun, May 12, 2019 at 12:31:59PM -0400, Michael S. Tsirkin wrote:
>>> OK this has been in next for a while.
>>>
>>> Last time IOMMU maintainers objected. Are objections
>>> still in force?
>>>
>>> If not could we get acks please?
>>
>> No objections against the code, I only hesitated because the Spec was
>> not yet official.
>>
>> So for the code:
>>
>>  Acked-by: Joerg Roedel 
> 
> Last spec patch had a bunch of comments not yet addressed.
> But I do not remember whether comments are just about wording
> or about the host/guest interface as well.
> Jean-Philippe could you remind me please?

It's mostly wording, but there is a small change in the config space
layout and two new feature bits. I'll send a new version of the driver
when possible.

Thanks,
Jean
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

Re: [PATCH v7 04/23] iommu: Introduce attach/detach_pasid_table API

2019-05-15 Thread Jean-Philippe Brucker

On 15/05/2019 14:06, Auger Eric wrote:
> Hi Jean-Philippe,
> 
> On 5/15/19 2:09 PM, Jean-Philippe Brucker wrote:
>> On 08/04/2019 13:18, Eric Auger wrote:
>>> diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
>>> index edcc0dda7993..532a64075f23 100644
>>> --- a/include/uapi/linux/iommu.h
>>> +++ b/include/uapi/linux/iommu.h
>>> @@ -112,4 +112,51 @@ struct iommu_fault {
>>> struct iommu_fault_page_request prm;
>>> };
>>>  };
>>> +
>>> +/**
>>> + * SMMUv3 Stream Table Entry stage 1 related information
>>> + * The PASID table is referred to as the context descriptor (CD) table.
>>> + *
>>> + * @s1fmt: STE s1fmt (format of the CD table: single CD, linear table
>>> +   or 2-level table)
>>
>> Running "scripts/kernel-doc -v -none" on this header produces some
>> warnings. Not sure if we want to get rid of all of them, but we should
>> at least fix the coding style for this comment (line must start with
>> " * "). I'm fixing it up on my sva/api branch
> Thanks!
> 
> Let me know if you want me to do the job for additional fixes.

I fixed the others warnings as well, in case we ever want to include
this into the kernel doc

Thanks,
Jean

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

Re: [PATCH v7 04/23] iommu: Introduce attach/detach_pasid_table API

2019-05-15 Thread Jean-Philippe Brucker

On 08/04/2019 13:18, Eric Auger wrote:
> diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
> index edcc0dda7993..532a64075f23 100644
> --- a/include/uapi/linux/iommu.h
> +++ b/include/uapi/linux/iommu.h
> @@ -112,4 +112,51 @@ struct iommu_fault {
>   struct iommu_fault_page_request prm;
>   };
>  };
> +
> +/**
> + * SMMUv3 Stream Table Entry stage 1 related information
> + * The PASID table is referred to as the context descriptor (CD) table.
> + *
> + * @s1fmt: STE s1fmt (format of the CD table: single CD, linear table
> +   or 2-level table)

Running "scripts/kernel-doc -v -none" on this header produces some
warnings. Not sure if we want to get rid of all of them, but we should
at least fix the coding style for this comment (line must start with
" * "). I'm fixing it up on my sva/api branch

Thanks,
Jean

> + * @s1dss: STE s1dss (specifies the behavior when pasid_bits != 0
> +   and no pasid is passed along with the incoming transaction)
> + * Please refer to the smmu 3.x spec (ARM IHI 0070A) for full details
> + */
> +struct iommu_pasid_smmuv3 {
> +#define PASID_TABLE_SMMUV3_CFG_VERSION_1 1
> + __u32   version;
> + __u8 s1fmt;
> + __u8 s1dss;
> + __u8 padding[2];
> +};
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

Re: [PATCH v7 11/23] iommu/arm-smmu-v3: Maintain a SID->device structure

2019-05-08 Thread Jean-Philippe Brucker

On 08/05/2019 15:05, Robin Murphy wrote:
> On 08/04/2019 13:18, Eric Auger wrote:
>> From: Jean-Philippe Brucker 
>>
>> When handling faults from the event or PRI queue, we need to find the
>> struct device associated to a SID. Add a rb_tree to keep track of SIDs.
> 
> Out of curiosity, have you looked at whether an xarray might now be a
> more efficient option for this?

I hadn't looked into it yet, but it's a welcome distraction.

* Searching by SID will be more efficient with xarray (which still is a
radix tree, with a better API). Rather than O(log2(n)) we walk
O(log_c(n)) nodes in the worst case, with c = XA_CHUNK_SIZE = 64. We
don't care about insertion/deletion time.

* Memory consumption is worse than rb-tree, when the SID space is a
little sparse. For PCI devices the three LSBs (function number) might
not be in use, meaning that 88% of the leaf slots would be unused. And
it gets worse if the system has lots of bridges, as each bus number
requires its own xa slot, ie. 98% unused.

  It's not too bad though, and in general I think the distribution of
SIDs would be good enough to justify using xarray. Plugging in more
devices would increase the memory consumption fast, but creating virtual
functions wouldn't. On one machine (TX2, a few discrete PCI cards) I
need 16 xa slots to store 42 device IDs. That's 16 * 576 bytes = 9 kB,
versus 42 * 40 bytes = 1.6 kB for the rb-tree. On another machine (x86,
lots of RC integrated endpoints) I need 18 slots to store 181 device
IDs, 10 kB vs. 7 kB with the rb-tree.

* Using xa would make this code a lot nicer.

Shame that we can't store the device pointer directly in the STE though,
there is already plenty of unused space in there.

Thanks,
Jean
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

Re: [PATCH v7 05/23] iommu: Introduce cache_invalidate API

2019-05-07 Thread Jean-Philippe Brucker

On 02/05/2019 17:46, Jacob Pan wrote:
> On Thu, 2 May 2019 11:53:34 +0100
> Jean-Philippe Brucker  wrote:
> 
>> On 02/05/2019 07:58, Auger Eric wrote:
>>> Hi Jean-Philippe,
>>>
>>> On 5/1/19 12:38 PM, Jean-Philippe Brucker wrote:  
>>>> On 08/04/2019 13:18, Eric Auger wrote:  
>>>>> +int iommu_cache_invalidate(struct iommu_domain *domain, struct
>>>>> device *dev,
>>>>> +struct iommu_cache_invalidate_info
>>>>> *inv_info) +{
>>>>> + int ret = 0;
>>>>> +
>>>>> + if (unlikely(!domain->ops->cache_invalidate))
>>>>> + return -ENODEV;
>>>>> +
>>>>> + ret = domain->ops->cache_invalidate(domain, dev,
>>>>> inv_info); +
>>>>> + return ret;  
>>>>
>>>> Nit: you don't really need ret
>>>>
>>>> The UAPI looks good to me, so
>>>>
>>>> Reviewed-by: Jean-Philippe Brucker
>>>>   
>>> Just to make sure, do you accept changes proposed by Jacob in
>>> https://lkml.org/lkml/2019/4/29/659 ie.
>>> - the addition of NR_IOMMU_INVAL_GRANU in enum
>>> iommu_inv_granularity and
>>> - the addition of NR_IOMMU_CACHE_TYPE  
>>
>> Ah sorry, I forgot about that, I'll review the next version. Yes they
>> can be useful (maybe call them IOMMU_INV_GRANU_NR and
>> IOMMU_CACHE_INV_TYPE_NR?). I guess it's legal to export in UAPI values
>> that will change over time, as VFIO also does it in its enums.
>>
> I am fine with the names. Maybe you can put this patch in your sva/api
> branch once you reviewed it? Having a common branch for common code
> makes life so much easier.

Done, with minor whitespace and name fixes

Thanks,
Jean
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

Re: [PATCH v7 05/23] iommu: Introduce cache_invalidate API

2019-05-02 Thread Jean-Philippe Brucker

On 02/05/2019 07:58, Auger Eric wrote:
> Hi Jean-Philippe,
> 
> On 5/1/19 12:38 PM, Jean-Philippe Brucker wrote:
>> On 08/04/2019 13:18, Eric Auger wrote:
>>> +int iommu_cache_invalidate(struct iommu_domain *domain, struct device *dev,
>>> +  struct iommu_cache_invalidate_info *inv_info)
>>> +{
>>> +   int ret = 0;
>>> +
>>> +   if (unlikely(!domain->ops->cache_invalidate))
>>> +   return -ENODEV;
>>> +
>>> +   ret = domain->ops->cache_invalidate(domain, dev, inv_info);
>>> +
>>> +   return ret;
>>
>> Nit: you don't really need ret
>>
>> The UAPI looks good to me, so
>>
>> Reviewed-by: Jean-Philippe Brucker 
> Just to make sure, do you accept changes proposed by Jacob in
> https://lkml.org/lkml/2019/4/29/659 ie.
> - the addition of NR_IOMMU_INVAL_GRANU in enum iommu_inv_granularity and
> - the addition of NR_IOMMU_CACHE_TYPE

Ah sorry, I forgot about that, I'll review the next version. Yes they
can be useful (maybe call them IOMMU_INV_GRANU_NR and
IOMMU_CACHE_INV_TYPE_NR?). I guess it's legal to export in UAPI values
that will change over time, as VFIO also does it in its enums.

Thanks,
Jean
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

Re: [PATCH v7 05/23] iommu: Introduce cache_invalidate API

2019-05-01 Thread Jean-Philippe Brucker

On 08/04/2019 13:18, Eric Auger wrote:
> +int iommu_cache_invalidate(struct iommu_domain *domain, struct device *dev,
> +struct iommu_cache_invalidate_info *inv_info)
> +{
> + int ret = 0;
> +
> + if (unlikely(!domain->ops->cache_invalidate))
> + return -ENODEV;
> +
> + ret = domain->ops->cache_invalidate(domain, dev, inv_info);
> +
> + return ret;

Nit: you don't really need ret

The UAPI looks good to me, so

Reviewed-by: Jean-Philippe Brucker 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

Re: [PATCH] vfio-pci: Fix MSI IRQ forwarding for without per-vector masking

2019-04-10 Thread Jean-Philippe Brucker

Hi Leo,

On 22/03/2019 05:23, Leo Yan wrote:
> If MSI doesn't support per-vector masking capability and
> PCI_MSI_FLAGS_MASKBIT isn't set in message control field, the function
> vfio_pci_msi_vector_write() will directly bail out for this case and
> every vector's 'virt_state' keeps setting bit VFIO_PCI_MSI_STATE_MASKED.
> 
> This results in the state maintained in 'virt_state' cannot really
> reflect the MSI hardware state; finally it will mislead the function
> vfio_pci_update_msi_entry() to skip IRQ forwarding with below flow:
> 
> vfio_pci_update_msi_entry() {
> 
>   [...]
> 
>   if (msi_is_masked(entry->virt_state) == msi_is_masked(entry->phys_state))
>   return 0;  ==> skip IRQ forwarding
> 
>   [...]
> }
> 
> To fix this issue, when detect PCI_MSI_FLAGS_MASKBIT is not set in the
> message control field, this patch simply clears bit
> VFIO_PCI_MSI_STATE_MASKED for all vectors 'virt_state'; at the end
> vfio_pci_update_msi_entry() can forward MSI IRQ successfully.
> 
> Signed-off-by: Leo Yan 
> ---
>  vfio/pci.c | 12 +++-
>  1 file changed, 11 insertions(+), 1 deletion(-)
> 
> diff --git a/vfio/pci.c b/vfio/pci.c
> index ba971eb..4fd24ac 100644
> --- a/vfio/pci.c
> +++ b/vfio/pci.c
> @@ -363,8 +363,18 @@ static int vfio_pci_msi_vector_write(struct kvm *kvm, 
> struct vfio_device *vdev,
>   struct vfio_pci_device *pdev = >pci;
>   struct msi_cap_64 *msi_cap_64 = PCI_CAP(>hdr, pdev->msi.pos);
>  
> - if (!(msi_cap_64->ctrl & PCI_MSI_FLAGS_MASKBIT))
> + if (!(msi_cap_64->ctrl & PCI_MSI_FLAGS_MASKBIT)) {
> + /*
> +  * If MSI doesn't support per-vector masking capability,
> +  * simply unmask for all vectors.
> +  */
> + for (i = 0; i < pdev->msi.nr_entries; i++) {
> + entry = >msi.entries[i];
> + msi_set_masked(entry->virt_state, false);
> + }
> +

This seems like the wrong place for this fix.
vfio_pci_msi_vector_write() is called every time the guest pokes the MSI
capability, and checks whether the access was on the Mask Bits Register.
If the function doesn't support per-vector masking, then the register
isn't implemented and we shouldn't do anything here.

To fix the problem I think we need to set masked(virt_state) properly at
init time, instead of blindly setting it to true. In fact from the
guest's point of view, MSIs and MSI-X are unmasked (and disabled) at
boot, so you could always set masked(virt_state) to false, but it may be
safer to copy the actual state of the MSI's Mask Bits into virt_state,
since for MSI, it could be non zero. If per-vector masking isn't
supported, then the virt state should be false.

Thanks,
Jean
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

Re: [PATCH kvmtool v3 2/3] vfio-pci: Add new function for INTx one-time initialisation

2019-04-04 Thread Jean-Philippe Brucker

On 26/03/2019 07:41, Leo Yan wrote:
> To support INTx enabling for multiple times, we need firstly to extract
> one-time initialisation and move the related code into a new function
> vfio_pci_init_intx(); if later disable and re-enable the INTx, we can
> skip these one-time operations.
> 
> This patch move below three main operations for INTx one-time
> initialisation from function vfio_pci_enable_intx() into function
> vfio_pci_init_intx():
> 
> - Reserve 2 FDs for INTx;
> - Sanity check with ioctl VFIO_DEVICE_GET_IRQ_INFO;
> - Setup pdev->intx_gsi.
> 
> Suggested-by: Jean-Philippe Brucker 
> Signed-off-by: Leo Yan 

Thanks for the patches

Reviewed-by: Jean-Philippe Brucker 

> ---
>  vfio/pci.c | 67 --
>  1 file changed, 40 insertions(+), 27 deletions(-)
> 
> diff --git a/vfio/pci.c b/vfio/pci.c
> index 5224fee..3c39844 100644
> --- a/vfio/pci.c
> +++ b/vfio/pci.c
> @@ -1018,30 +1018,7 @@ static int vfio_pci_enable_intx(struct kvm *kvm, 
> struct vfio_device *vdev)
>   struct vfio_irq_eventfd trigger;
>   struct vfio_irq_eventfd unmask;
>   struct vfio_pci_device *pdev = >pci;
> - int gsi = pdev->hdr.irq_line - KVM_IRQ_OFFSET;
> -
> - struct vfio_irq_info irq_info = {
> - .argsz = sizeof(irq_info),
> - .index = VFIO_PCI_INTX_IRQ_INDEX,
> - };
> -
> - vfio_pci_reserve_irq_fds(2);
> -
> - ret = ioctl(vdev->fd, VFIO_DEVICE_GET_IRQ_INFO, _info);
> - if (ret || irq_info.count == 0) {
> - vfio_dev_err(vdev, "no INTx reported by VFIO");
> - return -ENODEV;
> - }
> -
> - if (!(irq_info.flags & VFIO_IRQ_INFO_EVENTFD)) {
> - vfio_dev_err(vdev, "interrupt not eventfd capable");
> - return -EINVAL;
> - }
> -
> - if (!(irq_info.flags & VFIO_IRQ_INFO_AUTOMASKED)) {
> - vfio_dev_err(vdev, "INTx interrupt not AUTOMASKED");
> - return -EINVAL;
> - }
> + int gsi = pdev->intx_gsi;
>  
>   /*
>* PCI IRQ is level-triggered, so we use two eventfds. trigger_fd
> @@ -1097,8 +1074,6 @@ static int vfio_pci_enable_intx(struct kvm *kvm, struct 
> vfio_device *vdev)
>  
>   pdev->intx_fd = trigger_fd;
>   pdev->unmask_fd = unmask_fd;
> - /* Guest is going to ovewrite our irq_line... */
> - pdev->intx_gsi = gsi;
>  
>   return 0;
>  
> @@ -1117,6 +1092,39 @@ err_close:
>   return ret;
>  }
>  
> +static int vfio_pci_init_intx(struct kvm *kvm, struct vfio_device *vdev)
> +{
> + int ret;
> + struct vfio_pci_device *pdev = >pci;
> + struct vfio_irq_info irq_info = {
> + .argsz = sizeof(irq_info),
> + .index = VFIO_PCI_INTX_IRQ_INDEX,
> + };
> +
> + vfio_pci_reserve_irq_fds(2);
> +
> + ret = ioctl(vdev->fd, VFIO_DEVICE_GET_IRQ_INFO, _info);
> + if (ret || irq_info.count == 0) {
> + vfio_dev_err(vdev, "no INTx reported by VFIO");
> + return -ENODEV;
> + }
> +
> + if (!(irq_info.flags & VFIO_IRQ_INFO_EVENTFD)) {
> + vfio_dev_err(vdev, "interrupt not eventfd capable");
> + return -EINVAL;
> + }
> +
> + if (!(irq_info.flags & VFIO_IRQ_INFO_AUTOMASKED)) {
> + vfio_dev_err(vdev, "INTx interrupt not AUTOMASKED");
> + return -EINVAL;
> + }
> +
> + /* Guest is going to ovewrite our irq_line... */
> + pdev->intx_gsi = pdev->hdr.irq_line - KVM_IRQ_OFFSET;
> +
> + return 0;
> +}
> +
>  static int vfio_pci_configure_dev_irqs(struct kvm *kvm, struct vfio_device 
> *vdev)
>  {
>   int ret = 0;
> @@ -1142,8 +1150,13 @@ static int vfio_pci_configure_dev_irqs(struct kvm 
> *kvm, struct vfio_device *vdev
>   return ret;
>   }
>  
> - if (pdev->irq_modes & VFIO_PCI_IRQ_MODE_INTX)
> + if (pdev->irq_modes & VFIO_PCI_IRQ_MODE_INTX) {
> + ret = vfio_pci_init_intx(kvm, vdev);
> + if (ret)
> + return ret;
> +
>   ret = vfio_pci_enable_intx(kvm, vdev);
> + }
>  
>   return ret;
>  }
> 

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

Re: [PATCH kvmtool v3 3/3] vfio-pci: Re-enable INTx mode when disable MSI/MSIX

2019-04-04 Thread Jean-Philippe Brucker

On 26/03/2019 07:41, Leo Yan wrote:
> Since PCI forbids enabling INTx, MSI or MSIX at the same time, it's by
> default to disable INTx mode when enable MSI/MSIX mode; but this logic is
> easily broken if the guest PCI driver detects the MSI/MSIX cannot work as
> expected and tries to rollback to use INTx mode.  In this case, the INTx
> mode has been disabled and has no chance to re-enable it, thus both INTx
> mode and MSI/MSIX mode cannot work in vfio.
> 
> Below shows the detailed flow for introducing this issue:
> 
>   vfio_pci_configure_dev_irqs()
> `-> vfio_pci_enable_intx()
> 
>   vfio_pci_enable_msis()
> `-> vfio_pci_disable_intx()
> 
>   vfio_pci_disable_msis()   => Guest PCI driver disables MSI
> 
> To fix this issue, when disable MSI/MSIX we need to check if INTx mode
> is available for this device or not; if the device can support INTx then
> re-enable it so that the device can fallback to use it.
> 
> Since vfio_pci_disable_intx() / vfio_pci_enable_intx() pair functions
> may be called for multiple times, this patch uses 'intx_fd == -1' to
> denote the INTx is disabled, the pair functions can directly bail out
> when detect INTx has been disabled and enabled respectively.
> 
> Suggested-by: Jean-Philippe Brucker 
> Signed-off-by: Leo Yan 
> ---
>  vfio/pci.c | 41 ++---
>  1 file changed, 30 insertions(+), 11 deletions(-)
> 
> diff --git a/vfio/pci.c b/vfio/pci.c
> index 3c39844..3b2b1e7 100644
> --- a/vfio/pci.c
> +++ b/vfio/pci.c
> @@ -28,6 +28,7 @@ struct vfio_irq_eventfd {
>   msi_update_state(state, val, VFIO_PCI_MSI_STATE_EMPTY)
>  
>  static void vfio_pci_disable_intx(struct kvm *kvm, struct vfio_device *vdev);
> +static int vfio_pci_enable_intx(struct kvm *kvm, struct vfio_device *vdev);
>  
>  static int vfio_pci_enable_msis(struct kvm *kvm, struct vfio_device *vdev,
>   bool msix)
> @@ -50,17 +51,14 @@ static int vfio_pci_enable_msis(struct kvm *kvm, struct 
> vfio_device *vdev,
>   if (!msi_is_enabled(msis->virt_state))
>   return 0;
>  
> - if (pdev->irq_modes & VFIO_PCI_IRQ_MODE_INTX) {
> - /*
> -  * PCI (and VFIO) forbids enabling INTx, MSI or MSIX at the same
> -  * time. Since INTx has to be enabled from the start (we don't
> -  * have a reliable way to know when the user starts using it),
> -  * disable it now.
> -  */
> + /*
> +  * PCI (and VFIO) forbids enabling INTx, MSI or MSIX at the same
> +  * time. Since INTx has to be enabled from the start (after enabling
> +  * 'pdev->intx_fd' will be assigned to an eventfd and doesn't equal
> +  * to the init value -1), disable it now.
> +  */

I don't think the comment change is useful, we don't need that much
detail. The text that you replaced was trying to explain why we enable
INTx from the start, and would still apply (although it should have been
s/user/guest/)

> + if (pdev->irq_modes & VFIO_PCI_IRQ_MODE_INTX)
>   vfio_pci_disable_intx(kvm, vdev);
> - /* Permanently disable INTx */
> - pdev->irq_modes &= ~VFIO_PCI_IRQ_MODE_INTX;
> - }
>  
>   eventfds = (void *)msis->irq_set + sizeof(struct vfio_irq_set);
>  
> @@ -162,7 +160,16 @@ static int vfio_pci_disable_msis(struct kvm *kvm, struct 
> vfio_device *vdev,
>   msi_set_enabled(msis->phys_state, false);
>   msi_set_empty(msis->phys_state, true);
>  
> - return 0;
> + /*
> +  * When MSI or MSIX is disabled, this might be called when
> +  * PCI driver detects the MSI interrupt failure and wants to
> +  * rollback to INTx mode.  Thus enable INTx if the device
> +  * supports INTx mode in this case.
> +  */
> + if (pdev->irq_modes & VFIO_PCI_IRQ_MODE_INTX)
> + ret = vfio_pci_enable_intx(kvm, vdev);
> +
> + return ret >= 0 ? 0 : ret;
>  }
>  
>  static int vfio_pci_update_msi_entry(struct kvm *kvm, struct vfio_device 
> *vdev,
> @@ -1002,6 +1009,10 @@ static void vfio_pci_disable_intx(struct kvm *kvm, 
> struct vfio_device *vdev)
>   .index  = VFIO_PCI_INTX_IRQ_INDEX,
>   };
>  
> + /* INTx mode has been disabled */

Here as well, the comments on intx_fd seem unnecessary. But these are
only nits, the code is fine and I tested for regressions on my hardware, so:

Reviewed-by: Jean-Philippe Brucker 

> + if (pdev->intx_fd == -1)
> + return;
> +
>   pr_debug("user requested MSI, disabling INTx %d", gsi);
>  
>   ioctl(vdev->fd, VFIO_DEVICE_SET_IRQS, _set);
> @@ -

Re: [PATCH kvmtool v2 3/3] vfio-pci: Re-enable INTx mode when disable MSI/MSIX

2019-03-25 Thread Jean-Philippe Brucker

On 20/03/2019 06:20, Leo Yan wrote:
> Since PCI forbids enabling INTx, MSI or MSIX at the same time, it's by
> default to disable INTx mode when enable MSI/MSIX mode; but this logic is
> easily broken if the guest PCI driver detects the MSI/MSIX cannot work as
> expected and tries to rollback to use INTx mode.  The INTx mode has been
> disabled and it has no chance to be enabled again, thus both INTx mode
> and MSI/MSIX mode will not be enabled in vfio for this case.
> 
> Below shows the detailed flow for introducing this issue:
> 
>   vfio_pci_configure_dev_irqs()
> `-> vfio_pci_enable_intx()
> 
>   vfio_pci_enable_msis()
> `-> vfio_pci_disable_intx()
> 
>   vfio_pci_disable_msis()   => Guest PCI driver disables MSI
> 
> To fix this issue, when disable MSI/MSIX we need to check if INTx mode
> is available for this device or not; if the device can support INTx then
> we need to re-enable it so the device can fallback to use it.
> 
> In this patch, should note two minor changes:
> 
> - vfio_pci_disable_intx() may be called multiple times (each time the
>   guest enables one MSI vector).  This patch changes to use
>   'intx_fd == -1' to denote the INTx disabled, vfio_pci_disable_intx()
>   and vfio_pci_enable_intx will directly bail out when detect INTx has
>   been disabled and enabled respectively.
> 
> - Since pci_device_header will be corrupted after PCI configuration
>   and all irq related info will be lost.  Before re-enabling INTx
>   mode, this patch restores 'irq_pin' and 'irq_line' fields in struct
>   pci_device_header.
> 
> Signed-off-by: Leo Yan 
> ---
>  vfio/pci.c | 59 --
>  1 file changed, 48 insertions(+), 11 deletions(-)
> 
> diff --git a/vfio/pci.c b/vfio/pci.c
> index d025581..ba971eb 100644
> --- a/vfio/pci.c
> +++ b/vfio/pci.c
> @@ -28,6 +28,7 @@ struct vfio_irq_eventfd {
>   msi_update_state(state, val, VFIO_PCI_MSI_STATE_EMPTY)
>  
>  static void vfio_pci_disable_intx(struct kvm *kvm, struct vfio_device *vdev);
> +static int vfio_pci_enable_intx(struct kvm *kvm, struct vfio_device *vdev);
>  
>  static int vfio_pci_enable_msis(struct kvm *kvm, struct vfio_device *vdev,
>   bool msix)
> @@ -50,17 +51,14 @@ static int vfio_pci_enable_msis(struct kvm *kvm, struct 
> vfio_device *vdev,
>   if (!msi_is_enabled(msis->virt_state))
>   return 0;
>  
> - if (pdev->irq_modes & VFIO_PCI_IRQ_MODE_INTX) {
> - /*
> -  * PCI (and VFIO) forbids enabling INTx, MSI or MSIX at the same
> -  * time. Since INTx has to be enabled from the start (we don't
> -  * have a reliable way to know when the user starts using it),
> -  * disable it now.
> -  */
> + /*
> +  * PCI (and VFIO) forbids enabling INTx, MSI or MSIX at the same
> +  * time. Since INTx has to be enabled from the start (after enabling
> +  * 'pdev->intx_fd' will be assigned to an eventfd and doesn't equal
> +  * to the init value -1), disable it now.
> +  */
> + if (pdev->irq_modes & VFIO_PCI_IRQ_MODE_INTX)
>   vfio_pci_disable_intx(kvm, vdev);
> - /* Permanently disable INTx */
> - pdev->irq_modes &= ~VFIO_PCI_IRQ_MODE_INTX;
> - }
>  
>   eventfds = (void *)msis->irq_set + sizeof(struct vfio_irq_set);
>  
> @@ -162,7 +160,34 @@ static int vfio_pci_disable_msis(struct kvm *kvm, struct 
> vfio_device *vdev,
>   msi_set_enabled(msis->phys_state, false);
>   msi_set_empty(msis->phys_state, true);
>  
> - return 0;
> + /*
> +  * When MSI or MSIX is disabled, this might be called when
> +  * PCI driver detects the MSI interrupt failure and wants to
> +  * rollback to INTx mode.  Thus enable INTx if the device
> +  * supports INTx mode in this case.
> +  */
> + if (pdev->irq_modes & VFIO_PCI_IRQ_MODE_INTX) {
> + /*
> +  * Struct pci_device_header is not only used for header,
> +  * it also is used for PCI configuration; and in the function
> +  * vfio_pci_cfg_write() it firstly writes configuration space
> +  * and then read back the configuration space data into the
> +  * header structure; thus 'irq_pin' and 'irq_line' in the
> +  * header will be overwritten.
> +  *
> +  * If want to enable INTx mode properly, firstly needs to
> +  * restore 'irq_pin' and 'irq_line' values; we can simply set 1
> +  * to 'irq_pin', and 'pdev->intx_gsi' keeps gsi value when
> +  * enable INTx mode previously so we can simply use it to
> +  * recover irq line number by adding offset KVM_IRQ_OFFSET.
> +  */
> + pdev->hdr.irq_pin = 1;
> + pdev->hdr.irq_line = pdev->intx_gsi + KVM_IRQ_OFFSET;

That doesn't look right. We shouldn't change irq_line at runtime, it's
reserved to the guest (and

Re: [PATCH kvmtool v2 2/3] vfio-pci: Remove useless FDs reservation in vfio_pci_enable_intx()

2019-03-25 Thread Jean-Philippe Brucker

On 20/03/2019 06:20, Leo Yan wrote:
> Since INTx only uses 2 FDs, it's not particularly useful to reserve FDs
> in function vfio_pci_enable_intx(); so this patch is to remove FDs
> reservation in this function.
> 
> Signed-off-by: Leo Yan 

The main reason for this is that we want to call enable_intx() multiple
times and reserve_irq_fds() increments a static variable. That function
makes an approximation of the number of fds used by kvmtool and given
that we count a margin of 100 fds, the 2 INTx fds are already included
in that approximation (assuming we're not assigning hundreds of devices
to the guest).

But given that patch 3 highlights the need for a one-time init function,
maybe we can move the reserve_irq_fds() call there as well?

Thanks,
Jean

> ---
>  vfio/pci.c | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/vfio/pci.c b/vfio/pci.c
> index 5224fee..d025581 100644
> --- a/vfio/pci.c
> +++ b/vfio/pci.c
> @@ -1025,8 +1025,6 @@ static int vfio_pci_enable_intx(struct kvm *kvm, struct 
> vfio_device *vdev)
>   .index = VFIO_PCI_INTX_IRQ_INDEX,
>   };
>  
> - vfio_pci_reserve_irq_fds(2);
> -
>   ret = ioctl(vdev->fd, VFIO_DEVICE_GET_IRQ_INFO, _info);
>   if (ret || irq_info.count == 0) {
>   vfio_dev_err(vdev, "no INTx reported by VFIO");
> 

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

Re: [PATCH kvmtool v2 1/3] vfio-pci: Release INTx's unmask eventfd properly

2019-03-25 Thread Jean-Philippe Brucker

Hi Leo,

Thanks for the patches

On 20/03/2019 06:20, Leo Yan wrote:
> The PCI device INTx uses event fd 'unmask_fd' to signal the deassertion
> of the line from guest to host; but this eventfd isn't released properly
> when disable INTx.
> 
> This patch firstly adds field 'unmask_fd' in struct vfio_pci_device for
> storing unmask eventfd and close it when disable INTx.
> 
> Signed-off-by: Leo Yan 

Reviewed-by: Jean-Philippe Brucker 

> ---
>  include/kvm/vfio.h | 1 +
>  vfio/pci.c | 2 ++
>  2 files changed, 3 insertions(+)
> 
> diff --git a/include/kvm/vfio.h b/include/kvm/vfio.h
> index 60e6c54..28223cf 100644
> --- a/include/kvm/vfio.h
> +++ b/include/kvm/vfio.h
> @@ -74,6 +74,7 @@ struct vfio_pci_device {
>  
>   unsigned long   irq_modes;
>   int intx_fd;
> + int unmask_fd;
>   unsigned intintx_gsi;
>   struct vfio_pci_msi_common  msi;
>   struct vfio_pci_msi_common  msix;
> diff --git a/vfio/pci.c b/vfio/pci.c
> index 03de3c1..5224fee 100644
> --- a/vfio/pci.c
> +++ b/vfio/pci.c
> @@ -1008,6 +1008,7 @@ static void vfio_pci_disable_intx(struct kvm *kvm, 
> struct vfio_device *vdev)
>   irq__del_irqfd(kvm, gsi, pdev->intx_fd);
>  
>   close(pdev->intx_fd);
> + close(pdev->unmask_fd);
>  }
>  
>  static int vfio_pci_enable_intx(struct kvm *kvm, struct vfio_device *vdev)
> @@ -1095,6 +1096,7 @@ static int vfio_pci_enable_intx(struct kvm *kvm, struct 
> vfio_device *vdev)
>   }
>  
>   pdev->intx_fd = trigger_fd;
> + pdev->unmask_fd = unmask_fd;
>   /* Guest is going to ovewrite our irq_line... */
>   pdev->intx_gsi = gsi;
>  
> 

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

Re: [PATCH v6 05/22] iommu: Introduce cache_invalidate API

2019-03-21 Thread Jean-Philippe Brucker

On 21/03/2019 13:54, Auger Eric wrote:
> Hi Jacob, Jean-Philippe,
> 
> On 3/20/19 5:50 PM, Jean-Philippe Brucker wrote:
>> On 20/03/2019 16:37, Jacob Pan wrote:
>> [...]
>>>> +struct iommu_inv_addr_info {
>>>> +#define IOMMU_INV_ADDR_FLAGS_PASID(1 << 0)
>>>> +#define IOMMU_INV_ADDR_FLAGS_ARCHID   (1 << 1)
>>>> +#define IOMMU_INV_ADDR_FLAGS_LEAF (1 << 2)
>>>> +  __u32   flags;
>>>> +  __u32   archid;
>>>> +  __u64   pasid;
>>>> +  __u64   addr;
>>>> +  __u64   granule_size;
>>>> +  __u64   nb_granules;
>>>> +};
>>>> +
>>>> +/**
>>>> + * First level/stage invalidation information
>>>> + * @cache: bitfield that allows to select which caches to invalidate
>>>> + * @granularity: defines the lowest granularity used for the
>>>> invalidation:
>>>> + * domain > pasid > addr
>>>> + *
>>>> + * Not all the combinations of cache/granularity make sense:
>>>> + *
>>>> + * type |   DEV_IOTLB   | IOTLB |  PASID|
>>>> + * granularity|   |   |
>>>> cache  |
>>>> + * -+---+---+---+
>>>> + * DOMAIN |   N/A |   Y   |
>>>> Y  |
>>>> + * PASID  |   Y   |   Y   |
>>>> Y  |
>>>> + * ADDR   |   Y   |   Y   |
>>>> N/A|
>>>> + */
>>>> +struct iommu_cache_invalidate_info {
>>>> +#define IOMMU_CACHE_INVALIDATE_INFO_VERSION_1 1
>>>> +  __u32   version;
>>>> +/* IOMMU paging structure cache */
>>>> +#define IOMMU_CACHE_INV_TYPE_IOTLB(1 << 0) /* IOMMU IOTLB */
>>>> +#define IOMMU_CACHE_INV_TYPE_DEV_IOTLB(1 << 1) /* Device
>>>> IOTLB */ +#define IOMMU_CACHE_INV_TYPE_PASID   (1 << 2) /* PASID
>>>> cache */
>>> Just a clarification, this used to be an enum. You do intend to issue a
>>> single invalidation request on multiple cache types? Perhaps for
>>> virtio-IOMMU? I only see a single cache type in your patch #14. For VT-d
>>> we plan to issue one cache type at a time for now. So this format works
>>> for us.
>>
>> Yes for virtio-iommu I'd like as little overhead as possible, which
>> means a single invalidation message to hit both IOTLB and ATC at once,
>> and the ability to specify multiple pages with @nb_granules.
> The original request/explanation from Jean-Philippe can be found here:
> https://lkml.org/lkml/2019/1/28/1497
> 
>>
>>> However, if multiple cache types are issued in a single invalidation.
>>> They must share a single granularity, not all combinations are valid.
>>> e.g. dev IOTLB does not support domain granularity. Just a reminder,
>>> not an issue. Driver could filter out invalid combinations.
> Sure I will add a comment about this restriction.
>>
>> Agreed. Even the core could filter out invalid combinations based on the
>> table above: IOTLB and domain granularity are N/A.
> I don't get this sentence. What about vtd IOTLB domain-selective
> invalidation:

My mistake: I meant dev-IOTLB and domain granularity are N/A

Thanks,
Jean

> "
> • IOTLB entries caching mappings associated with the specified domain-id
> are invalidated.
> • Paging-structure-cache entries caching mappings associated with the
> specified domain-id are invalidated.
> "
> 
> Thanks
> 
> Eric
> 
>>
>> Thanks,
>> Jean
>>
>>>
>>>> +  __u8cache;
>>>> +  __u8granularity;
>>>> +  __u8padding[2];
>>>> +  union {
>>>> +  __u64   pasid;
>>>> +  struct iommu_inv_addr_info addr_info;
>>>> +  };
>>>> +};
>>>> +
>>>> +
>>>>  #endif /* _UAPI_IOMMU_H */
>>>
>>> [Jacob Pan]
>>> ___
>>> iommu mailing list
>>> io...@lists.linux-foundation.org
>>> https://lists.linuxfoundation.org/mailman/listinfo/iommu
>>>
>>
> ___
> iommu mailing list
> io...@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/iommu
> 

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

Re: [PATCH v6 05/22] iommu: Introduce cache_invalidate API

2019-03-20 Thread Jean-Philippe Brucker

On 20/03/2019 16:37, Jacob Pan wrote:
[...]
>> +struct iommu_inv_addr_info {
>> +#define IOMMU_INV_ADDR_FLAGS_PASID  (1 << 0)
>> +#define IOMMU_INV_ADDR_FLAGS_ARCHID (1 << 1)
>> +#define IOMMU_INV_ADDR_FLAGS_LEAF   (1 << 2)
>> +__u32   flags;
>> +__u32   archid;
>> +__u64   pasid;
>> +__u64   addr;
>> +__u64   granule_size;
>> +__u64   nb_granules;
>> +};
>> +
>> +/**
>> + * First level/stage invalidation information
>> + * @cache: bitfield that allows to select which caches to invalidate
>> + * @granularity: defines the lowest granularity used for the
>> invalidation:
>> + * domain > pasid > addr
>> + *
>> + * Not all the combinations of cache/granularity make sense:
>> + *
>> + * type |   DEV_IOTLB   | IOTLB |  PASID|
>> + * granularity  |   |   |
>> cache|
>> + * -+---+---+---+
>> + * DOMAIN   |   N/A |   Y   |
>> Y|
>> + * PASID|   Y   |   Y   |
>> Y|
>> + * ADDR |   Y   |   Y   |
>> N/A  |
>> + */
>> +struct iommu_cache_invalidate_info {
>> +#define IOMMU_CACHE_INVALIDATE_INFO_VERSION_1 1
>> +__u32   version;
>> +/* IOMMU paging structure cache */
>> +#define IOMMU_CACHE_INV_TYPE_IOTLB  (1 << 0) /* IOMMU IOTLB */
>> +#define IOMMU_CACHE_INV_TYPE_DEV_IOTLB  (1 << 1) /* Device
>> IOTLB */ +#define IOMMU_CACHE_INV_TYPE_PASID (1 << 2) /* PASID
>> cache */
> Just a clarification, this used to be an enum. You do intend to issue a
> single invalidation request on multiple cache types? Perhaps for
> virtio-IOMMU? I only see a single cache type in your patch #14. For VT-d
> we plan to issue one cache type at a time for now. So this format works
> for us.

Yes for virtio-iommu I'd like as little overhead as possible, which
means a single invalidation message to hit both IOTLB and ATC at once,
and the ability to specify multiple pages with @nb_granules.

> However, if multiple cache types are issued in a single invalidation.
> They must share a single granularity, not all combinations are valid.
> e.g. dev IOTLB does not support domain granularity. Just a reminder,
> not an issue. Driver could filter out invalid combinations.

Agreed. Even the core could filter out invalid combinations based on the
table above: IOTLB and domain granularity are N/A.

Thanks,
Jean

> 
>> +__u8cache;
>> +__u8granularity;
>> +__u8padding[2];
>> +union {
>> +__u64   pasid;
>> +struct iommu_inv_addr_info addr_info;
>> +};
>> +};
>> +
>> +
>>  #endif /* _UAPI_IOMMU_H */
> 
> [Jacob Pan]
> ___
> iommu mailing list
> io...@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/iommu
> 

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

Re: [PATCH v5 05/22] iommu: Introduce cache_invalidate API

2019-03-18 Thread Jean-Philippe Brucker

On 17/03/2019 16:43, Auger Eric wrote:
>>> diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
>>> index 532a64075f23..e4c6a447e85a 100644
>>> --- a/include/uapi/linux/iommu.h
>>> +++ b/include/uapi/linux/iommu.h
>>> @@ -159,4 +159,75 @@ struct iommu_pasid_table_config {
>>> };
>>>  };
>>>  
>>> +/* defines the granularity of the invalidation */
>>> +enum iommu_inv_granularity {
>>> +   IOMMU_INV_GRANU_DOMAIN, /* domain-selective
>>> invalidation */
>>> +   IOMMU_INV_GRANU_PASID,  /* pasid-selective
>>> invalidation */
>>> +   IOMMU_INV_GRANU_ADDR,   /* page-selective invalidation
>>> */ +};
>>> +
>>> +/**
>>> + * Address Selective Invalidation Structure
>>> + *
>>> + * @flags indicates the granularity of the address-selective
>>> invalidation
>>> + * - if PASID bit is set, @pasid field is populated and the
>>> invalidation
>>> + *   relates to cache entries tagged with this PASID and matching the
>>> + *   address range.
>>> + * - if ARCHID bit is set, @archid is populated and the invalidation
>>> relates
>>> + *   to cache entries tagged with this architecture specific id and
>>> matching
>>> + *   the address range.
>>> + * - Both PASID and ARCHID can be set as they may tag different
>>> caches.
>>> + * - if neither PASID or ARCHID is set, global addr invalidation
>>> applies
>>> + * - LEAF flag indicates whether only the leaf PTE caching needs to
>>> be
>>> + *   invalidated and other paging structure caches can be preserved.
>>> + * @pasid: process address space id
>>> + * @archid: architecture-specific id
>>> + * @addr: first stage/level input address
>>> + * @granule_size: page/block size of the mapping in bytes
>>> + * @nb_granules: number of contiguous granules to be invalidated
>>> + */
>>> +struct iommu_inv_addr_info {
>>> +#define IOMMU_INV_ADDR_FLAGS_PASID (1 << 0)
>>> +#define IOMMU_INV_ADDR_FLAGS_ARCHID(1 << 1)
>>> +#define IOMMU_INV_ADDR_FLAGS_LEAF  (1 << 2)
>>> +   __u32   flags;
>>> +   __u32   archid;
>>> +   __u64   pasid;
>>> +   __u64   addr;
>>> +   __u64   granule_size;
>>> +   __u64   nb_granules;
>>> +};
>>> +
>>> +/**
>>> + * First level/stage invalidation information
>>> + * @cache: bitfield that allows to select which caches to invalidate
>>> + * @granularity: defines the lowest granularity used for the
>>> invalidation:
>>> + * domain > pasid > addr
>>> + *
>>> + * Not all the combinations of cache/granularity make sense:
>>> + *
>>> + * type |   DEV_IOTLB   | IOTLB |  PASID|
>>> + * granularity |   |   |
>>> cache   |
>>> + * -+---+---+---+
>>> + * DOMAIN  |   N/A |   Y   |
>>> Y   |
>>> + * PASID   |   Y   |   Y   |
>>> Y   |
>>> + * ADDR|   Y   |   Y   |
>>> N/A |
>>> + */
>>> +struct iommu_cache_invalidate_info {
>>> +#define IOMMU_CACHE_INVALIDATE_INFO_VERSION_1 1
>>> +   __u32   version;
>>> +/* IOMMU paging structure cache */
>>> +#define IOMMU_CACHE_INV_TYPE_IOTLB (1 << 0) /* IOMMU IOTLB */
>>> +#define IOMMU_CACHE_INV_TYPE_DEV_IOTLB (1 << 1) /* Device
>>> IOTLB */ +#define IOMMU_CACHE_INV_TYPE_PASID(1 << 2) /* PASID
>>> cache */
>>> +   __u8cache;
>>> +   __u8granularity;
>>> +   __u8padding[2];
>>> +   union {
>>> +   __u64   pasid;
>> just realized there is already a pasid field in the addr_info, do we
>> still need this?
> I think so. Either you do a PASID based invalidation and you directly
> use the pasid field or you do an address based invalidation and you use
> the addr_info where the pasid may or not be passed.

I guess a comment would be useful?

- Invalidations by %IOMMU_INV_GRANU_ADDR use field @addr_info.
- Invalidations by %IOMMU_INV_GRANU_PASID use field @pasid.
- Invalidations by %IOMMU_INV_GRANU_DOMAIN don't take an argument.

Thanks,
Jean

> 
> Thanks
> 
> Eric
>>> +   struct iommu_inv_addr_info addr_info;
>>> +   };
>>> +};
>>> +
>>> +
>>>  #endif /* _UAPI_IOMMU_H */
>>
>> [Jacob Pan]
>>

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

Re: [PATCH kvmtool v1 2/2] vfio-pci: Fallback to INTx mode when disable MSI/MSIX

2019-03-15 Thread Jean-Philippe Brucker

On 15/03/2019 08:33, Leo Yan wrote:
> Since PCI forbids enabling INTx, MSI or MSIX at the same time, it's by
> default to disable INTx mode when enable MSI/MSIX mode; but this logic is
> easily broken if the guest PCI driver detects the MSI/MSIX cannot work as
> expected and tries to rollback to use INTx mode.  The INTx mode has been
> disabled and it has no chance to be enabled again, thus both INTx mode
> and MSI/MSIX mode will not be enabled in vfio for this case.
> 
> Below shows the detailed flow for introducing this issue:
> 
>   vfio_pci_configure_dev_irqs()
> `-> vfio_pci_enable_intx()
> 
>   vfio_pci_enable_msis()
> `-> vfio_pci_disable_intx()
> 
>   vfio_pci_disable_msis()   => Guest PCI driver disables MSI
> 
> To fix this issue, when disable MSI/MSIX we need to check if INTx mode
> is available for this device or not; if the device can support INTx then
> we need to re-enable it so the device can fallback to use it.
> 
> Signed-off-by: Leo Yan 
> ---
>  vfio/pci.c | 17 -
>  1 file changed, 12 insertions(+), 5 deletions(-)
> 
> diff --git a/vfio/pci.c b/vfio/pci.c
> index c0683f6..44727bb 100644
> --- a/vfio/pci.c
> +++ b/vfio/pci.c
> @@ -28,6 +28,7 @@ struct vfio_irq_eventfd {
>   msi_update_state(state, val, VFIO_PCI_MSI_STATE_EMPTY)
>  
>  static void vfio_pci_disable_intx(struct kvm *kvm, struct vfio_device *vdev);
> +static int vfio_pci_enable_intx(struct kvm *kvm, struct vfio_device *vdev);
>  
>  static int vfio_pci_enable_msis(struct kvm *kvm, struct vfio_device *vdev,
>   bool msix)
> @@ -50,7 +51,7 @@ static int vfio_pci_enable_msis(struct kvm *kvm, struct 
> vfio_device *vdev,
>   if (!msi_is_enabled(msis->virt_state))
>   return 0;
>  
> - if (pdev->irq_modes & VFIO_PCI_IRQ_MODE_INTX) {
> + if (pdev->irq_modes & VFIO_PCI_IRQ_MODE_INTX)
>   /*
>* PCI (and VFIO) forbids enabling INTx, MSI or MSIX at the same
>* time. Since INTx has to be enabled from the start (we don't
> @@ -58,9 +59,6 @@ static int vfio_pci_enable_msis(struct kvm *kvm, struct 
> vfio_device *vdev,
>* disable it now.
>*/
>   vfio_pci_disable_intx(kvm, vdev);
> - /* Permanently disable INTx */
> - pdev->irq_modes &= ~VFIO_PCI_IRQ_MODE_INTX;

As a result vfio_pci_disable_intx() may be called multiple times (each
time the guest enables one MSI vector). Could you make
vfio_pci_disable_intx() safe against that (maybe use intx_fd == -1 to
denote the INTx state)?

> - }
>  
>   eventfds = (void *)msis->irq_set + sizeof(struct vfio_irq_set);
>  
> @@ -162,7 +160,16 @@ static int vfio_pci_disable_msis(struct kvm *kvm, struct 
> vfio_device *vdev,
>   msi_set_enabled(msis->phys_state, false);
>   msi_set_empty(msis->phys_state, true);
>  
> - return 0;
> + if (pdev->irq_modes & VFIO_PCI_IRQ_MODE_INTX)
> + /*
> +  * When MSI or MSIX is disabled, this might be called when
> +  * PCI driver detects the MSI interrupt failure and wants to
> +  * rollback to INTx mode.  Thus enable INTx if the device
> +  * supports INTx mode in this case.
> +  */
> + ret = vfio_pci_enable_intx(kvm, vdev);

Let's remove vfio_pci_reserve_irq_fds(2) from vfio_pci_enable_intx(), it
should only called once per run, and isn't particularly useful here
since INTx only uses 2 fds. It's used to bump the fd rlimit when a
device needs ~2048 file descriptors for MSI-X.

Thanks,
Jean

> +
> + return ret >= 0 ? 0 : ret;
>  }
>  
>  static int vfio_pci_update_msi_entry(struct kvm *kvm, struct vfio_device 
> *vdev,
> 

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

Re: [PATCH kvmtool v1 1/2] vfio-pci: Release INTx's guest to host eventfd properly

2019-03-15 Thread Jean-Philippe Brucker

Hi,

On 15/03/2019 08:33, Leo Yan wrote:
> The PCI device INTx uses event fd 'unmask_fd' to signal the deassertion
> of the line from guest to host; but this eventfd isn't released properly
> when disable INTx.
> 
> When disable INTx this patch firstly unbinds interrupt signal by calling
> ioctl VFIO_DEVICE_SET_IRQS and then it uses the new added field
> 'unmask_fd' in struct vfio_pci_device to close event fd.
> 
> Signed-off-by: Leo Yan 
> ---
>  include/kvm/vfio.h |  1 +
>  vfio/pci.c | 15 ---
>  2 files changed, 13 insertions(+), 3 deletions(-)
> 
> diff --git a/include/kvm/vfio.h b/include/kvm/vfio.h
> index 60e6c54..28223cf 100644
> --- a/include/kvm/vfio.h
> +++ b/include/kvm/vfio.h
> @@ -74,6 +74,7 @@ struct vfio_pci_device {
>  
>   unsigned long   irq_modes;
>   int intx_fd;
> + int unmask_fd;
>   unsigned intintx_gsi;
>   struct vfio_pci_msi_common  msi;
>   struct vfio_pci_msi_common  msix;
> diff --git a/vfio/pci.c b/vfio/pci.c
> index 03de3c1..c0683f6 100644
> --- a/vfio/pci.c
> +++ b/vfio/pci.c
> @@ -996,18 +996,26 @@ static void vfio_pci_disable_intx(struct kvm *kvm, 
> struct vfio_device *vdev)
>  {
>   struct vfio_pci_device *pdev = >pci;
>   int gsi = pdev->intx_gsi;
> - struct vfio_irq_set irq_set = {
> - .argsz  = sizeof(irq_set),
> + struct vfio_irq_set trigger_irq = {
> + .argsz  = sizeof(trigger_irq),
>   .flags  = VFIO_IRQ_SET_DATA_NONE | VFIO_IRQ_SET_ACTION_TRIGGER,
>   .index  = VFIO_PCI_INTX_IRQ_INDEX,
>   };
>  
> + struct vfio_irq_set unmask_irq = {
> + .argsz  = sizeof(unmask_irq),
> + .flags  = VFIO_IRQ_SET_DATA_NONE | VFIO_IRQ_SET_ACTION_UNMASK,
> + .index  = VFIO_PCI_INTX_IRQ_INDEX,
> + };
> +
>   pr_debug("user requested MSI, disabling INTx %d", gsi);
>  
> - ioctl(vdev->fd, VFIO_DEVICE_SET_IRQS, _set);
> + ioctl(vdev->fd, VFIO_DEVICE_SET_IRQS, _irq);
> + ioctl(vdev->fd, VFIO_DEVICE_SET_IRQS, _irq);

The patch makes sense, we do need to close unmask_fd, but I don't think
we need the additional ioctl. VFIO removes the unmask trigger when we
disable INTx in the first ioctl, so an additional ioctl to remove the
unmask trigger will return EINVAL.

Thanks,
Jean

>   irq__del_irqfd(kvm, gsi, pdev->intx_fd);
>  
>   close(pdev->intx_fd);
> + close(pdev->unmask_fd);
>  }
>  
>  static int vfio_pci_enable_intx(struct kvm *kvm, struct vfio_device *vdev)
> @@ -1095,6 +1103,7 @@ static int vfio_pci_enable_intx(struct kvm *kvm, struct 
> vfio_device *vdev)
>   }
>  
>   pdev->intx_fd = trigger_fd;
> + pdev->unmask_fd = unmask_fd;
>   /* Guest is going to ovewrite our irq_line... */
>   pdev->intx_gsi = gsi;
>  
> 

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

Re: [PATCH v4 03/22] iommu: introduce device fault report API

2019-03-07 Thread Jean-Philippe Brucker

On 06/03/2019 23:46, Jacob Pan wrote:
> On Tue, 5 Mar 2019 15:03:41 +
> Jean-Philippe Brucker  wrote:
> 
>> On 18/02/2019 13:54, Eric Auger wrote:
>> [...]> +/**
>> > + * iommu_register_device_fault_handler() - Register a device fault
>> > handler
>> > + * @dev: the device
>> > + * @handler: the fault handler
>> > + * @data: private data passed as argument to the handler
>> > + *
>> > + * When an IOMMU fault event is received, call this handler with
>> > the fault event
>> > + * and data as argument. The handler should return 0 on success.
>> > If the fault is
>> > + * recoverable (IOMMU_FAULT_PAGE_REQ), the handler can also
>> > complete
>> > + * the fault by calling iommu_page_response() with one of the
>> > following
>> > + * response code:
>> > + * - IOMMU_PAGE_RESP_SUCCESS: retry the translation
>> > + * - IOMMU_PAGE_RESP_INVALID: terminate the fault
>> > + * - IOMMU_PAGE_RESP_FAILURE: terminate the fault and stop
>> > reporting
>> > + *   page faults if possible.  
>> 
>> The comment refers to function and values that haven't been defined
>> yet. Either the page_response() patch should come before, or we need
>> to split this patch.
>> 
>> Something I missed before: if the handler fails (returns != 0) it
>> should complete the fault by calling iommu_page_response(), if we're
>> not doing it in iommu_report_device_fault(). It should be indicated
>> in this comment. It's safe for the handler to call page_response()
>> since we're not holding fault_param->lock when calling the handler.
>> 
> If the page request fault is to be reported to a guest, the report
> function cannot wait for the completion status. As long as the fault is
> injected into the guest, the handler should complete with success. If
> the PRQ report fails, IMHO, the caller of iommu_report_device_fault()
> should send page_response, perhaps after clean up all partial response
> of the group too.

Ok, the caller (IOMMU driver) sending the page_response if
iommu_report_device_fault() fails does make more sense. Agreed on the
partial cleanup as well, we don't keep track of them here, but I need to
add that to the io-pgfault layer. However some cleanup should probably
happen in here...

>> > +   /* we only report device fault if there is a handler
>> > registered */
>> > +   mutex_lock(>iommu_param->lock);
>> > +   if (!dev->iommu_param->fault_param ||
>> > +   !dev->iommu_param->fault_param->handler) {
>> > +   ret = -EINVAL;
>> > +   goto done_unlock;
>> > +   }
>> > +   fparam = dev->iommu_param->fault_param;
>> > +   if (evt->fault.type == IOMMU_FAULT_PAGE_REQ &&
>> > +   evt->fault.prm.flags &
>> > IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE) {
>> > +   evt_pending = kmemdup(evt, sizeof(struct
>> > iommu_fault_event),
>> > +   GFP_KERNEL);
>> > +   if (!evt_pending) {
>> > +   ret = -ENOMEM;
>> > +   goto done_unlock;
>> > +   }
>> > +   mutex_lock(>lock);
>> > +   list_add_tail(_pending->list, >faults);
>> > +   mutex_unlock(>lock);
>> > +   }
>> > +   ret = fparam->handler(evt, fparam->data);

... if ret != 0, removing and freeing the pending event seems more
appropriate here than asking our caller to do it

Thanks,
Jean

>> > +done_unlock:
>> > +   mutex_unlock(>iommu_param->lock);
>> > +   return ret;
>> > +}
>> > +EXPORT_SYMBOL_GPL(iommu_report_device_fault);  
>> [...]
> 
> [Jacob Pan]

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

Re: [PATCH v4 02/22] iommu: introduce device fault data

2019-03-06 Thread Jean-Philippe Brucker

On 06/03/2019 14:30, Auger Eric wrote:
>>> +#define IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE (1 << 1)
>>> +#define IOMMU_FAULT_PAGE_REQUEST_PRIV_DATA (1 << 2)
>>> +   __u32   flags;
>>> +   __u32   pasid;
>>> +   __u32   grpid;
>>> +   __u32   perm;
>>> +   __u64   addr;
>>
>> Given that we'll be reporting stall faults using this struct, it would
>> be good to have the fetch_addr field and flag here as well.
> As the stall model looks really ARM specific shouldn't we introduce a
> dedicated struct and iommu_fault_type enum value?

There is no reason for the generic page fault handler to differentiate
between stall and PRI, they are page requests. For a stall we write STAG
into grpid and set LAST_PAGE=1. Then the SMMU driver writes the page
response either as a PRI_RESP or a RESUME depending on the device type.

> Also for stall faults don't we need to expose the stall tag (STAG) that,
> as far as I understand is going to be used by the guest we it wants to
> retry or terminate the faulted transaction. In practice doesn't the
> stall fault have the same fields of the unrecoverable fault + STAG? I am
> afraid adding the fetch_addr in the page request struct may "pollute"
> the PRI struct that can be understood by both aarch64 and x86 parties atm.

Let's leave out the fetch_addr field then, I was suggesting it for
completeness but I don't need it immediately, at least not for host SVA.
For dual-stage SVA (where both stage-1 and stage-2 are shared with the
CPU) we'll need the IPA field, but that's still a long way away.

> Also couldn't we envision to put this STALL struct in a new revision of
> the fault ABI.
As said above, generic code doesn't have to know the difference until we
start implementing nested SVA. Also, we need stall support in the fault
handler soon, since there is hardware supporting it.

Thanks,
Jean
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

Re: [PATCH v4 02/22] iommu: introduce device fault data

2019-03-06 Thread Jean-Philippe Brucker

On 06/03/2019 09:38, Auger Eric wrote:
>>> +struct iommu_fault_unrecoverable {
>>> +    __u32   reason; /* enum iommu_fault_reason */
>>> +#define IOMMU_FAULT_UNRECOV_PASID_VALID (1 << 0)
>>> +#define IOMMU_FAULT_UNRECOV_PERM_VALID  (1 << 1)
>> 
>> Not needed, since @perm is already a bitfield
> not exactly, READ is encoded as 0. We need to differentiate read fault
> from no perm provided. However if I follow your recommendation below and
> transform the READ FAULT into a set bit this makes sense.

Ah yes, seeing four defines I assumed read was in there. No need for
INST I think, it's already described by EXEC

>>> +#define IOMMU_FAULT_UNRECOV_ADDR_VALID  (1 << 2)
>>> +#define IOMMU_FAULT_UNRECOV_FETCH_ADDR_VALID    (1 << 3)
>>> +    __u32   flags;
>>> +    __u32   pasid;
>>> +#define IOMMU_FAULT_PERM_WRITE  (1 << 0) /* write */
>>> +#define IOMMU_FAULT_PERM_EXEC   (1 << 1) /* exec */
>>> +#define IOMMU_FAULT_PERM_PRIV   (1 << 2) /* priviledged */
>> 
>> typo "privileged"
> OK
>> 
>>> +#define IOMMU_FAULT_PERM_INST   (1 << 3) /* instruction */
>> 
>> Could you move these outside the struct definition? They are shared with
>> the other struct. And it would be less confusing, from the device driver
>> point of view, to merge those with the existing IOMMU_FAULT_* defines
>> (but moving them to UAPI and making them bits)
> ok I will look at this. Need to check if the read fault value is not
> hardcoded anywhere.

Oh right, looks like a couple of IOMMU drivers do. Hard to say if they
mean READ or just "don't care", at first glance. I guess we can keep the
FAULT_PERM variant until we actually unify the fault reporting API (not
overly complicated since there are three users. I have patches for that
buried somewhere)

>> 
>>> +    __u32   perm;
>>> +    __u64   addr;
>>> +    __u64   fetch_addr;
>>> +};
>>> +
>>> +/*
>>> + * Page Request data (aka. recoverable fault data)
>>> + * @flags : encodes whether the pasid is valid and whether this
>>> + * is the last page in group
>>> + * @pasid: pasid
>>> + * @grpid: page request group index
>>> + * @perm: requested page permissions
>>> + * @addr: page address
>>> + */
>>> +struct iommu_fault_page_request {
>>> +#define IOMMU_FAULT_PAGE_REQUEST_PASID_PRESENT  (1 << 0)
>> 
>> PASID_VALID, to be consistent with the other set of flags?
> OK
>> 
>>> +#define IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE  (1 << 1)
>>> +#define IOMMU_FAULT_PAGE_REQUEST_PRIV_DATA  (1 << 2)
>>> +    __u32   flags;
>>> +    __u32   pasid;
>>> +    __u32   grpid;
>>> +    __u32   perm;
>>> +    __u64   addr;
>> 
>> Given that we'll be reporting stall faults using this struct, it would
>> be good to have the fetch_addr field and flag here as well.
> OK
>> 
>>> +    __u64   private_data[2];
>>> +};
>>> +
>>> +/**
>>> + * struct iommu_fault - Generic fault data
>>> + *
>>> + * @type contains fault type
>>> + */
>>> +
>>> +struct iommu_fault {
>>> +    __u32   type;   /* enum iommu_fault_type */
>>> +    __u32   reserved;
>>> +    union {
>>> +    struct iommu_fault_unrecoverable event;
>>> +    struct iommu_fault_page_request prm;
>> 
>> What's the 'm' in "prm"? Maybe just "pr"?
> This stands for page request message, I think this is the Intel's naming?

Looks like it's the PCI naming, let's stick with it then

Thanks,
Jean
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

Re: [PATCH v4 05/22] iommu: Introduce cache_invalidate API

2019-03-05 Thread Jean-Philippe Brucker

On 18/02/2019 13:54, Eric Auger wrote:
> From: "Liu, Yi L" 
> 
> In any virtualization use case, when the first translation stage
> is "owned" by the guest OS, the host IOMMU driver has no knowledge
> of caching structure updates unless the guest invalidation activities
> are trapped by the virtualizer and passed down to the host.
> 
> Since the invalidation data are obtained from user space and will be
> written into physical IOMMU, we must allow security check at various
> layers. Therefore, generic invalidation data format are proposed here,
> model specific IOMMU drivers need to convert them into their own format.
> 
> Signed-off-by: Liu, Yi L 
> Signed-off-by: Jean-Philippe Brucker 
> Signed-off-by: Jacob Pan 
> Signed-off-by: Ashok Raj 
> Signed-off-by: Eric Auger 
> 
> ---
> v3 -> v4:
> - full reshape of the API following Alex' comments
> 
> v1 -> v2:
> - add arch_id field
> - renamed tlb_invalidate into cache_invalidate as this API allows
>   to invalidate context caches on top of IOTLBs
> 
> v1:
> renamed sva_invalidate into tlb_invalidate and add iommu_ prefix in
> header. Commit message reworded.
> ---
>  drivers/iommu/iommu.c  | 14 
>  include/linux/iommu.h  | 14 
>  include/uapi/linux/iommu.h | 71 ++
>  3 files changed, 99 insertions(+)
> 
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index b3adb77cb14c..bcb8eb15426c 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -1564,6 +1564,20 @@ void iommu_detach_pasid_table(struct iommu_domain 
> *domain)
>  }
>  EXPORT_SYMBOL_GPL(iommu_detach_pasid_table);
>  
> +int iommu_cache_invalidate(struct iommu_domain *domain, struct device *dev,
> +struct iommu_cache_invalidate_info *inv_info)
> +{
> + int ret = 0;
> +
> + if (unlikely(!domain->ops->cache_invalidate))
> + return -ENODEV;
> +
> + ret = domain->ops->cache_invalidate(domain, dev, inv_info);
> +
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(iommu_cache_invalidate);
> +
>  static void __iommu_detach_device(struct iommu_domain *domain,
> struct device *dev)
>  {
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index 7045e26f3a7d..a3b879d0753c 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -189,6 +189,7 @@ struct iommu_resv_region {
>   * @pgsize_bitmap: bitmap of all possible supported page sizes
>   * @attach_pasid_table: attach a pasid table
>   * @detach_pasid_table: detach the pasid table
> + * @cache_invalidate: invalidate translation caches
>   */
>  struct iommu_ops {
>   bool (*capable)(enum iommu_cap);
> @@ -235,6 +236,9 @@ struct iommu_ops {
> struct iommu_pasid_table_config *cfg);
>   void (*detach_pasid_table)(struct iommu_domain *domain);
>  
> + int (*cache_invalidate)(struct iommu_domain *domain, struct device *dev,
> + struct iommu_cache_invalidate_info *inv_info);
> +
>   unsigned long pgsize_bitmap;
>  };
>  
> @@ -348,6 +352,9 @@ extern void iommu_detach_device(struct iommu_domain 
> *domain,
>  extern int iommu_attach_pasid_table(struct iommu_domain *domain,
>   struct iommu_pasid_table_config *cfg);
>  extern void iommu_detach_pasid_table(struct iommu_domain *domain);
> +extern int iommu_cache_invalidate(struct iommu_domain *domain,
> +   struct device *dev,
> +   struct iommu_cache_invalidate_info *inv_info);
>  extern struct iommu_domain *iommu_get_domain_for_dev(struct device *dev);
>  extern struct iommu_domain *iommu_get_dma_domain(struct device *dev);
>  extern int iommu_map(struct iommu_domain *domain, unsigned long iova,
> @@ -798,6 +805,13 @@ void iommu_detach_pasid_table(struct iommu_domain 
> *domain)
>  {
>   return -ENODEV;
>  }
> +static inline int
> +iommu_cache_invalidate(struct iommu_domain *domain,
> +struct device *dev,
> +struct iommu_cache_invalidate_info *inv_info)
> +{
> + return -ENODEV;
> +}
>  
>  #endif /* CONFIG_IOMMU_API */
>  
> diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
> index e9065bfa5b24..ae41385b0a7e 100644
> --- a/include/uapi/linux/iommu.h
> +++ b/include/uapi/linux/iommu.h
> @@ -159,4 +159,75 @@ struct iommu_pasid_table_config {
>   };
>  };
>  
> +/* defines the granularity of the invalidation */
> +enum iommu_inv_granularity {
> + IOMMU_INV_GRANU_DOMAIN, /* domain-selective invalidati

Re: [PATCH v4 04/22] iommu: Introduce attach/detach_pasid_table API

2019-03-05 Thread Jean-Philippe Brucker

On 18/02/2019 13:54, Eric Auger wrote:
> From: Jacob Pan 
> 
> In virtualization use case, when a guest is assigned
> a PCI host device, protected by a virtual IOMMU on the guest,
> the physical IOMMU must be programmed to be consistent with
> the guest mappings. If the physical IOMMU supports two
> translation stages it makes sense to program guest mappings
> onto the first stage/level (ARM/Intel terminology) while the host
> owns the stage/level 2.
> 
> In that case, it is mandated to trap on guest configuration
> settings and pass those to the physical iommu driver.
> 
> This patch adds a new API to the iommu subsystem that allows
> to set/unset the pasid table information.
> 
> A generic iommu_pasid_table_config struct is introduced in
> a new iommu.h uapi header. This is going to be used by the VFIO
> user API.
> 
> Signed-off-by: Jean-Philippe Brucker 
> Signed-off-by: Liu, Yi L 
> Signed-off-by: Ashok Raj 
> Signed-off-by: Jacob Pan 
> Signed-off-by: Eric Auger 
> 
> ---
> 
> This patch generalizes the API introduced by Jacob & co-authors in
> https://lwn.net/Articles/754331/
> 
> v3 -> v4:
> - s/set_pasid_table/attach_pasid_table
> - restore detach_pasid_table. Detach can be used on unwind path.
> - add padding
> - remove @abort
> - signature used for config and format
> - add comments for fields in the SMMU struct
> 
> v2 -> v3:
> - replace unbind/bind by set_pasid_table
> - move table pointer and pasid bits in the generic part of the struct
> 
> v1 -> v2:
> - restore the original pasid table name
> - remove the struct device * parameter in the API
> - reworked iommu_pasid_smmuv3
> ---
>  drivers/iommu/iommu.c  | 19 +++
>  include/linux/iommu.h  | 22 ++
>  include/uapi/linux/iommu.h | 47 ++
>  3 files changed, 88 insertions(+)
> 
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index c297cdcf7f89..b3adb77cb14c 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -1545,6 +1545,25 @@ int iommu_attach_device(struct iommu_domain *domain, 
> struct device *dev)
>  }
>  EXPORT_SYMBOL_GPL(iommu_attach_device);
>  
> +int iommu_attach_pasid_table(struct iommu_domain *domain,
> +  struct iommu_pasid_table_config *cfg)
> +{
> + if (unlikely(!domain->ops->attach_pasid_table))
> + return -ENODEV;
> +
> + return domain->ops->attach_pasid_table(domain, cfg);
> +}
> +EXPORT_SYMBOL_GPL(iommu_attach_pasid_table);
> +
> +void iommu_detach_pasid_table(struct iommu_domain *domain)
> +{
> + if (unlikely(!domain->ops->detach_pasid_table))
> + return;
> +
> + domain->ops->detach_pasid_table(domain);
> +}
> +EXPORT_SYMBOL_GPL(iommu_detach_pasid_table);
> +
>  static void __iommu_detach_device(struct iommu_domain *domain,
> struct device *dev)
>  {
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index b38e0c100940..7045e26f3a7d 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -187,6 +187,8 @@ struct iommu_resv_region {
>   * @domain_window_disable: Disable a particular window for a domain
>   * @of_xlate: add OF master IDs to iommu grouping
>   * @pgsize_bitmap: bitmap of all possible supported page sizes
> + * @attach_pasid_table: attach a pasid table
> + * @detach_pasid_table: detach the pasid table

Should go before pgsize_bitmap

>   */
>  struct iommu_ops {
>   bool (*capable)(enum iommu_cap);
> @@ -229,6 +231,10 @@ struct iommu_ops {
>   int (*of_xlate)(struct device *dev, struct of_phandle_args *args);
>   bool (*is_attach_deferred)(struct iommu_domain *domain, struct device 
> *dev);
>  
> + int (*attach_pasid_table)(struct iommu_domain *domain,
> +   struct iommu_pasid_table_config *cfg);
> + void (*detach_pasid_table)(struct iommu_domain *domain);
> +
>   unsigned long pgsize_bitmap;
>  };
>  
> @@ -339,6 +345,9 @@ extern int iommu_attach_device(struct iommu_domain 
> *domain,
>  struct device *dev);
>  extern void iommu_detach_device(struct iommu_domain *domain,
>   struct device *dev);
> +extern int iommu_attach_pasid_table(struct iommu_domain *domain,
> + struct iommu_pasid_table_config *cfg);
> +extern void iommu_detach_pasid_table(struct iommu_domain *domain);
>  extern struct iommu_domain *iommu_get_domain_for_dev(struct device *dev);
>  extern struct iommu_domain *iommu_get_dma_domain(struct device *dev);
>  extern int i

Re: [PATCH v4 03/22] iommu: introduce device fault report API

2019-03-05 Thread Jean-Philippe Brucker

On 18/02/2019 13:54, Eric Auger wrote:
[...]> +/**
> + * iommu_register_device_fault_handler() - Register a device fault handler
> + * @dev: the device
> + * @handler: the fault handler
> + * @data: private data passed as argument to the handler
> + *
> + * When an IOMMU fault event is received, call this handler with the fault 
> event
> + * and data as argument. The handler should return 0 on success. If the 
> fault is
> + * recoverable (IOMMU_FAULT_PAGE_REQ), the handler can also complete
> + * the fault by calling iommu_page_response() with one of the following
> + * response code:
> + * - IOMMU_PAGE_RESP_SUCCESS: retry the translation
> + * - IOMMU_PAGE_RESP_INVALID: terminate the fault
> + * - IOMMU_PAGE_RESP_FAILURE: terminate the fault and stop reporting
> + *   page faults if possible.

The comment refers to function and values that haven't been defined yet.
Either the page_response() patch should come before, or we need to split
this patch.

Something I missed before: if the handler fails (returns != 0) it should
complete the fault by calling iommu_page_response(), if we're not doing
it in iommu_report_device_fault(). It should be indicated in this
comment. It's safe for the handler to call page_response() since we're
not holding fault_param->lock when calling the handler.

> + *
> + * Return 0 if the fault handler was installed successfully, or an error.
> + */
[...]
> +/**
> + * iommu_report_device_fault() - Report fault event to device
> + * @dev: the device
> + * @evt: fault event data
> + *
> + * Called by IOMMU model specific drivers when fault is detected, typically
> + * in a threaded IRQ handler.
> + *
> + * Return 0 on success, or an error.
> + */
> +int iommu_report_device_fault(struct device *dev, struct iommu_fault_event 
> *evt)
> +{
> + int ret = 0;
> + struct iommu_fault_event *evt_pending;
> + struct iommu_fault_param *fparam;
> +
> + /* iommu_param is allocated when device is added to group */
> + if (!dev->iommu_param | !evt)

Typo: ||

Thanks,
Jean

> + return -EINVAL;
> + /* we only report device fault if there is a handler registered */
> + mutex_lock(>iommu_param->lock);
> + if (!dev->iommu_param->fault_param ||
> + !dev->iommu_param->fault_param->handler) {
> + ret = -EINVAL;
> + goto done_unlock;
> + }
> + fparam = dev->iommu_param->fault_param;
> + if (evt->fault.type == IOMMU_FAULT_PAGE_REQ &&
> + evt->fault.prm.flags & IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE) {
> + evt_pending = kmemdup(evt, sizeof(struct iommu_fault_event),
> + GFP_KERNEL);
> + if (!evt_pending) {
> + ret = -ENOMEM;
> + goto done_unlock;
> + }
> + mutex_lock(>lock);
> + list_add_tail(_pending->list, >faults);
> + mutex_unlock(>lock);
> + }
> + ret = fparam->handler(evt, fparam->data);
> +done_unlock:
> + mutex_unlock(>iommu_param->lock);
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(iommu_report_device_fault);
[...]
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

Re: [PATCH v4 02/22] iommu: introduce device fault data

2019-03-05 Thread Jean-Philippe Brucker

On 18/02/2019 13:54, Eric Auger wrote:
> From: Jacob Pan 
> 
> Device faults detected by IOMMU can be reported outside the IOMMU
> subsystem for further processing. This patch introduces
> a generic device fault data structure.
> 
> The fault can be either an unrecoverable fault or a page request,
> also referred to as a recoverable fault.
> 
> We only care about non internal faults that are likely to be reported
> to an external subsystem.
> 
> Signed-off-by: Jacob Pan 
> Signed-off-by: Jean-Philippe Brucker 
> Signed-off-by: Liu, Yi L 
> Signed-off-by: Ashok Raj 
> Signed-off-by: Eric Auger 
> 
> ---
> 
> v3 -> v4:
> - use a union containing aither an unrecoverable fault or a page
>   request message. Move the device private data in the page request
>   structure. Reshuffle the fields and use flags.
> - move fault perm attributes to the uapi
> - remove a bunch of iommu_fault_reason enum values that were related
>   to internal errors
> ---
>  include/linux/iommu.h  |  47 +++
>  include/uapi/linux/iommu.h | 115 +
>  2 files changed, 162 insertions(+)
>  create mode 100644 include/uapi/linux/iommu.h
> 
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index e90da6b6f3d1..032d33894723 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -25,6 +25,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #define IOMMU_READ   (1 << 0)
>  #define IOMMU_WRITE  (1 << 1)
> @@ -48,6 +49,7 @@ struct bus_type;
>  struct device;
>  struct iommu_domain;
>  struct notifier_block;
> +struct iommu_fault_event;
>  
>  /* iommu fault flags */
>  #define IOMMU_FAULT_READ 0x0
> @@ -55,6 +57,7 @@ struct notifier_block;
>  
>  typedef int (*iommu_fault_handler_t)(struct iommu_domain *,
>   struct device *, unsigned long, int, void *);
> +typedef int (*iommu_dev_fault_handler_t)(struct iommu_fault_event *, void *);
>  
>  struct iommu_domain_geometry {
>   dma_addr_t aperture_start; /* First address that can be mapped*/
> @@ -243,6 +246,49 @@ struct iommu_device {
>   struct device *dev;
>  };
>  
> +/**
> + * struct iommu_fault_event - Generic per device fault data
> + *
> + * - PCI and non-PCI devices
> + * - Recoverable faults (e.g. page request), information based on PCI ATS
> + *   and PASID spec.

"for example information based on PCI PRI and PASID extensions"? ATS+PRI
have been integrated into the main spec, and only PRI is relevant here.

> + * - Un-recoverable faults of device interest

"of interest to device drivers"?

> + * - DMA remapping and IRQ remapping faults
> + *
> + * @fault: fault descriptor
> + * @iommu_private: used by the IOMMU driver for storing fault-specific
> + * data. Users should not modify this field before
> + * sending the fault response.
> + */
> +struct iommu_fault_event {
> + struct iommu_fault fault;
> + u64 iommu_private;
> +};
> +
> +/**
> + * struct iommu_fault_param - per-device IOMMU fault data
> + * @dev_fault_handler: Callback function to handle IOMMU faults at device 
> level
> + * @data: handler private data
> + *
> + */
> +struct iommu_fault_param {
> + iommu_dev_fault_handler_t handler;
> + void *data;
> +};
> +
> +/**
> + * struct iommu_param - collection of per-device IOMMU data
> + *
> + * @fault_param: IOMMU detected device fault reporting data
> + *
> + * TODO: migrate other per device data pointers under iommu_dev_data, e.g.
> + *   struct iommu_group  *iommu_group;
> + *   struct iommu_fwspec *iommu_fwspec;
> + */
> +struct iommu_param {
> + struct iommu_fault_param *fault_param;
> +};
> +
>  int  iommu_device_register(struct iommu_device *iommu);
>  void iommu_device_unregister(struct iommu_device *iommu);
>  int  iommu_device_sysfs_add(struct iommu_device *iommu,
> @@ -418,6 +464,7 @@ struct iommu_ops {};
>  struct iommu_group {};
>  struct iommu_fwspec {};
>  struct iommu_device {};
> +struct iommu_fault_param {};
>  
>  static inline bool iommu_present(struct bus_type *bus)
>  {
> diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
> new file mode 100644
> index ..7ebf23ed6ccb
> --- /dev/null
> +++ b/include/uapi/linux/iommu.h
> @@ -0,0 +1,115 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +/*
> + * IOMMU user API definitions
> + */
> +
> +#ifndef _UAPI_IOMMU_H
> +#define _UAPI_IOMMU_H
> +
> +#include 
> +
> +/*  Generic fault types, can be expanded IRQ remapping fault */
> +enum iommu_fault_

Re: [PATCH v7 0/7] Add virtio-iommu driver

2019-02-22 Thread Jean-Philippe Brucker

Hi Thiago,

On 21/02/2019 22:18, Thiago Jung Bauermann wrote:
> 
> Hello Jean-Philippe,
> 
> Jean-Philippe Brucker  writes:
>> Makes sense, though I think other virtio devices have been developed a
>> little more organically: device and driver code got upstreamed first,
>> and then the specification describing their interface got merged into
>> the standard. For example I believe that code for crypto, input and GPU
>> devices were upstreamed long before the specification was merged. Once
>> an implementation is upstream, the interface is expected to be
>> backward-compatible (all subsequent changes are introduced using feature
>> bits).
>>
>> So I've been working with this process in mind, also described by Jens
>> at KVM forum 2017 [3]:
>> (1) Reserve a device ID, and get that merged into virtio (ID 23 for
>> virtio-iommu was reserved last year)
>> (2) Open-source an implementation (this driver and Eric's device)
>> (3) Formalize and upstream the device specification
>>
>> But I get that some overlap between (2) and (3) would have been better.
>> So far the spec document has been reviewed mainly from the IOMMU point
>> of view, and might require more changes to be in line with the other
>> virtio devices -- hopefully just wording changes. I'll kick off step
>> (3), but I think the virtio folks are a bit busy with finalizing the 1.1
>> spec so I expect it to take a while.
> 
> I read v0.9 of the spec and have some minor comments, hope this is a
> good place to send them:

Thanks a lot, I'll fix them in my next posting. Note that I recently
sent v0.10 to virtio-comment, to request inclusion into the standard [1]
but your comments still apply to v0.10.

[1]
https://lists.oasis-open.org/archives/virtio-comment/201901/msg00016.html

> 1. In section 2.6.2, one reads
> 
> If the VIRTIO_IOMMU_F_INPUT_RANGE feature is offered and the range
> described by fields virt_start and virt_end doesn’t fit in the range
> described by input_range, the device MAY set status to VIRTIO_-
> IOMMU_S_RANGE and ignore the request.
> 
> Shouldn't int say "If the VIRTIO_IOMMU_F_INPUT_RANGE feature is
> negotiated" instead?

Yes, that seems clearer and more consistent with other devices, I'll
change it. In this case "offered" is equivalent to "negotiated", because
the driver SHOULD accept the feature or else the device may refuse to
set FEATURES_OK. A valid input_range field generally indicates that the
device is incapable of creating mappings outside this range, and it's
important that the driver acknowledges it.

> 2. There's a typo at the end of section 2.6.5:
> 
> The VIRTIO_IOMMU_MAP_F_MMIO flag is a memory type rather than a
> protection lag.
> 
> s/lag/flag/

Fixed in v0.10

> 3. In section 3.1.2.1.1, the viommu compatible field says "virtio,mmio".
> Shouldn't it say "virtio,mmio-iommu" instead, to be consistent with
> "virtio,pci-iommu"?

"virtio,mmio" already exists, and allows the virtio-mmio driver to pick
up any virtio device. The device type is then discovered while probing,
and doesn't need to be in the compatible string.

"virtio,pci-iommu" is something I introduced specifically for the
virtio-iommu, since it's the only virtio-pci device that requires a
device tree node - to describe the IOMMU topology earlier than the PCI
probe. If we want symmetry I'd rather replace "virtio,pci-iommu" with
"virtio,pci", but it wouldn't be used by other virtio device types. And
I have to admit I'm reluctant to change this binding now, given that it
has been reviewed (patch 2/7) and is ready to go.

> 4. There's a typo in section 3.3:
> 
> A host bridge may limit the input address space – transaction
> accessing some addresses won’t reach the physical IOMMU.
> 
> s/transaction/transactions/

I'll fix it, thanks

> I also have one last comment which you may freely ignore, considering
> it's clearly just personal opinion and also considering that the
> specification is mature at this point: it specifies memory ranges by
> specifying start and end addresses. My experience has been that this is
> error prone, leading to confusion and bugs regarding whether the end
> address is inclusive or exclusive. I tend to prefer expressing memory
> ranges by specifying a start address and a length, which eliminates
> ambiguity.

While the initial versions had start and length, I changed it because it
cannot express the whole 64-bit range. If the guest wants to do
unmap-all (and the input range is 64-bit), then it can send a single
UNMAP request with start=0, end=~0ULL, which wouldn't be possible with
start and length. Arguably a very rare use-case, but one I've tried to
implement at least twice with VFIO :) I'll see if I can make it more
obvious that end is inclusive, since the word doesn't appear at all in
the current draft.

Thanks,
Jean
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

Re: [RFC v3 02/21] iommu: Introduce cache_invalidate API

2019-01-28 Thread Jean-Philippe Brucker

Hi Eric,

On 25/01/2019 16:49, Auger Eric wrote:
[...]
>>> diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
>>> index 7a7cf7a3de7c..4605f5cfac84 100644
>>> --- a/include/uapi/linux/iommu.h
>>> +++ b/include/uapi/linux/iommu.h
>>> @@ -47,4 +47,99 @@ struct iommu_pasid_table_config {
>>> };
>>>  };
>>>  
>>> +/**
>>> + * enum iommu_inv_granularity - Generic invalidation granularity
>>> + * @IOMMU_INV_GRANU_DOMAIN_ALL_PASID:  TLB entries or PASID caches of 
>>> all
>>> + * PASIDs associated with a domain ID
>>> + * @IOMMU_INV_GRANU_PASID_SEL: TLB entries or PASID cache 
>>> associated
>>> + * with a PASID and a domain
>>> + * @IOMMU_INV_GRANU_PAGE_PASID:TLB entries of selected page 
>>> range
>>> + * within a PASID
>>> + *
>>> + * When an invalidation request is passed down to IOMMU to flush 
>>> translation
>>> + * caches, it may carry different granularity levels, which can be specific
>>> + * to certain types of translation caches.
>>> + * This enum is a collection of granularities for all types of translation
>>> + * caches. The idea is to make it easy for IOMMU model specific driver to
>>> + * convert from generic to model specific value. Each IOMMU driver
>>> + * can enforce check based on its own conversion table. The conversion is
>>> + * based on 2D look-up with inputs as follows:
>>> + * - translation cache types
>>> + * - granularity
>>> + *
>>> + * type |   DTLB|TLB|   PASID   |
>>> + *  granule |   |   |   cache   |
>>> + * -+---+---+---+
>>> + *  DN_ALL_PASID|   Y   |   Y   |   Y   |
>>> + *  PASID_SEL   |   Y   |   Y   |   Y   |
>>> + *  PAGE_PASID  |   Y   |   Y   |   N/A |
>>> + *
>>> + */
>>> +enum iommu_inv_granularity {
>>> +   IOMMU_INV_GRANU_DOMAIN_ALL_PASID,
>>> +   IOMMU_INV_GRANU_PASID_SEL,
>>> +   IOMMU_INV_GRANU_PAGE_PASID,
>>> +   IOMMU_INV_NR_GRANU,
>>> +};
>>> +
>>> +/**
>>> + * enum iommu_inv_type - Generic translation cache types for invalidation
>>> + *
>>> + * @IOMMU_INV_TYPE_DTLB:   device IOTLB
>>> + * @IOMMU_INV_TYPE_TLB:IOMMU paging structure cache
>>> + * @IOMMU_INV_TYPE_PASID:  PASID cache
>>> + * Invalidation requests sent to IOMMU for a given device need to indicate
>>> + * which type of translation cache to be operated on. Combined with enum
>>> + * iommu_inv_granularity, model specific driver can do a simple lookup to
>>> + * convert from generic to model specific value.
>>> + */
>>> +enum iommu_inv_type {
>>> +   IOMMU_INV_TYPE_DTLB,
>>> +   IOMMU_INV_TYPE_TLB,
>>> +   IOMMU_INV_TYPE_PASID,
>>> +   IOMMU_INV_NR_TYPE
>>> +};
>>> +
>>> +/**
>>> + * Translation cache invalidation header that contains mandatory meta data.
>>> + * @version:   info format version, expecting future extesions
>>> + * @type:  type of translation cache to be invalidated
>>> + */
>>> +struct iommu_cache_invalidate_hdr {
>>> +   __u32 version;
>>> +#define TLB_INV_HDR_VERSION_1 1
>>> +   enum iommu_inv_type type;
>>> +};
>>> +
>>> +/**
>>> + * Translation cache invalidation information, contains generic IOMMU
>>> + * data which can be parsed based on model ID by model specific drivers.
>>> + * Since the invalidation of second level page tables are included in the
>>> + * unmap operation, this info is only applicable to the first level
>>> + * translation caches, i.e. DMA request with PASID.
>>> + *
>>> + * @granularity:   requested invalidation granularity, type dependent
>>> + * @size:  2^size of 4K pages, 0 for 4k, 9 for 2MB, etc.
>>
>> Why is this a 4K page centric interface?
> This matches the vt-d Address Mask (AM) field of the IOTLB Invalidate
> Descriptor. We can pass a log2size instead.
>>
>>> + * @nr_pages:  number of pages to invalidate
>>> + * @pasid: processor address space ID value per PCI spec.
>>> + * @arch_id:   architecture dependent id characterizing a 
>>> context
>>> + * and tagging the caches, ie. domain Identfier on VTD,
>>> + * asid on ARM SMMU
>>> + * @addr:  page address to be invalidated
>>> + * @flags  IOMMU_INVALIDATE_ADDR_LEAF: leaf paging entries
>>> + * IOMMU_INVALIDATE_GLOBAL_PAGE: global pages
>>
>> Shouldn't some of these be tied the the granularity of the
>> invalidation?  It seems like this should be more similar to
>> iommu_pasid_table_config where the granularity of the invalidation
>> defines which entry within a union at the end of the structure is valid
>> and populated.  Otherwise we have fields that don't make sense for
>> certain invalidations.
> 
> I am a little bit embarrassed here as this API version is the outcome of
> long discussions held by Jacob, jean-Philippe and many others. I don't
> want to hijack that work as I am "simply" reusing this API.

Re: [RFC v3 01/21] iommu: Introduce set_pasid_table API

2019-01-25 Thread Jean-Philippe Brucker

On 25/01/2019 08:55, Auger Eric wrote:
> Hi Jean-Philippe,
> 
> On 1/25/19 9:39 AM, Auger Eric wrote:
>> Hi Jean-Philippe,
>>
>> On 1/11/19 7:16 PM, Jean-Philippe Brucker wrote:
>>> On 08/01/2019 10:26, Eric Auger wrote:
>>>> From: Jacob Pan 
>>>>
>>>> In virtualization use case, when a guest is assigned
>>>> a PCI host device, protected by a virtual IOMMU on a guest,
>>>> the physical IOMMU must be programmed to be consistent with
>>>> the guest mappings. If the physical IOMMU supports two
>>>> translation stages it makes sense to program guest mappings
>>>> onto the first stage/level (ARM/VTD terminology) while to host
>>>> owns the stage/level 2.
>>>>
>>>> In that case, it is mandated to trap on guest configuration
>>>> settings and pass those to the physical iommu driver.
>>>>
>>>> This patch adds a new API to the iommu subsystem that allows
>>>> to set the pasid table information.
>>>>
>>>> A generic iommu_pasid_table_config struct is introduced in
>>>> a new iommu.h uapi header. This is going to be used by the VFIO
>>>> user API. We foresee at least two specializations of this struct,
>>>> for PASID table passing and ARM SMMUv3.
>>>
>>> Last sentence is a bit confusing. With SMMUv3 it is also used for the
>>> PASID table, even when it only has one entry and PASID is disabled.
>> OK removed
>>>
>>>> Signed-off-by: Jean-Philippe Brucker 
>>>> Signed-off-by: Liu, Yi L 
>>>> Signed-off-by: Ashok Raj 
>>>> Signed-off-by: Jacob Pan 
>>>> Signed-off-by: Eric Auger 
>>>>
>>>> ---
>>>>
>>>> This patch generalizes the API introduced by Jacob & co-authors in
>>>> https://lwn.net/Articles/754331/
>>>>
>>>> v2 -> v3:
>>>> - replace unbind/bind by set_pasid_table
>>>> - move table pointer and pasid bits in the generic part of the struct
>>>>
>>>> v1 -> v2:
>>>> - restore the original pasid table name
>>>> - remove the struct device * parameter in the API
>>>> - reworked iommu_pasid_smmuv3
>>>> ---
>>>>  drivers/iommu/iommu.c  | 10 
>>>>  include/linux/iommu.h  | 14 +++
>>>>  include/uapi/linux/iommu.h | 50 ++
>>>>  3 files changed, 74 insertions(+)
>>>>  create mode 100644 include/uapi/linux/iommu.h
>>>>
>>>> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
>>>> index 3ed4db334341..0f2b7f1fc7c8 100644
>>>> --- a/drivers/iommu/iommu.c
>>>> +++ b/drivers/iommu/iommu.c
>>>> @@ -1393,6 +1393,16 @@ int iommu_attach_device(struct iommu_domain 
>>>> *domain, struct device *dev)
>>>>  }
>>>>  EXPORT_SYMBOL_GPL(iommu_attach_device);
>>>>  
>>>> +int iommu_set_pasid_table(struct iommu_domain *domain,
>>>> +struct iommu_pasid_table_config *cfg)
>>>> +{
>>>> +  if (unlikely(!domain->ops->set_pasid_table))
>>>> +  return -ENODEV;
>>>> +
>>>> +  return domain->ops->set_pasid_table(domain, cfg);
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(iommu_set_pasid_table);
>>>> +
>>>>  static void __iommu_detach_device(struct iommu_domain *domain,
>>>>  struct device *dev)
>>>>  {
>>>> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
>>>> index e90da6b6f3d1..1da2a2357ea4 100644
>>>> --- a/include/linux/iommu.h
>>>> +++ b/include/linux/iommu.h
>>>> @@ -25,6 +25,7 @@
>>>>  #include 
>>>>  #include 
>>>>  #include 
>>>> +#include 
>>>>  
>>>>  #define IOMMU_READ(1 << 0)
>>>>  #define IOMMU_WRITE   (1 << 1)
>>>> @@ -184,6 +185,7 @@ struct iommu_resv_region {
>>>>   * @domain_window_disable: Disable a particular window for a domain
>>>>   * @of_xlate: add OF master IDs to iommu grouping
>>>>   * @pgsize_bitmap: bitmap of all possible supported page sizes
>>>> + * @set_pasid_table: set pasid table
>>>>   */
>>>>  struct iommu_ops {
>>>>bool (*capable)(enum iommu_cap);
>>>> @@ -226,6 +228,9 @@ struct iommu_ops {
>>>>int (*of_xlat

Re: [PATCH v7 0/7] Add virtio-iommu driver

2019-01-24 Thread Jean-Philippe Brucker

Hi Joerg,

On 23/01/2019 08:34, Joerg Roedel wrote:
> Hi Jean-Philippe,
> 
> thanks for all your hard work on this!
> 
> On Tue, Jan 15, 2019 at 12:19:52PM +, Jean-Philippe Brucker wrote:
>> Implement the virtio-iommu driver, following specification v0.9 [1].
> 
> To make progress on this I think the spec needs to be close to something
> that can be included into the official virtio-specification. Have you
> proposed the specification for inclusion there?

I haven't yet. I did send a few drafts of the spec to the mailing list,
using arbitrary version numbers (0.1 - 0.9), and received excellent
feedback from Eric, Kevin, Ashok and others [2], but I hadn't formally
asked for inclusion yet. Since I haven't made any major change to the
interface in a while, I'll get on that.

> This is because I can't merge a driver that might be incompatible to
> future implementations because the specification needs to be changed on
> its way to an official standard.

Makes sense, though I think other virtio devices have been developed a
little more organically: device and driver code got upstreamed first,
and then the specification describing their interface got merged into
the standard. For example I believe that code for crypto, input and GPU
devices were upstreamed long before the specification was merged. Once
an implementation is upstream, the interface is expected to be
backward-compatible (all subsequent changes are introduced using feature
bits).

So I've been working with this process in mind, also described by Jens
at KVM forum 2017 [3]:
(1) Reserve a device ID, and get that merged into virtio (ID 23 for
virtio-iommu was reserved last year)
(2) Open-source an implementation (this driver and Eric's device)
(3) Formalize and upstream the device specification

But I get that some overlap between (2) and (3) would have been better.
So far the spec document has been reviewed mainly from the IOMMU point
of view, and might require more changes to be in line with the other
virtio devices -- hopefully just wording changes. I'll kick off step
(3), but I think the virtio folks are a bit busy with finalizing the 1.1
spec so I expect it to take a while.

Thanks,
Jean

[2] RFC https://markmail.org/thread/l6b2rpc46nua4egs
0.4 https://markmail.org/thread/f5k37mab7tnrslin
0.5 https://markmail.org/thread/tz65oolu5do7hi6n
0.6 https://markmail.org/thread/dppbg6owzrx2km2n
0.7 https://markmail.org/thread/dgdy4hicswpakmsq

[3] The future of virtio: riddles, myths and surprises
https://www.linux-kvm.org/images/0/03/Virtio_fall_2017.pdf
https://www.youtube.com/watch?v=z9cWwgYH97A

> I had a short discussion with Michael S. Tsirkin about that and from
> what I understood the spec needs to be proposed for inclusion on the
> virtio-comment[1] mailing list and later the TC needs to vote on it.
> Please work with Michael on this to get the specification official (or
> at least pretty close to something that will be part of the official
> virtio standard).
> 
> Regards,
> 
>   Joerg
> 
> [1] https://www.oasis-open.org/committees/comments/index.php?wg_abbrev=virtio

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

Re: [PATCH kvmtool v2 00/13] Implement reset of virtio devices

2019-01-22 Thread Jean-Philippe Brucker

On 22/01/2019 07:07, Will Deacon wrote:
> On Thu, Jan 10, 2019 at 02:12:37PM +, Julien Thierry wrote:
>> This series was developped by Jean-Philippe and is needed for a series
>> I'll be posting shortly after to load firmwares on arm kvmtool.
>>
>> Currently, when a guest tries to reset a device, a lot of ressources
>> aren't reset (threads keep running, virtio queue keep their state, etc).
>>
>> When the guest only does the reset to initialize the device and there
>> were no previous users, there is no noticeable issue. But when a guest
>> has a firmare + Linux, if the firmware uses a virtio device, Linux will
>> fail to probe that device.
>>
>> This series aim to properly reset the virtio resources when the guests
>> requests it.
>>
>> Reset of net vhost is unsupported for now.
>>
>> Patch 1 is a bug fix on ioeventfd
>> Patch 2-6 provide the core support so devices can implement their reset
>> Patch 7-13 implements the reset for the various virtio devices
> 
> Cheers, I'll pick this up. There's a vague comment in patch 10 about aio
> being busted. If that's the case then we should either fix it or remove
> it...

Thanks! I think my main concern with aio was that the iocb structures
are allocated on the stack, and get trashed after io_submit(). The aio
man says that control block should remain valid until the I/O completes
(which might still apply even if we're using raw iocb instead of aiocb -
I can't find where the kernel copies those buffers). I'll take a better
look and implement reset if possible.

Thanks,
Jean
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

Re: [PATCH v7 0/7] Add virtio-iommu driver

2019-01-21 Thread Jean-Philippe Brucker

Hi,

On 18/01/2019 15:51, Michael S. Tsirkin wrote:
> 
> On Tue, Jan 15, 2019 at 12:19:52PM +, Jean-Philippe Brucker wrote:
>> Implement the virtio-iommu driver, following specification v0.9 [1].
>>
>> This is a simple rebase onto Linux v5.0-rc2. We now use the
>> dev_iommu_fwspec_get() helper introduced in v5.0 instead of accessing
>> dev->iommu_fwspec, but there aren't any functional change from v6 [2].
>>
>> Our current goal for virtio-iommu is to get a paravirtual IOMMU working
>> on Arm, and enable device assignment to guest userspace. In this
>> use-case the mappings are static, and don't require optimal performance,
>> so this series tries to keep things simple. However there is plenty more
>> to do for features and optimizations, and having this base in v5.1 would
>> be good. Given that most of the changes are to drivers/iommu, I believe
>> the driver and future changes should go via the IOMMU tree.
>>
>> You can find Linux driver and kvmtool device on v0.9.2 branches [3],
>> module and x86 support on virtio-iommu/devel. Also tested with Eric's
>> QEMU device [4]. Please note that the series depends on Robin's
>> probe-deferral fix [5], which will hopefully land in v5.0.
>>
>> [1] Virtio-iommu specification v0.9, sources and pdf
>> git://linux-arm.org/virtio-iommu.git virtio-iommu/v0.9
>> http://jpbrucker.net/virtio-iommu/spec/v0.9/virtio-iommu-v0.9.pdf
>>
>> [2] [PATCH v6 0/7] Add virtio-iommu driver
>> 
>> https://lists.linuxfoundation.org/pipermail/iommu/2018-December/032127.html
>>
>> [3] git://linux-arm.org/linux-jpb.git virtio-iommu/v0.9.2
>> git://linux-arm.org/kvmtool-jpb.git virtio-iommu/v0.9.2
>>
>> [4] [RFC v9 00/17] VIRTIO-IOMMU device
>> https://www.mail-archive.com/qemu-devel@nongnu.org/msg575578.html
>>
>> [5] [PATCH] iommu/of: Fix probe-deferral
>> https://www.spinics.net/lists/arm-kernel/msg698371.html
> 
> Thanks for the work!
> So really my only issue with this is that there's no
> way for the IOMMU to describe the devices that it
> covers.
> 
> As a result that is then done in a platform-specific way.
> 
> And this means that for example it does not solve the problem that e.g.
> some power people have in that their platform simply does not have a way
> to specify which devices are covered by the IOMMU.

Isn't power using device tree? I haven't looked much at power because I
was told a while ago that they already paravirtualize their IOMMU and
don't need virtio-iommu, except perhaps for some legacy platforms. Or
something along those lines. But I would certainly be interested in
enabling the IOMMU for more architectures.

As for the enumeration problem, I still don't think we can get much
better than DT and ACPI as solutions (and IMO they are necessary to make
this device portable). But I believe that getting DT and ACPI support is
just a one-off inconvenience. That is, once the required bindings are
accepted, any future extension can then be done at the virtio level with
feature bits and probe requests, without having to update ACPI or DT.

Thanks,
Jean

> Solving that problem would make me much more excited about
> this device.
> 
> On the other hand I can see that while there have been some
> developments most of the code has been stable for quite a while now.
> 
> So what I am trying to do right about now, is making a small module that
> loads early and pokes at the IOMMU sufficiently to get the data about
> which devices use the IOMMU out of it using standard virtio config
> space.  IIUC it's claimed to be impossible without messy changes to the
> boot sequence.
> 
> If I succeed at least on some platforms I'll ask that this design is
> worked into this device, minimizing info that goes through DT/ACPI.  If
> I see I can't make it in time to meet the next merge window, I plan
> merging the existing patches using DT (barring surprises).
> 
> As I only have a very small amount of time to spend on this attempt, If
> someone else wants to try doing that in parallel, that would be great!
> 
> 
>> Jean-Philippe Brucker (7):
>>   dt-bindings: virtio-mmio: Add IOMMU description
>>   dt-bindings: virtio: Add virtio-pci-iommu node
>>   of: Allow the iommu-map property to omit untranslated devices
>>   PCI: OF: Initialize dev->fwnode appropriately
>>   iommu: Add virtio-iommu driver
>>   iommu/virtio: Add probe request
>>   iommu/virtio: Add event queue
>>
>>  .../devicetree/bindings/virtio/iommu.txt  |   66 +
>>  .../devicetree/bindings/virtio/mmio.txt   |   30 +
>>  MAINTAINERS   |7 +
>>  drivers/iommu/Kconfig

Re: [RFC v3 14/21] iommu: introduce device fault data

2019-01-16 Thread Jean-Philippe Brucker

On 15/01/2019 21:27, Auger Eric wrote:
[...]
  /* iommu fault flags */
 -#define IOMMU_FAULT_READ  0x0
 -#define IOMMU_FAULT_WRITE 0x1
 +#define IOMMU_FAULT_READ  (1 << 0)
 +#define IOMMU_FAULT_WRITE (1 << 1)
 +#define IOMMU_FAULT_EXEC  (1 << 2)
 +#define IOMMU_FAULT_PRIV  (1 << 3)
  
  typedef int (*iommu_fault_handler_t)(struct iommu_domain *,
struct device *, unsigned long, int, void *);
 +typedef int (*iommu_dev_fault_handler_t)(struct iommu_fault_event *,
 void *); 
  struct iommu_domain_geometry {
dma_addr_t aperture_start; /* First address that can be
 mapped*/ @@ -255,6 +259,52 @@ struct iommu_device {
struct device *dev;
  };
  
 +/**
 + * struct iommu_fault_event - Generic per device fault data
 + *
 + * - PCI and non-PCI devices
 + * - Recoverable faults (e.g. page request), information based on
 PCI ATS
 + * and PASID spec.
 + * - Un-recoverable faults of device interest
 + * - DMA remapping and IRQ remapping faults
 + *
 + * @fault: fault descriptor
 + * @device_private: if present, uniquely identify device-specific
 + *  private data for an individual page request.
 + * @iommu_private: used by the IOMMU driver for storing
 fault-specific
 + * data. Users should not modify this field before
 + * sending the fault response.
 + */
 +struct iommu_fault_event {
 +  struct iommu_fault fault;
 +  u64 device_private;
>>> I think we want to move device_private to uapi since it gets injected
>>> into the guest, then returned by guest in case of page response. For
>>> VT-d we also need 128 bits of private data. VT-d spec. 7.7.1
>>
>> Ah, I didn't notice the format changed in VT-d rev3. On that topic, how
>> do we manage future extensions to the iommu_fault struct? Should we add
>> ~48 bytes of padding after device_private, along with some flags telling
>> which field is valid, or deal with it using a structure version like we
>> do for the invalidate and bind structs? In the first case, iommu_fault
>> wouldn't fit in a 64-byte cacheline anymore, but I'm not sure we care.
>>
>>> For exception tracking (e.g. unanswered page request), I can add timer
>>> and list info later when I include PRQ. sounds ok?
 +  u64 iommu_private;
>> [...]
 +/**
 + * struct iommu_fault - Generic fault data
 + *
 + * @type contains fault type
 + * @reason fault reasons if relevant outside IOMMU driver.
 + * IOMMU driver internal faults are not reported.
 + * @addr: tells the offending page address
 + * @fetch_addr: tells the address that caused an abort, if any
 + * @pasid: contains process address space ID, used in shared virtual
 memory
 + * @page_req_group_id: page request group index
 + * @last_req: last request in a page request group
 + * @pasid_valid: indicates if the PRQ has a valid PASID
 + * @prot: page access protection flag:
 + *IOMMU_FAULT_READ, IOMMU_FAULT_WRITE
 + */
 +
 +struct iommu_fault {
 +  __u32   type;   /* enum iommu_fault_type */
 +  __u32   reason; /* enum iommu_fault_reason */
 +  __u64   addr;
 +  __u64   fetch_addr;
 +  __u32   pasid;
 +  __u32   page_req_group_id;
 +  __u32   last_req;
 +  __u32   pasid_valid;
 +  __u32   prot;
 +  __u32   access;
>>
>> What does @access contain? Can it be squashed into @prot?
> it was related to F_ACCESS event record and was a placeholder for
> reporting access attributes of the input transaction (Rnw, InD, PnU
> fields). But I wonder whether this is needed to implement such fine
> level fault reporting. Do we really care?

I think we do, to properly inject PRI/Stall later. But RnW, InD and PnU
can already be described with the IOMMU_FAULT_* flags defined above.
We're missing CLASS and S2, which could also be useful for debugging.
CLASS is specific to SMMUv3 but could probably be represented with
@reason. For S2, we could keep printing stage-2 faults in the driver,
and not report them to userspace.

Thanks,
Jean
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

Re: [RFC v3 14/21] iommu: introduce device fault data

2019-01-16 Thread Jean-Philippe Brucker

On 14/01/2019 22:32, Jacob Pan wrote:
>> [...]
 +/**
 + * struct iommu_fault - Generic fault data
 + *
 + * @type contains fault type
 + * @reason fault reasons if relevant outside IOMMU driver.
 + * IOMMU driver internal faults are not reported.
 + * @addr: tells the offending page address
 + * @fetch_addr: tells the address that caused an abort, if any
 + * @pasid: contains process address space ID, used in shared
 virtual memory
 + * @page_req_group_id: page request group index
 + * @last_req: last request in a page request group
 + * @pasid_valid: indicates if the PRQ has a valid PASID
 + * @prot: page access protection flag:
 + *IOMMU_FAULT_READ, IOMMU_FAULT_WRITE
 + */
 +
 +struct iommu_fault {
 +  __u32   type;   /* enum iommu_fault_type */
 +  __u32   reason; /* enum iommu_fault_reason */
 +  __u64   addr;
 +  __u64   fetch_addr;
 +  __u32   pasid;
 +  __u32   page_req_group_id;
 +  __u32   last_req;
 +  __u32   pasid_valid;
 +  __u32   prot;
 +  __u32   access;  
>>
>> What does @access contain? Can it be squashed into @prot?
>>
> I agreed.
> 
> how about this?
> #define IOMMU_FAULT_VERSION_V1 0x1
> struct iommu_fault {
>   __u16 version;

Right, but the version field becomes redundant when we present a batch
of these to userspace, in patch 18 (assuming we don't want to mix fault
structure versions within a batch... I certainly don't).

When introducing IOMMU_FAULT_VERSION_V2, in a distant future, I think we
still need to support a userspace that uses IOMMU_FAULT_VERSION_V1. One
strategy for this:

* We define structs iommu_fault_v1 (the current iommu_fault) and
  iommu_fault_v2.
* Userspace selects IOMMU_FAULT_VERSION_V1 when registering the fault
  queue
* The IOMMU driver fills iommu_fault_v2 and passes it to VFIO
* VFIO does its best to translate this into a iommu_fault_v1 struct

So what we need now, is a way for userspace to tell the kernel which
structure version it expects. I'm not sure we even need to pass the
actual version number we're using back to userspace. Agreeing on one
version at registration should be sufficient.

>   __u16 type;
>   __u32 reason;
>   __u64 addr;

I'm in favor of keeping @fetch_addr as well, it can contain useful
information. For example, while attempting to translate an IOVA
0xf000, the IOMMU can't find the PASID table that we installed with
address 0xdead - the guest passed an invalid address to
bind_pasid_table(). We can then report 0xf000 in @addr, and 0xdead
in @fetch_addr.

>   __u32 pasid;
>   __u32 page_req_group_id;
>   __u32 last_req : 1;
>   __u32 pasid_valid : 1;

Agreed, with some explicit padding or combined as a @flag field. In fact
if we do add the @fetch_addr field, I think we need a bit that indicates
its validity as well.

Thanks,
Jean

>   __u32 prot;
>   __u64 device_private[2];
>   __u8 padding[48];
> };
> 
> 
>> Thanks,
>> Jean
>>
>>> relocated to uapi, Yi can you confirm?
>>> __u64 device_private[2];
>>>   
 +};
  #endif /* _UAPI_IOMMU_H */  
>>>
>>> ___
>>> iommu mailing list
>>> io...@lists.linux-foundation.org
>>> https://lists.linuxfoundation.org/mailman/listinfo/iommu
>>>   
>>
> 
> [Jacob Pan]
> ___
> iommu mailing list
> io...@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/iommu
> 

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

Re: [RFC v3 17/21] iommu/smmuv3: Report non recoverable faults

2019-01-16 Thread Jean-Philippe Brucker

On 15/01/2019 21:06, Auger Eric wrote:
>>> +   iommu_report_device_fault(master->dev, );
>>
>> We should return here if the fault is successfully injected
> 
> Even if the fault gets injected in the guest can't it be still useful to
> get the message below on host side?

I don't think we should let the guest flood the host log by issuing
invalid DMA (or are there other cases where the guest can freely print
stuff in the host?) We do print all errors at the moment, but we should
tighten this once there is an upstream solution to let the guest control
DMA mappings.

Thanks,
Jean
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

[PATCH v7 7/7] iommu/virtio: Add event queue

2019-01-15 Thread Jean-Philippe Brucker

The event queue offers a way for the device to report access faults from
endpoints. It is implemented on virtqueue #1. Whenever the host needs to
signal a fault, it fills one of the buffers offered by the guest and
interrupts it.

Tested-by: Bharat Bhushan 
Tested-by: Eric Auger 
Reviewed-by: Eric Auger 
Signed-off-by: Jean-Philippe Brucker 
---
 drivers/iommu/virtio-iommu.c  | 115 +++---
 include/uapi/linux/virtio_iommu.h |  19 +
 2 files changed, 125 insertions(+), 9 deletions(-)

diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c
index 5e194493a531..4620dd221ffd 100644
--- a/drivers/iommu/virtio-iommu.c
+++ b/drivers/iommu/virtio-iommu.c
@@ -29,7 +29,8 @@
 #define MSI_IOVA_LENGTH0x10
 
 #define VIOMMU_REQUEST_VQ  0
-#define VIOMMU_NR_VQS  1
+#define VIOMMU_EVENT_VQ1
+#define VIOMMU_NR_VQS  2
 
 struct viommu_dev {
struct iommu_device iommu;
@@ -41,6 +42,7 @@ struct viommu_dev {
struct virtqueue*vqs[VIOMMU_NR_VQS];
spinlock_t  request_lock;
struct list_headrequests;
+   void*evts;
 
/* Device configuration */
struct iommu_domain_geometrygeometry;
@@ -82,6 +84,15 @@ struct viommu_request {
charbuf[];
 };
 
+#define VIOMMU_FAULT_RESV_MASK 0xff00
+
+struct viommu_event {
+   union {
+   u32 head;
+   struct virtio_iommu_fault fault;
+   };
+};
+
 #define to_viommu_domain(domain)   \
container_of(domain, struct viommu_domain, domain)
 
@@ -503,6 +514,68 @@ static int viommu_probe_endpoint(struct viommu_dev 
*viommu, struct device *dev)
return ret;
 }
 
+static int viommu_fault_handler(struct viommu_dev *viommu,
+   struct virtio_iommu_fault *fault)
+{
+   char *reason_str;
+
+   u8 reason   = fault->reason;
+   u32 flags   = le32_to_cpu(fault->flags);
+   u32 endpoint= le32_to_cpu(fault->endpoint);
+   u64 address = le64_to_cpu(fault->address);
+
+   switch (reason) {
+   case VIRTIO_IOMMU_FAULT_R_DOMAIN:
+   reason_str = "domain";
+   break;
+   case VIRTIO_IOMMU_FAULT_R_MAPPING:
+   reason_str = "page";
+   break;
+   case VIRTIO_IOMMU_FAULT_R_UNKNOWN:
+   default:
+   reason_str = "unknown";
+   break;
+   }
+
+   /* TODO: find EP by ID and report_iommu_fault */
+   if (flags & VIRTIO_IOMMU_FAULT_F_ADDRESS)
+   dev_err_ratelimited(viommu->dev, "%s fault from EP %u at %#llx 
[%s%s%s]\n",
+   reason_str, endpoint, address,
+   flags & VIRTIO_IOMMU_FAULT_F_READ ? "R" : 
"",
+   flags & VIRTIO_IOMMU_FAULT_F_WRITE ? "W" : 
"",
+   flags & VIRTIO_IOMMU_FAULT_F_EXEC ? "X" : 
"");
+   else
+   dev_err_ratelimited(viommu->dev, "%s fault from EP %u\n",
+   reason_str, endpoint);
+   return 0;
+}
+
+static void viommu_event_handler(struct virtqueue *vq)
+{
+   int ret;
+   unsigned int len;
+   struct scatterlist sg[1];
+   struct viommu_event *evt;
+   struct viommu_dev *viommu = vq->vdev->priv;
+
+   while ((evt = virtqueue_get_buf(vq, )) != NULL) {
+   if (len > sizeof(*evt)) {
+   dev_err(viommu->dev,
+   "invalid event buffer (len %u != %zu)\n",
+   len, sizeof(*evt));
+   } else if (!(evt->head & VIOMMU_FAULT_RESV_MASK)) {
+   viommu_fault_handler(viommu, >fault);
+   }
+
+   sg_init_one(sg, evt, sizeof(*evt));
+   ret = virtqueue_add_inbuf(vq, sg, 1, evt, GFP_ATOMIC);
+   if (ret)
+   dev_err(viommu->dev, "could not add event buffer\n");
+   }
+
+   virtqueue_kick(vq);
+}
+
 /* IOMMU API */
 
 static struct iommu_domain *viommu_domain_alloc(unsigned type)
@@ -886,16 +959,35 @@ static struct iommu_ops viommu_ops = {
 static int viommu_init_vqs(struct viommu_dev *viommu)
 {
struct virtio_device *vdev = dev_to_virtio(viommu->dev);
-   const char *name = "request";
-   void *ret;
+   const char *names[] = { "request", "event" };
+   vq_callback_t *callbacks[] = {
+   NULL, /* No async requests */
+   viommu_event_handler,
+   };
 
-   ret = virtio_find_si

[PATCH v7 1/7] dt-bindings: virtio-mmio: Add IOMMU description

2019-01-15 Thread Jean-Philippe Brucker

The nature of a virtio-mmio node is discovered by the virtio driver at
probe time. However the DMA relation between devices must be described
statically. When a virtio-mmio node is a virtio-iommu device, it needs an
"#iommu-cells" property as specified by bindings/iommu/iommu.txt.

Otherwise, the virtio-mmio device may perform DMA through an IOMMU, which
requires an "iommus" property. Describe these requirements in the
device-tree bindings documentation.

Reviewed-by: Rob Herring 
Reviewed-by: Eric Auger 
Signed-off-by: Jean-Philippe Brucker 
---
 .../devicetree/bindings/virtio/mmio.txt   | 30 +++
 1 file changed, 30 insertions(+)

diff --git a/Documentation/devicetree/bindings/virtio/mmio.txt 
b/Documentation/devicetree/bindings/virtio/mmio.txt
index 5069c1b8e193..21af30fbb81f 100644
--- a/Documentation/devicetree/bindings/virtio/mmio.txt
+++ b/Documentation/devicetree/bindings/virtio/mmio.txt
@@ -8,10 +8,40 @@ Required properties:
 - reg: control registers base address and size including configuration 
space
 - interrupts:  interrupt generated by the device
 
+Required properties for virtio-iommu:
+
+- #iommu-cells:When the node corresponds to a virtio-iommu device, it 
is
+   linked to DMA masters using the "iommus" or "iommu-map"
+   properties [1][2]. #iommu-cells specifies the size of the
+   "iommus" property. For virtio-iommu #iommu-cells must be
+   1, each cell describing a single endpoint ID.
+
+Optional properties:
+
+- iommus:  If the device accesses memory through an IOMMU, it should
+   have an "iommus" property [1]. Since virtio-iommu itself
+   does not access memory through an IOMMU, the "virtio,mmio"
+   node cannot have both an "#iommu-cells" and an "iommus"
+   property.
+
 Example:
 
virtio_block@3000 {
compatible = "virtio,mmio";
reg = <0x3000 0x100>;
interrupts = <41>;
+
+   /* Device has endpoint ID 23 */
+   iommus = < 23>
}
+
+   viommu: iommu@3100 {
+   compatible = "virtio,mmio";
+   reg = <0x3100 0x100>;
+   interrupts = <42>;
+
+   #iommu-cells = <1>
+   }
+
+[1] Documentation/devicetree/bindings/iommu/iommu.txt
+[2] Documentation/devicetree/bindings/pci/pci-iommu.txt
-- 
2.19.1

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

[PATCH v7 2/7] dt-bindings: virtio: Add virtio-pci-iommu node

2019-01-15 Thread Jean-Philippe Brucker

Some systems implement virtio-iommu as a PCI endpoint. The operating
system needs to discover the relationship between IOMMU and masters long
before the PCI endpoint gets probed. Add a PCI child node to describe the
virtio-iommu device.

The virtio-pci-iommu is conceptually split between a PCI programming
interface and a translation component on the parent bus. The latter
doesn't have a node in the device tree. The virtio-pci-iommu node
describes both, by linking the PCI endpoint to "iommus" property of DMA
master nodes and to "iommu-map" properties of bus nodes.

Reviewed-by: Rob Herring 
Reviewed-by: Eric Auger 
Signed-off-by: Jean-Philippe Brucker 
---
 .../devicetree/bindings/virtio/iommu.txt  | 66 +++
 1 file changed, 66 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/virtio/iommu.txt

diff --git a/Documentation/devicetree/bindings/virtio/iommu.txt 
b/Documentation/devicetree/bindings/virtio/iommu.txt
new file mode 100644
index ..2407fea0651c
--- /dev/null
+++ b/Documentation/devicetree/bindings/virtio/iommu.txt
@@ -0,0 +1,66 @@
+* virtio IOMMU PCI device
+
+When virtio-iommu uses the PCI transport, its programming interface is
+discovered dynamically by the PCI probing infrastructure. However the
+device tree statically describes the relation between IOMMU and DMA
+masters. Therefore, the PCI root complex that hosts the virtio-iommu
+contains a child node representing the IOMMU device explicitly.
+
+Required properties:
+
+- compatible:  Should be "virtio,pci-iommu"
+- reg: PCI address of the IOMMU. As defined in the PCI Bus
+   Binding reference [1], the reg property is a five-cell
+   address encoded as (phys.hi phys.mid phys.lo size.hi
+   size.lo). phys.hi should contain the device's BDF as
+   0b  dfff . The other cells
+   should be zero.
+- #iommu-cells:Each platform DMA master managed by the IOMMU is 
assigned
+   an endpoint ID, described by the "iommus" property [2].
+   For virtio-iommu, #iommu-cells must be 1.
+
+Notes:
+
+- DMA from the IOMMU device isn't managed by another IOMMU. Therefore the
+  virtio-iommu node doesn't have an "iommus" property, and is omitted from
+  the iommu-map property of the root complex.
+
+Example:
+
+pcie@1000 {
+   compatible = "pci-host-ecam-generic";
+   ...
+
+   /* The IOMMU programming interface uses slot 00:01.0 */
+   iommu0: iommu@0008 {
+   compatible = "virtio,pci-iommu";
+   reg = <0x0800 0 0 0 0>;
+   #iommu-cells = <1>;
+   };
+
+   /*
+* The IOMMU manages all functions in this PCI domain except
+* itself. Omit BDF 00:01.0.
+*/
+   iommu-map = <0x0  0x0 0x8>
+   <0x9  0x9 0xfff7>;
+};
+
+pcie@2000 {
+   compatible = "pci-host-ecam-generic";
+   ...
+   /*
+* The IOMMU also manages all functions from this domain,
+* with endpoint IDs 0x1 - 0x1
+*/
+   iommu-map = <0x0  0x1 0x1>;
+};
+
+ethernet@fe001000 {
+   ...
+   /* The IOMMU manages this platform device with endpoint ID 0x2 */
+   iommus = < 0x2>;
+};
+
+[1] Documentation/devicetree/bindings/pci/pci.txt
+[2] Documentation/devicetree/bindings/iommu/iommu.txt
-- 
2.19.1

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

[PATCH v7 4/7] PCI: OF: Initialize dev->fwnode appropriately

2019-01-15 Thread Jean-Philippe Brucker

For PCI devices that have an OF node, set the fwnode as well. This way
drivers that rely on fwnode don't need the special case described by
commit f94277af03ea ("of/platform: Initialise dev->fwnode appropriately").

Acked-by: Bjorn Helgaas 
Signed-off-by: Jean-Philippe Brucker 
---
 drivers/pci/of.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/drivers/pci/of.c b/drivers/pci/of.c
index 4c4217d0c3f1..c272ecfcd038 100644
--- a/drivers/pci/of.c
+++ b/drivers/pci/of.c
@@ -21,12 +21,15 @@ void pci_set_of_node(struct pci_dev *dev)
return;
dev->dev.of_node = of_pci_find_child_device(dev->bus->dev.of_node,
dev->devfn);
+   if (dev->dev.of_node)
+   dev->dev.fwnode = >dev.of_node->fwnode;
 }
 
 void pci_release_of_node(struct pci_dev *dev)
 {
of_node_put(dev->dev.of_node);
dev->dev.of_node = NULL;
+   dev->dev.fwnode = NULL;
 }
 
 void pci_set_bus_of_node(struct pci_bus *bus)
@@ -35,12 +38,16 @@ void pci_set_bus_of_node(struct pci_bus *bus)
bus->dev.of_node = pcibios_get_phb_of_node(bus);
else
bus->dev.of_node = of_node_get(bus->self->dev.of_node);
+
+   if (bus->dev.of_node)
+   bus->dev.fwnode = >dev.of_node->fwnode;
 }
 
 void pci_release_bus_of_node(struct pci_bus *bus)
 {
of_node_put(bus->dev.of_node);
bus->dev.of_node = NULL;
+   bus->dev.fwnode = NULL;
 }
 
 struct device_node * __weak pcibios_get_phb_of_node(struct pci_bus *bus)
-- 
2.19.1

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

[PATCH v7 5/7] iommu: Add virtio-iommu driver

2019-01-15 Thread Jean-Philippe Brucker

The virtio IOMMU is a para-virtualized device, allowing to send IOMMU
requests such as map/unmap over virtio transport without emulating page
tables. This implementation handles ATTACH, DETACH, MAP and UNMAP
requests.

The bulk of the code transforms calls coming from the IOMMU API into
corresponding virtio requests. Mappings are kept in an interval tree
instead of page tables. A little more work is required for modular and x86
support, so for the moment the driver depends on CONFIG_VIRTIO=y and
CONFIG_ARM64.

Tested-by: Bharat Bhushan 
Tested-by: Eric Auger 
Reviewed-by: Eric Auger 
Signed-off-by: Jean-Philippe Brucker 
---
 MAINTAINERS   |   7 +
 drivers/iommu/Kconfig |  11 +
 drivers/iommu/Makefile|   1 +
 drivers/iommu/virtio-iommu.c  | 916 ++
 include/uapi/linux/virtio_ids.h   |   1 +
 include/uapi/linux/virtio_iommu.h | 106 
 6 files changed, 1042 insertions(+)
 create mode 100644 drivers/iommu/virtio-iommu.c
 create mode 100644 include/uapi/linux/virtio_iommu.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 4d04cebb4a71..1ef06a1e0525 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -16274,6 +16274,13 @@ S: Maintained
 F: drivers/virtio/virtio_input.c
 F: include/uapi/linux/virtio_input.h
 
+VIRTIO IOMMU DRIVER
+M: Jean-Philippe Brucker 
+L: virtualizat...@lists.linux-foundation.org
+S: Maintained
+F: drivers/iommu/virtio-iommu.c
+F: include/uapi/linux/virtio_iommu.h
+
 VIRTUAL BOX GUEST DEVICE DRIVER
 M: Hans de Goede 
 M: Arnd Bergmann 
diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index d9a25715650e..d507fd754214 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -435,4 +435,15 @@ config QCOM_IOMMU
help
  Support for IOMMU on certain Qualcomm SoCs.
 
+config VIRTIO_IOMMU
+   bool "Virtio IOMMU driver"
+   depends on VIRTIO=y
+   depends on ARM64
+   select IOMMU_API
+   select INTERVAL_TREE
+   help
+ Para-virtualised IOMMU driver with virtio.
+
+ Say Y here if you intend to run this kernel as a guest.
+
 endif # IOMMU_SUPPORT
diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
index a158a68c8ea8..48d831a39281 100644
--- a/drivers/iommu/Makefile
+++ b/drivers/iommu/Makefile
@@ -32,3 +32,4 @@ obj-$(CONFIG_EXYNOS_IOMMU) += exynos-iommu.o
 obj-$(CONFIG_FSL_PAMU) += fsl_pamu.o fsl_pamu_domain.o
 obj-$(CONFIG_S390_IOMMU) += s390-iommu.o
 obj-$(CONFIG_QCOM_IOMMU) += qcom_iommu.o
+obj-$(CONFIG_VIRTIO_IOMMU) += virtio-iommu.o
diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c
new file mode 100644
index ..6fa012cd727e
--- /dev/null
+++ b/drivers/iommu/virtio-iommu.c
@@ -0,0 +1,916 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Virtio driver for the paravirtualized IOMMU
+ *
+ * Copyright (C) 2018 Arm Limited
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+#define MSI_IOVA_BASE  0x800
+#define MSI_IOVA_LENGTH0x10
+
+#define VIOMMU_REQUEST_VQ  0
+#define VIOMMU_NR_VQS  1
+
+struct viommu_dev {
+   struct iommu_device iommu;
+   struct device   *dev;
+   struct virtio_device*vdev;
+
+   struct ida  domain_ids;
+
+   struct virtqueue*vqs[VIOMMU_NR_VQS];
+   spinlock_t  request_lock;
+   struct list_headrequests;
+
+   /* Device configuration */
+   struct iommu_domain_geometrygeometry;
+   u64 pgsize_bitmap;
+   u8  domain_bits;
+};
+
+struct viommu_mapping {
+   phys_addr_t paddr;
+   struct interval_tree_node   iova;
+   u32 flags;
+};
+
+struct viommu_domain {
+   struct iommu_domain domain;
+   struct viommu_dev   *viommu;
+   struct mutexmutex; /* protects viommu pointer */
+   unsigned intid;
+
+   spinlock_t  mappings_lock;
+   struct rb_root_cached   mappings;
+
+   unsigned long   nr_endpoints;
+};
+
+struct viommu_endpoint {
+   struct viommu_dev   *viommu;
+   struct viommu_domain*vdomain;
+};
+
+struct viommu_request {
+   struct list_headlist;
+   void*writeback;
+   unsigned intwrite_offset;
+   unsigned intlen;
+   charbuf[];
+};
+
+#define to_viommu_domain(domain)   \
+   container_of(domain,

[PATCH v7 6/7] iommu/virtio: Add probe request

2019-01-15 Thread Jean-Philippe Brucker

When the device offers the probe feature, send a probe request for each
device managed by the IOMMU. Extract RESV_MEM information. When we
encounter a MSI doorbell region, set it up as a IOMMU_RESV_MSI region.
This will tell other subsystems that there is no need to map the MSI
doorbell in the virtio-iommu, because MSIs bypass it.

Tested-by: Bharat Bhushan 
Tested-by: Eric Auger 
Reviewed-by: Eric Auger 
Signed-off-by: Jean-Philippe Brucker 
---
 drivers/iommu/virtio-iommu.c  | 157 --
 include/uapi/linux/virtio_iommu.h |  36 +++
 2 files changed, 187 insertions(+), 6 deletions(-)

diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c
index 6fa012cd727e..5e194493a531 100644
--- a/drivers/iommu/virtio-iommu.c
+++ b/drivers/iommu/virtio-iommu.c
@@ -46,6 +46,7 @@ struct viommu_dev {
struct iommu_domain_geometrygeometry;
u64 pgsize_bitmap;
u8  domain_bits;
+   u32 probe_size;
 };
 
 struct viommu_mapping {
@@ -67,8 +68,10 @@ struct viommu_domain {
 };
 
 struct viommu_endpoint {
+   struct device   *dev;
struct viommu_dev   *viommu;
struct viommu_domain*vdomain;
+   struct list_headresv_regions;
 };
 
 struct viommu_request {
@@ -119,6 +122,9 @@ static off_t viommu_get_write_desc_offset(struct viommu_dev 
*viommu,
 {
size_t tail_size = sizeof(struct virtio_iommu_req_tail);
 
+   if (req->type == VIRTIO_IOMMU_T_PROBE)
+   return len - viommu->probe_size - tail_size;
+
return len - tail_size;
 }
 
@@ -393,6 +399,110 @@ static int viommu_replay_mappings(struct viommu_domain 
*vdomain)
return ret;
 }
 
+static int viommu_add_resv_mem(struct viommu_endpoint *vdev,
+  struct virtio_iommu_probe_resv_mem *mem,
+  size_t len)
+{
+   size_t size;
+   u64 start64, end64;
+   phys_addr_t start, end;
+   struct iommu_resv_region *region = NULL;
+   unsigned long prot = IOMMU_WRITE | IOMMU_NOEXEC | IOMMU_MMIO;
+
+   start = start64 = le64_to_cpu(mem->start);
+   end = end64 = le64_to_cpu(mem->end);
+   size = end64 - start64 + 1;
+
+   /* Catch any overflow, including the unlikely end64 - start64 + 1 = 0 */
+   if (start != start64 || end != end64 || size < end64 - start64)
+   return -EOVERFLOW;
+
+   if (len < sizeof(*mem))
+   return -EINVAL;
+
+   switch (mem->subtype) {
+   default:
+   dev_warn(vdev->dev, "unknown resv mem subtype 0x%x\n",
+mem->subtype);
+   /* Fall-through */
+   case VIRTIO_IOMMU_RESV_MEM_T_RESERVED:
+   region = iommu_alloc_resv_region(start, size, 0,
+IOMMU_RESV_RESERVED);
+   break;
+   case VIRTIO_IOMMU_RESV_MEM_T_MSI:
+   region = iommu_alloc_resv_region(start, size, prot,
+IOMMU_RESV_MSI);
+   break;
+   }
+   if (!region)
+   return -ENOMEM;
+
+   list_add(>resv_regions, >list);
+   return 0;
+}
+
+static int viommu_probe_endpoint(struct viommu_dev *viommu, struct device *dev)
+{
+   int ret;
+   u16 type, len;
+   size_t cur = 0;
+   size_t probe_len;
+   struct virtio_iommu_req_probe *probe;
+   struct virtio_iommu_probe_property *prop;
+   struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
+   struct viommu_endpoint *vdev = fwspec->iommu_priv;
+
+   if (!fwspec->num_ids)
+   return -EINVAL;
+
+   probe_len = sizeof(*probe) + viommu->probe_size +
+   sizeof(struct virtio_iommu_req_tail);
+   probe = kzalloc(probe_len, GFP_KERNEL);
+   if (!probe)
+   return -ENOMEM;
+
+   probe->head.type = VIRTIO_IOMMU_T_PROBE;
+   /*
+* For now, assume that properties of an endpoint that outputs multiple
+* IDs are consistent. Only probe the first one.
+*/
+   probe->endpoint = cpu_to_le32(fwspec->ids[0]);
+
+   ret = viommu_send_req_sync(viommu, probe, probe_len);
+   if (ret)
+   goto out_free;
+
+   prop = (void *)probe->properties;
+   type = le16_to_cpu(prop->type) & VIRTIO_IOMMU_PROBE_T_MASK;
+
+   while (type != VIRTIO_IOMMU_PROBE_T_NONE &&
+  cur < viommu->probe_size) {
+   len = le16_to_cpu(prop->length) + sizeof(*prop);
+
+   switch (type) {
+   case VIRTIO_IOMMU_PROBE_T_RESV_MEM:
+   ret = viommu_add_resv_mem(vdev, (void *)prop, len);
+   break;
+   default:
+   dev_e

[PATCH v7 0/7] Add virtio-iommu driver

2019-01-15 Thread Jean-Philippe Brucker

Implement the virtio-iommu driver, following specification v0.9 [1].

This is a simple rebase onto Linux v5.0-rc2. We now use the
dev_iommu_fwspec_get() helper introduced in v5.0 instead of accessing
dev->iommu_fwspec, but there aren't any functional change from v6 [2].

Our current goal for virtio-iommu is to get a paravirtual IOMMU working
on Arm, and enable device assignment to guest userspace. In this
use-case the mappings are static, and don't require optimal performance,
so this series tries to keep things simple. However there is plenty more
to do for features and optimizations, and having this base in v5.1 would
be good. Given that most of the changes are to drivers/iommu, I believe
the driver and future changes should go via the IOMMU tree.

You can find Linux driver and kvmtool device on v0.9.2 branches [3],
module and x86 support on virtio-iommu/devel. Also tested with Eric's
QEMU device [4]. Please note that the series depends on Robin's
probe-deferral fix [5], which will hopefully land in v5.0.

[1] Virtio-iommu specification v0.9, sources and pdf
git://linux-arm.org/virtio-iommu.git virtio-iommu/v0.9
http://jpbrucker.net/virtio-iommu/spec/v0.9/virtio-iommu-v0.9.pdf

[2] [PATCH v6 0/7] Add virtio-iommu driver
https://lists.linuxfoundation.org/pipermail/iommu/2018-December/032127.html

[3] git://linux-arm.org/linux-jpb.git virtio-iommu/v0.9.2
git://linux-arm.org/kvmtool-jpb.git virtio-iommu/v0.9.2

[4] [RFC v9 00/17] VIRTIO-IOMMU device
https://www.mail-archive.com/qemu-devel@nongnu.org/msg575578.html

[5] [PATCH] iommu/of: Fix probe-deferral
https://www.spinics.net/lists/arm-kernel/msg698371.html

Jean-Philippe Brucker (7):
  dt-bindings: virtio-mmio: Add IOMMU description
  dt-bindings: virtio: Add virtio-pci-iommu node
  of: Allow the iommu-map property to omit untranslated devices
  PCI: OF: Initialize dev->fwnode appropriately
  iommu: Add virtio-iommu driver
  iommu/virtio: Add probe request
  iommu/virtio: Add event queue

 .../devicetree/bindings/virtio/iommu.txt  |   66 +
 .../devicetree/bindings/virtio/mmio.txt   |   30 +
 MAINTAINERS   |7 +
 drivers/iommu/Kconfig |   11 +
 drivers/iommu/Makefile|1 +
 drivers/iommu/virtio-iommu.c  | 1158 +
 drivers/of/base.c |   10 +-
 drivers/pci/of.c  |7 +
 include/uapi/linux/virtio_ids.h   |1 +
 include/uapi/linux/virtio_iommu.h |  161 +++
 10 files changed, 1449 insertions(+), 3 deletions(-)
 create mode 100644 Documentation/devicetree/bindings/virtio/iommu.txt
 create mode 100644 drivers/iommu/virtio-iommu.c
 create mode 100644 include/uapi/linux/virtio_iommu.h

-- 
2.19.1

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

Re: [RFC v3 01/21] iommu: Introduce set_pasid_table API

2019-01-11 Thread Jean-Philippe Brucker

On 08/01/2019 10:26, Eric Auger wrote:
> From: Jacob Pan 
> 
> In virtualization use case, when a guest is assigned
> a PCI host device, protected by a virtual IOMMU on a guest,
> the physical IOMMU must be programmed to be consistent with
> the guest mappings. If the physical IOMMU supports two
> translation stages it makes sense to program guest mappings
> onto the first stage/level (ARM/VTD terminology) while to host
> owns the stage/level 2.
> 
> In that case, it is mandated to trap on guest configuration
> settings and pass those to the physical iommu driver.
> 
> This patch adds a new API to the iommu subsystem that allows
> to set the pasid table information.
> 
> A generic iommu_pasid_table_config struct is introduced in
> a new iommu.h uapi header. This is going to be used by the VFIO
> user API. We foresee at least two specializations of this struct,
> for PASID table passing and ARM SMMUv3.

Last sentence is a bit confusing. With SMMUv3 it is also used for the
PASID table, even when it only has one entry and PASID is disabled.

> Signed-off-by: Jean-Philippe Brucker 
> Signed-off-by: Liu, Yi L 
> Signed-off-by: Ashok Raj 
> Signed-off-by: Jacob Pan 
> Signed-off-by: Eric Auger 
> 
> ---
> 
> This patch generalizes the API introduced by Jacob & co-authors in
> https://lwn.net/Articles/754331/
> 
> v2 -> v3:
> - replace unbind/bind by set_pasid_table
> - move table pointer and pasid bits in the generic part of the struct
> 
> v1 -> v2:
> - restore the original pasid table name
> - remove the struct device * parameter in the API
> - reworked iommu_pasid_smmuv3
> ---
>  drivers/iommu/iommu.c  | 10 
>  include/linux/iommu.h  | 14 +++
>  include/uapi/linux/iommu.h | 50 ++
>  3 files changed, 74 insertions(+)
>  create mode 100644 include/uapi/linux/iommu.h
> 
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index 3ed4db334341..0f2b7f1fc7c8 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -1393,6 +1393,16 @@ int iommu_attach_device(struct iommu_domain *domain, 
> struct device *dev)
>  }
>  EXPORT_SYMBOL_GPL(iommu_attach_device);
>  
> +int iommu_set_pasid_table(struct iommu_domain *domain,
> +   struct iommu_pasid_table_config *cfg)
> +{
> + if (unlikely(!domain->ops->set_pasid_table))
> + return -ENODEV;
> +
> + return domain->ops->set_pasid_table(domain, cfg);
> +}
> +EXPORT_SYMBOL_GPL(iommu_set_pasid_table);
> +
>  static void __iommu_detach_device(struct iommu_domain *domain,
> struct device *dev)
>  {
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index e90da6b6f3d1..1da2a2357ea4 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -25,6 +25,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #define IOMMU_READ   (1 << 0)
>  #define IOMMU_WRITE  (1 << 1)
> @@ -184,6 +185,7 @@ struct iommu_resv_region {
>   * @domain_window_disable: Disable a particular window for a domain
>   * @of_xlate: add OF master IDs to iommu grouping
>   * @pgsize_bitmap: bitmap of all possible supported page sizes
> + * @set_pasid_table: set pasid table
>   */
>  struct iommu_ops {
>   bool (*capable)(enum iommu_cap);
> @@ -226,6 +228,9 @@ struct iommu_ops {
>   int (*of_xlate)(struct device *dev, struct of_phandle_args *args);
>   bool (*is_attach_deferred)(struct iommu_domain *domain, struct device 
> *dev);
>  
> + int (*set_pasid_table)(struct iommu_domain *domain,
> +struct iommu_pasid_table_config *cfg);
> +
>   unsigned long pgsize_bitmap;
>  };
>  
> @@ -287,6 +292,8 @@ extern int iommu_attach_device(struct iommu_domain 
> *domain,
>  struct device *dev);
>  extern void iommu_detach_device(struct iommu_domain *domain,
>   struct device *dev);
> +extern int iommu_set_pasid_table(struct iommu_domain *domain,
> +  struct iommu_pasid_table_config *cfg);
>  extern struct iommu_domain *iommu_get_domain_for_dev(struct device *dev);
>  extern struct iommu_domain *iommu_get_dma_domain(struct device *dev);
>  extern int iommu_map(struct iommu_domain *domain, unsigned long iova,
> @@ -696,6 +703,13 @@ const struct iommu_ops *iommu_ops_from_fwnode(struct 
> fwnode_handle *fwnode)
>   return NULL;
>  }
>  
> +static inline
> +int iommu_set_pasid_table(struct iommu_domain *domain,
> +   struct iommu_pasid_table_config *cfg)
> +{
> + return -ENODEV;
> +}
> +
>

Re: [RFC v3 17/21] iommu/smmuv3: Report non recoverable faults

2019-01-11 Thread Jean-Philippe Brucker

On 08/01/2019 10:26, Eric Auger wrote:
> When a stage 1 related fault event is read from the event queue,
> let's propagate it to potential external fault listeners, ie. users
> who registered a fault handler.
> 
> Signed-off-by: Eric Auger 
> ---
>  drivers/iommu/arm-smmu-v3.c | 124 
>  1 file changed, 113 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
> index 999ee470a2ae..6a711cbbb228 100644
> --- a/drivers/iommu/arm-smmu-v3.c
> +++ b/drivers/iommu/arm-smmu-v3.c
> @@ -168,6 +168,26 @@
>  #define ARM_SMMU_PRIQ_IRQ_CFG1   0xd8
>  #define ARM_SMMU_PRIQ_IRQ_CFG2   0xdc
>  
> +/* Events */
> +#define ARM_SMMU_EVT_F_UUT   0x01
> +#define ARM_SMMU_EVT_C_BAD_STREAMID  0x02
> +#define ARM_SMMU_EVT_F_STE_FETCH 0x03
> +#define ARM_SMMU_EVT_C_BAD_STE   0x04
> +#define ARM_SMMU_EVT_F_BAD_ATS_TREQ  0x05
> +#define ARM_SMMU_EVT_F_STREAM_DISABLED   0x06
> +#define ARM_SMMU_EVT_F_TRANSL_FORBIDDEN  0x07
> +#define ARM_SMMU_EVT_C_BAD_SUBSTREAMID   0x08
> +#define ARM_SMMU_EVT_F_CD_FETCH  0x09
> +#define ARM_SMMU_EVT_C_BAD_CD0x0a
> +#define ARM_SMMU_EVT_F_WALK_EABT 0x0b
> +#define ARM_SMMU_EVT_F_TRANSLATION   0x10
> +#define ARM_SMMU_EVT_F_ADDR_SIZE 0x11
> +#define ARM_SMMU_EVT_F_ACCESS0x12
> +#define ARM_SMMU_EVT_F_PERMISSION0x13
> +#define ARM_SMMU_EVT_F_TLB_CONFLICT  0x20
> +#define ARM_SMMU_EVT_F_CFG_CONFLICT  0x21
> +#define ARM_SMMU_EVT_E_PAGE_REQUEST  0x24
> +
>  /* Common MSI config fields */
>  #define MSI_CFG0_ADDR_MASK   GENMASK_ULL(51, 2)
>  #define MSI_CFG2_SH  GENMASK(5, 4)
> @@ -333,6 +353,11 @@
>  #define EVTQ_MAX_SZ_SHIFT7
>  
>  #define EVTQ_0_IDGENMASK_ULL(7, 0)
> +#define EVTQ_0_SUBSTREAMID   GENMASK_ULL(31, 12)
> +#define EVTQ_0_STREAMID  GENMASK_ULL(63, 32)
> +#define EVTQ_1_S2GENMASK_ULL(39, 39)
> +#define EVTQ_1_CLASS GENMASK_ULL(40, 41)
> +#define EVTQ_3_FETCH_ADDRGENMASK_ULL(51, 3)
>  
>  /* PRI queue */
>  #define PRIQ_ENT_DWORDS  2
> @@ -1270,7 +1295,6 @@ static int arm_smmu_init_l2_strtab(struct 
> arm_smmu_device *smmu, u32 sid)
>   return 0;
>  }
>  
> -__maybe_unused
>  static struct arm_smmu_master_data *
>  arm_smmu_find_master(struct arm_smmu_device *smmu, u32 sid)
>  {
> @@ -1296,24 +1320,102 @@ arm_smmu_find_master(struct arm_smmu_device *smmu, 
> u32 sid)
>   return master;
>  }
>  
> +static void arm_smmu_report_event(struct arm_smmu_device *smmu, u64 *evt)
> +{
> + u64 fetch_addr = FIELD_GET(EVTQ_3_FETCH_ADDR, evt[3]);
> + u32 sid = FIELD_GET(EVTQ_0_STREAMID, evt[0]);
> + bool s1 = !FIELD_GET(EVTQ_1_S2, evt[1]);
> + u8 type = FIELD_GET(EVTQ_0_ID, evt[0]);
> + struct arm_smmu_master_data *master;
> + struct iommu_fault_event event;
> + bool propagate = true;
> + u64 addr = evt[2];
> + int i;
> +
> + master = arm_smmu_find_master(smmu, sid);
> + if (WARN_ON(!master))
> + return;
> +
> + event.fault.type = IOMMU_FAULT_DMA_UNRECOV;
> +
> + switch (type) {
> + case ARM_SMMU_EVT_C_BAD_STREAMID:
> + event.fault.reason = IOMMU_FAULT_REASON_SOURCEID_INVALID;
> + break;
> + case ARM_SMMU_EVT_F_STREAM_DISABLED:
> + case ARM_SMMU_EVT_C_BAD_SUBSTREAMID:
> + event.fault.reason = IOMMU_FAULT_REASON_PASID_INVALID;
> + break;
> + case ARM_SMMU_EVT_F_CD_FETCH:
> + event.fault.reason = IOMMU_FAULT_REASON_PASID_FETCH;
> + break;
> + case ARM_SMMU_EVT_F_WALK_EABT:
> + event.fault.reason = IOMMU_FAULT_REASON_WALK_EABT;
> + event.fault.addr = addr;
> + event.fault.fetch_addr = fetch_addr;
> + propagate = s1;
> + break;
> + case ARM_SMMU_EVT_F_TRANSLATION:
> + event.fault.reason = IOMMU_FAULT_REASON_PTE_FETCH;
> + event.fault.addr = addr;
> + event.fault.fetch_addr = fetch_addr;
> + propagate = s1;
> + break;
> + case ARM_SMMU_EVT_F_PERMISSION:
> + event.fault.reason = IOMMU_FAULT_REASON_PERMISSION;
> + event.fault.addr = addr;
> + propagate = s1;
> + break;
> + case ARM_SMMU_EVT_F_ACCESS:
> + event.fault.reason = IOMMU_FAULT_REASON_ACCESS;
> + event.fault.addr = addr;
> + propagate = s1;
> + break;
> + case ARM_SMMU_EVT_C_BAD_STE:
> + event.fault.reason = 
> IOMMU_FAULT_REASON_BAD_DEVICE_CONTEXT_ENTRY;
> + break;
> + case ARM_SMMU_EVT_C_BAD_CD:
> + event.fault.reason = IOMMU_FAULT_REASON_BAD_PASID_ENTRY;
> + break;
> + case ARM_SMMU_EVT_F_ADDR_SIZE:
> + event.fault.reason =

Re: [RFC v3 11/21] iommu/smmuv3: Implement cache_invalidate

2019-01-11 Thread Jean-Philippe Brucker

On 08/01/2019 10:26, Eric Auger wrote:
> Implement IOMMU_INV_TYPE_TLB invalidations. When
> nr_pages is null we interpret this as a context
> invalidation.
> 
> Signed-off-by: Eric Auger 
> 
> ---
> 
> The user API needs to be refined to discriminate context
> invalidations from NH_VA invalidations. Also the leaf attribute
> is not yet properly handled.
> 
> v2 -> v3:
> - replace __arm_smmu_tlb_sync by arm_smmu_cmdq_issue_sync
> 
> v1 -> v2:
> - properly pass the asid
> ---
>  drivers/iommu/arm-smmu-v3.c | 40 +
>  1 file changed, 40 insertions(+)
> 
> diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
> index 0e006babc8a6..ca72e0ce92f6 100644
> --- a/drivers/iommu/arm-smmu-v3.c
> +++ b/drivers/iommu/arm-smmu-v3.c
> @@ -2293,6 +2293,45 @@ static int arm_smmu_set_pasid_table(struct 
> iommu_domain *domain,
>   return ret;
>  }
>  
> +static int
> +arm_smmu_cache_invalidate(struct iommu_domain *domain, struct device *dev,
> +   struct iommu_cache_invalidate_info *inv_info)
> +{
> + struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
> + struct arm_smmu_device *smmu = smmu_domain->smmu;
> +
> + if (smmu_domain->stage != ARM_SMMU_DOMAIN_NESTED)
> + return -EINVAL;
> +
> + if (!smmu)
> + return -EINVAL;
> +
> + switch (inv_info->hdr.type) {
> + case IOMMU_INV_TYPE_TLB:
> + /*
> +  * TODO: On context invalidation, the userspace sets nr_pages
> +  * to 0. Refine the API to add a dedicated flags and also
> +  * properly handle the leaf parameter.
> +  */

That's what inv->granularity is for: if inv->granularity is PASID_SEL,
then the invalidation is for the whole context (and nr_pages, size,
addr, etc. should be ignored). If inv->granularity is PAGE_PASID, then
it's a range. The names could probably be improved but it's already in
the API

Thanks,
Jean

> + if (!inv_info->nr_pages) {
> + smmu_domain->s1_cfg.cd.asid = inv_info->arch_id;
> + arm_smmu_tlb_inv_context(smmu_domain);
> + } else {
> + size_t granule = 1 << (inv_info->size + 12);
> + size_t size = inv_info->nr_pages * granule;
> +
> + smmu_domain->s1_cfg.cd.asid = inv_info->arch_id;
> + arm_smmu_tlb_inv_range_nosync(inv_info->addr, size,
> +   granule, false,
> +   smmu_domain);
> + arm_smmu_cmdq_issue_sync(smmu);
> + }
> + return 0;
> + default:
> + return -EINVAL;
> + }
> +}
> +
>  static struct iommu_ops arm_smmu_ops = {
>   .capable= arm_smmu_capable,
>   .domain_alloc   = arm_smmu_domain_alloc,
> @@ -2312,6 +2351,7 @@ static struct iommu_ops arm_smmu_ops = {
>   .get_resv_regions   = arm_smmu_get_resv_regions,
>   .put_resv_regions   = arm_smmu_put_resv_regions,
>   .set_pasid_table= arm_smmu_set_pasid_table,
> + .cache_invalidate   = arm_smmu_cache_invalidate,
>   .pgsize_bitmap  = -1UL, /* Restricted during device attach */
>  };
>  
> 

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

Re: [RFC v3 09/21] iommu/smmuv3: Get prepared for nested stage support

2019-01-11 Thread Jean-Philippe Brucker

Hi Eric,

On 08/01/2019 10:26, Eric Auger wrote:
> To allow nested stage support, we need to store both
> stage 1 and stage 2 configurations (and remove the former
> union).
> 
> arm_smmu_write_strtab_ent() is modified to write both stage
> fields in the STE.
> 
> We add a nested_bypass field to the S1 configuration as the first
> stage can be bypassed. Also the guest may force the STE to abort:
> this information gets stored into the nested_abort field.
> 
> Only S2 stage is "finalized" as the host does not configure
> S1 CD, guest does.
> 
> Signed-off-by: Eric Auger 
> 
> ---
> 
> v1 -> v2:
> - invalidate the STE before moving from a live STE config to another
> - add the nested_abort and nested_bypass fields
> ---
>  drivers/iommu/arm-smmu-v3.c | 43 -
>  1 file changed, 33 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
> index 9af68266bbb1..9716a301d9ae 100644
> --- a/drivers/iommu/arm-smmu-v3.c
> +++ b/drivers/iommu/arm-smmu-v3.c
> @@ -212,6 +212,7 @@
>  #define STRTAB_STE_0_CFG_BYPASS  4
>  #define STRTAB_STE_0_CFG_S1_TRANS5
>  #define STRTAB_STE_0_CFG_S2_TRANS6
> +#define STRTAB_STE_0_CFG_NESTED  7
>  
>  #define STRTAB_STE_0_S1FMT   GENMASK_ULL(5, 4)
>  #define STRTAB_STE_0_S1FMT_LINEAR0
> @@ -491,6 +492,10 @@ struct arm_smmu_strtab_l1_desc {
>  struct arm_smmu_s1_cfg {
>   __le64  *cdptr;
>   dma_addr_t  cdptr_dma;
> + /* in nested mode, tells s1 must be bypassed */
> + boolnested_bypass;
> + /* in nested mode, abort is forced by guest */
> + boolnested_abort;
>  
>   struct arm_smmu_ctx_desc {
>   u16 asid;
> @@ -515,6 +520,7 @@ struct arm_smmu_strtab_ent {
>* configured according to the domain type.
>*/
>   boolassigned;
> + boolnested;
>   struct arm_smmu_s1_cfg  *s1_cfg;
>   struct arm_smmu_s2_cfg  *s2_cfg;
>  };
> @@ -629,10 +635,8 @@ struct arm_smmu_domain {
>   boolnon_strict;
>  
>   enum arm_smmu_domain_stage  stage;
> - union {
> - struct arm_smmu_s1_cfg  s1_cfg;
> - struct arm_smmu_s2_cfg  s2_cfg;
> - };
> + struct arm_smmu_s1_cfg  s1_cfg;
> + struct arm_smmu_s2_cfg  s2_cfg;
>  
>   struct iommu_domain domain;
>  
> @@ -1139,10 +1143,11 @@ static void arm_smmu_write_strtab_ent(struct 
> arm_smmu_device *smmu, u32 sid,

Could you also update the "This is hideously complicated..." comment
with the nested case? This function was complicated before, but it
becomes hell when adding nested and SVA support, so we really need the
comments :)

>   break;
>   case STRTAB_STE_0_CFG_S1_TRANS:
>   case STRTAB_STE_0_CFG_S2_TRANS:
> + case STRTAB_STE_0_CFG_NESTED:
>   ste_live = true;
>   break;
>   case STRTAB_STE_0_CFG_ABORT:
> - if (disable_bypass)
> + if (disable_bypass || ste->nested)
>   break;
>   default:
>   BUG(); /* STE corruption */
> @@ -1154,7 +1159,8 @@ static void arm_smmu_write_strtab_ent(struct 
> arm_smmu_device *smmu, u32 sid,
>  
>   /* Bypass/fault */
>   if (!ste->assigned || !(ste->s1_cfg || ste->s2_cfg)) {
> - if (!ste->assigned && disable_bypass)
> + if ((!ste->assigned && disable_bypass) ||
> + (ste->s1_cfg && ste->s1_cfg->nested_abort))

I don't think we're ever reaching this, given that ste->assigned is true
and ste->s2_cfg is set.

Something I find noteworthy is that with STRTAB_STE_0_CFG_ABORT, no
event is recorded in case of DMA fault. For vSMMU you'd want to emulate
the SMMU behavior closely, so you don't want to inject faults if the
guest sets CFG_ABORT, but this way you also can't report errors to the
VMM. If we did want to notify the VMM of faults, we'd need to implement
nested_abort differently, for example by installing an empty context
descriptor with Config=s1translate-s2translate.

>   val |= FIELD_PREP(STRTAB_STE_0_CFG, 
> STRTAB_STE_0_CFG_ABORT);
>   else
>   val |= FIELD_PREP(STRTAB_STE_0_CFG, 
> STRTAB_STE_0_CFG_BYPASS);
> @@ -1172,8 +1178,17 @@ static void arm_smmu_write_strtab_ent(struct 
> arm_smmu_device *smmu, u32 sid,
>   return;
>   }
>  
> + if (ste->nested && ste_live) {
> + /*
> +  * When enabling nested, the STE may be transitionning from

transitioning (my bad)

> +  * s2 to nested and back. Invalidate the STE before changing it.
> +  */
> + dst[0] = cpu_to_le64(0);
> +

Re: [PATCH v6 0/7] Add virtio-iommu driver

2019-01-11 Thread Jean-Philippe Brucker

On 11/01/2019 12:28, Joerg Roedel wrote:
> Hi Jean-Philippe,
> 
> On Thu, Dec 13, 2018 at 12:50:29PM +0000, Jean-Philippe Brucker wrote:
>> We already do deferred flush: UNMAP requests are added to the queue by
>> iommu_unmap(), and then flushed out by iotlb_sync(). So we switch to the
>> host only on iotlb_sync(), or when the request queue is full.
> 
> So the mappings can stay in place until iotlb_sync() returns? What
> happens when the guest sends a map-request for a region it sent and
> unmap request before, but did not call iotlb_sync inbetween?

At that point the unmap is still in the request queue, and the host will
handle it before getting to the map request. For correctness requests
are necessarily handled in-order by the host. So if the map and unmap
refer to the same domain and IOVA, the host will remove the old mapping
before creating the new one.

Thanks,
Jean
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

Re: [RFC v3 14/21] iommu: introduce device fault data

2019-01-11 Thread Jean-Philippe Brucker

On 10/01/2019 18:45, Jacob Pan wrote:
> On Tue,  8 Jan 2019 11:26:26 +0100
> Eric Auger  wrote:
> 
>> From: Jacob Pan 
>>
>> Device faults detected by IOMMU can be reported outside IOMMU
>> subsystem for further processing. This patch intends to provide
>> a generic device fault data such that device drivers can be
>> communicated with IOMMU faults without model specific knowledge.
>>
>> The proposed format is the result of discussion at:
>> https://lkml.org/lkml/2017/11/10/291
>> Part of the code is based on Jean-Philippe Brucker's patchset
>> (https://patchwork.kernel.org/patch/9989315/).
>>
>> The assumption is that model specific IOMMU driver can filter and
>> handle most of the internal faults if the cause is within IOMMU driver
>> control. Therefore, the fault reasons can be reported are grouped
>> and generalized based common specifications such as PCI ATS.
>>
>> Signed-off-by: Jacob Pan 
>> Signed-off-by: Jean-Philippe Brucker 
>> Signed-off-by: Liu, Yi L 
>> Signed-off-by: Ashok Raj 
>> Signed-off-by: Eric Auger 
>> [moved part of the iommu_fault_event struct in the uapi, enriched
>>  the fault reasons to be able to map unrecoverable SMMUv3 errors]
>> ---
>>  include/linux/iommu.h  | 55 -
>>  include/uapi/linux/iommu.h | 83
>> ++ 2 files changed, 136
>> insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
>> index 244c1a3d5989..1dedc2d247c2 100644
>> --- a/include/linux/iommu.h
>> +++ b/include/linux/iommu.h
>> @@ -49,13 +49,17 @@ struct bus_type;
>>  struct device;
>>  struct iommu_domain;
>>  struct notifier_block;
>> +struct iommu_fault_event;
>>  
>>  /* iommu fault flags */
>> -#define IOMMU_FAULT_READ0x0
>> -#define IOMMU_FAULT_WRITE   0x1
>> +#define IOMMU_FAULT_READ(1 << 0)
>> +#define IOMMU_FAULT_WRITE   (1 << 1)
>> +#define IOMMU_FAULT_EXEC(1 << 2)
>> +#define IOMMU_FAULT_PRIV(1 << 3)
>>  
>>  typedef int (*iommu_fault_handler_t)(struct iommu_domain *,
>>  struct device *, unsigned long, int, void *);
>> +typedef int (*iommu_dev_fault_handler_t)(struct iommu_fault_event *,
>> void *); 
>>  struct iommu_domain_geometry {
>>  dma_addr_t aperture_start; /* First address that can be
>> mapped*/ @@ -255,6 +259,52 @@ struct iommu_device {
>>  struct device *dev;
>>  };
>>  
>> +/**
>> + * struct iommu_fault_event - Generic per device fault data
>> + *
>> + * - PCI and non-PCI devices
>> + * - Recoverable faults (e.g. page request), information based on
>> PCI ATS
>> + * and PASID spec.
>> + * - Un-recoverable faults of device interest
>> + * - DMA remapping and IRQ remapping faults
>> + *
>> + * @fault: fault descriptor
>> + * @device_private: if present, uniquely identify device-specific
>> + *  private data for an individual page request.
>> + * @iommu_private: used by the IOMMU driver for storing
>> fault-specific
>> + * data. Users should not modify this field before
>> + * sending the fault response.
>> + */
>> +struct iommu_fault_event {
>> +struct iommu_fault fault;
>> +u64 device_private;
> I think we want to move device_private to uapi since it gets injected
> into the guest, then returned by guest in case of page response. For
> VT-d we also need 128 bits of private data. VT-d spec. 7.7.1

Ah, I didn't notice the format changed in VT-d rev3. On that topic, how
do we manage future extensions to the iommu_fault struct? Should we add
~48 bytes of padding after device_private, along with some flags telling
which field is valid, or deal with it using a structure version like we
do for the invalidate and bind structs? In the first case, iommu_fault
wouldn't fit in a 64-byte cacheline anymore, but I'm not sure we care.

> For exception tracking (e.g. unanswered page request), I can add timer
> and list info later when I include PRQ. sounds ok?
>> +u64 iommu_private;
[...]
>> +/**
>> + * struct iommu_fault - Generic fault data
>> + *
>> + * @type contains fault type
>> + * @reason fault reasons if relevant outside IOMMU driver.
>> + * IOMMU driver internal faults are not reported.
>> + * @addr: tells the offending page address
>> + * @fetch_addr: tells the address that caused an abort, if any
>> + * @pasid: contains process address space ID, used in shared virtual
>>

Re: [PATCH v6 0/7] Add virtio-iommu driver

2018-12-20 Thread Jean-Philippe Brucker

On 19/12/2018 23:09, Michael S. Tsirkin wrote:
> On Thu, Dec 13, 2018 at 12:50:29PM +0000, Jean-Philippe Brucker wrote:
>>>> [3] git://linux-arm.org/linux-jpb.git virtio-iommu/v0.9.1
>>>>  git://linux-arm.org/kvmtool-jpb.git virtio-iommu/v0.9
>>>
>>> Unfortunatly gitweb seems to be broken on linux-arm.org. What is missing
>>> in this patch-set to make this work on x86?
>>
>> You should be able to access it here:
>> http://www.linux-arm.org/git?p=linux-jpb.git;a=shortlog;h=refs/heads/virtio-iommu/devel
>>
>> That branch contains missing bits for x86 support:
>>
>> * ACPI support. We have the code but it's waiting for an IORT spec
>> update, to reserve the IORT node ID. I expect it to take a while, given
>> that I'm alone requesting a change for something that's not upstream or
>> in hardware.
> 
> Frankly I think you should take a hard look at just getting the data
> needed from the PCI device itself.  You don't need to depend on virtio,
> it can be a small driver that gets you that data from the device config
> space and then just goes away.
> 
> If you want help with writing such a small driver let me know.
> 
> If there's an advantage to virtio-iommu then that would be its
> portability, and it all goes out of the window because
> of dependencies on ACPI and DT and OF and the rest of the zoo.

But the portable solutions are ACPI and DT.

Describing the DMA dependency through a device would require the guest
to probe the device before all others. How do we communicate this?
* pass a kernel parameter saying something like "probe_first=00:01.0"
* make sure that the PCI root complex is probed before any other
platform device (since the IOMMU can manage DMA of platform devices).
* change DT, ACPI and PCI core code to handle this probe_first kernel
parameter.

Better go with something standard, that any OS and hypervisor knows how
to use, and that other IOMMU devices already use.

>> * DMA ops for x86 (see "HACK" commit). I'd like to use dma-iommu but I'm
>> not sure how to implement the glue that sets dma_ops properly.
>>
>> Thanks,
>> Jean
> 
> OK so IIUC you are looking into Christoph's suggestions to fix that up?

Eventually yes. I'll give it a try next year, once the dma-iommu changes
are on the list. It's not a priority for me, given that x86 already has
a pvIOMMU with VT-d, and that Arm still needs one. It shouldn't block
this series.

> There's still a bit of time left before the merge window,
> maybe you can make above changes.

I'll wait to see if Joerg has other concerns about the design or the
code, and resend in January. I think that IOMMU driver changes should go
through his tree.

Thanks,
Jean
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

Re: [PATCH v6 0/7] Add virtio-iommu driver

2018-12-13 Thread Jean-Philippe Brucker

Hi Joerg,

On 12/12/2018 10:35, Joerg Roedel wrote:
> Hi,
> 
> to make progress on this, we should first agree on the protocol used
> between guest and host. I have a few points to discuss on the protocol
> first.
> 
> On Tue, Dec 11, 2018 at 06:20:57PM +, Jean-Philippe Brucker wrote:
>> [1] Virtio-iommu specification v0.9, sources and pdf
>> git://linux-arm.org/virtio-iommu.git virtio-iommu/v0.9
>> http://jpbrucker.net/virtio-iommu/spec/v0.9/virtio-iommu-v0.9.pdf
> 
> Looking at this I wonder why it doesn't make the IOTLB visible to the
> guest. the UNMAP requests seem to require that the TLB is already
> flushed to make the unmap visible.
> 
> I think that will cost significant performance for both, vfio and
> dma-iommu use-cases which both do (vfio at least to some degree),
> deferred flushing.

We already do deferred flush: UNMAP requests are added to the queue by
iommu_unmap(), and then flushed out by iotlb_sync(). So we switch to the
host only on iotlb_sync(), or when the request queue is full.

> I also wonder whether the protocol should implement a
> protocol version handshake and iommu-feature set queries.

With the virtio transport there is a handshake when the device (IOMMU)
is initialized, through feature bits and global config fields. Feature
bits are made of both transport-specific features, including the version
number, and device-specific features defined in section 2.3 of the above
document (the transport is described in the virtio 1.0 specification).
The device presents features that it supports in a register, and the
driver masks out the feature bits that it doesn't support. Then the
driver sets the global status to FEATURES_OK and initialization continues.

In addition virtio-iommu has per-endpoint features through the PROBE
request, since the vIOMMU may manage hardware (VFIO) and software
(virtio) endpoints at the same time, which don't have the same DMA
capabilities (different IOVA ranges, page granularity, reserved ranges,
pgtable sharing, etc). At the moment this is a one-way probe, not a
handshake. The device simply fills the properties of each endpoint, but
the driver doesn't have to ack them. Initially there was a way to
negotiate each PROBE property but it was deemed unnecessary during
review. By leaving a few spare bits in the property headers I made sure
it can be added back with a feature bit if we ever need it.

>> [3] git://linux-arm.org/linux-jpb.git virtio-iommu/v0.9.1
>> git://linux-arm.org/kvmtool-jpb.git virtio-iommu/v0.9
> 
> Unfortunatly gitweb seems to be broken on linux-arm.org. What is missing
> in this patch-set to make this work on x86?

You should be able to access it here:
http://www.linux-arm.org/git?p=linux-jpb.git;a=shortlog;h=refs/heads/virtio-iommu/devel

That branch contains missing bits for x86 support:

* ACPI support. We have the code but it's waiting for an IORT spec
update, to reserve the IORT node ID. I expect it to take a while, given
that I'm alone requesting a change for something that's not upstream or
in hardware.

* DMA ops for x86 (see "HACK" commit). I'd like to use dma-iommu but I'm
not sure how to implement the glue that sets dma_ops properly.

Thanks,
Jean
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

Re: [PATCH v6 0/7] Add virtio-iommu driver

2018-12-11 Thread Jean-Philippe Brucker

On 11/12/2018 18:31, Christoph Hellwig wrote:
> On Tue, Dec 11, 2018 at 06:20:57PM +0000, Jean-Philippe Brucker wrote:
>> Implement the virtio-iommu driver, following specification v0.9 [1].
>>
>> Only minor changes since v5 [2]. I fixed issues reported by Michael and
>> added tags from Eric and Bharat. Thanks!
>>
>> You can find Linux driver and kvmtool device on v0.9 branches [3],
>> module and x86 support on virtio-iommu/devel. Also tested with Eric's
>> QEMU device [4].
> 
> Just curious, what is the use case for it?

The main use case is assigning a device to guest userspace, using VFIO
both in the host and in the guest (the most cited example being DPDK).
There are others, and I wrote a little more about them last week:
https://www.spinics.net/lists/linux-pci/msg78529.html

Thanks,
Jean
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

[PATCH v6 6/7] iommu/virtio: Add probe request

2018-12-11 Thread Jean-Philippe Brucker

When the device offers the probe feature, send a probe request for each
device managed by the IOMMU. Extract RESV_MEM information. When we
encounter a MSI doorbell region, set it up as a IOMMU_RESV_MSI region.
This will tell other subsystems that there is no need to map the MSI
doorbell in the virtio-iommu, because MSIs bypass it.

Tested-by: Bharat Bhushan 
Tested-by: Eric Auger 
Reviewed-by: Eric Auger 
Signed-off-by: Jean-Philippe Brucker 
---
 drivers/iommu/virtio-iommu.c  | 156 --
 include/uapi/linux/virtio_iommu.h |  36 +++
 2 files changed, 186 insertions(+), 6 deletions(-)

diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c
index 7540dab9c8dc..0c7a7fa2628d 100644
--- a/drivers/iommu/virtio-iommu.c
+++ b/drivers/iommu/virtio-iommu.c
@@ -46,6 +46,7 @@ struct viommu_dev {
struct iommu_domain_geometrygeometry;
u64 pgsize_bitmap;
u8  domain_bits;
+   u32 probe_size;
 };
 
 struct viommu_mapping {
@@ -67,8 +68,10 @@ struct viommu_domain {
 };
 
 struct viommu_endpoint {
+   struct device   *dev;
struct viommu_dev   *viommu;
struct viommu_domain*vdomain;
+   struct list_headresv_regions;
 };
 
 struct viommu_request {
@@ -119,6 +122,9 @@ static off_t viommu_get_write_desc_offset(struct viommu_dev 
*viommu,
 {
size_t tail_size = sizeof(struct virtio_iommu_req_tail);
 
+   if (req->type == VIRTIO_IOMMU_T_PROBE)
+   return len - viommu->probe_size - tail_size;
+
return len - tail_size;
 }
 
@@ -393,6 +399,110 @@ static int viommu_replay_mappings(struct viommu_domain 
*vdomain)
return ret;
 }
 
+static int viommu_add_resv_mem(struct viommu_endpoint *vdev,
+  struct virtio_iommu_probe_resv_mem *mem,
+  size_t len)
+{
+   size_t size;
+   u64 start64, end64;
+   phys_addr_t start, end;
+   struct iommu_resv_region *region = NULL;
+   unsigned long prot = IOMMU_WRITE | IOMMU_NOEXEC | IOMMU_MMIO;
+
+   start = start64 = le64_to_cpu(mem->start);
+   end = end64 = le64_to_cpu(mem->end);
+   size = end64 - start64 + 1;
+
+   /* Catch any overflow, including the unlikely end64 - start64 + 1 = 0 */
+   if (start != start64 || end != end64 || size < end64 - start64)
+   return -EOVERFLOW;
+
+   if (len < sizeof(*mem))
+   return -EINVAL;
+
+   switch (mem->subtype) {
+   default:
+   dev_warn(vdev->dev, "unknown resv mem subtype 0x%x\n",
+mem->subtype);
+   /* Fall-through */
+   case VIRTIO_IOMMU_RESV_MEM_T_RESERVED:
+   region = iommu_alloc_resv_region(start, size, 0,
+IOMMU_RESV_RESERVED);
+   break;
+   case VIRTIO_IOMMU_RESV_MEM_T_MSI:
+   region = iommu_alloc_resv_region(start, size, prot,
+IOMMU_RESV_MSI);
+   break;
+   }
+   if (!region)
+   return -ENOMEM;
+
+   list_add(>resv_regions, >list);
+   return 0;
+}
+
+static int viommu_probe_endpoint(struct viommu_dev *viommu, struct device *dev)
+{
+   int ret;
+   u16 type, len;
+   size_t cur = 0;
+   size_t probe_len;
+   struct virtio_iommu_req_probe *probe;
+   struct virtio_iommu_probe_property *prop;
+   struct iommu_fwspec *fwspec = dev->iommu_fwspec;
+   struct viommu_endpoint *vdev = fwspec->iommu_priv;
+
+   if (!fwspec->num_ids)
+   return -EINVAL;
+
+   probe_len = sizeof(*probe) + viommu->probe_size +
+   sizeof(struct virtio_iommu_req_tail);
+   probe = kzalloc(probe_len, GFP_KERNEL);
+   if (!probe)
+   return -ENOMEM;
+
+   probe->head.type = VIRTIO_IOMMU_T_PROBE;
+   /*
+* For now, assume that properties of an endpoint that outputs multiple
+* IDs are consistent. Only probe the first one.
+*/
+   probe->endpoint = cpu_to_le32(fwspec->ids[0]);
+
+   ret = viommu_send_req_sync(viommu, probe, probe_len);
+   if (ret)
+   goto out_free;
+
+   prop = (void *)probe->properties;
+   type = le16_to_cpu(prop->type) & VIRTIO_IOMMU_PROBE_T_MASK;
+
+   while (type != VIRTIO_IOMMU_PROBE_T_NONE &&
+  cur < viommu->probe_size) {
+   len = le16_to_cpu(prop->length) + sizeof(*prop);
+
+   switch (type) {
+   case VIRTIO_IOMMU_PROBE_T_RESV_MEM:
+   ret = viommu_add_resv_mem(vdev, (void *)prop, len);
+   break;
+   default:
+   de

[PATCH v6 5/7] iommu: Add virtio-iommu driver

2018-12-11 Thread Jean-Philippe Brucker

The virtio IOMMU is a para-virtualized device, allowing to send IOMMU
requests such as map/unmap over virtio transport without emulating page
tables. This implementation handles ATTACH, DETACH, MAP and UNMAP
requests.

The bulk of the code transforms calls coming from the IOMMU API into
corresponding virtio requests. Mappings are kept in an interval tree
instead of page tables. A little more work is required for modular and x86
support, so for the moment the driver depends on CONFIG_VIRTIO=y and
CONFIG_ARM64.

Tested-by: Bharat Bhushan 
Tested-by: Eric Auger 
Reviewed-by: Eric Auger 
Signed-off-by: Jean-Philippe Brucker 
---
 MAINTAINERS   |   7 +
 drivers/iommu/Kconfig |  11 +
 drivers/iommu/Makefile|   1 +
 drivers/iommu/virtio-iommu.c  | 916 ++
 include/uapi/linux/virtio_ids.h   |   1 +
 include/uapi/linux/virtio_iommu.h | 106 
 6 files changed, 1042 insertions(+)
 create mode 100644 drivers/iommu/virtio-iommu.c
 create mode 100644 include/uapi/linux/virtio_iommu.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 8119141a926f..6d250bc7a4ae 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -16041,6 +16041,13 @@ S: Maintained
 F: drivers/virtio/virtio_input.c
 F: include/uapi/linux/virtio_input.h
 
+VIRTIO IOMMU DRIVER
+M: Jean-Philippe Brucker 
+L: virtualizat...@lists.linux-foundation.org
+S: Maintained
+F: drivers/iommu/virtio-iommu.c
+F: include/uapi/linux/virtio_iommu.h
+
 VIRTUAL BOX GUEST DEVICE DRIVER
 M: Hans de Goede 
 M: Arnd Bergmann 
diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index d9a25715650e..d507fd754214 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -435,4 +435,15 @@ config QCOM_IOMMU
help
  Support for IOMMU on certain Qualcomm SoCs.
 
+config VIRTIO_IOMMU
+   bool "Virtio IOMMU driver"
+   depends on VIRTIO=y
+   depends on ARM64
+   select IOMMU_API
+   select INTERVAL_TREE
+   help
+ Para-virtualised IOMMU driver with virtio.
+
+ Say Y here if you intend to run this kernel as a guest.
+
 endif # IOMMU_SUPPORT
diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
index a158a68c8ea8..48d831a39281 100644
--- a/drivers/iommu/Makefile
+++ b/drivers/iommu/Makefile
@@ -32,3 +32,4 @@ obj-$(CONFIG_EXYNOS_IOMMU) += exynos-iommu.o
 obj-$(CONFIG_FSL_PAMU) += fsl_pamu.o fsl_pamu_domain.o
 obj-$(CONFIG_S390_IOMMU) += s390-iommu.o
 obj-$(CONFIG_QCOM_IOMMU) += qcom_iommu.o
+obj-$(CONFIG_VIRTIO_IOMMU) += virtio-iommu.o
diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c
new file mode 100644
index ..7540dab9c8dc
--- /dev/null
+++ b/drivers/iommu/virtio-iommu.c
@@ -0,0 +1,916 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Virtio driver for the paravirtualized IOMMU
+ *
+ * Copyright (C) 2018 Arm Limited
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+#define MSI_IOVA_BASE  0x800
+#define MSI_IOVA_LENGTH0x10
+
+#define VIOMMU_REQUEST_VQ  0
+#define VIOMMU_NR_VQS  1
+
+struct viommu_dev {
+   struct iommu_device iommu;
+   struct device   *dev;
+   struct virtio_device*vdev;
+
+   struct ida  domain_ids;
+
+   struct virtqueue*vqs[VIOMMU_NR_VQS];
+   spinlock_t  request_lock;
+   struct list_headrequests;
+
+   /* Device configuration */
+   struct iommu_domain_geometrygeometry;
+   u64 pgsize_bitmap;
+   u8  domain_bits;
+};
+
+struct viommu_mapping {
+   phys_addr_t paddr;
+   struct interval_tree_node   iova;
+   u32 flags;
+};
+
+struct viommu_domain {
+   struct iommu_domain domain;
+   struct viommu_dev   *viommu;
+   struct mutexmutex; /* protects viommu pointer */
+   unsigned intid;
+
+   spinlock_t  mappings_lock;
+   struct rb_root_cached   mappings;
+
+   unsigned long   nr_endpoints;
+};
+
+struct viommu_endpoint {
+   struct viommu_dev   *viommu;
+   struct viommu_domain*vdomain;
+};
+
+struct viommu_request {
+   struct list_headlist;
+   void*writeback;
+   unsigned intwrite_offset;
+   unsigned intlen;
+   charbuf[];
+};
+
+#define to_viommu_domain(domain)   \
+   container_of(domain,

[PATCH v6 7/7] iommu/virtio: Add event queue

2018-12-11 Thread Jean-Philippe Brucker

The event queue offers a way for the device to report access faults from
endpoints. It is implemented on virtqueue #1. Whenever the host needs to
signal a fault, it fills one of the buffers offered by the guest and
interrupts it.

Tested-by: Bharat Bhushan 
Tested-by: Eric Auger 
Reviewed-by: Eric Auger 
Signed-off-by: Jean-Philippe Brucker 
---
 drivers/iommu/virtio-iommu.c  | 115 +++---
 include/uapi/linux/virtio_iommu.h |  19 +
 2 files changed, 125 insertions(+), 9 deletions(-)

diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c
index 0c7a7fa2628d..e6ff515d41c0 100644
--- a/drivers/iommu/virtio-iommu.c
+++ b/drivers/iommu/virtio-iommu.c
@@ -29,7 +29,8 @@
 #define MSI_IOVA_LENGTH0x10
 
 #define VIOMMU_REQUEST_VQ  0
-#define VIOMMU_NR_VQS  1
+#define VIOMMU_EVENT_VQ1
+#define VIOMMU_NR_VQS  2
 
 struct viommu_dev {
struct iommu_device iommu;
@@ -41,6 +42,7 @@ struct viommu_dev {
struct virtqueue*vqs[VIOMMU_NR_VQS];
spinlock_t  request_lock;
struct list_headrequests;
+   void*evts;
 
/* Device configuration */
struct iommu_domain_geometrygeometry;
@@ -82,6 +84,15 @@ struct viommu_request {
charbuf[];
 };
 
+#define VIOMMU_FAULT_RESV_MASK 0xff00
+
+struct viommu_event {
+   union {
+   u32 head;
+   struct virtio_iommu_fault fault;
+   };
+};
+
 #define to_viommu_domain(domain)   \
container_of(domain, struct viommu_domain, domain)
 
@@ -503,6 +514,68 @@ static int viommu_probe_endpoint(struct viommu_dev 
*viommu, struct device *dev)
return ret;
 }
 
+static int viommu_fault_handler(struct viommu_dev *viommu,
+   struct virtio_iommu_fault *fault)
+{
+   char *reason_str;
+
+   u8 reason   = fault->reason;
+   u32 flags   = le32_to_cpu(fault->flags);
+   u32 endpoint= le32_to_cpu(fault->endpoint);
+   u64 address = le64_to_cpu(fault->address);
+
+   switch (reason) {
+   case VIRTIO_IOMMU_FAULT_R_DOMAIN:
+   reason_str = "domain";
+   break;
+   case VIRTIO_IOMMU_FAULT_R_MAPPING:
+   reason_str = "page";
+   break;
+   case VIRTIO_IOMMU_FAULT_R_UNKNOWN:
+   default:
+   reason_str = "unknown";
+   break;
+   }
+
+   /* TODO: find EP by ID and report_iommu_fault */
+   if (flags & VIRTIO_IOMMU_FAULT_F_ADDRESS)
+   dev_err_ratelimited(viommu->dev, "%s fault from EP %u at %#llx 
[%s%s%s]\n",
+   reason_str, endpoint, address,
+   flags & VIRTIO_IOMMU_FAULT_F_READ ? "R" : 
"",
+   flags & VIRTIO_IOMMU_FAULT_F_WRITE ? "W" : 
"",
+   flags & VIRTIO_IOMMU_FAULT_F_EXEC ? "X" : 
"");
+   else
+   dev_err_ratelimited(viommu->dev, "%s fault from EP %u\n",
+   reason_str, endpoint);
+   return 0;
+}
+
+static void viommu_event_handler(struct virtqueue *vq)
+{
+   int ret;
+   unsigned int len;
+   struct scatterlist sg[1];
+   struct viommu_event *evt;
+   struct viommu_dev *viommu = vq->vdev->priv;
+
+   while ((evt = virtqueue_get_buf(vq, )) != NULL) {
+   if (len > sizeof(*evt)) {
+   dev_err(viommu->dev,
+   "invalid event buffer (len %u != %zu)\n",
+   len, sizeof(*evt));
+   } else if (!(evt->head & VIOMMU_FAULT_RESV_MASK)) {
+   viommu_fault_handler(viommu, >fault);
+   }
+
+   sg_init_one(sg, evt, sizeof(*evt));
+   ret = virtqueue_add_inbuf(vq, sg, 1, evt, GFP_ATOMIC);
+   if (ret)
+   dev_err(viommu->dev, "could not add event buffer\n");
+   }
+
+   virtqueue_kick(vq);
+}
+
 /* IOMMU API */
 
 static struct iommu_domain *viommu_domain_alloc(unsigned type)
@@ -885,16 +958,35 @@ static struct iommu_ops viommu_ops = {
 static int viommu_init_vqs(struct viommu_dev *viommu)
 {
struct virtio_device *vdev = dev_to_virtio(viommu->dev);
-   const char *name = "request";
-   void *ret;
+   const char *names[] = { "request", "event" };
+   vq_callback_t *callbacks[] = {
+   NULL, /* No async requests */
+   viommu_event_handler,
+   };
 
-   ret = virtio_find_si

[PATCH v6 3/7] of: Allow the iommu-map property to omit untranslated devices

2018-12-11 Thread Jean-Philippe Brucker

In PCI root complex nodes, the iommu-map property describes the IOMMU that
translates each endpoint. On some platforms, the IOMMU itself is presented
as a PCI endpoint (e.g. AMD IOMMU and virtio-iommu). This isn't supported
by the current OF driver, which expects all endpoints to have an IOMMU.
Allow the iommu-map property to have gaps.

Relaxing of_map_rid() also allows the msi-map property to have gaps, which
is invalid since MSIs always reach an MSI controller. In that case
pci_msi_setup_msi_irqs() will return an error when attempting to find the
device's MSI domain.

Reviewed-by: Rob Herring 
Signed-off-by: Jean-Philippe Brucker 
---
 drivers/of/base.c | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/drivers/of/base.c b/drivers/of/base.c
index 09692c9b32a7..99f6bfa9b898 100644
--- a/drivers/of/base.c
+++ b/drivers/of/base.c
@@ -2237,8 +2237,12 @@ int of_map_rid(struct device_node *np, u32 rid,
return 0;
}
 
-   pr_err("%pOF: Invalid %s translation - no match for rid 0x%x on %pOF\n",
-   np, map_name, rid, target && *target ? *target : NULL);
-   return -EFAULT;
+   pr_info("%pOF: no %s translation for rid 0x%x on %pOF\n", np, map_name,
+   rid, target && *target ? *target : NULL);
+
+   /* Bypasses translation */
+   if (id_out)
+   *id_out = rid;
+   return 0;
 }
 EXPORT_SYMBOL_GPL(of_map_rid);
-- 
2.19.1

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

[PATCH v6 4/7] PCI: OF: Initialize dev->fwnode appropriately

2018-12-11 Thread Jean-Philippe Brucker

For PCI devices that have an OF node, set the fwnode as well. This way
drivers that rely on fwnode don't need the special case described by
commit f94277af03ea ("of/platform: Initialise dev->fwnode appropriately").

Acked-by: Bjorn Helgaas 
Signed-off-by: Jean-Philippe Brucker 
---
 drivers/pci/of.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/drivers/pci/of.c b/drivers/pci/of.c
index 4c4217d0c3f1..c272ecfcd038 100644
--- a/drivers/pci/of.c
+++ b/drivers/pci/of.c
@@ -21,12 +21,15 @@ void pci_set_of_node(struct pci_dev *dev)
return;
dev->dev.of_node = of_pci_find_child_device(dev->bus->dev.of_node,
dev->devfn);
+   if (dev->dev.of_node)
+   dev->dev.fwnode = >dev.of_node->fwnode;
 }
 
 void pci_release_of_node(struct pci_dev *dev)
 {
of_node_put(dev->dev.of_node);
dev->dev.of_node = NULL;
+   dev->dev.fwnode = NULL;
 }
 
 void pci_set_bus_of_node(struct pci_bus *bus)
@@ -35,12 +38,16 @@ void pci_set_bus_of_node(struct pci_bus *bus)
bus->dev.of_node = pcibios_get_phb_of_node(bus);
else
bus->dev.of_node = of_node_get(bus->self->dev.of_node);
+
+   if (bus->dev.of_node)
+   bus->dev.fwnode = >dev.of_node->fwnode;
 }
 
 void pci_release_bus_of_node(struct pci_bus *bus)
 {
of_node_put(bus->dev.of_node);
bus->dev.of_node = NULL;
+   bus->dev.fwnode = NULL;
 }
 
 struct device_node * __weak pcibios_get_phb_of_node(struct pci_bus *bus)
-- 
2.19.1

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

[PATCH v6 1/7] dt-bindings: virtio-mmio: Add IOMMU description

2018-12-11 Thread Jean-Philippe Brucker

The nature of a virtio-mmio node is discovered by the virtio driver at
probe time. However the DMA relation between devices must be described
statically. When a virtio-mmio node is a virtio-iommu device, it needs an
"#iommu-cells" property as specified by bindings/iommu/iommu.txt.

Otherwise, the virtio-mmio device may perform DMA through an IOMMU, which
requires an "iommus" property. Describe these requirements in the
device-tree bindings documentation.

Reviewed-by: Rob Herring 
Reviewed-by: Eric Auger 
Signed-off-by: Jean-Philippe Brucker 
---
 .../devicetree/bindings/virtio/mmio.txt   | 30 +++
 1 file changed, 30 insertions(+)

diff --git a/Documentation/devicetree/bindings/virtio/mmio.txt 
b/Documentation/devicetree/bindings/virtio/mmio.txt
index 5069c1b8e193..21af30fbb81f 100644
--- a/Documentation/devicetree/bindings/virtio/mmio.txt
+++ b/Documentation/devicetree/bindings/virtio/mmio.txt
@@ -8,10 +8,40 @@ Required properties:
 - reg: control registers base address and size including configuration 
space
 - interrupts:  interrupt generated by the device
 
+Required properties for virtio-iommu:
+
+- #iommu-cells:When the node corresponds to a virtio-iommu device, it 
is
+   linked to DMA masters using the "iommus" or "iommu-map"
+   properties [1][2]. #iommu-cells specifies the size of the
+   "iommus" property. For virtio-iommu #iommu-cells must be
+   1, each cell describing a single endpoint ID.
+
+Optional properties:
+
+- iommus:  If the device accesses memory through an IOMMU, it should
+   have an "iommus" property [1]. Since virtio-iommu itself
+   does not access memory through an IOMMU, the "virtio,mmio"
+   node cannot have both an "#iommu-cells" and an "iommus"
+   property.
+
 Example:
 
virtio_block@3000 {
compatible = "virtio,mmio";
reg = <0x3000 0x100>;
interrupts = <41>;
+
+   /* Device has endpoint ID 23 */
+   iommus = < 23>
}
+
+   viommu: iommu@3100 {
+   compatible = "virtio,mmio";
+   reg = <0x3100 0x100>;
+   interrupts = <42>;
+
+   #iommu-cells = <1>
+   }
+
+[1] Documentation/devicetree/bindings/iommu/iommu.txt
+[2] Documentation/devicetree/bindings/pci/pci-iommu.txt
-- 
2.19.1

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

1 2 3 >

1 - 100 of 218 matches

Mail list logo