Re: [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs

2022-06-01 Thread Joao Martins
On 6/1/22 13:33, Jason Gunthorpe wrote:
> On Wed, Jun 01, 2022 at 01:18:52PM +0100, Joao Martins wrote:
> 
>>> So having safe racy reading in the kernel is probably best, and so RCU
>>> would be good here too.
>>
>> Reading dirties ought to be similar to map/unmap but slightly simpler as
>> I supposedly don't need to care about the pte changing under the hood (or
>> so I initially thought). I was wrestling at some point if test-and-clear
>> was enough or whether I switch back cmpxchg to detect the pte has changed
>> and only mark dirty based on the old value[*]. The latter would align with
>> how map/unmap performs the pte updates.
> 
> test-and-clear should be fine, but this all needs to be done under a
> RCU context while the page tables themsevles are freed by RCU. Then
> you can safely chase the page table pointers down to each level
> without fear of UAF.
> 

I was actually thinking more towards holding the same IOVA range lock to
align with the rest of map/unmap/demote/etc? All these IO page table
manip have all have the same performance requirements.

>> I am not sure yet on dynamic demote/promote of page sizes if it changes this.
> 
> For this kind of primitive the caller must provide the locking, just
> like map/unmap.
> 
Ah OK.

> Effectively you can consider the iommu_domain has having externally
> provided range-locks over the IOVA space. map/unmap/demote/promote
> must run serially over intersecting IOVA ranges.
> 
> In terms of iommufd this means we always have to hold a lock related
> to the area (which is the IOVA range) before issuing any iommu call on
> the domain.

/me nods
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs

2022-06-01 Thread Joao Martins
On 6/1/22 00:10, Jason Gunthorpe wrote:
> On Tue, May 31, 2022 at 10:22:32PM +0100, Robin Murphy wrote:
>> There are only 3 instances where we'll free a table while the domain is
>> live. The first is the one legitimate race condition, where two map requests
>> targeting relatively nearby PTEs both go to fill in an intermediate level of
>> table; whoever loses that race frees the table they allocated, but it was
>> never visible to anyone else so that's definitely fine. The second is if
>> we're mapping a block entry, and find that there's already a table entry
>> there, wherein we assume the table must be empty, clear the entry,
>> invalidate any walk caches, install the block entry, then free the orphaned
>> table; since we're mapping the entire IOVA range covered by that table,
>> there should be no other operations on that IOVA range attempting to walk
>> the table at the same time, so it's fine. 
> 
> I saw these two in the Intel driver
> 
>> The third is effectively the inverse, if we get a block-sized unmap
>> but find a table entry rather than a block at that point (on the
>> assumption that it's de-facto allowed for a single unmap to cover
>> multiple adjacent mappings as long as it does so exactly); similarly
>> we assume that the table must be full, and no other operations
>> should be racing because we're unmapping its whole IOVA range, so we
>> remove the table entry, invalidate, and free as before.
> 
> Not sure I noticed this one though
> 
> This all it all makes sense though.
> 

I saw all three incantations in AMD iommu /I think/

AMD has sort of a replicated PTEs concept representing a power of 2 page size
mapped in 'default' page sizes(4K, 2M, 1G), which we need to check every single
one of them. Which upon writing the RFC I realized it's not really the most
efficient thing to keep considering the IOMMU hardware doesn't guarantee the
dirty bit is on every replicated pte. And maybe it's clobbering the fact we
don't attempt to pick the best page-size for the IOVA mapping (like Intel),
to instead have all powers of 2 page sizes allowed and IOMMU hw tries its
best to cope.

>> Although we don't have debug dumping for io-pgtable-arm, it's good to be
>> thinking about this, since it's made me realise that dirty-tracking sweeps
>> per that proposal might pose a similar kind of concern, so we might still
>> need to harden these corners for the sake of that.
> 
> Lets make sure Joao sees this..
> 
> It is interesting because we probably don't want the big latency
> spikes that would come from using locking to block map/unmap while
> dirty reading is happening - eg at the iommfd level.
> 
Specially when we might be scanning big IOVA ranges.

> From a consistency POV the only case that matters is unmap and unmap
> should already be doing a dedicated dirty read directly prior to the
> unmap (as per that other long thread)
> 
/me nods, yes

FWIW, I've already removed the unmap_read_dirty ops.

> So having safe racy reading in the kernel is probably best, and so RCU
> would be good here too.
> 

Reading dirties ought to be similar to map/unmap but slightly simpler as
I supposedly don't need to care about the pte changing under the hood (or
so I initially thought). I was wrestling at some point if test-and-clear
was enough or whether I switch back cmpxchg to detect the pte has changed
and only mark dirty based on the old value[*]. The latter would align with
how map/unmap performs the pte updates.

[*] e.g. potentially the new entry hasn't been 'seen' by IOMMU walker or
might be a smaller size then what got dirtied with iopte split or racing
with a new map

>> that somewhere I have some half-finished patches making io-pgtable-arm use
>> the iommu_iotlb_gather freelist, so maybe I'll tackle both concerns at once
>> (perhaps we might even be able to RCU-ify the freelist generically? I'll see
>> how it goes when I get there).
> 
> This is all very similar to how the mm's tlb gather stuff works,
> especially on PPC which requires RCU protection of page tables instead
> of the IPI trick.
> 
> Currently the rcu_head and the lru overlap in the struct page. To do
> this I'd suggest following what was done for slab - see ca1a46d6f506
> ("Merge tag 'slab-for-5.17'..) and create a 'struct page' sub-class
> like 'struct slab' for use with iommu page tables and put the
> list_head and rcu_head sequentially in the struct iommu_pt_page.
> 
> Continue to use put_pages_list() just with the list_head in the new
> struct not the lru.
> 
> If we need it for dirty tracking then it makes sense to start
> progressing parts at least..
> 
The suggestions here seem to apply to both map/unmap too, not
just read_dirty() alone IIUC

I am not sure yet on dynamic demote/promote of page sizes if it changes this.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC 10/19] iommu/amd: Add unmap_read_dirty() support

2022-05-31 Thread Joao Martins
On 5/31/22 13:39, Suravee Suthikulpanit wrote:
> On 4/29/22 4:09 AM, Joao Martins wrote:
>> AMD implementation of unmap_read_dirty() is pretty simple as
>> mostly reuses unmap code with the extra addition of marshalling
>> the dirty bit into the bitmap as it walks the to-be-unmapped
>> IOPTE.
>>
>> Extra care is taken though, to switch over to cmpxchg as opposed
>> to a non-serialized store to the PTE and testing the dirty bit
>> only set until cmpxchg succeeds to set to 0.
>>
>> Signed-off-by: Joao Martins 
>> ---
>>   drivers/iommu/amd/io_pgtable.c | 44 +-
>>   drivers/iommu/amd/iommu.c  | 22 +
>>   2 files changed, 60 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/iommu/amd/io_pgtable.c b/drivers/iommu/amd/io_pgtable.c
>> index 8325ef193093..1868c3b58e6d 100644
>> --- a/drivers/iommu/amd/io_pgtable.c
>> +++ b/drivers/iommu/amd/io_pgtable.c
>> @@ -355,6 +355,16 @@ static void free_clear_pte(u64 *pte, u64 pteval, struct 
>> list_head *freelist)
>>  free_sub_pt(pt, mode, freelist);
>>   }
>>   
>> +static bool free_pte_dirty(u64 *pte, u64 pteval)
> 
> Nitpick: Since we free and clearing the dirty bit, should we change
> the function name to free_clear_pte_dirty()?
> 

We free and *read* the dirty bit. It just so happens that we clear dirty
bit and every other one in the process. Just to make sure that I am not
clear the dirty bit explicitly (like the read_and_clear_dirty())

>> +{
>> +bool dirty = false;
>> +
>> +while (IOMMU_PTE_DIRTY(cmpxchg64(pte, pteval, 0)))
> 
> We should use 0ULL instead of 0.
>

ack.

>> +dirty = true;
>> +
>> +return dirty;
>> +}
>> +
> 
> Actually, what do you think if we enhance the current free_clear_pte()
> to also handle the check dirty as well?
> 
See further below, about dropping this patch.

>>   /*
>>* Generic mapping functions. It maps a physical address into a DMA
>>* address space. It allocates the page table pages if necessary.
>> @@ -428,10 +438,11 @@ static int iommu_v1_map_page(struct io_pgtable_ops 
>> *ops, unsigned long iova,
>>  return ret;
>>   }
>>   
>> -static unsigned long iommu_v1_unmap_page(struct io_pgtable_ops *ops,
>> -  unsigned long iova,
>> -  size_t size,
>> -  struct iommu_iotlb_gather *gather)
>> +static unsigned long __iommu_v1_unmap_page(struct io_pgtable_ops *ops,
>> +   unsigned long iova,
>> +   size_t size,
>> +   struct iommu_iotlb_gather *gather,
>> +   struct iommu_dirty_bitmap *dirty)
>>   {
>>  struct amd_io_pgtable *pgtable = io_pgtable_ops_to_data(ops);
>>  unsigned long long unmapped;
>> @@ -445,11 +456,15 @@ static unsigned long iommu_v1_unmap_page(struct 
>> io_pgtable_ops *ops,
>>  while (unmapped < size) {
>>  pte = fetch_pte(pgtable, iova, &unmap_size);
>>  if (pte) {
>> -int i, count;
>> +unsigned long i, count;
>> +bool pte_dirty = false;
>>   
>>  count = PAGE_SIZE_PTE_COUNT(unmap_size);
>>  for (i = 0; i < count; i++)
>> -pte[i] = 0ULL;
>> +pte_dirty |= free_pte_dirty(&pte[i], pte[i]);
>> +
> 
> Actually, what if we change the existing free_clear_pte() to 
> free_and_clear_dirty_pte(),
> and incorporate the logic for
> 
Likewise, but otherwise it would be a good idea.

>> ...
>> diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
>> index 0a86392b2367..a8fcb6e9a684 100644
>> --- a/drivers/iommu/amd/iommu.c
>> +++ b/drivers/iommu/amd/iommu.c
>> @@ -2144,6 +2144,27 @@ static size_t amd_iommu_unmap(struct iommu_domain 
>> *dom, unsigned long iova,
>>  return r;
>>   }
>>   
>> +static size_t amd_iommu_unmap_read_dirty(struct iommu_domain *dom,
>> + unsigned long iova, size_t page_size,
>> + struct iommu_iotlb_gather *gather,
>> + struct iommu_dirty_bitmap *dirty)
>> +{
>> +struct protection_domain *domain = to_pdomain(dom);
>> +struct io_pgtable_ops *ops = &domain->iop.iop.ops;
>> +

Re: [PATCH RFC 09/19] iommu/amd: Access/Dirty bit support in IOPTEs

2022-05-31 Thread Joao Martins
On 5/31/22 12:34, Suravee Suthikulpanit wrote:
> Joao,
> 
> On 4/29/22 4:09 AM, Joao Martins wrote:
>> .
>> +static int amd_iommu_set_dirty_tracking(struct iommu_domain *domain,
>> +bool enable)
>> +{
>> +struct protection_domain *pdomain = to_pdomain(domain);
>> +struct iommu_dev_data *dev_data;
>> +bool dom_flush = false;
>> +
>> +if (!amd_iommu_had_support)
>> +return -EOPNOTSUPP;
>> +
>> +list_for_each_entry(dev_data, &pdomain->dev_list, list) {
> 
> Since we iterate through device list for the domain, we would need to
> call spin_lock_irqsave(&pdomain->lock, flags) here.
> 
Ugh, yes. Will fix.

>> +struct amd_iommu *iommu;
>> +u64 pte_root;
>> +
>> +iommu = amd_iommu_rlookup_table[dev_data->devid];
>> +pte_root = amd_iommu_dev_table[dev_data->devid].data[0];
>> +
>> +/* No change? */
>> +if (!(enable ^ !!(pte_root & DTE_FLAG_HAD)))
>> +continue;
>> +
>> +pte_root = (enable ?
>> +pte_root | DTE_FLAG_HAD : pte_root & ~DTE_FLAG_HAD);
>> +
>> +/* Flush device DTE */
>> +amd_iommu_dev_table[dev_data->devid].data[0] = pte_root;
>> +device_flush_dte(dev_data);
>> +dom_flush = true;
>> +}
>> +
>> +/* Flush IOTLB to mark IOPTE dirty on the next translation(s) */
>> +if (dom_flush) {
>> +unsigned long flags;
>> +
>> +spin_lock_irqsave(&pdomain->lock, flags);
>> +amd_iommu_domain_flush_tlb_pde(pdomain);
>> +amd_iommu_domain_flush_complete(pdomain);
>> +spin_unlock_irqrestore(&pdomain->lock, flags);
>> +}
> 
> And call spin_unlock_irqrestore(&pdomain->lock, flags); here.

ack

Additionally, something that I am thinking for v2 was going to have
@had bool field in iommu_dev_data. That would align better with the
rest of amd iommu code rather than me introducing this pattern of
using hardware location of PTE roots. Let me know if you disagree.

>> +
>> +return 0;
>> +}
>> +
>> +static bool amd_iommu_get_dirty_tracking(struct iommu_domain *domain)
>> +{
>> +struct protection_domain *pdomain = to_pdomain(domain);
>> +struct iommu_dev_data *dev_data;
>> +u64 dte;
>> +
> 
> Also call spin_lock_irqsave(&pdomain->lock, flags) here
> 
ack
>> +list_for_each_entry(dev_data, &pdomain->dev_list, list) {
>> +dte = amd_iommu_dev_table[dev_data->devid].data[0];
>> +if (!(dte & DTE_FLAG_HAD))
>> +return false;
>> +}
>> +
> 
> And call spin_unlock_irqsave(&pdomain->lock, flags) here
> 
ack

Same comment as I was saying above, and replace the @dte checking
to just instead check this new variable.

>> +return true;
>> +}
>> +
>> +static int amd_iommu_read_and_clear_dirty(struct iommu_domain *domain,
>> +  unsigned long iova, size_t size,
>> +  struct iommu_dirty_bitmap *dirty)
>> +{
>> +struct protection_domain *pdomain = to_pdomain(domain);
>> +struct io_pgtable_ops *ops = &pdomain->iop.iop.ops;
>> +
>> +if (!amd_iommu_get_dirty_tracking(domain))
>> +return -EOPNOTSUPP;
>> +
>> +if (!ops || !ops->read_and_clear_dirty)
>> +return -ENODEV;
> 
> We move this check before the amd_iommu_get_dirty_tracking().
> 

Yeap, better fail earlier.

> Best Regards,
> Suravee
> 
>> +
>> +return ops->read_and_clear_dirty(ops, iova, size, dirty);
>> +}
>> +
>> +
>>   static void amd_iommu_get_resv_regions(struct device *dev,
>> struct list_head *head)
>>   {
>> @@ -2293,6 +2368,8 @@ const struct iommu_ops amd_iommu_ops = {
>>  .flush_iotlb_all = amd_iommu_flush_iotlb_all,
>>  .iotlb_sync = amd_iommu_iotlb_sync,
>>  .free   = amd_iommu_domain_free,
>> +.set_dirty_tracking = amd_iommu_set_dirty_tracking,
>> +.read_and_clear_dirty = amd_iommu_read_and_clear_dirty,
>>  }
>>   };
>>   
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC 00/19] IOMMUFD Dirty Tracking

2022-05-10 Thread Joao Martins
On 5/10/22 02:38, Tian, Kevin wrote:
>> From: Jason Gunthorpe 
>> Sent: Friday, May 6, 2022 7:46 PM
>>
>> On Fri, May 06, 2022 at 03:51:40AM +, Tian, Kevin wrote:
 From: Jason Gunthorpe 
 Sent: Thursday, May 5, 2022 10:08 PM

 On Thu, May 05, 2022 at 07:40:37AM +, Tian, Kevin wrote:

> In concept this is an iommu property instead of a domain property.

 Not really, domains shouldn't be changing behaviors once they are
 created. If a domain supports dirty tracking and I attach a new device
 then it still must support dirty tracking.
>>>
>>> That sort of suggests that userspace should specify whether a domain
>>> supports dirty tracking when it's created. But how does userspace
>>> know that it should create the domain in this way in the first place?
>>> live migration is triggered on demand and it may not happen in the
>>> lifetime of a VM.
>>
>> The best you could do is to look at the devices being plugged in at VM
>> startup, and if they all support live migration then request dirty
>> tracking, otherwise don't.
> 
> Yes, this is how a device capability can help.
> 
>>
>> However, tt costs nothing to have dirty tracking as long as all iommus
>> support it in the system - which seems to be the normal case today.
>>
>> We should just always turn it on at this point.
> 
> Then still need a way to report " all iommus support it in the system"
> to userspace since many old systems don't support it at all. If we all
> agree that a device capability flag would be helpful on this front (like
> you also said below), probably can start building the initial skeleton
> with that in mind?
> 

This would capture device-specific and maybe iommu-instance features, but
there's some tiny bit odd semantic here. There's nothing that
depends on the device to support any of this, but rather the IOMMU instance 
that sits
below the device which is independent of device-own capabilities e.g. PRI on 
the other
hand would be a perfect fit for a device capability (?), but dirty tracking
conveying over a device capability would be a convenience rather than an exact
hw representation.

Thinking out loud if we are going as a device/iommu capability [to see if this 
matches
what people have or not in mind]: we would add dirty-tracking feature bit via 
the existent
kAPI for iommu device features (e.g. IOMMU_DEV_FEAT_AD) and on iommufd we would 
maybe add
an IOMMUFD_CMD_DEV_GET_IOMMU_FEATURES ioctl which would have an u64 dev_id as 
input (from
the returned vfio-pci BIND_IOMMUFD @out_dev_id) and u64 features as an output 
bitmap of
synthetic feature bits, having IOMMUFD_FEATURE_AD the only one we query (and
IOMMUFD_FEATURE_{SVA,IOPF} as potentially future candidates). Qemu would then 
at start of
day would check if /all devices/ support it and it would then still do the 
blind set
tracking, but bail out preemptively if any of device-iommu don't support 
dirty-tracking. I
don't think we have any case today for having to deal with different IOMMU 
instances that
have different features.

Either that or as discussed in the beginning perhaps add an iommufd (or iommufd 
hwpt one)
ioctl  call (e.g.IOMMUFD_CMD_CAP) via a input value (e.g. subop IOMMU_FEATURES) 
which
would gives us a structure of things (e.g. for the IOMMU_FEATURES subop the 
common
featureset bitmap in all iommu instances). This would give the 'all iommus 
support it in
the system'. Albeit the device one might have more concrete longevity if 
there's further
plans aside from dirty tracking.

>>
>>> and if the user always creates domain to allow dirty tracking by default,
>>> how does it know a failed attach is due to missing dirty tracking support
>>> by the IOMMU and then creates another domain which disables dirty
>>> tracking and retry-attach again?
>>
>> The automatic logic is complicated for sure, if you had a device flag
>> it would have to figure it out that way
>>
> 
> Yes. That is the model in my mind.
> 
> Thanks
> Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC 00/19] IOMMUFD Dirty Tracking

2022-05-05 Thread Joao Martins
On 5/5/22 12:03, Tian, Kevin wrote:
>> From: Joao Martins 
>> Sent: Thursday, May 5, 2022 6:07 PM
>>
>> On 5/5/22 08:42, Tian, Kevin wrote:
>>>> From: Jason Gunthorpe 
>>>> Sent: Tuesday, May 3, 2022 2:53 AM
>>>>
>>>> On Mon, May 02, 2022 at 12:11:07PM -0600, Alex Williamson wrote:
>>>>> On Fri, 29 Apr 2022 05:45:20 +
>>>>> "Tian, Kevin"  wrote:
>>>>>>> From: Joao Martins 
>>>>>>>  3) Unmapping an IOVA range while returning its dirty bit prior to
>>>>>>> unmap. This case is specific for non-nested vIOMMU case where an
>>>>>>> erronous guest (or device) DMAing to an address being unmapped at
>>>> the
>>>>>>> same time.
>>>>>>
>>>>>> an erroneous attempt like above cannot anticipate which DMAs can
>>>>>> succeed in that window thus the end behavior is undefined. For an
>>>>>> undefined behavior nothing will be broken by losing some bits dirtied
>>>>>> in the window between reading back dirty bits of the range and
>>>>>> actually calling unmap. From guest p.o.v. all those are black-box
>>>>>> hardware logic to serve a virtual iotlb invalidation request which just
>>>>>> cannot be completed in one cycle.
>>>>>>
>>>>>> Hence in reality probably this is not required except to meet vfio
>>>>>> compat requirement. Just in concept returning dirty bits at unmap
>>>>>> is more accurate.
>>>>>>
>>>>>> I'm slightly inclined to abandon it in iommufd uAPI.
>>>>>
>>>>> Sorry, I'm not following why an unmap with returned dirty bitmap
>>>>> operation is specific to a vIOMMU case, or in fact indicative of some
>>>>> sort of erroneous, racy behavior of guest or device.
>>>>
>>>> It is being compared against the alternative which is to explicitly
>>>> query dirty then do a normal unmap as two system calls and permit a
>>>> race.
>>>>
>>>> The only case with any difference is if the guest is racing DMA with
>>>> the unmap - in which case it is already indeterminate for the guest if
>>>> the DMA will be completed or not.
>>>>
>>>> eg on the vIOMMU case if the guest races DMA with unmap then we are
>>>> already fine with throwing away that DMA because that is how the race
>>>> resolves during non-migration situations, so resovling it as throwing
>>>> away the DMA during migration is OK too.
>>>>
>>>>> We need the flexibility to support memory hot-unplug operations
>>>>> during migration,
>>>>
>>>> I would have thought that hotplug during migration would simply
>>>> discard all the data - how does it use the dirty bitmap?
>>>>
>>>>> This was implemented as a single operation specifically to avoid
>>>>> races where ongoing access may be available after retrieving a
>>>>> snapshot of the bitmap.  Thanks,
>>>>
>>>> The issue is the cost.
>>>>
>>>> On a real iommu elminating the race is expensive as we have to write
>>>> protect the pages before query dirty, which seems to be an extra IOTLB
>>>> flush.
>>>>
>>>> It is not clear if paying this cost to become atomic is actually
>>>> something any use case needs.
>>>>
>>>> So, I suggest we think about a 3rd op 'write protect and clear
>>>> dirties' that will be followed by a normal unmap - the extra op will
>>>> have the extra oveheard and userspace can decide if it wants to pay or
>>>> not vs the non-atomic read dirties operation. And lets have a use case
>>>> where this must be atomic before we implement it..
>>>
>>> and write-protection also relies on the support of I/O page fault...
>>>
>> /I think/ all IOMMUs in this series already support permission/unrecoverable
>> I/O page faults for a long time IIUC.
>>
>> The earlier suggestion was just to discard the I/O page fault after
>> write-protection happens. fwiw, some IOMMUs also support suppressing
>> the event notification (like AMD).
> 
> iiuc the purpose of 'write-protection' here is to capture in-fly dirty pages
> in the said race window until unmap and iotlb is invalidated is completed.
> 
But then we depend on PRS being th

Re: [PATCH RFC 00/19] IOMMUFD Dirty Tracking

2022-05-05 Thread Joao Martins
On 5/5/22 08:42, Tian, Kevin wrote:
>> From: Jason Gunthorpe 
>> Sent: Tuesday, May 3, 2022 2:53 AM
>>
>> On Mon, May 02, 2022 at 12:11:07PM -0600, Alex Williamson wrote:
>>> On Fri, 29 Apr 2022 05:45:20 +
>>> "Tian, Kevin"  wrote:
>>>>> From: Joao Martins 
>>>>>  3) Unmapping an IOVA range while returning its dirty bit prior to
>>>>> unmap. This case is specific for non-nested vIOMMU case where an
>>>>> erronous guest (or device) DMAing to an address being unmapped at
>> the
>>>>> same time.
>>>>
>>>> an erroneous attempt like above cannot anticipate which DMAs can
>>>> succeed in that window thus the end behavior is undefined. For an
>>>> undefined behavior nothing will be broken by losing some bits dirtied
>>>> in the window between reading back dirty bits of the range and
>>>> actually calling unmap. From guest p.o.v. all those are black-box
>>>> hardware logic to serve a virtual iotlb invalidation request which just
>>>> cannot be completed in one cycle.
>>>>
>>>> Hence in reality probably this is not required except to meet vfio
>>>> compat requirement. Just in concept returning dirty bits at unmap
>>>> is more accurate.
>>>>
>>>> I'm slightly inclined to abandon it in iommufd uAPI.
>>>
>>> Sorry, I'm not following why an unmap with returned dirty bitmap
>>> operation is specific to a vIOMMU case, or in fact indicative of some
>>> sort of erroneous, racy behavior of guest or device.
>>
>> It is being compared against the alternative which is to explicitly
>> query dirty then do a normal unmap as two system calls and permit a
>> race.
>>
>> The only case with any difference is if the guest is racing DMA with
>> the unmap - in which case it is already indeterminate for the guest if
>> the DMA will be completed or not.
>>
>> eg on the vIOMMU case if the guest races DMA with unmap then we are
>> already fine with throwing away that DMA because that is how the race
>> resolves during non-migration situations, so resovling it as throwing
>> away the DMA during migration is OK too.
>>
>>> We need the flexibility to support memory hot-unplug operations
>>> during migration,
>>
>> I would have thought that hotplug during migration would simply
>> discard all the data - how does it use the dirty bitmap?
>>
>>> This was implemented as a single operation specifically to avoid
>>> races where ongoing access may be available after retrieving a
>>> snapshot of the bitmap.  Thanks,
>>
>> The issue is the cost.
>>
>> On a real iommu elminating the race is expensive as we have to write
>> protect the pages before query dirty, which seems to be an extra IOTLB
>> flush.
>>
>> It is not clear if paying this cost to become atomic is actually
>> something any use case needs.
>>
>> So, I suggest we think about a 3rd op 'write protect and clear
>> dirties' that will be followed by a normal unmap - the extra op will
>> have the extra oveheard and userspace can decide if it wants to pay or
>> not vs the non-atomic read dirties operation. And lets have a use case
>> where this must be atomic before we implement it..
> 
> and write-protection also relies on the support of I/O page fault...
> 
/I think/ all IOMMUs in this series already support permission/unrecoverable
I/O page faults for a long time IIUC.

The earlier suggestion was just to discard the I/O page fault after
write-protection happens. fwiw, some IOMMUs also support suppressing
the event notification (like AMD).
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC 15/19] iommu/arm-smmu-v3: Add set_dirty_tracking_range() support

2022-05-05 Thread Joao Martins
On 5/5/22 08:25, Shameerali Kolothum Thodi wrote:
>> -Original Message-
>> From: Joao Martins [mailto:joao.m.mart...@oracle.com]
>> Sent: 29 April 2022 12:05
>> To: Tian, Kevin 
>> Cc: Joerg Roedel ; Suravee Suthikulpanit
>> ; Will Deacon ; Robin
>> Murphy ; Jean-Philippe Brucker
>> ; zhukeqian ;
>> Shameerali Kolothum Thodi ;
>> David Woodhouse ; Lu Baolu
>> ; Jason Gunthorpe ; Nicolin Chen
>> ; Yishai Hadas ; Eric Auger
>> ; Liu, Yi L ; Alex Williamson
>> ; Cornelia Huck ;
>> k...@vger.kernel.org; iommu@lists.linux-foundation.org
>> Subject: Re: [PATCH RFC 15/19] iommu/arm-smmu-v3: Add
>> set_dirty_tracking_range() support
>>
>> On 4/29/22 09:28, Tian, Kevin wrote:
>>>> From: Joao Martins 
>>>> Sent: Friday, April 29, 2022 5:09 AM
>>>>
>>>> Similar to .read_and_clear_dirty() use the page table
>>>> walker helper functions and set DBM|RDONLY bit, thus
>>>> switching the IOPTE to writeable-clean.
>>>
>>> this should not be one-off if the operation needs to be
>>> applied to IOPTE. Say a map request comes right after
>>> set_dirty_tracking() is called. If it's agreed to remove
>>> the range op then smmu driver should record the tracking
>>> status internally and then apply the modifier to all the new
>>> mappings automatically before dirty tracking is disabled.
>>> Otherwise the same logic needs to be kept in iommufd to
>>> call set_dirty_tracking_range() explicitly for every new
>>> iopt_area created within the tracking window.
>>
>> Gah, I totally missed that by mistake. New mappings aren't
>> carrying over the "DBM is set". This needs a new io-pgtable
>> quirk added post dirty-tracking toggling.
>>
>> I can adjust, but I am at odds on including this in a future
>> iteration given that I can't really test any of this stuff.
>> Might drop the driver until I have hardware/emulation I can
>> use (or maybe others can take over this). It was included
>> for revising the iommu core ops and whether iommufd was
>> affected by it.
> 
> [+Kunkun Jiang]. I think he is now looking into this and might have
> a test setup to verify this.

I'll keep him CC'ed next iterations. Thanks!

FWIW, the should change a bit on next iteration (simpler)
by always enabling DBM from the start. SMMUv3 ::set_dirty_tracking() becomes
a simpler function that tests quirks (i.e. DBM set) and what not, and calls
read_and_clear_dirty() without a bitmap argument to clear dirties.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC 00/19] IOMMUFD Dirty Tracking

2022-05-03 Thread Joao Martins
On 5/2/22 19:52, Jason Gunthorpe wrote:
> On Mon, May 02, 2022 at 12:11:07PM -0600, Alex Williamson wrote:
>> On Fri, 29 Apr 2022 05:45:20 +
>> "Tian, Kevin"  wrote:
>>>> From: Joao Martins 
>>>>  3) Unmapping an IOVA range while returning its dirty bit prior to
>>>> unmap. This case is specific for non-nested vIOMMU case where an
>>>> erronous guest (or device) DMAing to an address being unmapped at the
>>>> same time.  
>>>
>>> an erroneous attempt like above cannot anticipate which DMAs can
>>> succeed in that window thus the end behavior is undefined. For an
>>> undefined behavior nothing will be broken by losing some bits dirtied
>>> in the window between reading back dirty bits of the range and
>>> actually calling unmap. From guest p.o.v. all those are black-box
>>> hardware logic to serve a virtual iotlb invalidation request which just
>>> cannot be completed in one cycle.
>>>
>>> Hence in reality probably this is not required except to meet vfio
>>> compat requirement. Just in concept returning dirty bits at unmap
>>> is more accurate.
>>>
>>> I'm slightly inclined to abandon it in iommufd uAPI.
>>
>> Sorry, I'm not following why an unmap with returned dirty bitmap
>> operation is specific to a vIOMMU case, or in fact indicative of some
>> sort of erroneous, racy behavior of guest or device.
> 
> It is being compared against the alternative which is to explicitly
> query dirty then do a normal unmap as two system calls and permit a
> race.
> 
> The only case with any difference is if the guest is racing DMA with
> the unmap - in which case it is already indeterminate for the guest if
> the DMA will be completed or not. 
> 
> eg on the vIOMMU case if the guest races DMA with unmap then we are
> already fine with throwing away that DMA because that is how the race
> resolves during non-migration situations, so resovling it as throwing
> away the DMA during migration is OK too.
> 

Exactly.

Even current unmap (ignoring dirties) isn't race-free and DMA could still be
happening between clearing PTE until the IOTLB flush.

The code in this series *attempted* at tackling races against hw IOMMU updates
to the A/D bits at the same time we are clearing the IOPTEs. But it didn't fully
addressed the race with DMA.

The current code (IIUC) just assumes it is dirty if it as pinned and DMA mapped,
so maybe it avoided some of these fundamental questions...

So really the comparison is whether we care of fixing the race *during unmap* --
which really device shouldn't be DMA-ing to in the first place -- that we need
to go out of our way to block DMA writes from happening then fetch dirties and
then unmap. Or can we fetch dirties and then unmap as two separate operations.

>> We need the flexibility to support memory hot-unplug operations
>> during migration,
> 
> I would have thought that hotplug during migration would simply
> discard all the data - how does it use the dirty bitmap?
> 

hmmm I don't follow either -- why one would we care about hot-unplugged
memory being dirty? Unless Alex is thinking that the guest would take
initiative in hotunplugging+hotplugging and expecting the same data to
be there, like pmem style...?

>> This was implemented as a single operation specifically to avoid
>> races where ongoing access may be available after retrieving a
>> snapshot of the bitmap.  Thanks,
> 
> The issue is the cost.
> 
> On a real iommu elminating the race is expensive as we have to write
> protect the pages before query dirty, which seems to be an extra IOTLB
> flush.
> 

... and that is only the DMA performance part affecting the endpoint
device. In software, there's also the extra overhead of walking the IOMMU
pagetables twice. So it's like unmap being 2x more expensive.


> It is not clear if paying this cost to become atomic is actually
> something any use case needs.
> 
> So, I suggest we think about a 3rd op 'write protect and clear
> dirties' that will be followed by a normal unmap - the extra op will
> have the extra oveheard and userspace can decide if it wants to pay or
> not vs the non-atomic read dirties operation. And lets have a use case
> where this must be atomic before we implement it..
> 

Definitely, I am happy to implement it if there's a use-case. But
I am not sure there's one right now aside from theory only? Have we
see issues that would otherwise require this?

> The downside is we loose a little bit of efficiency by unbundling
> these steps, the upside is that it doesn't require quite as many
> special iommu_domain/etc paths.
> 
> (Also 

Re: [PATCH RFC 18/19] iommu/intel: Access/Dirty bit support for SL domains

2022-05-02 Thread Joao Martins
On 4/30/22 07:12, Baolu Lu wrote:
> On 2022/4/29 05:09, Joao Martins wrote:
>> --- a/drivers/iommu/intel/iommu.c
>> +++ b/drivers/iommu/intel/iommu.c
>> @@ -5089,6 +5089,113 @@ static void intel_iommu_iotlb_sync_map(struct 
>> iommu_domain *domain,
>>  }
>>   }
>>   
>> +static int intel_iommu_set_dirty_tracking(struct iommu_domain *domain,
>> +  bool enable)
>> +{
>> +struct dmar_domain *dmar_domain = to_dmar_domain(domain);
>> +struct device_domain_info *info;
>> +unsigned long flags;
>> +int ret = -EINVAL;
> 
>   if (domain_use_first_level(dmar_domain))
>   return -EOPNOTSUPP;
> 
Will add.

>> +
>> +spin_lock_irqsave(&device_domain_lock, flags);
>> +if (list_empty(&dmar_domain->devices)) {
>> +spin_unlock_irqrestore(&device_domain_lock, flags);
>> +return ret;
>> +}
> 
> I agreed with Kevin's suggestion in his reply.
> 
/me nods

>> +
>> +list_for_each_entry(info, &dmar_domain->devices, link) {
>> +if (!info->dev || (info->domain != dmar_domain))
>> +continue;
> 
> This check is redundant.
> 

I'll drop it.

>> +
>> +/* Dirty tracking is second-stage level SM only */
>> +if ((info->domain && domain_use_first_level(info->domain)) ||
>> +!ecap_slads(info->iommu->ecap) ||
>> +!sm_supported(info->iommu) || !intel_iommu_sm) {
>> +ret = -EOPNOTSUPP;
>> +continue;
> 
> Perhaps break and return -EOPNOTSUPP directly here? We are not able to
> support a mixed mode, right?
> 
Correct, I should return early here.

>> +}
>> +
>> +ret = intel_pasid_setup_dirty_tracking(info->iommu, 
>> info->domain,
>> + info->dev, PASID_RID2PASID,
>> + enable);
>> +if (ret)
>> +break;
>> +}
>> +spin_unlock_irqrestore(&device_domain_lock, flags);
>> +
>> +/*
>> + * We need to flush context TLB and IOTLB with any cached translations
>> + * to force the incoming DMA requests for have its IOTLB entries tagged
>> + * with A/D bits
>> + */
>> +intel_flush_iotlb_all(domain);
>> +return ret;
>> +}
> 
> Best regards,
> baolu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC 04/19] iommu: Add an unmap API that returns dirtied IOPTEs

2022-05-02 Thread Joao Martins
On 4/30/22 06:12, Baolu Lu wrote:
> On 2022/4/29 05:09, Joao Martins wrote:
>> Today, the dirty state is lost and the page wouldn't be migrated to
>> destination potentially leading the guest into error.
>>
>> Add an unmap API that reads the dirty bit and sets it in the
>> user passed bitmap. This unmap iommu API tackles a potentially
>> racy update to the dirty bit *when* doing DMA on a iova that is
>> being unmapped at the same time.
>>
>> The new unmap_read_dirty/unmap_pages_read_dirty does not replace
>> the unmap pages, but rather only when explicit called with an dirty
>> bitmap data passed in.
>>
>> It could be said that the guest is buggy and rather than a special unmap
>> path tackling the theoretical race ... it would suffice fetching the
>> dirty bits (with GET_DIRTY_IOVA), and then unmap the IOVA.
> 
> I am not sure whether this API could solve the race.
> 

Yeah, it doesn't fully solve the race as DMA can still potentially
occuring until the IOMMU needs to rewalk page tables (i.e. after IOTLB flush).


> size_t iommu_unmap(struct iommu_domain *domain,
> unsigned long iova, size_t size)
> {
>  struct iommu_iotlb_gather iotlb_gather;
>  size_t ret;
> 
>  iommu_iotlb_gather_init(&iotlb_gather);
>  ret = __iommu_unmap(domain, iova, size, &iotlb_gather);
>  iommu_iotlb_sync(domain, &iotlb_gather);
> 
>  return ret;
> }
> 
> The PTEs are cleared before iotlb invalidation. What if a DMA write
> happens after PTE clearing and before the iotlb invalidation with the
> PTE happening to be cached?


Yeap. Jason/Robin also reiterated similarly.

To fully handle this we need to force the PTEs readonly, and check the dirty bit
after. So perhaps if we wanna go to the extent of fully stopping DMA -- which 
none
of unmap APIs ever guarantee -- we need more of an write-protects API that 
optionally
fetches the dirties. And then the unmap remains as is (prior to this series).

Now whether this race is worth solving isn't clear (bearing that solving the 
race will add
a lot of overhead), and git/mailing list archeology doesn't respond to that 
either if this
was ever useful in pratice :(

Joao
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC 03/19] iommufd: Dirty tracking data support

2022-05-02 Thread Joao Martins
On 4/30/22 05:11, Baolu Lu wrote:
> On 2022/4/29 05:09, Joao Martins wrote:
>> Add an IO pagetable API iopt_read_and_clear_dirty_data() that
>> performs the reading of dirty IOPTEs for a given IOVA range and
>> then copying back to userspace from each area-internal bitmap.
>>
>> Underneath it uses the IOMMU equivalent API which will read the
>> dirty bits, as well as atomically clearing the IOPTE dirty bit
>> and flushing the IOTLB at the end. The dirty bitmaps pass an
>> iotlb_gather to allow batching the dirty-bit updates.
>>
>> Most of the complexity, though, is in the handling of the user
>> bitmaps to avoid copies back and forth. The bitmap user addresses
>> need to be iterated through, pinned and then passing the pages
>> into iommu core. The amount of bitmap data passed at a time for a
>> read_and_clear_dirty() is 1 page worth of pinned base page
>> pointers. That equates to 16M bits, or rather 64G of data that
>> can be returned as 'dirtied'. The flush the IOTLB at the end of
>> the whole scanned IOVA range, to defer as much as possible the
>> potential DMA performance penalty.
>>
>> Signed-off-by: Joao Martins 
>> ---
>>   drivers/iommu/iommufd/io_pagetable.c| 169 
>>   drivers/iommu/iommufd/iommufd_private.h |  44 ++
>>   2 files changed, 213 insertions(+)
>>
>> diff --git a/drivers/iommu/iommufd/io_pagetable.c 
>> b/drivers/iommu/iommufd/io_pagetable.c
>> index f4609ef369e0..835b5040fce9 100644
>> --- a/drivers/iommu/iommufd/io_pagetable.c
>> +++ b/drivers/iommu/iommufd/io_pagetable.c
>> @@ -14,6 +14,7 @@
>>   #include 
>>   #include 
>>   #include 
>> +#include 
>>   
>>   #include "io_pagetable.h"
>>   
>> @@ -347,6 +348,174 @@ int iopt_set_dirty_tracking(struct io_pagetable *iopt,
>>  return ret;
>>   }
>>   
>> +int iommufd_dirty_iter_init(struct iommufd_dirty_iter *iter,
>> +struct iommufd_dirty_data *bitmap)
>> +{
>> +struct iommu_dirty_bitmap *dirty = &iter->dirty;
>> +unsigned long bitmap_len;
>> +
>> +bitmap_len = dirty_bitmap_bytes(bitmap->length >> dirty->pgshift);
>> +
>> +import_single_range(WRITE, bitmap->data, bitmap_len,
>> +&iter->bitmap_iov, &iter->bitmap_iter);
>> +iter->iova = bitmap->iova;
>> +
>> +/* Can record up to 64G at a time */
>> +dirty->pages = (struct page **) __get_free_page(GFP_KERNEL);
>> +
>> +return !dirty->pages ? -ENOMEM : 0;
>> +}
>> +
>> +void iommufd_dirty_iter_free(struct iommufd_dirty_iter *iter)
>> +{
>> +struct iommu_dirty_bitmap *dirty = &iter->dirty;
>> +
>> +if (dirty->pages) {
>> +free_page((unsigned long) dirty->pages);
>> +dirty->pages = NULL;
>> +}
>> +}
>> +
>> +bool iommufd_dirty_iter_done(struct iommufd_dirty_iter *iter)
>> +{
>> +return iov_iter_count(&iter->bitmap_iter) > 0;
>> +}
>> +
>> +static inline unsigned long iommufd_dirty_iter_bytes(struct 
>> iommufd_dirty_iter *iter)
>> +{
>> +unsigned long left = iter->bitmap_iter.count - 
>> iter->bitmap_iter.iov_offset;
>> +
>> +left = min_t(unsigned long, left, (iter->dirty.npages << PAGE_SHIFT));
>> +
>> +return left;
>> +}
>> +
>> +unsigned long iommufd_dirty_iova_length(struct iommufd_dirty_iter *iter)
>> +{
>> +unsigned long left = iommufd_dirty_iter_bytes(iter);
>> +
>> +return ((BITS_PER_BYTE * left) << iter->dirty.pgshift);
>> +}
>> +
>> +unsigned long iommufd_dirty_iova(struct iommufd_dirty_iter *iter)
>> +{
>> +unsigned long skip = iter->bitmap_iter.iov_offset;
>> +
>> +return iter->iova + ((BITS_PER_BYTE * skip) << iter->dirty.pgshift);
>> +}
>> +
>> +void iommufd_dirty_iter_advance(struct iommufd_dirty_iter *iter)
>> +{
>> +iov_iter_advance(&iter->bitmap_iter, iommufd_dirty_iter_bytes(iter));
>> +}
>> +
>> +void iommufd_dirty_iter_put(struct iommufd_dirty_iter *iter)
>> +{
>> +struct iommu_dirty_bitmap *dirty = &iter->dirty;
>> +
>> +if (dirty->npages)
>> +unpin_user_pages(dirty->pages, dirty->npages);
>> +}
>> +
>> +int iommufd_dirty_iter_get(struct iommufd_dirty_iter *iter)
>> +{
>> +struct iommu_dirty_bitmap

Re: [PATCH RFC 02/19] iommufd: Dirty tracking for io_pagetable

2022-05-02 Thread Joao Martins
On 4/30/22 00:51, Baolu Lu wrote:
> On 2022/4/29 05:09, Joao Martins wrote:
>> +int iopt_set_dirty_tracking(struct io_pagetable *iopt,
>> +struct iommu_domain *domain, bool enable)
>> +{
>> +struct iommu_domain *dom;
>> +unsigned long index;
>> +int ret = -EOPNOTSUPP;
>> +
>> +down_write(&iopt->iova_rwsem);
>> +if (!domain) {
>> +down_write(&iopt->domains_rwsem);
>> +xa_for_each(&iopt->domains, index, dom) {
>> +ret = iommu_set_dirty_tracking(dom, iopt, enable);
>> +if (ret < 0)
>> +break;
> 
> Do you need to roll back to the original state before return failure?
> Partial domains have already had dirty bit tracking enabled.
> 
Yeap, will fix the unwinding for next iteration.

>> +}
>> +up_write(&iopt->domains_rwsem);
>> +} else {
>> +ret = iommu_set_dirty_tracking(domain, iopt, enable);
>> +}
>> +
>> +up_write(&iopt->iova_rwsem);
>> +return ret;
>> +}
> 
> Best regards,
> baolu

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC 15/19] iommu/arm-smmu-v3: Add set_dirty_tracking_range() support

2022-05-02 Thread Joao Martins
[my mua made the message a tad crooked with the quotations]

On 5/2/22 12:52, Joao Martins wrote:
> On 4/29/22 20:20, Robin Murphy wrote:
>> On 2022-04-29 17:40, Joao Martins wrote:
>>> On 4/29/22 17:11, Jason Gunthorpe wrote:
>>>> On Fri, Apr 29, 2022 at 03:45:23PM +0100, Joao Martins wrote:
>>>>> On 4/29/22 13:23, Jason Gunthorpe wrote:
>>>>>> On Fri, Apr 29, 2022 at 01:06:06PM +0100, Joao Martins wrote:
>>>>>>
>>>>>>>> TBH I'd be inclined to just enable DBM unconditionally in
>>>>>>>> arm_smmu_domain_finalise() if the SMMU supports it. Trying to toggle it
>>>>>>>> dynamically (especially on a live domain) seems more trouble that it's
>>>>>>>> worth.
>>>>>>>
>>>>>>> Hmmm, but then it would strip userland/VMM from any sort of control 
>>>>>>> (contrary
>>>>>>> to what we can do on the CPU/KVM side). e.g. the first time you do
>>>>>>> GET_DIRTY_IOVA it would return all dirtied IOVAs since the beginning
>>>>>>> of guest time, as opposed to those only after you enabled 
>>>>>>> dirty-tracking.
>>>>>>
>>>>>> It just means that on SMMU the start tracking op clears all the dirty
>>>>>> bits.
>>>>>>
>>>>> Hmm, OK. But aren't really picking a poison here? On ARM it's the 
>>>>> difference
>>>>> from switching the setting the DBM bit and put the IOPTE as 
>>>>> writeable-clean (which
>>>>> is clearing another bit) versus read-and-clear-when-dirty-track-start 
>>>>> which means
>>>>> we need to re-walk the pagetables to clear one bit.
>>>>
>>>> Yes, I don't think a iopte walk is avoidable?
>>>>
>>> Correct -- exactly why I am still more learning towards enable DBM bit only 
>>> at start
>>> versus enabling DBM at domain-creation while clearing dirty at start.
>>
>> I'd say it's largely down to whether you want the bother of 
>> communicating a dynamic behaviour change into io-pgtable. The big 
>> advantage of having it just use DBM all the time is that you don't have 
>> to do that, and the "start tracking" operation is then nothing more than 
>> a normal "read and clear" operation but ignoring the read result.
>>
>> At this point I'd much rather opt for simplicity, and leave the fancier 
>> stuff to revisit later if and when somebody does demonstrate a 
>> significant overhead from using DBM when not strictly needed.
> OK -- I did get the code simplicity part[*]. Albeit my concern is that last
> point: if there's anything fundamentally affecting DMA performance then
> any SMMU user would see it even if they don't care at all about DBM (i.e. 
> regular
> baremetal/non-vm iommu usage).
> 

I can switch the SMMUv3 one to the always-enabled DBM bit.

> [*] It was how I had this initially PoC-ed. And really all IOMMU drivers 
> dirty tracking
> could be simplified to be always-enabled, and start/stop is essentially 
> flushing/clearing
> dirties. Albeit I like that this is only really used (by hardware) when 
> needed and any
> other DMA user isn't affected.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC 15/19] iommu/arm-smmu-v3: Add set_dirty_tracking_range() support

2022-05-02 Thread Joao Martins
On 4/29/22 20:20, Robin Murphy wrote:
> On 2022-04-29 17:40, Joao Martins wrote:
>> On 4/29/22 17:11, Jason Gunthorpe wrote:
>>> On Fri, Apr 29, 2022 at 03:45:23PM +0100, Joao Martins wrote:
>>>> On 4/29/22 13:23, Jason Gunthorpe wrote:
>>>>> On Fri, Apr 29, 2022 at 01:06:06PM +0100, Joao Martins wrote:
>>>>>
>>>>>>> TBH I'd be inclined to just enable DBM unconditionally in
>>>>>>> arm_smmu_domain_finalise() if the SMMU supports it. Trying to toggle it
>>>>>>> dynamically (especially on a live domain) seems more trouble that it's
>>>>>>> worth.
>>>>>>
>>>>>> Hmmm, but then it would strip userland/VMM from any sort of control 
>>>>>> (contrary
>>>>>> to what we can do on the CPU/KVM side). e.g. the first time you do
>>>>>> GET_DIRTY_IOVA it would return all dirtied IOVAs since the beginning
>>>>>> of guest time, as opposed to those only after you enabled dirty-tracking.
>>>>>
>>>>> It just means that on SMMU the start tracking op clears all the dirty
>>>>> bits.
>>>>>
>>>> Hmm, OK. But aren't really picking a poison here? On ARM it's the 
>>>> difference
>>>> from switching the setting the DBM bit and put the IOPTE as 
>>>> writeable-clean (which
>>>> is clearing another bit) versus read-and-clear-when-dirty-track-start 
>>>> which means
>>>> we need to re-walk the pagetables to clear one bit.
>>>
>>> Yes, I don't think a iopte walk is avoidable?
>>>
>> Correct -- exactly why I am still more learning towards enable DBM bit only 
>> at start
>> versus enabling DBM at domain-creation while clearing dirty at start.
> 
> I'd say it's largely down to whether you want the bother of 
> communicating a dynamic behaviour change into io-pgtable. The big 
> advantage of having it just use DBM all the time is that you don't have 
> to do that, and the "start tracking" operation is then nothing more than 
> a normal "read and clear" operation but ignoring the read result.
> 
> At this point I'd much rather opt for simplicity, and leave the fancier 
> stuff to revisit later if and when somebody does demonstrate a 
> significant overhead from using DBM when not strictly needed.
> OK -- I did get the code simplicity part[*]. Albeit my concern is that last
point: if there's anything fundamentally affecting DMA performance then
any SMMU user would see it even if they don't care at all about DBM (i.e. 
regular
baremetal/non-vm iommu usage).

[*] It was how I had this initially PoC-ed. And really all IOMMU drivers dirty 
tracking
could be simplified to be always-enabled, and start/stop is essentially 
flushing/clearing
dirties. Albeit I like that this is only really used (by hardware) when needed 
and any
other DMA user isn't affected.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC 15/19] iommu/arm-smmu-v3: Add set_dirty_tracking_range() support

2022-04-29 Thread Joao Martins
On 4/29/22 17:11, Jason Gunthorpe wrote:
> On Fri, Apr 29, 2022 at 03:45:23PM +0100, Joao Martins wrote:
>> On 4/29/22 13:23, Jason Gunthorpe wrote:
>>> On Fri, Apr 29, 2022 at 01:06:06PM +0100, Joao Martins wrote:
>>>
>>>>> TBH I'd be inclined to just enable DBM unconditionally in 
>>>>> arm_smmu_domain_finalise() if the SMMU supports it. Trying to toggle it 
>>>>> dynamically (especially on a live domain) seems more trouble that it's 
>>>>> worth.
>>>>
>>>> Hmmm, but then it would strip userland/VMM from any sort of control 
>>>> (contrary
>>>> to what we can do on the CPU/KVM side). e.g. the first time you do
>>>> GET_DIRTY_IOVA it would return all dirtied IOVAs since the beginning
>>>> of guest time, as opposed to those only after you enabled dirty-tracking.
>>>
>>> It just means that on SMMU the start tracking op clears all the dirty
>>> bits.
>>>
>> Hmm, OK. But aren't really picking a poison here? On ARM it's the difference
>> from switching the setting the DBM bit and put the IOPTE as writeable-clean 
>> (which
>> is clearing another bit) versus read-and-clear-when-dirty-track-start which 
>> means
>> we need to re-walk the pagetables to clear one bit.
> 
> Yes, I don't think a iopte walk is avoidable?
> 
Correct -- exactly why I am still more learning towards enable DBM bit only at 
start
versus enabling DBM at domain-creation while clearing dirty at start.

>> It's walking over ranges regardless.
> 
> Also, keep in mind start should always come up in a 0 dirties state on
> all platforms. So all implementations need to do something to wipe the
> dirty state, either explicitly during start or restoring all clean
> during stop.
> 
> A common use model might be to just destroy the iommu_domain without
> doing stop so prefering the clearing io page table at stop might be a
> better overall design.

If we want to ensure that the IOPTE dirty state is immutable before start
and after stop maybe this behaviour could be a new flag in the 
set-dirty-tracking
(or be implicit as you suggest).  but ... hmm, at the same time, I wonder if
it's better to let userspace fetch the dirties that were there /right after 
stopping/
(via GET_DIRTY_IOVA) rather than just discarding them implicitly at 
SET_DIRTY_TRACKING(0|1).
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC 01/19] iommu: Add iommu_domain ops for dirty tracking

2022-04-29 Thread Joao Martins
On 4/29/22 14:40, Baolu Lu wrote:
> Hi Joao,
> 
> Thanks for doing this.
> 
> On 2022/4/29 05:09, Joao Martins wrote:
>> Add to iommu domain operations a set of callbacks to
>> perform dirty tracking, particulary to start and stop
>> tracking and finally to test and clear the dirty data.
>>
>> Drivers are expected to dynamically change its hw protection
>> domain bits to toggle the tracking and flush some form of
>> control state structure that stands in the IOVA translation
>> path.
>>
>> For reading and clearing dirty data, in all IOMMUs a transition
>> from any of the PTE access bits (Access, Dirty) implies flushing
>> the IOTLB to invalidate any stale data in the IOTLB as to whether
>> or not the IOMMU should update the said PTEs. The iommu core APIs
>> introduce a new structure for storing the dirties, albeit vendor
>> IOMMUs implementing .read_and_clear_dirty() just use
>> iommu_dirty_bitmap_record() to set the memory storing dirties.
>> The underlying tracking/iteration of user bitmap memory is instead
>> done by iommufd which takes care of initializing the dirty bitmap
>> *prior* to passing to the IOMMU domain op.
>>
>> So far for currently/to-be-supported IOMMUs with dirty tracking
>> support this particularly because the tracking is part of
>> first stage tables and part of address translation. Below
>> it is mentioned how hardware deal with the hardware protection
>> domain control bits, to justify the added iommu core APIs.
>> vendor IOMMU implementation will also explain in more detail on
>> the dirty bit usage/clearing in the IOPTEs.
>>
>> * x86 AMD:
>>
>> The same thing for AMD particularly the Device Table
>> respectivally, followed by flushing the Device IOTLB. On AMD[1],
>> section "2.2.1 Updating Shared Tables", e.g.
>>
>>> Each table can also have its contents cached by the IOMMU or
>> peripheral IOTLBs. Therefore, after
>> updating a table entry that can be cached, system software must
>> send the IOMMU an appropriate
>> invalidate command. Information in the peripheral IOTLBs must
>> also be invalidated.
>>
>> There's no mention of particular bits that are cached or
>> not but fetching a dev entry is part of address translation
>> as also depicted, so invalidate the device table to make
>> sure the next translations fetch a DTE entry with the HD bits set.
>>
>> * x86 Intel (rev3.0+):
>>
>> Likewise[2] set the SSADE bit in the scalable-entry second stage table
>> to enable Access/Dirty bits in the second stage page table. See manual,
>> particularly on "6.2.3.1 Scalable-Mode PASID-Table Entry Programming
>> Considerations"
>>
>>> When modifying root-entries, scalable-mode root-entries,
>> context-entries, or scalable-mode context
>> entries:
>>> Software must serially invalidate the context-cache,
>> PASID-cache (if applicable), and the IOTLB.  The serialization is
>> required since hardware may utilize information from the
>> context-caches (e.g., Domain-ID) to tag new entries inserted to
>> the PASID-cache and IOTLB for processing in-flight requests.
>> Section 6.5 describe the invalidation operations.
>>
>> And also the whole chapter "" Table "Table 23.  Guidance to
>> Software for Invalidations" in "6.5.3.3 Guidance to Software for
>> Invalidations" explicitly mentions
>>
>>> SSADE transition from 0 to 1 in a scalable-mode PASID-table
>> entry with PGTT value of Second-stage or Nested
>>
>> * ARM SMMUV3.2:
>>
>> SMMUv3.2 needs to toggle the dirty bit descriptor
>> over the CD (or S2CD) for toggling and flush/invalidate
>> the IOMMU dev IOTLB.
>>
>> Reference[0]: SMMU spec, "5.4.1 CD notes",
>>
>>> The following CD fields are permitted to be cached as part of a
>> translation or TLB entry, and alteration requires
>> invalidation of any TLB entry that might have cached these
>> fields, in addition to CD structure cache invalidation:
>>
>> ...
>> HA, HD
>> ...
>>
>> Although, The ARM SMMUv3 case is a tad different that its x86
>> counterparts. Rather than changing *only* the IOMMU domain device entry to
>> enable dirty tracking (and having a dedicated bit for dirtyness in IOPTE)
>> ARM instead uses a dirty-bit modifier which is separately enabled, and
>> changes the *existing* meaning of access bits (for ro/rw), to the point
>> that marking access bit read-only but with dirty-bit-modifier enabled
>> doesn't trigger an perm io page

Re: [PATCH RFC 00/19] IOMMUFD Dirty Tracking

2022-04-29 Thread Joao Martins
On 4/29/22 13:38, Jason Gunthorpe wrote:
> On Fri, Apr 29, 2022 at 11:27:58AM +0100, Joao Martins wrote:
>>>>  3) Unmapping an IOVA range while returning its dirty bit prior to
>>>> unmap. This case is specific for non-nested vIOMMU case where an
>>>> erronous guest (or device) DMAing to an address being unmapped at the
>>>> same time.
>>>
>>> an erroneous attempt like above cannot anticipate which DMAs can
>>> succeed in that window thus the end behavior is undefined. For an
>>> undefined behavior nothing will be broken by losing some bits dirtied
>>> in the window between reading back dirty bits of the range and
>>> actually calling unmap. From guest p.o.v. all those are black-box
>>> hardware logic to serve a virtual iotlb invalidation request which just
>>> cannot be completed in one cycle.
>>>
>>> Hence in reality probably this is not required except to meet vfio
>>> compat requirement. Just in concept returning dirty bits at unmap
>>> is more accurate.
>>>
>>> I'm slightly inclined to abandon it in iommufd uAPI.
>>
>> OK, it seems I am not far off from your thoughts.
>>
>> I'll see what others think too, and if so I'll remove the unmap_dirty.
>>
>> Because if vfio-compat doesn't get the iommu hw dirty support, then there 
>> would
>> be no users of unmap_dirty.
> 
> I'm inclined to agree with Kevin.
> 
> If the VM does do a rouge DMA while unmapping its vIOMMU then already
> it will randomly get or loose that DMA. Adding the dirty tracking race
> during live migration just further bias's that randomness toward
> loose.  Since we don't relay protection faults to the guest there is
> no guest observable difference, IMHO.
> 
Hmm, we don't /yet/. I don't know if that is going to change at some point.

We do propagate MCEs for example (and AER?). And I suppose with nesting
IO-page-faults will be propagated. Albeit it is a different thing of this
problem above.

Albeit even if we do, after the unmap-and-read-dirty induced IO page faults
ought to not be propagated to the guest.

> In any case, I don't think the implementation here for unmap_dirty is
> race free?  So, if we are doing all this complexity just to make the
> race smaller, I don't see the point.
> 
+1

> To make it race free I think you have to write protect the IOPTE then
> synchronize the IOTLB, read back the dirty, then unmap and synchronize
> the IOTLB again. 

That would indeed fully close the race with the IOTLB. But damn, it would
be expensive.

> That has such a high performance cost I'm not
> convinced it is worthwhile - and if it has to be two step like this
> then it would be cleaner to introduce a 'writeprotect and read dirty'
> op instead of overloading unmap. 

I can switch to that kind of primitive, should the group deem this as
necessary. But it feels like we are more leaning towards a no.

> We don't need to microoptimize away
> the extra io page table walk when we are already doing two
> invalidations in the overhead..
> 
IIUC fully closing the race as above might be incompatible with SMMUv3
provided that we need to clear the DBM (or CD.HD) to mark the IOPTEs
from writeable-clean to read-only, but then the dirty bit loses its
meaning. Oh wait, unless it's just rather than comparing writeable-clean
we clear DBM and then just check if the PTE was RO or RW to determine
dirty (provided we discard any IO PAGE faults happening between wrprotect
and read-dirty)

>>>> * There's no capabilities API in IOMMUFD, and in this RFC each vendor 
>>>> tracks
>>>
>>> there was discussion adding device capability uAPI somewhere.
>>>
>> ack let me know if there was snippets to the conversation as I seem to have 
>> missed that.
> 
> It was just discssion pending something we actually needed to report.
> 
> Would be a very simple ioctl taking in the device ID and fulling a
> struct of stuff.
>  
Yeap.

>>> probably this can be reported as a device cap as supporting of dirty bit is
>>> an immutable property of the iommu serving that device. 
> 
> It is an easier fit to read it out of the iommu_domain after device
> attach though - since we don't need to build new kernel infrastructure
> to query it from a device.
>  
That would be more like working on a hwpt_id instead of a device_id for that
previously mentioned ioctl. Something like IOMMUFD_CHECK_EXTENSION

Which receives a capability nr (or additionally hwpt_id) and returns a struct of
something. That is more future proof towards new kinds of stuff e.g. fetching 
the
whole domain hardware capabilitie

Re: [PATCH RFC 07/19] iommufd/vfio-compat: Dirty tracking IOCTLs compatibility

2022-04-29 Thread Joao Martins
On 4/29/22 15:36, Jason Gunthorpe wrote:
> On Fri, Apr 29, 2022 at 03:27:00PM +0100, Joao Martins wrote:
> 
>>> We've made a qemu patch to allow qemu to be happy if dirty tracking is
>>> not supported in the vfio container for migration, which is part of
>>> the v2 enablement series. That seems like the better direction.
>>>
>> So in my auditing/testing, the listener callbacks are called but the dirty 
>> ioctls
>> return an error at start, and bails out early on sync. I suppose migration
>> won't really work, as no pages aren't set and what not but it could
>> cope with no-dirty-tracking support. So by 'making qemu happy' is this mainly
>> cleaning out the constant error messages you get and not even attempt
>> migration by introducing a migration blocker early on ... should it fetch
>> no migration capability?
> 
> It really just means pre-copy doesn't work and we can skip it, though
> I'm not sure exactly what the qemu patch ended up doing.. I think it
> will be posted by Monday
> 
Ha, or that :D i.e.

Why bother checking if there's dirty pages periodically when we can just do at 
the
beginning, and at the end when we pause the guest(and DMA). Maybe it prevents a 
whole
bunch of copying in the interim, and this patch of yours might be a improvement.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC 15/19] iommu/arm-smmu-v3: Add set_dirty_tracking_range() support

2022-04-29 Thread Joao Martins
On 4/29/22 13:23, Jason Gunthorpe wrote:
> On Fri, Apr 29, 2022 at 01:06:06PM +0100, Joao Martins wrote:
> 
>>> TBH I'd be inclined to just enable DBM unconditionally in 
>>> arm_smmu_domain_finalise() if the SMMU supports it. Trying to toggle it 
>>> dynamically (especially on a live domain) seems more trouble that it's 
>>> worth.
>>
>> Hmmm, but then it would strip userland/VMM from any sort of control (contrary
>> to what we can do on the CPU/KVM side). e.g. the first time you do
>> GET_DIRTY_IOVA it would return all dirtied IOVAs since the beginning
>> of guest time, as opposed to those only after you enabled dirty-tracking.
> 
> It just means that on SMMU the start tracking op clears all the dirty
> bits.
> 
Hmm, OK. But aren't really picking a poison here? On ARM it's the difference
from switching the setting the DBM bit and put the IOPTE as writeable-clean 
(which
is clearing another bit) versus read-and-clear-when-dirty-track-start which 
means
we need to re-walk the pagetables to clear one bit.

It's walking over ranges regardless.

> I also suppose you'd also want to install the IOPTEs as dirty to
> avoid a performance regression writing out new dirties for cases where
> we don't dirty track? And then the start tracking op will switch this
> so map creates non-dirty IOPTEs?

If we end up always enabling DBM + CD.HD perhaps it makes sense for IOTLB cache
the dirty-bit until we clear those bits.

But really, the way this series was /trying/ to do still feels the least pain,
and that way we have the same expectations from all iommus from iommufd
perspective too.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC 05/19] iommufd: Add a dirty bitmap to iopt_unmap_iova()

2022-04-29 Thread Joao Martins
On 4/29/22 13:14, Jason Gunthorpe wrote:
> On Thu, Apr 28, 2022 at 10:09:19PM +0100, Joao Martins wrote:
> 
>> +static void iommu_unmap_read_dirty_nofail(struct iommu_domain *domain,
>> +  unsigned long iova, size_t size,
>> +  struct iommufd_dirty_data *bitmap,
>> +  struct iommufd_dirty_iter *iter)
>> +{
> 
> This shouldn't be a nofail - that is only for path that trigger from
> destroy/error unwindow, which read dirty never does. The return code
> has to be propogated.
> 
> It needs some more thought how to organize this.. only unfill_domains
> needs this path, but it is shared with the error unwind paths and
> cannot generally fail..

It's part of the reason I splitted this part as it didn't struck as a natural
extension of the API.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC 13/19] iommu/arm-smmu-v3: Add feature detection for BBML

2022-04-29 Thread Joao Martins
On 4/29/22 13:26, Robin Murphy wrote:
> On 2022-04-29 12:54, Joao Martins wrote:
>> On 4/29/22 12:11, Robin Murphy wrote:
>>> On 2022-04-28 22:09, Joao Martins wrote:
>>>> From: Kunkun Jiang 
>>>>
>>>> This detects BBML feature and if SMMU supports it, transfer BBMLx
>>>> quirk to io-pgtable.
>>>>
>>>> BBML1 requires still marking PTE nT prior to performing a
>>>> translation table update, while BBML2 requires neither break-before-make
>>>> nor PTE nT bit being set. For dirty tracking it needs to clear
>>>> the dirty bit so checking BBML2 tells us the prerequisite. See SMMUv3.2
>>>> manual, section "3.21.1.3 When SMMU_IDR3.BBML == 2 (Level 2)" and
>>>> "3.21.1.2 When SMMU_IDR3.BBML == 1 (Level 1)"
>>>
>>> You can drop this, and the dependencies on BBML elsewhere, until you get
>>> round to the future large-page-splitting work, since that's the only
>>> thing this represents. Not much point having the feature flags without
>>> an actual implementation, or any users.
>>>
>> OK.
>>
>> My thinking was that the BBML2 meant *also* that we don't need that 
>> break-before-make
>> thingie upon switching translation table entries. It seems that from what you
>> say, BBML2 then just refers to this but only on the context of switching 
>> between
>> hugepages/normal pages (?), not in general on all bits of the PTE (which we 
>> woud .. upon
>> switching from writeable-dirty to writeable-clean with DBM-set).
> 
> Yes, BBML is purely about swapping between a block (hugepage) entry and 
> a table representing the exact equivalent mapping.
> 
> A break-before-make procedure isn't required when just changing 
> permissions, and AFAICS it doesn't apply to changing the DBM bit either, 
> but as mentioned I think we could probably just not do that anyway.

Interesting, thanks for the clarification.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC 03/19] iommufd: Dirty tracking data support

2022-04-29 Thread Joao Martins
On 4/29/22 13:09, Jason Gunthorpe wrote:
> On Fri, Apr 29, 2022 at 11:54:16AM +0100, Joao Martins wrote:
>> On 4/29/22 09:12, Tian, Kevin wrote:
>>>> From: Joao Martins 
>>>> Sent: Friday, April 29, 2022 5:09 AM
>>> [...]
>>>> +
>>>> +static int iommu_read_and_clear_dirty(struct iommu_domain *domain,
>>>> +struct iommufd_dirty_data *bitmap)
>>>
>>> In a glance this function and all previous helpers doesn't rely on any
>>> iommufd objects except that the new structures are named as
>>> iommufd_xxx. 
>>>
>>> I wonder whether moving all of them to the iommu layer would make
>>> more sense here.
>>>
>> I suppose, instinctively, I was trying to make this tie to iommufd only,
>> to avoid getting it called in cases we don't except when made as a generic
>> exported kernel facility.
>>
>> (note: iommufd can be built as a module).
> 
> Yeah, I think that is a reasonable reason to put iommufd only stuff in
> iommufd.ko rather than bloat the static kernel.
> 
> You could put it in a new .c file though so there is some logical
> modularity?

I can do that (iommu.c / dirty.c if no better idea comes to mind,
suggestions welcome :)).

Although I should said that there's some dependency on iopt structures and
what not so I have to see if this is a change for the better. I'll respond
here should it be dubious.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC 02/19] iommufd: Dirty tracking for io_pagetable

2022-04-29 Thread Joao Martins
On 4/29/22 12:56, Jason Gunthorpe wrote:
> On Fri, Apr 29, 2022 at 08:07:14AM +, Tian, Kevin wrote:
>>> From: Joao Martins 
>>> Sent: Friday, April 29, 2022 5:09 AM
>>>
>>> +static int __set_dirty_tracking_range_locked(struct iommu_domain
>>> *domain,
>>
>> suppose anything using iommu_domain as the first argument should
>> be put in the iommu layer. Here it's more reasonable to use iopt
>> as the first argument or simply merge with the next function.
>>
>>> +struct io_pagetable *iopt,
>>> +bool enable)
>>> +{
>>> +   const struct iommu_domain_ops *ops = domain->ops;
>>> +   struct iommu_iotlb_gather gather;
>>> +   struct iopt_area *area;
>>> +   int ret = -EOPNOTSUPP;
>>> +   unsigned long iova;
>>> +   size_t size;
>>> +
>>> +   iommu_iotlb_gather_init(&gather);
>>> +
>>> +   for (area = iopt_area_iter_first(iopt, 0, ULONG_MAX); area;
>>> +area = iopt_area_iter_next(area, 0, ULONG_MAX)) {
>>
>> how is this different from leaving iommu driver to walk the page table
>> and the poke the modifier bit for all present PTEs? As commented in last
>> patch this may allow removing the range op completely.
> 
> Yea, I'm not super keen on the two ops either, especially since they
> are so wildly different.
> 
/me ack

> I would expect that set_dirty_tracking turns on tracking for the
> entire iommu domain, for all present and future maps
> 
Yes.

I didn't do that correctly on ARM, neither on device-attach
(for x86 e.g. on hotplug).

> While set_dirty_tracking_range - I guess it only does the range, so if
> we make a new map then the new range will be untracked? But that is
> now racy, we have to map and then call set_dirty_tracking_range
> 
> It seems better for the iommu driver to deal with this and ARM should
> atomically make the new maps dirty tracking..
> 

Next iteration I'll need to fix the way IOMMUs handle dirty-tracking
probing and tracking in its private intermediate structures.

But yes, I was trying to transfer this to the iommu driver (perhaps in a
convoluted way).

>>> +int iopt_set_dirty_tracking(struct io_pagetable *iopt,
>>> +   struct iommu_domain *domain, bool enable)
>>> +{
>>> +   struct iommu_domain *dom;
>>> +   unsigned long index;
>>> +   int ret = -EOPNOTSUPP;
> 
> Returns EOPNOTSUPP if the xarray is empty?
> 
Argh no. Maybe -EINVAL is better here.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC 07/19] iommufd/vfio-compat: Dirty tracking IOCTLs compatibility

2022-04-29 Thread Joao Martins
On 4/29/22 13:19, Jason Gunthorpe wrote:
> On Thu, Apr 28, 2022 at 10:09:21PM +0100, Joao Martins wrote:
>> Add the correspondent APIs for performing VFIO dirty tracking,
>> particularly VFIO_IOMMU_DIRTY_PAGES ioctl subcmds:
>> * VFIO_IOMMU_DIRTY_PAGES_FLAG_START: Start dirty tracking and allocates
>>   the area @dirty_bitmap
>> * VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP: Stop dirty tracking and frees
>>  the area @dirty_bitmap
>> * VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP: Fetch dirty bitmap while dirty
>> tracking is active.
>>
>> Advertise the VFIO_IOMMU_TYPE1_INFO_CAP_MIGRATION
>> whereas it gets set the domain configured page size the same as
>> iopt::iova_alignment and maximum dirty bitmap size same
>> as VFIO. Compared to VFIO type1 iommu, the perpectual dirtying is
>> not implemented and userspace gets -EOPNOTSUPP which is handled by
>> today's userspace.
>>
>> Move iommufd_get_pagesizes() definition prior to unmap for
>> iommufd_vfio_unmap_dma() dirty support to validate the user bitmap page
>> size against IOPT pagesize.
>>
>> Signed-off-by: Joao Martins 
>> ---
>>  drivers/iommu/iommufd/vfio_compat.c | 221 ++--
>>  1 file changed, 209 insertions(+), 12 deletions(-)
> 
> I think I would probably not do this patch, it has behavior that is
> quite different from the current vfio - ie the interaction with the
> mdevs, and I don't intend to fix that. 

I'll drop this, until I hear otherwise.

I wasn't sure what people leaning towards to and keeping perpectual-dirty
stuff didn't feel right for a new UAPI either.

> So, with this patch and a mdev
> then vfio_compat will return all-not-dirty but current vfio will
> return all-dirty - and that is significant enough to break qemu.
> 
Ack

> We've made a qemu patch to allow qemu to be happy if dirty tracking is
> not supported in the vfio container for migration, which is part of
> the v2 enablement series. That seems like the better direction.
> 
So in my auditing/testing, the listener callbacks are called but the dirty 
ioctls
return an error at start, and bails out early on sync. I suppose migration
won't really work, as no pages aren't set and what not but it could
cope with no-dirty-tracking support. So by 'making qemu happy' is this mainly
cleaning out the constant error messages you get and not even attempt
migration by introducing a migration blocker early on ... should it fetch
no migration capability?

> I can see why this is useful to test with the current qemu however.

Yes, it is indeed useful for testing.

I am wondering if we can still emulate that in userspace, given that the 
expectation
from each GET_BITMAP call is to get all dirties, likewise for type1 unmap 
dirty. Unless
I am missed something obvious.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC 01/19] iommu: Add iommu_domain ops for dirty tracking

2022-04-29 Thread Joao Martins
On 4/29/22 13:08, Jason Gunthorpe wrote:
> On Thu, Apr 28, 2022 at 10:09:15PM +0100, Joao Martins wrote:
>> +
>> +unsigned int iommu_dirty_bitmap_record(struct iommu_dirty_bitmap *dirty,
>> +   unsigned long iova, unsigned long length)
>> +{
> 
> Lets put iommu_dirty_bitmap in its own patch, the VFIO driver side
> will want to use this same data structure.
> 
OK.

>> +while (nbits > 0) {
>> +kaddr = kmap(dirty->pages[idx]) + start_offset;
> 
> kmap_local?
> 
/me nods

>> +/**
>> + * struct iommu_dirty_bitmap - Dirty IOVA bitmap state
>> + *
>> + * @iova: IOVA representing the start of the bitmap, the first bit of the 
>> bitmap
>> + * @pgshift: Page granularity of the bitmap
>> + * @gather: Range information for a pending IOTLB flush
>> + * @start_offset: Offset of the first user page
>> + * @pages: User pages representing the bitmap region
>> + * @npages: Number of user pages pinned
>> + */
>> +struct iommu_dirty_bitmap {
>> +unsigned long iova;
>> +unsigned long pgshift;
>> +struct iommu_iotlb_gather *gather;
>> +unsigned long start_offset;
>> +unsigned long npages;
>> +struct page **pages;
> 
> In many (all?) cases I would expect this to be called from a process
> context, can we just store the __user pointer here, or is the idea
> that with modern kernels poking a u64 to userspace is slower than a
> kmap?
> 
I have both options implemented, I'll need to measure it. Code-wise it would be
a lot simpler to just poke at the userspace addresses (that was my first
prototype of this) but felt that poking at kernel addresses was safer and
avoid assumptions over the context (from the iommu driver). I can bring back
the former alternative if this was the wrong thing to do.

> I'm particularly concerend that this starts to require high
> order allocations with more than 2M of bitmap.. Maybe one direction is
> to GUP 2M chunks at a time and walk the __user pointer.
> 
That's what I am doing here. We GUP 2M of *bitmap* at a time.
Which is about 1 page for the struct page pointers. That is enough
for 64G of IOVA dirties read worst-case scenario (i.e. with base pages).

>> +static inline void iommu_dirty_bitmap_init(struct iommu_dirty_bitmap *dirty,
>> +   unsigned long base,
>> +   unsigned long pgshift,
>> +   struct iommu_iotlb_gather *gather)
>> +{
>> +memset(dirty, 0, sizeof(*dirty));
>> +dirty->iova = base;
>> +dirty->pgshift = pgshift;
>> +dirty->gather = gather;
>> +
>> +if (gather)
>> +iommu_iotlb_gather_init(dirty->gather);
>> +}
> 
> I would expect all the GUPing logic to be here too?

I had this in the iommufd_dirty_iter logic given that the iommu iteration
logic is in the parent structure that stores iommu_dirty_data.

My thinking with this patch was just to have what the IOMMU driver needs.

Which actually if anything this helper above ought to be in a later patch.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC 16/19] iommu/arm-smmu-v3: Enable HTTU for stage1 with io-pgtable mapping

2022-04-29 Thread Joao Martins
On 4/29/22 12:35, Robin Murphy wrote:
> On 2022-04-28 22:09, Joao Martins wrote:
>> From: Kunkun Jiang 
>>
>> As nested mode is not upstreamed now, we just aim to support dirty
>> log tracking for stage1 with io-pgtable mapping (means not support
>> SVA mapping). If HTTU is supported, we enable HA/HD bits in the SMMU
>> CD and transfer ARM_HD quirk to io-pgtable.
>>
>> We additionally filter out HD|HA if not supportted. The CD.HD bit
>> is not particularly useful unless we toggle the DBM bit in the PTE
>> entries.
>>
>> Co-developed-by: Keqian Zhu 
>> Signed-off-by: Keqian Zhu 
>> Signed-off-by: Kunkun Jiang 
>> [joaomart:Convey HD|HA bits over to the context descriptor
>>   and update commit message]
>> Signed-off-by: Joao Martins 
>> ---
>>   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 11 +++
>>   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  3 +++
>>   include/linux/io-pgtable.h  |  1 +
>>   3 files changed, 15 insertions(+)
>>
>> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
>> b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>> index 1ca72fcca930..5f728f8f20a2 100644
>> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>> @@ -1077,10 +1077,18 @@ int arm_smmu_write_ctx_desc(struct arm_smmu_domain 
>> *smmu_domain, int ssid,
>>   * this substream's traffic
>>   */
>>  } else { /* (1) and (2) */
>> +struct arm_smmu_device *smmu = smmu_domain->smmu;
>> +u64 tcr = cd->tcr;
>> +
>>  cdptr[1] = cpu_to_le64(cd->ttbr & CTXDESC_CD_1_TTB0_MASK);
>>  cdptr[2] = 0;
>>  cdptr[3] = cpu_to_le64(cd->mair);
>>   
>> +if (!(smmu->features & ARM_SMMU_FEAT_HD))
>> +tcr &= ~CTXDESC_CD_0_TCR_HD;
>> +if (!(smmu->features & ARM_SMMU_FEAT_HA))
>> +tcr &= ~CTXDESC_CD_0_TCR_HA;
> 
> This is very backwards...
> 
Yes.

>> +
>>  /*
>>   * STE is live, and the SMMU might read dwords of this CD in any
>>   * order. Ensure that it observes valid values before reading
>> @@ -2100,6 +2108,7 @@ static int arm_smmu_domain_finalise_s1(struct 
>> arm_smmu_domain *smmu_domain,
>>FIELD_PREP(CTXDESC_CD_0_TCR_ORGN0, tcr->orgn) |
>>FIELD_PREP(CTXDESC_CD_0_TCR_SH0, tcr->sh) |
>>FIELD_PREP(CTXDESC_CD_0_TCR_IPS, tcr->ips) |
>> +  CTXDESC_CD_0_TCR_HA | CTXDESC_CD_0_TCR_HD |
> 
> ...these should be set in io-pgtable's TCR value *if* io-pgatble is 
> using DBM, then propagated through from there like everything else.
> 

So the DBM bit superseedes the TCR bit -- that's strage? say if you mark a PTE 
as
writeable-clean with DBM set but TCR.HD unset .. then  won't trigger a 
perm-fault?
I need to re-read that section of the manual, as I didn't get the impression 
above.

>>CTXDESC_CD_0_TCR_EPD1 | CTXDESC_CD_0_AA64;
>>  cfg->cd.mair= pgtbl_cfg->arm_lpae_s1_cfg.mair;
>>   
>> @@ -2203,6 +2212,8 @@ static int arm_smmu_domain_finalise(struct 
>> iommu_domain *domain,
>>  .iommu_dev  = smmu->dev,
>>  };
>>   
>> +if (smmu->features & ARM_SMMU_FEAT_HD)
>> +pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_ARM_HD;
> 
> You need to depend on ARM_SMMU_FEAT_COHERENCY for this as well, not 
> least because you don't have any of the relevant business for 
> synchronising non-coherent PTEs in your walk functions, but it's also 
> implementation-defined whether HTTU even operates on non-cacheable 
> pagetables, and frankly you just don't want to go there ;)
> 
/me nods OK.

> Robin.
> 
>>  if (smmu->features & ARM_SMMU_FEAT_BBML1)
>>  pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_ARM_BBML1;
>>  else if (smmu->features & ARM_SMMU_FEAT_BBML2)
>> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h 
>> b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
>> index e15750be1d95..ff32242f2fdb 100644
>> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
>> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
>> @@ -292,6 +292,9 @@
>>   #define CTXDESC_CD_0_TCR_IPS   GENMASK_ULL(34, 32)
>>   #define CTXDESC_CD_0_TCR_TBI0  (1ULL << 38)
>>   
>> +#define CTXDESC_CD_0_TCR_HA(1UL << 43)
&g

Re: [PATCH RFC 15/19] iommu/arm-smmu-v3: Add set_dirty_tracking_range() support

2022-04-29 Thread Joao Martins
On 4/29/22 12:19, Robin Murphy wrote:
> On 2022-04-29 12:05, Joao Martins wrote:
>> On 4/29/22 09:28, Tian, Kevin wrote:
>>>> From: Joao Martins 
>>>> Sent: Friday, April 29, 2022 5:09 AM
>>>>
>>>> Similar to .read_and_clear_dirty() use the page table
>>>> walker helper functions and set DBM|RDONLY bit, thus
>>>> switching the IOPTE to writeable-clean.
>>>
>>> this should not be one-off if the operation needs to be
>>> applied to IOPTE. Say a map request comes right after
>>> set_dirty_tracking() is called. If it's agreed to remove
>>> the range op then smmu driver should record the tracking
>>> status internally and then apply the modifier to all the new
>>> mappings automatically before dirty tracking is disabled.
>>> Otherwise the same logic needs to be kept in iommufd to
>>> call set_dirty_tracking_range() explicitly for every new
>>> iopt_area created within the tracking window.
>>
>> Gah, I totally missed that by mistake. New mappings aren't
>> carrying over the "DBM is set". This needs a new io-pgtable
>> quirk added post dirty-tracking toggling.
>>
>> I can adjust, but I am at odds on including this in a future
>> iteration given that I can't really test any of this stuff.
>> Might drop the driver until I have hardware/emulation I can
>> use (or maybe others can take over this). It was included
>> for revising the iommu core ops and whether iommufd was
>> affected by it.
>>
>> I'll delete the range op, and let smmu v3 driver walk its
>> own IO pgtables.
> 
> TBH I'd be inclined to just enable DBM unconditionally in 
> arm_smmu_domain_finalise() if the SMMU supports it. Trying to toggle it 
> dynamically (especially on a live domain) seems more trouble that it's 
> worth.

Hmmm, but then it would strip userland/VMM from any sort of control (contrary
to what we can do on the CPU/KVM side). e.g. the first time you do
GET_DIRTY_IOVA it would return all dirtied IOVAs since the beginning
of guest time, as opposed to those only after you enabled dirty-tracking.

We do add the TCR values unconditionally if supported, but not
the actual dirty tracking.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC 13/19] iommu/arm-smmu-v3: Add feature detection for BBML

2022-04-29 Thread Joao Martins
On 4/29/22 12:11, Robin Murphy wrote:
> On 2022-04-28 22:09, Joao Martins wrote:
>> From: Kunkun Jiang 
>>
>> This detects BBML feature and if SMMU supports it, transfer BBMLx
>> quirk to io-pgtable.
>>
>> BBML1 requires still marking PTE nT prior to performing a
>> translation table update, while BBML2 requires neither break-before-make
>> nor PTE nT bit being set. For dirty tracking it needs to clear
>> the dirty bit so checking BBML2 tells us the prerequisite. See SMMUv3.2
>> manual, section "3.21.1.3 When SMMU_IDR3.BBML == 2 (Level 2)" and
>> "3.21.1.2 When SMMU_IDR3.BBML == 1 (Level 1)"
> 
> You can drop this, and the dependencies on BBML elsewhere, until you get 
> round to the future large-page-splitting work, since that's the only 
> thing this represents. Not much point having the feature flags without 
> an actual implementation, or any users.
> 
OK.

My thinking was that the BBML2 meant *also* that we don't need that 
break-before-make
thingie upon switching translation table entries. It seems that from what you
say, BBML2 then just refers to this but only on the context of switching between
hugepages/normal pages (?), not in general on all bits of the PTE (which we 
woud .. upon
switching from writeable-dirty to writeable-clean with DBM-set).
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC 18/19] iommu/intel: Access/Dirty bit support for SL domains

2022-04-29 Thread Joao Martins
On 4/29/22 10:03, Tian, Kevin wrote:
>> From: Joao Martins 
>> Sent: Friday, April 29, 2022 5:10 AM
>>
>> IOMMU advertises Access/Dirty bits if the extended capability
>> DMAR register reports it (ECAP, mnemonic ECAP.SSADS). The first
>> stage table, though, has not bit for advertising, unless referenced via
> 
> first-stage is compatible to CPU page table thus a/d bit support is
> implied. 

Ah! That clarifies something which the manual wasn't quite so clear about :)
I mean I understood what you just said from reading the manual but was
not was /really 100% sure/

> But for dirty tracking I'm I'm fine with only supporting it
> with second-stage as first-stage will be used only for guest in the
> nesting case (though in concept first-stage could also be used for
> IOVA when nesting is disabled there is no plan to do so on Intel
> platforms).
> 
Cool.

>> a scalable-mode PASID Entry. Relevant Intel IOMMU SDM ref for first stage
>> table "3.6.2 Accessed, Extended Accessed, and Dirty Flags" and second
>> stage table "3.7.2 Accessed and Dirty Flags".
>>
>> To enable it scalable-mode for the second-stage table is required,
>> solimit the use of dirty-bit to scalable-mode and discarding the
>> first stage configured DMAR domains. To use SSADS, we set a bit in
> 
> above is inaccurate. dirty bit is only supported in scalable mode so
> there is no limit per se.
> 
OK.

>> the scalable-mode PASID Table entry, by setting bit 9 (SSADE). When
> 
> "To use SSADS, we set bit 9 (SSADE) in the scalable-mode PASID table
> entry"
> 
/me nods

>> doing so, flush all iommu caches. Relevant SDM refs:
>>
>> "3.7.2 Accessed and Dirty Flags"
>> "6.5.3.3 Guidance to Software for Invalidations,
>>  Table 23. Guidance to Software for Invalidations"
>>
>> Dirty bit on the PTE is located in the same location (bit 9). The IOTLB
> 
> I'm not sure what information 'same location' here tries to convey...
> 

The PASID table *and* the dirty bit are both on bit 9.
(On AMD for example it's on different bits)

That's what 'location' meant, not the actual storage of those bits of course :)

>> caches some attributes when SSADE is enabled and dirty-ness information,
> 
> be direct that the dirty bit is cached in IOTLB thus any change of that
> bit requires flushing IOTLB
> 
OK, will make it clearer.

>> so we also need to flush IOTLB to make sure IOMMU attempts to set the
>> dirty bit again. Relevant manuals over the hardware translation is
>> chapter 6 with some special mention to:
>>
>> "6.2.3.1 Scalable-Mode PASID-Table Entry Programming Considerations"
>> "6.2.4 IOTLB"
>>
>> Signed-off-by: Joao Martins 
>> ---
>> Shouldn't probably be as aggresive as to flush all; needs
>> checking with hardware (and invalidations guidance) as to understand
>> what exactly needs flush.
> 
> yes, definitely not required to flush all. You can follow table 23
> for software guidance for invalidations.
> 
/me nods

>> ---
>>  drivers/iommu/intel/iommu.c | 109
>> 
>>  drivers/iommu/intel/pasid.c |  76 +
>>  drivers/iommu/intel/pasid.h |   7 +++
>>  include/linux/intel-iommu.h |  14 +
>>  4 files changed, 206 insertions(+)
>>
>> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
>> index ce33f85c72ab..92af43f27241 100644
>> --- a/drivers/iommu/intel/iommu.c
>> +++ b/drivers/iommu/intel/iommu.c
>> @@ -5089,6 +5089,113 @@ static void intel_iommu_iotlb_sync_map(struct
>> iommu_domain *domain,
>>  }
>>  }
>>
>> +static int intel_iommu_set_dirty_tracking(struct iommu_domain *domain,
>> +  bool enable)
>> +{
>> +struct dmar_domain *dmar_domain = to_dmar_domain(domain);
>> +struct device_domain_info *info;
>> +unsigned long flags;
>> +int ret = -EINVAL;
>> +
>> +spin_lock_irqsave(&device_domain_lock, flags);
>> +if (list_empty(&dmar_domain->devices)) {
>> +spin_unlock_irqrestore(&device_domain_lock, flags);
>> +return ret;
>> +}
> 
> or return success here and just don't set any dirty bitmap in
> read_and_clear_dirty()?
> 
Yeap.

> btw I think every iommu driver needs to record the tracking status
> so later if a device which doesn't claim dirty tracking support is
> attached to a domain which already has dirty_tracking enabled
> then the attach request should be rej

Re: [PATCH RFC 15/19] iommu/arm-smmu-v3: Add set_dirty_tracking_range() support

2022-04-29 Thread Joao Martins
On 4/29/22 09:28, Tian, Kevin wrote:
>> From: Joao Martins 
>> Sent: Friday, April 29, 2022 5:09 AM
>>
>> Similar to .read_and_clear_dirty() use the page table
>> walker helper functions and set DBM|RDONLY bit, thus
>> switching the IOPTE to writeable-clean.
> 
> this should not be one-off if the operation needs to be
> applied to IOPTE. Say a map request comes right after
> set_dirty_tracking() is called. If it's agreed to remove
> the range op then smmu driver should record the tracking
> status internally and then apply the modifier to all the new
> mappings automatically before dirty tracking is disabled.
> Otherwise the same logic needs to be kept in iommufd to
> call set_dirty_tracking_range() explicitly for every new
> iopt_area created within the tracking window.

Gah, I totally missed that by mistake. New mappings aren't
carrying over the "DBM is set". This needs a new io-pgtable
quirk added post dirty-tracking toggling.

I can adjust, but I am at odds on including this in a future
iteration given that I can't really test any of this stuff.
Might drop the driver until I have hardware/emulation I can
use (or maybe others can take over this). It was included
for revising the iommu core ops and whether iommufd was
affected by it.

I'll delete the range op, and let smmu v3 driver walk its
own IO pgtables.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC 03/19] iommufd: Dirty tracking data support

2022-04-29 Thread Joao Martins
On 4/29/22 09:12, Tian, Kevin wrote:
>> From: Joao Martins 
>> Sent: Friday, April 29, 2022 5:09 AM
> [...]
>> +
>> +static int iommu_read_and_clear_dirty(struct iommu_domain *domain,
>> +  struct iommufd_dirty_data *bitmap)
> 
> In a glance this function and all previous helpers doesn't rely on any
> iommufd objects except that the new structures are named as
> iommufd_xxx. 
> 
> I wonder whether moving all of them to the iommu layer would make
> more sense here.
> 
I suppose, instinctively, I was trying to make this tie to iommufd only,
to avoid getting it called in cases we don't except when made as a generic
exported kernel facility.

(note: iommufd can be built as a module).

>> +{
>> +const struct iommu_domain_ops *ops = domain->ops;
>> +struct iommu_iotlb_gather gather;
>> +struct iommufd_dirty_iter iter;
>> +int ret = 0;
>> +
>> +if (!ops || !ops->read_and_clear_dirty)
>> +return -EOPNOTSUPP;
>> +
>> +iommu_dirty_bitmap_init(&iter.dirty, bitmap->iova,
>> +__ffs(bitmap->page_size), &gather);
>> +ret = iommufd_dirty_iter_init(&iter, bitmap);
>> +if (ret)
>> +return -ENOMEM;
>> +
>> +for (; iommufd_dirty_iter_done(&iter);
>> + iommufd_dirty_iter_advance(&iter)) {
>> +ret = iommufd_dirty_iter_get(&iter);
>> +if (ret)
>> +break;
>> +
>> +ret = ops->read_and_clear_dirty(domain,
>> +iommufd_dirty_iova(&iter),
>> +iommufd_dirty_iova_length(&iter), &iter.dirty);
>> +
>> +iommufd_dirty_iter_put(&iter);
>> +
>> +if (ret)
>> +break;
>> +}
>> +
>> +iommu_iotlb_sync(domain, &gather);
>> +iommufd_dirty_iter_free(&iter);
>> +
>> +return ret;
>> +}
>> +
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC 02/19] iommufd: Dirty tracking for io_pagetable

2022-04-29 Thread Joao Martins
On 4/29/22 09:07, Tian, Kevin wrote:
>> From: Joao Martins 
>> Sent: Friday, April 29, 2022 5:09 AM
>>
>> +static int __set_dirty_tracking_range_locked(struct iommu_domain
>> *domain,
> 
> suppose anything using iommu_domain as the first argument should
> be put in the iommu layer. Here it's more reasonable to use iopt
> as the first argument or simply merge with the next function.
> 
OK

>> + struct io_pagetable *iopt,
>> + bool enable)
>> +{
>> +const struct iommu_domain_ops *ops = domain->ops;
>> +struct iommu_iotlb_gather gather;
>> +struct iopt_area *area;
>> +int ret = -EOPNOTSUPP;
>> +unsigned long iova;
>> +size_t size;
>> +
>> +iommu_iotlb_gather_init(&gather);
>> +
>> +for (area = iopt_area_iter_first(iopt, 0, ULONG_MAX); area;
>> + area = iopt_area_iter_next(area, 0, ULONG_MAX)) {
> 
> how is this different from leaving iommu driver to walk the page table
> and the poke the modifier bit for all present PTEs? 

It isn't. Moving towards a single op makes this simpler for iommu core API.

> As commented in last
> patch this may allow removing the range op completely.
> 
Yes.

>> +iova = iopt_area_iova(area);
>> +size = iopt_area_last_iova(area) - iova;
>> +
>> +if (ops->set_dirty_tracking_range) {
>> +ret = ops->set_dirty_tracking_range(domain, iova,
>> +size, &gather,
>> +enable);
>> +if (ret < 0)
>> +break;
>> +}
>> +}
>> +
>> +iommu_iotlb_sync(domain, &gather);
>> +
>> +return ret;
>> +}
>> +
>> +static int iommu_set_dirty_tracking(struct iommu_domain *domain,
>> +struct io_pagetable *iopt, bool enable)
> 
> similarly rename to __iopt_set_dirty_tracking() and use iopt as the
> leading argument.
> 
/me nods
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC 01/19] iommu: Add iommu_domain ops for dirty tracking

2022-04-29 Thread Joao Martins
On 4/29/22 08:54, Tian, Kevin wrote:
>> From: Joao Martins 
>> Sent: Friday, April 29, 2022 5:09 AM
>>
>> Add to iommu domain operations a set of callbacks to
>> perform dirty tracking, particulary to start and stop
>> tracking and finally to test and clear the dirty data.
> 
> to be consistent with other context, s/test/read/
> 
/me nods

>>
>> Drivers are expected to dynamically change its hw protection
>> domain bits to toggle the tracking and flush some form of
> 
> 'hw protection domain bits' sounds a bit weird. what about
> just using 'translation structures'?
> 
I replace with that instead.

>> control state structure that stands in the IOVA translation
>> path.
>>
>> For reading and clearing dirty data, in all IOMMUs a transition
>> from any of the PTE access bits (Access, Dirty) implies flushing
>> the IOTLB to invalidate any stale data in the IOTLB as to whether
>> or not the IOMMU should update the said PTEs. The iommu core APIs
>> introduce a new structure for storing the dirties, albeit vendor
>> IOMMUs implementing .read_and_clear_dirty() just use
> 
> s/vendor IOMMUs/iommu drivers/
> 
> btw according to past history in iommu mailing list sounds like
> 'vendor' is not a term welcomed in the kernel, while there are
> many occurrences in this series.
> 
Hmm, I wasn't aware actually.

Will move away from using 'vendor'.

> [...]
>> Although, The ARM SMMUv3 case is a tad different that its x86
>> counterparts. Rather than changing *only* the IOMMU domain device entry
>> to
>> enable dirty tracking (and having a dedicated bit for dirtyness in IOPTE)
>> ARM instead uses a dirty-bit modifier which is separately enabled, and
>> changes the *existing* meaning of access bits (for ro/rw), to the point
>> that marking access bit read-only but with dirty-bit-modifier enabled
>> doesn't trigger an perm io page fault.
>>
>> In pratice this means that changing iommu context isn't enough
>> and in fact mostly useless IIUC (and can be always enabled). Dirtying
>> is only really enabled when the DBM pte bit is enabled (with the
>> CD.HD bit as a prereq).
>>
>> To capture this h/w construct an iommu core API is added which enables
>> dirty tracking on an IOVA range rather than a device/context entry.
>> iommufd picks one or the other, and IOMMUFD core will favour
>> device-context op followed by IOVA-range alternative.
> 
> Above doesn't convince me on the necessity of introducing two ops
> here. Even for ARM it can accept a per-domain op and then walk the
> page table to manipulate any modifier for existing mappings. It
> doesn't matter whether it sets one bit in the context entry or multiple
> bits in the page table.
> 
OK

> [...]
>> +
> 
> Miss comment for this function.
> 
ack

>> +unsigned int iommu_dirty_bitmap_record(struct iommu_dirty_bitmap
>> *dirty,
>> +   unsigned long iova, unsigned long length)
>> +{
>> +unsigned long nbits, offset, start_offset, idx, size, *kaddr;
>> +
>> +nbits = max(1UL, length >> dirty->pgshift);
>> +offset = (iova - dirty->iova) >> dirty->pgshift;
>> +idx = offset / (PAGE_SIZE * BITS_PER_BYTE);
>> +offset = offset % (PAGE_SIZE * BITS_PER_BYTE);
>> +start_offset = dirty->start_offset;
> 
> could you elaborate the purpose of dirty->start_offset? Why dirty->iova
> doesn't start at offset 0 of the bitmap?
> 

It is to deal with page-unaligned addresses.

Like if the start of the bitmap -- and hence bitmap base IOVA for the first bit 
of the
bitmap -- isn't page-aligned and starts in the offset of a given page. Thus 
start-offset
is to know bit in the pinned page does dirty::iova correspond to.

>> +
>> +while (nbits > 0) {
>> +kaddr = kmap(dirty->pages[idx]) + start_offset;
>> +size = min(PAGE_SIZE * BITS_PER_BYTE - offset, nbits);
>> +bitmap_set(kaddr, offset, size);
>> +kunmap(dirty->pages[idx]);
> 
> what about the overhead of kmap/kunmap when it's done for every
> dirtied page (as done in patch 18)?

Isn't it an overhead mainly with highmem? Otherwise it ends up being 
page_to_virt(...)

But anyways the kmap's should be cached, and teardown when pinning the next 
user data.

Performance analysis is also something I want to fully hash out (as mentioned 
in the cover
letter).
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC 00/19] IOMMUFD Dirty Tracking

2022-04-29 Thread Joao Martins
On 4/29/22 06:45, Tian, Kevin wrote:
>> From: Joao Martins 
>> Sent: Friday, April 29, 2022 5:09 AM
>>
>> Presented herewith is a series that extends IOMMUFD to have IOMMU
>> hardware support for dirty bit in the IOPTEs.
>>
>> Today, AMD Milan (which been out for a year now) supports it while ARM
>> SMMUv3.2+ alongside VT-D rev3.x are expected to eventually come along.
>> The intended use-case is to support Live Migration with SR-IOV, with
> 
> this should not be restricted to SR-IOV.
> 
True. Should have written PCI Devices as that is orthogonal to SF/S-IOV/SR-IOV.

>> IOMMUs
>> that support it. Yishai Hadas will be soon submiting an RFC that covers the
>> PCI device dirty tracker via vfio.
>>
>> At a quick glance, IOMMUFD lets the userspace VMM create the IOAS with a
>> set of a IOVA ranges mapped to some physical memory composing an IO
>> pagetable. This is then attached to a particular device, consequently
>> creating the protection domain to share a common IO page table
>> representing the endporint DMA-addressable guest address space.
>> (Hopefully I am not twisting the terminology here) The resultant object
> 
> Just remove VMM/guest/... since iommufd is not specific to virtualization. 
> 
/me nods

>> is an hw_pagetable object which represents the iommu_domain
>> object that will be directly manipulated. For more background on
>> IOMMUFD have a look at these two series[0][1] on the kernel and qemu
>> consumption respectivally. The IOMMUFD UAPI, kAPI and the iommu core
>> kAPI is then extended to provide:
>>
>>  1) Enabling or disabling dirty tracking on the iommu_domain. Model
>> as the most common case of changing hardware protection domain control
> 
> didn't get what 'most common case' here tries to explain
> 
Most common case because out of the 3 analyzed IOMMUs two of them require
per-device context bits (intel, amd) and not page table entries being changed 
(arm).

>> bits, and ARM specific case of having to enable the per-PTE DBM control
>> bit. The 'real' tracking of whether dirty tracking is enabled or not is
>> stored in the vendor IOMMU, hence no new fields are added to iommufd
>> pagetable structures.
>>
>>  2) Read the I/O PTEs and marshal its dirtyiness into a bitmap. The bitmap
>> thus describe the IOVAs that got written by the device. While performing
>> the marshalling also vendors need to clear the dirty bits from IOPTE and
> 
> s/vendors/iommu drivers/ 
> 
OK, I will avoid the `vendor` term going forward.

>> allow the kAPI caller to batch the much needed IOTLB flush.
>> There's no copy of bitmaps to userspace backed memory, all is zerocopy
>> based. So far this is a test-and-clear kind of interface given that the
>> IOPT walk is going to be expensive. It occured to me to separate
>> the readout of dirty, and the clearing of dirty from IOPTEs.
>> I haven't opted for that one, given that it would mean two lenghty IOPTE
>> walks and felt counter-performant.
> 
> me too. that doesn't feel like a performant way.
> 
>>
>>  3) Unmapping an IOVA range while returning its dirty bit prior to
>> unmap. This case is specific for non-nested vIOMMU case where an
>> erronous guest (or device) DMAing to an address being unmapped at the
>> same time.
> 
> an erroneous attempt like above cannot anticipate which DMAs can
> succeed in that window thus the end behavior is undefined. For an
> undefined behavior nothing will be broken by losing some bits dirtied
> in the window between reading back dirty bits of the range and
> actually calling unmap. From guest p.o.v. all those are black-box
> hardware logic to serve a virtual iotlb invalidation request which just
> cannot be completed in one cycle.
> 
> Hence in reality probably this is not required except to meet vfio
> compat requirement. Just in concept returning dirty bits at unmap
> is more accurate.
> 
> I'm slightly inclined to abandon it in iommufd uAPI.
> 

OK, it seems I am not far off from your thoughts.

I'll see what others think too, and if so I'll remove the unmap_dirty.

Because if vfio-compat doesn't get the iommu hw dirty support, then there would
be no users of unmap_dirty.

>>
>> [See at the end too, on general remarks, specifically the one regarding
>>  probing dirty tracking via a dedicated iommufd cap ioctl]
>>
>> The series is organized as follows:
>>
>> * Patches 1-3: Takes care of the iommu domain operations to be added and
>> extends iommufd io-pagetable to set/clear dirty tracking, as well as
>> reading the dirty bits from the vendor pagetables

[PATCH RFC 13/19] iommu/arm-smmu-v3: Add feature detection for BBML

2022-04-28 Thread Joao Martins
From: Kunkun Jiang 

This detects BBML feature and if SMMU supports it, transfer BBMLx
quirk to io-pgtable.

BBML1 requires still marking PTE nT prior to performing a
translation table update, while BBML2 requires neither break-before-make
nor PTE nT bit being set. For dirty tracking it needs to clear
the dirty bit so checking BBML2 tells us the prerequisite. See SMMUv3.2
manual, section "3.21.1.3 When SMMU_IDR3.BBML == 2 (Level 2)" and
"3.21.1.2 When SMMU_IDR3.BBML == 1 (Level 1)"

Co-developed-by: Keqian Zhu 
Signed-off-by: Keqian Zhu 
Signed-off-by: Kunkun Jiang 
[joaomart: massage commit message with the need to have BBML quirk
 and add the Quirk io-pgtable flags]
Signed-off-by: Joao Martins 
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 19 +++
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  6 ++
 include/linux/io-pgtable.h  |  3 +++
 3 files changed, 28 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 14609ece4e33..4dba53bde2e3 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2203,6 +2203,11 @@ static int arm_smmu_domain_finalise(struct iommu_domain 
*domain,
.iommu_dev  = smmu->dev,
};
 
+   if (smmu->features & ARM_SMMU_FEAT_BBML1)
+   pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_ARM_BBML1;
+   else if (smmu->features & ARM_SMMU_FEAT_BBML2)
+   pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_ARM_BBML2;
+
pgtbl_ops = alloc_io_pgtable_ops(fmt, &pgtbl_cfg, smmu_domain);
if (!pgtbl_ops)
return -ENOMEM;
@@ -3591,6 +3596,20 @@ static int arm_smmu_device_hw_probe(struct 
arm_smmu_device *smmu)
 
/* IDR3 */
reg = readl_relaxed(smmu->base + ARM_SMMU_IDR3);
+   switch (FIELD_GET(IDR3_BBML, reg)) {
+   case IDR3_BBML0:
+   break;
+   case IDR3_BBML1:
+   smmu->features |= ARM_SMMU_FEAT_BBML1;
+   break;
+   case IDR3_BBML2:
+   smmu->features |= ARM_SMMU_FEAT_BBML2;
+   break;
+   default:
+   dev_err(smmu->dev, "unknown/unsupported BBM behavior level\n");
+   return -ENXIO;
+   }
+
if (FIELD_GET(IDR3_RIL, reg))
smmu->features |= ARM_SMMU_FEAT_RANGE_INV;
 
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 1487a80fdf1b..e15750be1d95 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -54,6 +54,10 @@
 #define IDR1_SIDSIZE   GENMASK(5, 0)
 
 #define ARM_SMMU_IDR3  0xc
+#define IDR3_BBML  GENMASK(12, 11)
+#define IDR3_BBML0 0
+#define IDR3_BBML1 1
+#define IDR3_BBML2 2
 #define IDR3_RIL   (1 << 10)
 
 #define ARM_SMMU_IDR5  0x14
@@ -644,6 +648,8 @@ struct arm_smmu_device {
 #define ARM_SMMU_FEAT_E2H  (1 << 18)
 #define ARM_SMMU_FEAT_HA   (1 << 19)
 #define ARM_SMMU_FEAT_HD   (1 << 20)
+#define ARM_SMMU_FEAT_BBML1(1 << 21)
+#define ARM_SMMU_FEAT_BBML2(1 << 22)
u32 features;
 
 #define ARM_SMMU_OPT_SKIP_PREFETCH (1 << 0)
diff --git a/include/linux/io-pgtable.h b/include/linux/io-pgtable.h
index c2ebfe037f5d..d7626ca67dbf 100644
--- a/include/linux/io-pgtable.h
+++ b/include/linux/io-pgtable.h
@@ -85,6 +85,9 @@ struct io_pgtable_cfg {
#define IO_PGTABLE_QUIRK_ARM_MTK_EXTBIT(3)
#define IO_PGTABLE_QUIRK_ARM_TTBR1  BIT(5)
#define IO_PGTABLE_QUIRK_ARM_OUTER_WBWA BIT(6)
+   #define IO_PGTABLE_QUIRK_ARM_BBML1  BIT(7)
+   #define IO_PGTABLE_QUIRK_ARM_BBML2  BIT(8)
+
unsigned long   quirks;
unsigned long   pgsize_bitmap;
unsigned intias;
-- 
2.17.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH RFC 19/19] iommu/intel: Add unmap_read_dirty() support

2022-04-28 Thread Joao Martins
Similar to other IOMMUs base unmap_read_dirty out of how unmap() with
the exception to having a non-racy clear of the PTE to return whether it
was dirty or not.

Signed-off-by: Joao Martins 
---
 drivers/iommu/intel/iommu.c | 43 -
 include/linux/intel-iommu.h | 16 ++
 2 files changed, 49 insertions(+), 10 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 92af43f27241..e80e98f5202b 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -1317,7 +1317,8 @@ static void dma_pte_list_pagetables(struct dmar_domain 
*domain,
 static void dma_pte_clear_level(struct dmar_domain *domain, int level,
struct dma_pte *pte, unsigned long pfn,
unsigned long start_pfn, unsigned long last_pfn,
-   struct list_head *freelist)
+   struct list_head *freelist,
+   struct iommu_dirty_bitmap *dirty)
 {
struct dma_pte *first_pte = NULL, *last_pte = NULL;
 
@@ -1338,7 +1339,11 @@ static void dma_pte_clear_level(struct dmar_domain 
*domain, int level,
if (level > 1 && !dma_pte_superpage(pte))
dma_pte_list_pagetables(domain, level - 1, pte, 
freelist);
 
-   dma_clear_pte(pte);
+   if (dma_clear_pte_dirty(pte) && dirty)
+   iommu_dirty_bitmap_record(dirty,
+   pfn << VTD_PAGE_SHIFT,
+   level_size(level) << VTD_PAGE_SHIFT);
+
if (!first_pte)
first_pte = pte;
last_pte = pte;
@@ -1347,7 +1352,7 @@ static void dma_pte_clear_level(struct dmar_domain 
*domain, int level,
dma_pte_clear_level(domain, level - 1,
phys_to_virt(dma_pte_addr(pte)),
level_pfn, start_pfn, last_pfn,
-   freelist);
+   freelist, dirty);
}
 next:
pfn = level_pfn + level_size(level);
@@ -1362,7 +1367,8 @@ static void dma_pte_clear_level(struct dmar_domain 
*domain, int level,
the page tables, and may have cached the intermediate levels. The
pages can only be freed after the IOTLB flush has been done. */
 static void domain_unmap(struct dmar_domain *domain, unsigned long start_pfn,
-unsigned long last_pfn, struct list_head *freelist)
+unsigned long last_pfn, struct list_head *freelist,
+struct iommu_dirty_bitmap *dirty)
 {
BUG_ON(!domain_pfn_supported(domain, start_pfn));
BUG_ON(!domain_pfn_supported(domain, last_pfn));
@@ -1370,7 +1376,8 @@ static void domain_unmap(struct dmar_domain *domain, 
unsigned long start_pfn,
 
/* we don't need lock here; nobody else touches the iova range */
dma_pte_clear_level(domain, agaw_to_level(domain->agaw),
-   domain->pgd, 0, start_pfn, last_pfn, freelist);
+   domain->pgd, 0, start_pfn, last_pfn, freelist,
+   dirty);
 
/* free pgd */
if (start_pfn == 0 && last_pfn == DOMAIN_MAX_PFN(domain->gaw)) {
@@ -2031,7 +2038,8 @@ static void domain_exit(struct dmar_domain *domain)
if (domain->pgd) {
LIST_HEAD(freelist);
 
-   domain_unmap(domain, 0, DOMAIN_MAX_PFN(domain->gaw), &freelist);
+   domain_unmap(domain, 0, DOMAIN_MAX_PFN(domain->gaw), &freelist,
+NULL);
put_pages_list(&freelist);
}
 
@@ -4125,7 +4133,8 @@ static int intel_iommu_memory_notifier(struct 
notifier_block *nb,
struct intel_iommu *iommu;
LIST_HEAD(freelist);
 
-   domain_unmap(si_domain, start_vpfn, last_vpfn, 
&freelist);
+   domain_unmap(si_domain, start_vpfn, last_vpfn,
+&freelist, NULL);
 
rcu_read_lock();
for_each_active_iommu(iommu, drhd)
@@ -4737,7 +4746,8 @@ static int intel_iommu_map_pages(struct iommu_domain 
*domain,
 
 static size_t intel_iommu_unmap(struct iommu_domain *domain,
unsigned long iova, size_t size,
-   struct iommu_iotlb_gather *gather)
+   struct iommu_iotlb_gather *gather,
+   struct iommu_dirty_bitmap *dirty)
 {
struct dmar_domain *dmar_domain = to_dmar_domain(domain);
unsigned long start_pfn, last_pf

[PATCH RFC 18/19] iommu/intel: Access/Dirty bit support for SL domains

2022-04-28 Thread Joao Martins
IOMMU advertises Access/Dirty bits if the extended capability
DMAR register reports it (ECAP, mnemonic ECAP.SSADS). The first
stage table, though, has not bit for advertising, unless referenced via
a scalable-mode PASID Entry. Relevant Intel IOMMU SDM ref for first stage
table "3.6.2 Accessed, Extended Accessed, and Dirty Flags" and second
stage table "3.7.2 Accessed and Dirty Flags".

To enable it scalable-mode for the second-stage table is required,
solimit the use of dirty-bit to scalable-mode and discarding the
first stage configured DMAR domains. To use SSADS, we set a bit in
the scalable-mode PASID Table entry, by setting bit 9 (SSADE). When
doing so, flush all iommu caches. Relevant SDM refs:

"3.7.2 Accessed and Dirty Flags"
"6.5.3.3 Guidance to Software for Invalidations,
 Table 23. Guidance to Software for Invalidations"

Dirty bit on the PTE is located in the same location (bit 9). The IOTLB
caches some attributes when SSADE is enabled and dirty-ness information,
so we also need to flush IOTLB to make sure IOMMU attempts to set the
dirty bit again. Relevant manuals over the hardware translation is
chapter 6 with some special mention to:

"6.2.3.1 Scalable-Mode PASID-Table Entry Programming Considerations"
"6.2.4 IOTLB"

Signed-off-by: Joao Martins 
---
Shouldn't probably be as aggresive as to flush all; needs
checking with hardware (and invalidations guidance) as to understand
what exactly needs flush.
---
 drivers/iommu/intel/iommu.c | 109 
 drivers/iommu/intel/pasid.c |  76 +
 drivers/iommu/intel/pasid.h |   7 +++
 include/linux/intel-iommu.h |  14 +
 4 files changed, 206 insertions(+)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index ce33f85c72ab..92af43f27241 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -5089,6 +5089,113 @@ static void intel_iommu_iotlb_sync_map(struct 
iommu_domain *domain,
}
 }
 
+static int intel_iommu_set_dirty_tracking(struct iommu_domain *domain,
+ bool enable)
+{
+   struct dmar_domain *dmar_domain = to_dmar_domain(domain);
+   struct device_domain_info *info;
+   unsigned long flags;
+   int ret = -EINVAL;
+
+   spin_lock_irqsave(&device_domain_lock, flags);
+   if (list_empty(&dmar_domain->devices)) {
+   spin_unlock_irqrestore(&device_domain_lock, flags);
+   return ret;
+   }
+
+   list_for_each_entry(info, &dmar_domain->devices, link) {
+   if (!info->dev || (info->domain != dmar_domain))
+   continue;
+
+   /* Dirty tracking is second-stage level SM only */
+   if ((info->domain && domain_use_first_level(info->domain)) ||
+   !ecap_slads(info->iommu->ecap) ||
+   !sm_supported(info->iommu) || !intel_iommu_sm) {
+   ret = -EOPNOTSUPP;
+   continue;
+   }
+
+   ret = intel_pasid_setup_dirty_tracking(info->iommu, 
info->domain,
+info->dev, PASID_RID2PASID,
+enable);
+   if (ret)
+   break;
+   }
+   spin_unlock_irqrestore(&device_domain_lock, flags);
+
+   /*
+* We need to flush context TLB and IOTLB with any cached translations
+* to force the incoming DMA requests for have its IOTLB entries tagged
+* with A/D bits
+*/
+   intel_flush_iotlb_all(domain);
+   return ret;
+}
+
+static int intel_iommu_get_dirty_tracking(struct iommu_domain *domain)
+{
+   struct dmar_domain *dmar_domain = to_dmar_domain(domain);
+   struct device_domain_info *info;
+   unsigned long flags;
+   int ret = 0;
+
+   spin_lock_irqsave(&device_domain_lock, flags);
+   list_for_each_entry(info, &dmar_domain->devices, link) {
+   if (!info->dev || (info->domain != dmar_domain))
+   continue;
+
+   /* Dirty tracking is second-stage level SM only */
+   if ((info->domain && domain_use_first_level(info->domain)) ||
+   !ecap_slads(info->iommu->ecap) ||
+   !sm_supported(info->iommu) || !intel_iommu_sm) {
+   ret = -EOPNOTSUPP;
+   continue;
+   }
+
+   if (!intel_pasid_dirty_tracking_enabled(info->iommu, 
info->domain,
+info->dev, PASID_RID2PASID)) {
+   ret = -EINVAL;
+   break;
+   }
+   }
+   spin_unlock_irqrestore(&device_domain_lock, fl

[PATCH RFC 14/19] iommu/arm-smmu-v3: Add read_and_clear_dirty() support

2022-04-28 Thread Joao Martins
.read_and_clear_dirty() IOMMU domain op takes care of
reading the dirty bits (i.e. PTE has both DBM and AP[2] set)
and marshalling into a bitmap of a given page size.

While reading the dirty bits we also clear the PTE AP[2]
bit to mark it as writable-clean.

Structure it in a way that the IOPTE walker is generic,
and so we pass a function pointer over what to do on a per-PTE
basis. This is useful for a followup patch where we supply an
io-pgtable op to enable DBM when starting/stopping dirty tracking.

Co-developed-by: Keqian Zhu 
Signed-off-by: Keqian Zhu 
Co-developed-by: Kunkun Jiang 
Signed-off-by: Kunkun Jiang 
Signed-off-by: Joao Martins 
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c |  27 ++
 drivers/iommu/io-pgtable-arm.c  | 102 +++-
 2 files changed, 128 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 4dba53bde2e3..232057d20197 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2743,6 +2743,32 @@ static int arm_smmu_enable_nesting(struct iommu_domain 
*domain)
return ret;
 }
 
+static int arm_smmu_read_and_clear_dirty(struct iommu_domain *domain,
+unsigned long iova, size_t size,
+struct iommu_dirty_bitmap *dirty)
+{
+   struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
+   struct io_pgtable_ops *ops = smmu_domain->pgtbl_ops;
+   struct arm_smmu_device *smmu = smmu_domain->smmu;
+   int ret;
+
+   if (!(smmu->features & ARM_SMMU_FEAT_HD) ||
+   !(smmu->features & ARM_SMMU_FEAT_BBML2))
+   return -ENODEV;
+
+   if (smmu_domain->stage != ARM_SMMU_DOMAIN_S1)
+   return -EINVAL;
+
+   if (!ops || !ops->read_and_clear_dirty) {
+   pr_err_once("io-pgtable don't support dirty tracking\n");
+   return -ENODEV;
+   }
+
+   ret = ops->read_and_clear_dirty(ops, iova, size, dirty);
+
+   return ret;
+}
+
 static int arm_smmu_of_xlate(struct device *dev, struct of_phandle_args *args)
 {
return iommu_fwspec_add_ids(dev, args->args, 1);
@@ -2871,6 +2897,7 @@ static struct iommu_ops arm_smmu_ops = {
.iova_to_phys   = arm_smmu_iova_to_phys,
.enable_nesting = arm_smmu_enable_nesting,
.free   = arm_smmu_domain_free,
+   .read_and_clear_dirty   = arm_smmu_read_and_clear_dirty,
}
 };
 
diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index 94ff319ae8ac..3c99028d315a 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -75,6 +75,7 @@
 
 #define ARM_LPAE_PTE_NSTABLE   (((arm_lpae_iopte)1) << 63)
 #define ARM_LPAE_PTE_XN(((arm_lpae_iopte)3) << 53)
+#define ARM_LPAE_PTE_DBM   (((arm_lpae_iopte)1) << 51)
 #define ARM_LPAE_PTE_AF(((arm_lpae_iopte)1) << 10)
 #define ARM_LPAE_PTE_SH_NS (((arm_lpae_iopte)0) << 8)
 #define ARM_LPAE_PTE_SH_OS (((arm_lpae_iopte)2) << 8)
@@ -84,7 +85,7 @@
 
 #define ARM_LPAE_PTE_ATTR_LO_MASK  (((arm_lpae_iopte)0x3ff) << 2)
 /* Ignore the contiguous bit for block splitting */
-#define ARM_LPAE_PTE_ATTR_HI_MASK  (((arm_lpae_iopte)6) << 52)
+#define ARM_LPAE_PTE_ATTR_HI_MASK  (((arm_lpae_iopte)13) << 51)
 #define ARM_LPAE_PTE_ATTR_MASK (ARM_LPAE_PTE_ATTR_LO_MASK |\
 ARM_LPAE_PTE_ATTR_HI_MASK)
 /* Software bit for solving coherency races */
@@ -93,6 +94,9 @@
 /* Stage-1 PTE */
 #define ARM_LPAE_PTE_AP_UNPRIV (((arm_lpae_iopte)1) << 6)
 #define ARM_LPAE_PTE_AP_RDONLY (((arm_lpae_iopte)2) << 6)
+#define ARM_LPAE_PTE_AP_RDONLY_BIT 7
+#define ARM_LPAE_PTE_AP_WRITABLE   (ARM_LPAE_PTE_AP_RDONLY | \
+ARM_LPAE_PTE_DBM)
 #define ARM_LPAE_PTE_ATTRINDX_SHIFT2
 #define ARM_LPAE_PTE_nG(((arm_lpae_iopte)1) << 11)
 
@@ -737,6 +741,101 @@ static phys_addr_t arm_lpae_iova_to_phys(struct 
io_pgtable_ops *ops,
return iopte_to_paddr(pte, data) | iova;
 }
 
+static int __arm_lpae_read_and_clear_dirty(unsigned long iova, size_t size,
+  arm_lpae_iopte *ptep, void *opaque)
+{
+   struct iommu_dirty_bitmap *dirty = opaque;
+   arm_lpae_iopte pte;
+
+   pte = READ_ONCE(*ptep);
+   if (WARN_ON(!pte))
+   return -EINVAL;
+
+   if (pte & ARM_LPAE_PTE_AP_WRITABLE)
+   return 0;
+
+   if (!(pte & ARM_LPAE_PTE_DBM))
+   return 0;
+
+   iommu_dirty_bitmap_record(dirty, iova, size);
+   set_bit(ARM_LPAE_PTE_A

[PATCH RFC 17/19] iommu/arm-smmu-v3: Add unmap_read_dirty() support

2022-04-28 Thread Joao Martins
Mostly reuses unmap existing code with the extra addition of
marshalling into a bitmap of a page size. To tackle the race,
switch away from a plain store to a cmpxchg() and check whether
IOVA was dirtied or not once it succeeds.

Signed-off-by: Joao Martins 
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 17 +
 drivers/iommu/io-pgtable-arm.c  | 78 +
 2 files changed, 82 insertions(+), 13 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 5f728f8f20a2..d1fb757056cc 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2499,6 +2499,22 @@ static size_t arm_smmu_unmap_pages(struct iommu_domain 
*domain, unsigned long io
return ops->unmap_pages(ops, iova, pgsize, pgcount, gather);
 }
 
+static size_t arm_smmu_unmap_pages_read_dirty(struct iommu_domain *domain,
+ unsigned long iova, size_t pgsize,
+ size_t pgcount,
+ struct iommu_iotlb_gather *gather,
+ struct iommu_dirty_bitmap *dirty)
+{
+   struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
+   struct io_pgtable_ops *ops = smmu_domain->pgtbl_ops;
+
+   if (!ops)
+   return 0;
+
+   return ops->unmap_pages_read_dirty(ops, iova, pgsize, pgcount,
+  gather, dirty);
+}
+
 static void arm_smmu_flush_iotlb_all(struct iommu_domain *domain)
 {
struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
@@ -2938,6 +2954,7 @@ static struct iommu_ops arm_smmu_ops = {
.free   = arm_smmu_domain_free,
.read_and_clear_dirty   = arm_smmu_read_and_clear_dirty,
.set_dirty_tracking_range = arm_smmu_set_dirty_tracking,
+   .unmap_pages_read_dirty = arm_smmu_unmap_pages_read_dirty,
}
 };
 
diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index 361410aa836c..143ee7d73f88 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -259,10 +259,30 @@ static void __arm_lpae_clear_pte(arm_lpae_iopte *ptep, 
struct io_pgtable_cfg *cf
__arm_lpae_sync_pte(ptep, 1, cfg);
 }
 
+static bool __arm_lpae_clear_dirty_pte(arm_lpae_iopte *ptep,
+  struct io_pgtable_cfg *cfg)
+{
+   arm_lpae_iopte tmp;
+   bool dirty = false;
+
+   do {
+   tmp = cmpxchg64(ptep, *ptep, 0);
+   if ((tmp & ARM_LPAE_PTE_DBM) &&
+   !(tmp & ARM_LPAE_PTE_AP_RDONLY))
+   dirty = true;
+   } while (tmp);
+
+   if (!cfg->coherent_walk)
+   __arm_lpae_sync_pte(ptep, 1, cfg);
+
+   return dirty;
+}
+
 static size_t __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
   struct iommu_iotlb_gather *gather,
   unsigned long iova, size_t size, size_t pgcount,
-  int lvl, arm_lpae_iopte *ptep);
+  int lvl, arm_lpae_iopte *ptep,
+  struct iommu_dirty_bitmap *dirty);
 
 static void __arm_lpae_init_pte(struct arm_lpae_io_pgtable *data,
phys_addr_t paddr, arm_lpae_iopte prot,
@@ -306,8 +326,13 @@ static int arm_lpae_init_pte(struct arm_lpae_io_pgtable 
*data,
size_t sz = ARM_LPAE_BLOCK_SIZE(lvl, data);
 
tblp = ptep - ARM_LPAE_LVL_IDX(iova, lvl, data);
+
+   /*
+* No need for dirty bitmap as arm_lpae_init_pte() is
+* only called from __arm_lpae_map()
+*/
if (__arm_lpae_unmap(data, NULL, iova + i * sz, sz, 1,
-lvl, tblp) != sz) {
+lvl, tblp, NULL) != sz) {
WARN_ON(1);
return -EINVAL;
}
@@ -564,7 +589,8 @@ static size_t arm_lpae_split_blk_unmap(struct 
arm_lpae_io_pgtable *data,
   struct iommu_iotlb_gather *gather,
   unsigned long iova, size_t size,
   arm_lpae_iopte blk_pte, int lvl,
-  arm_lpae_iopte *ptep, size_t pgcount)
+  arm_lpae_iopte *ptep, size_t pgcount,
+  struct iommu_dirty_bitmap *dirty)
 {
struct io_pgtable_cfg *cfg = &data->iop.cfg;
arm_lpae_iopte pte, *tablep;
@@ -617,13 +643,15 @@ static size_t arm_lpae_split_blk_unma

[PATCH RFC 16/19] iommu/arm-smmu-v3: Enable HTTU for stage1 with io-pgtable mapping

2022-04-28 Thread Joao Martins
From: Kunkun Jiang 

As nested mode is not upstreamed now, we just aim to support dirty
log tracking for stage1 with io-pgtable mapping (means not support
SVA mapping). If HTTU is supported, we enable HA/HD bits in the SMMU
CD and transfer ARM_HD quirk to io-pgtable.

We additionally filter out HD|HA if not supportted. The CD.HD bit
is not particularly useful unless we toggle the DBM bit in the PTE
entries.

Co-developed-by: Keqian Zhu 
Signed-off-by: Keqian Zhu 
Signed-off-by: Kunkun Jiang 
[joaomart:Convey HD|HA bits over to the context descriptor
 and update commit message]
Signed-off-by: Joao Martins 
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 11 +++
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  3 +++
 include/linux/io-pgtable.h  |  1 +
 3 files changed, 15 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 1ca72fcca930..5f728f8f20a2 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -1077,10 +1077,18 @@ int arm_smmu_write_ctx_desc(struct arm_smmu_domain 
*smmu_domain, int ssid,
 * this substream's traffic
 */
} else { /* (1) and (2) */
+   struct arm_smmu_device *smmu = smmu_domain->smmu;
+   u64 tcr = cd->tcr;
+
cdptr[1] = cpu_to_le64(cd->ttbr & CTXDESC_CD_1_TTB0_MASK);
cdptr[2] = 0;
cdptr[3] = cpu_to_le64(cd->mair);
 
+   if (!(smmu->features & ARM_SMMU_FEAT_HD))
+   tcr &= ~CTXDESC_CD_0_TCR_HD;
+   if (!(smmu->features & ARM_SMMU_FEAT_HA))
+   tcr &= ~CTXDESC_CD_0_TCR_HA;
+
/*
 * STE is live, and the SMMU might read dwords of this CD in any
 * order. Ensure that it observes valid values before reading
@@ -2100,6 +2108,7 @@ static int arm_smmu_domain_finalise_s1(struct 
arm_smmu_domain *smmu_domain,
  FIELD_PREP(CTXDESC_CD_0_TCR_ORGN0, tcr->orgn) |
  FIELD_PREP(CTXDESC_CD_0_TCR_SH0, tcr->sh) |
  FIELD_PREP(CTXDESC_CD_0_TCR_IPS, tcr->ips) |
+ CTXDESC_CD_0_TCR_HA | CTXDESC_CD_0_TCR_HD |
  CTXDESC_CD_0_TCR_EPD1 | CTXDESC_CD_0_AA64;
cfg->cd.mair= pgtbl_cfg->arm_lpae_s1_cfg.mair;
 
@@ -2203,6 +2212,8 @@ static int arm_smmu_domain_finalise(struct iommu_domain 
*domain,
.iommu_dev  = smmu->dev,
};
 
+   if (smmu->features & ARM_SMMU_FEAT_HD)
+   pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_ARM_HD;
if (smmu->features & ARM_SMMU_FEAT_BBML1)
pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_ARM_BBML1;
else if (smmu->features & ARM_SMMU_FEAT_BBML2)
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index e15750be1d95..ff32242f2fdb 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -292,6 +292,9 @@
 #define CTXDESC_CD_0_TCR_IPS   GENMASK_ULL(34, 32)
 #define CTXDESC_CD_0_TCR_TBI0  (1ULL << 38)
 
+#define CTXDESC_CD_0_TCR_HA(1UL << 43)
+#define CTXDESC_CD_0_TCR_HD(1UL << 42)
+
 #define CTXDESC_CD_0_AA64  (1UL << 41)
 #define CTXDESC_CD_0_S (1UL << 44)
 #define CTXDESC_CD_0_R (1UL << 45)
diff --git a/include/linux/io-pgtable.h b/include/linux/io-pgtable.h
index d7626ca67dbf..a11902ae9cf1 100644
--- a/include/linux/io-pgtable.h
+++ b/include/linux/io-pgtable.h
@@ -87,6 +87,7 @@ struct io_pgtable_cfg {
#define IO_PGTABLE_QUIRK_ARM_OUTER_WBWA BIT(6)
#define IO_PGTABLE_QUIRK_ARM_BBML1  BIT(7)
#define IO_PGTABLE_QUIRK_ARM_BBML2  BIT(8)
+   #define IO_PGTABLE_QUIRK_ARM_HD BIT(9)
 
unsigned long   quirks;
unsigned long   pgsize_bitmap;
-- 
2.17.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH RFC 15/19] iommu/arm-smmu-v3: Add set_dirty_tracking_range() support

2022-04-28 Thread Joao Martins
Similar to .read_and_clear_dirty() use the page table
walker helper functions and set DBM|RDONLY bit, thus
switching the IOPTE to writeable-clean.

Signed-off-by: Joao Martins 
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 29 
 drivers/iommu/io-pgtable-arm.c  | 52 +
 2 files changed, 81 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 232057d20197..1ca72fcca930 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2769,6 +2769,34 @@ static int arm_smmu_read_and_clear_dirty(struct 
iommu_domain *domain,
return ret;
 }
 
+static int arm_smmu_set_dirty_tracking(struct iommu_domain *domain,
+  unsigned long iova, size_t size,
+  struct iommu_iotlb_gather *iotlb_gather,
+  bool enabled)
+{
+   struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
+   struct io_pgtable_ops *ops = smmu_domain->pgtbl_ops;
+   struct arm_smmu_device *smmu = smmu_domain->smmu;
+   int ret;
+
+   if (!(smmu->features & ARM_SMMU_FEAT_HD) ||
+   !(smmu->features & ARM_SMMU_FEAT_BBML2))
+   return -ENODEV;
+
+   if (smmu_domain->stage != ARM_SMMU_DOMAIN_S1)
+   return -EINVAL;
+
+   if (!ops || !ops->set_dirty_tracking) {
+   pr_err_once("io-pgtable don't support dirty tracking\n");
+   return -ENODEV;
+   }
+
+   ret = ops->set_dirty_tracking(ops, iova, size, enabled);
+   iommu_iotlb_gather_add_range(iotlb_gather, iova, size);
+
+   return ret;
+}
+
 static int arm_smmu_of_xlate(struct device *dev, struct of_phandle_args *args)
 {
return iommu_fwspec_add_ids(dev, args->args, 1);
@@ -2898,6 +2926,7 @@ static struct iommu_ops arm_smmu_ops = {
.enable_nesting = arm_smmu_enable_nesting,
.free   = arm_smmu_domain_free,
.read_and_clear_dirty   = arm_smmu_read_and_clear_dirty,
+   .set_dirty_tracking_range = arm_smmu_set_dirty_tracking,
}
 };
 
diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index 3c99028d315a..361410aa836c 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -76,6 +76,7 @@
 #define ARM_LPAE_PTE_NSTABLE   (((arm_lpae_iopte)1) << 63)
 #define ARM_LPAE_PTE_XN(((arm_lpae_iopte)3) << 53)
 #define ARM_LPAE_PTE_DBM   (((arm_lpae_iopte)1) << 51)
+#define ARM_LPAE_PTE_DBM_BIT   51
 #define ARM_LPAE_PTE_AF(((arm_lpae_iopte)1) << 10)
 #define ARM_LPAE_PTE_SH_NS (((arm_lpae_iopte)0) << 8)
 #define ARM_LPAE_PTE_SH_OS (((arm_lpae_iopte)2) << 8)
@@ -836,6 +837,56 @@ static int arm_lpae_read_and_clear_dirty(struct 
io_pgtable_ops *ops,
 __arm_lpae_read_and_clear_dirty, dirty);
 }
 
+static int __arm_lpae_set_dirty_modifier(unsigned long iova, size_t size,
+arm_lpae_iopte *ptep, void *opaque)
+{
+   bool enabled = *((bool *) opaque);
+   arm_lpae_iopte pte;
+
+   pte = READ_ONCE(*ptep);
+   if (WARN_ON(!pte))
+   return -EINVAL;
+
+   if ((pte & ARM_LPAE_PTE_AP_WRITABLE) == ARM_LPAE_PTE_AP_RDONLY)
+   return -EINVAL;
+
+   if (!(enabled ^ !(pte & ARM_LPAE_PTE_DBM)))
+   return 0;
+
+   pte = enabled ? pte | (ARM_LPAE_PTE_DBM | ARM_LPAE_PTE_AP_RDONLY) :
+   pte & ~(ARM_LPAE_PTE_DBM | ARM_LPAE_PTE_AP_RDONLY);
+
+   WRITE_ONCE(*ptep, pte);
+   return 0;
+}
+
+
+static int arm_lpae_set_dirty_tracking(struct io_pgtable_ops *ops,
+  unsigned long iova, size_t size,
+  bool enabled)
+{
+   struct arm_lpae_io_pgtable *data = io_pgtable_ops_to_data(ops);
+   struct io_pgtable_cfg *cfg = &data->iop.cfg;
+   arm_lpae_iopte *ptep = data->pgd;
+   int lvl = data->start_level;
+   long iaext = (s64)iova >> cfg->ias;
+
+   if (WARN_ON(!size || (size & cfg->pgsize_bitmap) != size))
+   return -EINVAL;
+
+   if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_TTBR1)
+   iaext = ~iaext;
+   if (WARN_ON(iaext))
+   return -EINVAL;
+
+   if (data->iop.fmt != ARM_64_LPAE_S1 &&
+   data->iop.fmt != ARM_32_LPAE_S1)
+   return -EINVAL;
+
+   return __arm_lpae_iopte_walk(data, iova, size, lvl, ptep,
+__arm_lpae_set_dirty_modifier, &enabled);
+}
+
 static void arm_lpae_restrict_pgsizes(struct io_pgtable_cfg *c

[PATCH RFC 12/19] iommu/arm-smmu-v3: Add feature detection for HTTU

2022-04-28 Thread Joao Martins
From: Jean-Philippe Brucker 

If the SMMU supports it and the kernel was built with HTTU support,
Probe support for Hardware Translation Table Update (HTTU) which is
essentially to enable hardware update of access and dirty flags.

Probe and set the smmu::features for Hardware Dirty and Hardware Access
bits. This is in preparation, to enable it on the context descriptors of
stage 1 format.

Signed-off-by: Jean-Philippe Brucker 
[joaomart: Change commit message to reflect the underlying changes]
Signed-off-by: Joao Martins 
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 32 +
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  5 
 2 files changed, 37 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index fd49282c03a3..14609ece4e33 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -3424,6 +3424,28 @@ static int arm_smmu_device_reset(struct arm_smmu_device 
*smmu, bool bypass)
return 0;
 }
 
+static void arm_smmu_get_httu(struct arm_smmu_device *smmu, u32 reg)
+{
+   u32 fw_features = smmu->features & (ARM_SMMU_FEAT_HA | 
ARM_SMMU_FEAT_HD);
+   u32 features = 0;
+
+   switch (FIELD_GET(IDR0_HTTU, reg)) {
+   case IDR0_HTTU_ACCESS_DIRTY:
+   features |= ARM_SMMU_FEAT_HD;
+   fallthrough;
+   case IDR0_HTTU_ACCESS:
+   features |= ARM_SMMU_FEAT_HA;
+   }
+
+   if (smmu->dev->of_node)
+   smmu->features |= features;
+   else if (features != fw_features)
+   /* ACPI IORT sets the HTTU bits */
+   dev_warn(smmu->dev,
+"IDR0.HTTU overridden by FW configuration (0x%x)\n",
+fw_features);
+}
+
 static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu)
 {
u32 reg;
@@ -3484,6 +3506,8 @@ static int arm_smmu_device_hw_probe(struct 
arm_smmu_device *smmu)
smmu->features |= ARM_SMMU_FEAT_E2H;
}
 
+   arm_smmu_get_httu(smmu, reg);
+
/*
 * The coherency feature as set by FW is used in preference to the ID
 * register, but warn on mismatch.
@@ -3669,6 +3693,14 @@ static int arm_smmu_device_acpi_probe(struct 
platform_device *pdev,
if (iort_smmu->flags & ACPI_IORT_SMMU_V3_COHACC_OVERRIDE)
smmu->features |= ARM_SMMU_FEAT_COHERENCY;
 
+   switch (FIELD_GET(ACPI_IORT_SMMU_V3_HTTU_OVERRIDE, iort_smmu->flags)) {
+   case IDR0_HTTU_ACCESS_DIRTY:
+   smmu->features |= ARM_SMMU_FEAT_HD;
+   fallthrough;
+   case IDR0_HTTU_ACCESS:
+   smmu->features |= ARM_SMMU_FEAT_HA;
+   }
+
return 0;
 }
 #else
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index cd48590ada30..1487a80fdf1b 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -33,6 +33,9 @@
 #define IDR0_ASID16(1 << 12)
 #define IDR0_ATS   (1 << 10)
 #define IDR0_HYP   (1 << 9)
+#define IDR0_HTTU  GENMASK(7, 6)
+#define IDR0_HTTU_ACCESS   1
+#define IDR0_HTTU_ACCESS_DIRTY 2
 #define IDR0_COHACC(1 << 4)
 #define IDR0_TTF   GENMASK(3, 2)
 #define IDR0_TTF_AARCH64   2
@@ -639,6 +642,8 @@ struct arm_smmu_device {
 #define ARM_SMMU_FEAT_BTM  (1 << 16)
 #define ARM_SMMU_FEAT_SVA  (1 << 17)
 #define ARM_SMMU_FEAT_E2H  (1 << 18)
+#define ARM_SMMU_FEAT_HA   (1 << 19)
+#define ARM_SMMU_FEAT_HD   (1 << 20)
u32 features;
 
 #define ARM_SMMU_OPT_SKIP_PREFETCH (1 << 0)
-- 
2.17.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH RFC 11/19] iommu/amd: Print access/dirty bits if supported

2022-04-28 Thread Joao Martins
Print the feature, much like other kernel-supported features.

One can still probe its actual hw support via sysfs, regardless
of what the kernel does.

Signed-off-by: Joao Martins 
---
 drivers/iommu/amd/init.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/iommu/amd/init.c b/drivers/iommu/amd/init.c
index 27f2cf61d0c6..c410d127eb58 100644
--- a/drivers/iommu/amd/init.c
+++ b/drivers/iommu/amd/init.c
@@ -1936,6 +1936,10 @@ static void print_iommu_info(void)
 
if (iommu->features & FEATURE_GAM_VAPIC)
pr_cont(" GA_vAPIC");
+   if (iommu->features & FEATURE_HASUP)
+   pr_cont(" HASup");
+   if (iommu->features & FEATURE_HDSUP)
+   pr_cont(" HDSup");
 
pr_cont("\n");
}
-- 
2.17.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH RFC 09/19] iommu/amd: Access/Dirty bit support in IOPTEs

2022-04-28 Thread Joao Martins
IOMMU advertises Access/Dirty bits if the extended feature register
reports it. Relevant AMD IOMMU SDM ref[0]
"1.3.8 Enhanced Support for Access and Dirty Bits"

To enable it set the DTE flag in bits 7 and 8 to enable access, or
access+dirty. With that, the IOMMU starts marking the D and A flags on
every Memory Request or ATS translation request. It is on the VMM side
to steer whether to enable dirty tracking or not, rather than wrongly
doing in IOMMU. Relevant AMD IOMMU SDM ref [0], "Table 7. Device Table
Entry (DTE) Field Definitions" particularly the entry "HAD".

To actually toggle on and off it's relatively simple as it's setting
2 bits on DTE and flush the device DTE cache.

To get what's dirtied use existing AMD io-pgtable support, by walking
the pagetables over each IOVA, with fetch_pte().  The IOTLB flushing is
left to the caller (much like unmap), and iommu_dirty_bitmap_record() is
the one adding page-ranges to invalidate. This allows caller to batch
the flush over a big span of IOVA space, without the iommu wondering
about when to flush.

Worthwhile sections from AMD IOMMU SDM:

"2.2.3.1 Host Access Support"
"2.2.3.2 Host Dirty Support"

For details on how IOMMU hardware updates the dirty bit see,
and expects from its consequent clearing by CPU:

"2.2.7.4 Updating Accessed and Dirty Bits in the Guest Address Tables"
"2.2.7.5 Clearing Accessed and Dirty Bits"

Quoting the SDM:

"The setting of accessed and dirty status bits in the page tables is
visible to both the CPU and the peripheral when sharing guest page
tables. The IOMMU interlocked operations to update A and D bits must be
64-bit operations and naturally aligned on a 64-bit boundary"

.. and for the IOMMU update sequence to Dirty bit, essentially is states:

1. Decodes the read and write intent from the memory access.
2. If P=0 in the page descriptor, fail the access.
3. Compare the A & D bits in the descriptor with the read and write
intent in the request.
4. If the A or D bits need to be updated in the descriptor:
* Start atomic operation.
* Read the descriptor as a 64-bit access.
* If the descriptor no longer appears to require an update, release the
atomic lock with
no further action and continue to step 5.
* Calculate the new A & D bits.
* Write the descriptor as a 64-bit access.
* End atomic operation.
5. Continue to the next stage of translation or to the memory access.

Access/Dirty bits readout also need to consider the default
non-page-size (aka replicated PTEs as mentined by manual), as AMD
supports all powers of two page sizes (except 512G) even though the
underlying IOTLB mappings are restricted to the same ones as supported
by the CPU (4K, 2M, 1G). It makes one wonder whether AMD_IOMMU_PGSIZES
ought to avoid advertising non-default page sizes at all, when creating
an UNMANAGED DOMAIN, or when dirty tracking is toggling in.

Signed-off-by: Joao Martins 
---
 drivers/iommu/amd/amd_iommu.h   |  1 +
 drivers/iommu/amd/amd_iommu_types.h | 11 +
 drivers/iommu/amd/init.c|  8 ++-
 drivers/iommu/amd/io_pgtable.c  | 56 +
 drivers/iommu/amd/iommu.c   | 77 +
 5 files changed, 152 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/amd/amd_iommu.h b/drivers/iommu/amd/amd_iommu.h
index 1ab31074f5b3..2f16ad8f7514 100644
--- a/drivers/iommu/amd/amd_iommu.h
+++ b/drivers/iommu/amd/amd_iommu.h
@@ -34,6 +34,7 @@ extern int amd_iommu_reenable(int);
 extern int amd_iommu_enable_faulting(void);
 extern int amd_iommu_guest_ir;
 extern enum io_pgtable_fmt amd_iommu_pgtable;
+extern bool amd_iommu_had_support;
 
 /* IOMMUv2 specific functions */
 struct iommu_domain;
diff --git a/drivers/iommu/amd/amd_iommu_types.h 
b/drivers/iommu/amd/amd_iommu_types.h
index 47108ed44fbb..c1eba8fce4bb 100644
--- a/drivers/iommu/amd/amd_iommu_types.h
+++ b/drivers/iommu/amd/amd_iommu_types.h
@@ -93,7 +93,9 @@
 #define FEATURE_HE (1ULL<<8)
 #define FEATURE_PC (1ULL<<9)
 #define FEATURE_GAM_VAPIC  (1ULL<<21)
+#define FEATURE_HASUP  (1ULL<<49)
 #define FEATURE_EPHSUP (1ULL<<50)
+#define FEATURE_HDSUP  (1ULL<<52)
 #define FEATURE_SNP(1ULL<<63)
 
 #define FEATURE_PASID_SHIFT32
@@ -197,6 +199,7 @@
 /* macros and definitions for device table entries */
 #define DEV_ENTRY_VALID 0x00
 #define DEV_ENTRY_TRANSLATION   0x01
+#define DEV_ENTRY_HAD   0x07
 #define DEV_ENTRY_PPR   0x34
 #define DEV_ENTRY_IR0x3d
 #define DEV_ENTRY_IW0x3e
@@ -350,10 +353,16 @@
 #define PTE_LEVEL_PAGE_SIZE(level) \
(1ULL << (12 + (9 * (level
 
+/*
+ * The IOPTE dirty bit
+ */
+#define IOMMU_PTE_HD_BIT (6)
+
 /*
  * Bit value definition for I/O PTE fields
  */
 #define IOMMU_PTE_PR (1ULL << 0)
+#define IOMM

[PATCH RFC 10/19] iommu/amd: Add unmap_read_dirty() support

2022-04-28 Thread Joao Martins
AMD implementation of unmap_read_dirty() is pretty simple as
mostly reuses unmap code with the extra addition of marshalling
the dirty bit into the bitmap as it walks the to-be-unmapped
IOPTE.

Extra care is taken though, to switch over to cmpxchg as opposed
to a non-serialized store to the PTE and testing the dirty bit
only set until cmpxchg succeeds to set to 0.

Signed-off-by: Joao Martins 
---
 drivers/iommu/amd/io_pgtable.c | 44 +-
 drivers/iommu/amd/iommu.c  | 22 +
 2 files changed, 60 insertions(+), 6 deletions(-)

diff --git a/drivers/iommu/amd/io_pgtable.c b/drivers/iommu/amd/io_pgtable.c
index 8325ef193093..1868c3b58e6d 100644
--- a/drivers/iommu/amd/io_pgtable.c
+++ b/drivers/iommu/amd/io_pgtable.c
@@ -355,6 +355,16 @@ static void free_clear_pte(u64 *pte, u64 pteval, struct 
list_head *freelist)
free_sub_pt(pt, mode, freelist);
 }
 
+static bool free_pte_dirty(u64 *pte, u64 pteval)
+{
+   bool dirty = false;
+
+   while (IOMMU_PTE_DIRTY(cmpxchg64(pte, pteval, 0)))
+   dirty = true;
+
+   return dirty;
+}
+
 /*
  * Generic mapping functions. It maps a physical address into a DMA
  * address space. It allocates the page table pages if necessary.
@@ -428,10 +438,11 @@ static int iommu_v1_map_page(struct io_pgtable_ops *ops, 
unsigned long iova,
return ret;
 }
 
-static unsigned long iommu_v1_unmap_page(struct io_pgtable_ops *ops,
- unsigned long iova,
- size_t size,
- struct iommu_iotlb_gather *gather)
+static unsigned long __iommu_v1_unmap_page(struct io_pgtable_ops *ops,
+  unsigned long iova,
+  size_t size,
+  struct iommu_iotlb_gather *gather,
+  struct iommu_dirty_bitmap *dirty)
 {
struct amd_io_pgtable *pgtable = io_pgtable_ops_to_data(ops);
unsigned long long unmapped;
@@ -445,11 +456,15 @@ static unsigned long iommu_v1_unmap_page(struct 
io_pgtable_ops *ops,
while (unmapped < size) {
pte = fetch_pte(pgtable, iova, &unmap_size);
if (pte) {
-   int i, count;
+   unsigned long i, count;
+   bool pte_dirty = false;
 
count = PAGE_SIZE_PTE_COUNT(unmap_size);
for (i = 0; i < count; i++)
-   pte[i] = 0ULL;
+   pte_dirty |= free_pte_dirty(&pte[i], pte[i]);
+
+   if (unlikely(pte_dirty && dirty))
+   iommu_dirty_bitmap_record(dirty, iova, 
unmap_size);
}
 
iova = (iova & ~(unmap_size - 1)) + unmap_size;
@@ -461,6 +476,22 @@ static unsigned long iommu_v1_unmap_page(struct 
io_pgtable_ops *ops,
return unmapped;
 }
 
+static unsigned long iommu_v1_unmap_page(struct io_pgtable_ops *ops,
+unsigned long iova,
+size_t size,
+struct iommu_iotlb_gather *gather)
+{
+   return __iommu_v1_unmap_page(ops, iova, size, gather, NULL);
+}
+
+static unsigned long iommu_v1_unmap_page_read_dirty(struct io_pgtable_ops *ops,
+   unsigned long iova, size_t size,
+   struct iommu_iotlb_gather *gather,
+   struct iommu_dirty_bitmap *dirty)
+{
+   return __iommu_v1_unmap_page(ops, iova, size, gather, dirty);
+}
+
 static phys_addr_t iommu_v1_iova_to_phys(struct io_pgtable_ops *ops, unsigned 
long iova)
 {
struct amd_io_pgtable *pgtable = io_pgtable_ops_to_data(ops);
@@ -575,6 +606,7 @@ static struct io_pgtable *v1_alloc_pgtable(struct 
io_pgtable_cfg *cfg, void *coo
pgtable->iop.ops.unmap= iommu_v1_unmap_page;
pgtable->iop.ops.iova_to_phys = iommu_v1_iova_to_phys;
pgtable->iop.ops.read_and_clear_dirty = iommu_v1_read_and_clear_dirty;
+   pgtable->iop.ops.unmap_read_dirty = iommu_v1_unmap_page_read_dirty;
 
return &pgtable->iop;
 }
diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 0a86392b2367..a8fcb6e9a684 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -2144,6 +2144,27 @@ static size_t amd_iommu_unmap(struct iommu_domain *dom, 
unsigned long iova,
return r;
 }
 
+static size_t amd_iommu_unmap_read_dirty(struct iommu_domain *dom,
+unsigned long iova, size_t page_size,
+struct iommu_iotlb_gather *gather,
+struct iommu_dirty_bitmap *dirty)
+{
+   struct protection_domain *domain = to_pdo

[PATCH RFC 08/19] iommufd: Add a test for dirty tracking ioctls

2022-04-28 Thread Joao Martins
Add a new test ioctl for simulating the dirty IOVAs
in the mock domain, and implement the mock iommu domain ops
that get the dirty tracking supported.

The selftest exercises the usual main workflow of:

1) Setting/Clearing dirty tracking from the iommu domain
2) Read and clear dirty IOPTEs
3) Unmap and read dirty back

Signed-off-by: Joao Martins 
---
 drivers/iommu/iommufd/iommufd_test.h|   9 ++
 drivers/iommu/iommufd/selftest.c| 137 +++-
 tools/testing/selftests/iommu/Makefile  |   1 +
 tools/testing/selftests/iommu/iommufd.c | 135 +++
 4 files changed, 279 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/iommufd/iommufd_test.h 
b/drivers/iommu/iommufd/iommufd_test.h
index d22ef484af1a..90dafa513078 100644
--- a/drivers/iommu/iommufd/iommufd_test.h
+++ b/drivers/iommu/iommufd/iommufd_test.h
@@ -14,6 +14,7 @@ enum {
IOMMU_TEST_OP_MD_CHECK_REFS,
IOMMU_TEST_OP_ACCESS_PAGES,
IOMMU_TEST_OP_SET_TEMP_MEMORY_LIMIT,
+   IOMMU_TEST_OP_DIRTY,
 };
 
 enum {
@@ -57,6 +58,14 @@ struct iommu_test_cmd {
struct {
__u32 limit;
} memory_limit;
+   struct {
+   __u32 flags;
+   __aligned_u64 iova;
+   __aligned_u64 length;
+   __aligned_u64 page_size;
+   __aligned_u64 uptr;
+   __aligned_u64 out_nr_dirty;
+   } dirty;
};
__u32 last;
 };
diff --git a/drivers/iommu/iommufd/selftest.c b/drivers/iommu/iommufd/selftest.c
index a665719b493e..b02309722436 100644
--- a/drivers/iommu/iommufd/selftest.c
+++ b/drivers/iommu/iommufd/selftest.c
@@ -13,6 +13,7 @@
 size_t iommufd_test_memory_limit = 65536;
 
 enum {
+   MOCK_DIRTY_TRACK = 1,
MOCK_IO_PAGE_SIZE = PAGE_SIZE / 2,
 
/*
@@ -25,9 +26,11 @@ enum {
_MOCK_PFN_START = MOCK_PFN_MASK + 1,
MOCK_PFN_START_IOVA = _MOCK_PFN_START,
MOCK_PFN_LAST_IOVA = _MOCK_PFN_START,
+   MOCK_PFN_DIRTY_IOVA = _MOCK_PFN_START << 1,
 };
 
 struct mock_iommu_domain {
+   unsigned long flags;
struct iommu_domain domain;
struct xarray pfns;
 };
@@ -133,7 +136,7 @@ static size_t mock_domain_unmap_pages(struct iommu_domain 
*domain,
 
for (cur = 0; cur != pgsize; cur += MOCK_IO_PAGE_SIZE) {
ent = xa_erase(&mock->pfns, iova / MOCK_IO_PAGE_SIZE);
-   WARN_ON(!ent);
+
/*
 * iommufd generates unmaps that must be a strict
 * superset of the map's performend So every starting
@@ -143,12 +146,12 @@ static size_t mock_domain_unmap_pages(struct iommu_domain 
*domain,
 * passed to map_pages
 */
if (first) {
-   WARN_ON(!(xa_to_value(ent) &
+   WARN_ON(ent && !(xa_to_value(ent) &
  MOCK_PFN_START_IOVA));
first = false;
}
if (pgcount == 1 && cur + MOCK_IO_PAGE_SIZE == pgsize)
-   WARN_ON(!(xa_to_value(ent) &
+   WARN_ON(ent && !(xa_to_value(ent) &
  MOCK_PFN_LAST_IOVA));
 
iova += MOCK_IO_PAGE_SIZE;
@@ -171,6 +174,75 @@ static phys_addr_t mock_domain_iova_to_phys(struct 
iommu_domain *domain,
return (xa_to_value(ent) & MOCK_PFN_MASK) * MOCK_IO_PAGE_SIZE;
 }
 
+static int mock_domain_set_dirty_tracking(struct iommu_domain *domain,
+ bool enable)
+{
+   struct mock_iommu_domain *mock =
+   container_of(domain, struct mock_iommu_domain, domain);
+   unsigned long flags = mock->flags;
+
+   /* No change? */
+   if (!(enable ^ !!(flags & MOCK_DIRTY_TRACK)))
+   return -EINVAL;
+
+   flags = (enable ?
+flags | MOCK_DIRTY_TRACK : flags & ~MOCK_DIRTY_TRACK);
+
+   mock->flags = flags;
+   return 0;
+}
+
+static int mock_domain_read_and_clear_dirty(struct iommu_domain *domain,
+   unsigned long iova, size_t size,
+   struct iommu_dirty_bitmap *dirty)
+{
+   struct mock_iommu_domain *mock =
+   container_of(domain, struct mock_iommu_domain, domain);
+   unsigned long i, max = size / MOCK_IO_PAGE_SIZE;
+   void *ent, *old;
+
+   if (!(mock->flags & MOCK_DIRTY_TRACK))
+   return -EINVAL;
+
+   for (i = 0; i < max; i++) {
+   unsigned long cur = iova + i * MOCK_IO_PAGE_SIZE;
+
+   ent = xa_load(&mock->pfns, cur / MOCK_IO_PAGE_

[PATCH RFC 07/19] iommufd/vfio-compat: Dirty tracking IOCTLs compatibility

2022-04-28 Thread Joao Martins
Add the correspondent APIs for performing VFIO dirty tracking,
particularly VFIO_IOMMU_DIRTY_PAGES ioctl subcmds:
* VFIO_IOMMU_DIRTY_PAGES_FLAG_START: Start dirty tracking and allocates
 the area @dirty_bitmap
* VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP: Stop dirty tracking and frees
the area @dirty_bitmap
* VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP: Fetch dirty bitmap while dirty
tracking is active.

Advertise the VFIO_IOMMU_TYPE1_INFO_CAP_MIGRATION
whereas it gets set the domain configured page size the same as
iopt::iova_alignment and maximum dirty bitmap size same
as VFIO. Compared to VFIO type1 iommu, the perpectual dirtying is
not implemented and userspace gets -EOPNOTSUPP which is handled by
today's userspace.

Move iommufd_get_pagesizes() definition prior to unmap for
iommufd_vfio_unmap_dma() dirty support to validate the user bitmap page
size against IOPT pagesize.

Signed-off-by: Joao Martins 
---
 drivers/iommu/iommufd/vfio_compat.c | 221 ++--
 1 file changed, 209 insertions(+), 12 deletions(-)

diff --git a/drivers/iommu/iommufd/vfio_compat.c 
b/drivers/iommu/iommufd/vfio_compat.c
index dbe39404a105..2802f49cc10d 100644
--- a/drivers/iommu/iommufd/vfio_compat.c
+++ b/drivers/iommu/iommufd/vfio_compat.c
@@ -56,6 +56,16 @@ create_compat_ioas(struct iommufd_ctx *ictx)
return ioas;
 }
 
+static u64 iommufd_get_pagesizes(struct iommufd_ioas *ioas)
+{
+   /* FIXME: See vfio_update_pgsize_bitmap(), for compat this should return
+* the high bits too, and we need to decide if we should report that
+* iommufd supports less than PAGE_SIZE alignment or stick to strict
+* compatibility. qemu only cares about the first set bit.
+*/
+   return ioas->iopt.iova_alignment;
+}
+
 int iommufd_vfio_ioas(struct iommufd_ucmd *ucmd)
 {
struct iommu_vfio_ioas *cmd = ucmd->cmd;
@@ -130,9 +140,14 @@ static int iommufd_vfio_unmap_dma(struct iommufd_ctx 
*ictx, unsigned int cmd,
  void __user *arg)
 {
size_t minsz = offsetofend(struct vfio_iommu_type1_dma_unmap, size);
-   u32 supported_flags = VFIO_DMA_UNMAP_FLAG_ALL;
+   u32 supported_flags = VFIO_DMA_UNMAP_FLAG_ALL |
+   VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP;
+   struct iommufd_dirty_data dirty, *dirtyp = NULL;
struct vfio_iommu_type1_dma_unmap unmap;
+   struct vfio_bitmap bitmap;
struct iommufd_ioas *ioas;
+   unsigned long pgshift;
+   size_t pgsize;
int rc;
 
if (copy_from_user(&unmap, arg, minsz))
@@ -141,14 +156,53 @@ static int iommufd_vfio_unmap_dma(struct iommufd_ctx 
*ictx, unsigned int cmd,
if (unmap.argsz < minsz || unmap.flags & ~supported_flags)
return -EINVAL;
 
+   if (unmap.flags & VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP) {
+   unsigned long npages;
+
+   if (copy_from_user(&bitmap,
+  (void __user *)(arg + minsz),
+  sizeof(bitmap)))
+   return -EFAULT;
+
+   if (!access_ok((void __user *)bitmap.data, bitmap.size))
+   return -EINVAL;
+
+   pgshift = __ffs(bitmap.pgsize);
+   npages = unmap.size >> pgshift;
+
+   if (!npages || !bitmap.size ||
+   (bitmap.size > DIRTY_BITMAP_SIZE_MAX) ||
+   (bitmap.size < dirty_bitmap_bytes(npages)))
+   return -EINVAL;
+
+   dirty.iova = unmap.iova;
+   dirty.length = unmap.size;
+   dirty.data = bitmap.data;
+   dirty.page_size = 1 << pgshift;
+   dirtyp = &dirty;
+   }
+
ioas = get_compat_ioas(ictx);
if (IS_ERR(ioas))
return PTR_ERR(ioas);
 
+   pgshift = __ffs(iommufd_get_pagesizes(ioas)),
+   pgsize = (size_t)1 << pgshift;
+
+   /* When dirty tracking is enabled, allow only min supported pgsize */
+   if ((unmap.flags & VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP) &&
+   (bitmap.pgsize != pgsize)) {
+   rc = -EINVAL;
+   goto out_put;
+   }
+
if (unmap.flags & VFIO_DMA_UNMAP_FLAG_ALL)
rc = iopt_unmap_all(&ioas->iopt);
else
-   rc = iopt_unmap_iova(&ioas->iopt, unmap.iova, unmap.size, NULL);
+   rc = iopt_unmap_iova(&ioas->iopt, unmap.iova, unmap.size,
+dirtyp);
+
+out_put:
iommufd_put_object(&ioas->obj);
return rc;
 }
@@ -222,16 +276,6 @@ static int iommufd_vfio_set_iommu(struct iommufd_ctx 
*ictx, unsigned long type)
return 0;
 }
 
-static u64 iommufd_get_pagesizes(struct iommufd_ioas *ioas)
-{
-   /* FIXME: See vfio_update_pgsize_bitmap(), for compat this should 

[PATCH RFC 04/19] iommu: Add an unmap API that returns dirtied IOPTEs

2022-04-28 Thread Joao Martins
Today, the dirty state is lost and the page wouldn't be migrated to
destination potentially leading the guest into error.

Add an unmap API that reads the dirty bit and sets it in the
user passed bitmap. This unmap iommu API tackles a potentially
racy update to the dirty bit *when* doing DMA on a iova that is
being unmapped at the same time.

The new unmap_read_dirty/unmap_pages_read_dirty does not replace
the unmap pages, but rather only when explicit called with an dirty
bitmap data passed in.

It could be said that the guest is buggy and rather than a special unmap
path tackling the theoretical race ... it would suffice fetching the
dirty bits (with GET_DIRTY_IOVA), and then unmap the IOVA.

Signed-off-by: Joao Martins 
---
 drivers/iommu/iommu.c  | 43 +++---
 include/linux/io-pgtable.h | 10 +
 include/linux/iommu.h  | 12 +++
 3 files changed, 58 insertions(+), 7 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index d18b9ddbcce4..cc04263709ee 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -2289,12 +2289,25 @@ EXPORT_SYMBOL_GPL(iommu_map_atomic);
 
 static size_t __iommu_unmap_pages(struct iommu_domain *domain,
  unsigned long iova, size_t size,
- struct iommu_iotlb_gather *iotlb_gather)
+ struct iommu_iotlb_gather *iotlb_gather,
+ struct iommu_dirty_bitmap *dirty)
 {
const struct iommu_domain_ops *ops = domain->ops;
size_t pgsize, count;
 
pgsize = iommu_pgsize(domain, iova, iova, size, &count);
+
+   if (dirty) {
+   if (!ops->unmap_read_dirty && !ops->unmap_pages_read_dirty)
+   return 0;
+
+   return ops->unmap_pages_read_dirty ?
+  ops->unmap_pages_read_dirty(domain, iova, pgsize,
+  count, iotlb_gather, dirty) :
+  ops->unmap_read_dirty(domain, iova, pgsize,
+iotlb_gather, dirty);
+   }
+
return ops->unmap_pages ?
   ops->unmap_pages(domain, iova, pgsize, count, iotlb_gather) :
   ops->unmap(domain, iova, pgsize, iotlb_gather);
@@ -2302,7 +2315,8 @@ static size_t __iommu_unmap_pages(struct iommu_domain 
*domain,
 
 static size_t __iommu_unmap(struct iommu_domain *domain,
unsigned long iova, size_t size,
-   struct iommu_iotlb_gather *iotlb_gather)
+   struct iommu_iotlb_gather *iotlb_gather,
+   struct iommu_dirty_bitmap *dirty)
 {
const struct iommu_domain_ops *ops = domain->ops;
size_t unmapped_page, unmapped = 0;
@@ -2337,9 +2351,8 @@ static size_t __iommu_unmap(struct iommu_domain *domain,
 * or we hit an area that isn't mapped.
 */
while (unmapped < size) {
-   unmapped_page = __iommu_unmap_pages(domain, iova,
-   size - unmapped,
-   iotlb_gather);
+   unmapped_page = __iommu_unmap_pages(domain, iova, size - 
unmapped,
+   iotlb_gather, dirty);
if (!unmapped_page)
break;
 
@@ -2361,18 +2374,34 @@ size_t iommu_unmap(struct iommu_domain *domain,
size_t ret;
 
iommu_iotlb_gather_init(&iotlb_gather);
-   ret = __iommu_unmap(domain, iova, size, &iotlb_gather);
+   ret = __iommu_unmap(domain, iova, size, &iotlb_gather, NULL);
iommu_iotlb_sync(domain, &iotlb_gather);
 
return ret;
 }
 EXPORT_SYMBOL_GPL(iommu_unmap);
 
+size_t iommu_unmap_read_dirty(struct iommu_domain *domain,
+ unsigned long iova, size_t size,
+ struct iommu_dirty_bitmap *dirty)
+{
+   struct iommu_iotlb_gather iotlb_gather;
+   size_t ret;
+
+   iommu_iotlb_gather_init(&iotlb_gather);
+   ret = __iommu_unmap(domain, iova, size, &iotlb_gather, dirty);
+   iommu_iotlb_sync(domain, &iotlb_gather);
+
+   return ret;
+
+}
+EXPORT_SYMBOL_GPL(iommu_unmap_read_dirty);
+
 size_t iommu_unmap_fast(struct iommu_domain *domain,
unsigned long iova, size_t size,
struct iommu_iotlb_gather *iotlb_gather)
 {
-   return __iommu_unmap(domain, iova, size, iotlb_gather);
+   return __iommu_unmap(domain, iova, size, iotlb_gather, NULL);
 }
 EXPORT_SYMBOL_GPL(iommu_unmap_fast);
 
diff --git a/include/linux/io-pgtable.h b/include/linux/io-pgtable.h
index 82b39925c21f..c2ebfe037f5d 100644
--- a/include/linux/io-pgtable.h
+++ b/include/linux/io-pgtable.h
@@ -171,6 +171,16 @@ str

[PATCH RFC 06/19] iommufd: Dirty tracking IOCTLs for the hw_pagetable

2022-04-28 Thread Joao Martins
Every IOMMU driver should be able to implement the needed
iommu domain ops to perform dirty tracking.

Connect a hw_pagetable to the IOMMU core dirty tracking ops.
It exposes all of the functionality for the UAPI:

- Enable/Disable dirty tracking on an IOMMU domain (hw_pagetable id)
- Read the dirtied IOVAs (which clear IOMMU domain bitmap under the hood)
- Unmap and get the dirtied IOVAs

In doing so the previously internal iommufd_dirty_data structure is
moved over as the UAPI intermediate structure for representing iommufd
dirty bitmaps.

Contrary to past incantations the IOVA range to be scanned or unmap is
tied in to the bitmap size, and thus puts the heavy lifting in the
application to make sure it passes a precisedly sized bitmap address as
opposed to allowing base_iova != iova, which simplifies things further.

Signed-off-by: Joao Martins 
---
 drivers/iommu/iommufd/hw_pagetable.c| 79 +
 drivers/iommu/iommufd/ioas.c| 33 +++
 drivers/iommu/iommufd/iommufd_private.h | 22 ---
 drivers/iommu/iommufd/main.c|  9 +++
 include/uapi/linux/iommufd.h| 78 
 5 files changed, 214 insertions(+), 7 deletions(-)

diff --git a/drivers/iommu/iommufd/hw_pagetable.c 
b/drivers/iommu/iommufd/hw_pagetable.c
index bafd7d07918b..943bcc3898a4 100644
--- a/drivers/iommu/iommufd/hw_pagetable.c
+++ b/drivers/iommu/iommufd/hw_pagetable.c
@@ -3,6 +3,7 @@
  * Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES
  */
 #include 
+#include 
 
 #include "iommufd_private.h"
 
@@ -140,3 +141,81 @@ void iommufd_hw_pagetable_put(struct iommufd_ctx *ictx,
}
iommufd_object_destroy_user(ictx, &hwpt->obj);
 }
+
+int iommufd_hwpt_set_dirty(struct iommufd_ucmd *ucmd)
+{
+   struct iommu_hwpt_set_dirty *cmd = ucmd->cmd;
+   struct iommufd_hw_pagetable *hwpt;
+   struct iommufd_ioas *ioas;
+   int rc = -EOPNOTSUPP;
+   bool enable;
+
+   hwpt = iommufd_get_hwpt(ucmd, cmd->hwpt_id);
+   if (IS_ERR(hwpt))
+   return PTR_ERR(hwpt);
+
+   ioas = hwpt->ioas;
+   enable = cmd->flags & IOMMU_DIRTY_TRACKING_ENABLED;
+
+   rc = iopt_set_dirty_tracking(&ioas->iopt, hwpt->domain, enable);
+
+   iommufd_put_object(&hwpt->obj);
+   return rc;
+}
+
+int iommufd_check_iova_range(struct iommufd_ioas *ioas,
+struct iommufd_dirty_data *bitmap)
+{
+   unsigned long pgshift, npages;
+   size_t iommu_pgsize;
+   int rc = -EINVAL;
+   u64 bitmap_size;
+
+   pgshift = __ffs(bitmap->page_size);
+   npages = bitmap->length >> pgshift;
+   bitmap_size = dirty_bitmap_bytes(npages);
+
+   if (!npages || (bitmap_size > DIRTY_BITMAP_SIZE_MAX))
+   return rc;
+
+   if (!access_ok((void __user *) bitmap->data, bitmap_size))
+   return rc;
+
+   iommu_pgsize = 1 << __ffs(ioas->iopt.iova_alignment);
+
+   /* allow only smallest supported pgsize */
+   if (bitmap->page_size != iommu_pgsize)
+   return rc;
+
+   if (bitmap->iova & (iommu_pgsize - 1))
+   return rc;
+
+   if (!bitmap->length || bitmap->length & (iommu_pgsize - 1))
+   return rc;
+
+   return 0;
+}
+
+int iommufd_hwpt_get_dirty_iova(struct iommufd_ucmd *ucmd)
+{
+   struct iommu_hwpt_get_dirty_iova *cmd = ucmd->cmd;
+   struct iommufd_hw_pagetable *hwpt;
+   struct iommufd_ioas *ioas;
+   int rc = -EOPNOTSUPP;
+
+   hwpt = iommufd_get_hwpt(ucmd, cmd->hwpt_id);
+   if (IS_ERR(hwpt))
+   return PTR_ERR(hwpt);
+
+   ioas = hwpt->ioas;
+   rc = iommufd_check_iova_range(ioas, &cmd->bitmap);
+   if (rc)
+   goto out_put;
+
+   rc = iopt_read_and_clear_dirty_data(&ioas->iopt, hwpt->domain,
+   &cmd->bitmap);
+
+out_put:
+   iommufd_put_object(&hwpt->obj);
+   return rc;
+}
diff --git a/drivers/iommu/iommufd/ioas.c b/drivers/iommu/iommufd/ioas.c
index 19d6591aa005..50bef46bc0bb 100644
--- a/drivers/iommu/iommufd/ioas.c
+++ b/drivers/iommu/iommufd/ioas.c
@@ -243,6 +243,7 @@ int iommufd_ioas_unmap(struct iommufd_ucmd *ucmd)
rc = -EOVERFLOW;
goto out_put;
}
+
rc = iopt_unmap_iova(&ioas->iopt, cmd->iova, cmd->length, NULL);
}
 
@@ -250,3 +251,35 @@ int iommufd_ioas_unmap(struct iommufd_ucmd *ucmd)
iommufd_put_object(&ioas->obj);
return rc;
 }
+
+int iommufd_ioas_unmap_dirty(struct iommufd_ucmd *ucmd)
+{
+   struct iommu_ioas_unmap_dirty *cmd = ucmd->cmd;
+   struct iommufd_dirty_data *bitmap;
+   struct iommufd_ioas *ioas;
+   int rc;
+
+   ioas = iommufd_get_ioas(ucmd, cmd->ioas_id);
+   if (IS_ERR(io

[PATCH RFC 01/19] iommu: Add iommu_domain ops for dirty tracking

2022-04-28 Thread Joao Martins
Add to iommu domain operations a set of callbacks to
perform dirty tracking, particulary to start and stop
tracking and finally to test and clear the dirty data.

Drivers are expected to dynamically change its hw protection
domain bits to toggle the tracking and flush some form of
control state structure that stands in the IOVA translation
path.

For reading and clearing dirty data, in all IOMMUs a transition
from any of the PTE access bits (Access, Dirty) implies flushing
the IOTLB to invalidate any stale data in the IOTLB as to whether
or not the IOMMU should update the said PTEs. The iommu core APIs
introduce a new structure for storing the dirties, albeit vendor
IOMMUs implementing .read_and_clear_dirty() just use
iommu_dirty_bitmap_record() to set the memory storing dirties.
The underlying tracking/iteration of user bitmap memory is instead
done by iommufd which takes care of initializing the dirty bitmap
*prior* to passing to the IOMMU domain op.

So far for currently/to-be-supported IOMMUs with dirty tracking
support this particularly because the tracking is part of
first stage tables and part of address translation. Below
it is mentioned how hardware deal with the hardware protection
domain control bits, to justify the added iommu core APIs.
vendor IOMMU implementation will also explain in more detail on
the dirty bit usage/clearing in the IOPTEs.

* x86 AMD:

The same thing for AMD particularly the Device Table
respectivally, followed by flushing the Device IOTLB. On AMD[1],
section "2.2.1 Updating Shared Tables", e.g.

> Each table can also have its contents cached by the IOMMU or
peripheral IOTLBs. Therefore, after
updating a table entry that can be cached, system software must
send the IOMMU an appropriate
invalidate command. Information in the peripheral IOTLBs must
also be invalidated.

There's no mention of particular bits that are cached or
not but fetching a dev entry is part of address translation
as also depicted, so invalidate the device table to make
sure the next translations fetch a DTE entry with the HD bits set.

* x86 Intel (rev3.0+):

Likewise[2] set the SSADE bit in the scalable-entry second stage table
to enable Access/Dirty bits in the second stage page table. See manual,
particularly on "6.2.3.1 Scalable-Mode PASID-Table Entry Programming
Considerations"

> When modifying root-entries, scalable-mode root-entries,
context-entries, or scalable-mode context
entries:
> Software must serially invalidate the context-cache,
PASID-cache (if applicable), and the IOTLB.  The serialization is
required since hardware may utilize information from the
context-caches (e.g., Domain-ID) to tag new entries inserted to
the PASID-cache and IOTLB for processing in-flight requests.
Section 6.5 describe the invalidation operations.

And also the whole chapter "" Table "Table 23.  Guidance to
Software for Invalidations" in "6.5.3.3 Guidance to Software for
Invalidations" explicitly mentions

> SSADE transition from 0 to 1 in a scalable-mode PASID-table
entry with PGTT value of Second-stage or Nested

* ARM SMMUV3.2:

SMMUv3.2 needs to toggle the dirty bit descriptor
over the CD (or S2CD) for toggling and flush/invalidate
the IOMMU dev IOTLB.

Reference[0]: SMMU spec, "5.4.1 CD notes",

> The following CD fields are permitted to be cached as part of a
translation or TLB entry, and alteration requires
invalidation of any TLB entry that might have cached these
fields, in addition to CD structure cache invalidation:

...
HA, HD
...

Although, The ARM SMMUv3 case is a tad different that its x86
counterparts. Rather than changing *only* the IOMMU domain device entry to
enable dirty tracking (and having a dedicated bit for dirtyness in IOPTE)
ARM instead uses a dirty-bit modifier which is separately enabled, and
changes the *existing* meaning of access bits (for ro/rw), to the point
that marking access bit read-only but with dirty-bit-modifier enabled
doesn't trigger an perm io page fault.

In pratice this means that changing iommu context isn't enough
and in fact mostly useless IIUC (and can be always enabled). Dirtying
is only really enabled when the DBM pte bit is enabled (with the
CD.HD bit as a prereq).

To capture this h/w construct an iommu core API is added which enables
dirty tracking on an IOVA range rather than a device/context entry.
iommufd picks one or the other, and IOMMUFD core will favour
device-context op followed by IOVA-range alternative.

[0] https://developer.arm.com/documentation/ihi0070/latest
[1] https://www.amd.com/system/files/TechDocs/48882_IOMMU.pdf
[2] https://cdrdv2.intel.com/v1/dl/getContent/671081

Signed-off-by: Joao Martins 
---
 drivers/iommu/iommu.c  | 28 
 include/linux/io-pgtable.h |  6 +
 include/linux/iommu.h  | 52 ++
 3 files changed, 86 insertions(+)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu

[PATCH RFC 00/19] IOMMUFD Dirty Tracking

2022-04-28 Thread Joao Martins
lear dirty tracking, and immplicitly clearing
dirty bits on the readout. Given the lack of hardware and difficulty
to get this in an emulated SMMUv3 (given the dependency on the PE HTTU
and BBML2, IIUC) then this is only compiled tested. Hopefully I am not
getting the attribution wrong.

* Patches 18-19: Intel IOMMU rev3.x implementation. Tested with a Qemu
based intel-iommu with SSADS/SLADS emulation support.

To help testing/prototypization, qemu iommu emulation bits were written
to increase coverage of this code and hopefully make this more broadly
available for fellow contributors/devs. A separate series is submitted right
after this covering the Qemu IOMMUFD extensions for dirty tracking, alongside
its x86 iommus emulation A/D bits. Meanwhile it's also on github
(https://github.com/jpemartins/qemu/commits/iommufd)

Remarks / Observations:

* There's no capabilities API in IOMMUFD, and in this RFC each vendor tracks
what has access in each of the newly added ops. Initially I was thinking to
have a HWPT_GET_DIRTY to probe how dirty tracking is supported (rather than
bailing out with EOPNOTSUP) as well as an get_dirty_tracking
iommu-core API. On the UAPI, perhaps it might be better to have a single API
for capabilities in general (similar to KVM)  and at the simplest is a subop
where the necessary info is conveyed on a per-subop basis?

* The UAPI/kAPI could be generalized over the next iteration to also cover
Access bit (or Intel's Extended Access bit that tracks non-CPU usage).
It wasn't done, as I was not aware of a use-case. I am wondering
if the access-bits could be used to do some form of zero page detection
(to just send the pages that got touched), although dirty-bits could be
used just the same way. Happy to adjust for RFCv2. The algorithms, IOPTE
walk and marshalling into bitmaps as well as the necessary IOTLB flush
batching are all the same. The focus is on dirty bit given that the
dirtyness IOVA feedback is used to select the pages that need to be transfered
to the destination while migration is happening.
Sidebar: Sadly, there's a lot less clever possible tricks that can be
done (compared to the CPU/KVM) without having the PCI device cooperate (like
userfaultfd, wrprotect, etc as those would turn into nepharious IOMMU
perm faults and devices with DMA target aborts).
If folks thing the UAPI/iommu-kAPI should be agnostic to any PTE A/D
bits, we can instead have the ioctls be named after
HWPT_SET_TRACKING() and add another argument which asks which bits to
enabling tracking (IOMMUFD_ACCESS/IOMMUFD_DIRTY/IOMMUFD_ACCESS_NONCPU).
Likewise for the read_and_clear() as all PTE bits follow the same logic
as dirty. Happy to readjust if folks think it is worthwhile.

* IOMMU Nesting /shouldn't/ matter in this work, as it is expected that we
only care about the first stage of IOMMU pagetables for hypervisors i.e.
tracking dirty GPAs (and not caring about dirty GIOVAs).

* Dirty bit tracking only, is not enough. Large IO pages tend to be the norm
when DMA mapping large ranges of IOVA space, when really the VMM wants the
smallest granularity possible to track(i.e. host base pages). A separate bit
of work will need to take care demoting IOPTE page sizes at guest-runtime to
increase/decrease the dirty tracking granularity, likely under the form of a
IOAS demote/promote page-size within a previously mapped IOVA range.

Feedback is very much appreciated!

[0] https://lore.kernel.org/kvm/0-v1-e79cd8d168e8+6-iommufd_...@nvidia.com/
[1] https://lore.kernel.org/kvm/20220414104710.28534-1-yi.l@intel.com/
[2] 
https://lore.kernel.org/linux-arm-kernel/20210413085457.25400-1-zhukeqi...@huawei.com/

Joao

TODOs:
* More selftests for large/small iopte sizes;
* Better vIOMMU+VFIO testing (AMD doesn't support it);
* Performance efficiency of GET_DIRTY_IOVA in various workloads;
* Testing with a live migrateable VF;

Jean-Philippe Brucker (1):
  iommu/arm-smmu-v3: Add feature detection for HTTU

Joao Martins (16):
  iommu: Add iommu_domain ops for dirty tracking
  iommufd: Dirty tracking for io_pagetable
  iommufd: Dirty tracking data support
  iommu: Add an unmap API that returns dirtied IOPTEs
  iommufd: Add a dirty bitmap to iopt_unmap_iova()
  iommufd: Dirty tracking IOCTLs for the hw_pagetable
  iommufd/vfio-compat: Dirty tracking IOCTLs compatibility
  iommufd: Add a test for dirty tracking ioctls
  iommu/amd: Access/Dirty bit support in IOPTEs
  iommu/amd: Add unmap_read_dirty() support
  iommu/amd: Print access/dirty bits if supported
  iommu/arm-smmu-v3: Add read_and_clear_dirty() support
  iommu/arm-smmu-v3: Add set_dirty_tracking_range() support
  iommu/arm-smmu-v3: Add unmap_read_dirty() support
  iommu/intel: Access/Dirty bit support for SL domains
  iommu/intel: Add unmap_read_dirty() support

Kunkun Jiang (2):
  iommu/arm-smmu-v3: Add feature detection for BBML
  iommu/arm-smmu-v3: Enable HTTU for stage1 with io-pgtable mapping

 drivers/iommu/amd/amd_iommu.h

[PATCH RFC 05/19] iommufd: Add a dirty bitmap to iopt_unmap_iova()

2022-04-28 Thread Joao Martins
Add an argument to the kAPI that unmaps an IOVA from the attached
domains, to also receive a bitmap.

When passed an iommufd_dirty_data::bitmap we call out the special
dirty unmap (iommu_unmap_read_dirty()). The bitmap data is
iterated, similarly, like the read_and_clear_dirty() in IOVA
chunks using the previously added iommufd_dirty_iter* helper
functions.

Signed-off-by: Joao Martins 
---
 drivers/iommu/iommufd/io_pagetable.c| 13 ++--
 drivers/iommu/iommufd/io_pagetable.h|  3 +-
 drivers/iommu/iommufd/ioas.c|  2 +-
 drivers/iommu/iommufd/iommufd_private.h |  4 +-
 drivers/iommu/iommufd/pages.c   | 79 +
 drivers/iommu/iommufd/vfio_compat.c |  2 +-
 6 files changed, 80 insertions(+), 23 deletions(-)

diff --git a/drivers/iommu/iommufd/io_pagetable.c 
b/drivers/iommu/iommufd/io_pagetable.c
index 835b5040fce9..6f4117c629d4 100644
--- a/drivers/iommu/iommufd/io_pagetable.c
+++ b/drivers/iommu/iommufd/io_pagetable.c
@@ -542,13 +542,14 @@ struct iopt_pages *iopt_get_pages(struct io_pagetable 
*iopt, unsigned long iova,
 }
 
 static int __iopt_unmap_iova(struct io_pagetable *iopt, struct iopt_area *area,
-struct iopt_pages *pages)
+struct iopt_pages *pages,
+struct iommufd_dirty_data *bitmap)
 {
/* Drivers have to unpin on notification. */
if (WARN_ON(atomic_read(&area->num_users)))
return -EBUSY;
 
-   iopt_area_unfill_domains(area, pages);
+   iopt_area_unfill_domains(area, pages, bitmap);
WARN_ON(atomic_read(&area->num_users));
iopt_abort_area(area);
iopt_put_pages(pages);
@@ -560,12 +561,13 @@ static int __iopt_unmap_iova(struct io_pagetable *iopt, 
struct iopt_area *area,
  * @iopt: io_pagetable to act on
  * @iova: Starting iova to unmap
  * @length: Number of bytes to unmap
+ * @bitmap: Bitmap of dirtied IOVAs
  *
  * The requested range must exactly match an existing range.
  * Splitting/truncating IOVA mappings is not allowed.
  */
 int iopt_unmap_iova(struct io_pagetable *iopt, unsigned long iova,
-   unsigned long length)
+   unsigned long length, struct iommufd_dirty_data *bitmap)
 {
struct iopt_pages *pages;
struct iopt_area *area;
@@ -590,7 +592,8 @@ int iopt_unmap_iova(struct io_pagetable *iopt, unsigned 
long iova,
area->pages = NULL;
up_write(&iopt->iova_rwsem);
 
-   rc = __iopt_unmap_iova(iopt, area, pages);
+   rc = __iopt_unmap_iova(iopt, area, pages, bitmap);
+
up_read(&iopt->domains_rwsem);
return rc;
 }
@@ -614,7 +617,7 @@ int iopt_unmap_all(struct io_pagetable *iopt)
area->pages = NULL;
up_write(&iopt->iova_rwsem);
 
-   rc = __iopt_unmap_iova(iopt, area, pages);
+   rc = __iopt_unmap_iova(iopt, area, pages, NULL);
if (rc)
goto out_unlock_domains;
 
diff --git a/drivers/iommu/iommufd/io_pagetable.h 
b/drivers/iommu/iommufd/io_pagetable.h
index c8b6a60ff24c..c8baab25ab08 100644
--- a/drivers/iommu/iommufd/io_pagetable.h
+++ b/drivers/iommu/iommufd/io_pagetable.h
@@ -48,7 +48,8 @@ struct iopt_area {
 };
 
 int iopt_area_fill_domains(struct iopt_area *area, struct iopt_pages *pages);
-void iopt_area_unfill_domains(struct iopt_area *area, struct iopt_pages 
*pages);
+void iopt_area_unfill_domains(struct iopt_area *area, struct iopt_pages *pages,
+ struct iommufd_dirty_data *bitmap);
 
 int iopt_area_fill_domain(struct iopt_area *area, struct iommu_domain *domain);
 void iopt_area_unfill_domain(struct iopt_area *area, struct iopt_pages *pages,
diff --git a/drivers/iommu/iommufd/ioas.c b/drivers/iommu/iommufd/ioas.c
index 48149988c84b..19d6591aa005 100644
--- a/drivers/iommu/iommufd/ioas.c
+++ b/drivers/iommu/iommufd/ioas.c
@@ -243,7 +243,7 @@ int iommufd_ioas_unmap(struct iommufd_ucmd *ucmd)
rc = -EOVERFLOW;
goto out_put;
}
-   rc = iopt_unmap_iova(&ioas->iopt, cmd->iova, cmd->length);
+   rc = iopt_unmap_iova(&ioas->iopt, cmd->iova, cmd->length, NULL);
}
 
 out_put:
diff --git a/drivers/iommu/iommufd/iommufd_private.h 
b/drivers/iommu/iommufd/iommufd_private.h
index 4c12b4a8f1a6..3e3a97f623a1 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -47,8 +47,6 @@ int iopt_map_user_pages(struct io_pagetable *iopt, unsigned 
long *iova,
 int iopt_map_pages(struct io_pagetable *iopt, struct iopt_pages *pages,
   unsigned long *dst_iova, unsigned long start_byte,
   unsigned long length, int iommu_prot, unsigned int flags);
-int iopt_unmap_iova(struct io_pagetable *iopt, unsigned long iova,
-   unsigned long length);
 int iopt

[PATCH RFC 03/19] iommufd: Dirty tracking data support

2022-04-28 Thread Joao Martins
Add an IO pagetable API iopt_read_and_clear_dirty_data() that
performs the reading of dirty IOPTEs for a given IOVA range and
then copying back to userspace from each area-internal bitmap.

Underneath it uses the IOMMU equivalent API which will read the
dirty bits, as well as atomically clearing the IOPTE dirty bit
and flushing the IOTLB at the end. The dirty bitmaps pass an
iotlb_gather to allow batching the dirty-bit updates.

Most of the complexity, though, is in the handling of the user
bitmaps to avoid copies back and forth. The bitmap user addresses
need to be iterated through, pinned and then passing the pages
into iommu core. The amount of bitmap data passed at a time for a
read_and_clear_dirty() is 1 page worth of pinned base page
pointers. That equates to 16M bits, or rather 64G of data that
can be returned as 'dirtied'. The flush the IOTLB at the end of
the whole scanned IOVA range, to defer as much as possible the
potential DMA performance penalty.

Signed-off-by: Joao Martins 
---
 drivers/iommu/iommufd/io_pagetable.c| 169 
 drivers/iommu/iommufd/iommufd_private.h |  44 ++
 2 files changed, 213 insertions(+)

diff --git a/drivers/iommu/iommufd/io_pagetable.c 
b/drivers/iommu/iommufd/io_pagetable.c
index f4609ef369e0..835b5040fce9 100644
--- a/drivers/iommu/iommufd/io_pagetable.c
+++ b/drivers/iommu/iommufd/io_pagetable.c
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "io_pagetable.h"
 
@@ -347,6 +348,174 @@ int iopt_set_dirty_tracking(struct io_pagetable *iopt,
return ret;
 }
 
+int iommufd_dirty_iter_init(struct iommufd_dirty_iter *iter,
+   struct iommufd_dirty_data *bitmap)
+{
+   struct iommu_dirty_bitmap *dirty = &iter->dirty;
+   unsigned long bitmap_len;
+
+   bitmap_len = dirty_bitmap_bytes(bitmap->length >> dirty->pgshift);
+
+   import_single_range(WRITE, bitmap->data, bitmap_len,
+   &iter->bitmap_iov, &iter->bitmap_iter);
+   iter->iova = bitmap->iova;
+
+   /* Can record up to 64G at a time */
+   dirty->pages = (struct page **) __get_free_page(GFP_KERNEL);
+
+   return !dirty->pages ? -ENOMEM : 0;
+}
+
+void iommufd_dirty_iter_free(struct iommufd_dirty_iter *iter)
+{
+   struct iommu_dirty_bitmap *dirty = &iter->dirty;
+
+   if (dirty->pages) {
+   free_page((unsigned long) dirty->pages);
+   dirty->pages = NULL;
+   }
+}
+
+bool iommufd_dirty_iter_done(struct iommufd_dirty_iter *iter)
+{
+   return iov_iter_count(&iter->bitmap_iter) > 0;
+}
+
+static inline unsigned long iommufd_dirty_iter_bytes(struct iommufd_dirty_iter 
*iter)
+{
+   unsigned long left = iter->bitmap_iter.count - 
iter->bitmap_iter.iov_offset;
+
+   left = min_t(unsigned long, left, (iter->dirty.npages << PAGE_SHIFT));
+
+   return left;
+}
+
+unsigned long iommufd_dirty_iova_length(struct iommufd_dirty_iter *iter)
+{
+   unsigned long left = iommufd_dirty_iter_bytes(iter);
+
+   return ((BITS_PER_BYTE * left) << iter->dirty.pgshift);
+}
+
+unsigned long iommufd_dirty_iova(struct iommufd_dirty_iter *iter)
+{
+   unsigned long skip = iter->bitmap_iter.iov_offset;
+
+   return iter->iova + ((BITS_PER_BYTE * skip) << iter->dirty.pgshift);
+}
+
+void iommufd_dirty_iter_advance(struct iommufd_dirty_iter *iter)
+{
+   iov_iter_advance(&iter->bitmap_iter, iommufd_dirty_iter_bytes(iter));
+}
+
+void iommufd_dirty_iter_put(struct iommufd_dirty_iter *iter)
+{
+   struct iommu_dirty_bitmap *dirty = &iter->dirty;
+
+   if (dirty->npages)
+   unpin_user_pages(dirty->pages, dirty->npages);
+}
+
+int iommufd_dirty_iter_get(struct iommufd_dirty_iter *iter)
+{
+   struct iommu_dirty_bitmap *dirty = &iter->dirty;
+   unsigned long npages;
+   unsigned long ret;
+   void *addr;
+
+   addr = iter->bitmap_iov.iov_base + iter->bitmap_iter.iov_offset;
+   npages = iov_iter_npages(&iter->bitmap_iter,
+PAGE_SIZE / sizeof(struct page *));
+
+   ret = pin_user_pages_fast((unsigned long) addr, npages,
+ FOLL_WRITE, dirty->pages);
+   if (ret <= 0)
+   return -EINVAL;
+
+   dirty->npages = ret;
+   dirty->iova = iommufd_dirty_iova(iter);
+   dirty->start_offset = offset_in_page(addr);
+   return 0;
+}
+
+static int iommu_read_and_clear_dirty(struct iommu_domain *domain,
+ struct iommufd_dirty_data *bitmap)
+{
+   const struct iommu_domain_ops *ops = domain->ops;
+   struct iommu_iotlb_gather gather;
+   struct iommufd_dirty_iter iter;
+   int ret = 0;
+
+   if (!ops || !ops->read_and_clear_dirty)
+   return -EOPNOTSUPP;
+

[PATCH RFC 02/19] iommufd: Dirty tracking for io_pagetable

2022-04-28 Thread Joao Martins
Add an io_pagetable kernel API to toggle dirty tracking:

* iopt_set_dirty_tracking(iopt, [domain], state)

It receives either NULL (which means all domains) or an
iommu_domain. The intended caller of this is via the hw_pagetable
object that is created on device attach, which passes an
iommu_domain. For now, the all-domains is left for vfio-compat.

The hw protection domain dirty control is favored over the IOVA-range
alternative. For the latter, it iterates over all IOVA areas and calls
iommu domain op to enable/disable for the range.

Signed-off-by: Joao Martins 
---
 drivers/iommu/iommufd/io_pagetable.c| 71 +
 drivers/iommu/iommufd/iommufd_private.h |  3 ++
 2 files changed, 74 insertions(+)

diff --git a/drivers/iommu/iommufd/io_pagetable.c 
b/drivers/iommu/iommufd/io_pagetable.c
index f9f3b06946bf..f4609ef369e0 100644
--- a/drivers/iommu/iommufd/io_pagetable.c
+++ b/drivers/iommu/iommufd/io_pagetable.c
@@ -276,6 +276,77 @@ int iopt_map_user_pages(struct io_pagetable *iopt, 
unsigned long *iova,
return 0;
 }
 
+static int __set_dirty_tracking_range_locked(struct iommu_domain *domain,
+struct io_pagetable *iopt,
+bool enable)
+{
+   const struct iommu_domain_ops *ops = domain->ops;
+   struct iommu_iotlb_gather gather;
+   struct iopt_area *area;
+   int ret = -EOPNOTSUPP;
+   unsigned long iova;
+   size_t size;
+
+   iommu_iotlb_gather_init(&gather);
+
+   for (area = iopt_area_iter_first(iopt, 0, ULONG_MAX); area;
+area = iopt_area_iter_next(area, 0, ULONG_MAX)) {
+   iova = iopt_area_iova(area);
+   size = iopt_area_last_iova(area) - iova;
+
+   if (ops->set_dirty_tracking_range) {
+   ret = ops->set_dirty_tracking_range(domain, iova,
+   size, &gather,
+   enable);
+   if (ret < 0)
+   break;
+   }
+   }
+
+   iommu_iotlb_sync(domain, &gather);
+
+   return ret;
+}
+
+static int iommu_set_dirty_tracking(struct iommu_domain *domain,
+   struct io_pagetable *iopt, bool enable)
+{
+   const struct iommu_domain_ops *ops = domain->ops;
+   int ret = -EOPNOTSUPP;
+
+   if (ops->set_dirty_tracking)
+   ret = ops->set_dirty_tracking(domain, enable);
+   else if (ops->set_dirty_tracking_range)
+   ret = __set_dirty_tracking_range_locked(domain, iopt,
+   enable);
+
+   return ret;
+}
+
+int iopt_set_dirty_tracking(struct io_pagetable *iopt,
+   struct iommu_domain *domain, bool enable)
+{
+   struct iommu_domain *dom;
+   unsigned long index;
+   int ret = -EOPNOTSUPP;
+
+   down_write(&iopt->iova_rwsem);
+   if (!domain) {
+   down_write(&iopt->domains_rwsem);
+   xa_for_each(&iopt->domains, index, dom) {
+   ret = iommu_set_dirty_tracking(dom, iopt, enable);
+   if (ret < 0)
+   break;
+   }
+   up_write(&iopt->domains_rwsem);
+   } else {
+   ret = iommu_set_dirty_tracking(domain, iopt, enable);
+   }
+
+   up_write(&iopt->iova_rwsem);
+   return ret;
+}
+
 struct iopt_pages *iopt_get_pages(struct io_pagetable *iopt, unsigned long 
iova,
  unsigned long *start_byte,
  unsigned long length)
diff --git a/drivers/iommu/iommufd/iommufd_private.h 
b/drivers/iommu/iommufd/iommufd_private.h
index f55654278ac4..d00ef3b785c5 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -49,6 +49,9 @@ int iopt_unmap_iova(struct io_pagetable *iopt, unsigned long 
iova,
unsigned long length);
 int iopt_unmap_all(struct io_pagetable *iopt);
 
+int iopt_set_dirty_tracking(struct io_pagetable *iopt,
+   struct iommu_domain *domain, bool enable);
+
 int iopt_access_pages(struct io_pagetable *iopt, unsigned long iova,
  unsigned long npages, struct page **out_pages, bool 
write);
 void iopt_unaccess_pages(struct io_pagetable *iopt, unsigned long iova,
-- 
2.17.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] iommu/arm: Expose ARM SMMUv3 related registers via sysfs

2022-03-25 Thread Joao Martins
On 3/25/22 18:03, Robin Murphy wrote:
> On 2022-03-23 15:09, Miguel Luis wrote:
>>> On 23 Mar 2022, at 12:40, Robin Murphy  wrote:
>>> On 2022-03-23 12:54, Miguel Luis wrote:
 Allows userspace to check for SMMUv3 features.
>>>
>>> What will userspace do with that information?
>>>
>>> It hardly matters what fancy new features might be present, if the kernel 
>>> and/or the abstracted interfaces available to userspace aren't using them. 
>>> Any functionality which is supported by a usable interface should ideally 
>>> be represented via that interface itself.
>>>
>>
>> The inspiration was the same that Intel (cap/ecap) and AMD (cap/features) 
>> took
>> exposing it's iommu feature registers on sysfs. It's an easy way to 
>> understand
>> which features are supported by the hardware regardless of what the kernel
>> supports.
> 
> OK, so what do end-users of Intel and AMD systems do with that 
> understanding then?
> 
It's features probing and informational purposes really -- that's all
one can do with these registers regardless of IOMMU implementation.
There some others which print the version of the iommu that also appear
as RO sysfs entries.

Those two registers on those two implementations have proved useful the
last couple of years when I got new SDPs/machines.

At the end of the day some tool (or script) pretty prints what's
supported.. on what usually is depicted in very lenghty manuals.
And that's fed into a toolbox that prints all hardware and software
capabilities alongside diagnostics / etc (for troubleshooting).
You know, similar to `cpuid` on the x86 cpuid side (albeit you could
argue that it's behind a 'special' instruction). And that 'decodes' to
an human-friendly on what goes in those lenghty SDMs. `/proc/cpuinfo`
translates that into a set of keywords (which may be software-specific or
hardware features). And more recently `kcpuid` for raw-supported
features. I am sure there's more examples.

>> For example I could print the smmu->features and that would cover kernel
>> supported features but wouldn't help when new hardware arrives to know which
>> features are supported by the hardware.
> 
> Indeed the driver already prints the supported features at boot, largely 
> because it's useful for debugging. But again, what's the advantage of 
> knowing what you might theoretically be able to do with a machine if you 
> don't have the software support to actually do it? Are there users out 
> there who aren't going to update their OS *unless* they can cling to the 
> hope that a new OS might see the opportunity to use foreign-endianness 
> pagetables and finally take it?
> 
> I appreciate the natural human instinct to want to poke around at and 
> evaluate a new shiny thing, but a sufficiently interested user can 
> already do that with /dev/mem or any number of other ways. We don't need 
> the burden of maintaining an upstream sysfs ABI for curiosity.
> 
/dev/mem is an interesting alternative -- I wasn't quite sure you could use
the register addresses here arbitrarily on ARM. Anyways, I suppose it's an
worthwhile alternative in case it works.

> The other fact of the matter is that a great deal of systems with SMMUv3 
> will be using one of Arm Ltd's implementations, and as soon as one knows 
> that one can readily look up all the details in Arm's documentation. 
> Especially when one already has that site open to find the SMMUv3 
> architecture spec to make sense of cryptic register dumps, right?
> 
Heh, the idea was to avoid going to those specs all the time, but I understand
your earlier point(s).

Also, IOMMUs are getting bigger in terms of featureset -- so it would be nice to
better introspect what iommus can do (in the whatever mix of implementations)
in a less chaotic manner. Maybe implementation-specific sysfs entries are not 
it,
but hopefully something else in a more generic-way, in the advent of a more
direct interface to iommus.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility

2022-03-25 Thread Joao Martins
On 3/24/22 23:11, Jason Gunthorpe wrote:
> On Thu, Mar 24, 2022 at 04:04:03PM -0600, Alex Williamson wrote:
>> On Wed, 23 Mar 2022 21:33:42 -0300
>> Jason Gunthorpe  wrote:
>>> On Wed, Mar 23, 2022 at 04:51:25PM -0600, Alex Williamson wrote:
>>> I don't think this is compatibility. No kernel today triggers qemu to
>>> use this feature as no kernel supports live migration. No existing
>>> qemu will trigger this feature with new kernels that support live
>>> migration v2. Therefore we can adjust qemu's dirty tracking at the
>>> same time we enable migration v2 in qemu.
>>
>> I guess I was assuming that enabling v2 migration in QEMU was dependent
>> on the existing type1 dirty tracking because it's the only means we
>> have to tell QEMU that all memory is perpetually dirty when we have a
>> DMA device.  Is that not correct?
> 
> I haven't looked closely at this part in qemu, but IMHO, if qemu sees
> that it has VFIO migration support but does not have any DMA dirty
> tracking capability it should not do precopy flows.
> 
> If there is a bug here we should certainly fix it before progressing
> the v2 patches. I'll ask Yishai & Co to take a look.

I think that's already the case.

wrt to VFIO IOMMU type1, kernel always exports a migration capability
and the page sizes it supports. In the VMM if it matches the page size
qemu is using (x86 it is PAGE_SIZE) it determines for Qemu it will /use/ vfio
container ioctls. Which well, I guess it's always if the syscall is
there considering we dirty every page.

In qemu, the start and stop of dirty tracking is actually unbounded (it attempts
to do it without checking if the capability is there), although
syncing the dirties from vfio against Qemu private tracking, it does check
if the dirty page tracking is supported prior to even trying the syncing via the
ioctl. /Most importantly/ prior to all of this, starting/stopping/syncing dirty
tracking, Qemu adds a live migration blocker if either the device doesn't 
support
migration or VFIO container doesn't support it (so migration won't even start).
So I think VMM knows how to deal with the lack of the dirty container ioctls as
far as my understanding goes.

TBH, I am not overly concerned with dirty page tracking in vfio-compat layer --
I have been doing both in tandem (old and new). We mainly need to decide what do
we wanna maintain in the compat layer. I can drop that IOMMU support code I have
from vfio-compat or we do the 'perpectual dirtying' that current does, or not
support the dirty ioctls in vfio-compat at all. Maybe the latter makes more 
sense,
as that might mimmic more accurately what hardware supports, and deprive VMMs 
from
even starting migration. The second looks useful for testing, but doing dirty 
of all
DMA-mapped memory seems to be too much in a real world migration scenario :(
specially as the guest size increases.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2] iommu/amd: Restore IRTE.RemapEn bit for amd_iommu_activate_guest_mode

2020-09-16 Thread Joao Martins
On 9/16/20 12:17 PM, Suravee Suthikulpanit wrote:
> Commit e52d58d54a32 ("iommu/amd: Use cmpxchg_double() when updating
> 128-bit IRTE") removed an assumption that modify_irte_ga always set
> the valid bit, which requires the callers to set the appropriate value
> for the struct irte_ga.valid bit before calling the function.
> 
> Similar to the commit 26e495f34107 ("iommu/amd: Restore IRTE.RemapEn
> bit after programming IRTE"), which is for the function
> amd_iommu_deactivate_guest_mode().
> 
> The same change is also needed for the amd_iommu_activate_guest_mode().
> Otherwise, this could trigger IO_PAGE_FAULT for the VFIO based VMs with
> AVIC enabled.
> 
> Reported-by: Maxim Levitsky 
> Tested-by: Maxim Levitsky 
> Cc: Joao Martins 
> Fixes: e52d58d54a321 ("iommu/amd: Use cmpxchg_double() when updating 128-bit 
> IRTE")
> Signed-off-by: Suravee Suthikulpanit 
> ---
>  drivers/iommu/amd/iommu.c | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
> index e938677af8bc..db4fb840c59c 100644
> --- a/drivers/iommu/amd/iommu.c
> +++ b/drivers/iommu/amd/iommu.c
> @@ -3900,14 +3900,18 @@ int amd_iommu_activate_guest_mode(void *data)
>  {
>   struct amd_ir_data *ir_data = (struct amd_ir_data *)data;
>   struct irte_ga *entry = (struct irte_ga *) ir_data->entry;
> + u64 valid;
>  
>   if (!AMD_IOMMU_GUEST_IR_VAPIC(amd_iommu_guest_ir) ||
>   !entry || entry->lo.fields_vapic.guest_mode)
>   return 0;
>  
> + valid = entry->lo.fields_vapic.valid;
> +
>   entry->lo.val = 0;
>   entry->hi.val = 0;
>  
> + entry->lo.fields_vapic.valid   = valid;
>   entry->lo.fields_vapic.guest_mode  = 1;
>   entry->lo.fields_vapic.ga_log_intr = 1;
>   entry->hi.fields.ga_root_ptr   = ir_data->ga_root_ptr;
> 
FWIW,

Reviewed-by: Joao Martins 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] iommu/amd: fix interrupt remapping for avic

2020-09-15 Thread Joao Martins
On 9/15/20 1:30 PM, Suravee Suthikulpanit wrote:
> On 9/15/20 6:25 PM, Maxim Levitsky wrote:
>> On Mon, 2020-09-14 at 21:48 +0700, Suravee Suthikulpanit wrote:
>>> Could you please try with the following patch instead?
>>>
>>> --- a/drivers/iommu/amd/iommu.c
>>> +++ b/drivers/iommu/amd/iommu.c
>>> @@ -3840,14 +3840,18 @@ int amd_iommu_activate_guest_mode(void *data)
>>>{
>>>   struct amd_ir_data *ir_data = (struct amd_ir_data *)data;
>>>   struct irte_ga *entry = (struct irte_ga *) ir_data->entry;
>>> +   u64 valid;
>>>
>>>   if (!AMD_IOMMU_GUEST_IR_VAPIC(amd_iommu_guest_ir) ||
>>>   !entry || entry->lo.fields_vapic.guest_mode)
>>>   return 0;
>>>
>>> +   valid = entry->lo.fields_vapic.valid;
>>> +
>>>   entry->lo.val = 0;
>>>   entry->hi.val = 0;
>>>
>>> +   entry->lo.fields_vapic.valid   = valid;
>>>   entry->lo.fields_vapic.guest_mode  = 1;
>>>   entry->lo.fields_vapic.ga_log_intr = 1;
>>>   entry->hi.fields.ga_root_ptr   = ir_data->ga_root_ptr;
>>> @@ -3864,12 +3868,14 @@ int amd_iommu_deactivate_guest_mode(void *data)
>>>   struct amd_ir_data *ir_data = (struct amd_ir_data *)data;
>>>   struct irte_ga *entry = (struct irte_ga *) ir_data->entry;
>>>   struct irq_cfg *cfg = ir_data->cfg;
>>> -   u64 valid = entry->lo.fields_remap.valid;
>>> +   u64 valid;
>>>
>>>   if (!AMD_IOMMU_GUEST_IR_VAPIC(amd_iommu_guest_ir) ||
>>>   !entry || !entry->lo.fields_vapic.guest_mode)
>>>   return 0;
>>>
>>> +   valid = entry->lo.fields_remap.valid;
>>> +
>>>   entry->lo.val = 0;
>>>   entry->hi.val = 0;
>> I see. I based my approach on the fact that valid bit was
>> set always to true anyway before, plus that amd_iommu_activate_guest_mode
>> should be really only called when someone activates a valid interrupt 
>> remapping
>> entry, but IMHO the approach of preserving the valid bit is safer anyway.
>>
>> It works on my system (I applied the patch manually, since either your or my 
>> email client,
>> seems to mangle the patch)
>>
> 
> Sorry for the mangled patch. I'll submit the patch w/ your information. 
> Thanks for your help reporting, debugging, and 
> testing the patch.
> 
I assume you're only doing the valid bit preservation in 
amd_iommu_activate_guest_mode() ?
The null deref fix in amd_iommu_deactivate_guest_mode() was fixed elsewhere[0], 
or are you
planning on merging both changes like the diff you attached?

Asking also because commit 26e495f341 ("iommu/amd: Restore IRTE.RemapEn bit 
after
programming IRTE") was added in v5.4 and v5.8 stable trees but the v5.4 
backport didn't
include e52d58d54a321 ("iommu/amd: Use cmpxchg_double() when updating 128-bit 
IRTE").

Joao

[0] 
https://lore.kernel.org/linux-iommu/20200910171621.12879-1-joao.m.mart...@oracle.com/
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH] iommu/amd: Fix potential @entry null deref

2020-09-10 Thread Joao Martins
After commit 26e495f34107 ("iommu/amd: Restore IRTE.RemapEn bit after
programming IRTE"), smatch warns:

drivers/iommu/amd/iommu.c:3870 amd_iommu_deactivate_guest_mode()
warn: variable dereferenced before check 'entry' (see line 3867)

Fix this by moving the @valid assignment to after @entry has been checked
for NULL.

Cc: Suravee Suthikulpanit 
Fixes: 26e495f34107 ("iommu/amd: Restore IRTE.RemapEn bit after programming 
IRTE")
Reported-by: Dan Carpenter 
Signed-off-by: Joao Martins 
---
 drivers/iommu/amd/iommu.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 07ae8b93887e..8abe1c7ad45b 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -3864,12 +3864,14 @@ int amd_iommu_deactivate_guest_mode(void *data)
struct amd_ir_data *ir_data = (struct amd_ir_data *)data;
struct irte_ga *entry = (struct irte_ga *) ir_data->entry;
struct irq_cfg *cfg = ir_data->cfg;
-   u64 valid = entry->lo.fields_remap.valid;
+   u64 valid;
 
if (!AMD_IOMMU_GUEST_IR_VAPIC(amd_iommu_guest_ir) ||
!entry || !entry->lo.fields_vapic.guest_mode)
return 0;
 
+   valid = entry->lo.fields_remap.valid;
+
entry->lo.val = 0;
entry->hi.val = 0;
 
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 1/2] iommu: amd: Restore IRTE.RemapEn bit after programming IRTE

2020-09-02 Thread Joao Martins
On 9/2/20 5:51 AM, Suravee Suthikulpanit wrote:
> Currently, the RemapEn (valid) bit is accidentally cleared when
> programming IRTE w/ guestMode=0. It should be restored to
> the prior state.
> 
Probably requires:

 Fixes: b9fc6b56f478 ("iommu/amd: Implements irq_set_vcpu_affinity() hook to 
setup vapic
mode for pass-through devices")

?

> Signed-off-by: Suravee Suthikulpanit 

FWIW,

 Reviewed-by: Joao Martins 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 2/2] iommu: amd: Use cmpxchg_double() when updating 128-bit IRTE

2020-09-02 Thread Joao Martins
On 9/2/20 5:51 AM, Suravee Suthikulpanit wrote:
> When using 128-bit interrupt-remapping table entry (IRTE) (a.k.a GA mode),
> current driver disables interrupt remapping when it updates the IRTE
> so that the upper and lower 64-bit values can be updated safely.
> 
> However, this creates a small window, where the interrupt could
> arrive and result in IO_PAGE_FAULT (for interrupt) as shown below.
> 
>   IOMMU DriverDevice IRQ
>   ===
>   irte.RemapEn=0
>...
>change IRTEIRQ from device ==> IO_PAGE_FAULT !!
>...
>   irte.RemapEn=1
> 
> This scenario has been observed when changing irq affinity on a system
> running I/O-intensive workload, in which the destination APIC ID
> in the IRTE is updated.
> 
> Instead, use cmpxchg_double() to update the 128-bit IRTE at once without
> disabling the interrupt remapping. However, this means several features,
> which require GA (128-bit IRTE) support will also be affected if cmpxchg16b
> is not supported (which is unprecedented for AMD processors w/ IOMMU).
> 
Probably requires:

 Fixes: 880ac60e2538 ("iommu/amd: Introduce interrupt remapping ops structure")

?

> Reported-by: Sean Osborne 
> Tested-by: Erik Rockstrom 
> Signed-off-by: Suravee Suthikulpanit 

With the comments below addressed, FWIW:

 Reviewed-by: Joao Martins 

> diff --git a/drivers/iommu/amd/init.c b/drivers/iommu/amd/init.c
> index c652f16eb702..ad30467f6930 100644
> --- a/drivers/iommu/amd/init.c
> +++ b/drivers/iommu/amd/init.c
> @@ -1511,7 +1511,14 @@ static int __init init_iommu_one(struct amd_iommu 
> *iommu, struct ivhd_header *h)
>   iommu->mmio_phys_end = MMIO_REG_END_OFFSET;
>   else
>   iommu->mmio_phys_end = MMIO_CNTR_CONF_OFFSET;
> - if (((h->efr_attr & (0x1 << IOMMU_FEAT_GASUP_SHIFT)) == 0))
> +
> + /*
> +  * Note: GA (128-bit IRTE) mode requires cmpxchg16b supports.
> +  * GAM also requires GA mode. Therefore, we need to
> +  * check cmbxchg16b support before enabling it.
> +  */

s/cmbxchg16b/cmpxchg16b

> + if (!boot_cpu_has(X86_FEATURE_CX16) ||
> + ((h->efr_attr & (0x1 << IOMMU_FEAT_GASUP_SHIFT)) == 0))
>   amd_iommu_guest_ir = AMD_IOMMU_GUEST_IR_LEGACY;
>   break;
>   case 0x11:
> @@ -1520,8 +1527,18 @@ static int __init init_iommu_one(struct amd_iommu 
> *iommu, struct ivhd_header *h)
>   iommu->mmio_phys_end = MMIO_REG_END_OFFSET;
>   else
>   iommu->mmio_phys_end = MMIO_CNTR_CONF_OFFSET;
> - if (((h->efr_reg & (0x1 << IOMMU_EFR_GASUP_SHIFT)) == 0))
> +
> + /*
> +  * Note: GA (128-bit IRTE) mode requires cmpxchg16b supports.
> +  * XT, GAM also requires GA mode. Therefore, we need to
> +  * check cmbxchg16b support before enabling them.

s/cmbxchg16b/cmpxchg16b

> +  */
> + if (boot_cpu_has(X86_FEATURE_CX16) ||

You probably want !boot_cpu_has(X86_FEATURE_CX16) ?

> + ((h->efr_reg & (0x1 << IOMMU_EFR_GASUP_SHIFT)) == 0)) {
>   amd_iommu_guest_ir = AMD_IOMMU_GUEST_IR_LEGACY;
> + break;
> + }
> +
>   /*
>* Note: Since iommu_update_intcapxt() leverages
>* the IOMMU MMIO access to MSI capability block registers
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu