Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier

2020-08-11 Thread Eugenio Perez Martin
On Tue, Aug 11, 2020 at 7:01 PM Eugenio Perez Martin
 wrote:
>
> On Fri, Jun 26, 2020 at 11:29 PM Peter Xu  wrote:
> >
> > Hi, Eugenio,
> >
> > (CCing Eric, Yan and Michael too)
> >
> > On Fri, Jun 26, 2020 at 08:41:22AM +0200, Eugenio Pérez wrote:
> > > diff --git a/memory.c b/memory.c
> > > index 2f15a4b250..7f789710d2 100644
> > > --- a/memory.c
> > > +++ b/memory.c
> > > @@ -1915,8 +1915,6 @@ void memory_region_notify_one(IOMMUNotifier 
> > > *notifier,
> > >  return;
> > >  }
> > >
> > > -assert(entry->iova >= notifier->start && entry_end <= notifier->end);
> >
> > I can understand removing the assertion should solve the issue, however imho
> > the major issue is not about this single assertion but the whole addr_mask
> > issue behind with virtio...
> >
> > For normal IOTLB invalidations, we were trying our best to always make
> > IOMMUTLBEntry contain a valid addr_mask to be 2**N-1.  E.g., that's what 
> > we're
> > doing with the loop in vtd_address_space_unmap().
> >
> > But this is not the first time that we may want to break this assumption for
> > virtio so that we make the IOTLB a tuple of (start, len), then that len can 
> > be
> > not a address mask any more.  That seems to be more efficient for things 
> > like
> > vhost because iotlbs there are not page based, so it'll be inefficient if we
> > always guarantee the addr_mask because it'll be quite a lot more roundtrips 
> > of
> > the same range of invalidation.  Here we've encountered another issue of
> > triggering the assertion with virtio-net, but only with the old RHEL7 guest.
> >
> > I'm thinking whether we can make the IOTLB invalidation configurable by
> > specifying whether the backend of the notifier can handle arbitary address
> > range in some way.  So we still have the guaranteed addr_masks by default
> > (since I still don't think totally break the addr_mask restriction is 
> > wise...),
> > however we can allow the special backends to take adavantage of using 
> > arbitary
> > (start, len) ranges for reasons like performance.
> >
> > To do that, a quick idea is to introduce a flag 
> > IOMMU_NOTIFIER_ARBITRARY_MASK
> > to IOMMUNotifierFlag, to declare that the iommu notifier (and its backend) 
> > can
> > take arbitrary address mask, then it can be any value and finally becomes a
> > length rather than an addr_mask.  Then for every iommu notify() we can 
> > directly
> > deliver whatever we've got from the upper layer to this notifier.  With the 
> > new
> > flag, vhost can do iommu_notifier_init() with UNMAP|ARBITRARY_MASK so it
> > declares this capability.  Then no matter for device iotlb or normal iotlb, 
> > we
> > skip the complicated procedure to split a big range into small ranges that 
> > are
> > with strict addr_mask, but directly deliver the message to the iommu 
> > notifier.
> > E.g., we can skip the loop in vtd_address_space_unmap() if the notifier is 
> > with
> > ARBITRARY flag set.
> >
> > Then, the assert() is not accurate either, and may become something like:
> >
> > diff --git a/memory.c b/memory.c
> > index 2f15a4b250..99d0492509 100644
> > --- a/memory.c
> > +++ b/memory.c
> > @@ -1906,6 +1906,7 @@ void memory_region_notify_one(IOMMUNotifier *notifier,
> >  {
> >  IOMMUNotifierFlag request_flags;
> >  hwaddr entry_end = entry->iova + entry->addr_mask;
> > +IOMMUTLBEntry tmp = *entry;
> >
> >  /*
> >   * Skip the notification if the notification does not overlap
> > @@ -1915,7 +1916,13 @@ void memory_region_notify_one(IOMMUNotifier 
> > *notifier,
> >  return;
> >  }
> >
> > -assert(entry->iova >= notifier->start && entry_end <= notifier->end);
> > +if (notifier->notifier_flags & IOMMU_NOTIFIER_ARBITRARY_MASK) {
> > +tmp.iova = MAX(tmp.iova, notifier->start);
>
> Hi!
>
> If I modify the tmp.iova, the guest will complain (in dmesg):
> [  154.426828] DMAR: DRHD: handling fault status reg 2
> [  154.427700] DMAR: [DMA Read] Request device [01:00.0] fault addr
> 90d53fada000 [fault reason 04] Access beyond MGAW
>
> And will not forward packets anymore on that interface. Guests are
> totally ok if I only modify addr_mask.
>
> Still investigating the issue.
>
> Thanks!
>

Sorry it seems that I lost the nitpick Yan pointed out :).

Sending RFC v3.

>
> > +tmp.addr_mask = MIN(tmp.addr_mask, notifier->end);
> > +assert(tmp.iova <= tmp.addr_mask);
> > +} else {
> > +assert(entry->iova >= notifier->start && entry_end <= 
> > notifier->end);
> > +}
> >
> >  if (entry->perm & IOMMU_RW) {
> >  request_flags = IOMMU_NOTIFIER_MAP;
> > @@ -1924,7 +1931,7 @@ void memory_region_notify_one(IOMMUNotifier *notifier,
> >  }
> >
> >  if (notifier->notifier_flags & request_flags) {
> > -notifier->notify(notifier, entry);
> > +notifier->notify(notifier, );
> >  }
> >  }
> >
> > Then we can keep the assert() for e.g. vfio, however vhost can skip it and 
> > even
> > get some further 

Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier

2020-08-11 Thread Eugenio Perez Martin
On Fri, Jun 26, 2020 at 11:29 PM Peter Xu  wrote:
>
> Hi, Eugenio,
>
> (CCing Eric, Yan and Michael too)
>
> On Fri, Jun 26, 2020 at 08:41:22AM +0200, Eugenio Pérez wrote:
> > diff --git a/memory.c b/memory.c
> > index 2f15a4b250..7f789710d2 100644
> > --- a/memory.c
> > +++ b/memory.c
> > @@ -1915,8 +1915,6 @@ void memory_region_notify_one(IOMMUNotifier *notifier,
> >  return;
> >  }
> >
> > -assert(entry->iova >= notifier->start && entry_end <= notifier->end);
>
> I can understand removing the assertion should solve the issue, however imho
> the major issue is not about this single assertion but the whole addr_mask
> issue behind with virtio...
>
> For normal IOTLB invalidations, we were trying our best to always make
> IOMMUTLBEntry contain a valid addr_mask to be 2**N-1.  E.g., that's what we're
> doing with the loop in vtd_address_space_unmap().
>
> But this is not the first time that we may want to break this assumption for
> virtio so that we make the IOTLB a tuple of (start, len), then that len can be
> not a address mask any more.  That seems to be more efficient for things like
> vhost because iotlbs there are not page based, so it'll be inefficient if we
> always guarantee the addr_mask because it'll be quite a lot more roundtrips of
> the same range of invalidation.  Here we've encountered another issue of
> triggering the assertion with virtio-net, but only with the old RHEL7 guest.
>
> I'm thinking whether we can make the IOTLB invalidation configurable by
> specifying whether the backend of the notifier can handle arbitary address
> range in some way.  So we still have the guaranteed addr_masks by default
> (since I still don't think totally break the addr_mask restriction is 
> wise...),
> however we can allow the special backends to take adavantage of using arbitary
> (start, len) ranges for reasons like performance.
>
> To do that, a quick idea is to introduce a flag IOMMU_NOTIFIER_ARBITRARY_MASK
> to IOMMUNotifierFlag, to declare that the iommu notifier (and its backend) can
> take arbitrary address mask, then it can be any value and finally becomes a
> length rather than an addr_mask.  Then for every iommu notify() we can 
> directly
> deliver whatever we've got from the upper layer to this notifier.  With the 
> new
> flag, vhost can do iommu_notifier_init() with UNMAP|ARBITRARY_MASK so it
> declares this capability.  Then no matter for device iotlb or normal iotlb, we
> skip the complicated procedure to split a big range into small ranges that are
> with strict addr_mask, but directly deliver the message to the iommu notifier.
> E.g., we can skip the loop in vtd_address_space_unmap() if the notifier is 
> with
> ARBITRARY flag set.
>
> Then, the assert() is not accurate either, and may become something like:
>
> diff --git a/memory.c b/memory.c
> index 2f15a4b250..99d0492509 100644
> --- a/memory.c
> +++ b/memory.c
> @@ -1906,6 +1906,7 @@ void memory_region_notify_one(IOMMUNotifier *notifier,
>  {
>  IOMMUNotifierFlag request_flags;
>  hwaddr entry_end = entry->iova + entry->addr_mask;
> +IOMMUTLBEntry tmp = *entry;
>
>  /*
>   * Skip the notification if the notification does not overlap
> @@ -1915,7 +1916,13 @@ void memory_region_notify_one(IOMMUNotifier *notifier,
>  return;
>  }
>
> -assert(entry->iova >= notifier->start && entry_end <= notifier->end);
> +if (notifier->notifier_flags & IOMMU_NOTIFIER_ARBITRARY_MASK) {
> +tmp.iova = MAX(tmp.iova, notifier->start);

Hi!

If I modify the tmp.iova, the guest will complain (in dmesg):
[  154.426828] DMAR: DRHD: handling fault status reg 2
[  154.427700] DMAR: [DMA Read] Request device [01:00.0] fault addr
90d53fada000 [fault reason 04] Access beyond MGAW

And will not forward packets anymore on that interface. Guests are
totally ok if I only modify addr_mask.

Still investigating the issue.

Thanks!


> +tmp.addr_mask = MIN(tmp.addr_mask, notifier->end);
> +assert(tmp.iova <= tmp.addr_mask);
> +} else {
> +assert(entry->iova >= notifier->start && entry_end <= notifier->end);
> +}
>
>  if (entry->perm & IOMMU_RW) {
>  request_flags = IOMMU_NOTIFIER_MAP;
> @@ -1924,7 +1931,7 @@ void memory_region_notify_one(IOMMUNotifier *notifier,
>  }
>
>  if (notifier->notifier_flags & request_flags) {
> -notifier->notify(notifier, entry);
> +notifier->notify(notifier, );
>  }
>  }
>
> Then we can keep the assert() for e.g. vfio, however vhost can skip it and 
> even
> get some further performance boosts..  Does that make sense?
>
> Thanks,
>
> --
> Peter Xu
>




Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier

2020-08-04 Thread Jason Wang



On 2020/8/5 上午4:30, Peter Xu wrote:

On Mon, Aug 03, 2020 at 06:00:34PM +0200, Eugenio Pérez wrote:

On Fri, 2020-07-03 at 15:24 +0800, Jason Wang wrote:

On 2020/7/2 下午11:45, Peter Xu wrote:

On Thu, Jul 02, 2020 at 11:01:54AM +0800, Jason Wang wrote:

So I think we agree that a new notifier is needed?

Good to me, or a new flag should be easier (IOMMU_NOTIFIER_DEV_IOTLB)?

That should work but I wonder something as following is better.

Instead of introducing new flags, how about carry the type of event in
the notifier then the device (vhost) can choose the message it want to
process like:

static vhost_iommu_event(IOMMUNotifier *n, IOMMUTLBEvent *event)

{

switch (event->type) {

case IOMMU_MAP:
case IOMMU_UNMAP:
case IOMMU_DEV_IOTLB_UNMAP:
...

}

Thanks



Hi!

Sorry, I thought I had this clear but now it seems not so clear to me. Do you 
mean to add that switch to the current
vhost_iommu_unmap_notify, and then the "type" field to the IOMMUTLBEntry? Is 
that the scope of the changes, or there is
something I'm missing?

If that is correct, what is the advantage for vhost or other notifiers? I 
understand that move the IOMMUTLBEntry (addr,
len) -> (iova, mask) split/transformation to the different notifiers 
implementation could pollute them, but this is even a deeper change and vhost is 
not insterested in other events but IOMMU_UNMAP, isn't?

On the other hand, who decide what type of event is? If I follow the backtrace 
of the assert in
https://lists.gnu.org/archive/html/qemu-devel/2020-07/msg01015.html, it seems 
to me that it should be
vtd_process_device_iotlb_desc. How do I know if it should be IOMMU_UNMAP or 
IOMMU_DEV_IOTLB_UNMAP? If I set it in some
function of memory.c, I should decide the type looking the actual notifier, 
isn't?

(Since Jason didn't reply yesterday, I'll try to; Jason, feel free to correct
  me...)

IMHO whether to put the type into the IOMMUTLBEntry is not important.  The
important change should be that we introduce IOMMU_DEV_IOTLB_UNMAP (or I'd
rather call it IOMMU_DEV_IOTLB directly which is shorter and cleaner).  With
that information we can make the failing assertion conditional for MAP/UNMAP
only.



Or having another dedicated device IOTLB notifier.



   We can also allow dev-iotlb messages to take arbitrary addr_mask (so it
becomes a length of address range; imho we can keep using addr_mask for
simplicity, but we can comment for addr_mask that for dev-iotlb it can be not a
real mask).



Yes.

Thanks




Thanks,






Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier

2020-08-04 Thread Peter Xu
On Mon, Aug 03, 2020 at 06:00:34PM +0200, Eugenio Pérez wrote:
> On Fri, 2020-07-03 at 15:24 +0800, Jason Wang wrote:
> > On 2020/7/2 下午11:45, Peter Xu wrote:
> > > On Thu, Jul 02, 2020 at 11:01:54AM +0800, Jason Wang wrote:
> > > > So I think we agree that a new notifier is needed?
> > > Good to me, or a new flag should be easier (IOMMU_NOTIFIER_DEV_IOTLB)?
> > 
> > That should work but I wonder something as following is better.
> > 
> > Instead of introducing new flags, how about carry the type of event in 
> > the notifier then the device (vhost) can choose the message it want to 
> > process like:
> > 
> > static vhost_iommu_event(IOMMUNotifier *n, IOMMUTLBEvent *event)
> > 
> > {
> > 
> > switch (event->type) {
> > 
> > case IOMMU_MAP:
> > case IOMMU_UNMAP:
> > case IOMMU_DEV_IOTLB_UNMAP:
> > ...
> > 
> > }
> > 
> > Thanks
> > 
> > 
> 
> Hi!
> 
> Sorry, I thought I had this clear but now it seems not so clear to me. Do you 
> mean to add that switch to the current
> vhost_iommu_unmap_notify, and then the "type" field to the IOMMUTLBEntry? Is 
> that the scope of the changes, or there is
> something I'm missing?
> 
> If that is correct, what is the advantage for vhost or other notifiers? I 
> understand that move the IOMMUTLBEntry (addr,
> len) -> (iova, mask) split/transformation to the different notifiers 
> implementation could pollute them, but this is even a deeper change and vhost 
> is not insterested in other events but IOMMU_UNMAP, isn't?
> 
> On the other hand, who decide what type of event is? If I follow the 
> backtrace of the assert in 
> https://lists.gnu.org/archive/html/qemu-devel/2020-07/msg01015.html, it seems 
> to me that it should be
> vtd_process_device_iotlb_desc. How do I know if it should be IOMMU_UNMAP or 
> IOMMU_DEV_IOTLB_UNMAP? If I set it in some
> function of memory.c, I should decide the type looking the actual notifier, 
> isn't?

(Since Jason didn't reply yesterday, I'll try to; Jason, feel free to correct
 me...)

IMHO whether to put the type into the IOMMUTLBEntry is not important.  The
important change should be that we introduce IOMMU_DEV_IOTLB_UNMAP (or I'd
rather call it IOMMU_DEV_IOTLB directly which is shorter and cleaner).  With
that information we can make the failing assertion conditional for MAP/UNMAP
only.  We can also allow dev-iotlb messages to take arbitrary addr_mask (so it
becomes a length of address range; imho we can keep using addr_mask for
simplicity, but we can comment for addr_mask that for dev-iotlb it can be not a
real mask).

Thanks,

-- 
Peter Xu




Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier

2020-08-03 Thread Eugenio Pérez
On Fri, 2020-07-03 at 15:24 +0800, Jason Wang wrote:
> On 2020/7/2 下午11:45, Peter Xu wrote:
> > On Thu, Jul 02, 2020 at 11:01:54AM +0800, Jason Wang wrote:
> > > So I think we agree that a new notifier is needed?
> > Good to me, or a new flag should be easier (IOMMU_NOTIFIER_DEV_IOTLB)?
> 
> That should work but I wonder something as following is better.
> 
> Instead of introducing new flags, how about carry the type of event in 
> the notifier then the device (vhost) can choose the message it want to 
> process like:
> 
> static vhost_iommu_event(IOMMUNotifier *n, IOMMUTLBEvent *event)
> 
> {
> 
> switch (event->type) {
> 
> case IOMMU_MAP:
> case IOMMU_UNMAP:
> case IOMMU_DEV_IOTLB_UNMAP:
> ...
> 
> }
> 
> Thanks
> 
> 

Hi!

Sorry, I thought I had this clear but now it seems not so clear to me. Do you 
mean to add that switch to the current
vhost_iommu_unmap_notify, and then the "type" field to the IOMMUTLBEntry? Is 
that the scope of the changes, or there is
something I'm missing?

If that is correct, what is the advantage for vhost or other notifiers? I 
understand that move the IOMMUTLBEntry (addr,
len) -> (iova, mask) split/transformation to the different notifiers 
implementation could pollute them, but this is even a deeper change and vhost 
is not insterested in other events but IOMMU_UNMAP, isn't?

On the other hand, who decide what type of event is? If I follow the backtrace 
of the assert in 
https://lists.gnu.org/archive/html/qemu-devel/2020-07/msg01015.html, it seems 
to me that it should be
vtd_process_device_iotlb_desc. How do I know if it should be IOMMU_UNMAP or 
IOMMU_DEV_IOTLB_UNMAP? If I set it in some
function of memory.c, I should decide the type looking the actual notifier, 
isn't?

Thanks!




Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier

2020-07-21 Thread Peter Xu
On Tue, Jul 21, 2020 at 02:20:01PM +0800, Jason Wang wrote:
> 
> On 2020/7/20 下午9:03, Peter Xu wrote:
> > On Mon, Jul 20, 2020 at 12:02:06PM +0800, Jason Wang wrote:
> > > Right, so there's no need to deal with unmap in vtd's replay 
> > > implementation
> > > (as what generic one did).
> > We don't even for now; see vtd_page_walk_info.notify_unmap.  Thanks,
> 
> 
> Right, but I meant the vtd_address_space_unmap() in vtd_iomm_replay(). It
> looks to me it will trigger UNMAP notifiers.

Should be pretty safe for now, but I agree it seems optional (as we've
discussed about this previously).  Thanks,

-- 
Peter Xu




Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier

2020-07-21 Thread Jason Wang



On 2020/7/20 下午9:03, Peter Xu wrote:

On Mon, Jul 20, 2020 at 12:02:06PM +0800, Jason Wang wrote:

Right, so there's no need to deal with unmap in vtd's replay implementation
(as what generic one did).

We don't even for now; see vtd_page_walk_info.notify_unmap.  Thanks,



Right, but I meant the vtd_address_space_unmap() in vtd_iomm_replay(). 
It looks to me it will trigger UNMAP notifiers.


Thanks





Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier

2020-07-20 Thread Peter Xu
On Mon, Jul 20, 2020 at 12:02:06PM +0800, Jason Wang wrote:
> Right, so there's no need to deal with unmap in vtd's replay implementation
> (as what generic one did).

We don't even for now; see vtd_page_walk_info.notify_unmap.  Thanks,

-- 
Peter Xu




Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier

2020-07-19 Thread Jason Wang



On 2020/7/17 下午10:18, Peter Xu wrote:

On Thu, Jul 16, 2020 at 10:54:31AM +0800, Jason Wang wrote:

On 2020/7/16 上午9:00, Peter Xu wrote:

On Mon, Jul 13, 2020 at 12:04:16PM +0800, Jason Wang wrote:

On 2020/7/10 下午9:30, Peter Xu wrote:

On Fri, Jul 10, 2020 at 02:34:11PM +0800, Jason Wang wrote:

On 2020/7/9 下午10:10, Peter Xu wrote:

On Thu, Jul 09, 2020 at 01:58:33PM +0800, Jason Wang wrote:

- If we care the performance, it's better to implement the MAP event for
vhost, otherwise it could be a lot of IOTLB miss

I feel like these are two things.

So far what we are talking about is whether vt-d should have knowledge about
what kind of events one iommu notifier is interested in.  I still think we
should keep this as answered in question 1.

The other question is whether we want to switch vhost from UNMAP to MAP/UNMAP
events even without vDMA, so that vhost can establish the mapping even before
IO starts.  IMHO it's doable, but only if the guest runs DPDK workloads.  When
the guest is using dynamic iommu page mappings, I feel like that can be even
slower, because then the worst case is for each IO we'll need to vmexit twice:

   - The first vmexit caused by an invalidation to MAP the page tables, so 
vhost
 will setup the page table before IO starts

   - IO/DMA triggers and completes

   - The second vmexit caused by another invalidation to UNMAP the page 
tables

So it seems to be worse than when vhost only uses UNMAP like right now.  At
least we only have one vmexit (when UNMAP).  We'll have a vhost translate()
request from kernel to userspace, but IMHO that's cheaper than the vmexit.

Right but then I would still prefer to have another notifier.

Since vtd_page_walk has nothing to do with device IOTLB. IOMMU have a
dedicated command for flushing device IOTLB. But the check for
vtd_as_has_map_notifier is used to skip the device which can do demand
paging via ATS or device specific way. If we have two different notifiers,
vhost will be on the device iotlb notifier so we don't need it at all?

But we can still have iommu notifier that only registers to UNMAP even after we
introduce dev-iotlb notifier?  We don't want to do page walk for them as well.
TCG should be the only one so far, but I don't know.. maybe there can still be
new ones?

I think you're right. But looking at the codes, it looks like the check of
vtd_as_has_map_notifier() was only used in:

1) vtd_iommu_replay()
2) vtd_iotlb_page_invalidate_notify() (PSI)

For the replay, it's expensive anyhow. For PSI, I think it's just about one
or few mappings, not sure it will have obvious performance impact.

And I had two questions:

1) The codes doesn't check map for DSI or GI, does this match what spec
said? (It looks to me the spec is unclear in this part)

Both DSI/GI should cover maps too?  E.g. vtd_sync_shadow_page_table() in
vtd_iotlb_domain_invalidate().

I meant the code doesn't check whether there's an MAP notifier :)

It's actually checked, because it loops over vtd_as_with_notifiers, and only
MAP notifiers register to that. :)


I may miss something but I don't find the code to block UNMAP notifiers?

vhost_iommu_region_add()
     memory_region_register_iommu_notifier()
         memory_region_update_iommu_notify_flags()
             imrc->notify_flag_changed()
                 vtd_iommu_notify_flag_changed()

?

Yeah I think you're right.  I might have confused with some previous
implementations.  Maybe we should also do similar thing for DSI/GI just like
what we do in PSI.



Ok.





2) for the replay() I don't see other implementations (either spapr or
generic one) that did unmap (actually they skip unmap explicitly), any
reason for doing this in intel IOMMU?

I could be wrong, but I'd guess it's because vt-d implemented the caching mode
by leveraging the same invalidation strucuture, so it's harder to make all
things right (IOW, we can't clearly identify MAP with UNMAP when we receive an
invalidation request, because MAP/UNMAP requests look the same).

I didn't check others, but I believe spapr is doing it differently by using
some hypercalls to deliver IOMMU map/unmap requests, which seems a bit close to
what virtio-iommu is doing.  Anyway, the point is if we have explicit MAP/UNMAP
from the guest, logically the replay indeed does not need to do any unmap
because we don't need to call replay() on an already existing device but only
for e.g. hot plug.

But this looks conflict with what memory_region_iommu_replay( ) did, for
IOMMU that doesn't have a replay method, it skips UNMAP request:

      for (addr = 0; addr < memory_region_size(mr); addr += granularity) {
      iotlb = imrc->translate(iommu_mr, addr, IOMMU_NONE, n->iommu_idx);
      if (iotlb.perm != IOMMU_NONE) {
      n->notify(n, );
      }

I guess there's no knowledge of whether guest have an explicit MAP/UMAP for
this generic code. Or replay implies that guest doesn't have explicit
MAP/UNMAP?

I think it matches exactly with a 

Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier

2020-07-17 Thread Peter Xu
On Thu, Jul 16, 2020 at 10:54:31AM +0800, Jason Wang wrote:
> 
> On 2020/7/16 上午9:00, Peter Xu wrote:
> > On Mon, Jul 13, 2020 at 12:04:16PM +0800, Jason Wang wrote:
> > > On 2020/7/10 下午9:30, Peter Xu wrote:
> > > > On Fri, Jul 10, 2020 at 02:34:11PM +0800, Jason Wang wrote:
> > > > > On 2020/7/9 下午10:10, Peter Xu wrote:
> > > > > > On Thu, Jul 09, 2020 at 01:58:33PM +0800, Jason Wang wrote:
> > > > > > > > > - If we care the performance, it's better to implement the 
> > > > > > > > > MAP event for
> > > > > > > > > vhost, otherwise it could be a lot of IOTLB miss
> > > > > > > > I feel like these are two things.
> > > > > > > > 
> > > > > > > > So far what we are talking about is whether vt-d should have 
> > > > > > > > knowledge about
> > > > > > > > what kind of events one iommu notifier is interested in.  I 
> > > > > > > > still think we
> > > > > > > > should keep this as answered in question 1.
> > > > > > > > 
> > > > > > > > The other question is whether we want to switch vhost from 
> > > > > > > > UNMAP to MAP/UNMAP
> > > > > > > > events even without vDMA, so that vhost can establish the 
> > > > > > > > mapping even before
> > > > > > > > IO starts.  IMHO it's doable, but only if the guest runs DPDK 
> > > > > > > > workloads.  When
> > > > > > > > the guest is using dynamic iommu page mappings, I feel like 
> > > > > > > > that can be even
> > > > > > > > slower, because then the worst case is for each IO we'll need 
> > > > > > > > to vmexit twice:
> > > > > > > > 
> > > > > > > >   - The first vmexit caused by an invalidation to MAP the 
> > > > > > > > page tables, so vhost
> > > > > > > > will setup the page table before IO starts
> > > > > > > > 
> > > > > > > >   - IO/DMA triggers and completes
> > > > > > > > 
> > > > > > > >   - The second vmexit caused by another invalidation to 
> > > > > > > > UNMAP the page tables
> > > > > > > > 
> > > > > > > > So it seems to be worse than when vhost only uses UNMAP like 
> > > > > > > > right now.  At
> > > > > > > > least we only have one vmexit (when UNMAP).  We'll have a vhost 
> > > > > > > > translate()
> > > > > > > > request from kernel to userspace, but IMHO that's cheaper than 
> > > > > > > > the vmexit.
> > > > > > > Right but then I would still prefer to have another notifier.
> > > > > > > 
> > > > > > > Since vtd_page_walk has nothing to do with device IOTLB. IOMMU 
> > > > > > > have a
> > > > > > > dedicated command for flushing device IOTLB. But the check for
> > > > > > > vtd_as_has_map_notifier is used to skip the device which can do 
> > > > > > > demand
> > > > > > > paging via ATS or device specific way. If we have two different 
> > > > > > > notifiers,
> > > > > > > vhost will be on the device iotlb notifier so we don't need it at 
> > > > > > > all?
> > > > > > But we can still have iommu notifier that only registers to UNMAP 
> > > > > > even after we
> > > > > > introduce dev-iotlb notifier?  We don't want to do page walk for 
> > > > > > them as well.
> > > > > > TCG should be the only one so far, but I don't know.. maybe there 
> > > > > > can still be
> > > > > > new ones?
> > > > > I think you're right. But looking at the codes, it looks like the 
> > > > > check of
> > > > > vtd_as_has_map_notifier() was only used in:
> > > > > 
> > > > > 1) vtd_iommu_replay()
> > > > > 2) vtd_iotlb_page_invalidate_notify() (PSI)
> > > > > 
> > > > > For the replay, it's expensive anyhow. For PSI, I think it's just 
> > > > > about one
> > > > > or few mappings, not sure it will have obvious performance impact.
> > > > > 
> > > > > And I had two questions:
> > > > > 
> > > > > 1) The codes doesn't check map for DSI or GI, does this match what 
> > > > > spec
> > > > > said? (It looks to me the spec is unclear in this part)
> > > > Both DSI/GI should cover maps too?  E.g. vtd_sync_shadow_page_table() in
> > > > vtd_iotlb_domain_invalidate().
> > > 
> > > I meant the code doesn't check whether there's an MAP notifier :)
> > It's actually checked, because it loops over vtd_as_with_notifiers, and only
> > MAP notifiers register to that. :)
> 
> 
> I may miss something but I don't find the code to block UNMAP notifiers?
> 
> vhost_iommu_region_add()
>     memory_region_register_iommu_notifier()
>         memory_region_update_iommu_notify_flags()
>             imrc->notify_flag_changed()
>                 vtd_iommu_notify_flag_changed()
> 
> ?

Yeah I think you're right.  I might have confused with some previous
implementations.  Maybe we should also do similar thing for DSI/GI just like
what we do in PSI.

> > > > > 2) for the replay() I don't see other implementations (either spapr or
> > > > > generic one) that did unmap (actually they skip unmap explicitly), any
> > > > > reason for doing this in intel IOMMU?
> > > > I could be wrong, but I'd guess it's because vt-d implemented the 
> > > > caching mode
> > > > by leveraging the same invalidation strucuture, so it's harder to make 
> > > > all
> > > > things 

Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier

2020-07-15 Thread Jason Wang



On 2020/7/16 上午9:00, Peter Xu wrote:

On Mon, Jul 13, 2020 at 12:04:16PM +0800, Jason Wang wrote:

On 2020/7/10 下午9:30, Peter Xu wrote:

On Fri, Jul 10, 2020 at 02:34:11PM +0800, Jason Wang wrote:

On 2020/7/9 下午10:10, Peter Xu wrote:

On Thu, Jul 09, 2020 at 01:58:33PM +0800, Jason Wang wrote:

- If we care the performance, it's better to implement the MAP event for
vhost, otherwise it could be a lot of IOTLB miss

I feel like these are two things.

So far what we are talking about is whether vt-d should have knowledge about
what kind of events one iommu notifier is interested in.  I still think we
should keep this as answered in question 1.

The other question is whether we want to switch vhost from UNMAP to MAP/UNMAP
events even without vDMA, so that vhost can establish the mapping even before
IO starts.  IMHO it's doable, but only if the guest runs DPDK workloads.  When
the guest is using dynamic iommu page mappings, I feel like that can be even
slower, because then the worst case is for each IO we'll need to vmexit twice:

  - The first vmexit caused by an invalidation to MAP the page tables, so 
vhost
will setup the page table before IO starts

  - IO/DMA triggers and completes

  - The second vmexit caused by another invalidation to UNMAP the page 
tables

So it seems to be worse than when vhost only uses UNMAP like right now.  At
least we only have one vmexit (when UNMAP).  We'll have a vhost translate()
request from kernel to userspace, but IMHO that's cheaper than the vmexit.

Right but then I would still prefer to have another notifier.

Since vtd_page_walk has nothing to do with device IOTLB. IOMMU have a
dedicated command for flushing device IOTLB. But the check for
vtd_as_has_map_notifier is used to skip the device which can do demand
paging via ATS or device specific way. If we have two different notifiers,
vhost will be on the device iotlb notifier so we don't need it at all?

But we can still have iommu notifier that only registers to UNMAP even after we
introduce dev-iotlb notifier?  We don't want to do page walk for them as well.
TCG should be the only one so far, but I don't know.. maybe there can still be
new ones?

I think you're right. But looking at the codes, it looks like the check of
vtd_as_has_map_notifier() was only used in:

1) vtd_iommu_replay()
2) vtd_iotlb_page_invalidate_notify() (PSI)

For the replay, it's expensive anyhow. For PSI, I think it's just about one
or few mappings, not sure it will have obvious performance impact.

And I had two questions:

1) The codes doesn't check map for DSI or GI, does this match what spec
said? (It looks to me the spec is unclear in this part)

Both DSI/GI should cover maps too?  E.g. vtd_sync_shadow_page_table() in
vtd_iotlb_domain_invalidate().


I meant the code doesn't check whether there's an MAP notifier :)

It's actually checked, because it loops over vtd_as_with_notifiers, and only
MAP notifiers register to that. :)



I may miss something but I don't find the code to block UNMAP notifiers?

vhost_iommu_region_add()
    memory_region_register_iommu_notifier()
        memory_region_update_iommu_notify_flags()
            imrc->notify_flag_changed()
                vtd_iommu_notify_flag_changed()

?




But I agree with you that it should be cleaner to introduce the dev-iotlb
notifier type.



Yes, it can solve the issues of duplicated UNMAP issue of vhost.





2) for the replay() I don't see other implementations (either spapr or
generic one) that did unmap (actually they skip unmap explicitly), any
reason for doing this in intel IOMMU?

I could be wrong, but I'd guess it's because vt-d implemented the caching mode
by leveraging the same invalidation strucuture, so it's harder to make all
things right (IOW, we can't clearly identify MAP with UNMAP when we receive an
invalidation request, because MAP/UNMAP requests look the same).

I didn't check others, but I believe spapr is doing it differently by using
some hypercalls to deliver IOMMU map/unmap requests, which seems a bit close to
what virtio-iommu is doing.  Anyway, the point is if we have explicit MAP/UNMAP
from the guest, logically the replay indeed does not need to do any unmap
because we don't need to call replay() on an already existing device but only
for e.g. hot plug.


But this looks conflict with what memory_region_iommu_replay( ) did, for
IOMMU that doesn't have a replay method, it skips UNMAP request:

     for (addr = 0; addr < memory_region_size(mr); addr += granularity) {
     iotlb = imrc->translate(iommu_mr, addr, IOMMU_NONE, n->iommu_idx);
     if (iotlb.perm != IOMMU_NONE) {
     n->notify(n, );
     }

I guess there's no knowledge of whether guest have an explicit MAP/UMAP for
this generic code. Or replay implies that guest doesn't have explicit
MAP/UNMAP?

I think it matches exactly with a hot plug case?  Note that when IOMMU_NONE
could also mean the translation does not exist.  So it's actually trying to map

Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier

2020-07-15 Thread Peter Xu
On Mon, Jul 13, 2020 at 12:04:16PM +0800, Jason Wang wrote:
> 
> On 2020/7/10 下午9:30, Peter Xu wrote:
> > On Fri, Jul 10, 2020 at 02:34:11PM +0800, Jason Wang wrote:
> > > On 2020/7/9 下午10:10, Peter Xu wrote:
> > > > On Thu, Jul 09, 2020 at 01:58:33PM +0800, Jason Wang wrote:
> > > > > > > - If we care the performance, it's better to implement the MAP 
> > > > > > > event for
> > > > > > > vhost, otherwise it could be a lot of IOTLB miss
> > > > > > I feel like these are two things.
> > > > > > 
> > > > > > So far what we are talking about is whether vt-d should have 
> > > > > > knowledge about
> > > > > > what kind of events one iommu notifier is interested in.  I still 
> > > > > > think we
> > > > > > should keep this as answered in question 1.
> > > > > > 
> > > > > > The other question is whether we want to switch vhost from UNMAP to 
> > > > > > MAP/UNMAP
> > > > > > events even without vDMA, so that vhost can establish the mapping 
> > > > > > even before
> > > > > > IO starts.  IMHO it's doable, but only if the guest runs DPDK 
> > > > > > workloads.  When
> > > > > > the guest is using dynamic iommu page mappings, I feel like that 
> > > > > > can be even
> > > > > > slower, because then the worst case is for each IO we'll need to 
> > > > > > vmexit twice:
> > > > > > 
> > > > > >  - The first vmexit caused by an invalidation to MAP the page 
> > > > > > tables, so vhost
> > > > > >will setup the page table before IO starts
> > > > > > 
> > > > > >  - IO/DMA triggers and completes
> > > > > > 
> > > > > >  - The second vmexit caused by another invalidation to UNMAP 
> > > > > > the page tables
> > > > > > 
> > > > > > So it seems to be worse than when vhost only uses UNMAP like right 
> > > > > > now.  At
> > > > > > least we only have one vmexit (when UNMAP).  We'll have a vhost 
> > > > > > translate()
> > > > > > request from kernel to userspace, but IMHO that's cheaper than the 
> > > > > > vmexit.
> > > > > Right but then I would still prefer to have another notifier.
> > > > > 
> > > > > Since vtd_page_walk has nothing to do with device IOTLB. IOMMU have a
> > > > > dedicated command for flushing device IOTLB. But the check for
> > > > > vtd_as_has_map_notifier is used to skip the device which can do demand
> > > > > paging via ATS or device specific way. If we have two different 
> > > > > notifiers,
> > > > > vhost will be on the device iotlb notifier so we don't need it at all?
> > > > But we can still have iommu notifier that only registers to UNMAP even 
> > > > after we
> > > > introduce dev-iotlb notifier?  We don't want to do page walk for them 
> > > > as well.
> > > > TCG should be the only one so far, but I don't know.. maybe there can 
> > > > still be
> > > > new ones?
> > > 
> > > I think you're right. But looking at the codes, it looks like the check of
> > > vtd_as_has_map_notifier() was only used in:
> > > 
> > > 1) vtd_iommu_replay()
> > > 2) vtd_iotlb_page_invalidate_notify() (PSI)
> > > 
> > > For the replay, it's expensive anyhow. For PSI, I think it's just about 
> > > one
> > > or few mappings, not sure it will have obvious performance impact.
> > > 
> > > And I had two questions:
> > > 
> > > 1) The codes doesn't check map for DSI or GI, does this match what spec
> > > said? (It looks to me the spec is unclear in this part)
> > Both DSI/GI should cover maps too?  E.g. vtd_sync_shadow_page_table() in
> > vtd_iotlb_domain_invalidate().
> 
> 
> I meant the code doesn't check whether there's an MAP notifier :)

It's actually checked, because it loops over vtd_as_with_notifiers, and only
MAP notifiers register to that. :)

But I agree with you that it should be cleaner to introduce the dev-iotlb
notifier type.

> 
> 
> > 
> > > 2) for the replay() I don't see other implementations (either spapr or
> > > generic one) that did unmap (actually they skip unmap explicitly), any
> > > reason for doing this in intel IOMMU?
> > I could be wrong, but I'd guess it's because vt-d implemented the caching 
> > mode
> > by leveraging the same invalidation strucuture, so it's harder to make all
> > things right (IOW, we can't clearly identify MAP with UNMAP when we receive 
> > an
> > invalidation request, because MAP/UNMAP requests look the same).
> > 
> > I didn't check others, but I believe spapr is doing it differently by using
> > some hypercalls to deliver IOMMU map/unmap requests, which seems a bit 
> > close to
> > what virtio-iommu is doing.  Anyway, the point is if we have explicit 
> > MAP/UNMAP
> > from the guest, logically the replay indeed does not need to do any unmap
> > because we don't need to call replay() on an already existing device but 
> > only
> > for e.g. hot plug.
> 
> 
> But this looks conflict with what memory_region_iommu_replay( ) did, for
> IOMMU that doesn't have a replay method, it skips UNMAP request:
> 
>     for (addr = 0; addr < memory_region_size(mr); addr += granularity) {
>     iotlb = imrc->translate(iommu_mr, addr, 

Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier

2020-07-12 Thread Jason Wang



On 2020/7/10 下午9:30, Peter Xu wrote:

On Fri, Jul 10, 2020 at 02:34:11PM +0800, Jason Wang wrote:

On 2020/7/9 下午10:10, Peter Xu wrote:

On Thu, Jul 09, 2020 at 01:58:33PM +0800, Jason Wang wrote:

- If we care the performance, it's better to implement the MAP event for
vhost, otherwise it could be a lot of IOTLB miss

I feel like these are two things.

So far what we are talking about is whether vt-d should have knowledge about
what kind of events one iommu notifier is interested in.  I still think we
should keep this as answered in question 1.

The other question is whether we want to switch vhost from UNMAP to MAP/UNMAP
events even without vDMA, so that vhost can establish the mapping even before
IO starts.  IMHO it's doable, but only if the guest runs DPDK workloads.  When
the guest is using dynamic iommu page mappings, I feel like that can be even
slower, because then the worst case is for each IO we'll need to vmexit twice:

 - The first vmexit caused by an invalidation to MAP the page tables, so 
vhost
   will setup the page table before IO starts

 - IO/DMA triggers and completes

 - The second vmexit caused by another invalidation to UNMAP the page tables

So it seems to be worse than when vhost only uses UNMAP like right now.  At
least we only have one vmexit (when UNMAP).  We'll have a vhost translate()
request from kernel to userspace, but IMHO that's cheaper than the vmexit.

Right but then I would still prefer to have another notifier.

Since vtd_page_walk has nothing to do with device IOTLB. IOMMU have a
dedicated command for flushing device IOTLB. But the check for
vtd_as_has_map_notifier is used to skip the device which can do demand
paging via ATS or device specific way. If we have two different notifiers,
vhost will be on the device iotlb notifier so we don't need it at all?

But we can still have iommu notifier that only registers to UNMAP even after we
introduce dev-iotlb notifier?  We don't want to do page walk for them as well.
TCG should be the only one so far, but I don't know.. maybe there can still be
new ones?


I think you're right. But looking at the codes, it looks like the check of
vtd_as_has_map_notifier() was only used in:

1) vtd_iommu_replay()
2) vtd_iotlb_page_invalidate_notify() (PSI)

For the replay, it's expensive anyhow. For PSI, I think it's just about one
or few mappings, not sure it will have obvious performance impact.

And I had two questions:

1) The codes doesn't check map for DSI or GI, does this match what spec
said? (It looks to me the spec is unclear in this part)

Both DSI/GI should cover maps too?  E.g. vtd_sync_shadow_page_table() in
vtd_iotlb_domain_invalidate().



I meant the code doesn't check whether there's an MAP notifier :)





2) for the replay() I don't see other implementations (either spapr or
generic one) that did unmap (actually they skip unmap explicitly), any
reason for doing this in intel IOMMU?

I could be wrong, but I'd guess it's because vt-d implemented the caching mode
by leveraging the same invalidation strucuture, so it's harder to make all
things right (IOW, we can't clearly identify MAP with UNMAP when we receive an
invalidation request, because MAP/UNMAP requests look the same).

I didn't check others, but I believe spapr is doing it differently by using
some hypercalls to deliver IOMMU map/unmap requests, which seems a bit close to
what virtio-iommu is doing.  Anyway, the point is if we have explicit MAP/UNMAP
from the guest, logically the replay indeed does not need to do any unmap
because we don't need to call replay() on an already existing device but only
for e.g. hot plug.



But this looks conflict with what memory_region_iommu_replay( ) did, for 
IOMMU that doesn't have a replay method, it skips UNMAP request:


    for (addr = 0; addr < memory_region_size(mr); addr += granularity) {
    iotlb = imrc->translate(iommu_mr, addr, IOMMU_NONE, n->iommu_idx);
    if (iotlb.perm != IOMMU_NONE) {
    n->notify(n, );
    }

I guess there's no knowledge of whether guest have an explicit MAP/UMAP 
for this generic code. Or replay implies that guest doesn't have 
explicit MAP/UNMAP?


(btw, the code shortcut the memory_region_notify_one(), not sure the reason)



  VT-d does not have that clear interface, so VT-d needs to
maintain its own mapping structures, and also vt-d is using the same replay &
page_walk operations to sync all these structures, which complicated the vt-d
replay a bit.  With that, we assume replay() can be called anytime on a device,
and we won't notify duplicated MAPs to lower layer like vfio if it is mapped
before.  At the meantime, since we'll compare the latest mapping with the one
we cached in the iova tree, UNMAP becomes possible too.



AFAIK vtd_iommu_replay() did a completely UNMAP:

    /*
 * The replay can be triggered by either a invalidation or a newly
 * created entry. No matter what, we release existing mappings
 * (it means flushing caches 

Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier

2020-07-10 Thread Peter Xu
On Fri, Jul 10, 2020 at 02:34:11PM +0800, Jason Wang wrote:
> 
> On 2020/7/9 下午10:10, Peter Xu wrote:
> > On Thu, Jul 09, 2020 at 01:58:33PM +0800, Jason Wang wrote:
> > > > > - If we care the performance, it's better to implement the MAP event 
> > > > > for
> > > > > vhost, otherwise it could be a lot of IOTLB miss
> > > > I feel like these are two things.
> > > > 
> > > > So far what we are talking about is whether vt-d should have knowledge 
> > > > about
> > > > what kind of events one iommu notifier is interested in.  I still think 
> > > > we
> > > > should keep this as answered in question 1.
> > > > 
> > > > The other question is whether we want to switch vhost from UNMAP to 
> > > > MAP/UNMAP
> > > > events even without vDMA, so that vhost can establish the mapping even 
> > > > before
> > > > IO starts.  IMHO it's doable, but only if the guest runs DPDK 
> > > > workloads.  When
> > > > the guest is using dynamic iommu page mappings, I feel like that can be 
> > > > even
> > > > slower, because then the worst case is for each IO we'll need to vmexit 
> > > > twice:
> > > > 
> > > > - The first vmexit caused by an invalidation to MAP the page 
> > > > tables, so vhost
> > > >   will setup the page table before IO starts
> > > > 
> > > > - IO/DMA triggers and completes
> > > > 
> > > > - The second vmexit caused by another invalidation to UNMAP the 
> > > > page tables
> > > > 
> > > > So it seems to be worse than when vhost only uses UNMAP like right now. 
> > > >  At
> > > > least we only have one vmexit (when UNMAP).  We'll have a vhost 
> > > > translate()
> > > > request from kernel to userspace, but IMHO that's cheaper than the 
> > > > vmexit.
> > > 
> > > Right but then I would still prefer to have another notifier.
> > > 
> > > Since vtd_page_walk has nothing to do with device IOTLB. IOMMU have a
> > > dedicated command for flushing device IOTLB. But the check for
> > > vtd_as_has_map_notifier is used to skip the device which can do demand
> > > paging via ATS or device specific way. If we have two different notifiers,
> > > vhost will be on the device iotlb notifier so we don't need it at all?
> > But we can still have iommu notifier that only registers to UNMAP even 
> > after we
> > introduce dev-iotlb notifier?  We don't want to do page walk for them as 
> > well.
> > TCG should be the only one so far, but I don't know.. maybe there can still 
> > be
> > new ones?
> 
> 
> I think you're right. But looking at the codes, it looks like the check of
> vtd_as_has_map_notifier() was only used in:
> 
> 1) vtd_iommu_replay()
> 2) vtd_iotlb_page_invalidate_notify() (PSI)
> 
> For the replay, it's expensive anyhow. For PSI, I think it's just about one
> or few mappings, not sure it will have obvious performance impact.
> 
> And I had two questions:
> 
> 1) The codes doesn't check map for DSI or GI, does this match what spec
> said? (It looks to me the spec is unclear in this part)

Both DSI/GI should cover maps too?  E.g. vtd_sync_shadow_page_table() in
vtd_iotlb_domain_invalidate().

> 2) for the replay() I don't see other implementations (either spapr or
> generic one) that did unmap (actually they skip unmap explicitly), any
> reason for doing this in intel IOMMU?

I could be wrong, but I'd guess it's because vt-d implemented the caching mode
by leveraging the same invalidation strucuture, so it's harder to make all
things right (IOW, we can't clearly identify MAP with UNMAP when we receive an
invalidation request, because MAP/UNMAP requests look the same).

I didn't check others, but I believe spapr is doing it differently by using
some hypercalls to deliver IOMMU map/unmap requests, which seems a bit close to
what virtio-iommu is doing.  Anyway, the point is if we have explicit MAP/UNMAP
from the guest, logically the replay indeed does not need to do any unmap
because we don't need to call replay() on an already existing device but only
for e.g. hot plug.  VT-d does not have that clear interface, so VT-d needs to
maintain its own mapping structures, and also vt-d is using the same replay &
page_walk operations to sync all these structures, which complicated the vt-d
replay a bit.  With that, we assume replay() can be called anytime on a device,
and we won't notify duplicated MAPs to lower layer like vfio if it is mapped
before.  At the meantime, since we'll compare the latest mapping with the one
we cached in the iova tree, UNMAP becomes possible too.

Thanks,

-- 
Peter Xu




Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier

2020-07-10 Thread Jason Wang



On 2020/7/9 下午10:10, Peter Xu wrote:

On Thu, Jul 09, 2020 at 01:58:33PM +0800, Jason Wang wrote:

- If we care the performance, it's better to implement the MAP event for
vhost, otherwise it could be a lot of IOTLB miss

I feel like these are two things.

So far what we are talking about is whether vt-d should have knowledge about
what kind of events one iommu notifier is interested in.  I still think we
should keep this as answered in question 1.

The other question is whether we want to switch vhost from UNMAP to MAP/UNMAP
events even without vDMA, so that vhost can establish the mapping even before
IO starts.  IMHO it's doable, but only if the guest runs DPDK workloads.  When
the guest is using dynamic iommu page mappings, I feel like that can be even
slower, because then the worst case is for each IO we'll need to vmexit twice:

- The first vmexit caused by an invalidation to MAP the page tables, so 
vhost
  will setup the page table before IO starts

- IO/DMA triggers and completes

- The second vmexit caused by another invalidation to UNMAP the page tables

So it seems to be worse than when vhost only uses UNMAP like right now.  At
least we only have one vmexit (when UNMAP).  We'll have a vhost translate()
request from kernel to userspace, but IMHO that's cheaper than the vmexit.


Right but then I would still prefer to have another notifier.

Since vtd_page_walk has nothing to do with device IOTLB. IOMMU have a
dedicated command for flushing device IOTLB. But the check for
vtd_as_has_map_notifier is used to skip the device which can do demand
paging via ATS or device specific way. If we have two different notifiers,
vhost will be on the device iotlb notifier so we don't need it at all?

But we can still have iommu notifier that only registers to UNMAP even after we
introduce dev-iotlb notifier?  We don't want to do page walk for them as well.
TCG should be the only one so far, but I don't know.. maybe there can still be
new ones?



I think you're right. But looking at the codes, it looks like the check 
of vtd_as_has_map_notifier() was only used in:


1) vtd_iommu_replay()
2) vtd_iotlb_page_invalidate_notify() (PSI)

For the replay, it's expensive anyhow. For PSI, I think it's just about 
one or few mappings, not sure it will have obvious performance impact.


And I had two questions:

1) The codes doesn't check map for DSI or GI, does this match what spec 
said? (It looks to me the spec is unclear in this part)
2) for the replay() I don't see other implementations (either spapr or 
generic one) that did unmap (actually they skip unmap explicitly), any 
reason for doing this in intel IOMMU?


Thanks




Thanks,






Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier

2020-07-09 Thread Peter Xu
On Thu, Jul 09, 2020 at 01:58:33PM +0800, Jason Wang wrote:
> > > - If we care the performance, it's better to implement the MAP event for
> > > vhost, otherwise it could be a lot of IOTLB miss
> > I feel like these are two things.
> > 
> > So far what we are talking about is whether vt-d should have knowledge about
> > what kind of events one iommu notifier is interested in.  I still think we
> > should keep this as answered in question 1.
> > 
> > The other question is whether we want to switch vhost from UNMAP to 
> > MAP/UNMAP
> > events even without vDMA, so that vhost can establish the mapping even 
> > before
> > IO starts.  IMHO it's doable, but only if the guest runs DPDK workloads.  
> > When
> > the guest is using dynamic iommu page mappings, I feel like that can be even
> > slower, because then the worst case is for each IO we'll need to vmexit 
> > twice:
> > 
> >- The first vmexit caused by an invalidation to MAP the page tables, so 
> > vhost
> >  will setup the page table before IO starts
> > 
> >- IO/DMA triggers and completes
> > 
> >- The second vmexit caused by another invalidation to UNMAP the page 
> > tables
> > 
> > So it seems to be worse than when vhost only uses UNMAP like right now.  At
> > least we only have one vmexit (when UNMAP).  We'll have a vhost translate()
> > request from kernel to userspace, but IMHO that's cheaper than the vmexit.
> 
> 
> Right but then I would still prefer to have another notifier.
> 
> Since vtd_page_walk has nothing to do with device IOTLB. IOMMU have a
> dedicated command for flushing device IOTLB. But the check for
> vtd_as_has_map_notifier is used to skip the device which can do demand
> paging via ATS or device specific way. If we have two different notifiers,
> vhost will be on the device iotlb notifier so we don't need it at all?

But we can still have iommu notifier that only registers to UNMAP even after we
introduce dev-iotlb notifier?  We don't want to do page walk for them as well.
TCG should be the only one so far, but I don't know.. maybe there can still be
new ones?

Thanks,

-- 
Peter Xu




Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier

2020-07-08 Thread Jason Wang



On 2020/7/8 下午10:16, Peter Xu wrote:

On Wed, Jul 08, 2020 at 01:42:30PM +0800, Jason Wang wrote:

So it should be functional equivalent to vtd_as_has_notifier().

For example: in vtd_iommu_replay() we'll skip the replay if vhost has
registered the iommu notifier because vtd_as_has_map_notifier() will return
false.


Two questions:

- Do we care the performance here? If not, vhost may just ignore the MAP
event?

I think we care, because vtd_page_walk() can be expensive.



Ok.





- If we care the performance, it's better to implement the MAP event for
vhost, otherwise it could be a lot of IOTLB miss

I feel like these are two things.

So far what we are talking about is whether vt-d should have knowledge about
what kind of events one iommu notifier is interested in.  I still think we
should keep this as answered in question 1.

The other question is whether we want to switch vhost from UNMAP to MAP/UNMAP
events even without vDMA, so that vhost can establish the mapping even before
IO starts.  IMHO it's doable, but only if the guest runs DPDK workloads.  When
the guest is using dynamic iommu page mappings, I feel like that can be even
slower, because then the worst case is for each IO we'll need to vmexit twice:

   - The first vmexit caused by an invalidation to MAP the page tables, so vhost
 will setup the page table before IO starts

   - IO/DMA triggers and completes

   - The second vmexit caused by another invalidation to UNMAP the page tables

So it seems to be worse than when vhost only uses UNMAP like right now.  At
least we only have one vmexit (when UNMAP).  We'll have a vhost translate()
request from kernel to userspace, but IMHO that's cheaper than the vmexit.



Right but then I would still prefer to have another notifier.

Since vtd_page_walk has nothing to do with device IOTLB. IOMMU have a 
dedicated command for flushing device IOTLB. But the check for 
vtd_as_has_map_notifier is used to skip the device which can do demand 
paging via ATS or device specific way. If we have two different 
notifiers, vhost will be on the device iotlb notifier so we don't need 
it at all?


Thanks









Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier

2020-07-08 Thread Peter Xu
On Wed, Jul 08, 2020 at 01:42:30PM +0800, Jason Wang wrote:
> > > So it should be functional equivalent to vtd_as_has_notifier().
> > For example: in vtd_iommu_replay() we'll skip the replay if vhost has
> > registered the iommu notifier because vtd_as_has_map_notifier() will return
> > false.
> 
> 
> Two questions:
> 
> - Do we care the performance here? If not, vhost may just ignore the MAP
> event?

I think we care, because vtd_page_walk() can be expensive.

> - If we care the performance, it's better to implement the MAP event for
> vhost, otherwise it could be a lot of IOTLB miss

I feel like these are two things.

So far what we are talking about is whether vt-d should have knowledge about
what kind of events one iommu notifier is interested in.  I still think we
should keep this as answered in question 1.

The other question is whether we want to switch vhost from UNMAP to MAP/UNMAP
events even without vDMA, so that vhost can establish the mapping even before
IO starts.  IMHO it's doable, but only if the guest runs DPDK workloads.  When
the guest is using dynamic iommu page mappings, I feel like that can be even
slower, because then the worst case is for each IO we'll need to vmexit twice:

  - The first vmexit caused by an invalidation to MAP the page tables, so vhost
will setup the page table before IO starts

  - IO/DMA triggers and completes

  - The second vmexit caused by another invalidation to UNMAP the page tables

So it seems to be worse than when vhost only uses UNMAP like right now.  At
least we only have one vmexit (when UNMAP).  We'll have a vhost translate()
request from kernel to userspace, but IMHO that's cheaper than the vmexit.

-- 
Peter Xu




Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier

2020-07-08 Thread Jason Wang



On 2020/7/8 上午3:54, Peter Xu wrote:

On Tue, Jul 07, 2020 at 04:03:10PM +0800, Jason Wang wrote:

On 2020/7/3 下午9:03, Peter Xu wrote:

On Fri, Jul 03, 2020 at 03:24:19PM +0800, Jason Wang wrote:

On 2020/7/2 下午11:45, Peter Xu wrote:

On Thu, Jul 02, 2020 at 11:01:54AM +0800, Jason Wang wrote:

So I think we agree that a new notifier is needed?

Good to me, or a new flag should be easier (IOMMU_NOTIFIER_DEV_IOTLB)?

That should work but I wonder something as following is better.

Instead of introducing new flags, how about carry the type of event in the
notifier then the device (vhost) can choose the message it want to process
like:

static vhost_iommu_event(IOMMUNotifier *n, IOMMUTLBEvent *event)

{

switch (event->type) {

case IOMMU_MAP:
case IOMMU_UNMAP:
case IOMMU_DEV_IOTLB_UNMAP:
...

}

Looks ok to me, though imo we should still keep the registration information,
so VT-d knows which notifiers is interested in which events.  E.g., we can
still do something like vtd_as_has_map_notifier().


Is this for a better performance?

I wonder whether it's worth since we can't not have both vhost and vtd to be
registered into the same as.

/me failed to follow this sentence.. :(



Sorry, I meant "vfio" instead "vtd".





So it should be functional equivalent to vtd_as_has_notifier().

For example: in vtd_iommu_replay() we'll skip the replay if vhost has
registered the iommu notifier because vtd_as_has_map_notifier() will return
false.



Two questions:

- Do we care the performance here? If not, vhost may just ignore the MAP 
event?
- If we care the performance, it's better to implement the MAP event for 
vhost, otherwise it could be a lot of IOTLB miss


Thanks



   It'll only return true if the device is a vfio-pci device.

Without vtd_as_has_map_notifier(), how could we do that?






Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier

2020-07-07 Thread Peter Xu
On Tue, Jul 07, 2020 at 04:03:10PM +0800, Jason Wang wrote:
> 
> On 2020/7/3 下午9:03, Peter Xu wrote:
> > On Fri, Jul 03, 2020 at 03:24:19PM +0800, Jason Wang wrote:
> > > On 2020/7/2 下午11:45, Peter Xu wrote:
> > > > On Thu, Jul 02, 2020 at 11:01:54AM +0800, Jason Wang wrote:
> > > > > So I think we agree that a new notifier is needed?
> > > > Good to me, or a new flag should be easier (IOMMU_NOTIFIER_DEV_IOTLB)?
> > > 
> > > That should work but I wonder something as following is better.
> > > 
> > > Instead of introducing new flags, how about carry the type of event in the
> > > notifier then the device (vhost) can choose the message it want to process
> > > like:
> > > 
> > > static vhost_iommu_event(IOMMUNotifier *n, IOMMUTLBEvent *event)
> > > 
> > > {
> > > 
> > > switch (event->type) {
> > > 
> > > case IOMMU_MAP:
> > > case IOMMU_UNMAP:
> > > case IOMMU_DEV_IOTLB_UNMAP:
> > > ...
> > > 
> > > }
> > Looks ok to me, though imo we should still keep the registration 
> > information,
> > so VT-d knows which notifiers is interested in which events.  E.g., we can
> > still do something like vtd_as_has_map_notifier().
> 
> 
> Is this for a better performance?
> 
> I wonder whether it's worth since we can't not have both vhost and vtd to be
> registered into the same as.

/me failed to follow this sentence.. :(

> 
> So it should be functional equivalent to vtd_as_has_notifier().

For example: in vtd_iommu_replay() we'll skip the replay if vhost has
registered the iommu notifier because vtd_as_has_map_notifier() will return
false.  It'll only return true if the device is a vfio-pci device.

Without vtd_as_has_map_notifier(), how could we do that?

-- 
Peter Xu




Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier

2020-07-07 Thread Jason Wang



On 2020/7/3 下午9:03, Peter Xu wrote:

On Fri, Jul 03, 2020 at 03:24:19PM +0800, Jason Wang wrote:

On 2020/7/2 下午11:45, Peter Xu wrote:

On Thu, Jul 02, 2020 at 11:01:54AM +0800, Jason Wang wrote:

So I think we agree that a new notifier is needed?

Good to me, or a new flag should be easier (IOMMU_NOTIFIER_DEV_IOTLB)?


That should work but I wonder something as following is better.

Instead of introducing new flags, how about carry the type of event in the
notifier then the device (vhost) can choose the message it want to process
like:

static vhost_iommu_event(IOMMUNotifier *n, IOMMUTLBEvent *event)

{

switch (event->type) {

case IOMMU_MAP:
case IOMMU_UNMAP:
case IOMMU_DEV_IOTLB_UNMAP:
...

}

Looks ok to me, though imo we should still keep the registration information,
so VT-d knows which notifiers is interested in which events.  E.g., we can
still do something like vtd_as_has_map_notifier().



Is this for a better performance?

I wonder whether it's worth since we can't not have both vhost and vtd 
to be registered into the same as.


So it should be functional equivalent to vtd_as_has_notifier().

Thanks




So these are probably two different things: the new IOMMU_NOTIFIER_DEV_IOTLB
flag is one as discussed; the other one is whether we would like to introduce
IOMMUTLBEvent to include the type, so that we can avoid introduce two notifiers
for one device majorly to identify dev-iotlb from unmaps.

Thanks,






Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier

2020-07-03 Thread Peter Xu
On Fri, Jul 03, 2020 at 03:24:19PM +0800, Jason Wang wrote:
> 
> On 2020/7/2 下午11:45, Peter Xu wrote:
> > On Thu, Jul 02, 2020 at 11:01:54AM +0800, Jason Wang wrote:
> > > So I think we agree that a new notifier is needed?
> > Good to me, or a new flag should be easier (IOMMU_NOTIFIER_DEV_IOTLB)?
> 
> 
> That should work but I wonder something as following is better.
> 
> Instead of introducing new flags, how about carry the type of event in the
> notifier then the device (vhost) can choose the message it want to process
> like:
> 
> static vhost_iommu_event(IOMMUNotifier *n, IOMMUTLBEvent *event)
> 
> {
> 
> switch (event->type) {
> 
> case IOMMU_MAP:
> case IOMMU_UNMAP:
> case IOMMU_DEV_IOTLB_UNMAP:
> ...
> 
> }

Looks ok to me, though imo we should still keep the registration information,
so VT-d knows which notifiers is interested in which events.  E.g., we can
still do something like vtd_as_has_map_notifier().

So these are probably two different things: the new IOMMU_NOTIFIER_DEV_IOTLB
flag is one as discussed; the other one is whether we would like to introduce
IOMMUTLBEvent to include the type, so that we can avoid introduce two notifiers
for one device majorly to identify dev-iotlb from unmaps.

Thanks,

-- 
Peter Xu




Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier

2020-07-03 Thread Jason Wang



On 2020/7/2 下午11:45, Peter Xu wrote:

On Thu, Jul 02, 2020 at 11:01:54AM +0800, Jason Wang wrote:

So I think we agree that a new notifier is needed?

Good to me, or a new flag should be easier (IOMMU_NOTIFIER_DEV_IOTLB)?



That should work but I wonder something as following is better.

Instead of introducing new flags, how about carry the type of event in 
the notifier then the device (vhost) can choose the message it want to 
process like:


static vhost_iommu_event(IOMMUNotifier *n, IOMMUTLBEvent *event)

{

switch (event->type) {

case IOMMU_MAP:
case IOMMU_UNMAP:
case IOMMU_DEV_IOTLB_UNMAP:
...

}

Thanks









Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier

2020-07-02 Thread Peter Xu
On Thu, Jul 02, 2020 at 11:01:54AM +0800, Jason Wang wrote:
> So I think we agree that a new notifier is needed?

Good to me, or a new flag should be easier (IOMMU_NOTIFIER_DEV_IOTLB)?

-- 
Peter Xu




Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier

2020-07-01 Thread Jason Wang



On 2020/7/1 下午4:09, Jason Wang wrote:


On 2020/6/30 下午11:39, Peter Xu wrote:

On Tue, Jun 30, 2020 at 10:41:10AM +0800, Jason Wang wrote:

  /* According to ATS spec table 2.4:
   * S = 0, bits 15:12 =  range size: 4K
   * S = 1, bits 15:12 = xxx0 range size: 8K
   * S = 1, bits 15:12 = xx01 range size: 16K
   * S = 1, bits 15:12 = x011 range size: 32K
   * S = 1, bits 15:12 = 0111 range size: 64K
   * ...
   */


Right, but the comment is probably misleading here, since it's for 
the PCI-E
transaction between IOMMU and device not for the device IOTLB 
invalidation

descriptor.

For device IOTLB invalidation descriptor, spec allows a [0, ~0ULL]
invalidation:

"

6.5.2.5 Device-TLB Invalidate Descriptor

...

Size (S): The size field indicates the number of consecutive pages 
targeted

by this invalidation
request. If S field is zero, a single page at page address specified by
Address [63:12] is requested
to be invalidated. If S field is Set, the least significant bit in the
Address field with value 0b
indicates the invalidation address range. For example, if S field is 
Set and

Address[12] is Clear, it
indicates an 8KB invalidation address range with base address in 
Address

[63:13]. If S field and
Address[12] is Set and bit 13 is Clear, it indicates a 16KB 
invalidation

address range with base
address in Address [63:14], etc.

"

So if we receive an address whose [63] is 0 and the rest is all 1, 
it's then

a [0, ~0ULL] invalidation.
Yes.  I think invalidating the whole range is always fine.  It's 
still not

arbitrary, right?  E.g., we can't even invalidate (0x1000, 0x3000) with
device-iotlb because of the address mask, not to say sub-pages.



Yes.






How about just convert to use a range [start, end] for any 
notifier and move
the checks (e.g the assert) into the actual notifier implemented 
(vhost or

vfio)?
IOMMUTLBEntry itself is the abstraction layer of TLB entry.  
Hardware TLB entry
is definitely not arbitrary range either (because AFAICT the 
hardware should

only cache PFN rather than address, so at least PAGE_SIZE aligned).
Introducing this flag will already make this trickier just to 
avoid introducing
another similar struct to IOMMUTLBEntry, but I really don't want 
to make it a
default option...  Not to mention I probably have no reason to 
urge the rest
iommu notifier users (tcg, vfio) to change their existing good 
code to suite
any of the backend who can cooperate with arbitrary address 
ranges...

Ok, so it looks like we need a dedicated notifiers to device IOTLB.
Or we can also make a new flag for device iotlb just like current 
UNMAP? Then
we replace the vhost type from UNMAP to DEVICE_IOTLB.  But IMHO 
using the
ARBITRARY_LENGTH flag would work in a similar way. DEVICE_IOTLB 
flag could
also allow virtio/vhost to only receive one invalidation (now IIUC 
it'll
receive both iotlb and device-iotlb for unmapping a page when 
ats=on), but then
ats=on will be a must and it could break some old (misconfiged) 
qemu because
afaict previously virtio/vhost could even work with vIOMMU 
(accidentally) even

without ats=on.


That's a bug and I don't think we need to workaround 
mis-configurated qemu

:)

IMHO it depends on the strictness we want on the qemu cmdline API. :)

We should at least check libvirt to make sure it's using ats=on 
always, then I

agree maybe we can avoid considering the rest...

Thanks,



Cc libvirt list, but I think we should fix libvirt if they don't 
provide "ats=on".


Thanks



Libvirt looks fine, according to the domain  XML documentation[1]:

 QEMU's virtio devices have some attributes related to the virtio 
transport under the driver element: The iommu attribute enables the use 
of emulated IOMMU by the device. The attribute ats controls the Address 
Translation Service support for PCIe devices. This is needed to make use 
of IOTLB support (see IOMMU device). Possible values are on or off. 
Since 3.5.0


So I think we agree that a new notifier is needed?

Thanks

[1] https://libvirt.org/formatdomain.html











Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier

2020-07-01 Thread Jason Wang



On 2020/7/1 下午8:41, Peter Xu wrote:

On Wed, Jul 01, 2020 at 08:30:07PM +0800, Jason Wang wrote:

I overlooked myself that the IR region will be there even if ir=off.


Yes, but the point stands still but the issue is still if ir=off.



So I
think the assert should stand.


Do you mean vhost can't trigger the assert()? If yes, I don't get how it
can't.

Yes.  vhost can only trigger the translate() via memory API.  No matter whether
ir is on/off, the memory region is always enabled, so any access to 0xfeex
from memory API should still land at s->mr_ir, afaict, rather than the dmar 
region.



Right.

Thanks








Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier

2020-07-01 Thread Peter Xu
On Wed, Jul 01, 2020 at 08:30:07PM +0800, Jason Wang wrote:
> > I overlooked myself that the IR region will be there even if ir=off.
> 
> 
> Yes, but the point stands still but the issue is still if ir=off.
> 
> 
> >So I
> > think the assert should stand.
> 
> 
> Do you mean vhost can't trigger the assert()? If yes, I don't get how it
> can't.

Yes.  vhost can only trigger the translate() via memory API.  No matter whether
ir is on/off, the memory region is always enabled, so any access to 0xfeex
from memory API should still land at s->mr_ir, afaict, rather than the dmar 
region.

-- 
Peter Xu




Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier

2020-07-01 Thread Jason Wang



On 2020/7/1 下午8:16, Peter Xu wrote:

On Wed, Jul 01, 2020 at 04:11:49PM +0800, Jason Wang wrote:

On 2020/6/30 下午11:20, Peter Xu wrote:

On Tue, Jun 30, 2020 at 05:23:31PM +0800, Jason Wang wrote:

Ok, we had a dedicated mr for interrupt:

memory_region_add_subregion_overlap(MEMORY_REGION(_dev_as->iommu),
VTD_INTERRUPT_ADDR_FIRST,
_dev_as->iommu_ir, 1);

So it should be fine. I guess the reason that I'm asking is that I thought
"IR" means "Interrupt remapping" but in fact it means "Interrupt Region"?

I was meaning "interrupt remapping", and of course it's the interrupt region
too when IR enabled...


Right.



But I'm still not clear about the invalidation part for interrupt region,
maybe you can elaborate a little more on this.

Btw, I think guest can trigger the assert in vtd_do_iommu_translate() if we
teach vhost to DMA to that region:

Why would we want to?

I meant a buggy(malicious) guest driver.

Yes seems possible.  Do you want to post a patch?  Let me know if you want me
to...  Thanks,


Yes please.

Oh wait...  Actually the comment above explains...

 /*
  * We have standalone memory region for interrupt addresses, we
  * should never receive translation requests in this region.
  */
 assert(!vtd_is_interrupt_addr(addr));

I overlooked myself that the IR region will be there even if ir=off.



Yes, but the point stands still but the issue is still if ir=off.



   So I
think the assert should stand.



Do you mean vhost can't trigger the assert()? If yes, I don't get how it 
can't.


Thanks









Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier

2020-07-01 Thread Peter Xu
On Wed, Jul 01, 2020 at 04:11:49PM +0800, Jason Wang wrote:
> 
> On 2020/6/30 下午11:20, Peter Xu wrote:
> > On Tue, Jun 30, 2020 at 05:23:31PM +0800, Jason Wang wrote:
> > > > > Ok, we had a dedicated mr for interrupt:
> > > > > 
> > > > > memory_region_add_subregion_overlap(MEMORY_REGION(_dev_as->iommu),
> > > > > VTD_INTERRUPT_ADDR_FIRST,
> > > > > _dev_as->iommu_ir, 1);
> > > > > 
> > > > > So it should be fine. I guess the reason that I'm asking is that I 
> > > > > thought
> > > > > "IR" means "Interrupt remapping" but in fact it means "Interrupt 
> > > > > Region"?
> > I was meaning "interrupt remapping", and of course it's the interrupt region
> > too when IR enabled...
> 
> 
> Right.
> 
> 
> > 
> > > > > But I'm still not clear about the invalidation part for interrupt 
> > > > > region,
> > > > > maybe you can elaborate a little more on this.
> > > > > 
> > > > > Btw, I think guest can trigger the assert in vtd_do_iommu_translate() 
> > > > > if we
> > > > > teach vhost to DMA to that region:
> > > > Why would we want to?
> > > 
> > > I meant a buggy(malicious) guest driver.
> > Yes seems possible.  Do you want to post a patch?  Let me know if you want 
> > me
> > to...  Thanks,
> 
> 
> Yes please.

Oh wait...  Actually the comment above explains...

/*
 * We have standalone memory region for interrupt addresses, we
 * should never receive translation requests in this region.
 */
assert(!vtd_is_interrupt_addr(addr));

I overlooked myself that the IR region will be there even if ir=off.  So I
think the assert should stand.

-- 
Peter Xu




Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier

2020-07-01 Thread Jason Wang



On 2020/6/30 下午11:20, Peter Xu wrote:

On Tue, Jun 30, 2020 at 05:23:31PM +0800, Jason Wang wrote:

Ok, we had a dedicated mr for interrupt:

memory_region_add_subregion_overlap(MEMORY_REGION(_dev_as->iommu),
VTD_INTERRUPT_ADDR_FIRST,
_dev_as->iommu_ir, 1);

So it should be fine. I guess the reason that I'm asking is that I thought
"IR" means "Interrupt remapping" but in fact it means "Interrupt Region"?

I was meaning "interrupt remapping", and of course it's the interrupt region
too when IR enabled...



Right.





But I'm still not clear about the invalidation part for interrupt region,
maybe you can elaborate a little more on this.

Btw, I think guest can trigger the assert in vtd_do_iommu_translate() if we
teach vhost to DMA to that region:

Why would we want to?


I meant a buggy(malicious) guest driver.

Yes seems possible.  Do you want to post a patch?  Let me know if you want me
to...  Thanks,



Yes please.

Thanks









Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier

2020-07-01 Thread Jason Wang



On 2020/6/30 下午11:39, Peter Xu wrote:

On Tue, Jun 30, 2020 at 10:41:10AM +0800, Jason Wang wrote:

  /* According to ATS spec table 2.4:
   * S = 0, bits 15:12 =  range size: 4K
   * S = 1, bits 15:12 = xxx0 range size: 8K
   * S = 1, bits 15:12 = xx01 range size: 16K
   * S = 1, bits 15:12 = x011 range size: 32K
   * S = 1, bits 15:12 = 0111 range size: 64K
   * ...
   */


Right, but the comment is probably misleading here, since it's for the PCI-E
transaction between IOMMU and device not for the device IOTLB invalidation
descriptor.

For device IOTLB invalidation descriptor, spec allows a [0, ~0ULL]
invalidation:

"

6.5.2.5 Device-TLB Invalidate Descriptor

...

Size (S): The size field indicates the number of consecutive pages targeted
by this invalidation
request. If S field is zero, a single page at page address specified by
Address [63:12] is requested
to be invalidated. If S field is Set, the least significant bit in the
Address field with value 0b
indicates the invalidation address range. For example, if S field is Set and
Address[12] is Clear, it
indicates an 8KB invalidation address range with base address in Address
[63:13]. If S field and
Address[12] is Set and bit 13 is Clear, it indicates a 16KB invalidation
address range with base
address in Address [63:14], etc.

"

So if we receive an address whose [63] is 0 and the rest is all 1, it's then
a [0, ~0ULL] invalidation.

Yes.  I think invalidating the whole range is always fine.  It's still not
arbitrary, right?  E.g., we can't even invalidate (0x1000, 0x3000) with
device-iotlb because of the address mask, not to say sub-pages.



Yes.







How about just convert to use a range [start, end] for any notifier and move
the checks (e.g the assert) into the actual notifier implemented (vhost or
vfio)?

IOMMUTLBEntry itself is the abstraction layer of TLB entry.  Hardware TLB entry
is definitely not arbitrary range either (because AFAICT the hardware should
only cache PFN rather than address, so at least PAGE_SIZE aligned).
Introducing this flag will already make this trickier just to avoid introducing
another similar struct to IOMMUTLBEntry, but I really don't want to make it a
default option...  Not to mention I probably have no reason to urge the rest
iommu notifier users (tcg, vfio) to change their existing good code to suite
any of the backend who can cooperate with arbitrary address ranges...

Ok, so it looks like we need a dedicated notifiers to device IOTLB.

Or we can also make a new flag for device iotlb just like current UNMAP? Then
we replace the vhost type from UNMAP to DEVICE_IOTLB.  But IMHO using the
ARBITRARY_LENGTH flag would work in a similar way.  DEVICE_IOTLB flag could
also allow virtio/vhost to only receive one invalidation (now IIUC it'll
receive both iotlb and device-iotlb for unmapping a page when ats=on), but then
ats=on will be a must and it could break some old (misconfiged) qemu because
afaict previously virtio/vhost could even work with vIOMMU (accidentally) even
without ats=on.


That's a bug and I don't think we need to workaround mis-configurated qemu
:)

IMHO it depends on the strictness we want on the qemu cmdline API. :)

We should at least check libvirt to make sure it's using ats=on always, then I
agree maybe we can avoid considering the rest...

Thanks,



Cc libvirt list, but I think we should fix libvirt if they don't provide 
"ats=on".


Thanks





Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier

2020-06-30 Thread Peter Xu
On Tue, Jun 30, 2020 at 10:41:10AM +0800, Jason Wang wrote:
> >  /* According to ATS spec table 2.4:
> >   * S = 0, bits 15:12 =  range size: 4K
> >   * S = 1, bits 15:12 = xxx0 range size: 8K
> >   * S = 1, bits 15:12 = xx01 range size: 16K
> >   * S = 1, bits 15:12 = x011 range size: 32K
> >   * S = 1, bits 15:12 = 0111 range size: 64K
> >   * ...
> >   */
> 
> 
> Right, but the comment is probably misleading here, since it's for the PCI-E
> transaction between IOMMU and device not for the device IOTLB invalidation
> descriptor.
> 
> For device IOTLB invalidation descriptor, spec allows a [0, ~0ULL]
> invalidation:
> 
> "
> 
> 6.5.2.5 Device-TLB Invalidate Descriptor
> 
> ...
> 
> Size (S): The size field indicates the number of consecutive pages targeted
> by this invalidation
> request. If S field is zero, a single page at page address specified by
> Address [63:12] is requested
> to be invalidated. If S field is Set, the least significant bit in the
> Address field with value 0b
> indicates the invalidation address range. For example, if S field is Set and
> Address[12] is Clear, it
> indicates an 8KB invalidation address range with base address in Address
> [63:13]. If S field and
> Address[12] is Set and bit 13 is Clear, it indicates a 16KB invalidation
> address range with base
> address in Address [63:14], etc.
> 
> "
> 
> So if we receive an address whose [63] is 0 and the rest is all 1, it's then
> a [0, ~0ULL] invalidation.

Yes.  I think invalidating the whole range is always fine.  It's still not
arbitrary, right?  E.g., we can't even invalidate (0x1000, 0x3000) with
device-iotlb because of the address mask, not to say sub-pages.

> 
> 
> > 
> > > 
> > > > > How about just convert to use a range [start, end] for any notifier 
> > > > > and move
> > > > > the checks (e.g the assert) into the actual notifier implemented 
> > > > > (vhost or
> > > > > vfio)?
> > > > IOMMUTLBEntry itself is the abstraction layer of TLB entry.  Hardware 
> > > > TLB entry
> > > > is definitely not arbitrary range either (because AFAICT the hardware 
> > > > should
> > > > only cache PFN rather than address, so at least PAGE_SIZE aligned).
> > > > Introducing this flag will already make this trickier just to avoid 
> > > > introducing
> > > > another similar struct to IOMMUTLBEntry, but I really don't want to 
> > > > make it a
> > > > default option...  Not to mention I probably have no reason to urge the 
> > > > rest
> > > > iommu notifier users (tcg, vfio) to change their existing good code to 
> > > > suite
> > > > any of the backend who can cooperate with arbitrary address ranges...
> > > 
> > > Ok, so it looks like we need a dedicated notifiers to device IOTLB.
> > Or we can also make a new flag for device iotlb just like current UNMAP? 
> > Then
> > we replace the vhost type from UNMAP to DEVICE_IOTLB.  But IMHO using the
> > ARBITRARY_LENGTH flag would work in a similar way.  DEVICE_IOTLB flag could
> > also allow virtio/vhost to only receive one invalidation (now IIUC it'll
> > receive both iotlb and device-iotlb for unmapping a page when ats=on), but 
> > then
> > ats=on will be a must and it could break some old (misconfiged) qemu because
> > afaict previously virtio/vhost could even work with vIOMMU (accidentally) 
> > even
> > without ats=on.
> 
> 
> That's a bug and I don't think we need to workaround mis-configurated qemu
> :)

IMHO it depends on the strictness we want on the qemu cmdline API. :)

We should at least check libvirt to make sure it's using ats=on always, then I
agree maybe we can avoid considering the rest...

Thanks,

-- 
Peter Xu




Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier

2020-06-30 Thread Peter Xu
On Tue, Jun 30, 2020 at 05:23:31PM +0800, Jason Wang wrote:
> > > Ok, we had a dedicated mr for interrupt:
> > > 
> > > memory_region_add_subregion_overlap(MEMORY_REGION(_dev_as->iommu),
> > > VTD_INTERRUPT_ADDR_FIRST,
> > > _dev_as->iommu_ir, 1);
> > > 
> > > So it should be fine. I guess the reason that I'm asking is that I thought
> > > "IR" means "Interrupt remapping" but in fact it means "Interrupt Region"?

I was meaning "interrupt remapping", and of course it's the interrupt region
too when IR enabled...

> > > 
> > > But I'm still not clear about the invalidation part for interrupt region,
> > > maybe you can elaborate a little more on this.
> > > 
> > > Btw, I think guest can trigger the assert in vtd_do_iommu_translate() if 
> > > we
> > > teach vhost to DMA to that region:
> > 
> > Why would we want to?
> 
> 
> I meant a buggy(malicious) guest driver.

Yes seems possible.  Do you want to post a patch?  Let me know if you want me
to...  Thanks,

-- 
Peter Xu




Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier

2020-06-30 Thread Michael S. Tsirkin
On Tue, Jun 30, 2020 at 04:29:19PM +0800, Jason Wang wrote:
> 
> On 2020/6/30 上午10:41, Jason Wang wrote:
> > 
> > On 2020/6/29 下午9:34, Peter Xu wrote:
> > > On Mon, Jun 29, 2020 at 01:51:47PM +0800, Jason Wang wrote:
> > > > On 2020/6/28 下午10:47, Peter Xu wrote:
> > > > > On Sun, Jun 28, 2020 at 03:03:41PM +0800, Jason Wang wrote:
> > > > > > On 2020/6/27 上午5:29, Peter Xu wrote:
> > > > > > > Hi, Eugenio,
> > > > > > > 
> > > > > > > (CCing Eric, Yan and Michael too)
> > > > > > > 
> > > > > > > On Fri, Jun 26, 2020 at 08:41:22AM +0200, Eugenio Pérez wrote:
> > > > > > > > diff --git a/memory.c b/memory.c
> > > > > > > > index 2f15a4b250..7f789710d2 100644
> > > > > > > > --- a/memory.c
> > > > > > > > +++ b/memory.c
> > > > > > > > @@ -1915,8 +1915,6 @@ void
> > > > > > > > memory_region_notify_one(IOMMUNotifier
> > > > > > > > *notifier,
> > > > > > > >     return;
> > > > > > > >     }
> > > > > > > > -    assert(entry->iova >= notifier->start &&
> > > > > > > > entry_end <= notifier->end);
> > > > > > > I can understand removing the assertion should solve
> > > > > > > the issue, however imho
> > > > > > > the major issue is not about this single assertion
> > > > > > > but the whole addr_mask
> > > > > > > issue behind with virtio...
> > > > > > I don't get here, it looks to the the range was from
> > > > > > guest IOMMU drivers.
> > > > > Yes.  Note that I didn't mean that it's a problem in virtio,
> > > > > it's just the fact
> > > > > that virtio is the only one I know that would like to
> > > > > support arbitrary address
> > > > > range for the translated region.  I don't know about tcg,
> > > > > but vfio should still
> > > > > need some kind of page alignment in both the address and the
> > > > > addr_mask.  We
> > > > > have that assumption too across the memory core when we do
> > > > > translations.
> > > > 
> > > > Right but it looks to me the issue is not the alignment.
> > > > 
> > > > 
> > > > > A further cause of the issue is the MSI region when vIOMMU
> > > > > enabled - currently
> > > > > we implemented the interrupt region using another memory
> > > > > region so it split the
> > > > > whole DMA region into two parts.  That's really a clean approach to IR
> > > > > implementation, however that's also a burden to the
> > > > > invalidation part because
> > > > > then we'll need to handle things like this when the listened
> > > > > range is not page
> > > > > alighed at all (neither 0-0xfed, nor 0xfef-MAX).  If
> > > > > without the IR
> > > > > region (so the whole iommu address range will be a single FlatRange),
> > > > 
> > > > Is this a bug? I remember that at least for vtd, it won't do any
> > > > DMAR on the
> > > > intrrupt address range
> > > I don't think it's a bug, at least it's working as how I
> > > understand...  that
> > > interrupt range is using an IR region, that's why I said the IR
> > > region splits
> > > the DMAR region into two pieces, so we have two FlatRange for the same
> > > IOMMUMemoryRegion.
> > 
> > 
> > I don't check the qemu code but if "a single FlatRange" means
> > 0xFEEx_ is subject to DMA remapping, OS need to setup passthrough
> > mapping for that range in order to get MSI to work. This is not what vtd
> > spec said:
> > 
> > """
> > 
> > 3.14 Handling Requests to Interrupt Address Range
> > 
> > Requests without PASID to address range 0xFEEx_ are treated as
> > potential interrupt requests and are not subjected to DMA remapping
> > (even if translation structures specify a mapping for this
> > range). Instead, remapping hardware can be enabled to subject such
> > interrupt requests to interrupt remapping.
> > 
> > """
> > 
> > My understanding is vtd won't do any DMA translation on 0xFEEx_ even
> > if IR is not enabled.
> 
> 
> Ok, we had a dedicated mr for interrupt:
> 
> memory_region_add_subregion_overlap(MEMORY_REGION(_dev_as->iommu),
> VTD_INTERRUPT_ADDR_FIRST,
> _dev_as->iommu_ir, 1);
> 
> So it should be fine. I guess the reason that I'm asking is that I thought
> "IR" means "Interrupt remapping" but in fact it means "Interrupt Region"?
> 
> But I'm still not clear about the invalidation part for interrupt region,
> maybe you can elaborate a little more on this.
> 
> Btw, I think guest can trigger the assert in vtd_do_iommu_translate() if we
> teach vhost to DMA to that region:


Why would we want to?

> 
>     /*
>  * We have standalone memory region for interrupt addresses, we
>  * should never receive translation requests in this region.
>  */
>     assert(!vtd_is_interrupt_addr(addr));
> 
> Is this better to return false here? (We can work on the fix for vhost but
> it should be not trivial)
> 
> Thanks
> 




Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier

2020-06-30 Thread Jason Wang



On 2020/6/30 下午5:21, Michael S. Tsirkin wrote:

On Tue, Jun 30, 2020 at 04:29:19PM +0800, Jason Wang wrote:

On 2020/6/30 上午10:41, Jason Wang wrote:

On 2020/6/29 下午9:34, Peter Xu wrote:

On Mon, Jun 29, 2020 at 01:51:47PM +0800, Jason Wang wrote:

On 2020/6/28 下午10:47, Peter Xu wrote:

On Sun, Jun 28, 2020 at 03:03:41PM +0800, Jason Wang wrote:

On 2020/6/27 上午5:29, Peter Xu wrote:

Hi, Eugenio,

(CCing Eric, Yan and Michael too)

On Fri, Jun 26, 2020 at 08:41:22AM +0200, Eugenio Pérez wrote:

diff --git a/memory.c b/memory.c
index 2f15a4b250..7f789710d2 100644
--- a/memory.c
+++ b/memory.c
@@ -1915,8 +1915,6 @@ void
memory_region_notify_one(IOMMUNotifier
*notifier,
     return;
     }
-    assert(entry->iova >= notifier->start &&
entry_end <= notifier->end);

I can understand removing the assertion should solve
the issue, however imho
the major issue is not about this single assertion
but the whole addr_mask
issue behind with virtio...

I don't get here, it looks to the the range was from
guest IOMMU drivers.

Yes.  Note that I didn't mean that it's a problem in virtio,
it's just the fact
that virtio is the only one I know that would like to
support arbitrary address
range for the translated region.  I don't know about tcg,
but vfio should still
need some kind of page alignment in both the address and the
addr_mask.  We
have that assumption too across the memory core when we do
translations.

Right but it looks to me the issue is not the alignment.



A further cause of the issue is the MSI region when vIOMMU
enabled - currently
we implemented the interrupt region using another memory
region so it split the
whole DMA region into two parts.  That's really a clean approach to IR
implementation, however that's also a burden to the
invalidation part because
then we'll need to handle things like this when the listened
range is not page
alighed at all (neither 0-0xfed, nor 0xfef-MAX).  If
without the IR
region (so the whole iommu address range will be a single FlatRange),

Is this a bug? I remember that at least for vtd, it won't do any
DMAR on the
intrrupt address range

I don't think it's a bug, at least it's working as how I
understand...  that
interrupt range is using an IR region, that's why I said the IR
region splits
the DMAR region into two pieces, so we have two FlatRange for the same
IOMMUMemoryRegion.


I don't check the qemu code but if "a single FlatRange" means
0xFEEx_ is subject to DMA remapping, OS need to setup passthrough
mapping for that range in order to get MSI to work. This is not what vtd
spec said:

"""

3.14 Handling Requests to Interrupt Address Range

Requests without PASID to address range 0xFEEx_ are treated as
potential interrupt requests and are not subjected to DMA remapping
(even if translation structures specify a mapping for this
range). Instead, remapping hardware can be enabled to subject such
interrupt requests to interrupt remapping.

"""

My understanding is vtd won't do any DMA translation on 0xFEEx_ even
if IR is not enabled.


Ok, we had a dedicated mr for interrupt:

memory_region_add_subregion_overlap(MEMORY_REGION(_dev_as->iommu),
VTD_INTERRUPT_ADDR_FIRST,
_dev_as->iommu_ir, 1);

So it should be fine. I guess the reason that I'm asking is that I thought
"IR" means "Interrupt remapping" but in fact it means "Interrupt Region"?

But I'm still not clear about the invalidation part for interrupt region,
maybe you can elaborate a little more on this.

Btw, I think guest can trigger the assert in vtd_do_iommu_translate() if we
teach vhost to DMA to that region:


Why would we want to?



I meant a buggy(malicious) guest driver.

Thanks





     /*
  * We have standalone memory region for interrupt addresses, we
  * should never receive translation requests in this region.
  */
     assert(!vtd_is_interrupt_addr(addr));

Is this better to return false here? (We can work on the fix for vhost but
it should be not trivial)

Thanks






Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier

2020-06-30 Thread Jason Wang



On 2020/6/30 上午10:41, Jason Wang wrote:


On 2020/6/29 下午9:34, Peter Xu wrote:

On Mon, Jun 29, 2020 at 01:51:47PM +0800, Jason Wang wrote:

On 2020/6/28 下午10:47, Peter Xu wrote:

On Sun, Jun 28, 2020 at 03:03:41PM +0800, Jason Wang wrote:

On 2020/6/27 上午5:29, Peter Xu wrote:

Hi, Eugenio,

(CCing Eric, Yan and Michael too)

On Fri, Jun 26, 2020 at 08:41:22AM +0200, Eugenio Pérez wrote:

diff --git a/memory.c b/memory.c
index 2f15a4b250..7f789710d2 100644
--- a/memory.c
+++ b/memory.c
@@ -1915,8 +1915,6 @@ void 
memory_region_notify_one(IOMMUNotifier *notifier,

    return;
    }
-    assert(entry->iova >= notifier->start && entry_end <= 
notifier->end);
I can understand removing the assertion should solve the issue, 
however imho
the major issue is not about this single assertion but the whole 
addr_mask

issue behind with virtio...
I don't get here, it looks to the the range was from guest IOMMU 
drivers.
Yes.  Note that I didn't mean that it's a problem in virtio, it's 
just the fact
that virtio is the only one I know that would like to support 
arbitrary address
range for the translated region.  I don't know about tcg, but vfio 
should still
need some kind of page alignment in both the address and the 
addr_mask.  We
have that assumption too across the memory core when we do 
translations.


Right but it looks to me the issue is not the alignment.


A further cause of the issue is the MSI region when vIOMMU enabled 
- currently
we implemented the interrupt region using another memory region so 
it split the

whole DMA region into two parts.  That's really a clean approach to IR
implementation, however that's also a burden to the invalidation 
part because
then we'll need to handle things like this when the listened range 
is not page
alighed at all (neither 0-0xfed, nor 0xfef-MAX).  If 
without the IR

region (so the whole iommu address range will be a single FlatRange),


Is this a bug? I remember that at least for vtd, it won't do any 
DMAR on the

intrrupt address range
I don't think it's a bug, at least it's working as how I 
understand...  that
interrupt range is using an IR region, that's why I said the IR 
region splits

the DMAR region into two pieces, so we have two FlatRange for the same
IOMMUMemoryRegion.



I don't check the qemu code but if "a single FlatRange" means 
0xFEEx_ is subject to DMA remapping, OS need to setup passthrough 
mapping for that range in order to get MSI to work. This is not what 
vtd spec said:


"""

3.14 Handling Requests to Interrupt Address Range

Requests without PASID to address range 0xFEEx_ are treated as
potential interrupt requests and are not subjected to DMA remapping
(even if translation structures specify a mapping for this
range). Instead, remapping hardware can be enabled to subject such
interrupt requests to interrupt remapping.

"""

My understanding is vtd won't do any DMA translation on 0xFEEx_ 
even if IR is not enabled.



Ok, we had a dedicated mr for interrupt:

memory_region_add_subregion_overlap(MEMORY_REGION(_dev_as->iommu),
VTD_INTERRUPT_ADDR_FIRST,
_dev_as->iommu_ir, 1);

So it should be fine. I guess the reason that I'm asking is that I 
thought "IR" means "Interrupt remapping" but in fact it means "Interrupt 
Region"?


But I'm still not clear about the invalidation part for interrupt 
region, maybe you can elaborate a little more on this.


Btw, I think guest can trigger the assert in vtd_do_iommu_translate() if 
we teach vhost to DMA to that region:


    /*
 * We have standalone memory region for interrupt addresses, we
 * should never receive translation requests in this region.
 */
    assert(!vtd_is_interrupt_addr(addr));

Is this better to return false here? (We can work on the fix for vhost 
but it should be not trivial)


Thanks





Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier

2020-06-29 Thread Jason Wang



On 2020/6/29 下午9:34, Peter Xu wrote:

On Mon, Jun 29, 2020 at 01:51:47PM +0800, Jason Wang wrote:

On 2020/6/28 下午10:47, Peter Xu wrote:

On Sun, Jun 28, 2020 at 03:03:41PM +0800, Jason Wang wrote:

On 2020/6/27 上午5:29, Peter Xu wrote:

Hi, Eugenio,

(CCing Eric, Yan and Michael too)

On Fri, Jun 26, 2020 at 08:41:22AM +0200, Eugenio Pérez wrote:

diff --git a/memory.c b/memory.c
index 2f15a4b250..7f789710d2 100644
--- a/memory.c
+++ b/memory.c
@@ -1915,8 +1915,6 @@ void memory_region_notify_one(IOMMUNotifier *notifier,
return;
}
-assert(entry->iova >= notifier->start && entry_end <= notifier->end);

I can understand removing the assertion should solve the issue, however imho
the major issue is not about this single assertion but the whole addr_mask
issue behind with virtio...

I don't get here, it looks to the the range was from guest IOMMU drivers.

Yes.  Note that I didn't mean that it's a problem in virtio, it's just the fact
that virtio is the only one I know that would like to support arbitrary address
range for the translated region.  I don't know about tcg, but vfio should still
need some kind of page alignment in both the address and the addr_mask.  We
have that assumption too across the memory core when we do translations.


Right but it looks to me the issue is not the alignment.



A further cause of the issue is the MSI region when vIOMMU enabled - currently
we implemented the interrupt region using another memory region so it split the
whole DMA region into two parts.  That's really a clean approach to IR
implementation, however that's also a burden to the invalidation part because
then we'll need to handle things like this when the listened range is not page
alighed at all (neither 0-0xfed, nor 0xfef-MAX).  If without the IR
region (so the whole iommu address range will be a single FlatRange),


Is this a bug? I remember that at least for vtd, it won't do any DMAR on the
intrrupt address range

I don't think it's a bug, at least it's working as how I understand...  that
interrupt range is using an IR region, that's why I said the IR region splits
the DMAR region into two pieces, so we have two FlatRange for the same
IOMMUMemoryRegion.



I don't check the qemu code but if "a single FlatRange" means 
0xFEEx_ is subject to DMA remapping, OS need to setup passthrough 
mapping for that range in order to get MSI to work. This is not what vtd 
spec said:


"""

3.14 Handling Requests to Interrupt Address Range

Requests without PASID to address range 0xFEEx_ are treated as
potential interrupt requests and are not subjected to DMA remapping
(even if translation structures specify a mapping for this
range). Instead, remapping hardware can be enabled to subject such
interrupt requests to interrupt remapping.

"""

My understanding is vtd won't do any DMA translation on 0xFEEx_ even 
if IR is not enabled.








   I think
we probably don't need most of the logic in vtd_address_space_unmap() at all,
then we can directly deliver all the IOTLB invalidations without splitting into
small page aligned ranges to all the iommu notifiers.  Sadly, so far I still
don't have ideal solution for it, because we definitely need IR.


Another possible (theoretical) issue (for vhost) is that it can't trigger
interrupt through the interrupt range.

Hmm.. Could you explain?  When IR is enabled, all devices including virtio
who send interrupt to 0xfeeX should be trapped by IR.



I meant vhost not virtio, if you teach vhost to DMA to 0xFEEx_, it 
can't generate any interrupts as expected.








For normal IOTLB invalidations, we were trying our best to always make
IOMMUTLBEntry contain a valid addr_mask to be 2**N-1.  E.g., that's what we're
doing with the loop in vtd_address_space_unmap().

I'm sure such such assumption can work for any type of IOMMU.



But this is not the first time that we may want to break this assumption for
virtio so that we make the IOTLB a tuple of (start, len), then that len can be
not a address mask any more.  That seems to be more efficient for things like
vhost because iotlbs there are not page based, so it'll be inefficient if we
always guarantee the addr_mask because it'll be quite a lot more roundtrips of
the same range of invalidation.  Here we've encountered another issue of
triggering the assertion with virtio-net, but only with the old RHEL7 guest.

I'm thinking whether we can make the IOTLB invalidation configurable by
specifying whether the backend of the notifier can handle arbitary address
range in some way.  So we still have the guaranteed addr_masks by default
(since I still don't think totally break the addr_mask restriction is wise...),
however we can allow the special backends to take adavantage of using arbitary
(start, len) ranges for reasons like performance.

To do that, a quick idea is to introduce a flag IOMMU_NOTIFIER_ARBITRARY_MASK
to IOMMUNotifierFlag, to declare that the iommu notifier (and its backend) can

Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier

2020-06-29 Thread Peter Xu
On Mon, Jun 29, 2020 at 01:51:47PM +0800, Jason Wang wrote:
> 
> On 2020/6/28 下午10:47, Peter Xu wrote:
> > On Sun, Jun 28, 2020 at 03:03:41PM +0800, Jason Wang wrote:
> > > On 2020/6/27 上午5:29, Peter Xu wrote:
> > > > Hi, Eugenio,
> > > > 
> > > > (CCing Eric, Yan and Michael too)
> > > > 
> > > > On Fri, Jun 26, 2020 at 08:41:22AM +0200, Eugenio Pérez wrote:
> > > > > diff --git a/memory.c b/memory.c
> > > > > index 2f15a4b250..7f789710d2 100644
> > > > > --- a/memory.c
> > > > > +++ b/memory.c
> > > > > @@ -1915,8 +1915,6 @@ void memory_region_notify_one(IOMMUNotifier 
> > > > > *notifier,
> > > > >return;
> > > > >}
> > > > > -assert(entry->iova >= notifier->start && entry_end <= 
> > > > > notifier->end);
> > > > I can understand removing the assertion should solve the issue, however 
> > > > imho
> > > > the major issue is not about this single assertion but the whole 
> > > > addr_mask
> > > > issue behind with virtio...
> > > 
> > > I don't get here, it looks to the the range was from guest IOMMU drivers.
> > Yes.  Note that I didn't mean that it's a problem in virtio, it's just the 
> > fact
> > that virtio is the only one I know that would like to support arbitrary 
> > address
> > range for the translated region.  I don't know about tcg, but vfio should 
> > still
> > need some kind of page alignment in both the address and the addr_mask.  We
> > have that assumption too across the memory core when we do translations.
> 
> 
> Right but it looks to me the issue is not the alignment.
> 
> 
> > 
> > A further cause of the issue is the MSI region when vIOMMU enabled - 
> > currently
> > we implemented the interrupt region using another memory region so it split 
> > the
> > whole DMA region into two parts.  That's really a clean approach to IR
> > implementation, however that's also a burden to the invalidation part 
> > because
> > then we'll need to handle things like this when the listened range is not 
> > page
> > alighed at all (neither 0-0xfed, nor 0xfef-MAX).  If without the IR
> > region (so the whole iommu address range will be a single FlatRange),
> 
> 
> Is this a bug? I remember that at least for vtd, it won't do any DMAR on the
> intrrupt address range

I don't think it's a bug, at least it's working as how I understand...  that
interrupt range is using an IR region, that's why I said the IR region splits
the DMAR region into two pieces, so we have two FlatRange for the same
IOMMUMemoryRegion.

> 
> 
> >   I think
> > we probably don't need most of the logic in vtd_address_space_unmap() at 
> > all,
> > then we can directly deliver all the IOTLB invalidations without splitting 
> > into
> > small page aligned ranges to all the iommu notifiers.  Sadly, so far I still
> > don't have ideal solution for it, because we definitely need IR.
> 
> 
> Another possible (theoretical) issue (for vhost) is that it can't trigger
> interrupt through the interrupt range.

Hmm.. Could you explain?  When IR is enabled, all devices including virtio
who send interrupt to 0xfeeX should be trapped by IR.

> 
> 
> > 
> > > 
> > > > For normal IOTLB invalidations, we were trying our best to always make
> > > > IOMMUTLBEntry contain a valid addr_mask to be 2**N-1.  E.g., that's 
> > > > what we're
> > > > doing with the loop in vtd_address_space_unmap().
> > > 
> > > I'm sure such such assumption can work for any type of IOMMU.
> > > 
> > > 
> > > > But this is not the first time that we may want to break this 
> > > > assumption for
> > > > virtio so that we make the IOTLB a tuple of (start, len), then that len 
> > > > can be
> > > > not a address mask any more.  That seems to be more efficient for 
> > > > things like
> > > > vhost because iotlbs there are not page based, so it'll be inefficient 
> > > > if we
> > > > always guarantee the addr_mask because it'll be quite a lot more 
> > > > roundtrips of
> > > > the same range of invalidation.  Here we've encountered another issue of
> > > > triggering the assertion with virtio-net, but only with the old RHEL7 
> > > > guest.
> > > > 
> > > > I'm thinking whether we can make the IOTLB invalidation configurable by
> > > > specifying whether the backend of the notifier can handle arbitary 
> > > > address
> > > > range in some way.  So we still have the guaranteed addr_masks by 
> > > > default
> > > > (since I still don't think totally break the addr_mask restriction is 
> > > > wise...),
> > > > however we can allow the special backends to take adavantage of using 
> > > > arbitary
> > > > (start, len) ranges for reasons like performance.
> > > > 
> > > > To do that, a quick idea is to introduce a flag 
> > > > IOMMU_NOTIFIER_ARBITRARY_MASK
> > > > to IOMMUNotifierFlag, to declare that the iommu notifier (and its 
> > > > backend) can
> > > > take arbitrary address mask, then it can be any value and finally 
> > > > becomes a
> > > > length rather than an addr_mask.  Then for every iommu notify() we can 
> > > > directly
> 

Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier

2020-06-28 Thread Jason Wang



On 2020/6/28 下午10:47, Peter Xu wrote:

On Sun, Jun 28, 2020 at 03:03:41PM +0800, Jason Wang wrote:

On 2020/6/27 上午5:29, Peter Xu wrote:

Hi, Eugenio,

(CCing Eric, Yan and Michael too)

On Fri, Jun 26, 2020 at 08:41:22AM +0200, Eugenio Pérez wrote:

diff --git a/memory.c b/memory.c
index 2f15a4b250..7f789710d2 100644
--- a/memory.c
+++ b/memory.c
@@ -1915,8 +1915,6 @@ void memory_region_notify_one(IOMMUNotifier *notifier,
   return;
   }
-assert(entry->iova >= notifier->start && entry_end <= notifier->end);

I can understand removing the assertion should solve the issue, however imho
the major issue is not about this single assertion but the whole addr_mask
issue behind with virtio...


I don't get here, it looks to the the range was from guest IOMMU drivers.

Yes.  Note that I didn't mean that it's a problem in virtio, it's just the fact
that virtio is the only one I know that would like to support arbitrary address
range for the translated region.  I don't know about tcg, but vfio should still
need some kind of page alignment in both the address and the addr_mask.  We
have that assumption too across the memory core when we do translations.



Right but it looks to me the issue is not the alignment.




A further cause of the issue is the MSI region when vIOMMU enabled - currently
we implemented the interrupt region using another memory region so it split the
whole DMA region into two parts.  That's really a clean approach to IR
implementation, however that's also a burden to the invalidation part because
then we'll need to handle things like this when the listened range is not page
alighed at all (neither 0-0xfed, nor 0xfef-MAX).  If without the IR
region (so the whole iommu address range will be a single FlatRange),



Is this a bug? I remember that at least for vtd, it won't do any DMAR on 
the intrrupt address range




  I think
we probably don't need most of the logic in vtd_address_space_unmap() at all,
then we can directly deliver all the IOTLB invalidations without splitting into
small page aligned ranges to all the iommu notifiers.  Sadly, so far I still
don't have ideal solution for it, because we definitely need IR.



Another possible (theoretical) issue (for vhost) is that it can't 
trigger interrupt through the interrupt range.








For normal IOTLB invalidations, we were trying our best to always make
IOMMUTLBEntry contain a valid addr_mask to be 2**N-1.  E.g., that's what we're
doing with the loop in vtd_address_space_unmap().


I'm sure such such assumption can work for any type of IOMMU.



But this is not the first time that we may want to break this assumption for
virtio so that we make the IOTLB a tuple of (start, len), then that len can be
not a address mask any more.  That seems to be more efficient for things like
vhost because iotlbs there are not page based, so it'll be inefficient if we
always guarantee the addr_mask because it'll be quite a lot more roundtrips of
the same range of invalidation.  Here we've encountered another issue of
triggering the assertion with virtio-net, but only with the old RHEL7 guest.

I'm thinking whether we can make the IOTLB invalidation configurable by
specifying whether the backend of the notifier can handle arbitary address
range in some way.  So we still have the guaranteed addr_masks by default
(since I still don't think totally break the addr_mask restriction is wise...),
however we can allow the special backends to take adavantage of using arbitary
(start, len) ranges for reasons like performance.

To do that, a quick idea is to introduce a flag IOMMU_NOTIFIER_ARBITRARY_MASK
to IOMMUNotifierFlag, to declare that the iommu notifier (and its backend) can
take arbitrary address mask, then it can be any value and finally becomes a
length rather than an addr_mask.  Then for every iommu notify() we can directly
deliver whatever we've got from the upper layer to this notifier.  With the new
flag, vhost can do iommu_notifier_init() with UNMAP|ARBITRARY_MASK so it
declares this capability.  Then no matter for device iotlb or normal iotlb, we
skip the complicated procedure to split a big range into small ranges that are
with strict addr_mask, but directly deliver the message to the iommu notifier.
E.g., we can skip the loop in vtd_address_space_unmap() if the notifier is with
ARBITRARY flag set.


I'm not sure coupling IOMMU capability to notifier is the best choice.

IMHO it's not an IOMMU capability.  The flag I wanted to introduce is a
capability of the one who listens to the IOMMU TLB updates.  For our case, it's
virtio/vhost's capability to allow arbitrary length. The IOMMU itself
definitely has some limitation on the address range to be bound to an IOTLB
invalidation, e.g., the device-iotlb we're talking here only accept both the
iova address and addr_mask to be aligned to 2**N-1.



I think this go back to one of our previous discussion of whether to 
introduce a dedicated notifiers for device IOTLB.


For IOMMU, 

Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier

2020-06-28 Thread Peter Xu
On Sun, Jun 28, 2020 at 03:03:41PM +0800, Jason Wang wrote:
> 
> On 2020/6/27 上午5:29, Peter Xu wrote:
> > Hi, Eugenio,
> > 
> > (CCing Eric, Yan and Michael too)
> > 
> > On Fri, Jun 26, 2020 at 08:41:22AM +0200, Eugenio Pérez wrote:
> > > diff --git a/memory.c b/memory.c
> > > index 2f15a4b250..7f789710d2 100644
> > > --- a/memory.c
> > > +++ b/memory.c
> > > @@ -1915,8 +1915,6 @@ void memory_region_notify_one(IOMMUNotifier 
> > > *notifier,
> > >   return;
> > >   }
> > > -assert(entry->iova >= notifier->start && entry_end <= notifier->end);
> > I can understand removing the assertion should solve the issue, however imho
> > the major issue is not about this single assertion but the whole addr_mask
> > issue behind with virtio...
> 
> 
> I don't get here, it looks to the the range was from guest IOMMU drivers.

Yes.  Note that I didn't mean that it's a problem in virtio, it's just the fact
that virtio is the only one I know that would like to support arbitrary address
range for the translated region.  I don't know about tcg, but vfio should still
need some kind of page alignment in both the address and the addr_mask.  We
have that assumption too across the memory core when we do translations.

A further cause of the issue is the MSI region when vIOMMU enabled - currently
we implemented the interrupt region using another memory region so it split the
whole DMA region into two parts.  That's really a clean approach to IR
implementation, however that's also a burden to the invalidation part because
then we'll need to handle things like this when the listened range is not page
alighed at all (neither 0-0xfed, nor 0xfef-MAX).  If without the IR
region (so the whole iommu address range will be a single FlatRange), I think
we probably don't need most of the logic in vtd_address_space_unmap() at all,
then we can directly deliver all the IOTLB invalidations without splitting into
small page aligned ranges to all the iommu notifiers.  Sadly, so far I still
don't have ideal solution for it, because we definitely need IR.

> 
> 
> > 
> > For normal IOTLB invalidations, we were trying our best to always make
> > IOMMUTLBEntry contain a valid addr_mask to be 2**N-1.  E.g., that's what 
> > we're
> > doing with the loop in vtd_address_space_unmap().
> 
> 
> I'm sure such such assumption can work for any type of IOMMU.
> 
> 
> > 
> > But this is not the first time that we may want to break this assumption for
> > virtio so that we make the IOTLB a tuple of (start, len), then that len can 
> > be
> > not a address mask any more.  That seems to be more efficient for things 
> > like
> > vhost because iotlbs there are not page based, so it'll be inefficient if we
> > always guarantee the addr_mask because it'll be quite a lot more roundtrips 
> > of
> > the same range of invalidation.  Here we've encountered another issue of
> > triggering the assertion with virtio-net, but only with the old RHEL7 guest.
> > 
> > I'm thinking whether we can make the IOTLB invalidation configurable by
> > specifying whether the backend of the notifier can handle arbitary address
> > range in some way.  So we still have the guaranteed addr_masks by default
> > (since I still don't think totally break the addr_mask restriction is 
> > wise...),
> > however we can allow the special backends to take adavantage of using 
> > arbitary
> > (start, len) ranges for reasons like performance.
> > 
> > To do that, a quick idea is to introduce a flag 
> > IOMMU_NOTIFIER_ARBITRARY_MASK
> > to IOMMUNotifierFlag, to declare that the iommu notifier (and its backend) 
> > can
> > take arbitrary address mask, then it can be any value and finally becomes a
> > length rather than an addr_mask.  Then for every iommu notify() we can 
> > directly
> > deliver whatever we've got from the upper layer to this notifier.  With the 
> > new
> > flag, vhost can do iommu_notifier_init() with UNMAP|ARBITRARY_MASK so it
> > declares this capability.  Then no matter for device iotlb or normal iotlb, 
> > we
> > skip the complicated procedure to split a big range into small ranges that 
> > are
> > with strict addr_mask, but directly deliver the message to the iommu 
> > notifier.
> > E.g., we can skip the loop in vtd_address_space_unmap() if the notifier is 
> > with
> > ARBITRARY flag set.
> 
> 
> I'm not sure coupling IOMMU capability to notifier is the best choice.

IMHO it's not an IOMMU capability.  The flag I wanted to introduce is a
capability of the one who listens to the IOMMU TLB updates.  For our case, it's
virtio/vhost's capability to allow arbitrary length. The IOMMU itself
definitely has some limitation on the address range to be bound to an IOTLB
invalidation, e.g., the device-iotlb we're talking here only accept both the
iova address and addr_mask to be aligned to 2**N-1.

> 
> How about just convert to use a range [start, end] for any notifier and move
> the checks (e.g the assert) into the actual notifier implemented (vhost or
> 

Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier

2020-06-28 Thread Jason Wang



On 2020/6/27 上午5:29, Peter Xu wrote:

Hi, Eugenio,

(CCing Eric, Yan and Michael too)

On Fri, Jun 26, 2020 at 08:41:22AM +0200, Eugenio Pérez wrote:

diff --git a/memory.c b/memory.c
index 2f15a4b250..7f789710d2 100644
--- a/memory.c
+++ b/memory.c
@@ -1915,8 +1915,6 @@ void memory_region_notify_one(IOMMUNotifier *notifier,
  return;
  }
  
-assert(entry->iova >= notifier->start && entry_end <= notifier->end);

I can understand removing the assertion should solve the issue, however imho
the major issue is not about this single assertion but the whole addr_mask
issue behind with virtio...



I don't get here, it looks to the the range was from guest IOMMU drivers.




For normal IOTLB invalidations, we were trying our best to always make
IOMMUTLBEntry contain a valid addr_mask to be 2**N-1.  E.g., that's what we're
doing with the loop in vtd_address_space_unmap().



I'm sure such such assumption can work for any type of IOMMU.




But this is not the first time that we may want to break this assumption for
virtio so that we make the IOTLB a tuple of (start, len), then that len can be
not a address mask any more.  That seems to be more efficient for things like
vhost because iotlbs there are not page based, so it'll be inefficient if we
always guarantee the addr_mask because it'll be quite a lot more roundtrips of
the same range of invalidation.  Here we've encountered another issue of
triggering the assertion with virtio-net, but only with the old RHEL7 guest.

I'm thinking whether we can make the IOTLB invalidation configurable by
specifying whether the backend of the notifier can handle arbitary address
range in some way.  So we still have the guaranteed addr_masks by default
(since I still don't think totally break the addr_mask restriction is wise...),
however we can allow the special backends to take adavantage of using arbitary
(start, len) ranges for reasons like performance.

To do that, a quick idea is to introduce a flag IOMMU_NOTIFIER_ARBITRARY_MASK
to IOMMUNotifierFlag, to declare that the iommu notifier (and its backend) can
take arbitrary address mask, then it can be any value and finally becomes a
length rather than an addr_mask.  Then for every iommu notify() we can directly
deliver whatever we've got from the upper layer to this notifier.  With the new
flag, vhost can do iommu_notifier_init() with UNMAP|ARBITRARY_MASK so it
declares this capability.  Then no matter for device iotlb or normal iotlb, we
skip the complicated procedure to split a big range into small ranges that are
with strict addr_mask, but directly deliver the message to the iommu notifier.
E.g., we can skip the loop in vtd_address_space_unmap() if the notifier is with
ARBITRARY flag set.



I'm not sure coupling IOMMU capability to notifier is the best choice.

How about just convert to use a range [start, end] for any notifier and 
move the checks (e.g the assert) into the actual notifier implemented 
(vhost or vfio)?


Thanks




Then, the assert() is not accurate either, and may become something like:

diff --git a/memory.c b/memory.c
index 2f15a4b250..99d0492509 100644
--- a/memory.c
+++ b/memory.c
@@ -1906,6 +1906,7 @@ void memory_region_notify_one(IOMMUNotifier *notifier,
  {
  IOMMUNotifierFlag request_flags;
  hwaddr entry_end = entry->iova + entry->addr_mask;
+IOMMUTLBEntry tmp = *entry;

  /*
   * Skip the notification if the notification does not overlap
@@ -1915,7 +1916,13 @@ void memory_region_notify_one(IOMMUNotifier *notifier,
  return;
  }

-assert(entry->iova >= notifier->start && entry_end <= notifier->end);
+if (notifier->notifier_flags & IOMMU_NOTIFIER_ARBITRARY_MASK) {
+tmp.iova = MAX(tmp.iova, notifier->start);
+tmp.addr_mask = MIN(tmp.addr_mask, notifier->end);
+assert(tmp.iova <= tmp.addr_mask);
+} else {
+assert(entry->iova >= notifier->start && entry_end <= notifier->end);
+}

  if (entry->perm & IOMMU_RW) {
  request_flags = IOMMU_NOTIFIER_MAP;
@@ -1924,7 +1931,7 @@ void memory_region_notify_one(IOMMUNotifier *notifier,
  }

  if (notifier->notifier_flags & request_flags) {
-notifier->notify(notifier, entry);
+notifier->notify(notifier, );
  }
  }

Then we can keep the assert() for e.g. vfio, however vhost can skip it and even
get some further performance boosts..  Does that make sense?

Thanks,






Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier

2020-06-27 Thread Yan Zhao
On Sat, Jun 27, 2020 at 08:57:14AM -0400, Peter Xu wrote:
> On Sat, Jun 27, 2020 at 03:26:45AM -0400, Yan Zhao wrote:
> > > -assert(entry->iova >= notifier->start && entry_end <= notifier->end);
> > > +if (notifier->notifier_flags & IOMMU_NOTIFIER_ARBITRARY_MASK) {
> > > +tmp.iova = MAX(tmp.iova, notifier->start);
> > > +tmp.addr_mask = MIN(tmp.addr_mask, notifier->end);
> > NIT:
> >tmp.addr_mask = MIN(entry_end, notifier->end) - tmp.iova;
> 
> Right.  Thanks. :)
> 
> > > +assert(tmp.iova <= tmp.addr_mask);
> > no this assertion then.
> 
> Or change it into:
> 
>   assert(MIN(entry_end, notifier->end) >= tmp.iova);
> 
> To double confirm no overflow.
>
what about assert in this way, so that it's also useful to check overflow
in the other condition.

hwaddr entry_end = entry->iova + entry->addr_mask;
+
+ assert(notifier->end >= notifer->start && entry_end >= entry->iova);


then as there's a following filter
if (notifier->start > entry_end || notifier->end < entry->iova) {
return;
}

we can conclude that

entry_end >= entry->iova(tmp.iova)
entry_end >= notifier->start,
--> entry_end >= MAX(tmp.iova, notfier->start)
--> entry_end >= tmp.iova


notifier->end >= entry->iova (tmp.iova),
notifier->end >= notifer->start,
--> notifier->end >= MAX(tmp.iova, nofier->start)
--> notifier->end >= tmp.iova

==> MIN(end_end, notifer->end) >= tmp.iova

Thanks
Yan



Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier

2020-06-27 Thread Peter Xu
On Sat, Jun 27, 2020 at 03:26:45AM -0400, Yan Zhao wrote:
> > -assert(entry->iova >= notifier->start && entry_end <= notifier->end);
> > +if (notifier->notifier_flags & IOMMU_NOTIFIER_ARBITRARY_MASK) {
> > +tmp.iova = MAX(tmp.iova, notifier->start);
> > +tmp.addr_mask = MIN(tmp.addr_mask, notifier->end);
> NIT:
>tmp.addr_mask = MIN(entry_end, notifier->end) - tmp.iova;

Right.  Thanks. :)

> > +assert(tmp.iova <= tmp.addr_mask);
> no this assertion then.

Or change it into:

  assert(MIN(entry_end, notifier->end) >= tmp.iova);

To double confirm no overflow.

-- 
Peter Xu




Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier

2020-06-27 Thread Yan Zhao
On Fri, Jun 26, 2020 at 05:29:17PM -0400, Peter Xu wrote:
> Hi, Eugenio,
> 
> (CCing Eric, Yan and Michael too)

> On Fri, Jun 26, 2020 at 08:41:22AM +0200, Eugenio Pérez wrote:
> > diff --git a/memory.c b/memory.c
> > index 2f15a4b250..7f789710d2 100644
> > --- a/memory.c
> > +++ b/memory.c
> > @@ -1915,8 +1915,6 @@ void memory_region_notify_one(IOMMUNotifier *notifier,
> >  return;
> >  }
> >  
> > -assert(entry->iova >= notifier->start && entry_end <= notifier->end);
> 
> I can understand removing the assertion should solve the issue, however imho
> the major issue is not about this single assertion but the whole addr_mask
> issue behind with virtio...
Yes, the background for this assertion is
https://lists.gnu.org/archive/html/qemu-devel/2019-06/msg04218.html


> 
> For normal IOTLB invalidations, we were trying our best to always make
> IOMMUTLBEntry contain a valid addr_mask to be 2**N-1.  E.g., that's what we're
> doing with the loop in vtd_address_space_unmap().
> 
> But this is not the first time that we may want to break this assumption for
> virtio so that we make the IOTLB a tuple of (start, len), then that len can be
> not a address mask any more.  That seems to be more efficient for things like
> vhost because iotlbs there are not page based, so it'll be inefficient if we
> always guarantee the addr_mask because it'll be quite a lot more roundtrips of
> the same range of invalidation.  Here we've encountered another issue of
> triggering the assertion with virtio-net, but only with the old RHEL7 guest.
> 
> I'm thinking whether we can make the IOTLB invalidation configurable by
> specifying whether the backend of the notifier can handle arbitary address
> range in some way.  So we still have the guaranteed addr_masks by default
> (since I still don't think totally break the addr_mask restriction is 
> wise...),
> however we can allow the special backends to take adavantage of using arbitary
> (start, len) ranges for reasons like performance.
> 
> To do that, a quick idea is to introduce a flag IOMMU_NOTIFIER_ARBITRARY_MASK
> to IOMMUNotifierFlag, to declare that the iommu notifier (and its backend) can
> take arbitrary address mask, then it can be any value and finally becomes a
> length rather than an addr_mask.  Then for every iommu notify() we can 
> directly
> deliver whatever we've got from the upper layer to this notifier.  With the 
> new
> flag, vhost can do iommu_notifier_init() with UNMAP|ARBITRARY_MASK so it
> declares this capability.  Then no matter for device iotlb or normal iotlb, we
> skip the complicated procedure to split a big range into small ranges that are
> with strict addr_mask, but directly deliver the message to the iommu notifier.
> E.g., we can skip the loop in vtd_address_space_unmap() if the notifier is 
> with
> ARBITRARY flag set.
> 
> Then, the assert() is not accurate either, and may become something like:
> 
> diff --git a/memory.c b/memory.c
> index 2f15a4b250..99d0492509 100644
> --- a/memory.c
> +++ b/memory.c
> @@ -1906,6 +1906,7 @@ void memory_region_notify_one(IOMMUNotifier *notifier,
>  {
>  IOMMUNotifierFlag request_flags;
>  hwaddr entry_end = entry->iova + entry->addr_mask;
> +IOMMUTLBEntry tmp = *entry;
> 
>  /*
>   * Skip the notification if the notification does not overlap
> @@ -1915,7 +1916,13 @@ void memory_region_notify_one(IOMMUNotifier *notifier,
>  return;
>  }
> 
> -assert(entry->iova >= notifier->start && entry_end <= notifier->end);
> +if (notifier->notifier_flags & IOMMU_NOTIFIER_ARBITRARY_MASK) {
> +tmp.iova = MAX(tmp.iova, notifier->start);
> +tmp.addr_mask = MIN(tmp.addr_mask, notifier->end);
NIT:
   tmp.addr_mask = MIN(entry_end, notifier->end) - tmp.iova;
> +assert(tmp.iova <= tmp.addr_mask);
no this assertion then.

Thanks
Yan
   
> +} else {
> +assert(entry->iova >= notifier->start && entry_end <= notifier->end);
> +}
> 
>  if (entry->perm & IOMMU_RW) {
>  request_flags = IOMMU_NOTIFIER_MAP;
> @@ -1924,7 +1931,7 @@ void memory_region_notify_one(IOMMUNotifier *notifier,
>  }
> 
>  if (notifier->notifier_flags & request_flags) {
> -notifier->notify(notifier, entry);
> +notifier->notify(notifier, );
>  }
>  }
> 
> Then we can keep the assert() for e.g. vfio, however vhost can skip it and 
> even
> get some further performance boosts..  Does that make sense?
> 
> Thanks,
> 
> -- 
> Peter Xu
> 
> 



Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier

2020-06-26 Thread Peter Xu
Hi, Eugenio,

(CCing Eric, Yan and Michael too)

On Fri, Jun 26, 2020 at 08:41:22AM +0200, Eugenio Pérez wrote:
> diff --git a/memory.c b/memory.c
> index 2f15a4b250..7f789710d2 100644
> --- a/memory.c
> +++ b/memory.c
> @@ -1915,8 +1915,6 @@ void memory_region_notify_one(IOMMUNotifier *notifier,
>  return;
>  }
>  
> -assert(entry->iova >= notifier->start && entry_end <= notifier->end);

I can understand removing the assertion should solve the issue, however imho
the major issue is not about this single assertion but the whole addr_mask
issue behind with virtio...

For normal IOTLB invalidations, we were trying our best to always make
IOMMUTLBEntry contain a valid addr_mask to be 2**N-1.  E.g., that's what we're
doing with the loop in vtd_address_space_unmap().

But this is not the first time that we may want to break this assumption for
virtio so that we make the IOTLB a tuple of (start, len), then that len can be
not a address mask any more.  That seems to be more efficient for things like
vhost because iotlbs there are not page based, so it'll be inefficient if we
always guarantee the addr_mask because it'll be quite a lot more roundtrips of
the same range of invalidation.  Here we've encountered another issue of
triggering the assertion with virtio-net, but only with the old RHEL7 guest.

I'm thinking whether we can make the IOTLB invalidation configurable by
specifying whether the backend of the notifier can handle arbitary address
range in some way.  So we still have the guaranteed addr_masks by default
(since I still don't think totally break the addr_mask restriction is wise...),
however we can allow the special backends to take adavantage of using arbitary
(start, len) ranges for reasons like performance.

To do that, a quick idea is to introduce a flag IOMMU_NOTIFIER_ARBITRARY_MASK
to IOMMUNotifierFlag, to declare that the iommu notifier (and its backend) can
take arbitrary address mask, then it can be any value and finally becomes a
length rather than an addr_mask.  Then for every iommu notify() we can directly
deliver whatever we've got from the upper layer to this notifier.  With the new
flag, vhost can do iommu_notifier_init() with UNMAP|ARBITRARY_MASK so it
declares this capability.  Then no matter for device iotlb or normal iotlb, we
skip the complicated procedure to split a big range into small ranges that are
with strict addr_mask, but directly deliver the message to the iommu notifier.
E.g., we can skip the loop in vtd_address_space_unmap() if the notifier is with
ARBITRARY flag set.

Then, the assert() is not accurate either, and may become something like:

diff --git a/memory.c b/memory.c
index 2f15a4b250..99d0492509 100644
--- a/memory.c
+++ b/memory.c
@@ -1906,6 +1906,7 @@ void memory_region_notify_one(IOMMUNotifier *notifier,
 {
 IOMMUNotifierFlag request_flags;
 hwaddr entry_end = entry->iova + entry->addr_mask;
+IOMMUTLBEntry tmp = *entry;

 /*
  * Skip the notification if the notification does not overlap
@@ -1915,7 +1916,13 @@ void memory_region_notify_one(IOMMUNotifier *notifier,
 return;
 }

-assert(entry->iova >= notifier->start && entry_end <= notifier->end);
+if (notifier->notifier_flags & IOMMU_NOTIFIER_ARBITRARY_MASK) {
+tmp.iova = MAX(tmp.iova, notifier->start);
+tmp.addr_mask = MIN(tmp.addr_mask, notifier->end);
+assert(tmp.iova <= tmp.addr_mask);
+} else {
+assert(entry->iova >= notifier->start && entry_end <= notifier->end);
+}

 if (entry->perm & IOMMU_RW) {
 request_flags = IOMMU_NOTIFIER_MAP;
@@ -1924,7 +1931,7 @@ void memory_region_notify_one(IOMMUNotifier *notifier,
 }

 if (notifier->notifier_flags & request_flags) {
-notifier->notify(notifier, entry);
+notifier->notify(notifier, );
 }
 }

Then we can keep the assert() for e.g. vfio, however vhost can skip it and even
get some further performance boosts..  Does that make sense?

Thanks,

-- 
Peter Xu