Re: [PATCH v2 2/4] intel_iommu: Fix a potential issue in VFIO dirty page sync
On Wed, Jun 07, 2023 at 03:14:07AM +, Duan, Zhenzhong wrote: > > > >-Original Message- > >From: Peter Xu > >Sent: Tuesday, June 6, 2023 11:42 PM > >Subject: Re: [PATCH v2 2/4] intel_iommu: Fix a potential issue in VFIO dirty > >page sync > > > ... > >> >> a/include/exec/memory.h b/include/exec/memory.h index > >> >> c3661b2276c7..eecc3eec6702 100644 > >> >> --- a/include/exec/memory.h > >> >> +++ b/include/exec/memory.h > >> >> @@ -142,6 +142,10 @@ struct IOMMUTLBEntry { > >> >> * events (e.g. VFIO). Both notifications must be accurate so > >> >> that > >> >> * the shadow page table is fully in sync with the guest view. > >> >> * > >> >> + * Besides MAP, there is a special use case called FULL_MAP which > >> >> + * requests notification for all the existent mappings (e.g. VFIO > >> >> + * dirty page sync). > >> > > >> >Why do we need FULL_MAP? Can we simply reimpl MAP? > >> > >> Sorry, I just realized IOMMU_NOTIFIER_FULL_MAP is confusing. > >> Maybe IOMMU_NOTIFIER_MAP_FAST_PATH could be a bit more accurate. > >> > >> IIUC, currently replay() is called from two paths, one is VFIO device > >> address space switch which walks over the IOMMU page table to setup > >> initial mapping and cache it in IOVA tree. The other is VFIO dirty > >> sync which walks over the IOMMU page table to notify the mapping, > >> because we already cache the mapping in IOVA tree and VFIO dirty sync > >> is protected by BQL, so I think it's fine to pick mapping from IOVA > >> tree directly instead of walking over IOMMU page table. That's the > >> reason of FULL_MAP (IOMMU_NOTIFIER_MAP_FAST_PATH better). > >> > >> About "reimpl MAP", do you mean to walk over IOMMU page table to > >> notify all existing MAP events without checking with the IOVA tree for > >> difference? If you prefer, I'll rewrite an implementation this way. > > > >We still need to maintain iova tree. IIUC that's the major complexity of vt-d > >emulation, because we have that extra cache layer to sync with the real guest > >iommu pgtables. > > Can't agree more, looks only intel-iommu and virtio-iommu implemented such > optimization for now. > > > > >But I think we were just wrong to also notify in the unmap_all() procedure. > > > >IIUC the right thing to do (keeping replay() the interface as-is, per it > >used to be > >defined) is we should replace the unmap_all() to only evacuate the iova tree > >(keeping all host mappings untouched, IOW, don't notify UNMAP), and do a > >full resync there, which will notify all existing mappings as MAP. Then we > >don't interrupt with any existing mapping if there is (e.g. for the dirty > >sync > >case), meanwhile we keep sync too to latest (for moving a vfio device into an > >existing iommu group). > > > >Do you think that'll work for us? > > Yes, I think I get your point. > Below simple change will work in your suggested way, do you agree? > > @@ -3825,13 +3833,10 @@ static void vtd_iommu_replay(IOMMUMemoryRegion > *iommu_mr, IOMMUNotifier *n) > IntelIOMMUState *s = vtd_as->iommu_state; > uint8_t bus_n = pci_bus_num(vtd_as->bus); > VTDContextEntry ce; > +DMAMap map = { .iova = 0, .size = HWADDR_MAX } > > -/* > - * The replay can be triggered by either a invalidation or a newly > - * created entry. No matter what, we release existing mappings > - * (it means flushing caches for UNMAP-only registers). > - */ > -vtd_address_space_unmap(vtd_as, n); > +/* replay is protected by BQL, page walk will re-setup IOVA tree safely > */ > +iova_tree_remove(as->iova_tree, map); > > if (vtd_dev_to_context_entry(s, bus_n, vtd_as->devfn, ) == 0) { > trace_vtd_replay_ce_valid(s->root_scalable ? "scalable mode" : Yes, thanks! -- Peter Xu
RE: [PATCH v2 2/4] intel_iommu: Fix a potential issue in VFIO dirty page sync
>-Original Message- >From: Peter Xu >Sent: Tuesday, June 6, 2023 11:42 PM >Subject: Re: [PATCH v2 2/4] intel_iommu: Fix a potential issue in VFIO dirty >page sync > ... >> >> a/include/exec/memory.h b/include/exec/memory.h index >> >> c3661b2276c7..eecc3eec6702 100644 >> >> --- a/include/exec/memory.h >> >> +++ b/include/exec/memory.h >> >> @@ -142,6 +142,10 @@ struct IOMMUTLBEntry { >> >> * events (e.g. VFIO). Both notifications must be accurate so that >> >> * the shadow page table is fully in sync with the guest view. >> >> * >> >> + * Besides MAP, there is a special use case called FULL_MAP which >> >> + * requests notification for all the existent mappings (e.g. VFIO >> >> + * dirty page sync). >> > >> >Why do we need FULL_MAP? Can we simply reimpl MAP? >> >> Sorry, I just realized IOMMU_NOTIFIER_FULL_MAP is confusing. >> Maybe IOMMU_NOTIFIER_MAP_FAST_PATH could be a bit more accurate. >> >> IIUC, currently replay() is called from two paths, one is VFIO device >> address space switch which walks over the IOMMU page table to setup >> initial mapping and cache it in IOVA tree. The other is VFIO dirty >> sync which walks over the IOMMU page table to notify the mapping, >> because we already cache the mapping in IOVA tree and VFIO dirty sync >> is protected by BQL, so I think it's fine to pick mapping from IOVA >> tree directly instead of walking over IOMMU page table. That's the >> reason of FULL_MAP (IOMMU_NOTIFIER_MAP_FAST_PATH better). >> >> About "reimpl MAP", do you mean to walk over IOMMU page table to >> notify all existing MAP events without checking with the IOVA tree for >> difference? If you prefer, I'll rewrite an implementation this way. > >We still need to maintain iova tree. IIUC that's the major complexity of vt-d >emulation, because we have that extra cache layer to sync with the real guest >iommu pgtables. Can't agree more, looks only intel-iommu and virtio-iommu implemented such optimization for now. > >But I think we were just wrong to also notify in the unmap_all() procedure. > >IIUC the right thing to do (keeping replay() the interface as-is, per it used >to be >defined) is we should replace the unmap_all() to only evacuate the iova tree >(keeping all host mappings untouched, IOW, don't notify UNMAP), and do a >full resync there, which will notify all existing mappings as MAP. Then we >don't interrupt with any existing mapping if there is (e.g. for the dirty sync >case), meanwhile we keep sync too to latest (for moving a vfio device into an >existing iommu group). > >Do you think that'll work for us? Yes, I think I get your point. Below simple change will work in your suggested way, do you agree? @@ -3825,13 +3833,10 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n) IntelIOMMUState *s = vtd_as->iommu_state; uint8_t bus_n = pci_bus_num(vtd_as->bus); VTDContextEntry ce; +DMAMap map = { .iova = 0, .size = HWADDR_MAX } -/* - * The replay can be triggered by either a invalidation or a newly - * created entry. No matter what, we release existing mappings - * (it means flushing caches for UNMAP-only registers). - */ -vtd_address_space_unmap(vtd_as, n); +/* replay is protected by BQL, page walk will re-setup IOVA tree safely */ +iova_tree_remove(as->iova_tree, map); if (vtd_dev_to_context_entry(s, bus_n, vtd_as->devfn, ) == 0) { trace_vtd_replay_ce_valid(s->root_scalable ? "scalable mode" : Thanks Zhenzhong
Re: [PATCH v2 2/4] intel_iommu: Fix a potential issue in VFIO dirty page sync
On Tue, Jun 06, 2023 at 02:35:41AM +, Duan, Zhenzhong wrote: > >-Original Message- > >From: Peter Xu > >Sent: Tuesday, June 6, 2023 2:39 AM > >To: Duan, Zhenzhong > >Cc: qemu-devel@nongnu.org; m...@redhat.com; jasow...@redhat.com; > >pbonz...@redhat.com; richard.hender...@linaro.org; edua...@habkost.net; > >marcel.apfelb...@gmail.com; alex.william...@redhat.com; > >c...@redhat.com; da...@redhat.com; phi...@linaro.org; > >kwankh...@nvidia.com; c...@nvidia.com; Liu, Yi L ; Peng, > >Chao P > >Subject: Re: [PATCH v2 2/4] intel_iommu: Fix a potential issue in VFIO dirty > >page sync > > > >On Thu, Jun 01, 2023 at 02:33:18PM +0800, Zhenzhong Duan wrote: > >> Peter Xu found a potential issue: > >> > >> "The other thing is when I am looking at the new code I found that we > >> actually extended the replay() to be used also in dirty tracking of vfio, > >> in vfio_sync_dirty_bitmap(). For that maybe it's already broken if > >> unmap_all() because afaiu log_sync() can be called in migration thread > >> anytime during DMA so I think it means the device is prone to DMA with the > >> IOMMU pgtable quickly erased and rebuilt here, which means the DMA > >could > >> fail unexpectedly. Copy Alex, Kirti and Neo." > >> > >> To eliminate this small window with empty mapping, we should remove the > >> call to unmap_all(). Besides that, introduce a new notifier type called > >> IOMMU_NOTIFIER_FULL_MAP to get full mappings as intel_iommu only > >notifies > >> changed mappings while VFIO dirty page sync needs full mappings. Thanks > >> to current implementation of iova tree, we could pick mappings from iova > >> trees directly instead of walking through guest IOMMU page table. > >> > >> IOMMU_NOTIFIER_MAP is still used to get changed mappings for > >optimization > >> purpose. As long as notification for IOMMU_NOTIFIER_MAP could ensure > >shadow > >> page table in sync, then it's OK. > >> > >> Signed-off-by: Zhenzhong Duan > >> --- > >> hw/i386/intel_iommu.c | 49 +++-- > >-- > >> hw/vfio/common.c | 2 +- > >> include/exec/memory.h | 13 > >> softmmu/memory.c | 4 > >> 4 files changed, 58 insertions(+), 10 deletions(-) > >> > >> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c > >> index 94d52f4205d2..061fcded0dfb 100644 > >> --- a/hw/i386/intel_iommu.c > >> +++ b/hw/i386/intel_iommu.c > >> @@ -3819,6 +3819,41 @@ static int vtd_replay_hook(IOMMUTLBEvent > >*event, void *private) > >> return 0; > >> } > >> > >> +static gboolean vtd_replay_full_map(DMAMap *map, gpointer *private) > >> +{ > >> +IOMMUTLBEvent event; > >> + > >> +event.type = IOMMU_NOTIFIER_MAP; > >> +event.entry.iova = map->iova; > >> +event.entry.addr_mask = map->size; > >> +event.entry.target_as = _space_memory; > >> +event.entry.perm = map->perm; > >> +event.entry.translated_addr = map->translated_addr; > >> + > >> +return vtd_replay_hook(, private); > >> +} > >> + > >> +/* > >> + * This is a fast path to notify the full mappings falling in the scope > >> + * of IOMMU notifier. The call site should ensure no iova tree update by > >> + * taking necessary locks(e.x. BQL). > > > >We should be accurate on the locking - I think it's the BQL so far. > > Will update comments. > > > > >> + */ > >> +static int vtd_page_walk_full_map_fast_path(IOVATree *iova_tree, > >> +IOMMUNotifier *n) > >> +{ > >> +DMAMap map; > >> + > >> +map.iova = n->start; > >> +map.size = n->end - n->start; > >> +if (!iova_tree_find(iova_tree, )) { > >> +return 0; > >> +} > >> + > >> +iova_tree_foreach_range_data(iova_tree, , vtd_replay_full_map, > >> + (gpointer *)n); > >> +return 0; > >> +} > >> + > >> static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, > >IOMMUNotifier *n) > >> { > >> VTDAddressSpace *vtd_as = container_of(iommu_mr, VTDAddressSpace, > >iommu); > >> @@ -3826,13 +3861,6 @@ static void > >vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n) > >
RE: [PATCH v2 2/4] intel_iommu: Fix a potential issue in VFIO dirty page sync
>-Original Message- >From: Peter Xu >Sent: Tuesday, June 6, 2023 2:39 AM >To: Duan, Zhenzhong >Cc: qemu-devel@nongnu.org; m...@redhat.com; jasow...@redhat.com; >pbonz...@redhat.com; richard.hender...@linaro.org; edua...@habkost.net; >marcel.apfelb...@gmail.com; alex.william...@redhat.com; >c...@redhat.com; da...@redhat.com; phi...@linaro.org; >kwankh...@nvidia.com; c...@nvidia.com; Liu, Yi L ; Peng, >Chao P >Subject: Re: [PATCH v2 2/4] intel_iommu: Fix a potential issue in VFIO dirty >page sync > >On Thu, Jun 01, 2023 at 02:33:18PM +0800, Zhenzhong Duan wrote: >> Peter Xu found a potential issue: >> >> "The other thing is when I am looking at the new code I found that we >> actually extended the replay() to be used also in dirty tracking of vfio, >> in vfio_sync_dirty_bitmap(). For that maybe it's already broken if >> unmap_all() because afaiu log_sync() can be called in migration thread >> anytime during DMA so I think it means the device is prone to DMA with the >> IOMMU pgtable quickly erased and rebuilt here, which means the DMA >could >> fail unexpectedly. Copy Alex, Kirti and Neo." >> >> To eliminate this small window with empty mapping, we should remove the >> call to unmap_all(). Besides that, introduce a new notifier type called >> IOMMU_NOTIFIER_FULL_MAP to get full mappings as intel_iommu only >notifies >> changed mappings while VFIO dirty page sync needs full mappings. Thanks >> to current implementation of iova tree, we could pick mappings from iova >> trees directly instead of walking through guest IOMMU page table. >> >> IOMMU_NOTIFIER_MAP is still used to get changed mappings for >optimization >> purpose. As long as notification for IOMMU_NOTIFIER_MAP could ensure >shadow >> page table in sync, then it's OK. >> >> Signed-off-by: Zhenzhong Duan >> --- >> hw/i386/intel_iommu.c | 49 +++-- >-- >> hw/vfio/common.c | 2 +- >> include/exec/memory.h | 13 >> softmmu/memory.c | 4 >> 4 files changed, 58 insertions(+), 10 deletions(-) >> >> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c >> index 94d52f4205d2..061fcded0dfb 100644 >> --- a/hw/i386/intel_iommu.c >> +++ b/hw/i386/intel_iommu.c >> @@ -3819,6 +3819,41 @@ static int vtd_replay_hook(IOMMUTLBEvent >*event, void *private) >> return 0; >> } >> >> +static gboolean vtd_replay_full_map(DMAMap *map, gpointer *private) >> +{ >> +IOMMUTLBEvent event; >> + >> +event.type = IOMMU_NOTIFIER_MAP; >> +event.entry.iova = map->iova; >> +event.entry.addr_mask = map->size; >> +event.entry.target_as = _space_memory; >> +event.entry.perm = map->perm; >> +event.entry.translated_addr = map->translated_addr; >> + >> +return vtd_replay_hook(, private); >> +} >> + >> +/* >> + * This is a fast path to notify the full mappings falling in the scope >> + * of IOMMU notifier. The call site should ensure no iova tree update by >> + * taking necessary locks(e.x. BQL). > >We should be accurate on the locking - I think it's the BQL so far. Will update comments. > >> + */ >> +static int vtd_page_walk_full_map_fast_path(IOVATree *iova_tree, >> +IOMMUNotifier *n) >> +{ >> +DMAMap map; >> + >> +map.iova = n->start; >> +map.size = n->end - n->start; >> +if (!iova_tree_find(iova_tree, )) { >> +return 0; >> +} >> + >> +iova_tree_foreach_range_data(iova_tree, , vtd_replay_full_map, >> + (gpointer *)n); >> +return 0; >> +} >> + >> static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, >IOMMUNotifier *n) >> { >> VTDAddressSpace *vtd_as = container_of(iommu_mr, VTDAddressSpace, >iommu); >> @@ -3826,13 +3861,6 @@ static void >vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n) >> uint8_t bus_n = pci_bus_num(vtd_as->bus); >> VTDContextEntry ce; >> >> -/* >> - * The replay can be triggered by either a invalidation or a newly >> - * created entry. No matter what, we release existing mappings >> - * (it means flushing caches for UNMAP-only registers). >> - */ >> -vtd_address_space_unmap(vtd_as, n); >> - >> if (vtd_dev_to_context_entry(s, bus_n, vtd_as->devfn, ) == 0) { >> trace_vtd_replay_ce_valid(s->root_scalable ? &q
Re: [PATCH v2 2/4] intel_iommu: Fix a potential issue in VFIO dirty page sync
On Thu, Jun 01, 2023 at 02:33:18PM +0800, Zhenzhong Duan wrote: > Peter Xu found a potential issue: > > "The other thing is when I am looking at the new code I found that we > actually extended the replay() to be used also in dirty tracking of vfio, > in vfio_sync_dirty_bitmap(). For that maybe it's already broken if > unmap_all() because afaiu log_sync() can be called in migration thread > anytime during DMA so I think it means the device is prone to DMA with the > IOMMU pgtable quickly erased and rebuilt here, which means the DMA could > fail unexpectedly. Copy Alex, Kirti and Neo." > > To eliminate this small window with empty mapping, we should remove the > call to unmap_all(). Besides that, introduce a new notifier type called > IOMMU_NOTIFIER_FULL_MAP to get full mappings as intel_iommu only notifies > changed mappings while VFIO dirty page sync needs full mappings. Thanks > to current implementation of iova tree, we could pick mappings from iova > trees directly instead of walking through guest IOMMU page table. > > IOMMU_NOTIFIER_MAP is still used to get changed mappings for optimization > purpose. As long as notification for IOMMU_NOTIFIER_MAP could ensure shadow > page table in sync, then it's OK. > > Signed-off-by: Zhenzhong Duan > --- > hw/i386/intel_iommu.c | 49 +++ > hw/vfio/common.c | 2 +- > include/exec/memory.h | 13 > softmmu/memory.c | 4 > 4 files changed, 58 insertions(+), 10 deletions(-) > > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c > index 94d52f4205d2..061fcded0dfb 100644 > --- a/hw/i386/intel_iommu.c > +++ b/hw/i386/intel_iommu.c > @@ -3819,6 +3819,41 @@ static int vtd_replay_hook(IOMMUTLBEvent *event, void > *private) > return 0; > } > > +static gboolean vtd_replay_full_map(DMAMap *map, gpointer *private) > +{ > +IOMMUTLBEvent event; > + > +event.type = IOMMU_NOTIFIER_MAP; > +event.entry.iova = map->iova; > +event.entry.addr_mask = map->size; > +event.entry.target_as = _space_memory; > +event.entry.perm = map->perm; > +event.entry.translated_addr = map->translated_addr; > + > +return vtd_replay_hook(, private); > +} > + > +/* > + * This is a fast path to notify the full mappings falling in the scope > + * of IOMMU notifier. The call site should ensure no iova tree update by > + * taking necessary locks(e.x. BQL). We should be accurate on the locking - I think it's the BQL so far. > + */ > +static int vtd_page_walk_full_map_fast_path(IOVATree *iova_tree, > +IOMMUNotifier *n) > +{ > +DMAMap map; > + > +map.iova = n->start; > +map.size = n->end - n->start; > +if (!iova_tree_find(iova_tree, )) { > +return 0; > +} > + > +iova_tree_foreach_range_data(iova_tree, , vtd_replay_full_map, > + (gpointer *)n); > +return 0; > +} > + > static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n) > { > VTDAddressSpace *vtd_as = container_of(iommu_mr, VTDAddressSpace, iommu); > @@ -3826,13 +3861,6 @@ static void vtd_iommu_replay(IOMMUMemoryRegion > *iommu_mr, IOMMUNotifier *n) > uint8_t bus_n = pci_bus_num(vtd_as->bus); > VTDContextEntry ce; > > -/* > - * The replay can be triggered by either a invalidation or a newly > - * created entry. No matter what, we release existing mappings > - * (it means flushing caches for UNMAP-only registers). > - */ > -vtd_address_space_unmap(vtd_as, n); > - > if (vtd_dev_to_context_entry(s, bus_n, vtd_as->devfn, ) == 0) { > trace_vtd_replay_ce_valid(s->root_scalable ? "scalable mode" : >"legacy mode", > @@ -3850,8 +3878,11 @@ static void vtd_iommu_replay(IOMMUMemoryRegion > *iommu_mr, IOMMUNotifier *n) > .as = vtd_as, > .domain_id = vtd_get_domain_id(s, , vtd_as->pasid), > }; > - > -vtd_page_walk(s, , 0, ~0ULL, , vtd_as->pasid); > +if (n->notifier_flags & IOMMU_NOTIFIER_FULL_MAP) { > +vtd_page_walk_full_map_fast_path(vtd_as->iova_tree, n); > +} else { > +vtd_page_walk(s, , 0, ~0ULL, , vtd_as->pasid); > +} > } > } else { > trace_vtd_replay_ce_invalid(bus_n, PCI_SLOT(vtd_as->devfn), > diff --git a/hw/vfio/common.c b/hw/vfio/common.c > index 78358ede2764..5dae4502b908 100644 > --- a/hw/vfio/common.c > +++ b/hw/vfio/common.c > @@ -1890,7 +1890,7 @@ static int vfio_sync_dirty_bitmap(VFIOContainer > *container, > > iommu_notifier_init(, > vfio_iommu_map_dirty_notify, > -IOMMU_NOTIFIER_MAP, > +IOMMU_NOTIFIER_FULL_MAP, > section->offset_within_region, >