Re: [PATCH] vfio/pci: Handle concurrent vma faults

2021-03-12 Thread Jason Gunthorpe
On Fri, Mar 12, 2021 at 01:58:44PM -0700, Alex Williamson wrote:

> Yeah, we can indeed use memalloc_nofs_save/restore().  It seems we're
> trying to allocate something for pfnmap tracking and that enables lots
> of lockdep specific tests.  Is it valid to wrap io_remap_pfn_range()
> around clearing this flag or am I just masking a bug?  Thanks,

Yes, I think it is fine. Those functions are ment to be used in a
no-fs kind of region exactly like this.

no-fs is telling the allocator not to do reclaim which is forbidden
under the locks here (as reclaim will also attempt to get these locks)

I would defer to Michal Hocko though, maybe cc him on the final patch
series version.

Jason


Re: [PATCH] vfio/pci: Handle concurrent vma faults

2021-03-12 Thread Alex Williamson
On Fri, 12 Mar 2021 13:09:38 -0700
Alex Williamson  wrote:

> On Fri, 12 Mar 2021 15:41:47 -0400
> Jason Gunthorpe  wrote:
> 
> 
> ==
> WARNING: possible circular locking dependency detected
> 5.12.0-rc1+ #18 Not tainted
> --
> CPU 0/KVM/1406 is trying to acquire lock:
> a5a58d60 (fs_reclaim){+.+.}-{0:0}, at: fs_reclaim_acquire+0x83/0xd0
> 
> but task is already holding lock:
> 94c0f3e8fb08 (>i_mmap_rwsem){}-{3:3}, at: 
> vfio_device_io_remap_mapping_range+0x31/0x120 [vfio]
> 
> which lock already depends on the new lock.
> 
> 
> the existing dependency chain (in reverse order) is:
> 
> -> #1 (>i_mmap_rwsem){}-{3:3}:  
>down_write+0x3d/0x70
>dma_resv_lockdep+0x1b0/0x298
>do_one_initcall+0x5b/0x2d0
>kernel_init_freeable+0x251/0x298
>kernel_init+0xa/0x111
>ret_from_fork+0x22/0x30
> 
> -> #0 (fs_reclaim){+.+.}-{0:0}:  
>__lock_acquire+0x111f/0x1e10
>lock_acquire+0xb5/0x380
>fs_reclaim_acquire+0xa3/0xd0
>kmem_cache_alloc_trace+0x30/0x2c0
>memtype_reserve+0xc3/0x280
>reserve_pfn_range+0x86/0x160
>track_pfn_remap+0xa6/0xe0
>remap_pfn_range+0xa8/0x610
>vfio_device_io_remap_mapping_range+0x93/0x120 [vfio]
>vfio_pci_test_and_up_write_memory_lock+0x34/0x40 [vfio_pci]
>vfio_basic_config_write+0x12d/0x230 [vfio_pci]
>vfio_pci_config_rw+0x1b7/0x3a0 [vfio_pci]
>vfs_write+0xea/0x390
>__x64_sys_pwrite64+0x72/0xb0
>do_syscall_64+0x33/0x40
>entry_SYSCALL_64_after_hwframe+0x44/0xae
> 
..
> > Does current_gfp_context()/memalloc_nofs_save()/etc solve it?  

Yeah, we can indeed use memalloc_nofs_save/restore().  It seems we're
trying to allocate something for pfnmap tracking and that enables lots
of lockdep specific tests.  Is it valid to wrap io_remap_pfn_range()
around clearing this flag or am I just masking a bug?  Thanks,

Alex



Re: [PATCH] vfio/pci: Handle concurrent vma faults

2021-03-12 Thread Alex Williamson
On Fri, 12 Mar 2021 15:41:47 -0400
Jason Gunthorpe  wrote:

> On Fri, Mar 12, 2021 at 12:16:11PM -0700, Alex Williamson wrote:
> > On Wed, 10 Mar 2021 14:40:11 -0400
> > Jason Gunthorpe  wrote:
> >   
> > > On Wed, Mar 10, 2021 at 11:34:06AM -0700, Alex Williamson wrote:
> > >   
> > > > > I think after the address_space changes this should try to stick with
> > > > > a normal io_rmap_pfn_range() done outside the fault handler.
> > > > 
> > > > I assume you're suggesting calling io_remap_pfn_range() when device
> > > > memory is enabled,
> > > 
> > > Yes, I think I saw Peter thinking along these lines too
> > > 
> > > Then fault just always causes SIGBUS if it gets called  
> > 
> > Trying to use the address_space approach because otherwise we'd just be
> > adding back vma list tracking, it looks like we can't call
> > io_remap_pfn_range() while holding the address_space i_mmap_rwsem via
> > i_mmap_lock_write(), like done in unmap_mapping_range().  lockdep
> > identifies a circular lock order issue against fs_reclaim.  Minimally we
> > also need vma_interval_tree_iter_{first,next} exported in order to use
> > vma_interval_tree_foreach().  Suggestions?  Thanks,  
> 
> You are asking how to put the BAR back into every VMA when it is
> enabled again after it has been zap'd?

Exactly.
 
> What did the lockdep splat look like? Is it a memory allocation?


==
WARNING: possible circular locking dependency detected
5.12.0-rc1+ #18 Not tainted
--
CPU 0/KVM/1406 is trying to acquire lock:
a5a58d60 (fs_reclaim){+.+.}-{0:0}, at: fs_reclaim_acquire+0x83/0xd0

but task is already holding lock:
94c0f3e8fb08 (>i_mmap_rwsem){}-{3:3}, at: 
vfio_device_io_remap_mapping_range+0x31/0x120 [vfio]

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

-> #1 (>i_mmap_rwsem){}-{3:3}:
   down_write+0x3d/0x70
   dma_resv_lockdep+0x1b0/0x298
   do_one_initcall+0x5b/0x2d0
   kernel_init_freeable+0x251/0x298
   kernel_init+0xa/0x111
   ret_from_fork+0x22/0x30

-> #0 (fs_reclaim){+.+.}-{0:0}:
   __lock_acquire+0x111f/0x1e10
   lock_acquire+0xb5/0x380
   fs_reclaim_acquire+0xa3/0xd0
   kmem_cache_alloc_trace+0x30/0x2c0
   memtype_reserve+0xc3/0x280
   reserve_pfn_range+0x86/0x160
   track_pfn_remap+0xa6/0xe0
   remap_pfn_range+0xa8/0x610
   vfio_device_io_remap_mapping_range+0x93/0x120 [vfio]
   vfio_pci_test_and_up_write_memory_lock+0x34/0x40 [vfio_pci]
   vfio_basic_config_write+0x12d/0x230 [vfio_pci]
   vfio_pci_config_rw+0x1b7/0x3a0 [vfio_pci]
   vfs_write+0xea/0x390
   __x64_sys_pwrite64+0x72/0xb0
   do_syscall_64+0x33/0x40
   entry_SYSCALL_64_after_hwframe+0x44/0xae

other info that might help us debug this:

 Possible unsafe locking scenario:

   CPU0CPU1
   
  lock(>i_mmap_rwsem);
   lock(fs_reclaim);
   lock(>i_mmap_rwsem);
  lock(fs_reclaim);

 *** DEADLOCK ***

2 locks held by CPU 0/KVM/1406:
 #0: 94c0f9c71ef0 (>memory_lock){}-{3:3}, at: 
vfio_basic_config_write+0x19a/0x230 [vfio_pci]
 #1: 94c0f3e8fb08 (>i_mmap_rwsem){}-{3:3}, at: 
vfio_device_io_remap_mapping_range+0x31/0x120 [vfio]

stack backtrace:
CPU: 3 PID: 1406 Comm: CPU 0/KVM Not tainted 5.12.0-rc1+ #18
Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3904 
04/27/2013
Call Trace:
 dump_stack+0x7f/0xa1
 check_noncircular+0xcf/0xf0
 __lock_acquire+0x111f/0x1e10
 lock_acquire+0xb5/0x380
 ? fs_reclaim_acquire+0x83/0xd0
 ? pat_enabled+0x10/0x10
 ? memtype_reserve+0xc3/0x280
 fs_reclaim_acquire+0xa3/0xd0
 ? fs_reclaim_acquire+0x83/0xd0
 kmem_cache_alloc_trace+0x30/0x2c0
 memtype_reserve+0xc3/0x280
 reserve_pfn_range+0x86/0x160
 track_pfn_remap+0xa6/0xe0
 remap_pfn_range+0xa8/0x610
 ? lock_acquire+0xb5/0x380
 ? vfio_device_io_remap_mapping_range+0x31/0x120 [vfio]
 ? lock_is_held_type+0xa5/0x120
 vfio_device_io_remap_mapping_range+0x93/0x120 [vfio]
 vfio_pci_test_and_up_write_memory_lock+0x34/0x40 [vfio_pci]
 vfio_basic_config_write+0x12d/0x230 [vfio_pci]
 vfio_pci_config_rw+0x1b7/0x3a0 [vfio_pci]
 vfs_write+0xea/0x390
 __x64_sys_pwrite64+0x72/0xb0
 do_syscall_64+0x33/0x40
 entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f80176152ff
Code: 08 89 3c 24 48 89 4c 24 18 e8 3d f3 ff ff 4c 8b 54 24 18 48 8b 54 24 10 
41 89 c0 48 8b 74 24 08 8b 3c 24 b8 12 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 31 
44 89 c7 48 89 04 24 e8 6d f3 ff ff 48 8b
RSP: 002b:7f7efa5f72f0 EFLAGS: 0293 ORIG_RAX: 0012
RAX: ffda RBX: 0004 RCX: 7f80176152ff
RDX: 0002 RSI: 7f7efa5f736c RDI: 002d
RBP: 55b66913d530 R08:  R09: 
R10: 0704 R11: 0293 R12: 0004

Re: [PATCH] vfio/pci: Handle concurrent vma faults

2021-03-12 Thread Jason Gunthorpe
On Fri, Mar 12, 2021 at 12:16:11PM -0700, Alex Williamson wrote:
> On Wed, 10 Mar 2021 14:40:11 -0400
> Jason Gunthorpe  wrote:
> 
> > On Wed, Mar 10, 2021 at 11:34:06AM -0700, Alex Williamson wrote:
> > 
> > > > I think after the address_space changes this should try to stick with
> > > > a normal io_rmap_pfn_range() done outside the fault handler.  
> > > 
> > > I assume you're suggesting calling io_remap_pfn_range() when device
> > > memory is enabled,  
> > 
> > Yes, I think I saw Peter thinking along these lines too
> > 
> > Then fault just always causes SIGBUS if it gets called
> 
> Trying to use the address_space approach because otherwise we'd just be
> adding back vma list tracking, it looks like we can't call
> io_remap_pfn_range() while holding the address_space i_mmap_rwsem via
> i_mmap_lock_write(), like done in unmap_mapping_range().  lockdep
> identifies a circular lock order issue against fs_reclaim.  Minimally we
> also need vma_interval_tree_iter_{first,next} exported in order to use
> vma_interval_tree_foreach().  Suggestions?  Thanks,

You are asking how to put the BAR back into every VMA when it is
enabled again after it has been zap'd?

What did the lockdep splat look like? Is it a memory allocation?

Does current_gfp_context()/memalloc_nofs_save()/etc solve it?

The easiest answer is to continue to use fault and the
vmf_insert_page()..

But it feels like it wouuld be OK to export enough i_mmap machinery to
enable this. Cleaner than building your own tracking, which would
still have the same ugly mmap_sem inversion issue which was preventing
this last time.

Jason


Re: [PATCH] vfio/pci: Handle concurrent vma faults

2021-03-12 Thread Alex Williamson
On Wed, 10 Mar 2021 14:40:11 -0400
Jason Gunthorpe  wrote:

> On Wed, Mar 10, 2021 at 11:34:06AM -0700, Alex Williamson wrote:
> 
> > > I think after the address_space changes this should try to stick with
> > > a normal io_rmap_pfn_range() done outside the fault handler.  
> > 
> > I assume you're suggesting calling io_remap_pfn_range() when device
> > memory is enabled,  
> 
> Yes, I think I saw Peter thinking along these lines too
> 
> Then fault just always causes SIGBUS if it gets called

Trying to use the address_space approach because otherwise we'd just be
adding back vma list tracking, it looks like we can't call
io_remap_pfn_range() while holding the address_space i_mmap_rwsem via
i_mmap_lock_write(), like done in unmap_mapping_range().  lockdep
identifies a circular lock order issue against fs_reclaim.  Minimally we
also need vma_interval_tree_iter_{first,next} exported in order to use
vma_interval_tree_foreach().  Suggestions?  Thanks,

Alex



Re: [PATCH] vfio/pci: Handle concurrent vma faults

2021-03-11 Thread Peter Xu
On Thu, Mar 11, 2021 at 11:35:24AM +, Christoph Hellwig wrote:
> On Wed, Mar 10, 2021 at 03:06:07PM -0500, Peter Xu wrote:
> > On Wed, Mar 10, 2021 at 02:40:11PM -0400, Jason Gunthorpe wrote:
> > > On Wed, Mar 10, 2021 at 11:34:06AM -0700, Alex Williamson wrote:
> > > 
> > > > > I think after the address_space changes this should try to stick with
> > > > > a normal io_rmap_pfn_range() done outside the fault handler.
> > > > 
> > > > I assume you're suggesting calling io_remap_pfn_range() when device
> > > > memory is enabled,
> > > 
> > > Yes, I think I saw Peter thinking along these lines too
> > > 
> > > Then fault just always causes SIGBUS if it gets called
> 
> I feel much more comfortable having the io_remap_pfn_range in place.

It's just that Jason convinced me with the fact that io_remap_pfn_range() will
modify vma flags, and I tend to agree that's not a good thing to do during a
fault() handler (in remap_pfn_range):

vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP;

Although this case is special and it does not do harm it seems, since all these
four flags are already set by vfio_pci_mmap() anyways, so the flag didn't
really change at least with current code base.  It's just still cleaner to not
use io_remap_pfn_range() in vfio fault() since future change to the function
io_remap_pfn_range() may not guarantee to match with vfio mmap().

> 
> > 
> > Indeed that looks better than looping in the fault().
> > 
> > But I don't know whether it'll be easy to move io_remap_pfn_range() to 
> > device
> > memory enablement.  If it's a two-step thing, we can fix the BUG_ON and vma
> > duplication issue first, then the full rework can be done in the bigger 
> > series
> > as what be chosen as the last approach.
> 
> What kind of problems do you envision?  It seems pretty simple to do,
> at least when combined with the unmap_mapping_range patch.

Moving the prefault into device memory enablement will even remove the 1st
fault delay when doing the first MMIO access that triggers this fault().  Also
in that case I think we can also call io_remap_pfn_range() directly and safely,
rather than looping over vmf_insert_pfn_prot().

Thanks,

-- 
Peter Xu



Re: [PATCH] vfio/pci: Handle concurrent vma faults

2021-03-11 Thread Christoph Hellwig
On Wed, Mar 10, 2021 at 03:06:07PM -0500, Peter Xu wrote:
> On Wed, Mar 10, 2021 at 02:40:11PM -0400, Jason Gunthorpe wrote:
> > On Wed, Mar 10, 2021 at 11:34:06AM -0700, Alex Williamson wrote:
> > 
> > > > I think after the address_space changes this should try to stick with
> > > > a normal io_rmap_pfn_range() done outside the fault handler.
> > > 
> > > I assume you're suggesting calling io_remap_pfn_range() when device
> > > memory is enabled,
> > 
> > Yes, I think I saw Peter thinking along these lines too
> > 
> > Then fault just always causes SIGBUS if it gets called

I feel much more comfortable having the io_remap_pfn_range in place.

> 
> Indeed that looks better than looping in the fault().
> 
> But I don't know whether it'll be easy to move io_remap_pfn_range() to device
> memory enablement.  If it's a two-step thing, we can fix the BUG_ON and vma
> duplication issue first, then the full rework can be done in the bigger series
> as what be chosen as the last approach.

What kind of problems do you envision?  It seems pretty simple to do,
at least when combined with the unmap_mapping_range patch.


Re: [PATCH] vfio/pci: Handle concurrent vma faults

2021-03-10 Thread Peter Xu
On Wed, Mar 10, 2021 at 02:40:11PM -0400, Jason Gunthorpe wrote:
> On Wed, Mar 10, 2021 at 11:34:06AM -0700, Alex Williamson wrote:
> 
> > > I think after the address_space changes this should try to stick with
> > > a normal io_rmap_pfn_range() done outside the fault handler.
> > 
> > I assume you're suggesting calling io_remap_pfn_range() when device
> > memory is enabled,
> 
> Yes, I think I saw Peter thinking along these lines too
> 
> Then fault just always causes SIGBUS if it gets called

Indeed that looks better than looping in the fault().

But I don't know whether it'll be easy to move io_remap_pfn_range() to device
memory enablement.  If it's a two-step thing, we can fix the BUG_ON and vma
duplication issue first, then the full rework can be done in the bigger series
as what be chosen as the last approach.

Thanks,

-- 
Peter Xu



Re: [PATCH] vfio/pci: Handle concurrent vma faults

2021-03-10 Thread Jason Gunthorpe
On Wed, Mar 10, 2021 at 11:34:06AM -0700, Alex Williamson wrote:

> > I think after the address_space changes this should try to stick with
> > a normal io_rmap_pfn_range() done outside the fault handler.
> 
> I assume you're suggesting calling io_remap_pfn_range() when device
> memory is enabled,

Yes, I think I saw Peter thinking along these lines too

Then fault just always causes SIGBUS if it gets called

Jason


Re: [PATCH] vfio/pci: Handle concurrent vma faults

2021-03-10 Thread Alex Williamson
On Wed, 10 Mar 2021 14:14:46 -0400
Jason Gunthorpe  wrote:

> On Wed, Mar 10, 2021 at 10:53:29AM -0700, Alex Williamson wrote:
> 
> > diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> > index 65e7e6b44578..ae723808e08b 100644
> > +++ b/drivers/vfio/pci/vfio_pci.c
> > @@ -1573,6 +1573,11 @@ static int __vfio_pci_add_vma(struct vfio_pci_device 
> > *vdev,
> >  {
> > struct vfio_pci_mmap_vma *mmap_vma;
> >  
> > +   list_for_each_entry(mmap_vma, >vma_list, vma_next) {
> > +   if (mmap_vma->vma == vma)
> > +   return 0; /* Swallow the error, the vma is tracked */
> > +   }
> > +
> > mmap_vma = kmalloc(sizeof(*mmap_vma), GFP_KERNEL);
> > if (!mmap_vma)
> > return -ENOMEM;
> > @@ -1612,31 +1617,32 @@ static vm_fault_t vfio_pci_mmap_fault(struct 
> > vm_fault *vmf)
> >  {
> > struct vm_area_struct *vma = vmf->vma;
> > struct vfio_pci_device *vdev = vma->vm_private_data;
> > -   vm_fault_t ret = VM_FAULT_NOPAGE;
> > +   unsigned long vaddr = vma->vm_start, pfn = vma->vm_pgoff;
> > +   vm_fault_t ret = VM_FAULT_SIGBUS;
> >  
> > mutex_lock(>vma_lock);
> > down_read(>memory_lock);
> >  
> > -   if (!__vfio_pci_memory_enabled(vdev)) {
> > -   ret = VM_FAULT_SIGBUS;
> > -   mutex_unlock(>vma_lock);
> > +   if (!__vfio_pci_memory_enabled(vdev))
> > goto up_out;
> > +
> > +   for (; vaddr < vma->vm_end; vaddr += PAGE_SIZE, pfn++) {
> > +   ret = vmf_insert_pfn_prot(vma, vaddr, pfn,
> > + pgprot_decrypted(vma->vm_page_prot)); 
> >  
> 
> I investigated this, I think the above pgprot_decrypted() should be
> moved here:
> 
> static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
> {
> vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
> +   vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot);
> 
> 
> And since:
> 
> vm_fault_t vmf_insert_pfn(struct vm_area_struct *vma, unsigned long addr,
>   unsigned long pfn)
> {
>   return vmf_insert_pfn_prot(vma, addr, pfn, vma->vm_page_prot);
> 
> The above can just use vfm_insert_pfn()

Cool, easy enough.  Thanks for looking.
 
> The only thing that makes me nervous about this arrangment is loosing
> the track_pfn_remap() which was in remap_pfn_range() - I think it
> means we miss out on certain PAT manipulations.. I *suspect* this is
> not a problem for VFIO because it will rely on the MTRRs generally on
> x86 - but I also don't know this mechanim too well.

Yeah, for VM use cases the MTRRs generally override.

> I think after the address_space changes this should try to stick with
> a normal io_rmap_pfn_range() done outside the fault handler.

I assume you're suggesting calling io_remap_pfn_range() when device
memory is enabled, do you mean using vma_interval_tree_foreach() like
unmap_mapping_range() does to avoid re-adding a vma list?  Thanks,

Alex



Re: [PATCH] vfio/pci: Handle concurrent vma faults

2021-03-10 Thread Jason Gunthorpe
On Wed, Mar 10, 2021 at 10:53:29AM -0700, Alex Williamson wrote:

> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index 65e7e6b44578..ae723808e08b 100644
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -1573,6 +1573,11 @@ static int __vfio_pci_add_vma(struct vfio_pci_device 
> *vdev,
>  {
>   struct vfio_pci_mmap_vma *mmap_vma;
>  
> + list_for_each_entry(mmap_vma, >vma_list, vma_next) {
> + if (mmap_vma->vma == vma)
> + return 0; /* Swallow the error, the vma is tracked */
> + }
> +
>   mmap_vma = kmalloc(sizeof(*mmap_vma), GFP_KERNEL);
>   if (!mmap_vma)
>   return -ENOMEM;
> @@ -1612,31 +1617,32 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault 
> *vmf)
>  {
>   struct vm_area_struct *vma = vmf->vma;
>   struct vfio_pci_device *vdev = vma->vm_private_data;
> - vm_fault_t ret = VM_FAULT_NOPAGE;
> + unsigned long vaddr = vma->vm_start, pfn = vma->vm_pgoff;
> + vm_fault_t ret = VM_FAULT_SIGBUS;
>  
>   mutex_lock(>vma_lock);
>   down_read(>memory_lock);
>  
> - if (!__vfio_pci_memory_enabled(vdev)) {
> - ret = VM_FAULT_SIGBUS;
> - mutex_unlock(>vma_lock);
> + if (!__vfio_pci_memory_enabled(vdev))
>   goto up_out;
> +
> + for (; vaddr < vma->vm_end; vaddr += PAGE_SIZE, pfn++) {
> + ret = vmf_insert_pfn_prot(vma, vaddr, pfn,
> +   pgprot_decrypted(vma->vm_page_prot));

I investigated this, I think the above pgprot_decrypted() should be
moved here:

static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
{
vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+   vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot);


And since:

vm_fault_t vmf_insert_pfn(struct vm_area_struct *vma, unsigned long addr,
unsigned long pfn)
{
return vmf_insert_pfn_prot(vma, addr, pfn, vma->vm_page_prot);

The above can just use vfm_insert_pfn()

The only thing that makes me nervous about this arrangment is loosing
the track_pfn_remap() which was in remap_pfn_range() - I think it
means we miss out on certain PAT manipulations.. I *suspect* this is
not a problem for VFIO because it will rely on the MTRRs generally on
x86 - but I also don't know this mechanim too well.

I think after the address_space changes this should try to stick with
a normal io_rmap_pfn_range() done outside the fault handler.

Jason


[PATCH] vfio/pci: Handle concurrent vma faults

2021-03-10 Thread Alex Williamson
vfio_pci_mmap_fault() incorrectly makes use of io_remap_pfn_range()
from within a vm_ops fault handler.  This function will trigger a
BUG_ON if it encounters a populated pte within the remapped range,
where any fault is meant to populate the entire vma.  Concurrent
inflight faults to the same vma will therefore hit this issue,
triggering traces such as:

[ 1591.733256] kernel BUG at mm/memory.c:2177!
[ 1591.739515] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
[ 1591.747381] Modules linked in: vfio_iommu_type1 vfio_pci vfio_virqfd vfio 
pv680_mii(O)
[ 1591.760536] CPU: 2 PID: 227 Comm: lcore-worker-2 Tainted: G O 5.11.0-rc3+ #1
[ 1591.770735] Hardware name:  , BIOS HiFPGA 1P B600 V121-1
[ 1591.778872] pstate: 4049 (nZcv daif +PAN -UAO -TCO BTYPE=--)
[ 1591.786134] pc : remap_pfn_range+0x214/0x340
[ 1591.793564] lr : remap_pfn_range+0x1b8/0x340
[ 1591.799117] sp : 80001068bbd0
[ 1591.803476] x29: 80001068bbd0 x28: 042eff6f
[ 1591.810404] x27: 00110091 x26: 00130091
[ 1591.817457] x25: 00680fd3 x24: a92f1338e358
[ 1591.825144] x23: 00114000 x22: 0041
[ 1591.832506] x21: 00130091 x20: a92f141a4000
[ 1591.839520] x19: 001100a0 x18: 
[ 1591.846108] x17:  x16: a92f11844540
[ 1591.853570] x15:  x14: 
[ 1591.860768] x13: fc00 x12: 0880
[ 1591.868053] x11: 0821bf3d01d0 x10: 5ef2abd89000
[ 1591.875932] x9 : a92f12ab0064 x8 : a92f136471c0
[ 1591.883208] x7 : 00114091 x6 : 0002
[ 1591.890177] x5 : 0001 x4 : 0001
[ 1591.896656] x3 :  x2 : 016804400fd3
[ 1591.903215] x1 : 082126261880 x0 : fc2084989868
[ 1591.910234] Call trace:
[ 1591.914837]  remap_pfn_range+0x214/0x340
[ 1591.921765]  vfio_pci_mmap_fault+0xac/0x130 [vfio_pci]
[ 1591.931200]  __do_fault+0x44/0x12c
[ 1591.937031]  handle_mm_fault+0xcc8/0x1230
[ 1591.942475]  do_page_fault+0x16c/0x484
[ 1591.948635]  do_translation_fault+0xbc/0xd8
[ 1591.954171]  do_mem_abort+0x4c/0xc0
[ 1591.960316]  el0_da+0x40/0x80
[ 1591.965585]  el0_sync_handler+0x168/0x1b0
[ 1591.971608]  el0_sync+0x174/0x180
[ 1591.978312] Code: eb1b027f 54c0 f9400022 b4fffe02 (d421)

Switch to using vmf_insert_pfn_prot() so that we can retain the
decrypted memory protection from io_remap_pfn_range(), but allow
concurrent page table updates.  Tracking of vmas is also updated to
prevent duplicate entries.

Fixes: 11c4cd07ba11 ("vfio-pci: Fault mmaps to enable vma tracking")
Reported-by: Zeng Tao 
Suggested-by: Zeng Tao 
Signed-off-by: Alex Williamson 
---

Zeng Tao, I hope you don't mind me sending a new version to keep
this moving.  Testing and review appreciated, thanks!

 drivers/vfio/pci/vfio_pci.c |   30 ++
 1 file changed, 18 insertions(+), 12 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 65e7e6b44578..ae723808e08b 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -1573,6 +1573,11 @@ static int __vfio_pci_add_vma(struct vfio_pci_device 
*vdev,
 {
struct vfio_pci_mmap_vma *mmap_vma;
 
+   list_for_each_entry(mmap_vma, >vma_list, vma_next) {
+   if (mmap_vma->vma == vma)
+   return 0; /* Swallow the error, the vma is tracked */
+   }
+
mmap_vma = kmalloc(sizeof(*mmap_vma), GFP_KERNEL);
if (!mmap_vma)
return -ENOMEM;
@@ -1612,31 +1617,32 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault 
*vmf)
 {
struct vm_area_struct *vma = vmf->vma;
struct vfio_pci_device *vdev = vma->vm_private_data;
-   vm_fault_t ret = VM_FAULT_NOPAGE;
+   unsigned long vaddr = vma->vm_start, pfn = vma->vm_pgoff;
+   vm_fault_t ret = VM_FAULT_SIGBUS;
 
mutex_lock(>vma_lock);
down_read(>memory_lock);
 
-   if (!__vfio_pci_memory_enabled(vdev)) {
-   ret = VM_FAULT_SIGBUS;
-   mutex_unlock(>vma_lock);
+   if (!__vfio_pci_memory_enabled(vdev))
goto up_out;
+
+   for (; vaddr < vma->vm_end; vaddr += PAGE_SIZE, pfn++) {
+   ret = vmf_insert_pfn_prot(vma, vaddr, pfn,
+ pgprot_decrypted(vma->vm_page_prot));
+   if (ret != VM_FAULT_NOPAGE) {
+   zap_vma_ptes(vma, vma->vm_start, vaddr - vma->vm_start);
+   goto up_out;
+   }
}
 
if (__vfio_pci_add_vma(vdev, vma)) {
ret = VM_FAULT_OOM;
-   mutex_unlock(>vma_lock);
-   goto up_out;
+   zap_vma_ptes(vma, vma->vm_start, vma->vm_end - vma->vm_start);
}
 
-   mutex_unlock(>vma_lock);
-
-   if (io_remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
-  vma->vm_end - vma->vm_start, vma->vm_page_prot))
-