Re: [PATCH v4 5/8] mm: Device exclusive memory access

2021-03-08 Thread Alistair Popple
On Tuesday, 9 March 2021 6:44:41 AM AEDT Ralph Campbell wrote:
> 
> On 3/3/21 10:16 PM, Alistair Popple wrote:
> > Some devices require exclusive write access to shared virtual
> > memory (SVM) ranges to perform atomic operations on that memory. This
> > requires CPU page tables to be updated to deny access whilst atomic
> > operations are occurring.
> > 
> > In order to do this introduce a new swap entry
> > type (SWP_DEVICE_EXCLUSIVE). When a SVM range needs to be marked for
> > exclusive access by a device all page table mappings for the particular
> > range are replaced with device exclusive swap entries. This causes any
> > CPU access to the page to result in a fault.
> > 
> > Faults are resovled by replacing the faulting entry with the original
> > mapping. This results in MMU notifiers being called which a driver uses
> > to update access permissions such as revoking atomic access. After
> > notifiers have been called the device will no longer have exclusive
> > access to the region.
> > 
> > Signed-off-by: Alistair Popple 
> 
> I see in the next two patches how make_device_exclusive_entry() and
> check_device_exclusive_range() are used. This points out a similar
> problem that migrate_vma_setup() had before I added the
> mmu_notifier_range_init_migrate() helper to pass a cookie from
> migrate_vma_setup() to the invalidation callback so the device driver
> could ignore an invalidation callback triggered by the caller and thus
> resulting in a deadlock or having to invalidate device PTEs that
> wouldn't be migrating.
> 
> I think you can eliminate the need for check_device_exclusive_range() in
> the same way by adding a "void *" pointer to make_device_exclusive_entry()
> and passing that through to try_to_protect(), setting rmap_walk_control 
rwc.arg
> and then passing arg to mmu_notifier_range_init_migrate().

Thanks for the idea, I had missed there was already a way of passing a "void 
*" as part of mmu_notifier_range_init_migrate(). Agree that should allow a 
single pass without needing check_device_exclusive_range().

As Jason points out still need to check the GUP page is mapped at the expected 
address but that can be done as part of installing the exclusive swap entry in 
try_to_protect_one().

> Although, maybe it would be better to define a new
> mmu_notifier_range_init_exclusive() and event type MMU_NOTIFY_EXCLUSIVE so
> that a device driver can revoke atomic/exclusive access but keep read/write
> access to other parts of the page.

Agree, I don't think overloading mmu_notifier_range_init_migrate() with the 
exclusive usage is correct. Better to define a new helper.

> I thought about how make_device_exclusive_entry() is similar to 
hmm_range_fault()
> and whether it would be possible to add a new HMM_PFN_REQ_EXCLUSIVE flag but 
I
> see that make_device_exclusive_entry() returns the pages locked and with an
> additional get_page() reference. This doesn't fit well with the other
> hmm_range_fault() entries being returned as a "snapshot" so having a 
different
> API makes sense. I think it would be useful to add a HMM_PFN_EXCLUSIVE flag 
so
> that snapshots of the page tables can at least report that a page is 
exclusively
> being accessed by *some* device. Unfortunately, there is no pgmap pointer to 
be
> able to tell which device has exclusive access (since any struct page could 
be
> exclusively accessed, not just device private ones).
 
I have also experimented with integrating this with HMM but it just didn't end 
up being a good fit for the reasons you mention.

I also don't think adding HMM_PFN_EXCLUSIVE to read page table snapshots is 
that useful because there is no way to tell *which* device has exclusive 
access. So unless I've missed some particular usage for it now I think it can 
probably be added as a future improvement to HMM if/when it is needed.

 - Alistair





Re: [PATCH v4 5/8] mm: Device exclusive memory access

2021-03-08 Thread Ralph Campbell



On 3/3/21 10:16 PM, Alistair Popple wrote:

Some devices require exclusive write access to shared virtual
memory (SVM) ranges to perform atomic operations on that memory. This
requires CPU page tables to be updated to deny access whilst atomic
operations are occurring.

In order to do this introduce a new swap entry
type (SWP_DEVICE_EXCLUSIVE). When a SVM range needs to be marked for
exclusive access by a device all page table mappings for the particular
range are replaced with device exclusive swap entries. This causes any
CPU access to the page to result in a fault.

Faults are resovled by replacing the faulting entry with the original
mapping. This results in MMU notifiers being called which a driver uses
to update access permissions such as revoking atomic access. After
notifiers have been called the device will no longer have exclusive
access to the region.

Signed-off-by: Alistair Popple 


I see in the next two patches how make_device_exclusive_entry() and
check_device_exclusive_range() are used. This points out a similar
problem that migrate_vma_setup() had before I added the
mmu_notifier_range_init_migrate() helper to pass a cookie from
migrate_vma_setup() to the invalidation callback so the device driver
could ignore an invalidation callback triggered by the caller and thus
resulting in a deadlock or having to invalidate device PTEs that
wouldn't be migrating.

I think you can eliminate the need for check_device_exclusive_range() in
the same way by adding a "void *" pointer to make_device_exclusive_entry()
and passing that through to try_to_protect(), setting rmap_walk_control rwc.arg
and then passing arg to mmu_notifier_range_init_migrate().
Although, maybe it would be better to define a new
mmu_notifier_range_init_exclusive() and event type MMU_NOTIFY_EXCLUSIVE so
that a device driver can revoke atomic/exclusive access but keep read/write
access to other parts of the page.

I thought about how make_device_exclusive_entry() is similar to 
hmm_range_fault()
and whether it would be possible to add a new HMM_PFN_REQ_EXCLUSIVE flag but I
see that make_device_exclusive_entry() returns the pages locked and with an
additional get_page() reference. This doesn't fit well with the other
hmm_range_fault() entries being returned as a "snapshot" so having a different
API makes sense. I think it would be useful to add a HMM_PFN_EXCLUSIVE flag so
that snapshots of the page tables can at least report that a page is exclusively
being accessed by *some* device. Unfortunately, there is no pgmap pointer to be
able to tell which device has exclusive access (since any struct page could be
exclusively accessed, not just device private ones).


[PATCH v4 5/8] mm: Device exclusive memory access

2021-03-03 Thread Alistair Popple
Some devices require exclusive write access to shared virtual
memory (SVM) ranges to perform atomic operations on that memory. This
requires CPU page tables to be updated to deny access whilst atomic
operations are occurring.

In order to do this introduce a new swap entry
type (SWP_DEVICE_EXCLUSIVE). When a SVM range needs to be marked for
exclusive access by a device all page table mappings for the particular
range are replaced with device exclusive swap entries. This causes any
CPU access to the page to result in a fault.

Faults are resovled by replacing the faulting entry with the original
mapping. This results in MMU notifiers being called which a driver uses
to update access permissions such as revoking atomic access. After
notifiers have been called the device will no longer have exclusive
access to the region.

Signed-off-by: Alistair Popple 

---

v4:
* Add function to check that mappings are still valid and exclusive.
* s/long/unsigned long/ in make_device_exclusive_entry().
---
 Documentation/vm/hmm.rst |  15 +++
 include/linux/rmap.h |   5 +
 include/linux/swap.h |   4 +-
 include/linux/swapops.h  |  44 +++-
 mm/hmm.c |   5 +
 mm/memory.c  | 107 +-
 mm/mprotect.c|   8 ++
 mm/page_vma_mapped.c |   9 +-
 mm/rmap.c| 230 +++
 9 files changed, 420 insertions(+), 7 deletions(-)

diff --git a/Documentation/vm/hmm.rst b/Documentation/vm/hmm.rst
index 09e28507f5b2..f526ac4d8798 100644
--- a/Documentation/vm/hmm.rst
+++ b/Documentation/vm/hmm.rst
@@ -405,6 +405,21 @@ between device driver specific code and shared common code:
 
The lock can now be released.
 
+Exclusive access memory
+===
+
+Not all devices support atomic access to system memory. To support atomic
+operations to a shared virtual memory page such a device needs access to that
+page which is exclusive of any userspace access from the CPU. The
+``make_device_exclusive_range()`` function can be used to make a memory range
+inaccessible from userspace.
+
+This replaces all mappings for pages in the given range with special swap
+entries. Any attempt to access the swap entry results in a fault which is
+resovled by replacing the entry with the original mapping. A driver gets
+notified that the mapping has been changed by MMU notifiers, after which point
+it will no longer have exclusive access to the page.
+
 Memory cgroup (memcg) and rss accounting
 
 
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 77fa17de51d7..93cbb88f7253 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -193,6 +193,11 @@ int page_referenced(struct page *, int is_locked,
 bool try_to_migrate(struct page *page, enum ttu_flags flags);
 bool try_to_unmap(struct page *, enum ttu_flags flags);
 
+int make_device_exclusive_range(struct mm_struct *mm, unsigned long start,
+   unsigned long end, struct page **pages);
+bool check_device_exclusive_range(struct mm_struct *mm, unsigned long start,
+ unsigned long end, struct page **pages);
+
 /* Avoid racy checks */
 #define PVMW_SYNC  (1 << 0)
 /* Look for migarion entries rather than present PTEs */
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 57a7690966a4..2d8b8e7fb4c2 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -63,9 +63,11 @@ static inline int current_is_kswapd(void)
  * to a special SWP_DEVICE_* entry.
  */
 #ifdef CONFIG_DEVICE_PRIVATE
-#define SWP_DEVICE_NUM 2
+#define SWP_DEVICE_NUM 4
 #define SWP_DEVICE_WRITE (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM)
 #define SWP_DEVICE_READ (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM+1)
+#define SWP_DEVICE_EXCLUSIVE_WRITE 
(MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM+2)
+#define SWP_DEVICE_EXCLUSIVE_READ 
(MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM+3)
 #else
 #define SWP_DEVICE_NUM 0
 #endif
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 81008b0179cc..20b726e27f37 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -120,6 +120,27 @@ static inline bool 
is_writable_device_private_entry(swp_entry_t entry)
 {
return unlikely(swp_type(entry) == SWP_DEVICE_WRITE);
 }
+
+static inline swp_entry_t make_readable_device_exclusive_entry(pgoff_t offset)
+{
+   return swp_entry(SWP_DEVICE_EXCLUSIVE_READ, offset);
+}
+
+static inline swp_entry_t make_writable_device_exclusive_entry(pgoff_t offset)
+{
+   return swp_entry(SWP_DEVICE_EXCLUSIVE_WRITE, offset);
+}
+
+static inline bool is_device_exclusive_entry(swp_entry_t entry)
+{
+   return swp_type(entry) == SWP_DEVICE_EXCLUSIVE_READ ||
+   swp_type(entry) == SWP_DEVICE_EXCLUSIVE_WRITE;
+}
+
+static inline bool is_writable_device_exclusive_entry(swp_entry_t entry)
+{
+   return unlikely(swp_type(entry) ==