Re: [PATCH v1 2/5] mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault/prealloc memory

2021-04-15 Thread David Hildenbrand

On 07.04.21 12:31, David Hildenbrand wrote:

On 30.03.21 18:31, David Hildenbrand wrote:

On 30.03.21 18:30, David Hildenbrand wrote:

On 30.03.21 18:21, Jann Horn wrote:

On Tue, Mar 30, 2021 at 5:01 PM David Hildenbrand  wrote:

+long faultin_vma_page_range(struct vm_area_struct *vma, unsigned long start,
+   unsigned long end, bool write, int *locked)
+{
+   struct mm_struct *mm = vma->vm_mm;
+   unsigned long nr_pages = (end - start) / PAGE_SIZE;
+   int gup_flags;
+
+   VM_BUG_ON(!PAGE_ALIGNED(start));
+   VM_BUG_ON(!PAGE_ALIGNED(end));
+   VM_BUG_ON_VMA(start < vma->vm_start, vma);
+   VM_BUG_ON_VMA(end > vma->vm_end, vma);
+   mmap_assert_locked(mm);
+
+   /*
+* FOLL_HWPOISON: Return -EHWPOISON instead of -EFAULT when we hit
+*a poisoned page.
+* FOLL_POPULATE: Always populate memory with VM_LOCKONFAULT.
+* !FOLL_FORCE: Require proper access permissions.
+*/
+   gup_flags = FOLL_TOUCH | FOLL_POPULATE | FOLL_MLOCK | FOLL_HWPOISON;
+   if (write)
+   gup_flags |= FOLL_WRITE;
+
+   /*
+* See check_vma_flags(): Will return -EFAULT on incompatible mappings
+* or with insufficient permissions.
+*/
+   return __get_user_pages(mm, start, nr_pages, gup_flags,
+   NULL, NULL, locked);


You mentioned in the commit message that you don't want to actually
dirty all the file pages and force writeback; but doesn't
POPULATE_WRITE still do exactly that? In follow_page_pte(), if
FOLL_TOUCH and FOLL_WRITE are set, we mark the page as dirty:


Well, I mention that POPULATE_READ explicitly doesn't do that. I
primarily set it because populate_vma_page_range() also sets it.

Is it safe to *not* set it? IOW, fault something writable into a page
table (where the CPU could dirty it without additional page faults)
without marking it accessed? For me, this made logically sense. Thus I
also understood why populate_vma_page_range() set it.


FOLL_TOUCH doesn't have anything to do with installing the PTE - it
essentially means "the caller of get_user_pages wants to read/write
the contents of the returned page, so please do the same things you
would do if userspace was accessing the page". So in particular, if
you look up a page via get_user_pages() with FOLL_WRITE|FOLL_TOUCH,
that tells the MM subsystem "I will be writing into this page directly
from the kernel, bypassing the userspace page tables, so please mark
it as dirty now so that it will be properly written back later". Part
of that is that it marks the page as recently used, which has an
effect on LRU pageout behavior, I think - as far as I understand, that
is why populate_vma_page_range() uses FOLL_TOUCH.

If you look at __get_user_pages(), you can see that it is split up
into two major parts: faultin_page() for creating PTEs, and
follow_page_mask() for grabbing pages from PTEs. faultin_page()
ignores FOLL_TOUCH completely; only follow_page_mask() uses it.

In a way I guess maybe you do want the "mark as recently accessed"
part that FOLL_TOUCH would give you without FOLL_WRITE? But I think
you very much don't want the dirtying that FOLL_TOUCH|FOLL_WRITE leads
to. Maybe the ideal approach would be to add a new FOLL flag to say "I
only want to mark as recently used, I don't want to dirty". Or maybe
it's enough to just leave out the FOLL_TOUCH entirely, I don't know.


Any thoughts why populate_vma_page_range() does it?


Sorry, I missed the explanation above - thanks!


Looking into the details, adjusting the FOLL_TOUCH logic won't make too
much of a difference for MADV_POPULATE_WRITE I guess. AFAIKs, the
biggest impact of FOLL_TOUCH is actually with FOLL_FORCE - which we are
not using, but populate_vma_page_range() is.


If a page was not faulted in yet,
faultin_page(FOLL_WRITE)->handle_mm_fault(FAULT_FLAG_WRITE) will already
mark the PTE/PMD/... dirty and accessed. One example is
handle_pte_fault(). We will mark the page accessed again via FOLL_TOUCH,
which doesn't seem to be strictly required.


If the page was already faulted in, we have three cases:

1. Page faulted in writable. The page should already be dirty (otherwise
we would be in trouble I guess). We will mark it accessed.

2. Page faulted in readable. handle_mm_fault() will fault it in writable
and set the page dirty.

3. Page faulted in readable and we have FOLL_FORCE. We mark the page
dirty and accessed.


So doing a MADV_POPULATE_WRITE, whereby we prefault page tables
writable, doesn't seem to fly without marking the pages dirty. That's
one reason why I included MADV_POPULATE_READ.

We could

a) Drop FOLL_TOUCH. We are not marking the page accessed, which would
mean it gets evicted rather earlier than later.

b) Introduce FOLL_ACCESSED which won't do the dirtying. But then, the
pages are already dirty as explained above, so there isn't a real
observable change.

c) Keep it as is: Mark the page accessed and 

Re: [PATCH v1 2/5] mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault/prealloc memory

2021-04-07 Thread David Hildenbrand

On 30.03.21 18:31, David Hildenbrand wrote:

On 30.03.21 18:30, David Hildenbrand wrote:

On 30.03.21 18:21, Jann Horn wrote:

On Tue, Mar 30, 2021 at 5:01 PM David Hildenbrand  wrote:

+long faultin_vma_page_range(struct vm_area_struct *vma, unsigned long start,
+   unsigned long end, bool write, int *locked)
+{
+   struct mm_struct *mm = vma->vm_mm;
+   unsigned long nr_pages = (end - start) / PAGE_SIZE;
+   int gup_flags;
+
+   VM_BUG_ON(!PAGE_ALIGNED(start));
+   VM_BUG_ON(!PAGE_ALIGNED(end));
+   VM_BUG_ON_VMA(start < vma->vm_start, vma);
+   VM_BUG_ON_VMA(end > vma->vm_end, vma);
+   mmap_assert_locked(mm);
+
+   /*
+* FOLL_HWPOISON: Return -EHWPOISON instead of -EFAULT when we hit
+*a poisoned page.
+* FOLL_POPULATE: Always populate memory with VM_LOCKONFAULT.
+* !FOLL_FORCE: Require proper access permissions.
+*/
+   gup_flags = FOLL_TOUCH | FOLL_POPULATE | FOLL_MLOCK | FOLL_HWPOISON;
+   if (write)
+   gup_flags |= FOLL_WRITE;
+
+   /*
+* See check_vma_flags(): Will return -EFAULT on incompatible mappings
+* or with insufficient permissions.
+*/
+   return __get_user_pages(mm, start, nr_pages, gup_flags,
+   NULL, NULL, locked);


You mentioned in the commit message that you don't want to actually
dirty all the file pages and force writeback; but doesn't
POPULATE_WRITE still do exactly that? In follow_page_pte(), if
FOLL_TOUCH and FOLL_WRITE are set, we mark the page as dirty:


Well, I mention that POPULATE_READ explicitly doesn't do that. I
primarily set it because populate_vma_page_range() also sets it.

Is it safe to *not* set it? IOW, fault something writable into a page
table (where the CPU could dirty it without additional page faults)
without marking it accessed? For me, this made logically sense. Thus I
also understood why populate_vma_page_range() set it.


FOLL_TOUCH doesn't have anything to do with installing the PTE - it
essentially means "the caller of get_user_pages wants to read/write
the contents of the returned page, so please do the same things you
would do if userspace was accessing the page". So in particular, if
you look up a page via get_user_pages() with FOLL_WRITE|FOLL_TOUCH,
that tells the MM subsystem "I will be writing into this page directly
from the kernel, bypassing the userspace page tables, so please mark
it as dirty now so that it will be properly written back later". Part
of that is that it marks the page as recently used, which has an
effect on LRU pageout behavior, I think - as far as I understand, that
is why populate_vma_page_range() uses FOLL_TOUCH.

If you look at __get_user_pages(), you can see that it is split up
into two major parts: faultin_page() for creating PTEs, and
follow_page_mask() for grabbing pages from PTEs. faultin_page()
ignores FOLL_TOUCH completely; only follow_page_mask() uses it.

In a way I guess maybe you do want the "mark as recently accessed"
part that FOLL_TOUCH would give you without FOLL_WRITE? But I think
you very much don't want the dirtying that FOLL_TOUCH|FOLL_WRITE leads
to. Maybe the ideal approach would be to add a new FOLL flag to say "I
only want to mark as recently used, I don't want to dirty". Or maybe
it's enough to just leave out the FOLL_TOUCH entirely, I don't know.


Any thoughts why populate_vma_page_range() does it?


Sorry, I missed the explanation above - thanks!


Looking into the details, adjusting the FOLL_TOUCH logic won't make too 
much of a difference for MADV_POPULATE_WRITE I guess. AFAIKs, the 
biggest impact of FOLL_TOUCH is actually with FOLL_FORCE - which we are 
not using, but populate_vma_page_range() is.



If a page was not faulted in yet, 
faultin_page(FOLL_WRITE)->handle_mm_fault(FAULT_FLAG_WRITE) will already 
mark the PTE/PMD/... dirty and accessed. One example is 
handle_pte_fault(). We will mark the page accessed again via FOLL_TOUCH, 
which doesn't seem to be strictly required.



If the page was already faulted in, we have three cases:

1. Page faulted in writable. The page should already be dirty (otherwise 
we would be in trouble I guess). We will mark it accessed.


2. Page faulted in readable. handle_mm_fault() will fault it in writable 
and set the page dirty.


3. Page faulted in readable and we have FOLL_FORCE. We mark the page 
dirty and accessed.



So doing a MADV_POPULATE_WRITE, whereby we prefault page tables 
writable, doesn't seem to fly without marking the pages dirty. That's 
one reason why I included MADV_POPULATE_READ.


We could

a) Drop FOLL_TOUCH. We are not marking the page accessed, which would 
mean it gets evicted rather earlier than later.


b) Introduce FOLL_ACCESSED which won't do the dirtying. But then, the 
pages are already dirty as explained above, so there isn't a real 
observable change.


c) Keep it as is: Mark the page accessed and dirty. As it's already 

Re: [PATCH v1 2/5] mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault/prealloc memory

2021-03-30 Thread David Hildenbrand

On 30.03.21 18:30, David Hildenbrand wrote:

On 30.03.21 18:21, Jann Horn wrote:

On Tue, Mar 30, 2021 at 5:01 PM David Hildenbrand  wrote:

+long faultin_vma_page_range(struct vm_area_struct *vma, unsigned long start,
+   unsigned long end, bool write, int *locked)
+{
+   struct mm_struct *mm = vma->vm_mm;
+   unsigned long nr_pages = (end - start) / PAGE_SIZE;
+   int gup_flags;
+
+   VM_BUG_ON(!PAGE_ALIGNED(start));
+   VM_BUG_ON(!PAGE_ALIGNED(end));
+   VM_BUG_ON_VMA(start < vma->vm_start, vma);
+   VM_BUG_ON_VMA(end > vma->vm_end, vma);
+   mmap_assert_locked(mm);
+
+   /*
+* FOLL_HWPOISON: Return -EHWPOISON instead of -EFAULT when we hit
+*a poisoned page.
+* FOLL_POPULATE: Always populate memory with VM_LOCKONFAULT.
+* !FOLL_FORCE: Require proper access permissions.
+*/
+   gup_flags = FOLL_TOUCH | FOLL_POPULATE | FOLL_MLOCK | FOLL_HWPOISON;
+   if (write)
+   gup_flags |= FOLL_WRITE;
+
+   /*
+* See check_vma_flags(): Will return -EFAULT on incompatible mappings
+* or with insufficient permissions.
+*/
+   return __get_user_pages(mm, start, nr_pages, gup_flags,
+   NULL, NULL, locked);


You mentioned in the commit message that you don't want to actually
dirty all the file pages and force writeback; but doesn't
POPULATE_WRITE still do exactly that? In follow_page_pte(), if
FOLL_TOUCH and FOLL_WRITE are set, we mark the page as dirty:


Well, I mention that POPULATE_READ explicitly doesn't do that. I
primarily set it because populate_vma_page_range() also sets it.

Is it safe to *not* set it? IOW, fault something writable into a page
table (where the CPU could dirty it without additional page faults)
without marking it accessed? For me, this made logically sense. Thus I
also understood why populate_vma_page_range() set it.


FOLL_TOUCH doesn't have anything to do with installing the PTE - it
essentially means "the caller of get_user_pages wants to read/write
the contents of the returned page, so please do the same things you
would do if userspace was accessing the page". So in particular, if
you look up a page via get_user_pages() with FOLL_WRITE|FOLL_TOUCH,
that tells the MM subsystem "I will be writing into this page directly
from the kernel, bypassing the userspace page tables, so please mark
it as dirty now so that it will be properly written back later". Part
of that is that it marks the page as recently used, which has an
effect on LRU pageout behavior, I think - as far as I understand, that
is why populate_vma_page_range() uses FOLL_TOUCH.

If you look at __get_user_pages(), you can see that it is split up
into two major parts: faultin_page() for creating PTEs, and
follow_page_mask() for grabbing pages from PTEs. faultin_page()
ignores FOLL_TOUCH completely; only follow_page_mask() uses it.

In a way I guess maybe you do want the "mark as recently accessed"
part that FOLL_TOUCH would give you without FOLL_WRITE? But I think
you very much don't want the dirtying that FOLL_TOUCH|FOLL_WRITE leads
to. Maybe the ideal approach would be to add a new FOLL flag to say "I
only want to mark as recently used, I don't want to dirty". Or maybe
it's enough to just leave out the FOLL_TOUCH entirely, I don't know.


Any thoughts why populate_vma_page_range() does it?


Sorry, I missed the explanation above - thanks!


--
Thanks,

David / dhildenb



Re: [PATCH v1 2/5] mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault/prealloc memory

2021-03-30 Thread David Hildenbrand

On 30.03.21 18:21, Jann Horn wrote:

On Tue, Mar 30, 2021 at 5:01 PM David Hildenbrand  wrote:

+long faultin_vma_page_range(struct vm_area_struct *vma, unsigned long start,
+   unsigned long end, bool write, int *locked)
+{
+   struct mm_struct *mm = vma->vm_mm;
+   unsigned long nr_pages = (end - start) / PAGE_SIZE;
+   int gup_flags;
+
+   VM_BUG_ON(!PAGE_ALIGNED(start));
+   VM_BUG_ON(!PAGE_ALIGNED(end));
+   VM_BUG_ON_VMA(start < vma->vm_start, vma);
+   VM_BUG_ON_VMA(end > vma->vm_end, vma);
+   mmap_assert_locked(mm);
+
+   /*
+* FOLL_HWPOISON: Return -EHWPOISON instead of -EFAULT when we hit
+*a poisoned page.
+* FOLL_POPULATE: Always populate memory with VM_LOCKONFAULT.
+* !FOLL_FORCE: Require proper access permissions.
+*/
+   gup_flags = FOLL_TOUCH | FOLL_POPULATE | FOLL_MLOCK | FOLL_HWPOISON;
+   if (write)
+   gup_flags |= FOLL_WRITE;
+
+   /*
+* See check_vma_flags(): Will return -EFAULT on incompatible mappings
+* or with insufficient permissions.
+*/
+   return __get_user_pages(mm, start, nr_pages, gup_flags,
+   NULL, NULL, locked);


You mentioned in the commit message that you don't want to actually
dirty all the file pages and force writeback; but doesn't
POPULATE_WRITE still do exactly that? In follow_page_pte(), if
FOLL_TOUCH and FOLL_WRITE are set, we mark the page as dirty:


Well, I mention that POPULATE_READ explicitly doesn't do that. I
primarily set it because populate_vma_page_range() also sets it.

Is it safe to *not* set it? IOW, fault something writable into a page
table (where the CPU could dirty it without additional page faults)
without marking it accessed? For me, this made logically sense. Thus I
also understood why populate_vma_page_range() set it.


FOLL_TOUCH doesn't have anything to do with installing the PTE - it
essentially means "the caller of get_user_pages wants to read/write
the contents of the returned page, so please do the same things you
would do if userspace was accessing the page". So in particular, if
you look up a page via get_user_pages() with FOLL_WRITE|FOLL_TOUCH,
that tells the MM subsystem "I will be writing into this page directly
from the kernel, bypassing the userspace page tables, so please mark
it as dirty now so that it will be properly written back later". Part
of that is that it marks the page as recently used, which has an
effect on LRU pageout behavior, I think - as far as I understand, that
is why populate_vma_page_range() uses FOLL_TOUCH.

If you look at __get_user_pages(), you can see that it is split up
into two major parts: faultin_page() for creating PTEs, and
follow_page_mask() for grabbing pages from PTEs. faultin_page()
ignores FOLL_TOUCH completely; only follow_page_mask() uses it.

In a way I guess maybe you do want the "mark as recently accessed"
part that FOLL_TOUCH would give you without FOLL_WRITE? But I think
you very much don't want the dirtying that FOLL_TOUCH|FOLL_WRITE leads
to. Maybe the ideal approach would be to add a new FOLL flag to say "I
only want to mark as recently used, I don't want to dirty". Or maybe
it's enough to just leave out the FOLL_TOUCH entirely, I don't know.


Any thoughts why populate_vma_page_range() does it?

--
Thanks,

David / dhildenb



Re: [PATCH v1 2/5] mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault/prealloc memory

2021-03-30 Thread Jann Horn
On Tue, Mar 30, 2021 at 5:01 PM David Hildenbrand  wrote:
> >> +long faultin_vma_page_range(struct vm_area_struct *vma, unsigned long 
> >> start,
> >> +   unsigned long end, bool write, int *locked)
> >> +{
> >> +   struct mm_struct *mm = vma->vm_mm;
> >> +   unsigned long nr_pages = (end - start) / PAGE_SIZE;
> >> +   int gup_flags;
> >> +
> >> +   VM_BUG_ON(!PAGE_ALIGNED(start));
> >> +   VM_BUG_ON(!PAGE_ALIGNED(end));
> >> +   VM_BUG_ON_VMA(start < vma->vm_start, vma);
> >> +   VM_BUG_ON_VMA(end > vma->vm_end, vma);
> >> +   mmap_assert_locked(mm);
> >> +
> >> +   /*
> >> +* FOLL_HWPOISON: Return -EHWPOISON instead of -EFAULT when we hit
> >> +*a poisoned page.
> >> +* FOLL_POPULATE: Always populate memory with VM_LOCKONFAULT.
> >> +* !FOLL_FORCE: Require proper access permissions.
> >> +*/
> >> +   gup_flags = FOLL_TOUCH | FOLL_POPULATE | FOLL_MLOCK | 
> >> FOLL_HWPOISON;
> >> +   if (write)
> >> +   gup_flags |= FOLL_WRITE;
> >> +
> >> +   /*
> >> +* See check_vma_flags(): Will return -EFAULT on incompatible 
> >> mappings
> >> +* or with insufficient permissions.
> >> +*/
> >> +   return __get_user_pages(mm, start, nr_pages, gup_flags,
> >> +   NULL, NULL, locked);
> >
> > You mentioned in the commit message that you don't want to actually
> > dirty all the file pages and force writeback; but doesn't
> > POPULATE_WRITE still do exactly that? In follow_page_pte(), if
> > FOLL_TOUCH and FOLL_WRITE are set, we mark the page as dirty:
>
> Well, I mention that POPULATE_READ explicitly doesn't do that. I
> primarily set it because populate_vma_page_range() also sets it.
>
> Is it safe to *not* set it? IOW, fault something writable into a page
> table (where the CPU could dirty it without additional page faults)
> without marking it accessed? For me, this made logically sense. Thus I
> also understood why populate_vma_page_range() set it.

FOLL_TOUCH doesn't have anything to do with installing the PTE - it
essentially means "the caller of get_user_pages wants to read/write
the contents of the returned page, so please do the same things you
would do if userspace was accessing the page". So in particular, if
you look up a page via get_user_pages() with FOLL_WRITE|FOLL_TOUCH,
that tells the MM subsystem "I will be writing into this page directly
from the kernel, bypassing the userspace page tables, so please mark
it as dirty now so that it will be properly written back later". Part
of that is that it marks the page as recently used, which has an
effect on LRU pageout behavior, I think - as far as I understand, that
is why populate_vma_page_range() uses FOLL_TOUCH.

If you look at __get_user_pages(), you can see that it is split up
into two major parts: faultin_page() for creating PTEs, and
follow_page_mask() for grabbing pages from PTEs. faultin_page()
ignores FOLL_TOUCH completely; only follow_page_mask() uses it.

In a way I guess maybe you do want the "mark as recently accessed"
part that FOLL_TOUCH would give you without FOLL_WRITE? But I think
you very much don't want the dirtying that FOLL_TOUCH|FOLL_WRITE leads
to. Maybe the ideal approach would be to add a new FOLL flag to say "I
only want to mark as recently used, I don't want to dirty". Or maybe
it's enough to just leave out the FOLL_TOUCH entirely, I don't know.


Re: [PATCH v1 2/5] mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault/prealloc memory

2021-03-30 Thread David Hildenbrand

[...]



Let's introduce MADV_POPULATE_READ and MADV_POPULATE_WRITE with the
following semantics:
1. MADV_POPULATE_READ can be used to preallocate backend memory and
prefault page tables just like manually reading each individual page.
This will not break any COW mappings -- e.g., it will populate the
shared zeropage when applicable.


Please clarify what is meant by "backend memory". As far as I can tell
from looking at the code, MADV_POPULATE_READ on file mappings will
allocate zeroed memory in the page cache, and map it as readonly pages
into userspace, but any attempt to actually write to that memory will
trigger the filesystem's ->page_mkwrite handler; and e.g. ext4 will
only try to allocate disk blocks at that point, which may fail. So as
far as I can tell, for files on filesystems like ext4, the current
implementation of MADV_POPULATE_READ does not replace fallocate(). Am
I missing something?


Thanks for pointing that out, I guess I was blinded by tmpfs/hugetlbfs 
behavior. There might be cases (!tmpfs, !hugetlbfs) where we indeed need 
fallocate()+MADV_POPULATE_READ on file mappings.


The logic is essentially what mlock()/MAP_POPULATE does via 
populate_vma_page_range() on shared mappings, so I assumed it would 
always properly allocate backend memory.


/*
 * We want to touch writable mappings with a write fault in order
 * to break COW, except for shared mappings because these don't COW
 * and we would not want to dirty them for nothing.
 */

My tests with MADV_POPULATE_READ:
1. MAP_SHARED on tmpfs: memory in the file is allocated
2. MAP_PRIVATE on tmpfs: memory in the file is allocated
3. MAP_SHARED on hugetlbfs: memory in the file is allocated
4. MAP_PRIVATE on hugetlbfs: memory in the file is *not* allocated
5. MAP_SHARED on ext4: memory in the file is *not* allocated
6. MAP_PRIVATE on ext4: memory in the file is *not* allocated

1..4 are also the reason why it works with memfd as expected.

For 4 and 6 it's not bad: writing to the private mapping will not result 
in backend storage/blocks having to get allocated. So the backend 
storage is actually RAM (although we don't allocate backend storage here 
but use the shared zero page, but that's a different story).


For 5. we indeed need fallocate() before MADV_POPULATE_READ in case we 
could have holes.


Thanks for pointing that out.



If the desired semantics are that disk blocks should be preallocated,
I think you may have to look up the ->vm_file and then internally call
vfs_fallocate() to address this, kinda like in madvise_remove()?


Does not sound too complicated, but devil might be in the details. At 
least for MAP_SHARED this might be the right thing to do. As discussed 
above, for MAP_PRIVATE we usually don't want to do that (and SHMEM is 
just weird).


I honestly do wonder if breaking with MAP_POPULATE semantics is 
beneficial. For my use cases, doing fallocate() plus MADV_POPULATE_READ 
on shared, file-backed mappings would certainly be sufficient. But 
having a simple, consistent behavior would be much nicer.


I'll give it a thought!


2. If MADV_POPULATE_READ succeeds, all page tables have been populated
(prefaulted) readable once.
3. MADV_POPULATE_WRITE can be used to preallocate backend memory and
prefault page tables just like manually writing (or
reading+writing) each individual page. This will break any COW
mappings -- e.g., the shared zeropage is never populated.
4. If MADV_POPULATE_WRITE succeeds, all page tables have been populated
(prefaulted) writable once.
5. MADV_POPULATE_READ and MADV_POPULATE_WRITE cannot be applied to special
mappings marked with VM_PFNMAP and VM_IO. Also, proper access
permissions (e.g., PROT_READ, PROT_WRITE) are required. If any such
mapping is encountered, madvise() fails with -EINVAL.
6. If MADV_POPULATE_READ or MADV_POPULATE_WRITE fails, some page tables
might have been populated. In that case, madvise() fails with
-ENOMEM.


AFAICS that's not true (or misphrased). If MADV_POPULATE_*
successfully populates a bunch of pages, then fails because of an
error (e.g. EHWPOISON), it will return EHWPOISON, not ENOMEM, right?


Indeed, leftover from previous version. It's clearer in the man page I 
prepared, will fix it up.





7. MADV_POPULATE_READ and MADV_POPULATE_WRITE will return -EHWPOISON
when encountering a HW poisoned page in the range.
8. Similar to MAP_POPULATE, MADV_POPULATE_READ and MADV_POPULATE_WRITE
cannot protect from the OOM (Out Of Memory) handler killing the
process.

While the use case for MADV_POPULATE_WRITE is fairly obvious (i.e.,
preallocate memory and prefault page tables for VMs), there are valid use
cases for MADV_POPULATE_READ:
1. Efficiently populate page tables with zero pages (i.e., shared
zeropage). This is necessary when using userfaultfd() WP (Write-Protect
to properly catch all modifications within a mapping: for
write-protection to be effective for a virtual address, there has to be
 

Re: [PATCH v1 2/5] mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault/prealloc memory

2021-03-30 Thread Jann Horn
On Wed, Mar 17, 2021 at 12:07 PM David Hildenbrand  wrote:
> I. Background: Sparse Memory Mappings
>
> When we manage sparse memory mappings dynamically in user space - also
> sometimes involving MAP_NORESERVE - we want to dynamically populate/
> discard memory inside such a sparse memory region. Example users are
> hypervisors (especially implementing memory ballooning or similar
> technologies like virtio-mem) and memory allocators. In addition, we want
> to fail in a nice way (instead of generating SIGBUS) if populating does not
> succeed because we are out of backend memory (which can happen easily with
> file-based mappings, especially tmpfs and hugetlbfs).
>
> While MADV_DONTNEED, MADV_REMOVE and FALLOC_FL_PUNCH_HOLE allow for
> reliably discarding memory, there is no generic approach to populate
> page tables and preallocate memory.
>
> Although mmap() supports MAP_POPULATE, it is not applicable to the concept
> of sparse memory mappings, where we want to do populate/discard
> dynamically and avoid expensive/problematic remappings. In addition,
> we never actually report errors during the final populate phase - it is
> best-effort only.
>
> fallocate() can be used to preallocate file-based memory and fail in a safe
> way. However, it cannot really be used for any private mappings on
> anonymous files via memfd due to COW semantics. In addition, fallocate()
> does not actually populate page tables, so we still always get
> pagefaults on first access - which is sometimes undesired (i.e., real-time
> workloads) and requires real prefaulting of page tables, not just a
> preallocation of backend storage. There might be interesting use cases
> for sparse memory regions along with mlockall(MCL_ONFAULT) which
> fallocate() cannot satisfy as it does not prefault page tables.
>
> II. On preallcoation/prefaulting from user space
>
> Because we don't have a proper interface, what applications
> (like QEMU and databases) end up doing is touching (i.e., reading+writing
> one byte to not overwrite existing data) all individual pages.
>
> However, that approach
> 1) Can result in wear on storage backing, because we end up writing
>and thereby dirtying each page --- i.e., disks or pmem.
> 2) Can result in mmap_sem contention when prefaulting via multiple
>threads.
> 3) Requires expensive signal handling, especially to catch SIGBUS in case
>of hugetlbfs/shmem/file-backed memory. For example, this is
>problematic in hypervisors like QEMU where SIGBUS handlers might already
>be used by other subsystems concurrently to e.g, handle hardware errors.
>"Simply" doing preallocation concurrently from other thread is not that
>easy.
>
> III. On MADV_WILLNEED
>
> Extending MADV_WILLNEED is not an option because
> 1. It would change the semantics: "Expect access in the near future." and
>"might be a good idea to read some pages" vs. "Definitely populate/
>preallocate all memory and definitely fail on errors.".
> 2. Existing users (like virtio-balloon in QEMU when deflating the balloon)
>don't want populate/prealloc semantics. They treat this rather as a hint
>to give a little performance boost without too much overhead - and don't
>expect that a lot of memory might get consumed or a lot of time
>might be spent.
>
> IV. MADV_POPULATE_READ and MADV_POPULATE_WRITE
>
> Let's introduce MADV_POPULATE_READ and MADV_POPULATE_WRITE with the
> following semantics:
> 1. MADV_POPULATE_READ can be used to preallocate backend memory and
>prefault page tables just like manually reading each individual page.
>This will not break any COW mappings -- e.g., it will populate the
>shared zeropage when applicable.

Please clarify what is meant by "backend memory". As far as I can tell
from looking at the code, MADV_POPULATE_READ on file mappings will
allocate zeroed memory in the page cache, and map it as readonly pages
into userspace, but any attempt to actually write to that memory will
trigger the filesystem's ->page_mkwrite handler; and e.g. ext4 will
only try to allocate disk blocks at that point, which may fail. So as
far as I can tell, for files on filesystems like ext4, the current
implementation of MADV_POPULATE_READ does not replace fallocate(). Am
I missing something?

If the desired semantics are that disk blocks should be preallocated,
I think you may have to look up the ->vm_file and then internally call
vfs_fallocate() to address this, kinda like in madvise_remove()?

> 2. If MADV_POPULATE_READ succeeds, all page tables have been populated
>(prefaulted) readable once.
> 3. MADV_POPULATE_WRITE can be used to preallocate backend memory and
>prefault page tables just like manually writing (or
>reading+writing) each individual page. This will break any COW
>mappings -- e.g., the shared zeropage is never populated.
> 4. If MADV_POPULATE_WRITE succeeds, all page tables have been populated
>(prefaulted) writable once.
> 5. MADV_POPULATE_READ and 

[PATCH v1 2/5] mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault/prealloc memory

2021-03-17 Thread David Hildenbrand
I. Background: Sparse Memory Mappings

When we manage sparse memory mappings dynamically in user space - also
sometimes involving MAP_NORESERVE - we want to dynamically populate/
discard memory inside such a sparse memory region. Example users are
hypervisors (especially implementing memory ballooning or similar
technologies like virtio-mem) and memory allocators. In addition, we want
to fail in a nice way (instead of generating SIGBUS) if populating does not
succeed because we are out of backend memory (which can happen easily with
file-based mappings, especially tmpfs and hugetlbfs).

While MADV_DONTNEED, MADV_REMOVE and FALLOC_FL_PUNCH_HOLE allow for
reliably discarding memory, there is no generic approach to populate
page tables and preallocate memory.

Although mmap() supports MAP_POPULATE, it is not applicable to the concept
of sparse memory mappings, where we want to do populate/discard
dynamically and avoid expensive/problematic remappings. In addition,
we never actually report errors during the final populate phase - it is
best-effort only.

fallocate() can be used to preallocate file-based memory and fail in a safe
way. However, it cannot really be used for any private mappings on
anonymous files via memfd due to COW semantics. In addition, fallocate()
does not actually populate page tables, so we still always get
pagefaults on first access - which is sometimes undesired (i.e., real-time
workloads) and requires real prefaulting of page tables, not just a
preallocation of backend storage. There might be interesting use cases
for sparse memory regions along with mlockall(MCL_ONFAULT) which
fallocate() cannot satisfy as it does not prefault page tables.

II. On preallcoation/prefaulting from user space

Because we don't have a proper interface, what applications
(like QEMU and databases) end up doing is touching (i.e., reading+writing
one byte to not overwrite existing data) all individual pages.

However, that approach
1) Can result in wear on storage backing, because we end up writing
   and thereby dirtying each page --- i.e., disks or pmem.
2) Can result in mmap_sem contention when prefaulting via multiple
   threads.
3) Requires expensive signal handling, especially to catch SIGBUS in case
   of hugetlbfs/shmem/file-backed memory. For example, this is
   problematic in hypervisors like QEMU where SIGBUS handlers might already
   be used by other subsystems concurrently to e.g, handle hardware errors.
   "Simply" doing preallocation concurrently from other thread is not that
   easy.

III. On MADV_WILLNEED

Extending MADV_WILLNEED is not an option because
1. It would change the semantics: "Expect access in the near future." and
   "might be a good idea to read some pages" vs. "Definitely populate/
   preallocate all memory and definitely fail on errors.".
2. Existing users (like virtio-balloon in QEMU when deflating the balloon)
   don't want populate/prealloc semantics. They treat this rather as a hint
   to give a little performance boost without too much overhead - and don't
   expect that a lot of memory might get consumed or a lot of time
   might be spent.

IV. MADV_POPULATE_READ and MADV_POPULATE_WRITE

Let's introduce MADV_POPULATE_READ and MADV_POPULATE_WRITE with the
following semantics:
1. MADV_POPULATE_READ can be used to preallocate backend memory and
   prefault page tables just like manually reading each individual page.
   This will not break any COW mappings -- e.g., it will populate the
   shared zeropage when applicable.
2. If MADV_POPULATE_READ succeeds, all page tables have been populated
   (prefaulted) readable once.
3. MADV_POPULATE_WRITE can be used to preallocate backend memory and
   prefault page tables just like manually writing (or
   reading+writing) each individual page. This will break any COW
   mappings -- e.g., the shared zeropage is never populated.
4. If MADV_POPULATE_WRITE succeeds, all page tables have been populated
   (prefaulted) writable once.
5. MADV_POPULATE_READ and MADV_POPULATE_WRITE cannot be applied to special
   mappings marked with VM_PFNMAP and VM_IO. Also, proper access
   permissions (e.g., PROT_READ, PROT_WRITE) are required. If any such
   mapping is encountered, madvise() fails with -EINVAL.
6. If MADV_POPULATE_READ or MADV_POPULATE_WRITE fails, some page tables
   might have been populated. In that case, madvise() fails with
   -ENOMEM.
7. MADV_POPULATE_READ and MADV_POPULATE_WRITE will return -EHWPOISON
   when encountering a HW poisoned page in the range.
8. Similar to MAP_POPULATE, MADV_POPULATE_READ and MADV_POPULATE_WRITE
   cannot protect from the OOM (Out Of Memory) handler killing the
   process.

While the use case for MADV_POPULATE_WRITE is fairly obvious (i.e.,
preallocate memory and prefault page tables for VMs), there are valid use
cases for MADV_POPULATE_READ:
1. Efficiently populate page tables with zero pages (i.e., shared
   zeropage). This is necessary when using userfaultfd() WP (Write-Protect
   to properly catch