Re: [PATCH RFCv2] mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault/prealloc memory

2021-03-15 Thread David Hildenbrand

On 15.03.21 14:26, David Hildenbrand wrote:

On 15.03.21 14:03, Kirill A. Shutemov wrote:

On Mon, Mar 15, 2021 at 01:25:40PM +0100, David Hildenbrand wrote:

On 15.03.21 13:22, Kirill A. Shutemov wrote:

On Mon, Mar 08, 2021 at 05:45:20PM +0100, David Hildenbrand wrote:

+   case -EHWPOISON: /* Skip over any poisoned pages. */
+   start += PAGE_SIZE;
+   continue;


Why is it good approach? It's not abvious to me.


My main motivation was to simplify return code handling. I don't want to
return -EHWPOISON to user space


Why? Hiding the problem under the rug doesn't help anybody. SIGBUS later
is not better than an error upfront.


Well, if you think about "prefaulting page tables", the first intuition
is certainly not to check for poisoned pages, right? After all, you are
not actually accessing memory, you are allocating memory if required and
fill page tables. OTOH, mlock() will also choke on poisoned pages.

With the current semantics, you can start and run a VM just fine.
Preallocation/prefaulting succeeded after all. On access you will get a
SIGBUS, from which e.g., QEMU can recover by injecting an MCE into the
guest - just like if you would hit a poisoned page later.

The problem we are talking about is most probably very rare, especially
when using MADV_POPULATE_ for actual preallocation.

I don't have a strong opinion; not bailing out on poisoned pages felt
like the right thing to do.


I'll switch to propagating -EHWPOISON, it matches how e.g., mlock() 
behaves -- not ignoring poisoned pages. Thanks!


--
Thanks,

David / dhildenb



Re: [PATCH RFCv2] mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault/prealloc memory

2021-03-15 Thread David Hildenbrand

On 15.03.21 14:03, Kirill A. Shutemov wrote:

On Mon, Mar 15, 2021 at 01:25:40PM +0100, David Hildenbrand wrote:

On 15.03.21 13:22, Kirill A. Shutemov wrote:

On Mon, Mar 08, 2021 at 05:45:20PM +0100, David Hildenbrand wrote:

+   case -EHWPOISON: /* Skip over any poisoned pages. */
+   start += PAGE_SIZE;
+   continue;


Why is it good approach? It's not abvious to me.


My main motivation was to simplify return code handling. I don't want to
return -EHWPOISON to user space


Why? Hiding the problem under the rug doesn't help anybody. SIGBUS later
is not better than an error upfront.


Well, if you think about "prefaulting page tables", the first intuition 
is certainly not to check for poisoned pages, right? After all, you are 
not actually accessing memory, you are allocating memory if required and 
fill page tables. OTOH, mlock() will also choke on poisoned pages.


With the current semantics, you can start and run a VM just fine. 
Preallocation/prefaulting succeeded after all. On access you will get a 
SIGBUS, from which e.g., QEMU can recover by injecting an MCE into the 
guest - just like if you would hit a poisoned page later.


The problem we are talking about is most probably very rare, especially 
when using MADV_POPULATE_ for actual preallocation.


I don't have a strong opinion; not bailing out on poisoned pages felt 
like the right thing to do.


--
Thanks,

David / dhildenb



Re: [PATCH RFCv2] mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault/prealloc memory

2021-03-15 Thread Kirill A. Shutemov
On Mon, Mar 15, 2021 at 01:25:40PM +0100, David Hildenbrand wrote:
> On 15.03.21 13:22, Kirill A. Shutemov wrote:
> > On Mon, Mar 08, 2021 at 05:45:20PM +0100, David Hildenbrand wrote:
> > > + case -EHWPOISON: /* Skip over any poisoned pages. */
> > > + start += PAGE_SIZE;
> > > + continue;
> > 
> > Why is it good approach? It's not abvious to me.
> 
> My main motivation was to simplify return code handling. I don't want to
> return -EHWPOISON to user space

Why? Hiding the problem under the rug doesn't help anybody. SIGBUS later
is not better than an error upfront.

-- 
 Kirill A. Shutemov


Re: [PATCH RFCv2] mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault/prealloc memory

2021-03-15 Thread David Hildenbrand

On 15.03.21 13:22, Kirill A. Shutemov wrote:

On Mon, Mar 08, 2021 at 05:45:20PM +0100, David Hildenbrand wrote:

+   case -EHWPOISON: /* Skip over any poisoned pages. */
+   start += PAGE_SIZE;
+   continue;


Why is it good approach? It's not abvious to me.


My main motivation was to simplify return code handling. I don't want to 
return -EHWPOISON to user space and I don't want to convert it into 
something misleading like -ENOMEM or -EINVAL. So I decided to handle 
such stuff internally.


What would be you take on that?

--
Thanks,

David / dhildenb



Re: [PATCH RFCv2] mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault/prealloc memory

2021-03-15 Thread Kirill A. Shutemov
On Mon, Mar 08, 2021 at 05:45:20PM +0100, David Hildenbrand wrote:
> + case -EHWPOISON: /* Skip over any poisoned pages. */
> + start += PAGE_SIZE;
> + continue;

Why is it good approach? It's not abvious to me.

-- 
 Kirill A. Shutemov


Re: [PATCH RFCv2] mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault/prealloc memory

2021-03-10 Thread David Hildenbrand

On 08.03.21 17:45, David Hildenbrand wrote:

I. Background: Sparse Memory Mappings

When we manage sparse memory mappings dynamically in user space - also
sometimes involving MAP_NORESERVE - we want to dynamically populate/
discard memory inside such a sparse memory region. Example users are
hypervisors (especially implementing memory ballooning or similar
technologies like virtio-mem) and memory allocators. In addition, we want
to fail in a nice way (instead of generating SIGBUS) if populating does not
succeed because we are out of backend memory (which can happen easily with
file-based mappings, especially tmpfs and hugetlbfs).

While MADV_DONTNEED, MADV_REMOVE and FALLOC_FL_PUNCH_HOLE allow for
reliably discarding memory, there is no generic approach to populate
page tables and preallocate memory.

Although mmap() supports MAP_POPULATE, it is not applicable to the concept
of sparse memory mappings, where we want to do populate/discard
dynamically and avoid expensive/problematic remappings. In addition,
we never actually report errors during the final populate phase - it is
best-effort only.

fallocate() can be used to preallocate file-based memory and fail in a safe
way. However, it cannot really be used for any private mappings on
anonymous files via memfd due to COW semantics. In addition, fallocate()
does not actually populate page tables, so we still always get
pagefaults on first access - which is sometimes undesired (i.e., real-time
workloads) and requires real prefaulting of page tables, not just a
preallocation of backend storage. There might be interesting use cases
for sparse memory regions along with mlockall(MCL_ONFAULT) which
fallocate() cannot satisfy as it does not prefault page tables.

II. On preallcoation/prefaulting from user space

Because we don't have a proper interface, what applications
(like QEMU and databases) end up doing is touching (i.e., reading+writing
one byte to not overwrite existing data) all individual pages.

However, that approach
1) Can result in wear on storage backing, because we end up writing
and thereby dirtying each page --- i.e., disks or pmem.
2) Can result in mmap_sem contention when prefaulting via multiple
threads.
3) Requires expensive signal handling, especially to catch SIGBUS in case
of hugetlbfs/shmem/file-backed memory. For example, this is
problematic in hypervisors like QEMU where SIGBUS handlers might already
be used by other subsystems concurrently to e.g, handle hardware errors.
"Simply" doing preallocation concurrently from other thread is not that
easy.

III. On MADV_WILLNEED

Extending MADV_WILLNEED is not an option because
1. It would change the semantics: "Expect access in the near future." and
"might be a good idea to read some pages" vs. "Definitely populate/
preallocate all memory and definitely fail on errors.".
2. Existing users (like virtio-balloon in QEMU when deflating the balloon)
don't want populate/prealloc semantics. They treat this rather as a hint
to give a little performance boost without too much overhead - and don't
expect that a lot of memory might get consumed or a lot of time
might be spent.

IV. MADV_POPULATE_READ and MADV_POPULATE_WRITE

Let's introduce MADV_POPULATE_READ and MADV_POPULATE_WRITE with the
following semantics:
1. MADV_POPULATE_READ can be used to preallocate backend memory and
prefault page tables just like manually reading each individual page.
This will not break any COW mappings -- e.g., it will populate the
shared zeropage when applicable.
2. If MADV_POPULATE_READ succeeds, all page tables have been populated
(prefaulted) readable once.
3. MADV_POPULATE_WRITE can be used to preallocate backend memory and
prefault page tables just like manually writing (or
reading+writing) each individual page. This will break any COW
mappings -- e.g., the shared zeropage is never populated.
4. If MADV_POPULATE_WRITE succeeds, all page tables have been populated
(prefaulted) writable once.
5. MADV_POPULATE_READ and MADV_POPULATE_WRITE cannot be applied to special
mappings marked with VM_PFNMAP and VM_IO. Also, proper access
permissions (e.g., PROT_READ, PROT_WRITE) are required. If any such
mapping is encountered, madvise() fails with -EINVAL.
6. If MADV_POPULATE_READ or MADV_POPULATE_WRITE fails, some page tables
might have been populated. In that case, madvise() fails with
-ENOMEM.
7. MADV_POPULATE_READ and MADV_POPULATE_WRITE will ignore any poisoned
pages in the range.
8. Similar to MAP_POPULATE, MADV_POPULATE_READ and MADV_POPULATE_WRITE
 cannot protect from the OOM (Out Of Memory) handler killing the
 process.

While the use case for MADV_POPULATE_WRITE is fairly obvious (i.e.,
preallocate memory and prefault page tables for VMs), there are valid use
cases for MADV_POPULATE_READ:
1. Efficiently populate page tables with zero pages (i.e., shared
zeropage). This is necessary when using userfau

Re: [PATCH RFCv2] mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault/prealloc memory

2021-03-09 Thread David Hildenbrand

On 09.03.21 08:35, Rolf Eike Beer wrote:

diff --git a/mm/internal.h b/mm/internal.h
index 9902648f2206..a5c4ed23b1db 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -340,6 +340,9 @@ void __vma_unlink_list(struct mm_struct *mm,
struct vm_area_struct *vma);
  #ifdef CONFIG_MMU
  extern long populate_vma_page_range(struct vm_area_struct *vma,
unsigned long start, unsigned long end, int *nonblocking);
+extern long faultin_vma_page_range(struct vm_area_struct *vma,
+  unsigned long start, unsigned long end,
+  bool write, int *nonblocking);
  extern void munlock_vma_pages_range(struct vm_area_struct *vma,
unsigned long start, unsigned long end);
  static inline void munlock_vma_pages_all(struct vm_area_struct *vma)


The parameter name does not match the one in the implementation.

Otherwise the implementation looks fine AFAICT.


Hehe, you can tell how I copy-pasted from populate_vma_page_range(), 
because there, the variable names are messed up, too :)


Will fix (most probably populate_vma_page_range() as well in a cleanup 
patch), thanks!


--
Thanks,

David / dhildenb



Re: [PATCH RFCv2] mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault/prealloc memory

2021-03-08 Thread Rolf Eike Beer

diff --git a/mm/internal.h b/mm/internal.h
index 9902648f2206..a5c4ed23b1db 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -340,6 +340,9 @@ void __vma_unlink_list(struct mm_struct *mm,
struct vm_area_struct *vma);
 #ifdef CONFIG_MMU
 extern long populate_vma_page_range(struct vm_area_struct *vma,
unsigned long start, unsigned long end, int *nonblocking);
+extern long faultin_vma_page_range(struct vm_area_struct *vma,
+  unsigned long start, unsigned long end,
+  bool write, int *nonblocking);
 extern void munlock_vma_pages_range(struct vm_area_struct *vma,
unsigned long start, unsigned long end);
 static inline void munlock_vma_pages_all(struct vm_area_struct *vma)


The parameter name does not match the one in the implementation.

Otherwise the implementation looks fine AFAICT.

Eike