Re: [PATCH RFCv2] mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault/prealloc memory
On 15.03.21 14:26, David Hildenbrand wrote: On 15.03.21 14:03, Kirill A. Shutemov wrote: On Mon, Mar 15, 2021 at 01:25:40PM +0100, David Hildenbrand wrote: On 15.03.21 13:22, Kirill A. Shutemov wrote: On Mon, Mar 08, 2021 at 05:45:20PM +0100, David Hildenbrand wrote: + case -EHWPOISON: /* Skip over any poisoned pages. */ + start += PAGE_SIZE; + continue; Why is it good approach? It's not abvious to me. My main motivation was to simplify return code handling. I don't want to return -EHWPOISON to user space Why? Hiding the problem under the rug doesn't help anybody. SIGBUS later is not better than an error upfront. Well, if you think about "prefaulting page tables", the first intuition is certainly not to check for poisoned pages, right? After all, you are not actually accessing memory, you are allocating memory if required and fill page tables. OTOH, mlock() will also choke on poisoned pages. With the current semantics, you can start and run a VM just fine. Preallocation/prefaulting succeeded after all. On access you will get a SIGBUS, from which e.g., QEMU can recover by injecting an MCE into the guest - just like if you would hit a poisoned page later. The problem we are talking about is most probably very rare, especially when using MADV_POPULATE_ for actual preallocation. I don't have a strong opinion; not bailing out on poisoned pages felt like the right thing to do. I'll switch to propagating -EHWPOISON, it matches how e.g., mlock() behaves -- not ignoring poisoned pages. Thanks! -- Thanks, David / dhildenb
Re: [PATCH RFCv2] mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault/prealloc memory
On 15.03.21 14:03, Kirill A. Shutemov wrote: On Mon, Mar 15, 2021 at 01:25:40PM +0100, David Hildenbrand wrote: On 15.03.21 13:22, Kirill A. Shutemov wrote: On Mon, Mar 08, 2021 at 05:45:20PM +0100, David Hildenbrand wrote: + case -EHWPOISON: /* Skip over any poisoned pages. */ + start += PAGE_SIZE; + continue; Why is it good approach? It's not abvious to me. My main motivation was to simplify return code handling. I don't want to return -EHWPOISON to user space Why? Hiding the problem under the rug doesn't help anybody. SIGBUS later is not better than an error upfront. Well, if you think about "prefaulting page tables", the first intuition is certainly not to check for poisoned pages, right? After all, you are not actually accessing memory, you are allocating memory if required and fill page tables. OTOH, mlock() will also choke on poisoned pages. With the current semantics, you can start and run a VM just fine. Preallocation/prefaulting succeeded after all. On access you will get a SIGBUS, from which e.g., QEMU can recover by injecting an MCE into the guest - just like if you would hit a poisoned page later. The problem we are talking about is most probably very rare, especially when using MADV_POPULATE_ for actual preallocation. I don't have a strong opinion; not bailing out on poisoned pages felt like the right thing to do. -- Thanks, David / dhildenb
Re: [PATCH RFCv2] mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault/prealloc memory
On Mon, Mar 15, 2021 at 01:25:40PM +0100, David Hildenbrand wrote: > On 15.03.21 13:22, Kirill A. Shutemov wrote: > > On Mon, Mar 08, 2021 at 05:45:20PM +0100, David Hildenbrand wrote: > > > + case -EHWPOISON: /* Skip over any poisoned pages. */ > > > + start += PAGE_SIZE; > > > + continue; > > > > Why is it good approach? It's not abvious to me. > > My main motivation was to simplify return code handling. I don't want to > return -EHWPOISON to user space Why? Hiding the problem under the rug doesn't help anybody. SIGBUS later is not better than an error upfront. -- Kirill A. Shutemov
Re: [PATCH RFCv2] mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault/prealloc memory
On 15.03.21 13:22, Kirill A. Shutemov wrote: On Mon, Mar 08, 2021 at 05:45:20PM +0100, David Hildenbrand wrote: + case -EHWPOISON: /* Skip over any poisoned pages. */ + start += PAGE_SIZE; + continue; Why is it good approach? It's not abvious to me. My main motivation was to simplify return code handling. I don't want to return -EHWPOISON to user space and I don't want to convert it into something misleading like -ENOMEM or -EINVAL. So I decided to handle such stuff internally. What would be you take on that? -- Thanks, David / dhildenb
Re: [PATCH RFCv2] mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault/prealloc memory
On Mon, Mar 08, 2021 at 05:45:20PM +0100, David Hildenbrand wrote: > + case -EHWPOISON: /* Skip over any poisoned pages. */ > + start += PAGE_SIZE; > + continue; Why is it good approach? It's not abvious to me. -- Kirill A. Shutemov
Re: [PATCH RFCv2] mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault/prealloc memory
On 08.03.21 17:45, David Hildenbrand wrote: I. Background: Sparse Memory Mappings When we manage sparse memory mappings dynamically in user space - also sometimes involving MAP_NORESERVE - we want to dynamically populate/ discard memory inside such a sparse memory region. Example users are hypervisors (especially implementing memory ballooning or similar technologies like virtio-mem) and memory allocators. In addition, we want to fail in a nice way (instead of generating SIGBUS) if populating does not succeed because we are out of backend memory (which can happen easily with file-based mappings, especially tmpfs and hugetlbfs). While MADV_DONTNEED, MADV_REMOVE and FALLOC_FL_PUNCH_HOLE allow for reliably discarding memory, there is no generic approach to populate page tables and preallocate memory. Although mmap() supports MAP_POPULATE, it is not applicable to the concept of sparse memory mappings, where we want to do populate/discard dynamically and avoid expensive/problematic remappings. In addition, we never actually report errors during the final populate phase - it is best-effort only. fallocate() can be used to preallocate file-based memory and fail in a safe way. However, it cannot really be used for any private mappings on anonymous files via memfd due to COW semantics. In addition, fallocate() does not actually populate page tables, so we still always get pagefaults on first access - which is sometimes undesired (i.e., real-time workloads) and requires real prefaulting of page tables, not just a preallocation of backend storage. There might be interesting use cases for sparse memory regions along with mlockall(MCL_ONFAULT) which fallocate() cannot satisfy as it does not prefault page tables. II. On preallcoation/prefaulting from user space Because we don't have a proper interface, what applications (like QEMU and databases) end up doing is touching (i.e., reading+writing one byte to not overwrite existing data) all individual pages. However, that approach 1) Can result in wear on storage backing, because we end up writing and thereby dirtying each page --- i.e., disks or pmem. 2) Can result in mmap_sem contention when prefaulting via multiple threads. 3) Requires expensive signal handling, especially to catch SIGBUS in case of hugetlbfs/shmem/file-backed memory. For example, this is problematic in hypervisors like QEMU where SIGBUS handlers might already be used by other subsystems concurrently to e.g, handle hardware errors. "Simply" doing preallocation concurrently from other thread is not that easy. III. On MADV_WILLNEED Extending MADV_WILLNEED is not an option because 1. It would change the semantics: "Expect access in the near future." and "might be a good idea to read some pages" vs. "Definitely populate/ preallocate all memory and definitely fail on errors.". 2. Existing users (like virtio-balloon in QEMU when deflating the balloon) don't want populate/prealloc semantics. They treat this rather as a hint to give a little performance boost without too much overhead - and don't expect that a lot of memory might get consumed or a lot of time might be spent. IV. MADV_POPULATE_READ and MADV_POPULATE_WRITE Let's introduce MADV_POPULATE_READ and MADV_POPULATE_WRITE with the following semantics: 1. MADV_POPULATE_READ can be used to preallocate backend memory and prefault page tables just like manually reading each individual page. This will not break any COW mappings -- e.g., it will populate the shared zeropage when applicable. 2. If MADV_POPULATE_READ succeeds, all page tables have been populated (prefaulted) readable once. 3. MADV_POPULATE_WRITE can be used to preallocate backend memory and prefault page tables just like manually writing (or reading+writing) each individual page. This will break any COW mappings -- e.g., the shared zeropage is never populated. 4. If MADV_POPULATE_WRITE succeeds, all page tables have been populated (prefaulted) writable once. 5. MADV_POPULATE_READ and MADV_POPULATE_WRITE cannot be applied to special mappings marked with VM_PFNMAP and VM_IO. Also, proper access permissions (e.g., PROT_READ, PROT_WRITE) are required. If any such mapping is encountered, madvise() fails with -EINVAL. 6. If MADV_POPULATE_READ or MADV_POPULATE_WRITE fails, some page tables might have been populated. In that case, madvise() fails with -ENOMEM. 7. MADV_POPULATE_READ and MADV_POPULATE_WRITE will ignore any poisoned pages in the range. 8. Similar to MAP_POPULATE, MADV_POPULATE_READ and MADV_POPULATE_WRITE cannot protect from the OOM (Out Of Memory) handler killing the process. While the use case for MADV_POPULATE_WRITE is fairly obvious (i.e., preallocate memory and prefault page tables for VMs), there are valid use cases for MADV_POPULATE_READ: 1. Efficiently populate page tables with zero pages (i.e., shared zeropage). This is necessary when using userfau
Re: [PATCH RFCv2] mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault/prealloc memory
On 09.03.21 08:35, Rolf Eike Beer wrote: diff --git a/mm/internal.h b/mm/internal.h index 9902648f2206..a5c4ed23b1db 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -340,6 +340,9 @@ void __vma_unlink_list(struct mm_struct *mm, struct vm_area_struct *vma); #ifdef CONFIG_MMU extern long populate_vma_page_range(struct vm_area_struct *vma, unsigned long start, unsigned long end, int *nonblocking); +extern long faultin_vma_page_range(struct vm_area_struct *vma, + unsigned long start, unsigned long end, + bool write, int *nonblocking); extern void munlock_vma_pages_range(struct vm_area_struct *vma, unsigned long start, unsigned long end); static inline void munlock_vma_pages_all(struct vm_area_struct *vma) The parameter name does not match the one in the implementation. Otherwise the implementation looks fine AFAICT. Hehe, you can tell how I copy-pasted from populate_vma_page_range(), because there, the variable names are messed up, too :) Will fix (most probably populate_vma_page_range() as well in a cleanup patch), thanks! -- Thanks, David / dhildenb
Re: [PATCH RFCv2] mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault/prealloc memory
diff --git a/mm/internal.h b/mm/internal.h index 9902648f2206..a5c4ed23b1db 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -340,6 +340,9 @@ void __vma_unlink_list(struct mm_struct *mm, struct vm_area_struct *vma); #ifdef CONFIG_MMU extern long populate_vma_page_range(struct vm_area_struct *vma, unsigned long start, unsigned long end, int *nonblocking); +extern long faultin_vma_page_range(struct vm_area_struct *vma, + unsigned long start, unsigned long end, + bool write, int *nonblocking); extern void munlock_vma_pages_range(struct vm_area_struct *vma, unsigned long start, unsigned long end); static inline void munlock_vma_pages_all(struct vm_area_struct *vma) The parameter name does not match the one in the implementation. Otherwise the implementation looks fine AFAICT. Eike