Re: [Qemu-devel] [PATCH 08/17] mm: madvise MADV_USERFAULT
On Fri, Oct 03, 2014 at 07:07:58PM +0200, Andrea Arcangeli wrote: MADV_USERFAULT is a new madvise flag that will set VM_USERFAULT in the vma flags. Whenever VM_USERFAULT is set in an anonymous vma, if userland touches a still unmapped virtual address, a sigbus signal is sent instead of allocating a new page. The sigbus signal handler will then resolve the page fault in userland by calling the remap_anon_pages syscall. Hm. I wounder if this functionality really fits madvise(2) interface: as far as I understand it, it provides a way to give a *hint* to kernel which may or may not trigger an action from kernel side. I don't think an application will behaive reasonably if kernel ignore the *advise* and will not send SIGBUS, but allocate memory. I would suggest to consider to use some other interface for the functionality: a new syscall or, perhaps, mprotect(). -- Kirill A. Shutemov
Re: [Qemu-devel] [PATCH 08/17] mm: madvise MADV_USERFAULT
* Kirill A. Shutemov (kir...@shutemov.name) wrote: On Fri, Oct 03, 2014 at 07:07:58PM +0200, Andrea Arcangeli wrote: MADV_USERFAULT is a new madvise flag that will set VM_USERFAULT in the vma flags. Whenever VM_USERFAULT is set in an anonymous vma, if userland touches a still unmapped virtual address, a sigbus signal is sent instead of allocating a new page. The sigbus signal handler will then resolve the page fault in userland by calling the remap_anon_pages syscall. Hm. I wounder if this functionality really fits madvise(2) interface: as far as I understand it, it provides a way to give a *hint* to kernel which may or may not trigger an action from kernel side. I don't think an application will behaive reasonably if kernel ignore the *advise* and will not send SIGBUS, but allocate memory. Aren't DONTNEED and DONTDUMP similar cases of madvise operations that are expected to do what they say ? I would suggest to consider to use some other interface for the functionality: a new syscall or, perhaps, mprotect(). Dave -- Kirill A. Shutemov -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
Re: [Qemu-devel] [PATCH 08/17] mm: madvise MADV_USERFAULT
On Tue, Oct 07, 2014 at 11:46:04AM +0100, Dr. David Alan Gilbert wrote: * Kirill A. Shutemov (kir...@shutemov.name) wrote: On Fri, Oct 03, 2014 at 07:07:58PM +0200, Andrea Arcangeli wrote: MADV_USERFAULT is a new madvise flag that will set VM_USERFAULT in the vma flags. Whenever VM_USERFAULT is set in an anonymous vma, if userland touches a still unmapped virtual address, a sigbus signal is sent instead of allocating a new page. The sigbus signal handler will then resolve the page fault in userland by calling the remap_anon_pages syscall. Hm. I wounder if this functionality really fits madvise(2) interface: as far as I understand it, it provides a way to give a *hint* to kernel which may or may not trigger an action from kernel side. I don't think an application will behaive reasonably if kernel ignore the *advise* and will not send SIGBUS, but allocate memory. Aren't DONTNEED and DONTDUMP similar cases of madvise operations that are expected to do what they say ? No. If kernel would ignore MADV_DONTNEED or MADV_DONTDUMP it will not affect correctness, just behaviour will be suboptimal: more than needed memory used or wasted space in coredump. -- Kirill A. Shutemov
Re: [Qemu-devel] [PATCH 08/17] mm: madvise MADV_USERFAULT
* Kirill A. Shutemov (kir...@shutemov.name) wrote: On Tue, Oct 07, 2014 at 11:46:04AM +0100, Dr. David Alan Gilbert wrote: * Kirill A. Shutemov (kir...@shutemov.name) wrote: On Fri, Oct 03, 2014 at 07:07:58PM +0200, Andrea Arcangeli wrote: MADV_USERFAULT is a new madvise flag that will set VM_USERFAULT in the vma flags. Whenever VM_USERFAULT is set in an anonymous vma, if userland touches a still unmapped virtual address, a sigbus signal is sent instead of allocating a new page. The sigbus signal handler will then resolve the page fault in userland by calling the remap_anon_pages syscall. Hm. I wounder if this functionality really fits madvise(2) interface: as far as I understand it, it provides a way to give a *hint* to kernel which may or may not trigger an action from kernel side. I don't think an application will behaive reasonably if kernel ignore the *advise* and will not send SIGBUS, but allocate memory. Aren't DONTNEED and DONTDUMP similar cases of madvise operations that are expected to do what they say ? No. If kernel would ignore MADV_DONTNEED or MADV_DONTDUMP it will not affect correctness, just behaviour will be suboptimal: more than needed memory used or wasted space in coredump. That's not how the manpage reads for DONTNEED; it calls it out as a special case near the top, and explicitly says what will happen if you read the area marked as DONTNEED. It looks like there are openssl patches that use DONTDUMP to explicitly make sure keys etc don't land in cores. Dave -- Kirill A. Shutemov -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
Re: [Qemu-devel] [PATCH 08/17] mm: madvise MADV_USERFAULT
On Tue, Oct 07, 2014 at 12:01:02PM +0100, Dr. David Alan Gilbert wrote: * Kirill A. Shutemov (kir...@shutemov.name) wrote: On Tue, Oct 07, 2014 at 11:46:04AM +0100, Dr. David Alan Gilbert wrote: * Kirill A. Shutemov (kir...@shutemov.name) wrote: On Fri, Oct 03, 2014 at 07:07:58PM +0200, Andrea Arcangeli wrote: MADV_USERFAULT is a new madvise flag that will set VM_USERFAULT in the vma flags. Whenever VM_USERFAULT is set in an anonymous vma, if userland touches a still unmapped virtual address, a sigbus signal is sent instead of allocating a new page. The sigbus signal handler will then resolve the page fault in userland by calling the remap_anon_pages syscall. Hm. I wounder if this functionality really fits madvise(2) interface: as far as I understand it, it provides a way to give a *hint* to kernel which may or may not trigger an action from kernel side. I don't think an application will behaive reasonably if kernel ignore the *advise* and will not send SIGBUS, but allocate memory. Aren't DONTNEED and DONTDUMP similar cases of madvise operations that are expected to do what they say ? No. If kernel would ignore MADV_DONTNEED or MADV_DONTDUMP it will not affect correctness, just behaviour will be suboptimal: more than needed memory used or wasted space in coredump. That's not how the manpage reads for DONTNEED; it calls it out as a special case near the top, and explicitly says what will happen if you read the area marked as DONTNEED. Your are right. MADV_DONTNEED doesn't fit the interface too. That's bad and we can't fix it. But it's not a reason to make this mistake again. Read the next sentence: The kernel is free to ignore the advice. Note, POSIX_MADV_DONTNEED has totally different semantics. It looks like there are openssl patches that use DONTDUMP to explicitly make sure keys etc don't land in cores. That's nice to have. But openssl works on systems without the interface, meaning it's not essential for functionality. -- Kirill A. Shutemov
Re: [Qemu-devel] [PATCH 08/17] mm: madvise MADV_USERFAULT
Hi Kirill, On Tue, Oct 07, 2014 at 01:36:45PM +0300, Kirill A. Shutemov wrote: On Fri, Oct 03, 2014 at 07:07:58PM +0200, Andrea Arcangeli wrote: MADV_USERFAULT is a new madvise flag that will set VM_USERFAULT in the vma flags. Whenever VM_USERFAULT is set in an anonymous vma, if userland touches a still unmapped virtual address, a sigbus signal is sent instead of allocating a new page. The sigbus signal handler will then resolve the page fault in userland by calling the remap_anon_pages syscall. Hm. I wounder if this functionality really fits madvise(2) interface: as far as I understand it, it provides a way to give a *hint* to kernel which may or may not trigger an action from kernel side. I don't think an application will behaive reasonably if kernel ignore the *advise* and will not send SIGBUS, but allocate memory. I would suggest to consider to use some other interface for the functionality: a new syscall or, perhaps, mprotect(). I didn't feel like adding PROT_USERFAULT to mprotect, which looks hardwired to just these flags: PROT_NONE The memory cannot be accessed at all. PROT_READ The memory can be read. PROT_WRITE The memory can be modified. PROT_EXEC The memory can be executed. Normally mprotect doesn't just alter the vmas but it also alters pte/hugepmds protection bits, that's something that is never needed with VM_USERFAULT so I didn't feel like VM_USERFAULT is a protection change to the VMA. mprotect is also hardwired to mangle only the VM_READ|WRITE|EXEC flags, while madvise is ideal to set arbitrary vma flags. From an implementation standpoint the perfect place to set a flag in a vma is madvise. This is what MADV_DONTFORK (it sets VM_DONTCOPY) already does too in an identical way to MADV_USERFAULT/VM_USERFAULT. MADV_DONTFORK is as critical as MADV_USERFAULT because people depends on it for example to prevent the O_DIRECT vs fork race condition that results in silent data corruption during I/O with threads that may fork. The other reason why MADV_DONTFORK is critical is that fork() would otherwise fail with OOM unless full overcommit is enabled (i.e. pci hotplug crashes the guest if you forget to set MADV_DONTFORK). Another madvise that would generate a failure if not obeyed by the kernel is MADV_DONTNEED that if it does nothing it could run lead to OOM killing. We don't inflate virt balloons using munmap just to make an example. Various other apps (maybe JVM garbage collection too) makes extensive use of MADV_DONTNEED and depend on it. Said that I can change it to mprotect, the only thing that I don't like is that it'll result in a less clean patch and I can't possibly see a practical risk in keeping it simpler with madvise, as long as we always return -EINVAL whenever we encounter a vma type that cannot raise userfaults yet (that is something I already enforced). Yet another option would be to drop MADV_USERFAULT and vm_flagsVM_USERFAULT entirely and in turn the ability to handle userfaults with SIGBUS, and retain only the userfaultfd. The new userfaultfd protocol requires registering each created userfaultfd into its own private virtual memory ranges (that is to allow an unlimited number of userfaultfd per process). Currently the userfaultfd engages iff the fault address intersects both the MADV_USERFAULT range and the userfaultfd registered ranges. So I could drop MADV_USERFAULT and VM_USERFAULT and just check for vma-vm_userfaultfd_ctx!=NULL to know if the userfaultfd protocol needs to be engaged during the first page fault for a still unmapped virtual address. I just thought it would be more flexibile to also allow SIGBUS without forcing people to use userfaultfd (that's in fact the only reason to still retain madvise(MADV_USERFAULT)!). Volatile pages earlier patches only supported SIGBUS behavior for example.. and I didn't intend to force them to use userfaultfd if they're guaranteed to access the memory with the CPU and never through a kernel syscall (that is something the app can enforce by design). userfaultfd becomes necessary the moment you want to handle userfaults through syscalls/gup etc... qemu obviously requires userfaultfd and it never uses the userfaultfd-less SIGBUS behavior as it touches the memory in all possible ways (first and foremost with the KVM page fault that uses almost all variants of gup..). So here somebody should comment and choose between: 1) set VM_USERFAULT with mprotect(PROT_USERFAULT) instead of the current madvise(MADV_USERFAULT) 2) drop MADV_USERFAULT and VM_USERFAULT and force the usage of the userfaultfd protocol as the only way for userland to catch userfaults (each userfaultfd must already register itself into its own virtual memory ranges so it's a trivial change for userfaultfd users that deletes just 1 or 2 lines of userland code, but it would prevent to use the SIGBUS behavior with info-si_addr=faultaddr for other users) 3) keep things as they are now:
Re: [Qemu-devel] [PATCH 08/17] mm: madvise MADV_USERFAULT
On Tue, Oct 07, 2014 at 03:24:58PM +0200, Andrea Arcangeli wrote: Hi Kirill, On Tue, Oct 07, 2014 at 01:36:45PM +0300, Kirill A. Shutemov wrote: On Fri, Oct 03, 2014 at 07:07:58PM +0200, Andrea Arcangeli wrote: MADV_USERFAULT is a new madvise flag that will set VM_USERFAULT in the vma flags. Whenever VM_USERFAULT is set in an anonymous vma, if userland touches a still unmapped virtual address, a sigbus signal is sent instead of allocating a new page. The sigbus signal handler will then resolve the page fault in userland by calling the remap_anon_pages syscall. Hm. I wounder if this functionality really fits madvise(2) interface: as far as I understand it, it provides a way to give a *hint* to kernel which may or may not trigger an action from kernel side. I don't think an application will behaive reasonably if kernel ignore the *advise* and will not send SIGBUS, but allocate memory. I would suggest to consider to use some other interface for the functionality: a new syscall or, perhaps, mprotect(). I didn't feel like adding PROT_USERFAULT to mprotect, which looks hardwired to just these flags: PROT_NOALLOC may be? PROT_NONE The memory cannot be accessed at all. PROT_READ The memory can be read. PROT_WRITE The memory can be modified. PROT_EXEC The memory can be executed. To be complete: PROT_GROWSDOWN, PROT_GROWSUP and unused PROT_SEM. So here somebody should comment and choose between: 1) set VM_USERFAULT with mprotect(PROT_USERFAULT) instead of the current madvise(MADV_USERFAULT) 2) drop MADV_USERFAULT and VM_USERFAULT and force the usage of the userfaultfd protocol as the only way for userland to catch userfaults (each userfaultfd must already register itself into its own virtual memory ranges so it's a trivial change for userfaultfd users that deletes just 1 or 2 lines of userland code, but it would prevent to use the SIGBUS behavior with info-si_addr=faultaddr for other users) 3) keep things as they are now: use MADV_USERFAULT for SIGBUS userfaults, with optional intersection between the vm_flagsVM_USERFAULT ranges and the userfaultfd registered ranges with vma-vm_userfaultfd_ctx!=NULL to know if to engage the userfaultfd protocol instead of the plain SIGBUS 4) new syscall? I will update the code accordingly to feedback, so please comment. I don't have strong points on this. Just *feel* it doesn't fit advice semantics. The only userspace interface I've designed was not proven good by time. I would listen what senior maintainers say. :) -- Kirill A. Shutemov
Re: [Qemu-devel] [PATCH 08/17] mm: madvise MADV_USERFAULT
Hi, On Sat, Oct 04, 2014 at 08:13:36AM +0900, Mike Hommey wrote: On Fri, Oct 03, 2014 at 07:07:58PM +0200, Andrea Arcangeli wrote: MADV_USERFAULT is a new madvise flag that will set VM_USERFAULT in the vma flags. Whenever VM_USERFAULT is set in an anonymous vma, if userland touches a still unmapped virtual address, a sigbus signal is sent instead of allocating a new page. The sigbus signal handler will then resolve the page fault in userland by calling the remap_anon_pages syscall. What does unmapped virtual address mean in this context? To clarify this I added this in a second sentence in the commit header: still unmapped virtual address of the previous sentence in this context means that the pte/trans_huge_pmd is null. It means it's an hole inside the anonymous vma (the kind of hole that doesn't account for RSS but only virtual size of the process). It is the same state all anonymous virtual memory is, right after mmap. The same state that if you read from it, will map a zeropage into the faulting virtual address. If the page is swapped out, it will not trigger userfaults. If something isn't clear let me know. Thanks, Andrea
Re: [Qemu-devel] [PATCH 08/17] mm: madvise MADV_USERFAULT
On Fri, Oct 03, 2014 at 07:07:58PM +0200, Andrea Arcangeli wrote: MADV_USERFAULT is a new madvise flag that will set VM_USERFAULT in the vma flags. Whenever VM_USERFAULT is set in an anonymous vma, if userland touches a still unmapped virtual address, a sigbus signal is sent instead of allocating a new page. The sigbus signal handler will then resolve the page fault in userland by calling the remap_anon_pages syscall. What does unmapped virtual address mean in this context? Mike