Re: [Qemu-devel] [PATCH 00/23] userfaultfd v4
Hi Andrea, On Thu, May 21, 2015 at 05:52:51PM +0200, Andrea Arcangeli wrote: Hi Kirill, On Thu, May 21, 2015 at 04:11:11PM +0300, Kirill Smelkov wrote: Sorry for maybe speaking up too late, but here is additional real Not too late, in fact I don't think there's any change required for this at this stage, but it'd be great if you could help me to review. Thanks Since arrays can be large, it would be slow and thus not practical to [..] So I've implemented a scheme where array data is initially PROT_READ protected, then we catch SIGSEGV, if it is write and area belongs to array In the case of postcopy live migration (for qemu and/or containers) and postcopy live snapshotting, splitting the vmas is not an option because we may run out of them. If your PROT_READ areas are limited perhaps this isn't an issue but with hundreds GB guests (currently plenty in production) that needs to live migrate fully reliably and fast, the vmas could exceed the limit if we were to use mprotect. If your arrays are very large and the PROT_READ aren't limited, using userfaultfd this isn't only an optimization for you too, it's actually a must to avoid a potential -ENOMEM. I understand. To somehow mitigate this issue for every array/file I try to allocate ram pages from separate file on tmpfs with the same offset. This way if we allocate a lot of pages and mmap them in with PROT_READ, if they are adjacent to each other, the kernel will merge adjacent vmas into one vma: https://lab.nexedi.cn/kirr/wendelin.core/blob/ca064f75/include/wendelin/bigfile/ram.h#L100 https://lab.nexedi.cn/kirr/wendelin.core/blob/ca064f75/bigfile/ram_shmfs.c#L102 https://lab.nexedi.cn/kirr/wendelin.core/blob/ca064f75/bigfile/virtmem.c#L435 I agree this is only a half-measure - file parts accessed could be sparse, and also there is a in-shmfs overhead maintaining tables what real pages are allocated to which part of file on shmfs. So yes, if userfaultfd allows to overcomes vma layer and work directly on page tables, this helps. Also, since arrays could be large - bigger than RAM, and only sparse parts of it could be needed to get needed information, for reading it also makes sense to lazily load data in SIGSEGV handler with initial PROT_NONE protection. Similarly I heard somebody wrote a fastresume to load the suspended (on disk) guest ram using userfaultfd. That is a slightly less fundamental case than postcopy because you could do it also with MAP_SHARED, but it's still interesting in allowing to compress or decompress the suspended ram on the fly with lz4 for example, something MAP_PRIVATE/MAP_SHARED wouldn't do (plus there's the additional benefit of not having an orphaned inode left open even if the file is deleted, that prevents to unmount the filesystem for the whole lifetime of the guest). I see. Just a note - transparent compression/decompression could be achieved with MAP_SHARED if the compression is being performed by underlying filesystem - e.g. implemented with FUSE. ( I have not measured performance though ) This is very similar to how memory mapped files work, but adds transactionality which, as far as I know, is not provided by any currently in-kernel filesystem on Linux. That's another benefit yes. The gist of virtual memory-manager is this: https://lab.nexedi.cn/kirr/wendelin.core/blob/master/include/wendelin/bigfile/virtmem.h https://lab.nexedi.cn/kirr/wendelin.core/blob/master/bigfile/virtmem.c (vma_on_pagefault) I'll check it more in detail ASAP, thanks for the pointers! For operations it currently needs - establishing virtual memory areas and connecting to tracking it That's the UFFDIO_REGISTER/UNREGISTER. Yes - changing pages protection PROT_NONE or absent - initially absent is what works with -mm already. The lazy loading already works. Yes PROT_NONE - PROT_READ- after read Current UFFDIO_COPY will map it using vma-vm_page_prot. We'll need a new flag for UFFDIO_COPY to map it readonly. This is already contemplated: /* * There will be a wrprotection flag later that allows to map * pages wrprotected on the fly. And such a flag will be * available if the wrprotection ioctl are implemented for the * range according to the uffdio_register.ioctls. */ #define UFFDIO_COPY_MODE_DONTWAKE ((__u64)10) __u64 mode; If the memory protection framework exists (either through the uffdio_register.ioctl out value, or through uffdio_api.features out-only value) you can pass a new flag (MODE_WP) above to transition from absent to PROT_READ. Yes. The same probably applies to UFFDIO_ZEROPAGE (to mmap-in zeropage as RO on read, if that part of file is currently hole) So we settle on adding UFFDIO_COPY_MODE_WP and UFFDIO_ZEROPAGE_MODE_WP ? Or
Re: [Qemu-devel] [PATCH 00/23] userfaultfd v4
Hello up there, On Thu, May 14, 2015 at 07:30:57PM +0200, Andrea Arcangeli wrote: Hello everyone, This is the latest userfaultfd patchset against mm-v4.1-rc3 2015-05-14-10:04. The postcopy live migration feature on the qemu side is mostly ready to be merged and it entirely depends on the userfaultfd syscall to be merged as well. So it'd be great if this patchset could be reviewed for merging in -mm. Userfaults allow to implement on demand paging from userland and more generally they allow userland to more efficiently take control of the behavior of page faults than what was available before (PROT_NONE + SIGSEGV trap). The use cases are: [...] Even though there wasn't a real use case requesting it yet, it also allows to implement distributed shared memory in a way that readonly shared mappings can exist simultaneously in different hosts and they can be become exclusive at the first wrprotect fault. Sorry for maybe speaking up too late, but here is additional real potential use-case which in my view is overlapping with the above: Recently we needed to implement persistency for NumPy arrays - that is to track made changes to array memory and transactionally either abandon the changes on transaction abort, or store them back to storage on transaction commit. Since arrays can be large, it would be slow and thus not practical to have original data copy and compare memory to original to find what array parts have been changed. So I've implemented a scheme where array data is initially PROT_READ protected, then we catch SIGSEGV, if it is write and area belongs to array data - we mark that page as PROT_WRITE and continue. On commit time we know which parts were modified. Also, since arrays could be large - bigger than RAM, and only sparse parts of it could be needed to get needed information, for reading it also makes sense to lazily load data in SIGSEGV handler with initial PROT_NONE protection. This is very similar to how memory mapped files work, but adds transactionality which, as far as I know, is not provided by any currently in-kernel filesystem on Linux. The system is done as files, and arrays are then build on top of this-way memory-mapped files. So from now on we can forget about NumPy arrays and only talk about files, their mapping, lazy loading and transactionally storing in-memory changes back to file storage. To get this working, a custom user-space virtual memory manager is unrolled, which manages RAM memory pages, file mappings into virtual address-space, tracks pages protection and does SIGSEGV handling appropriately. The gist of virtual memory-manager is this: https://lab.nexedi.cn/kirr/wendelin.core/blob/master/include/wendelin/bigfile/virtmem.h https://lab.nexedi.cn/kirr/wendelin.core/blob/master/bigfile/virtmem.c (vma_on_pagefault) For operations it currently needs - establishing virtual memory areas and connecting to tracking it - changing pages protection PROT_NONE or absent - initially PROT_NONE - PROT_READ- after read PROT_READ - PROT_READWRITE - after write PROT_READWRITE - PROT_READ- after commit PROT_READWRITE - PROT_NONE or absent (again) - after abort PROT_READ - PROT_NONE or absent (again) - on reclaim - working with aliasable memory (thus taken from tmpfs) there could be two overlapping-in-file mapping for file (array) requested at different time, and changes from one mapping should propagate to another one - for common parts only 1 page should be memory-mapped into 2 places in address-space. so what is currently lacking on userfaultfd side is: - ability to remove / make PROT_NONE already mapped pages (UFFDIO_REMAP was recently dropped) - ability to arbitrarily change pages protection (e.g. RW - R) - inject aliasable memory from tmpfs (or better hugetlbfs) and into several places (UFFDIO_REMAP + some mapping copy semantic). The code is ugly because it is only a prototype. You can clone/read it all from here: https://lab.nexedi.cn/kirr/wendelin.core Virtual memory-manager even has tests, and from them it could be seen how the system is supposed to work (after each access - what pages and where are mapped and how): https://lab.nexedi.cn/kirr/wendelin.core/blob/master/bigfile/tests/test_virtmem.c The performance currently is not great, partly because of page clearing when getting ram from tmpfs, and partly because of mprotect/SIGSEGV/vmas overhead and other dumb things on my side. I still wanted to show the case, as userfaultd here has potential to remove overhead related to kernel. Thanks beforehand for feedback, Kirill P.S. some context http://www.wendelin.io/NXD-Wendelin.Core.Non.Secret/asEntireHTML
Re: [Qemu-devel] [PATCH 00/23] userfaultfd v4
Hi Kirill, On Thu, May 21, 2015 at 04:11:11PM +0300, Kirill Smelkov wrote: Sorry for maybe speaking up too late, but here is additional real Not too late, in fact I don't think there's any change required for this at this stage, but it'd be great if you could help me to review. Since arrays can be large, it would be slow and thus not practical to [..] So I've implemented a scheme where array data is initially PROT_READ protected, then we catch SIGSEGV, if it is write and area belongs to array In the case of postcopy live migration (for qemu and/or containers) and postcopy live snapshotting, splitting the vmas is not an option because we may run out of them. If your PROT_READ areas are limited perhaps this isn't an issue but with hundreds GB guests (currently plenty in production) that needs to live migrate fully reliably and fast, the vmas could exceed the limit if we were to use mprotect. If your arrays are very large and the PROT_READ aren't limited, using userfaultfd this isn't only an optimization for you too, it's actually a must to avoid a potential -ENOMEM. Also, since arrays could be large - bigger than RAM, and only sparse parts of it could be needed to get needed information, for reading it also makes sense to lazily load data in SIGSEGV handler with initial PROT_NONE protection. Similarly I heard somebody wrote a fastresume to load the suspended (on disk) guest ram using userfaultfd. That is a slightly less fundamental case than postcopy because you could do it also with MAP_SHARED, but it's still interesting in allowing to compress or decompress the suspended ram on the fly with lz4 for example, something MAP_PRIVATE/MAP_SHARED wouldn't do (plus there's the additional benefit of not having an orphaned inode left open even if the file is deleted, that prevents to unmount the filesystem for the whole lifetime of the guest). This is very similar to how memory mapped files work, but adds transactionality which, as far as I know, is not provided by any currently in-kernel filesystem on Linux. That's another benefit yes. The gist of virtual memory-manager is this: https://lab.nexedi.cn/kirr/wendelin.core/blob/master/include/wendelin/bigfile/virtmem.h https://lab.nexedi.cn/kirr/wendelin.core/blob/master/bigfile/virtmem.c (vma_on_pagefault) I'll check it more in detail ASAP, thanks for the pointers! For operations it currently needs - establishing virtual memory areas and connecting to tracking it That's the UFFDIO_REGISTER/UNREGISTER. - changing pages protection PROT_NONE or absent - initially absent is what works with -mm already. The lazy loading already works. PROT_NONE - PROT_READ- after read Current UFFDIO_COPY will map it using vma-vm_page_prot. We'll need a new flag for UFFDIO_COPY to map it readonly. This is already contemplated: /* * There will be a wrprotection flag later that allows to map * pages wrprotected on the fly. And such a flag will be * available if the wrprotection ioctl are implemented for the * range according to the uffdio_register.ioctls. */ #define UFFDIO_COPY_MODE_DONTWAKE ((__u64)10) __u64 mode; If the memory protection framework exists (either through the uffdio_register.ioctl out value, or through uffdio_api.features out-only value) you can pass a new flag (MODE_WP) above to transition from absent to PROT_READ. PROT_READ - PROT_READWRITE - after write This will need to add UFFDIO_MPROTECT. PROT_READWRITE - PROT_READ- after commit UFFDIO_MPROTECT again (but harder if going from rw to ro, because of a slight mess to solve with regard to FAULT_FLAG_TRIED, in case you want to run this UFFDIO_MPROTECT without stopping the threads that are accessing the memory concurrently). And this should only work if the uffdio_register.mode had MODE_WP set, so we don't run into the races created by COWs (gup vs fork race). PROT_READWRITE - PROT_NONE or absent (again) - after abort UFFDIO_MPROTECT again, but you won't be able to read the page contents inside the memory manager thread (the one working with userfaultfd). The manager at all times if forbidden to touch the memory it is tracking with userfaultfd (if it does it'll deadlock, but kill -9 will get rid of it). gdb ironically because it is using an underoptimized access_process_vm wouldn't hang, because FAULT_FLAG_RETRY won't be set in handle_userfault in the gdb context, and it'll just receive a sigbus if by mistake the user tries to touch the memory. Even if it will hung later as get_user_pages_locked|unlocked gets used there too, kill -9 would solve gdb too. Back to the problem of accessing the UFFDIO_MPROTECT(PROT_NONE) memory: to do that a new ioctl should be required. I'd rather not go back to the route of UFFDIO_REMAP, but it could copy the
Re: [Qemu-devel] [PATCH 00/23] userfaultfd v4
Hi Andrew, On Tue, May 19, 2015 at 02:38:01PM -0700, Andrew Morton wrote: On Thu, 14 May 2015 19:30:57 +0200 Andrea Arcangeli aarca...@redhat.com wrote: This is the latest userfaultfd patchset against mm-v4.1-rc3 2015-05-14-10:04. It would be useful to have some userfaultfd testcases in tools/testing/selftests/. Partly as an aid to arch maintainers when enabling this. And also as a standalone thing to give people a practical way of exercising this interface. Agreed. I was also thinking about writing a trinity module for it, I wrote it for an older version but it was much easier to do that back then before we had ioctls, now it's more tricky because the ioctls requires the fd open first etc... it's not enough to just call a syscall with a flood of supervised-random params anymore. What are your thoughts on enabling userfaultfd for other architectures, btw? Are there good use cases, are people working on it, etc? powerpc should be enabled and functional already. There's not much arch dependent code in it, so in theory if the postcopy live migration patchset is applied to qemu, it should work on powerpc out of the box. Nobody tested it yet but I don't expect trouble on the kernel side. Adding support for all other archs is just a few liner patch that defines the syscall number. I didn't do that out of tree because every time a new syscall materialized I would get more rejects during rebase. Also, I assume a manpage is in the works? Sooner rather than later would be good - Michael's review of proposed kernel interfaces has often been valuable. Yes, the manpage was certainly planned. It would require updates as we keep adding features (like the wrprotect tracking, the non-cooperative usage, and extending the availability of the ioctls to tmpfs). We can definitely write a manpage with the current features. Ok, so I'll continue working on the testcase and on the manpage. Thanks!! Andrea
Re: [Qemu-devel] [PATCH 00/23] userfaultfd v4
Hello Richard, On Tue, May 19, 2015 at 11:59:42PM +0200, Richard Weinberger wrote: On Tue, May 19, 2015 at 11:38 PM, Andrew Morton a...@linux-foundation.org wrote: On Thu, 14 May 2015 19:30:57 +0200 Andrea Arcangeli aarca...@redhat.com wrote: This is the latest userfaultfd patchset against mm-v4.1-rc3 2015-05-14-10:04. It would be useful to have some userfaultfd testcases in tools/testing/selftests/. Partly as an aid to arch maintainers when enabling this. And also as a standalone thing to give people a practical way of exercising this interface. What are your thoughts on enabling userfaultfd for other architectures, btw? Are there good use cases, are people working on it, etc? UML is using SIGSEGV for page faults. i.e. the UML processes receives a SIGSEGV, learns the faulting address from the mcontext and resolves the fault by installing a new mapping. If userfaultfd is faster that the SIGSEGV notification it could speed up UML a bit. For UML I'm only interested in the notification, not the resolving part. The missing data is present, only a new mapping is needed. No copy of data. Andrea, what do you think? I think you need some kind of UFFDIO_MPROTECT ioctl that is the same ioctl wrprotect tracking also needs. At the moment we focused the future plans mostly on wrprotection tracking but it could be extended to protnone tracking, either with the same feature flag as wrprotection (with a generic UFFDIO_MPROTECT) or with two separate feature flags and two separate ioctl. Your pages are not missing, like in the postcopy live snapshotting case the pages are not missing. The userfaultfd memory protection ioctl will not modify the VMA, but it'll just selectively mark pte/trans_huge_pmd wrprotected/protnone in order to get the faults. In the case of postcopy live snapshotting a single ioctl call will mark the entire guest address space readonly. For live snapshotting the fault resolution is a no brainer: when you get the fault the page is still readable and it just needs to be copied off by the live snapshotting thread to a different location, and then the UFFDIO_MPROTECT will be called again to make the page writable and wake the blocked fault. For the protnone, you need to modify the page before waking the blocked userfault, you can't just remove the protnone or other threads could modify it (if there are other threads). You'd need a further ioctl to copy the page off to a different place by using its kernel address (the userland address is not mapped) and copy it back to overwrite the original page. Alternatively once we extend the handle_userfault to tmpfs you could map the page in two virtual mappings and track the faults in one mapping (where the tracked app runs) and read/write the page contents in the other mapping that isn't tracked by the userfault. These are the first thoughts that comes to mind without knowing exactly what you need to do after you get the fault address, and without knowing exactly why you need to mark the region PROT_NONE. There will be some complications in adding the wrprotection/protnone feature: if faults could already happen when the wrprotect/protnone is armed, the handle_userfault() could be invoked in a retry-fault, that is not ok without allowing the userfault to return VM_FAULT_RETRY even during a refault (i.e. FAULT_FLAG_TRIED set but FAULT_FLAG_ALLOW_RETRY not set). The invariants of vma-vm_page_prot and pte/trans_huge_pmd permissions must also not break anywhere. These are the two main reasons why these features that requires to flip protection bits are left implemented later and made visible later with uffdio_api.feature flags and/or through uffdio_register.ioctl during UFFDIO_REGISTER.
Re: [Qemu-devel] [PATCH 00/23] userfaultfd v4
On Tue, May 19, 2015 at 11:38 PM, Andrew Morton a...@linux-foundation.org wrote: On Thu, 14 May 2015 19:30:57 +0200 Andrea Arcangeli aarca...@redhat.com wrote: This is the latest userfaultfd patchset against mm-v4.1-rc3 2015-05-14-10:04. It would be useful to have some userfaultfd testcases in tools/testing/selftests/. Partly as an aid to arch maintainers when enabling this. And also as a standalone thing to give people a practical way of exercising this interface. What are your thoughts on enabling userfaultfd for other architectures, btw? Are there good use cases, are people working on it, etc? UML is using SIGSEGV for page faults. i.e. the UML processes receives a SIGSEGV, learns the faulting address from the mcontext and resolves the fault by installing a new mapping. If userfaultfd is faster that the SIGSEGV notification it could speed up UML a bit. For UML I'm only interested in the notification, not the resolving part. The missing data is present, only a new mapping is needed. No copy of data. Andrea, what do you think? -- Thanks, //richard
Re: [Qemu-devel] [PATCH 00/23] userfaultfd v4
On Thu, 14 May 2015 19:30:57 +0200 Andrea Arcangeli aarca...@redhat.com wrote: This is the latest userfaultfd patchset against mm-v4.1-rc3 2015-05-14-10:04. It would be useful to have some userfaultfd testcases in tools/testing/selftests/. Partly as an aid to arch maintainers when enabling this. And also as a standalone thing to give people a practical way of exercising this interface. What are your thoughts on enabling userfaultfd for other architectures, btw? Are there good use cases, are people working on it, etc? Also, I assume a manpage is in the works? Sooner rather than later would be good - Michael's review of proposed kernel interfaces has often been valuable.
Re: [Qemu-devel] [PATCH 00/23] userfaultfd v4
On 05/14/2015 08:30 PM, Andrea Arcangeli wrote: Hello everyone, This is the latest userfaultfd patchset against mm-v4.1-rc3 2015-05-14-10:04. The postcopy live migration feature on the qemu side is mostly ready to be merged and it entirely depends on the userfaultfd syscall to be merged as well. So it'd be great if this patchset could be reviewed for merging in -mm. Userfaults allow to implement on demand paging from userland and more generally they allow userland to more efficiently take control of the behavior of page faults than what was available before (PROT_NONE + SIGSEGV trap). Not to spam with 23 e-mails, all patches are Acked-by: Pavel Emelyanov xe...@parallels.com Thanks! -- Pavel
[Qemu-devel] [PATCH 00/23] userfaultfd v4
Hello everyone, This is the latest userfaultfd patchset against mm-v4.1-rc3 2015-05-14-10:04. The postcopy live migration feature on the qemu side is mostly ready to be merged and it entirely depends on the userfaultfd syscall to be merged as well. So it'd be great if this patchset could be reviewed for merging in -mm. Userfaults allow to implement on demand paging from userland and more generally they allow userland to more efficiently take control of the behavior of page faults than what was available before (PROT_NONE + SIGSEGV trap). The use cases are: 1) KVM postcopy live migration (one form of cloud memory externalization). KVM postcopy live migration is the primary driver of this work: http://blog.zhaw.ch/icclab/setting-up-post-copy-live-migration-in-openstack/ http://lists.gnu.org/archive/html/qemu-devel/2015-02/msg04873.html 2) postcopy live migration of binaries inside linux containers: http://thread.gmane.org/gmane.linux.kernel.mm/132662 3) KVM postcopy live snapshotting (allowing to limit/throttle the memory usage, unlike fork would, plus the avoidance of fork overhead in the first place). While the wrprotect tracking is not implemented yet, the syscall API is already contemplating the wrprotect fault tracking and it's generic enough to allow its later implementation in a backwards compatible fashion. 4) KVM userfaults on shared memory. The UFFDIO_COPY lowlevel method should be extended to work also on tmpfs and then the uffdio_register.ioctls will notify userland that UFFDIO_COPY is available even when the registered virtual memory range is tmpfs backed. 5) alternate mechanism to notify web browsers or apps on embedded devices that volatile pages have been reclaimed. This basically avoids the need to run a syscall before the app can access with the CPU the virtual regions marked volatile. This depends on point 4) to be fulfilled first, as volatile pages happily apply to tmpfs. Even though there wasn't a real use case requesting it yet, it also allows to implement distributed shared memory in a way that readonly shared mappings can exist simultaneously in different hosts and they can be become exclusive at the first wrprotect fault. The development version can also be cloned here: git clone --reference linux -b userfault git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git Slides from LSF-MM summit (but beware that they're not uptodate): https://www.kernel.org/pub/linux/kernel/people/andrea/userfaultfd/userfaultfd-LSFMM-2015.pdf Comments welcome. Thanks, Andrea Changelog of the major changes since the last RFC v3: o The API has been slightly modified to avoid having to introduce a second revision of the API, in order to support the non cooperative usage. o Various mixed fixes thanks to the feedback from Dave Hansen and David Gilbert. The most notable one is the use of mm_users instead of mm_count to pin the mm to avoid crashes that assumed the vma still existed (in the userfaultfd_release method and in the various ioctl). exit_mmap doesn't even set mm-mmap to NULL, so unless I introduce a userfaultfd_exit to call in mmput, I have to pin the mm_users to be safe. This is a visible change mainly for the non-cooperative usage. o userfaults are waken immediately even if they're not been read yet, this can lead to POLLIN false positives (so I only allow poll if the fd is open in nonblocking mode to be sure it won't hang). http://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/commit/?h=userfaultid=f222d9de0a5302dc8ac62d6fab53a84251098751 o optimize read to return entries in O(1) and poll which was already O(1) becomes lockless. This required to split the waitqueue in two, one for pending faults and one for non pending faults, and the faults are refiled across the two waitqueues when they're read. Both waitqueues are protected by a single lock to be simpler and faster at runtime (the fault_pending_wqh one). http://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/commit/?h=userfaultid=9aa033ed43a1134c2223dac8c5d9e02e0100fca1 o Allocate the ctx with kmem_cache_alloc. http://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/commit/?h=userfaultid=f5a8db16d2876eed8906a4d36f1d0e06ca5490f6 o Originally qemu had two bitflags for each page and kept 3 states (of the 4 possible with two bits) for each page in order to deal with the races that can happen if one thread is reading the userfaults and another thread is calling the UFFDIO_COPY ioctl in the background. This patch solves all races in the kernel so the two bits per page can be dropped from qemu codebase. I started documenting the races that can materialize by using 2 threads (instead of running the workload single threaded with a single poll event loop) and how userland had to solve them until I decided it was simpler to fix the race in the kernel