Re: DANGER WILL ROBINSON, DANGER
On 04/10/19 11:41, Mircea CIRJALIU - MELIU wrote: > I get it so far. I have a patch that does mirroring in a separate VMA. > We create an extra VMA with VM_PFNMAP/VM_MIXEDMAP that mirrors the > source VMA in the other QEMU and is refreshed by the device MMU notifier. So for example on the host you'd have a new ioctl on the kvm file descriptor. You pass a size and you get back a file descriptor for that guest's physical memory, which is mmap-able up to the size you specified in the ioctl. In turn, the file descriptor would have ioctls to map/unmap ranges of the guest memory into its mmap-able range. Accessing an unmapped range produces a SIGSEGV. When asked via the QEMU monitor, QEMU will create the file descriptor and pass it back via SCM_RIGHTS. The management application can then use it to hotplug memory into the destination... > Create a new memslot based on the mirror VMA, hotplug it into the guest as > new memory device (is this possible?) and have a guest-side driver allocate > pages from that area. ... using the existing ivshmem device, whose BAR can be accessed and mmap-ed from the guest via sysfs. In other words, the hotplugging will use the file descriptor returned by QEMU when creating the ivshmem device. We then need an additional mechanism to invoke the map/unmap ioctls from the guest. Without writing a guest-side driver it is possible to: - pass a socket into the "create guest physical memory view" ioctl above. KVM will then associate that KVMI socket with the newly created file descriptor. - use KVMI messages to that socket to map/unmap sections of memory > Redirect (some) GFN->HVA translations into the new VMA based on a table > of addresses required by the introspector process. That would be tricky because there are multiple paths (gfn_to_page, gfn_to_pfn, etc.). There is some complication in this because the new device has to be plumbed at multiple levels (KVM, QEMU, libvirt). But it seems like a very easily separated piece of code (except for the KVMI socket part, which can be added later), so I suggest that you contribute the KVM parts first. Paolo ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: DANGER WILL ROBINSON, DANGER
On 03/10/19 20:31, Jerome Glisse wrote: > So in summary, the source qemu process has anonymous vma (regular > libc malloc for instance). The introspector qemu process which > mirror the the source qemu use mmap on /dev/kvm (assuming you can > reuse the kvm device file for this otherwise you can introduce a > new kvm device file). It should be a new device, something like /dev/kvmmem. BitDefender's RFC patches already have the right userspace API, that was not an issue. Paolo ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: DANGER WILL ROBINSON, DANGER
On Thu, Oct 03, 2019 at 04:42:20PM +, Mircea CIRJALIU - MELIU wrote: > > On 03/10/19 17:42, Jerome Glisse wrote: > > > All that is needed is to make sure that vm_normal_page() will see > > > those pte (inside the process that is mirroring the other process) as > > > special which is the case either because insert_pfn() mark the pte as > > > special or the kvm device driver which control the vm_operation struct > > > set a > > > find_special_page() callback that always return NULL, or the vma has > > > either VM_PFNMAP or VM_MIXEDMAP set (which is the case with > > insert_pfn). > > > > > > So you can keep the existing kvm code unmodified. > > > > Great, thanks. And KVM is already able to handle > > VM_PFNMAP/VM_MIXEDMAP, so that should work. > > This means setting VM_PFNMAP/VM_MIXEDMAP on the anon VMA that acts as the > VM's system RAM. > Will it have any side effects? You do not set it up on the anonymous vma but on the mmap of the kvm device file, the resulting vma is under the control of the kvm device file and is not an anonymous vma but a "device" special vma. So in summary, the source qemu process has anonymous vma (regular libc malloc for instance). The introspector qemu process which mirror the the source qemu use mmap on /dev/kvm (assuming you can reuse the kvm device file for this otherwise you can introduce a new kvm device file). The resulting mmap inside the introspector qemu process is a vma which has vma->vm_file pointing to the kvm device file and has VM_PFNMAP or VM_MIXEDMAP (i think you want the former). On architecture with ARCH_SPECIAL_PTE the pte will be mark as special when using insert_pfn() on other architecture you can either rely on VM_PFNMAP/VM_MIXEDMAP flag or set a specific find_special_page() callbacks in vm_ops. I am at a conference right now but i will put an example of what i mean next week. Cheers, Jérôme ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: DANGER WILL ROBINSON, DANGER
On 03/10/19 17:42, Jerome Glisse wrote: > All that is needed is to make sure that vm_normal_page() will see those > pte (inside the process that is mirroring the other process) as special > which is the case either because insert_pfn() mark the pte as special or > the kvm device driver which control the vm_operation struct set a > find_special_page() callback that always return NULL, or the vma has > either VM_PFNMAP or VM_MIXEDMAP set (which is the case with insert_pfn). > > So you can keep the existing kvm code unmodified. Great, thanks. And KVM is already able to handle VM_PFNMAP/VM_MIXEDMAP, so that should work. Paolo ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: DANGER WILL ROBINSON, DANGER
On Wed, Oct 02, 2019 at 10:10:18PM +0200, Paolo Bonzini wrote: > On 02/10/19 19:04, Jerome Glisse wrote: > > On Wed, Oct 02, 2019 at 06:18:06PM +0200, Paolo Bonzini wrote: > If the mapping of the source VMA changes, mirroring can update the > target VMA via insert_pfn. But what ensures that KVM's MMU notifier > dismantles its own existing page tables (so that they can be recreated > with the new mapping from the source VMA)? > >> > >> The KVM inspector process is also (or can be) a QEMU that will have to > >> create its own KVM guest page table. So if a page in the source VMA is > >> unmapped we want: > >> > >> - the source KVM to invalidate its guest page table (done by the KVM MMU > >> notifier) > >> > >> - the target VMA to be invalidated (easy using mirroring) > >> > >> - the target KVM to invalidate its guest page table, as a result of > >> invalidation of the target VMA > > > > You can do the target KVM invalidation inside the mirroring invalidation > > code. > > Why should the source and target KVMs behave differently? If the source > invalidates its guest page table via MMU notifiers, so should the target. > > The KVM MMU notifier exists so that nothing (including mirroring) needs > to know that there is KVM on the other side. Any interaction between > KVM page tables and VMAs must be mediated by MMU notifiers, anything > else is unacceptable. > > If it is possible to invoke the MMU notifiers around the calls to > insert_pfn, that of course would be perfect. Ok and yes you can do that exactly ie inside the mmu notifier callback from the target. For instance it is as easy as: target_mirror_notifier_start_callback(start, end) { struct kvm_mirror_struct *kvmms = from_mmun(...); unsigned long target_foff, size; size = end - start; target_foff = kvmms_convert_mirror_address(start); take_lock(kvmms->mirror_fault_exclusion_lock); unmap_mapping_range(kvmms->address_space, target_foff, size, 1); drop_lock(kvmms->mirror_fault_exclusion_lock); } All that is needed is to make sure that vm_normal_page() will see those pte (inside the process that is mirroring the other process) as special which is the case either because insert_pfn() mark the pte as special or the kvm device driver which control the vm_operation struct set a find_special_page() callback that always return NULL, or the vma has either VM_PFNMAP or VM_MIXEDMAP set (which is the case with insert_pfn). So you can keep the existing kvm code unmodified. Cheers, Jérôme ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: DANGER WILL ROBINSON, DANGER
On 02/10/19 19:04, Jerome Glisse wrote: > On Wed, Oct 02, 2019 at 06:18:06PM +0200, Paolo Bonzini wrote: If the mapping of the source VMA changes, mirroring can update the target VMA via insert_pfn. But what ensures that KVM's MMU notifier dismantles its own existing page tables (so that they can be recreated with the new mapping from the source VMA)? >> >> The KVM inspector process is also (or can be) a QEMU that will have to >> create its own KVM guest page table. So if a page in the source VMA is >> unmapped we want: >> >> - the source KVM to invalidate its guest page table (done by the KVM MMU >> notifier) >> >> - the target VMA to be invalidated (easy using mirroring) >> >> - the target KVM to invalidate its guest page table, as a result of >> invalidation of the target VMA > > You can do the target KVM invalidation inside the mirroring invalidation > code. Why should the source and target KVMs behave differently? If the source invalidates its guest page table via MMU notifiers, so should the target. The KVM MMU notifier exists so that nothing (including mirroring) needs to know that there is KVM on the other side. Any interaction between KVM page tables and VMAs must be mediated by MMU notifiers, anything else is unacceptable. If it is possible to invoke the MMU notifiers around the calls to insert_pfn, that of course would be perfect. Thanks, Paolo ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: DANGER WILL ROBINSON, DANGER
On Wed, Oct 02, 2019 at 06:18:06PM +0200, Paolo Bonzini wrote: > On 02/10/19 16:15, Jerome Glisse wrote: > >>> Why would you need to target mmu notifier on target vma ? > >> If the mapping of the source VMA changes, mirroring can update the > >> target VMA via insert_pfn. But what ensures that KVM's MMU notifier > >> dismantles its own existing page tables (so that they can be recreated > >> with the new mapping from the source VMA)? > >> > > So just to make sure i follow we have: > > - qemu process on host with anonymous vma > > -> host cpu page table > > - kvm which maps host anonymous vma to guest > > -> kvm guest page table > > - kvm inspector process which mirror vma from qemu process > > -> inspector process page table > > > > AFAIK the KVM notifier's will clear the kvm guest page table whenever > > necessary (through kvm_mmu_notifier_invalidate_range_start). This is > > what ensure that KVM's dismatles its own mapping, it abides to mmu- > > notifier callbacks. If you did not you would have bugs (at least i > > expect so). Am i wrong here ? > > The KVM inspector process is also (or can be) a QEMU that will have to > create its own KVM guest page table. Ok missed that part, thank you for explaining > > So if a page in the source VMA is unmapped we want: > > - the source KVM to invalidate its guest page table (done by the KVM MMU > notifier) > > - the target VMA to be invalidated (easy using mirroring) > > - the target KVM to invalidate its guest page table, as a result of > invalidation of the target VMA You can do the target KVM invalidation inside the mirroring invalidation code. Cheers, Jérôme ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: DANGER WILL ROBINSON, DANGER
On 02/10/19 16:15, Jerome Glisse wrote: >>> Why would you need to target mmu notifier on target vma ? >> If the mapping of the source VMA changes, mirroring can update the >> target VMA via insert_pfn. But what ensures that KVM's MMU notifier >> dismantles its own existing page tables (so that they can be recreated >> with the new mapping from the source VMA)? >> > So just to make sure i follow we have: > - qemu process on host with anonymous vma > -> host cpu page table > - kvm which maps host anonymous vma to guest > -> kvm guest page table > - kvm inspector process which mirror vma from qemu process > -> inspector process page table > > AFAIK the KVM notifier's will clear the kvm guest page table whenever > necessary (through kvm_mmu_notifier_invalidate_range_start). This is > what ensure that KVM's dismatles its own mapping, it abides to mmu- > notifier callbacks. If you did not you would have bugs (at least i > expect so). Am i wrong here ? The KVM inspector process is also (or can be) a QEMU that will have to create its own KVM guest page table. So if a page in the source VMA is unmapped we want: - the source KVM to invalidate its guest page table (done by the KVM MMU notifier) - the target VMA to be invalidated (easy using mirroring) - the target KVM to invalidate its guest page table, as a result of invalidation of the target VMA Paolo ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: DANGER WILL ROBINSON, DANGER
On Wed, Oct 02, 2019 at 03:46:30PM +0200, Paolo Bonzini wrote: > On 02/10/19 21:27, Jerome Glisse wrote: > > On Tue, Sep 10, 2019 at 07:49:51AM +, Mircea CIRJALIU - MELIU wrote: > >>> On 05/09/19 20:09, Jerome Glisse wrote: > Not sure i understand, you are saying that the solution i outline > above does not work ? If so then i think you are wrong, in the above > solution the importing process mmap a device file and the resulting > vma is then populated using insert_pfn() and constantly keep > synchronize with the target process through mirroring which means that > you never have to look at the struct page ... you can mirror any kind > of memory from the remote process. > >>> > >>> If insert_pfn in turn calls MMU notifiers for the target VMA (which would > >>> be > >>> the KVM MMU notifier), then that would work. Though I guess it would be > >>> possible to call MMU notifier update callbacks around the call to > >>> insert_pfn. > >> > >> Can't do that. > >> First, insert_pfn() uses set_pte_at() which won't trigger the MMU notifier > >> on > >> the target VMA. It's also static, so I'll have to access it thru > >> vmf_insert_pfn() > >> or vmf_insert_mixed(). > > > > Why would you need to target mmu notifier on target vma ? > > If the mapping of the source VMA changes, mirroring can update the > target VMA via insert_pfn. But what ensures that KVM's MMU notifier > dismantles its own existing page tables (so that they can be recreated > with the new mapping from the source VMA)? > So just to make sure i follow we have: - qemu process on host with anonymous vma -> host cpu page table - kvm which maps host anonymous vma to guest -> kvm guest page table - kvm inspector process which mirror vma from qemu process -> inspector process page table AFAIK the KVM notifier's will clear the kvm guest page table whenever necessary (through kvm_mmu_notifier_invalidate_range_start). This is what ensure that KVM's dismatles its own mapping, it abides to mmu- notifier callbacks. If you did not you would have bugs (at least i expect so). Am i wrong here ? The mirroring kernel driver would also register the notifier against the quemu process and would also abide to notifier callbacks. What you want to maintain at all times is that none of the actors above ever look at different page for the same virtual address (ie one looking at older page while another look at new page). This is where you have helper like HMM that make sure that you can not populate the mirroring vma while a notifier is on going. Which means that everything is serialize on the notifier. Cheers, Jérôme ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: DANGER WILL ROBINSON, DANGER
On 02/10/19 21:27, Jerome Glisse wrote: > On Tue, Sep 10, 2019 at 07:49:51AM +, Mircea CIRJALIU - MELIU wrote: >>> On 05/09/19 20:09, Jerome Glisse wrote: Not sure i understand, you are saying that the solution i outline above does not work ? If so then i think you are wrong, in the above solution the importing process mmap a device file and the resulting vma is then populated using insert_pfn() and constantly keep synchronize with the target process through mirroring which means that you never have to look at the struct page ... you can mirror any kind of memory from the remote process. >>> >>> If insert_pfn in turn calls MMU notifiers for the target VMA (which would be >>> the KVM MMU notifier), then that would work. Though I guess it would be >>> possible to call MMU notifier update callbacks around the call to >>> insert_pfn. >> >> Can't do that. >> First, insert_pfn() uses set_pte_at() which won't trigger the MMU notifier on >> the target VMA. It's also static, so I'll have to access it thru >> vmf_insert_pfn() >> or vmf_insert_mixed(). > > Why would you need to target mmu notifier on target vma ? If the mapping of the source VMA changes, mirroring can update the target VMA via insert_pfn. But what ensures that KVM's MMU notifier dismantles its own existing page tables (so that they can be recreated with the new mapping from the source VMA)? Thanks, Paolo > You do not need > that. The workflow is: > > userspace: > ptr = mmap(/dev/kvm-mirroring-device, virtual_addresse_of_target) > > Then when the mirroring process access ptr it triggers page fault that > endup in the vm_operation_struct->fault() which is just doing: > > kernel-kvm-mirroring-function: > kvm_mirror_page_fault(struct vm_fault *vmf) { > struct kvm_mirror_struct *kvmms; > > kvmms = kvm_mirror_struct_from_file(vmf->vma->vm_file); > ... > again: > hmm_range_register(); > hmm_range_snapshot(); > take_lock(kvmms->update); > if (!hmm_range_valid()) { > vm_insert_pfn(); > drop_lock(kvmms->update); > hmm_range_unregister(); > return VM_FAULT_NOPAGE; > } > drop_lock(kvmms->update); > goto again; > } > > The notifier callback: > kvmms_notifier_start() { > take_lock(kvmms->update); > clear_pte(start, end); > drop_lock(kvmms->update); > } > >> >> Our model (the importing process is encapsulated in another VM) forces us >> to mirror certain pages from the anon VMA backing one VM's system RAM to >> the other VM's anon VMA. > > The mirror does not have to be an anon vma it can very well be a > device vma ie mmap of a device file. I do not see any reasons why > the mirror need to be an anon vma. Please explain why. > >> >> Using the functions above means setting VM_PFNMAP|VM_MIXEDMAP on >> the target anon VMA, but I guess this breaks the VMA. Is this recommended? > > The mirror vma should not be an anon vma. > >> >> Then, mapping anon pages from one VMA to another without fixing the >> refcount and the mapcount breaks the daemons that think they're working >> on a pure anon VMA (kcompactd, khugepaged). > > Note here the target vma ie the mirroring one is a mmap of device file > and thus is skip by all of the above (kcompactd, khugepaged, ...) it is > fully ignore by core mm. > > Thus you do not need to fix the refcount in any way. If any of the core > mm try to reclaim memory from the original vma then you will get mmu > notifier callbacks and all you have to do is clear the page table of your > device vma. > > I did exactly that as a tools in the past and it works just fine with > no change to core mm whatsoever. > > Cheers, > Jérôme > ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: DANGER WILL ROBINSON, DANGER
On Tue, Sep 10, 2019 at 07:49:51AM +, Mircea CIRJALIU - MELIU wrote: > > On 05/09/19 20:09, Jerome Glisse wrote: > > > Not sure i understand, you are saying that the solution i outline > > > above does not work ? If so then i think you are wrong, in the above > > > solution the importing process mmap a device file and the resulting > > > vma is then populated using insert_pfn() and constantly keep > > > synchronize with the target process through mirroring which means that > > > you never have to look at the struct page ... you can mirror any kind > > > of memory from the remote process. > > > > If insert_pfn in turn calls MMU notifiers for the target VMA (which would be > > the KVM MMU notifier), then that would work. Though I guess it would be > > possible to call MMU notifier update callbacks around the call to > > insert_pfn. > > Can't do that. > First, insert_pfn() uses set_pte_at() which won't trigger the MMU notifier on > the target VMA. It's also static, so I'll have to access it thru > vmf_insert_pfn() > or vmf_insert_mixed(). Why would you need to target mmu notifier on target vma ? You do not need that. The workflow is: userspace: ptr = mmap(/dev/kvm-mirroring-device, virtual_addresse_of_target) Then when the mirroring process access ptr it triggers page fault that endup in the vm_operation_struct->fault() which is just doing: kernel-kvm-mirroring-function: kvm_mirror_page_fault(struct vm_fault *vmf) { struct kvm_mirror_struct *kvmms; kvmms = kvm_mirror_struct_from_file(vmf->vma->vm_file); ... again: hmm_range_register(); hmm_range_snapshot(); take_lock(kvmms->update); if (!hmm_range_valid()) { vm_insert_pfn(); drop_lock(kvmms->update); hmm_range_unregister(); return VM_FAULT_NOPAGE; } drop_lock(kvmms->update); goto again; } The notifier callback: kvmms_notifier_start() { take_lock(kvmms->update); clear_pte(start, end); drop_lock(kvmms->update); } > > Our model (the importing process is encapsulated in another VM) forces us > to mirror certain pages from the anon VMA backing one VM's system RAM to > the other VM's anon VMA. The mirror does not have to be an anon vma it can very well be a device vma ie mmap of a device file. I do not see any reasons why the mirror need to be an anon vma. Please explain why. > > Using the functions above means setting VM_PFNMAP|VM_MIXEDMAP on > the target anon VMA, but I guess this breaks the VMA. Is this recommended? The mirror vma should not be an anon vma. > > Then, mapping anon pages from one VMA to another without fixing the > refcount and the mapcount breaks the daemons that think they're working > on a pure anon VMA (kcompactd, khugepaged). Note here the target vma ie the mirroring one is a mmap of device file and thus is skip by all of the above (kcompactd, khugepaged, ...) it is fully ignore by core mm. Thus you do not need to fix the refcount in any way. If any of the core mm try to reclaim memory from the original vma then you will get mmu notifier callbacks and all you have to do is clear the page table of your device vma. I did exactly that as a tools in the past and it works just fine with no change to core mm whatsoever. Cheers, Jérôme ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: DANGER WILL ROBINSON, DANGER
On 05/09/19 20:09, Jerome Glisse wrote: > Not sure i understand, you are saying that the solution i outline above > does not work ? If so then i think you are wrong, in the above solution > the importing process mmap a device file and the resulting vma is then > populated using insert_pfn() and constantly keep synchronize with the > target process through mirroring which means that you never have to look > at the struct page ... you can mirror any kind of memory from the remote > process. If insert_pfn in turn calls MMU notifiers for the target VMA (which would be the KVM MMU notifier), then that would work. Though I guess it would be possible to call MMU notifier update callbacks around the call to insert_pfn. Paolo ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: DANGER WILL ROBINSON, DANGER
On Fri, Aug 23, 2019 at 12:39:21PM +, Mircea CIRJALIU - MELIU wrote: > > On Thu, Aug 15, 2019 at 03:19:29PM -0400, Jerome Glisse wrote: > > > On Tue, Aug 13, 2019 at 02:01:35PM +0300, Adalbert Lazăr wrote: > > > > On Fri, 9 Aug 2019 09:24:44 -0700, Matthew Wilcox > > wrote: > > > > > On Fri, Aug 09, 2019 at 07:00:26PM +0300, Adalbert Lazăr wrote: > > > > > > +++ b/include/linux/page-flags.h > > > > > > @@ -417,8 +417,10 @@ PAGEFLAG(Idle, idle, PF_ANY) > > > > > > */ > > > > > > #define PAGE_MAPPING_ANON 0x1 > > > > > > #define PAGE_MAPPING_MOVABLE 0x2 > > > > > > +#define PAGE_MAPPING_REMOTE0x4 > > > > > > > > > > Uh. How do you know page->mapping would otherwise have bit 2 > > clear? > > > > > Who's guaranteeing that? > > > > > > > > > > This is an awfully big patch to the memory management code, buried > > > > > in the middle of a gigantic series which almost guarantees nobody > > > > > would look at it. I call shenanigans. > > > > > > > > > > > @@ -1021,7 +1022,7 @@ void page_move_anon_rmap(struct page > > *page, struct vm_area_struct *vma) > > > > > > * __page_set_anon_rmap - set up new anonymous rmap > > > > > > * @page: Page or Hugepage to add to rmap > > > > > > * @vma: VM area to add page to. > > > > > > - * @address: User virtual address of the mapping > > > > > > + * @address: User virtual address of the mapping > > > > > > > > > > And mixing in fluff changes like this is a real no-no. Try again. > > > > > > > > > > > > > No bad intentions, just overzealous. > > > > I didn't want to hide anything from our patches. > > > > Once we advance with the introspection patches related to KVM we'll > > > > be back with the remote mapping patch, split and cleaned. > > > > > > They are not bit left in struct page ! Looking at the patch it seems > > > you want to have your own pin count just for KVM. This is bad, we are > > > already trying to solve the GUP thing (see all various patchset about > > > GUP posted recently). > > > > > > You need to rethink how you want to achieve this. Why not simply a > > > remote read()/write() into the process memory ie KVMI would call an > > > ioctl that allow to read or write into a remote process memory like > > > ptrace() but on steroid ... > > > > > > Adding this whole big complex infrastructure without justification of > > > why we need to avoid round trip is just too much really. > > > > Thinking a bit more about this, you can achieve the same thing without > > adding a single line to any mm code. Instead of having mmap with > > PROT_NONE | MAP_LOCKED you have userspace mmap some kvm device > > file (i am assuming this is something you already have and can control the > > mmap callback). > > > > So now kernel side you have a vma with a vm_operations_struct under your > > control this means that everything you want to block mm wise from within > > the inspector process can be block through those call- backs > > (find_special_page() specificaly for which you have to return NULL all the > > time). > > > > To mirror target process memory you can use hmm_mirror, when you > > populate the inspector process page table you use insert_pfn() (mmap of > > the kvm device file must mark this vma as PFNMAP). > > > > By following the hmm_mirror API, anytime the target process has a change in > > its page table (ie virtual address -> page) you will get a callback and all > > you > > have to do is clear the page table within the inspector process and flush > > tlb > > (use zap_page_range). > > > > On page fault within the inspector process the fault callback of vm_ops will > > get call and from there you call hmm_mirror following its API. > > > > Oh also mark the vma with VM_WIPEONFORK to avoid any issue if the > > inspector process use fork() (you could support fork but then you would > > need to mark the vma as SHARED and use unmap_mapping_pages instead of > > zap_page_range). > > > > > > There everything you want to do with already upstream mm code. > > I'm the author of remote mapping, so I owe everybody some explanations. > My requirement was to map pages from one QEMU process to another QEMU > process (our inspector process works in a virtual machine of its own). So I > had > to implement a KSM-like page sharing between processes, where an anon page > from the target QEMU's working memory is promoted to a remote page and > mapped in the inspector QEMU's working memory (both anon VMAs). > The extra page flag is for differentiating the page for rmap walking. > > The mapping requests come at PAGE_SIZE granularity for random addresses > within the target/inspector QEMUs, so I couldn't do any linear mapping that > would keep things simpler. > > I have an extra patch that does remote mapping by mirroring an entire VMA > from the target process by way of a device file. This thing creates a > separate > mirror VMA in my inspector process (at the moment a QEMU), but then I > bumped into the KVM hva->gpa mapping, which
Re: DANGER WILL ROBINSON, DANGER
On Thu, Aug 15, 2019 at 04:16:30PM -0400, Jerome Glisse wrote: > On Thu, Aug 15, 2019 at 03:19:29PM -0400, Jerome Glisse wrote: > > On Tue, Aug 13, 2019 at 02:01:35PM +0300, Adalbert Lazăr wrote: > > > On Fri, 9 Aug 2019 09:24:44 -0700, Matthew Wilcox > > > wrote: > > > > On Fri, Aug 09, 2019 at 07:00:26PM +0300, Adalbert Lazăr wrote: > > > > > +++ b/include/linux/page-flags.h > > > > > @@ -417,8 +417,10 @@ PAGEFLAG(Idle, idle, PF_ANY) > > > > > */ > > > > > #define PAGE_MAPPING_ANON0x1 > > > > > #define PAGE_MAPPING_MOVABLE 0x2 > > > > > +#define PAGE_MAPPING_REMOTE 0x4 > > > > > > > > Uh. How do you know page->mapping would otherwise have bit 2 clear? > > > > Who's guaranteeing that? > > > > > > > > This is an awfully big patch to the memory management code, buried in > > > > the middle of a gigantic series which almost guarantees nobody would > > > > look at it. I call shenanigans. > > > > > > > > > @@ -1021,7 +1022,7 @@ void page_move_anon_rmap(struct page *page, > > > > > struct vm_area_struct *vma) > > > > > * __page_set_anon_rmap - set up new anonymous rmap > > > > > * @page:Page or Hugepage to add to rmap > > > > > * @vma: VM area to add page to. > > > > > - * @address: User virtual address of the mapping > > > > > + * @address: User virtual address of the mapping > > > > > > > > And mixing in fluff changes like this is a real no-no. Try again. > > > > > > > > > > No bad intentions, just overzealous. > > > I didn't want to hide anything from our patches. > > > Once we advance with the introspection patches related to KVM we'll be > > > back with the remote mapping patch, split and cleaned. > > > > They are not bit left in struct page ! Looking at the patch it seems > > you want to have your own pin count just for KVM. This is bad, we are > > already trying to solve the GUP thing (see all various patchset about > > GUP posted recently). > > > > You need to rethink how you want to achieve this. Why not simply a > > remote read()/write() into the process memory ie KVMI would call > > an ioctl that allow to read or write into a remote process memory > > like ptrace() but on steroid ... > > > > Adding this whole big complex infrastructure without justification > > of why we need to avoid round trip is just too much really. > > Thinking a bit more about this, you can achieve the same thing without > adding a single line to any mm code. Instead of having mmap with > PROT_NONE | MAP_LOCKED you have userspace mmap some kvm device file > (i am assuming this is something you already have and can control > the mmap callback). > > So now kernel side you have a vma with a vm_operations_struct under > your control this means that everything you want to block mm wise > from within the inspector process can be block through those call- > backs (find_special_page() specificaly for which you have to return > NULL all the time). I'm actually aware of a couple of use cases that would like to mirror the VA of one process into another. One big one in the HPC world is the out of tree 'xpmem' still in wide use today. xpmem is basically what Jerome described above. If you do an approach like Jerome describes it would be nice if was a general facility and not buried in kvm. I know past xpmem adventures ran into trouble with locking/etc - ie getting the mm_struct of the victim seemed a bit hard for some reason, but maybe that could be done with a FD pass 'ioctl(I AM THE VICITM)' ? Jason ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: DANGER WILL ROBINSON, DANGER
On Thu, Aug 15, 2019 at 03:19:29PM -0400, Jerome Glisse wrote: > On Tue, Aug 13, 2019 at 02:01:35PM +0300, Adalbert Lazăr wrote: > > On Fri, 9 Aug 2019 09:24:44 -0700, Matthew Wilcox > > wrote: > > > On Fri, Aug 09, 2019 at 07:00:26PM +0300, Adalbert Lazăr wrote: > > > > +++ b/include/linux/page-flags.h > > > > @@ -417,8 +417,10 @@ PAGEFLAG(Idle, idle, PF_ANY) > > > > */ > > > > #define PAGE_MAPPING_ANON 0x1 > > > > #define PAGE_MAPPING_MOVABLE 0x2 > > > > +#define PAGE_MAPPING_REMOTE0x4 > > > > > > Uh. How do you know page->mapping would otherwise have bit 2 clear? > > > Who's guaranteeing that? > > > > > > This is an awfully big patch to the memory management code, buried in > > > the middle of a gigantic series which almost guarantees nobody would > > > look at it. I call shenanigans. > > > > > > > @@ -1021,7 +1022,7 @@ void page_move_anon_rmap(struct page *page, > > > > struct vm_area_struct *vma) > > > > * __page_set_anon_rmap - set up new anonymous rmap > > > > * @page: Page or Hugepage to add to rmap > > > > * @vma: VM area to add page to. > > > > - * @address: User virtual address of the mapping > > > > + * @address: User virtual address of the mapping > > > > > > And mixing in fluff changes like this is a real no-no. Try again. > > > > > > > No bad intentions, just overzealous. > > I didn't want to hide anything from our patches. > > Once we advance with the introspection patches related to KVM we'll be > > back with the remote mapping patch, split and cleaned. > > They are not bit left in struct page ! Looking at the patch it seems > you want to have your own pin count just for KVM. This is bad, we are > already trying to solve the GUP thing (see all various patchset about > GUP posted recently). > > You need to rethink how you want to achieve this. Why not simply a > remote read()/write() into the process memory ie KVMI would call > an ioctl that allow to read or write into a remote process memory > like ptrace() but on steroid ... > > Adding this whole big complex infrastructure without justification > of why we need to avoid round trip is just too much really. Thinking a bit more about this, you can achieve the same thing without adding a single line to any mm code. Instead of having mmap with PROT_NONE | MAP_LOCKED you have userspace mmap some kvm device file (i am assuming this is something you already have and can control the mmap callback). So now kernel side you have a vma with a vm_operations_struct under your control this means that everything you want to block mm wise from within the inspector process can be block through those call- backs (find_special_page() specificaly for which you have to return NULL all the time). To mirror target process memory you can use hmm_mirror, when you populate the inspector process page table you use insert_pfn() (mmap of the kvm device file must mark this vma as PFNMAP). By following the hmm_mirror API, anytime the target process has a change in its page table (ie virtual address -> page) you will get a callback and all you have to do is clear the page table within the inspector process and flush tlb (use zap_page_range). On page fault within the inspector process the fault callback of vm_ops will get call and from there you call hmm_mirror following its API. Oh also mark the vma with VM_WIPEONFORK to avoid any issue if the inspector process use fork() (you could support fork but then you would need to mark the vma as SHARED and use unmap_mapping_pages instead of zap_page_range). There everything you want to do with already upstream mm code. Cheers, Jérôme ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: DANGER WILL ROBINSON, DANGER
On Tue, Aug 13, 2019 at 02:01:35PM +0300, Adalbert Lazăr wrote: > On Fri, 9 Aug 2019 09:24:44 -0700, Matthew Wilcox wrote: > > On Fri, Aug 09, 2019 at 07:00:26PM +0300, Adalbert Lazăr wrote: > > > +++ b/include/linux/page-flags.h > > > @@ -417,8 +417,10 @@ PAGEFLAG(Idle, idle, PF_ANY) > > > */ > > > #define PAGE_MAPPING_ANON0x1 > > > #define PAGE_MAPPING_MOVABLE 0x2 > > > +#define PAGE_MAPPING_REMOTE 0x4 > > > > Uh. How do you know page->mapping would otherwise have bit 2 clear? > > Who's guaranteeing that? > > > > This is an awfully big patch to the memory management code, buried in > > the middle of a gigantic series which almost guarantees nobody would > > look at it. I call shenanigans. > > > > > @@ -1021,7 +1022,7 @@ void page_move_anon_rmap(struct page *page, struct > > > vm_area_struct *vma) > > > * __page_set_anon_rmap - set up new anonymous rmap > > > * @page:Page or Hugepage to add to rmap > > > * @vma: VM area to add page to. > > > - * @address: User virtual address of the mapping > > > + * @address: User virtual address of the mapping > > > > And mixing in fluff changes like this is a real no-no. Try again. > > > > No bad intentions, just overzealous. > I didn't want to hide anything from our patches. > Once we advance with the introspection patches related to KVM we'll be > back with the remote mapping patch, split and cleaned. They are not bit left in struct page ! Looking at the patch it seems you want to have your own pin count just for KVM. This is bad, we are already trying to solve the GUP thing (see all various patchset about GUP posted recently). You need to rethink how you want to achieve this. Why not simply a remote read()/write() into the process memory ie KVMI would call an ioctl that allow to read or write into a remote process memory like ptrace() but on steroid ... Adding this whole big complex infrastructure without justification of why we need to avoid round trip is just too much really. Cheers, Jérôme ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: DANGER WILL ROBINSON, DANGER
On 13/08/19 13:24, Matthew Wilcox wrote: >>> >>> This is an awfully big patch to the memory management code, buried in >>> the middle of a gigantic series which almost guarantees nobody would >>> look at it. I call shenanigans. >> Are you calling shenanigans on the patch submitter (which is gratuitous) >> or on the KVM maintainers/reviewers? > > On the patch submitter, of course. How can I possibly be criticising you > for something you didn't do? No idea. "Nobody would look at it" definitely includes me though. In any case, water under the bridge. The submitter did duly mark the series as RFC, I don't see anything wrong in what he did apart from not having testcases. :) Paolo ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: DANGER WILL ROBINSON, DANGER
On Tue, Aug 13, 2019 at 11:29:07AM +0200, Paolo Bonzini wrote: > On 09/08/19 18:24, Matthew Wilcox wrote: > > On Fri, Aug 09, 2019 at 07:00:26PM +0300, Adalbert Lazăr wrote: > >> +++ b/include/linux/page-flags.h > >> @@ -417,8 +417,10 @@ PAGEFLAG(Idle, idle, PF_ANY) > >> */ > >> #define PAGE_MAPPING_ANON 0x1 > >> #define PAGE_MAPPING_MOVABLE 0x2 > >> +#define PAGE_MAPPING_REMOTE 0x4 > > Uh. How do you know page->mapping would otherwise have bit 2 clear? > > Who's guaranteeing that? > > > > This is an awfully big patch to the memory management code, buried in > > the middle of a gigantic series which almost guarantees nobody would > > look at it. I call shenanigans. > > Are you calling shenanigans on the patch submitter (which is gratuitous) > or on the KVM maintainers/reviewers? On the patch submitter, of course. How can I possibly be criticising you for something you didn't do? ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: DANGER WILL ROBINSON, DANGER
On Fri, 9 Aug 2019 09:24:44 -0700, Matthew Wilcox wrote: > On Fri, Aug 09, 2019 at 07:00:26PM +0300, Adalbert Lazăr wrote: > > +++ b/include/linux/page-flags.h > > @@ -417,8 +417,10 @@ PAGEFLAG(Idle, idle, PF_ANY) > > */ > > #define PAGE_MAPPING_ANON 0x1 > > #define PAGE_MAPPING_MOVABLE 0x2 > > +#define PAGE_MAPPING_REMOTE0x4 > > Uh. How do you know page->mapping would otherwise have bit 2 clear? > Who's guaranteeing that? > > This is an awfully big patch to the memory management code, buried in > the middle of a gigantic series which almost guarantees nobody would > look at it. I call shenanigans. > > > @@ -1021,7 +1022,7 @@ void page_move_anon_rmap(struct page *page, struct > > vm_area_struct *vma) > > * __page_set_anon_rmap - set up new anonymous rmap > > * @page: Page or Hugepage to add to rmap > > * @vma: VM area to add page to. > > - * @address: User virtual address of the mapping > > + * @address: User virtual address of the mapping > > And mixing in fluff changes like this is a real no-no. Try again. > No bad intentions, just overzealous. I didn't want to hide anything from our patches. Once we advance with the introspection patches related to KVM we'll be back with the remote mapping patch, split and cleaned. Thanks ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: DANGER WILL ROBINSON, DANGER
On 09/08/19 18:24, Matthew Wilcox wrote: > On Fri, Aug 09, 2019 at 07:00:26PM +0300, Adalbert Lazăr wrote: >> +++ b/include/linux/page-flags.h >> @@ -417,8 +417,10 @@ PAGEFLAG(Idle, idle, PF_ANY) >> */ >> #define PAGE_MAPPING_ANON 0x1 >> #define PAGE_MAPPING_MOVABLE0x2 >> +#define PAGE_MAPPING_REMOTE 0x4 > Uh. How do you know page->mapping would otherwise have bit 2 clear? > Who's guaranteeing that? > > This is an awfully big patch to the memory management code, buried in > the middle of a gigantic series which almost guarantees nobody would > look at it. I call shenanigans. Are you calling shenanigans on the patch submitter (which is gratuitous) or on the KVM maintainers/reviewers? It's not true that nobody would look at it. Of course no one from linux-mm is going to look at it, but the maintainer that looks at the gigantic series is very much expected to look at it and explain to the submitter that this patch is unacceptable as is. In fact I shouldn't have to to explain this to you; you know better than believing that I would try to sneak it past the mm folks. I am puzzled. Paolo ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
DANGER WILL ROBINSON, DANGER
On Fri, Aug 09, 2019 at 07:00:26PM +0300, Adalbert Lazăr wrote: > +++ b/include/linux/page-flags.h > @@ -417,8 +417,10 @@ PAGEFLAG(Idle, idle, PF_ANY) > */ > #define PAGE_MAPPING_ANON0x1 > #define PAGE_MAPPING_MOVABLE 0x2 > +#define PAGE_MAPPING_REMOTE 0x4 Uh. How do you know page->mapping would otherwise have bit 2 clear? Who's guaranteeing that? This is an awfully big patch to the memory management code, buried in the middle of a gigantic series which almost guarantees nobody would look at it. I call shenanigans. > @@ -1021,7 +1022,7 @@ void page_move_anon_rmap(struct page *page, struct > vm_area_struct *vma) > * __page_set_anon_rmap - set up new anonymous rmap > * @page:Page or Hugepage to add to rmap > * @vma: VM area to add page to. > - * @address: User virtual address of the mapping > + * @address: User virtual address of the mapping And mixing in fluff changes like this is a real no-no. Try again. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization