Re: DANGER WILL ROBINSON, DANGER

2019-10-04 Thread Paolo Bonzini
On 04/10/19 11:41, Mircea CIRJALIU - MELIU wrote:
> I get it so far. I have a patch that does mirroring in a separate VMA.
> We create an extra VMA with VM_PFNMAP/VM_MIXEDMAP that mirrors the 
> source VMA in the other QEMU and is refreshed by the device MMU notifier.

So for example on the host you'd have a new ioctl on the kvm file
descriptor.  You pass a size and you get back a file descriptor for that
guest's physical memory, which is mmap-able up to the size you specified
in the ioctl.

In turn, the file descriptor would have ioctls to map/unmap ranges of
the guest memory into its mmap-able range.  Accessing an unmapped range
produces a SIGSEGV.

When asked via the QEMU monitor, QEMU will create the file descriptor
and pass it back via SCM_RIGHTS.  The management application can then
use it to hotplug memory into the destination...

> Create a new memslot based on the mirror VMA, hotplug it into the guest as
> new memory device (is this possible?) and have a guest-side driver allocate 
> pages from that area.

... using the existing ivshmem device, whose BAR can be accessed and
mmap-ed from the guest via sysfs.  In other words, the hotplugging will
use the file descriptor returned by QEMU when creating the ivshmem device.

We then need an additional mechanism to invoke the map/unmap ioctls from
the guest.  Without writing a guest-side driver it is possible to:

- pass a socket into the "create guest physical memory view" ioctl
above.  KVM will then associate that KVMI socket with the newly created
file descriptor.

- use KVMI messages to that socket to map/unmap sections of memory

> Redirect (some) GFN->HVA translations into the new VMA based on a table 
> of addresses required by the introspector process.

That would be tricky because there are multiple paths (gfn_to_page,
gfn_to_pfn, etc.).

There is some complication in this because the new device has to be
plumbed at multiple levels (KVM, QEMU, libvirt).  But it seems like a
very easily separated piece of code (except for the KVMI socket part,
which can be added later), so I suggest that you contribute the KVM
parts first.

Paolo
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: DANGER WILL ROBINSON, DANGER

2019-10-03 Thread Paolo Bonzini
On 03/10/19 20:31, Jerome Glisse wrote:
> So in summary, the source qemu process has anonymous vma (regular
> libc malloc for instance). The introspector qemu process which
> mirror the the source qemu use mmap on /dev/kvm (assuming you can
> reuse the kvm device file for this otherwise you can introduce a
> new kvm device file). 

It should be a new device, something like /dev/kvmmem.  BitDefender's
RFC patches already have the right userspace API, that was not an issue.

Paolo
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: DANGER WILL ROBINSON, DANGER

2019-10-03 Thread Jerome Glisse
On Thu, Oct 03, 2019 at 04:42:20PM +, Mircea CIRJALIU - MELIU wrote:
> > On 03/10/19 17:42, Jerome Glisse wrote:
> > > All that is needed is to make sure that vm_normal_page() will see
> > > those pte (inside the process that is mirroring the other process) as
> > > special which is the case either because insert_pfn() mark the pte as
> > > special or the kvm device driver which control the vm_operation struct
> > > set a
> > > find_special_page() callback that always return NULL, or the vma has
> > > either VM_PFNMAP or VM_MIXEDMAP set (which is the case with
> > insert_pfn).
> > >
> > > So you can keep the existing kvm code unmodified.
> > 
> > Great, thanks.  And KVM is already able to handle
> > VM_PFNMAP/VM_MIXEDMAP, so that should work.
> 
> This means setting VM_PFNMAP/VM_MIXEDMAP on the anon VMA that acts as the 
> VM's system RAM.
> Will it have any side effects?

You do not set it up on the anonymous vma but on the mmap of the
kvm device file, the resulting vma is under the control of the
kvm device file and is not an anonymous vma but a "device" special
vma.

So in summary, the source qemu process has anonymous vma (regular
libc malloc for instance). The introspector qemu process which
mirror the the source qemu use mmap on /dev/kvm (assuming you can
reuse the kvm device file for this otherwise you can introduce a
new kvm device file). The resulting mmap inside the introspector
qemu process is a vma which has vma->vm_file pointing to the kvm
device file and has VM_PFNMAP or VM_MIXEDMAP (i think you want the
former). On architecture with ARCH_SPECIAL_PTE the pte will be
mark as special when using insert_pfn() on other architecture you
can either rely on VM_PFNMAP/VM_MIXEDMAP flag or set a specific
find_special_page() callbacks in vm_ops.


I am at a conference right now but i will put an example of what
i mean next week.

Cheers,
Jérôme
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: DANGER WILL ROBINSON, DANGER

2019-10-03 Thread Paolo Bonzini
On 03/10/19 17:42, Jerome Glisse wrote:
> All that is needed is to make sure that vm_normal_page() will see those
> pte (inside the process that is mirroring the other process) as special
> which is the case either because insert_pfn() mark the pte as special or
> the kvm device driver which control the vm_operation struct set a
> find_special_page() callback that always return NULL, or the vma has
> either VM_PFNMAP or VM_MIXEDMAP set (which is the case with insert_pfn).
> 
> So you can keep the existing kvm code unmodified.

Great, thanks.  And KVM is already able to handle VM_PFNMAP/VM_MIXEDMAP,
so that should work.

Paolo
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: DANGER WILL ROBINSON, DANGER

2019-10-03 Thread Jerome Glisse
On Wed, Oct 02, 2019 at 10:10:18PM +0200, Paolo Bonzini wrote:
> On 02/10/19 19:04, Jerome Glisse wrote:
> > On Wed, Oct 02, 2019 at 06:18:06PM +0200, Paolo Bonzini wrote:
>  If the mapping of the source VMA changes, mirroring can update the
>  target VMA via insert_pfn.  But what ensures that KVM's MMU notifier
>  dismantles its own existing page tables (so that they can be recreated
>  with the new mapping from the source VMA)?
> >>
> >> The KVM inspector process is also (or can be) a QEMU that will have to
> >> create its own KVM guest page table.  So if a page in the source VMA is
> >> unmapped we want:
> >>
> >> - the source KVM to invalidate its guest page table (done by the KVM MMU
> >> notifier)
> >>
> >> - the target VMA to be invalidated (easy using mirroring)
> >>
> >> - the target KVM to invalidate its guest page table, as a result of
> >> invalidation of the target VMA
> > 
> > You can do the target KVM invalidation inside the mirroring invalidation
> > code.
> 
> Why should the source and target KVMs behave differently?  If the source
> invalidates its guest page table via MMU notifiers, so should the target.
> 
> The KVM MMU notifier exists so that nothing (including mirroring) needs
> to know that there is KVM on the other side.  Any interaction between
> KVM page tables and VMAs must be mediated by MMU notifiers, anything
> else is unacceptable.
> 
> If it is possible to invoke the MMU notifiers around the calls to
> insert_pfn, that of course would be perfect.

Ok and yes you can do that exactly ie inside the mmu notifier callback
from the target. For instance it is as easy as:
target_mirror_notifier_start_callback(start, end) {
struct kvm_mirror_struct *kvmms = from_mmun(...);
unsigned long target_foff, size;

size = end - start;
target_foff = kvmms_convert_mirror_address(start);
take_lock(kvmms->mirror_fault_exclusion_lock);
unmap_mapping_range(kvmms->address_space, target_foff, size, 1);
drop_lock(kvmms->mirror_fault_exclusion_lock);
}

All that is needed is to make sure that vm_normal_page() will see those
pte (inside the process that is mirroring the other process) as special
which is the case either because insert_pfn() mark the pte as special or
the kvm device driver which control the vm_operation struct set a
find_special_page() callback that always return NULL, or the vma has
either VM_PFNMAP or VM_MIXEDMAP set (which is the case with insert_pfn).

So you can keep the existing kvm code unmodified.

Cheers,
Jérôme
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: DANGER WILL ROBINSON, DANGER

2019-10-02 Thread Paolo Bonzini
On 02/10/19 19:04, Jerome Glisse wrote:
> On Wed, Oct 02, 2019 at 06:18:06PM +0200, Paolo Bonzini wrote:
 If the mapping of the source VMA changes, mirroring can update the
 target VMA via insert_pfn.  But what ensures that KVM's MMU notifier
 dismantles its own existing page tables (so that they can be recreated
 with the new mapping from the source VMA)?
>>
>> The KVM inspector process is also (or can be) a QEMU that will have to
>> create its own KVM guest page table.  So if a page in the source VMA is
>> unmapped we want:
>>
>> - the source KVM to invalidate its guest page table (done by the KVM MMU
>> notifier)
>>
>> - the target VMA to be invalidated (easy using mirroring)
>>
>> - the target KVM to invalidate its guest page table, as a result of
>> invalidation of the target VMA
> 
> You can do the target KVM invalidation inside the mirroring invalidation
> code.

Why should the source and target KVMs behave differently?  If the source
invalidates its guest page table via MMU notifiers, so should the target.

The KVM MMU notifier exists so that nothing (including mirroring) needs
to know that there is KVM on the other side.  Any interaction between
KVM page tables and VMAs must be mediated by MMU notifiers, anything
else is unacceptable.

If it is possible to invoke the MMU notifiers around the calls to
insert_pfn, that of course would be perfect.

Thanks,

Paolo
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: DANGER WILL ROBINSON, DANGER

2019-10-02 Thread Jerome Glisse
On Wed, Oct 02, 2019 at 06:18:06PM +0200, Paolo Bonzini wrote:
> On 02/10/19 16:15, Jerome Glisse wrote:
> >>> Why would you need to target mmu notifier on target vma ?
> >> If the mapping of the source VMA changes, mirroring can update the
> >> target VMA via insert_pfn.  But what ensures that KVM's MMU notifier
> >> dismantles its own existing page tables (so that they can be recreated
> >> with the new mapping from the source VMA)?
> >>
> > So just to make sure i follow we have:
> >   - qemu process on host with anonymous vma
> > -> host cpu page table
> >   - kvm which maps host anonymous vma to guest
> > -> kvm guest page table
> >   - kvm inspector process which mirror vma from qemu process
> > -> inspector process page table
> > 
> > AFAIK the KVM notifier's will clear the kvm guest page table whenever
> > necessary (through kvm_mmu_notifier_invalidate_range_start). This is
> > what ensure that KVM's dismatles its own mapping, it abides to mmu-
> > notifier callbacks. If you did not you would have bugs (at least i
> > expect so). Am i wrong here ?
> 
> The KVM inspector process is also (or can be) a QEMU that will have to
> create its own KVM guest page table.

Ok missed that part, thank you for explaining

> 
> So if a page in the source VMA is unmapped we want:
> 
> - the source KVM to invalidate its guest page table (done by the KVM MMU
> notifier)
> 
> - the target VMA to be invalidated (easy using mirroring)
> 
> - the target KVM to invalidate its guest page table, as a result of
> invalidation of the target VMA

You can do the target KVM invalidation inside the mirroring invalidation
code.

Cheers,
Jérôme
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: DANGER WILL ROBINSON, DANGER

2019-10-02 Thread Paolo Bonzini
On 02/10/19 16:15, Jerome Glisse wrote:
>>> Why would you need to target mmu notifier on target vma ?
>> If the mapping of the source VMA changes, mirroring can update the
>> target VMA via insert_pfn.  But what ensures that KVM's MMU notifier
>> dismantles its own existing page tables (so that they can be recreated
>> with the new mapping from the source VMA)?
>>
> So just to make sure i follow we have:
>   - qemu process on host with anonymous vma
> -> host cpu page table
>   - kvm which maps host anonymous vma to guest
> -> kvm guest page table
>   - kvm inspector process which mirror vma from qemu process
> -> inspector process page table
> 
> AFAIK the KVM notifier's will clear the kvm guest page table whenever
> necessary (through kvm_mmu_notifier_invalidate_range_start). This is
> what ensure that KVM's dismatles its own mapping, it abides to mmu-
> notifier callbacks. If you did not you would have bugs (at least i
> expect so). Am i wrong here ?

The KVM inspector process is also (or can be) a QEMU that will have to
create its own KVM guest page table.

So if a page in the source VMA is unmapped we want:

- the source KVM to invalidate its guest page table (done by the KVM MMU
notifier)

- the target VMA to be invalidated (easy using mirroring)

- the target KVM to invalidate its guest page table, as a result of
invalidation of the target VMA

Paolo
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: DANGER WILL ROBINSON, DANGER

2019-10-02 Thread Jerome Glisse
On Wed, Oct 02, 2019 at 03:46:30PM +0200, Paolo Bonzini wrote:
> On 02/10/19 21:27, Jerome Glisse wrote:
> > On Tue, Sep 10, 2019 at 07:49:51AM +, Mircea CIRJALIU - MELIU wrote:
> >>> On 05/09/19 20:09, Jerome Glisse wrote:
>  Not sure i understand, you are saying that the solution i outline
>  above does not work ? If so then i think you are wrong, in the above
>  solution the importing process mmap a device file and the resulting
>  vma is then populated using insert_pfn() and constantly keep
>  synchronize with the target process through mirroring which means that
>  you never have to look at the struct page ... you can mirror any kind
>  of memory from the remote process.
> >>>
> >>> If insert_pfn in turn calls MMU notifiers for the target VMA (which would 
> >>> be
> >>> the KVM MMU notifier), then that would work.  Though I guess it would be
> >>> possible to call MMU notifier update callbacks around the call to 
> >>> insert_pfn.
> >>
> >> Can't do that.
> >> First, insert_pfn() uses set_pte_at() which won't trigger the MMU notifier 
> >> on
> >> the target VMA. It's also static, so I'll have to access it thru 
> >> vmf_insert_pfn()
> >> or vmf_insert_mixed().
> > 
> > Why would you need to target mmu notifier on target vma ?
> 
> If the mapping of the source VMA changes, mirroring can update the
> target VMA via insert_pfn.  But what ensures that KVM's MMU notifier
> dismantles its own existing page tables (so that they can be recreated
> with the new mapping from the source VMA)?
> 

So just to make sure i follow we have:
  - qemu process on host with anonymous vma
-> host cpu page table
  - kvm which maps host anonymous vma to guest
-> kvm guest page table
  - kvm inspector process which mirror vma from qemu process
-> inspector process page table

AFAIK the KVM notifier's will clear the kvm guest page table whenever
necessary (through kvm_mmu_notifier_invalidate_range_start). This is
what ensure that KVM's dismatles its own mapping, it abides to mmu-
notifier callbacks. If you did not you would have bugs (at least i
expect so). Am i wrong here ?

The mirroring kernel driver would also register the notifier against
the quemu process and would also abide to notifier callbacks.

What you want to maintain at all times is that none of the actors
above ever look at different page for the same virtual address (ie
one looking at older page while another look at new page).

This is where you have helper like HMM that make sure that you can
not populate the mirroring vma while a notifier is on going. Which
means that everything is serialize on the notifier.

Cheers,
Jérôme
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: DANGER WILL ROBINSON, DANGER

2019-10-02 Thread Paolo Bonzini
On 02/10/19 21:27, Jerome Glisse wrote:
> On Tue, Sep 10, 2019 at 07:49:51AM +, Mircea CIRJALIU - MELIU wrote:
>>> On 05/09/19 20:09, Jerome Glisse wrote:
 Not sure i understand, you are saying that the solution i outline
 above does not work ? If so then i think you are wrong, in the above
 solution the importing process mmap a device file and the resulting
 vma is then populated using insert_pfn() and constantly keep
 synchronize with the target process through mirroring which means that
 you never have to look at the struct page ... you can mirror any kind
 of memory from the remote process.
>>>
>>> If insert_pfn in turn calls MMU notifiers for the target VMA (which would be
>>> the KVM MMU notifier), then that would work.  Though I guess it would be
>>> possible to call MMU notifier update callbacks around the call to 
>>> insert_pfn.
>>
>> Can't do that.
>> First, insert_pfn() uses set_pte_at() which won't trigger the MMU notifier on
>> the target VMA. It's also static, so I'll have to access it thru 
>> vmf_insert_pfn()
>> or vmf_insert_mixed().
> 
> Why would you need to target mmu notifier on target vma ?

If the mapping of the source VMA changes, mirroring can update the
target VMA via insert_pfn.  But what ensures that KVM's MMU notifier
dismantles its own existing page tables (so that they can be recreated
with the new mapping from the source VMA)?

Thanks,

Paolo

> You do not need
> that. The workflow is:
> 
> userspace:
> ptr = mmap(/dev/kvm-mirroring-device, virtual_addresse_of_target)
> 
> Then when the mirroring process access ptr it triggers page fault that
> endup in the vm_operation_struct->fault() which is just doing:
> 
> kernel-kvm-mirroring-function:
> kvm_mirror_page_fault(struct vm_fault *vmf) {
> struct kvm_mirror_struct *kvmms;
> 
> kvmms = kvm_mirror_struct_from_file(vmf->vma->vm_file);
> ...
> again:
> hmm_range_register();
> hmm_range_snapshot();
> take_lock(kvmms->update);
> if (!hmm_range_valid()) {
> vm_insert_pfn();
> drop_lock(kvmms->update);
> hmm_range_unregister();
> return VM_FAULT_NOPAGE;
> }
> drop_lock(kvmms->update);
> goto again;
> }
> 
> The notifier callback:
> kvmms_notifier_start() {
> take_lock(kvmms->update);
> clear_pte(start, end);
> drop_lock(kvmms->update);
> }
> 
>>
>> Our model (the importing process is encapsulated in another VM) forces us
>> to mirror certain pages from the anon VMA backing one VM's system RAM to 
>> the other VM's anon VMA. 
> 
> The mirror does not have to be an anon vma it can very well be a
> device vma ie mmap of a device file. I do not see any reasons why
> the mirror need to be an anon vma. Please explain why.
> 
>>
>> Using the functions above means setting VM_PFNMAP|VM_MIXEDMAP on 
>> the target anon VMA, but I guess this breaks the VMA. Is this recommended?
> 
> The mirror vma should not be an anon vma.
> 
>>
>> Then, mapping anon pages from one VMA to another without fixing the 
>> refcount and the mapcount breaks the daemons that think they're working 
>> on a pure anon VMA (kcompactd, khugepaged).
> 
> Note here the target vma ie the mirroring one is a mmap of device file
> and thus is skip by all of the above (kcompactd, khugepaged, ...) it is
> fully ignore by core mm.
> 
> Thus you do not need to fix the refcount in any way. If any of the core
> mm try to reclaim memory from the original vma then you will get mmu
> notifier callbacks and all you have to do is clear the page table of your
> device vma.
> 
> I did exactly that as a tools in the past and it works just fine with
> no change to core mm whatsoever.
> 
> Cheers,
> Jérôme
> 

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: DANGER WILL ROBINSON, DANGER

2019-10-02 Thread Jerome Glisse
On Tue, Sep 10, 2019 at 07:49:51AM +, Mircea CIRJALIU - MELIU wrote:
> > On 05/09/19 20:09, Jerome Glisse wrote:
> > > Not sure i understand, you are saying that the solution i outline
> > > above does not work ? If so then i think you are wrong, in the above
> > > solution the importing process mmap a device file and the resulting
> > > vma is then populated using insert_pfn() and constantly keep
> > > synchronize with the target process through mirroring which means that
> > > you never have to look at the struct page ... you can mirror any kind
> > > of memory from the remote process.
> > 
> > If insert_pfn in turn calls MMU notifiers for the target VMA (which would be
> > the KVM MMU notifier), then that would work.  Though I guess it would be
> > possible to call MMU notifier update callbacks around the call to 
> > insert_pfn.
> 
> Can't do that.
> First, insert_pfn() uses set_pte_at() which won't trigger the MMU notifier on
> the target VMA. It's also static, so I'll have to access it thru 
> vmf_insert_pfn()
> or vmf_insert_mixed().

Why would you need to target mmu notifier on target vma ? You do not need
that. The workflow is:

userspace:
ptr = mmap(/dev/kvm-mirroring-device, virtual_addresse_of_target)

Then when the mirroring process access ptr it triggers page fault that
endup in the vm_operation_struct->fault() which is just doing:

kernel-kvm-mirroring-function:
kvm_mirror_page_fault(struct vm_fault *vmf) {
struct kvm_mirror_struct *kvmms;

kvmms = kvm_mirror_struct_from_file(vmf->vma->vm_file);
...
again:
hmm_range_register();
hmm_range_snapshot();
take_lock(kvmms->update);
if (!hmm_range_valid()) {
vm_insert_pfn();
drop_lock(kvmms->update);
hmm_range_unregister();
return VM_FAULT_NOPAGE;
}
drop_lock(kvmms->update);
goto again;
}

The notifier callback:
kvmms_notifier_start() {
take_lock(kvmms->update);
clear_pte(start, end);
drop_lock(kvmms->update);
}

> 
> Our model (the importing process is encapsulated in another VM) forces us
> to mirror certain pages from the anon VMA backing one VM's system RAM to 
> the other VM's anon VMA. 

The mirror does not have to be an anon vma it can very well be a
device vma ie mmap of a device file. I do not see any reasons why
the mirror need to be an anon vma. Please explain why.

> 
> Using the functions above means setting VM_PFNMAP|VM_MIXEDMAP on 
> the target anon VMA, but I guess this breaks the VMA. Is this recommended?

The mirror vma should not be an anon vma.

> 
> Then, mapping anon pages from one VMA to another without fixing the 
> refcount and the mapcount breaks the daemons that think they're working 
> on a pure anon VMA (kcompactd, khugepaged).

Note here the target vma ie the mirroring one is a mmap of device file
and thus is skip by all of the above (kcompactd, khugepaged, ...) it is
fully ignore by core mm.

Thus you do not need to fix the refcount in any way. If any of the core
mm try to reclaim memory from the original vma then you will get mmu
notifier callbacks and all you have to do is clear the page table of your
device vma.

I did exactly that as a tools in the past and it works just fine with
no change to core mm whatsoever.

Cheers,
Jérôme
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: DANGER WILL ROBINSON, DANGER

2019-09-09 Thread Paolo Bonzini
On 05/09/19 20:09, Jerome Glisse wrote:
> Not sure i understand, you are saying that the solution i outline above
> does not work ? If so then i think you are wrong, in the above solution
> the importing process mmap a device file and the resulting vma is then
> populated using insert_pfn() and constantly keep synchronize with the
> target process through mirroring which means that you never have to look
> at the struct page ... you can mirror any kind of memory from the remote
> process.

If insert_pfn in turn calls MMU notifiers for the target VMA (which
would be the KVM MMU notifier), then that would work.  Though I guess it
would be possible to call MMU notifier update callbacks around the call
to insert_pfn.

Paolo
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: DANGER WILL ROBINSON, DANGER

2019-09-05 Thread Jerome Glisse
On Fri, Aug 23, 2019 at 12:39:21PM +, Mircea CIRJALIU - MELIU wrote:
> > On Thu, Aug 15, 2019 at 03:19:29PM -0400, Jerome Glisse wrote:
> > > On Tue, Aug 13, 2019 at 02:01:35PM +0300, Adalbert Lazăr wrote:
> > > > On Fri, 9 Aug 2019 09:24:44 -0700, Matthew Wilcox 
> > wrote:
> > > > > On Fri, Aug 09, 2019 at 07:00:26PM +0300, Adalbert Lazăr wrote:
> > > > > > +++ b/include/linux/page-flags.h
> > > > > > @@ -417,8 +417,10 @@ PAGEFLAG(Idle, idle, PF_ANY)
> > > > > >   */
> > > > > >  #define PAGE_MAPPING_ANON  0x1
> > > > > >  #define PAGE_MAPPING_MOVABLE   0x2
> > > > > > +#define PAGE_MAPPING_REMOTE0x4
> > > > >
> > > > > Uh.  How do you know page->mapping would otherwise have bit 2
> > clear?
> > > > > Who's guaranteeing that?
> > > > >
> > > > > This is an awfully big patch to the memory management code, buried
> > > > > in the middle of a gigantic series which almost guarantees nobody
> > > > > would look at it.  I call shenanigans.
> > > > >
> > > > > > @@ -1021,7 +1022,7 @@ void page_move_anon_rmap(struct page
> > *page, struct vm_area_struct *vma)
> > > > > >   * __page_set_anon_rmap - set up new anonymous rmap
> > > > > >   * @page:  Page or Hugepage to add to rmap
> > > > > >   * @vma:   VM area to add page to.
> > > > > > - * @address:   User virtual address of the mapping
> > > > > > + * @address:   User virtual address of the mapping
> > > > >
> > > > > And mixing in fluff changes like this is a real no-no.  Try again.
> > > > >
> > > >
> > > > No bad intentions, just overzealous.
> > > > I didn't want to hide anything from our patches.
> > > > Once we advance with the introspection patches related to KVM we'll
> > > > be back with the remote mapping patch, split and cleaned.
> > >
> > > They are not bit left in struct page ! Looking at the patch it seems
> > > you want to have your own pin count just for KVM. This is bad, we are
> > > already trying to solve the GUP thing (see all various patchset about
> > > GUP posted recently).
> > >
> > > You need to rethink how you want to achieve this. Why not simply a
> > > remote read()/write() into the process memory ie KVMI would call an
> > > ioctl that allow to read or write into a remote process memory like
> > > ptrace() but on steroid ...
> > >
> > > Adding this whole big complex infrastructure without justification of
> > > why we need to avoid round trip is just too much really.
> > 
> > Thinking a bit more about this, you can achieve the same thing without
> > adding a single line to any mm code. Instead of having mmap with
> > PROT_NONE | MAP_LOCKED you have userspace mmap some kvm device
> > file (i am assuming this is something you already have and can control the
> > mmap callback).
> > 
> > So now kernel side you have a vma with a vm_operations_struct under your
> > control this means that everything you want to block mm wise from within
> > the inspector process can be block through those call- backs
> > (find_special_page() specificaly for which you have to return NULL all the
> > time).
> > 
> > To mirror target process memory you can use hmm_mirror, when you
> > populate the inspector process page table you use insert_pfn() (mmap of
> > the kvm device file must mark this vma as PFNMAP).
> > 
> > By following the hmm_mirror API, anytime the target process has a change in
> > its page table (ie virtual address -> page) you will get a callback and all 
> > you
> > have to do is clear the page table within the inspector process and flush 
> > tlb
> > (use zap_page_range).
> > 
> > On page fault within the inspector process the fault callback of vm_ops will
> > get call and from there you call hmm_mirror following its API.
> > 
> > Oh also mark the vma with VM_WIPEONFORK to avoid any issue if the
> > inspector process use fork() (you could support fork but then you would
> > need to mark the vma as SHARED and use unmap_mapping_pages instead of
> > zap_page_range).
> > 
> > 
> > There everything you want to do with already upstream mm code.
> 
> I'm the author of remote mapping, so I owe everybody some explanations.
> My requirement was to map pages from one QEMU process to another QEMU 
> process (our inspector process works in a virtual machine of its own). So I 
> had 
> to implement a KSM-like page sharing between processes, where an anon page
> from the target QEMU's working memory is promoted to a remote page and 
> mapped in the inspector QEMU's working memory (both anon VMAs). 
> The extra page flag is for differentiating the page for rmap walking.
> 
> The mapping requests come at PAGE_SIZE granularity for random addresses 
> within the target/inspector QEMUs, so I couldn't do any linear mapping that
> would keep things simpler. 
> 
> I have an extra patch that does remote mapping by mirroring an entire VMA
> from the target process by way of a device file. This thing creates a 
> separate 
> mirror VMA in my inspector process (at the moment a QEMU), but then I 
> bumped into the KVM hva->gpa mapping, which 

Re: DANGER WILL ROBINSON, DANGER

2019-08-16 Thread Jason Gunthorpe
On Thu, Aug 15, 2019 at 04:16:30PM -0400, Jerome Glisse wrote:
> On Thu, Aug 15, 2019 at 03:19:29PM -0400, Jerome Glisse wrote:
> > On Tue, Aug 13, 2019 at 02:01:35PM +0300, Adalbert Lazăr wrote:
> > > On Fri, 9 Aug 2019 09:24:44 -0700, Matthew Wilcox  
> > > wrote:
> > > > On Fri, Aug 09, 2019 at 07:00:26PM +0300, Adalbert Lazăr wrote:
> > > > > +++ b/include/linux/page-flags.h
> > > > > @@ -417,8 +417,10 @@ PAGEFLAG(Idle, idle, PF_ANY)
> > > > >   */
> > > > >  #define PAGE_MAPPING_ANON0x1
> > > > >  #define PAGE_MAPPING_MOVABLE 0x2
> > > > > +#define PAGE_MAPPING_REMOTE  0x4
> > > > 
> > > > Uh.  How do you know page->mapping would otherwise have bit 2 clear?
> > > > Who's guaranteeing that?
> > > > 
> > > > This is an awfully big patch to the memory management code, buried in
> > > > the middle of a gigantic series which almost guarantees nobody would
> > > > look at it.  I call shenanigans.
> > > > 
> > > > > @@ -1021,7 +1022,7 @@ void page_move_anon_rmap(struct page *page, 
> > > > > struct vm_area_struct *vma)
> > > > >   * __page_set_anon_rmap - set up new anonymous rmap
> > > > >   * @page:Page or Hugepage to add to rmap
> > > > >   * @vma: VM area to add page to.
> > > > > - * @address: User virtual address of the mapping 
> > > > > + * @address: User virtual address of the mapping
> > > > 
> > > > And mixing in fluff changes like this is a real no-no.  Try again.
> > > > 
> > > 
> > > No bad intentions, just overzealous.
> > > I didn't want to hide anything from our patches.
> > > Once we advance with the introspection patches related to KVM we'll be
> > > back with the remote mapping patch, split and cleaned.
> > 
> > They are not bit left in struct page ! Looking at the patch it seems
> > you want to have your own pin count just for KVM. This is bad, we are
> > already trying to solve the GUP thing (see all various patchset about
> > GUP posted recently).
> > 
> > You need to rethink how you want to achieve this. Why not simply a
> > remote read()/write() into the process memory ie KVMI would call
> > an ioctl that allow to read or write into a remote process memory
> > like ptrace() but on steroid ...
> > 
> > Adding this whole big complex infrastructure without justification
> > of why we need to avoid round trip is just too much really.
> 
> Thinking a bit more about this, you can achieve the same thing without
> adding a single line to any mm code. Instead of having mmap with
> PROT_NONE | MAP_LOCKED you have userspace mmap some kvm device file
> (i am assuming this is something you already have and can control
> the mmap callback).
> 
> So now kernel side you have a vma with a vm_operations_struct under
> your control this means that everything you want to block mm wise
> from within the inspector process can be block through those call-
> backs (find_special_page() specificaly for which you have to return
> NULL all the time).

I'm actually aware of a couple of use cases that would like to
mirror the VA of one process into another. One big one in the HPC
world is the out of tree 'xpmem' still in wide use today. xpmem is
basically what Jerome described above.

If you do an approach like Jerome describes it would be nice if was a
general facility and not buried in kvm.

I know past xpmem adventures ran into trouble with locking/etc - ie
getting the mm_struct of the victim seemed a bit hard for some reason,
but maybe that could be done with a FD pass 'ioctl(I AM THE VICITM)' ?

Jason
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: DANGER WILL ROBINSON, DANGER

2019-08-15 Thread Jerome Glisse
On Thu, Aug 15, 2019 at 03:19:29PM -0400, Jerome Glisse wrote:
> On Tue, Aug 13, 2019 at 02:01:35PM +0300, Adalbert Lazăr wrote:
> > On Fri, 9 Aug 2019 09:24:44 -0700, Matthew Wilcox  
> > wrote:
> > > On Fri, Aug 09, 2019 at 07:00:26PM +0300, Adalbert Lazăr wrote:
> > > > +++ b/include/linux/page-flags.h
> > > > @@ -417,8 +417,10 @@ PAGEFLAG(Idle, idle, PF_ANY)
> > > >   */
> > > >  #define PAGE_MAPPING_ANON  0x1
> > > >  #define PAGE_MAPPING_MOVABLE   0x2
> > > > +#define PAGE_MAPPING_REMOTE0x4
> > > 
> > > Uh.  How do you know page->mapping would otherwise have bit 2 clear?
> > > Who's guaranteeing that?
> > > 
> > > This is an awfully big patch to the memory management code, buried in
> > > the middle of a gigantic series which almost guarantees nobody would
> > > look at it.  I call shenanigans.
> > > 
> > > > @@ -1021,7 +1022,7 @@ void page_move_anon_rmap(struct page *page, 
> > > > struct vm_area_struct *vma)
> > > >   * __page_set_anon_rmap - set up new anonymous rmap
> > > >   * @page:  Page or Hugepage to add to rmap
> > > >   * @vma:   VM area to add page to.
> > > > - * @address:   User virtual address of the mapping 
> > > > + * @address:   User virtual address of the mapping
> > > 
> > > And mixing in fluff changes like this is a real no-no.  Try again.
> > > 
> > 
> > No bad intentions, just overzealous.
> > I didn't want to hide anything from our patches.
> > Once we advance with the introspection patches related to KVM we'll be
> > back with the remote mapping patch, split and cleaned.
> 
> They are not bit left in struct page ! Looking at the patch it seems
> you want to have your own pin count just for KVM. This is bad, we are
> already trying to solve the GUP thing (see all various patchset about
> GUP posted recently).
> 
> You need to rethink how you want to achieve this. Why not simply a
> remote read()/write() into the process memory ie KVMI would call
> an ioctl that allow to read or write into a remote process memory
> like ptrace() but on steroid ...
> 
> Adding this whole big complex infrastructure without justification
> of why we need to avoid round trip is just too much really.

Thinking a bit more about this, you can achieve the same thing without
adding a single line to any mm code. Instead of having mmap with
PROT_NONE | MAP_LOCKED you have userspace mmap some kvm device file
(i am assuming this is something you already have and can control
the mmap callback).

So now kernel side you have a vma with a vm_operations_struct under
your control this means that everything you want to block mm wise
from within the inspector process can be block through those call-
backs (find_special_page() specificaly for which you have to return
NULL all the time).

To mirror target process memory you can use hmm_mirror, when you
populate the inspector process page table you use insert_pfn()
(mmap of the kvm device file must mark this vma as PFNMAP).

By following the hmm_mirror API, anytime the target process has
a change in its page table (ie virtual address -> page) you will
get a callback and all you have to do is clear the page table
within the inspector process and flush tlb (use zap_page_range).

On page fault within the inspector process the fault callback of
vm_ops will get call and from there you call hmm_mirror following
its API.

Oh also mark the vma with VM_WIPEONFORK to avoid any issue if the
inspector process use fork() (you could support fork but then you
would need to mark the vma as SHARED and use unmap_mapping_pages
instead of zap_page_range).


There everything you want to do with already upstream mm code.

Cheers,
Jérôme
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: DANGER WILL ROBINSON, DANGER

2019-08-15 Thread Jerome Glisse
On Tue, Aug 13, 2019 at 02:01:35PM +0300, Adalbert Lazăr wrote:
> On Fri, 9 Aug 2019 09:24:44 -0700, Matthew Wilcox  wrote:
> > On Fri, Aug 09, 2019 at 07:00:26PM +0300, Adalbert Lazăr wrote:
> > > +++ b/include/linux/page-flags.h
> > > @@ -417,8 +417,10 @@ PAGEFLAG(Idle, idle, PF_ANY)
> > >   */
> > >  #define PAGE_MAPPING_ANON0x1
> > >  #define PAGE_MAPPING_MOVABLE 0x2
> > > +#define PAGE_MAPPING_REMOTE  0x4
> > 
> > Uh.  How do you know page->mapping would otherwise have bit 2 clear?
> > Who's guaranteeing that?
> > 
> > This is an awfully big patch to the memory management code, buried in
> > the middle of a gigantic series which almost guarantees nobody would
> > look at it.  I call shenanigans.
> > 
> > > @@ -1021,7 +1022,7 @@ void page_move_anon_rmap(struct page *page, struct 
> > > vm_area_struct *vma)
> > >   * __page_set_anon_rmap - set up new anonymous rmap
> > >   * @page:Page or Hugepage to add to rmap
> > >   * @vma: VM area to add page to.
> > > - * @address: User virtual address of the mapping 
> > > + * @address: User virtual address of the mapping
> > 
> > And mixing in fluff changes like this is a real no-no.  Try again.
> > 
> 
> No bad intentions, just overzealous.
> I didn't want to hide anything from our patches.
> Once we advance with the introspection patches related to KVM we'll be
> back with the remote mapping patch, split and cleaned.

They are not bit left in struct page ! Looking at the patch it seems
you want to have your own pin count just for KVM. This is bad, we are
already trying to solve the GUP thing (see all various patchset about
GUP posted recently).

You need to rethink how you want to achieve this. Why not simply a
remote read()/write() into the process memory ie KVMI would call
an ioctl that allow to read or write into a remote process memory
like ptrace() but on steroid ...

Adding this whole big complex infrastructure without justification
of why we need to avoid round trip is just too much really.

Cheers,
Jérôme
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: DANGER WILL ROBINSON, DANGER

2019-08-13 Thread Paolo Bonzini
On 13/08/19 13:24, Matthew Wilcox wrote:
>>>
>>> This is an awfully big patch to the memory management code, buried in
>>> the middle of a gigantic series which almost guarantees nobody would
>>> look at it.  I call shenanigans.
>> Are you calling shenanigans on the patch submitter (which is gratuitous)
>> or on the KVM maintainers/reviewers?
>
> On the patch submitter, of course.  How can I possibly be criticising you
> for something you didn't do?

No idea.  "Nobody would look at it" definitely includes me though.

In any case, water under the bridge.  The submitter did duly mark the
series as RFC, I don't see anything wrong in what he did apart from not
having testcases. :)

Paolo
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: DANGER WILL ROBINSON, DANGER

2019-08-13 Thread Matthew Wilcox
On Tue, Aug 13, 2019 at 11:29:07AM +0200, Paolo Bonzini wrote:
> On 09/08/19 18:24, Matthew Wilcox wrote:
> > On Fri, Aug 09, 2019 at 07:00:26PM +0300, Adalbert Lazăr wrote:
> >> +++ b/include/linux/page-flags.h
> >> @@ -417,8 +417,10 @@ PAGEFLAG(Idle, idle, PF_ANY)
> >>   */
> >>  #define PAGE_MAPPING_ANON 0x1
> >>  #define PAGE_MAPPING_MOVABLE  0x2
> >> +#define PAGE_MAPPING_REMOTE   0x4
> > Uh.  How do you know page->mapping would otherwise have bit 2 clear?
> > Who's guaranteeing that?
> > 
> > This is an awfully big patch to the memory management code, buried in
> > the middle of a gigantic series which almost guarantees nobody would
> > look at it.  I call shenanigans.
> 
> Are you calling shenanigans on the patch submitter (which is gratuitous)
> or on the KVM maintainers/reviewers?

On the patch submitter, of course.  How can I possibly be criticising you
for something you didn't do?

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: DANGER WILL ROBINSON, DANGER

2019-08-13 Thread Adalbert Lazăr
On Fri, 9 Aug 2019 09:24:44 -0700, Matthew Wilcox  wrote:
> On Fri, Aug 09, 2019 at 07:00:26PM +0300, Adalbert Lazăr wrote:
> > +++ b/include/linux/page-flags.h
> > @@ -417,8 +417,10 @@ PAGEFLAG(Idle, idle, PF_ANY)
> >   */
> >  #define PAGE_MAPPING_ANON  0x1
> >  #define PAGE_MAPPING_MOVABLE   0x2
> > +#define PAGE_MAPPING_REMOTE0x4
> 
> Uh.  How do you know page->mapping would otherwise have bit 2 clear?
> Who's guaranteeing that?
> 
> This is an awfully big patch to the memory management code, buried in
> the middle of a gigantic series which almost guarantees nobody would
> look at it.  I call shenanigans.
> 
> > @@ -1021,7 +1022,7 @@ void page_move_anon_rmap(struct page *page, struct 
> > vm_area_struct *vma)
> >   * __page_set_anon_rmap - set up new anonymous rmap
> >   * @page:  Page or Hugepage to add to rmap
> >   * @vma:   VM area to add page to.
> > - * @address:   User virtual address of the mapping 
> > + * @address:   User virtual address of the mapping
> 
> And mixing in fluff changes like this is a real no-no.  Try again.
> 

No bad intentions, just overzealous.
I didn't want to hide anything from our patches.
Once we advance with the introspection patches related to KVM we'll be
back with the remote mapping patch, split and cleaned.

Thanks
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: DANGER WILL ROBINSON, DANGER

2019-08-13 Thread Paolo Bonzini
On 09/08/19 18:24, Matthew Wilcox wrote:
> On Fri, Aug 09, 2019 at 07:00:26PM +0300, Adalbert Lazăr wrote:
>> +++ b/include/linux/page-flags.h
>> @@ -417,8 +417,10 @@ PAGEFLAG(Idle, idle, PF_ANY)
>>   */
>>  #define PAGE_MAPPING_ANON   0x1
>>  #define PAGE_MAPPING_MOVABLE0x2
>> +#define PAGE_MAPPING_REMOTE 0x4
> Uh.  How do you know page->mapping would otherwise have bit 2 clear?
> Who's guaranteeing that?
> 
> This is an awfully big patch to the memory management code, buried in
> the middle of a gigantic series which almost guarantees nobody would
> look at it.  I call shenanigans.

Are you calling shenanigans on the patch submitter (which is gratuitous)
or on the KVM maintainers/reviewers?

It's not true that nobody would look at it.  Of course no one from
linux-mm is going to look at it, but the maintainer that looks at the
gigantic series is very much expected to look at it and explain to the
submitter that this patch is unacceptable as is.

In fact I shouldn't have to to explain this to you; you know better than
believing that I would try to sneak it past the mm folks.  I am puzzled.

Paolo
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

DANGER WILL ROBINSON, DANGER

2019-08-09 Thread Matthew Wilcox
On Fri, Aug 09, 2019 at 07:00:26PM +0300, Adalbert Lazăr wrote:
> +++ b/include/linux/page-flags.h
> @@ -417,8 +417,10 @@ PAGEFLAG(Idle, idle, PF_ANY)
>   */
>  #define PAGE_MAPPING_ANON0x1
>  #define PAGE_MAPPING_MOVABLE 0x2
> +#define PAGE_MAPPING_REMOTE  0x4

Uh.  How do you know page->mapping would otherwise have bit 2 clear?
Who's guaranteeing that?

This is an awfully big patch to the memory management code, buried in
the middle of a gigantic series which almost guarantees nobody would
look at it.  I call shenanigans.

> @@ -1021,7 +1022,7 @@ void page_move_anon_rmap(struct page *page, struct 
> vm_area_struct *vma)
>   * __page_set_anon_rmap - set up new anonymous rmap
>   * @page:Page or Hugepage to add to rmap
>   * @vma: VM area to add page to.
> - * @address: User virtual address of the mapping 
> + * @address: User virtual address of the mapping

And mixing in fluff changes like this is a real no-no.  Try again.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization