Re: [RFC PATCH v1 1/2] virt: memctl: control guest physical memory properties

2024-05-14 Thread Greg Kroah-Hartman
On Tue, May 14, 2024 at 06:21:57PM -0700, Yuanchu Xie wrote:
> On Tue, May 14, 2024 at 9:06 AM Greg Kroah-Hartman
>  wrote:
> >
> > On Mon, May 13, 2024 at 07:03:00PM -0700, Yuanchu Xie wrote:
> > > Memctl provides a way for the guest to control its physical memory
> > > properties, and enables optimizations and security features. For
> > > example, the guest can provide information to the host where parts of a
> > > hugepage may be unbacked, or sensitive data may not be swapped out, etc.
> > >...
> > Pretty generic name for a hardware-specific driver :(
> It's not for real hardware btw. Its use case is similar to pvpanic
> where the device is emulated by the VMM. I can change the name if it's
> a problem.

This file is only used for a single PCI device, that is very
hardware-specific even if that hardware is "fake" :)

Please make the name more specific as well.

thanks,

greg k-h



Re: [RFC PATCH v1 1/2] virt: memctl: control guest physical memory properties

2024-05-14 Thread Yuanchu Xie
On Tue, May 14, 2024 at 9:06 AM Greg Kroah-Hartman
 wrote:
>
> On Mon, May 13, 2024 at 07:03:00PM -0700, Yuanchu Xie wrote:
> > Memctl provides a way for the guest to control its physical memory
> > properties, and enables optimizations and security features. For
> > example, the guest can provide information to the host where parts of a
> > hugepage may be unbacked, or sensitive data may not be swapped out, etc.
> >...
> Pretty generic name for a hardware-specific driver :(
It's not for real hardware btw. Its use case is similar to pvpanic
where the device is emulated by the VMM. I can change the name if it's
a problem.

> Yup, you write this to hardware, please use proper structures and types
> for that, otherwise you will have problems in the near future.
Thanks for the review and comments on endianness and using proper
types. Will do.

Thanks,
Yuanchu



Re: [RFC PATCH v1 1/2] virt: memctl: control guest physical memory properties

2024-05-14 Thread Greg Kroah-Hartman
On Mon, May 13, 2024 at 07:03:00PM -0700, Yuanchu Xie wrote:
> +/*
> + * Used for internal kernel memctl calls, i.e. to better support kernel 
> stacks,
> + * or to efficiently zero hugetlb pages.
> + */
> +long memctl_vmm_call(__u64 func_code, __u64 addr, __u64 length, __u64 arg,
> +  struct memctl_buf *buf)
> +{
> + buf->call.func_code = func_code;
> + buf->call.addr = addr;
> + buf->call.length = length;
> + buf->call.arg = arg;
> +
> + return __memctl_vmm_call(buf);
> +}
> +EXPORT_SYMBOL(memctl_vmm_call);

You export something that is never actually called, which implies that
this is not tested at all (i.e. it is dead code.)  Please remove.

Also, why not EXPORT_SYMBOL_GPL()?   (I have to ask, sorry.)

thanks,

greg k-h



Re: [RFC PATCH v1 1/2] virt: memctl: control guest physical memory properties

2024-05-14 Thread Greg Kroah-Hartman
On Mon, May 13, 2024 at 07:03:00PM -0700, Yuanchu Xie wrote:
> Memctl provides a way for the guest to control its physical memory
> properties, and enables optimizations and security features. For
> example, the guest can provide information to the host where parts of a
> hugepage may be unbacked, or sensitive data may not be swapped out, etc.
> 
> Memctl allows guests to manipulate its gPTE entries in the SLAT, and
> also some other properties of the memory map the back's host memory.
> This is achieved by using the KVM_CAP_SYNC_MMU capability. When this
> capability is available, the changes in the backing of the memory region
> on the host are automatically reflected into the guest. For example, an
> mmap() or madvise() that affects the region will be made visible
> immediately.
> 
> There are two components of the implementation: the guest Linux driver
> and Virtual Machine Monitor (VMM) device. A guest-allocated shared
> buffer is negotiated per-cpu through a few PCI MMIO registers, the VMM
> device assigns a unique command for each per-cpu buffer. The guest
> writes its memctl request in the per-cpu buffer, then writes the
> corresponding command into the command register, calling into the VMM
> device to perform the memctl request.
> 
> The synchronous per-cpu shared buffer approach avoids the kick and busy
> waiting that the guest would have to do with virtio virtqueue transport.
> 
> We provide both kernel and userspace APIs
> Kernel API
> long memctl_vmm_call(__u64 func_code, __u64 addr, __u64 length, __u64 arg,
>struct memctl_buf *buf);
> 
> Kernel drivers can take advantage of the memctl calls to provide
> paravirtualization of kernel stacks or page zeroing.
> 
> User API
> >From the userland, the memctl guest driver is controlled via ioctl(2)
> call. It requires CAP_SYS_ADMIN.
> 
> ioctl(fd, MEMCTL_IOCTL, union memctl_vmm *memctl_vmm);
> 
> Guest userland applications can tag VMAs and guest hugepages, or advise
> the host on how to handle sensitive guest pages.
> 
> Supported function codes and their use cases:
> MEMCTL_FREE/REMOVE/DONTNEED/PAGEOUT. For the guest. One can reduce the
> struct page and page table lookup overhead by using hugepages backed by
> smaller pages on the host. These memctl commands can allow for partial
> freeing of private guest hugepages to save memory. They also allow
> kernel memory, such as kernel stacks and task_structs to be
> paravirtualized.
> 
> MEMCTL_UNMERGEABLE is useful for security, when the VM does not want to
> share its backing pages.
> The same with MADV_DONTDUMP, so sensitive pages are not included in a
> dump.
> MLOCK/UNLOCK can advise the host that sensitive information is not
> swapped out on the host.
> 
> MEMCTL_MPROTECT_NONE/R/W/RW. For guest stacks backed by hugepages, stack
> guard pages can be handled in the host and memory can be saved in the
> hugepage.
> 
> MEMCTL_SET_VMA_ANON_NAME is useful for observability and debugging how
> guest memory is being mapped on the host.
> 
> Sample program making use of MEMCTL_SET_VMA_ANON_NAME and
> MEMCTL_DONTNEED:
> https://github.com/Dummyc0m/memctl-set-anon-vma-name/tree/main
> https://github.com/Dummyc0m/memctl-set-anon-vma-name/tree/dontneed
> 
> The VMM implementation is being proposed for Cloud Hypervisor:
> https://github.com/Dummyc0m/cloud-hypervisor/
> 
> Cloud Hypervisor issue:
> https://github.com/cloud-hypervisor/cloud-hypervisor/issues/6318
> 
> Signed-off-by: Yuanchu Xie 
> ---
>  .../userspace-api/ioctl/ioctl-number.rst  |   2 +
>  drivers/virt/Kconfig  |   2 +
>  drivers/virt/Makefile |   1 +
>  drivers/virt/memctl/Kconfig   |  10 +
>  drivers/virt/memctl/Makefile  |   2 +
>  drivers/virt/memctl/memctl.c  | 425 ++
>  include/linux/memctl.h|  27 ++
>  include/uapi/linux/memctl.h   |  81 

You are mixing your PCI driver in with the memctl core code, is that
intentional?  Will there never be another PCI device for this type of
interface other than this one PCI device?

And if so, why export anything, why isn't this all in one body of code?

>  8 files changed, 550 insertions(+)
>  create mode 100644 drivers/virt/memctl/Kconfig
>  create mode 100644 drivers/virt/memctl/Makefile
>  create mode 100644 drivers/virt/memctl/memctl.c
>  create mode 100644 include/linux/memctl.h
>  create mode 100644 include/uapi/linux/memctl.h
> 
> diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst 
> b/Documentation/userspace-api/ioctl/ioctl-number.rst
> index 457e16f06e04..789d1251c0be 100644
> --- a/Documentation/userspace-api/ioctl/ioctl-number.rst
> +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
> @@ -368,6 +368,8 @@ Code  Seq#Include File
>Comments
>  0xCD  01 linux/reiserfs_fs.h
>  0xCE  01-02  uapi/linux/cxl_mem.h

[RFC PATCH v1 1/2] virt: memctl: control guest physical memory properties

2024-05-13 Thread Yuanchu Xie
Memctl provides a way for the guest to control its physical memory
properties, and enables optimizations and security features. For
example, the guest can provide information to the host where parts of a
hugepage may be unbacked, or sensitive data may not be swapped out, etc.

Memctl allows guests to manipulate its gPTE entries in the SLAT, and
also some other properties of the memory map the back's host memory.
This is achieved by using the KVM_CAP_SYNC_MMU capability. When this
capability is available, the changes in the backing of the memory region
on the host are automatically reflected into the guest. For example, an
mmap() or madvise() that affects the region will be made visible
immediately.

There are two components of the implementation: the guest Linux driver
and Virtual Machine Monitor (VMM) device. A guest-allocated shared
buffer is negotiated per-cpu through a few PCI MMIO registers, the VMM
device assigns a unique command for each per-cpu buffer. The guest
writes its memctl request in the per-cpu buffer, then writes the
corresponding command into the command register, calling into the VMM
device to perform the memctl request.

The synchronous per-cpu shared buffer approach avoids the kick and busy
waiting that the guest would have to do with virtio virtqueue transport.

We provide both kernel and userspace APIs
Kernel API
long memctl_vmm_call(__u64 func_code, __u64 addr, __u64 length, __u64 arg,
 struct memctl_buf *buf);

Kernel drivers can take advantage of the memctl calls to provide
paravirtualization of kernel stacks or page zeroing.

User API
>From the userland, the memctl guest driver is controlled via ioctl(2)
call. It requires CAP_SYS_ADMIN.

ioctl(fd, MEMCTL_IOCTL, union memctl_vmm *memctl_vmm);

Guest userland applications can tag VMAs and guest hugepages, or advise
the host on how to handle sensitive guest pages.

Supported function codes and their use cases:
MEMCTL_FREE/REMOVE/DONTNEED/PAGEOUT. For the guest. One can reduce the
struct page and page table lookup overhead by using hugepages backed by
smaller pages on the host. These memctl commands can allow for partial
freeing of private guest hugepages to save memory. They also allow
kernel memory, such as kernel stacks and task_structs to be
paravirtualized.

MEMCTL_UNMERGEABLE is useful for security, when the VM does not want to
share its backing pages.
The same with MADV_DONTDUMP, so sensitive pages are not included in a
dump.
MLOCK/UNLOCK can advise the host that sensitive information is not
swapped out on the host.

MEMCTL_MPROTECT_NONE/R/W/RW. For guest stacks backed by hugepages, stack
guard pages can be handled in the host and memory can be saved in the
hugepage.

MEMCTL_SET_VMA_ANON_NAME is useful for observability and debugging how
guest memory is being mapped on the host.

Sample program making use of MEMCTL_SET_VMA_ANON_NAME and
MEMCTL_DONTNEED:
https://github.com/Dummyc0m/memctl-set-anon-vma-name/tree/main
https://github.com/Dummyc0m/memctl-set-anon-vma-name/tree/dontneed

The VMM implementation is being proposed for Cloud Hypervisor:
https://github.com/Dummyc0m/cloud-hypervisor/

Cloud Hypervisor issue:
https://github.com/cloud-hypervisor/cloud-hypervisor/issues/6318

Signed-off-by: Yuanchu Xie 
---
 .../userspace-api/ioctl/ioctl-number.rst  |   2 +
 drivers/virt/Kconfig  |   2 +
 drivers/virt/Makefile |   1 +
 drivers/virt/memctl/Kconfig   |  10 +
 drivers/virt/memctl/Makefile  |   2 +
 drivers/virt/memctl/memctl.c  | 425 ++
 include/linux/memctl.h|  27 ++
 include/uapi/linux/memctl.h   |  81 
 8 files changed, 550 insertions(+)
 create mode 100644 drivers/virt/memctl/Kconfig
 create mode 100644 drivers/virt/memctl/Makefile
 create mode 100644 drivers/virt/memctl/memctl.c
 create mode 100644 include/linux/memctl.h
 create mode 100644 include/uapi/linux/memctl.h

diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst 
b/Documentation/userspace-api/ioctl/ioctl-number.rst
index 457e16f06e04..789d1251c0be 100644
--- a/Documentation/userspace-api/ioctl/ioctl-number.rst
+++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
@@ -368,6 +368,8 @@ Code  Seq#Include File  
 Comments
 0xCD  01 linux/reiserfs_fs.h
 0xCE  01-02  uapi/linux/cxl_mem.hCompute 
Express Link Memory Devices
 0xCF  02 fs/smb/client/cifs_ioctl.h
+0xDA  00 linux/memctl.h  Memctl 
Device
+ 

 0xDB  00-0F  drivers/char/mwave/mwavepub.h
 0xDD  00-3F  ZFCP 
device driver see drivers/s390/scsi/