Re: [PATCH] hw/i386/acpi: Set PCAT_COMPAT bit only when pic is not disabled

2024-04-03 Thread Kirill A. Shutemov
On Wed, Apr 03, 2024 at 10:03:15AM +0800, Xiaoyao Li wrote:
> On 4/2/2024 10:31 PM, Michael S. Tsirkin wrote:
> > On Tue, Apr 02, 2024 at 09:18:44PM +0800, Xiaoyao Li wrote:
> > > On 4/2/2024 6:02 PM, Michael S. Tsirkin wrote:
> > > > On Tue, Apr 02, 2024 at 04:25:16AM -0400, Xiaoyao Li wrote:
> > > > > Set MADT.FLAGS[bit 0].PCAT_COMPAT based on x86ms->pic.
> > > > > 
> > > > > Signed-off-by: Xiaoyao Li 
> > > > 
> > > > Please include more info in the commit log:
> > > > what is the behaviour you observe, why it is wrong,
> > > > how does the patch fix it, what is guest behaviour
> > > > before and after.
> > > 
> > > Sorry, I thought it was straightforward.
> > > 
> > > A value 1 of PCAT_COMPAT (bit 0) of MADT.Flags indicates that the system
> > > also has a PC-AT-compatible dual-8259 setup, i.e., the PIC.
> > > 
> > > When PIC is not enabled for x86 machine, the PCAT_COMPAT bit needs to be
> > > cleared. Otherwise, the guest thinks there is a present PIC even it is
> > > booted with pic=off on QEMU.
> > > 
> > > (I haven't seen real issue from Linux guest. The user of PIC inside guest
> > > seems only the pit calibration. Whether pit calibration is triggered 
> > > depends
> > > on other things. But logically, current code is wrong, we need to fix it
> > > anyway.
> > > 
> > > @Isaku, please share more info if you have)
> > > 
> 
> + Kirill,
> 
> It seems to have issue with legacy irqs with PCAT_COMPAT set 1 while no PIC
> on QEMU side. Kirill, could you elaborate it?

TDX guest cannot support PIC because the platform doesn't allow direct
interrupt injection, only posted interrupts.

For TDX guest kernel we had a patch[1] that forces no-PIC, but it is not
upstreamable as it is a hack.

I looked around to find The Right Way™ to archive the same effect and
discovered that we only have PIC ops hooked up because kernel bypasses[2]
PIC enumeration because PCAT_COMPAT is set. Which is wrong for TDX guest
or other platforms without PIC.

I am not aware about any user-visible issues due to it, but maybe they are
just not discovered yet.

[1] 
https://lore.kernel.org/linux-kernel/b29f00c1eb5cff585ec2b999b69923c13418ecc4.1619458733.git.sathyanarayanan.kuppusw...@linux.intel.com/
[2] 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/i8259.c#n322

-- 
  Kiryl Shutsemau / Kirill A. Shutemov



Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM

2023-04-13 Thread Kirill A. Shutemov
On Wed, Apr 12, 2023 at 06:07:28PM -0700, Sean Christopherson wrote:
> On Wed, Jan 25, 2023, Kirill A. Shutemov wrote:
> > On Wed, Jan 25, 2023 at 12:20:26AM +, Sean Christopherson wrote:
> > > On Tue, Jan 24, 2023, Liam Merwick wrote:
> > > > On 14/01/2023 00:37, Sean Christopherson wrote:
> > > > > On Fri, Dec 02, 2022, Chao Peng wrote:
> > > > > > This patch series implements KVM guest private memory for 
> > > > > > confidential
> > > > > > computing scenarios like Intel TDX[1]. If a TDX host accesses
> > > > > > TDX-protected guest memory, machine check can happen which can 
> > > > > > further
> > > > > > crash the running host system, this is terrible for multi-tenant
> > > > > > configurations. The host accesses include those from KVM userspace 
> > > > > > like
> > > > > > QEMU. This series addresses KVM userspace induced crash by 
> > > > > > introducing
> > > > > > new mm and KVM interfaces so KVM userspace can still manage guest 
> > > > > > memory
> > > > > > via a fd-based approach, but it can never access the guest memory
> > > > > > content.
> > > > > > 
> > > > > > The patch series touches both core mm and KVM code. I appreciate
> > > > > > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any 
> > > > > > other
> > > > > > reviews are always welcome.
> > > > > >- 01: mm change, target for mm tree
> > > > > >- 02-09: KVM change, target for KVM tree
> > > > > 
> > > > > A version with all of my feedback, plus reworked versions of Vishal's 
> > > > > selftest,
> > > > > is available here:
> > > > > 
> > > > >g...@github.com:sean-jc/linux.git x86/upm_base_support
> > > > > 
> > > > > It compiles and passes the selftest, but it's otherwise barely 
> > > > > tested.  There are
> > > > > a few todos (2 I think?) and many of the commits need changelogs, 
> > > > > i.e. it's still
> > > > > a WIP.
> > > > > 
> > > > 
> > > > When running LTP (https://github.com/linux-test-project/ltp) on the v10
> > > > bits (and also with Sean's branch above) I encounter the following NULL
> > > > pointer dereference with testcases/kernel/syscalls/madvise/madvise01
> > > > (100% reproducible).
> > > > 
> > > > It appears that in restrictedmem_error_page()
> > > > inode->i_mapping->private_data is NULL in the
> > > > list_for_each_entry_safe(inode, next, >s_inodes, i_sb_list) but I
> > > > don't know why.
> > > 
> > > Kirill, can you take a look?  Or pass the buck to someone who can? :-)
> > 
> > The patch below should help.
> > 
> > diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c
> > index 15c52301eeb9..39ada985c7c0 100644
> > --- a/mm/restrictedmem.c
> > +++ b/mm/restrictedmem.c
> > @@ -307,14 +307,29 @@ void restrictedmem_error_page(struct page *page, 
> > struct address_space *mapping)
> >  
> > spin_lock(>s_inode_list_lock);
> > list_for_each_entry_safe(inode, next, >s_inodes, i_sb_list) {
> > -   struct restrictedmem *rm = inode->i_mapping->private_data;
> > struct restrictedmem_notifier *notifier;
> > -   struct file *memfd = rm->memfd;
> > +   struct restrictedmem *rm;
> > unsigned long index;
> > +   struct file *memfd;
> >  
> > -   if (memfd->f_mapping != mapping)
> > +   if (atomic_read(>i_count))
> 
> Kirill, should this be
> 
>   if (!atomic_read(>i_count))
>   continue;
> 
> i.e. skip unreferenced inodes, not skip referenced inodes?

Ouch. Yes.

But looking at other instances of s_inodes usage, I think we can drop the
check altogether. inode cannot be completely free until it is removed from
s_inodes list.

While there, replace list_for_each_entry_safe() with
list_for_each_entry() as we don't remove anything from the list.

diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c
index 55e99e6c09a1..8e8a4420d3d1 100644
--- a/mm/restrictedmem.c
+++ b/mm/restrictedmem.c
@@ -194,22 +194,19 @@ static int restricted_error_remove_page(struct 
address_space *mapping,
struct page *page)
 {
struct super_block *sb = restrictedmem_mnt->mnt_sb;
-   struct inode *inode, *next;
+   struct inode *inode;
pgoff_t start, end;
 
start = page->index;
end = start + thp_nr_pages(page);
 
spin_lock(>s_inode_list_lock);
-   list_for_each_entry_safe(inode, next, >s_inodes, i_sb_list) {
+   list_for_each_entry(inode, >s_inodes, i_sb_list) {
struct restrictedmem_notifier *notifier;
struct restrictedmem *rm;
unsigned long index;
struct file *memfd;
 
-   if (atomic_read(>i_count))
-   continue;
-
spin_lock(>i_lock);
if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
spin_unlock(>i_lock);
-- 
  Kiryl Shutsemau / Kirill A. Shutemov



Re: [RFC PATCH v3 1/2] mm: restrictedmem: Allow userspace to specify mount for memfd_restricted

2023-04-04 Thread Kirill A. Shutemov
 return -EBADF;
> +
> + ret = -EINVAL;
> + if (!is_mount_root(f.file))
> + goto out;
> +
> + mnt = f.file->f_path.mnt;
> + if (!is_shmem_mount(mnt))
> + goto out;
> +
> + ret = file_permission(f.file, MAY_WRITE | MAY_EXEC);

Why MAY_EXEC?

> + if (ret)
> + goto out;
> +
> + ret = mnt_want_write(mnt);
> + if (unlikely(ret))
> + goto out;
> +
> + ret = restrictedmem_create(mnt);
> +
> + mnt_drop_write(mnt);
> +out:
> + fdput(f);
> +
> + return ret;
> +}

We need review from fs folks. Look mostly sensible, but I have no
experience in fs.

> +
> +SYSCALL_DEFINE2(memfd_restricted, unsigned int, flags, int, mount_fd)
> +{
> + if (flags & ~RMFD_USERMNT)
> + return -EINVAL;
> +
> + if (flags == RMFD_USERMNT) {
> + if (mount_fd < 0)
> + return -EINVAL;
> +
> + return restrictedmem_create_on_user_mount(mount_fd);
> + } else {
> + return restrictedmem_create(NULL);
> + }

Maybe restructure with single restrictedmem_create() call?

struct vfsmount *mnt = NULL;

if (flags == RMFD_USERMNT) {
...
mnt = ...();
}

return restrictedmem_create(mnt);
> +}
> +
>  int restrictedmem_bind(struct file *file, pgoff_t start, pgoff_t end,
>  struct restrictedmem_notifier *notifier, bool exclusive)
>  {
> --
> 2.40.0.348.gf938b09366-goog

-- 
  Kiryl Shutsemau / Kirill A. Shutemov



Re: [RFC PATCH 1/2] mm: restrictedmem: Allow userspace to specify mount_path for memfd_restricted

2023-02-16 Thread Kirill A. Shutemov
On Thu, Feb 16, 2023 at 12:41:16AM +, Ackerley Tng wrote:
> By default, the backing shmem file for a restrictedmem fd is created
> on shmem's kernel space mount.
> 
> With this patch, an optional tmpfs mount can be specified, which will
> be used as the mountpoint for backing the shmem file associated with a
> restrictedmem fd.
> 
> This change is modeled after how sys_open() can create an unnamed
> temporary file in a given directory with O_TMPFILE.
> 
> This will help restrictedmem fds inherit the properties of the
> provided tmpfs mounts, for example, hugepage allocation hints, NUMA
> binding hints, etc.
> 
> Signed-off-by: Ackerley Tng 
> ---
>  include/linux/syscalls.h   |  2 +-
>  include/uapi/linux/restrictedmem.h |  8 
>  mm/restrictedmem.c | 63 +++---
>  3 files changed, 66 insertions(+), 7 deletions(-)
>  create mode 100644 include/uapi/linux/restrictedmem.h
> 
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index f9e9e0c820c5..4b8efe9a8680 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -1056,7 +1056,7 @@ asmlinkage long sys_memfd_secret(unsigned int flags);
>  asmlinkage long sys_set_mempolicy_home_node(unsigned long start, unsigned 
> long len,
>   unsigned long home_node,
>   unsigned long flags);
> -asmlinkage long sys_memfd_restricted(unsigned int flags);
> +asmlinkage long sys_memfd_restricted(unsigned int flags, const char __user 
> *mount_path);
>  
>  /*
>   * Architecture-specific system calls

I'm not sure what the right practice now: do we provide string that
contains mount path or fd that represents the filesystem (returned from
fsmount(2) or open_tree(2)).

fd seems more flexible: it allows to specify unbind mounts.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov



Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM

2023-01-25 Thread Kirill A. Shutemov
On Wed, Jan 25, 2023 at 12:20:26AM +, Sean Christopherson wrote:
> On Tue, Jan 24, 2023, Liam Merwick wrote:
> > On 14/01/2023 00:37, Sean Christopherson wrote:
> > > On Fri, Dec 02, 2022, Chao Peng wrote:
> > > > This patch series implements KVM guest private memory for confidential
> > > > computing scenarios like Intel TDX[1]. If a TDX host accesses
> > > > TDX-protected guest memory, machine check can happen which can further
> > > > crash the running host system, this is terrible for multi-tenant
> > > > configurations. The host accesses include those from KVM userspace like
> > > > QEMU. This series addresses KVM userspace induced crash by introducing
> > > > new mm and KVM interfaces so KVM userspace can still manage guest memory
> > > > via a fd-based approach, but it can never access the guest memory
> > > > content.
> > > > 
> > > > The patch series touches both core mm and KVM code. I appreciate
> > > > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other
> > > > reviews are always welcome.
> > > >- 01: mm change, target for mm tree
> > > >- 02-09: KVM change, target for KVM tree
> > > 
> > > A version with all of my feedback, plus reworked versions of Vishal's 
> > > selftest,
> > > is available here:
> > > 
> > >g...@github.com:sean-jc/linux.git x86/upm_base_support
> > > 
> > > It compiles and passes the selftest, but it's otherwise barely tested.  
> > > There are
> > > a few todos (2 I think?) and many of the commits need changelogs, i.e. 
> > > it's still
> > > a WIP.
> > > 
> > 
> > When running LTP (https://github.com/linux-test-project/ltp) on the v10
> > bits (and also with Sean's branch above) I encounter the following NULL
> > pointer dereference with testcases/kernel/syscalls/madvise/madvise01
> > (100% reproducible).
> > 
> > It appears that in restrictedmem_error_page() inode->i_mapping->private_data
> > is NULL
> > in the list_for_each_entry_safe(inode, next, >s_inodes, i_sb_list)
> > but I don't know why.
> 
> Kirill, can you take a look?  Or pass the buck to someone who can? :-)

The patch below should help.

diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c
index 15c52301eeb9..39ada985c7c0 100644
--- a/mm/restrictedmem.c
+++ b/mm/restrictedmem.c
@@ -307,14 +307,29 @@ void restrictedmem_error_page(struct page *page, struct 
address_space *mapping)
 
spin_lock(>s_inode_list_lock);
list_for_each_entry_safe(inode, next, >s_inodes, i_sb_list) {
-   struct restrictedmem *rm = inode->i_mapping->private_data;
struct restrictedmem_notifier *notifier;
-   struct file *memfd = rm->memfd;
+   struct restrictedmem *rm;
unsigned long index;
+   struct file *memfd;
 
-   if (memfd->f_mapping != mapping)
+   if (atomic_read(>i_count))
continue;
 
+   spin_lock(>i_lock);
+   if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
+   spin_unlock(>i_lock);
+   continue;
+   }
+
+   rm = inode->i_mapping->private_data;
+   memfd = rm->memfd;
+
+   if (memfd->f_mapping != mapping) {
+   spin_unlock(>i_lock);
+   continue;
+   }
+   spin_unlock(>i_lock);
+
xa_for_each_range(>bindings, index, notifier, start, end)
notifier->ops->error(notifier, start, end);
break;
-- 
  Kiryl Shutsemau / Kirill A. Shutemov



Re: [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory

2023-01-23 Thread Kirill A. Shutemov
t page *page)
+{
+   struct super_block *sb = restrictedmem_mnt->mnt_sb;
+   struct inode *inode, *next;
+   pgoff_t start, end;
+
+   start = page->index;
+   end = start + thp_nr_pages(page);
+
+   spin_lock(>s_inode_list_lock);
+   list_for_each_entry_safe(inode, next, >s_inodes, i_sb_list) {
+   struct restrictedmem *rm = inode->i_mapping->private_data;
+   struct restrictedmem_notifier *notifier;
+   struct file *memfd = rm->memfd;
+   unsigned long index;
+
+   if (memfd->f_mapping != mapping)
+   continue;
+
+   xa_for_each_range(>bindings, index, notifier, start, end)
+   notifier->ops->error(notifier, start, end);
+   break;
+   }
+   spin_unlock(>s_inode_list_lock);
+
+   return 0;
+}
+
+#ifdef CONFIG_MIGRATION
+static int restricted_folio(struct address_space *mapping, struct folio *dst,
+   struct folio *src, enum migrate_mode mode)
+{
+   return -EBUSY;
+}
+#endif
+
+static struct address_space_operations restricted_aops = {
+   .dirty_folio= noop_dirty_folio,
+   .error_remove_page = restricted_error_remove_page,
+#ifdef CONFIG_MIGRATION
+   .migrate_folio  = restricted_folio,
+#endif
+};
+
 SYSCALL_DEFINE1(memfd_restricted, unsigned int, flags)
 {
struct file *file, *restricted_file;
@@ -209,6 +254,8 @@ SYSCALL_DEFINE1(memfd_restricted, unsigned int, flags)
file->f_mode |= FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE;
file->f_flags |= O_LARGEFILE;
 
+   file->f_mapping->a_ops = _aops;
+
restricted_file = restrictedmem_file_create(file);
if (IS_ERR(restricted_file)) {
err = PTR_ERR(restricted_file);
@@ -293,31 +340,3 @@ int restrictedmem_get_page(struct file *file, pgoff_t 
offset,
 }
 EXPORT_SYMBOL_GPL(restrictedmem_get_page);
 
-void restrictedmem_error_page(struct page *page, struct address_space *mapping)
-{
-   struct super_block *sb = restrictedmem_mnt->mnt_sb;
-   struct inode *inode, *next;
-   pgoff_t start, end;
-
-   if (!shmem_mapping(mapping))
-   return;
-
-   start = page->index;
-   end = start + thp_nr_pages(page);
-
-   spin_lock(>s_inode_list_lock);
-   list_for_each_entry_safe(inode, next, >s_inodes, i_sb_list) {
-   struct restrictedmem *rm = inode->i_mapping->private_data;
-   struct restrictedmem_notifier *notifier;
-   struct file *memfd = rm->memfd;
-   unsigned long index;
-
-   if (memfd->f_mapping != mapping)
-   continue;
-
-   xa_for_each_range(>bindings, index, notifier, start, end)
-   notifier->ops->error(notifier, start, end);
-   break;
-   }
-   spin_unlock(>s_inode_list_lock);
-}
diff --git a/mm/shmem.c b/mm/shmem.c
index c1d8b8a1aa3b..3df4d95784b9 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -231,7 +231,7 @@ static inline void shmem_inode_unacct_blocks(struct inode 
*inode, long pages)
 }
 
 static const struct super_operations shmem_ops;
-const struct address_space_operations shmem_aops;
+static const struct address_space_operations shmem_aops;
 static const struct file_operations shmem_file_operations;
 static const struct inode_operations shmem_inode_operations;
 static const struct inode_operations shmem_dir_inode_operations;
@@ -3894,7 +3894,7 @@ static int shmem_error_remove_page(struct address_space 
*mapping,
return 0;
 }
 
-const struct address_space_operations shmem_aops = {
+static const struct address_space_operations shmem_aops = {
.writepage  = shmem_writepage,
.dirty_folio= noop_dirty_folio,
 #ifdef CONFIG_TMPFS
@@ -3906,7 +3906,6 @@ const struct address_space_operations shmem_aops = {
 #endif
.error_remove_page = shmem_error_remove_page,
 };
-EXPORT_SYMBOL(shmem_aops);
 
 static const struct file_operations shmem_file_operations = {
.mmap   = shmem_mmap,
-- 
  Kiryl Shutsemau / Kirill A. Shutemov



Re: [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory

2023-01-23 Thread Kirill A. Shutemov
On Mon, Jan 23, 2023 at 03:03:45PM +0100, Vlastimil Babka wrote:
> On 12/22/22 01:37, Huang, Kai wrote:
> >>> I argue that this page pinning (or page migration prevention) is not
> >>> tied to where the page comes from, instead related to how the page will
> >>> be used. Whether the page is restrictedmem backed or GUP() backed, once
> >>> it's used by current version of TDX then the page pinning is needed. So
> >>> such page migration prevention is really TDX thing, even not KVM generic
> >>> thing (that's why I think we don't need change the existing logic of
> >>> kvm_release_pfn_clean()). 
> >>>
> > This essentially boils down to who "owns" page migration handling, and 
> > sadly,
> > page migration is kinda "owned" by the core-kernel, i.e. KVM cannot handle 
> > page
> > migration by itself -- it's just a passive receiver.
> > 
> > For normal pages, page migration is totally done by the core-kernel (i.e. it
> > unmaps page from VMA, allocates a new page, and uses migrate_pape() or 
> > a_ops-
> >> migrate_page() to actually migrate the page).
> > In the sense of TDX, conceptually it should be done in the same way. The 
> > more
> > important thing is: yes KVM can use get_page() to prevent page migration, 
> > but
> > when KVM wants to support it, KVM cannot just remove get_page(), as the 
> > core-
> > kernel will still just do migrate_page() which won't work for TDX (given
> > restricted_memfd doesn't have a_ops->migrate_page() implemented).
> > 
> > So I think the restricted_memfd filesystem should own page migration 
> > handling,
> > (i.e. by implementing a_ops->migrate_page() to either just reject page 
> > migration
> > or somehow support it).
> 
> While this thread seems to be settled on refcounts already, just wanted
> to point out that it wouldn't be ideal to prevent migrations by
> a_ops->migrate_page() rejecting them. It would mean cputime wasted (i.e.
> by memory compaction) by isolating the pages for migration and then
> releasing them after the callback rejects it (at least we wouldn't waste
> time creating and undoing migration entries in the userspace page tables
> as there's no mmap). Elevated refcount on the other hand is detected
> very early in compaction so no isolation is attempted, so from that
> aspect it's optimal.

Hm. Do we need a new hook in a_ops to check if the page is migratable
before going with longer path to migrate_page().

Or maybe add AS_UNMOVABLE?

-- 
  Kiryl Shutsemau / Kirill A. Shutemov



Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM

2023-01-16 Thread Kirill A. Shutemov
On Sat, Jan 14, 2023 at 12:37:59AM +, Sean Christopherson wrote:
> On Fri, Dec 02, 2022, Chao Peng wrote:
> > This patch series implements KVM guest private memory for confidential
> > computing scenarios like Intel TDX[1]. If a TDX host accesses
> > TDX-protected guest memory, machine check can happen which can further
> > crash the running host system, this is terrible for multi-tenant
> > configurations. The host accesses include those from KVM userspace like
> > QEMU. This series addresses KVM userspace induced crash by introducing
> > new mm and KVM interfaces so KVM userspace can still manage guest memory
> > via a fd-based approach, but it can never access the guest memory
> > content.
> > 
> > The patch series touches both core mm and KVM code. I appreciate
> > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other
> > reviews are always welcome.
> >   - 01: mm change, target for mm tree
> >   - 02-09: KVM change, target for KVM tree
> 
> A version with all of my feedback, plus reworked versions of Vishal's 
> selftest,
> is available here:
> 
>   g...@github.com:sean-jc/linux.git x86/upm_base_support
> 
> It compiles and passes the selftest, but it's otherwise barely tested.  There 
> are
> a few todos (2 I think?) and many of the commits need changelogs, i.e. it's 
> still
> a WIP.
> 
> As for next steps, can you (handwaving all of the TDX folks) take a look at 
> what
> I pushed and see if there's anything horrifically broken, and that it still 
> works
> for TDX?

Minor build fix:

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 6eb5336ccc65..4a9e9fa2552a 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -7211,8 +7211,8 @@ void kvm_arch_set_memory_attributes(struct kvm *kvm,
int level;
bool mixed;
 
-   lockdep_assert_held_write(kvm->mmu_lock);
-   lockdep_assert_held(kvm->slots_lock);
+   lockdep_assert_held_write(>mmu_lock);
+   lockdep_assert_held(>slots_lock);
 
/*
 * KVM x86 currently only supports KVM_MEMORY_ATTRIBUTE_PRIVATE, skip
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 467916943c73..4ef60ba7eb1d 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2304,7 +2304,7 @@ static inline void kvm_account_pgtable_pages(void *virt, 
int nr)
 #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
 static inline unsigned long kvm_get_memory_attributes(struct kvm *kvm, gfn_t 
gfn)
 {
-   lockdep_assert_held(kvm->mmu_lock);
+   lockdep_assert_held(>mmu_lock);
 
return xa_to_value(xa_load(>mem_attr_array, gfn));
 }
-- 
  Kiryl Shutsemau / Kirill A. Shutemov



Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory

2022-12-02 Thread Kirill A . Shutemov
On Fri, Dec 02, 2022 at 02:49:09PM +0800, Chao Peng wrote:
> On Thu, Dec 01, 2022 at 06:16:46PM -0800, Vishal Annapurve wrote:
> > On Tue, Oct 25, 2022 at 8:18 AM Chao Peng  
> > wrote:
> > >
> ...
> > > +}
> > > +
> > > +SYSCALL_DEFINE1(memfd_restricted, unsigned int, flags)
> > > +{
> > 
> > Looking at the underlying shmem implementation, there seems to be no
> > way to enable transparent huge pages specifically for restricted memfd
> > files.
> > 
> > Michael discussed earlier about tweaking
> > /sys/kernel/mm/transparent_hugepage/shmem_enabled setting to allow
> > hugepages to be used while backing restricted memfd. Such a change
> > will affect the rest of the shmem usecases as well. Even setting the
> > shmem_enabled policy to "advise" wouldn't help unless file based
> > advise for hugepage allocation is implemented.
> 
> Had a look at fadvise() and looks it does not support HUGEPAGE for any
> filesystem yet.

Yes, I think fadvise() is the right direction here. The problem is similar
to NUMA policy where existing APIs are focused around virtual memory
addresses. We need to extend ABI to take fd+offset as input instead.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov



Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory

2022-11-29 Thread Kirill A. Shutemov
On Mon, Nov 28, 2022 at 06:06:32PM -0600, Michael Roth wrote:
> On Tue, Oct 25, 2022 at 11:13:37PM +0800, Chao Peng wrote:
> > From: "Kirill A. Shutemov" 
> > 
> 
> 
> 
> > +static struct file *restrictedmem_file_create(struct file *memfd)
> > +{
> > +   struct restrictedmem_data *data;
> > +   struct address_space *mapping;
> > +   struct inode *inode;
> > +   struct file *file;
> > +
> > +   data = kzalloc(sizeof(*data), GFP_KERNEL);
> > +   if (!data)
> > +   return ERR_PTR(-ENOMEM);
> > +
> > +   data->memfd = memfd;
> > +   mutex_init(>lock);
> > +   INIT_LIST_HEAD(>notifiers);
> > +
> > +   inode = alloc_anon_inode(restrictedmem_mnt->mnt_sb);
> > +   if (IS_ERR(inode)) {
> > +   kfree(data);
> > +   return ERR_CAST(inode);
> > +   }
> > +
> > +   inode->i_mode |= S_IFREG;
> > +   inode->i_op = _iops;
> > +   inode->i_mapping->private_data = data;
> > +
> > +   file = alloc_file_pseudo(inode, restrictedmem_mnt,
> > +"restrictedmem", O_RDWR,
> > +_fops);
> > +   if (IS_ERR(file)) {
> > +   iput(inode);
> > +   kfree(data);
> > +   return ERR_CAST(file);
> > +   }
> > +
> > +   file->f_flags |= O_LARGEFILE;
> > +
> > +   mapping = memfd->f_mapping;
> > +   mapping_set_unevictable(mapping);
> > +   mapping_set_gfp_mask(mapping,
> > +mapping_gfp_mask(mapping) & ~__GFP_MOVABLE);
> 
> Is this supposed to prevent migration of pages being used for
> restrictedmem/shmem backend?

Yes, my bad. I expected it to prevent migration, but it is not true.

Looks like we need to bump refcount in restrictedmem_get_page() and reduce
it back when KVM is no longer use it.

Chao, could you adjust it?

-- 
  Kiryl Shutsemau / Kirill A. Shutemov



Re: [PATCH v9 0/8] KVM: mm: fd-based approach for supporting KVM

2022-11-15 Thread Kirill A. Shutemov
On Wed, Nov 09, 2022 at 06:54:04PM +0300, Kirill A. Shutemov wrote:
> On Mon, Nov 07, 2022 at 04:41:41PM -0800, Isaku Yamahata wrote:
> > On Thu, Nov 03, 2022 at 05:43:52PM +0530,
> > Vishal Annapurve  wrote:
> > 
> > > On Tue, Oct 25, 2022 at 8:48 PM Chao Peng  
> > > wrote:
> > > >
> > > > This patch series implements KVM guest private memory for confidential
> > > > computing scenarios like Intel TDX[1]. If a TDX host accesses
> > > > TDX-protected guest memory, machine check can happen which can further
> > > > crash the running host system, this is terrible for multi-tenant
> > > > configurations. The host accesses include those from KVM userspace like
> > > > QEMU. This series addresses KVM userspace induced crash by introducing
> > > > new mm and KVM interfaces so KVM userspace can still manage guest memory
> > > > via a fd-based approach, but it can never access the guest memory
> > > > content.
> > > >
> > > > The patch series touches both core mm and KVM code. I appreciate
> > > > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other
> > > > reviews are always welcome.
> > > >   - 01: mm change, target for mm tree
> > > >   - 02-08: KVM change, target for KVM tree
> > > >
> > > > Given KVM is the only current user for the mm part, I have chatted with
> > > > Paolo and he is OK to merge the mm change through KVM tree, but
> > > > reviewed-by/acked-by is still expected from the mm people.
> > > >
> > > > The patches have been verified in Intel TDX environment, but Vishal has
> > > > done an excellent work on the selftests[4] which are dedicated for this
> > > > series, making it possible to test this series without innovative
> > > > hardware and fancy steps of building a VM environment. See Test section
> > > > below for more info.
> > > >
> > > >
> > > > Introduction
> > > > 
> > > > KVM userspace being able to crash the host is horrible. Under current
> > > > KVM architecture, all guest memory is inherently accessible from KVM
> > > > userspace and is exposed to the mentioned crash issue. The goal of this
> > > > series is to provide a solution to align mm and KVM, on a userspace
> > > > inaccessible approach of exposing guest memory.
> > > >
> > > > Normally, KVM populates secondary page table (e.g. EPT) by using a host
> > > > virtual address (hva) from core mm page table (e.g. x86 userspace page
> > > > table). This requires guest memory being mmaped into KVM userspace, but
> > > > this is also the source where the mentioned crash issue can happen. In
> > > > theory, apart from those 'shared' memory for device emulation etc, guest
> > > > memory doesn't have to be mmaped into KVM userspace.
> > > >
> > > > This series introduces fd-based guest memory which will not be mmaped
> > > > into KVM userspace. KVM populates secondary page table by using a
> > > 
> > > With no mappings in place for userspace VMM, IIUC, looks like the host
> > > kernel will not be able to find the culprit userspace process in case
> > > of Machine check error on guest private memory. As implemented in
> > > hwpoison_user_mappings, host kernel tries to look at the processes
> > > which have mapped the pfns with hardware error.
> > > 
> > > Is there a modification needed in mce handling logic of the host
> > > kernel to immediately send a signal to the vcpu thread accessing
> > > faulting pfn backing guest private memory?
> > 
> > mce_register_decode_chain() can be used.  MCE physical address(p->mce_addr)
> > includes host key id in addition to real physical address.  By searching 
> > used
> > hkid by KVM, we can determine if the page is assigned to guest TD or not. If
> > yes, send SIGBUS.
> > 
> > kvm_machine_check() can be enhanced for KVM specific use.  This is before
> > memory_failure() is called, though.
> > 
> > any other ideas?
> 
> That's too KVM-centric. It will not work for other possible user of
> restricted memfd.
> 
> I tried to find a way to get it right: we need to get restricted memfd
> code info about corrupted page so it can invalidate its users. On the next
> request of the page the user will see an error. In case of KVM, the error
> will likely escalate to SIGBUS.
> 
> The problem is that core-mm code that handle

Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory

2022-11-14 Thread Kirill A. Shutemov
On Mon, Nov 14, 2022 at 03:02:37PM +0100, Vlastimil Babka wrote:
> On 11/1/22 16:19, Michael Roth wrote:
> > On Tue, Nov 01, 2022 at 07:37:29PM +0800, Chao Peng wrote:
> >> > 
> >> >   1) restoring kernel directmap:
> >> > 
> >> >  Currently SNP (and I believe TDX) need to either split or remove 
> >> > kernel
> >> >  direct mappings for restricted PFNs, since there is no guarantee 
> >> > that
> >> >  other PFNs within a 2MB range won't be used for non-restricted
> >> >  (which will cause an RMP #PF in the case of SNP since the 2MB
> >> >  mapping overlaps with guest-owned pages)
> >> 
> >> Has the splitting and restoring been a well-discussed direction? I'm
> >> just curious whether there is other options to solve this issue.
> > 
> > For SNP it's been discussed for quite some time, and either splitting or
> > removing private entries from directmap are the well-discussed way I'm
> > aware of to avoid RMP violations due to some other kernel process using
> > a 2MB mapping to access shared memory if there are private pages that
> > happen to be within that range.
> > 
> > In both cases the issue of how to restore directmap as 2M becomes a
> > problem.
> > 
> > I was also under the impression TDX had similar requirements. If so,
> > do you know what the plan is for handling this for TDX?
> > 
> > There are also 2 potential alternatives I'm aware of, but these haven't
> > been discussed in much detail AFAIK:
> > 
> > a) Ensure confidential guests are backed by 2MB pages. shmem has a way to
> >request 2MB THP pages, but I'm not sure how reliably we can guarantee
> >that enough THPs are available, so if we went that route we'd probably
> >be better off requiring the use of hugetlbfs as the backing store. But
> >obviously that's a bit limiting and it would be nice to have the option
> >of using normal pages as well. One nice thing with invalidation
> >scheme proposed here is that this would "Just Work" if implement
> >hugetlbfs support, so an admin that doesn't want any directmap
> >splitting has this option available, otherwise it's done as a
> >best-effort.
> > 
> > b) Implement general support for restoring directmap as 2M even when
> >subpages might be in use by other kernel threads. This would be the
> >most flexible approach since it requires no special handling during
> >invalidations, but I think it's only possible if all the CPA
> >attributes for the 2M range are the same at the time the mapping is
> >restored/unsplit, so some potential locking issues there and still
> >chance for splitting directmap over time.
> 
> I've been hoping that
> 
> c) using a mechanism such as [1] [2] where the goal is to group together
> these small allocations that need to increase directmap granularity so
> maximum number of large mappings are preserved.

As I mentioned in the other thread the restricted memfd can be backed by
secretmem instead of plain memfd. It already handles directmap with care.

But I don't think it has to be part of initial restricted memfd
implementation. It is SEV-specific requirement and AMD folks can extend
implementation as needed later.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov



Re: [PATCH v9 0/8] KVM: mm: fd-based approach for supporting KVM

2022-11-09 Thread Kirill A. Shutemov
On Mon, Nov 07, 2022 at 04:41:41PM -0800, Isaku Yamahata wrote:
> On Thu, Nov 03, 2022 at 05:43:52PM +0530,
> Vishal Annapurve  wrote:
> 
> > On Tue, Oct 25, 2022 at 8:48 PM Chao Peng  
> > wrote:
> > >
> > > This patch series implements KVM guest private memory for confidential
> > > computing scenarios like Intel TDX[1]. If a TDX host accesses
> > > TDX-protected guest memory, machine check can happen which can further
> > > crash the running host system, this is terrible for multi-tenant
> > > configurations. The host accesses include those from KVM userspace like
> > > QEMU. This series addresses KVM userspace induced crash by introducing
> > > new mm and KVM interfaces so KVM userspace can still manage guest memory
> > > via a fd-based approach, but it can never access the guest memory
> > > content.
> > >
> > > The patch series touches both core mm and KVM code. I appreciate
> > > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other
> > > reviews are always welcome.
> > >   - 01: mm change, target for mm tree
> > >   - 02-08: KVM change, target for KVM tree
> > >
> > > Given KVM is the only current user for the mm part, I have chatted with
> > > Paolo and he is OK to merge the mm change through KVM tree, but
> > > reviewed-by/acked-by is still expected from the mm people.
> > >
> > > The patches have been verified in Intel TDX environment, but Vishal has
> > > done an excellent work on the selftests[4] which are dedicated for this
> > > series, making it possible to test this series without innovative
> > > hardware and fancy steps of building a VM environment. See Test section
> > > below for more info.
> > >
> > >
> > > Introduction
> > > 
> > > KVM userspace being able to crash the host is horrible. Under current
> > > KVM architecture, all guest memory is inherently accessible from KVM
> > > userspace and is exposed to the mentioned crash issue. The goal of this
> > > series is to provide a solution to align mm and KVM, on a userspace
> > > inaccessible approach of exposing guest memory.
> > >
> > > Normally, KVM populates secondary page table (e.g. EPT) by using a host
> > > virtual address (hva) from core mm page table (e.g. x86 userspace page
> > > table). This requires guest memory being mmaped into KVM userspace, but
> > > this is also the source where the mentioned crash issue can happen. In
> > > theory, apart from those 'shared' memory for device emulation etc, guest
> > > memory doesn't have to be mmaped into KVM userspace.
> > >
> > > This series introduces fd-based guest memory which will not be mmaped
> > > into KVM userspace. KVM populates secondary page table by using a
> > 
> > With no mappings in place for userspace VMM, IIUC, looks like the host
> > kernel will not be able to find the culprit userspace process in case
> > of Machine check error on guest private memory. As implemented in
> > hwpoison_user_mappings, host kernel tries to look at the processes
> > which have mapped the pfns with hardware error.
> > 
> > Is there a modification needed in mce handling logic of the host
> > kernel to immediately send a signal to the vcpu thread accessing
> > faulting pfn backing guest private memory?
> 
> mce_register_decode_chain() can be used.  MCE physical address(p->mce_addr)
> includes host key id in addition to real physical address.  By searching used
> hkid by KVM, we can determine if the page is assigned to guest TD or not. If
> yes, send SIGBUS.
> 
> kvm_machine_check() can be enhanced for KVM specific use.  This is before
> memory_failure() is called, though.
> 
> any other ideas?

That's too KVM-centric. It will not work for other possible user of
restricted memfd.

I tried to find a way to get it right: we need to get restricted memfd
code info about corrupted page so it can invalidate its users. On the next
request of the page the user will see an error. In case of KVM, the error
will likely escalate to SIGBUS.

The problem is that core-mm code that handles memory failure knows nothing
about restricted memfd. It only sees that the page belongs to a normal
memfd.

AFAICS, there's no way to get it intercepted from the shim level. shmem
code has to be patches. shmem_error_remove_page() has to call into
restricted memfd code.

Hugh, are you okay with this? Or maybe you have a better idea?

-- 
  Kiryl Shutsemau / Kirill A. Shutemov



Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory

2022-11-03 Thread Kirill A. Shutemov
On Wed, Nov 02, 2022 at 05:07:00PM -0500, Michael Roth wrote:
> On Thu, Nov 03, 2022 at 12:14:04AM +0300, Kirill A. Shutemov wrote:
> > On Mon, Oct 31, 2022 at 12:47:38PM -0500, Michael Roth wrote:
> > > 
> > > In v8 there was some discussion about potentially passing the page/folio
> > > and order as part of the invalidation callback, I ended up needing
> > > something similar for SEV-SNP, and think it might make sense for other
> > > platforms. This main reasoning is:
> > > 
> > >   1) restoring kernel directmap:
> > > 
> > >  Currently SNP (and I believe TDX) need to either split or remove 
> > > kernel
> > >  direct mappings for restricted PFNs, since there is no guarantee that
> > >  other PFNs within a 2MB range won't be used for non-restricted
> > >  (which will cause an RMP #PF in the case of SNP since the 2MB
> > >  mapping overlaps with guest-owned pages)
> > 
> > That's news to me. Where the restriction for SNP comes from?
> 
> Sorry, missed your first question.
> 
> For SNP at least, the restriction is documented in APM Volume 2, Section
> 15.36.10, First row of Table 15-36 (preceeding paragraph has more
> context). I forgot to mention this is only pertaining to writes by the
> host to 2MB pages that contain guest-owned subpages, for reads it's
> not an issue, but I think the implementation requirements end up being
> the same either way:
> 
>   https://www.amd.com/system/files/TechDocs/24593.pdf

Looks like you wanted restricted memfd to be backed by secretmem rather
then normal memfd. It would help preserve directmap.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov



Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory

2022-11-02 Thread Kirill A. Shutemov
On Mon, Oct 31, 2022 at 12:47:38PM -0500, Michael Roth wrote:
> 
> In v8 there was some discussion about potentially passing the page/folio
> and order as part of the invalidation callback, I ended up needing
> something similar for SEV-SNP, and think it might make sense for other
> platforms. This main reasoning is:
> 
>   1) restoring kernel directmap:
> 
>  Currently SNP (and I believe TDX) need to either split or remove kernel
>  direct mappings for restricted PFNs, since there is no guarantee that
>  other PFNs within a 2MB range won't be used for non-restricted
>  (which will cause an RMP #PF in the case of SNP since the 2MB
>  mapping overlaps with guest-owned pages)

That's news to me. Where the restriction for SNP comes from? There's no
such limitation on TDX side AFAIK?

Could you point me to relevant documentation if there's any?

-- 
  Kiryl Shutsemau / Kirill A. Shutemov



Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd

2022-10-24 Thread Kirill A . Shutemov
On Fri, Oct 21, 2022 at 04:18:14PM +, Sean Christopherson wrote:
> On Fri, Oct 21, 2022, Chao Peng wrote:
> > > 
> > > In the context of userspace inaccessible memfd, what would be a
> > > suggested way to enforce NUMA memory policy for physical memory
> > > allocation? mbind[1] won't work here in absence of virtual address
> > > range.
> > 
> > How about set_mempolicy():
> > https://www.man7.org/linux/man-pages/man2/set_mempolicy.2.html
> 
> Andy Lutomirski brought this up in an off-list discussion way back when the 
> whole
> private-fd thing was first being proposed.
> 
>   : The current Linux NUMA APIs (mbind, move_pages) work on virtual 
> addresses.  If
>   : we want to support them for TDX private memory, we either need TDX private
>   : memory to have an HVA or we need file-based equivalents. Arguably we 
> should add
>   : fmove_pages and fbind syscalls anyway, since the current API is quite 
> awkward
>   : even for tools like numactl.

Yeah, we definitely have gaps in API wrt NUMA, but I don't think it be
addressed in the initial submission.

BTW, it is not regression comparing to old KVM slots, if the memory is
backed by memfd or other file:

MBIND(2)
   The  specified policy will be ignored for any MAP_SHARED mappings in the
   specified memory range.  Rather the pages will be allocated according to
   the  memory  policy  of the thread that caused the page to be allocated.
   Again, this may not be the thread that called mbind().

It is not clear how to define fbind(2) semantics, considering that multiple
processes may compete for the same region of page cache.

Should it be per-inode or per-fd? Or maybe per-range in inode/fd?

fmove_pages(2) should be relatively straight forward, since it is
best-effort and does not guarantee that the page will note be moved
somewhare else just after return from the syscall.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov



Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd

2022-10-19 Thread Kirill A . Shutemov
On Tue, Oct 18, 2022 at 07:12:10PM +0530, Vishal Annapurve wrote:
> On Tue, Oct 18, 2022 at 3:27 AM Kirill A . Shutemov
>  wrote:
> >
> > On Mon, Oct 17, 2022 at 06:39:06PM +0200, Gupta, Pankaj wrote:
> > > On 10/17/2022 6:19 PM, Kirill A . Shutemov wrote:
> > > > On Mon, Oct 17, 2022 at 03:00:21PM +0200, Vlastimil Babka wrote:
> > > > > On 9/15/22 16:29, Chao Peng wrote:
> > > > > > From: "Kirill A. Shutemov" 
> > > > > >
> > > > > > KVM can use memfd-provided memory for guest memory. For normal 
> > > > > > userspace
> > > > > > accessible memory, KVM userspace (e.g. QEMU) mmaps the memfd into 
> > > > > > its
> > > > > > virtual address space and then tells KVM to use the virtual address 
> > > > > > to
> > > > > > setup the mapping in the secondary page table (e.g. EPT).
> > > > > >
> > > > > > With confidential computing technologies like Intel TDX, the
> > > > > > memfd-provided memory may be encrypted with special key for special
> > > > > > software domain (e.g. KVM guest) and is not expected to be directly
> > > > > > accessed by userspace. Precisely, userspace access to such encrypted
> > > > > > memory may lead to host crash so it should be prevented.
> > > > > >
> > > > > > This patch introduces userspace inaccessible memfd (created with
> > > > > > MFD_INACCESSIBLE). Its memory is inaccessible from userspace through
> > > > > > ordinary MMU access (e.g. read/write/mmap) but can be accessed via
> > > > > > in-kernel interface so KVM can directly interact with core-mm 
> > > > > > without
> > > > > > the need to map the memory into KVM userspace.
> > > > > >
> > > > > > It provides semantics required for KVM guest private(encrypted) 
> > > > > > memory
> > > > > > support that a file descriptor with this flag set is going to be 
> > > > > > used as
> > > > > > the source of guest memory in confidential computing environments 
> > > > > > such
> > > > > > as Intel TDX/AMD SEV.
> > > > > >
> > > > > > KVM userspace is still in charge of the lifecycle of the memfd. It
> > > > > > should pass the opened fd to KVM. KVM uses the kernel APIs newly 
> > > > > > added
> > > > > > in this patch to obtain the physical memory address and then 
> > > > > > populate
> > > > > > the secondary page table entries.
> > > > > >
> > > > > > The userspace inaccessible memfd can be fallocate-ed and 
> > > > > > hole-punched
> > > > > > from userspace. When hole-punching happens, KVM can get notified 
> > > > > > through
> > > > > > inaccessible_notifier it then gets chance to remove any mapped 
> > > > > > entries
> > > > > > of the range in the secondary page tables.
> > > > > >
> > > > > > The userspace inaccessible memfd itself is implemented as a shim 
> > > > > > layer
> > > > > > on top of real memory file systems like tmpfs/hugetlbfs but this 
> > > > > > patch
> > > > > > only implemented tmpfs. The allocated memory is currently marked as
> > > > > > unmovable and unevictable, this is required for current confidential
> > > > > > usage. But in future this might be changed.
> > > > > >
> > > > > > Signed-off-by: Kirill A. Shutemov 
> > > > > > Signed-off-by: Chao Peng 
> > > > > > ---
> > > > >
> > > > > ...
> > > > >
> > > > > > +static long inaccessible_fallocate(struct file *file, int mode,
> > > > > > +  loff_t offset, loff_t len)
> > > > > > +{
> > > > > > +   struct inaccessible_data *data = 
> > > > > > file->f_mapping->private_data;
> > > > > > +   struct file *memfd = data->memfd;
> > > > > > +   int ret;
> > > > > > +
> > > > > > +   if (mode & FALLOC_FL_PUNCH_HOLE) {
> > > > > > +   if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> 

Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd

2022-10-17 Thread Kirill A . Shutemov
On Mon, Oct 17, 2022 at 06:39:06PM +0200, Gupta, Pankaj wrote:
> On 10/17/2022 6:19 PM, Kirill A . Shutemov wrote:
> > On Mon, Oct 17, 2022 at 03:00:21PM +0200, Vlastimil Babka wrote:
> > > On 9/15/22 16:29, Chao Peng wrote:
> > > > From: "Kirill A. Shutemov" 
> > > > 
> > > > KVM can use memfd-provided memory for guest memory. For normal userspace
> > > > accessible memory, KVM userspace (e.g. QEMU) mmaps the memfd into its
> > > > virtual address space and then tells KVM to use the virtual address to
> > > > setup the mapping in the secondary page table (e.g. EPT).
> > > > 
> > > > With confidential computing technologies like Intel TDX, the
> > > > memfd-provided memory may be encrypted with special key for special
> > > > software domain (e.g. KVM guest) and is not expected to be directly
> > > > accessed by userspace. Precisely, userspace access to such encrypted
> > > > memory may lead to host crash so it should be prevented.
> > > > 
> > > > This patch introduces userspace inaccessible memfd (created with
> > > > MFD_INACCESSIBLE). Its memory is inaccessible from userspace through
> > > > ordinary MMU access (e.g. read/write/mmap) but can be accessed via
> > > > in-kernel interface so KVM can directly interact with core-mm without
> > > > the need to map the memory into KVM userspace.
> > > > 
> > > > It provides semantics required for KVM guest private(encrypted) memory
> > > > support that a file descriptor with this flag set is going to be used as
> > > > the source of guest memory in confidential computing environments such
> > > > as Intel TDX/AMD SEV.
> > > > 
> > > > KVM userspace is still in charge of the lifecycle of the memfd. It
> > > > should pass the opened fd to KVM. KVM uses the kernel APIs newly added
> > > > in this patch to obtain the physical memory address and then populate
> > > > the secondary page table entries.
> > > > 
> > > > The userspace inaccessible memfd can be fallocate-ed and hole-punched
> > > > from userspace. When hole-punching happens, KVM can get notified through
> > > > inaccessible_notifier it then gets chance to remove any mapped entries
> > > > of the range in the secondary page tables.
> > > > 
> > > > The userspace inaccessible memfd itself is implemented as a shim layer
> > > > on top of real memory file systems like tmpfs/hugetlbfs but this patch
> > > > only implemented tmpfs. The allocated memory is currently marked as
> > > > unmovable and unevictable, this is required for current confidential
> > > > usage. But in future this might be changed.
> > > > 
> > > > Signed-off-by: Kirill A. Shutemov 
> > > > Signed-off-by: Chao Peng 
> > > > ---
> > > 
> > > ...
> > > 
> > > > +static long inaccessible_fallocate(struct file *file, int mode,
> > > > +  loff_t offset, loff_t len)
> > > > +{
> > > > +   struct inaccessible_data *data = file->f_mapping->private_data;
> > > > +   struct file *memfd = data->memfd;
> > > > +   int ret;
> > > > +
> > > > +   if (mode & FALLOC_FL_PUNCH_HOLE) {
> > > > +   if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> > > > +   return -EINVAL;
> > > > +   }
> > > > +
> > > > +   ret = memfd->f_op->fallocate(memfd, mode, offset, len);
> > > > +   inaccessible_notifier_invalidate(data, offset, offset + len);
> > > 
> > > Wonder if invalidate should precede the actual hole punch, otherwise we 
> > > open
> > > a window where the page tables point to memory no longer valid?
> > 
> > Yes, you are right. Thanks for catching this.
> 
> I also noticed this. But then thought the memory would be anyways zeroed
> (hole punched) before this call?

Hole punching can free pages, given that offset/len covers full page.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov



Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd

2022-10-17 Thread Kirill A . Shutemov
On Mon, Oct 17, 2022 at 03:00:21PM +0200, Vlastimil Babka wrote:
> On 9/15/22 16:29, Chao Peng wrote:
> > From: "Kirill A. Shutemov" 
> > 
> > KVM can use memfd-provided memory for guest memory. For normal userspace
> > accessible memory, KVM userspace (e.g. QEMU) mmaps the memfd into its
> > virtual address space and then tells KVM to use the virtual address to
> > setup the mapping in the secondary page table (e.g. EPT).
> > 
> > With confidential computing technologies like Intel TDX, the
> > memfd-provided memory may be encrypted with special key for special
> > software domain (e.g. KVM guest) and is not expected to be directly
> > accessed by userspace. Precisely, userspace access to such encrypted
> > memory may lead to host crash so it should be prevented.
> > 
> > This patch introduces userspace inaccessible memfd (created with
> > MFD_INACCESSIBLE). Its memory is inaccessible from userspace through
> > ordinary MMU access (e.g. read/write/mmap) but can be accessed via
> > in-kernel interface so KVM can directly interact with core-mm without
> > the need to map the memory into KVM userspace.
> > 
> > It provides semantics required for KVM guest private(encrypted) memory
> > support that a file descriptor with this flag set is going to be used as
> > the source of guest memory in confidential computing environments such
> > as Intel TDX/AMD SEV.
> > 
> > KVM userspace is still in charge of the lifecycle of the memfd. It
> > should pass the opened fd to KVM. KVM uses the kernel APIs newly added
> > in this patch to obtain the physical memory address and then populate
> > the secondary page table entries.
> > 
> > The userspace inaccessible memfd can be fallocate-ed and hole-punched
> > from userspace. When hole-punching happens, KVM can get notified through
> > inaccessible_notifier it then gets chance to remove any mapped entries
> > of the range in the secondary page tables.
> > 
> > The userspace inaccessible memfd itself is implemented as a shim layer
> > on top of real memory file systems like tmpfs/hugetlbfs but this patch
> > only implemented tmpfs. The allocated memory is currently marked as
> > unmovable and unevictable, this is required for current confidential
> > usage. But in future this might be changed.
> > 
> > Signed-off-by: Kirill A. Shutemov 
> > Signed-off-by: Chao Peng 
> > ---
> 
> ...
> 
> > +static long inaccessible_fallocate(struct file *file, int mode,
> > +  loff_t offset, loff_t len)
> > +{
> > +   struct inaccessible_data *data = file->f_mapping->private_data;
> > +   struct file *memfd = data->memfd;
> > +   int ret;
> > +
> > +   if (mode & FALLOC_FL_PUNCH_HOLE) {
> > +   if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> > +   return -EINVAL;
> > +   }
> > +
> > +   ret = memfd->f_op->fallocate(memfd, mode, offset, len);
> > +   inaccessible_notifier_invalidate(data, offset, offset + len);
> 
> Wonder if invalidate should precede the actual hole punch, otherwise we open
> a window where the page tables point to memory no longer valid?

Yes, you are right. Thanks for catching this.

> > +   return ret;
> > +}
> > +
> 
> ...
> 
> > +
> > +static struct file_system_type inaccessible_fs = {
> > +   .owner  = THIS_MODULE,
> > +   .name   = "[inaccessible]",
> 
> Dunno where exactly is this name visible, but shouldn't it better be
> "[memfd:inaccessible]"?

Maybe. And skip brackets.

> 
> > +   .init_fs_context = inaccessible_init_fs_context,
> > +   .kill_sb= kill_anon_super,
> > +};
> > +
> 

-- 
  Kiryl Shutsemau / Kirill A. Shutemov



Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd

2022-10-06 Thread Kirill A. Shutemov
On Thu, Oct 06, 2022 at 09:50:28AM +0100, Fuad Tabba wrote:
> Hi,
> 
> <...>
> 
> 
> > diff --git a/mm/memfd_inaccessible.c b/mm/memfd_inaccessible.c
> > new file mode 100644
> > index ..2d33cbdd9282
> > --- /dev/null
> > +++ b/mm/memfd_inaccessible.c
> 
> <...>
> 
> > +struct file *memfd_mkinaccessible(struct file *memfd)
> > +{
> > +   struct inaccessible_data *data;
> > +   struct address_space *mapping;
> > +   struct inode *inode;
> > +   struct file *file;
> > +
> > +   data = kzalloc(sizeof(*data), GFP_KERNEL);
> > +   if (!data)
> > +   return ERR_PTR(-ENOMEM);
> > +
> > +   data->memfd = memfd;
> > +   mutex_init(>lock);
> > +   INIT_LIST_HEAD(>notifiers);
> > +
> > +   inode = alloc_anon_inode(inaccessible_mnt->mnt_sb);
> > +   if (IS_ERR(inode)) {
> > +   kfree(data);
> > +   return ERR_CAST(inode);
> > +   }
> > +
> > +   inode->i_mode |= S_IFREG;
> > +   inode->i_op = _iops;
> > +   inode->i_mapping->private_data = data;
> > +
> > +   file = alloc_file_pseudo(inode, inaccessible_mnt,
> > +"[memfd:inaccessible]", O_RDWR,
> > +_fops);
> > +   if (IS_ERR(file)) {
> > +   iput(inode);
> > +   kfree(data);
> 
> I think this might be missing a return at this point.

Good catch! Thanks!

-- 
  Kiryl Shutsemau / Kirill A. Shutemov



Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd

2022-10-03 Thread Kirill A. Shutemov
On Mon, Oct 03, 2022 at 08:33:13AM +0100, Fuad Tabba wrote:
> > I think it is "don't do that" category. inaccessible_register_notifier()
> > caller has to know what file it operates on, no?
> 
> The thing is, you could oops the kernel from userspace. For that, all
> you have to do is a memfd_create without the MFD_INACCESSIBLE,
> followed by a KVM_SET_USER_MEMORY_REGION using that as the private_fd.
> I ran into this using my port of this patch series to arm64.

My point is that it has to be handled on a different level. KVM has to
reject private_fd if it is now inaccessible. It should be trivial by
checking file->f_inode->i_sb->s_magic.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov



Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd

2022-09-30 Thread Kirill A . Shutemov
On Fri, Sep 30, 2022 at 05:14:00PM +0100, Fuad Tabba wrote:
> Hi,
> 
> <...>
> 
> > diff --git a/mm/memfd_inaccessible.c b/mm/memfd_inaccessible.c
> > new file mode 100644
> > index ..2d33cbdd9282
> > --- /dev/null
> > +++ b/mm/memfd_inaccessible.c
> > @@ -0,0 +1,219 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +#include "linux/sbitmap.h"
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +
> > +struct inaccessible_data {
> > +   struct mutex lock;
> > +   struct file *memfd;
> > +   struct list_head notifiers;
> > +};
> > +
> > +static void inaccessible_notifier_invalidate(struct inaccessible_data 
> > *data,
> > +pgoff_t start, pgoff_t end)
> > +{
> > +   struct inaccessible_notifier *notifier;
> > +
> > +   mutex_lock(>lock);
> > +   list_for_each_entry(notifier, >notifiers, list) {
> > +   notifier->ops->invalidate(notifier, start, end);
> > +   }
> > +   mutex_unlock(>lock);
> > +}
> > +
> > +static int inaccessible_release(struct inode *inode, struct file *file)
> > +{
> > +   struct inaccessible_data *data = inode->i_mapping->private_data;
> > +
> > +   fput(data->memfd);
> > +   kfree(data);
> > +   return 0;
> > +}
> > +
> > +static long inaccessible_fallocate(struct file *file, int mode,
> > +  loff_t offset, loff_t len)
> > +{
> > +   struct inaccessible_data *data = file->f_mapping->private_data;
> > +   struct file *memfd = data->memfd;
> > +   int ret;
> > +
> > +   if (mode & FALLOC_FL_PUNCH_HOLE) {
> > +   if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> > +   return -EINVAL;
> > +   }
> > +
> > +   ret = memfd->f_op->fallocate(memfd, mode, offset, len);
> 
> I think that shmem_file_operations.fallocate is only set if
> CONFIG_TMPFS is enabled (shmem.c). Should there be a check at
> initialization that fallocate is set, or maybe a config dependency, or
> can we count on it always being enabled?

It is already there:

config MEMFD_CREATE
def_bool TMPFS || HUGETLBFS

And we reject inaccessible memfd_create() for HUGETLBFS.

But if we go with a separate syscall, yes, we need the dependency.

> > +   inaccessible_notifier_invalidate(data, offset, offset + len);
> > +   return ret;
> > +}
> > +
> 
> <...>
> 
> > +void inaccessible_register_notifier(struct file *file,
> > +   struct inaccessible_notifier *notifier)
> > +{
> > +   struct inaccessible_data *data = file->f_mapping->private_data;
> > +
> > +   mutex_lock(>lock);
> > +   list_add(>list, >notifiers);
> > +   mutex_unlock(>lock);
> > +}
> > +EXPORT_SYMBOL_GPL(inaccessible_register_notifier);
> 
> If the memfd wasn't marked as inaccessible, or more generally
> speaking, if the file isn't a memfd_inaccessible file, this ends up
> accessing an uninitialized pointer for the notifier list. Should there
> be a check for that here, and have this function return an error if
> that's not the case?

I think it is "don't do that" category. inaccessible_register_notifier()
caller has to know what file it operates on, no?

-- 
  Kiryl Shutsemau / Kirill A. Shutemov



Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd

2022-09-28 Thread Kirill A. Shutemov
On Tue, Sep 27, 2022 at 11:23:24PM +, Sean Christopherson wrote:
> On Mon, Sep 26, 2022, David Hildenbrand wrote:
> > On 26.09.22 16:48, Kirill A. Shutemov wrote:
> > > On Mon, Sep 26, 2022 at 12:35:34PM +0200, David Hildenbrand wrote:
> > > > When using DAX, what happens with the shared <->private conversion? 
> > > > Which
> > > > "type" is supposed to use dax, which not?
> > > > 
> > > > In other word, I'm missing too many details on the bigger picture of how
> > > > this would work at all to see why it makes sense right now to prepare 
> > > > for
> > > > that.
> > > 
> > > IIUC, KVM doesn't really care about pages or folios. They need PFN to
> > > populate SEPT. Returning page/folio would make KVM do additional steps to
> > > extract PFN and one more place to have a bug.
> > 
> > Fair enough. Smells KVM specific, though.
> 
> TL;DR: I'm good with either approach, though providing a "struct page" might 
> avoid
>refactoring the API in the nearish future.
> 
> Playing devil's advocate for a second, the counter argument is that KVM is the
> only user for the foreseeable future.
> 
> That said, it might make sense to return a "struct page" from the core API and
> force KVM to do page_to_pfn().  KVM already does that for HVA-based memory, so
> it's not exactly new code.

Core MM tries to move away from struct page in favour of struct folio. We
can make interface return folio.

But it would require more work on KVM side.

folio_pfn(folio) + offset % folio_nr_pages(folio) would give you PFN for
base-pagesize PFN for given offset. I guess it is not too hard.

It also gives KVM capability to populate multiple EPT entries for non-zero
order folio and save few cycles.

Does it work for you?

> More importantly, KVM may actually need/want the "struct page" in the 
> not-too-distant
> future to support mapping non-refcounted "struct page" memory into the guest. 
>  The
> ChromeOS folks have a use case involving virtio-gpu blobs where KVM can get 
> handed a
> "struct page" that _isn't_ refcounted[*].  Once the lack of mmu_notifier 
> integration
> is fixed, the remaining issue is that KVM doesn't currently have a way to 
> determine
> whether or not it holds a reference to the page.  Instead, KVM assumes that 
> if the
> page is "normal", it's refcounted, e.g. see kvm_release_pfn_clean().
> 
> KVM's current workaround for this is to refuse to map these pages into the 
> guest,
> i.e. KVM simply forces its assumption that normal pages are refcounted to be 
> true.
> To remove that workaround, the likely solution will be to pass around a tuple 
> of
> page+pfn, where "page" is non-NULL if the pfn is a refcounted "struct page".
> 
> At that point, getting handed a "struct page" from the core API would be a 
> good
> thing as KVM wouldn't need to probe the PFN to determine whether or not it's a
> refcounted page.
> 
> Note, I still want the order to be provided by the API so that KVM doesn't 
> need
> to run through a bunch of helpers to try and figure out the allowed mapping 
> size.
> 
> [*] 
> https://lore.kernel.org/all/CAD=HUj736L5oxkzeL2JoPV8g1S6Rugy_TquW=prt73ymfzp...@mail.gmail.com

These non-refcounted "struct page" confuses me.

IIUC (probably not), the idea is to share a buffer between host and guest
and avoid double buffering in page cache on the guest ("guest shadow
buffer" means page cache, right?). Don't we already have DAX interfaces to
bypass guest page cache?

And do you think it would need to be handled on inaccessible API lavel or
is it KVM-only thing that uses inaccessible API for some use-cases?

-- 
  Kiryl Shutsemau / Kirill A. Shutemov



Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd

2022-09-26 Thread Kirill A. Shutemov
On Mon, Sep 26, 2022 at 12:35:34PM +0200, David Hildenbrand wrote:
> On 23.09.22 02:58, Kirill A . Shutemov wrote:
> > On Mon, Sep 19, 2022 at 11:12:46AM +0200, David Hildenbrand wrote:
> > > > diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
> > > > index 6325d1d0e90f..9d066be3d7e8 100644
> > > > --- a/include/uapi/linux/magic.h
> > > > +++ b/include/uapi/linux/magic.h
> > > > @@ -101,5 +101,6 @@
> > > >#define DMA_BUF_MAGIC0x444d4142  /* "DMAB" */
> > > >#define DEVMEM_MAGIC 0x454d444d  /* "DMEM" */
> > > >#define SECRETMEM_MAGIC  0x5345434d  /* "SECM" */
> > > > +#define INACCESSIBLE_MAGIC 0x494e4143  /* "INAC" */
> > > 
> > > 
> > > [...]
> > > 
> > > > +
> > > > +int inaccessible_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn,
> > > > +int *order)
> > > > +{
> > > > +   struct inaccessible_data *data = file->f_mapping->private_data;
> > > > +   struct file *memfd = data->memfd;
> > > > +   struct page *page;
> > > > +   int ret;
> > > > +
> > > > +   ret = shmem_getpage(file_inode(memfd), offset, , 
> > > > SGP_WRITE);
> > > > +   if (ret)
> > > > +   return ret;
> > > > +
> > > > +   *pfn = page_to_pfn_t(page);
> > > > +   *order = thp_order(compound_head(page));
> > > > +   SetPageUptodate(page);
> > > > +   unlock_page(page);
> > > > +
> > > > +   return 0;
> > > > +}
> > > > +EXPORT_SYMBOL_GPL(inaccessible_get_pfn);
> > > > +
> > > > +void inaccessible_put_pfn(struct file *file, pfn_t pfn)
> > > > +{
> > > > +   struct page *page = pfn_t_to_page(pfn);
> > > > +
> > > > +   if (WARN_ON_ONCE(!page))
> > > > +   return;
> > > > +
> > > > +   put_page(page);
> > > > +}
> > > > +EXPORT_SYMBOL_GPL(inaccessible_put_pfn);
> > > 
> > > Sorry, I missed your reply regarding get/put interface.
> > > 
> > > https://lore.kernel.org/linux-mm/20220810092532.gd862...@chaop.bj.intel.com/
> > > 
> > > "We have a design assumption that somedays this can even support non-page
> > > based backing stores."
> > > 
> > > As long as there is no such user in sight (especially how to get the memfd
> > > from even allocating such memory which will require bigger changes), I
> > > prefer to keep it simple here and work on pages/folios. No need to
> > > over-complicate it for now.
> > 
> > Sean, Paolo , what is your take on this? Do you have conrete use case of
> > pageless backend for the mechanism in sight? Maybe DAX?
> 
> The problem I'm having with this is how to actually get such memory into the
> memory backend (that triggers notifiers) and what the semantics are at all
> with memory that is not managed by the buddy.
> 
> memfd with fixed PFNs doesn't make too much sense.

What do you mean by "fixed PFN". It is as fixed as struct page/folio, no?
PFN covers more possible backends.

> When using DAX, what happens with the shared <->private conversion? Which
> "type" is supposed to use dax, which not?
> 
> In other word, I'm missing too many details on the bigger picture of how
> this would work at all to see why it makes sense right now to prepare for
> that.

IIUC, KVM doesn't really care about pages or folios. They need PFN to
populate SEPT. Returning page/folio would make KVM do additional steps to
extract PFN and one more place to have a bug.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov



Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd

2022-09-22 Thread Kirill A . Shutemov
On Mon, Sep 19, 2022 at 11:12:46AM +0200, David Hildenbrand wrote:
> > diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
> > index 6325d1d0e90f..9d066be3d7e8 100644
> > --- a/include/uapi/linux/magic.h
> > +++ b/include/uapi/linux/magic.h
> > @@ -101,5 +101,6 @@
> >   #define DMA_BUF_MAGIC 0x444d4142  /* "DMAB" */
> >   #define DEVMEM_MAGIC  0x454d444d  /* "DMEM" */
> >   #define SECRETMEM_MAGIC   0x5345434d  /* "SECM" */
> > +#define INACCESSIBLE_MAGIC 0x494e4143  /* "INAC" */
> 
> 
> [...]
> 
> > +
> > +int inaccessible_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn,
> > +int *order)
> > +{
> > +   struct inaccessible_data *data = file->f_mapping->private_data;
> > +   struct file *memfd = data->memfd;
> > +   struct page *page;
> > +   int ret;
> > +
> > +   ret = shmem_getpage(file_inode(memfd), offset, , SGP_WRITE);
> > +   if (ret)
> > +   return ret;
> > +
> > +   *pfn = page_to_pfn_t(page);
> > +   *order = thp_order(compound_head(page));
> > +   SetPageUptodate(page);
> > +   unlock_page(page);
> > +
> > +   return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(inaccessible_get_pfn);
> > +
> > +void inaccessible_put_pfn(struct file *file, pfn_t pfn)
> > +{
> > +   struct page *page = pfn_t_to_page(pfn);
> > +
> > +   if (WARN_ON_ONCE(!page))
> > +   return;
> > +
> > +   put_page(page);
> > +}
> > +EXPORT_SYMBOL_GPL(inaccessible_put_pfn);
> 
> Sorry, I missed your reply regarding get/put interface.
> 
> https://lore.kernel.org/linux-mm/20220810092532.gd862...@chaop.bj.intel.com/
> 
> "We have a design assumption that somedays this can even support non-page
> based backing stores."
> 
> As long as there is no such user in sight (especially how to get the memfd
> from even allocating such memory which will require bigger changes), I
> prefer to keep it simple here and work on pages/folios. No need to
> over-complicate it for now.

Sean, Paolo , what is your take on this? Do you have conrete use case of
pageless backend for the mechanism in sight? Maybe DAX?

-- 
  Kiryl Shutsemau / Kirill A. Shutemov



Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd

2022-09-22 Thread Kirill A . Shutemov
On Thu, Sep 22, 2022 at 07:49:18PM +, Sean Christopherson wrote:
> On Thu, Sep 22, 2022, Wang, Wei W wrote:
> > On Thursday, September 15, 2022 10:29 PM, Chao Peng wrote:
> > > +int inaccessible_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn,
> > > +  int *order)
> > 
> > Better to remove "order" from this interface?
> 
> Hard 'no'.
> 
> > Some callers only need to get pfn, and no need to bother with
> > defining and inputting something unused. For callers who need the "order",
> > can easily get it via thp_order(pfn_to_page(pfn)) on their own.
> 
> That requires (a) assuming the pfn is backed by struct page, and (b) assuming 
> the
> struct page is a transparent huge page.  That might be true for the current
> implementation, but it most certainly will not always be true.
> 
> KVM originally did things like this, where there was dedicated code for THP 
> vs.
> HugeTLB, and it was a mess.  The goal here is very much to avoid repeating 
> those
> mistakes.  Have the backing store _tell_ KVM how big the mapping is, don't 
> force
> KVM to rediscover the info on its own.

I guess we can allow order pointer to be NULL to cover caller that don't
need to know the order. Is it useful?

-- 
  Kiryl Shutsemau / Kirill A. Shutemov



Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-09-13 Thread Kirill A. Shutemov
On Tue, Sep 13, 2022 at 02:53:25PM +, Sean Christopherson wrote:
> > > Switching topics, what actually prevents mmapp() on the shim?  I tried to 
> > > follow,
> > > but I don't know these areas well enough.
> > 
> > It has no f_op->mmap, so mmap() will fail with -ENODEV. See do_mmap().
> > (I did not read the switch statement correctly at first. Note there are
> > two 'fallthrough' there.)
> 
> Ah, validate_mmap_request().  Thought not implementing ->mmap() was the key, 
> but
> couldn't find the actual check.

validate_mmap_request() is in mm/nommu.c which is not relevant for real
computers.

I was talking about this check:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/mmap.c#n1495

-- 
  Kiryl Shutsemau / Kirill A. Shutemov



Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-09-13 Thread Kirill A. Shutemov
On Tue, Sep 13, 2022 at 09:44:27AM +, Sean Christopherson wrote:
> On Thu, Sep 08, 2022, Kirill A. Shutemov wrote:
> > On Wed, Aug 31, 2022 at 05:24:39PM +0300, Kirill A . Shutemov wrote:
> > > On Sat, Aug 20, 2022 at 10:15:32PM -0700, Hugh Dickins wrote:
> > > > > I will try next week to rework it as shim to top of shmem. Does it 
> > > > > work
> > > > > for you?
> > > > 
> > > > Yes, please do, thanks.  It's a compromise between us: the initial TDX
> > > > case has no justification to use shmem at all, but doing it that way
> > > > will help you with some of the infrastructure, and will probably be
> > > > easiest for KVM to extend to other more relaxed fd cases later.
> > > 
> > > Okay, below is my take on the shim approach.
> > > 
> > > I don't hate how it turned out. It is easier to understand without
> > > callback exchange thing.
> > > 
> > > The only caveat is I had to introduce external lock to protect against
> > > race between lookup and truncate.
> 
> As before, I think this lock is unnecessary.  Or at least it's unnessary to 
> hold
> the lock across get/put.  The ->invalidate() call will ensure that the pfn is
> never actually used if get() races with truncation.

The updated version you replying to does not use the lock to protect
against truncation anymore. The lock protect notifier list.

> Switching topics, what actually prevents mmapp() on the shim?  I tried to 
> follow,
> but I don't know these areas well enough.

It has no f_op->mmap, so mmap() will fail with -ENODEV. See do_mmap().
(I did not read the switch statement correctly at first. Note there are
two 'fallthrough' there.)

-- 
  Kiryl Shutsemau / Kirill A. Shutemov



Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-09-09 Thread Kirill A . Shutemov
On Fri, Sep 09, 2022 at 12:11:05PM -0700, Andy Lutomirski wrote:
> 
> 
> On Fri, Sep 9, 2022, at 7:32 AM, Kirill A . Shutemov wrote:
> > On Thu, Sep 08, 2022 at 09:48:35PM -0700, Andy Lutomirski wrote:
> >> On 8/19/22 17:27, Kirill A. Shutemov wrote:
> >> > On Thu, Aug 18, 2022 at 08:00:41PM -0700, Hugh Dickins wrote:
> >> > > On Thu, 18 Aug 2022, Kirill A . Shutemov wrote:
> >> > > > On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote:
> >> > > > > 
> >> > > > > If your memory could be swapped, that would be enough of a good 
> >> > > > > reason
> >> > > > > to make use of shmem.c: but it cannot be swapped; and although 
> >> > > > > there
> >> > > > > are some references in the mailthreads to it perhaps being 
> >> > > > > swappable
> >> > > > > in future, I get the impression that will not happen soon if ever.
> >> > > > > 
> >> > > > > If your memory could be migrated, that would be some reason to use
> >> > > > > filesystem page cache (because page migration happens to understand
> >> > > > > that type of memory): but it cannot be migrated.
> >> > > > 
> >> > > > Migration support is in pipeline. It is part of TDX 1.5 [1]. And 
> >> > > > swapping
> >> > > > theoretically possible, but I'm not aware of any plans as of now.
> >> > > > 
> >> > > > [1] 
> >> > > > https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html
> >> > > 
> >> > > I always forget, migration means different things to different 
> >> > > audiences.
> >> > > As an mm person, I was meaning page migration, whereas a virtualization
> >> > > person thinks VM live migration (which that reference appears to be 
> >> > > about),
> >> > > a scheduler person task migration, an ornithologist bird migration, 
> >> > > etc.
> >> > > 
> >> > > But you're an mm person too: you may have cited that reference in the
> >> > > knowledge that TDX 1.5 Live Migration will entail page migration of the
> >> > > kind I'm thinking of.  (Anyway, it's not important to clarify that 
> >> > > here.)
> >> > 
> >> > TDX 1.5 brings both.
> >> > 
> >> > In TDX speak, mm migration called relocation. See TDH.MEM.PAGE.RELOCATE.
> >> > 
> >> 
> >> This seems to be a pretty bad fit for the way that the core mm migrates
> >> pages.  The core mm unmaps the page, then moves (in software) the contents
> >> to a new address, then faults it in.  TDH.MEM.PAGE.RELOCATE doesn't fit 
> >> into
> >> that workflow very well.  I'm not saying it can't be done, but it won't 
> >> just
> >> work.
> >
> > Hm. From what I see we have all necessary infrastructure in place.
> >
> > Unmaping is NOP for inaccessible pages as it is never mapped and we have
> > mapping->a_ops->migrate_folio() callback that allows to replace software
> > copying with whatever is needed, like TDH.MEM.PAGE.RELOCATE.
> >
> > What do I miss?
> 
> Hmm, maybe this isn't as bad as I thought.
> 
> Right now, unless I've missed something, the migration workflow is to
> unmap (via try_to_migrate) all mappings, then migrate the backing store
> (with ->migrate_folio(), although it seems like most callers expect the
> actual copy to happen outside of ->migrate_folio(),

Most? I guess you are talking about MIGRATE_SYNC_NO_COPY, right? AFAICS,
it is HMM thing and not a common thing.

> and then make new
> mappings.  With the *current* (vma-based, not fd-based) model for KVM
> memory, this won't work -- we can't unmap before calling
> TDH.MEM.PAGE.RELOCATE.

We don't need to unmap. The page is not mapped from core-mm PoV.

> But maybe it's actually okay with some care or maybe mild modifications
> with the fd-based model.  We don't have any mmaps, per se, to unmap for
> secret / INACCESSIBLE memory.  So maybe we can get all the way to
> ->migrate_folio() without zapping anything in the secure EPT and just
> call TDH-MEM.PAGE.RELOCATE from inside migrate_folio().  And there will
> be nothing to fault back in.  From the core code's perspective, it's
> like migrating a memfd that doesn't happen to have my mappings at the
> time.

Modifications needed if we want to initiate migation from userspace. IIRC,
we don't have any API that can initiate page migration for file ranges,
without mapping the file.

But kernel can do it fine for own housekeeping, like compaction doesn't
need any VMA. And we need compaction working for long term stability of
the system.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov



Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-09-09 Thread Kirill A . Shutemov
On Thu, Sep 08, 2022 at 09:48:35PM -0700, Andy Lutomirski wrote:
> On 8/19/22 17:27, Kirill A. Shutemov wrote:
> > On Thu, Aug 18, 2022 at 08:00:41PM -0700, Hugh Dickins wrote:
> > > On Thu, 18 Aug 2022, Kirill A . Shutemov wrote:
> > > > On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote:
> > > > > 
> > > > > If your memory could be swapped, that would be enough of a good reason
> > > > > to make use of shmem.c: but it cannot be swapped; and although there
> > > > > are some references in the mailthreads to it perhaps being swappable
> > > > > in future, I get the impression that will not happen soon if ever.
> > > > > 
> > > > > If your memory could be migrated, that would be some reason to use
> > > > > filesystem page cache (because page migration happens to understand
> > > > > that type of memory): but it cannot be migrated.
> > > > 
> > > > Migration support is in pipeline. It is part of TDX 1.5 [1]. And 
> > > > swapping
> > > > theoretically possible, but I'm not aware of any plans as of now.
> > > > 
> > > > [1] 
> > > > https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html
> > > 
> > > I always forget, migration means different things to different audiences.
> > > As an mm person, I was meaning page migration, whereas a virtualization
> > > person thinks VM live migration (which that reference appears to be 
> > > about),
> > > a scheduler person task migration, an ornithologist bird migration, etc.
> > > 
> > > But you're an mm person too: you may have cited that reference in the
> > > knowledge that TDX 1.5 Live Migration will entail page migration of the
> > > kind I'm thinking of.  (Anyway, it's not important to clarify that here.)
> > 
> > TDX 1.5 brings both.
> > 
> > In TDX speak, mm migration called relocation. See TDH.MEM.PAGE.RELOCATE.
> > 
> 
> This seems to be a pretty bad fit for the way that the core mm migrates
> pages.  The core mm unmaps the page, then moves (in software) the contents
> to a new address, then faults it in.  TDH.MEM.PAGE.RELOCATE doesn't fit into
> that workflow very well.  I'm not saying it can't be done, but it won't just
> work.

Hm. From what I see we have all necessary infrastructure in place.

Unmaping is NOP for inaccessible pages as it is never mapped and we have
mapping->a_ops->migrate_folio() callback that allows to replace software
copying with whatever is needed, like TDH.MEM.PAGE.RELOCATE.

What do I miss?

-- 
  Kiryl Shutsemau / Kirill A. Shutemov



Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-09-07 Thread Kirill A. Shutemov
On Wed, Aug 31, 2022 at 05:24:39PM +0300, Kirill A . Shutemov wrote:
> On Sat, Aug 20, 2022 at 10:15:32PM -0700, Hugh Dickins wrote:
> > > I will try next week to rework it as shim to top of shmem. Does it work
> > > for you?
> > 
> > Yes, please do, thanks.  It's a compromise between us: the initial TDX
> > case has no justification to use shmem at all, but doing it that way
> > will help you with some of the infrastructure, and will probably be
> > easiest for KVM to extend to other more relaxed fd cases later.
> 
> Okay, below is my take on the shim approach.
> 
> I don't hate how it turned out. It is easier to understand without
> callback exchange thing.
> 
> The only caveat is I had to introduce external lock to protect against
> race between lookup and truncate. Otherwise, looks pretty reasonable to me.
> 
> I did very limited testing. And it lacks integration with KVM, but API
> changed not substantially, any it should be easy to adopt.
> 
> Any comments?

Updated version below. Nothing major. Some simplification and cleanups.

diff --git a/include/linux/memfd.h b/include/linux/memfd.h
index 4f1600413f91..334ddff08377 100644
--- a/include/linux/memfd.h
+++ b/include/linux/memfd.h
@@ -3,6 +3,7 @@
 #define __LINUX_MEMFD_H
 
 #include 
+#include 
 
 #ifdef CONFIG_MEMFD_CREATE
 extern long memfd_fcntl(struct file *file, unsigned int cmd, unsigned long 
arg);
@@ -13,4 +14,27 @@ static inline long memfd_fcntl(struct file *f, unsigned int 
c, unsigned long a)
 }
 #endif
 
+struct inaccessible_notifier;
+
+struct inaccessible_notifier_ops {
+   void (*invalidate)(struct inaccessible_notifier *notifier,
+  pgoff_t start, pgoff_t end);
+};
+
+struct inaccessible_notifier {
+   struct list_head list;
+   const struct inaccessible_notifier_ops *ops;
+};
+
+void inaccessible_register_notifier(struct file *file,
+   struct inaccessible_notifier *notifier);
+void inaccessible_unregister_notifier(struct file *file,
+ struct inaccessible_notifier *notifier);
+
+int inaccessible_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn,
+int *order);
+void inaccessible_put_pfn(struct file *file, pfn_t pfn);
+
+struct file *memfd_mkinaccessible(struct file *memfd);
+
 #endif /* __LINUX_MEMFD_H */
diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index 6325d1d0e90f..9d066be3d7e8 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -101,5 +101,6 @@
 #define DMA_BUF_MAGIC  0x444d4142  /* "DMAB" */
 #define DEVMEM_MAGIC   0x454d444d  /* "DMEM" */
 #define SECRETMEM_MAGIC0x5345434d  /* "SECM" */
+#define INACCESSIBLE_MAGIC 0x494e4143  /* "INAC" */
 
 #endif /* __LINUX_MAGIC_H__ */
diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h
index 7a8a26751c23..48750474b904 100644
--- a/include/uapi/linux/memfd.h
+++ b/include/uapi/linux/memfd.h
@@ -8,6 +8,7 @@
 #define MFD_CLOEXEC0x0001U
 #define MFD_ALLOW_SEALING  0x0002U
 #define MFD_HUGETLB0x0004U
+#define MFD_INACCESSIBLE   0x0008U
 
 /*
  * Huge page size encoding when MFD_HUGETLB is specified, and a huge page
diff --git a/mm/Makefile b/mm/Makefile
index 9a564f836403..f82e5d4b4388 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -126,7 +126,7 @@ obj-$(CONFIG_HARDENED_USERCOPY) += usercopy.o
 obj-$(CONFIG_PERCPU_STATS) += percpu-stats.o
 obj-$(CONFIG_ZONE_DEVICE) += memremap.o
 obj-$(CONFIG_HMM_MIRROR) += hmm.o
-obj-$(CONFIG_MEMFD_CREATE) += memfd.o
+obj-$(CONFIG_MEMFD_CREATE) += memfd.o memfd_inaccessible.o
 obj-$(CONFIG_MAPPING_DIRTY_HELPERS) += mapping_dirty_helpers.o
 obj-$(CONFIG_PTDUMP_CORE) += ptdump.o
 obj-$(CONFIG_PAGE_REPORTING) += page_reporting.o
diff --git a/mm/memfd.c b/mm/memfd.c
index 08f5f8304746..1853a90f49ff 100644
--- a/mm/memfd.c
+++ b/mm/memfd.c
@@ -261,7 +261,8 @@ long memfd_fcntl(struct file *file, unsigned int cmd, 
unsigned long arg)
 #define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1)
 #define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN)
 
-#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB)
+#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | \
+  MFD_INACCESSIBLE)
 
 SYSCALL_DEFINE2(memfd_create,
const char __user *, uname,
@@ -283,6 +284,14 @@ SYSCALL_DEFINE2(memfd_create,
return -EINVAL;
}
 
+   /* Disallow sealing when MFD_INACCESSIBLE is set. */
+   if ((flags & MFD_INACCESSIBLE) && (flags & MFD_ALLOW_SEALING))
+   return -EINVAL;
+
+   /* TODO: add hugetlb support */
+   if ((flags & MFD_INACCESSIBLE) && (flags & MFD_HUGETLB))
+   return -EINVAL;
+
/* len

Re: [PATCH v7 05/14] mm/memfd: Introduce MFD_INACCESSIBLE flag

2022-09-07 Thread Kirill A. Shutemov
On Fri, Aug 05, 2022 at 03:28:50PM +0200, David Hildenbrand wrote:
> On 06.07.22 10:20, Chao Peng wrote:
> > Introduce a new memfd_create() flag indicating the content of the
> > created memfd is inaccessible from userspace through ordinary MMU
> > access (e.g., read/write/mmap). However, the file content can be
> > accessed via a different mechanism (e.g. KVM MMU) indirectly.
> > 
> > It provides semantics required for KVM guest private memory support
> > that a file descriptor with this flag set is going to be used as the
> > source of guest memory in confidential computing environments such
> > as Intel TDX/AMD SEV but may not be accessible from host userspace.
> > 
> > The flag can not coexist with MFD_ALLOW_SEALING, future sealing is
> > also impossible for a memfd created with this flag.
> 
> It's kind of weird to have it that way. Why should the user have to
> care? It's the notifier requirement to have that, no?
> 
> Why can't we handle that when register a notifier? If anything is
> already mapped, fail registering the notifier if the notifier has these
> demands. If registering succeeds, block it internally.
> 
> Or what am I missing? We might not need the memfile set flag semantics
> eventually and would not have to expose such a flag to user space.

Well, with the new shim-based[1] implementation the approach without uAPI
does not work.

We now have two struct file, one is a normal accessible memfd and the
other one is wrapper around that hides the memfd from userspace and
filters allowed operations. If we first create an accessible memfd that
userspace see it would be hard to hide it as by the time userspace may
have multiple fds in different processes that point to the same struct
file.

[1] 
https://lore.kernel.org/all/20220831142439.65q2gi4g2d2z4...@box.shutemov.name

-- 
  Kiryl Shutsemau / Kirill A. Shutemov



Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-09-02 Thread Kirill A . Shutemov
On Fri, Sep 02, 2022 at 06:27:57PM +0800, Chao Peng wrote:
> > +   if (flags & MFD_INACCESSIBLE) {
> > +   struct file *inaccessible_file;
> > +
> > +   inaccessible_file = memfd_mkinaccessible(file);
> > +   if (IS_ERR(inaccessible_file)) {
> > +   error = PTR_ERR(inaccessible_file);
> > +   goto err_file;
> > +   }
> 
> The new file should alse be marked as O_LARGEFILE otherwise setting the
> initial size greater than 2^31 on the fd will be refused by ftruncate().
> 
> +   inaccessible_file->f_flags |= O_LARGEFILE;
> +

Good catch. Thanks.

I will modify memfd_mkinaccessible() to do this.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov



Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-08-31 Thread Kirill A . Shutemov
 = kzalloc(sizeof(*data), GFP_KERNEL);
+   if (!data)
+   return ERR_PTR(-ENOMEM);
+
+   data->memfd = memfd;
+   init_rwsem(>lock);
+   INIT_LIST_HEAD(>notifiers);
+
+   inode = alloc_anon_inode(inaccessible_mnt->mnt_sb);
+   if (IS_ERR(inode)) {
+   kfree(data);
+   return ERR_CAST(inode);
+   }
+
+   inode->i_mode |= S_IFREG;
+   inode->i_op = _iops;
+   inode->i_mapping->private_data = data;
+
+   file = alloc_file_pseudo(inode, inaccessible_mnt,
+"[memfd:inaccessible]", O_RDWR,
+_fops);
+   if (IS_ERR(file)) {
+   iput(inode);
+   kfree(data);
+   }
+
+   mapping = memfd->f_mapping;
+   mapping_set_unevictable(mapping);
+   mapping_set_gfp_mask(mapping,
+mapping_gfp_mask(mapping) & ~__GFP_MOVABLE);
+
+   return file;
+}
+
+int inaccessible_register_notifier(struct file *file,
+ struct inaccessible_notifier *notifier)
+{
+   struct inaccessible_data *data = file->f_mapping->private_data;
+
+   down_write(>lock);
+   list_add(>list, >notifiers);
+   up_write(>lock);
+
+   return 0;
+}
+EXPORT_SYMBOL_GPL(inaccessible_register_notifier);
+
+void inaccessible_unregister_notifier(struct file *file,
+ struct inaccessible_notifier *notifier)
+{
+   struct inaccessible_data *data = file->f_mapping->private_data;
+
+   down_write(>lock);
+   list_del_rcu(>list);
+   up_write(>lock);
+}
+EXPORT_SYMBOL_GPL(inaccessible_unregister_notifier);
+
+int inaccessible_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn,
+int *order)
+{
+   struct inaccessible_data *data = file->f_mapping->private_data;
+   struct file *memfd = data->memfd;
+   struct page *page;
+   int ret;
+
+   down_read(>lock);
+
+   ret = shmem_getpage(file_inode(memfd), offset, , SGP_WRITE);
+   if (ret) {
+   up_read(>lock);
+   return ret;
+   }
+
+   *pfn = page_to_pfn_t(page);
+   *order = thp_order(compound_head(page));
+   return 0;
+}
+EXPORT_SYMBOL_GPL(inaccessible_get_pfn);
+
+void inaccessible_put_pfn(struct file *file, pfn_t pfn)
+{
+   struct page *page = pfn_t_to_page(pfn);
+   struct inaccessible_data *data = file->f_mapping->private_data;
+
+   if (WARN_ON_ONCE(!page))
+   return;
+
+   SetPageUptodate(page);
+   unlock_page(page);
+   put_page(page);
+   up_read(>lock);
+}
+EXPORT_SYMBOL_GPL(inaccessible_put_pfn);
-- 
  Kiryl Shutsemau / Kirill A. Shutemov



Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-08-19 Thread Kirill A. Shutemov
On Thu, Aug 18, 2022 at 08:00:41PM -0700, Hugh Dickins wrote:
> On Thu, 18 Aug 2022, Kirill A . Shutemov wrote:
> > On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote:
> > > 
> > > If your memory could be swapped, that would be enough of a good reason
> > > to make use of shmem.c: but it cannot be swapped; and although there
> > > are some references in the mailthreads to it perhaps being swappable
> > > in future, I get the impression that will not happen soon if ever.
> > > 
> > > If your memory could be migrated, that would be some reason to use
> > > filesystem page cache (because page migration happens to understand
> > > that type of memory): but it cannot be migrated.
> > 
> > Migration support is in pipeline. It is part of TDX 1.5 [1]. And swapping
> > theoretically possible, but I'm not aware of any plans as of now.
> > 
> > [1] 
> > https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html
> 
> I always forget, migration means different things to different audiences.
> As an mm person, I was meaning page migration, whereas a virtualization
> person thinks VM live migration (which that reference appears to be about),
> a scheduler person task migration, an ornithologist bird migration, etc.
> 
> But you're an mm person too: you may have cited that reference in the
> knowledge that TDX 1.5 Live Migration will entail page migration of the
> kind I'm thinking of.  (Anyway, it's not important to clarify that here.)

TDX 1.5 brings both.

In TDX speak, mm migration called relocation. See TDH.MEM.PAGE.RELOCATE.

> > > Some of these impressions may come from earlier iterations of the
> > > patchset (v7 looks better in several ways than v5).  I am probably
> > > underestimating the extent to which you have taken on board other
> > > usages beyond TDX and SEV private memory, and rightly want to serve
> > > them all with similar interfaces: perhaps there is enough justification
> > > for shmem there, but I don't see it.  There was mention of userfaultfd
> > > in one link: does that provide the justification for using shmem?
> > > 
> > > I'm afraid of the special demands you may make of memory allocation
> > > later on - surprised that huge pages are not mentioned already;
> > > gigantic contiguous extents? secretmem removed from direct map?
> > 
> > The design allows for extension to hugetlbfs if needed. Combination of
> > MFD_INACCESSIBLE | MFD_HUGETLB should route this way. There should be zero
> > implications for shmem. It is going to be separate struct 
> > memfile_backing_store.
> 
> Last year's MFD_HUGEPAGE proposal would have allowed you to do it with
> memfd via tmpfs without needing to involve hugetlbfs; but you may prefer
> the determinism of hugetlbfs, relying on /proc/sys/vm/nr_hugepages etc.
> 
> But I've yet to see why you want to involve this or that filesystem
> (with all its filesystem-icity suppressed) at all.  The backing store
> is host memory, and tmpfs and hugetlbfs just impose their own
> idiosyncrasies on how that memory is allocated; but I think you would
> do better to choose your own idiosyncrasies in allocation directly -
> you don't need a different "backing store" to choose between 4k or 2M
> or 1G or whatever allocations.

These idiosyncrasies are well known: user who used hugetlbfs may want to
get direct replacement that would tap into the same hugetlb reserves and
get the same allocation guarantees. Admins know where to look if ENOMEM
happens.

For THP, admin may know how to tweak allocation/defrag policy for his
liking and how to track if they are allocated.

> tmpfs and hugetlbfs and page cache are designed around sharing memory:
> TDX is designed around absolutely not sharing memory; and the further
> uses which Sean foresees appear not to need it as page cache either.
> 
> Except perhaps for page migration reasons.  It's somewhat incidental,  
> but of course page migration knows how to migrate page cache, so
> masquerading as page cache will give a short cut to page migration,
> when page migration becomes at all possible.
> 
> > 
> > I'm not sure secretmem is a fit here as we want to extend MFD_INACCESSIBLE
> > to be movable if platform supports it and secretmem is not migratable by
> > design (without direct mapping fragmentations).
> > 
> > > Here's what I would prefer, and imagine much easier for you to maintain;
> > > but I'm no system designer, and may be misunderstanding throughout.
> > > 
> > > QEMU gets fd from opening /dev/kvm_something, uses ioctls (or perhaps
> > > the fallocate 

Re: [PATCH v6 6/8] KVM: Handle page fault for private memory

2022-08-18 Thread Kirill A. Shutemov
On Fri, Jun 17, 2022 at 09:30:53PM +, Sean Christopherson wrote:
> > @@ -4088,7 +4144,12 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, 
> > struct kvm_page_fault *fault
> > read_unlock(>kvm->mmu_lock);
> > else
> > write_unlock(>kvm->mmu_lock);
> > -   kvm_release_pfn_clean(fault->pfn);
> > +
> > +   if (fault->is_private)
> > +   kvm_private_mem_put_pfn(fault->slot, fault->pfn);
> 
> Why does the shmem path lock the page, and then unlock it here?

Lock is require to avoid race with truncate / punch hole. Like if truncate
happens after get_pfn(), but before it gets into SEPT we are screwed.

> Same question for why this path marks it dirty?  The guest has the page mapped
> so the dirty flag is immediately stale.

If page is clean and refcount is not elevated, vmscan is free to drop the
page from page cache. I don't think we want this.

> In other words, why does KVM need to do something different for private pfns?

Because in the traditional KVM memslot scheme, core mm takes care about
this.

The changes in v7 is wrong. Page has be locked until it lends into SEPT and
must make it dirty before unlocking.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov



Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-08-18 Thread Kirill A . Shutemov
On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote:
> On Wed, 6 Jul 2022, Chao Peng wrote:
> > This is the v7 of this series which tries to implement the fd-based KVM
> > guest private memory.
> 
> Here at last are my reluctant thoughts on this patchset.
> 
> fd-based approach for supporting KVM guest private memory: fine.
> 
> Use or abuse of memfd and shmem.c: mistaken.
> 
> memfd_create() was an excellent way to put together the initial prototype.
> 
> But since then, TDX in particular has forced an effort into preventing
> (by flags, seals, notifiers) almost everything that makes it shmem/tmpfs.
> 
> Are any of the shmem.c mods useful to existing users of shmem.c? No.
> Is MFD_INACCESSIBLE useful or comprehensible to memfd_create() users? No.
> 
> What use do you have for a filesystem here?  Almost none.
> IIUC, what you want is an fd through which QEMU can allocate kernel
> memory, selectively free that memory, and communicate fd+offset+length
> to KVM.  And perhaps an interface to initialize a little of that memory
> from a template (presumably copied from a real file on disk somewhere).
> 
> You don't need shmem.c or a filesystem for that!
> 
> If your memory could be swapped, that would be enough of a good reason
> to make use of shmem.c: but it cannot be swapped; and although there
> are some references in the mailthreads to it perhaps being swappable
> in future, I get the impression that will not happen soon if ever.
> 
> If your memory could be migrated, that would be some reason to use
> filesystem page cache (because page migration happens to understand
> that type of memory): but it cannot be migrated.

Migration support is in pipeline. It is part of TDX 1.5 [1]. And swapping
theoretically possible, but I'm not aware of any plans as of now.

[1] 
https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html

> Some of these impressions may come from earlier iterations of the
> patchset (v7 looks better in several ways than v5).  I am probably
> underestimating the extent to which you have taken on board other
> usages beyond TDX and SEV private memory, and rightly want to serve
> them all with similar interfaces: perhaps there is enough justification
> for shmem there, but I don't see it.  There was mention of userfaultfd
> in one link: does that provide the justification for using shmem?
> 
> I'm afraid of the special demands you may make of memory allocation
> later on - surprised that huge pages are not mentioned already;
> gigantic contiguous extents? secretmem removed from direct map?

The design allows for extension to hugetlbfs if needed. Combination of
MFD_INACCESSIBLE | MFD_HUGETLB should route this way. There should be zero
implications for shmem. It is going to be separate struct memfile_backing_store.

I'm not sure secretmem is a fit here as we want to extend MFD_INACCESSIBLE
to be movable if platform supports it and secretmem is not migratable by
design (without direct mapping fragmentations).

> Here's what I would prefer, and imagine much easier for you to maintain;
> but I'm no system designer, and may be misunderstanding throughout.
> 
> QEMU gets fd from opening /dev/kvm_something, uses ioctls (or perhaps
> the fallocate syscall interface itself) to allocate and free the memory,
> ioctl for initializing some of it too.  KVM in control of whether that
> fd can be read or written or mmap'ed or whatever, no need to prevent it
> in shmem.c, no need for flags, seals, notifications to and fro because
> KVM is already in control and knows the history.  If shmem actually has
> value, call into it underneath - somewhat like SysV SHM, and /dev/zero
> mmap, and i915/gem make use of it underneath.  If shmem has nothing to
> add, just allocate and free kernel memory directly, recorded in your
> own xarray.

I guess shim layer on top of shmem *can* work. I don't see immediately why
it would not. But I'm not sure it is right direction. We risk creating yet
another parallel VM with own rules/locking/accounting that opaque to
core-mm.

Note that on machines that run TDX guests such memory would likely be the
bulk of memory use. Treating it as a fringe case may bite us one day.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov



Re: [PATCH v7 01/14] mm: Add F_SEAL_AUTO_ALLOCATE seal to memfd

2022-08-17 Thread Kirill A. Shutemov
On Fri, Aug 05, 2022 at 07:55:38PM +0200, Paolo Bonzini wrote:
> On 7/21/22 11:44, David Hildenbrand wrote:
> > 
> > Also, I*think*  you can place pages via userfaultfd into shmem. Not
> > sure if that would count "auto alloc", but it would certainly bypass
> > fallocate().
> 
> Yeah, userfaultfd_register would probably have to forbid this for
> F_SEAL_AUTO_ALLOCATE vmas.  Maybe the memfile_node can be reused for this,
> adding a new MEMFILE_F_NO_AUTO_ALLOCATE flags?  Then userfault_register
> would do something like memfile_node_get_flags(vma->vm_file) and check the
> result.

I donno, memory allocation with userfaultfd looks pretty intentional to
me. Why would F_SEAL_AUTO_ALLOCATE prevent it?

Maybe we would need it in the future for post-copy migration or something?

Or existing practises around userfaultfd touch memory randomly and
therefore incompatible with F_SEAL_AUTO_ALLOCATE intent?

Note, that userfaultfd is only relevant for shared memory as it requires
VMA which we don't have for MFD_INACCESSIBLE.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov



Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-08-16 Thread Kirill A . Shutemov
On Tue, Aug 16, 2022 at 01:33:00PM +0200, Gupta, Pankaj wrote:
> Hi Chao,
> 
> > 
> > Actually the current version allows you to delay the allocation to a
> > later time (e.g. page fault time) if you don't call fallocate() on the
> > private fd. fallocate() is necessary in previous versions because we
> > treat the existense in the fd as 'private' but in this version we track
> > private/shared info in KVM so we don't rely on that fact from memory
> > backstores.
> 
> Does this also mean reservation of guest physical memory with secure
> processor (both for SEV-SNP & TDX) will also happen at page fault time?
> 
> Do we plan to keep it this way?

If you are talking about accepting memory by the guest, it is initiated by
the guest and has nothing to do with page fault time vs fallocate()
allocation of host memory. I mean acceptance happens after host memory
allocation but they are not in lockstep, acceptance can happen much later.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov



Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-04-12 Thread Kirill A. Shutemov
On Mon, Mar 28, 2022 at 01:16:48PM -0700, Andy Lutomirski wrote:
> On Thu, Mar 10, 2022 at 6:09 AM Chao Peng  wrote:
> >
> > This is the v5 of this series which tries to implement the fd-based KVM
> > guest private memory. The patches are based on latest kvm/queue branch
> > commit:
> >
> >   d5089416b7fb KVM: x86: Introduce KVM_CAP_DISABLE_QUIRKS2
> 
> Can this series be run and a VM booted without TDX?  A feature like
> that might help push it forward.

It would require enlightenment of the guest code. We have two options.

Simple one is to limit enabling to the guest kernel, but it would require
non-destructive conversion between shared->private memory. This does not
seem to be compatible with current design.

Other option is get memory private from time 0 of VM boot, but it requires
modification of virtual BIOS to setup shared ranges as needed. I'm not
sure if anybody volunteer to work on BIOS code to make it happen.

Hm.

-- 
 Kirill A. Shutemov



Re: [PATCH v5 04/13] mm/shmem: Restrict MFD_INACCESSIBLE memory against RLIMIT_MEMLOCK

2022-04-12 Thread Kirill A. Shutemov
On Tue, Apr 12, 2022 at 09:39:25PM +0800, Chao Peng wrote:
> On Mon, Apr 11, 2022 at 06:32:33PM +0300, Kirill A. Shutemov wrote:
> > On Thu, Apr 07, 2022 at 04:05:36PM +, Sean Christopherson wrote:
> > > Hmm, shmem_writepage() already handles SHM_F_INACCESSIBLE by rejecting 
> > > the swap, so
> > > maybe it's just the page migration path that needs to be updated?
> > 
> > My early version prevented migration with -ENOTSUPP for
> > address_space_operations::migratepage().
> > 
> > What's wrong with that approach?
> 
> I previously thought migratepage will not be called since we already
> marked the pages as UNMOVABLE, sounds not correct?

Do you mean missing __GFP_MOVABLE? I can be wrong, but I don't see that it
direclty affects if the page is migratable. It is a hint to page allocator
to group unmovable pages to separate page block and impove availablity of
higher order pages this way. Page allocator tries to allocate unmovable
pages from pages blocks that already have unmovable pages.

-- 
 Kirill A. Shutemov



Re: [PATCH v5 04/13] mm/shmem: Restrict MFD_INACCESSIBLE memory against RLIMIT_MEMLOCK

2022-04-11 Thread Kirill A. Shutemov
On Thu, Apr 07, 2022 at 04:05:36PM +, Sean Christopherson wrote:
> Hmm, shmem_writepage() already handles SHM_F_INACCESSIBLE by rejecting the 
> swap, so
> maybe it's just the page migration path that needs to be updated?

My early version prevented migration with -ENOTSUPP for
address_space_operations::migratepage().

What's wrong with that approach?

-- 
 Kirill A. Shutemov



Re: [PATCH v5 04/13] mm/shmem: Restrict MFD_INACCESSIBLE memory against RLIMIT_MEMLOCK

2022-04-11 Thread Kirill A. Shutemov
On Fri, Apr 08, 2022 at 09:02:54PM +0800, Chao Peng wrote:
> > I think the correct approach is to not do the locking automatically for 
> > SHM_F_INACCESSIBLE,
> > and instead require userspace to do shmctl(.., SHM_LOCK, ...) if userspace 
> > knows the
> > consumers don't support migrate/swap.  That'd require wrapping 
> > migrate_page() and then
> > wiring up notifier hooks for migrate/swap, but IMO that's a good thing to 
> > get sorted
> > out sooner than later.  KVM isn't planning on support migrate/swap for TDX 
> > or SNP,
> > but supporting at least migrate for a software-only implementation a la 
> > pKVM should
> > be relatively straightforward.  On the notifiee side, KVM can terminate the 
> > VM if it
> > gets an unexpected migrate/swap, e.g. so that TDX/SEV VMs don't die later 
> > with
> > exceptions and/or data corruption (pre-SNP SEV guests) in the guest.
> 
> SHM_LOCK sounds like a good match.

Emm, no. shmctl(2) and SHM_LOCK are SysV IPC thing. I don't see how they
fit here.

-- 
 Kirill A. Shutemov



Re: [PATCH v5 01/13] mm/memfd: Introduce MFD_INACCESSIBLE flag

2022-04-11 Thread Kirill A. Shutemov
On Thu, Mar 10, 2022 at 10:08:59PM +0800, Chao Peng wrote:
> From: "Kirill A. Shutemov" 
> 
> Introduce a new memfd_create() flag indicating the content of the
> created memfd is inaccessible from userspace through ordinary MMU
> access (e.g., read/write/mmap). However, the file content can be
> accessed via a different mechanism (e.g. KVM MMU) indirectly.
> 
> It provides semantics required for KVM guest private memory support
> that a file descriptor with this flag set is going to be used as the
> source of guest memory in confidential computing environments such
> as Intel TDX/AMD SEV but may not be accessible from host userspace.
> 
> Since page migration/swapping is not yet supported for such usages
> so these pages are currently marked as UNMOVABLE and UNEVICTABLE
> which makes them behave like long-term pinned pages.
> 
> The flag can not coexist with MFD_ALLOW_SEALING, future sealing is
> also impossible for a memfd created with this flag.
> 
> At this time only shmem implements this flag.
> 
> Signed-off-by: Kirill A. Shutemov 
> Signed-off-by: Chao Peng 
> ---
>  include/linux/shmem_fs.h   |  7 +
>  include/uapi/linux/memfd.h |  1 +
>  mm/memfd.c | 26 +++--
>  mm/shmem.c | 57 ++
>  4 files changed, 88 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
> index e65b80ed09e7..2dde843f28ef 100644
> --- a/include/linux/shmem_fs.h
> +++ b/include/linux/shmem_fs.h
> @@ -12,6 +12,9 @@
>  
>  /* inode in-kernel data */
>  
> +/* shmem extended flags */
> +#define SHM_F_INACCESSIBLE   0x0001  /* prevent ordinary MMU access (e.g. 
> read/write/mmap) to file content */
> +
>  struct shmem_inode_info {
>   spinlock_t  lock;
>   unsigned intseals;  /* shmem seals */
> @@ -24,6 +27,7 @@ struct shmem_inode_info {
>   struct shared_policypolicy; /* NUMA memory alloc policy */
>   struct simple_xattrsxattrs; /* list of xattrs */
>   atomic_tstop_eviction;  /* hold when working on inode */
> + unsigned intxflags; /* shmem extended flags */
>   struct inodevfs_inode;
>  };
>  

AFAICS, only two bits of 'flags' are used. And that's very strange that
VM_ flags are used for the purpose. My guess that someone was lazy to
introduce new constants for this.

I think we should fix this: introduce SHM_F_LOCKED and SHM_F_NORESERVE
alongside with SHM_F_INACCESSIBLE and stuff them all into info->flags.
It also makes shmem_file_setup_xflags() go away.

-- 
 Kirill A. Shutemov



Re: [PATCH v5 03/13] mm/shmem: Support memfile_notifier

2022-04-11 Thread Kirill A. Shutemov
On Thu, Mar 10, 2022 at 10:09:01PM +0800, Chao Peng wrote:
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 9b31a7056009..7b43e274c9a2 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -903,6 +903,28 @@ static struct folio *shmem_get_partial_folio(struct 
> inode *inode, pgoff_t index)
>   return page ? page_folio(page) : NULL;
>  }
>  
> +static void notify_fallocate(struct inode *inode, pgoff_t start, pgoff_t end)
> +{
> +#ifdef CONFIG_MEMFILE_NOTIFIER
> + struct shmem_inode_info *info = SHMEM_I(inode);
> +
> + memfile_notifier_fallocate(>memfile_notifiers, start, end);
> +#endif

All these #ifdefs look ugly. Could you provide dummy memfile_* for
!MEMFILE_NOTIFIER case?

-- 
 Kirill A. Shutemov



Re: [PATCH] x86: Implement Linear Address Masking support

2022-04-07 Thread Kirill A. Shutemov
On Thu, Apr 07, 2022 at 06:38:40PM +0200, Paolo Bonzini wrote:
> On 4/7/22 17:27, Kirill A. Shutemov wrote:
> > On Thu, Apr 07, 2022 at 07:28:54AM -0700, Richard Henderson wrote:
> > > On 4/7/22 06:18, Kirill A. Shutemov wrote:
> > > > > The new hook is incorrect, in that it doesn't apply to addresses along
> > > > > the tlb fast path.
> > > > 
> > > > I'm not sure what you mean by that. tlb_hit() mechanics works. We strip
> > > > the tag bits before tlb lookup.
> > > > 
> > > > Could you elaborate?
> > > 
> > > The fast path does not clear the bits, so you enter the slow path before 
> > > you
> > > get to clearing the bits.  You've lost most of the advantage of the tlb
> > > already.
> > 
> > Sorry for my ignorance, but what do you mean by fast path here?
> 
> The fast path is the TLB lookup code that is generated by the JIT compiler.
> If the TLB hits, the memory access doesn't go through any C code.  I think
> tagged addresses always fail the fast path in your patch.

Ah. Got it.

Could you point me to the key code area relevant to the topic? I'm not
familiar with the JIT side of QEMU.

-- 
 Kirill A. Shutemov



Re: [PATCH] x86: Implement Linear Address Masking support

2022-04-07 Thread Kirill A. Shutemov
On Thu, Apr 07, 2022 at 07:28:54AM -0700, Richard Henderson wrote:
> On 4/7/22 06:18, Kirill A. Shutemov wrote:
> > > The new hook is incorrect, in that it doesn't apply to addresses along
> > > the tlb fast path.
> > 
> > I'm not sure what you mean by that. tlb_hit() mechanics works. We strip
> > the tag bits before tlb lookup.
> > 
> > Could you elaborate?
> 
> The fast path does not clear the bits, so you enter the slow path before you
> get to clearing the bits.  You've lost most of the advantage of the tlb
> already.

Sorry for my ignorance, but what do you mean by fast path here?

My understanding is that it is the case when tlb_hit() is true and you
don't need to get into tlb_fill(). Are we talking about the same scheme?

For store_helper() I clear the bits before doing TLB look and fill. So TLB
will always deal with clean addresses.

Hm?

> > To be honest I don't fully understand how TBI emulation works.
> 
> In get_phys_addr_lpae:
> 
> addrsize = 64 - 8 * param.tbi;
> ...
> target_ulong top_bits = sextract64(address, inputsize,
>addrsize - inputsize);
> if (-top_bits != param.select) {
> /* The gap between the two regions is a Translation fault */
> fault_type = ARMFault_Translation;
> goto do_fault;
> }
> 
> which does not include TBI bits in the validation of the sign-extended 
> address.
> 
> > Consider store_helper(). I failed to find where tag bits get stripped
> > before getting there for !CONFIG_USER_ONLY. clean_data_tbi() only covers
> > user-only case.
> > 
> > And if we get there with tags, I don't see how we will ever get to fast
> > path: tlb_hit() should never return true there if any bit in top byte is
> > set as cached tlb_addr has them stripped.
> > 
> > tlb_fill() will get it handled correctly, but it is wasteful to go through
> > pagewalk on every tagged pointer dereference.
> 
> We won't do a pagewalk for every tagged pointer dereference.  It'll be
> pointer dereferences with differing tags past the limit of the victim cache
> (CPU_VTLB_SIZE).  And one tag will get to use the fast path, e.g. on the
> store following a load.
> 
> I've just now had a browse through the Intel docs, and I see that you're not
> performing the required modified canonicality check.

Modified is effectively done by clearing (and sign-extending) the address
before the check.

> While a proper tagged address will have the tag removed in CR2 during a
> page fault, an improper tagged address (with bit 63 != {47,56}) should
> have the original address reported to CR2.

Hm. I don't see it in spec. It rather points to other direction:

Page faults report the faulting linear address in CR2. Because LAM
masking (by sign-extension) applies before paging, the faulting
linear address recorded in CR2 does not contain the masked
metadata.

Yes, it talks about CR2 in case of page fault, not #GP due to canonicality
checking, but still.

> I could imagine a hook that could aid the victim cache in ignoring the tag,
> so that we need go through tlb_fill fewer times.  But I wouldn't want to
> include that in the base version of this feature, and I'd want take more
> than a moment in the design so that it could be used by ARM and RISC-V as
> well.

But what other options do you see. Clering the bits before TLB look up
matches the architectural spec and makes INVLPG match described behaviour
without special handling.

-- 
 Kirill A. Shutemov



Re: [PATCH] x86: Implement Linear Address Masking support

2022-04-07 Thread Kirill A. Shutemov
On Wed, Apr 06, 2022 at 10:34:41PM -0500, Richard Henderson wrote:
> On 4/6/22 20:01, Kirill A. Shutemov wrote:
> > Linear Address Masking feature makes CPU ignore some bits of the virtual
> > address. These bits can be used to encode metadata.
> > 
> > The feature is enumerated with CPUID.(EAX=07H, ECX=01H):EAX.LAM[bit 26].
> > 
> > CR3.LAM_U57[bit 62] allows to encode 6 bits of metadata in bits 62:57 of
> > user pointers.
> > 
> > CR3.LAM_U48[bit 61] allows to encode 15 bits of metadata in bits 62:48
> > of user pointers.
> > 
> > CR4.LAM_SUP[bit 28] allows to encode metadata of supervisor pointers.
> > If 5-level paging is in use, 6 bits of metadata can be encoded in 62:57.
> > For 4-level paging, 15 bits of metadata can be encoded in bits 62:48.
> > 
> > QEMU strips address from the metadata bits and gets it to canonical
> > shape before handling memory access. It has to be done very early before
> > TLB lookup.
> 
> The new hook is incorrect, in that it doesn't apply to addresses along
> the tlb fast path.

I'm not sure what you mean by that. tlb_hit() mechanics works. We strip
the tag bits before tlb lookup.

Could you elaborate?

> But it isn't really needed.  You can do all of the work in the existing
> tlb_fill hook. AArch64 has a similar feature, and that works fine.

To be honest I don't fully understand how TBI emulation works.

Consider store_helper(). I failed to find where tag bits get stripped
before getting there for !CONFIG_USER_ONLY. clean_data_tbi() only covers
user-only case.

And if we get there with tags, I don't see how we will ever get to fast
path: tlb_hit() should never return true there if any bit in top byte is
set as cached tlb_addr has them stripped.

tlb_fill() will get it handled correctly, but it is wasteful to go through
pagewalk on every tagged pointer dereference.

Hm?

-- 
 Kirill A. Shutemov



[PATCH] x86: Implement Linear Address Masking support

2022-04-06 Thread Kirill A. Shutemov
Linear Address Masking feature makes CPU ignore some bits of the virtual
address. These bits can be used to encode metadata.

The feature is enumerated with CPUID.(EAX=07H, ECX=01H):EAX.LAM[bit 26].

CR3.LAM_U57[bit 62] allows to encode 6 bits of metadata in bits 62:57 of
user pointers.

CR3.LAM_U48[bit 61] allows to encode 15 bits of metadata in bits 62:48
of user pointers.

CR4.LAM_SUP[bit 28] allows to encode metadata of supervisor pointers.
If 5-level paging is in use, 6 bits of metadata can be encoded in 62:57.
For 4-level paging, 15 bits of metadata can be encoded in bits 62:48.

QEMU strips address from the metadata bits and gets it to canonical
shape before handling memory access. It has to be done very early before
TLB lookup.

Signed-off-by: Kirill A. Shutemov 
---
 accel/tcg/cputlb.c   | 20 +---
 include/hw/core/tcg-cpu-ops.h|  5 +
 target/i386/cpu.c|  4 ++--
 target/i386/cpu.h| 26 +-
 target/i386/helper.c |  2 +-
 target/i386/tcg/helper-tcg.h |  1 +
 target/i386/tcg/sysemu/excp_helper.c | 28 +++-
 target/i386/tcg/sysemu/misc_helper.c |  3 +--
 target/i386/tcg/sysemu/svm_helper.c  |  3 +--
 target/i386/tcg/tcg-cpu.c|  1 +
 10 files changed, 81 insertions(+), 12 deletions(-)

diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
index 2035b2ac0ac0..15eff0df39c1 100644
--- a/accel/tcg/cputlb.c
+++ b/accel/tcg/cputlb.c
@@ -1295,6 +1295,17 @@ static inline ram_addr_t 
qemu_ram_addr_from_host_nofail(void *ptr)
 return ram_addr;
 }
 
+static vaddr clean_addr(CPUArchState *env, vaddr addr)
+{
+CPUClass *cc = CPU_GET_CLASS(env_cpu(env));
+
+if (cc->tcg_ops->do_clean_addr) {
+addr = cc->tcg_ops->do_clean_addr(env_cpu(env), addr);
+}
+
+return addr;
+}
+
 /*
  * Note: tlb_fill() can trigger a resize of the TLB. This means that all of the
  * caller's prior references to the TLB table (e.g. CPUTLBEntry pointers) must
@@ -1757,10 +1768,11 @@ bool tlb_plugin_lookup(CPUState *cpu, target_ulong 
addr, int mmu_idx,
  *
  * @prot may be PAGE_READ, PAGE_WRITE, or PAGE_READ|PAGE_WRITE.
  */
-static void *atomic_mmu_lookup(CPUArchState *env, target_ulong addr,
+static void *atomic_mmu_lookup(CPUArchState *env, target_ulong address,
MemOpIdx oi, int size, int prot,
uintptr_t retaddr)
 {
+target_ulong addr = clean_addr(env, address);
 size_t mmu_idx = get_mmuidx(oi);
 MemOp mop = get_memop(oi);
 int a_bits = get_alignment_bits(mop);
@@ -1904,10 +1916,11 @@ load_memop(const void *haddr, MemOp op)
 }
 
 static inline uint64_t QEMU_ALWAYS_INLINE
-load_helper(CPUArchState *env, target_ulong addr, MemOpIdx oi,
+load_helper(CPUArchState *env, target_ulong address, MemOpIdx oi,
 uintptr_t retaddr, MemOp op, bool code_read,
 FullLoadHelper *full_load)
 {
+target_ulong addr = clean_addr(env, address);
 uintptr_t mmu_idx = get_mmuidx(oi);
 uintptr_t index = tlb_index(env, mmu_idx, addr);
 CPUTLBEntry *entry = tlb_entry(env, mmu_idx, addr);
@@ -2307,9 +2320,10 @@ store_helper_unaligned(CPUArchState *env, target_ulong 
addr, uint64_t val,
 }
 
 static inline void QEMU_ALWAYS_INLINE
-store_helper(CPUArchState *env, target_ulong addr, uint64_t val,
+store_helper(CPUArchState *env, target_ulong address, uint64_t val,
  MemOpIdx oi, uintptr_t retaddr, MemOp op)
 {
+target_ulong addr = clean_addr(env, address);
 uintptr_t mmu_idx = get_mmuidx(oi);
 uintptr_t index = tlb_index(env, mmu_idx, addr);
 CPUTLBEntry *entry = tlb_entry(env, mmu_idx, addr);
diff --git a/include/hw/core/tcg-cpu-ops.h b/include/hw/core/tcg-cpu-ops.h
index e13898553aff..8e81f45510bf 100644
--- a/include/hw/core/tcg-cpu-ops.h
+++ b/include/hw/core/tcg-cpu-ops.h
@@ -82,6 +82,11 @@ struct TCGCPUOps {
 MMUAccessType access_type,
 int mmu_idx, uintptr_t retaddr) QEMU_NORETURN;
 
+/**
+ * @do_clean_addr: Callback for clearing metadata/tags from the address.
+ */
+vaddr (*do_clean_addr)(CPUState *cpu, vaddr addr);
+
 /**
  * @adjust_watchpoint_address: hack for cpu_check_watchpoint used by ARM
  */
diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index cb6b5467d067..6e3e8473bf04 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -662,7 +662,7 @@ void x86_cpu_vendor_words2str(char *dst, uint32_t vendor1,
   /* CPUID_7_0_ECX_OSPKE is dynamic */ \
   CPUID_7_0_ECX_LA57 | CPUID_7_0_ECX_PKS)
 #define TCG_7_0_EDX_FEATURES 0
-#define TCG_7_1_EAX_FEATURES 0
+#define TCG_7_1_EAX_FEATURES CPUID_7_1_EAX_LAM
 #define TCG_APM_FEATURES 0
 #define TCG_6_EAX_FEATURES CPUID_6_EAX_ARAT
 #define TCG_XSAVE_FEATURES (CPUID_XSAVE_XSAVEOPT | CPUID_XSAVE_XGETBV1)
@@ -876,7 +876,7 @@ FeatureWordInfo feature_word_info[F

Re: [RFC v2 PATCH 13/13] KVM: Enable memfd based page invalidation/fallocate

2021-11-22 Thread Kirill A. Shutemov
On Fri, Nov 19, 2021 at 09:47:39PM +0800, Chao Peng wrote:
> Since the memory backing store does not get notified when VM is
> destroyed so need check if VM is still live in these callbacks.
> 
> Signed-off-by: Yu Zhang 
> Signed-off-by: Chao Peng 
> ---
>  virt/kvm/memfd.c | 22 ++
>  1 file changed, 22 insertions(+)
> 
> diff --git a/virt/kvm/memfd.c b/virt/kvm/memfd.c
> index bd930dcb455f..bcfdc685ce22 100644
> --- a/virt/kvm/memfd.c
> +++ b/virt/kvm/memfd.c
> @@ -12,16 +12,38 @@
>  #include 
>  const static struct guest_mem_ops *memfd_ops;
>  
> +static bool vm_is_dead(struct kvm *vm)
> +{
> + struct kvm *kvm;
> +
> + list_for_each_entry(kvm, _list, vm_list) {
> + if (kvm == vm)
> + return false;
> + }

I don't think this is enough. The struct kvm can be freed and re-allocated
from the slab and this function will give false-negetive.

Maybe the kvm has to be tagged with a sequential id that incremented every
allocation. This id can be checked here.

> +
> + return true;
> +}

-- 
 Kirill A. Shutemov



Re: [RFC v2 PATCH 01/13] mm/shmem: Introduce F_SEAL_GUEST

2021-11-22 Thread Kirill A. Shutemov
On Fri, Nov 19, 2021 at 02:51:11PM +0100, David Hildenbrand wrote:
> On 19.11.21 14:47, Chao Peng wrote:
> > From: "Kirill A. Shutemov" 
> > 
> > The new seal type provides semantics required for KVM guest private
> > memory support. A file descriptor with the seal set is going to be used
> > as source of guest memory in confidential computing environments such as
> > Intel TDX and AMD SEV.
> > 
> > F_SEAL_GUEST can only be set on empty memfd. After the seal is set
> > userspace cannot read, write or mmap the memfd.
> > 
> > Userspace is in charge of guest memory lifecycle: it can allocate the
> > memory with falloc or punch hole to free memory from the guest.
> > 
> > The file descriptor passed down to KVM as guest memory backend. KVM
> > register itself as the owner of the memfd via memfd_register_guest().
> > 
> > KVM provides callback that needed to be called on fallocate and punch
> > hole.
> > 
> > memfd_register_guest() returns callbacks that need be used for
> > requesting a new page from memfd.
> > 
> 
> Repeating the feedback I already shared in a private mail thread:
> 
> 
> As long as page migration / swapping is not supported, these pages
> behave like any longterm pinned pages (e.g., VFIO) or secretmem pages.
> 
> 1. These pages are not MOVABLE. They must not end up on ZONE_MOVABLE or
> MIGRATE_CMA.
> 
> That should be easy to handle, you have to adjust the gfp_mask to
>   mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
> just as mm/secretmem.c:secretmem_file_create() does.

Okay, fair enough. mapping_set_unevictable() also makes sesne.

> 2. These pages behave like mlocked pages and should be accounted as such.
> 
> This is probably where the accounting "fun" starts, but maybe it's
> easier than I think to handle.
> 
> See mm/secretmem.c:secretmem_mmap(), where we account the pages as
> VM_LOCKED and will consequently check per-process mlock limits. As we
> don't mmap(), the same approach cannot be reused.
> 
> See drivers/vfio/vfio_iommu_type1.c:vfio_pin_map_dma() and
> vfio_pin_pages_remote() on how to manually account via mm->locked_vm .
> 
> But it's a bit hairy because these pages are not actually mapped into
> the page tables of the MM, so it might need some thought. Similarly,
> these pages actually behave like "pinned" (as in mm->pinned_vm), but we
> just don't increase the refcount AFAIR. Again, accounting really is a
> bit hairy ...

Accounting is fun indeed. Non-mapped mlocked memory is going to be
confusing. Hm...

I will look closer.

-- 
 Kirill A. Shutemov



Re: [RFC PATCH 1/6] mm: Add F_SEAL_GUEST to shmem/memfd

2021-11-12 Thread Kirill A. Shutemov
On Thu, Nov 11, 2021 at 10:13:40PM +0800, Chao Peng wrote:
> The new seal is only allowed if there's no pre-existing pages in the fd
> and there's no existing mapping of the file. After the seal is set, no
> read/write/mmap from userspace is allowed.
> 
> Signed-off-by: Kirill A. Shutemov 
> Signed-off-by: Yu Zhang 
> Signed-off-by: Chao Peng 

Below is replacement patch with fallocate callback support.

I also replaced page_level if order of the page because PG_LEVEL_2M/4K is
x86-specific can cannot be used in the generic code.

There's also bugix in guest_invalidate_page().


>From 9419ccb4bc3c1df4cc88f6c8ba212f4b1699 Mon Sep 17 00:00:00 2001
From: "Kirill A. Shutemov" 
Date: Fri, 12 Nov 2021 21:27:40 +0300
Subject: [PATCH] mm/shmem: Introduce F_SEAL_GUEST

The new seal type provides semantics required for KVM guest private
memory support. A file descriptor with the seal set is going to be used
as source of guest memory in confidential computing environments such as
Intel TDX and AMD SEV.

F_SEAL_GUEST can only be set on empty memfd. After the seal is set
userspace cannot read, write or mmap the memfd.

Userspace is in charge of guest memory lifecycle: it can allocate the
memory with falloc or punch hole to free memory from the guest.

The file descriptor passed down to KVM as guest memory backend. KVM
register itself as the owner of the memfd via memfd_register_guest().

KVM provides callback that needed to be called on fallocate and punch
hole.

memfd_register_guest() returns callbacks that need be used for
requesting a new page from memfd.

Signed-off-by: Kirill A. Shutemov 
---
 include/linux/memfd.h  |  24 
 include/linux/shmem_fs.h   |   9 +++
 include/uapi/linux/fcntl.h |   1 +
 mm/memfd.c |  32 +-
 mm/shmem.c | 117 -
 5 files changed, 179 insertions(+), 4 deletions(-)

diff --git a/include/linux/memfd.h b/include/linux/memfd.h
index 4f1600413f91..500dfe88043e 100644
--- a/include/linux/memfd.h
+++ b/include/linux/memfd.h
@@ -4,13 +4,37 @@
 
 #include 
 
+struct guest_ops {
+   void (*invalidate_page_range)(struct inode *inode, void *owner,
+ pgoff_t start, pgoff_t end);
+   void (*fallocate)(struct inode *inode, void *owner,
+ pgoff_t start, pgoff_t end);
+};
+
+struct guest_mem_ops {
+   unsigned long (*get_lock_pfn)(struct inode *inode, pgoff_t offset,
+ int *order);
+   void (*put_unlock_pfn)(unsigned long pfn);
+
+};
+
 #ifdef CONFIG_MEMFD_CREATE
 extern long memfd_fcntl(struct file *file, unsigned int cmd, unsigned long 
arg);
+
+extern inline int memfd_register_guest(struct inode *inode, void *owner,
+  const struct guest_ops *guest_ops,
+  const struct guest_mem_ops 
**guest_mem_ops);
 #else
 static inline long memfd_fcntl(struct file *f, unsigned int c, unsigned long a)
 {
return -EINVAL;
 }
+static inline int memfd_register_guest(struct inode *inode, void *owner,
+  const struct guest_ops *guest_ops,
+  const struct guest_mem_ops 
**guest_mem_ops)
+{
+   return -EINVAL;
+}
 #endif
 
 #endif /* __LINUX_MEMFD_H */
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index 8e775ce517bb..265d0c13bc5e 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -12,6 +12,9 @@
 
 /* inode in-kernel data */
 
+struct guest_ops;
+struct guest_mem_ops;
+
 struct shmem_inode_info {
spinlock_t  lock;
unsigned intseals;  /* shmem seals */
@@ -24,6 +27,8 @@ struct shmem_inode_info {
struct simple_xattrsxattrs; /* list of xattrs */
atomic_tstop_eviction;  /* hold when working on inode */
struct inodevfs_inode;
+   void*guest_owner;
+   const struct guest_ops  *guest_ops;
 };
 
 struct shmem_sb_info {
@@ -90,6 +95,10 @@ extern unsigned long shmem_swap_usage(struct vm_area_struct 
*vma);
 extern unsigned long shmem_partial_swap_usage(struct address_space *mapping,
pgoff_t start, pgoff_t end);
 
+extern int shmem_register_guest(struct inode *inode, void *owner,
+   const struct guest_ops *guest_ops,
+   const struct guest_mem_ops **guest_mem_ops);
+
 /* Flag allocation requirements to shmem_getpage */
 enum sgp_type {
SGP_READ,   /* don't exceed i_size, don't allocate page */
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index 2f86b2ad6d7e..c79bc8572721 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -43,6 +43,7 @@
 #define F_SEAL_GROW0x0004  /* prevent file from growing */
 #define F_SEAL

Re: [Qemu-devel] [PATCH v2 0/2] mux chardev events regression fix

2017-12-20 Thread Kirill A. Shutemov
On Fri, Nov 03, 2017 at 03:28:22PM +, Marc-André Lureau wrote:
> Hi,
> 
> The following patches fix and test the behaviour of mux chardev
> events, after a regression introduced in qemu 2.8.0.
> 
> v1->v2:
> - fix incompatible pointer type warning spotted by patchew

Looks like this patchset got stuck and never reached upsteam.

Any update on this?

-- 
 Kirill A. Shutemov



Re: [Qemu-devel] [PATCH v2 1/2] chardev: fix backend events regression with mux chardev

2017-11-03 Thread Kirill A. Shutemov
On Fri, Nov 03, 2017 at 03:28:23PM +, Marc-André Lureau wrote:
> Kirill noticied that on recent versions on QEMU he was not able to
> trigger SysRq to invoke debug capabilites of Linux Kernel.  He tracked
> it down to qemu_chr_be_event() ignoring CHR_EVENT_BREAK due s->be
> being NULL. The bug was introduced in 2.8, commit a4afa548fc6d ("char:
> move front end handlers in CharBackend"). Since the commit, the
> qemu_chr_be_event() failed to deliver CHR_EVENT_BREAK due to
> qemu_chr_fe_init() does not set s->be in case of mux.
> 
> Let's fix this by teaching mux to send an event to the frontend with
> the focus.
> 
> Reported-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
> Signed-off-by: Marc-André Lureau <marcandre.lur...@redhat.com>
> Fixes: a4afa548fc6d ("char: move front end handlers in CharBackend")

Works for me.

Tested-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>

-- 
 Kirill A. Shutemov



[Qemu-devel] [PATCH] chardev: don't forget to set backend for mux

2017-11-03 Thread Kirill A. Shutemov
I noticied that on recent versions on QEMU I was not able to trigger
SysRq to invoke debug capabilites of Linux Kernel.

I've tracked it down to qemu_chr_be_event() ignoring CHR_EVENT_BREAK due
s->be being NULL.

Looks like the bug was introduced in a4afa548fc6d ("char: move front end
handlers in CharBackend"). Since the commit the qemu_chr_be_event()
failed to deliver CHR_EVENT_BREAK due to qemu_chr_fe_init() forgot to
get s->be initialized in case of mux.

Let's fix this.

Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
Fixes: a4afa548fc6d ("char: move front end handlers in CharBackend")
---
 chardev/char-fe.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/chardev/char-fe.c b/chardev/char-fe.c
index ee6d59610031..d4a54947a567 100644
--- a/chardev/char-fe.c
+++ b/chardev/char-fe.c
@@ -209,9 +209,8 @@ bool qemu_chr_fe_init(CharBackend *b, Chardev *s, Error 
**errp)
 tag = d->mux_cnt++;
 } else if (s->be) {
 goto unavailable;
-} else {
-s->be = b;
 }
+s->be = b;
 
 b->fe_open = false;
 b->tag = tag;
-- 
2.14.2




Re: [Qemu-devel] [PATCH] x86: implement la57 paging mode

2016-12-22 Thread Kirill A. Shutemov
On Fri, Dec 16, 2016 at 01:59:36PM +0100, Paolo Bonzini wrote:
> 
> 
> On 15/12/2016 01:13, Kirill A. Shutemov wrote:
> > The new paging more is extension of IA32e mode with more additional page
> > table level.
> > 
> > It brings support of 57-bit vitrual address space (128PB) and 52-bit
> > physical address space (4PB).
> > 
> > The structure of new page table level is identical to pml4.
> > 
> > The feature is enumerated with CPUID.(EAX=07H, ECX=0):ECX[bit 16].
> > 
> > CR4.LA57[bit 12] need to be set when pageing enables to activate 5-level
> > paging mode.
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
> 
> Looks good, thanks!  The target-i386/translate.c bits are not necessary,
> but I guess they can also be removed on commit.

Is there anything else I need to do to make it applied?

-- 
 Kirill A. Shutemov



Re: [Qemu-devel] [PATCH] x86: implement la57 paging mode

2016-12-16 Thread Kirill A. Shutemov
On Fri, Dec 16, 2016 at 01:59:36PM +0100, Paolo Bonzini wrote:
> 
> 
> On 15/12/2016 01:13, Kirill A. Shutemov wrote:
> > The new paging more is extension of IA32e mode with more additional page
> > table level.
> > 
> > It brings support of 57-bit vitrual address space (128PB) and 52-bit
> > physical address space (4PB).
> > 
> > The structure of new page table level is identical to pml4.
> > 
> > The feature is enumerated with CPUID.(EAX=07H, ECX=0):ECX[bit 16].
> > 
> > CR4.LA57[bit 12] need to be set when pageing enables to activate 5-level
> > paging mode.
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
> 
> Looks good, thanks!  The target-i386/translate.c bits are not necessary,
> but I guess they can also be removed on commit.
> 
> Any chance you could also implement the MPX bits?

I don't have time for this right now. Maybe later.

-- 
 Kirill A. Shutemov



[Qemu-devel] [PATCH] x86: implement la57 paging mode

2016-12-14 Thread Kirill A. Shutemov
The new paging more is extension of IA32e mode with more additional page
table level.

It brings support of 57-bit vitrual address space (128PB) and 52-bit
physical address space (4PB).

The structure of new page table level is identical to pml4.

The feature is enumerated with CPUID.(EAX=07H, ECX=0):ECX[bit 16].

CR4.LA57[bit 12] need to be set when pageing enables to activate 5-level
paging mode.

Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 target-i386/arch_memory_mapping.c |  42 ++-
 target-i386/cpu.c |  16 ++-
 target-i386/cpu.h |   2 +
 target-i386/helper.c  |  54 +++--
 target-i386/monitor.c | 234 +-
 target-i386/translate.c   |   2 +
 6 files changed, 276 insertions(+), 74 deletions(-)

diff --git a/target-i386/arch_memory_mapping.c 
b/target-i386/arch_memory_mapping.c
index 88f341e1bbd0..826aee597b13 100644
--- a/target-i386/arch_memory_mapping.c
+++ b/target-i386/arch_memory_mapping.c
@@ -220,7 +220,8 @@ static void walk_pdpe(MemoryMappingList *list, AddressSpace 
*as,
 
 /* IA-32e Paging */
 static void walk_pml4e(MemoryMappingList *list, AddressSpace *as,
-   hwaddr pml4e_start_addr, int32_t a20_mask)
+   hwaddr pml4e_start_addr, int32_t a20_mask,
+   target_ulong start_line_addr)
 {
 hwaddr pml4e_addr, pdpe_start_addr;
 uint64_t pml4e;
@@ -236,11 +237,34 @@ static void walk_pml4e(MemoryMappingList *list, 
AddressSpace *as,
 continue;
 }
 
-line_addr = ((i & 0x1ffULL) << 39) | (0xULL << 48);
+line_addr = start_line_addr | ((i & 0x1ffULL) << 39);
 pdpe_start_addr = (pml4e & PLM4_ADDR_MASK) & a20_mask;
 walk_pdpe(list, as, pdpe_start_addr, a20_mask, line_addr);
 }
 }
+
+static void walk_pml5e(MemoryMappingList *list, AddressSpace *as,
+   hwaddr pml5e_start_addr, int32_t a20_mask)
+{
+hwaddr pml5e_addr, pml4e_start_addr;
+uint64_t pml5e;
+target_ulong line_addr;
+int i;
+
+for (i = 0; i < 512; i++) {
+pml5e_addr = (pml5e_start_addr + i * 8) & a20_mask;
+pml5e = address_space_ldq(as, pml5e_addr, MEMTXATTRS_UNSPECIFIED,
+  NULL);
+if (!(pml5e & PG_PRESENT_MASK)) {
+/* not present */
+continue;
+}
+
+line_addr = (0x7fULL << 57) | ((i & 0x1ffULL) << 48);
+pml4e_start_addr = (pml5e & PLM4_ADDR_MASK) & a20_mask;
+walk_pml4e(list, as, pml4e_start_addr, a20_mask, line_addr);
+}
+}
 #endif
 
 void x86_cpu_get_memory_mapping(CPUState *cs, MemoryMappingList *list,
@@ -257,10 +281,18 @@ void x86_cpu_get_memory_mapping(CPUState *cs, 
MemoryMappingList *list,
 if (env->cr[4] & CR4_PAE_MASK) {
 #ifdef TARGET_X86_64
 if (env->hflags & HF_LMA_MASK) {
-hwaddr pml4e_addr;
+if (env->cr[4] & CR4_LA57_MASK) {
+hwaddr pml5e_addr;
+
+pml5e_addr = (env->cr[3] & PLM4_ADDR_MASK) & env->a20_mask;
+walk_pml5e(list, cs->as, pml5e_addr, env->a20_mask);
+} else {
+hwaddr pml4e_addr;
 
-pml4e_addr = (env->cr[3] & PLM4_ADDR_MASK) & env->a20_mask;
-walk_pml4e(list, cs->as, pml4e_addr, env->a20_mask);
+pml4e_addr = (env->cr[3] & PLM4_ADDR_MASK) & env->a20_mask;
+walk_pml4e(list, cs->as, pml4e_addr, env->a20_mask,
+0xULL << 48);
+}
 } else
 #endif
 {
diff --git a/target-i386/cpu.c b/target-i386/cpu.c
index de1f30eeda63..a4b9832b5916 100644
--- a/target-i386/cpu.c
+++ b/target-i386/cpu.c
@@ -238,7 +238,8 @@ static void x86_cpu_vendor_words2str(char *dst, uint32_t 
vendor1,
   CPUID_7_0_EBX_HLE, CPUID_7_0_EBX_AVX2,
   CPUID_7_0_EBX_INVPCID, CPUID_7_0_EBX_RTM,
   CPUID_7_0_EBX_RDSEED */
-#define TCG_7_0_ECX_FEATURES (CPUID_7_0_ECX_PKU | CPUID_7_0_ECX_OSPKE)
+#define TCG_7_0_ECX_FEATURES (CPUID_7_0_ECX_PKU | CPUID_7_0_ECX_OSPKE | \
+  CPUID_7_0_ECX_LA57)
 #define TCG_7_0_EDX_FEATURES 0
 #define TCG_APM_FEATURES 0
 #define TCG_6_EAX_FEATURES CPUID_6_EAX_ARAT
@@ -435,7 +436,7 @@ static FeatureWordInfo feature_word_info[FEATURE_WORDS] = {
 "ospke", NULL, NULL, NULL,
 NULL, NULL, NULL, NULL,
 NULL, NULL, NULL, NULL,
-NULL, NULL, NULL, NULL,
+"la57", NULL, NULL, NULL,
 NULL, NULL, "rdpid", NULL,
 NULL, NULL, NULL, NULL,
 NULL, NULL, NULL, NULL,
@@ -2742,10 +2743,13 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
uint32_t count,
 case 0x8008:
 /* virtual & 

[Qemu-devel] [QEMU, PATCH] x86: implement la57 paging mode

2016-12-08 Thread Kirill A. Shutemov
The new paging more is extension of IA32e mode with more additional page
table level.

It brings support of 57-bit vitrual address space (128PB) and 52-bit
physical address space (4PB).

The structure of new page table level is identical to pml4.

The feature is enumerated with CPUID.(EAX=07H, ECX=0):ECX[bit 16].

CR4.LA57[bit 12] need to be set when pageing enables to activate 5-level
paging mode.

Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
Cc: qemu-devel@nongnu.org
---
 target-i386/arch_memory_mapping.c |  42 --
 target-i386/cpu.c |  16 ++--
 target-i386/cpu.h |   2 +
 target-i386/helper.c  |  54 ++--
 target-i386/monitor.c | 167 --
 target-i386/translate.c   |   2 +
 6 files changed, 238 insertions(+), 45 deletions(-)

diff --git a/target-i386/arch_memory_mapping.c 
b/target-i386/arch_memory_mapping.c
index 88f341e1bbd0..826aee597b13 100644
--- a/target-i386/arch_memory_mapping.c
+++ b/target-i386/arch_memory_mapping.c
@@ -220,7 +220,8 @@ static void walk_pdpe(MemoryMappingList *list, AddressSpace 
*as,
 
 /* IA-32e Paging */
 static void walk_pml4e(MemoryMappingList *list, AddressSpace *as,
-   hwaddr pml4e_start_addr, int32_t a20_mask)
+   hwaddr pml4e_start_addr, int32_t a20_mask,
+   target_ulong start_line_addr)
 {
 hwaddr pml4e_addr, pdpe_start_addr;
 uint64_t pml4e;
@@ -236,11 +237,34 @@ static void walk_pml4e(MemoryMappingList *list, 
AddressSpace *as,
 continue;
 }
 
-line_addr = ((i & 0x1ffULL) << 39) | (0xULL << 48);
+line_addr = start_line_addr | ((i & 0x1ffULL) << 39);
 pdpe_start_addr = (pml4e & PLM4_ADDR_MASK) & a20_mask;
 walk_pdpe(list, as, pdpe_start_addr, a20_mask, line_addr);
 }
 }
+
+static void walk_pml5e(MemoryMappingList *list, AddressSpace *as,
+   hwaddr pml5e_start_addr, int32_t a20_mask)
+{
+hwaddr pml5e_addr, pml4e_start_addr;
+uint64_t pml5e;
+target_ulong line_addr;
+int i;
+
+for (i = 0; i < 512; i++) {
+pml5e_addr = (pml5e_start_addr + i * 8) & a20_mask;
+pml5e = address_space_ldq(as, pml5e_addr, MEMTXATTRS_UNSPECIFIED,
+  NULL);
+if (!(pml5e & PG_PRESENT_MASK)) {
+/* not present */
+continue;
+}
+
+line_addr = (0x7fULL << 57) | ((i & 0x1ffULL) << 48);
+pml4e_start_addr = (pml5e & PLM4_ADDR_MASK) & a20_mask;
+walk_pml4e(list, as, pml4e_start_addr, a20_mask, line_addr);
+}
+}
 #endif
 
 void x86_cpu_get_memory_mapping(CPUState *cs, MemoryMappingList *list,
@@ -257,10 +281,18 @@ void x86_cpu_get_memory_mapping(CPUState *cs, 
MemoryMappingList *list,
 if (env->cr[4] & CR4_PAE_MASK) {
 #ifdef TARGET_X86_64
 if (env->hflags & HF_LMA_MASK) {
-hwaddr pml4e_addr;
+if (env->cr[4] & CR4_LA57_MASK) {
+hwaddr pml5e_addr;
+
+pml5e_addr = (env->cr[3] & PLM4_ADDR_MASK) & env->a20_mask;
+walk_pml5e(list, cs->as, pml5e_addr, env->a20_mask);
+} else {
+hwaddr pml4e_addr;
 
-pml4e_addr = (env->cr[3] & PLM4_ADDR_MASK) & env->a20_mask;
-walk_pml4e(list, cs->as, pml4e_addr, env->a20_mask);
+pml4e_addr = (env->cr[3] & PLM4_ADDR_MASK) & env->a20_mask;
+walk_pml4e(list, cs->as, pml4e_addr, env->a20_mask,
+0xULL << 48);
+}
 } else
 #endif
 {
diff --git a/target-i386/cpu.c b/target-i386/cpu.c
index de1f30eeda63..a4b9832b5916 100644
--- a/target-i386/cpu.c
+++ b/target-i386/cpu.c
@@ -238,7 +238,8 @@ static void x86_cpu_vendor_words2str(char *dst, uint32_t 
vendor1,
   CPUID_7_0_EBX_HLE, CPUID_7_0_EBX_AVX2,
   CPUID_7_0_EBX_INVPCID, CPUID_7_0_EBX_RTM,
   CPUID_7_0_EBX_RDSEED */
-#define TCG_7_0_ECX_FEATURES (CPUID_7_0_ECX_PKU | CPUID_7_0_ECX_OSPKE)
+#define TCG_7_0_ECX_FEATURES (CPUID_7_0_ECX_PKU | CPUID_7_0_ECX_OSPKE | \
+  CPUID_7_0_ECX_LA57)
 #define TCG_7_0_EDX_FEATURES 0
 #define TCG_APM_FEATURES 0
 #define TCG_6_EAX_FEATURES CPUID_6_EAX_ARAT
@@ -435,7 +436,7 @@ static FeatureWordInfo feature_word_info[FEATURE_WORDS] = {
 "ospke", NULL, NULL, NULL,
 NULL, NULL, NULL, NULL,
 NULL, NULL, NULL, NULL,
-NULL, NULL, NULL, NULL,
+"la57", NULL, NULL, NULL,
 NULL, NULL, "rdpid", NULL,
 NULL, NULL, NULL, NULL,
 NULL, NULL, NULL, NULL,
@@ -2742,10 +2743,13 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
uint32_t count,
 case 0x80

Re: [Qemu-devel] [PATCH 1/1] mm: thp: kvm: fix memory corruption in KVM with THP enabled

2016-04-27 Thread Kirill A. Shutemov
On Wed, Apr 27, 2016 at 04:59:57PM +0200, Andrea Arcangeli wrote:
> On Wed, Apr 27, 2016 at 04:50:30PM +0300, Kirill A. Shutemov wrote:
> > I know nothing about kvm. How do you protect against pmd splitting between
> > get_user_pages() and the check?
> 
> get_user_pages_fast() runs fully lockless and unpins the page right
> away (we need a get_user_pages_fast without the FOLL_GET in fact to
> avoid a totally useless atomic_inc/dec!).
> 
> Then we take a lock that is also taken by
> mmu_notifier_invalidate_range_start. This way __split_huge_pmd will
> block in mmu_notifier_invalidate_range_start if it tries to run again
> (every other mmu notifier like mmu_notifier_invalidate_page will also
> block).
> 
> Then after we serialized against __split_huge_pmd through the MMU
> notifier KVM internal locking, we are able to tell if any mmu_notifier
> invalidate happened in the region just before get_user_pages_fast()
> was invoked, until we call PageCompoundTransMap and we actually map
> the shadow pagetable into the compound page with hugepage
> granularity (to allow real 2MB TLBs if guest also uses trans_huge_pmd
> in the guest pagetables).
> 
> After the shadow pagetable is mapped, we drop the internal MMU
> notifier lock and __split_huge_pmd mmu_notifier_invalidate_range_start
> can continue and drop the shadow pagetable that we just mapped in the
> above paragraph just before dropping the mmu notifier internal lock.
> 
> To be able to tell if any invalidate happened while
> get_user_pages_fast was running and until we grab the lock again and
> we start mapping the shadow pagtable we use:
> 
>   mmu_seq = vcpu->kvm->mmu_notifier_seq;
>   smp_rmb();
> 
>   if (try_async_pf(vcpu, prefault, gfn, v, , write, _writable))
>    this is get_user_pages and does put_page on the page
>and just returns the 
>this is why we need a get_user_pages_fast that won't
>attempt to touch the page->_count at all! we can avoid
>2 atomic ops for each secondary MMU fault that way
>   return 0;
> 
>   spin_lock(>kvm->mmu_lock);
>   if (mmu_notifier_retry(vcpu->kvm, mmu_seq))
>   goto out_unlock;
>   ... here we check PageTransCompoundMap(pfn_to_page(pfn)) and
>   map a 4k or 2MB shadow pagetable on "pfn" ...
> 
> 
> Note mmu_notifier_retry does the other side of the smp_rmb():
> 
>   smp_rmb();
>   if (kvm->mmu_notifier_seq != mmu_seq)
>   return 1;
>   return 0;

Okay, I see.

But do we really want to make PageTransCompoundMap() visiable beyond KVM
code? It looks like too KVM-specific.

-- 
 Kirill A. Shutemov



Re: [Qemu-devel] [PATCH 1/1] mm: thp: kvm: fix memory corruption in KVM with THP enabled

2016-04-27 Thread Kirill A. Shutemov
 each compound page and
> + * to immediately map the entire compound page with a single secondary
> + * MMU fault. If there will be a pmd split later, the secondary MMUs
> + * will get an update through the MMU notifier invalidation through
> + * split_huge_pmd().
> + *
> + * Unlike PageTransCompound, this is safe to be called only while
> + * split_huge_pmd() cannot run from under us, like if protected by the
> + * MMU notifier, otherwise it may result in page->_mapcount < 0 false
> + * positives.
> + */

I know nothing about kvm. How do you protect against pmd splitting between
get_user_pages() and the check?

And the helper looks highly kvm-specific, doesn't it?

> +static inline int PageTransCompoundMap(struct page *page)
> +{
> + return PageTransCompound(page) && atomic_read(>_mapcount) < 0;
> +}
> +
> +/*
>   * PageTransTail returns true for both transparent huge pages
>   * and hugetlbfs pages, so it should only be called when it's known
>   * that hugetlbfs pages aren't involved.
> @@ -559,6 +580,7 @@ static inline int TestClearPageDoubleMap(struct page 
> *page)
>  #else
>  TESTPAGEFLAG_FALSE(TransHuge)
>  TESTPAGEFLAG_FALSE(TransCompound)
> +TESTPAGEFLAG_FALSE(TransCompoundMap)
>  TESTPAGEFLAG_FALSE(TransTail)
>  TESTPAGEFLAG_FALSE(DoubleMap)
>   TESTSETFLAG_FALSE(DoubleMap)

-- 
 Kirill A. Shutemov



Re: [Qemu-devel] post-copy is broken?

2016-04-15 Thread Kirill A. Shutemov
On Fri, Apr 15, 2016 at 02:42:33PM +0100, Dr. David Alan Gilbert wrote:
> * Kirill A. Shutemov (kir...@shutemov.name) wrote:
> > On Thu, Apr 14, 2016 at 12:22:30PM -0400, Andrea Arcangeli wrote:
> > > Adding linux-mm too,
> > > 
> > > On Thu, Apr 14, 2016 at 01:34:41PM +0100, Dr. David Alan Gilbert wrote:
> > > > * Andrea Arcangeli (aarca...@redhat.com) wrote:
> > > > 
> > > > > The next suspect is the massive THP refcounting change that went
> > > > > upstream recently:
> > > > 
> > > > > As further debug hint, can you try to disable THP and see if that
> > > > > makes the problem go away?
> > > > 
> > > > Yep, this seems to be the problem (cc'ing in Kirill).
> > > > 
> > > > 122afea9626ab3f717b250a8dd3d5ebf57cdb56c - works (just before Kirill 
> > > > disables THP)
> > > > 61f5d698cc97600e813ca5cf8e449b1ea1c11492 - breaks (when THP is 
> > > > reenabled)
> > > > 
> > > > It's pretty reliable; as you say disabling THP makes it work again
> > > > and putting it back to THP/madvise mode makes it break.  And you need
> > > > to test on a machine with some free ram to make sure THP has a chance
> > > > to have happened.
> > > > 
> > > > I'm not sure of all of the rework that happened in that series,
> > > > but my reading of it is that splitting of THP pages gets deferred;
> > > > so I wonder if when I do the madvise to turn THP off, if it's actually
> > > > still got THP pages and thus we end up with a whole THP mapped
> > > > when I'm expecting to be userfaulting those pages.
> > > 
> > > Good thing at least I didn't make UFFDIO_COPY THP aware yet so there's
> > > less variables (as no user was interested to handle userfaults at THP
> > > granularity yet, and from userland such an improvement would be
> > > completely invisible in terms of API, so if an user starts doing that
> > > we can just optimize the kernel for it, criu restore could do that as
> > > the faults will come from disk-I/O, when network is involved THP
> > > userfaults wouldn't have a great tradeoff with regard to the increased
> > > fault latency).
> > > 
> > > I suspect there is an handle_userfault missing somewhere in connection
> > > with trans_huge_pmd splits (not anymore THP splits) that you're doing
> > > with MADV_DONTNEED to zap those pages in the destination that got
> > > redirtied in source during the last precopy stage. Or more simply
> > > MADV_DONTNEED isn't zapping all the right ptes after the trans huge
> > > pmd got splitted.
> > > 
> > > The fact the page isn't splitted shouldn't matter too much, all we care
> > > about is the pte triggers handle_userfault after MADV_DONTNEED.
> > > 
> > > The userfaultfd testcase in the kernel isn't exercising this case
> > > unfortunately, that should probably be improved too, so there is a
> > > simpler way to reproduce than running precopy before postcopy in qemu.
> > 
> > I've tested current Linus' tree and v4.5 using qemu postcopy test case for
> > both x86-64 and i386 and it never failed for me:
> > 
> > /x86_64/postcopy: first_byte = 7e last_byte = 7d hit_edge = 1 OK
> > OK
> > /i386/postcopy: first_byte = f6 last_byte = f5 hit_edge = 1 OK
> > OK
> > 
> > I've run it directly, setting relevant QTEST_QEMU_BINARY.
> 
> Interesting; it's failing reliably for me - but only with a reasonably
> freshly booted machine (so that the pages get THPd).

The same here. Freshly booted machine with 64GiB ram. I've checked
/proc/vmstat: huge pages were allocated

-- 
 Kirill A. Shutemov



Re: [Qemu-devel] post-copy is broken?

2016-04-15 Thread Kirill A. Shutemov
On Thu, Apr 14, 2016 at 12:22:30PM -0400, Andrea Arcangeli wrote:
> Adding linux-mm too,
> 
> On Thu, Apr 14, 2016 at 01:34:41PM +0100, Dr. David Alan Gilbert wrote:
> > * Andrea Arcangeli (aarca...@redhat.com) wrote:
> > 
> > > The next suspect is the massive THP refcounting change that went
> > > upstream recently:
> > 
> > > As further debug hint, can you try to disable THP and see if that
> > > makes the problem go away?
> > 
> > Yep, this seems to be the problem (cc'ing in Kirill).
> > 
> > 122afea9626ab3f717b250a8dd3d5ebf57cdb56c - works (just before Kirill 
> > disables THP)
> > 61f5d698cc97600e813ca5cf8e449b1ea1c11492 - breaks (when THP is reenabled)
> > 
> > It's pretty reliable; as you say disabling THP makes it work again
> > and putting it back to THP/madvise mode makes it break.  And you need
> > to test on a machine with some free ram to make sure THP has a chance
> > to have happened.
> > 
> > I'm not sure of all of the rework that happened in that series,
> > but my reading of it is that splitting of THP pages gets deferred;
> > so I wonder if when I do the madvise to turn THP off, if it's actually
> > still got THP pages and thus we end up with a whole THP mapped
> > when I'm expecting to be userfaulting those pages.
> 
> Good thing at least I didn't make UFFDIO_COPY THP aware yet so there's
> less variables (as no user was interested to handle userfaults at THP
> granularity yet, and from userland such an improvement would be
> completely invisible in terms of API, so if an user starts doing that
> we can just optimize the kernel for it, criu restore could do that as
> the faults will come from disk-I/O, when network is involved THP
> userfaults wouldn't have a great tradeoff with regard to the increased
> fault latency).
> 
> I suspect there is an handle_userfault missing somewhere in connection
> with trans_huge_pmd splits (not anymore THP splits) that you're doing
> with MADV_DONTNEED to zap those pages in the destination that got
> redirtied in source during the last precopy stage. Or more simply
> MADV_DONTNEED isn't zapping all the right ptes after the trans huge
> pmd got splitted.
> 
> The fact the page isn't splitted shouldn't matter too much, all we care
> about is the pte triggers handle_userfault after MADV_DONTNEED.
> 
> The userfaultfd testcase in the kernel isn't exercising this case
> unfortunately, that should probably be improved too, so there is a
> simpler way to reproduce than running precopy before postcopy in qemu.

I've tested current Linus' tree and v4.5 using qemu postcopy test case for
both x86-64 and i386 and it never failed for me:

/x86_64/postcopy: first_byte = 7e last_byte = 7d hit_edge = 1 OK
OK
/i386/postcopy: first_byte = f6 last_byte = f5 hit_edge = 1 OK
OK

I've run it directly, setting relevant QTEST_QEMU_BINARY.

-- 
 Kirill A. Shutemov



Re: [Qemu-devel] [PATCH 07/17] mm: madvise MADV_USERFAULT: prepare vm_flags to allow more than 32bits

2014-10-07 Thread Kirill A. Shutemov
On Fri, Oct 03, 2014 at 07:07:57PM +0200, Andrea Arcangeli wrote:
 We run out of 32bits in vm_flags, noop change for 64bit archs.
 
 Signed-off-by: Andrea Arcangeli aarca...@redhat.com
 ---
  fs/proc/task_mmu.c   | 4 ++--
  include/linux/huge_mm.h  | 4 ++--
  include/linux/ksm.h  | 4 ++--
  include/linux/mm_types.h | 2 +-
  mm/huge_memory.c | 2 +-
  mm/ksm.c | 2 +-
  mm/madvise.c | 2 +-
  mm/mremap.c  | 2 +-
  8 files changed, 11 insertions(+), 11 deletions(-)
 
 diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
 index c341568..ee1c3a2 100644
 --- a/fs/proc/task_mmu.c
 +++ b/fs/proc/task_mmu.c
 @@ -532,11 +532,11 @@ static void show_smap_vma_flags(struct seq_file *m, 
 struct vm_area_struct *vma)
   /*
* Don't forget to update Documentation/ on changes.
*/
 - static const char mnemonics[BITS_PER_LONG][2] = {
 + static const char mnemonics[BITS_PER_LONG+1][2] = {

I believe here and below should be BITS_PER_LONG_LONG instead: it will
catch unknown vmflags. And +1 is not needed un 64-bit systems.

   /*
* In case if we meet a flag we don't know about.
*/
 - [0 ... (BITS_PER_LONG-1)] = ??,
 + [0 ... (BITS_PER_LONG)] = ??,
  
   [ilog2(VM_READ)]= rd,
   [ilog2(VM_WRITE)]   = wr,
-- 
 Kirill A. Shutemov



Re: [Qemu-devel] [PATCH 08/17] mm: madvise MADV_USERFAULT

2014-10-07 Thread Kirill A. Shutemov
On Fri, Oct 03, 2014 at 07:07:58PM +0200, Andrea Arcangeli wrote:
 MADV_USERFAULT is a new madvise flag that will set VM_USERFAULT in the
 vma flags. Whenever VM_USERFAULT is set in an anonymous vma, if
 userland touches a still unmapped virtual address, a sigbus signal is
 sent instead of allocating a new page. The sigbus signal handler will
 then resolve the page fault in userland by calling the
 remap_anon_pages syscall.

Hm. I wounder if this functionality really fits madvise(2) interface: as
far as I understand it, it provides a way to give a *hint* to kernel which
may or may not trigger an action from kernel side. I don't think an
application will behaive reasonably if kernel ignore the *advise* and will
not send SIGBUS, but allocate memory.

I would suggest to consider to use some other interface for the
functionality: a new syscall or, perhaps, mprotect().

-- 
 Kirill A. Shutemov



Re: [Qemu-devel] [PATCH 08/17] mm: madvise MADV_USERFAULT

2014-10-07 Thread Kirill A. Shutemov
On Tue, Oct 07, 2014 at 11:46:04AM +0100, Dr. David Alan Gilbert wrote:
 * Kirill A. Shutemov (kir...@shutemov.name) wrote:
  On Fri, Oct 03, 2014 at 07:07:58PM +0200, Andrea Arcangeli wrote:
   MADV_USERFAULT is a new madvise flag that will set VM_USERFAULT in the
   vma flags. Whenever VM_USERFAULT is set in an anonymous vma, if
   userland touches a still unmapped virtual address, a sigbus signal is
   sent instead of allocating a new page. The sigbus signal handler will
   then resolve the page fault in userland by calling the
   remap_anon_pages syscall.
  
  Hm. I wounder if this functionality really fits madvise(2) interface: as
  far as I understand it, it provides a way to give a *hint* to kernel which
  may or may not trigger an action from kernel side. I don't think an
  application will behaive reasonably if kernel ignore the *advise* and will
  not send SIGBUS, but allocate memory.
 
 Aren't DONTNEED and DONTDUMP  similar cases of madvise operations that are
 expected to do what they say ?

No. If kernel would ignore MADV_DONTNEED or MADV_DONTDUMP it will not
affect correctness, just behaviour will be suboptimal: more than needed
memory used or wasted space in coredump.

-- 
 Kirill A. Shutemov



Re: [Qemu-devel] [PATCH 10/17] mm: rmap preparation for remap_anon_pages

2014-10-07 Thread Kirill A. Shutemov
On Fri, Oct 03, 2014 at 07:08:00PM +0200, Andrea Arcangeli wrote:
 There's one constraint enforced to allow this simplification: the
 source pages passed to remap_anon_pages must be mapped only in one
 vma, but this is not a limitation when used to handle userland page
 faults with MADV_USERFAULT. The source addresses passed to
 remap_anon_pages should be set as VM_DONTCOPY with MADV_DONTFORK to
 avoid any risk of the mapcount of the pages increasing, if fork runs
 in parallel in another thread, before or while remap_anon_pages runs.

Have you considered triggering COW instead of adding limitation on
pages' mapcount? The limitation looks artificial from interface POV.

-- 
 Kirill A. Shutemov



Re: [Qemu-devel] [PATCH 08/17] mm: madvise MADV_USERFAULT

2014-10-07 Thread Kirill A. Shutemov
On Tue, Oct 07, 2014 at 12:01:02PM +0100, Dr. David Alan Gilbert wrote:
 * Kirill A. Shutemov (kir...@shutemov.name) wrote:
  On Tue, Oct 07, 2014 at 11:46:04AM +0100, Dr. David Alan Gilbert wrote:
   * Kirill A. Shutemov (kir...@shutemov.name) wrote:
On Fri, Oct 03, 2014 at 07:07:58PM +0200, Andrea Arcangeli wrote:
 MADV_USERFAULT is a new madvise flag that will set VM_USERFAULT in the
 vma flags. Whenever VM_USERFAULT is set in an anonymous vma, if
 userland touches a still unmapped virtual address, a sigbus signal is
 sent instead of allocating a new page. The sigbus signal handler will
 then resolve the page fault in userland by calling the
 remap_anon_pages syscall.

Hm. I wounder if this functionality really fits madvise(2) interface: as
far as I understand it, it provides a way to give a *hint* to kernel 
which
may or may not trigger an action from kernel side. I don't think an
application will behaive reasonably if kernel ignore the *advise* and 
will
not send SIGBUS, but allocate memory.
   
   Aren't DONTNEED and DONTDUMP  similar cases of madvise operations that are
   expected to do what they say ?
  
  No. If kernel would ignore MADV_DONTNEED or MADV_DONTDUMP it will not
  affect correctness, just behaviour will be suboptimal: more than needed
  memory used or wasted space in coredump.
 
 That's not how the manpage reads for DONTNEED; it calls it out as a special
 case near the top, and explicitly says what will happen if you read the
 area marked as DONTNEED.

Your are right. MADV_DONTNEED doesn't fit the interface too. That's bad
and we can't fix it. But it's not a reason to make this mistake again.

Read the next sentence: The kernel is free to ignore the advice.

Note, POSIX_MADV_DONTNEED has totally different semantics.

 It looks like there are openssl patches that use DONTDUMP to explicitly
 make sure keys etc don't land in cores.

That's nice to have. But openssl works on systems without the interface,
meaning it's not essential for functionality.

-- 
 Kirill A. Shutemov



Re: [Qemu-devel] [PATCH 08/17] mm: madvise MADV_USERFAULT

2014-10-07 Thread Kirill A. Shutemov
On Tue, Oct 07, 2014 at 03:24:58PM +0200, Andrea Arcangeli wrote:
 Hi Kirill,
 
 On Tue, Oct 07, 2014 at 01:36:45PM +0300, Kirill A. Shutemov wrote:
  On Fri, Oct 03, 2014 at 07:07:58PM +0200, Andrea Arcangeli wrote:
   MADV_USERFAULT is a new madvise flag that will set VM_USERFAULT in the
   vma flags. Whenever VM_USERFAULT is set in an anonymous vma, if
   userland touches a still unmapped virtual address, a sigbus signal is
   sent instead of allocating a new page. The sigbus signal handler will
   then resolve the page fault in userland by calling the
   remap_anon_pages syscall.
  
  Hm. I wounder if this functionality really fits madvise(2) interface: as
  far as I understand it, it provides a way to give a *hint* to kernel which
  may or may not trigger an action from kernel side. I don't think an
  application will behaive reasonably if kernel ignore the *advise* and will
  not send SIGBUS, but allocate memory.
  
  I would suggest to consider to use some other interface for the
  functionality: a new syscall or, perhaps, mprotect().
 
 I didn't feel like adding PROT_USERFAULT to mprotect, which looks
 hardwired to just these flags:

PROT_NOALLOC may be?

 
PROT_NONE  The memory cannot be accessed at all.
 
PROT_READ  The memory can be read.
 
PROT_WRITE The memory can be modified.
 
PROT_EXEC  The memory can be executed.

To be complete: PROT_GROWSDOWN, PROT_GROWSUP and unused PROT_SEM.

 So here somebody should comment and choose between:
 
 1) set VM_USERFAULT with mprotect(PROT_USERFAULT) instead of
the current madvise(MADV_USERFAULT)
 
 2) drop MADV_USERFAULT and VM_USERFAULT and force the usage of the
userfaultfd protocol as the only way for userland to catch
userfaults (each userfaultfd must already register itself into its
own virtual memory ranges so it's a trivial change for userfaultfd
users that deletes just 1 or 2 lines of userland code, but it would
prevent to use the SIGBUS behavior with info-si_addr=faultaddr for
other users)
 
 3) keep things as they are now: use MADV_USERFAULT for SIGBUS
userfaults, with optional intersection between the
vm_flagsVM_USERFAULT ranges and the userfaultfd registered ranges
with vma-vm_userfaultfd_ctx!=NULL to know if to engage the
userfaultfd protocol instead of the plain SIGBUS

4) new syscall?
 
 I will update the code accordingly to feedback, so please comment.

I don't have strong points on this. Just *feel* it doesn't fit advice
semantics.

The only userspace interface I've designed was not proven good by time.
I would listen what senior maintainers say. :)
 
-- 
 Kirill A. Shutemov



[Qemu-devel] [PATCH] [RESEND-try-3] hw/9pfs: fix P9_STATS_GEN handling

2014-01-28 Thread Kirill A. Shutemov
Currently we have few issues with P9_STATS_GEN:

 - We don't try to read st_gen anything except files or directories, but
   still set P9_STATS_GEN bit in st_result_mask. It may mislead client:
   we present garbage as valid st_gen.

 - If we failed to get valid st_gen with ENOTTY, we ignore error, but
   still set P9_STATS_GEN bit in st_result_mask.

 - If we failed to get valid st_gen with any other errno, we fail
   getattr altogether. It's excessive: we block valid client use-cases,
   like chdir(2) to non-readable directory with execution bit set.

The patch fixes these issues and cleanup code a bit.

Signed-off-by: Kirill A. Shutemov kirill.shute...@linux.intel.com
Reviewed-by: Daniel P. Berrange berra...@redhat.com
Reviewed-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com
---
 hw/9pfs/cofile.c   |  4 
 hw/9pfs/virtio-9p-handle.c |  8 +++-
 hw/9pfs/virtio-9p-local.c  | 10 ++
 hw/9pfs/virtio-9p-proxy.c  |  3 ++-
 hw/9pfs/virtio-9p.c| 12 ++--
 5 files changed, 25 insertions(+), 12 deletions(-)

diff --git a/hw/9pfs/cofile.c b/hw/9pfs/cofile.c
index 194c1306c665..2efebf35710f 100644
--- a/hw/9pfs/cofile.c
+++ b/hw/9pfs/cofile.c
@@ -38,10 +38,6 @@ int v9fs_co_st_gen(V9fsPDU *pdu, V9fsPath *path, mode_t 
st_mode,
 });
 v9fs_path_unlock(s);
 }
-/* The ioctl may not be supported depending on the path */
-if (err == -ENOTTY) {
-err = 0;
-}
 return err;
 }
 
diff --git a/hw/9pfs/virtio-9p-handle.c b/hw/9pfs/virtio-9p-handle.c
index fe8e0ed19dcc..17002a3d2867 100644
--- a/hw/9pfs/virtio-9p-handle.c
+++ b/hw/9pfs/virtio-9p-handle.c
@@ -582,6 +582,7 @@ static int handle_unlinkat(FsContext *ctx, V9fsPath *dir,
 static int handle_ioc_getversion(FsContext *ctx, V9fsPath *path,
  mode_t st_mode, uint64_t *st_gen)
 {
+#ifdef FS_IOC_GETVERSION
 int err;
 V9fsFidOpenState fid_open;
 
@@ -590,7 +591,8 @@ static int handle_ioc_getversion(FsContext *ctx, V9fsPath 
*path,
  * We can get fd for regular files and directories only
  */
 if (!S_ISREG(st_mode)  !S_ISDIR(st_mode)) {
-return 0;
+errno = ENOTTY;
+return -1;
 }
 err = handle_open(ctx, path, O_RDONLY, fid_open);
 if (err  0) {
@@ -599,6 +601,10 @@ static int handle_ioc_getversion(FsContext *ctx, V9fsPath 
*path,
 err = ioctl(fid_open.fd, FS_IOC_GETVERSION, st_gen);
 handle_close(ctx, fid_open);
 return err;
+#else
+errno = ENOTTY;
+return -1;
+#endif
 }
 
 static int handle_init(FsContext *ctx)
diff --git a/hw/9pfs/virtio-9p-local.c b/hw/9pfs/virtio-9p-local.c
index fc93e9e6e8da..df0dbffa7ac4 100644
--- a/hw/9pfs/virtio-9p-local.c
+++ b/hw/9pfs/virtio-9p-local.c
@@ -1068,8 +1068,8 @@ err_out:
 static int local_ioc_getversion(FsContext *ctx, V9fsPath *path,
 mode_t st_mode, uint64_t *st_gen)
 {
-int err;
 #ifdef FS_IOC_GETVERSION
+int err;
 V9fsFidOpenState fid_open;
 
 /*
@@ -1077,7 +1077,8 @@ static int local_ioc_getversion(FsContext *ctx, V9fsPath 
*path,
  * We can get fd for regular files and directories only
  */
 if (!S_ISREG(st_mode)  !S_ISDIR(st_mode)) {
-return 0;
+errno = ENOTTY;
+return -1;
 }
 err = local_open(ctx, path, O_RDONLY, fid_open);
 if (err  0) {
@@ -1085,10 +1086,11 @@ static int local_ioc_getversion(FsContext *ctx, 
V9fsPath *path,
 }
 err = ioctl(fid_open.fd, FS_IOC_GETVERSION, st_gen);
 local_close(ctx, fid_open);
+return err;
 #else
-err = -ENOTTY;
+errno = ENOTTY;
+return -1;
 #endif
-return err;
 }
 
 static int local_init(FsContext *ctx)
diff --git a/hw/9pfs/virtio-9p-proxy.c b/hw/9pfs/virtio-9p-proxy.c
index 5f44bb758b35..b57966d9d883 100644
--- a/hw/9pfs/virtio-9p-proxy.c
+++ b/hw/9pfs/virtio-9p-proxy.c
@@ -1086,7 +1086,8 @@ static int proxy_ioc_getversion(FsContext *fs_ctx, 
V9fsPath *path,
  * we can get fd for regular files and directories only
  */
 if (!S_ISREG(st_mode)  !S_ISDIR(st_mode)) {
-return 0;
+errno = ENOTTY;
+return -1;
 }
 err = v9fs_request(fs_ctx-private, T_GETVERSION, st_gen, s, path);
 if (err  0) {
diff --git a/hw/9pfs/virtio-9p.c b/hw/9pfs/virtio-9p.c
index 8cbb8ae32a03..3e51fcd152f8 100644
--- a/hw/9pfs/virtio-9p.c
+++ b/hw/9pfs/virtio-9p.c
@@ -1080,10 +1080,18 @@ static void v9fs_getattr(void *opaque)
 /*  fill st_gen if requested and supported by underlying fs */
 if (request_mask  P9_STATS_GEN) {
 retval = v9fs_co_st_gen(pdu, fidp-path, stbuf.st_mode, v9stat_dotl);
-if (retval  0) {
+switch (retval) {
+case 0:
+/* we have valid st_gen: update result mask */
+v9stat_dotl.st_result_mask |= P9_STATS_GEN;
+break;
+case -EINTR:
+/* request cancelled */
 goto out;
+default:
+/* failed to get st_gen: not fatal

Re: [Qemu-devel] [PATCH] [RESEND-try-3] hw/9pfs: fix P9_STATS_GEN handling

2014-01-28 Thread Kirill A. Shutemov
Michael S. Tsirkin wrote:
 On Tue, Jan 28, 2014 at 12:55:51PM +0200, Kirill A. Shutemov wrote:
  Currently we have few issues with P9_STATS_GEN:
  
   - We don't try to read st_gen anything except files or directories, but
 still set P9_STATS_GEN bit in st_result_mask. It may mislead client:
 we present garbage as valid st_gen.
  
   - If we failed to get valid st_gen with ENOTTY, we ignore error, but
 still set P9_STATS_GEN bit in st_result_mask.
  
   - If we failed to get valid st_gen with any other errno, we fail
 getattr altogether. It's excessive: we block valid client use-cases,
 like chdir(2) to non-readable directory with execution bit set.
  
  The patch fixes these issues and cleanup code a bit.
  
  Signed-off-by: Kirill A. Shutemov kirill.shute...@linux.intel.com
  Reviewed-by: Daniel P. Berrange berra...@redhat.com
  Reviewed-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com
 
 Would be better to split unrelated issues out to separate patches.

They are not totally unrelated: they all unbreak P9_STATS_GEN.

But yes, I can split if it needed.

  diff --git a/hw/9pfs/virtio-9p.c b/hw/9pfs/virtio-9p.c
  index 8cbb8ae32a03..3e51fcd152f8 100644
  --- a/hw/9pfs/virtio-9p.c
  +++ b/hw/9pfs/virtio-9p.c
  @@ -1080,10 +1080,18 @@ static void v9fs_getattr(void *opaque)
   /*  fill st_gen if requested and supported by underlying fs */
   if (request_mask  P9_STATS_GEN) {
   retval = v9fs_co_st_gen(pdu, fidp-path, stbuf.st_mode, 
  v9stat_dotl);
  -if (retval  0) {
  +switch (retval) {
  +case 0:
  +/* we have valid st_gen: update result mask */
  +v9stat_dotl.st_result_mask |= P9_STATS_GEN;
  +break;
  +case -EINTR:
  +/* request cancelled */
   goto out;
 
 Shouldn't EINTR be retried?

No. It could be canceled by client (with Tflush) on purpose and client can
retry if needed.

-- 
 Kirill A. Shutemov



[Qemu-devel] [PATCH 1/4] hw/9pfs: fix error handing in local_ioc_getversion()

2014-01-28 Thread Kirill A. Shutemov
v9fs_co_st_gen() expects to see error code in errno, not in return code.

Let's fix this.

Signed-off-by: Kirill A. Shutemov kirill.shute...@linux.intel.com
---
 hw/9pfs/virtio-9p-local.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/hw/9pfs/virtio-9p-local.c b/hw/9pfs/virtio-9p-local.c
index fc93e9e6e8da..9be8854e9148 100644
--- a/hw/9pfs/virtio-9p-local.c
+++ b/hw/9pfs/virtio-9p-local.c
@@ -1068,8 +1068,8 @@ err_out:
 static int local_ioc_getversion(FsContext *ctx, V9fsPath *path,
 mode_t st_mode, uint64_t *st_gen)
 {
-int err;
 #ifdef FS_IOC_GETVERSION
+int err;
 V9fsFidOpenState fid_open;
 
 /*
@@ -1085,10 +1085,11 @@ static int local_ioc_getversion(FsContext *ctx, 
V9fsPath *path,
 }
 err = ioctl(fid_open.fd, FS_IOC_GETVERSION, st_gen);
 local_close(ctx, fid_open);
+return err;
 #else
-err = -ENOTTY;
+errno = ENOTTY;
+return -1;
 #endif
-return err;
 }
 
 static int local_init(FsContext *ctx)
-- 
1.8.5.2




[Qemu-devel] [PATCH 2/4] hw/9pfs: handle undefined FS_IOC_GETVERSION case in handle_ioc_getversion()

2014-01-28 Thread Kirill A. Shutemov
All get_st_gen() implementations except handle_ioc_getversion() have
guard for undefined FS_IOC_GETVERSION. Let's add it there too.

Signed-off-by: Kirill A. Shutemov kirill.shute...@linux.intel.com
---
 hw/9pfs/virtio-9p-handle.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/hw/9pfs/virtio-9p-handle.c b/hw/9pfs/virtio-9p-handle.c
index fe8e0ed19dcc..ed8c126e1d6c 100644
--- a/hw/9pfs/virtio-9p-handle.c
+++ b/hw/9pfs/virtio-9p-handle.c
@@ -582,6 +582,7 @@ static int handle_unlinkat(FsContext *ctx, V9fsPath *dir,
 static int handle_ioc_getversion(FsContext *ctx, V9fsPath *path,
  mode_t st_mode, uint64_t *st_gen)
 {
+#ifdef FS_IOC_GETVERSION
 int err;
 V9fsFidOpenState fid_open;
 
@@ -599,6 +600,10 @@ static int handle_ioc_getversion(FsContext *ctx, V9fsPath 
*path,
 err = ioctl(fid_open.fd, FS_IOC_GETVERSION, st_gen);
 handle_close(ctx, fid_open);
 return err;
+#else
+errno = ENOTTY;
+return -1;
+#endif
 }
 
 static int handle_init(FsContext *ctx)
-- 
1.8.5.2




[Qemu-devel] [PATCH 4/4] hw/9pfs: fix P9_STATS_GEN handling

2014-01-28 Thread Kirill A. Shutemov
Currently we fail getattr request altogether if we can't read
P9_STATS_GEN for some reason. It breaks valid use cases:

E.g let's assume we have non-readable directory with execution bit set
on host and we export it to client over 9p On host we can chdir into
directory, but not open directory on read and list content.

But if client will try to call getattr (as part of chdir(2)) for the
directory it will fail with -EACCES. It happens because we try to open
the directory on read to call ioctl(FS_IOC_GETVERSION), it fails and we
return the error code to client.

It's excessive. The solution is to make P9_STATS_GEN failure non-fatal
for getattr request. Just don't set P9_STATS_GEN flag in result mask on
failure.

Signed-off-by: Kirill A. Shutemov kirill.shute...@linux.intel.com
---
 hw/9pfs/cofile.c|  4 
 hw/9pfs/virtio-9p.c | 12 ++--
 2 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/hw/9pfs/cofile.c b/hw/9pfs/cofile.c
index 194c1306c665..2efebf35710f 100644
--- a/hw/9pfs/cofile.c
+++ b/hw/9pfs/cofile.c
@@ -38,10 +38,6 @@ int v9fs_co_st_gen(V9fsPDU *pdu, V9fsPath *path, mode_t 
st_mode,
 });
 v9fs_path_unlock(s);
 }
-/* The ioctl may not be supported depending on the path */
-if (err == -ENOTTY) {
-err = 0;
-}
 return err;
 }
 
diff --git a/hw/9pfs/virtio-9p.c b/hw/9pfs/virtio-9p.c
index 8cbb8ae32a03..83e4e9398347 100644
--- a/hw/9pfs/virtio-9p.c
+++ b/hw/9pfs/virtio-9p.c
@@ -1080,10 +1080,18 @@ static void v9fs_getattr(void *opaque)
 /*  fill st_gen if requested and supported by underlying fs */
 if (request_mask  P9_STATS_GEN) {
 retval = v9fs_co_st_gen(pdu, fidp-path, stbuf.st_mode, v9stat_dotl);
-if (retval  0) {
+switch (retval) {
+case 0:
+/* we have valid st_gen: update result mask */
+v9stat_dotl.st_result_mask |= P9_STATS_GEN;
+break;
+case -EINTR:
+/* request cancelled, e.g. by Tflush */
 goto out;
+default:
+/* failed to get st_gen: not fatal, ignore */
+break;
 }
-v9stat_dotl.st_result_mask |= P9_STATS_GEN;
 }
 retval = pdu_marshal(pdu, offset, A, v9stat_dotl);
 if (retval  0) {
-- 
1.8.5.2




[Qemu-devel] [PATCH 3/4] hw/9pfs: make get_st_gen() return ENOTTY error on special files

2014-01-28 Thread Kirill A. Shutemov
Currently we silently ignore getversion requests for anything except
file or directory. Let's instead return ENOTTY error to indicate that
getversion is not supported. It makes implementation consistent on
all not-supported cases.

Signed-off-by: Kirill A. Shutemov kirill.shute...@linux.intel.com
---
 hw/9pfs/virtio-9p-handle.c | 3 ++-
 hw/9pfs/virtio-9p-local.c  | 3 ++-
 hw/9pfs/virtio-9p-proxy.c  | 3 ++-
 3 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/hw/9pfs/virtio-9p-handle.c b/hw/9pfs/virtio-9p-handle.c
index ed8c126e1d6c..17002a3d2867 100644
--- a/hw/9pfs/virtio-9p-handle.c
+++ b/hw/9pfs/virtio-9p-handle.c
@@ -591,7 +591,8 @@ static int handle_ioc_getversion(FsContext *ctx, V9fsPath 
*path,
  * We can get fd for regular files and directories only
  */
 if (!S_ISREG(st_mode)  !S_ISDIR(st_mode)) {
-return 0;
+errno = ENOTTY;
+return -1;
 }
 err = handle_open(ctx, path, O_RDONLY, fid_open);
 if (err  0) {
diff --git a/hw/9pfs/virtio-9p-local.c b/hw/9pfs/virtio-9p-local.c
index 9be8854e9148..df0dbffa7ac4 100644
--- a/hw/9pfs/virtio-9p-local.c
+++ b/hw/9pfs/virtio-9p-local.c
@@ -1077,7 +1077,8 @@ static int local_ioc_getversion(FsContext *ctx, V9fsPath 
*path,
  * We can get fd for regular files and directories only
  */
 if (!S_ISREG(st_mode)  !S_ISDIR(st_mode)) {
-return 0;
+errno = ENOTTY;
+return -1;
 }
 err = local_open(ctx, path, O_RDONLY, fid_open);
 if (err  0) {
diff --git a/hw/9pfs/virtio-9p-proxy.c b/hw/9pfs/virtio-9p-proxy.c
index 5f44bb758b35..b57966d9d883 100644
--- a/hw/9pfs/virtio-9p-proxy.c
+++ b/hw/9pfs/virtio-9p-proxy.c
@@ -1086,7 +1086,8 @@ static int proxy_ioc_getversion(FsContext *fs_ctx, 
V9fsPath *path,
  * we can get fd for regular files and directories only
  */
 if (!S_ISREG(st_mode)  !S_ISDIR(st_mode)) {
-return 0;
+errno = ENOTTY;
+return -1;
 }
 err = v9fs_request(fs_ctx-private, T_GETVERSION, st_gen, s, path);
 if (err  0) {
-- 
1.8.5.2




Re: [Qemu-devel] [RESEND-try-2][PATCH] hw/9pfs: fix P9_STATS_GEN handling

2013-12-23 Thread Kirill A. Shutemov
Aneesh Kumar K.V wrote:
 Kirill A. Shutemov kirill.shute...@linux.intel.com writes:
 
  Kirill A. Shutemov wrote:
  Currently we have few issues with P9_STATS_GEN:
  
   - We don't try to read st_gen anything except files or directories, but
 still set P9_STATS_GEN bit in st_result_mask. It may mislead client:
 we present garbage as valid st_gen.
  
   - If we failed to get valid st_gen with ENOTTY, we ignore error, but
 still set P9_STATS_GEN bit in st_result_mask.
  
   - If we failed to get valid st_gen with any other errno, we fail
 getattr altogether. It's excessive: we block valid client use-cases,
 like chdir(2) to non-readable directory with execution bit set.
  
  The patch fixes these issues and cleanup code a bit.
  
  Signed-off-by: Kirill A. Shutemov kirill.shute...@linux.intel.com
  Reviewed-by: Daniel P. Berrange berra...@redhat.com
  Reviewed-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com
 
  Ping?
 
 
 I am hoping that this will go to upstream directly without me doing a
 pull request ? Anthony, let me know if this need anything to be done
 from my side

So? Nobody cares?

-- 
 Kirill A. Shutemov



Re: [Qemu-devel] [RESEND-try-2][PATCH] hw/9pfs: fix P9_STATS_GEN handling

2013-12-09 Thread Kirill A. Shutemov
Kirill A. Shutemov wrote:
 Currently we have few issues with P9_STATS_GEN:
 
  - We don't try to read st_gen anything except files or directories, but
still set P9_STATS_GEN bit in st_result_mask. It may mislead client:
we present garbage as valid st_gen.
 
  - If we failed to get valid st_gen with ENOTTY, we ignore error, but
still set P9_STATS_GEN bit in st_result_mask.
 
  - If we failed to get valid st_gen with any other errno, we fail
getattr altogether. It's excessive: we block valid client use-cases,
like chdir(2) to non-readable directory with execution bit set.
 
 The patch fixes these issues and cleanup code a bit.
 
 Signed-off-by: Kirill A. Shutemov kirill.shute...@linux.intel.com
 Reviewed-by: Daniel P. Berrange berra...@redhat.com
 Reviewed-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com

Ping?

-- 
 Kirill A. Shutemov



[Qemu-devel] [RESEND-try-2][PATCH] hw/9pfs: fix P9_STATS_GEN handling

2013-11-29 Thread Kirill A. Shutemov
Currently we have few issues with P9_STATS_GEN:

 - We don't try to read st_gen anything except files or directories, but
   still set P9_STATS_GEN bit in st_result_mask. It may mislead client:
   we present garbage as valid st_gen.

 - If we failed to get valid st_gen with ENOTTY, we ignore error, but
   still set P9_STATS_GEN bit in st_result_mask.

 - If we failed to get valid st_gen with any other errno, we fail
   getattr altogether. It's excessive: we block valid client use-cases,
   like chdir(2) to non-readable directory with execution bit set.

The patch fixes these issues and cleanup code a bit.

Signed-off-by: Kirill A. Shutemov kirill.shute...@linux.intel.com
Reviewed-by: Daniel P. Berrange berra...@redhat.com
Reviewed-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com
---
 hw/9pfs/cofile.c   |  4 
 hw/9pfs/virtio-9p-handle.c |  8 +++-
 hw/9pfs/virtio-9p-local.c  | 10 ++
 hw/9pfs/virtio-9p-proxy.c  |  3 ++-
 hw/9pfs/virtio-9p.c| 12 ++--
 5 files changed, 25 insertions(+), 12 deletions(-)

diff --git a/hw/9pfs/cofile.c b/hw/9pfs/cofile.c
index 194c1306c665..2efebf35710f 100644
--- a/hw/9pfs/cofile.c
+++ b/hw/9pfs/cofile.c
@@ -38,10 +38,6 @@ int v9fs_co_st_gen(V9fsPDU *pdu, V9fsPath *path, mode_t 
st_mode,
 });
 v9fs_path_unlock(s);
 }
-/* The ioctl may not be supported depending on the path */
-if (err == -ENOTTY) {
-err = 0;
-}
 return err;
 }
 
diff --git a/hw/9pfs/virtio-9p-handle.c b/hw/9pfs/virtio-9p-handle.c
index fe8e0ed19dcc..17002a3d2867 100644
--- a/hw/9pfs/virtio-9p-handle.c
+++ b/hw/9pfs/virtio-9p-handle.c
@@ -582,6 +582,7 @@ static int handle_unlinkat(FsContext *ctx, V9fsPath *dir,
 static int handle_ioc_getversion(FsContext *ctx, V9fsPath *path,
  mode_t st_mode, uint64_t *st_gen)
 {
+#ifdef FS_IOC_GETVERSION
 int err;
 V9fsFidOpenState fid_open;
 
@@ -590,7 +591,8 @@ static int handle_ioc_getversion(FsContext *ctx, V9fsPath 
*path,
  * We can get fd for regular files and directories only
  */
 if (!S_ISREG(st_mode)  !S_ISDIR(st_mode)) {
-return 0;
+errno = ENOTTY;
+return -1;
 }
 err = handle_open(ctx, path, O_RDONLY, fid_open);
 if (err  0) {
@@ -599,6 +601,10 @@ static int handle_ioc_getversion(FsContext *ctx, V9fsPath 
*path,
 err = ioctl(fid_open.fd, FS_IOC_GETVERSION, st_gen);
 handle_close(ctx, fid_open);
 return err;
+#else
+errno = ENOTTY;
+return -1;
+#endif
 }
 
 static int handle_init(FsContext *ctx)
diff --git a/hw/9pfs/virtio-9p-local.c b/hw/9pfs/virtio-9p-local.c
index fc93e9e6e8da..df0dbffa7ac4 100644
--- a/hw/9pfs/virtio-9p-local.c
+++ b/hw/9pfs/virtio-9p-local.c
@@ -1068,8 +1068,8 @@ err_out:
 static int local_ioc_getversion(FsContext *ctx, V9fsPath *path,
 mode_t st_mode, uint64_t *st_gen)
 {
-int err;
 #ifdef FS_IOC_GETVERSION
+int err;
 V9fsFidOpenState fid_open;
 
 /*
@@ -1077,7 +1077,8 @@ static int local_ioc_getversion(FsContext *ctx, V9fsPath 
*path,
  * We can get fd for regular files and directories only
  */
 if (!S_ISREG(st_mode)  !S_ISDIR(st_mode)) {
-return 0;
+errno = ENOTTY;
+return -1;
 }
 err = local_open(ctx, path, O_RDONLY, fid_open);
 if (err  0) {
@@ -1085,10 +1086,11 @@ static int local_ioc_getversion(FsContext *ctx, 
V9fsPath *path,
 }
 err = ioctl(fid_open.fd, FS_IOC_GETVERSION, st_gen);
 local_close(ctx, fid_open);
+return err;
 #else
-err = -ENOTTY;
+errno = ENOTTY;
+return -1;
 #endif
-return err;
 }
 
 static int local_init(FsContext *ctx)
diff --git a/hw/9pfs/virtio-9p-proxy.c b/hw/9pfs/virtio-9p-proxy.c
index 5f44bb758b35..b57966d9d883 100644
--- a/hw/9pfs/virtio-9p-proxy.c
+++ b/hw/9pfs/virtio-9p-proxy.c
@@ -1086,7 +1086,8 @@ static int proxy_ioc_getversion(FsContext *fs_ctx, 
V9fsPath *path,
  * we can get fd for regular files and directories only
  */
 if (!S_ISREG(st_mode)  !S_ISDIR(st_mode)) {
-return 0;
+errno = ENOTTY;
+return -1;
 }
 err = v9fs_request(fs_ctx-private, T_GETVERSION, st_gen, s, path);
 if (err  0) {
diff --git a/hw/9pfs/virtio-9p.c b/hw/9pfs/virtio-9p.c
index 8cbb8ae32a03..3e51fcd152f8 100644
--- a/hw/9pfs/virtio-9p.c
+++ b/hw/9pfs/virtio-9p.c
@@ -1080,10 +1080,18 @@ static void v9fs_getattr(void *opaque)
 /*  fill st_gen if requested and supported by underlying fs */
 if (request_mask  P9_STATS_GEN) {
 retval = v9fs_co_st_gen(pdu, fidp-path, stbuf.st_mode, v9stat_dotl);
-if (retval  0) {
+switch (retval) {
+case 0:
+/* we have valid st_gen: update result mask */
+v9stat_dotl.st_result_mask |= P9_STATS_GEN;
+break;
+case -EINTR:
+/* request cancelled */
 goto out;
+default:
+/* failed to get st_gen: not fatal

Re: [Qemu-devel] [RESEND] [PATCH] hw/9pfs: fix P9_STATS_GEN handling

2013-11-06 Thread Kirill A. Shutemov
On Wed, Nov 06, 2013 at 09:41:47PM +0530, Aneesh Kumar K.V wrote:
 Kirill A. Shutemov kir...@shutemov.name writes:
 
  From: Kirill A. Shutemov kirill.shute...@linux.intel.com
 
  Currently we have few issues with P9_STATS_GEN:
 
   - We don't try to read st_gen anything except files or directories, but
 still set P9_STATS_GEN bit in st_result_mask. It may mislead client:
 we present garbage as valid st_gen.
 
 We should return 0 right ? We do 
 
 memset(v9lstat, 0, sizeof(*v9lstat));
 
 in stat_to_v9stat_dotl

The right way is not set P9_STATS_GEN in st_result_mask if we don't know
it.

   - If we failed to get valid st_gen with ENOTTY, we ignore error, but
 still set P9_STATS_GEN bit in st_result_mask.
 
 and have v9lstat.st_gen set to zero 

The same as above.

And if ioctl(fd, FS_IOC_GETVERSION, st_gen) fails, nobody specifies what
will be stored into st_gen, if any. We should not relay that fs will not
touch st_gen even if it sounds reasonable.

 
 
   - If we failed to get valid st_gen with any other errno, we fail
 getattr altogether. It's excessive: we block valid client use-cases,
 like chdir(2) to non-readable directory with execution bit set.
 
 
 Can you explain this in detail ? 

If you have following tree:
 
% mkdir testdir
% echo test  testdir/testfile
% chmod -r testdir
 
In normal environment it's usable: you can chdir(2) into it and read files
inside if you know its name:
 
% cd testdir
% cat testfile
test
 
You only can't list directory content:
 
% ls
ls: cannot open directory .: Permission denied
 
With current qemu 9p implementation you'll get on guest -EACCES on chdir(2)
or read, since qemu fill fail to provide basic stats to guess. It happens
because qemu try open(2) non-readable file to run FS_IOC_GETVERSION and
fails getattr altogether.
 
I believe it also breaks more trivial use-cases: ls -l on non-readable
file or directory for the same reason. But I haven't checked that.

-- 
 Kirill A. Shutemov



[Qemu-devel] [RESEND] [PATCH] hw/9pfs: fix P9_STATS_GEN handling

2013-11-04 Thread Kirill A. Shutemov
From: Kirill A. Shutemov kirill.shute...@linux.intel.com

Currently we have few issues with P9_STATS_GEN:

 - We don't try to read st_gen anything except files or directories, but
   still set P9_STATS_GEN bit in st_result_mask. It may mislead client:
   we present garbage as valid st_gen.

 - If we failed to get valid st_gen with ENOTTY, we ignore error, but
   still set P9_STATS_GEN bit in st_result_mask.

 - If we failed to get valid st_gen with any other errno, we fail
   getattr altogether. It's excessive: we block valid client use-cases,
   like chdir(2) to non-readable directory with execution bit set.

The patch fixes these issues and cleanup code a bit.

Signed-off-by: Kirill A. Shutemov kirill.shute...@linux.intel.com
Reviewed-by: Daniel P. Berrange berra...@redhat.com
---
 hw/9pfs/cofile.c   |  4 
 hw/9pfs/virtio-9p-handle.c |  8 +++-
 hw/9pfs/virtio-9p-local.c  | 10 ++
 hw/9pfs/virtio-9p-proxy.c  |  3 ++-
 hw/9pfs/virtio-9p.c| 12 ++--
 5 files changed, 25 insertions(+), 12 deletions(-)

diff --git a/hw/9pfs/cofile.c b/hw/9pfs/cofile.c
index 194c1306c665..2efebf35710f 100644
--- a/hw/9pfs/cofile.c
+++ b/hw/9pfs/cofile.c
@@ -38,10 +38,6 @@ int v9fs_co_st_gen(V9fsPDU *pdu, V9fsPath *path, mode_t 
st_mode,
 });
 v9fs_path_unlock(s);
 }
-/* The ioctl may not be supported depending on the path */
-if (err == -ENOTTY) {
-err = 0;
-}
 return err;
 }
 
diff --git a/hw/9pfs/virtio-9p-handle.c b/hw/9pfs/virtio-9p-handle.c
index fe8e0ed19dcc..17002a3d2867 100644
--- a/hw/9pfs/virtio-9p-handle.c
+++ b/hw/9pfs/virtio-9p-handle.c
@@ -582,6 +582,7 @@ static int handle_unlinkat(FsContext *ctx, V9fsPath *dir,
 static int handle_ioc_getversion(FsContext *ctx, V9fsPath *path,
  mode_t st_mode, uint64_t *st_gen)
 {
+#ifdef FS_IOC_GETVERSION
 int err;
 V9fsFidOpenState fid_open;
 
@@ -590,7 +591,8 @@ static int handle_ioc_getversion(FsContext *ctx, V9fsPath 
*path,
  * We can get fd for regular files and directories only
  */
 if (!S_ISREG(st_mode)  !S_ISDIR(st_mode)) {
-return 0;
+errno = ENOTTY;
+return -1;
 }
 err = handle_open(ctx, path, O_RDONLY, fid_open);
 if (err  0) {
@@ -599,6 +601,10 @@ static int handle_ioc_getversion(FsContext *ctx, V9fsPath 
*path,
 err = ioctl(fid_open.fd, FS_IOC_GETVERSION, st_gen);
 handle_close(ctx, fid_open);
 return err;
+#else
+errno = ENOTTY;
+return -1;
+#endif
 }
 
 static int handle_init(FsContext *ctx)
diff --git a/hw/9pfs/virtio-9p-local.c b/hw/9pfs/virtio-9p-local.c
index fc93e9e6e8da..df0dbffa7ac4 100644
--- a/hw/9pfs/virtio-9p-local.c
+++ b/hw/9pfs/virtio-9p-local.c
@@ -1068,8 +1068,8 @@ err_out:
 static int local_ioc_getversion(FsContext *ctx, V9fsPath *path,
 mode_t st_mode, uint64_t *st_gen)
 {
-int err;
 #ifdef FS_IOC_GETVERSION
+int err;
 V9fsFidOpenState fid_open;
 
 /*
@@ -1077,7 +1077,8 @@ static int local_ioc_getversion(FsContext *ctx, V9fsPath 
*path,
  * We can get fd for regular files and directories only
  */
 if (!S_ISREG(st_mode)  !S_ISDIR(st_mode)) {
-return 0;
+errno = ENOTTY;
+return -1;
 }
 err = local_open(ctx, path, O_RDONLY, fid_open);
 if (err  0) {
@@ -1085,10 +1086,11 @@ static int local_ioc_getversion(FsContext *ctx, 
V9fsPath *path,
 }
 err = ioctl(fid_open.fd, FS_IOC_GETVERSION, st_gen);
 local_close(ctx, fid_open);
+return err;
 #else
-err = -ENOTTY;
+errno = ENOTTY;
+return -1;
 #endif
-return err;
 }
 
 static int local_init(FsContext *ctx)
diff --git a/hw/9pfs/virtio-9p-proxy.c b/hw/9pfs/virtio-9p-proxy.c
index 5f44bb758b35..b57966d9d883 100644
--- a/hw/9pfs/virtio-9p-proxy.c
+++ b/hw/9pfs/virtio-9p-proxy.c
@@ -1086,7 +1086,8 @@ static int proxy_ioc_getversion(FsContext *fs_ctx, 
V9fsPath *path,
  * we can get fd for regular files and directories only
  */
 if (!S_ISREG(st_mode)  !S_ISDIR(st_mode)) {
-return 0;
+errno = ENOTTY;
+return -1;
 }
 err = v9fs_request(fs_ctx-private, T_GETVERSION, st_gen, s, path);
 if (err  0) {
diff --git a/hw/9pfs/virtio-9p.c b/hw/9pfs/virtio-9p.c
index 8cbb8ae32a03..3e51fcd152f8 100644
--- a/hw/9pfs/virtio-9p.c
+++ b/hw/9pfs/virtio-9p.c
@@ -1080,10 +1080,18 @@ static void v9fs_getattr(void *opaque)
 /*  fill st_gen if requested and supported by underlying fs */
 if (request_mask  P9_STATS_GEN) {
 retval = v9fs_co_st_gen(pdu, fidp-path, stbuf.st_mode, v9stat_dotl);
-if (retval  0) {
+switch (retval) {
+case 0:
+/* we have valid st_gen: update result mask */
+v9stat_dotl.st_result_mask |= P9_STATS_GEN;
+break;
+case -EINTR:
+/* request cancelled */
 goto out;
+default:
+/* failed to get st_gen: not fatal

Re: [Qemu-devel] [PATCH] hw/9pfs: fix P9_STATS_GEN handling

2013-10-30 Thread Kirill A. Shutemov
On Sun, Oct 27, 2013 at 03:41:34AM +0300, Kirill A. Shutemov wrote:
 From: Kirill A. Shutemov kir...@shutemov.name
 
 Currently we have few issues with P9_STATS_GEN:
 
  - We don't try to read st_gen anything except files or directories, but
still set P9_STATS_GEN bit in st_result_mask. It may mislead client:
we present garbage as valid st_gen.
 
  - If we failed to get valid st_gen with ENOTTY, we ignore error, but
still set P9_STATS_GEN bit in st_result_mask.
 
  - If we failed to get valid st_gen with any other errno, we fail
getattr altogether. It's excessive: we block valid client use-cases,
like chdir(2) to non-readable directory with execution bit set.
 
 The patch fixes these issues and cleanup code a bit.
 
 Signed-off-by: Kirill A. Shutemov kir...@shutemov.name

Ping?

-- 
 Kirill A. Shutemov



[Qemu-devel] [PATCH] hw/9pfs: fix P9_STATS_GEN handling

2013-10-27 Thread Kirill A. Shutemov
From: Kirill A. Shutemov kir...@shutemov.name

Currently we have few issues with P9_STATS_GEN:

 - We don't try to read st_gen anything except files or directories, but
   still set P9_STATS_GEN bit in st_result_mask. It may mislead client:
   we present garbage as valid st_gen.

 - If we failed to get valid st_gen with ENOTTY, we ignore error, but
   still set P9_STATS_GEN bit in st_result_mask.

 - If we failed to get valid st_gen with any other errno, we fail
   getattr altogether. It's excessive: we block valid client use-cases,
   like chdir(2) to non-readable directory with execution bit set.

The patch fixes these issues and cleanup code a bit.

Signed-off-by: Kirill A. Shutemov kir...@shutemov.name
---
 hw/9pfs/cofile.c   |  4 
 hw/9pfs/virtio-9p-handle.c |  8 +++-
 hw/9pfs/virtio-9p-local.c  | 10 ++
 hw/9pfs/virtio-9p-proxy.c  |  3 ++-
 hw/9pfs/virtio-9p.c| 12 ++--
 5 files changed, 25 insertions(+), 12 deletions(-)

diff --git a/hw/9pfs/cofile.c b/hw/9pfs/cofile.c
index 194c1306c6..2efebf3571 100644
--- a/hw/9pfs/cofile.c
+++ b/hw/9pfs/cofile.c
@@ -38,10 +38,6 @@ int v9fs_co_st_gen(V9fsPDU *pdu, V9fsPath *path, mode_t 
st_mode,
 });
 v9fs_path_unlock(s);
 }
-/* The ioctl may not be supported depending on the path */
-if (err == -ENOTTY) {
-err = 0;
-}
 return err;
 }
 
diff --git a/hw/9pfs/virtio-9p-handle.c b/hw/9pfs/virtio-9p-handle.c
index fe8e0ed19d..17002a3d28 100644
--- a/hw/9pfs/virtio-9p-handle.c
+++ b/hw/9pfs/virtio-9p-handle.c
@@ -582,6 +582,7 @@ static int handle_unlinkat(FsContext *ctx, V9fsPath *dir,
 static int handle_ioc_getversion(FsContext *ctx, V9fsPath *path,
  mode_t st_mode, uint64_t *st_gen)
 {
+#ifdef FS_IOC_GETVERSION
 int err;
 V9fsFidOpenState fid_open;
 
@@ -590,7 +591,8 @@ static int handle_ioc_getversion(FsContext *ctx, V9fsPath 
*path,
  * We can get fd for regular files and directories only
  */
 if (!S_ISREG(st_mode)  !S_ISDIR(st_mode)) {
-return 0;
+errno = ENOTTY;
+return -1;
 }
 err = handle_open(ctx, path, O_RDONLY, fid_open);
 if (err  0) {
@@ -599,6 +601,10 @@ static int handle_ioc_getversion(FsContext *ctx, V9fsPath 
*path,
 err = ioctl(fid_open.fd, FS_IOC_GETVERSION, st_gen);
 handle_close(ctx, fid_open);
 return err;
+#else
+errno = ENOTTY;
+return -1;
+#endif
 }
 
 static int handle_init(FsContext *ctx)
diff --git a/hw/9pfs/virtio-9p-local.c b/hw/9pfs/virtio-9p-local.c
index fc93e9e6e8..df0dbffa7a 100644
--- a/hw/9pfs/virtio-9p-local.c
+++ b/hw/9pfs/virtio-9p-local.c
@@ -1068,8 +1068,8 @@ err_out:
 static int local_ioc_getversion(FsContext *ctx, V9fsPath *path,
 mode_t st_mode, uint64_t *st_gen)
 {
-int err;
 #ifdef FS_IOC_GETVERSION
+int err;
 V9fsFidOpenState fid_open;
 
 /*
@@ -1077,7 +1077,8 @@ static int local_ioc_getversion(FsContext *ctx, V9fsPath 
*path,
  * We can get fd for regular files and directories only
  */
 if (!S_ISREG(st_mode)  !S_ISDIR(st_mode)) {
-return 0;
+errno = ENOTTY;
+return -1;
 }
 err = local_open(ctx, path, O_RDONLY, fid_open);
 if (err  0) {
@@ -1085,10 +1086,11 @@ static int local_ioc_getversion(FsContext *ctx, 
V9fsPath *path,
 }
 err = ioctl(fid_open.fd, FS_IOC_GETVERSION, st_gen);
 local_close(ctx, fid_open);
+return err;
 #else
-err = -ENOTTY;
+errno = ENOTTY;
+return -1;
 #endif
-return err;
 }
 
 static int local_init(FsContext *ctx)
diff --git a/hw/9pfs/virtio-9p-proxy.c b/hw/9pfs/virtio-9p-proxy.c
index 5f44bb758b..b57966d9d8 100644
--- a/hw/9pfs/virtio-9p-proxy.c
+++ b/hw/9pfs/virtio-9p-proxy.c
@@ -1086,7 +1086,8 @@ static int proxy_ioc_getversion(FsContext *fs_ctx, 
V9fsPath *path,
  * we can get fd for regular files and directories only
  */
 if (!S_ISREG(st_mode)  !S_ISDIR(st_mode)) {
-return 0;
+errno = ENOTTY;
+return -1;
 }
 err = v9fs_request(fs_ctx-private, T_GETVERSION, st_gen, s, path);
 if (err  0) {
diff --git a/hw/9pfs/virtio-9p.c b/hw/9pfs/virtio-9p.c
index 8cbb8ae32a..3e51fcd152 100644
--- a/hw/9pfs/virtio-9p.c
+++ b/hw/9pfs/virtio-9p.c
@@ -1080,10 +1080,18 @@ static void v9fs_getattr(void *opaque)
 /*  fill st_gen if requested and supported by underlying fs */
 if (request_mask  P9_STATS_GEN) {
 retval = v9fs_co_st_gen(pdu, fidp-path, stbuf.st_mode, v9stat_dotl);
-if (retval  0) {
+switch (retval) {
+case 0:
+/* we have valid st_gen: update result mask */
+v9stat_dotl.st_result_mask |= P9_STATS_GEN;
+break;
+case -EINTR:
+/* request cancelled */
 goto out;
+default:
+/* failed to get st_gen: not fatal, ignore */
+break;
 }
-v9stat_dotl.st_result_mask |= P9_STATS_GEN

[Qemu-devel] [PATCH] hw/9pfs: fix P9_STATS_GEN handling

2013-10-27 Thread Kirill A. Shutemov
Currently we have few issues with P9_STATS_GEN:

 - We don't try to read st_gen anything except files or directories, but
   still set P9_STATS_GEN bit in st_result_mask. It may mislead client:
   we present garbage as valid st_gen.

 - If we failed to get valid st_gen with ENOTTY, we ignore error, but
   still set P9_STATS_GEN bit in st_result_mask.

 - If we failed to get valid st_gen with any other errno, we fail
   getattr altogether. It's excessive: we block valid client use-cases,
   like chdir(2) to non-readable directory with execution bit set.

The patch fixes these issues and cleanup code a bit.

Signed-off-by: Kirill A. Shutemov kirill.shute...@linux.intel.com
---
 hw/9pfs/cofile.c   |  4 
 hw/9pfs/virtio-9p-handle.c |  8 +++-
 hw/9pfs/virtio-9p-local.c  | 10 ++
 hw/9pfs/virtio-9p-proxy.c  |  3 ++-
 hw/9pfs/virtio-9p.c| 12 ++--
 5 files changed, 25 insertions(+), 12 deletions(-)

diff --git a/hw/9pfs/cofile.c b/hw/9pfs/cofile.c
index 194c1306c6..2efebf3571 100644
--- a/hw/9pfs/cofile.c
+++ b/hw/9pfs/cofile.c
@@ -38,10 +38,6 @@ int v9fs_co_st_gen(V9fsPDU *pdu, V9fsPath *path, mode_t 
st_mode,
 });
 v9fs_path_unlock(s);
 }
-/* The ioctl may not be supported depending on the path */
-if (err == -ENOTTY) {
-err = 0;
-}
 return err;
 }
 
diff --git a/hw/9pfs/virtio-9p-handle.c b/hw/9pfs/virtio-9p-handle.c
index fe8e0ed19d..17002a3d28 100644
--- a/hw/9pfs/virtio-9p-handle.c
+++ b/hw/9pfs/virtio-9p-handle.c
@@ -582,6 +582,7 @@ static int handle_unlinkat(FsContext *ctx, V9fsPath *dir,
 static int handle_ioc_getversion(FsContext *ctx, V9fsPath *path,
  mode_t st_mode, uint64_t *st_gen)
 {
+#ifdef FS_IOC_GETVERSION
 int err;
 V9fsFidOpenState fid_open;
 
@@ -590,7 +591,8 @@ static int handle_ioc_getversion(FsContext *ctx, V9fsPath 
*path,
  * We can get fd for regular files and directories only
  */
 if (!S_ISREG(st_mode)  !S_ISDIR(st_mode)) {
-return 0;
+errno = ENOTTY;
+return -1;
 }
 err = handle_open(ctx, path, O_RDONLY, fid_open);
 if (err  0) {
@@ -599,6 +601,10 @@ static int handle_ioc_getversion(FsContext *ctx, V9fsPath 
*path,
 err = ioctl(fid_open.fd, FS_IOC_GETVERSION, st_gen);
 handle_close(ctx, fid_open);
 return err;
+#else
+errno = ENOTTY;
+return -1;
+#endif
 }
 
 static int handle_init(FsContext *ctx)
diff --git a/hw/9pfs/virtio-9p-local.c b/hw/9pfs/virtio-9p-local.c
index fc93e9e6e8..df0dbffa7a 100644
--- a/hw/9pfs/virtio-9p-local.c
+++ b/hw/9pfs/virtio-9p-local.c
@@ -1068,8 +1068,8 @@ err_out:
 static int local_ioc_getversion(FsContext *ctx, V9fsPath *path,
 mode_t st_mode, uint64_t *st_gen)
 {
-int err;
 #ifdef FS_IOC_GETVERSION
+int err;
 V9fsFidOpenState fid_open;
 
 /*
@@ -1077,7 +1077,8 @@ static int local_ioc_getversion(FsContext *ctx, V9fsPath 
*path,
  * We can get fd for regular files and directories only
  */
 if (!S_ISREG(st_mode)  !S_ISDIR(st_mode)) {
-return 0;
+errno = ENOTTY;
+return -1;
 }
 err = local_open(ctx, path, O_RDONLY, fid_open);
 if (err  0) {
@@ -1085,10 +1086,11 @@ static int local_ioc_getversion(FsContext *ctx, 
V9fsPath *path,
 }
 err = ioctl(fid_open.fd, FS_IOC_GETVERSION, st_gen);
 local_close(ctx, fid_open);
+return err;
 #else
-err = -ENOTTY;
+errno = ENOTTY;
+return -1;
 #endif
-return err;
 }
 
 static int local_init(FsContext *ctx)
diff --git a/hw/9pfs/virtio-9p-proxy.c b/hw/9pfs/virtio-9p-proxy.c
index 5f44bb758b..b57966d9d8 100644
--- a/hw/9pfs/virtio-9p-proxy.c
+++ b/hw/9pfs/virtio-9p-proxy.c
@@ -1086,7 +1086,8 @@ static int proxy_ioc_getversion(FsContext *fs_ctx, 
V9fsPath *path,
  * we can get fd for regular files and directories only
  */
 if (!S_ISREG(st_mode)  !S_ISDIR(st_mode)) {
-return 0;
+errno = ENOTTY;
+return -1;
 }
 err = v9fs_request(fs_ctx-private, T_GETVERSION, st_gen, s, path);
 if (err  0) {
diff --git a/hw/9pfs/virtio-9p.c b/hw/9pfs/virtio-9p.c
index 8cbb8ae32a..3e51fcd152 100644
--- a/hw/9pfs/virtio-9p.c
+++ b/hw/9pfs/virtio-9p.c
@@ -1080,10 +1080,18 @@ static void v9fs_getattr(void *opaque)
 /*  fill st_gen if requested and supported by underlying fs */
 if (request_mask  P9_STATS_GEN) {
 retval = v9fs_co_st_gen(pdu, fidp-path, stbuf.st_mode, v9stat_dotl);
-if (retval  0) {
+switch (retval) {
+case 0:
+/* we have valid st_gen: update result mask */
+v9stat_dotl.st_result_mask |= P9_STATS_GEN;
+break;
+case -EINTR:
+/* request cancelled */
 goto out;
+default:
+/* failed to get st_gen: not fatal, ignore */
+break;
 }
-v9stat_dotl.st_result_mask |= P9_STATS_GEN;
 }
 retval = pdu_marshal(pdu

Re: [Qemu-devel] Re: [PATCH 07/17] block/vvfat.c: fix warnings with _FORTIFY_SOURCE

2010-01-20 Thread Kirill A. Shutemov
On Wed, Jan 20, 2010 at 12:33 PM, Daniel P. Berrange
berra...@redhat.com wrote:
 On Wed, Jan 20, 2010 at 08:19:26AM +0200, Kirill A. Shutemov wrote:
 On Wed, Jan 20, 2010 at 1:56 AM, Juan Quintela quint...@redhat.com wrote:
  From: Kirill A. Shutemov kir...@shutemov.name
 
  CC    block/vvfat.o
  cc1: warnings being treated as errors
  block/vvfat.c: In function 'commit_one_file':
  block/vvfat.c:2259: error: ignoring return value of 'ftruncate', declared 
  with attribute warn_unused_result
  make: *** [block/vvfat.o] Error 1
   CC    block/vvfat.o
  In file included from /usr/include/stdio.h:912,
                  from ./qemu-common.h:19,
                  from block/vvfat.c:27:
  In function 'snprintf',
     inlined from 'init_directories' at block/vvfat.c:871,
     inlined from 'vvfat_open' at block/vvfat.c:1068:
  /usr/include/bits/stdio2.h:65: error: call to __builtin___snprintf_chk 
  will always overflow destination buffer
  make: *** [block/vvfat.o] Error 1
 
  Signed-off-by: Kirill A. Shutemov kir...@shutemov.name
  Signed-off-by: Juan Quintela quint...@redhat.com
  ---
   block/vvfat.c |    9 +++--
   1 files changed, 7 insertions(+), 2 deletions(-)
 
  diff --git a/block/vvfat.c b/block/vvfat.c
  index 063f731..df957e5 100644
  --- a/block/vvfat.c
  +++ b/block/vvfat.c
  @@ -868,7 +868,8 @@ static int init_directories(BDRVVVFATState* s,
      {
         direntry_t* entry=array_get_next((s-directory));
         entry-attributes=0x28; /* archive | volume label */
  -       snprintf((char*)entry-name,11,QEMU VVFAT);
  +       memcpy(entry-name,QEMU VVF,8);
  +       memcpy(entry-extension,AT ,3);
      }

 Better to use

 memcpy(entry-name, QEMU VVFAT, 11);

 memcpy() doesn't check bounds.

 It doesn't *currently* check bounds.

No. memcpy() will never check bounds. It's totaly different from strcpy,
http://gcc.gnu.org/ml/gcc-patches/2009-06/msg00419.html

 If we want to explicitly
 fill 2 fields at once, then we should redeclare this to have a
 union with one part comprising the entire buffer, thus avoiding
 the need for delibrate buffer overruns.

 Regards,
 Daniel
 --
 |: Red Hat, Engineering, London   -o-   http://people.redhat.com/berrange/ :|
 |: http://libvirt.org  -o-  http://virt-manager.org  -o-  http://ovirt.org :|
 |: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
 |: GnuPG: 7D3B9505  -o-  F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|





Re: [Qemu-devel] Re: [PATCH 07/17] block/vvfat.c: fix warnings with _FORTIFY_SOURCE

2010-01-20 Thread Kirill A. Shutemov
On Wed, Jan 20, 2010 at 2:15 PM, Markus Armbruster arm...@redhat.com wrote:
 Kevin Wolf kw...@redhat.com writes:

 Am 20.01.2010 12:09, schrieb Kirill A. Shutemov:
 On Wed, Jan 20, 2010 at 12:33 PM, Daniel P. Berrange
 berra...@redhat.com wrote:
 On Wed, Jan 20, 2010 at 08:19:26AM +0200, Kirill A. Shutemov wrote:
 On Wed, Jan 20, 2010 at 1:56 AM, Juan Quintela quint...@redhat.com 
 wrote:
 [...]
 diff --git a/block/vvfat.c b/block/vvfat.c
 index 063f731..df957e5 100644
 --- a/block/vvfat.c
 +++ b/block/vvfat.c
 @@ -868,7 +868,8 @@ static int init_directories(BDRVVVFATState* s,
     {
        direntry_t* entry=array_get_next((s-directory));
        entry-attributes=0x28; /* archive | volume label */
 -       snprintf((char*)entry-name,11,QEMU VVFAT);
 +       memcpy(entry-name,QEMU VVF,8);
 +       memcpy(entry-extension,AT ,3);
     }

 Better to use

 memcpy(entry-name, QEMU VVFAT, 11);

 memcpy() doesn't check bounds.

 No, this is evil, and may well be flagged by static analysis tools.

If so, the tool is stupid.

 It doesn't *currently* check bounds.

 No. memcpy() will never check bounds. It's totaly different from strcpy,
 http://gcc.gnu.org/ml/gcc-patches/2009-06/msg00419.html

 Regardless if deliberately overflowing the buffer works or doesn't
 making it explicit is better. Someone might reorder the struct or add
 new fields in between (okay, unlikely in this case, but still) and
 you'll overflow into fields you never wanted to touch.

 Moreover, compilers are free to put padding between members name and
 extension.

No, compiler can't add anything between. 'char' is always byte-aligned.




Re: [Qemu-devel] Re: [PATCH 07/17] block/vvfat.c: fix warnings with _FORTIFY_SOURCE

2010-01-20 Thread Kirill A. Shutemov
On Wed, Jan 20, 2010 at 3:03 PM, Markus Armbruster arm...@redhat.com wrote:
 Kirill A. Shutemov kir...@shutemov.name writes:
 No, compiler can't add anything between. 'char' is always byte-aligned.

 You got some reading to do then.

Do you want to say that it's not true? Could you provide an example
when 'char' isn't byte-aligned?




Re: [Qemu-devel] Re: [PATCH 07/17] block/vvfat.c: fix warnings with _FORTIFY_SOURCE

2010-01-20 Thread Kirill A. Shutemov
2010/1/20 Gleb Natapov g...@redhat.com:
 On Wed, Jan 20, 2010 at 02:03:04PM +0100, Markus Armbruster wrote:
 Kirill A. Shutemov kir...@shutemov.name writes:
  No, compiler can't add anything between. 'char' is always byte-aligned.

 You got some reading to do then.

 To be fair this structure is packed, but this is not the reason to write
 stupid code of course.

In what way does it stupid? Compiler can't insert pads between two
arrays of 'char'.




[Qemu-devel] Re: [PATCH 01/14] Introduce qemu_write_full()

2010-01-19 Thread Kirill A. Shutemov
On Tue, Jan 19, 2010 at 2:11 PM, Juan Quintela quint...@redhat.com wrote:
 Kirill A. Shutemov kir...@shutemov.name wrote:
 A variant of write(2) which handles partial write.

 Signed-off-by: Kirill A. Shutemov kir...@shutemov.name

 Hi

 Have you updated this series?  Is there any reason that you know when
 they haven't been picked?

I don't  know any reason, but I'm going to review it once again.

I also have plan to get rid of -fno-strict-aliasing where it's possible.




[Qemu-devel] Re: [PATCH 07/17] block/vvfat.c: fix warnings with _FORTIFY_SOURCE

2010-01-19 Thread Kirill A. Shutemov
On Wed, Jan 20, 2010 at 1:56 AM, Juan Quintela quint...@redhat.com wrote:
 From: Kirill A. Shutemov kir...@shutemov.name

 CC    block/vvfat.o
 cc1: warnings being treated as errors
 block/vvfat.c: In function 'commit_one_file':
 block/vvfat.c:2259: error: ignoring return value of 'ftruncate', declared 
 with attribute warn_unused_result
 make: *** [block/vvfat.o] Error 1
  CC    block/vvfat.o
 In file included from /usr/include/stdio.h:912,
                 from ./qemu-common.h:19,
                 from block/vvfat.c:27:
 In function 'snprintf',
    inlined from 'init_directories' at block/vvfat.c:871,
    inlined from 'vvfat_open' at block/vvfat.c:1068:
 /usr/include/bits/stdio2.h:65: error: call to __builtin___snprintf_chk will 
 always overflow destination buffer
 make: *** [block/vvfat.o] Error 1

 Signed-off-by: Kirill A. Shutemov kir...@shutemov.name
 Signed-off-by: Juan Quintela quint...@redhat.com
 ---
  block/vvfat.c |    9 +++--
  1 files changed, 7 insertions(+), 2 deletions(-)

 diff --git a/block/vvfat.c b/block/vvfat.c
 index 063f731..df957e5 100644
 --- a/block/vvfat.c
 +++ b/block/vvfat.c
 @@ -868,7 +868,8 @@ static int init_directories(BDRVVVFATState* s,
     {
        direntry_t* entry=array_get_next((s-directory));
        entry-attributes=0x28; /* archive | volume label */
 -       snprintf((char*)entry-name,11,QEMU VVFAT);
 +       memcpy(entry-name,QEMU VVF,8);
 +       memcpy(entry-extension,AT ,3);
     }

Better to use

memcpy(entry-name, QEMU VVFAT, 11);

memcpy() doesn't check bounds.

     /* Now build FAT, and write back information into directory */
 @@ -2256,7 +2257,11 @@ static int commit_one_file(BDRVVVFATState* s,
        c = c1;
     }

 -    ftruncate(fd, size);
 +    if (ftruncate(fd, size)) {
 +        perror(ftruncate());
 +        close(fd);
 +        return -4;
 +    }
     close(fd);

     return commit_mappings(s, first_cluster, dir_index);
 --
 1.6.5.2






[Qemu-devel] Re: [PATCH 09/15] usb-linux.c: fix warning with _FORTIFY_SOURCE

2010-01-05 Thread Kirill A. Shutemov
On Tue, Jan 5, 2010 at 4:42 PM, Juan Quintela quint...@trasno.org wrote:
 Kirill A. Shutemov kir...@shutemov.name wrote:
   CC    usb-linux.o
 cc1: warnings being treated as errors
 usb-linux.c: In function 'usb_host_read_file':
 usb-linux.c:1204: error: ignoring return value of 'fgets', declared with 
 attribute warn_unused_result
 make: *** [usb-linux.o] Error 1

 Signed-off-by: Kirill A. Shutemov kir...@shutemov.name
 ---
  usb-linux.c |    8 ++--
  1 files changed, 6 insertions(+), 2 deletions(-)

 diff --git a/usb-linux.c b/usb-linux.c
 index 88728e9..8673474 100644
 --- a/usb-linux.c
 +++ b/usb-linux.c
 @@ -1201,9 +1201,13 @@ static int usb_host_read_file(char *line, size_t 
 line_size, const char *device_f
               device_file);
      f = fopen(filename, r);
      if (f) {
 -        fgets(line, line_size, f);
 +        if (fgets(line, line_size, f)) {
 +            ret = 1;
 +        } else {
 +            ret = 0;
 +        }
 +
 This if is equivalent to:

 ret = !!fgets(line, line_size, f);

 No need for the if at all :)

It's not very readable.
Probably better to use something like:

ret = (fgets(line, line_size, f) != NULL);




Re: [Qemu-devel] [PATCH 14/14] Add -fstack-protector-all to CFLAGS

2010-01-01 Thread Kirill A. Shutemov
On Thu, Dec 31, 2009 at 12:58 PM, Arnaud Patard
arnaud.pat...@rtp-net.org wrote:
 Kirill A. Shutemov kir...@shutemov.name writes:
 Hi,

 -fstack-protector-all emit extra code to check for buffer overflows,
 such as stack smashing attacks.  This is done by adding a guard
 variable to functions with vulnerable objects.

 Signed-off-by: Kirill A. Shutemov kir...@shutemov.name
 ---
  configure |    1 +
  1 files changed, 1 insertions(+), 0 deletions(-)

 diff --git a/configure b/configure
 index 0cdcdb3..16b70d8 100755
 --- a/configure
 +++ b/configure
 @@ -98,6 +98,7 @@ QEMU_CFLAGS=-Wall -Wundef -Wendif-labels -Wwrite-strings 
 -Wmissing-prototypes $
  QEMU_CFLAGS=-Wstrict-prototypes -Wredundant-decls $QEMU_CFLAGS
  QEMU_CFLAGS=-D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE 
 $QEMU_CFLAGS
  QEMU_CFLAGS=-D_FORTIFY_SOURCE=2 $QEMU_CFLAGS
 +QEMU_CFLAGS=-fstack-protector-all $QEMU_CFLAGS

 afaik not all arches out there are supporting
 -fstack-protector-all (to be more precise, some have no stack protector
 support at all). iirc, gcc will emit a warning and still compile
 but would be nice to avoid a warning.

Thanks. Will be fixed.




Re: [Qemu-devel] [PATCH 12/14] linux-user/mmap.c: fix warnings with _FORTIFY_SOURCE

2010-01-01 Thread Kirill A. Shutemov
On Thu, Dec 31, 2009 at 12:50 PM, Arnaud Patard
arnaud.pat...@rtp-net.org wrote:
 Kirill A. Shutemov kir...@shutemov.name writes:

 Hi,

   CC    i386-linux-user/mmap.o
 cc1: warnings being treated as errors
 /usr/src/RPM/BUILD/qemu-0.11.92/linux-user/mmap.c: In function 'mmap_frag':
 /usr/src/RPM/BUILD/qemu-0.11.92/linux-user/mmap.c:253: error: ignoring 
 return value of 'pread', declared with attribute warn_unused_result
 /usr/src/RPM/BUILD/qemu-0.11.92/linux-user/mmap.c: In function 'target_mmap':
 /usr/src/RPM/BUILD/qemu-0.11.92/linux-user/mmap.c:477: error: ignoring 
 return value of 'pread', declared with attribute warn_unused_result
 make[1]: *** [mmap.o] Error 1

 Signed-off-by: Kirill A. Shutemov kir...@shutemov.name
 ---
  linux-user/mmap.c |    6 --
  1 files changed, 4 insertions(+), 2 deletions(-)

 diff --git a/linux-user/mmap.c b/linux-user/mmap.c
 index 144fb7c..e496c64 100644
 --- a/linux-user/mmap.c
 +++ b/linux-user/mmap.c
 @@ -250,7 +250,8 @@ static int mmap_frag(abi_ulong real_start,
              mprotect(host_start, qemu_host_page_size, prot1 | PROT_WRITE);

          /* read the corresponding file data */
 -        pread(fd, g2h(start), end - start, offset);
 +        if (pread(fd, g2h(start), end - start, offset) == -1)
 +            return -errno;

 This needs to be checked but iirc, it's wrong. One should set errno and
 return -1. Please double check and fix if needed.

Thanks.




[Qemu-devel] [PATCH 01/15] Introduce qemu_write_full()

2010-01-01 Thread Kirill A. Shutemov
A variant of write(2) which handles partial write.

Signed-off-by: Kirill A. Shutemov kir...@shutemov.name
---
 osdep.c   |   27 +++
 qemu-common.h |2 ++
 2 files changed, 29 insertions(+), 0 deletions(-)

diff --git a/osdep.c b/osdep.c
index e4836e7..8ae48fe 100644
--- a/osdep.c
+++ b/osdep.c
@@ -243,6 +243,33 @@ int qemu_open(const char *name, int flags, ...)
 return ret;
 }
 
+/*
+ * A variant of write(2) which handles partial write.
+ *
+ * Return the number of bytes transferred.
+ * Set errno if fewer than `count' bytes are written.
+ */
+size_t qemu_write_full(int fd, const void *buf, size_t count)
+{
+ssize_t ret;
+size_t total = 0;
+
+while (count) {
+ret = write(fd, buf, count);
+if (ret  0) {
+if (errno == EINTR)
+continue;
+break;
+}
+
+count -= ret;
+buf += ret;
+total += ret;
+}
+
+return total;
+}
+
 #ifndef _WIN32
 /*
  * Creates a pipe with FD_CLOEXEC set on both file descriptors
diff --git a/qemu-common.h b/qemu-common.h
index 8630f8c..7231348 100644
--- a/qemu-common.h
+++ b/qemu-common.h
@@ -160,6 +160,8 @@ void qemu_mutex_lock_iothread(void);
 void qemu_mutex_unlock_iothread(void);
 
 int qemu_open(const char *name, int flags, ...);
+size_t qemu_write_full(int fd, const void *buf, size_t count)
+__attribute__ ((warn_unused_result));
 void qemu_set_cloexec(int fd);
 
 #ifndef _WIN32
-- 
1.6.5.7





[Qemu-devel] [PATCH 02/15] posix-aio-compat.c: fix warning with _FORTIFY_SOURCE

2010-01-01 Thread Kirill A. Shutemov
  CCposix-aio-compat.o
cc1: warnings being treated as errors
posix-aio-compat.c: In function 'aio_signal_handler':
posix-aio-compat.c:505: error: ignoring return value of 'write', declared with 
attribute warn_unused_result
make: *** [posix-aio-compat.o] Error 1

Signed-off-by: Kirill A. Shutemov kir...@shutemov.name
---
 posix-aio-compat.c |5 -
 1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/posix-aio-compat.c b/posix-aio-compat.c
index dc14f53..1272e84 100644
--- a/posix-aio-compat.c
+++ b/posix-aio-compat.c
@@ -501,8 +501,11 @@ static void aio_signal_handler(int signum)
 {
 if (posix_aio_state) {
 char byte = 0;
+int ret;
 
-write(posix_aio_state-wfd, byte, sizeof(byte));
+ret = write(posix_aio_state-wfd, byte, sizeof(byte));
+if (ret  0  (errno != EINTR  errno != EAGAIN))
+die(write());
 }
 
 qemu_service_io();
-- 
1.6.5.7





[Qemu-devel] [PATCH 03/15] block/cow.c: fix warnings with _FORTIFY_SOURCE

2010-01-01 Thread Kirill A. Shutemov
  CCblock/cow.o
cc1: warnings being treated as errors
block/cow.c: In function 'cow_create':
block/cow.c:251: error: ignoring return value of 'write', declared with 
attribute warn_unused_result
block/cow.c:253: error: ignoring return value of 'ftruncate', declared with 
attribute warn_unused_result
make: *** [block/cow.o] Error 1

Signed-off-by: Kirill A. Shutemov kir...@shutemov.name
---
 block/cow.c |   19 ---
 1 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/block/cow.c b/block/cow.c
index a70854e..ba07b97 100644
--- a/block/cow.c
+++ b/block/cow.c
@@ -209,6 +209,7 @@ static int cow_create(const char *filename, 
QEMUOptionParameter *options)
 struct stat st;
 int64_t image_sectors = 0;
 const char *image_filename = NULL;
+int ret;
 
 /* Read out options */
 while (options  options-name) {
@@ -248,11 +249,23 @@ static int cow_create(const char *filename, 
QEMUOptionParameter *options)
 }
 cow_header.sectorsize = cpu_to_be32(512);
 cow_header.size = cpu_to_be64(image_sectors * 512);
-write(cow_fd, cow_header, sizeof(cow_header));
+ret = qemu_write_full(cow_fd, cow_header, sizeof(cow_header));
+if (ret != sizeof(cow_header)) {
+ret = -errno;
+goto exit;
+}
+
 /* resize to include at least all the bitmap */
-ftruncate(cow_fd, sizeof(cow_header) + ((image_sectors + 7)  3));
+ret = ftruncate(cow_fd, sizeof(cow_header) + ((image_sectors + 7)  3));
+if (ret) {
+ret = -errno;
+goto exit;
+}
+
+ret = 0;
+exit:
 close(cow_fd);
-return 0;
+return ret;
 }
 
 static void cow_flush(BlockDriverState *bs)
-- 
1.6.5.7





[Qemu-devel] [PATCH 04/15] block/qcow.c: fix warnings with _FORTIFY_SOURCE

2010-01-01 Thread Kirill A. Shutemov
  CCblock/qcow.o
cc1: warnings being treated as errors
block/qcow.c: In function 'qcow_create':
block/qcow.c:804: error: ignoring return value of 'write', declared with 
attribute warn_unused_result
block/qcow.c:806: error: ignoring return value of 'write', declared with 
attribute warn_unused_result
block/qcow.c:811: error: ignoring return value of 'write', declared with 
attribute warn_unused_result
make: *** [block/qcow.o] Error 1

Signed-off-by: Kirill A. Shutemov kir...@shutemov.name
---
 block/qcow.c |   25 +
 1 files changed, 21 insertions(+), 4 deletions(-)

diff --git a/block/qcow.c b/block/qcow.c
index 7fc85ae..28cc092 100644
--- a/block/qcow.c
+++ b/block/qcow.c
@@ -750,6 +750,7 @@ static int qcow_create(const char *filename, 
QEMUOptionParameter *options)
 int64_t total_size = 0;
 const char *backing_file = NULL;
 int flags = 0;
+int ret;
 
 /* Read out options */
 while (options  options-name) {
@@ -801,17 +802,33 @@ static int qcow_create(const char *filename, 
QEMUOptionParameter *options)
 }
 
 /* write all the data */
-write(fd, header, sizeof(header));
+ret = qemu_write_full(fd, header, sizeof(header));
+if (ret != sizeof(header)) {
+ret = -errno;
+goto exit;
+}
+
 if (backing_file) {
-write(fd, backing_file, backing_filename_len);
+ret = qemu_write_full(fd, backing_file, backing_filename_len);
+if (ret != backing_filename_len) {
+ret = -errno;
+goto exit;
+}
 }
 lseek(fd, header_size, SEEK_SET);
 tmp = 0;
 for(i = 0;i  l1_size; i++) {
-write(fd, tmp, sizeof(tmp));
+ret = qemu_write_full(fd, tmp, sizeof(tmp));
+if (ret != sizeof(tmp)) {
+ret = -errno;
+goto exit;
+}
 }
+
+ret = 0;
+exit:
 close(fd);
-return 0;
+return ret;
 }
 
 static int qcow_make_empty(BlockDriverState *bs)
-- 
1.6.5.7





[Qemu-devel] [PATCH 05/15] block/vmdk.o: fix warnings with _FORTIFY_SOURCE

2010-01-01 Thread Kirill A. Shutemov
  CCblock/vmdk.o
cc1: warnings being treated as errors
block/vmdk.c: In function 'vmdk_snapshot_create':
block/vmdk.c:236: error: ignoring return value of 'ftruncate', declared with 
attribute warn_unused_result
block/vmdk.c: In function 'vmdk_create':
block/vmdk.c:775: error: ignoring return value of 'write', declared with 
attribute warn_unused_result
block/vmdk.c:776: error: ignoring return value of 'write', declared with 
attribute warn_unused_result
block/vmdk.c:778: error: ignoring return value of 'ftruncate', declared with 
attribute warn_unused_result
block/vmdk.c:784: error: ignoring return value of 'write', declared with 
attribute warn_unused_result
block/vmdk.c:790: error: ignoring return value of 'write', declared with 
attribute warn_unused_result
block/vmdk.c:807: error: ignoring return value of 'write', declared with 
attribute warn_unused_result
make: *** [block/vmdk.o] Error 1

Signed-off-by: Kirill A. Shutemov kir...@shutemov.name
---
 block/vmdk.c |   50 --
 1 files changed, 40 insertions(+), 10 deletions(-)

diff --git a/block/vmdk.c b/block/vmdk.c
index 4e48622..58fc04b 100644
--- a/block/vmdk.c
+++ b/block/vmdk.c
@@ -233,7 +233,8 @@ static int vmdk_snapshot_create(const char *filename, const 
char *backing_file)
 memset(header, 0, sizeof(header));
 memcpy(header,hdr[4], sizeof(header)); // skip the VMDK4_MAGIC
 
-ftruncate(snp_fd, header.grain_offset  9);
+if (ftruncate(snp_fd, header.grain_offset  9))
+goto fail;
 /* the descriptor offset = 0x200 */
 if (lseek(p_fd, 0x200, SEEK_SET) == -1)
 goto fail;
@@ -716,6 +717,7 @@ static int vmdk_create(const char *filename, 
QEMUOptionParameter *options)
 int64_t total_size = 0;
 const char *backing_file = NULL;
 int flags = 0;
+int ret;
 
 // Read out options
 while (options  options-name) {
@@ -772,22 +774,44 @@ static int vmdk_create(const char *filename, 
QEMUOptionParameter *options)
 header.check_bytes[3] = 0xa;
 
 /* write all the data */
-write(fd, magic, sizeof(magic));
-write(fd, header, sizeof(header));
+ret = qemu_write_full(fd, magic, sizeof(magic));
+if (ret != sizeof(magic)) {
+ret = -errno;
+goto exit;
+}
+ret = qemu_write_full(fd, header, sizeof(header));
+if (ret != sizeof(header)) {
+ret = -errno;
+goto exit;
+}
 
-ftruncate(fd, header.grain_offset  9);
+ret = ftruncate(fd, header.grain_offset  9);
+if (ret  0) {
+ret = -errno;
+goto exit;
+}
 
 /* write grain directory */
 lseek(fd, le64_to_cpu(header.rgd_offset)  9, SEEK_SET);
 for (i = 0, tmp = header.rgd_offset + gd_size;
- i  gt_count; i++, tmp += gt_size)
-write(fd, tmp, sizeof(tmp));
+ i  gt_count; i++, tmp += gt_size) {
+ret = qemu_write_full(fd, tmp, sizeof(tmp));
+if (ret != sizeof(tmp)) {
+ret = -errno;
+goto exit;
+}
+}
 
 /* write backup grain directory */
 lseek(fd, le64_to_cpu(header.gd_offset)  9, SEEK_SET);
 for (i = 0, tmp = header.gd_offset + gd_size;
- i  gt_count; i++, tmp += gt_size)
-write(fd, tmp, sizeof(tmp));
+ i  gt_count; i++, tmp += gt_size) {
+ret = qemu_write_full(fd, tmp, sizeof(tmp));
+if (ret != sizeof(tmp)) {
+ret = -errno;
+goto exit;
+}
+}
 
 /* compose the descriptor */
 real_filename = filename;
@@ -804,10 +828,16 @@ static int vmdk_create(const char *filename, 
QEMUOptionParameter *options)
 
 /* write the descriptor */
 lseek(fd, le64_to_cpu(header.desc_offset)  9, SEEK_SET);
-write(fd, desc, strlen(desc));
+ret = qemu_write_full(fd, desc, strlen(desc));
+if (ret != strlen(desc)) {
+ret = -errno;
+goto exit;
+}
 
+ret = 0;
+exit:
 close(fd);
-return 0;
+return ret;
 }
 
 static void vmdk_close(BlockDriverState *bs)
-- 
1.6.5.7





[Qemu-devel] [PATCH 06/15] block/vvfat.c: fix warnings with _FORTIFY_SOURCE

2010-01-01 Thread Kirill A. Shutemov
  CCblock/vvfat.o
cc1: warnings being treated as errors
block/vvfat.c: In function 'commit_one_file':
block/vvfat.c:2259: error: ignoring return value of 'ftruncate', declared with 
attribute warn_unused_result
make: *** [block/vvfat.o] Error 1
  CCblock/vvfat.o
In file included from /usr/include/stdio.h:912,
 from ./qemu-common.h:19,
 from block/vvfat.c:27:
In function 'snprintf',
inlined from 'init_directories' at block/vvfat.c:871,
inlined from 'vvfat_open' at block/vvfat.c:1068:
/usr/include/bits/stdio2.h:65: error: call to __builtin___snprintf_chk will 
always overflow destination buffer
make: *** [block/vvfat.o] Error 1

Signed-off-by: Kirill A. Shutemov kir...@shutemov.name
---
 block/vvfat.c |7 +--
 1 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/block/vvfat.c b/block/vvfat.c
index 063f731..8140dbc 100644
--- a/block/vvfat.c
+++ b/block/vvfat.c
@@ -868,7 +868,7 @@ static int init_directories(BDRVVVFATState* s,
 {
direntry_t* entry=array_get_next((s-directory));
entry-attributes=0x28; /* archive | volume label */
-   snprintf((char*)entry-name,11,QEMU VVFAT);
+memcpy(entry-name, QEMU VVFAT, 11);
 }
 
 /* Now build FAT, and write back information into directory */
@@ -2256,7 +2256,10 @@ static int commit_one_file(BDRVVVFATState* s,
c = c1;
 }
 
-ftruncate(fd, size);
+if (ftruncate(fd, size)) {
+perror(ftruncate());
+abort();
+}
 close(fd);
 
 return commit_mappings(s, first_cluster, dir_index);
-- 
1.6.5.7





[Qemu-devel] [PATCH 07/15] block/qcow2.c: fix warnings with _FORTIFY_SOURCE

2010-01-01 Thread Kirill A. Shutemov
  CCblock/qcow2.o
cc1: warnings being treated as errors
block/qcow2.c: In function 'qcow_create2':
block/qcow2.c:829: error: ignoring return value of 'write', declared with 
attribute warn_unused_result
block/qcow2.c:838: error: ignoring return value of 'write', declared with 
attribute warn_unused_result
block/qcow2.c:839: error: ignoring return value of 'write', declared with 
attribute warn_unused_result
block/qcow2.c:841: error: ignoring return value of 'write', declared with 
attribute warn_unused_result
block/qcow2.c:844: error: ignoring return value of 'write', declared with 
attribute warn_unused_result
block/qcow2.c:849: error: ignoring return value of 'write', declared with 
attribute warn_unused_result
block/qcow2.c:852: error: ignoring return value of 'write', declared with 
attribute warn_unused_result
block/qcow2.c:855: error: ignoring return value of 'write', declared with 
attribute warn_unused_result
make: *** [block/qcow2.o] Error 1

Signed-off-by: Kirill A. Shutemov kir...@shutemov.name
---
 block/qcow2.c |   55 +--
 1 files changed, 45 insertions(+), 10 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index 984264b..1874124 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -743,7 +743,7 @@ static int qcow_create2(const char *filename, int64_t 
total_size,
 uint64_t tmp, offset;
 QCowCreateState s1, *s = s1;
 QCowExtension ext_bf = {0, 0};
-
+int ret;
 
 memset(s, 0, sizeof(*s));
 
@@ -826,7 +826,11 @@ static int qcow_create2(const char *filename, int64_t 
total_size,
 ref_clusters * s-cluster_size);
 
 /* write all the data */
-write(fd, header, sizeof(header));
+ret = qemu_write_full(fd, header, sizeof(header));
+if (ret != sizeof(header)) {
+ret = -errno;
+goto exit;
+}
 if (backing_file) {
 if (backing_format_len) {
 char zero[16];
@@ -835,25 +839,56 @@ static int qcow_create2(const char *filename, int64_t 
total_size,
 memset(zero, 0, sizeof(zero));
 cpu_to_be32s(ext_bf.magic);
 cpu_to_be32s(ext_bf.len);
-write(fd, ext_bf, sizeof(ext_bf));
-write(fd, backing_format, backing_format_len);
+ret = qemu_write_full(fd, ext_bf, sizeof(ext_bf));
+if (ret != sizeof(ext_bf)) {
+ret = -errno;
+goto exit;
+}
+ret = qemu_write_full(fd, backing_format, backing_format_len);
+if (ret != backing_format_len) {
+ret = -errno;
+goto exit;
+}
 if (padding  0) {
-write(fd, zero, padding);
+ret = qemu_write_full(fd, zero, padding);
+if (ret != padding) {
+ret = -errno;
+goto exit;
+}
 }
 }
-write(fd, backing_file, backing_filename_len);
+ret = qemu_write_full(fd, backing_file, backing_filename_len);
+if (ret != backing_filename_len) {
+ret = -errno;
+goto exit;
+}
 }
 lseek(fd, s-l1_table_offset, SEEK_SET);
 tmp = 0;
 for(i = 0;i  l1_size; i++) {
-write(fd, tmp, sizeof(tmp));
+ret = qemu_write_full(fd, tmp, sizeof(tmp));
+if (ret != sizeof(tmp)) {
+ret = -errno;
+goto exit;
+}
 }
 lseek(fd, s-refcount_table_offset, SEEK_SET);
-write(fd, s-refcount_table, s-cluster_size);
+ret = qemu_write_full(fd, s-refcount_table, s-cluster_size);
+if (ret != s-cluster_size) {
+ret = -errno;
+goto exit;
+}
 
 lseek(fd, s-refcount_block_offset, SEEK_SET);
-write(fd, s-refcount_block, ref_clusters * s-cluster_size);
+ret = qemu_write_full(fd, s-refcount_block,
+ref_clusters * s-cluster_size);
+if (ret != s-cluster_size) {
+ret = -errno;
+goto exit;
+}
 
+ret = 0;
+exit:
 qemu_free(s-refcount_table);
 qemu_free(s-refcount_block);
 close(fd);
@@ -867,7 +902,7 @@ static int qcow_create2(const char *filename, int64_t 
total_size,
 bdrv_close(bs);
 }
 
-return 0;
+return ret;
 }
 
 static int qcow_create(const char *filename, QEMUOptionParameter *options)
-- 
1.6.5.7





[Qemu-devel] [PATCH 08/15] net/slirp.c: fix warning with _FORTIFY_SOURCE

2010-01-01 Thread Kirill A. Shutemov
  CCnet/slirp.o
cc1: warnings being treated as errors
net/slirp.c: In function 'slirp_smb_cleanup':
net/slirp.c:470: error: ignoring return value of 'system', declared with 
attribute warn_unused_result
make: *** [net/slirp.o] Error 1

Signed-off-by: Kirill A. Shutemov kir...@shutemov.name
---
 net/slirp.c |9 -
 1 files changed, 8 insertions(+), 1 deletions(-)

diff --git a/net/slirp.c b/net/slirp.c
index 3f91c4b..ef7c8e4 100644
--- a/net/slirp.c
+++ b/net/slirp.c
@@ -464,10 +464,17 @@ int net_slirp_redir(const char *redir_str)
 static void slirp_smb_cleanup(SlirpState *s)
 {
 char cmd[128];
+int ret;
 
 if (s-smb_dir[0] != '\0') {
 snprintf(cmd, sizeof(cmd), rm -rf %s, s-smb_dir);
-system(cmd);
+ret = system(cmd);
+if (ret == -1) {
+perror(system());
+} else if (WEXITSTATUS(ret)) {
+qemu_error('%s' failed. Error code: %d\n,
+cmd, WEXITSTATUS(ret));
+}
 s-smb_dir[0] = '\0';
 }
 }
-- 
1.6.5.7





  1   2   3   >