from:"Dr. David Alan Gilbert"

Re: [PATCH] virt: acrn: Remove unusted list 'acrn_irqfd_clients'

2024-05-17 Thread Dr. David Alan Gilbert

* li...@treblig.org (li...@treblig.org) wrote:
> From: "Dr. David Alan Gilbert" 
> 
> It doesn't look like this was ever used.
> 
> Build tested only.
> 
> Signed-off-by: Dr. David Alan Gilbert 

Ping

> ---
>  drivers/virt/acrn/irqfd.c | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/drivers/virt/acrn/irqfd.c b/drivers/virt/acrn/irqfd.c
> index d4ad211dce7a3..346cf0be4aac7 100644
> --- a/drivers/virt/acrn/irqfd.c
> +++ b/drivers/virt/acrn/irqfd.c
> @@ -16,8 +16,6 @@
>  
>  #include "acrn_drv.h"
>  
> -static LIST_HEAD(acrn_irqfd_clients);
> -
>  /**
>   * struct hsm_irqfd - Properties of HSM irqfd
>   * @vm:  Associated VM pointer
> -- 
> 2.45.0
> 
-- 
 -Open up your eyes, open up your mind, open up your code ---   
/ Dr. David Alan Gilbert|   Running GNU/Linux   | Happy  \ 
\dave @ treblig.org |   | In Hex /
 \ _|_ http://www.treblig.org   |___/

Re: [PATCH] ftrace: Remove unused global 'ftrace_direct_func_count'

2024-05-06 Thread Dr. David Alan Gilbert

* li...@treblig.org (li...@treblig.org) wrote:
> From: "Dr. David Alan Gilbert" 
> 
> Commit 8788ca164eb4b ("ftrace: Remove the legacy _ftrace_direct API")
> stopped setting the 'ftrace_direct_func_count' variable, but left
> it around.  Clean it up.
> 
> Signed-off-by: Dr. David Alan Gilbert 

FYI this is on top of my earlier 'ftrace: Remove unused list 
'ftrace_direct_funcs'

Dave

> ---
>  include/linux/ftrace.h |  2 --
>  kernel/trace/fgraph.c  | 11 ---
>  kernel/trace/ftrace.c  |  1 -
>  3 files changed, 14 deletions(-)
> 
> diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
> index b01cca36147ff..e3a83ebd1b333 100644
> --- a/include/linux/ftrace.h
> +++ b/include/linux/ftrace.h
> @@ -413,7 +413,6 @@ struct ftrace_func_entry {
>  };
>  
>  #ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
> -extern int ftrace_direct_func_count;
>  unsigned long ftrace_find_rec_direct(unsigned long ip);
>  int register_ftrace_direct(struct ftrace_ops *ops, unsigned long addr);
>  int unregister_ftrace_direct(struct ftrace_ops *ops, unsigned long addr,
> @@ -425,7 +424,6 @@ void ftrace_stub_direct_tramp(void);
>  
>  #else
>  struct ftrace_ops;
> -# define ftrace_direct_func_count 0
>  static inline unsigned long ftrace_find_rec_direct(unsigned long ip)
>  {
>   return 0;
> diff --git a/kernel/trace/fgraph.c b/kernel/trace/fgraph.c
> index c83c005e654e3..a130b2d898f7c 100644
> --- a/kernel/trace/fgraph.c
> +++ b/kernel/trace/fgraph.c
> @@ -125,17 +125,6 @@ int function_graph_enter(unsigned long ret, unsigned 
> long func,
>  {
>   struct ftrace_graph_ent trace;
>  
> -#ifndef CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS
> - /*
> -  * Skip graph tracing if the return location is served by direct 
> trampoline,
> -  * since call sequence and return addresses are unpredictable anyway.
> -  * Ex: BPF trampoline may call original function and may skip frame
> -  * depending on type of BPF programs attached.
> -  */
> - if (ftrace_direct_func_count &&
> - ftrace_find_rec_direct(ret - MCOUNT_INSN_SIZE))
> - return -EBUSY;
> -#endif
>   trace.func = func;
>   trace.depth = ++current->curr_ret_depth;
>  
> diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
> index b18b4ece3d7c9..adf34167c3418 100644
> --- a/kernel/trace/ftrace.c
> +++ b/kernel/trace/ftrace.c
> @@ -2538,7 +2538,6 @@ ftrace_find_unique_ops(struct dyn_ftrace *rec)
>  /* Protected by rcu_tasks for reading, and direct_mutex for writing */
>  static struct ftrace_hash __rcu *direct_functions = EMPTY_HASH;
>  static DEFINE_MUTEX(direct_mutex);
> -int ftrace_direct_func_count;
>  
>  /*
>   * Search the direct_functions hash to see if the given instruction pointer
> -- 
> 2.45.0
> 
-- 
 -Open up your eyes, open up your mind, open up your code ---   
/ Dr. David Alan Gilbert|   Running GNU/Linux   | Happy  \ 
\dave @ treblig.org |   | In Hex /
 \ _|_ http://www.treblig.org   |___/

Re: ftrace_direct_func_count ?

2024-05-06 Thread Dr. David Alan Gilbert

* Steven Rostedt (rost...@goodmis.org) wrote:
> On Sat, 4 May 2024 13:35:26 +
> "Dr. David Alan Gilbert"  wrote:
> 
> > Hi,
> >   I've just posted a patch 'ftrace: Remove unused list 
> > 'ftrace_direct_funcs''
> > that clears out some old code, but while at it I noticed the global
> > 'ftrace_direct_func_count'.
> > 
> > As far as I can tell, it's never assigned (or initialised) but it is tested:
> > 
> > kernel/trace/fgraph.c:
> > #ifndef CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS
> >   /*
> >* Skip graph tracing if the return location is served by direct 
> > trampoline,
> >* since call sequence and return addresses are unpredictable anyway.
> >* Ex: BPF trampoline may call original function and may skip frame
> >* depending on type of BPF programs attached.
> >*/
> >   if (ftrace_direct_func_count &&
> >   ftrace_find_rec_direct(ret - MCOUNT_INSN_SIZE))
> > return -EBUSY;
> > #endif
> > 
> > So I wasn't sure whether it was just safe to nuke that section
> > or whether it really needed fixing?
> 
> Yes, after commit 8788ca164eb4bad ("ftrace: Remove the legacy
> _ftrace_direct API") that variable is no longer used.

OK, thanks, I'll send a follow up patch to my other patch to nuke
this as well.

Dave

> -- Steve
-- 
 -Open up your eyes, open up your mind, open up your code ---   
/ Dr. David Alan Gilbert|   Running GNU/Linux   | Happy  \ 
\dave @ treblig.org |   | In Hex /
 \ _|_ http://www.treblig.org   |___/

ftrace_direct_func_count ?

2024-05-04 Thread Dr. David Alan Gilbert

Hi,
  I've just posted a patch 'ftrace: Remove unused list 'ftrace_direct_funcs''
that clears out some old code, but while at it I noticed the global
'ftrace_direct_func_count'.

As far as I can tell, it's never assigned (or initialised) but it is tested:

kernel/trace/fgraph.c:
#ifndef CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS
  /*
   * Skip graph tracing if the return location is served by direct trampoline,
   * since call sequence and return addresses are unpredictable anyway.
   * Ex: BPF trampoline may call original function and may skip frame
   * depending on type of BPF programs attached.
   */
  if (ftrace_direct_func_count &&
  ftrace_find_rec_direct(ret - MCOUNT_INSN_SIZE))
return -EBUSY;
#endif

So I wasn't sure whether it was just safe to nuke that section
or whether it really needed fixing?

Dave

-- 
 -Open up your eyes, open up your mind, open up your code ---   
/ Dr. David Alan Gilbert|   Running GNU/Linux   | Happy  \ 
\dave @ treblig.org |   | In Hex /
 \ _|_ http://www.treblig.org   |___/

Re: [PATCH] module: ban '.', '..' as module names, ban '/' in module names

2024-04-15 Thread Dr. David Alan Gilbert

* Alexey Dobriyan (adobri...@gmail.com) wrote:
> On Sun, Apr 14, 2024 at 01:58:55PM -0700, Luis Chamberlain wrote:
> > On Sun, Apr 14, 2024 at 10:05:05PM +0300, Alexey Dobriyan wrote:
> > > --- a/include/linux/fs.h
> > > +++ b/include/linux/fs.h
> > > @@ -3616,4 +3616,12 @@ extern int vfs_fadvise(struct file *file, loff_t 
> > > offset, loff_t len,
> > >  extern int generic_fadvise(struct file *file, loff_t offset, loff_t len,
> > >  int advice);
> > >  
> > > +/*
> > > + * Use this if data from userspace end up as directory/filename on
> > > + * some virtual filesystem.
> > > + */
> > > +static inline bool string_is_vfs_ready(const char *s)
> > > +{
> > > + return strcmp(s, ".") != 0 && strcmp(s, "..") != 0 && !strchr(s, '/');
> > > +}
> > >  #endif /* _LINUX_FS_H */
> > > --- a/kernel/module/main.c
> > > +++ b/kernel/module/main.c
> > > @@ -2893,6 +2893,11 @@ static int load_module(struct load_info *info, 
> > > const char __user *uargs,
> > >  
> > >   audit_log_kern_module(mod->name);
> > >  
> > > + if (!string_is_vfs_ready(mod->name)) {
> > > + err = -EINVAL;
> > > + goto free_module;
> > > + }
> > > +
> > 
> > Sensible change however to put string_is_vfs_ready() in include/linux/fs.h 
> > is a stretch if there really are no other users.
> 
> This is forward thinking patch :-)
> 
> Other subsystems may create files/directories in proc/sysfs, and should
> check for bad names as well:
> 
>   /proc/2821/net/dev_snmp6/eth0
> 
> This looks exactly like something coming from userspace and making it
> into /proc, so the filter function doesn't belong to kernel/module/internal.h

You mean like:

[24180.292204] tuxthe: renamed from tuxthe
root@dalek:/home/dg# ls /sys/class/net/
enp5s0  lo  tuxthe  tuxthe  tuxthe  virbr0  virbr1

?

Dave
> 
-- 
 -Open up your eyes, open up your mind, open up your code ---   
/ Dr. David Alan Gilbert|   Running GNU/Linux   | Happy  \ 
\dave @ treblig.org |   | In Hex /
 \ _|_ http://www.treblig.org   |___/

Re: [PATCH v2 4/4] KVM: SVM: Support #GP handling for the case of nested on nested

2021-01-21 Thread Dr. David Alan Gilbert

* Wei Huang (wei.hua...@amd.com) wrote:
> Under the case of nested on nested (e.g. L0->L1->L2->L3), #GP triggered
> by SVM instructions can be hided from L1. Instead the hypervisor can
> inject the proper #VMEXIT to inform L1 of what is happening. Thus L1
> can avoid invoking the #GP workaround. For this reason we turns on
> guest VM's X86_FEATURE_SVME_ADDR_CHK bit for KVM running inside VM to
> receive the notification and change behavior.

Doesn't this mean a VM migrated between levels (hmm L2 to L1???) would
see different behaviour?
(I've never tried such a migration, but I thought in principal it should
work).

Dave


> Co-developed-by: Bandan Das 
> Signed-off-by: Bandan Das 
> Signed-off-by: Wei Huang 
> ---
>  arch/x86/kvm/svm/svm.c | 19 ++-
>  1 file changed, 18 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 2a12870ac71a..89512c0e7663 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -2196,6 +2196,11 @@ static int svm_instr_opcode(struct kvm_vcpu *vcpu)
>  
>  static int emulate_svm_instr(struct kvm_vcpu *vcpu, int opcode)
>  {
> + const int guest_mode_exit_codes[] = {
> + [SVM_INSTR_VMRUN] = SVM_EXIT_VMRUN,
> + [SVM_INSTR_VMLOAD] = SVM_EXIT_VMLOAD,
> + [SVM_INSTR_VMSAVE] = SVM_EXIT_VMSAVE,
> + };
>   int (*const svm_instr_handlers[])(struct vcpu_svm *svm) = {
>   [SVM_INSTR_VMRUN] = vmrun_interception,
>   [SVM_INSTR_VMLOAD] = vmload_interception,
> @@ -2203,7 +2208,14 @@ static int emulate_svm_instr(struct kvm_vcpu *vcpu, 
> int opcode)
>   };
>   struct vcpu_svm *svm = to_svm(vcpu);
>  
> - return svm_instr_handlers[opcode](svm);
> + if (is_guest_mode(vcpu)) {
> + svm->vmcb->control.exit_code = guest_mode_exit_codes[opcode];
> + svm->vmcb->control.exit_info_1 = 0;
> + svm->vmcb->control.exit_info_2 = 0;
> +
> + return nested_svm_vmexit(svm);
> + } else
> + return svm_instr_handlers[opcode](svm);
>  }
>  
>  /*
> @@ -4034,6 +4046,11 @@ static void svm_vcpu_after_set_cpuid(struct kvm_vcpu 
> *vcpu)
>   /* Check again if INVPCID interception if required */
>   svm_check_invpcid(svm);
>  
> + if (nested && guest_cpuid_has(vcpu, X86_FEATURE_SVM)) {
> + best = kvm_find_cpuid_entry(vcpu, 0x800A, 0);
> + best->edx |= (1 << 28);
> + }
> +
>       /* For sev guests, the memory encryption bit is not reserved in CR3.  */
>   if (sev_guest(vcpu->kvm)) {
>   best = kvm_find_cpuid_entry(vcpu, 0x801F, 0);
> -- 
> 2.27.0
> 
-- 
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [RFC PATCH 1/2] userfaultfd: add minor fault registration mode

2021-01-11 Thread Dr. David Alan Gilbert

* Axel Rasmussen (axelrasmus...@google.com) wrote:
> On Mon, Jan 11, 2021 at 3:58 AM Dr. David Alan Gilbert
>  wrote:
> >
> > * Axel Rasmussen (axelrasmus...@google.com) wrote:
> > > This feature allows userspace to intercept "minor" faults. By "minor"
> > > fault, I mean the following situation:
> > >
> > > Let there exist two mappings (i.e., VMAs) to the same page(s) (shared
> > > memory). One of the mappings is registered with userfaultfd (in minor
> > > mode), and the other is not. Via the non-UFFD mapping, the underlying
> > > pages have already been allocated & filled with some contents. The UFFD
> > > mapping has not yet been faulted in; when it is touched for the first
> > > time, this results in what I'm calling a "minor" fault. As a concrete
> > > example, when working with hugetlbfs, we have huge_pte_none(), but
> > > find_lock_page() finds an existing page.
> > >
> > > This commit adds the new registration mode, and sets the relevant flag
> > > on the VMAs being registered. In the hugetlb fault path, if we find
> > > that we have huge_pte_none(), but find_lock_page() does indeed find an
> > > existing page, then we have a "minor" fault, and if the VMA has the
> > > userfaultfd registration flag, we call into userfaultfd to handle it.
> > >
> > > Why add a new registration mode, as opposed to adding a feature to
> > > MISSING registration, like UFFD_FEATURE_SIGBUS?
> > >
> > > - The semantics are significantly different. UFFDIO_COPY or
> > >   UFFDIO_ZEROPAGE do not make sense for these minor faults; userspace
> > >   would instead just memset() or memcpy() or whatever via the non-UFFD
> > >   mapping. Unlike MISSING registration, MINOR registration only makes
> > >   sense for shared memory (hugetlbfs or shmem [to be supported in future
> > >   commits]).
> >
> > Is there a reason that UFFDIO_COPY can't work with this and a non-shared
> > mapping? and/or the fallocate or hole punching we currently use?
> 
> It could work, with a small modification. Mainly I just didn't see it
> being useful since memcpy() or similar works in this case. (We don't
> have the atomicity problem UFFDIO_COPY is trying to solve, since we're
> modifying via a non-UFFD VMA.)

I guess I was thinking how easily to modify my existing postcopy code;

> Maybe another (weaker?) argument against it is, if it happily accepts
> VMAs both with and without backing pages, it may be easier to misuse.
> E.g., if you mistakenly give it the wrong address, and it overwrites
> some existing page you didn't mean to modify.
> 
> I don't feel particularly strongly, if it strikes you as cleaner to
> just support it, I'll do so in the next revision.

Not necessarily; I was just thinking still in the way I'm used to with
the existing interface.

Dave

> >
> > Dave
> >
> > > - Doing so would make handle_userfault()'s "reason" argument confusing.
> > >   We'd pass in "MISSING" even if the pages weren't really missing.
> > >
> > > Signed-off-by: Axel Rasmussen 
> > > ---
> > >  fs/proc/task_mmu.c   |  1 +
> > >  fs/userfaultfd.c | 80 +++-
> > >  include/linux/mm.h   |  1 +
> > >  include/linux/userfaultfd_k.h| 12 -
> > >  include/trace/events/mmflags.h   |  1 +
> > >  include/uapi/linux/userfaultfd.h | 15 +-
> > >  mm/hugetlb.c | 31 +
> > >  7 files changed, 107 insertions(+), 34 deletions(-)
> > >
> > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > > index ee5a235b3056..108faf719a83 100644
> > > --- a/fs/proc/task_mmu.c
> > > +++ b/fs/proc/task_mmu.c
> > > @@ -651,6 +651,7 @@ static void show_smap_vma_flags(struct seq_file *m, 
> > > struct vm_area_struct *vma)
> > >   [ilog2(VM_MTE)] = "mt",
> > >   [ilog2(VM_MTE_ALLOWED)] = "",
> > >  #endif
> > > + [ilog2(VM_UFFD_MINOR)]  = "ui",
> > >  #ifdef CONFIG_ARCH_HAS_PKEYS
> > >   /* These come out via ProtectionKey: */
> > >   [ilog2(VM_PKEY_BIT0)]   = "",
> > > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> > > index 894cc28142e7..0a661422eb19 100644
> > > --- a/fs/userfaultfd.c
> > > +++ b/fs/userfaultfd.c
> > > @@ -28,6 +28,8 @@
> > >  #include 
> > >  #include

Re: [RFC PATCH 1/2] userfaultfd: add minor fault registration mode

2021-01-11 Thread Dr. David Alan Gilbert

 }
>  
> +static inline bool userfaultfd_minor(struct vm_area_struct *vma)
> +{
> + return false;
> +}
> +
>  static inline bool userfaultfd_pte_wp(struct vm_area_struct *vma,
> pte_t pte)
>  {
> diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
> index 67018d367b9f..2d583ffd4100 100644
> --- a/include/trace/events/mmflags.h
> +++ b/include/trace/events/mmflags.h
> @@ -151,6 +151,7 @@ IF_HAVE_PG_ARCH_2(PG_arch_2,  "arch_2"
> )
>   {VM_PFNMAP, "pfnmap"},  \
>   {VM_DENYWRITE,  "denywrite" },  \
>   {VM_UFFD_WP,"uffd_wp"   },  \
> + {VM_UFFD_MINOR, "uffd_minor"},  \
>   {VM_LOCKED, "locked"},  \
>   {VM_IO, "io"},  \
>   {VM_SEQ_READ,   "seqread"   },  \
> diff --git a/include/uapi/linux/userfaultfd.h 
> b/include/uapi/linux/userfaultfd.h
> index 5f2d88212f7c..1cc2cd8a5279 100644
> --- a/include/uapi/linux/userfaultfd.h
> +++ b/include/uapi/linux/userfaultfd.h
> @@ -19,15 +19,19 @@
>   * means the userland is reading).
>   */
>  #define UFFD_API ((__u64)0xAA)
> +#define UFFD_API_REGISTER_MODES (UFFDIO_REGISTER_MODE_MISSING |  \
> +  UFFDIO_REGISTER_MODE_WP |  \
> +  UFFDIO_REGISTER_MODE_MINOR)
>  #define UFFD_API_FEATURES (UFFD_FEATURE_PAGEFAULT_FLAG_WP |  \
>  UFFD_FEATURE_EVENT_FORK |\
>  UFFD_FEATURE_EVENT_REMAP |   \
> -UFFD_FEATURE_EVENT_REMOVE |  \
> +UFFD_FEATURE_EVENT_REMOVE |  \
>  UFFD_FEATURE_EVENT_UNMAP |   \
>  UFFD_FEATURE_MISSING_HUGETLBFS | \
>  UFFD_FEATURE_MISSING_SHMEM | \
>  UFFD_FEATURE_SIGBUS |\
> -UFFD_FEATURE_THREAD_ID)
> +UFFD_FEATURE_THREAD_ID | \
> +UFFD_FEATURE_MINOR_FAULT_HUGETLBFS)
>  #define UFFD_API_IOCTLS  \
>   ((__u64)1 << _UFFDIO_REGISTER | \
>(__u64)1 << _UFFDIO_UNREGISTER |   \
> @@ -127,6 +131,7 @@ struct uffd_msg {
>  /* flags for UFFD_EVENT_PAGEFAULT */
>  #define UFFD_PAGEFAULT_FLAG_WRITE(1<<0)  /* If this was a write fault */
>  #define UFFD_PAGEFAULT_FLAG_WP   (1<<1)  /* If reason is 
> VM_UFFD_WP */
> +#define UFFD_PAGEFAULT_FLAG_MINOR(1<<2)  /* If reason is VM_UFFD_MINOR */
>  
>  struct uffdio_api {
>   /* userland asks for an API number and the features to enable */
> @@ -171,6 +176,10 @@ struct uffdio_api {
>*
>* UFFD_FEATURE_THREAD_ID pid of the page faulted task_struct will
>* be returned, if feature is not requested 0 will be returned.
> +  *
> +  * UFFD_FEATURE_MINOR_FAULT_HUGETLBFS indicates that minor faults
> +  * can be intercepted (via REGISTER_MODE_MINOR) for
> +  * hugetlbfs-backed pages.
>*/
>  #define UFFD_FEATURE_PAGEFAULT_FLAG_WP   (1<<0)
>  #define UFFD_FEATURE_EVENT_FORK  (1<<1)
> @@ -181,6 +190,7 @@ struct uffdio_api {
>  #define UFFD_FEATURE_EVENT_UNMAP (1<<6)
>  #define UFFD_FEATURE_SIGBUS  (1<<7)
>  #define UFFD_FEATURE_THREAD_ID   (1<<8)
> +#define UFFD_FEATURE_MINOR_FAULT_HUGETLBFS   (1<<9)
>   __u64 features;
>  
>   __u64 ioctls;
> @@ -195,6 +205,7 @@ struct uffdio_register {
>   struct uffdio_range range;
>  #define UFFDIO_REGISTER_MODE_MISSING ((__u64)1<<0)
>  #define UFFDIO_REGISTER_MODE_WP  ((__u64)1<<1)
> +#define UFFDIO_REGISTER_MODE_MINOR   ((__u64)1<<2)
>   __u64 mode;
>  
>   /*
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index a2602969873d..0ba8f2f5a4ae 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4377,6 +4377,37 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
>   }
>   }
>  
> + /* Check for page in userfault range. */
> + if (!new_page && userfaultfd_minor(vma)) {
> + u32 hash;
> + struct vm_fault vmf = {
> + .vma = vma,
> + .address = haddr,
> + .flags = flags,
> + /*
> +  * Hard to debug if it ends up being used by a callee
> +  * that assumes something about the other uninitialized
> +  * fields... same as in memory.c
> +  */
> + };
> +
> + unlock_page(page);
> +
> + /*
> +  * hugetlb_fault_mutex and i_mmap_rwsem must be dropped before
> +  * handling userfault.  Reacquire after handling fault to make
> +  * calling code simpler.
> +  */
> +
> + hash = hugetlb_fault_mutex_hash(mapping, idx);
> + mutex_unlock(_fault_mutex_table[hash]);
> + i_mmap_unlock_read(mapping);
> + ret = handle_userfault(, VM_UFFD_MINOR);
> + i_mmap_lock_read(mapping);
> + mutex_lock(_fault_mutex_table[hash]);
> + goto out;
> + }
> +
>   /*
>* If we are going to COW a private mapping later, we examine the
>* pending reservations for this page now. This will ensure that
> -- 
> 2.29.2.729.g45daf8777d-goog
> 
-- 
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [RFC PATCH 0/2] userfaultfd: handle minor faults, add UFFDIO_CONTINUE

2021-01-11 Thread Dr. David Alan Gilbert

hared memory, and returns
>   -ENOENT in that case (regardless of the kind of fault).
> 
> Remaining Work
> ==
> 
> This patchset doesn't include updates to userfaultfd's documentation or
> selftests. This will be added before I send a non-RFC version of this series
> (I want to find out if there are strong objections to the API surface before
> spending the time to document it.)
> 
> Currently the patchset only supports hugetlbfs. There is no reason it can't 
> work
> with shmem, but I expect hugetlbfs to be much more commonly used since we're
> talking about backing guest memory for VMs. I plan to implement shmem support 
> in
> a follow-up patch series.
> 
> Axel Rasmussen (2):
>   userfaultfd: add minor fault registration mode
>   userfaultfd: add UFFDIO_CONTINUE ioctl
> 
>  fs/proc/task_mmu.c   |   1 +
>  fs/userfaultfd.c | 143 ++++++++---
>  include/linux/mm.h   |   1 +
>  include/linux/userfaultfd_k.h|  14 ++-
>  include/trace/events/mmflags.h   |   1 +
>  include/uapi/linux/userfaultfd.h |  36 +++-
>  mm/hugetlb.c |  42 +++--
>  mm/userfaultfd.c |  86 ++-
>  8 files changed, 261 insertions(+), 63 deletions(-)
> 
> --
> 2.29.2.729.g45daf8777d-goog
> 
-- 
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [PATCH v2 1/9] KVM: x86: Add AMD SEV specific Hypercall3

2020-12-18 Thread Dr. David Alan Gilbert

* Kalra, Ashish (ashish.ka...@amd.com) wrote:
> Hello Dave,
> 
> On Dec 18, 2020, at 1:40 PM, Dr. David Alan Gilbert  
> wrote:
> 
> * Ashish Kalra (ashish.ka...@amd.com) wrote:
> On Fri, Dec 11, 2020 at 10:55:42PM +, Ashish Kalra wrote:
> Hello All,
> 
> On Tue, Dec 08, 2020 at 10:29:05AM -0600, Brijesh Singh wrote:
> 
> On 12/7/20 9:09 PM, Steve Rutherford wrote:
> On Mon, Dec 7, 2020 at 12:42 PM Sean Christopherson  wrote:
> On Sun, Dec 06, 2020, Paolo Bonzini wrote:
> On 03/12/20 01:34, Sean Christopherson wrote:
> On Tue, Dec 01, 2020, Ashish Kalra wrote:
> From: Brijesh Singh 
> 
> KVM hypercall framework relies on alternative framework to patch the
> VMCALL -> VMMCALL on AMD platform. If a hypercall is made before
> apply_alternative() is called then it defaults to VMCALL. The approach
> works fine on non SEV guest. A VMCALL would causes #UD, and hypervisor
> will be able to decode the instruction and do the right things. But
> when SEV is active, guest memory is encrypted with guest key and
> hypervisor will not be able to decode the instruction bytes.
> 
> Add SEV specific hypercall3, it unconditionally uses VMMCALL. The hypercall
> will be used by the SEV guest to notify encrypted pages to the hypervisor.
> What if we invert KVM_HYPERCALL and X86_FEATURE_VMMCALL to default to VMMCALL
> and opt into VMCALL?  It's a synthetic feature flag either way, and I don't
> think there are any existing KVM hypercalls that happen before alternatives 
> are
> patched, i.e. it'll be a nop for sane kernel builds.
> 
> I'm also skeptical that a KVM specific hypercall is the right approach for the
> encryption behavior, but I'll take that up in the patches later in the series.
> Do you think that it's the guest that should "donate" memory for the bitmap
> instead?
> No.  Two things I'd like to explore:
> 
>  1. Making the hypercall to announce/request private vs. shared common across
> hypervisors (KVM, Hyper-V, VMware, etc...) and technologies (SEV-* and 
> TDX).
> I'm concerned that we'll end up with multiple hypercalls that do more or
> less the same thing, e.g. KVM+SEV, Hyper-V+SEV, TDX, etc...  Maybe it's a
> pipe dream, but I'd like to at least explore options before shoving in 
> KVM-
> only hypercalls.
> 
> 
>  2. Tracking shared memory via a list of ranges instead of a using bitmap to
> track all of guest memory.  For most use cases, the vast majority of guest
> memory will be private, most ranges will be 2mb+, and conversions between
> private and shared will be uncommon events, i.e. the overhead to walk and
> split/merge list entries is hopefully not a big concern.  I suspect a list
> would consume far less memory, hopefully without impacting performance.
> For a fancier data structure, I'd suggest an interval tree. Linux
> already has an rbtree-based interval tree implementation, which would
> likely work, and would probably assuage any performance concerns.
> 
> Something like this would not be worth doing unless most of the shared
> pages were physically contiguous. A sample Ubuntu 20.04 VM on GCP had
> 60ish discontiguous shared regions. This is by no means a thorough
> search, but it's suggestive. If this is typical, then the bitmap would
> be far less efficient than most any interval-based data structure.
> 
> You'd have to allow userspace to upper bound the number of intervals
> (similar to the maximum bitmap size), to prevent host OOMs due to
> malicious guests. There's something nice about the guest donating
> memory for this, since that would eliminate the OOM risk.
> 
> 
> Tracking the list of ranges may not be bad idea, especially if we use
> the some kind of rbtree-based data structure to update the ranges. It
> will certainly be better than bitmap which grows based on the guest
> memory size and as you guys see in the practice most of the pages will
> be guest private. I am not sure if guest donating a memory will cover
> all the cases, e.g what if we do a memory hotplug (increase the guest
> ram from 2GB to 64GB), will donated memory range will be enough to store
> the metadata.
> 
> .
> 
> With reference to internal discussions regarding the above, i am going
> to look into specific items as listed below :
> 
> 1). "hypercall" related :
> a). Explore the SEV-SNP page change request structure (included in GHCB),
> see if there is something common there than can be re-used for SEV/SEV-ES
> page encryption status hypercalls.
> b). Explore if there is any common hypercall framework i can use in
> Linux/KVM.
> 
> 2). related to the "backing" data structure - explore using a range-based
> list or something like rbtree-based interval t

Re: [PATCH v2 1/9] KVM: x86: Add AMD SEV specific Hypercall3

2020-12-18 Thread Dr. David Alan Gilbert

gt; > > search, but it's suggestive. If this is typical, then the bitmap would
> > > > be far less efficient than most any interval-based data structure.
> > > >
> > > > You'd have to allow userspace to upper bound the number of intervals
> > > > (similar to the maximum bitmap size), to prevent host OOMs due to
> > > > malicious guests. There's something nice about the guest donating
> > > > memory for this, since that would eliminate the OOM risk.
> > > 
> > > 
> > > Tracking the list of ranges may not be bad idea, especially if we use
> > > the some kind of rbtree-based data structure to update the ranges. It
> > > will certainly be better than bitmap which grows based on the guest
> > > memory size and as you guys see in the practice most of the pages will
> > > be guest private. I am not sure if guest donating a memory will cover
> > > all the cases, e.g what if we do a memory hotplug (increase the guest
> > > ram from 2GB to 64GB), will donated memory range will be enough to store
> > > the metadata.
> > > 
> > >. 
> > 
> > With reference to internal discussions regarding the above, i am going
> > to look into specific items as listed below :
> > 
> > 1). "hypercall" related :
> > a). Explore the SEV-SNP page change request structure (included in GHCB),
> > see if there is something common there than can be re-used for SEV/SEV-ES
> > page encryption status hypercalls.
> > b). Explore if there is any common hypercall framework i can use in 
> > Linux/KVM.
> > 
> > 2). related to the "backing" data structure - explore using a range-based
> > list or something like rbtree-based interval tree data structure
> > (as mentioned by Steve above) to replace the current bitmap based
> > implementation.
> > 
> > 
> 
> I do agree that a range-based list or an interval tree data structure is a
> really good "logical" fit for the guest page encryption status tracking.
> 
> We can only keep track of the guest unencrypted shared pages in the
> range(s) list (which will keep the data structure quite compact) and all
> the guest private/encrypted memory does not really need any tracking in
> the list, anything not in the list will be encrypted/private.
> 
> Also looking at a more "practical" use case, here is the current log of
> page encryption status hypercalls when booting a linux guest :
> 
> ...



> [   56.146336] page_enc_status_hc invoked, gpa = 1f018000, npages  = 1, enc = 
> 1
> [   56.146351] page_enc_status_hc invoked, gpa = 1f00e000, npages  = 1, enc = > 0
> [   56.147261] page_enc_status_hc invoked, gpa = 1f00e000, npages  = 1, enc = > 0
> [   56.147271] page_enc_status_hc invoked, gpa = 1f018000, npages  = 1, enc = > 0


> [   56.180730] page_enc_status_hc invoked, gpa = 1f008000, npages  = 1, enc = > 0
> [   56.180741] page_enc_status_hc invoked, gpa = 1f006000, npages  = 1, enc = > 0
> [   56.180768] page_enc_status_hc invoked, gpa = 1f008000, npages  = 1, enc = 
> 1
> [   56.180782] page_enc_status_hc invoked, gpa = 1f006000, npages  = 1, enc = 
> 1


> [   56.197110] page_enc_status_hc invoked, gpa = 1f007000, npages  = 1, enc = > 0
> [   56.197120] page_enc_status_hc invoked, gpa = 1f005000, npages  = 1, enc = > 0
> [   56.197136] page_enc_status_hc invoked, gpa = 1f007000, npages  = 1, enc = 
> 1
> [   56.197148] page_enc_status_hc invoked, gpa = 1f005000, npages  = 1, enc = 
> 1


> [   56.222679] page_enc_status_hc invoked, gpa = 1e83b000, npages  = 1, enc = > 0
> [   56.222691] page_enc_status_hc invoked, gpa = 1e839000, npages  = 1, enc = > 0
> [   56.222707] page_enc_status_hc invoked, gpa = 1e83b000, npages  = 1, enc = 
> 1
> [   56.222720] page_enc_status_hc invoked, gpa = 1e839000, npages  = 1, enc = 
> 1


> [   56.313747] page_enc_status_hc invoked, gpa = 1e5eb000, npages  = 1, enc = > 0
> [   56.313771] page_enc_status_hc invoked, gpa = 1e5e9000, npages  = 1, enc = > 0
> [   56.313789] page_enc_status_hc invoked, gpa = 1e5eb000, npages  = 1, enc = 
> 1
> [   56.313803] page_enc_status_hc invoked, gpa = 1e5e9000, npages  = 1, enc = 
> 1

> [   56.459276] page_enc_status_hc invoked, gpa = 1d767000, npages  = 100, enc 
> = 0
> [   56.459428] page_enc_status_hc invoked, gpa = 1e501000, npages  = 1, enc = 
> 1
> [   56.460037] page_enc_status_hc invoked, gpa = 1d767000, npages  = 100, enc 
> = 1
> [   56.460216] page_enc_status_hc invoked, gpa = 1e501000, npages  = 1, enc = > 0
> [   56.460299] page_enc_status_hc invoked, gpa = 1d767000, npages  = 100, enc 
> = 0
> [   56.460448] page_enc_status_hc invoked, gpa = 1e501000, npages  = 1, enc = 
> 1


> As can be observed here, all guest MMIO ranges are initially setup as
> shared, and those are all contigious guest page ranges.
> 
> After that the encryption status hypercalls are invoked when DMA gets
> triggered during disk i/o while booting the guest ... here again the
> guest page ranges are contigious, though mostly single page is touched 
> and a lot of page re-use is observed. 
> 
> So a range-based list/structure will be a "good" fit for such usage
> scenarios.

It seems surprisingly common to flick the same pages back and forth between
encrypted and clear for quite a while;  why is this?

Dave


> Thanks,
> Ashish
> 
-- 
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [PATCH v5 0/2] MTE support for KVM guest

2020-12-07 Thread Dr. David Alan Gilbert

* Peter Maydell (peter.mayd...@linaro.org) wrote:
> On Mon, 7 Dec 2020 at 16:44, Dr. David Alan Gilbert  
> wrote:
> > * Steven Price (steven.pr...@arm.com) wrote:
> > > Sorry, I know I simplified it rather by saying it's similar to protected 
> > > VM.
> > > Basically as I see it there are three types of memory access:
> > >
> > > 1) Debug case - has to go via a special case for decryption or ignoring 
> > > the
> > > MTE tag value. Hopefully this can be abstracted in the same way.
> > >
> > > 2) Migration - for a protected VM there's likely to be a special method to
> > > allow the VMM access to the encrypted memory (AFAIK memory is usually kept
> > > inaccessible to the VMM). For MTE this again has to be special cased as we
> > > actually want both the data and the tag values.
> > >
> > > 3) Device DMA - for a protected VM it's usual to unencrypt a small area of
> > > memory (with the permission of the guest) and use that as a bounce buffer.
> > > This is possible with MTE: have an area the VMM purposefully maps with
> > > PROT_MTE. The issue is that this has a performance overhead and we can do
> > > better with MTE because it's trivial for the VMM to disable the protection
> > > for any memory.
> >
> > Those all sound very similar to the AMD SEV world;  there's the special
> > case for Debug that Peter mentioned; migration is ...complicated and
> > needs special case that's still being figured out, and as I understand
> > Device DMA also uses a bounce buffer (and swiotlb in the guest to make
> > that happen).
> 
> Mmm, but for encrypted VMs the VM has to jump through all these
> hoops because "don't let the VM directly access arbitrary guest RAM"
> is the whole point of the feature. For MTE, we don't want in general
> to be doing tag-checked accesses to guest RAM and there is nothing
> in the feature "allow guests to use MTE" that requires that the VMM's
> guest RAM accesses must do tag-checking. So we should avoid having
> a design that require us to jump through all the hoops.

Yes agreed, that's a fair distinction.

Dave


 Even if
> it happens that handling encrypted VMs means that QEMU has to grow
> some infrastructure for carefully positioning hoops in appropriate
> places, we shouldn't use it unnecessarily... All we actually need is
> a mechanism for migrating the tags: I don't think there's ever a
> situation where you want tag-checking enabled for the VMM's accesses
> to the guest RAM.
> 
> thanks
> -- PMM
> 
-- 
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [PATCH v5 0/2] MTE support for KVM guest

2020-12-07 Thread Dr. David Alan Gilbert

* Steven Price (steven.pr...@arm.com) wrote:
> On 07/12/2020 15:27, Peter Maydell wrote:
> > On Mon, 7 Dec 2020 at 14:48, Steven Price  wrote:
> > > Sounds like you are making good progress - thanks for the update. Have
> > > you thought about how the PROT_MTE mappings might work if QEMU itself
> > > were to use MTE? My worry is that we end up with MTE in a guest
> > > preventing QEMU from using MTE itself (because of the PROT_MTE
> > > mappings). I'm hoping QEMU can wrap its use of guest memory in a
> > > sequence which disables tag checking (something similar will be needed
> > > for the "protected VM" use case anyway), but this isn't something I've
> > > looked into.
> > 
> > It's not entirely the same as the "protected VM" case. For that
> > the patches currently on list basically special case "this is a
> > debug access (eg from gdbstub/monitor)" which then either gets
> > to go via "decrypt guest RAM for debug" or gets failed depending
> > on whether the VM has a debug-is-ok flag enabled. For an MTE
> > guest the common case will be guests doing standard DMA operations
> > to or from guest memory. The ideal API for that from QEMU's
> > point of view would be "accesses to guest RAM don't do tag
> > checks, even if tag checks are enabled for accesses QEMU does to
> > memory it has allocated itself as a normal userspace program".
> 
> Sorry, I know I simplified it rather by saying it's similar to protected VM.
> Basically as I see it there are three types of memory access:
> 
> 1) Debug case - has to go via a special case for decryption or ignoring the
> MTE tag value. Hopefully this can be abstracted in the same way.
> 
> 2) Migration - for a protected VM there's likely to be a special method to
> allow the VMM access to the encrypted memory (AFAIK memory is usually kept
> inaccessible to the VMM). For MTE this again has to be special cased as we
> actually want both the data and the tag values.
> 
> 3) Device DMA - for a protected VM it's usual to unencrypt a small area of
> memory (with the permission of the guest) and use that as a bounce buffer.
> This is possible with MTE: have an area the VMM purposefully maps with
> PROT_MTE. The issue is that this has a performance overhead and we can do
> better with MTE because it's trivial for the VMM to disable the protection
> for any memory.

Those all sound very similar to the AMD SEV world;  there's the special
case for Debug that Peter mentioned; migration is ...complicated and
needs special case that's still being figured out, and as I understand
Device DMA also uses a bounce buffer (and swiotlb in the guest to make
that happen).


I'm not sure about the stories for the IBM hardware equivalents.

Dave

> The part I'm unsure on is how easy it is for QEMU to deal with (3) without
> the overhead of bounce buffers. Ideally there'd already be a wrapper for
> guest memory accesses and that could just be wrapped with setting TCO during
> the access. I suspect the actual situation is more complex though, and I'm
> hoping Haibo's investigations will help us understand this.
> 
> Thanks,
> 
> Steve
> 
-- 
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [PATCH v2 2/9] KVM: X86: Introduce KVM_HC_PAGE_ENC_STATUS hypercall

2020-12-02 Thread Dr. David Alan Gilbert

exist in our memslot */
> + pfn_start = gfn_to_pfn(kvm, gfn_start);
> + pfn_end = gfn_to_pfn(kvm, gfn_end);
> +
> + if (is_error_noslot_pfn(pfn_start) && !is_noslot_pfn(pfn_start)) {
> + /*
> +  * Allow guest MMIO range(s) to be added
> +  * to the page encryption bitmap.
> +  */
> + return -EINVAL;
> + }
> +
> + if (is_error_noslot_pfn(pfn_end) && !is_noslot_pfn(pfn_end)) {
> + /*
> +  * Allow guest MMIO range(s) to be added
> +  * to the page encryption bitmap.
> +  */
> + return -EINVAL;
> + }
> +
> + mutex_lock(>lock);
> +
> + if (sev->page_enc_bmap_size < gfn_end)
> + goto unlock;
> +
> + if (enc)
> + __bitmap_set(sev->page_enc_bmap, gfn_start,
> + gfn_end - gfn_start);
> + else
> + __bitmap_clear(sev->page_enc_bmap, gfn_start,
> + gfn_end - gfn_start);
> +
> +unlock:
> + mutex_unlock(>lock);
> + return 0;
> +}
> +
>  int svm_mem_enc_op(struct kvm *kvm, void __user *argp)
>  {
>   struct kvm_sev_cmd sev_cmd;
> @@ -1123,6 +1210,9 @@ void sev_vm_destroy(struct kvm *kvm)
>  
>   sev_unbind_asid(kvm, sev->handle);
>   sev_asid_free(sev->asid);
> +
> + kvfree(sev->page_enc_bmap);
> + sev->page_enc_bmap = NULL;
>  }
>  
>  int __init sev_hardware_setup(void)
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 6dc337b9c231..7122ea5f7c47 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -4312,6 +4312,8 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
>   .apic_init_signal_blocked = svm_apic_init_signal_blocked,
>  
>   .msr_filter_changed = svm_msr_filter_changed,
> +
> + .page_enc_status_hc = svm_page_enc_status_hc,
>  };
>  
>  static struct kvm_x86_init_ops svm_init_ops __initdata = {
> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> index fdff76eb6ceb..0103a23ca174 100644
> --- a/arch/x86/kvm/svm/svm.h
> +++ b/arch/x86/kvm/svm/svm.h
> @@ -66,6 +66,8 @@ struct kvm_sev_info {
>   int fd; /* SEV device fd */
>   unsigned long pages_locked; /* Number of pages locked */
>   struct list_head regions_list;  /* List of registered regions */
> + unsigned long *page_enc_bmap;
> + unsigned long page_enc_bmap_size;
>  };
>  
>  struct kvm_svm {
> @@ -409,6 +411,8 @@ int nested_svm_check_exception(struct vcpu_svm *svm, 
> unsigned nr,
>  bool has_error_code, u32 error_code);
>  int nested_svm_exit_special(struct vcpu_svm *svm);
>  void sync_nested_vmcb_control(struct vcpu_svm *svm);
> +int svm_page_enc_status_hc(struct kvm *kvm, unsigned long gpa,
> +   unsigned long npages, unsigned long enc);
>  
>  extern struct kvm_x86_nested_ops svm_nested_ops;
>  
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index c3441e7e5a87..5bc37a38e6be 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7722,6 +7722,7 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = {
>  
>   .msr_filter_changed = vmx_msr_filter_changed,
>   .cpu_dirty_log_size = vmx_cpu_dirty_log_size,
> + .page_enc_status_hc = NULL,
>  };
>  
>  static __init int hardware_setup(void)
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index a3fdc16cfd6f..3afc78f18f69 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -8125,6 +8125,12 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
>   kvm_sched_yield(vcpu->kvm, a0);
>   ret = 0;
>   break;
> +     case KVM_HC_PAGE_ENC_STATUS:
> + ret = -KVM_ENOSYS;
> + if (kvm_x86_ops.page_enc_status_hc)
> + ret = kvm_x86_ops.page_enc_status_hc(vcpu->kvm,
> + a0, a1, a2);
> + break;
>   default:
>   ret = -KVM_ENOSYS;
>   break;
> diff --git a/include/uapi/linux/kvm_para.h b/include/uapi/linux/kvm_para.h
> index 8b86609849b9..847b83b75dc8 100644
> --- a/include/uapi/linux/kvm_para.h
> +++ b/include/uapi/linux/kvm_para.h
> @@ -29,6 +29,7 @@
>  #define KVM_HC_CLOCK_PAIRING 9
>  #define KVM_HC_SEND_IPI  10
>  #define KVM_HC_SCHED_YIELD   11
> +#define KVM_HC_PAGE_ENC_STATUS   12
>  
>  /*
>   * hypercalls use architecture specific
> -- 
> 2.17.1
> 
-- 
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [PATCH v5 0/2] MTE support for KVM guest

2020-11-23 Thread Dr. David Alan Gilbert

* Peter Maydell (peter.mayd...@linaro.org) wrote:
> On Thu, 19 Nov 2020 at 15:39, Steven Price  wrote:
> > This series adds support for Arm's Memory Tagging Extension (MTE) to
> > KVM, allowing KVM guests to make use of it. This builds on the existing
> > user space support already in v5.10-rc1, see [1] for an overview.
> 
> > The change to require the VMM to map all guest memory PROT_MTE is
> > significant as it means that the VMM has to deal with the MTE tags even
> > if it doesn't care about them (e.g. for virtual devices or if the VMM
> > doesn't support migration). Also unfortunately because the VMM can
> > change the memory layout at any time the check for PROT_MTE/VM_MTE has
> > to be done very late (at the point of faulting pages into stage 2).
> 
> I'm a bit dubious about requring the VMM to map the guest memory
> PROT_MTE unless somebody's done at least a sketch of the design
> for how this would work on the QEMU side. Currently QEMU just
> assumes the guest memory is guest memory and it can access it
> without special precautions...

Although that is also changing because of the encrypted/protected memory
in things like SEV.

Dave

> thanks
> -- PMM
> 
-- 
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [Virtio-fs] Unbreakable loop in fuse_fill_write_pages()

2020-10-14 Thread Dr. David Alan Gilbert

* Qian Cai (c...@redhat.com) wrote:
> On Tue, 2020-10-13 at 14:58 -0400, Vivek Goyal wrote:
> 
> > I am wondering if virtiofsd still alive and responding to requests? I
> > see another task which is blocked on getdents() for more than 120s.
> > 
> > [10580.142571][  T348] INFO: task trinity-c36:254165 blocked for more than 
> > 123
> > +seconds.
> > [10580.143924][  T348]   Tainted: G   O  5.9.0-next-20201013+ #2
> > [10580.145158][  T348] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > +disables this message.
> > [10580.146636][  T348] task:trinity-c36 state:D stack:26704 pid:254165
> > ppid:
> > +87180 flags:0x0004
> > [10580.148260][  T348] Call Trace:
> > [10580.148789][  T348]  __schedule+0x71d/0x1b50
> > [10580.149532][  T348]  ? __sched_text_start+0x8/0x8
> > [10580.150343][  T348]  schedule+0xbf/0x270
> > [10580.151044][  T348]  schedule_preempt_disabled+0xc/0x20
> > [10580.152006][  T348]  __mutex_lock+0x9f1/0x1360
> > [10580.152777][  T348]  ? __fdget_pos+0x9c/0xb0
> > [10580.153484][  T348]  ? mutex_lock_io_nested+0x1240/0x1240
> > [10580.154432][  T348]  ? find_held_lock+0x33/0x1c0
> > [10580.155220][  T348]  ? __fdget_pos+0x9c/0xb0
> > [10580.155934][  T348]  __fdget_pos+0x9c/0xb0
> > [10580.156660][  T348]  __x64_sys_getdents+0xff/0x230
> > 
> > May be virtiofsd crashed and hence no requests are completing leading
> > to a hard lockup?
> Virtiofsd is still working. Once this happened, I manually create a file on 
> the
> guest (in virtiofs) and then I can see the content of it from the host.

If the virtiofsd is still running, attach gdb to it and get a full bt;

gdb --pid  whatever

(gdb) t a a bt full

that should show if it's stuck in one particular place.

Dave


> ___
> Virtio-fs mailing list
> virtio...@redhat.com
> https://www.redhat.com/mailman/listinfo/virtio-fs
-- 
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [Virtio-fs] [PATCH v4] kvm, x86: Exit to user space in case page fault error

2020-10-06 Thread Dr. David Alan Gilbert

* Sean Christopherson (sean.j.christopher...@intel.com) wrote:
> On Tue, Oct 06, 2020 at 06:39:56PM +0200, Vitaly Kuznetsov wrote:
> > Sean Christopherson  writes:
> > 
> > > On Tue, Oct 06, 2020 at 05:24:54PM +0200, Vitaly Kuznetsov wrote:
> > >> Vivek Goyal  writes:
> > >> > So you will have to report token (along with -EFAULT) to user space. 
> > >> > So this
> > >> > is basically the 3rd proposal which is extension of kvm API and will
> > >> > report say HVA/GFN also to user space along with -EFAULT.
> > >> 
> > >> Right, I meant to say that guest kernel has full register state of the
> > >> userspace process which caused APF to get queued and instead of trying
> > >> to extract it in KVM and pass to userspace in case of a (later) failure
> > >> we limit KVM api change to contain token or GFN only and somehow keep
> > >> the rest in the guest. This should help with TDX/SEV-ES.
> > >
> > > Whatever gets reported to userspace should be identical with and without
> > > async page faults, i.e. it definitely shouldn't have token information.
> > >
> > 
> > Oh, right, when the error gets reported synchronously guest's kernel is
> > not yet aware of the issue so it won't be possible to find anything in
> > its kdump if userspace decides to crash it immediately. The register
> > state (if available) will be actual though.
> > 
> > > Note, TDX doesn't allow injection exceptions, so reflecting a #PF back
> > > into the guest is not an option.  
> > 
> > Not even #MC? So sad :-)
> 
> Heh, #MC isn't allowed either, yet...
> 
> > > Nor do I think that's "correct" behavior (see everyone's objections to
> > > using #PF for APF fixed).  I.e. the event should probably be an IRQ.
> > 
> > I recall Paolo objected against making APF 'page not present' into in
> > interrupt as it will require some very special handling to make sure it
> > gets injected (and handled) immediately but I'm not really sure how big
> > the hack is going to be, maybe in the light of TDX/SEV-ES it's worth a
> > try.
> 
> This shouldn't have anything to do with APF.  Again, the event injection is
> needed even in the synchronous case as the file truncation in the host can
> affect existing mappings in the guest.
> 
> I don't know that the mechanism needs to be virtiofs specific or if there can
> be a more generic "these PFNs have disappeared", but it's most definitely
> orthogonal to APF.

There are other cases we get 'these PFNs have disappeared' other than
virtiofs;  the classic is when people back the guest using a tmpfs that
then runs out of room.

Dave

> ___
> Virtio-fs mailing list
> virtio...@redhat.com
> https://www.redhat.com/mailman/listinfo/virtio-fs
-- 
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [CRM114spam]: drivers/hwmon/w83627ehf.c:2417 w83627ehf_probe() warn: 'res->start' not released on lines: 2412.

2020-09-20 Thread Dr. David Alan Gilbert

* Dan Carpenter (dan.carpen...@oracle.com) wrote:
> tree:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 
> master
> head:   5925fa68fe8244651b3f78a88c4af99190a88f0d
> commit: 266cd5835947d08b7c963b6d9d9f15d9e481bd0a hwmon: (w83627ehf) convert 
> to with_info interface
> config: i386-randconfig-m021-20200916 (attached as .config)
> compiler: gcc-9 (Debian 9.3.0-15) 9.3.0
> 
> If you fix the issue, kindly add following tag as appropriate
> Reported-by: kernel test robot 
> Reported-by: Dan Carpenter 
> 
> smatch warnings:
> drivers/hwmon/w83627ehf.c:2417 w83627ehf_probe() warn: 'res->start' not 
> released on lines: 2412.
> 
> # 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=266cd5835947d08b7c963b6d9d9f15d9e481bd0a
> git remote add linus 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
> git fetch --no-tags linus master
> git checkout 266cd5835947d08b7c963b6d9d9f15d9e481bd0a
> vim +2417 drivers/hwmon/w83627ehf.c
> 
> 6c931ae1c09a96 drivers/hwmon/w83627ehf.c Bill Pemberton 
> 2012-11-19  2015  static int w83627ehf_probe(struct platform_device *pdev)
> 08e7e2789e0da4 drivers/i2c/chips/w83627ehf.c Jean Delvare   
> 2005-04-25  2016  {
> 1ea6dd3840e5a2 drivers/hwmon/w83627ehf.c David Hubbard  
> 2007-06-24  2017  struct device *dev = >dev;
> a8b3a3a53f9a81 drivers/hwmon/w83627ehf.c Jingoo Han 
> 2013-07-30  2018  struct w83627ehf_sio_data *sio_data = 
> dev_get_platdata(dev);
> 08e7e2789e0da4 drivers/i2c/chips/w83627ehf.c Jean Delvare   
> 2005-04-25  2019  struct w83627ehf_data *data;
> 1ea6dd3840e5a2 drivers/hwmon/w83627ehf.c David Hubbard  
> 2007-06-24  2020  struct resource *res;
> 03f5de2bb7125e drivers/hwmon/w83627ehf.c Jean Delvare   
> 2011-10-13  2021  u8 en_vrm10;
> 08e7e2789e0da4 drivers/i2c/chips/w83627ehf.c Jean Delvare   
> 2005-04-25  2022  int i, err = 0;
> 266cd5835947d0 drivers/hwmon/w83627ehf.c Dr. David Alan Gilbert 
> 2019-11-24  2023  struct device *hwmon_dev;
> 08e7e2789e0da4 drivers/i2c/chips/w83627ehf.c Jean Delvare   
> 2005-04-25  2024  
> 1ea6dd3840e5a2 drivers/hwmon/w83627ehf.c David Hubbard  
> 2007-06-24  2025  res = platform_get_resource(pdev, IORESOURCE_IO, 0);
> 1ea6dd3840e5a2 drivers/hwmon/w83627ehf.c David Hubbard  
> 2007-06-24  2026  if (!request_region(res->start, IOREGION_LENGTH, 
> DRVNAME)) {
> 08e7e2789e0da4 drivers/i2c/chips/w83627ehf.c Jean Delvare   
> 2005-04-25  2027  err = -EBUSY;
> 1ea6dd3840e5a2 drivers/hwmon/w83627ehf.c David Hubbard  
> 2007-06-24  2028  dev_err(dev, "Failed to request region 
> 0x%lx-0x%lx\n",
> 1ea6dd3840e5a2 drivers/hwmon/w83627ehf.c David Hubbard  
> 2007-06-24  2029  (unsigned long)res->start,
> 1ea6dd3840e5a2 drivers/hwmon/w83627ehf.c David Hubbard  
> 2007-06-24  2030  (unsigned long)res->start + 
> IOREGION_LENGTH - 1);
> 08e7e2789e0da4 drivers/i2c/chips/w83627ehf.c Jean Delvare   
> 2005-04-25  2031  goto exit;
> 08e7e2789e0da4 drivers/i2c/chips/w83627ehf.c Jean Delvare   
> 2005-04-25  2032  }
> 08e7e2789e0da4 drivers/i2c/chips/w83627ehf.c Jean Delvare   
> 2005-04-25  2033  
> 32260d94408c55 drivers/hwmon/w83627ehf.c Guenter Roeck  
> 2012-03-12  2034  data = devm_kzalloc(>dev, sizeof(struct 
> w83627ehf_data),
> 32260d94408c55 drivers/hwmon/w83627ehf.c Guenter Roeck  
> 2012-03-12  2035  GFP_KERNEL);
> e7e1ca6ef4f331 drivers/hwmon/w83627ehf.c Guenter Roeck  
> 2011-02-04  2036  if (!data) {
> 08e7e2789e0da4 drivers/i2c/chips/w83627ehf.c Jean Delvare   
> 2005-04-25  2037  err = -ENOMEM;
> 08e7e2789e0da4 drivers/i2c/chips/w83627ehf.c Jean Delvare   
> 2005-04-25  2038  goto exit_release;
> 08e7e2789e0da4 drivers/i2c/chips/w83627ehf.c Jean Delvare   
> 2005-04-25  2039  }
> 08e7e2789e0da4 drivers/i2c/chips/w83627ehf.c Jean Delvare   
> 2005-04-25  2040  
> 1ea6dd3840e5a2 drivers/hwmon/w83627ehf.c David Hubbard  
> 2007-06-24  2041  data->addr = res->start;
> 9a61bf6300533d drivers/hwmon/w83627ehf.c Ingo Molnar
> 2006-01-18  2042  mutex_init(>lock);
> 9a61bf6300533d drivers/hwmon/w83627ehf.c     Ingo Molnar
> 2006-01-18  2043  mutex_init(>update_lock);
> 1ea6dd3840e5a2 drivers/hw

Re: [PATCH RFC 0/2] KVM: x86: allow for more CPUID entries

2020-09-16 Thread Dr. David Alan Gilbert

* Wei Huang (wei.hua...@amd.com) wrote:
> On 09/15 05:51, Dr. David Alan Gilbert wrote:
> > * Vitaly Kuznetsov (vkuzn...@redhat.com) wrote:
> > > With QEMU and newer AMD CPUs (namely: Epyc 'Rome') the current limit for
> 
> Could you elaborate on this limit? On Rome, I counted ~35 CPUID functions 
> which
> include Fn_, Fn4000_ and Fn8000_.

On my 7302P the output of:
cpuid -1 -r | wc -l

is 61, there is one line of header in there.

However in a guest I see more; and I think that's because KVM  tends to
list the CPUID entries for a lot of disabled Intel features, even on
AMD, e.g. 0x11-0x1f which AMD doesn't have, are listed in a KVM guest.
Then you add the KVM CPUIDs at 4...0 and 41.

IMHO we should be filtering those out for at least two reasons:
  a) They're wrong
  b) We're probably not keeping the set of visible CPUID fields the same
when we move between host kernels, and that can't be good for
migration.

Still, those are separate problems.

Dave

> > > KVM_MAX_CPUID_ENTRIES(80) is reported to be hit. Last time it was raised
> > > from '40' in 2010. We can, of course, just bump it a little bit to fix
> > > the immediate issue but the report made me wonder why we need to pre-
> > > allocate vcpu->arch.cpuid_entries array instead of sizing it dynamically.
> > > This RFC is intended to feed my curiosity.
> > > 
> > > Very mildly tested with selftests/kvm-unit-tests and nothing seems to
> > > break. I also don't have access to the system where the original issue
> > > was reported but chances we're fixing it are very good IMO as just the
> > > second patch alone was reported to be sufficient.
> > > 
> > > Reported-by: Dr. David Alan Gilbert 
> > 
> > Oh nice, I was just going to bump the magic number :-)
> > 
> > Anyway, this seems to work for me, so:
> > 
> > Tested-by: Dr. David Alan Gilbert 
> > 
> 
> I tested on two platforms and the patches worked fine. So no objection on the
> design.
> 
> Tested-by: Wei Huang 
> 
> > > Vitaly Kuznetsov (2):
> > >   KVM: x86: allocate vcpu->arch.cpuid_entries dynamically
> > >   KVM: x86: bump KVM_MAX_CPUID_ENTRIES
> > > 
> > >  arch/x86/include/asm/kvm_host.h |  4 +--
> > >  arch/x86/kvm/cpuid.c| 55 ++++++++++++-
> > >  arch/x86/kvm/x86.c  |  1 +
> > >  3 files changed, 43 insertions(+), 17 deletions(-)
> > > 
> > > -- 
> > > 2.25.4
> > > 
> > -- 
> > Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
> > 
> 
-- 
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [PATCH RFC 0/2] KVM: x86: allow for more CPUID entries

2020-09-15 Thread Dr. David Alan Gilbert

* Vitaly Kuznetsov (vkuzn...@redhat.com) wrote:
> With QEMU and newer AMD CPUs (namely: Epyc 'Rome') the current limit for
> KVM_MAX_CPUID_ENTRIES(80) is reported to be hit. Last time it was raised
> from '40' in 2010. We can, of course, just bump it a little bit to fix
> the immediate issue but the report made me wonder why we need to pre-
> allocate vcpu->arch.cpuid_entries array instead of sizing it dynamically.
> This RFC is intended to feed my curiosity.
> 
> Very mildly tested with selftests/kvm-unit-tests and nothing seems to
> break. I also don't have access to the system where the original issue
> was reported but chances we're fixing it are very good IMO as just the
> second patch alone was reported to be sufficient.
> 
> Reported-by: Dr. David Alan Gilbert 

Oh nice, I was just going to bump the magic number :-)

Anyway, this seems to work for me, so:

Tested-by: Dr. David Alan Gilbert 

> Vitaly Kuznetsov (2):
>   KVM: x86: allocate vcpu->arch.cpuid_entries dynamically
>   KVM: x86: bump KVM_MAX_CPUID_ENTRIES
> 
>  arch/x86/include/asm/kvm_host.h |  4 +--
>  arch/x86/kvm/cpuid.c| 55 -
>  arch/x86/kvm/x86.c  |  1 +
>  3 files changed, 43 insertions(+), 17 deletions(-)
> 
> -- 
> 2.25.4
> 
-- 
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [PATCH v2 0/2] MTE support for KVM guest

2020-09-10 Thread Dr. David Alan Gilbert

* Andrew Jones (drjo...@redhat.com) wrote:
> On Wed, Sep 09, 2020 at 06:45:33PM -0700, Richard Henderson wrote:
> > On 9/9/20 8:25 AM, Andrew Jones wrote:
> > >>  * Provide a KVM-specific method to extract the tags from guest memory.
> > >>This might also have benefits in terms of providing an easy way to
> > >>read bulk tag data from guest memory (since the LDGM instruction
> > >>isn't available at EL0).
> > > 
> > > Maybe we need a new version of KVM_GET_DIRTY_LOG that also provides
> > > the tags for all addresses of each dirty page.
> > 
> > KVM_GET_DIRTY_LOG just provides one bit per dirty page, no?  Then VMM copies
> > the data out from its local address to guest memory.
> > 
> > There'd be no difference with or without tags, afaik.  It's just about how 
> > VMM
> > copies the data, with or without tags.
> 
> Right, as long as it's fast enough to do
> 
>   for_each_dirty_page(page, dirty_log)
> for (i = 0; i < host-page-size/16; i += 16)
>   append_tag(LDG(page + i))
> 
> to get all the tags for each dirty page. I understood it would be faster
> to use LDGM, but we'd need a new ioctl for that. So I was proposing we
> just piggyback on a new dirty-log ioctl instead.

That feels a bad idea to me; there's a couple of different ways dirty
page checking work; lets keep extracting the tags separate.

Dave

> Thanks,
> drew 
-- 
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [PATCH v2 0/2] MTE support for KVM guest

2020-09-07 Thread Dr. David Alan Gilbert

;  arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h | 14 ++
>  arch/arm64/kvm/mmu.c   | 15 +++
>  arch/arm64/kvm/reset.c |  8 
>  arch/arm64/kvm/sys_regs.c  | 20 +++-
>  8 files changed, 66 insertions(+), 7 deletions(-)
> 
> -- 
> 2.20.1
> 
-- 
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [PATCH v3 11/18] fuse: implement FUSE_INIT map_alignment field

2020-08-26 Thread Dr. David Alan Gilbert

gotiation.  I.e. send the largest
> > > alignment (FUSE_DAX_SHIFT in this implementation) that the kernel can
> > > provide in fuse_init_in.   In that case the only error would be if
> > > userspace ignored the given constraints.
> > 
> > We could make it two way negotiation if it helps. So if we support
> > multiple mapping sizes in future, say 4K, 64K, 2MB, 1GB. So idea is
> > to send alignment of largest mapping size to device/user_space (1GB)
> > in this case? And that will allow device to choose an alignment
> > which best fits its needs?
> > 
> > But problem here is that sending (log2(1GB)) does not mean we support
> > all the alignments in that range. For example, if device selects say
> > 256MB as minimum alignment, kernel might not support it.
> > 
> > So there seem to be two ways to handle this.
> > 
> > A.Let device be conservative and always specify the minimum aligment
> >   it can work with and let guest kernel automatically choose a mapping
> >   size which meets that min_alignment contraint.
> > 
> > B.Send all the mapping sizes supported by kernel to device and then
> >   device chooses an alignment as it sees fit. We could probably send
> >   a 64bit field and set a bit for every size we support as dax mapping.
> >   If we were to go down this path, I think in that case client should
> >   respond back with exact mapping size we should use (and not with
> >   minimum alignment).
> > 
> > I thought intent behind this patch was to implement A.
> > 
> > Stefan/David, this patch came from you folks. What do you think?
> 
> Yes, I agree with Vivek.
> 
> The FUSE server is telling the client the minimum alignment for
> foffset/moffset. The client can map any size it likes as long as
> foffset/moffset meet the alignment constraint. I can't think of a reason
> to do two-way negotiation.

Agreed, because there's not much that the server can do about it if the
client would like a smaller granularity - the servers granularity might
be dictated by it's mmap/pagesize/filesystem.  If the client wants a
larger granularity that's it's choice when it sends the setupmapping
calls.

Dave

> Stefan


-- 
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [PATCH v5 0/4] introduction of migration_version attribute for VFIO live migration

2020-06-05 Thread Dr. David Alan Gilbert

* Alex Williamson (alex.william...@redhat.com) wrote:
> On Fri, 5 Jun 2020 11:22:24 +0100
> "Dr. David Alan Gilbert"  wrote:
> 
> > * Alex Williamson (alex.william...@redhat.com) wrote:
> > > On Wed, 3 Jun 2020 01:24:43 -0400
> > > Yan Zhao  wrote:
> > >   
> > > > On Tue, Jun 02, 2020 at 09:55:28PM -0600, Alex Williamson wrote:  
> > > > > On Tue, 2 Jun 2020 23:19:48 -0400
> > > > > Yan Zhao  wrote:
> > > > > 
> > > > > > On Tue, Jun 02, 2020 at 04:55:27PM -0600, Alex Williamson wrote:
> > > > > > > On Wed, 29 Apr 2020 20:39:50 -0400
> > > > > > > Yan Zhao  wrote:
> > > > > > >   
> > > > > > > > On Wed, Apr 29, 2020 at 05:48:44PM +0800, Dr. David Alan 
> > > > > > > > Gilbert wrote:
> > > > > > > >   
> > > > > > > > > > > > > > > > > > > > > An mdev type is meant to define a 
> > > > > > > > > > > > > > > > > > > > > software compatible interface, so in
> > > > > > > > > > > > > > > > > > > > > the case of mdev->mdev migration, 
> > > > > > > > > > > > > > > > > > > > > doesn't migrating to a different type
> > > > > > > > > > > > > > > > > > > > > fail the most basic of compatibility 
> > > > > > > > > > > > > > > > > > > > > tests that we expect userspace to
> > > > > > > > > > > > > > > > > > > > > perform?  IOW, if two mdev types are 
> > > > > > > > > > > > > > > > > > > > > migration compatible, it seems a
> > > > > > > > > > > > > > > > > > > > > prerequisite to that is that they 
> > > > > > > > > > > > > > > > > > > > > provide the same software interface,
> > > > > > > > > > > > > > > > > > > > > which means they should be the same 
> > > > > > > > > > > > > > > > > > > > > mdev type.
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > In the hybrid cases of mdev->phys or 
> > > > > > > > > > > > > > > > > > > > > phys->mdev, how does a
> > > > > > > > > > > > > > > > > > > > management
> > > > > > > > > > > > > > > > > > > > > tool begin to even guess what might 
> > > > > > > > > > > > > > > > > > > > > be compatible?  Are we expecting
> > > > > > > > > > > > > > > > > > > > > libvirt to probe ever device with 
> > > > > > > > > > > > > > > > > > > > > this attribute in the system?  Is
> > > > > > > > > > > > > > > > > > > > > there going to be a new class 
> > > > > > > > > > > > > > > > > > > > > hierarchy created to enumerate all
> > > > > > > > > > > > > > > > > > > > > possible migrate-able devices?
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > yes, management tool needs to guess and 
> > > > > > > > > > > > > > > > > > > > test migration compatible
> > > > > > > > > > > > > > > > > > > > between two devices. But I think it's 
> > > > > > > > > > > > > > > > > > > > not the problem only for
> > > > > > > > > > > > > > > > > > > > mdev->phys or phys->mdev. even for 
> > > > > > >

Re: [PATCH v5 0/4] introduction of migration_version attribute for VFIO live migration

2020-06-05 Thread Dr. David Alan Gilbert

* Alex Williamson (alex.william...@redhat.com) wrote:
> On Wed, 3 Jun 2020 01:24:43 -0400
> Yan Zhao  wrote:
> 
> > On Tue, Jun 02, 2020 at 09:55:28PM -0600, Alex Williamson wrote:
> > > On Tue, 2 Jun 2020 23:19:48 -0400
> > > Yan Zhao  wrote:
> > >   
> > > > On Tue, Jun 02, 2020 at 04:55:27PM -0600, Alex Williamson wrote:  
> > > > > On Wed, 29 Apr 2020 20:39:50 -0400
> > > > > Yan Zhao  wrote:
> > > > > 
> > > > > > On Wed, Apr 29, 2020 at 05:48:44PM +0800, Dr. David Alan Gilbert 
> > > > > > wrote:
> > > > > > 
> > > > > > > > > > > > > > > > > > > An mdev type is meant to define a 
> > > > > > > > > > > > > > > > > > > software compatible interface, so in
> > > > > > > > > > > > > > > > > > > the case of mdev->mdev migration, doesn't 
> > > > > > > > > > > > > > > > > > > migrating to a different type
> > > > > > > > > > > > > > > > > > > fail the most basic of compatibility 
> > > > > > > > > > > > > > > > > > > tests that we expect userspace to
> > > > > > > > > > > > > > > > > > > perform?  IOW, if two mdev types are 
> > > > > > > > > > > > > > > > > > > migration compatible, it seems a
> > > > > > > > > > > > > > > > > > > prerequisite to that is that they provide 
> > > > > > > > > > > > > > > > > > > the same software interface,
> > > > > > > > > > > > > > > > > > > which means they should be the same mdev 
> > > > > > > > > > > > > > > > > > > type.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > In the hybrid cases of mdev->phys or 
> > > > > > > > > > > > > > > > > > > phys->mdev, how does a  
> > > > > > > > > > > > > > > > > > management  
> > > > > > > > > > > > > > > > > > > tool begin to even guess what might be 
> > > > > > > > > > > > > > > > > > > compatible?  Are we expecting
> > > > > > > > > > > > > > > > > > > libvirt to probe ever device with this 
> > > > > > > > > > > > > > > > > > > attribute in the system?  Is
> > > > > > > > > > > > > > > > > > > there going to be a new class hierarchy 
> > > > > > > > > > > > > > > > > > > created to enumerate all
> > > > > > > > > > > > > > > > > > > possible migrate-able devices?
> > > > > > > > > > > > > > > > > > >  
> > > > > > > > > > > > > > > > > > yes, management tool needs to guess and 
> > > > > > > > > > > > > > > > > > test migration compatible
> > > > > > > > > > > > > > > > > > between two devices. But I think it's not 
> > > > > > > > > > > > > > > > > > the problem only for
> > > > > > > > > > > > > > > > > > mdev->phys or phys->mdev. even for 
> > > > > > > > > > > > > > > > > > mdev->mdev, management tool needs
> > > > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > > first assume that the two mdevs have the 
> > > > > > > > > > > > > > > > > > same type of parent devices
> > > > > > > > > > > > > > > > > > (e.g.their pciids are equal). otherwise, 
> > > > > > > >

Re: [PATCH v5 0/4] introduction of migration_version attribute for VFIO live migration

2020-04-29 Thread Dr. David Alan Gilbert

* Yan Zhao (yan.y.z...@intel.com) wrote:
> On Wed, Apr 29, 2020 at 04:22:01PM +0800, Dr. David Alan Gilbert wrote:
> > * Yan Zhao (yan.y.z...@intel.com) wrote:
> > > On Tue, Apr 28, 2020 at 10:14:37PM +0800, Dr. David Alan Gilbert wrote:
> > > > * Yan Zhao (yan.y.z...@intel.com) wrote:
> > > > > On Mon, Apr 27, 2020 at 11:37:43PM +0800, Dr. David Alan Gilbert 
> > > > > wrote:
> > > > > > * Yan Zhao (yan.y.z...@intel.com) wrote:
> > > > > > > On Sat, Apr 25, 2020 at 03:10:49AM +0800, Dr. David Alan Gilbert 
> > > > > > > wrote:
> > > > > > > > * Yan Zhao (yan.y.z...@intel.com) wrote:
> > > > > > > > > On Tue, Apr 21, 2020 at 08:08:49PM +0800, Tian, Kevin wrote:
> > > > > > > > > > > From: Yan Zhao
> > > > > > > > > > > Sent: Tuesday, April 21, 2020 10:37 AM
> > > > > > > > > > > 
> > > > > > > > > > > On Tue, Apr 21, 2020 at 06:56:00AM +0800, Alex Williamson 
> > > > > > > > > > > wrote:
> > > > > > > > > > > > On Sun, 19 Apr 2020 21:24:57 -0400
> > > > > > > > > > > > Yan Zhao  wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > On Fri, Apr 17, 2020 at 07:24:57PM +0800, Cornelia 
> > > > > > > > > > > > > Huck wrote:
> > > > > > > > > > > > > > On Fri, 17 Apr 2020 05:52:02 -0400
> > > > > > > > > > > > > > Yan Zhao  wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Fri, Apr 17, 2020 at 04:44:50PM +0800, 
> > > > > > > > > > > > > > > Cornelia Huck wrote:
> > > > > > > > > > > > > > > > On Mon, 13 Apr 2020 01:52:01 -0400
> > > > > > > > > > > > > > > > Yan Zhao  wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > This patchset introduces a migration_version 
> > > > > > > > > > > > > > > > > attribute under sysfs
> > > > > > > > > > > of VFIO
> > > > > > > > > > > > > > > > > Mediated devices.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > This migration_version attribute is used to 
> > > > > > > > > > > > > > > > > check migration
> > > > > > > > > > > compatibility
> > > > > > > > > > > > > > > > > between two mdev devices.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Currently, it has two locations:
> > > > > > > > > > > > > > > > > (1) under mdev_type node,
> > > > > > > > > > > > > > > > > which can be used even before device 
> > > > > > > > > > > > > > > > > creation, but only for
> > > > > > > > > > > mdev
> > > > > > > > > > > > > > > > > devices of the same mdev type.
> > > > > > > > > > > > > > > > > (2) under mdev device node,
> > > > > > > > > > > > > > > > > which can only be used after the mdev 
> > > > > > > > > > > > > > > > > devices are created, but
> > > > > > > > > > > the src
> > > > > > > > > > > > > > > > > and target mdev devices are not 
> > > > > > > > > > > > > > > > > necessarily be of the same
> > > > > > > > > > > mdev type
> > > > > > > > > > > > > > > > > (The second location is newly added in v5, in 
> > > > > > > > > > > > > > > > > order to

Re: [PATCH v5 0/4] introduction of migration_version attribute for VFIO live migration

2020-04-29 Thread Dr. David Alan Gilbert

* Yan Zhao (yan.y.z...@intel.com) wrote:
> On Tue, Apr 28, 2020 at 10:14:37PM +0800, Dr. David Alan Gilbert wrote:
> > * Yan Zhao (yan.y.z...@intel.com) wrote:
> > > On Mon, Apr 27, 2020 at 11:37:43PM +0800, Dr. David Alan Gilbert wrote:
> > > > * Yan Zhao (yan.y.z...@intel.com) wrote:
> > > > > On Sat, Apr 25, 2020 at 03:10:49AM +0800, Dr. David Alan Gilbert 
> > > > > wrote:
> > > > > > * Yan Zhao (yan.y.z...@intel.com) wrote:
> > > > > > > On Tue, Apr 21, 2020 at 08:08:49PM +0800, Tian, Kevin wrote:
> > > > > > > > > From: Yan Zhao
> > > > > > > > > Sent: Tuesday, April 21, 2020 10:37 AM
> > > > > > > > > 
> > > > > > > > > On Tue, Apr 21, 2020 at 06:56:00AM +0800, Alex Williamson 
> > > > > > > > > wrote:
> > > > > > > > > > On Sun, 19 Apr 2020 21:24:57 -0400
> > > > > > > > > > Yan Zhao  wrote:
> > > > > > > > > >
> > > > > > > > > > > On Fri, Apr 17, 2020 at 07:24:57PM +0800, Cornelia Huck 
> > > > > > > > > > > wrote:
> > > > > > > > > > > > On Fri, 17 Apr 2020 05:52:02 -0400
> > > > > > > > > > > > Yan Zhao  wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > On Fri, Apr 17, 2020 at 04:44:50PM +0800, Cornelia 
> > > > > > > > > > > > > Huck wrote:
> > > > > > > > > > > > > > On Mon, 13 Apr 2020 01:52:01 -0400
> > > > > > > > > > > > > > Yan Zhao  wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > This patchset introduces a migration_version 
> > > > > > > > > > > > > > > attribute under sysfs
> > > > > > > > > of VFIO
> > > > > > > > > > > > > > > Mediated devices.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > This migration_version attribute is used to check 
> > > > > > > > > > > > > > > migration
> > > > > > > > > compatibility
> > > > > > > > > > > > > > > between two mdev devices.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Currently, it has two locations:
> > > > > > > > > > > > > > > (1) under mdev_type node,
> > > > > > > > > > > > > > > which can be used even before device 
> > > > > > > > > > > > > > > creation, but only for
> > > > > > > > > mdev
> > > > > > > > > > > > > > > devices of the same mdev type.
> > > > > > > > > > > > > > > (2) under mdev device node,
> > > > > > > > > > > > > > > which can only be used after the mdev devices 
> > > > > > > > > > > > > > > are created, but
> > > > > > > > > the src
> > > > > > > > > > > > > > > and target mdev devices are not necessarily 
> > > > > > > > > > > > > > > be of the same
> > > > > > > > > mdev type
> > > > > > > > > > > > > > > (The second location is newly added in v5, in 
> > > > > > > > > > > > > > > order to keep
> > > > > > > > > consistent
> > > > > > > > > > > > > > > with the migration_version node for migratable 
> > > > > > > > > > > > > > > pass-though
> > > > > > > > > devices)
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > What is the relationship between those two 
> > > > > > > > > > > > > > attributes?
> > > > > > > > > > > > > >

Re: [PATCH v5 0/4] introduction of migration_version attribute for VFIO live migration

2020-04-28 Thread Dr. David Alan Gilbert

* Yan Zhao (yan.y.z...@intel.com) wrote:
> On Mon, Apr 27, 2020 at 11:37:43PM +0800, Dr. David Alan Gilbert wrote:
> > * Yan Zhao (yan.y.z...@intel.com) wrote:
> > > On Sat, Apr 25, 2020 at 03:10:49AM +0800, Dr. David Alan Gilbert wrote:
> > > > * Yan Zhao (yan.y.z...@intel.com) wrote:
> > > > > On Tue, Apr 21, 2020 at 08:08:49PM +0800, Tian, Kevin wrote:
> > > > > > > From: Yan Zhao
> > > > > > > Sent: Tuesday, April 21, 2020 10:37 AM
> > > > > > > 
> > > > > > > On Tue, Apr 21, 2020 at 06:56:00AM +0800, Alex Williamson wrote:
> > > > > > > > On Sun, 19 Apr 2020 21:24:57 -0400
> > > > > > > > Yan Zhao  wrote:
> > > > > > > >
> > > > > > > > > On Fri, Apr 17, 2020 at 07:24:57PM +0800, Cornelia Huck wrote:
> > > > > > > > > > On Fri, 17 Apr 2020 05:52:02 -0400
> > > > > > > > > > Yan Zhao  wrote:
> > > > > > > > > >
> > > > > > > > > > > On Fri, Apr 17, 2020 at 04:44:50PM +0800, Cornelia Huck 
> > > > > > > > > > > wrote:
> > > > > > > > > > > > On Mon, 13 Apr 2020 01:52:01 -0400
> > > > > > > > > > > > Yan Zhao  wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > This patchset introduces a migration_version 
> > > > > > > > > > > > > attribute under sysfs
> > > > > > > of VFIO
> > > > > > > > > > > > > Mediated devices.
> > > > > > > > > > > > >
> > > > > > > > > > > > > This migration_version attribute is used to check 
> > > > > > > > > > > > > migration
> > > > > > > compatibility
> > > > > > > > > > > > > between two mdev devices.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Currently, it has two locations:
> > > > > > > > > > > > > (1) under mdev_type node,
> > > > > > > > > > > > > which can be used even before device creation, 
> > > > > > > > > > > > > but only for
> > > > > > > mdev
> > > > > > > > > > > > > devices of the same mdev type.
> > > > > > > > > > > > > (2) under mdev device node,
> > > > > > > > > > > > > which can only be used after the mdev devices are 
> > > > > > > > > > > > > created, but
> > > > > > > the src
> > > > > > > > > > > > > and target mdev devices are not necessarily be of 
> > > > > > > > > > > > > the same
> > > > > > > mdev type
> > > > > > > > > > > > > (The second location is newly added in v5, in order 
> > > > > > > > > > > > > to keep
> > > > > > > consistent
> > > > > > > > > > > > > with the migration_version node for migratable 
> > > > > > > > > > > > > pass-though
> > > > > > > devices)
> > > > > > > > > > > >
> > > > > > > > > > > > What is the relationship between those two attributes?
> > > > > > > > > > > >
> > > > > > > > > > > (1) is for mdev devices specifically, and (2) is provided 
> > > > > > > > > > > to keep the
> > > > > > > same
> > > > > > > > > > > sysfs interface as with non-mdev cases. so (2) is for 
> > > > > > > > > > > both mdev
> > > > > > > devices and
> > > > > > > > > > > non-mdev devices.
> > > > > > > > > > >
> > > > > > > > > > > in future, if we enable vfio-pci vendor ops, (i.e. a 
> > > > > > > > > > > non-mdev device
> > > > > > > > > > > is binding t

Re: [virtio-dev] Re: [PATCH v9 0/8] stg mail -e --version=v9 \

2019-09-10 Thread Dr. David Alan Gilbert

* Alexander Duyck (alexander.du...@gmail.com) wrote:
> On Tue, Sep 10, 2019 at 7:47 AM Michal Hocko  wrote:
> >
> > On Tue 10-09-19 07:42:43, Alexander Duyck wrote:
> > > On Tue, Sep 10, 2019 at 5:42 AM Michal Hocko  wrote:
> > > >
> > > > I wanted to review "mm: Introduce Reported pages" just realize that I
> > > > have no clue on what is going on so returned to the cover and it didn't
> > > > really help much. I am completely unfamiliar with virtio so please bear
> > > > with me.
> > > >
> > > > On Sat 07-09-19 10:25:03, Alexander Duyck wrote:
> > > > [...]
> > > > > This series provides an asynchronous means of reporting to a 
> > > > > hypervisor
> > > > > that a guest page is no longer in use and can have the data associated
> > > > > with it dropped. To do this I have implemented functionality that 
> > > > > allows
> > > > > for what I am referring to as unused page reporting
> > > > >
> > > > > The functionality for this is fairly simple. When enabled it will 
> > > > > allocate
> > > > > statistics to track the number of reported pages in a given free area.
> > > > > When the number of free pages exceeds this value plus a high water 
> > > > > value,
> > > > > currently 32, it will begin performing page reporting which consists 
> > > > > of
> > > > > pulling pages off of free list and placing them into a scatter list. 
> > > > > The
> > > > > scatterlist is then given to the page reporting device and it will 
> > > > > perform
> > > > > the required action to make the pages "reported", in the case of
> > > > > virtio-balloon this results in the pages being madvised as 
> > > > > MADV_DONTNEED
> > > > > and as such they are forced out of the guest. After this they are 
> > > > > placed
> > > > > back on the free list,
> > > >
> > > > And here I am reallly lost because "forced out of the guest" makes me
> > > > feel that those pages are no longer usable by the guest. So how come you
> > > > can add them back to the free list. I suspect understanding this part
> > > > will allow me to understand why we have to mark those pages and prevent
> > > > merging.
> > >
> > > Basically as the paragraph above mentions "forced out of the guest"
> > > really is just the hypervisor calling MADV_DONTNEED on the page in
> > > question. So the behavior is the same as any userspace application
> > > that calls MADV_DONTNEED where the contents are no longer accessible
> > > from userspace and attempting to access them will result in a fault
> > > and the page being populated with a zero fill on-demand page, or a
> > > copy of the file contents if the memory is file backed.
> >
> > As I've said I have no idea about virt so this doesn't really tell me
> > much. Does that mean that if somebody allocates such a page and tries to
> > access it then virt will handle a fault and bring it back?
> 
> Actually I am probably describing too much as the MADV_DONTNEED is the
> hypervisor behavior in response to the virtio-balloon notification. A
> more thorough explanation of it can be found by just running "man
> madvise", probably best just to leave it at that since I am probably
> confusing things by describing hypervisor behavior in a kernel patch
> set.
> 
> For the most part all the page reporting really does is provide a way
> to incrementally identify unused regions of memory in the buddy
> allocator. That in turn is used by virtio-balloon in a polling thread
> to report to the hypervisor what pages are not in use so that it can
> make a decision on what to do with the pages now that it knows they
> are unused.
> 
> All this is providing is just a report and it is optional if the
> hypervisor will act on it or not. If the hypervisor takes some sort of
> action on the page, then the expectation is that the hypervisor will
> use some sort of mechanism such as a page fault to discover when the
> page is used again.

OK, that's interestingly different (but OK) from some other schemes that
hav ebeen described which *require* the guest to somehow indicate the
page is in use before starting to use the page again.

Dave

> -
> To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
> 
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [PATCH] fuse: reserve byteswapped init opcodes

2019-09-04 Thread Dr. David Alan Gilbert

* Michael S. Tsirkin (m...@redhat.com) wrote:
> virtio fs tunnels fuse over a virtio channel.  One issue is two sides
> might be speaking different endian-ness. To detects this,
> host side looks at the opcode value in the FUSE_INIT command.
> Works fine at the moment but might fail if a future version
> of fuse will use such an opcode for initialization.
> Let's reserve this opcode so we remember and don't do this.

I think in theory that works even for normal fuse.

> Same for CUSE_INIT.
> 
> Signed-off-by: Michael S. Tsirkin 
> ---
>  include/uapi/linux/fuse.h | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> index 2971d29a42e4..f042e63f4aa0 100644
> --- a/include/uapi/linux/fuse.h
> +++ b/include/uapi/linux/fuse.h
> @@ -425,6 +425,10 @@ enum fuse_opcode {
>  
>   /* CUSE specific operations */
>   CUSE_INIT   = 4096,
> +
> + /* Reserved opcodes: helpful to detect structure endian-ness */
> + FUSE_INIT_BSWAP_RESERVED= 26 << 24,

FUSE_INIT << 24 probably works?

> + CUSE_INIT_BSWAP_RESERVED    = 16 << 16,

Dave

>  };
>  
>  enum fuse_notify_code {
> -- 
> MST
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [PATCH v2 18/30] virtio_fs, dax: Set up virtio_fs dax_device

2019-07-22 Thread Dr. David Alan Gilbert

* Christian Borntraeger (borntrae...@de.ibm.com) wrote:
> 
> 
> On 18.07.19 16:30, Dan Williams wrote:
> > On Thu, Jul 18, 2019 at 6:15 AM Vivek Goyal  wrote:
> >>
> >> On Wed, Jul 17, 2019 at 07:27:25PM +0200, Halil Pasic wrote:
> >>> On Wed, 15 May 2019 15:27:03 -0400
> >>> Vivek Goyal  wrote:
> >>>
> >>>> From: Stefan Hajnoczi 
> >>>>
> >>>> Setup a dax device.
> >>>>
> >>>> Use the shm capability to find the cache entry and map it.
> >>>>
> >>>> The DAX window is accessed by the fs/dax.c infrastructure and must have
> >>>> struct pages (at least on x86).  Use devm_memremap_pages() to map the
> >>>> DAX window PCI BAR and allocate struct page.
> >>>>
> >>>
> >>> Sorry for being this late. I don't see any more recent version so I will
> >>> comment here.
> >>>
> >>> I'm trying to figure out how is this supposed to work on s390. My concern
> >>> is, that on s390 PCI memory needs to be accessed by special
> >>> instructions. This is taken care of by the stuff defined in
> >>> arch/s390/include/asm/io.h. E.g. we 'override' __raw_writew so it uses
> >>> the appropriate s390 instruction. However if the code does not use the
> >>> linux abstractions for accessing PCI memory, but assumes it can be
> >>> accessed like RAM, we have a problem.
> >>>
> >>> Looking at this patch, it seems to me, that we might end up with exactly
> >>> the case described. For example AFAICT copy_to_iter() (3) resolves to
> >>> the function in lib/iov_iter.c which does not seem to cater for s390
> >>> oddities.
> >>>
> >>> I didn't have the time to investigate this properly, and since virtio-fs
> >>> is virtual, we may be able to get around what is otherwise a
> >>> limitation on s390. My understanding of these areas is admittedly
> >>> shallow, and since I'm not sure I'll have much more time to
> >>> invest in the near future I decided to raise concern.
> >>>
> >>> Any opinions?
> >>
> >> Hi Halil,
> >>
> >> I don't understand s390 and how PCI works there as well. Is there any
> >> other transport we can use there to map IO memory directly and access
> >> using DAX?
> >>
> >> BTW, is DAX supported for s390.
> >>
> >> I am also hoping somebody who knows better can chip in. Till that time,
> >> we could still use virtio-fs on s390 without DAX.
> > 
> > s390 has so-called "limited" dax support, see CONFIG_FS_DAX_LIMITED.
> > In practice that means that support for PTE_DEVMAP is missing which
> > means no get_user_pages() support for dax mappings. Effectively it's
> > only useful for execute-in-place as operations like fork() and ptrace
> > of dax mappings will fail.
> 
> 
> This is only true for the dcssblk device driver (drivers/s390/block/dcssblk.c
> and arch/s390/mm/extmem.c). 
> 
> For what its worth, the dcssblk looks to Linux like normal memory (just above 
> the
> previously detected memory) that can be used like normal memory. In previous 
> time
> we even had struct pages for this memory - this was removed long ago (when it 
> was
> still xip) to reduce the memory footprint for large dcss blocks and small 
> memory
> guests.
> Can the CONFIG_FS_DAX_LIMITED go away if we have struct pages for that memory?
> 
> Now some observations: 
> - dcssblk is z/VM only (not KVM)
> - Setting CONFIG_FS_DAX_LIMITED globally as a Kconfig option depending on 
> wether
>   a device driver is compiled in or not seems not flexible enough in case if 
> you
>   have device driver that does have struct pages and another one that doesn't
> - I do not see a reason why we should not be able to map anything from QEMU
>   into the guest real memory via an additional KVM memory slot. 
>   We would need to handle that in the guest somehow (and not as a PCI bar),
>   register this with struct pages etc.
> - we must then look how we can create the link between the guest memory and 
> the
>   virtio-fs driver. For virtio-ccw we might be able to add a new ccw command 
> or
>   whatever. Maybe we could also piggy-back on some memory hotplug work from 
> David
>   Hildenbrand (add cc).
> 
> Regarding limitations on the platform:
> - while we do have PCI, the virtio devices are usually plugged via the ccw 
> bus.
>   That implies no PCI bars. I assume you use those PCI bars only to 
> implicitely 
>   have the location of the shared memory
>   Correct?

Right.

> - no real memory mapped I/O. Instead there are instructions that work on the 
> mmio.
>   As I understand things, this is of no concern regarding virtio-fs as you do 
> not
>   need mmio in the sense that a memory access of the guest to such an address 
>   triggers an exit. You just need the shared memory as a mean to have the data
>   inside the guest. Any notification is done via normal virtqueue mechanisms
>   Correct?

Yep.

> 
> Adding Heiko, maybe he remembers some details of the dcssblk/xip history.
> 
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [PATCH v2 1/2] vfio/mdev: add version attribute for mdev device

2019-05-10 Thread Dr. David Alan Gilbert

* Cornelia Huck (coh...@redhat.com) wrote:
> On Thu, 9 May 2019 17:48:26 +0100
> "Dr. David Alan Gilbert"  wrote:
> 
> > * Cornelia Huck (coh...@redhat.com) wrote:
> > > On Thu, 9 May 2019 16:48:57 +0100
> > > "Dr. David Alan Gilbert"  wrote:
> > >   
> > > > * Cornelia Huck (coh...@redhat.com) wrote:  
> > > > > On Tue, 7 May 2019 15:18:26 -0600
> > > > > Alex Williamson  wrote:
> > > > > 
> > > > > > On Sun,  5 May 2019 21:49:04 -0400
> > > > > > Yan Zhao  wrote:
> > > > > 
> > > > > > > +  Errno:
> > > > > > > +  If vendor driver wants to claim a mdev device incompatible to 
> > > > > > > all other mdev
> > > > > > > +  devices, it should not register version attribute for this 
> > > > > > > mdev device. But if
> > > > > > > +  a vendor driver has already registered version attribute and 
> > > > > > > it wants to claim
> > > > > > > +  a mdev device incompatible to all other mdev devices, it needs 
> > > > > > > to return
> > > > > > > +  -ENODEV on access to this mdev device's version attribute.
> > > > > > > +  If a mdev device is only incompatible to certain mdev devices, 
> > > > > > > write of
> > > > > > > +  incompatible mdev devices's version strings to its version 
> > > > > > > attribute should
> > > > > > > +  return -EINVAL;  
> > > > > > 
> > > > > > I think it's best not to define the specific errno returned for a
> > > > > > specific situation, let the vendor driver decide, userspace simply
> > > > > > needs to know that an errno on read indicates the device does not
> > > > > > support migration version comparison and that an errno on write
> > > > > > indicates the devices are incompatible or the target doesn't support
> > > > > > migration versions.
> > > > > 
> > > > > I think I have to disagree here: It's probably valuable to have an
> > > > > agreed error for 'cannot migrate at all' vs 'cannot migrate between
> > > > > those two particular devices'. Userspace might want to do different
> > > > > things (e.g. trying with different device pairs).
> > > > 
> > > > Trying to stuff these things down an errno seems a bad idea; we can't
> > > > get much information that way.  
> > > 
> > > So, what would be a reasonable approach? Userspace should first read
> > > the version attributes on both devices (to find out whether migration
> > > is supported at all), and only then figure out via writing whether they
> > > are compatible?
> > > 
> > > (Or just go ahead and try, if it does not care about the reason.)  
> > 
> > Well, I'm OK with something like writing to test whether it's
> > compatible, it's just we need a better way of saying 'no'.
> > I'm not sure if that involves reading back from somewhere after
> > the write or what.
> 
> Hm, so I basically see two ways of doing that:
> - standardize on some error codes... problem: error codes can be hard
>   to fit to reasons
> - make the error available in some attribute that can be read
> 
> I'm not sure how we can serialize the readback with the last write,
> though (this looks inherently racy).
> 
> How important is detailed error reporting here?

I think we need something, otherwise we're just going to get vague
user reports of 'but my VM doesn't migrate'; I'd like the error to be
good enough to point most users to something they can understand
(e.g. wrong card family/too old a driver etc).

Dave

--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [PATCH v2 1/2] vfio/mdev: add version attribute for mdev device

2019-05-09 Thread Dr. David Alan Gilbert

* Cornelia Huck (coh...@redhat.com) wrote:
> On Thu, 9 May 2019 16:48:57 +0100
> "Dr. David Alan Gilbert"  wrote:
> 
> > * Cornelia Huck (coh...@redhat.com) wrote:
> > > On Tue, 7 May 2019 15:18:26 -0600
> > > Alex Williamson  wrote:
> > >   
> > > > On Sun,  5 May 2019 21:49:04 -0400
> > > > Yan Zhao  wrote:  
> > >   
> > > > > +  Errno:
> > > > > +  If vendor driver wants to claim a mdev device incompatible to all 
> > > > > other mdev
> > > > > +  devices, it should not register version attribute for this mdev 
> > > > > device. But if
> > > > > +  a vendor driver has already registered version attribute and it 
> > > > > wants to claim
> > > > > +  a mdev device incompatible to all other mdev devices, it needs to 
> > > > > return
> > > > > +  -ENODEV on access to this mdev device's version attribute.
> > > > > +  If a mdev device is only incompatible to certain mdev devices, 
> > > > > write of
> > > > > +  incompatible mdev devices's version strings to its version 
> > > > > attribute should
> > > > > +  return -EINVAL;
> > > > 
> > > > I think it's best not to define the specific errno returned for a
> > > > specific situation, let the vendor driver decide, userspace simply
> > > > needs to know that an errno on read indicates the device does not
> > > > support migration version comparison and that an errno on write
> > > > indicates the devices are incompatible or the target doesn't support
> > > > migration versions.  
> > > 
> > > I think I have to disagree here: It's probably valuable to have an
> > > agreed error for 'cannot migrate at all' vs 'cannot migrate between
> > > those two particular devices'. Userspace might want to do different
> > > things (e.g. trying with different device pairs).  
> > 
> > Trying to stuff these things down an errno seems a bad idea; we can't
> > get much information that way.
> 
> So, what would be a reasonable approach? Userspace should first read
> the version attributes on both devices (to find out whether migration
> is supported at all), and only then figure out via writing whether they
> are compatible?
> 
> (Or just go ahead and try, if it does not care about the reason.)

Well, I'm OK with something like writing to test whether it's
compatible, it's just we need a better way of saying 'no'.
I'm not sure if that involves reading back from somewhere after
the write or what.

Dave

--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [PATCH v2 2/2] drm/i915/gvt: export mdev device version to sysfs for Intel vGPU

2019-05-08 Thread Dr. David Alan Gilbert

* Yan Zhao (yan.y.z...@intel.com) wrote:
> This feature implements the version attribute for Intel's vGPU mdev
> devices.
> 
> version attribute is rw.
> It's used to check device compatibility for two mdev devices.
> version string format and length are private for vendor driver. vendor
> driver is able to define them freely.
> 
> For Intel vGPU of gen8 and gen9, the mdev device version
> consists of 3 fields: "vendor id" + "device id" + "mdev type".
> 
> Reading from a vGPU's version attribute, a string is returned in below
> format: --. e.g.
> 8086-193b-i915-GVTg_V5_2.
> 
> Writing a string to a vGPU's version attribute will trigger GVT to check
> whether a vGPU identified by the written string is compatible with
> current vGPU owning this version attribute. errno is returned if the two
> vGPUs are incompatible. The length of written string is returned in
> compatible case.
> 
> For other platforms, and for GVT not supporting vGPU live migration
> feature, errnos are returned when read/write of mdev devices' version
> attributes.
> 
> For old GVT versions where no version attributes exposed in sysfs, it is
> regarded as not supporting vGPU live migration.
> 
> For future platforms, besides the current 2 fields in vendor proprietary
> part, more fields may be added to identify Intel vGPU well for live
> migration purpose.
> 
> v2:
> 1. removed 32 common part of version string
> (Alex Williamson)
> 2. do not register version attribute for GVT not supporting live
> migration.(Cornelia Huck)
> 3. for platforms out of gen8, gen9, return -EINVAL --> -ENODEV for
> incompatible. (Cornelia Huck)
> 
> Cc: Alex Williamson 
> Cc: Erik Skultety 
> Cc: "Dr. David Alan Gilbert" 
> Cc: Cornelia Huck 
> Cc: "Tian, Kevin" 
> Cc: Zhenyu Wang 
> Cc: "Wang, Zhi A" 
> c: Neo Jia 
> Cc: Kirti Wankhede 
> 
> Signed-off-by: Yan Zhao 
> ---
>  drivers/gpu/drm/i915/gvt/Makefile |  2 +-
>  drivers/gpu/drm/i915/gvt/device_version.c | 87 +++
>  drivers/gpu/drm/i915/gvt/gvt.c| 51 +
>  drivers/gpu/drm/i915/gvt/gvt.h|  6 ++
>  4 files changed, 145 insertions(+), 1 deletion(-)
>  create mode 100644 drivers/gpu/drm/i915/gvt/device_version.c
> 
> diff --git a/drivers/gpu/drm/i915/gvt/Makefile 
> b/drivers/gpu/drm/i915/gvt/Makefile
> index 271fb46d4dd0..54e209a23899 100644
> --- a/drivers/gpu/drm/i915/gvt/Makefile
> +++ b/drivers/gpu/drm/i915/gvt/Makefile
> @@ -3,7 +3,7 @@ GVT_DIR := gvt
>  GVT_SOURCE := gvt.o aperture_gm.o handlers.o vgpu.o trace_points.o 
> firmware.o \
>   interrupt.o gtt.o cfg_space.o opregion.o mmio.o display.o edid.o \
>   execlist.o scheduler.o sched_policy.o mmio_context.o cmd_parser.o 
> debugfs.o \
> - fb_decoder.o dmabuf.o page_track.o
> + fb_decoder.o dmabuf.o page_track.o device_version.o
>  
>  ccflags-y+= -I$(src) -I$(src)/$(GVT_DIR)
>  i915-y   += $(addprefix $(GVT_DIR)/, 
> $(GVT_SOURCE))
> diff --git a/drivers/gpu/drm/i915/gvt/device_version.c 
> b/drivers/gpu/drm/i915/gvt/device_version.c
> new file mode 100644
> index ..bd4cdcbdba95
> --- /dev/null
> +++ b/drivers/gpu/drm/i915/gvt/device_version.c
> @@ -0,0 +1,87 @@
> +/*
> + * Copyright(c) 2011-2017 Intel Corporation. All rights reserved.
> + *
> + * Permission is hereby granted, free of charge, to any person obtaining a
> + * copy of this software and associated documentation files (the "Software"),
> + * to deal in the Software without restriction, including without limitation
> + * the rights to use, copy, modify, merge, publish, distribute, sublicense,
> + * and/or sell copies of the Software, and to permit persons to whom the
> + * Software is furnished to do so, subject to the following conditions:
> + *
> + * The above copyright notice and this permission notice (including the next
> + * paragraph) shall be included in all copies or substantial portions of the
> + * Software.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING 
> FROM,
> + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN 
> THE
> + * SOFTWARE.
> + *
> + * Authors:
> + *Yan Zhao 
> + */
> +#include 
> +#include "i915_drv.h"
>

Re: [PATCH 18/52] virtio-fs: Map cache using the values from the capabilities

2019-03-20 Thread Dr. David Alan Gilbert

* Liu Bo (obuil.li...@gmail.com) wrote:
> On Mon, Dec 10, 2018 at 9:57 AM Vivek Goyal  wrote:
> >
> > Instead of assuming we had the fixed bar for the cache, use the
> > value from the capabilities.
> >
> > Signed-off-by: Dr. David Alan Gilbert 
> > ---
> >  fs/fuse/virtio_fs.c | 32 +---
> >  1 file changed, 17 insertions(+), 15 deletions(-)
> >
> > diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
> > index 60d496c16841..55bac1465536 100644
> > --- a/fs/fuse/virtio_fs.c
> > +++ b/fs/fuse/virtio_fs.c
> > @@ -14,11 +14,6 @@
> >  #include 
> >  #include "fuse_i.h"
> >
> > -enum {
> > -   /* PCI BAR number of the virtio-fs DAX window */
> > -   VIRTIO_FS_WINDOW_BAR = 2,
> > -};
> > -
> >  /* List of virtio-fs device instances and a lock for the list */
> >  static DEFINE_MUTEX(virtio_fs_mutex);
> >  static LIST_HEAD(virtio_fs_instances);
> > @@ -518,7 +513,7 @@ static int virtio_fs_setup_dax(struct virtio_device 
> > *vdev, struct virtio_fs *fs)
> > struct dev_pagemap *pgmap;
> > struct pci_dev *pci_dev;
> > phys_addr_t phys_addr;
> > -   size_t len;
> > +   size_t bar_len;
> > int ret;
> > u8 have_cache, cache_bar;
> > u64 cache_offset, cache_len;
> > @@ -551,17 +546,13 @@ static int virtio_fs_setup_dax(struct virtio_device 
> > *vdev, struct virtio_fs *fs)
> >  }
> >
> > /* TODO handle case where device doesn't expose BAR? */
> > -   ret = pci_request_region(pci_dev, VIRTIO_FS_WINDOW_BAR,
> > -"virtio-fs-window");
> > +   ret = pci_request_region(pci_dev, cache_bar, "virtio-fs-window");
> > if (ret < 0) {
> > dev_err(>dev, "%s: failed to request window BAR\n",
> > __func__);
> > return ret;
> > }
> >
> > -   phys_addr = pci_resource_start(pci_dev, VIRTIO_FS_WINDOW_BAR);
> > -   len = pci_resource_len(pci_dev, VIRTIO_FS_WINDOW_BAR);
> > -
> > mi = devm_kzalloc(_dev->dev, sizeof(*mi), GFP_KERNEL);
> > if (!mi)
> > return -ENOMEM;
> > @@ -586,6 +577,17 @@ static int virtio_fs_setup_dax(struct virtio_device 
> > *vdev, struct virtio_fs *fs)
> > pgmap->ref = >ref;
> > pgmap->type = MEMORY_DEVICE_FS_DAX;
> >
> > +   phys_addr = pci_resource_start(pci_dev, cache_bar);
> > +   bar_len = pci_resource_len(pci_dev, cache_bar);
> > +
> > +   if (cache_offset + cache_len > bar_len) {
> > +   dev_err(>dev,
> > +   "%s: cache bar shorter than cap offset+len\n",
> > +   __func__);
> > +   return -EINVAL;
> > +   }
> > +   phys_addr += cache_offset;
> > +
> > /* Ideally we would directly use the PCI BAR resource but
> >  * devm_memremap_pages() wants its own copy in pgmap.  So
> >  * initialize a struct resource from scratch (only the start
> > @@ -594,7 +596,7 @@ static int virtio_fs_setup_dax(struct virtio_device 
> > *vdev, struct virtio_fs *fs)
> > pgmap->res = (struct resource){
> > .name = "virtio-fs dax window",
> > .start = phys_addr,
> > -   .end = phys_addr + len,
> > +   .end = phys_addr + cache_len,
> 
> Just in case you haven't noticed/fixed this problem, it should be
> 
> + .end = phys_addr + cache_len - 1,
> 
> because resource_size() counts %size as "end - start + 1".
> The end result of the above is a "conflicting page map" warning when
> specifying a second virtio-fs pci device.

Thanks for spotting this! I think we'd seen that message once but not
noticed where from.

> I'll send a patch for this, and feel free to take it along with the
> patchset if needed.
> 

Dave

> thanks,
> liubo
> 
> > };
> >
> > fs->window_kaddr = devm_memremap_pages(_dev->dev, pgmap);
> > @@ -607,10 +609,10 @@ static int virtio_fs_setup_dax(struct virtio_device 
> > *vdev, struct virtio_fs *fs)
> > return ret;
> >
> > fs->window_phys_addr = phys_addr;
> > -   fs->window_len = len;
> > +   fs->window_len = cache_len;
> >
> > -   dev_dbg(>dev, "%s: window kaddr 0x%px phys_addr 0x%llx len 
> > %zu\n",
> > -   __func__, fs->window_kaddr, phys_addr, len);
> > +   dev_dbg(>dev, "%s: cache kaddr 0x%px phys_addr 0x%llx len 
> > %llx\n",
> > +   __func__, fs->window_kaddr, phys_addr, cache_len);
> >
> > fs->dax_dev = alloc_dax(fs, NULL, _fs_dax_ops);
> > if (!fs->dax_dev)
> > --
> > 2.13.6
> >
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [PATCH v2 1/1] userfaultfd/sysctl: add vm.unprivileged_userfaultfd

2019-03-19 Thread Dr. David Alan Gilbert

* Andrew Morton (a...@linux-foundation.org) wrote:
> On Tue, 19 Mar 2019 11:07:22 +0800 Peter Xu  wrote:
> 
> > Add a global sysctl knob "vm.unprivileged_userfaultfd" to control
> > whether userfaultfd is allowed by unprivileged users.  When this is
> > set to zero, only privileged users (root user, or users with the
> > CAP_SYS_PTRACE capability) will be able to use the userfaultfd
> > syscalls.
> 
> Please send along a full description of why you believe Linux needs
> this feature, for me to add to the changelog.  What is the benefit to
> our users?  How will it be used?
> 
> etcetera.  As it was presented I'm seeing no justification for adding
> the patch!

How about:

---
Userfaultfd can be misued to make it easier to exploit existing use-after-free
(and similar) bugs that might otherwise only make a short window
or race condition available.  By using userfaultfd to stall a kernel
thread, a malicious program can keep some state, that it wrote, stable
for an extended period, which it can then access using an existing
exploit.   While it doesn't cause the exploit itself, and while it's not
the only thing that can stall a kernel thread when accessing a memory location,
it's one of the few that never needs priviledge.

Add a flag, allowing userfaultfd to be restricted, so that in general 
it won't be useable by arbitrary user programs, but in environments that
require userfaultfd it can be turned back on.

---

Dave

--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [PATCH 18/52] virtio-fs: Map cache using the values from the capabilities

2018-12-14 Thread Dr. David Alan Gilbert

* Cornelia Huck (coh...@redhat.com) wrote:
> On Fri, 14 Dec 2018 13:44:34 +
> Stefan Hajnoczi  wrote:
> 
> > On Thu, Dec 13, 2018 at 01:38:23PM +0100, Cornelia Huck wrote:
> > > On Thu, 13 Dec 2018 13:24:31 +0100
> > > David Hildenbrand  wrote:
> > >   
> > > > On 13.12.18 13:15, Dr. David Alan Gilbert wrote:  
> > > > > * David Hildenbrand (da...@redhat.com) wrote:
> > > > >> On 13.12.18 11:00, Dr. David Alan Gilbert wrote:
> > > > >>> * David Hildenbrand (da...@redhat.com) wrote:
> > > > >>>> On 13.12.18 10:13, Dr. David Alan Gilbert wrote:
> > > > >>>>> * David Hildenbrand (da...@redhat.com) wrote:
> > > > >>>>>> On 10.12.18 18:12, Vivek Goyal wrote:    
> > > > >>>>>>> Instead of assuming we had the fixed bar for the cache, use the
> > > > >>>>>>> value from the capabilities.
> > > > >>>>>>>
> > > > >>>>>>> Signed-off-by: Dr. David Alan Gilbert 
> > > > >>>>>>> ---
> > > > >>>>>>>  fs/fuse/virtio_fs.c | 32 +---
> > > > >>>>>>>  1 file changed, 17 insertions(+), 15 deletions(-)
> > > > >>>>>>>
> > > > >>>>>>> diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
> > > > >>>>>>> index 60d496c16841..55bac1465536 100644
> > > > >>>>>>> --- a/fs/fuse/virtio_fs.c
> > > > >>>>>>> +++ b/fs/fuse/virtio_fs.c
> > > > >>>>>>> @@ -14,11 +14,6 @@
> > > > >>>>>>>  #include 
> > > > >>>>>>>  #include "fuse_i.h"
> > > > >>>>>>>  
> > > > >>>>>>> -enum {
> > > > >>>>>>> -   /* PCI BAR number of the virtio-fs DAX window */
> > > > >>>>>>> -   VIRTIO_FS_WINDOW_BAR = 2,
> > > > >>>>>>> -};
> > > > >>>>>>> -
> > > > >>>>>>>  /* List of virtio-fs device instances and a lock for the list 
> > > > >>>>>>> */
> > > > >>>>>>>  static DEFINE_MUTEX(virtio_fs_mutex);
> > > > >>>>>>>  static LIST_HEAD(virtio_fs_instances);
> > > > >>>>>>> @@ -518,7 +513,7 @@ static int virtio_fs_setup_dax(struct 
> > > > >>>>>>> virtio_device *vdev, struct virtio_fs *fs)
> > > > >>>>>>> struct dev_pagemap *pgmap;
> > > > >>>>>>> struct pci_dev *pci_dev;
> > > > >>>>>>> phys_addr_t phys_addr;
> > > > >>>>>>> -   size_t len;
> > > > >>>>>>> +   size_t bar_len;
> > > > >>>>>>> int ret;
> > > > >>>>>>> u8 have_cache, cache_bar;
> > > > >>>>>>> u64 cache_offset, cache_len;
> > > > >>>>>>> @@ -551,17 +546,13 @@ static int virtio_fs_setup_dax(struct 
> > > > >>>>>>> virtio_device *vdev, struct virtio_fs *fs)
> > > > >>>>>>>  }
> > > > >>>>>>>  
> > > > >>>>>>> /* TODO handle case where device doesn't expose BAR? */ 
> > > > >>>>>>>
> > > > >>>>>>
> > > > >>>>>> For virtio-pmem we decided to not go via BARs as this would 
> > > > >>>>>> effectively
> > > > >>>>>> make it only usable for virtio-pci implementers. Instead, we are 
> > > > >>>>>> going
> > > > >>>>>> to export the applicable physical device region directly (e.g.
> > > > >>>>>> phys_start, phys_size in virtio config), so it is decoupled from 
> > > > >>>>>> PCI
> > > > >>>>>> details. Doing the same for virtio-fs would allow e.g. also 
> > > > >>>>>> virtio-ccw
> > > > >>>>>> to make eventually use of

Re: [PATCH 15/52] fuse: map virtio_fs DAX window BAR

2018-12-14 Thread Dr. David Alan Gilbert

* Vivek Goyal (vgo...@redhat.com) wrote:
> On Thu, Dec 13, 2018 at 03:40:52PM -0500, Vivek Goyal wrote:
> > On Thu, Dec 13, 2018 at 12:15:51PM -0800, Dan Williams wrote:
> > > On Thu, Dec 13, 2018 at 12:09 PM Dr. David Alan Gilbert
> > >  wrote:
> > > >
> > > > * Dan Williams (dan.j.willi...@intel.com) wrote:
> > > > > On Mon, Dec 10, 2018 at 9:22 AM Vivek Goyal  wrote:
> > > > > >
> > > > > > From: Stefan Hajnoczi 
> > > > > >
> > > > > > Experimental QEMU code introduces an MMIO BAR for mapping portions 
> > > > > > of
> > > > > > files in the virtio-fs device.  Map this BAR so that FUSE DAX can 
> > > > > > access
> > > > > > file contents from the host page cache.
> > > > >
> > > > > FUSE DAX sounds terrifying, can you explain a bit more about what 
> > > > > this is?
> > > >
> > > > We've got a guest running in QEMU, it sees an emulated PCI device;
> > > > that runs a FUSE protocol over virtio on that PCI device, but also has
> > > > a trick where via commands sent over the virtio queue associated with 
> > > > that device,
> > > > (fragments of) host files get mmap'd into the qemu virtual memory that 
> > > > corresponds
> > > > to the kvm slot exposed to the guest for that bar.
> > > >
> > > > The guest sees those chunks in that BAR, and thus you can read/write
> > > > to the host file by directly writing into that BAR.
> > > 
> > > Ok so it's all software emulated and there won't be hardware DMA
> > > initiated by the guest to that address?
> > 
> > That's my understanding.
> > 
> > > I.e. if the host file gets
> > > truncated / hole-punched the guest would just cause a refault and the
> > > filesystem could fill in the block,
> > 
> > Right
> > 
> > > or the guest is expected to die if
> > > the fault to the truncated file range results in SIGBUS.
> > 
> > Are you referring to the case where a file page is mapped in qemu and
> > another guest/process trucates that page and when qemu tries to access it it
> > will get SIGBUS. Have not tried it, will give it a try. Not sure what
> > happens when QEMU receives SIGBUS.
> > 
> > Having said that, this is not different from the case of one process
> > mapping a file and another process truncating the file and first process
> > getting SIGBUS, right?
> 
> Ok, tried this and guest process hangs.
> 
> Stefan, dgilbert, this reminds me that we have faced this issue during
> our testing and we decided that this will need some fixing in KVM. I
> even put this in as part of changelog of patch with subject "fuse: Take
> inode lock for dax inode truncation"
> 
> "Another problem is, if we setup a mapping in fuse_iomap_begin(), and
>  file gets truncated and dax read/write happens, KVM currently hangs.
>  It tries to fault in a page which does not exist on host (file got
>  truncated). It probably requries fixing in KVM."
> 
> Not sure what should happen though when qemu receives SIGBUS in this
> case.

Yes, and I noted it in the TODO in my qemu patch posting.

We need to figure out what we want the guest to see in this case and
figure out how to make QEMU/kvm fix it up so that the guest doesn't
see anything odd.

Dave

> Thanks
> Vivek
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [PATCH 15/52] fuse: map virtio_fs DAX window BAR

2018-12-13 Thread Dr. David Alan Gilbert

* Dan Williams (dan.j.willi...@intel.com) wrote:
> On Mon, Dec 10, 2018 at 9:22 AM Vivek Goyal  wrote:
> >
> > From: Stefan Hajnoczi 
> >
> > Experimental QEMU code introduces an MMIO BAR for mapping portions of
> > files in the virtio-fs device.  Map this BAR so that FUSE DAX can access
> > file contents from the host page cache.
> 
> FUSE DAX sounds terrifying, can you explain a bit more about what this is?

We've got a guest running in QEMU, it sees an emulated PCI device;
that runs a FUSE protocol over virtio on that PCI device, but also has
a trick where via commands sent over the virtio queue associated with that 
device,
(fragments of) host files get mmap'd into the qemu virtual memory that 
corresponds
to the kvm slot exposed to the guest for that bar.

The guest sees those chunks in that BAR, and thus you can read/write
to the host file by directly writing into that BAR.

> > The DAX window is accessed by the fs/dax.c infrastructure and must have
> > struct pages (at least on x86).  Use devm_memremap_pages() to map the
> > DAX window PCI BAR and allocate struct page.
> 
> PCI BAR space is not cache coherent,

Note that no real PCI infrastructure is involved - this is all emulated
devices, backed by mmap'd files on the host qemu process.

Dave

> what prevents these pages from
> being used in paths that would do:
> 
>    object = page_address(pfn_to_page(virtio_fs_pfn));
> 
> ...?
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [PATCH 18/52] virtio-fs: Map cache using the values from the capabilities

2018-12-13 Thread Dr. David Alan Gilbert

* David Hildenbrand (da...@redhat.com) wrote:
> On 13.12.18 11:00, Dr. David Alan Gilbert wrote:
> > * David Hildenbrand (da...@redhat.com) wrote:
> >> On 13.12.18 10:13, Dr. David Alan Gilbert wrote:
> >>> * David Hildenbrand (da...@redhat.com) wrote:
> >>>> On 10.12.18 18:12, Vivek Goyal wrote:
> >>>>> Instead of assuming we had the fixed bar for the cache, use the
> >>>>> value from the capabilities.
> >>>>>
> >>>>> Signed-off-by: Dr. David Alan Gilbert 
> >>>>> ---
> >>>>>  fs/fuse/virtio_fs.c | 32 +---
> >>>>>  1 file changed, 17 insertions(+), 15 deletions(-)
> >>>>>
> >>>>> diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
> >>>>> index 60d496c16841..55bac1465536 100644
> >>>>> --- a/fs/fuse/virtio_fs.c
> >>>>> +++ b/fs/fuse/virtio_fs.c
> >>>>> @@ -14,11 +14,6 @@
> >>>>>  #include 
> >>>>>  #include "fuse_i.h"
> >>>>>  
> >>>>> -enum {
> >>>>> -   /* PCI BAR number of the virtio-fs DAX window */
> >>>>> -   VIRTIO_FS_WINDOW_BAR = 2,
> >>>>> -};
> >>>>> -
> >>>>>  /* List of virtio-fs device instances and a lock for the list */
> >>>>>  static DEFINE_MUTEX(virtio_fs_mutex);
> >>>>>  static LIST_HEAD(virtio_fs_instances);
> >>>>> @@ -518,7 +513,7 @@ static int virtio_fs_setup_dax(struct virtio_device 
> >>>>> *vdev, struct virtio_fs *fs)
> >>>>> struct dev_pagemap *pgmap;
> >>>>> struct pci_dev *pci_dev;
> >>>>> phys_addr_t phys_addr;
> >>>>> -   size_t len;
> >>>>> +   size_t bar_len;
> >>>>> int ret;
> >>>>> u8 have_cache, cache_bar;
> >>>>> u64 cache_offset, cache_len;
> >>>>> @@ -551,17 +546,13 @@ static int virtio_fs_setup_dax(struct 
> >>>>> virtio_device *vdev, struct virtio_fs *fs)
> >>>>>  }
> >>>>>  
> >>>>> /* TODO handle case where device doesn't expose BAR? */
> >>>>
> >>>> For virtio-pmem we decided to not go via BARs as this would effectively
> >>>> make it only usable for virtio-pci implementers. Instead, we are going
> >>>> to export the applicable physical device region directly (e.g.
> >>>> phys_start, phys_size in virtio config), so it is decoupled from PCI
> >>>> details. Doing the same for virtio-fs would allow e.g. also virtio-ccw
> >>>> to make eventually use of this.
> >>>
> >>> That makes it a very odd looking PCI device;  I can see that with
> >>> virtio-pmem it makes some sense, given that it's job is to expose
> >>> arbitrary chunks of memory.
> >>>
> >>> Dave
> >>
> >> Well, the fact that your are
> >>
> >> - including 
> >> - adding pci related code
> >>
> >> in/to fs/fuse/virtio_fs.c
> >>
> >> tells me that these properties might be better communicated on the
> >> virtio layer, not on the PCI layer.
> >>
> >> Or do you really want to glue virtio-fs to virtio-pci for all eternity?
> > 
> > No, these need cleaning up; and the split within the bar
> > is probably going to change to be communicated via virtio layer
> > rather than pci capabilities.  However, I don't want to make our PCI
> > device look odd, just to make portability to non-PCI devices - so it's
> > right to make the split appropriately, but still to use PCI bars
> > for what they were designed for.
> > 
> > Dave
> 
> Let's discuss after the cleanup. In general I am not convinced this is
> the right thing to do. Using virtio-pci for anything else than pure
> transport smells like bad design to me (well, I am no virtio expert
> after all ;) ). No matter what PCI bars were designed for. If we can't
> get the same running with e.g. virtio-ccw or virtio-whatever, it is
> broken by design (or an addon that is tightly glued to virtio-pci, if
> that is the general idea).

I'm sure we can find alternatives for virtio-*, so I wouldn't expect
it to be glued to virtio-pci.

Dave

> -- 
> 
> Thanks,
> 
> David / dhildenb
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [PATCH 18/52] virtio-fs: Map cache using the values from the capabilities

2018-12-13 Thread Dr. David Alan Gilbert

* David Hildenbrand (da...@redhat.com) wrote:
> On 13.12.18 10:13, Dr. David Alan Gilbert wrote:
> > * David Hildenbrand (da...@redhat.com) wrote:
> >> On 10.12.18 18:12, Vivek Goyal wrote:
> >>> Instead of assuming we had the fixed bar for the cache, use the
> >>> value from the capabilities.
> >>>
> >>> Signed-off-by: Dr. David Alan Gilbert 
> >>> ---
> >>>  fs/fuse/virtio_fs.c | 32 +---
> >>>  1 file changed, 17 insertions(+), 15 deletions(-)
> >>>
> >>> diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
> >>> index 60d496c16841..55bac1465536 100644
> >>> --- a/fs/fuse/virtio_fs.c
> >>> +++ b/fs/fuse/virtio_fs.c
> >>> @@ -14,11 +14,6 @@
> >>>  #include 
> >>>  #include "fuse_i.h"
> >>>  
> >>> -enum {
> >>> - /* PCI BAR number of the virtio-fs DAX window */
> >>> - VIRTIO_FS_WINDOW_BAR = 2,
> >>> -};
> >>> -
> >>>  /* List of virtio-fs device instances and a lock for the list */
> >>>  static DEFINE_MUTEX(virtio_fs_mutex);
> >>>  static LIST_HEAD(virtio_fs_instances);
> >>> @@ -518,7 +513,7 @@ static int virtio_fs_setup_dax(struct virtio_device 
> >>> *vdev, struct virtio_fs *fs)
> >>>   struct dev_pagemap *pgmap;
> >>>   struct pci_dev *pci_dev;
> >>>   phys_addr_t phys_addr;
> >>> - size_t len;
> >>> + size_t bar_len;
> >>>   int ret;
> >>>   u8 have_cache, cache_bar;
> >>>   u64 cache_offset, cache_len;
> >>> @@ -551,17 +546,13 @@ static int virtio_fs_setup_dax(struct virtio_device 
> >>> *vdev, struct virtio_fs *fs)
> >>>  }
> >>>  
> >>>   /* TODO handle case where device doesn't expose BAR? */
> >>
> >> For virtio-pmem we decided to not go via BARs as this would effectively
> >> make it only usable for virtio-pci implementers. Instead, we are going
> >> to export the applicable physical device region directly (e.g.
> >> phys_start, phys_size in virtio config), so it is decoupled from PCI
> >> details. Doing the same for virtio-fs would allow e.g. also virtio-ccw
> >> to make eventually use of this.
> > 
> > That makes it a very odd looking PCI device;  I can see that with
> > virtio-pmem it makes some sense, given that it's job is to expose
> > arbitrary chunks of memory.
> > 
> > Dave
> 
> Well, the fact that your are
> 
> - including 
> - adding pci related code
> 
> in/to fs/fuse/virtio_fs.c
> 
> tells me that these properties might be better communicated on the
> virtio layer, not on the PCI layer.
> 
> Or do you really want to glue virtio-fs to virtio-pci for all eternity?

No, these need cleaning up; and the split within the bar
is probably going to change to be communicated via virtio layer
rather than pci capabilities.  However, I don't want to make our PCI
device look odd, just to make portability to non-PCI devices - so it's
right to make the split appropriately, but still to use PCI bars
for what they were designed for.

Dave

> 
> -- 
> 
> Thanks,
> 
> David / dhildenb
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [PATCH 18/52] virtio-fs: Map cache using the values from the capabilities

2018-12-13 Thread Dr. David Alan Gilbert

* David Hildenbrand (da...@redhat.com) wrote:
> On 10.12.18 18:12, Vivek Goyal wrote:
> > Instead of assuming we had the fixed bar for the cache, use the
> > value from the capabilities.
> > 
> > Signed-off-by: Dr. David Alan Gilbert 
> > ---
> >  fs/fuse/virtio_fs.c | 32 +---
> >  1 file changed, 17 insertions(+), 15 deletions(-)
> > 
> > diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
> > index 60d496c16841..55bac1465536 100644
> > --- a/fs/fuse/virtio_fs.c
> > +++ b/fs/fuse/virtio_fs.c
> > @@ -14,11 +14,6 @@
> >  #include 
> >  #include "fuse_i.h"
> >  
> > -enum {
> > -   /* PCI BAR number of the virtio-fs DAX window */
> > -   VIRTIO_FS_WINDOW_BAR = 2,
> > -};
> > -
> >  /* List of virtio-fs device instances and a lock for the list */
> >  static DEFINE_MUTEX(virtio_fs_mutex);
> >  static LIST_HEAD(virtio_fs_instances);
> > @@ -518,7 +513,7 @@ static int virtio_fs_setup_dax(struct virtio_device 
> > *vdev, struct virtio_fs *fs)
> > struct dev_pagemap *pgmap;
> > struct pci_dev *pci_dev;
> > phys_addr_t phys_addr;
> > -   size_t len;
> > +   size_t bar_len;
> > int ret;
> > u8 have_cache, cache_bar;
> > u64 cache_offset, cache_len;
> > @@ -551,17 +546,13 @@ static int virtio_fs_setup_dax(struct virtio_device 
> > *vdev, struct virtio_fs *fs)
> >  }
> >  
> > /* TODO handle case where device doesn't expose BAR? */
> 
> For virtio-pmem we decided to not go via BARs as this would effectively
> make it only usable for virtio-pci implementers. Instead, we are going
> to export the applicable physical device region directly (e.g.
> phys_start, phys_size in virtio config), so it is decoupled from PCI
> details. Doing the same for virtio-fs would allow e.g. also virtio-ccw
> to make eventually use of this.

That makes it a very odd looking PCI device;  I can see that with
virtio-pmem it makes some sense, given that it's job is to expose
arbitrary chunks of memory.

Dave


> > -   ret = pci_request_region(pci_dev, VIRTIO_FS_WINDOW_BAR,
> > -"virtio-fs-window");
> > +   ret = pci_request_region(pci_dev, cache_bar, "virtio-fs-window");
> > if (ret < 0) {
> > dev_err(>dev, "%s: failed to request window BAR\n",
> > __func__);
> > return ret;
> > }
> >  
> > -   phys_addr = pci_resource_start(pci_dev, VIRTIO_FS_WINDOW_BAR);
> > -   len = pci_resource_len(pci_dev, VIRTIO_FS_WINDOW_BAR);
> > -
> > mi = devm_kzalloc(_dev->dev, sizeof(*mi), GFP_KERNEL);
> > if (!mi)
> > return -ENOMEM;
> > @@ -586,6 +577,17 @@ static int virtio_fs_setup_dax(struct virtio_device 
> > *vdev, struct virtio_fs *fs)
> > pgmap->ref = >ref;
> > pgmap->type = MEMORY_DEVICE_FS_DAX;
> >  
> > +   phys_addr = pci_resource_start(pci_dev, cache_bar);
> > +   bar_len = pci_resource_len(pci_dev, cache_bar);
> > +
> > +   if (cache_offset + cache_len > bar_len) {
> > +   dev_err(>dev,
> > +   "%s: cache bar shorter than cap offset+len\n",
> > +   __func__);
> > +   return -EINVAL;
> > +   }
> > +   phys_addr += cache_offset;
> > +
> > /* Ideally we would directly use the PCI BAR resource but
> >  * devm_memremap_pages() wants its own copy in pgmap.  So
> >  * initialize a struct resource from scratch (only the start
> > @@ -594,7 +596,7 @@ static int virtio_fs_setup_dax(struct virtio_device 
> > *vdev, struct virtio_fs *fs)
> > pgmap->res = (struct resource){
> > .name = "virtio-fs dax window",
> > .start = phys_addr,
> > -   .end = phys_addr + len,
> > +   .end = phys_addr + cache_len,
> > };
> >  
> > fs->window_kaddr = devm_memremap_pages(_dev->dev, pgmap);
> > @@ -607,10 +609,10 @@ static int virtio_fs_setup_dax(struct virtio_device 
> > *vdev, struct virtio_fs *fs)
> > return ret;
> >  
> > fs->window_phys_addr = phys_addr;
> > -   fs->window_len = len;
> > +   fs->window_len = cache_len;
> >  
> > -   dev_dbg(>dev, "%s: window kaddr 0x%px phys_addr 0x%llx len %zu\n",
> > -   __func__, fs->window_kaddr, phys_addr, len);
> > +   dev_dbg(>dev, "%s: cache kaddr 0x%px phys_addr 0x%llx len %llx\n",
> > +   __func__, fs->window_kaddr, phys_addr, cache_len);
> >  
> > fs->dax_dev = alloc_dax(fs, NULL, _fs_dax_ops);
> > if (!fs->dax_dev)
> > 
> 
> 
> -- 
> 
> Thanks,
> 
> David / dhildenb
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [RFC,05/10] x86/speculation: Add basic IBRS support infrastructure

2018-01-31 Thread Dr. David Alan Gilbert

* Paolo Bonzini (pbonz...@redhat.com) wrote:
> On 29/01/2018 22:13, Andi Kleen wrote:
> >> What happens when someone introduces a
> >> workaround tied to some other model numbers?
> > There are already many of those in the tree for other issues and features. 
> > So far you managed to survive without. Likely that will be true
> > in the future too.
> 
> "Guests have to live with processor fuckups" is actually a much better
> answer than "Hypervisors may need to revisit their practice", since at
> least it's clear where the blame lies.
> 
> Because really it's just plain luck.  It just happens that most errata
> are for functionality that is not available to a virtual machine (e.g.
> perfmon and monitor workarounds or buggy TSC deadline timer that
> hypervisors emulate anyway), that only needs a chicken bit to be set in
> the host, or the bugs are there only for old hardware that doesn't have
> virtualization (X86_BUG_F00F, X86_BUGS_SWAPGS_FENCE).
> 
> CPUID flags are guaranteed to never change---never come, never go away.
> For anything that doesn't map nicely to a CPUID flag, you cannot really
> express it.  Also if something is not architectural, you can pretty much
> assume that you cannot know it under virtualization.  f/m/s is not
> architectural; family, model and stepping mean absolutely nothing when
> running in virtualization, because the host CPU model can change under
> your feet at any time.  We force guest vendor == host vendor just
> because otherwise too much stuff breaks, but that's it.

In some ways we've been luckiest on x86; my understanding is ARM have a
similar set of architecture-specific errata and aren't really sure
how to expose this to guests either.

Dave

> Paolo
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [RFC,05/10] x86/speculation: Add basic IBRS support infrastructure

2018-01-31 Thread Dr. David Alan Gilbert

* Paolo Bonzini (pbonz...@redhat.com) wrote:
> On 29/01/2018 22:13, Andi Kleen wrote:
> >> What happens when someone introduces a
> >> workaround tied to some other model numbers?
> > There are already many of those in the tree for other issues and features. 
> > So far you managed to survive without. Likely that will be true
> > in the future too.
> 
> "Guests have to live with processor fuckups" is actually a much better
> answer than "Hypervisors may need to revisit their practice", since at
> least it's clear where the blame lies.
> 
> Because really it's just plain luck.  It just happens that most errata
> are for functionality that is not available to a virtual machine (e.g.
> perfmon and monitor workarounds or buggy TSC deadline timer that
> hypervisors emulate anyway), that only needs a chicken bit to be set in
> the host, or the bugs are there only for old hardware that doesn't have
> virtualization (X86_BUG_F00F, X86_BUGS_SWAPGS_FENCE).
> 
> CPUID flags are guaranteed to never change---never come, never go away.
> For anything that doesn't map nicely to a CPUID flag, you cannot really
> express it.  Also if something is not architectural, you can pretty much
> assume that you cannot know it under virtualization.  f/m/s is not
> architectural; family, model and stepping mean absolutely nothing when
> running in virtualization, because the host CPU model can change under
> your feet at any time.  We force guest vendor == host vendor just
> because otherwise too much stuff breaks, but that's it.

In some ways we've been luckiest on x86; my understanding is ARM have a
similar set of architecture-specific errata and aren't really sure
how to expose this to guests either.

Dave

> Paolo
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [RFC,05/10] x86/speculation: Add basic IBRS support infrastructure

2018-01-31 Thread Dr. David Alan Gilbert

* Borislav Petkov (b...@suse.de) wrote:
> On Wed, Jan 31, 2018 at 12:30:36PM +0000, Dr. David Alan Gilbert wrote:
> > Indeed, it's only for this weird case where you suddenly need to change
> > it.
> 
> No, there's more:
> 
>   .name = "Broadwell-noTSX",
>   .name = "Haswell-noTSX",

Haswell came out and we made the CPU definition, and then got a
microcode update that removed the feature.

So the common feature of noTSX and IBRS is that they're the only two
cases where a CPU has released and then the flags have changed later.

Dave

> -- 
> Regards/Gruss,
> Boris.
> 
> SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 
> 21284 (AG Nürnberg)
> -- 
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [RFC,05/10] x86/speculation: Add basic IBRS support infrastructure

2018-01-31 Thread Dr. David Alan Gilbert

* Borislav Petkov (b...@suse.de) wrote:
> On Wed, Jan 31, 2018 at 12:30:36PM +0000, Dr. David Alan Gilbert wrote:
> > Indeed, it's only for this weird case where you suddenly need to change
> > it.
> 
> No, there's more:
> 
>   .name = "Broadwell-noTSX",
>   .name = "Haswell-noTSX",

Haswell came out and we made the CPU definition, and then got a
microcode update that removed the feature.

So the common feature of noTSX and IBRS is that they're the only two
cases where a CPU has released and then the flags have changed later.

Dave

> -- 
> Regards/Gruss,
> Boris.
> 
> SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 
> 21284 (AG Nürnberg)
> -- 
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [RFC,05/10] x86/speculation: Add basic IBRS support infrastructure

2018-01-31 Thread Dr. David Alan Gilbert

* Borislav Petkov (b...@suse.de) wrote:
> On Wed, Jan 31, 2018 at 11:04:07AM +0000, Dr. David Alan Gilbert wrote:
> > That half is the easy bit, we've already got that (thanks to Eduardo),
> > QEMU has -IBRS variants of CPU types, so if you start a VM with
> > -cpu Broadwell-IBRS
> 
> Eww, a CPU model with a specific feature bit. I hope you guys don't add
> a model like that for every CPU feature.

Indeed, it's only for this weird case where you suddenly need to change
it.

Dave

> -- 
> Regards/Gruss,
> Boris.
> 
> SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 
> 21284 (AG Nürnberg)
> -- 
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [RFC,05/10] x86/speculation: Add basic IBRS support infrastructure

2018-01-31 Thread Dr. David Alan Gilbert

* Borislav Petkov (b...@suse.de) wrote:
> On Wed, Jan 31, 2018 at 11:04:07AM +0000, Dr. David Alan Gilbert wrote:
> > That half is the easy bit, we've already got that (thanks to Eduardo),
> > QEMU has -IBRS variants of CPU types, so if you start a VM with
> > -cpu Broadwell-IBRS
> 
> Eww, a CPU model with a specific feature bit. I hope you guys don't add
> a model like that for every CPU feature.

Indeed, it's only for this weird case where you suddenly need to change
it.

Dave

> -- 
> Regards/Gruss,
> Boris.
> 
> SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 
> 21284 (AG Nürnberg)
> -- 
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [RFC,05/10] x86/speculation: Add basic IBRS support infrastructure

2018-01-31 Thread Dr. David Alan Gilbert

* Thomas Gleixner (t...@linutronix.de) wrote:
> On Wed, 31 Jan 2018, Christophe de Dinechin wrote:
> > > On 30 Jan 2018, at 21:46, Alan Cox <gno...@lxorguk.ukuu.org.uk> wrote:
> > > 
> > >> If you are ever going to migrate to Skylake, I think you should just
> > >> always tell the guests that you're running on Skylake. That way the
> > >> guests will always assume the worst case situation wrt Specte.
> > > 
> > > Unfortunately if you do that then guest may also decide to use other
> > > Skylake hardware features and pop its clogs when it finds out its actually
> > > running on Westmere or SandyBridge.
> > > 
> > > So you need to be able to both lie to the OS and user space via cpuid and
> > > also have a second 'but do skylake protections' that only mitigation
> > > aware software knows about.
> > 
> > Yes. The most desirable lie is different depending on whether you want to
> > allow virtualization features such as migration (where you’d gravitate
> > towards a CPU with less features) or whether you want to allow mitigation
> > (where you’d rather present the most fragile CPUID, probably Skylake).
> > 
> > Looking at some recent patches, I’m concerned that the code being added
> > often assumes that the CPUID is the correct way to get that info.
> > I do not think this is correct. You really want specific information about
> > the host CPUID, not whatever KVM CPUID emulation makes up.
> 
> That wont cut it. If you have a heterogenous farm of systems, then you need:
> 
>   - All CPUs have to support IBRS/IBPB or at least hte hypervisor has to
> pretend they do by providing fake MRS for that
> 
>   - Have a 'force IBRS/IBPB' mechanism so the guests don't discard it due
> to missing CPU feature bits.

That half is the easy bit, we've already got that (thanks to Eduardo),
QEMU has -IBRS variants of CPU types, so if you start a VM with
-cpu Broadwell-IBRS  it'll get advertised to the guest as having IBRS;
and (with appropriate flags) the management layers will only allow that
to be started on hosts that support IBRS and wont allow migration
between hosts with and without it.

> Though this gets worse. You have to make sure that the guest keeps _ALL_
> sorts of mitigation mechanisms enabled and does not decide to disable
> retpolines because IBRS/IBPB are "available".

This is what's different with this set; it's all coming down to sets
of heuristics which include CPU model etc, rather than just a 'we've got
a feature, use it'.

Dave

> Good luck with making all that work.
> 
> Thanks,
> 
>   tglx

--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [RFC,05/10] x86/speculation: Add basic IBRS support infrastructure

2018-01-31 Thread Dr. David Alan Gilbert

* Thomas Gleixner (t...@linutronix.de) wrote:
> On Wed, 31 Jan 2018, Christophe de Dinechin wrote:
> > > On 30 Jan 2018, at 21:46, Alan Cox  wrote:
> > > 
> > >> If you are ever going to migrate to Skylake, I think you should just
> > >> always tell the guests that you're running on Skylake. That way the
> > >> guests will always assume the worst case situation wrt Specte.
> > > 
> > > Unfortunately if you do that then guest may also decide to use other
> > > Skylake hardware features and pop its clogs when it finds out its actually
> > > running on Westmere or SandyBridge.
> > > 
> > > So you need to be able to both lie to the OS and user space via cpuid and
> > > also have a second 'but do skylake protections' that only mitigation
> > > aware software knows about.
> > 
> > Yes. The most desirable lie is different depending on whether you want to
> > allow virtualization features such as migration (where you’d gravitate
> > towards a CPU with less features) or whether you want to allow mitigation
> > (where you’d rather present the most fragile CPUID, probably Skylake).
> > 
> > Looking at some recent patches, I’m concerned that the code being added
> > often assumes that the CPUID is the correct way to get that info.
> > I do not think this is correct. You really want specific information about
> > the host CPUID, not whatever KVM CPUID emulation makes up.
> 
> That wont cut it. If you have a heterogenous farm of systems, then you need:
> 
>   - All CPUs have to support IBRS/IBPB or at least hte hypervisor has to
> pretend they do by providing fake MRS for that
> 
>   - Have a 'force IBRS/IBPB' mechanism so the guests don't discard it due
> to missing CPU feature bits.

That half is the easy bit, we've already got that (thanks to Eduardo),
QEMU has -IBRS variants of CPU types, so if you start a VM with
-cpu Broadwell-IBRS  it'll get advertised to the guest as having IBRS;
and (with appropriate flags) the management layers will only allow that
to be started on hosts that support IBRS and wont allow migration
between hosts with and without it.

> Though this gets worse. You have to make sure that the guest keeps _ALL_
> sorts of mitigation mechanisms enabled and does not decide to disable
> retpolines because IBRS/IBPB are "available".

This is what's different with this set; it's all coming down to sets
of heuristics which include CPU model etc, rather than just a 'we've got
a feature, use it'.

Dave

> Good luck with making all that work.
> 
> Thanks,
> 
>   tglx

--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [RFC,05/10] x86/speculation: Add basic IBRS support infrastructure

2018-01-30 Thread Dr. David Alan Gilbert

* Linus Torvalds (torva...@linux-foundation.org) wrote:

> Why do you even _care_ about the guest, and how it acts wrt Skylake?
> What you should care about is not so much the guests (which do their
> own thing) but protect guests from each other, no?
> 
> So I'm a bit mystified by some of this discussion within the context
> of virtual machines. I think that is separate from any measures that
> the guest machine may then decide to partake in.

Because you'd never want to be the cause of the guest making the wrong
decision and thus being less secure than it was on real hardware.

> If you are ever going to migrate to Skylake, I think you should just
> always tell the guests that you're running on Skylake. That way the
> guests will always assume the worst case situation wrt Specte.

Say you've got a pile of Ivybridge, all running lots of VMs,
the guests see that they're running on Ivybridge.
Now you need some more hosts, so you buy the latest Skylake boxes,
and add them into your cluster.  Previously it was fine to live
migrate a VM to the Skylake box and the VM still sees it's running
Ivybridge; and you can migrate that VM back and forward.
The rule was that as long as the CPU type you told the guest was
old enough then it could migrate to any newer box.

You can't tell the VMs running on Ivybridge they're running on Skylake
otherwise they'll start trying to use Skylake features
(OK, they should be checking flags, but that's a separate story).

Dave

> Maybe that mystification comes from me missing something.
> 
>    Linus
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [RFC,05/10] x86/speculation: Add basic IBRS support infrastructure

2018-01-30 Thread Dr. David Alan Gilbert

* Linus Torvalds (torva...@linux-foundation.org) wrote:

> Why do you even _care_ about the guest, and how it acts wrt Skylake?
> What you should care about is not so much the guests (which do their
> own thing) but protect guests from each other, no?
> 
> So I'm a bit mystified by some of this discussion within the context
> of virtual machines. I think that is separate from any measures that
> the guest machine may then decide to partake in.

Because you'd never want to be the cause of the guest making the wrong
decision and thus being less secure than it was on real hardware.

> If you are ever going to migrate to Skylake, I think you should just
> always tell the guests that you're running on Skylake. That way the
> guests will always assume the worst case situation wrt Specte.

Say you've got a pile of Ivybridge, all running lots of VMs,
the guests see that they're running on Ivybridge.
Now you need some more hosts, so you buy the latest Skylake boxes,
and add them into your cluster.  Previously it was fine to live
migrate a VM to the Skylake box and the VM still sees it's running
Ivybridge; and you can migrate that VM back and forward.
The rule was that as long as the CPU type you told the guest was
old enough then it could migrate to any newer box.

You can't tell the VMs running on Ivybridge they're running on Skylake
otherwise they'll start trying to use Skylake features
(OK, they should be checking flags, but that's a separate story).

Dave

> Maybe that mystification comes from me missing something.
> 
>    Linus
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [PATCH v6 00/10] Retpoline: Avoid speculative indirect calls in kernel

2018-01-10 Thread Dr. David Alan Gilbert

* Woodhouse, David (d...@amazon.co.uk) wrote:
> On Mon, 2018-01-08 at 02:42 -0800, Paul Turner wrote:
> > 
> > While the cases above involve the crafting and use of poisoned
> > entries.  Recall also that one of the initial conditions was that we
> > should avoid RSB underflow as some CPUs may try to use other indirect
> > predictors when this occurs.
> 
> I think we should start by deliberately ignoring the CPUs which use the
> other indirect predictors on RSB underflow. Those CPUs don't perform
> *quite* so badly with IBRS anyway.
> 
> Let's get the minimum amount of RSB handling in to cope with the pre-
> SKL CPUs, and then see if we really do want to extend it to make SKL
> 100% secure in retpoline mode or not.

How do you make decisions on which CPU you're running on?
I'm worried about the case of a VM that starts off on an older host
and then gets live migrated to a new Skylake.
For Intel CPUs we've historically been safe to live migrate
to any newer host based on having all the features that the old one had;
with the guest still seeing the flags etc for the old CPU.

Dave
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [PATCH v6 00/10] Retpoline: Avoid speculative indirect calls in kernel

2018-01-10 Thread Dr. David Alan Gilbert

* Woodhouse, David (d...@amazon.co.uk) wrote:
> On Mon, 2018-01-08 at 02:42 -0800, Paul Turner wrote:
> > 
> > While the cases above involve the crafting and use of poisoned
> > entries.  Recall also that one of the initial conditions was that we
> > should avoid RSB underflow as some CPUs may try to use other indirect
> > predictors when this occurs.
> 
> I think we should start by deliberately ignoring the CPUs which use the
> other indirect predictors on RSB underflow. Those CPUs don't perform
> *quite* so badly with IBRS anyway.
> 
> Let's get the minimum amount of RSB handling in to cope with the pre-
> SKL CPUs, and then see if we really do want to extend it to make SKL
> 100% secure in retpoline mode or not.

How do you make decisions on which CPU you're running on?
I'm worried about the case of a VM that starts off on an older host
and then gets live migrated to a new Skylake.
For Intel CPUs we've historically been safe to live migrate
to any newer host based on having all the features that the old one had;
with the guest still seeing the flags etc for the old CPU.

Dave
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [tip:x86/pti] x86/cpu/AMD: Use LFENCE_RDTSC instead of MFENCE_RDTSC

2018-01-08 Thread Dr. David Alan Gilbert

* Andrew Cooper (andrew.coop...@citrix.com) wrote:
> On 08/01/18 14:47, Tom Lendacky wrote:
> > On 1/8/2018 5:10 AM, Thomas Gleixner wrote:
> >> On Mon, 8 Jan 2018, Andrew Cooper wrote:
> >>
> >>> On 08/01/18 10:08, Thomas Gleixner wrote:
> >>>> On Sat, 6 Jan 2018, tip-bot for Tom Lendacky wrote:
> >>>>
> >>>>> Commit-ID:  0bf17c102177d5da9363bf8b1e4704b9996d5079
> >>>>> Gitweb: 
> >>>>> https://git.kernel.org/tip/0bf17c102177d5da9363bf8b1e4704b9996d5079
> >>>>> Author: Tom Lendacky <thomas.lenda...@amd.com>
> >>>>> AuthorDate: Fri, 5 Jan 2018 10:07:56 -0600
> >>>>> Committer:  Thomas Gleixner <t...@linutronix.de>
> >>>>> CommitDate: Sat, 6 Jan 2018 21:57:40 +0100
> >>>>>
> >>>>> x86/cpu/AMD: Use LFENCE_RDTSC instead of MFENCE_RDTSC
> >>>>>
> >>>>> With LFENCE now a serializing instruction, set the LFENCE_RDTSC
> >>>>> feature since the LFENCE instruction has less overhead than the
> >>>>> MFENCE instruction.
> >>>> Second thoughts on that. As pointed out by someone in one of the insane
> >>>> long threads:
> >>>>
> >>>> What happens if the kernel runs as a guest and
> >>>>
> >>>>   - the hypervisor did not set the LFENCE to serializing on the host
> >>>>
> >>>>   - the hypervisor does not allow writing MSR_AMD64_DE_CFG
> >>>>
> >>>> That would bring the guest into a pretty bad state or am I missing
> >>>> something essential here?
> >>> What I did in Xen was to attempt to set it, then read it back and see. 
> >>> If LFENCE still isn't serialising, using repoline is the only available
> >>> mitigation.
> >>>
> >>> My understanding from the folk at AMD is that retpoline is safe to use,
> >>> but has higher overhead than the LFENCE approach.
> > Correct, the retpoline will work, it just takes more cycles.
> >
> >> That still does not help vs. rdtsc_ordered() and LFENCE_RDTSC ...
> > Ok, I can add the read-back check before setting the feature flag(s).
> >
> > But... what about the case where the guest is a different family than
> > hypervisor? If we're on, say, a Fam15h hypervisor but the guest is started
> > as a Fam0fh guest where the MSR doesn't exist and LFENCE is supposed to be
> > serialized?  I'll have to do a rdmsr_safe() and only set the flag(s) if I
> > can successfully read the MSR back and validate the bit.
> 
> If your hypervisor is lying to you about the primary family, then all
> bets are off.  I don't expect there will be any production systems doing
> this.

It's not that an unusual thing to do on qemu/kvm - to specify the lowest
common denominator of the set of CPUs in your data centre (for any one
vendor); it does tend to get some weird combinations.

Dave

> The user can get to keep both pieces if they've decided that this was a
> good thing to try.
> 
> ~Andrew
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [tip:x86/pti] x86/cpu/AMD: Use LFENCE_RDTSC instead of MFENCE_RDTSC

2018-01-08 Thread Dr. David Alan Gilbert

* Andrew Cooper (andrew.coop...@citrix.com) wrote:
> On 08/01/18 14:47, Tom Lendacky wrote:
> > On 1/8/2018 5:10 AM, Thomas Gleixner wrote:
> >> On Mon, 8 Jan 2018, Andrew Cooper wrote:
> >>
> >>> On 08/01/18 10:08, Thomas Gleixner wrote:
> >>>> On Sat, 6 Jan 2018, tip-bot for Tom Lendacky wrote:
> >>>>
> >>>>> Commit-ID:  0bf17c102177d5da9363bf8b1e4704b9996d5079
> >>>>> Gitweb: 
> >>>>> https://git.kernel.org/tip/0bf17c102177d5da9363bf8b1e4704b9996d5079
> >>>>> Author: Tom Lendacky 
> >>>>> AuthorDate: Fri, 5 Jan 2018 10:07:56 -0600
> >>>>> Committer:  Thomas Gleixner 
> >>>>> CommitDate: Sat, 6 Jan 2018 21:57:40 +0100
> >>>>>
> >>>>> x86/cpu/AMD: Use LFENCE_RDTSC instead of MFENCE_RDTSC
> >>>>>
> >>>>> With LFENCE now a serializing instruction, set the LFENCE_RDTSC
> >>>>> feature since the LFENCE instruction has less overhead than the
> >>>>> MFENCE instruction.
> >>>> Second thoughts on that. As pointed out by someone in one of the insane
> >>>> long threads:
> >>>>
> >>>> What happens if the kernel runs as a guest and
> >>>>
> >>>>   - the hypervisor did not set the LFENCE to serializing on the host
> >>>>
> >>>>   - the hypervisor does not allow writing MSR_AMD64_DE_CFG
> >>>>
> >>>> That would bring the guest into a pretty bad state or am I missing
> >>>> something essential here?
> >>> What I did in Xen was to attempt to set it, then read it back and see. 
> >>> If LFENCE still isn't serialising, using repoline is the only available
> >>> mitigation.
> >>>
> >>> My understanding from the folk at AMD is that retpoline is safe to use,
> >>> but has higher overhead than the LFENCE approach.
> > Correct, the retpoline will work, it just takes more cycles.
> >
> >> That still does not help vs. rdtsc_ordered() and LFENCE_RDTSC ...
> > Ok, I can add the read-back check before setting the feature flag(s).
> >
> > But... what about the case where the guest is a different family than
> > hypervisor? If we're on, say, a Fam15h hypervisor but the guest is started
> > as a Fam0fh guest where the MSR doesn't exist and LFENCE is supposed to be
> > serialized?  I'll have to do a rdmsr_safe() and only set the flag(s) if I
> > can successfully read the MSR back and validate the bit.
> 
> If your hypervisor is lying to you about the primary family, then all
> bets are off.  I don't expect there will be any production systems doing
> this.

It's not that an unusual thing to do on qemu/kvm - to specify the lowest
common denominator of the set of CPUs in your data centre (for any one
vendor); it does tend to get some weird combinations.

Dave

> The user can get to keep both pieces if they've decided that this was a
> good thing to try.
> 
> ~Andrew
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [PATCH 3/7] x86/enter: Use IBRS on syscall and interrupts

2018-01-05 Thread Dr. David Alan Gilbert

> Dave Hansen <dave.han...@intel.com> wrote:
>> On 01/04/2018 08:51 PM, Andy Lutomirski wrote:
>> > Do we need an arch_prctl() to enable IBRS for user mode?
>> 
>> Eventually, once the dust settles.  I think there's a spectrum of
>> paranoia here, that is roughly (with increasing paranoia):
>> 
>> 1. do nothing
>> 2. do retpoline
>> 3. do IBRS in kernel
>> 4. do IBRS always
>> 
>> I think you're asking for ~3.5.
>> 
>> Patches for 1-3 are out there and 4 is pretty straightforward.  Doing a
>> arch_prctl() is still straightforward, but will be a much more niche
>> thing than any of the other choices.  Plus, with a user interface, we
>> have to argue over the ABI for at least a month or two. ;)

I was chatting to Andrea about this, and we came to the conclusion one
use might be for qemu;  I was worried about (theoretically) whether
userspace in a guest could read privileged data from the guest kernel by
attacking the qemu process rather than by attacking the kernels.

Dave

--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [PATCH 3/7] x86/enter: Use IBRS on syscall and interrupts

2018-01-05 Thread Dr. David Alan Gilbert

> Dave Hansen  wrote:
>> On 01/04/2018 08:51 PM, Andy Lutomirski wrote:
>> > Do we need an arch_prctl() to enable IBRS for user mode?
>> 
>> Eventually, once the dust settles.  I think there's a spectrum of
>> paranoia here, that is roughly (with increasing paranoia):
>> 
>> 1. do nothing
>> 2. do retpoline
>> 3. do IBRS in kernel
>> 4. do IBRS always
>> 
>> I think you're asking for ~3.5.
>> 
>> Patches for 1-3 are out there and 4 is pretty straightforward.  Doing a
>> arch_prctl() is still straightforward, but will be a much more niche
>> thing than any of the other choices.  Plus, with a user interface, we
>> have to argue over the ABI for at least a month or two. ;)

I was chatting to Andrea about this, and we came to the conclusion one
use might be for qemu;  I was worried about (theoretically) whether
userspace in a guest could read privileged data from the guest kernel by
attacking the qemu process rather than by attacking the kernels.

Dave

--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [PATCH] userfaultfd_zeropage: return -ENOSPC in case mm has gone

2017-08-02 Thread Dr. David Alan Gilbert

* Mike Rapoport (r...@linux.vnet.ibm.com) wrote:
> On Mon, Jul 31, 2017 at 03:45:08PM +0200, Michal Hocko wrote:
> > On Mon 31-07-17 15:32:47, Andrea Arcangeli wrote:
> > > On Mon, Jul 31, 2017 at 02:22:04PM +0200, Michal Hocko wrote:
> > > > On Thu 27-07-17 09:26:59, Mike Rapoport wrote:
> > > > > In the non-cooperative userfaultfd case, the process exit may race 
> > > > > with
> > > > > outstanding mcopy_atomic called by the uffd monitor.  Returning 
> > > > > -ENOSPC
> > > > > instead of -EINVAL when mm is already gone will allow uffd monitor to
> > > > > distinguish this case from other error conditions.
> > > > 
> > > > Normally we tend to return ESRCH in such case. ENOSPC sounds rather
> > > > confusing...
> > > 
> > > This is in sync and consistent with the retval for UFFDIO_COPY upstream:
> > > 
> > >   if (mmget_not_zero(ctx->mm)) {
> > >   ret = mcopy_atomic(ctx->mm, uffdio_copy.dst, uffdio_copy.src,
> > >  uffdio_copy.len);
> > >   mmput(ctx->mm);
> > >   } else {
> > >   return -ENOSPC;
> > >   }
> > > 
> > > If you preferred ESRCH I certainly wouldn't have been against, but we
> > > should have discussed it before it was upstream. All it matters is
> > > it's documented in the great manpage that was written for it as quoted
> > > below.
> > 
> > OK, I wasn't aware of this.
> > 
> > > +.TP
> > > +.B ENOENT
> > > +(Since Linux 4.11)
> > > +The faulting process has changed
> > > +its virtual memory layout simultaneously with outstanding
> > > +.I UFFDIO_COPY
> > > +operation.
> > > +.TP
> > > +.B ENOSPC
> > > +(Since Linux 4.11)
> > > +The faulting process has exited at the time of
> > > +.I UFFDIO_COPY
> > > +operation.
> > > 
> > > To change it now, we would need to involve manpage and other code
> > > changes.
> > 
> > Well, ESRCH is more appropriate so I would rather change it sooner than
> > later. But if we are going to risk user space breakage then this is not
> > worth the risk. I expected there are very few users of this API
> > currently so maybe it won't be a big disaster?
> 
> I surely can take care of CRIU, but I don't know if QEMU or certain
> database application that uses userfaultfd rely on this API, not mentioning
> there maybe other unknown users.
> 
> Andrea, what do you think?

QEMU doesn't care about the errno value, it just reports it.

Dave

> > Anyway, at least this is documented so I will leave the decision to you.
> > -- 
> > Michal Hocko
> > SUSE Labs
> > 
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to majord...@kvack.org.  For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: mailto:"d...@kvack.org;> em...@kvack.org 
> > 
> 
> -- 
> Sincerely yours,
> Mike.
> 
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [PATCH] userfaultfd_zeropage: return -ENOSPC in case mm has gone

2017-08-02 Thread Dr. David Alan Gilbert

* Mike Rapoport (r...@linux.vnet.ibm.com) wrote:
> On Mon, Jul 31, 2017 at 03:45:08PM +0200, Michal Hocko wrote:
> > On Mon 31-07-17 15:32:47, Andrea Arcangeli wrote:
> > > On Mon, Jul 31, 2017 at 02:22:04PM +0200, Michal Hocko wrote:
> > > > On Thu 27-07-17 09:26:59, Mike Rapoport wrote:
> > > > > In the non-cooperative userfaultfd case, the process exit may race 
> > > > > with
> > > > > outstanding mcopy_atomic called by the uffd monitor.  Returning 
> > > > > -ENOSPC
> > > > > instead of -EINVAL when mm is already gone will allow uffd monitor to
> > > > > distinguish this case from other error conditions.
> > > > 
> > > > Normally we tend to return ESRCH in such case. ENOSPC sounds rather
> > > > confusing...
> > > 
> > > This is in sync and consistent with the retval for UFFDIO_COPY upstream:
> > > 
> > >   if (mmget_not_zero(ctx->mm)) {
> > >   ret = mcopy_atomic(ctx->mm, uffdio_copy.dst, uffdio_copy.src,
> > >  uffdio_copy.len);
> > >   mmput(ctx->mm);
> > >   } else {
> > >   return -ENOSPC;
> > >   }
> > > 
> > > If you preferred ESRCH I certainly wouldn't have been against, but we
> > > should have discussed it before it was upstream. All it matters is
> > > it's documented in the great manpage that was written for it as quoted
> > > below.
> > 
> > OK, I wasn't aware of this.
> > 
> > > +.TP
> > > +.B ENOENT
> > > +(Since Linux 4.11)
> > > +The faulting process has changed
> > > +its virtual memory layout simultaneously with outstanding
> > > +.I UFFDIO_COPY
> > > +operation.
> > > +.TP
> > > +.B ENOSPC
> > > +(Since Linux 4.11)
> > > +The faulting process has exited at the time of
> > > +.I UFFDIO_COPY
> > > +operation.
> > > 
> > > To change it now, we would need to involve manpage and other code
> > > changes.
> > 
> > Well, ESRCH is more appropriate so I would rather change it sooner than
> > later. But if we are going to risk user space breakage then this is not
> > worth the risk. I expected there are very few users of this API
> > currently so maybe it won't be a big disaster?
> 
> I surely can take care of CRIU, but I don't know if QEMU or certain
> database application that uses userfaultfd rely on this API, not mentioning
> there maybe other unknown users.
> 
> Andrea, what do you think?

QEMU doesn't care about the errno value, it just reports it.

Dave

> > Anyway, at least this is documented so I will leave the decision to you.
> > -- 
> > Michal Hocko
> > SUSE Labs
> > 
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to majord...@kvack.org.  For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: mailto:"d...@kvack.org;> em...@kvack.org 
> > 
> 
> -- 
> Sincerely yours,
> Mike.
> 
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [Qemu-devel] [RFH] qemu-2.6 memory corruption with OVMF and linux-4.9

2017-06-18 Thread Dr. David Alan Gilbert

* Philipp Hahn (h...@univention.de) wrote:
> Hello,
> 
> Am 17.06.2017 um 18:51 schrieb Laszlo Ersek:
> > (I also recommend using the "vbindiff" tool for such problems, it is
> > great for picking out patterns.)
> > 
> >   ** ** ** ** ** ** ** **   8  9 ** ** ** 13 14 15
> >   -- -- -- -- -- -- -- --  -- -- -- -- -- -- -- --
> >   01 e8 00 00 00 00 00 00  8c 5e 00 00 00 10 ff f1
> > 0010  5b 78 8a 3e 00 00 00 00  00 00 00 00 00 00 00 00
> > 0020  8c 77 00 00 00 12 00 02  18 f0 00 00 00 00 00 00
> > 0030  00 1e 00 00 00 00 00 00  8c 8c 00 00 00 12 00 02
> > 0040  07 70 00 00 00 00 00 00  00 14 00 00 00 00 00 00
> > 0050  8c 9c 00 00 00 12 00 02  22 00 00 00 00 00 00 00
> > 0060  00 40 00 00 00 00 00 00  8c ac 00 00 00 10 ff f1
> > 
> >   01 e8 00 00 00 00 00 00  00 3c 00 00 00 17 00 00
> > 0010  5b 78 8a 3e 00 00 00 00  00 3c 00 00 00 07 00 00
> > 0020  8c 77 00 00 00 12 00 02  00 3c 00 00 00 07 00 00
> > 0030  00 1e 00 00 00 00 00 00  00 3c 00 00 00 17 00 00
> > 0040  07 70 00 00 00 00 00 00  00 3c 00 00 00 07 00 00
> > 0050  8c 9c 00 00 00 12 00 02  00 3c 00 00 00 07 00 00
> > 0060  00 40 00 00 00 00 00 00  00 3c 00 00 00 17 00 00
> >   -- -- -- -- -- -- -- --  -- -- -- -- -- -- -- --
> >   ** ** ** ** ** ** ** **   8  9 ** ** ** 13 14 15
> > 
> > The columns that I marked with "**" are identical between "good" and
> > "bad". (These are columns 0-7, 10-12.)
> > 
> > Column 8 is overwritten by zeros (every 16th byte).
> > 
> > Column 9 is overwritten by 0x3c (every 16th byte).
> > 
> > Column 13 is super interesting. The most significant nibble in that
> > column is not disturbed. And, in the least significant nibble, the least
> > significant three bits are turned on. Basically, the corruption could be
> > described, for this column (i.e., every 16th byte), as
> > 
> >   bad = good | 0x7
> > 
> > Column 14 is overwritten by zeros (every 16th byte).
> > 
> > Column 15 is overwritten by zeros (every 16th byte).
> > 
> > My take is that your host machine has faulty RAM. Please run memtest86+
> > or something similar.
> 
> I will do so, but for me very unlikely:
> - it never happens with BIOS, only with OVMF
> - for each test I start q new QEMU process, which should use a different
> memory region
> - it repeatedly hits e1000 or libata.ko
> 
> After updating from OVMF to 0~20161202.7bbe0b3e-1 from
> (0~20160813.de74668f-2 it has not yet happened again.
> 
> Anyway, thank you for your help.

What host CPU are you using?

Dave

> 
> Philipp
-- 
 -Open up your eyes, open up your mind, open up your code ---   
/ Dr. David Alan Gilbert|   Running GNU/Linux   | Happy  \ 
\dave @ treblig.org |   | In Hex /
 \ _|_ http://www.treblig.org   |___/

Re: [Qemu-devel] [RFH] qemu-2.6 memory corruption with OVMF and linux-4.9

2017-06-18 Thread Dr. David Alan Gilbert

* Philipp Hahn (h...@univention.de) wrote:
> Hello,
> 
> Am 17.06.2017 um 18:51 schrieb Laszlo Ersek:
> > (I also recommend using the "vbindiff" tool for such problems, it is
> > great for picking out patterns.)
> > 
> >   ** ** ** ** ** ** ** **   8  9 ** ** ** 13 14 15
> >   -- -- -- -- -- -- -- --  -- -- -- -- -- -- -- --
> >   01 e8 00 00 00 00 00 00  8c 5e 00 00 00 10 ff f1
> > 0010  5b 78 8a 3e 00 00 00 00  00 00 00 00 00 00 00 00
> > 0020  8c 77 00 00 00 12 00 02  18 f0 00 00 00 00 00 00
> > 0030  00 1e 00 00 00 00 00 00  8c 8c 00 00 00 12 00 02
> > 0040  07 70 00 00 00 00 00 00  00 14 00 00 00 00 00 00
> > 0050  8c 9c 00 00 00 12 00 02  22 00 00 00 00 00 00 00
> > 0060  00 40 00 00 00 00 00 00  8c ac 00 00 00 10 ff f1
> > 
> >   01 e8 00 00 00 00 00 00  00 3c 00 00 00 17 00 00
> > 0010  5b 78 8a 3e 00 00 00 00  00 3c 00 00 00 07 00 00
> > 0020  8c 77 00 00 00 12 00 02  00 3c 00 00 00 07 00 00
> > 0030  00 1e 00 00 00 00 00 00  00 3c 00 00 00 17 00 00
> > 0040  07 70 00 00 00 00 00 00  00 3c 00 00 00 07 00 00
> > 0050  8c 9c 00 00 00 12 00 02  00 3c 00 00 00 07 00 00
> > 0060  00 40 00 00 00 00 00 00  00 3c 00 00 00 17 00 00
> >   -- -- -- -- -- -- -- --  -- -- -- -- -- -- -- --
> >   ** ** ** ** ** ** ** **   8  9 ** ** ** 13 14 15
> > 
> > The columns that I marked with "**" are identical between "good" and
> > "bad". (These are columns 0-7, 10-12.)
> > 
> > Column 8 is overwritten by zeros (every 16th byte).
> > 
> > Column 9 is overwritten by 0x3c (every 16th byte).
> > 
> > Column 13 is super interesting. The most significant nibble in that
> > column is not disturbed. And, in the least significant nibble, the least
> > significant three bits are turned on. Basically, the corruption could be
> > described, for this column (i.e., every 16th byte), as
> > 
> >   bad = good | 0x7
> > 
> > Column 14 is overwritten by zeros (every 16th byte).
> > 
> > Column 15 is overwritten by zeros (every 16th byte).
> > 
> > My take is that your host machine has faulty RAM. Please run memtest86+
> > or something similar.
> 
> I will do so, but for me very unlikely:
> - it never happens with BIOS, only with OVMF
> - for each test I start q new QEMU process, which should use a different
> memory region
> - it repeatedly hits e1000 or libata.ko
> 
> After updating from OVMF to 0~20161202.7bbe0b3e-1 from
> (0~20160813.de74668f-2 it has not yet happened again.
> 
> Anyway, thank you for your help.

What host CPU are you using?

Dave

> 
> Philipp
-- 
 -Open up your eyes, open up your mind, open up your code ---   
/ Dr. David Alan Gilbert|   Running GNU/Linux   | Happy  \ 
\dave @ treblig.org |   | In Hex /
 \ _|_ http://www.treblig.org   |___/

Re: [PATCH v6 kernel 3/5] virtio-balloon: speed up inflate/deflate process

2017-01-20 Thread Dr. David Alan Gilbert

* Liang Li (liang.z...@intel.com) wrote:

> +static void free_extended_page_bitmap(struct virtio_balloon *vb)
> +{
> + int i, bmap_count = vb->nr_page_bmap;
> +
> + for (i = 1; i < bmap_count; i++) {
> + kfree(vb->page_bitmap[i]);
> + vb->page_bitmap[i] = NULL;
> + vb->nr_page_bmap--;
> + }
> +}
> +
> +static void kfree_page_bitmap(struct virtio_balloon *vb)
> +{
> + int i;
> +
> + for (i = 0; i < vb->nr_page_bmap; i++)
> + kfree(vb->page_bitmap[i]);
> +}

It might be worth commenting that pair of functions to make it clear
why they are so different; I guess the kfree_page_bitmap
is used just before you free the structure above it so you
don't need to keep the count/pointers updated?

Dave
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [PATCH v6 kernel 3/5] virtio-balloon: speed up inflate/deflate process

2017-01-20 Thread Dr. David Alan Gilbert

* Liang Li (liang.z...@intel.com) wrote:

> +static void free_extended_page_bitmap(struct virtio_balloon *vb)
> +{
> + int i, bmap_count = vb->nr_page_bmap;
> +
> + for (i = 1; i < bmap_count; i++) {
> + kfree(vb->page_bitmap[i]);
> + vb->page_bitmap[i] = NULL;
> + vb->nr_page_bmap--;
> + }
> +}
> +
> +static void kfree_page_bitmap(struct virtio_balloon *vb)
> +{
> + int i;
> +
> + for (i = 0; i < vb->nr_page_bmap; i++)
> + kfree(vb->page_bitmap[i]);
> +}

It might be worth commenting that pair of functions to make it clear
why they are so different; I guess the kfree_page_bitmap
is used just before you free the structure above it so you
don't need to keep the count/pointers updated?

Dave
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [patch] leds: add driver for Mellanox systems leds

2016-09-07 Thread Dr. David Alan Gilbert

dev.flags = LED_CORE_SUSPENDRESUME;
> + err = devm_led_classdev_register(dev, >pled[i].cdev);
> + if (err)
> + return err;
> +
> + cpld->pled[i].param.offset = mlxcpld_led->profile[i].offset;
> + cpld->pled[i].param.mask = mlxcpld_led->profile[i].mask;
> + cpld->pled[i].param.base_color =
> + mlxcpld_led->profile[i].base_color;
> +
> + if (mlxcpld_led->profile[i].brightness)
> + mlxcpld_led_brightness(>pled[i].cdev,
> + mlxcpld_led->profile[i].brightness);
> + }
> +
> + return 0;
> +}
> +
> +static int __init mlxcpld_led_probe(struct platform_device *pdev)
> +{
> + enum mlxcpld_led_platform_types mlxcpld_led_plat =
> + mlxcpld_led_platform_check_sys_type();
> +
> + mlxcpld_led = devm_kzalloc(>dev, sizeof(*mlxcpld_led),
> +GFP_KERNEL);
> + if (!mlxcpld_led)
> + return -ENOMEM;
> +
> + mlxcpld_led->pdev = pdev;
> +
> + switch (mlxcpld_led_plat) {
> + case MLXCPLD_LED_PLATFORM_MSN2100:
> + mlxcpld_led->profile = mlxcpld_led_msn2100_profile;
> + mlxcpld_led->num_led_instances =
> + ARRAY_SIZE(mlxcpld_led_msn2100_profile);
> + break;
> +
> + default:
> + mlxcpld_led->profile = mlxcpld_led_default_profile;
> + mlxcpld_led->num_led_instances =
> + ARRAY_SIZE(mlxcpld_led_default_profile);
> + break;
> + }
> +
> + spin_lock_init(_led->lock);
> +
> + return mlxcpld_led_config(>dev, mlxcpld_led);
> +}
> +
> +static struct platform_driver mlxcpld_led_driver = {
> + .driver = {
> + .name   = KBUILD_MODNAME,
> + },
> +};
> +
> +static int __init mlxcpld_led_init(void)
> +{
> + struct platform_device *pdev;
> + int err;
> +
> + pdev = platform_device_register_simple(KBUILD_MODNAME, -1, NULL, 0);
> + if (!pdev) {
> + pr_err("Device allocation failed\n");
> + return -ENOMEM;
> + }
> +
> + err = platform_driver_probe(_led_driver, mlxcpld_led_probe);
> + if (err) {
> + pr_err("Probe platform driver failed\n");
> + platform_device_unregister(pdev);
> + }
> +
> + return err;
> +}
> +
> +static void __exit mlxcpld_led_exit(void)
> +{
> + platform_device_unregister(mlxcpld_led->pdev);
> + platform_driver_unregister(_led_driver);
> +}
> +
> +module_init(mlxcpld_led_init);
> +module_exit(mlxcpld_led_exit);
> +
> +MODULE_AUTHOR("Vadim Pasternak (vad...@mellanox.com)");
> +MODULE_DESCRIPTION("Mellanox board LED driver");
> +MODULE_LICENSE("GPL v2");
> +MODULE_ALIAS("platform:leds_mlxcpld");
> -- 
> 2.1.4
> 
-- 
 -Open up your eyes, open up your mind, open up your code ---   
/ Dr. David Alan Gilbert|   Running GNU/Linux   | Happy  \ 
\dave @ treblig.org |   | In Hex /
 \ _|_ http://www.treblig.org   |___/

Re: [patch] leds: add driver for Mellanox systems leds

2016-09-07 Thread Dr. David Alan Gilbert

 + return err;
> +
> + cpld->pled[i].param.offset = mlxcpld_led->profile[i].offset;
> + cpld->pled[i].param.mask = mlxcpld_led->profile[i].mask;
> + cpld->pled[i].param.base_color =
> + mlxcpld_led->profile[i].base_color;
> +
> +         if (mlxcpld_led->profile[i].brightness)
> + mlxcpld_led_brightness(>pled[i].cdev,
> + mlxcpld_led->profile[i].brightness);
> + }
> +
> + return 0;
> +}
> +
> +static int __init mlxcpld_led_probe(struct platform_device *pdev)
> +{
> + enum mlxcpld_led_platform_types mlxcpld_led_plat =
> + mlxcpld_led_platform_check_sys_type();
> +
> + mlxcpld_led = devm_kzalloc(>dev, sizeof(*mlxcpld_led),
> +GFP_KERNEL);
> + if (!mlxcpld_led)
> + return -ENOMEM;
> +
> + mlxcpld_led->pdev = pdev;
> +
> + switch (mlxcpld_led_plat) {
> + case MLXCPLD_LED_PLATFORM_MSN2100:
> + mlxcpld_led->profile = mlxcpld_led_msn2100_profile;
> + mlxcpld_led->num_led_instances =
> + ARRAY_SIZE(mlxcpld_led_msn2100_profile);
> + break;
> +
> + default:
> + mlxcpld_led->profile = mlxcpld_led_default_profile;
> + mlxcpld_led->num_led_instances =
> + ARRAY_SIZE(mlxcpld_led_default_profile);
> + break;
> + }
> +
> + spin_lock_init(_led->lock);
> +
> + return mlxcpld_led_config(>dev, mlxcpld_led);
> +}
> +
> +static struct platform_driver mlxcpld_led_driver = {
> + .driver = {
> + .name   = KBUILD_MODNAME,
> + },
> +};
> +
> +static int __init mlxcpld_led_init(void)
> +{
> + struct platform_device *pdev;
> + int err;
> +
> + pdev = platform_device_register_simple(KBUILD_MODNAME, -1, NULL, 0);
> + if (!pdev) {
> + pr_err("Device allocation failed\n");
> + return -ENOMEM;
> + }
> +
> + err = platform_driver_probe(_led_driver, mlxcpld_led_probe);
> + if (err) {
> + pr_err("Probe platform driver failed\n");
> + platform_device_unregister(pdev);
> + }
> +
> + return err;
> +}
> +
> +static void __exit mlxcpld_led_exit(void)
> +{
> + platform_device_unregister(mlxcpld_led->pdev);
> + platform_driver_unregister(_led_driver);
> +}
> +
> +module_init(mlxcpld_led_init);
> +module_exit(mlxcpld_led_exit);
> +
> +MODULE_AUTHOR("Vadim Pasternak (vad...@mellanox.com)");
> +MODULE_DESCRIPTION("Mellanox board LED driver");
> +MODULE_LICENSE("GPL v2");
> +MODULE_ALIAS("platform:leds_mlxcpld");
> -- 
> 2.1.4
> 
-- 
 -Open up your eyes, open up your mind, open up your code ---   
/ Dr. David Alan Gilbert|   Running GNU/Linux   | Happy  \ 
\dave @ treblig.org |   | In Hex /
 \ _|_ http://www.treblig.org   |___/

Re: [CRM114spam]: Re: Odd performance results

2016-07-12 Thread Dr. David Alan Gilbert

* Peter Zijlstra (pet...@infradead.org) wrote:
> On Tue, Jul 12, 2016 at 10:49:58AM -0700, H. Peter Anvin wrote:
> > On 07/12/16 08:05, Paul E. McKenney wrote:
> > The CPU in question (and /proc/cpuinfo should show this) has four cores
> > with a total of eight threads.  The "siblings" and "cpu cores" fields in
> > /proc/cpuinfo should show the same thing.  So I am utterly confused
> > about what is unexpected here?
> 
> Typically threads are enumerated differently on Intel parts. Namely:
> 
>   cpu_id = code_id + nr_cores * smt_id
> 
> which gives, for a 4 core, 2 thread part:
> 
>   0-3: core 0-3, smt0
>   4-7: core 0-3, smt1
> 
> My Core i7-2600k for example has:
> 
> $ cat /sys/devices/system/cpu/cpu*/topology/thread_siblings_list
> 0,4
> 1,5
> 2,6
> 3,7
> 0,4
> 1,5
> 2,6
> 3,7
> 
> The ordering Paul has, namely 0,1 for core0,smt{0,1} is not something
> I've ever seen on an Intel part. AMD otoh does enumerate their CMT stuff
> like what Paul has.

Paul's isn't unique:
cat /sys/devices/system/cpu/cpu*/topology/thread_siblings_list
0-1
0-1
2-3
2-3

i7-3520M CPU @ 2.90GHz   (Dual core with hyperthread, Thinkpad t530, fedora 24)

Dave
-- 
 -Open up your eyes, open up your mind, open up your code ---   
/ Dr. David Alan Gilbert|   Running GNU/Linux   | Happy  \ 
\dave @ treblig.org |   | In Hex /
 \ _|_ http://www.treblig.org   |___/

Re: [CRM114spam]: Re: Odd performance results

2016-07-12 Thread Dr. David Alan Gilbert

* Peter Zijlstra (pet...@infradead.org) wrote:
> On Tue, Jul 12, 2016 at 10:49:58AM -0700, H. Peter Anvin wrote:
> > On 07/12/16 08:05, Paul E. McKenney wrote:
> > The CPU in question (and /proc/cpuinfo should show this) has four cores
> > with a total of eight threads.  The "siblings" and "cpu cores" fields in
> > /proc/cpuinfo should show the same thing.  So I am utterly confused
> > about what is unexpected here?
> 
> Typically threads are enumerated differently on Intel parts. Namely:
> 
>   cpu_id = code_id + nr_cores * smt_id
> 
> which gives, for a 4 core, 2 thread part:
> 
>   0-3: core 0-3, smt0
>   4-7: core 0-3, smt1
> 
> My Core i7-2600k for example has:
> 
> $ cat /sys/devices/system/cpu/cpu*/topology/thread_siblings_list
> 0,4
> 1,5
> 2,6
> 3,7
> 0,4
> 1,5
> 2,6
> 3,7
> 
> The ordering Paul has, namely 0,1 for core0,smt{0,1} is not something
> I've ever seen on an Intel part. AMD otoh does enumerate their CMT stuff
> like what Paul has.

Paul's isn't unique:
cat /sys/devices/system/cpu/cpu*/topology/thread_siblings_list
0-1
0-1
2-3
2-3

i7-3520M CPU @ 2.90GHz   (Dual core with hyperthread, Thinkpad t530, fedora 24)

Dave
-- 
 -Open up your eyes, open up your mind, open up your code ---   
/ Dr. David Alan Gilbert|   Running GNU/Linux   | Happy  \ 
\dave @ treblig.org |   | In Hex /
 \ _|_ http://www.treblig.org   |___/

[tip:x86/urgent] x86/msr: Use the proper trace point conditional for writes

2016-06-06 Thread tip-bot for Dr. David Alan Gilbert

Commit-ID:  08dd8cd06ed95625b9e2fac43c78fcb45b7eaf94
Gitweb: http://git.kernel.org/tip/08dd8cd06ed95625b9e2fac43c78fcb45b7eaf94
Author: Dr. David Alan Gilbert <dgilb...@redhat.com>
AuthorDate: Fri, 3 Jun 2016 19:00:59 +0100
Committer:  Thomas Gleixner <t...@linutronix.de>
CommitDate: Mon, 6 Jun 2016 15:33:39 +0200

x86/msr: Use the proper trace point conditional for writes

The msr tracing for writes is incorrectly conditional on the read trace.

Fixes: 7f47d8cc039f "x86, tracing, perf: Add trace point for MSR accesses"
Signed-off-by: Dr. David Alan Gilbert <dgilb...@redhat.com>
Cc: sta...@vger.kernel.org
Cc: a...@linux.intel.com
Link: 
http://lkml.kernel.org/r/1464976859-21850-1-git-send-email-dgilb...@redhat.com
Signed-off-by: Thomas Gleixner <t...@linutronix.de>

---
 arch/x86/include/asm/msr.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h
index 7dc1d8f..b5fee97 100644
--- a/arch/x86/include/asm/msr.h
+++ b/arch/x86/include/asm/msr.h
@@ -122,7 +122,7 @@ notrace static inline void native_write_msr(unsigned int 
msr,
 "2:\n"
 _ASM_EXTABLE_HANDLE(1b, 2b, ex_handler_wrmsr_unsafe)
 : : "c" (msr), "a"(low), "d" (high) : "memory");
-   if (msr_tracepoint_active(__tracepoint_read_msr))
+   if (msr_tracepoint_active(__tracepoint_write_msr))
do_trace_write_msr(msr, ((u64)high << 32 | low), 0);
 }
 
@@ -141,7 +141,7 @@ notrace static inline int native_write_msr_safe(unsigned 
int msr,
 : "c" (msr), "0" (low), "d" (high),
   [fault] "i" (-EIO)
 : "memory");
-   if (msr_tracepoint_active(__tracepoint_read_msr))
+   if (msr_tracepoint_active(__tracepoint_write_msr))
do_trace_write_msr(msr, ((u64)high << 32 | low), err);
return err;
 }

[tip:x86/urgent] x86/msr: Use the proper trace point conditional for writes

2016-06-06 Thread tip-bot for Dr. David Alan Gilbert

Commit-ID:  08dd8cd06ed95625b9e2fac43c78fcb45b7eaf94
Gitweb: http://git.kernel.org/tip/08dd8cd06ed95625b9e2fac43c78fcb45b7eaf94
Author: Dr. David Alan Gilbert 
AuthorDate: Fri, 3 Jun 2016 19:00:59 +0100
Committer:  Thomas Gleixner 
CommitDate: Mon, 6 Jun 2016 15:33:39 +0200

x86/msr: Use the proper trace point conditional for writes

The msr tracing for writes is incorrectly conditional on the read trace.

Fixes: 7f47d8cc039f "x86, tracing, perf: Add trace point for MSR accesses"
Signed-off-by: Dr. David Alan Gilbert 
Cc: sta...@vger.kernel.org
Cc: a...@linux.intel.com
Link: 
http://lkml.kernel.org/r/1464976859-21850-1-git-send-email-dgilb...@redhat.com
Signed-off-by: Thomas Gleixner 

---
 arch/x86/include/asm/msr.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h
index 7dc1d8f..b5fee97 100644
--- a/arch/x86/include/asm/msr.h
+++ b/arch/x86/include/asm/msr.h
@@ -122,7 +122,7 @@ notrace static inline void native_write_msr(unsigned int 
msr,
 "2:\n"
 _ASM_EXTABLE_HANDLE(1b, 2b, ex_handler_wrmsr_unsafe)
 : : "c" (msr), "a"(low), "d" (high) : "memory");
-   if (msr_tracepoint_active(__tracepoint_read_msr))
+   if (msr_tracepoint_active(__tracepoint_write_msr))
do_trace_write_msr(msr, ((u64)high << 32 | low), 0);
 }
 
@@ -141,7 +141,7 @@ notrace static inline int native_write_msr_safe(unsigned 
int msr,
 : "c" (msr), "0" (low), "d" (high),
   [fault] "i" (-EIO)
 : "memory");
-   if (msr_tracepoint_active(__tracepoint_read_msr))
+   if (msr_tracepoint_active(__tracepoint_write_msr))
do_trace_write_msr(msr, ((u64)high << 32 | low), err);
return err;
 }

[PATCH] msr tracing: Fix copy paste

2016-06-03 Thread Dr. David Alan Gilbert (git)

From: "Dr. David Alan Gilbert" <dgilb...@redhat.com>

The msr tracing for writes is incorrectly conditional on the read
trace.

Fixes: 7f47d8cc
Signed-off-by: Dr. David Alan Gilbert <dgilb...@redhat.com>
---
 arch/x86/include/asm/msr.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h
index 7dc1d8f..b5fee97 100644
--- a/arch/x86/include/asm/msr.h
+++ b/arch/x86/include/asm/msr.h
@@ -122,7 +122,7 @@ notrace static inline void native_write_msr(unsigned int 
msr,
 "2:\n"
 _ASM_EXTABLE_HANDLE(1b, 2b, ex_handler_wrmsr_unsafe)
 : : "c" (msr), "a"(low), "d" (high) : "memory");
-   if (msr_tracepoint_active(__tracepoint_read_msr))
+   if (msr_tracepoint_active(__tracepoint_write_msr))
do_trace_write_msr(msr, ((u64)high << 32 | low), 0);
 }
 
@@ -141,7 +141,7 @@ notrace static inline int native_write_msr_safe(unsigned 
int msr,
 : "c" (msr), "0" (low), "d" (high),
   [fault] "i" (-EIO)
 : "memory");
-   if (msr_tracepoint_active(__tracepoint_read_msr))
+   if (msr_tracepoint_active(__tracepoint_write_msr))
do_trace_write_msr(msr, ((u64)high << 32 | low), err);
return err;
 }
-- 
2.7.4

[PATCH] msr tracing: Fix copy paste

2016-06-03 Thread Dr. David Alan Gilbert (git)

From: "Dr. David Alan Gilbert" 

The msr tracing for writes is incorrectly conditional on the read
trace.

Fixes: 7f47d8cc
Signed-off-by: Dr. David Alan Gilbert 
---
 arch/x86/include/asm/msr.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h
index 7dc1d8f..b5fee97 100644
--- a/arch/x86/include/asm/msr.h
+++ b/arch/x86/include/asm/msr.h
@@ -122,7 +122,7 @@ notrace static inline void native_write_msr(unsigned int 
msr,
 "2:\n"
 _ASM_EXTABLE_HANDLE(1b, 2b, ex_handler_wrmsr_unsafe)
 : : "c" (msr), "a"(low), "d" (high) : "memory");
-   if (msr_tracepoint_active(__tracepoint_read_msr))
+   if (msr_tracepoint_active(__tracepoint_write_msr))
do_trace_write_msr(msr, ((u64)high << 32 | low), 0);
 }
 
@@ -141,7 +141,7 @@ notrace static inline int native_write_msr_safe(unsigned 
int msr,
 : "c" (msr), "0" (low), "d" (high),
   [fault] "i" (-EIO)
 : "memory");
-   if (msr_tracepoint_active(__tracepoint_read_msr))
+   if (msr_tracepoint_active(__tracepoint_write_msr))
do_trace_write_msr(msr, ((u64)high << 32 | low), err);
return err;
 }
-- 
2.7.4

trace name copy/paste in native_write_msr ?

2016-06-03 Thread Dr. David Alan Gilbert

Hi Andi,
  In arch/x86/include/asm/msr.h native_write_msr(_safe)
you added a trace test in 7f47d8cc, should the
tracepoint_active's be testing for __tracepoint_write_msr ?

/* Can be uninlined because referenced by paravirt */
notrace static inline void native_write_msr(unsigned int msr,
unsigned low, unsigned high)
{
asm volatile("1: wrmsr\n"
 "2:\n"
 _ASM_EXTABLE_HANDLE(1b, 2b, ex_handler_wrmsr_unsafe)
 : : "c" (msr), "a"(low), "d" (high) : "memory");
if (msr_tracepoint_active(__tracepoint_read_msr))   <---
do_trace_write_msr(msr, ((u64)high << 32 | low), 0);
}

/* Can be uninlined because referenced by paravirt */
notrace static inline int native_write_msr_safe(unsigned int msr,
unsigned low, unsigned high)
{
int err; 
asm volatile("2: wrmsr ; xor %[err],%[err]\n"
 "1:\n\t"
 ".section .fixup,\"ax\"\n\t"
 "3:  mov %[fault],%[err] ; jmp 1b\n\t"
 ".previous\n\t"
 _ASM_EXTABLE(2b, 3b)
 : [err] "=a" (err)
 : "c" (msr), "0" (low), "d" (high),
   [fault] "i" (-EIO)
 : "memory");
    if (msr_tracepoint_active(__tracepoint_read_msr))   <---
do_trace_write_msr(msr, ((u64)high << 32 | low), err);
return err;
}

Dave

--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

trace name copy/paste in native_write_msr ?

2016-06-03 Thread Dr. David Alan Gilbert

Hi Andi,
  In arch/x86/include/asm/msr.h native_write_msr(_safe)
you added a trace test in 7f47d8cc, should the
tracepoint_active's be testing for __tracepoint_write_msr ?

/* Can be uninlined because referenced by paravirt */
notrace static inline void native_write_msr(unsigned int msr,
unsigned low, unsigned high)
{
asm volatile("1: wrmsr\n"
 "2:\n"
 _ASM_EXTABLE_HANDLE(1b, 2b, ex_handler_wrmsr_unsafe)
 : : "c" (msr), "a"(low), "d" (high) : "memory");
if (msr_tracepoint_active(__tracepoint_read_msr))   <---
do_trace_write_msr(msr, ((u64)high << 32 | low), 0);
}

/* Can be uninlined because referenced by paravirt */
notrace static inline int native_write_msr_safe(unsigned int msr,
unsigned low, unsigned high)
{
int err; 
asm volatile("2: wrmsr ; xor %[err],%[err]\n"
 "1:\n\t"
 ".section .fixup,\"ax\"\n\t"
 "3:  mov %[fault],%[err] ; jmp 1b\n\t"
 ".previous\n\t"
 _ASM_EXTABLE(2b, 3b)
 : [err] "=a" (err)
 : "c" (msr), "0" (low), "d" (high),
   [fault] "i" (-EIO)
 : "memory");
    if (msr_tracepoint_active(__tracepoint_read_msr))   <---
do_trace_write_msr(msr, ((u64)high << 32 | low), err);
return err;
}

Dave

--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: transparent huge pages breaks KVM on AMD.

2016-05-13 Thread Dr. David Alan Gilbert

* Marc Haber (mh+linux-ker...@zugschlus.de) wrote:
> Hi David,
> 
> On Sat, Apr 23, 2016 at 07:52:46PM +0100, Dr. David Alan Gilbert wrote:
> > Hmm, your problem does sound like bad hardware, but
> > If you've got a nice reliable crash, can you try turning transparent huge 
> > pages
> > off on the host;
> >echo never > /sys/kernel/mm/transparent_hugepage/enabled
> 
> I must have missed this hint in the middle of the "your hardware is
> bad" avalance that came over me.
> 
> I spent two weeks bisecting "good" kernels since during the repeated
> reconfigurations, transparent huge pages got turned off in kernel
> configuration. After running each kernel for 24 hours, I eventually
> ended up with a working 4.5 kernel. The configuration diff was short,
> showing transparent huge pages, and - finally - upon re-reading the
> thread I found your hint.

OK, good.  When I sent that mail I'd hit a THP bug but in a
corner of migration and at the time we didn't know why and there was
no reason to think it would cause any other symptoms, but since it was
also between 4.4 and 4.5 it did seem worth mentioning as a long shot,
but it was no more than a long shot.

> I have now the result that 4.5, 4.5.1 and 4.5.4 corrupt KVM guest
> memory reliably in the first hour of running under disk load, causing
> the VM to either drop dead in the water, or to read randomness from
> disk. Rebooting fixes the VM. This happens as soon as transparent huge
> pages are turned on in the host.
> 
> Turning off transparent huge pages by echo never >
> /sys/kernel/mm/transparent_hugepage/enabled fixes the issue even
> without rebooting the host. Start up the VM again and it works just
> fine.
> 
> Is this an issue in (a) transparent huge pages, (b) KVM or (c) qemu?
> Where should this issue be forwarded? Or do we just accept it and turn
> transparent huge pages off?

Try Andrea's fix for (a).

Dave

> 
> Greetings
> Marc
> 
> -- 
> -
> Marc Haber | "I don't trust Computers. They | Mailadresse im Header
> Leimen, Germany|  lose things."Winona Ryder | Fon: *49 6224 1600402
> Nordisch by Nature |  How to make an American Quilt | Fax: *49 6224 1600421
-- 
 -Open up your eyes, open up your mind, open up your code ---   
/ Dr. David Alan Gilbert|   Running GNU/Linux   | Happy  \ 
\dave @ treblig.org |   | In Hex /
 \ _|_ http://www.treblig.org   |___/

Re: transparent huge pages breaks KVM on AMD.

2016-05-13 Thread Dr. David Alan Gilbert

* Marc Haber (mh+linux-ker...@zugschlus.de) wrote:
> Hi David,
> 
> On Sat, Apr 23, 2016 at 07:52:46PM +0100, Dr. David Alan Gilbert wrote:
> > Hmm, your problem does sound like bad hardware, but
> > If you've got a nice reliable crash, can you try turning transparent huge 
> > pages
> > off on the host;
> >echo never > /sys/kernel/mm/transparent_hugepage/enabled
> 
> I must have missed this hint in the middle of the "your hardware is
> bad" avalance that came over me.
> 
> I spent two weeks bisecting "good" kernels since during the repeated
> reconfigurations, transparent huge pages got turned off in kernel
> configuration. After running each kernel for 24 hours, I eventually
> ended up with a working 4.5 kernel. The configuration diff was short,
> showing transparent huge pages, and - finally - upon re-reading the
> thread I found your hint.

OK, good.  When I sent that mail I'd hit a THP bug but in a
corner of migration and at the time we didn't know why and there was
no reason to think it would cause any other symptoms, but since it was
also between 4.4 and 4.5 it did seem worth mentioning as a long shot,
but it was no more than a long shot.

> I have now the result that 4.5, 4.5.1 and 4.5.4 corrupt KVM guest
> memory reliably in the first hour of running under disk load, causing
> the VM to either drop dead in the water, or to read randomness from
> disk. Rebooting fixes the VM. This happens as soon as transparent huge
> pages are turned on in the host.
> 
> Turning off transparent huge pages by echo never >
> /sys/kernel/mm/transparent_hugepage/enabled fixes the issue even
> without rebooting the host. Start up the VM again and it works just
> fine.
> 
> Is this an issue in (a) transparent huge pages, (b) KVM or (c) qemu?
> Where should this issue be forwarded? Or do we just accept it and turn
> transparent huge pages off?

Try Andrea's fix for (a).

Dave

> 
> Greetings
> Marc
> 
> -- 
> -
> Marc Haber | "I don't trust Computers. They | Mailadresse im Header
> Leimen, Germany|  lose things."Winona Ryder | Fon: *49 6224 1600402
> Nordisch by Nature |  How to make an American Quilt | Fax: *49 6224 1600421
-- 
 -Open up your eyes, open up your mind, open up your code ---   
/ Dr. David Alan Gilbert|   Running GNU/Linux   | Happy  \ 
\dave @ treblig.org |   | In Hex /
 \ _|_ http://www.treblig.org   |___/

Re: [PATCH kernel 0/2] speed up live migration by skipping free pages

2016-04-25 Thread Dr. David Alan Gilbert

* Michael S. Tsirkin (m...@redhat.com) wrote:
> On Mon, Apr 25, 2016 at 05:38:30PM +0530, Amit Shah wrote:
> > On (Mon) 25 Apr 2016 [14:04:06], Michael S. Tsirkin wrote:
> > > On Mon, Apr 25, 2016 at 11:36:41AM +0530, Amit Shah wrote:
> > > > On (Tue) 19 Apr 2016 [22:34:32], Liang Li wrote:
> > > > > Current QEMU live migration implementation mark all guest's RAM pages
> > > > > as dirtied in the ram bulk stage, all these pages will be processed
> > > > > and it consumes quite a lot of CPU cycles and network bandwidth.
> > > > > 
> > > > > From guest's point of view, it doesn't care about the content in free
> > > > > page. We can make use of this fact and skip processing the free
> > > > > pages, this can save a lot CPU cycles and reduce the network traffic
> > > > > significantly while speed up the live migration process obviously.
> > > > > 
> > > > > This patch set is the kernel side implementation.
> > > > > 
> > > > > The virtio-balloon driver is extended to send the free page bitmap
> > > > > from guest to QEMU.
> > > > > 
> > > > > After getting the free page bitmap, QEMU can use it to filter out
> > > > > guest's free pages. This make the live migration process much more
> > > > > efficient.
> > > > > 
> > > > > In order to skip more free pages, we add an interface to let the user
> > > > > decide whether dropping the cache in guest during live migration.
> > > > 
> > > > So if virtio-balloon is the way to go (i.e. speed is acceptable), I
> > > > just have one point then.  My main concern with using (or not using)
> > > > virtio-balloon was that a guest admin is going to disable the
> > > > virtio-balloon driver entirely because the admin won't want the guest
> > > > to give away pages to the host, esp. when the guest is to be a
> > > > high-performant one.
> > > 
> > > The result will be the reverse of high-performance.
> > > 
> > > If you don't want to inflate a balloon, don't.
> > > 
> > > If you do but guest doesn't respond to inflate requests,
> > > it's quite reasonable for host to kill it -
> > > there is no way to distinguish between that and
> > > guest being malicious.
> > 
> > With the new command I'm suggesting, the guest will let the host know
> > that it has enabled this option, and it won't free up any RAM for the
> > host.
> > 
> > Also, just because a guest doesn't release some memory (which the
> > guest owns anyway) doesn't make it malicious, and killing such guests
> > is never going to end well for that hosting provider.
> > 
> > > I don't know of management tools doing that but
> > > it's rather reasonable. What does happen is
> > > some random guest memory is pushed it out to swap,
> > > which is likely much worse than dropping unused memory
> > > by moving it into the balloon.
> > 
> > Even if the host (admin) gave a guarantee that there won't be any
> > ballooning activity involved that will slow down the guest, a guest
> > admin can be paranoid enough to disable ballooning.  If, however, this
> > is made known to the host, it's likely a win-win situation because the
> > host knows the guest needs its RAM, and the guest can still use the
> > driver to send stats which the host can use during migration for
> > speedups.
> > 
> > 
> > Amit
> 
> We'd need to understand the usecase better to design a good interface
> for this. AFAIK the normal usecase for ballooning is for
> memory overcommit: asking guest to free up memory might work
> better than swap which makes host initiate a bunch of IO.
> How is not inflating in this case a good idea?
> I'm afraid I don't understand why was inflating balloon
> requested if we do not want the guest to inflate the balloon.  What does
> "paranoid" mean in this context?  This seems to imply some kind of
> security concern. Is guest likely to need all of its memory or a
> specific portion of it? Is it likely to be a static configuration or a
> dynamic one? If dynamic, does guest also want to avoid deflating the
> balloon or only inflating it?

The semantics we want here are subtly different from ballooning;

  a) In ballooning the host is asking for the guest to give up some ram;
 that's not quite the case here - although if the guest dropped some
 caches that *might* be helpful.
  b) In ballooning we're interested in just amounts of free memory; here
 we care about _which_ pages are free.
  c) We don't want to make it hard for the guest to start using some of the
 memory again; if it was hard then we suddenly get back into a balancing
 act of needing to quantatively talk about how much free RAM we have,
 and worry about whether we shouldn't tell the host about a small pot
 we keep back etc.

Dave

> -- 
> MST
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [PATCH kernel 0/2] speed up live migration by skipping free pages

2016-04-25 Thread Dr. David Alan Gilbert

* Michael S. Tsirkin (m...@redhat.com) wrote:
> On Mon, Apr 25, 2016 at 05:38:30PM +0530, Amit Shah wrote:
> > On (Mon) 25 Apr 2016 [14:04:06], Michael S. Tsirkin wrote:
> > > On Mon, Apr 25, 2016 at 11:36:41AM +0530, Amit Shah wrote:
> > > > On (Tue) 19 Apr 2016 [22:34:32], Liang Li wrote:
> > > > > Current QEMU live migration implementation mark all guest's RAM pages
> > > > > as dirtied in the ram bulk stage, all these pages will be processed
> > > > > and it consumes quite a lot of CPU cycles and network bandwidth.
> > > > > 
> > > > > From guest's point of view, it doesn't care about the content in free
> > > > > page. We can make use of this fact and skip processing the free
> > > > > pages, this can save a lot CPU cycles and reduce the network traffic
> > > > > significantly while speed up the live migration process obviously.
> > > > > 
> > > > > This patch set is the kernel side implementation.
> > > > > 
> > > > > The virtio-balloon driver is extended to send the free page bitmap
> > > > > from guest to QEMU.
> > > > > 
> > > > > After getting the free page bitmap, QEMU can use it to filter out
> > > > > guest's free pages. This make the live migration process much more
> > > > > efficient.
> > > > > 
> > > > > In order to skip more free pages, we add an interface to let the user
> > > > > decide whether dropping the cache in guest during live migration.
> > > > 
> > > > So if virtio-balloon is the way to go (i.e. speed is acceptable), I
> > > > just have one point then.  My main concern with using (or not using)
> > > > virtio-balloon was that a guest admin is going to disable the
> > > > virtio-balloon driver entirely because the admin won't want the guest
> > > > to give away pages to the host, esp. when the guest is to be a
> > > > high-performant one.
> > > 
> > > The result will be the reverse of high-performance.
> > > 
> > > If you don't want to inflate a balloon, don't.
> > > 
> > > If you do but guest doesn't respond to inflate requests,
> > > it's quite reasonable for host to kill it -
> > > there is no way to distinguish between that and
> > > guest being malicious.
> > 
> > With the new command I'm suggesting, the guest will let the host know
> > that it has enabled this option, and it won't free up any RAM for the
> > host.
> > 
> > Also, just because a guest doesn't release some memory (which the
> > guest owns anyway) doesn't make it malicious, and killing such guests
> > is never going to end well for that hosting provider.
> > 
> > > I don't know of management tools doing that but
> > > it's rather reasonable. What does happen is
> > > some random guest memory is pushed it out to swap,
> > > which is likely much worse than dropping unused memory
> > > by moving it into the balloon.
> > 
> > Even if the host (admin) gave a guarantee that there won't be any
> > ballooning activity involved that will slow down the guest, a guest
> > admin can be paranoid enough to disable ballooning.  If, however, this
> > is made known to the host, it's likely a win-win situation because the
> > host knows the guest needs its RAM, and the guest can still use the
> > driver to send stats which the host can use during migration for
> > speedups.
> > 
> > 
> > Amit
> 
> We'd need to understand the usecase better to design a good interface
> for this. AFAIK the normal usecase for ballooning is for
> memory overcommit: asking guest to free up memory might work
> better than swap which makes host initiate a bunch of IO.
> How is not inflating in this case a good idea?
> I'm afraid I don't understand why was inflating balloon
> requested if we do not want the guest to inflate the balloon.  What does
> "paranoid" mean in this context?  This seems to imply some kind of
> security concern. Is guest likely to need all of its memory or a
> specific portion of it? Is it likely to be a static configuration or a
> dynamic one? If dynamic, does guest also want to avoid deflating the
> balloon or only inflating it?

The semantics we want here are subtly different from ballooning;

  a) In ballooning the host is asking for the guest to give up some ram;
 that's not quite the case here - although if the guest dropped some
 caches that *might* be helpful.
  b) In ballooning we're interested in just amounts of free memory; here
 we care about _which_ pages are free.
  c) We don't want to make it hard for the guest to start using some of the
 memory again; if it was hard then we suddenly get back into a balancing
 act of needing to quantatively talk about how much free RAM we have,
 and worry about whether we shouldn't tell the host about a small pot
 we keep back etc.

Dave

> -- 
> MST
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: Major KVM issues with kernel 4.5 on the host

2016-04-23 Thread Dr. David Alan Gilbert

* Marc Haber (mh+linux-ker...@zugschlus.de) wrote:
> On Sat, Apr 23, 2016 at 06:04:29PM +0200, Borislav Petkov wrote:
> > On Thu, Apr 21, 2016 at 10:04:33PM +0200, Marc Haber wrote:
> > > Yes, but there are two symptoms. The VM either suffers file system
> > > issues (garbage read from files, or an aborted ext4 journal and
> > > following ro remount) or it stops dead in its tracks.
> > 
> > Stops dead? What does that mean exactly? Box is wedged solid and it
> > doesn't react to any key presses?
> 
> No ping, no reaction on serial console, no reaction on virtual
> console, no syslog entries.
> 
> > Because if so, this could really be a DRAM going bad and a correctable
> > error turning into an uncorrectable. How old is the DRAM in that box?
> > Judging by your CPU, it should be a couple of years...
> 
> Uncorrectable errors would still be identified by the ECC hardware,
> and the box wouldn't be perfectly fine with an "old" kernel.

Hmm, your problem does sound like bad hardware, but
If you've got a nice reliable crash, can you try turning transparent huge pages
off on the host;
   echo never > /sys/kernel/mm/transparent_hugepage/enabled

Dave

> > > The box reports about one correctable error per week, so I probably
> > > have a faulty DIMM, but since the issue only surfaces in VMs while the
> > > host system is in perfect working order...
> > 
> > So it could be that correctable error turns into an uncorrectable one at
> > some point. But then you should be getting an exception...
> 
> Yes, that would be in the logs.
> 
> > > And yes, I am pondering to simply replace the box with an Intel CPU.
> > 
> > Your CPU is fine, from what I've seen so far.
> 
> But we still postulate that the issue does only show on older AMD
> CPUs. Otherwise, I wouldn't be the only one making this experience.
> 
> > > I go the way of Debian packages since it is easier to handle the
> > > crypto file systems when the machine is booting up.
> > 
> > As long as you're testing the correct bisection kernels...
> 
> I am reasonably sure about that, yes.
> 
> > > And yes, I think about doing a test reinstall on unencrypted disk to
> > > find out whether encryption plays a role, but I currently need the
> > > machine to urgently to take it out of serice for half a month, and,
> > > again, the host system is in perfect working order, it is just VMs
> > > that barf.
> > 
> > Yeah, I can't reproduce it here and I have a very similar box to yours
> > which is otherwise idle, more or less.
> > 
> > Another fact which points to potentially DIMM going bad...
> 
> Do you want me to memtest for 24 hours?
> 
> Greetings
> Marc
> 
> -- 
> ---------
> Marc Haber | "I don't trust Computers. They | Mailadresse im Header
> Leimen, Germany|  lose things."Winona Ryder | Fon: *49 6224 1600402
> Nordisch by Nature |  How to make an American Quilt | Fax: *49 6224 1600421
-- 
 -Open up your eyes, open up your mind, open up your code ---   
/ Dr. David Alan Gilbert|   Running GNU/Linux   | Happy  \ 
\dave @ treblig.org |   | In Hex /
 \ _|_ http://www.treblig.org   |___/

Re: Major KVM issues with kernel 4.5 on the host

2016-04-23 Thread Dr. David Alan Gilbert

* Marc Haber (mh+linux-ker...@zugschlus.de) wrote:
> On Sat, Apr 23, 2016 at 06:04:29PM +0200, Borislav Petkov wrote:
> > On Thu, Apr 21, 2016 at 10:04:33PM +0200, Marc Haber wrote:
> > > Yes, but there are two symptoms. The VM either suffers file system
> > > issues (garbage read from files, or an aborted ext4 journal and
> > > following ro remount) or it stops dead in its tracks.
> > 
> > Stops dead? What does that mean exactly? Box is wedged solid and it
> > doesn't react to any key presses?
> 
> No ping, no reaction on serial console, no reaction on virtual
> console, no syslog entries.
> 
> > Because if so, this could really be a DRAM going bad and a correctable
> > error turning into an uncorrectable. How old is the DRAM in that box?
> > Judging by your CPU, it should be a couple of years...
> 
> Uncorrectable errors would still be identified by the ECC hardware,
> and the box wouldn't be perfectly fine with an "old" kernel.

Hmm, your problem does sound like bad hardware, but
If you've got a nice reliable crash, can you try turning transparent huge pages
off on the host;
   echo never > /sys/kernel/mm/transparent_hugepage/enabled

Dave

> > > The box reports about one correctable error per week, so I probably
> > > have a faulty DIMM, but since the issue only surfaces in VMs while the
> > > host system is in perfect working order...
> > 
> > So it could be that correctable error turns into an uncorrectable one at
> > some point. But then you should be getting an exception...
> 
> Yes, that would be in the logs.
> 
> > > And yes, I am pondering to simply replace the box with an Intel CPU.
> > 
> > Your CPU is fine, from what I've seen so far.
> 
> But we still postulate that the issue does only show on older AMD
> CPUs. Otherwise, I wouldn't be the only one making this experience.
> 
> > > I go the way of Debian packages since it is easier to handle the
> > > crypto file systems when the machine is booting up.
> > 
> > As long as you're testing the correct bisection kernels...
> 
> I am reasonably sure about that, yes.
> 
> > > And yes, I think about doing a test reinstall on unencrypted disk to
> > > find out whether encryption plays a role, but I currently need the
> > > machine to urgently to take it out of serice for half a month, and,
> > > again, the host system is in perfect working order, it is just VMs
> > > that barf.
> > 
> > Yeah, I can't reproduce it here and I have a very similar box to yours
> > which is otherwise idle, more or less.
> > 
> > Another fact which points to potentially DIMM going bad...
> 
> Do you want me to memtest for 24 hours?
> 
> Greetings
> Marc
> 
> -- 
> ---------
> Marc Haber | "I don't trust Computers. They | Mailadresse im Header
> Leimen, Germany|  lose things."Winona Ryder | Fon: *49 6224 1600402
> Nordisch by Nature |  How to make an American Quilt | Fax: *49 6224 1600421
-- 
 -Open up your eyes, open up your mind, open up your code ---   
/ Dr. David Alan Gilbert|   Running GNU/Linux   | Happy  \ 
\dave @ treblig.org |   | In Hex /
 \ _|_ http://www.treblig.org   |___/

Re: [PATCH kernel 1/2] mm: add the related functions to build the free page bitmap

2016-04-22 Thread Dr. David Alan Gilbert

* Michael S. Tsirkin (m...@redhat.com) wrote:
> On Tue, Apr 19, 2016 at 03:02:09PM +, Li, Liang Z wrote:
> > > On Tue, 2016-04-19 at 22:34 +0800, Liang Li wrote:
> > > > The free page bitmap will be sent to QEMU through virtio interface and
> > > > used for live migration optimization.
> > > > Drop the cache before building the free page bitmap can get more free
> > > > pages. Whether dropping the cache is decided by user.
> > > >
> > > 
> > > How do you prevent the guest from using those recently-freed pages for
> > > something else, between when you build the bitmap and the live migration
> > > completes?
> > 
> > Because the dirty page logging is enabled before building the bitmap, there 
> > is no need
> > to prevent the guest from using the recently-freed pages ...
> > 
> > Liang
> 
> Well one point of telling host that page is free is so that
> it can mark it clean even if it was dirty previously.
> So I think you must pass the pages to guest under the lock.
> This will allow host optimizations such as marking these
> pages MADV_DONTNEED or MADV_FREE.
> Otherwise it's all too tied up to a specific usecase -
> you aren't telling host that a page is free, you are telling it
> that a page was free in the past.

But doing it under lock sounds pretty expensive, especially given
how long the userspace side is going to take to work through the bitmap
and device what to do.

Dave

> 
> -- 
> MST
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [PATCH kernel 1/2] mm: add the related functions to build the free page bitmap

2016-04-22 Thread Dr. David Alan Gilbert

* Michael S. Tsirkin (m...@redhat.com) wrote:
> On Tue, Apr 19, 2016 at 03:02:09PM +, Li, Liang Z wrote:
> > > On Tue, 2016-04-19 at 22:34 +0800, Liang Li wrote:
> > > > The free page bitmap will be sent to QEMU through virtio interface and
> > > > used for live migration optimization.
> > > > Drop the cache before building the free page bitmap can get more free
> > > > pages. Whether dropping the cache is decided by user.
> > > >
> > > 
> > > How do you prevent the guest from using those recently-freed pages for
> > > something else, between when you build the bitmap and the live migration
> > > completes?
> > 
> > Because the dirty page logging is enabled before building the bitmap, there 
> > is no need
> > to prevent the guest from using the recently-freed pages ...
> > 
> > Liang
> 
> Well one point of telling host that page is free is so that
> it can mark it clean even if it was dirty previously.
> So I think you must pass the pages to guest under the lock.
> This will allow host optimizations such as marking these
> pages MADV_DONTNEED or MADV_FREE.
> Otherwise it's all too tied up to a specific usecase -
> you aren't telling host that a page is free, you are telling it
> that a page was free in the past.

But doing it under lock sounds pretty expensive, especially given
how long the userspace side is going to take to work through the bitmap
and device what to do.

Dave

> 
> -- 
> MST
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-15 Thread Dr. David Alan Gilbert

* Li, Liang Z (liang.z...@intel.com) wrote:
> > On Mon, Mar 14, 2016 at 05:03:34PM +0000, Dr. David Alan Gilbert wrote:
> > > * Li, Liang Z (liang.z...@intel.com) wrote:
> > > > >
> > > > > Hi,
> > > > >   I'm just catching back up on this thread; so without reference
> > > > > to any particular previous mail in the thread.
> > > > >
> > > > >   1) How many of the free pages do we tell the host about?
> > > > >  Your main change is telling the host about all the
> > > > >  free pages.
> > > >
> > > > Yes, all the guest's free pages.
> > > >
> > > > >  If we tell the host about all the free pages, then we might
> > > > >  end up needing to allocate more pages and update the host
> > > > >  with pages we now want to use; that would have to wait for the
> > > > >  host to acknowledge that use of these pages, since if we don't
> > > > >  wait for it then it might have skipped migrating a page we
> > > > >  just started using (I don't understand how your series solves 
> > > > > that).
> > > > >  So the guest probably needs to keep some free pages - how many?
> > > >
> > > > Actually, there is no need to care about whether the free pages will be
> > used by the host.
> > > > We only care about some of the free pages we get reused by the guest,
> > right?
> > > >
> > > > The dirty page logging can be used to solve this, starting the dirty
> > > > page logging before getting the free pages informant from guest.
> > > > Even some of the free pages are modified by the guest during the
> > > > process of getting the free pages information, these modified pages will
> > be traced by the dirty page logging mechanism. So in the following
> > migration_bitmap_sync() function.
> > > > The pages in the free pages bitmap, but latter was modified, will be
> > > > reset to dirty. We won't omit any dirtied pages.
> > > >
> > > > So, guest doesn't need to keep any free pages.
> > >
> > > OK, yes, that works; so we do:
> > >   * enable dirty logging
> > >   * ask guest for free pages
> > >   * initialise the migration bitmap as everything-free
> > >   * then later we do the normal sync-dirty bitmap stuff and it all just 
> > > works.
> > >
> > > That's nice and simple.
> > 
> > This works once, sure. But there's an issue is that you have to defer 
> > migration
> > until you get the free page list, and this only works once. So you end up 
> > with
> > heuristics about how long to wait.
> > 
> > Instead I propose:
> > 
> > - mark all pages dirty as we do now.
> > 
> > - at start of migration, start tracking dirty
> >   pages in kvm, and tell guest to start tracking free pages
> > 
> > we can now introduce any kind of delay, for example wait for ack from guest,
> > or do whatever else, or even just start migrating pages
> > 
> > - repeatedly:
> > - get list of free pages from guest
> > - clear them in migration bitmap
> > - get dirty list from kvm
> > 
> > - at end of migration, stop tracking writes in kvm,
> >   and tell guest to stop tracking free pages
> 
> I had thought of filtering out the free pages in each migration bitmap 
> synchronization. 
> The advantage is we can skip process as many free pages as possible. Not just 
> once.
> The disadvantage is that we should change the current memory management code 
> to track the free pages,
> instead of traversing the free page list to construct the free pages bitmap, 
> to reduce the overhead to get the free pages bitmap.
> I am not sure the if the Kernel people would like it.
> 
> If keeping the traversing mechanism, because of the overhead, maybe it's not 
> worth to filter out the free pages repeatedly.

Well, Michael's idea of not waiting for the dirty
bitmap to be filled does make that idea of constnatly
using the free-bitmap better.

In that case, is it easier if something (guest/host?)
allocates some memory in the guests physical RAM space
and just points the host to it, rather than having an 
explicit 'send'.

Dave

> Liang
> 
> 
> 
> 
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-15 Thread Dr. David Alan Gilbert

* Li, Liang Z (liang.z...@intel.com) wrote:
> > On Mon, Mar 14, 2016 at 05:03:34PM +0000, Dr. David Alan Gilbert wrote:
> > > * Li, Liang Z (liang.z...@intel.com) wrote:
> > > > >
> > > > > Hi,
> > > > >   I'm just catching back up on this thread; so without reference
> > > > > to any particular previous mail in the thread.
> > > > >
> > > > >   1) How many of the free pages do we tell the host about?
> > > > >  Your main change is telling the host about all the
> > > > >  free pages.
> > > >
> > > > Yes, all the guest's free pages.
> > > >
> > > > >  If we tell the host about all the free pages, then we might
> > > > >  end up needing to allocate more pages and update the host
> > > > >  with pages we now want to use; that would have to wait for the
> > > > >  host to acknowledge that use of these pages, since if we don't
> > > > >  wait for it then it might have skipped migrating a page we
> > > > >  just started using (I don't understand how your series solves 
> > > > > that).
> > > > >  So the guest probably needs to keep some free pages - how many?
> > > >
> > > > Actually, there is no need to care about whether the free pages will be
> > used by the host.
> > > > We only care about some of the free pages we get reused by the guest,
> > right?
> > > >
> > > > The dirty page logging can be used to solve this, starting the dirty
> > > > page logging before getting the free pages informant from guest.
> > > > Even some of the free pages are modified by the guest during the
> > > > process of getting the free pages information, these modified pages will
> > be traced by the dirty page logging mechanism. So in the following
> > migration_bitmap_sync() function.
> > > > The pages in the free pages bitmap, but latter was modified, will be
> > > > reset to dirty. We won't omit any dirtied pages.
> > > >
> > > > So, guest doesn't need to keep any free pages.
> > >
> > > OK, yes, that works; so we do:
> > >   * enable dirty logging
> > >   * ask guest for free pages
> > >   * initialise the migration bitmap as everything-free
> > >   * then later we do the normal sync-dirty bitmap stuff and it all just 
> > > works.
> > >
> > > That's nice and simple.
> > 
> > This works once, sure. But there's an issue is that you have to defer 
> > migration
> > until you get the free page list, and this only works once. So you end up 
> > with
> > heuristics about how long to wait.
> > 
> > Instead I propose:
> > 
> > - mark all pages dirty as we do now.
> > 
> > - at start of migration, start tracking dirty
> >   pages in kvm, and tell guest to start tracking free pages
> > 
> > we can now introduce any kind of delay, for example wait for ack from guest,
> > or do whatever else, or even just start migrating pages
> > 
> > - repeatedly:
> > - get list of free pages from guest
> > - clear them in migration bitmap
> > - get dirty list from kvm
> > 
> > - at end of migration, stop tracking writes in kvm,
> >   and tell guest to stop tracking free pages
> 
> I had thought of filtering out the free pages in each migration bitmap 
> synchronization. 
> The advantage is we can skip process as many free pages as possible. Not just 
> once.
> The disadvantage is that we should change the current memory management code 
> to track the free pages,
> instead of traversing the free page list to construct the free pages bitmap, 
> to reduce the overhead to get the free pages bitmap.
> I am not sure the if the Kernel people would like it.
> 
> If keeping the traversing mechanism, because of the overhead, maybe it's not 
> worth to filter out the free pages repeatedly.

Well, Michael's idea of not waiting for the dirty
bitmap to be filled does make that idea of constnatly
using the free-bitmap better.

In that case, is it easier if something (guest/host?)
allocates some memory in the guests physical RAM space
and just points the host to it, rather than having an 
explicit 'send'.

Dave

> Liang
> 
> 
> 
> 
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-14 Thread Dr. David Alan Gilbert

ap to spot zero pages that would help find other zero'd
> > pages, but perhaps ballooned is enough?
> > 
> Could you describe your ideal with more details?

At the moment the migration code spends a fair amount of time checking if a page
is zero; I was thinking perhaps the qemu could just open /proc/self/pagemap
and check if the page was mapped; that would seem cheap if we're checking big
ranges; and that would find all the balloon pages.

> >   5) Second-migrate
> > Given a VM where you've done all those tricks on, what happens when
> > you migrate it a second time?   I guess you're aiming for the guest
> > to update it's bitmap;  HPe's solution is to migrate it's balloon
> > bitmap along with the migration data.
> 
> Nothing is special in the second migration, QEMU will request the guest for 
> free pages
> Information, and the guest will traverse it's current free page list to 
> construct a
> new free page bitmap and send it to QEMU. Just like in the first migration.

Right.

Dave

> Liang
> > 
> > Dave
> > 
> > --
> > Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-14 Thread Dr. David Alan Gilbert

ap to spot zero pages that would help find other zero'd
> > pages, but perhaps ballooned is enough?
> > 
> Could you describe your ideal with more details?

At the moment the migration code spends a fair amount of time checking if a page
is zero; I was thinking perhaps the qemu could just open /proc/self/pagemap
and check if the page was mapped; that would seem cheap if we're checking big
ranges; and that would find all the balloon pages.

> >   5) Second-migrate
> > Given a VM where you've done all those tricks on, what happens when
> > you migrate it a second time?   I guess you're aiming for the guest
> > to update it's bitmap;  HPe's solution is to migrate it's balloon
> > bitmap along with the migration data.
> 
> Nothing is special in the second migration, QEMU will request the guest for 
> free pages
> Information, and the guest will traverse it's current free page list to 
> construct a
> new free page bitmap and send it to QEMU. Just like in the first migration.

Right.

Dave

> Liang
> > 
> > Dave
> > 
> > --
> > Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-10 Thread Dr. David Alan Gilbert

Hi,
  I'm just catching back up on this thread; so without reference to any
particular previous mail in the thread.

  1) How many of the free pages do we tell the host about?
 Your main change is telling the host about all the
 free pages.
 If we tell the host about all the free pages, then we might
 end up needing to allocate more pages and update the host
 with pages we now want to use; that would have to wait for the
 host to acknowledge that use of these pages, since if we don't
 wait for it then it might have skipped migrating a page we
 just started using (I don't understand how your series solves that).
 So the guest probably needs to keep some free pages - how many?

  2) Clearing out caches
 Does it make sense to clean caches?  They're apparently useful data
 so if we clean them it's likely to slow the guest down; I guess
 they're also likely to be fairly static data - so at least fairly
 easy to migrate.
 The answer here partially depends on what you want from your migration;
 if you're after the fastest possible migration time it might make
 sense to clean the caches and avoid migrating them; but that might
 be at the cost of more disruption to the guest - there's a trade off
 somewhere and it's not clear to me how you set that depending on your
 guest/network/reqirements.

  3) Why is ballooning slow?
 You've got a figure of 5s to balloon on an 8GB VM - but an 
 8GB VM isn't huge; so I worry about how long it would take
 on a big VM.   We need to understand why it's slow 
   * is it due to the guest shuffling pages around? 
   * is it due to the virtio-balloon protocol sending one page
 at a time?
 + Do balloon pages normally clump in physical memory
- i.e. would a 'large balloon' message help
- or do we need a bitmap because it tends not to clump?

   * is it due to the madvise on the host?
 If we were using the normal balloon messages, then we
 could, during migration, just route those to the migration
 code rather than bothering with the madvise.
 If they're clumping together we could just turn that into
 one big madvise; if they're not then would we benefit from
 a call that lets us madvise lots of areas?

  4) Speeding up the migration of those free pages
You're using the bitmap to avoid migrating those free pages; HPe's
patchset is reconstructing a bitmap from the balloon data;  OK, so
this all makes sense to avoid migrating them - I'd also been thinking
of using pagemap to spot zero pages that would help find other zero'd
pages, but perhaps ballooned is enough?

  5) Second-migrate
Given a VM where you've done all those tricks on, what happens when
you migrate it a second time?   I guess you're aiming for the guest
to update it's bitmap;  HPe's solution is to migrate it's balloon
bitmap along with the migration data.
 
Dave

--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-10 Thread Dr. David Alan Gilbert

Hi,
  I'm just catching back up on this thread; so without reference to any
particular previous mail in the thread.

  1) How many of the free pages do we tell the host about?
 Your main change is telling the host about all the
 free pages.
 If we tell the host about all the free pages, then we might
 end up needing to allocate more pages and update the host
 with pages we now want to use; that would have to wait for the
 host to acknowledge that use of these pages, since if we don't
 wait for it then it might have skipped migrating a page we
 just started using (I don't understand how your series solves that).
 So the guest probably needs to keep some free pages - how many?

  2) Clearing out caches
 Does it make sense to clean caches?  They're apparently useful data
 so if we clean them it's likely to slow the guest down; I guess
 they're also likely to be fairly static data - so at least fairly
 easy to migrate.
 The answer here partially depends on what you want from your migration;
 if you're after the fastest possible migration time it might make
 sense to clean the caches and avoid migrating them; but that might
 be at the cost of more disruption to the guest - there's a trade off
 somewhere and it's not clear to me how you set that depending on your
 guest/network/reqirements.

  3) Why is ballooning slow?
 You've got a figure of 5s to balloon on an 8GB VM - but an 
 8GB VM isn't huge; so I worry about how long it would take
 on a big VM.   We need to understand why it's slow 
   * is it due to the guest shuffling pages around? 
   * is it due to the virtio-balloon protocol sending one page
 at a time?
 + Do balloon pages normally clump in physical memory
- i.e. would a 'large balloon' message help
- or do we need a bitmap because it tends not to clump?

   * is it due to the madvise on the host?
 If we were using the normal balloon messages, then we
 could, during migration, just route those to the migration
 code rather than bothering with the madvise.
 If they're clumping together we could just turn that into
 one big madvise; if they're not then would we benefit from
 a call that lets us madvise lots of areas?

  4) Speeding up the migration of those free pages
You're using the bitmap to avoid migrating those free pages; HPe's
patchset is reconstructing a bitmap from the balloon data;  OK, so
this all makes sense to avoid migrating them - I'd also been thinking
of using pagemap to spot zero pages that would help find other zero'd
pages, but perhaps ballooned is enough?

  5) Second-migrate
Given a VM where you've done all those tricks on, what happens when
you migrate it a second time?   I guess you're aiming for the guest
to update it's bitmap;  HPe's solution is to migrate it's balloon
bitmap along with the migration data.
 
Dave

--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-04 Thread Dr. David Alan Gilbert

* Paolo Bonzini (pbonz...@redhat.com) wrote:
> 
> 
> On 04/03/2016 15:26, Li, Liang Z wrote:
> >> > 
> >> > The memory usage will keep increasing due to ever growing caches, etc, so
> >> > you'll be left with very little free memory fairly soon.
> >> > 
> > I don't think so.
> > 
> 
> Roman is right.  For example, here I am looking at a 64 GB (physical)
> machine which was booted about 30 minutes ago, and which is running
> disk-heavy workloads (installing VMs).
> 
> Since I have started writing this email (2 minutes?), the amount of free
> memory has already gone down from 37 GB to 33 GB.  I expect that by the
> time I have finished running the workload, in two hours, it will not
> have any free memory.

But what about a VM sitting idle, or that just has more RAM assigned to it
than is currently using.
 I've got a host here that's been up for 46 days and has been doing some
heavy VM debugging a few days ago, but today:

# free -m
  totalusedfree  shared  buff/cache   available
Mem:  965361146   44834 184   50555   94735

I very rarely use all it's RAM, so it's got a big chunk of free RAM, and yes
it's got a big chunk of cache as well.

Dave

> 
> Paolo
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-04 Thread Dr. David Alan Gilbert

* Paolo Bonzini (pbonz...@redhat.com) wrote:
> 
> 
> On 04/03/2016 15:26, Li, Liang Z wrote:
> >> > 
> >> > The memory usage will keep increasing due to ever growing caches, etc, so
> >> > you'll be left with very little free memory fairly soon.
> >> > 
> > I don't think so.
> > 
> 
> Roman is right.  For example, here I am looking at a 64 GB (physical)
> machine which was booted about 30 minutes ago, and which is running
> disk-heavy workloads (installing VMs).
> 
> Since I have started writing this email (2 minutes?), the amount of free
> memory has already gone down from 37 GB to 33 GB.  I expect that by the
> time I have finished running the workload, in two hours, it will not
> have any free memory.

But what about a VM sitting idle, or that just has more RAM assigned to it
than is currently using.
 I've got a host here that's been up for 46 days and has been doing some
heavy VM debugging a few days ago, but today:

# free -m
  totalusedfree  shared  buff/cache   available
Mem:  965361146   44834 184   50555   94735

I very rarely use all it's RAM, so it's got a big chunk of free RAM, and yes
it's got a big chunk of cache as well.

Dave

> 
> Paolo
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-04 Thread Dr. David Alan Gilbert

* Roman Kagan (rka...@virtuozzo.com) wrote:
> On Fri, Mar 04, 2016 at 08:23:09AM +, Li, Liang Z wrote:
> > > On Thu, Mar 03, 2016 at 05:46:15PM +0000, Dr. David Alan Gilbert wrote:
> > > > * Liang Li (liang.z...@intel.com) wrote:
> > > > > The current QEMU live migration implementation mark the all the
> > > > > guest's RAM pages as dirtied in the ram bulk stage, all these pages
> > > > > will be processed and that takes quit a lot of CPU cycles.
> > > > >
> > > > > From guest's point of view, it doesn't care about the content in
> > > > > free pages. We can make use of this fact and skip processing the
> > > > > free pages in the ram bulk stage, it can save a lot CPU cycles and
> > > > > reduce the network traffic significantly while speed up the live
> > > > > migration process obviously.
> > > > >
> > > > > This patch set is the QEMU side implementation.
> > > > >
> > > > > The virtio-balloon is extended so that QEMU can get the free pages
> > > > > information from the guest through virtio.
> > > > >
> > > > > After getting the free pages information (a bitmap), QEMU can use it
> > > > > to filter out the guest's free pages in the ram bulk stage. This
> > > > > make the live migration process much more efficient.
> > > >
> > > > Hi,
> > > >   An interesting solution; I know a few different people have been
> > > > looking at how to speed up ballooned VM migration.
> > > >
> > > >   I wonder if it would be possible to avoid the kernel changes by
> > > > parsing /proc/self/pagemap - if that can be used to detect
> > > > unmapped/zero mapped pages in the guest ram, would it achieve the
> > > same result?
> > > 
> > > Yes I was about to suggest the same thing: it's simple and makes use of 
> > > the
> > > existing infrastructure.  And you wouldn't need to care if the pages were
> > > unmapped by ballooning or anything else (alternative balloon
> > > implementations, not yet touched by the guest, etc.).  Besides, you 
> > > wouldn't
> > > need to synchronize with the guest.
> > > 
> > > Roman.
> > 
> > The unmapped/zero mapped pages can be detected by parsing 
> > /proc/self/pagemap,
> > but the free pages can't be detected by this. Imaging an application 
> > allocates a large amount
> > of memory , after using, it frees the memory, then live migration happens. 
> > All these free pages
> > will be process and sent to the destination, it's not optimal.
> 
> First, the likelihood of such a situation is marginal, there's no point
> optimizing for it specifically.
> 
> And second, even if that happens, you inflate the balloon right before
> the migration and the free memory will get umapped very quickly, so this
> case is covered nicely by the same technique that works for more
> realistic cases, too.

Although I wonder which is cheaper; that would be fairly expensive for
the guest wouldn't it? And you'd somehow have to kick the guest
before migration to do the ballooning - and how long would you wait
for it to finish?

Dave

> 
> Roman.
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-04 Thread Dr. David Alan Gilbert

* Roman Kagan (rka...@virtuozzo.com) wrote:
> On Fri, Mar 04, 2016 at 08:23:09AM +, Li, Liang Z wrote:
> > > On Thu, Mar 03, 2016 at 05:46:15PM +0000, Dr. David Alan Gilbert wrote:
> > > > * Liang Li (liang.z...@intel.com) wrote:
> > > > > The current QEMU live migration implementation mark the all the
> > > > > guest's RAM pages as dirtied in the ram bulk stage, all these pages
> > > > > will be processed and that takes quit a lot of CPU cycles.
> > > > >
> > > > > From guest's point of view, it doesn't care about the content in
> > > > > free pages. We can make use of this fact and skip processing the
> > > > > free pages in the ram bulk stage, it can save a lot CPU cycles and
> > > > > reduce the network traffic significantly while speed up the live
> > > > > migration process obviously.
> > > > >
> > > > > This patch set is the QEMU side implementation.
> > > > >
> > > > > The virtio-balloon is extended so that QEMU can get the free pages
> > > > > information from the guest through virtio.
> > > > >
> > > > > After getting the free pages information (a bitmap), QEMU can use it
> > > > > to filter out the guest's free pages in the ram bulk stage. This
> > > > > make the live migration process much more efficient.
> > > >
> > > > Hi,
> > > >   An interesting solution; I know a few different people have been
> > > > looking at how to speed up ballooned VM migration.
> > > >
> > > >   I wonder if it would be possible to avoid the kernel changes by
> > > > parsing /proc/self/pagemap - if that can be used to detect
> > > > unmapped/zero mapped pages in the guest ram, would it achieve the
> > > same result?
> > > 
> > > Yes I was about to suggest the same thing: it's simple and makes use of 
> > > the
> > > existing infrastructure.  And you wouldn't need to care if the pages were
> > > unmapped by ballooning or anything else (alternative balloon
> > > implementations, not yet touched by the guest, etc.).  Besides, you 
> > > wouldn't
> > > need to synchronize with the guest.
> > > 
> > > Roman.
> > 
> > The unmapped/zero mapped pages can be detected by parsing 
> > /proc/self/pagemap,
> > but the free pages can't be detected by this. Imaging an application 
> > allocates a large amount
> > of memory , after using, it frees the memory, then live migration happens. 
> > All these free pages
> > will be process and sent to the destination, it's not optimal.
> 
> First, the likelihood of such a situation is marginal, there's no point
> optimizing for it specifically.
> 
> And second, even if that happens, you inflate the balloon right before
> the migration and the free memory will get umapped very quickly, so this
> case is covered nicely by the same technique that works for more
> realistic cases, too.

Although I wonder which is cheaper; that would be fairly expensive for
the guest wouldn't it? And you'd somehow have to kick the guest
before migration to do the ballooning - and how long would you wait
for it to finish?

Dave

> 
> Roman.
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-03 Thread Dr. David Alan Gilbert

* Liang Li (liang.z...@intel.com) wrote:
> The current QEMU live migration implementation mark the all the
> guest's RAM pages as dirtied in the ram bulk stage, all these pages
> will be processed and that takes quit a lot of CPU cycles.
> 
> From guest's point of view, it doesn't care about the content in free
> pages. We can make use of this fact and skip processing the free
> pages in the ram bulk stage, it can save a lot CPU cycles and reduce
> the network traffic significantly while speed up the live migration
> process obviously.
> 
> This patch set is the QEMU side implementation.
> 
> The virtio-balloon is extended so that QEMU can get the free pages
> information from the guest through virtio.
> 
> After getting the free pages information (a bitmap), QEMU can use it
> to filter out the guest's free pages in the ram bulk stage. This make
> the live migration process much more efficient.

Hi,
  An interesting solution; I know a few different people have been looking
at how to speed up ballooned VM migration.

  I wonder if it would be possible to avoid the kernel changes by
parsing /proc/self/pagemap - if that can be used to detect unmapped/zero
mapped pages in the guest ram, would it achieve the same result?

> This RFC version doesn't take the post-copy and RDMA into
> consideration, maybe both of them can benefit from this PV solution
> by with some extra modifications.

For postcopy to be safe, you would still need to send a message to the
destination telling it that there were zero pages, otherwise the destination
can't tell if it's supposed to request the page from the source or
treat the page as zero.

Dave

> 
> Performance data
> 
> 
> Test environment:
> 
> CPU: Intel (R) Xeon(R) CPU ES-2699 v3 @ 2.30GHz
> Host RAM: 64GB
> Host Linux Kernel:  4.2.0   Host OS: CentOS 7.1
> Guest Linux Kernel:  4.5.rc6Guest OS: CentOS 6.6
> Network:  X540-AT2 with 10 Gigabit connection
> Guest RAM: 8GB
> 
> Case 1: Idle guest just boots:
> 
> | original  |pv
> ---
> total time(ms)  |1894   |   421
> 
> transferred ram(KB) |   398017  |  353242
> 
> 
> 
> Case 2: The guest has ever run some memory consuming workload, the
> workload is terminated just before live migration.
> 
> | original  |pv
> ---
> total time(ms)  |   7436|   552
> 
> transferred ram(KB) |  8146291  |  361375
> 
> 
> Liang Li (4):
>   pc: Add code to get the lowmem form PCMachineState
>   virtio-balloon: Add a new feature to balloon device
>   migration: not set migration bitmap in setup stage
>   migration: filter out guest's free pages in ram bulk stage
> 
>  balloon.c   | 30 -
>  hw/i386/pc.c|  5 ++
>  hw/i386/pc_piix.c   |  1 +
>  hw/i386/pc_q35.c|  1 +
>  hw/virtio/virtio-balloon.c  | 81 
> -
>  include/hw/i386/pc.h|  3 +-
>  include/hw/virtio/virtio-balloon.h  | 17 +-
>  include/standard-headers/linux/virtio_balloon.h |  1 +
>  include/sysemu/balloon.h    | 10 ++-
>  migration/ram.c | 64 +++
>  10 files changed, 195 insertions(+), 18 deletions(-)
> 
> -- 
> 1.8.3.1
> 
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-03 Thread Dr. David Alan Gilbert

* Liang Li (liang.z...@intel.com) wrote:
> The current QEMU live migration implementation mark the all the
> guest's RAM pages as dirtied in the ram bulk stage, all these pages
> will be processed and that takes quit a lot of CPU cycles.
> 
> From guest's point of view, it doesn't care about the content in free
> pages. We can make use of this fact and skip processing the free
> pages in the ram bulk stage, it can save a lot CPU cycles and reduce
> the network traffic significantly while speed up the live migration
> process obviously.
> 
> This patch set is the QEMU side implementation.
> 
> The virtio-balloon is extended so that QEMU can get the free pages
> information from the guest through virtio.
> 
> After getting the free pages information (a bitmap), QEMU can use it
> to filter out the guest's free pages in the ram bulk stage. This make
> the live migration process much more efficient.

Hi,
  An interesting solution; I know a few different people have been looking
at how to speed up ballooned VM migration.

  I wonder if it would be possible to avoid the kernel changes by
parsing /proc/self/pagemap - if that can be used to detect unmapped/zero
mapped pages in the guest ram, would it achieve the same result?

> This RFC version doesn't take the post-copy and RDMA into
> consideration, maybe both of them can benefit from this PV solution
> by with some extra modifications.

For postcopy to be safe, you would still need to send a message to the
destination telling it that there were zero pages, otherwise the destination
can't tell if it's supposed to request the page from the source or
treat the page as zero.

Dave

> 
> Performance data
> 
> 
> Test environment:
> 
> CPU: Intel (R) Xeon(R) CPU ES-2699 v3 @ 2.30GHz
> Host RAM: 64GB
> Host Linux Kernel:  4.2.0   Host OS: CentOS 7.1
> Guest Linux Kernel:  4.5.rc6Guest OS: CentOS 6.6
> Network:  X540-AT2 with 10 Gigabit connection
> Guest RAM: 8GB
> 
> Case 1: Idle guest just boots:
> 
> | original  |pv
> ---
> total time(ms)  |1894   |   421
> 
> transferred ram(KB) |   398017  |  353242
> 
> 
> 
> Case 2: The guest has ever run some memory consuming workload, the
> workload is terminated just before live migration.
> 
> | original  |pv
> ---
> total time(ms)  |   7436|   552
> 
> transferred ram(KB) |  8146291  |  361375
> 
> 
> Liang Li (4):
>   pc: Add code to get the lowmem form PCMachineState
>   virtio-balloon: Add a new feature to balloon device
>   migration: not set migration bitmap in setup stage
>   migration: filter out guest's free pages in ram bulk stage
> 
>  balloon.c   | 30 -
>  hw/i386/pc.c|  5 ++
>  hw/i386/pc_piix.c   |  1 +
>  hw/i386/pc_q35.c|  1 +
>  hw/virtio/virtio-balloon.c  | 81 
> -
>  include/hw/i386/pc.h|  3 +-
>  include/hw/virtio/virtio-balloon.h  | 17 +-
>  include/standard-headers/linux/virtio_balloon.h |  1 +
>  include/sysemu/balloon.h    | 10 ++-
>  migration/ram.c | 64 +++
>  10 files changed, 195 insertions(+), 18 deletions(-)
> 
> -- 
> 1.8.3.1
> 
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [RFC PATCH 0/3] x86: Add support for guest DMA dirty page tracking

2016-01-05 Thread Dr. David Alan Gilbert

* Michael S. Tsirkin (m...@redhat.com) wrote:
> On Tue, Jan 05, 2016 at 10:45:25AM +0000, Dr. David Alan Gilbert wrote:
> > * Michael S. Tsirkin (m...@redhat.com) wrote:
> > > On Tue, Jan 05, 2016 at 10:01:04AM +0000, Dr. David Alan Gilbert wrote:
> > > > * Michael S. Tsirkin (m...@redhat.com) wrote:
> > > > > On Mon, Jan 04, 2016 at 07:11:25PM -0800, Alexander Duyck wrote:
> > > > > > >> The two mechanisms referenced above would likely require 
> > > > > > >> coordination with
> > > > > > >> QEMU and as such are open to discussion.  I haven't attempted to 
> > > > > > >> address
> > > > > > >> them as I am not sure there is a consensus as of yet.  My 
> > > > > > >> personal
> > > > > > >> preference would be to add a vendor-specific configuration block 
> > > > > > >> to the
> > > > > > >> emulated pci-bridge interfaces created by QEMU that would allow 
> > > > > > >> us to
> > > > > > >> essentially extend shpc to support guest live migration with 
> > > > > > >> pass-through
> > > > > > >> devices.
> > > > > > >
> > > > > > > shpc?
> > > > > > 
> > > > > > That is kind of what I was thinking.  We basically need some 
> > > > > > mechanism
> > > > > > to allow for the host to ask the device to quiesce.  It has been
> > > > > > proposed to possibly even look at something like an ACPI interface
> > > > > > since I know ACPI is used by QEMU to manage hot-plug in the standard
> > > > > > case.
> > > > > > 
> > > > > > - Alex
> > > > > 
> > > > > 
> > > > > Start by using hot-unplug for this!
> > > > > 
> > > > > Really use your patch guest side, and write host side
> > > > > to allow starting migration with the device, but
> > > > > defer completing it.
> > > > > 
> > > > > So
> > > > > 
> > > > > 1.- host tells guest to start tracking memory writes
> > > > > 2.- guest acks
> > > > > 3.- migration starts
> > > > > 4.- most memory is migrated
> > > > > 5.- host tells guest to eject device
> > > > > 6.- guest acks
> > > > > 7.- stop vm and migrate rest of state
> > > > > 
> > > > > 
> > > > > It will already be a win since hot unplug after migration starts and
> > > > > most memory has been migrated is better than hot unplug before 
> > > > > migration
> > > > > starts.
> > > > > 
> > > > > Then measure downtime and profile. Then we can look at ways
> > > > > to quiesce device faster which really means step 5 is replaced
> > > > > with "host tells guest to quiesce device and dirty (or just unmap!)
> > > > > all memory mapped for write by device".
> > > > 
> > > > 
> > > > Doing a hot-unplug is going to upset the guests network stacks view
> > > > of the world; that's something we don't want to change.
> > > > 
> > > > Dave
> > > 
> > > It might but if you store the IP and restore it quickly
> > > after migration e.g. using guest agent, as opposed to DHCP,
> > > then it won't.
> > 
> > I thought if you hot-unplug then it will lose any outstanding connections
> > on that device.
> 
> Which connections and which device?  TCP connections and an ethernet
> device?  These are on different layers so of course you don't lose them.
> Just do not change the IP address.
> 
> Some guests send a signal to applications to close connections
> when all links go down. One can work around this
> in a variety of ways.

So, OK, I was surprised that a simple connection didn't go down when
I tested and just removed the network card; I'd thought stuff was more
aggressive when there was no route.
But as you say, some stuff does close connections when the links go down/away
so we do need to work around that; and any new outgoing connections get
a 'no route to host'.  So I'm still nervous what will break.

Dave

> 
> > > It allows calming the device down in a generic way,
> > > specific drivers can then implement the fast quiesce.
> > 
> > Except that if it breaks the guest networking it's useless.
> > 
> > Dave
> > 
> > > 
> > > > > 
> > > > > -- 
> > > > > MST
> > > > --
> > > > Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
> > --
> > Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 0/3] x86: Add support for guest DMA dirty page tracking

2016-01-05 Thread Dr. David Alan Gilbert

* Michael S. Tsirkin (m...@redhat.com) wrote:
> On Tue, Jan 05, 2016 at 10:45:25AM +0000, Dr. David Alan Gilbert wrote:
> > * Michael S. Tsirkin (m...@redhat.com) wrote:
> > > On Tue, Jan 05, 2016 at 10:01:04AM +0000, Dr. David Alan Gilbert wrote:
> > > > * Michael S. Tsirkin (m...@redhat.com) wrote:
> > > > > On Mon, Jan 04, 2016 at 07:11:25PM -0800, Alexander Duyck wrote:
> > > > > > >> The two mechanisms referenced above would likely require 
> > > > > > >> coordination with
> > > > > > >> QEMU and as such are open to discussion.  I haven't attempted to 
> > > > > > >> address
> > > > > > >> them as I am not sure there is a consensus as of yet.  My 
> > > > > > >> personal
> > > > > > >> preference would be to add a vendor-specific configuration block 
> > > > > > >> to the
> > > > > > >> emulated pci-bridge interfaces created by QEMU that would allow 
> > > > > > >> us to
> > > > > > >> essentially extend shpc to support guest live migration with 
> > > > > > >> pass-through
> > > > > > >> devices.
> > > > > > >
> > > > > > > shpc?
> > > > > > 
> > > > > > That is kind of what I was thinking.  We basically need some 
> > > > > > mechanism
> > > > > > to allow for the host to ask the device to quiesce.  It has been
> > > > > > proposed to possibly even look at something like an ACPI interface
> > > > > > since I know ACPI is used by QEMU to manage hot-plug in the standard
> > > > > > case.
> > > > > > 
> > > > > > - Alex
> > > > > 
> > > > > 
> > > > > Start by using hot-unplug for this!
> > > > > 
> > > > > Really use your patch guest side, and write host side
> > > > > to allow starting migration with the device, but
> > > > > defer completing it.
> > > > > 
> > > > > So
> > > > > 
> > > > > 1.- host tells guest to start tracking memory writes
> > > > > 2.- guest acks
> > > > > 3.- migration starts
> > > > > 4.- most memory is migrated
> > > > > 5.- host tells guest to eject device
> > > > > 6.- guest acks
> > > > > 7.- stop vm and migrate rest of state
> > > > > 
> > > > > 
> > > > > It will already be a win since hot unplug after migration starts and
> > > > > most memory has been migrated is better than hot unplug before 
> > > > > migration
> > > > > starts.
> > > > > 
> > > > > Then measure downtime and profile. Then we can look at ways
> > > > > to quiesce device faster which really means step 5 is replaced
> > > > > with "host tells guest to quiesce device and dirty (or just unmap!)
> > > > > all memory mapped for write by device".
> > > > 
> > > > 
> > > > Doing a hot-unplug is going to upset the guests network stacks view
> > > > of the world; that's something we don't want to change.
> > > > 
> > > > Dave
> > > 
> > > It might but if you store the IP and restore it quickly
> > > after migration e.g. using guest agent, as opposed to DHCP,
> > > then it won't.
> > 
> > I thought if you hot-unplug then it will lose any outstanding connections
> > on that device.
> > 
> > > It allows calming the device down in a generic way,
> > > specific drivers can then implement the fast quiesce.
> > 
> > Except that if it breaks the guest networking it's useless.
> > 
> > Dave
> 
> Is hot unplug useless then?

As a migration hack, yes, unless it's paired with a second network device
as a redundent route.
To do what's being suggested here it's got to be done at the device level
and not visible to the networking stack.

Dave

> 
> > > 
> > > > > 
> > > > > -- 
> > > > > MST
> > > > --
> > > > Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
> > --
> > Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 0/3] x86: Add support for guest DMA dirty page tracking

2016-01-05 Thread Dr. David Alan Gilbert

* Michael S. Tsirkin (m...@redhat.com) wrote:
> On Tue, Jan 05, 2016 at 10:01:04AM +0000, Dr. David Alan Gilbert wrote:
> > * Michael S. Tsirkin (m...@redhat.com) wrote:
> > > On Mon, Jan 04, 2016 at 07:11:25PM -0800, Alexander Duyck wrote:
> > > > >> The two mechanisms referenced above would likely require 
> > > > >> coordination with
> > > > >> QEMU and as such are open to discussion.  I haven't attempted to 
> > > > >> address
> > > > >> them as I am not sure there is a consensus as of yet.  My personal
> > > > >> preference would be to add a vendor-specific configuration block to 
> > > > >> the
> > > > >> emulated pci-bridge interfaces created by QEMU that would allow us to
> > > > >> essentially extend shpc to support guest live migration with 
> > > > >> pass-through
> > > > >> devices.
> > > > >
> > > > > shpc?
> > > > 
> > > > That is kind of what I was thinking.  We basically need some mechanism
> > > > to allow for the host to ask the device to quiesce.  It has been
> > > > proposed to possibly even look at something like an ACPI interface
> > > > since I know ACPI is used by QEMU to manage hot-plug in the standard
> > > > case.
> > > > 
> > > > - Alex
> > > 
> > > 
> > > Start by using hot-unplug for this!
> > > 
> > > Really use your patch guest side, and write host side
> > > to allow starting migration with the device, but
> > > defer completing it.
> > > 
> > > So
> > > 
> > > 1.- host tells guest to start tracking memory writes
> > > 2.- guest acks
> > > 3.- migration starts
> > > 4.- most memory is migrated
> > > 5.- host tells guest to eject device
> > > 6.- guest acks
> > > 7.- stop vm and migrate rest of state
> > > 
> > > 
> > > It will already be a win since hot unplug after migration starts and
> > > most memory has been migrated is better than hot unplug before migration
> > > starts.
> > > 
> > > Then measure downtime and profile. Then we can look at ways
> > > to quiesce device faster which really means step 5 is replaced
> > > with "host tells guest to quiesce device and dirty (or just unmap!)
> > > all memory mapped for write by device".
> > 
> > 
> > Doing a hot-unplug is going to upset the guests network stacks view
> > of the world; that's something we don't want to change.
> > 
> > Dave
> 
> It might but if you store the IP and restore it quickly
> after migration e.g. using guest agent, as opposed to DHCP,
> then it won't.

I thought if you hot-unplug then it will lose any outstanding connections
on that device.

> It allows calming the device down in a generic way,
> specific drivers can then implement the fast quiesce.

Except that if it breaks the guest networking it's useless.

Dave

> 
> > > 
> > > -- 
> > > MST
> > --
> > Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 0/3] x86: Add support for guest DMA dirty page tracking

2016-01-05 Thread Dr. David Alan Gilbert

* Michael S. Tsirkin (m...@redhat.com) wrote:
> On Mon, Jan 04, 2016 at 07:11:25PM -0800, Alexander Duyck wrote:
> > >> The two mechanisms referenced above would likely require coordination 
> > >> with
> > >> QEMU and as such are open to discussion.  I haven't attempted to address
> > >> them as I am not sure there is a consensus as of yet.  My personal
> > >> preference would be to add a vendor-specific configuration block to the
> > >> emulated pci-bridge interfaces created by QEMU that would allow us to
> > >> essentially extend shpc to support guest live migration with pass-through
> > >> devices.
> > >
> > > shpc?
> > 
> > That is kind of what I was thinking.  We basically need some mechanism
> > to allow for the host to ask the device to quiesce.  It has been
> > proposed to possibly even look at something like an ACPI interface
> > since I know ACPI is used by QEMU to manage hot-plug in the standard
> > case.
> > 
> > - Alex
> 
> 
> Start by using hot-unplug for this!
> 
> Really use your patch guest side, and write host side
> to allow starting migration with the device, but
> defer completing it.
> 
> So
> 
> 1.- host tells guest to start tracking memory writes
> 2.- guest acks
> 3.- migration starts
> 4.- most memory is migrated
> 5.- host tells guest to eject device
> 6.- guest acks
> 7.- stop vm and migrate rest of state
> 
> 
> It will already be a win since hot unplug after migration starts and
> most memory has been migrated is better than hot unplug before migration
> starts.
> 
> Then measure downtime and profile. Then we can look at ways
> to quiesce device faster which really means step 5 is replaced
> with "host tells guest to quiesce device and dirty (or just unmap!)
> all memory mapped for write by device".


Doing a hot-unplug is going to upset the guests network stacks view
of the world; that's something we don't want to change.

Dave

> 
> -- 
> MST
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 0/3] x86: Add support for guest DMA dirty page tracking

2016-01-05 Thread Dr. David Alan Gilbert

* Michael S. Tsirkin (m...@redhat.com) wrote:
> On Mon, Jan 04, 2016 at 07:11:25PM -0800, Alexander Duyck wrote:
> > >> The two mechanisms referenced above would likely require coordination 
> > >> with
> > >> QEMU and as such are open to discussion.  I haven't attempted to address
> > >> them as I am not sure there is a consensus as of yet.  My personal
> > >> preference would be to add a vendor-specific configuration block to the
> > >> emulated pci-bridge interfaces created by QEMU that would allow us to
> > >> essentially extend shpc to support guest live migration with pass-through
> > >> devices.
> > >
> > > shpc?
> > 
> > That is kind of what I was thinking.  We basically need some mechanism
> > to allow for the host to ask the device to quiesce.  It has been
> > proposed to possibly even look at something like an ACPI interface
> > since I know ACPI is used by QEMU to manage hot-plug in the standard
> > case.
> > 
> > - Alex
> 
> 
> Start by using hot-unplug for this!
> 
> Really use your patch guest side, and write host side
> to allow starting migration with the device, but
> defer completing it.
> 
> So
> 
> 1.- host tells guest to start tracking memory writes
> 2.- guest acks
> 3.- migration starts
> 4.- most memory is migrated
> 5.- host tells guest to eject device
> 6.- guest acks
> 7.- stop vm and migrate rest of state
> 
> 
> It will already be a win since hot unplug after migration starts and
> most memory has been migrated is better than hot unplug before migration
> starts.
> 
> Then measure downtime and profile. Then we can look at ways
> to quiesce device faster which really means step 5 is replaced
> with "host tells guest to quiesce device and dirty (or just unmap!)
> all memory mapped for write by device".


Doing a hot-unplug is going to upset the guests network stacks view
of the world; that's something we don't want to change.

Dave

> 
> -- 
> MST
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 0/3] x86: Add support for guest DMA dirty page tracking

2016-01-05 Thread Dr. David Alan Gilbert

* Michael S. Tsirkin (m...@redhat.com) wrote:
> On Tue, Jan 05, 2016 at 10:01:04AM +0000, Dr. David Alan Gilbert wrote:
> > * Michael S. Tsirkin (m...@redhat.com) wrote:
> > > On Mon, Jan 04, 2016 at 07:11:25PM -0800, Alexander Duyck wrote:
> > > > >> The two mechanisms referenced above would likely require 
> > > > >> coordination with
> > > > >> QEMU and as such are open to discussion.  I haven't attempted to 
> > > > >> address
> > > > >> them as I am not sure there is a consensus as of yet.  My personal
> > > > >> preference would be to add a vendor-specific configuration block to 
> > > > >> the
> > > > >> emulated pci-bridge interfaces created by QEMU that would allow us to
> > > > >> essentially extend shpc to support guest live migration with 
> > > > >> pass-through
> > > > >> devices.
> > > > >
> > > > > shpc?
> > > > 
> > > > That is kind of what I was thinking.  We basically need some mechanism
> > > > to allow for the host to ask the device to quiesce.  It has been
> > > > proposed to possibly even look at something like an ACPI interface
> > > > since I know ACPI is used by QEMU to manage hot-plug in the standard
> > > > case.
> > > > 
> > > > - Alex
> > > 
> > > 
> > > Start by using hot-unplug for this!
> > > 
> > > Really use your patch guest side, and write host side
> > > to allow starting migration with the device, but
> > > defer completing it.
> > > 
> > > So
> > > 
> > > 1.- host tells guest to start tracking memory writes
> > > 2.- guest acks
> > > 3.- migration starts
> > > 4.- most memory is migrated
> > > 5.- host tells guest to eject device
> > > 6.- guest acks
> > > 7.- stop vm and migrate rest of state
> > > 
> > > 
> > > It will already be a win since hot unplug after migration starts and
> > > most memory has been migrated is better than hot unplug before migration
> > > starts.
> > > 
> > > Then measure downtime and profile. Then we can look at ways
> > > to quiesce device faster which really means step 5 is replaced
> > > with "host tells guest to quiesce device and dirty (or just unmap!)
> > > all memory mapped for write by device".
> > 
> > 
> > Doing a hot-unplug is going to upset the guests network stacks view
> > of the world; that's something we don't want to change.
> > 
> > Dave
> 
> It might but if you store the IP and restore it quickly
> after migration e.g. using guest agent, as opposed to DHCP,
> then it won't.

I thought if you hot-unplug then it will lose any outstanding connections
on that device.

> It allows calming the device down in a generic way,
> specific drivers can then implement the fast quiesce.

Except that if it breaks the guest networking it's useless.

Dave

> 
> > > 
> > > -- 
> > > MST
> > --
> > Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 3 >

1 - 100 of 218 matches

Mail list logo