Re: [PATCH kernel] vfio: Add explicit alignments in vfio_iommu_spapr_tce_create

2015-12-20 Thread David Gibson
On Fri, Dec 18, 2015 at 12:35:47PM +1100, Alexey Kardashevskiy wrote:
> The vfio_iommu_spapr_tce_create struct has 4x32bit and 2x64bit fields
> which should have resulted in sizeof(fio_iommu_spapr_tce_create) equal
> to 32 bytes. However due to the gcc's default alignment, the actual
> size of this struct is 40 bytes.
> 
> This fills gaps with __resv1/2 fields.
> 
> This should not cause any change in behavior.
> 
> Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru>

Oops, that was a bit sloppy.  Oh well.

Acked-by: David Gibson <da...@gibson.dropbear.id.au>

> ---
>  include/uapi/linux/vfio.h | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 9fd7b5d..d117233 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -568,8 +568,10 @@ struct vfio_iommu_spapr_tce_create {
>   __u32 flags;
>   /* in */
>   __u32 page_shift;
> + __u32 __resv1;
>   __u64 window_size;
>   __u32 levels;
> + __u32 __resv2;
>   /* out */
>   __u64 start_addr;
>  };

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH v2] KVM: PPC: Exit guest upon fatal machine check exception

2015-12-16 Thread David Gibson
 {
> diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
> b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> index b98889e..f43c124 100644
> --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> @@ -147,7 +147,6 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S)
>   addir1, r1, 112
>   ld  r7, HSTATE_HOST_MSR(r13)
>

Seems like the comment a little above this should be updated to
reflect the fact that this path no longer handles machine checks.

Apart from that and the access width bug Thomas spotted, it looks ok
to me,.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH v2] KVM: PPC: Exit guest upon fatal machine check exception

2015-12-16 Thread David Gibson
 {
> diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
> b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> index b98889e..f43c124 100644
> --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> @@ -147,7 +147,6 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S)
>   addir1, r1, 112
>   ld  r7, HSTATE_HOST_MSR(r13)
>

Seems like the comment a little above this should be updated to
reflect the fact that this path no longer handles machine checks.

Apart from that and the access width bug Thomas spotted, it looks ok
to me,.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH kernel 5/9] KVM: PPC: Account TCE-containing pages in locked_vm

2015-12-08 Thread David Gibson
On Tue, Sep 15, 2015 at 08:49:35PM +1000, Alexey Kardashevskiy wrote:
> At the moment pages used for TCE tables (in addition to pages addressed
> by TCEs) are not counted in locked_vm counter so a malicious userspace
> tool can call ioctl(KVM_CREATE_SPAPR_TCE) as many times as RLIMIT_NOFILE and
> lock a lot of memory.
> 
> This adds counting for pages used for TCE tables.
> 
> This counts the number of pages required for a table plus pages for
> the kvmppc_spapr_tce_table struct (TCE table descriptor) itself.

Hmm.  Does it make sense to account for the descriptor struct itself?
I mean there are lots of little structures the kernel will allocate on
a process's behalf, and I don't think most of them get accounted
against locked vm.

> This does not change the amount of (de)allocated memory.
> 
> Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru>
> ---
>  arch/powerpc/kvm/book3s_64_vio.c | 51 
> +++-
>  1 file changed, 50 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_64_vio.c 
> b/arch/powerpc/kvm/book3s_64_vio.c
> index 9526c34..b70787d 100644
> --- a/arch/powerpc/kvm/book3s_64_vio.c
> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> @@ -45,13 +45,56 @@ static long kvmppc_stt_npages(unsigned long window_size)
>* sizeof(u64), PAGE_SIZE) / PAGE_SIZE;
>  }
>  
> +static long kvmppc_account_memlimit(long npages, bool inc)
> +{
> + long ret = 0;
> + const long bytes = sizeof(struct kvmppc_spapr_tce_table) +
> + (abs(npages) * sizeof(struct page *));
> + const long stt_pages = ALIGN(bytes, PAGE_SIZE) / PAGE_SIZE;

Overflow checks might be useful here, I'm not sure.

> +
> + if (!current || !current->mm)
> + return ret; /* process exited */
> +
> + npages += stt_pages;
> +
> + down_write(>mm->mmap_sem);
> +
> + if (inc) {
> + long locked, lock_limit;
> +
> + locked = current->mm->locked_vm + npages;
> + lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> + if (locked > lock_limit && !capable(CAP_IPC_LOCK))
> + ret = -ENOMEM;
> + else
> + current->mm->locked_vm += npages;
> + } else {
> + if (npages > current->mm->locked_vm)

Should this be a WARN_ON?  It means something has gone wrong
previously in the accounting, doesn't it?

> + npages = current->mm->locked_vm;
> +
> + current->mm->locked_vm -= npages;
> + }
> +
> + pr_debug("[%d] RLIMIT_MEMLOCK KVM %c%ld %ld/%ld%s\n", current->pid,
> + inc ? '+' : '-',
> + npages << PAGE_SHIFT,
> + current->mm->locked_vm << PAGE_SHIFT,
> + rlimit(RLIMIT_MEMLOCK),
> + ret ? " - exceeded" : "");
> +
> + up_write(>mm->mmap_sem);
> +
> + return ret;
> +}
> +
>  static void release_spapr_tce_table(struct rcu_head *head)
>  {
>   struct kvmppc_spapr_tce_table *stt = container_of(head,
>   struct kvmppc_spapr_tce_table, rcu);
>   int i;
> + long npages = kvmppc_stt_npages(stt->window_size);
>  
> - for (i = 0; i < kvmppc_stt_npages(stt->window_size); i++)
> + for (i = 0; i < npages; i++)
>   __free_page(stt->pages[i]);
>  
>   kfree(stt);
> @@ -89,6 +132,7 @@ static int kvm_spapr_tce_release(struct inode *inode, 
> struct file *filp)
>  
>   kvm_put_kvm(stt->kvm);
>  
> + kvmppc_account_memlimit(kvmppc_stt_npages(stt->window_size), false);
>   call_rcu(>rcu, release_spapr_tce_table);
>  
>       return 0;
> @@ -114,6 +158,11 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>   }
>  
>   npages = kvmppc_stt_npages(args->window_size);
> + ret = kvmppc_account_memlimit(npages, true);
> + if (ret) {
> + stt = NULL;
> + goto fail;
> + }
>  
>   stt = kzalloc(sizeof(*stt) + npages * sizeof(struct page *),
> GFP_KERNEL);

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH kernel 7/9] KVM: PPC: Move reusable bits of H_PUT_TCE handler to helpers

2015-12-08 Thread David Gibson
n (u64 *) page_address(page);
> +}
> +
> +/*
> + * Handles TCE requests for emulated devices.
> + * Puts guest TCE values to the table and expects user space to convert them.
> + * Called in both real and virtual modes.
> + * Cannot fail so kvmppc_tce_validate must be called before it.
> + *
> + * WARNING: This will be called in real-mode on HV KVM and virtual
> + *  mode on PR KVM
> + */
> +void kvmppc_tce_put(struct kvmppc_spapr_tce_table *stt,
> + unsigned long idx, unsigned long tce)
> +{
> + struct page *page;
> + u64 *tbl;
> +
> + page = stt->pages[idx / TCES_PER_PAGE];
> + tbl = kvmppc_page_address(page);
> +
> + tbl[idx % TCES_PER_PAGE] = tce;
> +}
> +EXPORT_SYMBOL_GPL(kvmppc_tce_put);
>  
>  /* WARNING: This will be called in real-mode on HV KVM and virtual
>   *  mode on PR KVM
> @@ -85,9 +159,6 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long 
> liobn,
>  {
>   struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn);
>   long ret = H_TOO_HARD;
> - unsigned long idx;
> - struct page *page;
> - u64 *tbl;
>  
>   /* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
>   /*  liobn, ioba, tce); */
> @@ -99,13 +170,11 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned 
> long liobn,
>   if (ret)
>   return ret;
>  
> - idx = ioba >> IOMMU_PAGE_SHIFT_4K;
> - page = stt->pages[idx / TCES_PER_PAGE];
> - tbl = (u64 *)page_address(page);
> + ret = kvmppc_tce_validate(stt, tce);
> + if (ret)
> + return ret;
>  
> - /* FIXME: Need to validate the TCE itself */
> - /* udbg_printf("tce @ %p\n", [idx % TCES_PER_PAGE]); */
> - tbl[idx % TCES_PER_PAGE] = tce;
> + kvmppc_tce_put(stt, ioba >> IOMMU_PAGE_SHIFT_4K, tce);
>  
>   return ret;
>  }

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH kernel 7/9] KVM: PPC: Move reusable bits of H_PUT_TCE handler to helpers

2015-12-08 Thread David Gibson
n (u64 *) page_address(page);
> +}
> +
> +/*
> + * Handles TCE requests for emulated devices.
> + * Puts guest TCE values to the table and expects user space to convert them.
> + * Called in both real and virtual modes.
> + * Cannot fail so kvmppc_tce_validate must be called before it.
> + *
> + * WARNING: This will be called in real-mode on HV KVM and virtual
> + *  mode on PR KVM
> + */
> +void kvmppc_tce_put(struct kvmppc_spapr_tce_table *stt,
> + unsigned long idx, unsigned long tce)
> +{
> + struct page *page;
> + u64 *tbl;
> +
> + page = stt->pages[idx / TCES_PER_PAGE];
> + tbl = kvmppc_page_address(page);
> +
> + tbl[idx % TCES_PER_PAGE] = tce;
> +}
> +EXPORT_SYMBOL_GPL(kvmppc_tce_put);
>  
>  /* WARNING: This will be called in real-mode on HV KVM and virtual
>   *  mode on PR KVM
> @@ -85,9 +159,6 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long 
> liobn,
>  {
>   struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn);
>   long ret = H_TOO_HARD;
> - unsigned long idx;
> - struct page *page;
> - u64 *tbl;
>  
>   /* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
>   /*  liobn, ioba, tce); */
> @@ -99,13 +170,11 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned 
> long liobn,
>   if (ret)
>   return ret;
>  
> - idx = ioba >> IOMMU_PAGE_SHIFT_4K;
> - page = stt->pages[idx / TCES_PER_PAGE];
> - tbl = (u64 *)page_address(page);
> + ret = kvmppc_tce_validate(stt, tce);
> + if (ret)
> + return ret;
>  
> - /* FIXME: Need to validate the TCE itself */
> - /* udbg_printf("tce @ %p\n", [idx % TCES_PER_PAGE]); */
> - tbl[idx % TCES_PER_PAGE] = tce;
> + kvmppc_tce_put(stt, ioba >> IOMMU_PAGE_SHIFT_4K, tce);
>  
>   return ret;
>  }

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH kernel 6/9] KVM: PPC: Replace SPAPR_TCE_SHIFT with IOMMU_PAGE_SHIFT_4K

2015-12-08 Thread David Gibson
On Tue, Sep 15, 2015 at 08:49:36PM +1000, Alexey Kardashevskiy wrote:
> SPAPR_TCE_SHIFT is used in few places only and since IOMMU_PAGE_SHIFT_4K
> can be easily used instead, remove SPAPR_TCE_SHIFT.
> 
> Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru>

Reviewed-by: David Gibson <da...@gibson.dropbear.id.au>

> ---
>  arch/powerpc/include/asm/kvm_book3s_64.h | 2 --
>  arch/powerpc/kvm/book3s_64_vio.c | 3 ++-
>  arch/powerpc/kvm/book3s_64_vio_hv.c  | 4 ++--
>  3 files changed, 4 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h 
> b/arch/powerpc/include/asm/kvm_book3s_64.h
> index 2aa79c8..7529aab 100644
> --- a/arch/powerpc/include/asm/kvm_book3s_64.h
> +++ b/arch/powerpc/include/asm/kvm_book3s_64.h
> @@ -33,8 +33,6 @@ static inline void svcpu_put(struct 
> kvmppc_book3s_shadow_vcpu *svcpu)
>  }
>  #endif
>  
> -#define SPAPR_TCE_SHIFT  12
> -
>  #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
>  #define KVM_DEFAULT_HPT_ORDER24  /* 16MB HPT by default */
>  #endif
> diff --git a/arch/powerpc/kvm/book3s_64_vio.c 
> b/arch/powerpc/kvm/book3s_64_vio.c
> index b70787d..e347856 100644
> --- a/arch/powerpc/kvm/book3s_64_vio.c
> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> @@ -36,12 +36,13 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #define TCES_PER_PAGE(PAGE_SIZE / sizeof(u64))
>  
>  static long kvmppc_stt_npages(unsigned long window_size)
>  {
> - return ALIGN((window_size >> SPAPR_TCE_SHIFT)
> + return ALIGN((window_size >> IOMMU_PAGE_SHIFT_4K)
>* sizeof(u64), PAGE_SIZE) / PAGE_SIZE;
>  }
>  
> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c 
> b/arch/powerpc/kvm/book3s_64_vio_hv.c
> index 8ae12ac..6cf1ab3 100644
> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
> @@ -99,7 +99,7 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long 
> liobn,
>   if (ret)
>   return ret;
>  
> - idx = ioba >> SPAPR_TCE_SHIFT;
> + idx = ioba >> IOMMU_PAGE_SHIFT_4K;
>   page = stt->pages[idx / TCES_PER_PAGE];
>   tbl = (u64 *)page_address(page);
>  
> @@ -121,7 +121,7 @@ long kvmppc_h_get_tce(struct kvm_vcpu *vcpu, unsigned 
> long liobn,
>   if (stt) {
>   ret = kvmppc_ioba_validate(stt, ioba, 1);
>   if (!ret) {
> - unsigned long idx = ioba >> SPAPR_TCE_SHIFT;
> + unsigned long idx = ioba >> IOMMU_PAGE_SHIFT_4K;
>   struct page *page = stt->pages[idx / TCES_PER_PAGE];
>   u64 *tbl = (u64 *)page_address(page);
>  

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH kernel 5/9] KVM: PPC: Account TCE-containing pages in locked_vm

2015-12-08 Thread David Gibson
On Tue, Sep 15, 2015 at 08:49:35PM +1000, Alexey Kardashevskiy wrote:
> At the moment pages used for TCE tables (in addition to pages addressed
> by TCEs) are not counted in locked_vm counter so a malicious userspace
> tool can call ioctl(KVM_CREATE_SPAPR_TCE) as many times as RLIMIT_NOFILE and
> lock a lot of memory.
> 
> This adds counting for pages used for TCE tables.
> 
> This counts the number of pages required for a table plus pages for
> the kvmppc_spapr_tce_table struct (TCE table descriptor) itself.

Hmm.  Does it make sense to account for the descriptor struct itself?
I mean there are lots of little structures the kernel will allocate on
a process's behalf, and I don't think most of them get accounted
against locked vm.

> This does not change the amount of (de)allocated memory.
> 
> Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru>
> ---
>  arch/powerpc/kvm/book3s_64_vio.c | 51 
> +++-
>  1 file changed, 50 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_64_vio.c 
> b/arch/powerpc/kvm/book3s_64_vio.c
> index 9526c34..b70787d 100644
> --- a/arch/powerpc/kvm/book3s_64_vio.c
> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> @@ -45,13 +45,56 @@ static long kvmppc_stt_npages(unsigned long window_size)
>* sizeof(u64), PAGE_SIZE) / PAGE_SIZE;
>  }
>  
> +static long kvmppc_account_memlimit(long npages, bool inc)
> +{
> + long ret = 0;
> + const long bytes = sizeof(struct kvmppc_spapr_tce_table) +
> + (abs(npages) * sizeof(struct page *));
> + const long stt_pages = ALIGN(bytes, PAGE_SIZE) / PAGE_SIZE;

Overflow checks might be useful here, I'm not sure.

> +
> + if (!current || !current->mm)
> + return ret; /* process exited */
> +
> + npages += stt_pages;
> +
> + down_write(>mm->mmap_sem);
> +
> + if (inc) {
> + long locked, lock_limit;
> +
> + locked = current->mm->locked_vm + npages;
> + lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> + if (locked > lock_limit && !capable(CAP_IPC_LOCK))
> + ret = -ENOMEM;
> + else
> + current->mm->locked_vm += npages;
> + } else {
> + if (npages > current->mm->locked_vm)

Should this be a WARN_ON?  It means something has gone wrong
previously in the accounting, doesn't it?

> + npages = current->mm->locked_vm;
> +
> + current->mm->locked_vm -= npages;
> + }
> +
> + pr_debug("[%d] RLIMIT_MEMLOCK KVM %c%ld %ld/%ld%s\n", current->pid,
> + inc ? '+' : '-',
> + npages << PAGE_SHIFT,
> + current->mm->locked_vm << PAGE_SHIFT,
> + rlimit(RLIMIT_MEMLOCK),
> + ret ? " - exceeded" : "");
> +
> + up_write(>mm->mmap_sem);
> +
> + return ret;
> +}
> +
>  static void release_spapr_tce_table(struct rcu_head *head)
>  {
>   struct kvmppc_spapr_tce_table *stt = container_of(head,
>   struct kvmppc_spapr_tce_table, rcu);
>   int i;
> + long npages = kvmppc_stt_npages(stt->window_size);
>  
> - for (i = 0; i < kvmppc_stt_npages(stt->window_size); i++)
> + for (i = 0; i < npages; i++)
>   __free_page(stt->pages[i]);
>  
>   kfree(stt);
> @@ -89,6 +132,7 @@ static int kvm_spapr_tce_release(struct inode *inode, 
> struct file *filp)
>  
>   kvm_put_kvm(stt->kvm);
>  
> + kvmppc_account_memlimit(kvmppc_stt_npages(stt->window_size), false);
>   call_rcu(>rcu, release_spapr_tce_table);
>  
>       return 0;
> @@ -114,6 +158,11 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>   }
>  
>   npages = kvmppc_stt_npages(args->window_size);
> + ret = kvmppc_account_memlimit(npages, true);
> + if (ret) {
> + stt = NULL;
> + goto fail;
> + }
>  
>   stt = kzalloc(sizeof(*stt) + npages * sizeof(struct page *),
> GFP_KERNEL);

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH kernel 6/9] KVM: PPC: Replace SPAPR_TCE_SHIFT with IOMMU_PAGE_SHIFT_4K

2015-12-08 Thread David Gibson
On Tue, Sep 15, 2015 at 08:49:36PM +1000, Alexey Kardashevskiy wrote:
> SPAPR_TCE_SHIFT is used in few places only and since IOMMU_PAGE_SHIFT_4K
> can be easily used instead, remove SPAPR_TCE_SHIFT.
> 
> Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru>

Reviewed-by: David Gibson <da...@gibson.dropbear.id.au>

> ---
>  arch/powerpc/include/asm/kvm_book3s_64.h | 2 --
>  arch/powerpc/kvm/book3s_64_vio.c | 3 ++-
>  arch/powerpc/kvm/book3s_64_vio_hv.c  | 4 ++--
>  3 files changed, 4 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h 
> b/arch/powerpc/include/asm/kvm_book3s_64.h
> index 2aa79c8..7529aab 100644
> --- a/arch/powerpc/include/asm/kvm_book3s_64.h
> +++ b/arch/powerpc/include/asm/kvm_book3s_64.h
> @@ -33,8 +33,6 @@ static inline void svcpu_put(struct 
> kvmppc_book3s_shadow_vcpu *svcpu)
>  }
>  #endif
>  
> -#define SPAPR_TCE_SHIFT  12
> -
>  #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
>  #define KVM_DEFAULT_HPT_ORDER24  /* 16MB HPT by default */
>  #endif
> diff --git a/arch/powerpc/kvm/book3s_64_vio.c 
> b/arch/powerpc/kvm/book3s_64_vio.c
> index b70787d..e347856 100644
> --- a/arch/powerpc/kvm/book3s_64_vio.c
> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> @@ -36,12 +36,13 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #define TCES_PER_PAGE(PAGE_SIZE / sizeof(u64))
>  
>  static long kvmppc_stt_npages(unsigned long window_size)
>  {
> - return ALIGN((window_size >> SPAPR_TCE_SHIFT)
> + return ALIGN((window_size >> IOMMU_PAGE_SHIFT_4K)
>* sizeof(u64), PAGE_SIZE) / PAGE_SIZE;
>  }
>  
> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c 
> b/arch/powerpc/kvm/book3s_64_vio_hv.c
> index 8ae12ac..6cf1ab3 100644
> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
> @@ -99,7 +99,7 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long 
> liobn,
>   if (ret)
>   return ret;
>  
> - idx = ioba >> SPAPR_TCE_SHIFT;
> + idx = ioba >> IOMMU_PAGE_SHIFT_4K;
>   page = stt->pages[idx / TCES_PER_PAGE];
>   tbl = (u64 *)page_address(page);
>  
> @@ -121,7 +121,7 @@ long kvmppc_h_get_tce(struct kvm_vcpu *vcpu, unsigned 
> long liobn,
>   if (stt) {
>   ret = kvmppc_ioba_validate(stt, ioba, 1);
>   if (!ret) {
> - unsigned long idx = ioba >> SPAPR_TCE_SHIFT;
> + unsigned long idx = ioba >> IOMMU_PAGE_SHIFT_4K;
>   struct page *page = stt->pages[idx / TCES_PER_PAGE];
>   u64 *tbl = (u64 *)page_address(page);
>  

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH kernel 8/9] KVM: Fix KVM_SMI chapter number

2015-12-08 Thread David Gibson

On Tue, Sep 15, 2015 at 08:49:38PM +1000, Alexey Kardashevskiy wrote:
> The KVM_SMI capability is following the KVM_S390_SET_IRQ_STATE capability
> which is "4.95", this changes the number of the KVM_SMI chapter to 4.96.
> 
> Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru>

Reviewed-by: David Gibson <da...@gibson.dropbear.id.au>

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH kernel 9/9] KVM: PPC: Add support for multiple-TCE hcalls

2015-12-08 Thread David Gibson
On Tue, Sep 15, 2015 at 08:49:39PM +1000, Alexey Kardashevskiy wrote:
> This adds real and virtual mode handlers for the H_PUT_TCE_INDIRECT and
> H_STUFF_TCE hypercalls for user space emulated devices such as IBMVIO
> devices or emulated PCI.  These calls allow adding multiple entries
> (up to 512) into the TCE table in one call which saves time on
> transition between kernel and user space.
> 
> This implements the KVM_CAP_PPC_MULTITCE capability. When present,
> the kernel will try handling H_PUT_TCE_INDIRECT and H_STUFF_TCE.
> If they can not be handled by the kernel, they are passed on to
> the user space. The user space still has to have an implementation
> for these.
> 
> Both HV and PR-syle KVM are supported.
> 
> Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru>
> ---
>  Documentation/virtual/kvm/api.txt   |  25 ++
>  arch/powerpc/include/asm/kvm_ppc.h  |  12 +++
>  arch/powerpc/kvm/book3s_64_vio.c| 111 +++-
>  arch/powerpc/kvm/book3s_64_vio_hv.c | 145 
> ++--
>  arch/powerpc/kvm/book3s_hv.c|  26 +-
>  arch/powerpc/kvm/book3s_hv_rmhandlers.S |   6 +-
>  arch/powerpc/kvm/book3s_pr_papr.c   |  35 
>  arch/powerpc/kvm/powerpc.c  |   3 +
>  8 files changed, 350 insertions(+), 13 deletions(-)
> 
> diff --git a/Documentation/virtual/kvm/api.txt 
> b/Documentation/virtual/kvm/api.txt
> index d86d831..593c62a 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -3019,6 +3019,31 @@ Returns: 0 on success, -1 on error
>  
>  Queues an SMI on the thread's vcpu.
>  
> +4.97 KVM_CAP_PPC_MULTITCE
> +
> +Capability: KVM_CAP_PPC_MULTITCE
> +Architectures: ppc
> +Type: vm
> +
> +This capability means the kernel is capable of handling hypercalls
> +H_PUT_TCE_INDIRECT and H_STUFF_TCE without passing those into the user
> +space. This significantly accelerates DMA operations for PPC KVM guests.
> +User space should expect that its handlers for these hypercalls
> +are not going to be called if user space previously registered LIOBN
> +in KVM (via KVM_CREATE_SPAPR_TCE or similar calls).
> +
> +In order to enable H_PUT_TCE_INDIRECT and H_STUFF_TCE use in the guest,
> +user space might have to advertise it for the guest. For example,
> +IBM pSeries (sPAPR) guest starts using them if "hcall-multi-tce" is
> +present in the "ibm,hypertas-functions" device-tree property.
> +
> +The hypercalls mentioned above may or may not be processed successfully
> +in the kernel based fast path. If they can not be handled by the kernel,
> +they will get passed on to user space. So user space still has to have
> +an implementation for these despite the in kernel acceleration.
> +
> +This capability is always enabled.
> +
>  5. The kvm_run structure
>  
>  
> diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
> b/arch/powerpc/include/asm/kvm_ppc.h
> index fcde896..e5b968e 100644
> --- a/arch/powerpc/include/asm/kvm_ppc.h
> +++ b/arch/powerpc/include/asm/kvm_ppc.h
> @@ -166,12 +166,24 @@ extern int kvmppc_pseries_do_hcall(struct kvm_vcpu 
> *vcpu);
>  
>  extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>   struct kvm_create_spapr_tce *args);
> +extern struct kvmppc_spapr_tce_table *kvmppc_find_table(
> + struct kvm_vcpu *vcpu, unsigned long liobn);
>  extern long kvmppc_ioba_validate(struct kvmppc_spapr_tce_table *stt,
>   unsigned long ioba, unsigned long npages);
>  extern long kvmppc_tce_validate(struct kvmppc_spapr_tce_table *tt,
>   unsigned long tce);
> +extern long kvmppc_gpa_to_ua(struct kvm *kvm, unsigned long gpa,
> + unsigned long *ua, unsigned long **prmap);
> +extern void kvmppc_tce_put(struct kvmppc_spapr_tce_table *tt,
> + unsigned long idx, unsigned long tce);
>  extern long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>unsigned long ioba, unsigned long tce);
> +extern long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> + unsigned long liobn, unsigned long ioba,
> + unsigned long tce_list, unsigned long npages);
> +extern long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
> + unsigned long liobn, unsigned long ioba,
> + unsigned long tce_value, unsigned long npages);
>  extern long kvmppc_h_get_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>unsigned long ioba);
>  extern struct page *kvm_alloc_hpt(unsigned long nr_pages);
> diff --git a/arch/powerpc/kvm/book3s_64_vio.c 
> b/arch/powerpc/kvm/book3s_64_vio.c
> in

Re: [PATCH kernel 9/9] KVM: PPC: Add support for multiple-TCE hcalls

2015-12-08 Thread David Gibson
On Tue, Sep 15, 2015 at 08:49:39PM +1000, Alexey Kardashevskiy wrote:
> This adds real and virtual mode handlers for the H_PUT_TCE_INDIRECT and
> H_STUFF_TCE hypercalls for user space emulated devices such as IBMVIO
> devices or emulated PCI.  These calls allow adding multiple entries
> (up to 512) into the TCE table in one call which saves time on
> transition between kernel and user space.
> 
> This implements the KVM_CAP_PPC_MULTITCE capability. When present,
> the kernel will try handling H_PUT_TCE_INDIRECT and H_STUFF_TCE.
> If they can not be handled by the kernel, they are passed on to
> the user space. The user space still has to have an implementation
> for these.
> 
> Both HV and PR-syle KVM are supported.
> 
> Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru>
> ---
>  Documentation/virtual/kvm/api.txt   |  25 ++
>  arch/powerpc/include/asm/kvm_ppc.h  |  12 +++
>  arch/powerpc/kvm/book3s_64_vio.c| 111 +++-
>  arch/powerpc/kvm/book3s_64_vio_hv.c | 145 
> ++--
>  arch/powerpc/kvm/book3s_hv.c|  26 +-
>  arch/powerpc/kvm/book3s_hv_rmhandlers.S |   6 +-
>  arch/powerpc/kvm/book3s_pr_papr.c   |  35 
>  arch/powerpc/kvm/powerpc.c  |   3 +
>  8 files changed, 350 insertions(+), 13 deletions(-)
> 
> diff --git a/Documentation/virtual/kvm/api.txt 
> b/Documentation/virtual/kvm/api.txt
> index d86d831..593c62a 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -3019,6 +3019,31 @@ Returns: 0 on success, -1 on error
>  
>  Queues an SMI on the thread's vcpu.
>  
> +4.97 KVM_CAP_PPC_MULTITCE
> +
> +Capability: KVM_CAP_PPC_MULTITCE
> +Architectures: ppc
> +Type: vm
> +
> +This capability means the kernel is capable of handling hypercalls
> +H_PUT_TCE_INDIRECT and H_STUFF_TCE without passing those into the user
> +space. This significantly accelerates DMA operations for PPC KVM guests.
> +User space should expect that its handlers for these hypercalls
> +are not going to be called if user space previously registered LIOBN
> +in KVM (via KVM_CREATE_SPAPR_TCE or similar calls).
> +
> +In order to enable H_PUT_TCE_INDIRECT and H_STUFF_TCE use in the guest,
> +user space might have to advertise it for the guest. For example,
> +IBM pSeries (sPAPR) guest starts using them if "hcall-multi-tce" is
> +present in the "ibm,hypertas-functions" device-tree property.
> +
> +The hypercalls mentioned above may or may not be processed successfully
> +in the kernel based fast path. If they can not be handled by the kernel,
> +they will get passed on to user space. So user space still has to have
> +an implementation for these despite the in kernel acceleration.
> +
> +This capability is always enabled.
> +
>  5. The kvm_run structure
>  
>  
> diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
> b/arch/powerpc/include/asm/kvm_ppc.h
> index fcde896..e5b968e 100644
> --- a/arch/powerpc/include/asm/kvm_ppc.h
> +++ b/arch/powerpc/include/asm/kvm_ppc.h
> @@ -166,12 +166,24 @@ extern int kvmppc_pseries_do_hcall(struct kvm_vcpu 
> *vcpu);
>  
>  extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>   struct kvm_create_spapr_tce *args);
> +extern struct kvmppc_spapr_tce_table *kvmppc_find_table(
> + struct kvm_vcpu *vcpu, unsigned long liobn);
>  extern long kvmppc_ioba_validate(struct kvmppc_spapr_tce_table *stt,
>   unsigned long ioba, unsigned long npages);
>  extern long kvmppc_tce_validate(struct kvmppc_spapr_tce_table *tt,
>   unsigned long tce);
> +extern long kvmppc_gpa_to_ua(struct kvm *kvm, unsigned long gpa,
> + unsigned long *ua, unsigned long **prmap);
> +extern void kvmppc_tce_put(struct kvmppc_spapr_tce_table *tt,
> + unsigned long idx, unsigned long tce);
>  extern long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>unsigned long ioba, unsigned long tce);
> +extern long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> + unsigned long liobn, unsigned long ioba,
> + unsigned long tce_list, unsigned long npages);
> +extern long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
> + unsigned long liobn, unsigned long ioba,
> + unsigned long tce_value, unsigned long npages);
>  extern long kvmppc_h_get_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>unsigned long ioba);
>  extern struct page *kvm_alloc_hpt(unsigned long nr_pages);
> diff --git a/arch/powerpc/kvm/book3s_64_vio.c 
> b/arch/powerpc/kvm/book3s_64_vio.c
> in

Re: [PATCH kernel 3/9] KVM: PPC: Rework H_PUT_TCE/H_GET_TCE handlers

2015-12-07 Thread David Gibson
ies on the fact that kvmppc_ioba_validate() would have
returned H_SUCCESS some distance above.  This seems rather fragile if
you insert anything else which alters ret in between.  Since this is
the success path, I think it's clearer to explicitly return H_SUCCESS.

>  }
>  EXPORT_SYMBOL_GPL(kvmppc_h_put_tce);
>  
>  long kvmppc_h_get_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
> unsigned long ioba)
>  {
> - struct kvm *kvm = vcpu->kvm;
> - struct kvmppc_spapr_tce_table *stt;
> + struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn);
> + long ret = H_TOO_HARD;
>  
> - list_for_each_entry(stt, >arch.spapr_tce_tables, list) {
> - if (stt->liobn == liobn) {
> +
> + if (stt) {
> + ret = kvmppc_ioba_validate(stt, ioba, 1);
> + if (!ret) {

This relies on the fact that H_SUCCESS == 0, I'm not sure if that's
something we're already doing elsewhere or not.


>   unsigned long idx = ioba >> SPAPR_TCE_SHIFT;
> - struct page *page;
> - u64 *tbl;
> -
> - if (ioba >= stt->window_size)
> - return H_PARAMETER;
> -
> - page = stt->pages[idx / TCES_PER_PAGE];
> - tbl = (u64 *)page_address(page);
> + struct page *page = stt->pages[idx / TCES_PER_PAGE];
> + u64 *tbl = (u64 *)page_address(page);
>  
>   vcpu->arch.gpr[4] = tbl[idx % TCES_PER_PAGE];
> - return H_SUCCESS;
>   }
>   }
>  
> - /* Didn't find the liobn, punt it to userspace */
> - return H_TOO_HARD;
> +
> + return ret;
>  }
>  EXPORT_SYMBOL_GPL(kvmppc_h_get_tce);

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH kernel 1/9] rcu: Define notrace version of list_for_each_entry_rcu

2015-12-07 Thread David Gibson
On Tue, Sep 15, 2015 at 08:49:31PM +1000, Alexey Kardashevskiy wrote:
> This defines list_for_each_entry_rcu_notrace and list_entry_rcu_notrace
> which use rcu_dereference_raw_notrace instead of rcu_dereference_raw.
> This allows using list_for_each_entry_rcu_notrace in real mode (MMU is off).
> 
> Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru>

Reviewed-by: David Gibson <da...@gibson.dropbear.id.au>

> ---
>  include/linux/rculist.h | 38 ++
>  1 file changed, 38 insertions(+)
> 
> diff --git a/include/linux/rculist.h b/include/linux/rculist.h
> index 17c6b1f..439c4d7 100644
> --- a/include/linux/rculist.h
> +++ b/include/linux/rculist.h
> @@ -253,6 +253,25 @@ static inline void list_splice_init_rcu(struct list_head 
> *list,
>  })
>  
>  /**
> + * list_entry_rcu_notrace - get the struct for this entry
> + * @ptr:the  list_head pointer.
> + * @type:   the type of the struct this is embedded in.
> + * @member: the name of the list_struct within the struct.
> + *
> + * This primitive may safely run concurrently with the _rcu list-mutation
> + * primitives such as list_add_rcu() as long as it's guarded by 
> rcu_read_lock().
> + *
> + * This is the same as list_entry_rcu() except that it does
> + * not do any RCU debugging or tracing.
> + */
> +#define list_entry_rcu_notrace(ptr, type, member) \
> +({ \
> + typeof(*ptr) __rcu *__ptr = (typeof(*ptr) __rcu __force *)ptr; \
> + container_of((typeof(ptr))rcu_dereference_raw_notrace(__ptr), \
> + type, member); \
> +})
> +
> +/**
>   * Where are list_empty_rcu() and list_first_entry_rcu()?
>   *
>   * Implementing those functions following their counterparts list_empty() and
> @@ -308,6 +327,25 @@ static inline void list_splice_init_rcu(struct list_head 
> *list,
>   pos = list_entry_rcu(pos->member.next, typeof(*pos), member))
>  
>  /**
> + * list_for_each_entry_rcu_notrace - iterate over rcu list of given type
> + * @pos: the type * to use as a loop cursor.
> + * @head:the head for your list.
> + * @member:  the name of the list_struct within the struct.
> + *
> + * This list-traversal primitive may safely run concurrently with
> + * the _rcu list-mutation primitives such as list_add_rcu()
> + * as long as the traversal is guarded by rcu_read_lock().
> + *
> + * This is the same as list_for_each_entry_rcu() except that it does
> + * not do any RCU debugging or tracing.
> + */
> +#define list_for_each_entry_rcu_notrace(pos, head, member) \
> + for (pos = list_entry_rcu_notrace((head)->next, typeof(*pos), member); \
> + >member != (head); \
> + pos = list_entry_rcu_notrace(pos->member.next, typeof(*pos), \
> +     member))
> +
> +/**
>   * list_for_each_entry_continue_rcu - continue iteration over list of given 
> type
>   * @pos: the type * to use as a loop cursor.
>   * @head:the head for your list.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH kernel 1/9] rcu: Define notrace version of list_for_each_entry_rcu

2015-12-07 Thread David Gibson
On Tue, Sep 15, 2015 at 08:49:31PM +1000, Alexey Kardashevskiy wrote:
> This defines list_for_each_entry_rcu_notrace and list_entry_rcu_notrace
> which use rcu_dereference_raw_notrace instead of rcu_dereference_raw.
> This allows using list_for_each_entry_rcu_notrace in real mode (MMU is off).
> 
> Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru>

Reviewed-by: David Gibson <da...@gibson.dropbear.id.au>

> ---
>  include/linux/rculist.h | 38 ++
>  1 file changed, 38 insertions(+)
> 
> diff --git a/include/linux/rculist.h b/include/linux/rculist.h
> index 17c6b1f..439c4d7 100644
> --- a/include/linux/rculist.h
> +++ b/include/linux/rculist.h
> @@ -253,6 +253,25 @@ static inline void list_splice_init_rcu(struct list_head 
> *list,
>  })
>  
>  /**
> + * list_entry_rcu_notrace - get the struct for this entry
> + * @ptr:the  list_head pointer.
> + * @type:   the type of the struct this is embedded in.
> + * @member: the name of the list_struct within the struct.
> + *
> + * This primitive may safely run concurrently with the _rcu list-mutation
> + * primitives such as list_add_rcu() as long as it's guarded by 
> rcu_read_lock().
> + *
> + * This is the same as list_entry_rcu() except that it does
> + * not do any RCU debugging or tracing.
> + */
> +#define list_entry_rcu_notrace(ptr, type, member) \
> +({ \
> + typeof(*ptr) __rcu *__ptr = (typeof(*ptr) __rcu __force *)ptr; \
> + container_of((typeof(ptr))rcu_dereference_raw_notrace(__ptr), \
> + type, member); \
> +})
> +
> +/**
>   * Where are list_empty_rcu() and list_first_entry_rcu()?
>   *
>   * Implementing those functions following their counterparts list_empty() and
> @@ -308,6 +327,25 @@ static inline void list_splice_init_rcu(struct list_head 
> *list,
>   pos = list_entry_rcu(pos->member.next, typeof(*pos), member))
>  
>  /**
> + * list_for_each_entry_rcu_notrace - iterate over rcu list of given type
> + * @pos: the type * to use as a loop cursor.
> + * @head:the head for your list.
> + * @member:  the name of the list_struct within the struct.
> + *
> + * This list-traversal primitive may safely run concurrently with
> + * the _rcu list-mutation primitives such as list_add_rcu()
> + * as long as the traversal is guarded by rcu_read_lock().
> + *
> + * This is the same as list_for_each_entry_rcu() except that it does
> + * not do any RCU debugging or tracing.
> + */
> +#define list_for_each_entry_rcu_notrace(pos, head, member) \
> + for (pos = list_entry_rcu_notrace((head)->next, typeof(*pos), member); \
> + >member != (head); \
> + pos = list_entry_rcu_notrace(pos->member.next, typeof(*pos), \
> +     member))
> +
> +/**
>   * list_for_each_entry_continue_rcu - continue iteration over list of given 
> type
>   * @pos: the type * to use as a loop cursor.
>   * @head:the head for your list.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH kernel 3/9] KVM: PPC: Rework H_PUT_TCE/H_GET_TCE handlers

2015-12-07 Thread David Gibson
ies on the fact that kvmppc_ioba_validate() would have
returned H_SUCCESS some distance above.  This seems rather fragile if
you insert anything else which alters ret in between.  Since this is
the success path, I think it's clearer to explicitly return H_SUCCESS.

>  }
>  EXPORT_SYMBOL_GPL(kvmppc_h_put_tce);
>  
>  long kvmppc_h_get_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
> unsigned long ioba)
>  {
> - struct kvm *kvm = vcpu->kvm;
> - struct kvmppc_spapr_tce_table *stt;
> + struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn);
> + long ret = H_TOO_HARD;
>  
> - list_for_each_entry(stt, >arch.spapr_tce_tables, list) {
> - if (stt->liobn == liobn) {
> +
> + if (stt) {
> + ret = kvmppc_ioba_validate(stt, ioba, 1);
> + if (!ret) {

This relies on the fact that H_SUCCESS == 0, I'm not sure if that's
something we're already doing elsewhere or not.


>   unsigned long idx = ioba >> SPAPR_TCE_SHIFT;
> - struct page *page;
> - u64 *tbl;
> -
> - if (ioba >= stt->window_size)
> - return H_PARAMETER;
> -
> - page = stt->pages[idx / TCES_PER_PAGE];
> - tbl = (u64 *)page_address(page);
> + struct page *page = stt->pages[idx / TCES_PER_PAGE];
> + u64 *tbl = (u64 *)page_address(page);
>  
>   vcpu->arch.gpr[4] = tbl[idx % TCES_PER_PAGE];
> - return H_SUCCESS;
>   }
>   }
>  
> - /* Didn't find the liobn, punt it to userspace */
> - return H_TOO_HARD;
> +
> + return ret;
>  }
>  EXPORT_SYMBOL_GPL(kvmppc_h_get_tce);

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH kernel 4/9] KVM: PPC: Use RCU for arch.spapr_tce_tables

2015-12-07 Thread David Gibson
On Tue, Sep 15, 2015 at 08:49:34PM +1000, Alexey Kardashevskiy wrote:
> At the moment spapr_tce_tables is not protected against races. This makes
> use of RCU-variants of list helpers. As some bits are executed in real
> mode, this makes use of just introduced list_for_each_entry_rcu_notrace().
> 
> This converts release_spapr_tce_table() to a RCU scheduled handler.
> 
> Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru>

Looks correct on my limited knowledge of RCU

Reviewed-by: David Gibson <da...@gibson.dropbear.id.au>

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH kernel 2/9] KVM: PPC: Make real_vmalloc_addr() public

2015-12-07 Thread David Gibson
On Tue, Sep 15, 2015 at 08:49:32PM +1000, Alexey Kardashevskiy wrote:
> This helper translates vmalloc'd addresses to linear addresses.
> It is only used by the KVM MMU code now and resides in the HV KVM code.
> We will need it further in the TCE code and the DMA memory preregistration
> code called in real mode.
> 
> This makes real_vmalloc_addr() public and moves it to the powerpc code as
> it does not do anything special for KVM.
> 
> Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru>

Hmm, I have a couple of small concerns.

First, I'm not sure if the name is clear enough for a public function.

Second, I'm not sure if mmu-hash64.h is the right place for it.  This
is still a function with very specific and limited usage, I wonder if
we should have somewhere else for such special real mode helpers.

Paulus, thoughts?

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH kernel 4/9] KVM: PPC: Use RCU for arch.spapr_tce_tables

2015-12-07 Thread David Gibson
On Tue, Sep 15, 2015 at 08:49:34PM +1000, Alexey Kardashevskiy wrote:
> At the moment spapr_tce_tables is not protected against races. This makes
> use of RCU-variants of list helpers. As some bits are executed in real
> mode, this makes use of just introduced list_for_each_entry_rcu_notrace().
> 
> This converts release_spapr_tce_table() to a RCU scheduled handler.
> 
> Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru>

Looks correct on my limited knowledge of RCU

Reviewed-by: David Gibson <da...@gibson.dropbear.id.au>

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH kernel 2/9] KVM: PPC: Make real_vmalloc_addr() public

2015-12-07 Thread David Gibson
On Tue, Sep 15, 2015 at 08:49:32PM +1000, Alexey Kardashevskiy wrote:
> This helper translates vmalloc'd addresses to linear addresses.
> It is only used by the KVM MMU code now and resides in the HV KVM code.
> We will need it further in the TCE code and the DMA memory preregistration
> code called in real mode.
> 
> This makes real_vmalloc_addr() public and moves it to the powerpc code as
> it does not do anything special for KVM.
> 
> Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru>

Hmm, I have a couple of small concerns.

First, I'm not sure if the name is clear enough for a public function.

Second, I'm not sure if mmu-hash64.h is the right place for it.  This
is still a function with very specific and limited usage, I wonder if
we should have somewhere else for such special real mode helpers.

Paulus, thoughts?

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH] KVM: PPC: Fix emulation of H_SET_DABR/X on POWER8

2015-11-22 Thread David Gibson
On Fri, 20 Nov 2015 09:11:45 +0100
Thomas Huth <th...@redhat.com> wrote:

> In the old DABR register, the BT (Breakpoint Translation) bit
> is bit number 61. In the new DAWRX register, the WT (Watchpoint
> Translation) bit is bit number 59. So to move the DABR-BT bit
> into the position of the DAWRX-WT bit, it has to be shifted by
> two, not only by one. This fixes hardware watchpoints in gdb of
> older guests that only use the H_SET_DABR/X interface instead
> of the new H_SET_MODE interface.
> 
> Signed-off-by: Thomas Huth <th...@redhat.com>

Reviewed-by: David Gibson <da...@gibson.dropbear.id.au>
Reviewed-by: David Gibson <dgib...@redhat.com>

> ---
>  arch/powerpc/kvm/book3s_hv_rmhandlers.S | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
> b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> index b98889e..3983b87 100644
> --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> @@ -2143,7 +2143,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S)
>  
>   /* Emulate H_SET_DABR/X on P8 for the sake of compat mode guests */
>  2:   rlwimi  r5, r4, 5, DAWRX_DR | DAWRX_DW
> - rlwimi  r5, r4, 1, DAWRX_WT
> + rlwimi  r5, r4, 2, DAWRX_WT
>   clrrdi  r4, r4, 3
>   std r4, VCPU_DAWR(r3)
>   std r5, VCPU_DAWRX(r3)
> -- 
> 1.8.3.1
> 


-- 
David Gibson <dgib...@redhat.com>
Senior Software Engineer, Virtualization, Red Hat


pgp8jxKFUmn0u.pgp
Description: OpenPGP digital signature


Re: [PATCH] KVM: PPC: Fix emulation of H_SET_DABR/X on POWER8

2015-11-22 Thread David Gibson
On Fri, 20 Nov 2015 09:11:45 +0100
Thomas Huth <th...@redhat.com> wrote:

> In the old DABR register, the BT (Breakpoint Translation) bit
> is bit number 61. In the new DAWRX register, the WT (Watchpoint
> Translation) bit is bit number 59. So to move the DABR-BT bit
> into the position of the DAWRX-WT bit, it has to be shifted by
> two, not only by one. This fixes hardware watchpoints in gdb of
> older guests that only use the H_SET_DABR/X interface instead
> of the new H_SET_MODE interface.
> 
> Signed-off-by: Thomas Huth <th...@redhat.com>

Reviewed-by: David Gibson <da...@gibson.dropbear.id.au>
Reviewed-by: David Gibson <dgib...@redhat.com>

> ---
>  arch/powerpc/kvm/book3s_hv_rmhandlers.S | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
> b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> index b98889e..3983b87 100644
> --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> @@ -2143,7 +2143,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S)
>  
>   /* Emulate H_SET_DABR/X on P8 for the sake of compat mode guests */
>  2:   rlwimi  r5, r4, 5, DAWRX_DR | DAWRX_DW
> - rlwimi  r5, r4, 1, DAWRX_WT
> + rlwimi  r5, r4, 2, DAWRX_WT
>   clrrdi  r4, r4, 3
>   std r4, VCPU_DAWR(r3)
>   std r5, VCPU_DAWRX(r3)
> -- 
> 1.8.3.1
> 


-- 
David Gibson <dgib...@redhat.com>
Senior Software Engineer, Virtualization, Red Hat


pgp9xdMl3CMpE.pgp
Description: OpenPGP digital signature


Re: [PATCH] KVM: PPC: Exit guest upon fatal machine check exception

2015-11-12 Thread David Gibson
On Thu, Nov 12, 2015 at 11:22:29PM +0530, Aravinda Prasad wrote:
> 
> 
> On Thursday 12 November 2015 10:13 AM, David Gibson wrote:
> > On Thu, Nov 12, 2015 at 10:02:10AM +0530, Aravinda Prasad wrote:
> >>
> >>
> >> On Thursday 12 November 2015 09:08 AM, David Gibson wrote:
> >>> On Thu, Nov 12, 2015 at 01:24:19PM +1100, Daniel Axtens wrote:
> >>>> Aravinda Prasad <aravi...@linux.vnet.ibm.com> writes:
> >>>>
> >>>>> This patch modifies KVM to cause a guest exit with
> >>>>> KVM_EXIT_NMI instead of immediately delivering a 0x200
> >>>>> interrupt to guest upon machine check exception in
> >>>>> guest address. Exiting the guest enables QEMU to build
> >>>>> error log and deliver machine check exception to guest
> >>>>> OS (either via guest OS registered machine check
> >>>>> handler or via 0x200 guest OS interrupt vector).
> >>>>>
> >>>>> This approach simplifies the delivering of machine
> >>>>> check exception to guest OS compared to the earlier approach
> >>>>> of KVM directly invoking 0x200 guest interrupt vector.
> >>>>> In the earlier approach QEMU patched the 0x200 interrupt
> >>>>> vector during boot. The patched code at 0x200 issued a
> >>>>> private hcall to pass the control to QEMU to build the
> >>>>> error log.
> >>>>>
> >>>>> This design/approach is based on the feedback for the
> >>>>> QEMU patches to handle machine check exception. Details
> >>>>> of earlier approach of handling machine check exception
> >>>>> in QEMU and related discussions can be found at:
> >>>>>
> >>>>> https://lists.nongnu.org/archive/html/qemu-devel/2014-11/msg00813.html
> >>>>
> >>>> I've poked at the MCE code, but not the KVM MCE code, so I may be
> >>>> mistaken here, but I'm not clear on how this handles errors that the
> >>>> guest can recover without terminating.
> >>>>
> >>>> For example, a Linux guest can handle a UE in guest userspace by killing
> >>>> the guest process. A hypthetical non-linux guest with a microkernel
> >>>> could even survive UEs in drivers.
> >>>>
> >>>> It sounds from your patch like you're changing this behaviour. Is this
> >>>> right?
> >>>
> >>> So, IIUC.  Once the qemu pieces are in place as well it shouldn't
> >>> change this behaviour: KVM will exit to qemu, qemu will log the error
> >>> information (new), then reinject the MC to the guest which can still
> >>> handle it as you describe above.
> >>
> >> Yes. With KVM and QEMU both in place this will not change the behavior.
> >> QEMU will inject the UE to guest and the guest handles the UE based on
> >> where it occurred. For example if an UE happens in a guest process
> >> address space, that process will be killed.
> >>
> >>>
> >>> But, there could be a problem if you have a new kernel with an old
> >>> qemu, in that case qemu might not understand the new exit type and
> >>> treat it as a fatal error, even though the guest could actually cope
> >>> with it.
> >>
> >> In case of new kernel and old QEMU, the guest terminates as old QEMU
> >> does not understand the NMI exit reason. However, this is the case with
> >> old kernel and old QEMU as they do not handle UE belonging to guest. The
> >> difference is that the guest kernel terminates with different error
> >> code.
> > 
> > Ok.. assuming the guest has code to handle the UE in 0x200, why would
> > the guest terminate with old kernel and old qemu?  I haven't quite
> > followed the logic.
> 
> I overlooked it. I think I need to take into consideration whether guest
> issued "ibm, nmi-register". If the guest has issued "ibm, nmi-register"
> then we should not jump to 0x200 upon UE. With the old kernel and old
> QEMU this is broken as we always jump to 0x200.
> 
> However, if the guest has not issued "ibm, nmi-register" then upon UE we
> should jump to 0x200. If new kernel is used with old QEMU this
> functionality breaks as it causes guest to terminate with unhandled NMI
> exit.
> 
> So thinking whether qemu should explicitly enable the new NMI
> behavior.

So, I think the reasoning above tends towards having qemu control the
MC

Re: [PATCH] KVM: PPC: Exit guest upon fatal machine check exception

2015-11-12 Thread David Gibson
On Thu, Nov 12, 2015 at 11:22:29PM +0530, Aravinda Prasad wrote:
> 
> 
> On Thursday 12 November 2015 10:13 AM, David Gibson wrote:
> > On Thu, Nov 12, 2015 at 10:02:10AM +0530, Aravinda Prasad wrote:
> >>
> >>
> >> On Thursday 12 November 2015 09:08 AM, David Gibson wrote:
> >>> On Thu, Nov 12, 2015 at 01:24:19PM +1100, Daniel Axtens wrote:
> >>>> Aravinda Prasad <aravi...@linux.vnet.ibm.com> writes:
> >>>>
> >>>>> This patch modifies KVM to cause a guest exit with
> >>>>> KVM_EXIT_NMI instead of immediately delivering a 0x200
> >>>>> interrupt to guest upon machine check exception in
> >>>>> guest address. Exiting the guest enables QEMU to build
> >>>>> error log and deliver machine check exception to guest
> >>>>> OS (either via guest OS registered machine check
> >>>>> handler or via 0x200 guest OS interrupt vector).
> >>>>>
> >>>>> This approach simplifies the delivering of machine
> >>>>> check exception to guest OS compared to the earlier approach
> >>>>> of KVM directly invoking 0x200 guest interrupt vector.
> >>>>> In the earlier approach QEMU patched the 0x200 interrupt
> >>>>> vector during boot. The patched code at 0x200 issued a
> >>>>> private hcall to pass the control to QEMU to build the
> >>>>> error log.
> >>>>>
> >>>>> This design/approach is based on the feedback for the
> >>>>> QEMU patches to handle machine check exception. Details
> >>>>> of earlier approach of handling machine check exception
> >>>>> in QEMU and related discussions can be found at:
> >>>>>
> >>>>> https://lists.nongnu.org/archive/html/qemu-devel/2014-11/msg00813.html
> >>>>
> >>>> I've poked at the MCE code, but not the KVM MCE code, so I may be
> >>>> mistaken here, but I'm not clear on how this handles errors that the
> >>>> guest can recover without terminating.
> >>>>
> >>>> For example, a Linux guest can handle a UE in guest userspace by killing
> >>>> the guest process. A hypthetical non-linux guest with a microkernel
> >>>> could even survive UEs in drivers.
> >>>>
> >>>> It sounds from your patch like you're changing this behaviour. Is this
> >>>> right?
> >>>
> >>> So, IIUC.  Once the qemu pieces are in place as well it shouldn't
> >>> change this behaviour: KVM will exit to qemu, qemu will log the error
> >>> information (new), then reinject the MC to the guest which can still
> >>> handle it as you describe above.
> >>
> >> Yes. With KVM and QEMU both in place this will not change the behavior.
> >> QEMU will inject the UE to guest and the guest handles the UE based on
> >> where it occurred. For example if an UE happens in a guest process
> >> address space, that process will be killed.
> >>
> >>>
> >>> But, there could be a problem if you have a new kernel with an old
> >>> qemu, in that case qemu might not understand the new exit type and
> >>> treat it as a fatal error, even though the guest could actually cope
> >>> with it.
> >>
> >> In case of new kernel and old QEMU, the guest terminates as old QEMU
> >> does not understand the NMI exit reason. However, this is the case with
> >> old kernel and old QEMU as they do not handle UE belonging to guest. The
> >> difference is that the guest kernel terminates with different error
> >> code.
> > 
> > Ok.. assuming the guest has code to handle the UE in 0x200, why would
> > the guest terminate with old kernel and old qemu?  I haven't quite
> > followed the logic.
> 
> I overlooked it. I think I need to take into consideration whether guest
> issued "ibm, nmi-register". If the guest has issued "ibm, nmi-register"
> then we should not jump to 0x200 upon UE. With the old kernel and old
> QEMU this is broken as we always jump to 0x200.
> 
> However, if the guest has not issued "ibm, nmi-register" then upon UE we
> should jump to 0x200. If new kernel is used with old QEMU this
> functionality breaks as it causes guest to terminate with unhandled NMI
> exit.
> 
> So thinking whether qemu should explicitly enable the new NMI
> behavior.

So, I think the reasoning above tends towards having qemu control the
MC

Re: [PATCH] KVM: PPC: Exit guest upon fatal machine check exception

2015-11-11 Thread David Gibson
On Thu, Nov 12, 2015 at 10:02:10AM +0530, Aravinda Prasad wrote:
> 
> 
> On Thursday 12 November 2015 09:08 AM, David Gibson wrote:
> > On Thu, Nov 12, 2015 at 01:24:19PM +1100, Daniel Axtens wrote:
> >> Aravinda Prasad <aravi...@linux.vnet.ibm.com> writes:
> >>
> >>> This patch modifies KVM to cause a guest exit with
> >>> KVM_EXIT_NMI instead of immediately delivering a 0x200
> >>> interrupt to guest upon machine check exception in
> >>> guest address. Exiting the guest enables QEMU to build
> >>> error log and deliver machine check exception to guest
> >>> OS (either via guest OS registered machine check
> >>> handler or via 0x200 guest OS interrupt vector).
> >>>
> >>> This approach simplifies the delivering of machine
> >>> check exception to guest OS compared to the earlier approach
> >>> of KVM directly invoking 0x200 guest interrupt vector.
> >>> In the earlier approach QEMU patched the 0x200 interrupt
> >>> vector during boot. The patched code at 0x200 issued a
> >>> private hcall to pass the control to QEMU to build the
> >>> error log.
> >>>
> >>> This design/approach is based on the feedback for the
> >>> QEMU patches to handle machine check exception. Details
> >>> of earlier approach of handling machine check exception
> >>> in QEMU and related discussions can be found at:
> >>>
> >>> https://lists.nongnu.org/archive/html/qemu-devel/2014-11/msg00813.html
> >>
> >> I've poked at the MCE code, but not the KVM MCE code, so I may be
> >> mistaken here, but I'm not clear on how this handles errors that the
> >> guest can recover without terminating.
> >>
> >> For example, a Linux guest can handle a UE in guest userspace by killing
> >> the guest process. A hypthetical non-linux guest with a microkernel
> >> could even survive UEs in drivers.
> >>
> >> It sounds from your patch like you're changing this behaviour. Is this
> >> right?
> > 
> > So, IIUC.  Once the qemu pieces are in place as well it shouldn't
> > change this behaviour: KVM will exit to qemu, qemu will log the error
> > information (new), then reinject the MC to the guest which can still
> > handle it as you describe above.
> 
> Yes. With KVM and QEMU both in place this will not change the behavior.
> QEMU will inject the UE to guest and the guest handles the UE based on
> where it occurred. For example if an UE happens in a guest process
> address space, that process will be killed.
> 
> > 
> > But, there could be a problem if you have a new kernel with an old
> > qemu, in that case qemu might not understand the new exit type and
> > treat it as a fatal error, even though the guest could actually cope
> > with it.
> 
> In case of new kernel and old QEMU, the guest terminates as old QEMU
> does not understand the NMI exit reason. However, this is the case with
> old kernel and old QEMU as they do not handle UE belonging to guest. The
> difference is that the guest kernel terminates with different error
> code.

Ok.. assuming the guest has code to handle the UE in 0x200, why would
the guest terminate with old kernel and old qemu?  I haven't quite
followed the logic.

> 
> old kernel and old QEMU -> guest panics [1] irrespective of where UE
>happened in guest address space.
> old kernel and new QEMU -> guest panics. same as above.
> new kernel and old QEMU -> guest terminates with unhanded NMI error
>irrespective of where UE happened in guest
> new kernel and new QEMU -> guest handles UEs in process address space
>by killing the process. guest terminates
>for UEs in guest kernel address space.
> 
> [1] https://lists.ozlabs.org/pipermail/linuxppc-dev/2014-June/118329.html
> 
> > 
> > Aravinda, do we need to change this so that qemu has to explicitly
> > enable the new NMI behaviour?  Or have I missed something that will
> > make that case work already.
> 
> I think we don't need to explicitly enable the new behavior. With new
> kernel and new QEMU this should just work. As mentioned above this is
> already broken for old kernel/QEMU. Any thoughts?
> 
> Regards,
> Aravinda
> 
> > 
> > 
> > 
> > ___
> > Linuxppc-dev mailing list
> > linuxppc-...@lists.ozlabs.org
> > https://lists.ozlabs.org/listinfo/linuxppc-dev
> > 
> 

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH] KVM: PPC: Exit guest upon fatal machine check exception

2015-11-11 Thread David Gibson
et later based on trap). For handled errors
> +  * (no-fatal), go back to guest execution with current HSRR0
> +  * instead of exiting the guest. This approach will cause
> +  * the guest to exit in case of fatal machine check error.
>*/
> - ld  r10, VCPU_PC(r9)
> + bne 2f  /* Continue guest execution? */
> + /* If not, exit the guest. SRR0/1 are already set */
> + b   mc_cont
> +2:   ld  r10, VCPU_PC(r9)
>   ld  r11, VCPU_MSR(r9)
> - bne 2f  /* Continue guest execution. */
> - /* If not, deliver a machine check.  SRR0/1 are already set */
> - li  r10, BOOK3S_INTERRUPT_MACHINE_CHECK
> - ld  r11, VCPU_MSR(r9)
> - bl  kvmppc_msr_interrupt
> -2:   b   fast_interrupt_c_return
> + b   fast_interrupt_c_return
>  
>  /*
>   * Check the reason we woke from nap, and take appropriate action.
> 

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH] KVM: PPC: Exit guest upon fatal machine check exception

2015-11-11 Thread David Gibson
et later based on trap). For handled errors
> +  * (no-fatal), go back to guest execution with current HSRR0
> +  * instead of exiting the guest. This approach will cause
> +  * the guest to exit in case of fatal machine check error.
>*/
> - ld  r10, VCPU_PC(r9)
> + bne 2f  /* Continue guest execution? */
> + /* If not, exit the guest. SRR0/1 are already set */
> + b   mc_cont
> +2:   ld  r10, VCPU_PC(r9)
>   ld  r11, VCPU_MSR(r9)
> - bne 2f  /* Continue guest execution. */
> - /* If not, deliver a machine check.  SRR0/1 are already set */
> - li  r10, BOOK3S_INTERRUPT_MACHINE_CHECK
> - ld  r11, VCPU_MSR(r9)
> - bl  kvmppc_msr_interrupt
> -2:   b   fast_interrupt_c_return
> + b   fast_interrupt_c_return
>  
>  /*
>   * Check the reason we woke from nap, and take appropriate action.
> 

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH] KVM: PPC: Exit guest upon fatal machine check exception

2015-11-11 Thread David Gibson
On Thu, Nov 12, 2015 at 01:24:19PM +1100, Daniel Axtens wrote:
> Aravinda Prasad <aravi...@linux.vnet.ibm.com> writes:
> 
> > This patch modifies KVM to cause a guest exit with
> > KVM_EXIT_NMI instead of immediately delivering a 0x200
> > interrupt to guest upon machine check exception in
> > guest address. Exiting the guest enables QEMU to build
> > error log and deliver machine check exception to guest
> > OS (either via guest OS registered machine check
> > handler or via 0x200 guest OS interrupt vector).
> >
> > This approach simplifies the delivering of machine
> > check exception to guest OS compared to the earlier approach
> > of KVM directly invoking 0x200 guest interrupt vector.
> > In the earlier approach QEMU patched the 0x200 interrupt
> > vector during boot. The patched code at 0x200 issued a
> > private hcall to pass the control to QEMU to build the
> > error log.
> >
> > This design/approach is based on the feedback for the
> > QEMU patches to handle machine check exception. Details
> > of earlier approach of handling machine check exception
> > in QEMU and related discussions can be found at:
> >
> > https://lists.nongnu.org/archive/html/qemu-devel/2014-11/msg00813.html
> 
> I've poked at the MCE code, but not the KVM MCE code, so I may be
> mistaken here, but I'm not clear on how this handles errors that the
> guest can recover without terminating.
> 
> For example, a Linux guest can handle a UE in guest userspace by killing
> the guest process. A hypthetical non-linux guest with a microkernel
> could even survive UEs in drivers.
> 
> It sounds from your patch like you're changing this behaviour. Is this
> right?

So, IIUC.  Once the qemu pieces are in place as well it shouldn't
change this behaviour: KVM will exit to qemu, qemu will log the error
information (new), then reinject the MC to the guest which can still
handle it as you describe above.

But, there could be a problem if you have a new kernel with an old
qemu, in that case qemu might not understand the new exit type and
treat it as a fatal error, even though the guest could actually cope
with it.

Aravinda, do we need to change this so that qemu has to explicitly
enable the new NMI behaviour?  Or have I missed something that will
make that case work already.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH] KVM: PPC: Exit guest upon fatal machine check exception

2015-11-11 Thread David Gibson
On Thu, Nov 12, 2015 at 01:24:19PM +1100, Daniel Axtens wrote:
> Aravinda Prasad <aravi...@linux.vnet.ibm.com> writes:
> 
> > This patch modifies KVM to cause a guest exit with
> > KVM_EXIT_NMI instead of immediately delivering a 0x200
> > interrupt to guest upon machine check exception in
> > guest address. Exiting the guest enables QEMU to build
> > error log and deliver machine check exception to guest
> > OS (either via guest OS registered machine check
> > handler or via 0x200 guest OS interrupt vector).
> >
> > This approach simplifies the delivering of machine
> > check exception to guest OS compared to the earlier approach
> > of KVM directly invoking 0x200 guest interrupt vector.
> > In the earlier approach QEMU patched the 0x200 interrupt
> > vector during boot. The patched code at 0x200 issued a
> > private hcall to pass the control to QEMU to build the
> > error log.
> >
> > This design/approach is based on the feedback for the
> > QEMU patches to handle machine check exception. Details
> > of earlier approach of handling machine check exception
> > in QEMU and related discussions can be found at:
> >
> > https://lists.nongnu.org/archive/html/qemu-devel/2014-11/msg00813.html
> 
> I've poked at the MCE code, but not the KVM MCE code, so I may be
> mistaken here, but I'm not clear on how this handles errors that the
> guest can recover without terminating.
> 
> For example, a Linux guest can handle a UE in guest userspace by killing
> the guest process. A hypthetical non-linux guest with a microkernel
> could even survive UEs in drivers.
> 
> It sounds from your patch like you're changing this behaviour. Is this
> right?

So, IIUC.  Once the qemu pieces are in place as well it shouldn't
change this behaviour: KVM will exit to qemu, qemu will log the error
information (new), then reinject the MC to the guest which can still
handle it as you describe above.

But, there could be a problem if you have a new kernel with an old
qemu, in that case qemu might not understand the new exit type and
treat it as a fatal error, even though the guest could actually cope
with it.

Aravinda, do we need to change this so that qemu has to explicitly
enable the new NMI behaviour?  Or have I missed something that will
make that case work already.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH] KVM: PPC: Exit guest upon fatal machine check exception

2015-11-11 Thread David Gibson
On Thu, Nov 12, 2015 at 10:02:10AM +0530, Aravinda Prasad wrote:
> 
> 
> On Thursday 12 November 2015 09:08 AM, David Gibson wrote:
> > On Thu, Nov 12, 2015 at 01:24:19PM +1100, Daniel Axtens wrote:
> >> Aravinda Prasad <aravi...@linux.vnet.ibm.com> writes:
> >>
> >>> This patch modifies KVM to cause a guest exit with
> >>> KVM_EXIT_NMI instead of immediately delivering a 0x200
> >>> interrupt to guest upon machine check exception in
> >>> guest address. Exiting the guest enables QEMU to build
> >>> error log and deliver machine check exception to guest
> >>> OS (either via guest OS registered machine check
> >>> handler or via 0x200 guest OS interrupt vector).
> >>>
> >>> This approach simplifies the delivering of machine
> >>> check exception to guest OS compared to the earlier approach
> >>> of KVM directly invoking 0x200 guest interrupt vector.
> >>> In the earlier approach QEMU patched the 0x200 interrupt
> >>> vector during boot. The patched code at 0x200 issued a
> >>> private hcall to pass the control to QEMU to build the
> >>> error log.
> >>>
> >>> This design/approach is based on the feedback for the
> >>> QEMU patches to handle machine check exception. Details
> >>> of earlier approach of handling machine check exception
> >>> in QEMU and related discussions can be found at:
> >>>
> >>> https://lists.nongnu.org/archive/html/qemu-devel/2014-11/msg00813.html
> >>
> >> I've poked at the MCE code, but not the KVM MCE code, so I may be
> >> mistaken here, but I'm not clear on how this handles errors that the
> >> guest can recover without terminating.
> >>
> >> For example, a Linux guest can handle a UE in guest userspace by killing
> >> the guest process. A hypthetical non-linux guest with a microkernel
> >> could even survive UEs in drivers.
> >>
> >> It sounds from your patch like you're changing this behaviour. Is this
> >> right?
> > 
> > So, IIUC.  Once the qemu pieces are in place as well it shouldn't
> > change this behaviour: KVM will exit to qemu, qemu will log the error
> > information (new), then reinject the MC to the guest which can still
> > handle it as you describe above.
> 
> Yes. With KVM and QEMU both in place this will not change the behavior.
> QEMU will inject the UE to guest and the guest handles the UE based on
> where it occurred. For example if an UE happens in a guest process
> address space, that process will be killed.
> 
> > 
> > But, there could be a problem if you have a new kernel with an old
> > qemu, in that case qemu might not understand the new exit type and
> > treat it as a fatal error, even though the guest could actually cope
> > with it.
> 
> In case of new kernel and old QEMU, the guest terminates as old QEMU
> does not understand the NMI exit reason. However, this is the case with
> old kernel and old QEMU as they do not handle UE belonging to guest. The
> difference is that the guest kernel terminates with different error
> code.

Ok.. assuming the guest has code to handle the UE in 0x200, why would
the guest terminate with old kernel and old qemu?  I haven't quite
followed the logic.

> 
> old kernel and old QEMU -> guest panics [1] irrespective of where UE
>happened in guest address space.
> old kernel and new QEMU -> guest panics. same as above.
> new kernel and old QEMU -> guest terminates with unhanded NMI error
>irrespective of where UE happened in guest
> new kernel and new QEMU -> guest handles UEs in process address space
>by killing the process. guest terminates
>for UEs in guest kernel address space.
> 
> [1] https://lists.ozlabs.org/pipermail/linuxppc-dev/2014-June/118329.html
> 
> > 
> > Aravinda, do we need to change this so that qemu has to explicitly
> > enable the new NMI behaviour?  Or have I missed something that will
> > make that case work already.
> 
> I think we don't need to explicitly enable the new behavior. With new
> kernel and new QEMU this should just work. As mentioned above this is
> already broken for old kernel/QEMU. Any thoughts?
> 
> Regards,
> Aravinda
> 
> > 
> > 
> > 
> > ___
> > Linuxppc-dev mailing list
> > linuxppc-...@lists.ozlabs.org
> > https://lists.ozlabs.org/listinfo/linuxppc-dev
> > 
> 

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH] KVM: PPC: Book3S HV: Synthesize segment fault if SLB lookup fails

2015-11-03 Thread David Gibson
On Tue, Oct 27, 2015 at 04:13:56PM +1100, Paul Mackerras wrote:
> When handling a hypervisor data or instruction storage interrupt (HDSI
> or HISI), we look up the SLB entry for the address being accessed in
> order to translate the effective address to a virtual address which can
> be looked up in the guest HPT.  This lookup can occasionally fail due
> to the guest replacing an SLB entry without invalidating the evicted
> SLB entry.  In this situation an ERAT (effective to real address
> translation cache) entry can persist and be used by the hardware even
> though there is no longer a corresponding SLB entry.
> 
> Previously we would just deliver a data or instruction storage interrupt
> (DSI or ISI) to the guest in this case.  However, this is not correct
> and has been observed to cause guests to crash, typically with a
> data storage protection interrupt on a store to the vmemmap area.
> 
> Instead, what we do now is to synthesize a data or instruction segment
> interrupt.  That should cause the guest to reload an appropriate entry
> into the SLB and retry the faulting instruction.  If it still faults,
> we should find an appropriate SLB entry next time and be able to handle
> the fault.
> 
> Signed-off-by: Paul Mackerras <pau...@samba.org>

Reviewed-by: David Gibson <da...@gibson.dropbear.id.au>

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH] KVM: PPC: Book3S HV: Synthesize segment fault if SLB lookup fails

2015-11-03 Thread David Gibson
On Tue, Oct 27, 2015 at 04:13:56PM +1100, Paul Mackerras wrote:
> When handling a hypervisor data or instruction storage interrupt (HDSI
> or HISI), we look up the SLB entry for the address being accessed in
> order to translate the effective address to a virtual address which can
> be looked up in the guest HPT.  This lookup can occasionally fail due
> to the guest replacing an SLB entry without invalidating the evicted
> SLB entry.  In this situation an ERAT (effective to real address
> translation cache) entry can persist and be used by the hardware even
> though there is no longer a corresponding SLB entry.
> 
> Previously we would just deliver a data or instruction storage interrupt
> (DSI or ISI) to the guest in this case.  However, this is not correct
> and has been observed to cause guests to crash, typically with a
> data storage protection interrupt on a store to the vmemmap area.
> 
> Instead, what we do now is to synthesize a data or instruction segment
> interrupt.  That should cause the guest to reload an appropriate entry
> into the SLB and retry the faulting instruction.  If it still faults,
> we should find an appropriate SLB entry next time and be able to handle
> the fault.
> 
> Signed-off-by: Paul Mackerras <pau...@samba.org>

Reviewed-by: David Gibson <da...@gibson.dropbear.id.au>

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [Qemu-devel] [PATCH v4] ppc/spapr: Implement H_RANDOM hypercall in QEMU

2015-09-21 Thread David Gibson
On Mon, Sep 21, 2015 at 10:37:28AM +0200, Greg Kurz wrote:
> On Mon, 21 Sep 2015 10:26:52 +0200
> Thomas Huth <th...@redhat.com> wrote:
> 
> > On 21/09/15 10:01, Greg Kurz wrote:
> > > On Mon, 21 Sep 2015 12:10:00 +1000
> > > David Gibson <da...@gibson.dropbear.id.au> wrote:
> > > 
> > >> On Fri, Sep 18, 2015 at 11:05:52AM +0200, Greg Kurz wrote:
> > >>> On Thu, 17 Sep 2015 10:49:41 +0200
> > >>> Thomas Huth <th...@redhat.com> wrote:
> > >>>
> > >>>> The PAPR interface defines a hypercall to pass high-quality
> > >>>> hardware generated random numbers to guests. Recent kernels can
> > >>>> already provide this hypercall to the guest if the right hardware
> > >>>> random number generator is available. But in case the user wants
> > >>>> to use another source like EGD, or QEMU is running with an older
> > >>>> kernel, we should also have this call in QEMU, so that guests that
> > >>>> do not support virtio-rng yet can get good random numbers, too.
> > >>>>
> > >>>> This patch now adds a new pseudo-device to QEMU that either
> > >>>> directly provides this hypercall to the guest or is able to
> > >>>> enable the in-kernel hypercall if available.
> > ...
> > >>> It is a good thing that the user can choose between in-kernel and 
> > >>> backend,
> > >>> and this patch does the work.
> > >>>
> > >>> This being said, I am not sure about the use case where a user has a 
> > >>> hwrng
> > >>> capable platform and wants to run guests without any hwrng support at 
> > >>> all is
> > >>> an appropriate default behavior... I guess we will find more users that 
> > >>> want
> > >>> in-kernel being the default if it is available.
> > >>>
> > >>> The patch below modifies yours to do just this: the pseudo-device is 
> > >>> only
> > >>> created if hwrng is present and not already created.
> > >>
> > >> I have mixed feelings about this.  On the one hand, I agree that it
> > >> would be nice to allow H_RANDOM support by default.  On the other hand
> > >> the patch below leaves no way to turn it off for testing purposes.  It
> > >> also adds another place where the guest hardware depends on the host
> > >> configuration, which adds to the already substantial mess of ensuring
> > >> that source and destination hardware configuration matches for
> > >> migration.
> > > 
> > > Yeah, describing the guest hw is really essential for migration... this
> > > is best addressed at the libvirt level with a full XML description of
> > > the machine... but FWIW if we are talking about running pseries on a
> > > POWER8 or newer host, I am not aware about "hwrng-less" boards... but
> > > I am probably missing something :)
> > 
> > Maybe it would be at least ok to enable it by default as long as
> > "-nodefaults" has not been specified as command line option?

I like that in principle, but the -nodefaults option isn't exposed
outside vl.c

> It makes a lot of sense indeed. I guess David should take your patch
> as it is now and the default behavior could be a follow up.

That's the plan.  I've already  taken the base patch.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgpfA9hFQZ4X9.pgp
Description: PGP signature


Re: [PATCH] KVM: PPC: Take the kvm->srcu lock in kvmppc_h_logical_ci_load/store()

2015-09-20 Thread David Gibson
On Fri, Sep 18, 2015 at 08:57:28AM +0200, Thomas Huth wrote:
> Access to the kvm->buses (like with the kvm_io_bus_read() and -write()
> functions) has to be protected via the kvm->srcu lock.
> The kvmppc_h_logical_ci_load() and -store() functions are missing
> this lock so far, so let's add it there, too.
> This fixes the problem that the kernel reports "suspicious RCU usage"
> when lock debugging is enabled.
> 
> Fixes: 99342cf8044420eebdf9297ca03a14cb6a7085a1
> Signed-off-by: Thomas Huth <th...@redhat.com>

Nice catch.  Looks like I missed this because the places
kvm_io_bus_{read,write}() are called on x86 are buried about 5 layers
below where the srcu lock is taken :/.

Reviewed-by: David Gibson <da...@gibson.dropbear.id.au>

> ---
>  arch/powerpc/kvm/book3s.c | 6 ++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
> index d75bf32..096e5eb 100644
> --- a/arch/powerpc/kvm/book3s.c
> +++ b/arch/powerpc/kvm/book3s.c
> @@ -828,12 +828,15 @@ int kvmppc_h_logical_ci_load(struct kvm_vcpu *vcpu)
>   unsigned long size = kvmppc_get_gpr(vcpu, 4);
>   unsigned long addr = kvmppc_get_gpr(vcpu, 5);
>   u64 buf;
> + int srcu_idx;
>   int ret;
>  
>   if (!is_power_of_2(size) || (size > sizeof(buf)))
>   return H_TOO_HARD;
>  
> + srcu_idx = srcu_read_lock(>kvm->srcu);
>   ret = kvm_io_bus_read(vcpu, KVM_MMIO_BUS, addr, size, );
> + srcu_read_unlock(>kvm->srcu, srcu_idx);
>   if (ret != 0)
>   return H_TOO_HARD;
>  
> @@ -868,6 +871,7 @@ int kvmppc_h_logical_ci_store(struct kvm_vcpu *vcpu)
>   unsigned long addr = kvmppc_get_gpr(vcpu, 5);
>   unsigned long val = kvmppc_get_gpr(vcpu, 6);
>   u64 buf;
> + int srcu_idx;
>   int ret;
>  
>   switch (size) {
> @@ -891,7 +895,9 @@ int kvmppc_h_logical_ci_store(struct kvm_vcpu *vcpu)
>   return H_TOO_HARD;
>   }
>  
> + srcu_idx = srcu_read_lock(>kvm->srcu);
>   ret = kvm_io_bus_write(vcpu, KVM_MMIO_BUS, addr, size, );
> + srcu_read_unlock(>kvm->srcu, srcu_idx);
>   if (ret != 0)
>   return H_TOO_HARD;
>  

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgpllAOgqg51p.pgp
Description: PGP signature


Re: [PATCH] KVM: PPC: Take the kvm->srcu lock in kvmppc_h_logical_ci_load/store()

2015-09-20 Thread David Gibson
On Fri, Sep 18, 2015 at 08:57:28AM +0200, Thomas Huth wrote:
> Access to the kvm->buses (like with the kvm_io_bus_read() and -write()
> functions) has to be protected via the kvm->srcu lock.
> The kvmppc_h_logical_ci_load() and -store() functions are missing
> this lock so far, so let's add it there, too.
> This fixes the problem that the kernel reports "suspicious RCU usage"
> when lock debugging is enabled.
> 
> Fixes: 99342cf8044420eebdf9297ca03a14cb6a7085a1
> Signed-off-by: Thomas Huth <th...@redhat.com>

Nice catch.  Looks like I missed this because the places
kvm_io_bus_{read,write}() are called on x86 are buried about 5 layers
below where the srcu lock is taken :/.

Reviewed-by: David Gibson <da...@gibson.dropbear.id.au>

> ---
>  arch/powerpc/kvm/book3s.c | 6 ++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
> index d75bf32..096e5eb 100644
> --- a/arch/powerpc/kvm/book3s.c
> +++ b/arch/powerpc/kvm/book3s.c
> @@ -828,12 +828,15 @@ int kvmppc_h_logical_ci_load(struct kvm_vcpu *vcpu)
>   unsigned long size = kvmppc_get_gpr(vcpu, 4);
>   unsigned long addr = kvmppc_get_gpr(vcpu, 5);
>   u64 buf;
> + int srcu_idx;
>   int ret;
>  
>   if (!is_power_of_2(size) || (size > sizeof(buf)))
>   return H_TOO_HARD;
>  
> + srcu_idx = srcu_read_lock(>kvm->srcu);
>   ret = kvm_io_bus_read(vcpu, KVM_MMIO_BUS, addr, size, );
> + srcu_read_unlock(>kvm->srcu, srcu_idx);
>   if (ret != 0)
>   return H_TOO_HARD;
>  
> @@ -868,6 +871,7 @@ int kvmppc_h_logical_ci_store(struct kvm_vcpu *vcpu)
>   unsigned long addr = kvmppc_get_gpr(vcpu, 5);
>   unsigned long val = kvmppc_get_gpr(vcpu, 6);
>   u64 buf;
> + int srcu_idx;
>   int ret;
>  
>   switch (size) {
> @@ -891,7 +895,9 @@ int kvmppc_h_logical_ci_store(struct kvm_vcpu *vcpu)
>   return H_TOO_HARD;
>   }
>  
> + srcu_idx = srcu_read_lock(>kvm->srcu);
>   ret = kvm_io_bus_write(vcpu, KVM_MMIO_BUS, addr, size, );
> + srcu_read_unlock(>kvm->srcu, srcu_idx);
>   if (ret != 0)
>   return H_TOO_HARD;
>  

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgp4kxxr5EHED.pgp
Description: PGP signature


Re: [PATCH 0/2] VFIO: Accept IOMMU group (PE) ID

2015-09-20 Thread David Gibson
On Sat, Sep 19, 2015 at 04:22:47PM +1000, David Gibson wrote:
> On Fri, Sep 18, 2015 at 09:47:32AM -0600, Alex Williamson wrote:
> > On Fri, 2015-09-18 at 16:24 +1000, Gavin Shan wrote:
> > > This allows to accept IOMMU group (PE) ID from the parameter from userland
> > > when handling EEH operation so that the operation only affects the target
> > > IOMMU group (PE). If the IOMMU group (PE) ID in the parameter from 
> > > userland
> > > is invalid, all IOMMU groups (PEs) attached to the specified container are
> > > affected as before.
> > > 
> > > Gavin Shan (2):
> > >   drivers/vfio: Support EEH API revision
> > >   drivers/vfio: Support IOMMU group for EEH operations
> > > 
> > >  drivers/vfio/vfio_iommu_spapr_tce.c | 50 
> > > ++---
> > >  drivers/vfio/vfio_spapr_eeh.c   | 46 
> > > ++
> > >  include/linux/vfio.h| 13 +++---
> > >  include/uapi/linux/vfio.h   |  6 +
> > >  4 files changed, 93 insertions(+), 22 deletions(-)
> > 
> > This interface is terrible.  A function named foo_enabled() should
> > return a bool, yes or no, don't try to overload it to also return a
> > version.
> 
> Sorry, that one's my fault.  I suggested that approach to Gavin
> without really thinking it through.
> 
> 
> > AFAICT, patch 2/2 breaks current users by changing the offset
> > of the union in struct vfio_eeh_pe_err.
> 
> Yeah, this one's ugly.  We have to preserve the offset, but that means
> putting the group in a very awkward place.  Especially since I'm not
> sure if there even are any existing users of the single extant union
> branch.
> 
> Sigh.
> 
> > Also, we generally pass group
> > file descriptors rather than a group ID because we can prove the
> > ownership of the group through the file descriptor and we don't need to
> > worry about races with the group because we can hold a reference to it.

Duh.  I finally realised the better, simpler, obvious solution.

Rather than changing the parameter structure, we should move the
ioctl()s so they're on the group fd instead of the container fd.

Obviously we need to keep it on the container fd for backwards compat,
but I think we should just error out if there is more than one group
in the container there.

We will need a new capability too, obviously.  VFIO_EEH_GROUPFD maybe?

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgpCxcPnyjMr0.pgp
Description: PGP signature


Re: [PATCH v4] ppc/spapr: Implement H_RANDOM hypercall in QEMU

2015-09-20 Thread David Gibson
On Thu, Sep 17, 2015 at 10:49:41AM +0200, Thomas Huth wrote:
> The PAPR interface defines a hypercall to pass high-quality
> hardware generated random numbers to guests. Recent kernels can
> already provide this hypercall to the guest if the right hardware
> random number generator is available. But in case the user wants
> to use another source like EGD, or QEMU is running with an older
> kernel, we should also have this call in QEMU, so that guests that
> do not support virtio-rng yet can get good random numbers, too.
> 
> This patch now adds a new pseudo-device to QEMU that either
> directly provides this hypercall to the guest or is able to
> enable the in-kernel hypercall if available. The in-kernel
> hypercall can be enabled with the use-kvm property, e.g.:
> 
>  qemu-system-ppc64 -device spapr-rng,use-kvm=true
> 
> For handling the hypercall in QEMU instead, a "RngBackend" is
> required since the hypercall should provide "good" random data
> instead of pseudo-random (like from a "simple" library function
> like rand() or g_random_int()). Since there are multiple RngBackends
> available, the user must select an appropriate back-end via the
> "rng" property of the device, e.g.:
> 
>  qemu-system-ppc64 -object rng-random,filename=/dev/hwrng,id=gid0 \
>-device spapr-rng,rng=gid0 ...
> 
> See http://wiki.qemu-project.org/Features-Done/VirtIORNG for
> other example of specifying RngBackends.
> 
> Signed-off-by: Thomas Huth <th...@redhat.com>

Thanks, applied to spapr-next.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgprmHRL0m8Qe.pgp
Description: PGP signature


Re: [PATCH v4] ppc/spapr: Implement H_RANDOM hypercall in QEMU

2015-09-20 Thread David Gibson
On Fri, Sep 18, 2015 at 11:05:52AM +0200, Greg Kurz wrote:
> On Thu, 17 Sep 2015 10:49:41 +0200
> Thomas Huth <th...@redhat.com> wrote:
> 
> > The PAPR interface defines a hypercall to pass high-quality
> > hardware generated random numbers to guests. Recent kernels can
> > already provide this hypercall to the guest if the right hardware
> > random number generator is available. But in case the user wants
> > to use another source like EGD, or QEMU is running with an older
> > kernel, we should also have this call in QEMU, so that guests that
> > do not support virtio-rng yet can get good random numbers, too.
> > 
> > This patch now adds a new pseudo-device to QEMU that either
> > directly provides this hypercall to the guest or is able to
> > enable the in-kernel hypercall if available. The in-kernel
> > hypercall can be enabled with the use-kvm property, e.g.:
> > 
> >  qemu-system-ppc64 -device spapr-rng,use-kvm=true
> > 
> > For handling the hypercall in QEMU instead, a "RngBackend" is
> > required since the hypercall should provide "good" random data
> > instead of pseudo-random (like from a "simple" library function
> > like rand() or g_random_int()). Since there are multiple RngBackends
> > available, the user must select an appropriate back-end via the
> > "rng" property of the device, e.g.:
> > 
> >  qemu-system-ppc64 -object rng-random,filename=/dev/hwrng,id=gid0 \
> >-device spapr-rng,rng=gid0 ...
> > 
> > See http://wiki.qemu-project.org/Features-Done/VirtIORNG for
> > other example of specifying RngBackends.
> > 
> > Signed-off-by: Thomas Huth <th...@redhat.com>
> > ---
> 
> It is a good thing that the user can choose between in-kernel and backend,
> and this patch does the work.
> 
> This being said, I am not sure about the use case where a user has a hwrng
> capable platform and wants to run guests without any hwrng support at all is
> an appropriate default behavior... I guess we will find more users that want
> in-kernel being the default if it is available.
> 
> The patch below modifies yours to do just this: the pseudo-device is only
> created if hwrng is present and not already created.

I have mixed feelings about this.  On the one hand, I agree that it
would be nice to allow H_RANDOM support by default.  On the other hand
the patch below leaves no way to turn it off for testing purposes.  It
also adds another place where the guest hardware depends on the host
configuration, which adds to the already substantial mess of ensuring
that source and destination hardware configuration matches for
migration.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgpUiaicZ6w4V.pgp
Description: PGP signature


Re: [PATCH 0/2] VFIO: Accept IOMMU group (PE) ID

2015-09-20 Thread David Gibson
On Sat, Sep 19, 2015 at 04:22:47PM +1000, David Gibson wrote:
> On Fri, Sep 18, 2015 at 09:47:32AM -0600, Alex Williamson wrote:
> > On Fri, 2015-09-18 at 16:24 +1000, Gavin Shan wrote:
> > > This allows to accept IOMMU group (PE) ID from the parameter from userland
> > > when handling EEH operation so that the operation only affects the target
> > > IOMMU group (PE). If the IOMMU group (PE) ID in the parameter from 
> > > userland
> > > is invalid, all IOMMU groups (PEs) attached to the specified container are
> > > affected as before.
> > > 
> > > Gavin Shan (2):
> > >   drivers/vfio: Support EEH API revision
> > >   drivers/vfio: Support IOMMU group for EEH operations
> > > 
> > >  drivers/vfio/vfio_iommu_spapr_tce.c | 50 
> > > ++---
> > >  drivers/vfio/vfio_spapr_eeh.c   | 46 
> > > ++
> > >  include/linux/vfio.h| 13 +++---
> > >  include/uapi/linux/vfio.h   |  6 +
> > >  4 files changed, 93 insertions(+), 22 deletions(-)
> > 
> > This interface is terrible.  A function named foo_enabled() should
> > return a bool, yes or no, don't try to overload it to also return a
> > version.
> 
> Sorry, that one's my fault.  I suggested that approach to Gavin
> without really thinking it through.
> 
> 
> > AFAICT, patch 2/2 breaks current users by changing the offset
> > of the union in struct vfio_eeh_pe_err.
> 
> Yeah, this one's ugly.  We have to preserve the offset, but that means
> putting the group in a very awkward place.  Especially since I'm not
> sure if there even are any existing users of the single extant union
> branch.
> 
> Sigh.
> 
> > Also, we generally pass group
> > file descriptors rather than a group ID because we can prove the
> > ownership of the group through the file descriptor and we don't need to
> > worry about races with the group because we can hold a reference to it.

Duh.  I finally realised the better, simpler, obvious solution.

Rather than changing the parameter structure, we should move the
ioctl()s so they're on the group fd instead of the container fd.

Obviously we need to keep it on the container fd for backwards compat,
but I think we should just error out if there is more than one group
in the container there.

We will need a new capability too, obviously.  VFIO_EEH_GROUPFD maybe?

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgpGg8RqigimZ.pgp
Description: PGP signature


Re: [PATCH 0/2] VFIO: Accept IOMMU group (PE) ID

2015-09-19 Thread David Gibson
On Fri, Sep 18, 2015 at 09:47:32AM -0600, Alex Williamson wrote:
> On Fri, 2015-09-18 at 16:24 +1000, Gavin Shan wrote:
> > This allows to accept IOMMU group (PE) ID from the parameter from userland
> > when handling EEH operation so that the operation only affects the target
> > IOMMU group (PE). If the IOMMU group (PE) ID in the parameter from userland
> > is invalid, all IOMMU groups (PEs) attached to the specified container are
> > affected as before.
> > 
> > Gavin Shan (2):
> >   drivers/vfio: Support EEH API revision
> >   drivers/vfio: Support IOMMU group for EEH operations
> > 
> >  drivers/vfio/vfio_iommu_spapr_tce.c | 50 
> > ++---
> >  drivers/vfio/vfio_spapr_eeh.c   | 46 ++
> >  include/linux/vfio.h| 13 +++---
> >  include/uapi/linux/vfio.h   |  6 +
> >  4 files changed, 93 insertions(+), 22 deletions(-)
> 
> This interface is terrible.  A function named foo_enabled() should
> return a bool, yes or no, don't try to overload it to also return a
> version.

Sorry, that one's my fault.  I suggested that approach to Gavin
without really thinking it through.


> AFAICT, patch 2/2 breaks current users by changing the offset
> of the union in struct vfio_eeh_pe_err.

Yeah, this one's ugly.  We have to preserve the offset, but that means
putting the group in a very awkward place.  Especially since I'm not
sure if there even are any existing users of the single extant union
branch.

Sigh.

> Also, we generally pass group
> file descriptors rather than a group ID because we can prove the
> ownership of the group through the file descriptor and we don't need to
> worry about races with the group because we can hold a reference to it.
> 

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgp3yfhiIyNQC.pgp
Description: PGP signature


Re: [PATCH 0/2] VFIO: Accept IOMMU group (PE) ID

2015-09-19 Thread David Gibson
On Fri, Sep 18, 2015 at 09:47:32AM -0600, Alex Williamson wrote:
> On Fri, 2015-09-18 at 16:24 +1000, Gavin Shan wrote:
> > This allows to accept IOMMU group (PE) ID from the parameter from userland
> > when handling EEH operation so that the operation only affects the target
> > IOMMU group (PE). If the IOMMU group (PE) ID in the parameter from userland
> > is invalid, all IOMMU groups (PEs) attached to the specified container are
> > affected as before.
> > 
> > Gavin Shan (2):
> >   drivers/vfio: Support EEH API revision
> >   drivers/vfio: Support IOMMU group for EEH operations
> > 
> >  drivers/vfio/vfio_iommu_spapr_tce.c | 50 
> > ++---
> >  drivers/vfio/vfio_spapr_eeh.c   | 46 ++
> >  include/linux/vfio.h| 13 +++---
> >  include/uapi/linux/vfio.h   |  6 +
> >  4 files changed, 93 insertions(+), 22 deletions(-)
> 
> This interface is terrible.  A function named foo_enabled() should
> return a bool, yes or no, don't try to overload it to also return a
> version.

Sorry, that one's my fault.  I suggested that approach to Gavin
without really thinking it through.


> AFAICT, patch 2/2 breaks current users by changing the offset
> of the union in struct vfio_eeh_pe_err.

Yeah, this one's ugly.  We have to preserve the offset, but that means
putting the group in a very awkward place.  Especially since I'm not
sure if there even are any existing users of the single extant union
branch.

Sigh.

> Also, we generally pass group
> file descriptors rather than a group ID because we can prove the
> ownership of the group through the file descriptor and we don't need to
> worry about races with the group because we can hold a reference to it.
> 

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgpRvQRqTjABt.pgp
Description: PGP signature


Re: [PATCH v3] ppc/spapr: Implement H_RANDOM hypercall in QEMU

2015-09-14 Thread David Gibson
On Mon, Sep 14, 2015 at 08:32:36AM +0200, Thomas Huth wrote:
> On 14/09/15 04:15, David Gibson wrote:
> > On Fri, Sep 11, 2015 at 11:17:01AM +0200, Thomas Huth wrote:
> >> The PAPR interface defines a hypercall to pass high-quality
> >> hardware generated random numbers to guests. Recent kernels can
> >> already provide this hypercall to the guest if the right hardware
> >> random number generator is available. But in case the user wants
> >> to use another source like EGD, or QEMU is running with an older
> >> kernel, we should also have this call in QEMU, so that guests that
> >> do not support virtio-rng yet can get good random numbers, too.
> >>
> >> This patch now adds a new pseude-device to QEMU that either
> >> directly provides this hypercall to the guest or is able to
> >> enable the in-kernel hypercall if available. The in-kernel
> >> hypercall can be enabled with the use-kvm property, e.g.:
> >>
> >>  qemu-system-ppc64 -device spapr-rng,use-kvm=true
> >>
> >> For handling the hypercall in QEMU instead, a RngBackend is required
> >> since the hypercall should provide "good" random data instead of
> >> pseudo-random (like from a "simple" library function like rand()
> >> or g_random_int()). Since there are multiple RngBackends available,
> >> the user must select an appropriate backend via the "backend"
> >> property of the device, e.g.:
> >>
> >>  qemu-system-ppc64 -object rng-random,filename=/dev/hwrng,id=rng0 \
> >>-device spapr-rng,backend=rng0 ...
> >>
> >> See http://wiki.qemu-project.org/Features-Done/VirtIORNG for
> >> other example of specifying RngBackends.
> ...
> >> +
> >> +#include "qemu/error-report.h"
> >> +#include "sysemu/sysemu.h"
> >> +#include "sysemu/device_tree.h"
> >> +#include "sysemu/rng.h"
> >> +#include "hw/ppc/spapr.h"
> >> +#include "kvm_ppc.h"
> >> +
> >> +#define SPAPR_RNG(obj) \
> >> +OBJECT_CHECK(sPAPRRngState, (obj), TYPE_SPAPR_RNG)
> >> +
> >> +typedef struct sPAPRRngState {
> >> +/*< private >*/
> >> +DeviceState ds;
> >> +RngBackend *backend;
> >> +bool use_kvm;
> >> +} sPAPRRngState;
> >> +
> >> +typedef struct HRandomData {
> >> +QemuSemaphore sem;
> >> +union {
> >> +uint64_t v64;
> >> +uint8_t v8[8];
> >> +} val;
> >> +int received;
> >> +} HRandomData;
> >> +
> >> +/* Callback function for the RngBackend */
> >> +static void random_recv(void *dest, const void *src, size_t size)
> >> +{
> >> +HRandomData *hrdp = dest;
> >> +
> >> +if (src && size > 0) {
> >> +assert(size + hrdp->received <= sizeof(hrdp->val.v8));
> >> +memcpy(>val.v8[hrdp->received], src, size);
> >> +hrdp->received += size;
> >> +}
> >> +
> >> +qemu_sem_post(>sem);
> > 
> > I'm assuming qemu_sem_post() includes the necessary memory barrier to
> > make sure the requesting thread actually sees the data.
> 
> Not sure whether I fully got your point here... both callback function
> and main thread are calling an extern C-function, so the compiler should
> not assume that the memory stays the same in the main thread...?

I'm not talking about a compiler barrier: the callback will likely be
invoked on a different CPU from the vcpu thread that invoked H_RANDOM,
so on a weakly ordered arch like Power we need a real CPU memory barrier.

> Anyway, I've tested the hypercall by implementing it in SLOF and calling
> it a couple of times there to see that all bits in the result behave
> randomly, so for me this is working fine.

Right, I'd be almost certain anyway that qemu_sem_post() (actually
likely the pthreads functions it invokes) will include the necessary
barriers to stop accesses leaking outside the locked region.

> 
> >> +}
> >> +
> >> +/* Handler for the H_RANDOM hypercall */
> >> +static target_ulong h_random(PowerPCCPU *cpu, sPAPRMachineState *spapr,
> >> + target_ulong opcode, target_ulong *args)
> >> +{
> >> +sPAPRRngState *rngstate;
> >> +HRandomData hrdata;
> >> +
> >> +rngstate = SPAPR_RNG(object_resolve_path_type("", TYPE_SPAPR_RNG, 
> >> NULL));
> >> +

Re: [PATCH v3] ppc/spapr: Implement H_RANDOM hypercall in QEMU

2015-09-13 Thread David Gibson
 false),
> +DEFINE_PROP_END_OF_LIST(),
> +};
> +
> +static void spapr_rng_class_init(ObjectClass *oc, void *data)
> +{
> +DeviceClass *dc = DEVICE_CLASS(oc);
> +
> +dc->realize = spapr_rng_realize;
> +set_bit(DEVICE_CATEGORY_MISC, dc->categories);
> +dc->props = spapr_rng_properties;
> +}
> +
> +static const TypeInfo spapr_rng_info = {
> +.name  = TYPE_SPAPR_RNG,
> +.parent= TYPE_DEVICE,
> +.instance_size = sizeof(sPAPRRngState),
> +.instance_init = spapr_rng_instance_init,
> +.class_init= spapr_rng_class_init,
> +};
> +
> +static void spapr_rng_register_type(void)
> +{
> +type_register_static(_rng_info);
> +}
> +type_init(spapr_rng_register_type)
> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> index 91a61ab..4e8aa2d 100644
> --- a/include/hw/ppc/spapr.h
> +++ b/include/hw/ppc/spapr.h
> @@ -331,6 +331,7 @@ struct sPAPRMachineState {
>  #define H_SET_MPP   0x2D0
>  #define H_GET_MPP   0x2D4
>  #define H_XIRR_X0x2FC
> +#define H_RANDOM0x300
>  #define H_SET_MODE  0x31C
>  #define MAX_HCALL_OPCODEH_SET_MODE
>  
> @@ -603,10 +604,13 @@ struct sPAPRConfigureConnectorState {
>  void spapr_ccs_reset_hook(void *opaque);
>  
>  #define TYPE_SPAPR_RTC "spapr-rtc"
> +#define TYPE_SPAPR_RNG "spapr-rng"
>  
>  void spapr_rtc_read(DeviceState *dev, struct tm *tm, uint32_t *ns);
>  int spapr_rtc_import_offset(DeviceState *dev, int64_t legacy_offset);
>  
> +int spapr_rng_populate_dt(void *fdt);
> +
>  #define SPAPR_MEMORY_BLOCK_SIZE (1 << 28) /* 256MB */
>  
>  #endif /* !defined (__HW_SPAPR_H__) */
> diff --git a/target-ppc/kvm.c b/target-ppc/kvm.c
> index 110436d..42f66fe 100644
> --- a/target-ppc/kvm.c
> +++ b/target-ppc/kvm.c
> @@ -2484,3 +2484,12 @@ int kvm_arch_msi_data_to_gsi(uint32_t data)
>  {
>  return data & 0x;
>  }
> +
> +int kvmppc_enable_hwrng(void)
> +{
> +if (!kvm_enabled() || !kvm_check_extension(kvm_state, 
> KVM_CAP_PPC_HWRNG)) {
> +return -1;
> +}
> +
> +return kvmppc_enable_hcall(kvm_state, H_RANDOM);
> +}
> diff --git a/target-ppc/kvm_ppc.h b/target-ppc/kvm_ppc.h
> index 4d30e27..68836b4 100644
> --- a/target-ppc/kvm_ppc.h
> +++ b/target-ppc/kvm_ppc.h
> @@ -55,6 +55,7 @@ void kvmppc_hash64_free_pteg(uint64_t token);
>  void kvmppc_hash64_write_pte(CPUPPCState *env, target_ulong pte_index,
>   target_ulong pte0, target_ulong pte1);
>  bool kvmppc_has_cap_fixup_hcalls(void);
> +int kvmppc_enable_hwrng(void);
>  
>  #else
>  
> @@ -248,6 +249,10 @@ static inline bool kvmppc_has_cap_fixup_hcalls(void)
>  abort();
>  }
>  
> +static inline int kvmppc_enable_hwrng(void)
> +{
> +return -1;
> +}
>  #endif
>  
>  #ifndef CONFIG_KVM

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgp6_iH1g4gwt.pgp
Description: PGP signature


Re: [PATCH 2/2] KVM: PPC: Book3S HV: Exit on H_DOORBELL only if HOST_IPI is set

2015-09-02 Thread David Gibson
On Thu, Sep 03, 2015 at 03:21:23PM +1000, Paul Mackerras wrote:
> From: "Gautham R. Shenoy" <e...@linux.vnet.ibm.com>
> 
> The code that handles the case when we receive a H_DOORBELL interrupt
> has a comment which says "Hypervisor doorbell - exit only if host IPI
> flag set".  However, the current code does not actually check if the
> host IPI flag is set.  This is due to a comparison instruction that
> got missed.
> 
> As a result, the current code performs the exit to host only
> if some sibling thread or a sibling sub-core is exiting to the
> host.  This implies that, an IPI sent to a sibling core in
> (subcores-per-core != 1) mode will be missed by the host unless the
> sibling core is on the exit path to the host.
> 
> This patch adds the missing comparison operation which will ensure
> that when HOST_IPI flag is set, we unconditionally exit to the host.
> 
> Fixes: 66feed61cdf6
> Cc: sta...@vger.kernel.org # v4.1+
> Signed-off-by: Gautham R. Shenoy <e...@linux.vnet.ibm.com>
> Signed-off-by: Paul Mackerras <pau...@samba.org>

Reviewed-by: David Gibson <da...@gibson.dropbear.id.au>

> ---
>  arch/powerpc/kvm/book3s_hv_rmhandlers.S | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
> b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> index b07f045..2273dca 100644
> --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> @@ -1213,6 +1213,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR)
>   cmpwi   r12, BOOK3S_INTERRUPT_H_DOORBELL
>   bne     3f
>   lbz r0, HSTATE_HOST_IPI(r13)
> + cmpwi   r0, 0
>   beq 4f
>   b   guest_exit_cont
>  3:

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgpC5z6OtPQK9.pgp
Description: PGP signature


Re: [PATCH 2/2] KVM: PPC: Book3S HV: Exit on H_DOORBELL only if HOST_IPI is set

2015-09-02 Thread David Gibson
On Thu, Sep 03, 2015 at 03:21:23PM +1000, Paul Mackerras wrote:
> From: "Gautham R. Shenoy" <e...@linux.vnet.ibm.com>
> 
> The code that handles the case when we receive a H_DOORBELL interrupt
> has a comment which says "Hypervisor doorbell - exit only if host IPI
> flag set".  However, the current code does not actually check if the
> host IPI flag is set.  This is due to a comparison instruction that
> got missed.
> 
> As a result, the current code performs the exit to host only
> if some sibling thread or a sibling sub-core is exiting to the
> host.  This implies that, an IPI sent to a sibling core in
> (subcores-per-core != 1) mode will be missed by the host unless the
> sibling core is on the exit path to the host.
> 
> This patch adds the missing comparison operation which will ensure
> that when HOST_IPI flag is set, we unconditionally exit to the host.
> 
> Fixes: 66feed61cdf6
> Cc: sta...@vger.kernel.org # v4.1+
> Signed-off-by: Gautham R. Shenoy <e...@linux.vnet.ibm.com>
> Signed-off-by: Paul Mackerras <pau...@samba.org>

Reviewed-by: David Gibson <da...@gibson.dropbear.id.au>

> ---
>  arch/powerpc/kvm/book3s_hv_rmhandlers.S | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
> b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> index b07f045..2273dca 100644
> --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> @@ -1213,6 +1213,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR)
>   cmpwi   r12, BOOK3S_INTERRUPT_H_DOORBELL
>   bne     3f
>   lbz r0, HSTATE_HOST_IPI(r13)
> + cmpwi   r0, 0
>   beq 4f
>   b   guest_exit_cont
>  3:

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgpwS0sC27ydU.pgp
Description: PGP signature


Re: [PATCH 1/2] KVM: PPC: Book3S HV: Fix race in starting secondary threads

2015-09-02 Thread David Gibson
On Thu, Sep 03, 2015 at 03:20:50PM +1000, Paul Mackerras wrote:
> From: "Gautham R. Shenoy" <e...@linux.vnet.ibm.com>
> 
> The current dynamic micro-threading code has a race due to which a
> secondary thread naps when it is supposed to be running a vcpu. As a
> side effect of this, on a guest exit, the primary thread in
> kvmppc_wait_for_nap() finds that this secondary thread hasn't cleared
> its vcore pointer. This results in "CPU X seems to be stuck!"
> warnings.
> 
> The race is possible since the primary thread on exiting the guests
> only waits for all the secondaries to clear its vcore pointer. It
> subsequently expects the secondary threads to enter nap while it
> unsplits the core. A secondary thread which hasn't yet entered the nap
> will loop in kvm_no_guest until its vcore pointer and the do_nap flag
> are unset. Once the core has been unsplit, a new vcpu thread can grab
> the core and set the do_nap flag *before* setting the vcore pointers
> of the secondary. As a result, the secondary thread will now enter nap
> via kvm_unsplit_nap instead of running the guest vcpu.
> 
> Fix this by setting the do_nap flag after setting the vcore pointer in
> the PACA of the secondary in kvmppc_run_core. Also, ensure that a
> secondary thread doesn't nap in kvm_unsplit_nap when the vcore pointer
> in its PACA struct is set.
> 
> Fixes: b4deba5c41e9
> Signed-off-by: Gautham R. Shenoy <e...@linux.vnet.ibm.com>
> Signed-off-by: Paul Mackerras <pau...@samba.org>

Reviewed-by: David Gibson <da...@gibson.dropbear.id.au>
-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgp1Om83aXHqD.pgp
Description: PGP signature


Re: [PATCH 1/2] KVM: PPC: Book3S HV: Fix race in starting secondary threads

2015-09-02 Thread David Gibson
On Thu, Sep 03, 2015 at 03:20:50PM +1000, Paul Mackerras wrote:
> From: "Gautham R. Shenoy" <e...@linux.vnet.ibm.com>
> 
> The current dynamic micro-threading code has a race due to which a
> secondary thread naps when it is supposed to be running a vcpu. As a
> side effect of this, on a guest exit, the primary thread in
> kvmppc_wait_for_nap() finds that this secondary thread hasn't cleared
> its vcore pointer. This results in "CPU X seems to be stuck!"
> warnings.
> 
> The race is possible since the primary thread on exiting the guests
> only waits for all the secondaries to clear its vcore pointer. It
> subsequently expects the secondary threads to enter nap while it
> unsplits the core. A secondary thread which hasn't yet entered the nap
> will loop in kvm_no_guest until its vcore pointer and the do_nap flag
> are unset. Once the core has been unsplit, a new vcpu thread can grab
> the core and set the do_nap flag *before* setting the vcore pointers
> of the secondary. As a result, the secondary thread will now enter nap
> via kvm_unsplit_nap instead of running the guest vcpu.
> 
> Fix this by setting the do_nap flag after setting the vcore pointer in
> the PACA of the secondary in kvmppc_run_core. Also, ensure that a
> secondary thread doesn't nap in kvm_unsplit_nap when the vcore pointer
> in its PACA struct is set.
> 
> Fixes: b4deba5c41e9
> Signed-off-by: Gautham R. Shenoy <e...@linux.vnet.ibm.com>
> Signed-off-by: Paul Mackerras <pau...@samba.org>

Reviewed-by: David Gibson <da...@gibson.dropbear.id.au>
-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgp4Gno6jUThl.pgp
Description: PGP signature


[PATCH] vfio: Enable VFIO device for powerpc

2015-08-12 Thread David Gibson
ec53500f kvm: Add VFIO device added a special KVM pseudo-device which is
used to handle any necessary interactions between KVM and VFIO.

Currently that device is built on x86 and ARM, but not powerpc, although
powerpc does support both KVM and VFIO.  This makes things awkward in
userspace

Currently qemu prints an alarming error message if you attempt to use VFIO
and it can't initialize the KVM VFIO device.  We don't want to remove the
warning, because lack of the KVM VFIO device could mean coherency problems
on x86.  On powerpc, however, the error is harmless but looks disturbing,
and a test based on host architecture in qemu would be ugly, and break if
we do need the KVM VFIO device for something important in future.

There's nothing preventing the KVM VFIO device from being built for
powerpc, so this patch turns it on.  It won't actually do anything, since
we don't define any of the arch_*() hooks, but it will make qemu happy and
we can extend it in future if we need to.

Signed-off-by: David Gibson da...@gibson.dropbear.id.au
Reviewed-by: Eric Auger eric.au...@linaro.org
---
 arch/powerpc/kvm/Makefile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Alex (Graf), sorry, forgot to send this to you on my first posting of
this patch, but I'm guessing it's your tree it should go in via.
Resending, with Eric's reviewed-by line folded in.  Please apply.

diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
index 0570eef..7f7b6d8 100644
--- a/arch/powerpc/kvm/Makefile
+++ b/arch/powerpc/kvm/Makefile
@@ -8,7 +8,7 @@ ccflags-y := -Ivirt/kvm -Iarch/powerpc/kvm
 KVM := ../../../virt/kvm
 
 common-objs-y = $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \
-   $(KVM)/eventfd.o
+   $(KVM)/eventfd.o $(KVM)/vfio.o
 
 CFLAGS_e500_mmu.o := -I.
 CFLAGS_e500_mmu_host.o := -I.
-- 
2.4.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [kvm-unit-tests PATCH 11/14] powerpc/ppc64: add rtas_power_off

2015-08-04 Thread David Gibson
On Tue, Aug 04, 2015 at 09:54:44AM +0200, Andrew Jones wrote:
 On Tue, Aug 04, 2015 at 02:11:30PM +1000, David Gibson wrote:
  On Mon, Aug 03, 2015 at 04:41:28PM +0200, Andrew Jones wrote:
   Add enough RTAS support to support power-off, and apply it to
   exit().
   
   Signed-off-by: Andrew Jones drjo...@redhat.com
   ---
lib/powerpc/asm/rtas.h  | 27 
lib/powerpc/io.c|  3 ++
lib/powerpc/rtas.c  | 84 
   +
lib/ppc64/asm/rtas.h|  1 +
powerpc/Makefile.common |  1 +
powerpc/cstart64.S  |  9 ++
6 files changed, 125 insertions(+)
create mode 100644 lib/powerpc/asm/rtas.h
create mode 100644 lib/powerpc/rtas.c
create mode 100644 lib/ppc64/asm/rtas.h
   
   diff --git a/lib/powerpc/asm/rtas.h b/lib/powerpc/asm/rtas.h
   new file mode 100644
   index 0..53ce126c69a42
   --- /dev/null
   +++ b/lib/powerpc/asm/rtas.h
   @@ -0,0 +1,27 @@
   +#ifndef _ASMPOWERPC_RTAS_H_
   +#define _ASMPOWERPC_RTAS_H_
   +/*
   + * Copyright (C) 2015, Red Hat Inc, Andrew Jones drjo...@redhat.com
   + *
   + * This work is licensed under the terms of the GNU LGPL, version 2.
   + */
   +#include libcflat.h
   +
   +#define RTAS_UNKNOWN_SERVICE (-1)
   +
   +struct rtas_args {
   + u32 token;
   + u32 nargs;
   + u32 nret;
   + u32 args[16];
   + u32 *rets;
   +};
   +
   +extern void rtas_init(void);
   +extern void enter_rtas(unsigned long args);
   +extern int rtas_token(const char *service);
   +extern int rtas_call(int token, int nargs, int nret, int *outputs, ...);
   +
   +extern void rtas_power_off(void);
   +
   +#endif /* _ASMPOWERPC_RTAS_H_ */
   diff --git a/lib/powerpc/io.c b/lib/powerpc/io.c
   index a7eaafeca9205..25cdd0ad961ee 100644
   --- a/lib/powerpc/io.c
   +++ b/lib/powerpc/io.c
   @@ -7,6 +7,7 @@
 */
#include libcflat.h
#include asm/spinlock.h
   +#include asm/rtas.h

extern void halt(int code);
extern void putchar(int c);
   @@ -15,6 +16,7 @@ static struct spinlock uart_lock;

void io_init(void)
{
   + rtas_init();
}

void puts(const char *s)
   @@ -27,5 +29,6 @@ void puts(const char *s)

void exit(int code)
{
   + rtas_power_off();
 halt(code);
}
   diff --git a/lib/powerpc/rtas.c b/lib/powerpc/rtas.c
   new file mode 100644
   index 0..39f7d4428ee77
   --- /dev/null
   +++ b/lib/powerpc/rtas.c
   @@ -0,0 +1,84 @@
   +/*
   + * powerpc RTAS
   + *
   + * Copyright (C) 2015, Red Hat Inc, Andrew Jones drjo...@redhat.com
   + *
   + * This work is licensed under the terms of the GNU LGPL, version 2.
   + */
   +#include libcflat.h
   +#include libfdt/libfdt.h
   +#include devicetree.h
   +#include asm/spinlock.h
   +#include asm/io.h
   +#include asm/rtas.h
   +
   +static struct spinlock lock;
   +struct rtas_args rtas_args;
   +static int rtas_node;
   +
   +void rtas_init(void)
   +{
   + if (!dt_available()) {
   + printf(%s: No device tree!\n, __func__);
   + abort();
   + }
   +
   + rtas_node = fdt_path_offset(dt_fdt(), /rtas);
   + if (rtas_node  0) {
   + printf(%s: /rtas node: %s\n, __func__,
   + fdt_strerror(rtas_node));
   + abort();
   + }
   +}
   +
   +int rtas_token(const char *service)
   +{
   + const struct fdt_property *prop;
   + u32 *token;
   +
   + prop = fdt_get_property(dt_fdt(), rtas_node, service, NULL);
  
  Oh.. one other thing here.
  
  This is only safe if you never alter the device tree blob you're given
  (or at least, not after rtas_init()).  That may well be the case, but
  I don't know your code well enough to be sure.
  
  Otherwise, the rtas node could get moved by other dt changes, meaning
  the offset stored in rtas_node is no longer valid.
 
 I hadn't planned on modifying the DT, but if you can conceive of a test
 they may want to, then we should do this right.

That's probably ok then, as long as it's made clear that you can't
mess with the DT.  Or if tests might want to modify it, you could give
them their own copy of it.

I guess doing it right
 just means to hunt down rtas_node each time, that's easy.

That's right.

Not having persistent handles is a bit of a pain, but it's the
necessary tradeoff to work with the tree in flattened form without
having some complicated pile of context state to go with it.
(Attempting to remind people that these aren't persistent handles is,
incidentally, why so many of the libfdt functions explicitly mention
offset in their names).

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgpfF0BoqH0NR.pgp
Description: PGP signature


Re: [kvm-unit-tests PATCH 11/14] powerpc/ppc64: add rtas_power_off

2015-08-04 Thread David Gibson
On Tue, Aug 04, 2015 at 09:47:59AM +0200, Andrew Jones wrote:
 On Tue, Aug 04, 2015 at 02:09:52PM +1000, David Gibson wrote:
  On Mon, Aug 03, 2015 at 07:08:17PM +0200, Paolo Bonzini wrote:
   
   
   On 03/08/2015 16:41, Andrew Jones wrote:
Add enough RTAS support to support power-off, and apply it to
exit().

Signed-off-by: Andrew Jones drjo...@redhat.com
   
   Why not use virtio-mmio + testdev on ppc as well?  Similar to how we're
   not using PSCI on ARM or ACPI on x86.
  
  Strange as it seems, MMIO is actually a PITA for a simple pseries
  guest like this.  Basically, you have to enable the MMU - which
  requires a whole bunch of setup - in order to perform cache-inhibited
  loads and stores, which is what you need for IO.
  
  There are hypercalls to sidestep this (H_LOGICAL_CI_LOAD and
  H_LOGICAL_CI_STORE), but having a hypercall and KVM exit for every IO
  access may be hideously slow.
  
  In early development we did have a hypercall mediated virtio model,
  but it was abandoned once we got PCI working.
 
 So I think by yours and Alex's responses, if we want testdev support
 then we should target using pci to expose it.

Um.. maybe.  I'm not really familiar with these testdevs, so I can't
answer directly.

 I'm ok with that, but
 prefer not to be distracted with it while getting ppc kickstarted.
 So, question for Paolo, are you OK with the exitcode snooper cheat?

If you wanted to add a special hypercall channel for use by the tests
I'd be ok with that too.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgpvWDZbaP8aT.pgp
Description: PGP signature


Re: [kvm-unit-tests PATCH 11/14] powerpc/ppc64: add rtas_power_off

2015-08-04 Thread David Gibson
On Tue, Aug 04, 2015 at 09:47:59AM +0200, Andrew Jones wrote:
 On Tue, Aug 04, 2015 at 02:09:52PM +1000, David Gibson wrote:
  On Mon, Aug 03, 2015 at 07:08:17PM +0200, Paolo Bonzini wrote:
   
   
   On 03/08/2015 16:41, Andrew Jones wrote:
Add enough RTAS support to support power-off, and apply it to
exit().

Signed-off-by: Andrew Jones drjo...@redhat.com
   
   Why not use virtio-mmio + testdev on ppc as well?  Similar to how we're
   not using PSCI on ARM or ACPI on x86.
  
  Strange as it seems, MMIO is actually a PITA for a simple pseries
  guest like this.  Basically, you have to enable the MMU - which
  requires a whole bunch of setup - in order to perform cache-inhibited
  loads and stores, which is what you need for IO.
  
  There are hypercalls to sidestep this (H_LOGICAL_CI_LOAD and
  H_LOGICAL_CI_STORE), but having a hypercall and KVM exit for every IO
  access may be hideously slow.
  
  In early development we did have a hypercall mediated virtio model,
  but it was abandoned once we got PCI working.
 
 So I think by yours and Alex's responses, if we want testdev support
 then we should target using pci to expose it.

Um.. maybe.  I'm not really familiar with these testdevs, so I can't
answer directly.

 I'm ok with that, but
 prefer not to be distracted with it while getting ppc kickstarted.
 So, question for Paolo, are you OK with the exitcode snooper cheat?

If you wanted to add a special hypercall channel for use by the tests
I'd be ok with that too.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgpgQsE6vWHG9.pgp
Description: PGP signature


Re: [kvm-unit-tests PATCH 11/14] powerpc/ppc64: add rtas_power_off

2015-08-04 Thread David Gibson
On Tue, Aug 04, 2015 at 09:54:44AM +0200, Andrew Jones wrote:
 On Tue, Aug 04, 2015 at 02:11:30PM +1000, David Gibson wrote:
  On Mon, Aug 03, 2015 at 04:41:28PM +0200, Andrew Jones wrote:
   Add enough RTAS support to support power-off, and apply it to
   exit().
   
   Signed-off-by: Andrew Jones drjo...@redhat.com
   ---
lib/powerpc/asm/rtas.h  | 27 
lib/powerpc/io.c|  3 ++
lib/powerpc/rtas.c  | 84 
   +
lib/ppc64/asm/rtas.h|  1 +
powerpc/Makefile.common |  1 +
powerpc/cstart64.S  |  9 ++
6 files changed, 125 insertions(+)
create mode 100644 lib/powerpc/asm/rtas.h
create mode 100644 lib/powerpc/rtas.c
create mode 100644 lib/ppc64/asm/rtas.h
   
   diff --git a/lib/powerpc/asm/rtas.h b/lib/powerpc/asm/rtas.h
   new file mode 100644
   index 0..53ce126c69a42
   --- /dev/null
   +++ b/lib/powerpc/asm/rtas.h
   @@ -0,0 +1,27 @@
   +#ifndef _ASMPOWERPC_RTAS_H_
   +#define _ASMPOWERPC_RTAS_H_
   +/*
   + * Copyright (C) 2015, Red Hat Inc, Andrew Jones drjo...@redhat.com
   + *
   + * This work is licensed under the terms of the GNU LGPL, version 2.
   + */
   +#include libcflat.h
   +
   +#define RTAS_UNKNOWN_SERVICE (-1)
   +
   +struct rtas_args {
   + u32 token;
   + u32 nargs;
   + u32 nret;
   + u32 args[16];
   + u32 *rets;
   +};
   +
   +extern void rtas_init(void);
   +extern void enter_rtas(unsigned long args);
   +extern int rtas_token(const char *service);
   +extern int rtas_call(int token, int nargs, int nret, int *outputs, ...);
   +
   +extern void rtas_power_off(void);
   +
   +#endif /* _ASMPOWERPC_RTAS_H_ */
   diff --git a/lib/powerpc/io.c b/lib/powerpc/io.c
   index a7eaafeca9205..25cdd0ad961ee 100644
   --- a/lib/powerpc/io.c
   +++ b/lib/powerpc/io.c
   @@ -7,6 +7,7 @@
 */
#include libcflat.h
#include asm/spinlock.h
   +#include asm/rtas.h

extern void halt(int code);
extern void putchar(int c);
   @@ -15,6 +16,7 @@ static struct spinlock uart_lock;

void io_init(void)
{
   + rtas_init();
}

void puts(const char *s)
   @@ -27,5 +29,6 @@ void puts(const char *s)

void exit(int code)
{
   + rtas_power_off();
 halt(code);
}
   diff --git a/lib/powerpc/rtas.c b/lib/powerpc/rtas.c
   new file mode 100644
   index 0..39f7d4428ee77
   --- /dev/null
   +++ b/lib/powerpc/rtas.c
   @@ -0,0 +1,84 @@
   +/*
   + * powerpc RTAS
   + *
   + * Copyright (C) 2015, Red Hat Inc, Andrew Jones drjo...@redhat.com
   + *
   + * This work is licensed under the terms of the GNU LGPL, version 2.
   + */
   +#include libcflat.h
   +#include libfdt/libfdt.h
   +#include devicetree.h
   +#include asm/spinlock.h
   +#include asm/io.h
   +#include asm/rtas.h
   +
   +static struct spinlock lock;
   +struct rtas_args rtas_args;
   +static int rtas_node;
   +
   +void rtas_init(void)
   +{
   + if (!dt_available()) {
   + printf(%s: No device tree!\n, __func__);
   + abort();
   + }
   +
   + rtas_node = fdt_path_offset(dt_fdt(), /rtas);
   + if (rtas_node  0) {
   + printf(%s: /rtas node: %s\n, __func__,
   + fdt_strerror(rtas_node));
   + abort();
   + }
   +}
   +
   +int rtas_token(const char *service)
   +{
   + const struct fdt_property *prop;
   + u32 *token;
   +
   + prop = fdt_get_property(dt_fdt(), rtas_node, service, NULL);
  
  Oh.. one other thing here.
  
  This is only safe if you never alter the device tree blob you're given
  (or at least, not after rtas_init()).  That may well be the case, but
  I don't know your code well enough to be sure.
  
  Otherwise, the rtas node could get moved by other dt changes, meaning
  the offset stored in rtas_node is no longer valid.
 
 I hadn't planned on modifying the DT, but if you can conceive of a test
 they may want to, then we should do this right.

That's probably ok then, as long as it's made clear that you can't
mess with the DT.  Or if tests might want to modify it, you could give
them their own copy of it.

I guess doing it right
 just means to hunt down rtas_node each time, that's easy.

That's right.

Not having persistent handles is a bit of a pain, but it's the
necessary tradeoff to work with the tree in flattened form without
having some complicated pile of context state to go with it.
(Attempting to remind people that these aren't persistent handles is,
incidentally, why so many of the libfdt functions explicitly mention
offset in their names).

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgpLKIfifNB6P.pgp
Description: PGP signature


Re: [kvm-unit-tests PATCH 10/14] powerpc/ppc64: relocate linker VMAs

2015-08-03 Thread David Gibson
: .llong  stackptr
 +p_toc:   .llong  tocptr
 +p_dyn:   .llong  dynamic_start
 +
  .text
  .align 3
  
 diff --git a/powerpc/flat.lds b/powerpc/flat.lds
 index bd075efb2c51b..8a573d27346de 100644
 --- a/powerpc/flat.lds
 +++ b/powerpc/flat.lds
 @@ -6,11 +6,22 @@ SECTIONS
  etext = .;
  .opd : { *(.opd) }
  . = ALIGN(16);
 +.dynamic : {
 +dynamic_start = .;
 +*(.dynamic)
 +}
 +.dynsym : {
 +dynsym_start = .;
 +*(.dynsym)
 +}
 +.rela.dyn : { *(.rela*) }
 +. = ALIGN(16);
  .data : {
  *(.data)
 +*(.data.rel*)
  }
  . = ALIGN(16);
 -.rodata : { *(.rodata) }
 +.rodata : { *(.rodata) *(.rodata.*) }
  . = ALIGN(16);
  .bss : { *(.bss) }
  . = ALIGN(16);
 diff --git a/powerpc/reloc64.c b/powerpc/reloc64.c
 new file mode 100644
 index 0..2804823bdfee3
 --- /dev/null
 +++ b/powerpc/reloc64.c
 @@ -0,0 +1,55 @@
 +/*
 + * relocate R_PPC_RELATIVE RELA entries. Normally this is done in
 + * assembly code to avoid the risk of using absolute addresses before
 + * they're relocated. We use C, but cautiously (no global references).
 + *
 + * Copyright (C) 2015, Red Hat Inc, Andrew Jones drjo...@redhat.com
 + *
 + * This work is licensed under the terms of the GNU LGPL, version 2.
 + */
 +#define DT_NULL  0
 +#define DT_RELA  7
 +#define DT_RELACOUNT 0x6ff9
 +#define R_PPC_RELATIVE   22
 +
 +struct elf64_dyn {
 + signed long long tag;
 + unsigned long long val;
 +};
 +
 +#define RELA_GET_TYPE(rela_ptr) ((rela_ptr)-info  0x)
 +struct elf64_rela {
 + unsigned long long offset;
 + unsigned long long info;
 + signed long long addend;
 +};
 +
 +void relocate(unsigned long load_addr, struct elf64_dyn *dyn_table)
 +{
 + unsigned long long rela_addr = 0, rela_count = 0, *addr;
 + struct elf64_dyn *d = dyn_table;
 + struct elf64_rela *r;
 +
 + while (d  d-tag != DT_NULL) {
 + if (d-tag == DT_RELA)
 + rela_addr = d-val;
 + else if (d-tag == DT_RELACOUNT)
 + rela_count = d-val;
 + if (rela_addr  rela_count)
 + break;
 + ++d;
 + }
 +
 + if (!rela_addr || !rela_count)
 + return;
 +
 + r = (void *)(rela_addr + load_addr);
 +
 + while (rela_count--) {
 + if (RELA_GET_TYPE(r) == R_PPC_RELATIVE) {
 + addr = (void *)(r-offset + load_addr);
 + *addr = r-addend + load_addr;
 + }
 + ++r;
 + }
 +}

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgpCVvclN8krC.pgp
Description: PGP signature


Re: [kvm-unit-tests PATCH 11/14] powerpc/ppc64: add rtas_power_off

2015-08-03 Thread David Gibson
..f4eff7f8eccb4 100644
 --- a/powerpc/Makefile.common
 +++ b/powerpc/Makefile.common
 @@ -39,6 +39,7 @@ cflatobjs += lib/powerpc/io.o
  cflatobjs += lib/powerpc/setup.o
  cflatobjs += lib/powerpc/mmu.o
  cflatobjs += lib/powerpc/smp.o
 +cflatobjs += lib/powerpc/rtas.o
  
  libgcc := $(shell $(CC) $(machine) --print-libgcc-file-name)
  start_addr := $(shell printf %x\n $$(( $(phys_base) + $(kernel_offset) )))
 diff --git a/powerpc/cstart64.S b/powerpc/cstart64.S
 index 8edaaa6e251fc..1c48083efa33c 100644
 --- a/powerpc/cstart64.S
 +++ b/powerpc/cstart64.S
 @@ -8,6 +8,7 @@
  
  #define HVSC .long 0x4422
  #define H_PUT_TERM_CHAR  0x58
 +#define KVMPPC_H_RTAS0xf000
  
  #define LOAD_REG_IMMEDIATE(reg,expr) \
   lis reg,(expr)@highest; \
 @@ -85,3 +86,11 @@ putchar:
   li  r5, 1   /* sending just 1 byte */
   HVSC
   blr
 +
 +.globl enter_rtas
 +enter_rtas:
 + mr  r4, r3
 + lis r3, KVMPPC_H_RTAS@h
 + ori r3, r3, KVMPPC_H_RTAS@l
 + HVSC
 + blr

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgp1hj1OQ6slE.pgp
Description: PGP signature


Re: [kvm-unit-tests PATCH 11/14] powerpc/ppc64: add rtas_power_off

2015-08-03 Thread David Gibson
 := $(shell printf %x\n $$(( $(phys_base) + $(kernel_offset) )))
 diff --git a/powerpc/cstart64.S b/powerpc/cstart64.S
 index 8edaaa6e251fc..1c48083efa33c 100644
 --- a/powerpc/cstart64.S
 +++ b/powerpc/cstart64.S
 @@ -8,6 +8,7 @@
  
  #define HVSC .long 0x4422
  #define H_PUT_TERM_CHAR  0x58
 +#define KVMPPC_H_RTAS0xf000
  
  #define LOAD_REG_IMMEDIATE(reg,expr) \
   lis reg,(expr)@highest; \
 @@ -85,3 +86,11 @@ putchar:
   li  r5, 1   /* sending just 1 byte */
   HVSC
   blr
 +
 +.globl enter_rtas
 +enter_rtas:
 + mr  r4, r3
 + lis r3, KVMPPC_H_RTAS@h
 + ori r3, r3, KVMPPC_H_RTAS@l
 + HVSC
 + blr

So, strictly speaking this isn't how a client program is supposed to
access RTAS.  Instead, you're supposed to call into SLOF to
instantiate RTAS, then jump into the blob it gives you.  Or, if you're
replacing rather than using SLOF, then you're supposed to get the
existing RTAS blob address from the device tree and jump into that (or
copy it somewhere else yourself).

In practice that RTAS blob will have the same 5 instructions as your
enter_rtas, so it's probably ok to do this.

What concerns me slightly, though, is that if we ever messed up the
RTAS blob handling in qemu or SLOF, then bypassing it in the kvm tests
like this means they wouldn't catch that breakage.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgp9_1ptY4tQk.pgp
Description: PGP signature


Re: [kvm-unit-tests PATCH 10/14] powerpc/ppc64: relocate linker VMAs

2015-08-03 Thread David Gibson
: .llong  stackptr
 +p_toc:   .llong  tocptr
 +p_dyn:   .llong  dynamic_start
 +
  .text
  .align 3
  
 diff --git a/powerpc/flat.lds b/powerpc/flat.lds
 index bd075efb2c51b..8a573d27346de 100644
 --- a/powerpc/flat.lds
 +++ b/powerpc/flat.lds
 @@ -6,11 +6,22 @@ SECTIONS
  etext = .;
  .opd : { *(.opd) }
  . = ALIGN(16);
 +.dynamic : {
 +dynamic_start = .;
 +*(.dynamic)
 +}
 +.dynsym : {
 +dynsym_start = .;
 +*(.dynsym)
 +}
 +.rela.dyn : { *(.rela*) }
 +. = ALIGN(16);
  .data : {
  *(.data)
 +*(.data.rel*)
  }
  . = ALIGN(16);
 -.rodata : { *(.rodata) }
 +.rodata : { *(.rodata) *(.rodata.*) }
  . = ALIGN(16);
  .bss : { *(.bss) }
  . = ALIGN(16);
 diff --git a/powerpc/reloc64.c b/powerpc/reloc64.c
 new file mode 100644
 index 0..2804823bdfee3
 --- /dev/null
 +++ b/powerpc/reloc64.c
 @@ -0,0 +1,55 @@
 +/*
 + * relocate R_PPC_RELATIVE RELA entries. Normally this is done in
 + * assembly code to avoid the risk of using absolute addresses before
 + * they're relocated. We use C, but cautiously (no global references).
 + *
 + * Copyright (C) 2015, Red Hat Inc, Andrew Jones drjo...@redhat.com
 + *
 + * This work is licensed under the terms of the GNU LGPL, version 2.
 + */
 +#define DT_NULL  0
 +#define DT_RELA  7
 +#define DT_RELACOUNT 0x6ff9
 +#define R_PPC_RELATIVE   22
 +
 +struct elf64_dyn {
 + signed long long tag;
 + unsigned long long val;
 +};
 +
 +#define RELA_GET_TYPE(rela_ptr) ((rela_ptr)-info  0x)
 +struct elf64_rela {
 + unsigned long long offset;
 + unsigned long long info;
 + signed long long addend;
 +};
 +
 +void relocate(unsigned long load_addr, struct elf64_dyn *dyn_table)
 +{
 + unsigned long long rela_addr = 0, rela_count = 0, *addr;
 + struct elf64_dyn *d = dyn_table;
 + struct elf64_rela *r;
 +
 + while (d  d-tag != DT_NULL) {
 + if (d-tag == DT_RELA)
 + rela_addr = d-val;
 + else if (d-tag == DT_RELACOUNT)
 + rela_count = d-val;
 + if (rela_addr  rela_count)
 + break;
 + ++d;
 + }
 +
 + if (!rela_addr || !rela_count)
 + return;
 +
 + r = (void *)(rela_addr + load_addr);
 +
 + while (rela_count--) {
 + if (RELA_GET_TYPE(r) == R_PPC_RELATIVE) {
 + addr = (void *)(r-offset + load_addr);
 + *addr = r-addend + load_addr;
 + }
 + ++r;
 + }
 +}

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgpFhpyyTUzJj.pgp
Description: PGP signature


Re: [kvm-unit-tests PATCH 08/14] powerpc/ppc64: add HV putchar

2015-08-03 Thread David Gibson
On Mon, Aug 03, 2015 at 04:41:25PM +0200, Andrew Jones wrote:
 Add the hvcall for putchar and use it in puts. That, along with a
 couple more lines in start to prepare for C code, and a branch to
 main(), gets us hello world. Run with
 
 qemu-system-ppc64 -M pseries\
 -bios powerpc/boot_rom.bin  \
 -display none -serial stdio \
 -kernel powerpc/selftest.elf
 
 (We're still not relocating yet, that comes in a later patch. Thus,
  testing hello-world at this point requires a hacked QEMU and linking
  the unit test at QEMU's kernel load address.)
 
 Signed-off-by: Andrew Jones drjo...@redhat.com
 ---
  lib/powerpc/io.c| 15 +--
  powerpc/Makefile.common |  1 +
  powerpc/cstart64.S  | 27 +++
  3 files changed, 41 insertions(+), 2 deletions(-)
 
 diff --git a/lib/powerpc/io.c b/lib/powerpc/io.c
 index cf3b6347e1e46..a7eaafeca9205 100644
 --- a/lib/powerpc/io.c
 +++ b/lib/powerpc/io.c
 @@ -6,15 +6,26 @@
   * This work is licensed under the terms of the GNU LGPL, version 2.
   */
  #include libcflat.h
 +#include asm/spinlock.h
 +
 +extern void halt(int code);
 +extern void putchar(int c);
 +
 +static struct spinlock uart_lock;
  
  void io_init(void)
  {
  }
  
 -void puts(const char *s __unused)
 +void puts(const char *s)
  {
 + spin_lock(uart_lock);
 + while (*s)
 + putchar(*s++);
 + spin_unlock(uart_lock);
  }
  
 -void exit(int code __unused)
 +void exit(int code)
  {
 + halt(code);
  }
 diff --git a/powerpc/Makefile.common b/powerpc/Makefile.common
 index 5cd3ea8085038..d6356540918a5 100644
 --- a/powerpc/Makefile.common
 +++ b/powerpc/Makefile.common
 @@ -26,6 +26,7 @@ CFLAGS += -ffreestanding
  CFLAGS += -Wextra
  CFLAGS += -O2
  CFLAGS += -I lib -I lib/libfdt
 +CFLAGS += -Wa,-mregnames
  
  asm-offsets = lib/$(ARCH)/asm-offsets.h
  include scripts/asm-offsets.mak
 diff --git a/powerpc/cstart64.S b/powerpc/cstart64.S
 index d9b77f44f9e0e..d7c51cd352ee4 100644
 --- a/powerpc/cstart64.S
 +++ b/powerpc/cstart64.S
 @@ -6,14 +6,41 @@
   * This work is licensed under the terms of the GNU LGPL, version 2.
   */
  
 +#define HVSC .long 0x4422

Unless your assembler is really old, you could probably use sc 1
inline, instead of this macro.  Not that it really matters.

 +#define H_PUT_TERM_CHAR  0x58
 +
 +#define LOAD_REG_IMMEDIATE(reg,expr) \
 + lis reg,(expr)@highest; \
 + ori reg,reg,(expr)@higher;  \
 + rldicr  reg,reg,32,31;  \
 + orisreg,reg,(expr)@h;   \
 + ori reg,reg,(expr)@l;
 +
 +#define LOAD_REG_ADDR(reg,name)  \
 + ld  reg,name@got(r2)
 +
  .section .init
  
  .globl start
  start:
 + LOAD_REG_IMMEDIATE(r1, stackptr)
 + LOAD_REG_IMMEDIATE(r2, tocptr)
 + bl  .main
 + bl  .exit

Is this built for ppc64 or ppc64le?  IIUC under the new ABI version
usually used on ppc64le, function descriptors are no longer used.  And
even under the old ABI, I think the assembler now has the smarts to
avoid explicitly referencing the .-symbols.

   b   halt
  
  .text
 +.align 3
  
  .globl halt
  halt:
  1:   b   1b
 +
 +.globl putchar
 +putchar:
 + sldir6, r3, 56
 + li  r3, H_PUT_TERM_CHAR
 + li  r4, 0   /* vty-reg 0 means to use the default vty */
 + li  r5, 1   /* sending just 1 byte */
 + HVSC
 + blr

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgp3zPfCF6w5W.pgp
Description: PGP signature


Re: [kvm-unit-tests PATCH 11/14] powerpc/ppc64: add rtas_power_off

2015-08-03 Thread David Gibson
..f4eff7f8eccb4 100644
 --- a/powerpc/Makefile.common
 +++ b/powerpc/Makefile.common
 @@ -39,6 +39,7 @@ cflatobjs += lib/powerpc/io.o
  cflatobjs += lib/powerpc/setup.o
  cflatobjs += lib/powerpc/mmu.o
  cflatobjs += lib/powerpc/smp.o
 +cflatobjs += lib/powerpc/rtas.o
  
  libgcc := $(shell $(CC) $(machine) --print-libgcc-file-name)
  start_addr := $(shell printf %x\n $$(( $(phys_base) + $(kernel_offset) )))
 diff --git a/powerpc/cstart64.S b/powerpc/cstart64.S
 index 8edaaa6e251fc..1c48083efa33c 100644
 --- a/powerpc/cstart64.S
 +++ b/powerpc/cstart64.S
 @@ -8,6 +8,7 @@
  
  #define HVSC .long 0x4422
  #define H_PUT_TERM_CHAR  0x58
 +#define KVMPPC_H_RTAS0xf000
  
  #define LOAD_REG_IMMEDIATE(reg,expr) \
   lis reg,(expr)@highest; \
 @@ -85,3 +86,11 @@ putchar:
   li  r5, 1   /* sending just 1 byte */
   HVSC
   blr
 +
 +.globl enter_rtas
 +enter_rtas:
 + mr  r4, r3
 + lis r3, KVMPPC_H_RTAS@h
 + ori r3, r3, KVMPPC_H_RTAS@l
 + HVSC
 + blr

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgpe6ywsI0WDn.pgp
Description: PGP signature


Re: [kvm-unit-tests PATCH 11/14] powerpc/ppc64: add rtas_power_off

2015-08-03 Thread David Gibson
 := $(shell printf %x\n $$(( $(phys_base) + $(kernel_offset) )))
 diff --git a/powerpc/cstart64.S b/powerpc/cstart64.S
 index 8edaaa6e251fc..1c48083efa33c 100644
 --- a/powerpc/cstart64.S
 +++ b/powerpc/cstart64.S
 @@ -8,6 +8,7 @@
  
  #define HVSC .long 0x4422
  #define H_PUT_TERM_CHAR  0x58
 +#define KVMPPC_H_RTAS0xf000
  
  #define LOAD_REG_IMMEDIATE(reg,expr) \
   lis reg,(expr)@highest; \
 @@ -85,3 +86,11 @@ putchar:
   li  r5, 1   /* sending just 1 byte */
   HVSC
   blr
 +
 +.globl enter_rtas
 +enter_rtas:
 + mr  r4, r3
 + lis r3, KVMPPC_H_RTAS@h
 + ori r3, r3, KVMPPC_H_RTAS@l
 + HVSC
 + blr

So, strictly speaking this isn't how a client program is supposed to
access RTAS.  Instead, you're supposed to call into SLOF to
instantiate RTAS, then jump into the blob it gives you.  Or, if you're
replacing rather than using SLOF, then you're supposed to get the
existing RTAS blob address from the device tree and jump into that (or
copy it somewhere else yourself).

In practice that RTAS blob will have the same 5 instructions as your
enter_rtas, so it's probably ok to do this.

What concerns me slightly, though, is that if we ever messed up the
RTAS blob handling in qemu or SLOF, then bypassing it in the kvm tests
like this means they wouldn't catch that breakage.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgpZju4URL0W8.pgp
Description: PGP signature


Re: [kvm-unit-tests PATCH 08/14] powerpc/ppc64: add HV putchar

2015-08-03 Thread David Gibson
On Mon, Aug 03, 2015 at 04:41:25PM +0200, Andrew Jones wrote:
 Add the hvcall for putchar and use it in puts. That, along with a
 couple more lines in start to prepare for C code, and a branch to
 main(), gets us hello world. Run with
 
 qemu-system-ppc64 -M pseries\
 -bios powerpc/boot_rom.bin  \
 -display none -serial stdio \
 -kernel powerpc/selftest.elf
 
 (We're still not relocating yet, that comes in a later patch. Thus,
  testing hello-world at this point requires a hacked QEMU and linking
  the unit test at QEMU's kernel load address.)
 
 Signed-off-by: Andrew Jones drjo...@redhat.com
 ---
  lib/powerpc/io.c| 15 +--
  powerpc/Makefile.common |  1 +
  powerpc/cstart64.S  | 27 +++
  3 files changed, 41 insertions(+), 2 deletions(-)
 
 diff --git a/lib/powerpc/io.c b/lib/powerpc/io.c
 index cf3b6347e1e46..a7eaafeca9205 100644
 --- a/lib/powerpc/io.c
 +++ b/lib/powerpc/io.c
 @@ -6,15 +6,26 @@
   * This work is licensed under the terms of the GNU LGPL, version 2.
   */
  #include libcflat.h
 +#include asm/spinlock.h
 +
 +extern void halt(int code);
 +extern void putchar(int c);
 +
 +static struct spinlock uart_lock;
  
  void io_init(void)
  {
  }
  
 -void puts(const char *s __unused)
 +void puts(const char *s)
  {
 + spin_lock(uart_lock);
 + while (*s)
 + putchar(*s++);
 + spin_unlock(uart_lock);
  }
  
 -void exit(int code __unused)
 +void exit(int code)
  {
 + halt(code);
  }
 diff --git a/powerpc/Makefile.common b/powerpc/Makefile.common
 index 5cd3ea8085038..d6356540918a5 100644
 --- a/powerpc/Makefile.common
 +++ b/powerpc/Makefile.common
 @@ -26,6 +26,7 @@ CFLAGS += -ffreestanding
  CFLAGS += -Wextra
  CFLAGS += -O2
  CFLAGS += -I lib -I lib/libfdt
 +CFLAGS += -Wa,-mregnames
  
  asm-offsets = lib/$(ARCH)/asm-offsets.h
  include scripts/asm-offsets.mak
 diff --git a/powerpc/cstart64.S b/powerpc/cstart64.S
 index d9b77f44f9e0e..d7c51cd352ee4 100644
 --- a/powerpc/cstart64.S
 +++ b/powerpc/cstart64.S
 @@ -6,14 +6,41 @@
   * This work is licensed under the terms of the GNU LGPL, version 2.
   */
  
 +#define HVSC .long 0x4422

Unless your assembler is really old, you could probably use sc 1
inline, instead of this macro.  Not that it really matters.

 +#define H_PUT_TERM_CHAR  0x58
 +
 +#define LOAD_REG_IMMEDIATE(reg,expr) \
 + lis reg,(expr)@highest; \
 + ori reg,reg,(expr)@higher;  \
 + rldicr  reg,reg,32,31;  \
 + orisreg,reg,(expr)@h;   \
 + ori reg,reg,(expr)@l;
 +
 +#define LOAD_REG_ADDR(reg,name)  \
 + ld  reg,name@got(r2)
 +
  .section .init
  
  .globl start
  start:
 + LOAD_REG_IMMEDIATE(r1, stackptr)
 + LOAD_REG_IMMEDIATE(r2, tocptr)
 + bl  .main
 + bl  .exit

Is this built for ppc64 or ppc64le?  IIUC under the new ABI version
usually used on ppc64le, function descriptors are no longer used.  And
even under the old ABI, I think the assembler now has the smarts to
avoid explicitly referencing the .-symbols.

   b   halt
  
  .text
 +.align 3
  
  .globl halt
  halt:
  1:   b   1b
 +
 +.globl putchar
 +putchar:
 + sldir6, r3, 56
 + li  r3, H_PUT_TERM_CHAR
 + li  r4, 0   /* vty-reg 0 means to use the default vty */
 + li  r5, 1   /* sending just 1 byte */
 + HVSC
 + blr

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgp8F9ROuEkrR.pgp
Description: PGP signature


Re: [kvm-unit-tests PATCH 11/14] powerpc/ppc64: add rtas_power_off

2015-08-03 Thread David Gibson
On Mon, Aug 03, 2015 at 07:08:17PM +0200, Paolo Bonzini wrote:
 
 
 On 03/08/2015 16:41, Andrew Jones wrote:
  Add enough RTAS support to support power-off, and apply it to
  exit().
  
  Signed-off-by: Andrew Jones drjo...@redhat.com
 
 Why not use virtio-mmio + testdev on ppc as well?  Similar to how we're
 not using PSCI on ARM or ACPI on x86.

Strange as it seems, MMIO is actually a PITA for a simple pseries
guest like this.  Basically, you have to enable the MMU - which
requires a whole bunch of setup - in order to perform cache-inhibited
loads and stores, which is what you need for IO.

There are hypercalls to sidestep this (H_LOGICAL_CI_LOAD and
H_LOGICAL_CI_STORE), but having a hypercall and KVM exit for every IO
access may be hideously slow.

In early development we did have a hypercall mediated virtio model,
but it was abandoned once we got PCI working.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgp8XppSvaicr.pgp
Description: PGP signature


Re: [kvm-unit-tests PATCH 11/14] powerpc/ppc64: add rtas_power_off

2015-08-03 Thread David Gibson
On Mon, Aug 03, 2015 at 07:08:17PM +0200, Paolo Bonzini wrote:
 
 
 On 03/08/2015 16:41, Andrew Jones wrote:
  Add enough RTAS support to support power-off, and apply it to
  exit().
  
  Signed-off-by: Andrew Jones drjo...@redhat.com
 
 Why not use virtio-mmio + testdev on ppc as well?  Similar to how we're
 not using PSCI on ARM or ACPI on x86.

Strange as it seems, MMIO is actually a PITA for a simple pseries
guest like this.  Basically, you have to enable the MMU - which
requires a whole bunch of setup - in order to perform cache-inhibited
loads and stores, which is what you need for IO.

There are hypercalls to sidestep this (H_LOGICAL_CI_LOAD and
H_LOGICAL_CI_STORE), but having a hypercall and KVM exit for every IO
access may be hideously slow.

In early development we did have a hypercall mediated virtio model,
but it was abandoned once we got PCI working.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgpc3MvTERmfq.pgp
Description: PGP signature


Re: [PATCH 0/2] Two fixes for dynamic micro-threading

2015-07-20 Thread David Gibson
On Thu, Jul 16, 2015 at 05:11:12PM +1000, Paul Mackerras wrote:
 This series contains two fixes for the new dynamic micro-threading
 code that was added recently for HV-mode KVM on Power servers.
 The patches are against Alex Graf's kvm-ppc-queue branch.  Please
 apply.

agraf,

Any word on these?  These appear to fix a really nasty host crash in
current upstream.  I'd really like to see them merged ASAP.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgpotlh8f_19D.pgp
Description: PGP signature


Re: [PATCH 0/2] Two fixes for dynamic micro-threading

2015-07-20 Thread David Gibson
On Thu, Jul 16, 2015 at 05:11:12PM +1000, Paul Mackerras wrote:
 This series contains two fixes for the new dynamic micro-threading
 code that was added recently for HV-mode KVM on Power servers.
 The patches are against Alex Graf's kvm-ppc-queue branch.  Please
 apply.

agraf,

Any word on these?  These appear to fix a really nasty host crash in
current upstream.  I'd really like to see them merged ASAP.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgp5vVs0iK10V.pgp
Description: PGP signature


[PATCH] vfio: Enable VFIO device for powerpc

2015-07-20 Thread David Gibson
ec53500f kvm: Add VFIO device added a special KVM pseudo-device which is
used to handle any necessary interactions between KVM and VFIO.

Currently that device is built on x86 and ARM, but not powerpc, although
powerpc does support both KVM and VFIO.  This makes things awkward in
userspace

Currently qemu prints an alarming error message if you attempt to use VFIO
and it can't initialize the KVM VFIO device.  We don't want to remove the
warning, because lack of the KVM VFIO device could mean coherency problems
on x86.  On powerpc, however, the error is harmless but looks disturbing,
and a test based on host architecture in qemu would be ugly, and break if
we do need the KVM VFIO device for something important in future.

There's nothing preventing the KVM VFIO device from being built for
powerpc, so this patch turns it on.  It won't actually do anything, since
we don't define any of the arch_*() hooks, but it will make qemu happy and
we can extend it in future if we need to.

Signed-off-by: David Gibson da...@gibson.dropbear.id.au
---
 arch/powerpc/kvm/Makefile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
index 0570eef..7f7b6d8 100644
--- a/arch/powerpc/kvm/Makefile
+++ b/arch/powerpc/kvm/Makefile
@@ -8,7 +8,7 @@ ccflags-y := -Ivirt/kvm -Iarch/powerpc/kvm
 KVM := ../../../virt/kvm
 
 common-objs-y = $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \
-   $(KVM)/eventfd.o
+   $(KVM)/eventfd.o $(KVM)/vfio.o
 
 CFLAGS_e500_mmu.o := -I.
 CFLAGS_e500_mmu_host.o := -I.
-- 
2.4.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH kernel v11 33/34] vfio: powerpc/spapr: Register memory and define IOMMU v2

2015-06-03 Thread David Gibson
On Wed, Jun 03, 2015 at 09:40:49PM +1000, Alexey Kardashevskiy wrote:
 On 06/02/2015 02:17 PM, David Gibson wrote:
 On Fri, May 29, 2015 at 06:44:57PM +1000, Alexey Kardashevskiy wrote:
 The existing implementation accounts the whole DMA window in
 the locked_vm counter. This is going to be worse with multiple
 containers and huge DMA windows. Also, real-time accounting would requite
 additional tracking of accounted pages due to the page size difference -
 IOMMU uses 4K pages and system uses 4K or 64K pages.
 
 Another issue is that actual pages pinning/unpinning happens on every
 DMA map/unmap request. This does not affect the performance much now as
 we spend way too much time now on switching context between
 guest/userspace/host but this will start to matter when we add in-kernel
 DMA map/unmap acceleration.
 
 This introduces a new IOMMU type for SPAPR - VFIO_SPAPR_TCE_v2_IOMMU.
 New IOMMU deprecates VFIO_IOMMU_ENABLE/VFIO_IOMMU_DISABLE and introduces
 2 new ioctls to register/unregister DMA memory -
 VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY -
 which receive user space address and size of a memory region which
 needs to be pinned/unpinned and counted in locked_vm.
 New IOMMU splits physical pages pinning and TCE table update
 into 2 different operations. It requires:
 1) guest pages to be registered first
 2) consequent map/unmap requests to work only with pre-registered memory.
 For the default single window case this means that the entire guest
 (instead of 2GB) needs to be pinned before using VFIO.
 When a huge DMA window is added, no additional pinning will be
 required, otherwise it would be guest RAM + 2GB.
 
 The new memory registration ioctls are not supported by
 VFIO_SPAPR_TCE_IOMMU. Dynamic DMA window and in-kernel acceleration
 will require memory to be preregistered in order to work.
 
 The accounting is done per the user process.
 
 This advertises v2 SPAPR TCE IOMMU and restricts what the userspace
 can do with v1 or v2 IOMMUs.
 
 In order to support memory pre-registration, we need a way to track
 the use of every registered memory region and only allow unregistration
 if a region is not in use anymore. So we need a way to tell from what
 region the just cleared TCE was from.
 
 This adds a userspace view of the TCE table into iommu_table struct.
 It contains userspace address, one per TCE entry. The table is only
 allocated when the ownership over an IOMMU group is taken which means
 it is only used from outside of the powernv code (such as VFIO).
 
 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
 [aw: for the vfio related changes]
 Acked-by: Alex Williamson alex.william...@redhat.com
 ---
 Changes:
 v11:
 * mm_iommu_put() does not return a code so this does not check it
 * moved v2 in tce_container to pack the struct
 
 v10:
 * moved it_userspace allocation to vfio_iommu_spapr_tce as it VFIO
 specific thing
 * squashed powerpc/iommu: Add userspace view of TCE table into this as
 it is
 a part of IOMMU v2
 * s/tce_iommu_use_page_v2/tce_iommu_prereg_ua_to_hpa/
 * fixed some function names to have tce_iommu_ in the beginning rather
 just tce_
 * as mm_iommu_mapped_inc() can now fail, check for the return code
 
 v9:
 * s/tce_get_hva_cached/tce_iommu_use_page_v2/
 
 v7:
 * now memory is registered per mm (i.e. process)
 * moved memory registration code to powerpc/mmu
 * merged vfio: powerpc/spapr: Define v2 IOMMU into this
 * limited new ioctls to v2 IOMMU
 * updated doc
 * unsupported ioclts return -ENOTTY instead of -EPERM
 
 v6:
 * tce_get_hva_cached() returns hva via a pointer
 
 v4:
 * updated docs
 * s/kzmalloc/vzalloc/
 * in tce_pin_pages()/tce_unpin_pages() removed @vaddr, @size and
 replaced offset with index
 * renamed vfio_iommu_type_register_memory to 
 vfio_iommu_spapr_register_memory
 and removed duplicating vfio_iommu_spapr_register_memory
 ---
   Documentation/vfio.txt  |  31 ++-
   arch/powerpc/include/asm/iommu.h|   6 +
   drivers/vfio/vfio_iommu_spapr_tce.c | 512 
  ++--
   include/uapi/linux/vfio.h   |  27 ++
   4 files changed, 487 insertions(+), 89 deletions(-)
 
 diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
 index 96978ec..7dcf2b5 100644
 --- a/Documentation/vfio.txt
 +++ b/Documentation/vfio.txt
 @@ -289,10 +289,12 @@ PPC64 sPAPR implementation note
 
   This implementation has some specifics:
 
 -1) Only one IOMMU group per container is supported as an IOMMU group
 -represents the minimal entity which isolation can be guaranteed for and
 -groups are allocated statically, one per a Partitionable Endpoint (PE)
 +1) On older systems (POWER7 with P5IOC2/IODA1) only one IOMMU group per
 +container is supported as an IOMMU table is allocated at the boot time,
 +one table per a IOMMU group which is a Partitionable Endpoint (PE)
   (PE is often a PCI domain but not always).
 +Newer systems (POWER8 with IODA2) have improved hardware design which 
 allows

Re: [PATCH kernel v11 27/34] powerpc/powernv: Implement multilevel TCE tables

2015-06-03 Thread David Gibson
On Wed, Jun 03, 2015 at 09:27:10PM +1000, Alexey Kardashevskiy wrote:
 On 06/02/2015 09:50 AM, David Gibson wrote:
 On Fri, May 29, 2015 at 06:44:51PM +1000, Alexey Kardashevskiy wrote:
 TCE tables might get too big in case of 4K IOMMU pages and DDW enabled
 on huge guests (hundreds of GB of RAM) so the kernel might be unable to
 allocate contiguous chunk of physical memory to store the TCE table.
 
 To address this, POWER8 CPU (actually, IODA2) supports multi-level
 TCE tables, up to 5 levels which splits the table into a tree of
 smaller subtables.
 
 This adds multi-level TCE tables support to
 pnv_pci_ioda2_table_alloc_pages() and pnv_pci_ioda2_table_free_pages()
 helpers.
 
 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
 ---
 Changes:
 v10:
 * fixed multiple comments received for v9
 
 v9:
 * moved from ioda2 to common powernv pci code
 * fixed cleanup if allocation fails in a middle
 * removed check for the size - all boundary checks happen in the calling 
 code
 anyway
 ---
   arch/powerpc/include/asm/iommu.h  |  2 +
   arch/powerpc/platforms/powernv/pci-ioda.c | 98 
  ---
   arch/powerpc/platforms/powernv/pci.c  | 13 
   3 files changed, 104 insertions(+), 9 deletions(-)
 
 diff --git a/arch/powerpc/include/asm/iommu.h 
 b/arch/powerpc/include/asm/iommu.h
 index 4636734..706cfc0 100644
 --- a/arch/powerpc/include/asm/iommu.h
 +++ b/arch/powerpc/include/asm/iommu.h
 @@ -96,6 +96,8 @@ struct iommu_pool {
   struct iommu_table {
 unsigned long  it_busno; /* Bus number this table belongs to */
 unsigned long  it_size;  /* Size of iommu table in entries */
 +   unsigned long  it_indirect_levels;
 +   unsigned long  it_level_size;
 unsigned long  it_offset;/* Offset into global table */
 unsigned long  it_base;  /* mapped address of tce table */
 unsigned long  it_index; /* which iommu table this is */
 diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
 b/arch/powerpc/platforms/powernv/pci-ioda.c
 index fda01c1..68ffc7a 100644
 --- a/arch/powerpc/platforms/powernv/pci-ioda.c
 +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
 @@ -49,6 +49,9 @@
   /* 256M DMA window, 4K TCE pages, 8 bytes TCE */
   #define TCE32_TABLE_SIZE  ((0x1000 / 0x1000) * 8)
 
 +#define POWERNV_IOMMU_DEFAULT_LEVELS   1
 +#define POWERNV_IOMMU_MAX_LEVELS   5
 +
   static void pnv_pci_ioda2_table_free_pages(struct iommu_table *tbl);
 
   static void pe_level_printk(const struct pnv_ioda_pe *pe, const char 
  *level,
 @@ -1975,6 +1978,8 @@ static long pnv_pci_ioda2_set_window(struct 
 iommu_table_group *table_group,
 table_group);
 struct pnv_phb *phb = pe-phb;
 int64_t rc;
 +   const unsigned long size = tbl-it_indirect_levels ?
 +   tbl-it_level_size : tbl-it_size;
 const __u64 start_addr = tbl-it_offset  tbl-it_page_shift;
 const __u64 win_size = tbl-it_size  tbl-it_page_shift;
 
 @@ -1989,9 +1994,9 @@ static long pnv_pci_ioda2_set_window(struct 
 iommu_table_group *table_group,
 rc = opal_pci_map_pe_dma_window(phb-opal_id,
 pe-pe_number,
 pe-pe_number  1,
 -   1,
 +   tbl-it_indirect_levels + 1,
 __pa(tbl-it_base),
 -   tbl-it_size  3,
 +   size  3,
 IOMMU_PAGE_SIZE(tbl));
 if (rc) {
 pe_err(pe, Failed to configure TCE table, err %ld\n, rc);
 @@ -2071,11 +2076,19 @@ static void pnv_pci_ioda_setup_opal_tce_kill(struct 
 pnv_phb *phb)
 phb-ioda.tce_inval_reg = ioremap(phb-ioda.tce_inval_reg_phys, 8);
   }
 
 -static __be64 *pnv_pci_ioda2_table_do_alloc_pages(int nid, unsigned shift)
 +static __be64 *pnv_pci_ioda2_table_do_alloc_pages(int nid, unsigned shift,
 +   unsigned levels, unsigned long limit,
 +   unsigned long *tce_table_allocated)
   {
 struct page *tce_mem = NULL;
 -   __be64 *addr;
 +   __be64 *addr, *tmp;
 unsigned order = max_t(unsigned, shift, PAGE_SHIFT) - PAGE_SHIFT;
 +   unsigned long local_allocated = 1UL  (order + PAGE_SHIFT);
 +   unsigned entries = 1UL  (shift - 3);
 +   long i;
 +
 +   if (*tce_table_allocated = limit)
 +   return NULL;
 
 I'm not quite clear what case this limit logic is trying to catch.
 
 
 The function is allocating some amount of entries which may be in one chunk
 of memory and spread between multiple chunks in multiple levels. limit is
 the amount of memory for actual TCEs (not intermediate levels). If I do not
 do this, and the user requests 5 levels, and I do not check this, more
 memory will be allocated that actually needed because size of the window is
 limited.

Ah, ok.  It's to handle the case where the requested window size
doesn't match a whole number of levels.

It seems a rather counter-intuitive way of handling it to me -
tracking the amount of memory allocated at the leaf level, rather than
tracking what window offset you're

Re: [PATCH kernel v11 09/34] vfio: powerpc/spapr: Move locked_vm accounting to helpers

2015-06-03 Thread David Gibson
On Wed, Jun 03, 2015 at 09:11:09PM +1000, Alexey Kardashevskiy wrote:
 On 06/01/2015 02:28 PM, David Gibson wrote:
 On Fri, May 29, 2015 at 06:44:33PM +1000, Alexey Kardashevskiy wrote:
 There moves locked pages accounting to helpers.
 Later they will be reused for Dynamic DMA windows (DDW).
 
 This reworks debug messages to show the current value and the limit.
 
 This stores the locked pages number in the container so when unlocking
 the iommu table pointer won't be needed. This does not have an effect
 now but it will with the multiple tables per container as then we will
 allow attaching/detaching groups on fly and we may end up having
 a container with no group attached but with the counter incremented.
 
 While we are here, update the comment explaining why RLIMIT_MEMLOCK
 might be required to be bigger than the guest RAM. This also prints
 pid of the current process in pr_warn/pr_debug.
 
 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
 [aw: for the vfio related changes]
 Acked-by: Alex Williamson alex.william...@redhat.com
 Reviewed-by: David Gibson da...@gibson.dropbear.id.au
 Reviewed-by: Gavin Shan gws...@linux.vnet.ibm.com
 ---
 Changes:
 v4:
 * new helpers do nothing if @npages == 0
 * tce_iommu_disable() now can decrement the counter if the group was
 detached (not possible now but will be in the future)
 ---
   drivers/vfio/vfio_iommu_spapr_tce.c | 82 
  -
   1 file changed, 63 insertions(+), 19 deletions(-)
 
 diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
 b/drivers/vfio/vfio_iommu_spapr_tce.c
 index 64300cc..40583f9 100644
 --- a/drivers/vfio/vfio_iommu_spapr_tce.c
 +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
 @@ -29,6 +29,51 @@
   static void tce_iommu_detach_group(void *iommu_data,
 struct iommu_group *iommu_group);
 
 +static long try_increment_locked_vm(long npages)
 +{
 +   long ret = 0, locked, lock_limit;
 +
 +   if (!current || !current-mm)
 +   return -ESRCH; /* process exited */
 +
 +   if (!npages)
 +   return 0;
 +
 +   down_write(current-mm-mmap_sem);
 +   locked = current-mm-locked_vm + npages;
 
 Is there a possibility of userspace triggering an integer overflow
 here, if npages is really huge?
 
 
 I do not see how. I just do not accept npages bigger than the host RAM size
 in pages. And it is long. For (lets say) 128GB host, the number of 4KB
 pages is (12830)/4096=33554432.

Ah, yes, npages has already been shifted right so it should be safe. Ok.

 
 
 
 +   lock_limit = rlimit(RLIMIT_MEMLOCK)  PAGE_SHIFT;
 +   if (locked  lock_limit  !capable(CAP_IPC_LOCK))
 +   ret = -ENOMEM;
 +   else
 +   current-mm-locked_vm += npages;
 +
 +   pr_debug([%d] RLIMIT_MEMLOCK +%ld %ld/%ld%s\n, current-pid,
 +   npages  PAGE_SHIFT,
 +   current-mm-locked_vm  PAGE_SHIFT,
 +   rlimit(RLIMIT_MEMLOCK),
 +   ret ?  - exceeded : );
 +
 +   up_write(current-mm-mmap_sem);
 +
 +   return ret;
 +}
 +
 +static void decrement_locked_vm(long npages)
 +{
 +   if (!current || !current-mm || !npages)
 +   return; /* process exited */
 +
 +   down_write(current-mm-mmap_sem);
 +   if (npages  current-mm-locked_vm)
 +   npages = current-mm-locked_vm;
 
 Can this case ever occur (without there being a leak bug somewhere
 else in the code)?
 
 
 It should not. Safety measure. Having a warning here might make sense but I
 believe if this happens, there will be many, many warnings in other places
 :)

Ok.  I'd would be nice to see a WARN_ON() as documentation that this
isn't a situation that should ever happen.  I wouldn't nack on that
basis alone though.

 +   current-mm-locked_vm -= npages;
 +   pr_debug([%d] RLIMIT_MEMLOCK -%ld %ld/%ld\n, current-pid,
 +   npages  PAGE_SHIFT,
 +   current-mm-locked_vm  PAGE_SHIFT,
 +   rlimit(RLIMIT_MEMLOCK));
 +   up_write(current-mm-mmap_sem);
 +}
 +
   /*
* VFIO IOMMU fd for SPAPR_TCE IOMMU implementation
*
 @@ -45,6 +90,7 @@ struct tce_container {
 struct mutex lock;
 struct iommu_table *tbl;
 bool enabled;
 +   unsigned long locked_pages;
   };
 
   static bool tce_page_is_contained(struct page *page, unsigned page_shift)
 @@ -60,7 +106,7 @@ static bool tce_page_is_contained(struct page *page, 
 unsigned page_shift)
   static int tce_iommu_enable(struct tce_container *container)
   {
 int ret = 0;
 -   unsigned long locked, lock_limit, npages;
 +   unsigned long locked;
 struct iommu_table *tbl = container-tbl;
 
 if (!container-tbl)
 @@ -89,21 +135,22 @@ static int tce_iommu_enable(struct tce_container 
 *container)
  * Also we don't have a nice way to fail on H_PUT_TCE due to ulimits,
  * that would effectively kill the guest at random points, much better
  * enforcing the limit based on the max that the guest can map.
 +*
 +* Unfortunately at the moment it counts whole tables, no matter how
 +* much

Re: [PATCH 2/2] KVM: PPC: Book3S HV: Implement dynamic micro-threading on POWER8

2015-06-02 Thread David Gibson
On Thu, May 28, 2015 at 03:17:20PM +1000, Paul Mackerras wrote:
 This builds on the ability to run more than one vcore on a physical
 core by using the micro-threading (split-core) modes of the POWER8
 chip.  Previously, only vcores from the same VM could be run together,
 and (on POWER8) only if they had just one thread per core.  With the
 ability to split the core on guest entry and unsplit it on guest exit,
 we can run up to 8 vcpu threads from up to 4 different VMs, and we can
 run multiple vcores with 2 or 4 vcpus per vcore.
 
 Dynamic micro-threading is only available if the static configuration
 of the cores is whole-core mode (unsplit), and only on POWER8.
 
 To manage this, we introduce a new kvm_split_mode struct which is
 shared across all of the subcores in the core, with a pointer in the
 paca on each thread.  In addition we extend the core_info struct to
 have information on each subcore.  When deciding whether to add a
 vcore to the set already on the core, we now have two possibilities:
 (a) piggyback the vcore onto an existing subcore, or (b) start a new
 subcore.
 
 Currently, when any vcpu needs to exit the guest and switch to host
 virtual mode, we interrupt all the threads in all subcores and switch
 the core back to whole-core mode.  It may be possible in future to
 allow some of the subcores to keep executing in the guest while
 subcore 0 switches to the host, but that is not implemented in this
 patch.
 
 This adds a module parameter called dynamic_mt_modes which controls
 which micro-threading (split-core) modes the code will consider, as a
 bitmap.  In other words, if it is 0, no micro-threading mode is
 considered; if it is 2, only 2-way micro-threading is considered; if
 it is 4, only 4-way, and if it is 6, both 2-way and 4-way
 micro-threading mode will be considered.  The default is 6.
 
 With this, we now have secondary threads which are the primary thread
 for their subcore and therefore need to do the MMU switch.  These
 threads will need to be started even if they have no vcpu to run, so
 we use the vcore pointer in the PACA rather than the vcpu pointer to
 trigger them.
 
 Signed-off-by: Paul Mackerras pau...@samba.org

Reviewed-by: David Gibson da...@gibson.dropbear.id.au

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgpwPMMh5BtzP.pgp
Description: PGP signature


Re: [PATCH 2/2] KVM: PPC: Book3S HV: Implement dynamic micro-threading on POWER8

2015-06-02 Thread David Gibson
On Thu, May 28, 2015 at 03:17:20PM +1000, Paul Mackerras wrote:
 This builds on the ability to run more than one vcore on a physical
 core by using the micro-threading (split-core) modes of the POWER8
 chip.  Previously, only vcores from the same VM could be run together,
 and (on POWER8) only if they had just one thread per core.  With the
 ability to split the core on guest entry and unsplit it on guest exit,
 we can run up to 8 vcpu threads from up to 4 different VMs, and we can
 run multiple vcores with 2 or 4 vcpus per vcore.
 
 Dynamic micro-threading is only available if the static configuration
 of the cores is whole-core mode (unsplit), and only on POWER8.
 
 To manage this, we introduce a new kvm_split_mode struct which is
 shared across all of the subcores in the core, with a pointer in the
 paca on each thread.  In addition we extend the core_info struct to
 have information on each subcore.  When deciding whether to add a
 vcore to the set already on the core, we now have two possibilities:
 (a) piggyback the vcore onto an existing subcore, or (b) start a new
 subcore.
 
 Currently, when any vcpu needs to exit the guest and switch to host
 virtual mode, we interrupt all the threads in all subcores and switch
 the core back to whole-core mode.  It may be possible in future to
 allow some of the subcores to keep executing in the guest while
 subcore 0 switches to the host, but that is not implemented in this
 patch.
 
 This adds a module parameter called dynamic_mt_modes which controls
 which micro-threading (split-core) modes the code will consider, as a
 bitmap.  In other words, if it is 0, no micro-threading mode is
 considered; if it is 2, only 2-way micro-threading is considered; if
 it is 4, only 4-way, and if it is 6, both 2-way and 4-way
 micro-threading mode will be considered.  The default is 6.
 
 With this, we now have secondary threads which are the primary thread
 for their subcore and therefore need to do the MMU switch.  These
 threads will need to be started even if they have no vcpu to run, so
 we use the vcore pointer in the PACA rather than the vcpu pointer to
 trigger them.
 
 Signed-off-by: Paul Mackerras pau...@samba.org

Reviewed-by: David Gibson da...@gibson.dropbear.id.au

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgpgPtMaytBmL.pgp
Description: PGP signature


Re: [PATCH 1/2] KVM: PPC: Book3S HV: Make use of unused threads when running guests

2015-06-02 Thread David Gibson
On Thu, May 28, 2015 at 03:17:19PM +1000, Paul Mackerras wrote:
 When running a virtual core of a guest that is configured with fewer
 threads per core than the physical cores have, the extra physical
 threads are currently unused.  This makes it possible to use them to
 run one or more other virtual cores from the same guest when certain
 conditions are met.  This applies on POWER7, and on POWER8 to guests
 with one thread per virtual core.  (It doesn't apply to POWER8 guests
 with multiple threads per vcore because they require a 1-1 virtual to
 physical thread mapping in order to be able to use msgsndp and the
 TIR.)
 
 The idea is that we maintain a list of preempted vcores for each
 physical cpu (i.e. each core, since the host runs single-threaded).
 Then, when a vcore is about to run, it checks to see if there are
 any vcores on the list for its physical cpu that could be
 piggybacked onto this vcore's execution.  If so, those additional
 vcores are put into state VCORE_PIGGYBACK and their runnable VCPU
 threads are started as well as the original vcore, which is called
 the master vcore.
 
 After the vcores have exited the guest, the extra ones are put back
 onto the preempted list if any of their VCPUs are still runnable and
 not idle.
 
 This means that vcpu-arch.ptid is no longer necessarily the same as
 the physical thread that the vcpu runs on.  In order to make it easier
 for code that wants to send an IPI to know which CPU to target, we
 now store that in a new field in struct vcpu_arch, called thread_cpu.
 
 Signed-off-by: Paul Mackerras pau...@samba.org

Reviewed-by: David Gibson da...@gibson.dropbear.id.au

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgpGb5ERdypMG.pgp
Description: PGP signature


Re: [PATCH 1/2] KVM: PPC: Book3S HV: Make use of unused threads when running guests

2015-06-02 Thread David Gibson
On Thu, May 28, 2015 at 03:17:19PM +1000, Paul Mackerras wrote:
 When running a virtual core of a guest that is configured with fewer
 threads per core than the physical cores have, the extra physical
 threads are currently unused.  This makes it possible to use them to
 run one or more other virtual cores from the same guest when certain
 conditions are met.  This applies on POWER7, and on POWER8 to guests
 with one thread per virtual core.  (It doesn't apply to POWER8 guests
 with multiple threads per vcore because they require a 1-1 virtual to
 physical thread mapping in order to be able to use msgsndp and the
 TIR.)
 
 The idea is that we maintain a list of preempted vcores for each
 physical cpu (i.e. each core, since the host runs single-threaded).
 Then, when a vcore is about to run, it checks to see if there are
 any vcores on the list for its physical cpu that could be
 piggybacked onto this vcore's execution.  If so, those additional
 vcores are put into state VCORE_PIGGYBACK and their runnable VCPU
 threads are started as well as the original vcore, which is called
 the master vcore.
 
 After the vcores have exited the guest, the extra ones are put back
 onto the preempted list if any of their VCPUs are still runnable and
 not idle.
 
 This means that vcpu-arch.ptid is no longer necessarily the same as
 the physical thread that the vcpu runs on.  In order to make it easier
 for code that wants to send an IPI to know which CPU to target, we
 now store that in a new field in struct vcpu_arch, called thread_cpu.
 
 Signed-off-by: Paul Mackerras pau...@samba.org

Reviewed-by: David Gibson da...@gibson.dropbear.id.au

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgprmehfWy_o1.pgp
Description: PGP signature


Re: [PATCH kernel v11 20/34] powerpc/powernv/ioda2: Move TCE kill register address to PE

2015-06-01 Thread David Gibson
On Fri, May 29, 2015 at 06:44:44PM +1000, Alexey Kardashevskiy wrote:
 At the moment the DMA setup code looks for the ibm,opal-tce-kill
 property which contains the TCE kill register address. Writing to
 this register invalidates TCE cache on IODA/IODA2 hub.
 
 This moves the register address from iommu_table to pnv_pnb as this
 register belongs to PHB and invalidates TCE cache for all tables of
 all attached PEs.
 
 This moves the property reading/remapping code to a helper which is
 called when DMA is being configured for PE and which does DMA setup
 for both IODA1 and IODA2.
 
 This adds a new pnv_pci_ioda2_tce_invalidate_entire() helper which
 invalidates cache for the entire table. It should be called after
 every call to opal_pci_map_pe_dma_window(). It was not required before
 because there was just a single TCE table and 64bit DMA was handled via
 bypass window (which has no table so no cache was used) but this is going
 to change with Dynamic DMA windows (DDW).
 
 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru

Reviewed-by: David Gibson da...@gibson.dropbear.id.au

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgpoCyC0dlyWS.pgp
Description: PGP signature


Re: [PATCH kernel v11 17/34] powerpc/spapr: vfio: Switch from iommu_table to new iommu_table_group

2015-06-01 Thread David Gibson
On Fri, May 29, 2015 at 06:44:41PM +1000, Alexey Kardashevskiy wrote:
 Modern IBM POWERPC systems support multiple (currently two) TCE tables
 per IOMMU group (a.k.a. PE). This adds a iommu_table_group container
 for TCE tables. Right now just one table is supported.
 
 For IODA, instead of embedding iommu_table, the new iommu_table_group
 keeps pointers to those. The iommu_table structs are allocated
 dynamically now by a pnv_pci_table_alloc() helper as PCI hotplug
 code (for EEH recovery) and SRIOV are supported there.
 
 For P5IOC2, both iommu_table_group and iommu_table are embedded into
 PE struct. As there is no EEH and SRIOV support for P5IOC2,
 iommu_free_table() should not be called on iommu_table struct pointers
 so we can keep it embedded in pnv_phb::p5ioc2.
 
 For pSeries, this replaces multiple calls of kzalloc_node() with a new
 iommu_pseries_group_alloc() helper and stores the table group struct
 pointer into the pci_dn struct. For release, a iommu_table_group_free()
 helper is added.
 
 This moves iommu_table struct allocation from SR-IOV code to
 the generic DMA initialization code in pnv_pci_ioda2_setup_dma_pe.
 
 This replaces a single pointer to iommu_group with a list of
 iommu_table_group structs. For now it is just a single iommu_table_group
 in this list but later with TCE table sharing enabled, the list will
 keep all the IOMMU groups which use the particular table. The list
 uses iommu_table_group_link structs rather than iommu_table_group::next
 as a VFIO container may have 2 IOMMU tables, each will have its own list
 head pointer as it is mainly for TCE invalidation code which should
 walk through all attached groups and invalidate TCE cache so
 the table has to keep the list head pointer. The other option would
 be storing list head in a VFIO container but it would not work as
 the platform code (which does TCE table update and invalidation) has
 no idea about VFIO.
 
 This should cause no behavioural change.
 
 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
 [aw: for the vfio related changes]
 Acked-by: Alex Williamson alex.william...@redhat.com
 Reviewed-by: David Gibson da...@gibson.dropbear.id.au
 Reviewed-by: Gavin Shan gws...@linux.vnet.ibm.com

It looks like this commit message doesn't match the code - it seems
like an older or newer version of the message from the previous patch.

This patch seems instead to be about changing the table_group - table
relationship from 1:1 to many:many.

 ---
 Changes:
 v10:
 * iommu_table is not embedded into iommu_table_group but allocated
 dynamically
 * iommu_table allocation is moved to a single place for IODA2's
 pnv_pci_ioda_setup_dma_pe where it belongs to
 * added list of groups into iommu_table; most of the code just looks at
 the first item to keep the patch simpler
 
 v9:
 * s/it_group/it_table_group/
 * added and used iommu_table_group_free(), from now iommu_free_table()
 is only used for VIO
 * added iommu_pseries_group_alloc()
 * squashed powerpc/iommu: Introduce iommu_table_alloc() helper into this
 ---
  arch/powerpc/include/asm/iommu.h|   8 +-
  arch/powerpc/kernel/iommu.c |   9 +-
  arch/powerpc/platforms/powernv/pci-ioda.c   |  45 ++
  arch/powerpc/platforms/powernv/pci-p5ioc2.c |   3 +
  arch/powerpc/platforms/powernv/pci.c|  76 +
  arch/powerpc/platforms/powernv/pci.h|   7 ++
  arch/powerpc/platforms/pseries/iommu.c  |  33 +++-
  drivers/vfio/vfio_iommu_spapr_tce.c | 122 
 
  8 files changed, 242 insertions(+), 61 deletions(-)
 
 diff --git a/arch/powerpc/include/asm/iommu.h 
 b/arch/powerpc/include/asm/iommu.h
 index 5a7267f..44a20cc 100644
 --- a/arch/powerpc/include/asm/iommu.h
 +++ b/arch/powerpc/include/asm/iommu.h
 @@ -91,7 +91,7 @@ struct iommu_table {
   struct iommu_pool pools[IOMMU_NR_POOLS];
   unsigned long *it_map;   /* A simple allocation bitmap for now */
   unsigned long  it_page_shift;/* table iommu page size */
 - struct iommu_table_group *it_table_group;
 + struct list_head it_group_list;/* List of iommu_table_group_link */
   struct iommu_table_ops *it_ops;
   void (*set_bypass)(struct iommu_table *tbl, bool enable);
  };
 @@ -126,6 +126,12 @@ extern struct iommu_table *iommu_init_table(struct 
 iommu_table * tbl,
   int nid);
  #define IOMMU_TABLE_GROUP_MAX_TABLES 1
  
 +struct iommu_table_group_link {
 + struct list_head next;
 + struct rcu_head rcu;
 + struct iommu_table_group *table_group;
 +};
 +
  struct iommu_table_group {
   struct iommu_group *group;
   struct iommu_table *tables[IOMMU_TABLE_GROUP_MAX_TABLES];
 diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
 index 719f048..e305a8f 100644
 --- a/arch/powerpc/kernel/iommu.c
 +++ b/arch/powerpc/kernel/iommu.c
 @@ -1078,6 +1078,7 @@ EXPORT_SYMBOL_GPL(iommu_release_ownership);
  int iommu_add_device(struct device

Re: [PATCH kernel v11 18/34] vfio: powerpc/spapr/iommu/powernv/ioda2: Rework IOMMU ownership control

2015-06-01 Thread David Gibson
On Fri, May 29, 2015 at 06:44:42PM +1000, Alexey Kardashevskiy wrote:
 This adds tce_iommu_take_ownership() and tce_iommu_release_ownership
 which call in a loop iommu_take_ownership()/iommu_release_ownership()
 for every table on the group. As there is just one now, no change in
 behaviour is expected.
 
 At the moment the iommu_table struct has a set_bypass() which enables/
 disables DMA bypass on IODA2 PHB. This is exposed to POWERPC IOMMU code
 which calls this callback when external IOMMU users such as VFIO are
 about to get over a PHB.
 
 The set_bypass() callback is not really an iommu_table function but
 IOMMU/PE function. This introduces a iommu_table_group_ops struct and
 adds take_ownership()/release_ownership() callbacks to it which are
 called when an external user takes/releases control over the IOMMU.
 
 This replaces set_bypass() with ownership callbacks as it is not
 necessarily just bypass enabling, it can be something else/more
 so let's give it more generic name.
 
 The callbacks is implemented for IODA2 only. Other platforms (P5IOC2,
 IODA1) will use the old iommu_take_ownership/iommu_release_ownership API.
 The following patches will replace iommu_take_ownership/
 iommu_release_ownership calls in IODA2 with full IOMMU table release/
 create.
 
 As we here and touching bypass control, this removes
 pnv_pci_ioda2_setup_bypass_pe() as it does not do much
 more compared to pnv_pci_ioda2_set_bypass. This moves tce_bypass_base
 initialization to pnv_pci_ioda2_setup_dma_pe.
 
 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
 [aw: for the vfio related changes]
 Acked-by: Alex Williamson alex.william...@redhat.com
 Reviewed-by: Gavin Shan gws...@linux.vnet.ibm.com

Reviewed-by: David Gibson da...@gibson.dropbear.id.au

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgpSeNI0zax8E.pgp
Description: PGP signature


Re: [PATCH kernel v11 21/34] powerpc/powernv/ioda2: Add TCE invalidation for all attached groups

2015-06-01 Thread David Gibson
On Fri, May 29, 2015 at 06:44:45PM +1000, Alexey Kardashevskiy wrote:
 The iommu_table struct keeps a list of IOMMU groups it is used for.
 At the moment there is just a single group attached but further
 patches will add TCE table sharing. When sharing is enabled, TCE cache
 in each PE needs to be invalidated so does the patch.
 
 This does not change pnv_pci_ioda1_tce_invalidate() as there is no plan
 to enable TCE table sharing on PHBs older than IODA2.
 
 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru

Reviewed-by: David Gibson da...@gibson.dropbear.id.au

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgp3Qap2rncgo.pgp
Description: PGP signature


Re: [PATCH kernel v11 23/34] powerpc/iommu/powernv: Release replaced TCE

2015-06-01 Thread David Gibson
On Fri, May 29, 2015 at 06:44:47PM +1000, Alexey Kardashevskiy wrote:
 At the moment writing new TCE value to the IOMMU table fails with EBUSY
 if there is a valid entry already. However PAPR specification allows
 the guest to write new TCE value without clearing it first.
 
 Another problem this patch is addressing is the use of pool locks for
 external IOMMU users such as VFIO. The pool locks are to protect
 DMA page allocator rather than entries and since the host kernel does
 not control what pages are in use, there is no point in pool locks and
 exchange()+put_page(oldtce) is sufficient to avoid possible races.
 
 This adds an exchange() callback to iommu_table_ops which does the same
 thing as set() plus it returns replaced TCE and DMA direction so
 the caller can release the pages afterwards. The exchange() receives
 a physical address unlike set() which receives linear mapping address;
 and returns a physical address as the clear() does.
 
 This implements exchange() for P5IOC2/IODA/IODA2. This adds a requirement
 for a platform to have exchange() implemented in order to support VFIO.
 
 This replaces iommu_tce_build() and iommu_clear_tce() with
 a single iommu_tce_xchg().
 
 This makes sure that TCE permission bits are not set in TCE passed to
 IOMMU API as those are to be calculated by platform code from
 DMA direction.
 
 This moves SetPageDirty() to the IOMMU code to make it work for both
 VFIO ioctl interface in in-kernel TCE acceleration (when it becomes
 available later).
 
 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
 [aw: for the vfio related changes]
 Acked-by: Alex Williamson alex.william...@redhat.com

Reviewed-by: David Gibson da...@gibson.dropbear.id.au

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgpzSxt7XbJHf.pgp
Description: PGP signature


Re: [PATCH kernel v11 25/34] powerpc/powernv/ioda2: Introduce helpers to allocate TCE pages

2015-06-01 Thread David Gibson
On Fri, May 29, 2015 at 06:44:49PM +1000, Alexey Kardashevskiy wrote:
 This is a part of moving TCE table allocation into an iommu_ops
 callback to support multiple IOMMU groups per one VFIO container.
 
 This moves the code which allocates the actual TCE tables to helpers:
 pnv_pci_ioda2_table_alloc_pages() and pnv_pci_ioda2_table_free_pages().
 These do not allocate/free the iommu_table struct.
 
 This enforces window size to be a power of two.
 
 This should cause no behavioural change.
 
 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
 Reviewed-by: Gavin Shan gws...@linux.vnet.ibm.com

Reviewed-by: David Gibson da...@gibson.dropbear.id.au

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgplT2uFUf7Sj.pgp
Description: PGP signature


Re: [PATCH kernel v11 27/34] powerpc/powernv: Implement multilevel TCE tables

2015-06-01 Thread David Gibson
);
 + unsigned long mask = (tbl-it_level_size - 1)  (level * shift);
 +
 + while (level) {
 + int n = (idx  mask)  (level * shift);
 + unsigned long tce = be64_to_cpu(tmp[n]);
 +
 + tmp = __va(tce  ~(TCE_PCI_READ | TCE_PCI_WRITE));
 + idx = ~mask;
 + mask = shift;
 + --level;
 + }
  
   return tmp + idx;
  }

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgpqF6QB1GbYY.pgp
Description: PGP signature


Re: [PATCH kernel v11 26/34] powerpc/powernv/ioda2: Introduce pnv_pci_ioda2_set_window

2015-06-01 Thread David Gibson
On Fri, May 29, 2015 at 06:44:50PM +1000, Alexey Kardashevskiy wrote:
 This is a part of moving DMA window programming to an iommu_ops
 callback. pnv_pci_ioda2_set_window() takes an iommu_table_group as
 a first parameter (not pnv_ioda_pe) as it is going to be used as
 a callback for VFIO DDW code.
 
 This adds pnv_pci_ioda2_tvt_invalidate() to invalidate TVT as it is

I'm assuming that's what's now called pnv_pci_ioda2_invalidate_entire()?

 a good thing to do. It does not have immediate effect now as the table
 is never recreated after reboot but it will in the following patches.
 
 This should cause no behavioural change.
 
 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
 Reviewed-by: David Gibson da...@gibson.dropbear.id.au
 Reviewed-by: Gavin Shan gws...@linux.vnet.ibm.com
 ---
 Changes:
 v11:
 * replaced some 1it_page_shift with IOMMU_PAGE_SIZE() macro
 
 v9:
 * initialize pe-table_group.tables[0] at the very end when
 tbl is fully initialized
 * moved pnv_pci_ioda2_tvt_invalidate() from earlier patch
 ---
  arch/powerpc/platforms/powernv/pci-ioda.c | 47 
 +--
  1 file changed, 38 insertions(+), 9 deletions(-)
 
 diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
 b/arch/powerpc/platforms/powernv/pci-ioda.c
 index 3d29fe3..fda01c1 100644
 --- a/arch/powerpc/platforms/powernv/pci-ioda.c
 +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
 @@ -1968,6 +1968,43 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb 
 *phb,
   }
  }
  
 +static long pnv_pci_ioda2_set_window(struct iommu_table_group *table_group,
 + int num, struct iommu_table *tbl)
 +{
 + struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe,
 + table_group);
 + struct pnv_phb *phb = pe-phb;
 + int64_t rc;
 + const __u64 start_addr = tbl-it_offset  tbl-it_page_shift;
 + const __u64 win_size = tbl-it_size  tbl-it_page_shift;
 +
 + pe_info(pe, Setting up window %llx..%llx pg=%x\n,
 + start_addr, start_addr + win_size - 1,
 + IOMMU_PAGE_SIZE(tbl));
 +
 + /*
 +  * Map TCE table through TVT. The TVE index is the PE number
 +  * shifted by 1 bit for 32-bits DMA space.
 +  */
 + rc = opal_pci_map_pe_dma_window(phb-opal_id,
 + pe-pe_number,
 + pe-pe_number  1,
 + 1,
 + __pa(tbl-it_base),
 + tbl-it_size  3,
 + IOMMU_PAGE_SIZE(tbl));
 + if (rc) {
 + pe_err(pe, Failed to configure TCE table, err %ld\n, rc);
 + return rc;
 + }
 +
 + pnv_pci_link_table_and_group(phb-hose-node, num,
 + tbl, pe-table_group);
 + pnv_pci_ioda2_tce_invalidate_entire(pe);
 +
 + return 0;
 +}
 +
  static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable)
  {
   uint16_t window_id = (pe-pe_number  1 ) + 1;
 @@ -2123,21 +2160,13 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb 
 *phb,
   pe-table_group.ops = pnv_pci_ioda2_ops;
  #endif
  
 - /*
 -  * Map TCE table through TVT. The TVE index is the PE number
 -  * shifted by 1 bit for 32-bits DMA space.
 -  */
 - rc = opal_pci_map_pe_dma_window(phb-opal_id, pe-pe_number,
 - pe-pe_number  1, 1, __pa(tbl-it_base),
 - tbl-it_size  3, 1ULL  tbl-it_page_shift);
 + rc = pnv_pci_ioda2_set_window(pe-table_group, 0, tbl);
   if (rc) {
   pe_err(pe, Failed to configure 32-bit TCE table,
   err %ld\n, rc);
   goto fail;
   }
  
 - pnv_pci_ioda2_tce_invalidate_entire(pe);
 -
   /* OPAL variant of PHB3 invalidated TCEs */
   if (phb-ioda.tce_inval_reg)
   tbl-it_type |= (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE);

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgptxgs23iUmd.pgp
Description: PGP signature


Re: [PATCH kernel v11 33/34] vfio: powerpc/spapr: Register memory and define IOMMU v2

2015-06-01 Thread David Gibson
 space memory where DMA is allowed. It pins
 + * user pages and does the locked memory accounting so
 + * subsequent VFIO_IOMMU_MAP_DMA/VFIO_IOMMU_UNMAP_DMA calls
 + * get faster.
 + */
 +struct vfio_iommu_spapr_register_memory {
 + __u32   argsz;
 + __u32   flags;
 + __u64   vaddr;  /* Process virtual address */
 + __u64   size;   /* Size of mapping (bytes) */
 +};
 +#define VFIO_IOMMU_SPAPR_REGISTER_MEMORY _IO(VFIO_TYPE, VFIO_BASE + 17)
 +
 +/**
 + * VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY - _IOW(VFIO_TYPE, VFIO_BASE + 18, 
 struct vfio_iommu_spapr_register_memory)
 + *
 + * Unregisters user space memory registered with
 + * VFIO_IOMMU_SPAPR_REGISTER_MEMORY.
 + * Uses vfio_iommu_spapr_register_memory for parameters.
 + */
 +#define VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY   _IO(VFIO_TYPE, VFIO_BASE + 18)
 +
  /* * */
  
  #endif /* _UAPIVFIO_H */

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgpidMRAVBCcs.pgp
Description: PGP signature


Re: [PATCH kernel v11 31/34] vfio: powerpc/spapr: powerpc/powernv/ioda2: Use DMA windows API in ownership control

2015-06-01 Thread David Gibson
On Fri, May 29, 2015 at 06:44:55PM +1000, Alexey Kardashevskiy wrote:
 Before the IOMMU user (VFIO) would take control over the IOMMU table
 belonging to a specific IOMMU group. This approach did not allow sharing
 tables between IOMMU groups attached to the same container.
 
 This introduces a new IOMMU ownership flavour when the user can not
 just control the existing IOMMU table but remove/create tables on demand.
 If an IOMMU implements take/release_ownership() callbacks, this lets
 the user have full control over the IOMMU group. When the ownership
 is taken, the platform code removes all the windows so the caller must
 create them.
 Before returning the ownership back to the platform code, VFIO
 unprograms and removes all the tables it created.
 
 This changes IODA2's onwership handler to remove the existing table
 rather than manipulating with the existing one. From now on,
 iommu_take_ownership() and iommu_release_ownership() are only called
 from the vfio_iommu_spapr_tce driver.
 
 Old-style ownership is still supported allowing VFIO to run on older
 P5IOC2 and IODA IO controllers.
 
 No change in userspace-visible behaviour is expected. Since it recreates
 TCE tables on each ownership change, related kernel traces will appear
 more often.
 
 This adds a pnv_pci_ioda2_setup_default_config() which is called
 when PE is being configured at boot time and when the ownership is
 passed from VFIO to the platform code.
 
 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
 [aw: for the vfio related changes]
 Acked-by: Alex Williamson alex.william...@redhat.com

Reviewed-by: David Gibson da...@gibson.dropbear.id.au

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgpGI6H0c3HKJ.pgp
Description: PGP signature


Re: [PATCH kernel v11 29/34] powerpc/powernv/ioda2: Use new helpers to do proper cleanup on PE release

2015-06-01 Thread David Gibson
On Fri, May 29, 2015 at 06:44:53PM +1000, Alexey Kardashevskiy wrote:
 The existing code programmed TVT#0 with some address and then
 immediately released that memory.
 
 This makes use of pnv_pci_ioda2_unset_window() and
 pnv_pci_ioda2_set_bypass() which do correct resource release and
 TVT update.
 
 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru

Reviewed-by: David Gibson da...@gibson.dropbear.id.au

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgpHcf_huXaFN.pgp
Description: PGP signature


Re: [PATCH kernel v11 32/34] powerpc/mmu: Add userspace-to-physical addresses translation cache

2015-06-01 Thread David Gibson
On Fri, May 29, 2015 at 06:44:56PM +1000, Alexey Kardashevskiy wrote:
 We are adding support for DMA memory pre-registration to be used in
 conjunction with VFIO. The idea is that the userspace which is going to
 run a guest may want to pre-register a user space memory region so
 it all gets pinned once and never goes away. Having this done,
 a hypervisor will not have to pin/unpin pages on every DMA map/unmap
 request. This is going to help with multiple pinning of the same memory.
 
 Another use of it is in-kernel real mode (mmu off) acceleration of
 DMA requests where real time translation of guest physical to host
 physical addresses is non-trivial and may fail as linux ptes may be
 temporarily invalid. Also, having cached host physical addresses
 (compared to just pinning at the start and then walking the page table
 again on every H_PUT_TCE), we can be sure that the addresses which we put
 into TCE table are the ones we already pinned.
 
 This adds a list of memory regions to mm_context_t. Each region consists
 of a header and a list of physical addresses. This adds API to:
 1. register/unregister memory regions;
 2. do final cleanup (which puts all pre-registered pages);
 3. do userspace to physical address translation;
 4. manage usage counters; multiple registration of the same memory
 is allowed (once per container).
 
 This implements 2 counters per registered memory region:
 - @mapped: incremented on every DMA mapping; decremented on unmapping;
 initialized to 1 when a region is just registered; once it becomes zero,
 no more mappings allowe;
 - @used: incremented on every register ioctl; decremented on
 unregister; unregistration is allowed for DMA mapped regions unless
 it is the very last reference. For the very last reference this checks
 that the region is still mapped and returns -EBUSY so the userspace
 gets to know that memory is still pinned and unregistration needs to
 be retried; @used remains 1.
 
 Host physical addresses are stored in vmalloc'ed array. In order to
 access these in the real mode (mmu off), there is a real_vmalloc_addr()
 helper. In-kernel acceleration patchset will move it from KVM to MMU code.
 
 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru

There's a bunch of ways this could be cleaned up, but those can be
done later.  So,

Reviewed-by: David Gibson da...@gibson.dropbear.id.au

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgpxPkLIdT5rV.pgp
Description: PGP signature


Re: [PATCH kernel v11 34/34] vfio: powerpc/spapr: Support Dynamic DMA windows

2015-06-01 Thread David Gibson
On Fri, May 29, 2015 at 06:44:58PM +1000, Alexey Kardashevskiy wrote:
 This adds create/remove window ioctls to create and remove DMA windows.
 sPAPR defines a Dynamic DMA windows capability which allows
 para-virtualized guests to create additional DMA windows on a PCI bus.
 The existing linux kernels use this new window to map the entire guest
 memory and switch to the direct DMA operations saving time on map/unmap
 requests which would normally happen in a big amounts.
 
 This adds 2 ioctl handlers - VFIO_IOMMU_SPAPR_TCE_CREATE and
 VFIO_IOMMU_SPAPR_TCE_REMOVE - to create and remove windows.
 Up to 2 windows are supported now by the hardware and by this driver.
 
 This changes VFIO_IOMMU_SPAPR_TCE_GET_INFO handler to return additional
 information such as a number of supported windows and maximum number
 levels of TCE tables.
 
 DDW is added as a capability, not as a SPAPR TCE IOMMU v2 unique feature
 as we still want to support v2 on platforms which cannot do DDW for
 the sake of TCE acceleration in KVM (coming soon).
 
 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
 [aw: for the vfio related changes]
 Acked-by: Alex Williamson alex.william...@redhat.com

Reviewed-by: David Gibson da...@gibson.dropbear.id.au

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgptc8o5ntY8T.pgp
Description: PGP signature


Re: [PATCH kernel v11 28/34] vfio: powerpc/spapr: powerpc/powernv/ioda: Define and implement DMA windows API

2015-06-01 Thread David Gibson
On Fri, May 29, 2015 at 06:44:52PM +1000, Alexey Kardashevskiy wrote:
 This extends iommu_table_group_ops by a set of callbacks to support
 dynamic DMA windows management.
 
 create_table() creates a TCE table with specific parameters.
 it receives iommu_table_group to know nodeid in order to allocate
 TCE table memory closer to the PHB. The exact format of allocated
 multi-level table might be also specific to the PHB model (not
 the case now though).
 This callback calculated the DMA window offset on a PCI bus from @num
 and stores it in a just created table.
 
 set_window() sets the window at specified TVT index + @num on PHB.
 
 unset_window() unsets the window from specified TVT.
 
 This adds a free() callback to iommu_table_ops to free the memory
 (potentially a tree of tables) allocated for the TCE table.
 
 create_table() and free() are supposed to be called once per
 VFIO container and set_window()/unset_window() are supposed to be
 called for every group in a container.
 
 This adds IOMMU capabilities to iommu_table_group such as default
 32bit window parameters and others. This makes use of new values in
 vfio_iommu_spapr_tce. IODA1/P5IOC2 do not support DDW so they do not
 advertise pagemasks to the userspace.
 
 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
 Acked-by: Alex Williamson alex.william...@redhat.com

Reviewed-by: David Gibson da...@gibson.dropbear.id.au

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgpLhpGvd0Imy.pgp
Description: PGP signature


Re: [PATCH kernel v11 30/34] powerpc/iommu/ioda2: Add get_table_size() to calculate the size of future table

2015-06-01 Thread David Gibson
On Fri, May 29, 2015 at 06:44:54PM +1000, Alexey Kardashevskiy wrote:
 This adds a way for the IOMMU user to know how much a new table will
 use so it can be accounted in the locked_vm limit before allocation
 happens.
 
 This stores the allocated table size in pnv_pci_ioda2_get_table_size()
 so the locked_vm counter can be updated correctly when a table is
 being disposed.
 
 This defines an iommu_table_group_ops callback to let VFIO know
 how much memory will be locked if a table is created.
 
 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru

Reviewed-by: David Gibson da...@gibson.dropbear.id.au

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgpBOujNE4LxQ.pgp
Description: PGP signature


Re: [PATCH kernel v11 04/34] powerpc/iommu: Put IOMMU group explicitly

2015-05-31 Thread David Gibson
On Fri, May 29, 2015 at 06:44:28PM +1000, Alexey Kardashevskiy wrote:
 So far an iommu_table lifetime was the same as PE. Dynamic DMA windows
 will change this and iommu_free_table() will not always require
 the group to be released.
 
 This moves iommu_group_put() out of iommu_free_table().
 
 This adds a iommu_pseries_free_table() helper which does
 iommu_group_put() and iommu_free_table(). Later it will be
 changed to receive a table_group and we will have to change less
 lines then.
 
 This should cause no behavioural change.
 
 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
 Reviewed-by: Gavin Shan gws...@linux.vnet.ibm.com

Reviewed-by: David Gibson da...@gibson.dropbear.id.au

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgpllzwh44Hes.pgp
Description: PGP signature


Re: [PATCH kernel v11 01/34] powerpc/eeh/ioda2: Use device::iommu_group to check IOMMU group

2015-05-31 Thread David Gibson
On Fri, May 29, 2015 at 06:44:25PM +1000, Alexey Kardashevskiy wrote:
 This relies on the fact that a PCI device always has an IOMMU table
 which may not be the case when we get dynamic DMA windows so
 let's use more reliable check for IOMMU group here.
 
 As we do not rely on the table presence here, remove the workaround
 from pnv_pci_ioda2_set_bypass(); also remove the @add_to_iommu_group
 parameter from pnv_ioda_setup_bus_dma().
 
 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
 Acked-by: Gavin Shan gws...@linux.vnet.ibm.com

Reviewed-by: David Gibson da...@gibson.dropbear.id.au

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgpMAeUXkkiNQ.pgp
Description: PGP signature


Re: [PATCH kernel v11 03/34] powerpc/powernv/ioda: Clean up IOMMU group registration

2015-05-31 Thread David Gibson
On Fri, May 29, 2015 at 06:44:27PM +1000, Alexey Kardashevskiy wrote:
 The existing code has 3 calls to iommu_register_group() and
 all 3 branches actually cover all possible cases.
 
 This replaces 3 calls with one and moves the registration earlier;
 the latter will make more sense when we add TCE table sharing.
 
 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
 Reviewed-by: Gavin Shan gws...@linux.vnet.ibm.com

Reviewed-by: David Gibson da...@gibson.dropbear.id.au

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgpU6Y2PERFhD.pgp
Description: PGP signature


Re: [PATCH kernel v11 05/34] powerpc/iommu: Always release iommu_table in iommu_free_table()

2015-05-31 Thread David Gibson
On Fri, May 29, 2015 at 06:44:29PM +1000, Alexey Kardashevskiy wrote:
 At the moment iommu_free_table() only releases memory if
 the table was initialized for the platform code use, i.e. it had
 it_map initialized (which purpose is to track DMA memory space use).
 
 With dynamic DMA windows, we will need to be able to release
 iommu_table even if it was used for VFIO in which case it_map is NULL
 so does the patch.
 
 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru

Reviewed-by: David Gibson da...@gibson.dropbear.id.au

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgpIOpAYk9qeN.pgp
Description: PGP signature


Re: [PATCH kernel v11 02/34] powerpc/iommu/powernv: Get rid of set_iommu_table_base_and_group

2015-05-31 Thread David Gibson
On Fri, May 29, 2015 at 06:44:26PM +1000, Alexey Kardashevskiy wrote:
 The set_iommu_table_base_and_group() name suggests that the function
 sets table base and add a device to an IOMMU group.
 
 The actual purpose for table base setting is to put some reference
 into a device so later iommu_add_device() can get the IOMMU group
 reference and the device to the group.
 
 At the moment a group cannot be explicitly passed to iommu_add_device()
 as we want it to work from the bus notifier, we can fix it later and
 remove confusing calls of set_iommu_table_base().
 
 This replaces set_iommu_table_base_and_group() with a couple of
 set_iommu_table_base() + iommu_add_device() which makes reading the code
 easier.
 
 This adds few comments why set_iommu_table_base() and iommu_add_device()
 are called where they are called.
 
 For IODA1/2, this essentially removes iommu_add_device() call from
 the pnv_pci_ioda_dma_dev_setup() as it will always fail at this particular
 place:
 - for physical PE, the device is already attached by iommu_add_device()
 in pnv_pci_ioda_setup_dma_pe();
 - for virtual PE, the sysfs entries are not ready to create all symlinks
 so actual adding is happening in tce_iommu_bus_notifier.
 
 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
 Reviewed-by: Gavin Shan gws...@linux.vnet.ibm.com

Reviewed-by: David Gibson da...@gibson.dropbear.id.au

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgp5SRDiXfhXF.pgp
Description: PGP signature


Re: [PATCH kernel v11 16/34] powerpc/spapr: vfio: Replace iommu_table with iommu_table_group

2015-05-31 Thread David Gibson
On Fri, May 29, 2015 at 06:44:40PM +1000, Alexey Kardashevskiy wrote:
 Modern IBM POWERPC systems support multiple (currently two) TCE tables
 per IOMMU group (a.k.a. PE). This adds a iommu_table_group container
 for TCE tables. Right now just one table is supported.
 
 This defines iommu_table_group struct which stores pointers to
 iommu_group and iommu_table(s). This replaces iommu_table with
 iommu_table_group where iommu_table was used to identify a group:
 - iommu_register_group();
 - iommudata of generic iommu_group;
 
 This removes @data from iommu_table as it_table_group provides
 same access to pnv_ioda_pe.
 
 For IODA, instead of embedding iommu_table, the new iommu_table_group
 keeps pointers to those. The iommu_table structs are allocated
 dynamically.
 
 For P5IOC2, both iommu_table_group and iommu_table are embedded into
 PE struct. As there is no EEH and SRIOV support for P5IOC2,
 iommu_free_table() should not be called on iommu_table struct pointers
 so we can keep it embedded in pnv_phb::p5ioc2.
 
 For pSeries, this replaces multiple calls of kzalloc_node() with a new
 iommu_pseries_alloc_group() helper and stores the table group struct
 pointer into the pci_dn struct. For release, a iommu_table_free_group()
 helper is added.
 
 This moves iommu_table struct allocation from SR-IOV code to
 the generic DMA initialization code in pnv_pci_ioda_setup_dma_pe and
 pnv_pci_ioda2_setup_dma_pe as this is where DMA is actually initialized.
 This change is here because those lines had to be changed anyway.
 
 This should cause no behavioural change.
 
 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
 [aw: for the vfio related changes]
 Acked-by: Alex Williamson alex.william...@redhat.com

Reviewed-by: David Gibson da...@gibson.dropbear.id.au

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgpQIcO5eSGXN.pgp
Description: PGP signature


Re: [PATCH kernel v11 09/34] vfio: powerpc/spapr: Move locked_vm accounting to helpers

2015-05-31 Thread David Gibson
On Fri, May 29, 2015 at 06:44:33PM +1000, Alexey Kardashevskiy wrote:
 There moves locked pages accounting to helpers.
 Later they will be reused for Dynamic DMA windows (DDW).
 
 This reworks debug messages to show the current value and the limit.
 
 This stores the locked pages number in the container so when unlocking
 the iommu table pointer won't be needed. This does not have an effect
 now but it will with the multiple tables per container as then we will
 allow attaching/detaching groups on fly and we may end up having
 a container with no group attached but with the counter incremented.
 
 While we are here, update the comment explaining why RLIMIT_MEMLOCK
 might be required to be bigger than the guest RAM. This also prints
 pid of the current process in pr_warn/pr_debug.
 
 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
 [aw: for the vfio related changes]
 Acked-by: Alex Williamson alex.william...@redhat.com
 Reviewed-by: David Gibson da...@gibson.dropbear.id.au
 Reviewed-by: Gavin Shan gws...@linux.vnet.ibm.com
 ---
 Changes:
 v4:
 * new helpers do nothing if @npages == 0
 * tce_iommu_disable() now can decrement the counter if the group was
 detached (not possible now but will be in the future)
 ---
  drivers/vfio/vfio_iommu_spapr_tce.c | 82 
 -
  1 file changed, 63 insertions(+), 19 deletions(-)
 
 diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
 b/drivers/vfio/vfio_iommu_spapr_tce.c
 index 64300cc..40583f9 100644
 --- a/drivers/vfio/vfio_iommu_spapr_tce.c
 +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
 @@ -29,6 +29,51 @@
  static void tce_iommu_detach_group(void *iommu_data,
   struct iommu_group *iommu_group);
  
 +static long try_increment_locked_vm(long npages)
 +{
 + long ret = 0, locked, lock_limit;
 +
 + if (!current || !current-mm)
 + return -ESRCH; /* process exited */
 +
 + if (!npages)
 + return 0;
 +
 + down_write(current-mm-mmap_sem);
 + locked = current-mm-locked_vm + npages;

Is there a possibility of userspace triggering an integer overflow
here, if npages is really huge?

 + lock_limit = rlimit(RLIMIT_MEMLOCK)  PAGE_SHIFT;
 + if (locked  lock_limit  !capable(CAP_IPC_LOCK))
 + ret = -ENOMEM;
 + else
 + current-mm-locked_vm += npages;
 +
 + pr_debug([%d] RLIMIT_MEMLOCK +%ld %ld/%ld%s\n, current-pid,
 + npages  PAGE_SHIFT,
 + current-mm-locked_vm  PAGE_SHIFT,
 + rlimit(RLIMIT_MEMLOCK),
 + ret ?  - exceeded : );
 +
 + up_write(current-mm-mmap_sem);
 +
 + return ret;
 +}
 +
 +static void decrement_locked_vm(long npages)
 +{
 + if (!current || !current-mm || !npages)
 + return; /* process exited */
 +
 + down_write(current-mm-mmap_sem);
 + if (npages  current-mm-locked_vm)
 + npages = current-mm-locked_vm;

Can this case ever occur (without there being a leak bug somewhere
else in the code)?

 + current-mm-locked_vm -= npages;
 + pr_debug([%d] RLIMIT_MEMLOCK -%ld %ld/%ld\n, current-pid,
 + npages  PAGE_SHIFT,
 + current-mm-locked_vm  PAGE_SHIFT,
 + rlimit(RLIMIT_MEMLOCK));
 + up_write(current-mm-mmap_sem);
 +}
 +
  /*
   * VFIO IOMMU fd for SPAPR_TCE IOMMU implementation
   *
 @@ -45,6 +90,7 @@ struct tce_container {
   struct mutex lock;
   struct iommu_table *tbl;
   bool enabled;
 + unsigned long locked_pages;
  };
  
  static bool tce_page_is_contained(struct page *page, unsigned page_shift)
 @@ -60,7 +106,7 @@ static bool tce_page_is_contained(struct page *page, 
 unsigned page_shift)
  static int tce_iommu_enable(struct tce_container *container)
  {
   int ret = 0;
 - unsigned long locked, lock_limit, npages;
 + unsigned long locked;
   struct iommu_table *tbl = container-tbl;
  
   if (!container-tbl)
 @@ -89,21 +135,22 @@ static int tce_iommu_enable(struct tce_container 
 *container)
* Also we don't have a nice way to fail on H_PUT_TCE due to ulimits,
* that would effectively kill the guest at random points, much better
* enforcing the limit based on the max that the guest can map.
 +  *
 +  * Unfortunately at the moment it counts whole tables, no matter how
 +  * much memory the guest has. I.e. for 4GB guest and 4 IOMMU groups
 +  * each with 2GB DMA window, 8GB will be counted here. The reason for
 +  * this is that we cannot tell here the amount of RAM used by the guest
 +  * as this information is only available from KVM and VFIO is
 +  * KVM agnostic.
*/
 - down_write(current-mm-mmap_sem);
 - npages = (tbl-it_size  tbl-it_page_shift)  PAGE_SHIFT;
 - locked = current-mm-locked_vm + npages;
 - lock_limit = rlimit(RLIMIT_MEMLOCK)  PAGE_SHIFT;
 - if (locked  lock_limit  !capable(CAP_IPC_LOCK)) {
 - pr_warn(RLIMIT_MEMLOCK (%ld

Re: [PATCH v6 3/8] macvtap: introduce macvtap_is_little_endian() helper

2015-05-27 Thread David Gibson
On Fri, Apr 24, 2015 at 02:24:48PM +0200, Greg Kurz wrote:
 Signed-off-by: Greg Kurz gk...@linux.vnet.ibm.com

Reviewed-by: David Gibson da...@gibson.dropbear.id.au

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgpQ75DSmzcZJ.pgp
Description: PGP signature


Re: [PATCH v6 8/8] macvtap/tun: cross-endian support for little-endian hosts

2015-05-27 Thread David Gibson
On Fri, Apr 24, 2015 at 02:50:36PM +0200, Greg Kurz wrote:
 The VNET_LE flag was introduced to fix accesses to virtio 1.0 headers
 that are always little-endian. It can also be used to handle the special
 case of a legacy little-endian device implemented by a big-endian host.
 
 Let's add a flag and ioctls for big-endian devices as well. If both flags
 are set, little-endian wins.
 
 Since this is isn't a common usecase, the feature is controlled by a kernel
 config option (not set by default).
 
 Both macvtap and tun are covered by this patch since they share the same
 API with userland.
 
 Signed-off-by: Greg Kurz gk...@linux.vnet.ibm.com

Reviewed-by: David Gibson da...@gibson.dropbear.id.au

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgp9Pvo0akB9k.pgp
Description: PGP signature


Re: [PATCH v6 6/8] virtio: add explicit big-endian support to memory accessors

2015-05-27 Thread David Gibson
On Fri, Apr 24, 2015 at 02:26:24PM +0200, Greg Kurz wrote:
 The current memory accessors logic is:
 - little endian if little_endian
 - native endian (i.e. no byteswap) if !little_endian
 
 If we want to fully support cross-endian vhost, we also need to be
 able to convert to big endian.
 
 Instead of changing the little_endian argument to some 3-value enum, this
 patch changes the logic to:
 - little endian if little_endian
 - big endian if !little_endian
 
 The native endian case is handled by all users with a trivial helper. This
 patch doesn't change any functionality, nor it does add overhead.
 
 Signed-off-by: Greg Kurz gk...@linux.vnet.ibm.com

Reviewed-by: David Gibson da...@gibson.dropbear.id.au

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgpiWIvRcbzD2.pgp
Description: PGP signature


Re: [PATCH v6 1/8] virtio: introduce virtio_is_little_endian() helper

2015-05-27 Thread David Gibson
On Fri, Apr 24, 2015 at 02:24:27PM +0200, Greg Kurz wrote:
 Signed-off-by: Greg Kurz gk...@linux.vnet.ibm.com

Reviewed-by: David Gibson da...@gibson.dropbear.id.au

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgpb03l0z2tFO.pgp
Description: PGP signature


Re: [PATCH v6 2/8] tun: add tun_is_little_endian() helper

2015-05-27 Thread David Gibson
On Fri, Apr 24, 2015 at 02:24:38PM +0200, Greg Kurz wrote:
 Signed-off-by: Greg Kurz gk...@linux.vnet.ibm.com

Reviewed-by: David Gibson da...@gibson.dropbear.id.au

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgp6xoaowBbwy.pgp
Description: PGP signature


Re: [PATCH v6 5/8] vhost: introduce vhost_is_little_endian() helper

2015-05-27 Thread David Gibson
On Fri, Apr 24, 2015 at 02:25:12PM +0200, Greg Kurz wrote:
 Signed-off-by: Greg Kurz gk...@linux.vnet.ibm.com

Reviewed-by: David Gibson da...@gibson.dropbear.id.au

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgpI3mX3BeZb0.pgp
Description: PGP signature


  1   2   3   >