Re: [PATCH kernel v4 08/10] KVM: PPC: Separate TCE validation from update

2017-02-09 Thread Alexey Kardashevskiy
On 10/02/17 15:50, David Gibson wrote:
> On Fri, Feb 10, 2017 at 03:09:30PM +1100, Alexey Kardashevskiy wrote:
>> On 10/02/17 14:07, David Gibson wrote:
>>> On Thu, Feb 09, 2017 at 07:20:11PM +1100, Alexey Kardashevskiy wrote:
 On 09/02/17 14:51, David Gibson wrote:
> On Tue, Feb 07, 2017 at 06:17:09PM +1100, Alexey Kardashevskiy wrote:
>> For the emulated devices it does not matter much if we get a broken TCE
>> half way handling a TCE list but for VFIO it will matter as it has
>> more chances to fail so we try to do our best and check as much as we
>> can before proceeding.
>>
>> This separates a guest view table update from validation. No change in
>> behavior is expected.
>>
>> Signed-off-by: Alexey Kardashevskiy 
>> ---
>>  arch/powerpc/kvm/book3s_64_vio.c| 8 
>>  arch/powerpc/kvm/book3s_64_vio_hv.c | 8 ++--
>>  2 files changed, 14 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/powerpc/kvm/book3s_64_vio.c 
>> b/arch/powerpc/kvm/book3s_64_vio.c
>> index 15df8ae627d9..9a7b7fca5e84 100644
>> --- a/arch/powerpc/kvm/book3s_64_vio.c
>> +++ b/arch/powerpc/kvm/book3s_64_vio.c
>> @@ -282,6 +282,14 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu 
>> *vcpu,
>>  ret = kvmppc_tce_validate(stt, tce);
>>  if (ret != H_SUCCESS)
>>  goto unlock_exit;
>> +}
>> +
>> +for (i = 0; i < npages; ++i) {
>> +if (get_user(tce, tces + i)) {
>> +ret = H_TOO_HARD;
>> +goto unlock_exit;
>> +}
>> +tce = be64_to_cpu(tce);
>
> This doesn't look safe.  The contents of user memory could change
> between the two get_user()s, meaning that you're no longer guaranteed
> a TCE loaded into kernel has been validated at all.
>
> I think you need to either:
>
> a) Make sure things safe against a bad TCE being loaded into a TCE
> table and move all validation to where the TCE is used, rather
> than loaded
>
> or
> b) Copy the whole set of indirect entries to a temporary in-kernel
>buffer, then validate, then load into the actual TCE table.


 Correct :( The problem is I do not know how far I want to go in reverting
 the state as it was when I started handling H_PUT_TCE_INDIRECT.

 For example, 1 container, 2 IOMMU groups with disabled shared tables, so -
 2 tables, 512 TCEs request and TCE#100 does not translate to host physical
 address.


 To do a) I'll need to remember old content of each hardware table entry as
 when I reach TCE#100, I'll need to revert to the initial state which means
 I need to write back old TCEs to all affected hardware tables and update
 reference counters of all affected preregistered areas. Well, the actual
 tables must not have different addresses (BUG_ON? is it worth testing while
 writing to hardware tables that values I am replacing are the same in all
 tables?) so I can have just a single array of old TCEs from hardware tables
 in vcpu.
>>>
>>> I thought you said shared tables were disabled, so the two tables
>>> would have different addresses?
>>
>> That would be 2 physically separated tables but the content would be the
>> same as long as they belong to the same VFIO container.
> 
> Ok.  I thought you were talking about the address of the TCE tables
> being the same above.

No, the example uses 2 separate TCE tables.

> Did you mean the address of an individual page
> mapped in the TCE table?

I meant the tables themselves are separate in the host memory but their
content is the same.


>>> Hmm.  Now I'm trying to remember, will the gpa->hpa translation fail
>>> only if the guest/qemu does something wrong, or can it fail for other
>>> reasons? 
>>
>> This should always just work.
> 
> Ok, given that, just replacing HPAs we can't translate with a clear
> entry seems fine to me.


Ok.


>>> What about in real mode vs. virtual mode?
>>
>> Real mode is no different in this matter.
>>
>> Real mode is different from virtual mode in 3 aspects:
>>
>> 1. iommu_table_ops::exchange() vs. exchange_rm() as real mode uses cache
>> inhibited writes to invalidate "TCE kill" cache;
>>
>> 2. list_for_each_entry_lockless() vs. list_for_each_entry_rct() because of
>> lockdep does not work in real mode properly;
>>
>> 3. real mode uses vmalloc_to_phys() while virtual mode can access vmalloc'd
>> addresses directly. Not expected to fail.
>>
>> This is a full list.
> 
> Ok.
> 
>>> I think the key to this approach will be to think carefully about what
>>> semantics you guarantee for mappings shadowed into the hardware
>>> tables.  For example, it might work to specify that the host mappings
>>> only match the GPA mappings if those GPA mapings are 

Re: [PATCH 1/3] kprobes: introduce weak variant of kprobe_exceptions_notify

2017-02-09 Thread Ingo Molnar

* Michael Ellerman  wrote:

> "Naveen N. Rao"  writes:
> 
> > kprobe_exceptions_notify() is not used on some of the architectures such
> > as arm[64] and powerpc anymore. Introduce a weak variant for such
> > architectures.
> 
> I'll merge patch 1 & 3 via the powerpc tree for v4.11.

Acked-by: Ingo Molnar 

Thanks,

Ingo


Re: [PATCH] powerpc: Blacklist GCC 5.4 6.1 and 6.2

2017-02-09 Thread Christophe LEROY



Le 10/02/2017 à 06:31, Cyril Bur a écrit :

A bug in the -02 optimisation of GCC 5.4 6.1 and 6.2 causes
setup_command_line() to not pass the correct first argument to strcpy
and therefore not actually copy the command_line.

A workaround patch was proposed: http://patchwork.ozlabs.org/patch/673130/
some discussion ensued.

A GCC bug was raised: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71709
The bug has been fixed in 7.0 and backported to GCC 5 and GCC 6.

At the time of writing GCC 5.4 is the most recent and is affected. GCC
6.3 contains the backported fix, has been tested and appears safe to
use.

Heavy-lifting-by: Akshay Adiga 
Signed-off-by: Cyril Bur 
---
 arch/powerpc/Makefile | 9 +
 1 file changed, 9 insertions(+)

diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile
index 31286fa7873c..a4b886694391 100644
--- a/arch/powerpc/Makefile
+++ b/arch/powerpc/Makefile
@@ -414,6 +414,15 @@ checkbin:
echo -n '*** Please use a different binutils version.' ; \
false ; \
fi
+   @if test "$(cc-version)" = "0504" \
+   || test "$(cc-version)" = "0601" \
+   || test "$(cc-version)" = "0602" ; then \
+   echo -n '*** GCC-5.4 6.1 6.2 have a bad -O2 optimisation ' ; \
+   echo 'which will cause lost commandline options (at least).' ; \
+   echo '*** Please use a different GCC version.' ; \
+   false ; \
+   fi
+


According to the GCC bug report, this bug applies to powerpc64le
Why force all targets to not use those versions of GCC ?

Christophe


Re: [kernel-hardening] [PATCH v2 1/1] powerpc: mm: support ARCH_MMAP_RND_BITS

2017-02-09 Thread Bhupesh Sharma
Hi Michael,

On Tue, Feb 7, 2017 at 7:57 AM, Michael Ellerman  wrote:
> Bhupesh Sharma  writes:
>
>> powerpc: arch_mmap_rnd() uses hard-coded values, (23-PAGE_SHIFT) for
>> 32-bit and (30-PAGE_SHIFT) for 64-bit, to generate the random offset
>> for the mmap base address.
>>
>> This value represents a compromise between increased
>> ASLR effectiveness and avoiding address-space fragmentation.
>> Replace it with a Kconfig option, which is sensibly bounded, so that
>> platform developers may choose where to place this compromise.
>> Keep default values as new minimums.
>>
>> This patch makes sure that now powerpc mmap arch_mmap_rnd() approach
>> is similar to other ARCHs like x86, arm64 and arm.
>>
>> Cc: Alexander Graf 
>> Cc: Benjamin Herrenschmidt 
>> Cc: Paul Mackerras 
>> Cc: Michael Ellerman 
>> Cc: Anatolij Gustschin 
>> Cc: Alistair Popple 
>> Cc: Matt Porter 
>> Cc: Vitaly Bordug 
>> Cc: Scott Wood 
>> Cc: Kumar Gala 
>> Cc: Daniel Cashman 
>> Signed-off-by: Bhupesh Sharma 
>> Reviewed-by: Kees Cook 
>> ---
>> Changes since v1:
>> v1 can be seen here 
>> (https://lists.ozlabs.org/pipermail/linuxppc-dev/2017-February/153594.html)
>> - No functional change in this patch.
>> - Added R-B from Kees.
>> - Dropped PATCH 2/2 from v1 as recommended by Kees Cook.
>
> Thanks for v2.
>
> But I replied to your v1 with some comments, did you see them?
>

I have replied to your comments on the original thread.
Please share your views and if possible share your test results on the
PPC setups you might have at your end.

Thanks,
Bhupesh


[PATCH] powerpc: Blacklist GCC 5.4 6.1 and 6.2

2017-02-09 Thread Cyril Bur
A bug in the -02 optimisation of GCC 5.4 6.1 and 6.2 causes
setup_command_line() to not pass the correct first argument to strcpy
and therefore not actually copy the command_line.

A workaround patch was proposed: http://patchwork.ozlabs.org/patch/673130/
some discussion ensued.

A GCC bug was raised: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71709
The bug has been fixed in 7.0 and backported to GCC 5 and GCC 6.

At the time of writing GCC 5.4 is the most recent and is affected. GCC
6.3 contains the backported fix, has been tested and appears safe to
use.

Heavy-lifting-by: Akshay Adiga 
Signed-off-by: Cyril Bur 
---
 arch/powerpc/Makefile | 9 +
 1 file changed, 9 insertions(+)

diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile
index 31286fa7873c..a4b886694391 100644
--- a/arch/powerpc/Makefile
+++ b/arch/powerpc/Makefile
@@ -414,6 +414,15 @@ checkbin:
echo -n '*** Please use a different binutils version.' ; \
false ; \
fi
+   @if test "$(cc-version)" = "0504" \
+   || test "$(cc-version)" = "0601" \
+   || test "$(cc-version)" = "0602" ; then \
+   echo -n '*** GCC-5.4 6.1 6.2 have a bad -O2 optimisation ' ; \
+   echo 'which will cause lost commandline options (at least).' ; \
+   echo '*** Please use a different GCC version.' ; \
+   false ; \
+   fi
+
 
 
 CLEAN_FILES += $(TOUT)
-- 
2.11.1



[PATCH 2/2] powerpc/mm/radix: rename _PAGE_LARGE to R_PAGE_LARGE

2017-02-09 Thread Aneesh Kumar K.V
This bit is only used by radix and it is nice to follow the naming style of 
having
bit name start with H_/R_ depending on which translation mode they are used.

No functional change in this patch.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/book3s/64/hugetlb.h | 2 +-
 arch/powerpc/include/asm/book3s/64/radix.h   | 4 ++--
 arch/powerpc/mm/tlb-radix.c  | 2 +-
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/hugetlb.h 
b/arch/powerpc/include/asm/book3s/64/hugetlb.h
index c62f14d0bec1..cd366596 100644
--- a/arch/powerpc/include/asm/book3s/64/hugetlb.h
+++ b/arch/powerpc/include/asm/book3s/64/hugetlb.h
@@ -46,7 +46,7 @@ static inline pte_t arch_make_huge_pte(pte_t entry, struct 
vm_area_struct *vma,
 */
VM_WARN_ON(page_shift == mmu_psize_defs[MMU_PAGE_1G].shift);
if (page_shift == mmu_psize_defs[MMU_PAGE_2M].shift)
-   return __pte(pte_val(entry) | _PAGE_LARGE);
+   return __pte(pte_val(entry) | R_PAGE_LARGE);
else
return entry;
 }
diff --git a/arch/powerpc/include/asm/book3s/64/radix.h 
b/arch/powerpc/include/asm/book3s/64/radix.h
index b2971e624bb5..96b94d2b4432 100644
--- a/arch/powerpc/include/asm/book3s/64/radix.h
+++ b/arch/powerpc/include/asm/book3s/64/radix.h
@@ -14,7 +14,7 @@
 /*
  * For P9 DD1 only, we need to track whether the pte's huge.
  */
-#define _PAGE_LARGE_RPAGE_RSV1
+#define R_PAGE_LARGE   _RPAGE_RSV1
 
 
 #ifndef __ASSEMBLY__
@@ -258,7 +258,7 @@ static inline int radix__pmd_trans_huge(pmd_t pmd)
 static inline pmd_t radix__pmd_mkhuge(pmd_t pmd)
 {
if (cpu_has_feature(CPU_FTR_POWER9_DD1))
-   return __pmd(pmd_val(pmd) | _PAGE_PTE | _PAGE_LARGE);
+   return __pmd(pmd_val(pmd) | _PAGE_PTE | R_PAGE_LARGE);
return __pmd(pmd_val(pmd) | _PAGE_PTE);
 }
 static inline void radix__pmdp_huge_split_prepare(struct vm_area_struct *vma,
diff --git a/arch/powerpc/mm/tlb-radix.c b/arch/powerpc/mm/tlb-radix.c
index 61b79119065f..a22a76905437 100644
--- a/arch/powerpc/mm/tlb-radix.c
+++ b/arch/powerpc/mm/tlb-radix.c
@@ -441,7 +441,7 @@ void radix__flush_tlb_pte_p9_dd1(unsigned long old_pte, 
struct mm_struct *mm,
return;
}
 
-   if (old_pte & _PAGE_LARGE)
+   if (old_pte & R_PAGE_LARGE)
radix__flush_tlb_page_psize(mm, address, MMU_PAGE_2M);
else
radix__flush_tlb_page_psize(mm, address, mmu_virtual_psize);
-- 
2.7.4



[PATCH 1/2] powerpc/mm: Cleanup bits definition between hash and radix.

2017-02-09 Thread Aneesh Kumar K.V
Define everything based on bits present in pgtable.h. This will help in easily
identifying overlapping bits between hash/radix.

No functional change with this patch.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/book3s/64/hash-64k.h |  4 
 arch/powerpc/include/asm/book3s/64/hash.h |  9 +
 arch/powerpc/include/asm/book3s/64/pgtable.h  | 10 ++
 arch/powerpc/include/asm/book3s/64/radix.h|  6 ++
 4 files changed, 17 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/hash-64k.h 
b/arch/powerpc/include/asm/book3s/64/hash-64k.h
index f3dd21efa2ea..b39f0b86405e 100644
--- a/arch/powerpc/include/asm/book3s/64/hash-64k.h
+++ b/arch/powerpc/include/asm/book3s/64/hash-64k.h
@@ -6,6 +6,10 @@
 #define H_PUD_INDEX_SIZE  5
 #define H_PGD_INDEX_SIZE  12
 
+/*
+ * 64k aligned address free up few of the lower bits of RPN for us
+ * We steal that here. For more deatils look at pte_pfn/pfn_pte()
+ */
 #define H_PAGE_COMBO   0x1000 /* this is a combo 4k page */
 #define H_PAGE_4K_PFN  0x2000 /* PFN is for a single 4k page */
 /*
diff --git a/arch/powerpc/include/asm/book3s/64/hash.h 
b/arch/powerpc/include/asm/book3s/64/hash.h
index f7b721bbf918..ec2828b1db07 100644
--- a/arch/powerpc/include/asm/book3s/64/hash.h
+++ b/arch/powerpc/include/asm/book3s/64/hash.h
@@ -13,12 +13,13 @@
  * We could create separate kernel read-only if we used the 3 PP bits
  * combinations that newer processors provide but we currently don't.
  */
-#define H_PAGE_BUSY0x00800 /* software: PTE & hash are busy */
+#define H_PAGE_BUSY_RPAGE_SW1 /* software: PTE & hash are busy */
 #define H_PTE_NONE_MASK_PAGE_HPTEFLAGS
 #define H_PAGE_F_GIX_SHIFT 57
-#define H_PAGE_F_GIX   (7ul << 57) /* HPTE index within HPTEG */
-#define H_PAGE_F_SECOND(1ul << 60) /* HPTE is in 2ndary 
HPTEG */
-#define H_PAGE_HASHPTE (1ul << 61) /* PTE has associated HPTE */
+/* (7ul << 57) HPTE index within HPTEG */
+#define H_PAGE_F_GIX   (_RPAGE_RSV2 | _RPAGE_RSV3 | _RPAGE_RSV4)
+#define H_PAGE_F_SECOND_RPAGE_RSV1 /* HPTE is in 2ndary 
HPTEG */
+#define H_PAGE_HASHPTE _RPAGE_SW0  /* PTE has associated HPTE */
 
 #ifdef CONFIG_PPC_64K_PAGES
 #include 
diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
b/arch/powerpc/include/asm/book3s/64/pgtable.h
index c684ef6cbd10..d95a2658b4d2 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -38,14 +38,8 @@
 #endif
 #define _PAGE_SPECIAL  _RPAGE_SW2 /* software: special page */
 
-/*
- * For P9 DD1 only, we need to track whether the pte's huge.
- */
-#define _PAGE_LARGE_RPAGE_RSV1
-
-
-#define _PAGE_PTE  (1ul << 62) /* distinguishes PTEs from 
pointers */
-#define _PAGE_PRESENT  (1ul << 63) /* pte contains a translation */
+#define _PAGE_PTE  0x4000UL/* distinguishes PTEs 
from pointers */
+#define _PAGE_PRESENT  0x8000UL/* pte contains a 
translation */
 /*
  * Drivers request for cache inhibited pte mapping using _PAGE_NO_CACHE
  * Instead of fixing all of them, add an alternate define which
diff --git a/arch/powerpc/include/asm/book3s/64/radix.h 
b/arch/powerpc/include/asm/book3s/64/radix.h
index 77e590c77299..b2971e624bb5 100644
--- a/arch/powerpc/include/asm/book3s/64/radix.h
+++ b/arch/powerpc/include/asm/book3s/64/radix.h
@@ -11,6 +11,12 @@
 #include 
 #endif
 
+/*
+ * For P9 DD1 only, we need to track whether the pte's huge.
+ */
+#define _PAGE_LARGE_RPAGE_RSV1
+
+
 #ifndef __ASSEMBLY__
 #include 
 #include 
-- 
2.7.4



Re: [PATCH kernel v4 08/10] KVM: PPC: Separate TCE validation from update

2017-02-09 Thread David Gibson
On Fri, Feb 10, 2017 at 03:09:30PM +1100, Alexey Kardashevskiy wrote:
> On 10/02/17 14:07, David Gibson wrote:
> > On Thu, Feb 09, 2017 at 07:20:11PM +1100, Alexey Kardashevskiy wrote:
> >> On 09/02/17 14:51, David Gibson wrote:
> >>> On Tue, Feb 07, 2017 at 06:17:09PM +1100, Alexey Kardashevskiy wrote:
>  For the emulated devices it does not matter much if we get a broken TCE
>  half way handling a TCE list but for VFIO it will matter as it has
>  more chances to fail so we try to do our best and check as much as we
>  can before proceeding.
> 
>  This separates a guest view table update from validation. No change in
>  behavior is expected.
> 
>  Signed-off-by: Alexey Kardashevskiy 
>  ---
>   arch/powerpc/kvm/book3s_64_vio.c| 8 
>   arch/powerpc/kvm/book3s_64_vio_hv.c | 8 ++--
>   2 files changed, 14 insertions(+), 2 deletions(-)
> 
>  diff --git a/arch/powerpc/kvm/book3s_64_vio.c 
>  b/arch/powerpc/kvm/book3s_64_vio.c
>  index 15df8ae627d9..9a7b7fca5e84 100644
>  --- a/arch/powerpc/kvm/book3s_64_vio.c
>  +++ b/arch/powerpc/kvm/book3s_64_vio.c
>  @@ -282,6 +282,14 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu 
>  *vcpu,
>   ret = kvmppc_tce_validate(stt, tce);
>   if (ret != H_SUCCESS)
>   goto unlock_exit;
>  +}
>  +
>  +for (i = 0; i < npages; ++i) {
>  +if (get_user(tce, tces + i)) {
>  +ret = H_TOO_HARD;
>  +goto unlock_exit;
>  +}
>  +tce = be64_to_cpu(tce);
> >>>
> >>> This doesn't look safe.  The contents of user memory could change
> >>> between the two get_user()s, meaning that you're no longer guaranteed
> >>> a TCE loaded into kernel has been validated at all.
> >>>
> >>> I think you need to either:
> >>>
> >>> a) Make sure things safe against a bad TCE being loaded into a TCE
> >>> table and move all validation to where the TCE is used, rather
> >>> than loaded
> >>>
> >>> or
> >>> b) Copy the whole set of indirect entries to a temporary in-kernel
> >>>buffer, then validate, then load into the actual TCE table.
> >>
> >>
> >> Correct :( The problem is I do not know how far I want to go in reverting
> >> the state as it was when I started handling H_PUT_TCE_INDIRECT.
> >>
> >> For example, 1 container, 2 IOMMU groups with disabled shared tables, so -
> >> 2 tables, 512 TCEs request and TCE#100 does not translate to host physical
> >> address.
> >>
> >>
> >> To do a) I'll need to remember old content of each hardware table entry as
> >> when I reach TCE#100, I'll need to revert to the initial state which means
> >> I need to write back old TCEs to all affected hardware tables and update
> >> reference counters of all affected preregistered areas. Well, the actual
> >> tables must not have different addresses (BUG_ON? is it worth testing while
> >> writing to hardware tables that values I am replacing are the same in all
> >> tables?) so I can have just a single array of old TCEs from hardware tables
> >> in vcpu.
> > 
> > I thought you said shared tables were disabled, so the two tables
> > would have different addresses?
> 
> That would be 2 physically separated tables but the content would be the
> same as long as they belong to the same VFIO container.

Ok.  I thought you were talking about the address of the TCE tables
being the same above.  Did you mean the address of an individual page
mapped in the TCE table?

> > Hmm.  Now I'm trying to remember, will the gpa->hpa translation fail
> > only if the guest/qemu does something wrong, or can it fail for other
> > reasons? 
> 
> This should always just work.

Ok, given that, just replacing HPAs we can't translate with a clear
entry seems fine to me.

> > What about in real mode vs. virtual mode?
> 
> Real mode is no different in this matter.
> 
> Real mode is different from virtual mode in 3 aspects:
> 
> 1. iommu_table_ops::exchange() vs. exchange_rm() as real mode uses cache
> inhibited writes to invalidate "TCE kill" cache;
> 
> 2. list_for_each_entry_lockless() vs. list_for_each_entry_rct() because of
> lockdep does not work in real mode properly;
> 
> 3. real mode uses vmalloc_to_phys() while virtual mode can access vmalloc'd
> addresses directly. Not expected to fail.
> 
> This is a full list.

Ok.

> > I think the key to this approach will be to think carefully about what
> > semantics you guarantee for mappings shadowed into the hardware
> > tables.  For example, it might work to specify that the host mappings
> > only match the GPA mappings if those GPA mapings are valid in the
> > first place.  So, H_PUT_TCE etc. would succeed as long as they're able
> > to update the view of the table in terms of GPA.  But when you shadow
> > those into the HPA tables, any entries which can't be 

Re: [PATCH kernel v4 08/10] KVM: PPC: Separate TCE validation from update

2017-02-09 Thread Alexey Kardashevskiy
On 10/02/17 14:07, David Gibson wrote:
> On Thu, Feb 09, 2017 at 07:20:11PM +1100, Alexey Kardashevskiy wrote:
>> On 09/02/17 14:51, David Gibson wrote:
>>> On Tue, Feb 07, 2017 at 06:17:09PM +1100, Alexey Kardashevskiy wrote:
 For the emulated devices it does not matter much if we get a broken TCE
 half way handling a TCE list but for VFIO it will matter as it has
 more chances to fail so we try to do our best and check as much as we
 can before proceeding.

 This separates a guest view table update from validation. No change in
 behavior is expected.

 Signed-off-by: Alexey Kardashevskiy 
 ---
  arch/powerpc/kvm/book3s_64_vio.c| 8 
  arch/powerpc/kvm/book3s_64_vio_hv.c | 8 ++--
  2 files changed, 14 insertions(+), 2 deletions(-)

 diff --git a/arch/powerpc/kvm/book3s_64_vio.c 
 b/arch/powerpc/kvm/book3s_64_vio.c
 index 15df8ae627d9..9a7b7fca5e84 100644
 --- a/arch/powerpc/kvm/book3s_64_vio.c
 +++ b/arch/powerpc/kvm/book3s_64_vio.c
 @@ -282,6 +282,14 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
ret = kvmppc_tce_validate(stt, tce);
if (ret != H_SUCCESS)
goto unlock_exit;
 +  }
 +
 +  for (i = 0; i < npages; ++i) {
 +  if (get_user(tce, tces + i)) {
 +  ret = H_TOO_HARD;
 +  goto unlock_exit;
 +  }
 +  tce = be64_to_cpu(tce);
>>>
>>> This doesn't look safe.  The contents of user memory could change
>>> between the two get_user()s, meaning that you're no longer guaranteed
>>> a TCE loaded into kernel has been validated at all.
>>>
>>> I think you need to either:
>>>
>>> a) Make sure things safe against a bad TCE being loaded into a TCE
>>> table and move all validation to where the TCE is used, rather
>>> than loaded
>>>
>>> or
>>> b) Copy the whole set of indirect entries to a temporary in-kernel
>>>buffer, then validate, then load into the actual TCE table.
>>
>>
>> Correct :( The problem is I do not know how far I want to go in reverting
>> the state as it was when I started handling H_PUT_TCE_INDIRECT.
>>
>> For example, 1 container, 2 IOMMU groups with disabled shared tables, so -
>> 2 tables, 512 TCEs request and TCE#100 does not translate to host physical
>> address.
>>
>>
>> To do a) I'll need to remember old content of each hardware table entry as
>> when I reach TCE#100, I'll need to revert to the initial state which means
>> I need to write back old TCEs to all affected hardware tables and update
>> reference counters of all affected preregistered areas. Well, the actual
>> tables must not have different addresses (BUG_ON? is it worth testing while
>> writing to hardware tables that values I am replacing are the same in all
>> tables?) so I can have just a single array of old TCEs from hardware tables
>> in vcpu.
> 
> I thought you said shared tables were disabled, so the two tables
> would have different addresses?

That would be 2 physically separated tables but the content would be the
same as long as they belong to the same VFIO container.


> 
> Hmm.  Now I'm trying to remember, will the gpa->hpa translation fail
> only if the guest/qemu does something wrong, or can it fail for other
> reasons? 

This should always just work.

> What about in real mode vs. virtual mode?

Real mode is no different in this matter.

Real mode is different from virtual mode in 3 aspects:

1. iommu_table_ops::exchange() vs. exchange_rm() as real mode uses cache
inhibited writes to invalidate "TCE kill" cache;

2. list_for_each_entry_lockless() vs. list_for_each_entry_rct() because of
lockdep does not work in real mode properly;

3. real mode uses vmalloc_to_phys() while virtual mode can access vmalloc'd
addresses directly. Not expected to fail.

This is a full list.


> 
> I think the key to this approach will be to think carefully about what
> semantics you guarantee for mappings shadowed into the hardware
> tables.  For example, it might work to specify that the host mappings
> only match the GPA mappings if those GPA mapings are valid in the
> first place.  So, H_PUT_TCE etc. would succeed as long as they're able
> to update the view of the table in terms of GPA.  But when you shadow
> those into the HPA tables, any entries which can't be translated you
> just replace with a cleared entry.

Literally with zero? Silently? WARN_ON_ONCE?

> That should be enough to protect
> the host.  Obviously you can expect the device to fail when you
> actually attempt to DMA there, but that's the guest's (or qemu's) own
> fault for putting bad addresses in the TCE table.
> 
> Obviously that might not be great for debugging, since mappings will
> appear to succeed, but then not work later on.
> 
> This does have the nice property that it's reasonably obvious what to
> do if you have some GPA mappings for emulated devices, then hotplug a
> 

Re: [PATCH kernel v4 10/10] KVM: PPC: VFIO: Add in-kernel acceleration for VFIO

2017-02-09 Thread David Gibson
On Fri, Feb 10, 2017 at 01:50:31PM +1100, Alexey Kardashevskiy wrote:
> On 09/02/17 17:41, David Gibson wrote:
> > On Tue, Feb 07, 2017 at 06:17:11PM +1100, Alexey Kardashevskiy wrote:
> >> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
> >> and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
> >> without passing them to user space which saves time on switching
> >> to user space and back.
> >>
> >> This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
> >> KVM tries to handle a TCE request in the real mode, if failed
> >> it passes the request to the virtual mode to complete the operation.
> >> If it a virtual mode handler fails, the request is passed to
> >> the user space; this is not expected to happen though.
> >>
> >> To avoid dealing with page use counters (which is tricky in real mode),
> >> this only accelerates SPAPR TCE IOMMU v2 clients which are required
> >> to pre-register the userspace memory. The very first TCE request will
> >> be handled in the VFIO SPAPR TCE driver anyway as the userspace view
> >> of the TCE table (iommu_table::it_userspace) is not allocated till
> >> the very first mapping happens and we cannot call vmalloc in real mode.
> >>
> >> This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
> >> the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
> >> and associates a physical IOMMU table with the SPAPR TCE table (which
> >> is a guest view of the hardware IOMMU table). The iommu_table object
> >> is cached and referenced so we do not have to look up for it in real mode.
> >>
> >> This does not implement the UNSET counterpart as there is no use for it -
> >> once the acceleration is enabled, the existing userspace won't
> >> disable it unless a VFIO container is destroyed; this adds necessary
> >> cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.
> >>
> >> As this creates a descriptor per IOMMU table-LIOBN couple (called
> >> kvmppc_spapr_tce_iommu_table), it is possible to have several
> >> descriptors with the same iommu_table (hardware IOMMU table) attached
> >> to the same LIOBN; we do not remove duplicates though as
> >> iommu_table_ops::exchange not just update a TCE entry (which is
> >> shared among IOMMU groups) but also invalidates the TCE cache
> >> (one per IOMMU group).
> >>
> >> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
> >> space.
> >>
> >> This finally makes use of vfio_external_user_iommu_id() which was
> >> introduced quite some time ago and was considered for removal.
> >>
> >> Tests show that this patch increases transmission speed from 220MB/s
> >> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
> >>
> >> Signed-off-by: Alexey Kardashevskiy 
> >> ---
> >> Changes:
> >> v4:
> >> * added note to the commit log about allowing multiple updates of
> >> the same IOMMU table;
> >> * instead of checking for if any memory was preregistered, this
> >> returns H_TOO_HARD if a specific page was not;
> >> * fixed comments from v3 about error handling in many places;
> >> * simplified TCE handlers and merged IOMMU parts inline - for example,
> >> there used to be kvmppc_h_put_tce_iommu(), now it is merged into
> >> kvmppc_h_put_tce(); this allows to check IOBA boundaries against
> >> the first attached table only (makes the code simpler);
> >>
> >> v3:
> >> * simplified not to use VFIO group notifiers
> >> * reworked cleanup, should be cleaner/simpler now
> >>
> >> v2:
> >> * reworked to use new VFIO notifiers
> >> * now same iommu_table may appear in the list several times, to be fixed 
> >> later
> >> ---
> >>
> >> This has separate copies of handlers for real and virtual modes as
> >> in fact H_PUT_TCE and H_STUFF_TCE could share a lot (common helpers
> >> would take a "realmode" flag) but H_PUT_TCE_INDIRECT uses get_user()
> >> in virtual mode and direct access in real mode and having a common
> >> helper for it would make things uglier imho.
> >>
> >>
> >> ---
> >>  Documentation/virtual/kvm/devices/vfio.txt |  22 +-
> >>  arch/powerpc/include/asm/kvm_host.h|   8 +
> >>  arch/powerpc/include/asm/kvm_ppc.h |   4 +
> >>  include/uapi/linux/kvm.h   |   8 +
> >>  arch/powerpc/kvm/book3s_64_vio.c   | 319 
> >> -
> >>  arch/powerpc/kvm/book3s_64_vio_hv.c| 172 +++-
> >>  arch/powerpc/kvm/powerpc.c |   2 +
> >>  virt/kvm/vfio.c|  60 ++
> >>  8 files changed, 590 insertions(+), 5 deletions(-)
> >>
> >> diff --git a/Documentation/virtual/kvm/devices/vfio.txt 
> >> b/Documentation/virtual/kvm/devices/vfio.txt
> >> index ef51740c67ca..f95d867168ea 100644
> >> --- a/Documentation/virtual/kvm/devices/vfio.txt
> >> +++ b/Documentation/virtual/kvm/devices/vfio.txt
> >> @@ -16,7 +16,25 @@ Groups:
> >>  
> >>  KVM_DEV_VFIO_GROUP attributes:
> >>KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device 

linux-next: manual merge of the kvm tree with the powerpc tree

2017-02-09 Thread Stephen Rothwell
Hi all,

Today's linux-next merge of the kvm tree got a conflict in:

  arch/powerpc/kernel/exceptions-64s.S

between commit:

  1a6822d194c3 ("powerpc/64s: Use (start, size) rather than (start, end) for 
exception handlers")

from the powerpc tree and commit:

  bc3551257af8 ("powerpc/64: Allow for relocation-on interrupts from guest to 
host")

from the kvm tree.

I fixed it up (I think - see below) and can carry the fix as necessary.
This is now fixed as far as linux-next is concerned, but any non trivial
conflicts should be mentioned to your upstream maintainer when your tree
is submitted for merging.  You may also want to consider cooperating
with the maintainer of the conflicting tree to minimise any particularly
complex conflicts.

-- 
Cheers,
Stephen Rothwell

diff --cc arch/powerpc/kernel/exceptions-64s.S
index a6205a4a3574,34a04a5fa468..
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@@ -720,14 -720,10 +720,10 @@@ hardware_interrupt_hv
FTR_SECTION_ELSE
_MASKABLE_EXCEPTION_PSERIES(0x500, hardware_interrupt_common,
EXC_STD, SOFTEN_TEST_PR)
- do_kvm_0x500:
-   KVM_HANDLER(PACA_EXGEN, EXC_STD, 0x500)
ALT_FTR_SECTION_END_IFSET(CPU_FTR_HVMODE | CPU_FTR_ARCH_206)
 -EXC_REAL_END(hardware_interrupt, 0x500, 0x600)
 +EXC_REAL_END(hardware_interrupt, 0x500, 0x100)
  
 -EXC_VIRT_BEGIN(hardware_interrupt, 0x4500, 0x4600)
 +EXC_VIRT_BEGIN(hardware_interrupt, 0x4500, 0x100)
.globl hardware_interrupt_relon_hv;
  hardware_interrupt_relon_hv:
BEGIN_FTR_SECTION
@@@ -735,8 -731,10 +731,10 @@@
FTR_SECTION_ELSE
_MASKABLE_RELON_EXCEPTION_PSERIES(0x500, 
hardware_interrupt_common, EXC_STD, SOFTEN_TEST_PR)
ALT_FTR_SECTION_END_IFSET(CPU_FTR_HVMODE)
 -EXC_VIRT_END(hardware_interrupt, 0x4500, 0x4600)
 +EXC_VIRT_END(hardware_interrupt, 0x4500, 0x100)
  
+ TRAMP_KVM(PACA_EXGEN, 0x500)
+ TRAMP_KVM_HV(PACA_EXGEN, 0x500)
  EXC_COMMON_ASYNC(hardware_interrupt_common, 0x500, do_IRQ)
  
  
@@@ -884,35 -907,15 +907,15 @@@ END_FTR_SECTION_IFSET(CPU_FTR_REAL_LE
b   system_call_common ;
  #endif
  
 -EXC_REAL_BEGIN(system_call, 0xc00, 0xd00)
 +EXC_REAL_BEGIN(system_call, 0xc00, 0x100)
-/*
- * If CONFIG_KVM_BOOK3S_64_HANDLER is set, save the PPR (on systems
- * that support it) before changing to HMT_MEDIUM. That allows the KVM
- * code to save that value into the guest state (it is the guest's PPR
- * value). Otherwise just change to HMT_MEDIUM as userspace has
- * already saved the PPR.
- */
- #ifdef CONFIG_KVM_BOOK3S_64_HANDLER
-   SET_SCRATCH0(r13)
-   GET_PACA(r13)
-   std r9,PACA_EXGEN+EX_R9(r13)
-   OPT_GET_SPR(r9, SPRN_PPR, CPU_FTR_HAS_PPR);
-   HMT_MEDIUM;
-   std r10,PACA_EXGEN+EX_R10(r13)
-   OPT_SAVE_REG_TO_PACA(PACA_EXGEN+EX_PPR, r9, CPU_FTR_HAS_PPR);
-   mfcrr9
-   KVMTEST_PR(0xc00)
-   GET_SCRATCH0(r13)
- #else
-   HMT_MEDIUM;
- #endif
+   SYSCALL_KVMTEST
SYSCALL_PSERIES_1
SYSCALL_PSERIES_2_RFID
SYSCALL_PSERIES_3
 -EXC_REAL_END(system_call, 0xc00, 0xd00)
 +EXC_REAL_END(system_call, 0xc00, 0x100)
  
 -EXC_VIRT_BEGIN(system_call, 0x4c00, 0x4d00)
 +EXC_VIRT_BEGIN(system_call, 0x4c00, 0x100)
-   HMT_MEDIUM
+   SYSCALL_KVMTEST
SYSCALL_PSERIES_1
SYSCALL_PSERIES_2_DIRECT
SYSCALL_PSERIES_3
@@@ -926,8 -929,8 +929,8 @@@ EXC_VIRT(single_step, 0x4d00, 0x100, 0x
  TRAMP_KVM(PACA_EXGEN, 0xd00)
  EXC_COMMON(single_step_common, 0xd00, single_step_exception)
  
 -EXC_REAL_OOL_HV(h_data_storage, 0xe00, 0xe20)
 -EXC_VIRT_OOL_HV(h_data_storage, 0x4e00, 0x4e20, 0xe00)
 +EXC_REAL_OOL_HV(h_data_storage, 0xe00, 0x20)
- EXC_VIRT_NONE(0x4e00, 0x20)
++EXC_VIRT_OOL_HV(h_data_storage, 0x4e00, 0x20, 0xe00)
  TRAMP_KVM_HV_SKIP(PACA_EXGEN, 0xe00)
  EXC_COMMON_BEGIN(h_data_storage_common)
mfspr   r10,SPRN_HDAR
@@@ -942,8 -945,8 +945,8 @@@
b   ret_from_except
  
  
 -EXC_REAL_OOL_HV(h_instr_storage, 0xe20, 0xe40)
 -EXC_VIRT_OOL_HV(h_instr_storage, 0x4e20, 0x4e40, 0xe20)
 +EXC_REAL_OOL_HV(h_instr_storage, 0xe20, 0x20)
- EXC_VIRT_NONE(0x4e20, 0x20)
++EXC_VIRT_OOL_HV(h_instr_storage, 0x4e20, 0x20, 0xe20)
  TRAMP_KVM_HV(PACA_EXGEN, 0xe20)
  EXC_COMMON(h_instr_storage_common, 0xe20, unknown_exception)
  


linux-next: manual merge of the kvm tree with the powerpc tree

2017-02-09 Thread Stephen Rothwell
Hi all,

Today's linux-next merge of the kvm tree got a conflict in:

  arch/powerpc/include/asm/head-64.h

between commit:

  852e5da99d15 ("powerpc/64s: Tidy up after exception handler rework")

from the powerpc tree and commit:

  7ede531773ea ("KVM: PPC: Book3S: Move 64-bit KVM interrupt handler out from 
alt section")

from the kvm tree.

I fixed it up (see below) and can carry the fix as necessary. This
is now fixed as far as linux-next is concerned, but any non trivial
conflicts should be mentioned to your upstream maintainer when your tree
is submitted for merging.  You may also want to consider cooperating
with the maintainer of the conflicting tree to minimise any particularly
complex conflicts.

-- 
Cheers,
Stephen Rothwell

diff --cc arch/powerpc/include/asm/head-64.h
index a475711cd9c3,9bd81619d090..
--- a/arch/powerpc/include/asm/head-64.h
+++ b/arch/powerpc/include/asm/head-64.h
@@@ -223,8 -217,8 +223,8 @@@ name
FIXED_SECTION_ENTRY_BEGIN(virt_trampolines, name)
  
  #ifdef CONFIG_KVM_BOOK3S_64_HANDLER
 -#define TRAMP_KVM_BEGIN(name) \
 +#define TRAMP_KVM_BEGIN(name) \
-   TRAMP_REAL_BEGIN(name)
+   TRAMP_VIRT_BEGIN(name)
  #else
  #define TRAMP_KVM_BEGIN(name)
  #endif


Re: [PATCH 1/3] kprobes: introduce weak variant of kprobe_exceptions_notify

2017-02-09 Thread Michael Ellerman
"Naveen N. Rao"  writes:

> kprobe_exceptions_notify() is not used on some of the architectures such
> as arm[64] and powerpc anymore. Introduce a weak variant for such
> architectures.

I'll merge patch 1 & 3 via the powerpc tree for v4.11.

You can then send patch 2 to the arm guys after -rc1, or for 4.12.

cheers


Re: [PATCH kernel v4 08/10] KVM: PPC: Separate TCE validation from update

2017-02-09 Thread David Gibson
On Thu, Feb 09, 2017 at 07:20:11PM +1100, Alexey Kardashevskiy wrote:
> On 09/02/17 14:51, David Gibson wrote:
> > On Tue, Feb 07, 2017 at 06:17:09PM +1100, Alexey Kardashevskiy wrote:
> >> For the emulated devices it does not matter much if we get a broken TCE
> >> half way handling a TCE list but for VFIO it will matter as it has
> >> more chances to fail so we try to do our best and check as much as we
> >> can before proceeding.
> >>
> >> This separates a guest view table update from validation. No change in
> >> behavior is expected.
> >>
> >> Signed-off-by: Alexey Kardashevskiy 
> >> ---
> >>  arch/powerpc/kvm/book3s_64_vio.c| 8 
> >>  arch/powerpc/kvm/book3s_64_vio_hv.c | 8 ++--
> >>  2 files changed, 14 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/arch/powerpc/kvm/book3s_64_vio.c 
> >> b/arch/powerpc/kvm/book3s_64_vio.c
> >> index 15df8ae627d9..9a7b7fca5e84 100644
> >> --- a/arch/powerpc/kvm/book3s_64_vio.c
> >> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> >> @@ -282,6 +282,14 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >>ret = kvmppc_tce_validate(stt, tce);
> >>if (ret != H_SUCCESS)
> >>goto unlock_exit;
> >> +  }
> >> +
> >> +  for (i = 0; i < npages; ++i) {
> >> +  if (get_user(tce, tces + i)) {
> >> +  ret = H_TOO_HARD;
> >> +  goto unlock_exit;
> >> +  }
> >> +  tce = be64_to_cpu(tce);
> > 
> > This doesn't look safe.  The contents of user memory could change
> > between the two get_user()s, meaning that you're no longer guaranteed
> > a TCE loaded into kernel has been validated at all.
> > 
> > I think you need to either:
> > 
> > a) Make sure things safe against a bad TCE being loaded into a TCE
> > table and move all validation to where the TCE is used, rather
> > than loaded
> > 
> > or
> > b) Copy the whole set of indirect entries to a temporary in-kernel
> >buffer, then validate, then load into the actual TCE table.
> 
> 
> Correct :( The problem is I do not know how far I want to go in reverting
> the state as it was when I started handling H_PUT_TCE_INDIRECT.
> 
> For example, 1 container, 2 IOMMU groups with disabled shared tables, so -
> 2 tables, 512 TCEs request and TCE#100 does not translate to host physical
> address.
> 
> 
> To do a) I'll need to remember old content of each hardware table entry as
> when I reach TCE#100, I'll need to revert to the initial state which means
> I need to write back old TCEs to all affected hardware tables and update
> reference counters of all affected preregistered areas. Well, the actual
> tables must not have different addresses (BUG_ON? is it worth testing while
> writing to hardware tables that values I am replacing are the same in all
> tables?) so I can have just a single array of old TCEs from hardware tables
> in vcpu.

I thought you said shared tables were disabled, so the two tables
would have different addresses?

Hmm.  Now I'm trying to remember, will the gpa->hpa translation fail
only if the guest/qemu does something wrong, or can it fail for other
reasons?  What about in real mode vs. virtual mode?

I think the key to this approach will be to think carefully about what
semantics you guarantee for mappings shadowed into the hardware
tables.  For example, it might work to specify that the host mappings
only match the GPA mappings if those GPA mapings are valid in the
first place.  So, H_PUT_TCE etc. would succeed as long as they're able
to update the view of the table in terms of GPA.  But when you shadow
those into the HPA tables, any entries which can't be translated you
just replace with a cleared entry. That should be enough to protect
the host.  Obviously you can expect the device to fail when you
actually attempt to DMA there, but that's the guest's (or qemu's) own
fault for putting bad addresses in the TCE table.

Obviously that might not be great for debugging, since mappings will
appear to succeed, but then not work later on.

This does have the nice property that it's reasonably obvious what to
do if you have some GPA mappings for emulated devices, then hotplug a
VFIO device and at that point hit a gpa->hpa translation error.
There's no hcall in this case, so there's no obvious way to return an
error to the guest.

> To do b) I'll need:
> 
> 1. to have a copy of TCEs from the guest in vcpu,

I don't quite understand this.  You need a temporary copy, yes, but I
don't see why it needs to be attached to the vcpu.

> I populate it via
> get_user() to make sure they won't change;
> 2. an array of userspace addresses translated from given TCEs; and in order
> to make sure these addresses won't go away, I'll need to reference each
> preregistered memory area via mm_iommu_mapped_inc().
> 
> When I reach TCE#100, I'll have to revert the change, i.e. call
> mm_iommu_mapped_dec().

Ugh.. yeah, I think to do this sanely, what you'd have to do is copy
the updated 

Re: [PATCH kernel v4 10/10] KVM: PPC: VFIO: Add in-kernel acceleration for VFIO

2017-02-09 Thread Alexey Kardashevskiy
On 09/02/17 17:41, David Gibson wrote:
> On Tue, Feb 07, 2017 at 06:17:11PM +1100, Alexey Kardashevskiy wrote:
>> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
>> and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
>> without passing them to user space which saves time on switching
>> to user space and back.
>>
>> This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
>> KVM tries to handle a TCE request in the real mode, if failed
>> it passes the request to the virtual mode to complete the operation.
>> If it a virtual mode handler fails, the request is passed to
>> the user space; this is not expected to happen though.
>>
>> To avoid dealing with page use counters (which is tricky in real mode),
>> this only accelerates SPAPR TCE IOMMU v2 clients which are required
>> to pre-register the userspace memory. The very first TCE request will
>> be handled in the VFIO SPAPR TCE driver anyway as the userspace view
>> of the TCE table (iommu_table::it_userspace) is not allocated till
>> the very first mapping happens and we cannot call vmalloc in real mode.
>>
>> This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
>> the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
>> and associates a physical IOMMU table with the SPAPR TCE table (which
>> is a guest view of the hardware IOMMU table). The iommu_table object
>> is cached and referenced so we do not have to look up for it in real mode.
>>
>> This does not implement the UNSET counterpart as there is no use for it -
>> once the acceleration is enabled, the existing userspace won't
>> disable it unless a VFIO container is destroyed; this adds necessary
>> cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.
>>
>> As this creates a descriptor per IOMMU table-LIOBN couple (called
>> kvmppc_spapr_tce_iommu_table), it is possible to have several
>> descriptors with the same iommu_table (hardware IOMMU table) attached
>> to the same LIOBN; we do not remove duplicates though as
>> iommu_table_ops::exchange not just update a TCE entry (which is
>> shared among IOMMU groups) but also invalidates the TCE cache
>> (one per IOMMU group).
>>
>> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
>> space.
>>
>> This finally makes use of vfio_external_user_iommu_id() which was
>> introduced quite some time ago and was considered for removal.
>>
>> Tests show that this patch increases transmission speed from 220MB/s
>> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
>>
>> Signed-off-by: Alexey Kardashevskiy 
>> ---
>> Changes:
>> v4:
>> * added note to the commit log about allowing multiple updates of
>> the same IOMMU table;
>> * instead of checking for if any memory was preregistered, this
>> returns H_TOO_HARD if a specific page was not;
>> * fixed comments from v3 about error handling in many places;
>> * simplified TCE handlers and merged IOMMU parts inline - for example,
>> there used to be kvmppc_h_put_tce_iommu(), now it is merged into
>> kvmppc_h_put_tce(); this allows to check IOBA boundaries against
>> the first attached table only (makes the code simpler);
>>
>> v3:
>> * simplified not to use VFIO group notifiers
>> * reworked cleanup, should be cleaner/simpler now
>>
>> v2:
>> * reworked to use new VFIO notifiers
>> * now same iommu_table may appear in the list several times, to be fixed 
>> later
>> ---
>>
>> This has separate copies of handlers for real and virtual modes as
>> in fact H_PUT_TCE and H_STUFF_TCE could share a lot (common helpers
>> would take a "realmode" flag) but H_PUT_TCE_INDIRECT uses get_user()
>> in virtual mode and direct access in real mode and having a common
>> helper for it would make things uglier imho.
>>
>>
>> ---
>>  Documentation/virtual/kvm/devices/vfio.txt |  22 +-
>>  arch/powerpc/include/asm/kvm_host.h|   8 +
>>  arch/powerpc/include/asm/kvm_ppc.h |   4 +
>>  include/uapi/linux/kvm.h   |   8 +
>>  arch/powerpc/kvm/book3s_64_vio.c   | 319 
>> -
>>  arch/powerpc/kvm/book3s_64_vio_hv.c| 172 +++-
>>  arch/powerpc/kvm/powerpc.c |   2 +
>>  virt/kvm/vfio.c|  60 ++
>>  8 files changed, 590 insertions(+), 5 deletions(-)
>>
>> diff --git a/Documentation/virtual/kvm/devices/vfio.txt 
>> b/Documentation/virtual/kvm/devices/vfio.txt
>> index ef51740c67ca..f95d867168ea 100644
>> --- a/Documentation/virtual/kvm/devices/vfio.txt
>> +++ b/Documentation/virtual/kvm/devices/vfio.txt
>> @@ -16,7 +16,25 @@ Groups:
>>  
>>  KVM_DEV_VFIO_GROUP attributes:
>>KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
>> +kvm_device_attr.addr points to an int32_t file descriptor
>> +for the VFIO group.
>>KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
>> +kvm_device_attr.addr points to an int32_t file descriptor
>> +for the VFIO 

[PATCH] powerpc: Add a prototype for mcount() so it can be versioned

2017-02-09 Thread Michael Ellerman
Currently we get a warning that _mcount() can't be versioned:

  WARNING: EXPORT symbol "_mcount" [vmlinux] version generation failed, symbol 
will not be versioned.

Add a prototype to asm-prototypes.h to fix it.

The prototype is not really correct, mcount() is not a normal function,
it has a special ABI. But for the purpose of versioning it doesn't
matter.

Signed-off-by: Michael Ellerman 
---
 arch/powerpc/include/asm/asm-prototypes.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/include/asm/asm-prototypes.h 
b/arch/powerpc/include/asm/asm-prototypes.h
index ba47c70712f9..f6c5264287e5 100644
--- a/arch/powerpc/include/asm/asm-prototypes.h
+++ b/arch/powerpc/include/asm/asm-prototypes.h
@@ -120,4 +120,6 @@ extern s64 __ashrdi3(s64, int);
 extern int __cmpdi2(s64, s64);
 extern int __ucmpdi2(u64, u64);
 
+void _mcount(void);
+
 #endif /* _ASM_POWERPC_ASM_PROTOTYPES_H */
-- 
2.7.4



[PATCH] powerpc: Fix confusing help text for DISABLE_MPROFILE_KERNEL

2017-02-09 Thread Anton Blanchard
From: Anton Blanchard 

The final paragraph of the help text is reversed - we want to
enable this option by default, and disable it if the toolchain
has a working -mprofile-kernel.

Signed-off-by: Anton Blanchard 
---
 arch/powerpc/Kconfig | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index a46d1c0..d2916ff 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -390,8 +390,8 @@ config DISABLE_MPROFILE_KERNEL
  be disabled also.
 
  If you have a toolchain which supports mprofile-kernel, then you can
- enable this. Otherwise leave it disabled. If you're not sure, say
- "N".
+ disable this. Otherwise leave it enabled. If you're not sure, say
+ "Y".
 
 config MPROFILE_KERNEL
depends on PPC64 && CPU_LITTLE_ENDIAN
-- 
2.9.3



Re: [PATCH 2/2] soc/fsl/qe: get rid of immrbar_virt_to_phys()

2017-02-09 Thread Li Yang
On Tue, Feb 7, 2017 at 3:05 AM, Christophe Leroy
 wrote:
> immrbar_virt_to_phys() is not used anymore
>
> Signed-off-by: Christophe Leroy 

Acked-by: Li Yang 

Regards,
Leo


Re: [PATCH 1/2] net: ethernet: ucc_geth: fix MEM_PART_MURAM mode

2017-02-09 Thread Li Yang
On Tue, Feb 7, 2017 at 3:05 AM, Christophe Leroy
 wrote:
> Since commit 5093bb965a163 ("powerpc/QE: switch to the cpm_muram
> implementation"), muram area is not part of immrbar mapping anymore
> so immrbar_virt_to_phys() is not usable anymore.
>
> Fixes: 5093bb965a163 ("powerpc/QE: switch to the cpm_muram implementation)
> Signed-off-by: Christophe Leroy 

Acked-by: Li Yang 

Regards,
Leo


RE: [PATCH] powerpc: Fix inconsistent of_node_to_nid EXPORT_SYMBOL handling

2017-02-09 Thread Shailendra Singh
Hi Michael,

>>>Which was merged in May 2006. So the powerpc version has almost 10 years 
>>>precedence.

 Agreed but keeping the licensing consistent will help drivers making seamless 
use of it on all architectures.

>>> But I guess it's a pretty boring API. So I'll merge this unless anyone else 
>>> objects.
Thanks. Appreciate your quick response and help.

Thanks,
Shailendra

-Original Message-
From: Michael Ellerman [mailto:m...@ellerman.id.au] 
Sent: Thursday, February 09, 2017 4:19 AM
To: Shailendra Singh ; linuxppc-dev@lists.ozlabs.org
Cc: j...@ozlabs.org; apop...@au1.ibm.com; balb...@au1.ibm.com; Andy Ritger 
; John Hubbard ; Sherry Cheung 
; Aruna Manjunatha ; John McKenna 
; david.da...@cavium.com; r...@kernel.org; 
will.dea...@arm.com; Arnd Bergmann 
Subject: Re: [PATCH] powerpc: Fix inconsistent of_node_to_nid EXPORT_SYMBOL 
handling

Shailendra Singh  writes:

> The generic implementation of of_node_to_nid is EXPORT_SYMBOL.

True. Added in 298535c00a2c, in April 2016.

> The powerpc implementation added by following commit is EXPORT_SYMBOL_GPL.
> commit 953039c8df7b ("[PATCH] powerpc: Allow devices to register with 
> numa
> topology")

Which was merged in May 2006.

So the powerpc version has almost 10 years precedence.
[Shailendra] - Agreed. Keeping the licensing consistent will help drivers 
making use of it on all architectures.

> This creates an inconsistency for of_node_to_nid callers across 
> architectures.
>
> Update the powerpc implementation to be exported consistently with the 
> generic implementation.

But I guess it's a pretty boring API. So I'll merge this unless anyone else 
objects.
[Shailendra] - Thanks. Appreciate your quick response.

cheers


> Signed-off-by: Shailendra Singh 
> Reviewed-by: Andy Ritger 
> ---
>  arch/powerpc/mm/numa.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index 
> b1099cb2f393..8aa4ca3c84c9 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -290,7 +290,7 @@ int of_node_to_nid(struct device_node *device)
>  
>   return nid;
>  }
> -EXPORT_SYMBOL_GPL(of_node_to_nid);
> +EXPORT_SYMBOL(of_node_to_nid);
>  
>  static int __init find_min_common_depth(void)  {
> --
> 2.4.11
---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---


Re: [PATCH] block: sed-opal: reduce stack size of ioctl handler

2017-02-09 Thread Arnd Bergmann
On Wed, Feb 8, 2017 at 11:12 PM, Scott Bauer  wrote:
> On Wed, Feb 08, 2017 at 02:58:28PM -0700, Scott Bauer wrote:
>> Thank you for the report. We want to keep the function calls agnostic to 
>> userland.
>> In the future we will have in-kernel callers and I don't want to have to do 
>> any
>> get_fs(KERNEL_DS) wizardry.
>>
>> Instead I think we can use a union to lessen the stack burden. I tested this 
>> patch just now
>> with config_ksasan and was able to build.
>
> Nack on this patch, it only really masks the issue. Keith pointed out we have 
> a call chain
> up to this ioctl then deeper down into nvme then the block layer. If we use 
> 25% of the stack
> just for this function it's still too dangerous and we'll run into corruption 
> later on and not
> remember this fix. I'll come up with another solution.

I think there are two issues that cause the stack frame to explode
with KASAN, and
your patch addresses one but not the other:

1. checks for variables being accessed after they go out of scope.
This is solved by the
   union at the start of the function, as they never go out of scope now.

2. checks for array overflows when passing a local variable by
reference to another
   function that is not inlined.

To solve the second problem while keeping the in-kernel interface, the
approach that
David suggesteed would work, or you could have a wrapper around each function to
do the copy_from_user in a more obvious but verbose way as I had in my version.

With David's approach, you could actually replace the switch() with a
lookup table
as well.

Arnd


RE: [PATCH] block: sed-opal: reduce stack size of ioctl handler

2017-02-09 Thread David Laight
From: Arnd Bergmann
> Sent: 08 February 2017 21:15
>
> When CONFIG_KASAN is in use, the sed_ioctl function uses unusually large 
> stack,
> as each possible ioctl argument gets its own stack area plus redzone:

Why not do a single copy_from_user() at the top of sed_ioctl() based on
the _IOC_DIR() and __IOC_SIZE() values?

Something like:
int sed_ioctl(..., unsigned int cmd, void __user *arg)
{
u64 buf[??]; /* or a union */
unsigned int cmd_sz = _IOC_SIZE(cmd);

if (_IOC_DIR(cmd) & (_IOC_WRITE | _IOC_READ) && cmd_sz > sizeof buf)
return -EINVAL;

if (_IOC_DIR(cmd) & _IOC_WRITE) {
if (copy_from_user(buf, arg, cmd_sz))
return -EFAULT;
} else {
if (IOC_DIR(cmd) & _IOC_READ))
memzero(buf, cmd_sz);
}

switch (cmd) {
...
rval = ...
...
}

if (rval >= 0 && (_IOC_DIR(cmd) & _IOC_READ)
&& copy_to_user(arg, buf, cmd_sz));
return -EFAULT;

return rval;
}

David



Re: "Unable to handle kernel paging request for instruction fetch" on P4080

2017-02-09 Thread Laurentiu Tudor
Hi Thomas,

On 02/01/2017 11:46 AM, Thomas De Schampheleire wrote:
> On Wed, Jan 25, 2017 at 10:46 AM, Thomas De Schampheleire
>  wrote:
>> Hi,
>>
>> We are experiencing kernel panics of the type "Unable to handle kernel paging
>> request for instruction fetch" but are stuck in our analysis. We would
>> appreciate any help you can give.
>>
>> The problem occurs from time to time on different instances of a particular
>> embedded systems. The kernel is very old, 2.6.36.4, and runs on an 8-core
>> Freescale P4080 QorIQ processor (e500mc-based). Upgrading the kernel to a 
>> newer
>> version is not an option at this point.

Any chance of at least trying a newer kernel just to check if the issue
was fixed in the meantime. Also, trying a newer u-boot might help.
Another thing: can you try booting on one core only?

---
Best Regards, Laurentiu

Re: [2/2] powerpc/powernv: Properly set "host-ipi" on IPIs

2017-02-09 Thread Michael Ellerman
On Tue, 2017-02-07 at 00:35:36 UTC, Benjamin Herrenschmidt wrote:
> Otherwise KVM will fail to pass them through to the host
> 
> Signed-off-by: Benjamin Herrenschmidt 

Applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/f83e6862047e1e371bdc5d512dd6ca

cheers


Re: [1/2] powerpc/powernv: Fix CPU hotplug to handle waking on HVI

2017-02-09 Thread Michael Ellerman
On Tue, 2017-02-07 at 00:35:31 UTC, Benjamin Herrenschmidt wrote:
> The IPIs come in as HVI not EE, so we need to test the appropriate
> SRR1 bits. The encoding is such that it won't have false positives
> on P7 and P8 so we can just test it like that. We also need to handle
> the icp-opal variant of the flush.
> 
> Signed-off-by: Benjamin Herrenschmidt 

Applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/9b256714979fad61ae11d90b53cf67

cheers


Re: powerpc/mm/radix: Update ERAT flushes when invalidating TLB

2017-02-09 Thread Michael Ellerman
On Mon, 2017-02-06 at 02:05:16 UTC, Benjamin Herrenschmidt wrote:
> Three tiny changes to the ERAT flushing logic: First don't make
> it depend on DD1. It hasn't been decided yet but we might run
> DD2 in a mode that also requires explicit flushes for performance
> reasons so make it unconditional. We also add a missing isync, and
> finally remove the flush from _tlbiel_va as it is only necessary
> for congruence-class invalidations (PID, LPID and full TLB), not
> targetted invalidations.
> 
> Signed-off-by: Benjamin Herrenschmidt 

Applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/90c1e3c2fafec57fcb55b5d69bcf29

cheers


Re: powerpc/mm: Fix spurrious segfaults on radix with Autonuma

2017-02-09 Thread Michael Ellerman
On Fri, 2017-02-03 at 06:10:28 UTC, Benjamin Herrenschmidt wrote:
> When autonuma marks a PTE inaccessible it clears all the protection
> bits but leave the PTE valid.
> 
> With the Radix MMU, an attempt at executing from such a PTE will
> take a fault with bit 35 of SRR1 set "SRR1_ISI_N_OR_G".
> 
> It is thus incorrect to treat all such faults as errors. We should
> pass them to handle_mm_fault() for autonuma to deal with. The case
> of pages that are really not executable is handled by the existing
> test for VM_EXEC further down.
> 
> That leaves us with catching the kernel attempts at executing user
> pages. We can catch that earlier, even before we do find_vma.
> 
> It is never valid on powerpc for the kernel to take an exec fault
> to begin with. So fold that test with the existing test for the
> kernel faulting on kernel addresses to bail out early.
> 
> Signed-off-by: Benjamin Herrenschmidt 
> Fixes: 1d18ad0 ("powerpc/mm: Detect instruction fetch denied and report")
> Fixes: 0ab5171 ("powerpc/mm: Fix no execute fault handling on pre-POWER5")
> Reviewed-by: Aneesh Kumar K.V 
> Acked-by: Balbir Singh 

Applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/d7df2443cd5f67fc6ee7c05a88e499

cheers


Re: [v2] powerpc/opal-irqchip: Use interrupt names if present

2017-02-09 Thread Michael Ellerman
On Wed, 2017-02-08 at 05:33:32 UTC, Michael Ellerman wrote:
> From: Benjamin Herrenschmidt 
> 
> Recent versions of OPAL can provide names for the various OPAL interrupts,
> so let's use them. This also modernises the code that fetches the
> interrupt array to use the helpers provided by the generic code instead
> of hand-parsing the property.
> 
> Signed-off-by: Benjamin Herrenschmidt 
> Signed-off-by: Michael Ellerman 

Applied to powerpc next.

https://git.kernel.org/powerpc/c/2717a33d60745f2f72e521cdaedf79

cheers


Re: powerpc/opal-lpc: Remove unneeded include

2017-02-09 Thread Michael Ellerman
On Tue, 2017-02-07 at 00:37:28 UTC, Benjamin Herrenschmidt wrote:
> We don't need asm/xics.h
> 
> Signed-off-by: Benjamin Herrenschmidt 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/86f7ce4b4751aaf51ebcff6dc21e38

cheers


Re: powerpc/mm: Add MMU_FTR_KERNEL_RO to possible feature mask

2017-02-09 Thread Michael Ellerman
On Mon, 2017-02-06 at 18:39:27 UTC, "Aneesh Kumar K.V" wrote:
> Without this we will always find the feature disabled
> 
> Fixes: 984d7a1ec6 ("powerpc/mm: Fixup kernel read only mapping")
> Signed-off-by: Aneesh Kumar K.V 
> Acked-by: Balbir Singh 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/a5ecdad4847897007399d7a14c9109

cheers


Re: [1/2] powerpc/powernv: Add XHCI and USB storage to defconfig

2017-02-09 Thread Michael Ellerman
On Mon, 2017-02-06 at 02:55:43 UTC, Michael Neuling wrote:
> These are common on bare metal machines, so put them in the defconfig.
> 
> This adds 216KB to the vmlinux size
> 
> Signed-off-by: Michael Neuling 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/e623c54ec995899cf4179873070861

cheers


Re: Documentation: powerpc/fsl: Update compatible for l2cache binding

2017-02-09 Thread Michael Ellerman
On Fri, 2017-02-03 at 00:43:16 UTC, Chris Packham wrote:
> List all the current valid compatible strings for the l2cache binding.
> This should stop checkpatch.pl from complaining and will hopefully save
> someone from having to debug a typo in their dts.
> 
> Signed-off-by: Chris Packham 
> Acked-by: Rob Herring 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/de55ce0de94b5daa804f69aa6ede79

cheers


Re: [2/2] powerpc/powernv: Add more BPF in defconfig

2017-02-09 Thread Michael Ellerman
On Tue, 2017-02-07 at 18:06:41 UTC, "Naveen N. Rao" wrote:
> On 2017/02/07 10:00PM, Michael Ellerman wrote:
> > 
> > I'll mush the patches all together and turn them all on across
> > pseries/ppc64/powernv.
> 
> Here's a version that does that. I have merged all the patches
> (including UPROBE_EVENT since that is very useful with bcc anyway). I
> have also added CGROUP_BPF to get namespace coverage.
> 
> Thanks,
> Naveen
> ---
> powerpc: include bpf/bcc related config options in defconfigs
> 
> Specifically:
> - CONFIG_BPF_SYSCALL
> - CONFIG_NET_SCHED
> - CONFIG_NET_CLS_BPF
> - CONFIG_NET_CLS_ACT
> - CONFIG_NET_ACT_BPF
> - CONFIG_CGROUP_BPF
> - CONFIG_UPROBE_EVENT
> 
> ... in pseries, ppc64 and powernv defconfigs.
> 
> Signed-off-by: Naveen N. Rao 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/63b2547b155de2c41d40ea75015574

cheers


Re: [RESEND,01/10] via-cuda: Cleanup printk calls

2017-02-09 Thread Michael Ellerman
On Sun, 2017-01-01 at 00:56:26 UTC, Finn Thain wrote:
> Add missing log message severity, remove old debug messages and
> replace printk() loop with print_hex_dump() call.
> 
> Tested-by: Stan Johnson 
> Signed-off-by: Finn Thain 

Series applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/523717d1496c794e8380d0e3de5ca6

cheers


Re: [RFC] powerpc/64s: use start, size rather than start, end for exception handlers

2017-02-09 Thread Michael Ellerman
On Tue, 2016-12-06 at 01:41:12 UTC, Nicholas Piggin wrote:
> start,size has the benefit of being easier to search for (start,end
> usually gives you the preceeding vector from the one you want, as first
> result).
> 
> Suggested-by: Benjamin Herrenschmidt 
> Signed-off-by: Nicholas Piggin 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/1a6822d194c3f627eeb6aaca6688a5

cheers


Re: powerpc/64s: tidy up after exception handler rework

2017-02-09 Thread Michael Ellerman
On Tue, 2016-12-06 at 01:40:15 UTC, Nicholas Piggin wrote:
> Somewhere along the line, search/replace left some naming garbled,
> and untidy alignment. Might as well fix them all up now while git
> blame history doesn't extend too far.
> 
> Signed-off-by: Nicholas Piggin 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/852e5da99d15d0631c09e718abaa4b

cheers


Re: powernv/hmi: Display the correct error info for CAPP errors.

2017-02-09 Thread Michael Ellerman
On Wed, 2016-11-16 at 05:28:02 UTC, Mahesh Salgaonkar wrote:
> From: Mahesh Salgaonkar 
> 
> On some CAPP errors we see console messages that prints unknown HMIs for
> which CAPI recovery is in progress. This patch fixes this by printing
> correct error info for HMI generated due to CAPP recovery.
> 
> Signed-off-by: Mahesh Salgaonkar 
> Tested-by: Andrew Donnellan 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/470a36a8c014e5cac7bb2df3829485

cheers


Re: [PATCH] powerpc: Fix inconsistent of_node_to_nid EXPORT_SYMBOL handling

2017-02-09 Thread Michael Ellerman
Shailendra Singh  writes:

> The generic implementation of of_node_to_nid is EXPORT_SYMBOL.

True. Added in 298535c00a2c, in April 2016.

> The powerpc implementation added by following commit is EXPORT_SYMBOL_GPL.
> commit 953039c8df7b ("[PATCH] powerpc: Allow devices to register with numa
> topology")

Which was merged in May 2006.

So the powerpc version has almost 10 years precedence.

> This creates an inconsistency for of_node_to_nid callers across
> architectures.
>
> Update the powerpc implementation to be exported consistently with the
> generic implementation.

But I guess it's a pretty boring API. So I'll merge this unless anyone
else objects.

cheers


> Signed-off-by: Shailendra Singh 
> Reviewed-by: Andy Ritger 
> ---
>  arch/powerpc/mm/numa.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index b1099cb2f393..8aa4ca3c84c9 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -290,7 +290,7 @@ int of_node_to_nid(struct device_node *device)
>  
>   return nid;
>  }
> -EXPORT_SYMBOL_GPL(of_node_to_nid);
> +EXPORT_SYMBOL(of_node_to_nid);
>  
>  static int __init find_min_common_depth(void)
>  {
> -- 
> 2.4.11


Re: linux-next: manual merge of the kvm-ppc tree with the powerpc tree

2017-02-09 Thread Michael Ellerman
Stephen Rothwell  writes:

> Hi Paul,
>
> Today's linux-next merge of the kvm-ppc tree got conflicts in:
>
>   arch/powerpc/include/asm/opal.h
>   arch/powerpc/platforms/powernv/opal-wrappers.S
>
> between commit:
>
>   1d0761d2557d ("powerpc/powernv: Initialise nest mmu")
>
> from the powerpc tree and commit:
>
>   ab9bad0ead9a ("powerpc/powernv: Remove separate entry for OPAL real mode 
> calls")
>
> from the kvm-ppc tree.

Thanks.

This will go away when I merge the ppc-kvm topic branch.

cheers


Re: [PATCH 1/2] net: ethernet: ucc_geth: fix MEM_PART_MURAM mode

2017-02-09 Thread Christophe LEROY

Acked-by: David S. Miller 

Le 07/02/2017 à 10:05, Christophe Leroy a écrit :

Since commit 5093bb965a163 ("powerpc/QE: switch to the cpm_muram
implementation"), muram area is not part of immrbar mapping anymore
so immrbar_virt_to_phys() is not usable anymore.

Fixes: 5093bb965a163 ("powerpc/QE: switch to the cpm_muram implementation)
Signed-off-by: Christophe Leroy 
---
 drivers/net/ethernet/freescale/ucc_geth.c | 8 +++-
 include/soc/fsl/qe/qe.h   | 1 +
 2 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/freescale/ucc_geth.c 
b/drivers/net/ethernet/freescale/ucc_geth.c
index 3f7ae9f..f77ba9f 100644
--- a/drivers/net/ethernet/freescale/ucc_geth.c
+++ b/drivers/net/ethernet/freescale/ucc_geth.c
@@ -2594,11 +2594,10 @@ static int ucc_geth_startup(struct ucc_geth_private 
*ugeth)
} else if (ugeth->ug_info->uf_info.bd_mem_part ==
   MEM_PART_MURAM) {
out_be32(>p_send_q_mem_reg->sqqd[i].bd_ring_base,
-(u32) immrbar_virt_to_phys(ugeth->
-   p_tx_bd_ring[i]));
+(u32)qe_muram_dma(ugeth->p_tx_bd_ring[i]));
out_be32(>p_send_q_mem_reg->sqqd[i].
 last_bd_completed_address,
-(u32) immrbar_virt_to_phys(endOfRing));
+(u32)qe_muram_dma(endOfRing));
}
}

@@ -2844,8 +2843,7 @@ static int ucc_geth_startup(struct ucc_geth_private 
*ugeth)
} else if (ugeth->ug_info->uf_info.bd_mem_part ==
   MEM_PART_MURAM) {
out_be32(>p_rx_bd_qs_tbl[i].externalbdbaseptr,
-(u32) immrbar_virt_to_phys(ugeth->
-   p_rx_bd_ring[i]));
+(u32)qe_muram_dma(ugeth->p_rx_bd_ring[i]));
}
/* rest of fields handled by QE */
}
diff --git a/include/soc/fsl/qe/qe.h b/include/soc/fsl/qe/qe.h
index 70339d7..0cd4c1147 100644
--- a/include/soc/fsl/qe/qe.h
+++ b/include/soc/fsl/qe/qe.h
@@ -243,6 +243,7 @@ static inline int qe_alive_during_sleep(void)
 #define qe_muram_free cpm_muram_free
 #define qe_muram_addr cpm_muram_addr
 #define qe_muram_offset cpm_muram_offset
+#define qe_muram_dma cpm_muram_dma

 #define qe_setbits32(_addr, _v) iowrite32be(ioread32be(_addr) |  (_v), (_addr))
 #define qe_clrbits32(_addr, _v) iowrite32be(ioread32be(_addr) & ~(_v), (_addr))



Re: [PATCH v4 13/15] livepatch: change to a per-task consistency model

2017-02-09 Thread Petr Mladek
On Wed 2017-02-08 10:46:36, Josh Poimboeuf wrote:
> On Wed, Feb 08, 2017 at 04:47:50PM +0100, Petr Mladek wrote:
> > > Notice in this case that klp_target_state is KLP_PATCHED.  Which means
> > > that klp_complete_transition() would not call synchronize_rcu() at the
> > > right time, nor would it call module_put().  It can be fixed with:
> > >
> > > @@ -387,7 +389,7 @@ static int __klp_enable_patch(struct klp_patch *patch)
> > >   pr_warn("failed to enable patch '%s'\n",
> > >   patch->mod->name);
> > >  
> > > - klp_unpatch_objects(patch);
> > > + klp_target_state = KLP_UNPATCHED;
> > >   klp_complete_transition();
> > >  
> > >   return ret;
> > 
> > Great catch! Looks good to me.
> 
> As it turns out, klp_target_state isn't accessible from this file, so
> I'll make a klp_cancel_transition() helper function which reverses
> klp_target_state and calls klp_complete_transition().

Sound good to me.

> > > This assumes that the 'if (klp_target_state == KLP_UNPATCHED)' clause in
> > > klp_try_complete_transition() gets moved to klp_complete_transition() as
> > > you suggested.
> > > 
> > > > > > > diff --git a/kernel/livepatch/patch.c b/kernel/livepatch/patch.c
> > > > > > > index 5efa262..1a77f05 100644
> > > > > > > --- a/kernel/livepatch/patch.c
> > > > > > > +++ b/kernel/livepatch/patch.c
> > > > > > > @@ -29,6 +29,7 @@
> > > > > > >  #include 
> > > > > > >  #include 
> > > > > > >  #include "patch.h"
> > > > > > > +#include "transition.h"
> > > > > > >  
> > > > > > >  static LIST_HEAD(klp_ops);
> > > > > > >  
> > > > > > > @@ -54,15 +55,58 @@ static void notrace 
> > > > > > > klp_ftrace_handler(unsigned long ip,
> > > > > > >  {
> > > > > > >   struct klp_ops *ops;
> > > > > > >   struct klp_func *func;
> > > > > > > + int patch_state;
> > > > > > >  
> > > > > > >   ops = container_of(fops, struct klp_ops, fops);
> > > > > > >  
> > > > > > >   rcu_read_lock();
> > > > > > > +
> > > > > > >   func = list_first_or_null_rcu(>func_stack, struct klp_func,
> > > > > > > stack_node);
> > > > > > > +
> > > > > > > + /*
> > > > > > > +  * func should never be NULL because preemption should be 
> > > > > > > disabled here
> > > > > > > +  * and unregister_ftrace_function() does the equivalent of a
> > > > > > > +  * synchronize_sched() before the func_stack removal.
> > > > > > > +  */
> > > > > > > + if (WARN_ON_ONCE(!func))
> > > > > > > + goto unlock;
> > > > > > > +
> > > > > > > + /*
> > > > > > > +  * Enforce the order of the ops->func_stack and 
> > > > > > > func->transition reads.
> > > > > > > +  * The corresponding write barrier is in __klp_enable_patch().
> > > > > > > +  */
> > > > > > > + smp_rmb();
> > > > > > 
> > > > > > I was curious why the comment did not mention __klp_disable_patch().
> > > > > > It was related to the hours of thinking. I would like to avoid this
> > > > > > in the future and add a comment like.
> > > > > > 
> > > > > >  * This barrier probably is not needed when the patch is being
> > > > > >  * disabled. The patch is removed from the stack in
> > > > > >  * klp_try_complete_transition() and there we need to call
> > > > > >  * rcu_synchronize() to prevent seeing the patch on the stack
> > > > > >  * at all.
> > > > > >  *
> > > > > >  * Well, it still might be needed to see func->transition
> > > > > >  * when the patch is removed and the task is migrated. See
> > > > > >  * the write barrier in __klp_disable_patch().
> > > > > 
> > > > > Agreed, though as you mentioned earlier, there's also the implicit
> > > > > barrier in klp_update_patch_state(), which would execute first in 
> > > > > such a
> > > > > scenario.  So I think I'll update the barrier comments in
> > > > > klp_update_patch_state().
> > > > 
> > > > You inspired me to a scenario with 3 CPUs:
> > > > 
> > > > CPU0CPU1CPU2
> > > > 
> > > > __klp_disable_patch()
> > > > 
> > > >   klp_init_transition()
> > > > 
> > > > func->transition = true
> > > > 
> > > >   smp_wmb()
> > > > 
> > > >   klp_start_transition()
> > > > 
> > > > set TIF_PATCH_PATCHPENDING
> > > > 
> > > > klp_update_patch_state()
> > > > 
> > > >   task->patch_state
> > > >  = KLP_UNPATCHED
> > > > 
> > > >   smp_mb()
> > > > 
> > > > klp_ftrace_handler()
> > > >   func = list_...
> > > > 
> > > >   smp_rmb() /*needed?*/
> > > > 
> > > >   if (func->transition)
> > > > 
> > > 
> > > I think this isn't possible.  Remember the comment I added to
> > > klp_update_patch_state():
> > > 
> > >  * NOTE: If task is not 'current', the 

Re: [PATCH kernel v4 10/10] KVM: PPC: VFIO: Add in-kernel acceleration for VFIO

2017-02-09 Thread David Gibson
On Tue, Feb 07, 2017 at 06:17:11PM +1100, Alexey Kardashevskiy wrote:
> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
> and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
> without passing them to user space which saves time on switching
> to user space and back.
> 
> This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
> KVM tries to handle a TCE request in the real mode, if failed
> it passes the request to the virtual mode to complete the operation.
> If it a virtual mode handler fails, the request is passed to
> the user space; this is not expected to happen though.
> 
> To avoid dealing with page use counters (which is tricky in real mode),
> this only accelerates SPAPR TCE IOMMU v2 clients which are required
> to pre-register the userspace memory. The very first TCE request will
> be handled in the VFIO SPAPR TCE driver anyway as the userspace view
> of the TCE table (iommu_table::it_userspace) is not allocated till
> the very first mapping happens and we cannot call vmalloc in real mode.
> 
> This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
> the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
> and associates a physical IOMMU table with the SPAPR TCE table (which
> is a guest view of the hardware IOMMU table). The iommu_table object
> is cached and referenced so we do not have to look up for it in real mode.
> 
> This does not implement the UNSET counterpart as there is no use for it -
> once the acceleration is enabled, the existing userspace won't
> disable it unless a VFIO container is destroyed; this adds necessary
> cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.
> 
> As this creates a descriptor per IOMMU table-LIOBN couple (called
> kvmppc_spapr_tce_iommu_table), it is possible to have several
> descriptors with the same iommu_table (hardware IOMMU table) attached
> to the same LIOBN; we do not remove duplicates though as
> iommu_table_ops::exchange not just update a TCE entry (which is
> shared among IOMMU groups) but also invalidates the TCE cache
> (one per IOMMU group).
> 
> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
> space.
> 
> This finally makes use of vfio_external_user_iommu_id() which was
> introduced quite some time ago and was considered for removal.
> 
> Tests show that this patch increases transmission speed from 220MB/s
> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
> 
> Signed-off-by: Alexey Kardashevskiy 
> ---
> Changes:
> v4:
> * added note to the commit log about allowing multiple updates of
> the same IOMMU table;
> * instead of checking for if any memory was preregistered, this
> returns H_TOO_HARD if a specific page was not;
> * fixed comments from v3 about error handling in many places;
> * simplified TCE handlers and merged IOMMU parts inline - for example,
> there used to be kvmppc_h_put_tce_iommu(), now it is merged into
> kvmppc_h_put_tce(); this allows to check IOBA boundaries against
> the first attached table only (makes the code simpler);
> 
> v3:
> * simplified not to use VFIO group notifiers
> * reworked cleanup, should be cleaner/simpler now
> 
> v2:
> * reworked to use new VFIO notifiers
> * now same iommu_table may appear in the list several times, to be fixed later
> ---
> 
> This has separate copies of handlers for real and virtual modes as
> in fact H_PUT_TCE and H_STUFF_TCE could share a lot (common helpers
> would take a "realmode" flag) but H_PUT_TCE_INDIRECT uses get_user()
> in virtual mode and direct access in real mode and having a common
> helper for it would make things uglier imho.
> 
> 
> ---
>  Documentation/virtual/kvm/devices/vfio.txt |  22 +-
>  arch/powerpc/include/asm/kvm_host.h|   8 +
>  arch/powerpc/include/asm/kvm_ppc.h |   4 +
>  include/uapi/linux/kvm.h   |   8 +
>  arch/powerpc/kvm/book3s_64_vio.c   | 319 
> -
>  arch/powerpc/kvm/book3s_64_vio_hv.c| 172 +++-
>  arch/powerpc/kvm/powerpc.c |   2 +
>  virt/kvm/vfio.c|  60 ++
>  8 files changed, 590 insertions(+), 5 deletions(-)
> 
> diff --git a/Documentation/virtual/kvm/devices/vfio.txt 
> b/Documentation/virtual/kvm/devices/vfio.txt
> index ef51740c67ca..f95d867168ea 100644
> --- a/Documentation/virtual/kvm/devices/vfio.txt
> +++ b/Documentation/virtual/kvm/devices/vfio.txt
> @@ -16,7 +16,25 @@ Groups:
>  
>  KVM_DEV_VFIO_GROUP attributes:
>KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
> + kvm_device_attr.addr points to an int32_t file descriptor
> + for the VFIO group.
>KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
> + kvm_device_attr.addr points to an int32_t file descriptor
> + for the VFIO group.
> +  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: attaches a guest visible TCE table
> + allocated by sPAPR KVM.
> + 

[PATCH 5/5] selftests/powerpc: Fix remaining fallout from recent changes

2017-02-09 Thread Michael Ellerman
In benchmarks we need to use $(TEST_GEN_PROGS) after we include lib.mk,
because lib.mk does the substitution to add $(OUTPUT).

In math the vmx and fpu names were typoed so they no longer matched
correctly, put back the 'v' and 'f'.

In tm we need to substitute $(OUTPUT) into SIGNAL_CONTEXT_CHK_TESTS so
that the rule matches.

In pmu there is an extraneous ':' on the end of $$BUILD_TARGET for the
clean and install rules, which breaks the logic in the child Makefiles.

Fixes: a8ba798bc8ec ("selftests: enable O and KBUILD_OUTPUT")
Signed-off-by: Michael Ellerman 
---
 tools/testing/selftests/powerpc/benchmarks/Makefile |  4 ++--
 tools/testing/selftests/powerpc/math/Makefile   | 16 
 tools/testing/selftests/powerpc/pmu/Makefile|  4 ++--
 tools/testing/selftests/powerpc/tm/Makefile |  1 +
 4 files changed, 13 insertions(+), 12 deletions(-)

diff --git a/tools/testing/selftests/powerpc/benchmarks/Makefile 
b/tools/testing/selftests/powerpc/benchmarks/Makefile
index 286c6ed2378c..fb96a89bd953 100644
--- a/tools/testing/selftests/powerpc/benchmarks/Makefile
+++ b/tools/testing/selftests/powerpc/benchmarks/Makefile
@@ -2,10 +2,10 @@ TEST_GEN_PROGS := gettimeofday context_switch mmap_bench 
futex_bench null_syscal
 
 CFLAGS += -O2
 
-$(TEST_GEN_PROGS): ../harness.c
-
 include ../../lib.mk
 
+$(TEST_GEN_PROGS): ../harness.c
+
 $(OUTPUT)/context_switch: ../utils.c
 $(OUTPUT)/context_switch: CFLAGS += -maltivec -mvsx -mabi=altivec
 $(OUTPUT)/context_switch: LDLIBS += -lpthread
diff --git a/tools/testing/selftests/powerpc/math/Makefile 
b/tools/testing/selftests/powerpc/math/Makefile
index da9f42feaaac..fa8bae920c91 100644
--- a/tools/testing/selftests/powerpc/math/Makefile
+++ b/tools/testing/selftests/powerpc/math/Makefile
@@ -5,13 +5,13 @@ include ../../lib.mk
 $(TEST_GEN_PROGS): ../harness.c
 $(TEST_GEN_PROGS): CFLAGS += -O2 -g -pthread -m64 -maltivec
 
-$(OUTPUT)/pu_syscall: fpu_asm.S
-$(OUTPUT)/pu_preempt: fpu_asm.S
-$(OUTPUT)/pu_signal:  fpu_asm.S
+$(OUTPUT)/fpu_syscall: fpu_asm.S
+$(OUTPUT)/fpu_preempt: fpu_asm.S
+$(OUTPUT)/fpu_signal:  fpu_asm.S
 
-$(OUTPUT)/mx_syscall: vmx_asm.S
-$(OUTPUT)/mx_preempt: vmx_asm.S
-$(OUTPUT)/mx_signal: vmx_asm.S
+$(OUTPUT)/vmx_syscall: vmx_asm.S
+$(OUTPUT)/vmx_preempt: vmx_asm.S
+$(OUTPUT)/vmx_signal: vmx_asm.S
 
-vsx_preempt: CFLAGS += -mvsx
-vsx_preempt: vsx_asm.S
+$(OUTPUT)/vsx_preempt: CFLAGS += -mvsx
+$(OUTPUT)/vsx_preempt: vsx_asm.S
diff --git a/tools/testing/selftests/powerpc/pmu/Makefile 
b/tools/testing/selftests/powerpc/pmu/Makefile
index 097b08acd867..e4e55d1d3e0f 100644
--- a/tools/testing/selftests/powerpc/pmu/Makefile
+++ b/tools/testing/selftests/powerpc/pmu/Makefile
@@ -31,12 +31,12 @@ endef
 DEFAULT_INSTALL_RULE := $(INSTALL_RULE)
 override define INSTALL_RULE
$(DEFAULT_INSTALL_RULE)
-   TARGET=ebb; BUILD_TARGET=$$OUTPUT/$$TARGET; $(MAKE) 
OUTPUT=$$BUILD_TARGET: -C $$TARGET install
+   TARGET=ebb; BUILD_TARGET=$$OUTPUT/$$TARGET; $(MAKE) 
OUTPUT=$$BUILD_TARGET -C $$TARGET install
 endef
 
 clean:
$(RM) $(TEST_GEN_PROGS) $(OUTPUT)/loop.o
-   TARGET=ebb; BUILD_TARGET=$$OUTPUT/$$TARGET; $(MAKE) 
OUTPUT=$$BUILD_TARGET: -C $$TARGET clean
+   TARGET=ebb; BUILD_TARGET=$$OUTPUT/$$TARGET; $(MAKE) 
OUTPUT=$$BUILD_TARGET -C $$TARGET clean
 
 ebb:
TARGET=$@; BUILD_TARGET=$$OUTPUT/$$TARGET; mkdir -p $$BUILD_TARGET; 
$(MAKE) OUTPUT=$$BUILD_TARGET -k -C $$TARGET all
diff --git a/tools/testing/selftests/powerpc/tm/Makefile 
b/tools/testing/selftests/powerpc/tm/Makefile
index 07da21769ff8..5576ee6a51f2 100644
--- a/tools/testing/selftests/powerpc/tm/Makefile
+++ b/tools/testing/selftests/powerpc/tm/Makefile
@@ -14,5 +14,6 @@ $(OUTPUT)/tm-syscall: tm-syscall-asm.S
 $(OUTPUT)/tm-syscall: CFLAGS += -I../../../../../usr/include
 $(OUTPUT)/tm-tmspr: CFLAGS += -pthread
 
+SIGNAL_CONTEXT_CHK_TESTS := $(patsubst 
%,$(OUTPUT)/%,$(SIGNAL_CONTEXT_CHK_TESTS))
 $(SIGNAL_CONTEXT_CHK_TESTS): tm-signal.S
 $(SIGNAL_CONTEXT_CHK_TESTS): CFLAGS += -mhtm -m64 -mvsx
-- 
2.7.4



[PATCH 4/5] selftests/powerpc: Fix the clean rule since recent changes

2017-02-09 Thread Michael Ellerman
The clean rule is broken for the powerpc tests:

  make[1]: Entering directory 'tools/testing/selftests/powerpc'
  Makefile:63: warning: overriding recipe for target 'clean'
  ../lib.mk:51: warning: ignoring old recipe for target 'clean'
  /bin/sh: 3: Syntax error: end of file unexpected (expecting "done")
  Makefile:63: recipe for target 'clean' failed

Fixes: a8ba798bc8ec ("selftests: enable O and KBUILD_OUTPUT")
Signed-off-by: Michael Ellerman 
---
 tools/testing/selftests/powerpc/Makefile | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/powerpc/Makefile 
b/tools/testing/selftests/powerpc/Makefile
index 1d48c0cab596..1c5d0575802e 100644
--- a/tools/testing/selftests/powerpc/Makefile
+++ b/tools/testing/selftests/powerpc/Makefile
@@ -62,7 +62,8 @@ endef
 clean:
@for TARGET in $(SUB_DIRS); do \
BUILD_TARGET=$$OUTPUT/$$TARGET; \
-   $(MAKE) OUTPUT=$$BUILD_TARGET -C $$TARGET clean;\ done;
+   $(MAKE) OUTPUT=$$BUILD_TARGET -C $$TARGET clean; \
+   done;
rm -f tags
 
 tags:
-- 
2.7.4



[PATCH 3/5] selftests: Fix the .S and .S -> .o rules

2017-02-09 Thread Michael Ellerman
Both these rules incorrectly use $< (first prerequisite) rather than
$^ (all prerequisites), meaning they don't work if we're using more than
one .S file as input. Switch them to using $^.

They also don't include $(CPPFLAGS) and other variables used in the
default rules, which breaks targets that require those. Fix that by
using the builtin $(COMPILE.S) and $(LINK.S) rules.

Fixes: a8ba798bc8ec ("selftests: enable O and KBUILD_OUTPUT")
Signed-off-by: Michael Ellerman 
---
 tools/testing/selftests/lib.mk | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/lib.mk b/tools/testing/selftests/lib.mk
index 98841c54763a..ce96d80ad64f 100644
--- a/tools/testing/selftests/lib.mk
+++ b/tools/testing/selftests/lib.mk
@@ -54,9 +54,9 @@ $(OUTPUT)/%:%.c
$(LINK.c) $^ $(LDLIBS) -o $@
 
 $(OUTPUT)/%.o:%.S
-   $(CC) $(ASFLAGS) -c $< -o $@
+   $(COMPILE.S) $^ -o $@
 
 $(OUTPUT)/%:%.S
-   $(CC) $(ASFLAGS) $< -o $@
+   $(LINK.S) $^ $(LDLIBS) -o $@
 
 .PHONY: run_tests all clean install emit_tests
-- 
2.7.4



[PATCH 2/5] selftests: Fix the .c linking rule

2017-02-09 Thread Michael Ellerman
Currently we can't build some tests, for example:

  $ make -C tools/testing/selftests/ TARGETS=vm
  ...
  gcc -Wall -I ../../../../usr/include   -lrt -lpthread 
../../../../usr/include/linux/kernel.h userfaultfd.c -o 
tools/testing/selftests/vm/userfaultfd
  /tmp/ccmOkQSM.o: In function `stress':
  userfaultfd.c:(.text+0xc60): undefined reference to `pthread_create'
  userfaultfd.c:(.text+0xca5): undefined reference to `pthread_create'
  userfaultfd.c:(.text+0xcee): undefined reference to `pthread_create'
  userfaultfd.c:(.text+0xd30): undefined reference to `pthread_create'
  userfaultfd.c:(.text+0xd77): undefined reference to `pthread_join'
  userfaultfd.c:(.text+0xe7d): undefined reference to `pthread_join'
  userfaultfd.c:(.text+0xe9f): undefined reference to `pthread_cancel'
  userfaultfd.c:(.text+0xec6): undefined reference to `pthread_join'
  userfaultfd.c:(.text+0xf14): undefined reference to `pthread_join'
  /tmp/ccmOkQSM.o: In function `userfaultfd_stress':
  userfaultfd.c:(.text+0x13e2): undefined reference to 
`pthread_attr_setstacksize'
  collect2: error: ld returned 1 exit status

This is because the rule for linking .c files to binaries is incorrect.

The first bug is that it uses $< (first prerequisite) instead of $^ (all
preqrequisites), fix it by using ^$.

Secondly the ordering of the prerequisites vs $(LDLIBS) is wrong,
meaning on toolchains that use --as-needed we fail to link (as above).
Fix that by placing $(LDLIBS) *after* ^$.

Finally switch to using the default rule $(LINK.c), so that we get
$(CPPFLAGS) etc. included.

Fixes: a8ba798bc8ec ("selftests: enable O and KBUILD_OUTPUT")
Signed-off-by: Michael Ellerman 
---
 tools/testing/selftests/lib.mk | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/lib.mk b/tools/testing/selftests/lib.mk
index 17ed4bbe3963..98841c54763a 100644
--- a/tools/testing/selftests/lib.mk
+++ b/tools/testing/selftests/lib.mk
@@ -51,7 +51,7 @@ clean:
$(RM) -r $(TEST_GEN_PROGS) $(TEST_GEN_PROGS_EXTENDED) $(TEST_GEN_FILES) 
$(EXTRA_CLEAN)
 
 $(OUTPUT)/%:%.c
-   $(CC) $(CFLAGS) $(LDFLAGS) $(LDLIBS) $< -o $@
+   $(LINK.c) $^ $(LDLIBS) -o $@
 
 $(OUTPUT)/%.o:%.S
$(CC) $(ASFLAGS) -c $< -o $@
-- 
2.7.4



[PATCH 1/5] selftests: Fix selftests build to just build, not run tests

2017-02-09 Thread Michael Ellerman
In commit 88baa78d1f31 ("selftests: remove duplicated all and clean
target"), the "all" target was removed from individual Makefiles and
added to lib.mk.

However the "all" target was added to lib.mk *after* the existing
"runtests" target. This means "runtests" becomes the first (default)
target for most of our Makefiles.

This has the effect of causing a plain "make" to build *and run* the
tests. Which is at best rude, but depending on which tests are run could
oops someone's build machine.

  $ make -C tools/testing/selftests/
  ...
  make[1]: Entering directory 'tools/testing/selftests/bpf'
  gcc -Wall -O2 -I../../../../usr/include   test_verifier.c -o 
tools/testing/selftests/bpf/test_verifier
  gcc -Wall -O2 -I../../../../usr/include   test_maps.c -o 
tools/testing/selftests/bpf/test_maps
  gcc -Wall -O2 -I../../../../usr/include   test_lru_map.c -o 
tools/testing/selftests/bpf/test_lru_map
  #0 add+sub+mul FAIL
  Failed to load prog 'Function not implemented'!
  #1 unreachable FAIL
  Unexpected error message!
  #2 unreachable2 FAIL
  ...

Fix it by moving the "all" target to the start of lib.mk, making it the
default target.

Fixes: 88baa78d1f31 ("selftests: remove duplicated all and clean target")
Signed-off-by: Michael Ellerman 
---
 tools/testing/selftests/lib.mk | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/tools/testing/selftests/lib.mk b/tools/testing/selftests/lib.mk
index 01bb7782a35e..17ed4bbe3963 100644
--- a/tools/testing/selftests/lib.mk
+++ b/tools/testing/selftests/lib.mk
@@ -2,6 +2,11 @@
 # Makefile can operate with or without the kbuild infrastructure.
 CC := $(CROSS_COMPILE)gcc
 
+TEST_GEN_PROGS := $(patsubst %,$(OUTPUT)/%,$(TEST_GEN_PROGS))
+TEST_GEN_FILES := $(patsubst %,$(OUTPUT)/%,$(TEST_GEN_FILES))
+
+all: $(TEST_GEN_PROGS) $(TEST_GEN_PROGS_EXTENDED) $(TEST_GEN_FILES)
+
 define RUN_TESTS
@for TEST in $(TEST_GEN_PROGS) $(TEST_PROGS); do \
BASENAME_TEST=`basename $$TEST`;\
@@ -42,11 +47,6 @@ endef
 emit_tests:
$(EMIT_TESTS)
 
-TEST_GEN_PROGS := $(patsubst %,$(OUTPUT)/%,$(TEST_GEN_PROGS))
-TEST_GEN_FILES := $(patsubst %,$(OUTPUT)/%,$(TEST_GEN_FILES))
-
-all: $(TEST_GEN_PROGS) $(TEST_GEN_PROGS_EXTENDED) $(TEST_GEN_FILES)
-
 clean:
$(RM) -r $(TEST_GEN_PROGS) $(TEST_GEN_PROGS_EXTENDED) $(TEST_GEN_FILES) 
$(EXTRA_CLEAN)
 
-- 
2.7.4



Re: [PATCH kernel v4 08/10] KVM: PPC: Separate TCE validation from update

2017-02-09 Thread Alexey Kardashevskiy
On 09/02/17 14:51, David Gibson wrote:
> On Tue, Feb 07, 2017 at 06:17:09PM +1100, Alexey Kardashevskiy wrote:
>> For the emulated devices it does not matter much if we get a broken TCE
>> half way handling a TCE list but for VFIO it will matter as it has
>> more chances to fail so we try to do our best and check as much as we
>> can before proceeding.
>>
>> This separates a guest view table update from validation. No change in
>> behavior is expected.
>>
>> Signed-off-by: Alexey Kardashevskiy 
>> ---
>>  arch/powerpc/kvm/book3s_64_vio.c| 8 
>>  arch/powerpc/kvm/book3s_64_vio_hv.c | 8 ++--
>>  2 files changed, 14 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/powerpc/kvm/book3s_64_vio.c 
>> b/arch/powerpc/kvm/book3s_64_vio.c
>> index 15df8ae627d9..9a7b7fca5e84 100644
>> --- a/arch/powerpc/kvm/book3s_64_vio.c
>> +++ b/arch/powerpc/kvm/book3s_64_vio.c
>> @@ -282,6 +282,14 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>  ret = kvmppc_tce_validate(stt, tce);
>>  if (ret != H_SUCCESS)
>>  goto unlock_exit;
>> +}
>> +
>> +for (i = 0; i < npages; ++i) {
>> +if (get_user(tce, tces + i)) {
>> +ret = H_TOO_HARD;
>> +goto unlock_exit;
>> +}
>> +tce = be64_to_cpu(tce);
> 
> This doesn't look safe.  The contents of user memory could change
> between the two get_user()s, meaning that you're no longer guaranteed
> a TCE loaded into kernel has been validated at all.
> 
> I think you need to either:
> 
> a) Make sure things safe against a bad TCE being loaded into a TCE
> table and move all validation to where the TCE is used, rather
> than loaded
> 
> or
> b) Copy the whole set of indirect entries to a temporary in-kernel
>buffer, then validate, then load into the actual TCE table.


Correct :( The problem is I do not know how far I want to go in reverting
the state as it was when I started handling H_PUT_TCE_INDIRECT.

For example, 1 container, 2 IOMMU groups with disabled shared tables, so -
2 tables, 512 TCEs request and TCE#100 does not translate to host physical
address.


To do a) I'll need to remember old content of each hardware table entry as
when I reach TCE#100, I'll need to revert to the initial state which means
I need to write back old TCEs to all affected hardware tables and update
reference counters of all affected preregistered areas. Well, the actual
tables must not have different addresses (BUG_ON? is it worth testing while
writing to hardware tables that values I am replacing are the same in all
tables?) so I can have just a single array of old TCEs from hardware tables
in vcpu.


To do b) I'll need:

1. to have a copy of TCEs from the guest in vcpu, I populate it via
get_user() to make sure they won't change;
2. an array of userspace addresses translated from given TCEs; and in order
to make sure these addresses won't go away, I'll need to reference each
preregistered memory area via mm_iommu_mapped_inc().

When I reach TCE#100, I'll have to revert the change, i.e. call
mm_iommu_mapped_dec().

So I will end up having 2 arrays in a vcpu and simpler reverting code.


Or I can do simpler version of b) which would store guest TCEs in
kvm_vcpu_arch::tces[512] and use them after checking. If a malicious guest
does something bad and I return from H_PUT_TCE_INDIRECT in a middle of
request, some preregistered regions will stay referenced till the guest is
killed or rebooted (and this will prevent memory from unregistering) - but
this means no harm to the host; and with preregistered RAM, there is no
valid reason for H_PUT_TCE_INDIRECT to fail for a good guest.



Which approach to pick?


LoPAPR says:
===
If the TCE parameter represents the logical page address of a page that is
not valid for the calling partition, return
H_Parameter.
===



>>  
>>  kvmppc_tce_put(stt, entry + i, tce);
>>  }
>> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c 
>> b/arch/powerpc/kvm/book3s_64_vio_hv.c
>> index 918af76ab2b6..f8a54b7c788e 100644
>> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
>> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
>> @@ -237,7 +237,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>  {
>>  struct kvmppc_spapr_tce_table *stt;
>>  long i, ret = H_SUCCESS;
>> -unsigned long tces, entry, ua = 0;
>> +unsigned long tces, entry, tce, ua = 0;
>>  unsigned long *rmap = NULL;
>>  
>>  stt = kvmppc_find_table(vcpu->kvm, liobn);
>> @@ -279,11 +279,15 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu 
>> *vcpu,
>>  }
>>  
>>  for (i = 0; i < npages; ++i) {
>> -unsigned long tce = be64_to_cpu(((u64 *)tces)[i]);
>> +tce = be64_to_cpu(((u64 *)tces)[i]);
>>  
>>  ret = kvmppc_tce_validate(stt, tce);
>>  if (ret != H_SUCCESS)
>>  goto unlock_exit;
>> +}
>> +
>> +for (i = 0; i < npages;