Re: [PATCH v6 1/7] kvmppc: Driver to manage pages of secure guest

2019-08-19 Thread Bharata B Rao
On Tue, Aug 20, 2019 at 04:22:15PM +1000, Suraj Jitindar Singh wrote:
> On Fri, 2019-08-09 at 14:11 +0530, Bharata B Rao wrote:
> > KVMPPC driver to manage page transitions of secure guest
> > via H_SVM_PAGE_IN and H_SVM_PAGE_OUT hcalls.
> > 
> > H_SVM_PAGE_IN: Move the content of a normal page to secure page
> > H_SVM_PAGE_OUT: Move the content of a secure page to normal page
> > 
> > Private ZONE_DEVICE memory equal to the amount of secure memory
> > available in the platform for running secure guests is created
> > via a char device. Whenever a page belonging to the guest becomes
> > secure, a page from this private device memory is used to
> > represent and track that secure page on the HV side. The movement
> > of pages between normal and secure memory is done via
> > migrate_vma_pages() using UV_PAGE_IN and UV_PAGE_OUT ucalls.
> 
> Hi Bharata,
> 
> please see my patch where I define the bits which define the type of
> the rmap entry:
> https://patchwork.ozlabs.org/patch/1149791/
> 
> Please add an entry for the devm pfn type like:
> #define KVMPPC_RMAP_PFN_DEVM 0x0200 /* secure guest devm
> pfn */
> 
> And the following in the appropriate header file
> 
> static inline bool kvmppc_rmap_is_pfn_demv(unsigned long *rmapp)
> {
>   return !!((*rmapp & KVMPPC_RMAP_TYPE_MASK) ==
> KVMPPC_RMAP_PFN_DEVM));
> }
> 

Sure, I have the equivalents defined locally, will move to appropriate
headers.

> Also see comment below.
> 
> > +static struct page *kvmppc_devm_get_page(unsigned long *rmap,
> > +   unsigned long gpa, unsigned
> > int lpid)
> > +{
> > +   struct page *dpage = NULL;
> > +   unsigned long bit, devm_pfn;
> > +   unsigned long nr_pfns = kvmppc_devm.pfn_last -
> > +   kvmppc_devm.pfn_first;
> > +   unsigned long flags;
> > +   struct kvmppc_devm_page_pvt *pvt;
> > +
> > +   if (kvmppc_is_devm_pfn(*rmap))
> > +   return NULL;
> > +
> > +   spin_lock_irqsave(&kvmppc_devm_lock, flags);
> > +   bit = find_first_zero_bit(kvmppc_devm.pfn_bitmap, nr_pfns);
> > +   if (bit >= nr_pfns)
> > +   goto out;
> > +
> > +   bitmap_set(kvmppc_devm.pfn_bitmap, bit, 1);
> > +   devm_pfn = bit + kvmppc_devm.pfn_first;
> > +   dpage = pfn_to_page(devm_pfn);
> > +
> > +   if (!trylock_page(dpage))
> > +   goto out_clear;
> > +
> > +   *rmap = devm_pfn | KVMPPC_PFN_DEVM;
> > +   pvt = kzalloc(sizeof(*pvt), GFP_ATOMIC);
> > +   if (!pvt)
> > +   goto out_unlock;
> > +   pvt->rmap = rmap;
> 
> Am I missing something, why does the rmap need to be stored in pvt?
> Given the gpa is already stored and this is enough to get back to the
> rmap entry, right?

I use rmap entry to note that this guest page is secure and is being
represented by device memory page on the HV side. When the page becomes
normal again, I need to undo that from dev_pagemap_ops.page_free()
where I don't have gpa.

Regards,
Bharata.



Re: [PATCH v6 1/7] kvmppc: Driver to manage pages of secure guest

2019-08-19 Thread Suraj Jitindar Singh
On Fri, 2019-08-09 at 14:11 +0530, Bharata B Rao wrote:
> KVMPPC driver to manage page transitions of secure guest
> via H_SVM_PAGE_IN and H_SVM_PAGE_OUT hcalls.
> 
> H_SVM_PAGE_IN: Move the content of a normal page to secure page
> H_SVM_PAGE_OUT: Move the content of a secure page to normal page
> 
> Private ZONE_DEVICE memory equal to the amount of secure memory
> available in the platform for running secure guests is created
> via a char device. Whenever a page belonging to the guest becomes
> secure, a page from this private device memory is used to
> represent and track that secure page on the HV side. The movement
> of pages between normal and secure memory is done via
> migrate_vma_pages() using UV_PAGE_IN and UV_PAGE_OUT ucalls.

Hi Bharata,

please see my patch where I define the bits which define the type of
the rmap entry:
https://patchwork.ozlabs.org/patch/1149791/

Please add an entry for the devm pfn type like:
#define KVMPPC_RMAP_PFN_DEVM 0x0200 /* secure guest devm
pfn */

And the following in the appropriate header file

static inline bool kvmppc_rmap_is_pfn_demv(unsigned long *rmapp)
{
return !!((*rmapp & KVMPPC_RMAP_TYPE_MASK) ==
KVMPPC_RMAP_PFN_DEVM));
}

Also see comment below.

Thanks,
Suraj

> 
> Signed-off-by: Bharata B Rao 
> ---
>  arch/powerpc/include/asm/hvcall.h  |   4 +
>  arch/powerpc/include/asm/kvm_book3s_devm.h |  29 ++
>  arch/powerpc/include/asm/kvm_host.h|  12 +
>  arch/powerpc/include/asm/ultravisor-api.h  |   2 +
>  arch/powerpc/include/asm/ultravisor.h  |  14 +
>  arch/powerpc/kvm/Makefile  |   3 +
>  arch/powerpc/kvm/book3s_hv.c   |  19 +
>  arch/powerpc/kvm/book3s_hv_devm.c  | 492
> +
>  8 files changed, 575 insertions(+)
>  create mode 100644 arch/powerpc/include/asm/kvm_book3s_devm.h
>  create mode 100644 arch/powerpc/kvm/book3s_hv_devm.c
> 
[snip]
> +
> +struct kvmppc_devm_page_pvt {
> + unsigned long *rmap;
> + unsigned int lpid;
> + unsigned long gpa;
> +};
> +
> +struct kvmppc_devm_copy_args {
> + unsigned long *rmap;
> + unsigned int lpid;
> + unsigned long gpa;
> + unsigned long page_shift;
> +};
> +
> +/*
> + * Bits 60:56 in the rmap entry will be used to identify the
> + * different uses/functions of rmap. This definition with move
> + * to a proper header when all other functions are defined.
> + */
> +#define KVMPPC_PFN_DEVM  (0x2ULL << 56)
> +
> +static inline bool kvmppc_is_devm_pfn(unsigned long pfn)
> +{
> + return !!(pfn & KVMPPC_PFN_DEVM);
> +}
> +
> +/*
> + * Get a free device PFN from the pool
> + *
> + * Called when a normal page is moved to secure memory (UV_PAGE_IN).
> Device
> + * PFN will be used to keep track of the secure page on HV side.
> + *
> + * @rmap here is the slot in the rmap array that corresponds to
> @gpa.
> + * Thus a non-zero rmap entry indicates that the corresonding guest
> + * page has become secure, and is not mapped on the HV side.
> + *
> + * NOTE: In this and subsequent functions, we pass around and access
> + * individual elements of kvm_memory_slot->arch.rmap[] without any
> + * protection. Should we use lock_rmap() here?
> + */
> +static struct page *kvmppc_devm_get_page(unsigned long *rmap,
> + unsigned long gpa, unsigned
> int lpid)
> +{
> + struct page *dpage = NULL;
> + unsigned long bit, devm_pfn;
> + unsigned long nr_pfns = kvmppc_devm.pfn_last -
> + kvmppc_devm.pfn_first;
> + unsigned long flags;
> + struct kvmppc_devm_page_pvt *pvt;
> +
> + if (kvmppc_is_devm_pfn(*rmap))
> + return NULL;
> +
> + spin_lock_irqsave(&kvmppc_devm_lock, flags);
> + bit = find_first_zero_bit(kvmppc_devm.pfn_bitmap, nr_pfns);
> + if (bit >= nr_pfns)
> + goto out;
> +
> + bitmap_set(kvmppc_devm.pfn_bitmap, bit, 1);
> + devm_pfn = bit + kvmppc_devm.pfn_first;
> + dpage = pfn_to_page(devm_pfn);
> +
> + if (!trylock_page(dpage))
> + goto out_clear;
> +
> + *rmap = devm_pfn | KVMPPC_PFN_DEVM;
> + pvt = kzalloc(sizeof(*pvt), GFP_ATOMIC);
> + if (!pvt)
> + goto out_unlock;
> + pvt->rmap = rmap;

Am I missing something, why does the rmap need to be stored in pvt?
Given the gpa is already stored and this is enough to get back to the
rmap entry, right?

> + pvt->gpa = gpa;
> + pvt->lpid = lpid;
> + dpage->zone_device_data = pvt;
> + spin_unlock_irqrestore(&kvmppc_devm_lock, flags);
> +
> + get_page(dpage);
> + return dpage;
> +
> +out_unlock:
> + unlock_page(dpage);
> +out_clear:
> + bitmap_clear(kvmppc_devm.pfn_bitmap,
> +  devm_pfn - kvmppc_devm.pfn_first, 1);
> +out:
> + spin_unlock_irqrestore(&kvmppc_devm_lock, flags);
> + return NULL;
> +}
> +
> 
[snip]


Re: [PATCH 1/2] powerpc/64s: remove support for kernel-mode syscalls

2019-08-19 Thread Nicholas Piggin
Nicholas Piggin's on August 20, 2019 3:11 pm:
> There is support for the kernel to execute the 'sc 0' instruction and
> make a system call to itself. This is a relic that is unused in the
> tree, therefore untested. It's also highly questionable for modules to
> be doing this.

Oh I'm sorry this is not 64s, it's 64e as well, I just realised title
is wrong. I actually haven't tested 64e either.

Thanks,
Nick


[PATCH 2/2] powerpc/64s: interrupt entry use isel to prevent untrusted speculative r1 values used by the kernel

2019-08-19 Thread Nicholas Piggin
Interrupts may come from user or kernel, so the stack pointer needs to
be set to either the base of the kernel stack, or a new frame on the
existing kernel stack pointer, respectively.

Using a branch for this can lead to r1-indexed memory operations being
speculatively executed using a value of r1 controlled by userspace.
This is the first step to possible speculative execution vulnerability.

This does not appear to be a problem on its own, because loads from the
stack with this rogue address should only come from memory the kernel
previously stored to during the same speculative path, so they should
always be satisfied by the store buffers rather than exposing the
underlying memory contents.

There are some obscure cases where an r1-indexed load may be used in
other ways (e.g., stack unwinding), however they are rare and difficult
to control, and they still need to contain a sequence that subsequently
changes microarchitectural state based on the result of such a rogue
load, in a way that can be observed.

However it's safer to just close the concern at the first step, by
preventing untrusted speculative r1 value leaking into the kernel. Do
this by using isel to select the r1 value rather than a branch. isel
output is not predicted on POWER CPUs which support the instruction,
although this is not architecture.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/exceptions-64s.S | 18 --
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index 768f133de4f1..8282c01db83e 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -393,15 +393,29 @@ END_FTR_SECTION_NESTED(CPU_FTR_CFAR, CPU_FTR_CFAR, 66);   
   \
  * On entry r13 points to the paca, r9-r13 are saved in the paca,
  * r9 contains the saved CR, r11 and r12 contain the saved SRR0 and
  * SRR1, and relocation is on.
+ *
+ * Using isel to select the r1 kernel stack depending on MSR_PR prevents
+ * speculative execution of memory ops with untrusted addresses (r1 from
+ * userspace) as a hardening measure, although there is no known vulnerability
+ * using a branch here instead. isel will not do value speculation on any POWER
+ * processor that implements it, although this is not currently documented.
  */
 #define EXCEPTION_COMMON(area, trap)  \
andi.   r10,r12,MSR_PR; /* See if coming from user  */ \
mr  r10,r1; /* Save r1  */ \
+BEGIN_FTR_SECTION \
+   ld  r1,PACAKSAVE(r13);  /* base stack if from user  */ \
+   addir1,r1,INT_FRAME_SIZE;   /* adjust for subi  */ \
+   iseleq  r1,r10,r1;  /* original r1 if from kernel   */ \
+   subir1,r1,INT_FRAME_SIZE;   /* alloc frame on kernel stack  */ \
+FTR_SECTION_ELSE  \
subir1,r1,INT_FRAME_SIZE;   /* alloc frame on kernel stack  */ \
beq-1f;\
ld  r1,PACAKSAVE(r13);  /* kernel stack to use  */ \
-1: tdgei   r1,-INT_FRAME_SIZE; /* trap if r1 is in userspace   */ \
-   EMIT_BUG_ENTRY 1b,__FILE__,__LINE__,0; \
+1:\
+ALT_FTR_SECTION_END_IFSET(CPU_FTR_ARCH_207S)  \
+2: tdgei   r1,-INT_FRAME_SIZE; /* trap if r1 is in userspace   */ \
+   EMIT_BUG_ENTRY 2b,__FILE__,__LINE__,0; \
 3: EXCEPTION_PROLOG_COMMON_1();   \
kuap_save_amr_and_lock r9, r10, cr1, cr0;  \
beq 4f; /* if from kernel mode  */ \
-- 
2.22.0



[PATCH 1/2] powerpc/64s: remove support for kernel-mode syscalls

2019-08-19 Thread Nicholas Piggin
There is support for the kernel to execute the 'sc 0' instruction and
make a system call to itself. This is a relic that is unused in the
tree, therefore untested. It's also highly questionable for modules to
be doing this.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/entry_64.S   | 21 ++---
 arch/powerpc/kernel/exceptions-64s.S |  2 --
 2 files changed, 6 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
index 0a0b5310f54a..6467bdab8d40 100644
--- a/arch/powerpc/kernel/entry_64.S
+++ b/arch/powerpc/kernel/entry_64.S
@@ -69,24 +69,20 @@ BEGIN_FTR_SECTION
bne .Ltabort_syscall
 END_FTR_SECTION_IFSET(CPU_FTR_TM)
 #endif
-   andi.   r10,r12,MSR_PR
mr  r10,r1
-   addir1,r1,-INT_FRAME_SIZE
-   beq-1f
ld  r1,PACAKSAVE(r13)
-1: std r10,0(r1)
+   std r10,0(r1)
std r11,_NIP(r1)
std r12,_MSR(r1)
std r0,GPR0(r1)
std r10,GPR1(r1)
-   beq 2f  /* if from kernel mode */
 #ifdef CONFIG_PPC_FSL_BOOK3E
 START_BTB_FLUSH_SECTION
BTB_FLUSH(r10)
 END_BTB_FLUSH_SECTION
 #endif
ACCOUNT_CPU_USER_ENTRY(r13, r10, r11)
-2: std r2,GPR2(r1)
+   std r2,GPR2(r1)
std r3,GPR3(r1)
mfcrr2
std r4,GPR4(r1)
@@ -122,14 +118,13 @@ END_BTB_FLUSH_SECTION
 
 #if defined(CONFIG_VIRT_CPU_ACCOUNTING_NATIVE) && defined(CONFIG_PPC_SPLPAR)
 BEGIN_FW_FTR_SECTION
-   beq 33f
-   /* if from user, see if there are any DTL entries to process */
+   /* see if there are any DTL entries to process */
ld  r10,PACALPPACAPTR(r13)  /* get ptr to VPA */
ld  r11,PACA_DTL_RIDX(r13)  /* get log read index */
addir10,r10,LPPACA_DTLIDX
LDX_BE  r10,0,r10   /* get log write index */
-   cmpdcr1,r11,r10
-   beq+cr1,33f
+   cmpdr11,r10
+   beq+33f
bl  accumulate_stolen_time
REST_GPR(0,r1)
REST_4GPRS(3,r1)
@@ -203,6 +198,7 @@ system_call:/* label this so stack 
traces look sane */
mtctr   r12
bctrl   /* Call handler */
 
+   /* syscall_exit can exit to kernel mode, via ret_from_kernel_thread */
 .Lsyscall_exit:
std r3,RESULT(r1)
 
@@ -216,11 +212,6 @@ system_call:   /* label this so stack 
traces look sane */
ld  r12, PACA_THREAD_INFO(r13)
 
ld  r8,_MSR(r1)
-#ifdef CONFIG_PPC_BOOK3S
-   /* No MSR:RI on BookE */
-   andi.   r10,r8,MSR_RI
-   beq-.Lunrecov_restore
-#endif
 
 /*
  * This is a few instructions into the actual syscall exit path (which actually
diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index 6ba3cc2ef8ab..768f133de4f1 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -1521,8 +1521,6 @@ EXC_COMMON(trap_0b_common, 0xb00, unknown_exception)
  * system call / hypercall (0xc00, 0x4c00)
  *
  * The system call exception is invoked with "sc 0" and does not alter HV bit.
- * There is support for kernel code to invoke system calls but there are no
- * in-tree users.
  *
  * The hypercall is invoked with "sc 1" and sets HV=1.
  *
-- 
2.22.0



Re: [PATCH v1 05/10] powerpc/mm: Do early ioremaps from top to bottom on PPC64 too.

2019-08-19 Thread Christophe Leroy




Le 20/08/2019 à 02:20, Michael Ellerman a écrit :

Nicholas Piggin  writes:

Christophe Leroy's on August 14, 2019 6:11 am:

Until vmalloc system is up and running, ioremap basically
allocates addresses at the border of the IOREMAP area.

On PPC32, addresses are allocated down from the top of the area
while on PPC64, addresses are allocated up from the base of the
area.
  
This series looks pretty good to me, but I'm not sure about this patch.


It seems like quite a small divergence in terms of code, and it looks
like the final result still has some ifdefs in these functions. Maybe
you could just keep existing behaviour for this cleanup series so it
does not risk triggering some obscure regression?


Yeah that is also my feeling. Changing it *should* work, and I haven't
found anything that breaks yet, but it's one of those things that's
bound to break something for some obscure reason.

Christophe do you think you can rework it to retain the different
allocation directions at least for now?



Yes I have started addressing the comments I received, and I think for 
now I'll keep all the machinery aside from the merge. Not sure yet if 
I'll leave it in pgtables_32/64.c or if I'll add ioremap_32/64.c


Christophe


Re: [RFC PATCH] powerpc: Convert ____flush_dcache_icache_phys() to C

2019-08-19 Thread Alastair D'Silva
On Fri, 2019-08-16 at 15:52 +, Christophe Leroy wrote:
> Resulting code (8xx with 16 bytes per cacheline and 16k pages)
> 
> 016c <__flush_dcache_icache_phys>:
>  16c: 54 63 00 22 rlwinm  r3,r3,0,0,17
>  170: 7d 20 00 a6 mfmsr   r9
>  174: 39 40 04 00 li  r10,1024
>  178: 55 28 07 34 rlwinm  r8,r9,0,28,26
>  17c: 7c 67 1b 78 mr  r7,r3
>  180: 7d 49 03 a6 mtctr   r10
>  184: 7d 00 01 24 mtmsr   r8
>  188: 4c 00 01 2c isync
>  18c: 7c 00 18 6c dcbst   0,r3
>  190: 38 63 00 10 addir3,r3,16
>  194: 42 00 ff f8 bdnz18c <__flush_dcache_icache_phys+0x20>
>  198: 7c 00 04 ac hwsync
>  19c: 7d 49 03 a6 mtctr   r10
>  1a0: 7c 00 3f ac icbi0,r7
>  1a4: 38 e7 00 10 addir7,r7,16
>  1a8: 42 00 ff f8 bdnz1a0 <__flush_dcache_icache_phys+0x34>
>  1ac: 7c 00 04 ac hwsync
>  1b0: 7d 20 01 24 mtmsr   r9
>  1b4: 4c 00 01 2c isync
>  1b8: 4e 80 00 20 blr
> 
> Signed-off-by: Christophe Leroy 
> ---
>  This patch is on top of Alastair's series "powerpc: convert cache
> asm to C"
>  Patch 3 of that series should touch __flush_dcache_icache_phys and
> this
>  patch could come just after patch 3.
> 
>  arch/powerpc/include/asm/cacheflush.h |  8 +
>  arch/powerpc/mm/mem.c | 55
> ---
>  2 files changed, 53 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/cacheflush.h
> b/arch/powerpc/include/asm/cacheflush.h
> index 1826bf2cc137..bf4f2dc4eb76 100644
> --- a/arch/powerpc/include/asm/cacheflush.h
> +++ b/arch/powerpc/include/asm/cacheflush.h
> @@ -47,6 +47,14 @@ void flush_icache_user_range(struct vm_area_struct
> *vma,
>   struct page *page, unsigned long
> addr,
>   int len);
>  void flush_dcache_icache_page(struct page *page);
> +#if defined(CONFIG_PPC32) && !defined(CONFIG_BOOKE)
> +void __flush_dcache_icache_phys(unsigned long physaddr);
> +#else
> +static inline void __flush_dcache_icache_phys(unsigned long
> physaddr)
> +{
> + BUG();
> +}
> +#endif
>  
>  /**
>   * flush_dcache_range(): Write any modified data cache blocks out to
> memory and invalidate them.
> diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
> index 43be99de7c9a..43009f9227c4 100644
> --- a/arch/powerpc/mm/mem.c
> +++ b/arch/powerpc/mm/mem.c
> @@ -402,6 +402,50 @@ void flush_dcache_page(struct page *page)
>  }
>  EXPORT_SYMBOL(flush_dcache_page);
>  
> +#if defined(CONFIG_PPC32) && !defined(CONFIG_BOOKE)
> +void __flush_dcache_icache_phys(unsigned long physaddr)
> +{
> + unsigned long bytes = l1_dcache_bytes();
> + unsigned long nb = PAGE_SIZE / bytes;
> + unsigned long addr = physaddr & PAGE_MASK;
> + unsigned long msr, msr0;
> + unsigned long loop1 = addr, loop2 = addr;
> +
> + if (cpu_has_feature(CPU_FTR_COHERENT_ICACHE)) {
> + /* For a snooping icache, we still need a dummy icbi to
> purge all the
> +  * prefetched instructions from the ifetch buffers. We
> also need a sync
> +  * before the icbi to order the the actual stores to
> memory that might
> +  * have modified instructions with the icbi.
> +  */
> + mb(); /* sync */
> + icbi((void *)addr);
> + mb(); /* sync */
> + isync();
> + return;
> + }
> + msr0 = mfmsr();
> + msr = msr0 & ~MSR_DR;
> + asm volatile(
> + "   mtctr %2;"
> + "   mtmsr %3;"
> + "   isync;"
> + "0: dcbst   0, %0;"
> + "   addi%0, %0, %4;"
> + "   bdnz0b;"
> + "   sync;"
> + "   mtctr %2;"
> + "1: icbi0, %1;"
> + "   addi%1, %1, %4;"
> + "   bdnz1b;"
> + "   sync;"
> + "   mtmsr %5;"
> + "   isync;"
> + : "+r" (loop1), "+r" (loop2)
> + : "r" (nb), "r" (msr), "i" (bytes), "r" (msr0)
> + : "ctr", "memory");
> +}
> +#endif
> +
>  void flush_dcache_icache_page(struct page *page)
>  {
>  #ifdef CONFIG_HUGETLB_PAGE
> @@ -419,16 +463,7 @@ void flush_dcache_icache_page(struct page *page)
>   __flush_dcache_icache(start);
>   kunmap_atomic(start);
>   } else {
> - unsigned long msr = mfmsr();
> -
> - /* Clear the DR bit so that we operate on physical
> -  * rather than virtual addresses
> -  */
> - mtmsr(msr & ~(MSR_DR));
> -
> - __flush_dcache_icache((void *)physaddr);
> -
> - mtmsr(msr);
> + __flush_dcache_icache_phys(page_to_pfn(page) <<
> PAGE_SHIFT);
>   }
>  #endif
>  }


Thanks Christophe,

I'm trying a somewhat different approach that requires less knowledge
of assembler. Handling of CPU_FTR_COHERENT_ICACHE is outside this
function. The code below is not a patch as my tree is a bit messy,
sorry:

/**
 * flush_dcache_icache_phys() - F

Re: [PATCH] powerpc: Don't add -mabi= flags when building with Clang

2019-08-19 Thread Nathan Chancellor
On Mon, Aug 19, 2019 at 04:19:31AM -0500, Segher Boessenkool wrote:
> On Sun, Aug 18, 2019 at 12:13:21PM -0700, Nathan Chancellor wrote:
> > When building pseries_defconfig, building vdso32 errors out:
> > 
> >   error: unknown target ABI 'elfv1'
> > 
> > Commit 4dc831aa8813 ("powerpc: Fix compiling a BE kernel with a
> > powerpc64le toolchain") added these flags to fix building GCC but
> > clang is multitargeted and does not need these flags. The ABI is
> > properly set based on the target triple, which is derived from
> > CROSS_COMPILE.
> 
> You mean that LLVM does not *allow* you to select a different ABI, or
> different ABI options, you always have to use the default.  (Everything
> else you say is true for GCC as well).

I need to improve the wording of the commit message as it is really that
clang does not allow a different ABI to be selected for 32-bit PowerPC,
as the setABI function is not overridden and it defaults to false.

https://github.com/llvm/llvm-project/blob/llvmorg-9.0.0-rc2/clang/include/clang/Basic/TargetInfo.h#L1073-L1078

https://github.com/llvm/llvm-project/blob/llvmorg-9.0.0-rc2/clang/lib/Basic/Targets/PPC.h#L327-L365

GCC appears to just silently ignores this flag (I think it is the
SUBSUBTARGET_OVERRIDE_OPTIONS macro in gcc/config/rs6000/linux64.h).

It can be changed for 64-bit PowerPC it seems but it doesn't need to be
with clang because everything is set properly internally (I'll find a
better way to clearly word that as I am sure I'm not quite getting that
subtlety right).

> (-mabi= does not set a "target ABI", fwiw, it is more subtle; please see
> the documentation.  Unless LLVM is incompatible in that respect as well?)

Are you referring to the error message? I suppose I could file an LLVM
bug report on that but that message applies to all of the '-mabi='
options, which may refer to a target ABI.

Cheers,
Nathan


Re: [PATCH v6 1/7] kvmppc: Driver to manage pages of secure guest

2019-08-19 Thread Thiago Jung Bauermann


Hello Bharata,

I have just a couple of small comments.

Bharata B Rao  writes:

> +/*
> + * Get a free device PFN from the pool
> + *
> + * Called when a normal page is moved to secure memory (UV_PAGE_IN). Device
> + * PFN will be used to keep track of the secure page on HV side.
> + *
> + * @rmap here is the slot in the rmap array that corresponds to @gpa.
> + * Thus a non-zero rmap entry indicates that the corresonding guest

Typo: corresponding

> +static u64 kvmppc_get_secmem_size(void)
> +{
> + struct device_node *np;
> + int i, len;
> + const __be32 *prop;
> + u64 size = 0;
> +
> + np = of_find_node_by_path("/ibm,ultravisor/ibm,uv-firmware");
> + if (!np)
> + goto out;

I believe that in general we try to avoid hard-coding the path when a
node is accessed and searched instead via its compatible property.

-- 
Thiago Jung Bauermann
IBM Linux Technology Center


Re: [PATCH v10 2/7] powerpc/mce: Fix MCE handling for huge pages

2019-08-19 Thread Nicholas Piggin
Santosh Sivaraj's on August 20, 2019 11:47 am:
> Hi Nick,
> 
> Nicholas Piggin  writes:
> 
>> Santosh Sivaraj's on August 15, 2019 10:39 am:
>>> From: Balbir Singh 
>>> 
>>> The current code would fail on huge pages addresses, since the shift would
>>> be incorrect. Use the correct page shift value returned by
>>> __find_linux_pte() to get the correct physical address. The code is more
>>> generic and can handle both regular and compound pages.
>>> 
>>> Fixes: ba41e1e1ccb9 ("powerpc/mce: Hookup derror (load/store) UE errors")
>>> Signed-off-by: Balbir Singh 
>>> [ar...@linux.ibm.com: Fixup pseries_do_memory_failure()]
>>> Signed-off-by: Reza Arbab 
>>> Co-developed-by: Santosh Sivaraj 
>>> Signed-off-by: Santosh Sivaraj 
>>> Tested-by: Mahesh Salgaonkar 
>>> Cc: sta...@vger.kernel.org # v4.15+
>>> ---
>>>  arch/powerpc/include/asm/mce.h   |  2 +-
>>>  arch/powerpc/kernel/mce_power.c  | 55 ++--
>>>  arch/powerpc/platforms/pseries/ras.c |  9 ++---
>>>  3 files changed, 32 insertions(+), 34 deletions(-)
>>> 
>>> diff --git a/arch/powerpc/include/asm/mce.h b/arch/powerpc/include/asm/mce.h
>>> index a4c6a74ad2fb..f3a6036b6bc0 100644
>>> --- a/arch/powerpc/include/asm/mce.h
>>> +++ b/arch/powerpc/include/asm/mce.h
>>> @@ -209,7 +209,7 @@ extern void release_mce_event(void);
>>>  extern void machine_check_queue_event(void);
>>>  extern void machine_check_print_event_info(struct machine_check_event *evt,
>>>bool user_mode, bool in_guest);
>>> -unsigned long addr_to_pfn(struct pt_regs *regs, unsigned long addr);
>>> +unsigned long addr_to_phys(struct pt_regs *regs, unsigned long addr);
>>>  #ifdef CONFIG_PPC_BOOK3S_64
>>>  void flush_and_reload_slb(void);
>>>  #endif /* CONFIG_PPC_BOOK3S_64 */
>>> diff --git a/arch/powerpc/kernel/mce_power.c 
>>> b/arch/powerpc/kernel/mce_power.c
>>> index a814d2dfb5b0..e74816f045f8 100644
>>> --- a/arch/powerpc/kernel/mce_power.c
>>> +++ b/arch/powerpc/kernel/mce_power.c
>>> @@ -20,13 +20,14 @@
>>>  #include 
>>>  
>>>  /*
>>> - * Convert an address related to an mm to a PFN. NOTE: we are in real
>>> - * mode, we could potentially race with page table updates.
>>> + * Convert an address related to an mm to a physical address.
>>> + * NOTE: we are in real mode, we could potentially race with page table 
>>> updates.
>>>   */
>>> -unsigned long addr_to_pfn(struct pt_regs *regs, unsigned long addr)
>>> +unsigned long addr_to_phys(struct pt_regs *regs, unsigned long addr)
>>>  {
>>> -   pte_t *ptep;
>>> -   unsigned long flags;
>>> +   pte_t *ptep, pte;
>>> +   unsigned int shift;
>>> +   unsigned long flags, phys_addr;
>>> struct mm_struct *mm;
>>>  
>>> if (user_mode(regs))
>>> @@ -35,14 +36,21 @@ unsigned long addr_to_pfn(struct pt_regs *regs, 
>>> unsigned long addr)
>>> mm = &init_mm;
>>>  
>>> local_irq_save(flags);
>>> -   if (mm == current->mm)
>>> -   ptep = find_current_mm_pte(mm->pgd, addr, NULL, NULL);
>>> -   else
>>> -   ptep = find_init_mm_pte(addr, NULL);
>>> +   ptep = __find_linux_pte(mm->pgd, addr, NULL, &shift);
>>> local_irq_restore(flags);
>>> +
>>> if (!ptep || pte_special(*ptep))
>>> return ULONG_MAX;
>>> -   return pte_pfn(*ptep);
>>> +
>>> +   pte = *ptep;
>>> +   if (shift > PAGE_SHIFT) {
>>> +   unsigned long rpnmask = (1ul << shift) - PAGE_SIZE;
>>> +
>>> +   pte = __pte(pte_val(pte) | (addr & rpnmask));
>>> +   }
>>> +   phys_addr = pte_pfn(pte) << PAGE_SHIFT;
>>> +
>>> +   return phys_addr;
>>>  }
>>
>> This should remain addr_to_pfn I think. None of the callers care what
>> size page the EA was mapped with. 'pfn' is referring to the Linux pfn,
>> which is the small page number.
>>
>>   if (shift > PAGE_SHIFT)
>> return (pte_pfn(*ptep) | ((addr & ((1UL << shift) - 1)) >> PAGE_SHIFT);
>>   else
>> return pte_pfn(*ptep);
>>
>> Something roughly like that, then you don't have to change any callers
>> or am I missing something?
> 
> Here[1] you asked to return the real address rather than pfn, which all
> callers care about. So made the changes accordingly.
> 
> [1] https://www.spinics.net/lists/kernel/msg3187658.html

Ah I did suggest it, but I meant _exact_ physical address :) The one
matching the effective address you gave it.

As it is now, the physical address is truncated at the small page size,
so if you do that you might as well just keep it as a pfn and no change
to callers.

I would also prefer getting the pfn as above rather than constructing a
new pte, which is a neat hack but is not a normal pattern.

Thanks,
Nick


Re: [PATCH 2/2] powerpc: support KASAN instrumentation of bitops

2019-08-19 Thread Daniel Axtens
Christophe Leroy  writes:

> Le 19/08/2019 à 08:28, Daniel Axtens a écrit :
>> In KASAN development I noticed that the powerpc-specific bitops
>> were not being picked up by the KASAN test suite.
>
> I'm not sure anybody cares about who noticed the problem. This sentence 
> could be rephrased as:
>
> The powerpc-specific bitops are not being picked up by the KASAN test suite.
>
>> 
>> Instrumentation is done via the bitops/instrumented-{atomic,lock}.h
>> headers. They require that arch-specific versions of bitop functions
>> are renamed to arch_*. Do this renaming.
>> 
>> For clear_bit_unlock_is_negative_byte, the current implementation
>> uses the PG_waiters constant. This works because it's a preprocessor
>> macro - so it's only actually evaluated in contexts where PG_waiters
>> is defined. With instrumentation however, it becomes a static inline
>> function, and all of a sudden we need the actual value of PG_waiters.
>> Because of the order of header includes, it's not available and we
>> fail to compile. Instead, manually specify that we care about bit 7.
>> This is still correct: bit 7 is the bit that would mark a negative
>> byte.
>> 
>> Cc: Nicholas Piggin  # clear_bit_unlock_negative_byte
>> Signed-off-by: Daniel Axtens 
>
> Reviewed-by: Christophe Leroy 
>
> Note that this patch might be an opportunity to replace all the 
> '__inline__' by the standard 'inline' keyword.

New patches sent with these things fixed, thanks. 
>
> Some () alignment to be fixes as well, see checkpatch warnings/checks at 
> https://openpower.xyz/job/snowpatch/job/snowpatch-linux-checkpatch/8601//artifact/linux/checkpatch.log
>
>> ---
>>   arch/powerpc/include/asm/bitops.h | 31 +++
>>   1 file changed, 19 insertions(+), 12 deletions(-)
>> 
>> diff --git a/arch/powerpc/include/asm/bitops.h 
>> b/arch/powerpc/include/asm/bitops.h
>> index 603aed229af7..8615b2bc35fe 100644
>> --- a/arch/powerpc/include/asm/bitops.h
>> +++ b/arch/powerpc/include/asm/bitops.h
>> @@ -86,22 +86,22 @@ DEFINE_BITOP(clear_bits, andc, "")
>>   DEFINE_BITOP(clear_bits_unlock, andc, PPC_RELEASE_BARRIER)
>>   DEFINE_BITOP(change_bits, xor, "")
>>   
>> -static __inline__ void set_bit(int nr, volatile unsigned long *addr)
>> +static __inline__ void arch_set_bit(int nr, volatile unsigned long *addr)
>>   {
>>  set_bits(BIT_MASK(nr), addr + BIT_WORD(nr));
>>   }
>>   
>> -static __inline__ void clear_bit(int nr, volatile unsigned long *addr)
>> +static __inline__ void arch_clear_bit(int nr, volatile unsigned long *addr)
>>   {
>>  clear_bits(BIT_MASK(nr), addr + BIT_WORD(nr));
>>   }
>>   
>> -static __inline__ void clear_bit_unlock(int nr, volatile unsigned long 
>> *addr)
>> +static __inline__ void arch_clear_bit_unlock(int nr, volatile unsigned long 
>> *addr)
>>   {
>>  clear_bits_unlock(BIT_MASK(nr), addr + BIT_WORD(nr));
>>   }
>>   
>> -static __inline__ void change_bit(int nr, volatile unsigned long *addr)
>> +static __inline__ void arch_change_bit(int nr, volatile unsigned long *addr)
>>   {
>>  change_bits(BIT_MASK(nr), addr + BIT_WORD(nr));
>>   }
>> @@ -138,26 +138,26 @@ DEFINE_TESTOP(test_and_clear_bits, andc, 
>> PPC_ATOMIC_ENTRY_BARRIER,
>>   DEFINE_TESTOP(test_and_change_bits, xor, PPC_ATOMIC_ENTRY_BARRIER,
>>PPC_ATOMIC_EXIT_BARRIER, 0)
>>   
>> -static __inline__ int test_and_set_bit(unsigned long nr,
>> +static __inline__ int arch_test_and_set_bit(unsigned long nr,
>> volatile unsigned long *addr)
>>   {
>>  return test_and_set_bits(BIT_MASK(nr), addr + BIT_WORD(nr)) != 0;
>>   }
>>   
>> -static __inline__ int test_and_set_bit_lock(unsigned long nr,
>> +static __inline__ int arch_test_and_set_bit_lock(unsigned long nr,
>> volatile unsigned long *addr)
>>   {
>>  return test_and_set_bits_lock(BIT_MASK(nr),
>>  addr + BIT_WORD(nr)) != 0;
>>   }
>>   
>> -static __inline__ int test_and_clear_bit(unsigned long nr,
>> +static __inline__ int arch_test_and_clear_bit(unsigned long nr,
>>   volatile unsigned long *addr)
>>   {
>>  return test_and_clear_bits(BIT_MASK(nr), addr + BIT_WORD(nr)) != 0;
>>   }
>>   
>> -static __inline__ int test_and_change_bit(unsigned long nr,
>> +static __inline__ int arch_test_and_change_bit(unsigned long nr,
>>volatile unsigned long *addr)
>>   {
>>  return test_and_change_bits(BIT_MASK(nr), addr + BIT_WORD(nr)) != 0;
>> @@ -185,15 +185,18 @@ static __inline__ unsigned long 
>> clear_bit_unlock_return_word(int nr,
>>  return old;
>>   }
>>   
>> -/* This is a special function for mm/filemap.c */
>> -#define clear_bit_unlock_is_negative_byte(nr, addr) \
>> -(clear_bit_unlock_return_word(nr, addr) & BIT_MASK(PG_waiters))
>> +/*
>> + * This is a special function for mm/filemap.c
>> + * Bit 7 corresponds to PG_waiters.
>> + */
>> +#define arch_clear_b

[PATCH v2 2/2] powerpc: support KASAN instrumentation of bitops

2019-08-19 Thread Daniel Axtens
The powerpc-specific bitops are not being picked up by the KASAN
test suite.

Instrumentation is done via the bitops/instrumented-{atomic,lock}.h
headers. They require that arch-specific versions of bitop functions
are renamed to arch_*. Do this renaming.

For clear_bit_unlock_is_negative_byte, the current implementation
uses the PG_waiters constant. This works because it's a preprocessor
macro - so it's only actually evaluated in contexts where PG_waiters
is defined. With instrumentation however, it becomes a static inline
function, and all of a sudden we need the actual value of PG_waiters.
Because of the order of header includes, it's not available and we
fail to compile. Instead, manually specify that we care about bit 7.
This is still correct: bit 7 is the bit that would mark a negative
byte.

While we're at it, replace __inline__ with inline across the file.

Cc: Nicholas Piggin  # clear_bit_unlock_negative_byte
Reviewed-by: Christophe Leroy 
Signed-off-by: Daniel Axtens 

--
v2: Address Christophe review
---
 arch/powerpc/include/asm/bitops.h | 51 ++-
 1 file changed, 29 insertions(+), 22 deletions(-)

diff --git a/arch/powerpc/include/asm/bitops.h 
b/arch/powerpc/include/asm/bitops.h
index 603aed229af7..28dcf8222943 100644
--- a/arch/powerpc/include/asm/bitops.h
+++ b/arch/powerpc/include/asm/bitops.h
@@ -64,7 +64,7 @@
 
 /* Macro for generating the ***_bits() functions */
 #define DEFINE_BITOP(fn, op, prefix)   \
-static __inline__ void fn(unsigned long mask,  \
+static inline void fn(unsigned long mask,  \
volatile unsigned long *_p) \
 {  \
unsigned long old;  \
@@ -86,22 +86,22 @@ DEFINE_BITOP(clear_bits, andc, "")
 DEFINE_BITOP(clear_bits_unlock, andc, PPC_RELEASE_BARRIER)
 DEFINE_BITOP(change_bits, xor, "")
 
-static __inline__ void set_bit(int nr, volatile unsigned long *addr)
+static inline void arch_set_bit(int nr, volatile unsigned long *addr)
 {
set_bits(BIT_MASK(nr), addr + BIT_WORD(nr));
 }
 
-static __inline__ void clear_bit(int nr, volatile unsigned long *addr)
+static inline void arch_clear_bit(int nr, volatile unsigned long *addr)
 {
clear_bits(BIT_MASK(nr), addr + BIT_WORD(nr));
 }
 
-static __inline__ void clear_bit_unlock(int nr, volatile unsigned long *addr)
+static inline void arch_clear_bit_unlock(int nr, volatile unsigned long *addr)
 {
clear_bits_unlock(BIT_MASK(nr), addr + BIT_WORD(nr));
 }
 
-static __inline__ void change_bit(int nr, volatile unsigned long *addr)
+static inline void arch_change_bit(int nr, volatile unsigned long *addr)
 {
change_bits(BIT_MASK(nr), addr + BIT_WORD(nr));
 }
@@ -109,7 +109,7 @@ static __inline__ void change_bit(int nr, volatile unsigned 
long *addr)
 /* Like DEFINE_BITOP(), with changes to the arguments to 'op' and the output
  * operands. */
 #define DEFINE_TESTOP(fn, op, prefix, postfix, eh) \
-static __inline__ unsigned long fn(\
+static inline unsigned long fn(\
unsigned long mask, \
volatile unsigned long *_p) \
 {  \
@@ -138,34 +138,34 @@ DEFINE_TESTOP(test_and_clear_bits, andc, 
PPC_ATOMIC_ENTRY_BARRIER,
 DEFINE_TESTOP(test_and_change_bits, xor, PPC_ATOMIC_ENTRY_BARRIER,
  PPC_ATOMIC_EXIT_BARRIER, 0)
 
-static __inline__ int test_and_set_bit(unsigned long nr,
-  volatile unsigned long *addr)
+static inline int arch_test_and_set_bit(unsigned long nr,
+   volatile unsigned long *addr)
 {
return test_and_set_bits(BIT_MASK(nr), addr + BIT_WORD(nr)) != 0;
 }
 
-static __inline__ int test_and_set_bit_lock(unsigned long nr,
-  volatile unsigned long *addr)
+static inline int arch_test_and_set_bit_lock(unsigned long nr,
+volatile unsigned long *addr)
 {
return test_and_set_bits_lock(BIT_MASK(nr),
addr + BIT_WORD(nr)) != 0;
 }
 
-static __inline__ int test_and_clear_bit(unsigned long nr,
-volatile unsigned long *addr)
+static inline int arch_test_and_clear_bit(unsigned long nr,
+ volatile unsigned long *addr)
 {
return test_and_clear_bits(BIT_MASK(nr), addr + BIT_WORD(nr)) != 0;
 }
 
-static __inline__ int test_and_change_bit(unsigned long nr,
- volatile unsigned long *addr)
+static inline int arch_test_and_change_bit(unsigned long nr,
+  volatile unsigned long *addr)
 {
return test_and_change_bits(BIT_MASK(nr), addr + BIT_WORD(nr)) != 0;
 }
 
 #ifdef CONFIG_PPC64
-static __inline__ unsigned long clear_bit_unlock_return_word(int nr,
-   

[PATCH v2 1/2] kasan: support instrumented bitops combined with generic bitops

2019-08-19 Thread Daniel Axtens
Currently bitops-instrumented.h assumes that the architecture provides
atomic, non-atomic and locking bitops (e.g. both set_bit and __set_bit).
This is true on x86 and s390, but is not always true: there is a
generic bitops/non-atomic.h header that provides generic non-atomic
operations, and also a generic bitops/lock.h for locking operations.

powerpc uses the generic non-atomic version, so it does not have it's
own e.g. __set_bit that could be renamed arch___set_bit.

Split up bitops-instrumented.h to mirror the atomic/non-atomic/lock
split. This allows arches to only include the headers where they
have arch-specific versions to rename. Update x86 and s390.

(The generic operations are automatically instrumented because they're
written in C, not asm.)

Suggested-by: Christophe Leroy 
Reviewed-by: Christophe Leroy 
Signed-off-by: Daniel Axtens 
---
 Documentation/core-api/kernel-api.rst |  17 +-
 arch/s390/include/asm/bitops.h|   4 +-
 arch/x86/include/asm/bitops.h |   4 +-
 include/asm-generic/bitops-instrumented.h | 263 --
 .../asm-generic/bitops/instrumented-atomic.h  | 100 +++
 .../asm-generic/bitops/instrumented-lock.h|  81 ++
 .../bitops/instrumented-non-atomic.h  | 114 
 7 files changed, 317 insertions(+), 266 deletions(-)
 delete mode 100644 include/asm-generic/bitops-instrumented.h
 create mode 100644 include/asm-generic/bitops/instrumented-atomic.h
 create mode 100644 include/asm-generic/bitops/instrumented-lock.h
 create mode 100644 include/asm-generic/bitops/instrumented-non-atomic.h

diff --git a/Documentation/core-api/kernel-api.rst 
b/Documentation/core-api/kernel-api.rst
index 08af5caf036d..2e21248277e3 100644
--- a/Documentation/core-api/kernel-api.rst
+++ b/Documentation/core-api/kernel-api.rst
@@ -54,7 +54,22 @@ The Linux kernel provides more basic utility functions.
 Bit Operations
 --
 
-.. kernel-doc:: include/asm-generic/bitops-instrumented.h
+Atomic Operations
+~
+
+.. kernel-doc:: include/asm-generic/bitops/instrumented-atomic.h
+   :internal:
+
+Non-atomic Operations
+~
+
+.. kernel-doc:: include/asm-generic/bitops/instrumented-non-atomic.h
+   :internal:
+
+Locking Operations
+~~
+
+.. kernel-doc:: include/asm-generic/bitops/instrumented-lock.h
:internal:
 
 Bitmap Operations
diff --git a/arch/s390/include/asm/bitops.h b/arch/s390/include/asm/bitops.h
index b8833ac983fa..0ceb12593a68 100644
--- a/arch/s390/include/asm/bitops.h
+++ b/arch/s390/include/asm/bitops.h
@@ -241,7 +241,9 @@ static inline void arch___clear_bit_unlock(unsigned long nr,
arch___clear_bit(nr, ptr);
 }
 
-#include 
+#include 
+#include 
+#include 
 
 /*
  * Functions which use MSB0 bit numbering.
diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index ba15d53c1ca7..4a2e2432238f 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -389,7 +389,9 @@ static __always_inline int fls64(__u64 x)
 
 #include 
 
-#include 
+#include 
+#include 
+#include 
 
 #include 
 
diff --git a/include/asm-generic/bitops-instrumented.h 
b/include/asm-generic/bitops-instrumented.h
deleted file mode 100644
index ddd1c6d9d8db..
--- a/include/asm-generic/bitops-instrumented.h
+++ /dev/null
@@ -1,263 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-
-/*
- * This file provides wrappers with sanitizer instrumentation for bit
- * operations.
- *
- * To use this functionality, an arch's bitops.h file needs to define each of
- * the below bit operations with an arch_ prefix (e.g. arch_set_bit(),
- * arch___set_bit(), etc.).
- */
-#ifndef _ASM_GENERIC_BITOPS_INSTRUMENTED_H
-#define _ASM_GENERIC_BITOPS_INSTRUMENTED_H
-
-#include 
-
-/**
- * set_bit - Atomically set a bit in memory
- * @nr: the bit to set
- * @addr: the address to start counting from
- *
- * This is a relaxed atomic operation (no implied memory barriers).
- *
- * Note that @nr may be almost arbitrarily large; this function is not
- * restricted to acting on a single-word quantity.
- */
-static inline void set_bit(long nr, volatile unsigned long *addr)
-{
-   kasan_check_write(addr + BIT_WORD(nr), sizeof(long));
-   arch_set_bit(nr, addr);
-}
-
-/**
- * __set_bit - Set a bit in memory
- * @nr: the bit to set
- * @addr: the address to start counting from
- *
- * Unlike set_bit(), this function is non-atomic. If it is called on the same
- * region of memory concurrently, the effect may be that only one operation
- * succeeds.
- */
-static inline void __set_bit(long nr, volatile unsigned long *addr)
-{
-   kasan_check_write(addr + BIT_WORD(nr), sizeof(long));
-   arch___set_bit(nr, addr);
-}
-
-/**
- * clear_bit - Clears a bit in memory
- * @nr: Bit to clear
- * @addr: Address to start counting from
- *
- * This is a relaxed atomic operation (no implied memory barriers).
- */
-static inline void clear_bit(long nr, volatile unsigned long *a

Re: [PATCH 0/3] Add bad pmem bad blocks to bad range

2019-08-19 Thread Santosh Sivaraj
Santosh Sivaraj  writes:

> This series, which should be based on top of the still un-merged
> "powerpc: implement machine check safe memcpy" series, adds support
> to add the bad blocks which generated an MCE to the NVDIMM bad blocks.
> The next access of the same memory will be blocked by the NVDIMM layer
> itself.

This is the v2 series. Missed to add in the subject.

>
> ---
> Santosh Sivaraj (3):
>   powerpc/mce: Add MCE notification chain
>   of_pmem: Add memory ranges which took a mce to bad range
>   papr/scm: Add bad memory ranges to nvdimm bad ranges
>
>  arch/powerpc/include/asm/mce.h|   3 +
>  arch/powerpc/kernel/mce.c |  15 +++
>  arch/powerpc/platforms/pseries/papr_scm.c |  86 +++-
>  drivers/nvdimm/of_pmem.c  | 151 +++---
>  4 files changed, 234 insertions(+), 21 deletions(-)
>
> -- 
> 2.21.0


Re: [PATCH] btrfs: fix allocation of bitmap pages.

2019-08-19 Thread Christoph Hellwig
On Mon, Aug 19, 2019 at 07:46:00PM +0200, David Sterba wrote:
> Another thing that is lost is the slub debugging support for all
> architectures, because get_zeroed_pages lacking the red zones and sanity
> checks.
> 
> I find working with raw pages in this code a bit inconsistent with the
> rest of btrfs code, but that's rather minor compared to the above.
> 
> Summing it up, I think that the proper fix should go to copy_page
> implementation on architectures that require it or make it clear what
> are the copy_page constraints.

The whole point of copy_page is to copy exactly one page and it makes
sense to assume that is aligned.  A sane memcpy would use the same
underlying primitives as well after checking they fit.  So I think the
prime issue here is btrfs' use of copy_page instead of memcpy.  The
secondary issue is slub fucking up alignments for no good reason.  We
just got bitten by that crap again in XFS as well :(


[PATCH 2/3] of_pmem: Add memory ranges which took a mce to bad range

2019-08-19 Thread Santosh Sivaraj
Subscribe to the MCE notification and add the physical address which
generated a memory error to nvdimm bad range.

Signed-off-by: Santosh Sivaraj 
---
 drivers/nvdimm/of_pmem.c | 151 +--
 1 file changed, 131 insertions(+), 20 deletions(-)

diff --git a/drivers/nvdimm/of_pmem.c b/drivers/nvdimm/of_pmem.c
index a0c8dcfa0bf9..155e56862fdf 100644
--- a/drivers/nvdimm/of_pmem.c
+++ b/drivers/nvdimm/of_pmem.c
@@ -8,6 +8,9 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
 
 static const struct attribute_group *region_attr_groups[] = {
&nd_region_attribute_group,
@@ -25,11 +28,77 @@ struct of_pmem_private {
struct nvdimm_bus *bus;
 };
 
+struct of_pmem_region {
+   struct of_pmem_private *priv;
+   struct nd_region_desc *region_desc;
+   struct nd_region *region;
+   struct list_head region_list;
+};
+
+LIST_HEAD(pmem_regions);
+DEFINE_MUTEX(pmem_region_lock);
+
+static int handle_mce_ue(struct notifier_block *nb, unsigned long val,
+void *data)
+{
+   struct machine_check_event *evt = data;
+   struct of_pmem_region *pmem_region;
+   u64 aligned_addr, phys_addr;
+   bool found = false;
+
+   if (evt->error_type != MCE_ERROR_TYPE_UE)
+   return NOTIFY_DONE;
+
+   if (list_empty(&pmem_regions))
+   return NOTIFY_DONE;
+
+   phys_addr = evt->u.ue_error.physical_address +
+   (evt->u.ue_error.effective_address & ~PAGE_MASK);
+
+   if (!evt->u.ue_error.physical_address_provided ||
+   !is_zone_device_page(pfn_to_page(phys_addr >> PAGE_SHIFT)))
+   return NOTIFY_DONE;
+
+   mutex_lock(&pmem_region_lock);
+   list_for_each_entry(pmem_region, &pmem_regions, region_list) {
+   struct resource *res = pmem_region->region_desc->res;
+
+   if (phys_addr >= res->start && phys_addr <= res->end) {
+   found = true;
+   break;
+   }
+   }
+   mutex_unlock(&pmem_region_lock);
+
+   if (!found)
+   return NOTIFY_DONE;
+
+   aligned_addr = ALIGN_DOWN(phys_addr, L1_CACHE_BYTES);
+
+   if (nvdimm_bus_add_badrange(pmem_region->priv->bus, aligned_addr,
+   L1_CACHE_BYTES))
+   return NOTIFY_DONE;
+
+   pr_debug("Add memory range (0x%llx -- 0x%llx) as bad range\n",
+aligned_addr, aligned_addr + L1_CACHE_BYTES);
+
+
+   nvdimm_region_notify(pmem_region->region, NVDIMM_REVALIDATE_POISON);
+
+   return NOTIFY_OK;
+}
+
+static struct notifier_block mce_ue_nb = {
+   .notifier_call = handle_mce_ue
+};
+
 static int of_pmem_region_probe(struct platform_device *pdev)
 {
struct of_pmem_private *priv;
struct device_node *np;
struct nvdimm_bus *bus;
+   struct of_pmem_region *pmem_region;
+   struct nd_region_desc *ndr_desc;
bool is_volatile;
int i;
 
@@ -58,32 +127,49 @@ static int of_pmem_region_probe(struct platform_device 
*pdev)
is_volatile ? "volatile" : "non-volatile",  np);
 
for (i = 0; i < pdev->num_resources; i++) {
-   struct nd_region_desc ndr_desc;
struct nd_region *region;
 
-   /*
-* NB: libnvdimm copies the data from ndr_desc into it's own
-* structures so passing a stack pointer is fine.
-*/
-   memset(&ndr_desc, 0, sizeof(ndr_desc));
-   ndr_desc.attr_groups = region_attr_groups;
-   ndr_desc.numa_node = dev_to_node(&pdev->dev);
-   ndr_desc.target_node = ndr_desc.numa_node;
-   ndr_desc.res = &pdev->resource[i];
-   ndr_desc.of_node = np;
-   set_bit(ND_REGION_PAGEMAP, &ndr_desc.flags);
+   ndr_desc = kzalloc(sizeof(struct nd_region_desc), GFP_KERNEL);
+   if (!ndr_desc) {
+   nvdimm_bus_unregister(priv->bus);
+   kfree(priv);
+   return -ENOMEM;
+   }
+
+   ndr_desc->attr_groups = region_attr_groups;
+   ndr_desc->numa_node = dev_to_node(&pdev->dev);
+   ndr_desc->target_node = ndr_desc->numa_node;
+   ndr_desc->res = &pdev->resource[i];
+   ndr_desc->of_node = np;
+   set_bit(ND_REGION_PAGEMAP, &ndr_desc->flags);
 
if (is_volatile)
-   region = nvdimm_volatile_region_create(bus, &ndr_desc);
+   region = nvdimm_volatile_region_create(bus, ndr_desc);
else
-   region = nvdimm_pmem_region_create(bus, &ndr_desc);
+   region = nvdimm_pmem_region_create(bus, ndr_desc);
 
-   if (!region)
+   if (!region) {
dev_warn(&pdev->dev, "Unable to register region %pR 
from %pOF\n",
- 

[PATCH 3/3] papr/scm: Add bad memory ranges to nvdimm bad ranges

2019-08-19 Thread Santosh Sivaraj
Subscribe to the MCE notification and add the physical address which
generated a memory error to nvdimm bad range.

Signed-off-by: Santosh Sivaraj 
---
 arch/powerpc/platforms/pseries/papr_scm.c | 86 ++-
 1 file changed, 85 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
b/arch/powerpc/platforms/pseries/papr_scm.c
index a5ac371a3f06..e38f7febc5d9 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -12,6 +12,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include 
 
@@ -39,8 +41,12 @@ struct papr_scm_priv {
struct resource res;
struct nd_region *region;
struct nd_interleave_set nd_set;
+   struct list_head region_list;
 };
 
+LIST_HEAD(papr_nd_regions);
+DEFINE_MUTEX(papr_ndr_lock);
+
 static int drc_pmem_bind(struct papr_scm_priv *p)
 {
unsigned long ret[PLPAR_HCALL_BUFSIZE];
@@ -364,6 +370,10 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
dev_info(dev, "Region registered with target node %d and online 
node %d",
 target_nid, online_nid);
 
+   mutex_lock(&papr_ndr_lock);
+   list_add_tail(&p->region_list, &papr_nd_regions);
+   mutex_unlock(&papr_ndr_lock);
+
return 0;
 
 err:   nvdimm_bus_unregister(p->bus);
@@ -371,6 +381,57 @@ err:   nvdimm_bus_unregister(p->bus);
return -ENXIO;
 }
 
+static int handle_mce_ue(struct notifier_block *nb, unsigned long val,
+void *data)
+{
+   struct machine_check_event *evt = data;
+   struct papr_scm_priv *p;
+   u64 phys_addr, aligned_addr;
+   bool found = false;
+
+   if (evt->error_type != MCE_ERROR_TYPE_UE)
+   return NOTIFY_DONE;
+
+   if (list_empty(&papr_nd_regions))
+   return NOTIFY_DONE;
+
+   phys_addr = evt->u.ue_error.physical_address +
+   (evt->u.ue_error.effective_address & ~PAGE_MASK);
+
+   if (!evt->u.ue_error.physical_address_provided ||
+   !is_zone_device_page(pfn_to_page(phys_addr >> PAGE_SHIFT)))
+   return NOTIFY_DONE;
+
+   mutex_lock(&papr_ndr_lock);
+   list_for_each_entry(p, &papr_nd_regions, region_list) {
+   struct resource res = p->res;
+
+   if (phys_addr >= res.start && phys_addr <= res.end) {
+   found = true;
+   break;
+   }
+   }
+   mutex_unlock(&papr_ndr_lock);
+
+   if (!found)
+   return NOTIFY_DONE;
+
+   aligned_addr = ALIGN_DOWN(phys_addr, L1_CACHE_BYTES);
+   if (nvdimm_bus_add_badrange(p->bus, aligned_addr, L1_CACHE_BYTES))
+   return NOTIFY_DONE;
+
+   pr_debug("Add memory range (0x%llx -- 0x%llx) as bad range\n",
+aligned_addr, aligned_addr + L1_CACHE_BYTES);
+
+   nvdimm_region_notify(p->region, NVDIMM_REVALIDATE_POISON);
+
+   return NOTIFY_OK;
+}
+
+static struct notifier_block mce_ue_nb = {
+   .notifier_call = handle_mce_ue
+};
+
 static int papr_scm_probe(struct platform_device *pdev)
 {
struct device_node *dn = pdev->dev.of_node;
@@ -456,6 +517,7 @@ static int papr_scm_probe(struct platform_device *pdev)
goto err2;
 
platform_set_drvdata(pdev, p);
+   mce_register_notifier(&mce_ue_nb);
 
return 0;
 
@@ -468,6 +530,10 @@ static int papr_scm_remove(struct platform_device *pdev)
 {
struct papr_scm_priv *p = platform_get_drvdata(pdev);
 
+   mutex_lock(&papr_ndr_lock);
+   list_del(&(p->region_list));
+   mutex_unlock(&papr_ndr_lock);
+
nvdimm_bus_unregister(p->bus);
drc_pmem_unbind(p);
kfree(p);
@@ -490,7 +556,25 @@ static struct platform_driver papr_scm_driver = {
},
 };
 
-module_platform_driver(papr_scm_driver);
+static int __init papr_scm_init(void)
+{
+   int ret;
+
+   ret = platform_driver_register(&papr_scm_driver);
+   if (!ret)
+   mce_register_notifier(&mce_ue_nb);
+
+   return ret;
+}
+module_init(papr_scm_init);
+
+static void __exit papr_scm_exit(void)
+{
+   mce_unregister_notifier(&mce_ue_nb);
+   platform_driver_unregister(&papr_scm_driver);
+}
+module_exit(papr_scm_exit);
+
 MODULE_DEVICE_TABLE(of, papr_scm_match);
 MODULE_LICENSE("GPL");
 MODULE_AUTHOR("IBM Corporation");
-- 
2.21.0



[PATCH 1/3] powerpc/mce: Add MCE notification chain

2019-08-19 Thread Santosh Sivaraj
This is needed to report bad blocks for persistent memory.

Signed-off-by: Santosh Sivaraj 
---
 arch/powerpc/include/asm/mce.h |  3 +++
 arch/powerpc/kernel/mce.c  | 15 +++
 2 files changed, 18 insertions(+)

diff --git a/arch/powerpc/include/asm/mce.h b/arch/powerpc/include/asm/mce.h
index e1931c8c2743..b1c6363f924c 100644
--- a/arch/powerpc/include/asm/mce.h
+++ b/arch/powerpc/include/asm/mce.h
@@ -212,6 +212,9 @@ extern void machine_check_queue_event(void);
 extern void machine_check_print_event_info(struct machine_check_event *evt,
   bool user_mode, bool in_guest);
 unsigned long addr_to_phys(struct pt_regs *regs, unsigned long addr);
+int mce_register_notifier(struct notifier_block *nb);
+int mce_unregister_notifier(struct notifier_block *nb);
+
 #ifdef CONFIG_PPC_BOOK3S_64
 void flush_and_reload_slb(void);
 #endif /* CONFIG_PPC_BOOK3S_64 */
diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
index ec4b3e1087be..a78210ca6cd9 100644
--- a/arch/powerpc/kernel/mce.c
+++ b/arch/powerpc/kernel/mce.c
@@ -47,6 +47,20 @@ static struct irq_work mce_ue_event_irq_work = {
 
 DECLARE_WORK(mce_ue_event_work, machine_process_ue_event);
 
+static BLOCKING_NOTIFIER_HEAD(mce_notifier_list);
+
+int mce_register_notifier(struct notifier_block *nb)
+{
+   return blocking_notifier_chain_register(&mce_notifier_list, nb);
+}
+EXPORT_SYMBOL_GPL(mce_register_notifier);
+
+int mce_unregister_notifier(struct notifier_block *nb)
+{
+   return blocking_notifier_chain_unregister(&mce_notifier_list, nb);
+}
+EXPORT_SYMBOL_GPL(mce_unregister_notifier);
+
 static void mce_set_error_info(struct machine_check_event *mce,
   struct mce_error_info *mce_err)
 {
@@ -263,6 +277,7 @@ static void machine_process_ue_event(struct work_struct 
*work)
while (__this_cpu_read(mce_ue_count) > 0) {
index = __this_cpu_read(mce_ue_count) - 1;
evt = this_cpu_ptr(&mce_ue_event_queue[index]);
+   blocking_notifier_call_chain(&mce_notifier_list, 0, evt);
 #ifdef CONFIG_MEMORY_FAILURE
/*
 * This should probably queued elsewhere, but
-- 
2.21.0



[PATCH 0/3] Add bad pmem bad blocks to bad range

2019-08-19 Thread Santosh Sivaraj
This series, which should be based on top of the still un-merged
"powerpc: implement machine check safe memcpy" series, adds support
to add the bad blocks which generated an MCE to the NVDIMM bad blocks.
The next access of the same memory will be blocked by the NVDIMM layer
itself.

---
Santosh Sivaraj (3):
  powerpc/mce: Add MCE notification chain
  of_pmem: Add memory ranges which took a mce to bad range
  papr/scm: Add bad memory ranges to nvdimm bad ranges

 arch/powerpc/include/asm/mce.h|   3 +
 arch/powerpc/kernel/mce.c |  15 +++
 arch/powerpc/platforms/pseries/papr_scm.c |  86 +++-
 drivers/nvdimm/of_pmem.c  | 151 +++---
 4 files changed, 234 insertions(+), 21 deletions(-)

-- 
2.21.0



[PATCH v4 16/16] powerpc/configs: Enable secure guest support in pseries and ppc64 defconfigs

2019-08-19 Thread Thiago Jung Bauermann
From: Ryan Grimm 

Enables running as a secure guest in platforms with an Ultravisor.

Signed-off-by: Ryan Grimm 
Signed-off-by: Ram Pai 
Signed-off-by: Thiago Jung Bauermann 
---
 arch/powerpc/configs/ppc64_defconfig   | 1 +
 arch/powerpc/configs/pseries_defconfig | 1 +
 2 files changed, 2 insertions(+)

diff --git a/arch/powerpc/configs/ppc64_defconfig 
b/arch/powerpc/configs/ppc64_defconfig
index dc83fefa04f7..b250e6f5a7ca 100644
--- a/arch/powerpc/configs/ppc64_defconfig
+++ b/arch/powerpc/configs/ppc64_defconfig
@@ -29,6 +29,7 @@ CONFIG_DTL=y
 CONFIG_SCANLOG=m
 CONFIG_PPC_SMLPAR=y
 CONFIG_IBMEBUS=y
+CONFIG_PPC_SVM=y
 CONFIG_PPC_MAPLE=y
 CONFIG_PPC_PASEMI=y
 CONFIG_PPC_PASEMI_IOMMU=y
diff --git a/arch/powerpc/configs/pseries_defconfig 
b/arch/powerpc/configs/pseries_defconfig
index 38abc9c1770a..26126b4d4de3 100644
--- a/arch/powerpc/configs/pseries_defconfig
+++ b/arch/powerpc/configs/pseries_defconfig
@@ -42,6 +42,7 @@ CONFIG_DTL=y
 CONFIG_SCANLOG=m
 CONFIG_PPC_SMLPAR=y
 CONFIG_IBMEBUS=y
+CONFIG_PPC_SVM=y
 # CONFIG_PPC_PMAC is not set
 CONFIG_RTAS_FLASH=m
 CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND=y


[PATCH v4 15/16] Documentation/powerpc: Ultravisor API

2019-08-19 Thread Thiago Jung Bauermann
From: Sukadev Bhattiprolu 

POWER9 processor includes support for Protected Execution Facility (PEF).
Attached documentation provides an overview of PEF and defines the API
for various interfaces that must be implemented in the Ultravisor
firmware as well as in the KVM Hypervisor.

Based on input from Mike Anderson, Thiago Bauermann, Claudio Carvalho,
Ben Herrenschmidt, Guerney Hunt, Paul Mackerras.

Signed-off-by: Sukadev Bhattiprolu 
Signed-off-by: Ram Pai 
Signed-off-by: Guerney Hunt 
Reviewed-by: Claudio Carvalho 
Reviewed-by: Michael Anderson 
Reviewed-by: Thiago Bauermann 
Signed-off-by: Claudio Carvalho 
Signed-off-by: Thiago Jung Bauermann 
---
 Documentation/powerpc/ultravisor.rst | 1055 ++
 1 file changed, 1055 insertions(+)

diff --git a/Documentation/powerpc/ultravisor.rst 
b/Documentation/powerpc/ultravisor.rst
new file mode 100644
index ..8d5246585b66
--- /dev/null
+++ b/Documentation/powerpc/ultravisor.rst
@@ -0,0 +1,1055 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. _ultravisor:
+
+
+Protected Execution Facility
+
+
+.. contents::
+:depth: 3
+
+.. sectnum::
+:depth: 3
+
+Protected Execution Facility
+
+
+Protected Execution Facility (PEF) is an architectural change for
+POWER 9 that enables Secure Virtual Machines (SVMs). When enabled,
+PEF adds a new higher privileged mode, called Ultravisor mode, to
+POWER architecture. Along with the new mode there is new firmware
+called the Protected Execution Ultravisor (or Ultravisor for short).
+Ultravisor mode is the highest privileged mode in POWER architecture.
+
+   +--+
+   | Privilege States |
+   +==+
+   |  Problem |
+   +--+
+   |  Supervisor  |
+   +--+
+   |  Hypervisor  |
+   +--+
+   |  Ultravisor  |
+   +--+
+
+PEF protects SVMs from the hypervisor, privileged users, and other
+VMs in the system. SVMs are protected while at rest and can only be
+executed by an authorized machine. All virtual machines utilize
+hypervisor services. The Ultravisor filters calls between the SVMs
+and the hypervisor to assure that information does not accidentally
+leak. All hypercalls except H_RANDOM are reflected to the hypervisor.
+H_RANDOM is not reflected to prevent the hypervisor from influencing
+random values in the SVM.
+
+To support this there is a refactoring of the ownership of resources
+in the CPU. Some of the resources which were previously hypervisor
+privileged are now ultravisor privileged.
+
+Hardware
+
+
+The hardware changes include the following:
+
+* There is a new bit in the MSR that determines whether the current
+  process is running in secure mode, MSR(S) bit 41. MSR(S)=1, process
+  is in secure mode, MSR(s)=0 process is in normal mode.
+
+* The MSR(S) bit can only be set by the Ultravisor.
+
+* HRFID cannot be used to set the MSR(S) bit. If the hypervisor needs
+  to return to a SVM it must use an ultracall. It can determine if
+  the VM it is returning to is secure.
+
+* There is a new Ultravisor privileged register, SMFCTRL, which has an
+  enable/disable bit SMFCTRL(E).
+
+* The privilege of a process is now determined by three MSR bits,
+  MSR(S, HV, PR). In each of the tables below the modes are listed
+  from least privilege to highest privilege. The higher privilege
+  modes can access all the resources of the lower privilege modes.
+
+  **Secure Mode MSR Settings**
+
+  +---+---+---+---+
+  | S | HV| PR|Privilege  |
+  +===+===+===+===+
+  | 1 | 0 | 1 | Problem   |
+  +---+---+---+---+
+  | 1 | 0 | 0 | Privileged(OS)|
+  +---+---+---+---+
+  | 1 | 1 | 0 | Ultravisor|
+  +---+---+---+---+
+  | 1 | 1 | 1 | Reserved  |
+  +---+---+---+---+
+
+  **Normal Mode MSR Settings**
+
+  +---+---+---+---+
+  | S | HV| PR|Privilege  |
+  +===+===+===+===+
+  | 0 | 0 | 1 | Problem   |
+  +---+---+---+---+
+  | 0 | 0 | 0 | Privileged(OS)|
+  +---+---+---+---+
+  | 0 | 1 | 0 | Hypervisor|
+  +---+---+---+---+
+  | 0 | 1 | 1 | Problem (HV)  |
+  +---+---+---+---+
+
+* Memory is partitioned into secure and normal memory. Only processes
+  that are running in secure mode can access secure memory.
+
+* The hardware does not allow anything that is not running secure to
+  access secure memory. This means that the Hypervisor cannot access
+  the memory of the SVM without using an ultracall (asking the
+  Ultravisor). The Ultravisor will only allo

[PATCH v4 14/16] powerpc/pseries/svm: Force SWIOTLB for secure guests

2019-08-19 Thread Thiago Jung Bauermann
From: Anshuman Khandual 

SWIOTLB checks range of incoming CPU addresses to be bounced and sees if
the device can access it through its DMA window without requiring bouncing.
In such cases it just chooses to skip bouncing. But for cases like secure
guests on powerpc platform all addresses need to be bounced into the shared
pool of memory because the host cannot access it otherwise. Hence the need
to do the bouncing is not related to device's DMA window and use of bounce
buffers is forced by setting swiotlb_force.

Also, connect the shared memory conversion functions into the
ARCH_HAS_MEM_ENCRYPT hooks and call swiotlb_update_mem_attributes() to
convert SWIOTLB's memory pool to shared memory.

Signed-off-by: Anshuman Khandual 
[ bauerman: Use ARCH_HAS_MEM_ENCRYPT hooks to share swiotlb memory pool. ]
Signed-off-by: Thiago Jung Bauermann 
---
 arch/powerpc/include/asm/mem_encrypt.h | 26 +++
 arch/powerpc/platforms/pseries/Kconfig |  3 ++
 arch/powerpc/platforms/pseries/svm.c   | 45 ++
 3 files changed, 74 insertions(+)

diff --git a/arch/powerpc/include/asm/mem_encrypt.h 
b/arch/powerpc/include/asm/mem_encrypt.h
new file mode 100644
index ..ba9dab07c1be
--- /dev/null
+++ b/arch/powerpc/include/asm/mem_encrypt.h
@@ -0,0 +1,26 @@
+/* SPDX-License-Identifier: GPL-2.0+ */
+/*
+ * SVM helper functions
+ *
+ * Copyright 2018 IBM Corporation
+ */
+
+#ifndef _ASM_POWERPC_MEM_ENCRYPT_H
+#define _ASM_POWERPC_MEM_ENCRYPT_H
+
+#include 
+
+static inline bool mem_encrypt_active(void)
+{
+   return is_secure_guest();
+}
+
+static inline bool force_dma_unencrypted(struct device *dev)
+{
+   return is_secure_guest();
+}
+
+int set_memory_encrypted(unsigned long addr, int numpages);
+int set_memory_decrypted(unsigned long addr, int numpages);
+
+#endif /* _ASM_POWERPC_MEM_ENCRYPT_H */
diff --git a/arch/powerpc/platforms/pseries/Kconfig 
b/arch/powerpc/platforms/pseries/Kconfig
index d09deb05bb66..9e35cf73 100644
--- a/arch/powerpc/platforms/pseries/Kconfig
+++ b/arch/powerpc/platforms/pseries/Kconfig
@@ -149,6 +149,9 @@ config PAPR_SCM
 config PPC_SVM
bool "Secure virtual machine (SVM) support for POWER"
depends on PPC_PSERIES
+   select SWIOTLB
+   select ARCH_HAS_MEM_ENCRYPT
+   select ARCH_HAS_FORCE_DMA_UNENCRYPTED
help
 There are certain POWER platforms which support secure guests using
 the Protected Execution Facility, with the help of an Ultravisor
diff --git a/arch/powerpc/platforms/pseries/svm.c 
b/arch/powerpc/platforms/pseries/svm.c
index 2b2b1a77ca1e..40c0637203d5 100644
--- a/arch/powerpc/platforms/pseries/svm.c
+++ b/arch/powerpc/platforms/pseries/svm.c
@@ -7,8 +7,53 @@
  */
 
 #include 
+#include 
+#include 
+#include 
 #include 
 
+static int __init init_svm(void)
+{
+   if (!is_secure_guest())
+   return 0;
+
+   /* Don't release the SWIOTLB buffer. */
+   ppc_swiotlb_enable = 1;
+
+   /*
+* Since the guest memory is inaccessible to the host, devices always
+* need to use the SWIOTLB buffer for DMA even if dma_capable() says
+* otherwise.
+*/
+   swiotlb_force = SWIOTLB_FORCE;
+
+   /* Share the SWIOTLB buffer with the host. */
+   swiotlb_update_mem_attributes();
+
+   return 0;
+}
+machine_early_initcall(pseries, init_svm);
+
+int set_memory_encrypted(unsigned long addr, int numpages)
+{
+   if (!PAGE_ALIGNED(addr))
+   return -EINVAL;
+
+   uv_unshare_page(PHYS_PFN(__pa(addr)), numpages);
+
+   return 0;
+}
+
+int set_memory_decrypted(unsigned long addr, int numpages)
+{
+   if (!PAGE_ALIGNED(addr))
+   return -EINVAL;
+
+   uv_share_page(PHYS_PFN(__pa(addr)), numpages);
+
+   return 0;
+}
+
 /* There's one dispatch log per CPU. */
 #define NR_DTL_PAGE (DISPATCH_LOG_BYTES * CONFIG_NR_CPUS / PAGE_SIZE)
 


[PATCH v4 13/16] powerpc/pseries/iommu: Don't use dma_iommu_ops on secure guests

2019-08-19 Thread Thiago Jung Bauermann
Secure guest memory is inacessible to devices so regular DMA isn't
possible.

In that case set devices' dma_map_ops to NULL so that the generic
DMA code path will use SWIOTLB to bounce buffers for DMA.

Signed-off-by: Thiago Jung Bauermann 
---
 arch/powerpc/platforms/pseries/iommu.c | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 889dc2e44b89..8d9c2b17ad54 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -36,6 +36,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "pseries.h"
 
@@ -1318,7 +1319,15 @@ void iommu_init_early_pSeries(void)
of_reconfig_notifier_register(&iommu_reconfig_nb);
register_memory_notifier(&iommu_mem_nb);
 
-   set_pci_dma_ops(&dma_iommu_ops);
+   /*
+* Secure guest memory is inacessible to devices so regular DMA isn't
+* possible.
+*
+* In that case keep devices' dma_map_ops as NULL so that the generic
+* DMA code path will use SWIOTLB to bounce buffers for DMA.
+*/
+   if (!is_secure_guest())
+   set_pci_dma_ops(&dma_iommu_ops);
 }
 
 static int __init disable_multitce(char *str)


[PATCH v4 12/16] powerpc/pseries/svm: Disable doorbells in SVM guests

2019-08-19 Thread Thiago Jung Bauermann
From: Sukadev Bhattiprolu 

Normally, the HV emulates some instructions like MSGSNDP, MSGCLRP
from a KVM guest. To emulate the instructions, it must first read
the instruction from the guest's memory and decode its parameters.

However for a secure guest (aka SVM), the page containing the
instruction is in secure memory and the HV cannot access directly.
It would need the Ultravisor (UV) to facilitate accessing the
instruction and parameters but the UV currently does not have
the support for such accesses.

Until the UV has such support, disable doorbells in SVMs. This might
incur a performance hit but that is yet to be quantified.

With this patch applied (needed only in SVMs not needed for HV) we
are able to launch SVM guests with multi-core support. Eg:

qemu -smp sockets=2,cores=2,threads=2.

Fix suggested by Benjamin Herrenschmidt. Thanks to input from
Paul Mackerras, Ram Pai and Michael Anderson.

Signed-off-by: Sukadev Bhattiprolu 
Signed-off-by: Thiago Jung Bauermann 
---
 arch/powerpc/platforms/pseries/smp.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/smp.c 
b/arch/powerpc/platforms/pseries/smp.c
index 4b3ef8d9c63f..ad61e90032da 100644
--- a/arch/powerpc/platforms/pseries/smp.c
+++ b/arch/powerpc/platforms/pseries/smp.c
@@ -41,6 +41,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "pseries.h"
 #include "offline_states.h"
@@ -221,7 +222,7 @@ static __init void pSeries_smp_probe_xics(void)
 {
xics_smp_probe();
 
-   if (cpu_has_feature(CPU_FTR_DBELL))
+   if (cpu_has_feature(CPU_FTR_DBELL) && !is_secure_guest())
smp_ops->cause_ipi = smp_pseries_cause_ipi;
else
smp_ops->cause_ipi = icp_ops->cause_ipi;


[RFC PATCH v4 11/16] powerpc/pseries/svm: Export guest SVM status to user space via sysfs

2019-08-19 Thread Thiago Jung Bauermann
From: Ryan Grimm 

User space might want to know it's running in a secure VM.  It can't do
a mfmsr because mfmsr is a privileged instruction.

The solution here is to create a cpu attribute:

/sys/devices/system/cpu/svm

which will read 0 or 1 based on the S bit of the current CPU.

Signed-off-by: Ryan Grimm 
Signed-off-by: Thiago Jung Bauermann 
---
 .../ABI/testing/sysfs-devices-system-cpu  | 10 ++
 arch/powerpc/kernel/sysfs.c   | 20 +++
 2 files changed, 30 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu 
b/Documentation/ABI/testing/sysfs-devices-system-cpu
index 5f7d7b14fa44..06d0931119cc 100644
--- a/Documentation/ABI/testing/sysfs-devices-system-cpu
+++ b/Documentation/ABI/testing/sysfs-devices-system-cpu
@@ -562,3 +562,13 @@ Description:   Umwait control
  or C0.2 state. The time is an unsigned 32-bit number.
  Note that a value of zero means there is no limit.
  Low order two bits must be zero.
+
+What:  /sys/devices/system/cpu/svm
+Date:  August 2019
+Contact:   Linux kernel mailing list 
+   Linux for PowerPC mailing list 
+Description:   Secure Virtual Machine
+
+   If 1, it means the system is using the Protected Execution
+   Facility in POWER9 and newer processors. i.e., it is a Secure
+   Virtual Machine.
diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c
index e2147d7c9e72..80a676da11cb 100644
--- a/arch/powerpc/kernel/sysfs.c
+++ b/arch/powerpc/kernel/sysfs.c
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "cacheinfo.h"
 #include "setup.h"
@@ -715,6 +716,23 @@ static struct device_attribute pa6t_attrs[] = {
 #endif /* HAS_PPC_PMC_PA6T */
 #endif /* HAS_PPC_PMC_CLASSIC */
 
+#ifdef CONFIG_PPC_SVM
+static ssize_t show_svm(struct device *dev, struct device_attribute *attr, 
char *buf)
+{
+   return sprintf(buf, "%u\n", is_secure_guest());
+}
+static DEVICE_ATTR(svm, 0444, show_svm, NULL);
+
+static void create_svm_file(void)
+{
+   device_create_file(cpu_subsys.dev_root, &dev_attr_svm);
+}
+#else
+static void create_svm_file(void)
+{
+}
+#endif /* CONFIG_PPC_SVM */
+
 static int register_cpu_online(unsigned int cpu)
 {
struct cpu *c = &per_cpu(cpu_devices, cpu);
@@ -1058,6 +1076,8 @@ static int __init topology_init(void)
sysfs_create_dscr_default();
 #endif /* CONFIG_PPC64 */
 
+   create_svm_file();
+
return 0;
 }
 subsys_initcall(topology_init);


[PATCH v4 10/16] powerpc/pseries/svm: Unshare all pages before kexecing a new kernel

2019-08-19 Thread Thiago Jung Bauermann
From: Ram Pai 

A new kernel deserves a clean slate. Any pages shared with the hypervisor
is unshared before invoking the new kernel. However there are exceptions.
If the new kernel is invoked to dump the current kernel, or if there is a
explicit request to preserve the state of the current kernel, unsharing
of pages is skipped.

NOTE: While testing crashkernel, make sure at least 256M is reserved for
crashkernel. Otherwise SWIOTLB allocation will fail and crash kernel will
fail to boot.

Signed-off-by: Ram Pai 
Signed-off-by: Thiago Jung Bauermann 
---
 arch/powerpc/include/asm/ultravisor-api.h | 1 +
 arch/powerpc/include/asm/ultravisor.h | 5 +
 arch/powerpc/kernel/machine_kexec_64.c| 9 +
 3 files changed, 15 insertions(+)

diff --git a/arch/powerpc/include/asm/ultravisor-api.h 
b/arch/powerpc/include/asm/ultravisor-api.h
index 142b0576b89f..7e69c364bde0 100644
--- a/arch/powerpc/include/asm/ultravisor-api.h
+++ b/arch/powerpc/include/asm/ultravisor-api.h
@@ -24,5 +24,6 @@
 #define UV_ESM 0xF110
 #define UV_SHARE_PAGE  0xF130
 #define UV_UNSHARE_PAGE0xF134
+#define UV_UNSHARE_ALL_PAGES   0xF140
 
 #endif /* _ASM_POWERPC_ULTRAVISOR_API_H */
diff --git a/arch/powerpc/include/asm/ultravisor.h 
b/arch/powerpc/include/asm/ultravisor.h
index a930aec8c1e3..e6f8a2b96694 100644
--- a/arch/powerpc/include/asm/ultravisor.h
+++ b/arch/powerpc/include/asm/ultravisor.h
@@ -21,4 +21,9 @@ static inline int uv_unshare_page(u64 pfn, u64 npages)
return ucall_norets(UV_UNSHARE_PAGE, pfn, npages);
 }
 
+static inline int uv_unshare_all_pages(void)
+{
+   return ucall_norets(UV_UNSHARE_ALL_PAGES);
+}
+
 #endif /* _ASM_POWERPC_ULTRAVISOR_H */
diff --git a/arch/powerpc/kernel/machine_kexec_64.c 
b/arch/powerpc/kernel/machine_kexec_64.c
index 18481b0e2788..04a7cba58eff 100644
--- a/arch/powerpc/kernel/machine_kexec_64.c
+++ b/arch/powerpc/kernel/machine_kexec_64.c
@@ -29,6 +29,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 int default_machine_kexec_prepare(struct kimage *image)
 {
@@ -327,6 +329,13 @@ void default_machine_kexec(struct kimage *image)
 #ifdef CONFIG_PPC_PSERIES
kexec_paca.lppaca_ptr = NULL;
 #endif
+
+   if (is_secure_guest() && !(image->preserve_context ||
+  image->type == KEXEC_TYPE_CRASH)) {
+   uv_unshare_all_pages();
+   printk("kexec: Unshared all shared pages.\n");
+   }
+
paca_ptrs[kexec_paca.paca_index] = &kexec_paca;
 
setup_paca(&kexec_paca);



[PATCH v4 09/16] powerpc/pseries/svm: Use shared memory for Debug Trace Log (DTL)

2019-08-19 Thread Thiago Jung Bauermann
From: Anshuman Khandual 

Secure guests need to share the DTL buffers with the hypervisor. To that
end, use a kmem_cache constructor which converts the underlying buddy
allocated SLUB cache pages into shared memory.

Signed-off-by: Anshuman Khandual 
Signed-off-by: Thiago Jung Bauermann 
---
 arch/powerpc/include/asm/svm.h  |  5 
 arch/powerpc/platforms/pseries/Makefile |  1 +
 arch/powerpc/platforms/pseries/setup.c  |  5 +++-
 arch/powerpc/platforms/pseries/svm.c| 40 +
 4 files changed, 50 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/svm.h b/arch/powerpc/include/asm/svm.h
index 2689d8d841f8..85580b30aba4 100644
--- a/arch/powerpc/include/asm/svm.h
+++ b/arch/powerpc/include/asm/svm.h
@@ -15,6 +15,9 @@ static inline bool is_secure_guest(void)
return mfmsr() & MSR_S;
 }
 
+void dtl_cache_ctor(void *addr);
+#define get_dtl_cache_ctor()   (is_secure_guest() ? dtl_cache_ctor : NULL)
+
 #else /* CONFIG_PPC_SVM */
 
 static inline bool is_secure_guest(void)
@@ -22,5 +25,7 @@ static inline bool is_secure_guest(void)
return false;
 }
 
+#define get_dtl_cache_ctor() NULL
+
 #endif /* CONFIG_PPC_SVM */
 #endif /* _ASM_POWERPC_SVM_H */
diff --git a/arch/powerpc/platforms/pseries/Makefile 
b/arch/powerpc/platforms/pseries/Makefile
index ab3d59aeacca..a420ef4c9d8e 100644
--- a/arch/powerpc/platforms/pseries/Makefile
+++ b/arch/powerpc/platforms/pseries/Makefile
@@ -26,6 +26,7 @@ obj-$(CONFIG_IBMVIO)  += vio.o
 obj-$(CONFIG_IBMEBUS)  += ibmebus.o
 obj-$(CONFIG_PAPR_SCM) += papr_scm.o
 obj-$(CONFIG_PPC_SPLPAR)   += vphn.o
+obj-$(CONFIG_PPC_SVM)  += svm.o
 
 ifdef CONFIG_PPC_PSERIES
 obj-$(CONFIG_SUSPEND)  += suspend.o
diff --git a/arch/powerpc/platforms/pseries/setup.c 
b/arch/powerpc/platforms/pseries/setup.c
index f5940cc71c37..d8930c3a8a11 100644
--- a/arch/powerpc/platforms/pseries/setup.c
+++ b/arch/powerpc/platforms/pseries/setup.c
@@ -69,6 +69,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "pseries.h"
 #include "../../../../drivers/pci/pci.h"
@@ -297,8 +298,10 @@ static inline int alloc_dispatch_logs(void)
 
 static int alloc_dispatch_log_kmem_cache(void)
 {
+   void (*ctor)(void *) = get_dtl_cache_ctor();
+
dtl_cache = kmem_cache_create("dtl", DISPATCH_LOG_BYTES,
-   DISPATCH_LOG_BYTES, 0, NULL);
+   DISPATCH_LOG_BYTES, 0, ctor);
if (!dtl_cache) {
pr_warn("Failed to create dispatch trace log buffer cache\n");
pr_warn("Stolen time statistics will be unreliable\n");
diff --git a/arch/powerpc/platforms/pseries/svm.c 
b/arch/powerpc/platforms/pseries/svm.c
new file mode 100644
index ..2b2b1a77ca1e
--- /dev/null
+++ b/arch/powerpc/platforms/pseries/svm.c
@@ -0,0 +1,40 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * Secure VM platform
+ *
+ * Copyright 2018 IBM Corporation
+ * Author: Anshuman Khandual 
+ */
+
+#include 
+#include 
+
+/* There's one dispatch log per CPU. */
+#define NR_DTL_PAGE (DISPATCH_LOG_BYTES * CONFIG_NR_CPUS / PAGE_SIZE)
+
+static struct page *dtl_page_store[NR_DTL_PAGE];
+static long dtl_nr_pages;
+
+static bool is_dtl_page_shared(struct page *page)
+{
+   long i;
+
+   for (i = 0; i < dtl_nr_pages; i++)
+   if (dtl_page_store[i] == page)
+   return true;
+
+   return false;
+}
+
+void dtl_cache_ctor(void *addr)
+{
+   unsigned long pfn = PHYS_PFN(__pa(addr));
+   struct page *page = pfn_to_page(pfn);
+
+   if (!is_dtl_page_shared(page)) {
+   dtl_page_store[dtl_nr_pages] = page;
+   dtl_nr_pages++;
+   WARN_ON(dtl_nr_pages >= NR_DTL_PAGE);
+   uv_share_page(pfn, 1);
+   }
+}


[PATCH v4 08/16] powerpc/pseries/svm: Use shared memory for LPPACA structures

2019-08-19 Thread Thiago Jung Bauermann
From: Anshuman Khandual 

LPPACA structures need to be shared with the host. Hence they need to be in
shared memory. Instead of allocating individual chunks of memory for a
given structure from memblock, a contiguous chunk of memory is allocated
and then converted into shared memory. Subsequent allocation requests will
come from the contiguous chunk which will be always shared memory for all
structures.

While we are able to use a kmem_cache constructor for the Debug Trace Log,
LPPACAs are allocated very early in the boot process (before SLUB is
available) so we need to use a simpler scheme here.

Introduce helper is_svm_platform() which uses the S bit of the MSR to tell
whether we're running as a secure guest.

Signed-off-by: Anshuman Khandual 
Signed-off-by: Thiago Jung Bauermann 
---
 arch/powerpc/include/asm/svm.h | 26 
 arch/powerpc/kernel/paca.c | 43 +-
 2 files changed, 68 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/svm.h b/arch/powerpc/include/asm/svm.h
new file mode 100644
index ..2689d8d841f8
--- /dev/null
+++ b/arch/powerpc/include/asm/svm.h
@@ -0,0 +1,26 @@
+/* SPDX-License-Identifier: GPL-2.0+ */
+/*
+ * SVM helper functions
+ *
+ * Copyright 2018 Anshuman Khandual, IBM Corporation.
+ */
+
+#ifndef _ASM_POWERPC_SVM_H
+#define _ASM_POWERPC_SVM_H
+
+#ifdef CONFIG_PPC_SVM
+
+static inline bool is_secure_guest(void)
+{
+   return mfmsr() & MSR_S;
+}
+
+#else /* CONFIG_PPC_SVM */
+
+static inline bool is_secure_guest(void)
+{
+   return false;
+}
+
+#endif /* CONFIG_PPC_SVM */
+#endif /* _ASM_POWERPC_SVM_H */
diff --git a/arch/powerpc/kernel/paca.c b/arch/powerpc/kernel/paca.c
index 612fc87ef785..949eceb254d8 100644
--- a/arch/powerpc/kernel/paca.c
+++ b/arch/powerpc/kernel/paca.c
@@ -14,6 +14,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include "setup.h"
 
@@ -54,6 +56,41 @@ static void *__init alloc_paca_data(unsigned long size, 
unsigned long align,
 
 #define LPPACA_SIZE 0x400
 
+static void *__init alloc_shared_lppaca(unsigned long size, unsigned long 
align,
+   unsigned long limit, int cpu)
+{
+   size_t shared_lppaca_total_size = PAGE_ALIGN(nr_cpu_ids * LPPACA_SIZE);
+   static unsigned long shared_lppaca_size;
+   static void *shared_lppaca;
+   void *ptr;
+
+   if (!shared_lppaca) {
+   memblock_set_bottom_up(true);
+
+   shared_lppaca =
+   memblock_alloc_try_nid(shared_lppaca_total_size,
+  PAGE_SIZE, MEMBLOCK_LOW_LIMIT,
+  limit, NUMA_NO_NODE);
+   if (!shared_lppaca)
+   panic("cannot allocate shared data");
+
+   memblock_set_bottom_up(false);
+   uv_share_page(PHYS_PFN(__pa(shared_lppaca)),
+ shared_lppaca_total_size >> PAGE_SHIFT);
+   }
+
+   ptr = shared_lppaca + shared_lppaca_size;
+   shared_lppaca_size += size;
+
+   /*
+* This is very early in boot, so no harm done if the kernel crashes at
+* this point.
+*/
+   BUG_ON(shared_lppaca_size >= shared_lppaca_total_size);
+
+   return ptr;
+}
+
 /*
  * See asm/lppaca.h for more detail.
  *
@@ -83,7 +120,11 @@ static struct lppaca * __init new_lppaca(int cpu, unsigned 
long limit)
if (early_cpu_has_feature(CPU_FTR_HVMODE))
return NULL;
 
-   lp = alloc_paca_data(LPPACA_SIZE, 0x400, limit, cpu);
+   if (is_secure_guest())
+   lp = alloc_shared_lppaca(LPPACA_SIZE, 0x400, limit, cpu);
+   else
+   lp = alloc_paca_data(LPPACA_SIZE, 0x400, limit, cpu);
+
init_lppaca(lp);
 
return lp;



[PATCH v4 07/16] powerpc/pseries: Add and use LPPACA_SIZE constant

2019-08-19 Thread Thiago Jung Bauermann
Helps document what the hard-coded number means.

Also take the opportunity to fix an #endif comment.

Suggested-by: Alexey Kardashevskiy 
Signed-off-by: Thiago Jung Bauermann 
---
 arch/powerpc/kernel/paca.c | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/kernel/paca.c b/arch/powerpc/kernel/paca.c
index e3ad8aa4730d..612fc87ef785 100644
--- a/arch/powerpc/kernel/paca.c
+++ b/arch/powerpc/kernel/paca.c
@@ -52,6 +52,8 @@ static void *__init alloc_paca_data(unsigned long size, 
unsigned long align,
 
 #ifdef CONFIG_PPC_PSERIES
 
+#define LPPACA_SIZE 0x400
+
 /*
  * See asm/lppaca.h for more detail.
  *
@@ -65,7 +67,7 @@ static inline void init_lppaca(struct lppaca *lppaca)
 
*lppaca = (struct lppaca) {
.desc = cpu_to_be32(0xd397d781),/* "LpPa" */
-   .size = cpu_to_be16(0x400),
+   .size = cpu_to_be16(LPPACA_SIZE),
.fpregs_in_use = 1,
.slb_count = cpu_to_be16(64),
.vmxregs_in_use = 0,
@@ -75,19 +77,18 @@ static inline void init_lppaca(struct lppaca *lppaca)
 static struct lppaca * __init new_lppaca(int cpu, unsigned long limit)
 {
struct lppaca *lp;
-   size_t size = 0x400;
 
-   BUILD_BUG_ON(size < sizeof(struct lppaca));
+   BUILD_BUG_ON(sizeof(struct lppaca) > LPPACA_SIZE);
 
if (early_cpu_has_feature(CPU_FTR_HVMODE))
return NULL;
 
-   lp = alloc_paca_data(size, 0x400, limit, cpu);
+   lp = alloc_paca_data(LPPACA_SIZE, 0x400, limit, cpu);
init_lppaca(lp);
 
return lp;
 }
-#endif /* CONFIG_PPC_BOOK3S */
+#endif /* CONFIG_PPC_PSERIES */
 
 #ifdef CONFIG_PPC_BOOK3S_64
 



[PATCH v4 06/16] powerpc: Introduce the MSR_S bit

2019-08-19 Thread Thiago Jung Bauermann
From: Sukadev Bhattiprolu 

Protected Execution Facility (PEF) is an architectural change for
POWER 9 that enables Secure Virtual Machines (SVMs). When enabled,
PEF adds a new higher privileged mode, called Ultravisor mode, to
POWER architecture.

The hardware changes include the following:

  * There is a new bit in the MSR that determines whether the current
process is running in secure mode, MSR(S) bit 41. MSR(S)=1, process
is in secure mode, MSR(s)=0 process is in normal mode.

  * The MSR(S) bit can only be set by the Ultravisor.

  * HRFID cannot be used to set the MSR(S) bit. If the hypervisor needs
to return to a SVM it must use an ultracall. It can determine if
the VM it is returning to is secure.

  * The privilege of a process is now determined by three MSR bits,
MSR(S, HV, PR). In each of the tables below the modes are listed
from least privilege to highest privilege. The higher privilege
modes can access all the resources of the lower privilege modes.

**Secure Mode MSR Settings**

   +---+---+---+---+
   | S | HV| PR|Privilege  |
   +===+===+===+===+
   | 1 | 0 | 1 | Problem   |
   +---+---+---+---+
   | 1 | 0 | 0 | Privileged(OS)|
   +---+---+---+---+
   | 1 | 1 | 0 | Ultravisor|
   +---+---+---+---+
   | 1 | 1 | 1 | Reserved  |
   +---+---+---+---+

**Normal Mode MSR Settings**

   +---+---+---+---+
   | S | HV| PR|Privilege  |
   +===+===+===+===+
   | 0 | 0 | 1 | Problem   |
   +---+---+---+---+
   | 0 | 0 | 0 | Privileged(OS)|
   +---+---+---+---+
   | 0 | 1 | 0 | Hypervisor|
   +---+---+---+---+
   | 0 | 1 | 1 | Problem (HV)  |
   +---+---+---+---+

Signed-off-by: Sukadev Bhattiprolu 
Signed-off-by: Ram Pai 
[ cclaudio: Update the commit message ]
Signed-off-by: Claudio Carvalho 
Signed-off-by: Thiago Jung Bauermann 
---
 arch/powerpc/include/asm/reg.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index 10caa145f98b..ec3714cf0989 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -38,6 +38,7 @@
 #define MSR_TM_LG  32  /* Trans Mem Available */
 #define MSR_VEC_LG 25  /* Enable AltiVec */
 #define MSR_VSX_LG 23  /* Enable VSX */
+#define MSR_S_LG   22  /* Secure state */
 #define MSR_POW_LG 18  /* Enable Power Management */
 #define MSR_WE_LG  18  /* Wait State Enable */
 #define MSR_TGPR_LG17  /* TLB Update registers in use */
@@ -71,11 +72,13 @@
 #define MSR_SF __MASK(MSR_SF_LG)   /* Enable 64 bit mode */
 #define MSR_ISF__MASK(MSR_ISF_LG)  /* Interrupt 64b mode 
valid on 630 */
 #define MSR_HV __MASK(MSR_HV_LG)   /* Hypervisor state */
+#define MSR_S  __MASK(MSR_S_LG)/* Secure state */
 #else
 /* so tests for these bits fail on 32-bit */
 #define MSR_SF 0
 #define MSR_ISF0
 #define MSR_HV 0
+#define MSR_S  0
 #endif
 
 /*


[PATCH v4 05/16] powerpc/pseries/svm: Add helpers for UV_SHARE_PAGE and UV_UNSHARE_PAGE

2019-08-19 Thread Thiago Jung Bauermann
From: Ram Pai 

These functions are used when the guest wants to grant the hypervisor
access to certain pages.

Signed-off-by: Ram Pai 
Signed-off-by: Thiago Jung Bauermann 
---
 arch/powerpc/include/asm/ultravisor-api.h |  2 ++
 arch/powerpc/include/asm/ultravisor.h | 24 +++
 2 files changed, 26 insertions(+)

diff --git a/arch/powerpc/include/asm/ultravisor-api.h 
b/arch/powerpc/include/asm/ultravisor-api.h
index d3503d1f447e..142b0576b89f 100644
--- a/arch/powerpc/include/asm/ultravisor-api.h
+++ b/arch/powerpc/include/asm/ultravisor-api.h
@@ -22,5 +22,7 @@
 
 /* opcodes */
 #define UV_ESM 0xF110
+#define UV_SHARE_PAGE  0xF130
+#define UV_UNSHARE_PAGE0xF134
 
 #endif /* _ASM_POWERPC_ULTRAVISOR_API_H */
diff --git a/arch/powerpc/include/asm/ultravisor.h 
b/arch/powerpc/include/asm/ultravisor.h
new file mode 100644
index ..a930aec8c1e3
--- /dev/null
+++ b/arch/powerpc/include/asm/ultravisor.h
@@ -0,0 +1,24 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Ultravisor definitions
+ *
+ * Copyright 2019, IBM Corporation.
+ *
+ */
+#ifndef _ASM_POWERPC_ULTRAVISOR_H
+#define _ASM_POWERPC_ULTRAVISOR_H
+
+#include 
+#include 
+
+static inline int uv_share_page(u64 pfn, u64 npages)
+{
+   return ucall_norets(UV_SHARE_PAGE, pfn, npages);
+}
+
+static inline int uv_unshare_page(u64 pfn, u64 npages)
+{
+   return ucall_norets(UV_UNSHARE_PAGE, pfn, npages);
+}
+
+#endif /* _ASM_POWERPC_ULTRAVISOR_H */



[PATCH v4 04/16] powerpc/prom_init: Add the ESM call to prom_init

2019-08-19 Thread Thiago Jung Bauermann
From: Ram Pai 

Make the Enter-Secure-Mode (ESM) ultravisor call to switch the VM to secure
mode. Pass kernel base address and FDT address so that the Ultravisor is
able to verify the integrity of the VM using information from the ESM blob.

Add "svm=" command line option to turn on switching to secure mode.

Signed-off-by: Ram Pai 
[ andmike: Generate an RTAS os-term hcall when the ESM ucall fails. ]
Signed-off-by: Michael Anderson 
[ bauerman: Cleaned up the code a bit. ]
Signed-off-by: Thiago Jung Bauermann 
---
 .../admin-guide/kernel-parameters.txt |  5 +
 arch/powerpc/include/asm/ultravisor-api.h |  3 +
 arch/powerpc/kernel/prom_init.c   | 96 +++
 3 files changed, 104 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 7ccd158b3894..231a008b7961 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4620,6 +4620,11 @@
/sys/power/pm_test). Only available when CONFIG_PM_DEBUG
is set. Default value is 5.
 
+   svm=[PPC]
+   Format: { on | off | y | n | 1 | 0 }
+   This parameter controls use of the Protected
+   Execution Facility on pSeries.
+
swapaccount=[0|1]
[KNL] Enable accounting of swap in memory resource
controller if no parameter or 1 is given or disable
diff --git a/arch/powerpc/include/asm/ultravisor-api.h 
b/arch/powerpc/include/asm/ultravisor-api.h
index 88ffa78f9d61..d3503d1f447e 100644
--- a/arch/powerpc/include/asm/ultravisor-api.h
+++ b/arch/powerpc/include/asm/ultravisor-api.h
@@ -20,4 +20,7 @@
 #define U_PARAMETERH_PARAMETER
 #define U_SUCCESS  H_SUCCESS
 
+/* opcodes */
+#define UV_ESM 0xF110
+
 #endif /* _ASM_POWERPC_ULTRAVISOR_API_H */
diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
index 514707ef6779..74f70f90eff0 100644
--- a/arch/powerpc/kernel/prom_init.c
+++ b/arch/powerpc/kernel/prom_init.c
@@ -40,6 +40,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -171,6 +172,10 @@ static bool __prombss prom_radix_disable;
 static bool __prombss prom_xive_disable;
 #endif
 
+#ifdef CONFIG_PPC_SVM
+static bool __prombss prom_svm_enable;
+#endif
+
 struct platform_support {
bool hash_mmu;
bool radix_mmu;
@@ -812,6 +817,17 @@ static void __init early_cmdline_parse(void)
prom_debug("XIVE disabled from cmdline\n");
}
 #endif /* CONFIG_PPC_PSERIES */
+
+#ifdef CONFIG_PPC_SVM
+   opt = prom_strstr(prom_cmd_line, "svm=");
+   if (opt) {
+   bool val;
+
+   opt += sizeof("svm=") - 1;
+   if (!prom_strtobool(opt, &val))
+   prom_svm_enable = val;
+   }
+#endif /* CONFIG_PPC_SVM */
 }
 
 #ifdef CONFIG_PPC_PSERIES
@@ -1712,6 +1728,43 @@ static void __init prom_close_stdin(void)
}
 }
 
+#ifdef CONFIG_PPC_SVM
+static int prom_rtas_hcall(uint64_t args)
+{
+   register uint64_t arg1 asm("r3") = H_RTAS;
+   register uint64_t arg2 asm("r4") = args;
+
+   asm volatile("sc 1\n" : "=r" (arg1) :
+   "r" (arg1),
+   "r" (arg2) :);
+   return arg1;
+}
+
+static struct rtas_args __prombss os_term_args;
+
+static void __init prom_rtas_os_term(char *str)
+{
+   phandle rtas_node;
+   __be32 val;
+   u32 token;
+
+   prom_debug("%s: start...\n", __func__);
+   rtas_node = call_prom("finddevice", 1, 1, ADDR("/rtas"));
+   prom_debug("rtas_node: %x\n", rtas_node);
+   if (!PHANDLE_VALID(rtas_node))
+   return;
+
+   val = 0;
+   prom_getprop(rtas_node, "ibm,os-term", &val, sizeof(val));
+   token = be32_to_cpu(val);
+   prom_debug("ibm,os-term: %x\n", token);
+   if (token == 0)
+   prom_panic("Could not get token for ibm,os-term\n");
+   os_term_args.token = cpu_to_be32(token);
+   prom_rtas_hcall((uint64_t)&os_term_args);
+}
+#endif /* CONFIG_PPC_SVM */
+
 /*
  * Allocate room for and instantiate RTAS
  */
@@ -3168,6 +3221,46 @@ static void unreloc_toc(void)
 #endif
 #endif
 
+#ifdef CONFIG_PPC_SVM
+/*
+ * Perform the Enter Secure Mode ultracall.
+ */
+static int enter_secure_mode(unsigned long kbase, unsigned long fdt)
+{
+   register unsigned long r3 asm("r3") = UV_ESM;
+   register unsigned long r4 asm("r4") = kbase;
+   register unsigned long r5 asm("r5") = fdt;
+
+   asm volatile("sc 2" : "+r"(r3) : "r"(r4), "r"(r5));
+
+   return r3;
+}
+
+/*
+ * Call the Ultravisor to transfer us to secure memory if we have an ESM blob.
+ */
+static void setup_secure_guest(unsigned long kbase, unsigned long fdt)
+{
+   int ret;
+
+   if (!prom_svm_enable)
+   return;
+
+   /* Switch

[PATCH v4 03/16] powerpc: Add support for adding an ESM blob to the zImage wrapper

2019-08-19 Thread Thiago Jung Bauermann
From: Benjamin Herrenschmidt 

For secure VMs, the signing tool will create a ticket called the "ESM blob"
for the Enter Secure Mode ultravisor call with the signatures of the kernel
and initrd among other things.

This adds support to the wrapper script for adding that blob via the "-e"
option to the zImage.pseries.

It also adds code to the zImage wrapper itself to retrieve and if necessary
relocate the blob, and pass its address to Linux via the device-tree, to be
later consumed by prom_init.

Signed-off-by: Benjamin Herrenschmidt 
[ bauerman: Minor adjustments to some comments. ]
Signed-off-by: Thiago Jung Bauermann 
---
 arch/powerpc/boot/main.c   | 41 ++
 arch/powerpc/boot/ops.h|  2 ++
 arch/powerpc/boot/wrapper  | 24 +---
 arch/powerpc/boot/zImage.lds.S |  8 +++
 4 files changed, 72 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/boot/main.c b/arch/powerpc/boot/main.c
index 102cc546444d..a9d209135975 100644
--- a/arch/powerpc/boot/main.c
+++ b/arch/powerpc/boot/main.c
@@ -146,6 +146,46 @@ static struct addr_range prep_initrd(struct addr_range 
vmlinux, void *chosen,
return (struct addr_range){(void *)initrd_addr, initrd_size};
 }
 
+#ifdef __powerpc64__
+static void prep_esm_blob(struct addr_range vmlinux, void *chosen)
+{
+   unsigned long esm_blob_addr, esm_blob_size;
+
+   /* Do we have an ESM (Enter Secure Mode) blob? */
+   if (_esm_blob_end <= _esm_blob_start)
+   return;
+
+   printf("Attached ESM blob at 0x%p-0x%p\n\r",
+  _esm_blob_start, _esm_blob_end);
+   esm_blob_addr = (unsigned long)_esm_blob_start;
+   esm_blob_size = _esm_blob_end - _esm_blob_start;
+
+   /*
+* If the ESM blob is too low it will be clobbered when the
+* kernel relocates to its final location.  In this case,
+* allocate a safer place and move it.
+*/
+   if (esm_blob_addr < vmlinux.size) {
+   void *old_addr = (void *)esm_blob_addr;
+
+   printf("Allocating 0x%lx bytes for esm_blob ...\n\r",
+  esm_blob_size);
+   esm_blob_addr = (unsigned long)malloc(esm_blob_size);
+   if (!esm_blob_addr)
+   fatal("Can't allocate memory for ESM blob !\n\r");
+   printf("Relocating ESM blob 0x%lx <- 0x%p (0x%lx bytes)\n\r",
+  esm_blob_addr, old_addr, esm_blob_size);
+   memmove((void *)esm_blob_addr, old_addr, esm_blob_size);
+   }
+
+   /* Tell the kernel ESM blob address via device tree. */
+   setprop_val(chosen, "linux,esm-blob-start", (u32)(esm_blob_addr));
+   setprop_val(chosen, "linux,esm-blob-end", (u32)(esm_blob_addr + 
esm_blob_size));
+}
+#else
+static inline void prep_esm_blob(struct addr_range vmlinux, void *chosen) { }
+#endif
+
 /* A buffer that may be edited by tools operating on a zImage binary so as to
  * edit the command line passed to vmlinux (by setting /chosen/bootargs).
  * The buffer is put in it's own section so that tools may locate it easier.
@@ -214,6 +254,7 @@ void start(void)
vmlinux = prep_kernel();
initrd = prep_initrd(vmlinux, chosen,
 loader_info.initrd_addr, loader_info.initrd_size);
+   prep_esm_blob(vmlinux, chosen);
prep_cmdline(chosen);
 
printf("Finalizing device tree...");
diff --git a/arch/powerpc/boot/ops.h b/arch/powerpc/boot/ops.h
index cd043726ed88..e0606766480f 100644
--- a/arch/powerpc/boot/ops.h
+++ b/arch/powerpc/boot/ops.h
@@ -251,6 +251,8 @@ extern char _initrd_start[];
 extern char _initrd_end[];
 extern char _dtb_start[];
 extern char _dtb_end[];
+extern char _esm_blob_start[];
+extern char _esm_blob_end[];
 
 static inline __attribute__((const))
 int __ilog2_u32(u32 n)
diff --git a/arch/powerpc/boot/wrapper b/arch/powerpc/boot/wrapper
index 5148ac271f28..ed6266367bc0 100755
--- a/arch/powerpc/boot/wrapper
+++ b/arch/powerpc/boot/wrapper
@@ -13,6 +13,7 @@
 # -i initrdspecify initrd file
 # -d devtree   specify device-tree blob
 # -s tree.dts  specify device-tree source file (needs dtc installed)
+# -e esm_blob   specify ESM blob for secure images
 # -c   cache $kernel.strip.gz (use if present & newer, else make)
 # -C prefixspecify command prefix for cross-building tools
 #  (strip, objcopy, ld)
@@ -37,6 +38,7 @@ platform=of
 initrd=
 dtb=
 dts=
+esm_blob=
 cacheit=
 binary=
 compression=.gz
@@ -60,9 +62,9 @@ tmpdir=.
 
 usage() {
 echo 'Usage: wrapper [-o output] [-p platform] [-i initrd]' >&2
-echo '   [-d devtree] [-s tree.dts] [-c] [-C cross-prefix]' >&2
-echo '   [-D datadir] [-W workingdir] [-Z (gz|xz|none)]' >&2
-echo '   [--no-compression] [vmlinux]' >&2
+echo '   [-d devtree] [-s tree.dts] [-e esm_blob]' >&2
+echo '   [-c] [-C cross-prefix] [-D datadir] [-W workingdir]' >&2
+echo '   [-Z (gz|xz|none)] [--no-com

[PATCH v4 02/16] powerpc/pseries: Introduce option to build secure virtual machines

2019-08-19 Thread Thiago Jung Bauermann
Introduce CONFIG_PPC_SVM to control support for secure guests and include
Ultravisor-related helpers when it is selected

Signed-off-by: Thiago Jung Bauermann 
---
 arch/powerpc/include/asm/asm-prototypes.h |  2 +-
 arch/powerpc/kernel/Makefile  |  4 +++-
 arch/powerpc/platforms/pseries/Kconfig| 11 +++
 3 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/asm-prototypes.h 
b/arch/powerpc/include/asm/asm-prototypes.h
index e698f48cbc6d..49196d35e3bb 100644
--- a/arch/powerpc/include/asm/asm-prototypes.h
+++ b/arch/powerpc/include/asm/asm-prototypes.h
@@ -36,7 +36,7 @@ void __trace_hcall_entry(unsigned long opcode, unsigned long 
*args);
 void __trace_hcall_exit(long opcode, long retval, unsigned long *retbuf);
 
 /* Ultravisor */
-#ifdef CONFIG_PPC_POWERNV
+#if defined(CONFIG_PPC_POWERNV) || defined(CONFIG_PPC_SVM)
 long ucall_norets(unsigned long opcode, ...);
 #else
 static inline long ucall_norets(unsigned long opcode, ...)
diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile
index c6c4ea240b2a..ba379dfb8b83 100644
--- a/arch/powerpc/kernel/Makefile
+++ b/arch/powerpc/kernel/Makefile
@@ -156,7 +156,9 @@ endif
 
 obj-$(CONFIG_EPAPR_PARAVIRT)   += epapr_paravirt.o epapr_hcalls.o
 obj-$(CONFIG_KVM_GUEST)+= kvm.o kvm_emul.o
-obj-$(CONFIG_PPC_POWERNV)  += ucall.o
+ifneq ($(CONFIG_PPC_POWERNV)$(CONFIG_PPC_SVM),)
+obj-y  += ucall.o
+endif
 
 # Disable GCOV, KCOV & sanitizers in odd or sensitive code
 GCOV_PROFILE_prom_init.o := n
diff --git a/arch/powerpc/platforms/pseries/Kconfig 
b/arch/powerpc/platforms/pseries/Kconfig
index f7b484f3..d09deb05bb66 100644
--- a/arch/powerpc/platforms/pseries/Kconfig
+++ b/arch/powerpc/platforms/pseries/Kconfig
@@ -145,3 +145,14 @@ config PAPR_SCM
tristate "Support for the PAPR Storage Class Memory interface"
help
  Enable access to hypervisor provided storage class memory.
+
+config PPC_SVM
+   bool "Secure virtual machine (SVM) support for POWER"
+   depends on PPC_PSERIES
+   help
+There are certain POWER platforms which support secure guests using
+the Protected Execution Facility, with the help of an Ultravisor
+executing below the hypervisor layer. This enables support for
+those guests.
+
+If unsure, say "N".


[PATCH v4 00/16] Secure Virtual Machine Enablement

2019-08-19 Thread Thiago Jung Bauermann
Hello,

This is a minor update of this patch series. It addresses review comments
made to v3. Details are in the changelog. The sysfs patch is updated and
included here but as I mentioned earlier can be postponed. It is marked
RFC for that reason.

As with the previous version, the patch introducing ucall_norets() (patch 1)
and the one adding documentation on the Ultravisor (patch 15) are copied
from v5 of Claudio Carvalho's KVM on Ultravisor series and don't yet address
the review comments made there. They are included here so that this series
can stand on its own.

The patches apply on top of v4 of the  cleanup series:

https://lore.kernel.org/linuxppc-dev/20190806044919.10622-1-bauer...@linux.ibm.com/

Everything is available in branch ultravisor-secure-vm (applied on top of
today's powerpc/next) at this repo:

https://github.com/bauermann/linux.git

Original cover letter below, and changelog at the bottom:

This series enables Secure Virtual Machines (SVMs) on powerpc. SVMs use the
Protected Execution Facility (PEF) and request to be migrated to secure
memory during prom_init() so by default all of their memory is inaccessible
to the hypervisor. There is an Ultravisor call that the VM can use to
request certain pages to be made accessible to (or shared with) the
hypervisor.

The objective of these patches is to have the guest perform this request
for buffers that need to be accessed by the hypervisor such as the LPPACAs,
the SWIOTLB memory and the Debug Trace Log.

Patch 3 ("powerpc: Add support for adding an ESM blob to the zImage
wrapper") is posted as RFC because we are still finalizing the details on
how the ESM blob will be passed along with the kernel. All other patches are
(hopefully) in upstreamable shape and don't depend on this patch.

Unfortunately this series still doesn't enable the use of virtio devices in
the secure guest. This support depends on a discussion that is currently
ongoing with the virtio community:

https://lore.kernel.org/linuxppc-dev/87womn8inf.fsf@morokweng.localdomain/

I was able to test it using Claudio's patches in the host kernel, booting
normally using an initramfs for the root filesystem.

This is the command used to start up the guest with QEMU 4.0:

qemu-system-ppc64   \
-nodefaults \
-cpu host   \
-machine 
pseries,accel=kvm,kvm-type=HV,cap-htm=off,cap-cfpc=broken,cap-sbbc=broken,cap-ibs=broken
 \
-display none   \
-serial mon:stdio   \
-smp 1  \
-m 4G   \
-kernel /root/bauermann/vmlinux \
-initrd /root/bauermann/fs_small.cpio   \
-append 'debug'

Changelog

Since v3:

- Patch "powerpc/kernel: Add ucall_norets() ultravisor call handler"
  - Use updated commit message from Claudio Carvalho's KVM series v5.

- Patch "powerpc: Introduce the MSR_S bit"
  - Use updated commit message from Claudio Carvalho.

- Patch "powerpc/pseries/svm: Use shared memory for LPPACA structures"
  - Changed copyright year in  to 2018. Suggested by Michael
Ellerman.

- Patch "powerpc/pseries/svm: Use shared memory for Debug Trace Log (DTL)"
  - Changed copyright year in svm.c to 2018. Suggested by Michael Ellerman.

- Patch "powerpc/pseries/svm: Export guest SVM status to user space via sysfs"
  - Changed to check MSR_S on the current CPU. Suggested by Michael Ellerman.
  - Added documentation for new sysfs file. Suggested by Michael Ellerman.

- Patch "powerpc/pseries/iommu: Don't use dma_iommu_ops on secure guests"
  - Changed to only call set_pci_dma_ops() on non-secure guests. Suggested
by Christoph Hellwig.

- Patch "powerpc/pseries/svm: Force SWIOTLB for secure guests"
  - Changed copyright year in  to 2018. Suggested by
Michael Ellerman.

- Patch "Documentation/powerpc: Ultravisor API"
  - Use updated patch from Claudio Carvalho's KVM series v5.

Since v2:

- Patch "powerpc/kernel: Add ucall_norets() ultravisor call handler"
  - Borrowed unchanged from Claudio's "kvmppc: Paravirtualize KVM to support
ultravisor" series.

- Patch "powerpc/prom_init: Add the ESM call to prom_init"
  - Briefly mention in the commit message why we pass the kernel base address
and FDT to the Enter Secure Mode ultracall. Suggested by Alexey
Kardashevskiy.
  - Use enter_secure_mode() version provided by Segher Boessenkool.

- Patch "powerpc/pseries/svm: Add helpers for UV_SHARE_PAGE and UV_UNSHARE_PAGE"
  - Use ucall_norets() which doesn't need to be passed a return buffer.
Suggested by Alexey Kardashevskiy.

- Patch "powerpc: Introduce the MSR_S bit"
  - Moved from Claudio's "kvmppc: Paravirtualize KVM to support ultravisor"
series to this series.

- Patch "Documentation/powerpc: Ultravisor API"
  - New patch from Sukadev Bhattiprolu. Will also appear on Claudio's
kvmppc series.

Since v1:

-

[PATCH v4 01/16] powerpc/kernel: Add ucall_norets() ultravisor call handler

2019-08-19 Thread Thiago Jung Bauermann
From: Claudio Carvalho 

The ultracalls (ucalls for short) allow the Secure Virtual Machines
(SVM)s and hypervisor to request services from the ultravisor such as
accessing a register or memory region that can only be accessed when
running in ultravisor-privileged mode.

This patch adds the ucall_norets() ultravisor call handler. Like
plpar_hcall_norets(), it also saves and restores the Condition
Register (CR).

The specific service needed from an ucall is specified in register
R3 (the first parameter to the ucall). Other parameters to the
ucall, if any, are specified in registers R4 through R12.

Return value of all ucalls is in register R3. Other output values
from the ucall, if any, are returned in registers R4 through R12.

Each ucall returns specific error codes, applicable in the context
of the ucall. However, like with the PowerPC Architecture Platform
Reference (PAPR), if no specific error code is defined for a particular
situation, then the ucall will fallback to an erroneous
parameter-position based code. i.e U_PARAMETER, U_P2, U_P3 etc depending
on the ucall parameter that may have caused the error.

Every host kernel (powernv) needs to be able to do ucalls in case it
ends up being run in a machine with ultravisor enabled. Otherwise, the
kernel may crash early in boot trying to access ultravisor resources,
for instance, trying to set the partition table entry 0. Secure guests
also need to be able to do ucalls and its kernel may not have
CONFIG_PPC_POWERNV=y. For that reason, the ucall.S file is placed under
arch/powerpc/kernel.

If ultravisor is not enabled, the ucalls will be redirected to the
hypervisor which must handle/fail the call.

Thanks to inputs from Ram Pai and Michael Anderson.

Signed-off-by: Claudio Carvalho 
Signed-off-by: Thiago Jung Bauermann 
---
 arch/powerpc/include/asm/asm-prototypes.h | 11 +++
 arch/powerpc/include/asm/ultravisor-api.h | 23 +++
 arch/powerpc/kernel/Makefile  |  1 +
 arch/powerpc/kernel/ucall.S   | 20 
 4 files changed, 55 insertions(+)

diff --git a/arch/powerpc/include/asm/asm-prototypes.h 
b/arch/powerpc/include/asm/asm-prototypes.h
index ec1c97a8e8cb..e698f48cbc6d 100644
--- a/arch/powerpc/include/asm/asm-prototypes.h
+++ b/arch/powerpc/include/asm/asm-prototypes.h
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -34,6 +35,16 @@ extern struct static_key hcall_tracepoint_key;
 void __trace_hcall_entry(unsigned long opcode, unsigned long *args);
 void __trace_hcall_exit(long opcode, long retval, unsigned long *retbuf);
 
+/* Ultravisor */
+#ifdef CONFIG_PPC_POWERNV
+long ucall_norets(unsigned long opcode, ...);
+#else
+static inline long ucall_norets(unsigned long opcode, ...)
+{
+   return U_NOT_AVAILABLE;
+}
+#endif
+
 /* OPAL */
 int64_t __opal_call(int64_t a0, int64_t a1, int64_t a2, int64_t a3,
int64_t a4, int64_t a5, int64_t a6, int64_t a7,
diff --git a/arch/powerpc/include/asm/ultravisor-api.h 
b/arch/powerpc/include/asm/ultravisor-api.h
new file mode 100644
index ..88ffa78f9d61
--- /dev/null
+++ b/arch/powerpc/include/asm/ultravisor-api.h
@@ -0,0 +1,23 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Ultravisor API.
+ *
+ * Copyright 2019, IBM Corporation.
+ *
+ */
+#ifndef _ASM_POWERPC_ULTRAVISOR_API_H
+#define _ASM_POWERPC_ULTRAVISOR_API_H
+
+#include 
+
+/* Return codes */
+#define U_FUNCTION H_FUNCTION
+#define U_NOT_AVAILABLEH_NOT_AVAILABLE
+#define U_P2   H_P2
+#define U_P3   H_P3
+#define U_P4   H_P4
+#define U_P5   H_P5
+#define U_PARAMETERH_PARAMETER
+#define U_SUCCESS  H_SUCCESS
+
+#endif /* _ASM_POWERPC_ULTRAVISOR_API_H */
diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile
index ea0c69236789..c6c4ea240b2a 100644
--- a/arch/powerpc/kernel/Makefile
+++ b/arch/powerpc/kernel/Makefile
@@ -156,6 +156,7 @@ endif
 
 obj-$(CONFIG_EPAPR_PARAVIRT)   += epapr_paravirt.o epapr_hcalls.o
 obj-$(CONFIG_KVM_GUEST)+= kvm.o kvm_emul.o
+obj-$(CONFIG_PPC_POWERNV)  += ucall.o
 
 # Disable GCOV, KCOV & sanitizers in odd or sensitive code
 GCOV_PROFILE_prom_init.o := n
diff --git a/arch/powerpc/kernel/ucall.S b/arch/powerpc/kernel/ucall.S
new file mode 100644
index ..de9133e45d21
--- /dev/null
+++ b/arch/powerpc/kernel/ucall.S
@@ -0,0 +1,20 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Generic code to perform an ultravisor call.
+ *
+ * Copyright 2019, IBM Corporation.
+ *
+ */
+#include 
+#include 
+
+_GLOBAL(ucall_norets)
+EXPORT_SYMBOL_GPL(ucall_norets)
+   mfcrr0
+   stw r0,8(r1)
+
+   sc  2   /* Invoke the ultravisor */
+
+   lwz r0,8(r1)
+   mtcrf   0xff,r0
+   blr /* Return r3 = status */


Re: [PATCH v10 2/7] powerpc/mce: Fix MCE handling for huge pages

2019-08-19 Thread Santosh Sivaraj
Hi Nick,

Nicholas Piggin  writes:

> Santosh Sivaraj's on August 15, 2019 10:39 am:
>> From: Balbir Singh 
>> 
>> The current code would fail on huge pages addresses, since the shift would
>> be incorrect. Use the correct page shift value returned by
>> __find_linux_pte() to get the correct physical address. The code is more
>> generic and can handle both regular and compound pages.
>> 
>> Fixes: ba41e1e1ccb9 ("powerpc/mce: Hookup derror (load/store) UE errors")
>> Signed-off-by: Balbir Singh 
>> [ar...@linux.ibm.com: Fixup pseries_do_memory_failure()]
>> Signed-off-by: Reza Arbab 
>> Co-developed-by: Santosh Sivaraj 
>> Signed-off-by: Santosh Sivaraj 
>> Tested-by: Mahesh Salgaonkar 
>> Cc: sta...@vger.kernel.org # v4.15+
>> ---
>>  arch/powerpc/include/asm/mce.h   |  2 +-
>>  arch/powerpc/kernel/mce_power.c  | 55 ++--
>>  arch/powerpc/platforms/pseries/ras.c |  9 ++---
>>  3 files changed, 32 insertions(+), 34 deletions(-)
>> 
>> diff --git a/arch/powerpc/include/asm/mce.h b/arch/powerpc/include/asm/mce.h
>> index a4c6a74ad2fb..f3a6036b6bc0 100644
>> --- a/arch/powerpc/include/asm/mce.h
>> +++ b/arch/powerpc/include/asm/mce.h
>> @@ -209,7 +209,7 @@ extern void release_mce_event(void);
>>  extern void machine_check_queue_event(void);
>>  extern void machine_check_print_event_info(struct machine_check_event *evt,
>> bool user_mode, bool in_guest);
>> -unsigned long addr_to_pfn(struct pt_regs *regs, unsigned long addr);
>> +unsigned long addr_to_phys(struct pt_regs *regs, unsigned long addr);
>>  #ifdef CONFIG_PPC_BOOK3S_64
>>  void flush_and_reload_slb(void);
>>  #endif /* CONFIG_PPC_BOOK3S_64 */
>> diff --git a/arch/powerpc/kernel/mce_power.c 
>> b/arch/powerpc/kernel/mce_power.c
>> index a814d2dfb5b0..e74816f045f8 100644
>> --- a/arch/powerpc/kernel/mce_power.c
>> +++ b/arch/powerpc/kernel/mce_power.c
>> @@ -20,13 +20,14 @@
>>  #include 
>>  
>>  /*
>> - * Convert an address related to an mm to a PFN. NOTE: we are in real
>> - * mode, we could potentially race with page table updates.
>> + * Convert an address related to an mm to a physical address.
>> + * NOTE: we are in real mode, we could potentially race with page table 
>> updates.
>>   */
>> -unsigned long addr_to_pfn(struct pt_regs *regs, unsigned long addr)
>> +unsigned long addr_to_phys(struct pt_regs *regs, unsigned long addr)
>>  {
>> -pte_t *ptep;
>> -unsigned long flags;
>> +pte_t *ptep, pte;
>> +unsigned int shift;
>> +unsigned long flags, phys_addr;
>>  struct mm_struct *mm;
>>  
>>  if (user_mode(regs))
>> @@ -35,14 +36,21 @@ unsigned long addr_to_pfn(struct pt_regs *regs, unsigned 
>> long addr)
>>  mm = &init_mm;
>>  
>>  local_irq_save(flags);
>> -if (mm == current->mm)
>> -ptep = find_current_mm_pte(mm->pgd, addr, NULL, NULL);
>> -else
>> -ptep = find_init_mm_pte(addr, NULL);
>> +ptep = __find_linux_pte(mm->pgd, addr, NULL, &shift);
>>  local_irq_restore(flags);
>> +
>>  if (!ptep || pte_special(*ptep))
>>  return ULONG_MAX;
>> -return pte_pfn(*ptep);
>> +
>> +pte = *ptep;
>> +if (shift > PAGE_SHIFT) {
>> +unsigned long rpnmask = (1ul << shift) - PAGE_SIZE;
>> +
>> +pte = __pte(pte_val(pte) | (addr & rpnmask));
>> +}
>> +phys_addr = pte_pfn(pte) << PAGE_SHIFT;
>> +
>> +return phys_addr;
>>  }
>
> This should remain addr_to_pfn I think. None of the callers care what
> size page the EA was mapped with. 'pfn' is referring to the Linux pfn,
> which is the small page number.
>
>   if (shift > PAGE_SHIFT)
> return (pte_pfn(*ptep) | ((addr & ((1UL << shift) - 1)) >> PAGE_SHIFT);
>   else
> return pte_pfn(*ptep);
>
> Something roughly like that, then you don't have to change any callers
> or am I missing something?

Here[1] you asked to return the real address rather than pfn, which all
callers care about. So made the changes accordingly.

[1] https://www.spinics.net/lists/kernel/msg3187658.html

Thanks,
Santosh
>
> Thanks,
> Nick


Re: [PATCH v1 05/10] powerpc/mm: Do early ioremaps from top to bottom on PPC64 too.

2019-08-19 Thread Michael Ellerman
Nicholas Piggin  writes:
> Christophe Leroy's on August 14, 2019 6:11 am:
>> Until vmalloc system is up and running, ioremap basically
>> allocates addresses at the border of the IOREMAP area.
>> 
>> On PPC32, addresses are allocated down from the top of the area
>> while on PPC64, addresses are allocated up from the base of the
>> area.
>  
> This series looks pretty good to me, but I'm not sure about this patch.
>
> It seems like quite a small divergence in terms of code, and it looks
> like the final result still has some ifdefs in these functions. Maybe
> you could just keep existing behaviour for this cleanup series so it
> does not risk triggering some obscure regression?

Yeah that is also my feeling. Changing it *should* work, and I haven't
found anything that breaks yet, but it's one of those things that's
bound to break something for some obscure reason.

Christophe do you think you can rework it to retain the different
allocation directions at least for now?

cheers


Re: [PATCH v1 08/10] powerpc/mm: move __ioremap_at() and __iounmap_at() into ioremap.c

2019-08-19 Thread Michael Ellerman
Christophe Leroy  writes:

> diff --git a/arch/powerpc/mm/ioremap.c b/arch/powerpc/mm/ioremap.c
> index 57d742509cec..889ee656cf64 100644
> --- a/arch/powerpc/mm/ioremap.c
> +++ b/arch/powerpc/mm/ioremap.c
> @@ -103,3 +103,46 @@ void iounmap(volatile void __iomem *token)
>   vunmap(addr);
>  }
>  EXPORT_SYMBOL(iounmap);
> +
> +#ifdef CONFIG_PPC64
> +/**
> + * __ioremap_at - Low level function to establish the page tables
> + *for an IO mapping
> + */
> +void __iomem *__ioremap_at(phys_addr_t pa, void *ea, unsigned long size, 
> pgprot_t prot)
> +{
> + /* We don't support the 4K PFN hack with ioremap */
> + if (pgprot_val(prot) & H_PAGE_4K_PFN)
> + return NULL;
> +
> + if ((ea + size) >= (void *)IOREMAP_END) {
> + pr_warn("Outside the supported range\n");
> + return NULL;
> + }
> +
> + WARN_ON(pa & ~PAGE_MASK);
> + WARN_ON(((unsigned long)ea) & ~PAGE_MASK);
> + WARN_ON(size & ~PAGE_MASK);
> +
> + if (ioremap_range((unsigned long)ea, pa, size, prot, NUMA_NO_NODE))

This doesn't build.

Adding ...

extern int ioremap_range(unsigned long ea, phys_addr_t pa, unsigned long size, 
pgprot_t prot, int nid);

... above, until the next patch, fixes it.

cheers


Re: [PATCH] powerpc/vdso32: Add support for CLOCK_{REALTIME/MONOTONIC}_COARSE

2019-08-19 Thread Nathan Lynch
Christophe Leroy  writes:

> Hi,
>
> Le 19/08/2019 à 18:37, Nathan Lynch a écrit :
>> Hi,
>> 
>> Christophe Leroy  writes:
>>> Benchmark from vdsotest:
>> 
>> I assume you also ran the verification/correctness parts of vdsotest...? :-)
>> 
>
> I did run vdsotest-all. I guess it runs the verifications too ?

It does, but at a quick glance it runs the validation for "only" 1
second per API. It may provide more confidence to allow the validation
to run across several second (tv_sec) transitions, e.g.

vdsotest -d 30 clock-gettime-monotonic-coarse verify

Regardless, I did not see any problem with your patch.


Re: [PATCH v4 1/3] kasan: support backing vmalloc space with real shadow memory

2019-08-19 Thread Andy Lutomirski
> On Aug 18, 2019, at 8:58 PM, Daniel Axtens  wrote:
>

>>> Each page of shadow memory represent 8 pages of real memory. Could we use
>>> page_ref to count how many pieces of a shadow page are used so that we can
>>> free it when the ref count decreases to 0.
>
> I'm not sure how much of a difference it will make, but I'll have a look.
>

There are a grand total of eight possible pages that could require a
given shadow page. I would suggest that, instead of reference
counting, you just check all eight pages.

Or, better yet, look at the actual vm_area_struct and are where prev
and next point. That should tell you exactly which range can be freed.


Re: [PATCH] powerpc/vdso32: Add support for CLOCK_{REALTIME/MONOTONIC}_COARSE

2019-08-19 Thread Christophe Leroy

Hi,

Le 19/08/2019 à 18:37, Nathan Lynch a écrit :

Hi,

Christophe Leroy  writes:

Benchmark from vdsotest:


I assume you also ran the verification/correctness parts of vdsotest...? :-)



I did run vdsotest-all. I guess it runs the verifications too ?

Christophe


Re: [PATCH v5 3/4] mm/nvdimm: Use correct #defines instead of open coding

2019-08-19 Thread Dan Williams
On Mon, Aug 19, 2019 at 2:32 AM Aneesh Kumar K.V
 wrote:
>
> Aneesh Kumar K.V  writes:
>
> > Dan Williams  writes:
> >
> >> On Fri, Aug 9, 2019 at 12:45 AM Aneesh Kumar K.V
> >>  wrote:
> >>>
> >>
>
> ...
>
> >>> diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
> >>> index 37e96811c2fc..c1d9be609322 100644
> >>> --- a/drivers/nvdimm/pfn_devs.c
> >>> +++ b/drivers/nvdimm/pfn_devs.c
> >>> @@ -725,7 +725,8 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn)
> >>>  * when populating the vmemmap. This *should* be equal to
> >>>  * PMD_SIZE for most architectures.
> >>>  */
> >>> -   offset = ALIGN(start + SZ_8K + 64 * npfns, align) - start;
> >>> +   offset = ALIGN(start + SZ_8K + sizeof(struct page) * 
> >>> npfns,
> >>
> >> I'd prefer if this was not dynamic and was instead set to the maximum
> >> size of 'struct page' across all archs just to enhance cross-arch
> >> compatibility. I think that answer is '64'.
> >
> >
> > That still doesn't take care of the case where we add new elements to
> > struct page later. If we have struct page size changing across
> > architectures, we should still be ok as long as new size is less than what 
> > is
> > stored in pfn superblock? I understand the desire to keep it
> > non-dynamic. But we also need to make sure we don't reserve less space
> > when creating a new namespace on a config that got struct page size >
> > 64?
>
>
> How about
>
> libnvdimm/pfn_dev: Add a build check to make sure we notice when struct page 
> size change
>
> When namespace is created with map device as pmem device, struct page is 
> stored in the
> reserve block area. We need to make sure we account for the right struct page
> size while doing this. Instead of directly depending on sizeof(struct page)
> which can change based on different kernel config option, use the max struct
> page size (64) while calculating the reserve block area. This makes sure pmem
> device can be used across kernels built with different configs.
>
> If the above assumption of max struct page size change, we need to update the
> reserve block allocation space for new namespaces created.
>
> Signed-off-by: Aneesh Kumar K.V 
>
> 1 file changed, 7 insertions(+)
> drivers/nvdimm/pfn_devs.c | 7 +++
>
> modified   drivers/nvdimm/pfn_devs.c
> @@ -722,7 +722,14 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn)
>  * The altmap should be padded out to the block size used
>  * when populating the vmemmap. This *should* be equal to
>  * PMD_SIZE for most architectures.
> +*
> +* Also make sure size of struct page is less than 64. We
> +* want to make sure we use large enough size here so that
> +* we don't have a dynamic reserve space depending on
> +* struct page size. But we also want to make sure we notice
> +* if we end up adding new elements to struct page.
>  */
> +   BUILD_BUG_ON(64 < sizeof(struct page));

Looks ok to me. There are ongoing heroic efforts to make sure 'struct
page' does not grown beyond the size of cacheline. The fact that
'struct page_ext' is allocated out of line makes it safe to assume
that 'struct page' will not be growing larger in the foreseeable
future.


Re: [PATCH] btrfs: fix allocation of bitmap pages.

2019-08-19 Thread Christophe Leroy




Le 19/08/2019 à 19:46, David Sterba a écrit :

On Sat, Aug 17, 2019 at 07:44:39AM +, Christophe Leroy wrote:

Various notifications of type "BUG kmalloc-4096 () : Redzone
overwritten" have been observed recently in various parts of
the kernel. After some time, it has been made a relation with
the use of BTRFS filesystem.

[   22.809700] BUG kmalloc-4096 (Tainted: GW): Redzone 
overwritten
[   22.809971] 
-

[   22.810286] INFO: 0xbe1a5921-0xfbfc06cd. First byte 0x0 instead of 0xcc
[   22.810866] INFO: Allocated in __load_free_space_cache+0x588/0x780 [btrfs] 
age=22 cpu=0 pid=224
[   22.811193]  __slab_alloc.constprop.26+0x44/0x70
[   22.811345]  kmem_cache_alloc_trace+0xf0/0x2ec
[   22.811588]  __load_free_space_cache+0x588/0x780 [btrfs]
[   22.811848]  load_free_space_cache+0xf4/0x1b0 [btrfs]
[   22.812090]  cache_block_group+0x1d0/0x3d0 [btrfs]
[   22.812321]  find_free_extent+0x680/0x12a4 [btrfs]
[   22.812549]  btrfs_reserve_extent+0xec/0x220 [btrfs]
[   22.812785]  btrfs_alloc_tree_block+0x178/0x5f4 [btrfs]
[   22.813032]  __btrfs_cow_block+0x150/0x5d4 [btrfs]
[   22.813262]  btrfs_cow_block+0x194/0x298 [btrfs]
[   22.813484]  commit_cowonly_roots+0x44/0x294 [btrfs]
[   22.813718]  btrfs_commit_transaction+0x63c/0xc0c [btrfs]
[   22.813973]  close_ctree+0xf8/0x2a4 [btrfs]
[   22.814107]  generic_shutdown_super+0x80/0x110
[   22.814250]  kill_anon_super+0x18/0x30
[   22.814437]  btrfs_kill_super+0x18/0x90 [btrfs]
[   22.814590] INFO: Freed in proc_cgroup_show+0xc0/0x248 age=41 cpu=0 pid=83
[   22.814841]  proc_cgroup_show+0xc0/0x248
[   22.814967]  proc_single_show+0x54/0x98
[   22.815086]  seq_read+0x278/0x45c
[   22.815190]  __vfs_read+0x28/0x17c
[   22.815289]  vfs_read+0xa8/0x14c
[   22.815381]  ksys_read+0x50/0x94
[   22.815475]  ret_from_syscall+0x0/0x38

Commit 69d2480456d1 ("btrfs: use copy_page for copying pages instead
of memcpy") changed the way bitmap blocks are copied. But allthough
bitmaps have the size of a page, they were allocated with kzalloc().

Most of the time, kzalloc() allocates aligned blocks of memory, so
copy_page() can be used. But when some debug options like SLAB_DEBUG
are activated, kzalloc() may return unaligned pointer.

On powerpc, memcpy(), copy_page() and other copying functions use
'dcbz' instruction which provides an entire zeroed cacheline to avoid
memory read when the intention is to overwrite a full line. Functions
like memcpy() are writen to care about partial cachelines at the start
and end of the destination, but copy_page() assumes it gets pages.


This assumption is not documented nor any pitfalls mentioned in
include/asm-generic/page.h that provides the generic implementation. I
as an API user cannot check each arch implementation for additional
constraints or I would expect that it deals with the boundary cases the
same way as arch-specific memcpy implementations.


For me, copy_page() is there to ... copy pages. Not to copy any piece of 
RAM having the size of a page.


But it happened to others. See commit 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d72e9a7a93e4f8e9e52491921d99e0c8aa89eb4e




Another thing that is lost is the slub debugging support for all
architectures, because get_zeroed_pages lacking the red zones and sanity
checks.

I find working with raw pages in this code a bit inconsistent with the
rest of btrfs code, but that's rather minor compared to the above.


What about using kmem_cache instead ? I see kmem_cache is already widely 
used in BTRFS, so using it also for block of memory of size PAGE_SIZE 
should be ok ?


AFAICS, kmem_cache has the red zones and sanity checks.



Summing it up, I think that the proper fix should go to copy_page
implementation on architectures that require it or make it clear what
are the copy_page constraints.



I guess anybody using copy_page() to copy something else than a page is 
on his/her own.


But following that (bad) experience, I propose a patch to at least 
detect it early, see https://patchwork.ozlabs.org/patch/1148033/


Christophe


Re: [PATCH] btrfs: fix allocation of bitmap pages.

2019-08-19 Thread David Sterba
On Sat, Aug 17, 2019 at 07:44:39AM +, Christophe Leroy wrote:
> Various notifications of type "BUG kmalloc-4096 () : Redzone
> overwritten" have been observed recently in various parts of
> the kernel. After some time, it has been made a relation with
> the use of BTRFS filesystem.
> 
> [   22.809700] BUG kmalloc-4096 (Tainted: GW): Redzone 
> overwritten
> [   22.809971] 
> -
> 
> [   22.810286] INFO: 0xbe1a5921-0xfbfc06cd. First byte 0x0 instead of 0xcc
> [   22.810866] INFO: Allocated in __load_free_space_cache+0x588/0x780 [btrfs] 
> age=22 cpu=0 pid=224
> [   22.811193]__slab_alloc.constprop.26+0x44/0x70
> [   22.811345]kmem_cache_alloc_trace+0xf0/0x2ec
> [   22.811588]__load_free_space_cache+0x588/0x780 [btrfs]
> [   22.811848]load_free_space_cache+0xf4/0x1b0 [btrfs]
> [   22.812090]cache_block_group+0x1d0/0x3d0 [btrfs]
> [   22.812321]find_free_extent+0x680/0x12a4 [btrfs]
> [   22.812549]btrfs_reserve_extent+0xec/0x220 [btrfs]
> [   22.812785]btrfs_alloc_tree_block+0x178/0x5f4 [btrfs]
> [   22.813032]__btrfs_cow_block+0x150/0x5d4 [btrfs]
> [   22.813262]btrfs_cow_block+0x194/0x298 [btrfs]
> [   22.813484]commit_cowonly_roots+0x44/0x294 [btrfs]
> [   22.813718]btrfs_commit_transaction+0x63c/0xc0c [btrfs]
> [   22.813973]close_ctree+0xf8/0x2a4 [btrfs]
> [   22.814107]generic_shutdown_super+0x80/0x110
> [   22.814250]kill_anon_super+0x18/0x30
> [   22.814437]btrfs_kill_super+0x18/0x90 [btrfs]
> [   22.814590] INFO: Freed in proc_cgroup_show+0xc0/0x248 age=41 cpu=0 pid=83
> [   22.814841]proc_cgroup_show+0xc0/0x248
> [   22.814967]proc_single_show+0x54/0x98
> [   22.815086]seq_read+0x278/0x45c
> [   22.815190]__vfs_read+0x28/0x17c
> [   22.815289]vfs_read+0xa8/0x14c
> [   22.815381]ksys_read+0x50/0x94
> [   22.815475]ret_from_syscall+0x0/0x38
> 
> Commit 69d2480456d1 ("btrfs: use copy_page for copying pages instead
> of memcpy") changed the way bitmap blocks are copied. But allthough
> bitmaps have the size of a page, they were allocated with kzalloc().
> 
> Most of the time, kzalloc() allocates aligned blocks of memory, so
> copy_page() can be used. But when some debug options like SLAB_DEBUG
> are activated, kzalloc() may return unaligned pointer.
> 
> On powerpc, memcpy(), copy_page() and other copying functions use
> 'dcbz' instruction which provides an entire zeroed cacheline to avoid
> memory read when the intention is to overwrite a full line. Functions
> like memcpy() are writen to care about partial cachelines at the start
> and end of the destination, but copy_page() assumes it gets pages.

This assumption is not documented nor any pitfalls mentioned in
include/asm-generic/page.h that provides the generic implementation. I
as an API user cannot check each arch implementation for additional
constraints or I would expect that it deals with the boundary cases the
same way as arch-specific memcpy implementations.

Another thing that is lost is the slub debugging support for all
architectures, because get_zeroed_pages lacking the red zones and sanity
checks.

I find working with raw pages in this code a bit inconsistent with the
rest of btrfs code, but that's rather minor compared to the above.

Summing it up, I think that the proper fix should go to copy_page
implementation on architectures that require it or make it clear what
are the copy_page constraints.


Clean up cut-here even harder (was Re: [PATCH 1/3] powerpc: don't use __WARN() for WARN_ON())

2019-08-19 Thread Kees Cook
On Mon, Aug 19, 2019 at 09:28:03AM -0700, Kees Cook wrote:
> On Mon, Aug 19, 2019 at 01:06:28PM +, Christophe Leroy wrote:
> > __WARN() used to just call __WARN_TAINT(TAINT_WARN)
> > 
> > But a call to printk() has been added in the commit identified below
> > to print a " cut here " line.
> > 
> > This change only applies to warnings using __WARN(), which means
> > WARN_ON() where the condition is constant at compile time.
> > For WARN_ON() with a non constant condition, the additional line is
> > not printed.
> > 
> > In addition, adding a call to printk() forces GCC to add a stack frame
> > and save volatile registers. Powerpc has been using traps to implement
> > warnings in order to avoid that.
> > 
> > So, call __WARN_TAINT(TAINT_WARN) directly instead of using __WARN()
> > in order to restore the previous behaviour.
> > 
> > If one day powerpc wants the decorative " cut here " line, it
> > has to be done in the trap handler, not in the WARN_ON() macro.
> > 
> > Fixes: 6b15f678fb7d ("include/asm-generic/bug.h: fix "cut here" for WARN_ON 
> > for __WARN_TAINT architectures")
> > Signed-off-by: Christophe Leroy 
> 
> Ah! Hmpf. Yeah, that wasn't an intended side-effect of this fix.
> 
> It seems PPC is not alone in this situation of making this code much
> noisier. It looks like there needs to be a way to indicate to the trap
> handler that a message was delivered or not. Perhaps we can add another
> taint flag?

I meant "bug flag" here, not taint. Here's a stab at it. This tries to
remove redundant defines, and moves the "cut here" up into the slow path
explicitly (out of _warn()) and creates a flag so the trap handler can
actually detect if things were already reported...

Thoughts?


diff --git a/include/asm-generic/bug.h b/include/asm-generic/bug.h
index aa6c093d9ce9..c2b79878f24c 100644
--- a/include/asm-generic/bug.h
+++ b/include/asm-generic/bug.h
@@ -10,6 +10,7 @@
 #define BUGFLAG_WARNING(1 << 0)
 #define BUGFLAG_ONCE   (1 << 1)
 #define BUGFLAG_DONE   (1 << 2)
+#define BUGFLAG_PRINTK (1 << 3)
 #define BUGFLAG_TAINT(taint)   ((taint) << 8)
 #define BUG_GET_TAINT(bug) ((bug)->flags >> 8)
 #endif
@@ -62,13 +63,11 @@ struct bug_entry {
 #endif
 
 #ifdef __WARN_FLAGS
-#define __WARN_TAINT(taint)__WARN_FLAGS(BUGFLAG_TAINT(taint))
-#define __WARN_ONCE_TAINT(taint)   
__WARN_FLAGS(BUGFLAG_ONCE|BUGFLAG_TAINT(taint))
-
 #define WARN_ON_ONCE(condition) ({ \
int __ret_warn_on = !!(condition);  \
if (unlikely(__ret_warn_on))\
-   __WARN_ONCE_TAINT(TAINT_WARN);  \
+   __WARN_FLAGS(BUGFLAG_ONCE | \
+BUGFLAG_TAINT(TAINT_WARN));\
unlikely(__ret_warn_on);\
 })
 #endif
@@ -89,7 +88,7 @@ struct bug_entry {
  *
  * Use the versions with printk format strings to provide better diagnostics.
  */
-#ifndef __WARN_TAINT
+#ifndef __WARN_FLAGS
 extern __printf(3, 4)
 void warn_slowpath_fmt(const char *file, const int line,
   const char *fmt, ...);
@@ -104,12 +103,12 @@ extern void warn_slowpath_null(const char *file, const 
int line);
warn_slowpath_fmt_taint(__FILE__, __LINE__, taint, arg)
 #else
 extern __printf(1, 2) void __warn_printk(const char *fmt, ...);
-#define __WARN() do { \
-   printk(KERN_WARNING CUT_HERE); __WARN_TAINT(TAINT_WARN); \
-} while (0)
+#define __WARN()   __WARN_FLAGS(BUGFLAG_TAINT(TAINT_WARN))
 #define __WARN_printf(arg...)  __WARN_printf_taint(TAINT_WARN, arg)
-#define __WARN_printf_taint(taint, arg...) \
-   do { __warn_printk(arg); __WARN_TAINT(taint); } while (0)
+#define __WARN_printf_taint(taint, arg...) do {\
+   __warn_printk(arg); __WARN_FLAGS(BUGFLAG_PRINTK |   \
+BUGFLAG_TAINT(taint)); \
+   } while (0)
 #endif
 
 /* used internally by panic.c */
diff --git a/kernel/panic.c b/kernel/panic.c
index 057540b6eee9..03c98da6e3f7 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -551,9 +551,6 @@ void __warn(const char *file, int line, void *caller, 
unsigned taint,
 {
disable_trace_on_warning();
 
-   if (args)
-   pr_warn(CUT_HERE);
-
if (file)
pr_warn("WARNING: CPU: %d PID: %d at %s:%d %pS\n",
raw_smp_processor_id(), current->pid, file, line,
@@ -596,6 +593,7 @@ void warn_slowpath_fmt(const char *file, int line, const 
char *fmt, ...)
 {
struct warn_args args;
 
+   pr_warn(CUT_HERE);
args.fmt = fmt;
va_start(args.args, fmt);
__warn(file, line, __builtin_return_address(0), TAINT_WARN, NULL,
@@ -609,6 +607,7 @@ void warn_slowpath_fmt_taint(const char *file, int line,
 {
struct warn_args args;
 
+   pr_war

Re: [PATCH v5 1/4] nvdimm: Consider probe return -EOPNOTSUPP as success

2019-08-19 Thread Dan Williams
On Mon, Aug 19, 2019 at 12:07 AM Aneesh Kumar K.V
 wrote:
>
> Dan Williams  writes:
>
> > On Tue, Aug 13, 2019 at 9:22 PM Dan Williams  
> > wrote:
> >>
> >> Hi Aneesh, logic looks correct but there are some cleanups I'd like to
> >> see and a lead-in patch that I attached.
> >>
> >> I've started prefixing nvdimm patches with:
> >>
> >> libnvdimm/$component:
> >>
> >> ...since this patch mostly impacts the pmem driver lets prefix it
> >> "libnvdimm/pmem: "
> >>
> >> On Fri, Aug 9, 2019 at 12:45 AM Aneesh Kumar K.V
> >>  wrote:
> >> >
> >> > This patch add -EOPNOTSUPP as return from probe callback to
> >>
> >> s/This patch add/Add/
> >>
> >> No need to say "this patch" it's obviously a patch.
> >>
> >> > indicate we were not able to initialize a namespace due to pfn superblock
> >> > feature/version mismatch. We want to consider this a probe success so 
> >> > that
> >> > we can create new namesapce seed and there by avoid marking the failed
> >> > namespace as the seed namespace.
> >>
> >> Please replace usage of "we" with the exact agent involved as which
> >> "we" is being referred to gets confusing for the reader.
> >>
> >> i.e. "indicate that the pmem driver was not..." "The nvdimm core wants
> >> to consider this...".
> >>
> >> >
> >> > Signed-off-by: Aneesh Kumar K.V 
> >> > ---
> >> >  drivers/nvdimm/bus.c  |  2 +-
> >> >  drivers/nvdimm/pmem.c | 26 ++
> >> >  2 files changed, 23 insertions(+), 5 deletions(-)
> >> >
> >> > diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c
> >> > index 798c5c4aea9c..16c35e6446a7 100644
> >> > --- a/drivers/nvdimm/bus.c
> >> > +++ b/drivers/nvdimm/bus.c
> >> > @@ -95,7 +95,7 @@ static int nvdimm_bus_probe(struct device *dev)
> >> > rc = nd_drv->probe(dev);
> >> > debug_nvdimm_unlock(dev);
> >> >
> >> > -   if (rc == 0)
> >> > +   if (rc == 0 || rc == -EOPNOTSUPP)
> >> > nd_region_probe_success(nvdimm_bus, dev);
> >>
> >> This now makes the nd_region_probe_success() helper obviously misnamed
> >> since it now wants to take actions on non-probe success. I attached a
> >> lead-in cleanup that you can pull into your series that renames that
> >> routine to nd_region_advance_seeds().
> >>
> >> When you rebase this needs a comment about why EOPNOTSUPP has special 
> >> handling.
> >>
> >> > else
> >> > nd_region_disable(nvdimm_bus, dev);
> >> > diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> >> > index 4c121dd03dd9..3f498881dd28 100644
> >> > --- a/drivers/nvdimm/pmem.c
> >> > +++ b/drivers/nvdimm/pmem.c
> >> > @@ -490,6 +490,7 @@ static int pmem_attach_disk(struct device *dev,
> >> >
> >> >  static int nd_pmem_probe(struct device *dev)
> >> >  {
> >> > +   int ret;
> >> > struct nd_namespace_common *ndns;
> >> >
> >> > ndns = nvdimm_namespace_common_probe(dev);
> >> > @@ -505,12 +506,29 @@ static int nd_pmem_probe(struct device *dev)
> >> > if (is_nd_pfn(dev))
> >> > return pmem_attach_disk(dev, ndns);
> >> >
> >> > -   /* if we find a valid info-block we'll come back as that 
> >> > personality */
> >> > -   if (nd_btt_probe(dev, ndns) == 0 || nd_pfn_probe(dev, ndns) == 0
> >> > -   || nd_dax_probe(dev, ndns) == 0)
> >>
> >> Similar need for an updated comment here to explain the special
> >> translation of error codes.
> >>
> >> > +   ret = nd_btt_probe(dev, ndns);
> >> > +   if (ret == 0)
> >> > return -ENXIO;
> >> > +   else if (ret == -EOPNOTSUPP)
> >>
> >> Are there cases where the btt driver needs to return EOPNOTSUPP? I'd
> >> otherwise like to keep this special casing constrained to the pfn /
> >> dax info block cases.
> >
> > In fact I think EOPNOTSUPP is only something that the device-dax case
> > would be concerned with because that's the only interface that
> > attempts to guarantee a given mapping granularity.
>
> We need to do similar error handling w.r.t fsdax when the pfn superblock
> indicates different PAGE_SIZE and struct page size?

Only in the case where PAGE_SIZE is less than the pfn superblock page
size, the memmap is stored on pmem, and the reservation is too small.
Otherwise the PAGE_SIZE difference does not matter in practice for the
fsdax case... unless I'm overlooking another failure case?

> I don't think btt
> needs to support EOPNOTSUPP. But we can keep it for consistency?

That's not a sufficient argument in my mind. The comment about why
EOPNOTSUPP is treated specially should have a note about the known
usages, and since there is no BTT case for it lets leave it out.


Re: [PATCH] powerpc/vdso32: Add support for CLOCK_{REALTIME/MONOTONIC}_COARSE

2019-08-19 Thread Nathan Lynch
Hi,

Christophe Leroy  writes:
> Benchmark from vdsotest:

I assume you also ran the verification/correctness parts of vdsotest...? :-)




Re: [PATCH 1/3] powerpc: don't use __WARN() for WARN_ON()

2019-08-19 Thread Kees Cook
On Mon, Aug 19, 2019 at 01:06:28PM +, Christophe Leroy wrote:
> __WARN() used to just call __WARN_TAINT(TAINT_WARN)
> 
> But a call to printk() has been added in the commit identified below
> to print a " cut here " line.
> 
> This change only applies to warnings using __WARN(), which means
> WARN_ON() where the condition is constant at compile time.
> For WARN_ON() with a non constant condition, the additional line is
> not printed.
> 
> In addition, adding a call to printk() forces GCC to add a stack frame
> and save volatile registers. Powerpc has been using traps to implement
> warnings in order to avoid that.
> 
> So, call __WARN_TAINT(TAINT_WARN) directly instead of using __WARN()
> in order to restore the previous behaviour.
> 
> If one day powerpc wants the decorative " cut here " line, it
> has to be done in the trap handler, not in the WARN_ON() macro.
> 
> Fixes: 6b15f678fb7d ("include/asm-generic/bug.h: fix "cut here" for WARN_ON 
> for __WARN_TAINT architectures")
> Signed-off-by: Christophe Leroy 

Ah! Hmpf. Yeah, that wasn't an intended side-effect of this fix.

It seems PPC is not alone in this situation of making this code much
noisier. It looks like there needs to be a way to indicate to the trap
handler that a message was delivered or not. Perhaps we can add another
taint flag?

-kees

> ---
>  arch/powerpc/include/asm/bug.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/include/asm/bug.h b/arch/powerpc/include/asm/bug.h
> index fed7e6241349..3928fdaebb71 100644
> --- a/arch/powerpc/include/asm/bug.h
> +++ b/arch/powerpc/include/asm/bug.h
> @@ -99,7 +99,7 @@
>   int __ret_warn_on = !!(x);  \
>   if (__builtin_constant_p(__ret_warn_on)) {  \
>   if (__ret_warn_on)  \
> - __WARN();   \
> + __WARN_TAINT(TAINT_WARN);   \
>   } else {\
>   __asm__ __volatile__(   \
>   "1: "PPC_TLNEI" %4,0\n" \
> -- 
> 2.13.3
> 

-- 
Kees Cook


Re: [PATCH v4 11/25] powernv/fadump: register kernel metadata address with opal

2019-08-19 Thread Hari Bathini



On 14/08/19 3:51 PM, Mahesh Jagannath Salgaonkar wrote:
> On 8/14/19 12:36 PM, Hari Bathini wrote:
>>
>>
>> On 13/08/19 4:11 PM, Mahesh J Salgaonkar wrote:
>>> On 2019-07-16 17:03:15 Tue, Hari Bathini wrote:
 OPAL allows registering address with it in the first kernel and
 retrieving it after MPIPL. Setup kernel metadata and register its
 address with OPAL to use it for processing the crash dump.

 Signed-off-by: Hari Bathini 
 ---
[...]
>>
>>> What if kernel crashes before metadata area is initialized ?
>>
>> registered_regions would be '0'. So, it is treated as fadump is not 
>> registered case.
>> Let me
>> initialize metadata explicitly before registering the address with f/w to 
>> avoid any assumption...
> 
> Do you want to do that before memblock reservation ? Should we move this
> to setup_fadump() ?

Better here as failing early would mean we could fall back to KDump..



Re: [PATCH 3/3] powerpc: use __builtin_trap() in BUG/WARN macros.

2019-08-19 Thread Segher Boessenkool
On Mon, Aug 19, 2019 at 05:05:46PM +0200, Christophe Leroy wrote:
> Le 19/08/2019 à 16:37, Segher Boessenkool a écrit :
> >On Mon, Aug 19, 2019 at 04:08:43PM +0200, Christophe Leroy wrote:
> >>Le 19/08/2019 à 15:23, Segher Boessenkool a écrit :
> >>>On Mon, Aug 19, 2019 at 01:06:31PM +, Christophe Leroy wrote:
> Note that we keep using an assembly text using "twi 31, 0, 0" for
> inconditional traps because GCC drops all code after
> __builtin_trap() when the condition is always true at build time.
> >>>
> >>>As I said, it can also do this for conditional traps, if it can prove
> >>>the condition is always true.
> >>
> >>But we have another branch for 'always true' and 'always false' using
> >>__builtin_constant_p(), which don't use __builtin_trap(). Is there
> >>anything wrong with that ?:
> >
> >The compiler might not realise it is constant when it evaluates the
> >__builtin_constant_p, but only realises it later.  As the documentation
> >for the builtin says:
> >   A return of 0 does not indicate that the
> >   value is _not_ a constant, but merely that GCC cannot prove it is a
> >   constant with the specified value of the '-O' option.
> 
> So you mean GCC would not be able to prove that 
> __builtin_constant_p(cond) is always true but it would be able to prove 
> that if (cond)  is always true ?

Not sure what you mean, sorry.

> And isn't there a away to tell GCC that '__builtin_trap()' is 
> recoverable in our case ?

No, GCC knows that a trap will never fall through.

> >I think it may work if you do
> >
> >#define BUG_ON(x) do {   \
> > if (__builtin_constant_p(x)) {  \
> > if (x)  \
> > BUG();  \
> > } else {\
> > BUG_ENTRY("", 0);   \
> > if (x)  \
> > __builtin_trap();   \
> > }   \
> >} while (0)
> 
> It doesn't work:

You need to make a BUG_ENTRY so that it refers to the *following* trap
instruction, if you go this way.

> >I don't know how BUG_ENTRY works exactly.
> 
> It's basic, maybe too basic: it adds an inline asm with a label, and 
> adds a .long in the __bug_table section with the address of that label.
> 
> When putting it after the __builtin_trap(), I changed it to using the 
> address before the one of the label which is always the twxx instruction 
> as far as I can see.
> 
> #define BUG_ENTRY(insn, flags, ...)   \
>   __asm__ __volatile__(   \
>   "1: " insn "\n" \
>   ".section __bug_table,\"aw\"\n" \
>   "2:\t" PPC_LONG "1b, %0\n"  \
>   "\t.short %1, %2\n" \
>   ".org 2b+%3\n"  \
>   ".previous\n"   \
>   : : "i" (__FILE__), "i" (__LINE__), \
> "i" (flags),  \
> "i" (sizeof(struct bug_entry)),   \
> ##__VA_ARGS__)

#define MY_BUG_ENTRY(lab, flags)\
__asm__ __volatile__(   \
".section __bug_table,\"aw\"\n" \
"2:\t" PPC_LONG "%4, %0\n"  \
"\t.short %1, %2\n" \
".org 2b+%3\n"  \
".previous\n"   \
: : "i" (__FILE__), "i" (__LINE__), \
  "i" (flags),  \
  "i" (sizeof(struct bug_entry)),   \
  "i" (lab))

called as

#define BUG_ON(x) do {  \
MY_BUG_ENTRY(&&lab, 0); \
lab: if (x) \
__builtin_trap();   \
} while (0)

not sure how reliable that works -- *if* it works, I just typed that in
without testing or anything -- but hopefully you get the idea.


Segher


Re: [PATCH 3/3] powerpc: use __builtin_trap() in BUG/WARN macros.

2019-08-19 Thread Christophe Leroy




Le 19/08/2019 à 16:37, Segher Boessenkool a écrit :

On Mon, Aug 19, 2019 at 04:08:43PM +0200, Christophe Leroy wrote:

Le 19/08/2019 à 15:23, Segher Boessenkool a écrit :

On Mon, Aug 19, 2019 at 01:06:31PM +, Christophe Leroy wrote:

Note that we keep using an assembly text using "twi 31, 0, 0" for
inconditional traps because GCC drops all code after
__builtin_trap() when the condition is always true at build time.


As I said, it can also do this for conditional traps, if it can prove
the condition is always true.


But we have another branch for 'always true' and 'always false' using
__builtin_constant_p(), which don't use __builtin_trap(). Is there
anything wrong with that ?:


The compiler might not realise it is constant when it evaluates the
__builtin_constant_p, but only realises it later.  As the documentation
for the builtin says:
   A return of 0 does not indicate that the
   value is _not_ a constant, but merely that GCC cannot prove it is a
   constant with the specified value of the '-O' option.


So you mean GCC would not be able to prove that 
__builtin_constant_p(cond) is always true but it would be able to prove 
that if (cond)  is always true ?


And isn't there a away to tell GCC that '__builtin_trap()' is 
recoverable in our case ?




(and there should be many more and more serious warnings here).


#define BUG_ON(x) do {  \
if (__builtin_constant_p(x)) {  \
if (x)  \
BUG();  \
} else {\
if (x)  \
__builtin_trap();   \
BUG_ENTRY("", 0); \
}   \
} while (0)


I think it may work if you do

#define BUG_ON(x) do {  \
if (__builtin_constant_p(x)) {  \
if (x)  \
BUG();  \
} else {\
BUG_ENTRY("", 0); \
if (x)  \
__builtin_trap();   \
}   \
} while (0)


It doesn't work:

void test_bug1(unsigned long long a)
{
BUG_ON(a);
}

0090 :
  90:   7c 63 23 78 or  r3,r3,r4
  94:   0f 03 00 00 twnei   r3,0
  98:   4e 80 00 20 blr

RELOCATION RECORDS FOR [__bug_table]:
OFFSET   TYPE  VALUE
0084 R_PPC_ADDR32  .text+0x0090

As you see, the relocation in __bug_table points to the 'or' and not to 
the 'twnei'.




or even just

#define BUG_ON(x) do {  \
BUG_ENTRY("", 0); \
if (x)  \
__builtin_trap();   \
}   \
} while (0)

if BUG_ENTRY can work for the trap insn *after* it.


Can you put the bug table asm *before* the __builtin_trap maybe?  That
should make it all work fine...  If you somehow can tell what machine
instruction is that trap, anyway.


And how can I tell that ?


I don't know how BUG_ENTRY works exactly.


It's basic, maybe too basic: it adds an inline asm with a label, and 
adds a .long in the __bug_table section with the address of that label.


When putting it after the __builtin_trap(), I changed it to using the 
address before the one of the label which is always the twxx instruction 
as far as I can see.


#define BUG_ENTRY(insn, flags, ...) \
__asm__ __volatile__(   \
"1:" insn "\n"  \
".section __bug_table,\"aw\"\n" \
"2:\t" PPC_LONG "1b, %0\n"  \
"\t.short %1, %2\n"   \
".org 2b+%3\n"\
".previous\n" \
: : "i" (__FILE__), "i" (__LINE__), \
  "i" (flags),\
  "i" (sizeof(struct bug_entry)), \
  ##__VA_ARGS__)

Christophe


Re: [PATCH v3 3/3] powerpc/64: optimise LOAD_REG_IMMEDIATE_SYM()

2019-08-19 Thread Nicholas Piggin
Segher Boessenkool's on August 20, 2019 12:24 am:
> On Mon, Aug 19, 2019 at 01:58:12PM +, Christophe Leroy wrote:
>> -#define LOAD_REG_IMMEDIATE_SYM(reg,expr)\
>> -lis reg,(expr)@highest; \
>> -ori reg,reg,(expr)@higher;  \
>> -rldicr  reg,reg,32,31;  \
>> -orisreg,reg,(expr)@__AS_ATHIGH; \
>> -ori reg,reg,(expr)@l;
>> +#define LOAD_REG_IMMEDIATE_SYM(reg, tmp, expr)  \
>> +lis reg, (expr)@highest;\
>> +lis tmp, (expr)@__AS_ATHIGH;\
>> +ori reg, reg, (expr)@higher;\
>> +ori tmp, reg, (expr)@l; \
>> +rldimi  reg, tmp, 32, 0
> 
> That should be
> 
> #define LOAD_REG_IMMEDIATE_SYM(reg, tmp, expr)\
>   lis tmp, (expr)@highest;\
>   ori tmp, tmp, (expr)@higher;\
>   lis reg, (expr)@__AS_ATHIGH;\
>   ori reg, reg, (expr)@l; \
>   rldimi  reg, tmp, 32, 0
> 
> (tmp is the high half, reg is the low half, as inputs to that rldimi).

I guess the intention was also to try to fit the independent ops into
the earliest fetch/issue cycle possible.

#define LOAD_REG_IMMEDIATE_SYM(reg, tmp, expr)  \
lis tmp, (expr)@highest;\
lis reg, (expr)@__AS_ATHIGH;\
ori tmp, tmp, (expr)@higher;\
ori reg, reg, (expr)@l; \
rldimi  reg, tmp, 32, 0

Very cool series though.

Thanks,
Nick


Re: [PATCH 0/6] drm+dma: cache support for arm, etc

2019-08-19 Thread Rob Clark
On Sun, Aug 18, 2019 at 10:23 PM Christoph Hellwig  wrote:
>
> On Fri, Aug 16, 2019 at 02:04:35PM -0700, Rob Clark wrote:
> > I don't disagree about needing an API to get uncached memory (or
> > ideally just something outside of the linear map).  But I think this
> > is a separate problem.
> >
> > What I was hoping for, for v5.4, is a way to stop abusing dma_map/sync
> > for cache ops to get rid of the hack I had to make for v5.3.  And also
> > to fix vgem on non-x86.  (Unfortunately changing vgem to used cached
> > mappings breaks x86 CI, but fixes CI on arm/arm64..)  We can do that
> > without any changes in allocation.  There is still the possibility for
> > problems due to cached alias, but that has been a problem this whole
> > time, it isn't something new.
>
> But that just means we start exposing random low-level APIs that
> people will quickly abuse..  In fact even your simple plan to some
> extent already is an abuse of the intent of these functions, and
> it also requires a lot of knowledge in the driver that in the normal
> cases drivers can't know (e.g. is the device dma coherent or not).

I can agree that most drivers should use the higher level APIs.. but
not that we must prevent *all* drivers from using them.  Most of what
DMA API is trying to solve doesn't apply to a driver like drm/msm..
which is how we ended up with hacks to try and misuse the high level
API to accomplish what we need.

Perhaps we can protect the prototypes with #ifdef LOWLEVEL_DMA_API /
#endif type thing to make it more obvious to other drivers that it
probably isn't the API they should use?

BR,
-R


Re: [PATCH 3/3] powerpc: use __builtin_trap() in BUG/WARN macros.

2019-08-19 Thread Segher Boessenkool
On Mon, Aug 19, 2019 at 04:08:43PM +0200, Christophe Leroy wrote:
> Le 19/08/2019 à 15:23, Segher Boessenkool a écrit :
> >On Mon, Aug 19, 2019 at 01:06:31PM +, Christophe Leroy wrote:
> >>Note that we keep using an assembly text using "twi 31, 0, 0" for
> >>inconditional traps because GCC drops all code after
> >>__builtin_trap() when the condition is always true at build time.
> >
> >As I said, it can also do this for conditional traps, if it can prove
> >the condition is always true.
> 
> But we have another branch for 'always true' and 'always false' using 
> __builtin_constant_p(), which don't use __builtin_trap(). Is there 
> anything wrong with that ?:

The compiler might not realise it is constant when it evaluates the
__builtin_constant_p, but only realises it later.  As the documentation
for the builtin says:
  A return of 0 does not indicate that the
  value is _not_ a constant, but merely that GCC cannot prove it is a
  constant with the specified value of the '-O' option.

(and there should be many more and more serious warnings here).

> #define BUG_ON(x) do {\
>   if (__builtin_constant_p(x)) {  \
>   if (x)  \
>   BUG();  \
>   } else {\
>   if (x)  \
>   __builtin_trap();   \
>   BUG_ENTRY("", 0);   \
>   }   \
> } while (0)

I think it may work if you do

#define BUG_ON(x) do {  \
if (__builtin_constant_p(x)) {  \
if (x)  \
BUG();  \
} else {\
BUG_ENTRY("", 0);   \
if (x)  \
__builtin_trap();   \
}   \
} while (0)

or even just

#define BUG_ON(x) do {  \
BUG_ENTRY("", 0);   \
if (x)  \
__builtin_trap();   \
}   \
} while (0)

if BUG_ENTRY can work for the trap insn *after* it.

> >Can you put the bug table asm *before* the __builtin_trap maybe?  That
> >should make it all work fine...  If you somehow can tell what machine
> >instruction is that trap, anyway.
> 
> And how can I tell that ?

I don't know how BUG_ENTRY works exactly.


Segher


Re: [PATCH v10 6/7] powerpc/mce: Handle UE event for memcpy_mcsafe

2019-08-19 Thread Nicholas Piggin
Santosh Sivaraj's on August 15, 2019 10:39 am:
> From: Balbir Singh 
> 
> If we take a UE on one of the instructions with a fixup entry, set nip
> to continue execution at the fixup entry. Stop processing the event
> further or print it.

The previous patch added these fixup entries and now you handle them
here. Which in theory seems to break bisecting. The patches should
either be merged, or this one moved ahead in the series.

I'm still not entirely happy with the ignore_event thing, but that's
probably more a symptom of the convoluted way machine check handling
and reporting is structured. For now it's probably fine.

Reviewed-by: Nicholas Piggin 

> 
> Co-developed-by: Reza Arbab 
> Signed-off-by: Reza Arbab 
> Signed-off-by: Balbir Singh 
> Signed-off-by: Santosh Sivaraj 
> Reviewed-by: Mahesh Salgaonkar 
> ---
>  arch/powerpc/include/asm/mce.h  |  4 +++-
>  arch/powerpc/kernel/mce.c   | 16 
>  arch/powerpc/kernel/mce_power.c | 15 +--
>  3 files changed, 32 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/mce.h b/arch/powerpc/include/asm/mce.h
> index f3a6036b6bc0..e1931c8c2743 100644
> --- a/arch/powerpc/include/asm/mce.h
> +++ b/arch/powerpc/include/asm/mce.h
> @@ -122,7 +122,8 @@ struct machine_check_event {
>   enum MCE_UeErrorType ue_error_type:8;
>   u8  effective_address_provided;
>   u8  physical_address_provided;
> - u8  reserved_1[5];
> + u8  ignore_event;
> + u8  reserved_1[4];
>   u64 effective_address;
>   u64 physical_address;
>   u8  reserved_2[8];
> @@ -193,6 +194,7 @@ struct mce_error_info {
>   enum MCE_Initiator  initiator:8;
>   enum MCE_ErrorClass error_class:8;
>   boolsync_error;
> + boolignore_event;
>  };
>  
>  #define MAX_MC_EVT   100
> diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
> index a3b122a685a5..ec4b3e1087be 100644
> --- a/arch/powerpc/kernel/mce.c
> +++ b/arch/powerpc/kernel/mce.c
> @@ -149,6 +149,7 @@ void save_mce_event(struct pt_regs *regs, long handled,
>   if (phys_addr != ULONG_MAX) {
>   mce->u.ue_error.physical_address_provided = true;
>   mce->u.ue_error.physical_address = phys_addr;
> + mce->u.ue_error.ignore_event = mce_err->ignore_event;
>   machine_check_ue_event(mce);
>   }
>   }
> @@ -266,8 +267,17 @@ static void machine_process_ue_event(struct work_struct 
> *work)
>   /*
>* This should probably queued elsewhere, but
>* oh! well
> +  *
> +  * Don't report this machine check because the caller has a
> +  * asked us to ignore the event, it has a fixup handler which
> +  * will do the appropriate error handling and reporting.
>*/
>   if (evt->error_type == MCE_ERROR_TYPE_UE) {
> + if (evt->u.ue_error.ignore_event) {
> + __this_cpu_dec(mce_ue_count);
> + continue;
> + }
> +
>   if (evt->u.ue_error.physical_address_provided) {
>   unsigned long pfn;
>  
> @@ -301,6 +311,12 @@ static void machine_check_process_queued_event(struct 
> irq_work *work)
>   while (__this_cpu_read(mce_queue_count) > 0) {
>   index = __this_cpu_read(mce_queue_count) - 1;
>   evt = this_cpu_ptr(&mce_event_queue[index]);
> +
> + if (evt->error_type == MCE_ERROR_TYPE_UE &&
> + evt->u.ue_error.ignore_event) {
> + __this_cpu_dec(mce_queue_count);
> + continue;
> + }
>   machine_check_print_event_info(evt, false, false);
>   __this_cpu_dec(mce_queue_count);
>   }
> diff --git a/arch/powerpc/kernel/mce_power.c b/arch/powerpc/kernel/mce_power.c
> index e74816f045f8..1dd87f6f5186 100644
> --- a/arch/powerpc/kernel/mce_power.c
> +++ b/arch/powerpc/kernel/mce_power.c
> @@ -11,6 +11,7 @@
>  
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -18,6 +19,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  /*
>   * Convert an address related to an mm to a physical address.
> @@ -559,9 +561,18 @@ static int mce_handle_derror(struct pt_regs *regs,
>   return 0;
>  }
>  
> -static long mce_handle_ue_error(struct pt_regs *regs)
> +static long mce_handle_ue_error(struct pt_regs *regs,
> + struct mce_error_info *mce_err)
>  {
>   long handled = 0;
> + const struct exception_table_entry *entry;
> +
> + entry 

Re: [PATCH 7/8] parisc: don't set ARCH_NO_COHERENT_DMA_MMAP

2019-08-19 Thread Christoph Hellwig
Does my explanation from Thursday make sense or is it completely
off?  Does the patch description need some update to be less
confusing to those used to different terminology?

On Thu, Aug 15, 2019 at 12:50:02PM +0200, Christoph Hellwig wrote:
> Except for the different naming scheme vs the code this matches my
> assumptions.
> 
> In the code we have three cases (and a fourth EISA case mention in
> comments, but not actually implemented as far as I can tell):
> 
> arch/parisc/kernel/pci-dma.c says in the top of file comments:
> 
> ** AFAIK, all PA7100LC and PA7300LC platforms can use this code.
> 
> and the handles two different case.  for cpu_type == pcxl or pcxl2
> it maps the memory as uncached for dma_alloc_coherent, and for all
> other cpu types it fails the coherent allocations.
> 
> In addition to that there are the ccio and sba iommu drivers, of which
> according to your above comment one is always present for pa8xxx.
> 
> Which brings us back to this patch, which ensures that no cacheable
> memory is exported to userspace by removing ->mmap from ccio and sba.
> It then enabled dma_mmap_coherent for the pcxl or pcxl2 case that
> allocates uncached memory, which dma_mmap_coherent does not work
> because dma_alloc_coherent already failed for the !pcxl && !pcxl2
> and thus there is no memory to mmap.
> 
> So if the description is too confusing please suggest a better
> one, I'm a little lost between all these code names and product
> names (arch/parisc/include/asm/dma-mapping.h uses yet another set).
---end quoted text---


Re: [PATCH v3 3/3] powerpc/64: optimise LOAD_REG_IMMEDIATE_SYM()

2019-08-19 Thread Segher Boessenkool
On Mon, Aug 19, 2019 at 01:58:12PM +, Christophe Leroy wrote:
> -#define LOAD_REG_IMMEDIATE_SYM(reg,expr) \
> - lis reg,(expr)@highest; \
> - ori reg,reg,(expr)@higher;  \
> - rldicr  reg,reg,32,31;  \
> - orisreg,reg,(expr)@__AS_ATHIGH; \
> - ori reg,reg,(expr)@l;
> +#define LOAD_REG_IMMEDIATE_SYM(reg, tmp, expr)   \
> + lis reg, (expr)@highest;\
> + lis tmp, (expr)@__AS_ATHIGH;\
> + ori reg, reg, (expr)@higher;\
> + ori tmp, reg, (expr)@l; \
> + rldimi  reg, tmp, 32, 0

That should be

#define LOAD_REG_IMMEDIATE_SYM(reg, tmp, expr)  \
lis tmp, (expr)@highest;\
ori tmp, tmp, (expr)@higher;\
lis reg, (expr)@__AS_ATHIGH;\
ori reg, reg, (expr)@l; \
rldimi  reg, tmp, 32, 0

(tmp is the high half, reg is the low half, as inputs to that rldimi).


Segher


Re: [PATCH v10 2/7] powerpc/mce: Fix MCE handling for huge pages

2019-08-19 Thread Nicholas Piggin
Santosh Sivaraj's on August 15, 2019 10:39 am:
> From: Balbir Singh 
> 
> The current code would fail on huge pages addresses, since the shift would
> be incorrect. Use the correct page shift value returned by
> __find_linux_pte() to get the correct physical address. The code is more
> generic and can handle both regular and compound pages.
> 
> Fixes: ba41e1e1ccb9 ("powerpc/mce: Hookup derror (load/store) UE errors")
> Signed-off-by: Balbir Singh 
> [ar...@linux.ibm.com: Fixup pseries_do_memory_failure()]
> Signed-off-by: Reza Arbab 
> Co-developed-by: Santosh Sivaraj 
> Signed-off-by: Santosh Sivaraj 
> Tested-by: Mahesh Salgaonkar 
> Cc: sta...@vger.kernel.org # v4.15+
> ---
>  arch/powerpc/include/asm/mce.h   |  2 +-
>  arch/powerpc/kernel/mce_power.c  | 55 ++--
>  arch/powerpc/platforms/pseries/ras.c |  9 ++---
>  3 files changed, 32 insertions(+), 34 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/mce.h b/arch/powerpc/include/asm/mce.h
> index a4c6a74ad2fb..f3a6036b6bc0 100644
> --- a/arch/powerpc/include/asm/mce.h
> +++ b/arch/powerpc/include/asm/mce.h
> @@ -209,7 +209,7 @@ extern void release_mce_event(void);
>  extern void machine_check_queue_event(void);
>  extern void machine_check_print_event_info(struct machine_check_event *evt,
>  bool user_mode, bool in_guest);
> -unsigned long addr_to_pfn(struct pt_regs *regs, unsigned long addr);
> +unsigned long addr_to_phys(struct pt_regs *regs, unsigned long addr);
>  #ifdef CONFIG_PPC_BOOK3S_64
>  void flush_and_reload_slb(void);
>  #endif /* CONFIG_PPC_BOOK3S_64 */
> diff --git a/arch/powerpc/kernel/mce_power.c b/arch/powerpc/kernel/mce_power.c
> index a814d2dfb5b0..e74816f045f8 100644
> --- a/arch/powerpc/kernel/mce_power.c
> +++ b/arch/powerpc/kernel/mce_power.c
> @@ -20,13 +20,14 @@
>  #include 
>  
>  /*
> - * Convert an address related to an mm to a PFN. NOTE: we are in real
> - * mode, we could potentially race with page table updates.
> + * Convert an address related to an mm to a physical address.
> + * NOTE: we are in real mode, we could potentially race with page table 
> updates.
>   */
> -unsigned long addr_to_pfn(struct pt_regs *regs, unsigned long addr)
> +unsigned long addr_to_phys(struct pt_regs *regs, unsigned long addr)
>  {
> - pte_t *ptep;
> - unsigned long flags;
> + pte_t *ptep, pte;
> + unsigned int shift;
> + unsigned long flags, phys_addr;
>   struct mm_struct *mm;
>  
>   if (user_mode(regs))
> @@ -35,14 +36,21 @@ unsigned long addr_to_pfn(struct pt_regs *regs, unsigned 
> long addr)
>   mm = &init_mm;
>  
>   local_irq_save(flags);
> - if (mm == current->mm)
> - ptep = find_current_mm_pte(mm->pgd, addr, NULL, NULL);
> - else
> - ptep = find_init_mm_pte(addr, NULL);
> + ptep = __find_linux_pte(mm->pgd, addr, NULL, &shift);
>   local_irq_restore(flags);
> +
>   if (!ptep || pte_special(*ptep))
>   return ULONG_MAX;
> - return pte_pfn(*ptep);
> +
> + pte = *ptep;
> + if (shift > PAGE_SHIFT) {
> + unsigned long rpnmask = (1ul << shift) - PAGE_SIZE;
> +
> + pte = __pte(pte_val(pte) | (addr & rpnmask));
> + }
> + phys_addr = pte_pfn(pte) << PAGE_SHIFT;
> +
> + return phys_addr;
>  }

This should remain addr_to_pfn I think. None of the callers care what
size page the EA was mapped with. 'pfn' is referring to the Linux pfn,
which is the small page number.

  if (shift > PAGE_SHIFT)
return (pte_pfn(*ptep) | ((addr & ((1UL << shift) - 1)) >> PAGE_SHIFT);
  else
return pte_pfn(*ptep);

Something roughly like that, then you don't have to change any callers
or am I missing something?

Thanks,
Nick



Re: [PATCH v3 1/3] powerpc: rewrite LOAD_REG_IMMEDIATE() as an intelligent macro

2019-08-19 Thread Segher Boessenkool
Hi Christophe,

On Mon, Aug 19, 2019 at 01:58:10PM +, Christophe Leroy wrote:
> +.macro __LOAD_REG_IMMEDIATE r, x
> + .if (\x) >= 0x8000 || (\x) < -0x8000
> + __LOAD_REG_IMMEDIATE_32 \r, (\x) >> 32
> + sldi\r, \r, 32
> + .if (\x) & 0x != 0
> + oris \r, \r, (\x)@__AS_ATHIGH
> + .endif
> + .if (\x) & 0x != 0
> + oris \r, \r, (\x)@l
> + .endif
> + .else
> + __LOAD_REG_IMMEDIATE_32 \r, \x
> + .endif
> +.endm

How did you test this?  That last "oris" should be "ori"?

Rest looks good :-)


Segher


Re: [PATCH 3/3] powerpc: use __builtin_trap() in BUG/WARN macros.

2019-08-19 Thread Christophe Leroy




Le 19/08/2019 à 15:23, Segher Boessenkool a écrit :

On Mon, Aug 19, 2019 at 01:06:31PM +, Christophe Leroy wrote:

Note that we keep using an assembly text using "twi 31, 0, 0" for
inconditional traps because GCC drops all code after
__builtin_trap() when the condition is always true at build time.


As I said, it can also do this for conditional traps, if it can prove
the condition is always true.


But we have another branch for 'always true' and 'always false' using 
__builtin_constant_p(), which don't use __builtin_trap(). Is there 
anything wrong with that ?:


#define BUG_ON(x) do {  \
if (__builtin_constant_p(x)) {  \
if (x)  \
BUG();  \
} else {\
if (x)  \
__builtin_trap();   \
BUG_ENTRY("", 0); \
}   \
} while (0)

#define WARN_ON(x) ({   \
int __ret_warn_on = !!(x);  \
if (__builtin_constant_p(__ret_warn_on)) {  \
if (__ret_warn_on)  \
__WARN_TAINT(TAINT_WARN);   \
} else {\
if (__ret_warn_on)  \
__builtin_trap();   \
BUG_ENTRY("", BUGFLAG_WARNING | BUGFLAG_TAINT(TAINT_WARN));   \
}   \
unlikely(__ret_warn_on);\
})




Can you put the bug table asm *before* the __builtin_trap maybe?  That
should make it all work fine...  If you somehow can tell what machine
instruction is that trap, anyway.


And how can I tell that ?

When I put it *after*, it always points to the trap instruction. When I 
put it *before* it usually points on the first instruction used to 
prepare the registers for the trap condition.


Christophe


Re: [PATCH v10 1/7] powerpc/mce: Schedule work from irq_work

2019-08-19 Thread Nicholas Piggin
Santosh Sivaraj's on August 15, 2019 10:39 am:
> schedule_work() cannot be called from MCE exception context as MCE can
> interrupt even in interrupt disabled context.

The powernv code doesn't do this in general, rather defers kernel
MCEs. My patch series converts the pseries machine check exception
code over to the same.

However there remain special cases where that's not true for
powernv, e.g., the machine check stack overflow or unrecoverable
MCE paths try to force it through so something gets printed. We
probably shouldn't even try to do memory failure in these cases.

Still, shouldn't hurt to make this change and fixes the existing
"different" pseries code.

Thanks,
Nick

> fixes: 733e4a4c ("powerpc/mce: hookup memory_failure for UE errors")
> Suggested-by: Mahesh Salgaonkar 
> Signed-off-by: Santosh Sivaraj 
> Reviewed-by: Mahesh Salgaonkar 
> Acked-by: Balbir Singh 
> Cc: sta...@vger.kernel.org # v4.15+

Reviewed-by: Nicholas Piggin 



[PATCH v3 3/3] powerpc/64: optimise LOAD_REG_IMMEDIATE_SYM()

2019-08-19 Thread Christophe Leroy
Optimise LOAD_REG_IMMEDIATE_SYM() using a temporary register to
parallelise operations.

It reduces the path from 5 to 3 instructions.

Suggested-by: Segher Boessenkool 
Signed-off-by: Christophe Leroy 

---
v3: new
---
 arch/powerpc/include/asm/ppc_asm.h   | 12 ++--
 arch/powerpc/kernel/exceptions-64e.S | 22 +-
 arch/powerpc/kernel/head_64.S|  2 +-
 3 files changed, 20 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/include/asm/ppc_asm.h 
b/arch/powerpc/include/asm/ppc_asm.h
index aa8717c1571a..9d55bff9a73c 100644
--- a/arch/powerpc/include/asm/ppc_asm.h
+++ b/arch/powerpc/include/asm/ppc_asm.h
@@ -347,12 +347,12 @@ GLUE(.,name):
 
 #define LOAD_REG_IMMEDIATE(reg, expr) __LOAD_REG_IMMEDIATE reg, expr
 
-#define LOAD_REG_IMMEDIATE_SYM(reg,expr)   \
-   lis reg,(expr)@highest; \
-   ori reg,reg,(expr)@higher;  \
-   rldicr  reg,reg,32,31;  \
-   orisreg,reg,(expr)@__AS_ATHIGH; \
-   ori reg,reg,(expr)@l;
+#define LOAD_REG_IMMEDIATE_SYM(reg, tmp, expr) \
+   lis reg, (expr)@highest;\
+   lis tmp, (expr)@__AS_ATHIGH;\
+   ori reg, reg, (expr)@higher;\
+   ori tmp, reg, (expr)@l; \
+   rldimi  reg, tmp, 32, 0
 
 #define LOAD_REG_ADDR(reg,name)\
ld  reg,name@got(r2)
diff --git a/arch/powerpc/kernel/exceptions-64e.S 
b/arch/powerpc/kernel/exceptions-64e.S
index 898aae6da167..829950b96d29 100644
--- a/arch/powerpc/kernel/exceptions-64e.S
+++ b/arch/powerpc/kernel/exceptions-64e.S
@@ -750,12 +750,14 @@ END_FTR_SECTION_IFSET(CPU_FTR_ALTIVEC)
ld  r15,PACATOC(r13)
ld  r14,interrupt_base_book3e@got(r15)
ld  r15,__end_interrupts@got(r15)
-#else
-   LOAD_REG_IMMEDIATE_SYM(r14,interrupt_base_book3e)
-   LOAD_REG_IMMEDIATE_SYM(r15,__end_interrupts)
-#endif
cmpld   cr0,r10,r14
cmpld   cr1,r10,r15
+#else
+   LOAD_REG_IMMEDIATE_SYM(r14, r15, interrupt_base_book3e)
+   cmpld   cr0, r10, r14
+   LOAD_REG_IMMEDIATE_SYM(r14, r15, __end_interrupts)
+   cmpld   cr1, r10, r14
+#endif
blt+cr0,1f
bge+cr1,1f
 
@@ -820,12 +822,14 @@ kernel_dbg_exc:
ld  r15,PACATOC(r13)
ld  r14,interrupt_base_book3e@got(r15)
ld  r15,__end_interrupts@got(r15)
-#else
-   LOAD_REG_IMMEDIATE_SYM(r14,interrupt_base_book3e)
-   LOAD_REG_IMMEDIATE_SYM(r15,__end_interrupts)
-#endif
cmpld   cr0,r10,r14
cmpld   cr1,r10,r15
+#else
+   LOAD_REG_IMMEDIATE_SYM(r14, r15, interrupt_base_book3e)
+   cmpld   cr0, r10, r14
+   LOAD_REG_IMMEDIATE_SYM(r14, r15,__end_interrupts)
+   cmpld   cr1, r10, r14
+#endif
blt+cr0,1f
bge+cr1,1f
 
@@ -1449,7 +1453,7 @@ a2_tlbinit_code_start:
 a2_tlbinit_after_linear_map:
 
/* Now we branch the new virtual address mapped by this entry */
-   LOAD_REG_IMMEDIATE_SYM(r3,1f)
+   LOAD_REG_IMMEDIATE_SYM(r3, r5, 1f)
mtctr   r3
bctr
 
diff --git a/arch/powerpc/kernel/head_64.S b/arch/powerpc/kernel/head_64.S
index 1fd44761e997..0f2d61af47cc 100644
--- a/arch/powerpc/kernel/head_64.S
+++ b/arch/powerpc/kernel/head_64.S
@@ -635,7 +635,7 @@ __after_prom_start:
sub r5,r5,r11
 #else
/* just copy interrupts */
-   LOAD_REG_IMMEDIATE_SYM(r5, FIXED_SYMBOL_ABS_ADDR(__end_interrupts))
+   LOAD_REG_IMMEDIATE_SYM(r5, r11, FIXED_SYMBOL_ABS_ADDR(__end_interrupts))
 #endif
b   5f
 3:
-- 
2.13.3



[PATCH v3 2/3] powerpc/32: replace LOAD_MSR_KERNEL() by LOAD_REG_IMMEDIATE()

2019-08-19 Thread Christophe Leroy
LOAD_MSR_KERNEL() and LOAD_REG_IMMEDIATE() are doing the same thing
in the same way. Drop LOAD_MSR_KERNEL()

Signed-off-by: Christophe Leroy 

---
v2: no change
v3: no change
---
 arch/powerpc/kernel/entry_32.S | 18 +-
 arch/powerpc/kernel/head_32.h  | 21 -
 2 files changed, 13 insertions(+), 26 deletions(-)

diff --git a/arch/powerpc/kernel/entry_32.S b/arch/powerpc/kernel/entry_32.S
index 54fab22c9a43..972b05504a0a 100644
--- a/arch/powerpc/kernel/entry_32.S
+++ b/arch/powerpc/kernel/entry_32.S
@@ -230,7 +230,7 @@ transfer_to_handler_cont:
 */
lis r12,reenable_mmu@h
ori r12,r12,reenable_mmu@l
-   LOAD_MSR_KERNEL(r0, MSR_KERNEL)
+   LOAD_REG_IMMEDIATE(r0, MSR_KERNEL)
mtspr   SPRN_SRR0,r12
mtspr   SPRN_SRR1,r0
SYNC
@@ -304,7 +304,7 @@ stack_ovf:
addir1,r1,THREAD_SIZE-STACK_FRAME_OVERHEAD
lis r9,StackOverflow@ha
addir9,r9,StackOverflow@l
-   LOAD_MSR_KERNEL(r10,MSR_KERNEL)
+   LOAD_REG_IMMEDIATE(r10,MSR_KERNEL)
 #if defined(CONFIG_PPC_8xx) && defined(CONFIG_PERF_EVENTS)
mtspr   SPRN_NRI, r0
 #endif
@@ -324,7 +324,7 @@ trace_syscall_entry_irq_off:
bl  trace_hardirqs_on
 
/* Now enable for real */
-   LOAD_MSR_KERNEL(r10, MSR_KERNEL | MSR_EE)
+   LOAD_REG_IMMEDIATE(r10, MSR_KERNEL | MSR_EE)
mtmsr   r10
 
REST_GPR(0, r1)
@@ -394,7 +394,7 @@ ret_from_syscall:
 #endif
mr  r6,r3
/* disable interrupts so current_thread_info()->flags can't change */
-   LOAD_MSR_KERNEL(r10,MSR_KERNEL) /* doesn't include MSR_EE */
+   LOAD_REG_IMMEDIATE(r10,MSR_KERNEL)  /* doesn't include MSR_EE */
/* Note: We don't bother telling lockdep about it */
SYNC
MTMSRD(r10)
@@ -824,7 +824,7 @@ ret_from_except:
 * can't change between when we test it and when we return
 * from the interrupt. */
/* Note: We don't bother telling lockdep about it */
-   LOAD_MSR_KERNEL(r10,MSR_KERNEL)
+   LOAD_REG_IMMEDIATE(r10,MSR_KERNEL)
SYNC/* Some chip revs have problems here... */
MTMSRD(r10) /* disable interrupts */
 
@@ -991,7 +991,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_NEED_PAIRED_STWCX)
 * can restart the exception exit path at the label
 * exc_exit_restart below.  -- paulus
 */
-   LOAD_MSR_KERNEL(r10,MSR_KERNEL & ~MSR_RI)
+   LOAD_REG_IMMEDIATE(r10,MSR_KERNEL & ~MSR_RI)
SYNC
MTMSRD(r10) /* clear the RI bit */
.globl exc_exit_restart
@@ -1066,7 +1066,7 @@ exc_exit_restart_end:
REST_NVGPRS(r1);\
lwz r3,_MSR(r1);\
andi.   r3,r3,MSR_PR;   \
-   LOAD_MSR_KERNEL(r10,MSR_KERNEL);\
+   LOAD_REG_IMMEDIATE(r10,MSR_KERNEL); \
bne user_exc_return;\
lwz r0,GPR0(r1);\
lwz r2,GPR2(r1);\
@@ -1236,7 +1236,7 @@ recheck:
 * neither. Those disable/enable cycles used to peek at
 * TI_FLAGS aren't advertised.
 */
-   LOAD_MSR_KERNEL(r10,MSR_KERNEL)
+   LOAD_REG_IMMEDIATE(r10,MSR_KERNEL)
SYNC
MTMSRD(r10) /* disable interrupts */
lwz r9,TI_FLAGS(r2)
@@ -1329,7 +1329,7 @@ _GLOBAL(enter_rtas)
lwz r4,RTASBASE(r4)
mfmsr   r9
stw r9,8(r1)
-   LOAD_MSR_KERNEL(r0,MSR_KERNEL)
+   LOAD_REG_IMMEDIATE(r0,MSR_KERNEL)
SYNC/* disable interrupts so SRR0/1 */
MTMSRD(r0)  /* don't get trashed */
li  r9,MSR_KERNEL & ~(MSR_IR|MSR_DR)
diff --git a/arch/powerpc/kernel/head_32.h b/arch/powerpc/kernel/head_32.h
index 4a692553651f..8abc7783dbe5 100644
--- a/arch/powerpc/kernel/head_32.h
+++ b/arch/powerpc/kernel/head_32.h
@@ -5,19 +5,6 @@
 #include /* for STACK_FRAME_REGS_MARKER */
 
 /*
- * MSR_KERNEL is > 0x8000 on 4xx/Book-E since it include MSR_CE.
- */
-.macro __LOAD_MSR_KERNEL r, x
-.if \x >= 0x8000
-   lis \r, (\x)@h
-   ori \r, \r, (\x)@l
-.else
-   li \r, (\x)
-.endif
-.endm
-#define LOAD_MSR_KERNEL(r, x) __LOAD_MSR_KERNEL r, x
-
-/*
  * Exception entry code.  This code runs with address translation
  * turned off, i.e. using physical addresses.
  * We assume sprg3 has the physical address of the current
@@ -92,7 +79,7 @@
 #ifdef CONFIG_40x
rlwinm  r9,r9,0,14,12   /* clear MSR_WE (necessary?) */
 #else
-   LOAD_MSR_KERNEL(r10, MSR_KERNEL & ~(MSR_IR|MSR_DR)) /* can take 
exceptions */
+   LOAD_REG_IMMEDIATE(r10, MSR_KERNEL & ~(MSR_IR|MSR_DR)) /* can take 
exceptions */
MTMSRD(r10)

[PATCH v3 1/3] powerpc: rewrite LOAD_REG_IMMEDIATE() as an intelligent macro

2019-08-19 Thread Christophe Leroy
Today LOAD_REG_IMMEDIATE() is a basic #define which loads all
parts on a value into a register, including the parts that are NUL.

This means always 2 instructions on PPC32 and always 5 instructions
on PPC64. And those instructions cannot run in parallele as they are
updating the same register.

Ex: LOAD_REG_IMMEDIATE(r1,THREAD_SIZE) in head_64.S results in:

3c 20 00 00 lis r1,0
60 21 00 00 ori r1,r1,0
78 21 07 c6 rldicr  r1,r1,32,31
64 21 00 00 orisr1,r1,0
60 21 40 00 ori r1,r1,16384

Rewrite LOAD_REG_IMMEDIATE() with GAS macro in order to skip
the parts that are NUL.

Rename existing LOAD_REG_IMMEDIATE() as LOAD_REG_IMMEDIATE_SYM()
and use that one for loading value of symbols which are not known
at compile time.

Now LOAD_REG_IMMEDIATE(r1,THREAD_SIZE) in head_64.S results in:

38 20 40 00 li  r1,16384

Signed-off-by: Christophe Leroy 

---
v2: Fixed the test from (\x) & 0x to (\x) >= 0x8000 || (\x) < 
-0x8000 in __LOAD_REG_IMMEDIATE()
v3: Replaced rldicr by sldi as suggested by Segher for readability
---
 arch/powerpc/include/asm/ppc_asm.h   | 42 +++-
 arch/powerpc/kernel/exceptions-64e.S | 10 -
 arch/powerpc/kernel/head_64.S|  2 +-
 3 files changed, 43 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/include/asm/ppc_asm.h 
b/arch/powerpc/include/asm/ppc_asm.h
index e0637730a8e7..aa8717c1571a 100644
--- a/arch/powerpc/include/asm/ppc_asm.h
+++ b/arch/powerpc/include/asm/ppc_asm.h
@@ -311,13 +311,43 @@ GLUE(.,name):
addis   reg,reg,(name - 0b)@ha; \
addireg,reg,(name - 0b)@l;
 
-#ifdef __powerpc64__
-#ifdef HAVE_AS_ATHIGH
+#if defined(__powerpc64__) && defined(HAVE_AS_ATHIGH)
 #define __AS_ATHIGH high
 #else
 #define __AS_ATHIGH h
 #endif
-#define LOAD_REG_IMMEDIATE(reg,expr)   \
+
+.macro __LOAD_REG_IMMEDIATE_32 r, x
+   .if (\x) >= 0x8000 || (\x) < -0x8000
+   lis \r, (\x)@__AS_ATHIGH
+   .if (\x) & 0x != 0
+   ori \r, \r, (\x)@l
+   .endif
+   .else
+   li \r, (\x)@l
+   .endif
+.endm
+
+.macro __LOAD_REG_IMMEDIATE r, x
+   .if (\x) >= 0x8000 || (\x) < -0x8000
+   __LOAD_REG_IMMEDIATE_32 \r, (\x) >> 32
+   sldi\r, \r, 32
+   .if (\x) & 0x != 0
+   oris \r, \r, (\x)@__AS_ATHIGH
+   .endif
+   .if (\x) & 0x != 0
+   oris \r, \r, (\x)@l
+   .endif
+   .else
+   __LOAD_REG_IMMEDIATE_32 \r, \x
+   .endif
+.endm
+
+#ifdef __powerpc64__
+
+#define LOAD_REG_IMMEDIATE(reg, expr) __LOAD_REG_IMMEDIATE reg, expr
+
+#define LOAD_REG_IMMEDIATE_SYM(reg,expr)   \
lis reg,(expr)@highest; \
ori reg,reg,(expr)@higher;  \
rldicr  reg,reg,32,31;  \
@@ -335,11 +365,13 @@ GLUE(.,name):
 
 #else /* 32-bit */
 
-#define LOAD_REG_IMMEDIATE(reg,expr)   \
+#define LOAD_REG_IMMEDIATE(reg, expr) __LOAD_REG_IMMEDIATE_32 reg, expr
+
+#define LOAD_REG_IMMEDIATE_SYM(reg,expr)   \
lis reg,(expr)@ha;  \
addireg,reg,(expr)@l;
 
-#define LOAD_REG_ADDR(reg,name)LOAD_REG_IMMEDIATE(reg, name)
+#define LOAD_REG_ADDR(reg,name)LOAD_REG_IMMEDIATE_SYM(reg, 
name)
 
 #define LOAD_REG_ADDRBASE(reg, name)   lis reg,name@ha
 #define ADDROFF(name)  name@l
diff --git a/arch/powerpc/kernel/exceptions-64e.S 
b/arch/powerpc/kernel/exceptions-64e.S
index 1cfb3da4a84a..898aae6da167 100644
--- a/arch/powerpc/kernel/exceptions-64e.S
+++ b/arch/powerpc/kernel/exceptions-64e.S
@@ -751,8 +751,8 @@ END_FTR_SECTION_IFSET(CPU_FTR_ALTIVEC)
ld  r14,interrupt_base_book3e@got(r15)
ld  r15,__end_interrupts@got(r15)
 #else
-   LOAD_REG_IMMEDIATE(r14,interrupt_base_book3e)
-   LOAD_REG_IMMEDIATE(r15,__end_interrupts)
+   LOAD_REG_IMMEDIATE_SYM(r14,interrupt_base_book3e)
+   LOAD_REG_IMMEDIATE_SYM(r15,__end_interrupts)
 #endif
cmpld   cr0,r10,r14
cmpld   cr1,r10,r15
@@ -821,8 +821,8 @@ kernel_dbg_exc:
ld  r14,interrupt_base_book3e@got(r15)
ld  r15,__end_interrupts@got(r15)
 #else
-   LOAD_REG_IMMEDIATE(r14,interrupt_base_book3e)
-   LOAD_REG_IMMEDIATE(r15,__end_interrupts)
+   LOAD_REG_IMMEDIATE_SYM(r14,interrupt_base_book3e)
+   LOAD_REG_IMMEDIATE_SYM(r15,__end_interrupts)
 #endif
cmpld   cr0,r10,r14
cmpld   cr1,r10,r15
@@ -1449,7 +1449,7 @@ a2_tlbinit_code_start:
 a2_tlbinit_after_linear_map:
 
/* Now we branch the new virtual address mapped by this entry */
-   LOAD_REG_IMMEDIATE(r3,1f)
+   LOAD_REG_IMMEDIATE_SYM(r3,1f)
mtctr   r3
bctr
 
diff --git a/arch/powerpc/kernel/head_64.S b/arch/powerpc/kernel/head_64.S
index 91d297e696dd..1fd44761e997 100644
--- a/arch/powerpc/kernel/head_

Re: [PATCH v4 1/2] powerpc/time: Only set CONFIG_ARCH_HAS_SCALED_CPUTIME on PPC64

2019-08-19 Thread Nicholas Piggin
Christophe Leroy's on August 14, 2019 4:31 pm:
> Hi Nick,
> 
> 
> Le 07/06/2018 à 03:43, Nicholas Piggin a écrit :
>> On Wed,  6 Jun 2018 14:21:08 + (UTC)
>> Christophe Leroy  wrote:
>> 
>>> scaled cputime is only meaningfull when the processor has
>>> SPURR and/or PURR, which means only on PPC64.
>>>
> 
> [...]
> 
>> 
>> I wonder if we could make this depend on PPC_PSERIES or even
>> PPC_SPLPAR as well? (That would be for a later patch)
> 
> Can we go further on this ?
> 
> Do we know exactly which configuration support scaled cputime, in 
> extenso have SPRN_SPURR and/or SPRN_PURR ?
> 
> Ref https://github.com/linuxppc/issues/issues/171

Unfortunately I don't know enough about the timing stuff and who
uses it. SPURR is available on all configurations (guest, bare metal),
so it could account scaled time there too. I guess better just leave
it for now.

Thanks,
Nick


Re: [PATCH v1 05/10] powerpc/mm: Do early ioremaps from top to bottom on PPC64 too.

2019-08-19 Thread Nicholas Piggin
Christophe Leroy's on August 14, 2019 6:11 am:
> Until vmalloc system is up and running, ioremap basically
> allocates addresses at the border of the IOREMAP area.
> 
> On PPC32, addresses are allocated down from the top of the area
> while on PPC64, addresses are allocated up from the base of the
> area.
 
This series looks pretty good to me, but I'm not sure about this patch.

It seems like quite a small divergence in terms of code, and it looks
like the final result still has some ifdefs in these functions. Maybe
you could just keep existing behaviour for this cleanup series so it
does not risk triggering some obscure regression? Merging behaviour
could be proposed at the end.

Thanks,
Nick



Re: [PATCH 3/3] powerpc: use __builtin_trap() in BUG/WARN macros.

2019-08-19 Thread Segher Boessenkool
On Mon, Aug 19, 2019 at 01:06:31PM +, Christophe Leroy wrote:
> Note that we keep using an assembly text using "twi 31, 0, 0" for
> inconditional traps because GCC drops all code after
> __builtin_trap() when the condition is always true at build time.

As I said, it can also do this for conditional traps, if it can prove
the condition is always true.

Can you put the bug table asm *before* the __builtin_trap maybe?  That
should make it all work fine...  If you somehow can tell what machine
instruction is that trap, anyway.


Segher


Re: [PATCH v2 09/44] powerpc/64s/pseries: machine check convert to use common event code

2019-08-19 Thread Nicholas Piggin
Michael Ellerman's on August 17, 2019 8:25 am:
> kbuild test robot  writes:
>> Hi Nicholas,
>>
>> I love your patch! Yet something to improve:
>>
>> [auto build test ERROR on linus/master]
>> [cannot apply to v5.3-rc3 next-20190807]
>> [if your patch is applied to the wrong git tree, please drop us a note to 
>> help improve the system]
>>
>> url:
>> https://github.com/0day-ci/linux/commits/Nicholas-Piggin/powerpc-64s-exception-cleanup-and-macrofiy/20190802-11
>> config: powerpc-defconfig (attached as .config)
>> compiler: powerpc64-linux-gcc (GCC) 7.4.0
>> reproduce:
>> wget 
>> https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
>> ~/bin/make.cross
>> chmod +x ~/bin/make.cross
>> # save the attached .config to linux build tree
>> GCC_VERSION=7.4.0 make.cross ARCH=powerpc 
>>
>> If you fix the issue, kindly add following tag
>> Reported-by: kbuild test robot 
>>
>> All errors (new ones prefixed by >>):
>>
>>arch/powerpc/platforms/pseries/ras.c: In function 'mce_handle_error':
 arch/powerpc/platforms/pseries/ras.c:563:28: error: this statement may 
 fall through [-Werror=implicit-fallthrough=]
>>mce_err.u.ue_error_type = MCE_UE_ERROR_IFETCH;
>>^
>>arch/powerpc/platforms/pseries/ras.c:564:3: note: here
>>   case MC_ERROR_UE_PAGE_TABLE_WALK_IFETCH:
>>   ^~~~
>>arch/powerpc/platforms/pseries/ras.c:565:28: error: this statement may 
>> fall through [-Werror=implicit-fallthrough=]
>>mce_err.u.ue_error_type = MCE_UE_ERROR_PAGE_TABLE_WALK_IFETCH;
>>^
>>arch/powerpc/platforms/pseries/ras.c:566:3: note: here
>>   case MC_ERROR_UE_LOAD_STORE:
>>   ^~~~
>>arch/powerpc/platforms/pseries/ras.c:567:28: error: this statement may 
>> fall through [-Werror=implicit-fallthrough=]
>>mce_err.u.ue_error_type = MCE_UE_ERROR_LOAD_STORE;
>>^
>>arch/powerpc/platforms/pseries/ras.c:568:3: note: here
>>   case MC_ERROR_UE_PAGE_TABLE_WALK_LOAD_STORE:
>>   ^~~~
>>arch/powerpc/platforms/pseries/ras.c:569:28: error: this statement may 
>> fall through [-Werror=implicit-fallthrough=]
>>mce_err.u.ue_error_type = MCE_UE_ERROR_PAGE_TABLE_WALK_LOAD_STORE;
>>^
>>arch/powerpc/platforms/pseries/ras.c:570:3: note: here
>>   case MC_ERROR_UE_INDETERMINATE:
>>   ^~~~
>>cc1: all warnings being treated as errors
> 
> I think you meant to break in all these cases?

Yes I did. I might have had a couple of other minor fixes in the
series and have since retested guest mce injection so I'd perhaps
better resend the series.

Thanks,
Nick


Re: [PATCH 1/2] powerpc/64s: remplement power4_idle code in C

2019-08-19 Thread Nicholas Piggin
Michael Ellerman's on August 18, 2019 1:49 pm:
> Nicholas Piggin  writes:
>> diff --git a/arch/powerpc/kernel/exceptions-64s.S 
>> b/arch/powerpc/kernel/exceptions-64s.S
>> index eee5bef736c8..64d5ffbb07d1 100644
>> --- a/arch/powerpc/kernel/exceptions-64s.S
>> +++ b/arch/powerpc/kernel/exceptions-64s.S
>> @@ -2286,15 +2286,6 @@ USE_FIXED_SECTION(virt_trampolines)
>>  __end_interrupts:
>>  DEFINE_FIXED_SYMBOL(__end_interrupts)
>>  
>> -#ifdef CONFIG_PPC_970_NAP
>> -EXC_COMMON_BEGIN(power4_fixup_nap)
>> -andcr9,r9,r10
>> -std r9,TI_LOCAL_FLAGS(r11)
>> -ld  r10,_LINK(r1)   /* make idle task do the */
>> -std r10,_NIP(r1)/* equivalent of a blr */
>> -blr
>> -#endif
> 
> This breaks ppc64_defconfig build with:
> 
> ERROR: start_text address is c0008100, should be c0008000
> 
> Due to:
> 
> c0008000 <001a.long_branch.power4_fixup_nap>:
> c0008000:   48 03 5a b4 b   c003dab4 
> 
> 
> 
> Moving power4_fixup_nap back into exceptions-64s.S seems to fix it.

Okay that should be fine if you can update it.

Thanks,
Nick


Re: [PATCH 2/3] powerpc/64s/radix: all CPUs should flush local translation structure before turning MMU on

2019-08-19 Thread Nicholas Piggin
Michael Ellerman's on August 19, 2019 12:00 pm:
> Nicholas Piggin  writes:
>> Rather than sprinkle various translation structure invalidations
>> around different places in early boot, have each CPU flush everything
>> from its local translation structures before enabling its MMU.
>>
>> Radix guests can execute tlbie(l), so have them tlbiel_all in the same
>> place as radix host does.
>>
>> Signed-off-by: Nicholas Piggin 
>> ---
>>  arch/powerpc/mm/book3s64/radix_pgtable.c | 11 ++-
>>  1 file changed, 2 insertions(+), 9 deletions(-)
>>
>> diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c 
>> b/arch/powerpc/mm/book3s64/radix_pgtable.c
>> index d60cfa05447a..839e01795211 100644
>> --- a/arch/powerpc/mm/book3s64/radix_pgtable.c
>> +++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
>> @@ -382,11 +382,6 @@ static void __init radix_init_pgtable(void)
>>   */
>>  register_process_table(__pa(process_tb), 0, PRTB_SIZE_SHIFT - 12);
>>  pr_info("Process table %p and radix root for kernel: %p\n", process_tb, 
>> init_mm.pgd);
>> -asm volatile("ptesync" : : : "memory");
>> -asm volatile(PPC_TLBIE_5(%0,%1,2,1,1) : :
>> - "r" (TLBIEL_INVAL_SET_LPID), "r" (0));
>> -asm volatile("eieio; tlbsync; ptesync" : : : "memory");
>> -trace_tlbie(0, 0, TLBIEL_INVAL_SET_LPID, 0, 2, 1, 1);
>>  
>>  /*
>>   * The init_mm context is given the first available (non-zero) PID,
>> @@ -633,8 +628,7 @@ void __init radix__early_init_mmu(void)
>>  radix_init_pgtable();
>>  /* Switch to the guard PID before turning on MMU */
>>  radix__switch_mmu_context(NULL, &init_mm);
>> -if (cpu_has_feature(CPU_FTR_HVMODE))
>> -tlbiel_all();
>> +tlbiel_all();
>>  }
> 
> This is oopsing for me in a guest on Power9:
> 
>   [0.00] radix-mmu: Page sizes from device-tree:
>   [0.00] radix-mmu: Page size shift = 12 AP=0x0
>   [0.00] radix-mmu: Page size shift = 16 AP=0x5
>   [0.00] radix-mmu: Page size shift = 21 AP=0x1
>   [0.00] radix-mmu: Page size shift = 30 AP=0x2
>   [0.00]  -> fw_vec5_feature_init()
>   [0.00]  <- fw_vec5_feature_init()
>   [0.00]  -> fw_hypertas_feature_init()
>   [0.00]  <- fw_hypertas_feature_init()
>   [0.00] radix-mmu: Activating Kernel Userspace Execution Prevention
>   [0.00] radix-mmu: Activating Kernel Userspace Access Prevention
>   [0.00] lpar: Using radix MMU under hypervisor
>   [0.00] radix-mmu: Mapped 0x-0x4000 with 
> 1.00 GiB pages (exec)
>   [0.00] radix-mmu: Mapped 0x4000-0x0001 with 
> 1.00 GiB pages
>   [0.00] radix-mmu: Process table (ptrval) and radix root for 
> kernel: (ptrval)
>   [0.00] Oops: Exception in kernel mode, sig: 4 [#1]
>   [0.00] LE PAGE_SIZE=64K MMU=Radix MMU=Hash SMP NR_CPUS=2048 NUMA 
>   [0.00] Modules linked in:
>   [0.00] CPU: 0 PID: 0 Comm: swapper Not tainted 
> 5.3.0-rc2-gcc-8.2.0-00063-gef906dcf7b75 #633
>   [0.00] NIP:  c00838f8 LR: c1066864 CTR: 
> c00838c0
>   [0.00] REGS: c1647c40 TRAP: 0700   Not tainted  
> (5.3.0-rc2-gcc-8.2.0-00063-gef906dcf7b75)
>   [0.00] MSR:  80043003   CR: 48000222  XER: 
> 2004
>   [0.00] CFAR: c00839b4 IRQMASK: 1 
>   [0.00] GPR00: c1066864 c1647ed0 c1649700 
>  
>   [0.00] GPR04: c1608830  0010 
> 2000 
>   [0.00] GPR08: 0c00  0002 
> 726f6620746f6f72 
>   [0.00] GPR12: c00838c0 c193 0dc5bef0 
> 01309e10 
>   [0.00] GPR16: 01309c90 fffd 0dc5bef0 
> 01339800 
>   [0.00] GPR20: 0014 01ac 0dc5bf38 
> 0daf 
>   [0.00] GPR24: 01f4000c c000 0040 
> c1802858 
>   [0.00] GPR28: c007 c1803954 c1681cb0 
> c1608830 
>   [0.00] NIP [c00838f8] radix__tlbiel_all+0x48/0x110
>   [0.00] LR [c1066864] radix__early_init_mmu+0x494/0x4c8
>   [0.00] Call Trace:
>   [0.00] [c1647ed0] [c1066820] 
> radix__early_init_mmu+0x450/0x4c8 (unreliable)
>   [0.00] [c1647f60] [c105c628] early_setup+0x160/0x198
>   [0.00] [c1647f90] [b460] 0xb460
>   [0.00] Instruction dump:
>   [0.00] 2b830001 3902 409e00e8 3d220003 3929c318 e929 
> e9290010 75290002 
>   [0.00] 41820088 7c4004ac 3920 79085564 <7d294224> 3940007f 
> 39201000 38e0 
>   [0.00] random: get_random_bytes called from 
> print_oops_end_marker+0x40/0x80 with crng_init=0
>   [0.00] ---[ end trace 00

[PATCH 3/3] powerpc: use __builtin_trap() in BUG/WARN macros.

2019-08-19 Thread Christophe Leroy
The below exemples of use of WARN_ON() show that the result
is sub-optimal in regard of the capabilities of powerpc.

void test_warn1(unsigned long long a)
{
WARN_ON(a);
}

void test_warn2(unsigned long a)
{
WARN_ON(a);
}

void test_warn3(unsigned long a, unsigned long b)
{
WARN_ON(a < b);
}

void test_warn4(unsigned long a, unsigned long b)
{
WARN_ON(!a);
}

void test_warn5(unsigned long a, unsigned long b)
{
WARN_ON(!a && b);
}

 :
   0:   7c 64 23 78 or  r4,r3,r4
   4:   31 24 ff ff addic   r9,r4,-1
   8:   7c 89 21 10 subfe   r4,r9,r4
   c:   0f 04 00 00 twnei   r4,0
  10:   4e 80 00 20 blr

0014 :
  14:   31 23 ff ff addic   r9,r3,-1
  18:   7c 69 19 10 subfe   r3,r9,r3
  1c:   0f 03 00 00 twnei   r3,0
  20:   4e 80 00 20 blr

0024 :
  24:   7c 84 18 10 subfc   r4,r4,r3
  28:   7d 29 49 10 subfe   r9,r9,r9
  2c:   7d 29 00 d0 neg r9,r9
  30:   0f 09 00 00 twnei   r9,0
  34:   4e 80 00 20 blr

0038 :
  38:   7c 63 00 34 cntlzw  r3,r3
  3c:   54 63 d9 7e rlwinm  r3,r3,27,5,31
  40:   0f 03 00 00 twnei   r3,0
  44:   4e 80 00 20 blr

0048 :
  48:   2f 83 00 00 cmpwi   cr7,r3,0
  4c:   39 20 00 00 li  r9,0
  50:   41 9e 00 0c beq cr7,5c 
  54:   7c 84 00 34 cntlzw  r4,r4
  58:   54 89 d9 7e rlwinm  r9,r4,27,5,31
  5c:   0f 09 00 00 twnei   r9,0
  60:   4e 80 00 20 blr

RELOCATION RECORDS FOR [__bug_table]:
OFFSET   TYPE  VALUE
 R_PPC_ADDR32  .text+0x000c
000c R_PPC_ADDR32  .text+0x001c
0018 R_PPC_ADDR32  .text+0x0030
0018 R_PPC_ADDR32  .text+0x0030
0024 R_PPC_ADDR32  .text+0x0040
0030 R_PPC_ADDR32  .text+0x005c

Using __builtin_trap() instead of inline assembly of twnei/tdnei
provides a far better result:

 :
   0:   7c 64 23 78 or  r4,r3,r4
   4:   0f 04 00 00 twnei   r4,0
   8:   4e 80 00 20 blr

000c :
   c:   0f 03 00 00 twnei   r3,0
  10:   4e 80 00 20 blr

0014 :
  14:   7c 43 20 08 twllt   r3,r4
  18:   4e 80 00 20 blr

001c :
  1c:   0c 83 00 00 tweqi   r3,0
  20:   4e 80 00 20 blr

0024 :
  24:   2f 83 00 00 cmpwi   cr7,r3,0
  28:   41 9e 00 08 beq cr7,30 
  2c:   0c 84 00 00 tweqi   r4,0
  30:   4e 80 00 20 blr

RELOCATION RECORDS FOR [__bug_table]:
OFFSET   TYPE  VALUE
 R_PPC_ADDR32  .text+0x0004
000c R_PPC_ADDR32  .text+0x000c
0018 R_PPC_ADDR32  .text+0x0014
0024 R_PPC_ADDR32  .text+0x001c
0030 R_PPC_ADDR32  .text+0x002c

Note that we keep using an assembly text using "twi 31, 0, 0" for
inconditional traps because GCC drops all code after
__builtin_trap() when the condition is always true at build time.

In addition, this patch also fixes bugs in the BUG_ON(x) macro
which unlike WARN_ON(x) uses (x) directly as the condition by
forcing it to long instead of using !!(x). This leads to
upper part of an unsigned long long being ignored on PPC32 and
may produce bugs on PPC64 if (x) is a smaller type like an int.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/bug.h | 17 ++---
 1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/include/asm/bug.h b/arch/powerpc/include/asm/bug.h
index dbf7da90f507..a229130ffcf9 100644
--- a/arch/powerpc/include/asm/bug.h
+++ b/arch/powerpc/include/asm/bug.h
@@ -44,14 +44,14 @@
 #ifdef CONFIG_DEBUG_BUGVERBOSE
 #define _EMIT_BUG_ENTRY\
".section __bug_table,\"aw\"\n" \
-   "2:\t" PPC_LONG "1b, %0\n"  \
+   "2:\t" PPC_LONG "1b - 4, %0\n"  \
"\t.short %1, %2\n" \
".org 2b+%3\n"  \
".previous\n"
 #else
 #define _EMIT_BUG_ENTRY\
".section __bug_table,\"aw\"\n" \
-   "2:\t" PPC_LONG "1b\n"  \
+   "2:\t" PPC_LONG "1b - 4\n"  \
"\t.short %2\n" \
".org 2b+%3\n"  \
".previous\n"
@@ -59,7 +59,8 @@
 
 #define BUG_ENTRY(insn, flags, ...)\
__asm__ __volatile__(   \
-   "1: " insn "\n" \
+   insn "\n"   \
+   "1:\n"  \
_EMIT_BUG_ENTRY \
: : "i" (__FILE__), "i" (__LINE__), \
  "i" (flags),  \
@@ -82,7 +83,9 @@
if (x)  \
BUG();  \
} else {\
-   BUG_ENTRY(PPC_TLNEI " %4, 0", 0

[PATCH 1/3] powerpc: don't use __WARN() for WARN_ON()

2019-08-19 Thread Christophe Leroy
__WARN() used to just call __WARN_TAINT(TAINT_WARN)

But a call to printk() has been added in the commit identified below
to print a " cut here " line.

This change only applies to warnings using __WARN(), which means
WARN_ON() where the condition is constant at compile time.
For WARN_ON() with a non constant condition, the additional line is
not printed.

In addition, adding a call to printk() forces GCC to add a stack frame
and save volatile registers. Powerpc has been using traps to implement
warnings in order to avoid that.

So, call __WARN_TAINT(TAINT_WARN) directly instead of using __WARN()
in order to restore the previous behaviour.

If one day powerpc wants the decorative " cut here " line, it
has to be done in the trap handler, not in the WARN_ON() macro.

Fixes: 6b15f678fb7d ("include/asm-generic/bug.h: fix "cut here" for WARN_ON for 
__WARN_TAINT architectures")
Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/bug.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/bug.h b/arch/powerpc/include/asm/bug.h
index fed7e6241349..3928fdaebb71 100644
--- a/arch/powerpc/include/asm/bug.h
+++ b/arch/powerpc/include/asm/bug.h
@@ -99,7 +99,7 @@
int __ret_warn_on = !!(x);  \
if (__builtin_constant_p(__ret_warn_on)) {  \
if (__ret_warn_on)  \
-   __WARN();   \
+   __WARN_TAINT(TAINT_WARN);   \
} else {\
__asm__ __volatile__(   \
"1: "PPC_TLNEI" %4,0\n" \
-- 
2.13.3



[PATCH 2/3] powerpc: refactoring BUG/WARN macros

2019-08-19 Thread Christophe Leroy
BUG(), WARN() and friends are using a similar inline
assembly to implement various traps with various flags.

Lets refactor via a new BUG_ENTRY() macro.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/bug.h | 41 +++--
 1 file changed, 15 insertions(+), 26 deletions(-)

diff --git a/arch/powerpc/include/asm/bug.h b/arch/powerpc/include/asm/bug.h
index 3928fdaebb71..dbf7da90f507 100644
--- a/arch/powerpc/include/asm/bug.h
+++ b/arch/powerpc/include/asm/bug.h
@@ -57,6 +57,15 @@
".previous\n"
 #endif
 
+#define BUG_ENTRY(insn, flags, ...)\
+   __asm__ __volatile__(   \
+   "1: " insn "\n" \
+   _EMIT_BUG_ENTRY \
+   : : "i" (__FILE__), "i" (__LINE__), \
+ "i" (flags),  \
+ "i" (sizeof(struct bug_entry)),   \
+ ##__VA_ARGS__)
+
 /*
  * BUG_ON() and WARN_ON() do their best to cooperate with compile-time
  * optimisations. However depending on the complexity of the condition
@@ -64,11 +73,7 @@
  */
 
 #define BUG() do { \
-   __asm__ __volatile__(   \
-   "1: twi 31,0,0\n"   \
-   _EMIT_BUG_ENTRY \
-   : : "i" (__FILE__), "i" (__LINE__), \
-   "i" (0), "i"  (sizeof(struct bug_entry)));  \
+   BUG_ENTRY("twi 31, 0, 0", 0);   \
unreachable();  \
 } while (0)
 
@@ -77,23 +82,11 @@
if (x)  \
BUG();  \
} else {\
-   __asm__ __volatile__(   \
-   "1: "PPC_TLNEI" %4,0\n" \
-   _EMIT_BUG_ENTRY \
-   : : "i" (__FILE__), "i" (__LINE__), "i" (0),\
- "i" (sizeof(struct bug_entry)),   \
- "r" ((__force long)(x))); \
+   BUG_ENTRY(PPC_TLNEI " %4, 0", 0, "r" ((__force long)(x)));  
\
}   \
 } while (0)
 
-#define __WARN_FLAGS(flags) do {   \
-   __asm__ __volatile__(   \
-   "1: twi 31,0,0\n"   \
-   _EMIT_BUG_ENTRY \
-   : : "i" (__FILE__), "i" (__LINE__), \
- "i" (BUGFLAG_WARNING|(flags)),\
- "i" (sizeof(struct bug_entry)));  \
-} while (0)
+#define __WARN_FLAGS(flags) BUG_ENTRY("twi 31, 0, 0", BUGFLAG_WARNING | 
(flags))
 
 #define WARN_ON(x) ({  \
int __ret_warn_on = !!(x);  \
@@ -101,13 +94,9 @@
if (__ret_warn_on)  \
__WARN_TAINT(TAINT_WARN);   \
} else {\
-   __asm__ __volatile__(   \
-   "1: "PPC_TLNEI" %4,0\n" \
-   _EMIT_BUG_ENTRY \
-   : : "i" (__FILE__), "i" (__LINE__), \
- "i" (BUGFLAG_WARNING|BUGFLAG_TAINT(TAINT_WARN)),\
- "i" (sizeof(struct bug_entry)),   \
- "r" (__ret_warn_on)); \
+   BUG_ENTRY(PPC_TLNEI " %4, 0",   \
+ BUGFLAG_WARNING | BUGFLAG_TAINT(TAINT_WARN),  \
+ "r" (__ret_warn_on)); \
}   \
unlikely(__ret_warn_on);\
 })
-- 
2.13.3



Re: [PATCH v1 07/10] powerpc/mm: move iounmap() into ioremap.c and drop __iounmap()

2019-08-19 Thread Michael Ellerman
Christophe Leroy  writes:
> diff --git a/arch/powerpc/mm/ioremap.c b/arch/powerpc/mm/ioremap.c
> index 0c23660522ca..57d742509cec 100644
> --- a/arch/powerpc/mm/ioremap.c
> +++ b/arch/powerpc/mm/ioremap.c
> @@ -72,3 +75,31 @@ void __iomem *ioremap_prot(phys_addr_t addr, unsigned long 
> size, unsigned long f
>   return __ioremap_caller(addr, size, pte_pgprot(pte), caller);
>  }
>  EXPORT_SYMBOL(ioremap_prot);
> +
> +/*
> + * Unmap an IO region and remove it from vmalloc'd list.
> + * Access to IO memory should be serialized by driver.
> + */
> +void iounmap(volatile void __iomem *token)
> +{
> + void *addr;
> +
> + /*
> +  * If mapped by BATs then there is nothing to do.
> +  */
> + if (v_block_mapped((unsigned long)token))
> + return;
> +
> + if (!slab_is_available())
> + return;
> +
> + addr = (void *)((unsigned long __force)PCI_FIX_ADDR(token) & PAGE_MASK);
> + if (WARN_ON((unsigned long)addr < IOREMAP_BASE))
> + return;

This pops a bunch, as we seem to have various places that want to call
iounmap(NULL) in error paths, much like kfree().

One example:

[   85.062269] WARNING: CPU: 6 PID: 3643 at arch/powerpc/mm/ioremap.c:97 
.iounmap+0x58/0xb0
[   85.062276] Modules linked in: snd_powermac(+) snd_pcm snd_timer snd 
soundcore
[   85.062314] CPU: 6 PID: 3643 Comm: modprobe Tainted: GW 
5.3.0-rc2-gcc-8.2.0-00051-ga8e8d67f314c #655
[   85.062325] NIP:  c0078e08 LR: c0078dd0 CTR: c0078db0
[   85.062335] REGS: c000f44f6e40 TRAP: 0700   Tainted: GW  
(5.3.0-rc2-gcc-8.2.0-00051-ga8e8d67f314c)
[   85.062342] MSR:  82029032   CR: 24228884  
XER: 
[   85.062377] CFAR: c0339650 IRQMASK: 0 
   GPR00: c0078dd0 c000f44f70d0 c1a2ff00 
 
   GPR04: c00800336518  c1a66b80 
0001 
   GPR08: c13296c8 c00a8000 0001 
c0080032ba08 
   GPR12: c0078db0 c0003fff8e80 0004 
c0080035 
   GPR16: c2637730 c0d69868  
c2637740 
   GPR20: c197ad08  0100 
0028 
   GPR24: c000f736eac0 c000f44f7370 c000f5065000 
c00800336510 
   GPR28: c26aafc8 ffed  
 
[   85.062554] NIP [c0078e08] .iounmap+0x58/0xb0
[   85.062564] LR [c0078dd0] .iounmap+0x20/0xb0
[   85.062572] Call Trace:
[   85.062591] [c000f44f70d0] [c0078dd0] .iounmap+0x20/0xb0 
(unreliable)
[   85.062623] [c000f44f7150] [c00800321e24] .snd_pmac_free+0x164/0x270 
[snd_powermac]
[   85.062709] [c000f44f71e0] [c00800322fa4] .snd_pmac_new+0x884/0xf30 
[snd_powermac]
[   85.062798] [c000f44f72f0] [c0080032015c] .snd_pmac_probe+0x7c/0x450 
[snd_powermac]
[   85.062849] [c000f44f73a0] [c07b0628] 
.platform_drv_probe+0x68/0x100
[   85.062863] [c000f44f7430] [c07aca94] .really_probe+0x144/0x3c0
[   85.062880] [c000f44f74d0] [c07acfe0] 
.driver_probe_device+0x80/0x170
[   85.062899] [c000f44f7560] [c07a9be4] 
.bus_for_each_drv+0xb4/0x130
[   85.062922] [c000f44f7610] [c07ac89c] 
.__device_attach+0x11c/0x1a0
[   85.062941] [c000f44f76c0] [c07ab3d8] 
.bus_probe_device+0xe8/0x100
[   85.062961] [c000f44f7750] [c07a6164] .device_add+0x504/0x7d0
[   85.062978] [c000f44f7820] [c07b02fc] 
.platform_device_add+0x14c/0x310
[   85.062994] [c000f44f78c0] [c07b14c0] 
.platform_device_register_full+0x130/0x210
[   85.063084] [c000f44f7940] [c0080032b850] 
.alsa_card_pmac_init+0x80/0xc4 [snd_powermac]
[   85.063106] [c000f44f7a10] [c0010e58] .do_one_initcall+0x88/0x448
[   85.063158] [c000f44f7b00] [c02416e4] .do_init_module+0x74/0x2e0
[   85.063203] [c000f44f7ba0] [c0243bd4] .load_module+0x20b4/0x26d0
[   85.063219] [c000f44f7cf0] [c0244470] 
.__se_sys_finit_module+0xe0/0x140
[   85.063237] [c000f44f7e20] [c000c46c] system_call+0x5c/0x70



I think we can just do:

void iounmap(volatile void __iomem *token)
{
void *addr;

if (!addr)
return;

...

??

cheers


[PATCH v5 0/2] powerpc: Enabling IMA arch specific secure boot policies

2019-08-19 Thread Nayna Jain
IMA subsystem supports custom, built-in, arch-specific policies to define
the files to be measured and appraised. These policies are honored based
on the priority where arch-specific policies is the highest and custom
is the lowest.

OpenPOWER systems rely on IMA for signature verification of the kernel.
This patchset adds support for powerpc specific arch policies that are
defined based on system's OS secureboot state. The OS secureboot state
of the system is determined via device-tree entry.

Changelog:
v5:
* secureboot state is now read via device tree entry rather than OPAL
secure variables
* ima arch policies are updated to use policy based template for
measurement rules

v4:
* Fixed the build issue as reported by Satheesh Rajendran.

v3:
* OPAL APIs in Patch 1 are updated to provide generic interface based on
key/keylen. This patchset updates kernel OPAL APIs to be compatible with
generic interface.
* Patch 2 is cleaned up to use new OPAL APIs.
* Since OPAL can support different types of backend which can vary in the
variable interpretation, the Patch 2 is updated to add a check for the
backend version
* OPAL API now expects consumer to first check the supported backend version
before calling other secvar OPAL APIs. This check is now added in patch 2.
* IMA policies in Patch 3 is updated to specify appended signature and
per policy template.
* The patches now are free of any EFIisms.

v2:

* Removed Patch 1: powerpc/include: Override unneeded early ioremap
functions
* Updated Subject line and patch description of the Patch 1 of this series
* Removed dependency of OPAL_SECVAR on EFI, CPU_BIG_ENDIAN and UCS2_STRING
* Changed OPAL APIs from static to non-static. Added opal-secvar.h for the
same
* Removed EFI hooks from opal_secvar.c
* Removed opal_secvar_get_next(), opal_secvar_enqueue() and
opal_query_variable_info() function
* get_powerpc_sb_mode() in secboot.c now directly calls OPAL Runtime API
rather than via EFI hooks.
* Fixed log messages in get_powerpc_sb_mode() function.
* Added dependency for PPC_SECURE_BOOT on configs PPC64 and OPAL_SECVAR
* Replaced obj-$(CONFIG_IMA) with obj-$(CONFIG_PPC_SECURE_BOOT) in
arch/powerpc/kernel/Makefile

Nayna Jain (2):
  powerpc: detect the secure boot mode of the system
  powerpc: Add support to initialize ima policy rules

 arch/powerpc/Kconfig   | 13 ++
 arch/powerpc/include/asm/secboot.h | 27 
 arch/powerpc/kernel/Makefile   |  2 +
 arch/powerpc/kernel/ima_arch.c | 50 +
 arch/powerpc/kernel/secboot.c  | 71 ++
 include/linux/ima.h|  3 +-
 6 files changed, 165 insertions(+), 1 deletion(-)
 create mode 100644 arch/powerpc/include/asm/secboot.h
 create mode 100644 arch/powerpc/kernel/ima_arch.c
 create mode 100644 arch/powerpc/kernel/secboot.c

-- 
2.20.1



[PATCH v5 2/2] powerpc: Add support to initialize ima policy rules

2019-08-19 Thread Nayna Jain
POWER secure boot relies on the kernel IMA security subsystem to
perform the OS kernel image signature verification. Since each secure
boot mode has different IMA policy requirements, dynamic definition of
the policy rules based on the runtime secure boot mode of the system is
required. On systems that support secure boot, but have it disabled,
only measurement policy rules of the kernel image and modules are
defined.

This patch defines the arch-specific implementation to retrieve the
secure boot mode of the system and accordingly configures the IMA policy
rules.

This patch provides arch-specific IMA policies if PPC_SECURE_BOOT
config is enabled.

Signed-off-by: Nayna Jain 
---
 arch/powerpc/Kconfig   |  2 ++
 arch/powerpc/kernel/Makefile   |  2 +-
 arch/powerpc/kernel/ima_arch.c | 50 ++
 include/linux/ima.h|  3 +-
 4 files changed, 55 insertions(+), 2 deletions(-)
 create mode 100644 arch/powerpc/kernel/ima_arch.c

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index c902a39124dc..42109682b727 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -917,6 +917,8 @@ config PPC_SECURE_BOOT
bool
default n
depends on PPC64
+   depends on IMA
+   depends on IMA_ARCH_POLICY
help
  Linux on POWER with firmware secure boot enabled needs to define
  security policies to extend secure boot to the OS.This config
diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile
index d310ebb4e526..520b1c814197 100644
--- a/arch/powerpc/kernel/Makefile
+++ b/arch/powerpc/kernel/Makefile
@@ -157,7 +157,7 @@ endif
 obj-$(CONFIG_EPAPR_PARAVIRT)   += epapr_paravirt.o epapr_hcalls.o
 obj-$(CONFIG_KVM_GUEST)+= kvm.o kvm_emul.o
 
-obj-$(CONFIG_PPC_SECURE_BOOT)  += secboot.o
+obj-$(CONFIG_PPC_SECURE_BOOT)  += secboot.o ima_arch.o
 
 # Disable GCOV, KCOV & sanitizers in odd or sensitive code
 GCOV_PROFILE_prom_init.o := n
diff --git a/arch/powerpc/kernel/ima_arch.c b/arch/powerpc/kernel/ima_arch.c
new file mode 100644
index ..ac90fac83338
--- /dev/null
+++ b/arch/powerpc/kernel/ima_arch.c
@@ -0,0 +1,50 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2019 IBM Corporation
+ * Author: Nayna Jain 
+ *
+ * ima_arch.c
+ *  - initialize ima policies for PowerPC Secure Boot
+ */
+
+#include 
+#include 
+
+bool arch_ima_get_secureboot(void)
+{
+   return get_powerpc_secureboot();
+}
+
+/*
+ * File signature verification is not needed, include only measurements
+ */
+static const char *const default_arch_rules[] = {
+   "measure func=KEXEC_KERNEL_CHECK",
+   "measure func=MODULE_CHECK",
+   NULL
+};
+
+/* Both file signature verification and measurements are needed */
+static const char *const sb_arch_rules[] = {
+   "measure func=KEXEC_KERNEL_CHECK template=ima-modsig",
+   "appraise func=KEXEC_KERNEL_CHECK appraise_type=imasig|modsig",
+#if IS_ENABLED(CONFIG_MODULE_SIG)
+   "measure func=MODULE_CHECK",
+#else
+   "measure func=MODULE_CHECK template=ima-modsig",
+   "appraise func=MODULE_CHECK appraise_type=imasig|modsig",
+#endif
+   NULL
+};
+
+/*
+ * On PowerPC, file measurements are to be added to the IMA measurement list
+ * irrespective of the secure boot state of the system. Signature verification
+ * is conditionally enabled based on the secure boot state.
+ */
+const char *const *arch_get_ima_policy(void)
+{
+   if (IS_ENABLED(CONFIG_IMA_ARCH_POLICY) && arch_ima_get_secureboot())
+   return sb_arch_rules;
+   return default_arch_rules;
+}
diff --git a/include/linux/ima.h b/include/linux/ima.h
index a20ad398d260..10af09b5b478 100644
--- a/include/linux/ima.h
+++ b/include/linux/ima.h
@@ -29,7 +29,8 @@ extern void ima_kexec_cmdline(const void *buf, int size);
 extern void ima_add_kexec_buffer(struct kimage *image);
 #endif
 
-#if (defined(CONFIG_X86) && defined(CONFIG_EFI)) || defined(CONFIG_S390)
+#if (defined(CONFIG_X86) && defined(CONFIG_EFI)) || defined(CONFIG_S390) \
+   || defined(CONFIG_PPC_SECURE_BOOT)
 extern bool arch_ima_get_secureboot(void);
 extern const char * const *arch_get_ima_policy(void);
 #else
-- 
2.20.1



[PATCH v5 1/2] powerpc: detect the secure boot mode of the system

2019-08-19 Thread Nayna Jain
Secure boot on POWER defines different IMA policies based on the secure
boot state of the system.

This patch defines a function to detect the secure boot state of the
system.

The PPC_SECURE_BOOT config represents the base enablement of secureboot
on POWER.

Signed-off-by: Nayna Jain 
---
 arch/powerpc/Kconfig   | 11 +
 arch/powerpc/include/asm/secboot.h | 27 
 arch/powerpc/kernel/Makefile   |  2 +
 arch/powerpc/kernel/secboot.c  | 71 ++
 4 files changed, 111 insertions(+)
 create mode 100644 arch/powerpc/include/asm/secboot.h
 create mode 100644 arch/powerpc/kernel/secboot.c

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 77f6ebf97113..c902a39124dc 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -912,6 +912,17 @@ config PPC_MEM_KEYS
 
  If unsure, say y.
 
+config PPC_SECURE_BOOT
+   prompt "Enable PowerPC Secure Boot"
+   bool
+   default n
+   depends on PPC64
+   help
+ Linux on POWER with firmware secure boot enabled needs to define
+ security policies to extend secure boot to the OS.This config
+ allows user to enable OS Secure Boot on PowerPC systems that
+ have firmware secure boot support.
+
 endmenu
 
 config ISA_DMA_API
diff --git a/arch/powerpc/include/asm/secboot.h 
b/arch/powerpc/include/asm/secboot.h
new file mode 100644
index ..e726261bb00b
--- /dev/null
+++ b/arch/powerpc/include/asm/secboot.h
@@ -0,0 +1,27 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * PowerPC secure boot definitions
+ *
+ * Copyright (C) 2019 IBM Corporation
+ * Author: Nayna Jain 
+ *
+ */
+#ifndef POWERPC_SECBOOT_H
+#define POWERPC_SECBOOT_H
+
+#ifdef CONFIG_PPC_SECURE_BOOT
+extern struct device_node *is_powerpc_secvar_supported(void);
+extern bool get_powerpc_secureboot(void);
+#else
+static inline struct device_node *is_powerpc_secvar_supported(void)
+{
+   return NULL;
+}
+
+static inline bool get_powerpc_secureboot(void)
+{
+   return false;
+}
+
+#endif
+#endif
diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile
index ea0c69236789..d310ebb4e526 100644
--- a/arch/powerpc/kernel/Makefile
+++ b/arch/powerpc/kernel/Makefile
@@ -157,6 +157,8 @@ endif
 obj-$(CONFIG_EPAPR_PARAVIRT)   += epapr_paravirt.o epapr_hcalls.o
 obj-$(CONFIG_KVM_GUEST)+= kvm.o kvm_emul.o
 
+obj-$(CONFIG_PPC_SECURE_BOOT)  += secboot.o
+
 # Disable GCOV, KCOV & sanitizers in odd or sensitive code
 GCOV_PROFILE_prom_init.o := n
 KCOV_INSTRUMENT_prom_init.o := n
diff --git a/arch/powerpc/kernel/secboot.c b/arch/powerpc/kernel/secboot.c
new file mode 100644
index ..5ea0d52d64ef
--- /dev/null
+++ b/arch/powerpc/kernel/secboot.c
@@ -0,0 +1,71 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2019 IBM Corporation
+ * Author: Nayna Jain 
+ *
+ * secboot.c
+ *  - util function to get powerpc secboot state
+ */
+#include 
+#include 
+#include 
+
+struct device_node *is_powerpc_secvar_supported(void)
+{
+   struct device_node *np;
+   int status;
+
+   np = of_find_node_by_name(NULL, "ibm,secureboot");
+   if (!np) {
+   pr_info("secureboot node is not found\n");
+   return NULL;
+   }
+
+   status = of_device_is_compatible(np, "ibm,secureboot-v3");
+   if (!status) {
+   pr_info("Secure variables are not supported by this 
firmware\n");
+   return NULL;
+   }
+
+   return np;
+}
+
+bool get_powerpc_secureboot(void)
+{
+   struct device_node *np;
+   struct device_node *secvar_np;
+   const u64 *psecboot;
+   u64 secboot = 0;
+
+   np = is_powerpc_secvar_supported();
+   if (!np)
+   goto disabled;
+
+   /* Fail-safe for any failure related to secvar */
+   secvar_np = of_get_child_by_name(np, "secvar");
+   if (!secvar_np) {
+   pr_err("Expected secure variables support, fail-safe\n");
+   goto enabled;
+   }
+
+   if (!of_device_is_available(secvar_np)) {
+   pr_err("Secure variables support is in error state, 
fail-safe\n");
+   goto enabled;
+   }
+
+   psecboot = of_get_property(secvar_np, "secure-mode", NULL);
+   if (!psecboot)
+   goto enabled;
+
+   secboot = be64_to_cpup((__be64 *)psecboot);
+   if (!(secboot & (~0x0)))
+   goto disabled;
+
+enabled:
+   pr_info("secureboot mode enabled\n");
+   return true;
+
+disabled:
+   pr_info("secureboot mode disabled\n");
+   return false;
+}
-- 
2.20.1



Re: [PATCH 2/2] powerpc: support KASAN instrumentation of bitops

2019-08-19 Thread Christophe Leroy




Le 19/08/2019 à 08:28, Daniel Axtens a écrit :

In KASAN development I noticed that the powerpc-specific bitops
were not being picked up by the KASAN test suite.


I'm not sure anybody cares about who noticed the problem. This sentence 
could be rephrased as:


The powerpc-specific bitops are not being picked up by the KASAN test suite.



Instrumentation is done via the bitops/instrumented-{atomic,lock}.h
headers. They require that arch-specific versions of bitop functions
are renamed to arch_*. Do this renaming.

For clear_bit_unlock_is_negative_byte, the current implementation
uses the PG_waiters constant. This works because it's a preprocessor
macro - so it's only actually evaluated in contexts where PG_waiters
is defined. With instrumentation however, it becomes a static inline
function, and all of a sudden we need the actual value of PG_waiters.
Because of the order of header includes, it's not available and we
fail to compile. Instead, manually specify that we care about bit 7.
This is still correct: bit 7 is the bit that would mark a negative
byte.

Cc: Nicholas Piggin  # clear_bit_unlock_negative_byte
Signed-off-by: Daniel Axtens 


Reviewed-by: Christophe Leroy 

Note that this patch might be an opportunity to replace all the 
'__inline__' by the standard 'inline' keyword.


Some () alignment to be fixes as well, see checkpatch warnings/checks at 
https://openpower.xyz/job/snowpatch/job/snowpatch-linux-checkpatch/8601//artifact/linux/checkpatch.log



---
  arch/powerpc/include/asm/bitops.h | 31 +++
  1 file changed, 19 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/include/asm/bitops.h 
b/arch/powerpc/include/asm/bitops.h
index 603aed229af7..8615b2bc35fe 100644
--- a/arch/powerpc/include/asm/bitops.h
+++ b/arch/powerpc/include/asm/bitops.h
@@ -86,22 +86,22 @@ DEFINE_BITOP(clear_bits, andc, "")
  DEFINE_BITOP(clear_bits_unlock, andc, PPC_RELEASE_BARRIER)
  DEFINE_BITOP(change_bits, xor, "")
  
-static __inline__ void set_bit(int nr, volatile unsigned long *addr)

+static __inline__ void arch_set_bit(int nr, volatile unsigned long *addr)
  {
set_bits(BIT_MASK(nr), addr + BIT_WORD(nr));
  }
  
-static __inline__ void clear_bit(int nr, volatile unsigned long *addr)

+static __inline__ void arch_clear_bit(int nr, volatile unsigned long *addr)
  {
clear_bits(BIT_MASK(nr), addr + BIT_WORD(nr));
  }
  
-static __inline__ void clear_bit_unlock(int nr, volatile unsigned long *addr)

+static __inline__ void arch_clear_bit_unlock(int nr, volatile unsigned long 
*addr)
  {
clear_bits_unlock(BIT_MASK(nr), addr + BIT_WORD(nr));
  }
  
-static __inline__ void change_bit(int nr, volatile unsigned long *addr)

+static __inline__ void arch_change_bit(int nr, volatile unsigned long *addr)
  {
change_bits(BIT_MASK(nr), addr + BIT_WORD(nr));
  }
@@ -138,26 +138,26 @@ DEFINE_TESTOP(test_and_clear_bits, andc, 
PPC_ATOMIC_ENTRY_BARRIER,
  DEFINE_TESTOP(test_and_change_bits, xor, PPC_ATOMIC_ENTRY_BARRIER,
  PPC_ATOMIC_EXIT_BARRIER, 0)
  
-static __inline__ int test_and_set_bit(unsigned long nr,

+static __inline__ int arch_test_and_set_bit(unsigned long nr,
   volatile unsigned long *addr)
  {
return test_and_set_bits(BIT_MASK(nr), addr + BIT_WORD(nr)) != 0;
  }
  
-static __inline__ int test_and_set_bit_lock(unsigned long nr,

+static __inline__ int arch_test_and_set_bit_lock(unsigned long nr,
   volatile unsigned long *addr)
  {
return test_and_set_bits_lock(BIT_MASK(nr),
addr + BIT_WORD(nr)) != 0;
  }
  
-static __inline__ int test_and_clear_bit(unsigned long nr,

+static __inline__ int arch_test_and_clear_bit(unsigned long nr,
 volatile unsigned long *addr)
  {
return test_and_clear_bits(BIT_MASK(nr), addr + BIT_WORD(nr)) != 0;
  }
  
-static __inline__ int test_and_change_bit(unsigned long nr,

+static __inline__ int arch_test_and_change_bit(unsigned long nr,
  volatile unsigned long *addr)
  {
return test_and_change_bits(BIT_MASK(nr), addr + BIT_WORD(nr)) != 0;
@@ -185,15 +185,18 @@ static __inline__ unsigned long 
clear_bit_unlock_return_word(int nr,
return old;
  }
  
-/* This is a special function for mm/filemap.c */

-#define clear_bit_unlock_is_negative_byte(nr, addr)\
-   (clear_bit_unlock_return_word(nr, addr) & BIT_MASK(PG_waiters))
+/*
+ * This is a special function for mm/filemap.c
+ * Bit 7 corresponds to PG_waiters.
+ */
+#define arch_clear_bit_unlock_is_negative_byte(nr, addr)   \
+   (clear_bit_unlock_return_word(nr, addr) & BIT_MASK(7))
  
  #endif /* CONFIG_PPC64 */
  
  #include 
  
-static __inline__ void __clear_bit_unlock(int nr, volatile unsigned long *addr)

+static __inline__ void arch___clear_bit_unlock(int nr, volatile unsigned long 
*addr)
  {
 

Re: [PATCH 1/2] kasan: support instrumented bitops combined with generic bitops

2019-08-19 Thread Christophe Leroy




Le 19/08/2019 à 08:28, Daniel Axtens a écrit :

Currently bitops-instrumented.h assumes that the architecture provides
atomic, non-atomic and locking bitops (e.g. both set_bit and __set_bit).
This is true on x86 and s390, but is not always true: there is a
generic bitops/non-atomic.h header that provides generic non-atomic
operations, and also a generic bitops/lock.h for locking operations.

powerpc uses the generic non-atomic version, so it does not have it's
own e.g. __set_bit that could be renamed arch___set_bit.

Split up bitops-instrumented.h to mirror the atomic/non-atomic/lock
split. This allows arches to only include the headers where they
have arch-specific versions to rename. Update x86 and s390.

(The generic operations are automatically instrumented because they're
written in C, not asm.)

Suggested-by: Christophe Leroy 
Signed-off-by: Daniel Axtens 


Reviewed-by: Christophe Leroy 


---
  Documentation/core-api/kernel-api.rst |  17 +-
  arch/s390/include/asm/bitops.h|   4 +-
  arch/x86/include/asm/bitops.h |   4 +-
  include/asm-generic/bitops-instrumented.h | 263 --
  .../asm-generic/bitops/instrumented-atomic.h  | 100 +++
  .../asm-generic/bitops/instrumented-lock.h|  81 ++
  .../bitops/instrumented-non-atomic.h  | 114 
  7 files changed, 317 insertions(+), 266 deletions(-)
  delete mode 100644 include/asm-generic/bitops-instrumented.h
  create mode 100644 include/asm-generic/bitops/instrumented-atomic.h
  create mode 100644 include/asm-generic/bitops/instrumented-lock.h
  create mode 100644 include/asm-generic/bitops/instrumented-non-atomic.h

diff --git a/Documentation/core-api/kernel-api.rst 
b/Documentation/core-api/kernel-api.rst
index 08af5caf036d..2e21248277e3 100644
--- a/Documentation/core-api/kernel-api.rst
+++ b/Documentation/core-api/kernel-api.rst
@@ -54,7 +54,22 @@ The Linux kernel provides more basic utility functions.
  Bit Operations
  --
  
-.. kernel-doc:: include/asm-generic/bitops-instrumented.h

+Atomic Operations
+~
+
+.. kernel-doc:: include/asm-generic/bitops/instrumented-atomic.h
+   :internal:
+
+Non-atomic Operations
+~
+
+.. kernel-doc:: include/asm-generic/bitops/instrumented-non-atomic.h
+   :internal:
+
+Locking Operations
+~~
+
+.. kernel-doc:: include/asm-generic/bitops/instrumented-lock.h
 :internal:
  
  Bitmap Operations

diff --git a/arch/s390/include/asm/bitops.h b/arch/s390/include/asm/bitops.h
index b8833ac983fa..0ceb12593a68 100644
--- a/arch/s390/include/asm/bitops.h
+++ b/arch/s390/include/asm/bitops.h
@@ -241,7 +241,9 @@ static inline void arch___clear_bit_unlock(unsigned long nr,
arch___clear_bit(nr, ptr);
  }
  
-#include 

+#include 
+#include 
+#include 
  
  /*

   * Functions which use MSB0 bit numbering.
diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index ba15d53c1ca7..4a2e2432238f 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -389,7 +389,9 @@ static __always_inline int fls64(__u64 x)
  
  #include 
  
-#include 

+#include 
+#include 
+#include 
  
  #include 
  
diff --git a/include/asm-generic/bitops-instrumented.h b/include/asm-generic/bitops-instrumented.h

deleted file mode 100644
index ddd1c6d9d8db..
--- a/include/asm-generic/bitops-instrumented.h
+++ /dev/null
@@ -1,263 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-
-/*
- * This file provides wrappers with sanitizer instrumentation for bit
- * operations.
- *
- * To use this functionality, an arch's bitops.h file needs to define each of
- * the below bit operations with an arch_ prefix (e.g. arch_set_bit(),
- * arch___set_bit(), etc.).
- */
-#ifndef _ASM_GENERIC_BITOPS_INSTRUMENTED_H
-#define _ASM_GENERIC_BITOPS_INSTRUMENTED_H
-
-#include 
-
-/**
- * set_bit - Atomically set a bit in memory
- * @nr: the bit to set
- * @addr: the address to start counting from
- *
- * This is a relaxed atomic operation (no implied memory barriers).
- *
- * Note that @nr may be almost arbitrarily large; this function is not
- * restricted to acting on a single-word quantity.
- */
-static inline void set_bit(long nr, volatile unsigned long *addr)
-{
-   kasan_check_write(addr + BIT_WORD(nr), sizeof(long));
-   arch_set_bit(nr, addr);
-}
-
-/**
- * __set_bit - Set a bit in memory
- * @nr: the bit to set
- * @addr: the address to start counting from
- *
- * Unlike set_bit(), this function is non-atomic. If it is called on the same
- * region of memory concurrently, the effect may be that only one operation
- * succeeds.
- */
-static inline void __set_bit(long nr, volatile unsigned long *addr)
-{
-   kasan_check_write(addr + BIT_WORD(nr), sizeof(long));
-   arch___set_bit(nr, addr);
-}
-
-/**
- * clear_bit - Clears a bit in memory
- * @nr: Bit to clear
- * @addr: Address to start counting from
- *
- * This is a relaxed atomic operation (no impli

Re: [5.3.0-rc4-next][bisected 882632][qla2xxx] WARNING: CPU: 10 PID: 425 at drivers/scsi/qla2xxx/qla_isr.c:2784 qla2x00_status_entry.isra

2019-08-19 Thread Abdul Haleem
On Wed, 2019-08-14 at 20:42 -0700, Bart Van Assche wrote:
> On 8/14/19 10:18 AM, Abdul Haleem wrote:
> > On Wed, 2019-08-14 at 10:05 -0700, Bart Van Assche wrote:
> >> On 8/14/19 9:52 AM, Abdul Haleem wrote:
> >>> Greeting's
> >>>
> >>> Today's linux-next kernel (5.3.0-rc4-next-20190813)  booted with warning 
> >>> on my powerpc power 8 lpar
> >>>
> >>> The WARN_ON_ONCE() was introduced by commit 88263208 (scsi: qla2xxx: 
> >>> Complain if sp->done() is not...)
> >>>
> >>> boot logs:
> >>>
> >>> WARNING: CPU: 10 PID: 425 at drivers/scsi/qla2xxx/qla_isr.c:2784
> >>
> >> Hi Abdul,
> >>
> >> Thank you for having reported this. Is that the only warning reported on 
> >> your setup by the qla2xxx
> >> driver? If that warning is commented out, does the qla2xxx driver work as 
> >> expected?
> > 
> > boot warning did not show up when the commit is reverted.
> > 
> > should I comment out only the WARN_ON_ONCE() which is causing the issue,
> > and not the other one ?
> 
> Yes please. Commit 88263208 introduced five kernel warnings but I think 
> only one of these should be removed again, e.g. as follows:
> 
> diff --git a/drivers/scsi/qla2xxx/qla_isr.c b/drivers/scsi/qla2xxx/qla_isr.c
> index cd39ac18c5fd..d81b5ecce24b 100644
> --- a/drivers/scsi/qla2xxx/qla_isr.c
> +++ b/drivers/scsi/qla2xxx/qla_isr.c
> @@ -2780,8 +2780,6 @@ qla2x00_status_entry(scsi_qla_host_t *vha, struct 
> rsp_que *rsp, void *pkt)
> 
>   if (rsp->status_srb == NULL)
>   sp->done(sp, res);
> - else
> - WARN_ON_ONCE(true);
>   }
> 
>   /**
 
Applying above patch on system boots fine.

i.e no warnings pop up when keeping all WARN_ON_ONCE() except above one.

Reported-and-Tested-by: Abdul Haleem 

-- 
Regard's

Abdul Haleem
IBM Linux Technology Centre





Re: [PATCH v4 1/3] kasan: support backing vmalloc space with real shadow memory

2019-08-19 Thread Mark Rutland
On Fri, Aug 16, 2019 at 10:41:00AM -0700, Andy Lutomirski wrote:
> On Fri, Aug 16, 2019 at 10:08 AM Mark Rutland  wrote:
> >
> > Hi Christophe,
> >
> > On Fri, Aug 16, 2019 at 09:47:00AM +0200, Christophe Leroy wrote:
> > > Le 15/08/2019 à 02:16, Daniel Axtens a écrit :
> > > > Hook into vmalloc and vmap, and dynamically allocate real shadow
> > > > memory to back the mappings.
> > > >
> > > > Most mappings in vmalloc space are small, requiring less than a full
> > > > page of shadow space. Allocating a full shadow page per mapping would
> > > > therefore be wasteful. Furthermore, to ensure that different mappings
> > > > use different shadow pages, mappings would have to be aligned to
> > > > KASAN_SHADOW_SCALE_SIZE * PAGE_SIZE.
> > > >
> > > > Instead, share backing space across multiple mappings. Allocate
> > > > a backing page the first time a mapping in vmalloc space uses a
> > > > particular page of the shadow region. Keep this page around
> > > > regardless of whether the mapping is later freed - in the mean time
> > > > the page could have become shared by another vmalloc mapping.
> > > >
> > > > This can in theory lead to unbounded memory growth, but the vmalloc
> > > > allocator is pretty good at reusing addresses, so the practical memory
> > > > usage grows at first but then stays fairly stable.
> > >
> > > I guess people having gigabytes of memory don't mind, but I'm concerned
> > > about tiny targets with very little amount of memory. I have boards with 
> > > as
> > > little as 32Mbytes of RAM. The shadow region for the linear space already
> > > takes one eighth of the RAM. I'd rather avoid keeping unused shadow pages
> > > busy.
> >
> > I think this depends on how much shadow would be in constant use vs what
> > would get left unused. If the amount in constant use is sufficiently
> > large (or the residue is sufficiently small), then it may not be
> > worthwhile to support KASAN_VMALLOC on such small systems.
> >
> > > Each page of shadow memory represent 8 pages of real memory. Could we use
> > > page_ref to count how many pieces of a shadow page are used so that we can
> > > free it when the ref count decreases to 0.
> > >
> > > > This requires architecture support to actually use: arches must stop
> > > > mapping the read-only zero page over portion of the shadow region that
> > > > covers the vmalloc space and instead leave it unmapped.
> > >
> > > Why 'must' ? Couldn't we switch back and forth from the zero page to real
> > > page on demand ?
> > >
> > > If the zero page is not mapped for unused vmalloc space, bad memory 
> > > accesses
> > > will Oops on the shadow memory access instead of Oopsing on the real bad
> > > access, making it more difficult to locate and identify the issue.
> >
> > I agree this isn't nice, though FWIW this can already happen today for
> > bad addresses that fall outside of the usual kernel address space. We
> > could make the !KASAN_INLINE checks resilient to this by using
> > probe_kernel_read() to check the shadow, and treating unmapped shadow as
> > poison.
> 
> Could we instead modify the page fault handlers to detect this case
> and print a useful message?

In general we can't know if a bad access was a KASAN shadow lookup (e.g.
since the shadow of NULL falls outside of the shadow region), but we
could always print a message using kasan_shadow_to_mem() for any
unhandled fault to suggeest what the "real" address might have been.

Thanks,
Mark.


Re: [PATCH v5 3/4] mm/nvdimm: Use correct #defines instead of open coding

2019-08-19 Thread Aneesh Kumar K.V
Aneesh Kumar K.V  writes:

> Dan Williams  writes:
>
>> On Fri, Aug 9, 2019 at 12:45 AM Aneesh Kumar K.V
>>  wrote:
>>>
>>

...

>>> diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
>>> index 37e96811c2fc..c1d9be609322 100644
>>> --- a/drivers/nvdimm/pfn_devs.c
>>> +++ b/drivers/nvdimm/pfn_devs.c
>>> @@ -725,7 +725,8 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn)
>>>  * when populating the vmemmap. This *should* be equal to
>>>  * PMD_SIZE for most architectures.
>>>  */
>>> -   offset = ALIGN(start + SZ_8K + 64 * npfns, align) - start;
>>> +   offset = ALIGN(start + SZ_8K + sizeof(struct page) * npfns,
>>
>> I'd prefer if this was not dynamic and was instead set to the maximum
>> size of 'struct page' across all archs just to enhance cross-arch
>> compatibility. I think that answer is '64'.
>
>
> That still doesn't take care of the case where we add new elements to
> struct page later. If we have struct page size changing across
> architectures, we should still be ok as long as new size is less than what is
> stored in pfn superblock? I understand the desire to keep it
> non-dynamic. But we also need to make sure we don't reserve less space
> when creating a new namespace on a config that got struct page size >
> 64? 


How about

libnvdimm/pfn_dev: Add a build check to make sure we notice when struct page 
size change

When namespace is created with map device as pmem device, struct page is stored 
in the
reserve block area. We need to make sure we account for the right struct page
size while doing this. Instead of directly depending on sizeof(struct page)
which can change based on different kernel config option, use the max struct
page size (64) while calculating the reserve block area. This makes sure pmem
device can be used across kernels built with different configs.

If the above assumption of max struct page size change, we need to update the
reserve block allocation space for new namespaces created.

Signed-off-by: Aneesh Kumar K.V 

1 file changed, 7 insertions(+)
drivers/nvdimm/pfn_devs.c | 7 +++

modified   drivers/nvdimm/pfn_devs.c
@@ -722,7 +722,14 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn)
 * The altmap should be padded out to the block size used
 * when populating the vmemmap. This *should* be equal to
 * PMD_SIZE for most architectures.
+*
+* Also make sure size of struct page is less than 64. We
+* want to make sure we use large enough size here so that
+* we don't have a dynamic reserve space depending on
+* struct page size. But we also want to make sure we notice
+* if we end up adding new elements to struct page.
 */
+   BUILD_BUG_ON(64 < sizeof(struct page));
offset = ALIGN(start + SZ_8K + 64 * npfns, align) - start;
} else if (nd_pfn->mode == PFN_MODE_RAM)
offset = ALIGN(start + SZ_8K, align) - start;


-aneesh



Re: [PATCH] powerpc: Don't add -mabi= flags when building with Clang

2019-08-19 Thread Segher Boessenkool
On Sun, Aug 18, 2019 at 12:13:21PM -0700, Nathan Chancellor wrote:
> When building pseries_defconfig, building vdso32 errors out:
> 
>   error: unknown target ABI 'elfv1'
> 
> Commit 4dc831aa8813 ("powerpc: Fix compiling a BE kernel with a
> powerpc64le toolchain") added these flags to fix building GCC but
> clang is multitargeted and does not need these flags. The ABI is
> properly set based on the target triple, which is derived from
> CROSS_COMPILE.

You mean that LLVM does not *allow* you to select a different ABI, or
different ABI options, you always have to use the default.  (Everything
else you say is true for GCC as well).

(-mabi= does not set a "target ABI", fwiw, it is more subtle; please see
the documentation.  Unless LLVM is incompatible in that respect as well?)


Segher


Re: [PATCH v2 3/3] arm: Add support for function error injection

2019-08-19 Thread Leo Yan
Hi Russell,

On Tue, Aug 06, 2019 at 06:00:15PM +0800, Leo Yan wrote:
> This patch implements arm specific functions regs_set_return_value() and
> override_function_with_return() to support function error injection.
> 
> In the exception flow, it updates pt_regs::ARM_pc with pt_regs::ARM_lr
> so can override the probed function return.

Gentle ping ...  Could you review this patch?

Thanks,
Leo.

> Signed-off-by: Leo Yan 
> ---
>  arch/arm/Kconfig  |  1 +
>  arch/arm/include/asm/ptrace.h |  5 +
>  arch/arm/lib/Makefile |  2 ++
>  arch/arm/lib/error-inject.c   | 19 +++
>  4 files changed, 27 insertions(+)
>  create mode 100644 arch/arm/lib/error-inject.c
> 
> diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
> index 33b00579beff..2d3d44a037f6 100644
> --- a/arch/arm/Kconfig
> +++ b/arch/arm/Kconfig
> @@ -77,6 +77,7 @@ config ARM
>   select HAVE_EXIT_THREAD
>   select HAVE_FAST_GUP if ARM_LPAE
>   select HAVE_FTRACE_MCOUNT_RECORD if !XIP_KERNEL
> + select HAVE_FUNCTION_ERROR_INJECTION if !THUMB2_KERNEL
>   select HAVE_FUNCTION_GRAPH_TRACER if !THUMB2_KERNEL && !CC_IS_CLANG
>   select HAVE_FUNCTION_TRACER if !XIP_KERNEL
>   select HAVE_GCC_PLUGINS
> diff --git a/arch/arm/include/asm/ptrace.h b/arch/arm/include/asm/ptrace.h
> index 91d6b7856be4..3b41f37b361a 100644
> --- a/arch/arm/include/asm/ptrace.h
> +++ b/arch/arm/include/asm/ptrace.h
> @@ -89,6 +89,11 @@ static inline long regs_return_value(struct pt_regs *regs)
>   return regs->ARM_r0;
>  }
>  
> +static inline void regs_set_return_value(struct pt_regs *regs, unsigned long 
> rc)
> +{
> + regs->ARM_r0 = rc;
> +}
> +
>  #define instruction_pointer(regs)(regs)->ARM_pc
>  
>  #ifdef CONFIG_THUMB2_KERNEL
> diff --git a/arch/arm/lib/Makefile b/arch/arm/lib/Makefile
> index b25c54585048..8f56484a7156 100644
> --- a/arch/arm/lib/Makefile
> +++ b/arch/arm/lib/Makefile
> @@ -42,3 +42,5 @@ ifeq ($(CONFIG_KERNEL_MODE_NEON),y)
>CFLAGS_xor-neon.o  += $(NEON_FLAGS)
>obj-$(CONFIG_XOR_BLOCKS)   += xor-neon.o
>  endif
> +
> +obj-$(CONFIG_FUNCTION_ERROR_INJECTION) += error-inject.o
> diff --git a/arch/arm/lib/error-inject.c b/arch/arm/lib/error-inject.c
> new file mode 100644
> index ..2d696dc94893
> --- /dev/null
> +++ b/arch/arm/lib/error-inject.c
> @@ -0,0 +1,19 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#include 
> +#include 
> +
> +void override_function_with_return(struct pt_regs *regs)
> +{
> + /*
> +  * 'regs' represents the state on entry of a predefined function in
> +  * the kernel/module and which is captured on a kprobe.
> +  *
> +  * 'regs->ARM_lr' contains the the link register for the probed
> +  * function, when kprobe returns back from exception it will override
> +  * the end of probed function and directly return to the predefined
> +  * function's caller.
> +  */
> + instruction_pointer_set(regs, regs->ARM_lr);
> +}
> +NOKPROBE_SYMBOL(override_function_with_return);
> -- 
> 2.17.1
> 


Re: [PATCH v5 1/3] PM: wakeup: Add routine to help fetch wakeup source object.

2019-08-19 Thread Rafael J. Wysocki
On Monday, August 19, 2019 10:33:25 AM CEST Ran Wang wrote:
> Hi Rafael,
> 
> On Monday, August 19, 2019 16:20, Rafael J. Wysocki wrote:
> > 
> > On Mon, Aug 19, 2019 at 10:15 AM Ran Wang  wrote:
> > >
> > > Hi Rafael,
> > >
> > > On Monday, August 05, 2019 17:59, Rafael J. Wysocki wrote:
> > > >
> > > > On Wednesday, July 24, 2019 9:47:20 AM CEST Ran Wang wrote:
> > > > > Some user might want to go through all registered wakeup sources
> > > > > and doing things accordingly. For example, SoC PM driver might
> > > > > need to do HW programming to prevent powering down specific IP
> > > > > which wakeup source depending on. So add this API to help walk
> > > > > through all registered wakeup source objects on that list and return 
> > > > > them
> > one by one.
> > > > >
> > > > > Signed-off-by: Ran Wang 
> > > > > ---
> > > > > Change in v5:
> > > > > - Update commit message, add decription of walk through all wakeup
> > > > > source objects.
> > > > > - Add SCU protection in function wakeup_source_get_next().
> > > > > - Rename wakeup_source member 'attached_dev' to 'dev' and move
> > > > > it
> > > > up
> > > > > (before wakeirq).
> > > > >
> > > > > Change in v4:
> > > > > - None.
> > > > >
> > > > > Change in v3:
> > > > > - Adjust indentation of *attached_dev;.
> > > > >
> > > > > Change in v2:
> > > > > - None.
> > > > >
> > > > >  drivers/base/power/wakeup.c | 24 
> > > > >  include/linux/pm_wakeup.h   |  3 +++
> > > > >  2 files changed, 27 insertions(+)
> > > > >
> > > > > diff --git a/drivers/base/power/wakeup.c
> > > > > b/drivers/base/power/wakeup.c index ee31d4f..2fba891 100644
> > > > > --- a/drivers/base/power/wakeup.c
> > > > > +++ b/drivers/base/power/wakeup.c
> > > > > @@ -14,6 +14,7 @@
> > > > >  #include 
> > > > >  #include 
> > > > >  #include 
> > > > > +#include 
> > > > >  #include 
> > > > >  #include 
> > > > >
> > > > > @@ -226,6 +227,28 @@ void wakeup_source_unregister(struct
> > > > wakeup_source *ws)
> > > > > }
> > > > >  }
> > > > >  EXPORT_SYMBOL_GPL(wakeup_source_unregister);
> > > > > +/**
> > > > > + * wakeup_source_get_next - Get next wakeup source from the list
> > > > > + * @ws: Previous wakeup source object, null means caller want first 
> > > > > one.
> > > > > + */
> > > > > +struct wakeup_source *wakeup_source_get_next(struct wakeup_source
> > > > > +*ws) {
> > > > > +   struct list_head *ws_head = &wakeup_sources;
> > > > > +   struct wakeup_source *next_ws = NULL;
> > > > > +   int idx;
> > > > > +
> > > > > +   idx = srcu_read_lock(&wakeup_srcu);
> > > > > +   if (ws)
> > > > > +   next_ws = list_next_or_null_rcu(ws_head, &ws->entry,
> > > > > +   struct wakeup_source, entry);
> > > > > +   else
> > > > > +   next_ws = list_entry_rcu(ws_head->next,
> > > > > +   struct wakeup_source, entry);
> > > > > +   srcu_read_unlock(&wakeup_srcu, idx);
> > > > > +
> > > >
> > > > This is incorrect.
> > > >
> > > > The SRCU cannot be unlocked until the caller of this is done with
> > > > the object returned by it, or that object can be freed while it is 
> > > > still being
> > accessed.
> > >
> > > Thanks for the comment. Looks like I was not fully understanding your
> > > point on
> > > v4 discussion. So I will implement 3 APIs by referring
> > > wakeup_sources_stats_seq_start/next/stop()
> > >
> > > > Besides, this patch conflicts with some general wakeup sources
> > > > changes in the works, so it needs to be deferred and rebased on top of 
> > > > those
> > changes.
> > >
> > > Could you please tell me which is the right code base I should developing 
> > > on?
> > > I just tried applying v5 patch on latest
> > > git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb.git branch master
> > (d1abaeb Linux 5.3-rc5) and no conflict encountered.
> > 
> > It is better to use the most recent -rc from Linus (5.3-rc5 as of
> > today) as the base unless your patches depend on some changes that are not 
> > in
> > there.
> 
> OK, So I need to implement on latest 
> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git branch 
> master, am I right?
> 
> However, I just checked v5.3-rc5 code and found it has the same HEAD (d1abaeb 
> Linux 5.3-rc5
> on which I did not observe v5 patch apply conflict, did I miss something? 
> Thanks.

The conflict I mentioned earlier was with another patch series in the works
which is not in 5.3-rc5.  However, there are problems with that series and it
is not linux-next now even, so please just base your series on top of -rc5.






RE: [PATCH v5 1/3] PM: wakeup: Add routine to help fetch wakeup source object.

2019-08-19 Thread Ran Wang
Hi Rafael,

On Monday, August 19, 2019 16:20, Rafael J. Wysocki wrote:
> 
> On Mon, Aug 19, 2019 at 10:15 AM Ran Wang  wrote:
> >
> > Hi Rafael,
> >
> > On Monday, August 05, 2019 17:59, Rafael J. Wysocki wrote:
> > >
> > > On Wednesday, July 24, 2019 9:47:20 AM CEST Ran Wang wrote:
> > > > Some user might want to go through all registered wakeup sources
> > > > and doing things accordingly. For example, SoC PM driver might
> > > > need to do HW programming to prevent powering down specific IP
> > > > which wakeup source depending on. So add this API to help walk
> > > > through all registered wakeup source objects on that list and return 
> > > > them
> one by one.
> > > >
> > > > Signed-off-by: Ran Wang 
> > > > ---
> > > > Change in v5:
> > > > - Update commit message, add decription of walk through all wakeup
> > > > source objects.
> > > > - Add SCU protection in function wakeup_source_get_next().
> > > > - Rename wakeup_source member 'attached_dev' to 'dev' and move
> > > > it
> > > up
> > > > (before wakeirq).
> > > >
> > > > Change in v4:
> > > > - None.
> > > >
> > > > Change in v3:
> > > > - Adjust indentation of *attached_dev;.
> > > >
> > > > Change in v2:
> > > > - None.
> > > >
> > > >  drivers/base/power/wakeup.c | 24 
> > > >  include/linux/pm_wakeup.h   |  3 +++
> > > >  2 files changed, 27 insertions(+)
> > > >
> > > > diff --git a/drivers/base/power/wakeup.c
> > > > b/drivers/base/power/wakeup.c index ee31d4f..2fba891 100644
> > > > --- a/drivers/base/power/wakeup.c
> > > > +++ b/drivers/base/power/wakeup.c
> > > > @@ -14,6 +14,7 @@
> > > >  #include 
> > > >  #include 
> > > >  #include 
> > > > +#include 
> > > >  #include 
> > > >  #include 
> > > >
> > > > @@ -226,6 +227,28 @@ void wakeup_source_unregister(struct
> > > wakeup_source *ws)
> > > > }
> > > >  }
> > > >  EXPORT_SYMBOL_GPL(wakeup_source_unregister);
> > > > +/**
> > > > + * wakeup_source_get_next - Get next wakeup source from the list
> > > > + * @ws: Previous wakeup source object, null means caller want first 
> > > > one.
> > > > + */
> > > > +struct wakeup_source *wakeup_source_get_next(struct wakeup_source
> > > > +*ws) {
> > > > +   struct list_head *ws_head = &wakeup_sources;
> > > > +   struct wakeup_source *next_ws = NULL;
> > > > +   int idx;
> > > > +
> > > > +   idx = srcu_read_lock(&wakeup_srcu);
> > > > +   if (ws)
> > > > +   next_ws = list_next_or_null_rcu(ws_head, &ws->entry,
> > > > +   struct wakeup_source, entry);
> > > > +   else
> > > > +   next_ws = list_entry_rcu(ws_head->next,
> > > > +   struct wakeup_source, entry);
> > > > +   srcu_read_unlock(&wakeup_srcu, idx);
> > > > +
> > >
> > > This is incorrect.
> > >
> > > The SRCU cannot be unlocked until the caller of this is done with
> > > the object returned by it, or that object can be freed while it is still 
> > > being
> accessed.
> >
> > Thanks for the comment. Looks like I was not fully understanding your
> > point on
> > v4 discussion. So I will implement 3 APIs by referring
> > wakeup_sources_stats_seq_start/next/stop()
> >
> > > Besides, this patch conflicts with some general wakeup sources
> > > changes in the works, so it needs to be deferred and rebased on top of 
> > > those
> changes.
> >
> > Could you please tell me which is the right code base I should developing 
> > on?
> > I just tried applying v5 patch on latest
> > git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb.git branch master
> (d1abaeb Linux 5.3-rc5) and no conflict encountered.
> 
> It is better to use the most recent -rc from Linus (5.3-rc5 as of
> today) as the base unless your patches depend on some changes that are not in
> there.

OK, So I need to implement on latest 
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git branch master, 
am I right?

However, I just checked v5.3-rc5 code and found it has the same HEAD (d1abaeb 
Linux 5.3-rc5
on which I did not observe v5 patch apply conflict, did I miss something? 
Thanks.

Regards,
Ran


Re: [PATCH v5 1/3] PM: wakeup: Add routine to help fetch wakeup source object.

2019-08-19 Thread Rafael J. Wysocki
On Mon, Aug 19, 2019 at 10:15 AM Ran Wang  wrote:
>
> Hi Rafael,
>
> On Monday, August 05, 2019 17:59, Rafael J. Wysocki wrote:
> >
> > On Wednesday, July 24, 2019 9:47:20 AM CEST Ran Wang wrote:
> > > Some user might want to go through all registered wakeup sources and
> > > doing things accordingly. For example, SoC PM driver might need to do
> > > HW programming to prevent powering down specific IP which wakeup
> > > source depending on. So add this API to help walk through all
> > > registered wakeup source objects on that list and return them one by one.
> > >
> > > Signed-off-by: Ran Wang 
> > > ---
> > > Change in v5:
> > > - Update commit message, add decription of walk through all wakeup
> > > source objects.
> > > - Add SCU protection in function wakeup_source_get_next().
> > > - Rename wakeup_source member 'attached_dev' to 'dev' and move it
> > up
> > > (before wakeirq).
> > >
> > > Change in v4:
> > > - None.
> > >
> > > Change in v3:
> > > - Adjust indentation of *attached_dev;.
> > >
> > > Change in v2:
> > > - None.
> > >
> > >  drivers/base/power/wakeup.c | 24 
> > >  include/linux/pm_wakeup.h   |  3 +++
> > >  2 files changed, 27 insertions(+)
> > >
> > > diff --git a/drivers/base/power/wakeup.c b/drivers/base/power/wakeup.c
> > > index ee31d4f..2fba891 100644
> > > --- a/drivers/base/power/wakeup.c
> > > +++ b/drivers/base/power/wakeup.c
> > > @@ -14,6 +14,7 @@
> > >  #include 
> > >  #include 
> > >  #include 
> > > +#include 
> > >  #include 
> > >  #include 
> > >
> > > @@ -226,6 +227,28 @@ void wakeup_source_unregister(struct
> > wakeup_source *ws)
> > > }
> > >  }
> > >  EXPORT_SYMBOL_GPL(wakeup_source_unregister);
> > > +/**
> > > + * wakeup_source_get_next - Get next wakeup source from the list
> > > + * @ws: Previous wakeup source object, null means caller want first one.
> > > + */
> > > +struct wakeup_source *wakeup_source_get_next(struct wakeup_source
> > > +*ws) {
> > > +   struct list_head *ws_head = &wakeup_sources;
> > > +   struct wakeup_source *next_ws = NULL;
> > > +   int idx;
> > > +
> > > +   idx = srcu_read_lock(&wakeup_srcu);
> > > +   if (ws)
> > > +   next_ws = list_next_or_null_rcu(ws_head, &ws->entry,
> > > +   struct wakeup_source, entry);
> > > +   else
> > > +   next_ws = list_entry_rcu(ws_head->next,
> > > +   struct wakeup_source, entry);
> > > +   srcu_read_unlock(&wakeup_srcu, idx);
> > > +
> >
> > This is incorrect.
> >
> > The SRCU cannot be unlocked until the caller of this is done with the object
> > returned by it, or that object can be freed while it is still being 
> > accessed.
>
> Thanks for the comment. Looks like I was not fully understanding your point on
> v4 discussion. So I will implement 3 APIs by referring 
> wakeup_sources_stats_seq_start/next/stop()
>
> > Besides, this patch conflicts with some general wakeup sources changes in 
> > the
> > works, so it needs to be deferred and rebased on top of those changes.
>
> Could you please tell me which is the right code base I should developing on?
> I just tried applying v5 patch on latest 
> git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb.git branch master 
> (d1abaeb Linux 5.3-rc5)
> and no conflict encountered.

It is better to use the most recent -rc from Linus (5.3-rc5 as of
today) as the base unless your patches depend on some changes that are
not in there.


WARN_ON(1) generates ugly code since commit 6b15f678fb7d

2019-08-19 Thread Christophe Leroy

Hi Drew,

I recently noticed gcc suddenly generating ugly code for WARN_ON(1).

It looks like commit 6b15f678fb7d ("include/asm-generic/bug.h: fix "cut 
here" for WARN_ON for __WARN_TAINT architectures") is the culprit.


unsigned long test_mul1(unsigned long a, unsigned long b)
{
    unsigned long long r = (unsigned long long)a * (unsigned long long)b;

    if (r > 0x)
        WARN_ON(1);

    return r;
}

Before that patch, I was getting the following code:

0008 :
   8:    7d 23 20 16     mulhwu  r9,r3,r4
   c:    7c 63 21 d6     mullw   r3,r3,r4
  10:    2f 89 00 00     cmpwi   cr7,r9,0
  14:    4d 9e 00 20     beqlr   cr7
  18:    0f e0 00 00     twui    r0,0
  1c:    4e 80 00 20     blr

Now I get:

002c :
  2c:    7d 23 20 16     mulhwu  r9,r3,r4
  30:    94 21 ff f0     stwu    r1,-16(r1)
  34:    7c 08 02 a6     mflr    r0
  38:    93 e1 00 0c     stw r31,12(r1)
  3c:    90 01 00 14     stw r0,20(r1)
  40:    7f e3 21 d6     mullw   r31,r3,r4
  44:    2f 89 00 00     cmpwi   cr7,r9,0
  48:    40 9e 00 1c     bne cr7,64 
  4c:    80 01 00 14     lwz r0,20(r1)
  50:    7f e3 fb 78     mr  r3,r31
  54:    83 e1 00 0c     lwz r31,12(r1)
  58:    7c 08 03 a6     mtlr    r0
  5c:    38 21 00 10     addi    r1,r1,16
  60:    4e 80 00 20     blr
  64:    3c 60 00 00     lis r3,0
            66: R_PPC_ADDR16_HA    .rodata.str1.4
  68:    38 63 00 00     addi    r3,r3,0
            6a: R_PPC_ADDR16_LO    .rodata.str1.4
  6c:    48 00 00 01     bl  6c 
            6c: R_PPC_REL24    printk
  70:    0f e0 00 00     twui    r0,0
  74:    4b ff ff d8     b   4c 

As you can see, a call to printk() is added, which means setting up a 
stack frame, saving volatile registers, etc ...

That's all the things we want to avoid when using WARN_ON().

And digging a bit more, I see that you are only adding this 'cut here' 
to calls like WARN_ON(1), ie where the condition is a constant.
For calls where the condition is not a constant, there is no change and 
no 'cut here' line added:


unsigned long test_mul2(unsigned long a, unsigned long b)
{
    unsigned long long r = (unsigned long long)a * (unsigned long long)b;

    WARN_ON(r > 0x);

    return r;
}

Before and after your patch, the code is clean and no call to add any 
'cut here' line.

0078 :
  78:    7d 43 20 16     mulhwu  r10,r3,r4
  7c:    7c 63 21 d6     mullw   r3,r3,r4
  80:    31 2a ff ff     addic   r9,r10,-1
  84:    7d 29 51 10     subfe   r9,r9,r10
  88:    0f 09 00 00     twnei   r9,0
  8c:    4e 80 00 20     blr


Was it your intention to modify the behaviour and kill the lightweight 
implementations of WARN_ON() ?


Looking into arch/powerpc/include/bug.h, I see that when the condition 
is constant, WARN_ON() uses __WARN(), which itself calls __WARN_FLAGS() 
with relevant flags.


In the old days, __WARN() was implemented in arch/powerpc/include/bug.h
Commit b2be05273a17 ("panic: Allow warnings to set different taint 
flags") replaced __WARN() by __WARN_TAINT() and added a generic 
definition of __WARN()
In the begining I thought the __WARN() call in 
arch/powerpc/include/bug.h was forgotten, but looking into the commit in 
full, it looks like it was intentional to make __WARN() generic and have 
arches use it.


Then commit 19d436268dde ("debug: Add _ONCE() logic to report_bug()") 
replaced __WARN_TAINT() by __WARN_FLAGS().


So by changing the generic __WARN() you are impacting all users include 
those using 'trap' like instruction in order to avoid function calls.


What is to be done for getting back a clean code which doesn't call 
printk() on the hot path ?


Thanks,
Christophe





RE: [PATCH v5 1/3] PM: wakeup: Add routine to help fetch wakeup source object.

2019-08-19 Thread Ran Wang
Hi Rafael,

On Monday, August 05, 2019 17:59, Rafael J. Wysocki wrote:
> 
> On Wednesday, July 24, 2019 9:47:20 AM CEST Ran Wang wrote:
> > Some user might want to go through all registered wakeup sources and
> > doing things accordingly. For example, SoC PM driver might need to do
> > HW programming to prevent powering down specific IP which wakeup
> > source depending on. So add this API to help walk through all
> > registered wakeup source objects on that list and return them one by one.
> >
> > Signed-off-by: Ran Wang 
> > ---
> > Change in v5:
> > - Update commit message, add decription of walk through all wakeup
> > source objects.
> > - Add SCU protection in function wakeup_source_get_next().
> > - Rename wakeup_source member 'attached_dev' to 'dev' and move it
> up
> > (before wakeirq).
> >
> > Change in v4:
> > - None.
> >
> > Change in v3:
> > - Adjust indentation of *attached_dev;.
> >
> > Change in v2:
> > - None.
> >
> >  drivers/base/power/wakeup.c | 24 
> >  include/linux/pm_wakeup.h   |  3 +++
> >  2 files changed, 27 insertions(+)
> >
> > diff --git a/drivers/base/power/wakeup.c b/drivers/base/power/wakeup.c
> > index ee31d4f..2fba891 100644
> > --- a/drivers/base/power/wakeup.c
> > +++ b/drivers/base/power/wakeup.c
> > @@ -14,6 +14,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  #include 
> >  #include 
> >
> > @@ -226,6 +227,28 @@ void wakeup_source_unregister(struct
> wakeup_source *ws)
> > }
> >  }
> >  EXPORT_SYMBOL_GPL(wakeup_source_unregister);
> > +/**
> > + * wakeup_source_get_next - Get next wakeup source from the list
> > + * @ws: Previous wakeup source object, null means caller want first one.
> > + */
> > +struct wakeup_source *wakeup_source_get_next(struct wakeup_source
> > +*ws) {
> > +   struct list_head *ws_head = &wakeup_sources;
> > +   struct wakeup_source *next_ws = NULL;
> > +   int idx;
> > +
> > +   idx = srcu_read_lock(&wakeup_srcu);
> > +   if (ws)
> > +   next_ws = list_next_or_null_rcu(ws_head, &ws->entry,
> > +   struct wakeup_source, entry);
> > +   else
> > +   next_ws = list_entry_rcu(ws_head->next,
> > +   struct wakeup_source, entry);
> > +   srcu_read_unlock(&wakeup_srcu, idx);
> > +
> 
> This is incorrect.
> 
> The SRCU cannot be unlocked until the caller of this is done with the object
> returned by it, or that object can be freed while it is still being accessed.

Thanks for the comment. Looks like I was not fully understanding your point on
v4 discussion. So I will implement 3 APIs by referring 
wakeup_sources_stats_seq_start/next/stop()
 
> Besides, this patch conflicts with some general wakeup sources changes in the
> works, so it needs to be deferred and rebased on top of those changes.

Could you please tell me which is the right code base I should developing on?
I just tried applying v5 patch on latest 
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb.git branch master 
(d1abaeb Linux 5.3-rc5)
and no conflict encountered.

Thanks & Regards,
Ran




Re: [PATCH] powerpc: optimise WARN_ON()

2019-08-19 Thread Segher Boessenkool
On Mon, Aug 19, 2019 at 07:40:42AM +0200, Christophe Leroy wrote:
> Le 18/08/2019 à 14:01, Segher Boessenkool a écrit :
> >On Sat, Aug 17, 2019 at 09:04:42AM +, Christophe Leroy wrote:
> >>Unlike BUG_ON(x), WARN_ON(x) uses !!(x) as the trigger
> >>of the t(d/w)nei instruction instead of using directly the
> >>value of x.
> >>
> >>This leads to GCC adding unnecessary pair of addic/subfe.
> >
> >And it has to, it is passed as an "r" to an asm, GCC has to put the "!!"
> >value into a register.
> >
> >>By using (x) instead of !!(x) like BUG_ON() does, the additional
> >>instructions go away:
> >
> >But is it correct?  What happens if you pass an int to WARN_ON, on a
> >64-bit kernel?
> 
> On a 64-bit kernel, an int is still in a 64-bit register, so there would 
> be no problem with tdnei, would it ? an int 0 is the same as an long 0, 
> right ?

The top 32 bits of a 64-bit register holding an int are undefined.  Take
as example

  int x = 42;
  x = ~x;

which may put ___ffd5 into the reg, not ___ffd5
as you might expect or want.  For tw instructions this makes no difference
(they only look at the low 32 bits anyway); for td insns, it does.

> It is on 32-bit kernel that I see a problem, if one passes a long long 
> to WARN_ON(), the forced cast to long will just drop the upper size of 
> it. So as of today, BUG_ON() is buggy for that.

Sure, it isn't defined what types you can pass to that macro.  Another
thing that makes inline functions much saner to use.

> >(You might want to have 64-bit generate either tw or td.  But, with
> >your __builtin_trap patch, all that will be automatic).
> 
> Yes I'll discard this patch and focus on the __builtin_trap() one which 
> should solve most issues.

But see my comment there about the compiler knowing all code after an
unconditional trap is dead.


Segher


Re: [PATCH] powerpc: Don't add -mabi= flags when building with Clang

2019-08-19 Thread Daniel Axtens
Hi Nathan,

> When building pseries_defconfig, building vdso32 errors out:
>
>   error: unknown target ABI 'elfv1'
>
> Commit 4dc831aa8813 ("powerpc: Fix compiling a BE kernel with a
> powerpc64le toolchain") added these flags to fix building GCC but
> clang is multitargeted and does not need these flags. The ABI is
> properly set based on the target triple, which is derived from
> CROSS_COMPILE.
>
> https://github.com/llvm/llvm-project/blob/llvmorg-9.0.0-rc2/clang/lib/Driver/ToolChains/Clang.cpp#L1782-L1804
>
> -mcall-aixdesc is not an implemented flag in clang so it can be
> safely excluded as well, see commit 238abecde8ad ("powerpc: Don't
> use gcc specific options on clang").
>

This all looks good to me, thanks for picking it up, and sorry I hadn't
got around to it!

The makefile is a bit messy and there are a few ways it could probably
be reorganised to reduce ifdefs. But I don't think this is the right
place to do that. With that in mind,

Reviewed-by: Daniel Axtens 

Regards,
Daniel

> pseries_defconfig successfully builds after this patch and
> powernv_defconfig and ppc44x_defconfig don't regress.
>
> Link: https://github.com/ClangBuiltLinux/linux/issues/240
> Signed-off-by: Nathan Chancellor 
> ---
>  arch/powerpc/Makefile | 4 
>  1 file changed, 4 insertions(+)
>
> diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile
> index c345b79414a9..971b04bc753d 100644
> --- a/arch/powerpc/Makefile
> +++ b/arch/powerpc/Makefile
> @@ -93,11 +93,13 @@ MULTIPLEWORD  := -mmultiple
>  endif
>  
>  ifdef CONFIG_PPC64
> +ifndef CONFIG_CC_IS_CLANG
>  cflags-$(CONFIG_CPU_BIG_ENDIAN)  += $(call cc-option,-mabi=elfv1)
>  cflags-$(CONFIG_CPU_BIG_ENDIAN)  += $(call 
> cc-option,-mcall-aixdesc)
>  aflags-$(CONFIG_CPU_BIG_ENDIAN)  += $(call cc-option,-mabi=elfv1)
>  aflags-$(CONFIG_CPU_LITTLE_ENDIAN)   += -mabi=elfv2
>  endif
> +endif
>  
>  ifndef CONFIG_CC_IS_CLANG
>cflags-$(CONFIG_CPU_LITTLE_ENDIAN) += -mno-strict-align
> @@ -144,6 +146,7 @@ endif
>  endif
>  
>  CFLAGS-$(CONFIG_PPC64)   := $(call cc-option,-mtraceback=no)
> +ifndef CONFIG_CC_IS_CLANG
>  ifdef CONFIG_CPU_LITTLE_ENDIAN
>  CFLAGS-$(CONFIG_PPC64)   += $(call cc-option,-mabi=elfv2,$(call 
> cc-option,-mcall-aixdesc))
>  AFLAGS-$(CONFIG_PPC64)   += $(call cc-option,-mabi=elfv2)
> @@ -152,6 +155,7 @@ CFLAGS-$(CONFIG_PPC64)+= $(call cc-option,-mabi=elfv1)
>  CFLAGS-$(CONFIG_PPC64)   += $(call cc-option,-mcall-aixdesc)
>  AFLAGS-$(CONFIG_PPC64)   += $(call cc-option,-mabi=elfv1)
>  endif
> +endif
>  CFLAGS-$(CONFIG_PPC64)   += $(call cc-option,-mcmodel=medium,$(call 
> cc-option,-mminimal-toc))
>  CFLAGS-$(CONFIG_PPC64)   += $(call 
> cc-option,-mno-pointers-to-nested-functions)
>  
> -- 
> 2.23.0


Re: [PATCH v5 3/4] mm/nvdimm: Use correct #defines instead of open coding

2019-08-19 Thread Aneesh Kumar K.V
Dan Williams  writes:

> On Fri, Aug 9, 2019 at 12:45 AM Aneesh Kumar K.V
>  wrote:
>>
>> Use PAGE_SIZE instead of SZ_4K and sizeof(struct page) instead of 64.
>> If we have a kernel built with different struct page size the previous
>> patch should handle marking the namespace disabled.
>
> Each of these changes carry independent non-overlapping regression
> risk, so lets split them into separate patches. Others might
>
>> Signed-off-by: Aneesh Kumar K.V 
>> ---
>>  drivers/nvdimm/label.c  | 2 +-
>>  drivers/nvdimm/namespace_devs.c | 6 +++---
>>  drivers/nvdimm/pfn_devs.c   | 3 ++-
>>  drivers/nvdimm/region_devs.c| 8 
>>  4 files changed, 10 insertions(+), 9 deletions(-)
>>
>> diff --git a/drivers/nvdimm/label.c b/drivers/nvdimm/label.c
>> index 73e197babc2f..7ee037063be7 100644
>> --- a/drivers/nvdimm/label.c
>> +++ b/drivers/nvdimm/label.c
>> @@ -355,7 +355,7 @@ static bool slot_valid(struct nvdimm_drvdata *ndd,
>>
>> /* check that DPA allocations are page aligned */
>> if ((__le64_to_cpu(nd_label->dpa)
>> -   | __le64_to_cpu(nd_label->rawsize)) % SZ_4K)
>> +   | __le64_to_cpu(nd_label->rawsize)) % 
>> PAGE_SIZE)
>
> The UEFI label specification has no concept of PAGE_SIZE, so this
> check is a pure Linux-ism. There's no strict requirement why
> slot_valid() needs to check for page alignment and it would seem to
> actively hurt cross-page-size compatibility, so let's delete the check
> and rely on checksum validation.


Will do a separate patch to drop that check.

>
>> return false;
>>
>> /* check checksum */
>> diff --git a/drivers/nvdimm/namespace_devs.c 
>> b/drivers/nvdimm/namespace_devs.c
>> index a16e52251a30..a9c76df12cb9 100644
>> --- a/drivers/nvdimm/namespace_devs.c
>> +++ b/drivers/nvdimm/namespace_devs.c
>> @@ -1006,10 +1006,10 @@ static ssize_t __size_store(struct device *dev, 
>> unsigned long long val)
>> return -ENXIO;
>> }
>>
>> -   div_u64_rem(val, SZ_4K * nd_region->ndr_mappings, &remainder);
>> +   div_u64_rem(val, PAGE_SIZE * nd_region->ndr_mappings, &remainder);
>> if (remainder) {
>> -   dev_dbg(dev, "%llu is not %dK aligned\n", val,
>> -   (SZ_4K * nd_region->ndr_mappings) / SZ_1K);
>> +   dev_dbg(dev, "%llu is not %ldK aligned\n", val,
>> +   (PAGE_SIZE * nd_region->ndr_mappings) / 
>> SZ_1K);
>> return -EINVAL;
>
> Yes, looks good, but this deserves its own independent patch.
>
>> }
>>
>> diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
>> index 37e96811c2fc..c1d9be609322 100644
>> --- a/drivers/nvdimm/pfn_devs.c
>> +++ b/drivers/nvdimm/pfn_devs.c
>> @@ -725,7 +725,8 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn)
>>  * when populating the vmemmap. This *should* be equal to
>>  * PMD_SIZE for most architectures.
>>  */
>> -   offset = ALIGN(start + SZ_8K + 64 * npfns, align) - start;
>> +   offset = ALIGN(start + SZ_8K + sizeof(struct page) * npfns,
>
> I'd prefer if this was not dynamic and was instead set to the maximum
> size of 'struct page' across all archs just to enhance cross-arch
> compatibility. I think that answer is '64'.


That still doesn't take care of the case where we add new elements to
struct page later. If we have struct page size changing across
architectures, we should still be ok as long as new size is less than what is
stored in pfn superblock? I understand the desire to keep it
non-dynamic. But we also need to make sure we don't reserve less space
when creating a new namespace on a config that got struct page size >
64? 


>> +  align) - start;
>> } else if (nd_pfn->mode == PFN_MODE_RAM)
>> offset = ALIGN(start + SZ_8K, align) - start;
>> else
>> diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
>> index af30cbe7a8ea..20e265a534f8 100644
>> --- a/drivers/nvdimm/region_devs.c
>> +++ b/drivers/nvdimm/region_devs.c
>> @@ -992,10 +992,10 @@ static struct nd_region *nd_region_create(struct 
>> nvdimm_bus *nvdimm_bus,
>> struct nd_mapping_desc *mapping = &ndr_desc->mapping[i];
>> struct nvdimm *nvdimm = mapping->nvdimm;
>>
>> -   if ((mapping->start | mapping->size) % SZ_4K) {
>> -   dev_err(&nvdimm_bus->dev, "%s: %s mapping%d is not 
>> 4K aligned\n",
>> -   caller, dev_name(&nvdimm->dev), i);
>> -
>> +   if ((mapping->start | mapping->size) % PAGE_SIZE) {
>> +   dev_err(&nvdimm_bus->dev,
>> +   "%s: %s mapping%d is not %ld aligned\n",
>> +   caller, dev_name(&nvdimm->dev), i, 
>> PAGE_SIZE);
>> return NULL;
>> 

Re: [PATCH v5 1/4] nvdimm: Consider probe return -EOPNOTSUPP as success

2019-08-19 Thread Aneesh Kumar K.V
Dan Williams  writes:

> On Tue, Aug 13, 2019 at 9:22 PM Dan Williams  wrote:
>>
>> Hi Aneesh, logic looks correct but there are some cleanups I'd like to
>> see and a lead-in patch that I attached.
>>
>> I've started prefixing nvdimm patches with:
>>
>> libnvdimm/$component:
>>
>> ...since this patch mostly impacts the pmem driver lets prefix it
>> "libnvdimm/pmem: "
>>
>> On Fri, Aug 9, 2019 at 12:45 AM Aneesh Kumar K.V
>>  wrote:
>> >
>> > This patch add -EOPNOTSUPP as return from probe callback to
>>
>> s/This patch add/Add/
>>
>> No need to say "this patch" it's obviously a patch.
>>
>> > indicate we were not able to initialize a namespace due to pfn superblock
>> > feature/version mismatch. We want to consider this a probe success so that
>> > we can create new namesapce seed and there by avoid marking the failed
>> > namespace as the seed namespace.
>>
>> Please replace usage of "we" with the exact agent involved as which
>> "we" is being referred to gets confusing for the reader.
>>
>> i.e. "indicate that the pmem driver was not..." "The nvdimm core wants
>> to consider this...".
>>
>> >
>> > Signed-off-by: Aneesh Kumar K.V 
>> > ---
>> >  drivers/nvdimm/bus.c  |  2 +-
>> >  drivers/nvdimm/pmem.c | 26 ++
>> >  2 files changed, 23 insertions(+), 5 deletions(-)
>> >
>> > diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c
>> > index 798c5c4aea9c..16c35e6446a7 100644
>> > --- a/drivers/nvdimm/bus.c
>> > +++ b/drivers/nvdimm/bus.c
>> > @@ -95,7 +95,7 @@ static int nvdimm_bus_probe(struct device *dev)
>> > rc = nd_drv->probe(dev);
>> > debug_nvdimm_unlock(dev);
>> >
>> > -   if (rc == 0)
>> > +   if (rc == 0 || rc == -EOPNOTSUPP)
>> > nd_region_probe_success(nvdimm_bus, dev);
>>
>> This now makes the nd_region_probe_success() helper obviously misnamed
>> since it now wants to take actions on non-probe success. I attached a
>> lead-in cleanup that you can pull into your series that renames that
>> routine to nd_region_advance_seeds().
>>
>> When you rebase this needs a comment about why EOPNOTSUPP has special 
>> handling.
>>
>> > else
>> > nd_region_disable(nvdimm_bus, dev);
>> > diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
>> > index 4c121dd03dd9..3f498881dd28 100644
>> > --- a/drivers/nvdimm/pmem.c
>> > +++ b/drivers/nvdimm/pmem.c
>> > @@ -490,6 +490,7 @@ static int pmem_attach_disk(struct device *dev,
>> >
>> >  static int nd_pmem_probe(struct device *dev)
>> >  {
>> > +   int ret;
>> > struct nd_namespace_common *ndns;
>> >
>> > ndns = nvdimm_namespace_common_probe(dev);
>> > @@ -505,12 +506,29 @@ static int nd_pmem_probe(struct device *dev)
>> > if (is_nd_pfn(dev))
>> > return pmem_attach_disk(dev, ndns);
>> >
>> > -   /* if we find a valid info-block we'll come back as that 
>> > personality */
>> > -   if (nd_btt_probe(dev, ndns) == 0 || nd_pfn_probe(dev, ndns) == 0
>> > -   || nd_dax_probe(dev, ndns) == 0)
>>
>> Similar need for an updated comment here to explain the special
>> translation of error codes.
>>
>> > +   ret = nd_btt_probe(dev, ndns);
>> > +   if (ret == 0)
>> > return -ENXIO;
>> > +   else if (ret == -EOPNOTSUPP)
>>
>> Are there cases where the btt driver needs to return EOPNOTSUPP? I'd
>> otherwise like to keep this special casing constrained to the pfn /
>> dax info block cases.
>
> In fact I think EOPNOTSUPP is only something that the device-dax case
> would be concerned with because that's the only interface that
> attempts to guarantee a given mapping granularity.

We need to do similar error handling w.r.t fsdax when the pfn superblock
indicates different PAGE_SIZE and struct page size? I don't think btt
needs to support EOPNOTSUPP. But we can keep it for consistency?

-aneesh