Re: [PATCH trivial] KVM: PPC: Remove superfluous check for non-zero return value

2019-09-18 Thread Thomas Huth
On 18/09/2019 22.54, Greg Kurz wrote:
> On Wed, 18 Sep 2019 18:44:36 +0200
> Greg Kurz  wrote:
> 
>> On Wed, 11 Sep 2019 21:52:35 +0200
>> Thomas Huth  wrote:
>>
>>> After the kfree()s haven been removed in the previous
>>> commit 9798f4ea71ea ("fix rollback when kvmppc_xive_create fails"),
>>> the code can be simplified even more to simply always "return ret"
>>> now.
>>>
>>> Signed-off-by: Thomas Huth 
>>> ---
>>
>> This looks like a good candidate for trivial, hence Cc'ing Jiri
>> and adding trivial keyword in subject.
>>
>> Reviewed-by: Greg Kurz 
>>
> 
> Oops, the patch is correct but there are some fixes that require
> the return 0 to stay around...
> 
> https://patchwork.ozlabs.org/project/kvm-ppc/list/?series=129957

:-)

Ok, then please simply ignore my patch.

 Thomas


Re: [PATCH] mm/pgtable/debug: Fix test validating architecture page table helpers

2019-09-18 Thread Christophe Leroy




Le 18/09/2019 à 09:32, Anshuman Khandual a écrit :



On 09/13/2019 11:53 AM, Christophe Leroy wrote:

Fix build failure on powerpc.

Fix preemption imbalance.

Signed-off-by: Christophe Leroy 
---
  mm/arch_pgtable_test.c | 3 +++
  1 file changed, 3 insertions(+)

diff --git a/mm/arch_pgtable_test.c b/mm/arch_pgtable_test.c
index 8b4a92756ad8..f2b3c9ec35fa 100644
--- a/mm/arch_pgtable_test.c
+++ b/mm/arch_pgtable_test.c
@@ -24,6 +24,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  #include 
  
@@ -400,6 +401,8 @@ static int __init arch_pgtable_tests_init(void)

p4d_clear_tests(p4dp);
pgd_clear_tests(mm, pgdp);
  
+	pte_unmap(ptep);

+
pmd_populate_tests(mm, pmdp, saved_ptep);
pud_populate_tests(mm, pudp, saved_pmdp);
p4d_populate_tests(mm, p4dp, saved_pudp);



Hello Christophe,

I am planning to fold this fix into the current patch and retain your
Signed-off-by. Are you okay with it ?



No problem, do whatever is convenient for you. You can keep the 
signed-off-by, or use tested-by: as I tested it on PPC32.


Christophe


Re: [PATCH V2 2/2] mm/pgtable/debug: Add test validating architecture page table helpers

2019-09-18 Thread Christophe Leroy




Le 19/09/2019 à 06:56, Anshuman Khandual a écrit :



On 09/18/2019 09:56 PM, Christophe Leroy wrote:



Le 18/09/2019 à 07:04, Anshuman Khandual a écrit :



On 09/13/2019 03:31 PM, Christophe Leroy wrote:



Le 13/09/2019 à 11:02, Anshuman Khandual a écrit :



+#if !defined(__PAGETABLE_PMD_FOLDED) && !defined(__ARCH_HAS_4LEVEL_HACK)


#ifdefs have to be avoided as much as possible, see below


Yeah but it has been bit difficult to avoid all these $ifdef because of the
availability (or lack of it) for all these pgtable helpers in various config
combinations on all platforms.


As far as I can see these pgtable helpers should exist everywhere at least via 
asm-generic/ files.


But they might not actually do the right thing.



Can you spot a particular config which fails ?


Lets consider the following example (after removing the $ifdefs around it)
which though builds successfully but fails to pass the intended test. This
is with arm64 config 4K pages sizes with 39 bits VA space which ends up
with a 3 level page table arrangement.

static void __init p4d_clear_tests(p4d_t *p4dp)
{
  p4d_t p4d = READ_ONCE(*p4dp);


My suggestion was not to completely drop the #ifdef but to do like you did in 
pgd_clear_tests() for instance, ie to add the following test on top of the 
function:

 if (mm_pud_folded(mm) || is_defined(__ARCH_HAS_5LEVEL_HACK))
     return;



Sometimes this does not really work. On some platforms, combination of
__PAGETABLE_PUD_FOLDED and __ARCH_HAS_5LEVEL_HACK decide whether the
helpers such as __pud() or __pgd() is even available for that platform.
Ideally it should have been through generic falls backs in include/*/
but I guess there might be bugs on the platform or it has not been
changed to adopt 5 level page table framework with required folding
macros etc.


Yes. As I suggested below, most likely that's better to retain the 
#ifdef __ARCH_HAS_5LEVEL_HACK but change the #ifdef 
__PAGETABLE_PUD_FOLDED by a runtime test of mm_pud_folded(mm)


As pointed by Gerald, some arches don't have __PAGETABLE_PUD_FOLDED 
because they are deciding dynamically if they fold the level on not, but 
have mm_pud_folded(mm)






  p4d = __p4d(p4d_val(p4d) | RANDOM_ORVALUE);
  WRITE_ONCE(*p4dp, p4d);
  p4d_clear(p4dp);
  p4d = READ_ONCE(*p4dp);
  WARN_ON(!p4d_none(p4d));
}

The following test hits an error at WARN_ON(!p4d_none(p4d))

[   16.757333] [ cut here ]
[   16.758019] WARNING: CPU: 11 PID: 1 at mm/arch_pgtable_test.c:187 
arch_pgtable_tests_init+0x24c/0x474


[...]


[   16.781282] ---[ end trace 042e6c40c0a3b038 ]---

On arm64 (4K page size|39 bits VA|3 level page table)

#elif CONFIG_PGTABLE_LEVELS == 3    /* Applicable here */
#define __ARCH_USE_5LEVEL_HACK
#include 

Which pulls in

#include 

which pulls in

#include 

which defines

static inline int p4d_none(p4d_t p4d)
{
  return 0;
}

which will invariably trigger WARN_ON(!p4d_none(p4d)).

Similarly for next test p4d_populate_tests() which will always be
successful because p4d_bad() invariably returns negative.

static inline int p4d_bad(p4d_t p4d)
{
  return 0;
}

static void __init p4d_populate_tests(struct mm_struct *mm, p4d_t *p4dp,
    pud_t *pudp)
{
  p4d_t p4d;

  /*
   * This entry points to next level page table page.
   * Hence this must not qualify as p4d_bad().
   */
  pud_clear(pudp);
  p4d_clear(p4dp);
  p4d_populate(mm, p4dp, pudp);
  p4d = READ_ONCE(*p4dp);
  WARN_ON(p4d_bad(p4d));
}

We should not run these tests for the above config because they are
not applicable and will invariably produce same result.



[...]



So it shouldn't be an issue. Maybe if a couple of arches miss them, the best 
would be to fix the arches, since that's the purpose of your testsuite isn't it 
?


The run time failures as explained previously is because of the folding which
needs to be protected as they are not even applicable. The compile time
failures are because pxx_populate() signatures are platform specific depending
on how many page table levels they really support.



So IIUC, the compiletime problem is around __ARCH_HAS_5LEVEL_HACK. For all #if 
!defined(__PAGETABLE_PXX_FOLDED), something equivalent to the following should 
make the trick.

 if (mm_pxx_folded())
     return;


For the __ARCH_HAS_5LEVEL_HACK stuff, I think we should be able to regroup all 
impacted functions inside a single #ifdef __ARCH_HAS_5LEVEL_HACK


I was wondering if it will be better to

1) Minimize all #ifdefs in the code which might fail on some platforms
2) Restrict proposed test module to platforms where it builds and runs
3) Enable other platforms afterwards after fixing their build problems or other 
requirements


I understand that __ARCH_HAS_5LEVEL_HACK is an HACK as its name 
suggests, so you can't expect all platforms to go for an 

Re: [PATCH v3 2/2] powerpc/irq: inline call_do_irq() and call_do_softirq()

2019-09-18 Thread Christophe Leroy




Le 18/09/2019 à 18:39, Segher Boessenkool a écrit :

Hi Christophe,

On Wed, Sep 18, 2019 at 03:48:20PM +, Christophe Leroy wrote:

call_do_irq() and call_do_softirq() are quite similar on PPC32 and
PPC64 and are simple enough to be worth inlining.

Inlining them avoids an mflr/mtlr pair plus a save/reload on stack.


But you hardcode the calling sequence in inline asm, which for various
reasons is not a great idea.


+static inline void call_do_irq(struct pt_regs *regs, void *sp)
+{
+   register unsigned long r3 asm("r3") = (unsigned long)regs;
+
+   asm volatile(
+   "  "PPC_STLU"1, %2(%1);\n"
+   "  mr  1, %1;\n"
+   "  bl  %3;\n"
+   "  "PPC_LL"  1, 0(1);\n" : "+r"(r3) :
+   "b"(sp), "i"(THREAD_SIZE - STACK_FRAME_OVERHEAD), "i"(__do_irq) 
:
+   "lr", "xer", "ctr", "memory", "cr0", "cr1", "cr5", "cr6", "cr7",
+   "r0", "r4", "r5", "r6", "r7", "r8", "r9", "r10", "r11", "r12");
+}


I realise the original code had this...  Loading the old stack pointer
value back from the stack creates a bottleneck (via the store->load
forwarding it requires).  It could just use
   addi 1,1,-(%2)
here, which can also be written as
   addi 1,1,%n2
(that is portable to all architectures btw).


No, we switched stack before the bl call, we replaced r1 by r3 after 
saving r1 into r3 stack. Now we have to restore the original r1.





Please write the  "+r"(r3)  on the next line?  Not on the same line as
the multi-line template.  This make things more readable.


I don't know if using functions as an "i" works properly...  It probably
does, it's just not something that you see often :-)


What about r2?  Various ABIs handle that differently.  This might make
it impossible to share implementation between 32-bit and 64-bit for this.
But we could add it to the clobber list worst case, that will always work.


Isn't r2 non-volatile on all ABIs ?




So anyway, it looks to me like it will work.  Nice cleanup.  Would be
better if you could do the call to __do_irq from C code, but maybe we
cannot have everything ;-)


sparc do it the following way, is there no risk that GCC adds unwanted 
code inbetween that is not aware there the stack pointer has changed ?


void do_softirq_own_stack(void)
{
void *orig_sp, *sp = softirq_stack[smp_processor_id()];

sp += THREAD_SIZE - 192 - STACK_BIAS;

__asm__ __volatile__("mov %%sp, %0\n\t"
 "mov %1, %%sp"
 : "=" (orig_sp)
 : "r" (sp));
__do_softirq();
__asm__ __volatile__("mov %0, %%sp"
 : : "r" (orig_sp));
}

If the above is no risk, then can we do the same on powerpc ?



Reviewed-by: Segher Boessenkool 


Thanks for the review.

Christophe


Re: [PATCH V2 2/2] mm/pgtable/debug: Add test validating architecture page table helpers

2019-09-18 Thread Anshuman Khandual



On 09/18/2019 09:56 PM, Christophe Leroy wrote:
> 
> 
> Le 18/09/2019 à 07:04, Anshuman Khandual a écrit :
>>
>>
>> On 09/13/2019 03:31 PM, Christophe Leroy wrote:
>>>
>>>
>>> Le 13/09/2019 à 11:02, Anshuman Khandual a écrit :

>> +#if !defined(__PAGETABLE_PMD_FOLDED) && !defined(__ARCH_HAS_4LEVEL_HACK)
>
> #ifdefs have to be avoided as much as possible, see below

 Yeah but it has been bit difficult to avoid all these $ifdef because of the
 availability (or lack of it) for all these pgtable helpers in various 
 config
 combinations on all platforms.
>>>
>>> As far as I can see these pgtable helpers should exist everywhere at least 
>>> via asm-generic/ files.
>>
>> But they might not actually do the right thing.
>>
>>>
>>> Can you spot a particular config which fails ?
>>
>> Lets consider the following example (after removing the $ifdefs around it)
>> which though builds successfully but fails to pass the intended test. This
>> is with arm64 config 4K pages sizes with 39 bits VA space which ends up
>> with a 3 level page table arrangement.
>>
>> static void __init p4d_clear_tests(p4d_t *p4dp)
>> {
>>  p4d_t p4d = READ_ONCE(*p4dp);
> 
> My suggestion was not to completely drop the #ifdef but to do like you did in 
> pgd_clear_tests() for instance, ie to add the following test on top of the 
> function:
> 
> if (mm_pud_folded(mm) || is_defined(__ARCH_HAS_5LEVEL_HACK))
>     return;
> 

Sometimes this does not really work. On some platforms, combination of
__PAGETABLE_PUD_FOLDED and __ARCH_HAS_5LEVEL_HACK decide whether the
helpers such as __pud() or __pgd() is even available for that platform.
Ideally it should have been through generic falls backs in include/*/
but I guess there might be bugs on the platform or it has not been
changed to adopt 5 level page table framework with required folding
macros etc.

>>
>>  p4d = __p4d(p4d_val(p4d) | RANDOM_ORVALUE);
>>  WRITE_ONCE(*p4dp, p4d);
>>  p4d_clear(p4dp);
>>  p4d = READ_ONCE(*p4dp);
>>  WARN_ON(!p4d_none(p4d));
>> }
>>
>> The following test hits an error at WARN_ON(!p4d_none(p4d))
>>
>> [   16.757333] [ cut here ]
>> [   16.758019] WARNING: CPU: 11 PID: 1 at mm/arch_pgtable_test.c:187 
>> arch_pgtable_tests_init+0x24c/0x474
>> [   16.759455] Modules linked in:
>> [   16.759952] CPU: 11 PID: 1 Comm: swapper/0 Not tainted 
>> 5.3.0-next-20190916-5-g61c218153bb8-dirty #222
>> [   16.761449] Hardware name: linux,dummy-virt (DT)
>> [   16.762185] pstate: 0045 (nzcv daif +PAN -UAO)
>> [   16.762964] pc : arch_pgtable_tests_init+0x24c/0x474
>> [   16.763750] lr : arch_pgtable_tests_init+0x174/0x474
>> [   16.764534] sp : ffc011d7bd50
>> [   16.765065] x29: ffc011d7bd50 x28: 1756bac0
>> [   16.765908] x27: ff85ddaf3000 x26: 02e8
>> [   16.766767] x25: ffc0111ce000 x24: ff85ddaf32e8
>> [   16.767606] x23: ff85ddaef278 x22: 0045cc844000
>> [   16.768445] x21: 00065daef003 x20: 1754
>> [   16.769283] x19: ff85ddb6 x18: 0014
>> [   16.770122] x17: 980426bb x16: 698594c6
>> [   16.770976] x15: 66e25a88 x14: 
>> [   16.771813] x13: 1754 x12: 000a
>> [   16.772651] x11: ff85fcfd0a40 x10: 0001
>> [   16.773488] x9 : 0008 x8 : ffc01143ab26
>> [   16.774336] x7 :  x6 : 
>> [   16.775180] x5 :  x4 : 
>> [   16.776018] x3 : 1756bbe8 x2 : 00065daeb003
>> [   16.776856] x1 : 0065daeb x0 : f000
>> [   16.777693] Call trace:
>> [   16.778092]  arch_pgtable_tests_init+0x24c/0x474
>> [   16.778843]  do_one_initcall+0x74/0x1b0
>> [   16.779458]  kernel_init_freeable+0x1cc/0x290
>> [   16.780151]  kernel_init+0x10/0x100
>> [   16.780710]  ret_from_fork+0x10/0x18
>> [   16.781282] ---[ end trace 042e6c40c0a3b038 ]---
>>
>> On arm64 (4K page size|39 bits VA|3 level page table)
>>
>> #elif CONFIG_PGTABLE_LEVELS == 3    /* Applicable here */
>> #define __ARCH_USE_5LEVEL_HACK
>> #include 
>>
>> Which pulls in
>>
>> #include 
>>
>> which pulls in
>>
>> #include 
>>
>> which defines
>>
>> static inline int p4d_none(p4d_t p4d)
>> {
>>  return 0;
>> }
>>
>> which will invariably trigger WARN_ON(!p4d_none(p4d)).
>>
>> Similarly for next test p4d_populate_tests() which will always be
>> successful because p4d_bad() invariably returns negative.
>>
>> static inline int p4d_bad(p4d_t p4d)
>> {
>>  return 0;
>> }
>>
>> static void __init p4d_populate_tests(struct mm_struct *mm, p4d_t *p4dp,
>>    pud_t *pudp)
>> {
>>  p4d_t p4d;
>>
>>  /*
>>   * This entry points to next level page table page.
>>   * Hence this must not qualify as p4d_bad().
>>   */
>>  pud_clear(pudp);
>>  p4d_clear(p4dp);
>> 

Re: [PATCH 3/5] ocxl: Tally up the LPC memory on a link & allow it to be mapped

2019-09-18 Thread Alastair D'Silva
On Wed, 2019-09-18 at 16:02 +0200, Frederic Barrat wrote:
> 
> Le 17/09/2019 à 03:42, Alastair D'Silva a écrit :
> > From: Alastair D'Silva 
> > 
> > Tally up the LPC memory on an OpenCAPI link & allow it to be mapped
> > 
> > Signed-off-by: Alastair D'Silva 
> > ---
> >   drivers/misc/ocxl/core.c  |  9 +
> >   drivers/misc/ocxl/link.c  | 61
> > +++
> >   drivers/misc/ocxl/ocxl_internal.h | 42 +
> >   3 files changed, 112 insertions(+)
> > 
> > diff --git a/drivers/misc/ocxl/core.c b/drivers/misc/ocxl/core.c
> > index b7a09b21ab36..fdfe4e0a34e1 100644
> > --- a/drivers/misc/ocxl/core.c
> > +++ b/drivers/misc/ocxl/core.c
> > @@ -230,8 +230,17 @@ static int configure_afu(struct ocxl_afu *afu,
> > u8 afu_idx, struct pci_dev *dev)
> > if (rc)
> > goto err_free_pasid;
> >   
> > +   if (afu->config.lpc_mem_size || afu-
> > >config.special_purpose_mem_size) {
> > +   rc = ocxl_link_add_lpc_mem(afu->fn->link,
> > +   afu->config.lpc_mem_size + afu-
> > >config.special_purpose_mem_size);
> 
> I don't think we should count the special purpose memory, as it's
> not 
> meant to be accessed through the GPU mem BAR, but I'll check.

At least for OpenCAPI 3.0, there is no other in-spec way to access the
memory if it is not mapped by the NPU.

> 
> What happens when unconfiguring the AFU? We should reduce the range
> (see 
> also below). Partial reconfig doesn't seem so far off, so we should
> take 
> it into account.
> 

The mapping is left until the last AFU on the link offlines it's
memory, at which point we clear the mapping from the NPU.

> 
> > +   if (rc)
> > +   goto err_free_mmio;
> > +   }
> > +
> > return 0;
> >   
> > +err_free_mmio:
> > +   unmap_mmio_areas(afu);
> >   err_free_pasid:
> > reclaim_afu_pasid(afu);
> >   err_free_actag:
> > diff --git a/drivers/misc/ocxl/link.c b/drivers/misc/ocxl/link.c
> > index 58d111afd9f6..2874811a4398 100644
> > --- a/drivers/misc/ocxl/link.c
> > +++ b/drivers/misc/ocxl/link.c
> > @@ -84,6 +84,11 @@ struct ocxl_link {
> > int dev;
> > atomic_t irq_available;
> > struct spa *spa;
> > +   struct mutex lpc_mem_lock;
> > +   u64 lpc_mem_sz; /* Total amount of LPC memory presented on the
> > link */
> > +   u64 lpc_mem;
> > +   int lpc_consumers;
> > +
> > void *platform_data;
> >   };
> >   static struct list_head links_list = LIST_HEAD_INIT(links_list);
> > @@ -396,6 +401,8 @@ static int alloc_link(struct pci_dev *dev, int
> > PE_mask, struct ocxl_link **out_l
> > if (rc)
> > goto err_spa;
> >   
> > +   mutex_init(>lpc_mem_lock);
> > +
> > /* platform specific hook */
> > rc = pnv_ocxl_spa_setup(dev, link->spa->spa_mem, PE_mask,
> > >platform_data);
> > @@ -711,3 +718,57 @@ void ocxl_link_free_irq(void *link_handle, int
> > hw_irq)
> > atomic_inc(>irq_available);
> >   }
> >   EXPORT_SYMBOL_GPL(ocxl_link_free_irq);
> > +
> > +int ocxl_link_add_lpc_mem(void *link_handle, u64 size)
> > +{
> > +   struct ocxl_link *link = (struct ocxl_link *) link_handle;
> > +
> > +   u64 orig_size;
> > +   bool good = false;
> > +
> > +   mutex_lock(>lpc_mem_lock);
> > +   orig_size = link->lpc_mem_sz;
> > +   link->lpc_mem_sz += size;
> 
> 
> We have a choice to make here:
> 1. either we only support one LPC memory-carrying AFU (and the above
> is 
> overkill)
> 2. or we support multiple AFUs with LPC memory (on the same
> function), 
> but then I think the above is too simple.
> 
>  From the opencapi spec, each AFU can define a chunk of memory with
> a 
> starting address and a size. There's no rule which says they have to
> be 
> contiguous. There's no rule which says it must start at 0. So to
> support 
> multiple AFUs with LPC memory, we should record the current maximum 
> range instead of just the global size. Ultimately, we need to tell
> the 
> NPU the range of permissible addresses. It starts at 0, so we need
> to 
> take into account any intial offset and holes.
> 
> I would go for option 2, to at least be consistent within ocxl and 
> support multiple AFUs. Even though I don't think we'll see FPGA
> images 
> with multiple AFUs with LPC memory any time soon.
> 

Ill rework this to take an offset & size, the NPU will map from the
base address up to the largest offset + size provided across all AFUs
on the link.

> 
> > +   good = orig_size < link->lpc_mem_sz;
> > +   mutex_unlock(>lpc_mem_lock);
> > +
> > +   // Check for overflow
> > +   return (good) ? 0 : -EINVAL;
> > +}
> > +EXPORT_SYMBOL_GPL(ocxl_link_add_lpc_mem);
> 
> Do the symbol really need to be exported? IIUC, the next patch
> defines a 
> higher level ocxl_afu_map_lpc_mem() which is meant to be called by a 
> calling driver.
> 

No, I'll remove it.

> 
> > +
> > +u64 ocxl_link_lpc_map(void *link_handle, struct pci_dev *pdev)
> > +{
> > +   struct ocxl_link *link = (struct ocxl_link *) link_handle;
> > +
> > + 

Re: [PATCH v3 1/5] powerpc: Allow flush_icache_range to work across ranges >4GB

2019-09-18 Thread Alastair D'Silva
On Thu, 2019-09-19 at 13:43 +1000, Michael Ellerman wrote:
> "Alastair D'Silva"  writes:
> > From: Alastair D'Silva 
> > 
> > When calling flush_icache_range with a size >4GB, we were masking
> > off the upper 32 bits, so we would incorrectly flush a range
> > smaller
> > than intended.
> > 
> > __kernel_sync_dicache in the 64 bit VDSO has the same bug.
> 
> Please fix that in a separate patch.
> 
> Your subject doesn't mention __kernel_sync_dicache(), and also the
> two
> changes backport differently, so it's better if they're done as
> separate
> patches.
> 

Ok.

> cheers
> 
> > This patch replaces the 32 bit shifts with 64 bit ones, so that
> > the full size is accounted for.
> > 
> > Signed-off-by: Alastair D'Silva 
> > Cc: sta...@vger.kernel.org
> > ---
> >  arch/powerpc/kernel/misc_64.S   | 4 ++--
> >  arch/powerpc/kernel/vdso64/cacheflush.S | 4 ++--
> >  2 files changed, 4 insertions(+), 4 deletions(-)
> > 
> > diff --git a/arch/powerpc/kernel/misc_64.S
> > b/arch/powerpc/kernel/misc_64.S
> > index b55a7b4cb543..9bc0aa9aeb65 100644
> > --- a/arch/powerpc/kernel/misc_64.S
> > +++ b/arch/powerpc/kernel/misc_64.S
> > @@ -82,7 +82,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_COHERENT_ICACHE)
> > subfr8,r6,r4/* compute length */
> > add r8,r8,r5/* ensure we get enough */
> > lwz r9,DCACHEL1LOGBLOCKSIZE(r10)/* Get log-2 of cache block
> > size */
> > -   srw.r8,r8,r9/* compute line count */
> > +   srd.r8,r8,r9/* compute line count */
> > beqlr   /* nothing to do? */
> > mtctr   r8
> >  1: dcbst   0,r6
> > @@ -98,7 +98,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_COHERENT_ICACHE)
> > subfr8,r6,r4/* compute length */
> > add r8,r8,r5
> > lwz r9,ICACHEL1LOGBLOCKSIZE(r10)/* Get log-2 of Icache
> > block size */
> > -   srw.r8,r8,r9/* compute line count */
> > +   srd.r8,r8,r9/* compute line count */
> > beqlr   /* nothing to do? */
> > mtctr   r8
> >  2: icbi0,r6
> > diff --git a/arch/powerpc/kernel/vdso64/cacheflush.S
> > b/arch/powerpc/kernel/vdso64/cacheflush.S
> > index 3f92561a64c4..526f5ba2593e 100644
> > --- a/arch/powerpc/kernel/vdso64/cacheflush.S
> > +++ b/arch/powerpc/kernel/vdso64/cacheflush.S
> > @@ -35,7 +35,7 @@ V_FUNCTION_BEGIN(__kernel_sync_dicache)
> > subfr8,r6,r4/* compute length */
> > add r8,r8,r5/* ensure we get enough */
> > lwz r9,CFG_DCACHE_LOGBLOCKSZ(r10)
> > -   srw.r8,r8,r9/* compute line count */
> > +   srd.r8,r8,r9/* compute line count */
> > crclr   cr0*4+so
> > beqlr   /* nothing to do? */
> > mtctr   r8
> > @@ -52,7 +52,7 @@ V_FUNCTION_BEGIN(__kernel_sync_dicache)
> > subfr8,r6,r4/* compute length */
> > add r8,r8,r5
> > lwz r9,CFG_ICACHE_LOGBLOCKSZ(r10)
> > -   srw.r8,r8,r9/* compute line count */
> > +   srd.r8,r8,r9/* compute line count */
> > crclr   cr0*4+so
> > beqlr   /* nothing to do? */
> > mtctr   r8
> > -- 
> > 2.21.0
-- 
Alastair D'Silva
Open Source Developer
Linux Technology Centre, IBM Australia
mob: 0423 762 819



Re: [PATCH v3 1/5] powerpc: Allow flush_icache_range to work across ranges >4GB

2019-09-18 Thread Michael Ellerman
"Alastair D'Silva"  writes:
> From: Alastair D'Silva 
>
> When calling flush_icache_range with a size >4GB, we were masking
> off the upper 32 bits, so we would incorrectly flush a range smaller
> than intended.
>
> __kernel_sync_dicache in the 64 bit VDSO has the same bug.

Please fix that in a separate patch.

Your subject doesn't mention __kernel_sync_dicache(), and also the two
changes backport differently, so it's better if they're done as separate
patches.

cheers

> This patch replaces the 32 bit shifts with 64 bit ones, so that
> the full size is accounted for.
>
> Signed-off-by: Alastair D'Silva 
> Cc: sta...@vger.kernel.org
> ---
>  arch/powerpc/kernel/misc_64.S   | 4 ++--
>  arch/powerpc/kernel/vdso64/cacheflush.S | 4 ++--
>  2 files changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/arch/powerpc/kernel/misc_64.S b/arch/powerpc/kernel/misc_64.S
> index b55a7b4cb543..9bc0aa9aeb65 100644
> --- a/arch/powerpc/kernel/misc_64.S
> +++ b/arch/powerpc/kernel/misc_64.S
> @@ -82,7 +82,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_COHERENT_ICACHE)
>   subfr8,r6,r4/* compute length */
>   add r8,r8,r5/* ensure we get enough */
>   lwz r9,DCACHEL1LOGBLOCKSIZE(r10)/* Get log-2 of cache block 
> size */
> - srw.r8,r8,r9/* compute line count */
> + srd.r8,r8,r9/* compute line count */
>   beqlr   /* nothing to do? */
>   mtctr   r8
>  1:   dcbst   0,r6
> @@ -98,7 +98,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_COHERENT_ICACHE)
>   subfr8,r6,r4/* compute length */
>   add r8,r8,r5
>   lwz r9,ICACHEL1LOGBLOCKSIZE(r10)/* Get log-2 of Icache block 
> size */
> - srw.r8,r8,r9/* compute line count */
> + srd.r8,r8,r9/* compute line count */
>   beqlr   /* nothing to do? */
>   mtctr   r8
>  2:   icbi0,r6
> diff --git a/arch/powerpc/kernel/vdso64/cacheflush.S 
> b/arch/powerpc/kernel/vdso64/cacheflush.S
> index 3f92561a64c4..526f5ba2593e 100644
> --- a/arch/powerpc/kernel/vdso64/cacheflush.S
> +++ b/arch/powerpc/kernel/vdso64/cacheflush.S
> @@ -35,7 +35,7 @@ V_FUNCTION_BEGIN(__kernel_sync_dicache)
>   subfr8,r6,r4/* compute length */
>   add r8,r8,r5/* ensure we get enough */
>   lwz r9,CFG_DCACHE_LOGBLOCKSZ(r10)
> - srw.r8,r8,r9/* compute line count */
> + srd.r8,r8,r9/* compute line count */
>   crclr   cr0*4+so
>   beqlr   /* nothing to do? */
>   mtctr   r8
> @@ -52,7 +52,7 @@ V_FUNCTION_BEGIN(__kernel_sync_dicache)
>   subfr8,r6,r4/* compute length */
>   add r8,r8,r5
>   lwz r9,CFG_ICACHE_LOGBLOCKSZ(r10)
> - srw.r8,r8,r9/* compute line count */
> + srd.r8,r8,r9/* compute line count */
>   crclr   cr0*4+so
>   beqlr   /* nothing to do? */
>   mtctr   r8
> -- 
> 2.21.0


RE: [PATCH V2 3/4] ASoC: pcm_dmaengine: Extract snd_dmaengine_pcm_set_runtime_hwparams

2019-09-18 Thread S.j. Wang
Hi

> 
> When set the runtime hardware parameters, we may need to query the
> capability of DMA to complete the parameters.
> 
> This patch is to Extract this operation from
> dmaengine_pcm_set_runtime_hwparams function to a separate function
> snd_dmaengine_pcm_set_runtime_hwparams, that other components
> which need this feature can call this function.
> 
> Signed-off-by: Shengjiu Wang 
> ---

kbuild test robot report compile issue, will resend v3 after fixing.

Best regards
Wang shengjiu


Re: [PATCH 4/5] ocxl: Add functions to map/unmap LPC memory

2019-09-18 Thread Alastair D'Silva
On Tue, 2019-09-17 at 11:43 +1000, Alastair D'Silva wrote:
> From: Alastair D'Silva 
> 
> Add functions to map/unmap LPC memory
> 
> Signed-off-by: Alastair D'Silva 
> ---
>  drivers/misc/ocxl/config.c|  4 +++
>  drivers/misc/ocxl/core.c  | 50
> +++
>  drivers/misc/ocxl/link.c  |  4 +--
>  drivers/misc/ocxl/ocxl_internal.h | 10 +--
>  include/misc/ocxl.h   | 18 +++
>  5 files changed, 82 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/misc/ocxl/config.c b/drivers/misc/ocxl/config.c
> index c8e19bfb5ef9..fb0c3b6f8312 100644
> --- a/drivers/misc/ocxl/config.c
> +++ b/drivers/misc/ocxl/config.c
> @@ -568,6 +568,10 @@ static int read_afu_lpc_memory_info(struct
> pci_dev *dev,
>   afu->special_purpose_mem_size =
>   total_mem_size - lpc_mem_size;
>   }
> +
> + dev_info(>dev, "Probed LPC memory of %#llx bytes and
> special purpose memory of %#llx bytes\n",
> + afu->lpc_mem_size, afu->special_purpose_mem_size);
> +
>   return 0;
>  }
>  
> diff --git a/drivers/misc/ocxl/core.c b/drivers/misc/ocxl/core.c
> index fdfe4e0a34e1..eb24bb9d655f 100644
> --- a/drivers/misc/ocxl/core.c
> +++ b/drivers/misc/ocxl/core.c
> @@ -210,6 +210,55 @@ static void unmap_mmio_areas(struct ocxl_afu
> *afu)
>   release_fn_bar(afu->fn, afu->config.global_mmio_bar);
>  }
>  
> +int ocxl_map_lpc_mem(struct ocxl_afu *afu)
> +{
> + struct pci_dev *dev = to_pci_dev(afu->fn->dev.parent);
> +
> + if ((afu->config.lpc_mem_size + afu-
> >config.special_purpose_mem_size) == 0)
> + return 0;
> +
> + afu->lpc_base_addr = ocxl_link_lpc_online(afu->fn->link, dev);
> + if (afu->lpc_base_addr == 0)
> + return -EINVAL;
> +
> + if (afu->config.lpc_mem_size) {
> + afu->lpc_res.start = afu->lpc_base_addr + afu-
> >config.lpc_mem_offset;
> + afu->lpc_res.end = afu->lpc_res.start + afu-
> >config.lpc_mem_size - 1;
> + }
> +
> + if (afu->config.special_purpose_mem_size) {
> + afu->special_purpose_res.start = afu->lpc_base_addr +
> +  afu-
> >config.special_purpose_mem_offset;
> + afu->special_purpose_res.end = afu-
> >special_purpose_res.start +
> +afu-
> >config.special_purpose_mem_size - 1;
> + }
> +
> + return 0;
> +}
> +EXPORT_SYMBOL(ocxl_map_lpc_mem);
> +
> +struct resource *ocxl_afu_lpc_mem(struct ocxl_afu *afu)
> +{
> + return >lpc_res;
> +}
> +EXPORT_SYMBOL(ocxl_afu_lpc_mem);
> +
> +static void unmap_lpc_mem(struct ocxl_afu *afu)
> +{
> + struct pci_dev *dev = to_pci_dev(afu->fn->dev.parent);
> +
> + if (afu->lpc_res.start || afu->special_purpose_res.start) {
> + void *link = afu->fn->link;
> +
> + ocxl_link_lpc_offline(link, dev);
> +
> + afu->lpc_res.start = 0;
> + afu->lpc_res.end = 0;
> + afu->special_purpose_res.start = 0;
> + afu->special_purpose_res.end = 0;
> + }
> +}
> +
>  static int configure_afu(struct ocxl_afu *afu, u8 afu_idx, struct
> pci_dev *dev)
>  {
>   int rc;
> @@ -250,6 +299,7 @@ static int configure_afu(struct ocxl_afu *afu, u8
> afu_idx, struct pci_dev *dev)
>  
>  static void deconfigure_afu(struct ocxl_afu *afu)
>  {
> + unmap_lpc_mem(afu);
>   unmap_mmio_areas(afu);
>   reclaim_afu_pasid(afu);
>   reclaim_afu_actag(afu);
> diff --git a/drivers/misc/ocxl/link.c b/drivers/misc/ocxl/link.c
> index 2874811a4398..9e303a5f4d85 100644
> --- a/drivers/misc/ocxl/link.c
> +++ b/drivers/misc/ocxl/link.c
> @@ -738,7 +738,7 @@ int ocxl_link_add_lpc_mem(void *link_handle, u64
> size)
>  }
>  EXPORT_SYMBOL_GPL(ocxl_link_add_lpc_mem);
>  
> -u64 ocxl_link_lpc_map(void *link_handle, struct pci_dev *pdev)
> +u64 ocxl_link_lpc_online(void *link_handle, struct pci_dev *pdev)
>  {
>   struct ocxl_link *link = (struct ocxl_link *) link_handle;
>  
> @@ -759,7 +759,7 @@ u64 ocxl_link_lpc_map(void *link_handle, struct
> pci_dev *pdev)
>   return link->lpc_mem;
>  }
>  
> -void ocxl_link_lpc_release(void *link_handle, struct pci_dev *pdev)
> +void ocxl_link_lpc_offline(void *link_handle, struct pci_dev *pdev)
>  {
>   struct ocxl_link *link = (struct ocxl_link *) link_handle;
>  
> diff --git a/drivers/misc/ocxl/ocxl_internal.h
> b/drivers/misc/ocxl/ocxl_internal.h
> index db2647a90fc8..5656a4aab5b7 100644
> --- a/drivers/misc/ocxl/ocxl_internal.h
> +++ b/drivers/misc/ocxl/ocxl_internal.h
> @@ -52,6 +52,12 @@ struct ocxl_afu {
>   void __iomem *global_mmio_ptr;
>   u64 pp_mmio_start;
>   void *private;
> + u64 lpc_base_addr; /* Covers both LPC & special purpose memory
> */
> + struct bin_attribute attr_global_mmio;
> + struct bin_attribute attr_lpc_mem;

The bin_attributes are not needed here.

> + struct resource lpc_res;
> + struct bin_attribute 

Re: [PATCH 4/5] ocxl: Add functions to map/unmap LPC memory

2019-09-18 Thread Alastair D'Silva
On Wed, 2019-09-18 at 16:03 +0200, Frederic Barrat wrote:
> 
> Le 17/09/2019 à 03:43, Alastair D'Silva a écrit :
> > From: Alastair D'Silva 
> > 
> > Add functions to map/unmap LPC memory
> > 
> > Signed-off-by: Alastair D'Silva 
> > ---
> >   drivers/misc/ocxl/config.c|  4 +++
> >   drivers/misc/ocxl/core.c  | 50
> > +++
> >   drivers/misc/ocxl/link.c  |  4 +--
> >   drivers/misc/ocxl/ocxl_internal.h | 10 +--
> >   include/misc/ocxl.h   | 18 +++
> >   5 files changed, 82 insertions(+), 4 deletions(-)
> > 
> > diff --git a/drivers/misc/ocxl/config.c
> > b/drivers/misc/ocxl/config.c
> > index c8e19bfb5ef9..fb0c3b6f8312 100644
> > --- a/drivers/misc/ocxl/config.c
> > +++ b/drivers/misc/ocxl/config.c
> > @@ -568,6 +568,10 @@ static int read_afu_lpc_memory_info(struct
> > pci_dev *dev,
> > afu->special_purpose_mem_size =
> > total_mem_size - lpc_mem_size;
> > }
> > +
> > +   dev_info(>dev, "Probed LPC memory of %#llx bytes and
> > special purpose memory of %#llx bytes\n",
> > +   afu->lpc_mem_size, afu->special_purpose_mem_size);
> > +
> > return 0;
> >   }
> >   
> > diff --git a/drivers/misc/ocxl/core.c b/drivers/misc/ocxl/core.c
> > index fdfe4e0a34e1..eb24bb9d655f 100644
> > --- a/drivers/misc/ocxl/core.c
> > +++ b/drivers/misc/ocxl/core.c
> > @@ -210,6 +210,55 @@ static void unmap_mmio_areas(struct ocxl_afu
> > *afu)
> > release_fn_bar(afu->fn, afu->config.global_mmio_bar);
> >   }
> >   
> > +int ocxl_map_lpc_mem(struct ocxl_afu *afu)
> > +{
> > +   struct pci_dev *dev = to_pci_dev(afu->fn->dev.parent);
> > +
> > +   if ((afu->config.lpc_mem_size + afu-
> > >config.special_purpose_mem_size) == 0)
> > +   return 0;
> > +
> > +   afu->lpc_base_addr = ocxl_link_lpc_online(afu->fn->link, dev);
> > +   if (afu->lpc_base_addr == 0)
> > +   return -EINVAL;
> > +
> > +   if (afu->config.lpc_mem_size) {
> > +   afu->lpc_res.start = afu->lpc_base_addr + afu-
> > >config.lpc_mem_offset;
> > +   afu->lpc_res.end = afu->lpc_res.start + afu-
> > >config.lpc_mem_size - 1;
> > +   }
> > +
> > +   if (afu->config.special_purpose_mem_size) {
> > +   afu->special_purpose_res.start = afu->lpc_base_addr +
> > +afu-
> > >config.special_purpose_mem_offset;
> > +   afu->special_purpose_res.end = afu-
> > >special_purpose_res.start +
> > +  afu-
> > >config.special_purpose_mem_size - 1;
> > +   }
> > +
> > +   return 0;
> > +}
> > +EXPORT_SYMBOL(ocxl_map_lpc_mem);
> > +
> > +struct resource *ocxl_afu_lpc_mem(struct ocxl_afu *afu)
> > +{
> > +   return >lpc_res;
> > +}
> > +EXPORT_SYMBOL(ocxl_afu_lpc_mem);
> > +
> > +static void unmap_lpc_mem(struct ocxl_afu *afu)
> > +{
> > +   struct pci_dev *dev = to_pci_dev(afu->fn->dev.parent);
> > +
> > +   if (afu->lpc_res.start || afu->special_purpose_res.start) {
> > +   void *link = afu->fn->link;
> > +
> > +   ocxl_link_lpc_offline(link, dev);
> > +
> > +   afu->lpc_res.start = 0;
> > +   afu->lpc_res.end = 0;
> > +   afu->special_purpose_res.start = 0;
> > +   afu->special_purpose_res.end = 0;
> > +   }
> > +}
> > +
> >   static int configure_afu(struct ocxl_afu *afu, u8 afu_idx, struct
> > pci_dev *dev)
> >   {
> > int rc;
> > @@ -250,6 +299,7 @@ static int configure_afu(struct ocxl_afu *afu,
> > u8 afu_idx, struct pci_dev *dev)
> >   
> >   static void deconfigure_afu(struct ocxl_afu *afu)
> >   {
> > +   unmap_lpc_mem(afu);
> > unmap_mmio_areas(afu);
> > reclaim_afu_pasid(afu);
> > reclaim_afu_actag(afu);
> > diff --git a/drivers/misc/ocxl/link.c b/drivers/misc/ocxl/link.c
> > index 2874811a4398..9e303a5f4d85 100644
> > --- a/drivers/misc/ocxl/link.c
> > +++ b/drivers/misc/ocxl/link.c
> > @@ -738,7 +738,7 @@ int ocxl_link_add_lpc_mem(void *link_handle,
> > u64 size)
> >   }
> >   EXPORT_SYMBOL_GPL(ocxl_link_add_lpc_mem);
> >   
> > -u64 ocxl_link_lpc_map(void *link_handle, struct pci_dev *pdev)
> > +u64 ocxl_link_lpc_online(void *link_handle, struct pci_dev *pdev)
> >   {
> > struct ocxl_link *link = (struct ocxl_link *) link_handle;
> >   
> > @@ -759,7 +759,7 @@ u64 ocxl_link_lpc_map(void *link_handle, struct
> > pci_dev *pdev)
> > return link->lpc_mem;
> >   }
> >   
> > -void ocxl_link_lpc_release(void *link_handle, struct pci_dev
> > *pdev)
> > +void ocxl_link_lpc_offline(void *link_handle, struct pci_dev
> > *pdev)
> 
> Could we avoid the renaming by squashing it with the previous patch?
> 

Yup, good catch.

> 
> >   {
> > struct ocxl_link *link = (struct ocxl_link *) link_handle;
> >   
> > diff --git a/drivers/misc/ocxl/ocxl_internal.h
> > b/drivers/misc/ocxl/ocxl_internal.h
> > index db2647a90fc8..5656a4aab5b7 100644
> > --- a/drivers/misc/ocxl/ocxl_internal.h
> > +++ b/drivers/misc/ocxl/ocxl_internal.h
> > @@ -52,6 +52,12 @@ struct ocxl_afu 

Re: [PATCH 2/5] powerpc: Map & release OpenCAPI LPC memory

2019-09-18 Thread Alastair D'Silva
On Wed, 2019-09-18 at 16:03 +0200, Frederic Barrat wrote:
> 
> Le 17/09/2019 à 03:42, Alastair D'Silva a écrit :
> > From: Alastair D'Silva 
> > 
> > Map & release OpenCAPI LPC memory.
> > 
> > Signed-off-by: Alastair D'Silva 
> > ---
> >   arch/powerpc/include/asm/pnv-ocxl.h   |  2 ++
> >   arch/powerpc/platforms/powernv/ocxl.c | 42
> > +++
> >   2 files changed, 44 insertions(+)
> > 
> > diff --git a/arch/powerpc/include/asm/pnv-ocxl.h
> > b/arch/powerpc/include/asm/pnv-ocxl.h
> > index 7de82647e761..f8f8ffb48aa8 100644
> > --- a/arch/powerpc/include/asm/pnv-ocxl.h
> > +++ b/arch/powerpc/include/asm/pnv-ocxl.h
> > @@ -32,5 +32,7 @@ extern int pnv_ocxl_spa_remove_pe_from_cache(void
> > *platform_data, int pe_handle)
> >   
> >   extern int pnv_ocxl_alloc_xive_irq(u32 *irq, u64 *trigger_addr);
> >   extern void pnv_ocxl_free_xive_irq(u32 irq);
> > +extern u64 pnv_ocxl_platform_lpc_setup(struct pci_dev *pdev, u64
> > size);
> > +extern void pnv_ocxl_platform_lpc_release(struct pci_dev *pdev);
> >   
> >   #endif /* _ASM_PNV_OCXL_H */
> > diff --git a/arch/powerpc/platforms/powernv/ocxl.c
> > b/arch/powerpc/platforms/powernv/ocxl.c
> > index 8c65aacda9c8..81393728d6a3 100644
> > --- a/arch/powerpc/platforms/powernv/ocxl.c
> > +++ b/arch/powerpc/platforms/powernv/ocxl.c
> > @@ -475,6 +475,48 @@ void pnv_ocxl_spa_release(void *platform_data)
> >   }
> >   EXPORT_SYMBOL_GPL(pnv_ocxl_spa_release);
> >   
> > +u64 pnv_ocxl_platform_lpc_setup(struct pci_dev *pdev, u64 size)
> > +{
> > +   struct pci_controller *hose = pci_bus_to_host(pdev->bus);
> > +   struct pnv_phb *phb = hose->private_data;
> > +   struct pci_dn *pdn = pci_get_pdn(pdev);
> > +   u32 bdfn = (pdn->busno << 8) | pdn->devfn;
> 
> We can spare a call to pci_get_pdn() with
> bdfn = (pdev->bus->number << 8) | pdev->devfn;
> 

Ok.

> 
> > +   u64 base_addr = 0;
> > +
> > +   int rc = opal_npu_mem_alloc(phb->opal_id, bdfn, size,
> > _addr);
> > +
> > +   WARN_ON(rc);
> 
> Instead of a WARN, we should catch the error and return a null
> address 
> to the caller.
> 

base_addr will be 0 in the error case, are you suggesting we just
remove the WARN_ON()?

> > +
> > +   base_addr = be64_to_cpu(base_addr);
> > +
> > +   rc = check_hotplug_memory_addressable(base_addr, base_addr +
> > size);
> 
> That code is missing?
> 

That's added in the following patch on the mm list:
 [PATCH v3 1/2] memory_hotplug: Add a bounds check to
check_hotplug_memory_range()

> 
> > +   if (rc) {
> > +   dev_warn(>dev,
> > +"LPC memory range 0x%llx-0x%llx is not fully
> > addressable",
> > +base_addr, base_addr + size - 1);
> > +   return 0;
> > +   }
> > +
> > +
> > +   return base_addr;
> > +}
> > +EXPORT_SYMBOL_GPL(pnv_ocxl_platform_lpc_setup);
> > +
> > +void pnv_ocxl_platform_lpc_release(struct pci_dev *pdev)
> > +{
> > +   struct pci_controller *hose = pci_bus_to_host(pdev->bus);
> > +   struct pnv_phb *phb = hose->private_data;
> > +   struct pci_dn *pdn = pci_get_pdn(pdev);
> > +   u32 bdfn;
> > +   int rc;
> > +
> > +   bdfn = (pdn->busno << 8) | pdn->devfn;
> > +   rc = opal_npu_mem_release(phb->opal_id, bdfn);
> > +   WARN_ON(rc);
> 
> Same comments as above.
> 
>Fred
> 
> 
> 
> > +}
> > +EXPORT_SYMBOL_GPL(pnv_ocxl_platform_lpc_release);
> > +
> > +
> >   int pnv_ocxl_spa_remove_pe_from_cache(void *platform_data, int
> > pe_handle)
> >   {
> > struct spa_data *data = (struct spa_data *) platform_data;
> > 
-- 
Alastair D'Silva
Open Source Developer
Linux Technology Centre, IBM Australia
mob: 0423 762 819



Re: [PATCH v2 1/3] tools/perf: Move kvm-stat header file from conditional inclusion to common include section

2019-09-18 Thread Arnaldo Carvalho de Melo
Em Fri, Jul 19, 2019 at 10:58:37AM +0530, Ravi Bangoria escreveu:
> 
> LGTM. For the series,
> 
> Reviewed-By: Ravi Bangoria 


Thanks, applied.

- Arnaldo


Re: [PATCH v3 2/2] net/ibmvnic: prevent more than one thread from running in reset

2019-09-18 Thread Juliet Kim


On 9/18/19 1:12 AM, Michael Ellerman wrote:
> Hi Juliet,
>
> Juliet Kim  writes:
>> Signed-off-by: Juliet Kim 
>> ---
>>  drivers/net/ethernet/ibm/ibmvnic.c | 23 ++-
>>  drivers/net/ethernet/ibm/ibmvnic.h |  3 +++
>>  2 files changed, 25 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
>> b/drivers/net/ethernet/ibm/ibmvnic.c
>> index ba340aaff1b3..f344ccd68ad9 100644
>> --- a/drivers/net/ethernet/ibm/ibmvnic.c
>> +++ b/drivers/net/ethernet/ibm/ibmvnic.c
>> @@ -2054,6 +2054,13 @@ static void __ibmvnic_reset(struct work_struct *work)
>>  
>>  adapter = container_of(work, struct ibmvnic_adapter, ibmvnic_reset);
>>  
>> +if (adapter->resetting) {
>> +schedule_delayed_work(>ibmvnic_delayed_reset,
>> +  IBMVNIC_RESET_DELAY);
>> +return;
>> +}
>> +
>> +adapter->resetting = true;
>>  reset_state = adapter->state;
> Is there some locking/serialisation around this?
>
> Otherwise that looks very racy. ie. two CPUs could both see
> adapter->resetting == false, then both set it to true, and then continue
> executing and stomp on each other.
>
> cheers

I agree there may be a race here. Thank you for reviewing.

I will address it in the next version.



Re: [PATCH trivial] KVM: PPC: Remove superfluous check for non-zero return value

2019-09-18 Thread Greg Kurz
On Wed, 11 Sep 2019 21:52:35 +0200
Thomas Huth  wrote:

> After the kfree()s haven been removed in the previous
> commit 9798f4ea71ea ("fix rollback when kvmppc_xive_create fails"),
> the code can be simplified even more to simply always "return ret"
> now.
> 
> Signed-off-by: Thomas Huth 
> ---

This looks like a good candidate for trivial, hence Cc'ing Jiri
and adding trivial keyword in subject.

Reviewed-by: Greg Kurz 

>  arch/powerpc/kvm/book3s_xive.c| 5 +
>  arch/powerpc/kvm/book3s_xive_native.c | 5 +
>  2 files changed, 2 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
> index e3ba67095895..2f6f463fcdfb 100644
> --- a/arch/powerpc/kvm/book3s_xive.c
> +++ b/arch/powerpc/kvm/book3s_xive.c
> @@ -1986,10 +1986,7 @@ static int kvmppc_xive_create(struct kvm_device *dev, 
> u32 type)
>  
>   xive->single_escalation = xive_native_has_single_escalation();
>  
> - if (ret)
> - return ret;
> -
> - return 0;
> + return ret;
>  }
>  
>  int kvmppc_xive_debug_show_queues(struct seq_file *m, struct kvm_vcpu *vcpu)
> diff --git a/arch/powerpc/kvm/book3s_xive_native.c 
> b/arch/powerpc/kvm/book3s_xive_native.c
> index a998823f68a3..7a50772f26fe 100644
> --- a/arch/powerpc/kvm/book3s_xive_native.c
> +++ b/arch/powerpc/kvm/book3s_xive_native.c
> @@ -1089,10 +1089,7 @@ static int kvmppc_xive_native_create(struct kvm_device 
> *dev, u32 type)
>   xive->single_escalation = xive_native_has_single_escalation();
>   xive->ops = _xive_native_ops;
>  
> - if (ret)
> - return ret;
> -
> - return 0;
> + return ret;
>  }
>  
>  /*



Re: [PATCH trivial] KVM: PPC: Remove superfluous check for non-zero return value

2019-09-18 Thread Greg Kurz
On Wed, 18 Sep 2019 18:44:36 +0200
Greg Kurz  wrote:

> On Wed, 11 Sep 2019 21:52:35 +0200
> Thomas Huth  wrote:
> 
> > After the kfree()s haven been removed in the previous
> > commit 9798f4ea71ea ("fix rollback when kvmppc_xive_create fails"),
> > the code can be simplified even more to simply always "return ret"
> > now.
> > 
> > Signed-off-by: Thomas Huth 
> > ---
> 
> This looks like a good candidate for trivial, hence Cc'ing Jiri
> and adding trivial keyword in subject.
> 
> Reviewed-by: Greg Kurz 
> 

Oops, the patch is correct but there are some fixes that require
the return 0 to stay around...

https://patchwork.ozlabs.org/project/kvm-ppc/list/?series=129957

> >  arch/powerpc/kvm/book3s_xive.c| 5 +
> >  arch/powerpc/kvm/book3s_xive_native.c | 5 +
> >  2 files changed, 2 insertions(+), 8 deletions(-)
> > 
> > diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
> > index e3ba67095895..2f6f463fcdfb 100644
> > --- a/arch/powerpc/kvm/book3s_xive.c
> > +++ b/arch/powerpc/kvm/book3s_xive.c
> > @@ -1986,10 +1986,7 @@ static int kvmppc_xive_create(struct kvm_device 
> > *dev, u32 type)
> >  
> > xive->single_escalation = xive_native_has_single_escalation();
> >  
> > -   if (ret)
> > -   return ret;
> > -
> > -   return 0;
> > +   return ret;
> >  }
> >  
> >  int kvmppc_xive_debug_show_queues(struct seq_file *m, struct kvm_vcpu 
> > *vcpu)
> > diff --git a/arch/powerpc/kvm/book3s_xive_native.c 
> > b/arch/powerpc/kvm/book3s_xive_native.c
> > index a998823f68a3..7a50772f26fe 100644
> > --- a/arch/powerpc/kvm/book3s_xive_native.c
> > +++ b/arch/powerpc/kvm/book3s_xive_native.c
> > @@ -1089,10 +1089,7 @@ static int kvmppc_xive_native_create(struct 
> > kvm_device *dev, u32 type)
> > xive->single_escalation = xive_native_has_single_escalation();
> > xive->ops = _xive_native_ops;
> >  
> > -   if (ret)
> > -   return ret;
> > -
> > -   return 0;
> > +   return ret;
> >  }
> >  
> >  /*
> 



Re: [PATCH 2/2] MIPS: remove support for DMA_ATTR_WRITE_COMBINE

2019-09-18 Thread Maciej W. Rozycki
On Wed, 7 Aug 2019, Christoph Hellwig wrote:

> Mips uses the KSEG1 kernel memory segment to map dma coherent
> allocations for non-coherent devices as uncacheable, and does not have
> any kind of special support for DMA_ATTR_WRITE_COMBINE in the allocation
> path.  Thus supporting DMA_ATTR_WRITE_COMBINE in dma_mmap_attrs will
> lead to multiple mappings with different caching attributes.

 FYI, AFAIK _CACHE_UNCACHED_ACCELERATED (where supported) is effectively 
write-combine.  Though IIUC someone would have to wire it in first.

  Maciej


Re: [PATCH V2 2/2] mm/pgtable/debug: Add test validating architecture page table helpers

2019-09-18 Thread Gerald Schaefer
On Wed, 18 Sep 2019 18:26:03 +0200
Christophe Leroy  wrote:

[..] 
> My suggestion was not to completely drop the #ifdef but to do like you 
> did in pgd_clear_tests() for instance, ie to add the following test on 
> top of the function:
> 
>   if (mm_pud_folded(mm) || is_defined(__ARCH_HAS_5LEVEL_HACK))
>   return;
> 

Ah, very nice, this would also fix the remaining issues for s390. Since
we have dynamic page table folding, neither __PAGETABLE_PXX_FOLDED nor
__ARCH_HAS_XLEVEL_HACK is defined, but mm_pxx_folded() will work.

mm_alloc() returns with a 3-level page table by default on s390, so we
will run into issues in p4d_clear/populate_tests(), and also at the end
with p4d/pud_free() (double free).

So, adding the mm_pud_folded() check to p4d_clear/populate_tests(),
and also adding mm_p4d/pud_folded() checks at the end before calling
p4d/pud_free(), would make it all work on s390.

BTW, regarding p4d/pud_free(), I'm not sure if we should rather check
the folding inside our s390 functions, similar to how we do it for
p4d/pud_free_tlb(), instead of relying on not being called for folded
p4d/pud. So far, I see no problem with this behavior, all callers of
p4d/pud_free() should be fine because of our folding check within
p4d/pud_present/none(). But that doesn't mean that it is correct not
to check for the folding inside p4d/pud_free(). At least, with this
test module we do now have a caller of p4d/pud_free() on potentially
folded entries, so instead of adding pxx_folded() checks to this
test module, we could add them to our p4d/pud_free() functions.
Any thoughts on this?

Regards,
Gerald



Re: [PATCH 0/2] pseries/hotplug: Change the default behaviour of cede_offline

2019-09-18 Thread Naveen N. Rao

Michael Ellerman wrote:

"Naveen N. Rao"  writes:

Michael Ellerman wrote:

"Gautham R. Shenoy"  writes:

From: "Gautham R. Shenoy" 

Currently on Pseries Linux Guests, the offlined CPU can be put to one
of the following two states:
   - Long term processor cede (also called extended cede)
   - Returned to the Hypervisor via RTAS "stop-self" call.

This is controlled by the kernel boot parameter "cede_offline=on/off".

By default the offlined CPUs enter extended cede.


Since commit 3aa565f53c39 ("powerpc/pseries: Add hooks to put the CPU into an 
appropriate offline state") (Nov 2009)

Which you wrote :)

Why was that wrong?


The PHYP hypervisor considers CPUs in extended cede to be "active"
since the CPUs are still under the control fo the Linux Guests. Hence, when we 
change the
SMT modes by offlining the secondary CPUs, the PURR and the RWMR SPRs
will continue to count the values for offlined CPUs in extended cede
as if they are online.

One of the expectations with PURR is that the for an interval of time,
the sum of the PURR increments across the online CPUs of a core should
equal the number of timebase ticks for that interval.

This is currently not the case.


But why does that matter? It's just some accounting stuff, does it
actually break something meaningful?


Yes, this broke lparstat at the very least (though its quite unfortunate 
we took so long to notice).


By "so long" you mean 10 years?

Also I've never heard of lparstat, but I assume it's one of these tools
that's meant to behave like the AIX equivalent?


Yes, and yes. lparstat is part of powerpc-utils.



If it's been "broken" for 10 years and no one noticed, I'd argue the
current behaviour is now "correct" and fixing it would actually be a
breakage :)


:)
More on this below...




With SMT disabled, and under load:
  $ sudo lparstat 1 10

  System Configuration
  type=Shared mode=Uncapped smt=Off lcpu=2 mem=7759616 kB cpus=6 ent=1.00 


  %user  %sys %wait%idlephysc %entc lbusy  vcsw phint
  - - --- - - - -
  100.00  0.00  0.00 0.00 1.10 110.00 100.00 128784460 0
  100.00  0.00  0.00 0.00 1.07 107.00 100.00 128784860 0
  100.00  0.00  0.00 0.00 1.07 107.00 100.00 128785260 0
  100.00  0.00  0.00 0.00 1.07 107.00 100.00 128785662 0
  100.00  0.00  0.00 0.00 1.07 107.00 100.00 128786062 0
  100.00  0.00  0.00 0.00 1.07 107.00 100.00 128786462 0
  100.00  0.00  0.00 0.00 1.07 107.00 100.00 128786862 0
  100.00  0.00  0.00 0.00 1.07 107.00 100.00 128787262 0
  100.00  0.00  0.00 0.00 1.07 107.00 100.00 128787664 0
  100.00  0.00  0.00 0.00 1.07 107.00 100.00 128788064 0


What about that is wrong?


The 'physc' column represents cpu usage in units of physical cores.  
With 2 virtual cores ('lcpu=2') in uncapped, shared processor mode, we 
expect this to be closer to 2 when fully loaded (and spare capacity 
elsewhere in the system).





With cede_offline=off:
  $ sudo lparstat 1 10

  System Configuration
  type=Shared mode=Uncapped smt=Off lcpu=2 mem=7759616 kB cpus=6 ent=1.00 


  %user  %sys %wait%idlephysc %entc lbusy  vcsw phint
  - - --- - - - -
  100.00  0.00  0.00 0.00 1.94 194.00 100.00 128961588 0
  100.00  0.00  0.00 0.00 1.91 191.00 100.00 128961988 0
  100.00  0.00  0.00 0.00  inf   inf 100.00 128962392 0
  100.00  0.00  0.00 0.00 1.91 191.00 100.00 128962792 0
  100.00  0.00  0.00 0.00 1.91 191.00 100.00 128963192 0
  100.00  0.00  0.00 0.00 1.91 191.00 100.00 128963592 0
  100.00  0.00  0.00 0.00 1.91 191.00 100.00 128963992 0
  100.00  0.00  0.00 0.00 1.91 191.00 100.00 128964392 0
  100.00  0.00  0.00 0.00 1.91 191.00 100.00 128964792 0
  100.00  0.00  0.00 0.00 1.91 191.00 100.00 128965194 0

[The 'inf' values there show a different bug]

Also, since we expose [S]PURR through sysfs, any tools that make use of 
that directly are also affected due to this.


But again if we've had the current behaviour for 10 years then arguably
that's now the correct behaviour.


That's a fair point, and probably again points to this area getting less 
tested. One of the main reasons for this being caught now though, is 
that there are workloads being tested under lower SMT levels now. So, I 
suspect no one has been relying on this behavior and we can consider 
this to be a bug.



Thanks,
Naveen



Re: [PATCH 0/2] pseries/hotplug: Change the default behaviour of cede_offline

2019-09-18 Thread Nathan Lynch
Gautham R Shenoy  writes:
> The accounting problem in tools such as lparstat with
> "cede_offline=on" is affecting customers who are using these tools for
> capacity-planning. That problem needs a fix in the short-term, for
> which Patch 1 changes the default behaviour of cede_offline from "on"
> to "off". Since this patch would break the existing userspace tools
> that use the CPU-Offline infrastructure to fold CPUs for saving power,
> the sysfs interface allowing a runtime change of cede_offline_enabled
> was provided to enable these userspace tools to cope with minimal
> change.

So we would be putting users in the position of choosing between folding
and correct accounting. :-(

Actually how does changing the offline mechanism to stop-self break the
folding utility?


Re: [PATCH v3 2/2] powerpc/irq: inline call_do_irq() and call_do_softirq()

2019-09-18 Thread Segher Boessenkool
Hi Christophe,

On Wed, Sep 18, 2019 at 03:48:20PM +, Christophe Leroy wrote:
> call_do_irq() and call_do_softirq() are quite similar on PPC32 and
> PPC64 and are simple enough to be worth inlining.
> 
> Inlining them avoids an mflr/mtlr pair plus a save/reload on stack.

But you hardcode the calling sequence in inline asm, which for various
reasons is not a great idea.

> +static inline void call_do_irq(struct pt_regs *regs, void *sp)
> +{
> + register unsigned long r3 asm("r3") = (unsigned long)regs;
> +
> + asm volatile(
> + "   "PPC_STLU"  1, %2(%1);\n"
> + "   mr  1, %1;\n"
> + "   bl  %3;\n"
> + "   "PPC_LL"1, 0(1);\n" : "+r"(r3) :
> + "b"(sp), "i"(THREAD_SIZE - STACK_FRAME_OVERHEAD), "i"(__do_irq) 
> :
> + "lr", "xer", "ctr", "memory", "cr0", "cr1", "cr5", "cr6", "cr7",
> + "r0", "r4", "r5", "r6", "r7", "r8", "r9", "r10", "r11", "r12");
> +}

I realise the original code had this...  Loading the old stack pointer
value back from the stack creates a bottleneck (via the store->load
forwarding it requires).  It could just use
  addi 1,1,-(%2)
here, which can also be written as
  addi 1,1,%n2
(that is portable to all architectures btw).


Please write the  "+r"(r3)  on the next line?  Not on the same line as
the multi-line template.  This make things more readable.


I don't know if using functions as an "i" works properly...  It probably
does, it's just not something that you see often :-)


What about r2?  Various ABIs handle that differently.  This might make
it impossible to share implementation between 32-bit and 64-bit for this.
But we could add it to the clobber list worst case, that will always work.


So anyway, it looks to me like it will work.  Nice cleanup.  Would be
better if you could do the call to __do_irq from C code, but maybe we
cannot have everything ;-)

Reviewed-by: Segher Boessenkool 


Segher


Re: [PATCH V2 2/2] mm/pgtable/debug: Add test validating architecture page table helpers

2019-09-18 Thread Christophe Leroy




Le 18/09/2019 à 07:04, Anshuman Khandual a écrit :



On 09/13/2019 03:31 PM, Christophe Leroy wrote:



Le 13/09/2019 à 11:02, Anshuman Khandual a écrit :



+#if !defined(__PAGETABLE_PMD_FOLDED) && !defined(__ARCH_HAS_4LEVEL_HACK)


#ifdefs have to be avoided as much as possible, see below


Yeah but it has been bit difficult to avoid all these $ifdef because of the
availability (or lack of it) for all these pgtable helpers in various config
combinations on all platforms.


As far as I can see these pgtable helpers should exist everywhere at least via 
asm-generic/ files.


But they might not actually do the right thing.



Can you spot a particular config which fails ?


Lets consider the following example (after removing the $ifdefs around it)
which though builds successfully but fails to pass the intended test. This
is with arm64 config 4K pages sizes with 39 bits VA space which ends up
with a 3 level page table arrangement.

static void __init p4d_clear_tests(p4d_t *p4dp)
{
 p4d_t p4d = READ_ONCE(*p4dp);


My suggestion was not to completely drop the #ifdef but to do like you 
did in pgd_clear_tests() for instance, ie to add the following test on 
top of the function:


if (mm_pud_folded(mm) || is_defined(__ARCH_HAS_5LEVEL_HACK))
return;



 p4d = __p4d(p4d_val(p4d) | RANDOM_ORVALUE);
 WRITE_ONCE(*p4dp, p4d);
 p4d_clear(p4dp);
 p4d = READ_ONCE(*p4dp);
 WARN_ON(!p4d_none(p4d));
}

The following test hits an error at WARN_ON(!p4d_none(p4d))

[   16.757333] [ cut here ]
[   16.758019] WARNING: CPU: 11 PID: 1 at mm/arch_pgtable_test.c:187 
arch_pgtable_tests_init+0x24c/0x474
[   16.759455] Modules linked in:
[   16.759952] CPU: 11 PID: 1 Comm: swapper/0 Not tainted 
5.3.0-next-20190916-5-g61c218153bb8-dirty #222
[   16.761449] Hardware name: linux,dummy-virt (DT)
[   16.762185] pstate: 0045 (nzcv daif +PAN -UAO)
[   16.762964] pc : arch_pgtable_tests_init+0x24c/0x474
[   16.763750] lr : arch_pgtable_tests_init+0x174/0x474
[   16.764534] sp : ffc011d7bd50
[   16.765065] x29: ffc011d7bd50 x28: 1756bac0
[   16.765908] x27: ff85ddaf3000 x26: 02e8
[   16.766767] x25: ffc0111ce000 x24: ff85ddaf32e8
[   16.767606] x23: ff85ddaef278 x22: 0045cc844000
[   16.768445] x21: 00065daef003 x20: 1754
[   16.769283] x19: ff85ddb6 x18: 0014
[   16.770122] x17: 980426bb x16: 698594c6
[   16.770976] x15: 66e25a88 x14: 
[   16.771813] x13: 1754 x12: 000a
[   16.772651] x11: ff85fcfd0a40 x10: 0001
[   16.773488] x9 : 0008 x8 : ffc01143ab26
[   16.774336] x7 :  x6 : 
[   16.775180] x5 :  x4 : 
[   16.776018] x3 : 1756bbe8 x2 : 00065daeb003
[   16.776856] x1 : 0065daeb x0 : f000
[   16.777693] Call trace:
[   16.778092]  arch_pgtable_tests_init+0x24c/0x474
[   16.778843]  do_one_initcall+0x74/0x1b0
[   16.779458]  kernel_init_freeable+0x1cc/0x290
[   16.780151]  kernel_init+0x10/0x100
[   16.780710]  ret_from_fork+0x10/0x18
[   16.781282] ---[ end trace 042e6c40c0a3b038 ]---

On arm64 (4K page size|39 bits VA|3 level page table)

#elif CONFIG_PGTABLE_LEVELS == 3/* Applicable here */
#define __ARCH_USE_5LEVEL_HACK
#include 

Which pulls in

#include 

which pulls in

#include 

which defines

static inline int p4d_none(p4d_t p4d)
{
 return 0;
}

which will invariably trigger WARN_ON(!p4d_none(p4d)).

Similarly for next test p4d_populate_tests() which will always be
successful because p4d_bad() invariably returns negative.

static inline int p4d_bad(p4d_t p4d)
{
 return 0;
}

static void __init p4d_populate_tests(struct mm_struct *mm, p4d_t *p4dp,
   pud_t *pudp)
{
 p4d_t p4d;

 /*
  * This entry points to next level page table page.
  * Hence this must not qualify as p4d_bad().
  */
 pud_clear(pudp);
 p4d_clear(p4dp);
 p4d_populate(mm, p4dp, pudp);
 p4d = READ_ONCE(*p4dp);
 WARN_ON(p4d_bad(p4d));
}

We should not run these tests for the above config because they are
not applicable and will invariably produce same result.









[...]


+#if !defined(__PAGETABLE_PUD_FOLDED) && !defined(__ARCH_HAS_5LEVEL_HACK)


The same can be done here.


IIRC not only the page table helpers but there are data types (pxx_t) which
were not present on various configs and these wrappers help prevent build
failures. Any ways will try and see if this can be improved further. But
meanwhile if you have some suggestions, please do let me know.


pgt_t and pmd_t are everywhere I guess.
then pud_t and p4d_t have fallbacks in asm-generic files.


Lets take another example where it fails to compile. On arm64 with 16K
page size, 48 bits VA, 

Re: [PATCH V2 3/4] ASoC: pcm_dmaengine: Extract snd_dmaengine_pcm_set_runtime_hwparams

2019-09-18 Thread kbuild test robot
Hi Shengjiu,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on linus/master]
[cannot apply to v5.3 next-20190917]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Shengjiu-Wang/ASoC-fsl_asrc-Use-in-out-put_format-instead-of-in-out-put_word_width/20190918-145058
config: arm-viper_defconfig (attached as .config)
compiler: arm-linux-gnueabi-gcc (GCC) 7.4.0
reproduce:
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
GCC_VERSION=7.4.0 make.cross ARCH=arm 

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot 

All errors (new ones prefixed by >>):

>> ERROR: "snd_soc_set_runtime_hwparams" [sound/core/snd-pcm-dmaengine.ko] 
>> undefined!

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


[PATCH v3 2/2] powerpc/irq: inline call_do_irq() and call_do_softirq()

2019-09-18 Thread Christophe Leroy
call_do_irq() and call_do_softirq() are quite similar on PPC32 and
PPC64 and are simple enough to be worth inlining.

Inlining them avoids an mflr/mtlr pair plus a save/reload on stack.

This is inspired from S390 arch. Several other arches do more or
less the same. The way sparc arch does seems odd thought.

Signed-off-by: Christophe Leroy 

---
v2: no change.
v3: no change.
---
 arch/powerpc/include/asm/irq.h |  2 --
 arch/powerpc/kernel/irq.c  | 26 ++
 arch/powerpc/kernel/misc_32.S  | 25 -
 arch/powerpc/kernel/misc_64.S  | 22 --
 4 files changed, 26 insertions(+), 49 deletions(-)

diff --git a/arch/powerpc/include/asm/irq.h b/arch/powerpc/include/asm/irq.h
index 0c6469983c66..10476d5283dc 100644
--- a/arch/powerpc/include/asm/irq.h
+++ b/arch/powerpc/include/asm/irq.h
@@ -57,8 +57,6 @@ extern void *mcheckirq_ctx[NR_CPUS];
 extern void *hardirq_ctx[NR_CPUS];
 extern void *softirq_ctx[NR_CPUS];
 
-void call_do_softirq(void *sp);
-void call_do_irq(struct pt_regs *regs, void *sp);
 extern void do_IRQ(struct pt_regs *regs);
 extern void __init init_IRQ(void);
 extern void __do_irq(struct pt_regs *regs);
diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c
index 04204be49577..b028c49f9635 100644
--- a/arch/powerpc/kernel/irq.c
+++ b/arch/powerpc/kernel/irq.c
@@ -642,6 +642,20 @@ void __do_irq(struct pt_regs *regs)
irq_exit();
 }
 
+static inline void call_do_irq(struct pt_regs *regs, void *sp)
+{
+   register unsigned long r3 asm("r3") = (unsigned long)regs;
+
+   asm volatile(
+   "   "PPC_STLU"  1, %2(%1);\n"
+   "   mr  1, %1;\n"
+   "   bl  %3;\n"
+   "   "PPC_LL"1, 0(1);\n" : "+r"(r3) :
+   "b"(sp), "i"(THREAD_SIZE - STACK_FRAME_OVERHEAD), "i"(__do_irq) 
:
+   "lr", "xer", "ctr", "memory", "cr0", "cr1", "cr5", "cr6", "cr7",
+   "r0", "r4", "r5", "r6", "r7", "r8", "r9", "r10", "r11", "r12");
+}
+
 void do_IRQ(struct pt_regs *regs)
 {
struct pt_regs *old_regs = set_irq_regs(regs);
@@ -686,6 +700,18 @@ void *mcheckirq_ctx[NR_CPUS] __read_mostly;
 void *softirq_ctx[NR_CPUS] __read_mostly;
 void *hardirq_ctx[NR_CPUS] __read_mostly;
 
+static inline void call_do_softirq(const void *sp)
+{
+   asm volatile(
+   "   "PPC_STLU"  1, %1(%0);\n"
+   "   mr  1, %0;\n"
+   "   bl  %2;\n"
+   "   "PPC_LL"1, 0(1);\n" : :
+   "b"(sp), "i"(THREAD_SIZE - STACK_FRAME_OVERHEAD), 
"i"(__do_softirq) :
+   "lr", "xer", "ctr", "memory", "cr0", "cr1", "cr5", "cr6", "cr7",
+   "r0", "r4", "r5", "r6", "r7", "r8", "r9", "r10", "r11", "r12");
+}
+
 void do_softirq_own_stack(void)
 {
void *irqsp = softirq_ctx[smp_processor_id()];
diff --git a/arch/powerpc/kernel/misc_32.S b/arch/powerpc/kernel/misc_32.S
index a5422f7782b3..307307b57743 100644
--- a/arch/powerpc/kernel/misc_32.S
+++ b/arch/powerpc/kernel/misc_32.S
@@ -33,31 +33,6 @@
 
.text
 
-_GLOBAL(call_do_softirq)
-   mflrr0
-   stw r0,4(r1)
-   stwur1,THREAD_SIZE-STACK_FRAME_OVERHEAD(r3)
-   mr  r1,r3
-   bl  __do_softirq
-   lwz r1,0(r1)
-   lwz r0,4(r1)
-   mtlrr0
-   blr
-
-/*
- * void call_do_irq(struct pt_regs *regs, void *sp);
- */
-_GLOBAL(call_do_irq)
-   mflrr0
-   stw r0,4(r1)
-   stwur1,THREAD_SIZE-STACK_FRAME_OVERHEAD(r4)
-   mr  r1,r4
-   bl  __do_irq
-   lwz r1,0(r1)
-   lwz r0,4(r1)
-   mtlrr0
-   blr
-
 /*
  * This returns the high 64 bits of the product of two 64-bit numbers.
  */
diff --git a/arch/powerpc/kernel/misc_64.S b/arch/powerpc/kernel/misc_64.S
index b55a7b4cb543..69fd714a5236 100644
--- a/arch/powerpc/kernel/misc_64.S
+++ b/arch/powerpc/kernel/misc_64.S
@@ -27,28 +27,6 @@
 
.text
 
-_GLOBAL(call_do_softirq)
-   mflrr0
-   std r0,16(r1)
-   stdur1,THREAD_SIZE-STACK_FRAME_OVERHEAD(r3)
-   mr  r1,r3
-   bl  __do_softirq
-   ld  r1,0(r1)
-   ld  r0,16(r1)
-   mtlrr0
-   blr
-
-_GLOBAL(call_do_irq)
-   mflrr0
-   std r0,16(r1)
-   stdur1,THREAD_SIZE-STACK_FRAME_OVERHEAD(r4)
-   mr  r1,r4
-   bl  __do_irq
-   ld  r1,0(r1)
-   ld  r0,16(r1)
-   mtlrr0
-   blr
-
.section".toc","aw"
 PPC64_CACHES:
.tc ppc64_caches[TC],ppc64_caches
-- 
2.13.3



[PATCH v3 1/2] powerpc/irq: bring back ksp_limit management in C functions.

2019-09-18 Thread Christophe Leroy
Commit cbc9565ee826 ("powerpc: Remove ksp_limit on ppc64") moved
PPC32 ksp_limit handling in assembly functions call_do_softirq()
and call_do_irq() as they are different for PPC32 and PPC64.

In preparation of replacing these functions by inline assembly,
partialy revert that commit to bring back ksp_limit assignment
in the callers.

To get and set ksp_limit without a forest of #ifdefs CONFIG_PPC32,
use helpers that will void on PPC64.

Signed-off-by: Christophe Leroy 

---
v2: added forward declaration of struct task_struct to avoid build failure.
v3: included linux/sched.h, forward declaration is not enough.
---
 arch/powerpc/include/asm/irq.h | 22 ++
 arch/powerpc/kernel/irq.c  | 14 +-
 arch/powerpc/kernel/misc_32.S  | 14 --
 3 files changed, 35 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/include/asm/irq.h b/arch/powerpc/include/asm/irq.h
index 814dfab7e392..0c6469983c66 100644
--- a/arch/powerpc/include/asm/irq.h
+++ b/arch/powerpc/include/asm/irq.h
@@ -10,6 +10,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -64,5 +65,26 @@ extern void __do_irq(struct pt_regs *regs);
 
 int irq_choose_cpu(const struct cpumask *mask);
 
+#ifdef CONFIG_PPC32
+static inline unsigned long get_ksp_limit(struct task_struct *tsk)
+{
+   return tsk->thread.ksp_limit;
+}
+
+static inline void set_ksp_limit(struct task_struct *tsk, unsigned long limit)
+{
+   tsk->thread.ksp_limit = limit;
+}
+#else
+static inline unsigned long get_ksp_limit(struct task_struct *tsk)
+{
+   return 0;
+}
+
+static inline void set_ksp_limit(struct task_struct *tsk, unsigned long limit)
+{
+}
+#endif
+
 #endif /* _ASM_IRQ_H */
 #endif /* __KERNEL__ */
diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c
index 5645bc9cbc09..04204be49577 100644
--- a/arch/powerpc/kernel/irq.c
+++ b/arch/powerpc/kernel/irq.c
@@ -646,6 +646,7 @@ void do_IRQ(struct pt_regs *regs)
 {
struct pt_regs *old_regs = set_irq_regs(regs);
void *cursp, *irqsp, *sirqsp;
+   unsigned long saved_ksp_limit = get_ksp_limit(current);
 
/* Switch to the irq stack to handle this */
cursp = (void *)(current_stack_pointer() & ~(THREAD_SIZE - 1));
@@ -658,9 +659,15 @@ void do_IRQ(struct pt_regs *regs)
set_irq_regs(old_regs);
return;
}
+   /* Adjust the stack limit */
+   set_ksp_limit(current, (unsigned long)irqsp);
+
/* Switch stack and call */
call_do_irq(regs, irqsp);
 
+   /* Restore stack limit */
+   set_ksp_limit(current, saved_ksp_limit);
+
set_irq_regs(old_regs);
 }
 
@@ -681,7 +688,12 @@ void *hardirq_ctx[NR_CPUS] __read_mostly;
 
 void do_softirq_own_stack(void)
 {
-   call_do_softirq(softirq_ctx[smp_processor_id()]);
+   void *irqsp = softirq_ctx[smp_processor_id()];
+   unsigned long saved_ksp_limit = get_ksp_limit(current);
+
+   set_ksp_limit(current, (unsigned long)irqsp);
+   call_do_softirq(irqsp);
+   set_ksp_limit(current, saved_ksp_limit);
 }
 
 irq_hw_number_t virq_to_hw(unsigned int virq)
diff --git a/arch/powerpc/kernel/misc_32.S b/arch/powerpc/kernel/misc_32.S
index 82df4b09e79f..a5422f7782b3 100644
--- a/arch/powerpc/kernel/misc_32.S
+++ b/arch/powerpc/kernel/misc_32.S
@@ -33,23 +33,14 @@
 
.text
 
-/*
- * We store the saved ksp_limit in the unused part
- * of the STACK_FRAME_OVERHEAD
- */
 _GLOBAL(call_do_softirq)
mflrr0
stw r0,4(r1)
-   lwz r10,THREAD+KSP_LIMIT(r2)
-   stw r3, THREAD+KSP_LIMIT(r2)
stwur1,THREAD_SIZE-STACK_FRAME_OVERHEAD(r3)
mr  r1,r3
-   stw r10,8(r1)
bl  __do_softirq
-   lwz r10,8(r1)
lwz r1,0(r1)
lwz r0,4(r1)
-   stw r10,THREAD+KSP_LIMIT(r2)
mtlrr0
blr
 
@@ -59,16 +50,11 @@ _GLOBAL(call_do_softirq)
 _GLOBAL(call_do_irq)
mflrr0
stw r0,4(r1)
-   lwz r10,THREAD+KSP_LIMIT(r2)
-   stw r4, THREAD+KSP_LIMIT(r2)
stwur1,THREAD_SIZE-STACK_FRAME_OVERHEAD(r4)
mr  r1,r4
-   stw r10,8(r1)
bl  __do_irq
-   lwz r10,8(r1)
lwz r1,0(r1)
lwz r0,4(r1)
-   stw r10,THREAD+KSP_LIMIT(r2)
mtlrr0
blr
 
-- 
2.13.3



Re: [PATCH v12 05/12] namei: obey trailing magic-link DAC permissions

2019-09-18 Thread Aleksa Sarai
On 2019-09-18, Aleksa Sarai  wrote:
> On 2019-09-17, Jann Horn  wrote:
> > On Wed, Sep 4, 2019 at 10:21 PM Aleksa Sarai  wrote:
> > > The ability for userspace to "re-open" file descriptors through
> > > /proc/self/fd has been a very useful tool for all sorts of usecases
> > > (container runtimes are one common example). However, the current
> > > interface for doing this has resulted in some pretty subtle security
> > > holes. Userspace can re-open a file descriptor with more permissions
> > > than the original, which can result in cases such as /proc/$pid/exe
> > > being re-opened O_RDWR at a later date even though (by definition)
> > > /proc/$pid/exe cannot be opened for writing. When combined with O_PATH
> > > the results can get even more confusing.
> > [...]
> > > Instead we have to restrict it in such a way that it doesn't break
> > > (good) users but does block potential attackers. The solution applied in
> > > this patch is to restrict *re-opening* (not resolution through)
> > > magic-links by requiring that mode of the link be obeyed. Normal
> > > symlinks have modes of a+rwx but magic-links have other modes. These
> > > magic-link modes were historically ignored during path resolution, but
> > > they've now been re-purposed for more useful ends.
> > 
> > Thanks for dealing with this issue!
> > 
> > [...]
> > > diff --git a/fs/namei.c b/fs/namei.c
> > > index 209c51a5226c..54d57dad0f91 100644
> > > --- a/fs/namei.c
> > > +++ b/fs/namei.c
> > > @@ -872,7 +872,7 @@ void nd_jump_link(struct path *path)
> > >
> > > nd->path = *path;
> > > nd->inode = nd->path.dentry->d_inode;
> > > -   nd->flags |= LOOKUP_JUMPED;
> > > +   nd->flags |= LOOKUP_JUMPED | LOOKUP_MAGICLINK_JUMPED;
> > >  }
> > [...]
> > > +static int trailing_magiclink(struct nameidata *nd, int acc_mode,
> > > + fmode_t *opath_mask)
> > > +{
> > > +   struct inode *inode = nd->link_inode;
> > > +   fmode_t upgrade_mask = 0;
> > > +
> > > +   /* Was the trailing_symlink() a magic-link? */
> > > +   if (!(nd->flags & LOOKUP_MAGICLINK_JUMPED))
> > > +   return 0;
> > > +
> > > +   /*
> > > +* Figure out the upgrade-mask of the link_inode. Since these 
> > > aren't
> > > +* strictly POSIX semantics we don't do an acl_permission_check() 
> > > here,
> > > +* so we only care that at least one bit is set for each 
> > > upgrade-mode.
> > > +*/
> > > +   if (inode->i_mode & S_IRUGO)
> > > +   upgrade_mask |= FMODE_PATH_READ;
> > > +   if (inode->i_mode & S_IWUGO)
> > > +   upgrade_mask |= FMODE_PATH_WRITE;
> > > +   /* Restrict the O_PATH upgrade-mask of the caller. */
> > > +   if (opath_mask)
> > > +   *opath_mask &= upgrade_mask;
> > > +   return may_open_magiclink(upgrade_mask, acc_mode);
> > >  }
> > 
> > This looks racy because entries in the file descriptor table can be
> > switched out as long as task->files->file_lock isn't held. Unless I'm
> > missing something, something like the following (untested) would
> > bypass this restriction:
> 
> You're absolutely right -- good catch!
> 
> > Perhaps you could change nd_jump_link() to "void nd_jump_link(struct
> > path *path, umode_t link_mode)", and let proc_pid_get_link() pass the
> > link_mode through from an out-argument of .proc_get_link()? Then
> > proc_fd_link() could grab the proper mode in a race-free manner. And
> > nd_jump_link() could stash the mode in the nameidata.
> 
> This indeed does appear to be the simplest solution -- I'm currently
> testing a variation of the patch you proposed (with a few extra bits to
> deal with nd_jump_link and proc_get_link being used elsewhere).
> 
> I'll include this change (assuming it fixes the flaw you found) in the
> v13 series I'll send around next week. Thanks, Jann!

In case you're interested -- I've also included a selftest based on this
attack in my series (though it uses CLONE_FILES so that we could also
test O_EMPTYPATH, which wasn't affected because it didn't go through
procfs and thus couldn't hit the "outdated inode->i_mode" problem).

The attack script succeeds around 20% of the time on the original
patchset, and with the updated patchset it doesn't succeed in several
hundred thousand attempts (which I've repeated a few times).

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH



signature.asc
Description: PGP signature


Re: [PATCH v4] powerpc/setup_64: fix -Wempty-body warnings

2019-09-18 Thread Qian Cai
Michael, ping in case that you might forget this one forever as well.

On Mon, 2019-07-15 at 14:32 -0400, Qian Cai wrote:
> At the beginning of setup_64.c, it has,
> 
>   #ifdef DEBUG
>   #define DBG(fmt...) udbg_printf(fmt)
>   #else
>   #define DBG(fmt...)
>   #endif
> 
> where DBG() could be compiled away, and generate warnings,
> 
> arch/powerpc/kernel/setup_64.c: In function 'initialize_cache_info':
> arch/powerpc/kernel/setup_64.c:579:49: warning: suggest braces around
> empty body in an 'if' statement [-Wempty-body]
> DBG("Argh, can't find dcache properties !\n");
>  ^
> arch/powerpc/kernel/setup_64.c:582:49: warning: suggest braces around
> empty body in an 'if' statement [-Wempty-body]
> DBG("Argh, can't find icache properties !\n");
> 
> Fix it by using the suggestions from Michael:
> 
> "Neither of those sites should use DBG(), that's not really early boot
> code, they should just use pr_warn().
> 
> And the other uses of DBG() in initialize_cache_info() should just be
> removed.
> 
> In smp_release_cpus() the entry/exit DBG's should just be removed, and
> the spinning_secondaries line should just be pr_debug().
> 
> That would just leave the two calls in early_setup(). If we taught
> udbg_printf() to return early when udbg_putc is NULL, then we could just
> call udbg_printf() unconditionally and get rid of the DBG macro
> entirely."
> 
> Suggested-by: Michael Ellerman 
> Signed-off-by: Qian Cai 
> ---
> 
> v4: Use the suggestions from Michael and __func__ per checkpatch.
> v3: Use no_printk() macro, and make sure that format and argument are always
> verified by the compiler using a more generic form ##__VA_ARGS__ per Joe.
> v2: Fix it by using a NOP while loop per Tyrel.
> 
>  arch/powerpc/kernel/setup_64.c | 26 ++
>  arch/powerpc/kernel/udbg.c | 14 --
>  2 files changed, 14 insertions(+), 26 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c
> index 44b4c432a273..d2af4c228970 100644
> --- a/arch/powerpc/kernel/setup_64.c
> +++ b/arch/powerpc/kernel/setup_64.c
> @@ -68,12 +68,6 @@
>  
>  #include "setup.h"
>  
> -#ifdef DEBUG
> -#define DBG(fmt...) udbg_printf(fmt)
> -#else
> -#define DBG(fmt...)
> -#endif
> -
>  int spinning_secondaries;
>  u64 ppc64_pft_size;
>  
> @@ -305,7 +299,7 @@ void __init early_setup(unsigned long dt_ptr)
>   /* Enable early debugging if any specified (see udbg.h) */
>   udbg_early_init();
>  
> - DBG(" -> early_setup(), dt_ptr: 0x%lx\n", dt_ptr);
> + udbg_printf(" -> %s(), dt_ptr: 0x%lx\n", __func__, dt_ptr);
>  
>   /*
>* Do early initialization using the flattened device
> @@ -362,11 +356,11 @@ void __init early_setup(unsigned long dt_ptr)
>*/
>   this_cpu_enable_ftrace();
>  
> - DBG(" <- early_setup()\n");
> + udbg_printf(" <- %s()\n", __func__);
>  
>  #ifdef CONFIG_PPC_EARLY_DEBUG_BOOTX
>   /*
> -  * This needs to be done *last* (after the above DBG() even)
> +  * This needs to be done *last* (after the above udbg_printf() even)
>*
>* Right after we return from this function, we turn on the MMU
>* which means the real-mode access trick that btext does will
> @@ -436,8 +430,6 @@ void smp_release_cpus(void)
>   if (!use_spinloop())
>   return;
>  
> - DBG(" -> smp_release_cpus()\n");
> -
>   /* All secondary cpus are spinning on a common spinloop, release them
>* all now so they can start to spin on their individual paca
>* spinloops. For non SMP kernels, the secondary cpus never get out
> @@ -456,9 +448,7 @@ void smp_release_cpus(void)
>   break;
>   udelay(1);
>   }
> - DBG("spinning_secondaries = %d\n", spinning_secondaries);
> -
> - DBG(" <- smp_release_cpus()\n");
> + pr_debug("spinning_secondaries = %d\n", spinning_secondaries);
>  }
>  #endif /* CONFIG_SMP || CONFIG_KEXEC_CORE */
>  
> @@ -551,8 +541,6 @@ void __init initialize_cache_info(void)
>   struct device_node *cpu = NULL, *l2, *l3 = NULL;
>   u32 pvr;
>  
> - DBG(" -> initialize_cache_info()\n");
> -
>   /*
>* All shipping POWER8 machines have a firmware bug that
>* puts incorrect information in the device-tree. This will
> @@ -576,10 +564,10 @@ void __init initialize_cache_info(void)
>*/
>   if (cpu) {
>   if (!parse_cache_info(cpu, false, _caches.l1d))
> - DBG("Argh, can't find dcache properties !\n");
> + pr_warn("Argh, can't find dcache properties !\n");
>  
>   if (!parse_cache_info(cpu, true, _caches.l1i))
> - DBG("Argh, can't find icache properties !\n");
> + pr_warn("Argh, can't find icache properties !\n");
>  
>   /*
>* Try to find the L2 and L3 if any. Assume they are
> @@ -604,8 +592,6 @@ void __init 

Re: [Bug 204789] New: Boot failure with more than 256G of memory

2019-09-18 Thread Cameron Berkenpas

Hello,

Unfortunately, this patch set has made things quite a bit worse for me. 
Appending mem=256G doesn't fix it either. in all cases, the system at 
least gets past early boot and then I will probably get a panic and 
eventual reboot, or occasionally it just locks up entirely.


Here's my very first attempt at booting the kernel where I didn't even 
get a panic:

https://pastebin.com/a3TVZcVB

Here's another attempt where I get a panic:
https://pastebin.com/QsJjyC2v

Finally here's an attempt with mem=256G:
https://pastebin.com/swgLYie9

I don't know that these results are substantially different from each 
other, but perhaps there's something helpful.


Sometimes (but not in any of the above), the host gets to the point that 
systemd starts up, but ultimately it seems I got the same stacktrace.


At one point, I ended up with a CPU guarded out, but it was simple to 
recover.


-Cameron

On 9/17/19 8:15 PM, Aneesh Kumar K.V wrote:

On 9/13/19 10:58 PM, Cameron Berkenpas wrote:
Running against the kernel I built against 0034d395f89d and the 
problem is still there.


However, running against the kernel I built against the previous 
commit, a35a3c6f6065, and the system boots.


This being due to 0034d395f89d confirmed.



https://lore.kernel.org/linuxppc-dev/20190917145702.9214-1-aneesh.ku...@linux.ibm.com 



This series should help you.

-aneesh





[RFC PATCH] powerpc/mm/mce: Keep irq disabled during lockless page table walk

2019-09-18 Thread Aneesh Kumar K.V
__find_linux_mm_pte return a page table entry pointer walking the
page table without holding locks. To make it safe against a THP
split and collapse, we disable interrupts around the lockless
page table walk. We need to keep the interrupts disabled as long
as we use the page table entry pointer.

Cc: Balbir Singh 
Cc: Reza Arbab 
Cc: Santosh Sivaraj 
Fixes: ba41e1e1ccb9 ("powerpc/mce: Hookup derror (load/store) UE errors")
Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/kernel/mce_power.c | 18 +++---
 1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/kernel/mce_power.c b/arch/powerpc/kernel/mce_power.c
index 356e7b99f661..585c37dc1b18 100644
--- a/arch/powerpc/kernel/mce_power.c
+++ b/arch/powerpc/kernel/mce_power.c
@@ -28,6 +28,7 @@
 unsigned long addr_to_pfn(struct pt_regs *regs, unsigned long addr)
 {
pte_t *ptep;
+   unsigned long pfn;
unsigned int shift;
unsigned long flags;
struct mm_struct *mm;
@@ -39,18 +40,21 @@ unsigned long addr_to_pfn(struct pt_regs *regs, unsigned 
long addr)
 
local_irq_save(flags);
ptep = __find_linux_pte(mm->pgd, addr, NULL, );
-   local_irq_restore(flags);
 
-   if (!ptep || pte_special(*ptep))
-   return ULONG_MAX;
+   if (!ptep || pte_special(*ptep)) {
+   pfn = ULONG_MAX;
+   goto err_out;
+   }
 
if (shift > PAGE_SHIFT) {
unsigned long rpnmask = (1ul << shift) - PAGE_SIZE;
 
-   return pte_pfn(__pte(pte_val(*ptep) | (addr & rpnmask)));
-   }
-
-   return pte_pfn(*ptep);
+   pfn = pte_pfn(__pte(pte_val(*ptep) | (addr & rpnmask)));
+   } else
+   pfn = pte_pfn(*ptep);
+err_out:
+   local_irq_restore(flags);
+   return pfn;
 }
 
 /* flush SLBs and reload */
-- 
2.21.0



Re: [PATCH] powerpc: Add barrier_nospec to raw_copy_in_user()

2019-09-18 Thread Michael Ellerman
Michal Suchánek  writes:
> On Wed,  6 Mar 2019 12:10:38 +1100
> Suraj Jitindar Singh  wrote:
>
>> Commit ddf35cf3764b ("powerpc: Use barrier_nospec in copy_from_user()")
>> Added barrier_nospec before loading from user-controller pointers.
>> The intention was to order the load from the potentially user-controlled
>> pointer vs a previous branch based on an access_ok() check or similar.
>> 
>> In order to achieve the same result, add a barrier_nospec to the
>> raw_copy_in_user() function before loading from such a user-controlled
>> pointer.
>> 
>> Signed-off-by: Suraj Jitindar Singh 
>> ---
>>  arch/powerpc/include/asm/uaccess.h | 1 +
>>  1 file changed, 1 insertion(+)
>> 
>> diff --git a/arch/powerpc/include/asm/uaccess.h 
>> b/arch/powerpc/include/asm/uaccess.h
>> index e3a731793ea2..bb615592d5bb 100644
>> --- a/arch/powerpc/include/asm/uaccess.h
>> +++ b/arch/powerpc/include/asm/uaccess.h
>> @@ -306,6 +306,7 @@ extern unsigned long __copy_tofrom_user(void __user *to,
>>  static inline unsigned long
>>  raw_copy_in_user(void __user *to, const void __user *from, unsigned long n)
>>  {
>> +barrier_nospec();
>>  return __copy_tofrom_user(to, from, n);
>>  }
>>  #endif /* __powerpc64__ */
>
> Hello,
>
> AFAICT this is not needed. The data cannot be used in the kernel without
> subsequent copy_from_user which already has the nospec barrier.

Suraj did talk to me about this before sending the patch. My memory is
that he came up with a scenario where it was at least theoretically
useful, but he didn't mention that in the change log. He was working on
nested KVM at the time.

I've now forgotten, and he's moved on to other things so probably won't
remember either, if he's even checking his mail :/

cheers


Re: [PATCH 4/5] ocxl: Add functions to map/unmap LPC memory

2019-09-18 Thread Frederic Barrat




Le 17/09/2019 à 03:43, Alastair D'Silva a écrit :

From: Alastair D'Silva 

Add functions to map/unmap LPC memory

Signed-off-by: Alastair D'Silva 
---
  drivers/misc/ocxl/config.c|  4 +++
  drivers/misc/ocxl/core.c  | 50 +++
  drivers/misc/ocxl/link.c  |  4 +--
  drivers/misc/ocxl/ocxl_internal.h | 10 +--
  include/misc/ocxl.h   | 18 +++
  5 files changed, 82 insertions(+), 4 deletions(-)

diff --git a/drivers/misc/ocxl/config.c b/drivers/misc/ocxl/config.c
index c8e19bfb5ef9..fb0c3b6f8312 100644
--- a/drivers/misc/ocxl/config.c
+++ b/drivers/misc/ocxl/config.c
@@ -568,6 +568,10 @@ static int read_afu_lpc_memory_info(struct pci_dev *dev,
afu->special_purpose_mem_size =
total_mem_size - lpc_mem_size;
}
+
+   dev_info(>dev, "Probed LPC memory of %#llx bytes and special purpose 
memory of %#llx bytes\n",
+   afu->lpc_mem_size, afu->special_purpose_mem_size);
+
return 0;
  }
  
diff --git a/drivers/misc/ocxl/core.c b/drivers/misc/ocxl/core.c

index fdfe4e0a34e1..eb24bb9d655f 100644
--- a/drivers/misc/ocxl/core.c
+++ b/drivers/misc/ocxl/core.c
@@ -210,6 +210,55 @@ static void unmap_mmio_areas(struct ocxl_afu *afu)
release_fn_bar(afu->fn, afu->config.global_mmio_bar);
  }
  
+int ocxl_map_lpc_mem(struct ocxl_afu *afu)

+{
+   struct pci_dev *dev = to_pci_dev(afu->fn->dev.parent);
+
+   if ((afu->config.lpc_mem_size + afu->config.special_purpose_mem_size) 
== 0)
+   return 0;
+
+   afu->lpc_base_addr = ocxl_link_lpc_online(afu->fn->link, dev);
+   if (afu->lpc_base_addr == 0)
+   return -EINVAL;
+
+   if (afu->config.lpc_mem_size) {
+   afu->lpc_res.start = afu->lpc_base_addr + 
afu->config.lpc_mem_offset;
+   afu->lpc_res.end = afu->lpc_res.start + 
afu->config.lpc_mem_size - 1;
+   }
+
+   if (afu->config.special_purpose_mem_size) {
+   afu->special_purpose_res.start = afu->lpc_base_addr +
+
afu->config.special_purpose_mem_offset;
+   afu->special_purpose_res.end = afu->special_purpose_res.start +
+  
afu->config.special_purpose_mem_size - 1;
+   }
+
+   return 0;
+}
+EXPORT_SYMBOL(ocxl_map_lpc_mem);
+
+struct resource *ocxl_afu_lpc_mem(struct ocxl_afu *afu)
+{
+   return >lpc_res;
+}
+EXPORT_SYMBOL(ocxl_afu_lpc_mem);
+
+static void unmap_lpc_mem(struct ocxl_afu *afu)
+{
+   struct pci_dev *dev = to_pci_dev(afu->fn->dev.parent);
+
+   if (afu->lpc_res.start || afu->special_purpose_res.start) {
+   void *link = afu->fn->link;
+
+   ocxl_link_lpc_offline(link, dev);
+
+   afu->lpc_res.start = 0;
+   afu->lpc_res.end = 0;
+   afu->special_purpose_res.start = 0;
+   afu->special_purpose_res.end = 0;
+   }
+}
+
  static int configure_afu(struct ocxl_afu *afu, u8 afu_idx, struct pci_dev 
*dev)
  {
int rc;
@@ -250,6 +299,7 @@ static int configure_afu(struct ocxl_afu *afu, u8 afu_idx, 
struct pci_dev *dev)
  
  static void deconfigure_afu(struct ocxl_afu *afu)

  {
+   unmap_lpc_mem(afu);
unmap_mmio_areas(afu);
reclaim_afu_pasid(afu);
reclaim_afu_actag(afu);
diff --git a/drivers/misc/ocxl/link.c b/drivers/misc/ocxl/link.c
index 2874811a4398..9e303a5f4d85 100644
--- a/drivers/misc/ocxl/link.c
+++ b/drivers/misc/ocxl/link.c
@@ -738,7 +738,7 @@ int ocxl_link_add_lpc_mem(void *link_handle, u64 size)
  }
  EXPORT_SYMBOL_GPL(ocxl_link_add_lpc_mem);
  
-u64 ocxl_link_lpc_map(void *link_handle, struct pci_dev *pdev)

+u64 ocxl_link_lpc_online(void *link_handle, struct pci_dev *pdev)
  {
struct ocxl_link *link = (struct ocxl_link *) link_handle;
  
@@ -759,7 +759,7 @@ u64 ocxl_link_lpc_map(void *link_handle, struct pci_dev *pdev)

return link->lpc_mem;
  }
  
-void ocxl_link_lpc_release(void *link_handle, struct pci_dev *pdev)

+void ocxl_link_lpc_offline(void *link_handle, struct pci_dev *pdev)



Could we avoid the renaming by squashing it with the previous patch?



  {
struct ocxl_link *link = (struct ocxl_link *) link_handle;
  
diff --git a/drivers/misc/ocxl/ocxl_internal.h b/drivers/misc/ocxl/ocxl_internal.h

index db2647a90fc8..5656a4aab5b7 100644
--- a/drivers/misc/ocxl/ocxl_internal.h
+++ b/drivers/misc/ocxl/ocxl_internal.h
@@ -52,6 +52,12 @@ struct ocxl_afu {
void __iomem *global_mmio_ptr;
u64 pp_mmio_start;
void *private;
+   u64 lpc_base_addr; /* Covers both LPC & special purpose memory */
+   struct bin_attribute attr_global_mmio;
+   struct bin_attribute attr_lpc_mem;
+   struct resource lpc_res;
+   struct bin_attribute attr_special_purpose_mem;
+   struct resource special_purpose_res;
  };
  
  enum ocxl_context_status {

@@ -170,7 +176,7 @@ extern u64 

Re: [PATCH 2/5] powerpc: Map & release OpenCAPI LPC memory

2019-09-18 Thread Frederic Barrat




Le 17/09/2019 à 03:42, Alastair D'Silva a écrit :

From: Alastair D'Silva 

Map & release OpenCAPI LPC memory.

Signed-off-by: Alastair D'Silva 
---
  arch/powerpc/include/asm/pnv-ocxl.h   |  2 ++
  arch/powerpc/platforms/powernv/ocxl.c | 42 +++
  2 files changed, 44 insertions(+)

diff --git a/arch/powerpc/include/asm/pnv-ocxl.h 
b/arch/powerpc/include/asm/pnv-ocxl.h
index 7de82647e761..f8f8ffb48aa8 100644
--- a/arch/powerpc/include/asm/pnv-ocxl.h
+++ b/arch/powerpc/include/asm/pnv-ocxl.h
@@ -32,5 +32,7 @@ extern int pnv_ocxl_spa_remove_pe_from_cache(void 
*platform_data, int pe_handle)
  
  extern int pnv_ocxl_alloc_xive_irq(u32 *irq, u64 *trigger_addr);

  extern void pnv_ocxl_free_xive_irq(u32 irq);
+extern u64 pnv_ocxl_platform_lpc_setup(struct pci_dev *pdev, u64 size);
+extern void pnv_ocxl_platform_lpc_release(struct pci_dev *pdev);
  
  #endif /* _ASM_PNV_OCXL_H */

diff --git a/arch/powerpc/platforms/powernv/ocxl.c 
b/arch/powerpc/platforms/powernv/ocxl.c
index 8c65aacda9c8..81393728d6a3 100644
--- a/arch/powerpc/platforms/powernv/ocxl.c
+++ b/arch/powerpc/platforms/powernv/ocxl.c
@@ -475,6 +475,48 @@ void pnv_ocxl_spa_release(void *platform_data)
  }
  EXPORT_SYMBOL_GPL(pnv_ocxl_spa_release);
  
+u64 pnv_ocxl_platform_lpc_setup(struct pci_dev *pdev, u64 size)

+{
+   struct pci_controller *hose = pci_bus_to_host(pdev->bus);
+   struct pnv_phb *phb = hose->private_data;
+   struct pci_dn *pdn = pci_get_pdn(pdev);
+   u32 bdfn = (pdn->busno << 8) | pdn->devfn;



We can spare a call to pci_get_pdn() with
bdfn = (pdev->bus->number << 8) | pdev->devfn;



+   u64 base_addr = 0;
+
+   int rc = opal_npu_mem_alloc(phb->opal_id, bdfn, size, _addr);
+
+   WARN_ON(rc);


Instead of a WARN, we should catch the error and return a null address 
to the caller.



+
+   base_addr = be64_to_cpu(base_addr);
+
+   rc = check_hotplug_memory_addressable(base_addr, base_addr + size);



That code is missing?



+   if (rc) {
+   dev_warn(>dev,
+"LPC memory range 0x%llx-0x%llx is not fully 
addressable",
+base_addr, base_addr + size - 1);
+   return 0;
+   }
+
+
+   return base_addr;
+}
+EXPORT_SYMBOL_GPL(pnv_ocxl_platform_lpc_setup);
+
+void pnv_ocxl_platform_lpc_release(struct pci_dev *pdev)
+{
+   struct pci_controller *hose = pci_bus_to_host(pdev->bus);
+   struct pnv_phb *phb = hose->private_data;
+   struct pci_dn *pdn = pci_get_pdn(pdev);
+   u32 bdfn;
+   int rc;
+
+   bdfn = (pdn->busno << 8) | pdn->devfn;
+   rc = opal_npu_mem_release(phb->opal_id, bdfn);
+   WARN_ON(rc);



Same comments as above.

  Fred




+}
+EXPORT_SYMBOL_GPL(pnv_ocxl_platform_lpc_release);
+
+
  int pnv_ocxl_spa_remove_pe_from_cache(void *platform_data, int pe_handle)
  {
struct spa_data *data = (struct spa_data *) platform_data;





Re: [PATCH 3/5] ocxl: Tally up the LPC memory on a link & allow it to be mapped

2019-09-18 Thread Frederic Barrat




Le 17/09/2019 à 03:42, Alastair D'Silva a écrit :

From: Alastair D'Silva 

Tally up the LPC memory on an OpenCAPI link & allow it to be mapped

Signed-off-by: Alastair D'Silva 
---
  drivers/misc/ocxl/core.c  |  9 +
  drivers/misc/ocxl/link.c  | 61 +++
  drivers/misc/ocxl/ocxl_internal.h | 42 +
  3 files changed, 112 insertions(+)

diff --git a/drivers/misc/ocxl/core.c b/drivers/misc/ocxl/core.c
index b7a09b21ab36..fdfe4e0a34e1 100644
--- a/drivers/misc/ocxl/core.c
+++ b/drivers/misc/ocxl/core.c
@@ -230,8 +230,17 @@ static int configure_afu(struct ocxl_afu *afu, u8 afu_idx, 
struct pci_dev *dev)
if (rc)
goto err_free_pasid;
  
+	if (afu->config.lpc_mem_size || afu->config.special_purpose_mem_size) {

+   rc = ocxl_link_add_lpc_mem(afu->fn->link,
+   afu->config.lpc_mem_size + 
afu->config.special_purpose_mem_size);



I don't think we should count the special purpose memory, as it's not 
meant to be accessed through the GPU mem BAR, but I'll check.


What happens when unconfiguring the AFU? We should reduce the range (see 
also below). Partial reconfig doesn't seem so far off, so we should take 
it into account.




+   if (rc)
+   goto err_free_mmio;
+   }
+
return 0;
  
+err_free_mmio:

+   unmap_mmio_areas(afu);
  err_free_pasid:
reclaim_afu_pasid(afu);
  err_free_actag:
diff --git a/drivers/misc/ocxl/link.c b/drivers/misc/ocxl/link.c
index 58d111afd9f6..2874811a4398 100644
--- a/drivers/misc/ocxl/link.c
+++ b/drivers/misc/ocxl/link.c
@@ -84,6 +84,11 @@ struct ocxl_link {
int dev;
atomic_t irq_available;
struct spa *spa;
+   struct mutex lpc_mem_lock;
+   u64 lpc_mem_sz; /* Total amount of LPC memory presented on the link */
+   u64 lpc_mem;
+   int lpc_consumers;
+
void *platform_data;
  };
  static struct list_head links_list = LIST_HEAD_INIT(links_list);
@@ -396,6 +401,8 @@ static int alloc_link(struct pci_dev *dev, int PE_mask, 
struct ocxl_link **out_l
if (rc)
goto err_spa;
  
+	mutex_init(>lpc_mem_lock);

+
/* platform specific hook */
rc = pnv_ocxl_spa_setup(dev, link->spa->spa_mem, PE_mask,
>platform_data);
@@ -711,3 +718,57 @@ void ocxl_link_free_irq(void *link_handle, int hw_irq)
atomic_inc(>irq_available);
  }
  EXPORT_SYMBOL_GPL(ocxl_link_free_irq);
+
+int ocxl_link_add_lpc_mem(void *link_handle, u64 size)
+{
+   struct ocxl_link *link = (struct ocxl_link *) link_handle;
+
+   u64 orig_size;
+   bool good = false;
+
+   mutex_lock(>lpc_mem_lock);
+   orig_size = link->lpc_mem_sz;
+   link->lpc_mem_sz += size;




We have a choice to make here:
1. either we only support one LPC memory-carrying AFU (and the above is 
overkill)
2. or we support multiple AFUs with LPC memory (on the same function), 
but then I think the above is too simple.


From the opencapi spec, each AFU can define a chunk of memory with a 
starting address and a size. There's no rule which says they have to be 
contiguous. There's no rule which says it must start at 0. So to support 
multiple AFUs with LPC memory, we should record the current maximum 
range instead of just the global size. Ultimately, we need to tell the 
NPU the range of permissible addresses. It starts at 0, so we need to 
take into account any intial offset and holes.


I would go for option 2, to at least be consistent within ocxl and 
support multiple AFUs. Even though I don't think we'll see FPGA images 
with multiple AFUs with LPC memory any time soon.




+   good = orig_size < link->lpc_mem_sz;
+   mutex_unlock(>lpc_mem_lock);
+
+   // Check for overflow
+   return (good) ? 0 : -EINVAL;
+}
+EXPORT_SYMBOL_GPL(ocxl_link_add_lpc_mem);



Do the symbol really need to be exported? IIUC, the next patch defines a 
higher level ocxl_afu_map_lpc_mem() which is meant to be called by a 
calling driver.




+
+u64 ocxl_link_lpc_map(void *link_handle, struct pci_dev *pdev)
+{
+   struct ocxl_link *link = (struct ocxl_link *) link_handle;
+
+   mutex_lock(>lpc_mem_lock);
+   if (link->lpc_mem) {
+   u64 lpc_mem = link->lpc_mem;
+
+   link->lpc_consumers++;
+   mutex_unlock(>lpc_mem_lock);
+   return lpc_mem;
+   }
+
+   link->lpc_mem = pnv_ocxl_platform_lpc_setup(pdev, link->lpc_mem_sz);
+   if (link->lpc_mem)
+   link->lpc_consumers++;
+   mutex_unlock(>lpc_mem_lock);
+
+   return link->lpc_mem;



Should be cached in a temp variable, like on the fast path, otherwise 
it's accessed with no lock.




+}
+
+void ocxl_link_lpc_release(void *link_handle, struct pci_dev *pdev)
+{
+   struct ocxl_link *link = (struct ocxl_link *) link_handle;
+
+   mutex_lock(>lpc_mem_lock);
+   link->lpc_consumers--;
+   if 

[PATCH] powerpc/book3s64/radix: Avoid WARN_ON in destroy_context

2019-09-18 Thread Aneesh Kumar K.V
On failed task initialization due to memory allocation failures,
we can call into destroy_context with process_tb entry set. This patch
forces the process_tb entry to zero in destroy_context. With
this patch, we lose the ability to track if we are destroying a context
without flushing the process table entry.

 WARNING: CPU: 4 PID: 6368 at arch/powerpc/mm/mmu_context_book3s64.c:246 
destroy_context+0x58/0x340
 NIP [c00875f8] destroy_context+0x58/0x340
 LR [c013da18] __mmdrop+0x78/0x270
 Call Trace:
 [c00f7db77c80] [c013da18] __mmdrop+0x78/0x270
 [c00f7db77cf0] [c04d6a34] __do_execve_file.isra.13+0xbd4/0x1000
 [c00f7db77e00] [c04d7428] sys_execve+0x58/0x70
 [c00f7db77e30] [c000b388] system_call+0x5c/0x70

Reported-by: Priya M.A 
Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/book3s64/mmu_context.c | 12 +++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/book3s64/mmu_context.c 
b/arch/powerpc/mm/book3s64/mmu_context.c
index 2d0cb5ba9a47..3a0f5ab190ec 100644
--- a/arch/powerpc/mm/book3s64/mmu_context.c
+++ b/arch/powerpc/mm/book3s64/mmu_context.c
@@ -256,8 +256,18 @@ void destroy_context(struct mm_struct *mm)
 #ifdef CONFIG_SPAPR_TCE_IOMMU
WARN_ON_ONCE(!list_empty(>context.iommu_group_mem_list));
 #endif
+   /* For tasks which were successfully initialized we would end up
+* calling arch_exit_mmap which clears the process table entry.
+* arch_exit_mmap get called before the required fullmm tlb flush
+* which does a RIC=2 flush. Hence for an initialized task, we do
+* clear any cached process table entry. The condition below handles
+* the error case during task init. We do set the process table entry
+* early and if we fail a task initialization, we need to ensure
+* the process table entry is zeroed. We need not worry about process
+* table entry caches because the task never ran with the PID value.
+*/
if (radix_enabled())
-   WARN_ON(process_tb[mm->context.id].prtb0 != 0);
+   process_tb[mm->context.id].prtb0 = 0;
else
subpage_prot_free(mm);
destroy_contexts(>context);
-- 
2.21.0



Re: [PATCH v12 05/12] namei: obey trailing magic-link DAC permissions

2019-09-18 Thread Aleksa Sarai
On 2019-09-17, Jann Horn  wrote:
> On Wed, Sep 4, 2019 at 10:21 PM Aleksa Sarai  wrote:
> > The ability for userspace to "re-open" file descriptors through
> > /proc/self/fd has been a very useful tool for all sorts of usecases
> > (container runtimes are one common example). However, the current
> > interface for doing this has resulted in some pretty subtle security
> > holes. Userspace can re-open a file descriptor with more permissions
> > than the original, which can result in cases such as /proc/$pid/exe
> > being re-opened O_RDWR at a later date even though (by definition)
> > /proc/$pid/exe cannot be opened for writing. When combined with O_PATH
> > the results can get even more confusing.
> [...]
> > Instead we have to restrict it in such a way that it doesn't break
> > (good) users but does block potential attackers. The solution applied in
> > this patch is to restrict *re-opening* (not resolution through)
> > magic-links by requiring that mode of the link be obeyed. Normal
> > symlinks have modes of a+rwx but magic-links have other modes. These
> > magic-link modes were historically ignored during path resolution, but
> > they've now been re-purposed for more useful ends.
> 
> Thanks for dealing with this issue!
> 
> [...]
> > diff --git a/fs/namei.c b/fs/namei.c
> > index 209c51a5226c..54d57dad0f91 100644
> > --- a/fs/namei.c
> > +++ b/fs/namei.c
> > @@ -872,7 +872,7 @@ void nd_jump_link(struct path *path)
> >
> > nd->path = *path;
> > nd->inode = nd->path.dentry->d_inode;
> > -   nd->flags |= LOOKUP_JUMPED;
> > +   nd->flags |= LOOKUP_JUMPED | LOOKUP_MAGICLINK_JUMPED;
> >  }
> [...]
> > +static int trailing_magiclink(struct nameidata *nd, int acc_mode,
> > + fmode_t *opath_mask)
> > +{
> > +   struct inode *inode = nd->link_inode;
> > +   fmode_t upgrade_mask = 0;
> > +
> > +   /* Was the trailing_symlink() a magic-link? */
> > +   if (!(nd->flags & LOOKUP_MAGICLINK_JUMPED))
> > +   return 0;
> > +
> > +   /*
> > +* Figure out the upgrade-mask of the link_inode. Since these aren't
> > +* strictly POSIX semantics we don't do an acl_permission_check() 
> > here,
> > +* so we only care that at least one bit is set for each 
> > upgrade-mode.
> > +*/
> > +   if (inode->i_mode & S_IRUGO)
> > +   upgrade_mask |= FMODE_PATH_READ;
> > +   if (inode->i_mode & S_IWUGO)
> > +   upgrade_mask |= FMODE_PATH_WRITE;
> > +   /* Restrict the O_PATH upgrade-mask of the caller. */
> > +   if (opath_mask)
> > +   *opath_mask &= upgrade_mask;
> > +   return may_open_magiclink(upgrade_mask, acc_mode);
> >  }
> 
> This looks racy because entries in the file descriptor table can be
> switched out as long as task->files->file_lock isn't held. Unless I'm
> missing something, something like the following (untested) would
> bypass this restriction:

You're absolutely right -- good catch!

> Perhaps you could change nd_jump_link() to "void nd_jump_link(struct
> path *path, umode_t link_mode)", and let proc_pid_get_link() pass the
> link_mode through from an out-argument of .proc_get_link()? Then
> proc_fd_link() could grab the proper mode in a race-free manner. And
> nd_jump_link() could stash the mode in the nameidata.

This indeed does appear to be the simplest solution -- I'm currently
testing a variation of the patch you proposed (with a few extra bits to
deal with nd_jump_link and proc_get_link being used elsewhere).

I'll include this change (assuming it fixes the flaw you found) in the
v13 series I'll send around next week. Thanks, Jann!

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH



signature.asc
Description: PGP signature


Re: [PATCH v2 2/2] powerpc/mm: call H_BLOCK_REMOVE when supported

2019-09-18 Thread Michael Ellerman
Hi Laurent,

Few comments ...

Laurent Dufour  writes:
> Now we do not call _BLOCK_REMOVE all the time when the feature is
> exhibited.

This isn't true until after the patch is applied, ie. the tense is
wrong. The rest of the change log explains things fine, so just drop
that sentence I think.

Can you include the info about the oops in here.

> Depending on the hardware and the hypervisor, the hcall H_BLOCK_REMOVE may
> not be able to process all the page size for a segment base page size, as
  ^
  sizes
> reported by the TLB Invalidate Characteristics.o
 ^
 stray "o"
>
> For each couple base segment page size and actual page size, this
   ^
   "pair of"
> characteristic is telling the size of the block the hcall is supporting.
 ^  ^
 "tells us" supports
>
> Due to the involve complexity in do_block_remove() and call_block_remove(),
 ^
 "required" is better I think
> and the fact currently a 8 size block is returned by the hypervisor,  we
  ^  ^
  that   "block of size 8"
> are only supporting 8 size block to the H_BLOCK_REMOVE hcall.
>
> Furthermore a warning message is displayed at boot time in the case of an
> unsupported block size.

I'm not sure we should be doing that? It could be unnecessarily spammy.

> In order to identify this limitation easily in the code,a local define
> HBLKR_SUPPORTED_SIZE defining the currently supported block size, and a
> dedicated checking helper is_supported_hlbkr() are introduced.
>
> For regular pages and hugetlb, the assumption is made that the page size is
> equal to the base page size. For THP the page size is assumed to be 16M.
>
> Signed-off-by: Laurent Dufour 
> ---
>  arch/powerpc/platforms/pseries/lpar.c | 35 +--
>  1 file changed, 33 insertions(+), 2 deletions(-)
>
> diff --git a/arch/powerpc/platforms/pseries/lpar.c 
> b/arch/powerpc/platforms/pseries/lpar.c
> index 98a5c2ff9a0b..e2ad9b3b1097 100644
> --- a/arch/powerpc/platforms/pseries/lpar.c
> +++ b/arch/powerpc/platforms/pseries/lpar.c
> @@ -65,6 +65,13 @@ EXPORT_SYMBOL(plpar_hcall_norets);
>   */
>  static int hblkr_size[MMU_PAGE_COUNT][MMU_PAGE_COUNT];
>  
> +/*
> + * Due to the involved complexity, and that the current hypervisor is only
> + * returning this value or 0, we are limiting the support of the 
> H_BLOCK_REMOVE
> + * buffer size to 8 size block.
> + */
> +#define HBLKR_SUPPORTED_BLOCK_SIZE 8
> +
>  #ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
>  static u8 dtl_mask = DTL_LOG_PREEMPT;
>  #else
> @@ -993,6 +1000,15 @@ static void pSeries_lpar_hpte_invalidate(unsigned long 
> slot, unsigned long vpn,
>  #define HBLKR_CTRL_ERRNOTFOUND   0x8800UL
>  #define HBLKR_CTRL_ERRBUSY   0xa000UL
>  
> +/*
> + * Returned true if we are supporting this block size for the specified 
> segment
> + * base page size and actual page size.
> + */
> +static inline bool is_supported_hlbkr(int bpsize, int psize)
> +{
> + return (hblkr_size[bpsize][psize] == HBLKR_SUPPORTED_BLOCK_SIZE);
> +}
> +
>  /**
>   * H_BLOCK_REMOVE caller.
>   * @idx should point to the latest @param entry set with a PTEX.
> @@ -1152,7 +1168,11 @@ static inline void 
> __pSeries_lpar_hugepage_invalidate(unsigned long *slot,
>   if (lock_tlbie)
>   spin_lock_irqsave(_lpar_tlbie_lock, flags);
>  
> - if (firmware_has_feature(FW_FEATURE_BLOCK_REMOVE))
> + /*
> +  * Assuming THP size is 16M, and we only support 8 bytes size buffer
> +  * for the momment.
> +  */
> + if (is_supported_hlbkr(psize, MMU_PAGE_16M))

It's not very clear that this is correct in all cases. ie. how do we
know we're being called for THP and not regular huge page?

I think we're only called via:
  flush_hash_hugepage()
  -> mmu_hash_ops.hugepage_invalidate()
 pSeries_lpar_hugepage_invalidate()
 -> __pSeries_lpar_hugepage_invalidate()

And flush_hash_hugepage() is called via:
  __hash_page_thp()
  and
  hpte_do_hugepage_flush()

The first is presumably fine, the 2nd is called in a few places:
  __flush_hash_table_range() under if (is_thp)
  hash__pmd_hugepage_update()


But it's a little bit fragile if the code ever evolves. Not sure if
there's a better solution, other than just documenting it.

>   hugepage_block_invalidate(slot, vpn, count, psize, ssize);
>   else
>   hugepage_bulk_invalidate(slot, vpn, count, psize, ssize);
> @@ -1437,6 +1457,14 @@ void __init 
> pseries_lpar_read_hblkr_characteristics(void)
>  
>   block_size = 1 << block_size;
>  
> + /*
> +  * If the block size is not supported by the kernel, report it,
> +  * but continue reading the 

Re: [PATCH v2 1/2] powperc/mm: read TLB Block Invalidate Characteristics

2019-09-18 Thread Michael Ellerman
Hi Laurent,

Comments below ...

Laurent Dufour  writes:
> The PAPR document specifies the TLB Block Invalidate Characteristics which
> tells for each couple segment base page size, actual page size, the size of
 ^
 "pair of" again

> the block the hcall H_BLOCK_REMOVE is supporting.
 ^
 "supports" is fine.

> These characteristics are loaded at boot time in a new table hblkr_size.
> The table is appart the mmu_psize_def because this is specific to the
   ^
   "separate from"

> pseries architecture.
  ^
  platform
>
> A new init service, pseries_lpar_read_hblkr_characteristics() is added to
 ^
 function

> read the characteristics. In that function, the size of the buffer is set
> to twice the number of known page size, plus 10 bytes to ensure we have
> enough place. This new init function is called from pSeries_setup_arch().
 ^
 space
>
> Signed-off-by: Laurent Dufour 
> ---
>  .../include/asm/book3s/64/tlbflush-hash.h |   1 +
>  arch/powerpc/platforms/pseries/lpar.c | 138 ++
>  arch/powerpc/platforms/pseries/setup.c|   1 +
>  3 files changed, 140 insertions(+)
>
> diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h 
> b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
> index 64d02a704bcb..74155cc8cf89 100644
> --- a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
> +++ b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
> @@ -117,4 +117,5 @@ extern void __flush_hash_table_range(struct mm_struct 
> *mm, unsigned long start,
>unsigned long end);
>  extern void flush_tlb_pmd_range(struct mm_struct *mm, pmd_t *pmd,
>   unsigned long addr);
> +extern void pseries_lpar_read_hblkr_characteristics(void);

This doesn't need "extern", and also should go in
arch/powerpc/platforms/pseries/pseries.h as it's local to that directory.

You're using "hblkr" in a few places, can we instead make it "hblkrm" -
"rm" is a well known abbreviation for "remove".


> diff --git a/arch/powerpc/platforms/pseries/lpar.c 
> b/arch/powerpc/platforms/pseries/lpar.c
> index 36b846f6e74e..98a5c2ff9a0b 100644
> --- a/arch/powerpc/platforms/pseries/lpar.c
> +++ b/arch/powerpc/platforms/pseries/lpar.c
> @@ -56,6 +56,15 @@ EXPORT_SYMBOL(plpar_hcall);
>  EXPORT_SYMBOL(plpar_hcall9);
>  EXPORT_SYMBOL(plpar_hcall_norets);
>  
> +/*
> + * H_BLOCK_REMOVE supported block size for this page size in segment who's 
> base
> + * page size is that page size.
> + *
> + * The first index is the segment base page size, the second one is the 
> actual
> + * page size.
> + */
> +static int hblkr_size[MMU_PAGE_COUNT][MMU_PAGE_COUNT];

Can you make that __ro_after_init, it goes at the end, eg:

static int hblkr_size[MMU_PAGE_COUNT][MMU_PAGE_COUNT] __ro_after_init;

> @@ -1311,6 +1320,135 @@ static void do_block_remove(unsigned long number, 
> struct ppc64_tlb_batch *batch,
>   (void)call_block_remove(pix, param, true);
>  }
>  
> +/*
> + * TLB Block Invalidate Characteristics These characteristics define the 
> size of
   ^
   newline before here?

> + * the block the hcall H_BLOCK_REMOVE is able to process for each couple 
> segment
> + * base page size, actual page size.
> + *
> + * The ibm,get-system-parameter properties is returning a buffer with the
> + * following layout:
> + *
> + * [ 2 bytes size of the RTAS buffer (without these 2 bytes) ]
 ^
 "excluding"

> + * -
> + * TLB Block Invalidate Specifiers:
> + * [ 1 byte LOG base 2 of the TLB invalidate block size being specified ]
> + * [ 1 byte Number of page sizes (N) that are supported for the specified
> + *  TLB invalidate block size ]
> + * [ 1 byte Encoded segment base page size and actual page size
> + *  MSB=0 means 4k segment base page size and actual page size
> + *  MSB=1 the penc value in mmu_psize_def ]
> + * ...
> + * -
> + * Next TLB Block Invalidate Specifiers...
> + * -
> + * [ 0 ]
> + */
> +static inline void __init set_hblk_bloc_size(int bpsize, int psize,
> +  unsigned int block_size)

"static inline" and __init are sort of contradictory.

Just make it "static void __init" and the compiler will inline it
anyway, but if it decides not to the section will be correct.

This one uses "hblk"? Should it be set_hblkrm_block_size() ?

> +{
> + if (block_size > hblkr_size[bpsize][psize])
> + hblkr_size[bpsize][psize] = block_size;
> +}
> +
> +/*
> + * Decode the Encoded segment base page size and actual page size.
> + * PAPR specifies with bit 0 as MSB:
> + *   - bit 0 is the L bit
> + *   - bits 

Re: [PATCH v2 0/2] powerpc/mm: Conditionally call H_BLOCK_REMOVE

2019-09-18 Thread Michael Ellerman
Hi Laurent,

Thanks for fixing this, just a few comments.

Laurent Dufour  writes:
> Since the commit ba2dd8a26baa ("powerpc/pseries/mm: call H_BLOCK_REMOVE"),
> the call to H_BLOCK_REMOVE is always done if the feature is exhibited.
>
> However, the hypervisor may not support all the block size for the hcall
> H_BLOCK_REMOVE depending on the segment base page size and actual page
> size.
>
> When unsupported block size is used, the hcall H_BLOCK_REMOVE is returning
> H_PARAM, which is triggering a BUG_ON check leading to a panic like this:

Missing panic :)

Also can you put that detail in the 2nd commit, so that it's obvious
that it is a fix for an oops.

> The PAPR document specifies the TLB Block Invalidate Characteristics which

Can you give a section/page number for the PAPR reference.

> tells for each couple segment base page size, actual page size, the size of
 ^
 "pair of" is better than "couple" I think

> the block the hcall H_BLOCK_REMOVE is supporting.
>
> Supporting various block sizes doesn't seem needed at that time since all
> systems I was able to play with was supporting an 8 addresses block size,
> which is the maximum through the hcall, or none at all. Supporting various
> size would complexify the algorithm in call_block_remove() so unless this
> is required, this is not done.
>
> In the case of block size different from 8, a warning message is displayed
> at boot time and that block size will be ignored checking for the
> H_BLOCK_REMOVE support.
>
> Due to the minimal amount of hardware showing a limited set of
> H_BLOCK_REMOVE supported page size, I don't think there is a need to push
> this series to the stable mailing list.

But the hardware that is exhibiting it, by crashing and not booting, is
mostly (all?) older hardware. So I think we probably should send it back
to stable, otherwise those machines may never get an updated kernel.

Can you add Fixes: tags to the commits, they should point at the commit
that added H_BLOCK_REMOVE support.

cheers

> The first patch is reading the characteristic through the hcall
> ibm,get-system-parameter and record the supported block size for each page
> size.  The second patch is changing the check used to detect the
> H_BLOCK_REMOVE availability to take care of the base page size and page
> size couple.
>
> Changes since V1:
>
>  - Remove penc initialisation, this is already done in
>mmu_psize_set_default_penc()
>  - Add details on the TLB Block Invalidate Characteristics's buffer format
>  - Introduce #define instead of using direct numerical values
>  - Function reading the characteristics is now directly called from
>pSeries_setup_arch()
>  - The characteristics are now stored in a dedciated table static to lpar.c
>
> Laurent Dufour (2):
>   powperc/mm: read TLB Block Invalidate Characteristics
>   powerpc/mm: call H_BLOCK_REMOVE when supported
>
>  .../include/asm/book3s/64/tlbflush-hash.h |   1 +
>  arch/powerpc/platforms/pseries/lpar.c | 173 +-
>  arch/powerpc/platforms/pseries/setup.c|   1 +
>  3 files changed, 173 insertions(+), 2 deletions(-)
>
> -- 
> 2.23.0


Re: [PATCH v1 0/4] powerpc: Add support for GENERIC_EARLY_IOREMAP

2019-09-18 Thread Christoph Hellwig
On Thu, Sep 12, 2019 at 01:49:40PM +, Christophe Leroy wrote:
> This series adds support for GENERIC_EARLY_IOREMAP on powerpc.
> 
> It also adds a warning in the standard ioremap() when it's called
> before vmalloc is available in order to help locate those users.
> 
> Next step will be to incrementaly migrate all early users of
> ioremap() to using early_ioremap() or other method.
> 
> Once they are all converted we can drop all the logic
> behind ioremap_bot.

Thanks, this looks pretty nice!


Re: [PATCH 0/2] pseries/hotplug: Change the default behaviour of cede_offline

2019-09-18 Thread Aneesh Kumar K.V

On 9/18/19 5:01 PM, Michael Ellerman wrote:

"Naveen N. Rao"  writes:

Michael Ellerman wrote:

"Gautham R. Shenoy"  writes:

From: "Gautham R. Shenoy" 







Also, since we expose [S]PURR through sysfs, any tools that make use of
that directly are also affected due to this.


But again if we've had the current behaviour for 10 years then arguably
that's now the correct behaviour.


Or nobody is looking at PURR based accounting/usage metric on Linux. We 
got some customers complaining about this recently once they started 
using the metric.


-aneesh



Re: [PATCH 0/2] pseries/hotplug: Change the default behaviour of cede_offline

2019-09-18 Thread Gautham R Shenoy
On Wed, Sep 18, 2019 at 03:14:15PM +1000, Michael Ellerman wrote:
> "Gautham R. Shenoy"  writes:
> > From: "Gautham R. Shenoy" 
> >
> > Currently on Pseries Linux Guests, the offlined CPU can be put to one
> > of the following two states:
> >- Long term processor cede (also called extended cede)
> >- Returned to the Hypervisor via RTAS "stop-self" call.
> >
> > This is controlled by the kernel boot parameter "cede_offline=on/off".
> >
> > By default the offlined CPUs enter extended cede.
> 
> Since commit 3aa565f53c39 ("powerpc/pseries: Add hooks to put the CPU into an 
> appropriate offline state") (Nov 2009)
> 
> Which you wrote :)

Mea Culpa! I forgot to include the "Fixes commit 3aa565f53c39" into
Patch 1 of the series.

> 
> Why was that wrong?

It was wrong from the definition of what PHYP considers as
"not-active" CPU. From the point of view of that hypervisor, a CPU is
not-active iff it is in RTAS "stop-self". Thus if a CPU is offline via
extended cede, and not using any cycles, it is still considered to be
active, by PHYP. This causes PURR accounting is broken. 

> 
> > The PHYP hypervisor considers CPUs in extended cede to be "active"
> > since the CPUs are still under the control fo the Linux Guests. Hence, when 
> > we change the
> > SMT modes by offlining the secondary CPUs, the PURR and the RWMR SPRs
> > will continue to count the values for offlined CPUs in extended cede
> > as if they are online.
> >
> > One of the expectations with PURR is that the for an interval of time,
> > the sum of the PURR increments across the online CPUs of a core should
> > equal the number of timebase ticks for that interval.
> >
> > This is currently not the case.
> 
> But why does that matter? It's just some accounting stuff, does it
> actually break something meaningful?

As Naveen mentioned, it breaks lparstat which the customers are using
for capacity planning. Unfortunately we discovered this 10 years after
the feature was written.

> 
> Also what does this do to the latency of CPU online/offline.

It will have a slightly higher latency compared to extended cede,
since it involves an additional rtas-call for both the start and
stopping of CPU. Will measure the exact difference and post it in the
next version.

> And what does this do on KVM?

KVM doesn't seem to depend on the state of the offline VCPU as it has
an explicit way of signalling whether a CPU is online or not, via
KVM_REG_PPC_ONLINE. In commit 7aa15842c15f ("KVM: PPC: Book3S HV: Set
RWMR on POWER8 so PURR/SPURR count correctly") we use this KVM reg to
update the count of online vCPUs in a core, and use this count to set
the RWMR correctly before dispatching the core.

So, this patchset doesn't affect KVM.

> 
> 
> > In the following data (Generated using
> > https://github.com/gautshen/misc/blob/master/purr_tb.py):
> >
> >
> > delta tb = tb ticks elapsed in 1 second.
> > delta purr = sum of PURR increments on online CPUs of that core in 1
> >  second
> >   
> > SMT=off
> > ===
> > Coredelta tb(apprx)  delta purr 
> > ===
> > core00 [  0]51200   69883784
> > core01 [  8]51200   88782536
> > core02 [ 16]51200   94296824
> > core03 [ 24]51200   80951968
> 
> Showing the expected value in another column would make this much
> clearer.

Thanks. Will update the testcase to call out the expected value.
> 
> cheers
> 


--
Thanks and Regards
gautham.


Re: [PATCH 0/2] pseries/hotplug: Change the default behaviour of cede_offline

2019-09-18 Thread Gautham R Shenoy
Hello Nathan, Michael,

On Tue, Sep 17, 2019 at 12:36:35PM -0500, Nathan Lynch wrote:
> Gautham R Shenoy  writes:
> > On Thu, Sep 12, 2019 at 10:39:45AM -0500, Nathan Lynch wrote:
> >> "Gautham R. Shenoy"  writes:
> >> > The patchset also defines a new sysfs attribute
> >> > "/sys/device/system/cpu/cede_offline_enabled" on PSeries Linux guests
> >> > to allow userspace programs to change the state into which the
> >> > offlined CPU need to be put to at runtime.
> >> 
> >> A boolean sysfs interface will become awkward if we need to add another
> >> mode in the future.
> >> 
> >> What do you think about naming the attribute something like
> >> 'offline_mode', with the possible values 'extended-cede' and
> >> 'rtas-stopped'?
> >
> > We can do that. However, IMHO in the longer term, on PSeries guests,
> > we should have only one offline state - rtas-stopped.  The reason for
> > this being, that on Linux, SMT switch is brought into effect through
> > the CPU Hotplug interface. The only state in which the SMT switch will
> > recognized by the hypervisors such as PHYP is rtas-stopped.
> 
> OK. Why "longer term" though, instead of doing it now?

Because adding extended-cede into a cpuidle state is non-trivial since
a CPU in that state is non responsive to external interrupts. We will
additional changes in the IPI, Timer and the Interrupt code to ensure
that these get translated to a H_PROD in order to wake-up the target
CPU in extended CEDE.

Timer: is relatively easy since the cpuidle infrastructure has the
   timer-offload framework (used for fastsleep in POWER8) where we
   can offload the timers of an idling CPU to another CPU which
   can wakeup the CPU when the timer expires via an IPI.

IPIs: We need to ensure that icp_hv_set_qirr() correctly sends H_IPI
  or H_PROD depending on whether or not the target CPU is in
  extended CEDE.

Interrupts: Either we migrate away the interrupts from the CPU that is
entering extended CEDE or we prevent a CPU that is the
sole target for an interrupt from entering extended CEDE.

The accounting problem in tools such as lparstat with
"cede_offline=on" is affecting customers who are using these tools for
capacity-planning. That problem needs a fix in the short-term, for
which Patch 1 changes the default behaviour of cede_offline from "on"
to "off". Since this patch would break the existing userspace tools
that use the CPU-Offline infrastructure to fold CPUs for saving power,
the sysfs interface allowing a runtime change of cede_offline_enabled
was provided to enable these userspace tools to cope with minimal
change.

> 
> 
> > All other states (such as extended-cede) should in the long-term be
> > exposed via the cpuidle interface.
> >
> > With this in mind, I made the sysfs interface boolean to mirror the
> > current "cede_offline" commandline parameter. Eventually when we have
> > only one offline-state, we can deprecate the commandline parameter as
> > well as the sysfs interface.
> 
> I don't care for adding a sysfs interface that is intended from the
> beginning to become vestigial...

Fair point. Come to think of it, in case the cpuidle menu governor
behaviour doesn't match the expectations provided by the current
userspace solutions for folding idle CPUs for power-savings, it would
be useful to have this option around so that existing users who prefer
the userspace solution can still have that option.

> 
> This strikes me as unnecessarily incremental if you're changing the
> default offline state. Any user space programs depending on the current
> behavior will have to change anyway (and why is it OK to break them?)
>

Yes, the current userspace program will need to be modified to check
for the sysfs interface and change the value to
cede_offline_enabled=1.

> Why isn't the plan:
> 
>   1. Add extended cede support to the pseries cpuidle driver
>   2. Make stop-self the only cpu offline state for pseries (no sysfs
>  interface necessary)

This is the plan, except that 1. requires some additional work and
this patchset is proposed as a short-term mitigation until we get
1. right.

> 
> ?

--
Thanks and Regards
gautham.


Re: [PATCH 0/2] pseries/hotplug: Change the default behaviour of cede_offline

2019-09-18 Thread Michael Ellerman
"Naveen N. Rao"  writes:
> Michael Ellerman wrote:
>> "Gautham R. Shenoy"  writes:
>>> From: "Gautham R. Shenoy" 
>>>
>>> Currently on Pseries Linux Guests, the offlined CPU can be put to one
>>> of the following two states:
>>>- Long term processor cede (also called extended cede)
>>>- Returned to the Hypervisor via RTAS "stop-self" call.
>>>
>>> This is controlled by the kernel boot parameter "cede_offline=on/off".
>>>
>>> By default the offlined CPUs enter extended cede.
>> 
>> Since commit 3aa565f53c39 ("powerpc/pseries: Add hooks to put the CPU into 
>> an appropriate offline state") (Nov 2009)
>> 
>> Which you wrote :)
>> 
>> Why was that wrong?
>> 
>>> The PHYP hypervisor considers CPUs in extended cede to be "active"
>>> since the CPUs are still under the control fo the Linux Guests. Hence, when 
>>> we change the
>>> SMT modes by offlining the secondary CPUs, the PURR and the RWMR SPRs
>>> will continue to count the values for offlined CPUs in extended cede
>>> as if they are online.
>>>
>>> One of the expectations with PURR is that the for an interval of time,
>>> the sum of the PURR increments across the online CPUs of a core should
>>> equal the number of timebase ticks for that interval.
>>>
>>> This is currently not the case.
>> 
>> But why does that matter? It's just some accounting stuff, does it
>> actually break something meaningful?
>
> Yes, this broke lparstat at the very least (though its quite unfortunate 
> we took so long to notice).

By "so long" you mean 10 years?

Also I've never heard of lparstat, but I assume it's one of these tools
that's meant to behave like the AIX equivalent?

If it's been "broken" for 10 years and no one noticed, I'd argue the
current behaviour is now "correct" and fixing it would actually be a
breakage :)

> With SMT disabled, and under load:
>   $ sudo lparstat 1 10
>
>   System Configuration
>   type=Shared mode=Uncapped smt=Off lcpu=2 mem=7759616 kB cpus=6 ent=1.00 
>
>   %user  %sys %wait%idlephysc %entc lbusy  vcsw phint
>   - - --- - - - -
>   100.00  0.00  0.00 0.00 1.10 110.00 100.00 128784460 0
>   100.00  0.00  0.00 0.00 1.07 107.00 100.00 128784860 0
>   100.00  0.00  0.00 0.00 1.07 107.00 100.00 128785260 0
>   100.00  0.00  0.00 0.00 1.07 107.00 100.00 128785662 0
>   100.00  0.00  0.00 0.00 1.07 107.00 100.00 128786062 0
>   100.00  0.00  0.00 0.00 1.07 107.00 100.00 128786462 0
>   100.00  0.00  0.00 0.00 1.07 107.00 100.00 128786862 0
>   100.00  0.00  0.00 0.00 1.07 107.00 100.00 128787262 0
>   100.00  0.00  0.00 0.00 1.07 107.00 100.00 128787664 0
>   100.00  0.00  0.00 0.00 1.07 107.00 100.00 128788064 0

What about that is wrong?

> With cede_offline=off:
>   $ sudo lparstat 1 10
>
>   System Configuration
>   type=Shared mode=Uncapped smt=Off lcpu=2 mem=7759616 kB cpus=6 ent=1.00 
>
>   %user  %sys %wait%idlephysc %entc lbusy  vcsw phint
>   - - --- - - - -
>   100.00  0.00  0.00 0.00 1.94 194.00 100.00 128961588 0
>   100.00  0.00  0.00 0.00 1.91 191.00 100.00 128961988 0
>   100.00  0.00  0.00 0.00  inf   inf 100.00 128962392 0
>   100.00  0.00  0.00 0.00 1.91 191.00 100.00 128962792 0
>   100.00  0.00  0.00 0.00 1.91 191.00 100.00 128963192 0
>   100.00  0.00  0.00 0.00 1.91 191.00 100.00 128963592 0
>   100.00  0.00  0.00 0.00 1.91 191.00 100.00 128963992 0
>   100.00  0.00  0.00 0.00 1.91 191.00 100.00 128964392 0
>   100.00  0.00  0.00 0.00 1.91 191.00 100.00 128964792 0
>   100.00  0.00  0.00 0.00 1.91 191.00 100.00 128965194 0
>
> [The 'inf' values there show a different bug]
>
> Also, since we expose [S]PURR through sysfs, any tools that make use of 
> that directly are also affected due to this.

But again if we've had the current behaviour for 10 years then arguably
that's now the correct behaviour.

cheers


Re: [PATCH] powerpc/crashkernel: take mem option into account

2019-09-18 Thread Michael Ellerman
Pingfan Liu  writes:
> Cc Kexec list. And keep the original content.
>
> On Thu, Sep 12, 2019 at 10:50 AM Pingfan Liu  wrote:
>>
>> 'mem=" option is an easy way to put high pressure on memory during some
>> test. Hence in stead of total mem, the effective usable memory size
   ^  ^
   instead"actual" would be clearer

I think adding: "after applying the memory limit" 

would help here.

>> should be considered when reserving mem for crashkernel. Otherwise
>> the boot up may experience oom issue.
  ^
  OOM
>>
>> E.g passing
>> crashkernel="2G-4G:384M,4G-16G:512M,16G-64G:1G,64G-128G:2G,128G-:4G", and
>> mem=5G on a 256G machine.

Spelling out the behaviour before and after would help here, eg:

.. "would reserve 4G prior to the change and 512M afterward."


>> Signed-off-by: Pingfan Liu 
>> Cc: Hari Bathini 
>> Cc: Michael Ellerman 
>> To: linuxppc-dev@lists.ozlabs.org
>> ---
>> v1 -> v2: fix the printk info about the total mem
>>  arch/powerpc/kernel/machine_kexec.c | 7 ---
>>  1 file changed, 4 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/powerpc/kernel/machine_kexec.c 
>> b/arch/powerpc/kernel/machine_kexec.c
>> index c4ed328..eec96dc 100644
>> --- a/arch/powerpc/kernel/machine_kexec.c
>> +++ b/arch/powerpc/kernel/machine_kexec.c
>> @@ -114,11 +114,12 @@ void machine_kexec(struct kimage *image)
>>
>>  void __init reserve_crashkernel(void)
>>  {
>> -   unsigned long long crash_size, crash_base;
>> +   unsigned long long crash_size, crash_base, total_mem_sz;
>> int ret;
>>
>> +   total_mem_sz = memory_limit ? memory_limit : 
>> memblock_phys_mem_size();
>> /* use common parsing */
>> -   ret = parse_crashkernel(boot_command_line, memblock_phys_mem_size(),
>> +   ret = parse_crashkernel(boot_command_line, total_mem_sz,
>> _size, _base);

I think this change makes sense. But we have multiple arches that
implement similar logic, and I wonder if we should keep them all the
same.

eg:

  arch/arm/kernel/setup.c:ret = 
parse_crashkernel(boot_command_line, total_mem,
  arch/arm64/mm/init.c:   ret = 
parse_crashkernel(boot_command_line, memblock_phys_mem_size(),
  arch/ia64/kernel/setup.c:   ret = 
parse_crashkernel(boot_command_line, total,
  arch/mips/kernel/setup.c:   ret = 
parse_crashkernel(boot_command_line, total_mem,
  arch/powerpc/kernel/fadump.c:   ret = 
parse_crashkernel(boot_command_line, memblock_phys_mem_size(),
  arch/powerpc/kernel/machine_kexec.c:ret = 
parse_crashkernel(boot_command_line, memblock_phys_mem_size(),
  arch/s390/kernel/setup.c:   rc = 
parse_crashkernel(boot_command_line, memory_end, _size,
  arch/sh/kernel/machine_kexec.c: ret = 
parse_crashkernel(boot_command_line, memblock_phys_mem_size(),
  arch/x86/kernel/setup.c:ret = 
parse_crashkernel(boot_command_line, total_mem, _size, _base);


>From a quick glance most of them don't seem to take the memory limit
into account.

So I guess the question is do we want all arches to implement the same
behaviour or do we think it doesn't matter if they differ in details
like this?

cheers


Re: [PATCH v8 2/8] kvmppc: Movement of pages between normal and secure memory

2019-09-18 Thread Bharata B Rao
On Wed, Sep 18, 2019 at 12:42:10PM +0530, Bharata B Rao wrote:
> On Tue, Sep 17, 2019 at 04:31:39PM -0700, Sukadev Bhattiprolu wrote:
> > 
> > Minor: Can this allocation be outside the lock? I guess it would change
> > the order of cleanup at the end of the function.
> 
> Cleanup has bitmap_clear which needs be under spinlock, so this order
> of setup/alloc and cleanup will keep things simple is what I felt.
> 
> > 
> > > + spin_unlock(_uvmem_pfn_lock);
> > > +
> > > + *rmap = uvmem_pfn | KVMPPC_RMAP_UVMEM_PFN;
> > > + pvt->rmap = rmap;
> > > + pvt->gpa = gpa;
> > > + pvt->lpid = lpid;
> > > + dpage->zone_device_data = pvt;
> > > +
> > > + get_page(dpage);
> > > + return dpage;
> > > +
> > > +out_unlock:
> > > + unlock_page(dpage);
> > > +out_clear:
> > > + bitmap_clear(kvmppc_uvmem_pfn_bitmap, uvmem_pfn - pfn_first, 1);
> > 
> > Reuse variable 'bit'  here?
> 
> Sure.
> 
> > 
> > > +out:
> > > + spin_unlock(_uvmem_pfn_lock);
> > > + return NULL;
> > > +}
> > > +
> > > +/*
> > > + * Alloc a PFN from private device memory pool and copy page from normal
> > > + * memory to secure memory using UV_PAGE_IN uvcall.
> > > + */
> > > +static int
> > > +kvmppc_svm_page_in(struct vm_area_struct *vma, unsigned long start,
> > > +unsigned long end, unsigned long *rmap,
> > > +unsigned long gpa, unsigned int lpid,
> > > +unsigned long page_shift)
> > > +{
> > > + unsigned long src_pfn, dst_pfn = 0;
> > > + struct migrate_vma mig;
> > > + struct page *spage;
> > > + unsigned long pfn;
> > > + struct page *dpage;
> > > + int ret = 0;
> > > +
> > > + memset(, 0, sizeof(mig));
> > > + mig.vma = vma;
> > > + mig.start = start;
> > > + mig.end = end;
> > > + mig.src = _pfn;
> > > + mig.dst = _pfn;
> > > +
> > > + ret = migrate_vma_setup();
> > > + if (ret)
> > > + return ret;
> > > +
> > > + spage = migrate_pfn_to_page(*mig.src);
> > > + pfn = *mig.src >> MIGRATE_PFN_SHIFT;
> > > + if (!spage || !(*mig.src & MIGRATE_PFN_MIGRATE)) {
> > > + ret = 0;
> > 
> > Do we want to return success here (and have caller return H_SUCCESS) if
> > we can't find the source page?
> 
> spage is NULL for zero page. In this case we return success but there is
> no UV_PAGE_IN involved.
> 
> Absence of MIGRATE_PFN_MIGRATE indicates that the requested page
> can't be migrated. I haven't hit this case till now. Similar check
> is also present in the nouveau driver. I am not sure if this is strictly
> needed here.
> 
> Christoph, Jason - do you know if !(*mig.src & MIGRATE_PFN_MIGRATE)
> check is required and if so in which cases will it be true?

Looks like the currently existing check

if (!spage || !(*mig.src & MIGRATE_PFN_MIGRATE)) {
ret = 0;
goto out_finalize;
}

will prevent both

1. Zero pages and
2. Pages for which no page table entries exist

from getting migrated to secure (device) side. In both the above cases
!spage is true (and MIGRATE_PFN_MIGRATE is set). In both cases
we needn't copy the page, but migration should complete.

Guess the following comment extract from migrate_vma_setup() is talking about
Case 2 above.

 * For empty entries inside CPU page table (pte_none() or pmd_none() is true) we
 * do set MIGRATE_PFN_MIGRATE flag inside the corresponding source array thus
 * allowing the caller to allocate device memory for those unback virtual
 * address.  For this the caller simply has to allocate device memory and
 * properly set the destination entry like for regular migration.  Note that

Is the above understanding correct? Will the same apply to nouveau driver too?
nouveau_dmem_migrate_copy_one() also seems to bail out after a similar
check.

Regards,
Bharata.




Re: [PATCH v3] powerpc/fadump: sysfs for fadump memory reservation

2019-09-18 Thread Sourabh Jain



On 8/27/19 11:32 AM, Hari Bathini wrote:
> 
> 
> On 27/08/19 8:49 AM, Michael Ellerman wrote:
>> Hari Bathini  writes:
>>> On 26/08/19 4:14 PM, Sourabh Jain wrote:
 On 8/26/19 3:46 PM, Sourabh Jain wrote:
> On 8/26/19 3:29 PM, Hari Bathini wrote:
>> On 10/08/19 11:29 PM, Sourabh Jain wrote:
>>> Add a sys interface to allow querying the memory reserved by
>>> fadump for saving the crash dump.
>>>
>>> Add an ABI doc entry for new sysfs interface.
>>>- /sys/kernel/fadump_mem_reserved
>>>
>>> Signed-off-by: Sourabh Jain 
>>> ---
>>> Changelog:
>>> v1 -> v2:
>>>   - Added ABI doc for new sysfs interface.
>>>
>>> v2 -> v3:
>>>   - Updated the ABI documentation.
>>> ---
>>>
>>>  Documentation/ABI/testing/sysfs-kernel-fadump|  6 ++
>>
>> Shouldn't this be 
>> Documentation/ABI/testing/sysfs-kernel-fadump_mem_reserved?

 How about documenting fadump_mem_reserved and other sysfs attributes 
 suggested
 by you in a single file Documentation/ABI/testing/sysfs-kernel-fadump?
>>>
>>> I wouldn't mind that but please do check if it is breaking a convention..
>>
>> AIUI a file named like that would hold the documentation for the files
>> inside a directory called /sys/kernel/fadump.
>>
>> And in fact that's probably where these files should live, rather than
>> just dropped directly into /sys/kernel.
> Michael, could that be corrected now by introducing new sysfs files for 
> FADump in
> /sys/kernel/fadump/.
> 
> Also, duplicating current /sys/kernel/fadump_* files as /sys/kernel/fadump/* 
> files
> & eventually dropping /sys/kernel/fadump_* files sometime later..


Sent a patch series that adds fadump_mem_reserved sysfs file along with 
reorganizing
the existing fadump sysfs files.

Patch series available here:
https://lists.ozlabs.org/pipermail/linuxppc-dev/2019-September/197100.html

- Sourabh Jain



Re: [PATCH] mm/pgtable/debug: Fix test validating architecture page table helpers

2019-09-18 Thread Anshuman Khandual



On 09/13/2019 11:53 AM, Christophe Leroy wrote:
> Fix build failure on powerpc.
> 
> Fix preemption imbalance.
> 
> Signed-off-by: Christophe Leroy 
> ---
>  mm/arch_pgtable_test.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/mm/arch_pgtable_test.c b/mm/arch_pgtable_test.c
> index 8b4a92756ad8..f2b3c9ec35fa 100644
> --- a/mm/arch_pgtable_test.c
> +++ b/mm/arch_pgtable_test.c
> @@ -24,6 +24,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  
> @@ -400,6 +401,8 @@ static int __init arch_pgtable_tests_init(void)
>   p4d_clear_tests(p4dp);
>   pgd_clear_tests(mm, pgdp);
>  
> + pte_unmap(ptep);
> +
>   pmd_populate_tests(mm, pmdp, saved_ptep);
>   pud_populate_tests(mm, pudp, saved_pmdp);
>   p4d_populate_tests(mm, p4dp, saved_pudp);
> 

Hello Christophe,

I am planning to fold this fix into the current patch and retain your
Signed-off-by. Are you okay with it ?

- Anshuman


Re: [PATCH v8 8/8] KVM: PPC: Ultravisor: Add PPC_UV config option

2019-09-18 Thread Bharata B Rao
On Tue, Sep 17, 2019 at 04:37:07PM -0700, Sukadev Bhattiprolu wrote:
> Bharata B Rao [bhar...@linux.ibm.com] wrote:
> > From: Anshuman Khandual 
> > 
> > CONFIG_PPC_UV adds support for ultravisor.
> > 
> > Signed-off-by: Anshuman Khandual 
> > Signed-off-by: Bharata B Rao 
> > Signed-off-by: Ram Pai 
> > [ Update config help and commit message ]
> > Signed-off-by: Claudio Carvalho 
> 
> Except for one question in Patch 2, the patch series looks good to me.
> 
> Reviewed-by: Sukadev Bhattiprolu 

Thanks!

Regards,
Bharata.



Re: [PATCH v8 7/8] kvmppc: Support reset of secure guest

2019-09-18 Thread Bharata B Rao
On Tue, Sep 17, 2019 at 04:27:36PM -0700, Sukadev Bhattiprolu wrote:
> > +
> > +   if (kvmppc_is_guest_secure(kvm)) {
> 
> Nit: Since this entire function only applies to secure guests we could
> return here for normal guests.

Yes, can be done.

> >  bool kvmppc_is_guest_secure(struct kvm *kvm)
> >  {
> > -   return !!(kvm->arch.secure_guest & KVMPPC_SECURE_INIT_DONE);
> > +   return (kvm->arch.secure_guest & KVMPPC_SECURE_INIT_DONE);
> >  }
> 
> This change could be folded into PATCH 6?

That was the intention but looks like I 'pick'ed wrong commit during rebase.
Will fix this.

> > 
> >  unsigned long kvmppc_h_svm_init_start(struct kvm *kvm)
> > @@ -85,9 +86,68 @@ unsigned long kvmppc_h_svm_init_done(struct kvm *kvm)
> > return H_UNSUPPORTED;
> > 
> > kvm->arch.secure_guest |= KVMPPC_SECURE_INIT_DONE;
> > +   if (kvm_is_radix(kvm)) {
> > +   kvmppc_free_radix(kvm);
> > +   pr_info("LPID %d went secure, freed HV side radix pgtables\n",
> > +   kvm->arch.lpid);
> > +   }
> > return H_SUCCESS;
> >  }
> > 
> > +/*
> > + * Drop device pages that we maintain for the secure guest
> > + *
> > + * We first mark the pages to be skipped from UV_PAGE_OUT when there
> > + * is HV side fault on these pages. Next we *get* these pages, forcing
> 
> Is that get page implicit? there is no explicit "get" in this 
> function?

gfn_to_pfn does get_user_pages eventually.

> 
> > + * fault on them, do fault time migration to replace the device PTEs in
> > + * QEMU page table with normal PTEs from newly allocated pages.
> > + */
> > +static void kvmppc_uvmem_drop_pages(struct kvm_memory_slot *free,
> > +  struct kvm *kvm)
> > +{
> > +   int i;
> > +   struct kvmppc_uvmem_page_pvt *pvt;
> > +   unsigned long pfn;
> > +
> > +   for (i = 0; i < free->npages; i++) {
> > +   unsigned long *rmap = >arch.rmap[i];
> > +   struct page *uvmem_page;
> > +
> > +   if (kvmppc_rmap_type(rmap) == KVMPPC_RMAP_UVMEM_PFN) {
> > +   uvmem_page = pfn_to_page(*rmap &
> > +~KVMPPC_RMAP_UVMEM_PFN);
> > +   pvt = (struct kvmppc_uvmem_page_pvt *)
> > +   uvmem_page->zone_device_data;
> > +   pvt->skip_page_out = true;
> > +
> > +   pfn = gfn_to_pfn(kvm, pvt->gpa >> PAGE_SHIFT);

Regards,
Bharata.



Re: [PATCH v8 4/8] kvmppc: H_SVM_INIT_START and H_SVM_INIT_DONE hcalls

2019-09-18 Thread Bharata B Rao
On Tue, Sep 17, 2019 at 04:35:02PM -0700, Sukadev Bhattiprolu wrote:
> 
> > +unsigned long kvmppc_h_svm_init_done(struct kvm *kvm)
> > +{
> > +   if (!(kvm->arch.secure_guest & KVMPPC_SECURE_INIT_START))
> 
> Minor: Should we also check if KVMPPC_SECURE_INIT_DONE is set here (since
> both can be set)?

You imply that we can check if more H_SVM_INIT_DONE calls are issued
after first such call? Currently we just set the KVMPPC_SECURE_INIT_DONE
bit again.

Regards,
Bharata.



Re: [PATCH v8 2/8] kvmppc: Movement of pages between normal and secure memory

2019-09-18 Thread Bharata B Rao
On Tue, Sep 17, 2019 at 04:31:39PM -0700, Sukadev Bhattiprolu wrote:
> 
> Minor: Can this allocation be outside the lock? I guess it would change
> the order of cleanup at the end of the function.

Cleanup has bitmap_clear which needs be under spinlock, so this order
of setup/alloc and cleanup will keep things simple is what I felt.

> 
> > +   spin_unlock(_uvmem_pfn_lock);
> > +
> > +   *rmap = uvmem_pfn | KVMPPC_RMAP_UVMEM_PFN;
> > +   pvt->rmap = rmap;
> > +   pvt->gpa = gpa;
> > +   pvt->lpid = lpid;
> > +   dpage->zone_device_data = pvt;
> > +
> > +   get_page(dpage);
> > +   return dpage;
> > +
> > +out_unlock:
> > +   unlock_page(dpage);
> > +out_clear:
> > +   bitmap_clear(kvmppc_uvmem_pfn_bitmap, uvmem_pfn - pfn_first, 1);
> 
> Reuse variable 'bit'  here?

Sure.

> 
> > +out:
> > +   spin_unlock(_uvmem_pfn_lock);
> > +   return NULL;
> > +}
> > +
> > +/*
> > + * Alloc a PFN from private device memory pool and copy page from normal
> > + * memory to secure memory using UV_PAGE_IN uvcall.
> > + */
> > +static int
> > +kvmppc_svm_page_in(struct vm_area_struct *vma, unsigned long start,
> > +  unsigned long end, unsigned long *rmap,
> > +  unsigned long gpa, unsigned int lpid,
> > +  unsigned long page_shift)
> > +{
> > +   unsigned long src_pfn, dst_pfn = 0;
> > +   struct migrate_vma mig;
> > +   struct page *spage;
> > +   unsigned long pfn;
> > +   struct page *dpage;
> > +   int ret = 0;
> > +
> > +   memset(, 0, sizeof(mig));
> > +   mig.vma = vma;
> > +   mig.start = start;
> > +   mig.end = end;
> > +   mig.src = _pfn;
> > +   mig.dst = _pfn;
> > +
> > +   ret = migrate_vma_setup();
> > +   if (ret)
> > +   return ret;
> > +
> > +   spage = migrate_pfn_to_page(*mig.src);
> > +   pfn = *mig.src >> MIGRATE_PFN_SHIFT;
> > +   if (!spage || !(*mig.src & MIGRATE_PFN_MIGRATE)) {
> > +   ret = 0;
> 
> Do we want to return success here (and have caller return H_SUCCESS) if
> we can't find the source page?

spage is NULL for zero page. In this case we return success but there is
no UV_PAGE_IN involved.

Absence of MIGRATE_PFN_MIGRATE indicates that the requested page
can't be migrated. I haven't hit this case till now. Similar check
is also present in the nouveau driver. I am not sure if this is strictly
needed here.

Christoph, Jason - do you know if !(*mig.src & MIGRATE_PFN_MIGRATE)
check is required and if so in which cases will it be true?

> > + * Fault handler callback when HV touches any page that has been
> 
>  Nit: s/callback/callback. Called /

Yeah will rephrase.

Regards,
Bharata.



Re: [PATCH 0/2] pseries/hotplug: Change the default behaviour of cede_offline

2019-09-18 Thread Naveen N. Rao

Michael Ellerman wrote:

"Gautham R. Shenoy"  writes:

From: "Gautham R. Shenoy" 

Currently on Pseries Linux Guests, the offlined CPU can be put to one
of the following two states:
   - Long term processor cede (also called extended cede)
   - Returned to the Hypervisor via RTAS "stop-self" call.

This is controlled by the kernel boot parameter "cede_offline=on/off".

By default the offlined CPUs enter extended cede.


Since commit 3aa565f53c39 ("powerpc/pseries: Add hooks to put the CPU into an 
appropriate offline state") (Nov 2009)

Which you wrote :)

Why was that wrong?


The PHYP hypervisor considers CPUs in extended cede to be "active"
since the CPUs are still under the control fo the Linux Guests. Hence, when we 
change the
SMT modes by offlining the secondary CPUs, the PURR and the RWMR SPRs
will continue to count the values for offlined CPUs in extended cede
as if they are online.

One of the expectations with PURR is that the for an interval of time,
the sum of the PURR increments across the online CPUs of a core should
equal the number of timebase ticks for that interval.

This is currently not the case.


But why does that matter? It's just some accounting stuff, does it
actually break something meaningful?


Yes, this broke lparstat at the very least (though its quite unfortunate 
we took so long to notice).


With SMT disabled, and under load:
 $ sudo lparstat 1 10

 System Configuration
 type=Shared mode=Uncapped smt=Off lcpu=2 mem=7759616 kB cpus=6 ent=1.00 


 %user  %sys %wait%idlephysc %entc lbusy  vcsw phint
 - - --- - - - -
 100.00  0.00  0.00 0.00 1.10 110.00 100.00 128784460 0
 100.00  0.00  0.00 0.00 1.07 107.00 100.00 128784860 0
 100.00  0.00  0.00 0.00 1.07 107.00 100.00 128785260 0
 100.00  0.00  0.00 0.00 1.07 107.00 100.00 128785662 0
 100.00  0.00  0.00 0.00 1.07 107.00 100.00 128786062 0
 100.00  0.00  0.00 0.00 1.07 107.00 100.00 128786462 0
 100.00  0.00  0.00 0.00 1.07 107.00 100.00 128786862 0
 100.00  0.00  0.00 0.00 1.07 107.00 100.00 128787262 0
 100.00  0.00  0.00 0.00 1.07 107.00 100.00 128787664 0
 100.00  0.00  0.00 0.00 1.07 107.00 100.00 128788064 0


With cede_offline=off:
 $ sudo lparstat 1 10

 System Configuration
 type=Shared mode=Uncapped smt=Off lcpu=2 mem=7759616 kB cpus=6 ent=1.00 


 %user  %sys %wait%idlephysc %entc lbusy  vcsw phint
 - - --- - - - -
 100.00  0.00  0.00 0.00 1.94 194.00 100.00 128961588 0
 100.00  0.00  0.00 0.00 1.91 191.00 100.00 128961988 0
 100.00  0.00  0.00 0.00  inf   inf 100.00 128962392 0
 100.00  0.00  0.00 0.00 1.91 191.00 100.00 128962792 0
 100.00  0.00  0.00 0.00 1.91 191.00 100.00 128963192 0
 100.00  0.00  0.00 0.00 1.91 191.00 100.00 128963592 0
 100.00  0.00  0.00 0.00 1.91 191.00 100.00 128963992 0
 100.00  0.00  0.00 0.00 1.91 191.00 100.00 128964392 0
 100.00  0.00  0.00 0.00 1.91 191.00 100.00 128964792 0
 100.00  0.00  0.00 0.00 1.91 191.00 100.00 128965194 0

[The 'inf' values there show a different bug]

Also, since we expose [S]PURR through sysfs, any tools that make use of 
that directly are also affected due to this.



- Naveen



[PATCH V2 4/4] ASoC: fsl_asrc: Fix error with S24_3LE format bitstream in i.MX8

2019-09-18 Thread Shengjiu Wang
There is error "aplay: pcm_write:2023: write error: Input/output error"
on i.MX8QM/i.MX8QXP platform for S24_3LE format.

In i.MX8QM/i.MX8QXP, the DMA is EDMA, which don't support 24bit
sample, but we didn't add any constraint, that cause issues.

So we need to query the caps of dma, then update the hw parameters
according to the caps.

Signed-off-by: Shengjiu Wang 
---
 sound/soc/fsl/fsl_asrc.c |  4 +--
 sound/soc/fsl/fsl_asrc.h |  3 +++
 sound/soc/fsl/fsl_asrc_dma.c | 48 ++--
 3 files changed, 46 insertions(+), 9 deletions(-)

diff --git a/sound/soc/fsl/fsl_asrc.c b/sound/soc/fsl/fsl_asrc.c
index 584badf956d2..0bf91a6f54b9 100644
--- a/sound/soc/fsl/fsl_asrc.c
+++ b/sound/soc/fsl/fsl_asrc.c
@@ -115,7 +115,7 @@ static void fsl_asrc_sel_proc(int inrate, int outrate,
  * within range [ANCA, ANCA+ANCB-1], depends on the channels of pair A
  * while pair A and pair C are comparatively independent.
  */
-static int fsl_asrc_request_pair(int channels, struct fsl_asrc_pair *pair)
+int fsl_asrc_request_pair(int channels, struct fsl_asrc_pair *pair)
 {
enum asrc_pair_index index = ASRC_INVALID_PAIR;
struct fsl_asrc *asrc_priv = pair->asrc_priv;
@@ -158,7 +158,7 @@ static int fsl_asrc_request_pair(int channels, struct 
fsl_asrc_pair *pair)
  *
  * It clears the resource from asrc_priv and releases the occupied channels.
  */
-static void fsl_asrc_release_pair(struct fsl_asrc_pair *pair)
+void fsl_asrc_release_pair(struct fsl_asrc_pair *pair)
 {
struct fsl_asrc *asrc_priv = pair->asrc_priv;
enum asrc_pair_index index = pair->index;
diff --git a/sound/soc/fsl/fsl_asrc.h b/sound/soc/fsl/fsl_asrc.h
index 38af485bdd22..2b57e8c53728 100644
--- a/sound/soc/fsl/fsl_asrc.h
+++ b/sound/soc/fsl/fsl_asrc.h
@@ -462,4 +462,7 @@ struct fsl_asrc {
 #define DRV_NAME "fsl-asrc-dai"
 extern struct snd_soc_component_driver fsl_asrc_component;
 struct dma_chan *fsl_asrc_get_dma_channel(struct fsl_asrc_pair *pair, bool 
dir);
+int fsl_asrc_request_pair(int channels, struct fsl_asrc_pair *pair);
+void fsl_asrc_release_pair(struct fsl_asrc_pair *pair);
+
 #endif /* _FSL_ASRC_H */
diff --git a/sound/soc/fsl/fsl_asrc_dma.c b/sound/soc/fsl/fsl_asrc_dma.c
index 01052a0808b0..c1f39975cc32 100644
--- a/sound/soc/fsl/fsl_asrc_dma.c
+++ b/sound/soc/fsl/fsl_asrc_dma.c
@@ -16,13 +16,11 @@
 
 #define FSL_ASRC_DMABUF_SIZE   (256 * 1024)
 
-static const struct snd_pcm_hardware snd_imx_hardware = {
+static struct snd_pcm_hardware snd_imx_hardware = {
.info = SNDRV_PCM_INFO_INTERLEAVED |
SNDRV_PCM_INFO_BLOCK_TRANSFER |
SNDRV_PCM_INFO_MMAP |
-   SNDRV_PCM_INFO_MMAP_VALID |
-   SNDRV_PCM_INFO_PAUSE |
-   SNDRV_PCM_INFO_RESUME,
+   SNDRV_PCM_INFO_MMAP_VALID,
.buffer_bytes_max = FSL_ASRC_DMABUF_SIZE,
.period_bytes_min = 128,
.period_bytes_max = 65535, /* Limited by SDMA engine */
@@ -276,6 +274,11 @@ static int fsl_asrc_dma_startup(struct snd_pcm_substream 
*substream)
struct device *dev = component->dev;
struct fsl_asrc *asrc_priv = dev_get_drvdata(dev);
struct fsl_asrc_pair *pair;
+   bool tx = substream->stream == SNDRV_PCM_STREAM_PLAYBACK;
+   u8 dir = tx ? OUT : IN;
+   struct dma_chan *tmp_chan;
+   struct snd_dmaengine_dai_dma_data *dma_data;
+   int ret;
 
pair = kzalloc(sizeof(struct fsl_asrc_pair), GFP_KERNEL);
if (!pair)
@@ -285,9 +288,40 @@ static int fsl_asrc_dma_startup(struct snd_pcm_substream 
*substream)
 
runtime->private_data = pair;
 
-   snd_pcm_hw_constraint_integer(substream->runtime,
- SNDRV_PCM_HW_PARAM_PERIODS);
-   snd_soc_set_runtime_hwparams(substream, _imx_hardware);
+   ret = snd_pcm_hw_constraint_integer(substream->runtime,
+   SNDRV_PCM_HW_PARAM_PERIODS);
+   if (ret < 0) {
+   dev_err(dev, "failed to set pcm hw params periods\n");
+   return ret;
+   }
+
+   /* Request a temp pair, which is release in the end */
+   ret = fsl_asrc_request_pair(1, pair);
+   if (ret < 0) {
+   dev_err(dev, "failed to request asrc pair\n");
+   return ret;
+   }
+
+   tmp_chan = fsl_asrc_get_dma_channel(pair, dir);
+   if (!tmp_chan) {
+   dev_err(dev, "can't get dma channel\n");
+   return -EINVAL;
+   }
+
+   dma_data = snd_soc_dai_get_dma_data(rtd->cpu_dai, substream);
+
+   ret = snd_dmaengine_pcm_set_runtime_hwparams(substream,
+dma_data,
+_imx_hardware,
+tmp_chan);
+   if (ret < 0) {
+   dev_err(dev, "failed to set runtime hwparams\n");
+   return ret;
+   }
+
+   dma_release_channel(tmp_chan);
+ 

[PATCH V2 3/4] ASoC: pcm_dmaengine: Extract snd_dmaengine_pcm_set_runtime_hwparams

2019-09-18 Thread Shengjiu Wang
When set the runtime hardware parameters, we may need to query
the capability of DMA to complete the parameters.

This patch is to Extract this operation from
dmaengine_pcm_set_runtime_hwparams function to a separate function
snd_dmaengine_pcm_set_runtime_hwparams, that other components which
need this feature can call this function.

Signed-off-by: Shengjiu Wang 
---
 include/sound/dmaengine_pcm.h |  5 ++
 sound/core/pcm_dmaengine.c| 83 +++
 sound/soc/soc-generic-dmaengine-pcm.c | 62 ++--
 3 files changed, 92 insertions(+), 58 deletions(-)

diff --git a/include/sound/dmaengine_pcm.h b/include/sound/dmaengine_pcm.h
index c679f6116580..1bcbc4d2bc6f 100644
--- a/include/sound/dmaengine_pcm.h
+++ b/include/sound/dmaengine_pcm.h
@@ -83,6 +83,11 @@ void snd_dmaengine_pcm_set_config_from_dai_data(
const struct snd_dmaengine_dai_dma_data *dma_data,
struct dma_slave_config *config);
 
+int snd_dmaengine_pcm_set_runtime_hwparams(
+   struct snd_pcm_substream *substream,
+   struct snd_dmaengine_dai_dma_data *dma_data,
+   struct snd_pcm_hardware *hw,
+   struct dma_chan *chan);
 
 /*
  * Try to request the DMA channel using compat_request_channel or
diff --git a/sound/core/pcm_dmaengine.c b/sound/core/pcm_dmaengine.c
index 89a05926ac73..c9d56a12164d 100644
--- a/sound/core/pcm_dmaengine.c
+++ b/sound/core/pcm_dmaengine.c
@@ -369,4 +369,87 @@ int snd_dmaengine_pcm_close_release_chan(struct 
snd_pcm_substream *substream)
 }
 EXPORT_SYMBOL_GPL(snd_dmaengine_pcm_close_release_chan);
 
+/**
+ * snd_dmaengine_pcm_set_runtime_hwparams - Set runtime hw params
+ * @substream: PCM substream
+ * @dma_data: DAI DMA data
+ * @hw: PCM hw params
+ * @chan: DMA channel to use for data transfers
+ *
+ * Returns 0 on success, a negative error code otherwise.
+ *
+ * This function will query DMA capability, then refine the pcm hardware
+ * parameters, finally set substream runtime hardware parameters.
+ */
+int snd_dmaengine_pcm_set_runtime_hwparams(
+   struct snd_pcm_substream *substream,
+   struct snd_dmaengine_dai_dma_data *dma_data,
+   struct snd_pcm_hardware *hw,
+   struct dma_chan *chan)
+{
+   struct dma_slave_caps dma_caps;
+   u32 addr_widths = BIT(DMA_SLAVE_BUSWIDTH_1_BYTE) |
+ BIT(DMA_SLAVE_BUSWIDTH_2_BYTES) |
+ BIT(DMA_SLAVE_BUSWIDTH_4_BYTES);
+   snd_pcm_format_t i;
+   int ret;
+
+   if (!hw || !chan || !dma_data)
+   return -EINVAL;
+
+   ret = dma_get_slave_caps(chan, _caps);
+   if (ret == 0) {
+   if (dma_caps.cmd_pause && dma_caps.cmd_resume)
+   hw->info |= SNDRV_PCM_INFO_PAUSE | 
SNDRV_PCM_INFO_RESUME;
+   if (dma_caps.residue_granularity <= 
DMA_RESIDUE_GRANULARITY_SEGMENT)
+   hw->info |= SNDRV_PCM_INFO_BATCH;
+
+   if (substream->stream == SNDRV_PCM_STREAM_PLAYBACK)
+   addr_widths = dma_caps.dst_addr_widths;
+   else
+   addr_widths = dma_caps.src_addr_widths;
+   }
+
+   /*
+* If SND_DMAENGINE_PCM_DAI_FLAG_PACK is set keep
+* hw.formats set to 0, meaning no restrictions are in place.
+* In this case it's the responsibility of the DAI driver to
+* provide the supported format information.
+*/
+   if (!(dma_data->flags & SND_DMAENGINE_PCM_DAI_FLAG_PACK))
+   /*
+* Prepare formats mask for valid/allowed sample types. If the
+* dma does not have support for the given physical word size,
+* it needs to be masked out so user space can not use the
+* format which produces corrupted audio.
+* In case the dma driver does not implement the slave_caps the
+* default assumption is that it supports 1, 2 and 4 bytes
+* widths.
+*/
+   for (i = SNDRV_PCM_FORMAT_FIRST; i <= SNDRV_PCM_FORMAT_LAST; 
i++) {
+   int bits = snd_pcm_format_physical_width(i);
+
+   /*
+* Enable only samples with DMA supported physical
+* widths
+*/
+   switch (bits) {
+   case 8:
+   case 16:
+   case 24:
+   case 32:
+   case 64:
+   if (addr_widths & (1 << (bits / 8)))
+   hw->formats |= pcm_format_to_bits(i);
+   break;
+   default:
+   /* Unsupported types */
+   break;
+   }
+   }
+
+   return snd_soc_set_runtime_hwparams(substream, hw);
+}

[PATCH V2 1/4] ASoC: fsl_asrc: Use in(out)put_format instead of in(out)put_word_width

2019-09-18 Thread Shengjiu Wang
snd_pcm_format_t is more formal than enum asrc_word_width, which has
two property, width and physical width, which is more accurate than
enum asrc_word_width. So it is better to use in(out)put_format
instead of in(out)put_word_width.

Signed-off-by: Shengjiu Wang 
Acked-by: Nicolin Chen 
---
 sound/soc/fsl/fsl_asrc.c | 56 +++-
 sound/soc/fsl/fsl_asrc.h |  4 +--
 2 files changed, 40 insertions(+), 20 deletions(-)

diff --git a/sound/soc/fsl/fsl_asrc.c b/sound/soc/fsl/fsl_asrc.c
index cfa40ef6b1ca..4d3804a1ea55 100644
--- a/sound/soc/fsl/fsl_asrc.c
+++ b/sound/soc/fsl/fsl_asrc.c
@@ -265,6 +265,8 @@ static int fsl_asrc_config_pair(struct fsl_asrc_pair *pair)
struct asrc_config *config = pair->config;
struct fsl_asrc *asrc_priv = pair->asrc_priv;
enum asrc_pair_index index = pair->index;
+   enum asrc_word_width input_word_width;
+   enum asrc_word_width output_word_width;
u32 inrate, outrate, indiv, outdiv;
u32 clk_index[2], div[2];
int in, out, channels;
@@ -283,9 +285,32 @@ static int fsl_asrc_config_pair(struct fsl_asrc_pair *pair)
return -EINVAL;
}
 
-   /* Validate output width */
-   if (config->output_word_width == ASRC_WIDTH_8_BIT) {
-   pair_err("does not support 8bit width output\n");
+   switch (snd_pcm_format_width(config->input_format)) {
+   case 8:
+   input_word_width = ASRC_WIDTH_8_BIT;
+   break;
+   case 16:
+   input_word_width = ASRC_WIDTH_16_BIT;
+   break;
+   case 24:
+   input_word_width = ASRC_WIDTH_24_BIT;
+   break;
+   default:
+   pair_err("does not support this input format, %d\n",
+config->input_format);
+   return -EINVAL;
+   }
+
+   switch (snd_pcm_format_width(config->output_format)) {
+   case 16:
+   output_word_width = ASRC_WIDTH_16_BIT;
+   break;
+   case 24:
+   output_word_width = ASRC_WIDTH_24_BIT;
+   break;
+   default:
+   pair_err("does not support this output format, %d\n",
+config->output_format);
return -EINVAL;
}
 
@@ -383,8 +408,8 @@ static int fsl_asrc_config_pair(struct fsl_asrc_pair *pair)
/* Implement word_width configurations */
regmap_update_bits(asrc_priv->regmap, REG_ASRMCR1(index),
   ASRMCR1i_OW16_MASK | ASRMCR1i_IWD_MASK,
-  ASRMCR1i_OW16(config->output_word_width) |
-  ASRMCR1i_IWD(config->input_word_width));
+  ASRMCR1i_OW16(output_word_width) |
+  ASRMCR1i_IWD(input_word_width));
 
/* Enable BUFFER STALL */
regmap_update_bits(asrc_priv->regmap, REG_ASRMCR(index),
@@ -497,13 +522,13 @@ static int fsl_asrc_dai_hw_params(struct 
snd_pcm_substream *substream,
  struct snd_soc_dai *dai)
 {
struct fsl_asrc *asrc_priv = snd_soc_dai_get_drvdata(dai);
-   int width = params_width(params);
struct snd_pcm_runtime *runtime = substream->runtime;
struct fsl_asrc_pair *pair = runtime->private_data;
unsigned int channels = params_channels(params);
unsigned int rate = params_rate(params);
struct asrc_config config;
-   int word_width, ret;
+   snd_pcm_format_t format;
+   int ret;
 
ret = fsl_asrc_request_pair(channels, pair);
if (ret) {
@@ -513,15 +538,10 @@ static int fsl_asrc_dai_hw_params(struct 
snd_pcm_substream *substream,
 
pair->config = 
 
-   if (width == 16)
-   width = ASRC_WIDTH_16_BIT;
-   else
-   width = ASRC_WIDTH_24_BIT;
-
if (asrc_priv->asrc_width == 16)
-   word_width = ASRC_WIDTH_16_BIT;
+   format = SNDRV_PCM_FORMAT_S16_LE;
else
-   word_width = ASRC_WIDTH_24_BIT;
+   format = SNDRV_PCM_FORMAT_S24_LE;
 
config.pair = pair->index;
config.channel_num = channels;
@@ -529,13 +549,13 @@ static int fsl_asrc_dai_hw_params(struct 
snd_pcm_substream *substream,
config.outclk = OUTCLK_ASRCK1_CLK;
 
if (substream->stream == SNDRV_PCM_STREAM_PLAYBACK) {
-   config.input_word_width   = width;
-   config.output_word_width  = word_width;
+   config.input_format   = params_format(params);
+   config.output_format  = format;
config.input_sample_rate  = rate;
config.output_sample_rate = asrc_priv->asrc_rate;
} else {
-   config.input_word_width   = word_width;
-   config.output_word_width  = width;
+   config.input_format   = format;
+   config.output_format  = params_format(params);
config.input_sample_rate  = 

[PATCH V2 2/4] ASoC: fsl_asrc: update supported sample format

2019-09-18 Thread Shengjiu Wang
The ASRC support 24bit/16bit/8bit input width, which is
data width, not slot width.

For the S20_3LE format, the data with is 20bit, slot width
is 24bit, if we set ASRMCR1n.IWD to be 24bits, the result
is the volume is lower than expected, it likes 24bit data
right shift 4 bits

So replace S20_3LE with S24_3LE in supported list and add S8
format in TX supported list

Signed-off-by: Shengjiu Wang 
---
 sound/soc/fsl/fsl_asrc.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/sound/soc/fsl/fsl_asrc.c b/sound/soc/fsl/fsl_asrc.c
index 4d3804a1ea55..584badf956d2 100644
--- a/sound/soc/fsl/fsl_asrc.c
+++ b/sound/soc/fsl/fsl_asrc.c
@@ -624,7 +624,7 @@ static int fsl_asrc_dai_probe(struct snd_soc_dai *dai)
 
 #define FSL_ASRC_FORMATS   (SNDRV_PCM_FMTBIT_S24_LE | \
 SNDRV_PCM_FMTBIT_S16_LE | \
-SNDRV_PCM_FMTBIT_S20_3LE)
+SNDRV_PCM_FMTBIT_S24_3LE)
 
 static struct snd_soc_dai_driver fsl_asrc_dai = {
.probe = fsl_asrc_dai_probe,
@@ -635,7 +635,8 @@ static struct snd_soc_dai_driver fsl_asrc_dai = {
.rate_min = 5512,
.rate_max = 192000,
.rates = SNDRV_PCM_RATE_KNOT,
-   .formats = FSL_ASRC_FORMATS,
+   .formats = FSL_ASRC_FORMATS |
+  SNDRV_PCM_FMTBIT_S8,
},
.capture = {
.stream_name = "ASRC-Capture",
-- 
2.21.0



[PATCH V2 0/4] update supported sample format

2019-09-18 Thread Shengjiu Wang
This patch serial is to update the supported format for fsl_asrc
and fix some format issue.

Shengjiu Wang (4):
  ASoC: fsl_asrc: Use in(out)put_format instead of in(out)put_word_width
  ASoC: fsl_asrc: update supported sample format
  ASoC: pcm_dmaengine: Extract snd_dmaengine_pcm_set_runtime_hwparams
  ASoC: fsl_asrc: Fix error with S24_3LE format bitstream in i.MX8

changes in v2
- extract snd_dmaengine_pcm_set_runtime_hwparams in one
  separate path.
- 4th patch depends on 3rd patch


 include/sound/dmaengine_pcm.h |  5 ++
 sound/core/pcm_dmaengine.c| 83 +++
 sound/soc/fsl/fsl_asrc.c  | 65 ++---
 sound/soc/fsl/fsl_asrc.h  |  7 ++-
 sound/soc/fsl/fsl_asrc_dma.c  | 48 +---
 sound/soc/soc-generic-dmaengine-pcm.c | 62 ++--
 6 files changed, 181 insertions(+), 89 deletions(-)

-- 
2.21.0



Re: [PATCH v3 2/2] net/ibmvnic: prevent more than one thread from running in reset

2019-09-18 Thread Michael Ellerman
Hi Juliet,

Juliet Kim  writes:
> Signed-off-by: Juliet Kim 
> ---
>  drivers/net/ethernet/ibm/ibmvnic.c | 23 ++-
>  drivers/net/ethernet/ibm/ibmvnic.h |  3 +++
>  2 files changed, 25 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
> b/drivers/net/ethernet/ibm/ibmvnic.c
> index ba340aaff1b3..f344ccd68ad9 100644
> --- a/drivers/net/ethernet/ibm/ibmvnic.c
> +++ b/drivers/net/ethernet/ibm/ibmvnic.c
> @@ -2054,6 +2054,13 @@ static void __ibmvnic_reset(struct work_struct *work)
>  
>   adapter = container_of(work, struct ibmvnic_adapter, ibmvnic_reset);
>  
> + if (adapter->resetting) {
> + schedule_delayed_work(>ibmvnic_delayed_reset,
> +   IBMVNIC_RESET_DELAY);
> + return;
> + }
> +
> + adapter->resetting = true;
>   reset_state = adapter->state;

Is there some locking/serialisation around this?

Otherwise that looks very racy. ie. two CPUs could both see
adapter->resetting == false, then both set it to true, and then continue
executing and stomp on each other.

cheers