[PATCH v3] powerpc/mm: Remove kvm radix prefetch workaround for Power9 DD2.2

2019-12-05 Thread Jordan Niethe
Commit a25bd72badfa ("powerpc/mm/radix: Workaround prefetch issue with
KVM") introduced a number of workarounds as coming out of a guest with
the mmu enabled would make the cpu would start running in hypervisor
state with the PID value from the guest. The cpu will then start
prefetching for the hypervisor with that PID value.

In Power9 DD2.2 the cpu behaviour was modified to fix this. When
accessing Quadrant 0 in hypervisor mode with LPID != 0 prefetching will
not be performed. This means that we can get rid of the workarounds for
Power9 DD2.2 and later revisions. Add a new cpu feature
CPU_FTR_P9_RADIX_PREFETCH_BUG to indicate if the workarounds are needed.

Signed-off-by: Jordan Niethe 
---
v2: Use a cpu feature instead of open coding the PVR check
v3: Put parentheses around CPU_FTRS_POWER9_DD2_0 value
---
 arch/powerpc/include/asm/cputable.h  |  7 +--
 arch/powerpc/kernel/dt_cpu_ftrs.c| 13 -
 arch/powerpc/kvm/book3s_hv_rmhandlers.S  |  2 ++
 arch/powerpc/mm/book3s64/radix_pgtable.c |  6 +-
 arch/powerpc/mm/book3s64/radix_tlb.c |  3 +++
 5 files changed, 23 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/include/asm/cputable.h 
b/arch/powerpc/include/asm/cputable.h
index cf00ff0d121d..40a4d3c6fd99 100644
--- a/arch/powerpc/include/asm/cputable.h
+++ b/arch/powerpc/include/asm/cputable.h
@@ -212,6 +212,7 @@ static inline void cpu_feature_keys_init(void) { }
 #define CPU_FTR_P9_TLBIE_STQ_BUG   LONG_ASM_CONST(0x4000)
 #define CPU_FTR_P9_TIDR
LONG_ASM_CONST(0x8000)
 #define CPU_FTR_P9_TLBIE_ERAT_BUG  LONG_ASM_CONST(0x0001)
+#define CPU_FTR_P9_RADIX_PREFETCH_BUG  LONG_ASM_CONST(0x0002)
 
 #ifndef __ASSEMBLY__
 
@@ -459,8 +460,10 @@ static inline void cpu_feature_keys_init(void) { }
CPU_FTR_DBELL | CPU_FTR_HAS_PPR | CPU_FTR_ARCH_207S | \
CPU_FTR_TM_COMP | CPU_FTR_ARCH_300 | CPU_FTR_PKEY | \
CPU_FTR_P9_TLBIE_STQ_BUG | CPU_FTR_P9_TLBIE_ERAT_BUG | 
CPU_FTR_P9_TIDR)
-#define CPU_FTRS_POWER9_DD2_0 CPU_FTRS_POWER9
-#define CPU_FTRS_POWER9_DD2_1 (CPU_FTRS_POWER9 | CPU_FTR_POWER9_DD2_1)
+#define CPU_FTRS_POWER9_DD2_0 (CPU_FTRS_POWER9 | CPU_FTR_P9_RADIX_PREFETCH_BUG)
+#define CPU_FTRS_POWER9_DD2_1 (CPU_FTRS_POWER9 | \
+  CPU_FTR_P9_RADIX_PREFETCH_BUG | \
+  CPU_FTR_POWER9_DD2_1)
 #define CPU_FTRS_POWER9_DD2_2 (CPU_FTRS_POWER9 | CPU_FTR_POWER9_DD2_1 | \
   CPU_FTR_P9_TM_HV_ASSIST | \
   CPU_FTR_P9_TM_XER_SO_BUG)
diff --git a/arch/powerpc/kernel/dt_cpu_ftrs.c 
b/arch/powerpc/kernel/dt_cpu_ftrs.c
index 180b3a5d1001..182b4047c1ef 100644
--- a/arch/powerpc/kernel/dt_cpu_ftrs.c
+++ b/arch/powerpc/kernel/dt_cpu_ftrs.c
@@ -727,17 +727,20 @@ static __init void cpufeatures_cpu_quirks(void)
/*
 * Not all quirks can be derived from the cpufeatures device tree.
 */
-   if ((version & 0xefff) == 0x004e0200)
-   ; /* DD2.0 has no feature flag */
-   else if ((version & 0xefff) == 0x004e0201)
+   if ((version & 0xefff) == 0x004e0200) {
+   /* DD2.0 has no feature flag */
+   cur_cpu_spec->cpu_features |= CPU_FTR_P9_RADIX_PREFETCH_BUG;
+   } else if ((version & 0xefff) == 0x004e0201) {
cur_cpu_spec->cpu_features |= CPU_FTR_POWER9_DD2_1;
-   else if ((version & 0xefff) == 0x004e0202) {
+   cur_cpu_spec->cpu_features |= CPU_FTR_P9_RADIX_PREFETCH_BUG;
+   } else if ((version & 0xefff) == 0x004e0202) {
cur_cpu_spec->cpu_features |= CPU_FTR_P9_TM_HV_ASSIST;
cur_cpu_spec->cpu_features |= CPU_FTR_P9_TM_XER_SO_BUG;
cur_cpu_spec->cpu_features |= CPU_FTR_POWER9_DD2_1;
-   } else if ((version & 0x) == 0x004e)
+   } else if ((version & 0x) == 0x004e) {
/* DD2.1 and up have DD2_1 */
cur_cpu_spec->cpu_features |= CPU_FTR_POWER9_DD2_1;
+   }
 
if ((version & 0x) == 0x004e) {
cur_cpu_spec->cpu_features &= ~(CPU_FTR_DAWR);
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index faebcbb8c4db..72b08bb17200 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -1793,6 +1793,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_300)
tlbsync
ptesync
 
+BEGIN_FTR_SECTION
/* Radix: Handle the case where the guest used an illegal PID */
LOAD_REG_ADDR(r4, mmu_base_pid)
lwz r3, VCPU_GUEST_PID(r9)
@@ -1822,6 +1823,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_300)
addir7,r7,0x1000
bdnz1b
ptesync
+END_FTR_SECTION_IFSET(CPU_FTR_P9_RADIX_PREFETCH_BUG)
 
 2:
 #endif /* CONFIG_PPC_RADIX_MMU */
diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c 

Re: [RFC] Efficiency of the phandle_cache on ppc64/SLOF

2019-12-05 Thread Frank Rowand
On 12/5/19 10:35 AM, Sebastian Andrzej Siewior wrote:
> On 2019-12-03 10:56:35 [-0600], Rob Herring wrote:
>>> Another possibility would be to make the cache be dependent
>>> upon not CONFIG_PPC.  It might be possible to disable the
>>> cache with a minimal code change.
>>
>> I'd rather not do that.
>>
>> And yes, as mentioned earlier I don't like the complexity. I didn't
>> from the start and I'm  I'm still of the opinion we should have a
>> fixed or 1 time sized true cache (i.e. smaller than total # of
>> phandles). That would solve the RT memory allocation and locking issue
>> too.
>>
>> For reference, the performance difference between the current
>> implementation (assuming fixes haven't regressed it) was ~400ms vs. a
>> ~340ms improvement with a 64 entry cache (using a mask, not a hash).
>> IMO, 340ms improvement was good enough.
> 
> Okay. So the 814 phandles would result in an array with 1024 slots. That
> would need 4KiB of memory.

Is this amount of memory an issue for this system?

If module support is not configured into the kernel then the cache is
removed and memory freed in a late initcall.  I don't know if that helps
your use case or not.


> What about we go back to the fix 64 slots array but with hash32 for the
> lookup? Without the hash we would be 60ms slower during boot (compared
> to now, based on ancient data) but then the hash isn't expensive so we
> end up with better coverage of the memory on systems which don't have a
> plain enumeration of the phandle.

That performance data is specific to one particular system.  It does not
generalize to all devicetree based systems.  So don't draw too many
conclusions from it.  If you want to understand the boot performance
impact for your system, you need to measure the alternatives on
your system.

Is there a memory usage issue for the systems that led to this thread?
Unless there is a documented memory issue, I do not want to expand an
issue with poor cache bucket percent utilization to the other issue
of cache size.

-Frank

> 
>> Rob
> 
> Sebastian
> 



Re: [RFC] Efficiency of the phandle_cache on ppc64/SLOF

2019-12-05 Thread Frank Rowand
On 12/3/19 10:56 AM, Rob Herring wrote:
> On Mon, Dec 2, 2019 at 10:28 PM Frank Rowand  wrote:
>>
>> On 12/2/19 10:12 PM, Michael Ellerman wrote:
>>> Frank Rowand  writes:
 On 11/29/19 9:10 AM, Sebastian Andrzej Siewior wrote:
> I've been looking at phandle_cache and noticed the following: The raw
> phandle value as generated by dtc starts at zero and is incremented by
> one for each phandle entry. The qemu pSeries model is using Slof (which
> is probably the same thing as used on real hardware) and this looks like
> a poiner value for the phandle.
> With
> qemu-system-ppc64le -m 16G -machine pseries -smp 8
>
> I got the following output:
> | entries: 64
> | phandle 7e732468 slot 28 hash c
> | phandle 7e732ad0 slot 10 hash 27
> | phandle 7e732ee8 slot 28 hash 3a
> | phandle 7e734160 slot 20 hash 36
> | phandle 7e734318 slot 18 hash 3a
> | phandle 7e734428 slot 28 hash 33
> | phandle 7e734538 slot 38 hash 2c
> | phandle 7e734850 slot 10 hash e
> | phandle 7e735220 slot 20 hash 2d
> | phandle 7e735bf0 slot 30 hash d
> | phandle 7e7365c0 slot 0 hash 2d
> | phandle 7e736f90 slot 10 hash d
> | phandle 7e737960 slot 20 hash 2d
> | phandle 7e738330 slot 30 hash d
> | phandle 7e738d00 slot 0 hash 2d
> | phandle 7e739730 slot 30 hash 38
> | phandle 7e73bd08 slot 8 hash 17
> | phandle 7e73c2e0 slot 20 hash 32
> | phandle 7e73c7f8 slot 38 hash 37
> | phandle 7e782420 slot 20 hash 13
> | phandle 7e782ed8 slot 18 hash 1b
> | phandle 7e73ce28 slot 28 hash 39
> | phandle 7e73d390 slot 10 hash 22
> | phandle 7e73d9a8 slot 28 hash 1a
> | phandle 7e73dc28 slot 28 hash 37
> | phandle 7e73de00 slot 0 hash a
> | phandle 7e73e028 slot 28 hash 0
> | phandle 7e7621a8 slot 28 hash 36
> | phandle 7e73e458 slot 18 hash 1e
> | phandle 7e73e608 slot 8 hash 1e
> | phandle 7e740078 slot 38 hash 28
> | phandle 7e740180 slot 0 hash 1d
> | phandle 7e740240 slot 0 hash 33
> | phandle 7e740348 slot 8 hash 29
> | phandle 7e740410 slot 10 hash 2
> | phandle 7e740eb0 slot 30 hash 3e
> | phandle 7e745390 slot 10 hash 33
> | phandle 7e747b08 slot 8 hash c
> | phandle 7e748528 slot 28 hash f
> | phandle 7e74a6e0 slot 20 hash 18
> | phandle 7e74aab0 slot 30 hash b
> | phandle 7e74f788 slot 8 hash d
> | Used entries: 8, hashed: 29
>
> So the hash array has 64 entries out which only 8 are populated. Using
> hash_32() populates 29 entries.
> Could someone with real hardware verify this?
> I'm not sure how important this performance wise, it looks just like a
> waste using only 1/8 of the array.

 The hash used is based on the assumptions you noted, and as stated in the
 code, that phandle property values are in a contiguous range of 1..n
 (not starting from zero), which is what dtc generates.
>>>
 We knew that for systems that do not match the assumptions that the hash
 will not be optimal.
>>>
>>> If we're going to have the phandle cache it should at least make some
>>> attempt to work across the systems that Linux supports.
>>>
 Unless there is a serious performance problem for
 such systems, I do not want to make the phandle hash code more complicated
 to optimize for these cases.  And the pseries have been performing ok
 without phandle related performance issues that I remember hearing since
 before the cache was added, which could have only helped the performance.
 Yes, if your observations are correct, some memory is being wasted, but
 a 64 entry cache is not very large on a pseries.
>>>
>>> A single line change to use an actual hash function is hardly
>>> complicating it, compared to the amount of code already there.
>>
>> With a dtc generated devicetree, the hash is perfect, with no
>> misses.  That single line change then makes the hash bad for
>> dtc generated devicetrees.
> 
> To quantify that, I did some tests with the hash algo and it's about a
> 10% collision rate using phandle values of 1-$cache_size. There's
> never more than 2 phandles colliding in a slot. The actual impact
> would be less given 100% thrashing seems unlikely.

Thank you for doing the tests.  Actual data helps a lot.

If there is only a 10% collision rate for this case, that does not
sound bad to me.  There is the possibility of current or future
code resulting in ping ponging between two phandle values which
collide in the cache, but that should not be the common case.

However, given a choice between two algorithms, one of which
guarantees no thrashing (the current cache algorithm) and one
which allows a pathologic use case which results in thrashing,
I prefer the first algorithm.  This may seem theoretical, but
I have seen enough pathological code paths in my years of
performance analysis and tuning to be sensitive to this issue.

> 
>> The cache was added to address a 

Re: [RFC] Efficiency of the phandle_cache on ppc64/SLOF

2019-12-05 Thread Frank Rowand
On 12/3/19 12:35 PM, Segher Boessenkool wrote:
> Hi!
> 
> On Tue, Dec 03, 2019 at 03:03:22PM +1100, Michael Ellerman wrote:
>> Sebastian Andrzej Siewior  writes:
>> I've certainly heard it said that on some OF's the phandle was just ==
>> the address of the internal representation, and I guess maybe for SLOF
>> that is true.
> 
> It is (or was).  In many OFs it is just the effective address of some
> node structure.  SLOF runs with translation off normally.
> 
>> They seem to vary wildly though, eg. on an Apple G5:
> 
> Apple OF runs with translation on usually.  IIRC these are effective
> addresses as well.
> 
> The OF they have on G5 machines is mostly 32-bit, for compatibility is my
> guess (for userland things dealing with addresses from OF, importantly).
> 
>>   $ find /proc/device-tree/ -name phandle | xargs lsprop | head -10
>>   /proc/device-tree/vsp@0,f900/veo@f918/phandle ff970848
>>   /proc/device-tree/vsp@0,f900/phandle ff970360
>>   /proc/device-tree/vsp@0,f900/veo@f908/phandle ff970730
>>   /proc/device-tree/nvram@0,fff04000/phandle ff967fb8
>>   /proc/device-tree/xmodem/phandle ff9655e8
>>   /proc/device-tree/multiboot/phandle ff9504f0
>>   /proc/device-tree/diagnostics/phandle ff965550
>>   /proc/device-tree/options/phandle ff893cf0
>>   /proc/device-tree/openprom/client-services/phandle ff8925b8
>>   /proc/device-tree/openprom/phandle ff892458
>>
>> That machine does not have enough RAM for those to be 32-bit real
>> addresses. I think Apple OF is running in virtual mode though (?), so
>> maybe they are pointers?
> 
> Yes, I think the default is to have 8MB ram at the top of 4GB (which is
> the physical address of the bootrom, btw) for OF.
> 
>> And on an IBM pseries machine they're a bit all over the place:
>>
>>   /proc/device-tree/cpus/PowerPC,POWER8@40/ibm,phandle 1040
>>   /proc/device-tree/cpus/l2-cache@2005/ibm,phandle 2005
>>   /proc/device-tree/cpus/PowerPC,POWER8@30/ibm,phandle 1030
>>   /proc/device-tree/cpus/PowerPC,POWER8@20/ibm,phandle 1020
>>   /proc/device-tree/cpus/PowerPC,POWER8@10/ibm,phandle 1010
>>   /proc/device-tree/cpus/l2-cache@2003/ibm,phandle 2003
>>   /proc/device-tree/cpus/l2-cache@200a/ibm,phandle 200a
>>   /proc/device-tree/cpus/l3-cache@3108/ibm,phandle 3108
>>   /proc/device-tree/cpus/l2-cache@2001/ibm,phandle 2001
>>   /proc/device-tree/cpus/l3-cache@3106/ibm,phandle 3106
>>   /proc/device-tree/cpus/ibm,phandle fff8
>>   /proc/device-tree/cpus/l3-cache@3104/ibm,phandle 3104
>>   /proc/device-tree/cpus/l2-cache@2008/ibm,phandle 2008
>>   /proc/device-tree/cpus/l3-cache@3102/ibm,phandle 3102
>>   /proc/device-tree/cpus/l2-cache@2006/ibm,phandle 2006
>>   /proc/device-tree/cpus/l3-cache@3100/ibm,phandle 3100
>>   /proc/device-tree/cpus/PowerPC,POWER8@8/ibm,phandle 1008
>>   /proc/device-tree/cpus/l2-cache@2004/ibm,phandle 2004
>>   /proc/device-tree/cpus/PowerPC,POWER8@48/ibm,phandle 1048
>>   /proc/device-tree/cpus/PowerPC,POWER8@38/ibm,phandle 1038
>>   /proc/device-tree/cpus/l2-cache@2002/ibm,phandle 2002
>>   /proc/device-tree/cpus/PowerPC,POWER8@28/ibm,phandle 1028
>>   /proc/device-tree/cpus/l3-cache@3107/ibm,phandle 3107
>>   /proc/device-tree/cpus/PowerPC,POWER8@18/ibm,phandle 1018
>>   /proc/device-tree/cpus/l2-cache@2000/ibm,phandle 2000
>>   /proc/device-tree/cpus/l3-cache@3105/ibm,phandle 3105
>>   /proc/device-tree/cpus/l3-cache@3103/ibm,phandle 3103
>>   /proc/device-tree/cpus/l3-cache@310a/ibm,phandle 310a
>>   /proc/device-tree/cpus/PowerPC,POWER8@0/ibm,phandle 1000
>>   /proc/device-tree/cpus/l2-cache@2007/ibm,phandle 2007
>>   /proc/device-tree/cpus/l3-cache@3101/ibm,phandle 3101
>>   /proc/device-tree/pci@800201b/ibm,phandle 201b
> 
> Some (the 1000) look like addresses as well.
> 
>>> So the hash array has 64 entries out which only 8 are populated. Using
>>> hash_32() populates 29 entries.
> 
>> On the G5 it's similarly inefficient:
>> [0.007379] OF: of_populate_phandle_cache(242) Used entries: 31, hashed: 
>> 111
> 
>> And some output from a "real" pseries machine (IBM OF), which is
>> slightly better:
>> [0.129467] OF: of_populate_phandle_cache(242) Used entries: 39, hashed: 
>> 81
> 
>> So yeah using hash_32() is quite a bit better in both cases.
> 
> Yup, no surprise there.  And hash_32 is very cheap to compute.
> 
>> And if I'm reading your patch right it would be a single line change to>> 
>> switch, so that seems like it's worth doing to me.
> 
> Agreed!
> 
> Btw.  Some OFs mangle the phandles some way, to make it easier to catch
> people using it as an address (and similarly, mangle ihandles differently,
> so you catch confusion between ihandles and phandles as well).  Like a
> simple xor, with some odd number preferably.  You should assume *nothing*
> about phandles, they are opaque identifiers.

For arm32 machines that use dtc to generate the devicetree, which is 

Re: [PATCH v5 2/5] powerpc/kprobes: Mark newly allocated probes as RO

2019-12-05 Thread Michael Ellerman
Michael Ellerman  writes:
> Russell Currey  writes:
>> With CONFIG_STRICT_KERNEL_RWX=y and CONFIG_KPROBES=y, there will be one
>> W+X page at boot by default.  This can be tested with
>> CONFIG_PPC_PTDUMP=y and CONFIG_PPC_DEBUG_WX=y set, and checking the
>> kernel log during boot.
>>
>> powerpc doesn't implement its own alloc() for kprobes like other
>> architectures do, but we couldn't immediately mark RO anyway since we do
>> a memcpy to the page we allocate later.  After that, nothing should be
>> allowed to modify the page, and write permissions are removed well
>> before the kprobe is armed.
>>
>> Thus mark newly allocated probes as read-only once it's safe to do so.
>>
>> Signed-off-by: Russell Currey 
>> ---
>>  arch/powerpc/kernel/kprobes.c | 3 +++
>>  1 file changed, 3 insertions(+)
>>
>> diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c
>> index 2d27ec4feee4..2610496de7c7 100644
>> --- a/arch/powerpc/kernel/kprobes.c
>> +++ b/arch/powerpc/kernel/kprobes.c
>> @@ -24,6 +24,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>  
>>  DEFINE_PER_CPU(struct kprobe *, current_kprobe) = NULL;
>>  DEFINE_PER_CPU(struct kprobe_ctlblk, kprobe_ctlblk);
>> @@ -131,6 +132,8 @@ int arch_prepare_kprobe(struct kprobe *p)
>>  (unsigned long)p->ainsn.insn + sizeof(kprobe_opcode_t));
>>  }
>>  
>> +set_memory_ro((unsigned long)p->ainsn.insn, 1);
>> +
>
> That comes from:
>   p->ainsn.insn = get_insn_slot();
>
>
> Which ends up in __get_insn_slot() I think. And that looks very much
> like it's going to hand out multiple slots per page, which isn't going
> to work because you've just marked the whole page RO.
>
> So I would expect this to crash on the 2nd kprobe that's installed. Have
> you tested it somehow?

I'm not sure if this is the issue I was talking about, but it doesn't
survive ftracetest:

  [ 1139.576047] [ cut here ]
  [ 1139.576322] kernel BUG at mm/memory.c:2036!
  cpu 0x1f: Vector: 700 (Program Check) at [c01fd6c675d0]
  pc: c035d018: apply_to_page_range+0x318/0x610
  lr: c00900bc: change_memory_attr+0x4c/0x70
  sp: c01fd6c67860
 msr: 90029033
current = 0xc01fa4a47880
paca= 0xc01e5c80   irqmask: 0x03   irq_happened: 0x01
  pid   = 7168, comm = ftracetest
  kernel BUG at mm/memory.c:2036!
  Linux version 5.4.0-gcc-8.2.0-11694-gf1f9aa266811 
(mich...@raptor-2.ozlabs.ibm.com) (gcc version 8.2.0 (crosstool-NG 
1.24.0-rc1.16-9627a04)) #1384 SMP Thu Dec 5 22:11:09 AEDT 2019
  enter ? for help
  [c01fd6c67940] c00900bc change_memory_attr+0x4c/0x70
  [c01fd6c67970] c0053c48 arch_prepare_kprobe+0xb8/0x120
  [c01fd6c679e0] c022f718 register_kprobe+0x608/0x790
  [c01fd6c67a40] c022fc50 register_kretprobe+0x230/0x350
  [c01fd6c67a80] c02849b4 __register_trace_kprobe+0xf4/0x1a0
  [c01fd6c67af0] c0285b18 trace_kprobe_create+0x738/0xf70
  [c01fd6c67c30] c0286378 create_or_delete_trace_kprobe+0x28/0x70
  [c01fd6c67c50] c025f024 trace_run_command+0xc4/0xe0
  [c01fd6c67ca0] c025f128 trace_parse_run_command+0xe8/0x230
  [c01fd6c67d40] c02845d0 probes_write+0x20/0x40
  [c01fd6c67d60] c03eef4c __vfs_write+0x3c/0x70
  [c01fd6c67d80] c03f26a0 vfs_write+0xd0/0x200
  [c01fd6c67dd0] c03f2a3c ksys_write+0x7c/0x140
  [c01fd6c67e20] c000b9e0 system_call+0x5c/0x68
  --- Exception: c01 (System Call) at 7fff8f06e420
  SP (793d6830) is in userspace
  1f:mon> client_loop: send disconnect: Broken pipe


Sorry I didn't get any more info on the crash, I lost the console and
then some CI bot stole the machine 8)

You should be able to reproduce just by running ftracetest.

cheers


Re: [PATCH 01/14] powerpc/vas: Describe vas-port and interrupts properties

2019-12-05 Thread Rob Herring
On Tue, Nov 26, 2019 at 05:02:03PM -0800, Haren Myneni wrote:
> [PATCH 01/14] powerpc/vas: Describe vas-port and interrupts properties

Something wrong here with the subject in the body.

> 
> Signed-off-by: Haren Myneni 
> ---
>  Documentation/devicetree/bindings/powerpc/ibm,vas.txt | 5 +
>  1 file changed, 5 insertions(+)
> 
> diff --git a/Documentation/devicetree/bindings/powerpc/ibm,vas.txt 
> b/Documentation/devicetree/bindings/powerpc/ibm,vas.txt
> index bf11d2f..12de08b 100644
> --- a/Documentation/devicetree/bindings/powerpc/ibm,vas.txt
> +++ b/Documentation/devicetree/bindings/powerpc/ibm,vas.txt
> @@ -11,6 +11,8 @@ Required properties:
>window context start and length, OS/User window context start and length,
>"Paste address" start and length, "Paste window id" start bit and number
>of bits)
> +- ibm,vas-port : Port address for the interrupt.

What's the size of this property?

> +- interrupts: IRQ value for each VAS instance and level.
>  
>  Example:
>  
> @@ -18,5 +20,8 @@ Example:
>   compatible = "ibm,vas", "ibm,power9-vas";
>   reg = <0x60191 0x200 0x60190 0x1 
> 0x8 0x1 0x20 0x10>;
>   name = "vas";
> + interrupts = <0x1f 0>;
> + interrupt-parent = <>;
>   ibm,vas-id = <0x1>;
> + ibm,vas-port = <0x601000100>;
>   };
> -- 
> 1.8.3.1
> 
> 
> 


Re: [PATCH v3 1/2] powerpc/vcpu: Assume dedicated processors as non-preempt

2019-12-05 Thread Phil Auld
On Thu, Dec 05, 2019 at 02:02:17PM +0530 Srikar Dronamraju wrote:
> With commit 247f2f6f3c70 ("sched/core: Don't schedule threads on pre-empted
> vCPUs"), scheduler avoids preempted vCPUs to schedule tasks on wakeup.
> This leads to wrong choice of CPU, which in-turn leads to larger wakeup
> latencies. Eventually, it leads to performance regression in latency
> sensitive benchmarks like soltp, schbench etc.
> 
> On Powerpc, vcpu_is_preempted only looks at yield_count. If the
> yield_count is odd, the vCPU is assumed to be preempted. However
> yield_count is increased whenever LPAR enters CEDE state. So any CPU
> that has entered CEDE state is assumed to be preempted.
> 
> Even if vCPU of dedicated LPAR is preempted/donated, it should have
> right of first-use since they are suppose to own the vCPU.
> 
> On a Power9 System with 32 cores
>  # lscpu
> Architecture:ppc64le
> Byte Order:  Little Endian
> CPU(s):  128
> On-line CPU(s) list: 0-127
> Thread(s) per core:  8
> Core(s) per socket:  1
> Socket(s):   16
> NUMA node(s):2
> Model:   2.2 (pvr 004e 0202)
> Model name:  POWER9 (architected), altivec supported
> Hypervisor vendor:   pHyp
> Virtualization type: para
> L1d cache:   32K
> L1i cache:   32K
> L2 cache:512K
> L3 cache:10240K
> NUMA node0 CPU(s):   0-63
> NUMA node1 CPU(s):   64-127
> 
>   # perf stat -a -r 5 ./schbench
> v5.4 v5.4 + patch
> Latency percentiles (usec)   Latency percentiles (usec)
>   50.th: 45   50.th: 39
>   75.th: 62   75.th: 53
>   90.th: 71   90.th: 67
>   95.th: 77   95.th: 76
>   *99.th: 91  *99.th: 89
>   99.5000th: 707  99.5000th: 93
>   99.9000th: 6920 99.9000th: 118
>   min=0, max=10048min=0, max=211
> Latency percentiles (usec)   Latency percentiles (usec)
>   50.th: 45   50.th: 34
>   75.th: 61   75.th: 45
>   90.th: 72   90.th: 53
>   95.th: 79   95.th: 56
>   *99.th: 691 *99.th: 61
>   99.5000th: 3972 99.5000th: 63
>   99.9000th: 8368 99.9000th: 78
>   min=0, max=16606min=0, max=228
> Latency percentiles (usec)   Latency percentiles (usec)
>   50.th: 45   50.th: 34
>   75.th: 61   75.th: 45
>   90.th: 71   90.th: 53
>   95.th: 77   95.th: 57
>   *99.th: 106 *99.th: 63
>   99.5000th: 2364 99.5000th: 68
>   99.9000th: 7480 99.9000th: 100
>   min=0, max=10001min=0, max=134
> Latency percentiles (usec)   Latency percentiles (usec)
>   50.th: 45   50.th: 34
>   75.th: 62   75.th: 46
>   90.th: 72   90.th: 53
>   95.th: 78   95.th: 56
>   *99.th: 93  *99.th: 61
>   99.5000th: 108  99.5000th: 64
>   99.9000th: 6792 99.9000th: 85
>   min=0, max=17681min=0, max=121
> Latency percentiles (usec)   Latency percentiles (usec)
>   50.th: 46   50.th: 33
>   75.th: 62   75.th: 44
>   90.th: 73   90.th: 51
>   95.th: 79   95.th: 54
>   *99.th: 113 *99.th: 61
>   99.5000th: 2724 99.5000th: 64
>   99.9000th: 6184 99.9000th: 82
>   min=0, max=9887 min=0, max=121
> 
>  Performance counter stats for 'system wide' (5 runs):
> 
> context-switches43,373  ( +-  0.40% )   44,597 ( +-  0.55% )
> cpu-migrations   1,211  ( +-  5.04% )  220 ( +-  6.23% )
> page-faults 15,983  ( +-  5.21% )   15,360 ( +-  3.38% )
> 
> Waiman Long suggested using static_keys.
> 
> Reported-by: Parth Shah 
> Reported-by: Ihor Pasichnyk 
> Cc: Parth Shah 
> Cc: Ihor Pasichnyk 
> Cc: Juri Lelli 
> Cc: Phil Auld 
> Cc: Waiman Long 
> Cc: Gautham R. Shenoy 
> Tested-by: Juri Lelli 
> Ack-by: Waiman Long 
> Reviewed-by: Gautham R. Shenoy 
> Signed-off-by: Srikar Dronamraju 
> ---
> Changelog v1 

Re: [PATCH v3 2/2] powerpc/shared: Use static key to detect shared processor

2019-12-05 Thread Phil Auld
On Thu, Dec 05, 2019 at 02:02:18PM +0530 Srikar Dronamraju wrote:
> With the static key shared processor available, is_shared_processor()
> can return without having to query the lppaca structure.
> 
> Cc: Parth Shah 
> Cc: Ihor Pasichnyk 
> Cc: Juri Lelli 
> Cc: Phil Auld 
> Cc: Waiman Long 
> Signed-off-by: Srikar Dronamraju 
> ---
> Changelog v1 (https://patchwork.ozlabs.org/patch/1204192/) ->v2:
> Now that we no more refer to lppaca, remove the comment.
> 
> Changelog v2->v3:
> Code is now under CONFIG_PPC_SPLPAR as it depends on CONFIG_PPC_PSERIES.
> This was suggested by Waiman Long.
> 
>  arch/powerpc/include/asm/spinlock.h | 9 ++---
>  1 file changed, 2 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/spinlock.h 
> b/arch/powerpc/include/asm/spinlock.h
> index de817c25deff..e83d57f27566 100644
> --- a/arch/powerpc/include/asm/spinlock.h
> +++ b/arch/powerpc/include/asm/spinlock.h
> @@ -111,13 +111,8 @@ static inline void splpar_rw_yield(arch_rwlock_t *lock) 
> {};
>  
>  static inline bool is_shared_processor(void)
>  {
> -/*
> - * LPPACA is only available on Pseries so guard anything LPPACA related to
> - * allow other platforms (which include this common header) to compile.
> - */
> -#ifdef CONFIG_PPC_PSERIES
> - return (IS_ENABLED(CONFIG_PPC_SPLPAR) &&
> - lppaca_shared_proc(local_paca->lppaca_ptr));
> +#ifdef CONFIG_PPC_SPLPAR
> + return static_branch_unlikely(_processor);
>  #else
>   return false;
>  #endif
> -- 
> 2.18.1
> 

Fwiw,

Acked-by: Phil Auld 
-- 



Re: [PATCH v2 16/27] nvdimm/ocxl: Implement the Read Error Log command

2019-12-05 Thread kbuild test robot
Hi Alastair,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on v5.4-rc8]
[also build test ERROR on char-misc/char-misc-testing]
[cannot apply to linux-nvdimm/libnvdimm-for-next linus/master next-20191205]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url:
https://github.com/0day-ci/linux/commits/Alastair-D-Silva/Add-support-for-OpenCAPI-SCM-devices/20191203-120511
base:af42d3466bdc8f39806b26f593604fdc54140bcb
config: x86_64-randconfig-b001-20191203 (attached as .config)
compiler: gcc-7 (Debian 7.5.0-1) 7.5.0
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64 

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot 

All errors (new ones prefixed by >>):

   In file included from :32:0:
>> ./usr/include/nvdimm/ocxl-scm.h:21:24: error: C++ style comments are not 
>> allowed in ISO C90
 __u32 log_identifier; // out
   ^
>> ./usr/include/nvdimm/ocxl-scm.h:21:24: error: (this will be reported only 
>> once per input file)

---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org Intel Corporation


.config.gz
Description: application/gzip


Re: [PATCH 0/3] pseries: Track and expose idle PURR and SPURR ticks

2019-12-05 Thread Naveen N. Rao

Hi Nathan,

Nathan Lynch wrote:

Hi Kamalesh,

Kamalesh Babulal  writes:

On 12/5/19 3:54 AM, Nathan Lynch wrote:

"Gautham R. Shenoy"  writes:


Tools such as lparstat which are used to compute the utilization need
to know [S]PURR ticks when the cpu was busy or idle. The [S]PURR
counters are already exposed through sysfs.  We already account for
PURR ticks when we go to idle so that we can update the VPA area. This
patchset extends support to account for SPURR ticks when idle, and
expose both via per-cpu sysfs files.


Does anything really want to use PURR instead of SPURR? Seems like we
should expose only SPURR idle values if possible.



lparstat is one of the consumers of PURR idle metric
(https://groups.google.com/forum/#!topic/powerpc-utils-devel/fYRo69xO9r4). 
Agree, on the argument that system utilization metrics based on SPURR

accounting is accurate in comparison to PURR, which isn't proportional to
CPU frequency.  PURR has been traditionally used to understand the system
utilization, whereas SPURR is used for understanding how much capacity is
left/exceeding in the system based on the current power saving mode.


I'll phrase my question differently: does SPURR complement or supercede
PURR? You seem to be saying they serve different purposes. If PURR is
actually useful rather then vestigial then I have no objection to
exposing idle_purr.


SPURR complements PURR, so we need both. SPURR/PURR ratio helps provide 
an indication of the available headroom in terms of core resources, at 
maximum frequency.



- Naveen



Re: [PATCH 2/3] powerpc/sysfs: Show idle_purr and idle_spurr for every CPU

2019-12-05 Thread Naveen N. Rao

Gautham R. Shenoy wrote:

From: "Gautham R. Shenoy" 

On Pseries LPARs, to calculate utilization, we need to know the
[S]PURR ticks when the CPUs were busy or idle.

The total PURR and SPURR ticks are already exposed via the per-cpu
sysfs files /sys/devices/system/cpu/cpuX/purr and
/sys/devices/system/cpu/cpuX/spurr.

This patch adds support for exposing the idle PURR and SPURR ticks via
/sys/devices/system/cpu/cpuX/idle_purr and
/sys/devices/system/cpu/cpuX/idle_spurr.

Signed-off-by: Gautham R. Shenoy 
---
 arch/powerpc/kernel/sysfs.c | 32 
 1 file changed, 32 insertions(+)

diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c
index 80a676d..42ade55 100644
--- a/arch/powerpc/kernel/sysfs.c
+++ b/arch/powerpc/kernel/sysfs.c
@@ -1044,6 +1044,36 @@ static ssize_t show_physical_id(struct device *dev,
 }
 static DEVICE_ATTR(physical_id, 0444, show_physical_id, NULL);

+static ssize_t idle_purr_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+   struct cpu *cpu = container_of(dev, struct cpu, dev);
+   unsigned int cpuid = cpu->dev.id;
+   struct lppaca *cpu_lppaca_ptr = paca_ptrs[cpuid]->lppaca_ptr;
+   u64 idle_purr_cycles = be64_to_cpu(cpu_lppaca_ptr->wait_state_cycles);
+
+   return sprintf(buf, "%llx\n", idle_purr_cycles);
+}
+static DEVICE_ATTR_RO(idle_purr);
+
+DECLARE_PER_CPU(u64, idle_spurr_cycles);
+static ssize_t idle_spurr_show(struct device *dev,
+  struct device_attribute *attr, char *buf)
+{
+   struct cpu *cpu = container_of(dev, struct cpu, dev);
+   unsigned int cpuid = cpu->dev.id;
+   u64 *idle_spurr_cycles_ptr = per_cpu_ptr(_spurr_cycles, cpuid);


Is it possible for a user to read stale values if a particular cpu is in 
an extended cede? Is it possible to use smp_call_function_single() to 
force the cpu out of idle?


- Naveen



Re: [RFC] Efficiency of the phandle_cache on ppc64/SLOF

2019-12-05 Thread Sebastian Andrzej Siewior
On 2019-12-03 10:56:35 [-0600], Rob Herring wrote:
> > Another possibility would be to make the cache be dependent
> > upon not CONFIG_PPC.  It might be possible to disable the
> > cache with a minimal code change.
> 
> I'd rather not do that.
> 
> And yes, as mentioned earlier I don't like the complexity. I didn't
> from the start and I'm  I'm still of the opinion we should have a
> fixed or 1 time sized true cache (i.e. smaller than total # of
> phandles). That would solve the RT memory allocation and locking issue
> too.
> 
> For reference, the performance difference between the current
> implementation (assuming fixes haven't regressed it) was ~400ms vs. a
> ~340ms improvement with a 64 entry cache (using a mask, not a hash).
> IMO, 340ms improvement was good enough.

Okay. So the 814 phandles would result in an array with 1024 slots. That
would need 4KiB of memory.
What about we go back to the fix 64 slots array but with hash32 for the
lookup? Without the hash we would be 60ms slower during boot (compared
to now, based on ancient data) but then the hash isn't expensive so we
end up with better coverage of the memory on systems which don't have a
plain enumeration of the phandle.

> Rob

Sebastian


Re: [PATCH 0/3] pseries: Track and expose idle PURR and SPURR ticks

2019-12-05 Thread Nathan Lynch
Hi Kamalesh,

Kamalesh Babulal  writes:
> On 12/5/19 3:54 AM, Nathan Lynch wrote:
>> "Gautham R. Shenoy"  writes:
>>>
>>> Tools such as lparstat which are used to compute the utilization need
>>> to know [S]PURR ticks when the cpu was busy or idle. The [S]PURR
>>> counters are already exposed through sysfs.  We already account for
>>> PURR ticks when we go to idle so that we can update the VPA area. This
>>> patchset extends support to account for SPURR ticks when idle, and
>>> expose both via per-cpu sysfs files.
>> 
>> Does anything really want to use PURR instead of SPURR? Seems like we
>> should expose only SPURR idle values if possible.
>> 
>
> lparstat is one of the consumers of PURR idle metric
> (https://groups.google.com/forum/#!topic/powerpc-utils-devel/fYRo69xO9r4). 
> Agree, on the argument that system utilization metrics based on SPURR
> accounting is accurate in comparison to PURR, which isn't proportional to
> CPU frequency.  PURR has been traditionally used to understand the system
> utilization, whereas SPURR is used for understanding how much capacity is
> left/exceeding in the system based on the current power saving mode.

I'll phrase my question differently: does SPURR complement or supercede
PURR? You seem to be saying they serve different purposes. If PURR is
actually useful rather then vestigial then I have no objection to
exposing idle_purr.


Re: [PATCH 0/3] pseries: Track and expose idle PURR and SPURR ticks

2019-12-05 Thread Kamalesh Babulal
On 12/5/19 3:54 AM, Nathan Lynch wrote:
> "Gautham R. Shenoy"  writes:
>> From: "Gautham R. Shenoy" 
>>
>> On PSeries LPARs, the data centers planners desire a more accurate
>> view of system utilization per resource such as CPU to plan the system
>> capacity requirements better. Such accuracy can be obtained by reading
>> PURR/SPURR registers for CPU resource utilization.
>>
>> Tools such as lparstat which are used to compute the utilization need
>> to know [S]PURR ticks when the cpu was busy or idle. The [S]PURR
>> counters are already exposed through sysfs.  We already account for
>> PURR ticks when we go to idle so that we can update the VPA area. This
>> patchset extends support to account for SPURR ticks when idle, and
>> expose both via per-cpu sysfs files.
> 
> Does anything really want to use PURR instead of SPURR? Seems like we
> should expose only SPURR idle values if possible.
> 

lparstat is one of the consumers of PURR idle metric
(https://groups.google.com/forum/#!topic/powerpc-utils-devel/fYRo69xO9r4). 
Agree, on the argument that system utilization metrics based on SPURR
accounting is accurate in comparison to PURR, which isn't proportional to
CPU frequency.  PURR has been traditionally used to understand the system
utilization, whereas SPURR is used for understanding how much capacity is
left/exceeding in the system based on the current power saving mode.

-- 
Kamalesh



Re: [PATCH v3 2/2] powerpc/shared: Use static key to detect shared processor

2019-12-05 Thread Waiman Long
On 12/5/19 3:32 AM, Srikar Dronamraju wrote:
> With the static key shared processor available, is_shared_processor()
> can return without having to query the lppaca structure.
>
> Cc: Parth Shah 
> Cc: Ihor Pasichnyk 
> Cc: Juri Lelli 
> Cc: Phil Auld 
> Cc: Waiman Long 
> Signed-off-by: Srikar Dronamraju 
> ---
> Changelog v1 (https://patchwork.ozlabs.org/patch/1204192/) ->v2:
> Now that we no more refer to lppaca, remove the comment.
>
> Changelog v2->v3:
> Code is now under CONFIG_PPC_SPLPAR as it depends on CONFIG_PPC_PSERIES.
> This was suggested by Waiman Long.
>
>  arch/powerpc/include/asm/spinlock.h | 9 ++---
>  1 file changed, 2 insertions(+), 7 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/spinlock.h 
> b/arch/powerpc/include/asm/spinlock.h
> index de817c25deff..e83d57f27566 100644
> --- a/arch/powerpc/include/asm/spinlock.h
> +++ b/arch/powerpc/include/asm/spinlock.h
> @@ -111,13 +111,8 @@ static inline void splpar_rw_yield(arch_rwlock_t *lock) 
> {};
>  
>  static inline bool is_shared_processor(void)
>  {
> -/*
> - * LPPACA is only available on Pseries so guard anything LPPACA related to
> - * allow other platforms (which include this common header) to compile.
> - */
> -#ifdef CONFIG_PPC_PSERIES
> - return (IS_ENABLED(CONFIG_PPC_SPLPAR) &&
> - lppaca_shared_proc(local_paca->lppaca_ptr));
> +#ifdef CONFIG_PPC_SPLPAR
> + return static_branch_unlikely(_processor);
>  #else
>   return false;
>  #endif

Acked-by: Waiman Long 



Re: [PATCH 1/2] powerpc/vcpu: Assume dedicated processors as non-preempt

2019-12-05 Thread Juri Lelli
Hi,

On 04/12/19 19:14, Srikar Dronamraju wrote:
> With commit 247f2f6f3c70 ("sched/core: Don't schedule threads on pre-empted
> vCPUs"), scheduler avoids preempted vCPUs to schedule tasks on wakeup.
> This leads to wrong choice of CPU, which in-turn leads to larger wakeup
> latencies. Eventually, it leads to performance regression in latency
> sensitive benchmarks like soltp, schbench etc.
> 
> On Powerpc, vcpu_is_preempted only looks at yield_count. If the
> yield_count is odd, the vCPU is assumed to be preempted. However
> yield_count is increased whenever LPAR enters CEDE state. So any CPU
> that has entered CEDE state is assumed to be preempted.
> 
> Even if vCPU of dedicated LPAR is preempted/donated, it should have
> right of first-use since they are suppose to own the vCPU.
> 
> On a Power9 System with 32 cores
>  # lscpu
> Architecture:ppc64le
> Byte Order:  Little Endian
> CPU(s):  128
> On-line CPU(s) list: 0-127
> Thread(s) per core:  8
> Core(s) per socket:  1
> Socket(s):   16
> NUMA node(s):2
> Model:   2.2 (pvr 004e 0202)
> Model name:  POWER9 (architected), altivec supported
> Hypervisor vendor:   pHyp
> Virtualization type: para
> L1d cache:   32K
> L1i cache:   32K
> L2 cache:512K
> L3 cache:10240K
> NUMA node0 CPU(s):   0-63
> NUMA node1 CPU(s):   64-127
>  
> 
>   # perf stat -a -r 5 ./schbench
> v5.4  v5.4 + patch
> Latency percentiles (usec)  Latency percentiles (usec)
>   49.th: 47   50.th: 33
>   74.th: 64   75.th: 44
>   89.th: 76   90.th: 50
>   94.th: 83   95.th: 53
>   *98.th: 103 *99.th: 57
>   98.5000th: 2124 99.5000th: 59
>   98.9000th: 7976 99.9000th: 83
>   min=-1, max=10519   min=0, max=117
> Latency percentiles (usec)  Latency percentiles (usec)
>   49.th: 45   50.th: 34
>   74.th: 61   75.th: 45
>   89.th: 70   90.th: 52
>   94.th: 77   95.th: 56
>   *98.th: 504 *99.th: 62
>   98.5000th: 4012 99.5000th: 64
>   98.9000th: 8168 99.9000th: 79
>   min=-1, max=14500   min=0, max=123
> Latency percentiles (usec)  Latency percentiles (usec)
>   49.th: 48   50.th: 35
>   74.th: 65   75.th: 47
>   89.th: 76   90.th: 55
>   94.th: 82   95.th: 59
>   *98.th: 1098*99.th: 67
>   98.5000th: 3988 99.5000th: 71
>   98.9000th: 9360 99.9000th: 98
>   min=-1, max=19283   min=0, max=137
> Latency percentiles (usec)  Latency percentiles (usec)
>   49.th: 46   50.th: 35
>   74.th: 63   75.th: 46
>   89.th: 73   90.th: 53
>   94.th: 78   95.th: 57
>   *98.th: 113 *99.th: 63
>   98.5000th: 2316 99.5000th: 65
>   98.9000th: 7704 99.9000th: 83
>   min=-1, max=17976   min=0, max=139
> Latency percentiles (usec)  Latency percentiles (usec)
>   49.th: 46   50.th: 34
>   74.th: 62   75.th: 46
>   89.th: 73   90.th: 53
>   94.th: 79   95.th: 57
>   *98.th: 97  *99.th: 64
>   98.5000th: 1398 99.5000th: 70
>   98.9000th: 8136 99.9000th: 100
>   min=-1, max=10008   min=0, max=142
> 
> Performance counter stats for 'system wide' (4 runs):
> 
> context-switches   42,604 ( +-  0.87% )   45,397 ( +-  0.25% )
> cpu-migrations  0,195 ( +-  2.70% )  230 ( +-  7.23% )
> page-faults16,783 ( +- 14.87% )   16,781 ( +-  9.77% )
> 
> Waiman Long 

Re: [PATCH] ASoC: fsl_sai: add IRQF_SHARED

2019-12-05 Thread Daniel Baluta
On Thu, Dec 5, 2019 at 11:18 AM Michael Walle  wrote:
>
> Hi Daniel,
>
> Am 2019-12-05 09:43, schrieb Daniel Baluta:
> > On Fri, Nov 29, 2019 at 12:40 AM Michael Walle 
> > wrote:
> >>
> >> The LS1028A SoC uses the same interrupt line for adjacent SAIs. Use
> >> IRQF_SHARED to be able to use these SAIs simultaneously.
> >
> > Hi Michael,
> >
> > Thanks for the patch. We have a similar change inside our internal tree
> > (it is on my long TODO list to upstream :D).
> >
> > We add the shared flag conditionally on a dts property.
> >
> > Do you think it is a good idea to always add shared flag? I'm thinking
> > on SAI IP integrations where the interrupt is edge triggered.
>
> Mhh, I don't really get the point to make the flag conditionally. If
> there is only one user, the flag won't hurt, correct?
>
> If there are two users, we need the flag anyway.
>
> > AFAIK edge triggered interrupts do not get along very well
> > with sharing an interrupt line.
>
> So in that case you shouldn't use shared edge triggered interrupts in
> the
> SoC in the first place, I guess.

I think you make a good point. I was thinking that it would hurt the single
user case. But it is fine.

Thanks for the patch.

Acked-by: Daniel Baluta 


Re: [PATCH] ASoC: fsl_sai: add IRQF_SHARED

2019-12-05 Thread Michael Walle

Hi Daniel,

Am 2019-12-05 09:43, schrieb Daniel Baluta:
On Fri, Nov 29, 2019 at 12:40 AM Michael Walle  
wrote:


The LS1028A SoC uses the same interrupt line for adjacent SAIs. Use
IRQF_SHARED to be able to use these SAIs simultaneously.


Hi Michael,

Thanks for the patch. We have a similar change inside our internal tree
(it is on my long TODO list to upstream :D).

We add the shared flag conditionally on a dts property.

Do you think it is a good idea to always add shared flag? I'm thinking
on SAI IP integrations where the interrupt is edge triggered.


Mhh, I don't really get the point to make the flag conditionally. If
there is only one user, the flag won't hurt, correct?

If there are two users, we need the flag anyway.


AFAIK edge triggered interrupts do not get along very well
with sharing an interrupt line.


So in that case you shouldn't use shared edge triggered interrupts in 
the

SoC in the first place, I guess.

-michael


Re: [PATCH] ASoC: fsl_sai: add IRQF_SHARED

2019-12-05 Thread Daniel Baluta
On Fri, Nov 29, 2019 at 12:40 AM Michael Walle  wrote:
>
> The LS1028A SoC uses the same interrupt line for adjacent SAIs. Use
> IRQF_SHARED to be able to use these SAIs simultaneously.

Hi Michael,

Thanks for the patch. We have a similar change inside our internal tree
(it is on my long TODO list to upstream :D).

We add the shared flag conditionally on a dts property.

Do you think it is a good idea to always add shared flag? I'm thinking
on SAI IP integrations where the interrupt is edge triggered.

AFAIK edge triggered interrupts do not get along very well
with sharing an interrupt line.


[PATCH v3 2/2] powerpc/shared: Use static key to detect shared processor

2019-12-05 Thread Srikar Dronamraju
With the static key shared processor available, is_shared_processor()
can return without having to query the lppaca structure.

Cc: Parth Shah 
Cc: Ihor Pasichnyk 
Cc: Juri Lelli 
Cc: Phil Auld 
Cc: Waiman Long 
Signed-off-by: Srikar Dronamraju 
---
Changelog v1 (https://patchwork.ozlabs.org/patch/1204192/) ->v2:
Now that we no more refer to lppaca, remove the comment.

Changelog v2->v3:
Code is now under CONFIG_PPC_SPLPAR as it depends on CONFIG_PPC_PSERIES.
This was suggested by Waiman Long.

 arch/powerpc/include/asm/spinlock.h | 9 ++---
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/include/asm/spinlock.h 
b/arch/powerpc/include/asm/spinlock.h
index de817c25deff..e83d57f27566 100644
--- a/arch/powerpc/include/asm/spinlock.h
+++ b/arch/powerpc/include/asm/spinlock.h
@@ -111,13 +111,8 @@ static inline void splpar_rw_yield(arch_rwlock_t *lock) {};
 
 static inline bool is_shared_processor(void)
 {
-/*
- * LPPACA is only available on Pseries so guard anything LPPACA related to
- * allow other platforms (which include this common header) to compile.
- */
-#ifdef CONFIG_PPC_PSERIES
-   return (IS_ENABLED(CONFIG_PPC_SPLPAR) &&
-   lppaca_shared_proc(local_paca->lppaca_ptr));
+#ifdef CONFIG_PPC_SPLPAR
+   return static_branch_unlikely(_processor);
 #else
return false;
 #endif
-- 
2.18.1



[PATCH v3 1/2] powerpc/vcpu: Assume dedicated processors as non-preempt

2019-12-05 Thread Srikar Dronamraju
With commit 247f2f6f3c70 ("sched/core: Don't schedule threads on pre-empted
vCPUs"), scheduler avoids preempted vCPUs to schedule tasks on wakeup.
This leads to wrong choice of CPU, which in-turn leads to larger wakeup
latencies. Eventually, it leads to performance regression in latency
sensitive benchmarks like soltp, schbench etc.

On Powerpc, vcpu_is_preempted only looks at yield_count. If the
yield_count is odd, the vCPU is assumed to be preempted. However
yield_count is increased whenever LPAR enters CEDE state. So any CPU
that has entered CEDE state is assumed to be preempted.

Even if vCPU of dedicated LPAR is preempted/donated, it should have
right of first-use since they are suppose to own the vCPU.

On a Power9 System with 32 cores
 # lscpu
Architecture:ppc64le
Byte Order:  Little Endian
CPU(s):  128
On-line CPU(s) list: 0-127
Thread(s) per core:  8
Core(s) per socket:  1
Socket(s):   16
NUMA node(s):2
Model:   2.2 (pvr 004e 0202)
Model name:  POWER9 (architected), altivec supported
Hypervisor vendor:   pHyp
Virtualization type: para
L1d cache:   32K
L1i cache:   32K
L2 cache:512K
L3 cache:10240K
NUMA node0 CPU(s):   0-63
NUMA node1 CPU(s):   64-127

  # perf stat -a -r 5 ./schbench
v5.4 v5.4 + patch
Latency percentiles (usec)   Latency percentiles (usec)
50.th: 45   50.th: 39
75.th: 62   75.th: 53
90.th: 71   90.th: 67
95.th: 77   95.th: 76
*99.th: 91  *99.th: 89
99.5000th: 707  99.5000th: 93
99.9000th: 6920 99.9000th: 118
min=0, max=10048min=0, max=211
Latency percentiles (usec)   Latency percentiles (usec)
50.th: 45   50.th: 34
75.th: 61   75.th: 45
90.th: 72   90.th: 53
95.th: 79   95.th: 56
*99.th: 691 *99.th: 61
99.5000th: 3972 99.5000th: 63
99.9000th: 8368 99.9000th: 78
min=0, max=16606min=0, max=228
Latency percentiles (usec)   Latency percentiles (usec)
50.th: 45   50.th: 34
75.th: 61   75.th: 45
90.th: 71   90.th: 53
95.th: 77   95.th: 57
*99.th: 106 *99.th: 63
99.5000th: 2364 99.5000th: 68
99.9000th: 7480 99.9000th: 100
min=0, max=10001min=0, max=134
Latency percentiles (usec)   Latency percentiles (usec)
50.th: 45   50.th: 34
75.th: 62   75.th: 46
90.th: 72   90.th: 53
95.th: 78   95.th: 56
*99.th: 93  *99.th: 61
99.5000th: 108  99.5000th: 64
99.9000th: 6792 99.9000th: 85
min=0, max=17681min=0, max=121
Latency percentiles (usec)   Latency percentiles (usec)
50.th: 46   50.th: 33
75.th: 62   75.th: 44
90.th: 73   90.th: 51
95.th: 79   95.th: 54
*99.th: 113 *99.th: 61
99.5000th: 2724 99.5000th: 64
99.9000th: 6184 99.9000th: 82
min=0, max=9887 min=0, max=121

 Performance counter stats for 'system wide' (5 runs):

context-switches43,373  ( +-  0.40% )   44,597 ( +-  0.55% )
cpu-migrations   1,211  ( +-  5.04% )  220 ( +-  6.23% )
page-faults 15,983  ( +-  5.21% )   15,360 ( +-  3.38% )

Waiman Long suggested using static_keys.

Reported-by: Parth Shah 
Reported-by: Ihor Pasichnyk 
Cc: Parth Shah 
Cc: Ihor Pasichnyk 
Cc: Juri Lelli 
Cc: Phil Auld 
Cc: Waiman Long 
Cc: Gautham R. Shenoy 
Tested-by: Juri Lelli 
Ack-by: Waiman Long 
Reviewed-by: Gautham R. Shenoy 
Signed-off-by: Srikar Dronamraju 
---
Changelog v1 (https://patchwork.ozlabs.org/patch/1204190/) ->v3:
Code is now under CONFIG_PPC_SPLPAR as it depends on CONFIG_PPC_PSERIES.
This was suggested by Waiman Long.

 arch/powerpc/include/asm/spinlock.h | 5 +++--
 

Re: [PATCH v4 1/2] powerpc/pseries/iommu: Share the per-cpu TCE page with the hypervisor.

2019-12-05 Thread Christoph Hellwig
On Wed, Dec 04, 2019 at 12:42:32PM -0800, Ram Pai wrote:
> > The other approach we could use for that - which would still allow
> > H_PUT_TCE_INDIRECT, would be to allocate the TCE buffer page from the
> > same pool that we use for the bounce buffers.  I assume there must
> > already be some sort of allocator for that?
> 
> The allocator for swiotlb is buried deep in the swiotlb code. It is 
> not exposed to the outside-swiotlb world. Will have to do major surgery
> to expose it.

I don't think it would require all that much changes, but I'd really
hate the layering of calling into it directly.  Do we have a struct
device associated with the iommu that doesn't get iommu translations
themselves?  If we do a dma_alloc_coherent on that you'll get the
memory pool for free.