[PATCH v3] powerpc/mm: Remove kvm radix prefetch workaround for Power9 DD2.2
Commit a25bd72badfa ("powerpc/mm/radix: Workaround prefetch issue with KVM") introduced a number of workarounds as coming out of a guest with the mmu enabled would make the cpu would start running in hypervisor state with the PID value from the guest. The cpu will then start prefetching for the hypervisor with that PID value. In Power9 DD2.2 the cpu behaviour was modified to fix this. When accessing Quadrant 0 in hypervisor mode with LPID != 0 prefetching will not be performed. This means that we can get rid of the workarounds for Power9 DD2.2 and later revisions. Add a new cpu feature CPU_FTR_P9_RADIX_PREFETCH_BUG to indicate if the workarounds are needed. Signed-off-by: Jordan Niethe --- v2: Use a cpu feature instead of open coding the PVR check v3: Put parentheses around CPU_FTRS_POWER9_DD2_0 value --- arch/powerpc/include/asm/cputable.h | 7 +-- arch/powerpc/kernel/dt_cpu_ftrs.c| 13 - arch/powerpc/kvm/book3s_hv_rmhandlers.S | 2 ++ arch/powerpc/mm/book3s64/radix_pgtable.c | 6 +- arch/powerpc/mm/book3s64/radix_tlb.c | 3 +++ 5 files changed, 23 insertions(+), 8 deletions(-) diff --git a/arch/powerpc/include/asm/cputable.h b/arch/powerpc/include/asm/cputable.h index cf00ff0d121d..40a4d3c6fd99 100644 --- a/arch/powerpc/include/asm/cputable.h +++ b/arch/powerpc/include/asm/cputable.h @@ -212,6 +212,7 @@ static inline void cpu_feature_keys_init(void) { } #define CPU_FTR_P9_TLBIE_STQ_BUG LONG_ASM_CONST(0x4000) #define CPU_FTR_P9_TIDR LONG_ASM_CONST(0x8000) #define CPU_FTR_P9_TLBIE_ERAT_BUG LONG_ASM_CONST(0x0001) +#define CPU_FTR_P9_RADIX_PREFETCH_BUG LONG_ASM_CONST(0x0002) #ifndef __ASSEMBLY__ @@ -459,8 +460,10 @@ static inline void cpu_feature_keys_init(void) { } CPU_FTR_DBELL | CPU_FTR_HAS_PPR | CPU_FTR_ARCH_207S | \ CPU_FTR_TM_COMP | CPU_FTR_ARCH_300 | CPU_FTR_PKEY | \ CPU_FTR_P9_TLBIE_STQ_BUG | CPU_FTR_P9_TLBIE_ERAT_BUG | CPU_FTR_P9_TIDR) -#define CPU_FTRS_POWER9_DD2_0 CPU_FTRS_POWER9 -#define CPU_FTRS_POWER9_DD2_1 (CPU_FTRS_POWER9 | CPU_FTR_POWER9_DD2_1) +#define CPU_FTRS_POWER9_DD2_0 (CPU_FTRS_POWER9 | CPU_FTR_P9_RADIX_PREFETCH_BUG) +#define CPU_FTRS_POWER9_DD2_1 (CPU_FTRS_POWER9 | \ + CPU_FTR_P9_RADIX_PREFETCH_BUG | \ + CPU_FTR_POWER9_DD2_1) #define CPU_FTRS_POWER9_DD2_2 (CPU_FTRS_POWER9 | CPU_FTR_POWER9_DD2_1 | \ CPU_FTR_P9_TM_HV_ASSIST | \ CPU_FTR_P9_TM_XER_SO_BUG) diff --git a/arch/powerpc/kernel/dt_cpu_ftrs.c b/arch/powerpc/kernel/dt_cpu_ftrs.c index 180b3a5d1001..182b4047c1ef 100644 --- a/arch/powerpc/kernel/dt_cpu_ftrs.c +++ b/arch/powerpc/kernel/dt_cpu_ftrs.c @@ -727,17 +727,20 @@ static __init void cpufeatures_cpu_quirks(void) /* * Not all quirks can be derived from the cpufeatures device tree. */ - if ((version & 0xefff) == 0x004e0200) - ; /* DD2.0 has no feature flag */ - else if ((version & 0xefff) == 0x004e0201) + if ((version & 0xefff) == 0x004e0200) { + /* DD2.0 has no feature flag */ + cur_cpu_spec->cpu_features |= CPU_FTR_P9_RADIX_PREFETCH_BUG; + } else if ((version & 0xefff) == 0x004e0201) { cur_cpu_spec->cpu_features |= CPU_FTR_POWER9_DD2_1; - else if ((version & 0xefff) == 0x004e0202) { + cur_cpu_spec->cpu_features |= CPU_FTR_P9_RADIX_PREFETCH_BUG; + } else if ((version & 0xefff) == 0x004e0202) { cur_cpu_spec->cpu_features |= CPU_FTR_P9_TM_HV_ASSIST; cur_cpu_spec->cpu_features |= CPU_FTR_P9_TM_XER_SO_BUG; cur_cpu_spec->cpu_features |= CPU_FTR_POWER9_DD2_1; - } else if ((version & 0x) == 0x004e) + } else if ((version & 0x) == 0x004e) { /* DD2.1 and up have DD2_1 */ cur_cpu_spec->cpu_features |= CPU_FTR_POWER9_DD2_1; + } if ((version & 0x) == 0x004e) { cur_cpu_spec->cpu_features &= ~(CPU_FTR_DAWR); diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S index faebcbb8c4db..72b08bb17200 100644 --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S @@ -1793,6 +1793,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_300) tlbsync ptesync +BEGIN_FTR_SECTION /* Radix: Handle the case where the guest used an illegal PID */ LOAD_REG_ADDR(r4, mmu_base_pid) lwz r3, VCPU_GUEST_PID(r9) @@ -1822,6 +1823,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_300) addir7,r7,0x1000 bdnz1b ptesync +END_FTR_SECTION_IFSET(CPU_FTR_P9_RADIX_PREFETCH_BUG) 2: #endif /* CONFIG_PPC_RADIX_MMU */ diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c
Re: [RFC] Efficiency of the phandle_cache on ppc64/SLOF
On 12/5/19 10:35 AM, Sebastian Andrzej Siewior wrote: > On 2019-12-03 10:56:35 [-0600], Rob Herring wrote: >>> Another possibility would be to make the cache be dependent >>> upon not CONFIG_PPC. It might be possible to disable the >>> cache with a minimal code change. >> >> I'd rather not do that. >> >> And yes, as mentioned earlier I don't like the complexity. I didn't >> from the start and I'm I'm still of the opinion we should have a >> fixed or 1 time sized true cache (i.e. smaller than total # of >> phandles). That would solve the RT memory allocation and locking issue >> too. >> >> For reference, the performance difference between the current >> implementation (assuming fixes haven't regressed it) was ~400ms vs. a >> ~340ms improvement with a 64 entry cache (using a mask, not a hash). >> IMO, 340ms improvement was good enough. > > Okay. So the 814 phandles would result in an array with 1024 slots. That > would need 4KiB of memory. Is this amount of memory an issue for this system? If module support is not configured into the kernel then the cache is removed and memory freed in a late initcall. I don't know if that helps your use case or not. > What about we go back to the fix 64 slots array but with hash32 for the > lookup? Without the hash we would be 60ms slower during boot (compared > to now, based on ancient data) but then the hash isn't expensive so we > end up with better coverage of the memory on systems which don't have a > plain enumeration of the phandle. That performance data is specific to one particular system. It does not generalize to all devicetree based systems. So don't draw too many conclusions from it. If you want to understand the boot performance impact for your system, you need to measure the alternatives on your system. Is there a memory usage issue for the systems that led to this thread? Unless there is a documented memory issue, I do not want to expand an issue with poor cache bucket percent utilization to the other issue of cache size. -Frank > >> Rob > > Sebastian >
Re: [RFC] Efficiency of the phandle_cache on ppc64/SLOF
On 12/3/19 10:56 AM, Rob Herring wrote: > On Mon, Dec 2, 2019 at 10:28 PM Frank Rowand wrote: >> >> On 12/2/19 10:12 PM, Michael Ellerman wrote: >>> Frank Rowand writes: On 11/29/19 9:10 AM, Sebastian Andrzej Siewior wrote: > I've been looking at phandle_cache and noticed the following: The raw > phandle value as generated by dtc starts at zero and is incremented by > one for each phandle entry. The qemu pSeries model is using Slof (which > is probably the same thing as used on real hardware) and this looks like > a poiner value for the phandle. > With > qemu-system-ppc64le -m 16G -machine pseries -smp 8 > > I got the following output: > | entries: 64 > | phandle 7e732468 slot 28 hash c > | phandle 7e732ad0 slot 10 hash 27 > | phandle 7e732ee8 slot 28 hash 3a > | phandle 7e734160 slot 20 hash 36 > | phandle 7e734318 slot 18 hash 3a > | phandle 7e734428 slot 28 hash 33 > | phandle 7e734538 slot 38 hash 2c > | phandle 7e734850 slot 10 hash e > | phandle 7e735220 slot 20 hash 2d > | phandle 7e735bf0 slot 30 hash d > | phandle 7e7365c0 slot 0 hash 2d > | phandle 7e736f90 slot 10 hash d > | phandle 7e737960 slot 20 hash 2d > | phandle 7e738330 slot 30 hash d > | phandle 7e738d00 slot 0 hash 2d > | phandle 7e739730 slot 30 hash 38 > | phandle 7e73bd08 slot 8 hash 17 > | phandle 7e73c2e0 slot 20 hash 32 > | phandle 7e73c7f8 slot 38 hash 37 > | phandle 7e782420 slot 20 hash 13 > | phandle 7e782ed8 slot 18 hash 1b > | phandle 7e73ce28 slot 28 hash 39 > | phandle 7e73d390 slot 10 hash 22 > | phandle 7e73d9a8 slot 28 hash 1a > | phandle 7e73dc28 slot 28 hash 37 > | phandle 7e73de00 slot 0 hash a > | phandle 7e73e028 slot 28 hash 0 > | phandle 7e7621a8 slot 28 hash 36 > | phandle 7e73e458 slot 18 hash 1e > | phandle 7e73e608 slot 8 hash 1e > | phandle 7e740078 slot 38 hash 28 > | phandle 7e740180 slot 0 hash 1d > | phandle 7e740240 slot 0 hash 33 > | phandle 7e740348 slot 8 hash 29 > | phandle 7e740410 slot 10 hash 2 > | phandle 7e740eb0 slot 30 hash 3e > | phandle 7e745390 slot 10 hash 33 > | phandle 7e747b08 slot 8 hash c > | phandle 7e748528 slot 28 hash f > | phandle 7e74a6e0 slot 20 hash 18 > | phandle 7e74aab0 slot 30 hash b > | phandle 7e74f788 slot 8 hash d > | Used entries: 8, hashed: 29 > > So the hash array has 64 entries out which only 8 are populated. Using > hash_32() populates 29 entries. > Could someone with real hardware verify this? > I'm not sure how important this performance wise, it looks just like a > waste using only 1/8 of the array. The hash used is based on the assumptions you noted, and as stated in the code, that phandle property values are in a contiguous range of 1..n (not starting from zero), which is what dtc generates. >>> We knew that for systems that do not match the assumptions that the hash will not be optimal. >>> >>> If we're going to have the phandle cache it should at least make some >>> attempt to work across the systems that Linux supports. >>> Unless there is a serious performance problem for such systems, I do not want to make the phandle hash code more complicated to optimize for these cases. And the pseries have been performing ok without phandle related performance issues that I remember hearing since before the cache was added, which could have only helped the performance. Yes, if your observations are correct, some memory is being wasted, but a 64 entry cache is not very large on a pseries. >>> >>> A single line change to use an actual hash function is hardly >>> complicating it, compared to the amount of code already there. >> >> With a dtc generated devicetree, the hash is perfect, with no >> misses. That single line change then makes the hash bad for >> dtc generated devicetrees. > > To quantify that, I did some tests with the hash algo and it's about a > 10% collision rate using phandle values of 1-$cache_size. There's > never more than 2 phandles colliding in a slot. The actual impact > would be less given 100% thrashing seems unlikely. Thank you for doing the tests. Actual data helps a lot. If there is only a 10% collision rate for this case, that does not sound bad to me. There is the possibility of current or future code resulting in ping ponging between two phandle values which collide in the cache, but that should not be the common case. However, given a choice between two algorithms, one of which guarantees no thrashing (the current cache algorithm) and one which allows a pathologic use case which results in thrashing, I prefer the first algorithm. This may seem theoretical, but I have seen enough pathological code paths in my years of performance analysis and tuning to be sensitive to this issue. > >> The cache was added to address a
Re: [RFC] Efficiency of the phandle_cache on ppc64/SLOF
On 12/3/19 12:35 PM, Segher Boessenkool wrote: > Hi! > > On Tue, Dec 03, 2019 at 03:03:22PM +1100, Michael Ellerman wrote: >> Sebastian Andrzej Siewior writes: >> I've certainly heard it said that on some OF's the phandle was just == >> the address of the internal representation, and I guess maybe for SLOF >> that is true. > > It is (or was). In many OFs it is just the effective address of some > node structure. SLOF runs with translation off normally. > >> They seem to vary wildly though, eg. on an Apple G5: > > Apple OF runs with translation on usually. IIRC these are effective > addresses as well. > > The OF they have on G5 machines is mostly 32-bit, for compatibility is my > guess (for userland things dealing with addresses from OF, importantly). > >> $ find /proc/device-tree/ -name phandle | xargs lsprop | head -10 >> /proc/device-tree/vsp@0,f900/veo@f918/phandle ff970848 >> /proc/device-tree/vsp@0,f900/phandle ff970360 >> /proc/device-tree/vsp@0,f900/veo@f908/phandle ff970730 >> /proc/device-tree/nvram@0,fff04000/phandle ff967fb8 >> /proc/device-tree/xmodem/phandle ff9655e8 >> /proc/device-tree/multiboot/phandle ff9504f0 >> /proc/device-tree/diagnostics/phandle ff965550 >> /proc/device-tree/options/phandle ff893cf0 >> /proc/device-tree/openprom/client-services/phandle ff8925b8 >> /proc/device-tree/openprom/phandle ff892458 >> >> That machine does not have enough RAM for those to be 32-bit real >> addresses. I think Apple OF is running in virtual mode though (?), so >> maybe they are pointers? > > Yes, I think the default is to have 8MB ram at the top of 4GB (which is > the physical address of the bootrom, btw) for OF. > >> And on an IBM pseries machine they're a bit all over the place: >> >> /proc/device-tree/cpus/PowerPC,POWER8@40/ibm,phandle 1040 >> /proc/device-tree/cpus/l2-cache@2005/ibm,phandle 2005 >> /proc/device-tree/cpus/PowerPC,POWER8@30/ibm,phandle 1030 >> /proc/device-tree/cpus/PowerPC,POWER8@20/ibm,phandle 1020 >> /proc/device-tree/cpus/PowerPC,POWER8@10/ibm,phandle 1010 >> /proc/device-tree/cpus/l2-cache@2003/ibm,phandle 2003 >> /proc/device-tree/cpus/l2-cache@200a/ibm,phandle 200a >> /proc/device-tree/cpus/l3-cache@3108/ibm,phandle 3108 >> /proc/device-tree/cpus/l2-cache@2001/ibm,phandle 2001 >> /proc/device-tree/cpus/l3-cache@3106/ibm,phandle 3106 >> /proc/device-tree/cpus/ibm,phandle fff8 >> /proc/device-tree/cpus/l3-cache@3104/ibm,phandle 3104 >> /proc/device-tree/cpus/l2-cache@2008/ibm,phandle 2008 >> /proc/device-tree/cpus/l3-cache@3102/ibm,phandle 3102 >> /proc/device-tree/cpus/l2-cache@2006/ibm,phandle 2006 >> /proc/device-tree/cpus/l3-cache@3100/ibm,phandle 3100 >> /proc/device-tree/cpus/PowerPC,POWER8@8/ibm,phandle 1008 >> /proc/device-tree/cpus/l2-cache@2004/ibm,phandle 2004 >> /proc/device-tree/cpus/PowerPC,POWER8@48/ibm,phandle 1048 >> /proc/device-tree/cpus/PowerPC,POWER8@38/ibm,phandle 1038 >> /proc/device-tree/cpus/l2-cache@2002/ibm,phandle 2002 >> /proc/device-tree/cpus/PowerPC,POWER8@28/ibm,phandle 1028 >> /proc/device-tree/cpus/l3-cache@3107/ibm,phandle 3107 >> /proc/device-tree/cpus/PowerPC,POWER8@18/ibm,phandle 1018 >> /proc/device-tree/cpus/l2-cache@2000/ibm,phandle 2000 >> /proc/device-tree/cpus/l3-cache@3105/ibm,phandle 3105 >> /proc/device-tree/cpus/l3-cache@3103/ibm,phandle 3103 >> /proc/device-tree/cpus/l3-cache@310a/ibm,phandle 310a >> /proc/device-tree/cpus/PowerPC,POWER8@0/ibm,phandle 1000 >> /proc/device-tree/cpus/l2-cache@2007/ibm,phandle 2007 >> /proc/device-tree/cpus/l3-cache@3101/ibm,phandle 3101 >> /proc/device-tree/pci@800201b/ibm,phandle 201b > > Some (the 1000) look like addresses as well. > >>> So the hash array has 64 entries out which only 8 are populated. Using >>> hash_32() populates 29 entries. > >> On the G5 it's similarly inefficient: >> [0.007379] OF: of_populate_phandle_cache(242) Used entries: 31, hashed: >> 111 > >> And some output from a "real" pseries machine (IBM OF), which is >> slightly better: >> [0.129467] OF: of_populate_phandle_cache(242) Used entries: 39, hashed: >> 81 > >> So yeah using hash_32() is quite a bit better in both cases. > > Yup, no surprise there. And hash_32 is very cheap to compute. > >> And if I'm reading your patch right it would be a single line change to>> >> switch, so that seems like it's worth doing to me. > > Agreed! > > Btw. Some OFs mangle the phandles some way, to make it easier to catch > people using it as an address (and similarly, mangle ihandles differently, > so you catch confusion between ihandles and phandles as well). Like a > simple xor, with some odd number preferably. You should assume *nothing* > about phandles, they are opaque identifiers. For arm32 machines that use dtc to generate the devicetree, which is
Re: [PATCH v5 2/5] powerpc/kprobes: Mark newly allocated probes as RO
Michael Ellerman writes: > Russell Currey writes: >> With CONFIG_STRICT_KERNEL_RWX=y and CONFIG_KPROBES=y, there will be one >> W+X page at boot by default. This can be tested with >> CONFIG_PPC_PTDUMP=y and CONFIG_PPC_DEBUG_WX=y set, and checking the >> kernel log during boot. >> >> powerpc doesn't implement its own alloc() for kprobes like other >> architectures do, but we couldn't immediately mark RO anyway since we do >> a memcpy to the page we allocate later. After that, nothing should be >> allowed to modify the page, and write permissions are removed well >> before the kprobe is armed. >> >> Thus mark newly allocated probes as read-only once it's safe to do so. >> >> Signed-off-by: Russell Currey >> --- >> arch/powerpc/kernel/kprobes.c | 3 +++ >> 1 file changed, 3 insertions(+) >> >> diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c >> index 2d27ec4feee4..2610496de7c7 100644 >> --- a/arch/powerpc/kernel/kprobes.c >> +++ b/arch/powerpc/kernel/kprobes.c >> @@ -24,6 +24,7 @@ >> #include >> #include >> #include >> +#include >> >> DEFINE_PER_CPU(struct kprobe *, current_kprobe) = NULL; >> DEFINE_PER_CPU(struct kprobe_ctlblk, kprobe_ctlblk); >> @@ -131,6 +132,8 @@ int arch_prepare_kprobe(struct kprobe *p) >> (unsigned long)p->ainsn.insn + sizeof(kprobe_opcode_t)); >> } >> >> +set_memory_ro((unsigned long)p->ainsn.insn, 1); >> + > > That comes from: > p->ainsn.insn = get_insn_slot(); > > > Which ends up in __get_insn_slot() I think. And that looks very much > like it's going to hand out multiple slots per page, which isn't going > to work because you've just marked the whole page RO. > > So I would expect this to crash on the 2nd kprobe that's installed. Have > you tested it somehow? I'm not sure if this is the issue I was talking about, but it doesn't survive ftracetest: [ 1139.576047] [ cut here ] [ 1139.576322] kernel BUG at mm/memory.c:2036! cpu 0x1f: Vector: 700 (Program Check) at [c01fd6c675d0] pc: c035d018: apply_to_page_range+0x318/0x610 lr: c00900bc: change_memory_attr+0x4c/0x70 sp: c01fd6c67860 msr: 90029033 current = 0xc01fa4a47880 paca= 0xc01e5c80 irqmask: 0x03 irq_happened: 0x01 pid = 7168, comm = ftracetest kernel BUG at mm/memory.c:2036! Linux version 5.4.0-gcc-8.2.0-11694-gf1f9aa266811 (mich...@raptor-2.ozlabs.ibm.com) (gcc version 8.2.0 (crosstool-NG 1.24.0-rc1.16-9627a04)) #1384 SMP Thu Dec 5 22:11:09 AEDT 2019 enter ? for help [c01fd6c67940] c00900bc change_memory_attr+0x4c/0x70 [c01fd6c67970] c0053c48 arch_prepare_kprobe+0xb8/0x120 [c01fd6c679e0] c022f718 register_kprobe+0x608/0x790 [c01fd6c67a40] c022fc50 register_kretprobe+0x230/0x350 [c01fd6c67a80] c02849b4 __register_trace_kprobe+0xf4/0x1a0 [c01fd6c67af0] c0285b18 trace_kprobe_create+0x738/0xf70 [c01fd6c67c30] c0286378 create_or_delete_trace_kprobe+0x28/0x70 [c01fd6c67c50] c025f024 trace_run_command+0xc4/0xe0 [c01fd6c67ca0] c025f128 trace_parse_run_command+0xe8/0x230 [c01fd6c67d40] c02845d0 probes_write+0x20/0x40 [c01fd6c67d60] c03eef4c __vfs_write+0x3c/0x70 [c01fd6c67d80] c03f26a0 vfs_write+0xd0/0x200 [c01fd6c67dd0] c03f2a3c ksys_write+0x7c/0x140 [c01fd6c67e20] c000b9e0 system_call+0x5c/0x68 --- Exception: c01 (System Call) at 7fff8f06e420 SP (793d6830) is in userspace 1f:mon> client_loop: send disconnect: Broken pipe Sorry I didn't get any more info on the crash, I lost the console and then some CI bot stole the machine 8) You should be able to reproduce just by running ftracetest. cheers
Re: [PATCH 01/14] powerpc/vas: Describe vas-port and interrupts properties
On Tue, Nov 26, 2019 at 05:02:03PM -0800, Haren Myneni wrote: > [PATCH 01/14] powerpc/vas: Describe vas-port and interrupts properties Something wrong here with the subject in the body. > > Signed-off-by: Haren Myneni > --- > Documentation/devicetree/bindings/powerpc/ibm,vas.txt | 5 + > 1 file changed, 5 insertions(+) > > diff --git a/Documentation/devicetree/bindings/powerpc/ibm,vas.txt > b/Documentation/devicetree/bindings/powerpc/ibm,vas.txt > index bf11d2f..12de08b 100644 > --- a/Documentation/devicetree/bindings/powerpc/ibm,vas.txt > +++ b/Documentation/devicetree/bindings/powerpc/ibm,vas.txt > @@ -11,6 +11,8 @@ Required properties: >window context start and length, OS/User window context start and length, >"Paste address" start and length, "Paste window id" start bit and number >of bits) > +- ibm,vas-port : Port address for the interrupt. What's the size of this property? > +- interrupts: IRQ value for each VAS instance and level. > > Example: > > @@ -18,5 +20,8 @@ Example: > compatible = "ibm,vas", "ibm,power9-vas"; > reg = <0x60191 0x200 0x60190 0x1 > 0x8 0x1 0x20 0x10>; > name = "vas"; > + interrupts = <0x1f 0>; > + interrupt-parent = <>; > ibm,vas-id = <0x1>; > + ibm,vas-port = <0x601000100>; > }; > -- > 1.8.3.1 > > >
Re: [PATCH v3 1/2] powerpc/vcpu: Assume dedicated processors as non-preempt
On Thu, Dec 05, 2019 at 02:02:17PM +0530 Srikar Dronamraju wrote: > With commit 247f2f6f3c70 ("sched/core: Don't schedule threads on pre-empted > vCPUs"), scheduler avoids preempted vCPUs to schedule tasks on wakeup. > This leads to wrong choice of CPU, which in-turn leads to larger wakeup > latencies. Eventually, it leads to performance regression in latency > sensitive benchmarks like soltp, schbench etc. > > On Powerpc, vcpu_is_preempted only looks at yield_count. If the > yield_count is odd, the vCPU is assumed to be preempted. However > yield_count is increased whenever LPAR enters CEDE state. So any CPU > that has entered CEDE state is assumed to be preempted. > > Even if vCPU of dedicated LPAR is preempted/donated, it should have > right of first-use since they are suppose to own the vCPU. > > On a Power9 System with 32 cores > # lscpu > Architecture:ppc64le > Byte Order: Little Endian > CPU(s): 128 > On-line CPU(s) list: 0-127 > Thread(s) per core: 8 > Core(s) per socket: 1 > Socket(s): 16 > NUMA node(s):2 > Model: 2.2 (pvr 004e 0202) > Model name: POWER9 (architected), altivec supported > Hypervisor vendor: pHyp > Virtualization type: para > L1d cache: 32K > L1i cache: 32K > L2 cache:512K > L3 cache:10240K > NUMA node0 CPU(s): 0-63 > NUMA node1 CPU(s): 64-127 > > # perf stat -a -r 5 ./schbench > v5.4 v5.4 + patch > Latency percentiles (usec) Latency percentiles (usec) > 50.th: 45 50.th: 39 > 75.th: 62 75.th: 53 > 90.th: 71 90.th: 67 > 95.th: 77 95.th: 76 > *99.th: 91 *99.th: 89 > 99.5000th: 707 99.5000th: 93 > 99.9000th: 6920 99.9000th: 118 > min=0, max=10048min=0, max=211 > Latency percentiles (usec) Latency percentiles (usec) > 50.th: 45 50.th: 34 > 75.th: 61 75.th: 45 > 90.th: 72 90.th: 53 > 95.th: 79 95.th: 56 > *99.th: 691 *99.th: 61 > 99.5000th: 3972 99.5000th: 63 > 99.9000th: 8368 99.9000th: 78 > min=0, max=16606min=0, max=228 > Latency percentiles (usec) Latency percentiles (usec) > 50.th: 45 50.th: 34 > 75.th: 61 75.th: 45 > 90.th: 71 90.th: 53 > 95.th: 77 95.th: 57 > *99.th: 106 *99.th: 63 > 99.5000th: 2364 99.5000th: 68 > 99.9000th: 7480 99.9000th: 100 > min=0, max=10001min=0, max=134 > Latency percentiles (usec) Latency percentiles (usec) > 50.th: 45 50.th: 34 > 75.th: 62 75.th: 46 > 90.th: 72 90.th: 53 > 95.th: 78 95.th: 56 > *99.th: 93 *99.th: 61 > 99.5000th: 108 99.5000th: 64 > 99.9000th: 6792 99.9000th: 85 > min=0, max=17681min=0, max=121 > Latency percentiles (usec) Latency percentiles (usec) > 50.th: 46 50.th: 33 > 75.th: 62 75.th: 44 > 90.th: 73 90.th: 51 > 95.th: 79 95.th: 54 > *99.th: 113 *99.th: 61 > 99.5000th: 2724 99.5000th: 64 > 99.9000th: 6184 99.9000th: 82 > min=0, max=9887 min=0, max=121 > > Performance counter stats for 'system wide' (5 runs): > > context-switches43,373 ( +- 0.40% ) 44,597 ( +- 0.55% ) > cpu-migrations 1,211 ( +- 5.04% ) 220 ( +- 6.23% ) > page-faults 15,983 ( +- 5.21% ) 15,360 ( +- 3.38% ) > > Waiman Long suggested using static_keys. > > Reported-by: Parth Shah > Reported-by: Ihor Pasichnyk > Cc: Parth Shah > Cc: Ihor Pasichnyk > Cc: Juri Lelli > Cc: Phil Auld > Cc: Waiman Long > Cc: Gautham R. Shenoy > Tested-by: Juri Lelli > Ack-by: Waiman Long > Reviewed-by: Gautham R. Shenoy > Signed-off-by: Srikar Dronamraju > --- > Changelog v1
Re: [PATCH v3 2/2] powerpc/shared: Use static key to detect shared processor
On Thu, Dec 05, 2019 at 02:02:18PM +0530 Srikar Dronamraju wrote: > With the static key shared processor available, is_shared_processor() > can return without having to query the lppaca structure. > > Cc: Parth Shah > Cc: Ihor Pasichnyk > Cc: Juri Lelli > Cc: Phil Auld > Cc: Waiman Long > Signed-off-by: Srikar Dronamraju > --- > Changelog v1 (https://patchwork.ozlabs.org/patch/1204192/) ->v2: > Now that we no more refer to lppaca, remove the comment. > > Changelog v2->v3: > Code is now under CONFIG_PPC_SPLPAR as it depends on CONFIG_PPC_PSERIES. > This was suggested by Waiman Long. > > arch/powerpc/include/asm/spinlock.h | 9 ++--- > 1 file changed, 2 insertions(+), 7 deletions(-) > > diff --git a/arch/powerpc/include/asm/spinlock.h > b/arch/powerpc/include/asm/spinlock.h > index de817c25deff..e83d57f27566 100644 > --- a/arch/powerpc/include/asm/spinlock.h > +++ b/arch/powerpc/include/asm/spinlock.h > @@ -111,13 +111,8 @@ static inline void splpar_rw_yield(arch_rwlock_t *lock) > {}; > > static inline bool is_shared_processor(void) > { > -/* > - * LPPACA is only available on Pseries so guard anything LPPACA related to > - * allow other platforms (which include this common header) to compile. > - */ > -#ifdef CONFIG_PPC_PSERIES > - return (IS_ENABLED(CONFIG_PPC_SPLPAR) && > - lppaca_shared_proc(local_paca->lppaca_ptr)); > +#ifdef CONFIG_PPC_SPLPAR > + return static_branch_unlikely(_processor); > #else > return false; > #endif > -- > 2.18.1 > Fwiw, Acked-by: Phil Auld --
Re: [PATCH v2 16/27] nvdimm/ocxl: Implement the Read Error Log command
Hi Alastair, Thank you for the patch! Yet something to improve: [auto build test ERROR on v5.4-rc8] [also build test ERROR on char-misc/char-misc-testing] [cannot apply to linux-nvdimm/libnvdimm-for-next linus/master next-20191205] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system. BTW, we also suggest to use '--base' option to specify the base tree in git format-patch, please see https://stackoverflow.com/a/37406982] url: https://github.com/0day-ci/linux/commits/Alastair-D-Silva/Add-support-for-OpenCAPI-SCM-devices/20191203-120511 base:af42d3466bdc8f39806b26f593604fdc54140bcb config: x86_64-randconfig-b001-20191203 (attached as .config) compiler: gcc-7 (Debian 7.5.0-1) 7.5.0 reproduce: # save the attached .config to linux build tree make ARCH=x86_64 If you fix the issue, kindly add following tag Reported-by: kbuild test robot All errors (new ones prefixed by >>): In file included from :32:0: >> ./usr/include/nvdimm/ocxl-scm.h:21:24: error: C++ style comments are not >> allowed in ISO C90 __u32 log_identifier; // out ^ >> ./usr/include/nvdimm/ocxl-scm.h:21:24: error: (this will be reported only >> once per input file) --- 0-DAY kernel test infrastructure Open Source Technology Center https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org Intel Corporation .config.gz Description: application/gzip
Re: [PATCH 0/3] pseries: Track and expose idle PURR and SPURR ticks
Hi Nathan, Nathan Lynch wrote: Hi Kamalesh, Kamalesh Babulal writes: On 12/5/19 3:54 AM, Nathan Lynch wrote: "Gautham R. Shenoy" writes: Tools such as lparstat which are used to compute the utilization need to know [S]PURR ticks when the cpu was busy or idle. The [S]PURR counters are already exposed through sysfs. We already account for PURR ticks when we go to idle so that we can update the VPA area. This patchset extends support to account for SPURR ticks when idle, and expose both via per-cpu sysfs files. Does anything really want to use PURR instead of SPURR? Seems like we should expose only SPURR idle values if possible. lparstat is one of the consumers of PURR idle metric (https://groups.google.com/forum/#!topic/powerpc-utils-devel/fYRo69xO9r4). Agree, on the argument that system utilization metrics based on SPURR accounting is accurate in comparison to PURR, which isn't proportional to CPU frequency. PURR has been traditionally used to understand the system utilization, whereas SPURR is used for understanding how much capacity is left/exceeding in the system based on the current power saving mode. I'll phrase my question differently: does SPURR complement or supercede PURR? You seem to be saying they serve different purposes. If PURR is actually useful rather then vestigial then I have no objection to exposing idle_purr. SPURR complements PURR, so we need both. SPURR/PURR ratio helps provide an indication of the available headroom in terms of core resources, at maximum frequency. - Naveen
Re: [PATCH 2/3] powerpc/sysfs: Show idle_purr and idle_spurr for every CPU
Gautham R. Shenoy wrote: From: "Gautham R. Shenoy" On Pseries LPARs, to calculate utilization, we need to know the [S]PURR ticks when the CPUs were busy or idle. The total PURR and SPURR ticks are already exposed via the per-cpu sysfs files /sys/devices/system/cpu/cpuX/purr and /sys/devices/system/cpu/cpuX/spurr. This patch adds support for exposing the idle PURR and SPURR ticks via /sys/devices/system/cpu/cpuX/idle_purr and /sys/devices/system/cpu/cpuX/idle_spurr. Signed-off-by: Gautham R. Shenoy --- arch/powerpc/kernel/sysfs.c | 32 1 file changed, 32 insertions(+) diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c index 80a676d..42ade55 100644 --- a/arch/powerpc/kernel/sysfs.c +++ b/arch/powerpc/kernel/sysfs.c @@ -1044,6 +1044,36 @@ static ssize_t show_physical_id(struct device *dev, } static DEVICE_ATTR(physical_id, 0444, show_physical_id, NULL); +static ssize_t idle_purr_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + struct cpu *cpu = container_of(dev, struct cpu, dev); + unsigned int cpuid = cpu->dev.id; + struct lppaca *cpu_lppaca_ptr = paca_ptrs[cpuid]->lppaca_ptr; + u64 idle_purr_cycles = be64_to_cpu(cpu_lppaca_ptr->wait_state_cycles); + + return sprintf(buf, "%llx\n", idle_purr_cycles); +} +static DEVICE_ATTR_RO(idle_purr); + +DECLARE_PER_CPU(u64, idle_spurr_cycles); +static ssize_t idle_spurr_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + struct cpu *cpu = container_of(dev, struct cpu, dev); + unsigned int cpuid = cpu->dev.id; + u64 *idle_spurr_cycles_ptr = per_cpu_ptr(_spurr_cycles, cpuid); Is it possible for a user to read stale values if a particular cpu is in an extended cede? Is it possible to use smp_call_function_single() to force the cpu out of idle? - Naveen
Re: [RFC] Efficiency of the phandle_cache on ppc64/SLOF
On 2019-12-03 10:56:35 [-0600], Rob Herring wrote: > > Another possibility would be to make the cache be dependent > > upon not CONFIG_PPC. It might be possible to disable the > > cache with a minimal code change. > > I'd rather not do that. > > And yes, as mentioned earlier I don't like the complexity. I didn't > from the start and I'm I'm still of the opinion we should have a > fixed or 1 time sized true cache (i.e. smaller than total # of > phandles). That would solve the RT memory allocation and locking issue > too. > > For reference, the performance difference between the current > implementation (assuming fixes haven't regressed it) was ~400ms vs. a > ~340ms improvement with a 64 entry cache (using a mask, not a hash). > IMO, 340ms improvement was good enough. Okay. So the 814 phandles would result in an array with 1024 slots. That would need 4KiB of memory. What about we go back to the fix 64 slots array but with hash32 for the lookup? Without the hash we would be 60ms slower during boot (compared to now, based on ancient data) but then the hash isn't expensive so we end up with better coverage of the memory on systems which don't have a plain enumeration of the phandle. > Rob Sebastian
Re: [PATCH 0/3] pseries: Track and expose idle PURR and SPURR ticks
Hi Kamalesh, Kamalesh Babulal writes: > On 12/5/19 3:54 AM, Nathan Lynch wrote: >> "Gautham R. Shenoy" writes: >>> >>> Tools such as lparstat which are used to compute the utilization need >>> to know [S]PURR ticks when the cpu was busy or idle. The [S]PURR >>> counters are already exposed through sysfs. We already account for >>> PURR ticks when we go to idle so that we can update the VPA area. This >>> patchset extends support to account for SPURR ticks when idle, and >>> expose both via per-cpu sysfs files. >> >> Does anything really want to use PURR instead of SPURR? Seems like we >> should expose only SPURR idle values if possible. >> > > lparstat is one of the consumers of PURR idle metric > (https://groups.google.com/forum/#!topic/powerpc-utils-devel/fYRo69xO9r4). > Agree, on the argument that system utilization metrics based on SPURR > accounting is accurate in comparison to PURR, which isn't proportional to > CPU frequency. PURR has been traditionally used to understand the system > utilization, whereas SPURR is used for understanding how much capacity is > left/exceeding in the system based on the current power saving mode. I'll phrase my question differently: does SPURR complement or supercede PURR? You seem to be saying they serve different purposes. If PURR is actually useful rather then vestigial then I have no objection to exposing idle_purr.
Re: [PATCH 0/3] pseries: Track and expose idle PURR and SPURR ticks
On 12/5/19 3:54 AM, Nathan Lynch wrote: > "Gautham R. Shenoy" writes: >> From: "Gautham R. Shenoy" >> >> On PSeries LPARs, the data centers planners desire a more accurate >> view of system utilization per resource such as CPU to plan the system >> capacity requirements better. Such accuracy can be obtained by reading >> PURR/SPURR registers for CPU resource utilization. >> >> Tools such as lparstat which are used to compute the utilization need >> to know [S]PURR ticks when the cpu was busy or idle. The [S]PURR >> counters are already exposed through sysfs. We already account for >> PURR ticks when we go to idle so that we can update the VPA area. This >> patchset extends support to account for SPURR ticks when idle, and >> expose both via per-cpu sysfs files. > > Does anything really want to use PURR instead of SPURR? Seems like we > should expose only SPURR idle values if possible. > lparstat is one of the consumers of PURR idle metric (https://groups.google.com/forum/#!topic/powerpc-utils-devel/fYRo69xO9r4). Agree, on the argument that system utilization metrics based on SPURR accounting is accurate in comparison to PURR, which isn't proportional to CPU frequency. PURR has been traditionally used to understand the system utilization, whereas SPURR is used for understanding how much capacity is left/exceeding in the system based on the current power saving mode. -- Kamalesh
Re: [PATCH v3 2/2] powerpc/shared: Use static key to detect shared processor
On 12/5/19 3:32 AM, Srikar Dronamraju wrote: > With the static key shared processor available, is_shared_processor() > can return without having to query the lppaca structure. > > Cc: Parth Shah > Cc: Ihor Pasichnyk > Cc: Juri Lelli > Cc: Phil Auld > Cc: Waiman Long > Signed-off-by: Srikar Dronamraju > --- > Changelog v1 (https://patchwork.ozlabs.org/patch/1204192/) ->v2: > Now that we no more refer to lppaca, remove the comment. > > Changelog v2->v3: > Code is now under CONFIG_PPC_SPLPAR as it depends on CONFIG_PPC_PSERIES. > This was suggested by Waiman Long. > > arch/powerpc/include/asm/spinlock.h | 9 ++--- > 1 file changed, 2 insertions(+), 7 deletions(-) > > diff --git a/arch/powerpc/include/asm/spinlock.h > b/arch/powerpc/include/asm/spinlock.h > index de817c25deff..e83d57f27566 100644 > --- a/arch/powerpc/include/asm/spinlock.h > +++ b/arch/powerpc/include/asm/spinlock.h > @@ -111,13 +111,8 @@ static inline void splpar_rw_yield(arch_rwlock_t *lock) > {}; > > static inline bool is_shared_processor(void) > { > -/* > - * LPPACA is only available on Pseries so guard anything LPPACA related to > - * allow other platforms (which include this common header) to compile. > - */ > -#ifdef CONFIG_PPC_PSERIES > - return (IS_ENABLED(CONFIG_PPC_SPLPAR) && > - lppaca_shared_proc(local_paca->lppaca_ptr)); > +#ifdef CONFIG_PPC_SPLPAR > + return static_branch_unlikely(_processor); > #else > return false; > #endif Acked-by: Waiman Long
Re: [PATCH 1/2] powerpc/vcpu: Assume dedicated processors as non-preempt
Hi, On 04/12/19 19:14, Srikar Dronamraju wrote: > With commit 247f2f6f3c70 ("sched/core: Don't schedule threads on pre-empted > vCPUs"), scheduler avoids preempted vCPUs to schedule tasks on wakeup. > This leads to wrong choice of CPU, which in-turn leads to larger wakeup > latencies. Eventually, it leads to performance regression in latency > sensitive benchmarks like soltp, schbench etc. > > On Powerpc, vcpu_is_preempted only looks at yield_count. If the > yield_count is odd, the vCPU is assumed to be preempted. However > yield_count is increased whenever LPAR enters CEDE state. So any CPU > that has entered CEDE state is assumed to be preempted. > > Even if vCPU of dedicated LPAR is preempted/donated, it should have > right of first-use since they are suppose to own the vCPU. > > On a Power9 System with 32 cores > # lscpu > Architecture:ppc64le > Byte Order: Little Endian > CPU(s): 128 > On-line CPU(s) list: 0-127 > Thread(s) per core: 8 > Core(s) per socket: 1 > Socket(s): 16 > NUMA node(s):2 > Model: 2.2 (pvr 004e 0202) > Model name: POWER9 (architected), altivec supported > Hypervisor vendor: pHyp > Virtualization type: para > L1d cache: 32K > L1i cache: 32K > L2 cache:512K > L3 cache:10240K > NUMA node0 CPU(s): 0-63 > NUMA node1 CPU(s): 64-127 > > > # perf stat -a -r 5 ./schbench > v5.4 v5.4 + patch > Latency percentiles (usec) Latency percentiles (usec) > 49.th: 47 50.th: 33 > 74.th: 64 75.th: 44 > 89.th: 76 90.th: 50 > 94.th: 83 95.th: 53 > *98.th: 103 *99.th: 57 > 98.5000th: 2124 99.5000th: 59 > 98.9000th: 7976 99.9000th: 83 > min=-1, max=10519 min=0, max=117 > Latency percentiles (usec) Latency percentiles (usec) > 49.th: 45 50.th: 34 > 74.th: 61 75.th: 45 > 89.th: 70 90.th: 52 > 94.th: 77 95.th: 56 > *98.th: 504 *99.th: 62 > 98.5000th: 4012 99.5000th: 64 > 98.9000th: 8168 99.9000th: 79 > min=-1, max=14500 min=0, max=123 > Latency percentiles (usec) Latency percentiles (usec) > 49.th: 48 50.th: 35 > 74.th: 65 75.th: 47 > 89.th: 76 90.th: 55 > 94.th: 82 95.th: 59 > *98.th: 1098*99.th: 67 > 98.5000th: 3988 99.5000th: 71 > 98.9000th: 9360 99.9000th: 98 > min=-1, max=19283 min=0, max=137 > Latency percentiles (usec) Latency percentiles (usec) > 49.th: 46 50.th: 35 > 74.th: 63 75.th: 46 > 89.th: 73 90.th: 53 > 94.th: 78 95.th: 57 > *98.th: 113 *99.th: 63 > 98.5000th: 2316 99.5000th: 65 > 98.9000th: 7704 99.9000th: 83 > min=-1, max=17976 min=0, max=139 > Latency percentiles (usec) Latency percentiles (usec) > 49.th: 46 50.th: 34 > 74.th: 62 75.th: 46 > 89.th: 73 90.th: 53 > 94.th: 79 95.th: 57 > *98.th: 97 *99.th: 64 > 98.5000th: 1398 99.5000th: 70 > 98.9000th: 8136 99.9000th: 100 > min=-1, max=10008 min=0, max=142 > > Performance counter stats for 'system wide' (4 runs): > > context-switches 42,604 ( +- 0.87% ) 45,397 ( +- 0.25% ) > cpu-migrations 0,195 ( +- 2.70% ) 230 ( +- 7.23% ) > page-faults16,783 ( +- 14.87% ) 16,781 ( +- 9.77% ) > > Waiman Long
Re: [PATCH] ASoC: fsl_sai: add IRQF_SHARED
On Thu, Dec 5, 2019 at 11:18 AM Michael Walle wrote: > > Hi Daniel, > > Am 2019-12-05 09:43, schrieb Daniel Baluta: > > On Fri, Nov 29, 2019 at 12:40 AM Michael Walle > > wrote: > >> > >> The LS1028A SoC uses the same interrupt line for adjacent SAIs. Use > >> IRQF_SHARED to be able to use these SAIs simultaneously. > > > > Hi Michael, > > > > Thanks for the patch. We have a similar change inside our internal tree > > (it is on my long TODO list to upstream :D). > > > > We add the shared flag conditionally on a dts property. > > > > Do you think it is a good idea to always add shared flag? I'm thinking > > on SAI IP integrations where the interrupt is edge triggered. > > Mhh, I don't really get the point to make the flag conditionally. If > there is only one user, the flag won't hurt, correct? > > If there are two users, we need the flag anyway. > > > AFAIK edge triggered interrupts do not get along very well > > with sharing an interrupt line. > > So in that case you shouldn't use shared edge triggered interrupts in > the > SoC in the first place, I guess. I think you make a good point. I was thinking that it would hurt the single user case. But it is fine. Thanks for the patch. Acked-by: Daniel Baluta
Re: [PATCH] ASoC: fsl_sai: add IRQF_SHARED
Hi Daniel, Am 2019-12-05 09:43, schrieb Daniel Baluta: On Fri, Nov 29, 2019 at 12:40 AM Michael Walle wrote: The LS1028A SoC uses the same interrupt line for adjacent SAIs. Use IRQF_SHARED to be able to use these SAIs simultaneously. Hi Michael, Thanks for the patch. We have a similar change inside our internal tree (it is on my long TODO list to upstream :D). We add the shared flag conditionally on a dts property. Do you think it is a good idea to always add shared flag? I'm thinking on SAI IP integrations where the interrupt is edge triggered. Mhh, I don't really get the point to make the flag conditionally. If there is only one user, the flag won't hurt, correct? If there are two users, we need the flag anyway. AFAIK edge triggered interrupts do not get along very well with sharing an interrupt line. So in that case you shouldn't use shared edge triggered interrupts in the SoC in the first place, I guess. -michael
Re: [PATCH] ASoC: fsl_sai: add IRQF_SHARED
On Fri, Nov 29, 2019 at 12:40 AM Michael Walle wrote: > > The LS1028A SoC uses the same interrupt line for adjacent SAIs. Use > IRQF_SHARED to be able to use these SAIs simultaneously. Hi Michael, Thanks for the patch. We have a similar change inside our internal tree (it is on my long TODO list to upstream :D). We add the shared flag conditionally on a dts property. Do you think it is a good idea to always add shared flag? I'm thinking on SAI IP integrations where the interrupt is edge triggered. AFAIK edge triggered interrupts do not get along very well with sharing an interrupt line.
[PATCH v3 2/2] powerpc/shared: Use static key to detect shared processor
With the static key shared processor available, is_shared_processor() can return without having to query the lppaca structure. Cc: Parth Shah Cc: Ihor Pasichnyk Cc: Juri Lelli Cc: Phil Auld Cc: Waiman Long Signed-off-by: Srikar Dronamraju --- Changelog v1 (https://patchwork.ozlabs.org/patch/1204192/) ->v2: Now that we no more refer to lppaca, remove the comment. Changelog v2->v3: Code is now under CONFIG_PPC_SPLPAR as it depends on CONFIG_PPC_PSERIES. This was suggested by Waiman Long. arch/powerpc/include/asm/spinlock.h | 9 ++--- 1 file changed, 2 insertions(+), 7 deletions(-) diff --git a/arch/powerpc/include/asm/spinlock.h b/arch/powerpc/include/asm/spinlock.h index de817c25deff..e83d57f27566 100644 --- a/arch/powerpc/include/asm/spinlock.h +++ b/arch/powerpc/include/asm/spinlock.h @@ -111,13 +111,8 @@ static inline void splpar_rw_yield(arch_rwlock_t *lock) {}; static inline bool is_shared_processor(void) { -/* - * LPPACA is only available on Pseries so guard anything LPPACA related to - * allow other platforms (which include this common header) to compile. - */ -#ifdef CONFIG_PPC_PSERIES - return (IS_ENABLED(CONFIG_PPC_SPLPAR) && - lppaca_shared_proc(local_paca->lppaca_ptr)); +#ifdef CONFIG_PPC_SPLPAR + return static_branch_unlikely(_processor); #else return false; #endif -- 2.18.1
[PATCH v3 1/2] powerpc/vcpu: Assume dedicated processors as non-preempt
With commit 247f2f6f3c70 ("sched/core: Don't schedule threads on pre-empted vCPUs"), scheduler avoids preempted vCPUs to schedule tasks on wakeup. This leads to wrong choice of CPU, which in-turn leads to larger wakeup latencies. Eventually, it leads to performance regression in latency sensitive benchmarks like soltp, schbench etc. On Powerpc, vcpu_is_preempted only looks at yield_count. If the yield_count is odd, the vCPU is assumed to be preempted. However yield_count is increased whenever LPAR enters CEDE state. So any CPU that has entered CEDE state is assumed to be preempted. Even if vCPU of dedicated LPAR is preempted/donated, it should have right of first-use since they are suppose to own the vCPU. On a Power9 System with 32 cores # lscpu Architecture:ppc64le Byte Order: Little Endian CPU(s): 128 On-line CPU(s) list: 0-127 Thread(s) per core: 8 Core(s) per socket: 1 Socket(s): 16 NUMA node(s):2 Model: 2.2 (pvr 004e 0202) Model name: POWER9 (architected), altivec supported Hypervisor vendor: pHyp Virtualization type: para L1d cache: 32K L1i cache: 32K L2 cache:512K L3 cache:10240K NUMA node0 CPU(s): 0-63 NUMA node1 CPU(s): 64-127 # perf stat -a -r 5 ./schbench v5.4 v5.4 + patch Latency percentiles (usec) Latency percentiles (usec) 50.th: 45 50.th: 39 75.th: 62 75.th: 53 90.th: 71 90.th: 67 95.th: 77 95.th: 76 *99.th: 91 *99.th: 89 99.5000th: 707 99.5000th: 93 99.9000th: 6920 99.9000th: 118 min=0, max=10048min=0, max=211 Latency percentiles (usec) Latency percentiles (usec) 50.th: 45 50.th: 34 75.th: 61 75.th: 45 90.th: 72 90.th: 53 95.th: 79 95.th: 56 *99.th: 691 *99.th: 61 99.5000th: 3972 99.5000th: 63 99.9000th: 8368 99.9000th: 78 min=0, max=16606min=0, max=228 Latency percentiles (usec) Latency percentiles (usec) 50.th: 45 50.th: 34 75.th: 61 75.th: 45 90.th: 71 90.th: 53 95.th: 77 95.th: 57 *99.th: 106 *99.th: 63 99.5000th: 2364 99.5000th: 68 99.9000th: 7480 99.9000th: 100 min=0, max=10001min=0, max=134 Latency percentiles (usec) Latency percentiles (usec) 50.th: 45 50.th: 34 75.th: 62 75.th: 46 90.th: 72 90.th: 53 95.th: 78 95.th: 56 *99.th: 93 *99.th: 61 99.5000th: 108 99.5000th: 64 99.9000th: 6792 99.9000th: 85 min=0, max=17681min=0, max=121 Latency percentiles (usec) Latency percentiles (usec) 50.th: 46 50.th: 33 75.th: 62 75.th: 44 90.th: 73 90.th: 51 95.th: 79 95.th: 54 *99.th: 113 *99.th: 61 99.5000th: 2724 99.5000th: 64 99.9000th: 6184 99.9000th: 82 min=0, max=9887 min=0, max=121 Performance counter stats for 'system wide' (5 runs): context-switches43,373 ( +- 0.40% ) 44,597 ( +- 0.55% ) cpu-migrations 1,211 ( +- 5.04% ) 220 ( +- 6.23% ) page-faults 15,983 ( +- 5.21% ) 15,360 ( +- 3.38% ) Waiman Long suggested using static_keys. Reported-by: Parth Shah Reported-by: Ihor Pasichnyk Cc: Parth Shah Cc: Ihor Pasichnyk Cc: Juri Lelli Cc: Phil Auld Cc: Waiman Long Cc: Gautham R. Shenoy Tested-by: Juri Lelli Ack-by: Waiman Long Reviewed-by: Gautham R. Shenoy Signed-off-by: Srikar Dronamraju --- Changelog v1 (https://patchwork.ozlabs.org/patch/1204190/) ->v3: Code is now under CONFIG_PPC_SPLPAR as it depends on CONFIG_PPC_PSERIES. This was suggested by Waiman Long. arch/powerpc/include/asm/spinlock.h | 5 +++--
Re: [PATCH v4 1/2] powerpc/pseries/iommu: Share the per-cpu TCE page with the hypervisor.
On Wed, Dec 04, 2019 at 12:42:32PM -0800, Ram Pai wrote: > > The other approach we could use for that - which would still allow > > H_PUT_TCE_INDIRECT, would be to allocate the TCE buffer page from the > > same pool that we use for the bounce buffers. I assume there must > > already be some sort of allocator for that? > > The allocator for swiotlb is buried deep in the swiotlb code. It is > not exposed to the outside-swiotlb world. Will have to do major surgery > to expose it. I don't think it would require all that much changes, but I'd really hate the layering of calling into it directly. Do we have a struct device associated with the iommu that doesn't get iommu translations themselves? If we do a dma_alloc_coherent on that you'll get the memory pool for free.