Re: [RFC 00/11] perf: Enhancing perf to export processor hazard information
On 3/1/20 11:23 PM, Ravi Bangoria wrote: > Most modern microprocessors employ complex instruction execution > pipelines such that many instructions can be 'in flight' at any > given point in time. Various factors affect this pipeline and > hazards are the primary among them. Different types of hazards > exist - Data hazards, Structural hazards and Control hazards. > Data hazard is the case where data dependencies exist between > instructions in different stages in the pipeline. Structural > hazard is when the same processor hardware is needed by more > than one instruction in flight at the same time. Control hazards > are more the branch misprediction kinds. > > Information about these hazards are critical towards analyzing > performance issues and also to tune software to overcome such > issues. Modern processors export such hazard data in Performance > Monitoring Unit (PMU) registers. Ex, 'Sampled Instruction Event > Register' on IBM PowerPC[1][2] and 'Instruction-Based Sampling' on > AMD[3] provides similar information. > > Implementation detail: > > A new sample_type called PERF_SAMPLE_PIPELINE_HAZ is introduced. > If it's set, kernel converts arch specific hazard information > into generic format: > > struct perf_pipeline_haz_data { > /* Instruction/Opcode type: Load, Store, Branch */ > __u8itype; At the risk of bike-shedding (in an RFC, no less), "itype" doesn't convey enough meaning to me. "inst_type"? I see in 03/11, you use "perf_inst_type". > /* Instruction Cache source */ > __u8icache; Possibly same here, and you use "perf_inst_cache" in 03/11. > /* Instruction suffered hazard in pipeline stage */ > __u8hazard_stage; > /* Hazard reason */ > __u8hazard_reason; > /* Instruction suffered stall in pipeline stage */ > __u8stall_stage; > /* Stall reason */ > __u8stall_reason; > __u16 pad; > }; > > ... which can be read by user from mmap() ring buffer. With this > approach, sample perf report in hazard mode looks like (On IBM > PowerPC): > > # ./perf record --hazard ./ebizzy > # ./perf report --hazard > Overhead Symbol Shared Instruction Type Hazard Stage Hazard > Reason Stall Stage Stall Reason ICache access > 36.58% [.] thread_run ebizzy Load LSU > MispredictLSU Load fin L1 hit > 9.46% [.] thread_run ebizzy Load LSU > MispredictLSU Dcache_miss L1 hit > 1.76% [.] thread_run ebizzy Fixed point - - > - - L1 hit > 1.31% [.] thread_run ebizzy Load LSUERAT > Miss LSU Load fin L1 hit > 1.27% [.] thread_run ebizzy Load LSU > Mispredict- - L1 hit > 1.16% [.] thread_run ebizzy Fixed point - - > FXU Fixed cycle L1 hit > 0.50% [.] thread_run ebizzy Fixed point ISUSource > UnavailableFXU Fixed cycle L1 hit > 0.30% [.] thread_run ebizzy Load LSULMQ > Full, DERAT Miss LSU Load fin L1 hit > 0.24% [.] thread_run ebizzy Load LSUERAT > Miss - - L1 hit > 0.08% [.] thread_run ebizzy - - - > BRU Fixed cycle L1 hit > 0.05% [.] thread_run ebizzy Branch- - > BRU Fixed cycle L1 hit > 0.04% [.] thread_run ebizzy Fixed point ISUSource > Unavailable- - L1 hit How are these to be interpreted? This is great information, but is it possible to make it more readable for non-experts? If each of these map 1:1 with hardware events, should you emit the name of the event here, so that can be used to look up further information? For example, does the first line map to PM_CMPLU_STALL_LSU_FIN? What was "Mispredict[ed]"? (Is it different from a branch misprediction?) And how does this relate to "L1 hit"? Can we emit "Load finish" instead of "Load fin" for easier reading? 03/11 also has "Marked fin before NTC". Nit: why does "Dcache_miss" have an underscore and none of the others? > Also perf annotate with hazard data: > │static int > │compare(const void *p1, const void *p2) > │{ >33.23 │ stdr31,-8(r1) > │ {haz_stage: LSU, haz_reason: ERAT Miss, stall_stage: LSU, > stall_reason: Store, icache: L1 hit} > │ {haz_stage: LSU, haz_reason: ERAT Miss, stall_stage: LSU, > stall_reason: Store, icache: L1 hit} > │ {haz_stage: LSU,
Re: [RFC] powerpc/pseries/hcall: remove the save/restore of CR
On 9/26/19 10:48 PM, Lijun Pan wrote: > According to the PAPR, hcalls should not modify the Condition > Register fields, hence save/restore the CR is not necessary. Just curious: could you provide a more specific reference? PC
sys_exit: NR 0 (was "sys_exit: NR -1")
I'm seeing another asymmetry, this one is on powerpc: -- perf 30179 [001] 10374622.667436: raw_syscalls:sys_enter: NR 11 (3fffe16ef55e, [...]) ls 30179 [002] 10374622.667921: raw_syscalls:sys_exit: NR 0 = 0 -- Syscall "11" is "execve", but is it expected that the sys_exit will report the ID as 0? Is that because of the nature of "execve", to overlay the entire contents of the process? This does not seem to occur on x86. PC
Re: sys_exit: NR -1
On 6/13/19 4:00 PM, Paul Clarke wrote: > On 6/12/19 1:32 AM, Naveen N. Rao wrote: >> Paul Clarke wrote: >>> What are the circumstances in which raw_syscalls:sys_exit reports "-1" for >>> the syscall ID? >>> >>> perf 5375 [007] 59632.478528: raw_syscalls:sys_enter: NR 1 (3, >>> 9fb888, 8, 2d83740, 1, 7) >>> perf 5375 [007] 59632.478532: raw_syscalls:sys_exit: NR 1 = 8 >>> perf 5375 [007] 59632.478538: raw_syscalls:sys_enter: NR 15 (11, >>> 7ca734b0, 7ca73380, 2d83740, 1, 7) >>> perf 5375 [007] 59632.478539: raw_syscalls:sys_exit: NR -1 = 8 >>> perf 5375 [007] 59632.478543: raw_syscalls:sys_enter: NR 16 (4, >>> 2401, 0, 2d83740, 1, 0) >>> perf 5375 [007] 59632.478551: raw_syscalls:sys_exit: NR 16 = 0 >> >> Which architecture? >> For powerpc, see: >> >> static inline int syscall_get_nr(struct task_struct *task, struct pt_regs >> *regs) >> { >> /* >> * Note that we are returning an int here. That means 0x, ie. >> * 32-bit negative 1, will be interpreted as -1 on a 64-bit kernel. >> * This is important for seccomp so that compat tasks can set r0 = -1 >> * to reject the syscall. >> */ >> return TRAP(regs) == 0xc00 ? regs->gpr[0] : -1; >> } > > So, that's intentional? And has some special meaning? (I confess I don't > understand what the comment is saying exactly.) > > Is this documented? Does something depend on this ABI? > > To me, it just makes parsing more difficult, both by humans and machines. I should've noted that the instance I encountered was on x86. PC
Re: sys_exit: NR -1
On 6/12/19 1:32 AM, Naveen N. Rao wrote: > Paul Clarke wrote: >> What are the circumstances in which raw_syscalls:sys_exit reports "-1" for >> the syscall ID? >> >> perf 5375 [007] 59632.478528: raw_syscalls:sys_enter: NR 1 (3, >> 9fb888, 8, 2d83740, 1, 7) >> perf 5375 [007] 59632.478532: raw_syscalls:sys_exit: NR 1 = 8 >> perf 5375 [007] 59632.478538: raw_syscalls:sys_enter: NR 15 (11, >> 7ca734b0, 7ca73380, 2d83740, 1, 7) >> perf 5375 [007] 59632.478539: raw_syscalls:sys_exit: NR -1 = 8 >> perf 5375 [007] 59632.478543: raw_syscalls:sys_enter: NR 16 (4, 2401, >> 0, 2d83740, 1, 0) >> perf 5375 [007] 59632.478551: raw_syscalls:sys_exit: NR 16 = 0 > > Which architecture? > For powerpc, see: > > static inline int syscall_get_nr(struct task_struct *task, struct pt_regs > *regs) > { > /* > * Note that we are returning an int here. That means 0x, ie. > * 32-bit negative 1, will be interpreted as -1 on a 64-bit kernel. > * This is important for seccomp so that compat tasks can set r0 = -1 > * to reject the syscall. > */ > return TRAP(regs) == 0xc00 ? regs->gpr[0] : -1; > } So, that's intentional? And has some special meaning? (I confess I don't understand what the comment is saying exactly.) Is this documented? Does something depend on this ABI? To me, it just makes parsing more difficult, both by humans and machines. PC
[PATCH 4/4] [powerpc] perf vendor events: Add JSON metrics for POWER9
Descriptions of metrics for POWER9 processors can be found in the "POWER9 Performance Monitor Unit User’s Guide", which is currently available on the "IBM Portal for OpenPOWER" (https://www-355.ibm.com/systems/power/openpower/welcome.xhtml) at https://www-355.ibm.com/systems/power/openpower/posting.xhtml?postingId=4948CDE1963C9BCA852582F800718190 This patch is for metric groups: - general and other metrics not in a metric group. Signed-off-by: Paul A. Clarke --- .../arch/powerpc/power9/metrics.json | 368 ++ 1 file changed, 368 insertions(+) diff --git a/tools/perf/pmu-events/arch/powerpc/power9/metrics.json b/tools/perf/pmu-events/arch/powerpc/power9/metrics.json index c39a922aaf84..811c2a8c1c9e 100644 --- a/tools/perf/pmu-events/arch/powerpc/power9/metrics.json +++ b/tools/perf/pmu-events/arch/powerpc/power9/metrics.json @@ -813,6 +813,114 @@ "MetricGroup": "estimated_dcache_miss_cpi", "MetricName": "rmem_cpi_percent" }, +{ +"BriefDescription": "Branch Mispredict flushes per instruction", +"MetricExpr": "PM_FLUSH_MPRED / PM_RUN_INST_CMPL * 100", +"MetricGroup": "general", +"MetricName": "br_mpred_flush_rate_percent" +}, +{ +"BriefDescription": "Cycles per instruction", +"MetricExpr": "PM_CYC / PM_INST_CMPL", +"MetricGroup": "general", +"MetricName": "cpi" +}, +{ +"BriefDescription": "GCT empty cycles", +"MetricExpr": "(PM_FLUSH_DISP / PM_RUN_INST_CMPL) * 100", +"MetricGroup": "general", +"MetricName": "disp_flush_rate_percent" +}, +{ +"BriefDescription": "% DTLB miss rate per inst", +"MetricExpr": "PM_DTLB_MISS / PM_RUN_INST_CMPL *100", +"MetricGroup": "general", +"MetricName": "dtlb_miss_rate_percent" +}, +{ +"BriefDescription": "Flush rate (%)", +"MetricExpr": "PM_FLUSH * 100 / PM_RUN_INST_CMPL", +"MetricGroup": "general", +"MetricName": "flush_rate_percent" +}, +{ +"BriefDescription": "Instructions per cycles", +"MetricExpr": "PM_INST_CMPL / PM_CYC", +"MetricGroup": "general", +"MetricName": "ipc" +}, +{ +"BriefDescription": "% ITLB miss rate per inst", +"MetricExpr": "PM_ITLB_MISS / PM_RUN_INST_CMPL *100", +"MetricGroup": "general", +"MetricName": "itlb_miss_rate_percent" +}, +{ +"BriefDescription": "Percentage of L1 load misses per L1 load ref", +"MetricExpr": "PM_LD_MISS_L1 / PM_LD_REF_L1 * 100", +"MetricGroup": "general", +"MetricName": "l1_ld_miss_ratio_percent" +}, +{ +"BriefDescription": "Percentage of L1 store misses per run instruction", +"MetricExpr": "PM_ST_MISS_L1 * 100 / PM_RUN_INST_CMPL", +"MetricGroup": "general", +"MetricName": "l1_st_miss_rate_percent" +}, +{ +"BriefDescription": "Percentage of L1 store misses per L1 store ref", +"MetricExpr": "PM_ST_MISS_L1 / PM_ST_FIN * 100", +"MetricGroup": "general", +"MetricName": "l1_st_miss_ratio_percent" +}, +{ +"BriefDescription": "L2 Instruction Miss Rate (per instruction)(%)", +"MetricExpr": "PM_INST_FROM_L2MISS * 100 / PM_RUN_INST_CMPL", +"MetricGroup": "general", +"MetricName": "l2_inst_miss_rate_percent" +}, +{ +"BriefDescription": "L2 dmand Load Miss Rate (per run instruction)(%)", +"MetricExpr": "PM_DATA_FROM_L2MISS * 100 / PM_RUN_INST_CMPL", +"MetricGroup": "general", +"MetricName": "l2_ld_miss_rate_percent" +}, +{ +"BriefDescription": "L2 PTEG Miss Rate (per run instruction)(%)", +"MetricExpr": "PM_DPTEG_FROM_L2MISS * 100 / PM_RUN_INST_CMPL", +"MetricGroup": "general", +"MetricName": "l2_pteg_miss_rate_percent" +}, +{ +"BriefDescription": "L3 Instruction Miss Rate (per instruction)(%)", +"MetricExpr": "PM_INST_FROM_L3MISS * 100 / PM_RUN_INST_CMPL", +"MetricGroup": "general", +"MetricName": "l3_inst_miss_rate_percent" +}, +{ +"BriefDescription": "L3 demand Load Miss Rate (per run instruction)(%)", +"MetricExpr": "PM_DATA_FROM_L3MISS * 100 / PM_RUN_INST_CMPL", +"MetricGroup": "general", +"MetricName": "l3_ld_miss_rate_percent" +}, +{ +"BriefDescription": "L3 PTEG Miss Rate (per run instruction)(%)", +"MetricExpr": "PM_DPTEG_FROM_L3MISS * 100 / PM_RUN_INST_CMPL", +"MetricGroup": "general", +"MetricName": "l3_pteg_miss_rate_percent" +}, +{ +"BriefDescription": "Run cycles per cycle", +"MetricExpr": "PM_RUN_CYC / PM_CYC*100", +"MetricGroup": "general", +"MetricName": "run_cycles_percent" +}, +{ +"BriefDescription": "Instruction dispatch-to-completion ratio", +"MetricExpr": "PM_INST_DISP /
[PATCH 2/4] [powerpc] perf vendor events: Add JSON metrics for POWER9
Descriptions of metrics for POWER9 processors can be found in the "POWER9 Performance Monitor Unit User’s Guide", which is currently available on the "IBM Portal for OpenPOWER" (https://www-355.ibm.com/systems/power/openpower/welcome.xhtml) at https://www-355.ibm.com/systems/power/openpower/posting.xhtml?postingId=4948CDE1963C9BCA852582F800718190 This patch is for metric groups: - dl1_reloads_percent_per_inst - dl1_reloads_percent_per_ref - instruction_misses_percent_per_inst - l2_stats - l3_stats - pteg_reloads_percent_per_inst - pteg_reloads_percent_per_ref Signed-off-by: Paul A. Clarke --- .../arch/powerpc/power9/metrics.json | 660 ++ 1 file changed, 660 insertions(+) diff --git a/tools/perf/pmu-events/arch/powerpc/power9/metrics.json b/tools/perf/pmu-events/arch/powerpc/power9/metrics.json index cd46ebb8da6a..166f95518c45 100644 --- a/tools/perf/pmu-events/arch/powerpc/power9/metrics.json +++ b/tools/perf/pmu-events/arch/powerpc/power9/metrics.json @@ -484,6 +484,210 @@ "MetricGroup": "cpi_breakdown", "MetricName": "vfxu_stall_cpi" }, +{ +"BriefDescription": "% of DL1 Reloads from Distant L2 or L3 (Modified) per Inst", +"MetricExpr": "PM_DATA_FROM_DL2L3_MOD * 100 / PM_RUN_INST_CMPL", +"MetricGroup": "dl1_reloads_percent_per_inst", +"MetricName": "dl1_reload_from_dl2l3_mod_rate_percent" +}, +{ +"BriefDescription": "% of DL1 Reloads from Distant L2 or L3 (Shared) per Inst", +"MetricExpr": "PM_DATA_FROM_DL2L3_SHR * 100 / PM_RUN_INST_CMPL", +"MetricGroup": "dl1_reloads_percent_per_inst", +"MetricName": "dl1_reload_from_dl2l3_shr_rate_percent" +}, +{ +"BriefDescription": "% of DL1 Reloads from Distant Memory per Inst", +"MetricExpr": "PM_DATA_FROM_DMEM * 100 / PM_RUN_INST_CMPL", +"MetricGroup": "dl1_reloads_percent_per_inst", +"MetricName": "dl1_reload_from_dmem_rate_percent" +}, +{ +"BriefDescription": "% of DL1 reloads from Private L2, other core per Inst", +"MetricExpr": "PM_DATA_FROM_L21_MOD * 100 / PM_RUN_INST_CMPL", +"MetricGroup": "dl1_reloads_percent_per_inst", +"MetricName": "dl1_reload_from_l21_mod_rate_percent" +}, +{ +"BriefDescription": "% of DL1 reloads from Private L2, other core per Inst", +"MetricExpr": "PM_DATA_FROM_L21_SHR * 100 / PM_RUN_INST_CMPL", +"MetricGroup": "dl1_reloads_percent_per_inst", +"MetricName": "dl1_reload_from_l21_shr_rate_percent" +}, +{ +"BriefDescription": "% of DL1 reloads from L2 per Inst", +"MetricExpr": "PM_DATA_FROM_L2MISS * 100 / PM_RUN_INST_CMPL", +"MetricGroup": "dl1_reloads_percent_per_inst", +"MetricName": "dl1_reload_from_l2_miss_rate_percent" +}, +{ +"BriefDescription": "% of DL1 reloads from L2 per Inst", +"MetricExpr": "PM_DATA_FROM_L2 * 100 / PM_RUN_INST_CMPL", +"MetricGroup": "dl1_reloads_percent_per_inst", +"MetricName": "dl1_reload_from_l2_rate_percent" +}, +{ +"BriefDescription": "% of DL1 reloads from Private L3 M state, other core per Inst", +"MetricExpr": "PM_DATA_FROM_L31_MOD * 100 / PM_RUN_INST_CMPL", +"MetricGroup": "dl1_reloads_percent_per_inst", +"MetricName": "dl1_reload_from_l31_mod_rate_percent" +}, +{ +"BriefDescription": "% of DL1 reloads from Private L3 S tate, other core per Inst", +"MetricExpr": "PM_DATA_FROM_L31_SHR * 100 / PM_RUN_INST_CMPL", +"MetricGroup": "dl1_reloads_percent_per_inst", +"MetricName": "dl1_reload_from_l31_shr_rate_percent" +}, +{ +"BriefDescription": "% of DL1 Reloads that came from the L3 and were brought into the L3 by a prefetch, per instruction completed", +"MetricExpr": "PM_DATA_FROM_L3_MEPF * 100 / PM_RUN_INST_CMPL", +"MetricGroup": "dl1_reloads_percent_per_inst", +"MetricName": "dl1_reload_from_l3_mepf_rate_percent" +}, +{ +"BriefDescription": "% of DL1 reloads from L3 per Inst", +"MetricExpr": "PM_DATA_FROM_L3MISS * 100 / PM_RUN_INST_CMPL", +"MetricGroup": "dl1_reloads_percent_per_inst", +"MetricName": "dl1_reload_from_l3_miss_rate_percent" +}, +{ +"BriefDescription": "% of DL1 Reloads from L3 per Inst", +"MetricExpr": "PM_DATA_FROM_L3 * 100 / PM_RUN_INST_CMPL", +"MetricGroup": "dl1_reloads_percent_per_inst", +"MetricName": "dl1_reload_from_l3_rate_percent" +}, +{ +"BriefDescription": "% of DL1 Reloads from Local Memory per Inst", +"MetricExpr": "PM_DATA_FROM_LMEM * 100 / PM_RUN_INST_CMPL", +"MetricGroup": "dl1_reloads_percent_per_inst", +"MetricName": "dl1_reload_from_lmem_rate_percent" +}, +{ +"BriefDescription": "% of DL1 reloads from Private L3, other core per Inst", +"MetricExpr": "PM_DATA_FROM_RL2L3_MOD *
[PATCH 3/4] [powerpc] perf vendor events: Add JSON metrics for POWER9
Descriptions of metrics for POWER9 processors can be found in the "POWER9 Performance Monitor Unit User’s Guide", which is currently available on the "IBM Portal for OpenPOWER" (https://www-355.ibm.com/systems/power/openpower/welcome.xhtml) at https://www-355.ibm.com/systems/power/openpower/posting.xhtml?postingId=4948CDE1963C9BCA852582F800718190 This patch is for metric groups: - branch_prediction - instruction_stats_percent_per_ref - latency - lsu_rejects - memory - prefetch - translation Plus, some whitespace changes. Signed-off-by: Paul A. Clarke --- .../arch/powerpc/power9/metrics.json | 403 +- 1 file changed, 390 insertions(+), 13 deletions(-) diff --git a/tools/perf/pmu-events/arch/powerpc/power9/metrics.json b/tools/perf/pmu-events/arch/powerpc/power9/metrics.json index 166f95518c45..c39a922aaf84 100644 --- a/tools/perf/pmu-events/arch/powerpc/power9/metrics.json +++ b/tools/perf/pmu-events/arch/powerpc/power9/metrics.json @@ -1,4 +1,39 @@ [ +{ +"MetricExpr": "PM_BR_MPRED_CMPL / PM_BR_PRED * 100", +"MetricGroup": "branch_prediction", +"MetricName": "br_misprediction_percent" +}, +{ +"BriefDescription": "Count cache branch misprediction per instruction", +"MetricExpr": "PM_BR_MPRED_CCACHE / PM_RUN_INST_CMPL * 100", +"MetricGroup": "branch_prediction", +"MetricName": "ccache_mispredict_rate_percent" +}, +{ +"BriefDescription": "Count cache branch misprediction", +"MetricExpr": "PM_BR_MPRED_CCACHE / PM_BR_PRED_CCACHE * 100", +"MetricGroup": "branch_prediction", +"MetricName": "ccache_misprediction_percent" +}, +{ +"BriefDescription": "Link stack branch misprediction", +"MetricExpr": "PM_BR_MPRED_LSTACK / PM_RUN_INST_CMPL * 100", +"MetricGroup": "branch_prediction", +"MetricName": "lstack_mispredict_rate_percent" +}, +{ +"BriefDescription": "Link stack branch misprediction", +"MetricExpr": "PM_BR_MPRED_LSTACK/ PM_BR_PRED_LSTACK * 100", +"MetricGroup": "branch_prediction", +"MetricName": "lstack_misprediction_percent" +}, +{ +"BriefDescription": "% Branches Taken", +"MetricExpr": "PM_BR_TAKEN_CMPL * 100 / PM_BRU_FIN", +"MetricGroup": "branch_prediction", +"MetricName": "taken_branches_percent" +}, { "BriefDescription": "Completion stall due to a Branch Unit", "MetricExpr": "PM_CMPLU_STALL_BRU/PM_RUN_INST_CMPL", @@ -881,13 +916,121 @@ "MetricName": "l1_inst_miss_rate_percent" }, { -"BriefDescription": "%L2 Modified CO Cache read Utilization (4 pclks per disp attempt)", +"BriefDescription": "Icache Fetchs per Icache Miss", +"MetricExpr": "(PM_L1_ICACHE_MISS - PM_IC_PREF_WRITE) / PM_L1_ICACHE_MISS", +"MetricGroup": "instruction_stats_percent_per_ref", +"MetricName": "icache_miss_reload" +}, +{ +"BriefDescription": "% of ICache reloads due to prefetch", +"MetricExpr": "PM_IC_PREF_WRITE * 100 / PM_L1_ICACHE_MISS", +"MetricGroup": "instruction_stats_percent_per_ref", +"MetricName": "icache_pref_percent" +}, +{ +"BriefDescription": "% of ICache reloads from Distant L2 or L3 (Modified)", +"MetricExpr": "PM_INST_FROM_DL2L3_MOD * 100 / PM_L1_ICACHE_MISS", +"MetricGroup": "instruction_stats_percent_per_ref", +"MetricName": "inst_from_dl2l3_mod_percent" +}, +{ +"BriefDescription": "% of ICache reloads from Distant L2 or L3 (Shared)", +"MetricExpr": "PM_INST_FROM_DL2L3_SHR * 100 / PM_L1_ICACHE_MISS", +"MetricGroup": "instruction_stats_percent_per_ref", +"MetricName": "inst_from_dl2l3_shr_percent" +}, +{ +"BriefDescription": "% of ICache reloads from Distant L4", +"MetricExpr": "PM_INST_FROM_DL4 * 100 / PM_L1_ICACHE_MISS", +"MetricGroup": "instruction_stats_percent_per_ref", +"MetricName": "inst_from_dl4_percent" +}, +{ +"BriefDescription": "% of ICache reloads from Distant Memory", +"MetricExpr": "PM_INST_FROM_DMEM * 100 / PM_L1_ICACHE_MISS", +"MetricGroup": "instruction_stats_percent_per_ref", +"MetricName": "inst_from_dmem_percent" +}, +{ +"BriefDescription": "% of ICache reloads from Private L2, other core", +"MetricExpr": "PM_INST_FROM_L21_MOD * 100 / PM_L1_ICACHE_MISS", +"MetricGroup": "instruction_stats_percent_per_ref", +"MetricName": "inst_from_l21_mod_percent" +}, +{ +"BriefDescription": "% of ICache reloads from Private L2, other core", +"MetricExpr": "PM_INST_FROM_L21_SHR * 100 / PM_L1_ICACHE_MISS", +"MetricGroup": "instruction_stats_percent_per_ref", +"MetricName": "inst_from_l21_shr_percent" +}, +{ +"BriefDescription": "% of ICache reloads from L2", +
[PATCH v2 0/4] [powerpc] perf vendor events: Add JSON metrics for POWER9
[Note this is for POWER*9* and is different content than a previous patchset for POWER*8*.] The patches define metrics and metric groups for computation by "perf" for POWER9 processors. Paul Clarke (4): [powerpc] perf vendor events: Add JSON metrics for POWER9 [powerpc] perf vendor events: Add JSON metrics for POWER9 [powerpc] perf vendor events: Add JSON metrics for POWER9 [powerpc] perf vendor events: Add JSON metrics for POWER9 v2: The content of these patches was sent previously, to LKML and linux-perf-users, but never showed up at linux-perf-users, so I split it into smaller patches and am resending. .../arch/powerpc/power9/metrics.json | 1982 + 1 file changed, 1982 insertions(+) create mode 100644 tools/perf/pmu-events/arch/powerpc/power9/metrics.json -- 2.17.1
[PATCH 1/4] [powerpc] perf vendor events: Add JSON metrics for POWER9
Descriptions of metrics for POWER9 processors can be found in the "POWER9 Performance Monitor Unit User’s Guide", which is currently available on the "IBM Portal for OpenPOWER" (https://www-355.ibm.com/systems/power/openpower/welcome.xhtml) at https://www-355.ibm.com/systems/power/openpower/posting.xhtml?postingId=4948CDE1963C9BCA852582F800718190 This patch is for metric groups: - cpi_breakdown - estimated_dcache_miss_cpi Signed-off-by: Paul A. Clarke --- .../arch/powerpc/power9/metrics.json | 577 ++ 1 file changed, 577 insertions(+) create mode 100644 tools/perf/pmu-events/arch/powerpc/power9/metrics.json diff --git a/tools/perf/pmu-events/arch/powerpc/power9/metrics.json b/tools/perf/pmu-events/arch/powerpc/power9/metrics.json new file mode 100644 index ..cd46ebb8da6a --- /dev/null +++ b/tools/perf/pmu-events/arch/powerpc/power9/metrics.json @@ -0,0 +1,577 @@ +[ +{ +"BriefDescription": "Completion stall due to a Branch Unit", +"MetricExpr": "PM_CMPLU_STALL_BRU/PM_RUN_INST_CMPL", +"MetricGroup": "cpi_breakdown", +"MetricName": "bru_stall_cpi" +}, +{ +"BriefDescription": "Finish stall because the NTF instruction was routed to the crypto execution pipe and was waiting to finish", +"MetricExpr": "PM_CMPLU_STALL_CRYPTO/PM_RUN_INST_CMPL", +"MetricGroup": "cpi_breakdown", +"MetricName": "crypto_stall_cpi" +}, +{ +"BriefDescription": "Finish stall because the NTF instruction was a load that missed the L1 and was waiting for the data to return from the nest", +"MetricExpr": "PM_CMPLU_STALL_DCACHE_MISS/PM_RUN_INST_CMPL", +"MetricGroup": "cpi_breakdown", +"MetricName": "dcache_miss_stall_cpi" +}, +{ +"BriefDescription": "Finish stall because the NTF instruction was a multi-cycle instruction issued to the Decimal Floating Point execution pipe and waiting to finish.", +"MetricExpr": "PM_CMPLU_STALL_DFLONG/PM_RUN_INST_CMPL", +"MetricGroup": "cpi_breakdown", +"MetricName": "dflong_stall_cpi" +}, +{ +"BriefDescription": "Stalls due to short latency decimal floating ops.", +"MetricExpr": "(PM_CMPLU_STALL_DFU - PM_CMPLU_STALL_DFLONG)/PM_RUN_INST_CMPL", +"MetricGroup": "cpi_breakdown", +"MetricName": "dfu_other_stall_cpi" +}, +{ +"BriefDescription": "Finish stall because the NTF instruction was issued to the Decimal Floating Point execution pipe and waiting to finish.", +"MetricExpr": "PM_CMPLU_STALL_DFU/PM_RUN_INST_CMPL", +"MetricGroup": "cpi_breakdown", +"MetricName": "dfu_stall_cpi" +}, +{ +"BriefDescription": "Completion stall by Dcache miss which resolved off node memory/cache", +"MetricExpr": "(PM_CMPLU_STALL_DMISS_L3MISS - PM_CMPLU_STALL_DMISS_L21_L31 - PM_CMPLU_STALL_DMISS_LMEM - PM_CMPLU_STALL_DMISS_REMOTE)/PM_RUN_INST_CMPL", +"MetricGroup": "cpi_breakdown", +"MetricName": "dmiss_distant_stall_cpi" +}, +{ +"BriefDescription": "Completion stall by Dcache miss which resolved on chip ( excluding local L2/L3)", +"MetricExpr": "PM_CMPLU_STALL_DMISS_L21_L31/PM_RUN_INST_CMPL", +"MetricGroup": "cpi_breakdown", +"MetricName": "dmiss_l21_l31_stall_cpi" +}, +{ +"BriefDescription": "Completion stall due to cache miss that resolves in the L2 or L3 with a conflict", +"MetricExpr": "PM_CMPLU_STALL_DMISS_L2L3_CONFLICT/PM_RUN_INST_CMPL", +"MetricGroup": "cpi_breakdown", +"MetricName": "dmiss_l2l3_conflict_stall_cpi" +}, +{ +"BriefDescription": "Completion stall due to cache miss that resolves in the L2 or L3 without conflict", +"MetricExpr": "(PM_CMPLU_STALL_DMISS_L2L3 - PM_CMPLU_STALL_DMISS_L2L3_CONFLICT)/PM_RUN_INST_CMPL", +"MetricGroup": "cpi_breakdown", +"MetricName": "dmiss_l2l3_noconflict_stall_cpi" +}, +{ +"BriefDescription": "Completion stall by Dcache miss which resolved in L2/L3", +"MetricExpr": "PM_CMPLU_STALL_DMISS_L2L3/PM_RUN_INST_CMPL", +"MetricGroup": "cpi_breakdown", +"MetricName": "dmiss_l2l3_stall_cpi" +}, +{ +"BriefDescription": "Completion stall due to cache miss resolving missed the L3", +"MetricExpr": "PM_CMPLU_STALL_DMISS_L3MISS/PM_RUN_INST_CMPL", +"MetricGroup": "cpi_breakdown", +"MetricName": "dmiss_l3miss_stall_cpi" +}, +{ +"BriefDescription": "Completion stall due to cache miss that resolves in local memory", +"MetricExpr": "PM_CMPLU_STALL_DMISS_LMEM/PM_RUN_INST_CMPL", +"MetricGroup": "cpi_breakdown", +"MetricName": "dmiss_lmem_stall_cpi" +}, +{ +"BriefDescription": "Completion stall by Dcache miss which resolved outside of local memory", +"MetricExpr": "(PM_CMPLU_STALL_DMISS_L3MISS -
Re: [PATCH] selftests/powerpc: Fix strncpy usage
On 06/22/2018 09:43 AM, Breno Leitao wrote: > If you don't mind, I would solve this problem slightly different, as it seems > to be more readable. > > - strncpy(prog, argv[0], strlen(argv[0])); > + if (strlen(argv[0]) >= LEN_MAX){ > + fprintf(stderr, "Very big executable name: %s\n", argv[0]); "Very big" is an observation. "Too big" indicates a problem better. Or, more explicitly "Executable name is too long". PC
Re: [PATCH] perf/python: Add ppc64le to audit uname list
Coincidentally, I just saw a developer stumble upon this within the last week. Could this be pushed upstream soon? PC On 08/02/2017 10:06 AM, Arnaldo Carvalho de Melo wrote: > Em Wed, Aug 02, 2017 at 08:12:16PM +0530, Naveen N. Rao escreveu: >> Before patch: >> $ uname -m >> ppc64le > > Thanks, applied, > > - Arnaldo > >> $ ./perf script -s ./scripts/python/syscall-counts.py >> Install the audit-libs-python package to get syscall names. >> For example: >> # apt-get install python-audit (Ubuntu) >> # yum install audit-libs-python (Fedora) >> etc. >> >> Press control+C to stop and show the summary >> ^CWarning: >> 4 out of order events recorded. >> >> syscall events: >> >> event count >> --- >> 4 504638 >> 54 1206 >> 221 42 >> 5521 >> 3 12 >> 167 10 >> 11 8 >> 6 7 >> 1256 >> 5 6 >> 1085 >> 1624 >> 90 4 >> 45 3 >> 33 3 >> 3111 >> 2461 >> 2381 >> 93 1 >> 91 1 >> >> After patch: >> ./perf script -s ./scripts/python/syscall-counts.py >> Press control+C to stop and show the summary >> ^CWarning: >> 5 out of order events recorded. >> >> syscall events: >> >> event count >> --- >> write 643411 >> ioctl 1206 >> futex 54 >> fcntl 27 >> poll 14 >> read 12 >> execve 8 >> close 7 >> mprotect 6 >> open 6 >> nanosleep 5 >> fstat 5 >> mmap 4 >> inotify_add_watch 3 >> brk3 >> access 3 >> timerfd_settime1 >> clock_gettime 1 >> epoll_wait 1 >> ftruncate 1 >> munmap 1 >> >> Signed-off-by: Naveen N. Rao>> --- >> tools/perf/scripts/python/Perf-Trace-Util/lib/Perf/Trace/Util.py | 1 + >> 1 file changed, 1 insertion(+) >> >> diff --git >> a/tools/perf/scripts/python/Perf-Trace-Util/lib/Perf/Trace/Util.py >> b/tools/perf/scripts/python/Perf-Trace-Util/lib/Perf/Trace/Util.py >> index 1d95009592eb..f6c84966e4f8 100644 >> --- a/tools/perf/scripts/python/Perf-Trace-Util/lib/Perf/Trace/Util.py >> +++ b/tools/perf/scripts/python/Perf-Trace-Util/lib/Perf/Trace/Util.py >> @@ -57,6 +57,7 @@ try: >> 'ia64' : audit.MACH_IA64, >> 'ppc' : audit.MACH_PPC, >> 'ppc64' : audit.MACH_PPC64, >> +'ppc64le' : audit.MACH_PPC64LE, >> 's390' : audit.MACH_S390, >> 's390x' : audit.MACH_S390X, >> 'i386' : audit.MACH_X86, >> -- >> 2.13.3 >
Re: [PATCH] powerpc/lib: Split xor_vmx file to guarantee instruction ordering
On 05/24/2017 11:56 PM, Matt Brown wrote: > On Wed, May 24, 2017 at 11:36 PM, Paul Clarke <p...@us.ibm.com> wrote: >> On 05/23/2017 06:45 PM, Matt Brown wrote: >>> The xor_vmx.c file is used for the RAID5 xor operations. In these functions >>> altivec is enabled to run the operation and then disabled. However due to >>> compiler instruction reordering, altivec instructions are being run before >>> enable_altivec() and after disable_altivec(). >> >> If altivec instructions can be reordered after disable_altivec(), then >> disable_altivec() is broken, I'd think. >> >> Could it be because the isync in mtmsr_isync() is after the mtmsr? >> >> disable_kernel_altivec >> - msr_check_and_clear >> - __msr_check_and_clear >> - mtmsr_isync > > So it turns out the enable / disable functions don't actually enable > or disable the use of vector instructions. > If we have marked the file to be compiled with altivec the compiler > has free reign to reorder the vector instructions wherever it likes. > Including reordering it before or after the enable/disable_altivec > commands. > > The enable_kernel_altivec and disable_kernel_altivec functions are > mainly there to empty and restore the vector registers which could > have been used in user-space. So those functions work as intended, > although not particularly intuitive. > > Splitting the files and only compiling the xor_vmx.c file with altivec > will guarantee that there are no vector instructions in the > xor_vmx_glue.c file, and that no vector instructions are outside of > the enable/disable block. Thanks for the explanation! That makes sense. PC
Re: [PATCH] powerpc/lib: Split xor_vmx file to guarantee instruction ordering
On 05/23/2017 06:45 PM, Matt Brown wrote: > The xor_vmx.c file is used for the RAID5 xor operations. In these functions > altivec is enabled to run the operation and then disabled. However due to > compiler instruction reordering, altivec instructions are being run before > enable_altivec() and after disable_altivec(). If altivec instructions can be reordered after disable_altivec(), then disable_altivec() is broken, I'd think. Could it be because the isync in mtmsr_isync() is after the mtmsr? disable_kernel_altivec - msr_check_and_clear - __msr_check_and_clear - mtmsr_isync PC
Re: [PATCH] powerpc/64s: dt_cpu_ftrs boot time setup option
On 05/11/2017 10:46 PM, Michael Ellerman wrote: > Paul Clarke <p...@us.ibm.com> writes: >> On 05/11/2017 06:24 AM, Nicholas Piggin wrote: >>> Provide a dt_cpu_ftrs= cmdline option to disable the dt_cpu_ftrs CPU >>> feature discovery, and fall back to the "cputable" based version. >> >> This boat has already sailed, I think, but "ftrs"? > > What you think vowels grow on trees! :) At least you're using lower-case, which takes less space. ;-) >> Was it too difficult to type "features"? I see "ftrs" and think "footers". > For the command line option we could spell out features. > > But should we also expand "dt", and "cpu" ? > > device_tree_central_processing_unit_features=off I'm not the biggest fan of "dt" ("devtree", anyone?), but that's certainly established. So _that's_ what "cpu" stands for! ;-) > :P > >> This seems like something that should be easy and intuitive to find >> and understand. > > Maybe. Ideally no one will ever use it, certainly not end users, it's > primarily intended for developers doing bring-up. I'm a developer, and I like to be able to understand something at a glance. (Hmm, what are "CPU footers"?) :-P PC
Re: [PATCH] powerpc/64s: dt_cpu_ftrs boot time setup option
On 05/11/2017 06:24 AM, Nicholas Piggin wrote: > Provide a dt_cpu_ftrs= cmdline option to disable the dt_cpu_ftrs CPU > feature discovery, and fall back to the "cputable" based version. This boat has already sailed, I think, but "ftrs"? Was it too difficult to type "features"? This seems like something that should be easy and intuitive to find and understand. :-) PC
Re: [PATCH v2] powerpc/kprobes: refactor kprobe_lookup_name for safer string operations
On 05/04/2017 05:24 AM, Naveen N. Rao wrote: > Use safer string manipulation functions when dealing with a > user-provided string in kprobe_lookup_name(). > > Reported-by: David Laight> Signed-off-by: Naveen N. Rao > --- > Changed to ignore return value of 0 from strscpy(), as suggested by > Masami. > > - Naveen > > arch/powerpc/kernel/kprobes.c | 47 > ++- > 1 file changed, 20 insertions(+), 27 deletions(-) > > diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c > index 160ae0fa7d0d..255d28d31ca1 100644 > --- a/arch/powerpc/kernel/kprobes.c > +++ b/arch/powerpc/kernel/kprobes.c > @@ -53,7 +53,7 @@ bool arch_within_kprobe_blacklist(unsigned long addr) > > kprobe_opcode_t *kprobe_lookup_name(const char *name, unsigned int offset) > { > - kprobe_opcode_t *addr; > + kprobe_opcode_t *addr = NULL; > > #ifdef PPC64_ELF_ABI_v2 > /* PPC64 ABIv2 needs local entry point */ > @@ -85,36 +85,29 @@ kprobe_opcode_t *kprobe_lookup_name(const char *name, > unsigned int offset) >* Also handle format. >*/ > char dot_name[MODULE_NAME_LEN + 1 + KSYM_NAME_LEN]; > - const char *modsym; > bool dot_appended = false; > - if ((modsym = strchr(name, ':')) != NULL) { > - modsym++; > - if (*modsym != '\0' && *modsym != '.') { > - /* Convert to */ > - strncpy(dot_name, name, modsym - name); > - dot_name[modsym - name] = '.'; > - dot_name[modsym - name + 1] = '\0'; > - strncat(dot_name, modsym, > - sizeof(dot_name) - (modsym - name) - 2); > - dot_appended = true; > - } else { > - dot_name[0] = '\0'; > - strncat(dot_name, name, sizeof(dot_name) - 1); > - } > - } else if (name[0] != '.') { > - dot_name[0] = '.'; > - dot_name[1] = '\0'; > - strncat(dot_name, name, KSYM_NAME_LEN - 2); > + const char *c; > + ssize_t ret = 0; > + int len = 0; > + > + if ((c = strnchr(name, MODULE_NAME_LEN, ':')) != NULL) { Shouldn't this be MODULE_NAME_LEN + 1, since the ':' can come after a module name of length MODULE_NAME_LEN? > + c++; > + len = c - name; > + memcpy(dot_name, name, len); > + } else > + c = name; > + > + if (*c != '\0' && *c != '.') { > + dot_name[len++] = '.'; > dot_appended = true; > - } else { > - dot_name[0] = '\0'; > - strncat(dot_name, name, KSYM_NAME_LEN - 1); > } PC
Re: [PATCH v4 4/7] powerpc/kprobes: Use safer string functions in kprobe_lookup_name()
Sent too soon. The suggestions don't guarantee null termination. Refined, below. (Sorry for the noise.) On 04/21/2017 08:33 AM, Paul Clarke wrote: > On 04/21/2017 07:33 AM, Naveen N. Rao wrote: >> Convert usage of strchr()/strncpy()/strncat() to >> strnchr()/memcpy()/strlcat() for simpler and safer string manipulation. >> diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c >> index 97b5eed1f76d..c73fb6e3b43f 100644 >> --- a/arch/powerpc/kernel/kprobes.c >> +++ b/arch/powerpc/kernel/kprobes.c >> @@ -65,28 +65,27 @@ kprobe_opcode_t *kprobe_lookup_name(const char *name, >> unsigned int offset) >> char dot_name[MODULE_NAME_LEN + 1 + KSYM_NAME_LEN]; >> const char *modsym; >> bool dot_appended = false; >> -if ((modsym = strchr(name, ':')) != NULL) { >> +if ((modsym = strnchr(name, ':', MODULE_NAME_LEN)) != NULL) { >> modsym++; >> if (*modsym != '\0' && *modsym != '.') { >> /* Convert to */ >> -strncpy(dot_name, name, modsym - name); >> +memcpy(dot_name, name, modsym - name); >> dot_name[modsym - name] = '.'; >> dot_name[modsym - name + 1] = '\0'; >> -strncat(dot_name, modsym, >> -sizeof(dot_name) - (modsym - name) - 2); >> +strlcat(dot_name, modsym, sizeof(dot_name)); > > Would it be more efficient here to replace this: > -- > dot_name[modsym - name + 1] = '\0'; > strlcat(dot_name, modsym, sizeof(dot_name)); > -- > > with this: > strncpy(_name[modsym - name + 1], modsym, KSYM_NAME_LEN); keep the null termination, and just adjust the starting point for strlcat: -- dot_name[modsym - name + 1] = '\0'; strlcat(_name[modsym - name + 1], modsym, KSYM_NAME_LEN); -- > (So you aren't rescanning dot_name to find the end, when you already know the > end position?) > >> dot_appended = true; >> } else { >> dot_name[0] = '\0'; >> -strncat(dot_name, name, sizeof(dot_name) - 1); >> +strlcat(dot_name, name, sizeof(dot_name)); > > and here do: > strncpy(dot_name, name, sizeof(dot_name)); > > (and remove the null termination immediately above) nevermind. :-) > >> } >> } else if (name[0] != '.') { >> dot_name[0] = '.'; >> dot_name[1] = '\0'; >> -strncat(dot_name, name, KSYM_NAME_LEN - 2); >> +strlcat(dot_name, name, sizeof(dot_name)); > > and here do: > strncpy(_name[1], name, sizeof(dot_name)); > > (and remove the null termination immediately above) > nevermind. :-) >> dot_appended = true; >> } else { >> dot_name[0] = '\0'; >> -strncat(dot_name, name, KSYM_NAME_LEN - 1); >> +strlcat(dot_name, name, sizeof(dot_name)); > > and here do: > strncpy(dot_name, name, sizeof(dot_name)); > > (and remove the null termination immediately above) > >> } nevermind. :-) PC
Re: [PATCH v4 4/7] powerpc/kprobes: Use safer string functions in kprobe_lookup_name()
On 04/21/2017 08:33 AM, Paul Clarke wrote: > On 04/21/2017 07:33 AM, Naveen N. Rao wrote: >> } else if (name[0] != '.') { >> dot_name[0] = '.'; >> dot_name[1] = '\0'; >> -strncat(dot_name, name, KSYM_NAME_LEN - 2); >> +strlcat(dot_name, name, sizeof(dot_name)); > > and here do: > strncpy(_name[1], name, sizeof(dot_name)); oops. s/sizeof(dot_name)/sizeof(dot_name) - 1/ PC
Re: [PATCH v4 4/7] powerpc/kprobes: Use safer string functions in kprobe_lookup_name()
On 04/21/2017 07:33 AM, Naveen N. Rao wrote: > Convert usage of strchr()/strncpy()/strncat() to > strnchr()/memcpy()/strlcat() for simpler and safer string manipulation. > > Reported-by: David Laight> Signed-off-by: Naveen N. Rao > --- > Changes: Additionally convert the strchr(). > > > arch/powerpc/kernel/kprobes.c | 13 ++--- > 1 file changed, 6 insertions(+), 7 deletions(-) > > diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c > index 97b5eed1f76d..c73fb6e3b43f 100644 > --- a/arch/powerpc/kernel/kprobes.c > +++ b/arch/powerpc/kernel/kprobes.c > @@ -65,28 +65,27 @@ kprobe_opcode_t *kprobe_lookup_name(const char *name, > unsigned int offset) > char dot_name[MODULE_NAME_LEN + 1 + KSYM_NAME_LEN]; > const char *modsym; > bool dot_appended = false; > - if ((modsym = strchr(name, ':')) != NULL) { > + if ((modsym = strnchr(name, ':', MODULE_NAME_LEN)) != NULL) { > modsym++; > if (*modsym != '\0' && *modsym != '.') { > /* Convert to */ > - strncpy(dot_name, name, modsym - name); > + memcpy(dot_name, name, modsym - name); > dot_name[modsym - name] = '.'; > dot_name[modsym - name + 1] = '\0'; > - strncat(dot_name, modsym, > - sizeof(dot_name) - (modsym - name) - 2); > + strlcat(dot_name, modsym, sizeof(dot_name)); Would it be more efficient here to replace this: -- dot_name[modsym - name + 1] = '\0'; strlcat(dot_name, modsym, sizeof(dot_name)); -- with this: strncpy(_name[modsym - name + 1], modsym, KSYM_NAME_LEN); (So you aren't rescanning dot_name to find the end, when you already know the end position?) > dot_appended = true; > } else { > dot_name[0] = '\0'; > - strncat(dot_name, name, sizeof(dot_name) - 1); > + strlcat(dot_name, name, sizeof(dot_name)); and here do: strncpy(dot_name, name, sizeof(dot_name)); (and remove the null termination immediately above) > } > } else if (name[0] != '.') { > dot_name[0] = '.'; > dot_name[1] = '\0'; > - strncat(dot_name, name, KSYM_NAME_LEN - 2); > + strlcat(dot_name, name, sizeof(dot_name)); and here do: strncpy(_name[1], name, sizeof(dot_name)); (and remove the null termination immediately above) > dot_appended = true; > } else { > dot_name[0] = '\0'; > - strncat(dot_name, name, KSYM_NAME_LEN - 1); > + strlcat(dot_name, name, sizeof(dot_name)); and here do: strncpy(dot_name, name, sizeof(dot_name)); (and remove the null termination immediately above) > } PC
Re: [PATCH v4 3/7] kprobes: validate the symbol name provided during probe registration
a nit or two, below... On 04/21/2017 07:32 AM, Naveen N. Rao wrote: > diff --git a/kernel/kprobes.c b/kernel/kprobes.c > index 6a128f3a7ed1..ff9b1ac72a38 100644 > --- a/kernel/kprobes.c > +++ b/kernel/kprobes.c > @@ -1383,6 +1383,34 @@ bool within_kprobe_blacklist(unsigned long addr) > } > > /* > + * We mainly want to ensure that the provided string is of a reasonable > length > + * and is of the form [:], so that this is safe to > process > + * further. > + * We don't worry about invalid characters as those will just prevent > + * matching existing kallsyms. > + */ > +bool is_valid_kprobe_symbol_name(const char *name) > +{ > + size_t sym_len; > + const char *s; > + > + s = strnchr(name, ':', MODULE_NAME_LEN + KSYM_NAME_LEN + 1); > + if (s) { > + sym_len = (size_t)(s - name); > + if (sym_len <= 0 || sym_len >= MODULE_NAME_LEN) "sym_len <= 0" looks odd here, since sym_len is likely unsigned and would never be less than zero, anyway. > + return false; > + s++; > + } else > + s = name; > + > + sym_len = strnlen(s, KSYM_NAME_LEN); > + if (sym_len <= 0 || sym_len >= KSYM_NAME_LEN) here, too. > + return false; > + > + return true; > +} PC
Re: [PATCH V4] powerpc/hugetlb: Add ABI defines for supported HugeTLB page sizes
nits... take 'em or leave 'em... On 04/07/2017 08:01 AM, Michael Ellerman wrote: Anshuman Khandualwrites: And I reworded the comment the make it clearer (I think) that most users shouldn't need to use these, and should just use the default size: /* * When MAP_HUGETLB is set bits [26:31] encode the log2 of the huge page size. need a comma after "set". also, "bits [26:31]" of what? * A value of zero indicates that the default huge page size should be used. * To use a non-default huge page size, one of these defines can be used, or the * size can be encoded by hand. Note that on most systems only a subset, or * possibly none, of these sizes will be available. */ PC
Re: [PATCH v2] powerpc: A new cache geometry aux vectors
On 03/02/2017 10:01 PM, Michael Ellerman wrote: Paul Clarke <p...@us.ibm.com> writes: On 03/02/2017 12:33 AM, Michael Ellerman wrote: Paul Clarke <p...@us.ibm.com> writes: On 02/02/2017 12:22 AM, Benjamin Herrenschmidt wrote: This adds AUX vectors for the L1I,D, L2 and L3 cache levels providing for each cache level the size of the cache in bytes and the geometry (line size and number of ways). We chose to not use the existing alpha/sh definition which packs all the information in a single entry per cache level as it is too restricted to represent some of the geometries used on POWER. Signed-off-by: Benjamin Herrenschmidt <b...@kernel.crashing.org> --- v2. Remove some leftovers and fix 32-bit build Paul, can you test & ack this please ? It replaces the last patch of my previous series (the rest is identical). Sorry for the delay on this, I totally missed this e-mail until this week. I tested this with the Power Functional Simulator (a.k.a. Mambo), with a fairly recent 4.9 kernel from the powerpc git tree. You mean 4.9 plus this patch? Using Mambo-P8, I see what I expect for cache line size, but associativity is 0: AT_L1I_CACHESIZE: 32768 (0x8000) AT_L1I_CACHEGEOMETRY: associativity 0; line size 128 AT_L1D_CACHESIZE: 32768 (0x8000) AT_L1D_CACHEGEOMETRY: associativity 0; line size 128 AT_L2_CACHESIZE: 524288 (0x8) AT_L2_CACHEGEOMETRY: associativity 0; line size 128 AT_L3_CACHESIZE: 8388608 (0x80) AT_L3_CACHEGEOMETRY: associativity 0; line size 128 ...shouldn't associativity be 8 for all levels? Yes. I fixed it in v3: http://patchwork.ozlabs.org/patch/723576/ Which is now merged in Linus tree, so please test that. It's not working for me, but it's also very likely that I could be doing something stupid. Just pulling from git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git, it built (without the "systemsim" patches"), but wouldn't boot in Mambo (not because it couldn't find root...something else not fully explored). Hmm OK, it's booting for me, but might depend on your config/toolchain etc. Linus' tree just before rc1 is a bit of a dangerous place to be. I checked out tag "v4.10", and that built (with the "systemsim" patches, Ben's 1-8 and your 9v3) and booted. I think you need my entire v3, ignore Ben's series. On Mambo-P8, the device tree looks correct (to me), but the associativity is still 0. OK, I think if you use the full v3 the associativity will be fixed. Though it might still be wrong, because the kernel code hard codes the values on P8 bare metal, because on real machines (running skiboot) the values provided in the device tree are wrong. Hopefully those hard coded values are also correct on p8 mambo, but we should probably check. See: https://git.kernel.org/torvalds/c/608b42140e96 OK, thanks for your help. All is working as expected now. mambo-p8 looks fine: -- AT_L1I_CACHESIZE: 32768 (0x8000) getauxval(AT_L1I_CACHEGEOMETRY) = 0x00080080 AT_L1I_CACHEGEOMETRY: associativity 8; line size 128 AT_L1D_CACHESIZE: 32768 (0x8000) getauxval(AT_L1D_CACHEGEOMETRY) = 0x00080080 AT_L1D_CACHEGEOMETRY: associativity 8; line size 128 AT_L2_CACHESIZE: 524288 (0x8) getauxval(AT_L2_CACHEGEOMETRY) = 0x00080080 AT_L2_CACHEGEOMETRY: associativity 8; line size 128 AT_L3_CACHESIZE: 8388608 (0x80) getauxval(AT_L2_CACHEGEOMETRY) = 0x00080080 AT_L3_CACHEGEOMETRY: associativity 8; line size 128 -- On P9, the device tree isn't correct, so I'm not sure what to expect. Junk :) On P9 it all comes from the device tree so if that's wrong the AUX vectors will definitely be wrong. mambo-p9 is still falling victim to the device-tree, but "working": -- getauxval(AT_L1I_CACHEGEOMETRY) = 0x0080 AT_L1I_CACHEGEOMETRY: associativity 0; line size 128 getauxval(AT_L1D_CACHEGEOMETRY) = 0x0080 AT_L1D_CACHEGEOMETRY: associativity 0; line size 128 getauxval(AT_L2_CACHEGEOMETRY) = 0x getauxval(AT_L2_CACHEGEOMETRY) = 0x -- Can you try the test program I wrote (which may also have bugs): http://patchwork.ozlabs.org/patch/724418/ I couldn't get this working, either, so I'm probably doing something stupid. Did you apply the first patch of the series also? I did now, and it works, with expected results: P8: -- # ./cache_shape test: cache_shape tags: git_version:v4.10-0-gc470abd-dirty L1I cache size: 0x8000 32768B 32K L1I line size:0x80 8-way associative L1D cache size: 0x8000 32768B 32K L1D line size:0x80 8-way associative L2 cache size:0x8 524288B512K L2 line size:0x80 8-way associative L3 cache size: 0x808388608B 8192K L3 line size:0x80 8-way associative success: cache_shape -- P9 (with bad device-tree): -- # ./cache_shape test: cache_shape tags: git_version:v4.10-0-gc470abd-
Re: [PATCH v2] powerpc: A new cache geometry aux vectors
On 03/02/2017 12:33 AM, Michael Ellerman wrote: Paul Clarke <p...@us.ibm.com> writes: On 02/02/2017 12:22 AM, Benjamin Herrenschmidt wrote: This adds AUX vectors for the L1I,D, L2 and L3 cache levels providing for each cache level the size of the cache in bytes and the geometry (line size and number of ways). We chose to not use the existing alpha/sh definition which packs all the information in a single entry per cache level as it is too restricted to represent some of the geometries used on POWER. Signed-off-by: Benjamin Herrenschmidt <b...@kernel.crashing.org> --- v2. Remove some leftovers and fix 32-bit build Paul, can you test & ack this please ? It replaces the last patch of my previous series (the rest is identical). Sorry for the delay on this, I totally missed this e-mail until this week. I tested this with the Power Functional Simulator (a.k.a. Mambo), with a fairly recent 4.9 kernel from the powerpc git tree. You mean 4.9 plus this patch? Using Mambo-P8, I see what I expect for cache line size, but associativity is 0: AT_L1I_CACHESIZE: 32768 (0x8000) AT_L1I_CACHEGEOMETRY: associativity 0; line size 128 AT_L1D_CACHESIZE: 32768 (0x8000) AT_L1D_CACHEGEOMETRY: associativity 0; line size 128 AT_L2_CACHESIZE: 524288 (0x8) AT_L2_CACHEGEOMETRY: associativity 0; line size 128 AT_L3_CACHESIZE: 8388608 (0x80) AT_L3_CACHEGEOMETRY: associativity 0; line size 128 ...shouldn't associativity be 8 for all levels? Yes. I fixed it in v3: http://patchwork.ozlabs.org/patch/723576/ Which is now merged in Linus tree, so please test that. It's not working for me, but it's also very likely that I could be doing something stupid. Just pulling from git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git, it built (without the "systemsim" patches"), but wouldn't boot in Mambo (not because it couldn't find root...something else not fully explored). I checked out tag "v4.10", and that built (with the "systemsim" patches, Ben's 1-8 and your 9v3) and booted. On Mambo-P8, the device tree looks correct (to me), but the associativity is still 0. On P9, the device tree isn't correct, so I'm not sure what to expect. I added a bit more debug to my test: AT_L1I_CACHESIZE: 32768 (0x8000) getauxval(AT_L1I_CACHEGEOMETRY) = 0x0080 AT_L1I_CACHEGEOMETRY: associativity 0; line size 128 AT_L1D_CACHESIZE: 32768 (0x8000) getauxval(AT_L1D_CACHEGEOMETRY) = 0x0080 AT_L1D_CACHEGEOMETRY: associativity 0; line size 128 AT_L2_CACHESIZE: 524288 (0x8) getauxval(AT_L2_CACHEGEOMETRY) = 0x0080 AT_L2_CACHEGEOMETRY: associativity 0; line size 128 AT_L3_CACHESIZE: 8388608 (0x80) getauxval(AT_L2_CACHEGEOMETRY) = 0x0080 AT_L3_CACHEGEOMETRY: associativity 0; line size 128 When I use gdb, and break at main: (gdb) x/60gx ($r6) 0x3ca0: 0x0016 0x0016 0x3cb0: 0x0016 0x0016 0x3cc0: 0x0013 0x0080 0x3cd0: 0x0014 0x0080 0x3ce0: 0x0015 0x 0x3cf0: 0x0021 0x3fffb7fa 0x3d00: 0x0028 0x8000 0x3d10: 0x0029 0x0080 0x3d20: 0x002a 0x8000 0x3d30: 0x002b 0x0080 0x3d40: 0x002c 0x0008 0x3d50: 0x002d 0x0080 0x3d60: 0x002e 0x0080 0x3d70: 0x002f 0x0080 and, looking at the device-tree: # for i in /proc/device-tree/cpus/PowerPC@0/*cache*; do echo $i; od -tx1 $i; done /proc/device-tree/cpus/PowerPC@0/d-cache-block-size 000 00 00 00 80 /proc/device-tree/cpus/PowerPC@0/d-cache-line-size 000 00 00 00 80 /proc/device-tree/cpus/PowerPC@0/d-cache-sets 000 00 00 00 04 /proc/device-tree/cpus/PowerPC@0/d-cache-size 000 00 00 80 00 /proc/device-tree/cpus/PowerPC@0/i-cache-block-size 000 00 00 00 80 /proc/device-tree/cpus/PowerPC@0/i-cache-line-size 000 00 00 00 80 /proc/device-tree/cpus/PowerPC@0/i-cache-sets 000 00 00 00 02 /proc/device-tree/cpus/PowerPC@0/i-cache-size 000 00 00 80 00 /proc/device-tree/cpus/PowerPC@0/l2-cache-size 000 00 0a 00 00 Can you try the test program I wrote (which may also have bugs): http://patchwork.ozlabs.org/patch/724418/ I couldn't get this working, either, so I'm probably doing something stupid. PC
Re: [PATCH v2] powerpc: A new cache geometry aux vectors
On 02/02/2017 12:22 AM, Benjamin Herrenschmidt wrote: This adds AUX vectors for the L1I,D, L2 and L3 cache levels providing for each cache level the size of the cache in bytes and the geometry (line size and number of ways). We chose to not use the existing alpha/sh definition which packs all the information in a single entry per cache level as it is too restricted to represent some of the geometries used on POWER. Signed-off-by: Benjamin Herrenschmidt--- v2. Remove some leftovers and fix 32-bit build Paul, can you test & ack this please ? It replaces the last patch of my previous series (the rest is identical). Sorry for the delay on this, I totally missed this e-mail until this week. I tested this with the Power Functional Simulator (a.k.a. Mambo), with a fairly recent 4.9 kernel from the powerpc git tree. Using Mambo-P8, I see what I expect for cache line size, but associativity is 0: AT_L1I_CACHESIZE: 32768 (0x8000) AT_L1I_CACHEGEOMETRY: associativity 0; line size 128 AT_L1D_CACHESIZE: 32768 (0x8000) AT_L1D_CACHEGEOMETRY: associativity 0; line size 128 AT_L2_CACHESIZE: 524288 (0x8) AT_L2_CACHEGEOMETRY: associativity 0; line size 128 AT_L3_CACHESIZE: 8388608 (0x80) AT_L3_CACHEGEOMETRY: associativity 0; line size 128 ...shouldn't associativity be 8 for all levels? Using Mambo-P9, I don't get any cache sizes, and only L1 geometries, and no associativity: AT_L1I_CACHEGEOMETRY: associativity 0; line size 128 AT_L1D_CACHEGEOMETRY: associativity 0; line size 128 PC
Re: [PATCH 9/9] powerpc: A new cache geometry aux vectors
On 01/08/2017 05:31 PM, Benjamin Herrenschmidt wrote: This adds AUX vectors for the L1I,D, L2 and L3 cache levels providing for each cache level the size of the cache in bytes and the geometry (line size and number of ways). We chose to not use the existing alpha/sh definition which packs all the information in a single entry per cache level as it is too restricted to represent some of the geometries used on POWER. diff --git a/arch/powerpc/include/asm/cache.h b/arch/powerpc/include/asm/cache.h index 3987bd9..1557d26 100644 --- a/arch/powerpc/include/asm/cache.h +++ b/arch/powerpc/include/asm/cache.h @@ -35,6 +35,7 @@ struct ppc_cache_info { u32 log_block_size; u32 blocks_per_page; u32 sets; + u32 assoc; Associativity is defined as u32... }; struct ppc64_caches { diff --git a/arch/powerpc/include/asm/elf.h b/arch/powerpc/include/asm/elf.h index 730c27e..a128836 100644 --- a/arch/powerpc/include/asm/elf.h +++ b/arch/powerpc/include/asm/elf.h +extern long il1cache_shape; +extern long dl1cache_shape; +extern long l2cache_shape; +extern long l3cache_shape; shapes are "long"... @@ -136,6 +140,9 @@ extern int arch_setup_additional_pages(struct linux_binprm *bprm, #endif /* CONFIG_SPU_BASE */ +#define get_cache_shape(level) \ + (ppc64_caches.level.assoc << 16 | ppc64_caches.level.line_size) Now we stuff u32 assoc into 16 bits. What's the reason behind combining associativity and cache line size into a single field? Likely the most requested values will be cache line size and (maybe) cache size. The latter is a simple query with these changes, but the former is now a query and some awkward mask/shift. I don't think we're currently providing accessor macros, although I guess those can be added to glibc. Would it not be simpler, though, to just add independent queries for each of cache line size and associativity? - AT_L1I_CACHESIZE - AT_L1I_CACHE_LINE_SIZE - AT_L1I_CACHE_ASSOC etc. Is there a big "con" to having a few more AUXV entries? Regards, PC @@ -156,6 +163,14 @@ do { \ NEW_AUX_ENT(AT_ICACHEBSIZE, icache_bsize); \ NEW_AUX_ENT(AT_UCACHEBSIZE, ucache_bsize); \ VDSO_AUX_ENT(AT_SYSINFO_EHDR, current->mm->context.vdso_base);\ + NEW_AUX_ENT(AT_L1I_CACHESIZE, ppc64_caches.l1i.size); \ + NEW_AUX_ENT(AT_L1I_CACHEGEOMETRY, get_cache_shape(l1i));\ + NEW_AUX_ENT(AT_L1D_CACHESIZE, ppc64_caches.l1i.size); \ + NEW_AUX_ENT(AT_L1D_CACHEGEOMETRY, get_cache_shape(l1i));\ + NEW_AUX_ENT(AT_L2_CACHESIZE, ppc64_caches.l2.size); \ + NEW_AUX_ENT(AT_L2_CACHEGEOMETRY, get_cache_shape(l2)); \ + NEW_AUX_ENT(AT_L3_CACHESIZE, ppc64_caches.l3.size); \ + NEW_AUX_ENT(AT_L3_CACHEGEOMETRY, get_cache_shape(l3)); \ } while (0) #endif /* _ASM_POWERPC_ELF_H */ diff --git a/arch/powerpc/include/uapi/asm/auxvec.h b/arch/powerpc/include/uapi/asm/auxvec.h index ce17d2c..be6e94e 100644 --- a/arch/powerpc/include/uapi/asm/auxvec.h +++ b/arch/powerpc/include/uapi/asm/auxvec.h @@ -16,6 +16,37 @@ */ #define AT_SYSINFO_EHDR33 -#define AT_VECTOR_SIZE_ARCH 6 /* entries in ARCH_DLINFO */ +/* + * AT_*CACHEBSIZE above represent the cache *block* size which is + * the size that is affected by the cache management instructions. + * + * It doesn't nececssarily matches the cache *line* size which is + * more of a performance tuning hint. Additionally the latter can + * be different for the different cache levels. + * + * The set of entries below represent more extensive information + * about the caches, in the form of two entry per cache type, + * one entry containing the cache size in bytes, and the other + * containing the cache line size in bytes in the bottom 16 bits + * and the cache associativity in the next 16 bits. + * + * The associativity is such that if N is the 16-bit value, the + * cache is N way set associative. A value if 0x means fully + * associative, a value of 1 means directly mapped. + * + * For all these fields, a value of 0 means that the information + * is not known. + */ + +#define AT_L1I_CACHESIZE 40 +#define AT_L1I_CACHEGEOMETRY 41 +#define AT_L1D_CACHESIZE 42 +#define AT_L1D_CACHEGEOMETRY 43 +#define AT_L2_CACHESIZE44 +#define AT_L2_CACHEGEOMETRY45 +#define AT_L3_CACHESIZE46 +#define AT_L3_CACHEGEOMETRY47 + +#define AT_VECTOR_SIZE_ARCH14 /* entries in ARCH_DLINFO */ #endif diff --git a/arch/powerpc/kernel/setup-common.c b/arch/powerpc/kernel/setup-common.c index e0eeed4..cfa2a06 100644 --- a/arch/powerpc/kernel/setup-common.c +++ b/arch/powerpc/kernel/setup-common.c @@ -94,7 +94,10 @@ EXPORT_SYMBOL_GPL(boot_cpuid); int dcache_bsize; int icache_bsize; int ucache_bsize; - +long il1cache_shape = -1; +long
Re: [PATCH 2/2] powerpc/pseries: Dynamically increase RMA size
Only nits from me...(see below) On 08/05/2016 01:30 PM, Sukadev Bhattiprolu wrote: Here is an updated patch to fix the build when CONFIG_PPC_PSERIES=n. --- From d4f77a6ca7b6ea83f6588e7d541cc70bf001ae85 Mon Sep 17 00:00:00 2001 From: rootDate: Thu, 4 Aug 2016 23:13:37 -0400 Subject: [PATCH 2/2] powerpc/pseries: Dynamically grow RMA size When booting a very large system with a larg initrd we run out of space for the flattened device tree (FDT). To fix this we must increase the space allocated for the RMA region. The RMA size is hard-coded in the 'ibm_architecture_vec[]' and increasing the size there will apply to all systems, small and large, so we want to increase the RMA region only when necessary. When we run out of room for the FDT, set a new OF property, 'ibm,new-rma-size' to the new RMA size (512MB) and issue a client-architecture-support (CAS) call to the firmware. This will initiate a system reboot. Upon reboot we notice the new property and update the RMA size accordingly. Fix suggested by Michael Ellerman. Signed-off-by: Sukadev Bhattiprolu --- [v2]: - Add a comment in code regarding 'fixup_nr_cores_done' - Fix build break when CONFIG_PPC_PSERIES=n --- arch/powerpc/kernel/prom_init.c | 96 - 1 file changed, 95 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c index f612a99..cbd5387 100644 --- a/arch/powerpc/kernel/prom_init.c +++ b/arch/powerpc/kernel/prom_init.c @@ -679,6 +679,7 @@ unsigned char ibm_architecture_vec[] = { W(0x), /* virt_base */ W(0x), /* virt_size */ W(0x), /* load_base */ +#define IBM_ARCH_VEC_MIN_RMA_OFFSET108 W(256), /* 256MB min RMA */ W(0x), /* full client load */ 0, /* min RMA percentage of total RAM */ @@ -867,6 +868,14 @@ static void fixup_nr_cores(void) { u32 cores; unsigned char *ptcores; + static bool fixup_nr_cores_done = false; + + /* +* If this is a second CAS call in the same boot sequence, (see +* increase_rma_size()), we don't need to do the fixup again. +*/ + if (fixup_nr_cores_done) + return; /* We need to tell the FW about the number of cores we support. * @@ -898,6 +907,41 @@ static void fixup_nr_cores(void) ptcores[1] = (cores >> 16) & 0xff; ptcores[2] = (cores >> 8) & 0xff; ptcores[3] = cores & 0xff; + fixup_nr_cores_done = true; + } +} + +static void __init fixup_rma_size(void) +{ + int rc; + u64 size; + unsigned char *min_rmap; + phandle optnode; + char str[64]; + + optnode = call_prom("finddevice", 1, 1, ADDR("/options")); + if (!PHANDLE_VALID(optnode)) + prom_panic("Cannot find /options"); + + /* +* If a prior boot specified a new RMA size, use that size in +* ibm_architecture_vec[]. See also increase_rma_size(). +*/ + size = 0ULL; + memset(str, 0, sizeof(str)); + rc = prom_getprop(optnode, "ibm,new-rma-size", , sizeof(str)); + if (rc <= 0) + return; + + size = prom_strtoul(str, NULL); + min_rmap = _architecture_vec[IBM_ARCH_VEC_MIN_RMA_OFFSET]; + + if (size) { + prom_printf("Using RMA size %lu from ibm,new-rma-size.\n", size); + min_rmap[0] = (size >> 24) & 0xff; + min_rmap[1] = (size >> 16) & 0xff; + min_rmap[2] = (size >> 8) & 0xff; + min_rmap[3] = size & 0xff; } } @@ -911,6 +955,8 @@ static void __init prom_send_capabilities(void) fixup_nr_cores(); + fixup_rma_size(); + /* try calling the ibm,client-architecture-support method */ prom_printf("Calling ibm,client-architecture-support..."); if (call_prom_ret("call-method", 3, 2, , @@ -946,6 +992,52 @@ static void __init prom_send_capabilities(void) } #endif /* __BIG_ENDIAN__ */ } + +static void __init increase_rma_size(void) +{ + int rc; + u64 size; + char str[64]; + phandle optnode; + + optnode = call_prom("finddevice", 1, 1, ADDR("/options")); + if (!PHANDLE_VALID(optnode)) + prom_panic("Cannot find /options"); + + /* +* If we already increased the RMA size, return. +*/ + size = 0ULL; + memset(str, 0, sizeof(str)); + rc = prom_getprop(optnode, "ibm,new-rma-size", , sizeof(str)); + + size = prom_strtoul(str, NULL); + if (size == 512ULL) { Is this preferred over strncmp? Using a string also helps with my suggestion below... + prom_printf("RMA
Re: [PATCH] cpufreq: powernv: Redesign the presentation of throttle notification
On 12/13/2015 12:17 PM, Shilpasri G Bhat wrote: Replace the throttling event console messages to perf trace event "power:powernv_throttle" and throttle counter stats which are exported in sysfs. The newly added sysfs files are as follows: 1)/sys/devices/system/node/node0/throttle_frequencies This gives the throttle stats for each of the available frequencies. The throttle stat of a frequency is the total number of times the max frequency was reduced to that frequency. # cat /sys/devices/system/node/node0/throttle_frequencies 4023000 0 399 0 3956000 1 3923000 0 389 0 3857000 2 3823000 0 379 0 3757000 2 3724000 1 369 1 ... Is this data useful? It seems like "elapsed time" at each frequency might be more useful, if any. 2)/sys/devices/system/node/node0/throttle_reasons This gives the stats for each of the supported throttle reasons. This gives the total number of times the frequency was throttled due to each of the reasons. # cat /sys/devices/system/node/node0/throttle_reasons No throttling 7 Power Cap 0 Processor Over Temperature 7 Power Supply Failure 0 Over Current 0 OCC Reset 0 3)/sys/devices/system/node/node0/throttle_stat This gives the total number of throttle events occurred in turbo range of frequencies and non-turbo(below nominal) range of frequencies. non-turbo should read "at or below nominal". Maybe "sub-turbo" is a better term(?) # cat /sys/devices/system/node/node0/throttle_stat Turbo 7 Nominal 0 Should this read "Non-turbo" or "Sub-turbo" instead of "Nominal", since the events could well occur when already operating below nominal. Signed-off-by: Shilpasri G Bhat--- drivers/cpufreq/powernv-cpufreq.c | 186 +- include/trace/events/power.h | 22 + 2 files changed, 166 insertions(+), 42 deletions(-) diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c index cb50138..bdde9d6 100644 --- a/drivers/cpufreq/powernv-cpufreq.c +++ b/drivers/cpufreq/powernv-cpufreq.c @@ -28,6 +28,9 @@ #include #include #include +#include +#include +#include #include #include @@ -43,12 +46,27 @@ static struct cpufreq_frequency_table powernv_freqs[POWERNV_MAX_PSTATES+1]; static bool rebooting, throttled, occ_reset; +static char throttle_reason[][30] = { + "No throttling", + "Power Cap", + "Processor Over Temperature", + "Power Supply Failure", + "Over Current", + "OCC Reset" +}; I'm curious if this would be slightly more efficiently implemented as: static const char *throttle_reason[] = { ... }; Do you need 30 characters per string for a reason? Regardless, it should be const. [...] -- PC ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v2] powerpc: re-enable dynticks
It appears the discussion wrapped up on this, favorably. Is there anything else I need to do to get this merged? Regards, PC On 02/20/2015 11:18 PM, Paul E. McKenney wrote: On Fri, Feb 20, 2015 at 11:13:33AM -0600, Paul Clarke wrote: implement arch_irq_work_has_interrupt() for powerpc (resending because I messed up the e-mail addresses) Commit 9b01f5bf3 introduced a dependency on IRQ work self-IPIs for full dynamic ticks to be enabled, by expecting architectures to implement a suitable arch_irq_work_has_interrupt() routine. Several arches have implemented this routine, including x86 (3010279f) and arm (09f6edd4), but powerpc was omitted. This patch implements this routine for powerpc. The symptom, at boot (on powerpc arch systems) with nohz_full=CPU list is displayed: NO_HZ: Can't run full dynticks because arch doesn't support irq work self-IPIs after this patch: NO_HZ: Full dynticks CPUs: CPU list. Tested against 3.19. v2: changed return 1 to return true, per Michael Ellerman CC: Frederic Weisbecker fweis...@gmail.com CC: Paul E. McKenney paul...@linux.vnet.ibm.com Signed-off-by: Paul A. Clarke p...@us.ibm.com Reviewed-by: Paul E. McKenney paul...@linux.vnet.ibm.com diff --git a/arch/powerpc/include/asm/irq_work.h b/arch/powerpc/include/asm/irq_work.h new file mode 100644 index 000..99cc0aa --- /dev/null +++ b/arch/powerpc/include/asm/irq_work.h @@ -0,0 +1,11 @@ +#ifndef _ASM_IRQ_WORK_H +#define _ASM_IRQ_WORK_H + +#include asm/processor.h + +static inline bool arch_irq_work_has_interrupt(void) +{ +return true; +} + +#endif /* _ASM_IRQ_WORK_H */ ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v2] powerpc: re-enable dynticks
implement arch_irq_work_has_interrupt() for powerpc Commit 9b01f5bf3 introduced a dependency on IRQ work self-IPIs for full dynamic ticks to be enabled, by expecting architectures to implement a suitable arch_irq_work_has_interrupt() routine. Several arches have implemented this routine, including x86 (3010279f) and arm (09f6edd4), but powerpc was omitted. This patch implements this routine for powerpc. The symptom, at boot (on powerpc arch systems) with nohz_full=CPU list is displayed: NO_HZ: Can't run full dynticks because arch doesn't support irq work self-IPIs after this patch: NO_HZ: Full dynticks CPUs: CPU list. Tested against 3.19. v2: changed return 1 to return true, per Michael Ellerman CC: Frederic Weisbecker fweis...@gmail.com CC: Paul E. McKenney paul...@linux.vnet.ibm.com Signed-off-by: Paul A. Clarke p...@us.ibm.com diff --git a/arch/powerpc/include/asm/irq_work.h b/arch/powerpc/include/asm/irq_work.h new file mode 100644 index 000..99cc0aa --- /dev/null +++ b/arch/powerpc/include/asm/irq_work.h @@ -0,0 +1,11 @@ +#ifndef _ASM_IRQ_WORK_H +#define _ASM_IRQ_WORK_H + +#include asm/processor.h + +static inline bool arch_irq_work_has_interrupt(void) +{ + return true; +} + +#endif /* _ASM_IRQ_WORK_H */ ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v2] powerpc: re-enable dynticks
implement arch_irq_work_has_interrupt() for powerpc (resending because I messed up the e-mail addresses) Commit 9b01f5bf3 introduced a dependency on IRQ work self-IPIs for full dynamic ticks to be enabled, by expecting architectures to implement a suitable arch_irq_work_has_interrupt() routine. Several arches have implemented this routine, including x86 (3010279f) and arm (09f6edd4), but powerpc was omitted. This patch implements this routine for powerpc. The symptom, at boot (on powerpc arch systems) with nohz_full=CPU list is displayed: NO_HZ: Can't run full dynticks because arch doesn't support irq work self-IPIs after this patch: NO_HZ: Full dynticks CPUs: CPU list. Tested against 3.19. v2: changed return 1 to return true, per Michael Ellerman CC: Frederic Weisbecker fweis...@gmail.com CC: Paul E. McKenney paul...@linux.vnet.ibm.com Signed-off-by: Paul A. Clarke p...@us.ibm.com diff --git a/arch/powerpc/include/asm/irq_work.h b/arch/powerpc/include/asm/irq_work.h new file mode 100644 index 000..99cc0aa --- /dev/null +++ b/arch/powerpc/include/asm/irq_work.h @@ -0,0 +1,11 @@ +#ifndef _ASM_IRQ_WORK_H +#define _ASM_IRQ_WORK_H + +#include asm/processor.h + +static inline bool arch_irq_work_has_interrupt(void) +{ +return true; +} + +#endif /* _ASM_IRQ_WORK_H */ ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH] powerpc: re-enable dynticks
implement arch_irq_work_has_interrupt() for powerpc Commit 9b01f5bf3 introduced a dependency on IRQ work self-IPIs for full dynamic ticks to be enabled, by expecting architectures to implement a suitable arch_irq_work_has_interrupt() routine. Several arches have implemented this routine, including x86 (3010279f) and arm (09f6edd4), but powerpc was omitted. This patch implements this routine for powerpc. The symptom, at boot (on powerpc arch systems) with nohz_full=CPU list is displayed: NO_HZ: Can't run full dynticks because arch doesn't support irq work self-IPIs after this patch: NO_HZ: Full dynticks CPUs: CPU list. Tested against 3.19. CC: Frederic Weisbecker fweis...@gmail.com CC: Paul E. McKenney paul...@linux.vnet.ibm.com Signed-off-by: Paul A. Clarke p...@us.ibm.com diff --git a/arch/powerpc/include/asm/irq_work.h b/arch/powerpc/include/asm/irq_work.h new file mode 100644 index 000..18365ec --- /dev/null +++ b/arch/powerpc/include/asm/irq_work.h @@ -0,0 +1,11 @@ +#ifndef _ASM_IRQ_WORK_H +#define _ASM_IRQ_WORK_H + +#include asm/processor.h + +static inline bool arch_irq_work_has_interrupt(void) +{ + return 1; +} + +#endif /* _ASM_IRQ_WORK_H */ -- Regards, Paul Clarke, IBM ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powerpc: Reword the returning from prom_init message
On 01/12/2015 03:48 AM, Michael Ellerman wrote: We get way too many bug reports that say the kernel is hung in prom_init, which stems from the fact that the last piece of output people see is returning from prom_init. The kernel is almost never hung in prom_init(), it's just that it's crashed somewhere after prom_init() but prior to the console coming up. The existing message should give a clue to that, ie. returning from indicates that prom_init() has finished, but it doesn't seem to work. Let's try something different. This prints: Calling quiesce... Shutting down Open Firmware, booting Linux via __start() ... Which hopefully makes it clear that prom_init() is not the problem, and although __start() probably isn't either, it's at least the right place to begin looking. I'm very much in favor of anything that increases usability or decreases confusion. I worry about confusion caused by any phrase that begins Shutting down.. Even if accurate, I wonder if something more positive that indicates that, at least at this point, things are continuing to proceed along would be preferred. Perhaps something like Transferring control from Open Firmware to kernel (via __start) (Do we know for certain that what is about to be invoked is Linux per-se, or can it be something else that has a __start?) PC ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powerpc/powernv: Support OPAL requested heartbeat
On 11/12/2014 12:03 AM, Benjamin Herrenschmidt wrote: If OPAL requests it, call it back via opal_poll_events() at a regular interval. Some versions of OPAL on some machines require this to operate some internal timeouts properly. I don't see any reason why not, but wanted to ask to make sure: this daemon can be bound to a core, correct? At least manually by taskset or similar? It sounds a lot like the RTAS thread, which I was happy to see go away when running directly on OPAL, and don't want something else to take its place. It was not bindable, but ran periodically on EVERY core, err hardware thread. (Just looking to avoid latency blips.) PC ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: powerpc: mitigate impact of decrementer reset
On 11/12/2014 08:39 PM, Michael Ellerman wrote: On Wed, 2014-11-05 at 11:06 -0600, Paul Clarke wrote: On 10/07/2014 09:52 PM, Michael Ellerman wrote: On Tue, 2014-07-10 at 19:13:24 UTC, Paul Clarke wrote: This patch short-circuits the reset of the decrementer, exiting after the decrementer reset, but before the housekeeping tasks if the only need for the interrupt is simply to reset it. After this patch, the latency spike was measured at about 150 nanoseconds. Thanks for the excellent changelog. But this patch makes me a bit nervous :) Do you know where the latency is coming from? Is it primarily the irq work? Yes, it is all under irq_enter (measured at ~10us) and irq_exit (~12us). Hmm, OK. I actually meant irq_work_run(). AIUI irq_enter/exit() are just state tracking, they shouldn't be actually running work. How are you measuring it? ftrace function_graph tracer: -- 127.425212 || .irq_enter() { 127.425213 ||.rcu_irq_enter() { 127.425213 | + 12.206 us | .rcu_eqs_exit_common.isra.41(); 127.425226 | + 12.750 us |} ... RCU is a big hitter 127.425226 ||.vtime_common_account_irq_enter() { 127.425226 || .vtime_account_user() { 127.425226 |0.032 us|._raw_spin_lock(); 127.425227 |0.034 us|.get_vtime_delta(); 127.425227 ||.account_user_time() { 127.425228 |0.030 us| .cpuacct_account_field(); 127.425228 || .acct_account_cputime() { 127.425228 |0.082 us|.__acct_update_integrals(); 127.425229 |0.562 us| } 127.425229 |1.500 us|} 127.425229 |2.954 us| } 127.425230 |3.434 us|} ... but even accounting is not insignificant 127.425230 | + 17.218 us | } 127.425230 || /* timer_interrupt_entry: [...] */ ... nothing to see here, because there's nothing to do except reset the decrementer 127.425230 || /* timer_interrupt_exit: [...] */ ... (less than 1 us spent doing the required work) 127.425231 || .irq_exit() { 127.425231 ||.vtime_gen_account_irq_exit() { 127.425231 |0.036 us| ._raw_spin_lock(); 127.425232 || .__vtime_account_system() { 127.425232 |0.030 us|.get_vtime_delta(); 127.425232 ||.account_system_time() { 127.425233 |0.030 us| .cpuacct_account_field(); 127.425233 || .acct_account_cputime() { 127.425233 |0.072 us|.__acct_update_integrals(); 127.425234 |0.564 us| } 127.425234 |1.546 us|} 127.425234 |2.528 us| } 127.425235 |3.700 us|} ... significant accounting time 127.425235 |0.032 us|.idle_cpu(); 127.425235 ||.tick_nohz_irq_exit() { 127.425236 || .can_stop_full_tick() { 127.425236 |0.022 us|.sched_can_stop_tick(); 127.425236 |0.020 us|.posix_cpu_timers_can_stop_tick() 127.425237 |0.970 us| } 127.425237 |0.082 us| .ktime_get(); 127.425238 || .tick_nohz_stop_sched_tick() { 127.425238 |0.032 us|.timekeeping_max_deferment(); 127.425238 ||.get_next_timer_interrupt() { 127.425239 |0.038 us| ._raw_spin_lock(); 127.425239 || .hrtimer_get_next_event() { 127.425239 |0.030 us|._raw_spin_lock_irqsave(); 127.425240 |0.028 us|._raw_spin_unlock_irqrestore 127.425240 |0.984 us| } 127.425241 |1.936 us|} 127.425241 |0.032 us|.scheduler_tick_max_deferment(); 127.425241 |3.438 us| } 127.425242 |5.880 us|} 127.425242 ||.rcu_irq_exit() { 127.425242 |0.102 us| .rcu_eqs_enter_common.isra.40(); 127.425243 |0.576 us|} 127.425243 | + 12.156 us | } This one was almost 30 us total (17.218 + 12.156 = 29.374 us), just to reset the decrementer. If so I'd prefer if we could move the short circuit into __timer_interrupt() itself. That way we'd still have the trace points usable, and it would hopefully result in less duplicated logic. But irq_enter and irq_exit are called in timer_interrupt, before __timer_interrupt is called. I don't see how that helps. The time spent in __timer_interrupt is minuscule by comparison. Right, it won't help if it's irq_enter() that is causing the delay. But I was assuming it was irq_work_run(). Are you suggesting that irq_enter/exit be moved into __timer_interrupt as well? (I'm not sure how that would impact the existing call to __timer_interrupt from tick_broadcast_ipi_handler
Re: [PATCH] powerpc: mitigate impact of decrementer reset
On 11/10/2014 04:08 AM, Benjamin Herrenschmidt wrote: On Tue, 2014-10-07 at 14:13 -0500, Paul Clarke wrote: The POWER ISA defines an always-running decrementer which can be used to schedule interrupts after a certain time interval has elapsed. The decrementer counts down at the same frequency as the Time Base, which is 512 MHz. The maximum value of the decrementer is 0x7fff. This works out to a maximum interval of about 4.19 seconds. If a larger interval is desired, the kernel will set the decrementer to its maximum value and reset it after it expires (underflows) a sufficient number of times until the desired interval has elapsed. The negative effect of this is that an unwanted latency spike will impact normal processing at most every 4.19 seconds. On an IBM POWER8-based system, this spike was measured at about 25-30 microseconds, much of which was basic, opportunistic housekeeping tasks that could otherwise have waited. This patch short-circuits the reset of the decrementer, exiting after the decrementer reset, but before the housekeeping tasks if the only need for the interrupt is simply to reset it. After this patch, the latency spike was measured at about 150 nanoseconds. Doesn't this break the irq_work stuff ? We trigger it with a set_dec(1); and your patch will probably cause it to be skipped... You're right. I'm confused by the division between timer_interrupt() and __timer_interrupt(). The former is called with interrupts disabled (and enables them), but also calls irq_enter()/irq_exit(). Why are those calls not in __timer_interrupt()? (If they were, the short-circuit logic might be a bit easier to put directly in __timer_interrupt(), which would eliminate any duplicate code.) It looks like __timer_interrupt is only called directly by the broadcast timer IPI handler. (Why is __timer_interrupt not static?) Does this path not need irq_enter/irq_exit? Signed-off-by: Paul A. Clarke p...@us.ibm.com --- arch/powerpc/kernel/time.c | 13 + 1 file changed, 13 insertions(+) diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c index 368ab37..962a06b 100644 --- a/arch/powerpc/kernel/time.c +++ b/arch/powerpc/kernel/time.c @@ -528,6 +528,7 @@ void timer_interrupt(struct pt_regs * regs) { struct pt_regs *old_regs; u64 *next_tb = __get_cpu_var(decrementers_next_tb); + u64 now; /* Ensure a positive value is written to the decrementer, or else * some CPUs will continue to take decrementer exceptions. @@ -550,6 +551,18 @@ void timer_interrupt(struct pt_regs * regs) */ may_hard_irq_enable(); + /* If this is simply the decrementer expiring (underflow) due to +* the limited size of the decrementer, and not a set timer, +* reset (if needed) and return +*/ + now = get_tb_or_rtc(); + if (now *next_tb) { + now = *next_tb - now; + if (now = DECREMENTER_MAX) + set_dec((int)now); + __get_cpu_var(irq_stat).timer_irqs_others++; + return; + } #if defined(CONFIG_PPC32) defined(CONFIG_PPC_PMAC) if (atomic_read(ppc_n_lost_interrupts) != 0) Regards, PC ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: powerpc: mitigate impact of decrementer reset
Sorry it took me so long to get back to this... On 10/07/2014 09:52 PM, Michael Ellerman wrote: On Tue, 2014-07-10 at 19:13:24 UTC, Paul Clarke wrote: The POWER ISA defines an always-running decrementer which can be used to schedule interrupts after a certain time interval has elapsed. The decrementer counts down at the same frequency as the Time Base, which is 512 MHz. The maximum value of the decrementer is 0x7fff. This works out to a maximum interval of about 4.19 seconds. If a larger interval is desired, the kernel will set the decrementer to its maximum value and reset it after it expires (underflows) a sufficient number of times until the desired interval has elapsed. The negative effect of this is that an unwanted latency spike will impact normal processing at most every 4.19 seconds. On an IBM POWER8-based system, this spike was measured at about 25-30 microseconds, much of which was basic, opportunistic housekeeping tasks that could otherwise have waited. This patch short-circuits the reset of the decrementer, exiting after the decrementer reset, but before the housekeeping tasks if the only need for the interrupt is simply to reset it. After this patch, the latency spike was measured at about 150 nanoseconds. Thanks for the excellent changelog. But this patch makes me a bit nervous :) Do you know where the latency is coming from? Is it primarily the irq work? Yes, it is all under irq_enter (measured at ~10us) and irq_exit (~12us). If so I'd prefer if we could move the short circuit into __timer_interrupt() itself. That way we'd still have the trace points usable, and it would hopefully result in less duplicated logic. But irq_enter and irq_exit are called in timer_interrupt, before __timer_interrupt is called. I don't see how that helps. The time spent in __timer_interrupt is minuscule by comparison. Are you suggesting that irq_enter/exit be moved into __timer_interrupt as well? (I'm not sure how that would impact the existing call to __timer_interrupt from tick_broadcast_ipi_handler? And if there is no impact, what's the point of separating timer_interrupt and __timer_interrupt?) Regards, PC ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powerpc: mitigate impact of decrementer reset
On 10/08/2014 12:37 AM, Heinz Wrobel wrote: what if your tb wraps during the test? Per the Power ISA, Time Base is 64 bits, monotonically increasing, and is writable only in hypervisor state. To my understanding, it is set to zero at boot (although this is not prescribed). Also, as noted by others, the logic is roughly duplicated (with some differences) from the analogous code in __timer_interrupt just above it. I don't see wrapping as a concern. PC ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH] powerpc: mitigate impact of decrementer reset
The POWER ISA defines an always-running decrementer which can be used to schedule interrupts after a certain time interval has elapsed. The decrementer counts down at the same frequency as the Time Base, which is 512 MHz. The maximum value of the decrementer is 0x7fff. This works out to a maximum interval of about 4.19 seconds. If a larger interval is desired, the kernel will set the decrementer to its maximum value and reset it after it expires (underflows) a sufficient number of times until the desired interval has elapsed. The negative effect of this is that an unwanted latency spike will impact normal processing at most every 4.19 seconds. On an IBM POWER8-based system, this spike was measured at about 25-30 microseconds, much of which was basic, opportunistic housekeeping tasks that could otherwise have waited. This patch short-circuits the reset of the decrementer, exiting after the decrementer reset, but before the housekeeping tasks if the only need for the interrupt is simply to reset it. After this patch, the latency spike was measured at about 150 nanoseconds. Signed-off-by: Paul A. Clarke p...@us.ibm.com --- arch/powerpc/kernel/time.c | 13 + 1 file changed, 13 insertions(+) diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c index 368ab37..962a06b 100644 --- a/arch/powerpc/kernel/time.c +++ b/arch/powerpc/kernel/time.c @@ -528,6 +528,7 @@ void timer_interrupt(struct pt_regs * regs) { struct pt_regs *old_regs; u64 *next_tb = __get_cpu_var(decrementers_next_tb); + u64 now; /* Ensure a positive value is written to the decrementer, or else * some CPUs will continue to take decrementer exceptions. @@ -550,6 +551,18 @@ void timer_interrupt(struct pt_regs * regs) */ may_hard_irq_enable(); + /* If this is simply the decrementer expiring (underflow) due to +* the limited size of the decrementer, and not a set timer, +* reset (if needed) and return +*/ + now = get_tb_or_rtc(); + if (now *next_tb) { + now = *next_tb - now; + if (now = DECREMENTER_MAX) + set_dec((int)now); + __get_cpu_var(irq_stat).timer_irqs_others++; + return; + } #if defined(CONFIG_PPC32) defined(CONFIG_PPC_PMAC) if (atomic_read(ppc_n_lost_interrupts) != 0) -- 2.1.2.330.g565301e ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev