Re: [RFC 00/11] perf: Enhancing perf to export processor hazard information

2020-03-02 Thread Paul Clarke
On 3/1/20 11:23 PM, Ravi Bangoria wrote:
> Most modern microprocessors employ complex instruction execution
> pipelines such that many instructions can be 'in flight' at any
> given point in time. Various factors affect this pipeline and
> hazards are the primary among them. Different types of hazards
> exist - Data hazards, Structural hazards and Control hazards.
> Data hazard is the case where data dependencies exist between
> instructions in different stages in the pipeline. Structural
> hazard is when the same processor hardware is needed by more
> than one instruction in flight at the same time. Control hazards
> are more the branch misprediction kinds. 
> 
> Information about these hazards are critical towards analyzing
> performance issues and also to tune software to overcome such
> issues. Modern processors export such hazard data in Performance
> Monitoring Unit (PMU) registers. Ex, 'Sampled Instruction Event
> Register' on IBM PowerPC[1][2] and 'Instruction-Based Sampling' on
> AMD[3] provides similar information.
> 
> Implementation detail:
> 
> A new sample_type called PERF_SAMPLE_PIPELINE_HAZ is introduced.
> If it's set, kernel converts arch specific hazard information
> into generic format:
> 
>   struct perf_pipeline_haz_data {
>  /* Instruction/Opcode type: Load, Store, Branch  */
>  __u8itype;

At the risk of bike-shedding (in an RFC, no less), "itype" doesn't convey 
enough meaning to me.  "inst_type"?  I see in 03/11, you use "perf_inst_type".

>  /* Instruction Cache source */
>  __u8icache;

Possibly same here, and you use "perf_inst_cache" in 03/11.

>  /* Instruction suffered hazard in pipeline stage */
>  __u8hazard_stage;
>  /* Hazard reason */
>  __u8hazard_reason;
>  /* Instruction suffered stall in pipeline stage */
>  __u8stall_stage;
>  /* Stall reason */
>  __u8stall_reason;
>  __u16   pad;
>   };
> 
> ... which can be read by user from mmap() ring buffer. With this
> approach, sample perf report in hazard mode looks like (On IBM
> PowerPC):
> 
>   # ./perf record --hazard ./ebizzy
>   # ./perf report --hazard
>   Overhead  Symbol  Shared  Instruction Type  Hazard Stage   Hazard 
> Reason Stall Stage   Stall Reason  ICache access
> 36.58%  [.] thread_run  ebizzy  Load  LSU
> MispredictLSU   Load fin  L1 hit
>  9.46%  [.] thread_run  ebizzy  Load  LSU
> MispredictLSU   Dcache_miss   L1 hit
>  1.76%  [.] thread_run  ebizzy  Fixed point   -  -
>  - - L1 hit
>  1.31%  [.] thread_run  ebizzy  Load  LSUERAT 
> Miss LSU   Load fin  L1 hit
>  1.27%  [.] thread_run  ebizzy  Load  LSU
> Mispredict- - L1 hit
>  1.16%  [.] thread_run  ebizzy  Fixed point   -  -
>  FXU   Fixed cycle   L1 hit
>  0.50%  [.] thread_run  ebizzy  Fixed point   ISUSource 
> UnavailableFXU   Fixed cycle   L1 hit
>  0.30%  [.] thread_run  ebizzy  Load  LSULMQ 
> Full, DERAT Miss  LSU   Load fin  L1 hit
>  0.24%  [.] thread_run  ebizzy  Load  LSUERAT 
> Miss - - L1 hit
>  0.08%  [.] thread_run  ebizzy  - -  -
>  BRU   Fixed cycle   L1 hit
>  0.05%  [.] thread_run  ebizzy  Branch-  -
>  BRU   Fixed cycle   L1 hit
>  0.04%  [.] thread_run  ebizzy  Fixed point   ISUSource 
> Unavailable- - L1 hit

How are these to be interpreted?  This is great information, but is it possible 
to make it more readable for non-experts?  If each of these map 1:1 with 
hardware events, should you emit the name of the event here, so that can be 
used to look up further information?  For example, does the first line map to 
PM_CMPLU_STALL_LSU_FIN?
What was "Mispredict[ed]"? (Is it different from a branch misprediction?) And 
how does this relate to "L1 hit"?
Can we emit "Load finish" instead of "Load fin" for easier reading?  03/11 also 
has "Marked fin before NTC".
Nit: why does "Dcache_miss" have an underscore and none of the others?

> Also perf annotate with hazard data:

>  │static int
>  │compare(const void *p1, const void *p2)
>  │{
>33.23 │  stdr31,-8(r1)
>  │   {haz_stage: LSU, haz_reason: ERAT Miss, stall_stage: LSU, 
> stall_reason: Store, icache: L1 hit}
>  │   {haz_stage: LSU, haz_reason: ERAT Miss, stall_stage: LSU, 
> stall_reason: Store, icache: L1 hit}
>  │   {haz_stage: LSU, 

Re: [RFC] powerpc/pseries/hcall: remove the save/restore of CR

2019-09-27 Thread Paul Clarke
On 9/26/19 10:48 PM, Lijun Pan wrote:
> According to the PAPR, hcalls should not modify the Condition
> Register fields, hence save/restore the CR is not necessary.

Just curious: could you provide a more specific reference?

PC


sys_exit: NR 0 (was "sys_exit: NR -1")

2019-06-13 Thread Paul Clarke
I'm seeing another asymmetry, this one is on powerpc:
--
  perf 30179 [001] 10374622.667436: raw_syscalls:sys_enter: NR 11 
(3fffe16ef55e, [...])
ls 30179 [002] 10374622.667921:  raw_syscalls:sys_exit: NR 0 = 0
--

Syscall "11" is "execve", but is it expected that the sys_exit will report the 
ID as 0?  Is that because of the nature of "execve", to overlay the entire 
contents of the process?

This does not seem to occur on x86.

PC



Re: sys_exit: NR -1

2019-06-13 Thread Paul Clarke
On 6/13/19 4:00 PM, Paul Clarke wrote:
> On 6/12/19 1:32 AM, Naveen N. Rao wrote:
>> Paul Clarke wrote:
>>> What are the circumstances in which raw_syscalls:sys_exit reports "-1" for 
>>> the syscall ID?
>>>
>>>     perf  5375 [007] 59632.478528:   raw_syscalls:sys_enter: NR 1 (3, 
>>> 9fb888, 8, 2d83740, 1, 7)
>>>     perf  5375 [007] 59632.478532:    raw_syscalls:sys_exit: NR 1 = 8
>>>     perf  5375 [007] 59632.478538:   raw_syscalls:sys_enter: NR 15 (11, 
>>> 7ca734b0, 7ca73380, 2d83740, 1, 7)
>>>     perf  5375 [007] 59632.478539:    raw_syscalls:sys_exit: NR -1 = 8
>>>     perf  5375 [007] 59632.478543:   raw_syscalls:sys_enter: NR 16 (4, 
>>> 2401, 0, 2d83740, 1, 0)
>>>     perf  5375 [007] 59632.478551:    raw_syscalls:sys_exit: NR 16 = 0
>>
>> Which architecture?
>> For powerpc, see:
>>
>> static inline int syscall_get_nr(struct task_struct *task, struct pt_regs 
>> *regs)
>> {
>> /*
>>  * Note that we are returning an int here. That means 0x, ie.
>>  * 32-bit negative 1, will be interpreted as -1 on a 64-bit kernel.
>>  * This is important for seccomp so that compat tasks can set r0 = -1
>>  * to reject the syscall.
>>  */
>> return TRAP(regs) == 0xc00 ? regs->gpr[0] : -1;
>> }
> 
> So, that's intentional?  And has some special meaning?  (I confess I don't 
> understand what the comment is saying exactly.)
> 
> Is this documented?  Does something depend on this ABI?
> 
> To me, it just makes parsing more difficult, both by humans and machines.

I should've noted that the instance I encountered was on x86.

PC



Re: sys_exit: NR -1

2019-06-13 Thread Paul Clarke
On 6/12/19 1:32 AM, Naveen N. Rao wrote:
> Paul Clarke wrote:
>> What are the circumstances in which raw_syscalls:sys_exit reports "-1" for 
>> the syscall ID?
>>
>>     perf  5375 [007] 59632.478528:   raw_syscalls:sys_enter: NR 1 (3, 
>> 9fb888, 8, 2d83740, 1, 7)
>>     perf  5375 [007] 59632.478532:    raw_syscalls:sys_exit: NR 1 = 8
>>     perf  5375 [007] 59632.478538:   raw_syscalls:sys_enter: NR 15 (11, 
>> 7ca734b0, 7ca73380, 2d83740, 1, 7)
>>     perf  5375 [007] 59632.478539:    raw_syscalls:sys_exit: NR -1 = 8
>>     perf  5375 [007] 59632.478543:   raw_syscalls:sys_enter: NR 16 (4, 2401, 
>> 0, 2d83740, 1, 0)
>>     perf  5375 [007] 59632.478551:    raw_syscalls:sys_exit: NR 16 = 0
> 
> Which architecture?
> For powerpc, see:
> 
> static inline int syscall_get_nr(struct task_struct *task, struct pt_regs 
> *regs)
> {
> /*
>  * Note that we are returning an int here. That means 0x, ie.
>  * 32-bit negative 1, will be interpreted as -1 on a 64-bit kernel.
>  * This is important for seccomp so that compat tasks can set r0 = -1
>  * to reject the syscall.
>  */
> return TRAP(regs) == 0xc00 ? regs->gpr[0] : -1;
> }

So, that's intentional?  And has some special meaning?  (I confess I don't 
understand what the comment is saying exactly.)

Is this documented?  Does something depend on this ABI?

To me, it just makes parsing more difficult, both by humans and machines.

PC



[PATCH 4/4] [powerpc] perf vendor events: Add JSON metrics for POWER9

2019-02-09 Thread Paul Clarke
Descriptions of metrics for POWER9 processors can be found in the
"POWER9 Performance Monitor Unit User’s Guide", which is currently
available on the "IBM Portal for OpenPOWER"
(https://www-355.ibm.com/systems/power/openpower/welcome.xhtml) at
https://www-355.ibm.com/systems/power/openpower/posting.xhtml?postingId=4948CDE1963C9BCA852582F800718190

This patch is for metric groups:
- general

and other metrics not in a metric group.

Signed-off-by: Paul A. Clarke 
---
 .../arch/powerpc/power9/metrics.json  | 368 ++
 1 file changed, 368 insertions(+)

diff --git a/tools/perf/pmu-events/arch/powerpc/power9/metrics.json 
b/tools/perf/pmu-events/arch/powerpc/power9/metrics.json
index c39a922aaf84..811c2a8c1c9e 100644
--- a/tools/perf/pmu-events/arch/powerpc/power9/metrics.json
+++ b/tools/perf/pmu-events/arch/powerpc/power9/metrics.json
@@ -813,6 +813,114 @@
 "MetricGroup": "estimated_dcache_miss_cpi",
 "MetricName": "rmem_cpi_percent"
 },
+{
+"BriefDescription": "Branch Mispredict flushes per instruction",
+"MetricExpr": "PM_FLUSH_MPRED / PM_RUN_INST_CMPL * 100",
+"MetricGroup": "general",
+"MetricName": "br_mpred_flush_rate_percent"
+},
+{
+"BriefDescription": "Cycles per instruction",
+"MetricExpr": "PM_CYC / PM_INST_CMPL",
+"MetricGroup": "general",
+"MetricName": "cpi"
+},
+{
+"BriefDescription": "GCT empty cycles",
+"MetricExpr": "(PM_FLUSH_DISP / PM_RUN_INST_CMPL) * 100",
+"MetricGroup": "general",
+"MetricName": "disp_flush_rate_percent"
+},
+{
+"BriefDescription": "% DTLB miss rate per inst",
+"MetricExpr": "PM_DTLB_MISS / PM_RUN_INST_CMPL *100",
+"MetricGroup": "general",
+"MetricName": "dtlb_miss_rate_percent"
+},
+{
+"BriefDescription": "Flush rate (%)",
+"MetricExpr": "PM_FLUSH * 100 / PM_RUN_INST_CMPL",
+"MetricGroup": "general",
+"MetricName": "flush_rate_percent"
+},
+{
+"BriefDescription": "Instructions per cycles",
+"MetricExpr": "PM_INST_CMPL / PM_CYC",
+"MetricGroup": "general",
+"MetricName": "ipc"
+},
+{
+"BriefDescription": "% ITLB miss rate per inst",
+"MetricExpr": "PM_ITLB_MISS / PM_RUN_INST_CMPL *100",
+"MetricGroup": "general",
+"MetricName": "itlb_miss_rate_percent"
+},
+{
+"BriefDescription": "Percentage of L1 load misses per L1 load ref",
+"MetricExpr": "PM_LD_MISS_L1 / PM_LD_REF_L1 * 100",
+"MetricGroup": "general",
+"MetricName": "l1_ld_miss_ratio_percent"
+},
+{
+"BriefDescription": "Percentage of L1 store misses per run 
instruction",
+"MetricExpr": "PM_ST_MISS_L1 * 100 / PM_RUN_INST_CMPL",
+"MetricGroup": "general",
+"MetricName": "l1_st_miss_rate_percent"
+},
+{
+"BriefDescription": "Percentage of L1 store misses per L1 store ref",
+"MetricExpr": "PM_ST_MISS_L1 / PM_ST_FIN * 100",
+"MetricGroup": "general",
+"MetricName": "l1_st_miss_ratio_percent"
+},
+{
+"BriefDescription": "L2 Instruction Miss Rate (per instruction)(%)",
+"MetricExpr": "PM_INST_FROM_L2MISS * 100 / PM_RUN_INST_CMPL",
+"MetricGroup": "general",
+"MetricName": "l2_inst_miss_rate_percent"
+},
+{
+"BriefDescription": "L2 dmand Load Miss Rate (per run instruction)(%)",
+"MetricExpr": "PM_DATA_FROM_L2MISS * 100 / PM_RUN_INST_CMPL",
+"MetricGroup": "general",
+"MetricName": "l2_ld_miss_rate_percent"
+},
+{
+"BriefDescription": "L2 PTEG Miss Rate (per run instruction)(%)",
+"MetricExpr": "PM_DPTEG_FROM_L2MISS * 100 / PM_RUN_INST_CMPL",
+"MetricGroup": "general",
+"MetricName": "l2_pteg_miss_rate_percent"
+},
+{
+"BriefDescription": "L3 Instruction Miss Rate (per instruction)(%)",
+"MetricExpr": "PM_INST_FROM_L3MISS * 100 / PM_RUN_INST_CMPL",
+"MetricGroup": "general",
+"MetricName": "l3_inst_miss_rate_percent"
+},
+{
+"BriefDescription": "L3 demand Load Miss Rate (per run 
instruction)(%)",
+"MetricExpr": "PM_DATA_FROM_L3MISS * 100 / PM_RUN_INST_CMPL",
+"MetricGroup": "general",
+"MetricName": "l3_ld_miss_rate_percent"
+},
+{
+"BriefDescription": "L3 PTEG Miss Rate (per run instruction)(%)",
+"MetricExpr": "PM_DPTEG_FROM_L3MISS * 100 / PM_RUN_INST_CMPL",
+"MetricGroup": "general",
+"MetricName": "l3_pteg_miss_rate_percent"
+},
+{
+"BriefDescription": "Run cycles per cycle",
+"MetricExpr": "PM_RUN_CYC / PM_CYC*100",
+"MetricGroup": "general",
+"MetricName": "run_cycles_percent"
+},
+{
+"BriefDescription": "Instruction dispatch-to-completion ratio",
+"MetricExpr": "PM_INST_DISP / 

[PATCH 2/4] [powerpc] perf vendor events: Add JSON metrics for POWER9

2019-02-09 Thread Paul Clarke
Descriptions of metrics for POWER9 processors can be found in the
"POWER9 Performance Monitor Unit User’s Guide", which is currently
available on the "IBM Portal for OpenPOWER"
(https://www-355.ibm.com/systems/power/openpower/welcome.xhtml) at
https://www-355.ibm.com/systems/power/openpower/posting.xhtml?postingId=4948CDE1963C9BCA852582F800718190

This patch is for metric groups:
- dl1_reloads_percent_per_inst
- dl1_reloads_percent_per_ref
- instruction_misses_percent_per_inst
- l2_stats
- l3_stats
- pteg_reloads_percent_per_inst
- pteg_reloads_percent_per_ref

Signed-off-by: Paul A. Clarke 
---
 .../arch/powerpc/power9/metrics.json  | 660 ++
 1 file changed, 660 insertions(+)

diff --git a/tools/perf/pmu-events/arch/powerpc/power9/metrics.json 
b/tools/perf/pmu-events/arch/powerpc/power9/metrics.json
index cd46ebb8da6a..166f95518c45 100644
--- a/tools/perf/pmu-events/arch/powerpc/power9/metrics.json
+++ b/tools/perf/pmu-events/arch/powerpc/power9/metrics.json
@@ -484,6 +484,210 @@
 "MetricGroup": "cpi_breakdown",
 "MetricName": "vfxu_stall_cpi"
 },
+{
+"BriefDescription": "% of DL1 Reloads from Distant L2 or L3 (Modified) 
per Inst",
+"MetricExpr": "PM_DATA_FROM_DL2L3_MOD * 100 / PM_RUN_INST_CMPL",
+"MetricGroup": "dl1_reloads_percent_per_inst",
+"MetricName": "dl1_reload_from_dl2l3_mod_rate_percent"
+},
+{
+"BriefDescription": "% of DL1 Reloads from Distant L2 or L3 (Shared) 
per Inst",
+"MetricExpr": "PM_DATA_FROM_DL2L3_SHR * 100 / PM_RUN_INST_CMPL",
+"MetricGroup": "dl1_reloads_percent_per_inst",
+"MetricName": "dl1_reload_from_dl2l3_shr_rate_percent"
+},
+{
+"BriefDescription": "% of DL1 Reloads from Distant Memory per Inst",
+"MetricExpr": "PM_DATA_FROM_DMEM * 100 / PM_RUN_INST_CMPL",
+"MetricGroup": "dl1_reloads_percent_per_inst",
+"MetricName": "dl1_reload_from_dmem_rate_percent"
+},
+{
+"BriefDescription": "% of DL1 reloads from Private L2, other core per 
Inst",
+"MetricExpr": "PM_DATA_FROM_L21_MOD * 100 / PM_RUN_INST_CMPL",
+"MetricGroup": "dl1_reloads_percent_per_inst",
+"MetricName": "dl1_reload_from_l21_mod_rate_percent"
+},
+{
+"BriefDescription": "% of DL1 reloads from Private L2, other core per 
Inst",
+"MetricExpr": "PM_DATA_FROM_L21_SHR * 100 / PM_RUN_INST_CMPL",
+"MetricGroup": "dl1_reloads_percent_per_inst",
+"MetricName": "dl1_reload_from_l21_shr_rate_percent"
+},
+{
+"BriefDescription": "% of DL1 reloads from L2 per Inst",
+"MetricExpr": "PM_DATA_FROM_L2MISS * 100 / PM_RUN_INST_CMPL",
+"MetricGroup": "dl1_reloads_percent_per_inst",
+"MetricName": "dl1_reload_from_l2_miss_rate_percent"
+},
+{
+"BriefDescription": "% of DL1 reloads from L2 per Inst",
+"MetricExpr": "PM_DATA_FROM_L2 * 100 / PM_RUN_INST_CMPL",
+"MetricGroup": "dl1_reloads_percent_per_inst",
+"MetricName": "dl1_reload_from_l2_rate_percent"
+},
+{
+"BriefDescription": "% of DL1 reloads from Private L3 M state, other 
core per Inst",
+"MetricExpr": "PM_DATA_FROM_L31_MOD * 100 / PM_RUN_INST_CMPL",
+"MetricGroup": "dl1_reloads_percent_per_inst",
+"MetricName": "dl1_reload_from_l31_mod_rate_percent"
+},
+{
+"BriefDescription": "% of DL1 reloads from Private L3 S tate, other 
core per Inst",
+"MetricExpr": "PM_DATA_FROM_L31_SHR * 100 / PM_RUN_INST_CMPL",
+"MetricGroup": "dl1_reloads_percent_per_inst",
+"MetricName": "dl1_reload_from_l31_shr_rate_percent"
+},
+{
+"BriefDescription": "% of DL1 Reloads that came from the L3 and were 
brought into the L3 by a prefetch, per instruction completed",
+"MetricExpr": "PM_DATA_FROM_L3_MEPF * 100 / PM_RUN_INST_CMPL",
+"MetricGroup": "dl1_reloads_percent_per_inst",
+"MetricName": "dl1_reload_from_l3_mepf_rate_percent"
+},
+{
+"BriefDescription": "% of DL1 reloads from L3 per Inst",
+"MetricExpr": "PM_DATA_FROM_L3MISS * 100 / PM_RUN_INST_CMPL",
+"MetricGroup": "dl1_reloads_percent_per_inst",
+"MetricName": "dl1_reload_from_l3_miss_rate_percent"
+},
+{
+"BriefDescription": "% of DL1 Reloads from L3 per Inst",
+"MetricExpr": "PM_DATA_FROM_L3 * 100 / PM_RUN_INST_CMPL",
+"MetricGroup": "dl1_reloads_percent_per_inst",
+"MetricName": "dl1_reload_from_l3_rate_percent"
+},
+{
+"BriefDescription": "% of DL1 Reloads from Local Memory per Inst",
+"MetricExpr": "PM_DATA_FROM_LMEM * 100 / PM_RUN_INST_CMPL",
+"MetricGroup": "dl1_reloads_percent_per_inst",
+"MetricName": "dl1_reload_from_lmem_rate_percent"
+},
+{
+"BriefDescription": "% of DL1 reloads from Private L3, other core per 
Inst",
+"MetricExpr": "PM_DATA_FROM_RL2L3_MOD * 

[PATCH 3/4] [powerpc] perf vendor events: Add JSON metrics for POWER9

2019-02-09 Thread Paul Clarke
Descriptions of metrics for POWER9 processors can be found in the
"POWER9 Performance Monitor Unit User’s Guide", which is currently
available on the "IBM Portal for OpenPOWER"
(https://www-355.ibm.com/systems/power/openpower/welcome.xhtml) at
https://www-355.ibm.com/systems/power/openpower/posting.xhtml?postingId=4948CDE1963C9BCA852582F800718190

This patch is for metric groups:
- branch_prediction
- instruction_stats_percent_per_ref
- latency
- lsu_rejects
- memory
- prefetch
- translation

Plus, some whitespace changes.

Signed-off-by: Paul A. Clarke 
---
 .../arch/powerpc/power9/metrics.json  | 403 +-
 1 file changed, 390 insertions(+), 13 deletions(-)

diff --git a/tools/perf/pmu-events/arch/powerpc/power9/metrics.json 
b/tools/perf/pmu-events/arch/powerpc/power9/metrics.json
index 166f95518c45..c39a922aaf84 100644
--- a/tools/perf/pmu-events/arch/powerpc/power9/metrics.json
+++ b/tools/perf/pmu-events/arch/powerpc/power9/metrics.json
@@ -1,4 +1,39 @@
 [
+{
+"MetricExpr": "PM_BR_MPRED_CMPL / PM_BR_PRED * 100",
+"MetricGroup": "branch_prediction",
+"MetricName": "br_misprediction_percent"
+},
+{
+"BriefDescription": "Count cache branch misprediction per instruction",
+"MetricExpr": "PM_BR_MPRED_CCACHE / PM_RUN_INST_CMPL * 100",
+"MetricGroup": "branch_prediction",
+"MetricName": "ccache_mispredict_rate_percent"
+},
+{
+"BriefDescription": "Count cache branch misprediction",
+"MetricExpr": "PM_BR_MPRED_CCACHE / PM_BR_PRED_CCACHE * 100",
+"MetricGroup": "branch_prediction",
+"MetricName": "ccache_misprediction_percent"
+},
+{
+"BriefDescription": "Link stack branch misprediction",
+"MetricExpr": "PM_BR_MPRED_LSTACK / PM_RUN_INST_CMPL * 100",
+"MetricGroup": "branch_prediction",
+"MetricName": "lstack_mispredict_rate_percent"
+},
+{
+"BriefDescription": "Link stack branch misprediction",
+"MetricExpr": "PM_BR_MPRED_LSTACK/ PM_BR_PRED_LSTACK * 100",
+"MetricGroup": "branch_prediction",
+"MetricName": "lstack_misprediction_percent"
+},
+{
+"BriefDescription": "% Branches Taken",
+"MetricExpr": "PM_BR_TAKEN_CMPL * 100 / PM_BRU_FIN",
+"MetricGroup": "branch_prediction",
+"MetricName": "taken_branches_percent"
+},
 {
 "BriefDescription": "Completion stall due to a Branch Unit",
 "MetricExpr": "PM_CMPLU_STALL_BRU/PM_RUN_INST_CMPL",
@@ -881,13 +916,121 @@
 "MetricName": "l1_inst_miss_rate_percent"
 },
 {
-"BriefDescription": "%L2 Modified CO  Cache read Utilization (4 pclks 
per disp attempt)",
+"BriefDescription": "Icache Fetchs per Icache Miss",
+"MetricExpr": "(PM_L1_ICACHE_MISS - PM_IC_PREF_WRITE) / 
PM_L1_ICACHE_MISS",
+"MetricGroup": "instruction_stats_percent_per_ref",
+"MetricName": "icache_miss_reload"
+},
+{
+"BriefDescription": "% of ICache reloads due to prefetch",
+"MetricExpr": "PM_IC_PREF_WRITE * 100 / PM_L1_ICACHE_MISS",
+"MetricGroup": "instruction_stats_percent_per_ref",
+"MetricName": "icache_pref_percent"
+},
+{
+"BriefDescription": "% of ICache reloads from Distant L2 or L3 
(Modified)",
+"MetricExpr": "PM_INST_FROM_DL2L3_MOD * 100 / PM_L1_ICACHE_MISS",
+"MetricGroup": "instruction_stats_percent_per_ref",
+"MetricName": "inst_from_dl2l3_mod_percent"
+},
+{
+"BriefDescription": "% of ICache reloads from Distant L2 or L3 
(Shared)",
+"MetricExpr": "PM_INST_FROM_DL2L3_SHR * 100 / PM_L1_ICACHE_MISS",
+"MetricGroup": "instruction_stats_percent_per_ref",
+"MetricName": "inst_from_dl2l3_shr_percent"
+},
+{
+"BriefDescription": "% of ICache reloads from Distant L4",
+"MetricExpr": "PM_INST_FROM_DL4 * 100 / PM_L1_ICACHE_MISS",
+"MetricGroup": "instruction_stats_percent_per_ref",
+"MetricName": "inst_from_dl4_percent"
+},
+{
+"BriefDescription": "% of ICache reloads from Distant Memory",
+"MetricExpr": "PM_INST_FROM_DMEM * 100 / PM_L1_ICACHE_MISS",
+"MetricGroup": "instruction_stats_percent_per_ref",
+"MetricName": "inst_from_dmem_percent"
+},
+{
+"BriefDescription": "% of ICache reloads from Private L2, other core",
+"MetricExpr": "PM_INST_FROM_L21_MOD * 100 / PM_L1_ICACHE_MISS",
+"MetricGroup": "instruction_stats_percent_per_ref",
+"MetricName": "inst_from_l21_mod_percent"
+},
+{
+"BriefDescription": "% of ICache reloads from Private L2, other core",
+"MetricExpr": "PM_INST_FROM_L21_SHR * 100 / PM_L1_ICACHE_MISS",
+"MetricGroup": "instruction_stats_percent_per_ref",
+"MetricName": "inst_from_l21_shr_percent"
+},
+{
+"BriefDescription": "% of ICache reloads from L2",
+

[PATCH v2 0/4] [powerpc] perf vendor events: Add JSON metrics for POWER9

2019-02-09 Thread Paul Clarke
[Note this is for POWER*9* and is different content than a
previous patchset for POWER*8*.]

The patches define metrics and metric groups for computation by "perf"
for POWER9 processors.

Paul Clarke (4):
  [powerpc] perf vendor events: Add JSON metrics for POWER9
  [powerpc] perf vendor events: Add JSON metrics for POWER9
  [powerpc] perf vendor events: Add JSON metrics for POWER9
  [powerpc] perf vendor events: Add JSON metrics for POWER9

v2:
The content of these patches was sent previously, to LKML and
linux-perf-users, but never showed up at linux-perf-users, so
I split it into smaller patches and am resending.

 .../arch/powerpc/power9/metrics.json  | 1982 +
 1 file changed, 1982 insertions(+)
 create mode 100644 tools/perf/pmu-events/arch/powerpc/power9/metrics.json

-- 
2.17.1



[PATCH 1/4] [powerpc] perf vendor events: Add JSON metrics for POWER9

2019-02-09 Thread Paul Clarke
Descriptions of metrics for POWER9 processors can be found in the
"POWER9 Performance Monitor Unit User’s Guide", which is currently
available on the "IBM Portal for OpenPOWER"
(https://www-355.ibm.com/systems/power/openpower/welcome.xhtml) at
https://www-355.ibm.com/systems/power/openpower/posting.xhtml?postingId=4948CDE1963C9BCA852582F800718190

This patch is for metric groups:
- cpi_breakdown
- estimated_dcache_miss_cpi

Signed-off-by: Paul A. Clarke 
---
 .../arch/powerpc/power9/metrics.json  | 577 ++
 1 file changed, 577 insertions(+)
 create mode 100644 tools/perf/pmu-events/arch/powerpc/power9/metrics.json

diff --git a/tools/perf/pmu-events/arch/powerpc/power9/metrics.json 
b/tools/perf/pmu-events/arch/powerpc/power9/metrics.json
new file mode 100644
index ..cd46ebb8da6a
--- /dev/null
+++ b/tools/perf/pmu-events/arch/powerpc/power9/metrics.json
@@ -0,0 +1,577 @@
+[
+{
+"BriefDescription": "Completion stall due to a Branch Unit",
+"MetricExpr": "PM_CMPLU_STALL_BRU/PM_RUN_INST_CMPL",
+"MetricGroup": "cpi_breakdown",
+"MetricName": "bru_stall_cpi"
+},
+{
+"BriefDescription": "Finish stall because the NTF instruction was 
routed to the crypto execution pipe and was waiting to finish",
+"MetricExpr": "PM_CMPLU_STALL_CRYPTO/PM_RUN_INST_CMPL",
+"MetricGroup": "cpi_breakdown",
+"MetricName": "crypto_stall_cpi"
+},
+{
+"BriefDescription": "Finish stall because the NTF instruction was a 
load that missed the L1 and was waiting for the data to return from the nest",
+"MetricExpr": "PM_CMPLU_STALL_DCACHE_MISS/PM_RUN_INST_CMPL",
+"MetricGroup": "cpi_breakdown",
+"MetricName": "dcache_miss_stall_cpi"
+},
+{
+"BriefDescription": "Finish stall because the NTF instruction was a 
multi-cycle instruction issued to the Decimal Floating Point execution pipe and 
waiting to finish.",
+"MetricExpr": "PM_CMPLU_STALL_DFLONG/PM_RUN_INST_CMPL",
+"MetricGroup": "cpi_breakdown",
+"MetricName": "dflong_stall_cpi"
+},
+{
+"BriefDescription": "Stalls due to short latency decimal floating 
ops.",
+"MetricExpr": "(PM_CMPLU_STALL_DFU - 
PM_CMPLU_STALL_DFLONG)/PM_RUN_INST_CMPL",
+"MetricGroup": "cpi_breakdown",
+"MetricName": "dfu_other_stall_cpi"
+},
+{
+"BriefDescription": "Finish stall because the NTF instruction was 
issued to the Decimal Floating Point execution pipe and waiting to finish.",
+"MetricExpr": "PM_CMPLU_STALL_DFU/PM_RUN_INST_CMPL",
+"MetricGroup": "cpi_breakdown",
+"MetricName": "dfu_stall_cpi"
+},
+{
+"BriefDescription": "Completion stall by Dcache miss which resolved 
off node memory/cache",
+"MetricExpr": "(PM_CMPLU_STALL_DMISS_L3MISS - 
PM_CMPLU_STALL_DMISS_L21_L31 - PM_CMPLU_STALL_DMISS_LMEM - 
PM_CMPLU_STALL_DMISS_REMOTE)/PM_RUN_INST_CMPL",
+"MetricGroup": "cpi_breakdown",
+"MetricName": "dmiss_distant_stall_cpi"
+},
+{
+"BriefDescription": "Completion stall by Dcache miss which resolved on 
chip ( excluding local L2/L3)",
+"MetricExpr": "PM_CMPLU_STALL_DMISS_L21_L31/PM_RUN_INST_CMPL",
+"MetricGroup": "cpi_breakdown",
+"MetricName": "dmiss_l21_l31_stall_cpi"
+},
+{
+"BriefDescription": "Completion stall due to cache miss that resolves 
in the L2 or L3 with a conflict",
+"MetricExpr": "PM_CMPLU_STALL_DMISS_L2L3_CONFLICT/PM_RUN_INST_CMPL",
+"MetricGroup": "cpi_breakdown",
+"MetricName": "dmiss_l2l3_conflict_stall_cpi"
+},
+{
+"BriefDescription": "Completion stall due to cache miss that resolves 
in the L2 or L3 without conflict",
+"MetricExpr": "(PM_CMPLU_STALL_DMISS_L2L3 - 
PM_CMPLU_STALL_DMISS_L2L3_CONFLICT)/PM_RUN_INST_CMPL",
+"MetricGroup": "cpi_breakdown",
+"MetricName": "dmiss_l2l3_noconflict_stall_cpi"
+},
+{
+"BriefDescription": "Completion stall by Dcache miss which resolved in 
L2/L3",
+"MetricExpr": "PM_CMPLU_STALL_DMISS_L2L3/PM_RUN_INST_CMPL",
+"MetricGroup": "cpi_breakdown",
+"MetricName": "dmiss_l2l3_stall_cpi"
+},
+{
+"BriefDescription": "Completion stall due to cache miss resolving 
missed the L3",
+"MetricExpr": "PM_CMPLU_STALL_DMISS_L3MISS/PM_RUN_INST_CMPL",
+"MetricGroup": "cpi_breakdown",
+"MetricName": "dmiss_l3miss_stall_cpi"
+},
+{
+"BriefDescription": "Completion stall due to cache miss that resolves 
in local memory",
+"MetricExpr": "PM_CMPLU_STALL_DMISS_LMEM/PM_RUN_INST_CMPL",
+"MetricGroup": "cpi_breakdown",
+"MetricName": "dmiss_lmem_stall_cpi"
+},
+{
+"BriefDescription": "Completion stall by Dcache miss which resolved 
outside of local memory",
+"MetricExpr": "(PM_CMPLU_STALL_DMISS_L3MISS - 

Re: [PATCH] selftests/powerpc: Fix strncpy usage

2018-06-22 Thread Paul Clarke
On 06/22/2018 09:43 AM, Breno Leitao wrote:
> If you don't mind, I would solve this problem slightly different, as it seems
> to be more readable.
> 
> -   strncpy(prog, argv[0], strlen(argv[0]));
> +   if (strlen(argv[0]) >= LEN_MAX){
> +   fprintf(stderr, "Very big executable name: %s\n", argv[0]);

"Very big" is an observation.  "Too big" indicates a problem better.  Or, more 
explicitly "Executable name is too long".

PC



Re: [PATCH] perf/python: Add ppc64le to audit uname list

2017-08-02 Thread Paul Clarke
Coincidentally, I just saw a developer stumble upon this within the last
week.  Could this be pushed upstream soon?

PC

On 08/02/2017 10:06 AM, Arnaldo Carvalho de Melo wrote:
> Em Wed, Aug 02, 2017 at 08:12:16PM +0530, Naveen N. Rao escreveu:
>> Before patch:
>>   $ uname -m
>>   ppc64le
> 
> Thanks, applied,
> 
> - Arnaldo
> 
>>   $ ./perf script -s ./scripts/python/syscall-counts.py
>>   Install the audit-libs-python package to get syscall names.
>>   For example:
>> # apt-get install python-audit (Ubuntu)
>> # yum install audit-libs-python (Fedora)
>> etc.
>>
>>   Press control+C to stop and show the summary
>>   ^CWarning:
>>   4 out of order events recorded.
>>
>>   syscall events:
>>
>>   event  count
>>     ---
>>   4 504638
>>   54  1206
>>   221   42
>>   5521
>>   3 12
>>   167   10
>>   11 8
>>   6  7
>>   1256
>>   5  6
>>   1085
>>   1624
>>   90 4
>>   45 3
>>   33 3
>>   3111
>>   2461
>>   2381
>>   93 1
>>   91 1
>>
>> After patch:
>>   ./perf script -s ./scripts/python/syscall-counts.py
>>   Press control+C to stop and show the summary
>>   ^CWarning:
>>   5 out of order events recorded.
>>
>>   syscall events:
>>
>>   event  count
>>     ---
>>   write 643411
>>   ioctl   1206
>>   futex 54
>>   fcntl 27
>>   poll  14
>>   read  12
>>   execve 8
>>   close  7
>>   mprotect   6
>>   open   6
>>   nanosleep  5
>>   fstat  5
>>   mmap   4
>>   inotify_add_watch  3
>>   brk3
>>   access 3
>>   timerfd_settime1
>>   clock_gettime  1
>>   epoll_wait 1
>>   ftruncate  1
>>   munmap 1
>>
>> Signed-off-by: Naveen N. Rao 
>> ---
>>  tools/perf/scripts/python/Perf-Trace-Util/lib/Perf/Trace/Util.py | 1 +
>>  1 file changed, 1 insertion(+)
>>
>> diff --git 
>> a/tools/perf/scripts/python/Perf-Trace-Util/lib/Perf/Trace/Util.py 
>> b/tools/perf/scripts/python/Perf-Trace-Util/lib/Perf/Trace/Util.py
>> index 1d95009592eb..f6c84966e4f8 100644
>> --- a/tools/perf/scripts/python/Perf-Trace-Util/lib/Perf/Trace/Util.py
>> +++ b/tools/perf/scripts/python/Perf-Trace-Util/lib/Perf/Trace/Util.py
>> @@ -57,6 +57,7 @@ try:
>>  'ia64'  : audit.MACH_IA64,
>>  'ppc'   : audit.MACH_PPC,
>>  'ppc64' : audit.MACH_PPC64,
>> +'ppc64le' : audit.MACH_PPC64LE,
>>  's390'  : audit.MACH_S390,
>>  's390x' : audit.MACH_S390X,
>>  'i386'  : audit.MACH_X86,
>> -- 
>> 2.13.3
> 



Re: [PATCH] powerpc/lib: Split xor_vmx file to guarantee instruction ordering

2017-05-25 Thread Paul Clarke
On 05/24/2017 11:56 PM, Matt Brown wrote:
> On Wed, May 24, 2017 at 11:36 PM, Paul Clarke <p...@us.ibm.com> wrote:
>> On 05/23/2017 06:45 PM, Matt Brown wrote:
>>> The xor_vmx.c file is used for the RAID5 xor operations. In these functions
>>> altivec is enabled to run the operation and then disabled. However due to
>>> compiler instruction reordering, altivec instructions are being run before
>>> enable_altivec() and after disable_altivec().
>>
>> If altivec instructions can be reordered after disable_altivec(), then 
>> disable_altivec() is broken, I'd think.
>>
>> Could it be because the isync in mtmsr_isync() is after the mtmsr?
>>
>> disable_kernel_altivec
>> - msr_check_and_clear
>>   - __msr_check_and_clear
>> - mtmsr_isync
> 
> So it turns out the enable / disable functions don't actually enable
> or disable the use of vector instructions.
> If we have marked the file to be compiled with altivec the compiler
> has free reign to reorder the vector instructions wherever it likes.
> Including reordering it before or after the enable/disable_altivec
> commands.
> 
> The enable_kernel_altivec and disable_kernel_altivec functions are
> mainly there to empty and restore the vector registers which could
> have been used in user-space. So those functions work as intended,
> although not particularly intuitive.
> 
> Splitting the files and only compiling the xor_vmx.c file with altivec
> will guarantee that there are no vector instructions in the
> xor_vmx_glue.c file, and that no vector instructions are outside of
> the enable/disable block.

Thanks for the explanation!  That makes sense.

PC



Re: [PATCH] powerpc/lib: Split xor_vmx file to guarantee instruction ordering

2017-05-24 Thread Paul Clarke
On 05/23/2017 06:45 PM, Matt Brown wrote:
> The xor_vmx.c file is used for the RAID5 xor operations. In these functions
> altivec is enabled to run the operation and then disabled. However due to
> compiler instruction reordering, altivec instructions are being run before
> enable_altivec() and after disable_altivec().

If altivec instructions can be reordered after disable_altivec(), then 
disable_altivec() is broken, I'd think.

Could it be because the isync in mtmsr_isync() is after the mtmsr?

disable_kernel_altivec
- msr_check_and_clear
  - __msr_check_and_clear
- mtmsr_isync

PC



Re: [PATCH] powerpc/64s: dt_cpu_ftrs boot time setup option

2017-05-12 Thread Paul Clarke
On 05/11/2017 10:46 PM, Michael Ellerman wrote:
> Paul Clarke <p...@us.ibm.com> writes:
>> On 05/11/2017 06:24 AM, Nicholas Piggin wrote:
>>> Provide a dt_cpu_ftrs= cmdline option to disable the dt_cpu_ftrs CPU
>>> feature discovery, and fall back to the "cputable" based version.
>>
>> This boat has already sailed, I think, but "ftrs"?
> 
> What you think vowels grow on trees! :)

At least you're using lower-case, which takes less space.  ;-)

>> Was it too difficult to type "features"?

I see "ftrs" and think "footers".

> For the command line option we could spell out features.
> 
> But should we also expand "dt", and "cpu" ?
> 
> device_tree_central_processing_unit_features=off

I'm not the biggest fan of "dt" ("devtree", anyone?), but that's certainly 
established.

So _that's_ what "cpu" stands for!  ;-)

> :P
> 
>> This seems like something that should be easy and intuitive to find
>> and understand.
> 
> Maybe. Ideally no one will ever use it, certainly not end users, it's
> primarily intended for developers doing bring-up.

I'm a developer, and I like to be able to understand something at a glance.  
(Hmm, what are "CPU footers"?)

:-P

PC



Re: [PATCH] powerpc/64s: dt_cpu_ftrs boot time setup option

2017-05-11 Thread Paul Clarke
On 05/11/2017 06:24 AM, Nicholas Piggin wrote:
> Provide a dt_cpu_ftrs= cmdline option to disable the dt_cpu_ftrs CPU
> feature discovery, and fall back to the "cputable" based version.

This boat has already sailed, I think, but "ftrs"?  Was it too difficult to 
type "features"?  This seems like something that should be easy and intuitive 
to find and understand.

:-)

PC



Re: [PATCH v2] powerpc/kprobes: refactor kprobe_lookup_name for safer string operations

2017-05-04 Thread Paul Clarke
On 05/04/2017 05:24 AM, Naveen N. Rao wrote:
> Use safer string manipulation functions when dealing with a
> user-provided string in kprobe_lookup_name().
> 
> Reported-by: David Laight 
> Signed-off-by: Naveen N. Rao 
> ---
> Changed to ignore return value of 0 from strscpy(), as suggested by
> Masami.
> 
> - Naveen
> 
>  arch/powerpc/kernel/kprobes.c | 47 
> ++-
>  1 file changed, 20 insertions(+), 27 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c
> index 160ae0fa7d0d..255d28d31ca1 100644
> --- a/arch/powerpc/kernel/kprobes.c
> +++ b/arch/powerpc/kernel/kprobes.c
> @@ -53,7 +53,7 @@ bool arch_within_kprobe_blacklist(unsigned long addr)
> 
>  kprobe_opcode_t *kprobe_lookup_name(const char *name, unsigned int offset)
>  {
> - kprobe_opcode_t *addr;
> + kprobe_opcode_t *addr = NULL;
> 
>  #ifdef PPC64_ELF_ABI_v2
>   /* PPC64 ABIv2 needs local entry point */
> @@ -85,36 +85,29 @@ kprobe_opcode_t *kprobe_lookup_name(const char *name, 
> unsigned int offset)
>* Also handle  format.
>*/
>   char dot_name[MODULE_NAME_LEN + 1 + KSYM_NAME_LEN];
> - const char *modsym;
>   bool dot_appended = false;
> - if ((modsym = strchr(name, ':')) != NULL) {
> - modsym++;
> - if (*modsym != '\0' && *modsym != '.') {
> - /* Convert to  */
> - strncpy(dot_name, name, modsym - name);
> - dot_name[modsym - name] = '.';
> - dot_name[modsym - name + 1] = '\0';
> - strncat(dot_name, modsym,
> - sizeof(dot_name) - (modsym - name) - 2);
> - dot_appended = true;
> - } else {
> - dot_name[0] = '\0';
> - strncat(dot_name, name, sizeof(dot_name) - 1);
> - }
> - } else if (name[0] != '.') {
> - dot_name[0] = '.';
> - dot_name[1] = '\0';
> - strncat(dot_name, name, KSYM_NAME_LEN - 2);
> + const char *c;
> + ssize_t ret = 0;
> + int len = 0;
> +
> + if ((c = strnchr(name, MODULE_NAME_LEN, ':')) != NULL) {

Shouldn't this be MODULE_NAME_LEN + 1, since the ':' can come after a module 
name of length MODULE_NAME_LEN?

> + c++;
> + len = c - name;
> + memcpy(dot_name, name, len);
> + } else
> + c = name;
> +
> + if (*c != '\0' && *c != '.') {
> + dot_name[len++] = '.';
>   dot_appended = true;
> - } else {
> - dot_name[0] = '\0';
> - strncat(dot_name, name, KSYM_NAME_LEN - 1);
>   }

PC



Re: [PATCH v4 4/7] powerpc/kprobes: Use safer string functions in kprobe_lookup_name()

2017-04-21 Thread Paul Clarke
Sent too soon.  The suggestions don't guarantee null termination.  Refined, 
below. (Sorry for the noise.)

On 04/21/2017 08:33 AM, Paul Clarke wrote:
> On 04/21/2017 07:33 AM, Naveen N. Rao wrote:
>> Convert usage of strchr()/strncpy()/strncat() to
>> strnchr()/memcpy()/strlcat() for simpler and safer string manipulation.

>> diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c
>> index 97b5eed1f76d..c73fb6e3b43f 100644
>> --- a/arch/powerpc/kernel/kprobes.c
>> +++ b/arch/powerpc/kernel/kprobes.c
>> @@ -65,28 +65,27 @@ kprobe_opcode_t *kprobe_lookup_name(const char *name, 
>> unsigned int offset)
>>  char dot_name[MODULE_NAME_LEN + 1 + KSYM_NAME_LEN];
>>  const char *modsym;
>>  bool dot_appended = false;
>> -if ((modsym = strchr(name, ':')) != NULL) {
>> +if ((modsym = strnchr(name, ':', MODULE_NAME_LEN)) != NULL) {
>>  modsym++;
>>  if (*modsym != '\0' && *modsym != '.') {
>>  /* Convert to  */
>> -strncpy(dot_name, name, modsym - name);
>> +memcpy(dot_name, name, modsym - name);
>>  dot_name[modsym - name] = '.';
>>  dot_name[modsym - name + 1] = '\0';
>> -strncat(dot_name, modsym,
>> -sizeof(dot_name) - (modsym - name) - 2);
>> +strlcat(dot_name, modsym, sizeof(dot_name));
> 
> Would it be more efficient here to replace this:
> --
> dot_name[modsym - name + 1] = '\0';
> strlcat(dot_name, modsym, sizeof(dot_name));
> --
> 
> with this:
> strncpy(_name[modsym - name + 1], modsym, KSYM_NAME_LEN);

keep the null termination, and just adjust the starting point for strlcat:
--
dot_name[modsym - name + 1] = '\0';
strlcat(_name[modsym - name + 1], modsym, KSYM_NAME_LEN);
--

> (So you aren't rescanning dot_name to find the end, when you already know the 
> end position?)
> 
>>  dot_appended = true;
>>  } else {
>>  dot_name[0] = '\0';
>> -strncat(dot_name, name, sizeof(dot_name) - 1);
>> +strlcat(dot_name, name, sizeof(dot_name));
> 
> and here do:
> strncpy(dot_name, name, sizeof(dot_name));
> 
> (and remove the null termination immediately above)

nevermind.  :-)

> 
>>  }
>>  } else if (name[0] != '.') {
>>  dot_name[0] = '.';
>>  dot_name[1] = '\0';
>> -strncat(dot_name, name, KSYM_NAME_LEN - 2);
>> +strlcat(dot_name, name, sizeof(dot_name));
> 
> and here do:
> strncpy(_name[1], name, sizeof(dot_name));
> 
> (and remove the null termination immediately above)
> 

nevermind.  :-)

>>  dot_appended = true;
>>  } else {
>>  dot_name[0] = '\0';
>> -strncat(dot_name, name, KSYM_NAME_LEN - 1);
>> +strlcat(dot_name, name, sizeof(dot_name));
> 
> and here do:
> strncpy(dot_name, name, sizeof(dot_name));
> 
> (and remove the null termination immediately above)
> 
>>  }

nevermind.  :-)

PC



Re: [PATCH v4 4/7] powerpc/kprobes: Use safer string functions in kprobe_lookup_name()

2017-04-21 Thread Paul Clarke
On 04/21/2017 08:33 AM, Paul Clarke wrote:
> On 04/21/2017 07:33 AM, Naveen N. Rao wrote:
>>  } else if (name[0] != '.') {
>>  dot_name[0] = '.';
>>  dot_name[1] = '\0';
>> -strncat(dot_name, name, KSYM_NAME_LEN - 2);
>> +strlcat(dot_name, name, sizeof(dot_name));
> 
> and here do:
> strncpy(_name[1], name, sizeof(dot_name));

oops.  s/sizeof(dot_name)/sizeof(dot_name) - 1/

PC



Re: [PATCH v4 4/7] powerpc/kprobes: Use safer string functions in kprobe_lookup_name()

2017-04-21 Thread Paul Clarke
On 04/21/2017 07:33 AM, Naveen N. Rao wrote:
> Convert usage of strchr()/strncpy()/strncat() to
> strnchr()/memcpy()/strlcat() for simpler and safer string manipulation.
> 
> Reported-by: David Laight 
> Signed-off-by: Naveen N. Rao 
> ---
> Changes: Additionally convert the strchr().
> 
> 
>  arch/powerpc/kernel/kprobes.c | 13 ++---
>  1 file changed, 6 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c
> index 97b5eed1f76d..c73fb6e3b43f 100644
> --- a/arch/powerpc/kernel/kprobes.c
> +++ b/arch/powerpc/kernel/kprobes.c
> @@ -65,28 +65,27 @@ kprobe_opcode_t *kprobe_lookup_name(const char *name, 
> unsigned int offset)
>   char dot_name[MODULE_NAME_LEN + 1 + KSYM_NAME_LEN];
>   const char *modsym;
>   bool dot_appended = false;
> - if ((modsym = strchr(name, ':')) != NULL) {
> + if ((modsym = strnchr(name, ':', MODULE_NAME_LEN)) != NULL) {
>   modsym++;
>   if (*modsym != '\0' && *modsym != '.') {
>   /* Convert to  */
> - strncpy(dot_name, name, modsym - name);
> + memcpy(dot_name, name, modsym - name);
>   dot_name[modsym - name] = '.';
>   dot_name[modsym - name + 1] = '\0';
> - strncat(dot_name, modsym,
> - sizeof(dot_name) - (modsym - name) - 2);
> + strlcat(dot_name, modsym, sizeof(dot_name));

Would it be more efficient here to replace this:
--
dot_name[modsym - name + 1] = '\0';
strlcat(dot_name, modsym, sizeof(dot_name));
--

with this:
strncpy(_name[modsym - name + 1], modsym, KSYM_NAME_LEN);

(So you aren't rescanning dot_name to find the end, when you already know the 
end position?)

>   dot_appended = true;
>   } else {
>   dot_name[0] = '\0';
> - strncat(dot_name, name, sizeof(dot_name) - 1);
> + strlcat(dot_name, name, sizeof(dot_name));

and here do:
strncpy(dot_name, name, sizeof(dot_name));

(and remove the null termination immediately above)

>   }
>   } else if (name[0] != '.') {
>   dot_name[0] = '.';
>   dot_name[1] = '\0';
> - strncat(dot_name, name, KSYM_NAME_LEN - 2);
> + strlcat(dot_name, name, sizeof(dot_name));

and here do:
strncpy(_name[1], name, sizeof(dot_name));

(and remove the null termination immediately above)

>   dot_appended = true;
>   } else {
>   dot_name[0] = '\0';
> - strncat(dot_name, name, KSYM_NAME_LEN - 1);
> + strlcat(dot_name, name, sizeof(dot_name));

and here do:
strncpy(dot_name, name, sizeof(dot_name));

(and remove the null termination immediately above)

>   }

PC



Re: [PATCH v4 3/7] kprobes: validate the symbol name provided during probe registration

2017-04-21 Thread Paul Clarke
a nit or two, below...

On 04/21/2017 07:32 AM, Naveen N. Rao wrote:
> diff --git a/kernel/kprobes.c b/kernel/kprobes.c
> index 6a128f3a7ed1..ff9b1ac72a38 100644
> --- a/kernel/kprobes.c
> +++ b/kernel/kprobes.c
> @@ -1383,6 +1383,34 @@ bool within_kprobe_blacklist(unsigned long addr)
>  }
> 
>  /*
> + * We mainly want to ensure that the provided string is of a reasonable 
> length
> + * and is of the form [:], so that this is safe to 
> process
> + * further.
> + * We don't worry about invalid characters as those will just prevent
> + * matching existing kallsyms.
> + */
> +bool is_valid_kprobe_symbol_name(const char *name)
> +{
> + size_t sym_len;
> + const char *s;
> +
> + s = strnchr(name, ':', MODULE_NAME_LEN + KSYM_NAME_LEN + 1);
> + if (s) {
> + sym_len = (size_t)(s - name);
> + if (sym_len <= 0  || sym_len >= MODULE_NAME_LEN)

"sym_len <= 0" looks odd here, since sym_len is likely unsigned and would never 
be less than zero, anyway.

> + return false;
> + s++;
> + } else
> + s = name;
> +
> + sym_len = strnlen(s, KSYM_NAME_LEN);
> + if (sym_len <= 0 || sym_len >= KSYM_NAME_LEN)

here, too.

> + return false;
> +
> + return true;
> +}

PC



Re: [PATCH V4] powerpc/hugetlb: Add ABI defines for supported HugeTLB page sizes

2017-04-07 Thread Paul Clarke

nits... take 'em or leave 'em...

On 04/07/2017 08:01 AM, Michael Ellerman wrote:

Anshuman Khandual  writes:
And I reworded the comment the make it clearer (I think) that most users
shouldn't need to use these, and should just use the default size:

/*
 * When MAP_HUGETLB is set bits [26:31] encode the log2 of the huge page size.


need a comma after "set".

also, "bits [26:31]" of what?


 * A value of zero indicates that the default huge page size should be used.
 * To use a non-default huge page size, one of these defines can be used, or the
 * size can be encoded by hand. Note that on most systems only a subset, or
 * possibly none, of these sizes will be available.
 */


PC



Re: [PATCH v2] powerpc: A new cache geometry aux vectors

2017-03-03 Thread Paul Clarke

On 03/02/2017 10:01 PM, Michael Ellerman wrote:

Paul Clarke <p...@us.ibm.com> writes:

On 03/02/2017 12:33 AM, Michael Ellerman wrote:

Paul Clarke <p...@us.ibm.com> writes:

On 02/02/2017 12:22 AM, Benjamin Herrenschmidt wrote:

This adds AUX vectors for the L1I,D, L2 and L3 cache levels
providing for each cache level the size of the cache in bytes
and the geometry (line size and number of ways).

We chose to not use the existing alpha/sh definition which
packs all the information in a single entry per cache level as
it is too restricted to represent some of the geometries used
on POWER.

Signed-off-by: Benjamin Herrenschmidt <b...@kernel.crashing.org>
---

v2. Remove some leftovers and fix 32-bit build

Paul, can you test & ack this please ? It replaces the last patch of
my previous series (the rest is identical).


Sorry for the delay on this, I totally missed this e-mail until this week.

I tested this with the Power Functional Simulator (a.k.a. Mambo), with
a fairly recent 4.9 kernel from the powerpc git tree.


You mean 4.9 plus this patch?


Using Mambo-P8, I see what I expect for cache line size, but associativity is 0:
AT_L1I_CACHESIZE: 32768 (0x8000)
AT_L1I_CACHEGEOMETRY: associativity 0; line size 128
AT_L1D_CACHESIZE: 32768 (0x8000)
AT_L1D_CACHEGEOMETRY: associativity 0; line size 128
AT_L2_CACHESIZE: 524288 (0x8)
AT_L2_CACHEGEOMETRY: associativity 0; line size 128
AT_L3_CACHESIZE: 8388608 (0x80)
AT_L3_CACHEGEOMETRY: associativity 0; line size 128

...shouldn't associativity be 8 for all levels?


Yes.

I fixed it in v3: http://patchwork.ozlabs.org/patch/723576/

Which is now merged in Linus tree, so please test that.


It's not working for me, but it's also very likely that I could be
doing something stupid. Just pulling from
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git, it
built (without the "systemsim" patches"), but wouldn't boot in Mambo
(not because it couldn't find root...something else not fully
explored).


Hmm OK, it's booting for me, but might depend on your config/toolchain
etc. Linus' tree just before rc1 is a bit of a dangerous place to be.


I checked out tag "v4.10", and that built (with the
"systemsim" patches, Ben's 1-8 and your 9v3) and booted.


I think you need my entire v3, ignore Ben's series.


On Mambo-P8,
the device tree looks correct (to me), but the associativity is still
0.


OK, I think if you use the full v3 the associativity will be fixed.

Though it might still be wrong, because the kernel code hard codes the
values on P8 bare metal, because on real machines (running skiboot) the
values provided in the device tree are wrong.

Hopefully those hard coded values are also correct on p8 mambo, but we
should probably check.

See: https://git.kernel.org/torvalds/c/608b42140e96


OK, thanks for your help.  All is working as expected now.  mambo-p8 looks fine:
--
AT_L1I_CACHESIZE: 32768 (0x8000)
getauxval(AT_L1I_CACHEGEOMETRY) = 0x00080080
AT_L1I_CACHEGEOMETRY: associativity 8; line size 128
AT_L1D_CACHESIZE: 32768 (0x8000)
getauxval(AT_L1D_CACHEGEOMETRY) = 0x00080080
AT_L1D_CACHEGEOMETRY: associativity 8; line size 128
AT_L2_CACHESIZE: 524288 (0x8)
getauxval(AT_L2_CACHEGEOMETRY) = 0x00080080
AT_L2_CACHEGEOMETRY: associativity 8; line size 128
AT_L3_CACHESIZE: 8388608 (0x80)
getauxval(AT_L2_CACHEGEOMETRY) = 0x00080080
AT_L3_CACHEGEOMETRY: associativity 8; line size 128
--


On P9, the device tree isn't correct, so I'm not sure what to expect.


Junk :)

On P9 it all comes from the device tree so if that's wrong the AUX
vectors will definitely be wrong.


mambo-p9 is still falling victim to the device-tree, but "working":
--
getauxval(AT_L1I_CACHEGEOMETRY) = 0x0080
AT_L1I_CACHEGEOMETRY: associativity 0; line size 128
getauxval(AT_L1D_CACHEGEOMETRY) = 0x0080
AT_L1D_CACHEGEOMETRY: associativity 0; line size 128
getauxval(AT_L2_CACHEGEOMETRY) = 0x
getauxval(AT_L2_CACHEGEOMETRY) = 0x
--


Can you try the test program I wrote (which may also have bugs):

  http://patchwork.ozlabs.org/patch/724418/


I couldn't get this working, either, so I'm probably doing something stupid.


Did you apply the first patch of the series also?


I did now, and it works, with expected results:
P8:
--
# ./cache_shape
test: cache_shape
tags: git_version:v4.10-0-gc470abd-dirty
L1I  cache size: 0x8000  32768B 32K
L1I  line size:0x80   8-way associative
L1D  cache size: 0x8000  32768B 32K
L1D  line size:0x80   8-way associative
L2   cache size:0x8 524288B512K
L2   line size:0x80   8-way associative
L3   cache size:   0x808388608B   8192K
L3   line size:0x80   8-way associative
success: cache_shape
--

P9 (with bad device-tree):
--
# ./cache_shape
test: cache_shape
tags: git_version:v4.10-0-gc470abd-

Re: [PATCH v2] powerpc: A new cache geometry aux vectors

2017-03-02 Thread Paul Clarke

On 03/02/2017 12:33 AM, Michael Ellerman wrote:

Paul Clarke <p...@us.ibm.com> writes:

On 02/02/2017 12:22 AM, Benjamin Herrenschmidt wrote:

This adds AUX vectors for the L1I,D, L2 and L3 cache levels
providing for each cache level the size of the cache in bytes
and the geometry (line size and number of ways).

We chose to not use the existing alpha/sh definition which
packs all the information in a single entry per cache level as
it is too restricted to represent some of the geometries used
on POWER.

Signed-off-by: Benjamin Herrenschmidt <b...@kernel.crashing.org>
---

v2. Remove some leftovers and fix 32-bit build

Paul, can you test & ack this please ? It replaces the last patch of
my previous series (the rest is identical).


Sorry for the delay on this, I totally missed this e-mail until this week.

I tested this with the Power Functional Simulator (a.k.a. Mambo), with
a fairly recent 4.9 kernel from the powerpc git tree.


You mean 4.9 plus this patch?


Using Mambo-P8, I see what I expect for cache line size, but associativity is 0:
AT_L1I_CACHESIZE: 32768 (0x8000)
AT_L1I_CACHEGEOMETRY: associativity 0; line size 128
AT_L1D_CACHESIZE: 32768 (0x8000)
AT_L1D_CACHEGEOMETRY: associativity 0; line size 128
AT_L2_CACHESIZE: 524288 (0x8)
AT_L2_CACHEGEOMETRY: associativity 0; line size 128
AT_L3_CACHESIZE: 8388608 (0x80)
AT_L3_CACHEGEOMETRY: associativity 0; line size 128

...shouldn't associativity be 8 for all levels?


Yes.

I fixed it in v3: http://patchwork.ozlabs.org/patch/723576/

Which is now merged in Linus tree, so please test that.


It's not working for me, but it's also very likely that I could be doing something stupid.  Just pulling from 
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git, it built (without the "systemsim" 
patches"), but wouldn't boot in Mambo (not because it couldn't find root...something else not fully 
explored).  I checked out tag "v4.10", and that built (with the "systemsim" patches, Ben's 1-8 
and your 9v3) and booted.  On Mambo-P8, the device tree looks correct (to me), but the associativity is still 0.  
On P9, the device tree isn't correct, so I'm not sure what to expect.

I added a bit more debug to my test:
AT_L1I_CACHESIZE: 32768 (0x8000)
getauxval(AT_L1I_CACHEGEOMETRY) = 0x0080
AT_L1I_CACHEGEOMETRY: associativity 0; line size 128
AT_L1D_CACHESIZE: 32768 (0x8000)
getauxval(AT_L1D_CACHEGEOMETRY) = 0x0080
AT_L1D_CACHEGEOMETRY: associativity 0; line size 128
AT_L2_CACHESIZE: 524288 (0x8)
getauxval(AT_L2_CACHEGEOMETRY) = 0x0080
AT_L2_CACHEGEOMETRY: associativity 0; line size 128
AT_L3_CACHESIZE: 8388608 (0x80)
getauxval(AT_L2_CACHEGEOMETRY) = 0x0080
AT_L3_CACHEGEOMETRY: associativity 0; line size 128

When I use gdb, and break at main:
(gdb) x/60gx ($r6)
0x3ca0: 0x0016  0x0016
0x3cb0: 0x0016  0x0016
0x3cc0: 0x0013  0x0080
0x3cd0: 0x0014  0x0080
0x3ce0: 0x0015  0x
0x3cf0: 0x0021  0x3fffb7fa
0x3d00: 0x0028  0x8000
0x3d10: 0x0029  0x0080
0x3d20: 0x002a  0x8000
0x3d30: 0x002b  0x0080
0x3d40: 0x002c  0x0008
0x3d50: 0x002d  0x0080
0x3d60: 0x002e  0x0080
0x3d70: 0x002f  0x0080

and, looking at the device-tree:
# for i in /proc/device-tree/cpus/PowerPC@0/*cache*; do echo $i;
 od -tx1 $i; done
/proc/device-tree/cpus/PowerPC@0/d-cache-block-size
000 00 00 00 80
/proc/device-tree/cpus/PowerPC@0/d-cache-line-size
000 00 00 00 80
/proc/device-tree/cpus/PowerPC@0/d-cache-sets
000 00 00 00 04
/proc/device-tree/cpus/PowerPC@0/d-cache-size
000 00 00 80 00
/proc/device-tree/cpus/PowerPC@0/i-cache-block-size
000 00 00 00 80
/proc/device-tree/cpus/PowerPC@0/i-cache-line-size
000 00 00 00 80
/proc/device-tree/cpus/PowerPC@0/i-cache-sets
000 00 00 00 02
/proc/device-tree/cpus/PowerPC@0/i-cache-size
000 00 00 80 00
/proc/device-tree/cpus/PowerPC@0/l2-cache-size
000 00 0a 00 00



Can you try the test program I wrote (which may also have bugs):

  http://patchwork.ozlabs.org/patch/724418/


I couldn't get this working, either, so I'm probably doing something stupid.

PC



Re: [PATCH v2] powerpc: A new cache geometry aux vectors

2017-03-01 Thread Paul Clarke

On 02/02/2017 12:22 AM, Benjamin Herrenschmidt wrote:

This adds AUX vectors for the L1I,D, L2 and L3 cache levels
providing for each cache level the size of the cache in bytes
and the geometry (line size and number of ways).

We chose to not use the existing alpha/sh definition which
packs all the information in a single entry per cache level as
it is too restricted to represent some of the geometries used
on POWER.

Signed-off-by: Benjamin Herrenschmidt 
---

v2. Remove some leftovers and fix 32-bit build

Paul, can you test & ack this please ? It replaces the last patch of
my previous series (the rest is identical).


Sorry for the delay on this, I totally missed this e-mail until this week.

I tested this with the Power Functional Simulator (a.k.a. Mambo), with a fairly 
recent 4.9 kernel from the powerpc git tree.

Using Mambo-P8, I see what I expect for cache line size, but associativity is 0:
AT_L1I_CACHESIZE: 32768 (0x8000)
AT_L1I_CACHEGEOMETRY: associativity 0; line size 128
AT_L1D_CACHESIZE: 32768 (0x8000)
AT_L1D_CACHEGEOMETRY: associativity 0; line size 128
AT_L2_CACHESIZE: 524288 (0x8)
AT_L2_CACHEGEOMETRY: associativity 0; line size 128
AT_L3_CACHESIZE: 8388608 (0x80)
AT_L3_CACHEGEOMETRY: associativity 0; line size 128

...shouldn't associativity be 8 for all levels?

Using Mambo-P9, I don't get any cache sizes, and only L1 geometries, and no 
associativity:
AT_L1I_CACHEGEOMETRY: associativity 0; line size 128
AT_L1D_CACHEGEOMETRY: associativity 0; line size 128

PC



Re: [PATCH 9/9] powerpc: A new cache geometry aux vectors

2017-01-10 Thread Paul Clarke

On 01/08/2017 05:31 PM, Benjamin Herrenschmidt wrote:

This adds AUX vectors for the L1I,D, L2 and L3 cache levels
providing for each cache level the size of the cache in bytes
and the geometry (line size and number of ways).

We chose to not use the existing alpha/sh definition which
packs all the information in a single entry per cache level as
it is too restricted to represent some of the geometries used
on POWER.



diff --git a/arch/powerpc/include/asm/cache.h b/arch/powerpc/include/asm/cache.h
index 3987bd9..1557d26 100644
--- a/arch/powerpc/include/asm/cache.h
+++ b/arch/powerpc/include/asm/cache.h
@@ -35,6 +35,7 @@ struct ppc_cache_info {
u32 log_block_size;
u32 blocks_per_page;
u32 sets;
+   u32 assoc;


Associativity is defined as u32...


 };

 struct ppc64_caches {
diff --git a/arch/powerpc/include/asm/elf.h b/arch/powerpc/include/asm/elf.h
index 730c27e..a128836 100644
--- a/arch/powerpc/include/asm/elf.h
+++ b/arch/powerpc/include/asm/elf.h



+extern long il1cache_shape;
+extern long dl1cache_shape;
+extern long l2cache_shape;
+extern long l3cache_shape;


shapes are "long"...


@@ -136,6 +140,9 @@ extern int arch_setup_additional_pages(struct linux_binprm 
*bprm,

 #endif /* CONFIG_SPU_BASE */

+#define get_cache_shape(level) \
+   (ppc64_caches.level.assoc << 16 | ppc64_caches.level.line_size)


Now we stuff u32 assoc into 16 bits.

What's the reason behind combining associativity and cache line size into a 
single field?

Likely the most requested values will be cache line size and (maybe) cache 
size.  The latter is a simple query with these changes, but the former is now a 
query and some awkward mask/shift.  I don't think we're currently providing 
accessor macros, although I guess those can be added to glibc.  Would it not be 
simpler, though, to just add independent queries for each of cache line size 
and associativity?
- AT_L1I_CACHESIZE
- AT_L1I_CACHE_LINE_SIZE
- AT_L1I_CACHE_ASSOC
etc.

Is there a big "con" to having a few more AUXV entries?

Regards,
PC


@@ -156,6 +163,14 @@ do {   
\
NEW_AUX_ENT(AT_ICACHEBSIZE, icache_bsize);  \
NEW_AUX_ENT(AT_UCACHEBSIZE, ucache_bsize);  \
VDSO_AUX_ENT(AT_SYSINFO_EHDR, current->mm->context.vdso_base);\
+   NEW_AUX_ENT(AT_L1I_CACHESIZE, ppc64_caches.l1i.size);   \
+   NEW_AUX_ENT(AT_L1I_CACHEGEOMETRY, get_cache_shape(l1i));\
+   NEW_AUX_ENT(AT_L1D_CACHESIZE, ppc64_caches.l1i.size);   \
+   NEW_AUX_ENT(AT_L1D_CACHEGEOMETRY, get_cache_shape(l1i));\
+   NEW_AUX_ENT(AT_L2_CACHESIZE, ppc64_caches.l2.size); \
+   NEW_AUX_ENT(AT_L2_CACHEGEOMETRY, get_cache_shape(l2));  \
+   NEW_AUX_ENT(AT_L3_CACHESIZE, ppc64_caches.l3.size); \
+   NEW_AUX_ENT(AT_L3_CACHEGEOMETRY, get_cache_shape(l3));  \
 } while (0)

 #endif /* _ASM_POWERPC_ELF_H */
diff --git a/arch/powerpc/include/uapi/asm/auxvec.h 
b/arch/powerpc/include/uapi/asm/auxvec.h
index ce17d2c..be6e94e 100644
--- a/arch/powerpc/include/uapi/asm/auxvec.h
+++ b/arch/powerpc/include/uapi/asm/auxvec.h
@@ -16,6 +16,37 @@
  */
 #define AT_SYSINFO_EHDR33

-#define AT_VECTOR_SIZE_ARCH 6 /* entries in ARCH_DLINFO */
+/*
+ * AT_*CACHEBSIZE above represent the cache *block* size which is
+ * the size that is affected by the cache management instructions.
+ *
+ * It doesn't nececssarily matches the cache *line* size which is
+ * more of a performance tuning hint. Additionally the latter can
+ * be different for the different cache levels.
+ *
+ * The set of entries below represent more extensive information
+ * about the caches, in the form of two entry per cache type,
+ * one entry containing the cache size in bytes, and the other
+ * containing the cache line size in bytes in the bottom 16 bits
+ * and the cache associativity in the next 16 bits.
+ *
+ * The associativity is such that if N is the 16-bit value, the
+ * cache is N way set associative. A value if 0x means fully
+ * associative, a value of 1 means directly mapped.
+ *
+ * For all these fields, a value of 0 means that the information
+ * is not known.
+ */
+
+#define AT_L1I_CACHESIZE   40
+#define AT_L1I_CACHEGEOMETRY   41
+#define AT_L1D_CACHESIZE   42
+#define AT_L1D_CACHEGEOMETRY   43
+#define AT_L2_CACHESIZE44
+#define AT_L2_CACHEGEOMETRY45
+#define AT_L3_CACHESIZE46
+#define AT_L3_CACHEGEOMETRY47
+
+#define AT_VECTOR_SIZE_ARCH14 /* entries in ARCH_DLINFO */

 #endif
diff --git a/arch/powerpc/kernel/setup-common.c 
b/arch/powerpc/kernel/setup-common.c
index e0eeed4..cfa2a06 100644
--- a/arch/powerpc/kernel/setup-common.c
+++ b/arch/powerpc/kernel/setup-common.c
@@ -94,7 +94,10 @@ EXPORT_SYMBOL_GPL(boot_cpuid);
 int dcache_bsize;
 int icache_bsize;
 int ucache_bsize;
-
+long il1cache_shape = -1;
+long 

Re: [PATCH 2/2] powerpc/pseries: Dynamically increase RMA size

2016-08-05 Thread Paul Clarke

Only nits from me...(see below)

On 08/05/2016 01:30 PM, Sukadev Bhattiprolu wrote:

Here is an updated patch to fix the build when CONFIG_PPC_PSERIES=n.
---
From d4f77a6ca7b6ea83f6588e7d541cc70bf001ae85 Mon Sep 17 00:00:00 2001
From: root 
Date: Thu, 4 Aug 2016 23:13:37 -0400
Subject: [PATCH 2/2] powerpc/pseries: Dynamically grow RMA size

When booting a very large system with a larg initrd we run out of space
for the flattened device tree (FDT). To fix this we must increase the
space allocated for the RMA region.

The RMA size is hard-coded in the 'ibm_architecture_vec[]' and increasing
the size there will apply to all systems, small and large, so we want to
increase the RMA region only when necessary.

When we run out of room for the FDT, set a new OF property, 'ibm,new-rma-size'
to the new RMA size (512MB) and issue a client-architecture-support (CAS)
call to the firmware. This will initiate a system reboot. Upon reboot we
notice the new property and update the RMA size accordingly.

Fix suggested by Michael Ellerman.

Signed-off-by: Sukadev Bhattiprolu 
---

[v2]:   - Add a comment in code regarding 'fixup_nr_cores_done'
- Fix build break when CONFIG_PPC_PSERIES=n
---
 arch/powerpc/kernel/prom_init.c | 96 -
 1 file changed, 95 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
index f612a99..cbd5387 100644
--- a/arch/powerpc/kernel/prom_init.c
+++ b/arch/powerpc/kernel/prom_init.c
@@ -679,6 +679,7 @@ unsigned char ibm_architecture_vec[] = {
W(0x),  /* virt_base */
W(0x),  /* virt_size */
W(0x),  /* load_base */
+#define IBM_ARCH_VEC_MIN_RMA_OFFSET108
W(256), /* 256MB min RMA */
W(0x),  /* full client load */
0,  /* min RMA percentage of total RAM */
@@ -867,6 +868,14 @@ static void fixup_nr_cores(void)
 {
u32 cores;
unsigned char *ptcores;
+   static bool fixup_nr_cores_done = false;
+
+   /*
+* If this is a second CAS call in the same boot sequence, (see
+* increase_rma_size()), we don't need to do the fixup again.
+*/
+   if (fixup_nr_cores_done)
+   return;

/* We need to tell the FW about the number of cores we support.
 *
@@ -898,6 +907,41 @@ static void fixup_nr_cores(void)
ptcores[1] = (cores >> 16) & 0xff;
ptcores[2] = (cores >> 8) & 0xff;
ptcores[3] = cores & 0xff;
+   fixup_nr_cores_done = true;
+   }
+}
+
+static void __init fixup_rma_size(void)
+{
+   int rc;
+   u64 size;
+   unsigned char *min_rmap;
+   phandle optnode;
+   char str[64];
+
+   optnode = call_prom("finddevice", 1, 1, ADDR("/options"));
+   if (!PHANDLE_VALID(optnode))
+   prom_panic("Cannot find /options");
+
+   /*
+* If a prior boot specified a new RMA size, use that size in
+* ibm_architecture_vec[]. See also increase_rma_size().
+*/
+   size = 0ULL;
+   memset(str, 0, sizeof(str));
+   rc = prom_getprop(optnode, "ibm,new-rma-size", , sizeof(str));
+   if (rc <= 0)
+   return;
+
+   size = prom_strtoul(str, NULL);
+   min_rmap = _architecture_vec[IBM_ARCH_VEC_MIN_RMA_OFFSET];
+
+   if (size) {
+   prom_printf("Using RMA size %lu from ibm,new-rma-size.\n", 
size);
+   min_rmap[0] = (size >> 24) & 0xff;
+   min_rmap[1] = (size >> 16) & 0xff;
+   min_rmap[2] = (size >> 8) & 0xff;
+   min_rmap[3] = size & 0xff;
}
 }

@@ -911,6 +955,8 @@ static void __init prom_send_capabilities(void)

fixup_nr_cores();

+   fixup_rma_size();
+
/* try calling the ibm,client-architecture-support method */
prom_printf("Calling ibm,client-architecture-support...");
if (call_prom_ret("call-method", 3, 2, ,
@@ -946,6 +992,52 @@ static void __init prom_send_capabilities(void)
}
 #endif /* __BIG_ENDIAN__ */
 }
+
+static void __init increase_rma_size(void)
+{
+   int rc;
+   u64 size;
+   char str[64];
+   phandle optnode;
+
+   optnode = call_prom("finddevice", 1, 1, ADDR("/options"));
+   if (!PHANDLE_VALID(optnode))
+   prom_panic("Cannot find /options");
+
+   /*
+* If we already increased the RMA size, return.
+*/
+   size = 0ULL;
+   memset(str, 0, sizeof(str));
+   rc = prom_getprop(optnode, "ibm,new-rma-size", , sizeof(str));
+
+   size = prom_strtoul(str, NULL);
+   if (size == 512ULL) {


Is this preferred over strncmp?  Using a string also helps with my suggestion 
below...


+   prom_printf("RMA 

Re: [PATCH] cpufreq: powernv: Redesign the presentation of throttle notification

2015-12-14 Thread Paul Clarke

On 12/13/2015 12:17 PM, Shilpasri G Bhat wrote:

Replace the throttling event console messages to perf trace event
"power:powernv_throttle" and throttle counter stats which are
exported in sysfs. The newly added sysfs files are as follows:

1)/sys/devices/system/node/node0/throttle_frequencies
   This gives the throttle stats for each of the available frequencies.
   The throttle stat of a frequency is the total number of times the max
   frequency was reduced to that frequency.
   # cat /sys/devices/system/node/node0/throttle_frequencies
   4023000 0
   399 0
   3956000 1
   3923000 0
   389 0
   3857000 2
   3823000 0
   379 0
   3757000 2
   3724000 1
   369 1
   ...


Is this data useful?  It seems like "elapsed time" at each frequency might be 
more useful, if any.



2)/sys/devices/system/node/node0/throttle_reasons
   This gives the stats for each of the supported throttle reasons.
   This gives the total number of times the frequency was throttled due
   to each of the reasons.
   # cat /sys/devices/system/node/node0/throttle_reasons
   No throttling 7
   Power Cap 0
   Processor Over Temperature 7
   Power Supply Failure 0
   Over Current 0
   OCC Reset 0

3)/sys/devices/system/node/node0/throttle_stat
   This gives the total number of throttle events occurred in turbo
   range of frequencies and non-turbo(below nominal) range of
   frequencies.


non-turbo should read "at or below nominal".  Maybe "sub-turbo" is a better 
term(?)



   # cat /sys/devices/system/node/node0/throttle_stat
   Turbo 7
   Nominal 0


Should this read "Non-turbo" or "Sub-turbo" instead of "Nominal", since the 
events could well occur when already operating below nominal.



Signed-off-by: Shilpasri G Bhat 
---
  drivers/cpufreq/powernv-cpufreq.c | 186 +-
  include/trace/events/power.h  |  22 +
  2 files changed, 166 insertions(+), 42 deletions(-)

diff --git a/drivers/cpufreq/powernv-cpufreq.c 
b/drivers/cpufreq/powernv-cpufreq.c
index cb50138..bdde9d6 100644
--- a/drivers/cpufreq/powernv-cpufreq.c
+++ b/drivers/cpufreq/powernv-cpufreq.c
@@ -28,6 +28,9 @@
  #include 
  #include 
  #include 
+#include 
+#include 
+#include 

  #include 
  #include 
@@ -43,12 +46,27 @@
  static struct cpufreq_frequency_table powernv_freqs[POWERNV_MAX_PSTATES+1];
  static bool rebooting, throttled, occ_reset;

+static char throttle_reason[][30] = {
+   "No throttling",
+   "Power Cap",
+   "Processor Over Temperature",
+   "Power Supply Failure",
+   "Over Current",
+   "OCC Reset"
+};


I'm curious if this would be slightly more efficiently implemented as:
static const char *throttle_reason[] = { ... };

Do you need 30 characters per string for a reason?

Regardless, it should be const.

[...]
--
PC

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v2] powerpc: re-enable dynticks

2015-02-27 Thread Paul Clarke
It appears the discussion wrapped up on this, favorably.  Is there 
anything else I need to do to get this merged?


Regards,
PC

On 02/20/2015 11:18 PM, Paul E. McKenney wrote:

On Fri, Feb 20, 2015 at 11:13:33AM -0600, Paul Clarke wrote:


implement arch_irq_work_has_interrupt() for powerpc

(resending because I messed up the e-mail addresses)

Commit 9b01f5bf3 introduced a dependency on IRQ work self-IPIs for
full dynamic ticks to be enabled, by expecting architectures to
implement a suitable arch_irq_work_has_interrupt() routine.

Several arches have implemented this routine, including x86
(3010279f) and arm (09f6edd4), but powerpc was omitted.

This patch implements this routine for powerpc.

The symptom, at boot (on powerpc arch systems) with nohz_full=CPU
list is displayed:
 NO_HZ: Can't run full dynticks because arch doesn't support irq
work self-IPIs

after this patch:
 NO_HZ: Full dynticks CPUs: CPU list.

Tested against 3.19.

v2: changed return 1 to return true, per Michael Ellerman

CC: Frederic Weisbecker fweis...@gmail.com
CC: Paul E. McKenney paul...@linux.vnet.ibm.com
Signed-off-by: Paul A. Clarke p...@us.ibm.com


Reviewed-by: Paul E. McKenney paul...@linux.vnet.ibm.com


diff --git a/arch/powerpc/include/asm/irq_work.h
b/arch/powerpc/include/asm/irq_work.h
new file mode 100644
index 000..99cc0aa
--- /dev/null
+++ b/arch/powerpc/include/asm/irq_work.h
@@ -0,0 +1,11 @@
+#ifndef _ASM_IRQ_WORK_H
+#define _ASM_IRQ_WORK_H
+
+#include asm/processor.h
+
+static inline bool arch_irq_work_has_interrupt(void)
+{
+return true;
+}
+
+#endif /* _ASM_IRQ_WORK_H */


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev



___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v2] powerpc: re-enable dynticks

2015-02-20 Thread Paul Clarke

implement arch_irq_work_has_interrupt() for powerpc

Commit 9b01f5bf3 introduced a dependency on IRQ work self-IPIs for 
full dynamic ticks to be enabled, by expecting architectures to 
implement a suitable arch_irq_work_has_interrupt() routine.


Several arches have implemented this routine, including x86 (3010279f) 
and arm (09f6edd4), but powerpc was omitted.


This patch implements this routine for powerpc.

The symptom, at boot (on powerpc arch systems) with nohz_full=CPU 
list is displayed:
NO_HZ: Can't run full dynticks because arch doesn't support irq 
work self-IPIs


after this patch:
NO_HZ: Full dynticks CPUs: CPU list.

Tested against 3.19.

v2: changed return 1 to return true, per Michael Ellerman

CC: Frederic Weisbecker fweis...@gmail.com
CC: Paul E. McKenney paul...@linux.vnet.ibm.com
Signed-off-by: Paul A. Clarke p...@us.ibm.com

diff --git a/arch/powerpc/include/asm/irq_work.h 
b/arch/powerpc/include/asm/irq_work.h

new file mode 100644
index 000..99cc0aa
--- /dev/null
+++ b/arch/powerpc/include/asm/irq_work.h
@@ -0,0 +1,11 @@
+#ifndef _ASM_IRQ_WORK_H
+#define _ASM_IRQ_WORK_H
+
+#include asm/processor.h
+
+static inline bool arch_irq_work_has_interrupt(void)
+{
+   return true;
+}
+
+#endif /* _ASM_IRQ_WORK_H */

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v2] powerpc: re-enable dynticks

2015-02-20 Thread Paul Clarke


implement arch_irq_work_has_interrupt() for powerpc

(resending because I messed up the e-mail addresses)

Commit 9b01f5bf3 introduced a dependency on IRQ work self-IPIs for 
full dynamic ticks to be enabled, by expecting architectures to 
implement a suitable arch_irq_work_has_interrupt() routine.


Several arches have implemented this routine, including x86 (3010279f) 
and arm (09f6edd4), but powerpc was omitted.


This patch implements this routine for powerpc.

The symptom, at boot (on powerpc arch systems) with nohz_full=CPU 
list is displayed:
NO_HZ: Can't run full dynticks because arch doesn't support irq 
work self-IPIs


after this patch:
NO_HZ: Full dynticks CPUs: CPU list.

Tested against 3.19.

v2: changed return 1 to return true, per Michael Ellerman

CC: Frederic Weisbecker fweis...@gmail.com
CC: Paul E. McKenney paul...@linux.vnet.ibm.com
Signed-off-by: Paul A. Clarke p...@us.ibm.com

diff --git a/arch/powerpc/include/asm/irq_work.h 
b/arch/powerpc/include/asm/irq_work.h

new file mode 100644
index 000..99cc0aa
--- /dev/null
+++ b/arch/powerpc/include/asm/irq_work.h
@@ -0,0 +1,11 @@
+#ifndef _ASM_IRQ_WORK_H
+#define _ASM_IRQ_WORK_H
+
+#include asm/processor.h
+
+static inline bool arch_irq_work_has_interrupt(void)
+{
+return true;
+}
+
+#endif /* _ASM_IRQ_WORK_H */

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH] powerpc: re-enable dynticks

2015-02-13 Thread Paul Clarke

implement arch_irq_work_has_interrupt() for powerpc

Commit 9b01f5bf3 introduced a dependency on IRQ work self-IPIs for 
full dynamic ticks to be enabled, by expecting architectures to 
implement a suitable arch_irq_work_has_interrupt() routine.


Several arches have implemented this routine, including x86 (3010279f) 
and arm (09f6edd4), but powerpc was omitted.


This patch implements this routine for powerpc.

The symptom, at boot (on powerpc arch systems) with nohz_full=CPU 
list is displayed:
NO_HZ: Can't run full dynticks because arch doesn't support irq 
work self-IPIs


after this patch:
NO_HZ: Full dynticks CPUs: CPU list.

Tested against 3.19.

CC: Frederic Weisbecker fweis...@gmail.com
CC: Paul E. McKenney paul...@linux.vnet.ibm.com
Signed-off-by: Paul A. Clarke p...@us.ibm.com

diff --git a/arch/powerpc/include/asm/irq_work.h 
b/arch/powerpc/include/asm/irq_work.h

new file mode 100644
index 000..18365ec
--- /dev/null
+++ b/arch/powerpc/include/asm/irq_work.h
@@ -0,0 +1,11 @@
+#ifndef _ASM_IRQ_WORK_H
+#define _ASM_IRQ_WORK_H
+
+#include asm/processor.h
+
+static inline bool arch_irq_work_has_interrupt(void)
+{
+   return 1;
+}
+
+#endif /* _ASM_IRQ_WORK_H */

--
Regards,
Paul Clarke, IBM

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] powerpc: Reword the returning from prom_init message

2015-01-12 Thread Paul Clarke

On 01/12/2015 03:48 AM, Michael Ellerman wrote:

We get way too many bug reports that say the kernel is hung in
prom_init, which stems from the fact that the last piece of output
people see is returning from prom_init.

The kernel is almost never hung in prom_init(), it's just that it's
crashed somewhere after prom_init() but prior to the console coming up.

The existing message should give a clue to that, ie. returning from
indicates that prom_init() has finished, but it doesn't seem to work.
Let's try something different.

This prints:

   Calling quiesce...
   Shutting down Open Firmware, booting Linux via __start() ...

Which hopefully makes it clear that prom_init() is not the problem, and
although __start() probably isn't either, it's at least the right place
to begin looking.


I'm very much in favor of anything that increases usability or decreases 
confusion.


I worry about confusion caused by any phrase that begins Shutting 
down..  Even if accurate, I wonder if something more positive that 
indicates that, at least at this point, things are continuing to proceed 
along would be preferred.


Perhaps something like Transferring control from Open Firmware to 
kernel (via __start) 


(Do we know for certain that what is about to be invoked is Linux 
per-se, or can it be something else that has a __start?)


PC

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] powerpc/powernv: Support OPAL requested heartbeat

2014-11-13 Thread Paul Clarke

On 11/12/2014 12:03 AM, Benjamin Herrenschmidt wrote:

If OPAL requests it, call it back via opal_poll_events() at a
regular interval. Some versions of OPAL on some machines require
this to operate some internal timeouts properly.


I don't see any reason why not, but wanted to ask to make sure: this 
daemon can be bound to a core, correct?  At least manually by taskset or 
similar?


It sounds a lot like the RTAS thread, which I was happy to see go away 
when running directly on OPAL, and don't want something else to take its 
place.  It was not bindable, but ran periodically on EVERY core, err 
hardware thread.


(Just looking to avoid latency blips.)

PC

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: powerpc: mitigate impact of decrementer reset

2014-11-13 Thread Paul Clarke

On 11/12/2014 08:39 PM, Michael Ellerman wrote:

On Wed, 2014-11-05 at 11:06 -0600, Paul Clarke wrote:

On 10/07/2014 09:52 PM, Michael Ellerman wrote:

On Tue, 2014-07-10 at 19:13:24 UTC, Paul Clarke wrote:

This patch short-circuits the reset of the decrementer, exiting after
the decrementer reset, but before the housekeeping tasks if the only
need for the interrupt is simply to reset it.  After this patch,
the latency spike was measured at about 150 nanoseconds.



Thanks for the excellent changelog. But this patch makes me a bit nervous :)

Do you know where the latency is coming from? Is it primarily the irq work?


Yes, it is all under irq_enter (measured at ~10us) and irq_exit (~12us).


Hmm, OK. I actually meant irq_work_run().

AIUI irq_enter/exit() are just state tracking, they shouldn't be actually
running work.

How are you measuring it?


ftrace function_graph tracer:
--
  127.425212 ||  .irq_enter() {
  127.425213 ||.rcu_irq_enter() {
  127.425213 |  + 12.206 us   |  .rcu_eqs_exit_common.isra.41();
  127.425226 |  + 12.750 us   |}
... RCU is a big hitter
  127.425226 ||.vtime_common_account_irq_enter() {
  127.425226 ||  .vtime_account_user() {
  127.425226 |0.032 us|._raw_spin_lock();
  127.425227 |0.034 us|.get_vtime_delta();
  127.425227 ||.account_user_time() {
  127.425228 |0.030 us|  .cpuacct_account_field();
  127.425228 ||  .acct_account_cputime() {
  127.425228 |0.082 us|.__acct_update_integrals();
  127.425229 |0.562 us|  }
  127.425229 |1.500 us|}
  127.425229 |2.954 us|  }
  127.425230 |3.434 us|}
... but even accounting is not insignificant
  127.425230 |  + 17.218 us   |  }
  127.425230 ||  /* timer_interrupt_entry: [...] */
... nothing to see here, because there's nothing to do except reset the 
decrementer

  127.425230 ||  /* timer_interrupt_exit: [...] */
... (less than 1 us spent doing the required work)
  127.425231 ||  .irq_exit() {
  127.425231 ||.vtime_gen_account_irq_exit() {
  127.425231 |0.036 us|  ._raw_spin_lock();
  127.425232 ||  .__vtime_account_system() {
  127.425232 |0.030 us|.get_vtime_delta();
  127.425232 ||.account_system_time() {
  127.425233 |0.030 us|  .cpuacct_account_field();
  127.425233 ||  .acct_account_cputime() {
  127.425233 |0.072 us|.__acct_update_integrals();
  127.425234 |0.564 us|  }
  127.425234 |1.546 us|}
  127.425234 |2.528 us|  }
  127.425235 |3.700 us|}
... significant accounting time
  127.425235 |0.032 us|.idle_cpu();
  127.425235 ||.tick_nohz_irq_exit() {
  127.425236 ||  .can_stop_full_tick() {
  127.425236 |0.022 us|.sched_can_stop_tick();
  127.425236 |0.020 us|.posix_cpu_timers_can_stop_tick()
  127.425237 |0.970 us|  }
  127.425237 |0.082 us|  .ktime_get();
  127.425238 ||  .tick_nohz_stop_sched_tick() {
  127.425238 |0.032 us|.timekeeping_max_deferment();
  127.425238 ||.get_next_timer_interrupt() {
  127.425239 |0.038 us|  ._raw_spin_lock();
  127.425239 ||  .hrtimer_get_next_event() {
  127.425239 |0.030 us|._raw_spin_lock_irqsave();
  127.425240 |0.028 us|._raw_spin_unlock_irqrestore
  127.425240 |0.984 us|  }
  127.425241 |1.936 us|}
  127.425241 |0.032 us|.scheduler_tick_max_deferment();
  127.425241 |3.438 us|  }
  127.425242 |5.880 us|}
  127.425242 ||.rcu_irq_exit() {
  127.425242 |0.102 us|  .rcu_eqs_enter_common.isra.40();
  127.425243 |0.576 us|}
  127.425243 |  + 12.156 us   |  }

This one was almost 30 us total (17.218 + 12.156 = 29.374 us), just to 
reset the decrementer.



If so I'd prefer if we could move the short circuit into __timer_interrupt()
itself. That way we'd still have the trace points usable, and it would
hopefully result in less duplicated logic.


But irq_enter and irq_exit are called in timer_interrupt, before
__timer_interrupt is called.  I don't see how that helps.  The time
spent in __timer_interrupt is minuscule by comparison.


Right, it won't help if it's irq_enter() that is causing the delay. But I was
assuming it was irq_work_run().


Are you suggesting that irq_enter/exit be moved into __timer_interrupt
as well?  (I'm not sure how that would impact the existing call to
__timer_interrupt from tick_broadcast_ipi_handler

Re: [PATCH] powerpc: mitigate impact of decrementer reset

2014-11-10 Thread Paul Clarke

On 11/10/2014 04:08 AM, Benjamin Herrenschmidt wrote:

On Tue, 2014-10-07 at 14:13 -0500, Paul Clarke wrote:

The POWER ISA defines an always-running decrementer which can be used
to schedule interrupts after a certain time interval has elapsed.
The decrementer counts down at the same frequency as the Time Base,
which is 512 MHz.  The maximum value of the decrementer is 0x7fff.
This works out to a maximum interval of about 4.19 seconds.

If a larger interval is desired, the kernel will set the decrementer
to its maximum value and reset it after it expires (underflows)
a sufficient number of times until the desired interval has elapsed.

The negative effect of this is that an unwanted latency spike will
impact normal processing at most every 4.19 seconds.  On an IBM
POWER8-based system, this spike was measured at about 25-30
microseconds, much of which was basic, opportunistic housekeeping
tasks that could otherwise have waited.

This patch short-circuits the reset of the decrementer, exiting after
the decrementer reset, but before the housekeeping tasks if the only
need for the interrupt is simply to reset it.  After this patch,
the latency spike was measured at about 150 nanoseconds.


Doesn't this break the irq_work stuff ? We trigger it with a set_dec(1);
and your patch will probably cause it to be skipped...


You're right.

I'm confused by the division between timer_interrupt() and 
__timer_interrupt().  The former is called with interrupts disabled (and 
enables them), but also calls irq_enter()/irq_exit().  Why are those 
calls not in __timer_interrupt()?  (If they were, the short-circuit 
logic might be a bit easier to put directly in __timer_interrupt(), 
which would eliminate any duplicate code.)


It looks like __timer_interrupt is only called directly by the broadcast 
timer IPI handler.  (Why is __timer_interrupt not static?)  Does this 
path not need irq_enter/irq_exit?



Signed-off-by: Paul A. Clarke p...@us.ibm.com
---
   arch/powerpc/kernel/time.c | 13 +
   1 file changed, 13 insertions(+)

diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 368ab37..962a06b 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -528,6 +528,7 @@ void timer_interrupt(struct pt_regs * regs)
   {
struct pt_regs *old_regs;
u64 *next_tb = __get_cpu_var(decrementers_next_tb);
+   u64 now;

/* Ensure a positive value is written to the decrementer, or else
 * some CPUs will continue to take decrementer exceptions.
@@ -550,6 +551,18 @@ void timer_interrupt(struct pt_regs * regs)
 */
may_hard_irq_enable();

+   /* If this is simply the decrementer expiring (underflow) due to
+* the limited size of the decrementer, and not a set timer,
+* reset (if needed) and return
+*/
+   now = get_tb_or_rtc();
+   if (now  *next_tb) {
+   now = *next_tb - now;
+   if (now = DECREMENTER_MAX)
+   set_dec((int)now);
+   __get_cpu_var(irq_stat).timer_irqs_others++;
+   return;
+   }

   #if defined(CONFIG_PPC32)  defined(CONFIG_PPC_PMAC)
if (atomic_read(ppc_n_lost_interrupts) != 0)


Regards,
PC

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: powerpc: mitigate impact of decrementer reset

2014-11-05 Thread Paul Clarke

Sorry it took me so long to get back to this...

On 10/07/2014 09:52 PM, Michael Ellerman wrote:

On Tue, 2014-07-10 at 19:13:24 UTC, Paul Clarke wrote:

The POWER ISA defines an always-running decrementer which can be used
to schedule interrupts after a certain time interval has elapsed.
The decrementer counts down at the same frequency as the Time Base,
which is 512 MHz.  The maximum value of the decrementer is 0x7fff.
This works out to a maximum interval of about 4.19 seconds.

If a larger interval is desired, the kernel will set the decrementer
to its maximum value and reset it after it expires (underflows)
a sufficient number of times until the desired interval has elapsed.

The negative effect of this is that an unwanted latency spike will
impact normal processing at most every 4.19 seconds.  On an IBM
POWER8-based system, this spike was measured at about 25-30
microseconds, much of which was basic, opportunistic housekeeping
tasks that could otherwise have waited.

This patch short-circuits the reset of the decrementer, exiting after
the decrementer reset, but before the housekeeping tasks if the only
need for the interrupt is simply to reset it.  After this patch,
the latency spike was measured at about 150 nanoseconds.



Thanks for the excellent changelog. But this patch makes me a bit nervous :)

Do you know where the latency is coming from? Is it primarily the irq work?


Yes, it is all under irq_enter (measured at ~10us) and irq_exit (~12us).


If so I'd prefer if we could move the short circuit into __timer_interrupt()
itself. That way we'd still have the trace points usable, and it would
hopefully result in less duplicated logic.


But irq_enter and irq_exit are called in timer_interrupt, before 
__timer_interrupt is called.  I don't see how that helps.  The time 
spent in __timer_interrupt is minuscule by comparison.


Are you suggesting that irq_enter/exit be moved into __timer_interrupt 
as well?  (I'm not sure how that would impact the existing call to 
__timer_interrupt from tick_broadcast_ipi_handler?  And if there is no 
impact, what's the point of separating timer_interrupt and 
__timer_interrupt?)


Regards,
PC

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] powerpc: mitigate impact of decrementer reset

2014-10-08 Thread Paul Clarke

On 10/08/2014 12:37 AM, Heinz Wrobel wrote:

what if your tb wraps during the  test?


Per the Power ISA, Time Base is 64 bits, monotonically increasing, and 
is writable only in hypervisor state.  To my understanding, it is set to 
zero at boot (although this is not prescribed).


Also, as noted by others, the logic is roughly duplicated (with some 
differences) from the analogous code in __timer_interrupt just above it.


I don't see wrapping as a concern.

PC

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH] powerpc: mitigate impact of decrementer reset

2014-10-07 Thread Paul Clarke

The POWER ISA defines an always-running decrementer which can be used
to schedule interrupts after a certain time interval has elapsed.
The decrementer counts down at the same frequency as the Time Base,
which is 512 MHz.  The maximum value of the decrementer is 0x7fff.
This works out to a maximum interval of about 4.19 seconds.

If a larger interval is desired, the kernel will set the decrementer
to its maximum value and reset it after it expires (underflows)
a sufficient number of times until the desired interval has elapsed.

The negative effect of this is that an unwanted latency spike will
impact normal processing at most every 4.19 seconds.  On an IBM
POWER8-based system, this spike was measured at about 25-30
microseconds, much of which was basic, opportunistic housekeeping
tasks that could otherwise have waited.

This patch short-circuits the reset of the decrementer, exiting after
the decrementer reset, but before the housekeeping tasks if the only
need for the interrupt is simply to reset it.  After this patch,
the latency spike was measured at about 150 nanoseconds.

Signed-off-by: Paul A. Clarke p...@us.ibm.com
---
 arch/powerpc/kernel/time.c | 13 +
 1 file changed, 13 insertions(+)

diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 368ab37..962a06b 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -528,6 +528,7 @@ void timer_interrupt(struct pt_regs * regs)
 {
struct pt_regs *old_regs;
u64 *next_tb = __get_cpu_var(decrementers_next_tb);
+   u64 now;

/* Ensure a positive value is written to the decrementer, or else
 * some CPUs will continue to take decrementer exceptions.
@@ -550,6 +551,18 @@ void timer_interrupt(struct pt_regs * regs)
 */
may_hard_irq_enable();

+   /* If this is simply the decrementer expiring (underflow) due to
+* the limited size of the decrementer, and not a set timer,
+* reset (if needed) and return
+*/
+   now = get_tb_or_rtc();
+   if (now  *next_tb) {
+   now = *next_tb - now;
+   if (now = DECREMENTER_MAX)
+   set_dec((int)now);
+   __get_cpu_var(irq_stat).timer_irqs_others++;
+   return;
+   }

 #if defined(CONFIG_PPC32)  defined(CONFIG_PPC_PMAC)
if (atomic_read(ppc_n_lost_interrupts) != 0)
--
2.1.2.330.g565301e



___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev