Re: [PATCH 1/8][v4] powerpc/perf: Rename Power8 macros to start with PME
On 09/14/2013 06:19 AM, Sukadev Bhattiprolu wrote: > We use helpers like GENERIC_EVENT_ATTR() to list the generic events in > sysfs. To avoid name collisions, GENERIC_EVENT_ATTR() requires the perf > event macros to start with PME. We got all the raw event codes covered for P7 with the help of power7-events-list.h enumeration. /* * Power7 event codes. */ #define EVENT(_name, _code) \ PME_##_name = _code, enum { #include "power7-events-list.h" }; #undef EVENT Just wondering if its a good idea to name change these selected macros to be consumed by GENERIC_EVENT_ATTR() right here for this purpose or we need to get the comprehensive list of raw events for P8 first. Just an idea. Regards Anshuman -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 8/8][v4] powerpc/perf: Export Power7 memory hierarchy info to user space.
On 09/14/2013 06:19 AM, Sukadev Bhattiprolu wrote: > On Power7, the DCACHE_SRC field in MMCRA register identifies the memory > hierarchy level (eg: L2, L3 etc) from which a data-cache miss for a > marked instruction was satisfied. > > Use the 'perf_mem_data_src' object to export this hierarchy level to user > space. Some memory hierarchy levels in Power7 don't map into the arch-neutral > levels. However, since newer generation of the processor (i.e. Power8) uses > fewer levels than in Power7, we don't really need to define new hierarchy > levels just for Power7. > > We instead, map as many levels as possible and approximate the rest. See > comments near dcache-src_map[] in the patch. > > Usage: > > perf record -d -e 'cpu/PM_MRK_GRP_CMPL/' > perf report -n --mem-mode --sort=mem,sym,dso,symbol_daddr,dso_daddr" > > For samples involving load/store instructions, the memory > hierarchy level is shown as "L1 hit", "Remote RAM hit" etc. > # or > > perf record --data > perf report -D > > Sample records contain a 'data_src' field which encodes the > memory hierarchy level: Eg: data_src 0x442 indicates > MEM_OP_LOAD, MEM_LVL_HIT, MEM_LVL_L2 (i.e load hit L2). Successfully built and boot tested this entire patchset both on a P7 and P8 system. Running some sample tests with ebizzy micro benchmark. Till now got only 0x142 and 0x0 values for data_src object for the sample records. Will experiment around bit more on P7 and P8 systems and post the results. Regards Anshuman -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 8/8][v4] powerpc/perf: Export Power7 memory hierarchy info to user space.
On 09/14/2013 06:19 AM, Sukadev Bhattiprolu wrote: > +static void power7_get_mem_data_src(union perf_mem_data_src *dsrc, > + struct pt_regs *regs) > +{ > + u64 idx; > + u64 mmcra = regs->dsisr; > + u64 addr; > + int ret; > + unsigned int instr; > + > + if (mmcra & POWER7_MMCRA_DCACHE_MISS) { > + idx = mmcra & POWER7_MMCRA_DCACHE_SRC_MASK; > + idx >>= POWER7_MMCRA_DCACHE_SRC_SHIFT; > + > + dsrc->val |= dcache_src_map[idx]; > + return; > + } > + > + instr = 0; > + addr = perf_instruction_pointer(regs); > + > + if (is_kernel_addr(addr)) > + instr = *(unsigned int *)addr; > + else { > + pagefault_disable(); > + ret = __get_user_inatomic(instr, (unsigned int __user *)addr); > + pagefault_enable(); > + if (ret) > + instr = 0; > + } > + if (instr && instr_is_load_store(&instr)) Wondering if there is any possibility of getting positive values for "(mmcra & POWER7_MMCRA_DCACHE_SRC_MASK) >> POWER7_MMCRA_DCACHE_SRC_SHIFT" when the marked instruction did not have MMCRA[POWER7_MMCRA_DCACHE_MISS] bit set. In that case we should actually compute dsrc->val as in the previous case. I did couple of experiments on a P7 box, but was not able to find a instance for a marked instruction whose MMCRA[POWER7_MMCRA_DCACHE_MISS] bit not set and have a positive value POWER7_MMCRA_DCACHE_SRC field. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V2 0/6] perf: New conditional branch filter
On 09/10/2013 07:36 AM, Michael Ellerman wrote: > On Fri, 2013-08-30 at 09:54 +0530, Anshuman Khandual wrote: >> This patchset is the re-spin of the original branch stack sampling >> patchset which introduced new PERF_SAMPLE_BRANCH_COND filter. This patchset >> also enables SW based branch filtering support for PPC64 platforms which have >> branch stack sampling support. With this new enablement, the branch filter >> support >> for PPC64 platforms have been extended to include all these combinations >> discussed >> below with a sample test application program. > > ... > >> Mixed filters >> - >> (6) perf record -e branch-misses:u -j any_call,any_ret ./cprog >> Error: >> The perf.data file has no samples! >> >> NOTE: As expected. The HW filters all the branches which are calls and SW >> tries to find return >> branches in that given set. Both the filters are mutually exclussive, so >> obviously no samples >> found in the end profile. > > The semantics of multiple filters is not clear to me. It could be an OR, > or an AND. You have implemented AND, does that match existing behaviour > on x86 for example? I believe it does match. X86 code drops the branch records (originally captured in the LBR) while applying the SW filters. Regards Anshuman -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V2 0/6] perf: New conditional branch filter
On 09/26/2013 04:44 PM, Stephane Eranian wrote: > So you are saying that the HW filter is exclusive. That seems odd. But > I think it is > because of the choices is ANY. ANY covers all the types of branches. Therefore > it does not make a difference whether you add COND or not. And > vice-versa, if you > set COND, you need to disable ANY. I bet if you add other filters such > as CALL, RETURN, > then you could OR them and say: I want RETURN or CALLS. > > But that's okay. The API operates in OR mode but if the HW does not > support it, you > can check the mask and reject if more than one type is set. That is > arch-specific code. > The alternative, if to only capture ANY and emulate the filter in SW. > This will work, of > course. But the downside, is that you lose the way to appreciate how > many, for instance, > COND branches you sampled out of the total number of COND branches > retired. Unless > you can count COND branches separately. Hey Stephane, Thanks for your reply. I am working on a solution where PMU will process all the requested branch filters in HW only if it can filter all of them in an OR manner else it will just leave the entire thing upto the SW to process and do no filtering itself. This implies that branch filtering will either happen completely in HW or completely in SW and never in a mixed manner. This way it will conform to the OR mode defined in the API. I will post the revised patch set soon. Regards Anshuman -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH V2 0/6] perf: New conditional branch filter
sc # PERF_SAMPLE_BRANCH_ANY_CALL bl hw_1_1 # PERF_SAMPLE_BRANCH_ANY_CALL bl hw_1_2 # PERF_SAMPLE_BRANCH_ANY_CALL # PERF_SAMPLE_BRANCH_COND bl hw_2_1 # PERF_SAMPLE_BRANCH_ANY_CALL bl hw_2_2 # PERF_SAMPLE_BRANCH_ANY_CALL # PERF_SAMPLE_BRANCH_ANY_RET bl sw_3_1 # PERF_SAMPLE_BRANCH_ANY_CALL bl sw_3_2 # PERF_SAMPLE_BRANCH_ANY_CALL # PERF_SAMPLE_BRANCH_IND_CALL bl sw_4_1 # PERF_SAMPLE_BRANCH_ANY_CALL bl sw_4_2 # PERF_SAMPLE_BRANCH_ANY_CALL # Restore LR mtlr 25 blr # PERF_SAMPLE_BRANCH_ANY_RET Changes in V2 -- (1) Enabled PPC64 SW branch filtering support (2) Incorporated changes required for all previous comments Anshuman Khandual (6): perf: New conditional branch filter criteria in branch stack sampling powerpc, perf: Enable conditional branch filter for POWER8 perf, tool: Conditional branch filter 'cond' added to perf record x86, perf: Add conditional branch filtering support perf, documentation: Description for conditional branch filter powerpc, perf: Enable SW filtering in branch stack sampling framework arch/powerpc/include/asm/perf_event_server.h | 2 +- arch/powerpc/perf/core-book3s.c | 200 +-- arch/powerpc/perf/power8-pmu.c | 25 ++-- arch/x86/kernel/cpu/perf_event_intel_lbr.c | 5 + include/uapi/linux/perf_event.h | 3 +- tools/perf/Documentation/perf-record.txt | 3 +- tools/perf/builtin-record.c | 1 + 7 files changed, 216 insertions(+), 23 deletions(-) -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH V2 5/6] perf, documentation: Description for conditional branch filter
Adding documentation support for conditional branch filter. Signed-off-by: Anshuman Khandual Reviewed-by: Stephane Eranian --- tools/perf/Documentation/perf-record.txt | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/tools/perf/Documentation/perf-record.txt b/tools/perf/Documentation/perf-record.txt index e297b74..59ca8d0 100644 --- a/tools/perf/Documentation/perf-record.txt +++ b/tools/perf/Documentation/perf-record.txt @@ -163,12 +163,13 @@ following filters are defined: - any_call: any function call or system call - any_ret: any function return or system call return - ind_call: any indirect branch +- cond: conditional branches - u: only when the branch target is at the user level - k: only when the branch target is in the kernel - hv: only when the target is at the hypervisor level + -The option requires at least one branch type among any, any_call, any_ret, ind_call. +The option requires at least one branch type among any, any_call, any_ret, ind_call, cond. The privilege levels may be omitted, in which case, the privilege levels of the associated event are applied to the branch filter. Both kernel (k) and hypervisor (hv) privilege levels are subject to permissions. When sampling on multiple events, branch stack sampling -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH V2 4/6] x86, perf: Add conditional branch filtering support
This patch adds conditional branch filtering support, enabling it for PERF_SAMPLE_BRANCH_COND in perf branch stack sampling framework by utilizing an available software filter X86_BR_JCC. Signed-off-by: Anshuman Khandual Reviewed-by: Stephane Eranian --- arch/x86/kernel/cpu/perf_event_intel_lbr.c | 5 + 1 file changed, 5 insertions(+) diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c index d5be06a..9723773 100644 --- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c +++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c @@ -371,6 +371,9 @@ static void intel_pmu_setup_sw_lbr_filter(struct perf_event *event) if (br_type & PERF_SAMPLE_BRANCH_NO_TX) mask |= X86_BR_NO_TX; + if (br_type & PERF_SAMPLE_BRANCH_COND) + mask |= X86_BR_JCC; + /* * stash actual user request into reg, it may * be used by fixup code for some CPU @@ -665,6 +668,7 @@ static const int nhm_lbr_sel_map[PERF_SAMPLE_BRANCH_MAX] = { * NHM/WSM erratum: must include IND_JMP to capture IND_CALL */ [PERF_SAMPLE_BRANCH_IND_CALL] = LBR_IND_CALL | LBR_IND_JMP, + [PERF_SAMPLE_BRANCH_COND] = LBR_JCC, }; static const int snb_lbr_sel_map[PERF_SAMPLE_BRANCH_MAX] = { @@ -676,6 +680,7 @@ static const int snb_lbr_sel_map[PERF_SAMPLE_BRANCH_MAX] = { [PERF_SAMPLE_BRANCH_ANY_CALL] = LBR_REL_CALL | LBR_IND_CALL | LBR_FAR, [PERF_SAMPLE_BRANCH_IND_CALL] = LBR_IND_CALL, + [PERF_SAMPLE_BRANCH_COND] = LBR_JCC, }; /* core */ -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH V2 6/6] powerpc, perf: Enable SW filtering in branch stack sampling framework
This patch enables SW based post processing of BHRB captured branches to be able to meet more user defined branch filtration criteria in perf branch stack sampling framework. This changes increase the number of filters and their valid combinations on powerpc64 platform with BHRB support. Summary of code changes described below. (1) struct cpu_hw_events Introduced two new variables and modified one to track various filters. a) bhrb_hw_filter Tracks PMU based HW branch filter flags. Computed from PMU dependent call back. b) bhrb_sw_filter Tracks SW based instruction filter flags Computed from PPC64 generic SW filter. c) filter_mask Tracks overall filter flags for PPC64 (2) Creating HW event with BHRB request Kernel would try to figure out supported HW filters through a PMU call back ppmu->bhrb_filter_map(). Here it would only invalidate unsupported HW filter combinations. In future we could process one element from the combination in HW and one in SW. Meanwhile cpuhw->filter_mask would be tracking the overall supported branch filter requests on the PMU. Kernel would also process the user request against available SW filters for PPC64. Then we would process filter_mask to verify whether all the user requested branch filters have been taken care of either in HW or in SW. (3) BHRB SW filter processing During the BHRB data capture inside the PMU interrupt context, each of the captured "perf_branch_entry.from" would be checked for compliance with applicable SW branch filters. If the entry does not confirm to the filter requirements, it would be discarded from the final perf branch stack buffer. (4) Instruction classification for proposed SW filters Here are the list of category of instructions which have been classified under the proposed SW filters. (a) PERF_SAMPLE_BRANCH_ANY_RETURN (i) [Un]conditional branch to LR without setting the LR (1) blr (2) bclr (3) btlr (4) bflr (5) bdnzlr (6) bdnztlr (7) bdnzflr (8) bdzlr (9) bdztlr (10) bdzflr (11) bltlr (12) blelr (13) beqlr (14) bgelr (15) bgtlr (16) bnllr (17) bnelr (18) bnglr (19) bsolr (20) bnslr (21) biclr (22) bnilr (23) bunlr (24) bnulr (b) PERF_SAMPLE_BRANCH_IND_CALL (i) [Un]conditional branch to CTR with setting the link (1) bctrl (2) bcctrl (3) btctrl (4) bfctrl (5) bltctrl (6) blectrl (7) beqctrl (8) bgectrl (9) bgtctrl (10) bnlctrl (11) bnectrl (12) bngctrl (13) bsoctrl (14) bnsctrl (15) bicctrl (16) bnictrl (17) bunctrl (18) bnuctrl (ii) [Un]conditional branch to LR setting the link (0) bclrl (1) blrl (2) btlrl (3) bflrl (4) bdnzlrl (5) bdnztlrl (6) bdnzflrl (7) bdzlrl (8) bdztlrl (9) bdzflrl (10) bltlrl (11) blelrl (12) beqlrl (13) bgelrl (14) bgtlrl (15) bnllrl (16) bnelrl (17) bnglrl (18) bsolrl (19) bnslrl (20) biclrl (21) bnilrl (22) bunlrl (23) bnulrl (iii) [Un]conditional branch to TAR setting the link (1) btarl (2) bctarl Signed-off-by: Anshuman Khandual --- arch/powerpc/include/asm/perf_event_server.h | 2 +- arch/powerpc/perf/core-book3s.c | 200 +
[PATCH V2 1/6] perf: New conditional branch filter criteria in branch stack sampling
POWER8 PMU based BHRB supports filtering for conditional branches. This patch introduces new branch filter PERF_SAMPLE_BRANCH_COND which will extend the existing perf ABI. Other architectures can provide this functionality with either HW filtering support (if present) or with SW filtering of instructions. Signed-off-by: Anshuman Khandual Reviewed-by: Stephane Eranian --- include/uapi/linux/perf_event.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h index 0b1df41..5da52b6 100644 --- a/include/uapi/linux/perf_event.h +++ b/include/uapi/linux/perf_event.h @@ -160,8 +160,9 @@ enum perf_branch_sample_type { PERF_SAMPLE_BRANCH_ABORT_TX = 1U << 7, /* transaction aborts */ PERF_SAMPLE_BRANCH_IN_TX= 1U << 8, /* in transaction */ PERF_SAMPLE_BRANCH_NO_TX= 1U << 9, /* not in transaction */ + PERF_SAMPLE_BRANCH_COND = 1U << 10, /* conditional branches */ - PERF_SAMPLE_BRANCH_MAX = 1U << 10, /* non-ABI */ + PERF_SAMPLE_BRANCH_MAX = 1U << 11, /* non-ABI */ }; #define PERF_SAMPLE_BRANCH_PLM_ALL \ -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH V2 3/6] perf, tool: Conditional branch filter 'cond' added to perf record
Adding perf record support for new branch stack filter criteria PERF_SAMPLE_BRANCH_COND. Signed-off-by: Anshuman Khandual --- tools/perf/builtin-record.c | 1 + 1 file changed, 1 insertion(+) diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c index ecca62e..802d11d 100644 --- a/tools/perf/builtin-record.c +++ b/tools/perf/builtin-record.c @@ -625,6 +625,7 @@ static const struct branch_mode branch_modes[] = { BRANCH_OPT("any_call", PERF_SAMPLE_BRANCH_ANY_CALL), BRANCH_OPT("any_ret", PERF_SAMPLE_BRANCH_ANY_RETURN), BRANCH_OPT("ind_call", PERF_SAMPLE_BRANCH_IND_CALL), + BRANCH_OPT("cond", PERF_SAMPLE_BRANCH_COND), BRANCH_END }; -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH V2 2/6] powerpc, perf: Enable conditional branch filter for POWER8
Enables conditional branch filter support for POWER8 utilizing MMCRA register based filter and also invalidates a BHRB branch filter combination involving conditional branches. Signed-off-by: Anshuman Khandual --- arch/powerpc/perf/power8-pmu.c | 10 ++ 1 file changed, 10 insertions(+) diff --git a/arch/powerpc/perf/power8-pmu.c b/arch/powerpc/perf/power8-pmu.c index 2ee4a70..6e28587 100644 --- a/arch/powerpc/perf/power8-pmu.c +++ b/arch/powerpc/perf/power8-pmu.c @@ -580,11 +580,21 @@ static u64 power8_bhrb_filter_map(u64 branch_sample_type) if (branch_sample_type & PERF_SAMPLE_BRANCH_IND_CALL) return -1; + /* Invalid branch filter combination - HW does not support */ + if ((branch_sample_type & PERF_SAMPLE_BRANCH_ANY_CALL) && + (branch_sample_type & PERF_SAMPLE_BRANCH_COND)) + return -1; + if (branch_sample_type & PERF_SAMPLE_BRANCH_ANY_CALL) { pmu_bhrb_filter |= POWER8_MMCRA_IFM1; return pmu_bhrb_filter; } + if (branch_sample_type & PERF_SAMPLE_BRANCH_COND) { + pmu_bhrb_filter |= POWER8_MMCRA_IFM3; + return pmu_bhrb_filter; + } + /* Every thing else is unsupported */ return -1; } -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V2 0/6] perf: New conditional branch filter
On 08/30/2013 05:18 PM, Stephane Eranian wrote: > 2013/8/30 Anshuman Khandual >> > >> > This patchset is the re-spin of the original branch stack sampling >> > patchset which introduced new PERF_SAMPLE_BRANCH_COND filter. This patchset >> > also enables SW based branch filtering support for PPC64 platforms which >> > have >> > branch stack sampling support. With this new enablement, the branch filter >> > support >> > for PPC64 platforms have been extended to include all these combinations >> > discussed >> > below with a sample test application program. >> > >> > > I am trying to understand which HW has support for capturing the > branches: PPC7 or PPC8. > Then it seems you're saying that only PPC8 has the filtering support. > On PPC7 you use the > SW filter. Did I get this right? > > I will look at the patch set. > Hey Stephane, POWER7 does not have BHRB support required to capture the branches. Right now its only POWER8 (which has BHRB) can capture branches in HW. It has some PMU level branch filters and rest we have implemented in SW. But these SW filters cannot be applied in POWER7 as it does not support branch stack sampling because of lack of BHRB. I have mentioned PPC64 support in the sense that this SW filtering code could be used in existing or future generation powerpc processors which would have PMU support for branch stack sampling. My apologies if the description for the patchset was ambiguous. Regards Anshuman -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] KVM: PPC: Book3S: Add support for hwrng found on some powernv systems
On 09/26/2013 12:01 PM, Michael Ellerman wrote: > +int powernv_hwrng_present(void) > +{ > + return __raw_get_cpu_var(powernv_rng) != NULL; > +} > + > static unsigned long rng_whiten(struct powernv_rng *rng, unsigned long val) > { > unsigned long parity; > @@ -42,6 +48,17 @@ static unsigned long rng_whiten(struct powernv_rng *rng, > unsigned long val) > return val; > } > > +int powernv_get_random_real_mode(unsigned long *v) > +{ > + struct powernv_rng *rng; > + > + rng = __raw_get_cpu_var(powernv_rng); > + > + *v = rng_whiten(rng, in_rm64(rng->regs_real)); > + Will it be in_be64() instead of in_rm64() ? Its failing the build here. Except this all individual patches build correctly. Regards Anshuman -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V2 0/6] perf: New conditional branch filter
On 08/30/2013 05:18 PM, Stephane Eranian wrote: > 2013/8/30 Anshuman Khandual >> > >> > This patchset is the re-spin of the original branch stack sampling >> > patchset which introduced new PERF_SAMPLE_BRANCH_COND filter. This patchset >> > also enables SW based branch filtering support for PPC64 platforms which >> > have >> > branch stack sampling support. With this new enablement, the branch filter >> > support >> > for PPC64 platforms have been extended to include all these combinations >> > discussed >> > below with a sample test application program. >> > >> > > I am trying to understand which HW has support for capturing the > branches: PPC7 or PPC8. > Then it seems you're saying that only PPC8 has the filtering support. > On PPC7 you use the > SW filter. Did I get this right? > > I will look at the patch set. > Hey Stephane, Just wondering if you got a chance to go though the patchset ? Regards Anshuman -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V2 0/6] perf: New conditional branch filter
On 09/21/2013 12:11 PM, Anshuman Khandual wrote: > On 08/30/2013 05:18 PM, Stephane Eranian wrote: >> 2013/8/30 Anshuman Khandual >>>> >>>> This patchset is the re-spin of the original branch stack sampling >>>> patchset which introduced new PERF_SAMPLE_BRANCH_COND filter. This patchset >>>> also enables SW based branch filtering support for PPC64 platforms which >>>> have >>>> branch stack sampling support. With this new enablement, the branch filter >>>> support >>>> for PPC64 platforms have been extended to include all these combinations >>>> discussed >>>> below with a sample test application program. >>>> >>>> >> I am trying to understand which HW has support for capturing the >> branches: PPC7 or PPC8. >> Then it seems you're saying that only PPC8 has the filtering support. >> On PPC7 you use the >> SW filter. Did I get this right? >> >> I will look at the patch set. >> > > Hey Stephane, > > Just wondering if you got a chance to go though the patchset ? s/though/through/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V2 0/6] perf: New conditional branch filter
On 09/21/2013 12:25 PM, Stephane Eranian wrote: > On Tue, Sep 10, 2013 at 4:06 AM, Michael Ellerman > wrote: >> > >> > On Fri, 2013-08-30 at 09:54 +0530, Anshuman Khandual wrote: >>> > > This patchset is the re-spin of the original branch stack sampling >>> > > patchset which introduced new PERF_SAMPLE_BRANCH_COND filter. This >>> > > patchset >>> > > also enables SW based branch filtering support for PPC64 platforms >>> > > which have >>> > > branch stack sampling support. With this new enablement, the branch >>> > > filter support >>> > > for PPC64 platforms have been extended to include all these >>> > > combinations discussed >>> > > below with a sample test application program. >> > >> > ... >> > >>> > > Mixed filters >>> > > - >>> > > (6) perf record -e branch-misses:u -j any_call,any_ret ./cprog >>> > > Error: >>> > > The perf.data file has no samples! >>> > > >>> > > NOTE: As expected. The HW filters all the branches which are calls and >>> > > SW tries to find return >>> > > branches in that given set. Both the filters are mutually exclussive, >>> > > so obviously no samples >>> > > found in the end profile. >> > >> > The semantics of multiple filters is not clear to me. It could be an OR, >> > or an AND. You have implemented AND, does that match existing behaviour >> > on x86 for example? >> > > The semantic on the API is OR. AND does not make sense: CALL & RETURN? > On x86, the HW filter is an OR (default: ALL, set bit to disable a > type). I suspect > it is similar on PPC. Hey Stephane, In POWER8 BHRB, we have got three HW PMU filters out of which we are trying to use two of them PERF_SAMPLE_BRANCH_ANY_CALL and PERF_SAMPLE_BRANCH_COND respectively. (1) These filters are exclusive of each other and cannot be OR-ed with each other (2) The SW filters are applied on the branch record set captured with BHRB which have the HW filters applied. So the working set is already reduced with the HW PMU filters. SW filter goes through the working set and figures out which one of them satisfy the SW filter criteria and gets picked up. The SW filter cannot find out branches records which matches the criteria outside of BHRB captured set. So we cannot OR the filters. This makes the combination of HW and SW filter inherently an "AND" not OR. (3) But once we have captured the BHRB filtered data with HW PMU filter, multiple SW filters (if requested) can be applied either in OR or AND manner. It should be either like (1) (HW_FILTER_1) && (SW_FILTER_1) && (SW_FILTER_2) or like (2) (HW_FILTER_1) && (SW_FILTER_1 || SW_FILTER_2) NOTE: I admit that the current validate_instruction() function does not do either of them correctly. Will fix it in the next iteration. (4) These combination of filters are not supported right now because (a) We are unable to process two HW PMU filters simultaneously (b) We have not worked on replacement SW filter for either of the HW filters (1) (HW_FILTER_1), (HW_FILTER_2) (2) (HW_FILTER_1), (HW_FILTER_2), (SW_FILTER_1) (3) (HW_FILTER_1), (HW_FILTER_2), (SW_FILTER_1), (SW_FILTER_2) How ever these combination of filters can be supported right now. (1) (HW_FILTER_1) (2) (HW_FILTER_2) (3) (SW_FILTER_1) (4) (SW_FILTER_2) (5) (SW_FILTER_1), (SW_FILTER_2) (6) (HW_FILTER_1), (SW_FILTER_1) (7) (HW_FILTER_1), (SW_FILTER_2) (8) (HW_FILTER_1), (SW_FILTER_1), (SW_FILTER_2) (9) (HW_FILTER_2), (SW_FILTER_1) (10) (HW_FILTER_2), (SW_FILTER_2) (11) (HW_FILTER_2), (SW_FILTER_1), (SW_FILTER_2) Given the situation as explained here, which semantic would be better for single HW and multiple SW filters. Accordingly validate_instruction() function will have to be re-implemented. But I believe OR-ing the SW filters will be preferable. (1) (HW_FILTER_1) && (SW_FILTER_1) && (SW_FILTER_2) or (2) (HW_FILTER_1) && (SW_FILTER_1 || SW_FILTER_2) Please let me know your inputs and suggestions on this. Thank you. Regards Anshuman -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V2 0/6] perf: New conditional branch filter
On 09/25/2013 07:49 AM, Michael Ellerman wrote: > On Mon, 2013-09-23 at 14:45 +0530, Anshuman Khandual wrote: >> On 09/21/2013 12:25 PM, Stephane Eranian wrote: >>> On Tue, Sep 10, 2013 at 4:06 AM, Michael Ellerman >>> wrote: >>>>> >>>>> On Fri, 2013-08-30 at 09:54 +0530, Anshuman Khandual wrote: >>>>>>> This patchset is the re-spin of the original branch stack sampling >>>>>>> patchset which introduced new PERF_SAMPLE_BRANCH_COND filter. This >>>>>>> patchset >>>>>>> also enables SW based branch filtering support for PPC64 platforms >>>>>>> which have >>>>>>> branch stack sampling support. With this new enablement, the branch >>>>>>> filter support >>>>>>> for PPC64 platforms have been extended to include all these >>>>>>> combinations discussed >>>>>>> below with a sample test application program. >>>>> >>>>> ... >>>>> >>>>>>> Mixed filters >>>>>>> - >>>>>>> (6) perf record -e branch-misses:u -j any_call,any_ret ./cprog >>>>>>> Error: >>>>>>> The perf.data file has no samples! >>>>>>> >>>>>>> NOTE: As expected. The HW filters all the branches which are calls and >>>>>>> SW tries to find return >>>>>>> branches in that given set. Both the filters are mutually exclussive, >>>>>>> so obviously no samples >>>>>>> found in the end profile. >>>>> >>>>> The semantics of multiple filters is not clear to me. It could be an OR, >>>>> or an AND. You have implemented AND, does that match existing behaviour >>>>> on x86 for example? >>> >>> The semantic on the API is OR. AND does not make sense: CALL & RETURN? >>> On x86, the HW filter is an OR (default: ALL, set bit to disable a >>> type). I suspect >>> it is similar on PPC. >> >> Given the situation as explained here, which semantic would be better for >> single >> HW and multiple SW filters. Accordingly validate_instruction() function will >> have >> to be re-implemented. But I believe OR-ing the SW filters will be preferable. >> >> (1) (HW_FILTER_1) && (SW_FILTER_1) && (SW_FILTER_2) >> or >> (2) (HW_FILTER_1) && (SW_FILTER_1 || SW_FILTER_2) >> >> Please let me know your inputs and suggestions on this. Thank you. > > You need to implement the correct semantics, regardless of how the > hardware happens to work. > > That means if multiple filters are specified you need to do all the > filtering in software. Hello Stephane, I looked at the X86 code on branch filtering implementation. (1) During event creation intel_pmu_hw_config calls intel_pmu_setup_lbr_filter when LBR sampling is required, intel_pmu_setup_lbr_filter calls these two functions (a) intel_pmu_setup_sw_lbr_filter "event->hw.branch_reg.reg" contains all the SW filter masks which can be supported for the user requested filters event->attr.branch_sample_type (even if some of them could implemented in PMU HW) (b) intel_pmu_setup_hw_lbr_filter (when HW filtering is present) "event->hw.branch_reg.config" contains all the PMU HW filter masks corresponding to the requested filters in event->attr.branch_sample_type. One point to note here is that if the user has requested for some branch filter which is not supported in the HW LBR filter, the event creation request is rejected with EOPNOTSUPP. This not true for the filters which can be ignored in the PMU. (2) When the event is enabled in the PMU (a) cpuc->lbr_sel->config gets into the HW register to enable the filtering of branches which was determined in the function intel_pmu_setup_hw_lbr_filter. (3) After the IRQ happened, intel_pmu_lbr_read reads all the entries from the LBR HW and then applies the filter in the function intel_pmu_lbr_filter. (a) intel_pmu_lbr_filter functions take into account cpuc->br_sel (which is nothing but event->hw.branch_reg.reg as determined in the function intel_pmu_setup_sw_lbr_filter) which contains the entire branch filter request set in terms applicable SW filter. Here the semantic is OR when we look at from SW filter implementation point of view. BUT what branch record set we are working on right now ? A set which was captu
[PATCH] powerpc, perf: Change PMU flag values representation from decimal to hex
Signed-off-by: Anshuman Khandual --- arch/powerpc/include/asm/perf_event_server.h | 21 - 1 file changed, 16 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/include/asm/perf_event_server.h b/arch/powerpc/include/asm/perf_event_server.h index 9710be3..e3f10bb 100644 --- a/arch/powerpc/include/asm/perf_event_server.h +++ b/arch/powerpc/include/asm/perf_event_server.h @@ -11,6 +11,7 @@ #include #include +#include #define MAX_HWEVENTS 8 #define MAX_EVENT_ALTERNATIVES 8 @@ -45,11 +46,21 @@ struct power_pmu { /* * Values for power_pmu.flags */ -#define PPMU_LIMITED_PMC5_61 /* PMC5/6 have limited function */ -#define PPMU_ALT_SIPR 2 /* uses alternate posn for SIPR/HV */ -#define PPMU_NO_SIPR 4 /* no SIPR/HV in MMCRA at all */ -#define PPMU_NO_CONT_SAMPLING 8 /* no continuous sampling */ -#define PPMU_SIAR_VALID16 /* Processor has SIAR Valid bit */ + +#define PPMU_LIMITED_PMC5_6\ + LONG_ASM_CONST(0x0001) /* PMC5/6 have limited function */ + +#define PPMU_ALT_SIPR \ + LONG_ASM_CONST(0x0002) /* uses alternate posn for SIPR/HV */ + +#define PPMU_NO_SIPR \ + LONG_ASM_CONST(0x0004) /* no SIPR/HV in MMCRA at all */ + +#define PPMU_NO_CONT_SAMPLING \ + LONG_ASM_CONST(0x0008) /* no continuous sampling */ + +#define PPMU_SIAR_VALID\ + LONG_ASM_CONST(0x0010) /* Processor has SIAR Valid bit */ /* * Values for flags to get_alternatives() -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] powerpc, perf: Change PMU flag values representation from decimal to hex
On 11/16/2012 05:12 PM, Paul Mackerras wrote: > On Fri, Nov 16, 2012 at 02:29:04PM +0530, Anshuman Khandual wrote: >> Signed-off-by: Anshuman Khandual > > That's not a sufficient description of why you are making this > change. In particular, what is the motivation for and impact of using > LONG_ASM_CONST? > > Paul. Hey Paul, I have just sent out a revised patch where I have updated the description and dropped the usage of LONG_ASM_CONST. Regards Anshuman -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V2 2/5] powerpc, perf: Add basic assembly code to read BHRB entries on POWER8
On 04/16/2013 10:53 PM, Segher Boessenkool wrote: >> +/* r3 = n (where n = [0-1023]) >> + * The maximum number of BHRB entries supported with PPC_MFBHRBE >> instruction >> + * is 1024. We have limited number of table entries here as POWER8 >> implements >> + * 32 BHRB entries. >> + */ >> + >> +/* .global read_bhrb */ >> +_GLOBAL(read_bhrb) >> +cmpldir3,1023 > > This should be 31, since that is the last entry in the table below. Hey Segher, Would fix this in the next version. Thanks for pointing it out. Regards Anshuman -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V2 5/5] powerpc, perf: Enable branch stack sampling framework support with BHRB
On 04/17/2013 12:38 PM, Michael Ellerman wrote: > On Tue, Apr 16, 2013 at 09:24:10PM +0530, Anshuman Khandual wrote: >> This patch provides basic enablement for perf branch stack sampling framework >> on POWER8 processor with a new PMU feature called BHRB (Branch History >> Rolling >> Buffer). >> >> Signed-off-by: Anshuman Khandual >> --- >> arch/powerpc/perf/core-book3s.c | 96 >> +++-- >> arch/powerpc/perf/perf_event_bhrb.c | 85 >> 2 files changed, 178 insertions(+), 3 deletions(-) >> create mode 100644 arch/powerpc/perf/perf_event_bhrb.c >> >> diff --git a/arch/powerpc/perf/core-book3s.c >> b/arch/powerpc/perf/core-book3s.c >> index 4ac6e64..f4d1347 100644 >> --- a/arch/powerpc/perf/core-book3s.c >> +++ b/arch/powerpc/perf/core-book3s.c >> @@ -19,6 +19,8 @@ >> #include >> #include >> >> +#define BHRB_MAX_ENTRIES32 >> + >> struct cpu_hw_events { >> int n_events; >> int n_percpu; >> @@ -38,11 +40,21 @@ struct cpu_hw_events { >> >> unsigned int group_flag; >> int n_txn_start; >> + >> +/* BHRB bits */ >> +u64 bhrb_filter;/* BHRB HW branch >> filter */ >> +int bhrb_users; >> +void*bhrb_context; >> +struct perf_branch_stack bhrb_stack; >> +struct perf_branch_entry bhrb_entries[BHRB_MAX_ENTRIES]; >> }; >> + >> DEFINE_PER_CPU(struct cpu_hw_events, cpu_hw_events); >> >> struct power_pmu *ppmu; >> >> +#include "perf_event_bhrb.c" >> + > > Um, why are you doing that? > There was no specific reason for that. > If it's code that should be in a header, put it in a header. If it's C > code, add it to the Makefile. Sure I would get the new code in the Makefile. Thanks! Regards Anshuman > > cheers > ___ > Linuxppc-dev mailing list > linuxppc-...@lists.ozlabs.org > https://lists.ozlabs.org/listinfo/linuxppc-dev > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V2 5/5] powerpc, perf: Enable branch stack sampling framework support with BHRB
On 04/17/2013 05:37 PM, Anshuman Khandual wrote: > On 04/17/2013 12:38 PM, Michael Ellerman wrote: >> On Tue, Apr 16, 2013 at 09:24:10PM +0530, Anshuman Khandual wrote: >>> This patch provides basic enablement for perf branch stack sampling >>> framework >>> on POWER8 processor with a new PMU feature called BHRB (Branch History >>> Rolling >>> Buffer). >>> >>> Signed-off-by: Anshuman Khandual >>> --- >>> arch/powerpc/perf/core-book3s.c | 96 >>> +++-- >>> arch/powerpc/perf/perf_event_bhrb.c | 85 >>> 2 files changed, 178 insertions(+), 3 deletions(-) >>> create mode 100644 arch/powerpc/perf/perf_event_bhrb.c >>> >>> diff --git a/arch/powerpc/perf/core-book3s.c >>> b/arch/powerpc/perf/core-book3s.c >>> index 4ac6e64..f4d1347 100644 >>> --- a/arch/powerpc/perf/core-book3s.c >>> +++ b/arch/powerpc/perf/core-book3s.c >>> @@ -19,6 +19,8 @@ >>> #include >>> #include >>> >>> +#define BHRB_MAX_ENTRIES 32 >>> + >>> struct cpu_hw_events { >>> int n_events; >>> int n_percpu; >>> @@ -38,11 +40,21 @@ struct cpu_hw_events { >>> >>> unsigned int group_flag; >>> int n_txn_start; >>> + >>> + /* BHRB bits */ >>> + u64 bhrb_filter;/* BHRB HW branch >>> filter */ >>> + int bhrb_users; >>> + void*bhrb_context; >>> + struct perf_branch_stack bhrb_stack; >>> + struct perf_branch_entry bhrb_entries[BHRB_MAX_ENTRIES]; >>> }; >>> + >>> DEFINE_PER_CPU(struct cpu_hw_events, cpu_hw_events); >>> >>> struct power_pmu *ppmu; >>> >>> +#include "perf_event_bhrb.c" >>> + >> >> Um, why are you doing that? >> > > There was no specific reason for that. > Ahh, I remember it now. The function in the new file uses "cpu_hw_events" structure which is passed during "record_and_restart" data capture phase. Right now cpu_hw_events is not defined in the header file but inside core-book3s.c itself. Solution to this problem could be any of these. (0) Move all the code from the new file perf_event_bhrb.c into core-book3s.c (1) Move cpu_hw_events structure to perf_event_server.h (2) Create additional BHRB processing function inside struct power_pmu and define it for P8 inside power8_pmu.c Regards Anshuman -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH V3 3/5] powerpc, perf: Add new BHRB related generic functions, data and flags
This patch adds couple of generic functions to power_pmu structure which would configure the BHRB and it's filters. It also adds representation of the number of BHRB entries present on the PMU. A new PMU flag PPMU_BHRB would indicate presence of BHRB feature. Signed-off-by: Anshuman Khandual --- arch/powerpc/include/asm/perf_event_server.h | 6 ++ 1 file changed, 6 insertions(+) diff --git a/arch/powerpc/include/asm/perf_event_server.h b/arch/powerpc/include/asm/perf_event_server.h index 57b42da..3f0c15c 100644 --- a/arch/powerpc/include/asm/perf_event_server.h +++ b/arch/powerpc/include/asm/perf_event_server.h @@ -33,6 +33,8 @@ struct power_pmu { unsigned long *valp); int (*get_alternatives)(u64 event_id, unsigned int flags, u64 alt[]); + u64 (*bhrb_filter_map)(u64 branch_sample_type); + void(*config_bhrb)(u64 pmu_bhrb_filter); void(*disable_pmc)(unsigned int pmc, unsigned long mmcr[]); int (*limited_pmc_event)(u64 event_id); u32 flags; @@ -42,6 +44,9 @@ struct power_pmu { int (*cache_events)[PERF_COUNT_HW_CACHE_MAX] [PERF_COUNT_HW_CACHE_OP_MAX] [PERF_COUNT_HW_CACHE_RESULT_MAX]; + + /* BHRB entries in the PMU */ + int bhrb_nr; }; /* @@ -54,6 +59,7 @@ struct power_pmu { #define PPMU_SIAR_VALID0x0010 /* Processor has SIAR Valid bit */ #define PPMU_HAS_SSLOT 0x0020 /* Has sampled slot in MMCRA */ #define PPMU_HAS_SIER 0x0040 /* Has SIER */ +#define PPMU_BHRB 0x0080 /* has BHRB feature enabled */ /* * Values for flags to get_alternatives() -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH V3 1/5] powerpc, perf: Add new BHRB related instructions for POWER8
This patch adds new POWER8 instruction encoding for reading the BHRB buffer entries and also clearing it. Encoding for "clrbhrb" instruction is straight forward. But "mfbhrbe" encoding involves reading a certain index of BHRB buffer into a particular GPR register. Signed-off-by: Anshuman Khandual --- arch/powerpc/include/asm/ppc-opcode.h | 7 +++ 1 file changed, 7 insertions(+) diff --git a/arch/powerpc/include/asm/ppc-opcode.h b/arch/powerpc/include/asm/ppc-opcode.h index 8752bc8..93ae5a1 100644 --- a/arch/powerpc/include/asm/ppc-opcode.h +++ b/arch/powerpc/include/asm/ppc-opcode.h @@ -82,6 +82,7 @@ #define__REGA0_R31 31 /* sorted alphabetically */ +#define PPC_INST_BHRBE 0x7c00025c #define PPC_INST_DCBA 0x7c0005ec #define PPC_INST_DCBA_MASK 0xfc0007fe #define PPC_INST_DCBAL 0x7c2005ec @@ -297,6 +298,12 @@ #define PPC_NAPstringify_in_c(.long PPC_INST_NAP) #define PPC_SLEEP stringify_in_c(.long PPC_INST_SLEEP) +/* BHRB instructions */ +#define PPC_CLRBHRBstringify_in_c(.long 0x7c00035c) +#define PPC_MFBHRBE(r, n) stringify_in_c(.long PPC_INST_BHRBE | \ + __PPC_RS(r) | \ + (((n) & 0x1f) << 11)) + /* Transactional memory instructions */ #define TRECHKPT stringify_in_c(.long PPC_INST_TRECHKPT) #define TRECLAIM(r)stringify_in_c(.long PPC_INST_TRECLAIM \ -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH V3 5/5] powerpc, perf: Enable branch stack sampling framework
Provides basic enablement for perf branch stack sampling framework on POWER8 processor based platforms. Adds new BHRB related elements into cpu_hw_event structure to represent current BHRB config, BHRB filter configuration, manage context and to hold output BHRB buffer during PMU interrupt before passing to the user space. This also enables processing of BHRB data and converts them into generic perf branch stack data format. Signed-off-by: Anshuman Khandual --- arch/powerpc/include/asm/perf_event_server.h | 1 + arch/powerpc/perf/core-book3s.c | 167 ++- 2 files changed, 165 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/include/asm/perf_event_server.h b/arch/powerpc/include/asm/perf_event_server.h index 3f0c15c..f265049 100644 --- a/arch/powerpc/include/asm/perf_event_server.h +++ b/arch/powerpc/include/asm/perf_event_server.h @@ -73,6 +73,7 @@ extern int register_power_pmu(struct power_pmu *); struct pt_regs; extern unsigned long perf_misc_flags(struct pt_regs *regs); extern unsigned long perf_instruction_pointer(struct pt_regs *regs); +extern unsigned long int read_bhrb(int n); /* * Only override the default definitions in include/linux/perf_event.h diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c index 4ac6e64..c627843 100644 --- a/arch/powerpc/perf/core-book3s.c +++ b/arch/powerpc/perf/core-book3s.c @@ -19,6 +19,11 @@ #include #include +#define BHRB_MAX_ENTRIES 32 +#define BHRB_TARGET0x0002 +#define BHRB_PREDICTION0x0001 +#define BHRB_EA0xFFFC + struct cpu_hw_events { int n_events; int n_percpu; @@ -38,7 +43,15 @@ struct cpu_hw_events { unsigned int group_flag; int n_txn_start; + + /* BHRB bits */ + u64 bhrb_filter;/* BHRB HW branch filter */ + int bhrb_users; + void*bhrb_context; + struct perf_branch_stack bhrb_stack; + struct perf_branch_entry bhrb_entries[BHRB_MAX_ENTRIES]; }; + DEFINE_PER_CPU(struct cpu_hw_events, cpu_hw_events); struct power_pmu *ppmu; @@ -858,6 +871,9 @@ static void power_pmu_enable(struct pmu *pmu) } out: + if (cpuhw->bhrb_users) + ppmu->config_bhrb(cpuhw->bhrb_filter); + local_irq_restore(flags); } @@ -888,6 +904,47 @@ static int collect_events(struct perf_event *group, int max_count, return n; } +/* Reset all possible BHRB entries */ +static void power_pmu_bhrb_reset(void) +{ + asm volatile(PPC_CLRBHRB); +} + +void power_pmu_bhrb_enable(struct perf_event *event) +{ + struct cpu_hw_events *cpuhw = &__get_cpu_var(cpu_hw_events); + + if (!ppmu->bhrb_nr) + return; + + /* Clear BHRB if we changed task context to avoid data leaks */ + if (event->ctx->task && cpuhw->bhrb_context != event->ctx) { + power_pmu_bhrb_reset(); + cpuhw->bhrb_context = event->ctx; + } + cpuhw->bhrb_users++; +} + +void power_pmu_bhrb_disable(struct perf_event *event) +{ + struct cpu_hw_events *cpuhw = &__get_cpu_var(cpu_hw_events); + + if (!ppmu->bhrb_nr) + return; + + cpuhw->bhrb_users--; + WARN_ON_ONCE(cpuhw->bhrb_users < 0); + + if (!cpuhw->disabled && !cpuhw->bhrb_users) { + /* BHRB cannot be turned off when other +* events are active on the PMU. +*/ + + /* avoid stale pointer */ + cpuhw->bhrb_context = NULL; + } +} + /* * Add a event to the PMU. * If all events are not already frozen, then we disable and @@ -947,6 +1004,9 @@ nocheck: ret = 0; out: + if (has_branch_stack(event)) + power_pmu_bhrb_enable(event); + perf_pmu_enable(event->pmu); local_irq_restore(flags); return ret; @@ -999,6 +1059,9 @@ static void power_pmu_del(struct perf_event *event, int ef_flags) cpuhw->mmcr[0] &= ~(MMCR0_PMXE | MMCR0_FCECE); } + if (has_branch_stack(event)) + power_pmu_bhrb_disable(event); + perf_pmu_enable(event->pmu); local_irq_restore(flags); } @@ -1117,6 +1180,15 @@ int power_pmu_commit_txn(struct pmu *pmu) return 0; } +/* Called from ctxsw to prevent one process's branch entries to + * mingle with the other process's entries during context switch. + */ +void power_pmu_flush_branch_stack(void) +{ + if (ppmu->bhrb_nr) + power_pmu_bhrb_reset(); +} + /* * Return 1 if we might be able to put event on a limited PMC, * or 0 if not. @@ -1231,9 +1303,11 @@ static int power_pmu_event_init(struct perf_event *event) if (
[PATCH V3 0/5] powerpc, perf: BHRB based branch stack enablement on POWER8
Branch History Rolling Buffer (BHRB) is a new PMU feaure in IBM POWER8 processor which records the branch instructions inside the execution pipeline. This patchset enables the basic functionality of the feature through generic perf branch stack sampling framework. Sample output - $./perf record -b top $./perf report Overhead Command Source Shared Object Source Symbol Target Shared ObjectTarget Symbol # ... .. ... # 7.82% top libc-2.11.2.so[k] _IO_vfscanf libc-2.11.2.so[k] _IO_vfscanf 6.17% top libc-2.11.2.so[k] _IO_vfscanf [unknown] [k] 2.37% top [unknown] [k] 0xf7aafb30 [unknown] [k] 1.80% top [unknown] [k] 0x0fe07978 libc-2.11.2.so[k] _IO_vfscanf 1.60% top libc-2.11.2.so[k] _IO_vfscanf [kernel.kallsyms] [k] .do_task_stat 1.20% top [kernel.kallsyms] [k] .do_task_stat [kernel.kallsyms] [k] .do_task_stat 1.02% top libc-2.11.2.so[k] vfprintf libc-2.11.2.so[k] vfprintf 0.92% top top [k] _init [unknown] [k] 0x0fe037f4 Changes in V2 -- - Added copyright messages to the newly created files - Modified couple of commit messages Changes in V3 - - Incorporated review comments from Segher https://lkml.org/lkml/2013/4/16/350 - Worked on a solution for review comment from Michael Ellerman https://lkml.org/lkml/2013/4/17/548 - Could not move updated cpu_hw_events structure from core-book3s.c file into perf_event_server.h Because perf_event_server.h is pulled in first inside linux/perf_event.h before the definition of perf_branch_entry structure. Thats the reason why perf_branch_entry definition is not available inside perf_event_server.h where we define the array inside cpu_hw_events structure. - Finally have pulled in the code from perf_event_bhrb.c into core-book3s.c - Improved documentation for the patchset Anshuman Khandual (5): powerpc, perf: Add new BHRB related instructions for POWER8 powerpc, perf: Add basic assembly code to read BHRB entries on POWER8 powerpc, perf: Add new BHRB related generic functions, data and flags powerpc, perf: Define BHRB generic functions, data and flags for POWER8 powerpc, perf: Enable branch stack sampling framework arch/powerpc/include/asm/perf_event_server.h | 7 ++ arch/powerpc/include/asm/ppc-opcode.h| 7 ++ arch/powerpc/perf/Makefile | 2 +- arch/powerpc/perf/bhrb.S | 44 +++ arch/powerpc/perf/core-book3s.c | 167 ++- arch/powerpc/perf/power8-pmu.c | 57 - 6 files changed, 279 insertions(+), 5 deletions(-) create mode 100644 arch/powerpc/perf/bhrb.S -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH V3 4/5] powerpc, perf: Define BHRB generic functions, data and flags for POWER8
This patch populates BHRB specific data for power_pmu structure. It also implements POWER8 specific BHRB filter and configuration functions. Signed-off-by: Anshuman Khandual --- arch/powerpc/perf/power8-pmu.c | 57 +- 1 file changed, 56 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/perf/power8-pmu.c b/arch/powerpc/perf/power8-pmu.c index 106ae0b..153408c 100644 --- a/arch/powerpc/perf/power8-pmu.c +++ b/arch/powerpc/perf/power8-pmu.c @@ -109,6 +109,16 @@ #define EVENT_IS_MARKED(EVENT_MARKED_MASK << EVENT_MARKED_SHIFT) #define EVENT_PSEL_MASK0xff/* PMCxSEL value */ +/* MMCRA IFM bits - POWER8 */ +#definePOWER8_MMCRA_IFM1 0x4000UL +#definePOWER8_MMCRA_IFM2 0x8000UL +#definePOWER8_MMCRA_IFM3 0xC000UL + +#define ONLY_PLM \ + (PERF_SAMPLE_BRANCH_USER|\ +PERF_SAMPLE_BRANCH_KERNEL |\ +PERF_SAMPLE_BRANCH_HV) + /* * Layout of constraint bits: * @@ -428,6 +438,48 @@ static int power8_generic_events[] = { [PERF_COUNT_HW_BRANCH_MISSES] = PM_BR_MPRED_CMPL, }; +static u64 power8_bhrb_filter_map(u64 branch_sample_type) +{ + u64 pmu_bhrb_filter = 0; + u64 br_privilege = branch_sample_type & ONLY_PLM; + + /* BHRB and regular PMU events share the same prvillege state +* filter configuration. BHRB is always recorded along with a +* regular PMU event. So privilege state filter criteria for BHRB +* and the companion PMU events has to be the same. As a default +* "perf record" tool sets all privillege bits ON when no filter +* criteria is provided in the command line. So as along as all +* privillege bits are ON or they are OFF, we are good to go. +*/ + if ((br_privilege != 7) && (br_privilege != 0)) + return -1; + + /* No branch filter requested */ + if (branch_sample_type & PERF_SAMPLE_BRANCH_ANY) + return pmu_bhrb_filter; + + /* Invalid branch filter options - HW does not support */ + if (branch_sample_type & PERF_SAMPLE_BRANCH_ANY_RETURN) + return -1; + + if (branch_sample_type & PERF_SAMPLE_BRANCH_IND_CALL) + return -1; + + if (branch_sample_type & PERF_SAMPLE_BRANCH_ANY_CALL) { + pmu_bhrb_filter |= POWER8_MMCRA_IFM1; + return pmu_bhrb_filter; + } + + /* Every thing else is unsupported */ + return -1; +} + +static void power8_config_bhrb(u64 pmu_bhrb_filter) +{ + /* Enable BHRB filter in PMU */ + mtspr(SPRN_MMCRA, (mfspr(SPRN_MMCRA) | pmu_bhrb_filter)); +} + static struct power_pmu power8_pmu = { .name = "POWER8", .n_counter = 6, @@ -435,12 +487,15 @@ static struct power_pmu power8_pmu = { .add_fields = POWER8_ADD_FIELDS, .test_adder = POWER8_TEST_ADDER, .compute_mmcr = power8_compute_mmcr, + .config_bhrb= power8_config_bhrb, + .bhrb_filter_map= power8_bhrb_filter_map, .get_constraint = power8_get_constraint, .disable_pmc= power8_disable_pmc, - .flags = PPMU_HAS_SSLOT | PPMU_HAS_SIER, + .flags = PPMU_HAS_SSLOT | PPMU_HAS_SIER | PPMU_BHRB, .n_generic = ARRAY_SIZE(power8_generic_events), .generic_events = power8_generic_events, .attr_groups= power8_pmu_attr_groups, + .bhrb_nr= 32, }; static int __init init_power8_pmu(void) -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH V3 2/5] powerpc, perf: Add basic assembly code to read BHRB entries on POWER8
This patch adds the basic assembly code to read BHRB buffer. BHRB entries are valid only after a PMU interrupt has happened (when MMCR0[PMAO]=1) and BHRB has been freezed. BHRB read should not be attempted when it is still enabled (MMCR0[PMAE]=1) and getting updated, as this can produce non-deterministic results. Signed-off-by: Anshuman Khandual --- arch/powerpc/perf/Makefile | 2 +- arch/powerpc/perf/bhrb.S | 44 2 files changed, 45 insertions(+), 1 deletion(-) create mode 100644 arch/powerpc/perf/bhrb.S diff --git a/arch/powerpc/perf/Makefile b/arch/powerpc/perf/Makefile index 472db18..510fae1 100644 --- a/arch/powerpc/perf/Makefile +++ b/arch/powerpc/perf/Makefile @@ -2,7 +2,7 @@ subdir-ccflags-$(CONFIG_PPC_WERROR) := -Werror obj-$(CONFIG_PERF_EVENTS) += callchain.o -obj-$(CONFIG_PPC_PERF_CTRS)+= core-book3s.o +obj-$(CONFIG_PPC_PERF_CTRS)+= core-book3s.o bhrb.o obj64-$(CONFIG_PPC_PERF_CTRS) += power4-pmu.o ppc970-pmu.o power5-pmu.o \ power5+-pmu.o power6-pmu.o power7-pmu.o \ power8-pmu.o diff --git a/arch/powerpc/perf/bhrb.S b/arch/powerpc/perf/bhrb.S new file mode 100644 index 000..d85f9a5 --- /dev/null +++ b/arch/powerpc/perf/bhrb.S @@ -0,0 +1,44 @@ +/* + * Basic assembly code to read BHRB entries + * + * Copyright 2013 Anshuman Khandual, IBM Corporation. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version + * 2 of the License, or (at your option) any later version. + */ +#include +#include + + .text + +.balign 8 + +/* r3 = n (where n = [0-31]) + * The maximum number of BHRB entries supported with PPC_MFBHRBE instruction + * is 1024. We have limited number of table entries here as POWER8 implements + * 32 BHRB entries. + */ + +/* .global read_bhrb */ +_GLOBAL(read_bhrb) + cmpldi r3,31 + bgt 1f + ld r4,bhrb_table@got(r2) + sldir3,r3,3 + add r3,r4,r3 + mtctr r3 + bctr +1: li r3,0 + blr + +#define MFBHRB_TABLE1(n) PPC_MFBHRBE(R3,n); blr +#define MFBHRB_TABLE2(n) MFBHRB_TABLE1(n); MFBHRB_TABLE1(n+1) +#define MFBHRB_TABLE4(n) MFBHRB_TABLE2(n); MFBHRB_TABLE2(n+2) +#define MFBHRB_TABLE8(n) MFBHRB_TABLE4(n); MFBHRB_TABLE4(n+4) +#define MFBHRB_TABLE16(n) MFBHRB_TABLE8(n); MFBHRB_TABLE8(n+8) +#define MFBHRB_TABLE32(n) MFBHRB_TABLE16(n); MFBHRB_TABLE16(n+16) + +bhrb_table: + MFBHRB_TABLE32(0) -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V3 1/5] powerpc, perf: Add new BHRB related instructions for POWER8
On 04/22/2013 05:11 AM, Michael Ellerman wrote: > On Thu, Apr 18, 2013 at 05:56:12PM +0530, Anshuman Khandual wrote: >> This patch adds new POWER8 instruction encoding for reading >> the BHRB buffer entries and also clearing it. Encoding for >> "clrbhrb" instruction is straight forward. > > Which is "clear branch history rolling buffer" ? > >> But "mfbhrbe" >> encoding involves reading a certain index of BHRB buffer >> into a particular GPR register. > > And "Move from branch history rolling buffer entry" ? > Sure, would add these descriptions in the change log. >> diff --git a/arch/powerpc/include/asm/ppc-opcode.h >> b/arch/powerpc/include/asm/ppc-opcode.h >> index 8752bc8..93ae5a1 100644 >> --- a/arch/powerpc/include/asm/ppc-opcode.h >> +++ b/arch/powerpc/include/asm/ppc-opcode.h >> @@ -82,6 +82,7 @@ >> #define __REGA0_R31 31 >> >> /* sorted alphabetically */ >> +#define PPC_INST_BHRBE 0x7c00025c > > I don't think you really need this, just use the literal value below. > >> @@ -297,6 +298,12 @@ >> #define PPC_NAP stringify_in_c(.long PPC_INST_NAP) >> #define PPC_SLEEP stringify_in_c(.long PPC_INST_SLEEP) >> >> +/* BHRB instructions */ >> +#define PPC_CLRBHRB stringify_in_c(.long 0x7c00035c) >> +#define PPC_MFBHRBE(r, n) stringify_in_c(.long PPC_INST_BHRBE | \ >> +__PPC_RS(r) | \ >> +(((n) & 0x1f) << 11)) > > Why are you not using ___PPC_RB(n) here ? > I would replace __PPC_RS(r) with __PPC_RT(r) which makes more sense from instruction encoding point of view. Regards Anshuman -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V3 1/5] powerpc, perf: Add new BHRB related instructions for POWER8
On 04/22/2013 08:20 AM, Michael Neuling wrote: > Michael Ellerman wrote: > >> On Mon, Apr 22, 2013 at 11:13:43AM +1000, Michael Neuling wrote: >>> Michael Ellerman wrote: >>> >>>> On Thu, Apr 18, 2013 at 05:56:12PM +0530, Anshuman Khandual wrote: >>>>> This patch adds new POWER8 instruction encoding for reading >>>>> the BHRB buffer entries and also clearing it. Encoding for >>>>> "clrbhrb" instruction is straight forward. >>>> >>>> Which is "clear branch history rolling buffer" ? >>>> >>>>> But "mfbhrbe" >>>>> encoding involves reading a certain index of BHRB buffer >>>>> into a particular GPR register. >>>> >>>> And "Move from branch history rolling buffer entry" ? >>>> >>>>> diff --git a/arch/powerpc/include/asm/ppc-opcode.h >>>>> b/arch/powerpc/include/asm/ppc-opcode.h >>>>> index 8752bc8..93ae5a1 100644 >>>>> --- a/arch/powerpc/include/asm/ppc-opcode.h >>>>> +++ b/arch/powerpc/include/asm/ppc-opcode.h >>>>> @@ -82,6 +82,7 @@ >>>>> #define __REGA0_R31 31 >>>>> >>>>> /* sorted alphabetically */ >>>>> +#define PPC_INST_BHRBE 0x7c00025c >>>> >>>> I don't think you really need this, just use the literal value below. >>> >>> The rest of the defines in this file do this, so Anshuman's right. >> >> I don't see the point, but sure let's be consistent. Though in that case >> he should do the same for PPC_CLRBHRB below. > > Agreed. > Sure, would define a new macro (PPC_INST_CLRBHRB) to encode 0x7c00035c before using it for PPC_CLRBHRB. > Mikey > >> >>>>> @@ -297,6 +298,12 @@ >>>>> #define PPC_NAP stringify_in_c(.long PPC_INST_NAP) >>>>> #define PPC_SLEEPstringify_in_c(.long PPC_INST_SLEEP) >>>>> >>>>> +/* BHRB instructions */ >>>>> +#define PPC_CLRBHRB stringify_in_c(.long 0x7c00035c) >>>>> +#define PPC_MFBHRBE(r, n)stringify_in_c(.long PPC_INST_BHRBE | \ >>>>> + __PPC_RS(r) | \ >>>>> + (((n) & 0x1f) << 11)) >>>> >>>> Why are you not using ___PPC_RB(n) here ? >>> >>> Actually, this is wrong. The number field should be 10 bits (0x3ff), >>> not 5 (0x1f) Anshuman please fix. >> >> ACK. I got it wrong, thought this as 32 instead of 1024. Would fix it. Regards Anshuman -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V3 1/5] powerpc, perf: Add new BHRB related instructions for POWER8
On 04/22/2013 08:20 AM, Michael Neuling wrote: > Michael Ellerman wrote: > >> On Mon, Apr 22, 2013 at 11:13:43AM +1000, Michael Neuling wrote: >>> Michael Ellerman wrote: >>> >>>> On Thu, Apr 18, 2013 at 05:56:12PM +0530, Anshuman Khandual wrote: >>>>> This patch adds new POWER8 instruction encoding for reading >>>>> the BHRB buffer entries and also clearing it. Encoding for >>>>> "clrbhrb" instruction is straight forward. >>>> >>>> Which is "clear branch history rolling buffer" ? >>>> >>>>> But "mfbhrbe" >>>>> encoding involves reading a certain index of BHRB buffer >>>>> into a particular GPR register. >>>> >>>> And "Move from branch history rolling buffer entry" ? >>>> >>>>> diff --git a/arch/powerpc/include/asm/ppc-opcode.h >>>>> b/arch/powerpc/include/asm/ppc-opcode.h >>>>> index 8752bc8..93ae5a1 100644 >>>>> --- a/arch/powerpc/include/asm/ppc-opcode.h >>>>> +++ b/arch/powerpc/include/asm/ppc-opcode.h >>>>> @@ -82,6 +82,7 @@ >>>>> #define __REGA0_R31 31 >>>>> >>>>> /* sorted alphabetically */ >>>>> +#define PPC_INST_BHRBE 0x7c00025c >>>> >>>> I don't think you really need this, just use the literal value below. >>> >>> The rest of the defines in this file do this, so Anshuman's right. >> >> I don't see the point, but sure let's be consistent. Though in that case >> he should do the same for PPC_CLRBHRB below. > > Agreed. > Sure, would define a new macro (PPC_INST_CLRBHRB) to encode 0x7c00035c before using it for PPC_CLRBHRB. > Mikey > >> >>>>> @@ -297,6 +298,12 @@ >>>>> #define PPC_NAP stringify_in_c(.long PPC_INST_NAP) >>>>> #define PPC_SLEEPstringify_in_c(.long PPC_INST_SLEEP) >>>>> >>>>> +/* BHRB instructions */ >>>>> +#define PPC_CLRBHRB stringify_in_c(.long 0x7c00035c) >>>>> +#define PPC_MFBHRBE(r, n)stringify_in_c(.long PPC_INST_BHRBE | \ >>>>> + __PPC_RS(r) | \ >>>>> + (((n) & 0x1f) << 11)) >>>> >>>> Why are you not using ___PPC_RB(n) here ? >>> >>> Actually, this is wrong. The number field should be 10 bits (0x3ff), >>> not 5 (0x1f) Anshuman please fix. >> >> ACK. I got it wrong, thought this as 32 instead of 1024. Would fix it. Regards Anshuman -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH V4 3/5] powerpc, perf: Add new BHRB related generic functions, data and flags
This patch adds couple of generic functions to power_pmu structure which would configure the BHRB and it's filters. It also adds representation of the number of BHRB entries present on the PMU. A new PMU flag PPMU_BHRB would indicate presence of BHRB feature. Signed-off-by: Anshuman Khandual --- arch/powerpc/include/asm/perf_event_server.h | 6 ++ 1 file changed, 6 insertions(+) diff --git a/arch/powerpc/include/asm/perf_event_server.h b/arch/powerpc/include/asm/perf_event_server.h index 57b42da..3f0c15c 100644 --- a/arch/powerpc/include/asm/perf_event_server.h +++ b/arch/powerpc/include/asm/perf_event_server.h @@ -33,6 +33,8 @@ struct power_pmu { unsigned long *valp); int (*get_alternatives)(u64 event_id, unsigned int flags, u64 alt[]); + u64 (*bhrb_filter_map)(u64 branch_sample_type); + void(*config_bhrb)(u64 pmu_bhrb_filter); void(*disable_pmc)(unsigned int pmc, unsigned long mmcr[]); int (*limited_pmc_event)(u64 event_id); u32 flags; @@ -42,6 +44,9 @@ struct power_pmu { int (*cache_events)[PERF_COUNT_HW_CACHE_MAX] [PERF_COUNT_HW_CACHE_OP_MAX] [PERF_COUNT_HW_CACHE_RESULT_MAX]; + + /* BHRB entries in the PMU */ + int bhrb_nr; }; /* @@ -54,6 +59,7 @@ struct power_pmu { #define PPMU_SIAR_VALID0x0010 /* Processor has SIAR Valid bit */ #define PPMU_HAS_SSLOT 0x0020 /* Has sampled slot in MMCRA */ #define PPMU_HAS_SIER 0x0040 /* Has SIER */ +#define PPMU_BHRB 0x0080 /* has BHRB feature enabled */ /* * Values for flags to get_alternatives() -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH V4 4/5] powerpc, perf: Define BHRB generic functions, data and flags for POWER8
This patch populates BHRB specific data for power_pmu structure. It also implements POWER8 specific BHRB filter and configuration functions. Signed-off-by: Anshuman Khandual --- arch/powerpc/perf/power8-pmu.c | 57 +- 1 file changed, 56 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/perf/power8-pmu.c b/arch/powerpc/perf/power8-pmu.c index 106ae0b..153408c 100644 --- a/arch/powerpc/perf/power8-pmu.c +++ b/arch/powerpc/perf/power8-pmu.c @@ -109,6 +109,16 @@ #define EVENT_IS_MARKED(EVENT_MARKED_MASK << EVENT_MARKED_SHIFT) #define EVENT_PSEL_MASK0xff/* PMCxSEL value */ +/* MMCRA IFM bits - POWER8 */ +#definePOWER8_MMCRA_IFM1 0x4000UL +#definePOWER8_MMCRA_IFM2 0x8000UL +#definePOWER8_MMCRA_IFM3 0xC000UL + +#define ONLY_PLM \ + (PERF_SAMPLE_BRANCH_USER|\ +PERF_SAMPLE_BRANCH_KERNEL |\ +PERF_SAMPLE_BRANCH_HV) + /* * Layout of constraint bits: * @@ -428,6 +438,48 @@ static int power8_generic_events[] = { [PERF_COUNT_HW_BRANCH_MISSES] = PM_BR_MPRED_CMPL, }; +static u64 power8_bhrb_filter_map(u64 branch_sample_type) +{ + u64 pmu_bhrb_filter = 0; + u64 br_privilege = branch_sample_type & ONLY_PLM; + + /* BHRB and regular PMU events share the same prvillege state +* filter configuration. BHRB is always recorded along with a +* regular PMU event. So privilege state filter criteria for BHRB +* and the companion PMU events has to be the same. As a default +* "perf record" tool sets all privillege bits ON when no filter +* criteria is provided in the command line. So as along as all +* privillege bits are ON or they are OFF, we are good to go. +*/ + if ((br_privilege != 7) && (br_privilege != 0)) + return -1; + + /* No branch filter requested */ + if (branch_sample_type & PERF_SAMPLE_BRANCH_ANY) + return pmu_bhrb_filter; + + /* Invalid branch filter options - HW does not support */ + if (branch_sample_type & PERF_SAMPLE_BRANCH_ANY_RETURN) + return -1; + + if (branch_sample_type & PERF_SAMPLE_BRANCH_IND_CALL) + return -1; + + if (branch_sample_type & PERF_SAMPLE_BRANCH_ANY_CALL) { + pmu_bhrb_filter |= POWER8_MMCRA_IFM1; + return pmu_bhrb_filter; + } + + /* Every thing else is unsupported */ + return -1; +} + +static void power8_config_bhrb(u64 pmu_bhrb_filter) +{ + /* Enable BHRB filter in PMU */ + mtspr(SPRN_MMCRA, (mfspr(SPRN_MMCRA) | pmu_bhrb_filter)); +} + static struct power_pmu power8_pmu = { .name = "POWER8", .n_counter = 6, @@ -435,12 +487,15 @@ static struct power_pmu power8_pmu = { .add_fields = POWER8_ADD_FIELDS, .test_adder = POWER8_TEST_ADDER, .compute_mmcr = power8_compute_mmcr, + .config_bhrb= power8_config_bhrb, + .bhrb_filter_map= power8_bhrb_filter_map, .get_constraint = power8_get_constraint, .disable_pmc= power8_disable_pmc, - .flags = PPMU_HAS_SSLOT | PPMU_HAS_SIER, + .flags = PPMU_HAS_SSLOT | PPMU_HAS_SIER | PPMU_BHRB, .n_generic = ARRAY_SIZE(power8_generic_events), .generic_events = power8_generic_events, .attr_groups= power8_pmu_attr_groups, + .bhrb_nr= 32, }; static int __init init_power8_pmu(void) -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH V4 2/5] powerpc, perf: Add basic assembly code to read BHRB entries on POWER8
This patch adds the basic assembly code to read BHRB buffer. BHRB entries are valid only after a PMU interrupt has happened (when MMCR0[PMAO]=1) and BHRB has been freezed. BHRB read should not be attempted when it is still enabled (MMCR0[PMAE]=1) and getting updated, as this can produce non-deterministic results. Signed-off-by: Anshuman Khandual --- arch/powerpc/perf/Makefile | 2 +- arch/powerpc/perf/bhrb.S | 44 2 files changed, 45 insertions(+), 1 deletion(-) create mode 100644 arch/powerpc/perf/bhrb.S diff --git a/arch/powerpc/perf/Makefile b/arch/powerpc/perf/Makefile index 472db18..510fae1 100644 --- a/arch/powerpc/perf/Makefile +++ b/arch/powerpc/perf/Makefile @@ -2,7 +2,7 @@ subdir-ccflags-$(CONFIG_PPC_WERROR) := -Werror obj-$(CONFIG_PERF_EVENTS) += callchain.o -obj-$(CONFIG_PPC_PERF_CTRS)+= core-book3s.o +obj-$(CONFIG_PPC_PERF_CTRS)+= core-book3s.o bhrb.o obj64-$(CONFIG_PPC_PERF_CTRS) += power4-pmu.o ppc970-pmu.o power5-pmu.o \ power5+-pmu.o power6-pmu.o power7-pmu.o \ power8-pmu.o diff --git a/arch/powerpc/perf/bhrb.S b/arch/powerpc/perf/bhrb.S new file mode 100644 index 000..d85f9a5 --- /dev/null +++ b/arch/powerpc/perf/bhrb.S @@ -0,0 +1,44 @@ +/* + * Basic assembly code to read BHRB entries + * + * Copyright 2013 Anshuman Khandual, IBM Corporation. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version + * 2 of the License, or (at your option) any later version. + */ +#include +#include + + .text + +.balign 8 + +/* r3 = n (where n = [0-31]) + * The maximum number of BHRB entries supported with PPC_MFBHRBE instruction + * is 1024. We have limited number of table entries here as POWER8 implements + * 32 BHRB entries. + */ + +/* .global read_bhrb */ +_GLOBAL(read_bhrb) + cmpldi r3,31 + bgt 1f + ld r4,bhrb_table@got(r2) + sldir3,r3,3 + add r3,r4,r3 + mtctr r3 + bctr +1: li r3,0 + blr + +#define MFBHRB_TABLE1(n) PPC_MFBHRBE(R3,n); blr +#define MFBHRB_TABLE2(n) MFBHRB_TABLE1(n); MFBHRB_TABLE1(n+1) +#define MFBHRB_TABLE4(n) MFBHRB_TABLE2(n); MFBHRB_TABLE2(n+2) +#define MFBHRB_TABLE8(n) MFBHRB_TABLE4(n); MFBHRB_TABLE4(n+4) +#define MFBHRB_TABLE16(n) MFBHRB_TABLE8(n); MFBHRB_TABLE8(n+8) +#define MFBHRB_TABLE32(n) MFBHRB_TABLE16(n); MFBHRB_TABLE16(n+16) + +bhrb_table: + MFBHRB_TABLE32(0) -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH V4 5/5] powerpc, perf: Enable branch stack sampling framework
Provides basic enablement for perf branch stack sampling framework on POWER8 processor based platforms. Adds new BHRB related elements into cpu_hw_event structure to represent current BHRB config, BHRB filter configuration, manage context and to hold output BHRB buffer during PMU interrupt before passing to the user space. This also enables processing of BHRB data and converts them into generic perf branch stack data format. Signed-off-by: Anshuman Khandual --- arch/powerpc/include/asm/perf_event_server.h | 1 + arch/powerpc/perf/core-book3s.c | 167 ++- 2 files changed, 165 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/include/asm/perf_event_server.h b/arch/powerpc/include/asm/perf_event_server.h index 3f0c15c..f265049 100644 --- a/arch/powerpc/include/asm/perf_event_server.h +++ b/arch/powerpc/include/asm/perf_event_server.h @@ -73,6 +73,7 @@ extern int register_power_pmu(struct power_pmu *); struct pt_regs; extern unsigned long perf_misc_flags(struct pt_regs *regs); extern unsigned long perf_instruction_pointer(struct pt_regs *regs); +extern unsigned long int read_bhrb(int n); /* * Only override the default definitions in include/linux/perf_event.h diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c index 4ac6e64..c627843 100644 --- a/arch/powerpc/perf/core-book3s.c +++ b/arch/powerpc/perf/core-book3s.c @@ -19,6 +19,11 @@ #include #include +#define BHRB_MAX_ENTRIES 32 +#define BHRB_TARGET0x0002 +#define BHRB_PREDICTION0x0001 +#define BHRB_EA0xFFFC + struct cpu_hw_events { int n_events; int n_percpu; @@ -38,7 +43,15 @@ struct cpu_hw_events { unsigned int group_flag; int n_txn_start; + + /* BHRB bits */ + u64 bhrb_filter;/* BHRB HW branch filter */ + int bhrb_users; + void*bhrb_context; + struct perf_branch_stack bhrb_stack; + struct perf_branch_entry bhrb_entries[BHRB_MAX_ENTRIES]; }; + DEFINE_PER_CPU(struct cpu_hw_events, cpu_hw_events); struct power_pmu *ppmu; @@ -858,6 +871,9 @@ static void power_pmu_enable(struct pmu *pmu) } out: + if (cpuhw->bhrb_users) + ppmu->config_bhrb(cpuhw->bhrb_filter); + local_irq_restore(flags); } @@ -888,6 +904,47 @@ static int collect_events(struct perf_event *group, int max_count, return n; } +/* Reset all possible BHRB entries */ +static void power_pmu_bhrb_reset(void) +{ + asm volatile(PPC_CLRBHRB); +} + +void power_pmu_bhrb_enable(struct perf_event *event) +{ + struct cpu_hw_events *cpuhw = &__get_cpu_var(cpu_hw_events); + + if (!ppmu->bhrb_nr) + return; + + /* Clear BHRB if we changed task context to avoid data leaks */ + if (event->ctx->task && cpuhw->bhrb_context != event->ctx) { + power_pmu_bhrb_reset(); + cpuhw->bhrb_context = event->ctx; + } + cpuhw->bhrb_users++; +} + +void power_pmu_bhrb_disable(struct perf_event *event) +{ + struct cpu_hw_events *cpuhw = &__get_cpu_var(cpu_hw_events); + + if (!ppmu->bhrb_nr) + return; + + cpuhw->bhrb_users--; + WARN_ON_ONCE(cpuhw->bhrb_users < 0); + + if (!cpuhw->disabled && !cpuhw->bhrb_users) { + /* BHRB cannot be turned off when other +* events are active on the PMU. +*/ + + /* avoid stale pointer */ + cpuhw->bhrb_context = NULL; + } +} + /* * Add a event to the PMU. * If all events are not already frozen, then we disable and @@ -947,6 +1004,9 @@ nocheck: ret = 0; out: + if (has_branch_stack(event)) + power_pmu_bhrb_enable(event); + perf_pmu_enable(event->pmu); local_irq_restore(flags); return ret; @@ -999,6 +1059,9 @@ static void power_pmu_del(struct perf_event *event, int ef_flags) cpuhw->mmcr[0] &= ~(MMCR0_PMXE | MMCR0_FCECE); } + if (has_branch_stack(event)) + power_pmu_bhrb_disable(event); + perf_pmu_enable(event->pmu); local_irq_restore(flags); } @@ -1117,6 +1180,15 @@ int power_pmu_commit_txn(struct pmu *pmu) return 0; } +/* Called from ctxsw to prevent one process's branch entries to + * mingle with the other process's entries during context switch. + */ +void power_pmu_flush_branch_stack(void) +{ + if (ppmu->bhrb_nr) + power_pmu_bhrb_reset(); +} + /* * Return 1 if we might be able to put event on a limited PMC, * or 0 if not. @@ -1231,9 +1303,11 @@ static int power_pmu_event_init(struct perf_event *event) if (
[PATCH V4 1/5] powerpc, perf: Add new BHRB related instructions for POWER8
This patch adds new POWER8 instruction encoding for reading and clearing Branch History Rolling Buffer entries. The new instruction 'mfbhrbe' (move from branch history rolling buffer entry) is used to read BHRB buffer entries and instruction 'clrbhrb' (clear branch history rolling buffer) is used to clear the entire buffer. The instruction 'clrbhrb' has straight forward encoding. But the instruction encoding format for reading the BHRB entries is like 'mfbhrbe RT, BHRBE' where it takes two arguments, i.e the index for the BHRB buffer entry to read and a general purpose register to put the value which was read from the buffer entry. Signed-off-by: Anshuman Khandual --- arch/powerpc/include/asm/ppc-opcode.h | 8 1 file changed, 8 insertions(+) diff --git a/arch/powerpc/include/asm/ppc-opcode.h b/arch/powerpc/include/asm/ppc-opcode.h index 8752bc8..0c34e48 100644 --- a/arch/powerpc/include/asm/ppc-opcode.h +++ b/arch/powerpc/include/asm/ppc-opcode.h @@ -82,6 +82,8 @@ #define__REGA0_R31 31 /* sorted alphabetically */ +#define PPC_INST_BHRBE 0x7c00025c +#define PPC_INST_CLRBHRB 0x7c00035c #define PPC_INST_DCBA 0x7c0005ec #define PPC_INST_DCBA_MASK 0xfc0007fe #define PPC_INST_DCBAL 0x7c2005ec @@ -297,6 +299,12 @@ #define PPC_NAPstringify_in_c(.long PPC_INST_NAP) #define PPC_SLEEP stringify_in_c(.long PPC_INST_SLEEP) +/* BHRB instructions */ +#define PPC_CLRBHRBstringify_in_c(.long PPC_INST_CLRBHRB) +#define PPC_MFBHRBE(r, n) stringify_in_c(.long PPC_INST_BHRBE | \ + __PPC_RT(r) | \ + (((n) & 0x3ff) << 11)) + /* Transactional memory instructions */ #define TRECHKPT stringify_in_c(.long PPC_INST_TRECHKPT) #define TRECLAIM(r)stringify_in_c(.long PPC_INST_TRECLAIM \ -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH V4 0/5] powerpc, perf: BHRB based branch stack enablement on POWER8
Branch History Rolling Buffer (BHRB) is a new PMU feaure in IBM POWER8 processor which records the branch instructions inside the execution pipeline. This patchset enables the basic functionality of the feature through generic perf branch stack sampling framework. Sample output - $./perf record -b top $./perf report Overhead Command Source Shared Object Source Symbol Target Shared ObjectTarget Symbol # ... .. ... # 7.82% top libc-2.11.2.so[k] _IO_vfscanf libc-2.11.2.so[k] _IO_vfscanf 6.17% top libc-2.11.2.so[k] _IO_vfscanf [unknown] [k] 2.37% top [unknown] [k] 0xf7aafb30 [unknown] [k] 1.80% top [unknown] [k] 0x0fe07978 libc-2.11.2.so[k] _IO_vfscanf 1.60% top libc-2.11.2.so[k] _IO_vfscanf [kernel.kallsyms] [k] .do_task_stat 1.20% top [kernel.kallsyms] [k] .do_task_stat [kernel.kallsyms] [k] .do_task_stat 1.02% top libc-2.11.2.so[k] vfprintf libc-2.11.2.so[k] vfprintf 0.92% top top [k] _init [unknown] [k] 0x0fe037f4 Changes in V2 -- - Added copyright messages to the newly created files - Modified couple of commit messages Changes in V3 - - Incorporated review comments from Segher https://lkml.org/lkml/2013/4/16/350 - Worked on a solution for review comment from Michael Ellerman https://lkml.org/lkml/2013/4/17/548 - Could not move updated cpu_hw_events structure from core-book3s.c file into perf_event_server.h Because perf_event_server.h is pulled in first inside linux/perf_event.h before the definition of perf_branch_entry structure. Thats the reason why perf_branch_entry definition is not available inside perf_event_server.h where we define the array inside cpu_hw_events structure. - Finally have pulled in the code from perf_event_bhrb.c into core-book3s.c - Improved documentation for the patchset Changes in V4 - - Incorporated review comments on V3 regarding new instruction encoding Anshuman Khandual (5): powerpc, perf: Add new BHRB related instructions for POWER8 powerpc, perf: Add basic assembly code to read BHRB entries on POWER8 powerpc, perf: Add new BHRB related generic functions, data and flags powerpc, perf: Define BHRB generic functions, data and flags for POWER8 powerpc, perf: Enable branch stack sampling framework arch/powerpc/include/asm/perf_event_server.h | 7 ++ arch/powerpc/include/asm/ppc-opcode.h| 8 ++ arch/powerpc/perf/Makefile | 2 +- arch/powerpc/perf/bhrb.S | 44 +++ arch/powerpc/perf/core-book3s.c | 167 ++- arch/powerpc/perf/power8-pmu.c | 57 - 6 files changed, 280 insertions(+), 5 deletions(-) create mode 100644 arch/powerpc/perf/bhrb.S -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] perf/Power7: Save dcache_src fields in sample record.
> > AFAICT, Power7 supports one extra level in the cache-hierarchy, so we propose > to add a new cache level, REM_CCE3 shown above. > > To maintain consistency in terminology (i.e 2-hops = remote, 3-hops = > distant), > I propose leaving the REM_MEM1 unused and adding another level, REM_MEM3. > Agreed. > Further, in the above REM_CCE1 case, Power7 can also identify if the data came > from the L2 or L3 cache of another core on the same chip. To describe this to > user space, we propose to set ->mem_lvl to: > > PERF_MEM_LVL_REM_CCE1|PERF_MEM_LVL_L2 > > PERF_MEM_LVL_REM_CCE1|PERF_MEM_LVL_L3 > > Either that or we could leave REM_CCE1 unused in Power and add two more > levels: > > PERF_MEM_XLVL_REM_L2_CCE1 > PERF_MEM_XLVL_REM_L3_CCE1 > > The former approach seems less confusing and this patch uses that approach. > Yeah, the former approach is simpler and makes sense. > Signed-off-by: Sukadev Bhattiprolu > --- > arch/powerpc/include/asm/perf_event_server.h |2 + > arch/powerpc/perf/core-book3s.c |4 + > arch/powerpc/perf/power7-pmu.c | 81 > ++ > include/uapi/linux/perf_event.h | 12 +++- > 4 files changed, 97 insertions(+), 2 deletions(-) > > diff --git a/arch/powerpc/include/asm/perf_event_server.h > b/arch/powerpc/include/asm/perf_event_server.h > index f265049..f2d162b 100644 > --- a/arch/powerpc/include/asm/perf_event_server.h > +++ b/arch/powerpc/include/asm/perf_event_server.h > @@ -37,6 +37,8 @@ struct power_pmu { > void(*config_bhrb)(u64 pmu_bhrb_filter); > void(*disable_pmc)(unsigned int pmc, unsigned long mmcr[]); > int (*limited_pmc_event)(u64 event_id); > + void(*get_mem_data_src)(struct perf_sample_data *data, > + struct pt_regs *regs); > u32 flags; > const struct attribute_group**attr_groups; > int n_generic; > diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c > index 426180b..7778fa9 100644 > --- a/arch/powerpc/perf/core-book3s.c > +++ b/arch/powerpc/perf/core-book3s.c > @@ -1632,6 +1632,10 @@ static void record_and_restart(struct perf_event > *event, unsigned long val, > data.br_stack = &cpuhw->bhrb_stack; > } > > + if (event->attr.sample_type & PERF_SAMPLE_DATA_SRC && > + ppmu->get_mem_data_src) > + ppmu->get_mem_data_src(&data, regs); > + > if (perf_event_overflow(event, &data, regs)) > power_pmu_stop(event, 0); > } > diff --git a/arch/powerpc/perf/power7-pmu.c b/arch/powerpc/perf/power7-pmu.c > index 3c475d6..af92bfe 100644 > --- a/arch/powerpc/perf/power7-pmu.c > +++ b/arch/powerpc/perf/power7-pmu.c > @@ -209,6 +209,85 @@ static int power7_get_alternatives(u64 event, unsigned > int flags, u64 alt[]) > return nalt; > } > > +#define POWER7_MMCRA_PEMPTY (0x1L << 63) > +#define POWER7_MMCRA_FIN_STALL (0x1L << 62) > +#define POWER7_MMCRA_CMPL_STALL (0x1L << 61) > +#define POWER7_MMCRA_STALL_REASON_MASK (0xFL << 60) > + > +#define POWER7_MMCRA_DCACHE_MISS(0x1L << 55) > + > +#define POWER7_MMCRA_DCACHE_SRC_SHIFT 51 > +#define POWER7_MMCRA_DCACHE_SRC_MASK(0xFL << > POWER7_MMCRA_DCACHE_SRC_SHIFT) > + > +#define POWER7_MMCRA_MDTLB_MISS (0x1L << 50) > + > +#define POWER7_MMCRA_MDTLB_SRC_SHIFT46 > +#define POWER7_MMCRA_MDTLB_SRC_MASK (0xFL << > POWER7_MMCRA_MDTLB_SRC_SHIFT) > + > +#define POWER7_MMCRA_MDERAT_MISS(0x1L<< 45) > +#define POWER7_MMCRA_MLSU_REJ (0x1L<< 44) > + > +/* and so on */ > + > +/* > + * Map DCACHE_SRC fields to the Linux memory hierarchy levels. > + * > + * Bits 9..12 in the MMCRA indicate the source of a data-cache entry, with > + * each of the 16 possible values referring to a specific source. Eg: if > + * the 4-bits have the value 1 (0b0001), the dcache entry was found local > + * L3 cache. > + * > + * We use the table, dcache_src_map, to map this value 1 to PERF_MEM_LVL_L3, > + * the arch-neutral representation of the L3 cache. > + * > + * Similarly, in case of marked data TLB miss, bits 14..17 of the MMCRA > + * indicate the load source of a marked DTLB entry. dtlb_src_map[] gives > + * the mapping to the arch-neutral values of the TLB source. Where did you define dtlb_src_map[] ? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] powerpc, perf: Fix processing conditions for invalid BHRB entries
Fixing some conditions during BHRB entry processing. Signed-off-by: Anshuman Khandual --- arch/powerpc/perf/core-book3s.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c index 09db68d..1de2756 100644 --- a/arch/powerpc/perf/core-book3s.c +++ b/arch/powerpc/perf/core-book3s.c @@ -1481,25 +1481,25 @@ void power_pmu_bhrb_read(struct cpu_hw_events *cpuhw) target = val & BHRB_TARGET; /* Probable Missed entry: Not applicable for POWER8 */ - if ((addr == 0) && (target == 0) && (pred == 1)) { + if ((addr == 0) && (!target) && pred) { r_index++; continue; } /* Real Missed entry: Power8 based missed entry */ - if ((addr == 0) && (target == 1) && (pred == 1)) { + if ((addr == 0) && target && pred) { r_index++; continue; } /* Reserved condition: Not a valid entry */ - if ((addr == 0) && (target == 1) && (pred == 0)) { + if ((addr == 0) && target && (!pred)) { r_index++; continue; } /* Is a target address */ - if (val & BHRB_TARGET) { + if (target) { /* First address cannot be a target address */ if (r_index == 0) { r_index++; -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] powerpc, perf: Clear out branch entries to avoid any previous stale values
The 'to' field inside branch entries might contain stale values from previous PMU interrupt instances which had indirect branches. So clear all the values before reading a fresh set of BHRB entries after a PMU interrupt. Signed-off-by: Anshuman Khandual --- arch/powerpc/perf/core-book3s.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c index c627843..09db68d 100644 --- a/arch/powerpc/perf/core-book3s.c +++ b/arch/powerpc/perf/core-book3s.c @@ -1590,6 +1590,8 @@ static void record_and_restart(struct perf_event *event, unsigned long val, if (event->attr.sample_type & PERF_SAMPLE_BRANCH_STACK) { struct cpu_hw_events *cpuhw; cpuhw = &__get_cpu_var(cpu_hw_events); + memset(cpuhw->bhrb_entries, 0, + sizeof(struct perf_branch_entry) * BHRB_MAX_ENTRIES); power_pmu_bhrb_read(cpuhw); data.br_stack = &cpuhw->bhrb_stack; } -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] powerpc, perf: Fix processing conditions for invalid BHRB entries
On 05/06/2013 04:41 PM, Michael Neuling wrote: > Anshuman Khandual wrote: > >> Fixing some conditions during BHRB entry processing. > > I think we can simplify this a lot more... something like the below. > I feel that the conditional handling of the invalid BHRB entries should be present which would help us during the debug and also could be used for more granular branch classification or error handling later on. > Also, this marks the "to" address as all 1s, which is better poison > value since it's possible to branch to/from 0x0. > Agreed. > diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c > index c627843..d410d65 100644 > --- a/arch/powerpc/perf/core-book3s.c > +++ b/arch/powerpc/perf/core-book3s.c > @@ -1463,65 +1463,45 @@ void power_pmu_bhrb_read(struct cpu_hw_events *cpuhw) > { > u64 val; > u64 addr; > - int r_index, u_index, target, pred; > + int r_index, u_index, pred; > > r_index = 0; > u_index = 0; > while (r_index < ppmu->bhrb_nr) { > /* Assembly read function */ > - val = read_bhrb(r_index); > + val = read_bhrb(r_index++); > > /* Terminal marker: End of valid BHRB entries */ > - if (val == 0) { > + if (!val) { > break; > } else { > /* BHRB field break up */ > addr = val & BHRB_EA; > pred = val & BHRB_PREDICTION; > - target = val & BHRB_TARGET; > > - /* Probable Missed entry: Not applicable for POWER8 */ > - if ((addr == 0) && (target == 0) && (pred == 1)) { > - r_index++; > + if (!addr) > + /* invalid entry */ > continue; > - } > - > - /* Real Missed entry: Power8 based missed entry */ > - if ((addr == 0) && (target == 1) && (pred == 1)) { > - r_index++; > - continue; > - } > > - /* Reserved condition: Not a valid entry */ > - if ((addr == 0) && (target == 1) && (pred == 0)) { > - r_index++; > - continue; > - } > - > - /* Is a target address */ > if (val & BHRB_TARGET) { > /* First address cannot be a target address */ > - if (r_index == 0) { > - r_index++; > + if (r_index == 0) > continue; > - } > > /* Update target address for the previous entry > */ > cpuhw->bhrb_entries[u_index - 1].to = addr; > cpuhw->bhrb_entries[u_index - 1].mispred = pred; > cpuhw->bhrb_entries[u_index - 1].predicted = > ~pred; > - > - /* Dont increment u_index */ > - r_index++; > } else { > /* Update address, flags for current entry */ > cpuhw->bhrb_entries[u_index].from = addr; > + cpuhw->bhrb_entries[u_index].to = > + 0x; > cpuhw->bhrb_entries[u_index].mispred = pred; > cpuhw->bhrb_entries[u_index].predicted = ~pred; > > /* Successfully popullated one entry */ > u_index++; > - r_index++; > } > } > } > ___ > Linuxppc-dev mailing list > linuxppc-...@lists.ozlabs.org > https://lists.ozlabs.org/listinfo/linuxppc-dev > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/5] powerpc, perf: Add new BHRB related generic functions, data and flags
Signed-off-by: Anshuman Khandual --- arch/powerpc/include/asm/perf_event_server.h | 6 ++ 1 file changed, 6 insertions(+) diff --git a/arch/powerpc/include/asm/perf_event_server.h b/arch/powerpc/include/asm/perf_event_server.h index 57b42da..3f0c15c 100644 --- a/arch/powerpc/include/asm/perf_event_server.h +++ b/arch/powerpc/include/asm/perf_event_server.h @@ -33,6 +33,8 @@ struct power_pmu { unsigned long *valp); int (*get_alternatives)(u64 event_id, unsigned int flags, u64 alt[]); + u64 (*bhrb_filter_map)(u64 branch_sample_type); + void(*config_bhrb)(u64 pmu_bhrb_filter); void(*disable_pmc)(unsigned int pmc, unsigned long mmcr[]); int (*limited_pmc_event)(u64 event_id); u32 flags; @@ -42,6 +44,9 @@ struct power_pmu { int (*cache_events)[PERF_COUNT_HW_CACHE_MAX] [PERF_COUNT_HW_CACHE_OP_MAX] [PERF_COUNT_HW_CACHE_RESULT_MAX]; + + /* BHRB entries in the PMU */ + int bhrb_nr; }; /* @@ -54,6 +59,7 @@ struct power_pmu { #define PPMU_SIAR_VALID0x0010 /* Processor has SIAR Valid bit */ #define PPMU_HAS_SSLOT 0x0020 /* Has sampled slot in MMCRA */ #define PPMU_HAS_SIER 0x0040 /* Has SIER */ +#define PPMU_BHRB 0x0080 /* has BHRB feature enabled */ /* * Values for flags to get_alternatives() -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 4/5] powerpc, perf: Define BHRB generic functions, data and flags for POWER8
Signed-off-by: Anshuman Khandual --- arch/powerpc/perf/power8-pmu.c | 57 +- 1 file changed, 56 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/perf/power8-pmu.c b/arch/powerpc/perf/power8-pmu.c index 106ae0b..153408c 100644 --- a/arch/powerpc/perf/power8-pmu.c +++ b/arch/powerpc/perf/power8-pmu.c @@ -109,6 +109,16 @@ #define EVENT_IS_MARKED(EVENT_MARKED_MASK << EVENT_MARKED_SHIFT) #define EVENT_PSEL_MASK0xff/* PMCxSEL value */ +/* MMCRA IFM bits - POWER8 */ +#definePOWER8_MMCRA_IFM1 0x4000UL +#definePOWER8_MMCRA_IFM2 0x8000UL +#definePOWER8_MMCRA_IFM3 0xC000UL + +#define ONLY_PLM \ + (PERF_SAMPLE_BRANCH_USER|\ +PERF_SAMPLE_BRANCH_KERNEL |\ +PERF_SAMPLE_BRANCH_HV) + /* * Layout of constraint bits: * @@ -428,6 +438,48 @@ static int power8_generic_events[] = { [PERF_COUNT_HW_BRANCH_MISSES] = PM_BR_MPRED_CMPL, }; +static u64 power8_bhrb_filter_map(u64 branch_sample_type) +{ + u64 pmu_bhrb_filter = 0; + u64 br_privilege = branch_sample_type & ONLY_PLM; + + /* BHRB and regular PMU events share the same prvillege state +* filter configuration. BHRB is always recorded along with a +* regular PMU event. So privilege state filter criteria for BHRB +* and the companion PMU events has to be the same. As a default +* "perf record" tool sets all privillege bits ON when no filter +* criteria is provided in the command line. So as along as all +* privillege bits are ON or they are OFF, we are good to go. +*/ + if ((br_privilege != 7) && (br_privilege != 0)) + return -1; + + /* No branch filter requested */ + if (branch_sample_type & PERF_SAMPLE_BRANCH_ANY) + return pmu_bhrb_filter; + + /* Invalid branch filter options - HW does not support */ + if (branch_sample_type & PERF_SAMPLE_BRANCH_ANY_RETURN) + return -1; + + if (branch_sample_type & PERF_SAMPLE_BRANCH_IND_CALL) + return -1; + + if (branch_sample_type & PERF_SAMPLE_BRANCH_ANY_CALL) { + pmu_bhrb_filter |= POWER8_MMCRA_IFM1; + return pmu_bhrb_filter; + } + + /* Every thing else is unsupported */ + return -1; +} + +static void power8_config_bhrb(u64 pmu_bhrb_filter) +{ + /* Enable BHRB filter in PMU */ + mtspr(SPRN_MMCRA, (mfspr(SPRN_MMCRA) | pmu_bhrb_filter)); +} + static struct power_pmu power8_pmu = { .name = "POWER8", .n_counter = 6, @@ -435,12 +487,15 @@ static struct power_pmu power8_pmu = { .add_fields = POWER8_ADD_FIELDS, .test_adder = POWER8_TEST_ADDER, .compute_mmcr = power8_compute_mmcr, + .config_bhrb= power8_config_bhrb, + .bhrb_filter_map= power8_bhrb_filter_map, .get_constraint = power8_get_constraint, .disable_pmc= power8_disable_pmc, - .flags = PPMU_HAS_SSLOT | PPMU_HAS_SIER, + .flags = PPMU_HAS_SSLOT | PPMU_HAS_SIER | PPMU_BHRB, .n_generic = ARRAY_SIZE(power8_generic_events), .generic_events = power8_generic_events, .attr_groups= power8_pmu_attr_groups, + .bhrb_nr= 32, }; static int __init init_power8_pmu(void) -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 5/5] powerpc, perf: Enable branch stack sampling framework support with BHRB
This patch provides basic enablement for perf branch stack sampling framework on POWER8 processor with a new PMU feature called BHRB (Branch History Rolling Buffer). Signed-off-by: Anshuman Khandual --- arch/powerpc/perf/core-book3s.c | 96 +++-- arch/powerpc/perf/perf_event_bhrb.c | 75 + 2 files changed, 168 insertions(+), 3 deletions(-) create mode 100644 arch/powerpc/perf/perf_event_bhrb.c diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c index 4ac6e64..f4d1347 100644 --- a/arch/powerpc/perf/core-book3s.c +++ b/arch/powerpc/perf/core-book3s.c @@ -19,6 +19,8 @@ #include #include +#define BHRB_MAX_ENTRIES 32 + struct cpu_hw_events { int n_events; int n_percpu; @@ -38,11 +40,21 @@ struct cpu_hw_events { unsigned int group_flag; int n_txn_start; + + /* BHRB bits */ + u64 bhrb_filter;/* BHRB HW branch filter */ + int bhrb_users; + void*bhrb_context; + struct perf_branch_stack bhrb_stack; + struct perf_branch_entry bhrb_entries[BHRB_MAX_ENTRIES]; }; + DEFINE_PER_CPU(struct cpu_hw_events, cpu_hw_events); struct power_pmu *ppmu; +#include "perf_event_bhrb.c" + /* * Normally, to ignore kernel events we set the FCS (freeze counters * in supervisor mode) bit in MMCR0, but if the kernel runs with the @@ -858,6 +870,9 @@ static void power_pmu_enable(struct pmu *pmu) } out: + if (cpuhw->bhrb_users) + ppmu->config_bhrb(cpuhw->bhrb_filter); + local_irq_restore(flags); } @@ -888,6 +903,47 @@ static int collect_events(struct perf_event *group, int max_count, return n; } +/* Reset all possible BHRB entries */ +static void power_pmu_bhrb_reset(void) +{ + asm volatile(PPC_CLRBHRB); +} + +void power_pmu_bhrb_enable(struct perf_event *event) +{ + struct cpu_hw_events *cpuhw = &__get_cpu_var(cpu_hw_events); + + if (!ppmu->bhrb_nr) + return; + + /* Clear BHRB if we changed task context to avoid data leaks */ + if (event->ctx->task && cpuhw->bhrb_context != event->ctx) { + power_pmu_bhrb_reset(); + cpuhw->bhrb_context = event->ctx; + } + cpuhw->bhrb_users++; +} + +void power_pmu_bhrb_disable(struct perf_event *event) +{ + struct cpu_hw_events *cpuhw = &__get_cpu_var(cpu_hw_events); + + if (!ppmu->bhrb_nr) + return; + + cpuhw->bhrb_users--; + WARN_ON_ONCE(cpuhw->bhrb_users < 0); + + if (!cpuhw->disabled && !cpuhw->bhrb_users) { + /* BHRB cannot be turned off when other +* events are active on the PMU. +*/ + + /* avoid stale pointer */ + cpuhw->bhrb_context = NULL; + } +} + /* * Add a event to the PMU. * If all events are not already frozen, then we disable and @@ -947,6 +1003,9 @@ nocheck: ret = 0; out: + if (has_branch_stack(event)) + power_pmu_bhrb_enable(event); + perf_pmu_enable(event->pmu); local_irq_restore(flags); return ret; @@ -999,6 +1058,9 @@ static void power_pmu_del(struct perf_event *event, int ef_flags) cpuhw->mmcr[0] &= ~(MMCR0_PMXE | MMCR0_FCECE); } + if (has_branch_stack(event)) + power_pmu_bhrb_disable(event); + perf_pmu_enable(event->pmu); local_irq_restore(flags); } @@ -1117,6 +1179,15 @@ int power_pmu_commit_txn(struct pmu *pmu) return 0; } +/* Called from ctxsw to prevent one process's branch entries to + * mingle with the other process's entries during context switch. + */ +void power_pmu_flush_branch_stack(void) +{ + if (ppmu->bhrb_nr) + power_pmu_bhrb_reset(); +} + /* * Return 1 if we might be able to put event on a limited PMC, * or 0 if not. @@ -1231,9 +1302,11 @@ static int power_pmu_event_init(struct perf_event *event) if (!ppmu) return -ENOENT; - /* does not support taken branch sampling */ - if (has_branch_stack(event)) - return -EOPNOTSUPP; + if (has_branch_stack(event)) { + /* PMU has BHRB enabled */ + if (!(ppmu->flags & PPMU_BHRB)) + return -EOPNOTSUPP; + } switch (event->attr.type) { case PERF_TYPE_HARDWARE: @@ -1314,6 +1387,15 @@ static int power_pmu_event_init(struct perf_event *event) cpuhw = &get_cpu_var(cpu_hw_events); err = power_check_constraints(cpuhw, events, cflags, n + 1); + + if (has_branch_stack(event)) { + cpuhw->bhrb_filter = ppmu->bhrb_filter_map( +
[PATCH 2/5] powerpc, perf: Add basic assembly code to read BHRB entries on POWER8
Signed-off-by: Anshuman Khandual --- arch/powerpc/perf/Makefile | 2 +- arch/powerpc/perf/bhrb.S | 34 ++ 2 files changed, 35 insertions(+), 1 deletion(-) create mode 100644 arch/powerpc/perf/bhrb.S diff --git a/arch/powerpc/perf/Makefile b/arch/powerpc/perf/Makefile index 472db18..510fae1 100644 --- a/arch/powerpc/perf/Makefile +++ b/arch/powerpc/perf/Makefile @@ -2,7 +2,7 @@ subdir-ccflags-$(CONFIG_PPC_WERROR) := -Werror obj-$(CONFIG_PERF_EVENTS) += callchain.o -obj-$(CONFIG_PPC_PERF_CTRS)+= core-book3s.o +obj-$(CONFIG_PPC_PERF_CTRS)+= core-book3s.o bhrb.o obj64-$(CONFIG_PPC_PERF_CTRS) += power4-pmu.o ppc970-pmu.o power5-pmu.o \ power5+-pmu.o power6-pmu.o power7-pmu.o \ power8-pmu.o diff --git a/arch/powerpc/perf/bhrb.S b/arch/powerpc/perf/bhrb.S new file mode 100644 index 000..cffb815 --- /dev/null +++ b/arch/powerpc/perf/bhrb.S @@ -0,0 +1,34 @@ +#include +#include + + .text + +.balign 8 + +/* r3 = n (where n = [0-1023]) + * The maximum number of BHRB entries supported with PPC_MFBHRBE instruction + * is 1024. We have limited number of table entries here as POWER8 implements + * 32 BHRB entries. + */ + +/* .global read_bhrb */ +_GLOBAL(read_bhrb) + cmpldi r3,1023 + bgt 1f + ld r4,bhrb_table@got(r2) + sldir3,r3,3 + add r3,r4,r3 + mtctr r3 + bctr +1: li r3,0 + blr + +#define MFBHRB_TABLE1(n) PPC_MFBHRBE(R3,n); blr +#define MFBHRB_TABLE2(n) MFBHRB_TABLE1(n); MFBHRB_TABLE1(n+1) +#define MFBHRB_TABLE4(n) MFBHRB_TABLE2(n); MFBHRB_TABLE2(n+2) +#define MFBHRB_TABLE8(n) MFBHRB_TABLE4(n); MFBHRB_TABLE4(n+4) +#define MFBHRB_TABLE16(n) MFBHRB_TABLE8(n); MFBHRB_TABLE8(n+8) +#define MFBHRB_TABLE32(n) MFBHRB_TABLE16(n); MFBHRB_TABLE16(n+16) + +bhrb_table: + MFBHRB_TABLE32(0) -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/5] powerpc, perf: Add new BHRB related instructions on POWER8
This patch adds new instructions support for reading various BHRB entries and also clearing the BHRB buffer. Signed-off-by: Anshuman Khandual --- arch/powerpc/include/asm/ppc-opcode.h | 7 +++ 1 file changed, 7 insertions(+) diff --git a/arch/powerpc/include/asm/ppc-opcode.h b/arch/powerpc/include/asm/ppc-opcode.h index 8752bc8..93ae5a1 100644 --- a/arch/powerpc/include/asm/ppc-opcode.h +++ b/arch/powerpc/include/asm/ppc-opcode.h @@ -82,6 +82,7 @@ #define__REGA0_R31 31 /* sorted alphabetically */ +#define PPC_INST_BHRBE 0x7c00025c #define PPC_INST_DCBA 0x7c0005ec #define PPC_INST_DCBA_MASK 0xfc0007fe #define PPC_INST_DCBAL 0x7c2005ec @@ -297,6 +298,12 @@ #define PPC_NAPstringify_in_c(.long PPC_INST_NAP) #define PPC_SLEEP stringify_in_c(.long PPC_INST_SLEEP) +/* BHRB instructions */ +#define PPC_CLRBHRBstringify_in_c(.long 0x7c00035c) +#define PPC_MFBHRBE(r, n) stringify_in_c(.long PPC_INST_BHRBE | \ + __PPC_RS(r) | \ + (((n) & 0x1f) << 11)) + /* Transactional memory instructions */ #define TRECHKPT stringify_in_c(.long PPC_INST_TRECHKPT) #define TRECLAIM(r)stringify_in_c(.long PPC_INST_TRECLAIM \ -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 0/5] powerpc, perf: BHRB based branch stack enablement on POWER8
Branch History Rolling Buffer (BHRB) is a new PMU feaure in IBM POWER8 processor which records the branch instructions inside the execution pipeline. This patchset enables the basic functionality of the feature through generic perf branch stack sampling framework. Sample output - $./perf record -b top $./perf report Overhead Command Source Shared Object Source Symbol Target Shared ObjectTarget Symbol # ... .. ... # 7.82% top libc-2.11.2.so[k] _IO_vfscanf libc-2.11.2.so[k] _IO_vfscanf 6.17% top libc-2.11.2.so[k] _IO_vfscanf [unknown] [k] 2.37% top [unknown] [k] 0xf7aafb30 [unknown] [k] 1.80% top [unknown] [k] 0x0fe07978 libc-2.11.2.so[k] _IO_vfscanf 1.60% top libc-2.11.2.so[k] _IO_vfscanf [kernel.kallsyms] [k] .do_task_stat 1.20% top [kernel.kallsyms] [k] .do_task_stat [kernel.kallsyms] [k] .do_task_stat 1.02% top libc-2.11.2.so[k] vfprintf libc-2.11.2.so[k] vfprintf 0.92% top top [k] _init [unknown] [k] 0x0fe037f4 Anshuman Khandual (5): powerpc, perf: Add new BHRB related instructions on POWER8 powerpc, perf: Add basic assembly code to read BHRB entries on POWER8 powerpc, perf: Add new BHRB related generic functions, data and flags powerpc, perf: Define BHRB generic functions, data and flags for POWER8 powerpc, perf: Enable branch stack sampling framework support with BHRB arch/powerpc/include/asm/perf_event_server.h | 6 ++ arch/powerpc/include/asm/ppc-opcode.h| 7 ++ arch/powerpc/perf/Makefile | 2 +- arch/powerpc/perf/bhrb.S | 34 ++ arch/powerpc/perf/core-book3s.c | 96 +++- arch/powerpc/perf/perf_event_bhrb.c | 75 ++ arch/powerpc/perf/power8-pmu.c | 57 - 7 files changed, 272 insertions(+), 5 deletions(-) create mode 100644 arch/powerpc/perf/bhrb.S create mode 100644 arch/powerpc/perf/perf_event_bhrb.c -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH V2 0/5] powerpc, perf: BHRB based branch stack enablement on POWER8
Branch History Rolling Buffer (BHRB) is a new PMU feaure in IBM POWER8 processor which records the branch instructions inside the execution pipeline. This patchset enables the basic functionality of the feature through generic perf branch stack sampling framework. Sample output - $./perf record -b top $./perf report Overhead Command Source Shared Object Source Symbol Target Shared ObjectTarget Symbol # ... .. ... # 7.82% top libc-2.11.2.so[k] _IO_vfscanf libc-2.11.2.so[k] _IO_vfscanf 6.17% top libc-2.11.2.so[k] _IO_vfscanf [unknown] [k] 2.37% top [unknown] [k] 0xf7aafb30 [unknown] [k] 1.80% top [unknown] [k] 0x0fe07978 libc-2.11.2.so[k] _IO_vfscanf 1.60% top libc-2.11.2.so[k] _IO_vfscanf [kernel.kallsyms] [k] .do_task_stat 1.20% top [kernel.kallsyms] [k] .do_task_stat [kernel.kallsyms] [k] .do_task_stat 1.02% top libc-2.11.2.so[k] vfprintf libc-2.11.2.so[k] vfprintf 0.92% top top [k] _init [unknown] [k] 0x0fe037f4 Changes in V2 -- - Added copyright messages to the newly created files - Modified couple of commit messages Anshuman Khandual (5): powerpc, perf: Add new BHRB related instructions on POWER8 powerpc, perf: Add basic assembly code to read BHRB entries on POWER8 powerpc, perf: Add new BHRB related generic functions, data and flags powerpc, perf: Define BHRB generic functions, data and flags for POWER8 powerpc, perf: Enable branch stack sampling framework support with BHRB arch/powerpc/include/asm/perf_event_server.h | 6 ++ arch/powerpc/include/asm/ppc-opcode.h| 7 ++ arch/powerpc/perf/Makefile | 2 +- arch/powerpc/perf/bhrb.S | 44 + arch/powerpc/perf/core-book3s.c | 96 +++- arch/powerpc/perf/perf_event_bhrb.c | 85 arch/powerpc/perf/power8-pmu.c | 57 - 7 files changed, 292 insertions(+), 5 deletions(-) create mode 100644 arch/powerpc/perf/bhrb.S create mode 100644 arch/powerpc/perf/perf_event_bhrb.c -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH V2 5/5] powerpc, perf: Enable branch stack sampling framework support with BHRB
This patch provides basic enablement for perf branch stack sampling framework on POWER8 processor with a new PMU feature called BHRB (Branch History Rolling Buffer). Signed-off-by: Anshuman Khandual --- arch/powerpc/perf/core-book3s.c | 96 +++-- arch/powerpc/perf/perf_event_bhrb.c | 85 2 files changed, 178 insertions(+), 3 deletions(-) create mode 100644 arch/powerpc/perf/perf_event_bhrb.c diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c index 4ac6e64..f4d1347 100644 --- a/arch/powerpc/perf/core-book3s.c +++ b/arch/powerpc/perf/core-book3s.c @@ -19,6 +19,8 @@ #include #include +#define BHRB_MAX_ENTRIES 32 + struct cpu_hw_events { int n_events; int n_percpu; @@ -38,11 +40,21 @@ struct cpu_hw_events { unsigned int group_flag; int n_txn_start; + + /* BHRB bits */ + u64 bhrb_filter;/* BHRB HW branch filter */ + int bhrb_users; + void*bhrb_context; + struct perf_branch_stack bhrb_stack; + struct perf_branch_entry bhrb_entries[BHRB_MAX_ENTRIES]; }; + DEFINE_PER_CPU(struct cpu_hw_events, cpu_hw_events); struct power_pmu *ppmu; +#include "perf_event_bhrb.c" + /* * Normally, to ignore kernel events we set the FCS (freeze counters * in supervisor mode) bit in MMCR0, but if the kernel runs with the @@ -858,6 +870,9 @@ static void power_pmu_enable(struct pmu *pmu) } out: + if (cpuhw->bhrb_users) + ppmu->config_bhrb(cpuhw->bhrb_filter); + local_irq_restore(flags); } @@ -888,6 +903,47 @@ static int collect_events(struct perf_event *group, int max_count, return n; } +/* Reset all possible BHRB entries */ +static void power_pmu_bhrb_reset(void) +{ + asm volatile(PPC_CLRBHRB); +} + +void power_pmu_bhrb_enable(struct perf_event *event) +{ + struct cpu_hw_events *cpuhw = &__get_cpu_var(cpu_hw_events); + + if (!ppmu->bhrb_nr) + return; + + /* Clear BHRB if we changed task context to avoid data leaks */ + if (event->ctx->task && cpuhw->bhrb_context != event->ctx) { + power_pmu_bhrb_reset(); + cpuhw->bhrb_context = event->ctx; + } + cpuhw->bhrb_users++; +} + +void power_pmu_bhrb_disable(struct perf_event *event) +{ + struct cpu_hw_events *cpuhw = &__get_cpu_var(cpu_hw_events); + + if (!ppmu->bhrb_nr) + return; + + cpuhw->bhrb_users--; + WARN_ON_ONCE(cpuhw->bhrb_users < 0); + + if (!cpuhw->disabled && !cpuhw->bhrb_users) { + /* BHRB cannot be turned off when other +* events are active on the PMU. +*/ + + /* avoid stale pointer */ + cpuhw->bhrb_context = NULL; + } +} + /* * Add a event to the PMU. * If all events are not already frozen, then we disable and @@ -947,6 +1003,9 @@ nocheck: ret = 0; out: + if (has_branch_stack(event)) + power_pmu_bhrb_enable(event); + perf_pmu_enable(event->pmu); local_irq_restore(flags); return ret; @@ -999,6 +1058,9 @@ static void power_pmu_del(struct perf_event *event, int ef_flags) cpuhw->mmcr[0] &= ~(MMCR0_PMXE | MMCR0_FCECE); } + if (has_branch_stack(event)) + power_pmu_bhrb_disable(event); + perf_pmu_enable(event->pmu); local_irq_restore(flags); } @@ -1117,6 +1179,15 @@ int power_pmu_commit_txn(struct pmu *pmu) return 0; } +/* Called from ctxsw to prevent one process's branch entries to + * mingle with the other process's entries during context switch. + */ +void power_pmu_flush_branch_stack(void) +{ + if (ppmu->bhrb_nr) + power_pmu_bhrb_reset(); +} + /* * Return 1 if we might be able to put event on a limited PMC, * or 0 if not. @@ -1231,9 +1302,11 @@ static int power_pmu_event_init(struct perf_event *event) if (!ppmu) return -ENOENT; - /* does not support taken branch sampling */ - if (has_branch_stack(event)) - return -EOPNOTSUPP; + if (has_branch_stack(event)) { + /* PMU has BHRB enabled */ + if (!(ppmu->flags & PPMU_BHRB)) + return -EOPNOTSUPP; + } switch (event->attr.type) { case PERF_TYPE_HARDWARE: @@ -1314,6 +1387,15 @@ static int power_pmu_event_init(struct perf_event *event) cpuhw = &get_cpu_var(cpu_hw_events); err = power_check_constraints(cpuhw, events, cflags, n + 1); + + if (has_branch_stack(event)) { + cpuhw->bhrb_filter = ppmu->bhrb_filter_map( +
[PATCH V2 1/5] powerpc, perf: Add new BHRB related instructions on POWER8
This patch adds new instructions support for reading various BHRB entries and also clearing the BHRB buffer. Signed-off-by: Anshuman Khandual --- arch/powerpc/include/asm/ppc-opcode.h | 7 +++ 1 file changed, 7 insertions(+) diff --git a/arch/powerpc/include/asm/ppc-opcode.h b/arch/powerpc/include/asm/ppc-opcode.h index 8752bc8..93ae5a1 100644 --- a/arch/powerpc/include/asm/ppc-opcode.h +++ b/arch/powerpc/include/asm/ppc-opcode.h @@ -82,6 +82,7 @@ #define__REGA0_R31 31 /* sorted alphabetically */ +#define PPC_INST_BHRBE 0x7c00025c #define PPC_INST_DCBA 0x7c0005ec #define PPC_INST_DCBA_MASK 0xfc0007fe #define PPC_INST_DCBAL 0x7c2005ec @@ -297,6 +298,12 @@ #define PPC_NAPstringify_in_c(.long PPC_INST_NAP) #define PPC_SLEEP stringify_in_c(.long PPC_INST_SLEEP) +/* BHRB instructions */ +#define PPC_CLRBHRBstringify_in_c(.long 0x7c00035c) +#define PPC_MFBHRBE(r, n) stringify_in_c(.long PPC_INST_BHRBE | \ + __PPC_RS(r) | \ + (((n) & 0x1f) << 11)) + /* Transactional memory instructions */ #define TRECHKPT stringify_in_c(.long PPC_INST_TRECHKPT) #define TRECLAIM(r)stringify_in_c(.long PPC_INST_TRECLAIM \ -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH V2 4/5] powerpc, perf: Define BHRB generic functions, data and flags for POWER8
Defines BHRB functions, data and flags for POWER8 Signed-off-by: Anshuman Khandual --- arch/powerpc/perf/power8-pmu.c | 57 +- 1 file changed, 56 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/perf/power8-pmu.c b/arch/powerpc/perf/power8-pmu.c index 106ae0b..153408c 100644 --- a/arch/powerpc/perf/power8-pmu.c +++ b/arch/powerpc/perf/power8-pmu.c @@ -109,6 +109,16 @@ #define EVENT_IS_MARKED(EVENT_MARKED_MASK << EVENT_MARKED_SHIFT) #define EVENT_PSEL_MASK0xff/* PMCxSEL value */ +/* MMCRA IFM bits - POWER8 */ +#definePOWER8_MMCRA_IFM1 0x4000UL +#definePOWER8_MMCRA_IFM2 0x8000UL +#definePOWER8_MMCRA_IFM3 0xC000UL + +#define ONLY_PLM \ + (PERF_SAMPLE_BRANCH_USER|\ +PERF_SAMPLE_BRANCH_KERNEL |\ +PERF_SAMPLE_BRANCH_HV) + /* * Layout of constraint bits: * @@ -428,6 +438,48 @@ static int power8_generic_events[] = { [PERF_COUNT_HW_BRANCH_MISSES] = PM_BR_MPRED_CMPL, }; +static u64 power8_bhrb_filter_map(u64 branch_sample_type) +{ + u64 pmu_bhrb_filter = 0; + u64 br_privilege = branch_sample_type & ONLY_PLM; + + /* BHRB and regular PMU events share the same prvillege state +* filter configuration. BHRB is always recorded along with a +* regular PMU event. So privilege state filter criteria for BHRB +* and the companion PMU events has to be the same. As a default +* "perf record" tool sets all privillege bits ON when no filter +* criteria is provided in the command line. So as along as all +* privillege bits are ON or they are OFF, we are good to go. +*/ + if ((br_privilege != 7) && (br_privilege != 0)) + return -1; + + /* No branch filter requested */ + if (branch_sample_type & PERF_SAMPLE_BRANCH_ANY) + return pmu_bhrb_filter; + + /* Invalid branch filter options - HW does not support */ + if (branch_sample_type & PERF_SAMPLE_BRANCH_ANY_RETURN) + return -1; + + if (branch_sample_type & PERF_SAMPLE_BRANCH_IND_CALL) + return -1; + + if (branch_sample_type & PERF_SAMPLE_BRANCH_ANY_CALL) { + pmu_bhrb_filter |= POWER8_MMCRA_IFM1; + return pmu_bhrb_filter; + } + + /* Every thing else is unsupported */ + return -1; +} + +static void power8_config_bhrb(u64 pmu_bhrb_filter) +{ + /* Enable BHRB filter in PMU */ + mtspr(SPRN_MMCRA, (mfspr(SPRN_MMCRA) | pmu_bhrb_filter)); +} + static struct power_pmu power8_pmu = { .name = "POWER8", .n_counter = 6, @@ -435,12 +487,15 @@ static struct power_pmu power8_pmu = { .add_fields = POWER8_ADD_FIELDS, .test_adder = POWER8_TEST_ADDER, .compute_mmcr = power8_compute_mmcr, + .config_bhrb= power8_config_bhrb, + .bhrb_filter_map= power8_bhrb_filter_map, .get_constraint = power8_get_constraint, .disable_pmc= power8_disable_pmc, - .flags = PPMU_HAS_SSLOT | PPMU_HAS_SIER, + .flags = PPMU_HAS_SSLOT | PPMU_HAS_SIER | PPMU_BHRB, .n_generic = ARRAY_SIZE(power8_generic_events), .generic_events = power8_generic_events, .attr_groups= power8_pmu_attr_groups, + .bhrb_nr= 32, }; static int __init init_power8_pmu(void) -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH V2 2/5] powerpc, perf: Add basic assembly code to read BHRB entries on POWER8
This patch adds the basic assembly code to read BHRB entries Signed-off-by: Anshuman Khandual --- arch/powerpc/perf/Makefile | 2 +- arch/powerpc/perf/bhrb.S | 44 2 files changed, 45 insertions(+), 1 deletion(-) create mode 100644 arch/powerpc/perf/bhrb.S diff --git a/arch/powerpc/perf/Makefile b/arch/powerpc/perf/Makefile index 472db18..510fae1 100644 --- a/arch/powerpc/perf/Makefile +++ b/arch/powerpc/perf/Makefile @@ -2,7 +2,7 @@ subdir-ccflags-$(CONFIG_PPC_WERROR) := -Werror obj-$(CONFIG_PERF_EVENTS) += callchain.o -obj-$(CONFIG_PPC_PERF_CTRS)+= core-book3s.o +obj-$(CONFIG_PPC_PERF_CTRS)+= core-book3s.o bhrb.o obj64-$(CONFIG_PPC_PERF_CTRS) += power4-pmu.o ppc970-pmu.o power5-pmu.o \ power5+-pmu.o power6-pmu.o power7-pmu.o \ power8-pmu.o diff --git a/arch/powerpc/perf/bhrb.S b/arch/powerpc/perf/bhrb.S new file mode 100644 index 000..0e8a018 --- /dev/null +++ b/arch/powerpc/perf/bhrb.S @@ -0,0 +1,44 @@ +/* + * Basic assembly code to read BHRB entries + * + * Copyright 2013 Anshuman Khandual, IBM Corporation. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version + * 2 of the License, or (at your option) any later version. + */ +#include +#include + + .text + +.balign 8 + +/* r3 = n (where n = [0-1023]) + * The maximum number of BHRB entries supported with PPC_MFBHRBE instruction + * is 1024. We have limited number of table entries here as POWER8 implements + * 32 BHRB entries. + */ + +/* .global read_bhrb */ +_GLOBAL(read_bhrb) + cmpldi r3,1023 + bgt 1f + ld r4,bhrb_table@got(r2) + sldir3,r3,3 + add r3,r4,r3 + mtctr r3 + bctr +1: li r3,0 + blr + +#define MFBHRB_TABLE1(n) PPC_MFBHRBE(R3,n); blr +#define MFBHRB_TABLE2(n) MFBHRB_TABLE1(n); MFBHRB_TABLE1(n+1) +#define MFBHRB_TABLE4(n) MFBHRB_TABLE2(n); MFBHRB_TABLE2(n+2) +#define MFBHRB_TABLE8(n) MFBHRB_TABLE4(n); MFBHRB_TABLE4(n+4) +#define MFBHRB_TABLE16(n) MFBHRB_TABLE8(n); MFBHRB_TABLE8(n+8) +#define MFBHRB_TABLE32(n) MFBHRB_TABLE16(n); MFBHRB_TABLE16(n+16) + +bhrb_table: + MFBHRB_TABLE32(0) -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH V2 3/5] powerpc, perf: Add new BHRB related generic functions, data and flags
This patch adds some new BHRB related generic functions, data and flags Signed-off-by: Anshuman Khandual --- arch/powerpc/include/asm/perf_event_server.h | 6 ++ 1 file changed, 6 insertions(+) diff --git a/arch/powerpc/include/asm/perf_event_server.h b/arch/powerpc/include/asm/perf_event_server.h index 57b42da..3f0c15c 100644 --- a/arch/powerpc/include/asm/perf_event_server.h +++ b/arch/powerpc/include/asm/perf_event_server.h @@ -33,6 +33,8 @@ struct power_pmu { unsigned long *valp); int (*get_alternatives)(u64 event_id, unsigned int flags, u64 alt[]); + u64 (*bhrb_filter_map)(u64 branch_sample_type); + void(*config_bhrb)(u64 pmu_bhrb_filter); void(*disable_pmc)(unsigned int pmc, unsigned long mmcr[]); int (*limited_pmc_event)(u64 event_id); u32 flags; @@ -42,6 +44,9 @@ struct power_pmu { int (*cache_events)[PERF_COUNT_HW_CACHE_MAX] [PERF_COUNT_HW_CACHE_OP_MAX] [PERF_COUNT_HW_CACHE_RESULT_MAX]; + + /* BHRB entries in the PMU */ + int bhrb_nr; }; /* @@ -54,6 +59,7 @@ struct power_pmu { #define PPMU_SIAR_VALID0x0010 /* Processor has SIAR Valid bit */ #define PPMU_HAS_SSLOT 0x0020 /* Has sampled slot in MMCRA */ #define PPMU_HAS_SIER 0x0040 /* Has SIER */ +#define PPMU_BHRB 0x0080 /* has BHRB feature enabled */ /* * Values for flags to get_alternatives() -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/2] powerpc, perf: Ignore separate BHRB privilege state filter request
Completely ignore BHRB privilege state filter request as we are already configuring MMCRA register with privilege state filtering attribute for the accompanying PMU event. This would help achieve cleaner user space interaction for BHRB. Signed-off-by: Anshuman Khandual --- arch/powerpc/perf/power8-pmu.c | 17 + 1 file changed, 9 insertions(+), 8 deletions(-) diff --git a/arch/powerpc/perf/power8-pmu.c b/arch/powerpc/perf/power8-pmu.c index f7d1c4f..8ed323d 100644 --- a/arch/powerpc/perf/power8-pmu.c +++ b/arch/powerpc/perf/power8-pmu.c @@ -525,16 +525,17 @@ static u64 power8_bhrb_filter_map(u64 branch_sample_type) u64 pmu_bhrb_filter = 0; u64 br_privilege = branch_sample_type & ONLY_PLM; - /* BHRB and regular PMU events share the same prvillege state + /* BHRB and regular PMU events share the same prvilege state * filter configuration. BHRB is always recorded along with a -* regular PMU event. So privilege state filter criteria for BHRB -* and the companion PMU events has to be the same. As a default -* "perf record" tool sets all privillege bits ON when no filter -* criteria is provided in the command line. So as along as all -* privillege bits are ON or they are OFF, we are good to go. +* regular PMU event. So privilege state filter criteria for +* the BHRB and the companion PMU events has to be the same. +* Separate BHRB privillege state filter requests would be +* ignored. */ - if ((br_privilege != 7) && (br_privilege != 0)) - return -1; + + if (br_privilege) + pr_info("BHRB privilege state filter request %llx ignored\n", + br_privilege); /* No branch filter requested */ if (branch_sample_type & PERF_SAMPLE_BRANCH_ANY) -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 0/2] Improvement and fixes for BHRB
(1) The first patch fixes a situation like this Before patch:- ./perf record -j any -e branch-misses:k ls Error: The sys_perf_event_open() syscall returned with 95 (Operation not supported) for event (branch-misses:k). /bin/dmesg may provide additional information. No CONFIG_PERF_EVENTS=y kernel support configured? Here 'perf record' actually copies over ':k' filter request into BHRB privilege state filter config and our previous check in kernel would fail that. After patch:- - /perf record -j any -e branch-misses:k ls perf perf.data perf.data.old test-mmap-ring [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.002 MB perf.data (~102 samples) ] (2) The second patch fixes context migration for BHRB filter configuration Anshuman Khandual (2): powerpc, perf: Ignore separate BHRB privilege state filter request powerpc, perf: BHRB filter configuration should follow the task arch/powerpc/perf/core-book3s.c | 5 - arch/powerpc/perf/power8-pmu.c | 17 + 2 files changed, 13 insertions(+), 9 deletions(-) -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/2] powerpc, perf: BHRB filter configuration should follow the task
When the task moves around the system, the corresponding cpuhw per cpu strcuture should be popullated with the BHRB filter request value so that PMU could be configured appropriately with that during the next call into power_pmu_enable(). Signed-off-by: Anshuman Khandual --- arch/powerpc/perf/core-book3s.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c index 426180b..48c68a8 100644 --- a/arch/powerpc/perf/core-book3s.c +++ b/arch/powerpc/perf/core-book3s.c @@ -1122,8 +1122,11 @@ nocheck: ret = 0; out: - if (has_branch_stack(event)) + if (has_branch_stack(event)) { power_pmu_bhrb_enable(event); + cpuhw->bhrb_filter = ppmu->bhrb_filter_map( + event->attr.branch_sample_type); + } perf_pmu_enable(event->pmu); local_irq_restore(flags); -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/5] perf: New conditional branch filter criteria in branch stack sampling
POWER8 PMU based BHRB supports filtering for conditional branches. This patch introduces new branch filter PERF_SAMPLE_BRANCH_COND which will extend the existing perf ABI. Other architectures can provide this functionality with either HW filtering support (if present) or with SW filtering of instructions. Signed-off-by: Anshuman Khandual --- include/uapi/linux/perf_event.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h index fb104e5..cb0de86 100644 --- a/include/uapi/linux/perf_event.h +++ b/include/uapi/linux/perf_event.h @@ -157,8 +157,9 @@ enum perf_branch_sample_type { PERF_SAMPLE_BRANCH_ANY_CALL = 1U << 4, /* any call branch */ PERF_SAMPLE_BRANCH_ANY_RETURN = 1U << 5, /* any return branch */ PERF_SAMPLE_BRANCH_IND_CALL = 1U << 6, /* indirect calls */ + PERF_SAMPLE_BRANCH_COND = 1U << 7, /* conditional branches */ - PERF_SAMPLE_BRANCH_MAX = 1U << 7, /* non-ABI */ + PERF_SAMPLE_BRANCH_MAX = 1U << 8, /* non-ABI */ }; #define PERF_SAMPLE_BRANCH_PLM_ALL \ -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 5/5] perf, documentation: Description for conditional branch filter
Adding documentation support for conditional branch filter. Signed-off-by: Anshuman Khandual --- tools/perf/Documentation/perf-record.txt | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/tools/perf/Documentation/perf-record.txt b/tools/perf/Documentation/perf-record.txt index d4da111..8b5e1ed 100644 --- a/tools/perf/Documentation/perf-record.txt +++ b/tools/perf/Documentation/perf-record.txt @@ -169,12 +169,13 @@ following filters are defined: - any_call: any function call or system call - any_ret: any function return or system call return - ind_call: any indirect branch +- cond: conditional branches - u: only when the branch target is at the user level - k: only when the branch target is in the kernel - hv: only when the target is at the hypervisor level + -The option requires at least one branch type among any, any_call, any_ret, ind_call. +The option requires at least one branch type among any, any_call, any_ret, ind_call, cond. The privilege levels may be omitted, in which case, the privilege levels of the associated event are applied to the branch filter. Both kernel (k) and hypervisor (hv) privilege levels are subject to permissions. When sampling on multiple events, branch stack sampling -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/5] powerpc, perf: Enable conditional branch filter for POWER8
Enables conditional branch filter support for POWER8 utilizing MMCRA register based filter and also invalidates a BHRB branch filter combination involving conditional branches. Signed-off-by: Anshuman Khandual --- arch/powerpc/perf/power8-pmu.c | 10 ++ 1 file changed, 10 insertions(+) diff --git a/arch/powerpc/perf/power8-pmu.c b/arch/powerpc/perf/power8-pmu.c index 8ed323d..e60b38f 100644 --- a/arch/powerpc/perf/power8-pmu.c +++ b/arch/powerpc/perf/power8-pmu.c @@ -548,11 +548,21 @@ static u64 power8_bhrb_filter_map(u64 branch_sample_type) if (branch_sample_type & PERF_SAMPLE_BRANCH_IND_CALL) return -1; + /* Invalid branch filter combination - HW does not support */ + if ((branch_sample_type & PERF_SAMPLE_BRANCH_ANY_CALL) && + (branch_sample_type & PERF_SAMPLE_BRANCH_COND)) + return -1; + if (branch_sample_type & PERF_SAMPLE_BRANCH_ANY_CALL) { pmu_bhrb_filter |= POWER8_MMCRA_IFM1; return pmu_bhrb_filter; } + if (branch_sample_type & PERF_SAMPLE_BRANCH_COND) { + pmu_bhrb_filter |= POWER8_MMCRA_IFM3; + return pmu_bhrb_filter; + } + /* Every thing else is unsupported */ return -1; } -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 4/5] x86, perf: Add conditional branch filtering support
From: Peter Zijlstra This patch adds conditional branch filtering support, enabling it for PERF_SAMPLE_BRANCH_COND in perf branch stack sampling framework by utilizing an available software filter X86_BR_JCC. Signed-off-by: Peter Zijlstra Signed-off-by: Anshuman Khandual --- arch/x86/kernel/cpu/perf_event_intel_lbr.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c index d978353..a0d6387 100644 --- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c +++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c @@ -337,6 +337,10 @@ static int intel_pmu_setup_sw_lbr_filter(struct perf_event *event) if (br_type & PERF_SAMPLE_BRANCH_IND_CALL) mask |= X86_BR_IND_CALL; + + if (br_type & PERF_SAMPLE_BRANCH_COND) + mask |= X86_BR_JCC; + /* * stash actual user request into reg, it may * be used by fixup code for some CPU @@ -626,6 +630,7 @@ static const int nhm_lbr_sel_map[PERF_SAMPLE_BRANCH_MAX] = { * NHM/WSM erratum: must include IND_JMP to capture IND_CALL */ [PERF_SAMPLE_BRANCH_IND_CALL] = LBR_IND_CALL | LBR_IND_JMP, + [PERF_SAMPLE_BRANCH_COND] = LBR_JCC, }; static const int snb_lbr_sel_map[PERF_SAMPLE_BRANCH_MAX] = { @@ -637,6 +642,7 @@ static const int snb_lbr_sel_map[PERF_SAMPLE_BRANCH_MAX] = { [PERF_SAMPLE_BRANCH_ANY_CALL] = LBR_REL_CALL | LBR_IND_CALL | LBR_FAR, [PERF_SAMPLE_BRANCH_IND_CALL] = LBR_IND_CALL, + [PERF_SAMPLE_BRANCH_COND] = LBR_JCC, }; /* core */ -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/5] perf, tool: Conditional branch filter 'cond' added to perf record
Adding perf record support for new branch stack filter criteria PERF_SAMPLE_BRANCH_COND. Signed-off-by: Anshuman Khandual --- tools/perf/builtin-record.c | 1 + 1 file changed, 1 insertion(+) diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c index cdf58ec..833743a 100644 --- a/tools/perf/builtin-record.c +++ b/tools/perf/builtin-record.c @@ -676,6 +676,7 @@ static const struct branch_mode branch_modes[] = { BRANCH_OPT("any_call", PERF_SAMPLE_BRANCH_ANY_CALL), BRANCH_OPT("any_ret", PERF_SAMPLE_BRANCH_ANY_RETURN), BRANCH_OPT("ind_call", PERF_SAMPLE_BRANCH_IND_CALL), + BRANCH_OPT("cond", PERF_SAMPLE_BRANCH_CONDITIONAL), BRANCH_END }; -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 0/5] perf: Introducing new conditional branch filter
This patchset introduces conditional branch filter in perf branch stack sampling framework incorporating review comments from Michael Neuling, Peter Zijlstra and Stephane Eranian. Anshuman Khandual (5): perf: New conditional branch filter criteria in branch stack sampling powerpc, perf: Enable conditional branch filter for POWER8 perf, tool: Conditional branch filter 'cond' added to perf record x86, perf: Add conditional branch filtering support perf, documentation: Description for conditional branch filter arch/powerpc/perf/power8-pmu.c | 10 ++ arch/x86/kernel/cpu/perf_event_intel_lbr.c | 6 ++ include/uapi/linux/perf_event.h| 3 ++- tools/perf/Documentation/perf-record.txt | 3 ++- tools/perf/builtin-record.c| 1 + 5 files changed, 21 insertions(+), 2 deletions(-) -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] perf, x86, lbr: Demand proper privileges for PERF_SAMPLE_BRANCH_KERNEL
On 05/21/2013 07:25 PM, Stephane Eranian wrote: > On Thu, May 16, 2013 at 12:15 PM, Michael Neuling wrote: >> Peter Zijlstra wrote: >> >>> On Wed, May 15, 2013 at 03:37:22PM +0200, Stephane Eranian wrote: On Fri, May 3, 2013 at 2:11 PM, Peter Zijlstra wrote: > We should always have proper privileges when requesting kernel data. > > Cc: Andi Kleen > Cc: eran...@google.com > Signed-off-by: Peter Zijlstra > Link: http://lkml.kernel.org/n/tip-v0x9ky3ahzr6nm3c6ilwr...@git.kernel.org > --- > arch/x86/kernel/cpu/perf_event_intel_lbr.c |5 - > 1 file changed, 4 insertions(+), 1 deletion(-) > > --- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c > +++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c > @@ -318,8 +318,11 @@ static void intel_pmu_setup_sw_lbr_filte > if (br_type & PERF_SAMPLE_BRANCH_USER) > mask |= X86_BR_USER; > > - if (br_type & PERF_SAMPLE_BRANCH_KERNEL) > + if (br_type & PERF_SAMPLE_BRANCH_KERNEL) { > + if (perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN)) > + return -EACCES; > mask |= X86_BR_KERNEL; > + } > This will prevent regular users from capturing kernel -> kernel branches. But it won't prevent users from getting kernel -> user branches. Thus some kernel address will still be captured. I guess they could be eliminated by the sw_filter. When using LBR priv level filtering, the filter applies to the branch target only. >>> >>> How about something like the below? It also adds the branch flags >>> Mikey wanted for PowerPC. >> >> Peter, >> >> BTW PowerPC also has the ability to filter on conditional branches. Any >> chance we could add something like the follow to perf also? >> >> Mikey >> >> diff --git a/include/uapi/linux/perf_event.h >> b/include/uapi/linux/perf_event.h >> index fb104e5..891c769 100644 >> --- a/include/uapi/linux/perf_event.h >> +++ b/include/uapi/linux/perf_event.h >> @@ -157,8 +157,9 @@ enum perf_branch_sample_type { >> PERF_SAMPLE_BRANCH_ANY_CALL = 1U << 4, /* any call branch */ >> PERF_SAMPLE_BRANCH_ANY_RETURN = 1U << 5, /* any return branch */ >> PERF_SAMPLE_BRANCH_IND_CALL = 1U << 6, /* indirect calls */ >> + PERF_SAMPLE_BRANCH_CONDITIONAL = 1U << 7, /* conditional branches */ >> > I would use PERF_SAMPLE_BRANCH_COND here. > >> - PERF_SAMPLE_BRANCH_MAX = 1U << 7, /* non-ABI */ >> + PERF_SAMPLE_BRANCH_MAX = 1U << 8, /* non-ABI */ >> }; >> >> #define PERF_SAMPLE_BRANCH_PLM_ALL \ >> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c >> index cdf58ec..5b0b89d 100644 >> --- a/tools/perf/builtin-record.c >> +++ b/tools/perf/builtin-record.c >> @@ -676,6 +676,7 @@ static const struct branch_mode branch_modes[] = { >> BRANCH_OPT("any_call", PERF_SAMPLE_BRANCH_ANY_CALL), >> BRANCH_OPT("any_ret", PERF_SAMPLE_BRANCH_ANY_RETURN), >> BRANCH_OPT("ind_call", PERF_SAMPLE_BRANCH_IND_CALL), >> + BRANCH_OPT("cnd", PERF_SAMPLE_BRANCH_CONDITIONAL), > > use "cond" > >> BRANCH_END >> }; >> > > And if you do this, you also need to update the x86 > perf_event_intel_lbr.c mapping > tables to fill out the entries for PERF_SAMPLE_BRANCH_COND: > > [PERF_SAMPLE_BRANCH_COND] = LBR_JCC, > > And you also need to update intel_pmu_setup_sw_lbr_filter() > to handle the conversion to x86 instructions: > >if (br_type & PERF_SAMPLE_BRANCH_COND) > mask |= X86_BR_JCC; > > > You also need to update the perf-record.txt documentation to list cond > as a possible > branch filter. Hey Stephane, I have incorporated all the review comments into the patch series https://lkml.org/lkml/2013/5/22/51. Regards Anshuman -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] powerpc, perf: Ignore separate BHRB privilege state filter request
> > Your description from patch 0 should be here. > Sure, will bring it here. > >> diff --git a/arch/powerpc/perf/power8-pmu.c b/arch/powerpc/perf/power8-pmu.c >> index f7d1c4f..8ed323d 100644 >> --- a/arch/powerpc/perf/power8-pmu.c >> +++ b/arch/powerpc/perf/power8-pmu.c >> @@ -525,16 +525,17 @@ static u64 power8_bhrb_filter_map(u64 >> branch_sample_type) >> u64 pmu_bhrb_filter = 0; >> u64 br_privilege = branch_sample_type & ONLY_PLM; >> >> -/* BHRB and regular PMU events share the same prvillege state >> +/* BHRB and regular PMU events share the same prvilege state > > Please spell "privilege" correctly. > My bad, will fix it. >> * filter configuration. BHRB is always recorded along with > It still says "privilege state filter criteria for the BHRB and the > companion PMU events has to be the same". > > But they don't, right? > Right > What it should say is "we ignore the privilege bits in the branch sample > type because they are handled by the underlying PMC configuration" - or > something like that. Here is the latest description for the code block /* BHRB and regular PMU events share the same privilege state * filter configuration. BHRB is always recorded along with a * regular PMU event. As the privilege state filter is handled * in the basic PMC configuration of the accompanying regular * PMU event, we ignore any separate BHRB specific request. */ Does it sound better ? > >> -if ((br_privilege != 7) && (br_privilege != 0)) >> -return -1; >> + >> +if (br_privilege) >> +pr_info("BHRB privilege state filter request %llx ignored\n", >> +br_privilege); > > Don't do that. Ignoring the br_privilege is either the right thing to do > in which case we do it and print nothing, I thought the informational print would at least make the user aware of the fact that the separate filter request for BHRB went ignored. Can we add this some where in the documentation ? or it doesn't make sense and > we reject it. > > cheers > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] perf, x86, lbr: Demand proper privileges for PERF_SAMPLE_BRANCH_KERNEL
On 05/22/2013 05:53 PM, Stephane Eranian wrote: > Hi, > > > > On Wed, May 22, 2013 at 8:43 AM, Anshuman Khandual > wrote: >> On 05/21/2013 07:25 PM, Stephane Eranian wrote: >>> On Thu, May 16, 2013 at 12:15 PM, Michael Neuling wrote: >>>> Peter Zijlstra wrote: >>>> >>>>> On Wed, May 15, 2013 at 03:37:22PM +0200, Stephane Eranian wrote: >>>>>> On Fri, May 3, 2013 at 2:11 PM, Peter Zijlstra >>>>>> wrote: >>>>>>> We should always have proper privileges when requesting kernel data. >>>>>>> >>>>>>> Cc: Andi Kleen >>>>>>> Cc: eran...@google.com >>>>>>> Signed-off-by: Peter Zijlstra >>>>>>> Link: >>>>>>> http://lkml.kernel.org/n/tip-v0x9ky3ahzr6nm3c6ilwr...@git.kernel.org >>>>>>> --- >>>>>>> arch/x86/kernel/cpu/perf_event_intel_lbr.c |5 - >>>>>>> 1 file changed, 4 insertions(+), 1 deletion(-) >>>>>>> >>>>>>> --- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c >>>>>>> +++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c >>>>>>> @@ -318,8 +318,11 @@ static void intel_pmu_setup_sw_lbr_filte >>>>>>> if (br_type & PERF_SAMPLE_BRANCH_USER) >>>>>>> mask |= X86_BR_USER; >>>>>>> >>>>>>> - if (br_type & PERF_SAMPLE_BRANCH_KERNEL) >>>>>>> + if (br_type & PERF_SAMPLE_BRANCH_KERNEL) { >>>>>>> + if (perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN)) >>>>>>> + return -EACCES; >>>>>>> mask |= X86_BR_KERNEL; >>>>>>> + } >>>>>>> >>>>>> This will prevent regular users from capturing kernel -> kernel branches. >>>>>> But it won't prevent users from getting kernel -> user branches. Thus >>>>>> some kernel address will still be captured. I guess they could be >>>>>> eliminated >>>>>> by the sw_filter. >>>>>> >>>>>> When using LBR priv level filtering, the filter applies to the branch >>>>>> target >>>>>> only. >>>>> >>>>> How about something like the below? It also adds the branch flags >>>>> Mikey wanted for PowerPC. >>>> >>>> Peter, >>>> >>>> BTW PowerPC also has the ability to filter on conditional branches. Any >>>> chance we could add something like the follow to perf also? >>>> >>>> Mikey >>>> >>>> diff --git a/include/uapi/linux/perf_event.h >>>> b/include/uapi/linux/perf_event.h >>>> index fb104e5..891c769 100644 >>>> --- a/include/uapi/linux/perf_event.h >>>> +++ b/include/uapi/linux/perf_event.h >>>> @@ -157,8 +157,9 @@ enum perf_branch_sample_type { >>>> PERF_SAMPLE_BRANCH_ANY_CALL = 1U << 4, /* any call branch */ >>>> PERF_SAMPLE_BRANCH_ANY_RETURN = 1U << 5, /* any return branch */ >>>> PERF_SAMPLE_BRANCH_IND_CALL = 1U << 6, /* indirect calls */ >>>> + PERF_SAMPLE_BRANCH_CONDITIONAL = 1U << 7, /* conditional branches >>>> */ >>>> >>> I would use PERF_SAMPLE_BRANCH_COND here. >>> >>>> - PERF_SAMPLE_BRANCH_MAX = 1U << 7, /* non-ABI */ >>>> + PERF_SAMPLE_BRANCH_MAX = 1U << 8, /* non-ABI */ >>>> }; >>>> >>>> #define PERF_SAMPLE_BRANCH_PLM_ALL \ >>>> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c >>>> index cdf58ec..5b0b89d 100644 >>>> --- a/tools/perf/builtin-record.c >>>> +++ b/tools/perf/builtin-record.c >>>> @@ -676,6 +676,7 @@ static const struct branch_mode branch_modes[] = { >>>> BRANCH_OPT("any_call", PERF_SAMPLE_BRANCH_ANY_CALL), >>>> BRANCH_OPT("any_ret", PERF_SAMPLE_BRANCH_ANY_RETURN), >>>> BRANCH_OPT("ind_call", PERF_SAMPLE_BRANCH_IND_CALL), >>>> + BRANCH_OPT("cnd", PERF_SAMPLE_BRANCH_CONDITIONAL), >>> >>> use "cond" >>> >>>> BRANCH_END >>>> }; >>>> >>> >>> And if you do this, you also need to update the x86 >>> perf_event_intel_lbr.c mapping >>> tables to fill out the entries for PERF_SAMPLE_BRANCH_COND: >>> >>> [PERF_SAMPLE_BRANCH_COND] = LBR_JCC, >>> >>> And you also need to update intel_pmu_setup_sw_lbr_filter() >>> to handle the conversion to x86 instructions: >>> >>>if (br_type & PERF_SAMPLE_BRANCH_COND) >>> mask |= X86_BR_JCC; >>> >>> >>> You also need to update the perf-record.txt documentation to list cond >>> as a possible >>> branch filter. >> >> Hey Stephane, >> >> I have incorporated all the review comments into the patch series >> https://lkml.org/lkml/2013/5/22/51. >> > I don't see how you can compile Patch 3/5: > > + BRANCH_OPT("cond", PERF_SAMPLE_BRANCH_CONDITIONAL), > > Needs to be PERF_SAMPLE_BRANCH_COND. > Ahh, sorry missed it, will fix it. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] powerpc, perf: Ignore separate BHRB privilege state filter request
On 05/22/2013 02:29 PM, Anshuman Khandual wrote: >> >> Your description from patch 0 should be here. > Does it sound better ? > >> >>> - if ((br_privilege != 7) && (br_privilege != 0)) >>> - return -1; >>> + >>> + if (br_privilege) >>> + pr_info("BHRB privilege state filter request %llx ignored\n", >>> + br_privilege); >> >> Don't do that. Ignoring the br_privilege is either the right thing to do >> in which case we do it and print nothing, > > > I thought the informational print would at least make the user aware > of the fact that the separate filter request for BHRB went ignored. > Can we add this some where in the documentation ? So, what we decide here ? We will just ignore any separate BHRB privilege state filter request without printing any informational event or warning ? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: how to implement platform specific per process parameter?
> There are 4 options: > 1. [not a kernel interface] use ptrace to execute the register changing > command inside the specified pid. The next context switch saves the new > value in the thread_struct. Dirty hack. > > 2. Add a new syscall which would receive pid + register value and do the > job. A bit too much. > > 3. Add some hook in /proc filesystem but so far there were no platform > specific bits, right? > > 4. Implement a static node "/sys/devices/system/cpu/dscr_control". > write() would parse the input stream, call scanf("%d %x", &pid, &dscr) > and do the job. > /sys/ interface would be appropriate I believe. But in this way we can take a new (pid, dscr) and update thread_struct. But there should be a way to enlist all (pid, dscr) values which are explicitly set by the user and different than that of /sys/devices/system/cpu/dscr_default. So that we can know which process is holding to what value of DSCR at any point of time. > What is the correct approach? Thanks. > > Regards Anshuman -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH] perf: Add a few generic stalled-cycles events
On 10/12/2012 06:58 AM, Sukadev Bhattiprolu wrote: > > From 89cb6a25b9f714e55a379467a832ee015014ed11 Mon Sep 17 00:00:00 2001 > From: Sukadev Bhattiprolu > Date: Tue, 18 Sep 2012 10:59:01 -0700 > Subject: [PATCH] perf: Add a few generic stalled-cycles events > > The existing generic event 'stalled-cycles-backend' corresponds to > PM_CMPLU_STALL event in Power7. While this event is useful, detailed > performance analysis often requires us to find more specific reasons > for the stalled cycle. For instance, stalled cycles in Power7 can > occur due to, among others: > > - instruction fetch unit (IFU), > - Load-store-unit (LSU), > - Fixed point unit (FXU) > - Branch unit (BRU) > > While it is possible to use raw codes to monitor these events, it quickly > becomes cumbersome with performance analysis frequently requiring mapping > the raw event codes in reports to their symbolic names. > > This patch is a proposal to try and generalize such perf events. Since > the code changes are quite simple, I bunched all the 4 events together. > > I am not familiar with how readily these events would map to other > architectures. Here is some information on the events for Power7: > > stalled-cycles-fixed-point (PM_CMPLU_STALL_FXU) > > Following a completion stall, the last instruction to finish > before completion resumes was from the Fixed Point Unit. > > Completion stall is any period when no groups completed and > the completion table was not empty for that thread. > > stalled-cycles-load-store (PM_CMPLU_STALL_LSU) > > Following a completion stall, the last instruction to finish > before completion resumes was from the Load-Store Unit. > > stalled-cycles-instruction-fetch (PM_CMPLU_STALL_IFU) > > Following a completion stall, the last instruction to finish > before completion resumes was from the Instruction Fetch Unit. > > stalled-cycles-branch (PM_CMPLU_STALL_BRU) > > Following a completion stall, the last instruction to finish > before completion resumes was from the Branch Unit. > > Looking for feedback on this approach and if this can be further extended. > Power7 has 530 events[2] out of which a "CPI stack analysis"[1] uses about 26 > events. > > > [1] CPI Stack analysis > > https://www.power.org/documentation/commonly-used-metrics-for-performance-analysis > > [2] Power7 events: > > https://www.power.org/documentation/comprehensive-pmu-event-reference-power7/ Here we should try to come up with a generic list of places in the processor where the cycles can stall. PERF_COUNT_HW_STALLED_CYCLES_FIXED_POINT PERF_COUNT_HW_STALLED_CYCLES_LOAD_STORE PERF_COUNT_HW_STALLED_CYCLES_INSTRUCTION_FETCH PERF_COUNT_HW_STALLED_CYCLES_BRANCH PERF_COUNT_HW_STALLED_CYCLES_ PERF_COUNT_HW_STALLED_CYCLES_ PERF_COUNT_HW_STALLED_CYCLES_ --- This generic list can be a superset which can accommodate all the architecture giving the flexibility to implement selectively there after. Stall locations are very important from CPI analysis stand point with real world use cases. This will definitely help us in that direction. Regards Anshuman -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH] perf: Add a few generic stalled-cycles events
On 10/15/2012 10:53 PM, Arun Sharma wrote: > On 10/15/12 8:55 AM, Robert Richter wrote: > > [..] >> Perf tool works then out-of-the-box with: >> >> $ perf record -e cpu/stalled-cycles-fixed-point/ ... >> >> The event string can easily be reused by other architectures as a >> quasi standard. > > I like Robert's proposal better. It's hard to model all the stall events > (eg: instruction decoder related stalls on x86) in a hardware > independent way. > > Another area to think about: software engineers are generally busy and > have a limited amount of time to devote to hardware event based > optimizations. The most common question I hear is: what is the expected > perf gain if I fix this? It's hard to answer that with just the stall > events. > Hardware event based optimization is a very important aspect of real world application tuning. CPI stack analysis is a good reason why perf should have stall events as generic ones. But I am not clear on situations where we consider adding these new generic events into linux/perf_event.h and the situations where we should go with the sys fs interface. Could you please elaborate on this ? Regards Anshuman -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V3] mm: Add sysfs interface to dump each node's zonelist information
On 09/07/2016 08:38 AM, Kees Cook wrote: > On Tue, Sep 6, 2016 at 1:36 PM, Dave Hansen wrote: >> On 09/06/2016 01:31 AM, Anshuman Khandual wrote: >>> [NODE (0)] >>> ZONELIST_FALLBACK >>> (0) (node 0) (zone DMA c140c000) >>> (1) (node 1) (zone DMA c001) >>> (2) (node 2) (zone DMA c002) >>> (3) (node 3) (zone DMA c003) >>> ZONELIST_NOFALLBACK >>> (0) (node 0) (zone DMA c140c000) >> >> Don't we have some prohibition on dumping out kernel addresses like this >> so that attackers can't trivially defeat kernel layout randomization? > > Anything printing memory addresses should be using %pK (not %lx as done here). Learned about the significance of %pK coupled with kptr_restrict interface. Will change this. Thanks for pointing out.
Re: [PATCH 5/5] mm: cleanup pfn_t usage in track_pfn_insert()
On 09/06/2016 10:19 PM, Dan Williams wrote: > Now that track_pfn_insert() is no longer used in the DAX path, it no > longer needs to comprehend pfn_t values. > > Signed-off-by: Dan Williams > --- > arch/x86/mm/pat.c |4 ++-- > include/asm-generic/pgtable.h |4 ++-- > mm/memory.c |2 +- > 3 files changed, 5 insertions(+), 5 deletions(-) A small nit. Should not the arch/x86/mm/pat.c changes be separated out into a different patch ? Kind of faced little bit problem separating out generic core mm changes to that of arch specific mm changes when going through the commits in retrospect.
Re: [PATCH V3] mm: Add sysfs interface to dump each node's zonelist information
On 09/06/2016 02:35 PM, kbuild test robot wrote: > Hi Anshuman, > > [auto build test ERROR on driver-core/driver-core-testing] > [also build test ERROR on v4.8-rc5 next-20160906] > [if your patch is applied to the wrong git tree, please drop us a note to > help improve the system] > [Suggest to use git(>=2.9.0) format-patch --base= (or --base=auto for > convenience) to record what (public, well-known) commit your patch series was > built on] > [Check https://git-scm.com/docs/git-format-patch for more information] > > url: > https://github.com/0day-ci/linux/commits/Anshuman-Khandual/mm-Add-sysfs-interface-to-dump-each-node-s-zonelist-information/20160906-163752 > config: x86_64-randconfig-x019-201636 (attached as .config) > compiler: gcc-6 (Debian 6.1.1-9) 6.1.1 20160705 > reproduce: > # save the attached .config to linux build tree > make ARCH=x86_64 I am not able to reproduce this build failure with Fedora 24 and gcc (GCC) 6.1.1 20160621 on a x86 laptop. Maybe adding mmzone.h into page_alloc.c will be enough to just take care any issues.
[PATCH V4] mm: Add sysfs interface to dump each node's zonelist information
Each individual node in the system has a ZONELIST_FALLBACK zonelist and a ZONELIST_NOFALLBACK zonelist. These zonelists decide fallback order of zones during memory allocations. Sometimes it helps to dump these zonelists to see the priority order of various zones in them. Particularly platforms which support memory hotplug into previously non existing zones (at boot), this interface helps in visualizing which all zonelists of the system at what priority level, the new hot added memory ends up in. POWER is such a platform where all the memory detected during boot time remains with ZONE_DMA for good but then hot plug process can actually get new memory into ZONE_MOVABLE. So having a way to get the snapshot of the zonelists on the system after memory or node hot[un]plug is desirable. This change adds one new sysfs interface (/sys/devices/system/memory/system_zone_details) which will fetch and dump this information. Example zonelist information from a KVM guest. [NODE (0)] ZONELIST_FALLBACK (0) (node 0) (DMA 0xc0006300) (1) (node 1) (DMA 0xc0016300) (2) (node 2) (DMA 0xc0026300) (3) (node 3) (DMA 0xc003ffdba300) ZONELIST_NOFALLBACK (0) (node 0) (DMA 0xc0006300) [NODE (1)] ZONELIST_FALLBACK (0) (node 1) (DMA 0xc0016300) (1) (node 2) (DMA 0xc0026300) (2) (node 3) (DMA 0xc003ffdba300) (3) (node 0) (DMA 0xc0006300) ZONELIST_NOFALLBACK (0) (node 1) (DMA 0xc0016300) [NODE (2)] ZONELIST_FALLBACK (0) (node 2) (DMA 0xc0026300) (1) (node 3) (DMA 0xc003ffdba300) (2) (node 0) (DMA 0xc0006300) (3) (node 1) (DMA 0xc0016300) ZONELIST_NOFALLBACK (0) (node 2) (DMA 0xc0026300) [NODE (3)] ZONELIST_FALLBACK (0) (node 3) (DMA 0xc003ffdba300) (1) (node 0) (DMA 0xc0006300) (2) (node 1) (DMA 0xc0016300) (3) (node 2) (DMA 0xc0026300) ZONELIST_NOFALLBACK (0) (node 3) (DMA 0xc003ffdba300) Signed-off-by: Anshuman Khandual --- Changes in V4: - Explicitly included mmzone.h header inside page_alloc.c - Changed the kernel address printing from %lx to %pK Changes in V3: - Moved all these new sysfs code inside CONFIG_NUMA Changes in V2: - Added more details into the commit message - Added sysfs interface file details into the commit message - Added ../ABI/testing/sysfs-system-zone-details file .../ABI/testing/sysfs-system-zone-details | 9 drivers/base/memory.c | 52 ++ mm/page_alloc.c| 1 + 3 files changed, 62 insertions(+) create mode 100644 Documentation/ABI/testing/sysfs-system-zone-details diff --git a/Documentation/ABI/testing/sysfs-system-zone-details b/Documentation/ABI/testing/sysfs-system-zone-details new file mode 100644 index 000..9c13b2e --- /dev/null +++ b/Documentation/ABI/testing/sysfs-system-zone-details @@ -0,0 +1,9 @@ +What: /sys/devices/system/memory/system_zone_details +Date: Sep 2016 +KernelVersion: 4.8 +Contact: khand...@linux.vnet.ibm.com +Description: + This read only file dumps the zonelist and it's constituent + zones information for both ZONELIST_FALLBACK and ZONELIST_ + NOFALLBACK zonelists for each online node of the system at + any given point of time. diff --git a/drivers/base/memory.c b/drivers/base/memory.c index dc75de9..c7ab991 100644 --- a/drivers/base/memory.c +++ b/drivers/base/memory.c @@ -442,7 +442,56 @@ print_block_size(struct device *dev, struct device_attribute *attr, return sprintf(buf, "%lx\n", get_memory_block_size()); } +#ifdef CONFIG_NUMA +static ssize_t dump_zonelist(char *buf, struct zonelist *zonelist) +{ + unsigned int i; + ssize_t count = 0; + + for (i = 0; zonelist->_zonerefs[i].zone; i++) { + count += sprintf(buf + count, + "\t\t(%d) (node %d) (%-7s 0x%pK)\n", i, + zonelist->_zonerefs[i].zone->zone_pgdat->node_id, + zone_names[zonelist->_zonerefs[i].zone_idx], + (void *) zonelist->_zonerefs[i].zone); + } + return count; +} + +static ssize_t dump_zonelists(char *buf) +{ + struct zonelist *zonelist; + unsigned int node; + ssize_t count = 0; + + for_each_online_node(node) { + zonelist = &(NODE_DATA(node)-> + node_zonelists[ZONELIST_FALLBACK]); +
Re: [PATCH -v3 01/10] mm, swap: Make swap cluster size same of THP size on x86_64
On 09/07/2016 10:16 PM, Huang, Ying wrote: > From: Huang Ying > > In this patch, the size of the swap cluster is changed to that of the > THP (Transparent Huge Page) on x86_64 architecture (512). This is for > the THP swap support on x86_64. Where one swap cluster will be used to > hold the contents of each THP swapped out. And some information of the > swapped out THP (such as compound map count) will be recorded in the > swap_cluster_info data structure. > > For other architectures which want THP swap support, THP_SWAP_CLUSTER > need to be selected in the Kconfig file for the architecture. > > In effect, this will enlarge swap cluster size by 2 times on x86_64. > Which may make it harder to find a free cluster when the swap space > becomes fragmented. So that, this may reduce the continuous swap space > allocation and sequential write in theory. The performance test in 0day > shows no regressions caused by this. This patch needs to be split into two separate ones (1) Add THP_SWAP_CLUSTER config option (2) Enable CONFIG_THP_SWAP_CLUSTER for X86_64 The first patch should explain the proposal and the second patch should have 86_64 arch specific details, regressions etc as already been explained in the commit message.
Re: [PATCH -v3 03/10] mm, memcg: Support to charge/uncharge multiple swap entries
On 09/07/2016 10:16 PM, Huang, Ying wrote: > From: Huang Ying > > This patch make it possible to charge or uncharge a set of continuous > swap entries in the swap cgroup. The number of swap entries is > specified via an added parameter. > > This will be used for the THP (Transparent Huge Page) swap support. > Where a swap cluster backing a THP may be allocated and freed as a > whole. So a set of continuous swap entries (512 on x86_64) backing one Please use HPAGE_SIZE / PAGE_SIZE instead of hard coded number like 512.
Re: [PATCH -v3 04/10] mm, THP, swap: Add swap cluster allocate/free functions
On 09/07/2016 10:16 PM, Huang, Ying wrote: > From: Huang Ying > > The swap cluster allocation/free functions are added based on the > existing swap cluster management mechanism for SSD. These functions > don't work for the rotating hard disks because the existing swap cluster > management mechanism doesn't work for them. The hard disks support may > be added if someone really need it. But that needn't be included in > this patchset. > > This will be used for the THP (Transparent Huge Page) swap support. > Where one swap cluster will hold the contents of each THP swapped out. Which tree this series is based against ? This patch does not apply on the mainline kernel.
Re: [PATCH -v3 03/10] mm, memcg: Support to charge/uncharge multiple swap entries
On 09/07/2016 10:16 PM, Huang, Ying wrote: > From: Huang Ying > > This patch make it possible to charge or uncharge a set of continuous > swap entries in the swap cgroup. The number of swap entries is > specified via an added parameter. > > This will be used for the THP (Transparent Huge Page) swap support. > Where a swap cluster backing a THP may be allocated and freed as a > whole. So a set of continuous swap entries (512 on x86_64) backing one Please use HPAGE_SIZE / PAGE_SIZE instead of hard coded number like 512.
Re: [PATCH -v3 04/10] mm, THP, swap: Add swap cluster allocate/free functions
On 09/07/2016 10:16 PM, Huang, Ying wrote: > From: Huang Ying > > The swap cluster allocation/free functions are added based on the > existing swap cluster management mechanism for SSD. These functions > don't work for the rotating hard disks because the existing swap cluster > management mechanism doesn't work for them. The hard disks support may > be added if someone really need it. But that needn't be included in > this patchset. > > This will be used for the THP (Transparent Huge Page) swap support. > Where one swap cluster will hold the contents of each THP swapped out. Which tree this series is based against ? This patch does not apply on the mainline kernel today.
Re: [PATCH -v3 07/10] mm, THP, swap: Support to add/delete THP to/from swap cache
On 09/07/2016 10:16 PM, Huang, Ying wrote: > From: Huang Ying > > With this patch, a THP (Transparent Huge Page) can be added/deleted > to/from the swap cache as a set of sub-pages (512 on x86_64). > > This will be used for the THP (Transparent Huge Page) swap support. > Where one THP may be added/delted to/from the swap cache. This will > batch the swap cache operations to reduce the lock acquire/release times > for the THP swap too. > > Cc: Hugh Dickins > Cc: Shaohua Li > Cc: Minchan Kim > Cc: Rik van Riel > Cc: Andrea Arcangeli > Cc: Kirill A. Shutemov > Signed-off-by: "Huang, Ying" > --- > include/linux/page-flags.h | 2 +- > mm/swap_state.c| 57 > +++--- > 2 files changed, 40 insertions(+), 19 deletions(-) > > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h > index 74e4dda..f5bcbea 100644 > --- a/include/linux/page-flags.h > +++ b/include/linux/page-flags.h > @@ -314,7 +314,7 @@ PAGEFLAG_FALSE(HighMem) > #endif > > #ifdef CONFIG_SWAP > -PAGEFLAG(SwapCache, swapcache, PF_NO_COMPOUND) > +PAGEFLAG(SwapCache, swapcache, PF_NO_TAIL) What is the reason for this change ? The commit message does not seem to explain.
Re: [PATCH -v3 01/10] mm, swap: Make swap cluster size same of THP size on x86_64
On 09/07/2016 10:16 PM, Huang, Ying wrote: > From: Huang Ying > > In this patch, the size of the swap cluster is changed to that of the > THP (Transparent Huge Page) on x86_64 architecture (512). This is for > the THP swap support on x86_64. Where one swap cluster will be used to > hold the contents of each THP swapped out. And some information of the > swapped out THP (such as compound map count) will be recorded in the > swap_cluster_info data structure. > > For other architectures which want THP swap support, THP_SWAP_CLUSTER > need to be selected in the Kconfig file for the architecture. > > In effect, this will enlarge swap cluster size by 2 times on x86_64. > Which may make it harder to find a free cluster when the swap space > becomes fragmented. So that, this may reduce the continuous swap space > allocation and sequential write in theory. The performance test in 0day > shows no regressions caused by this. This patch needs to be split into two separate ones (1) Add THP_SWAP_CLUSTER config option (2) Enable CONFIG_THP_SWAP_CLUSTER for X86_64 The first patch should explain the proposal and the second patch should have 86_64 arch specific details, regressions etc as already been explained in the commit message.
[PATCH 1/2] mm: Move definition of 'zone_names' array into mmzone.h
zone_names[] is used to identify any zone given it's index which can be used in many other places. So moving the definition into include/linux/mmzone.h for broader access. Signed-off-by: Anshuman Khandual --- include/linux/mmzone.h | 17 + mm/page_alloc.c| 17 - 2 files changed, 17 insertions(+), 17 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index d572b78..01ca3f4 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -341,6 +341,23 @@ enum zone_type { }; +static char * const zone_names[__MAX_NR_ZONES] = { +#ifdef CONFIG_ZONE_DMA +"DMA", +#endif +#ifdef CONFIG_ZONE_DMA32 +"DMA32", +#endif +"Normal", +#ifdef CONFIG_HIGHMEM +"HighMem", +#endif +"Movable", +#ifdef CONFIG_ZONE_DEVICE +"Device", +#endif +}; + #ifndef __GENERATING_BOUNDS_H struct zone { diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 3fbe73a..8e2261c 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -207,23 +207,6 @@ int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES-1] = { EXPORT_SYMBOL(totalram_pages); -static char * const zone_names[MAX_NR_ZONES] = { -#ifdef CONFIG_ZONE_DMA -"DMA", -#endif -#ifdef CONFIG_ZONE_DMA32 -"DMA32", -#endif -"Normal", -#ifdef CONFIG_HIGHMEM -"HighMem", -#endif -"Movable", -#ifdef CONFIG_ZONE_DEVICE -"Device", -#endif -}; - char * const migratetype_names[MIGRATE_TYPES] = { "Unmovable", "Movable", -- 1.9.3
[PATCH 2/2] mm: Add sysfs interface to dump each node's zonelist information
Each individual node in the system has a ZONELIST_FALLBACK zonelist and a ZONELIST_NOFALLBACK zonelist. These zonelists decide fallback order of zones during memory allocations. Sometimes it helps to dump these zonelists to see the priority order of various zones in them. This change just adds a sysfs interface for doing the same. Example zonelist information from a KVM guest. [NODE (0)] ZONELIST_FALLBACK (0) (node 0) (zone DMA c140c000) (1) (node 1) (zone DMA c001) (2) (node 2) (zone DMA c002) (3) (node 3) (zone DMA c003) ZONELIST_NOFALLBACK (0) (node 0) (zone DMA c140c000) [NODE (1)] ZONELIST_FALLBACK (0) (node 1) (zone DMA c001) (1) (node 2) (zone DMA c002) (2) (node 3) (zone DMA c003) (3) (node 0) (zone DMA c140c000) ZONELIST_NOFALLBACK (0) (node 1) (zone DMA c001) [NODE (2)] ZONELIST_FALLBACK (0) (node 2) (zone DMA c002) (1) (node 3) (zone DMA c003) (2) (node 0) (zone DMA c140c000) (3) (node 1) (zone DMA c001) ZONELIST_NOFALLBACK (0) (node 2) (zone DMA c002) [NODE (3)] ZONELIST_FALLBACK (0) (node 3) (zone DMA c003) (1) (node 0) (zone DMA c140c000) (2) (node 1) (zone DMA c001) (3) (node 2) (zone DMA c002) ZONELIST_NOFALLBACK (0) (node 3) (zone DMA c003) Signed-off-by: Anshuman Khandual --- drivers/base/memory.c | 46 ++ 1 file changed, 46 insertions(+) diff --git a/drivers/base/memory.c b/drivers/base/memory.c index dc75de9..8c9330a 100644 --- a/drivers/base/memory.c +++ b/drivers/base/memory.c @@ -442,7 +442,52 @@ print_block_size(struct device *dev, struct device_attribute *attr, return sprintf(buf, "%lx\n", get_memory_block_size()); } +static ssize_t dump_zonelist(char *buf, struct zonelist *zonelist) +{ + unsigned int i; + ssize_t count = 0; + + for (i = 0; zonelist->_zonerefs[i].zone; i++) { + count += sprintf(buf + count, + "\t\t(%d) (node %d) (%-10s %lx)\n", i, + zonelist->_zonerefs[i].zone->zone_pgdat->node_id, + zone_names[zonelist->_zonerefs[i].zone_idx], + (unsigned long) zonelist->_zonerefs[i].zone); + } + return count; +} + +static ssize_t dump_zonelists(char *buf) +{ + struct zonelist *zonelist; + unsigned int node; + ssize_t count = 0; + + for_each_online_node(node) { + zonelist = &(NODE_DATA(node)-> + node_zonelists[ZONELIST_FALLBACK]); + count += sprintf(buf + count, "[NODE (%d)]\n", node); + count += sprintf(buf + count, "\tZONELIST_FALLBACK\n"); + count += dump_zonelist(buf + count, zonelist); + + zonelist = &(NODE_DATA(node)-> + node_zonelists[ZONELIST_NOFALLBACK]); + count += sprintf(buf + count, "\tZONELIST_NOFALLBACK\n"); + count += dump_zonelist(buf + count, zonelist); + } + return count; +} + +static ssize_t +print_system_zone_details(struct device *dev, struct device_attribute *attr, +char *buf) +{ + return dump_zonelists(buf); +} + + static DEVICE_ATTR(block_size_bytes, 0444, print_block_size, NULL); +static DEVICE_ATTR(system_zone_details, 0444, print_system_zone_details, NULL); /* * Memory auto online policy. @@ -783,6 +828,7 @@ static struct attribute *memory_root_attrs[] = { #endif &dev_attr_block_size_bytes.attr, + &dev_attr_system_zone_details.attr, &dev_attr_auto_online_blocks.attr, NULL }; -- 1.9.3
Re: [PATCH 1/2] mm: Move definition of 'zone_names' array into mmzone.h
On 09/01/2016 02:40 AM, Andrew Morton wrote: > On Wed, 31 Aug 2016 08:55:49 +0530 Anshuman Khandual > wrote: > >> zone_names[] is used to identify any zone given it's index which >> can be used in many other places. So moving the definition into >> include/linux/mmzone.h for broader access. >> >> ... >> >> --- a/include/linux/mmzone.h >> +++ b/include/linux/mmzone.h >> @@ -341,6 +341,23 @@ enum zone_type { >> >> }; >> >> +static char * const zone_names[__MAX_NR_ZONES] = { >> +#ifdef CONFIG_ZONE_DMA >> + "DMA", >> +#endif >> +#ifdef CONFIG_ZONE_DMA32 >> + "DMA32", >> +#endif >> + "Normal", >> +#ifdef CONFIG_HIGHMEM >> + "HighMem", >> +#endif >> + "Movable", >> +#ifdef CONFIG_ZONE_DEVICE >> + "Device", >> +#endif >> +}; >> + >> #ifndef __GENERATING_BOUNDS_H >> >> struct zone { >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >> index 3fbe73a..8e2261c 100644 >> --- a/mm/page_alloc.c >> +++ b/mm/page_alloc.c >> @@ -207,23 +207,6 @@ int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES-1] = { >> >> EXPORT_SYMBOL(totalram_pages); >> >> -static char * const zone_names[MAX_NR_ZONES] = { >> -#ifdef CONFIG_ZONE_DMA >> - "DMA", >> -#endif >> -#ifdef CONFIG_ZONE_DMA32 >> - "DMA32", >> -#endif >> - "Normal", >> -#ifdef CONFIG_HIGHMEM >> - "HighMem", >> -#endif >> - "Movable", >> -#ifdef CONFIG_ZONE_DEVICE >> - "Device", >> -#endif >> -}; >> - >> char * const migratetype_names[MIGRATE_TYPES] = { >> "Unmovable", >> "Movable", > > This is worrisome. On some (ancient) compilers, this will produce a > copy of that array into each compilation unit which includes mmzone.h. > > On smarter compilers, it will produce a copy of the array in each > compilation unit which *uses* zone_names[]. > > On even smarter compilers (and linkers!), only one copy of zone_names[] > will exist in vmlinux. > > I don't know if gcc is an "even smarter compiler" and I didn't check, > and I didn't check which gcc versions are even smarter. I'd rather not > have to ;) It is risky. > > So, let's just make it non-static and add a declaration into mmzone.h, > please. > I understand your concern, will change it.
Re: [PATCH 2/2] mm: Add sysfs interface to dump each node's zonelist information
On 09/01/2016 02:42 AM, Andrew Morton wrote: > On Wed, 31 Aug 2016 08:55:50 +0530 Anshuman Khandual > wrote: > >> Each individual node in the system has a ZONELIST_FALLBACK zonelist >> and a ZONELIST_NOFALLBACK zonelist. These zonelists decide fallback >> order of zones during memory allocations. Sometimes it helps to dump >> these zonelists to see the priority order of various zones in them. >> This change just adds a sysfs interface for doing the same. >> >> Example zonelist information from a KVM guest. >> >> [NODE (0)] >> ZONELIST_FALLBACK >> (0) (node 0) (zone DMA c140c000) >> (1) (node 1) (zone DMA c001) >> (2) (node 2) (zone DMA c002) >> (3) (node 3) (zone DMA c003) >> ZONELIST_NOFALLBACK >> (0) (node 0) (zone DMA c140c000) >> [NODE (1)] >> ZONELIST_FALLBACK >> (0) (node 1) (zone DMA c001) >> (1) (node 2) (zone DMA c002) >> (2) (node 3) (zone DMA c003) >> (3) (node 0) (zone DMA c140c000) >> ZONELIST_NOFALLBACK >> (0) (node 1) (zone DMA c001) >> [NODE (2)] >> ZONELIST_FALLBACK >> (0) (node 2) (zone DMA c002) >> (1) (node 3) (zone DMA c003) >> (2) (node 0) (zone DMA c140c000) >> (3) (node 1) (zone DMA c001) >> ZONELIST_NOFALLBACK >> (0) (node 2) (zone DMA c002) >> [NODE (3)] >> ZONELIST_FALLBACK >> (0) (node 3) (zone DMA c003) >> (1) (node 0) (zone DMA c140c000) >> (2) (node 1) (zone DMA c001) >> (3) (node 2) (zone DMA c002) >> ZONELIST_NOFALLBACK >> (0) (node 3) (zone DMA c003) > > Can you please sell this a bit better? Why does it "sometimes help"? > Why does the benefit of this patch to our users justify the overhead > and cost? On platforms which support memory hotplug into previously non existing (at boot) zones, this interface helps in visualizing which zonelists of the system, the new hot added memory ends up in. POWER is such a platform where all the memory detected during boot time remains with ZONE_DMA for good but then hot plug process can actually get new memory into ZONE_MOVABLE. So having a way to get the snapshot of the zonelists on the system after memory or node hot[un]plug is a good thing, IMHO. > > Please document the full path to the sysfs file(s) within the changelog. Sure, will do. > > Please find somewhere in Documentation/ to document the new interface. > Sure, will create this following file describing the interface. Documentation/ABI/testing/sysfs-system-zone-details
Re: [RFC v2 00/12] powerpc: Memory Protection Keys
On 06/20/2017 10:40 AM, Balbir Singh wrote: > On Fri, 2017-06-16 at 20:52 -0700, Ram Pai wrote: >> Memory protection keys enable applications to protect its >> address space from inadvertent access or corruption from >> itself. > > I presume by itself you mean protection between threads? Between threads due to race conditions or from the same thread because of programming error. > >> >> The overall idea: >> >> A process allocates a key and associates it with >> a address range withinits address space. > > OK, so this is per VMA? Yeah but the same key can be given to multiple VMAs. Any change will effect every VMA who got tagged by it. > >> The process than can dynamically set read/write >> permissions on the key without involving the >> kernel. > > This bit is not clear, how can the key be set without > involving the kernel? I presume you mean the key is set With pkey_mprotect() system call, all the effected PTEs get tagged for once. Switching the permission happens just by writing into the register on the fly. > in the PTE's and the access protection values can be > set without involving the kernel? PTE setting happens once, access protection values can be changed on the fly through register.
Re: [RFC v2 11/12]Documentation: Documentation updates.
On 06/17/2017 09:22 AM, Ram Pai wrote: > The Documentaton file is moved from x86 into the generic area, > since this feature is now supported by more than one archs. > > Signed-off-by: Ram Pai > --- > Documentation/vm/protection-keys.txt | 110 > ++ > Documentation/x86/protection-keys.txt | 85 -- I am not sure whether this is a good idea. There might be specifics for each architecture which need to be detailed again in this new generic one. > 2 files changed, 110 insertions(+), 85 deletions(-) > create mode 100644 Documentation/vm/protection-keys.txt > delete mode 100644 Documentation/x86/protection-keys.txt > > diff --git a/Documentation/vm/protection-keys.txt > b/Documentation/vm/protection-keys.txt > new file mode 100644 > index 000..b49e6bb > --- /dev/null > +++ b/Documentation/vm/protection-keys.txt > @@ -0,0 +1,110 @@ > +Memory Protection Keys for Userspace (PKU aka PKEYs) is a CPU feature > +found in new generation of intel CPUs on PowerPC CPUs. > + > +Memory Protection Keys provides a mechanism for enforcing page-based > +protections, but without requiring modification of the page tables > +when an application changes protection domains. Does resultant access through protection keys should be a subset of the protection bits enabled through original PTE PROT format ? Does the semantics exactly the same on x86 and powerpc ? > + > + > +On Intel: > + > +It works by dedicating 4 previously ignored bits in each page table > +entry to a "protection key", giving 16 possible keys. > + > +There is also a new user-accessible register (PKRU) with two separate > +bits (Access Disable and Write Disable) for each key. Being a CPU > +register, PKRU is inherently thread-local, potentially giving each > +thread a different set of protections from every other thread. > + > +There are two new instructions (RDPKRU/WRPKRU) for reading and writing > +to the new register. The feature is only available in 64-bit mode, > +even though there is theoretically space in the PAE PTEs. These > +permissions are enforced on data access only and have no effect on > +instruction fetches. > + > + > +On PowerPC: > + > +It works by dedicating 5 page table entry to a "protection key", > +giving 32 possible keys. > + > +There is a user-accessible register (AMR) with two separate bits > +(Access Disable and Write Disable) for each key. Being a CPU > +register, AMR is inherently thread-local, potentially giving each > +thread a different set of protections from every other thread. Small nit. Space needed here. > +NOTE: Disabling read permission does not disable > +write and vice-versa. > + > +The feature is available on 64-bit HPTE mode only. > + > +'mtspr 0xd, mem' reads the AMR register > +'mfspr mem, 0xd' writes into the AMR register. > + > +Permissions are enforced on data access only and have no effect on > +instruction fetches. > + > +=== Syscalls === > + > +There are 3 system calls which directly interact with pkeys: > + > + int pkey_alloc(unsigned long flags, unsigned long init_access_rights) > + int pkey_free(int pkey); > + int pkey_mprotect(unsigned long start, size_t len, > + unsigned long prot, int pkey); > + > +Before a pkey can be used, it must first be allocated with > +pkey_alloc(). An application calls the WRPKRU instruction > +directly in order to change access permissions to memory covered > +with a key. In this example WRPKRU is wrapped by a C function > +called pkey_set(). > + > + int real_prot = PROT_READ|PROT_WRITE; > + pkey = pkey_alloc(0, PKEY_DENY_WRITE); > + ptr = mmap(NULL, PAGE_SIZE, PROT_NONE, MAP_ANONYMOUS|MAP_PRIVATE, -1, > 0); > + ret = pkey_mprotect(ptr, PAGE_SIZE, real_prot, pkey); > + ... application runs here > + > +Now, if the application needs to update the data at 'ptr', it can > +gain access, do the update, then remove its write access: > + > + pkey_set(pkey, 0); // clear PKEY_DENY_WRITE > + *ptr = foo; // assign something > + pkey_set(pkey, PKEY_DENY_WRITE); // set PKEY_DENY_WRITE again > + > +Now when it frees the memory, it will also free the pkey since it > +is no longer in use: > + > + munmap(ptr, PAGE_SIZE); > + pkey_free(pkey); > + > +(Note: pkey_set() is a wrapper for the RDPKRU and WRPKRU instructions. > + An example implementation can be found in > + tools/testing/selftests/x86/protection_keys.c) > + > +=== Behavior === > + > +The kernel attempts to make protection keys consistent with the > +behavior of a plain mprotect(). For instance if you do this: > + > + mprotect(ptr, size, PROT_NONE); > + something(ptr); > + > +you can expect the same effects with protection keys when doing this: > + > + pkey = pkey_alloc(0, PKEY_DISABLE_WRITE | PKEY_DISABLE_READ); > + pkey_mprotect(ptr, size, PROT_READ|PROT_WRITE, pkey); > + somet
Re: [RFC v2 12/12]selftest: Updated protection key selftest
On 06/17/2017 09:22 AM, Ram Pai wrote: > Added test support for PowerPC implementation off protection keys. > > Signed-off-by: Ram Pai First of all, there are a lot of instances where we use *pkru* named functions on power even the real implementations have taken care of doing appropriate things. That looks pretty hacky. We need to change them to generic names first before adding both x86 and powerpc procedures inside it.
Re: [RFC v2 10/12] powerpc: Read AMR only if pkey-violation caused the exception.
On 06/19/2017 11:29 PM, Ram Pai wrote: > On Mon, Jun 19, 2017 at 09:06:13PM +1000, Michael Ellerman wrote: >> Ram Pai writes: >> >>> Signed-off-by: Ram Pai >>> --- >>> arch/powerpc/kernel/exceptions-64s.S | 16 ++-- >>> 1 file changed, 10 insertions(+), 6 deletions(-) >>> >>> diff --git a/arch/powerpc/kernel/exceptions-64s.S >>> b/arch/powerpc/kernel/exceptions-64s.S >>> index 8db9ef8..a4de1b4 100644 >>> --- a/arch/powerpc/kernel/exceptions-64s.S >>> +++ b/arch/powerpc/kernel/exceptions-64s.S >>> @@ -493,13 +493,15 @@ EXC_COMMON_BEGIN(data_access_common) >>> ld r12,_MSR(r1) >>> ld r3,PACA_EXGEN+EX_DAR(r13) >>> lwz r4,PACA_EXGEN+EX_DSISR(r13) >>> + std r3,_DAR(r1) >>> + std r4,_DSISR(r1) >>> #ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS >>> + andis. r0,r4,DSISR_KEYFAULT@h /* save AMR only if its a key fault */ >>> + beq+1f >> >> This seems to be incremental on top of one of your other patches. >> >> But I don't see why, can you please just squash this into whatever patch >> adds this code in the first place. > > It was an optimization added later. But yes it can be squashed into an > earlier patch. Could you please explain what is the optimization this achieves ?
Re: [RFC v2 09/12] powerpc: Deliver SEGV signal on pkey violation.
On 06/17/2017 09:22 AM, Ram Pai wrote: > The value of the AMR register at the time of exception > is made available in gp_regs[PT_AMR] of the siginfo. > > This field can be used to reprogram the permission bits of > any valid pkey. > > Similarly the value of the pkey, whose protection got violated, > is made available at si_pkey field of the siginfo structure. > > Signed-off-by: Ram Pai > --- > arch/powerpc/include/asm/paca.h| 1 + > arch/powerpc/include/uapi/asm/ptrace.h | 3 ++- > arch/powerpc/kernel/asm-offsets.c | 5 > arch/powerpc/kernel/exceptions-64s.S | 8 ++ > arch/powerpc/kernel/signal_32.c| 14 ++ > arch/powerpc/kernel/signal_64.c| 14 ++ > arch/powerpc/kernel/traps.c| 49 > ++ > arch/powerpc/mm/fault.c| 4 +++ > 8 files changed, 97 insertions(+), 1 deletion(-) > > diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h > index 1c09f8f..a41afd3 100644 > --- a/arch/powerpc/include/asm/paca.h > +++ b/arch/powerpc/include/asm/paca.h > @@ -92,6 +92,7 @@ struct paca_struct { > struct dtl_entry *dispatch_log_end; > #endif /* CONFIG_PPC_STD_MMU_64 */ > u64 dscr_default; /* per-CPU default DSCR */ > + u64 paca_amr; /* value of amr at exception */ > > #ifdef CONFIG_PPC_STD_MMU_64 > /* > diff --git a/arch/powerpc/include/uapi/asm/ptrace.h > b/arch/powerpc/include/uapi/asm/ptrace.h > index 8036b38..7ec2428 100644 > --- a/arch/powerpc/include/uapi/asm/ptrace.h > +++ b/arch/powerpc/include/uapi/asm/ptrace.h > @@ -108,8 +108,9 @@ struct pt_regs { > #define PT_DAR 41 > #define PT_DSISR 42 > #define PT_RESULT 43 > -#define PT_DSCR 44 > #define PT_REGS_COUNT 44 > +#define PT_DSCR 44 > +#define PT_AMR 45 PT_REGS_COUNT is not getting incremented even after adding one more element into the pack ? > > #define PT_FPR0 48 /* each FP reg occupies 2 slots in this space */ > > diff --git a/arch/powerpc/kernel/asm-offsets.c > b/arch/powerpc/kernel/asm-offsets.c > index 709e234..17f5d8a 100644 > --- a/arch/powerpc/kernel/asm-offsets.c > +++ b/arch/powerpc/kernel/asm-offsets.c > @@ -241,6 +241,11 @@ int main(void) > OFFSET(PACAHWCPUID, paca_struct, hw_cpu_id); > OFFSET(PACAKEXECSTATE, paca_struct, kexec_state); > OFFSET(PACA_DSCR_DEFAULT, paca_struct, dscr_default); > + > +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS > + OFFSET(PACA_AMR, paca_struct, paca_amr); > +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */ > + So we now have a place in PACA for AMR. > OFFSET(ACCOUNT_STARTTIME, paca_struct, accounting.starttime); > OFFSET(ACCOUNT_STARTTIME_USER, paca_struct, accounting.starttime_user); > OFFSET(ACCOUNT_USER_TIME, paca_struct, accounting.utime); > diff --git a/arch/powerpc/kernel/exceptions-64s.S > b/arch/powerpc/kernel/exceptions-64s.S > index 3fd0528..8db9ef8 100644 > --- a/arch/powerpc/kernel/exceptions-64s.S > +++ b/arch/powerpc/kernel/exceptions-64s.S > @@ -493,6 +493,10 @@ EXC_COMMON_BEGIN(data_access_common) > ld r12,_MSR(r1) > ld r3,PACA_EXGEN+EX_DAR(r13) > lwz r4,PACA_EXGEN+EX_DSISR(r13) > +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS > + mfspr r5,SPRN_AMR > + std r5,PACA_AMR(r13) > +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */ > li r5,0x300 > std r3,_DAR(r1) > std r4,_DSISR(r1) > @@ -561,6 +565,10 @@ EXC_COMMON_BEGIN(instruction_access_common) > ld r12,_MSR(r1) > ld r3,_NIP(r1) > andis. r4,r12,0x5820 > +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS > + mfspr r5,SPRN_AMR > + std r5,PACA_AMR(r13) > +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */ Saving the AMR context on page faults, this seems to be changing in the next patch again based on whether any key was active at that point and fault happened for the key enforcement ? > li r5,0x400 > std r3,_DAR(r1) > std r4,_DSISR(r1) > diff --git a/arch/powerpc/kernel/signal_32.c b/arch/powerpc/kernel/signal_32.c > index 97bb138..059766a 100644 > --- a/arch/powerpc/kernel/signal_32.c > +++ b/arch/powerpc/kernel/signal_32.c > @@ -500,6 +500,11 @@ static int save_user_regs(struct pt_regs *regs, struct > mcontext __user *frame, > (unsigned long) &frame->tramp[2]); > } > > +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS > + if (__put_user(get_paca()->paca_amr, &frame->mc_gregs[PT_AMR])) > + return 1; > +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */ > + > return 0; > } > > @@ -661,6 +666,9 @@ static long restore_user_regs(struct pt_regs *regs, > long err; > unsigned int save_r2 = 0; > unsigned long msr; > +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS > + unsigned long amr; > +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */ > #ifdef CONFIG_VSX > int i; >
Re: [RFC v2 08/12] powerpc: Handle exceptions caused by violation of pkey protection.
On 06/17/2017 09:22 AM, Ram Pai wrote: > Handle Data and Instruction exceptions caused by memory > protection-key. > > Signed-off-by: Ram Pai > (cherry picked from commit a5e5217619a0c475fe0cacc3b0cf1d3d33c79a09) To which tree this commit belongs to ? > > Conflicts: > arch/powerpc/include/asm/reg.h > arch/powerpc/kernel/exceptions-64s.S > --- > arch/powerpc/include/asm/mmu_context.h | 12 + > arch/powerpc/include/asm/pkeys.h | 9 > arch/powerpc/include/asm/reg.h | 7 +-- > arch/powerpc/mm/fault.c| 21 +++- > arch/powerpc/mm/pkeys.c| 90 > ++ > 5 files changed, 134 insertions(+), 5 deletions(-) > > diff --git a/arch/powerpc/include/asm/mmu_context.h > b/arch/powerpc/include/asm/mmu_context.h > index da7e943..71fffe0 100644 > --- a/arch/powerpc/include/asm/mmu_context.h > +++ b/arch/powerpc/include/asm/mmu_context.h > @@ -175,11 +175,23 @@ static inline void arch_bprm_mm_init(struct mm_struct > *mm, > { > } > > +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS > +bool arch_pte_access_permitted(pte_t pte, bool write); > +bool arch_vma_access_permitted(struct vm_area_struct *vma, > + bool write, bool execute, bool foreign); > +#else /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */ > +static inline bool arch_pte_access_permitted(pte_t pte, bool write) > +{ > + /* by default, allow everything */ > + return true; > +} > static inline bool arch_vma_access_permitted(struct vm_area_struct *vma, > bool write, bool execute, bool foreign) > { > /* by default, allow everything */ > return true; > } Right, these are the two functions the core VM expects the arch to provide. > +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */ > + > #endif /* __KERNEL__ */ > #endif /* __ASM_POWERPC_MMU_CONTEXT_H */ > diff --git a/arch/powerpc/include/asm/pkeys.h > b/arch/powerpc/include/asm/pkeys.h > index 9b6820d..405e7db 100644 > --- a/arch/powerpc/include/asm/pkeys.h > +++ b/arch/powerpc/include/asm/pkeys.h > @@ -14,6 +14,15 @@ > VM_PKEY_BIT3 | \ > VM_PKEY_BIT4) > > +static inline u16 pte_flags_to_pkey(unsigned long pte_flags) > +{ > + return ((pte_flags & H_PAGE_PKEY_BIT4) ? 0x1 : 0x0) | > + ((pte_flags & H_PAGE_PKEY_BIT3) ? 0x2 : 0x0) | > + ((pte_flags & H_PAGE_PKEY_BIT2) ? 0x4 : 0x0) | > + ((pte_flags & H_PAGE_PKEY_BIT1) ? 0x8 : 0x0) | > + ((pte_flags & H_PAGE_PKEY_BIT0) ? 0x10 : 0x0); > +} Add defines for the above 0x1, 0x2, 0x4, 0x8 etc ? > + > #define pkey_to_vmflag_bits(key) (((key & 0x1UL) ? VM_PKEY_BIT0 : 0x0UL) | \ > ((key & 0x2UL) ? VM_PKEY_BIT1 : 0x0UL) |\ > ((key & 0x4UL) ? VM_PKEY_BIT2 : 0x0UL) |\ > diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h > index 2dcb8a1..a11977f 100644 > --- a/arch/powerpc/include/asm/reg.h > +++ b/arch/powerpc/include/asm/reg.h > @@ -285,9 +285,10 @@ > #define DSISR_UNSUPP_MMU 0x0008 /* Unsupported MMU config */ > #define DSISR_SET_RC 0x0004 /* Failed setting of > R/C bits */ > #define DSISR_PGDIRFAULT 0x0002 /* Fault on page directory */ > -#define DSISR_PAGE_FAULT_MASK (DSISR_BIT32 | \ > - DSISR_PAGEATTR_CONFLT | \ > - DSISR_BADACCESS | \ > +#define DSISR_PAGE_FAULT_MASK (DSISR_BIT32 | \ > + DSISR_PAGEATTR_CONFLT | \ > + DSISR_BADACCESS | \ > + DSISR_KEYFAULT |\ > DSISR_BIT43) This should have been cleaned up before adding new DSISR_KEYFAULT reason code into it. But I guess its okay. > #define SPRN_TBRL0x10C /* Time Base Read Lower Register (user, R/O) */ > #define SPRN_TBRU0x10D /* Time Base Read Upper Register (user, R/O) */ > diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c > index 3a7d580..c31624f 100644 > --- a/arch/powerpc/mm/fault.c > +++ b/arch/powerpc/mm/fault.c > @@ -216,9 +216,10 @@ int do_page_fault(struct pt_regs *regs, unsigned long > address, >* bits we are interested in. But there are some bits which >* indicate errors in DSISR but can validly be set in SRR1. >*/ > - if (trap == 0x400) > + if (trap == 0x400) { > error_code &= 0x4820; > - else > + flags |= FAULT_FLAG_INSTRUCTION; > + } else > is_write = error_code & DSISR_ISSTORE; > #else Why adding the FAULT_FLAG_INSTRUCTION here ? > is_write = error_code & ESR_DST; > @@ -261,6 +262,13 @@ int do_page_fault(struct pt_regs *regs, unsigned long > address, > } > #endif > > +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS > + if (error_code & DSISR_KEYFAULT) { > + code = SEGV_PKUERR; > +
Re: [RFC v2 07/12] powerpc: Macro the mask used for checking DSI exception
On 06/17/2017 09:22 AM, Ram Pai wrote: > Replace the magic number used to check for DSI exception > with a meaningful value. > > Signed-off-by: Ram Pai > --- > arch/powerpc/include/asm/reg.h | 9 - > arch/powerpc/kernel/exceptions-64s.S | 2 +- > 2 files changed, 9 insertions(+), 2 deletions(-) > > diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h > index 7e50e47..2dcb8a1 100644 > --- a/arch/powerpc/include/asm/reg.h > +++ b/arch/powerpc/include/asm/reg.h > @@ -272,16 +272,23 @@ > #define SPRN_DAR 0x013 /* Data Address Register */ > #define SPRN_DBCR0x136 /* e300 Data Breakpoint Control Reg */ > #define SPRN_DSISR 0x012 /* Data Storage Interrupt Status Register */ > +#define DSISR_BIT320x8000 /* not defined */ > #define DSISR_NOHPTE 0x4000 /* no translation found > */ > +#define DSISR_PAGEATTR_CONFLT 0x2000 /* page attribute > conflict */ > +#define DSISR_BIT350x1000 /* not defined */ > #define DSISR_PROTFAULT0x0800 /* protection fault */ > #define DSISR_BADACCESS0x0400 /* bad access to CI or G */ > #define DSISR_ISSTORE 0x0200 /* access was a store */ > #define DSISR_DABRMATCH0x0040 /* hit data breakpoint */ > -#define DSISR_NOSEGMENT0x0020 /* SLB miss */ > #define DSISR_KEYFAULT 0x0020 /* Key fault */ > +#define DSISR_BIT430x0010 /* not defined */ > #define DSISR_UNSUPP_MMU 0x0008 /* Unsupported MMU config */ > #define DSISR_SET_RC 0x0004 /* Failed setting of > R/C bits */ > #define DSISR_PGDIRFAULT 0x0002 /* Fault on page directory */ > +#define DSISR_PAGE_FAULT_MASK (DSISR_BIT32 | \ > + DSISR_PAGEATTR_CONFLT | \ > + DSISR_BADACCESS | \ > + DSISR_BIT43) Sorry missed this one. Seems like there are couple of unnecessary line additions in the subsequent patch which adds the new PKEY reason code. -#define DSISR_PAGE_FAULT_MASK (DSISR_BIT32 | \ - DSISR_PAGEATTR_CONFLT | \ - DSISR_BADACCESS | \ +#define DSISR_PAGE_FAULT_MASK (DSISR_BIT32 | \ + DSISR_PAGEATTR_CONFLT | \ + DSISR_BADACCESS | \ + DSISR_KEYFAULT |\ DSISR_BIT43)
Re: [RFC v2 06/12] powerpc: Program HPTE key protection bits.
On 06/17/2017 09:22 AM, Ram Pai wrote: > Map the PTE protection key bits to the HPTE key protection bits, > while creatiing HPTE entries. > > Signed-off-by: Ram Pai > --- > arch/powerpc/include/asm/book3s/64/mmu-hash.h | 5 + > arch/powerpc/include/asm/pkeys.h | 7 +++ > arch/powerpc/mm/hash_utils_64.c | 5 + > 3 files changed, 17 insertions(+) > > diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h > b/arch/powerpc/include/asm/book3s/64/mmu-hash.h > index cfb8169..3d7872c 100644 > --- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h > +++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h > @@ -90,6 +90,8 @@ > #define HPTE_R_PP0 ASM_CONST(0x8000) > #define HPTE_R_TSASM_CONST(0x4000) > #define HPTE_R_KEY_HIASM_CONST(0x3000) > +#define HPTE_R_KEY_BIT0 ASM_CONST(0x2000) > +#define HPTE_R_KEY_BIT1 ASM_CONST(0x1000) > #define HPTE_R_RPN_SHIFT 12 > #define HPTE_R_RPN ASM_CONST(0x0000) > #define HPTE_R_RPN_3_0 ASM_CONST(0x01fff000) > @@ -104,6 +106,9 @@ > #define HPTE_R_C ASM_CONST(0x0080) > #define HPTE_R_R ASM_CONST(0x0100) > #define HPTE_R_KEY_LOASM_CONST(0x0e00) > +#define HPTE_R_KEY_BIT2 ASM_CONST(0x0800) > +#define HPTE_R_KEY_BIT3 ASM_CONST(0x0400) > +#define HPTE_R_KEY_BIT4 ASM_CONST(0x0200) > Should we indicate/document how these 5 bits are not contiguous in the HPTE format for any given real page ? > #define HPTE_V_1TB_SEG ASM_CONST(0x4000) > #define HPTE_V_VRMA_MASK ASM_CONST(0x4001ff00) > diff --git a/arch/powerpc/include/asm/pkeys.h > b/arch/powerpc/include/asm/pkeys.h > index 0f3dca8..9b6820d 100644 > --- a/arch/powerpc/include/asm/pkeys.h > +++ b/arch/powerpc/include/asm/pkeys.h > @@ -27,6 +27,13 @@ > ((vm_flags & VM_PKEY_BIT3) ? H_PAGE_PKEY_BIT1 : 0x0UL) | \ > ((vm_flags & VM_PKEY_BIT4) ? H_PAGE_PKEY_BIT0 : 0x0UL)) > > +#define calc_pte_to_hpte_pkey_bits(pteflags) \ > + (((pteflags & H_PAGE_PKEY_BIT0) ? HPTE_R_KEY_BIT0 : 0x0UL) |\ > + ((pteflags & H_PAGE_PKEY_BIT1) ? HPTE_R_KEY_BIT1 : 0x0UL) | \ > + ((pteflags & H_PAGE_PKEY_BIT2) ? HPTE_R_KEY_BIT2 : 0x0UL) | \ > + ((pteflags & H_PAGE_PKEY_BIT3) ? HPTE_R_KEY_BIT3 : 0x0UL) | \ > + ((pteflags & H_PAGE_PKEY_BIT4) ? HPTE_R_KEY_BIT4 : 0x0UL)) > + We can drop calc_ in here. pte_to_hpte_pkey_bits should be sufficient.
Re: [RFC v2 01/12] powerpc: Free up four 64K PTE bits in 4K backed hpte pages.
On 06/17/2017 09:22 AM, Ram Pai wrote: > Rearrange 64K PTE bits to free up bits 3, 4, 5 and 6 > in the 4K backed hpte pages. These bits continue to be used > for 64K backed hpte pages in this patch, but will be freed > up in the next patch. The counting 3, 4, 5 and 6 are in BE format I believe, I was initially trying to see that from right to left as we normally do in the kernel and was getting confused. So basically these bits (which are only applicable for 64K mapping IIUC) are going to be freed up from the PTE format. #define _RPAGE_RSV1 0x1000UL #define _RPAGE_RSV2 0x0800UL #define _RPAGE_RSV3 0x0400UL #define _RPAGE_RSV4 0x0200UL As you have mentioned before this feature is available for 64K page size only and not for 4K mappings. So I assume we support both the combinations. * 64K mapping on 64K * 64K mapping on 4K These are the current users of the above bits #define H_PAGE_BUSY _RPAGE_RSV1 /* software: PTE & hash are busy */ #define H_PAGE_F_SECOND _RPAGE_RSV2 /* HPTE is in 2ndary HPTEG */ #define H_PAGE_F_GIX(_RPAGE_RSV3 | _RPAGE_RSV4 | _RPAGE_RPN44) #define H_PAGE_HASHPTE _RPAGE_RPN43/* PTE has associated HPTE */ > > The patch does the following change to the 64K PTE format > > H_PAGE_BUSY moves from bit 3 to bit 9 and what is in there on bit 9 now ? This ? #define _RPAGE_SW2 0x00400 which is used as #define _PAGE_SPECIAL _RPAGE_SW2 /* software: special page */ which will not be required any more ? > H_PAGE_F_SECOND which occupied bit 4 moves to the second part > of the pte. > H_PAGE_F_GIX which occupied bit 5, 6 and 7 also moves to the > second part of the pte. > > the four bits((H_PAGE_F_SECOND|H_PAGE_F_GIX) that represent a slot > is initialized to 0xF indicating an invalid slot. If a hpte > gets cached in a 0xF slot(i.e 7th slot of secondary), it is > released immediately. In other words, even though 0xF is a Release immediately means we attempt again for a new hash slot ? > valid slot we discard and consider it as an invalid > slot;i.e hpte_soft_invalid(). This gives us an opportunity to not > depend on a bit in the primary PTE in order to determine the > validity of a slot. So we have to see the slot number in the second half for each PTE to figure out if it has got a valid slot in the hash page table. > > When we release ahpte in the 0xF slot we also release a > legitimate primary slot andunmapthat entry. This is to > ensure that we do get a legimate non-0xF slot the next time we > retry for a slot. Okay. > > Though treating 0xF slot as invalid reduces the number of available > slots and may have an effect on the performance, the probabilty > of hitting a 0xF is extermely low. Why you say that ? I thought every slot number has the same probability of hit from the hash function. > > Compared to the current scheme, the above described scheme reduces > the number of false hash table updates significantly and has the How it reduces false hash table updates ? > added advantage of releasing four valuable PTE bits for other > purpose. > > This idea was jointly developed by Paul Mackerras, Aneesh, Michael > Ellermen and myself. > > 4K PTE format remain unchanged currently. > > Signed-off-by: Ram Pai > --- > arch/powerpc/include/asm/book3s/64/hash-4k.h | 20 +++ > arch/powerpc/include/asm/book3s/64/hash-64k.h | 32 +++ > arch/powerpc/include/asm/book3s/64/hash.h | 15 +++-- > arch/powerpc/include/asm/book3s/64/mmu-hash.h | 5 ++ > arch/powerpc/mm/dump_linuxpagetables.c| 3 +- > arch/powerpc/mm/hash64_4k.c | 14 ++--- > arch/powerpc/mm/hash64_64k.c | 81 > --- > arch/powerpc/mm/hash_utils_64.c | 30 +++--- > 8 files changed, 122 insertions(+), 78 deletions(-) > > diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h > b/arch/powerpc/include/asm/book3s/64/hash-4k.h > index b4b5e6b..5ef1d81 100644 > --- a/arch/powerpc/include/asm/book3s/64/hash-4k.h > +++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h > @@ -16,6 +16,18 @@ > #define H_PUD_TABLE_SIZE (sizeof(pud_t) << H_PUD_INDEX_SIZE) > #define H_PGD_TABLE_SIZE (sizeof(pgd_t) << H_PGD_INDEX_SIZE) > > + > +/* > + * Only supported by 4k linux page size > + */ > +#define H_PAGE_F_SECOND_RPAGE_RSV2 /* HPTE is in 2ndary HPTEG */ > +#define H_PAGE_F_GIX (_RPAGE_RSV3 | _RPAGE_RSV4 | _RPAGE_RPN44) > +#define H_PAGE_F_GIX_SHIFT 56 > + > +#define H_PAGE_BUSY _RPAGE_RSV1 /* software: PTE & hash are busy */ > +#define H_PAGE_HASHPTE _RPAGE_RPN43/* PTE has associated HPTE */ > + > + So we moved the common 64K definitions here. > /* PTE flags to conserve for HPTE identification */ > #define _PAGE_HPTEFLAGS (H
Re: [RFC v2 02/12] powerpc: Free up four 64K PTE bits in 64K backed hpte pages.
On 06/17/2017 09:22 AM, Ram Pai wrote: > Rearrange 64K PTE bits to free up bits 3, 4, 5 and 6 > in the 64K backed hpte pages. This along with the earlier > patch will entirely free up the four bits from 64K PTE. > > This patch does the following change to 64K PTE that is > backed by 64K hpte. > > H_PAGE_F_SECOND which occupied bit 4 moves to the second part > of the pte. > H_PAGE_F_GIX which occupied bit 5, 6 and 7 also moves to the > second part of the pte. > > since bit 7 is now freed up, we move H_PAGE_BUSY from bit 9 > to bit 7. Trying to minimize gaps so that contiguous bits > can be allocated if needed in the future. > > The second part of the PTE will hold > (H_PAGE_F_SECOND|H_PAGE_F_GIX) at bit 60,61,62,63. I still dont understand how we freed up the 5th bit which is used in the 5th patch. Was that bit never used for any thing on 64K page size (64K and 4K mappings) ? +#define _RPAGE_RSV50x00040UL +#define H_PAGE_PKEY_BIT0 _RPAGE_RSV1 +#define H_PAGE_PKEY_BIT1 _RPAGE_RSV2 +#define H_PAGE_PKEY_BIT2 _RPAGE_RSV3 +#define H_PAGE_PKEY_BIT3 _RPAGE_RSV4 +#define H_PAGE_PKEY_BIT4 _RPAGE_RSV5
Re: [RFC v2 09/12] powerpc: Deliver SEGV signal on pkey violation.
On 06/21/2017 05:26 AM, Ram Pai wrote: > On Tue, Jun 20, 2017 at 12:24:53PM +0530, Anshuman Khandual wrote: >> On 06/17/2017 09:22 AM, Ram Pai wrote: >>> The value of the AMR register at the time of exception >>> is made available in gp_regs[PT_AMR] of the siginfo. >>> >>> This field can be used to reprogram the permission bits of >>> any valid pkey. >>> >>> Similarly the value of the pkey, whose protection got violated, >>> is made available at si_pkey field of the siginfo structure. >>> >>> Signed-off-by: Ram Pai >>> --- >>> arch/powerpc/include/asm/paca.h| 1 + >>> arch/powerpc/include/uapi/asm/ptrace.h | 3 ++- >>> arch/powerpc/kernel/asm-offsets.c | 5 >>> arch/powerpc/kernel/exceptions-64s.S | 8 ++ >>> arch/powerpc/kernel/signal_32.c| 14 ++ >>> arch/powerpc/kernel/signal_64.c| 14 ++ >>> arch/powerpc/kernel/traps.c| 49 >>> ++ >>> arch/powerpc/mm/fault.c| 4 +++ >>> 8 files changed, 97 insertions(+), 1 deletion(-) >>> >>> diff --git a/arch/powerpc/include/asm/paca.h >>> b/arch/powerpc/include/asm/paca.h >>> index 1c09f8f..a41afd3 100644 >>> --- a/arch/powerpc/include/asm/paca.h >>> +++ b/arch/powerpc/include/asm/paca.h >>> @@ -92,6 +92,7 @@ struct paca_struct { >>> struct dtl_entry *dispatch_log_end; >>> #endif /* CONFIG_PPC_STD_MMU_64 */ >>> u64 dscr_default; /* per-CPU default DSCR */ >>> + u64 paca_amr; /* value of amr at exception */ >>> >>> #ifdef CONFIG_PPC_STD_MMU_64 >>> /* >>> diff --git a/arch/powerpc/include/uapi/asm/ptrace.h >>> b/arch/powerpc/include/uapi/asm/ptrace.h >>> index 8036b38..7ec2428 100644 >>> --- a/arch/powerpc/include/uapi/asm/ptrace.h >>> +++ b/arch/powerpc/include/uapi/asm/ptrace.h >>> @@ -108,8 +108,9 @@ struct pt_regs { >>> #define PT_DAR 41 >>> #define PT_DSISR 42 >>> #define PT_RESULT 43 >>> -#define PT_DSCR 44 >>> #define PT_REGS_COUNT 44 >>> +#define PT_DSCR 44 >>> +#define PT_AMR 45 >> >> PT_REGS_COUNT is not getting incremented even after adding >> one more element into the pack ? > > Correct. there are 48 entires in gp_regs table AFAICT, only the first 45 > are exposed through pt_regs and through gp_regs. the remaining > are exposed through gp_regs only. > >> >>> >>> #define PT_FPR048 /* each FP reg occupies 2 slots in this space */ >>> >>> diff --git a/arch/powerpc/kernel/asm-offsets.c >>> b/arch/powerpc/kernel/asm-offsets.c >>> index 709e234..17f5d8a 100644 >>> --- a/arch/powerpc/kernel/asm-offsets.c >>> +++ b/arch/powerpc/kernel/asm-offsets.c >>> @@ -241,6 +241,11 @@ int main(void) >>> OFFSET(PACAHWCPUID, paca_struct, hw_cpu_id); >>> OFFSET(PACAKEXECSTATE, paca_struct, kexec_state); >>> OFFSET(PACA_DSCR_DEFAULT, paca_struct, dscr_default); >>> + >>> +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS >>> + OFFSET(PACA_AMR, paca_struct, paca_amr); >>> +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */ >>> + >> >> So we now have a place in PACA for AMR. > > yes. > >> >>> OFFSET(ACCOUNT_STARTTIME, paca_struct, accounting.starttime); >>> OFFSET(ACCOUNT_STARTTIME_USER, paca_struct, accounting.starttime_user); >>> OFFSET(ACCOUNT_USER_TIME, paca_struct, accounting.utime); >>> diff --git a/arch/powerpc/kernel/exceptions-64s.S >>> b/arch/powerpc/kernel/exceptions-64s.S >>> index 3fd0528..8db9ef8 100644 >>> --- a/arch/powerpc/kernel/exceptions-64s.S >>> +++ b/arch/powerpc/kernel/exceptions-64s.S >>> @@ -493,6 +493,10 @@ EXC_COMMON_BEGIN(data_access_common) >>> ld r12,_MSR(r1) >>> ld r3,PACA_EXGEN+EX_DAR(r13) >>> lwz r4,PACA_EXGEN+EX_DSISR(r13) >>> +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS >>> + mfspr r5,SPRN_AMR >>> + std r5,PACA_AMR(r13) >>> +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */ >>> li r5,0x300 >>> std r3,_DAR(r1) >>> std r4,_DSISR(r1) >>> @@ -561,6 +565,10 @@ EXC_COMMON_BEGIN(instruction_access_common) >>> ld r12,_MSR(r1) >>> ld r3,_NIP(r1) >>> andis. r4,r12,0x5820 >>> +#ifdef CONFIG_PPC
Re: [RFC v2 08/12] powerpc: Handle exceptions caused by violation of pkey protection.
On 06/21/2017 05:13 AM, Ram Pai wrote: > On Tue, Jun 20, 2017 at 12:54:45PM +0530, Anshuman Khandual wrote: >> On 06/17/2017 09:22 AM, Ram Pai wrote: >>> Handle Data and Instruction exceptions caused by memory >>> protection-key. >>> >>> Signed-off-by: Ram Pai >>> (cherry picked from commit a5e5217619a0c475fe0cacc3b0cf1d3d33c79a09) > > Sorry. it was residue of a bad cleanup. It got cherry-picked from my own > internal branch, but than i forgot to delete that line. > >> >> To which tree this commit belongs to ? >> >>> >>> Conflicts: >>> arch/powerpc/include/asm/reg.h >>> arch/powerpc/kernel/exceptions-64s.S > > same here. these two line are some residues of patching-up my tree with > commits from other internal branches. > >>> --- >>> arch/powerpc/include/asm/mmu_context.h | 12 + >>> arch/powerpc/include/asm/pkeys.h | 9 >>> arch/powerpc/include/asm/reg.h | 7 +-- >>> arch/powerpc/mm/fault.c| 21 +++- >>> arch/powerpc/mm/pkeys.c| 90 >>> ++ >>> 5 files changed, 134 insertions(+), 5 deletions(-) >>> >>> diff --git a/arch/powerpc/include/asm/mmu_context.h >>> b/arch/powerpc/include/asm/mmu_context.h >>> index da7e943..71fffe0 100644 >>> --- a/arch/powerpc/include/asm/mmu_context.h >>> +++ b/arch/powerpc/include/asm/mmu_context.h >>> @@ -175,11 +175,23 @@ static inline void arch_bprm_mm_init(struct mm_struct >>> *mm, >>> { >>> } >>> >>> +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS >>> +bool arch_pte_access_permitted(pte_t pte, bool write); >>> +bool arch_vma_access_permitted(struct vm_area_struct *vma, >>> + bool write, bool execute, bool foreign); >>> +#else /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */ >>> +static inline bool arch_pte_access_permitted(pte_t pte, bool write) >>> +{ >>> + /* by default, allow everything */ >>> + return true; >>> +} >>> static inline bool arch_vma_access_permitted(struct vm_area_struct *vma, >>> bool write, bool execute, bool foreign) >>> { >>> /* by default, allow everything */ >>> return true; >>> } >> >> Right, these are the two functions the core VM expects the >> arch to provide. >> >>> +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */ >>> + >>> #endif /* __KERNEL__ */ >>> #endif /* __ASM_POWERPC_MMU_CONTEXT_H */ >>> diff --git a/arch/powerpc/include/asm/pkeys.h >>> b/arch/powerpc/include/asm/pkeys.h >>> index 9b6820d..405e7db 100644 >>> --- a/arch/powerpc/include/asm/pkeys.h >>> +++ b/arch/powerpc/include/asm/pkeys.h >>> @@ -14,6 +14,15 @@ >>> VM_PKEY_BIT3 | \ >>> VM_PKEY_BIT4) >>> >>> +static inline u16 pte_flags_to_pkey(unsigned long pte_flags) >>> +{ >>> + return ((pte_flags & H_PAGE_PKEY_BIT4) ? 0x1 : 0x0) | >>> + ((pte_flags & H_PAGE_PKEY_BIT3) ? 0x2 : 0x0) | >>> + ((pte_flags & H_PAGE_PKEY_BIT2) ? 0x4 : 0x0) | >>> + ((pte_flags & H_PAGE_PKEY_BIT1) ? 0x8 : 0x0) | >>> + ((pte_flags & H_PAGE_PKEY_BIT0) ? 0x10 : 0x0); >>> +} >> >> Add defines for the above 0x1, 0x2, 0x4, 0x8 etc ? > > hmm...not sure if it will make the code any better. > >> >>> + >>> #define pkey_to_vmflag_bits(key) (((key & 0x1UL) ? VM_PKEY_BIT0 : 0x0UL) | >>> \ >>> ((key & 0x2UL) ? VM_PKEY_BIT1 : 0x0UL) |\ >>> ((key & 0x4UL) ? VM_PKEY_BIT2 : 0x0UL) |\ >>> diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h >>> index 2dcb8a1..a11977f 100644 >>> --- a/arch/powerpc/include/asm/reg.h >>> +++ b/arch/powerpc/include/asm/reg.h >>> @@ -285,9 +285,10 @@ >>> #define DSISR_UNSUPP_MMU 0x0008 /* Unsupported MMU config */ >>> #define DSISR_SET_RC 0x0004 /* Failed setting of >>> R/C bits */ >>> #define DSISR_PGDIRFAULT 0x0002 /* Fault on page directory >>> */ >>> -#define DSISR_PAGE_FAULT_MASK (DSISR_BIT32 | \ >>> - DSISR_PAGEATTR_CONFLT | \ >>> - DSISR_BADACCESS | \ >>> +#define DSISR_PAGE_FAULT_MASK (DSISR_BIT32 |
Re: [RFC v2 01/12] powerpc: Free up four 64K PTE bits in 4K backed hpte pages.
On 06/21/2017 04:53 AM, Ram Pai wrote: > On Tue, Jun 20, 2017 at 03:50:25PM +0530, Anshuman Khandual wrote: >> On 06/17/2017 09:22 AM, Ram Pai wrote: >>> Rearrange 64K PTE bits to free up bits 3, 4, 5 and 6 >>> in the 4K backed hpte pages. These bits continue to be used >>> for 64K backed hpte pages in this patch, but will be freed >>> up in the next patch. >> >> The counting 3, 4, 5 and 6 are in BE format I believe, I was >> initially trying to see that from right to left as we normally >> do in the kernel and was getting confused. So basically these >> bits (which are only applicable for 64K mapping IIUC) are going >> to be freed up from the PTE format. >> >> #define _RPAGE_RSV1 0x1000UL >> #define _RPAGE_RSV2 0x0800UL >> #define _RPAGE_RSV3 0x0400UL >> #define _RPAGE_RSV4 0x0200UL >> >> As you have mentioned before this feature is available for 64K >> page size only and not for 4K mappings. So I assume we support >> both the combinations. >> >> * 64K mapping on 64K >> * 64K mapping on 4K > > yes. > >> >> These are the current users of the above bits >> >> #define H_PAGE_BUSY _RPAGE_RSV1 /* software: PTE & hash are busy */ >> #define H_PAGE_F_SECOND _RPAGE_RSV2 /* HPTE is in 2ndary >> HPTEG */ >> #define H_PAGE_F_GIX (_RPAGE_RSV3 | _RPAGE_RSV4 | _RPAGE_RPN44) >> #define H_PAGE_HASHPTE _RPAGE_RPN43/* PTE has associated >> HPTE */ >> >>> >>> The patch does the following change to the 64K PTE format >>> >>> H_PAGE_BUSY moves from bit 3 to bit 9 >> >> and what is in there on bit 9 now ? This ? >> >> #define _RPAGE_SW2 0x00400 >> >> which is used as >> >> #define _PAGE_SPECIAL_RPAGE_SW2 /* software: special page */ >> >> which will not be required any more ? > > i think you are reading bit 9 from right to left. the bit 9 i refer to > is from left to right. Using the same numbering convention the ISA3.0 uses. Right, my bad. Then it would be this one. '#define _RPAGE_RPN42 0x0040UL' > I know it is confusing, will make a mention in the comment of this > patch, to read it the big-endian way. Right. > > BTW: Bit 9 is not used currently. so using it in this patch. But this is > a temporary move. the H_PAGE_BUSY will move to bit 7 in the next patch. > > Had to keep at bit 9, because bit 7 is not yet entirely freed up. it is > used by 64K PTE backed by 64k htpe. Got it. > >> >>> H_PAGE_F_SECOND which occupied bit 4 moves to the second part >>> of the pte. >>> H_PAGE_F_GIX which occupied bit 5, 6 and 7 also moves to the >>> second part of the pte. >>> >>> the four bits((H_PAGE_F_SECOND|H_PAGE_F_GIX) that represent a slot >>> is initialized to 0xF indicating an invalid slot. If a hpte >>> gets cached in a 0xF slot(i.e 7th slot of secondary), it is >>> released immediately. In other words, even though 0xF is a >> >> Release immediately means we attempt again for a new hash slot ? > > yes. > >> >>> valid slot we discard and consider it as an invalid >>> slot;i.e hpte_soft_invalid(). This gives us an opportunity to not >>> depend on a bit in the primary PTE in order to determine the >>> validity of a slot. >> >> So we have to see the slot number in the second half for each PTE to >> figure out if it has got a valid slot in the hash page table. > > yes. > >> >>> >>> When we release ahpte in the 0xF slot we also release a >>> legitimate primary slot andunmapthat entry. This is to >>> ensure that we do get a legimate non-0xF slot the next time we >>> retry for a slot. >> >> Okay. >> >>> >>> Though treating 0xF slot as invalid reduces the number of available >>> slots and may have an effect on the performance, the probabilty >>> of hitting a 0xF is extermely low. >> >> Why you say that ? I thought every slot number has the same probability >> of hit from the hash function. > > Every hash bucket has the same probability. But every slot within the > hash bucket is filled in sequentially. so it takes 15 hptes to hash to > the same bucket before we get to the 15th slot in the secondary. Okay, would the last one be 16th instead ? > >> >>> >>> Compared
Re: [RFC v2 01/12] powerpc: Free up four 64K PTE bits in 4K backed hpte pages.
On 06/17/2017 09:22 AM, Ram Pai wrote: > Rearrange 64K PTE bits to free up bits 3, 4, 5 and 6 > in the 4K backed hpte pages. These bits continue to be used > for 64K backed hpte pages in this patch, but will be freed > up in the next patch. > > The patch does the following change to the 64K PTE format > > H_PAGE_BUSY moves from bit 3 to bit 9 > H_PAGE_F_SECOND which occupied bit 4 moves to the second part > of the pte. > H_PAGE_F_GIX which occupied bit 5, 6 and 7 also moves to the > second part of the pte. > > the four bits((H_PAGE_F_SECOND|H_PAGE_F_GIX) that represent a slot > is initialized to 0xF indicating an invalid slot. If a hpte > gets cached in a 0xF slot(i.e 7th slot of secondary), it is > released immediately. In other words, even though 0xF is a > valid slot we discard and consider it as an invalid > slot;i.e hpte_soft_invalid(). This gives us an opportunity to not > depend on a bit in the primary PTE in order to determine the > validity of a slot. > > When we release ahpte in the 0xF slot we also release a > legitimate primary slot andunmapthat entry. This is to > ensure that we do get a legimate non-0xF slot the next time we > retry for a slot. > > Though treating 0xF slot as invalid reduces the number of available > slots and may have an effect on the performance, the probabilty > of hitting a 0xF is extermely low. > > Compared to the current scheme, the above described scheme reduces > the number of false hash table updates significantly and has the > added advantage of releasing four valuable PTE bits for other > purpose. > > This idea was jointly developed by Paul Mackerras, Aneesh, Michael > Ellermen and myself. > > 4K PTE format remain unchanged currently. Scanned through the PTE format again for hash 64K and 4K. It seems to me that there might be 5 free bits already present on the PTE format. I might have seriously mistaken something here :) Please correct me if that is not the case. _RPAGE_RPN* I think is applicable only for hash page table format and will not be available for radix later. +#define _PAGE_FREE_1 0x0040UL /* Not used */ +#define _RPAGE_SW0 0x2000UL /* Not used */ +#define _RPAGE_SW1 0x0800UL /* Not used */ +#define _RPAGE_RPN42 0x0040UL /* Not used */ +#define _RPAGE_RPN41 0x0020UL /* Not used */
Re: [PATCH 1/2] mm, thp: introduce dedicated transparent huge page allocation interfaces
On 10/16/2017 02:49 PM, changbin...@intel.com wrote: > From: Changbin Du > > This patch introduced 4 new interfaces to allocate a prepared > transparent huge page. > - alloc_transhuge_page_vma > - alloc_transhuge_page_nodemask > - alloc_transhuge_page_node > - alloc_transhuge_page > If we are trying to match HugeTLB helpers, then it should have format something like alloc_transhugepage_xxx instead of alloc_transhuge_page_XXX. But I think its okay. > The aim is to remove duplicated code and simplify transparent > huge page allocation. These are similar to alloc_hugepage_xxx > which are for hugetlbfs pages. This patch does below changes: > - define alloc_transhuge_page_xxx interfaces > - apply them to all existing code > - declare prep_transhuge_page as static since no others use it > - remove alloc_hugepage_vma definition since it no longer has users > > Signed-off-by: Changbin Du > --- > include/linux/gfp.h | 4 > include/linux/huge_mm.h | 13 - > include/linux/migrate.h | 14 +- > mm/huge_memory.c| 50 > ++--- > mm/khugepaged.c | 11 ++- > mm/mempolicy.c | 10 +++--- > mm/migrate.c| 12 > mm/shmem.c | 6 ++ > 8 files changed, 71 insertions(+), 49 deletions(-) > > diff --git a/include/linux/gfp.h b/include/linux/gfp.h > index f780718..855c72e 100644 > --- a/include/linux/gfp.h > +++ b/include/linux/gfp.h > @@ -507,15 +507,11 @@ alloc_pages(gfp_t gfp_mask, unsigned int order) > extern struct page *alloc_pages_vma(gfp_t gfp_mask, int order, > struct vm_area_struct *vma, unsigned long addr, > int node, bool hugepage); > -#define alloc_hugepage_vma(gfp_mask, vma, addr, order) \ > - alloc_pages_vma(gfp_mask, order, vma, addr, numa_node_id(), true) > #else > #define alloc_pages(gfp_mask, order) \ > alloc_pages_node(numa_node_id(), gfp_mask, order) > #define alloc_pages_vma(gfp_mask, order, vma, addr, node, false)\ > alloc_pages(gfp_mask, order) > -#define alloc_hugepage_vma(gfp_mask, vma, addr, order) \ > - alloc_pages(gfp_mask, order) > #endif > #define alloc_page(gfp_mask) alloc_pages(gfp_mask, 0) > #define alloc_page_vma(gfp_mask, vma, addr) \ > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h > index 14bc21c..1dd2c33 100644 > --- a/include/linux/huge_mm.h > +++ b/include/linux/huge_mm.h > @@ -130,9 +130,20 @@ extern unsigned long thp_get_unmapped_area(struct file > *filp, > unsigned long addr, unsigned long len, unsigned long pgoff, > unsigned long flags); > > -extern void prep_transhuge_page(struct page *page); > extern void free_transhuge_page(struct page *page); > > +struct page *alloc_transhuge_page_vma(gfp_t gfp_mask, > + struct vm_area_struct *vma, unsigned long addr); > +struct page *alloc_transhuge_page_nodemask(gfp_t gfp_mask, > + int preferred_nid, nodemask_t *nmask); Would not they require 'extern' here ? > + > +static inline struct page *alloc_transhuge_page_node(int nid, gfp_t gfp_mask) > +{ > + return alloc_transhuge_page_nodemask(gfp_mask, nid, NULL); > +} > + > +struct page *alloc_transhuge_page(gfp_t gfp_mask); > + > bool can_split_huge_page(struct page *page, int *pextra_pins); > int split_huge_page_to_list(struct page *page, struct list_head *list); > static inline int split_huge_page(struct page *page) > diff --git a/include/linux/migrate.h b/include/linux/migrate.h > index 643c7ae..70a00f3 100644 > --- a/include/linux/migrate.h > +++ b/include/linux/migrate.h > @@ -42,19 +42,15 @@ static inline struct page *new_page_nodemask(struct page > *page, > return > alloc_huge_page_nodemask(page_hstate(compound_head(page)), > preferred_nid, nodemask); > > - if (thp_migration_supported() && PageTransHuge(page)) { > - order = HPAGE_PMD_ORDER; > - gfp_mask |= GFP_TRANSHUGE; > - } > - > if (PageHighMem(page) || (zone_idx(page_zone(page)) == ZONE_MOVABLE)) > gfp_mask |= __GFP_HIGHMEM; > > - new_page = __alloc_pages_nodemask(gfp_mask, order, > + if (thp_migration_supported() && PageTransHuge(page)) > + return alloc_transhuge_page_nodemask(gfp_mask | GFP_TRANSHUGE, > + preferred_nid, nodemask); > + else > + return __alloc_pages_nodemask(gfp_mask, order, > preferred_nid, nodemask); > - > - if (new_page && PageTransHuge(page)) > - prep_transhuge_page(new_page); This makes sense, calling prep_transhuge_page() inside the function alloc_transhuge_page_nodemask() is better I guess. > > return new_page; > } > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index 269b5df..e267488 100644 > --- a/mm/huge_memory.c > +++ b/mm/hu