Re: [PATCH 1/8][v4] powerpc/perf: Rename Power8 macros to start with PME

2013-09-17 Thread Anshuman Khandual
On 09/14/2013 06:19 AM, Sukadev Bhattiprolu wrote:
> We use helpers like GENERIC_EVENT_ATTR() to list the generic events in
> sysfs. To avoid name collisions, GENERIC_EVENT_ATTR() requires the perf
> event macros to start with PME.

We got all the raw event codes covered for P7 with the help of 
power7-events-list.h
enumeration.

/*
 * Power7 event codes.
 */
#define EVENT(_name, _code) \
PME_##_name = _code,

enum {
#include "power7-events-list.h"
};
#undef EVENT

Just wondering if its a good idea to name change these selected macros to be 
consumed
by GENERIC_EVENT_ATTR() right here for this purpose or we need to get the 
comprehensive
list of raw events for P8 first. Just an idea.

Regards
Anshuman

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 8/8][v4] powerpc/perf: Export Power7 memory hierarchy info to user space.

2013-09-18 Thread Anshuman Khandual
On 09/14/2013 06:19 AM, Sukadev Bhattiprolu wrote:
> On Power7, the DCACHE_SRC field in MMCRA register identifies the memory
> hierarchy level (eg: L2, L3 etc) from which a data-cache miss for a
> marked instruction was satisfied.
> 
> Use the 'perf_mem_data_src' object to export this hierarchy level to user
> space. Some memory hierarchy levels in Power7 don't map into the arch-neutral
> levels. However, since newer generation of the processor (i.e. Power8) uses
> fewer levels than in Power7, we don't really need to define new hierarchy
> levels just for Power7.
> 
> We instead, map as many levels as possible and approximate the rest. See
> comments near dcache-src_map[] in the patch.
> 
> Usage:
> 
>   perf record -d -e 'cpu/PM_MRK_GRP_CMPL/' 
>   perf report -n --mem-mode --sort=mem,sym,dso,symbol_daddr,dso_daddr"
> 
>   For samples involving load/store instructions, the memory
>   hierarchy level is shown as "L1 hit", "Remote RAM hit" etc.
>   # or
> 
>   perf record --data 
>   perf report -D
> 
>   Sample records contain a 'data_src' field which encodes the
>   memory hierarchy level: Eg: data_src 0x442 indicates
>   MEM_OP_LOAD, MEM_LVL_HIT, MEM_LVL_L2 (i.e load hit L2).

Successfully built and boot tested this entire patchset both on a P7 and P8 
system.
Running some sample tests with ebizzy micro benchmark. Till now got only 0x142 
and
0x0 values for data_src object for the sample records. Will experiment around 
bit
more on P7 and P8 systems and post the results.

Regards
Anshuman 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 8/8][v4] powerpc/perf: Export Power7 memory hierarchy info to user space.

2013-09-19 Thread Anshuman Khandual
On 09/14/2013 06:19 AM, Sukadev Bhattiprolu wrote:
> +static void power7_get_mem_data_src(union perf_mem_data_src *dsrc,
> + struct pt_regs *regs)
> +{
> + u64 idx;
> + u64 mmcra = regs->dsisr;
> + u64 addr;
> + int ret;
> + unsigned int instr;
> +
> + if (mmcra & POWER7_MMCRA_DCACHE_MISS) {
> + idx = mmcra & POWER7_MMCRA_DCACHE_SRC_MASK;
> + idx >>= POWER7_MMCRA_DCACHE_SRC_SHIFT;
> +
> + dsrc->val |= dcache_src_map[idx];
> + return;
> + }
> +
> + instr = 0;
> + addr = perf_instruction_pointer(regs);
> +
> + if (is_kernel_addr(addr))
> + instr = *(unsigned int *)addr;
> + else {
> + pagefault_disable();
> + ret = __get_user_inatomic(instr, (unsigned int __user *)addr);
> + pagefault_enable();
> + if (ret)
> + instr = 0;
> + }
> + if (instr && instr_is_load_store(&instr))


Wondering if there is any possibility of getting positive values for
"(mmcra & POWER7_MMCRA_DCACHE_SRC_MASK) >> POWER7_MMCRA_DCACHE_SRC_SHIFT"
when the marked instruction did not have MMCRA[POWER7_MMCRA_DCACHE_MISS]
bit set. In that case we should actually compute dsrc->val as in the previous
case. I did couple of experiments on a P7 box, but was not able to find a
instance for a marked instruction whose MMCRA[POWER7_MMCRA_DCACHE_MISS] bit
not set and have a positive value POWER7_MMCRA_DCACHE_SRC field.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V2 0/6] perf: New conditional branch filter

2013-09-09 Thread Anshuman Khandual
On 09/10/2013 07:36 AM, Michael Ellerman wrote:
> On Fri, 2013-08-30 at 09:54 +0530, Anshuman Khandual wrote:
>>  This patchset is the re-spin of the original branch stack sampling
>> patchset which introduced new PERF_SAMPLE_BRANCH_COND filter. This patchset
>> also enables SW based branch filtering support for PPC64 platforms which have
>> branch stack sampling support. With this new enablement, the branch filter 
>> support
>> for PPC64 platforms have been extended to include all these combinations 
>> discussed
>> below with a sample test application program.
> 
> ...
> 
>> Mixed filters
>> -
>> (6) perf record -e branch-misses:u -j any_call,any_ret ./cprog
>> Error:
>> The perf.data file has no samples!
>>
>> NOTE: As expected. The HW filters all the branches which are calls and SW 
>> tries to find return
>> branches in that given set. Both the filters are mutually exclussive, so 
>> obviously no samples
>> found in the end profile.
> 
> The semantics of multiple filters is not clear to me. It could be an OR,
> or an AND. You have implemented AND, does that match existing behaviour
> on x86 for example?

I believe it does match. X86 code drops the branch records (originally captured
in the LBR) while applying the SW filters.

Regards
Anshuman

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V2 0/6] perf: New conditional branch filter

2013-10-09 Thread Anshuman Khandual
On 09/26/2013 04:44 PM, Stephane Eranian wrote:
> So you are saying that the HW filter is exclusive. That seems odd. But
> I think it is
> because of the choices is ANY. ANY covers all the types of branches. Therefore
> it does not make a difference whether you add COND or not. And
> vice-versa, if you
> set COND, you need to disable ANY. I bet if you add other filters such
> as CALL, RETURN,
> then you could OR them and say: I want RETURN or CALLS.
> 
> But that's okay. The API operates in OR mode but if the HW does not
> support it, you
> can check the mask and reject if more than one type is set. That is
> arch-specific code.
> The alternative, if to only capture ANY and emulate the filter in SW.
> This will work, of
> course. But the downside, is that you lose the way to appreciate how
> many, for instance,
> COND branches you sampled out of the total number of COND branches
> retired. Unless
> you can count COND branches separately.

Hey Stephane,

Thanks for your reply. I am working on a solution where PMU will process
all the requested branch filters in HW only if it can filter all of them in an
OR manner else it will just leave the entire thing upto the SW to process and
do no filtering itself. This implies that branch filtering will either happen
completely in HW or completely in SW and never in a mixed manner. This way
it will conform to the OR mode defined in the API. I will post the revised
patch set soon.

Regards
Anshuman

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH V2 0/6] perf: New conditional branch filter

2013-08-29 Thread Anshuman Khandual
sc

# PERF_SAMPLE_BRANCH_ANY_CALL
bl hw_1_1   # PERF_SAMPLE_BRANCH_ANY_CALL
bl hw_1_2   # PERF_SAMPLE_BRANCH_ANY_CALL
# PERF_SAMPLE_BRANCH_COND
bl hw_2_1   # PERF_SAMPLE_BRANCH_ANY_CALL
bl hw_2_2   # PERF_SAMPLE_BRANCH_ANY_CALL

# PERF_SAMPLE_BRANCH_ANY_RET
bl sw_3_1   # PERF_SAMPLE_BRANCH_ANY_CALL
bl sw_3_2   # PERF_SAMPLE_BRANCH_ANY_CALL
# PERF_SAMPLE_BRANCH_IND_CALL
bl sw_4_1   # PERF_SAMPLE_BRANCH_ANY_CALL
    bl sw_4_2   # PERF_SAMPLE_BRANCH_ANY_CALL

# Restore LR
mtlr 25
blr # PERF_SAMPLE_BRANCH_ANY_RET

   
Changes in V2
--
(1) Enabled PPC64 SW branch filtering support
(2) Incorporated changes required for all previous comments

Anshuman Khandual (6):
  perf: New conditional branch filter criteria in branch stack sampling
  powerpc, perf: Enable conditional branch filter for POWER8
  perf, tool: Conditional branch filter 'cond' added to perf record
  x86, perf: Add conditional branch filtering support
  perf, documentation: Description for conditional branch filter
  powerpc, perf: Enable SW filtering in branch stack sampling framework

 arch/powerpc/include/asm/perf_event_server.h |   2 +-
 arch/powerpc/perf/core-book3s.c  | 200 +--
 arch/powerpc/perf/power8-pmu.c   |  25 ++--
 arch/x86/kernel/cpu/perf_event_intel_lbr.c   |   5 +
 include/uapi/linux/perf_event.h  |   3 +-
 tools/perf/Documentation/perf-record.txt |   3 +-
 tools/perf/builtin-record.c  |   1 +
 7 files changed, 216 insertions(+), 23 deletions(-)

-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH V2 5/6] perf, documentation: Description for conditional branch filter

2013-08-29 Thread Anshuman Khandual
Adding documentation support for conditional branch filter.

Signed-off-by: Anshuman Khandual 
Reviewed-by: Stephane Eranian 
---
 tools/perf/Documentation/perf-record.txt | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/tools/perf/Documentation/perf-record.txt 
b/tools/perf/Documentation/perf-record.txt
index e297b74..59ca8d0 100644
--- a/tools/perf/Documentation/perf-record.txt
+++ b/tools/perf/Documentation/perf-record.txt
@@ -163,12 +163,13 @@ following filters are defined:
 - any_call: any function call or system call
 - any_ret: any function return or system call return
 - ind_call: any indirect branch
+- cond: conditional branches
 - u:  only when the branch target is at the user level
 - k: only when the branch target is in the kernel
 - hv: only when the target is at the hypervisor level
 
 +
-The option requires at least one branch type among any, any_call, any_ret, 
ind_call.
+The option requires at least one branch type among any, any_call, any_ret, 
ind_call, cond.
 The privilege levels may be omitted, in which case, the privilege levels of 
the associated
 event are applied to the branch filter. Both kernel (k) and hypervisor (hv) 
privilege
 levels are subject to permissions.  When sampling on multiple events, branch 
stack sampling
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH V2 4/6] x86, perf: Add conditional branch filtering support

2013-08-29 Thread Anshuman Khandual
This patch adds conditional branch filtering support,
enabling it for PERF_SAMPLE_BRANCH_COND in perf branch
stack sampling framework by utilizing an available
software filter X86_BR_JCC.

Signed-off-by: Anshuman Khandual 
Reviewed-by: Stephane Eranian 
---
 arch/x86/kernel/cpu/perf_event_intel_lbr.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c 
b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index d5be06a..9723773 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -371,6 +371,9 @@ static void intel_pmu_setup_sw_lbr_filter(struct perf_event 
*event)
if (br_type & PERF_SAMPLE_BRANCH_NO_TX)
mask |= X86_BR_NO_TX;
 
+   if (br_type & PERF_SAMPLE_BRANCH_COND)
+   mask |= X86_BR_JCC;
+
/*
 * stash actual user request into reg, it may
 * be used by fixup code for some CPU
@@ -665,6 +668,7 @@ static const int nhm_lbr_sel_map[PERF_SAMPLE_BRANCH_MAX] = {
 * NHM/WSM erratum: must include IND_JMP to capture IND_CALL
 */
[PERF_SAMPLE_BRANCH_IND_CALL] = LBR_IND_CALL | LBR_IND_JMP,
+   [PERF_SAMPLE_BRANCH_COND] = LBR_JCC,
 };
 
 static const int snb_lbr_sel_map[PERF_SAMPLE_BRANCH_MAX] = {
@@ -676,6 +680,7 @@ static const int snb_lbr_sel_map[PERF_SAMPLE_BRANCH_MAX] = {
[PERF_SAMPLE_BRANCH_ANY_CALL]   = LBR_REL_CALL | LBR_IND_CALL
| LBR_FAR,
[PERF_SAMPLE_BRANCH_IND_CALL]   = LBR_IND_CALL,
+   [PERF_SAMPLE_BRANCH_COND]   = LBR_JCC,
 };
 
 /* core */
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH V2 6/6] powerpc, perf: Enable SW filtering in branch stack sampling framework

2013-08-29 Thread Anshuman Khandual
This patch enables SW based post processing of BHRB captured branches
to be able to meet more user defined branch filtration criteria in perf
branch stack sampling framework. This changes increase the number of
filters and their valid combinations on powerpc64 platform with BHRB
support. Summary of code changes described below.

(1) struct cpu_hw_events

Introduced two new variables and modified one to track various filters.

a) bhrb_hw_filter   Tracks PMU based HW branch filter flags.
Computed from PMU dependent call back.
b) bhrb_sw_filter   Tracks SW based instruction filter flags
Computed from PPC64 generic SW filter.
c) filter_mask  Tracks overall filter flags for PPC64

(2) Creating HW event with BHRB request

Kernel would try to figure out supported HW filters through a PMU call
back ppmu->bhrb_filter_map(). Here it would only invalidate unsupported
HW filter combinations. In future we could process one element from the
combination in HW and one in SW. Meanwhile cpuhw->filter_mask would be
tracking the overall supported branch filter requests on the PMU.

Kernel would also process the user request against available SW filters
for PPC64. Then we would process filter_mask to verify whether all the
user requested branch filters have been taken care of either in HW or in
SW.

(3) BHRB SW filter processing

During the BHRB data capture inside the PMU interrupt context, each
of the captured "perf_branch_entry.from" would be checked for compliance
with applicable SW branch filters. If the entry does not confirm to the
filter requirements, it would be discarded from the final perf branch
stack buffer.

(4) Instruction classification for proposed SW filters

Here are the list of category of instructions which have been classified
under the proposed SW filters.

(a) PERF_SAMPLE_BRANCH_ANY_RETURN

(i) [Un]conditional branch to LR without setting the LR
(1) blr
(2) bclr
(3) btlr
(4) bflr
(5) bdnzlr
(6) bdnztlr
(7) bdnzflr
(8) bdzlr
(9) bdztlr
(10) bdzflr
(11) bltlr
(12) blelr
(13) beqlr
(14) bgelr
(15) bgtlr
(16) bnllr
(17) bnelr
(18) bnglr
(19) bsolr
(20) bnslr
(21) biclr
(22) bnilr
(23) bunlr
(24) bnulr

(b) PERF_SAMPLE_BRANCH_IND_CALL

(i) [Un]conditional branch to CTR with setting the link
(1) bctrl
(2) bcctrl
(3) btctrl
(4) bfctrl
(5) bltctrl
(6) blectrl
(7) beqctrl
(8) bgectrl
(9) bgtctrl
(10) bnlctrl
(11) bnectrl
(12) bngctrl
(13) bsoctrl
(14) bnsctrl
(15) bicctrl
(16) bnictrl
(17) bunctrl
(18) bnuctrl

(ii) [Un]conditional branch to LR setting the link
(0) bclrl
(1) blrl
(2) btlrl
(3) bflrl
(4) bdnzlrl
(5) bdnztlrl
(6) bdnzflrl
(7) bdzlrl
(8) bdztlrl
(9) bdzflrl
(10) bltlrl
(11) blelrl
(12) beqlrl
(13) bgelrl
(14) bgtlrl
(15) bnllrl
(16) bnelrl
(17) bnglrl
(18) bsolrl
(19) bnslrl
(20) biclrl
(21) bnilrl
(22) bunlrl
(23) bnulrl

(iii) [Un]conditional branch to TAR setting the link
(1) btarl
(2) bctarl

Signed-off-by: Anshuman Khandual 
---
 arch/powerpc/include/asm/perf_event_server.h |   2 +-
 arch/powerpc/perf/core-book3s.c  | 200 +

[PATCH V2 1/6] perf: New conditional branch filter criteria in branch stack sampling

2013-08-29 Thread Anshuman Khandual
POWER8 PMU based BHRB supports filtering for conditional branches.
This patch introduces new branch filter PERF_SAMPLE_BRANCH_COND which
will extend the existing perf ABI. Other architectures can provide
this functionality with either HW filtering support (if present) or
with SW filtering of instructions.

Signed-off-by: Anshuman Khandual 
Reviewed-by: Stephane Eranian 
---
 include/uapi/linux/perf_event.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 0b1df41..5da52b6 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -160,8 +160,9 @@ enum perf_branch_sample_type {
PERF_SAMPLE_BRANCH_ABORT_TX = 1U << 7, /* transaction aborts */
PERF_SAMPLE_BRANCH_IN_TX= 1U << 8, /* in transaction */
PERF_SAMPLE_BRANCH_NO_TX= 1U << 9, /* not in transaction */
+   PERF_SAMPLE_BRANCH_COND = 1U << 10, /* conditional branches */
 
-   PERF_SAMPLE_BRANCH_MAX  = 1U << 10, /* non-ABI */
+   PERF_SAMPLE_BRANCH_MAX  = 1U << 11, /* non-ABI */
 };
 
 #define PERF_SAMPLE_BRANCH_PLM_ALL \
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH V2 3/6] perf, tool: Conditional branch filter 'cond' added to perf record

2013-08-29 Thread Anshuman Khandual
Adding perf record support for new branch stack filter criteria
PERF_SAMPLE_BRANCH_COND.

Signed-off-by: Anshuman Khandual 
---
 tools/perf/builtin-record.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index ecca62e..802d11d 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -625,6 +625,7 @@ static const struct branch_mode branch_modes[] = {
BRANCH_OPT("any_call", PERF_SAMPLE_BRANCH_ANY_CALL),
BRANCH_OPT("any_ret", PERF_SAMPLE_BRANCH_ANY_RETURN),
BRANCH_OPT("ind_call", PERF_SAMPLE_BRANCH_IND_CALL),
+   BRANCH_OPT("cond", PERF_SAMPLE_BRANCH_COND),
BRANCH_END
 };
 
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH V2 2/6] powerpc, perf: Enable conditional branch filter for POWER8

2013-08-29 Thread Anshuman Khandual
Enables conditional branch filter support for POWER8
utilizing MMCRA register based filter and also invalidates
a BHRB branch filter combination involving conditional
branches.

Signed-off-by: Anshuman Khandual 
---
 arch/powerpc/perf/power8-pmu.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/arch/powerpc/perf/power8-pmu.c b/arch/powerpc/perf/power8-pmu.c
index 2ee4a70..6e28587 100644
--- a/arch/powerpc/perf/power8-pmu.c
+++ b/arch/powerpc/perf/power8-pmu.c
@@ -580,11 +580,21 @@ static u64 power8_bhrb_filter_map(u64 branch_sample_type)
if (branch_sample_type & PERF_SAMPLE_BRANCH_IND_CALL)
return -1;
 
+   /* Invalid branch filter combination - HW does not support */
+   if ((branch_sample_type & PERF_SAMPLE_BRANCH_ANY_CALL) &&
+   (branch_sample_type & PERF_SAMPLE_BRANCH_COND))
+   return -1;
+
if (branch_sample_type & PERF_SAMPLE_BRANCH_ANY_CALL) {
pmu_bhrb_filter |= POWER8_MMCRA_IFM1;
return pmu_bhrb_filter;
}
 
+   if (branch_sample_type & PERF_SAMPLE_BRANCH_COND) {
+   pmu_bhrb_filter |= POWER8_MMCRA_IFM3;
+   return pmu_bhrb_filter;
+   }
+
/* Every thing else is unsupported */
return -1;
 }
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V2 0/6] perf: New conditional branch filter

2013-09-01 Thread Anshuman Khandual
On 08/30/2013 05:18 PM, Stephane Eranian wrote:
> 2013/8/30 Anshuman Khandual 
>> >
>> > This patchset is the re-spin of the original branch stack sampling
>> > patchset which introduced new PERF_SAMPLE_BRANCH_COND filter. This patchset
>> > also enables SW based branch filtering support for PPC64 platforms which 
>> > have
>> > branch stack sampling support. With this new enablement, the branch filter 
>> > support
>> > for PPC64 platforms have been extended to include all these combinations 
>> > discussed
>> > below with a sample test application program.
>> >
>> >
> I am trying to understand which HW has support for capturing the
> branches: PPC7 or PPC8.
> Then it seems you're saying that only PPC8 has the filtering support.
> On PPC7 you use the
> SW filter. Did I get this right?
> 
> I will look at the patch set.
> 

Hey Stephane,

POWER7 does not have BHRB support required to capture the branches. Right
now its only POWER8 (which has BHRB) can capture branches in HW. It has some
PMU level branch filters and rest we have implemented in SW. But these SW
filters cannot be applied in POWER7 as it does not support branch stack 
sampling because of lack of BHRB. I have mentioned PPC64 support in the
sense that this SW filtering code could be used in existing or future generation
powerpc processors which would have PMU support for branch stack sampling. My
apologies if the description for the patchset was ambiguous.

Regards
Anshuman

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] KVM: PPC: Book3S: Add support for hwrng found on some powernv systems

2013-09-27 Thread Anshuman Khandual
On 09/26/2013 12:01 PM, Michael Ellerman wrote:
> +int powernv_hwrng_present(void)
> +{
> + return __raw_get_cpu_var(powernv_rng) != NULL;
> +}
> +
>  static unsigned long rng_whiten(struct powernv_rng *rng, unsigned long val)
>  {
>   unsigned long parity;
> @@ -42,6 +48,17 @@ static unsigned long rng_whiten(struct powernv_rng *rng, 
> unsigned long val)
>   return val;
>  }
> 
> +int powernv_get_random_real_mode(unsigned long *v)
> +{
> + struct powernv_rng *rng;
> +
> + rng = __raw_get_cpu_var(powernv_rng);
> +
> + *v = rng_whiten(rng, in_rm64(rng->regs_real));
> +

Will it be in_be64() instead of in_rm64() ? Its failing the build here. Except 
this
all individual patches build correctly.

Regards
Anshuman

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V2 0/6] perf: New conditional branch filter

2013-09-20 Thread Anshuman Khandual
On 08/30/2013 05:18 PM, Stephane Eranian wrote:
> 2013/8/30 Anshuman Khandual 
>> >
>> > This patchset is the re-spin of the original branch stack sampling
>> > patchset which introduced new PERF_SAMPLE_BRANCH_COND filter. This patchset
>> > also enables SW based branch filtering support for PPC64 platforms which 
>> > have
>> > branch stack sampling support. With this new enablement, the branch filter 
>> > support
>> > for PPC64 platforms have been extended to include all these combinations 
>> > discussed
>> > below with a sample test application program.
>> >
>> >
> I am trying to understand which HW has support for capturing the
> branches: PPC7 or PPC8.
> Then it seems you're saying that only PPC8 has the filtering support.
> On PPC7 you use the
> SW filter. Did I get this right?
> 
> I will look at the patch set.
> 

Hey Stephane,

Just wondering if you got a chance to go though the patchset ?

Regards
Anshuman

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V2 0/6] perf: New conditional branch filter

2013-09-20 Thread Anshuman Khandual
On 09/21/2013 12:11 PM, Anshuman Khandual wrote:
> On 08/30/2013 05:18 PM, Stephane Eranian wrote:
>> 2013/8/30 Anshuman Khandual 
>>>>
>>>> This patchset is the re-spin of the original branch stack sampling
>>>> patchset which introduced new PERF_SAMPLE_BRANCH_COND filter. This patchset
>>>> also enables SW based branch filtering support for PPC64 platforms which 
>>>> have
>>>> branch stack sampling support. With this new enablement, the branch filter 
>>>> support
>>>> for PPC64 platforms have been extended to include all these combinations 
>>>> discussed
>>>> below with a sample test application program.
>>>>
>>>>
>> I am trying to understand which HW has support for capturing the
>> branches: PPC7 or PPC8.
>> Then it seems you're saying that only PPC8 has the filtering support.
>> On PPC7 you use the
>> SW filter. Did I get this right?
>>
>> I will look at the patch set.
>>
> 
> Hey Stephane,
> 
> Just wondering if you got a chance to go though the patchset ?


s/though/through/

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V2 0/6] perf: New conditional branch filter

2013-09-23 Thread Anshuman Khandual
On 09/21/2013 12:25 PM, Stephane Eranian wrote:
> On Tue, Sep 10, 2013 at 4:06 AM, Michael Ellerman
>  wrote:
>> >
>> > On Fri, 2013-08-30 at 09:54 +0530, Anshuman Khandual wrote:
>>> > >   This patchset is the re-spin of the original branch stack sampling
>>> > > patchset which introduced new PERF_SAMPLE_BRANCH_COND filter. This 
>>> > > patchset
>>> > > also enables SW based branch filtering support for PPC64 platforms 
>>> > > which have
>>> > > branch stack sampling support. With this new enablement, the branch 
>>> > > filter support
>>> > > for PPC64 platforms have been extended to include all these 
>>> > > combinations discussed
>>> > > below with a sample test application program.
>> >
>> > ...
>> >
>>> > > Mixed filters
>>> > > -
>>> > > (6) perf record -e branch-misses:u -j any_call,any_ret ./cprog
>>> > > Error:
>>> > > The perf.data file has no samples!
>>> > >
>>> > > NOTE: As expected. The HW filters all the branches which are calls and 
>>> > > SW tries to find return
>>> > > branches in that given set. Both the filters are mutually exclussive, 
>>> > > so obviously no samples
>>> > > found in the end profile.
>> >
>> > The semantics of multiple filters is not clear to me. It could be an OR,
>> > or an AND. You have implemented AND, does that match existing behaviour
>> > on x86 for example?
>> >
> The semantic on the API is OR. AND does not make sense: CALL & RETURN?
> On x86, the HW filter is an OR (default: ALL, set bit to disable a
> type). I suspect
> it is similar on PPC.

Hey Stephane,

In POWER8 BHRB, we have got three HW PMU filters out of which we are trying
to use two of them PERF_SAMPLE_BRANCH_ANY_CALL and PERF_SAMPLE_BRANCH_COND
respectively.

(1) These filters are exclusive of each other and cannot be OR-ed with each 
other

(2) The SW filters are applied on the branch record set captured with BHRB
which have the HW filters applied. So the working set is already reduced
with the HW PMU filters. SW filter goes through the working set and figures
out which one of them satisfy the SW filter criteria and gets picked up. The
SW filter cannot find out branches records which matches the criteria 
outside
of BHRB captured set. So we cannot OR the filters.

This makes the combination of HW and SW filter inherently an "AND" not OR.

(3) But once we have captured the BHRB filtered data with HW PMU filter, 
multiple SW
filters (if requested) can be applied either in OR or AND manner.

It should be either like
(1) (HW_FILTER_1) && (SW_FILTER_1) && (SW_FILTER_2)
or like
(2) (HW_FILTER_1) && (SW_FILTER_1 || SW_FILTER_2)

NOTE: I admit that the current validate_instruction() function does not do
either of them correctly. Will fix it in the next iteration.

(4) These combination of filters are not supported right now because

(a) We are unable to process two HW PMU filters simultaneously
(b) We have not worked on replacement SW filter for either of the HW 
filters

(1) (HW_FILTER_1), (HW_FILTER_2)
(2) (HW_FILTER_1), (HW_FILTER_2), (SW_FILTER_1)
(3) (HW_FILTER_1), (HW_FILTER_2), (SW_FILTER_1), (SW_FILTER_2)

   How ever these combination of filters can be supported right now.

(1) (HW_FILTER_1)
(2) (HW_FILTER_2)

(3) (SW_FILTER_1)
(4) (SW_FILTER_2)
(5) (SW_FILTER_1), (SW_FILTER_2)

(6)  (HW_FILTER_1), (SW_FILTER_1)
(7)  (HW_FILTER_1), (SW_FILTER_2)
(8)  (HW_FILTER_1), (SW_FILTER_1), (SW_FILTER_2)
(9)  (HW_FILTER_2), (SW_FILTER_1)
(10) (HW_FILTER_2), (SW_FILTER_2)
(11) (HW_FILTER_2), (SW_FILTER_1), (SW_FILTER_2)


Given the situation as explained here, which semantic would be better for single
HW and multiple SW filters. Accordingly validate_instruction() function will 
have
to be re-implemented. But I believe OR-ing the SW filters will be preferable.

(1) (HW_FILTER_1) && (SW_FILTER_1) && (SW_FILTER_2)
or
(2) (HW_FILTER_1) && (SW_FILTER_1 || SW_FILTER_2)

Please let me know your inputs and suggestions on this. Thank you.

Regards
Anshuman

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V2 0/6] perf: New conditional branch filter

2013-09-24 Thread Anshuman Khandual
On 09/25/2013 07:49 AM, Michael Ellerman wrote:
> On Mon, 2013-09-23 at 14:45 +0530, Anshuman Khandual wrote:
>> On 09/21/2013 12:25 PM, Stephane Eranian wrote:
>>> On Tue, Sep 10, 2013 at 4:06 AM, Michael Ellerman
>>>  wrote:
>>>>>
>>>>> On Fri, 2013-08-30 at 09:54 +0530, Anshuman Khandual wrote:
>>>>>>>   This patchset is the re-spin of the original branch stack sampling
>>>>>>> patchset which introduced new PERF_SAMPLE_BRANCH_COND filter. This 
>>>>>>> patchset
>>>>>>> also enables SW based branch filtering support for PPC64 platforms 
>>>>>>> which have
>>>>>>> branch stack sampling support. With this new enablement, the branch 
>>>>>>> filter support
>>>>>>> for PPC64 platforms have been extended to include all these 
>>>>>>> combinations discussed
>>>>>>> below with a sample test application program.
>>>>>
>>>>> ...
>>>>>
>>>>>>> Mixed filters
>>>>>>> -
>>>>>>> (6) perf record -e branch-misses:u -j any_call,any_ret ./cprog
>>>>>>> Error:
>>>>>>> The perf.data file has no samples!
>>>>>>>
>>>>>>> NOTE: As expected. The HW filters all the branches which are calls and 
>>>>>>> SW tries to find return
>>>>>>> branches in that given set. Both the filters are mutually exclussive, 
>>>>>>> so obviously no samples
>>>>>>> found in the end profile.
>>>>>
>>>>> The semantics of multiple filters is not clear to me. It could be an OR,
>>>>> or an AND. You have implemented AND, does that match existing behaviour
>>>>> on x86 for example?
>>>
>>> The semantic on the API is OR. AND does not make sense: CALL & RETURN?
>>> On x86, the HW filter is an OR (default: ALL, set bit to disable a
>>> type). I suspect
>>> it is similar on PPC.
>>
>> Given the situation as explained here, which semantic would be better for 
>> single
>> HW and multiple SW filters. Accordingly validate_instruction() function will 
>> have
>> to be re-implemented. But I believe OR-ing the SW filters will be preferable.
>>
>>  (1) (HW_FILTER_1) && (SW_FILTER_1) && (SW_FILTER_2)
>>  or
>>  (2) (HW_FILTER_1) && (SW_FILTER_1 || SW_FILTER_2)
>>
>> Please let me know your inputs and suggestions on this. Thank you.
> 
> You need to implement the correct semantics, regardless of how the
> hardware happens to work.
> 
> That means if multiple filters are specified you need to do all the
> filtering in software.

Hello Stephane,

I looked at the X86 code on branch filtering implementation.

(1) During event creation intel_pmu_hw_config calls intel_pmu_setup_lbr_filter 
when LBR sampling
is required, intel_pmu_setup_lbr_filter calls these two functions 

(a) intel_pmu_setup_sw_lbr_filter

"event->hw.branch_reg.reg" contains all the SW filter masks which can be
supported for the user requested filters event->attr.branch_sample_type 
(even
if some of them could implemented in PMU HW)

(b) intel_pmu_setup_hw_lbr_filter (when HW filtering is present)

"event->hw.branch_reg.config" contains all the PMU HW filter masks 
corresponding
to the requested filters in event->attr.branch_sample_type. One point 
to note
here is that if the user has requested for some branch filter which is 
not supported
in the HW LBR filter, the event creation request is rejected with 
EOPNOTSUPP. This
not true for the filters which can be ignored in the PMU.

(2) When the event is enabled in the PMU

(a) cpuc->lbr_sel->config gets into the HW register to enable the 
filtering of branches
which was determined in the function intel_pmu_setup_hw_lbr_filter. 

(3) After the IRQ happened, intel_pmu_lbr_read reads all the entries from the 
LBR  HW and then
applies the filter in the function intel_pmu_lbr_filter.

(a) intel_pmu_lbr_filter functions take into account cpuc->br_sel 
(which is nothing but
event->hw.branch_reg.reg as determined in the function 
intel_pmu_setup_sw_lbr_filter)
which contains the entire branch filter request set in terms applicable 
SW filter. Here
the semantic is OR when we look at from SW filter implementation point 
of view.

   BUT what branch record set we are working on right now ? A set which was 
captu

[PATCH] powerpc, perf: Change PMU flag values representation from decimal to hex

2012-11-16 Thread Anshuman Khandual
Signed-off-by: Anshuman Khandual 
---
 arch/powerpc/include/asm/perf_event_server.h | 21 -
 1 file changed, 16 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/include/asm/perf_event_server.h 
b/arch/powerpc/include/asm/perf_event_server.h
index 9710be3..e3f10bb 100644
--- a/arch/powerpc/include/asm/perf_event_server.h
+++ b/arch/powerpc/include/asm/perf_event_server.h
@@ -11,6 +11,7 @@
 
 #include 
 #include 
+#include 
 
 #define MAX_HWEVENTS   8
 #define MAX_EVENT_ALTERNATIVES 8
@@ -45,11 +46,21 @@ struct power_pmu {
 /*
  * Values for power_pmu.flags
  */
-#define PPMU_LIMITED_PMC5_61   /* PMC5/6 have limited function */
-#define PPMU_ALT_SIPR  2   /* uses alternate posn for SIPR/HV */
-#define PPMU_NO_SIPR   4   /* no SIPR/HV in MMCRA at all */
-#define PPMU_NO_CONT_SAMPLING  8   /* no continuous sampling */
-#define PPMU_SIAR_VALID16  /* Processor has SIAR Valid bit 
*/
+
+#define PPMU_LIMITED_PMC5_6\
+   LONG_ASM_CONST(0x0001) /* PMC5/6 have limited function */
+
+#define PPMU_ALT_SIPR  \
+   LONG_ASM_CONST(0x0002) /* uses alternate posn for SIPR/HV */
+
+#define PPMU_NO_SIPR   \
+   LONG_ASM_CONST(0x0004) /* no SIPR/HV in MMCRA at all */
+
+#define PPMU_NO_CONT_SAMPLING  \
+   LONG_ASM_CONST(0x0008) /* no continuous sampling */
+
+#define PPMU_SIAR_VALID\
+   LONG_ASM_CONST(0x0010) /* Processor has SIAR Valid bit */
 
 /*
  * Values for flags to get_alternatives()
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] powerpc, perf: Change PMU flag values representation from decimal to hex

2012-11-18 Thread Anshuman Khandual
On 11/16/2012 05:12 PM, Paul Mackerras wrote:
> On Fri, Nov 16, 2012 at 02:29:04PM +0530, Anshuman Khandual wrote:
>> Signed-off-by: Anshuman Khandual 
> 
> That's not a sufficient description of why you are making this
> change.  In particular, what is the motivation for and impact of using
> LONG_ASM_CONST?
> 
> Paul.

Hey Paul,

I have just sent out a revised patch where I have updated the description
and dropped the usage of LONG_ASM_CONST.

Regards
Anshuman 
 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V2 2/5] powerpc, perf: Add basic assembly code to read BHRB entries on POWER8

2013-04-16 Thread Anshuman Khandual
On 04/16/2013 10:53 PM, Segher Boessenkool wrote:
>> +/* r3 = n  (where n = [0-1023])
>> + * The maximum number of BHRB entries supported with PPC_MFBHRBE
>> instruction
>> + * is 1024. We have limited number of table entries here as POWER8
>> implements
>> + * 32 BHRB entries.
>> + */
>> +
>> +/* .global read_bhrb */
>> +_GLOBAL(read_bhrb)
>> +cmpldir3,1023
> 
> This should be 31, since that is the last entry in the table below.

Hey Segher,

Would fix this in the next version. Thanks for pointing it out.

Regards
Anshuman

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V2 5/5] powerpc, perf: Enable branch stack sampling framework support with BHRB

2013-04-17 Thread Anshuman Khandual
On 04/17/2013 12:38 PM, Michael Ellerman wrote:
> On Tue, Apr 16, 2013 at 09:24:10PM +0530, Anshuman Khandual wrote:
>> This patch provides basic enablement for perf branch stack sampling framework
>> on POWER8 processor with a new PMU feature called BHRB (Branch History 
>> Rolling
>> Buffer).
>>
>> Signed-off-by: Anshuman Khandual 
>> ---
>>  arch/powerpc/perf/core-book3s.c | 96 
>> +++--
>>  arch/powerpc/perf/perf_event_bhrb.c | 85 
>>  2 files changed, 178 insertions(+), 3 deletions(-)
>>  create mode 100644 arch/powerpc/perf/perf_event_bhrb.c
>>
>> diff --git a/arch/powerpc/perf/core-book3s.c 
>> b/arch/powerpc/perf/core-book3s.c
>> index 4ac6e64..f4d1347 100644
>> --- a/arch/powerpc/perf/core-book3s.c
>> +++ b/arch/powerpc/perf/core-book3s.c
>> @@ -19,6 +19,8 @@
>>  #include 
>>  #include 
>>  
>> +#define BHRB_MAX_ENTRIES32
>> +
>>  struct cpu_hw_events {
>>  int n_events;
>>  int n_percpu;
>> @@ -38,11 +40,21 @@ struct cpu_hw_events {
>>  
>>  unsigned int group_flag;
>>  int n_txn_start;
>> +
>> +/* BHRB bits */
>> +u64 bhrb_filter;/* BHRB HW branch 
>> filter */
>> +int bhrb_users;
>> +void*bhrb_context;
>> +struct  perf_branch_stack   bhrb_stack;
>> +struct  perf_branch_entry   bhrb_entries[BHRB_MAX_ENTRIES];
>>  };
>> +
>>  DEFINE_PER_CPU(struct cpu_hw_events, cpu_hw_events);
>>  
>>  struct power_pmu *ppmu;
>>  
>> +#include "perf_event_bhrb.c"
>> +
> 
> Um, why are you doing that?
> 

There was no specific reason for that.

> If it's code that should be in a header, put it in a header. If it's C
> code, add it to the Makefile.

Sure I would get the new code in the Makefile. Thanks!

Regards
Anshuman

> 
> cheers
> ___
> Linuxppc-dev mailing list
> linuxppc-...@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V2 5/5] powerpc, perf: Enable branch stack sampling framework support with BHRB

2013-04-17 Thread Anshuman Khandual
On 04/17/2013 05:37 PM, Anshuman Khandual wrote:
> On 04/17/2013 12:38 PM, Michael Ellerman wrote:
>> On Tue, Apr 16, 2013 at 09:24:10PM +0530, Anshuman Khandual wrote:
>>> This patch provides basic enablement for perf branch stack sampling 
>>> framework
>>> on POWER8 processor with a new PMU feature called BHRB (Branch History 
>>> Rolling
>>> Buffer).
>>>
>>> Signed-off-by: Anshuman Khandual 
>>> ---
>>>  arch/powerpc/perf/core-book3s.c | 96 
>>> +++--
>>>  arch/powerpc/perf/perf_event_bhrb.c | 85 
>>>  2 files changed, 178 insertions(+), 3 deletions(-)
>>>  create mode 100644 arch/powerpc/perf/perf_event_bhrb.c
>>>
>>> diff --git a/arch/powerpc/perf/core-book3s.c 
>>> b/arch/powerpc/perf/core-book3s.c
>>> index 4ac6e64..f4d1347 100644
>>> --- a/arch/powerpc/perf/core-book3s.c
>>> +++ b/arch/powerpc/perf/core-book3s.c
>>> @@ -19,6 +19,8 @@
>>>  #include 
>>>  #include 
>>>  
>>> +#define BHRB_MAX_ENTRIES   32
>>> +
>>>  struct cpu_hw_events {
>>> int n_events;
>>> int n_percpu;
>>> @@ -38,11 +40,21 @@ struct cpu_hw_events {
>>>  
>>> unsigned int group_flag;
>>> int n_txn_start;
>>> +
>>> +   /* BHRB bits */
>>> +   u64 bhrb_filter;/* BHRB HW branch 
>>> filter */
>>> +   int bhrb_users;
>>> +   void*bhrb_context;
>>> +   struct  perf_branch_stack   bhrb_stack;
>>> +   struct  perf_branch_entry   bhrb_entries[BHRB_MAX_ENTRIES];
>>>  };
>>> +
>>>  DEFINE_PER_CPU(struct cpu_hw_events, cpu_hw_events);
>>>  
>>>  struct power_pmu *ppmu;
>>>  
>>> +#include "perf_event_bhrb.c"
>>> +
>>
>> Um, why are you doing that?
>>
> 
> There was no specific reason for that.
> 

Ahh, I remember it now. The function in the new file uses "cpu_hw_events" 
structure
which is passed during "record_and_restart" data capture phase. Right now 
cpu_hw_events is
not defined in the header file but inside core-book3s.c itself. Solution to 
this problem
could be any of these.

(0) Move all the code from the new file perf_event_bhrb.c into core-book3s.c
(1) Move cpu_hw_events structure to perf_event_server.h
(2) Create additional BHRB processing function inside struct power_pmu and
define it for P8 inside power8_pmu.c

Regards
Anshuman

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH V3 3/5] powerpc, perf: Add new BHRB related generic functions, data and flags

2013-04-18 Thread Anshuman Khandual
This patch adds couple of generic functions to power_pmu structure
which would configure the BHRB and it's filters. It also adds
representation of the number of BHRB entries present on the PMU.
A new PMU flag PPMU_BHRB would indicate presence of BHRB feature.

Signed-off-by: Anshuman Khandual 
---
 arch/powerpc/include/asm/perf_event_server.h | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/arch/powerpc/include/asm/perf_event_server.h 
b/arch/powerpc/include/asm/perf_event_server.h
index 57b42da..3f0c15c 100644
--- a/arch/powerpc/include/asm/perf_event_server.h
+++ b/arch/powerpc/include/asm/perf_event_server.h
@@ -33,6 +33,8 @@ struct power_pmu {
unsigned long *valp);
int (*get_alternatives)(u64 event_id, unsigned int flags,
u64 alt[]);
+   u64 (*bhrb_filter_map)(u64 branch_sample_type);
+   void(*config_bhrb)(u64 pmu_bhrb_filter);
void(*disable_pmc)(unsigned int pmc, unsigned long mmcr[]);
int (*limited_pmc_event)(u64 event_id);
u32 flags;
@@ -42,6 +44,9 @@ struct power_pmu {
int (*cache_events)[PERF_COUNT_HW_CACHE_MAX]
   [PERF_COUNT_HW_CACHE_OP_MAX]
   [PERF_COUNT_HW_CACHE_RESULT_MAX];
+
+   /* BHRB entries in the PMU */
+   int bhrb_nr;
 };
 
 /*
@@ -54,6 +59,7 @@ struct power_pmu {
 #define PPMU_SIAR_VALID0x0010 /* Processor has SIAR Valid 
bit */
 #define PPMU_HAS_SSLOT 0x0020 /* Has sampled slot in MMCRA */
 #define PPMU_HAS_SIER  0x0040 /* Has SIER */
+#define PPMU_BHRB  0x0080 /* has BHRB feature enabled */
 
 /*
  * Values for flags to get_alternatives()
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH V3 1/5] powerpc, perf: Add new BHRB related instructions for POWER8

2013-04-18 Thread Anshuman Khandual
This patch adds new POWER8 instruction encoding for reading
the BHRB buffer entries and also clearing it. Encoding for
"clrbhrb" instruction is straight forward. But "mfbhrbe"
encoding involves reading a certain index of BHRB buffer
into a particular GPR register.

Signed-off-by: Anshuman Khandual 
---
 arch/powerpc/include/asm/ppc-opcode.h | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/arch/powerpc/include/asm/ppc-opcode.h 
b/arch/powerpc/include/asm/ppc-opcode.h
index 8752bc8..93ae5a1 100644
--- a/arch/powerpc/include/asm/ppc-opcode.h
+++ b/arch/powerpc/include/asm/ppc-opcode.h
@@ -82,6 +82,7 @@
 #define__REGA0_R31 31
 
 /* sorted alphabetically */
+#define PPC_INST_BHRBE 0x7c00025c
 #define PPC_INST_DCBA  0x7c0005ec
 #define PPC_INST_DCBA_MASK 0xfc0007fe
 #define PPC_INST_DCBAL 0x7c2005ec
@@ -297,6 +298,12 @@
 #define PPC_NAPstringify_in_c(.long PPC_INST_NAP)
 #define PPC_SLEEP  stringify_in_c(.long PPC_INST_SLEEP)
 
+/* BHRB instructions */
+#define PPC_CLRBHRBstringify_in_c(.long 0x7c00035c)
+#define PPC_MFBHRBE(r, n)  stringify_in_c(.long PPC_INST_BHRBE | \
+   __PPC_RS(r) | \
+   (((n) & 0x1f) << 11))
+
 /* Transactional memory instructions */
 #define TRECHKPT   stringify_in_c(.long PPC_INST_TRECHKPT)
 #define TRECLAIM(r)stringify_in_c(.long PPC_INST_TRECLAIM \
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH V3 5/5] powerpc, perf: Enable branch stack sampling framework

2013-04-18 Thread Anshuman Khandual
Provides basic enablement for perf branch stack sampling framework on
POWER8 processor based platforms. Adds new BHRB related elements into
cpu_hw_event structure to represent current BHRB config, BHRB filter
configuration, manage context and to hold output BHRB buffer during
PMU interrupt before passing to the user space. This also enables
processing of BHRB data and converts them into generic perf branch
stack data format.

Signed-off-by: Anshuman Khandual 
---
 arch/powerpc/include/asm/perf_event_server.h |   1 +
 arch/powerpc/perf/core-book3s.c  | 167 ++-
 2 files changed, 165 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/perf_event_server.h 
b/arch/powerpc/include/asm/perf_event_server.h
index 3f0c15c..f265049 100644
--- a/arch/powerpc/include/asm/perf_event_server.h
+++ b/arch/powerpc/include/asm/perf_event_server.h
@@ -73,6 +73,7 @@ extern int register_power_pmu(struct power_pmu *);
 struct pt_regs;
 extern unsigned long perf_misc_flags(struct pt_regs *regs);
 extern unsigned long perf_instruction_pointer(struct pt_regs *regs);
+extern unsigned long int read_bhrb(int n);
 
 /*
  * Only override the default definitions in include/linux/perf_event.h
diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
index 4ac6e64..c627843 100644
--- a/arch/powerpc/perf/core-book3s.c
+++ b/arch/powerpc/perf/core-book3s.c
@@ -19,6 +19,11 @@
 #include 
 #include 
 
+#define BHRB_MAX_ENTRIES   32
+#define BHRB_TARGET0x0002
+#define BHRB_PREDICTION0x0001
+#define BHRB_EA0xFFFC
+
 struct cpu_hw_events {
int n_events;
int n_percpu;
@@ -38,7 +43,15 @@ struct cpu_hw_events {
 
unsigned int group_flag;
int n_txn_start;
+
+   /* BHRB bits */
+   u64 bhrb_filter;/* BHRB HW branch 
filter */
+   int bhrb_users;
+   void*bhrb_context;
+   struct  perf_branch_stack   bhrb_stack;
+   struct  perf_branch_entry   bhrb_entries[BHRB_MAX_ENTRIES];
 };
+
 DEFINE_PER_CPU(struct cpu_hw_events, cpu_hw_events);
 
 struct power_pmu *ppmu;
@@ -858,6 +871,9 @@ static void power_pmu_enable(struct pmu *pmu)
}
 
  out:
+   if (cpuhw->bhrb_users)
+   ppmu->config_bhrb(cpuhw->bhrb_filter);
+
local_irq_restore(flags);
 }
 
@@ -888,6 +904,47 @@ static int collect_events(struct perf_event *group, int 
max_count,
return n;
 }
 
+/* Reset all possible BHRB entries */
+static void power_pmu_bhrb_reset(void)
+{
+   asm volatile(PPC_CLRBHRB);
+}
+
+void power_pmu_bhrb_enable(struct perf_event *event)
+{
+   struct cpu_hw_events *cpuhw = &__get_cpu_var(cpu_hw_events);
+
+   if (!ppmu->bhrb_nr)
+   return;
+
+   /* Clear BHRB if we changed task context to avoid data leaks */
+   if (event->ctx->task && cpuhw->bhrb_context != event->ctx) {
+   power_pmu_bhrb_reset();
+   cpuhw->bhrb_context = event->ctx;
+   }
+   cpuhw->bhrb_users++;
+}
+
+void power_pmu_bhrb_disable(struct perf_event *event)
+{
+   struct cpu_hw_events *cpuhw = &__get_cpu_var(cpu_hw_events);
+
+   if (!ppmu->bhrb_nr)
+   return;
+
+   cpuhw->bhrb_users--;
+   WARN_ON_ONCE(cpuhw->bhrb_users < 0);
+
+   if (!cpuhw->disabled && !cpuhw->bhrb_users) {
+   /* BHRB cannot be turned off when other
+* events are active on the PMU.
+*/
+
+   /* avoid stale pointer */
+   cpuhw->bhrb_context = NULL;
+   }
+}
+
 /*
  * Add a event to the PMU.
  * If all events are not already frozen, then we disable and
@@ -947,6 +1004,9 @@ nocheck:
 
ret = 0;
  out:
+   if (has_branch_stack(event))
+   power_pmu_bhrb_enable(event);
+
perf_pmu_enable(event->pmu);
local_irq_restore(flags);
return ret;
@@ -999,6 +1059,9 @@ static void power_pmu_del(struct perf_event *event, int 
ef_flags)
cpuhw->mmcr[0] &= ~(MMCR0_PMXE | MMCR0_FCECE);
}
 
+   if (has_branch_stack(event))
+   power_pmu_bhrb_disable(event);
+
perf_pmu_enable(event->pmu);
local_irq_restore(flags);
 }
@@ -1117,6 +1180,15 @@ int power_pmu_commit_txn(struct pmu *pmu)
return 0;
 }
 
+/* Called from ctxsw to prevent one process's branch entries to
+ * mingle with the other process's entries during context switch.
+ */
+void power_pmu_flush_branch_stack(void)
+{
+   if (ppmu->bhrb_nr)
+   power_pmu_bhrb_reset();
+}
+
 /*
  * Return 1 if we might be able to put event on a limited PMC,
  * or 0 if not.
@@ -1231,9 +1303,11 @@ static int power_pmu_event_init(struct perf_event *event)
if (

[PATCH V3 0/5] powerpc, perf: BHRB based branch stack enablement on POWER8

2013-04-18 Thread Anshuman Khandual
Branch History Rolling Buffer (BHRB) is a new PMU feaure in IBM
POWER8 processor which records the branch instructions inside the execution
pipeline. This patchset enables the basic functionality of the feature through
generic perf branch stack sampling framework.

Sample output
-
$./perf record -b top
$./perf report

Overhead  Command  Source Shared Object   Source Symbol 
 Target Shared ObjectTarget Symbol
#   ...    
..    
...
#

 7.82%  top  libc-2.11.2.so[k] _IO_vfscanf  
   libc-2.11.2.so[k] _IO_vfscanf
 6.17%  top  libc-2.11.2.so[k] _IO_vfscanf  
   [unknown] [k] 
 2.37%  top  [unknown] [k] 0xf7aafb30   
   [unknown] [k] 
 1.80%  top  [unknown] [k] 0x0fe07978   
   libc-2.11.2.so[k] _IO_vfscanf
 1.60%  top  libc-2.11.2.so[k] _IO_vfscanf  
   [kernel.kallsyms] [k] .do_task_stat
 1.20%  top  [kernel.kallsyms] [k] .do_task_stat
   [kernel.kallsyms] [k] .do_task_stat
 1.02%  top  libc-2.11.2.so[k] vfprintf 
   libc-2.11.2.so[k] vfprintf
 0.92%  top  top   [k] _init
   [unknown] [k] 0x0fe037f4

Changes in V2
--
- Added copyright messages to the newly created files
- Modified couple of commit messages

Changes in V3
-
- Incorporated review comments from Segher https://lkml.org/lkml/2013/4/16/350
- Worked on a solution for review comment from Michael Ellerman 
https://lkml.org/lkml/2013/4/17/548
- Could not move updated cpu_hw_events structure from core-book3s.c 
file into perf_event_server.h
  Because perf_event_server.h is pulled in first inside 
linux/perf_event.h before the definition of
  perf_branch_entry structure. Thats the reason why perf_branch_entry 
definition is not available
  inside perf_event_server.h where we define the array inside 
cpu_hw_events structure.

- Finally have pulled in the code from perf_event_bhrb.c into 
core-book3s.c

- Improved documentation for the patchset


Anshuman Khandual (5):
  powerpc, perf: Add new BHRB related instructions for POWER8
  powerpc, perf: Add basic assembly code to read BHRB entries on POWER8
  powerpc, perf: Add new BHRB related generic functions, data and flags
  powerpc, perf: Define BHRB generic functions, data and flags for POWER8
  powerpc, perf: Enable branch stack sampling framework

 arch/powerpc/include/asm/perf_event_server.h |   7 ++
 arch/powerpc/include/asm/ppc-opcode.h|   7 ++
 arch/powerpc/perf/Makefile   |   2 +-
 arch/powerpc/perf/bhrb.S |  44 +++
 arch/powerpc/perf/core-book3s.c  | 167 ++-
 arch/powerpc/perf/power8-pmu.c   |  57 -
 6 files changed, 279 insertions(+), 5 deletions(-)
 create mode 100644 arch/powerpc/perf/bhrb.S

-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH V3 4/5] powerpc, perf: Define BHRB generic functions, data and flags for POWER8

2013-04-18 Thread Anshuman Khandual
This patch populates BHRB specific data for power_pmu structure. It
also implements POWER8 specific BHRB filter and configuration functions.

Signed-off-by: Anshuman Khandual 
---
 arch/powerpc/perf/power8-pmu.c | 57 +-
 1 file changed, 56 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/perf/power8-pmu.c b/arch/powerpc/perf/power8-pmu.c
index 106ae0b..153408c 100644
--- a/arch/powerpc/perf/power8-pmu.c
+++ b/arch/powerpc/perf/power8-pmu.c
@@ -109,6 +109,16 @@
 #define EVENT_IS_MARKED(EVENT_MARKED_MASK << 
EVENT_MARKED_SHIFT)
 #define EVENT_PSEL_MASK0xff/* PMCxSEL value */
 
+/* MMCRA IFM bits - POWER8 */
+#definePOWER8_MMCRA_IFM1   0x4000UL
+#definePOWER8_MMCRA_IFM2   0x8000UL
+#definePOWER8_MMCRA_IFM3   0xC000UL
+
+#define ONLY_PLM \
+   (PERF_SAMPLE_BRANCH_USER|\
+PERF_SAMPLE_BRANCH_KERNEL  |\
+PERF_SAMPLE_BRANCH_HV)
+
 /*
  * Layout of constraint bits:
  *
@@ -428,6 +438,48 @@ static int power8_generic_events[] = {
[PERF_COUNT_HW_BRANCH_MISSES] = PM_BR_MPRED_CMPL,
 };
 
+static u64 power8_bhrb_filter_map(u64 branch_sample_type)
+{
+   u64 pmu_bhrb_filter = 0;
+   u64 br_privilege = branch_sample_type & ONLY_PLM;
+
+   /* BHRB and regular PMU events share the same prvillege state
+* filter configuration. BHRB is always recorded along with a
+* regular PMU event. So privilege state filter criteria for BHRB
+* and the companion PMU events has to be the same. As a default
+* "perf record" tool sets all privillege bits ON when no filter
+* criteria is provided in the command line. So as along as all
+* privillege bits are ON or they are OFF, we are good to go.
+*/
+   if ((br_privilege != 7) && (br_privilege != 0))
+   return -1;
+
+   /* No branch filter requested */
+   if (branch_sample_type & PERF_SAMPLE_BRANCH_ANY)
+   return pmu_bhrb_filter;
+
+   /* Invalid branch filter options - HW does not support */
+   if (branch_sample_type & PERF_SAMPLE_BRANCH_ANY_RETURN)
+   return -1;
+
+   if (branch_sample_type & PERF_SAMPLE_BRANCH_IND_CALL)
+   return -1;
+
+   if (branch_sample_type & PERF_SAMPLE_BRANCH_ANY_CALL) {
+   pmu_bhrb_filter |= POWER8_MMCRA_IFM1;
+   return pmu_bhrb_filter;
+   }
+
+   /* Every thing else is unsupported */
+   return -1;
+}
+
+static void power8_config_bhrb(u64 pmu_bhrb_filter)
+{
+   /* Enable BHRB filter in PMU */
+   mtspr(SPRN_MMCRA, (mfspr(SPRN_MMCRA) | pmu_bhrb_filter));
+}
+
 static struct power_pmu power8_pmu = {
.name   = "POWER8",
.n_counter  = 6,
@@ -435,12 +487,15 @@ static struct power_pmu power8_pmu = {
.add_fields = POWER8_ADD_FIELDS,
.test_adder = POWER8_TEST_ADDER,
.compute_mmcr   = power8_compute_mmcr,
+   .config_bhrb= power8_config_bhrb,
+   .bhrb_filter_map= power8_bhrb_filter_map,
.get_constraint = power8_get_constraint,
.disable_pmc= power8_disable_pmc,
-   .flags  = PPMU_HAS_SSLOT | PPMU_HAS_SIER,
+   .flags  = PPMU_HAS_SSLOT | PPMU_HAS_SIER | PPMU_BHRB,
.n_generic  = ARRAY_SIZE(power8_generic_events),
.generic_events = power8_generic_events,
.attr_groups= power8_pmu_attr_groups,
+   .bhrb_nr= 32,
 };
 
 static int __init init_power8_pmu(void)
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH V3 2/5] powerpc, perf: Add basic assembly code to read BHRB entries on POWER8

2013-04-18 Thread Anshuman Khandual
This patch adds the basic assembly code to read BHRB buffer. BHRB entries
are valid only after a PMU interrupt has happened (when MMCR0[PMAO]=1)
and BHRB has been freezed. BHRB read should not be attempted when it is
still enabled (MMCR0[PMAE]=1) and getting updated, as this can produce
non-deterministic results.

Signed-off-by: Anshuman Khandual 
---
 arch/powerpc/perf/Makefile |  2 +-
 arch/powerpc/perf/bhrb.S   | 44 
 2 files changed, 45 insertions(+), 1 deletion(-)
 create mode 100644 arch/powerpc/perf/bhrb.S

diff --git a/arch/powerpc/perf/Makefile b/arch/powerpc/perf/Makefile
index 472db18..510fae1 100644
--- a/arch/powerpc/perf/Makefile
+++ b/arch/powerpc/perf/Makefile
@@ -2,7 +2,7 @@ subdir-ccflags-$(CONFIG_PPC_WERROR) := -Werror
 
 obj-$(CONFIG_PERF_EVENTS)  += callchain.o
 
-obj-$(CONFIG_PPC_PERF_CTRS)+= core-book3s.o
+obj-$(CONFIG_PPC_PERF_CTRS)+= core-book3s.o bhrb.o
 obj64-$(CONFIG_PPC_PERF_CTRS)  += power4-pmu.o ppc970-pmu.o power5-pmu.o \
   power5+-pmu.o power6-pmu.o power7-pmu.o \
   power8-pmu.o
diff --git a/arch/powerpc/perf/bhrb.S b/arch/powerpc/perf/bhrb.S
new file mode 100644
index 000..d85f9a5
--- /dev/null
+++ b/arch/powerpc/perf/bhrb.S
@@ -0,0 +1,44 @@
+/*
+ * Basic assembly code to read BHRB entries
+ *
+ * Copyright 2013 Anshuman Khandual, IBM Corporation.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+#include 
+#include 
+
+   .text
+
+.balign 8
+
+/* r3 = n  (where n = [0-31])
+ * The maximum number of BHRB entries supported with PPC_MFBHRBE instruction
+ * is 1024. We have limited number of table entries here as POWER8 implements
+ * 32 BHRB entries.
+ */
+
+/* .global read_bhrb */
+_GLOBAL(read_bhrb)
+   cmpldi  r3,31
+   bgt 1f
+   ld  r4,bhrb_table@got(r2)
+   sldir3,r3,3
+   add r3,r4,r3
+   mtctr   r3
+   bctr
+1: li  r3,0
+   blr
+
+#define MFBHRB_TABLE1(n) PPC_MFBHRBE(R3,n); blr
+#define MFBHRB_TABLE2(n) MFBHRB_TABLE1(n); MFBHRB_TABLE1(n+1)
+#define MFBHRB_TABLE4(n) MFBHRB_TABLE2(n); MFBHRB_TABLE2(n+2)
+#define MFBHRB_TABLE8(n) MFBHRB_TABLE4(n); MFBHRB_TABLE4(n+4)
+#define MFBHRB_TABLE16(n) MFBHRB_TABLE8(n); MFBHRB_TABLE8(n+8)
+#define MFBHRB_TABLE32(n) MFBHRB_TABLE16(n); MFBHRB_TABLE16(n+16)
+
+bhrb_table:
+   MFBHRB_TABLE32(0)
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V3 1/5] powerpc, perf: Add new BHRB related instructions for POWER8

2013-04-21 Thread Anshuman Khandual
On 04/22/2013 05:11 AM, Michael Ellerman wrote:
> On Thu, Apr 18, 2013 at 05:56:12PM +0530, Anshuman Khandual wrote:
>> This patch adds new POWER8 instruction encoding for reading
>> the BHRB buffer entries and also clearing it. Encoding for
>> "clrbhrb" instruction is straight forward.
> 
> Which is "clear branch history rolling buffer" ?
> 
>> But "mfbhrbe"
>> encoding involves reading a certain index of BHRB buffer
>> into a particular GPR register.
> 
> And "Move from branch history rolling buffer entry" ?
> 

Sure, would add these descriptions in the change log.

>> diff --git a/arch/powerpc/include/asm/ppc-opcode.h 
>> b/arch/powerpc/include/asm/ppc-opcode.h
>> index 8752bc8..93ae5a1 100644
>> --- a/arch/powerpc/include/asm/ppc-opcode.h
>> +++ b/arch/powerpc/include/asm/ppc-opcode.h
>> @@ -82,6 +82,7 @@
>>  #define __REGA0_R31 31
>>  
>>  /* sorted alphabetically */
>> +#define PPC_INST_BHRBE  0x7c00025c
> 
> I don't think you really need this, just use the literal value below.
> 
>> @@ -297,6 +298,12 @@
>>  #define PPC_NAP stringify_in_c(.long PPC_INST_NAP)
>>  #define PPC_SLEEP   stringify_in_c(.long PPC_INST_SLEEP)
>>  
>> +/* BHRB instructions */
>> +#define PPC_CLRBHRB stringify_in_c(.long 0x7c00035c)
>> +#define PPC_MFBHRBE(r, n)   stringify_in_c(.long PPC_INST_BHRBE | \
>> +__PPC_RS(r) | \
>> +(((n) & 0x1f) << 11))
> 
> Why are you not using ___PPC_RB(n) here ?
> 

I would replace __PPC_RS(r) with __PPC_RT(r) which makes more sense from
instruction encoding point of view.

Regards
Anshuman

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V3 1/5] powerpc, perf: Add new BHRB related instructions for POWER8

2013-04-22 Thread Anshuman Khandual
On 04/22/2013 08:20 AM, Michael Neuling wrote:
> Michael Ellerman  wrote:
> 
>> On Mon, Apr 22, 2013 at 11:13:43AM +1000, Michael Neuling wrote:
>>> Michael Ellerman  wrote:
>>>
>>>> On Thu, Apr 18, 2013 at 05:56:12PM +0530, Anshuman Khandual wrote:
>>>>> This patch adds new POWER8 instruction encoding for reading
>>>>> the BHRB buffer entries and also clearing it. Encoding for
>>>>> "clrbhrb" instruction is straight forward.
>>>>
>>>> Which is "clear branch history rolling buffer" ?
>>>>
>>>>> But "mfbhrbe"
>>>>> encoding involves reading a certain index of BHRB buffer
>>>>> into a particular GPR register.
>>>>
>>>> And "Move from branch history rolling buffer entry" ?
>>>>
>>>>> diff --git a/arch/powerpc/include/asm/ppc-opcode.h 
>>>>> b/arch/powerpc/include/asm/ppc-opcode.h
>>>>> index 8752bc8..93ae5a1 100644
>>>>> --- a/arch/powerpc/include/asm/ppc-opcode.h
>>>>> +++ b/arch/powerpc/include/asm/ppc-opcode.h
>>>>> @@ -82,6 +82,7 @@
>>>>>  #define  __REGA0_R31 31
>>>>>  
>>>>>  /* sorted alphabetically */
>>>>> +#define PPC_INST_BHRBE   0x7c00025c
>>>>
>>>> I don't think you really need this, just use the literal value below.
>>>
>>> The rest of the defines in this file do this, so Anshuman's right. 
>>
>> I don't see the point, but sure let's be consistent. Though in that case
>> he should do the same for PPC_CLRBHRB below.
> 
> Agreed.
> 

Sure, would define a new macro (PPC_INST_CLRBHRB) to encode 0x7c00035c
before using it for PPC_CLRBHRB.

> Mikey
> 
>>
>>>>> @@ -297,6 +298,12 @@
>>>>>  #define PPC_NAP  stringify_in_c(.long PPC_INST_NAP)
>>>>>  #define PPC_SLEEPstringify_in_c(.long PPC_INST_SLEEP)
>>>>>  
>>>>> +/* BHRB instructions */
>>>>> +#define PPC_CLRBHRB  stringify_in_c(.long 0x7c00035c)
>>>>> +#define PPC_MFBHRBE(r, n)stringify_in_c(.long PPC_INST_BHRBE | \
>>>>> + __PPC_RS(r) | \
>>>>> + (((n) & 0x1f) << 11))
>>>>
>>>> Why are you not using ___PPC_RB(n) here ?
>>>
>>> Actually, this is wrong.  The number field should be 10 bits (0x3ff),
>>> not 5 (0x1f)  Anshuman please fix.
>>
>> ACK.

I got it wrong, thought this as 32 instead of 1024. Would fix it.

Regards
Anshuman

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V3 1/5] powerpc, perf: Add new BHRB related instructions for POWER8

2013-04-22 Thread Anshuman Khandual
On 04/22/2013 08:20 AM, Michael Neuling wrote:
> Michael Ellerman  wrote:
> 
>> On Mon, Apr 22, 2013 at 11:13:43AM +1000, Michael Neuling wrote:
>>> Michael Ellerman  wrote:
>>>
>>>> On Thu, Apr 18, 2013 at 05:56:12PM +0530, Anshuman Khandual wrote:
>>>>> This patch adds new POWER8 instruction encoding for reading
>>>>> the BHRB buffer entries and also clearing it. Encoding for
>>>>> "clrbhrb" instruction is straight forward.
>>>>
>>>> Which is "clear branch history rolling buffer" ?
>>>>
>>>>> But "mfbhrbe"
>>>>> encoding involves reading a certain index of BHRB buffer
>>>>> into a particular GPR register.
>>>>
>>>> And "Move from branch history rolling buffer entry" ?
>>>>
>>>>> diff --git a/arch/powerpc/include/asm/ppc-opcode.h 
>>>>> b/arch/powerpc/include/asm/ppc-opcode.h
>>>>> index 8752bc8..93ae5a1 100644
>>>>> --- a/arch/powerpc/include/asm/ppc-opcode.h
>>>>> +++ b/arch/powerpc/include/asm/ppc-opcode.h
>>>>> @@ -82,6 +82,7 @@
>>>>>  #define  __REGA0_R31 31
>>>>>  
>>>>>  /* sorted alphabetically */
>>>>> +#define PPC_INST_BHRBE   0x7c00025c
>>>>
>>>> I don't think you really need this, just use the literal value below.
>>>
>>> The rest of the defines in this file do this, so Anshuman's right. 
>>
>> I don't see the point, but sure let's be consistent. Though in that case
>> he should do the same for PPC_CLRBHRB below.
> 
> Agreed.
> 

Sure, would define a new macro (PPC_INST_CLRBHRB) to encode 0x7c00035c
before using it for PPC_CLRBHRB.

> Mikey
> 
>>
>>>>> @@ -297,6 +298,12 @@
>>>>>  #define PPC_NAP  stringify_in_c(.long PPC_INST_NAP)
>>>>>  #define PPC_SLEEPstringify_in_c(.long PPC_INST_SLEEP)
>>>>>  
>>>>> +/* BHRB instructions */
>>>>> +#define PPC_CLRBHRB  stringify_in_c(.long 0x7c00035c)
>>>>> +#define PPC_MFBHRBE(r, n)stringify_in_c(.long PPC_INST_BHRBE | \
>>>>> + __PPC_RS(r) | \
>>>>> + (((n) & 0x1f) << 11))
>>>>
>>>> Why are you not using ___PPC_RB(n) here ?
>>>
>>> Actually, this is wrong.  The number field should be 10 bits (0x3ff),
>>> not 5 (0x1f)  Anshuman please fix.
>>
>> ACK.

I got it wrong, thought this as 32 instead of 1024. Would fix it.

Regards
Anshuman

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH V4 3/5] powerpc, perf: Add new BHRB related generic functions, data and flags

2013-04-22 Thread Anshuman Khandual
This patch adds couple of generic functions to power_pmu structure
which would configure the BHRB and it's filters. It also adds
representation of the number of BHRB entries present on the PMU.
A new PMU flag PPMU_BHRB would indicate presence of BHRB feature.

Signed-off-by: Anshuman Khandual 
---
 arch/powerpc/include/asm/perf_event_server.h | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/arch/powerpc/include/asm/perf_event_server.h 
b/arch/powerpc/include/asm/perf_event_server.h
index 57b42da..3f0c15c 100644
--- a/arch/powerpc/include/asm/perf_event_server.h
+++ b/arch/powerpc/include/asm/perf_event_server.h
@@ -33,6 +33,8 @@ struct power_pmu {
unsigned long *valp);
int (*get_alternatives)(u64 event_id, unsigned int flags,
u64 alt[]);
+   u64 (*bhrb_filter_map)(u64 branch_sample_type);
+   void(*config_bhrb)(u64 pmu_bhrb_filter);
void(*disable_pmc)(unsigned int pmc, unsigned long mmcr[]);
int (*limited_pmc_event)(u64 event_id);
u32 flags;
@@ -42,6 +44,9 @@ struct power_pmu {
int (*cache_events)[PERF_COUNT_HW_CACHE_MAX]
   [PERF_COUNT_HW_CACHE_OP_MAX]
   [PERF_COUNT_HW_CACHE_RESULT_MAX];
+
+   /* BHRB entries in the PMU */
+   int bhrb_nr;
 };
 
 /*
@@ -54,6 +59,7 @@ struct power_pmu {
 #define PPMU_SIAR_VALID0x0010 /* Processor has SIAR Valid 
bit */
 #define PPMU_HAS_SSLOT 0x0020 /* Has sampled slot in MMCRA */
 #define PPMU_HAS_SIER  0x0040 /* Has SIER */
+#define PPMU_BHRB  0x0080 /* has BHRB feature enabled */
 
 /*
  * Values for flags to get_alternatives()
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH V4 4/5] powerpc, perf: Define BHRB generic functions, data and flags for POWER8

2013-04-22 Thread Anshuman Khandual
This patch populates BHRB specific data for power_pmu structure. It
also implements POWER8 specific BHRB filter and configuration functions.

Signed-off-by: Anshuman Khandual 
---
 arch/powerpc/perf/power8-pmu.c | 57 +-
 1 file changed, 56 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/perf/power8-pmu.c b/arch/powerpc/perf/power8-pmu.c
index 106ae0b..153408c 100644
--- a/arch/powerpc/perf/power8-pmu.c
+++ b/arch/powerpc/perf/power8-pmu.c
@@ -109,6 +109,16 @@
 #define EVENT_IS_MARKED(EVENT_MARKED_MASK << 
EVENT_MARKED_SHIFT)
 #define EVENT_PSEL_MASK0xff/* PMCxSEL value */
 
+/* MMCRA IFM bits - POWER8 */
+#definePOWER8_MMCRA_IFM1   0x4000UL
+#definePOWER8_MMCRA_IFM2   0x8000UL
+#definePOWER8_MMCRA_IFM3   0xC000UL
+
+#define ONLY_PLM \
+   (PERF_SAMPLE_BRANCH_USER|\
+PERF_SAMPLE_BRANCH_KERNEL  |\
+PERF_SAMPLE_BRANCH_HV)
+
 /*
  * Layout of constraint bits:
  *
@@ -428,6 +438,48 @@ static int power8_generic_events[] = {
[PERF_COUNT_HW_BRANCH_MISSES] = PM_BR_MPRED_CMPL,
 };
 
+static u64 power8_bhrb_filter_map(u64 branch_sample_type)
+{
+   u64 pmu_bhrb_filter = 0;
+   u64 br_privilege = branch_sample_type & ONLY_PLM;
+
+   /* BHRB and regular PMU events share the same prvillege state
+* filter configuration. BHRB is always recorded along with a
+* regular PMU event. So privilege state filter criteria for BHRB
+* and the companion PMU events has to be the same. As a default
+* "perf record" tool sets all privillege bits ON when no filter
+* criteria is provided in the command line. So as along as all
+* privillege bits are ON or they are OFF, we are good to go.
+*/
+   if ((br_privilege != 7) && (br_privilege != 0))
+   return -1;
+
+   /* No branch filter requested */
+   if (branch_sample_type & PERF_SAMPLE_BRANCH_ANY)
+   return pmu_bhrb_filter;
+
+   /* Invalid branch filter options - HW does not support */
+   if (branch_sample_type & PERF_SAMPLE_BRANCH_ANY_RETURN)
+   return -1;
+
+   if (branch_sample_type & PERF_SAMPLE_BRANCH_IND_CALL)
+   return -1;
+
+   if (branch_sample_type & PERF_SAMPLE_BRANCH_ANY_CALL) {
+   pmu_bhrb_filter |= POWER8_MMCRA_IFM1;
+   return pmu_bhrb_filter;
+   }
+
+   /* Every thing else is unsupported */
+   return -1;
+}
+
+static void power8_config_bhrb(u64 pmu_bhrb_filter)
+{
+   /* Enable BHRB filter in PMU */
+   mtspr(SPRN_MMCRA, (mfspr(SPRN_MMCRA) | pmu_bhrb_filter));
+}
+
 static struct power_pmu power8_pmu = {
.name   = "POWER8",
.n_counter  = 6,
@@ -435,12 +487,15 @@ static struct power_pmu power8_pmu = {
.add_fields = POWER8_ADD_FIELDS,
.test_adder = POWER8_TEST_ADDER,
.compute_mmcr   = power8_compute_mmcr,
+   .config_bhrb= power8_config_bhrb,
+   .bhrb_filter_map= power8_bhrb_filter_map,
.get_constraint = power8_get_constraint,
.disable_pmc= power8_disable_pmc,
-   .flags  = PPMU_HAS_SSLOT | PPMU_HAS_SIER,
+   .flags  = PPMU_HAS_SSLOT | PPMU_HAS_SIER | PPMU_BHRB,
.n_generic  = ARRAY_SIZE(power8_generic_events),
.generic_events = power8_generic_events,
.attr_groups= power8_pmu_attr_groups,
+   .bhrb_nr= 32,
 };
 
 static int __init init_power8_pmu(void)
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH V4 2/5] powerpc, perf: Add basic assembly code to read BHRB entries on POWER8

2013-04-22 Thread Anshuman Khandual
This patch adds the basic assembly code to read BHRB buffer. BHRB entries
are valid only after a PMU interrupt has happened (when MMCR0[PMAO]=1)
and BHRB has been freezed. BHRB read should not be attempted when it is
still enabled (MMCR0[PMAE]=1) and getting updated, as this can produce
non-deterministic results.

Signed-off-by: Anshuman Khandual 
---
 arch/powerpc/perf/Makefile |  2 +-
 arch/powerpc/perf/bhrb.S   | 44 
 2 files changed, 45 insertions(+), 1 deletion(-)
 create mode 100644 arch/powerpc/perf/bhrb.S

diff --git a/arch/powerpc/perf/Makefile b/arch/powerpc/perf/Makefile
index 472db18..510fae1 100644
--- a/arch/powerpc/perf/Makefile
+++ b/arch/powerpc/perf/Makefile
@@ -2,7 +2,7 @@ subdir-ccflags-$(CONFIG_PPC_WERROR) := -Werror
 
 obj-$(CONFIG_PERF_EVENTS)  += callchain.o
 
-obj-$(CONFIG_PPC_PERF_CTRS)+= core-book3s.o
+obj-$(CONFIG_PPC_PERF_CTRS)+= core-book3s.o bhrb.o
 obj64-$(CONFIG_PPC_PERF_CTRS)  += power4-pmu.o ppc970-pmu.o power5-pmu.o \
   power5+-pmu.o power6-pmu.o power7-pmu.o \
   power8-pmu.o
diff --git a/arch/powerpc/perf/bhrb.S b/arch/powerpc/perf/bhrb.S
new file mode 100644
index 000..d85f9a5
--- /dev/null
+++ b/arch/powerpc/perf/bhrb.S
@@ -0,0 +1,44 @@
+/*
+ * Basic assembly code to read BHRB entries
+ *
+ * Copyright 2013 Anshuman Khandual, IBM Corporation.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+#include 
+#include 
+
+   .text
+
+.balign 8
+
+/* r3 = n  (where n = [0-31])
+ * The maximum number of BHRB entries supported with PPC_MFBHRBE instruction
+ * is 1024. We have limited number of table entries here as POWER8 implements
+ * 32 BHRB entries.
+ */
+
+/* .global read_bhrb */
+_GLOBAL(read_bhrb)
+   cmpldi  r3,31
+   bgt 1f
+   ld  r4,bhrb_table@got(r2)
+   sldir3,r3,3
+   add r3,r4,r3
+   mtctr   r3
+   bctr
+1: li  r3,0
+   blr
+
+#define MFBHRB_TABLE1(n) PPC_MFBHRBE(R3,n); blr
+#define MFBHRB_TABLE2(n) MFBHRB_TABLE1(n); MFBHRB_TABLE1(n+1)
+#define MFBHRB_TABLE4(n) MFBHRB_TABLE2(n); MFBHRB_TABLE2(n+2)
+#define MFBHRB_TABLE8(n) MFBHRB_TABLE4(n); MFBHRB_TABLE4(n+4)
+#define MFBHRB_TABLE16(n) MFBHRB_TABLE8(n); MFBHRB_TABLE8(n+8)
+#define MFBHRB_TABLE32(n) MFBHRB_TABLE16(n); MFBHRB_TABLE16(n+16)
+
+bhrb_table:
+   MFBHRB_TABLE32(0)
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH V4 5/5] powerpc, perf: Enable branch stack sampling framework

2013-04-22 Thread Anshuman Khandual
Provides basic enablement for perf branch stack sampling framework on
POWER8 processor based platforms. Adds new BHRB related elements into
cpu_hw_event structure to represent current BHRB config, BHRB filter
configuration, manage context and to hold output BHRB buffer during
PMU interrupt before passing to the user space. This also enables
processing of BHRB data and converts them into generic perf branch
stack data format.

Signed-off-by: Anshuman Khandual 
---
 arch/powerpc/include/asm/perf_event_server.h |   1 +
 arch/powerpc/perf/core-book3s.c  | 167 ++-
 2 files changed, 165 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/perf_event_server.h 
b/arch/powerpc/include/asm/perf_event_server.h
index 3f0c15c..f265049 100644
--- a/arch/powerpc/include/asm/perf_event_server.h
+++ b/arch/powerpc/include/asm/perf_event_server.h
@@ -73,6 +73,7 @@ extern int register_power_pmu(struct power_pmu *);
 struct pt_regs;
 extern unsigned long perf_misc_flags(struct pt_regs *regs);
 extern unsigned long perf_instruction_pointer(struct pt_regs *regs);
+extern unsigned long int read_bhrb(int n);
 
 /*
  * Only override the default definitions in include/linux/perf_event.h
diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
index 4ac6e64..c627843 100644
--- a/arch/powerpc/perf/core-book3s.c
+++ b/arch/powerpc/perf/core-book3s.c
@@ -19,6 +19,11 @@
 #include 
 #include 
 
+#define BHRB_MAX_ENTRIES   32
+#define BHRB_TARGET0x0002
+#define BHRB_PREDICTION0x0001
+#define BHRB_EA0xFFFC
+
 struct cpu_hw_events {
int n_events;
int n_percpu;
@@ -38,7 +43,15 @@ struct cpu_hw_events {
 
unsigned int group_flag;
int n_txn_start;
+
+   /* BHRB bits */
+   u64 bhrb_filter;/* BHRB HW branch 
filter */
+   int bhrb_users;
+   void*bhrb_context;
+   struct  perf_branch_stack   bhrb_stack;
+   struct  perf_branch_entry   bhrb_entries[BHRB_MAX_ENTRIES];
 };
+
 DEFINE_PER_CPU(struct cpu_hw_events, cpu_hw_events);
 
 struct power_pmu *ppmu;
@@ -858,6 +871,9 @@ static void power_pmu_enable(struct pmu *pmu)
}
 
  out:
+   if (cpuhw->bhrb_users)
+   ppmu->config_bhrb(cpuhw->bhrb_filter);
+
local_irq_restore(flags);
 }
 
@@ -888,6 +904,47 @@ static int collect_events(struct perf_event *group, int 
max_count,
return n;
 }
 
+/* Reset all possible BHRB entries */
+static void power_pmu_bhrb_reset(void)
+{
+   asm volatile(PPC_CLRBHRB);
+}
+
+void power_pmu_bhrb_enable(struct perf_event *event)
+{
+   struct cpu_hw_events *cpuhw = &__get_cpu_var(cpu_hw_events);
+
+   if (!ppmu->bhrb_nr)
+   return;
+
+   /* Clear BHRB if we changed task context to avoid data leaks */
+   if (event->ctx->task && cpuhw->bhrb_context != event->ctx) {
+   power_pmu_bhrb_reset();
+   cpuhw->bhrb_context = event->ctx;
+   }
+   cpuhw->bhrb_users++;
+}
+
+void power_pmu_bhrb_disable(struct perf_event *event)
+{
+   struct cpu_hw_events *cpuhw = &__get_cpu_var(cpu_hw_events);
+
+   if (!ppmu->bhrb_nr)
+   return;
+
+   cpuhw->bhrb_users--;
+   WARN_ON_ONCE(cpuhw->bhrb_users < 0);
+
+   if (!cpuhw->disabled && !cpuhw->bhrb_users) {
+   /* BHRB cannot be turned off when other
+* events are active on the PMU.
+*/
+
+   /* avoid stale pointer */
+   cpuhw->bhrb_context = NULL;
+   }
+}
+
 /*
  * Add a event to the PMU.
  * If all events are not already frozen, then we disable and
@@ -947,6 +1004,9 @@ nocheck:
 
ret = 0;
  out:
+   if (has_branch_stack(event))
+   power_pmu_bhrb_enable(event);
+
perf_pmu_enable(event->pmu);
local_irq_restore(flags);
return ret;
@@ -999,6 +1059,9 @@ static void power_pmu_del(struct perf_event *event, int 
ef_flags)
cpuhw->mmcr[0] &= ~(MMCR0_PMXE | MMCR0_FCECE);
}
 
+   if (has_branch_stack(event))
+   power_pmu_bhrb_disable(event);
+
perf_pmu_enable(event->pmu);
local_irq_restore(flags);
 }
@@ -1117,6 +1180,15 @@ int power_pmu_commit_txn(struct pmu *pmu)
return 0;
 }
 
+/* Called from ctxsw to prevent one process's branch entries to
+ * mingle with the other process's entries during context switch.
+ */
+void power_pmu_flush_branch_stack(void)
+{
+   if (ppmu->bhrb_nr)
+   power_pmu_bhrb_reset();
+}
+
 /*
  * Return 1 if we might be able to put event on a limited PMC,
  * or 0 if not.
@@ -1231,9 +1303,11 @@ static int power_pmu_event_init(struct perf_event *event)
if (

[PATCH V4 1/5] powerpc, perf: Add new BHRB related instructions for POWER8

2013-04-22 Thread Anshuman Khandual
This patch adds new POWER8 instruction encoding for reading
and clearing Branch History Rolling Buffer entries. The new
instruction 'mfbhrbe' (move from branch history rolling buffer
entry) is used to read BHRB buffer entries and instruction
'clrbhrb' (clear branch history rolling buffer) is used to
clear the entire buffer. The instruction 'clrbhrb' has straight
forward encoding. But the instruction encoding format for
reading the BHRB entries is like 'mfbhrbe RT, BHRBE' where it
takes two arguments, i.e the index for the BHRB buffer entry to
read and a general purpose register to put the value which was
read from the buffer entry.

Signed-off-by: Anshuman Khandual 
---
 arch/powerpc/include/asm/ppc-opcode.h | 8 
 1 file changed, 8 insertions(+)

diff --git a/arch/powerpc/include/asm/ppc-opcode.h 
b/arch/powerpc/include/asm/ppc-opcode.h
index 8752bc8..0c34e48 100644
--- a/arch/powerpc/include/asm/ppc-opcode.h
+++ b/arch/powerpc/include/asm/ppc-opcode.h
@@ -82,6 +82,8 @@
 #define__REGA0_R31 31
 
 /* sorted alphabetically */
+#define PPC_INST_BHRBE 0x7c00025c
+#define PPC_INST_CLRBHRB   0x7c00035c
 #define PPC_INST_DCBA  0x7c0005ec
 #define PPC_INST_DCBA_MASK 0xfc0007fe
 #define PPC_INST_DCBAL 0x7c2005ec
@@ -297,6 +299,12 @@
 #define PPC_NAPstringify_in_c(.long PPC_INST_NAP)
 #define PPC_SLEEP  stringify_in_c(.long PPC_INST_SLEEP)
 
+/* BHRB instructions */
+#define PPC_CLRBHRBstringify_in_c(.long PPC_INST_CLRBHRB)
+#define PPC_MFBHRBE(r, n)  stringify_in_c(.long PPC_INST_BHRBE | \
+   __PPC_RT(r) | \
+   (((n) & 0x3ff) << 11))
+
 /* Transactional memory instructions */
 #define TRECHKPT   stringify_in_c(.long PPC_INST_TRECHKPT)
 #define TRECLAIM(r)stringify_in_c(.long PPC_INST_TRECLAIM \
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH V4 0/5] powerpc, perf: BHRB based branch stack enablement on POWER8

2013-04-22 Thread Anshuman Khandual
Branch History Rolling Buffer (BHRB) is a new PMU feaure in IBM
POWER8 processor which records the branch instructions inside the execution
pipeline. This patchset enables the basic functionality of the feature through
generic perf branch stack sampling framework.

Sample output
-
$./perf record -b top
$./perf report

Overhead  Command  Source Shared Object   Source Symbol 
 Target Shared ObjectTarget Symbol
#   ...    
..    
...
#

 7.82%  top  libc-2.11.2.so[k] _IO_vfscanf  
   libc-2.11.2.so[k] _IO_vfscanf
 6.17%  top  libc-2.11.2.so[k] _IO_vfscanf  
   [unknown] [k] 
 2.37%  top  [unknown] [k] 0xf7aafb30   
   [unknown] [k] 
 1.80%  top  [unknown] [k] 0x0fe07978   
   libc-2.11.2.so[k] _IO_vfscanf
 1.60%  top  libc-2.11.2.so[k] _IO_vfscanf  
   [kernel.kallsyms] [k] .do_task_stat
 1.20%  top  [kernel.kallsyms] [k] .do_task_stat
   [kernel.kallsyms] [k] .do_task_stat
 1.02%  top  libc-2.11.2.so[k] vfprintf 
   libc-2.11.2.so[k] vfprintf
 0.92%  top  top   [k] _init
   [unknown] [k] 0x0fe037f4

Changes in V2
--
- Added copyright messages to the newly created files
- Modified couple of commit messages

Changes in V3
-
- Incorporated review comments from Segher https://lkml.org/lkml/2013/4/16/350
- Worked on a solution for review comment from Michael Ellerman 
https://lkml.org/lkml/2013/4/17/548
- Could not move updated cpu_hw_events structure from core-book3s.c 
file into perf_event_server.h
  Because perf_event_server.h is pulled in first inside 
linux/perf_event.h before the definition of
  perf_branch_entry structure. Thats the reason why perf_branch_entry 
definition is not available
  inside perf_event_server.h where we define the array inside 
cpu_hw_events structure.

- Finally have pulled in the code from perf_event_bhrb.c into 
core-book3s.c

- Improved documentation for the patchset

Changes in V4
-
- Incorporated review comments on V3 regarding new instruction encoding

Anshuman Khandual (5):
  powerpc, perf: Add new BHRB related instructions for POWER8
  powerpc, perf: Add basic assembly code to read BHRB entries on POWER8
  powerpc, perf: Add new BHRB related generic functions, data and flags
  powerpc, perf: Define BHRB generic functions, data and flags for POWER8
  powerpc, perf: Enable branch stack sampling framework

 arch/powerpc/include/asm/perf_event_server.h |   7 ++
 arch/powerpc/include/asm/ppc-opcode.h|   8 ++
 arch/powerpc/perf/Makefile   |   2 +-
 arch/powerpc/perf/bhrb.S |  44 +++
 arch/powerpc/perf/core-book3s.c  | 167 ++-
 arch/powerpc/perf/power8-pmu.c   |  57 -
 6 files changed, 280 insertions(+), 5 deletions(-)
 create mode 100644 arch/powerpc/perf/bhrb.S

-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] perf/Power7: Save dcache_src fields in sample record.

2013-06-10 Thread Anshuman Khandual
> 
> AFAICT, Power7 supports one extra level in the cache-hierarchy, so we propose
> to add a new cache level, REM_CCE3 shown above.
> 
> To maintain consistency in terminology (i.e 2-hops = remote, 3-hops = 
> distant),
> I propose leaving the REM_MEM1 unused and adding another level, REM_MEM3.
> 

Agreed.

> Further, in the above REM_CCE1 case, Power7 can also identify if the data came
> from the L2 or L3 cache of another core on the same chip. To describe this to
> user space, we propose to set ->mem_lvl to:
> 
>   PERF_MEM_LVL_REM_CCE1|PERF_MEM_LVL_L2
> 
>   PERF_MEM_LVL_REM_CCE1|PERF_MEM_LVL_L3
> 
> Either that or we could leave REM_CCE1 unused in Power and add two more 
> levels:
> 
>   PERF_MEM_XLVL_REM_L2_CCE1
>   PERF_MEM_XLVL_REM_L3_CCE1
> 
> The former approach seems less confusing and this patch uses that approach.
> 

Yeah, the former approach is simpler and makes sense.


> Signed-off-by: Sukadev Bhattiprolu 
> ---
>  arch/powerpc/include/asm/perf_event_server.h |2 +
>  arch/powerpc/perf/core-book3s.c  |4 +
>  arch/powerpc/perf/power7-pmu.c   |   81 
> ++
>  include/uapi/linux/perf_event.h  |   12 +++-
>  4 files changed, 97 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/perf_event_server.h 
> b/arch/powerpc/include/asm/perf_event_server.h
> index f265049..f2d162b 100644
> --- a/arch/powerpc/include/asm/perf_event_server.h
> +++ b/arch/powerpc/include/asm/perf_event_server.h
> @@ -37,6 +37,8 @@ struct power_pmu {
>   void(*config_bhrb)(u64 pmu_bhrb_filter);
>   void(*disable_pmc)(unsigned int pmc, unsigned long mmcr[]);
>   int (*limited_pmc_event)(u64 event_id);
> + void(*get_mem_data_src)(struct perf_sample_data *data,
> + struct pt_regs *regs);
>   u32 flags;
>   const struct attribute_group**attr_groups;
>   int n_generic;
> diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
> index 426180b..7778fa9 100644
> --- a/arch/powerpc/perf/core-book3s.c
> +++ b/arch/powerpc/perf/core-book3s.c
> @@ -1632,6 +1632,10 @@ static void record_and_restart(struct perf_event 
> *event, unsigned long val,
>   data.br_stack = &cpuhw->bhrb_stack;
>   }
> 
> + if (event->attr.sample_type & PERF_SAMPLE_DATA_SRC &&
> + ppmu->get_mem_data_src)
> + ppmu->get_mem_data_src(&data, regs);
> +
>   if (perf_event_overflow(event, &data, regs))
>   power_pmu_stop(event, 0);
>   }
> diff --git a/arch/powerpc/perf/power7-pmu.c b/arch/powerpc/perf/power7-pmu.c
> index 3c475d6..af92bfe 100644
> --- a/arch/powerpc/perf/power7-pmu.c
> +++ b/arch/powerpc/perf/power7-pmu.c
> @@ -209,6 +209,85 @@ static int power7_get_alternatives(u64 event, unsigned 
> int flags, u64 alt[])
>   return nalt;
>  }
> 
> +#define  POWER7_MMCRA_PEMPTY (0x1L << 63)
> +#define  POWER7_MMCRA_FIN_STALL  (0x1L << 62)
> +#define  POWER7_MMCRA_CMPL_STALL (0x1L << 61)
> +#define  POWER7_MMCRA_STALL_REASON_MASK  (0xFL << 60)
> +
> +#define  POWER7_MMCRA_DCACHE_MISS(0x1L << 55)
> +
> +#define  POWER7_MMCRA_DCACHE_SRC_SHIFT   51
> +#define  POWER7_MMCRA_DCACHE_SRC_MASK(0xFL << 
> POWER7_MMCRA_DCACHE_SRC_SHIFT)
> +
> +#define  POWER7_MMCRA_MDTLB_MISS (0x1L << 50)
> +
> +#define  POWER7_MMCRA_MDTLB_SRC_SHIFT46
> +#define  POWER7_MMCRA_MDTLB_SRC_MASK (0xFL << 
> POWER7_MMCRA_MDTLB_SRC_SHIFT)
> +
> +#define  POWER7_MMCRA_MDERAT_MISS(0x1L<< 45)
> +#define  POWER7_MMCRA_MLSU_REJ   (0x1L<< 44)
> +
> +/* and so on */
> +
> +/*
> + * Map DCACHE_SRC fields to the Linux memory hierarchy levels.
> + *
> + * Bits 9..12 in the MMCRA indicate the source of a data-cache entry, with
> + * each of the 16 possible values referring to a specific source. Eg: if
> + * the 4-bits have the value 1 (0b0001), the dcache entry was found local
> + * L3 cache.
> + *
> + * We use the table, dcache_src_map, to map this value 1 to PERF_MEM_LVL_L3,
> + * the arch-neutral representation of the L3 cache.
> + *
> + * Similarly, in case of marked data TLB miss, bits 14..17 of the MMCRA
> + * indicate the load source of a marked DTLB  entry. dtlb_src_map[] gives
> + * the mapping to the arch-neutral values of the TLB source.


Where did you define dtlb_src_map[] ?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] powerpc, perf: Fix processing conditions for invalid BHRB entries

2013-05-06 Thread Anshuman Khandual
Fixing some conditions during BHRB entry processing.

Signed-off-by: Anshuman Khandual 
---
 arch/powerpc/perf/core-book3s.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
index 09db68d..1de2756 100644
--- a/arch/powerpc/perf/core-book3s.c
+++ b/arch/powerpc/perf/core-book3s.c
@@ -1481,25 +1481,25 @@ void power_pmu_bhrb_read(struct cpu_hw_events *cpuhw)
target = val & BHRB_TARGET;
 
/* Probable Missed entry: Not applicable for POWER8 */
-   if ((addr == 0) && (target == 0) && (pred == 1)) {
+   if ((addr == 0) && (!target) && pred) {
r_index++;
continue;
}
 
/* Real Missed entry: Power8 based missed entry */
-   if ((addr == 0) && (target == 1) && (pred == 1)) {
+   if ((addr == 0) && target && pred) {
r_index++;
continue;
}
 
/* Reserved condition: Not a valid entry  */
-   if ((addr == 0) && (target == 1) && (pred == 0)) {
+   if ((addr == 0) && target && (!pred)) {
r_index++;
continue;
}
 
/* Is a target address */
-   if (val & BHRB_TARGET) {
+   if (target) {
/* First address cannot be a target address */
if (r_index == 0) {
r_index++;
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] powerpc, perf: Clear out branch entries to avoid any previous stale values

2013-05-06 Thread Anshuman Khandual
The 'to' field inside branch entries might contain stale values from previous
PMU interrupt instances which had indirect branches. So clear all the values
before reading a fresh set of BHRB entries after a PMU interrupt.

Signed-off-by: Anshuman Khandual 
---
 arch/powerpc/perf/core-book3s.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
index c627843..09db68d 100644
--- a/arch/powerpc/perf/core-book3s.c
+++ b/arch/powerpc/perf/core-book3s.c
@@ -1590,6 +1590,8 @@ static void record_and_restart(struct perf_event *event, 
unsigned long val,
if (event->attr.sample_type & PERF_SAMPLE_BRANCH_STACK) {
struct cpu_hw_events *cpuhw;
cpuhw = &__get_cpu_var(cpu_hw_events);
+   memset(cpuhw->bhrb_entries, 0,
+   sizeof(struct perf_branch_entry) * 
BHRB_MAX_ENTRIES);
power_pmu_bhrb_read(cpuhw);
data.br_stack = &cpuhw->bhrb_stack;
}
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] powerpc, perf: Fix processing conditions for invalid BHRB entries

2013-05-06 Thread Anshuman Khandual
On 05/06/2013 04:41 PM, Michael Neuling wrote:
> Anshuman Khandual  wrote:
> 
>> Fixing some conditions during BHRB entry processing.
> 
> I think we can simplify this a lot more... something like the below.
> 

I feel that the conditional handling of the invalid BHRB entries should be
present which would help us during the debug and also could be used for more
granular branch classification or error handling later on.

> Also, this marks the "to" address as all 1s, which is better poison
> value since it's possible to branch to/from 0x0.
>

Agreed.


> diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
> index c627843..d410d65 100644
> --- a/arch/powerpc/perf/core-book3s.c
> +++ b/arch/powerpc/perf/core-book3s.c
> @@ -1463,65 +1463,45 @@ void power_pmu_bhrb_read(struct cpu_hw_events *cpuhw)
>  {
>   u64 val;
>   u64 addr;
> - int r_index, u_index, target, pred;
> + int r_index, u_index, pred;
> 
>   r_index = 0;
>   u_index = 0;
>   while (r_index < ppmu->bhrb_nr) {
>   /* Assembly read function */
> - val = read_bhrb(r_index);
> + val = read_bhrb(r_index++);
> 
>   /* Terminal marker: End of valid BHRB entries */
> - if (val == 0) {
> + if (!val) {
>   break;
>   } else {
>   /* BHRB field break up */
>   addr = val & BHRB_EA;
>   pred = val & BHRB_PREDICTION;
> - target = val & BHRB_TARGET;
> 
> - /* Probable Missed entry: Not applicable for POWER8 */
> - if ((addr == 0) && (target == 0) && (pred == 1)) {
> - r_index++;
> + if (!addr)
> + /* invalid entry */
>   continue;
> - }
> -
> - /* Real Missed entry: Power8 based missed entry */
> - if ((addr == 0) && (target == 1) && (pred == 1)) {
> - r_index++;
> - continue;
> - }
> 
> - /* Reserved condition: Not a valid entry  */
> - if ((addr == 0) && (target == 1) && (pred == 0)) {
> - r_index++;
> - continue;
> - }
> -
> - /* Is a target address */
>   if (val & BHRB_TARGET) {
>   /* First address cannot be a target address */
> - if (r_index == 0) {
> - r_index++;
> + if (r_index == 0)
>   continue;
> - }
> 
>   /* Update target address for the previous entry 
> */
>   cpuhw->bhrb_entries[u_index - 1].to = addr;
>   cpuhw->bhrb_entries[u_index - 1].mispred = pred;
>   cpuhw->bhrb_entries[u_index - 1].predicted = 
> ~pred;
> -
> - /* Dont increment u_index */
> - r_index++;
>   } else {
>   /* Update address, flags for current entry */
>   cpuhw->bhrb_entries[u_index].from = addr;
> + cpuhw->bhrb_entries[u_index].to =
> + 0x;
>   cpuhw->bhrb_entries[u_index].mispred = pred;
>   cpuhw->bhrb_entries[u_index].predicted = ~pred;
> 
>   /* Successfully popullated one entry */
>   u_index++;
> - r_index++;
>   }
>   }
>   }
> ___
> Linuxppc-dev mailing list
> linuxppc-...@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 3/5] powerpc, perf: Add new BHRB related generic functions, data and flags

2013-04-16 Thread Anshuman Khandual
Signed-off-by: Anshuman Khandual 
---
 arch/powerpc/include/asm/perf_event_server.h | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/arch/powerpc/include/asm/perf_event_server.h 
b/arch/powerpc/include/asm/perf_event_server.h
index 57b42da..3f0c15c 100644
--- a/arch/powerpc/include/asm/perf_event_server.h
+++ b/arch/powerpc/include/asm/perf_event_server.h
@@ -33,6 +33,8 @@ struct power_pmu {
unsigned long *valp);
int (*get_alternatives)(u64 event_id, unsigned int flags,
u64 alt[]);
+   u64 (*bhrb_filter_map)(u64 branch_sample_type);
+   void(*config_bhrb)(u64 pmu_bhrb_filter);
void(*disable_pmc)(unsigned int pmc, unsigned long mmcr[]);
int (*limited_pmc_event)(u64 event_id);
u32 flags;
@@ -42,6 +44,9 @@ struct power_pmu {
int (*cache_events)[PERF_COUNT_HW_CACHE_MAX]
   [PERF_COUNT_HW_CACHE_OP_MAX]
   [PERF_COUNT_HW_CACHE_RESULT_MAX];
+
+   /* BHRB entries in the PMU */
+   int bhrb_nr;
 };
 
 /*
@@ -54,6 +59,7 @@ struct power_pmu {
 #define PPMU_SIAR_VALID0x0010 /* Processor has SIAR Valid 
bit */
 #define PPMU_HAS_SSLOT 0x0020 /* Has sampled slot in MMCRA */
 #define PPMU_HAS_SIER  0x0040 /* Has SIER */
+#define PPMU_BHRB  0x0080 /* has BHRB feature enabled */
 
 /*
  * Values for flags to get_alternatives()
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 4/5] powerpc, perf: Define BHRB generic functions, data and flags for POWER8

2013-04-16 Thread Anshuman Khandual
Signed-off-by: Anshuman Khandual 
---
 arch/powerpc/perf/power8-pmu.c | 57 +-
 1 file changed, 56 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/perf/power8-pmu.c b/arch/powerpc/perf/power8-pmu.c
index 106ae0b..153408c 100644
--- a/arch/powerpc/perf/power8-pmu.c
+++ b/arch/powerpc/perf/power8-pmu.c
@@ -109,6 +109,16 @@
 #define EVENT_IS_MARKED(EVENT_MARKED_MASK << 
EVENT_MARKED_SHIFT)
 #define EVENT_PSEL_MASK0xff/* PMCxSEL value */
 
+/* MMCRA IFM bits - POWER8 */
+#definePOWER8_MMCRA_IFM1   0x4000UL
+#definePOWER8_MMCRA_IFM2   0x8000UL
+#definePOWER8_MMCRA_IFM3   0xC000UL
+
+#define ONLY_PLM \
+   (PERF_SAMPLE_BRANCH_USER|\
+PERF_SAMPLE_BRANCH_KERNEL  |\
+PERF_SAMPLE_BRANCH_HV)
+
 /*
  * Layout of constraint bits:
  *
@@ -428,6 +438,48 @@ static int power8_generic_events[] = {
[PERF_COUNT_HW_BRANCH_MISSES] = PM_BR_MPRED_CMPL,
 };
 
+static u64 power8_bhrb_filter_map(u64 branch_sample_type)
+{
+   u64 pmu_bhrb_filter = 0;
+   u64 br_privilege = branch_sample_type & ONLY_PLM;
+
+   /* BHRB and regular PMU events share the same prvillege state
+* filter configuration. BHRB is always recorded along with a
+* regular PMU event. So privilege state filter criteria for BHRB
+* and the companion PMU events has to be the same. As a default
+* "perf record" tool sets all privillege bits ON when no filter
+* criteria is provided in the command line. So as along as all
+* privillege bits are ON or they are OFF, we are good to go.
+*/
+   if ((br_privilege != 7) && (br_privilege != 0))
+   return -1;
+
+   /* No branch filter requested */
+   if (branch_sample_type & PERF_SAMPLE_BRANCH_ANY)
+   return pmu_bhrb_filter;
+
+   /* Invalid branch filter options - HW does not support */
+   if (branch_sample_type & PERF_SAMPLE_BRANCH_ANY_RETURN)
+   return -1;
+
+   if (branch_sample_type & PERF_SAMPLE_BRANCH_IND_CALL)
+   return -1;
+
+   if (branch_sample_type & PERF_SAMPLE_BRANCH_ANY_CALL) {
+   pmu_bhrb_filter |= POWER8_MMCRA_IFM1;
+   return pmu_bhrb_filter;
+   }
+
+   /* Every thing else is unsupported */
+   return -1;
+}
+
+static void power8_config_bhrb(u64 pmu_bhrb_filter)
+{
+   /* Enable BHRB filter in PMU */
+   mtspr(SPRN_MMCRA, (mfspr(SPRN_MMCRA) | pmu_bhrb_filter));
+}
+
 static struct power_pmu power8_pmu = {
.name   = "POWER8",
.n_counter  = 6,
@@ -435,12 +487,15 @@ static struct power_pmu power8_pmu = {
.add_fields = POWER8_ADD_FIELDS,
.test_adder = POWER8_TEST_ADDER,
.compute_mmcr   = power8_compute_mmcr,
+   .config_bhrb= power8_config_bhrb,
+   .bhrb_filter_map= power8_bhrb_filter_map,
.get_constraint = power8_get_constraint,
.disable_pmc= power8_disable_pmc,
-   .flags  = PPMU_HAS_SSLOT | PPMU_HAS_SIER,
+   .flags  = PPMU_HAS_SSLOT | PPMU_HAS_SIER | PPMU_BHRB,
.n_generic  = ARRAY_SIZE(power8_generic_events),
.generic_events = power8_generic_events,
.attr_groups= power8_pmu_attr_groups,
+   .bhrb_nr= 32,
 };
 
 static int __init init_power8_pmu(void)
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 5/5] powerpc, perf: Enable branch stack sampling framework support with BHRB

2013-04-16 Thread Anshuman Khandual
This patch provides basic enablement for perf branch stack sampling framework
on POWER8 processor with a new PMU feature called BHRB (Branch History Rolling
Buffer).

Signed-off-by: Anshuman Khandual 
---
 arch/powerpc/perf/core-book3s.c | 96 +++--
 arch/powerpc/perf/perf_event_bhrb.c | 75 +
 2 files changed, 168 insertions(+), 3 deletions(-)
 create mode 100644 arch/powerpc/perf/perf_event_bhrb.c

diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
index 4ac6e64..f4d1347 100644
--- a/arch/powerpc/perf/core-book3s.c
+++ b/arch/powerpc/perf/core-book3s.c
@@ -19,6 +19,8 @@
 #include 
 #include 
 
+#define BHRB_MAX_ENTRIES   32
+
 struct cpu_hw_events {
int n_events;
int n_percpu;
@@ -38,11 +40,21 @@ struct cpu_hw_events {
 
unsigned int group_flag;
int n_txn_start;
+
+   /* BHRB bits */
+   u64 bhrb_filter;/* BHRB HW branch 
filter */
+   int bhrb_users;
+   void*bhrb_context;
+   struct  perf_branch_stack   bhrb_stack;
+   struct  perf_branch_entry   bhrb_entries[BHRB_MAX_ENTRIES];
 };
+
 DEFINE_PER_CPU(struct cpu_hw_events, cpu_hw_events);
 
 struct power_pmu *ppmu;
 
+#include "perf_event_bhrb.c"
+
 /*
  * Normally, to ignore kernel events we set the FCS (freeze counters
  * in supervisor mode) bit in MMCR0, but if the kernel runs with the
@@ -858,6 +870,9 @@ static void power_pmu_enable(struct pmu *pmu)
}
 
  out:
+   if (cpuhw->bhrb_users)
+   ppmu->config_bhrb(cpuhw->bhrb_filter);
+
local_irq_restore(flags);
 }
 
@@ -888,6 +903,47 @@ static int collect_events(struct perf_event *group, int 
max_count,
return n;
 }
 
+/* Reset all possible BHRB entries */
+static void power_pmu_bhrb_reset(void)
+{
+   asm volatile(PPC_CLRBHRB);
+}
+
+void power_pmu_bhrb_enable(struct perf_event *event)
+{
+   struct cpu_hw_events *cpuhw = &__get_cpu_var(cpu_hw_events);
+
+   if (!ppmu->bhrb_nr)
+   return;
+
+   /* Clear BHRB if we changed task context to avoid data leaks */
+   if (event->ctx->task && cpuhw->bhrb_context != event->ctx) {
+   power_pmu_bhrb_reset();
+   cpuhw->bhrb_context = event->ctx;
+   }
+   cpuhw->bhrb_users++;
+}
+
+void power_pmu_bhrb_disable(struct perf_event *event)
+{
+   struct cpu_hw_events *cpuhw = &__get_cpu_var(cpu_hw_events);
+
+   if (!ppmu->bhrb_nr)
+   return;
+
+   cpuhw->bhrb_users--;
+   WARN_ON_ONCE(cpuhw->bhrb_users < 0);
+
+   if (!cpuhw->disabled && !cpuhw->bhrb_users) {
+   /* BHRB cannot be turned off when other
+* events are active on the PMU.
+*/
+
+   /* avoid stale pointer */
+   cpuhw->bhrb_context = NULL;
+   }
+}
+
 /*
  * Add a event to the PMU.
  * If all events are not already frozen, then we disable and
@@ -947,6 +1003,9 @@ nocheck:
 
ret = 0;
  out:
+   if (has_branch_stack(event))
+   power_pmu_bhrb_enable(event);
+
perf_pmu_enable(event->pmu);
local_irq_restore(flags);
return ret;
@@ -999,6 +1058,9 @@ static void power_pmu_del(struct perf_event *event, int 
ef_flags)
cpuhw->mmcr[0] &= ~(MMCR0_PMXE | MMCR0_FCECE);
}
 
+   if (has_branch_stack(event))
+   power_pmu_bhrb_disable(event);
+
perf_pmu_enable(event->pmu);
local_irq_restore(flags);
 }
@@ -1117,6 +1179,15 @@ int power_pmu_commit_txn(struct pmu *pmu)
return 0;
 }
 
+/* Called from ctxsw to prevent one process's branch entries to
+ * mingle with the other process's entries during context switch.
+ */
+void power_pmu_flush_branch_stack(void)
+{
+   if (ppmu->bhrb_nr)
+   power_pmu_bhrb_reset();
+}
+
 /*
  * Return 1 if we might be able to put event on a limited PMC,
  * or 0 if not.
@@ -1231,9 +1302,11 @@ static int power_pmu_event_init(struct perf_event *event)
if (!ppmu)
return -ENOENT;
 
-   /* does not support taken branch sampling */
-   if (has_branch_stack(event))
-   return -EOPNOTSUPP;
+   if (has_branch_stack(event)) {
+   /* PMU has BHRB enabled */
+   if (!(ppmu->flags & PPMU_BHRB))
+   return -EOPNOTSUPP;
+   }
 
switch (event->attr.type) {
case PERF_TYPE_HARDWARE:
@@ -1314,6 +1387,15 @@ static int power_pmu_event_init(struct perf_event *event)
 
cpuhw = &get_cpu_var(cpu_hw_events);
err = power_check_constraints(cpuhw, events, cflags, n + 1);
+
+   if (has_branch_stack(event)) {
+   cpuhw->bhrb_filter = ppmu->bhrb_filter_map(
+   

[PATCH 2/5] powerpc, perf: Add basic assembly code to read BHRB entries on POWER8

2013-04-16 Thread Anshuman Khandual
Signed-off-by: Anshuman Khandual 
---
 arch/powerpc/perf/Makefile |  2 +-
 arch/powerpc/perf/bhrb.S   | 34 ++
 2 files changed, 35 insertions(+), 1 deletion(-)
 create mode 100644 arch/powerpc/perf/bhrb.S

diff --git a/arch/powerpc/perf/Makefile b/arch/powerpc/perf/Makefile
index 472db18..510fae1 100644
--- a/arch/powerpc/perf/Makefile
+++ b/arch/powerpc/perf/Makefile
@@ -2,7 +2,7 @@ subdir-ccflags-$(CONFIG_PPC_WERROR) := -Werror
 
 obj-$(CONFIG_PERF_EVENTS)  += callchain.o
 
-obj-$(CONFIG_PPC_PERF_CTRS)+= core-book3s.o
+obj-$(CONFIG_PPC_PERF_CTRS)+= core-book3s.o bhrb.o
 obj64-$(CONFIG_PPC_PERF_CTRS)  += power4-pmu.o ppc970-pmu.o power5-pmu.o \
   power5+-pmu.o power6-pmu.o power7-pmu.o \
   power8-pmu.o
diff --git a/arch/powerpc/perf/bhrb.S b/arch/powerpc/perf/bhrb.S
new file mode 100644
index 000..cffb815
--- /dev/null
+++ b/arch/powerpc/perf/bhrb.S
@@ -0,0 +1,34 @@
+#include 
+#include 
+
+   .text
+
+.balign 8
+
+/* r3 = n  (where n = [0-1023])
+ * The maximum number of BHRB entries supported with PPC_MFBHRBE instruction
+ * is 1024. We have limited number of table entries here as POWER8 implements
+ * 32 BHRB entries.
+ */
+
+/* .global read_bhrb */
+_GLOBAL(read_bhrb)
+   cmpldi  r3,1023
+   bgt 1f
+   ld  r4,bhrb_table@got(r2)
+   sldir3,r3,3
+   add r3,r4,r3
+   mtctr   r3
+   bctr
+1: li  r3,0
+   blr
+
+#define MFBHRB_TABLE1(n) PPC_MFBHRBE(R3,n); blr
+#define MFBHRB_TABLE2(n) MFBHRB_TABLE1(n); MFBHRB_TABLE1(n+1)
+#define MFBHRB_TABLE4(n) MFBHRB_TABLE2(n); MFBHRB_TABLE2(n+2)
+#define MFBHRB_TABLE8(n) MFBHRB_TABLE4(n); MFBHRB_TABLE4(n+4)
+#define MFBHRB_TABLE16(n) MFBHRB_TABLE8(n); MFBHRB_TABLE8(n+8)
+#define MFBHRB_TABLE32(n) MFBHRB_TABLE16(n); MFBHRB_TABLE16(n+16)
+
+bhrb_table:
+   MFBHRB_TABLE32(0)
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/5] powerpc, perf: Add new BHRB related instructions on POWER8

2013-04-16 Thread Anshuman Khandual
This patch adds new instructions support for reading various
BHRB entries and also clearing the BHRB buffer.

Signed-off-by: Anshuman Khandual 
---
 arch/powerpc/include/asm/ppc-opcode.h | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/arch/powerpc/include/asm/ppc-opcode.h 
b/arch/powerpc/include/asm/ppc-opcode.h
index 8752bc8..93ae5a1 100644
--- a/arch/powerpc/include/asm/ppc-opcode.h
+++ b/arch/powerpc/include/asm/ppc-opcode.h
@@ -82,6 +82,7 @@
 #define__REGA0_R31 31
 
 /* sorted alphabetically */
+#define PPC_INST_BHRBE 0x7c00025c
 #define PPC_INST_DCBA  0x7c0005ec
 #define PPC_INST_DCBA_MASK 0xfc0007fe
 #define PPC_INST_DCBAL 0x7c2005ec
@@ -297,6 +298,12 @@
 #define PPC_NAPstringify_in_c(.long PPC_INST_NAP)
 #define PPC_SLEEP  stringify_in_c(.long PPC_INST_SLEEP)
 
+/* BHRB instructions */
+#define PPC_CLRBHRBstringify_in_c(.long 0x7c00035c)
+#define PPC_MFBHRBE(r, n)  stringify_in_c(.long PPC_INST_BHRBE | \
+   __PPC_RS(r) | \
+   (((n) & 0x1f) << 11))
+
 /* Transactional memory instructions */
 #define TRECHKPT   stringify_in_c(.long PPC_INST_TRECHKPT)
 #define TRECLAIM(r)stringify_in_c(.long PPC_INST_TRECLAIM \
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 0/5] powerpc, perf: BHRB based branch stack enablement on POWER8

2013-04-16 Thread Anshuman Khandual
Branch History Rolling Buffer (BHRB) is a new PMU feaure in IBM
POWER8 processor which records the branch instructions inside the execution
pipeline. This patchset enables the basic functionality of the feature through
generic perf branch stack sampling framework.

Sample output
-
$./perf record -b top
$./perf report

Overhead  Command  Source Shared Object   Source Symbol 
 Target Shared ObjectTarget Symbol
#   ...    
..    
...
#

 7.82%  top  libc-2.11.2.so[k] _IO_vfscanf  
   libc-2.11.2.so[k] _IO_vfscanf
 6.17%  top  libc-2.11.2.so[k] _IO_vfscanf  
   [unknown] [k]    
 2.37%  top  [unknown] [k] 0xf7aafb30   
   [unknown] [k]    
 1.80%  top  [unknown] [k] 0x0fe07978   
   libc-2.11.2.so[k] _IO_vfscanf
 1.60%  top  libc-2.11.2.so[k] _IO_vfscanf  
   [kernel.kallsyms] [k] .do_task_stat  
 1.20%  top  [kernel.kallsyms] [k] .do_task_stat
   [kernel.kallsyms] [k] .do_task_stat  
 1.02%  top  libc-2.11.2.so[k] vfprintf 
   libc-2.11.2.so[k] vfprintf   
 0.92%  top  top   [k] _init
   [unknown] [k] 0x0fe037f4 

Anshuman Khandual (5):
  powerpc, perf: Add new BHRB related instructions on POWER8
  powerpc, perf: Add basic assembly code to read BHRB entries on POWER8
  powerpc, perf: Add new BHRB related generic functions, data and flags
  powerpc, perf: Define BHRB generic functions, data and flags for POWER8
  powerpc, perf: Enable branch stack sampling framework support with BHRB

 arch/powerpc/include/asm/perf_event_server.h |  6 ++
 arch/powerpc/include/asm/ppc-opcode.h|  7 ++
 arch/powerpc/perf/Makefile   |  2 +-
 arch/powerpc/perf/bhrb.S | 34 ++
 arch/powerpc/perf/core-book3s.c  | 96 +++-
 arch/powerpc/perf/perf_event_bhrb.c  | 75 ++
 arch/powerpc/perf/power8-pmu.c   | 57 -
 7 files changed, 272 insertions(+), 5 deletions(-)
 create mode 100644 arch/powerpc/perf/bhrb.S
 create mode 100644 arch/powerpc/perf/perf_event_bhrb.c

-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH V2 0/5] powerpc, perf: BHRB based branch stack enablement on POWER8

2013-04-16 Thread Anshuman Khandual
Branch History Rolling Buffer (BHRB) is a new PMU feaure in IBM
POWER8 processor which records the branch instructions inside the execution
pipeline. This patchset enables the basic functionality of the feature through
generic perf branch stack sampling framework.

Sample output
-
$./perf record -b top
$./perf report

Overhead  Command  Source Shared Object   Source Symbol 
 Target Shared ObjectTarget Symbol
#   ...    
..    
...
#

 7.82%  top  libc-2.11.2.so[k] _IO_vfscanf  
   libc-2.11.2.so[k] _IO_vfscanf
 6.17%  top  libc-2.11.2.so[k] _IO_vfscanf  
   [unknown] [k] 
 2.37%  top  [unknown] [k] 0xf7aafb30   
   [unknown] [k] 
 1.80%  top  [unknown] [k] 0x0fe07978   
   libc-2.11.2.so[k] _IO_vfscanf
 1.60%  top  libc-2.11.2.so[k] _IO_vfscanf  
   [kernel.kallsyms] [k] .do_task_stat
 1.20%  top  [kernel.kallsyms] [k] .do_task_stat
   [kernel.kallsyms] [k] .do_task_stat
 1.02%  top  libc-2.11.2.so[k] vfprintf 
   libc-2.11.2.so[k] vfprintf
 0.92%  top  top   [k] _init
   [unknown] [k] 0x0fe037f4

Changes in V2
--
- Added copyright messages to the newly created files
- Modified couple of commit messages

Anshuman Khandual (5):
  powerpc, perf: Add new BHRB related instructions on POWER8
  powerpc, perf: Add basic assembly code to read BHRB entries on POWER8
  powerpc, perf: Add new BHRB related generic functions, data and flags
  powerpc, perf: Define BHRB generic functions, data and flags for POWER8
  powerpc, perf: Enable branch stack sampling framework support with BHRB

 arch/powerpc/include/asm/perf_event_server.h |  6 ++
 arch/powerpc/include/asm/ppc-opcode.h|  7 ++
 arch/powerpc/perf/Makefile   |  2 +-
 arch/powerpc/perf/bhrb.S | 44 +
 arch/powerpc/perf/core-book3s.c  | 96 +++-
 arch/powerpc/perf/perf_event_bhrb.c  | 85 
 arch/powerpc/perf/power8-pmu.c   | 57 -
 7 files changed, 292 insertions(+), 5 deletions(-)
 create mode 100644 arch/powerpc/perf/bhrb.S
 create mode 100644 arch/powerpc/perf/perf_event_bhrb.c

-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH V2 5/5] powerpc, perf: Enable branch stack sampling framework support with BHRB

2013-04-16 Thread Anshuman Khandual
This patch provides basic enablement for perf branch stack sampling framework
on POWER8 processor with a new PMU feature called BHRB (Branch History Rolling
Buffer).

Signed-off-by: Anshuman Khandual 
---
 arch/powerpc/perf/core-book3s.c | 96 +++--
 arch/powerpc/perf/perf_event_bhrb.c | 85 
 2 files changed, 178 insertions(+), 3 deletions(-)
 create mode 100644 arch/powerpc/perf/perf_event_bhrb.c

diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
index 4ac6e64..f4d1347 100644
--- a/arch/powerpc/perf/core-book3s.c
+++ b/arch/powerpc/perf/core-book3s.c
@@ -19,6 +19,8 @@
 #include 
 #include 
 
+#define BHRB_MAX_ENTRIES   32
+
 struct cpu_hw_events {
int n_events;
int n_percpu;
@@ -38,11 +40,21 @@ struct cpu_hw_events {
 
unsigned int group_flag;
int n_txn_start;
+
+   /* BHRB bits */
+   u64 bhrb_filter;/* BHRB HW branch 
filter */
+   int bhrb_users;
+   void*bhrb_context;
+   struct  perf_branch_stack   bhrb_stack;
+   struct  perf_branch_entry   bhrb_entries[BHRB_MAX_ENTRIES];
 };
+
 DEFINE_PER_CPU(struct cpu_hw_events, cpu_hw_events);
 
 struct power_pmu *ppmu;
 
+#include "perf_event_bhrb.c"
+
 /*
  * Normally, to ignore kernel events we set the FCS (freeze counters
  * in supervisor mode) bit in MMCR0, but if the kernel runs with the
@@ -858,6 +870,9 @@ static void power_pmu_enable(struct pmu *pmu)
}
 
  out:
+   if (cpuhw->bhrb_users)
+   ppmu->config_bhrb(cpuhw->bhrb_filter);
+
local_irq_restore(flags);
 }
 
@@ -888,6 +903,47 @@ static int collect_events(struct perf_event *group, int 
max_count,
return n;
 }
 
+/* Reset all possible BHRB entries */
+static void power_pmu_bhrb_reset(void)
+{
+   asm volatile(PPC_CLRBHRB);
+}
+
+void power_pmu_bhrb_enable(struct perf_event *event)
+{
+   struct cpu_hw_events *cpuhw = &__get_cpu_var(cpu_hw_events);
+
+   if (!ppmu->bhrb_nr)
+   return;
+
+   /* Clear BHRB if we changed task context to avoid data leaks */
+   if (event->ctx->task && cpuhw->bhrb_context != event->ctx) {
+   power_pmu_bhrb_reset();
+   cpuhw->bhrb_context = event->ctx;
+   }
+   cpuhw->bhrb_users++;
+}
+
+void power_pmu_bhrb_disable(struct perf_event *event)
+{
+   struct cpu_hw_events *cpuhw = &__get_cpu_var(cpu_hw_events);
+
+   if (!ppmu->bhrb_nr)
+   return;
+
+   cpuhw->bhrb_users--;
+   WARN_ON_ONCE(cpuhw->bhrb_users < 0);
+
+   if (!cpuhw->disabled && !cpuhw->bhrb_users) {
+   /* BHRB cannot be turned off when other
+* events are active on the PMU.
+*/
+
+   /* avoid stale pointer */
+   cpuhw->bhrb_context = NULL;
+   }
+}
+
 /*
  * Add a event to the PMU.
  * If all events are not already frozen, then we disable and
@@ -947,6 +1003,9 @@ nocheck:
 
ret = 0;
  out:
+   if (has_branch_stack(event))
+   power_pmu_bhrb_enable(event);
+
perf_pmu_enable(event->pmu);
local_irq_restore(flags);
return ret;
@@ -999,6 +1058,9 @@ static void power_pmu_del(struct perf_event *event, int 
ef_flags)
cpuhw->mmcr[0] &= ~(MMCR0_PMXE | MMCR0_FCECE);
}
 
+   if (has_branch_stack(event))
+   power_pmu_bhrb_disable(event);
+
perf_pmu_enable(event->pmu);
local_irq_restore(flags);
 }
@@ -1117,6 +1179,15 @@ int power_pmu_commit_txn(struct pmu *pmu)
return 0;
 }
 
+/* Called from ctxsw to prevent one process's branch entries to
+ * mingle with the other process's entries during context switch.
+ */
+void power_pmu_flush_branch_stack(void)
+{
+   if (ppmu->bhrb_nr)
+   power_pmu_bhrb_reset();
+}
+
 /*
  * Return 1 if we might be able to put event on a limited PMC,
  * or 0 if not.
@@ -1231,9 +1302,11 @@ static int power_pmu_event_init(struct perf_event *event)
if (!ppmu)
return -ENOENT;
 
-   /* does not support taken branch sampling */
-   if (has_branch_stack(event))
-   return -EOPNOTSUPP;
+   if (has_branch_stack(event)) {
+   /* PMU has BHRB enabled */
+   if (!(ppmu->flags & PPMU_BHRB))
+   return -EOPNOTSUPP;
+   }
 
switch (event->attr.type) {
case PERF_TYPE_HARDWARE:
@@ -1314,6 +1387,15 @@ static int power_pmu_event_init(struct perf_event *event)
 
cpuhw = &get_cpu_var(cpu_hw_events);
err = power_check_constraints(cpuhw, events, cflags, n + 1);
+
+   if (has_branch_stack(event)) {
+   cpuhw->bhrb_filter = ppmu->bhrb_filter_map(
+   

[PATCH V2 1/5] powerpc, perf: Add new BHRB related instructions on POWER8

2013-04-16 Thread Anshuman Khandual
This patch adds new instructions support for reading various
BHRB entries and also clearing the BHRB buffer.

Signed-off-by: Anshuman Khandual 
---
 arch/powerpc/include/asm/ppc-opcode.h | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/arch/powerpc/include/asm/ppc-opcode.h 
b/arch/powerpc/include/asm/ppc-opcode.h
index 8752bc8..93ae5a1 100644
--- a/arch/powerpc/include/asm/ppc-opcode.h
+++ b/arch/powerpc/include/asm/ppc-opcode.h
@@ -82,6 +82,7 @@
 #define__REGA0_R31 31
 
 /* sorted alphabetically */
+#define PPC_INST_BHRBE 0x7c00025c
 #define PPC_INST_DCBA  0x7c0005ec
 #define PPC_INST_DCBA_MASK 0xfc0007fe
 #define PPC_INST_DCBAL 0x7c2005ec
@@ -297,6 +298,12 @@
 #define PPC_NAPstringify_in_c(.long PPC_INST_NAP)
 #define PPC_SLEEP  stringify_in_c(.long PPC_INST_SLEEP)
 
+/* BHRB instructions */
+#define PPC_CLRBHRBstringify_in_c(.long 0x7c00035c)
+#define PPC_MFBHRBE(r, n)  stringify_in_c(.long PPC_INST_BHRBE | \
+   __PPC_RS(r) | \
+   (((n) & 0x1f) << 11))
+
 /* Transactional memory instructions */
 #define TRECHKPT   stringify_in_c(.long PPC_INST_TRECHKPT)
 #define TRECLAIM(r)stringify_in_c(.long PPC_INST_TRECLAIM \
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH V2 4/5] powerpc, perf: Define BHRB generic functions, data and flags for POWER8

2013-04-16 Thread Anshuman Khandual
Defines BHRB functions, data and flags for POWER8

Signed-off-by: Anshuman Khandual 
---
 arch/powerpc/perf/power8-pmu.c | 57 +-
 1 file changed, 56 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/perf/power8-pmu.c b/arch/powerpc/perf/power8-pmu.c
index 106ae0b..153408c 100644
--- a/arch/powerpc/perf/power8-pmu.c
+++ b/arch/powerpc/perf/power8-pmu.c
@@ -109,6 +109,16 @@
 #define EVENT_IS_MARKED(EVENT_MARKED_MASK << 
EVENT_MARKED_SHIFT)
 #define EVENT_PSEL_MASK0xff/* PMCxSEL value */
 
+/* MMCRA IFM bits - POWER8 */
+#definePOWER8_MMCRA_IFM1   0x4000UL
+#definePOWER8_MMCRA_IFM2   0x8000UL
+#definePOWER8_MMCRA_IFM3   0xC000UL
+
+#define ONLY_PLM \
+   (PERF_SAMPLE_BRANCH_USER|\
+PERF_SAMPLE_BRANCH_KERNEL  |\
+PERF_SAMPLE_BRANCH_HV)
+
 /*
  * Layout of constraint bits:
  *
@@ -428,6 +438,48 @@ static int power8_generic_events[] = {
[PERF_COUNT_HW_BRANCH_MISSES] = PM_BR_MPRED_CMPL,
 };
 
+static u64 power8_bhrb_filter_map(u64 branch_sample_type)
+{
+   u64 pmu_bhrb_filter = 0;
+   u64 br_privilege = branch_sample_type & ONLY_PLM;
+
+   /* BHRB and regular PMU events share the same prvillege state
+* filter configuration. BHRB is always recorded along with a
+* regular PMU event. So privilege state filter criteria for BHRB
+* and the companion PMU events has to be the same. As a default
+* "perf record" tool sets all privillege bits ON when no filter
+* criteria is provided in the command line. So as along as all
+* privillege bits are ON or they are OFF, we are good to go.
+*/
+   if ((br_privilege != 7) && (br_privilege != 0))
+   return -1;
+
+   /* No branch filter requested */
+   if (branch_sample_type & PERF_SAMPLE_BRANCH_ANY)
+   return pmu_bhrb_filter;
+
+   /* Invalid branch filter options - HW does not support */
+   if (branch_sample_type & PERF_SAMPLE_BRANCH_ANY_RETURN)
+   return -1;
+
+   if (branch_sample_type & PERF_SAMPLE_BRANCH_IND_CALL)
+   return -1;
+
+   if (branch_sample_type & PERF_SAMPLE_BRANCH_ANY_CALL) {
+   pmu_bhrb_filter |= POWER8_MMCRA_IFM1;
+   return pmu_bhrb_filter;
+   }
+
+   /* Every thing else is unsupported */
+   return -1;
+}
+
+static void power8_config_bhrb(u64 pmu_bhrb_filter)
+{
+   /* Enable BHRB filter in PMU */
+   mtspr(SPRN_MMCRA, (mfspr(SPRN_MMCRA) | pmu_bhrb_filter));
+}
+
 static struct power_pmu power8_pmu = {
.name   = "POWER8",
.n_counter  = 6,
@@ -435,12 +487,15 @@ static struct power_pmu power8_pmu = {
.add_fields = POWER8_ADD_FIELDS,
.test_adder = POWER8_TEST_ADDER,
.compute_mmcr   = power8_compute_mmcr,
+   .config_bhrb= power8_config_bhrb,
+   .bhrb_filter_map= power8_bhrb_filter_map,
.get_constraint = power8_get_constraint,
.disable_pmc= power8_disable_pmc,
-   .flags  = PPMU_HAS_SSLOT | PPMU_HAS_SIER,
+   .flags  = PPMU_HAS_SSLOT | PPMU_HAS_SIER | PPMU_BHRB,
.n_generic  = ARRAY_SIZE(power8_generic_events),
.generic_events = power8_generic_events,
.attr_groups= power8_pmu_attr_groups,
+   .bhrb_nr= 32,
 };
 
 static int __init init_power8_pmu(void)
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH V2 2/5] powerpc, perf: Add basic assembly code to read BHRB entries on POWER8

2013-04-16 Thread Anshuman Khandual
This patch adds the basic assembly code to read BHRB entries

Signed-off-by: Anshuman Khandual 
---
 arch/powerpc/perf/Makefile |  2 +-
 arch/powerpc/perf/bhrb.S   | 44 
 2 files changed, 45 insertions(+), 1 deletion(-)
 create mode 100644 arch/powerpc/perf/bhrb.S

diff --git a/arch/powerpc/perf/Makefile b/arch/powerpc/perf/Makefile
index 472db18..510fae1 100644
--- a/arch/powerpc/perf/Makefile
+++ b/arch/powerpc/perf/Makefile
@@ -2,7 +2,7 @@ subdir-ccflags-$(CONFIG_PPC_WERROR) := -Werror
 
 obj-$(CONFIG_PERF_EVENTS)  += callchain.o
 
-obj-$(CONFIG_PPC_PERF_CTRS)+= core-book3s.o
+obj-$(CONFIG_PPC_PERF_CTRS)+= core-book3s.o bhrb.o
 obj64-$(CONFIG_PPC_PERF_CTRS)  += power4-pmu.o ppc970-pmu.o power5-pmu.o \
   power5+-pmu.o power6-pmu.o power7-pmu.o \
   power8-pmu.o
diff --git a/arch/powerpc/perf/bhrb.S b/arch/powerpc/perf/bhrb.S
new file mode 100644
index 000..0e8a018
--- /dev/null
+++ b/arch/powerpc/perf/bhrb.S
@@ -0,0 +1,44 @@
+/*
+ * Basic assembly code to read BHRB entries
+ *
+ * Copyright 2013 Anshuman Khandual, IBM Corporation.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+#include 
+#include 
+
+   .text
+
+.balign 8
+
+/* r3 = n  (where n = [0-1023])
+ * The maximum number of BHRB entries supported with PPC_MFBHRBE instruction
+ * is 1024. We have limited number of table entries here as POWER8 implements
+ * 32 BHRB entries.
+ */
+
+/* .global read_bhrb */
+_GLOBAL(read_bhrb)
+   cmpldi  r3,1023
+   bgt 1f
+   ld  r4,bhrb_table@got(r2)
+   sldir3,r3,3
+   add r3,r4,r3
+   mtctr   r3
+   bctr
+1: li  r3,0
+   blr
+
+#define MFBHRB_TABLE1(n) PPC_MFBHRBE(R3,n); blr
+#define MFBHRB_TABLE2(n) MFBHRB_TABLE1(n); MFBHRB_TABLE1(n+1)
+#define MFBHRB_TABLE4(n) MFBHRB_TABLE2(n); MFBHRB_TABLE2(n+2)
+#define MFBHRB_TABLE8(n) MFBHRB_TABLE4(n); MFBHRB_TABLE4(n+4)
+#define MFBHRB_TABLE16(n) MFBHRB_TABLE8(n); MFBHRB_TABLE8(n+8)
+#define MFBHRB_TABLE32(n) MFBHRB_TABLE16(n); MFBHRB_TABLE16(n+16)
+
+bhrb_table:
+   MFBHRB_TABLE32(0)
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH V2 3/5] powerpc, perf: Add new BHRB related generic functions, data and flags

2013-04-16 Thread Anshuman Khandual
This patch adds some new BHRB related generic functions, data and flags

Signed-off-by: Anshuman Khandual 
---
 arch/powerpc/include/asm/perf_event_server.h | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/arch/powerpc/include/asm/perf_event_server.h 
b/arch/powerpc/include/asm/perf_event_server.h
index 57b42da..3f0c15c 100644
--- a/arch/powerpc/include/asm/perf_event_server.h
+++ b/arch/powerpc/include/asm/perf_event_server.h
@@ -33,6 +33,8 @@ struct power_pmu {
unsigned long *valp);
int (*get_alternatives)(u64 event_id, unsigned int flags,
u64 alt[]);
+   u64 (*bhrb_filter_map)(u64 branch_sample_type);
+   void(*config_bhrb)(u64 pmu_bhrb_filter);
void(*disable_pmc)(unsigned int pmc, unsigned long mmcr[]);
int (*limited_pmc_event)(u64 event_id);
u32 flags;
@@ -42,6 +44,9 @@ struct power_pmu {
int (*cache_events)[PERF_COUNT_HW_CACHE_MAX]
   [PERF_COUNT_HW_CACHE_OP_MAX]
   [PERF_COUNT_HW_CACHE_RESULT_MAX];
+
+   /* BHRB entries in the PMU */
+   int bhrb_nr;
 };
 
 /*
@@ -54,6 +59,7 @@ struct power_pmu {
 #define PPMU_SIAR_VALID0x0010 /* Processor has SIAR Valid 
bit */
 #define PPMU_HAS_SSLOT 0x0020 /* Has sampled slot in MMCRA */
 #define PPMU_HAS_SIER  0x0040 /* Has SIER */
+#define PPMU_BHRB  0x0080 /* has BHRB feature enabled */
 
 /*
  * Values for flags to get_alternatives()
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/2] powerpc, perf: Ignore separate BHRB privilege state filter request

2013-05-21 Thread Anshuman Khandual
Completely ignore BHRB privilege state filter request as we are
already configuring MMCRA register with privilege state filtering
attribute for the accompanying PMU event. This would help achieve
cleaner user space interaction for BHRB.

Signed-off-by: Anshuman Khandual 
---
 arch/powerpc/perf/power8-pmu.c | 17 +
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/perf/power8-pmu.c b/arch/powerpc/perf/power8-pmu.c
index f7d1c4f..8ed323d 100644
--- a/arch/powerpc/perf/power8-pmu.c
+++ b/arch/powerpc/perf/power8-pmu.c
@@ -525,16 +525,17 @@ static u64 power8_bhrb_filter_map(u64 branch_sample_type)
u64 pmu_bhrb_filter = 0;
u64 br_privilege = branch_sample_type & ONLY_PLM;
 
-   /* BHRB and regular PMU events share the same prvillege state
+   /* BHRB and regular PMU events share the same prvilege state
 * filter configuration. BHRB is always recorded along with a
-* regular PMU event. So privilege state filter criteria for BHRB
-* and the companion PMU events has to be the same. As a default
-* "perf record" tool sets all privillege bits ON when no filter
-* criteria is provided in the command line. So as along as all
-* privillege bits are ON or they are OFF, we are good to go.
+* regular PMU event. So privilege state filter criteria for
+* the BHRB and the companion PMU events has to be the same.
+* Separate BHRB privillege state filter requests would be
+* ignored.
 */
-   if ((br_privilege != 7) && (br_privilege != 0))
-   return -1;
+
+   if (br_privilege)
+   pr_info("BHRB privilege state filter request %llx ignored\n",
+   br_privilege);
 
/* No branch filter requested */
if (branch_sample_type & PERF_SAMPLE_BRANCH_ANY)
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 0/2] Improvement and fixes for BHRB

2013-05-21 Thread Anshuman Khandual
(1) The first patch fixes a situation like this

Before patch:-


./perf record -j any -e branch-misses:k ls
Error:
The sys_perf_event_open() syscall returned with 95 (Operation not supported) 
for event (branch-misses:k).
/bin/dmesg may provide additional information.
No CONFIG_PERF_EVENTS=y kernel support configured?

Here 'perf record' actually copies over ':k' filter request into BHRB
privilege state filter config and our previous check in kernel would
fail that.

After patch:-
-

/perf record -j any -e branch-misses:k ls
perf  perf.data  perf.data.old  test-mmap-ring
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.002 MB perf.data (~102 samples) ]

(2) The second patch fixes context migration for BHRB filter configuration

Anshuman Khandual (2):
  powerpc, perf: Ignore separate BHRB privilege state filter request
  powerpc, perf: BHRB filter configuration should follow the task

 arch/powerpc/perf/core-book3s.c |  5 -
 arch/powerpc/perf/power8-pmu.c  | 17 +
 2 files changed, 13 insertions(+), 9 deletions(-)

-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/2] powerpc, perf: BHRB filter configuration should follow the task

2013-05-21 Thread Anshuman Khandual
When the task moves around the system, the corresponding cpuhw
per cpu strcuture should be popullated with the BHRB filter
request value so that PMU could be configured appropriately with
that during the next call into power_pmu_enable().

Signed-off-by: Anshuman Khandual 
---
 arch/powerpc/perf/core-book3s.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
index 426180b..48c68a8 100644
--- a/arch/powerpc/perf/core-book3s.c
+++ b/arch/powerpc/perf/core-book3s.c
@@ -1122,8 +1122,11 @@ nocheck:
 
ret = 0;
  out:
-   if (has_branch_stack(event))
+   if (has_branch_stack(event)) {
power_pmu_bhrb_enable(event);
+   cpuhw->bhrb_filter = ppmu->bhrb_filter_map(
+   event->attr.branch_sample_type);
+   }
 
perf_pmu_enable(event->pmu);
local_irq_restore(flags);
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/5] perf: New conditional branch filter criteria in branch stack sampling

2013-05-21 Thread Anshuman Khandual
POWER8 PMU based BHRB supports filtering for conditional branches.
This patch introduces new branch filter PERF_SAMPLE_BRANCH_COND which
will extend the existing perf ABI. Other architectures can provide
this functionality with either HW filtering support (if present) or
with SW filtering of instructions.

Signed-off-by: Anshuman Khandual 
---
 include/uapi/linux/perf_event.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index fb104e5..cb0de86 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -157,8 +157,9 @@ enum perf_branch_sample_type {
PERF_SAMPLE_BRANCH_ANY_CALL = 1U << 4, /* any call branch */
PERF_SAMPLE_BRANCH_ANY_RETURN   = 1U << 5, /* any return branch */
PERF_SAMPLE_BRANCH_IND_CALL = 1U << 6, /* indirect calls */
+   PERF_SAMPLE_BRANCH_COND = 1U << 7, /* conditional branches */
 
-   PERF_SAMPLE_BRANCH_MAX  = 1U << 7, /* non-ABI */
+   PERF_SAMPLE_BRANCH_MAX  = 1U << 8, /* non-ABI */
 };
 
 #define PERF_SAMPLE_BRANCH_PLM_ALL \
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 5/5] perf, documentation: Description for conditional branch filter

2013-05-21 Thread Anshuman Khandual
Adding documentation support for conditional branch filter.

Signed-off-by: Anshuman Khandual 
---
 tools/perf/Documentation/perf-record.txt | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/tools/perf/Documentation/perf-record.txt 
b/tools/perf/Documentation/perf-record.txt
index d4da111..8b5e1ed 100644
--- a/tools/perf/Documentation/perf-record.txt
+++ b/tools/perf/Documentation/perf-record.txt
@@ -169,12 +169,13 @@ following filters are defined:
 - any_call: any function call or system call
 - any_ret: any function return or system call return
 - ind_call: any indirect branch
+- cond: conditional branches
 - u:  only when the branch target is at the user level
 - k: only when the branch target is in the kernel
 - hv: only when the target is at the hypervisor level
 
 +
-The option requires at least one branch type among any, any_call, any_ret, 
ind_call.
+The option requires at least one branch type among any, any_call, any_ret, 
ind_call, cond.
 The privilege levels may be omitted, in which case, the privilege levels of 
the associated
 event are applied to the branch filter. Both kernel (k) and hypervisor (hv) 
privilege
 levels are subject to permissions.  When sampling on multiple events, branch 
stack sampling
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/5] powerpc, perf: Enable conditional branch filter for POWER8

2013-05-21 Thread Anshuman Khandual
Enables conditional branch filter support for POWER8
utilizing MMCRA register based filter and also invalidates
a BHRB branch filter combination involving conditional
branches.

Signed-off-by: Anshuman Khandual 
---
 arch/powerpc/perf/power8-pmu.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/arch/powerpc/perf/power8-pmu.c b/arch/powerpc/perf/power8-pmu.c
index 8ed323d..e60b38f 100644
--- a/arch/powerpc/perf/power8-pmu.c
+++ b/arch/powerpc/perf/power8-pmu.c
@@ -548,11 +548,21 @@ static u64 power8_bhrb_filter_map(u64 branch_sample_type)
if (branch_sample_type & PERF_SAMPLE_BRANCH_IND_CALL)
return -1;
 
+   /* Invalid branch filter combination - HW does not support */
+   if ((branch_sample_type & PERF_SAMPLE_BRANCH_ANY_CALL) &&
+   (branch_sample_type & PERF_SAMPLE_BRANCH_COND))
+   return -1;
+
if (branch_sample_type & PERF_SAMPLE_BRANCH_ANY_CALL) {
pmu_bhrb_filter |= POWER8_MMCRA_IFM1;
return pmu_bhrb_filter;
}
 
+   if (branch_sample_type & PERF_SAMPLE_BRANCH_COND) {
+   pmu_bhrb_filter |= POWER8_MMCRA_IFM3;
+   return pmu_bhrb_filter;
+   }
+
/* Every thing else is unsupported */
return -1;
 }
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 4/5] x86, perf: Add conditional branch filtering support

2013-05-21 Thread Anshuman Khandual
From: Peter Zijlstra 

This patch adds conditional branch filtering support,
enabling it for PERF_SAMPLE_BRANCH_COND in perf branch
stack sampling framework by utilizing an available
software filter X86_BR_JCC.

Signed-off-by: Peter Zijlstra 
Signed-off-by: Anshuman Khandual 
---
 arch/x86/kernel/cpu/perf_event_intel_lbr.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c 
b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index d978353..a0d6387 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -337,6 +337,10 @@ static int intel_pmu_setup_sw_lbr_filter(struct perf_event 
*event)
 
if (br_type & PERF_SAMPLE_BRANCH_IND_CALL)
mask |= X86_BR_IND_CALL;
+
+   if (br_type & PERF_SAMPLE_BRANCH_COND)
+   mask |= X86_BR_JCC;
+
/*
 * stash actual user request into reg, it may
 * be used by fixup code for some CPU
@@ -626,6 +630,7 @@ static const int nhm_lbr_sel_map[PERF_SAMPLE_BRANCH_MAX] = {
 * NHM/WSM erratum: must include IND_JMP to capture IND_CALL
 */
[PERF_SAMPLE_BRANCH_IND_CALL] = LBR_IND_CALL | LBR_IND_JMP,
+   [PERF_SAMPLE_BRANCH_COND] = LBR_JCC,
 };
 
 static const int snb_lbr_sel_map[PERF_SAMPLE_BRANCH_MAX] = {
@@ -637,6 +642,7 @@ static const int snb_lbr_sel_map[PERF_SAMPLE_BRANCH_MAX] = {
[PERF_SAMPLE_BRANCH_ANY_CALL]   = LBR_REL_CALL | LBR_IND_CALL
| LBR_FAR,
[PERF_SAMPLE_BRANCH_IND_CALL]   = LBR_IND_CALL,
+   [PERF_SAMPLE_BRANCH_COND]   = LBR_JCC,
 };
 
 /* core */
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 3/5] perf, tool: Conditional branch filter 'cond' added to perf record

2013-05-21 Thread Anshuman Khandual
Adding perf record support for new branch stack filter criteria
PERF_SAMPLE_BRANCH_COND.

Signed-off-by: Anshuman Khandual 
---
 tools/perf/builtin-record.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index cdf58ec..833743a 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -676,6 +676,7 @@ static const struct branch_mode branch_modes[] = {
BRANCH_OPT("any_call", PERF_SAMPLE_BRANCH_ANY_CALL),
BRANCH_OPT("any_ret", PERF_SAMPLE_BRANCH_ANY_RETURN),
BRANCH_OPT("ind_call", PERF_SAMPLE_BRANCH_IND_CALL),
+   BRANCH_OPT("cond", PERF_SAMPLE_BRANCH_CONDITIONAL),
BRANCH_END
 };
 
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 0/5] perf: Introducing new conditional branch filter

2013-05-21 Thread Anshuman Khandual
This patchset introduces conditional branch filter in perf branch stack
sampling framework incorporating review comments from Michael Neuling,
Peter Zijlstra and Stephane Eranian.

Anshuman Khandual (5):
  perf: New conditional branch filter criteria in branch stack sampling
  powerpc, perf: Enable conditional branch filter for POWER8
  perf, tool: Conditional branch filter 'cond' added to perf record
  x86, perf: Add conditional branch filtering support
  perf, documentation: Description for conditional branch filter

 arch/powerpc/perf/power8-pmu.c | 10 ++
 arch/x86/kernel/cpu/perf_event_intel_lbr.c |  6 ++
 include/uapi/linux/perf_event.h|  3 ++-
 tools/perf/Documentation/perf-record.txt   |  3 ++-
 tools/perf/builtin-record.c|  1 +
 5 files changed, 21 insertions(+), 2 deletions(-)

-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] perf, x86, lbr: Demand proper privileges for PERF_SAMPLE_BRANCH_KERNEL

2013-05-21 Thread Anshuman Khandual
On 05/21/2013 07:25 PM, Stephane Eranian wrote:
> On Thu, May 16, 2013 at 12:15 PM, Michael Neuling  wrote:
>> Peter Zijlstra  wrote:
>>
>>> On Wed, May 15, 2013 at 03:37:22PM +0200, Stephane Eranian wrote:
 On Fri, May 3, 2013 at 2:11 PM, Peter Zijlstra  
 wrote:
> We should always have proper privileges when requesting kernel data.
>
> Cc: Andi Kleen 
> Cc: eran...@google.com
> Signed-off-by: Peter Zijlstra 
> Link: http://lkml.kernel.org/n/tip-v0x9ky3ahzr6nm3c6ilwr...@git.kernel.org
> ---
>  arch/x86/kernel/cpu/perf_event_intel_lbr.c |5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
>
> --- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
> +++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
> @@ -318,8 +318,11 @@ static void intel_pmu_setup_sw_lbr_filte
> if (br_type & PERF_SAMPLE_BRANCH_USER)
> mask |= X86_BR_USER;
>
> -   if (br_type & PERF_SAMPLE_BRANCH_KERNEL)
> +   if (br_type & PERF_SAMPLE_BRANCH_KERNEL) {
> +   if (perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN))
> +   return -EACCES;
> mask |= X86_BR_KERNEL;
> +   }
>
 This will prevent regular users from capturing kernel -> kernel branches.
 But it won't prevent users from getting kernel -> user branches. Thus
 some kernel address will still be captured. I guess they could be 
 eliminated
 by the sw_filter.

 When using LBR priv level filtering, the filter applies to the branch 
 target
 only.
>>>
>>> How about something like the below? It also adds the branch flags
>>> Mikey wanted for PowerPC.
>>
>> Peter,
>>
>> BTW PowerPC also has the ability to filter on conditional branches.  Any
>> chance we could add something like the follow to perf also?
>>
>> Mikey
>>
>> diff --git a/include/uapi/linux/perf_event.h 
>> b/include/uapi/linux/perf_event.h
>> index fb104e5..891c769 100644
>> --- a/include/uapi/linux/perf_event.h
>> +++ b/include/uapi/linux/perf_event.h
>> @@ -157,8 +157,9 @@ enum perf_branch_sample_type {
>> PERF_SAMPLE_BRANCH_ANY_CALL = 1U << 4, /* any call branch */
>> PERF_SAMPLE_BRANCH_ANY_RETURN   = 1U << 5, /* any return branch */
>> PERF_SAMPLE_BRANCH_IND_CALL = 1U << 6, /* indirect calls */
>> +   PERF_SAMPLE_BRANCH_CONDITIONAL  = 1U << 7, /* conditional branches */
>>
> I would use PERF_SAMPLE_BRANCH_COND here.
> 
>> -   PERF_SAMPLE_BRANCH_MAX  = 1U << 7, /* non-ABI */
>> +   PERF_SAMPLE_BRANCH_MAX  = 1U << 8, /* non-ABI */
>>  };
>>
>>  #define PERF_SAMPLE_BRANCH_PLM_ALL \
>> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
>> index cdf58ec..5b0b89d 100644
>> --- a/tools/perf/builtin-record.c
>> +++ b/tools/perf/builtin-record.c
>> @@ -676,6 +676,7 @@ static const struct branch_mode branch_modes[] = {
>> BRANCH_OPT("any_call", PERF_SAMPLE_BRANCH_ANY_CALL),
>> BRANCH_OPT("any_ret", PERF_SAMPLE_BRANCH_ANY_RETURN),
>> BRANCH_OPT("ind_call", PERF_SAMPLE_BRANCH_IND_CALL),
>> +   BRANCH_OPT("cnd", PERF_SAMPLE_BRANCH_CONDITIONAL),
> 
> use "cond"
> 
>> BRANCH_END
>>  };
>>
> 
> And if you do this, you also need to update the x86
> perf_event_intel_lbr.c mapping
> tables to fill out the entries for PERF_SAMPLE_BRANCH_COND:
> 
> [PERF_SAMPLE_BRANCH_COND]   = LBR_JCC,
> 
> And you also need to update intel_pmu_setup_sw_lbr_filter()
> to handle the conversion to x86 instructions:
> 
>if (br_type & PERF_SAMPLE_BRANCH_COND)
> mask |= X86_BR_JCC;
> 
> 
> You also need to update the perf-record.txt documentation to list cond
> as a possible
> branch filter.

Hey Stephane,

I have incorporated all the review comments into the patch series
https://lkml.org/lkml/2013/5/22/51.

Regards
Anshuman

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] powerpc, perf: Ignore separate BHRB privilege state filter request

2013-05-22 Thread Anshuman Khandual
> 
> Your description from patch 0 should be here.
>

Sure, will bring it here.

> 
>> diff --git a/arch/powerpc/perf/power8-pmu.c b/arch/powerpc/perf/power8-pmu.c
>> index f7d1c4f..8ed323d 100644
>> --- a/arch/powerpc/perf/power8-pmu.c
>> +++ b/arch/powerpc/perf/power8-pmu.c
>> @@ -525,16 +525,17 @@ static u64 power8_bhrb_filter_map(u64 
>> branch_sample_type)
>>  u64 pmu_bhrb_filter = 0;
>>  u64 br_privilege = branch_sample_type & ONLY_PLM;
>>  
>> -/* BHRB and regular PMU events share the same prvillege state
>> +/* BHRB and regular PMU events share the same prvilege state
> 
> Please spell "privilege" correctly.
> 

My bad, will fix it.


>>   * filter configuration. BHRB is always recorded along with 
> It still says "privilege state filter criteria for the BHRB and the
> companion PMU events has to be the same".
> 
> But they don't, right?
> 

Right


> What it should say is "we ignore the privilege bits in the branch sample
> type because they are handled by the underlying PMC configuration" - or
> something like that.

Here is the latest description for the code block

/* BHRB and regular PMU events share the same privilege state
 * filter configuration. BHRB is always recorded along with a
 * regular PMU event. As the privilege state filter is handled
 * in the basic PMC configuration of the accompanying regular
 * PMU event, we ignore any separate BHRB specific request.
 */

Does it sound better ?

> 
>> -if ((br_privilege != 7) && (br_privilege != 0))
>> -return -1;
>> +
>> +if (br_privilege)
>> +pr_info("BHRB privilege state filter request %llx ignored\n",
>> +br_privilege);
> 
> Don't do that. Ignoring the br_privilege is either the right thing to do
> in which case we do it and print nothing,


I thought the informational print would at least make the user aware
of the fact that the separate filter request for BHRB went ignored.
Can we add this some where in the documentation ?

  
 or it doesn't make sense and
> we reject it.
>

 
> cheers
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] perf, x86, lbr: Demand proper privileges for PERF_SAMPLE_BRANCH_KERNEL

2013-05-22 Thread Anshuman Khandual
On 05/22/2013 05:53 PM, Stephane Eranian wrote:
> Hi,
> 
> 
> 
> On Wed, May 22, 2013 at 8:43 AM, Anshuman Khandual
>  wrote:
>> On 05/21/2013 07:25 PM, Stephane Eranian wrote:
>>> On Thu, May 16, 2013 at 12:15 PM, Michael Neuling  wrote:
>>>> Peter Zijlstra  wrote:
>>>>
>>>>> On Wed, May 15, 2013 at 03:37:22PM +0200, Stephane Eranian wrote:
>>>>>> On Fri, May 3, 2013 at 2:11 PM, Peter Zijlstra  
>>>>>> wrote:
>>>>>>> We should always have proper privileges when requesting kernel data.
>>>>>>>
>>>>>>> Cc: Andi Kleen 
>>>>>>> Cc: eran...@google.com
>>>>>>> Signed-off-by: Peter Zijlstra 
>>>>>>> Link: 
>>>>>>> http://lkml.kernel.org/n/tip-v0x9ky3ahzr6nm3c6ilwr...@git.kernel.org
>>>>>>> ---
>>>>>>>  arch/x86/kernel/cpu/perf_event_intel_lbr.c |5 -
>>>>>>>  1 file changed, 4 insertions(+), 1 deletion(-)
>>>>>>>
>>>>>>> --- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
>>>>>>> +++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
>>>>>>> @@ -318,8 +318,11 @@ static void intel_pmu_setup_sw_lbr_filte
>>>>>>> if (br_type & PERF_SAMPLE_BRANCH_USER)
>>>>>>> mask |= X86_BR_USER;
>>>>>>>
>>>>>>> -   if (br_type & PERF_SAMPLE_BRANCH_KERNEL)
>>>>>>> +   if (br_type & PERF_SAMPLE_BRANCH_KERNEL) {
>>>>>>> +   if (perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN))
>>>>>>> +   return -EACCES;
>>>>>>> mask |= X86_BR_KERNEL;
>>>>>>> +   }
>>>>>>>
>>>>>> This will prevent regular users from capturing kernel -> kernel branches.
>>>>>> But it won't prevent users from getting kernel -> user branches. Thus
>>>>>> some kernel address will still be captured. I guess they could be 
>>>>>> eliminated
>>>>>> by the sw_filter.
>>>>>>
>>>>>> When using LBR priv level filtering, the filter applies to the branch 
>>>>>> target
>>>>>> only.
>>>>>
>>>>> How about something like the below? It also adds the branch flags
>>>>> Mikey wanted for PowerPC.
>>>>
>>>> Peter,
>>>>
>>>> BTW PowerPC also has the ability to filter on conditional branches.  Any
>>>> chance we could add something like the follow to perf also?
>>>>
>>>> Mikey
>>>>
>>>> diff --git a/include/uapi/linux/perf_event.h 
>>>> b/include/uapi/linux/perf_event.h
>>>> index fb104e5..891c769 100644
>>>> --- a/include/uapi/linux/perf_event.h
>>>> +++ b/include/uapi/linux/perf_event.h
>>>> @@ -157,8 +157,9 @@ enum perf_branch_sample_type {
>>>> PERF_SAMPLE_BRANCH_ANY_CALL = 1U << 4, /* any call branch */
>>>> PERF_SAMPLE_BRANCH_ANY_RETURN   = 1U << 5, /* any return branch */
>>>> PERF_SAMPLE_BRANCH_IND_CALL = 1U << 6, /* indirect calls */
>>>> +   PERF_SAMPLE_BRANCH_CONDITIONAL  = 1U << 7, /* conditional branches 
>>>> */
>>>>
>>> I would use PERF_SAMPLE_BRANCH_COND here.
>>>
>>>> -   PERF_SAMPLE_BRANCH_MAX  = 1U << 7, /* non-ABI */
>>>> +   PERF_SAMPLE_BRANCH_MAX  = 1U << 8, /* non-ABI */
>>>>  };
>>>>
>>>>  #define PERF_SAMPLE_BRANCH_PLM_ALL \
>>>> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
>>>> index cdf58ec..5b0b89d 100644
>>>> --- a/tools/perf/builtin-record.c
>>>> +++ b/tools/perf/builtin-record.c
>>>> @@ -676,6 +676,7 @@ static const struct branch_mode branch_modes[] = {
>>>> BRANCH_OPT("any_call", PERF_SAMPLE_BRANCH_ANY_CALL),
>>>> BRANCH_OPT("any_ret", PERF_SAMPLE_BRANCH_ANY_RETURN),
>>>> BRANCH_OPT("ind_call", PERF_SAMPLE_BRANCH_IND_CALL),
>>>> +   BRANCH_OPT("cnd", PERF_SAMPLE_BRANCH_CONDITIONAL),
>>>
>>> use "cond"
>>>
>>>> BRANCH_END
>>>>  };
>>>>
>>>
>>> And if you do this, you also need to update the x86
>>> perf_event_intel_lbr.c mapping
>>> tables to fill out the entries for PERF_SAMPLE_BRANCH_COND:
>>>
>>> [PERF_SAMPLE_BRANCH_COND]   = LBR_JCC,
>>>
>>> And you also need to update intel_pmu_setup_sw_lbr_filter()
>>> to handle the conversion to x86 instructions:
>>>
>>>if (br_type & PERF_SAMPLE_BRANCH_COND)
>>> mask |= X86_BR_JCC;
>>>
>>>
>>> You also need to update the perf-record.txt documentation to list cond
>>> as a possible
>>> branch filter.
>>
>> Hey Stephane,
>>
>> I have incorporated all the review comments into the patch series
>> https://lkml.org/lkml/2013/5/22/51.
>>
> I don't see how you can compile Patch 3/5:
> 
> + BRANCH_OPT("cond", PERF_SAMPLE_BRANCH_CONDITIONAL),
> 
> Needs to be PERF_SAMPLE_BRANCH_COND.
> 

Ahh, sorry missed it, will fix it.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] powerpc, perf: Ignore separate BHRB privilege state filter request

2013-05-27 Thread Anshuman Khandual
On 05/22/2013 02:29 PM, Anshuman Khandual wrote:
>>
>> Your description from patch 0 should be here.
> Does it sound better ?
> 
>>
>>> -   if ((br_privilege != 7) && (br_privilege != 0))
>>> -   return -1;
>>> +
>>> +   if (br_privilege)
>>> +   pr_info("BHRB privilege state filter request %llx ignored\n",
>>> +   br_privilege);
>>
>> Don't do that. Ignoring the br_privilege is either the right thing to do
>> in which case we do it and print nothing,
> 
> 
> I thought the informational print would at least make the user aware
> of the fact that the separate filter request for BHRB went ignored.
> Can we add this some where in the documentation ?

So, what we decide here ? We will just ignore any separate BHRB
privilege state filter request without printing any informational
event or warning ?


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: how to implement platform specific per process parameter?

2012-08-06 Thread Anshuman Khandual
> There are 4 options:

> 1. [not a kernel interface] use ptrace to execute the register changing
> command inside the specified pid. The next context switch saves the new
> value in the thread_struct. Dirty hack.
> 
> 2. Add a new syscall which would receive pid + register value and do the
> job. A bit too much.
> 
> 3. Add some hook in /proc filesystem but so far there were no platform
> specific bits, right?
> 
> 4. Implement a static node "/sys/devices/system/cpu/dscr_control".
> write() would parse the input stream, call scanf("%d %x", &pid, &dscr)
> and do the job.
> 


/sys/ interface would be appropriate I believe. But in this way we can take a 
new
(pid, dscr) and update thread_struct. But there should be a way to enlist all
(pid, dscr) values which are explicitly set by the user and different than that
of /sys/devices/system/cpu/dscr_default. So that we can know which process is 
holding
to what value of DSCR at any point of time.

 

> What is the correct approach? Thanks.
> 
> 


Regards
Anshuman

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH] perf: Add a few generic stalled-cycles events

2012-10-14 Thread Anshuman Khandual
On 10/12/2012 06:58 AM, Sukadev Bhattiprolu wrote:
> 
> From 89cb6a25b9f714e55a379467a832ee015014ed11 Mon Sep 17 00:00:00 2001
> From: Sukadev Bhattiprolu 
> Date: Tue, 18 Sep 2012 10:59:01 -0700
> Subject: [PATCH] perf: Add a few generic stalled-cycles events
> 
> The existing generic event 'stalled-cycles-backend' corresponds to
> PM_CMPLU_STALL event in Power7. While this event is useful, detailed
> performance analysis often requires us to find more specific reasons
> for the stalled cycle. For instance, stalled cycles in Power7 can
> occur due to, among others:
> 
>   - instruction fetch unit (IFU),
>   - Load-store-unit (LSU),
>   - Fixed point unit (FXU)
>   - Branch unit (BRU)
> 
> While it is possible to use raw codes to monitor these events, it quickly
> becomes cumbersome with performance analysis frequently requiring mapping
> the raw event codes in reports to their symbolic names.
> 
> This patch is a proposal to try and generalize such perf events. Since
> the code changes are quite simple, I bunched all the 4 events together.
> 
> I am not familiar with how readily these events would map to other
> architectures. Here is some information on the events for Power7:
> 
>   stalled-cycles-fixed-point (PM_CMPLU_STALL_FXU)
> 
>   Following a completion stall, the last instruction to finish
>   before completion resumes was from the Fixed Point Unit.
> 
>   Completion stall is any period when no groups completed and
>   the completion table was not empty for that thread.
> 
>   stalled-cycles-load-store (PM_CMPLU_STALL_LSU)
> 
>   Following a completion stall, the last instruction to finish
>   before completion resumes was from the Load-Store Unit.
> 
>   stalled-cycles-instruction-fetch (PM_CMPLU_STALL_IFU)
> 
>   Following a completion stall, the last instruction to finish
>   before completion resumes was from the Instruction Fetch Unit.
> 
>   stalled-cycles-branch (PM_CMPLU_STALL_BRU)
> 
>   Following a completion stall, the last instruction to finish
>   before completion resumes was from the Branch Unit.
> 
> Looking for feedback on this approach and if this can be further extended.
> Power7 has 530 events[2] out of which a "CPI stack analysis"[1] uses about 26
> events.
> 
> 
> [1] CPI Stack analysis
>   
> https://www.power.org/documentation/commonly-used-metrics-for-performance-analysis
> 
> [2] Power7 events:
>   
> https://www.power.org/documentation/comprehensive-pmu-event-reference-power7/

Here we should try to come up with a generic list of places in the processor 
where
the cycles can stall.

PERF_COUNT_HW_STALLED_CYCLES_FIXED_POINT
PERF_COUNT_HW_STALLED_CYCLES_LOAD_STORE
PERF_COUNT_HW_STALLED_CYCLES_INSTRUCTION_FETCH
PERF_COUNT_HW_STALLED_CYCLES_BRANCH
PERF_COUNT_HW_STALLED_CYCLES_
PERF_COUNT_HW_STALLED_CYCLES_
PERF_COUNT_HW_STALLED_CYCLES_
---

This generic list can be a superset which can accommodate all the architecture
giving the flexibility to implement selectively there after. Stall locations are
very important from CPI analysis stand point with real world use cases. This 
will
definitely help us in that direction.

Regards
Anshuman

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH] perf: Add a few generic stalled-cycles events

2012-10-15 Thread Anshuman Khandual
On 10/15/2012 10:53 PM, Arun Sharma wrote:
> On 10/15/12 8:55 AM, Robert Richter wrote:
> 
> [..]
>> Perf tool works then out-of-the-box with:
>>
>>   $ perf record -e cpu/stalled-cycles-fixed-point/ ...
>>
>> The event string can easily be reused by other architectures as a
>> quasi standard.
> 
> I like Robert's proposal better. It's hard to model all the stall events
> (eg: instruction decoder related stalls on x86) in a hardware
> independent way.
> 
> Another area to think about: software engineers are generally busy and
> have a limited amount of time to devote to hardware event based
> optimizations. The most common question I hear is: what is the expected
> perf gain if I fix this? It's hard to answer that with just the stall
> events.
> 

Hardware event based optimization is a very important aspect of real world 
application
tuning. CPI stack analysis is a good reason why perf should have stall events 
as generic
ones. But I am not clear on situations where we consider adding these new 
generic events
into linux/perf_event.h and the situations where we should go with the sys fs 
interface.
Could you please elaborate on this ?

Regards
Anshuman

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V3] mm: Add sysfs interface to dump each node's zonelist information

2016-09-06 Thread Anshuman Khandual
On 09/07/2016 08:38 AM, Kees Cook wrote:
> On Tue, Sep 6, 2016 at 1:36 PM, Dave Hansen  wrote:
>> On 09/06/2016 01:31 AM, Anshuman Khandual wrote:
>>> [NODE (0)]
>>> ZONELIST_FALLBACK
>>> (0) (node 0) (zone DMA c140c000)
>>> (1) (node 1) (zone DMA c001)
>>> (2) (node 2) (zone DMA c002)
>>> (3) (node 3) (zone DMA c003)
>>> ZONELIST_NOFALLBACK
>>> (0) (node 0) (zone DMA c140c000)
>>
>> Don't we have some prohibition on dumping out kernel addresses like this
>> so that attackers can't trivially defeat kernel layout randomization?
> 
> Anything printing memory addresses should be using %pK (not %lx as done here).

Learned about the significance of %pK coupled with kptr_restrict
interface. Will change this. Thanks for pointing out.



Re: [PATCH 5/5] mm: cleanup pfn_t usage in track_pfn_insert()

2016-09-06 Thread Anshuman Khandual
On 09/06/2016 10:19 PM, Dan Williams wrote:
> Now that track_pfn_insert() is no longer used in the DAX path, it no
> longer needs to comprehend pfn_t values.
> 
> Signed-off-by: Dan Williams 
> ---
>  arch/x86/mm/pat.c |4 ++--
>  include/asm-generic/pgtable.h |4 ++--
>  mm/memory.c   |2 +-
>  3 files changed, 5 insertions(+), 5 deletions(-)

A small nit. Should not the arch/x86/mm/pat.c changes be separated out
into a different patch ? Kind of faced little bit problem separating out
generic core mm changes to that of arch specific mm changes when going
through the commits in retrospect.



Re: [PATCH V3] mm: Add sysfs interface to dump each node's zonelist information

2016-09-07 Thread Anshuman Khandual
On 09/06/2016 02:35 PM, kbuild test robot wrote:
> Hi Anshuman,
> 
> [auto build test ERROR on driver-core/driver-core-testing]
> [also build test ERROR on v4.8-rc5 next-20160906]
> [if your patch is applied to the wrong git tree, please drop us a note to 
> help improve the system]
> [Suggest to use git(>=2.9.0) format-patch --base= (or --base=auto for 
> convenience) to record what (public, well-known) commit your patch series was 
> built on]
> [Check https://git-scm.com/docs/git-format-patch for more information]
> 
> url:
> https://github.com/0day-ci/linux/commits/Anshuman-Khandual/mm-Add-sysfs-interface-to-dump-each-node-s-zonelist-information/20160906-163752
> config: x86_64-randconfig-x019-201636 (attached as .config)
> compiler: gcc-6 (Debian 6.1.1-9) 6.1.1 20160705
> reproduce:
> # save the attached .config to linux build tree
> make ARCH=x86_64 

I am not able to reproduce this build failure with Fedora 24
and gcc (GCC) 6.1.1 20160621 on a x86 laptop. Maybe adding
mmzone.h into page_alloc.c will be enough to just take care
any issues.



[PATCH V4] mm: Add sysfs interface to dump each node's zonelist information

2016-09-07 Thread Anshuman Khandual
Each individual node in the system has a ZONELIST_FALLBACK zonelist
and a ZONELIST_NOFALLBACK zonelist. These zonelists decide fallback
order of zones during memory allocations. Sometimes it helps to dump
these zonelists to see the priority order of various zones in them.

Particularly platforms which support memory hotplug into previously
non existing zones (at boot), this interface helps in visualizing
which all zonelists of the system at what priority level, the new
hot added memory ends up in. POWER is such a platform where all the
memory detected during boot time remains with ZONE_DMA for good but
then hot plug process can actually get new memory into ZONE_MOVABLE.
So having a way to get the snapshot of the zonelists on the system
after memory or node hot[un]plug is desirable. This change adds one
new sysfs interface (/sys/devices/system/memory/system_zone_details)
which will fetch and dump this information.

Example zonelist information from a KVM guest.

[NODE (0)]
ZONELIST_FALLBACK
(0) (node 0) (DMA 0xc0006300)
(1) (node 1) (DMA 0xc0016300)
(2) (node 2) (DMA 0xc0026300)
(3) (node 3) (DMA 0xc003ffdba300)
ZONELIST_NOFALLBACK
(0) (node 0) (DMA 0xc0006300)
[NODE (1)]
ZONELIST_FALLBACK
(0) (node 1) (DMA 0xc0016300)
(1) (node 2) (DMA 0xc0026300)
(2) (node 3) (DMA 0xc003ffdba300)
(3) (node 0) (DMA 0xc0006300)
ZONELIST_NOFALLBACK
(0) (node 1) (DMA 0xc0016300)
[NODE (2)]
ZONELIST_FALLBACK
(0) (node 2) (DMA 0xc0026300)
(1) (node 3) (DMA 0xc003ffdba300)
(2) (node 0) (DMA 0xc0006300)
(3) (node 1) (DMA 0xc0016300)
ZONELIST_NOFALLBACK
(0) (node 2) (DMA 0xc0026300)
[NODE (3)]
ZONELIST_FALLBACK
(0) (node 3) (DMA 0xc003ffdba300)
(1) (node 0) (DMA 0xc0006300)
(2) (node 1) (DMA 0xc0016300)
(3) (node 2) (DMA 0xc0026300)
ZONELIST_NOFALLBACK
(0) (node 3) (DMA 0xc003ffdba300)

Signed-off-by: Anshuman Khandual 
---
Changes in V4:
- Explicitly included mmzone.h header inside page_alloc.c
- Changed the kernel address printing from %lx to %pK

Changes in V3:
- Moved all these new sysfs code inside CONFIG_NUMA

Changes in V2:
- Added more details into the commit message
- Added sysfs interface file details into the commit message
- Added ../ABI/testing/sysfs-system-zone-details file

 .../ABI/testing/sysfs-system-zone-details  |  9 
 drivers/base/memory.c  | 52 ++
 mm/page_alloc.c|  1 +
 3 files changed, 62 insertions(+)
 create mode 100644 Documentation/ABI/testing/sysfs-system-zone-details

diff --git a/Documentation/ABI/testing/sysfs-system-zone-details 
b/Documentation/ABI/testing/sysfs-system-zone-details
new file mode 100644
index 000..9c13b2e
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-system-zone-details
@@ -0,0 +1,9 @@
+What:  /sys/devices/system/memory/system_zone_details
+Date:  Sep 2016
+KernelVersion: 4.8
+Contact:   khand...@linux.vnet.ibm.com
+Description:
+   This read only file dumps the zonelist and it's constituent
+   zones information for both ZONELIST_FALLBACK and ZONELIST_
+   NOFALLBACK zonelists for each online node of the system at
+   any given point of time.
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index dc75de9..c7ab991 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -442,7 +442,56 @@ print_block_size(struct device *dev, struct 
device_attribute *attr,
return sprintf(buf, "%lx\n", get_memory_block_size());
 }
 
+#ifdef CONFIG_NUMA
+static ssize_t dump_zonelist(char *buf, struct zonelist *zonelist)
+{
+   unsigned int i;
+   ssize_t count = 0;
+
+   for (i = 0; zonelist->_zonerefs[i].zone; i++) {
+   count += sprintf(buf + count,
+   "\t\t(%d) (node %d) (%-7s 0x%pK)\n", i,
+   zonelist->_zonerefs[i].zone->zone_pgdat->node_id,
+   zone_names[zonelist->_zonerefs[i].zone_idx],
+   (void *) zonelist->_zonerefs[i].zone);
+   }
+   return count;
+}
+
+static ssize_t dump_zonelists(char *buf)
+{
+   struct zonelist *zonelist;
+   unsigned int node;
+   ssize_t count = 0;
+
+   for_each_online_node(node) {
+   zonelist = &(NODE_DATA(node)->
+   node_zonelists[ZONELIST_FALLBACK]);
+ 

Re: [PATCH -v3 01/10] mm, swap: Make swap cluster size same of THP size on x86_64

2016-09-08 Thread Anshuman Khandual
On 09/07/2016 10:16 PM, Huang, Ying wrote:
> From: Huang Ying 
> 
> In this patch, the size of the swap cluster is changed to that of the
> THP (Transparent Huge Page) on x86_64 architecture (512).  This is for
> the THP swap support on x86_64.  Where one swap cluster will be used to
> hold the contents of each THP swapped out.  And some information of the
> swapped out THP (such as compound map count) will be recorded in the
> swap_cluster_info data structure.
> 
> For other architectures which want THP swap support, THP_SWAP_CLUSTER
> need to be selected in the Kconfig file for the architecture.
> 
> In effect, this will enlarge swap cluster size by 2 times on x86_64.
> Which may make it harder to find a free cluster when the swap space
> becomes fragmented.  So that, this may reduce the continuous swap space
> allocation and sequential write in theory.  The performance test in 0day
> shows no regressions caused by this.

This patch needs to be split into two separate ones

(1) Add THP_SWAP_CLUSTER config option
(2) Enable CONFIG_THP_SWAP_CLUSTER for X86_64

The first patch should explain the proposal and the second patch
should have 86_64 arch specific details, regressions etc as already
been explained in the commit message.



Re: [PATCH -v3 03/10] mm, memcg: Support to charge/uncharge multiple swap entries

2016-09-08 Thread Anshuman Khandual
On 09/07/2016 10:16 PM, Huang, Ying wrote:
> From: Huang Ying 
> 
> This patch make it possible to charge or uncharge a set of continuous
> swap entries in the swap cgroup.  The number of swap entries is
> specified via an added parameter.
> 
> This will be used for the THP (Transparent Huge Page) swap support.
> Where a swap cluster backing a THP may be allocated and freed as a
> whole.  So a set of continuous swap entries (512 on x86_64) backing one

Please use HPAGE_SIZE / PAGE_SIZE instead of hard coded number like 512.



Re: [PATCH -v3 04/10] mm, THP, swap: Add swap cluster allocate/free functions

2016-09-08 Thread Anshuman Khandual
On 09/07/2016 10:16 PM, Huang, Ying wrote:
> From: Huang Ying 
> 
> The swap cluster allocation/free functions are added based on the
> existing swap cluster management mechanism for SSD.  These functions
> don't work for the rotating hard disks because the existing swap cluster
> management mechanism doesn't work for them.  The hard disks support may
> be added if someone really need it.  But that needn't be included in
> this patchset.
> 
> This will be used for the THP (Transparent Huge Page) swap support.
> Where one swap cluster will hold the contents of each THP swapped out.

Which tree this series is based against ? This patch does not apply
on the mainline kernel.



Re: [PATCH -v3 03/10] mm, memcg: Support to charge/uncharge multiple swap entries

2016-09-08 Thread Anshuman Khandual
On 09/07/2016 10:16 PM, Huang, Ying wrote:
> From: Huang Ying 
> 
> This patch make it possible to charge or uncharge a set of continuous
> swap entries in the swap cgroup.  The number of swap entries is
> specified via an added parameter.
> 
> This will be used for the THP (Transparent Huge Page) swap support.
> Where a swap cluster backing a THP may be allocated and freed as a
> whole.  So a set of continuous swap entries (512 on x86_64) backing one

Please use HPAGE_SIZE / PAGE_SIZE instead of hard coded number like 512.



Re: [PATCH -v3 04/10] mm, THP, swap: Add swap cluster allocate/free functions

2016-09-08 Thread Anshuman Khandual
On 09/07/2016 10:16 PM, Huang, Ying wrote:
> From: Huang Ying 
> 
> The swap cluster allocation/free functions are added based on the
> existing swap cluster management mechanism for SSD.  These functions
> don't work for the rotating hard disks because the existing swap cluster
> management mechanism doesn't work for them.  The hard disks support may
> be added if someone really need it.  But that needn't be included in
> this patchset.
> 
> This will be used for the THP (Transparent Huge Page) swap support.
> Where one swap cluster will hold the contents of each THP swapped out.

Which tree this series is based against ? This patch does not apply
on the mainline kernel today.



Re: [PATCH -v3 07/10] mm, THP, swap: Support to add/delete THP to/from swap cache

2016-09-08 Thread Anshuman Khandual
On 09/07/2016 10:16 PM, Huang, Ying wrote:
> From: Huang Ying 
> 
> With this patch, a THP (Transparent Huge Page) can be added/deleted
> to/from the swap cache as a set of sub-pages (512 on x86_64).
> 
> This will be used for the THP (Transparent Huge Page) swap support.
> Where one THP may be added/delted to/from the swap cache.  This will
> batch the swap cache operations to reduce the lock acquire/release times
> for the THP swap too.
> 
> Cc: Hugh Dickins 
> Cc: Shaohua Li 
> Cc: Minchan Kim 
> Cc: Rik van Riel 
> Cc: Andrea Arcangeli 
> Cc: Kirill A. Shutemov 
> Signed-off-by: "Huang, Ying" 
> ---
>  include/linux/page-flags.h |  2 +-
>  mm/swap_state.c| 57 
> +++---
>  2 files changed, 40 insertions(+), 19 deletions(-)
> 
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 74e4dda..f5bcbea 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -314,7 +314,7 @@ PAGEFLAG_FALSE(HighMem)
>  #endif
>  
>  #ifdef CONFIG_SWAP
> -PAGEFLAG(SwapCache, swapcache, PF_NO_COMPOUND)
> +PAGEFLAG(SwapCache, swapcache, PF_NO_TAIL)

What is the reason for this change ? The commit message does not seem
to explain.



Re: [PATCH -v3 01/10] mm, swap: Make swap cluster size same of THP size on x86_64

2016-09-08 Thread Anshuman Khandual
On 09/07/2016 10:16 PM, Huang, Ying wrote:
> From: Huang Ying 
> 
> In this patch, the size of the swap cluster is changed to that of the
> THP (Transparent Huge Page) on x86_64 architecture (512).  This is for
> the THP swap support on x86_64.  Where one swap cluster will be used to
> hold the contents of each THP swapped out.  And some information of the
> swapped out THP (such as compound map count) will be recorded in the
> swap_cluster_info data structure.
> 
> For other architectures which want THP swap support, THP_SWAP_CLUSTER
> need to be selected in the Kconfig file for the architecture.
> 
> In effect, this will enlarge swap cluster size by 2 times on x86_64.
> Which may make it harder to find a free cluster when the swap space
> becomes fragmented.  So that, this may reduce the continuous swap space
> allocation and sequential write in theory.  The performance test in 0day
> shows no regressions caused by this.

This patch needs to be split into two separate ones

(1) Add THP_SWAP_CLUSTER config option
(2) Enable CONFIG_THP_SWAP_CLUSTER for X86_64

The first patch should explain the proposal and the second patch
should have 86_64 arch specific details, regressions etc as already
been explained in the commit message.



[PATCH 1/2] mm: Move definition of 'zone_names' array into mmzone.h

2016-08-30 Thread Anshuman Khandual
zone_names[] is used to identify any zone given it's index which
can be used in many other places. So moving the definition into
include/linux/mmzone.h for broader access.

Signed-off-by: Anshuman Khandual 
---
 include/linux/mmzone.h | 17 +
 mm/page_alloc.c| 17 -
 2 files changed, 17 insertions(+), 17 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index d572b78..01ca3f4 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -341,6 +341,23 @@ enum zone_type {
 
 };
 
+static char * const zone_names[__MAX_NR_ZONES] = {
+#ifdef CONFIG_ZONE_DMA
+"DMA",
+#endif
+#ifdef CONFIG_ZONE_DMA32
+"DMA32",
+#endif
+"Normal",
+#ifdef CONFIG_HIGHMEM
+"HighMem",
+#endif
+"Movable",
+#ifdef CONFIG_ZONE_DEVICE
+"Device",
+#endif
+};
+
 #ifndef __GENERATING_BOUNDS_H
 
 struct zone {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3fbe73a..8e2261c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -207,23 +207,6 @@ int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES-1] = {
 
 EXPORT_SYMBOL(totalram_pages);
 
-static char * const zone_names[MAX_NR_ZONES] = {
-#ifdef CONFIG_ZONE_DMA
-"DMA",
-#endif
-#ifdef CONFIG_ZONE_DMA32
-"DMA32",
-#endif
-"Normal",
-#ifdef CONFIG_HIGHMEM
-"HighMem",
-#endif
-"Movable",
-#ifdef CONFIG_ZONE_DEVICE
-"Device",
-#endif
-};
-
 char * const migratetype_names[MIGRATE_TYPES] = {
"Unmovable",
"Movable",
-- 
1.9.3



[PATCH 2/2] mm: Add sysfs interface to dump each node's zonelist information

2016-08-30 Thread Anshuman Khandual
Each individual node in the system has a ZONELIST_FALLBACK zonelist
and a ZONELIST_NOFALLBACK zonelist. These zonelists decide fallback
order of zones during memory allocations. Sometimes it helps to dump
these zonelists to see the priority order of various zones in them.
This change just adds a sysfs interface for doing the same.

Example zonelist information from a KVM guest.

[NODE (0)]
ZONELIST_FALLBACK
(0) (node 0) (zone DMA c140c000)
(1) (node 1) (zone DMA c001)
(2) (node 2) (zone DMA c002)
(3) (node 3) (zone DMA c003)
ZONELIST_NOFALLBACK
(0) (node 0) (zone DMA c140c000)
[NODE (1)]
ZONELIST_FALLBACK
(0) (node 1) (zone DMA c001)
(1) (node 2) (zone DMA c002)
(2) (node 3) (zone DMA c003)
(3) (node 0) (zone DMA c140c000)
ZONELIST_NOFALLBACK
(0) (node 1) (zone DMA c001)
[NODE (2)]
ZONELIST_FALLBACK
(0) (node 2) (zone DMA c002)
(1) (node 3) (zone DMA c003)
(2) (node 0) (zone DMA c140c000)
(3) (node 1) (zone DMA c001)
ZONELIST_NOFALLBACK
(0) (node 2) (zone DMA c002)
[NODE (3)]
ZONELIST_FALLBACK
(0) (node 3) (zone DMA c003)
(1) (node 0) (zone DMA c140c000)
(2) (node 1) (zone DMA c001)
(3) (node 2) (zone DMA c002)
ZONELIST_NOFALLBACK
(0) (node 3) (zone DMA c003)

Signed-off-by: Anshuman Khandual 
---
 drivers/base/memory.c | 46 ++
 1 file changed, 46 insertions(+)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index dc75de9..8c9330a 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -442,7 +442,52 @@ print_block_size(struct device *dev, struct 
device_attribute *attr,
return sprintf(buf, "%lx\n", get_memory_block_size());
 }
 
+static ssize_t dump_zonelist(char *buf, struct zonelist *zonelist)
+{
+   unsigned int i;
+   ssize_t count = 0;
+
+   for (i = 0; zonelist->_zonerefs[i].zone; i++) {
+   count += sprintf(buf + count,
+   "\t\t(%d) (node %d) (%-10s %lx)\n", i,
+   zonelist->_zonerefs[i].zone->zone_pgdat->node_id,
+   zone_names[zonelist->_zonerefs[i].zone_idx],
+   (unsigned long) zonelist->_zonerefs[i].zone);
+   }
+   return count;
+}
+
+static ssize_t dump_zonelists(char *buf)
+{
+   struct zonelist *zonelist;
+   unsigned int node;
+   ssize_t count = 0;
+
+   for_each_online_node(node) {
+   zonelist = &(NODE_DATA(node)->
+   node_zonelists[ZONELIST_FALLBACK]);
+   count += sprintf(buf + count, "[NODE (%d)]\n", node);
+   count += sprintf(buf + count, "\tZONELIST_FALLBACK\n");
+   count += dump_zonelist(buf + count, zonelist);
+
+   zonelist = &(NODE_DATA(node)->
+   node_zonelists[ZONELIST_NOFALLBACK]);
+   count += sprintf(buf + count, "\tZONELIST_NOFALLBACK\n");
+   count += dump_zonelist(buf + count, zonelist);
+   }
+   return count;
+}
+
+static ssize_t
+print_system_zone_details(struct device *dev, struct device_attribute *attr,
+char *buf)
+{
+   return dump_zonelists(buf);
+}
+
+
 static DEVICE_ATTR(block_size_bytes, 0444, print_block_size, NULL);
+static DEVICE_ATTR(system_zone_details, 0444, print_system_zone_details, NULL);
 
 /*
  * Memory auto online policy.
@@ -783,6 +828,7 @@ static struct attribute *memory_root_attrs[] = {
 #endif
 
&dev_attr_block_size_bytes.attr,
+   &dev_attr_system_zone_details.attr,
&dev_attr_auto_online_blocks.attr,
NULL
 };
-- 
1.9.3



Re: [PATCH 1/2] mm: Move definition of 'zone_names' array into mmzone.h

2016-09-01 Thread Anshuman Khandual
On 09/01/2016 02:40 AM, Andrew Morton wrote:
> On Wed, 31 Aug 2016 08:55:49 +0530 Anshuman Khandual 
>  wrote:
> 
>> zone_names[] is used to identify any zone given it's index which
>> can be used in many other places. So moving the definition into
>> include/linux/mmzone.h for broader access.
>>
>> ...
>>
>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -341,6 +341,23 @@ enum zone_type {
>>  
>>  };
>>  
>> +static char * const zone_names[__MAX_NR_ZONES] = {
>> +#ifdef CONFIG_ZONE_DMA
>> + "DMA",
>> +#endif
>> +#ifdef CONFIG_ZONE_DMA32
>> + "DMA32",
>> +#endif
>> + "Normal",
>> +#ifdef CONFIG_HIGHMEM
>> + "HighMem",
>> +#endif
>> + "Movable",
>> +#ifdef CONFIG_ZONE_DEVICE
>> + "Device",
>> +#endif
>> +};
>> +
>>  #ifndef __GENERATING_BOUNDS_H
>>  
>>  struct zone {
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 3fbe73a..8e2261c 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -207,23 +207,6 @@ int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES-1] = {
>>  
>>  EXPORT_SYMBOL(totalram_pages);
>>  
>> -static char * const zone_names[MAX_NR_ZONES] = {
>> -#ifdef CONFIG_ZONE_DMA
>> - "DMA",
>> -#endif
>> -#ifdef CONFIG_ZONE_DMA32
>> - "DMA32",
>> -#endif
>> - "Normal",
>> -#ifdef CONFIG_HIGHMEM
>> - "HighMem",
>> -#endif
>> - "Movable",
>> -#ifdef CONFIG_ZONE_DEVICE
>> - "Device",
>> -#endif
>> -};
>> -
>>  char * const migratetype_names[MIGRATE_TYPES] = {
>>  "Unmovable",
>>  "Movable",
> 
> This is worrisome.  On some (ancient) compilers, this will produce a
> copy of that array into each compilation unit which includes mmzone.h.
> 
> On smarter compilers, it will produce a copy of the array in each
> compilation unit which *uses* zone_names[].
> 
> On even smarter compilers (and linkers!), only one copy of zone_names[]
> will exist in vmlinux.
> 
> I don't know if gcc is an "even smarter compiler" and I didn't check,
> and I didn't check which gcc versions are even smarter.  I'd rather not
> have to ;) It is risky.
> 
> So, let's just make it non-static and add a declaration into mmzone.h,
> please.
> 

I understand your concern, will change it.



Re: [PATCH 2/2] mm: Add sysfs interface to dump each node's zonelist information

2016-09-01 Thread Anshuman Khandual
On 09/01/2016 02:42 AM, Andrew Morton wrote:
> On Wed, 31 Aug 2016 08:55:50 +0530 Anshuman Khandual 
>  wrote:
> 
>> Each individual node in the system has a ZONELIST_FALLBACK zonelist
>> and a ZONELIST_NOFALLBACK zonelist. These zonelists decide fallback
>> order of zones during memory allocations. Sometimes it helps to dump
>> these zonelists to see the priority order of various zones in them.
>> This change just adds a sysfs interface for doing the same.
>>
>> Example zonelist information from a KVM guest.
>>
>> [NODE (0)]
>> ZONELIST_FALLBACK
>> (0) (node 0) (zone DMA c140c000)
>> (1) (node 1) (zone DMA c001)
>> (2) (node 2) (zone DMA c002)
>> (3) (node 3) (zone DMA c003)
>> ZONELIST_NOFALLBACK
>> (0) (node 0) (zone DMA c140c000)
>> [NODE (1)]
>> ZONELIST_FALLBACK
>> (0) (node 1) (zone DMA c001)
>> (1) (node 2) (zone DMA c002)
>> (2) (node 3) (zone DMA c003)
>> (3) (node 0) (zone DMA c140c000)
>> ZONELIST_NOFALLBACK
>> (0) (node 1) (zone DMA c001)
>> [NODE (2)]
>> ZONELIST_FALLBACK
>> (0) (node 2) (zone DMA c002)
>> (1) (node 3) (zone DMA c003)
>> (2) (node 0) (zone DMA c140c000)
>> (3) (node 1) (zone DMA c001)
>> ZONELIST_NOFALLBACK
>> (0) (node 2) (zone DMA c002)
>> [NODE (3)]
>> ZONELIST_FALLBACK
>> (0) (node 3) (zone DMA c003)
>> (1) (node 0) (zone DMA c140c000)
>> (2) (node 1) (zone DMA c001)
>> (3) (node 2) (zone DMA c002)
>> ZONELIST_NOFALLBACK
>> (0) (node 3) (zone DMA c003)
> 
> Can you please sell this a bit better?  Why does it "sometimes help"?
> Why does the benefit of this patch to our users justify the overhead
> and cost?

On platforms which support memory hotplug into previously non existing
(at boot) zones, this interface helps in visualizing which zonelists
of the system, the new hot added memory ends up in. POWER is such a
platform where all the memory detected during boot time remains with
ZONE_DMA for good but then hot plug process can actually get new memory
into ZONE_MOVABLE. So having a way to get the snapshot of the zonelists
on the system after memory or node hot[un]plug is a good thing, IMHO.

> 
> Please document the full path to the sysfs file(s) within the changelog.

Sure, will do.

> 
> Please find somewhere in Documentation/ to document the new interface.
> 

Sure, will create this following file describing the interface.

Documentation/ABI/testing/sysfs-system-zone-details



Re: [RFC v2 00/12] powerpc: Memory Protection Keys

2017-06-19 Thread Anshuman Khandual
On 06/20/2017 10:40 AM, Balbir Singh wrote:
> On Fri, 2017-06-16 at 20:52 -0700, Ram Pai wrote:
>> Memory protection keys enable applications to protect its
>> address space from inadvertent access or corruption from
>> itself.
> 
> I presume by itself you mean protection between threads?

Between threads due to race conditions or from the same thread
because of programming error.

> 
>>
>> The overall idea:
>>
>>  A process allocates a   key  and associates it with
>>  a  address  range  withinits   address   space.
> 
> OK, so this is per VMA?

Yeah but the same key can be given to multiple VMAs. Any
change will effect every VMA who got tagged by it.

> 
>>  The process  than  can  dynamically  set read/write 
>>  permissions on  the   key   without  involving  the 
>>  kernel.
> 
> This bit is not clear, how can the key be set without
> involving the kernel? I presume you mean the key is set

With pkey_mprotect() system call, all the effected PTEs get
tagged for once. Switching the permission happens just by
writing into the register on the fly.

> in the PTE's and the access protection values can be
> set without involving the kernel?

PTE setting happens once, access protection values can be
changed on the fly through register.



Re: [RFC v2 11/12]Documentation: Documentation updates.

2017-06-19 Thread Anshuman Khandual
On 06/17/2017 09:22 AM, Ram Pai wrote:
> The Documentaton file is moved from x86 into the generic area,
> since this feature is now supported by more than one archs.
> 
> Signed-off-by: Ram Pai 
> ---
>  Documentation/vm/protection-keys.txt  | 110 
> ++
>  Documentation/x86/protection-keys.txt |  85 --

I am not sure whether this is a good idea. There might be
specifics for each architecture which need to be detailed
again in this new generic one.

>  2 files changed, 110 insertions(+), 85 deletions(-)
>  create mode 100644 Documentation/vm/protection-keys.txt
>  delete mode 100644 Documentation/x86/protection-keys.txt
> 
> diff --git a/Documentation/vm/protection-keys.txt 
> b/Documentation/vm/protection-keys.txt
> new file mode 100644
> index 000..b49e6bb
> --- /dev/null
> +++ b/Documentation/vm/protection-keys.txt
> @@ -0,0 +1,110 @@
> +Memory Protection Keys for Userspace (PKU aka PKEYs) is a CPU feature
> +found in new generation of intel CPUs on PowerPC CPUs.
> +
> +Memory Protection Keys provides a mechanism for enforcing page-based
> +protections, but without requiring modification of the page tables
> +when an application changes protection domains.

Does resultant access through protection keys should be a
subset of the protection bits enabled through original PTE
PROT format ? Does the semantics exactly the same on x86
and powerpc ?

> +
> +
> +On Intel:
> +
> +It works by dedicating 4 previously ignored bits in each page table
> +entry to a "protection key", giving 16 possible keys.
> +
> +There is also a new user-accessible register (PKRU) with two separate
> +bits (Access Disable and Write Disable) for each key.  Being a CPU
> +register, PKRU is inherently thread-local, potentially giving each
> +thread a different set of protections from every other thread.
> +
> +There are two new instructions (RDPKRU/WRPKRU) for reading and writing
> +to the new register.  The feature is only available in 64-bit mode,
> +even though there is theoretically space in the PAE PTEs.  These
> +permissions are enforced on data access only and have no effect on
> +instruction fetches.
> +
> +
> +On PowerPC:
> +
> +It works by dedicating 5 page table entry to a "protection key",
> +giving 32 possible keys.
> +
> +There is a user-accessible register (AMR) with two separate bits
> +(Access Disable and Write Disable) for each key.  Being a CPU
> +register, AMR is inherently thread-local, potentially giving each
> +thread a different set of protections from every other thread.

Small nit. Space needed here.

> +NOTE: Disabling read permission does not disable
> +write and vice-versa.
> +
> +The feature is available on 64-bit HPTE mode only.
> +
> +'mtspr 0xd, mem' reads the AMR register
> +'mfspr mem, 0xd' writes into the AMR register.
> +
> +Permissions are enforced on data access only and have no effect on
> +instruction fetches.
> +
> +=== Syscalls ===
> +
> +There are 3 system calls which directly interact with pkeys:
> +
> + int pkey_alloc(unsigned long flags, unsigned long init_access_rights)
> + int pkey_free(int pkey);
> + int pkey_mprotect(unsigned long start, size_t len,
> +   unsigned long prot, int pkey);
> +
> +Before a pkey can be used, it must first be allocated with
> +pkey_alloc().  An application calls the WRPKRU instruction
> +directly in order to change access permissions to memory covered
> +with a key.  In this example WRPKRU is wrapped by a C function
> +called pkey_set().
> +
> + int real_prot = PROT_READ|PROT_WRITE;
> + pkey = pkey_alloc(0, PKEY_DENY_WRITE);
> + ptr = mmap(NULL, PAGE_SIZE, PROT_NONE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 
> 0);
> + ret = pkey_mprotect(ptr, PAGE_SIZE, real_prot, pkey);
> + ... application runs here
> +
> +Now, if the application needs to update the data at 'ptr', it can
> +gain access, do the update, then remove its write access:
> +
> + pkey_set(pkey, 0); // clear PKEY_DENY_WRITE
> + *ptr = foo; // assign something
> + pkey_set(pkey, PKEY_DENY_WRITE); // set PKEY_DENY_WRITE again
> +
> +Now when it frees the memory, it will also free the pkey since it
> +is no longer in use:
> +
> + munmap(ptr, PAGE_SIZE);
> + pkey_free(pkey);
> +
> +(Note: pkey_set() is a wrapper for the RDPKRU and WRPKRU instructions.
> + An example implementation can be found in
> + tools/testing/selftests/x86/protection_keys.c)
> +
> +=== Behavior ===
> +
> +The kernel attempts to make protection keys consistent with the
> +behavior of a plain mprotect().  For instance if you do this:
> +
> + mprotect(ptr, size, PROT_NONE);
> + something(ptr);
> +
> +you can expect the same effects with protection keys when doing this:
> +
> + pkey = pkey_alloc(0, PKEY_DISABLE_WRITE | PKEY_DISABLE_READ);
> + pkey_mprotect(ptr, size, PROT_READ|PROT_WRITE, pkey);
> + somet

Re: [RFC v2 12/12]selftest: Updated protection key selftest

2017-06-19 Thread Anshuman Khandual
On 06/17/2017 09:22 AM, Ram Pai wrote:
> Added test support for PowerPC implementation off protection keys.
> 
> Signed-off-by: Ram Pai 

First of all, there are a lot of instances where we use *pkru*
named functions on power even the real implementations have
taken care of doing appropriate things. That looks pretty
hacky. We need to change them to generic names first before
adding both x86 and powerpc procedures inside it.



Re: [RFC v2 10/12] powerpc: Read AMR only if pkey-violation caused the exception.

2017-06-19 Thread Anshuman Khandual
On 06/19/2017 11:29 PM, Ram Pai wrote:
> On Mon, Jun 19, 2017 at 09:06:13PM +1000, Michael Ellerman wrote:
>> Ram Pai  writes:
>>
>>> Signed-off-by: Ram Pai 
>>> ---
>>>  arch/powerpc/kernel/exceptions-64s.S | 16 ++--
>>>  1 file changed, 10 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/arch/powerpc/kernel/exceptions-64s.S 
>>> b/arch/powerpc/kernel/exceptions-64s.S
>>> index 8db9ef8..a4de1b4 100644
>>> --- a/arch/powerpc/kernel/exceptions-64s.S
>>> +++ b/arch/powerpc/kernel/exceptions-64s.S
>>> @@ -493,13 +493,15 @@ EXC_COMMON_BEGIN(data_access_common)
>>> ld  r12,_MSR(r1)
>>> ld  r3,PACA_EXGEN+EX_DAR(r13)
>>> lwz r4,PACA_EXGEN+EX_DSISR(r13)
>>> +   std r3,_DAR(r1)
>>> +   std r4,_DSISR(r1)
>>>  #ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
>>> +   andis.  r0,r4,DSISR_KEYFAULT@h /* save AMR only if its a key fault */
>>> +   beq+1f
>>
>> This seems to be incremental on top of one of your other patches.
>>
>> But I don't see why, can you please just squash this into whatever patch
>> adds this code in the first place.
> 
> It was an optimization added later. But yes it can be squashed into an
> earlier patch.

Could you please explain what is the optimization this achieves ?



Re: [RFC v2 09/12] powerpc: Deliver SEGV signal on pkey violation.

2017-06-19 Thread Anshuman Khandual
On 06/17/2017 09:22 AM, Ram Pai wrote:
> The value of the AMR register at the time of exception
> is made available in gp_regs[PT_AMR] of the siginfo.
> 
> This field can be used to reprogram the permission bits of
> any valid pkey.
> 
> Similarly the value of the pkey, whose protection got violated,
> is made available at si_pkey field of the siginfo structure.
> 
> Signed-off-by: Ram Pai 
> ---
>  arch/powerpc/include/asm/paca.h|  1 +
>  arch/powerpc/include/uapi/asm/ptrace.h |  3 ++-
>  arch/powerpc/kernel/asm-offsets.c  |  5 
>  arch/powerpc/kernel/exceptions-64s.S   |  8 ++
>  arch/powerpc/kernel/signal_32.c| 14 ++
>  arch/powerpc/kernel/signal_64.c| 14 ++
>  arch/powerpc/kernel/traps.c| 49 
> ++
>  arch/powerpc/mm/fault.c|  4 +++
>  8 files changed, 97 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
> index 1c09f8f..a41afd3 100644
> --- a/arch/powerpc/include/asm/paca.h
> +++ b/arch/powerpc/include/asm/paca.h
> @@ -92,6 +92,7 @@ struct paca_struct {
>   struct dtl_entry *dispatch_log_end;
>  #endif /* CONFIG_PPC_STD_MMU_64 */
>   u64 dscr_default;   /* per-CPU default DSCR */
> + u64 paca_amr;   /* value of amr at exception */
> 
>  #ifdef CONFIG_PPC_STD_MMU_64
>   /*
> diff --git a/arch/powerpc/include/uapi/asm/ptrace.h 
> b/arch/powerpc/include/uapi/asm/ptrace.h
> index 8036b38..7ec2428 100644
> --- a/arch/powerpc/include/uapi/asm/ptrace.h
> +++ b/arch/powerpc/include/uapi/asm/ptrace.h
> @@ -108,8 +108,9 @@ struct pt_regs {
>  #define PT_DAR   41
>  #define PT_DSISR 42
>  #define PT_RESULT 43
> -#define PT_DSCR 44
>  #define PT_REGS_COUNT 44
> +#define PT_DSCR 44
> +#define PT_AMR   45

PT_REGS_COUNT is not getting incremented even after adding
one more element into the pack ?

> 
>  #define PT_FPR0  48  /* each FP reg occupies 2 slots in this space */
> 
> diff --git a/arch/powerpc/kernel/asm-offsets.c 
> b/arch/powerpc/kernel/asm-offsets.c
> index 709e234..17f5d8a 100644
> --- a/arch/powerpc/kernel/asm-offsets.c
> +++ b/arch/powerpc/kernel/asm-offsets.c
> @@ -241,6 +241,11 @@ int main(void)
>   OFFSET(PACAHWCPUID, paca_struct, hw_cpu_id);
>   OFFSET(PACAKEXECSTATE, paca_struct, kexec_state);
>   OFFSET(PACA_DSCR_DEFAULT, paca_struct, dscr_default);
> +
> +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
> + OFFSET(PACA_AMR, paca_struct, paca_amr);
> +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
> +

So we now have a place in PACA for AMR.

>   OFFSET(ACCOUNT_STARTTIME, paca_struct, accounting.starttime);
>   OFFSET(ACCOUNT_STARTTIME_USER, paca_struct, accounting.starttime_user);
>   OFFSET(ACCOUNT_USER_TIME, paca_struct, accounting.utime);
> diff --git a/arch/powerpc/kernel/exceptions-64s.S 
> b/arch/powerpc/kernel/exceptions-64s.S
> index 3fd0528..8db9ef8 100644
> --- a/arch/powerpc/kernel/exceptions-64s.S
> +++ b/arch/powerpc/kernel/exceptions-64s.S
> @@ -493,6 +493,10 @@ EXC_COMMON_BEGIN(data_access_common)
>   ld  r12,_MSR(r1)
>   ld  r3,PACA_EXGEN+EX_DAR(r13)
>   lwz r4,PACA_EXGEN+EX_DSISR(r13)
> +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
> + mfspr   r5,SPRN_AMR
> + std r5,PACA_AMR(r13)
> +#endif /*  CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
>   li  r5,0x300
>   std r3,_DAR(r1)
>   std r4,_DSISR(r1)
> @@ -561,6 +565,10 @@ EXC_COMMON_BEGIN(instruction_access_common)
>   ld  r12,_MSR(r1)
>   ld  r3,_NIP(r1)
>   andis.  r4,r12,0x5820
> +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
> + mfspr   r5,SPRN_AMR
> + std r5,PACA_AMR(r13)
> +#endif /*  CONFIG_PPC64_MEMORY_PROTECTION_KEYS */

Saving the AMR context on page faults, this seems to be
changing in the next patch again based on whether any
key was active at that point and fault happened for the
key enforcement ?

>   li  r5,0x400
>   std r3,_DAR(r1)
>   std r4,_DSISR(r1)
> diff --git a/arch/powerpc/kernel/signal_32.c b/arch/powerpc/kernel/signal_32.c
> index 97bb138..059766a 100644
> --- a/arch/powerpc/kernel/signal_32.c
> +++ b/arch/powerpc/kernel/signal_32.c
> @@ -500,6 +500,11 @@ static int save_user_regs(struct pt_regs *regs, struct 
> mcontext __user *frame,
>  (unsigned long) &frame->tramp[2]);
>   }
> 
> +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
> + if (__put_user(get_paca()->paca_amr, &frame->mc_gregs[PT_AMR]))
> + return 1;
> +#endif /*  CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
> +
>   return 0;
>  }
> 
> @@ -661,6 +666,9 @@ static long restore_user_regs(struct pt_regs *regs,
>   long err;
>   unsigned int save_r2 = 0;
>   unsigned long msr;
> +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
> + unsigned long amr;
> +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
>  #ifdef CONFIG_VSX
>   int i;
>

Re: [RFC v2 08/12] powerpc: Handle exceptions caused by violation of pkey protection.

2017-06-20 Thread Anshuman Khandual
On 06/17/2017 09:22 AM, Ram Pai wrote:
> Handle Data and Instruction exceptions caused by memory
> protection-key.
> 
> Signed-off-by: Ram Pai 
> (cherry picked from commit a5e5217619a0c475fe0cacc3b0cf1d3d33c79a09)

To which tree this commit belongs to ?

> 
> Conflicts:
>   arch/powerpc/include/asm/reg.h
>   arch/powerpc/kernel/exceptions-64s.S
> ---
>  arch/powerpc/include/asm/mmu_context.h | 12 +
>  arch/powerpc/include/asm/pkeys.h   |  9 
>  arch/powerpc/include/asm/reg.h |  7 +--
>  arch/powerpc/mm/fault.c| 21 +++-
>  arch/powerpc/mm/pkeys.c| 90 
> ++
>  5 files changed, 134 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/mmu_context.h 
> b/arch/powerpc/include/asm/mmu_context.h
> index da7e943..71fffe0 100644
> --- a/arch/powerpc/include/asm/mmu_context.h
> +++ b/arch/powerpc/include/asm/mmu_context.h
> @@ -175,11 +175,23 @@ static inline void arch_bprm_mm_init(struct mm_struct 
> *mm,
>  {
>  }
> 
> +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
> +bool arch_pte_access_permitted(pte_t pte, bool write);
> +bool arch_vma_access_permitted(struct vm_area_struct *vma,
> + bool write, bool execute, bool foreign);
> +#else /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
> +static inline bool arch_pte_access_permitted(pte_t pte, bool write)
> +{
> + /* by default, allow everything */
> + return true;
> +}
>  static inline bool arch_vma_access_permitted(struct vm_area_struct *vma,
>   bool write, bool execute, bool foreign)
>  {
>   /* by default, allow everything */
>   return true;
>  }

Right, these are the two functions the core VM expects the
arch to provide.

> +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
> +
>  #endif /* __KERNEL__ */
>  #endif /* __ASM_POWERPC_MMU_CONTEXT_H */
> diff --git a/arch/powerpc/include/asm/pkeys.h 
> b/arch/powerpc/include/asm/pkeys.h
> index 9b6820d..405e7db 100644
> --- a/arch/powerpc/include/asm/pkeys.h
> +++ b/arch/powerpc/include/asm/pkeys.h
> @@ -14,6 +14,15 @@
>   VM_PKEY_BIT3 | \
>   VM_PKEY_BIT4)
> 
> +static inline u16 pte_flags_to_pkey(unsigned long pte_flags)
> +{
> + return ((pte_flags & H_PAGE_PKEY_BIT4) ? 0x1 : 0x0) |
> + ((pte_flags & H_PAGE_PKEY_BIT3) ? 0x2 : 0x0) |
> + ((pte_flags & H_PAGE_PKEY_BIT2) ? 0x4 : 0x0) |
> + ((pte_flags & H_PAGE_PKEY_BIT1) ? 0x8 : 0x0) |
> + ((pte_flags & H_PAGE_PKEY_BIT0) ? 0x10 : 0x0);
> +}

Add defines for the above 0x1, 0x2, 0x4, 0x8 etc ?

> +
>  #define pkey_to_vmflag_bits(key) (((key & 0x1UL) ? VM_PKEY_BIT0 : 0x0UL) | \
>   ((key & 0x2UL) ? VM_PKEY_BIT1 : 0x0UL) |\
>   ((key & 0x4UL) ? VM_PKEY_BIT2 : 0x0UL) |\
> diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
> index 2dcb8a1..a11977f 100644
> --- a/arch/powerpc/include/asm/reg.h
> +++ b/arch/powerpc/include/asm/reg.h
> @@ -285,9 +285,10 @@
>  #define   DSISR_UNSUPP_MMU   0x0008  /* Unsupported MMU config */
>  #define   DSISR_SET_RC   0x0004  /* Failed setting of 
> R/C bits */
>  #define   DSISR_PGDIRFAULT  0x0002  /* Fault on page directory */
> -#define   DSISR_PAGE_FAULT_MASK (DSISR_BIT32 | \
> - DSISR_PAGEATTR_CONFLT | \
> - DSISR_BADACCESS |   \
> +#define   DSISR_PAGE_FAULT_MASK (DSISR_BIT32 |   \
> + DSISR_PAGEATTR_CONFLT | \
> + DSISR_BADACCESS |   \
> + DSISR_KEYFAULT |\
>   DSISR_BIT43)

This should have been cleaned up before adding new
DSISR_KEYFAULT reason code into it. But I guess its
okay.

>  #define SPRN_TBRL0x10C   /* Time Base Read Lower Register (user, R/O) */
>  #define SPRN_TBRU0x10D   /* Time Base Read Upper Register (user, R/O) */
> diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
> index 3a7d580..c31624f 100644
> --- a/arch/powerpc/mm/fault.c
> +++ b/arch/powerpc/mm/fault.c
> @@ -216,9 +216,10 @@ int do_page_fault(struct pt_regs *regs, unsigned long 
> address,
>* bits we are interested in.  But there are some bits which
>* indicate errors in DSISR but can validly be set in SRR1.
>*/
> - if (trap == 0x400)
> + if (trap == 0x400) {
>   error_code &= 0x4820;
> - else
> + flags |= FAULT_FLAG_INSTRUCTION;
> + } else
>   is_write = error_code & DSISR_ISSTORE;
>  #else

Why adding the FAULT_FLAG_INSTRUCTION here ?

>   is_write = error_code & ESR_DST;
> @@ -261,6 +262,13 @@ int do_page_fault(struct pt_regs *regs, unsigned long 
> address,
>   }
>  #endif
> 
> +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
> + if (error_code & DSISR_KEYFAULT) {
> + code = SEGV_PKUERR;
> +  

Re: [RFC v2 07/12] powerpc: Macro the mask used for checking DSI exception

2017-06-20 Thread Anshuman Khandual
On 06/17/2017 09:22 AM, Ram Pai wrote:
> Replace the magic number used to check for DSI exception
> with a meaningful value.
> 
> Signed-off-by: Ram Pai 
> ---
>  arch/powerpc/include/asm/reg.h   | 9 -
>  arch/powerpc/kernel/exceptions-64s.S | 2 +-
>  2 files changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
> index 7e50e47..2dcb8a1 100644
> --- a/arch/powerpc/include/asm/reg.h
> +++ b/arch/powerpc/include/asm/reg.h
> @@ -272,16 +272,23 @@
>  #define SPRN_DAR 0x013   /* Data Address Register */
>  #define SPRN_DBCR0x136   /* e300 Data Breakpoint Control Reg */
>  #define SPRN_DSISR   0x012   /* Data Storage Interrupt Status Register */
> +#define   DSISR_BIT320x8000  /* not defined */
>  #define   DSISR_NOHPTE   0x4000  /* no translation found 
> */
> +#define   DSISR_PAGEATTR_CONFLT  0x2000  /* page attribute 
> conflict */
> +#define   DSISR_BIT350x1000  /* not defined */
>  #define   DSISR_PROTFAULT0x0800  /* protection fault */
>  #define   DSISR_BADACCESS0x0400  /* bad access to CI or G */
>  #define   DSISR_ISSTORE  0x0200  /* access was a store */
>  #define   DSISR_DABRMATCH0x0040  /* hit data breakpoint */
> -#define   DSISR_NOSEGMENT0x0020  /* SLB miss */
>  #define   DSISR_KEYFAULT 0x0020  /* Key fault */
> +#define   DSISR_BIT430x0010  /* not defined */
>  #define   DSISR_UNSUPP_MMU   0x0008  /* Unsupported MMU config */
>  #define   DSISR_SET_RC   0x0004  /* Failed setting of 
> R/C bits */
>  #define   DSISR_PGDIRFAULT  0x0002  /* Fault on page directory */
> +#define   DSISR_PAGE_FAULT_MASK (DSISR_BIT32 | \
> + DSISR_PAGEATTR_CONFLT | \
> + DSISR_BADACCESS |   \
> + DSISR_BIT43)

Sorry missed this one. Seems like there are couple of unnecessary
line additions in the subsequent patch which adds the new PKEY
reason code.

-#define   DSISR_PAGE_FAULT_MASK (DSISR_BIT32 | \
-   DSISR_PAGEATTR_CONFLT | \
-   DSISR_BADACCESS |   \
+#define   DSISR_PAGE_FAULT_MASK (DSISR_BIT32 | \
+   DSISR_PAGEATTR_CONFLT | \
+   DSISR_BADACCESS |   \
+   DSISR_KEYFAULT |\
DSISR_BIT43)





Re: [RFC v2 06/12] powerpc: Program HPTE key protection bits.

2017-06-20 Thread Anshuman Khandual
On 06/17/2017 09:22 AM, Ram Pai wrote:
> Map the PTE protection key bits to the HPTE key protection bits,
> while creatiing HPTE  entries.
> 
> Signed-off-by: Ram Pai 
> ---
>  arch/powerpc/include/asm/book3s/64/mmu-hash.h | 5 +
>  arch/powerpc/include/asm/pkeys.h  | 7 +++
>  arch/powerpc/mm/hash_utils_64.c   | 5 +
>  3 files changed, 17 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h 
> b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
> index cfb8169..3d7872c 100644
> --- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
> +++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
> @@ -90,6 +90,8 @@
>  #define HPTE_R_PP0   ASM_CONST(0x8000)
>  #define HPTE_R_TSASM_CONST(0x4000)
>  #define HPTE_R_KEY_HIASM_CONST(0x3000)
> +#define HPTE_R_KEY_BIT0  ASM_CONST(0x2000)
> +#define HPTE_R_KEY_BIT1  ASM_CONST(0x1000)
>  #define HPTE_R_RPN_SHIFT 12
>  #define HPTE_R_RPN   ASM_CONST(0x0000)
>  #define HPTE_R_RPN_3_0   ASM_CONST(0x01fff000)
> @@ -104,6 +106,9 @@
>  #define HPTE_R_C ASM_CONST(0x0080)
>  #define HPTE_R_R ASM_CONST(0x0100)
>  #define HPTE_R_KEY_LOASM_CONST(0x0e00)
> +#define HPTE_R_KEY_BIT2  ASM_CONST(0x0800)
> +#define HPTE_R_KEY_BIT3  ASM_CONST(0x0400)
> +#define HPTE_R_KEY_BIT4  ASM_CONST(0x0200)
> 

Should we indicate/document how these 5 bits are not contiguous
in the HPTE format for any given real page ?

>  #define HPTE_V_1TB_SEG   ASM_CONST(0x4000)
>  #define HPTE_V_VRMA_MASK ASM_CONST(0x4001ff00)
> diff --git a/arch/powerpc/include/asm/pkeys.h 
> b/arch/powerpc/include/asm/pkeys.h
> index 0f3dca8..9b6820d 100644
> --- a/arch/powerpc/include/asm/pkeys.h
> +++ b/arch/powerpc/include/asm/pkeys.h
> @@ -27,6 +27,13 @@
>   ((vm_flags & VM_PKEY_BIT3) ? H_PAGE_PKEY_BIT1 : 0x0UL) | \
>   ((vm_flags & VM_PKEY_BIT4) ? H_PAGE_PKEY_BIT0 : 0x0UL))
> 
> +#define calc_pte_to_hpte_pkey_bits(pteflags) \
> + (((pteflags & H_PAGE_PKEY_BIT0) ? HPTE_R_KEY_BIT0 : 0x0UL) |\
> + ((pteflags & H_PAGE_PKEY_BIT1) ? HPTE_R_KEY_BIT1 : 0x0UL) | \
> + ((pteflags & H_PAGE_PKEY_BIT2) ? HPTE_R_KEY_BIT2 : 0x0UL) | \
> + ((pteflags & H_PAGE_PKEY_BIT3) ? HPTE_R_KEY_BIT3 : 0x0UL) | \
> + ((pteflags & H_PAGE_PKEY_BIT4) ? HPTE_R_KEY_BIT4 : 0x0UL))
> +

We can drop calc_ in here. pte_to_hpte_pkey_bits should be
sufficient.



Re: [RFC v2 01/12] powerpc: Free up four 64K PTE bits in 4K backed hpte pages.

2017-06-20 Thread Anshuman Khandual
On 06/17/2017 09:22 AM, Ram Pai wrote:
> Rearrange 64K PTE bits to  free  up  bits 3, 4, 5  and  6
> in the 4K backed hpte pages. These bits continue to be used
> for 64K backed hpte pages in this patch, but will be freed
> up in the next patch.

The counting 3, 4, 5 and 6 are in BE format I believe, I was
initially trying to see that from right to left as we normally
do in the kernel and was getting confused. So basically these
bits (which are only applicable for 64K mapping IIUC) are going
to be freed up from the PTE format.

#define _RPAGE_RSV1 0x1000UL
#define _RPAGE_RSV2 0x0800UL
#define _RPAGE_RSV3 0x0400UL
#define _RPAGE_RSV4 0x0200UL

As you have mentioned before this feature is available for 64K
page size only and not for 4K mappings. So I assume we support
both the combinations.

* 64K mapping on 64K
* 64K mapping on 4K

These are the current users of the above bits

#define H_PAGE_BUSY _RPAGE_RSV1 /* software: PTE & hash are busy */
#define H_PAGE_F_SECOND _RPAGE_RSV2 /* HPTE is in 2ndary HPTEG */
#define H_PAGE_F_GIX(_RPAGE_RSV3 | _RPAGE_RSV4 | _RPAGE_RPN44)
#define H_PAGE_HASHPTE  _RPAGE_RPN43/* PTE has associated HPTE */

> 
> The patch does the following change to the 64K PTE format
> 
> H_PAGE_BUSY moves from bit 3 to bit 9

and what is in there on bit 9 now ? This ?

#define _RPAGE_SW2  0x00400

which is used as 

#define _PAGE_SPECIAL   _RPAGE_SW2 /* software: special page */

which will not be required any more ?

> H_PAGE_F_SECOND which occupied bit 4 moves to the second part
>   of the pte.
> H_PAGE_F_GIX which  occupied bit 5, 6 and 7 also moves to the
>   second part of the pte.
> 
> the four  bits((H_PAGE_F_SECOND|H_PAGE_F_GIX) that represent a slot
> is  initialized  to  0xF  indicating  an invalid  slot.  If  a hpte
> gets cached in a 0xF  slot(i.e  7th  slot  of  secondary),  it   is
> released immediately. In  other  words, even  though   0xF   is   a

Release immediately means we attempt again for a new hash slot ?

> valid slot we discard  and consider it as an invalid
> slot;i.e hpte_soft_invalid(). This  gives  us  an opportunity to not
> depend on a bit in the primary PTE in order to determine the
> validity of a slot.

So we have to see the slot number in the second half for each PTE to
figure out if it has got a valid slot in the hash page table.

> 
> When  we  release  ahpte   in the 0xF   slot we also   release a
> legitimate primary   slot  andunmapthat  entry. This  is  to
> ensure  that we do get a   legimate   non-0xF  slot the next time we
> retry for a slot.

Okay.

> 
> Though treating 0xF slot as invalid reduces the number of available
> slots  and  may  have an effect  on the performance, the probabilty
> of hitting a 0xF is extermely low.

Why you say that ? I thought every slot number has the same probability
of hit from the hash function.

> 
> Compared  to the current scheme, the above described scheme reduces
> the number of false hash table updates  significantly  and  has the

How it reduces false hash table updates ?

> added  advantage  of  releasing  four  valuable  PTE bits for other
> purpose.
> 
> This idea was jointly developed by Paul Mackerras, Aneesh, Michael
> Ellermen and myself.
> 
> 4K PTE format remain unchanged currently.
> 
> Signed-off-by: Ram Pai 
> ---
>  arch/powerpc/include/asm/book3s/64/hash-4k.h  | 20 +++
>  arch/powerpc/include/asm/book3s/64/hash-64k.h | 32 +++
>  arch/powerpc/include/asm/book3s/64/hash.h | 15 +++--
>  arch/powerpc/include/asm/book3s/64/mmu-hash.h |  5 ++
>  arch/powerpc/mm/dump_linuxpagetables.c|  3 +-
>  arch/powerpc/mm/hash64_4k.c   | 14 ++---
>  arch/powerpc/mm/hash64_64k.c  | 81 
> ---
>  arch/powerpc/mm/hash_utils_64.c   | 30 +++---
>  8 files changed, 122 insertions(+), 78 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h 
> b/arch/powerpc/include/asm/book3s/64/hash-4k.h
> index b4b5e6b..5ef1d81 100644
> --- a/arch/powerpc/include/asm/book3s/64/hash-4k.h
> +++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h
> @@ -16,6 +16,18 @@
>  #define H_PUD_TABLE_SIZE (sizeof(pud_t) << H_PUD_INDEX_SIZE)
>  #define H_PGD_TABLE_SIZE (sizeof(pgd_t) << H_PGD_INDEX_SIZE)
> 
> +
> +/*
> + * Only supported by 4k linux page size
> + */
> +#define H_PAGE_F_SECOND_RPAGE_RSV2 /* HPTE is in 2ndary HPTEG */
> +#define H_PAGE_F_GIX   (_RPAGE_RSV3 | _RPAGE_RSV4 | _RPAGE_RPN44)
> +#define H_PAGE_F_GIX_SHIFT 56
> +
> +#define H_PAGE_BUSY  _RPAGE_RSV1 /* software: PTE & hash are busy */
> +#define H_PAGE_HASHPTE   _RPAGE_RPN43/* PTE has associated HPTE */
> +
> +

So we moved the common 64K definitions here.


>  /* PTE flags to conserve for HPTE identification */
>  #define _PAGE_HPTEFLAGS (H

Re: [RFC v2 02/12] powerpc: Free up four 64K PTE bits in 64K backed hpte pages.

2017-06-20 Thread Anshuman Khandual
On 06/17/2017 09:22 AM, Ram Pai wrote:
> Rearrange 64K PTE bits to  free  up  bits 3, 4, 5  and  6
> in the 64K backed hpte pages. This along with the earlier
> patch will entirely free up the four bits from 64K PTE.
> 
> This patch does the following change to 64K PTE that is
> backed by 64K hpte.
> 
> H_PAGE_F_SECOND which occupied bit 4 moves to the second part
> of the pte.
> H_PAGE_F_GIX which  occupied bit 5, 6 and 7 also moves to the
> second part of the pte.
> 
> since bit 7 is now freed up, we move H_PAGE_BUSY from bit 9
> to bit 7. Trying to minimize gaps so that contiguous bits
> can be allocated if needed in the future.
> 
> The second part of the PTE will hold
> (H_PAGE_F_SECOND|H_PAGE_F_GIX) at bit 60,61,62,63.

I still dont understand how we freed up the 5th bit which is
used in the 5th patch. Was that bit never used for any thing
on 64K page size (64K and 4K mappings) ?

+#define _RPAGE_RSV50x00040UL

+#define H_PAGE_PKEY_BIT0   _RPAGE_RSV1
+#define H_PAGE_PKEY_BIT1   _RPAGE_RSV2
+#define H_PAGE_PKEY_BIT2   _RPAGE_RSV3
+#define H_PAGE_PKEY_BIT3   _RPAGE_RSV4
+#define H_PAGE_PKEY_BIT4   _RPAGE_RSV5



Re: [RFC v2 09/12] powerpc: Deliver SEGV signal on pkey violation.

2017-06-20 Thread Anshuman Khandual
On 06/21/2017 05:26 AM, Ram Pai wrote:
> On Tue, Jun 20, 2017 at 12:24:53PM +0530, Anshuman Khandual wrote:
>> On 06/17/2017 09:22 AM, Ram Pai wrote:
>>> The value of the AMR register at the time of exception
>>> is made available in gp_regs[PT_AMR] of the siginfo.
>>>
>>> This field can be used to reprogram the permission bits of
>>> any valid pkey.
>>>
>>> Similarly the value of the pkey, whose protection got violated,
>>> is made available at si_pkey field of the siginfo structure.
>>>
>>> Signed-off-by: Ram Pai 
>>> ---
>>>  arch/powerpc/include/asm/paca.h|  1 +
>>>  arch/powerpc/include/uapi/asm/ptrace.h |  3 ++-
>>>  arch/powerpc/kernel/asm-offsets.c  |  5 
>>>  arch/powerpc/kernel/exceptions-64s.S   |  8 ++
>>>  arch/powerpc/kernel/signal_32.c| 14 ++
>>>  arch/powerpc/kernel/signal_64.c| 14 ++
>>>  arch/powerpc/kernel/traps.c| 49 
>>> ++
>>>  arch/powerpc/mm/fault.c|  4 +++
>>>  8 files changed, 97 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/arch/powerpc/include/asm/paca.h 
>>> b/arch/powerpc/include/asm/paca.h
>>> index 1c09f8f..a41afd3 100644
>>> --- a/arch/powerpc/include/asm/paca.h
>>> +++ b/arch/powerpc/include/asm/paca.h
>>> @@ -92,6 +92,7 @@ struct paca_struct {
>>> struct dtl_entry *dispatch_log_end;
>>>  #endif /* CONFIG_PPC_STD_MMU_64 */
>>> u64 dscr_default;   /* per-CPU default DSCR */
>>> +   u64 paca_amr;   /* value of amr at exception */
>>>
>>>  #ifdef CONFIG_PPC_STD_MMU_64
>>> /*
>>> diff --git a/arch/powerpc/include/uapi/asm/ptrace.h 
>>> b/arch/powerpc/include/uapi/asm/ptrace.h
>>> index 8036b38..7ec2428 100644
>>> --- a/arch/powerpc/include/uapi/asm/ptrace.h
>>> +++ b/arch/powerpc/include/uapi/asm/ptrace.h
>>> @@ -108,8 +108,9 @@ struct pt_regs {
>>>  #define PT_DAR 41
>>>  #define PT_DSISR 42
>>>  #define PT_RESULT 43
>>> -#define PT_DSCR 44
>>>  #define PT_REGS_COUNT 44
>>> +#define PT_DSCR 44
>>> +#define PT_AMR 45
>>
>> PT_REGS_COUNT is not getting incremented even after adding
>> one more element into the pack ?
> 
> Correct. there are 48 entires in gp_regs table AFAICT, only the first 45
> are exposed through pt_regs and through gp_regs. the remaining
> are exposed through gp_regs only.
> 
>>
>>>
>>>  #define PT_FPR048  /* each FP reg occupies 2 slots in this space */
>>>
>>> diff --git a/arch/powerpc/kernel/asm-offsets.c 
>>> b/arch/powerpc/kernel/asm-offsets.c
>>> index 709e234..17f5d8a 100644
>>> --- a/arch/powerpc/kernel/asm-offsets.c
>>> +++ b/arch/powerpc/kernel/asm-offsets.c
>>> @@ -241,6 +241,11 @@ int main(void)
>>> OFFSET(PACAHWCPUID, paca_struct, hw_cpu_id);
>>> OFFSET(PACAKEXECSTATE, paca_struct, kexec_state);
>>> OFFSET(PACA_DSCR_DEFAULT, paca_struct, dscr_default);
>>> +
>>> +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
>>> +   OFFSET(PACA_AMR, paca_struct, paca_amr);
>>> +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
>>> +
>>
>> So we now have a place in PACA for AMR.
> 
> yes.
> 
>>
>>> OFFSET(ACCOUNT_STARTTIME, paca_struct, accounting.starttime);
>>> OFFSET(ACCOUNT_STARTTIME_USER, paca_struct, accounting.starttime_user);
>>> OFFSET(ACCOUNT_USER_TIME, paca_struct, accounting.utime);
>>> diff --git a/arch/powerpc/kernel/exceptions-64s.S 
>>> b/arch/powerpc/kernel/exceptions-64s.S
>>> index 3fd0528..8db9ef8 100644
>>> --- a/arch/powerpc/kernel/exceptions-64s.S
>>> +++ b/arch/powerpc/kernel/exceptions-64s.S
>>> @@ -493,6 +493,10 @@ EXC_COMMON_BEGIN(data_access_common)
>>> ld  r12,_MSR(r1)
>>> ld  r3,PACA_EXGEN+EX_DAR(r13)
>>> lwz r4,PACA_EXGEN+EX_DSISR(r13)
>>> +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
>>> +   mfspr   r5,SPRN_AMR
>>> +   std r5,PACA_AMR(r13)
>>> +#endif /*  CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
>>> li  r5,0x300
>>> std r3,_DAR(r1)
>>> std r4,_DSISR(r1)
>>> @@ -561,6 +565,10 @@ EXC_COMMON_BEGIN(instruction_access_common)
>>> ld  r12,_MSR(r1)
>>> ld  r3,_NIP(r1)
>>> andis.  r4,r12,0x5820
>>> +#ifdef CONFIG_PPC

Re: [RFC v2 08/12] powerpc: Handle exceptions caused by violation of pkey protection.

2017-06-20 Thread Anshuman Khandual
On 06/21/2017 05:13 AM, Ram Pai wrote:
> On Tue, Jun 20, 2017 at 12:54:45PM +0530, Anshuman Khandual wrote:
>> On 06/17/2017 09:22 AM, Ram Pai wrote:
>>> Handle Data and Instruction exceptions caused by memory
>>> protection-key.
>>>
>>> Signed-off-by: Ram Pai 
>>> (cherry picked from commit a5e5217619a0c475fe0cacc3b0cf1d3d33c79a09)
> 
> Sorry. it was residue of a bad cleanup. It got cherry-picked from my own
> internal branch, but than i forgot to delete that line.
> 
>>
>> To which tree this commit belongs to ?
>>
>>>
>>> Conflicts:
>>> arch/powerpc/include/asm/reg.h
>>> arch/powerpc/kernel/exceptions-64s.S
> 
> same here. these two line are some residues of patching-up my tree with
> commits from other internal branches.
> 
>>> ---
>>>  arch/powerpc/include/asm/mmu_context.h | 12 +
>>>  arch/powerpc/include/asm/pkeys.h   |  9 
>>>  arch/powerpc/include/asm/reg.h |  7 +--
>>>  arch/powerpc/mm/fault.c| 21 +++-
>>>  arch/powerpc/mm/pkeys.c| 90 
>>> ++
>>>  5 files changed, 134 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/arch/powerpc/include/asm/mmu_context.h 
>>> b/arch/powerpc/include/asm/mmu_context.h
>>> index da7e943..71fffe0 100644
>>> --- a/arch/powerpc/include/asm/mmu_context.h
>>> +++ b/arch/powerpc/include/asm/mmu_context.h
>>> @@ -175,11 +175,23 @@ static inline void arch_bprm_mm_init(struct mm_struct 
>>> *mm,
>>>  {
>>>  }
>>>
>>> +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
>>> +bool arch_pte_access_permitted(pte_t pte, bool write);
>>> +bool arch_vma_access_permitted(struct vm_area_struct *vma,
>>> +   bool write, bool execute, bool foreign);
>>> +#else /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
>>> +static inline bool arch_pte_access_permitted(pte_t pte, bool write)
>>> +{
>>> +   /* by default, allow everything */
>>> +   return true;
>>> +}
>>>  static inline bool arch_vma_access_permitted(struct vm_area_struct *vma,
>>> bool write, bool execute, bool foreign)
>>>  {
>>> /* by default, allow everything */
>>> return true;
>>>  }
>>
>> Right, these are the two functions the core VM expects the
>> arch to provide.
>>
>>> +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
>>> +
>>>  #endif /* __KERNEL__ */
>>>  #endif /* __ASM_POWERPC_MMU_CONTEXT_H */
>>> diff --git a/arch/powerpc/include/asm/pkeys.h 
>>> b/arch/powerpc/include/asm/pkeys.h
>>> index 9b6820d..405e7db 100644
>>> --- a/arch/powerpc/include/asm/pkeys.h
>>> +++ b/arch/powerpc/include/asm/pkeys.h
>>> @@ -14,6 +14,15 @@
>>> VM_PKEY_BIT3 | \
>>> VM_PKEY_BIT4)
>>>
>>> +static inline u16 pte_flags_to_pkey(unsigned long pte_flags)
>>> +{
>>> +   return ((pte_flags & H_PAGE_PKEY_BIT4) ? 0x1 : 0x0) |
>>> +   ((pte_flags & H_PAGE_PKEY_BIT3) ? 0x2 : 0x0) |
>>> +   ((pte_flags & H_PAGE_PKEY_BIT2) ? 0x4 : 0x0) |
>>> +   ((pte_flags & H_PAGE_PKEY_BIT1) ? 0x8 : 0x0) |
>>> +   ((pte_flags & H_PAGE_PKEY_BIT0) ? 0x10 : 0x0);
>>> +}
>>
>> Add defines for the above 0x1, 0x2, 0x4, 0x8 etc ?
> 
> hmm...not sure if it will make the code any better.
> 
>>
>>> +
>>>  #define pkey_to_vmflag_bits(key) (((key & 0x1UL) ? VM_PKEY_BIT0 : 0x0UL) | 
>>> \
>>> ((key & 0x2UL) ? VM_PKEY_BIT1 : 0x0UL) |\
>>> ((key & 0x4UL) ? VM_PKEY_BIT2 : 0x0UL) |\
>>> diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
>>> index 2dcb8a1..a11977f 100644
>>> --- a/arch/powerpc/include/asm/reg.h
>>> +++ b/arch/powerpc/include/asm/reg.h
>>> @@ -285,9 +285,10 @@
>>>  #define   DSISR_UNSUPP_MMU 0x0008  /* Unsupported MMU config */
>>>  #define   DSISR_SET_RC 0x0004  /* Failed setting of 
>>> R/C bits */
>>>  #define   DSISR_PGDIRFAULT  0x0002  /* Fault on page directory 
>>> */
>>> -#define   DSISR_PAGE_FAULT_MASK (DSISR_BIT32 | \
>>> -   DSISR_PAGEATTR_CONFLT | \
>>> -   DSISR_BADACCESS |   \
>>> +#define   DSISR_PAGE_FAULT_MASK (DSISR_BIT32 |  

Re: [RFC v2 01/12] powerpc: Free up four 64K PTE bits in 4K backed hpte pages.

2017-06-20 Thread Anshuman Khandual
On 06/21/2017 04:53 AM, Ram Pai wrote:
> On Tue, Jun 20, 2017 at 03:50:25PM +0530, Anshuman Khandual wrote:
>> On 06/17/2017 09:22 AM, Ram Pai wrote:
>>> Rearrange 64K PTE bits to  free  up  bits 3, 4, 5  and  6
>>> in the 4K backed hpte pages. These bits continue to be used
>>> for 64K backed hpte pages in this patch, but will be freed
>>> up in the next patch.
>>
>> The counting 3, 4, 5 and 6 are in BE format I believe, I was
>> initially trying to see that from right to left as we normally
>> do in the kernel and was getting confused. So basically these
>> bits (which are only applicable for 64K mapping IIUC) are going
>> to be freed up from the PTE format.
>>
>> #define _RPAGE_RSV1  0x1000UL
>> #define _RPAGE_RSV2  0x0800UL
>> #define _RPAGE_RSV3  0x0400UL
>> #define _RPAGE_RSV4  0x0200UL
>>
>> As you have mentioned before this feature is available for 64K
>> page size only and not for 4K mappings. So I assume we support
>> both the combinations.
>>
>> * 64K mapping on 64K
>> * 64K mapping on 4K
> 
> yes.
> 
>>
>> These are the current users of the above bits
>>
>> #define H_PAGE_BUSY  _RPAGE_RSV1 /* software: PTE & hash are busy */
>> #define H_PAGE_F_SECOND  _RPAGE_RSV2 /* HPTE is in 2ndary 
>> HPTEG */
>> #define H_PAGE_F_GIX (_RPAGE_RSV3 | _RPAGE_RSV4 | _RPAGE_RPN44)
>> #define H_PAGE_HASHPTE   _RPAGE_RPN43/* PTE has associated 
>> HPTE */
>>
>>>
>>> The patch does the following change to the 64K PTE format
>>>
>>> H_PAGE_BUSY moves from bit 3 to bit 9
>>
>> and what is in there on bit 9 now ? This ?
>>
>> #define _RPAGE_SW2   0x00400
>>
>> which is used as 
>>
>> #define _PAGE_SPECIAL_RPAGE_SW2 /* software: special page */
>>
>> which will not be required any more ?
> 
> i think you are reading bit 9 from right to left. the bit 9 i refer to
> is from left to right. Using the same numbering convention the ISA3.0 uses.

Right, my bad. Then it would be this one.

'#define _RPAGE_RPN42   0x0040UL'

> I know it is confusing, will make a mention in the comment of this
> patch, to read it the big-endian way.

Right.

> 
> BTW: Bit 9 is not used currently. so using it in this patch. But this is
> a temporary move. the H_PAGE_BUSY will move to bit 7 in the next patch.
> 
> Had to keep at bit 9, because bit 7 is not yet entirely freed up. it is
> used by 64K PTE backed by 64k htpe.

Got it.

> 
>>
>>> H_PAGE_F_SECOND which occupied bit 4 moves to the second part
>>> of the pte.
>>> H_PAGE_F_GIX which  occupied bit 5, 6 and 7 also moves to the
>>> second part of the pte.
>>>
>>> the four  bits((H_PAGE_F_SECOND|H_PAGE_F_GIX) that represent a slot
>>> is  initialized  to  0xF  indicating  an invalid  slot.  If  a hpte
>>> gets cached in a 0xF  slot(i.e  7th  slot  of  secondary),  it   is
>>> released immediately. In  other  words, even  though   0xF   is   a
>>
>> Release immediately means we attempt again for a new hash slot ?
> 
> yes.
> 
>>
>>> valid slot we discard  and consider it as an invalid
>>> slot;i.e hpte_soft_invalid(). This  gives  us  an opportunity to not
>>> depend on a bit in the primary PTE in order to determine the
>>> validity of a slot.
>>
>> So we have to see the slot number in the second half for each PTE to
>> figure out if it has got a valid slot in the hash page table.
> 
> yes.
> 
>>
>>>
>>> When  we  release  ahpte   in the 0xF   slot we also   release a
>>> legitimate primary   slot  andunmapthat  entry. This  is  to
>>> ensure  that we do get a   legimate   non-0xF  slot the next time we
>>> retry for a slot.
>>
>> Okay.
>>
>>>
>>> Though treating 0xF slot as invalid reduces the number of available
>>> slots  and  may  have an effect  on the performance, the probabilty
>>> of hitting a 0xF is extermely low.
>>
>> Why you say that ? I thought every slot number has the same probability
>> of hit from the hash function.
> 
> Every hash bucket has the same probability. But every slot within the
> hash bucket is filled in sequentially. so it takes 15 hptes to hash to
> the same bucket before we get to the 15th slot in the secondary.

Okay, would the last one be 16th instead ?

> 
>>
>>>
>>> Compared

Re: [RFC v2 01/12] powerpc: Free up four 64K PTE bits in 4K backed hpte pages.

2017-06-22 Thread Anshuman Khandual
On 06/17/2017 09:22 AM, Ram Pai wrote:
> Rearrange 64K PTE bits to  free  up  bits 3, 4, 5  and  6
> in the 4K backed hpte pages. These bits continue to be used
> for 64K backed hpte pages in this patch, but will be freed
> up in the next patch.
> 
> The patch does the following change to the 64K PTE format
> 
> H_PAGE_BUSY moves from bit 3 to bit 9
> H_PAGE_F_SECOND which occupied bit 4 moves to the second part
>   of the pte.
> H_PAGE_F_GIX which  occupied bit 5, 6 and 7 also moves to the
>   second part of the pte.
> 
> the four  bits((H_PAGE_F_SECOND|H_PAGE_F_GIX) that represent a slot
> is  initialized  to  0xF  indicating  an invalid  slot.  If  a hpte
> gets cached in a 0xF  slot(i.e  7th  slot  of  secondary),  it   is
> released immediately. In  other  words, even  though   0xF   is   a
> valid slot we discard  and consider it as an invalid
> slot;i.e hpte_soft_invalid(). This  gives  us  an opportunity to not
> depend on a bit in the primary PTE in order to determine the
> validity of a slot.
> 
> When  we  release  ahpte   in the 0xF   slot we also   release a
> legitimate primary   slot  andunmapthat  entry. This  is  to
> ensure  that we do get a   legimate   non-0xF  slot the next time we
> retry for a slot.
> 
> Though treating 0xF slot as invalid reduces the number of available
> slots  and  may  have an effect  on the performance, the probabilty
> of hitting a 0xF is extermely low.
> 
> Compared  to the current scheme, the above described scheme reduces
> the number of false hash table updates  significantly  and  has the
> added  advantage  of  releasing  four  valuable  PTE bits for other
> purpose.
> 
> This idea was jointly developed by Paul Mackerras, Aneesh, Michael
> Ellermen and myself.
> 
> 4K PTE format remain unchanged currently.

Scanned through the PTE format again for hash 64K and 4K. It seems
to me that there might be 5 free bits already present on the PTE
format. I might have seriously mistaken something here :) Please
correct me if that is not the case. _RPAGE_RPN* I think is applicable
only for hash page table format and will not be available for radix
later.

+#define _PAGE_FREE_1   0x0040UL /* Not used */
+#define _RPAGE_SW0 0x2000UL /* Not used */
+#define _RPAGE_SW1 0x0800UL /* Not used */
+#define _RPAGE_RPN42   0x0040UL /* Not used */
+#define _RPAGE_RPN41   0x0020UL /* Not used */




Re: [PATCH 1/2] mm, thp: introduce dedicated transparent huge page allocation interfaces

2017-10-17 Thread Anshuman Khandual
On 10/16/2017 02:49 PM, changbin...@intel.com wrote:
> From: Changbin Du 
> 
> This patch introduced 4 new interfaces to allocate a prepared
> transparent huge page.
>   - alloc_transhuge_page_vma
>   - alloc_transhuge_page_nodemask
>   - alloc_transhuge_page_node
>   - alloc_transhuge_page
> 

If we are trying to match HugeTLB helpers, then it should have
format something like alloc_transhugepage_xxx instead of
alloc_transhuge_page_XXX. But I think its okay.

> The aim is to remove duplicated code and simplify transparent
> huge page allocation. These are similar to alloc_hugepage_xxx
> which are for hugetlbfs pages. This patch does below changes:
>   - define alloc_transhuge_page_xxx interfaces
>   - apply them to all existing code
>   - declare prep_transhuge_page as static since no others use it
>   - remove alloc_hugepage_vma definition since it no longer has users
> 
> Signed-off-by: Changbin Du 
> ---
>  include/linux/gfp.h |  4 
>  include/linux/huge_mm.h | 13 -
>  include/linux/migrate.h | 14 +-
>  mm/huge_memory.c| 50 
> ++---
>  mm/khugepaged.c | 11 ++-
>  mm/mempolicy.c  | 10 +++---
>  mm/migrate.c| 12 
>  mm/shmem.c  |  6 ++
>  8 files changed, 71 insertions(+), 49 deletions(-)
> 
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index f780718..855c72e 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -507,15 +507,11 @@ alloc_pages(gfp_t gfp_mask, unsigned int order)
>  extern struct page *alloc_pages_vma(gfp_t gfp_mask, int order,
>   struct vm_area_struct *vma, unsigned long addr,
>   int node, bool hugepage);
> -#define alloc_hugepage_vma(gfp_mask, vma, addr, order)   \
> - alloc_pages_vma(gfp_mask, order, vma, addr, numa_node_id(), true)
>  #else
>  #define alloc_pages(gfp_mask, order) \
>   alloc_pages_node(numa_node_id(), gfp_mask, order)
>  #define alloc_pages_vma(gfp_mask, order, vma, addr, node, false)\
>   alloc_pages(gfp_mask, order)
> -#define alloc_hugepage_vma(gfp_mask, vma, addr, order)   \
> - alloc_pages(gfp_mask, order)
>  #endif
>  #define alloc_page(gfp_mask) alloc_pages(gfp_mask, 0)
>  #define alloc_page_vma(gfp_mask, vma, addr)  \
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 14bc21c..1dd2c33 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -130,9 +130,20 @@ extern unsigned long thp_get_unmapped_area(struct file 
> *filp,
>   unsigned long addr, unsigned long len, unsigned long pgoff,
>   unsigned long flags);
>  
> -extern void prep_transhuge_page(struct page *page);
>  extern void free_transhuge_page(struct page *page);
>  
> +struct page *alloc_transhuge_page_vma(gfp_t gfp_mask,
> + struct vm_area_struct *vma, unsigned long addr);
> +struct page *alloc_transhuge_page_nodemask(gfp_t gfp_mask,
> + int preferred_nid, nodemask_t *nmask);

Would not they require 'extern' here ?

> +
> +static inline struct page *alloc_transhuge_page_node(int nid, gfp_t gfp_mask)
> +{
> + return alloc_transhuge_page_nodemask(gfp_mask, nid, NULL);
> +}
> +
> +struct page *alloc_transhuge_page(gfp_t gfp_mask);
> +
>  bool can_split_huge_page(struct page *page, int *pextra_pins);
>  int split_huge_page_to_list(struct page *page, struct list_head *list);
>  static inline int split_huge_page(struct page *page)
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index 643c7ae..70a00f3 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -42,19 +42,15 @@ static inline struct page *new_page_nodemask(struct page 
> *page,
>   return 
> alloc_huge_page_nodemask(page_hstate(compound_head(page)),
>   preferred_nid, nodemask);
>  
> - if (thp_migration_supported() && PageTransHuge(page)) {
> - order = HPAGE_PMD_ORDER;
> - gfp_mask |= GFP_TRANSHUGE;
> - }
> -
>   if (PageHighMem(page) || (zone_idx(page_zone(page)) == ZONE_MOVABLE))
>   gfp_mask |= __GFP_HIGHMEM;
>  
> - new_page = __alloc_pages_nodemask(gfp_mask, order,
> + if (thp_migration_supported() && PageTransHuge(page))
> + return alloc_transhuge_page_nodemask(gfp_mask | GFP_TRANSHUGE,
> + preferred_nid, nodemask);
> + else
> + return __alloc_pages_nodemask(gfp_mask, order,
>   preferred_nid, nodemask);
> -
> - if (new_page && PageTransHuge(page))
> - prep_transhuge_page(new_page);

This makes sense, calling prep_transhuge_page() inside the
function alloc_transhuge_page_nodemask() is better I guess.

>  
>   return new_page;
>  }
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 269b5df..e267488 100644
> --- a/mm/huge_memory.c
> +++ b/mm/hu

  1   2   3   4   5   6   7   8   9   10   >