Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

2020-07-06 Thread Andi Kleen
> > What's the point of this indirection other than another way of avoiding
> > empty node 0?
> 
> Honestly, I do not have any idea. I've traced it down to
> Author: Andi Kleen 
> Date:   Tue Jan 11 15:35:48 2005 -0800

I don't remember all the details, and I can't even find the commit
(is it in linux-historic?).

But AFAIK there's no guarantee PXMs are small and continuous, so it
seemed better to have a clean zero based space.

Back then we had a lot of problems with buggy SRAT tables in BIOS,
so we really tried to avoid trusting the BIOS as much as possible.

-Andi



Re: [PATCH v8 6/7] tools/perf: Enable Hz/hz prinitg for --metric-only option

2020-04-02 Thread Andi Kleen
> > diff --git a/tools/perf/util/stat-display.c b/tools/perf/util/stat-display.c
> > index 9e757d18d713..679aaa655824 100644
> > --- a/tools/perf/util/stat-display.c
> > +++ b/tools/perf/util/stat-display.c
> > @@ -237,8 +237,6 @@ static bool valid_only_metric(const char *unit)
> > if (!unit)
> > return false;
> > if (strstr(unit, "/sec") ||
> > -   strstr(unit, "hz") ||
> > -   strstr(unit, "Hz") ||
> 
> will this change output of --metric-only for some setups then?
> 
> Andi, are you ok with this?

Yes should be ok

$ grep -i ScaleUnit.*hz arch/x86/*/* 
$ 

Probably was for some metric we later dropped.

-Andi


Re: [RFC 00/11] perf: Enhancing perf to export processor hazard information

2020-03-02 Thread Andi Kleen
On Mon, Mar 02, 2020 at 11:13:32AM +0100, Peter Zijlstra wrote:
> On Mon, Mar 02, 2020 at 10:53:44AM +0530, Ravi Bangoria wrote:
> > Modern processors export such hazard data in Performance
> > Monitoring Unit (PMU) registers. Ex, 'Sampled Instruction Event
> > Register' on IBM PowerPC[1][2] and 'Instruction-Based Sampling' on
> > AMD[3] provides similar information.
> > 
> > Implementation detail:
> > 
> > A new sample_type called PERF_SAMPLE_PIPELINE_HAZ is introduced.
> > If it's set, kernel converts arch specific hazard information
> > into generic format:
> > 
> >   struct perf_pipeline_haz_data {
> >  /* Instruction/Opcode type: Load, Store, Branch  */
> >  __u8itype;
> >  /* Instruction Cache source */
> >  __u8icache;
> >  /* Instruction suffered hazard in pipeline stage */
> >  __u8hazard_stage;
> >  /* Hazard reason */
> >  __u8hazard_reason;
> >  /* Instruction suffered stall in pipeline stage */
> >  __u8stall_stage;
> >  /* Stall reason */
> >  __u8stall_reason;
> >  __u16   pad;
> >   };
> 
> Kim, does this format indeed work for AMD IBS?

Intel PEBS has a similar concept for annotation of memory accesses,
which is already exported through perf_mem_data_src. This is essentially
an extension. It would be better to have something unified here. 
Right now it seems to duplicate at least part of the PEBS facility.

-Andi



Re: [RFC 5/6] perf/tools: Enhance JSON/metric infrastructure to handle "?"

2020-01-21 Thread Andi Kleen
> Here, '?' will be replaced with a runtime value and metric expression will
> be replicated.

Okay seems reasonable to me.

Thanks,

-Andi


Re: [RFC 5/6] perf/tools: Enhance JSON/metric infrastructure to handle "?"

2020-01-17 Thread Andi Kleen
On Fri, Jan 17, 2020 at 06:16:19PM +0530, Kajol Jain wrote:
> Patch enhances current metric infrastructure to handle "?" in the metric
> expression. The "?" can be use for parameters whose value not known while
> creating metric events and which can be replace later at runtime to
> the proper value. It also add flexibility to create multiple events out
> of single metric event added in json file.

Please add a proper specification how this ? thing is supposed to work,
what the exact semantics are, how it is different from the existing
# mechanism etc.

The standard way to do similar things before was to define an explicit
# name and let the expr code take care of it. 

-Andi 


Re: [PATCH v9 21/24] perf tools: Add support for the SPF perf event

2018-03-26 Thread Andi Kleen
On Mon, Mar 26, 2018 at 02:44:48PM -0700, David Rientjes wrote:
> On Tue, 13 Mar 2018, Laurent Dufour wrote:
> 
> > Add support for the new speculative faults event.
> > 
> > Signed-off-by: Laurent Dufour 
> 
> Acked-by: David Rientjes 
> 
> Aside: should there be a new spec_flt field for struct task_struct that 
> complements maj_flt and min_flt and be exported through /proc/pid/stat?

No. task_struct is already too bloated. If you need per process tracking 
you can always get it through trace points.

-Andi


Re: [PATCH v6 03/24] mm: Dont assume page-table invariance during faults

2018-01-16 Thread Andi Kleen
Laurent Dufour  writes:

> From: Peter Zijlstra 
>
> One of the side effects of speculating on faults (without holding
> mmap_sem) is that we can race with free_pgtables() and therefore we
> cannot assume the page-tables will stick around.
>
> Remove the reliance on the pte pointer.

This needs a lot more explanation. So why is this code not needed with
SPF only?

-Andi


Re: [PATCH v2 14/20] mm: Provide speculative fault infrastructure

2017-08-28 Thread Andi Kleen
> That makes me extremely nervous... there could be all sort of
> assumptions esp. in arch code about the fact that we never populate the
> tree without the mm sem.
> 
> We'd have to audit archs closely. Things like the page walk cache
> flushing on power etc...

Yes the whole thing is quite risky. Probably will need some
kind of per architecture opt-in scheme?

-Andi


Re: [PATCH v2 2/5] perf/x86/intel: Record branch type

2017-04-07 Thread Andi Kleen
> > It's a somewhat common situation with partially JITed code, if you
> > don't have an agent. You can still do a lot of useful things.
> 
> Like what? How can you say anything about code you don't have?

For example if you combine the PMU topdown measurement, and see if it's
frontend bound, and then you see it has lots of forward conditionals,
then dynamic basic block reordering will help. If you have lots
of cross page jumps then function reordering will help. etc.

> > We found it useful to have this extra information during workload
> > analysis. Forward conditionals and page crossing jumps
> > are indications of frontend problems.
> 
> But you already have the exact same information in {to,from}, why would
> you need to repackage information already contained?

Without this patch, we don't know if it's conditional or something else.
And the kernel already knows this for its filtering, so it can as well
report it.

Right the CROSS_* and forward backward information could be computed
later.

-Andi




Re: [PATCH v2 2/5] perf/x86/intel: Record branch type

2017-04-07 Thread Andi Kleen
On Fri, Apr 07, 2017 at 05:20:31PM +0200, Peter Zijlstra wrote:
> On Fri, Apr 07, 2017 at 06:47:43PM +0800, Jin Yao wrote:
> > Perf already has support for disassembling the branch instruction
> > and using the branch type for filtering. The patch just records
> > the branch type in perf_branch_entry.
> > 
> > Before recording, the patch converts the x86 branch classification
> > to common branch classification and compute for checking if the
> > branches cross 4K or 2MB areas. It's an approximate computing for
> > crossing 4K page or 2MB page.
> 
> The changelog is completely empty of rationale. Why do we care?
> 
> Not having the binary is a very bad reason; you can't do much of
> anything if that's missing.

It's a somewhat common situation with partially JITed code, if you
don't have an agent. You can still do a lot of useful things.

We found it useful to have this extra information during workload
analysis. Forward conditionals and page crossing jumps
are indications of frontend problems.

-Andi


Re: [PATCH v21 00/20] perf, tools: Add support for PMU events in JSON format

2016-09-26 Thread Andi Kleen
On Mon, Sep 26, 2016 at 12:03:43PM -0300, Arnaldo Carvalho de Melo wrote:
> Em Mon, Sep 26, 2016 at 10:35:33AM +0200, Jiri Olsa escreveu:
> > ping.. is that working for you? IMO we can include this
> > as additional patch to the set..
> 
> No, it doesn't fails to build on the first cross env I tried, fixing it
> now, resulting patch:

Yes it shouldn't be difficult to fix cross building. I don't think
there are any fundamental problems.

-Andi


Re: [PATCH v20 00/20] perf, tools: Add support for PMU events in JSON format

2016-08-31 Thread Andi Kleen
> > 
> > > 
> > > I've already made some changes in pmu-events/* to support
> > > this hierarchy to see how bad the change would be.. and
> > > it's not that bad ;-)
> > 
> > Everything has to be automated, please no manual changes.
> 
> sure
> 
> so, if you're ok with the layout, how do you want to proceed further?

If the split version is acceptable it's fine for me to merge it.

I'll add split-json to my scripting, so the next update would
be split too.

-Andi



Re: [PATCH v20 00/20] perf, tools: Add support for PMU events in JSON format

2016-08-31 Thread Andi Kleen
> hi,
> I had discussion with Ingo about the state of this patchset
> and there's one more requirement from his side - to split
> event files into per topic files

Thanks Jiri.
> 
> I made some initial changes over latest Sukadev's branch
> and came up with something like this:

Did you just split it by the "Topic" fields?

> 
>   $ find pmu-events/arch/x86/
>   pmu-events/arch/x86/
>   pmu-events/arch/x86/NehalemEX_core
>   pmu-events/arch/x86/NehalemEX_core/Memory.json
>   pmu-events/arch/x86/NehalemEX_core/Virtual-Memory.json
>   pmu-events/arch/x86/NehalemEX_core/Cache.json
>   pmu-events/arch/x86/NehalemEX_core/Pipeline.json
>   pmu-events/arch/x86/NehalemEX_core/Floating-point.json
>   pmu-events/arch/x86/NehalemEX_core/Other.json
>   pmu-events/arch/x86/mapfile.csv
>   pmu-events/arch/x86/Broadwell_core
>   pmu-events/arch/x86/Broadwell_core/Memory.json
>   pmu-events/arch/x86/Broadwell_core/Virtual-Memory.json
>   pmu-events/arch/x86/Broadwell_core/Cache.json
>   pmu-events/arch/x86/Broadwell_core/Pipeline.json
>   pmu-events/arch/x86/Broadwell_core/Floating-point.json
>   pmu-events/arch/x86/Broadwell_core/Other.json
>   pmu-events/arch/x86/Broadwell_core/Frontend.json
> 
> so let's have a discussion if this is acceptable for you guys

Splitting is fine for me, as long as it's scriptable.

I already have some scripts to generate the perf json files,
can update them to split.

> 
> I've already made some changes in pmu-events/* to support
> this hierarchy to see how bad the change would be.. and
> it's not that bad ;-)

Everything has to be automated, please no manual changes.

-Andi


Re: [PATCH 2/5] kbuild: allow archs to select build for link dead code/data elimination

2016-08-09 Thread Andi Kleen
On Wed, Aug 10, 2016 at 12:29:29AM +0200, Arnd Bergmann wrote:
> On Monday, August 8, 2016 8:16:05 PM CEST Andi Kleen wrote:
> > > I don't understand what led Andi Kleen to also move .text.hot and
> > > .text.unlikely together with .text [2], but this may have
> > > been a related issue.
> > >
> > > [2] https://lkml.org/lkml/2015/7/19/377
> >
> > The goal was just to move .hot and .unlikely all together, so that
> > they are clustered and use the minimum amount of cache. On x86 doesn't
> > matter where they are exactly, as long as each is together.
> > If they are not explicitely listed then the linker interleaves
> > them with the normal text, which defeats the purpose.
> 
> I still don't see it, my reading of your patch is that you did
> the opposite, by changing the description that puts all .text.hot
> in front of .text, and all .text.unlikely after exit.text into
> one that mixes them with .text. What am I missing here?

No it doesn't mix .unlikely with .text, .unlikely is all in one place.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only


Re: [PATCH 2/5] kbuild: allow archs to select build for link dead code/data elimination

2016-08-09 Thread Andi Kleen
On Wed, Aug 10, 2016 at 12:29:29AM +0200, Arnd Bergmann wrote:
> On Monday, August 8, 2016 8:16:05 PM CEST Andi Kleen wrote:
> > > I don't understand what led Andi Kleen to also move .text.hot and
> > > .text.unlikely together with .text [2], but this may have
> > > been a related issue.
> > >
> > > [2] https://lkml.org/lkml/2015/7/19/377
> >
> > The goal was just to move .hot and .unlikely all together, so that
> > they are clustered and use the minimum amount of cache. On x86 doesn't
> > matter where they are exactly, as long as each is together.
> > If they are not explicitely listed then the linker interleaves
> > them with the normal text, which defeats the purpose.
> 
> I still don't see it, my reading of your patch is that you did
> the opposite, by changing the description that puts all .text.hot
> in front of .text, and all .text.unlikely after exit.text into
> one that mixes them with .text. What am I missing here?

.text.hot is actually not used, the critical part is .text.unlikely
which was not listed and was interleaved before the patch.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only


Re: [PATCH 2/5] kbuild: allow archs to select build for link dead code/data elimination

2016-08-08 Thread Andi Kleen
> I don't understand what led Andi Kleen to also move .text.hot and
> .text.unlikely together with .text [2], but this may have
> been a related issue.

The goal was just to move .hot and .unlikely all together, so that
they are clustered and use the minimum amount of cache. On x86 doesn't
matter where they are exactly, as long as each is together.
If they are not explicitely listed then the linker interleaves
them with the normal text, which defeats the purpose.

-Andi


Re: [PATCH v19 00/19] perf, tools: Add support for PMU events in JSON format

2015-12-03 Thread Andi Kleen
On Mon, Nov 30, 2015 at 06:56:55PM -0800, Sukadev Bhattiprolu wrote:
> CPUs support a large number of performance monitoring events (PMU events)
> and often these events are very specific to an architecture/model of the
> CPU. To use most of these PMU events with perf, we currently have to identify
> them by their raw codes:

Could this patch be merged please? I'm not aware of any open objection to it,
and it has been already extensively reviewed in the past.

Full event lists are very useful for people. Also there are additional
changes (uncore event lists) which are blocked on these patches getting
merged.

Thanks,
-Andi
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v14 15/19] perf, tools: Support long descriptions with perf list

2015-06-05 Thread Andi Kleen
On Fri, Jun 05, 2015 at 12:21:38PM +0200, Jiri Olsa wrote:
 On Thu, Jun 04, 2015 at 11:27:23PM -0700, Sukadev Bhattiprolu wrote:
 
 SNIP
 
  ---
   tools/perf/builtin-list.c |   11 ---
   1 file changed, 8 insertions(+), 3 deletions(-)
  
  diff --git a/tools/perf/builtin-list.c b/tools/perf/builtin-list.c
  index 3f058f7..d0f7a18 100644
  --- a/tools/perf/builtin-list.c
  +++ b/tools/perf/builtin-list.c
  @@ -22,10 +22,13 @@ int cmd_list(int argc, const char **argv, const char 
  *prefix __maybe_unused)
   {
  int i;
  bool raw_dump = false;
  +   bool long_desc_flag = false;
  struct option list_options[] = {
  OPT_BOOLEAN(0, raw-dump, raw_dump, Dump raw events),
  OPT_BOOLEAN('d', desc, desc_flag,
  Print extra event descriptions. --no-desc to not 
  print.),
  +   OPT_BOOLEAN('d', long-desc, long_desc_flag,
  +   Print longer event descriptions.),
 
 hum, it should be 'v' , right?

Yes that's right.

Also BTW need to add the new option to the usage line a few lines below.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v13 12/14] perf, tools: Add support for event list topics

2015-06-03 Thread Andi Kleen
 please split at least the jevents Topic parsing from the rest
 idelay also the alias update and the display change

What's the point of all these splits? It's already one logical unit,
not too large, and is bisectable.

-andi

-- 
a...@linux.intel.com -- Speaking for myself only
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 2/4] perf: jevents: Program to convert JSON file to C style file

2015-05-31 Thread Andi Kleen

Ok I did some scripting to add these topics you requested to the Intel JSON 
files,
and changed perf list to group events by them. 

I'll redirect any questions on their value to you.  
And I certainly hope this is the last of your improvements for now.

The updated event lists are available in

git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-misc 
perf/intel-json-files-3

The updated patches are available in 

git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-misc perf/builtin-json-6

Also posted separately.

The output looks like this

% perf list
...
Cache:
  l1d.replacement   
   [L1D data line replacements]
  l1d_pend_miss.pending 
   [L1D miss oustandings duration in cycles]
  l1d_pend_miss.pending_cycles  
   [Cycles with L1D load Misses outstanding]
...
Floating point:
  fp_assist.any 
   [Cycles with any input/output SSE or FP assist]
  fp_assist.simd_input  
   [Number of SIMD FP assists due to input values]
  fp_assist.simd_output 
   [Number of SIMD FP assists due to Output values]
...
Memory:
  machine_clears.memory_ordering
   [Counts the number of machine clears due to memory order conflicts]
  mem_trans_retired.load_latency_gt_128 
   [Loads with latency value being above 128 (Must be precise)]
  mem_trans_retired.load_latency_gt_16  
   [Loads with latency value being above 16 (Must be precise)]
...
Pipeline:
  arith.fpu_div 
   [Divide operations executed]
  arith.fpu_div_active  
   [Cycles when divider is busy executing divide operations]
  baclears.any  
   [Counts the total number when the front end is resteered, mainly when 
the BPU cannot provide a correct
prediction and this is corrected by other branch handling mechanisms at 
the front end]


-Andi

P.S.: You may want to look up the definition of logical fallacy in wikipedia.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 04/10] perf, tools: Handle header line in mapfile

2015-05-29 Thread Andi Kleen
On Fri, May 29, 2015 at 11:13:15AM +0200, Jiri Olsa wrote:
 On Thu, May 28, 2015 at 10:45:06PM -0700, Sukadev Bhattiprolu wrote:
  Jiri Olsa [jo...@redhat.com] wrote:
  |  if (line[0] == '#' || line[0] == '\n')
  |  continue;
  |  +   if (!strncmp(line, Family, 6))
  |  +   continue;
  | 
  | I think we should fix mapfiles to put the 'Family' starting
  | line as a comment.. the way powerpc mapfile is done
  
  You mean add something like this to the Intel mapfile:
  
  # Power8 entries
  004d0100,1,power8.json,core
  
  and drop this patch?
 
 right

But it's a CSV file. CSV files are supposed to have column headers.
There are lots of tools that work better with them if they have headers.
Please keep the header.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 09/10] perf, tools: Add a --no-desc flag to perf list

2015-05-28 Thread Andi Kleen
On Thu, May 28, 2015 at 02:39:14PM +0200, Jiri Olsa wrote:
 On Wed, May 27, 2015 at 02:23:28PM -0700, Sukadev Bhattiprolu wrote:
  From: Andi Kleen a...@linux.intel.com
  
  Add a --no-desc flag to perf list to not print the event descriptions
  that were earlier added for JSON events. This may be useful to
  get a less crowded listing.
  
  It's still default to print descriptions as that is the more useful
  default for most users.
 
 I might not be typical user, but the first thing I tried to
 explore was 'perf list -v' ;-)
 
 would it be better to have just the list with event names for:
$ perf list
 
 and with descriptions for:
$ perf list -v
 
 not sure we already discussed this..

It was discussed last time. I think it's better to have descriptions
by default. Far more user friendly.

One thing we could in theory do with -v is to switch between Brief and Public
Description (the later is often more verbose). Would need some changes
to the alias code to have two descriptions though.

-Andi


-- 
a...@linux.intel.com -- Speaking for myself only
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 2/4] perf: jevents: Program to convert JSON file to C style file

2015-05-28 Thread Andi Kleen
 So instead of this flat structure, there should at minimum be broad 
 categorization 
 of the various parts of the hardware they relate to: whether they relate to 
 the 
 branch predictor, memory caches, TLB caches, memory ops, offcore, decoders, 
 execution units, FPU ops, etc., etc. - so that they can be queried via 'perf 
 list'.

The categorization is generally on the stem name, which already works fine with
the existing perf list wildcard support. So for example you only want 
branches. 

perf list br*
...
  br_inst_exec.all_branches 
   [Speculative and retired branches]
  br_inst_exec.all_conditional  
   [Speculative and retired macro-conditional branches]
  br_inst_exec.all_direct_jmp   
   [Speculative and retired macro-unconditional branches excluding calls 
and indirects]
  br_inst_exec.all_direct_near_call 
   [Speculative and retired direct near calls]
  br_inst_exec.all_indirect_jump_non_call_ret   
   [Speculative and retired indirect branches excluding calls and returns]
  br_inst_exec.all_indirect_near_return 
   [Speculative and retired indirect return branches]
...

Or mid level cache events:

perf list l2*
...
  l2_l1d_wb_rqsts.all   
   [Not rejected writebacks from L1D to L2 cache lines in any state]
  l2_l1d_wb_rqsts.hit_e 
   [Not rejected writebacks from L1D to L2 cache lines in E state]
  l2_l1d_wb_rqsts.hit_m 
   [Not rejected writebacks from L1D to L2 cache lines in M state]
  l2_l1d_wb_rqsts.miss  
   [Count the number of modified Lines evicted from L1 and missed L2. 
(Non-rejected WBs from the DCU.)]
  l2_lines_in.all   
   [L2 cache lines filling L2]
...

There are some exceptions, but generally it works this way.

The stem could be put into a separate header, but it would seem redundant to 
me. 

 We don't just want the import the unstructured mess that these event files 
 are - 
 we want to turn them into real structure. We can still keep the messy vendor 
 names 
 as well, like IDQ.DSB_CYCLES, but we want to impose structure as well.

The vendor names directly map to the micro architecture, which is whole
point of the events. IDQ is a part of the CPU, and is described in the 
CPU manuals. One of the main motivations for adding event lists is to make
perf match to that documentation.

 
 3)
 
 There should be good 'perf list' visualization for these events: grouping, 
 individual names, with a good interface to query details if needed. I.e. it 
 should 
 be possible to browse and discover events relevant to the CPU the tool is 
 executing on.

I suppose we could change perf list to give the stem names as section headers
to make the long list a bit more readable.

Generally you need to have some knowledge of the micro architecture to use
these events. There is no way around that.

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 4/4] perf: Add power8 PMU events in JSON format

2015-05-27 Thread Andi Kleen
On Thu, May 28, 2015 at 12:01:31AM +0900, Namhyung Kim wrote:
 On Wed, May 27, 2015 at 11:41 PM, Andi Kleen a...@linux.intel.com wrote:
   +  {
   +EventCode: 0x2505e,
   +EventName: PM_BACK_BR_CMPL,
   +BriefDescription: Branch instruction completed with a target 
   address less than current instruction address,,
   +PublicDescription: Branch instruction completed with a target 
   address less than current instruction address.,
 
  Can't we remove PublicDescription field if it's identical to
  BriefDescription?  It seems just wasting spaces..
 
  It's not always identical. There are events where PublicDescription is much 
  longer (several paragraphs)
 
 I know.  What I said is make it optional so that we can drop if it's 
 identical.

Should be easy enough. It's already optional in the jevents parser.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 4/4] perf: Add power8 PMU events in JSON format

2015-05-27 Thread Andi Kleen
  +  {
  +EventCode: 0x2505e,
  +EventName: PM_BACK_BR_CMPL,
  +BriefDescription: Branch instruction completed with a target 
  address less than current instruction address,,
  +PublicDescription: Branch instruction completed with a target 
  address less than current instruction address.,
 
 Can't we remove PublicDescription field if it's identical to
 BriefDescription?  It seems just wasting spaces..

It's not always identical. There are events where PublicDescription is much 
longer (several paragraphs)

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 2/4] perf: jevents: Program to convert JSON file to C style file

2015-05-27 Thread Andi Kleen
 So we build tables of all models in the architecture, and choose
 matching one when compiling perf, right?  Can't we do that when
 building the tables?  IOW, why don't we check the VFM and discard
 non-matching tables?  Those non-matching tables are also needed?

We build it for all cpus in an architecture, not all architectures.
So e.g. for an x86 binary power is not included, and vice versa.
It always includes all CPUs for a given architecture, so it's possible
to use the perf binary on other systems than just the one it was 
build on.

-andi

-- 
a...@linux.intel.com -- Speaking for myself only
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 2/4] perf: jevents: Program to convert JSON file to C style file

2015-05-22 Thread Andi Kleen
 Sure, but shouldn't we allow JSON files to be in subdirs
 
   pmu-events/arch/x86/HSX/Haswell_core.json
 
 and this could go to arbitrary levels?

I used a flat hierarchy. Should be good enough.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 2/4] perf: jevents: Program to convert JSON file to C style file

2015-05-22 Thread Andi Kleen
 pmu-events.c depends only on JSON files relevant to the arch perf is
 being built on and there could be several JSON files per arch. So it
 would complicate the Makefiles.

Could just use a wildcard dependency on */$(ARCH)/*.json 

Also it would be good to move the generated file into the object
directory. I tried it but it needs some more changes to the Makefiles.

-Andi

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 3/4] perf: Use pmu_events_map table to create event aliases

2015-05-21 Thread Andi Kleen
On Wed, May 20, 2015 at 10:02:04PM -0700, Sukadev Bhattiprolu wrote:
 Andi Kleen [a...@linux.intel.com] wrote:
 | If you need something else in vfm to identify the CPU 
 | can't you just add it there? I wouldn't really call it vfm, it's
 | really a abstract cpu identifier per architecture. So if you
 | need pvr just add it there.
 
 Ok. I will change vfm to cpuid_str and include pvr in it.

Thanks.

With that change it would be also cleaner to provide a get_cpuid_str()
function by the architecture code, and then strcmp in the matching
code, instead of having architecture specific compare code.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 3/4] perf: Use pmu_events_map table to create event aliases

2015-05-20 Thread Andi Kleen
 +/*
 + * Return TRUE if the CPU identified by @vfm, @version, and @type
 + * matches the current CPU.  vfm refers to [Vendor, Family, Model],
 + *
 + * Return FALSE otherwise.
 + *
 + * For Powerpc, we only compare @version to the processor PVR.
 + */
 +bool arch_pmu_events_match_cpu(const char *vfm __maybe_unused,
 + const char *version,
 + const char *type __maybe_unused)
 +{
 + char *cpustr;
 + bool rc;
 +
 + cpustr = get_cpu_str();
 + rc = !strcmp(version, cpustr);


Surely against vfm not version
I think your mapfile is wrong if that works?

That's the Intel format:

.vfm = GenuineIntel-6-3E,
.version = V16,
.type = core,
.table = pme_IvyTown_core


-Andi

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 3/4] perf: Use pmu_events_map table to create event aliases

2015-05-20 Thread Andi Kleen
 Obviously, that does not fit into the VFM field. We could either
 add a new PVR field to the mapfile:
 
   [vfm, version, type, pvr]
 
 or, as the patch currently does, let architectures intepret the
 version field as they see fit?
 
 IOW, leave it to architectures to keep arch_pmu_events_match_cpu()
 consistent with _their_ mapfile?

version is the version number of the event file. This way 
you can't signify the version number if you ever change something.

If you need something else in vfm to identify the CPU 
can't you just add it there? I wouldn't really call it vfm, it's
really a abstract cpu identifier per architecture. So if you
need pvr just add it there.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC][PATCH 4/4] perf: Create aliases for PMU events

2015-05-04 Thread Andi Kleen
 I personaly like having set of event files in JSON notation
 rather than having them directly in C structure

Yes, strings are better and JSON input is also better. 

I prototyped translating JSON into the proposed structures. I already had to
add three new fields, and it wouldn't work for uncore. The string 
format is much more extensible.

BTW as expected the binary sizes are gigantic (for 14 CPU types):

% size all.o
   textdata bss dec hex filename
 662698   0   0  662698   a1caa all.o

% gcc -E all.c | wc -l
55475

-Andi
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: 'perf upgrade' (was: Re: [PATCH v9 00/11] Add support for JSON event files.)

2015-04-17 Thread Andi Kleen
On Fri, Apr 17, 2015 at 05:31:26PM +0200, Jiri Olsa wrote:
 On Wed, Apr 15, 2015 at 01:50:42PM -0700, Sukadev Bhattiprolu wrote:
 
 SNIP
 
  | 
  |  - to blindly follow some poorly constructed vendor format with no 
  |high level structure, that IMHO didn't work very well when OProfile 
  |was written, and misrepresenting it as 'symbolic event names'.
  | 
  |Take a look at:
  | 
  |  https://download.01.org/perfmon/HSW/Haswell_core_V17.json
  | 
  |and weep.
  
  Evil vendor formats, but to be fair, here is what _we_ have today:
  
  perf stat -e r10068,r20036,r40060,r40ac sleep 1
 
 hum, you could also use the 'cpu/event=.../' syntax right?

That's even worse -- same hex numbers, just more redundancy.

All the other profilers support symbolic names, which is what
users want.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: 'perf upgrade' (was: Re: [PATCH v9 00/11] Add support for JSON event files.)

2015-04-15 Thread Andi Kleen
 My suggestion to resolve the technical objections and lift the NAK 
 would be:
 
  - to add the tables to the source code, in a more human readable 
format and (optionally) structure the event names better into a 
higher level hierarchy, than the humungous linear dumps with no 
explanations that you propose - while still supporting the 'raw' 
vendor event names you want to use, for those people who are used 
to them.

Trying to understand what you mean with high level hierarchy:

Do you mean something like the oprofile event / unit mask split,
with events having a default unit mask?

This one actually works poorly for modern Intel systems, as
unit masks can completely change the behavior, so there is 
not necessarily a direct relation between the name before the dot and the one
after, or a concept of a default unit mask.

Or do you mean someone creating a tree hierarchy of events to systematically
debug some particular problem?

I implemented this in my toplev tool here:
https://github.com/andikleen/pmu-tools/wiki/toplev-manual
http://github.com/andikleen/pmu-tools

But it's quite complicated and cannot really be done portable in
a full way without completely turning perf into an architecture specific
tool (which I assume you're opposed to)

Or just some sections grouping events into different topics?
branches, caches, execution or somesuch.

I think this could be done with the JSON format. Just add a new header.
Print them as sections in perf list.

Or do you mean something like the perf cache events tables?
We already have those. But they cannot replace direct micro architectural
specific events, as the cache event tables do not necessarily express
everything a given micro architecture supports. And also there is a lot
of documentation using the vendor event names, which the users
then want to use.

If it's something else please explain.

 
  - to pre-parse the event descriptions at build time - beyond the 
speedup 

FWIW I just measured it and the overhead parse time for the haswell JSON file on
my laptop is about 33ms. 

# with event map
% time ./obj-perf/perf list  /dev/null
real0m0.045s
user0m0.041s
sys 0m0.003s

% export EVENTMAP=/dev/zero
% time perf list  /dev/null

real0m0.011s
user0m0.004s
sys 0m0.006s


  - to upgrade perf as a whole unit: this helps not just your usecase
but many other usecases as well.

With the downloader it actually automatically downloads any missing
files, so the upgrading just works.

-Andi
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: 'perf upgrade' (was: Re: [PATCH v9 00/11] Add support for JSON event files.)

2015-04-14 Thread Andi Kleen
On Tue, Apr 14, 2015 at 10:55:41AM +0200, Ingo Molnar wrote:
 
 * Sukadev Bhattiprolu suka...@linux.vnet.ibm.com wrote:
 
  This is another attempt to resurrect Andi Kleen's patchset so users
  can specify perf events by their event names rather than raw codes.
  
  This is a rebase of Andi Kleen's patchset from Jul 30, 2014[1] to 4.0.
  (I fixed minor and not so minor conflicts).
 
 So this series shows some progress, but instead of this limited 
 checkout ability I'd still prefer it if 'perf download' downloaded the 
 latest perf code itself and built it - it shouldn't be limited to just 
 a small subset of the perf source code!

I'm not against that idea (and can see it being useful in some cases[1]), 
but this would need some authentication. 

Otherwise it would be a $bigbrother/nsa/anyone controling your network
can run everything on your computer style hole.

The original JSON files don't have this problem because they didn't contain
any code.

-Andi

[1] especially to work around the broken perf setup on d./u. distros.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: perf report broken for branch stack samples

2015-04-06 Thread Andi Kleen
On Wed, Apr 01, 2015 at 02:32:54PM +0530, Anshuman Khandual wrote:
 Hello,
 
 perf report is not showing up the branch stack sample results in the
 from_symbol --- to_symbol format even if the perf.data file has got
 the samples (through 'perf record -b workload' session). Perf report

Sorry for the late answer. This was already fixed in tip by 

commit fefd2d9619de3bf0bf02a8622e9f445c3d19cc3f
Author: He Kuang heku...@huawei.com
Date:   Sun Feb 15 10:33:37 2015 +0800

perf report: Fix branch stack mode cannot be set

-Andi
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 1/3] perf/e6500: Make event translations available in sysfs

2015-03-27 Thread Andi Kleen

Thanks for supporting the JSON format too.

 (c) If not, given we don't know how to get us out of the current
 status quo, can this patchseries still be applied, given the
 original complaint was the size of our events-list.h (whereas

The Intel core event lists are far larger even
(and will grow even more when uncore gets added)

 power7-events-list.h is almost twice the size)?  If not, patch 3/3
 in this series is still valid, no matter what, and it should still
 be applied (let us know if we need to resubmit).

Could also just leave out the downloader for now, so that you have
to get your own event file and set it up with export EVENTMAP=...

That's basically the patchkit, minus one patch.

-Andi
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 1/3] perf/e6500: Make event translations available in sysfs

2015-02-18 Thread Andi Kleen

Well I'm tired of discussing this. I don't think what you
proposed makes sense, putting 3.4MB[1] of changing blob into perf.

I'll resubmit the JSON parser without the downloader. Then users
have the option to get their own events and use that.

If you don't like that, standard perf just has to stay with limited 
events and r as before, with users having to use external
tools or libraries for names for more events[2][3].

-Andi

[1] Current size of https://download.01.org/perfmon/
[2] ocperf in https://github.com/andikleen/pmu-tools
[3] http://perfmon2.sourceforge.net/
-- 
a...@linux.intel.com -- Speaking for myself only.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 1/3] perf/e6500: Make event translations available in sysfs

2015-02-09 Thread Andi Kleen
 I'll NAK any external 'download area' (and I told that Andi 
 before): tools/perf/event-tables/ or so is a good enough 
 'download area' with fast enough update cycles.

The proposal was to put it on kernel.org, similar to how
external firmware blobs are distributed. CPU event lists
are data sheets, so are like firmware. They do not
follow the normal kernel code licenses. They are not 
source code. They cannot be reviewed in the normal way.

 If any 'update' of event descriptions is needed it can 
 happen through the distro package mechanism, or via a 
 simple 'git pull' if it's compiled directly.
 
 Lets not overengineer this with any dependence on an 
 external site and with a separate update mechanism - lets 
 just get the tables into tools/ and see it from there...

That experiment has been already done for oprofile,
didn't work very well.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory

2014-01-06 Thread Andi Kleen
Anton Blanchard an...@samba.org writes:

 Thoughts? It seems like we could hit a similar situation if a machine
 is balanced but we run out of memory on a single node.

Yes I agree, but your patch doesn't seem to attempt to handle this?

-Andi

 Index: b/mm/slub.c
 ===
 --- a/mm/slub.c
 +++ b/mm/slub.c
 @@ -2278,10 +2278,17 @@ redo:
  
   if (unlikely(!node_match(page, node))) {
   stat(s, ALLOC_NODE_MISMATCH);
 - deactivate_slab(s, page, c-freelist);
 - c-page = NULL;
 - c-freelist = NULL;
 - goto new_slab;
 +
 + /*
 +  * If the node contains no memory there is no point in trying
 +  * to allocate a new node local slab
 +  */
 + if (node_spanned_pages(node)) {
 + deactivate_slab(s, page, c-freelist);
 + c-page = NULL;
 + c-freelist = NULL;
 + goto new_slab;
 + }
   }
  
   /*
-- 
a...@linux.intel.com -- Speaking for myself only
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH V4 04/10] x86, perf: Add conditional branch filtering support

2013-12-06 Thread Andi Kleen
On Wed, Dec 04, 2013 at 04:02:36PM +0530, Anshuman Khandual wrote:
 This patch adds conditional branch filtering support,
 enabling it for PERF_SAMPLE_BRANCH_COND in perf branch
 stack sampling framework by utilizing an available
 software filter X86_BR_JCC.

Newer Intel CPUs a hardware filter too for not a conditional
branch.  I can look at implementing that.

The software option seems fine for now.

-Andi


 
 Signed-off-by: Anshuman Khandual khand...@linux.vnet.ibm.com
 Reviewed-by: Stephane Eranian eran...@google.com
 ---
  arch/x86/kernel/cpu/perf_event_intel_lbr.c | 5 +
  1 file changed, 5 insertions(+)
 
 diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c 
 b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
 index d82d155..9dd2459 100644
 --- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
 +++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
 @@ -384,6 +384,9 @@ static void intel_pmu_setup_sw_lbr_filter(struct 
 perf_event *event)
   if (br_type  PERF_SAMPLE_BRANCH_NO_TX)
   mask |= X86_BR_NO_TX;
  
 + if (br_type  PERF_SAMPLE_BRANCH_COND)
 + mask |= X86_BR_JCC;
 +
   /*
* stash actual user request into reg, it may
* be used by fixup code for some CPU
 @@ -678,6 +681,7 @@ static const int nhm_lbr_sel_map[PERF_SAMPLE_BRANCH_MAX] 
 = {
* NHM/WSM erratum: must include IND_JMP to capture IND_CALL
*/
   [PERF_SAMPLE_BRANCH_IND_CALL] = LBR_IND_CALL | LBR_IND_JMP,
 + [PERF_SAMPLE_BRANCH_COND] = LBR_JCC,
  };
  
  static const int snb_lbr_sel_map[PERF_SAMPLE_BRANCH_MAX] = {
 @@ -689,6 +693,7 @@ static const int snb_lbr_sel_map[PERF_SAMPLE_BRANCH_MAX] 
 = {
   [PERF_SAMPLE_BRANCH_ANY_CALL]   = LBR_REL_CALL | LBR_IND_CALL
   | LBR_FAR,
   [PERF_SAMPLE_BRANCH_IND_CALL]   = LBR_IND_CALL,
 + [PERF_SAMPLE_BRANCH_COND]   = LBR_JCC,
  };
  
  /* core */
 -- 
 1.7.11.7
 

-- 
a...@linux.intel.com -- Speaking for myself only
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: Avoiding the dentry d_lock on final dput(), part deux: transactional memory

2013-10-02 Thread Andi Kleen
Linus Torvalds torva...@linux-foundation.org writes:

 On Mon, Sep 30, 2013 at 1:01 PM, Waiman Long waiman.l...@hp.com wrote:

 I think this patch is worth a trial if relevant hardware is more widely
 available. The TSX code certainly need to be moved to an architecture
 specific area and should be runtime enabled using a static key. We also need
 more TSX support infrastructure in place first.

 I think we can pick that up from Andi's patches, he should have that.
 Although that did have very x86-specific naming (ie xbegin). And I
 don't think he used asm goto to quite the same advantage as this -
 and I think we do want to make sure that the overhead is minimal.

FWIW my version #0 used asm goto directly, but I later switched to not
using it to support more compilers and higher level abstractions (locks
etc.) and use the same intrinsics as the user level guys are using.

The two extra instructions from not using asm goto for xbegin
don't matter all that much in the end.

That's the old asm goto stuff I wrote originally
(user level version):

https://github.com/andikleen/tsx-tools/blob/master/include/rtm-goto.h

There was also a kernel version of it that patched, right 
now this is done in the main TSX patchkit like this:

https://git.kernel.org/cgit/linux/kernel/git/ak/linux-misc.git/commit/?h=hle312/rtm-baseid=9190346d57a9bc89e746aee774d07e54cd1e6e75

Essentially without RTM it just becomes and unconditional jump
to the abort handler, xabort is a nop, and xtest always returns 0.

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 8/10] memory-hotplug : remove page table of x86_64 architecture

2012-10-07 Thread Andi Kleen
Yasuaki Ishimatsu isimatu.yasu...@jp.fujitsu.com writes:
 + }
 +
 + /*
 +  * We use 2M page, but we need to remove part of them,
 +  * so split 2M page to 4K page.
 +  */
 + pte = alloc_low_page(pte_phys);

What happens when the allocation fails?

alloc_low_page seems to be buggy there too, it would __pa a NULL 
pointer.

 + if (pud_large(*pud)) {
 + if ((addr  ~PUD_MASK) == 0  next = end) {
 + set_pud(pud, __pud(0));
 + pages++;
 + continue;
 + }
 +
 + /*
 +  * We use 1G page, but we need to remove part of them,
 +  * so split 1G page to 2M page.
 +  */
 + pmd = alloc_low_page(pmd_phys);

Same here

 + __split_large_page((pte_t *)pud, addr, (pte_t *)pmd);
 +
 + spin_lock(init_mm.page_table_lock);
 + pud_populate(init_mm, pud, __va(pmd_phys));
 + spin_unlock(init_mm.page_table_lock);
 + }
 +
 + pmd = map_low_page(pmd_offset(pud, 0));
 + phys_pmd_remove(pmd, addr, end);
 + unmap_low_page(pmd);
 + __flush_tlb_all();
 + }
 + __flush_tlb_all();

This doesn't flush the other CPUs doesn't it?

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v2 4/6] x86: Add clear_page_nocache

2012-08-13 Thread Andi Kleen
 Moving 64 bytes per cycle is faster on Sandy Bridge, but slower on
 Westmere. Any preference? ;)

You have to be careful with these benchmarks.

- You need to make sure the data is cache cold, cache hot is misleading.
- The numbers can change if you have multiple CPUs doing this in parallel.

-Andi
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH] [6/99] seqlock: Don't smp_rmb in seqlock reader spin loop

2011-07-27 Thread Andi Kleen
2.6.35-longterm review patch.  If anyone has any objections, please let me know.

--
From: Milton Miller milt...@bga.com

commit 5db1256a5131d3b133946fa02ac9770a784e6eb2 upstream.

Move the smp_rmb after cpu_relax loop in read_seqlock and add
ACCESS_ONCE to make sure the test and return are consistent.

A multi-threaded core in the lab didn't like the update
from 2.6.35 to 2.6.36, to the point it would hang during
boot when multiple threads were active.  Bisection showed
af5ab277ded04bd9bc6b048c5a2f0e7d70ef0867 (clockevents:
Remove the per cpu tick skew) as the culprit and it is
supported with stack traces showing xtime_lock waits including
tick_do_update_jiffies64 and/or update_vsyscall.

Experimentation showed the combination of cpu_relax and smp_rmb
was significantly slowing the progress of other threads sharing
the core, and this patch is effective in avoiding the hang.

A theory is the rmb is affecting the whole core while the
cpu_relax is causing a resource rebalance flush, together they
cause an interfernce cadance that is unbroken when the seqlock
reader has interrupts disabled.

At first I was confused why the refactor in
3c22cd5709e8143444a6d08682a87f4c57902df3 (kernel: optimise
seqlock) didn't affect this patch application, but after some
study that affected seqcount not seqlock. The new seqcount was
not factored back into the seqlock.  I defer that the future.

While the removal of the timer interrupt offset created
contention for the xtime lock while a cpu does the
additonal work to update the system clock, the seqlock
implementation with the tight rmb spin loop goes back much
further, and is just waiting for the right trigger.

Signed-off-by: Milton Miller milt...@bga.com
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Linus Torvalds torva...@linux-foundation.org
Cc: Andi Kleen a...@firstfloor.org
Cc: Nick Piggin npig...@kernel.dk
Cc: Benjamin Herrenschmidt b...@kernel.crashing.org
Cc: Anton Blanchard an...@samba.org
Cc: Paul McKenney paul...@linux.vnet.ibm.com
Acked-by: Eric Dumazet eric.duma...@gmail.com
Signed-off-by: Andi Kleen a...@linux.intel.com
Link: http://lkml.kernel.org/r/%3Cseqlock-rmb%40mdm.bga.com%3E
Signed-off-by: Thomas Gleixner t...@linutronix.de
Signed-off-by: Greg Kroah-Hartman gre...@suse.de

---
 include/linux/seqlock.h |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Index: linux-2.6.35.y/include/linux/seqlock.h
===
--- linux-2.6.35.y.orig/include/linux/seqlock.h
+++ linux-2.6.35.y/include/linux/seqlock.h
@@ -88,12 +88,12 @@ static __always_inline unsigned read_seq
unsigned ret;
 
 repeat:
-   ret = sl-sequence;
-   smp_rmb();
+   ret = ACCESS_ONCE(sl-sequence);
if (unlikely(ret  1)) {
cpu_relax();
goto repeat;
}
+   smp_rmb();
 
return ret;
 }
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] seqlock: don't smp_rmb in seqlock reader spin loop

2011-05-12 Thread Andi Kleen
On Thu, May 12, 2011 at 04:13:54AM -0500, Milton Miller wrote:
 
 Move the smp_rmb after cpu_relax loop in read_seqlock and add
 ACCESS_ONCE to make sure the test and return are consistent.
 
 A multi-threaded core in the lab didn't like the update

Which core was that?

-Andi
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 0/5] make *_gate_vma accept mm_struct instead of task_struct

2011-03-10 Thread Andi Kleen
On Tue, Mar 08, 2011 at 07:31:56PM -0500, Stephen Wilson wrote:
 
 Morally, the question of whether an address lies in a gate vma should be asked
 with respect to an mm, not a particular task.
 
 Practically, dropping the dependency on task_struct will help make current and
 future operations on mm's more flexible and convenient.  In particular, it
 allows some code paths to avoid the need to hold task_lock.
 
 The only architecture this change impacts in any significant way is x86_64.
 The principle change on that architecture is to mirror TIF_IA32 via
 a new flag in mm_context_t. 

The problem is -- you're adding a likely cache miss on mm_struct for
every 32bit compat syscall now, even if they don't need mm_struct
currently (and a lot of them do not) Unless there's a very good
justification to make up for this performance issue elsewhere
(including numbers) this seems like a bad idea.

 This is the first of a two part series that implements safe writes to
 /proc/pid/mem.  I will be posting the second series to lkml shortly.  These

Making every syscall slower for /proc/pid/mem doesn't seem like a good
tradeoff to me. Please solve this in some other way.

-Andi
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 0/5] make *_gate_vma accept mm_struct instead of task_struct II

2011-03-10 Thread Andi Kleen
On Thu, Mar 10, 2011 at 08:00:32AM -0800, Andi Kleen wrote:
 On Tue, Mar 08, 2011 at 07:31:56PM -0500, Stephen Wilson wrote:
  
  Morally, the question of whether an address lies in a gate vma should be 
  asked
  with respect to an mm, not a particular task.
  
  Practically, dropping the dependency on task_struct will help make current 
  and
  future operations on mm's more flexible and convenient.  In particular, it
  allows some code paths to avoid the need to hold task_lock.
  
  The only architecture this change impacts in any significant way is x86_64.
  The principle change on that architecture is to mirror TIF_IA32 via
  a new flag in mm_context_t. 
 
 The problem is -- you're adding a likely cache miss on mm_struct for
 every 32bit compat syscall now, even if they don't need mm_struct
 currently (and a lot of them do not) Unless there's a very good
 justification to make up for this performance issue elsewhere
 (including numbers) this seems like a bad idea.

Hmm I see you're only setting it on exec time actually on rereading
the patches. I thought you were changing TS_COMPAT which is in
the syscall path.

Never mind.  I have no problems with doing such a change on exec
time.

-Andi
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 0/5] make *_gate_vma accept mm_struct instead of task_struct II

2011-03-10 Thread Andi Kleen
On Thu, Mar 10, 2011 at 11:54:14AM -0500, Stephen Wilson wrote:
 
 On Thu, Mar 10, 2011 at 08:38:09AM -0800, Andi Kleen wrote:
  On Thu, Mar 10, 2011 at 08:00:32AM -0800, Andi Kleen wrote:
   On Tue, Mar 08, 2011 at 07:31:56PM -0500, Stephen Wilson wrote:
The only architecture this change impacts in any significant way is 
x86_64.
The principle change on that architecture is to mirror TIF_IA32 via
a new flag in mm_context_t. 
   
   The problem is -- you're adding a likely cache miss on mm_struct for
   every 32bit compat syscall now, even if they don't need mm_struct
   currently (and a lot of them do not) Unless there's a very good
   justification to make up for this performance issue elsewhere
   (including numbers) this seems like a bad idea.
  
  Hmm I see you're only setting it on exec time actually on rereading
  the patches. I thought you were changing TS_COMPAT which is in
  the syscall path.
  
  Never mind.  I have no problems with doing such a change on exec
  time.
 
 OK.  Great!  Does this mean I have your ACK'ed by or reviewed by?

I didn't read it all, but the first two patches looked ok.

-Andi
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [RFC] [PATCH] allow low HZ values?

2010-10-12 Thread Andi Kleen
Thomas Gleixner t...@linutronix.de writes:

 On Mon, 11 Oct 2010, Tim Pepper wrote:

 I'm not necessarily wanting to open up the age old question of what is
 a good HZ, but we were doing some testing on timer tick overheads for
 HPC applications and this came up...

 Yeah. This comes always up when the timer tick overhead on HPC is
 tested. And this patch is again the fundamentally wrong answer.

That's a unfair description of the proposal.

 We have told HPC folks for years that we need a kind of NOHZ mode
 for HPC where we can transparently switch off the tick when only one
 user space bound thread is active and switch back to normal once this
 thing terminates or goes into the kernel via a syscall. Sigh, nothing
 happened ever except for repeating the same crap patches over and
 over.

Jan Blunck posted a patch for this exactly few months ago.
Unfortunately it didn't get the accounting right, but other than
that it seemed like a reasonable starting point.

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 31/40] trace syscalls: Convert various generic compat syscalls

2010-06-23 Thread Andi Kleen

, Ian Munsie wrote:

From: Ian Munsieimun...@au1.ibm.com

This patch converts numerous trivial compat syscalls through the generic
kernel code to use the COMPAT_SYSCALL_DEFINE family of macros.


Why? This just makes the code look uglier and the functions harder
to grep for.

-Andi
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 31/40] trace syscalls: Convert various generic compat syscalls

2010-06-23 Thread Andi Kleen

, Frederic Weisbecker wrote:

On Wed, Jun 23, 2010 at 12:19:38PM +0200, Andi Kleen wrote:

, Ian Munsie wrote:

From: Ian Munsieimun...@au1.ibm.com

This patch converts numerous trivial compat syscalls through the generic
kernel code to use the COMPAT_SYSCALL_DEFINE family of macros.


Why? This just makes the code look uglier and the functions harder
to grep for.



Because it makes them usable with syscall tracing.


Ok that information is missing in the changelog then.

Also I hope the uglification-usefullness factor is really worth it.
The patch is certainly no slouch on the uglification side.

It also has maintenance costs, e.g. I doubt ctags and cscope
will be able to deal with these kinds of macros, so it has a
high cost for everyone using these tools. For those
it would be actually better if you used separate annotation
that does not confuse standard C parsers.

-Andi
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 31/40] trace syscalls: Convert various generic compat syscalls

2010-06-23 Thread Andi Kleen




I haven't heard any complains about existing syscalls wrappers.


At least for me they always interrupt my grepping.



What kind of annotations could solve that?


If you put the annotation in a separate macro and leave the original
prototype alone. Then C parsers could still parse it.

-Andi
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 2/4] panic: Allow taint flag for warnings to be changed from TAINT_WARN

2010-03-21 Thread Andi Kleen
Ben Hutchings b...@decadent.org.uk writes:

 WARN() is used in some places to report firmware or hardware bugs that
 are then worked-around.  These bugs do not affect the stability of the
 kernel and should not set the usual TAINT_WARN flag.  To allow for
 this, add WARN_TAINT() and WARN_TAINT_ONCE() macros that take a taint
 flag as argument.

 Architectures that implement warnings using trap instructions instead
 of calls to warn_slowpath_*() must now implement __WARN_TAINT(taint)
 instead of __WARN().

I guess this should enforce that at least some taint flag is set?
(e.g. with a BUILD_BUG_ON)

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [v8 PATCH 2/8]: cpuidle: implement a list based approach to register a set of idle routines.

2009-10-14 Thread Andi Kleen
 How about something like this..
 If the arch does not enable CONFIG_CPU_IDLE, the cpuidle_idle_call
 which is called from cpu_idle() should call default_idle without
 involving the registering cpuidle steps. This should prevent bloating
 up of the kernel for archs which dont want to use cpuidle.

On x86 some people want small kernel too, so selecting it on a architecture
granuality is not good. Also you can make it default, you just need
to slim it down first.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [v8 PATCH 2/8]: cpuidle: implement a list based approach to register a set of idle routines.

2009-10-12 Thread Andi Kleen
Peter Zijlstra a.p.zijls...@chello.nl writes:

 So does it make sense to have a set of sets?

 Why not integrate them all into one set to be ruled by this governor
 thing?

cpuidle is currently optional, that is why the two level hierarchy
is there so that you can still have simple idle selection without it.

% size drivers/cpuidle/*.o
   textdata bss dec hex filename
   55141416  4469741b3e drivers/cpuidle/built-in.o

Adding it unconditionally would add ~7k to everyone who wants idle functions.

I think making it unconditional would require putting it on a serious
diet first.

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 1/3] Support for PCI Express reset type

2009-07-31 Thread Andi Kleen
Mike Mason mm...@us.ibm.com writes:

 These patches supersede the previously submitted patch that
 implemented a fundamental reset bit field. 

 Please review and let me know of any concerns.

Any plans to implement that for x86 too? Right now it seems to be a PPC
specific hack. And where is the driver that is using it?

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: question about softirqs

2009-05-13 Thread Andi Kleen
Chris Friesen cfrie...@nortel.com writes:

 One of the reasons I brought up this issue is that there is a lot of
 documentation out there that says softirqs will be processed on return
 from a syscall.  The fact that it actually depends on the scheduler
 parameters of the task issuing the syscall isn't ever mentioned.

It's not mentioned because it is not currently.

However some network TCP RX processing can happen in process context,
which gives you most of the benefit anyways.

 In fact, Documentation/DocBook/kernel-hacking.tmpl in the kernel
 source still has the following:

 Whenever a system call is about to return to userspace, or a
 hardware interrupt handler exits, any 'software interrupts'
 which are marked pending (usually by hardware interrupts) are
 run (filenamekernel/softirq.c/filename).

 If anyone is looking at changing this code, it might be good to ensure
 that at least the kernel docs are updated.

So far the code is not changed in mainline. There have been some
proposals only.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: question about softirqs

2009-05-13 Thread Andi Kleen
 If a soft irq is raised in process context, raise_softirq() in
 kernel/softirq.c calls wakeup_softirqd() to make sure that ksoftirqd

softirqd is only used when the softirq runs for too long or when
there are no suitable irq exits for a long time.

In normal situations (not excessive time in softirq) they don't
do anything. 

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: question about softirqs

2009-05-13 Thread Andi Kleen
Thomas Gleixner t...@linutronix.de writes:


 Err, no. Chris is completely correct:

 if (!in_interrupt())
   wakeup_softirqd();

Yes you have to wake it up just in case, but it doesn't normally
process the data because a normal softirq comes in faster. It's
just a safety policy. 

You can check this by checking the accumulated CPU time on your
ksoftirqs.  Mine are all 0 even on long running systems.

The reason Andrea originally added the softirqds was just that
if you have very softirq intensive workloads they would tie
up too much CPU time or not make enough process with the default
don't loop too often heuristics. 

 We can not rely on irqs coming in when the softirq is raised from

You can't rely on it, but it happens in near all cases.

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only.
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: question about softirqs

2009-05-13 Thread Andi Kleen
 I have one machine SMP flooded by network frames, CPU0 handling all

Yes that's the case softirqd is supposed to handle. When you 
spend a significant part of your CPU time in softirq context it kicks
in to provide somewhat fair additional CPU time.

But most systems (like mine) don't do that.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: question about softirqs

2009-05-13 Thread Andi Kleen
On Wed, May 13, 2009 at 09:05:01AM -0600, Chris Friesen wrote:
 Andi Kleen wrote:
  Thomas Gleixner t...@linutronix.de writes:
 
 Err, no. Chris is completely correct:
 
 if (!in_interrupt())
 wakeup_softirqd();
  
  Yes you have to wake it up just in case, but it doesn't normally
  process the data because a normal softirq comes in faster. It's
  just a safety policy. 
 
 What about the scenario I raised earlier, where we have incoming network
 packets,

network packets are normally processed by the network packet interrupt's
softirq or alternatively in the NAPI poll loop.

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only.
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: question about softirqs

2009-05-13 Thread Andi Kleen
On Wed, May 13, 2009 at 01:04:09PM -0600, Chris Friesen wrote:
 Andi Kleen wrote:
 
  network packets are normally processed by the network packet interrupt's
  softirq or alternatively in the NAPI poll loop.
 
 If we have a high priority task, ksoftirqd may not get a chance to run.

In this case the next interrupt will also process them. It will just
go more slowly because interrupts limit the work compared to ksoftirqd.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: question about softirqs

2009-05-13 Thread Andi Kleen
On Wed, May 13, 2009 at 01:44:59PM -0600, Chris Friesen wrote:
 Andi Kleen wrote:
  On Wed, May 13, 2009 at 01:04:09PM -0600, Chris Friesen wrote:
  Andi Kleen wrote:
 
  network packets are normally processed by the network packet interrupt's
  softirq or alternatively in the NAPI poll loop.
  If we have a high priority task, ksoftirqd may not get a chance to run.
  
  In this case the next interrupt will also process them. It will just
  go more slowly because interrupts limit the work compared to ksoftirqd.
 
 I realize that they will eventually get processed.  My point is that the
 documentation (in-kernel, online, and in various books) says that
 softirqs will be processed _on the return from a syscall_. 

They are. The documentation is correct. 

What might not be all processed is all packets that are in the per CPU
backlog queue when the network softirq runs (for non NAPI, for NAPI that's 
obsolete anyways). That's because there are limits.

Or when new work comes in in parallel it doesn't process it all.

But that's always the case -- no queue is infinite, so you have
always situations where it can drop or delay items.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [PATCH 2/2] x86-64: seccomp: fix 32/64 syscall hole

2009-05-08 Thread Andi Kleen
Markus Gutschke (ÜÒÐ) mar...@google.com writes:

 There are a large number of system calls that normal C/C++ code uses
 quite frequently, and that are not security sensitive. A typical
 example would be gettimeofday().

At least on x86-64 gettimeofday() (and time(2)) work inside seccomp because 
they're vsyscalls that run in ring 3 only.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [RFC/PATCH] Block device for the ISS simulator

2008-10-02 Thread Andi Kleen
Benjamin Herrenschmidt [EMAIL PROTECTED] writes:

 The ISS simulator is a simple powerpc simulator used among other things
 for hardware bringup. It implements a simple memory mapped block device
 interface.

...

 --- /dev/null 1970-01-01 00:00:00.0 +
 +++ linux-work/drivers/block/iss_blk.c2008-09-23 11:12:03.0 
 +1000
 @@ -0,0 +1,365 @@
 +/*
 + * Simple block device for the ISS simulator
 + */

The first paragraph in your description above should be in this comment.

-Andi
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [RFC] [PATCH 0/5 V2] Huge page backed user-space stacks

2008-08-06 Thread Andi Kleen
Andrew Morton [EMAIL PROTECTED] writes:

 Do we expect that this change will be replicated in other
 memory-intensive apps?  (I do).

The catch with 2MB pages on x86 is that x86 CPUs generally have
much less 2MB TLB entries than 4K entries. So if you're unlucky
and access a lot of mappings you might actually thrash more with
them. That is why they are not necessarily an universal win.

-Andi
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [PATCH 0/4] 16G huge page support for powerpc

2008-03-26 Thread Andi Kleen

FWIW i turned over the hugepages patchkit to Nick Piggin. So send
all future patches to him please.

-Andi
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [PATCH 1/4] allow arch specific function for allocating gigantic pages

2008-03-26 Thread Andi Kleen

Haven't reviewed it in detail, just noticed something.

 @@ -614,6 +610,7 @@ static int __init hugetlb_init(void)
  {
   if (HPAGE_SHIFT == 0)
   return 0;
 + INIT_LIST_HEAD(huge_boot_pages);
   return hugetlb_init_hstate(global_hstate);

I don't think adding the INIT_LIST_HEAD here is correct. There can
be huge pages added by the __setup handlers before hugetlb_init

-Andi
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [PATCH 1/3] Fix Unlikely(x) == y

2008-02-19 Thread Andi Kleen

On Tue, Feb 19, 2008 at 08:46:46PM +1100, Nick Piggin wrote:
 On Tuesday 19 February 2008 20:25, Andi Kleen wrote:
  On Tue, Feb 19, 2008 at 01:33:53PM +1100, Nick Piggin wrote:
 
   I actually once measured context switching performance in the scheduler,
   and removing the  unlikely hint for testing RT tasks IIRC gave about 5%
   performance drop.
 
  OT: what benchmarks did you use for that? I had a change some time
  ago to the CFS scheduler to avoid unpredicted indirect calls for
  the common case, but I wasn't able to benchmark a difference with the usual
  suspect benchmark (lmbench). Since it increased code size by
  a few bytes it was rejected then.
 
 I think it was just a simple context switch benchmark, but not lmbench
 (which I found to be a bit too variable). But it was a long time ago...

Do you still have it?

I thought about writing my own but ended up being too lazy for that @)

 
   However, the P4's branch predictor is pretty good, and it should easily
 
  I think it depends on the generation. Prescott class branch
  prediction should be much better than the earlier ones.
 
 I was using a Nocona Xeon, which I think is a Prescott class? 

Yes.

 And don't they have much higher mispredict penalty (than older P4s)?

They do have a longer pipeline, so yes more penalty (5 or 6 stages more iirc),
but also a lot better branch predictor which makes up for that.

 
 
   Actually one thing I don't like about gcc is that I think it still emits
   cmovs for likely/unlikely branches,
 
  That's -Os.
 
 And -O2 and -O3, on the gccs that I'm using, AFAIKS.

Well if it still happens on gcc 4.2 with P4 tuning you should
perhaps open a gcc PR. They tend to ignore these bugs mostly in
my experience, but sometimes they act on them. 

 
 
   which is silly (the gcc developers
 
  It depends on the CPU. e.g. on K8 and P6 using CMOV if possible
  makes sense. P4 doesn't like it though.
 
 If the branch is completely predictable (eg. annotated), then I
 think branches should be used anyway. Even on well predicted
 branches, cmov is similar speed on microbenchmarks, but it will
 increase data hazards I think, so it will probably be worse for
 some real world situations.

At least the respective optimization manuals say they should be used.
I presume they only made this recommendation after some extensive
benchmarking.

 
 
   the quite good numbers that cold CPU predictors can attain. However
   for really performance critical code (or really never executed
   code), then I think it is OK to have the hints and not have to rely
   on gcc heuristics.
 
  But only when the explicit hints are different from what the implicit
  branch predictors would predict anyways. And if you look at the
  heuristics that is not often the case...
 
 But a likely branch will be _strongly_ predicted to be taken,
 wheras a lot of the gcc heuristics simply have slightly more or
 slightly less probability. So it's not just a question of which
 way is more likely, but also _how_ likely it is to go that way.

Yes, but a lot of the heuristics are pretty strong (80%) and gcc will
act on them unless it has a very strong contra cue. And that should
normally not be the case.

-Andi
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [PATCH 1/3] Fix Unlikely(x) == y

2008-02-18 Thread Andi Kleen
Arjan van de Ven [EMAIL PROTECTED] writes:
 you have more faith in the authors knowledge of how his code actually behaves 
 than I think is warranted  :)

iirc there was a mm patch some time ago to keep track of the actual unlikely
values at runtime and it showed indeed some wrong ones. But the 
far majority of them are probably correct.

 Or faith in that he knows what unlikely means.
 I should write docs about this; but unlikely() means:
 1) It happens less than 0.01% of the cases.
 2) The compiler couldn't have figured this out by itself
(NULL pointer checks are compiler done already, same for some other 
 conditions)
 3) It's a hot codepath where shaving 0.5 cycles (less even on x86) matters
(and the author is ok with taking a 500 cycles hit if he's wrong)

One more thing unlikely() does is to move the unlikely code out of line.
So it should conserve some icache in critical functions, which might
well be worth some more cycles (don't have numbers though). 

But overall I agree with you that unlikely is in most cases a bad 
idea (and I submitted the original patch introducing it originally @). That is 
because
it is often used in situations where gcc's default branch prediction
heuristics do would make exactly the same decision

   if (unlikely(x == NULL)) 

is simply totally useless because gcc already assumes all x == NULL
tests are unlikely. I appended some of the builtin heuristics from
a recent gcc source so people can see them.

Note in particular the last predictors; assuming branch ending 
with goto, including call, causing early function return or 
returning negative constant are not taken. Just these alone
are likely 95+% of the unlikelies in the kernel.

-Andi

/* Use number of loop iterations determined by # of iterations
   analysis to set probability.  We don't want to use Dempster-Shaffer
   theory here, as the predictions is exact.  */
DEF_PREDICTOR (PRED_LOOP_ITERATIONS, loop iterations, PROB_ALWAYS,
   PRED_FLAG_FIRST_MATCH)

/* Hints dropped by user via __builtin_expect feature.  */
DEF_PREDICTOR (PRED_BUILTIN_EXPECT, __builtin_expect, PROB_VERY_LIKELY,
   PRED_FLAG_FIRST_MATCH)

/* Use number of loop iterations guessed by the contents of the loop.  */
DEF_PREDICTOR (PRED_LOOP_ITERATIONS_GUESSED, guessed loop iterations,
   PROB_ALWAYS, PRED_FLAG_FIRST_MATCH)

/* Branch containing goto is probably not taken.  */
DEF_PREDICTOR (PRED_CONTINUE, continue, HITRATE (56), 0)

/* Branch to basic block containing call marked by noreturn attribute.  */
DEF_PREDICTOR (PRED_NORETURN, noreturn call, HITRATE (99),
   PRED_FLAG_FIRST_MATCH)

/* Branch to basic block containing call marked by cold function attribute.  */
DEF_PREDICTOR (PRED_COLD_FUNCTION, cold function call, HITRATE (99),
   PRED_FLAG_FIRST_MATCH)

/* Loopback edge is taken.  */
DEF_PREDICTOR (PRED_LOOP_BRANCH, loop branch, HITRATE (86),
   PRED_FLAG_FIRST_MATCH)

/* Edge causing loop to terminate is probably not taken.  */
DEF_PREDICTOR (PRED_LOOP_EXIT, loop exit, HITRATE (91),
   PRED_FLAG_FIRST_MATCH)

/* Pointers are usually not NULL.  */
DEF_PREDICTOR (PRED_POINTER, pointer, HITRATE (85), 0)
DEF_PREDICTOR (PRED_TREE_POINTER, pointer (on trees), HITRATE (85), 0)

/* NE is probable, EQ not etc...  */
DEF_PREDICTOR (PRED_OPCODE_POSITIVE, opcode values positive, HITRATE (79), 0)
DEF_PREDICTOR (PRED_OPCODE_NONEQUAL, opcode values nonequal, HITRATE (71), 0)
DEF_PREDICTOR (PRED_FPOPCODE, fp_opcode, HITRATE (90), 0)
DEF_PREDICTOR (PRED_TREE_OPCODE_POSITIVE, opcode values positive (on trees), 
HITRATE (70), 0)
DEF_PREDICTOR (PRED_TREE_OPCODE_NONEQUAL, opcode values nonequal (on trees), 
HITRATE (69), 0)
DEF_PREDICTOR (PRED_TREE_FPOPCODE, fp_opcode (on trees), HITRATE (90), 0)

/* Branch guarding call is probably taken.  */
DEF_PREDICTOR (PRED_CALL, call, HITRATE (69), 0)

/* Branch causing function to terminate is probably not taken.  */
DEF_PREDICTOR (PRED_TREE_EARLY_RETURN, early return (on trees), HITRATE (54), 
0)

/* Branch containing goto is probably not taken.  */
DEF_PREDICTOR (PRED_GOTO, goto, HITRATE (70), 0)

/* Branch guarding call is probably taken.  */
DEF_PREDICTOR (PRED_CALL, call, HITRATE (69), 0)

/* Branch causing function to terminate is probably not taken.  */
DEF_PREDICTOR (PRED_TREE_EARLY_RETURN, early return (on trees), HITRATE (54), 
0)

/* Branch containing goto is probably not taken.  */
DEF_PREDICTOR (PRED_GOTO, goto, HITRATE (70), 0)

/* Branch ending with return constant is probably not taken.  */
DEF_PREDICTOR (PRED_CONST_RETURN, const return, HITRATE (67), 0)

/* Branch ending with return negative constant is probably not taken.  */
DEF_PREDICTOR (PRED_NEGATIVE_RETURN, negative return, HITRATE (96), 0)

/* Branch ending with return; is probably not taken */
DEF_PREDICTOR (PRED_NULL_RETURN, null return, HITRATE (96), 0)

/* Branches to a mudflap bounds check are extremely unlikely.  */
DEF_PREDICTOR 

Re: [PATCH 00/10] x86: Reduce Memory Usage and Inter-Node message traffic (v3)

2007-09-13 Thread Andi Kleen
On Wednesday 12 September 2007 03:56, [EMAIL PROTECTED] wrote:
 Note:

 This patch consolidates all the previous patches regarding
 the conversion of static arrays sized by NR_CPUS into per_cpu
 data arrays and is referenced against 2.6.23-rc6 .


Looks good to me from the x86 side. I'll leave it to Andrew to
handle for now though because it touches too many files
outside x86.

-Andi
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: Too verbose compat_ioctl messages?

2007-07-06 Thread Andi Kleen
H. Peter Anvin [EMAIL PROTECTED] writes:
 
 For one thing, it looks like we're returning the wrong thing (EINVAL
 rather than ENOTTY) across the board.  This was unfortunately a common
 misunderstanding with non-tty-related ioctls in the early days of Linux.

ENOTTY is so excessively misnamed that it is actually surprising
anybody ever got that right (for very small values of right i guess)

-Andi
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev