Re: [PATCHv4 00/57] perf c2c: Add new tool to analyze cacheline contention on NUMA systems

2016-10-01 Thread Joe Mario

On 09/29/2016 05:19 AM, Peter Zijlstra wrote:
 


What I want is a tool that maps memop events (any PEBS memops) back to a
'type::member' form and sorts on that. That doesn't rely on the PEBS
'Data Linear Address' field, as that is useless for dynamically
allocated bits. Instead it would use the IP and Dwarf information to
deduce the 'type::member' of the memop.

I want pahole like output, showing me where the hits (green) and misses
(red) are in a structure.


I agree that would give valuable insight, but it needs to be
in addition to what this c2c provides today, and not a replacement for.

Ten years ago Robert Hundt created that pahole-style output as a developer 
option
to the HP-UX compiler.  It used compiler feedback to compute every struct
accessed by the application, with exact counts for all reads and writes to
every struct member.  It even had affinity information to show how often
field members were accessed together in time.

He and I ran it on numerous large applications.  It was awesome, but it
did fall short in a few places that Jiri's c2c patches provide, such as
being able to:

- distinguish where the concurrent cacheline accesses came from (e.g, which
  cores, and which nodes).

- see where the loads got resolved from, (local cache, local memory, remote
  cache, remote memory).

- see if the hot structs were cacheline aligned or not.

- see if more than one hot struct shares a cachline.

- see how costly, via load latencies, the contention is.

- see, among all the accesses to a cachline, which thread or process is
  causing the most harm.

- insight into how many other threads/processes are contending for a
  cacheline (and who they are).

The above info has been critical to understanding how best to tackle the
contention uncovered for all those who have used the "perf c2c" prototype.

So yes, the pahole-style addition would be a plus and it would make it easier
to map it back to the struct, but make sure to preserve what the current
"perfc2c" provides that the pahole-style output will not.

Joe






Re: [PATCHv4 00/57] perf c2c: Add new tool to analyze cacheline contention on NUMA systems

2016-09-29 Thread Arnaldo Carvalho de Melo
Em Thu, Sep 29, 2016 at 11:19:12AM +0200, Peter Zijlstra escreveu:
> On Thu, Sep 22, 2016 at 05:36:28PM +0200, Jiri Olsa wrote:
> > sending new version of c2c patches (v3) originally posted in here:
> >   http://lwn.net/Articles/588866/

> I'll just keep repeating; this is not the tool I want :-( I'll not block
> this tool, but I also think its far less usable than it should've been.

Well, I think its an experimentation with using that info, one that
people have been using and seemingly finding and fixing problems.

Requires more work than the way you describe(d) various times, tho,
indeed. :-\
 
>   
> https://lkml.kernel.org/r/20151209093402.gm6...@twins.programming.kicks-ass.net
 
> What I want is a tool that maps memop events (any PEBS memops) back to a
> 'type::member' form and sorts on that. That doesn't rely on the PEBS
> 'Data Linear Address' field, as that is useless for dynamically
> allocated bits. Instead it would use the IP and Dwarf information to
> deduce the 'type::member' of the memop.
 
> I want pahole like output, showing me where the hits (green) and misses
> (red) are in a structure.
 
> I want to be able to 'perf memops report -EC task_struct' and see the
> expanded task_struct (as per 'pahole -EC task_struct') annotated, not a
> data address for each task in my workload (which could be 100+ and
> entirely useless).
 
> Currently this is somewhat involved, since Dwarf doesn't include type
> information for all memops, so we'd have to disassemble and interpret,
> which while tedious is possible.
 
> However, afaik, Stephane has been working with their tools team to get
> additional DWARF info to make this easier. Stephane, any updates on
> that?

Yeah, that would be interesting to know, I for one, due to the c2c
effort + this other work Stephane mentioned some time ago, moved working
on such a pahole based tool to the backburner, lots of other patches to
review, test, even proof read to then process all the time :-\

- Arnaldo


Re: [PATCHv4 00/57] perf c2c: Add new tool to analyze cacheline contention on NUMA systems

2016-09-29 Thread Peter Zijlstra
On Thu, Sep 22, 2016 at 05:36:28PM +0200, Jiri Olsa wrote:
> hi,
> sending new version of c2c patches (v3) originally posted in here:
>   http://lwn.net/Articles/588866/
> 
> I took the old set and reworked it to fit into current upstream code.
> It follows the same logic as original patch and provides (almost) the
> same stdio interface. In addition new TUI interface was added.
> 
> The perf c2c tool provides means for Shared Data C2C/HITM analysis.
> It allows you to track down the cacheline contentions. The tool is
> based on x86's load latency and precise store facility events provided
> by Intel CPUs.
> 
> The tool was tested by Joe Mario and has proven to be useful and found
> some cachelines contentions. Joe also wrote a blog about c2c tool with
> examples located in here:
> 
>   https://joemario.github.io/blog/2016/09/01/c2c-blog/
> 
> v4 changes:
>   - 4 patches already queued
>   - used u32 for c2c_stats instead of int [Stanislav]
>   - fixed NO_SLANG=1 compilation [Kim]
>   - add __hist_entry__snprintf helper [Arnaldo]
> 
> Code is also available in:
>   git://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git
>   perf/c2c_v4
> 
> Testing:
>   $ perf c2c record -a [workload]
>   $ perf c2c report [--stdio]
>   $ man perf-c2c
> 
>   It's most likely you won't generate any remote HITMs on common
>   laptops, so to get results for local HITMs please use:
> 
>   $ perf c2c report -d lcl [--stdio]

I'll just keep repeating; this is not the tool I want :-( I'll not block
this tool, but I also think its far less usable than it should've been.

  
https://lkml.kernel.org/r/20151209093402.gm6...@twins.programming.kicks-ass.net

What I want is a tool that maps memop events (any PEBS memops) back to a
'type::member' form and sorts on that. That doesn't rely on the PEBS
'Data Linear Address' field, as that is useless for dynamically
allocated bits. Instead it would use the IP and Dwarf information to
deduce the 'type::member' of the memop.

I want pahole like output, showing me where the hits (green) and misses
(red) are in a structure.

I want to be able to 'perf memops report -EC task_struct' and see the
expanded task_struct (as per 'pahole -EC task_struct') annotated, not a
data address for each task in my workload (which could be 100+ and
entirely useless).

Currently this is somewhat involved, since Dwarf doesn't include type
information for all memops, so we'd have to disassemble and interpret,
which while tedious is possible.

However, afaik, Stephane has been working with their tools team to get
additional DWARF info to make this easier. Stephane, any updates on
that?




[PATCHv4 00/57] perf c2c: Add new tool to analyze cacheline contention on NUMA systems

2016-09-22 Thread Jiri Olsa
hi,
sending new version of c2c patches (v3) originally posted in here:
  http://lwn.net/Articles/588866/

I took the old set and reworked it to fit into current upstream code.
It follows the same logic as original patch and provides (almost) the
same stdio interface. In addition new TUI interface was added.

The perf c2c tool provides means for Shared Data C2C/HITM analysis.
It allows you to track down the cacheline contentions. The tool is
based on x86's load latency and precise store facility events provided
by Intel CPUs.

The tool was tested by Joe Mario and has proven to be useful and found
some cachelines contentions. Joe also wrote a blog about c2c tool with
examples located in here:

  https://joemario.github.io/blog/2016/09/01/c2c-blog/

v4 changes:
  - 4 patches already queued
  - used u32 for c2c_stats instead of int [Stanislav]
  - fixed NO_SLANG=1 compilation [Kim]
  - add __hist_entry__snprintf helper [Arnaldo]

Code is also available in:
  git://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git
  perf/c2c_v4

Testing:
  $ perf c2c record -a [workload]
  $ perf c2c report [--stdio]
  $ man perf-c2c

  It's most likely you won't generate any remote HITMs on common
  laptops, so to get results for local HITMs please use:

  $ perf c2c report -d lcl [--stdio]

thanks,
jirka


Cc: "Michael Trapp" 
Cc: "Long, Wai Man" 
Cc: Stanislav Ievlev 
Cc: Kim Phillips 
---
Jiri Olsa (57):
  perf tools: Add __hist_entry__snprintf function
  perf tools: Introduce c2c_decode_stats function
  perf tools: Introduce c2c_add_stats function
  perf tools: Make reset_dimensions global
  perf tools: Make output_field_add and sort_dimension__add global
  perf tools: Make several sorting functions global
  perf tools: Make several display functions global
  perf tools: Make __hist_entry__snprintf function global
  perf tools: Make hists__fprintf_headers function global
  perf c2c: Add c2c command
  perf c2c: Add record subcommand
  perf c2c: Add report subcommand
  perf c2c report: Add dimension support
  perf c2c report: Add sort_entry dimension support
  perf c2c report: Fallback to standard dimensions
  perf c2c report: Add sample processing
  perf c2c report: Add cacheline hists processing
  perf c2c report: Decode c2c_stats for hist entries
  perf c2c report: Add header macros
  perf c2c report: Add dcacheline dimension key
  perf c2c report: Add offset dimension key
  perf c2c report: Add iaddr dimension key
  perf c2c report: Add hitm related dimension keys
  perf c2c report: Add stores related dimension keys
  perf c2c report: Add loads related dimension keys
  perf c2c report: Add llc and remote loads related dimension keys
  perf c2c report: Add llc load miss dimension key
  perf c2c report: Add total record sort key
  perf c2c report: Add total loads sort key
  perf c2c report: Add hitm percent sort key
  perf c2c report: Add hitm/store percent related sort keys
  perf c2c report: Add dram related sort keys
  perf c2c report: Add pid sort key
  perf c2c report: Add tid sort key
  perf c2c report: Add symbol and dso sort keys
  perf c2c report: Add node sort key
  perf c2c report: Add stats related sort keys
  perf c2c report: Add cpu cnt sort key
  perf c2c report: Add src line sort key
  perf c2c report: Setup number of header lines for hists
  perf c2c report: Set final resort fields
  perf c2c report: Add stdio output support
  perf c2c report: Add main browser
  perf c2c report: Add cacheline browser
  perf c2c report: Add global stats stdio output
  perf c2c report: Add shared cachelines stats stdio output
  perf c2c report: Add c2c related stats stdio output
  perf c2c report: Allow to report callchains
  perf c2c report: Limit the cachelines table entries
  perf c2c report: Add support to choose local HITMs
  perf c2c report: Allow to set cacheline sort fields
  perf c2c report: Recalc width of global sort entries
  perf c2c report: Add cacheline index entry
  perf c2c report: Add support to manage symbol name length
  perf c2c report: Iterate node display in browser
  perf c2c report: Add help windows
  perf c2c: Add man page and credits

 tools/perf/Build  |1 +
 tools/perf/Documentation/perf-c2c.txt |  276 
 tools/perf/builtin-c2c.c  | 2742 +
 tools/perf/builtin.h  |1 +
 tools/perf/perf.c |1 +
 tools/perf/ui/browsers/hists.c|4 +-
 tools/perf/ui/browsers/hists.h|1 +
 tools/perf/ui/hist.c  |2 +-
 tools/perf/ui/stdio/hist.c|   12 +-
 tools/perf/util/hist.c|1 +
 tools/perf/util/hist.h|6 +
 tools/perf/util/mem-events.c  |  128 ++
 tools/perf/util/mem-events.h