Re: [PATCHv4 00/57] perf c2c: Add new tool to analyze cacheline contention on NUMA systems
On 09/29/2016 05:19 AM, Peter Zijlstra wrote: What I want is a tool that maps memop events (any PEBS memops) back to a 'type::member' form and sorts on that. That doesn't rely on the PEBS 'Data Linear Address' field, as that is useless for dynamically allocated bits. Instead it would use the IP and Dwarf information to deduce the 'type::member' of the memop. I want pahole like output, showing me where the hits (green) and misses (red) are in a structure. I agree that would give valuable insight, but it needs to be in addition to what this c2c provides today, and not a replacement for. Ten years ago Robert Hundt created that pahole-style output as a developer option to the HP-UX compiler. It used compiler feedback to compute every struct accessed by the application, with exact counts for all reads and writes to every struct member. It even had affinity information to show how often field members were accessed together in time. He and I ran it on numerous large applications. It was awesome, but it did fall short in a few places that Jiri's c2c patches provide, such as being able to: - distinguish where the concurrent cacheline accesses came from (e.g, which cores, and which nodes). - see where the loads got resolved from, (local cache, local memory, remote cache, remote memory). - see if the hot structs were cacheline aligned or not. - see if more than one hot struct shares a cachline. - see how costly, via load latencies, the contention is. - see, among all the accesses to a cachline, which thread or process is causing the most harm. - insight into how many other threads/processes are contending for a cacheline (and who they are). The above info has been critical to understanding how best to tackle the contention uncovered for all those who have used the "perf c2c" prototype. So yes, the pahole-style addition would be a plus and it would make it easier to map it back to the struct, but make sure to preserve what the current "perfc2c" provides that the pahole-style output will not. Joe
Re: [PATCHv4 00/57] perf c2c: Add new tool to analyze cacheline contention on NUMA systems
Em Thu, Sep 29, 2016 at 11:19:12AM +0200, Peter Zijlstra escreveu: > On Thu, Sep 22, 2016 at 05:36:28PM +0200, Jiri Olsa wrote: > > sending new version of c2c patches (v3) originally posted in here: > > http://lwn.net/Articles/588866/ > I'll just keep repeating; this is not the tool I want :-( I'll not block > this tool, but I also think its far less usable than it should've been. Well, I think its an experimentation with using that info, one that people have been using and seemingly finding and fixing problems. Requires more work than the way you describe(d) various times, tho, indeed. :-\ > > https://lkml.kernel.org/r/20151209093402.gm6...@twins.programming.kicks-ass.net > What I want is a tool that maps memop events (any PEBS memops) back to a > 'type::member' form and sorts on that. That doesn't rely on the PEBS > 'Data Linear Address' field, as that is useless for dynamically > allocated bits. Instead it would use the IP and Dwarf information to > deduce the 'type::member' of the memop. > I want pahole like output, showing me where the hits (green) and misses > (red) are in a structure. > I want to be able to 'perf memops report -EC task_struct' and see the > expanded task_struct (as per 'pahole -EC task_struct') annotated, not a > data address for each task in my workload (which could be 100+ and > entirely useless). > Currently this is somewhat involved, since Dwarf doesn't include type > information for all memops, so we'd have to disassemble and interpret, > which while tedious is possible. > However, afaik, Stephane has been working with their tools team to get > additional DWARF info to make this easier. Stephane, any updates on > that? Yeah, that would be interesting to know, I for one, due to the c2c effort + this other work Stephane mentioned some time ago, moved working on such a pahole based tool to the backburner, lots of other patches to review, test, even proof read to then process all the time :-\ - Arnaldo
Re: [PATCHv4 00/57] perf c2c: Add new tool to analyze cacheline contention on NUMA systems
On Thu, Sep 22, 2016 at 05:36:28PM +0200, Jiri Olsa wrote: > hi, > sending new version of c2c patches (v3) originally posted in here: > http://lwn.net/Articles/588866/ > > I took the old set and reworked it to fit into current upstream code. > It follows the same logic as original patch and provides (almost) the > same stdio interface. In addition new TUI interface was added. > > The perf c2c tool provides means for Shared Data C2C/HITM analysis. > It allows you to track down the cacheline contentions. The tool is > based on x86's load latency and precise store facility events provided > by Intel CPUs. > > The tool was tested by Joe Mario and has proven to be useful and found > some cachelines contentions. Joe also wrote a blog about c2c tool with > examples located in here: > > https://joemario.github.io/blog/2016/09/01/c2c-blog/ > > v4 changes: > - 4 patches already queued > - used u32 for c2c_stats instead of int [Stanislav] > - fixed NO_SLANG=1 compilation [Kim] > - add __hist_entry__snprintf helper [Arnaldo] > > Code is also available in: > git://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git > perf/c2c_v4 > > Testing: > $ perf c2c record -a [workload] > $ perf c2c report [--stdio] > $ man perf-c2c > > It's most likely you won't generate any remote HITMs on common > laptops, so to get results for local HITMs please use: > > $ perf c2c report -d lcl [--stdio] I'll just keep repeating; this is not the tool I want :-( I'll not block this tool, but I also think its far less usable than it should've been. https://lkml.kernel.org/r/20151209093402.gm6...@twins.programming.kicks-ass.net What I want is a tool that maps memop events (any PEBS memops) back to a 'type::member' form and sorts on that. That doesn't rely on the PEBS 'Data Linear Address' field, as that is useless for dynamically allocated bits. Instead it would use the IP and Dwarf information to deduce the 'type::member' of the memop. I want pahole like output, showing me where the hits (green) and misses (red) are in a structure. I want to be able to 'perf memops report -EC task_struct' and see the expanded task_struct (as per 'pahole -EC task_struct') annotated, not a data address for each task in my workload (which could be 100+ and entirely useless). Currently this is somewhat involved, since Dwarf doesn't include type information for all memops, so we'd have to disassemble and interpret, which while tedious is possible. However, afaik, Stephane has been working with their tools team to get additional DWARF info to make this easier. Stephane, any updates on that?
[PATCHv4 00/57] perf c2c: Add new tool to analyze cacheline contention on NUMA systems
hi, sending new version of c2c patches (v3) originally posted in here: http://lwn.net/Articles/588866/ I took the old set and reworked it to fit into current upstream code. It follows the same logic as original patch and provides (almost) the same stdio interface. In addition new TUI interface was added. The perf c2c tool provides means for Shared Data C2C/HITM analysis. It allows you to track down the cacheline contentions. The tool is based on x86's load latency and precise store facility events provided by Intel CPUs. The tool was tested by Joe Mario and has proven to be useful and found some cachelines contentions. Joe also wrote a blog about c2c tool with examples located in here: https://joemario.github.io/blog/2016/09/01/c2c-blog/ v4 changes: - 4 patches already queued - used u32 for c2c_stats instead of int [Stanislav] - fixed NO_SLANG=1 compilation [Kim] - add __hist_entry__snprintf helper [Arnaldo] Code is also available in: git://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git perf/c2c_v4 Testing: $ perf c2c record -a [workload] $ perf c2c report [--stdio] $ man perf-c2c It's most likely you won't generate any remote HITMs on common laptops, so to get results for local HITMs please use: $ perf c2c report -d lcl [--stdio] thanks, jirka Cc: "Michael Trapp" Cc: "Long, Wai Man" Cc: Stanislav Ievlev Cc: Kim Phillips --- Jiri Olsa (57): perf tools: Add __hist_entry__snprintf function perf tools: Introduce c2c_decode_stats function perf tools: Introduce c2c_add_stats function perf tools: Make reset_dimensions global perf tools: Make output_field_add and sort_dimension__add global perf tools: Make several sorting functions global perf tools: Make several display functions global perf tools: Make __hist_entry__snprintf function global perf tools: Make hists__fprintf_headers function global perf c2c: Add c2c command perf c2c: Add record subcommand perf c2c: Add report subcommand perf c2c report: Add dimension support perf c2c report: Add sort_entry dimension support perf c2c report: Fallback to standard dimensions perf c2c report: Add sample processing perf c2c report: Add cacheline hists processing perf c2c report: Decode c2c_stats for hist entries perf c2c report: Add header macros perf c2c report: Add dcacheline dimension key perf c2c report: Add offset dimension key perf c2c report: Add iaddr dimension key perf c2c report: Add hitm related dimension keys perf c2c report: Add stores related dimension keys perf c2c report: Add loads related dimension keys perf c2c report: Add llc and remote loads related dimension keys perf c2c report: Add llc load miss dimension key perf c2c report: Add total record sort key perf c2c report: Add total loads sort key perf c2c report: Add hitm percent sort key perf c2c report: Add hitm/store percent related sort keys perf c2c report: Add dram related sort keys perf c2c report: Add pid sort key perf c2c report: Add tid sort key perf c2c report: Add symbol and dso sort keys perf c2c report: Add node sort key perf c2c report: Add stats related sort keys perf c2c report: Add cpu cnt sort key perf c2c report: Add src line sort key perf c2c report: Setup number of header lines for hists perf c2c report: Set final resort fields perf c2c report: Add stdio output support perf c2c report: Add main browser perf c2c report: Add cacheline browser perf c2c report: Add global stats stdio output perf c2c report: Add shared cachelines stats stdio output perf c2c report: Add c2c related stats stdio output perf c2c report: Allow to report callchains perf c2c report: Limit the cachelines table entries perf c2c report: Add support to choose local HITMs perf c2c report: Allow to set cacheline sort fields perf c2c report: Recalc width of global sort entries perf c2c report: Add cacheline index entry perf c2c report: Add support to manage symbol name length perf c2c report: Iterate node display in browser perf c2c report: Add help windows perf c2c: Add man page and credits tools/perf/Build |1 + tools/perf/Documentation/perf-c2c.txt | 276 tools/perf/builtin-c2c.c | 2742 + tools/perf/builtin.h |1 + tools/perf/perf.c |1 + tools/perf/ui/browsers/hists.c|4 +- tools/perf/ui/browsers/hists.h|1 + tools/perf/ui/hist.c |2 +- tools/perf/ui/stdio/hist.c| 12 +- tools/perf/util/hist.c|1 + tools/perf/util/hist.h|6 + tools/perf/util/mem-events.c | 128 ++ tools/perf/util/mem-events.h