Re: [PATCH 2/3] perf tools: Spare double comparison of callchain first entry

2014-01-17 Thread Frederic Weisbecker
On Fri, Jan 17, 2014 at 04:56:18PM +0900, Namhyung Kim wrote:
> Hi Arnaldo and Frederic,
> 
> On Thu, 16 Jan 2014 17:47:34 -0200, Arnaldo Carvalho de Melo wrote:
> > Em Thu, Jan 16, 2014 at 06:34:58PM +0100, Frederic Weisbecker escreveu:
> >> On Thu, Jan 16, 2014 at 10:17:53AM +0900, Namhyung Kim wrote: 
> >> > I think if the sort key doesn't contain "symbol", unmatch case would be
> >> > increased as more various callchains would go into a same entry.
> >> 
> >> You mean -g fractal,0.5,callee,address ?
> >> 
> >> Hmm, actually I haven't seen much difference there.
> >
> > I guess he will, but will wait for Namhyung's final ack here, ok?
> 
> I meant sorting of hist entry like "-s cpu".   So as your number said
> it's not an issue IMHO.  Arnaldo, I'm good with this change please merge
> it. :)

Thanks a lot guys! :)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] perf tools: Spare double comparison of callchain first entry

2014-01-17 Thread Frederic Weisbecker
On Fri, Jan 17, 2014 at 04:56:18PM +0900, Namhyung Kim wrote:
 Hi Arnaldo and Frederic,
 
 On Thu, 16 Jan 2014 17:47:34 -0200, Arnaldo Carvalho de Melo wrote:
  Em Thu, Jan 16, 2014 at 06:34:58PM +0100, Frederic Weisbecker escreveu:
  On Thu, Jan 16, 2014 at 10:17:53AM +0900, Namhyung Kim wrote: 
   I think if the sort key doesn't contain symbol, unmatch case would be
   increased as more various callchains would go into a same entry.
  
  You mean -g fractal,0.5,callee,address ?
  
  Hmm, actually I haven't seen much difference there.
 
  I guess he will, but will wait for Namhyung's final ack here, ok?
 
 I meant sorting of hist entry like -s cpu.   So as your number said
 it's not an issue IMHO.  Arnaldo, I'm good with this change please merge
 it. :)

Thanks a lot guys! :)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] perf tools: Spare double comparison of callchain first entry

2014-01-16 Thread Namhyung Kim
Hi Arnaldo and Frederic,

On Thu, 16 Jan 2014 17:47:34 -0200, Arnaldo Carvalho de Melo wrote:
> Em Thu, Jan 16, 2014 at 06:34:58PM +0100, Frederic Weisbecker escreveu:
>> On Thu, Jan 16, 2014 at 10:17:53AM +0900, Namhyung Kim wrote: 
>> > I think if the sort key doesn't contain "symbol", unmatch case would be
>> > increased as more various callchains would go into a same entry.
>> 
>> You mean -g fractal,0.5,callee,address ?
>> 
>> Hmm, actually I haven't seen much difference there.
>
> I guess he will, but will wait for Namhyung's final ack here, ok?

I meant sorting of hist entry like "-s cpu".   So as your number said
it's not an issue IMHO.  Arnaldo, I'm good with this change please merge
it. :)

Thanks,
Namhyung
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] perf tools: Spare double comparison of callchain first entry

2014-01-16 Thread Arnaldo Carvalho de Melo
Em Thu, Jan 16, 2014 at 06:34:58PM +0100, Frederic Weisbecker escreveu:
> On Thu, Jan 16, 2014 at 10:17:53AM +0900, Namhyung Kim wrote: 
> > I think if the sort key doesn't contain "symbol", unmatch case would be
> > increased as more various callchains would go into a same entry.
> 
> You mean -g fractal,0.5,callee,address ?
> 
> Hmm, actually I haven't seen much difference there.

I guess he will, but will wait for Namhyung's final ack here, ok?

- Arnaldo
 
> > >
> > >> 
> > >> >
> > >> > This results in less comparisons performed by the CPU.
> > >> 
> > >> Do you have any numbers?  I suspect it'd not be a big change, but just
> > >> curious.
> > >
> > > So I compared before/after the patchset (which include the cursor restore 
> > > removal)
> > > with:
> > >
> > >   1) Some big hackbench-like load that generates > 200 MB perf.data
> > >
> > >   perf record -g -- perf bench sched messaging -l $SOME_BIG_NUMBER
> > >
> > >   2) Compare before/after with the following reports:
> > >
> > >   perf stat perf report --stdio > /dev/null
> > >   perf stat perf report --stdio -s sym > /dev/null
> > >   perf stat perf report --stdio -G > /dev/null
> > >   perf stat perf report --stdio -g fractal,0.5,caller,address > /dev/null 
> > >
> > > And most of the time I had < 0.01% difference on time completion in 
> > > favour of the patchset
> > > (which may be due to the removed cursor restore patch eventually).
> > >
> > > So, all in one, there was no real interesting difference. If you want the 
> > > true results I can definetly relaunch the tests.
> > 
> > So as an extreme case, could you please also test "-s cpu" case and
> > share the numbers?
> 
> There is indeed a tiny difference here.
> 
> Before the patchset:
> 
> fweisbec@Aivars:~/linux-2.6-tip/tools/perf$ sudo ./perf stat -r 20 ./perf 
> report --stdio -s cpu > /dev/null
> 
>  Performance counter stats for './perf report --stdio -s cpu' (20 runs):
> 
>3343,047232  task-clock (msec) #0,999 CPUs utilized
> ( +-  0,12% )
>  6  context-switches  #0,002 K/sec
> ( +-  3,82% )
>  0  cpu-migrations#0,000 K/sec
>   
>128 076  page-faults   #0,038 M/sec
> ( +-  0,00% )
> 13 044 840 323  cycles#3,902 GHz  
> ( +-  0,12% )
>  stalled-cycles-frontend  
>  stalled-cycles-backend   
> 16 341 506 514  instructions  #1,25  insns per cycle  
> ( +-  0,00% )
>  4 042 448 707  branches  # 1209,211 M/sec
> ( +-  0,00% )
> 26 819 441  branch-misses #0,66% of all branches  
> ( +-  0,09% )
> 
>3,345286450 seconds time elapsed   
>( +-  0,12% )
> 
> After the patchset:
> 
> fweisbec@Aivars:~/linux-2.6-tip/tools/perf$ sudo ./perf stat -r 20 ./perf 
> report --stdio -s cpu > /dev/null
> 
>  Performance counter stats for './perf report --stdio -s cpu' (20 runs):
> 
>3365,739972  task-clock (msec) #0,999 CPUs utilized
> ( +-  0,12% )
>  6  context-switches  #0,002 K/sec
> ( +-  2,99% )
>  0  cpu-migrations#0,000 K/sec
>   
>128 076  page-faults   #0,038 M/sec
> ( +-  0,00% )
> 13 133 593 870  cycles#3,902 GHz  
> ( +-  0,12% )
>  stalled-cycles-frontend  
>  stalled-cycles-backend   
> 16 626 286 378  instructions  #1,27  insns per cycle  
> ( +-  0,00% )
>  4 119 555 502  branches  # 1223,967 M/sec
> ( +-  0,00% )
> 28 687 283  branch-misses #0,70% of all branches  
> ( +-  0,09% )
> 
>3,367984867 seconds time elapsed   
>( +-  0,12% )
> 
> 
> Which makes about 0.6% difference on the overhead.
> Now it had less overhead in common cases (default sorting, -s sym, -G, 
> etc...).
> I guess it's not really worrysome, it's mostly unvisible at this scale.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] perf tools: Spare double comparison of callchain first entry

2014-01-16 Thread Frederic Weisbecker
On Thu, Jan 16, 2014 at 10:17:53AM +0900, Namhyung Kim wrote: 
> I think if the sort key doesn't contain "symbol", unmatch case would be
> increased as more various callchains would go into a same entry.

You mean -g fractal,0.5,callee,address ?

Hmm, actually I haven't seen much difference there.

> >
> >> 
> >> >
> >> > This results in less comparisons performed by the CPU.
> >> 
> >> Do you have any numbers?  I suspect it'd not be a big change, but just
> >> curious.
> >
> > So I compared before/after the patchset (which include the cursor restore 
> > removal)
> > with:
> >
> > 1) Some big hackbench-like load that generates > 200 MB perf.data
> >
> > perf record -g -- perf bench sched messaging -l $SOME_BIG_NUMBER
> >
> > 2) Compare before/after with the following reports:
> >
> > perf stat perf report --stdio > /dev/null
> > perf stat perf report --stdio -s sym > /dev/null
> > perf stat perf report --stdio -G > /dev/null
> > perf stat perf report --stdio -g fractal,0.5,caller,address > /dev/null 
> >
> > And most of the time I had < 0.01% difference on time completion in favour 
> > of the patchset
> > (which may be due to the removed cursor restore patch eventually).
> >
> > So, all in one, there was no real interesting difference. If you want the 
> > true results I can definetly relaunch the tests.
> 
> So as an extreme case, could you please also test "-s cpu" case and
> share the numbers?

There is indeed a tiny difference here.

Before the patchset:

fweisbec@Aivars:~/linux-2.6-tip/tools/perf$ sudo ./perf stat -r 20 ./perf 
report --stdio -s cpu > /dev/null

 Performance counter stats for './perf report --stdio -s cpu' (20 runs):

   3343,047232  task-clock (msec) #0,999 CPUs utilized  
  ( +-  0,12% )
 6  context-switches  #0,002 K/sec  
  ( +-  3,82% )
 0  cpu-migrations#0,000 K/sec  

   128 076  page-faults   #0,038 M/sec  
  ( +-  0,00% )
13 044 840 323  cycles#3,902 GHz
  ( +-  0,12% )
 stalled-cycles-frontend  
 stalled-cycles-backend   
16 341 506 514  instructions  #1,25  insns per cycle
  ( +-  0,00% )
 4 042 448 707  branches  # 1209,211 M/sec  
  ( +-  0,00% )
26 819 441  branch-misses #0,66% of all branches
  ( +-  0,09% )

   3,345286450 seconds time elapsed 
 ( +-  0,12% )

After the patchset:

fweisbec@Aivars:~/linux-2.6-tip/tools/perf$ sudo ./perf stat -r 20 ./perf 
report --stdio -s cpu > /dev/null

 Performance counter stats for './perf report --stdio -s cpu' (20 runs):

   3365,739972  task-clock (msec) #0,999 CPUs utilized  
  ( +-  0,12% )
 6  context-switches  #0,002 K/sec  
  ( +-  2,99% )
 0  cpu-migrations#0,000 K/sec  

   128 076  page-faults   #0,038 M/sec  
  ( +-  0,00% )
13 133 593 870  cycles#3,902 GHz
  ( +-  0,12% )
 stalled-cycles-frontend  
 stalled-cycles-backend   
16 626 286 378  instructions  #1,27  insns per cycle
  ( +-  0,00% )
 4 119 555 502  branches  # 1223,967 M/sec  
  ( +-  0,00% )
28 687 283  branch-misses #0,70% of all branches
  ( +-  0,09% )

   3,367984867 seconds time elapsed 
 ( +-  0,12% )


Which makes about 0.6% difference on the overhead.
Now it had less overhead in common cases (default sorting, -s sym, -G, etc...).
I guess it's not really worrysome, it's mostly unvisible at this scale.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] perf tools: Spare double comparison of callchain first entry

2014-01-16 Thread Frederic Weisbecker
On Thu, Jan 16, 2014 at 10:17:53AM +0900, Namhyung Kim wrote: 
 I think if the sort key doesn't contain symbol, unmatch case would be
 increased as more various callchains would go into a same entry.

You mean -g fractal,0.5,callee,address ?

Hmm, actually I haven't seen much difference there.

 
  
  
   This results in less comparisons performed by the CPU.
  
  Do you have any numbers?  I suspect it'd not be a big change, but just
  curious.
 
  So I compared before/after the patchset (which include the cursor restore 
  removal)
  with:
 
  1) Some big hackbench-like load that generates  200 MB perf.data
 
  perf record -g -- perf bench sched messaging -l $SOME_BIG_NUMBER
 
  2) Compare before/after with the following reports:
 
  perf stat perf report --stdio  /dev/null
  perf stat perf report --stdio -s sym  /dev/null
  perf stat perf report --stdio -G  /dev/null
  perf stat perf report --stdio -g fractal,0.5,caller,address  /dev/null 
 
  And most of the time I had  0.01% difference on time completion in favour 
  of the patchset
  (which may be due to the removed cursor restore patch eventually).
 
  So, all in one, there was no real interesting difference. If you want the 
  true results I can definetly relaunch the tests.
 
 So as an extreme case, could you please also test -s cpu case and
 share the numbers?

There is indeed a tiny difference here.

Before the patchset:

fweisbec@Aivars:~/linux-2.6-tip/tools/perf$ sudo ./perf stat -r 20 ./perf 
report --stdio -s cpu  /dev/null

 Performance counter stats for './perf report --stdio -s cpu' (20 runs):

   3343,047232  task-clock (msec) #0,999 CPUs utilized  
  ( +-  0,12% )
 6  context-switches  #0,002 K/sec  
  ( +-  3,82% )
 0  cpu-migrations#0,000 K/sec  

   128 076  page-faults   #0,038 M/sec  
  ( +-  0,00% )
13 044 840 323  cycles#3,902 GHz
  ( +-  0,12% )
   not supported  stalled-cycles-frontend  
   not supported  stalled-cycles-backend   
16 341 506 514  instructions  #1,25  insns per cycle
  ( +-  0,00% )
 4 042 448 707  branches  # 1209,211 M/sec  
  ( +-  0,00% )
26 819 441  branch-misses #0,66% of all branches
  ( +-  0,09% )

   3,345286450 seconds time elapsed 
 ( +-  0,12% )

After the patchset:

fweisbec@Aivars:~/linux-2.6-tip/tools/perf$ sudo ./perf stat -r 20 ./perf 
report --stdio -s cpu  /dev/null

 Performance counter stats for './perf report --stdio -s cpu' (20 runs):

   3365,739972  task-clock (msec) #0,999 CPUs utilized  
  ( +-  0,12% )
 6  context-switches  #0,002 K/sec  
  ( +-  2,99% )
 0  cpu-migrations#0,000 K/sec  

   128 076  page-faults   #0,038 M/sec  
  ( +-  0,00% )
13 133 593 870  cycles#3,902 GHz
  ( +-  0,12% )
   not supported  stalled-cycles-frontend  
   not supported  stalled-cycles-backend   
16 626 286 378  instructions  #1,27  insns per cycle
  ( +-  0,00% )
 4 119 555 502  branches  # 1223,967 M/sec  
  ( +-  0,00% )
28 687 283  branch-misses #0,70% of all branches
  ( +-  0,09% )

   3,367984867 seconds time elapsed 
 ( +-  0,12% )


Which makes about 0.6% difference on the overhead.
Now it had less overhead in common cases (default sorting, -s sym, -G, etc...).
I guess it's not really worrysome, it's mostly unvisible at this scale.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] perf tools: Spare double comparison of callchain first entry

2014-01-16 Thread Arnaldo Carvalho de Melo
Em Thu, Jan 16, 2014 at 06:34:58PM +0100, Frederic Weisbecker escreveu:
 On Thu, Jan 16, 2014 at 10:17:53AM +0900, Namhyung Kim wrote: 
  I think if the sort key doesn't contain symbol, unmatch case would be
  increased as more various callchains would go into a same entry.
 
 You mean -g fractal,0.5,callee,address ?
 
 Hmm, actually I haven't seen much difference there.

I guess he will, but will wait for Namhyung's final ack here, ok?

- Arnaldo
 
  
   
   
This results in less comparisons performed by the CPU.
   
   Do you have any numbers?  I suspect it'd not be a big change, but just
   curious.
  
   So I compared before/after the patchset (which include the cursor restore 
   removal)
   with:
  
 1) Some big hackbench-like load that generates  200 MB perf.data
  
 perf record -g -- perf bench sched messaging -l $SOME_BIG_NUMBER
  
 2) Compare before/after with the following reports:
  
 perf stat perf report --stdio  /dev/null
 perf stat perf report --stdio -s sym  /dev/null
 perf stat perf report --stdio -G  /dev/null
 perf stat perf report --stdio -g fractal,0.5,caller,address  /dev/null 
  
   And most of the time I had  0.01% difference on time completion in 
   favour of the patchset
   (which may be due to the removed cursor restore patch eventually).
  
   So, all in one, there was no real interesting difference. If you want the 
   true results I can definetly relaunch the tests.
  
  So as an extreme case, could you please also test -s cpu case and
  share the numbers?
 
 There is indeed a tiny difference here.
 
 Before the patchset:
 
 fweisbec@Aivars:~/linux-2.6-tip/tools/perf$ sudo ./perf stat -r 20 ./perf 
 report --stdio -s cpu  /dev/null
 
  Performance counter stats for './perf report --stdio -s cpu' (20 runs):
 
3343,047232  task-clock (msec) #0,999 CPUs utilized
 ( +-  0,12% )
  6  context-switches  #0,002 K/sec
 ( +-  3,82% )
  0  cpu-migrations#0,000 K/sec
   
128 076  page-faults   #0,038 M/sec
 ( +-  0,00% )
 13 044 840 323  cycles#3,902 GHz  
 ( +-  0,12% )
not supported  stalled-cycles-frontend  
not supported  stalled-cycles-backend   
 16 341 506 514  instructions  #1,25  insns per cycle  
 ( +-  0,00% )
  4 042 448 707  branches  # 1209,211 M/sec
 ( +-  0,00% )
 26 819 441  branch-misses #0,66% of all branches  
 ( +-  0,09% )
 
3,345286450 seconds time elapsed   
( +-  0,12% )
 
 After the patchset:
 
 fweisbec@Aivars:~/linux-2.6-tip/tools/perf$ sudo ./perf stat -r 20 ./perf 
 report --stdio -s cpu  /dev/null
 
  Performance counter stats for './perf report --stdio -s cpu' (20 runs):
 
3365,739972  task-clock (msec) #0,999 CPUs utilized
 ( +-  0,12% )
  6  context-switches  #0,002 K/sec
 ( +-  2,99% )
  0  cpu-migrations#0,000 K/sec
   
128 076  page-faults   #0,038 M/sec
 ( +-  0,00% )
 13 133 593 870  cycles#3,902 GHz  
 ( +-  0,12% )
not supported  stalled-cycles-frontend  
not supported  stalled-cycles-backend   
 16 626 286 378  instructions  #1,27  insns per cycle  
 ( +-  0,00% )
  4 119 555 502  branches  # 1223,967 M/sec
 ( +-  0,00% )
 28 687 283  branch-misses #0,70% of all branches  
 ( +-  0,09% )
 
3,367984867 seconds time elapsed   
( +-  0,12% )
 
 
 Which makes about 0.6% difference on the overhead.
 Now it had less overhead in common cases (default sorting, -s sym, -G, 
 etc...).
 I guess it's not really worrysome, it's mostly unvisible at this scale.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] perf tools: Spare double comparison of callchain first entry

2014-01-16 Thread Namhyung Kim
Hi Arnaldo and Frederic,

On Thu, 16 Jan 2014 17:47:34 -0200, Arnaldo Carvalho de Melo wrote:
 Em Thu, Jan 16, 2014 at 06:34:58PM +0100, Frederic Weisbecker escreveu:
 On Thu, Jan 16, 2014 at 10:17:53AM +0900, Namhyung Kim wrote: 
  I think if the sort key doesn't contain symbol, unmatch case would be
  increased as more various callchains would go into a same entry.
 
 You mean -g fractal,0.5,callee,address ?
 
 Hmm, actually I haven't seen much difference there.

 I guess he will, but will wait for Namhyung's final ack here, ok?

I meant sorting of hist entry like -s cpu.   So as your number said
it's not an issue IMHO.  Arnaldo, I'm good with this change please merge
it. :)

Thanks,
Namhyung
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] perf tools: Spare double comparison of callchain first entry

2014-01-15 Thread Namhyung Kim
Hi Frederic,

On Wed, 15 Jan 2014 17:59:30 +0100, Frederic Weisbecker wrote:
> On Wed, Jan 15, 2014 at 03:23:46PM +0900, Namhyung Kim wrote:
>> On Tue, 14 Jan 2014 16:37:15 +0100, Frederic Weisbecker wrote:
>> > When a new callchain child branch matches an existing one in the rbtree,
>> > the comparison of its first entry is performed twice:
>> >
>> > 1) From append_chain_children() on branch lookup
>> >
>> > 2) If 1) reports a match, append_chain() then compares all entries of
>> > the new branch against the matching node in the rbtree, and this
>> > comparison includes the first entry of the new branch again.
>> 
>> Right.
>> 
>> >
>> > Lets shortcut this by performing the whole comparison only from
>> > append_chain() which then returns the result of the comparison between
>> > the first entry of the new branch and the iterating node in the rbtree.
>> > If the first entry matches, the lookup on the current level of siblings
>> > stops and propagates to the children of the matching nodes.
>> 
>> Hmm..  it looks like that I thought directly calling append_chain() has
>> some overhead - but it's not.
>
> No that's a right concern. I worried as well because I wasn't sure if there
> is more match than unmatch on the first entry. I'd tend to think that the 
> first
> entry endures unmatches most often, in which case calling match_chain() first
> may be more efficient as a fast path (ie: calling append_chain() involves
> one more function call and a few other details).
>
> But eventually measurement hasn't shown significant difference before and
> after the patch.

I think if the sort key doesn't contain "symbol", unmatch case would be
increased as more various callchains would go into a same entry.

>
>> 
>> >
>> > This results in less comparisons performed by the CPU.
>> 
>> Do you have any numbers?  I suspect it'd not be a big change, but just
>> curious.
>
> So I compared before/after the patchset (which include the cursor restore 
> removal)
> with:
>
>   1) Some big hackbench-like load that generates > 200 MB perf.data
>
>   perf record -g -- perf bench sched messaging -l $SOME_BIG_NUMBER
>
>   2) Compare before/after with the following reports:
>
>   perf stat perf report --stdio > /dev/null
>   perf stat perf report --stdio -s sym > /dev/null
>   perf stat perf report --stdio -G > /dev/null
>   perf stat perf report --stdio -g fractal,0.5,caller,address > /dev/null 
>
> And most of the time I had < 0.01% difference on time completion in favour of 
> the patchset
> (which may be due to the removed cursor restore patch eventually).
>
> So, all in one, there was no real interesting difference. If you want the 
> true results I can definetly relaunch the tests.

So as an extreme case, could you please also test "-s cpu" case and
share the numbers?

Thanks,
Namhyung
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] perf tools: Spare double comparison of callchain first entry

2014-01-15 Thread Frederic Weisbecker
On Wed, Jan 15, 2014 at 03:23:46PM +0900, Namhyung Kim wrote:
> On Tue, 14 Jan 2014 16:37:15 +0100, Frederic Weisbecker wrote:
> > When a new callchain child branch matches an existing one in the rbtree,
> > the comparison of its first entry is performed twice:
> >
> > 1) From append_chain_children() on branch lookup
> >
> > 2) If 1) reports a match, append_chain() then compares all entries of
> > the new branch against the matching node in the rbtree, and this
> > comparison includes the first entry of the new branch again.
> 
> Right.
> 
> >
> > Lets shortcut this by performing the whole comparison only from
> > append_chain() which then returns the result of the comparison between
> > the first entry of the new branch and the iterating node in the rbtree.
> > If the first entry matches, the lookup on the current level of siblings
> > stops and propagates to the children of the matching nodes.
> 
> Hmm..  it looks like that I thought directly calling append_chain() has
> some overhead - but it's not.

No that's a right concern. I worried as well because I wasn't sure if there
is more match than unmatch on the first entry. I'd tend to think that the first
entry endures unmatches most often, in which case calling match_chain() first
may be more efficient as a fast path (ie: calling append_chain() involves
one more function call and a few other details).

But eventually measurement hasn't shown significant difference before and
after the patch.

> 
> >
> > This results in less comparisons performed by the CPU.
> 
> Do you have any numbers?  I suspect it'd not be a big change, but just
> curious.

So I compared before/after the patchset (which include the cursor restore 
removal)
with:

1) Some big hackbench-like load that generates > 200 MB perf.data

perf record -g -- perf bench sched messaging -l $SOME_BIG_NUMBER

2) Compare before/after with the following reports:

perf stat perf report --stdio > /dev/null
perf stat perf report --stdio -s sym > /dev/null
perf stat perf report --stdio -G > /dev/null
perf stat perf report --stdio -g fractal,0.5,caller,address > /dev/null 

And most of the time I had < 0.01% difference on time completion in favour of 
the patchset
(which may be due to the removed cursor restore patch eventually).

So, all in one, there was no real interesting difference. If you want the true 
results I can definetly relaunch the tests.

> >
> > Signed-off-by: Frederic Weisbecker 
> 
> Reviewed-by: Namhyung Kim 

Thanks!

> 
> Thanks,
> Namhyung
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] perf tools: Spare double comparison of callchain first entry

2014-01-15 Thread Frederic Weisbecker
On Wed, Jan 15, 2014 at 03:23:46PM +0900, Namhyung Kim wrote:
 On Tue, 14 Jan 2014 16:37:15 +0100, Frederic Weisbecker wrote:
  When a new callchain child branch matches an existing one in the rbtree,
  the comparison of its first entry is performed twice:
 
  1) From append_chain_children() on branch lookup
 
  2) If 1) reports a match, append_chain() then compares all entries of
  the new branch against the matching node in the rbtree, and this
  comparison includes the first entry of the new branch again.
 
 Right.
 
 
  Lets shortcut this by performing the whole comparison only from
  append_chain() which then returns the result of the comparison between
  the first entry of the new branch and the iterating node in the rbtree.
  If the first entry matches, the lookup on the current level of siblings
  stops and propagates to the children of the matching nodes.
 
 Hmm..  it looks like that I thought directly calling append_chain() has
 some overhead - but it's not.

No that's a right concern. I worried as well because I wasn't sure if there
is more match than unmatch on the first entry. I'd tend to think that the first
entry endures unmatches most often, in which case calling match_chain() first
may be more efficient as a fast path (ie: calling append_chain() involves
one more function call and a few other details).

But eventually measurement hasn't shown significant difference before and
after the patch.

 
 
  This results in less comparisons performed by the CPU.
 
 Do you have any numbers?  I suspect it'd not be a big change, but just
 curious.

So I compared before/after the patchset (which include the cursor restore 
removal)
with:

1) Some big hackbench-like load that generates  200 MB perf.data

perf record -g -- perf bench sched messaging -l $SOME_BIG_NUMBER

2) Compare before/after with the following reports:

perf stat perf report --stdio  /dev/null
perf stat perf report --stdio -s sym  /dev/null
perf stat perf report --stdio -G  /dev/null
perf stat perf report --stdio -g fractal,0.5,caller,address  /dev/null 

And most of the time I had  0.01% difference on time completion in favour of 
the patchset
(which may be due to the removed cursor restore patch eventually).

So, all in one, there was no real interesting difference. If you want the true 
results I can definetly relaunch the tests.

 
  Signed-off-by: Frederic Weisbecker fweis...@gmail.com
 
 Reviewed-by: Namhyung Kim namhy...@kernel.org

Thanks!

 
 Thanks,
 Namhyung
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] perf tools: Spare double comparison of callchain first entry

2014-01-15 Thread Namhyung Kim
Hi Frederic,

On Wed, 15 Jan 2014 17:59:30 +0100, Frederic Weisbecker wrote:
 On Wed, Jan 15, 2014 at 03:23:46PM +0900, Namhyung Kim wrote:
 On Tue, 14 Jan 2014 16:37:15 +0100, Frederic Weisbecker wrote:
  When a new callchain child branch matches an existing one in the rbtree,
  the comparison of its first entry is performed twice:
 
  1) From append_chain_children() on branch lookup
 
  2) If 1) reports a match, append_chain() then compares all entries of
  the new branch against the matching node in the rbtree, and this
  comparison includes the first entry of the new branch again.
 
 Right.
 
 
  Lets shortcut this by performing the whole comparison only from
  append_chain() which then returns the result of the comparison between
  the first entry of the new branch and the iterating node in the rbtree.
  If the first entry matches, the lookup on the current level of siblings
  stops and propagates to the children of the matching nodes.
 
 Hmm..  it looks like that I thought directly calling append_chain() has
 some overhead - but it's not.

 No that's a right concern. I worried as well because I wasn't sure if there
 is more match than unmatch on the first entry. I'd tend to think that the 
 first
 entry endures unmatches most often, in which case calling match_chain() first
 may be more efficient as a fast path (ie: calling append_chain() involves
 one more function call and a few other details).

 But eventually measurement hasn't shown significant difference before and
 after the patch.

I think if the sort key doesn't contain symbol, unmatch case would be
increased as more various callchains would go into a same entry.


 
 
  This results in less comparisons performed by the CPU.
 
 Do you have any numbers?  I suspect it'd not be a big change, but just
 curious.

 So I compared before/after the patchset (which include the cursor restore 
 removal)
 with:

   1) Some big hackbench-like load that generates  200 MB perf.data

   perf record -g -- perf bench sched messaging -l $SOME_BIG_NUMBER

   2) Compare before/after with the following reports:

   perf stat perf report --stdio  /dev/null
   perf stat perf report --stdio -s sym  /dev/null
   perf stat perf report --stdio -G  /dev/null
   perf stat perf report --stdio -g fractal,0.5,caller,address  /dev/null 

 And most of the time I had  0.01% difference on time completion in favour of 
 the patchset
 (which may be due to the removed cursor restore patch eventually).

 So, all in one, there was no real interesting difference. If you want the 
 true results I can definetly relaunch the tests.

So as an extreme case, could you please also test -s cpu case and
share the numbers?

Thanks,
Namhyung
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] perf tools: Spare double comparison of callchain first entry

2014-01-14 Thread Namhyung Kim
On Tue, 14 Jan 2014 16:37:15 +0100, Frederic Weisbecker wrote:
> When a new callchain child branch matches an existing one in the rbtree,
> the comparison of its first entry is performed twice:
>
> 1) From append_chain_children() on branch lookup
>
> 2) If 1) reports a match, append_chain() then compares all entries of
> the new branch against the matching node in the rbtree, and this
> comparison includes the first entry of the new branch again.

Right.

>
> Lets shortcut this by performing the whole comparison only from
> append_chain() which then returns the result of the comparison between
> the first entry of the new branch and the iterating node in the rbtree.
> If the first entry matches, the lookup on the current level of siblings
> stops and propagates to the children of the matching nodes.

Hmm..  it looks like that I thought directly calling append_chain() has
some overhead - but it's not.

>
> This results in less comparisons performed by the CPU.

Do you have any numbers?  I suspect it'd not be a big change, but just
curious.

>
> Signed-off-by: Frederic Weisbecker 

Reviewed-by: Namhyung Kim 

Thanks,
Namhyung
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] perf tools: Spare double comparison of callchain first entry

2014-01-14 Thread Namhyung Kim
On Tue, 14 Jan 2014 16:37:15 +0100, Frederic Weisbecker wrote:
 When a new callchain child branch matches an existing one in the rbtree,
 the comparison of its first entry is performed twice:

 1) From append_chain_children() on branch lookup

 2) If 1) reports a match, append_chain() then compares all entries of
 the new branch against the matching node in the rbtree, and this
 comparison includes the first entry of the new branch again.

Right.


 Lets shortcut this by performing the whole comparison only from
 append_chain() which then returns the result of the comparison between
 the first entry of the new branch and the iterating node in the rbtree.
 If the first entry matches, the lookup on the current level of siblings
 stops and propagates to the children of the matching nodes.

Hmm..  it looks like that I thought directly calling append_chain() has
some overhead - but it's not.


 This results in less comparisons performed by the CPU.

Do you have any numbers?  I suspect it'd not be a big change, but just
curious.


 Signed-off-by: Frederic Weisbecker fweis...@gmail.com

Reviewed-by: Namhyung Kim namhy...@kernel.org

Thanks,
Namhyung
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/