Re: [PATCH 2/3] perf tools: Spare double comparison of callchain first entry
On Fri, Jan 17, 2014 at 04:56:18PM +0900, Namhyung Kim wrote: > Hi Arnaldo and Frederic, > > On Thu, 16 Jan 2014 17:47:34 -0200, Arnaldo Carvalho de Melo wrote: > > Em Thu, Jan 16, 2014 at 06:34:58PM +0100, Frederic Weisbecker escreveu: > >> On Thu, Jan 16, 2014 at 10:17:53AM +0900, Namhyung Kim wrote: > >> > I think if the sort key doesn't contain "symbol", unmatch case would be > >> > increased as more various callchains would go into a same entry. > >> > >> You mean -g fractal,0.5,callee,address ? > >> > >> Hmm, actually I haven't seen much difference there. > > > > I guess he will, but will wait for Namhyung's final ack here, ok? > > I meant sorting of hist entry like "-s cpu". So as your number said > it's not an issue IMHO. Arnaldo, I'm good with this change please merge > it. :) Thanks a lot guys! :) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3] perf tools: Spare double comparison of callchain first entry
On Fri, Jan 17, 2014 at 04:56:18PM +0900, Namhyung Kim wrote: Hi Arnaldo and Frederic, On Thu, 16 Jan 2014 17:47:34 -0200, Arnaldo Carvalho de Melo wrote: Em Thu, Jan 16, 2014 at 06:34:58PM +0100, Frederic Weisbecker escreveu: On Thu, Jan 16, 2014 at 10:17:53AM +0900, Namhyung Kim wrote: I think if the sort key doesn't contain symbol, unmatch case would be increased as more various callchains would go into a same entry. You mean -g fractal,0.5,callee,address ? Hmm, actually I haven't seen much difference there. I guess he will, but will wait for Namhyung's final ack here, ok? I meant sorting of hist entry like -s cpu. So as your number said it's not an issue IMHO. Arnaldo, I'm good with this change please merge it. :) Thanks a lot guys! :) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3] perf tools: Spare double comparison of callchain first entry
Hi Arnaldo and Frederic, On Thu, 16 Jan 2014 17:47:34 -0200, Arnaldo Carvalho de Melo wrote: > Em Thu, Jan 16, 2014 at 06:34:58PM +0100, Frederic Weisbecker escreveu: >> On Thu, Jan 16, 2014 at 10:17:53AM +0900, Namhyung Kim wrote: >> > I think if the sort key doesn't contain "symbol", unmatch case would be >> > increased as more various callchains would go into a same entry. >> >> You mean -g fractal,0.5,callee,address ? >> >> Hmm, actually I haven't seen much difference there. > > I guess he will, but will wait for Namhyung's final ack here, ok? I meant sorting of hist entry like "-s cpu". So as your number said it's not an issue IMHO. Arnaldo, I'm good with this change please merge it. :) Thanks, Namhyung -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3] perf tools: Spare double comparison of callchain first entry
Em Thu, Jan 16, 2014 at 06:34:58PM +0100, Frederic Weisbecker escreveu: > On Thu, Jan 16, 2014 at 10:17:53AM +0900, Namhyung Kim wrote: > > I think if the sort key doesn't contain "symbol", unmatch case would be > > increased as more various callchains would go into a same entry. > > You mean -g fractal,0.5,callee,address ? > > Hmm, actually I haven't seen much difference there. I guess he will, but will wait for Namhyung's final ack here, ok? - Arnaldo > > > > > >> > > >> > > > >> > This results in less comparisons performed by the CPU. > > >> > > >> Do you have any numbers? I suspect it'd not be a big change, but just > > >> curious. > > > > > > So I compared before/after the patchset (which include the cursor restore > > > removal) > > > with: > > > > > > 1) Some big hackbench-like load that generates > 200 MB perf.data > > > > > > perf record -g -- perf bench sched messaging -l $SOME_BIG_NUMBER > > > > > > 2) Compare before/after with the following reports: > > > > > > perf stat perf report --stdio > /dev/null > > > perf stat perf report --stdio -s sym > /dev/null > > > perf stat perf report --stdio -G > /dev/null > > > perf stat perf report --stdio -g fractal,0.5,caller,address > /dev/null > > > > > > And most of the time I had < 0.01% difference on time completion in > > > favour of the patchset > > > (which may be due to the removed cursor restore patch eventually). > > > > > > So, all in one, there was no real interesting difference. If you want the > > > true results I can definetly relaunch the tests. > > > > So as an extreme case, could you please also test "-s cpu" case and > > share the numbers? > > There is indeed a tiny difference here. > > Before the patchset: > > fweisbec@Aivars:~/linux-2.6-tip/tools/perf$ sudo ./perf stat -r 20 ./perf > report --stdio -s cpu > /dev/null > > Performance counter stats for './perf report --stdio -s cpu' (20 runs): > >3343,047232 task-clock (msec) #0,999 CPUs utilized > ( +- 0,12% ) > 6 context-switches #0,002 K/sec > ( +- 3,82% ) > 0 cpu-migrations#0,000 K/sec > >128 076 page-faults #0,038 M/sec > ( +- 0,00% ) > 13 044 840 323 cycles#3,902 GHz > ( +- 0,12% ) > stalled-cycles-frontend > stalled-cycles-backend > 16 341 506 514 instructions #1,25 insns per cycle > ( +- 0,00% ) > 4 042 448 707 branches # 1209,211 M/sec > ( +- 0,00% ) > 26 819 441 branch-misses #0,66% of all branches > ( +- 0,09% ) > >3,345286450 seconds time elapsed >( +- 0,12% ) > > After the patchset: > > fweisbec@Aivars:~/linux-2.6-tip/tools/perf$ sudo ./perf stat -r 20 ./perf > report --stdio -s cpu > /dev/null > > Performance counter stats for './perf report --stdio -s cpu' (20 runs): > >3365,739972 task-clock (msec) #0,999 CPUs utilized > ( +- 0,12% ) > 6 context-switches #0,002 K/sec > ( +- 2,99% ) > 0 cpu-migrations#0,000 K/sec > >128 076 page-faults #0,038 M/sec > ( +- 0,00% ) > 13 133 593 870 cycles#3,902 GHz > ( +- 0,12% ) > stalled-cycles-frontend > stalled-cycles-backend > 16 626 286 378 instructions #1,27 insns per cycle > ( +- 0,00% ) > 4 119 555 502 branches # 1223,967 M/sec > ( +- 0,00% ) > 28 687 283 branch-misses #0,70% of all branches > ( +- 0,09% ) > >3,367984867 seconds time elapsed >( +- 0,12% ) > > > Which makes about 0.6% difference on the overhead. > Now it had less overhead in common cases (default sorting, -s sym, -G, > etc...). > I guess it's not really worrysome, it's mostly unvisible at this scale. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3] perf tools: Spare double comparison of callchain first entry
On Thu, Jan 16, 2014 at 10:17:53AM +0900, Namhyung Kim wrote: > I think if the sort key doesn't contain "symbol", unmatch case would be > increased as more various callchains would go into a same entry. You mean -g fractal,0.5,callee,address ? Hmm, actually I haven't seen much difference there. > > > >> > >> > > >> > This results in less comparisons performed by the CPU. > >> > >> Do you have any numbers? I suspect it'd not be a big change, but just > >> curious. > > > > So I compared before/after the patchset (which include the cursor restore > > removal) > > with: > > > > 1) Some big hackbench-like load that generates > 200 MB perf.data > > > > perf record -g -- perf bench sched messaging -l $SOME_BIG_NUMBER > > > > 2) Compare before/after with the following reports: > > > > perf stat perf report --stdio > /dev/null > > perf stat perf report --stdio -s sym > /dev/null > > perf stat perf report --stdio -G > /dev/null > > perf stat perf report --stdio -g fractal,0.5,caller,address > /dev/null > > > > And most of the time I had < 0.01% difference on time completion in favour > > of the patchset > > (which may be due to the removed cursor restore patch eventually). > > > > So, all in one, there was no real interesting difference. If you want the > > true results I can definetly relaunch the tests. > > So as an extreme case, could you please also test "-s cpu" case and > share the numbers? There is indeed a tiny difference here. Before the patchset: fweisbec@Aivars:~/linux-2.6-tip/tools/perf$ sudo ./perf stat -r 20 ./perf report --stdio -s cpu > /dev/null Performance counter stats for './perf report --stdio -s cpu' (20 runs): 3343,047232 task-clock (msec) #0,999 CPUs utilized ( +- 0,12% ) 6 context-switches #0,002 K/sec ( +- 3,82% ) 0 cpu-migrations#0,000 K/sec 128 076 page-faults #0,038 M/sec ( +- 0,00% ) 13 044 840 323 cycles#3,902 GHz ( +- 0,12% ) stalled-cycles-frontend stalled-cycles-backend 16 341 506 514 instructions #1,25 insns per cycle ( +- 0,00% ) 4 042 448 707 branches # 1209,211 M/sec ( +- 0,00% ) 26 819 441 branch-misses #0,66% of all branches ( +- 0,09% ) 3,345286450 seconds time elapsed ( +- 0,12% ) After the patchset: fweisbec@Aivars:~/linux-2.6-tip/tools/perf$ sudo ./perf stat -r 20 ./perf report --stdio -s cpu > /dev/null Performance counter stats for './perf report --stdio -s cpu' (20 runs): 3365,739972 task-clock (msec) #0,999 CPUs utilized ( +- 0,12% ) 6 context-switches #0,002 K/sec ( +- 2,99% ) 0 cpu-migrations#0,000 K/sec 128 076 page-faults #0,038 M/sec ( +- 0,00% ) 13 133 593 870 cycles#3,902 GHz ( +- 0,12% ) stalled-cycles-frontend stalled-cycles-backend 16 626 286 378 instructions #1,27 insns per cycle ( +- 0,00% ) 4 119 555 502 branches # 1223,967 M/sec ( +- 0,00% ) 28 687 283 branch-misses #0,70% of all branches ( +- 0,09% ) 3,367984867 seconds time elapsed ( +- 0,12% ) Which makes about 0.6% difference on the overhead. Now it had less overhead in common cases (default sorting, -s sym, -G, etc...). I guess it's not really worrysome, it's mostly unvisible at this scale. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3] perf tools: Spare double comparison of callchain first entry
On Thu, Jan 16, 2014 at 10:17:53AM +0900, Namhyung Kim wrote: I think if the sort key doesn't contain symbol, unmatch case would be increased as more various callchains would go into a same entry. You mean -g fractal,0.5,callee,address ? Hmm, actually I haven't seen much difference there. This results in less comparisons performed by the CPU. Do you have any numbers? I suspect it'd not be a big change, but just curious. So I compared before/after the patchset (which include the cursor restore removal) with: 1) Some big hackbench-like load that generates 200 MB perf.data perf record -g -- perf bench sched messaging -l $SOME_BIG_NUMBER 2) Compare before/after with the following reports: perf stat perf report --stdio /dev/null perf stat perf report --stdio -s sym /dev/null perf stat perf report --stdio -G /dev/null perf stat perf report --stdio -g fractal,0.5,caller,address /dev/null And most of the time I had 0.01% difference on time completion in favour of the patchset (which may be due to the removed cursor restore patch eventually). So, all in one, there was no real interesting difference. If you want the true results I can definetly relaunch the tests. So as an extreme case, could you please also test -s cpu case and share the numbers? There is indeed a tiny difference here. Before the patchset: fweisbec@Aivars:~/linux-2.6-tip/tools/perf$ sudo ./perf stat -r 20 ./perf report --stdio -s cpu /dev/null Performance counter stats for './perf report --stdio -s cpu' (20 runs): 3343,047232 task-clock (msec) #0,999 CPUs utilized ( +- 0,12% ) 6 context-switches #0,002 K/sec ( +- 3,82% ) 0 cpu-migrations#0,000 K/sec 128 076 page-faults #0,038 M/sec ( +- 0,00% ) 13 044 840 323 cycles#3,902 GHz ( +- 0,12% ) not supported stalled-cycles-frontend not supported stalled-cycles-backend 16 341 506 514 instructions #1,25 insns per cycle ( +- 0,00% ) 4 042 448 707 branches # 1209,211 M/sec ( +- 0,00% ) 26 819 441 branch-misses #0,66% of all branches ( +- 0,09% ) 3,345286450 seconds time elapsed ( +- 0,12% ) After the patchset: fweisbec@Aivars:~/linux-2.6-tip/tools/perf$ sudo ./perf stat -r 20 ./perf report --stdio -s cpu /dev/null Performance counter stats for './perf report --stdio -s cpu' (20 runs): 3365,739972 task-clock (msec) #0,999 CPUs utilized ( +- 0,12% ) 6 context-switches #0,002 K/sec ( +- 2,99% ) 0 cpu-migrations#0,000 K/sec 128 076 page-faults #0,038 M/sec ( +- 0,00% ) 13 133 593 870 cycles#3,902 GHz ( +- 0,12% ) not supported stalled-cycles-frontend not supported stalled-cycles-backend 16 626 286 378 instructions #1,27 insns per cycle ( +- 0,00% ) 4 119 555 502 branches # 1223,967 M/sec ( +- 0,00% ) 28 687 283 branch-misses #0,70% of all branches ( +- 0,09% ) 3,367984867 seconds time elapsed ( +- 0,12% ) Which makes about 0.6% difference on the overhead. Now it had less overhead in common cases (default sorting, -s sym, -G, etc...). I guess it's not really worrysome, it's mostly unvisible at this scale. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3] perf tools: Spare double comparison of callchain first entry
Em Thu, Jan 16, 2014 at 06:34:58PM +0100, Frederic Weisbecker escreveu: On Thu, Jan 16, 2014 at 10:17:53AM +0900, Namhyung Kim wrote: I think if the sort key doesn't contain symbol, unmatch case would be increased as more various callchains would go into a same entry. You mean -g fractal,0.5,callee,address ? Hmm, actually I haven't seen much difference there. I guess he will, but will wait for Namhyung's final ack here, ok? - Arnaldo This results in less comparisons performed by the CPU. Do you have any numbers? I suspect it'd not be a big change, but just curious. So I compared before/after the patchset (which include the cursor restore removal) with: 1) Some big hackbench-like load that generates 200 MB perf.data perf record -g -- perf bench sched messaging -l $SOME_BIG_NUMBER 2) Compare before/after with the following reports: perf stat perf report --stdio /dev/null perf stat perf report --stdio -s sym /dev/null perf stat perf report --stdio -G /dev/null perf stat perf report --stdio -g fractal,0.5,caller,address /dev/null And most of the time I had 0.01% difference on time completion in favour of the patchset (which may be due to the removed cursor restore patch eventually). So, all in one, there was no real interesting difference. If you want the true results I can definetly relaunch the tests. So as an extreme case, could you please also test -s cpu case and share the numbers? There is indeed a tiny difference here. Before the patchset: fweisbec@Aivars:~/linux-2.6-tip/tools/perf$ sudo ./perf stat -r 20 ./perf report --stdio -s cpu /dev/null Performance counter stats for './perf report --stdio -s cpu' (20 runs): 3343,047232 task-clock (msec) #0,999 CPUs utilized ( +- 0,12% ) 6 context-switches #0,002 K/sec ( +- 3,82% ) 0 cpu-migrations#0,000 K/sec 128 076 page-faults #0,038 M/sec ( +- 0,00% ) 13 044 840 323 cycles#3,902 GHz ( +- 0,12% ) not supported stalled-cycles-frontend not supported stalled-cycles-backend 16 341 506 514 instructions #1,25 insns per cycle ( +- 0,00% ) 4 042 448 707 branches # 1209,211 M/sec ( +- 0,00% ) 26 819 441 branch-misses #0,66% of all branches ( +- 0,09% ) 3,345286450 seconds time elapsed ( +- 0,12% ) After the patchset: fweisbec@Aivars:~/linux-2.6-tip/tools/perf$ sudo ./perf stat -r 20 ./perf report --stdio -s cpu /dev/null Performance counter stats for './perf report --stdio -s cpu' (20 runs): 3365,739972 task-clock (msec) #0,999 CPUs utilized ( +- 0,12% ) 6 context-switches #0,002 K/sec ( +- 2,99% ) 0 cpu-migrations#0,000 K/sec 128 076 page-faults #0,038 M/sec ( +- 0,00% ) 13 133 593 870 cycles#3,902 GHz ( +- 0,12% ) not supported stalled-cycles-frontend not supported stalled-cycles-backend 16 626 286 378 instructions #1,27 insns per cycle ( +- 0,00% ) 4 119 555 502 branches # 1223,967 M/sec ( +- 0,00% ) 28 687 283 branch-misses #0,70% of all branches ( +- 0,09% ) 3,367984867 seconds time elapsed ( +- 0,12% ) Which makes about 0.6% difference on the overhead. Now it had less overhead in common cases (default sorting, -s sym, -G, etc...). I guess it's not really worrysome, it's mostly unvisible at this scale. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3] perf tools: Spare double comparison of callchain first entry
Hi Arnaldo and Frederic, On Thu, 16 Jan 2014 17:47:34 -0200, Arnaldo Carvalho de Melo wrote: Em Thu, Jan 16, 2014 at 06:34:58PM +0100, Frederic Weisbecker escreveu: On Thu, Jan 16, 2014 at 10:17:53AM +0900, Namhyung Kim wrote: I think if the sort key doesn't contain symbol, unmatch case would be increased as more various callchains would go into a same entry. You mean -g fractal,0.5,callee,address ? Hmm, actually I haven't seen much difference there. I guess he will, but will wait for Namhyung's final ack here, ok? I meant sorting of hist entry like -s cpu. So as your number said it's not an issue IMHO. Arnaldo, I'm good with this change please merge it. :) Thanks, Namhyung -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3] perf tools: Spare double comparison of callchain first entry
Hi Frederic, On Wed, 15 Jan 2014 17:59:30 +0100, Frederic Weisbecker wrote: > On Wed, Jan 15, 2014 at 03:23:46PM +0900, Namhyung Kim wrote: >> On Tue, 14 Jan 2014 16:37:15 +0100, Frederic Weisbecker wrote: >> > When a new callchain child branch matches an existing one in the rbtree, >> > the comparison of its first entry is performed twice: >> > >> > 1) From append_chain_children() on branch lookup >> > >> > 2) If 1) reports a match, append_chain() then compares all entries of >> > the new branch against the matching node in the rbtree, and this >> > comparison includes the first entry of the new branch again. >> >> Right. >> >> > >> > Lets shortcut this by performing the whole comparison only from >> > append_chain() which then returns the result of the comparison between >> > the first entry of the new branch and the iterating node in the rbtree. >> > If the first entry matches, the lookup on the current level of siblings >> > stops and propagates to the children of the matching nodes. >> >> Hmm.. it looks like that I thought directly calling append_chain() has >> some overhead - but it's not. > > No that's a right concern. I worried as well because I wasn't sure if there > is more match than unmatch on the first entry. I'd tend to think that the > first > entry endures unmatches most often, in which case calling match_chain() first > may be more efficient as a fast path (ie: calling append_chain() involves > one more function call and a few other details). > > But eventually measurement hasn't shown significant difference before and > after the patch. I think if the sort key doesn't contain "symbol", unmatch case would be increased as more various callchains would go into a same entry. > >> >> > >> > This results in less comparisons performed by the CPU. >> >> Do you have any numbers? I suspect it'd not be a big change, but just >> curious. > > So I compared before/after the patchset (which include the cursor restore > removal) > with: > > 1) Some big hackbench-like load that generates > 200 MB perf.data > > perf record -g -- perf bench sched messaging -l $SOME_BIG_NUMBER > > 2) Compare before/after with the following reports: > > perf stat perf report --stdio > /dev/null > perf stat perf report --stdio -s sym > /dev/null > perf stat perf report --stdio -G > /dev/null > perf stat perf report --stdio -g fractal,0.5,caller,address > /dev/null > > And most of the time I had < 0.01% difference on time completion in favour of > the patchset > (which may be due to the removed cursor restore patch eventually). > > So, all in one, there was no real interesting difference. If you want the > true results I can definetly relaunch the tests. So as an extreme case, could you please also test "-s cpu" case and share the numbers? Thanks, Namhyung -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3] perf tools: Spare double comparison of callchain first entry
On Wed, Jan 15, 2014 at 03:23:46PM +0900, Namhyung Kim wrote: > On Tue, 14 Jan 2014 16:37:15 +0100, Frederic Weisbecker wrote: > > When a new callchain child branch matches an existing one in the rbtree, > > the comparison of its first entry is performed twice: > > > > 1) From append_chain_children() on branch lookup > > > > 2) If 1) reports a match, append_chain() then compares all entries of > > the new branch against the matching node in the rbtree, and this > > comparison includes the first entry of the new branch again. > > Right. > > > > > Lets shortcut this by performing the whole comparison only from > > append_chain() which then returns the result of the comparison between > > the first entry of the new branch and the iterating node in the rbtree. > > If the first entry matches, the lookup on the current level of siblings > > stops and propagates to the children of the matching nodes. > > Hmm.. it looks like that I thought directly calling append_chain() has > some overhead - but it's not. No that's a right concern. I worried as well because I wasn't sure if there is more match than unmatch on the first entry. I'd tend to think that the first entry endures unmatches most often, in which case calling match_chain() first may be more efficient as a fast path (ie: calling append_chain() involves one more function call and a few other details). But eventually measurement hasn't shown significant difference before and after the patch. > > > > > This results in less comparisons performed by the CPU. > > Do you have any numbers? I suspect it'd not be a big change, but just > curious. So I compared before/after the patchset (which include the cursor restore removal) with: 1) Some big hackbench-like load that generates > 200 MB perf.data perf record -g -- perf bench sched messaging -l $SOME_BIG_NUMBER 2) Compare before/after with the following reports: perf stat perf report --stdio > /dev/null perf stat perf report --stdio -s sym > /dev/null perf stat perf report --stdio -G > /dev/null perf stat perf report --stdio -g fractal,0.5,caller,address > /dev/null And most of the time I had < 0.01% difference on time completion in favour of the patchset (which may be due to the removed cursor restore patch eventually). So, all in one, there was no real interesting difference. If you want the true results I can definetly relaunch the tests. > > > > Signed-off-by: Frederic Weisbecker > > Reviewed-by: Namhyung Kim Thanks! > > Thanks, > Namhyung -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3] perf tools: Spare double comparison of callchain first entry
On Wed, Jan 15, 2014 at 03:23:46PM +0900, Namhyung Kim wrote: On Tue, 14 Jan 2014 16:37:15 +0100, Frederic Weisbecker wrote: When a new callchain child branch matches an existing one in the rbtree, the comparison of its first entry is performed twice: 1) From append_chain_children() on branch lookup 2) If 1) reports a match, append_chain() then compares all entries of the new branch against the matching node in the rbtree, and this comparison includes the first entry of the new branch again. Right. Lets shortcut this by performing the whole comparison only from append_chain() which then returns the result of the comparison between the first entry of the new branch and the iterating node in the rbtree. If the first entry matches, the lookup on the current level of siblings stops and propagates to the children of the matching nodes. Hmm.. it looks like that I thought directly calling append_chain() has some overhead - but it's not. No that's a right concern. I worried as well because I wasn't sure if there is more match than unmatch on the first entry. I'd tend to think that the first entry endures unmatches most often, in which case calling match_chain() first may be more efficient as a fast path (ie: calling append_chain() involves one more function call and a few other details). But eventually measurement hasn't shown significant difference before and after the patch. This results in less comparisons performed by the CPU. Do you have any numbers? I suspect it'd not be a big change, but just curious. So I compared before/after the patchset (which include the cursor restore removal) with: 1) Some big hackbench-like load that generates 200 MB perf.data perf record -g -- perf bench sched messaging -l $SOME_BIG_NUMBER 2) Compare before/after with the following reports: perf stat perf report --stdio /dev/null perf stat perf report --stdio -s sym /dev/null perf stat perf report --stdio -G /dev/null perf stat perf report --stdio -g fractal,0.5,caller,address /dev/null And most of the time I had 0.01% difference on time completion in favour of the patchset (which may be due to the removed cursor restore patch eventually). So, all in one, there was no real interesting difference. If you want the true results I can definetly relaunch the tests. Signed-off-by: Frederic Weisbecker fweis...@gmail.com Reviewed-by: Namhyung Kim namhy...@kernel.org Thanks! Thanks, Namhyung -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3] perf tools: Spare double comparison of callchain first entry
Hi Frederic, On Wed, 15 Jan 2014 17:59:30 +0100, Frederic Weisbecker wrote: On Wed, Jan 15, 2014 at 03:23:46PM +0900, Namhyung Kim wrote: On Tue, 14 Jan 2014 16:37:15 +0100, Frederic Weisbecker wrote: When a new callchain child branch matches an existing one in the rbtree, the comparison of its first entry is performed twice: 1) From append_chain_children() on branch lookup 2) If 1) reports a match, append_chain() then compares all entries of the new branch against the matching node in the rbtree, and this comparison includes the first entry of the new branch again. Right. Lets shortcut this by performing the whole comparison only from append_chain() which then returns the result of the comparison between the first entry of the new branch and the iterating node in the rbtree. If the first entry matches, the lookup on the current level of siblings stops and propagates to the children of the matching nodes. Hmm.. it looks like that I thought directly calling append_chain() has some overhead - but it's not. No that's a right concern. I worried as well because I wasn't sure if there is more match than unmatch on the first entry. I'd tend to think that the first entry endures unmatches most often, in which case calling match_chain() first may be more efficient as a fast path (ie: calling append_chain() involves one more function call and a few other details). But eventually measurement hasn't shown significant difference before and after the patch. I think if the sort key doesn't contain symbol, unmatch case would be increased as more various callchains would go into a same entry. This results in less comparisons performed by the CPU. Do you have any numbers? I suspect it'd not be a big change, but just curious. So I compared before/after the patchset (which include the cursor restore removal) with: 1) Some big hackbench-like load that generates 200 MB perf.data perf record -g -- perf bench sched messaging -l $SOME_BIG_NUMBER 2) Compare before/after with the following reports: perf stat perf report --stdio /dev/null perf stat perf report --stdio -s sym /dev/null perf stat perf report --stdio -G /dev/null perf stat perf report --stdio -g fractal,0.5,caller,address /dev/null And most of the time I had 0.01% difference on time completion in favour of the patchset (which may be due to the removed cursor restore patch eventually). So, all in one, there was no real interesting difference. If you want the true results I can definetly relaunch the tests. So as an extreme case, could you please also test -s cpu case and share the numbers? Thanks, Namhyung -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3] perf tools: Spare double comparison of callchain first entry
On Tue, 14 Jan 2014 16:37:15 +0100, Frederic Weisbecker wrote: > When a new callchain child branch matches an existing one in the rbtree, > the comparison of its first entry is performed twice: > > 1) From append_chain_children() on branch lookup > > 2) If 1) reports a match, append_chain() then compares all entries of > the new branch against the matching node in the rbtree, and this > comparison includes the first entry of the new branch again. Right. > > Lets shortcut this by performing the whole comparison only from > append_chain() which then returns the result of the comparison between > the first entry of the new branch and the iterating node in the rbtree. > If the first entry matches, the lookup on the current level of siblings > stops and propagates to the children of the matching nodes. Hmm.. it looks like that I thought directly calling append_chain() has some overhead - but it's not. > > This results in less comparisons performed by the CPU. Do you have any numbers? I suspect it'd not be a big change, but just curious. > > Signed-off-by: Frederic Weisbecker Reviewed-by: Namhyung Kim Thanks, Namhyung -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3] perf tools: Spare double comparison of callchain first entry
On Tue, 14 Jan 2014 16:37:15 +0100, Frederic Weisbecker wrote: When a new callchain child branch matches an existing one in the rbtree, the comparison of its first entry is performed twice: 1) From append_chain_children() on branch lookup 2) If 1) reports a match, append_chain() then compares all entries of the new branch against the matching node in the rbtree, and this comparison includes the first entry of the new branch again. Right. Lets shortcut this by performing the whole comparison only from append_chain() which then returns the result of the comparison between the first entry of the new branch and the iterating node in the rbtree. If the first entry matches, the lookup on the current level of siblings stops and propagates to the children of the matching nodes. Hmm.. it looks like that I thought directly calling append_chain() has some overhead - but it's not. This results in less comparisons performed by the CPU. Do you have any numbers? I suspect it'd not be a big change, but just curious. Signed-off-by: Frederic Weisbecker fweis...@gmail.com Reviewed-by: Namhyung Kim namhy...@kernel.org Thanks, Namhyung -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/