Re: [HACKERS] Timing overhead and Linux clock sources
On Tue, Aug 28, 2012 at 6:13 AM, Greg Smith g...@2ndquadrant.com wrote: After staring at all the examples I generated again, I think Bruce is right that the newer format he's suggesting is better. I know I never thought about whether reordering for easier interpretation made sense before, and I'd also guess it was less coding for the existing order was the only reason Ants did it that way. That is correct, I didn't really have a preference for the order so I went for simpler is better. I agree that the order suggested by Bruce looks better. Having seen more measurements I'd have to say that the histogram is mostly useless. It seems to me that it mostly shows OS scheduling noise. I would even say that the histogram output should be hidden behind an command line option to avoid unnecessary confusion. Ants Aasma -- Cybertec Schönig Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Timing overhead and Linux clock sources
On Mon, Aug 27, 2012 at 11:13 PM, Greg Smith g...@2ndquadrant.com wrote: After staring at all the examples I generated again, I think Bruce is right that the newer format he's suggesting is better. OK by me, then. If you're happy, I'm happy. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Timing overhead and Linux clock sources
On Tue, Aug 28, 2012 at 10:43:07AM -0400, Robert Haas wrote: On Mon, Aug 27, 2012 at 11:13 PM, Greg Smith g...@2ndquadrant.com wrote: After staring at all the examples I generated again, I think Bruce is right that the newer format he's suggesting is better. OK by me, then. If you're happy, I'm happy. OK, applied. The general rule I followed here is that if a report has a fixed start value and a variable stop value, the start value should be reported first. When comparing runs, any missing (ending) rows are seen visually as empty (at the end of the output but blank). This isn't a common case, so this issue doesn't come up often. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Timing overhead and Linux clock sources
On Sat, Aug 25, 2012 at 10:48 PM, Bruce Momjian br...@momjian.us wrote: On Mon, Aug 20, 2012 at 03:11:51PM -0400, Robert Haas wrote: On Thu, Aug 16, 2012 at 10:28 PM, Bruce Momjian br...@momjian.us wrote: FYI, I am planning to go ahead and package this tool in /contrib for PG 9.3. Isn't this exactly what we already did, in 9.2, in the form of contrib/pg_test_timing? Sorry, not sure how I missed that commit. Anyway, I am attaching a patch for 9.3 that I think improves the output of the tool, plus adds some C comments. The new output has the lowest duration times first: Testing timing overhead for 3 seconds. Per loop time including overhead: 41.31 nsec Histogram of timing durations: usec % of total count 1 95.87135 69627856 2 4.127592997719 4 0.00086628 8 0.00018133 16 0.1 5 32 0.0 1 This should make the output clearer to eyeball for problems --- a good timing has a high percentage on the first line, rather than on the last line. I guess I'm not sure the output format is an improvement. I wouldn't care much one way or the other if we had made this change at the time in AS92, but I'm not sure it's really worth breaking compatibility for a format that may or may not be any better. The person who wrote the original code presumably preferred it way it already is. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Timing overhead and Linux clock sources
On Mon, Aug 27, 2012 at 12:39:02PM -0400, Robert Haas wrote: On Sat, Aug 25, 2012 at 10:48 PM, Bruce Momjian br...@momjian.us wrote: On Mon, Aug 20, 2012 at 03:11:51PM -0400, Robert Haas wrote: On Thu, Aug 16, 2012 at 10:28 PM, Bruce Momjian br...@momjian.us wrote: FYI, I am planning to go ahead and package this tool in /contrib for PG 9.3. Isn't this exactly what we already did, in 9.2, in the form of contrib/pg_test_timing? Sorry, not sure how I missed that commit. Anyway, I am attaching a patch for 9.3 that I think improves the output of the tool, plus adds some C comments. The new output has the lowest duration times first: Testing timing overhead for 3 seconds. Per loop time including overhead: 41.31 nsec Histogram of timing durations: usec % of total count 1 95.87135 69627856 2 4.127592997719 4 0.00086628 8 0.00018133 16 0.1 5 32 0.0 1 This should make the output clearer to eyeball for problems --- a good timing has a high percentage on the first line, rather than on the last line. I guess I'm not sure the output format is an improvement. I wouldn't care much one way or the other if we had made this change at the time in AS92, but I'm not sure it's really worth breaking compatibility for a format that may or may not be any better. The person who wrote the original code presumably preferred it way it already is. He wrote it that way to allow for simpler C code --- he could just start from 31 and keeping skipping entries until he hit a non-zero. My format makes it easy to see which line should have the majority of the entries, e.g. first line should be 90%. I doubt there are enough people running this cross-version that consistency in output makes any difference between major PG versions. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Timing overhead and Linux clock sources
On Mon, Aug 27, 2012 at 01:18:51PM -0400, Bruce Momjian wrote: This should make the output clearer to eyeball for problems --- a good timing has a high percentage on the first line, rather than on the last line. I guess I'm not sure the output format is an improvement. I wouldn't care much one way or the other if we had made this change at the time in AS92, but I'm not sure it's really worth breaking compatibility for a format that may or may not be any better. The person who wrote the original code presumably preferred it way it already is. He wrote it that way to allow for simpler C code --- he could just start from 31 and keeping skipping entries until he hit a non-zero. My format makes it easy to see which line should have the majority of the entries, e.g. first line should be 90%. I doubt there are enough people running this cross-version that consistency in output makes any difference between major PG versions. The real weird part is that this tool outputs a variable number of rows/buckets, depending on the hightest non-zero bucket, so I had trouble understanding it when the last line was the one to look at, especially for multiple runs. Also, we heavily adjusted the output of pg_test_fsync for several major releases and that wasn't a problem for anyone. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Timing overhead and Linux clock sources
On Mon, Aug 27, 2012 at 1:18 PM, Bruce Momjian br...@momjian.us wrote: He wrote it that way to allow for simpler C code --- he could just start from 31 and keeping skipping entries until he hit a non-zero. My format makes it easy to see which line should have the majority of the entries, e.g. first line should be 90%. I doubt there are enough people running this cross-version that consistency in output makes any difference between major PG versions. I don't see why it's better for the first line to have a big number than the last line. What difference does it make? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Timing overhead and Linux clock sources
On Mon, Aug 27, 2012 at 04:42:34PM -0400, Robert Haas wrote: On Mon, Aug 27, 2012 at 1:18 PM, Bruce Momjian br...@momjian.us wrote: He wrote it that way to allow for simpler C code --- he could just start from 31 and keeping skipping entries until he hit a non-zero. My format makes it easy to see which line should have the majority of the entries, e.g. first line should be 90%. I doubt there are enough people running this cross-version that consistency in output makes any difference between major PG versions. I don't see why it's better for the first line to have a big number than the last line. What difference does it make? When you are looking at that output, the 1 usec is where you want to see most of the percentage, and it trails off after that. Here is an example from the current output format: Histogram of timing durations: usec: count percent 16: 3 0.7% 8:563 0.01357% 4: 3241 0.07810% 2:2990371 72.05956% 1:1155682 27.84870% That first line is pretty meaningless. You have to look at the last line, see that only 27% of 1, then look up to see that 72% is 12, which isn't good. My format shows: usec % of total count 1 27.848701155682 2 72.059562990371 4 0.07810 3241 8 0.01357563 16 0.7 3 First line, 27%, that's a problem, look down for more details. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Timing overhead and Linux clock sources
On 08/27/2012 06:20 PM, Bruce Momjian wrote: On Mon, Aug 27, 2012 at 04:42:34PM -0400, Robert Haas wrote: I don't see why it's better for the first line to have a big number than the last line. What difference does it make? When you are looking at that output, the1 usec is where you want to see most of the percentage, and it trails off after that. After staring at all the examples I generated again, I think Bruce is right that the newer format he's suggesting is better. I know I never thought about whether reordering for easier interpretation made sense before, and I'd also guess it was less coding for the existing order was the only reason Ants did it that way. Where I think this is a most useful improvement is when comparing two systems with different results that don't end at the same boundary. The proposed formatting would show the good vs. bad examples I put in the docs like this: usec: count percent 1: 27694571 90.23734% 2:2993204 9.75277% 4: 3010 0.00981% 8: 22 0.7% 16: 1 0.0% 32: 1 0.0% usec: count percent 1:1155682 27.84870% 2:2990371 72.05956% 4: 3241 0.07810% 8:563 0.01357% 16: 3 0.7% And I think it's easier to compare those two in that order. The docs do specifically suggest staring at the 1 usec numbers first, and having just mocked up both orders I do prefer this one for that job. The way this was originally written, it's easier to come to an initially misleading conclusion. The fact that the first system sometimes spikes to the 32 usec range is the first thing that jumps out at you in the originally committed ordering, and that's not where people should focus first. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Timing overhead and Linux clock sources
On Mon, Aug 27, 2012 at 11:13:00PM -0400, Greg Smith wrote: On 08/27/2012 06:20 PM, Bruce Momjian wrote: On Mon, Aug 27, 2012 at 04:42:34PM -0400, Robert Haas wrote: I don't see why it's better for the first line to have a big number than the last line. What difference does it make? When you are looking at that output, the1 usec is where you want to see most of the percentage, and it trails off after that. After staring at all the examples I generated again, I think Bruce is right that the newer format he's suggesting is better. I know I never thought about whether reordering for easier interpretation made sense before, and I'd also guess it was less coding for the existing order was the only reason Ants did it that way. Where I think this is a most useful improvement is when comparing two systems with different results that don't end at the same boundary. The proposed formatting would show the good vs. bad examples I put in the docs like this: usec: count percent 1: 27694571 90.23734% 2:2993204 9.75277% 4: 3010 0.00981% 8: 22 0.7% 16: 1 0.0% 32: 1 0.0% usec: count percent 1:1155682 27.84870% 2:2990371 72.05956% 4: 3241 0.07810% 8:563 0.01357% 16: 3 0.7% And I think it's easier to compare those two in that order. The docs do specifically suggest staring at the 1 usec numbers first, and having just mocked up both orders I do prefer this one for that job. The way this was originally written, it's easier to come to an initially misleading conclusion. The fact that the first system sometimes spikes to the 32 usec range is the first thing that jumps out at you in the originally committed ordering, and that's not where people should focus first. Yes, I was totally confused how you would explain how to analyze this. Once the patch is applied you might find it easier to clearly state that in the docs. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Timing overhead and Linux clock sources
On Mon, Aug 20, 2012 at 03:11:51PM -0400, Robert Haas wrote: On Thu, Aug 16, 2012 at 10:28 PM, Bruce Momjian br...@momjian.us wrote: FYI, I am planning to go ahead and package this tool in /contrib for PG 9.3. Isn't this exactly what we already did, in 9.2, in the form of contrib/pg_test_timing? Sorry, not sure how I missed that commit. Anyway, I am attaching a patch for 9.3 that I think improves the output of the tool, plus adds some C comments. The new output has the lowest duration times first: Testing timing overhead for 3 seconds. Per loop time including overhead: 41.31 nsec Histogram of timing durations: usec % of total count 1 95.87135 69627856 2 4.127592997719 4 0.00086628 8 0.00018133 16 0.1 5 32 0.0 1 This should make the output clearer to eyeball for problems --- a good timing has a high percentage on the first line, rather than on the last line. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + diff --git a/contrib/pg_test_timing/pg_test_timing.c b/contrib/pg_test_timing/pg_test_timing.c new file mode 100644 index b3f98ab..8d79c7b *** a/contrib/pg_test_timing/pg_test_timing.c --- b/contrib/pg_test_timing/pg_test_timing.c *** static const char *progname; *** 14,29 static int32 test_duration = 3; static void handle_args(int argc, char *argv[]); ! static void test_timing(int32); int main(int argc, char *argv[]) { progname = get_progname(argv[0]); handle_args(argc, argv); ! test_timing(test_duration); return 0; } --- 14,37 static int32 test_duration = 3; static void handle_args(int argc, char *argv[]); ! static uint64 test_timing(int32); ! static void output(uint64 loop_count); ! ! /* record duration in powers of 2 microseconds */ ! int64 histogram[32]; int main(int argc, char *argv[]) { + uint64 loop_count; + progname = get_progname(argv[0]); handle_args(argc, argv); ! loop_count = test_timing(test_duration); ! ! output(loop_count); return 0; } *** handle_args(int argc, char *argv[]) *** 95,119 } } ! static void test_timing(int32 duration) { uint64 total_time; int64 time_elapsed = 0; uint64 loop_count = 0; ! uint64 prev, ! cur; ! int32 diff, ! i, ! bits, ! found; ! ! instr_time start_time, ! end_time, ! temp; ! ! static int64 histogram[32]; ! char buf[100]; total_time = duration 0 ? duration * 100 : 0; --- 103,116 } } ! static uint64 test_timing(int32 duration) { uint64 total_time; int64 time_elapsed = 0; uint64 loop_count = 0; ! uint64 prev, cur; ! instr_time start_time, end_time, temp; total_time = duration 0 ? duration * 100 : 0; *** test_timing(int32 duration) *** 122,132 --- 119,132 while (time_elapsed total_time) { + int32 diff, bits = 0; + prev = cur; INSTR_TIME_SET_CURRENT(temp); cur = INSTR_TIME_GET_MICROSEC(temp); diff = cur - prev; + /* Did time go backwards? */ if (diff 0) { printf(Detected clock going backwards in time.\n); *** test_timing(int32 duration) *** 134,145 exit(1); } ! bits = 0; while (diff) { diff = 1; bits++; } histogram[bits]++; loop_count++; --- 134,147 exit(1); } ! /* What is the highest bit in the time diff? */ while (diff) { diff = 1; bits++; } + + /* Update appropriate duration bucket */ histogram[bits]++; loop_count++; *** test_timing(int32 duration) *** 153,171 printf(Per loop time including overhead: %0.2f nsec\n, INSTR_TIME_GET_DOUBLE(end_time) * 1e9 / loop_count); printf(Histogram of timing durations:\n); ! printf(%9s: %10s %9s\n, usec, count, percent); ! found = 0; ! for (i = 31; i = 0; i--) { ! if (found || histogram[i]) ! { ! found = 1; ! /* lame hack to work around INT64_FORMAT deficiencies */ ! snprintf(buf, sizeof(buf), INT64_FORMAT, histogram[i]); ! printf(%9ld: %10s %8.5f%%\n, 1l i, buf, ! (double) histogram[i] * 100 / loop_count); ! } } } --- 155,183 printf(Per loop time including overhead: %0.2f nsec\n, INSTR_TIME_GET_DOUBLE(end_time) * 1e9 / loop_count); + + return loop_count; + } + + static void + output(uint64 loop_count) + { + int64 max_bit = 31, i; + + /* find highest bit value */ + while (max_bit 0 histogram[max_bit] == 0) + max_bit--; + printf(Histogram of timing durations:\n); ! printf(%6s %10s %10s\n, usec, % of total, count); ! for (i = 0; i = max_bit; i++) { ! char
Re: [HACKERS] Timing overhead and Linux clock sources
On Thu, Aug 16, 2012 at 10:28 PM, Bruce Momjian br...@momjian.us wrote: FYI, I am planning to go ahead and package this tool in /contrib for PG 9.3. Isn't this exactly what we already did, in 9.2, in the form of contrib/pg_test_timing? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Timing overhead and Linux clock sources
FYI, I am planning to go ahead and package this tool in /contrib for PG 9.3. --- On Fri, Dec 9, 2011 at 08:26:12PM -0500, Greg Smith wrote: On 12/09/2011 06:48 PM, Ants Aasma wrote: The attached test program (test_gettimeofday_monotonic) shows that one test loop iteration takes 34ns with tsc and 1270ns with hpet. This test program is great, I've wanted this exact sort of visibility into this problem for years. I've toyed with writing something like this and then seeing what results it returns on all of the build farm animals. For now I can just run it on all the hardware I have access to, which is not a small list. I think we should bundle this up similarly to test_fsync, document some best practices on time sources, and then point the vague warning about EXPLAIN overhead toward that. Then new sources of timing overhead can point there too. Much like low-level fsync timing, there's nothing PostgreSQL can do about it, the best we can do is provide a program to help users classify a system as likely or unlikely to run to have high-overhead timing. I can make the needed docs changes, this is resolving a long standing issue impacting code I wanted to add. Rough guideline I'm seeing shape up is that 50ns is unlikely to cause clock timing to be a significant problem, 500ns certainly is, and values in the middle should concern people but not necessarily invalidate the timing data collected. I just confirmed that switching the clock source by echoing a new value into /sys/devices/system/clocksource/clocksource0/current_clocksource works (on the 2.6.32 kernel at least). What we want to see on a good server is a result that looks like this, from my Intel i7-860 system: $ cat /sys/devices/system/clocksource/clocksource0/current_clocksource tsc $ ./test_gettimeofday_monotonic 5 Per loop: 39.30 ns usec: count percent 4: 6 0.0% 2:104 0.8% 1:4999760 3.92983% 0: 16109 96.07009% Here's how badly that degrades if I use one of the alternate time sources: # echo acpi_pm /sys/devices/system/clocksource/clocksource0/current_clocksource $ ./test_gettimeofday_monotonic 5 Per loop: 727.65 ns usec: count percent 16: 1 0.1% 8: 0 0.0% 4: 1233 0.01794% 2:699 0.01017% 1:4992622 72.65764% 0:1876879 27.31423% echo hpet /sys/devices/system/clocksource/clocksource0/current_clocksource $ ./test_gettimeofday_monotonic 5 Per loop: 576.96 ns usec: count percent 8: 2 0.2% 4: 1273 0.01469% 2:767 0.00885% 1:4993028 57.61598% 0:3670977 42.36046% Switching to the Intel T7200 CPU in my T60 laptop only provides the poor quality time sources, not TSC, and results show timing is really slow there: $ cat /sys/devices/system/clocksource/clocksource0/current_clocksource hpet $ ./test_gettimeofday_monotonic 5 Per loop: 1019.60 ns usec: count percent 256: 2 0.4% 128: 3 0.6% 64: 90 0.00184% 32: 23 0.00047% 16: 92 0.00188% 8: 1246 0.02541% 4: 34 0.00069% 2: 136154 2.77645% 1:4700772 95.85818% 0: 65466 1.33498% # echo acpi_pm /sys/devices/system/clocksource/clocksource0/current_clocksource $ ./test_gettimeofday_monotonic 5 Per loop: 1864.66 ns usec: count percent 256: 2 0.7% 128: 0 0.0% 64: 3 0.00011% 32: 6 0.00022% 16: 90 0.00336% 8: 1741 0.06493% 4: 2062 0.07690% 2:2260601 84.30489% 1: 416954 15.54952% 0: 0 0.0% This seems to be measuring exactly the problem I only had a hand-wave it's bad on this hardware explanation of before. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Timing overhead and Linux clock sources
On Wed, Dec 7, 2011 at 9:40 AM, Greg Smith g...@2ndquadrant.com wrote: He estimated 22ns per gettimeofday on the system with fast timing calls--presumably using TSC, and possibly faster than I saw because his system had less cores than mine to worry about. He got 990 ns on his slower system, and a worst case there of 3% overhead. Roberts comment about sequential scan performing lots of reads in a tight loop made me think of worse worst case. A count(*) with wide rows and/or lots of bloat. I created a test table with one tuple per page like this: CREATE TABLE io_test WITH (fillfactor=10) AS SELECT repeat('x', 1000) FROM generate_series(1,3); I then timed SELECT COUNT(*) FROM io_test; with track_iotiming on and off. Averages of 1000 executions, differences significant according to t-test: timer | iotiming=off | iotiming=on | diff hpet | 86.10 ms |147.80 ms | 71.67% tsc | 85.86 ms | 87.66 ms | 2.10% The attached test program (test_gettimeofday_monotonic) shows that one test loop iteration takes 34ns with tsc and 1270ns with hpet. I also managed to run the test program a couple of two socket Solaris 10 machines. The one with Xeon X5570 had iteration time of 220ns and Xeon E5620 had 270ns iterations. I'm not sure yet whether this is due to Solaris gettimeofday just being slower or some hardware issue. I also tested a more reasonable case of count(*) on pgbench_accounts with scale factor 50 (61 tuples per page). With tsc timing was actually 1% faster, but not statistically significant, with hpet the overhead was 5.6%. Scaling the overhead for the Solaris machines, it seems that the worst case for timing all buffer reads would be about 20% slower. Count(*) on medium length tuples for an out of shared buffers, in OS cache would have overhead between 1 and 2%. One random thought: I wonder if there's a way for us to just time every N'th event or something like that, to keep the overhead low. This would work only for cases where there's a reasonably uniform distribution of times or really long sampling periods, otherwise the variability will make the result pretty much useless. For example in the I/O case, a pretty typical load can have 1% of timings be 3 orders of magnitude longer than median. -- Ants Aasma #include stdio.h #include sys/time.h typedef long long int64; typedef unsigned long long uint64; typedef long int32; typedef unsigned long uint32; #define TIMEDIFF(a, b) (((a).tv_sec - (b).tv_sec)*100 + ((a).tv_usec - (b).tv_usec)) int main(int argc, char **argv, char **arge) { struct timeval start, end, prev, cur; int64 i; int64 time_elapsed = 0; int32 duration; uint64 loop_count = 0; static int64 timings[32]; if (argc != 2 || !sscanf(argv[1], %ld, duration) || duration 0) { printf(Usage: test_gettimeofday_monotonic seconds\n); return 1; } uint64 end_time = 100ll*duration; gettimeofday(start, 0); cur = start; while (time_elapsed end_time) { prev = cur; gettimeofday(cur, 0); long diff = TIMEDIFF(cur, prev); if (diff 0) { printf(Time warp: %ld 0 = %ld.%ld %ld.%ld\n, diff, cur.tv_sec, cur.tv_usec, prev.tv_sec, prev.tv_usec); return -1; } int bits = 0; while (diff) { diff = 1; bits++; } timings[bits]++; loop_count++; time_elapsed = TIMEDIFF(cur, start); } gettimeofday(end, 0); printf(Per loop: %0.2f ns\n, ((double) TIMEDIFF(end, start))*1000/loop_count); printf(%9s: %10s %9s\n, usec, count, percent); int found = 0; for (i = 31; i = 0; i--) { if (found || timings[i]) { found = 1; printf(%9ld: %10lld %8.5f%%\n, 1li1, timings[i], ((double)timings[i])*100/loop_count); } } return 0; } -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Timing overhead and Linux clock sources
On 12/09/2011 06:48 PM, Ants Aasma wrote: The attached test program (test_gettimeofday_monotonic) shows that one test loop iteration takes 34ns with tsc and 1270ns with hpet. This test program is great, I've wanted this exact sort of visibility into this problem for years. I've toyed with writing something like this and then seeing what results it returns on all of the build farm animals. For now I can just run it on all the hardware I have access to, which is not a small list. I think we should bundle this up similarly to test_fsync, document some best practices on time sources, and then point the vague warning about EXPLAIN overhead toward that. Then new sources of timing overhead can point there too. Much like low-level fsync timing, there's nothing PostgreSQL can do about it, the best we can do is provide a program to help users classify a system as likely or unlikely to run to have high-overhead timing. I can make the needed docs changes, this is resolving a long standing issue impacting code I wanted to add. Rough guideline I'm seeing shape up is that 50ns is unlikely to cause clock timing to be a significant problem, 500ns certainly is, and values in the middle should concern people but not necessarily invalidate the timing data collected. I just confirmed that switching the clock source by echoing a new value into /sys/devices/system/clocksource/clocksource0/current_clocksource works (on the 2.6.32 kernel at least). What we want to see on a good server is a result that looks like this, from my Intel i7-860 system: $ cat /sys/devices/system/clocksource/clocksource0/current_clocksource tsc $ ./test_gettimeofday_monotonic 5 Per loop: 39.30 ns usec: count percent 4: 6 0.0% 2:104 0.8% 1:4999760 3.92983% 0: 16109 96.07009% Here's how badly that degrades if I use one of the alternate time sources: # echo acpi_pm /sys/devices/system/clocksource/clocksource0/current_clocksource $ ./test_gettimeofday_monotonic 5 Per loop: 727.65 ns usec: count percent 16: 1 0.1% 8: 0 0.0% 4: 1233 0.01794% 2:699 0.01017% 1:4992622 72.65764% 0:1876879 27.31423% echo hpet /sys/devices/system/clocksource/clocksource0/current_clocksource $ ./test_gettimeofday_monotonic 5 Per loop: 576.96 ns usec: count percent 8: 2 0.2% 4: 1273 0.01469% 2:767 0.00885% 1:4993028 57.61598% 0:3670977 42.36046% Switching to the Intel T7200 CPU in my T60 laptop only provides the poor quality time sources, not TSC, and results show timing is really slow there: $ cat /sys/devices/system/clocksource/clocksource0/current_clocksource hpet $ ./test_gettimeofday_monotonic 5 Per loop: 1019.60 ns usec: count percent 256: 2 0.4% 128: 3 0.6% 64: 90 0.00184% 32: 23 0.00047% 16: 92 0.00188% 8: 1246 0.02541% 4: 34 0.00069% 2: 136154 2.77645% 1:4700772 95.85818% 0: 65466 1.33498% # echo acpi_pm /sys/devices/system/clocksource/clocksource0/current_clocksource $ ./test_gettimeofday_monotonic 5 Per loop: 1864.66 ns usec: count percent 256: 2 0.7% 128: 0 0.0% 64: 3 0.00011% 32: 6 0.00022% 16: 90 0.00336% 8: 1741 0.06493% 4: 2062 0.07690% 2:2260601 84.30489% 1: 416954 15.54952% 0: 0 0.0% This seems to be measuring exactly the problem I only had a hand-wave it's bad on this hardware explanation of before. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Timing overhead and Linux clock sources
On Tue, Dec 6, 2011 at 9:58 PM, Greg Smith g...@2ndquadrant.com wrote: -If you have a system with a working TSC clock source (timing data is pulled right from the CPU), timing overhead is reasonable enough that you might turn it on even for things that happen frequently, such as the buffer I/O timing patch enables. Even the TSC stuff looks expensive enough that you wouldn't to pay the full overhead all the time on a busy system, but of course we probably wouldn't want to do that anyway. EXPLAIN ANALYZE is extremely expensive mostly because it's timing entry and exit into every plan node, and the way our executor works, those are very frequent operations. But you could probably gather more coarse-grained statistics, like separating parse, plan, and execute time for each query, without breaking a sweat. I'm not sure about buffer I/Os - on a big sequential scan, you might do quite a lot of those in a pretty tight loop. That's not an argument against adding the option, though, assuming that the default setting is off. And, certainly, I agree with you that it's worth trying to document some of this stuff so that people don't have to try to figure it out themselves (uggh!). One random thought: I wonder if there's a way for us to just time every N'th event or something like that, to keep the overhead low. The problem is that you might not get accurate results if, say, every 2N'th event takes much longer than normal - you'll either hit all the long ones, or miss them all. You could fix that by using a pseudorandom number generator to decide whether to time each event, but that's got it's own overhead... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Timing overhead and Linux clock sources
On 12/06/2011 10:20 PM, Robert Haas wrote: EXPLAIN ANALYZE is extremely expensive mostly because it's timing entry and exit into every plan node, and the way our executor works, those are very frequent operations. The plan for the query I was timing looks like this: Aggregate (cost=738.00..738.01 rows=1 width=0) (actual time=3.045..3.045 rows=1 loops=1) - Seq Scan on customers (cost=0.00..688.00 rows=2 width=0) (actual time=0.002..1.700 rows=2 loops=1) That's then 2 * 2 timing calls for the Seq Scan dominating the runtime. On the system with fast TSC, the fastest execution was 1.478ms, the slowest with timing 2.945ms. That's 1.467ms of total timing overhead, worst-case, so approximately 37ns per timing call. If you're executing something that is only ever hitting data in shared_buffers, you can measure that; any other case, probably not. Picking apart the one with slow timing class on my laptop, fastest is 5.52ms, and the fastest with timing is 57.959ms. That makes for a minimum of 1311ns per timing call, best-case. I'm not sure about buffer I/Os - on a big sequential scan, you might do quite a lot of those in a pretty tight loop. To put this into perspective relative to the number of explain time calls, there are 488 pages in the relation my test is executing again. I think we need to be careful to keep timing calls from happening at every buffer allocation. I wouldn't expect sprinkling one around every buffer miss would be a problem on a system with a fast clocksource. And that is what was shown by the testing Ants Aasma did before submitting the add timing of buffer I/O requests patch; his results make more sense to me now. He estimated 22ns per gettimeofday on the system with fast timing calls--presumably using TSC, and possibly faster than I saw because his system had less cores than mine to worry about. He got 990 ns on his slower system, and a worst case there of 3% overhead. Whether people who are on one of these slower timing call systems would be willing to pay 3% overhead is questionable. But I now believe Ants's claim that it's below the noise level on ones with a good TSC driven timer. I got a 35:1 ratio between fast and slow clock sources, he got 45:1. If we try to estimate the timing overhead that is too small to measure, I'd guess it's ~3% / 30 = 0.1%. I'd just leave that on all the time on a good TSC-driven system. Buffer hits and tuple-level operations you couldn't afford to time, just about anything else would be fine. One random thought: I wonder if there's a way for us to just time every N'th event or something like that, to keep the overhead low. I'm predicting we got a lot of future demand for instrumentation features like this, where we want to make them available but would like to keep them from happening too often when the system is busy. Tossing a percentage of them might work. Caching them in queue somewhere for processing by a background process, and not collecting the data, if that queue fills is another idea I've been thinking about recently. I'm working on some ideas for making is the server busy? something you can ask the background writer usefully too. There's a number of things that become practical for that process to do, when it's decoupled from doing the checkpoint sync job so its worst-case response time is expected to tighten up. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers