Re: [HACKERS] Instrument checkpoint sync calls

2010-12-16 Thread Greg Smith

Robert Haas wrote:

On Wed, Dec 15, 2010 at 9:22 AM, Greg Smith g...@2ndquadrant.com wrote:
  

patch I submit.  Doesn't seem worth going through the trouble of committing
that minor rework on its own, I'll slip it into the next useful thing that
touches this area I do.  Thanks for the hint, this would work better than
what I did.



Well, if I'm the one committing it, I'll pull that part out again and
commit it separately.  Not sure if that affects your calculus, but I
much prefer patches that don't try to do ancillary things along the
way.
  


I meant that I'd bundle it into the block of time I spend on that, and 
likely submit with something else that touches the same area.  Obviously 
the correction patch would be better on its own when being handed over 
to a committer.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books



Re: [HACKERS] Instrument checkpoint sync calls

2010-12-16 Thread Robert Haas
On Thu, Dec 16, 2010 at 6:48 AM, Greg Smith g...@2ndquadrant.com wrote:
 I meant that I'd bundle it into the block of time I spend on that, and
 likely submit with something else that touches the same area.  Obviously the
 correction patch would be better on its own when being handed over to a
 committer.

Oh, that's very different.  Never mind!

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Instrument checkpoint sync calls

2010-12-15 Thread Greg Smith

Alvaro Herrera wrote:

I gave this patch a look and it seems pretty good to me, except that I'm
uncomfortable with the idea of mdsync filling in the details for
CheckpointStats fields directly.  Would it work to pass a struct (say
SmgrSyncStats) from CheckPointBuffers to smgrsync and from there to
mdsync, have this function fill it, and return it back so that
CheckPointBuffers copies the data from this struct into CheckpointStats?
  


That was originally how I planned to write this bit of code.  When I 
realized that the CheckpointStats structure was already visible there 
and stuffed with details that ultimately go into the same output line at 
the end, it just didn't seem worth the extra code complexity.  The 
abstraction layer around md.c was not exactly airtight before I poked 
that extra little hole in there, and I was aiming via the principal of a 
smaller diff usually being the better patch . 

If you feel strongly that the result led to a bad abstraction violation, 
I'll submit a patch to refactor it to pass a structure instead before 
the next CF.  I appreciate your concern, I'm just not sure it's worth 
spending time on.  What I'd really like to do is refactor out major 
parts of the leaky md/smgr layers altogether instead, but that's 
obviously a bigger project.



Another minor nitpick: inside the block when you call FileSync, why
check for log_checkpoints at all?  Seems to me that just checking for
zero of sync_start should be enough.  Alternatively, seems simpler to
just have a local var with the value of log_checkpoints at the start of
mdsync and use that throughout the function.  (Surely if someone turns
off log_checkpoints in the middle of a checkpoint, it's not really a
problem that we collect and report stats during that checkpoint.)
  


And now you're just getting picky!  This is a useful observation though, 
and I'll try to include that fix along with the next general checkpoint 
overhaul patch I submit.  Doesn't seem worth going through the trouble 
of committing that minor rework on its own, I'll slip it into the next 
useful thing that touches this area I do.  Thanks for the hint, this 
would work better than what I did.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Instrument checkpoint sync calls

2010-12-15 Thread Robert Haas
On Wed, Dec 15, 2010 at 9:22 AM, Greg Smith g...@2ndquadrant.com wrote:
 patch I submit.  Doesn't seem worth going through the trouble of committing
 that minor rework on its own, I'll slip it into the next useful thing that
 touches this area I do.  Thanks for the hint, this would work better than
 what I did.

Well, if I'm the one committing it, I'll pull that part out again and
commit it separately.  Not sure if that affects your calculus, but I
much prefer patches that don't try to do ancillary things along the
way.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Instrument checkpoint sync calls

2010-12-14 Thread Greg Smith

Robert Haas wrote:

I took a look at this and it looks generally good, but I'm wondering
why md.c is converting the results from an exact value to a floating
point, only to have xlog.c turn around and convert back to an integer.
 I think it could just return milliseconds directly, or if you're
worried about a checkpoint that takes more than 24 days to complete,
seconds and microseconds.  


Attached patch now does something like this, except without the 
roll-over concern.  INSTR_TIME_GET_MICROSEC returns a uint64 value.  I 
just made that the storage format for all these values until they're 
converted for display.  Test output:


LOG:  checkpoint starting: xlog
DEBUG:  checkpoint sync: number=1 file=base/16385/16480 time=10422.859 msec
DEBUG:  checkpoint sync: number=2 file=base/16385/16475_vm time=2896.614 
msec

DEBUG:  checkpoint sync: number=3 file=base/16385/16475.1 time=57.836 msec
DEBUG:  checkpoint sync: number=4 file=base/16385/16466 time=20.080 msec
DEBUG:  checkpoint sync: number=5 file=base/16385/16463 time=74.926 msec
DEBUG:  checkpoint sync: number=6 file=base/16385/16482 time=74.263 msec
DEBUG:  checkpoint sync: number=7 file=base/16385/16475_fsm time=7.062 msec
DEBUG:  checkpoint sync: number=8 file=base/16385/16475 time=35.164 msec
LOG:  checkpoint complete: wrote 2143 buffers (52.3%); 0 transaction log 
file(s) added, 0 removed, 3 recycled; write=1.213 s, sync=13.589 s, 
total=24.744 s; sync files=8, longest=10.422 s, average=1.698 s


This shows the hard truncation used, so 10422.859 msec becomes 10.422 
s.  I don't think allowing up to 0.999ms of error there is a problem 
given the expected scale.  But since none of the precision is lost until 
the end, that could be changed with only touching the final display 
formatting conversion of the value.  Following your general idea 
further, why throw away any resolution inside of md.c; let xlog.c decide 
how to show it.


Note that I also fixed the DEBUG level lines to only show their actual 
precision.  Before that was printing 6 digits to the right of the 
decimal point each time, the last three of which were always 0.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 1ed9687..c9778df 100644
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
*** LogCheckpointEnd(bool restartpoint)
*** 6955,6964 
  {
  	long		write_secs,
  sync_secs,
! total_secs;
  	int			write_usecs,
  sync_usecs,
! total_usecs;
  
  	CheckpointStats.ckpt_end_t = GetCurrentTimestamp();
  
--- 6955,6969 
  {
  	long		write_secs,
  sync_secs,
! total_secs,
! longest_secs,
! average_secs;
  	int			write_usecs,
  sync_usecs,
! total_usecs,
! longest_usecs,
! average_usecs;
! 	uint64		average_sync_time;
  
  	CheckpointStats.ckpt_end_t = GetCurrentTimestamp();
  
*** LogCheckpointEnd(bool restartpoint)
*** 6974,6991 
  		CheckpointStats.ckpt_sync_end_t,
  		sync_secs, sync_usecs);
  
  	if (restartpoint)
  		elog(LOG, restartpoint complete: wrote %d buffers (%.1f%%); 
! 			 write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s,
  			 CheckpointStats.ckpt_bufs_written,
  			 (double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers,
  			 write_secs, write_usecs / 1000,
  			 sync_secs, sync_usecs / 1000,
! 			 total_secs, total_usecs / 1000);
  	else
  		elog(LOG, checkpoint complete: wrote %d buffers (%.1f%%); 
  			 %d transaction log file(s) added, %d removed, %d recycled; 
! 			 write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s,
  			 CheckpointStats.ckpt_bufs_written,
  			 (double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers,
  			 CheckpointStats.ckpt_segs_added,
--- 6979,7017 
  		CheckpointStats.ckpt_sync_end_t,
  		sync_secs, sync_usecs);
  
+ 	/*
+ 	 * Timing values returned from CheckpointStats are in microseconds.
+ 	 * Convert to the second plus microsecond form that TimestampDifference
+ 	 * returns for homogeneous printing.
+ 	 */
+ 	longest_secs = (long) (CheckpointStats.ckpt_longest_sync / 100);
+ 	longest_usecs = CheckpointStats.ckpt_longest_sync -
+ 		(uint64) longest_secs * 100;
+ 
+ 	average_sync_time = 0;
+ 	if (CheckpointStats.ckpt_sync_rels  0) 
+ 		average_sync_time = CheckpointStats.ckpt_agg_sync_time /
+ 			CheckpointStats.ckpt_sync_rels;
+ 	average_secs = (long) (average_sync_time / 100);
+ 	average_usecs = average_sync_time - (uint64) average_secs * 100;
+ 
  	if (restartpoint)
  		elog(LOG, restartpoint complete: wrote %d buffers (%.1f%%); 
! 			 write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s; 
! 			 sync files=%d, longest=%ld.%03d s, average=%ld.%03d s,
  			 CheckpointStats.ckpt_bufs_written,
  			 (double) 

Re: [HACKERS] Instrument checkpoint sync calls

2010-12-14 Thread Alvaro Herrera

I gave this patch a look and it seems pretty good to me, except that I'm
uncomfortable with the idea of mdsync filling in the details for
CheckpointStats fields directly.  Would it work to pass a struct (say
SmgrSyncStats) from CheckPointBuffers to smgrsync and from there to
mdsync, have this function fill it, and return it back so that
CheckPointBuffers copies the data from this struct into CheckpointStats?

Another minor nitpick: inside the block when you call FileSync, why
check for log_checkpoints at all?  Seems to me that just checking for
zero of sync_start should be enough.  Alternatively, seems simpler to
just have a local var with the value of log_checkpoints at the start of
mdsync and use that throughout the function.  (Surely if someone turns
off log_checkpoints in the middle of a checkpoint, it's not really a
problem that we collect and report stats during that checkpoint.)

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Instrument checkpoint sync calls

2010-12-14 Thread Robert Haas
On Tue, Dec 14, 2010 at 9:29 AM, Alvaro Herrera
alvhe...@commandprompt.com wrote:
 I gave this patch a look and it seems pretty good to me, except

Err, woops.  I just committed this as-is.  Sorry.

 that I'm
 uncomfortable with the idea of mdsync filling in the details for
 CheckpointStats fields directly.  Would it work to pass a struct (say
 SmgrSyncStats) from CheckPointBuffers to smgrsync and from there to
 mdsync, have this function fill it, and return it back so that
 CheckPointBuffers copies the data from this struct into CheckpointStats?

 Another minor nitpick: inside the block when you call FileSync, why
 check for log_checkpoints at all?  Seems to me that just checking for
 zero of sync_start should be enough.  Alternatively, seems simpler to
 just have a local var with the value of log_checkpoints at the start of
 mdsync and use that throughout the function.  (Surely if someone turns
 off log_checkpoints in the middle of a checkpoint, it's not really a
 problem that we collect and report stats during that checkpoint.)

Neither of these things bothers me, but we can certainly discuss...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Instrument checkpoint sync calls

2010-12-14 Thread Alvaro Herrera
Excerpts from Robert Haas's message of mar dic 14 11:34:55 -0300 2010:
 On Tue, Dec 14, 2010 at 9:29 AM, Alvaro Herrera
 alvhe...@commandprompt.com wrote:
  I gave this patch a look and it seems pretty good to me, except
 
 Err, woops.  I just committed this as-is.  Sorry.

I noticed :-)

  that I'm
  uncomfortable with the idea of mdsync filling in the details for
  CheckpointStats fields directly.  Would it work to pass a struct (say
  SmgrSyncStats) from CheckPointBuffers to smgrsync and from there to
  mdsync, have this function fill it, and return it back so that
  CheckPointBuffers copies the data from this struct into CheckpointStats?
 
  Another minor nitpick: inside the block when you call FileSync, why
  check for log_checkpoints at all?  Seems to me that just checking for
  zero of sync_start should be enough.  Alternatively, seems simpler to
  just have a local var with the value of log_checkpoints at the start of
  mdsync and use that throughout the function.  (Surely if someone turns
  off log_checkpoints in the middle of a checkpoint, it's not really a
  problem that we collect and report stats during that checkpoint.)
 
 Neither of these things bothers me, but we can certainly discuss...

Well, the second one was just about simplifying it, so never mind that.
But referring to CheckpointStats in md.c seems to me to be a violation
of modularity that ought to be fixed.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Instrument checkpoint sync calls

2010-12-14 Thread Robert Haas
On Tue, Dec 14, 2010 at 11:47 AM, Alvaro Herrera
alvhe...@commandprompt.com wrote:
  that I'm
  uncomfortable with the idea of mdsync filling in the details for
  CheckpointStats fields directly.  Would it work to pass a struct (say
  SmgrSyncStats) from CheckPointBuffers to smgrsync and from there to
  mdsync, have this function fill it, and return it back so that
  CheckPointBuffers copies the data from this struct into CheckpointStats?

 But referring to CheckpointStats in md.c seems to me to be a violation
 of modularity that ought to be fixed.

Hmm.  I guess I can't get worked up about it.  We could do what you
propose, but I'm not sure what purpose it would serve.  It's not as if
mdsync() can possibly serve any other purpose other than to be the
guts of a checkpoint.  It seems to me that if we really wanted to get
this disentangled from the checkpoint stats stuff we'd also need to
think about moving that elog(DEBUG1) I added out of the function
somehow, but I think that would just be a lot of notional complexity
for no particular benefit.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Instrument checkpoint sync calls

2010-12-13 Thread Greg Smith

Robert Haas wrote:

I'm wondering why md.c is converting the results from an exact value to a 
floating
point, only to have xlog.c turn around and convert back to an integer.
 I think it could just return milliseconds directly, or if you're
worried about a checkpoint that takes more than 24 days to complete,
seconds and microseconds.


md.c is printing the value as a float, so I converted early to a double 
and then percolated it upward from there.  More an artifact of how the 
code grew from its initial form than an intentional decision. I see your 
point that making elapsed, total_elapsed, ckpt_agg_sync_time, and 
ckpt_longest_sync all the same type of integer that 
INSTR_TIME_GET_MICROSEC returns would reduce the potential for rounding 
abberations.  I could do another rev of the patch tomorrow with that 
change if you'd prefer it that way.  I don't have a strong opinion about 
that implementation detail; given that xlog.c is printing a less 
fine-grained time anyway (seconds with 3 digits vs. msec with 3 digits) 
it seems unlikely to run into a real problem here.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Instrument checkpoint sync calls

2010-12-13 Thread Robert Haas
On Mon, Dec 13, 2010 at 3:19 AM, Greg Smith g...@2ndquadrant.com wrote:
 Robert Haas wrote:

 I'm wondering why md.c is converting the results from an exact value to a
 floating
 point, only to have xlog.c turn around and convert back to an integer.
  I think it could just return milliseconds directly, or if you're
 worried about a checkpoint that takes more than 24 days to complete,
 seconds and microseconds.

 md.c is printing the value as a float, so I converted early to a double and
 then percolated it upward from there.  More an artifact of how the code grew
 from its initial form than an intentional decision. I see your point that
 making elapsed, total_elapsed, ckpt_agg_sync_time, and ckpt_longest_sync all
 the same type of integer that INSTR_TIME_GET_MICROSEC returns would reduce
 the potential for rounding abberations.  I could do another rev of the patch
 tomorrow with that change if you'd prefer it that way.  I don't have a
 strong opinion about that implementation detail; given that xlog.c is
 printing a less fine-grained time anyway (seconds with 3 digits vs. msec
 with 3 digits) it seems unlikely to run into a real problem here.

Yeah, it'd probably be OK anyway, but I think it doesn't really cost
us anything to avoid the unnecessary conversion steps, so count me as
a vote for doing it that way.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Instrument checkpoint sync calls

2010-12-12 Thread Robert Haas
On Sun, Dec 5, 2010 at 4:23 PM, Greg Smith g...@2ndquadrant.com wrote:
 Jeff Janes wrote:

 I've attached a tiny patch to apply over yours, to deal with this and
 with the case where no files are synced.


 Thanks for that.  That obvious error eluded me because in most of the patch
 update testing I was doing (on ext3), the longest sync was always about the
 same length as the total sync time.

 Attached patch (in correct diff form this time!) collects up all changes.
  That includes elimination of a potential race condition if someone changes
 log_checkpoints while a long sync phase is executing.  I don't know whether
 that can happen, and it obviously won't give accurate stats going back to
 the beginning of the checkpoint in that case, but it  tries to defend aginst
 producing complete garbage if that value changes out from under it.

 This is the first version of this patch I feel fairly good about; no open
 concerns left on my side.  Jeff, since you're now the de-facto credited
 reviewer of this one by virtue of suggesting bug fixes, could you take this
 update out for a spin too?

I took a look at this and it looks generally good, but I'm wondering
why md.c is converting the results from an exact value to a floating
point, only to have xlog.c turn around and convert back to an integer.
 I think it could just return milliseconds directly, or if you're
worried about a checkpoint that takes more than 24 days to complete,
seconds and microseconds.  Or am I missing something?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Instrument checkpoint sync calls

2010-12-09 Thread Greg Smith

Jeff Janes wrote:

In my test cases, the syncs that the backends were doing were almost
always to the same file that the checkpoint writer was already choking
on (so they are entangled simply by virtue of that).  So very quickly
all the backends hit the same wall and thunked to a halt.  This is
probably a feature of trying to use pgbench as the basis to get a very
artificial model.


Yes--pgbench has some problems like you describe, ones that are a bit 
different than the way I've seen fsync writes get in each other's way in 
the production systems I've looked at.  That's good if you really want 
to provoke this behavior, which is one reason why I've used as an 
example for my patches so far (the other being that it's already 
available in everyone's installation).  But it's tough to get it to act 
more like a real-world system, which don't have quite so many localized 
updates, without cranking the scale way up.  And that then tends to 
aggravate other problems too.


The 8.3 checkpoint spreading work also got some useful results using the 
dbt-2 benchmark.  I'm at the point where I think I need to return to 
that test program for what I'm doing now.  I'd encourage you to try that 
out too if you get a chance.


Thanks for the feedback and the review.  I hope you appreciate now why I 
suggested you wait for the stuff I was submitting before getting back 
into the sorted checkpoint topic again.  That should be a lot easier to 
make sense of with this instrumentation in place.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Instrument checkpoint sync calls

2010-12-07 Thread Jeff Janes
On Sun, Dec 5, 2010 at 1:23 PM, Greg Smith g...@2ndquadrant.com wrote:
 Jeff Janes wrote:

 I've attached a tiny patch to apply over yours, to deal with this and
 with the case where no files are synced.


 Thanks for that.  That obvious error eluded me because in most of the patch
 update testing I was doing (on ext3), the longest sync was always about the
 same length as the total sync time.

 Attached patch (in correct diff form this time!) collects up all changes.
  That includes elimination of a potential race condition if someone changes
 log_checkpoints while a long sync phase is executing.  I don't know whether
 that can happen, and it obviously won't give accurate stats going back to
 the beginning of the checkpoint in that case, but it  tries to defend aginst
 producing complete garbage if that value changes out from under it.

It looks to me that the checkpoint writer only reloads settings
between checkpoints, so log_checkpoints cannot change under it.
But if it can, the behavior of the patch is about as reasonable as one
can expect, I think.


 This is the first version of this patch I feel fairly good about; no open
 concerns left on my side.  Jeff, since you're now the de-facto credited
 reviewer of this one by virtue of suggesting bug fixes, could you take this
 update out for a spin too?

It applies and builds cleanly, and passes make check with -enable-cassert.
The details of log_checkpoints output are not documented, so no
document patches are needed.
I cannot think of any reasonable way to add a check into make check
framework, so I don't think one is needed.
It does what it says and what we want, and I don't see any dangers.

I haven't noticed any slow-downs.  It is listed under performance, but
it is intended to diagnose, not fix, performance problems.

It seems to meet the coding style.  It uses existing macros, so should
be as portable as the existing PG code is.

Comments seem sufficient.  The use of INSTR_TIME_SET_ZERO as a flag
for detecting changing log_checkpoints maybe could use a comment, but
it doesn't take much thought to figure out that that is what is going
on.

I could not make it crash.

I'm marking it as ready for committer.



 Combining this instrumentation patch with the backend sync one already
 committed, the net result under log_min_messages=debug1is somewhat
 undesirable in that I can now see the individual sync times for the
 syncs done by the checkpoint writer, but I do not get times for the
 syncs done by the backend (I only get informed of their existence).


 On a properly working system, backend syncs shouldn't be happening.  So if
 you see them, I think the question shouldn't be how long are they taking?,
 it's how do I get rid of them?  From that perspective, knowing of their
 existence is sufficient to suggest the necessary tuning changes, such as
 dropping bgwriter_delay.

OK.  I was trying to figure out if the syncs were entangled (all ended
at about the same time, regardless of when they started).  But I can
see that that is basically a hacker thing to do, not production, so I
can add in my own logging at the end of the backend sync for my
purposes.


 When you get into a situation where they are showing up, a whole lot of them
 can happen in a very brief period;

In my test cases, the syncs that the backends were doing were almost
always to the same file that the checkpoint writer was already choking
on (so they are entangled simply by virtue of that).  So very quickly
all the backends hit the same wall and thunked to a halt.  This is
probably a feature of trying to use pgbench as the basis to get a very
artificial model.  I need to find a way to get dozens or hundreds of
files going.  Maybe pgbench could be hacked so that each connection
used a different schema.


Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Instrument checkpoint sync calls

2010-12-05 Thread Greg Smith

Jeff Janes wrote:

I've attached a tiny patch to apply over yours, to deal with this and
with the case where no files are synced.
  


Thanks for that.  That obvious error eluded me because in most of the 
patch update testing I was doing (on ext3), the longest sync was always 
about the same length as the total sync time.


Attached patch (in correct diff form this time!) collects up all 
changes.  That includes elimination of a potential race condition if 
someone changes log_checkpoints while a long sync phase is executing.  I 
don't know whether that can happen, and it obviously won't give accurate 
stats going back to the beginning of the checkpoint in that case, but 
it  tries to defend aginst producing complete garbage if that value 
changes out from under it.


This is the first version of this patch I feel fairly good about; no 
open concerns left on my side.  Jeff, since you're now the de-facto 
credited reviewer of this one by virtue of suggesting bug fixes, could 
you take this update out for a spin too?



Combining this instrumentation patch with the backend sync one already
committed, the net result under log_min_messages=debug1is somewhat
undesirable in that I can now see the individual sync times for the
syncs done by the checkpoint writer, but I do not get times for the
syncs done by the backend (I only get informed of their existence).
  


On a properly working system, backend syncs shouldn't be happening.  So 
if you see them, I think the question shouldn't be how long are they 
taking?, it's how do I get rid of them?  From that perspective, 
knowing of their existence is sufficient to suggest the necessary tuning 
changes, such as dropping bgwriter_delay.


When you get into a situation where they are showing up, a whole lot of 
them can happen in a very brief period; enough that I'd actually be 
concerned about the added timing overhead, something I normally don't 
care about very much.  That's the main reason I didn't bother 
instrumenting those too--the situations where they happen are ones 
already running low on resources.


Big writes for things that can only be written out at checkpoint time, 
on the other hand, are unavoidable in the current design.  Providing 
detail on them is going to be relevant unless there's a major 
refactoring of how those happen.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us


diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ae9ed8f..6adc139 100644
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
*** LogCheckpointEnd(bool restartpoint)
*** 6955,6964 
  {
  	long		write_secs,
  sync_secs,
! total_secs;
  	int			write_usecs,
  sync_usecs,
! total_usecs;
  
  	CheckpointStats.ckpt_end_t = GetCurrentTimestamp();
  
--- 6955,6969 
  {
  	long		write_secs,
  sync_secs,
! total_secs,
! longest_secs,
! average_secs;
  	int			write_usecs,
  sync_usecs,
! total_usecs,
! longest_usecs,
! average_usecs;
! 	double		average_sync_time;
  
  	CheckpointStats.ckpt_end_t = GetCurrentTimestamp();
  
*** LogCheckpointEnd(bool restartpoint)
*** 6974,6991 
  		CheckpointStats.ckpt_sync_end_t,
  		sync_secs, sync_usecs);
  
  	if (restartpoint)
  		elog(LOG, restartpoint complete: wrote %d buffers (%.1f%%); 
! 			 write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s,
  			 CheckpointStats.ckpt_bufs_written,
  			 (double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers,
  			 write_secs, write_usecs / 1000,
  			 sync_secs, sync_usecs / 1000,
! 			 total_secs, total_usecs / 1000);
  	else
  		elog(LOG, checkpoint complete: wrote %d buffers (%.1f%%); 
  			 %d transaction log file(s) added, %d removed, %d recycled; 
! 			 write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s,
  			 CheckpointStats.ckpt_bufs_written,
  			 (double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers,
  			 CheckpointStats.ckpt_segs_added,
--- 6979,7017 
  		CheckpointStats.ckpt_sync_end_t,
  		sync_secs, sync_usecs);
  
+ 	/*
+ 	 * Timing values returned from CheckpointStats are in milliseconds.
+ 	 * Convert to the second plus microsecond form that TimestampDifference
+ 	 * returns for homogeneous printing.
+ 	 */
+ 	longest_secs = (long) (CheckpointStats.ckpt_longest_sync / 1000);
+ 	longest_usecs = 1000 * (CheckpointStats.ckpt_longest_sync -
+ 		longest_secs * 1000);
+ 
+ 	average_sync_time = 0;
+ 	if (CheckpointStats.ckpt_sync_rels  0) 
+ 		average_sync_time = CheckpointStats.ckpt_agg_sync_time /
+ 			CheckpointStats.ckpt_sync_rels;
+ 	average_secs = (long) (average_sync_time / 1000);
+ 	average_usecs = 1000 * (average_sync_time - average_secs * 1000);
+ 
  	if (restartpoint)
  		elog(LOG, restartpoint complete: wrote %d buffers (%.1f%%); 
! 			 write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s; 
! 			 sync files=%d, 

Re: [HACKERS] Instrument checkpoint sync calls

2010-12-02 Thread Jeff Janes
On Tue, Nov 30, 2010 at 12:15 PM, Jeff Janes jeff.ja...@gmail.com wrote:
 On Tue, Nov 30, 2010 at 8:38 AM, Greg Smith g...@2ndquadrant.com wrote:


 Hi Greg,

 Thanks for the update.



 This might be ready for some proper review now.  I know there's at least one
 blatant bug still in here I haven't found yet, related to how the averages
 are computed.

 Yes, the blatant bug:

 average_sync_time = CheckpointStats.ckpt_longest_sync /
 CheckpointStats.ckpt_sync_rels;

 That should clearly be ckpt_agg_sync_time, not ckpt_longest_sync.

I've attached a tiny patch to apply over yours, to deal with this and
with the case where no files are synced.

Combining this instrumentation patch with the backend sync one already
committed, the net result under log_min_messages=debug1is somewhat
undesirable in that I can now see the individual sync times for the
syncs done by the checkpoint writer, but I do not get times for the
syncs done by the backend (I only get informed of their existence).

I don't know what I would propose to fix this.  Having the reportage
of sync time of backend syncs be controlled by log_checkpoints seems
somewhat perverse, but the only alternative I see is to have
log_min_messages=debug1 always report times for the backend syncs.  Or
to have them go unreported altogether.

Cheers,

Jeff


patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Instrument checkpoint sync calls

2010-11-30 Thread Greg Smith

Jeff Janes wrote:

For the individual file sync times emitted under debug1, it would be
very handy if the file being synced was identified, for example
relation base/16384/16523. Rather than being numbered sequentially
within a given checkpoint.
  


I was numbering them sequentially so that it's straightforward to graph 
the sync times in an external analysis tool, but the relation data is 
helpful too.  New patch reflecting all upthread suggestions is 
attached.  The output looks like this now at DEBUG1:


LOG:  checkpoint starting: xlog
DEBUG:  checkpoint sync: number=1 file=base/16424/11645 
time=11589.549000 msec

DEBUG:  checkpoint sync: number=2 file=base/16424/16438 time=16.148000 msec
DEBUG:  checkpoint sync: number=3 file=base/16424/16437 time=53.53 msec
DEBUG:  checkpoint sync: number=4 file=base/16424/16447 time=10.214000 msec
DEBUG:  checkpoint sync: number=5 file=base/16424/11607 time=1.499000 msec
DEBUG:  checkpoint sync: number=6 file=base/16424/16425_fsm 
time=2.921000 msec

DEBUG:  checkpoint sync: number=7 file=base/16424/16437.1 time=4.237000 msec
DEBUG:  checkpoint sync: number=8 file=base/16424/16428_fsm 
time=1.654000 msec

DEBUG:  checkpoint sync: number=9 file=base/16424/16442 time=7.92 msec
DEBUG:  checkpoint sync: number=10 file=base/16424/16428_vm 
time=2.613000 msec

DEBUG:  checkpoint sync: number=11 file=base/16424/11618 time=1.468000 msec
DEBUG:  checkpoint sync: number=12 file=base/16424/16437_fsm 
time=2.638000 msec

DEBUG:  checkpoint sync: number=13 file=base/16424/16428 time=2.883000 msec
DEBUG:  checkpoint sync: number=14 file=base/16424/16425 time=3.369000 msec
DEBUG:  checkpoint sync: number=15 file=base/16424/16437_vm 
time=8.686000 msec
DEBUG:  checkpoint sync: number=16 file=base/16424/16425_vm 
time=5.984000 msec
LOG:  checkpoint complete: wrote 2074 buffers (50.6%); 0 transaction log 
file(s) added, 0 removed, 3 recycled; write=0.617 s, sync=11.715 s, 
total=22.167 s; sync files=16, longest=11.589 s, average=0.724 s


I kept the units for the DEBUG level ones in msec because that's a 
better scale for the common really short syncs during good behavior.  
But the summary info in seconds now appears at the end of the existing 
checkpoint complete message, so only one line to parse for those 
looking to analyze the gross checkpoint data.  That looks to work well 
enough for finding situations like the big ext3 spikes.  You can easily 
see one in this example by the fact that longest=11.589 s is almost 
the entirety of sync=11.715 s.  That's the really key thing there's 
currently no visibility into, that's made obvious with this patch.


This might be ready for some proper review now.  I know there's at least 
one blatant bug still in here I haven't found yet, related to how the 
averages are computed.  I saw this once:


LOG:  checkpoint complete: wrote 0 buffers (0.0%); 0 transaction log 
file(s) added, 0 removed, 1 recycled; write=0.000 s, sync=0.000 s, 
total=0.001 s; sync files=0, longest=0.000 s, 
average=-9223372036854775808.-2147483 s


After an immediate checkpoint, so at least one path not quite right yet.

--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD


diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ede6ceb..6f6eb3b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7063,10 +7063,15 @@ LogCheckpointEnd(bool restartpoint)
 {
 	long		write_secs,
 sync_secs,
-total_secs;
+total_secs,
+longest_secs,
+average_secs;
 	int			write_usecs,
 sync_usecs,
-total_usecs;
+total_usecs,
+longest_usecs,
+average_usecs;
+	double		average_sync_time;
 
 	CheckpointStats.ckpt_end_t = GetCurrentTimestamp();
 
@@ -7082,18 +7087,35 @@ LogCheckpointEnd(bool restartpoint)
 		CheckpointStats.ckpt_sync_end_t,
 		sync_secs, sync_usecs);
 
+	/*
+	 * Timing values returned from CheckpointStats are in milliseconds.
+	 * Convert to the second plus microsecond form that TimestampDifference
+	 * returns for homogeneous printing.
+	 */
+	longest_secs = (long) (CheckpointStats.ckpt_longest_sync / 1000);
+	longest_usecs = 1000 * (CheckpointStats.ckpt_longest_sync - longest_secs * 1000);
+
+	average_sync_time = CheckpointStats.ckpt_longest_sync / CheckpointStats.ckpt_sync_rels;
+	average_secs = (long) (average_sync_time / 1000);
+	average_usecs = 1000 * (average_sync_time - average_secs * 1000);
+
 	if (restartpoint)
 		elog(LOG, restartpoint complete: wrote %d buffers (%.1f%%); 
-			 write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s,
+			 write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s; 
+			 sync files=%d, longest=%ld.%03d s, average=%ld.%03d s,
 			 CheckpointStats.ckpt_bufs_written,
 			 (double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers,
 			 write_secs, write_usecs / 1000,
 			 sync_secs, sync_usecs / 1000,
-			 total_secs, total_usecs / 1000);
+			 total_secs, total_usecs / 1000,
+			 

Re: [HACKERS] Instrument checkpoint sync calls

2010-11-30 Thread Cédric Villemain
2010/11/30 Greg Smith g...@2ndquadrant.com:
 Jeff Janes wrote:

 For the individual file sync times emitted under debug1, it would be
 very handy if the file being synced was identified, for example
 relation base/16384/16523. Rather than being numbered sequentially
 within a given checkpoint.


 I was numbering them sequentially so that it's straightforward to graph the
 sync times in an external analysis tool, but the relation data is helpful
 too.  New patch reflecting all upthread suggestions is attached.  The output
 looks like this now at DEBUG1:

 LOG:  checkpoint starting: xlog
 DEBUG:  checkpoint sync: number=1 file=base/16424/11645 time=11589.549000
 msec
 DEBUG:  checkpoint sync: number=2 file=base/16424/16438 time=16.148000 msec
 DEBUG:  checkpoint sync: number=3 file=base/16424/16437 time=53.53 msec
 DEBUG:  checkpoint sync: number=4 file=base/16424/16447 time=10.214000 msec
 DEBUG:  checkpoint sync: number=5 file=base/16424/11607 time=1.499000 msec
 DEBUG:  checkpoint sync: number=6 file=base/16424/16425_fsm time=2.921000
 msec
 DEBUG:  checkpoint sync: number=7 file=base/16424/16437.1 time=4.237000 msec
 DEBUG:  checkpoint sync: number=8 file=base/16424/16428_fsm time=1.654000
 msec
 DEBUG:  checkpoint sync: number=9 file=base/16424/16442 time=7.92 msec
 DEBUG:  checkpoint sync: number=10 file=base/16424/16428_vm time=2.613000
 msec
 DEBUG:  checkpoint sync: number=11 file=base/16424/11618 time=1.468000 msec
 DEBUG:  checkpoint sync: number=12 file=base/16424/16437_fsm time=2.638000
 msec
 DEBUG:  checkpoint sync: number=13 file=base/16424/16428 time=2.883000 msec
 DEBUG:  checkpoint sync: number=14 file=base/16424/16425 time=3.369000 msec
 DEBUG:  checkpoint sync: number=15 file=base/16424/16437_vm time=8.686000
 msec
 DEBUG:  checkpoint sync: number=16 file=base/16424/16425_vm time=5.984000
 msec
 LOG:  checkpoint complete: wrote 2074 buffers (50.6%); 0 transaction log
 file(s) added, 0 removed, 3 recycled; write=0.617 s, sync=11.715 s,
 total=22.167 s; sync files=16, longest=11.589 s, average=0.724 s

 I kept the units for the DEBUG level ones in msec because that's a better
 scale for the common really short syncs during good behavior.  But the
 summary info in seconds now appears at the end of the existing checkpoint
 complete message, so only one line to parse for those looking to analyze
 the gross checkpoint data.  That looks to work well enough for finding
 situations like the big ext3 spikes.  You can easily see one in this example
 by the fact that longest=11.589 s is almost the entirety of sync=11.715
 s.  That's the really key thing there's currently no visibility into,
 that's made obvious with this patch.

wonderfull.


 This might be ready for some proper review now.  I know there's at least one
 blatant bug still in here I haven't found yet, related to how the averages
 are computed.  I saw this once:

 LOG:  checkpoint complete: wrote 0 buffers (0.0%); 0 transaction log file(s)
 added, 0 removed, 1 recycled; write=0.000 s, sync=0.000 s, total=0.001 s;
 sync files=0, longest=0.000 s, average=-9223372036854775808.-2147483 s

 After an immediate checkpoint, so at least one path not quite right yet.

 --
 Greg Smith   2ndQuadrant US    g...@2ndquadrant.com   Baltimore, MD




 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers





-- 
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Instrument checkpoint sync calls

2010-11-30 Thread Jeff Janes
On Tue, Nov 30, 2010 at 8:38 AM, Greg Smith g...@2ndquadrant.com wrote:


Hi Greg,

Thanks for the update.



 This might be ready for some proper review now.  I know there's at least one
 blatant bug still in here I haven't found yet, related to how the averages
 are computed.

Yes, the blatant bug:

average_sync_time = CheckpointStats.ckpt_longest_sync /
CheckpointStats.ckpt_sync_rels;

That should clearly be ckpt_agg_sync_time, not ckpt_longest_sync.


 I saw this once:

 LOG:  checkpoint complete: wrote 0 buffers (0.0%); 0 transaction log file(s)
 added, 0 removed, 1 recycled; write=0.000 s, sync=0.000 s, total=0.001 s;
 sync files=0, longest=0.000 s, average=-9223372036854775808.-2147483 s

 After an immediate checkpoint, so at least one path not quite right yet.

Not clear what the right thing to do here is.  I guess we should
special case the div by zero to yield zero for the average?

The patch is in unified diff rather than context diff.  Doesn't bother
me, but the wiki on doing a review says...

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Instrument checkpoint sync calls

2010-11-26 Thread Greg Smith

Jeff Janes wrote:

For the individual file sync times emitted under debug1, it would be
very handy if the file being synced was identified, for example
relation base/16384/16523.


I was in the middle of looking into adding that already, so consider 
that part of the target for the next update I'm working on.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Instrument checkpoint sync calls

2010-11-24 Thread Robert Haas
On Wed, Nov 24, 2010 at 1:02 AM, Greg Smith g...@2ndquadrant.com wrote:
 Robert Haas wrote:
 Did this get eaten by the email goblin, or you're still working on it?

 Fell behind due to an unfortunately timed bit of pneumonia.  Hurray for the
 health benefits of cross country flights.  Will fix this up, rebase my other
 patch, and head toward some more review/'Fest cleanup now that I'm feeling
 better.

Ouch.  Fringe benefits of consulting.  Thanks for the update.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Instrument checkpoint sync calls

2010-11-23 Thread Greg Smith

Robert Haas wrote:

Did this get eaten by the email goblin, or you're still working on it?
  


Fell behind due to an unfortunately timed bit of pneumonia.  Hurray for 
the health benefits of cross country flights.  Will fix this up, rebase 
my other patch, and head toward some more review/'Fest cleanup now that 
I'm feeling better.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Instrument checkpoint sync calls

2010-11-22 Thread Jeff Janes
On Mon, Nov 15, 2010 at 12:09 PM, Greg Smith g...@2ndquadrant.com wrote:
 So my task list is:

 0) Rebase against the HEAD that just code related to this touched today

 1) Assume that log_checkpoints is sufficient control over whether the timing
 overhead added is worth collecting, and therefore remove the half-baked idea
 of also wrapping with a compile-time option.

 2) Have the sync summary returned upwards, so it can be put onto the same
 line as the rest of the rest of the log_checkpoint info.

 All seems reasonable to me.  Will rev a new patch by tomorrow.

For the individual file sync times emitted under debug1, it would be
very handy if the file being synced was identified, for example
relation base/16384/16523. Rather than being numbered sequentially
within a given checkpoint.

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Instrument checkpoint sync calls

2010-11-21 Thread Robert Haas
On Mon, Nov 15, 2010 at 3:09 PM, Greg Smith g...@2ndquadrant.com wrote:
 So my task list is:

 0) Rebase against the HEAD that just code related to this touched today

 1) Assume that log_checkpoints is sufficient control over whether the timing
 overhead added is worth collecting, and therefore remove the half-baked idea
 of also wrapping with a compile-time option.

 2) Have the sync summary returned upwards, so it can be put onto the same
 line as the rest of the rest of the log_checkpoint info.

 All seems reasonable to me.  Will rev a new patch by tomorrow.

Did this get eaten by the email goblin, or you're still working on it?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Instrument checkpoint sync calls

2010-11-15 Thread Robert Haas
On Sun, Nov 14, 2010 at 7:04 PM, Greg Smith g...@2ndquadrant.com wrote:
 It might.  One trade-off is that if you're looking at the sync write detail,
 the summary comes out in a similar form.  And it was easy to put in
 here--I'd have to return some new data out of the sync phase call in order
 for that to show up in the main log.  If there's general buy-in on the idea,
 I could do all of that.

+1 for that approach.

 What I was alluding to is that I know gettimeofday executes fast on my Linux
 system here, so even if I did measure the overhead and showed it's near zero
 that doesn't mean it will be so on every platform.  The how long does it
 take to find out the current time on every supported PostgreSQL platform?
 question is one I'd like to have an answer to, but it's hard to collect
 properly.  All I know is that I don't have any system where it's slow to
 properly test again here.

I would be very surprised if we can find a system where gettimeofday()
takes a significant amount of time compared with fsync().  It might be
(probably is) too expensive to stick into code paths that are heavily
CPU-bounded, but surely the cost here is going to be dwarfed by the
fsync(), no?  Unless maybe there's no I/O to be done anyway, but that
case doesn't seem important to optimize for.  I think making this code
dependent on ifdef DEBUG_FSYNC is going to make in order of magnitude
less useful, so I'd really rather not do that unless we have some hard
evidence that there is a real problem to worry about here.  Making it
conditional on log_checkpoints seems entirely sufficient to me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Instrument checkpoint sync calls

2010-11-15 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 I would be very surprised if we can find a system where gettimeofday()
 takes a significant amount of time compared with fsync().  It might be
 (probably is) too expensive to stick into code paths that are heavily
 CPU-bounded, but surely the cost here is going to be dwarfed by the
 fsync(), no?  Unless maybe there's no I/O to be done anyway, but that
 case doesn't seem important to optimize for.

I'm not sure I buy that --- the whole point of spread checkpoints is
that we hope the I/O happens before we actually call fsync.

 Making it
 conditional on log_checkpoints seems entirely sufficient to me.

But I'll agree with that.  If you're turning on log_checkpoints then
you've given the system permission to indulge in extra overhead for
monitoring.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Instrument checkpoint sync calls

2010-11-15 Thread Greg Smith

So my task list is:

0) Rebase against the HEAD that just code related to this touched today

1) Assume that log_checkpoints is sufficient control over whether the 
timing overhead added is worth collecting, and therefore remove the 
half-baked idea of also wrapping with a compile-time option.


2) Have the sync summary returned upwards, so it can be put onto the 
same line as the rest of the rest of the log_checkpoint info.


All seems reasonable to me.  Will rev a new patch by tomorrow.

--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Instrument checkpoint sync calls

2010-11-15 Thread Robert Haas
On Mon, Nov 15, 2010 at 2:48 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 I would be very surprised if we can find a system where gettimeofday()
 takes a significant amount of time compared with fsync().  It might be
 (probably is) too expensive to stick into code paths that are heavily
 CPU-bounded, but surely the cost here is going to be dwarfed by the
 fsync(), no?  Unless maybe there's no I/O to be done anyway, but that
 case doesn't seem important to optimize for.

 I'm not sure I buy that --- the whole point of spread checkpoints is
 that we hope the I/O happens before we actually call fsync.

Possibly.  Even then, it's not guaranteed, because it might well be
that fsync() has to do more grovelling through system data structures
to determine that there is in fact no work to be done than what
gettimeofday() has to do to read off a counter somewhere...

 Making it
 conditional on log_checkpoints seems entirely sufficient to me.

 But I'll agree with that.  If you're turning on log_checkpoints then
 you've given the system permission to indulge in extra overhead for
 monitoring.

...but if we agree on the way forward then the details don't matter.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Instrument checkpoint sync calls

2010-11-14 Thread Greg Smith
Attached patch adds some logging for each individual fsync call made 
during a checkpoint, along with a summary at the end.  You need to have 
the following to see all of the detail:


log_checkpoints=on
log_min_messages=debug1

And here's a sample:

LOG:  checkpoint starting: immediate force wait
DEBUG:  checkpoint sync: file=1 time=1.946000 msec
DEBUG:  checkpoint sync: file=2 time=0.666000 msec
DEBUG:  checkpoint sync: file=3 time=0.004000 msec
LOG:  checkpoint sync: files=3 longest=1.946000 msec average=0.872000 msec
LOG:  checkpoint complete: wrote 3 buffers (0.1%); 0 transaction log 
file(s) added, 0 removed, 0 recycled; write=0.000 s, sync=0.002 s, 
total=0.003 s


I think that it's reasonable for the sort of people who turn 
log_checkpoints on to also get the sync summary line, thus it being 
logged at LOG level.  The detail on individual syncs might go to DEBUG2 
or lower instead of DEBUG1 where I put it, that part I don't have a 
strong opinion on.  It's at DEBUG1 to make testing the patch without a 
gigantic amount of log data also coming in easier.


Right now the code is written such that all the calls that grab timing 
information are wrapped around ifdef DEBUG_FSYNC, which is a variable 
set to 1 that could be a compile-time option like DEBUG_DEADLOCK, to 
allow turning this code path off at build time.  I personally think that 
if you're already making an fsync call and have log_checkpoints on, the 
additional overhead of also timing that fsync is minimal even on 
platforms where timing is slow (I don't have such a system to test that 
assumption however).  And I've seen enough value in troubleshooting 
nasty checkpoint sync problems using this patch to feel it's worth 
having even if it does add some overhead.


I'm a little concerned about log_checkpoints changing on me in the 
middle of the execution of a checkpoint, which would cause some problems 
here.  Not sure if that's something I should explicitly code for, given 
that all I think it will do is screw up one of the timing results.  It 
does seem a risk from the last minute self-review I just did of the code.


I'll give a sample program that stresses the system, generating slow 
timing results and other types of bad behavior, along with the next 
patch I submit here shortly.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us


diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 1219fcf..c72da72 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -20,6 +20,7 @@
 
 #include catalog/catalog.h
 #include miscadmin.h
+#include portability/instr_time.h
 #include postmaster/bgwriter.h
 #include storage/fd.h
 #include storage/bufmgr.h
@@ -29,6 +30,11 @@
 #include utils/memutils.h
 #include pg_trace.h
 
+/*
+ * For now just leave this on all the time, eventually this will
+ * need to be a full compiler flag like DEBUG_DEADLOCK
+ */
+#define DEBUG_FSYNC	1
 
 /* interval for calling AbsorbFsyncRequests in mdsync */
 #define FSYNCS_PER_ABSORB		10
@@ -927,6 +933,15 @@ mdsync(void)
 	PendingOperationEntry *entry;
 	int			absorb_counter;
 
+#ifdef DEBUG_FSYNC
+	/* Statistics on sync times */
+	int processed = 0;
+	instr_time sync_start,sync_end,diff;
+	double elapsed;
+	double longest = 0;
+	double total_elapsed = 0;
+#endif
+
 	/*
 	 * This is only called during checkpoints, and checkpoints should only
 	 * occur in processes that have created a pendingOpsTable.
@@ -1069,9 +1084,32 @@ mdsync(void)
 seg = _mdfd_getseg(reln, entry-tag.forknum,
 			  entry-tag.segno * ((BlockNumber) RELSEG_SIZE),
    false, EXTENSION_RETURN_NULL);
+
+#ifdef DEBUG_FSYNC
+if (log_checkpoints)
+	INSTR_TIME_SET_CURRENT(sync_start);
+#endif
 if (seg != NULL 
 	FileSync(seg-mdfd_vfd) = 0)
+{
+
+#ifdef DEBUG_FSYNC
+	if (log_checkpoints)
+	{
+		INSTR_TIME_SET_CURRENT(sync_end);
+		diff=sync_end;
+		INSTR_TIME_SUBTRACT(diff, sync_start);
+		elapsed = (double) INSTR_TIME_GET_MICROSEC(diff) / 1000.0;
+		if (elapsed  longest)
+			longest = elapsed;
+		total_elapsed+=elapsed;
+		processed++;
+		/* Reduce the log level here to DEBUG2 in final patch? */
+		elog(DEBUG1, checkpoint sync: file=%d time=%f msec, processed, elapsed);
+	}
+#endif
 	break;		/* success; break out of retry loop */
+	}
 
 /*
  * XXX is there any point in allowing more than one retry?
@@ -1113,6 +1151,13 @@ mdsync(void)
 			elog(ERROR, pendingOpsTable corrupted);
 	}			/* end loop over hashtable entries */
 
+#ifdef DEBUG_FSYNC
+	/* Report on sync performance metrics */
+	if (log_checkpoints  (processed  0))
+		elog(LOG, checkpoint sync: files=%d longest=%f msec average=%f msec,
+			processed, longest, total_elapsed / processed);
+#endif
+
 	/* Flag successful completion of mdsync */
 	mdsync_in_progress = false;
 }

-- 
Sent via pgsql-hackers mailing list 

Re: [HACKERS] Instrument checkpoint sync calls

2010-11-14 Thread Magnus Hagander
On Sun, Nov 14, 2010 at 22:37, Greg Smith g...@2ndquadrant.com wrote:
 Attached patch adds some logging for each individual fsync call made during
 a checkpoint, along with a summary at the end.  You need to have the
 following to see all of the detail:

 log_checkpoints=on
 log_min_messages=debug1

 And here's a sample:

 LOG:  checkpoint starting: immediate force wait
 DEBUG:  checkpoint sync: file=1 time=1.946000 msec
 DEBUG:  checkpoint sync: file=2 time=0.666000 msec
 DEBUG:  checkpoint sync: file=3 time=0.004000 msec
 LOG:  checkpoint sync: files=3 longest=1.946000 msec average=0.872000 msec
 LOG:  checkpoint complete: wrote 3 buffers (0.1%); 0 transaction log file(s)
 added, 0 removed, 0 recycled; write=0.000 s, sync=0.002 s, total=0.003 s

 I think that it's reasonable for the sort of people who turn log_checkpoints
 on to also get the sync summary line, thus it being logged at LOG level.

In that case, wouldn't it make more sense to add a couple of more
fields to the existing line? Easier to post-process the logfile that
way...


  The detail on individual syncs might go to DEBUG2 or lower instead of
 DEBUG1 where I put it, that part I don't have a strong opinion on.  It's at
 DEBUG1 to make testing the patch without a gigantic amount of log data also
 coming in easier.

 Right now the code is written such that all the calls that grab timing
 information are wrapped around ifdef DEBUG_FSYNC, which is a variable set
 to 1 that could be a compile-time option like DEBUG_DEADLOCK, to allow
 turning this code path off at build time.  I personally think that if you're
 already making an fsync call and have log_checkpoints on, the additional
 overhead of also timing that fsync is minimal even on platforms where timing
 is slow (I don't have such a system to test that assumption however).  And
 I've seen enough value in troubleshooting nasty checkpoint sync problems
 using this patch to feel it's worth having even if it does add some
 overhead.

It sounds like it should be - but that should be possible to measure, no?


-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Instrument checkpoint sync calls

2010-11-14 Thread Greg Smith

Magnus Hagander wrote:

I think that it's reasonable for the sort of people who turn log_checkpoints
on to also get the sync summary line, thus it being logged at LOG level.



In that case, wouldn't it make more sense to add a couple of more
fields to the existing line? Easier to post-process the logfile that
way...
  


It might.  One trade-off is that if you're looking at the sync write 
detail, the summary comes out in a similar form.  And it was easy to put 
in here--I'd have to return some new data out of the sync phase call in 
order for that to show up in the main log.  If there's general buy-in on 
the idea, I could do all of that.




I personally think that if you're
already making an fsync call and have log_checkpoints on, the additional
overhead of also timing that fsync is minimal even on platforms where timing
is slow (I don't have such a system to test that assumption however)...

It sounds like it should be - but that should be possible to measure, no?
  


What I was alluding to is that I know gettimeofday executes fast on my 
Linux system here, so even if I did measure the overhead and showed it's 
near zero that doesn't mean it will be so on every platform.  The how 
long does it take to find out the current time on every supported 
PostgreSQL platform? question is one I'd like to have an answer to, but 
it's hard to collect properly.  All I know is that I don't have any 
system where it's slow to properly test again here.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us