Re: [HACKERS] [PERFORM] DELETE vs TRUNCATE explanation

2012-07-19 Thread Robert Haas
On Wed, Jul 18, 2012 at 5:17 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 I've been chewing on this issue some more, and no longer like my
 previous proposal, which was

 ... What I'm thinking about
 is reducing the hash key to just RelFileNodeBackend + ForkNumber,
 so that there's one hashtable entry per fork, and then storing a
 bitmap to indicate which segment numbers need to be sync'd.  At
 one gigabyte to the bit, I think we could expect the bitmap would
 not get terribly large.  We'd still have a cancel flag in each
 hash entry, but it'd apply to the whole relation fork not each
 segment.

 The reason that's not so attractive is the later observation that what
 we really care about optimizing is FORGET_RELATION_FSYNC for all the
 forks of a relation at once, which we could produce just one request
 for with trivial refactoring of smgrunlink/mdunlink.  The above
 representation doesn't help for that.  So what I'm now thinking is that
 we should create a second hash table, with key RelFileNode only,
 carrying two booleans: a cancel-previous-fsyncs bool and a
 please-unlink-after-checkpoint bool.  (The latter field would allow us
 to drop the separate pending-unlinks data structure.)  Entries would
 be made in this table when we got a FORGET_RELATION_FSYNC or
 UNLINK_RELATION_REQUEST message -- note that in 99% of cases we'd get
 both message types for each relation, since they're both created during
 DROP.  (Maybe we could even combine these request types.)  To use the
 table, as we scan the existing per-fork-and-segment hash table, we'd
 have to do a lookup in the per-relation table to see if there was a
 later cancel message for that relation.  Now this does add a few cycles
 to the processing of each pendingOpsTable entry in mdsync ... but
 considering that the major work in that loop is an fsync call, it is
 tough to believe that anybody would notice an extra hashtable lookup.

Seems a bit complex, but it might be worth it.  Keep in mind that I
eventually want to be able to make an unlogged table logged or a visca
versa, which will probably entail unlinking just the init fork (for
the logged - unlogged direction).

 However, I also came up with an entirely different line of thought,
 which unfortunately seems incompatible with either of the improved
 table designs above.  It is this: instead of having a request queue
 that feeds into a hash table hidden within the checkpointer process,
 what about storing the pending-fsyncs table as a shared hash table
 in shared memory?  That is, ForwardFsyncRequest would not simply
 try to add the request to a linear array, but would do a HASH_ENTER
 call on a shared hash table.  This means the de-duplication occurs
 for free and we no longer need CompactCheckpointerRequestQueue at all.
 Basically, this would amount to saying that the original design was
 wrong to try to micro-optimize the time spent in ForwardFsyncRequest,
 and that we'd rather pay a little more per ForwardFsyncRequest call
 to avoid the enormous response-time spike that will occur when
 CompactCheckpointerRequestQueue has to run.  (Not to mention that
 the checkpointer would eventually have to do HASH_ENTER anyway.)
 I think this would address your observation above that the request
 queue tends to contain an awful lot of duplicates.

I'm not concerned about the queue *containing* a large number of
duplicates; I'm concerned about the large number of duplicate
*requests*.  Under either the current system or this proposal, every
time we write a block, we must take and release CheckpointerCommLock.
Now, I have no evidence that there's actually a bottleneck there, but
if there is, this proposal won't fix it.  In fact, I suspect on the
whole it would make things worse, because while it's true that
CompactCheckpointerRequestQueue is expensive, it shouldn't normally be
happening at all, because the checkpointer should be draining the
queue regularly enough to prevent it from filling.  So except when the
system is in the pathological state where the checkpointer becomes
unresponsive because it's blocked in-kernel on a very long fsync and
there is a large amount of simultaneous write activity, each process
that acquires CheckpointerCommLock holds it for just long enough to
slam a few bytes of data into the queue, which is very cheap.  I
suspect that updating a hash table would be significantly more
expensive, and we'd pay whatever that extra overhead is on every fsync
request, not just in the unusual case where we manage to fill the
queue.  So I don't think this is likely to be a win.

If you think about the case of an UPDATE statement that hits a large
number of blocks in the same relation, it sends an fsync request for
every single block.  Really, it's only necessary to send a new fsync
request if the checkpointer has begun a new checkpoint cycle in the
meantime; otherwise, the old request is still pending and will cover
the new write as well.  But there's no way for the backend doing the
writes to know 

Re: [HACKERS] [PERFORM] DELETE vs TRUNCATE explanation

2012-07-19 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 Seems a bit complex, but it might be worth it.  Keep in mind that I
 eventually want to be able to make an unlogged table logged or a visca
 versa, which will probably entail unlinking just the init fork (for
 the logged - unlogged direction).

Well, as far as that goes, I don't see a reason why you couldn't unlink
the init fork immediately on commit.  The checkpointer should not have
to be involved at all --- there's no reason to send it a FORGET FSYNC
request either, because there shouldn't be any outstanding writes
against an init fork, no?

But having said that, this does serve as an example that we might
someday want the flexibility to kill individual forks.  I was
intending to kill smgrdounlinkfork altogether, but I'll refrain.

 I think this is just over-engineered.  The originally complained-of
 problem was all about the inefficiency of manipulating the
 checkpointer's backend-private data structures, right?  I don't see
 any particular need to mess with the shared memory data structures at
 all.  If you wanted to add some de-duping logic to retail fsync
 requests, you could probably accomplish that more cheaply by having
 each such request look at the last half-dozen or so items in the queue
 and skip inserting the new request if any of them match the new
 request.  But I think that'd probably be a net loss, because it would
 mean holding the lock for longer.

What about checking just the immediately previous entry?  This would
at least fix the problem for bulk-load situations, and the cost ought
to be about negligible compared to acquiring the LWLock.

I have also been wondering about de-duping on the backend side, but
the problem is that if a backend remembers its last few requests,
it doesn't know when that cache has to be cleared because of a new
checkpoint cycle starting.  We could advertise the current cycle
number in shared memory, but you'd still need to take a lock to
read it.  (If we had memory fence primitives it could be a bit
cheaper, but I dunno how much.)

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PERFORM] DELETE vs TRUNCATE explanation

2012-07-19 Thread Robert Haas
On Thu, Jul 19, 2012 at 10:09 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 Seems a bit complex, but it might be worth it.  Keep in mind that I
 eventually want to be able to make an unlogged table logged or a visca
 versa, which will probably entail unlinking just the init fork (for
 the logged - unlogged direction).

 Well, as far as that goes, I don't see a reason why you couldn't unlink
 the init fork immediately on commit.  The checkpointer should not have
 to be involved at all --- there's no reason to send it a FORGET FSYNC
 request either, because there shouldn't be any outstanding writes
 against an init fork, no?

Well, it gets written when it gets created.  Some of those writes go
through shared_buffers.

 But having said that, this does serve as an example that we might
 someday want the flexibility to kill individual forks.  I was
 intending to kill smgrdounlinkfork altogether, but I'll refrain.

If you want to remove it, it's OK with me.  We can always put it back
later if it's needed.  We have an SCM that allows us to revert
patches.  :-)

 What about checking just the immediately previous entry?  This would
 at least fix the problem for bulk-load situations, and the cost ought
 to be about negligible compared to acquiring the LWLock.

Well, two things:

1. If a single bulk load is the ONLY activity on the system, or more
generally if only one segment in the system is being heavily written,
then that would reduce the number of entries that get added to the
queue, but if you're doing two bulk loads on different tables at the
same time, then it might not do much.  From Greg Smith's previous
comments on this topic, I understand that having two or three entries
alternating in the queue is a fairly common pattern.

2. You say fix the problem but I'm not exactly clear what problem
you think this fixes.  It's true that the compaction code is a lot
slower than an ordinary queue insertion, but I think it generally
doesn't happen enough to matter, and when it does happen the system is
generally I/O bound anyway, so who cares?  One possible argument in
favor of doing something along these lines is that it would reduce the
amount of data that the checkpointer would have to copy while holding
the lock, thus causing less disruption for other processes trying to
insert into the request queue.  But I don't know whether that effect
is significant enough to matter.

 I have also been wondering about de-duping on the backend side, but
 the problem is that if a backend remembers its last few requests,
 it doesn't know when that cache has to be cleared because of a new
 checkpoint cycle starting.  We could advertise the current cycle
 number in shared memory, but you'd still need to take a lock to
 read it.  (If we had memory fence primitives it could be a bit
 cheaper, but I dunno how much.)

Well, we do have those, as of 9.2.  There not being used for anything
yet, but I've been looking for an opportunity to put them into use.
sinvaladt.c's msgnumLock is an obvious candidate, but the 9.2 changes
to reduce the impact of sinval synchronization work sufficiently well
that I haven't been motivated to tinker with it any further.  Maybe it
would be worth doing just to exercise that code, though.

Or, maybe we can use them here.  But after some thought I can't see
exactly how we'd do it.  Memory barriers prevent a value from being
prefetched too early or written back to main memory too late, relative
to other memory operations by the same process, but the definition of
too early and too late is not quite clear to me here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PERFORM] DELETE vs TRUNCATE explanation

2012-07-19 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Thu, Jul 19, 2012 at 10:09 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 What about checking just the immediately previous entry?  This would
 at least fix the problem for bulk-load situations, and the cost ought
 to be about negligible compared to acquiring the LWLock.

 2. You say fix the problem but I'm not exactly clear what problem
 you think this fixes.

What I'm concerned about is that there is going to be a great deal more
fsync request queue traffic in 9.2 than there ever was before, as a
consequence of the bgwriter/checkpointer split.  The design expectation
for this mechanism was that most fsync requests would be generated
locally inside the bgwriter and thus go straight into the hash table
without having to go through the shared-memory queue.  I admit that
we have seen no benchmarks showing that there's a problem, but that's
because up till yesterday the bgwriter was failing to transmit such
messages at all.  So I'm looking for ways to cut the overhead.

But having said that, maybe we should not panic until we actually see
some benchmarks showing the problem.

Meanwhile, we do know there's a problem with FORGET_RELATION_FSYNC.
I have been looking at the two-hash-tables design I suggested before,
and realized that there's a timing issue: if we just stuff forget
requests into a separate table, there is no method for determining
whether a given fsync request arrived before or after a given forget
request.  This is problematic if the relfilenode gets recycled: we
need to be able to guarantee that a previously-posted forget request
won't cancel a valid fsync for the new relation.  I believe this is
soluble though, if we merge the forget requests with unlink requests,
because a relfilenode can't be recycled until we do the unlink.
So as far as the code goes:

1. Convert the PendingUnlinkEntry linked list to a hash table keyed by
RelFileNode.  It acts the same as before, and shouldn't be materially
slower to process, but now we can determine in O(1) time whether there
is a pending unlink for a relfilenode.

2. Treat the existence of a pending unlink request as a relation-wide
fsync cancel; so the loop in mdsync needs one extra hashtable lookup
to determine validity of a PendingOperationEntry.  As before, this
should not matter much considering that we're about to do an fsync().

3. Tweak mdunlink so that it does not send a FORGET_RELATION_FSYNC
message if it is sending an UNLINK_RELATION_REQUEST.  (A side benefit
is that this gives us another 2X reduction in fsync queue traffic,
and not just any queue traffic but the type of traffic that we must
not fail to queue.)

The FORGET_RELATION_FSYNC code path will still exist, and will still
require a full hashtable scan, but we don't care because it isn't
being used in common situations.  It would only be needed for stuff
like killing an init fork.

The argument that this is safe involves these points:

* mdunlink cannot send UNLINK_RELATION_REQUEST until it's done
ftruncate on the main fork's first segment, because otherwise that
segment could theoretically get unlinked from under it before it can do
the truncate.  But this is okay since the ftruncate won't cause any
fsync the checkpointer might concurrently be doing to fail.  The
request *will* be sent before we unlink any other files, so mdsync
will be able to recover if it gets an fsync failure due to concurrent
unlink.

* Because a relfilenode cannot be recycled until we process and delete
the PendingUnlinkEntry during mdpostckpt, it is not possible for valid
new fsync requests to arrive while the PendingUnlinkEntry still exists
to cause them to be considered canceled.

* Because we only process and delete PendingUnlinkEntrys that have been
there since before the checkpoint started, we can be sure that any
PendingOperationEntrys referring to the relfilenode will have been
scanned and deleted by mdsync before we remove the PendingUnlinkEntry.

Unless somebody sees a hole in this logic, I'll go make this happen.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PERFORM] DELETE vs TRUNCATE explanation

2012-07-19 Thread Robert Haas
On Thu, Jul 19, 2012 at 2:57 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Thu, Jul 19, 2012 at 10:09 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 What about checking just the immediately previous entry?  This would
 at least fix the problem for bulk-load situations, and the cost ought
 to be about negligible compared to acquiring the LWLock.

 2. You say fix the problem but I'm not exactly clear what problem
 you think this fixes.

 What I'm concerned about is that there is going to be a great deal more
 fsync request queue traffic in 9.2 than there ever was before, as a
 consequence of the bgwriter/checkpointer split.  The design expectation
 for this mechanism was that most fsync requests would be generated
 locally inside the bgwriter and thus go straight into the hash table
 without having to go through the shared-memory queue.  I admit that
 we have seen no benchmarks showing that there's a problem, but that's
 because up till yesterday the bgwriter was failing to transmit such
 messages at all.  So I'm looking for ways to cut the overhead.

 But having said that, maybe we should not panic until we actually see
 some benchmarks showing the problem.

+1 for not panicking.  I'm prepared to believe that there could be a
problem here, but I'm not prepared to believe that we've characterized
it well enough to be certain that any changes we choose to make will
make things better not worse.

 Meanwhile, we do know there's a problem with FORGET_RELATION_FSYNC.
 I have been looking at the two-hash-tables design I suggested before,
 and realized that there's a timing issue: if we just stuff forget
 requests into a separate table, there is no method for determining
 whether a given fsync request arrived before or after a given forget
 request.  This is problematic if the relfilenode gets recycled: we
 need to be able to guarantee that a previously-posted forget request
 won't cancel a valid fsync for the new relation.  I believe this is
 soluble though, if we merge the forget requests with unlink requests,
 because a relfilenode can't be recycled until we do the unlink.
 So as far as the code goes:

 1. Convert the PendingUnlinkEntry linked list to a hash table keyed by
 RelFileNode.  It acts the same as before, and shouldn't be materially
 slower to process, but now we can determine in O(1) time whether there
 is a pending unlink for a relfilenode.

 2. Treat the existence of a pending unlink request as a relation-wide
 fsync cancel; so the loop in mdsync needs one extra hashtable lookup
 to determine validity of a PendingOperationEntry.  As before, this
 should not matter much considering that we're about to do an fsync().

 3. Tweak mdunlink so that it does not send a FORGET_RELATION_FSYNC
 message if it is sending an UNLINK_RELATION_REQUEST.  (A side benefit
 is that this gives us another 2X reduction in fsync queue traffic,
 and not just any queue traffic but the type of traffic that we must
 not fail to queue.)

 The FORGET_RELATION_FSYNC code path will still exist, and will still
 require a full hashtable scan, but we don't care because it isn't
 being used in common situations.  It would only be needed for stuff
 like killing an init fork.

 The argument that this is safe involves these points:

 * mdunlink cannot send UNLINK_RELATION_REQUEST until it's done
 ftruncate on the main fork's first segment, because otherwise that
 segment could theoretically get unlinked from under it before it can do
 the truncate.  But this is okay since the ftruncate won't cause any
 fsync the checkpointer might concurrently be doing to fail.  The
 request *will* be sent before we unlink any other files, so mdsync
 will be able to recover if it gets an fsync failure due to concurrent
 unlink.

 * Because a relfilenode cannot be recycled until we process and delete
 the PendingUnlinkEntry during mdpostckpt, it is not possible for valid
 new fsync requests to arrive while the PendingUnlinkEntry still exists
 to cause them to be considered canceled.

 * Because we only process and delete PendingUnlinkEntrys that have been
 there since before the checkpoint started, we can be sure that any
 PendingOperationEntrys referring to the relfilenode will have been
 scanned and deleted by mdsync before we remove the PendingUnlinkEntry.

 Unless somebody sees a hole in this logic, I'll go make this happen.

What if we change the hash table to have RelFileNode as the key and an
array of MAX_FORKNUM bitmapsets as the value?  Then when you get a
forget request, you can just zap all the sets to empty.  That seems
a whole lot simpler than your proposal and I don't see any real
downside.  I can't actually poke a whole in your logic at the moment
but a simpler system that requires no assumptions about filesystem
behavior seems preferable to me.

You can still make an unlink request imply a corresponding
forget-request if you want, but now that's a separate optimization.

-- 
Robert Haas
EnterpriseDB: 

Re: [HACKERS] [PERFORM] DELETE vs TRUNCATE explanation

2012-07-19 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 What if we change the hash table to have RelFileNode as the key and an
 array of MAX_FORKNUM bitmapsets as the value?  Then when you get a
 forget request, you can just zap all the sets to empty.

Hm ... the only argument I can really make against that is that there'll
be no way to move such a table into shared memory; but there's probably
little hope of that anyway, given points made upthread.  The bitmapset
manipulations are a bit tricky but solvable, and I agree there's
something to be said for not tying this stuff so closely to the
mechanism for relfilenode recycling.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PERFORM] DELETE vs TRUNCATE explanation

2012-07-18 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Mon, Jul 16, 2012 at 12:36 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Well, that argument is exactly why the code is designed the way it is...
 but we are now finding out that sending useless fsync requests isn't as
 cheap as all that.

 I agree, but I think the problem can be solved for a pretty modest
 amount of effort without needing to make fsync PGC_POSTMASTER.  Your
 proposal to refactor the pendingOpsTable representation seems like it
 will help a lot.  Perhaps you should do that first and then we can
 reassess.
 ...
 In my view, the elephant in the room here is that it's dramatically
 inefficient for every backend to send an fsync request on every block
 write.  For many users, in many workloads, all of those requests will
 be for just a tiny handful of relation segments.  The fsync queue
 compaction code works as well as it does for precisely that reason -
 when it triggers, we typically can compact a list of thousands or
 millions of entries down to less than two dozen.  In other words, as I
 see it, the issue here is not so much that 100% of the fsync requests
 are useless when fsync=off, but rather that 99.9% of them are useless
 even when fsync=on.

 In any case, I'm still of the opinion that we ought to try making one
 fix (your proposed refactoring of the pendingOpsTable) and then see
 where we're at.

I've been chewing on this issue some more, and no longer like my
previous proposal, which was

 ... What I'm thinking about
 is reducing the hash key to just RelFileNodeBackend + ForkNumber,
 so that there's one hashtable entry per fork, and then storing a
 bitmap to indicate which segment numbers need to be sync'd.  At
 one gigabyte to the bit, I think we could expect the bitmap would
 not get terribly large.  We'd still have a cancel flag in each
 hash entry, but it'd apply to the whole relation fork not each
 segment.

The reason that's not so attractive is the later observation that what
we really care about optimizing is FORGET_RELATION_FSYNC for all the
forks of a relation at once, which we could produce just one request
for with trivial refactoring of smgrunlink/mdunlink.  The above
representation doesn't help for that.  So what I'm now thinking is that
we should create a second hash table, with key RelFileNode only,
carrying two booleans: a cancel-previous-fsyncs bool and a
please-unlink-after-checkpoint bool.  (The latter field would allow us
to drop the separate pending-unlinks data structure.)  Entries would
be made in this table when we got a FORGET_RELATION_FSYNC or
UNLINK_RELATION_REQUEST message -- note that in 99% of cases we'd get
both message types for each relation, since they're both created during
DROP.  (Maybe we could even combine these request types.)  To use the
table, as we scan the existing per-fork-and-segment hash table, we'd
have to do a lookup in the per-relation table to see if there was a
later cancel message for that relation.  Now this does add a few cycles
to the processing of each pendingOpsTable entry in mdsync ... but
considering that the major work in that loop is an fsync call, it is
tough to believe that anybody would notice an extra hashtable lookup.

However, I also came up with an entirely different line of thought,
which unfortunately seems incompatible with either of the improved
table designs above.  It is this: instead of having a request queue
that feeds into a hash table hidden within the checkpointer process,
what about storing the pending-fsyncs table as a shared hash table
in shared memory?  That is, ForwardFsyncRequest would not simply
try to add the request to a linear array, but would do a HASH_ENTER
call on a shared hash table.  This means the de-duplication occurs
for free and we no longer need CompactCheckpointerRequestQueue at all.
Basically, this would amount to saying that the original design was
wrong to try to micro-optimize the time spent in ForwardFsyncRequest,
and that we'd rather pay a little more per ForwardFsyncRequest call
to avoid the enormous response-time spike that will occur when
CompactCheckpointerRequestQueue has to run.  (Not to mention that
the checkpointer would eventually have to do HASH_ENTER anyway.)
I think this would address your observation above that the request
queue tends to contain an awful lot of duplicates.

But I only see how to make that work with the existing hash table
structure, because with either of the other table designs, it's
difficult to set a predetermined limit on the amount of shared
memory needed.  The segment-number bitmaps could grow uncomfortably
large in the first design, while in the second there's no good way
to know how large the per-relation table has to be to cover a given
size for the per-fork-and-segment table.  (The sore spot here is that
once we've accepted a per-fork entry, failing to record a relation-level
cancel for it is not an option, so we can't just return failure.)

So if we go that way it seems like we still have the 

Re: [HACKERS] [PERFORM] DELETE vs TRUNCATE explanation

2012-07-17 Thread Greg Smith

On 07/16/2012 02:39 PM, Robert Haas wrote:

Unfortunately, there are lots of important operations (like bulk
loading, SELECT * FROM bigtable, and VACUUM notverybigtable) that
inevitably end up writing out their own dirty buffers.  And even when
the background writer does write something, it's not always clear that
this is a positive thing.  Here's Greg Smith commenting on the
more-is-worse phenonmenon:

http://archives.postgresql.org/pgsql-hackers/2012-02/msg00564.php


You can add crash recovery to the list of things where the interaction 
with the OS write cache matters a lot too, something I just took a 
beating and learned from recently.  Since the recovery process is 
essentially one giant unified backend, how effectively the background 
writer and/or checkpointer move writes from recovery to themselves is 
really important.  It's a bit easier to characterize than a complicated 
mixed set of clients, which has given me a couple of ideas to chase down.


What I've been doing for much of the last month (instead of my original 
plan of reviewing patches) is moving toward the bottom of characterizing 
that under high pressure.  It provides an even easier way to compare 
multiple write strategies at the OS level than regular pgbench-like 
benchmarks.  Recovery playback with a different tuning becomes as simple 
as rolling back to a simple base backup and replaying all the WAL, 
possibly including some number of bulk operations that showed up.  You 
can measure that speed instead of transaction-level throughput.  I'm 
seeing the same ~100% difference in performance between various Linux 
tunings on recovery as I was getting on VACUUM tests, and it's a whole 
lot easier to setup and (ahem) replicate the results.  I'm putting 
together a playback time benchmark based on this observation.


The fact that I have servers all over the place now with 64GB worth of 
RAM has turned the topic of how much dirty memory should be used for 
write caching into a hot item for me again in general too.  If I live 
through 9.3 development, I expect to have a lot more ideas about how to 
deal with this whole area play out in the upcoming months.  I could 
really use a cool day to sit outside thinking about it right now.



Jeff Janes and I came up with what I believe to be a plausible
explanation for the problem:

http://archives.postgresql.org/pgsql-hackers/2012-03/msg00356.php

I kinda think we ought to be looking at fixing that for 9.2, and
perhaps even back-patching further, but nobody else seemed terribly
excited about it.


FYI, I never rejected any of that thinking, I just haven't chewed on 
what you two were proposing.  If that's still something you think should 
be revisited for 9.2, I'll take a longer look at it.  My feeling on this 
so far has really been that the write blocking issues are much larger 
than the exact logic used by the background writer during the code you 
were highlighting, which I always saw as more active/important during 
idle periods.  This whole area needs to get a complete overhaul during 
9.3 though, especially since there are plenty of people who want to fit 
checksum writes into that path too.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PERFORM] DELETE vs TRUNCATE explanation

2012-07-16 Thread Robert Haas
On Sun, Jul 15, 2012 at 2:29 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 I think what we ought to do is bite the bullet and refactor the
 representation of the pendingOps table.  What I'm thinking about
 is reducing the hash key to just RelFileNodeBackend + ForkNumber,
 so that there's one hashtable entry per fork, and then storing a
 bitmap to indicate which segment numbers need to be sync'd.  At
 one gigabyte to the bit, I think we could expect the bitmap would
 not get terribly large.  We'd still have a cancel flag in each
 hash entry, but it'd apply to the whole relation fork not each
 segment.

I think this is a good idea.

 Also, I still wonder if it is worth memorizing fsyncs (under
 fsync=off) that may or may not ever take place.  Is there any
 guarantee that we can make by doing so, that couldn't be made
 otherwise?

 Yeah, you have a point there.  It's not real clear that switching fsync
 from off to on is an operation that we can make any guarantees about,
 short of executing something like the code recently added to initdb
 to force-sync the entire PGDATA tree.  Perhaps we should change fsync
 to be PGC_POSTMASTER (ie frozen at postmaster start), and then we could
 skip forwarding fsync requests when it's off?

I am emphatically opposed to making fsync PGC_POSTMASTER.  Being able
to change parameters on the fly without having to shut down the system
is important, and we should be looking for ways to make it possible to
change more things on-the-fly, not arbitrarily restricting GUCs that
already exist.  This is certainly one I've changed on the fly, and I'm
willing to bet there are real-world users out there who have done the
same (e.g. to survive an unexpected load spike).

I would argue that such a change adds no measure of safety, anyway.
Suppose we have the following sequence of events, starting with
fsync=off:

T0: write
T1: checkpoint (fsync of T0 skipped since fsync=off)
T2: write
T3: fsync=on
T4: checkpoint (fsync of T2 performed)

Why is it OK to fsync the write at T2 but not the one at T0?  In order
for the system to become crash-safe, the user will need to guarantee,
at some point following T3, that the entire OS buffer cache has been
flushed to disk.  Whether or not the fsync of T2 happened is
irrelevant.  Had we chosen not to send an fsync request at all at time
T2, the user's obligations following T3 would be entirely unchanged.
Thus, I see no reason why we need to restrict the fsync setting in
order to implement the proposed optimization.

But, at a broader level, I am not very excited about this
optimization.  It seems to me that if this is hurting enough to be
noticeable, then it's hurting us when fsync=on as well, and we had
maybe think a little harder about how to cut down on the IPC overhead.
 If the bgwriter comm lock is contended, we could partition it - e.g.
by giving each backend a small queue protected by the backendLock,
which is flushed into the main queue when it fills and harvested by
the bgwriter once per checkpoint cycle.  (This is the same principle
as the fast-path locking stuff that we used to eliminate lmgr
contention on short read-only queries in 9.2.)  If we only fix it for
the fsync=off case, then what about people who are running with
fsync=on but have extremely fast fsyncs?  Most of us probably don't
have the hardware to test that today but it's certainly out there and
will probably become more common in the future.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PERFORM] DELETE vs TRUNCATE explanation

2012-07-16 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Sun, Jul 15, 2012 at 2:29 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Yeah, you have a point there.  It's not real clear that switching fsync
 from off to on is an operation that we can make any guarantees about,
 short of executing something like the code recently added to initdb
 to force-sync the entire PGDATA tree.  Perhaps we should change fsync
 to be PGC_POSTMASTER (ie frozen at postmaster start), and then we could
 skip forwarding fsync requests when it's off?

 I would argue that such a change adds no measure of safety, anyway.

Well, yes it does, and the reason was explained further down in the
thread: since we have no particular guarantees as to how quickly
postmaster children will absorb postgresql.conf updates, there could be
individual processes still running with fsync = off long after the user
thinks he's turned it on.  A forced restart solves that.  I believe the
reason for the current coding in the fsync queuing stuff is so that you
only have to worry about how long it takes the checkpointer to notice
the GUC change, and not any random backend that's running a forty-hour
query.

 But, at a broader level, I am not very excited about this
 optimization.  It seems to me that if this is hurting enough to be
 noticeable, then it's hurting us when fsync=on as well, and we had
 maybe think a little harder about how to cut down on the IPC overhead.

Uh, that's exactly what's under discussion.  Not sending useless fsync
requests when fsync is off is just one part of it; a part that happens
to be quite useful for some test scenarios, even if not so much for
production.  (IIRC, the original complainant in this thread was running
fsync off.)

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PERFORM] DELETE vs TRUNCATE explanation

2012-07-16 Thread Robert Haas
On Mon, Jul 16, 2012 at 12:08 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Sun, Jul 15, 2012 at 2:29 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Yeah, you have a point there.  It's not real clear that switching fsync
 from off to on is an operation that we can make any guarantees about,
 short of executing something like the code recently added to initdb
 to force-sync the entire PGDATA tree.  Perhaps we should change fsync
 to be PGC_POSTMASTER (ie frozen at postmaster start), and then we could
 skip forwarding fsync requests when it's off?

 I would argue that such a change adds no measure of safety, anyway.

 Well, yes it does, and the reason was explained further down in the
 thread: since we have no particular guarantees as to how quickly
 postmaster children will absorb postgresql.conf updates, there could be
 individual processes still running with fsync = off long after the user
 thinks he's turned it on.  A forced restart solves that.  I believe the
 reason for the current coding in the fsync queuing stuff is so that you
 only have to worry about how long it takes the checkpointer to notice
 the GUC change, and not any random backend that's running a forty-hour
 query.

Hrmf, I guess that's a fair point.  But if we believe that reasoning
then I think it's an argument for sending fsync requests even when
fsync=off, not for making fsync PGC_POSTMASTER.  Or maybe we could
store the current value of the fsync flag in shared memory somewhere
and have backends check it before deciding whether to enqueue a
request.  With proper use of memory barriers it should be possible to
make this work without requiring a lock.

 But, at a broader level, I am not very excited about this
 optimization.  It seems to me that if this is hurting enough to be
 noticeable, then it's hurting us when fsync=on as well, and we had
 maybe think a little harder about how to cut down on the IPC overhead.

 Uh, that's exactly what's under discussion.  Not sending useless fsync
 requests when fsync is off is just one part of it; a part that happens
 to be quite useful for some test scenarios, even if not so much for
 production.  (IIRC, the original complainant in this thread was running
 fsync off.)

My point is that if sending fsync requests is cheap enough, then not
sending them won't save anything meaningful.  And I don't see why it
can't be made just that cheap, thereby benefiting people with fsync=on
as well.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PERFORM] DELETE vs TRUNCATE explanation

2012-07-16 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Mon, Jul 16, 2012 at 12:08 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Uh, that's exactly what's under discussion.  Not sending useless fsync
 requests when fsync is off is just one part of it; a part that happens
 to be quite useful for some test scenarios, even if not so much for
 production.  (IIRC, the original complainant in this thread was running
 fsync off.)

 My point is that if sending fsync requests is cheap enough, then not
 sending them won't save anything meaningful.

Well, that argument is exactly why the code is designed the way it is...
but we are now finding out that sending useless fsync requests isn't as
cheap as all that.

The larger point here, in any case, is that I don't believe anyone wants
to expend a good deal of skull sweat and possibly performance on
ensuring that transitioning from fsync off to fsync on in an active
database is a reliable operation.  It does not seem like something we
are ever going to recommend, and we have surely got nine hundred ninety
nine other things that are more useful to spend development time on.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PERFORM] DELETE vs TRUNCATE explanation

2012-07-16 Thread Robert Haas
On Mon, Jul 16, 2012 at 12:36 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Mon, Jul 16, 2012 at 12:08 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Uh, that's exactly what's under discussion.  Not sending useless fsync
 requests when fsync is off is just one part of it; a part that happens
 to be quite useful for some test scenarios, even if not so much for
 production.  (IIRC, the original complainant in this thread was running
 fsync off.)

 My point is that if sending fsync requests is cheap enough, then not
 sending them won't save anything meaningful.

 Well, that argument is exactly why the code is designed the way it is...
 but we are now finding out that sending useless fsync requests isn't as
 cheap as all that.

I agree, but I think the problem can be solved for a pretty modest
amount of effort without needing to make fsync PGC_POSTMASTER.  Your
proposal to refactor the pendingOpsTable representation seems like it
will help a lot.  Perhaps you should do that first and then we can
reassess.

 The larger point here, in any case, is that I don't believe anyone wants
 to expend a good deal of skull sweat and possibly performance on
 ensuring that transitioning from fsync off to fsync on in an active
 database is a reliable operation.  It does not seem like something we
 are ever going to recommend, and we have surely got nine hundred ninety
 nine other things that are more useful to spend development time on.

We may not recommend it, but I am sure that people will do it anyway,
and requiring them to bounce the server in that situation seems
unfortunate, especially since it will also require them to bounce the
server in order to go the other direction.

In my view, the elephant in the room here is that it's dramatically
inefficient for every backend to send an fsync request on every block
write.  For many users, in many workloads, all of those requests will
be for just a tiny handful of relation segments.  The fsync queue
compaction code works as well as it does for precisely that reason -
when it triggers, we typically can compact a list of thousands or
millions of entries down to less than two dozen.  In other words, as I
see it, the issue here is not so much that 100% of the fsync requests
are useless when fsync=off, but rather that 99.9% of them are useless
even when fsync=on.

In any case, I'm still of the opinion that we ought to try making one
fix (your proposed refactoring of the pendingOpsTable) and then see
where we're at.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PERFORM] DELETE vs TRUNCATE explanation

2012-07-16 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 In my view, the elephant in the room here is that it's dramatically
 inefficient for every backend to send an fsync request on every block
 write.

Yeah.  This was better before the decision was taken to separate
bgwriter from checkpointer; before that, only local communication was
involved for the bulk of write operations (or at least so we hope).
I remain less than convinced that that split was really a great idea.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PERFORM] DELETE vs TRUNCATE explanation

2012-07-16 Thread Robert Haas
On Mon, Jul 16, 2012 at 12:57 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 In my view, the elephant in the room here is that it's dramatically
 inefficient for every backend to send an fsync request on every block
 write.

 Yeah.  This was better before the decision was taken to separate
 bgwriter from checkpointer; before that, only local communication was
 involved for the bulk of write operations (or at least so we hope).
 I remain less than convinced that that split was really a great idea.

Unfortunately, there are lots of important operations (like bulk
loading, SELECT * FROM bigtable, and VACUUM notverybigtable) that
inevitably end up writing out their own dirty buffers.  And even when
the background writer does write something, it's not always clear that
this is a positive thing.  Here's Greg Smith commenting on the
more-is-worse phenonmenon:

http://archives.postgresql.org/pgsql-hackers/2012-02/msg00564.php

Jeff Janes and I came up with what I believe to be a plausible
explanation for the problem:

http://archives.postgresql.org/pgsql-hackers/2012-03/msg00356.php

I kinda think we ought to be looking at fixing that for 9.2, and
perhaps even back-patching further, but nobody else seemed terribly
excited about it.

At any rate, I'm somewhat less convinced that the split was a good
idea than I was when we did it, mostly because we haven't really gone
anywhere with it subsequently.  But I do think there's a good argument
that any process which is responsible for running a system call that
can take 30 seconds to return had better not be responsible for
anything else that matters very much.  If background writing is one of
the things we do that doesn't matter very much, then we need to figure
out what's wrong with it (see above) and make it matter more.  If it
already matters, then it needs to happen continuously and not get
suppressed while other tasks (like long fsyncs) are happening, at
least not without some evidence that such suppression is the right
choice from a performance standpoint.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PERFORM] DELETE vs TRUNCATE explanation

2012-07-16 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 Unfortunately, there are lots of important operations (like bulk
 loading, SELECT * FROM bigtable, and VACUUM notverybigtable) that
 inevitably end up writing out their own dirty buffers.  And even when
 the background writer does write something, it's not always clear that
 this is a positive thing.  Here's Greg Smith commenting on the
 more-is-worse phenonmenon:

 http://archives.postgresql.org/pgsql-hackers/2012-02/msg00564.php

 Jeff Janes and I came up with what I believe to be a plausible
 explanation for the problem:

 http://archives.postgresql.org/pgsql-hackers/2012-03/msg00356.php

 I kinda think we ought to be looking at fixing that for 9.2, and
 perhaps even back-patching further, but nobody else seemed terribly
 excited about it.

I'd be fine with back-patching something like that into 9.2 if we had
(a) a patch and (b) experimental evidence that it made things better.
Unless I missed something, we have neither.  Also, I read the above
two messages to say that you, Greg, and Jeff have three different ideas
about exactly what should be done, which is less than comforting for
a last-minute patch...

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PERFORM] DELETE vs TRUNCATE explanation

2012-07-16 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 At any rate, I'm somewhat less convinced that the split was a good
 idea than I was when we did it, mostly because we haven't really gone
 anywhere with it subsequently.

BTW, while we are on the subject: hasn't this split completely broken
the statistics about backend-initiated writes?  I don't see anything
in ForwardFsyncRequest that distinguishes whether it's being called in
the bgwriter or a regular backend.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PERFORM] DELETE vs TRUNCATE explanation

2012-07-16 Thread Robert Haas
On Mon, Jul 16, 2012 at 3:18 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 At any rate, I'm somewhat less convinced that the split was a good
 idea than I was when we did it, mostly because we haven't really gone
 anywhere with it subsequently.

 BTW, while we are on the subject: hasn't this split completely broken
 the statistics about backend-initiated writes?

Yes, it seems to have done just that.  The comment for
ForwardFsyncRequest is a few bricks short of a load too:

 * Whenever a backend is compelled to write directly to a relation
 * (which should be seldom, if the checkpointer is getting its job done),
 * the backend calls this routine to pass over knowledge that the relation
 * is dirty and must be fsync'd before next checkpoint.  We also use this
 * opportunity to count such writes for statistical purposes.

Line 2 seems to have been mechanically changed from background
writer to checkpointer, but of course it should still say
background writer in this case.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PERFORM] DELETE vs TRUNCATE explanation

2012-07-16 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 Yes, it seems to have done just that.  The comment for
 ForwardFsyncRequest is a few bricks short of a load too:
 ...
 Line 2 seems to have been mechanically changed from background
 writer to checkpointer, but of course it should still say
 background writer in this case.

Yeah, found that one already (it's probably my fault).

Will see about fixing the stats in a separate patch.  I just wanted to
know if the issue had been dealt with in some non-obvious fashion.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PERFORM] DELETE vs TRUNCATE explanation

2012-07-15 Thread Tom Lane
Jeff Janes jeff.ja...@gmail.com writes:
 On Thu, Jul 12, 2012 at 9:55 PM, Jeff Janes jeff.ja...@gmail.com wrote:
 The topic was poor performance when truncating lots of small tables
 repeatedly on test environments with fsync=off.
 
 On Thu, Jul 12, 2012 at 6:00 PM, Jeff Janes jeff.ja...@gmail.com wrote:
 I think the problem is in the Fsync Absorption queue.  Every truncate
 adds a FORGET_RELATION_FSYNC to the queue, and processing each one of
 those leads to sequential scanning the checkpointer's pending ops hash
 table, which is quite large.  It is almost entirely full of other
 requests which have already been canceled, but it still has to dig
 through them all.   So this is essentially an N^2 operation.

 The attached patch addresses this problem by deleting the entry when
 it is safe to do so, and flagging it as canceled otherwise.

I don't like this patch at all.  It seems ugly and not terribly safe,
and it won't help at all when the checkpointer is in the midst of an
mdsync scan, which is a nontrivial part of its cycle.

I think what we ought to do is bite the bullet and refactor the
representation of the pendingOps table.  What I'm thinking about
is reducing the hash key to just RelFileNodeBackend + ForkNumber,
so that there's one hashtable entry per fork, and then storing a
bitmap to indicate which segment numbers need to be sync'd.  At
one gigabyte to the bit, I think we could expect the bitmap would
not get terribly large.  We'd still have a cancel flag in each
hash entry, but it'd apply to the whole relation fork not each
segment.

If we did this then the FORGET_RELATION_FSYNC code path could use
a hashtable lookup instead of having to traverse the table
linearly; and that would get rid of the O(N^2) performance issue.
The performance of FORGET_DATABASE_FSYNC might still suck, but
DROP DATABASE is a pretty heavyweight operation anyhow.

I'm willing to have a go at coding this design if it sounds sane.
Comments?

 Also, I still wonder if it is worth memorizing fsyncs (under
 fsync=off) that may or may not ever take place.  Is there any
 guarantee that we can make by doing so, that couldn't be made
 otherwise?

Yeah, you have a point there.  It's not real clear that switching fsync
from off to on is an operation that we can make any guarantees about,
short of executing something like the code recently added to initdb
to force-sync the entire PGDATA tree.  Perhaps we should change fsync
to be PGC_POSTMASTER (ie frozen at postmaster start), and then we could
skip forwarding fsync requests when it's off?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PERFORM] DELETE vs TRUNCATE explanation

2012-07-15 Thread Tom Lane
... btw, in the penny wise and pound foolish department, I observe that
smgrdounlink calls mdunlink separately for each possibly existing fork
of a relation to be dropped.  That means we are queuing a separate fsync
queue entry for each fork, and could immediately save a factor of four
in FORGET_RELATION_FSYNC traffic if we were to redefine those queue
entries as applying to all forks.  The only reason to have a per-fork
variant, AFAICS, is for smgrdounlinkfork(), which is used nowhere and
exists only because I was too chicken to remove the functionality
outright in commit ece01aae479227d9836294b287d872c5a6146a11.  But given
that we know the fsync queue can be a bottleneck, my vote is to refactor
mdunlink to apply to all forks and send only one message.

I am also wondering whether it's really necessary to send fsync request
messages for backend-local relations.  If rnode.backend says it's local,
can't we skip sending the fsync request?  All local relations are
flush-on-crash anyway, no?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PERFORM] DELETE vs TRUNCATE explanation

2012-07-15 Thread Craig Ringer

On 07/16/2012 02:29 AM, Tom Lane wrote:

Yeah, you have a point there.  It's not real clear that switching fsync
from off to on is an operation that we can make any guarantees about,
short of executing something like the code recently added to initdb
to force-sync the entire PGDATA tree.


There's one way that doesn't have any housekeeping cost to Pg. It's 
pretty bad manners if there's anybody other than Pg on the system though:


  sync()

Let the OS do the housekeeping.

It's possible to do something similar on Windows, in that there are 
utilities for the purpose:


  http://technet.microsoft.com/en-us/sysinternals/bb897438.aspx

This probably uses:

  http://msdn.microsoft.com/en-us/library/s9xk9ehd%28VS.71%29.aspx

from COMMODE.OBJ (unfortunate name), which has existed since win98.



Perhaps we should change fsync
to be PGC_POSTMASTER (ie frozen at postmaster start), and then we could
skip forwarding fsync requests when it's off?


Personally, I didn't even know it was runtime switchable.

fsync=off is much less necessary with async commits, group commit via 
commit delay, WAL improvements, etc. To me it's mostly of utility when 
testing, particularly on SSDs. I don't see a DB restart requirement as a 
big issue. It'd be interesting to see what -general has to say, if there 
are people depending on this.


If it's necessary to retain the ability to runtime switch it, making it 
a somewhat rude sync() in exchange for boosted performance the rest of 
the time may well be worthwhile anyway. It'd be interesting to see.


All this talk of synchronisation is making me really frustrated that 
there seems to be very poor support in OSes for syncing a set of files 
in a single pass, potentially saving a lot of time and thrashing. A way 
to relax the ordering guarantee from Files are synced in the order 
fsync() is called on each to files are all synced when this call 
completes would be great. I've been running into this issue in some 
non-Pg-related work and it's been bugging me.


--
Craig Ringer


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PERFORM] DELETE vs TRUNCATE explanation

2012-07-15 Thread Tom Lane
Craig Ringer ring...@ringerc.id.au writes:
 On 07/16/2012 02:29 AM, Tom Lane wrote:
 Yeah, you have a point there.  It's not real clear that switching fsync
 from off to on is an operation that we can make any guarantees about,
 short of executing something like the code recently added to initdb
 to force-sync the entire PGDATA tree.

 There's one way that doesn't have any housekeeping cost to Pg. It's 
 pretty bad manners if there's anybody other than Pg on the system though:
sync()

Yeah, I thought about that: if we could document that issuing a manual
sync after turning fsync on leaves you in a guaranteed-good state once
the sync is complete, it'd probably be fine.  However, I'm not convinced
that we could promise that with a straight face.  In the first place,
PG has only very weak guarantees about how quickly all processes in the
system will absorb a GUC update.  In the second place, I'm not entirely
sure that there aren't race conditions around checkpoints and the fsync
request queue (particularly if we do what Jeff is suggesting and
suppress queuing requests at the upstream end).  It might be all right,
or it might be all right after expending some work, but the whole thing
is not an area where I think anyone wants to spend time.  I think it'd
be much safer to document that the correct procedure is stop the
database, do a manual sync, enable fsync in postgresql.conf, restart the
database.  And if that's what we're documenting, we lose little or
nothing by marking fsync as PGC_POSTMASTER.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PERFORM] DELETE vs TRUNCATE explanation

2012-07-15 Thread Craig Ringer

On 07/16/2012 09:37 AM, Tom Lane wrote:

There's one way that doesn't have any housekeeping cost to Pg. It's
pretty bad manners if there's anybody other than Pg on the system though:
sync()

Yeah, I thought about that: if we could document that issuing a manual
sync after turning fsync on leaves you in a guaranteed-good state once
the sync is complete, it'd probably be fine.  However, I'm not convinced
that we could promise that with a straight face.  In the first place,
PG has only very weak guarantees about how quickly all processes in the
system will absorb a GUC update.  In the second place, I'm not entirely
sure that there aren't race conditions around checkpoints and the fsync
request queue (particularly if we do what Jeff is suggesting and
suppress queuing requests at the upstream end).  It might be all right,
or it might be all right after expending some work, but the whole thing
is not an area where I think anyone wants to spend time.  I think it'd
be much safer to document that the correct procedure is stop the
database, do a manual sync, enable fsync in postgresql.conf, restart the
database.  And if that's what we're documenting, we lose little or
nothing by marking fsync as PGC_POSTMASTER.
Sounds reasonable to me; I tend to view fsync=off as a testing feature 
anyway. Will clone onto -general and see if anyone yells.


--
Craig Ringer



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PERFORM] DELETE vs TRUNCATE explanation

2012-07-14 Thread Jeff Janes
On Thu, Jul 12, 2012 at 9:55 PM, Jeff Janes jeff.ja...@gmail.com wrote:
 I've moved this thread from performance to hackers.

 The topic was poor performance when truncating lots of small tables
 repeatedly on test environments with fsync=off.

 On Thu, Jul 12, 2012 at 6:00 PM, Jeff Janes jeff.ja...@gmail.com wrote:

 I think the problem is in the Fsync Absorption queue.  Every truncate
 adds a FORGET_RELATION_FSYNC to the queue, and processing each one of
 those leads to sequential scanning the checkpointer's pending ops hash
 table, which is quite large.  It is almost entirely full of other
 requests which have already been canceled, but it still has to dig
 through them all.   So this is essentially an N^2 operation.
...

 I'm not sure why we don't just delete the entry instead of marking it
 as cancelled.  It looks like the only problem is that you can't delete
 an entry other than the one just returned by hash_seq_search.  Which
 would be fine, as that is the entry that we would want to delete;
 except that mdsync might have a different hash_seq_search open, and so
 it wouldn't be safe to delete.

The attached patch addresses this problem by deleting the entry when
it is safe to do so, and flagging it as canceled otherwise.

I thought of using has_seq_scans to determine when it is safe, but
dynahash.c does not make that function public, and I was afraid it
might be too slow, anyway.

So instead I used a static variable, plus the knowledge that the only
time there are two scans on the table is when mdsync starts one and
then calls RememberFsyncRequest indirectly.  There is one other place
that does a seq scan, but there is no way for control to pass from
that loop to reach RememberFsyncRequest.

I've added code to disclaim the scan if mdsync errors out.  I don't
think that this should a problem because at that point the scan object
is never going to be used again, so if its internal state gets screwed
up it shouldn't matter.  However, I wonder if it should also call
hash_seq_term, otherwise the pending ops table will be permanently
prevented from expanding (this is a pre-existing condition, not to do
with my patch).  Since I don't know what can make mdsync error out
without being catastrophic, I don't know how to test this out.

One concern is that if the ops table ever does become bloated, it can
never recover while under load.  The bloated table will cause mdsync
to take a long time to run, and as long as mdsync is in the call stack
the antibloat feature is defeated--so we have crossed a tipping point
and cannot get back.  I don't see that occurring in the current use
case, however.  With my current benchmark, the anti-bloat is effective
enough that mdsync never takes very long to execute, so a virtuous
circle exists.

As an aside, the comments in dynahash.c seem to suggest that one can
always delete the entry returned by hash_seq_search, regardless of the
existence of other sequential searches.  I'm pretty sure that this is
not true.  Also, shouldn't this contract about when one is allowed to
delete entries be in the hsearch.h file, rather than the dynahash.c
file?

Also, I still wonder if it is worth memorizing fsyncs (under
fsync=off) that may or may not ever take place.  Is there any
guarantee that we can make by doing so, that couldn't be made
otherwise?


Cheers,

Jeff


FsyncRequest_delete_v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PERFORM] DELETE vs TRUNCATE explanation

2012-07-12 Thread Jeff Janes
I've moved this thread from performance to hackers.

The topic was poor performance when truncating lots of small tables
repeatedly on test environments with fsync=off.

On Thu, Jul 12, 2012 at 6:00 PM, Jeff Janes jeff.ja...@gmail.com wrote:

 I think the problem is in the Fsync Absorption queue.  Every truncate
 adds a FORGET_RELATION_FSYNC to the queue, and processing each one of
 those leads to sequential scanning the checkpointer's pending ops hash
 table, which is quite large.  It is almost entirely full of other
 requests which have already been canceled, but it still has to dig
 through them all.   So this is essentially an N^2 operation.

My attached Proof of Concept patch reduces the run time of the
benchmark at the end of this message from 650sec to 84sec,
demonstrating that this is in fact the problem.  Which doesn't mean
that my patch is the right answer to it, of course.

(The delete option is still faster than truncate, coming in at around 55sec)


 I'm not sure why we don't just delete the entry instead of marking it
 as cancelled.  It looks like the only problem is that you can't delete
 an entry other than the one just returned by hash_seq_search.  Which
 would be fine, as that is the entry that we would want to delete;
 except that mdsync might have a different hash_seq_search open, and so
 it wouldn't be safe to delete.

 If the segno was taken out of the hash key and handled some other way,
 then the forgetting could be done with a simple hash look up rather
 than a full scan.

The above two ideas might be the better solution, as they would work
even when fsync=on.  Since BBU are becoming so popular I think the
fsync queue could be a problem even with fsync on if the fsync is fast
enough.  But I don't immediately know how to implement them.

 Maybe we could just turn off the pending ops table altogether when
 fsync=off, but since fsync is PGC_SIGHUP it is not clear how you could
 safely turn it back on.

Now that I think about it, I don't see how turning fsync from off to
on can ever be known to be safe, until a system wide sync has
intervened.  After all a segment that was dirtied and added to the
pending ops table while fsync=off might also be removed from the
pending ops table the microsecond before fsync is turned on, so how is
that different from never adding it in the first place?

The attached Proof Of Concept patch implements this in two ways, one
of which is commented out.  The commented out way omits the overhead
of sending the request to the checkpointer in the first place, but
breaks modularity a bit.

The benchmark used on 9.3devel head is:

fsync=off, all other defaults.

## one time initialization
perl -le 'print create schema foo$_; create table foo$_.foo$_ (k
integer, v integer); $ARGV[0]..$ARGV[0]+$ARGV[1]-1' 0 10 |psql

## actual benchmark.
perl -le 'print set client_min_messages=warning;;
foreach (1..1) {
print BEGIN;\n;
print insert into foo$_.foo$_ select * from
generate_series(1,10);  foreach $ARGV[0]..$ARGV[0]+$ARGV[1]-1;
print COMMIT;\nBEGIN;\n;
print truncate table foo$_.foo$_;  foreach
$ARGV[0]..$ARGV[0]+$ARGV[1]-1;
#print delete from foo$_.foo$_;  foreach
$ARGV[0]..$ARGV[0]+$ARGV[1]-1;
print COMMIT;\n
   }  ' 0 10 | time psql  /dev/null

Cheers,

Jeff


fsync_queue_POC.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers