Re: [HACKERS] Group commit, revised

2012-02-07 Thread Heikki Linnakangas

On 04.02.2012 07:24, Jeff Janes wrote:

Is it safe to assume that, under #ifdef LWLOCK_STATS, a call to
LWLockAcquire will always precede any calls to LWLockWaitUntilFree
when a new process is started, to calloc the stats assays?



I guess it is right now, because the only user is WALWrite, which
would never be acquired before WALInsert is at least once.  But this
doesn't seem very future proof.


Agreed, we can't count on that. There's no clear single point after a 
process startup where the first lwlock is acquired. Out of curiosity, I 
added an elog(LOG, ...) to that initialization code, to log which lwlock 
is acquired first in a process. It depends on the process and 
circumstances - here's the list I got:


BufFreeListLock
ShmemIndexLock
XidGenLock
ProcArrayLock
BgWriterCommLock
AutoVacuumLock

And that's probably not all, I bet you would acquire different locks 
first with recovery, streaming replication etc.. I didn't test those.


Anyway, I added the initialization to LWLockWaitUntilFree now. Thanks!


I guess the same complain could be logged against LWLockConditionalAcquire.


LWLockConditionalAcquire doesn't update the stats, so it's ok.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Group commit, revised

2012-02-03 Thread Jeff Janes
On Mon, Jan 30, 2012 at 6:57 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 On 27.01.2012 15:38, Robert Haas wrote:

 On Fri, Jan 27, 2012 at 8:35 AM, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com  wrote:

 Yeah, we have to be careful with any overhead in there, it can be a hot
 spot. I wouldn't expect any measurable difference from the above, though.
 Could I ask you to rerun the pgbench tests you did recently with this
 patch?
 Or can you think of a better test for this?


 I can't do so immediately, because I'm waiting for Nate Boley to tell
 me I can go ahead and start testing on that machine again.  But I will
 do it once I get the word.


 I committed this. I ran pgbench test on an 8-core box and didn't see any
 slowdown. It would still be good if you get a chance to rerun the bigger
 test, but I feel confident that there's no measurable slowdown.

Is it safe to assume that, under #ifdef LWLOCK_STATS, a call to
LWLockAcquire will always precede any calls to LWLockWaitUntilFree
when a new process is started, to calloc the stats assays?

I guess it is right now, because the only user is WALWrite, which
would never be acquired before WALInsert is at least once.  But this
doesn't seem very future proof.

I guess the same complain could be logged against LWLockConditionalAcquire.

Since people wouldn't be expected to define LWLOCK_STATS on production
builds, perhaps this issue is ignorable.

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Group commit, revised

2012-01-31 Thread Simon Riggs
On Tue, Jan 31, 2012 at 7:43 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:

 That seems like a pretty marginal gain. If you're bound by the speed of
 fsyncs, this will reduce the latency by the time it takes to mark the clog,
 which is tiny in comparison to all the other stuff that needs to happen,
 like, flushing the WAL. And that's ignoring any additional overhead caused
 by the signaling between processes. If you're bound by CPU capacity, this
 doesn't help at all because it just moves the work around.

We're not bound by CPU capacity. Latency is an issue, especially when
contention drives it higher with occasional spikes.

I expect this to have a good measurable impact, as well as a
stabilising effect on the latency.

 Anyway, this is quite different from the original goal and patch for group
 commit, so can we please leave this for 9.3, and move on with the review of
 pending 9.2 patches.

Actually, there is very little change here from the original patch.

But I would note that your own changes were also quite different, and
had no noticeable gain. They were also based on a brand new and
radical set of thoughts.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Group commit, revised

2012-01-31 Thread Robert Haas
On Tue, Jan 31, 2012 at 3:02 AM, Simon Riggs si...@2ndquadrant.com wrote:
 On Tue, Jan 31, 2012 at 7:43 AM, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com wrote:

 That seems like a pretty marginal gain. If you're bound by the speed of
 fsyncs, this will reduce the latency by the time it takes to mark the clog,
 which is tiny in comparison to all the other stuff that needs to happen,
 like, flushing the WAL. And that's ignoring any additional overhead caused
 by the signaling between processes. If you're bound by CPU capacity, this
 doesn't help at all because it just moves the work around.

 We're not bound by CPU capacity. Latency is an issue, especially when
 contention drives it higher with occasional spikes.

 I expect this to have a good measurable impact, as well as a
 stabilising effect on the latency.

 Anyway, this is quite different from the original goal and patch for group
 commit, so can we please leave this for 9.3, and move on with the review of
 pending 9.2 patches.

 Actually, there is very little change here from the original patch.

 But I would note that your own changes were also quite different, and
 had no noticeable gain. They were also based on a brand new and
 radical set of thoughts.

I think you're trying to muddy the waters.  Heikki's implementation
was different than yours, and there are some things about it I'm not
100% thrilled with, but it's fundamentally the same concept.  The new
idea you're describing here is something else entirely.  Instead of
focusing on a technical critique of one implementation vs. another
(out of the three we have to choose from), you're looking at cramming
more optimizations into the implementation you prefer.  I'm pretty
sure that Heikki's implementation could support that optimization,
too, if we actually want to do it that way.  But there might be good
reasons not to do it that way: for example, every transaction commit
will have to bump the CLOG page LSN, which will delay setting hint
bits on other transactions on the page in cases where they can now be
set immediately.  In any event, trying to slip it into the group
commit patch will serve only to prevent it from getting the separate
scrutiny which it doubtless deserves.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Group commit, revised

2012-01-31 Thread Peter Geoghegan
On 31 January 2012 14:24, Robert Haas robertmh...@gmail.com wrote:
 I think you're trying to muddy the waters.  Heikki's implementation
 was different than yours, and there are some things about it I'm not
 100% thrilled with, but it's fundamentally the same concept.  The new
 idea you're describing here is something else entirely.  Instead of
 focusing on a technical critique of one implementation vs. another
 (out of the three we have to choose from), you're looking at cramming
 more optimizations into the implementation you prefer.  I'm pretty
 sure that Heikki's implementation could support that optimization,
 too, if we actually want to do it that way.  But there might be good
 reasons not to do it that way: for example, every transaction commit
 will have to bump the CLOG page LSN, which will delay setting hint
 bits on other transactions on the page in cases where they can now be
 set immediately.  In any event, trying to slip it into the group
 commit patch will serve only to prevent it from getting the separate
 scrutiny which it doubtless deserves.

Well, I also think it deserves separate scrutiny, but it's not as if
it can be reasonably argued that it can be isolated from 1 of those 3
implementations. Our immediate goal is to produce a benchmark of a new
patch, that operates on the same fundamental principle as the original
patch, though with a much reduced code footprint. We then have a
reasonable basis for comparison: The original benchmark (or possibly a
new benchmark on the original patch, which has seemingly identical
performance characteristics to Heikki's anyway), and the new patch.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Group commit, revised

2012-01-30 Thread Heikki Linnakangas

On 27.01.2012 15:38, Robert Haas wrote:

On Fri, Jan 27, 2012 at 8:35 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com  wrote:

Yeah, we have to be careful with any overhead in there, it can be a hot
spot. I wouldn't expect any measurable difference from the above, though.
Could I ask you to rerun the pgbench tests you did recently with this patch?
Or can you think of a better test for this?


I can't do so immediately, because I'm waiting for Nate Boley to tell
me I can go ahead and start testing on that machine again.  But I will
do it once I get the word.


I committed this. I ran pgbench test on an 8-core box and didn't see any 
slowdown. It would still be good if you get a chance to rerun the bigger 
test, but I feel confident that there's no measurable slowdown.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Group commit, revised

2012-01-30 Thread Simon Riggs
On Mon, Jan 30, 2012 at 2:57 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:

 I committed this. I ran pgbench test on an 8-core box and didn't see any
 slowdown. It would still be good if you get a chance to rerun the bigger
 test, but I feel confident that there's no measurable slowdown.

I asked clearly and specifically for you to hold back committing
anything. Not sure why you would ignore that and commit without
actually asking myself or Peter. On a point of principle alone, I
think you should revert. Working together is difficult if
communication channels are openly ignored and disregarded.

Peter and I have been working on a new version that seems likely to
improve performance over your suggestions. We should be showing
something soon.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Group commit, revised

2012-01-30 Thread Heikki Linnakangas

On 30.01.2012 17:18, Simon Riggs wrote:

On Mon, Jan 30, 2012 at 2:57 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com  wrote:


I committed this. I ran pgbench test on an 8-core box and didn't see any
slowdown. It would still be good if you get a chance to rerun the bigger
test, but I feel confident that there's no measurable slowdown.


I asked clearly and specifically for you to hold back committing
anything. Not sure why you would ignore that and commit without
actually asking myself or Peter. On a point of principle alone, I
think you should revert. Working together is difficult if
communication channels are openly ignored and disregarded.


You must be referring to this:

http://archives.postgresql.org/pgsql-hackers/2012-01/msg01406.php

What I committed in the end was quite different from the version that 
was in reply to, too. If you have a specific objection to the patch as 
committed, please let me know.



Peter and I have been working on a new version that seems likely to
improve performance over your suggestions. We should be showing
something soon.


Please post those ideas, and let's discuss them. If it's something 
simple, maybe we can still sneak them into this release. Otherwise, 
let's focus on the existing patches that are pending review or commit.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Group commit, revised

2012-01-30 Thread Jeff Janes
On Sun, Jan 29, 2012 at 1:20 PM, Greg Smith g...@2ndquadrant.com wrote:
 On 01/28/2012 07:48 PM, Jeff Janes wrote:


 I haven't inspected that deep fall off at 30 clients for the patch.
 By way of reference, if I turn off synchronous commit, I get
 tps=1245.8 which is 100% CPU limited.  This sets an theoretical upper
 bound on what could be achieved by the best possible group committing
 method.


 This sort of thing is why I suspect that to completely isolate some results,
 we're going to need a moderately high end server--with lots of
 cores--combined with an intentionally mismatched slow drive.  It's too easy
 to get pgbench and/or PostgreSQL to choke on something other than I/O when
 using smaller core counts.  I don't think I have anything where the floor is
 24 TPS per client though.  Hmmm...I think I can connect an IDE drive to my
 MythTV box and format it with ext4.  Thanks for the test idea.

 One thing you could try on this system is using the -N Do not update
 pgbench_tellers and pgbench_branches.  That eliminates a lot of the
 contention that might be pulling down your higher core count tests, while
 still giving a completely valid test of whether the group commit mechanism
 works.  Not sure whether that will push up the top-end usefully for you,
 worth a try if you have time to test again.

Adding the -N did eliminate the fall-off at 30 clients for
group_commit patch.  But, I still want to explore why the fall off
occurs when I get a chance.  I know why the curve would stop going up
without using -N (with -s of 40 and -c of 30, many connections will be
waiting on row locks for updates to pgbench_branches) but that should
cause a leveling off, not a collapse.

Other than the lack of drop off at 30 clients, -N didn't meaningfully
change anything.  Everyone got slightly faster except at -c1.


 If the group_commit patch goes in, would we then rip out commit_delay
 and commit_siblings?


 The main reason those are still hanging around at all are to allow pushing
 on the latency vs. throughput trade-off on really busy systems.  The use
 case is that you expect, say, 10 clients to constantly be committing at a
 high rate.  So if there's just one committing so far, assume it's the
 leading edge of a wave and pause a bit for the rest to come in.  I don't
 think the cases where this is useful behavior--people both want it and the
 current mechanism provides it--are very common in the real world.

The tests I did are exactly that environment where commit_delay might
be expected to help.  And it did help, but just not all that much. One
of the problems is that while it does wait for those others to come in
and then it does flush them in one fsync; but often the others never
get woken up successfully to realize that they have already been
flushed.  They continue to block.

The group_commit patch, on the other hand, accomplishes exactly what
commit_delay was intended to accomplish but doesn't do a very good job
of.

With the -N option, I also used commit_delay on top of group_commit,
and the difference between the two look like it was within the margin
of error.  So commit_delay did not obviously cause further
improvement.


 It can be
 useful for throughput oriented benchmarks though, which is why I'd say it
 hasn't killed off yet.

 We'll have to see whether the final form this makes sense in will usefully
 replace that sort of thing.  I'd certainly be in favor of nuking
 commit_delay and commit_siblings with a better option; it would be nice if
 we don't eliminate this tuning option in the process though.

But I'm pretty sure that group_commit has stolen that thunder.
Obviously a few benchmarks on one system isn't enough to prove that,
though.

The only use case I see left for commit_delay is where it is set on a
per-connection basis rather than system-wide.  Once you start a fsync,
everyone who missed the bus is locked out until the next one.  So
low-priority connections can set commit_delay so as not to trigger the
bus to leave before the high priority process gets on.  But that seems
like a pretty tenuous use case with better ways to do it.

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Group commit, revised

2012-01-30 Thread Simon Riggs
On Mon, Jan 30, 2012 at 3:32 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 On 30.01.2012 17:18, Simon Riggs wrote:

 I asked clearly and specifically for you to hold back committing
 anything. Not sure why you would ignore that and commit without
 actually asking myself or Peter. On a point of principle alone, I
 think you should revert. Working together is difficult if
 communication channels are openly ignored and disregarded.


 You must be referring to this:

 http://archives.postgresql.org/pgsql-hackers/2012-01/msg01406.php

 What I committed in the end was quite different from the version that was in
 reply to, too. If you have a specific objection to the patch as committed,
 please let me know.

I said There is much yet to discuss so please don't think about committing
anything yet.

There's not really any way you could misinterpret them.


 Peter and I have been working on a new version that seems likely to
 improve performance over your suggestions. We should be showing
 something soon.


 Please post those ideas, and let's discuss them. If it's something simple,
 maybe we can still sneak them into this release. Otherwise, let's focus on
 the existing patches that are pending review or commit.

If you really did want to discuss it, it would have taken you 5
minutes to check whether there was consensus on the patch before
committing it. Your actions betray the opposite of a willingness to
discuss anything.

Yes, I'd like to discuss ideas, not just ram home a half-discussed and
half-finished patch that happens to do things the way you personally
prefer, overriding all inputs.

Especially when you know we're working on another version.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Group commit, revised

2012-01-30 Thread Heikki Linnakangas

On 30.01.2012 21:55, Simon Riggs wrote:

On Mon, Jan 30, 2012 at 3:32 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com  wrote:

On 30.01.2012 17:18, Simon Riggs wrote:



Peter and I have been working on a new version that seems likely to
improve performance over your suggestions. We should be showing
something soon.


Please post those ideas, and let's discuss them. If it's something simple,
maybe we can still sneak them into this release. Otherwise, let's focus on
the existing patches that are pending review or commit.


If you really did want to discuss it, it would have taken you 5
minutes to check whether there was consensus on the patch before
committing it. Your actions betray the opposite of a willingness to
discuss anything.

Yes, I'd like to discuss ideas, not just ram home a half-discussed and
half-finished patch that happens to do things the way you personally
prefer, overriding all inputs.

Especially when you know we're working on another version.


Sorry. So, what's the approach you're working on?

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Group commit, revised

2012-01-30 Thread Simon Riggs
On Mon, Jan 30, 2012 at 8:04 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:

 So, what's the approach you're working on?

I've had a few days leave at end of last week, so no time to fully
discuss the next steps with the patch. That's why you were requested
not to commit anything.

You've suggested there was no reason to have the WALwriter be
involved, which isn't the case and made other comments about latches
that weren't correct also.

The plan here is to allow WAL flush and clog updates to occur
concurrently. Which allows the clog contention and update time to be
completely hidden behind the wait for the WAL flush. That is only
possible if we have the WALwriter involved since we need two processes
to be actively involved.

It's a relatively minor change and uses code that is already committed
and working, not some just invented low level stuff that might not
work right. You might then ask, why the delay? Just simply because my
absence has prevented moving forwards. We'll have a patch tomorrow.

The theory behind this is clear, but needs some explanation.

There are 5 actions that need to occur at commit
1) insert WAL record
2) optionally flush WAL record
3) mark the clog AND set LSN from (1) if we skipped (2)
4) optionally wait for sync rep
5) remove the proc from the procarray

Dependencies between those actions are these
Step (3) must always happen before (5) otherwise we get race
conditions in visibility checking.
Step (4) must always happen before (5) otherwise we also get race
conditions in disaster cases.
Step (1) must always happen before (2) if it happens
Step (1) must always happen before (3) if we skipped (2)

Notice that step (2) and step (3) are actually independent of each other.

So an improved design for commit is to
2) request flush up to LSN, but don't wait
3) mark the clog and set LSN
4) wait for LSN once, either for walwriter or walsender to release us

This is free of race conditions as long as step (3) marks each clog
page with a valid LSN, just as we would do for asynchronous commit.

Marking the clog with an LSN ensures that we issue XLogFlush(LSN) on
the clog page before it is written, so we always get WAL flushed to
the desired LSN before the clog mark appears on disk.

Does this cause any other behaviour? No, because the LSN marked on the
clog is always flushed by the time we hit step (5), so there is no
delay in any hint bit setting, or any other effect.

So step (2) requests the flush, which is then performed by WALwriter.
Backend then performs (3) while the flush takes place and then waits
at step (4) to be woken

We only wait once in step 4, rather than waiting for flush at step (2)
and then waiting again at step (4).

So we use the existing code path for TransactionIdAsyncCommitTree()
yet we wait at step (4) using the SyncRep code.

Step 5 happens last, as always.

There are two benefits to this approach
* The clog update happens for free since it is hidden behind a
longer running task
* The implementation uses already tested and robust code for SyncRep
and AsyncCommit

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Group commit, revised

2012-01-30 Thread Heikki Linnakangas

On 31.01.2012 01:35, Simon Riggs wrote:

The plan here is to allow WAL flush and clog updates to occur
concurrently. Which allows the clog contention and update time to be
completely hidden behind the wait for the WAL flush. That is only
possible if we have the WALwriter involved since we need two processes
to be actively involved.

...

The theory behind this is clear, but needs some explanation.

There are 5 actions that need to occur at commit
1) insert WAL record
2) optionally flush WAL record
3) mark the clog AND set LSN from (1) if we skipped (2)
4) optionally wait for sync rep
5) remove the proc from the procarray


 ...


Notice that step (2) and step (3) are actually independent of each other.

So an improved design for commit is to
2) request flush up to LSN, but don't wait
3) mark the clog and set LSN
4) wait for LSN once, either for walwriter or walsender to release us


That seems like a pretty marginal gain. If you're bound by the speed of 
fsyncs, this will reduce the latency by the time it takes to mark the 
clog, which is tiny in comparison to all the other stuff that needs to 
happen, like, flushing the WAL. And that's ignoring any additional 
overhead caused by the signaling between processes. If you're bound by 
CPU capacity, this doesn't help at all because it just moves the work 
around.


Anyway, this is quite different from the original goal and patch for 
group commit, so can we please leave this for 9.3, and move on with the 
review of pending 9.2 patches.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Group commit, revised

2012-01-29 Thread Jesper Krogh

On 2012-01-29 01:48, Jeff Janes wrote:

I ran three modes, head, head with commit_delay, and the group_commit patch

shared_buffers = 600MB
wal_sync_method=fsync

optionally with:
commit_delay=5
commit_siblings=1

pgbench -i -s40

for clients in 1 5 10 15 20 25 30
pgbench -T 30 -M prepared -c $clients -j $clients

ran 5 times each, taking maximum tps from the repeat runs.

The results are impressive.

clients headhead_commit_delay   group_commit
1   23.923.023.9
5   31.051.359.9
10  35.056.595.7
15  35.664.9136.4
20  34.368.7159.3
25  36.564.1168.8
30  37.283.871.5

I haven't inspected that deep fall off at 30 clients for the patch.

By way of reference, if I turn off synchronous commit, I get
tps=1245.8 which is 100% CPU limited.  This sets an theoretical upper
bound on what could be achieved by the best possible group committing
method.

If the group_commit patch goes in, would we then rip out commit_delay
and commit_siblings?


Adding to the list of tests that isn't excactly a real-world system I 
decided

to repeat Jeff's tests on a Intel(R) Core(TM)2 Duo CPU E7500  @ 2.93GHz
with 4GB of memory and an Intel X25-M 160GB SSD drive underneath.


BaselineCommitdelay Group commit
1   1168.67 1233.33 1212.67
5   2611.33 3022.00 2647.67
10  3044.67 .33 3296.33
15  3153.33 3177.00 3456.67
20  3087.33 3126.33 3618.67
25  2715.00 2359.00 3309.33
30  2736.33 2831.67 2737.67


Numbers are average over 3 runs.

I have set checkpoint_segments to 30 .. otherwise same configuration as 
Jeff.

Attached is a graph.

Nice conclusion is.. group commit outperforms baseline in all runs (on 
this system).


My purpose was actual more to try to quantify the difference between a 
single SSD and

a single rotating disk.

--
Jesper
attachment: pgbench.png
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Group commit, revised

2012-01-29 Thread Greg Smith

On 01/28/2012 07:48 PM, Jeff Janes wrote:

Others are going to test this out on high-end systems. I wanted to
try it out on the other end of the scale.  I've used a Pentium 4,
3.2GHz,
with 2GB of RAM and with a single IDE drive running ext4.  ext4 is
amazingly bad on IDE, giving about 25 fsync's per second (and it lies
about fdatasync, but apparently not about fsync)


Fantastic, I had to stop for a minute to check the date on your message 
for a second there, make sure it hadn't come from some mail server 
that's been backed up on delivery the last five years.  I'm cleaning 
house toward testing this out here, and I was going to test on the same 
system using both fast and horribly slow drives.  Both ends of the scale 
are important, and they benefit in a very different way from these changes.



I haven't inspected that deep fall off at 30 clients for the patch.
By way of reference, if I turn off synchronous commit, I get
tps=1245.8 which is 100% CPU limited.  This sets an theoretical upper
bound on what could be achieved by the best possible group committing
method.


This sort of thing is why I suspect that to completely isolate some 
results, we're going to need a moderately high end server--with lots of 
cores--combined with an intentionally mismatched slow drive.  It's too 
easy to get pgbench and/or PostgreSQL to choke on something other than 
I/O when using smaller core counts.  I don't think I have anything where 
the floor is 24 TPS per client though.  Hmmm...I think I can connect an 
IDE drive to my MythTV box and format it with ext4.  Thanks for the test 
idea.


One thing you could try on this system is using the -N Do not update 
pgbench_tellers and pgbench_branches.  That eliminates a lot of the 
contention that might be pulling down your higher core count tests, 
while still giving a completely valid test of whether the group commit 
mechanism works.  Not sure whether that will push up the top-end 
usefully for you, worth a try if you have time to test again.



If the group_commit patch goes in, would we then rip out commit_delay
and commit_siblings?


The main reason those are still hanging around at all are to allow 
pushing on the latency vs. throughput trade-off on really busy systems.  
The use case is that you expect, say, 10 clients to constantly be 
committing at a high rate.  So if there's just one committing so far, 
assume it's the leading edge of a wave and pause a bit for the rest to 
come in.  I don't think the cases where this is useful behavior--people 
both want it and the current mechanism provides it--are very common in 
the real world.  It can be useful for throughput oriented benchmarks 
though, which is why I'd say it hasn't killed off yet.


We'll have to see whether the final form this makes sense in will 
usefully replace that sort of thing.  I'd certainly be in favor of 
nuking commit_delay and commit_siblings with a better option; it would 
be nice if we don't eliminate this tuning option in the process though.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Group commit, revised

2012-01-28 Thread Jeff Janes
On Fri, Jan 27, 2012 at 5:35 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 On 26.01.2012 04:10, Robert Haas wrote:


 I think you should break this off into a new function,
 LWLockWaitUntilFree(), rather than treating it as a new LWLockMode.
 Also, instead of adding lwWaitOnly, I would suggest that we generalize
 lwWaiting and lwExclusive into a field lwWaitRequest, which can be set
 to 1 for exclusive, 2 for shared, 3 for wait-for-free, etc.  That way
 we don't have to add another boolean every time someone invents a new
 type of wait - not that that should hopefully happen very often.  A
 side benefit of this is that you can simplify the test in
 LWLockRelease(): keep releasing waiters until you come to an exclusive
 waiter, then stop.  This possibly saves one shared memory fetch and
 subsequent test instruction, which might not be trivial - all of this
 code is extremely hot.


 Makes sense. Attached is a new version, doing exactly that.

Others are going to test this out on high-end systems.  I wanted to
try it out on the other end of the scale.  I've used a Pentium 4,
3.2GHz,
with 2GB of RAM and with a single IDE drive running ext4.  ext4 is
amazingly bad on IDE, giving about 25 fsync's per second (and it lies
about fdatasync, but apparently not about fsync)

I ran three modes, head, head with commit_delay, and the group_commit patch

shared_buffers = 600MB
wal_sync_method=fsync

optionally with:
commit_delay=5
commit_siblings=1

pgbench -i -s40

for clients in 1 5 10 15 20 25 30
pgbench -T 30 -M prepared -c $clients -j $clients

ran 5 times each, taking maximum tps from the repeat runs.

The results are impressive.

clients headhead_commit_delay   group_commit
1   23.923.023.9
5   31.051.359.9
10  35.056.595.7
15  35.664.9136.4
20  34.368.7159.3
25  36.564.1168.8
30  37.283.871.5

I haven't inspected that deep fall off at 30 clients for the patch.

By way of reference, if I turn off synchronous commit, I get
tps=1245.8 which is 100% CPU limited.  This sets an theoretical upper
bound on what could be achieved by the best possible group committing
method.

If the group_commit patch goes in, would we then rip out commit_delay
and commit_siblings?



Cheers,

Jeff
attachment: graph.png
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Group commit, revised

2012-01-27 Thread Heikki Linnakangas

On 26.01.2012 04:10, Robert Haas wrote:

On Wed, Jan 25, 2012 at 3:11 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com  wrote:

Attached is a patch to do that. It adds a new mode to
LWLockConditionalAcquire(), LW_EXCLUSIVE_BUT_WAIT. If the lock is free, it
is acquired and the function returns immediately. However, unlike normal
LWLockConditionalAcquire(), if the lock is not free it goes to sleep until
it is released. But unlike normal LWLockAcquire(), when the lock is
released, the function returns *without* acquiring the lock.
 ...


I think you should break this off into a new function,
LWLockWaitUntilFree(), rather than treating it as a new LWLockMode.
Also, instead of adding lwWaitOnly, I would suggest that we generalize
lwWaiting and lwExclusive into a field lwWaitRequest, which can be set
to 1 for exclusive, 2 for shared, 3 for wait-for-free, etc.  That way
we don't have to add another boolean every time someone invents a new
type of wait - not that that should hopefully happen very often.  A
side benefit of this is that you can simplify the test in
LWLockRelease(): keep releasing waiters until you come to an exclusive
waiter, then stop.  This possibly saves one shared memory fetch and
subsequent test instruction, which might not be trivial - all of this
code is extremely hot.


Makes sense. Attached is a new version, doing exactly that.


 We probably want to benchmark this carefully
to make sure that it doesn't cause a performance regression even just
from this:

+   if (!proc-lwWaitOnly)
+   lock-releaseOK = false;

I know it sounds crazy, but I'm not 100% sure that that additional
test there is cheap enough not to matter.  Assuming it is, though, I
kind of like the concept: I think there must be other places where
these semantics are useful.


Yeah, we have to be careful with any overhead in there, it can be a hot 
spot. I wouldn't expect any measurable difference from the above, 
though. Could I ask you to rerun the pgbench tests you did recently with 
this patch? Or can you think of a better test for this?


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
*** a/src/backend/access/transam/twophase.c
--- b/src/backend/access/transam/twophase.c
***
*** 327,333  MarkAsPreparing(TransactionId xid, const char *gid,
  	proc-databaseId = databaseid;
  	proc-roleId = owner;
  	proc-lwWaiting = false;
! 	proc-lwExclusive = false;
  	proc-lwWaitLink = NULL;
  	proc-waitLock = NULL;
  	proc-waitProcLock = NULL;
--- 327,333 
  	proc-databaseId = databaseid;
  	proc-roleId = owner;
  	proc-lwWaiting = false;
! 	proc-lwWaitMode = 0;
  	proc-lwWaitLink = NULL;
  	proc-waitLock = NULL;
  	proc-waitProcLock = NULL;
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***
*** 2118,2140  XLogFlush(XLogRecPtr record)
  	/* initialize to given target; may increase below */
  	WriteRqstPtr = record;
  
! 	/* read LogwrtResult and update local state */
  	{
  		/* use volatile pointer to prevent code rearrangement */
  		volatile XLogCtlData *xlogctl = XLogCtl;
  
  		SpinLockAcquire(xlogctl-info_lck);
  		if (XLByteLT(WriteRqstPtr, xlogctl-LogwrtRqst.Write))
  			WriteRqstPtr = xlogctl-LogwrtRqst.Write;
  		LogwrtResult = xlogctl-LogwrtResult;
  		SpinLockRelease(xlogctl-info_lck);
- 	}
  
! 	/* done already? */
! 	if (!XLByteLE(record, LogwrtResult.Flush))
! 	{
! 		/* now wait for the write lock */
! 		LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
  		LogwrtResult = XLogCtl-Write.LogwrtResult;
  		if (!XLByteLE(record, LogwrtResult.Flush))
  		{
--- 2118,2160 
  	/* initialize to given target; may increase below */
  	WriteRqstPtr = record;
  
! 	/*
! 	 * Now wait until we get the write lock, or someone else does the
! 	 * flush for us.
! 	 */
! 	for (;;)
  	{
  		/* use volatile pointer to prevent code rearrangement */
  		volatile XLogCtlData *xlogctl = XLogCtl;
  
+ 		/* read LogwrtResult and update local state */
  		SpinLockAcquire(xlogctl-info_lck);
  		if (XLByteLT(WriteRqstPtr, xlogctl-LogwrtRqst.Write))
  			WriteRqstPtr = xlogctl-LogwrtRqst.Write;
  		LogwrtResult = xlogctl-LogwrtResult;
  		SpinLockRelease(xlogctl-info_lck);
  
! 		/* done already? */
! 		if (XLByteLE(record, LogwrtResult.Flush))
! 			break;
! 
! 		/*
! 		 * Try to get the write lock. If we can't get it immediately, wait
! 		 * until it's released, and recheck if we still need to do the flush
! 		 * or if the backend that held the lock did it for us already. This
! 		 * helps to maintain a good rate of group committing when the system
! 		 * is bottlenecked by the speed of fsyncing.
! 		 */
! 		if (!LWLockWaitUntilFree(WALWriteLock, LW_EXCLUSIVE))
! 		{
! 			/*
! 			 * The lock is now free, but we didn't acquire it yet. Before we
! 			 * do, loop back to check if someone else flushed the record for
! 			 * us already.
! 			 */
! 			continue;
! 		}
! 		/* Got the 

Re: [HACKERS] Group commit, revised

2012-01-27 Thread Robert Haas
On Fri, Jan 27, 2012 at 8:35 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 Yeah, we have to be careful with any overhead in there, it can be a hot
 spot. I wouldn't expect any measurable difference from the above, though.
 Could I ask you to rerun the pgbench tests you did recently with this patch?
 Or can you think of a better test for this?

I can't do so immediately, because I'm waiting for Nate Boley to tell
me I can go ahead and start testing on that machine again.  But I will
do it once I get the word.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Group commit, revised

2012-01-25 Thread Heikki Linnakangas
I've been thinking, what exactly is the important part of this group 
commit patch that gives the benefit? Keeping the queue sorted isn't all 
that important - XLogFlush() requests for commits will come in almost 
the correct order anyway.


I also don't much like the division of labour between groupcommit.c and 
xlog.c. XLogFlush() calls GroupCommitWaitForLSN(), which calls back 
DoXLogFlush(), which is a bit hard to follow. (that's in my latest 
version; the original patch had similar division but it also cut across 
processes, with the wal writer actually doing the flush)


It occurs to me that the WALWriteLock already provides much of the same 
machinery we're trying to build here. It provides FIFO-style queuing, 
with the capability to wake up the next process or processes in the 
queue. Perhaps all we need is some new primitive to LWLock, to make it 
do exactly what we need.


Attached is a patch to do that. It adds a new mode to 
LWLockConditionalAcquire(), LW_EXCLUSIVE_BUT_WAIT. If the lock is free, 
it is acquired and the function returns immediately. However, unlike 
normal LWLockConditionalAcquire(), if the lock is not free it goes to 
sleep until it is released. But unlike normal LWLockAcquire(), when the 
lock is released, the function returns *without* acquiring the lock.


I modified XLogFlush() to use that new function for WALWriteLock. It 
tries to get WALWriteLock, but if it's not immediately free, it waits 
until it is released. Then, before trying to acquire the lock again, it 
rechecks LogwrtResult to see if the record was already flushed by the 
process that released the lock.


This is a much smaller patch than the group commit patch, and in my 
pgbench-tools tests on my laptop, it has the same effect on performance. 
The downside is that it touches lwlock.c, which is a change at a lower 
level. Robert's flexlocks patch probably would've been useful here.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ce659ec..d726a98 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2066,23 +2066,42 @@ XLogFlush(XLogRecPtr record)
 	/* initialize to given target; may increase below */
 	WriteRqstPtr = record;
 
-	/* read LogwrtResult and update local state */
+	/*
+	 * Now wait until we get the write lock, or someone else does the
+	 * flush for us.
+	 */
+	for (;;)
 	{
 		/* use volatile pointer to prevent code rearrangement */
 		volatile XLogCtlData *xlogctl = XLogCtl;
 
+		/* read LogwrtResult and update local state */
 		SpinLockAcquire(xlogctl-info_lck);
 		if (XLByteLT(WriteRqstPtr, xlogctl-LogwrtRqst.Write))
 			WriteRqstPtr = xlogctl-LogwrtRqst.Write;
 		LogwrtResult = xlogctl-LogwrtResult;
 		SpinLockRelease(xlogctl-info_lck);
-	}
 
-	/* done already? */
-	if (!XLByteLE(record, LogwrtResult.Flush))
-	{
-		/* now wait for the write lock */
-		LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
+		/* done already? */
+		if (XLByteLE(record, LogwrtResult.Flush))
+			break;
+
+		/*
+		 * Try to get the write lock. If we can't get it immediately, wait
+		 * until it's released, but before actually acquiring it, loop back
+		 * to check if the backend holding the lock did the flush for us
+		 * already. This helps to maintain a good rate of group committing
+		 * when the system is bottlenecked by the speed of fsyncing.
+		 */
+		if (!LWLockConditionalAcquire(WALWriteLock, LW_EXCLUSIVE_BUT_WAIT))
+		{
+			/*
+			 * Didn't get the lock straight away, but we might be done
+			 * already
+			 */
+			continue;
+		}
+		/* Got the lock */
 		LogwrtResult = XLogCtl-Write.LogwrtResult;
 		if (!XLByteLE(record, LogwrtResult.Flush))
 		{
@@ -2111,6 +2130,8 @@ XLogFlush(XLogRecPtr record)
 			XLogWrite(WriteRqst, false, false);
 		}
 		LWLockRelease(WALWriteLock);
+		/* done */
+		break;
 	}
 
 	END_CRIT_SECTION();
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index cc41568..488f573 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -430,6 +430,7 @@ LWLockAcquire(LWLockId lockid, LWLockMode mode)
 			elog(PANIC, cannot wait without a PGPROC structure);
 
 		proc-lwWaiting = true;
+		proc-lwWaitOnly = false;
 		proc-lwExclusive = (mode == LW_EXCLUSIVE);
 		proc-lwWaitLink = NULL;
 		if (lock-head == NULL)
@@ -496,7 +497,9 @@ LWLockAcquire(LWLockId lockid, LWLockMode mode)
 /*
  * LWLockConditionalAcquire - acquire a lightweight lock in the specified mode
  *
- * If the lock is not available, return FALSE with no side-effects.
+ * If the lock is not available, return FALSE with no side-effects. In
+ * LW_EXCLUSIVE_BUT_WAIT mode, if the lock is not available, waits until it
+ * is available, but then returns false without acquiring it.
  *
  * If successful, cancel/die interrupts are held off until lock release.
  */
@@ -504,7 +507,9 @@ bool
 LWLockConditionalAcquire(LWLockId lockid, 

Re: [HACKERS] Group commit, revised

2012-01-25 Thread Robert Haas
On Wed, Jan 25, 2012 at 3:11 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 I've been thinking, what exactly is the important part of this group commit
 patch that gives the benefit? Keeping the queue sorted isn't all that
 important - XLogFlush() requests for commits will come in almost the correct
 order anyway.

 I also don't much like the division of labour between groupcommit.c and
 xlog.c. XLogFlush() calls GroupCommitWaitForLSN(), which calls back
 DoXLogFlush(), which is a bit hard to follow. (that's in my latest version;
 the original patch had similar division but it also cut across processes,
 with the wal writer actually doing the flush)

 It occurs to me that the WALWriteLock already provides much of the same
 machinery we're trying to build here. It provides FIFO-style queuing, with
 the capability to wake up the next process or processes in the queue.
 Perhaps all we need is some new primitive to LWLock, to make it do exactly
 what we need.

 Attached is a patch to do that. It adds a new mode to
 LWLockConditionalAcquire(), LW_EXCLUSIVE_BUT_WAIT. If the lock is free, it
 is acquired and the function returns immediately. However, unlike normal
 LWLockConditionalAcquire(), if the lock is not free it goes to sleep until
 it is released. But unlike normal LWLockAcquire(), when the lock is
 released, the function returns *without* acquiring the lock.

 I modified XLogFlush() to use that new function for WALWriteLock. It tries
 to get WALWriteLock, but if it's not immediately free, it waits until it is
 released. Then, before trying to acquire the lock again, it rechecks
 LogwrtResult to see if the record was already flushed by the process that
 released the lock.

 This is a much smaller patch than the group commit patch, and in my
 pgbench-tools tests on my laptop, it has the same effect on performance. The
 downside is that it touches lwlock.c, which is a change at a lower level.
 Robert's flexlocks patch probably would've been useful here.

I think you should break this off into a new function,
LWLockWaitUntilFree(), rather than treating it as a new LWLockMode.
Also, instead of adding lwWaitOnly, I would suggest that we generalize
lwWaiting and lwExclusive into a field lwWaitRequest, which can be set
to 1 for exclusive, 2 for shared, 3 for wait-for-free, etc.  That way
we don't have to add another boolean every time someone invents a new
type of wait - not that that should hopefully happen very often.  A
side benefit of this is that you can simplify the test in
LWLockRelease(): keep releasing waiters until you come to an exclusive
waiter, then stop.  This possibly saves one shared memory fetch and
subsequent test instruction, which might not be trivial - all of this
code is extremely hot.  We probably want to benchmark this carefully
to make sure that it doesn't cause a performance regression even just
from this:

+   if (!proc-lwWaitOnly)
+   lock-releaseOK = false;

I know it sounds crazy, but I'm not 100% sure that that additional
test there is cheap enough not to matter.  Assuming it is, though, I
kind of like the concept: I think there must be other places where
these semantics are useful.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Group commit, revised

2012-01-24 Thread Simon Riggs
On Fri, Jan 20, 2012 at 10:30 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:

 I spent some time cleaning this up. Details below, but here are the
 highlights:

 * Reverted the removal of wal_writer_delay
 * Removed heuristic on big flushes

No contested viewpoints on anything there.

 * Doesn't rely on WAL writer. Any process can act as the leader now.
 * Uses PGSemaphoreLock/Unlock instead of latches

Thanks for producing an alternate version, it will allow us to comment
on various approaches.

There is much yet to discuss so please don't think about committing
anything yet.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Group commit, revised

2012-01-20 Thread Robert Haas
On Thu, Jan 19, 2012 at 10:46 PM, Peter Geoghegan pe...@2ndquadrant.com wrote:
 On 19 January 2012 17:40, Robert Haas robertmh...@gmail.com wrote:
 I don't know what you mean by this.  I think removing wal_writer_delay
 is premature, because I think it still may have some utility, and the
 patch removes it.  That's a separate change that should be factored
 out of this patch and discussed separately.

 FWIW, I don't really care too much if we keep wal_writer_delay,
 provided it is only used in place of the patch's
 WALWRITER_NORMAL_TIMEOUT constant. I will note in passing that I doubt
 that the effect with asynchronous commits and hint bits is pronounced
 enough to have ever transferred through to someone making a practical
 recommendation to reduce wal_writer_delay to ameliorate clog
 contention.

It was very visible in some benchmarking Heikki did, and I was able to
reproduce it.

 Yeah, I guess the shutdown sequence could get a bit complex if we try
 to make everyone go through the WAL writer all the time.  But I wonder
 if we could rejiggering things somehow so that everything goes through
 WAL writer if its dead.

 I'm not sure what you mean by this last bit. It sounds a bit hazardous.

That last if was supposed to say unless, which may contribute to
the confusion.

 My problem with nominating a backend to the status of an auxiliary is
 that no matter what way you cut it, it increases the failure surface
 area, so to speak.

I think that's the wrong way of thinking about it.  Imagine this: we
maintain a queue of people who are waiting on the current WAL flush,
the current-flush-to LSN, and a queue of people who are waiting on the
next WAL flush, and a leader.  All this data is protected by a
spinlock.  When you want to flush WAL, you grab the spinlock.  If the
current-flush-to LSN is greater than the LSN you need, you add
yourself to the waiting-on-current-flush queue, release the spinlock,
and go to sleep.  Otherwise, if there's no leader, you become the
leader, enter your flush LSN as the current-flush-to-LSN, and release
the spinlock.  If there is a leader, you add yourself to the
waiting-on-next-flush queue, release the spinlock, and sleep.

If you become the leader, you perform the flush.  Then you retake the
spinlock, dequeue anyone waiting on the current flush, move all of the
next flush waiters (or those within a certain LSN distance) to the
current flush list, remember who is at the head of that new queue, and
release the spinlock.  You then set a flag in the PGPROC of the
backend now at the head of the next-flush queue and wake him up; he
checks that flag on waking to know whether he is the leader or whether
he's just been woken because his flush is done.  After waking him so
the next flush can get started, you wake all the people who were
waiting on the flush you already did.

This may or may not be a good design, but I don't think it has any
more failure surface area than what you have here.  In particular,
whether or not the WAL writer is running doesn't matter; the system
can run just fine without it, and can even still do group commit.  To
look at it another way, it's not a question of whether we're treating
a regular backend as an auxiliary process; there's no rule anywhere
that backends can't be dependent on the proper operation of other
backends for proper functioning - there are MANY places that have that
property, including LWLockAcquire() and LWLockRelease().  Nor is there
any rule that background processes are more reliable than foreground
processes, nor do I believe they are.  Much of the existing patch's
failure surface seems to me to come from the coordination it requires
between ordinary backends and the background writer, and possible race
conditions appertaining thereto: WAL writer dies, backend dies,
postmaster dies, postmaster and WAL writer die together, etc.

 +       if (delta  XLOG_SEG_SIZE * CheckPointSegments ||
 +                       !ProcGlobal-groupCommitAvailable)

 That seems like a gigantic value.  I would think that we'd want to
 forget about group commit any time we're behind by more than one
 segment (maybe less).

 I'm sure that you're right - I myself described it as horrible in my
 original mail. I think that whatever value we set this to ought to be
 well reasoned. Right now, your magic number doesn't seem much better
 than my bogus heuristic (which only ever served as a placeholder
 implementation that hinted at a well-principled formula). What's your
 basis for suggesting that that limit would always be close to optimal?

It's probably not - I suspect even a single WAL segment still too
large.  I'm pretty sure that it would be safe to always flush up to
the next block boundary, because we always write the whole block
anyway even if it's only partially filled.  So there's no possible
regression there.  Anything larger than the end of the current 8kB
block is going to be a trade-off between latency and throughput, and
it seems to me that there's 

Re: [HACKERS] Group commit, revised

2012-01-20 Thread Heikki Linnakangas
I spent some time cleaning this up. Details below, but here are the 
highlights:


* Reverted the removal of wal_writer_delay
* Doesn't rely on WAL writer. Any process can act as the leader now.
* Removed heuristic on big flushes
* Uses PGSemaphoreLock/Unlock instead of latches

On 20.01.2012 17:30, Robert Haas wrote:

On Thu, Jan 19, 2012 at 10:46 PM, Peter Geogheganpe...@2ndquadrant.com  wrote:

+   if (delta  XLOG_SEG_SIZE * CheckPointSegments ||
+   !ProcGlobal-groupCommitAvailable)

That seems like a gigantic value.  I would think that we'd want to
forget about group commit any time we're behind by more than one
segment (maybe less).


I'm sure that you're right - I myself described it as horrible in my
original mail. I think that whatever value we set this to ought to be
well reasoned. Right now, your magic number doesn't seem much better
than my bogus heuristic (which only ever served as a placeholder
implementation that hinted at a well-principled formula). What's your
basis for suggesting that that limit would always be close to optimal?


It's probably not - I suspect even a single WAL segment still too
large.  I'm pretty sure that it would be safe to always flush up to
the next block boundary, because we always write the whole block
anyway even if it's only partially filled.  So there's no possible
regression there.  Anything larger than the end of the current 8kB
block is going to be a trade-off between latency and throughput, and
it seems to me that there's probably only one way to figure out what's
reasonable: test a bunch of different values and see which ones
perform well in practice.  Maybe we could answer part of the question
by writing a test program which does repeated sequential writes of
increasingly-large chunks of data.  We might find out for example that
64kB is basically the same as 8kB because most of the overhead is in
the system call anyway, but beyond that the curve kinks.  Or it may be
that 16kB is already twice as slow as 8kB, or that you can go up to
1MB without a problem.  I don't see a way to know that without testing
it on a couple different pieces of hardware and seeing what happens.


I ripped away that heuristic for a flush that's too large. After 
pondering it for a while, I came to the conclusion that as implemented 
in the patch, it was pointless. The thing is, if the big flush doesn't 
go through the group commit machinery, it's going to grab the 
WALWriteLock straight away. Before any smaller commits can proceed, they 
will need to grab that lock anyway, so the effect is the same as if the 
big flush had just joined the queue.


Maybe we should have a heuristic to split a large flush into smaller 
chunks. The WAL segment boundary would be a quite natural split point, 
for example, because when crossing the file boundary you have to issue 
separate fsync()s for the files anyway. But I'm happy with leaving that 
out for now, it's not any worse than it used to be without group commit 
anyway.



Ugh.  Our current system doesn't even require this code to be
interruptible, I think: you can't receive cancel or die interrupts
either while waiting for an LWLock or while holding it.


Right. Or within HOLD/RESUME_INTERRUPTS blocks.

The patch added some CHECK_FOR_INTERRUPTS() calls into various places in 
the commit/abort codepaths, but that's all dead code; they're all within 
a HOLD/RESUME_INTERRUPTS blocks.


I replaced the usage of latches with the more traditional 
PGSemaphoreLock/Unlock. It semaphore model works just fine in this case, 
where we have a lwlock to guard the wait queue, and when a process is 
waiting we know it will be woken up or something messed up at a pretty 
low level. We don't need a timeout or to wake up at other signals while 
waiting. Furthermore, the WAL writer didn't have a latch before this 
patch; it's not many lines of code to initialize the latch and set up 
the signal handler for it, but it already has a semaphore that's ready 
to use.


I wonder if we should rename the file into xlogflush.c or something 
like that, to make it clear that this works with any XLOG flushes, not 
just commits? Group commit is the usual term for this feature, so we 
should definitely mention that in the comments, but it might be better 
to rename the files/functions to emphasize that this is about WAL 
flushing in general.


This probably needs some further cleanup to fix things I've broken, and 
I haven't done any performance testing, but it's progress. Do you have a 
shell script or something that you used for the performance tests that I 
could run? Or would you like to re-run the tests you did earlier with 
this patch?


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
***
*** 1760,1809  SET ENABLE_SEQSCAN TO OFF;
/listitem
   /varlistentry
  
-  varlistentry id=guc-commit-delay xreflabel=commit_delay
-   

Re: [HACKERS] Group commit, revised

2012-01-20 Thread Peter Geoghegan
On 20 January 2012 22:30, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 Maybe we should have a heuristic to split a large flush into smaller chunks.
 The WAL segment boundary would be a quite natural split point, for example,
 because when crossing the file boundary you have to issue separate fsync()s
 for the files anyway. But I'm happy with leaving that out for now, it's not
 any worse than it used to be without group commit anyway.

Let's quantify how much of a problem that is first.

 The patch added some CHECK_FOR_INTERRUPTS() calls into various places in the
 commit/abort codepaths, but that's all dead code; they're all within a
 HOLD/RESUME_INTERRUPTS blocks.

Fair enough, but do you think it's acceptable to say well, we can't
have errors within critical blocks anyway, and nor can the driver, so
just assume that the driver will successfully service the request?

 Furthermore, the WAL writer didn't have a latch before this patch; it's not
 many lines of code to initialize the latch and set up the signal handler for
 it, but it already has a semaphore that's ready to use.

Uh, yes it did - WalWriterDelay is passed to WaitLatch(). It didn't
use the process latch as I did (which is initialised anyway), though I
believe it should have (on general principal, to avoid invalidation
issues when generic handlers are registered, plus because the process
latch is already initialised and available), which is why I changed
it. Whatever you do with group commit, you're going to want to look at
the changes made to the WAL Writer in my original patch outside of the
main loop, because there are one or two fixes for it included
(registering a usr1 signal handler and saving errno in existing
handlers), and because we need an alternative way of power saving if
you're not going to include the mechanism originally proposed - maybe
something similar to what has been done for the BGWriter in my patch
for that. At 5 wake-ups per second by default, the process is by a
wide margin the biggest culprit (except BGWriter, which is also 5 by
default, but that is addressed by my other patch that you're
reviewing). I want to fix that problem too, and possibly investigate
if there's something to be done about the checkpointer (though that
only has a 5 second timeout, so it's not a major concern). In any
case, we should encourage the idea that auxiliary processes will use
the proc latch, unless perhaps they only use a local latch like the
avlauncher does, imho.

Why did you remove the new assertions in unix_latch.c/win32_latch.c? I
think you should keep them, as well as my additional comments on latch
timeout invalidation issues in latch.h which are untouched in your
revision (though this looks to be a rough revision, so I shouldn't
read anything into that either way I suppose). In general, we should
try and use the process latch whenever we can.

 I wonder if we should rename the file into xlogflush.c or something like
 that, to make it clear that this works with any XLOG flushes, not just
 commits? Group commit is the usual term for this feature, so we should
 definitely mention that in the comments, but it might be better to rename
 the files/functions to emphasize that this is about WAL flushing in general.

Okay.

 This probably needs some further cleanup to fix things I've broken, and I
 haven't done any performance testing, but it's progress. Do you have a shell
 script or something that you used for the performance tests that I could
 run? Or would you like to re-run the tests you did earlier with this patch?

No, I'm using pgbench-tools, and there's no reason to think that you
couldn't get similar results on ordinary hardware, which is all I used
- obviously you'll want to make sure that you're using a file system
that supports granular fsyncs, like ext4. All of the details,
including the config for pgbench-tools, are in my original e-mail. I
have taken the time to re-run the benchmark and update the wiki with
that new information - I'd call it a draw.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Group commit, revised

2012-01-20 Thread Peter Geoghegan
On 21 January 2012 03:13, Peter Geoghegan pe...@2ndquadrant.com wrote:
 I have taken the time to re-run the benchmark and update the wiki with
 that new information - I'd call it a draw.

On second though, the occasional latency spikes that we see with my
patch (which uses the poll() based latch in the run that is
benchmarked) might be significant - difficult to say. The averages are
about the same though.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Group commit, revised

2012-01-19 Thread Robert Haas
On Wed, Jan 18, 2012 at 5:38 PM, Simon Riggs si...@2ndquadrant.com wrote:
 On Wed, Jan 18, 2012 at 1:23 AM, Robert Haas robertmh...@gmail.com wrote:
 On Tue, Jan 17, 2012 at 12:37 PM, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com wrote:
 I found it very helpful to reduce wal_writer_delay in pgbench tests, when
 running with synchronous_commit=off. The reason is that hint bits don't get
 set until the commit record is flushed to disk, so making the flushes more
 frequent reduces the contention on the clog. However, Simon made async
 commits nudge WAL writer if the WAL page fills up, so I'm not sure how
 relevant that experience is anymore.

 Still completely relevant and orthogonal to this discussion. The patch
 retains multi-modal behaviour.

I don't know what you mean by this.  I think removing wal_writer_delay
is premature, because I think it still may have some utility, and the
patch removes it.  That's a separate change that should be factored
out of this patch and discussed separately.

 There's still a small but measurable effect there in master.  I think
 we might be able to make it fully auto-tuning, but I don't think we're
 fully there yet (not sure how much this patch changes that equation).

 I suggested a design similar to the one you just proposed to Simon
 when he originally suggested this feature.  It seems that if the WAL
 writer is the only one doing WAL flushes, then there must be some IPC
 overhead - and context switching - involved whenever WAL is flushed.
 But clearly we're saving something somewhere else, on the basis of
 Peter's results, so maybe it's not worth worrying about.  It does seem
 pretty odd to have all the regular backends going through the WAL
 writer and the auxiliary processes using a different mechanism,
 though.  If we got rid of that, maybe WAL writer wouldn't even require
 a lock, if there's only one process that can be doing it at a time.

 When we did sync rep it made sense to have the WALSender do the work
 and for others to just wait. It would be quite strange to require a
 different design for essentially the same thing for normal commits and
 WAL flushes to local disk. I should mention the original proposal for
 streaming replication had each backend send data to standby
 independently and that was recognised as a bad idea after some time.
 Same for sync rep also.

I don't think those cases are directly comparable.  SR is talking to
another machine, and I can't imagine that there is a terribly
convenient or portable way for every backend that needs one to get a
hold of the file descriptor for the socket.  Even if it could, the
data is sent as a stream, so if multiple backends sent to the same
file descriptor you'd have to have some locking to prevent messages
from getting interleaved.  Or else you could have multiple
connections, or use UDP, but that gets rather complex.  Anyway, none
of this is an issue for file I/O: anybody can open the file, and if
two backends write data at different offsets at the same time - or the
same data at the same offset at the same time - there's no problem.
So the fact that it wasn't a good idea for SR doesn't convince me that
it's also a bad idea here.

On the other hand, I'm not saying we *should* do it that way, either -
i.e. I am not trying to require a different design just because it's
fun to make people change things.  Rather, I am trying to figure out
whether the design you've chosen is in fact the best one, and part of
that involves reasoning about why it might or might not be.  There are
obvious reasons to think that having process A kick process B and go
to sleep, then have process B do some work and wake up process A might
be less efficient than having process A just do the work itself, in
the uncontended case.  The fact that that isn't happening seems
awfully strange to me; it's hard to understand why 3 system calls are
faster than one.  That suggests that either the current system is
badly broken in some way that this patch fixes (in which case, it
would be nice to know what the broken thing is) or that there's an
opportunity for further optimization of the new patch (either now or
later, depending on how much work we're talking about).  Just to be
clear, it makes perfect sense that the new system is faster in the
contended case, and the benchmark numbers are very impressive.  It's a
lot harder to understand why it's not slower in the uncontended case.

 Not sure why its odd to have backends do one thing and auxiliaries do
 another. The whole point of auxiliary processes is that they do a
 specific thing different to normal backends. Anyway, the important
 thing is to have auxiliary processes be independent of each other as
 much as possible, which simplifies error handling and state logic in
 the postmaster.

Yeah, I guess the shutdown sequence could get a bit complex if we try
to make everyone go through the WAL writer all the time.  But I wonder
if we could rejiggering things somehow so that 

Re: [HACKERS] Group commit, revised

2012-01-19 Thread Peter Geoghegan
On 19 January 2012 17:40, Robert Haas robertmh...@gmail.com wrote:
 I don't know what you mean by this.  I think removing wal_writer_delay
 is premature, because I think it still may have some utility, and the
 patch removes it.  That's a separate change that should be factored
 out of this patch and discussed separately.

FWIW, I don't really care too much if we keep wal_writer_delay,
provided it is only used in place of the patch's
WALWRITER_NORMAL_TIMEOUT constant. I will note in passing that I doubt
that the effect with asynchronous commits and hint bits is pronounced
enough to have ever transferred through to someone making a practical
recommendation to reduce wal_writer_delay to ameliorate clog
contention.

 When we did sync rep it made sense to have the WALSender do the work
 and for others to just wait. It would be quite strange to require a
 different design for essentially the same thing for normal commits and
 WAL flushes to local disk. I should mention the original proposal for
 streaming replication had each backend send data to standby
 independently and that was recognised as a bad idea after some time.
 Same for sync rep also.

 I don't think those cases are directly comparable.  SR is talking to
 another machine, and I can't imagine that there is a terribly
 convenient or portable way for every backend that needs one to get a
 hold of the file descriptor for the socket.  Even if it could, the
 data is sent as a stream, so if multiple backends sent to the same
 file descriptor you'd have to have some locking to prevent messages
 from getting interleaved.  Or else you could have multiple
 connections, or use UDP, but that gets rather complex.  Anyway, none
 of this is an issue for file I/O: anybody can open the file, and if
 two backends write data at different offsets at the same time - or the
 same data at the same offset at the same time - there's no problem.
 So the fact that it wasn't a good idea for SR doesn't convince me that
 it's also a bad idea here.

 On the other hand, I'm not saying we *should* do it that way, either -
 i.e. I am not trying to require a different design just because it's
 fun to make people change things.  Rather, I am trying to figure out
 whether the design you've chosen is in fact the best one, and part of
 that involves reasoning about why it might or might not be.  There are
 obvious reasons to think that having process A kick process B and go
 to sleep, then have process B do some work and wake up process A might
 be less efficient than having process A just do the work itself, in
 the uncontended case.  The fact that that isn't happening seems
 awfully strange to me; it's hard to understand why 3 system calls are
 faster than one.  That suggests that either the current system is
 badly broken in some way that this patch fixes (in which case, it
 would be nice to know what the broken thing is) or that there's an
 opportunity for further optimization of the new patch (either now or
 later, depending on how much work we're talking about).  Just to be
 clear, it makes perfect sense that the new system is faster in the
 contended case, and the benchmark numbers are very impressive.  It's a
 lot harder to understand why it's not slower in the uncontended case.

 Not sure why its odd to have backends do one thing and auxiliaries do
 another. The whole point of auxiliary processes is that they do a
 specific thing different to normal backends. Anyway, the important
 thing is to have auxiliary processes be independent of each other as
 much as possible, which simplifies error handling and state logic in
 the postmaster.

 Yeah, I guess the shutdown sequence could get a bit complex if we try
 to make everyone go through the WAL writer all the time.  But I wonder
 if we could rejiggering things somehow so that everything goes through
 WAL writer if its dead.

I'm not sure what you mean by this last bit. It sounds a bit hazardous.

My problem with nominating a backend to the status of an auxiliary is
that no matter what way you cut it, it increases the failure surface
area, so to speak.

I'm not sure why Heikki thinks that it follows that having a
dependency on some backend is simpler than having one on an auxiliary
process. As to the question of IPC and context switch overhead, I'd
speculate that protecting access to a data structure with book keeping
information regarding which backend is currently the driver and so on
might imply considerably more overhead than IPC and context switching.
It might also be that having WAL Writer almost solely responsible for
this might facilitate more effective use of CPU cache.

On most modern architectures, system calls don't actually cause a full
context switch. The kernel can just enter a mode switch (go from
user mode to kernel mode, and then back to user mode). You can observe
this effect with vmstat. That's how 3 system calls might not look much
more expensive than 1.

 +       if (delta  XLOG_SEG_SIZE * CheckPointSegments 

Re: [HACKERS] Group commit, revised

2012-01-18 Thread Simon Riggs
On Wed, Jan 18, 2012 at 1:23 AM, Robert Haas robertmh...@gmail.com wrote:
 On Tue, Jan 17, 2012 at 12:37 PM, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com wrote:
 I found it very helpful to reduce wal_writer_delay in pgbench tests, when
 running with synchronous_commit=off. The reason is that hint bits don't get
 set until the commit record is flushed to disk, so making the flushes more
 frequent reduces the contention on the clog. However, Simon made async
 commits nudge WAL writer if the WAL page fills up, so I'm not sure how
 relevant that experience is anymore.

Still completely relevant and orthogonal to this discussion. The patch
retains multi-modal behaviour.

 There's still a small but measurable effect there in master.  I think
 we might be able to make it fully auto-tuning, but I don't think we're
 fully there yet (not sure how much this patch changes that equation).

 I suggested a design similar to the one you just proposed to Simon
 when he originally suggested this feature.  It seems that if the WAL
 writer is the only one doing WAL flushes, then there must be some IPC
 overhead - and context switching - involved whenever WAL is flushed.
 But clearly we're saving something somewhere else, on the basis of
 Peter's results, so maybe it's not worth worrying about.  It does seem
 pretty odd to have all the regular backends going through the WAL
 writer and the auxiliary processes using a different mechanism,
 though.  If we got rid of that, maybe WAL writer wouldn't even require
 a lock, if there's only one process that can be doing it at a time.

When we did sync rep it made sense to have the WALSender do the work
and for others to just wait. It would be quite strange to require a
different design for essentially the same thing for normal commits and
WAL flushes to local disk. I should mention the original proposal for
streaming replication had each backend send data to standby
independently and that was recognised as a bad idea after some time.
Same for sync rep also.

The gain is that previously there was measurable contention for the
WALWriteLock, now there is none. Plus the gang effect continues to
work even when the database gets busy, which isn't true of piggyback
commits as we use now.

Not sure why its odd to have backends do one thing and auxiliaries do
another. The whole point of auxiliary processes is that they do a
specific thing different to normal backends. Anyway, the important
thing is to have auxiliary processes be independent of each other as
much as possible, which simplifies error handling and state logic in
the postmaster.

With regard to context switching, we're making a kernel call to fsync,
so we'll get a context switch anyway. The whole process is similar to
the way lwlock wake up works.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Group commit, revised

2012-01-17 Thread Peter Geoghegan
On 16 January 2012 08:11, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 Impressive results. How about uploading the PDF to the community wiki?

Sure. http://wiki.postgresql.org/wiki/Group_commit .

 I think it might be simpler if it wasn't the background writer that's
 responsible for driving the group commit queue, but the backends
 themselves. When a flush request comes in, you join the queue, and if
 someone else is already doing the flush, sleep until the driver wakes you
 up. If no-one is doing the flush yet (ie. the queue is empty), start doing
 it yourself. You'll need a state variable to keep track who's driving the
 queue, but otherwise I think it would be simpler as there would be no
 dependency on WAL writer.

I think this replaces one problem with another. You've now effectively
elevated a nominated backend to the status of an auxiliary process -
do you intend to have the postmaster look after it, as with any other
auxiliary process? I'm not sure that that is a more difficult problem
to solve, but I suspect so. At least my proposal can have any one of
the backends, both currently participating in group commit and yet to,
wake up the WAL Writer.

 I tend think of the group commit facility as a bus. Passengers can hop on
 board at any time, and they take turns on who drives the bus. When the first
 passengers hops in, there is no driver so he takes the driver seat. When the
 next passenger hops in, he sees that someone is driving the bus already, so
 he sits down, and places a big sign on his forehead stating the bus stop
 where he wants to get off, and goes to sleep. When the driver has reached
 his own bus stop, he wakes up all the passengers who wanted to get off at
 the same stop or any of the earlier stops [1]. He also wakes up the
 passenger whose bus stop is the farthest from the current stop, and gets off
 the bus. The woken-up passengers who have already reached their stops can
 immediately get off the bus, and the one who has not notices that no-one is
 driving the bus anymore, so he takes the driver seat.

 [1] in a real bus, a passenger would not be happy if he's woken up too late
 and finds himself at the next stop instead of the one where he wanted to go,
 but for group commit, that is fine.

 In this arrangement, you could use the per-process semaphore for the
 sleep/wakeups, instead of latches. I'm not sure if there's any difference,
 but semaphores are more tried and tested, at least.

Yes, and I expect that this won't be the last time someone uses a bus
analogy in relation to this!

The proposed patch is heavily based on sync rep, which I'd have
imagined was more tried and tested than any proposed completely
alternative implementation, as it is basically a generalisation of
exactly the same principle, WAL Writer changes notwithstanding. I
would have imagined that that aspect would be particularly approved
of.

 wal_writer_delay is still needed for controlling how often asynchronous
 commits are flushed to disk.

That had occurred to me of course, but has anyone ever actually
tweaked wal_writer_delay with adjusting the behaviour of asynchronous
commits in mind? I'm pretty sure that the answer is no. I have a
slight preference for obsoleting it as a consequence of introducing
group commit, but I don't think that it matters that much.

 Auxiliary processes cannot use group commit. The changes made prevent
 them from availing of commit_siblings/commit_delay parallelism,
 because it doesn't exist anymore.

 Auxiliary processes have PGPROC entries too. Why can't they participate?

It was deemed to be a poor design decision to effectively create a
dependency on the WAL Writer among other auxiliary processes, as to do
so would perhaps compromise the way in which the postmaster notices
and corrects isolated failures. Maybe I'll revisit that assessment,
but I am not convinced that it's worth the very careful analysis of
the implications of such an unprecedented dependency, without there
being some obvious advantage. It it's a question of their being
deprived of commit_siblings group commit, well, we know from
experience that people didn't tend to touch it a whole lot anyway.

 Group commit is sometimes throttled, which seems appropriate - if a
 backend requests that the WAL Writer flush an LSN deemed too far from
 the known flushed point, that request is rejected and the backend goes
 through another path, where XLogWrite() is called.

 Hmm, if the backend doing the big flush gets the WALWriteLock before a bunch
 of group committers, the group committers will have to wait until the big
 flush is finished, anyway. I presume the idea of the throttling is to avoid
 the situation where a bunch of small commits need to wait for a huge flush
 to finish.

Exactly. Of course, you're never going to see that situation with
pgbench. I don't have much data to inform exactly what the right
trade-off is here, or some generic approximation of it across
platforms and hardware - other 

Re: [HACKERS] Group commit, revised

2012-01-17 Thread Heikki Linnakangas

On 17.01.2012 16:35, Peter Geoghegan wrote:

On 16 January 2012 08:11, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com  wrote:

I think it might be simpler if it wasn't the background writer that's
responsible for driving the group commit queue, but the backends
themselves. When a flush request comes in, you join the queue, and if
someone else is already doing the flush, sleep until the driver wakes you
up. If no-one is doing the flush yet (ie. the queue is empty), start doing
it yourself. You'll need a state variable to keep track who's driving the
queue, but otherwise I think it would be simpler as there would be no
dependency on WAL writer.


I think this replaces one problem with another. You've now effectively
elevated a nominated backend to the status of an auxiliary process -
do you intend to have the postmaster look after it, as with any other
auxiliary process?


The GroupCommitWaitForLSN() call happens within a critical section. If 
the process dies, you're screwed no matter what. It's not very different 
from the current situation where if one backend flushes the WAL, another 
backend will notice that once it gets the WALWriteLock, and returns quickly.



wal_writer_delay is still needed for controlling how often asynchronous
commits are flushed to disk.


That had occurred to me of course, but has anyone ever actually
tweaked wal_writer_delay with adjusting the behaviour of asynchronous
commits in mind?


I found it very helpful to reduce wal_writer_delay in pgbench tests, 
when running with synchronous_commit=off. The reason is that hint bits 
don't get set until the commit record is flushed to disk, so making the 
flushes more frequent reduces the contention on the clog. However, Simon 
made async commits nudge WAL writer if the WAL page fills up, so I'm not 
sure how relevant that experience is anymore.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Group commit, revised

2012-01-17 Thread Peter Geoghegan
On 17 January 2012 17:37, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 I found it very helpful to reduce wal_writer_delay in pgbench tests, when
 running with synchronous_commit=off. The reason is that hint bits don't get
 set until the commit record is flushed to disk, so making the flushes more
 frequent reduces the contention on the clog. However, Simon made async
 commits nudge WAL writer if the WAL page fills up, so I'm not sure how
 relevant that experience is anymore.

It's quite possible that the WAL Writer will spin continuously, if
given enough work to do, with this patch.

Although it's hard to tell from the graph I sent, there is a modest
improvement in TPS for even a single client. See the tables in the
PDF.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Group commit, revised

2012-01-17 Thread Jim Nasby
On Jan 15, 2012, at 4:42 PM, Peter Geoghegan wrote:
 Attached is a patch that myself and Simon Riggs collaborated on. I
 took the group commit patch that Simon posted to the list back in
 November, and partially rewrote it.

Forgive me if this is a dumb question, but I noticed a few places doing things 
like:

if (...)
  Blah();
else
{
  ...
}

Is that allowed PG formatting? I thought that if braces were required on one 
branch of an if they were supposed to go on both sides.

Also, I didn't see any README changes in the patch... perhaps this is big 
enough to warrant them?
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Group commit, revised

2012-01-17 Thread Alvaro Herrera

Excerpts from Jim Nasby's message of mar ene 17 21:21:57 -0300 2012:
 On Jan 15, 2012, at 4:42 PM, Peter Geoghegan wrote:
  Attached is a patch that myself and Simon Riggs collaborated on. I
  took the group commit patch that Simon posted to the list back in
  November, and partially rewrote it.
 
 Forgive me if this is a dumb question, but I noticed a few places doing 
 things like:
 
 if (...)
   Blah();
 else
 {
   ...
 }
 
 Is that allowed PG formatting? I thought that if braces were required on one 
 branch of an if they were supposed to go on both sides.

Yeah, we even used to have pg_indent remove braces around single
statements, so if you had one statement in the if branch and more
around the other one, pg_indent would have left things like that anyway.

(This particular pg_indent behavior got removed because it messed up
PG_TRY blocks.)

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Group commit, revised

2012-01-17 Thread Robert Haas
On Tue, Jan 17, 2012 at 12:37 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 I found it very helpful to reduce wal_writer_delay in pgbench tests, when
 running with synchronous_commit=off. The reason is that hint bits don't get
 set until the commit record is flushed to disk, so making the flushes more
 frequent reduces the contention on the clog. However, Simon made async
 commits nudge WAL writer if the WAL page fills up, so I'm not sure how
 relevant that experience is anymore.

There's still a small but measurable effect there in master.  I think
we might be able to make it fully auto-tuning, but I don't think we're
fully there yet (not sure how much this patch changes that equation).

I suggested a design similar to the one you just proposed to Simon
when he originally suggested this feature.  It seems that if the WAL
writer is the only one doing WAL flushes, then there must be some IPC
overhead - and context switching - involved whenever WAL is flushed.
But clearly we're saving something somewhere else, on the basis of
Peter's results, so maybe it's not worth worrying about.  It does seem
pretty odd to have all the regular backends going through the WAL
writer and the auxiliary processes using a different mechanism,
though.  If we got rid of that, maybe WAL writer wouldn't even require
a lock, if there's only one process that can be doing it at a time.

What happens in standalone mode?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Group commit, revised

2012-01-16 Thread Heikki Linnakangas

On 16.01.2012 00:42, Peter Geoghegan wrote:

I've also attached the results of a pgbench-tools driven benchmark,
which are quite striking (Just the most relevant image - e-mail me
privately if you'd like a copy of the full report, as I don't want to
send a large PDF file to the list as a courtesy to others). Apart from
the obvious improvement in throughput, there is also a considerable
improvement in average and worst latency at all client counts.


Impressive results. How about uploading the PDF to the community wiki?


The main way that I've added value here is by refactoring and fixing
bugs. There were some tricky race conditions that caused the
regression tests to fail for that early draft patch, but there were a
few other small bugs too. There is an unprecedented latch pattern
introduced by this patch: Two processes (the WAL Writer and any given
connection backend) have a mutual dependency. Only one may sleep, in
which case the other is responsible for waking it. Much of the
difficulty in debugging this patch, and much of the complexity that
I've added, came from preventing both from simultaneously sleeping,
even in the face of various known failure modes like postmaster death.
If this happens, it does of course represent a denial-of-service, so
that's something that reviewers are going to want to heavily
scrutinise. I suppose that sync rep should be fine here, as it waits
on walsenders, but I haven't actually comprehensively tested the
patch, so there may be undiscovered unpleasant interactions with other
xlog related features. I can report that it passes the regression
tests successfully, and on an apparently consistently basis - I
battled with intermittent failures for a time.


I think it might be simpler if it wasn't the background writer that's 
responsible for driving the group commit queue, but the backends 
themselves. When a flush request comes in, you join the queue, and if 
someone else is already doing the flush, sleep until the driver wakes 
you up. If no-one is doing the flush yet (ie. the queue is empty), start 
doing it yourself. You'll need a state variable to keep track who's 
driving the queue, but otherwise I think it would be simpler as there 
would be no dependency on WAL writer.


I tend think of the group commit facility as a bus. Passengers can hop 
on board at any time, and they take turns on who drives the bus. When 
the first passengers hops in, there is no driver so he takes the driver 
seat. When the next passenger hops in, he sees that someone is driving 
the bus already, so he sits down, and places a big sign on his forehead 
stating the bus stop where he wants to get off, and goes to sleep. When 
the driver has reached his own bus stop, he wakes up all the passengers 
who wanted to get off at the same stop or any of the earlier stops [1]. 
He also wakes up the passenger whose bus stop is the farthest from the 
current stop, and gets off the bus. The woken-up passengers who have 
already reached their stops can immediately get off the bus, and the one 
who has not notices that no-one is driving the bus anymore, so he takes 
the driver seat.


[1] in a real bus, a passenger would not be happy if he's woken up too 
late and finds himself at the next stop instead of the one where he 
wanted to go, but for group commit, that is fine.


In this arrangement, you could use the per-process semaphore for the 
sleep/wakeups, instead of latches. I'm not sure if there's any 
difference, but semaphores are more tried and tested, at least.



The benefits (and, admittedly, the controversies) of this patch go
beyond mere batching of commits: it also largely, though not entirely,
obviates the need for user backends to directly write/flush WAL, and
the WAL Writer may never sleep if it continually finds work to do -
wal_writer_delay is obsoleted, as are commit_siblings and
commit_delay.


wal_writer_delay is still needed for controlling how often asynchronous 
commits are flushed to disk.



Auxiliary processes cannot use group commit. The changes made prevent
them from availing of commit_siblings/commit_delay parallelism,
because it doesn't exist anymore.


Auxiliary processes have PGPROC entries too. Why can't they participate?


Group commit is sometimes throttled, which seems appropriate - if a
backend requests that the WAL Writer flush an LSN deemed too far from
the known flushed point, that request is rejected and the backend goes
through another path, where XLogWrite() is called.


Hmm, if the backend doing the big flush gets the WALWriteLock before a 
bunch of group committers, the group committers will have to wait until 
the big flush is finished, anyway. I presume the idea of the throttling 
is to avoid the situation where a bunch of small commits need to wait 
for a huge flush to finish.


Perhaps the big flusher should always join the queue, but use some 
heuristic to first flush up to the previous commit request, to wake up 
others quickly, and do another flush to flush 

[HACKERS] Group commit, revised

2012-01-15 Thread Peter Geoghegan
Attached is a patch that myself and Simon Riggs collaborated on. I
took the group commit patch that Simon posted to the list back in
November, and partially rewrote it. Here is that original thread:

http://archives.postgresql.org/pgsql-hackers/2011-11/msg00802.php

I've also attached the results of a pgbench-tools driven benchmark,
which are quite striking (Just the most relevant image - e-mail me
privately if you'd like a copy of the full report, as I don't want to
send a large PDF file to the list as a courtesy to others). Apart from
the obvious improvement in throughput, there is also a considerable
improvement in average and worst latency at all client counts.

To recap, the initial impetus to pursue this idea came from the
observation that with sync rep, we could get massive improvements in
the transactions-per-second throughput by simply adding clients. Greg
Smith performed a benchmark while in Amsterdam for the PgConf.EU
conference, which was discussed in a talk there. Over an
inter-continental connection from Amsterdam to his office in Baltimore
on the U.S. east coast, he saw TPS as reported by pgbench on what I
suppose was either an insert or update workload grow from a mere 10
TPS for a single connection to over 2000 TPS for about 300
connections. That was with the large, inherent latency imposed by
those sorts of distances (3822 miles/ 6150 km, about a 100ms ping time
on a decent connection). Quite obviously, as clients were added, the
server was able to batch increasing numbers of commits in each
confirmation message, resulting in this effect.

The main way that I've added value here is by refactoring and fixing
bugs. There were some tricky race conditions that caused the
regression tests to fail for that early draft patch, but there were a
few other small bugs too. There is an unprecedented latch pattern
introduced by this patch: Two processes (the WAL Writer and any given
connection backend) have a mutual dependency. Only one may sleep, in
which case the other is responsible for waking it. Much of the
difficulty in debugging this patch, and much of the complexity that
I've added, came from preventing both from simultaneously sleeping,
even in the face of various known failure modes like postmaster death.
If this happens, it does of course represent a denial-of-service, so
that's something that reviewers are going to want to heavily
scrutinise. I suppose that sync rep should be fine here, as it waits
on walsenders, but I haven't actually comprehensively tested the
patch, so there may be undiscovered unpleasant interactions with other
xlog related features. I can report that it passes the regression
tests successfully, and on an apparently consistently basis - I
battled with intermittent failures for a time.

Simon's original patch largely copied the syncrep.c code as an
expedient to prove the concept. Obviously this design was never
intended to get committed, and I've done some commonality and
variability analysis, refactoring to considerably slim down the new
groupcommit.c file by exposing some previously module-private
functions from syncrep.c .

I encourage others to reproduce my benchmark here. I attach a
pgbench-tools config. You can get the latest version of the tool at:

https://github.com/gregs1104/pgbench-tools

I also attach hdparm information for the disk that was used during
these benchmarks. Note that I have not disabled the write cache. It's
a Linux box, with ext4, running a recent kernel.

The benefits (and, admittedly, the controversies) of this patch go
beyond mere batching of commits: it also largely, though not entirely,
obviates the need for user backends to directly write/flush WAL, and
the WAL Writer may never sleep if it continually finds work to do -
wal_writer_delay is obsoleted, as are commit_siblings and
commit_delay. I suspect that they will not be missed. Of course, it
does all this to facilitate group commit itself. The group commit
feature does not have a GUC to control it, as this seems like
something that would be fairly pointless to turn off. FWIW, this is
currently the case for the recently introduced Maria DB group commit
implementation.

Auxiliary processes cannot use group commit. The changes made prevent
them from availing of commit_siblings/commit_delay parallelism,
because it doesn't exist anymore.

Group commit is sometimes throttled, which seems appropriate - if a
backend requests that the WAL Writer flush an LSN deemed too far from
the known flushed point, that request is rejected and the backend goes
through another path, where XLogWrite() is called. Currently the group
commit infrastructure decides that on the sole basis of there being a
volume of WAL that is equivalent in size to checkpoint_segments
between the two points. This is probably a fairly horrible heuristic,
not least since it overloads checkpoint_segments, but is of course
only a first-pass effort. Bright ideas are always welcome.

Thoughts?

-- 
Peter Geoghegan       

Re: [HACKERS] Group Commit

2011-11-16 Thread Simon Riggs
On Wed, Nov 16, 2011 at 1:17 AM, Robert Haas robertmh...@gmail.com wrote:
 On Tue, Nov 15, 2011 at 7:18 PM, Jeff Janes jeff.ja...@gmail.com wrote:
 When I apply this to HEAD and run make check, it freezes at:
 test tablespace               ...
 [...]
  Does anyone else see this?

 It hangs for me, too, just slightly later:


OK, thanks for checking. I'll investigate.


-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Group Commit

2011-11-15 Thread Jeff Janes
On Mon, Nov 14, 2011 at 2:08 AM, Simon Riggs si...@2ndquadrant.com wrote:
 Enclosed patch implements Group Commit and also powersave mode for WALWriter.

 XLogFlush() waits for WALWriter to run XLogBackgroundFlush(), which
 flushes WAL and then wakes waiters. Uses same concepts and similar
 code to sync rep.

 Purpose is to provide consistent WAL writes, even when WALInsertLock
 contended. Currently no off option, thinking is that the overhead of
 doing this is relatively low and so it can be always on - exactly as
 it is for sync rep.

 WALWriter now has variable wakeups, so wal_writer_delay is removed.
 Commit_delay and Commit_siblings are now superfluous and are also removed.

 Works, but needs discussion in some areas, docs and possibly tuning
 first, so this is more of a quicky than a slow, comfortable patch.

When I apply this to HEAD and run make check, it freezes at:
test tablespace   ...

The wal writer process is running at 100% CPU.  The only thing it is
doing that is visible though strace is checking that the postmaster is
alive.

I see the same behavior on two different Linux boxes, 32 bit and 64
bit.  I've tried to do some debugging, but haven't made much head-way.
 Does anyone else see this?

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Group Commit

2011-11-15 Thread Robert Haas
On Tue, Nov 15, 2011 at 7:18 PM, Jeff Janes jeff.ja...@gmail.com wrote:
 When I apply this to HEAD and run make check, it freezes at:
 test tablespace               ...
[...]
  Does anyone else see this?

It hangs for me, too, just slightly later:

== running regression test queries==
test tablespace   ... ok
parallel group (18 tests):  name txid oid float4 text int2 int4 int8
char varchar uuid float8 money boolean bit enum numeric

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Group Commit

2011-11-14 Thread Simon Riggs
Enclosed patch implements Group Commit and also powersave mode for WALWriter.

XLogFlush() waits for WALWriter to run XLogBackgroundFlush(), which
flushes WAL and then wakes waiters. Uses same concepts and similar
code to sync rep.

Purpose is to provide consistent WAL writes, even when WALInsertLock
contended. Currently no off option, thinking is that the overhead of
doing this is relatively low and so it can be always on - exactly as
it is for sync rep.

WALWriter now has variable wakeups, so wal_writer_delay is removed.
Commit_delay and Commit_siblings are now superfluous and are also removed.

Works, but needs discussion in some areas, docs and possibly tuning
first, so this is more of a quicky than a slow, comfortable patch.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


group_commit.v2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Group Commit

2011-11-14 Thread Josh Berkus

 Purpose is to provide consistent WAL writes, even when WALInsertLock
 contended. Currently no off option, thinking is that the overhead of
 doing this is relatively low and so it can be always on - exactly as
 it is for sync rep.

Hmmm, have you had a chance to do any performance tests?

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Group Commit

2011-11-14 Thread Greg Smith

On 11/14/2011 03:43 PM, Josh Berkus wrote:
   

Purpose is to provide consistent WAL writes, even when WALInsertLock
contended. Currently no off option, thinking is that the overhead of
doing this is relatively low and so it can be always on - exactly as
it is for sync rep.
 

Hmmm, have you had a chance to do any performance tests?
   


I was planning to run some later this week, but someone else is welcome 
to take a shot at it.  The inspiration for this change was the 
performance scaling tests I did for sync rep last month.  Don't recall 
if I shared those with this list yet; I've attached the fun graph.  Over 
a slow international link with 100ms ping times, I was only getting the 
expected 10 TPS doing sync rep with a single client.  But as more 
clients were added, so that a chunk of them were acknowledged in each 
commit reply, the total throughput among all of them scaled near 
linearly.  With 300 clients, that managed to hit a crazy 2000 TPS.


The best scenario to show this patch working would be a laptop drive 
spinning at a slow speed (5400 or 4200 RPM) so that individual local 
commits are slow.  That won't be 100ms slow, but close to 10ms is easy 
to see.  When adding clients to a system with a slow local commit, what 
I've observed is that the scaling levels off between 750 and 1000 TPS, 
no matter how many clients are involved.  The hope is that this 
alternate implementation will give the higher scaling in the face of 
slow commits that is seen on sync rep.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us

attachment: clients-3.png
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Group Commit

2008-03-06 Thread Bruce Momjian

Should we remove these now that we have async commit?

#commit_delay = 0   # range 0-10, in 
microseconds
#commit_siblings = 5# range 1-1000

They seem unfixable.

---

Simon Riggs wrote:
 On Tue, 2007-04-10 at 11:40 +0100, Heikki Linnakangas wrote:
  Tom Lane wrote:
   Heikki Linnakangas [EMAIL PROTECTED] writes:
   I've been working on the patch to enhance our group commit behavior. The 
   patch is a dirty hack at the moment, but I'm settled on the algorithm 
   I'm going to use and I know the issues involved.
   
   One question that just came to mind is whether Simon's no-commit-wait
   patch doesn't fundamentally alter the context of discussion for this.
 
 I was certainly intending that it would.
 
   Aside from the prospect that people won't really care about group commit
   if they can just use the periodic-WAL-sync approach, ISTM that one way
   to get group commit is to just make everybody wait for the dedicated
   WAL writer to write their commit record.  With a sufficiently short
   delay between write/fsync attempts in the background process, won't
   that net out at about the same place as a complicated group-commit
   patch?
  
  Possibly. To get efficient group commit there would need to be some kind 
  of signaling between the WAL writer and normal backends. I think there 
  is some in the patch, but I'm not sure if it gives efficient group 
  commit. A constant delay will just give us something similar to 
  commit_delay.
 
 Agreed.
 
  I've refrained from spending time on group commit until the 
  commit-no-wait patch lands, because it's going to conflict anyway. I'm 
  starting to feel we should not try to rush group commit into 8.3, unless 
  it somehow falls out of the commit-no-wait patch by accident, given that 
  we're past feature freeze and coming up with a proper group commit 
  algorithm would need a lot of research and testing. Better do it for 8.4 
  with more time, we've got enough features on plate for 8.3 anyway.
 
 My feeling was that I couldn't get both done for 8.3, and that including
 the WAL Writer in 8.3 would make the dev path clearer for a later
 attempt upon group commit.
 
 I think it was worth exploring whether it would be easy, but I think we
 can see it'll take a lot of work to make it fly right.
 
 -- 
   Simon Riggs 
   EnterpriseDB   http://www.enterprisedb.com
 
 
 
 ---(end of broadcast)---
 TIP 5: don't forget to increase your free space map settings

-- 
  Bruce Momjian  [EMAIL PROTECTED]http://momjian.us
  EnterpriseDB http://postgres.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://mail.postgresql.org/mj/mj_wwwusr?domain=postgresql.orgextra=pgsql-hackers


Re: [HACKERS] Group Commit

2008-03-06 Thread Greg Smith

On Thu, 6 Mar 2008, Bruce Momjian wrote:


Should we remove these now that we have async commit?
#commit_delay = 0   # range 0-10, in 
microseconds
#commit_siblings = 5# range 1-1000
They seem unfixable.


commit_delay offers a small but not insignificant improvement for some 
people using the feature under bursty, high client loads.  The useful 
tuning seems to be siblings[10-20] and a small setting for the delay; I 
usually just set it to 1 which gives the minimum the OS is capable of 
resolving.


That wasn't the feature's original intention I think, but that's what it's 
useful for regardless.  As async commit is only applicable in cases where 
it's OK to expand the window for transaction loss, removing commit_delay 
will cause a small performance regression for users who have tuned it 
usefully right now.


I actually have a paper design for something that builds a little model 
for how likely it is another commit will be coming soon that essentially 
turns this into something that can be tuned automatically, better than any 
person can do it.  No idea if I'll actually build that thing, but I hope 
it's obvious that there's some possibility to improve this area for 
applications that can't use async commit.  If you're going to dump the 
feature, I'd suggest at least waiting until later in the 8.4 cycle to see 
if something better comes along first.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Group Commit

2008-03-06 Thread Tom Lane
Greg Smith [EMAIL PROTECTED] writes:
 I actually have a paper design for something that builds a little model 
 for how likely it is another commit will be coming soon that essentially 
 turns this into something that can be tuned automatically, better than any 
 person can do it.  No idea if I'll actually build that thing, but I hope 
 it's obvious that there's some possibility to improve this area for 
 applications that can't use async commit.  If you're going to dump the 
 feature, I'd suggest at least waiting until later in the 8.4 cycle to see 
 if something better comes along first.

What about the other idea of just having committers wait for the next
walwriter-cycle flush before reporting commit?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Group Commit

2008-03-06 Thread Greg Smith

On Thu, 6 Mar 2008, Tom Lane wrote:


What about the other idea of just having committers wait for the next
walwriter-cycle flush before reporting commit?


I haven't considered that too much yet; it may very well be superior to 
anything I was thinking of.  The only point I was trying to make today is 
that I'd prefer not to see commit_delay excised until there's a superior 
replacement for it that's suitable even for synchronous write situations.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Group Commit

2007-05-17 Thread Bruce Momjian

This is not ready for 8.3.

This has been saved for the 8.4 release:

http://momjian.postgresql.org/cgi-bin/pgpatches_hold

---

Heikki Linnakangas wrote:
 It's been known for years that commit_delay isn't very good at giving us 
 group commit behavior. I did some experiments with this simple test 
 case: BEGIN; INSERT INTO test VALUES (1); COMMIT;, with different 
 numbers of concurrent clients and with and without commit_delay.
 
 Summary for the impatient:
 1. Current behavior sucks.
 2. commit_delay doesn't help with # of clients  ~10. It does help with 
 higher numbers, but it still sucks.
 3. I'm working on a patch.
 
 
 I added logging to show how many commit records are flushed on each 
 fsync. The output with otherwise unpatched PG head looks like this, with 
 5 clients:
 
 LOG:  Flushed 4 out of 5 commits
 LOG:  Flushed 1 out of 5 commits
 LOG:  Flushed 4 out of 5 commits
 LOG:  Flushed 1 out of 5 commits
 LOG:  Flushed 4 out of 5 commits
 LOG:  Flushed 1 out of 5 commits
 LOG:  Flushed 4 out of 5 commits
 LOG:  Flushed 1 out of 5 commits
 LOG:  Flushed 3 out of 5 commits
 LOG:  Flushed 2 out of 5 commits
 LOG:  Flushed 3 out of 5 commits
 LOG:  Flushed 2 out of 5 commits
 LOG:  Flushed 3 out of 5 commits
 LOG:  Flushed 2 out of 5 commits
 LOG:  Flushed 3 out of 5 commits
 ...
 
 Here's what's happening:
 
 1. Client 1 issues fsync (A)
 2. Clients 2-5 write their commit record, and try to fsync, but they 
 have to wait for fsync (A) to finish.
 3. fsync (A) finishes, freeing client 1.
 4. One of clients 2-5 starts the next fsync (B), which will flush 
 commits of clients 2-5 to disk
 5. Client 1 begins new transaction, inserts commit record and tries to 
 fsync. Needs to wait for previous fsync (B) to finish
 6. fsync B finishes, freeing clients 2-5
 7. Client 1 issues fsync (C)
 8. ...
 
 The 2-3-2-3 pattern can be explained with similar unfortunate 
 resonance, but with two clients instead of client 1 in the above 
 possibly running in separate cores (test was run on a dual-core laptop).
 
 I also draw a diagram illustrating the above, attached.
 
 I wrote a quick  dirty patch for this that I'm going to refine further, 
 but wanted to get the results out for others to look at first. I'm not 
 posting the patch yet, but it basically adds some synchronization to the 
 WAL flushes. It introduces a counter of inserted but not yet flushed 
 commit records. Instead of the commit_delay, the counter is checked. If 
 it's smaller than NBackends, the process waits until count reaches 
 NBackends, or a timeout expires. There's two significant differences to 
 commit_delay here:
 1. Instead of waiting for commit_delay to expire, processes are woken 
 and fsync is started immediately when we know there's no more commit 
 records coming that we should wait for. Even though commit_delay is 
 given in microseconds, the real granularity of the wait can be as high 
 as 10 ms, which is in the same ball park as the fsync itself.
 2. commit_delay is not used when there's less than commit_siblings 
 non-idle backends in the system. With very short transactions, it's 
 worthwhile to wait even if that's the case, because a client can begin 
 and finish a transaction in much shorter time than it takes to fsync. 
 This is what makes the commit_delay to not work at all in my test case 
 with 2 clients.
 
 Here's a spreadsheet with the results of the tests I ran:
 http://community.enterprisedb.com/groupcommit-comparison.ods
 
 It contains a graph that shows that the patch works very well for this 
 test case. It's not very good for real life as it is, though. An obvious 
 flaw is that if you have a longer-running transaction, effect 1. goes 
 away. Instead of waiting for NBackends commit records, we should try to 
 guess the number of transactions that are likely to finish in a 
 reasonably short time. I'm thinking of keeping a running average of 
 commits per second, or # of transactions that finish while an fsync is 
 taking place.
 
 Any thoughts?
 
 -- 
Heikki Linnakangas
EnterpriseDB   http://www.enterprisedb.com


 
 ---(end of broadcast)---
 TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

-- 
  Bruce Momjian  [EMAIL PROTECTED]  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Group Commit

2007-04-13 Thread Simon Riggs
On Tue, 2007-04-10 at 11:40 +0100, Heikki Linnakangas wrote:
 Tom Lane wrote:
  Heikki Linnakangas [EMAIL PROTECTED] writes:
  I've been working on the patch to enhance our group commit behavior. The 
  patch is a dirty hack at the moment, but I'm settled on the algorithm 
  I'm going to use and I know the issues involved.
  
  One question that just came to mind is whether Simon's no-commit-wait
  patch doesn't fundamentally alter the context of discussion for this.

I was certainly intending that it would.

  Aside from the prospect that people won't really care about group commit
  if they can just use the periodic-WAL-sync approach, ISTM that one way
  to get group commit is to just make everybody wait for the dedicated
  WAL writer to write their commit record.  With a sufficiently short
  delay between write/fsync attempts in the background process, won't
  that net out at about the same place as a complicated group-commit
  patch?
 
 Possibly. To get efficient group commit there would need to be some kind 
 of signaling between the WAL writer and normal backends. I think there 
 is some in the patch, but I'm not sure if it gives efficient group 
 commit. A constant delay will just give us something similar to 
 commit_delay.

Agreed.

 I've refrained from spending time on group commit until the 
 commit-no-wait patch lands, because it's going to conflict anyway. I'm 
 starting to feel we should not try to rush group commit into 8.3, unless 
 it somehow falls out of the commit-no-wait patch by accident, given that 
 we're past feature freeze and coming up with a proper group commit 
 algorithm would need a lot of research and testing. Better do it for 8.4 
 with more time, we've got enough features on plate for 8.3 anyway.

My feeling was that I couldn't get both done for 8.3, and that including
the WAL Writer in 8.3 would make the dev path clearer for a later
attempt upon group commit.

I think it was worth exploring whether it would be easy, but I think we
can see it'll take a lot of work to make it fly right.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Group Commit

2007-04-10 Thread Zeugswetter Andreas ADI SD

   I've been working on the patch to enhance our group commit
behavior. 
   The patch is a dirty hack at the moment, but I'm settled on the 
   algorithm I'm going to use and I know the issues involved.
  
  One question that just came to mind is whether Simon's
no-commit-wait 
  patch doesn't fundamentally alter the context of discussion for
this.
  Aside from the prospect that people won't really care about group 
  commit if they can just use the periodic-WAL-sync approach, ISTM
that 
  one way to get group commit is to just make everybody wait for the 
  dedicated WAL writer to write their commit record.

Yes good catch, I think we will want to merge the two.
But, you won't want to wait indefinitely, since imho the dedicated WAL
writer will primarily only want to write/flush full WAL pages. Maybe
flush half full WAL pages only after some longer timeout. But basically
this timeout should be longer than an individual backend is willing to
delay their commit.

   With a 
  sufficiently short delay between write/fsync attempts in the 
  background process, won't that net out at about the same place as a 
  complicated group-commit patch?

I don't think we want the delay so short, or we won't get any grouped
writes.

I think what we could do is wait up to commit_delay for the
dedicated WAL writer to do it's work. If it did'nt do it until timeout
let the backend do the flushing itself.

 I think the big question is whether commit_delay is ever going to be
generally useful.

It is designed to allow a higher transaction/second rate on a constantly
WAL bottlenecked system, so I think it still has a use case. I think you
should not compare it to no-commit-wait from the feature side (only
implementation).

Andreas

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Group Commit

2007-04-10 Thread Heikki Linnakangas

Tom Lane wrote:

Heikki Linnakangas [EMAIL PROTECTED] writes:
I've been working on the patch to enhance our group commit behavior. The 
patch is a dirty hack at the moment, but I'm settled on the algorithm 
I'm going to use and I know the issues involved.


One question that just came to mind is whether Simon's no-commit-wait
patch doesn't fundamentally alter the context of discussion for this.
Aside from the prospect that people won't really care about group commit
if they can just use the periodic-WAL-sync approach, ISTM that one way
to get group commit is to just make everybody wait for the dedicated
WAL writer to write their commit record.  With a sufficiently short
delay between write/fsync attempts in the background process, won't
that net out at about the same place as a complicated group-commit
patch?


Possibly. To get efficient group commit there would need to be some kind 
of signaling between the WAL writer and normal backends. I think there 
is some in the patch, but I'm not sure if it gives efficient group 
commit. A constant delay will just give us something similar to 
commit_delay.


I've refrained from spending time on group commit until the 
commit-no-wait patch lands, because it's going to conflict anyway. I'm 
starting to feel we should not try to rush group commit into 8.3, unless 
it somehow falls out of the commit-no-wait patch by accident, given that 
we're past feature freeze and coming up with a proper group commit 
algorithm would need a lot of research and testing. Better do it for 8.4 
with more time, we've got enough features on plate for 8.3 anyway.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] Group Commit

2007-04-10 Thread Tom Lane
Heikki Linnakangas [EMAIL PROTECTED] writes:
 I've refrained from spending time on group commit until the 
 commit-no-wait patch lands, because it's going to conflict anyway. I'm 
 starting to feel we should not try to rush group commit into 8.3, unless 
 it somehow falls out of the commit-no-wait patch by accident, given that 
 we're past feature freeze and coming up with a proper group commit 
 algorithm would need a lot of research and testing. Better do it for 8.4 
 with more time, we've got enough features on plate for 8.3 anyway.

It's possible that it *would* fall out of commit-no-wait, if we are
alert to the possibility of shaking the tree in the right direction ;-)
Otherwise I agree with waiting till 8.4 to deal with it.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Group Commit

2007-04-09 Thread Tom Lane
Heikki Linnakangas [EMAIL PROTECTED] writes:
 I've been working on the patch to enhance our group commit behavior. The 
 patch is a dirty hack at the moment, but I'm settled on the algorithm 
 I'm going to use and I know the issues involved.

One question that just came to mind is whether Simon's no-commit-wait
patch doesn't fundamentally alter the context of discussion for this.
Aside from the prospect that people won't really care about group commit
if they can just use the periodic-WAL-sync approach, ISTM that one way
to get group commit is to just make everybody wait for the dedicated
WAL writer to write their commit record.  With a sufficiently short
delay between write/fsync attempts in the background process, won't
that net out at about the same place as a complicated group-commit
patch?

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Group Commit

2007-04-09 Thread Bruce Momjian
Tom Lane wrote:
 Heikki Linnakangas [EMAIL PROTECTED] writes:
  I've been working on the patch to enhance our group commit behavior. The 
  patch is a dirty hack at the moment, but I'm settled on the algorithm 
  I'm going to use and I know the issues involved.
 
 One question that just came to mind is whether Simon's no-commit-wait
 patch doesn't fundamentally alter the context of discussion for this.
 Aside from the prospect that people won't really care about group commit
 if they can just use the periodic-WAL-sync approach, ISTM that one way
 to get group commit is to just make everybody wait for the dedicated
 WAL writer to write their commit record.  With a sufficiently short
 delay between write/fsync attempts in the background process, won't
 that net out at about the same place as a complicated group-commit
 patch?

This is a good point.  commit_delay was designed to allow multiple
transactions to fsync with a single fsync.  no-commit-wait is going to
do this much more effectively (the client doesn't have to wait for the
other transations).  The one thing commit_delay gives us that
no-commit-wait does not is the guarantee that a commit returned to the
client is on disk, without any milliseconds delay.

The big question is who is going to care about the milliseconds delay
and is using a configuration that is going to benefit from commit_delay.
Basically, commit_delay always had a very limited use-case, but now
with no-commit-wait, commit_delay has an even smaller use-case.

I think the big question is whether commit_delay is ever going to be
generally useful.

I tried to find out what release commit_delay was added, and remembered
that the feature was so questionable we did not mention its addition in
the 7.1 release notes.  After six years, we are still unsure about the
feature.  Another big question is whether commit_delay is _ever_ going
to be useful, and with no-commit-wait being added, commit_delay looks
even more questionable and perhaps it should just be removed in 8.3.

-- 
  Bruce Momjian  [EMAIL PROTECTED]  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] Group Commit

2007-04-09 Thread Greg Smith

On Mon, 9 Apr 2007, Bruce Momjian wrote:


The big question is who is going to care about the milliseconds delay
and is using a configuration that is going to benefit from commit_delay.


I care.  WAL writes are a major bottleneck when many clients are 
committing near the same time.  Both times I've played with the 
commit_delay settings I found it improved the peak throughput under load 
at an acceptable low cost in latency.  I'll try to present some numbers on 
that when I get time, before you make me cry by taking it away.


An alternate mechanism that tells the client the commit is done when it 
hasn't hit disk is of no use for the applications I work with, so I 
haven't even been paying attention to no-commit-wait.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Group Commit

2007-04-09 Thread Tom Lane
Greg Smith [EMAIL PROTECTED] writes:
 An alternate mechanism that tells the client the commit is done when it 
 hasn't hit disk is of no use for the applications I work with, so I 
 haven't even been paying attention to no-commit-wait.

Agreed, if you need committed to mean committed then no-wait isn't
going to float your boat.  But the point I was making is that the
infrastructure Simon proposes (ie, a separate wal-writer process)
might be useful for this case too, with a lot less extra code than
Heikki is thinking about.  Now maybe that won't work, but we should
certainly not consider these as entirely-independent patches.

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Group Commit

2007-04-09 Thread Tatsuo Ishii
 On Mon, 9 Apr 2007, Bruce Momjian wrote:
 
  The big question is who is going to care about the milliseconds delay
  and is using a configuration that is going to benefit from commit_delay.
 
 I care.  WAL writes are a major bottleneck when many clients are 
 committing near the same time.  Both times I've played with the 
 commit_delay settings I found it improved the peak throughput under load 
 at an acceptable low cost in latency.  I'll try to present some numbers on 
 that when I get time, before you make me cry by taking it away.

Totally agreed here. I experienced throughput improvement by using
commit_delay too. 

 An alternate mechanism that tells the client the commit is done when it 
 hasn't hit disk is of no use for the applications I work with, so I 
 haven't even been paying attention to no-commit-wait.

Agreed too.
--
Tatsuo Ishii
SRA OSS, Inc. Japan

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


[HACKERS] Group Commit

2007-03-29 Thread Heikki Linnakangas
I've been working on the patch to enhance our group commit behavior. The 
patch is a dirty hack at the moment, but I'm settled on the algorithm 
I'm going to use and I know the issues involved.


Here's the patch as it is if you want to try it out:
http://community.enterprisedb.com/groupcommit-pghead-2.patch

but it needs a rewrite before being accepted. It'll only work on systems 
that use sysv semaphores, I needed to add a function to acquire a 
semaphore with timeout and I only did it for sysv_sema.c for now.



What are the chances of getting this in 8.3, assuming that I rewrite and 
submit a patch within the next week or two?



Algorithm
-

Instead of starting a WAL flush immediately after a commit record is 
inserted, we wait a while to give other backends a chance to finish 
their transactions and have them flushed by the same fsync call. There's 
two things we can control: how many commits to wait for (commit group 
size), and for how long (timeout).


We try to estimate the optimal commit group size. The estimate is

commit group size = (# of commit records flushed + # of commit records 
arrived while fsyncing).


This is a relatively simple estimate that works reasonably well with 
very short transactions, and the timeout limits the damage when the 
estimate is not working.


There's a lot more factors we could take into account in the estimate, 
for example:
- # of backends and their states (affects how many are likely to commit 
soon)

- amount of WAL written since last XLogFlush (affects the duration of fsync)
- when exaclty the commit records arrive (we don't want to wait 10 ms to 
get one more commit record in, when an fsync takes 11 ms)


but I wanted to keep this simple for now.

The timeout is currently hard-coded at 1 ms. I wanted to keep it short 
compared to the time it takes to fsync (somewhere in the 5-15 ms 
depending on hardware), to limit the damage when the algorithm isn't 
getting the estimate right. We could also vary the timeout, but I'm not 
sure how to calculate the optimal value and the real granularity will 
depend on the system anyhow.


Implementation
--

To count the # of commits since last XLogFlush, I added a new 
XLogCtlCommit struct in shared memory:


typedef struct XLogCtlCommit
{
slock_tcommit_lock;   /* protects the struct */
int	   commitCount;   /* # of commit records inserted since 
XLogFlush */

intgroupSize; /* current commit group size */
XLogRecPtr lastCommitPtr; /* location of the latest commit record */
PGPROC*waiter;/* process to signal when groupSize is 
reached */

} XLogCtlCommit;

Whenever a commit record is inserted in XLogInsert, commitCount is 
incremented and lastCommitPtr is updated.

When it reaches groupSize, the waiter-process is woken up.

In XLogFlush, after acquiring WALWriteLock, we wait until groupSize is 
reached (or timeout expires) before doing the flush.


Instead of the current logic to flush as much WAL as possible, we flush 
up to the last commit record. Flushing any more wouldn't save us an 
fsync later on, but might make the current fsync take longer. By doing 
that, we avoid the conditional acquire of the WALInsertLock that's in 
there currently. We make note of commitCount before starting the fsync; 
that's the # of commit records that arrived in time so that the fsync 
will flush them. Let's call that value intime.


After the fsync is finished, we update the groupSize for the next round. 
The new groupSize is the current commitCount after the fsync, IOW the 
number of commit records arrived after the previous XLogFlush, including 
the time it took to do the fsync. We update the commitCount by 
decrementing it by intime.


Now we're ready for the next round, and we can release WALWriteLock.

WALWriteLock


The above would work nicely, except that a normal lwlock doesn't play 
nicely. You can release and reacquire a lightwait lock in the same time 
slice even when there's other backends queuing for the lock, effectively 
cutting the queue.


Here's what sometimes happens, with 2 clients:

Client 1   Client 2
do workdo work
insert commit record   insert commit record
acquire WALWriteLock
   try to acquire WALWriteLock, blocks
fsync
release WALWriteLock
begin new transaction
do work
insert commit record
reacquire WALWriteLock
wait for 2nd commit to arrive

Client 1 will eventually time out and commit just its own commit record. 
Client 2 should be released immediately after client 1 releases the 
WALWriteLock. It only needs to observe that its commit record has 
already been flushed and doesn't need to do anything.


To fix the above, and other race conditions like that, we need a 
specialized WALWriteLock that orders the waiters by the commit record 
XLogRecPtrs. WALWriteLockRelease wakes up all waiters that have their 
commit record already flushed. They will just fall through without 

Re: [HACKERS] Group Commit

2007-03-29 Thread Heikki Linnakangas

I wrote:
What are the chances of getting this in 8.3, assuming that I rewrite and 
submit a patch within the next week or two?


I also intend to do performance testing with different workloads to 
ensure the patch doesn't introduce a performance regression under some 
conditions.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Group Commit

2007-03-29 Thread Tom Lane
Heikki Linnakangas [EMAIL PROTECTED] writes:
 I've been working on the patch to enhance our group commit behavior. The 
 patch is a dirty hack at the moment, but I'm settled on the algorithm 
 I'm going to use and I know the issues involved.
 ...
 The timeout is currently hard-coded at 1 ms.

This is where my bogometer triggered.  There's way too many platforms
where 1 msec timeout is a sheer fantasy.  If you cannot make it perform
well with a 10-msec timeout then I don't think it's going to be at all
portable.

Now I know that newer Linux kernels tend to ship with 1KHz scheduler
tick rate, so there's a useful set of platforms where you could make it
work even so, but I'm not really satisfied with saying this facility is
only usable if you have a fast kernel tick rate ...

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Group Commit

2007-03-29 Thread Heikki Linnakangas

Tom Lane wrote:

Heikki Linnakangas [EMAIL PROTECTED] writes:
I've been working on the patch to enhance our group commit behavior. The 
patch is a dirty hack at the moment, but I'm settled on the algorithm 
I'm going to use and I know the issues involved.

...
The timeout is currently hard-coded at 1 ms.


This is where my bogometer triggered.  There's way too many platforms
where 1 msec timeout is a sheer fantasy.  If you cannot make it perform
well with a 10-msec timeout then I don't think it's going to be at all
portable.

Now I know that newer Linux kernels tend to ship with 1KHz scheduler
tick rate, so there's a useful set of platforms where you could make it
work even so, but I'm not really satisfied with saying this facility is
only usable if you have a fast kernel tick rate ...


The 1 ms timeout isn't essential for the algorithm. In fact, I chose it 
arbitrarily; in the quick tests I did the length of the timeout didn't 
seem to matter much. I'm running with CONFIG_HZ=250 kernel myself, which 
means that the timeout is really 4 ms on my laptop.


I suspect the tick rate largely explains why the current commit_delay 
isn't very good is that even though you specify it in microseconds, it 
really waits a lot longer. With the proposed algorithm, the fsync is 
started immediately when enough commit records have been inserted, so 
the timeout only comes into play when the estimate for the group size is 
too high.


With a higher-precision timer, we could vary not only the commit group 
size but also the timeout.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [HACKERS] Group Commit

2007-03-29 Thread Tom Lane
Heikki Linnakangas [EMAIL PROTECTED] writes:
 Tom Lane wrote:
 This is where my bogometer triggered.  There's way too many platforms
 where 1 msec timeout is a sheer fantasy.  If you cannot make it perform
 well with a 10-msec timeout then I don't think it's going to be at all
 portable.

 The 1 ms timeout isn't essential for the algorithm.

OK, but when you get to performance testing, please see how well it
works at CONFIG_HZ=100.

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Group Commit

2007-03-27 Thread Richard Huxton

Heikki Linnakangas wrote:

Here's what's happening:

1. Client 1 issues fsync (A)
2. Clients 2-5 write their commit record, and try to fsync, but they 
have to wait for fsync (A) to finish.


It contains a graph that shows that the patch works very well for this 
test case. It's not very good for real life as it is, though. An obvious 
flaw is that if you have a longer-running transaction, effect 1. goes 
away. Instead of waiting for NBackends commit records, we should try to 
guess the number of transactions that are likely to finish in a 
reasonably short time. I'm thinking of keeping a running average of 
commits per second, or # of transactions that finish while an fsync is 
taking place.


Any thoughts?


Well, you did say *any* thoughts, so I guess mine count :-)

Do you not want to minimise the cost of #2 in your sequence? Some 
measure of total backend time spent waiting to commit.


I don't know how simple it is to measure/estimate the time spent for # 
of transactions that finish while an fsync is  taking place.


--
  Richard Huxton
  Archonet Ltd

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


[HACKERS] Group Commit

2007-03-26 Thread Heikki Linnakangas
It's been known for years that commit_delay isn't very good at giving us 
group commit behavior. I did some experiments with this simple test 
case: BEGIN; INSERT INTO test VALUES (1); COMMIT;, with different 
numbers of concurrent clients and with and without commit_delay.


Summary for the impatient:
1. Current behavior sucks.
2. commit_delay doesn't help with # of clients  ~10. It does help with 
higher numbers, but it still sucks.

3. I'm working on a patch.


I added logging to show how many commit records are flushed on each 
fsync. The output with otherwise unpatched PG head looks like this, with 
5 clients:


LOG:  Flushed 4 out of 5 commits
LOG:  Flushed 1 out of 5 commits
LOG:  Flushed 4 out of 5 commits
LOG:  Flushed 1 out of 5 commits
LOG:  Flushed 4 out of 5 commits
LOG:  Flushed 1 out of 5 commits
LOG:  Flushed 4 out of 5 commits
LOG:  Flushed 1 out of 5 commits
LOG:  Flushed 3 out of 5 commits
LOG:  Flushed 2 out of 5 commits
LOG:  Flushed 3 out of 5 commits
LOG:  Flushed 2 out of 5 commits
LOG:  Flushed 3 out of 5 commits
LOG:  Flushed 2 out of 5 commits
LOG:  Flushed 3 out of 5 commits
...

Here's what's happening:

1. Client 1 issues fsync (A)
2. Clients 2-5 write their commit record, and try to fsync, but they 
have to wait for fsync (A) to finish.

3. fsync (A) finishes, freeing client 1.
4. One of clients 2-5 starts the next fsync (B), which will flush 
commits of clients 2-5 to disk
5. Client 1 begins new transaction, inserts commit record and tries to 
fsync. Needs to wait for previous fsync (B) to finish

6. fsync B finishes, freeing clients 2-5
7. Client 1 issues fsync (C)
8. ...

The 2-3-2-3 pattern can be explained with similar unfortunate 
resonance, but with two clients instead of client 1 in the above 
possibly running in separate cores (test was run on a dual-core laptop).


I also draw a diagram illustrating the above, attached.

I wrote a quick  dirty patch for this that I'm going to refine further, 
but wanted to get the results out for others to look at first. I'm not 
posting the patch yet, but it basically adds some synchronization to the 
WAL flushes. It introduces a counter of inserted but not yet flushed 
commit records. Instead of the commit_delay, the counter is checked. If 
it's smaller than NBackends, the process waits until count reaches 
NBackends, or a timeout expires. There's two significant differences to 
commit_delay here:
1. Instead of waiting for commit_delay to expire, processes are woken 
and fsync is started immediately when we know there's no more commit 
records coming that we should wait for. Even though commit_delay is 
given in microseconds, the real granularity of the wait can be as high 
as 10 ms, which is in the same ball park as the fsync itself.
2. commit_delay is not used when there's less than commit_siblings 
non-idle backends in the system. With very short transactions, it's 
worthwhile to wait even if that's the case, because a client can begin 
and finish a transaction in much shorter time than it takes to fsync. 
This is what makes the commit_delay to not work at all in my test case 
with 2 clients.


Here's a spreadsheet with the results of the tests I ran:
http://community.enterprisedb.com/groupcommit-comparison.ods

It contains a graph that shows that the patch works very well for this 
test case. It's not very good for real life as it is, though. An obvious 
flaw is that if you have a longer-running transaction, effect 1. goes 
away. Instead of waiting for NBackends commit records, we should try to 
guess the number of transactions that are likely to finish in a 
reasonably short time. I'm thinking of keeping a running average of 
commits per second, or # of transactions that finish while an fsync is 
taking place.


Any thoughts?

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match