Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)

2012-03-21 Thread Fujii Masao
On Thu, Mar 15, 2012 at 5:52 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 When all those changes are put together, the patched version now beats or
 matches the current code in the RAM drive tests, except that the
 single-client case is still about 10% slower. I added the new test results
 at http://community.enterprisedb.com/xloginsert-scale-tests/, and a new
 version of the patch is attached.

When I ran pgbench with v18 patch, I encountered the following PANIC error:

PANIC:  space reserved for WAL record does not match what was written

To investigate the cause, I applied the following changes and ran pgbench again,


diff --git a/src/backend/access/transam/xlog.c
b/src/backend/access/transam/xlog.c
index bfc7421..2cef0bd 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1294,7 +1294,7 @@ CopyXLogRecordToWAL(int write_len, bool
isLogSwitch, XLogRecord *rechdr,
}

if (!XLByteEQ(CurrPos, EndPos))
-   elog(PANIC, space reserved for WAL record
does not match what was written);
+ elog(PANIC, space reserved for WAL record does not
match what was written, CurrPos: %X/%X, EndPos: %X/%X,
CurrPos.xlogid, CurrPos.xrecoff, EndPos.xlogid, EndPos.xrecoff);
}
else
{


then I got the following:

PANIC:  space reserved for WAL record does not match what was
written, CurrPos: C/0, EndPos: B/FF00

So I think that the patch would have a bug which handles WAL boundary wrongly.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)

2012-03-21 Thread Heikki Linnakangas

On 21.03.2012 13:14, Fujii Masao wrote:

 PANIC:  space reserved for WAL record does not match what was
written, CurrPos: C/0, EndPos: B/FF00

So I think that the patch would have a bug which handles WAL boundary wrongly.


Thanks for the testing! These WAL boundary issues are really tricky, you 
found bugs in that area before, and I found and fixed one before posting 
the last version, and apparently there's still at least one left.


Overall, what do you (and others) think about the state of this patch? 
I'm starting to feel that this needs to be pushed to 9.3. That bug might 
not be very hard to fix, but the fact that these bugs are still cropping 
up at this late stage makes me uneasy. That boundary calculation in 
particular is surprisingly tricky, and I think it could be made less 
tricky with some refactoring of the WAL-logging code, replacing 
XLogRecPtr with uint64s, like Peter (IIRC) suggested a while ago. And 
that seems like 9.3 material. Also, there's still these two known issues:


1. It slows down the WAL insertion in a single backend by about 10%
2. With lots of backends inserting tiny records concurrently, you get 
spinlock contention, which consumes a lot of CPU. Without the patch, you 
get lwlock contention and bandwidth isn't any better, but you sleep 
rather than spin.


I'm afraid those issues aren't easily fixable. I haven't been able to 
identify the source of slowdown in the single-backend case, it seems to 
be simply the distributed cost of the extra bookkeeping. That might be 
acceptable, 10% slowdown of raw WAL insertion speed is not good, but WAL 
insertion only accounts for a fraction of the total CPU usage for any 
real workload, so I believe the slowdown of a real application would be 
more like 1-3%, at worst. But I would feel more comfortable if we had 
more time to test that.


The spinlock contention issue might be acceptable too. I think it would 
be hard to run into it in a real application, and even then, the 
benchmarks show that although you spend a lot of CPU time spinning, you 
get at least the same overall bandwidth with the patch, which is what 
really matters. And I think it could be alleviated by reducing the time 
the spinlock is held, and I think that could be done by making the space 
reservation calculations simpler. If we got rid of the limitation that 
the WAL record header is never split across WAL pages, and always stored 
the continuation record header on all WAL pages, the space reservation 
calculation could be reduced to essentially currentpos += size. But 
that again seems 9.3 material.


So, although none of the issues alone is a show-stopper, but considering 
all these things together, I'm starting to feel that this needs to be 
pushed to 9.3. Thoughts?


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)

2012-03-21 Thread Robert Haas
On Wed, Mar 21, 2012 at 7:52 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 So, although none of the issues alone is a show-stopper, but considering all
 these things together, I'm starting to feel that this needs to be pushed to
 9.3. Thoughts?

I think I agree.  I like the refactoring ideas that you're proposing,
but I don't really think we should be starting on that in mid-March.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)

2012-03-21 Thread Tom Lane
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 ... although none of the issues alone is a show-stopper, but considering 
 all these things together, I'm starting to feel that this needs to be 
 pushed to 9.3. Thoughts?

Agreed.  In particular, I think you are right that it'd be prudent to
simplify the WAL-location arithmetic and then rebase this code onto
that.  And since no code at all has been written for the arithmetic
change, I think we have to consider that it's not 9.2 material.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)

2012-03-21 Thread Fujii Masao
On Wed, Mar 21, 2012 at 10:59 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 ... although none of the issues alone is a show-stopper, but considering
 all these things together, I'm starting to feel that this needs to be
 pushed to 9.3. Thoughts?

 Agreed.  In particular, I think you are right that it'd be prudent to
 simplify the WAL-location arithmetic and then rebase this code onto
 that.  And since no code at all has been written for the arithmetic
 change, I think we have to consider that it's not 9.2 material.

Agreed.

BTW, the patch changes some functions so that they use volatile pointer,
as follows:

@@ -8448,7 +9232,7 @@ XLogReportParameters(void)
 void
 UpdateFullPageWrites(void)
 {
-   XLogCtlInsert *Insert = XLogCtl-Insert;
+   volatile XLogCtlInsert *Insert = XLogCtl-Insert;

These changes should be applied?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)

2012-03-12 Thread Jeff Janes
On Fri, Mar 9, 2012 at 2:45 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:


 Thanks!

 BTW, I haven't forgotten about the recovery bugs Jeff found earlier. I'm
 planning to do a longer run with his test script - I only run it for about
 1000 iterations - to see if I can reproduce the PANIC with both the earlier
 patch version he tested, and this new one.

Hi Heikki,

I've run the v12 patch for 17,489 rounds of crash and recovery, and
detected no assertion failures or other problems.

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)

2012-03-12 Thread Heikki Linnakangas

On 09.03.2012 12:04, Heikki Linnakangas wrote:

I've been doing some performance testing with this, using a simple C
function that just inserts a dummy WAL record of given size. I'm not
totally satisfied. Although the patch helps with scalability at 3-4
concurrent backends doing WAL insertions, it seems to slow down the
single-client case with small WAL records by about 5-10%. This is what
Robert also saw with an earlier version of the patch
(http://archives.postgresql.org/pgsql-hackers/2011-12/msg01223.php). I
tested this with the data directory on a RAM drive, unfortunately I
don't have a server with a hard drive that can sustain the high
insertion rate. I'll post more detailed results, once I've refined the
tests a bit.


So, here's more detailed test results, using Greg Smith's excellent 
pgbench-tools test suite:


http://community.enterprisedb.com/xloginsert-scale-tests/

The workload in all of these tests was a simple C function that writes a 
lot of very small WAL records, with 16 bytes of payload each. I ran the 
tests with the data directory on a regular hard drive, on an SDD, and on 
a ram drive (/dev/shm). With HDD, I also tried fsync=off and 
synchronous_commit_off. For each of those, I ran the tests with 1-16 
concurrent backends.


Summary: The patch hurts single-backend performance by about 10%, except 
for the synchronous_commit=off test. Between 2-6 clients, it either 
helps, doesn't make any difference, or hurts. With  6 clients, it hurts.


So, that's quite disappointing. The patch has two problems: the 10% 
slowdown in single-client case, and the slowdown with  6 clients. I 
don't know where exactly the single-client slowdown comes from, although 
I'm not surprised that the bookkeeping with slots etc. has some 
overhead. Hopefully that overhead can be made smaller, if not eliminated 
completely..


The slowdown with  6 clients seems to be spinlock contention. I ran 
perf record for a short duration during one of the ramdrive tests, and 
saw the spinlock acquisition in ReserveXLogInsertLocation() consuming 
about 80% of all CPU time.


I then hacked the patch a little bit, removing the check in XLogInsert 
for fullPageWrites and forcePageWrites, as well as the check for did a 
checkpoint just happen (see 
http://community.enterprisedb.com/xloginsert-scale-tests/disable-fpwcheck.patch). 
My hunch was that accessing those fields causes cache line stealing, 
making the cache line containing the spinlock even more busy. That hunch 
seems to be correct; when I reran the tests with that patch, the 
performance with high # of clients became much better. See the results 
with xloginsert-scale-13.patch. With that change, the single-client 
case is still about 10% slower than current code, but the performance 
with  8 clients is almost as good as with current code. Between 2-6 
clients, the patch is a win.


The hack that restored the  6 clients performance to current level is 
not safe, of course, so I'll have to figure out a safe way to get that 
effect. Also, even when the performance is as good as current code, it's 
not good to spend all the CPU time spinning on the spinlock. I didn't 
measure the CPU usage with current code, but I would expect it to be 
sleeping, not spinning, when not doing useful work.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)

2012-03-12 Thread Fujii Masao
On Tue, Mar 13, 2012 at 1:59 AM, Jeff Janes jeff.ja...@gmail.com wrote:
 On Fri, Mar 9, 2012 at 2:45 AM, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com wrote:


 Thanks!

 BTW, I haven't forgotten about the recovery bugs Jeff found earlier. I'm
 planning to do a longer run with his test script - I only run it for about
 1000 iterations - to see if I can reproduce the PANIC with both the earlier
 patch version he tested, and this new one.

 Hi Heikki,

 I've run the v12 patch for 17,489 rounds of crash and recovery, and
 detected no assertion failures or other problems.

In v12 patch, I could no longer reproduce the assertion failure, too.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)

2012-03-09 Thread Heikki Linnakangas

On 07.03.2012 17:28, Tom Lane wrote:

Simon Riggssi...@2ndquadrant.com  writes:

On Wed, Mar 7, 2012 at 3:04 PM, Tom Lanet...@sss.pgh.pa.us  wrote:

Alvaro Herreraalvhe...@commandprompt.com  writes:

So they are undoubtely rare.  Not sure if as rare as Higgs bosons.



Even if they're rare, having a major performance hiccup when one happens
is not a side-effect I want to see from a patch whose only reason to
exist is better performance.



I agree the effect you point out can exist, I just don't want to slow
down the main case as a result.


I don't see any reason to think that what I suggested would slow things
down, especially not if the code were set up to fall through quickly in
the typical case where no page boundary is crossed.  Integer division is
not slow on any machine made in the last 15 years or so.


Agreed. I wasn't worried about the looping with extra-large records, but 
might as well not do it.


Here's an updated patch. It now only loops once per segment that a 
record crosses. Plus a lot of other small cleanup.


I've been doing some performance testing with this, using a simple C 
function that just inserts a dummy WAL record of given size. I'm not 
totally satisfied. Although the patch helps with scalability at 3-4 
concurrent backends doing WAL insertions, it seems to slow down the 
single-client case with small WAL records by about 5-10%. This is what 
Robert also saw with an earlier version of the patch 
(http://archives.postgresql.org/pgsql-hackers/2011-12/msg01223.php). I 
tested this with the data directory on a RAM drive, unfortunately I 
don't have a server with a hard drive that can sustain the high 
insertion rate. I'll post more detailed results, once I've refined the 
tests a bit.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)

2012-03-09 Thread Fujii Masao
On Fri, Mar 9, 2012 at 7:04 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 Here's an updated patch. It now only loops once per segment that a record
 crosses. Plus a lot of other small cleanup.

Thanks! But you forgot to attach the patch.

 I've been doing some performance testing with this, using a simple C
 function that just inserts a dummy WAL record of given size. I'm not totally
 satisfied. Although the patch helps with scalability at 3-4 concurrent
 backends doing WAL insertions, it seems to slow down the single-client case
 with small WAL records by about 5-10%. This is what Robert also saw with an
 earlier version of the patch
 (http://archives.postgresql.org/pgsql-hackers/2011-12/msg01223.php). I
 tested this with the data directory on a RAM drive, unfortunately I don't
 have a server with a hard drive that can sustain the high insertion rate.
 I'll post more detailed results, once I've refined the tests a bit.

I'm also doing performance test. If I get interesting result, I'll post it.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)

2012-03-07 Thread Simon Riggs
On Tue, Mar 6, 2012 at 8:32 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 On 06.03.2012 17:12, Tom Lane wrote:
 How long is the current locked code exactly --- does it contain a loop?

 Perhaps best if you take a look for yourself, the function is called
 ReserveXLogInsertLocation() in patch. It calls a helper function called
   AdvanceXLogRecPtrToNextPage(ptr), which is small and could be inlined.
 It does contain one loop, which iterates once for every WAL page the
 record crosses.

 Hm.  The loop makes me a tad uncomfortable, because it is possible for
 WAL records to be very long (many pages).  I see the point that
 replacing the spinlock with an LWLock would likely negate any
 performance win from this patch, but having other processes arrive and
 spin while somebody is busy calculating the size of a multi-megabyte
 commit record would be bad too.

I would have thought the existence of a multi-megabyte commit record
would already imply a huge performance effect elsewhere and we
wouldn't care too much about a few spinlock cycles. But I think
they're as rare as Higgs bosons.

If/when such records do exist it isn't likely to be on a high
transaction rate server. Even allocating ~1 million xids takes long
enough that we wouldn't be expecting a very high commit rate even with
200 concurrent sessions doing this. If such records are rare, then the
minor blip they cause is just a drop in the ocean.

So I think Tom's concern is valid, but the frequency of problems
resulting from it will be so low as to not even be measurable. And
before we fix a perceived performance issue, we really should prove
its existence first, then confirm that this area is the bottleneck
that is slowing such workloads.

So +1 to Heikki keeping the spinlock, as is, and not redesign anything.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)

2012-03-07 Thread Alvaro Herrera

Excerpts from Simon Riggs's message of mié mar 07 05:35:44 -0300 2012:
 On Tue, Mar 6, 2012 at 8:32 PM, Tom Lane t...@sss.pgh.pa.us wrote:
  Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
  On 06.03.2012 17:12, Tom Lane wrote:
  How long is the current locked code exactly --- does it contain a loop?
 
  Perhaps best if you take a look for yourself, the function is called
  ReserveXLogInsertLocation() in patch. It calls a helper function called
    AdvanceXLogRecPtrToNextPage(ptr), which is small and could be inlined.
  It does contain one loop, which iterates once for every WAL page the
  record crosses.
 
  Hm.  The loop makes me a tad uncomfortable, because it is possible for
  WAL records to be very long (many pages).  I see the point that
  replacing the spinlock with an LWLock would likely negate any
  performance win from this patch, but having other processes arrive and
  spin while somebody is busy calculating the size of a multi-megabyte
  commit record would be bad too.
 
 I would have thought the existence of a multi-megabyte commit record
 would already imply a huge performance effect elsewhere and we
 wouldn't care too much about a few spinlock cycles. But I think
 they're as rare as Higgs bosons.

Just to keep things in perspective -- For a commit record to reach one
megabyte, it would have to be a transaction that drops over 43k tables.
Or have 64k smgr inval messages (for example, a TRUNCATE might send half
a dozen of these messages). Or have 262k subtransactions.  Or
combinations thereof.

Now admittedly, a page is only 8 kB, so for a commit record to be many
pages long (that is, =3) it would require about 1500 smgr inval
messages, or, say, about 250 TRUNCATEs (of permanent tables with at
least one toastable field and at least one index).

So they are undoubtely rare.  Not sure if as rare as Higgs bosons.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)

2012-03-07 Thread Tom Lane
Alvaro Herrera alvhe...@commandprompt.com writes:
 Just to keep things in perspective -- For a commit record to reach one
 megabyte, it would have to be a transaction that drops over 43k tables.
 Or have 64k smgr inval messages (for example, a TRUNCATE might send half
 a dozen of these messages). Or have 262k subtransactions.  Or
 combinations thereof.

 Now admittedly, a page is only 8 kB, so for a commit record to be many
 pages long (that is, =3) it would require about 1500 smgr inval
 messages, or, say, about 250 TRUNCATEs (of permanent tables with at
 least one toastable field and at least one index).

What about the locks (if running hot-standby)?

 So they are undoubtely rare.  Not sure if as rare as Higgs bosons.

Even if they're rare, having a major performance hiccup when one happens
is not a side-effect I want to see from a patch whose only reason to
exist is better performance.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)

2012-03-07 Thread Simon Riggs
On Wed, Mar 7, 2012 at 3:04 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Alvaro Herrera alvhe...@commandprompt.com writes:
 Just to keep things in perspective -- For a commit record to reach one
 megabyte, it would have to be a transaction that drops over 43k tables.
 Or have 64k smgr inval messages (for example, a TRUNCATE might send half
 a dozen of these messages). Or have 262k subtransactions.  Or
 combinations thereof.

 Now admittedly, a page is only 8 kB, so for a commit record to be many
 pages long (that is, =3) it would require about 1500 smgr inval
 messages, or, say, about 250 TRUNCATEs (of permanent tables with at
 least one toastable field and at least one index).

 What about the locks (if running hot-standby)?

It's a list of active AccessExclusiveLocks. If that list is long you
can be sure not much else is happening on the server.


 So they are undoubtely rare.  Not sure if as rare as Higgs bosons.

 Even if they're rare, having a major performance hiccup when one happens
 is not a side-effect I want to see from a patch whose only reason to
 exist is better performance.

I agree the effect you point out can exist, I just don't want to slow
down the main case as a result.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)

2012-03-07 Thread Tom Lane
Simon Riggs si...@2ndquadrant.com writes:
 On Wed, Mar 7, 2012 at 3:04 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Alvaro Herrera alvhe...@commandprompt.com writes:
 So they are undoubtely rare.  Not sure if as rare as Higgs bosons.

 Even if they're rare, having a major performance hiccup when one happens
 is not a side-effect I want to see from a patch whose only reason to
 exist is better performance.

 I agree the effect you point out can exist, I just don't want to slow
 down the main case as a result.

I don't see any reason to think that what I suggested would slow things
down, especially not if the code were set up to fall through quickly in
the typical case where no page boundary is crossed.  Integer division is
not slow on any machine made in the last 15 years or so.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)

2012-03-07 Thread Jeff Janes
On Mon, Mar 5, 2012 at 8:50 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:

 That particular issue would be very hard to hit in practice, so I don't know
 if this could explain the recovery failures that Jeff saw. I got the test
 script running (thanks for that Jeff!), but unfortunately have not seen any
 failures yet (aside from the issue with crossing xlogid boundary), with
 either this or the older version of the patch.

 Attached is a new version of the patch.

I've run patch v10 for 14109 cycles of crash and recovery, and there
were 8 assertion failures at xlog.c, Line: 2106 during the
end-of-recovery checkpoint.

How many cycles have you run?  Assuming the crashes follow a simple
binomial distribution with the frequency I see, you would have to run
for ~1230 cycles for a 50% chance of experiencing at least one, or for
~8120 cycles for a 99% chance of experiencing at least one.

I think Fujii's method if provoking this problem is more efficient
than mine, although I haven't tried it myself.

Dual Core AMD Opteron(tm) Processor 275
2.6.32.36-0.5-default #1 SMP 2011-04-14 10:12:31 +0200 x86_64 x86_64
x86_64 GNU/Linux

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)

2012-03-06 Thread Fujii Masao
On Tue, Mar 6, 2012 at 2:17 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 On 21.02.2012 13:19, Fujii Masao wrote:
 In some places, the spinlock insertpos_lck is taken while another
 spinlock info_lck is being held. Is this OK? What if unfortunately
 inner spinlock takes long to be taken?

 Hmm, that's only done at a checkpoint (and a restartpoint), so I doubt
 that's a big issue in practice. We had the same pattern before the
 patch, just with WALInsertLock instead of insertpos_lck. Holding a
 spinlock longer is much worse than holding a lwlock longer, but
 nevertheless I don't think that's a problem.

 No, that's NOT okay.  A spinlock is only supposed to be held across a
 short straight-line sequence of instructions.

This also strikes me that the usage of the spinlock insertpos_lck might
not be OK in ReserveXLogInsertLocation() because a few dozen instructions
can be performed while holding the spinlock

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)

2012-03-06 Thread Fujii Masao
On Tue, Mar 6, 2012 at 1:50 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 +        * An xlog-switch record consumes all the remaining space on the
 +        * WAL segment. We have already reserved it for us, but we still
 need
 +        * to make sure it's been allocated and zeroed in the WAL buffers
 so
 +        * that when the caller (or someone else) does XLogWrite(), it can
 +        * really write out all the zeros.

 Why do we need to write out all the remaining space with zeros? In
 current master, we don't do that. A recovery code ignores the data
 following XLOG_SWITCH record, so I don't think that's required.


 It's to keep the logic simpler. Before the patch, an xlog-switch just
 initialized the next page in the WAL buffers to insert to, to be the first
 page in the next segment. With this patch, we rely on a simple linear
 mapping from an XLogRecPtr to the WAL buffer that should contain that page
 (see XLogRecPtrToBufIdx()). Such a mapping is not possible if you sometimes
 skip over pages in the WAL buffers, so we allocate the buffers for those
 empty pages, too. Note that this means that an xlog-switch can blow through
 all your WAL buffers.

 We could probably optimize that so that you don't need to actually write()
 and fsync() all the zeros, perhaps by setting a flag on the WAL buffer to
 indicate that it only contains padding for an xlog-switch. However, I don't
 see any easy way to avoid blowing the cache.

 I'm thinking that xlog-switching happens so seldom, and typically on a
 fairly idle system, so we don't need to optimize it much. I guess we should
 measure the impact, though..

Online backup which forces an xlog-switch twice might be performed under
a certain amount of load. So, to avoid the performance spike during online
backup, I think that the optimization which skips write() and fsync() of
the padding bytes is helpful.

 On 22.02.2012 03:34, Fujii Masao wrote:
 When I ran the long-running performance test, I encountered the following
 panic error.

      PANIC:  could not find WAL buffer for 0/FF00

 0/FF00 is the xlog file boundary, so the patch seems to handle
 the xlog file boundary incorrectly. In the patch, current insertion lsn
 is advanced by directly incrementing XLogRecPtr.xrecoff as follows.
 But to handle the xlog file boundary correctly, we should use
 XLByteAdvance() for that, instead?

 Thanks, fixed this, too.

 I made the locking a bit more strict in WaitXLogInsertionsToFinish(), so
 that it grabs the insertpos_lck to read nextslot. I previously thought that
 was not necessary, assuming that reading/writing an int32 is atomic, but I'm
 afraid there might be memory-ordering issues where the CurrPos of the most
 recent slot has not become visible to other backends yet, while the
 advancing of nextslot has.

 That particular issue would be very hard to hit in practice, so I don't know
 if this could explain the recovery failures that Jeff saw. I got the test
 script running (thanks for that Jeff!), but unfortunately have not seen any
 failures yet (aside from the issue with crossing xlogid boundary), with
 either this or the older version of the patch.

 Attached is a new version of the patch.

Thanks for the new patch!

In this new patch, I again was able to reproduce the assertion failure which
I described on the upthread.
http://archives.postgresql.org/message-id/CAHGQGwGRuNJ%3D_ctXwteNkFkdvMDNFYxFdn0D1cd-CqL0OgNCLg%40mail.gmail.com

$ uname -a
Linux hermes 3.0.0-16-generic #28-Ubuntu SMP Fri Jan 27 17:50:54 UTC
2012 i686 i686 i386 GNU/Linux

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)

2012-03-06 Thread Heikki Linnakangas

On 06.03.2012 14:52, Fujii Masao wrote:

On Tue, Mar 6, 2012 at 2:17 AM, Tom Lanet...@sss.pgh.pa.us  wrote:

Heikki Linnakangasheikki.linnakan...@enterprisedb.com  writes:

On 21.02.2012 13:19, Fujii Masao wrote:

In some places, the spinlock insertpos_lck is taken while another
spinlock info_lck is being held. Is this OK? What if unfortunately
inner spinlock takes long to be taken?



Hmm, that's only done at a checkpoint (and a restartpoint), so I doubt
that's a big issue in practice. We had the same pattern before the
patch, just with WALInsertLock instead of insertpos_lck. Holding a
spinlock longer is much worse than holding a lwlock longer, but
nevertheless I don't think that's a problem.


No, that's NOT okay.  A spinlock is only supposed to be held across a
short straight-line sequence of instructions.


Ok, that's easy enough to fix.


This also strikes me that the usage of the spinlock insertpos_lck might
not be OK in ReserveXLogInsertLocation() because a few dozen instructions
can be performed while holding the spinlock


I admit that block is longer than any of our existing spinlock blocks. 
However, it's important for performance. I tried using a lwlock earlier, 
and that negated the gains. So if that's a serious objection, then let's 
resolve that now before I spend any more time on other aspects of the 
patch. Any ideas how to make that block shorter?


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)

2012-03-06 Thread Robert Haas
On Tue, Mar 6, 2012 at 10:07 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 I admit that block is longer than any of our existing spinlock blocks.
 However, it's important for performance. I tried using a lwlock earlier, and
 that negated the gains. So if that's a serious objection, then let's resolve
 that now before I spend any more time on other aspects of the patch. Any
 ideas how to make that block shorter?

We shouldn't put the cart in front of the horse.  The point of keeping
spinlock acquisitions short is to improve performance by preventing
excess spinning.  If the performance is better with a spinlock than
with an lwlock, then clearly the spinning isn't excessive, or at least
not in the case you tested.

That having been said, shorter critical sections are always good, of course...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)

2012-03-06 Thread Tom Lane
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 On 06.03.2012 14:52, Fujii Masao wrote:
 This also strikes me that the usage of the spinlock insertpos_lck might
 not be OK in ReserveXLogInsertLocation() because a few dozen instructions
 can be performed while holding the spinlock

 I admit that block is longer than any of our existing spinlock blocks. 
 However, it's important for performance. I tried using a lwlock earlier, 
 and that negated the gains. So if that's a serious objection, then let's 
 resolve that now before I spend any more time on other aspects of the 
 patch. Any ideas how to make that block shorter?

How long is the current locked code exactly --- does it contain a loop?

I'm not sure where the threshold of pain is for length of time holding a
spinlock.  I wouldn't go out of the way to avoid using a spinlock for
say a hundred instructions, at least not unless it was a very
high-contention lock.  But sleeping while holding a spinlock is right out.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)

2012-03-06 Thread Heikki Linnakangas

On 06.03.2012 17:12, Tom Lane wrote:

Heikki Linnakangasheikki.linnakan...@enterprisedb.com  writes:

On 06.03.2012 14:52, Fujii Masao wrote:

This also strikes me that the usage of the spinlock insertpos_lck might
not be OK in ReserveXLogInsertLocation() because a few dozen instructions
can be performed while holding the spinlock



I admit that block is longer than any of our existing spinlock blocks.
However, it's important for performance. I tried using a lwlock earlier,
and that negated the gains. So if that's a serious objection, then let's
resolve that now before I spend any more time on other aspects of the
patch. Any ideas how to make that block shorter?


How long is the current locked code exactly --- does it contain a loop?


Perhaps best if you take a look for yourself, the function is called 
ReserveXLogInsertLocation() in patch. It calls a helper function called 
 AdvanceXLogRecPtrToNextPage(ptr), which is small and could be inlined. 
It does contain one loop, which iterates once for every WAL page the 
record crosses.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)

2012-03-06 Thread Tom Lane
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 On 06.03.2012 17:12, Tom Lane wrote:
 How long is the current locked code exactly --- does it contain a loop?

 Perhaps best if you take a look for yourself, the function is called 
 ReserveXLogInsertLocation() in patch. It calls a helper function called 
   AdvanceXLogRecPtrToNextPage(ptr), which is small and could be inlined. 
 It does contain one loop, which iterates once for every WAL page the 
 record crosses.

Hm.  The loop makes me a tad uncomfortable, because it is possible for
WAL records to be very long (many pages).  I see the point that
replacing the spinlock with an LWLock would likely negate any
performance win from this patch, but having other processes arrive and
spin while somebody is busy calculating the size of a multi-megabyte
commit record would be bad too.

What I suggest is that it should not be necessary to crawl forward one
page at a time to figure out how many pages will be needed to store N
bytes worth of WAL data.  You're basically implementing a division
problem as repeated subtraction.  Getting the extra WAL-segment-start
overhead right would be slightly tricky; but even if you didn't want to
try to make it pure straight-line code, at the very least it seems like
you could set it up so that the loop iterates only once per segment not
page.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)

2012-03-06 Thread Fujii Masao
On Wed, Mar 7, 2012 at 5:32 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 What I suggest is that it should not be necessary to crawl forward one
 page at a time to figure out how many pages will be needed to store N
 bytes worth of WAL data.  You're basically implementing a division
 problem as repeated subtraction.  Getting the extra WAL-segment-start
 overhead right would be slightly tricky; but even if you didn't want to
 try to make it pure straight-line code, at the very least it seems like
 you could set it up so that the loop iterates only once per segment not
 page.

+1

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)

2012-03-05 Thread Heikki Linnakangas

On 20.02.2012 08:00, Amit Kapila wrote:

I was trying to understand this patch and had few doubts:

1. In PerformXLogInsert(), why there is need to check freespace when
already during ReserveXLogInsertLocation(), the space is reserved. Is
it possible that the record size is more than actually calculted in
ReserveXLogInsertLocation(), if so in that case what I understand is
it is moving to next page to write, however isn't it possible that
some other backend had already reserved that space.


The calculations between PerformXLogInsert (called CopyXLogRecordToWAL() 
in the latest patch version) and ReserveXLogInsertLocation() must always 
match, otherwise we have reserved incorrect amount of WAL and you get 
corrupt WAL. They both need to do the same calculations of how the WAL 
record is split across pages, which depends on how much free space there 
is on the first page. There is an assertion in CopyXLogRecordToWAL() to 
check that once it's finished writing the WAL record, the last byte 
landed on the position that ReserveXLogInsertLocation() calculated it 
would.


Another way to do that would be to remember the calculations done in 
ReserveXLogInsertLocation(), in an extra array or something. But we want 
to keep ReserveXLogInsertLocation() as simple as possible, as that runs 
while holding the spinlock. Any extra CPU cycles there will hurt 
scalability.



2. In function WaitForXLogInsertionSlotToBecomeFree(), chances are
there such that when nextslot equals lastslot, all new backends try
to  reserve a slot will start waiting on same last slot which can
lead to serialization for those backends and can impact latency.


True. That warrants some performance testing to see if that effect is 
significant. (it's surely better than the current situation, anyway, 
where all WAL insertions block on the single lock)



3. GetXlogBuffer - This will get called twice, once for normal
buffer, second time for when there is not enough space in current
page, and both times it can lead to I/O whereas in earlier algorithm,
the chances of I/O is only once.


I don't see any difference to the previous situation. In both cases, if 
you need a new page to copy the WAL record to, you need to first flush 
out some old pages from the WAL buffers if they're all dirty. The patch 
doesn't change the number of WAL buffers consumed. Note that 
GetXLogBuffer() is very cheap when it doesn't need to do I/O, extra 
calls to it don't matter if the page is already initialized.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)

2012-03-05 Thread Heikki Linnakangas

On 21.02.2012 13:19, Fujii Masao wrote:

On Sat, Feb 18, 2012 at 12:36 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com  wrote:

Attached is a new version, fixing that, and off-by-one bug you pointed out
in the slot wraparound handling. I also moved code around a bit, I think
this new division of labor between the XLogInsert subroutines is more
readable.


This patch includes not only xlog scaling improvement but also other ones.
I think it's better to extract them as separate patches and commit them first.
If we do so, the main patch would become more readable.


Good point.


For example, I think that the followings can be extracted as a separate patch.

   (1) Make walwriter try to initialize as many of the no-longer-needed
WAL buffers
   for future use as we can.


This is pretty hard to extract from the larger patch. The current code 
in master assumes that there's only one page that is currently inserted 
to, and relies on WALInsertLock being held in AdvanceXLInsertBuffer(). 
The logic with the scaling patch is quite different.



   (2) Refactor the update full_page_writes code.
   (3) Get rid of XLogCtl-Write.LogwrtResult and XLogCtl-Insert.LogwrtResult.


Attached are patches for these two items. Barring objections, I'll 
commit these.



   (4) Call TRACE_POSTGRESQL_XLOG_SWITCH() even if the xlog switch has no
work to do.


Actually, I think I'll just move it in the patch to keep the existing 
behavior.



I'm not sure if (3) makes sense. In current master, those two shared variables
are used to reduce the contention of XLogCtl-info_lck and WALWriteLock.
You think they have no effect on reducing the lock contention?


XLogCtl-Write.LogwrtResult certainly seems redundant with 
XLogCtl-LogwrtResult. There might be some value in 
XLogCtl-Insert.LogwrtResult, it's used in AdvanceXLInsertBuffer() to 
before acquiring info_lck. But I doubt that makes any difference in 
practice either. At best it's saving one spinlock acquisition per WAL 
buffer, which isn't all much compared to all the other work involved. 
(once the scaling patch is committed, this point is moot anyway)


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 266c0de..eb7932e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -731,8 +731,7 @@ XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
 	unsigned	i;
 	bool		updrqst;
 	bool		doPageWrites;
-	bool		isLogSwitch = false;
-	bool		fpwChange = false;
+	bool		isLogSwitch = (rmid == RM_XLOG_ID  info == XLOG_SWITCH);
 	uint8		info_orig = info;
 
 	/* cross-check on whether we should be here or not */
@@ -746,30 +745,11 @@ XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
 	TRACE_POSTGRESQL_XLOG_INSERT(rmid, info);
 
 	/*
-	 * Handle special cases/records.
+	 * In bootstrap mode, we don't actually log anything but XLOG resources;
+	 * return a phony record pointer.
 	 */
-	if (rmid == RM_XLOG_ID)
+	if (IsBootstrapProcessingMode()  rmid != RM_XLOG_ID)
 	{
-		switch (info)
-		{
-			case XLOG_SWITCH:
-isLogSwitch = true;
-break;
-
-			case XLOG_FPW_CHANGE:
-fpwChange = true;
-break;
-
-			default:
-break;
-		}
-	}
-	else if (IsBootstrapProcessingMode())
-	{
-		/*
-		 * In bootstrap mode, we don't actually log anything but XLOG resources;
-		 * return a phony record pointer.
-		 */
 		RecPtr.xlogid = 0;
 		RecPtr.xrecoff = SizeOfXLogLongPHD;		/* start of 1st chkpt record */
 		return RecPtr;
@@ -1232,15 +1212,6 @@ begin:;
 		WriteRqst = XLogCtl-xlblocks[curridx];
 	}
 
-	/*
-	 * If the record is an XLOG_FPW_CHANGE, we update full_page_writes
-	 * in shared memory before releasing WALInsertLock. This ensures that
-	 * an XLOG_FPW_CHANGE record precedes any WAL record affected
-	 * by this change of full_page_writes.
-	 */
-	if (fpwChange)
-		Insert-fullPageWrites = fullPageWrites;
-
 	LWLockRelease(WALInsertLock);
 
 	if (updrqst)
@@ -8517,6 +8488,22 @@ UpdateFullPageWrites(void)
 	if (fullPageWrites == Insert-fullPageWrites)
 		return;
 
+	START_CRIT_SECTION();
+
+	/*
+	 * It's always safe to take full page images, even when not strictly
+	 * required, but not the other round. So if we're setting full_page_writes
+	 * to true, first set it true and then write the WAL record. If we're
+	 * setting it to false, first write the WAL record and then set the
+	 * global flag.
+	 */
+	if (fullPageWrites)
+	{
+		LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
+		Insert-fullPageWrites = true;
+		LWLockRelease(WALInsertLock);
+	}
+
 	/*
 	 * Write an XLOG_FPW_CHANGE record. This allows us to keep
 	 * track of full_page_writes during archive recovery, if required.
@@ -8532,12 +8519,15 @@ UpdateFullPageWrites(void)
 
 		XLogInsert(RM_XLOG_ID, XLOG_FPW_CHANGE, rdata);
 	}
-	else
+
+
+	if (!fullPageWrites)
 	{
 		LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
-		Insert-fullPageWrites = fullPageWrites;
+		

Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)

2012-03-05 Thread Tom Lane
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 On 21.02.2012 13:19, Fujii Masao wrote:
 In some places, the spinlock insertpos_lck is taken while another
 spinlock info_lck is being held. Is this OK? What if unfortunately
 inner spinlock takes long to be taken?

 Hmm, that's only done at a checkpoint (and a restartpoint), so I doubt 
 that's a big issue in practice. We had the same pattern before the 
 patch, just with WALInsertLock instead of insertpos_lck. Holding a 
 spinlock longer is much worse than holding a lwlock longer, but 
 nevertheless I don't think that's a problem.

No, that's NOT okay.  A spinlock is only supposed to be held across a
short straight-line sequence of instructions.  Something that could
involve a spin loop, or worse a sleep() kernel call, is right out.
Please change this.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)

2012-02-25 Thread Jeff Janes
On Tue, Feb 21, 2012 at 5:34 PM, Fujii Masao masao.fu...@gmail.com wrote:
 On Tue, Feb 21, 2012 at 8:19 PM, Fujii Masao masao.fu...@gmail.com wrote:
 On Sat, Feb 18, 2012 at 12:36 AM, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com wrote:
 Attached is a new version, fixing that, and off-by-one bug you pointed out
 in the slot wraparound handling. I also moved code around a bit, I think
 this new division of labor between the XLogInsert subroutines is more
 readable.

 When I ran the long-running performance test, I encountered the following
 panic error.

    PANIC:  could not find WAL buffer for 0/FF00

I too see this panic when the system survives long enough to get to
that log switch.

But I'm also still seeing (with version 9) the assert failure at
xlog.c, Line: 2154 during the end-of-recovery checkpoint.

Here is a set up for repeating my tests.  I used this test simply
because I had it sitting around after having written it for other
purposes.  Indeed I'm not all that sure I should publish it.
Hopefully other people will write other tests which exercise other
corner cases, rather than exercising the same ones I am.

The patch creates a guc which causes the md writer routine to panic
and bring down the database, triggering recovery, after a given number
for writes.  In this context probably any other method of forcing a
crash and recovery would be just as good as this specific method of
crashing.

The choice of 400 for the cutoff for crashing is based on:

1) If the number is too low, you re-crash within recovery so you never
get a chance to inspect the database.  In my hands, recovery doesn't
need to do more than 400 writes. (I don't know how to make the
database use different guc setting during recovery than it did before
the crash).

2) If the number is too high, it takes too long for a crash to happen
and I'm not all that patient.

Some of the changes to postgresql.conf.sample are purely my
preferences and have nothing in particular to do with this set up.
But archive_timeout = 30 is necessary in order to get checkpoints, and
thus mdwrites, to happen often enough to trigger crashes often enough
to satisfy my impatience.

The Perl script exercises the integrity of the database by launching
multiple processes (4 by default) to run updates and memorize what
updates they have run.  After a crash, the Perl processes all
communicate their data up to the parent, which consolidates that
information and then queries the post-recovery database to make sure
it agrees.  Transactions that are in-flight at the time of a crash are
indeterminate.  Maybe the crash happened before the commit, and maybe
it happened after the commit but before we received notification of
the commit.  So whichever way those turn out, it is not proof of
corruption.

With the xloginsert-scale-9.patch, the above features are not needed
because the problem is not that the database is incorrect after
recovery, but that the database doesn't recover in the first place. So
just running pgbench would be good enough to detect that.  But in
earlier versions this feature did detect incorrect recovery.

This logs an awful lot of stuff, most of which merely indicates normal
operation.  The problem is that corruption is rare, so if you wait
until you see corruption before turning on logging, then you have to
wait l long time to get another instance of corruption so you can
dissect the log information.  So, I just log everything all of the
time.
A warning from 'line 63' which is not marked as in-flight indicates
database corruption.  A warning from 'line 66' indicates even worse
corruption. A failure of the entire outer script to execute for the
expected number of iterations (i.e. failure of the warning issued on
'line 18' to show up 100 times) indicates the database failed to
restart.

Also attached is a bash script that exercises the whole thing.  Note
that it has various directories hard coded that really ought not be,
and that it has no compunctions about calling rm -r /tmp/data.  I run
it is as ./do.sh  log and then inspect the log file for unusual
lines.

To run this, you first have to apply your own xlog patch, and apply my
crash-inducing patch, and build and install the resulting pgsql.  And
edit the shell script to point to it, etc..  The whole thing is a bit
of a idiosyncratic mess.

Cheers,

Jeff


crash_REL9_2CF4.patch
Description: Binary data


count.pl
Description: Binary data


do.sh
Description: Bourne shell script

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)

2012-02-21 Thread Fujii Masao
On Sat, Feb 18, 2012 at 12:36 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 Attached is a new version, fixing that, and off-by-one bug you pointed out
 in the slot wraparound handling. I also moved code around a bit, I think
 this new division of labor between the XLogInsert subroutines is more
 readable.

This patch includes not only xlog scaling improvement but also other ones.
I think it's better to extract them as separate patches and commit them first.
If we do so, the main patch would become more readable.
For example, I think that the followings can be extracted as a separate patch.

  (1) Make walwriter try to initialize as many of the no-longer-needed
WAL buffers
  for future use as we can.

  (2) Refactor the update full_page_writes code.

  (3) Get rid of XLogCtl-Write.LogwrtResult and XLogCtl-Insert.LogwrtResult.

  (4) Call TRACE_POSTGRESQL_XLOG_SWITCH() even if the xlog switch has no
   work to do.

  Others?

I'm not sure if (3) makes sense. In current master, those two shared variables
are used to reduce the contention of XLogCtl-info_lck and WALWriteLock.
You think they have no effect on reducing the lock contention?

In some places, the spinlock insertpos_lck is taken while another
spinlock info_lck
is being held. Is this OK? What if unfortunately inner spinlock takes
long to be taken?

+* An xlog-switch record consumes all the remaining space on the
+* WAL segment. We have already reserved it for us, but we still need
+* to make sure it's been allocated and zeroed in the WAL buffers so
+* that when the caller (or someone else) does XLogWrite(), it can
+* really write out all the zeros.

Why do we need to write out all the remaining space with zeros? In
current master,
we don't do that. A recovery code ignores the data following XLOG_SWITCH record,
so I don't think that's required.

+   /* XXX: before this patch, TRACE_POSTGRESQL_XLOG_SWITCH was not called
+* if the xlog switch had no work to do, ie. if we were already at the
+* beginning of a new XLOG segment. You can check if RecPtr points to
+* beginning of a segment if you want to keep the distinction.
+*/
+   TRACE_POSTGRESQL_XLOG_SWITCH();

If so, RecPtr (or the flag indicating whether the xlog switch has no
work to do) should
be given to TRACE_POSTGRESQL_XLOG_SWITCH() as an argument?


The followings are trivial comments:

+* insertion, but ẃhile we're at it, advance lastslot as much as we

Typo: 'ẃ' should be 'w'

In XLogRecPtrToBufIdx() and XLogRecEndPtrToBufIdx(), why don't you use
XLogFileSize
instead of XLogSegsPerFile * XLogSegSize?

There are some source code comments which still refer to WALInsertLock.

It's cleaner if pg_start_backup_callback() uses Insert instead of
XLogCtl-Insert,
like do_pg_start_backup(), do_pg_stop_backup() and do_pg_abort_backup() do.

+   freespace = XLOG_BLCKSZ - EndRecPtr.xrecoff % XLOG_BLCKSZ;

Though this is not incorrect, it's better to use EndOfLog instead of EndRecPtr,
like near code does.

while (extraWaits--  0)
PGSemaphoreUnlock(MyProc-sem);

This should be executed also in WaitForXLogInsertionSlotToBecomeFree()?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)

2012-02-21 Thread Fujii Masao
On Tue, Feb 21, 2012 at 8:19 PM, Fujii Masao masao.fu...@gmail.com wrote:
 On Sat, Feb 18, 2012 at 12:36 AM, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com wrote:
 Attached is a new version, fixing that, and off-by-one bug you pointed out
 in the slot wraparound handling. I also moved code around a bit, I think
 this new division of labor between the XLogInsert subroutines is more
 readable.

When I ran the long-running performance test, I encountered the following
panic error.

PANIC:  could not find WAL buffer for 0/FF00

0/FF00 is the xlog file boundary, so the patch seems to handle
the xlog file boundary incorrectly. In the patch, current insertion lsn
is advanced by directly incrementing XLogRecPtr.xrecoff as follows.
But to handle the xlog file boundary correctly, we should use
XLByteAdvance() for that, instead?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)

2012-02-20 Thread Fujii Masao
On Sun, Feb 19, 2012 at 3:01 AM, Jeff Janes jeff.ja...@gmail.com wrote:
 I've tested your v9 patch.  I no longer see any inconsistencies or
 lost transactions in the recovered database.  But occasionally I get
 databases that fail to recover at all.
 It has always been with the exact same failed assertion, at xlog.c line 2154.

 I've only seen this 4 times out of 2202 cycles of crash and recovery,
 so it must be some rather obscure situation.

 LOG:  database system was not properly shut down; automatic recovery in 
 progress
 LOG:  redo starts at 0/180001B0
 LOG:  unexpected pageaddr 0/15084000 in log file 0, segment 25, offset 540672
 LOG:  redo done at 0/19083FD0
 LOG:  last completed transaction was at log time 2012-02-17 11:13:50.369488-08
 LOG:  checkpoint starting: end-of-recovery immediate
 TRAP: FailedAssertion(!(((uint64) (NewPageEndPtr).xlogid *
 (uint64) (((uint32) 0x) / ((uint32) (16 * 1024 * 1024))) *
 ((uint32) (16 * 1024 * 1024))) + (NewPageEndPtr).xrecoff - 1)) / 8192)
 % (XLogCtl-XLogCacheBlck + 1)) == nextidx), File: xlog.c, Line:
 2154)
 LOG:  startup process (PID 5390) was terminated by signal 6: Aborted
 LOG:  aborting startup due to startup process failure

I could reproduce this when I made the server crash just after executing
select pg_switch_xlog().

$ initdb -D data
$ pg_ctl -D data start
$ psql -c select pg_switch_xlog()
$ pg_ctl -D data stop -m i
$ pg_ctl -D data start
...
LOG:  redo done at 0/16E3B0C
TRAP: FailedAssertion(!(((uint64) (NewPageEndPtr).xlogid *
(uint64) (((uint32) 0x) / ((uint32) (16 * 1024 * 1024))) *
((uint32) (16 * 1024 * 1024))) + (NewPageEndPtr).xrecoff - 1)) / 8192)
% (XLogCtl-XLogCacheBlck + 1)) == nextidx), File: xlog.c, Line:
2154)
LOG:  startup process (PID 16361) was terminated by signal 6: Aborted
LOG:  aborting startup due to startup process failure

Though I've not read new patch yet, I doubt that xlog switch code would
still have a bug.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)

2012-02-19 Thread Amit Kapila
I was trying to understand this patch and had few doubts:

1. In PerformXLogInsert(), why there is need to check freespace when already
during ReserveXLogInsertLocation(), 
   the space is reserved. 
   Is it possible that the record size is more than actually calculted in
ReserveXLogInsertLocation(), if so in that 
   case what I understand is it is moving to next page to write, however
isn't it possible that some other backend had 
   already reserved that space.

2. In function WaitForXLogInsertionSlotToBecomeFree(), chances are there
such that when nextslot equals lastslot, all new 
   backends try to  reserve a slot will start waiting on same last slot
which can lead to serialization for those 
   backends and can impact latency. 

3. GetXlogBuffer - This will get called twice, once for normal buffer,
second time for when there is not enough space in 
   current page, and both times it can lead to I/O whereas in earlier
algorithm, the chances of I/O is only once.



-Original Message-
From: pgsql-hackers-ow...@postgresql.org
[mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Heikki Linnakangas
Sent: Friday, February 17, 2012 9:07 PM
To: Fujii Masao
Cc: Jeff Janes; Robert Haas; PostgreSQL-development
Subject: Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work
outside WALInsertLock)

On 17.02.2012 07:27, Fujii Masao wrote:
 Got another problem: when I ran pg_stop_backup to take an online 
 backup, it got stuck until I had generated new WAL record. This 
 happens because, in the patch, when pg_stop_backup forces a switch to 
 new WAL file, old WAL file is not marked as archivable until next new 
 WAL record has been inserted, but pg_stop_backup keeps waiting for that
WAL file to be archived.
 OTOH, without the patch, WAL file is marked as archivable as soon as 
 WAL file switch occurs.

 So, in short, the patch seems to handle the WAL file switch logic
incorrectly.

Yep. For a WAL-switch record, XLogInsert returns the location of the end of
the record, not the end of the empty padding space. So when the caller
flushed up to that point, it didn't flush the empty space and therefore
didn't notify the archiver.

Attached is a new version, fixing that, and off-by-one bug you pointed out
in the slot wraparound handling. I also moved code around a bit, I think
this new division of labor between the XLogInsert subroutines is more
readable.

Thanks for the testing!

-- 
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)

2012-02-18 Thread Jeff Janes
On Fri, Feb 17, 2012 at 7:36 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 On 17.02.2012 07:27, Fujii Masao wrote:

 Got another problem: when I ran pg_stop_backup to take an online backup,
 it got stuck until I had generated new WAL record. This happens because,
 in the patch, when pg_stop_backup forces a switch to new WAL file, old
 WAL file is not marked as archivable until next new WAL record has been
 inserted, but pg_stop_backup keeps waiting for that WAL file to be
 archived.
 OTOH, without the patch, WAL file is marked as archivable as soon as WAL
 file switch occurs.

 So, in short, the patch seems to handle the WAL file switch logic
 incorrectly.


 Yep. For a WAL-switch record, XLogInsert returns the location of the end of
 the record, not the end of the empty padding space. So when the caller
 flushed up to that point, it didn't flush the empty space and therefore
 didn't notify the archiver.

 Attached is a new version, fixing that, and off-by-one bug you pointed out
 in the slot wraparound handling. I also moved code around a bit, I think
 this new division of labor between the XLogInsert subroutines is more
 readable.

 Thanks for the testing!

Hi Heikki,

Sorry for the week long radio silence, I haven't been able to find
much time during the week.  I'll try to extract my test case from it's
quite messy testing harness and get a self-contained version, but it
will probably take a week or two to do it.  I can probably refactor it
to rely just on Perl and the modules DBI, DBD::Pg, IO::Pipe and
Storable.  Some of those are not core Perl modules, but they are all
common ones.  Would that be a good option?

I've tested your v9 patch.  I no longer see any inconsistencies or
lost transactions in the recovered database.  But occasionally I get
databases that fail to recover at all.
It has always been with the exact same failed assertion, at xlog.c line 2154.

I've only seen this 4 times out of 2202 cycles of crash and recovery,
so it must be some rather obscure situation.

LOG:  database system was not properly shut down; automatic recovery in progress
LOG:  redo starts at 0/180001B0
LOG:  unexpected pageaddr 0/15084000 in log file 0, segment 25, offset 540672
LOG:  redo done at 0/19083FD0
LOG:  last completed transaction was at log time 2012-02-17 11:13:50.369488-08
LOG:  checkpoint starting: end-of-recovery immediate
TRAP: FailedAssertion(!(((uint64) (NewPageEndPtr).xlogid *
(uint64) (((uint32) 0x) / ((uint32) (16 * 1024 * 1024))) *
((uint32) (16 * 1024 * 1024))) + (NewPageEndPtr).xrecoff - 1)) / 8192)
% (XLogCtl-XLogCacheBlck + 1)) == nextidx), File: xlog.c, Line:
2154)
LOG:  startup process (PID 5390) was terminated by signal 6: Aborted
LOG:  aborting startup due to startup process failure

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)

2012-02-17 Thread Heikki Linnakangas

On 16.02.2012 13:31, Fujii Masao wrote:

On Thu, Feb 16, 2012 at 6:15 PM, Fujii Masaomasao.fu...@gmail.com  wrote:

BTW, when I ran the test on my Ubuntu, I could not reproduce the problem.
I could reproduce the problem only in MacOS.


+   nextslot = Insert-nextslot;
+   if (NextSlotNo(nextslot) == lastslot)
+   {
+   /*
+* Oops, we've caught our tail and the oldest slot is still 
in use.
+* Have to wait for it to become vacant.
+*/
+   SpinLockRelease(Insert-insertpos_lck);
+   WaitForXLogInsertionSlotToBecomeFree();
+   goto retry;
+   }
+   myslot =XLogCtl-XLogInsertSlots[nextslot];
+   nextslot = NextSlotNo(nextslot);

nextslot can reach NumXLogInsertSlots, which would be a bug, I guess.
When I did the quick-fix and ran the test, I could not reproduce the problem
any more. I'm not sure if this is really the cause of the problem, though.


Ah, I see. That explains why you only see it on some platforms - 
depending on ALIGNOF_XLOG_BUFFER, there is often enough padding after 
the last valid slot to accommodate the extra bogus slot. Thanks for the 
debugging!


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)

2012-02-16 Thread Fujii Masao
On Thu, Feb 16, 2012 at 5:02 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 On 15.02.2012 18:52, Fujii Masao wrote:

 On Thu, Feb 16, 2012 at 1:01 AM, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com  wrote:

 Are you still seeing this failure with the latest patch I posted

 (http://archives.postgresql.org/message-id/4f38f5e5.8050...@enterprisedb.com)?


 Yes. Just to be safe, I again applied the latest patch to HEAD,
 compiled that and tried
 the same test. Then unfortunately I got the same failure again.


 Ok.

 I ran the configure with '--enable-debug' '--enable-cassert'
 'CPPFLAGS=-DWAL_DEBUG',
 and make with -j 2 option.

 When I ran the test with wal_debug = on, I got the following assertion
 failure.

 LOG:  INSERT @ 0/17B3F90: prev 0/17B3F10; xid 998; len 31: Heap -
 insert: rel 1663/12277/16384; tid 0/197
 STATEMENT:  create table t (i int); insert into t
 values(generate_series(1,1)); delete from t
 LOG:  INSERT @ 0/17B3FD0: prev 0/17B3F50; xid 998; len 31: Heap -
 insert: rel 1663/12277/16384; tid 0/198
 STATEMENT:  create table t (i int); insert into t
 values(generate_series(1,1)); delete from t
 TRAP: FailedAssertion(!(((bool) (((void*)((target-tid)) != ((void
 *)0))  (((target-tid))-ip_posid != 0, File: heapam.c,

 Line: 5578)
 LOG:  xlog bg flush request 0/17B4000; write 0/17A6000; flush 0/179D5C0
 LOG:  xlog bg flush request 0/17B4000; write 0/17B; flush 0/17B
 LOG:  server process (PID 16806) was terminated by signal 6: Abort trap

 This might be related to the original problem which Jeff and I saw.


 That's strange. I made a fresh checkout, too, and applied the patch, but
 still can't reproduce. I used the attached script to test it.

 It's surprising that the crash happens when the records are inserted, not at
 recovery. I don't see anything obviously wrong there, so could you please
 take a look around in gdb and see if you can get a clue what's going on?
 What's the stack trace?

According to the above log messages, one strange thing is that the location
of the WAL record (i.e., 0/17B3F90) is not the same as the previous location
of the following WAL record (i.e., 0/17B3F50). Is this intentional?

BTW, when I ran the test on my Ubuntu, I could not reproduce the problem.
I could reproduce the problem only in MacOS.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)

2012-02-16 Thread Fujii Masao
On Thu, Feb 16, 2012 at 6:15 PM, Fujii Masao masao.fu...@gmail.com wrote:
 On Thu, Feb 16, 2012 at 5:02 AM, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com wrote:
 On 15.02.2012 18:52, Fujii Masao wrote:

 On Thu, Feb 16, 2012 at 1:01 AM, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com  wrote:

 Are you still seeing this failure with the latest patch I posted

 (http://archives.postgresql.org/message-id/4f38f5e5.8050...@enterprisedb.com)?


 Yes. Just to be safe, I again applied the latest patch to HEAD,
 compiled that and tried
 the same test. Then unfortunately I got the same failure again.


 Ok.

 I ran the configure with '--enable-debug' '--enable-cassert'
 'CPPFLAGS=-DWAL_DEBUG',
 and make with -j 2 option.

 When I ran the test with wal_debug = on, I got the following assertion
 failure.

 LOG:  INSERT @ 0/17B3F90: prev 0/17B3F10; xid 998; len 31: Heap -
 insert: rel 1663/12277/16384; tid 0/197
 STATEMENT:  create table t (i int); insert into t
 values(generate_series(1,1)); delete from t
 LOG:  INSERT @ 0/17B3FD0: prev 0/17B3F50; xid 998; len 31: Heap -
 insert: rel 1663/12277/16384; tid 0/198
 STATEMENT:  create table t (i int); insert into t
 values(generate_series(1,1)); delete from t
 TRAP: FailedAssertion(!(((bool) (((void*)((target-tid)) != ((void
 *)0))  (((target-tid))-ip_posid != 0, File: heapam.c,

 Line: 5578)
 LOG:  xlog bg flush request 0/17B4000; write 0/17A6000; flush 0/179D5C0
 LOG:  xlog bg flush request 0/17B4000; write 0/17B; flush 0/17B
 LOG:  server process (PID 16806) was terminated by signal 6: Abort trap

 This might be related to the original problem which Jeff and I saw.


 That's strange. I made a fresh checkout, too, and applied the patch, but
 still can't reproduce. I used the attached script to test it.

 It's surprising that the crash happens when the records are inserted, not at
 recovery. I don't see anything obviously wrong there, so could you please
 take a look around in gdb and see if you can get a clue what's going on?
 What's the stack trace?

 According to the above log messages, one strange thing is that the location
 of the WAL record (i.e., 0/17B3F90) is not the same as the previous location
 of the following WAL record (i.e., 0/17B3F50). Is this intentional?

 BTW, when I ran the test on my Ubuntu, I could not reproduce the problem.
 I could reproduce the problem only in MacOS.

+   nextslot = Insert-nextslot;
+   if (NextSlotNo(nextslot) == lastslot)
+   {
+   /*
+* Oops, we've caught our tail and the oldest slot is still 
in use.
+* Have to wait for it to become vacant.
+*/
+   SpinLockRelease(Insert-insertpos_lck);
+   WaitForXLogInsertionSlotToBecomeFree();
+   goto retry;
+   }
+   myslot = XLogCtl-XLogInsertSlots[nextslot];
+   nextslot = NextSlotNo(nextslot);

nextslot can reach NumXLogInsertSlots, which would be a bug, I guess.
When I did the quick-fix and ran the test, I could not reproduce the problem
any more. I'm not sure if this is really the cause of the problem, though.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)

2012-02-16 Thread Fujii Masao
On Mon, Feb 13, 2012 at 8:37 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 On 13.02.2012 01:04, Jeff Janes wrote:

 Attached is my quick and dirty attempt to set XLP_FIRST_IS_CONTRECORD.
  I have no idea if I did it correctly, in particular if calling
 GetXLogBuffer(CurrPos) twice is OK or if GetXLogBuffer has side
 effects that make that a bad thing to do.  I'm not proposing it as the
 real fix, I just wanted to get around this problem in order to do more
 testing.


 Thanks. That's basically the right approach. Attached patch contains a
 cleaned up version of that.

Got another problem: when I ran pg_stop_backup to take an online backup,
it got stuck until I had generated new WAL record. This happens because,
in the patch, when pg_stop_backup forces a switch to new WAL file, old
WAL file is not marked as archivable until next new WAL record has been
inserted, but pg_stop_backup keeps waiting for that WAL file to be archived.
OTOH, without the patch, WAL file is marked as archivable as soon as WAL
file switch occurs.

So, in short, the patch seems to handle the WAL file switch logic incorrectly.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)

2012-02-15 Thread Heikki Linnakangas

On 13.02.2012 19:13, Fujii Masao wrote:

On Mon, Feb 13, 2012 at 8:37 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com  wrote:

On 13.02.2012 01:04, Jeff Janes wrote:


Attached is my quick and dirty attempt to set XLP_FIRST_IS_CONTRECORD.
  I have no idea if I did it correctly, in particular if calling
GetXLogBuffer(CurrPos) twice is OK or if GetXLogBuffer has side
effects that make that a bad thing to do.  I'm not proposing it as the
real fix, I just wanted to get around this problem in order to do more
testing.



Thanks. That's basically the right approach. Attached patch contains a
cleaned up version of that.



It does get rid of the there is no contrecord flag errors, but
recover still does not work.

Now the count of tuples in the table is always correct (I never
provoke a crash during the initial table load), but sometimes updates
to those tuples that were reported to have been committed are lost.

This is more subtle, it does not happen on every crash.

It seems that when recovery ends on record with zero length at...,
that recovery is correct.

But when it ends on invalid magic number  in log file.. then the
recovery is screwed up.



Can you write a self-contained test case for that? I've been trying to
reproduce that by running the regression tests and pgbench with a streaming
replication standby, which should be pretty much the same as crash recovery.
No luck this far.


Probably I could reproduce the same problem as Jeff got. Here is the test case:

$ initdb -D data
$ pg_ctl -D data start
$ psql -c create table t (i int); insert into t
values(generate_series(1,1)); delete from t
$ pg_ctl -D data stop -m i
$ pg_ctl -D data start

The crash recovery emitted the following server logs:

LOG:  database system was interrupted; last known up at 2012-02-14 02:07:01 JST
LOG:  database system was not properly shut down; automatic recovery in progress
LOG:  redo starts at 0/179CC90
LOG:  invalid magic number  in log file 0, segment 1, offset 8060928
LOG:  redo done at 0/17AD858
LOG:  database system is ready to accept connections
LOG:  autovacuum launcher started

After recovery, I could not see the table t which I created before:

$ psql -c select count(*) from t
ERROR:  relation t does not exist


Are you still seeing this failure with the latest patch I posted 
(http://archives.postgresql.org/message-id/4f38f5e5.8050...@enterprisedb.com)? 
That includes Jeff's fix for the original crash you and Jeff saw. With 
that version, I can't get a crash anymore. I also can't reproduce the 
inconsistency that Jeff still saw with his fix 
(http://archives.postgresql.org/message-id/CAMkU=1zGWp2QnTjiyFe0VMu4gc+MoEexXYaVC2u=+orfiyj...@mail.gmail.com). 
Jeff, can you clarify if you're still seeing an issue with the latest 
version of the patch? If so, can you give a self-contained test case for 
that?


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)

2012-02-15 Thread Fujii Masao
On Thu, Feb 16, 2012 at 1:01 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 On 13.02.2012 19:13, Fujii Masao wrote:

 On Mon, Feb 13, 2012 at 8:37 PM, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com  wrote:

 On 13.02.2012 01:04, Jeff Janes wrote:


 Attached is my quick and dirty attempt to set XLP_FIRST_IS_CONTRECORD.
  I have no idea if I did it correctly, in particular if calling
 GetXLogBuffer(CurrPos) twice is OK or if GetXLogBuffer has side
 effects that make that a bad thing to do.  I'm not proposing it as the
 real fix, I just wanted to get around this problem in order to do more
 testing.



 Thanks. That's basically the right approach. Attached patch contains a
 cleaned up version of that.


 It does get rid of the there is no contrecord flag errors, but
 recover still does not work.

 Now the count of tuples in the table is always correct (I never
 provoke a crash during the initial table load), but sometimes updates
 to those tuples that were reported to have been committed are lost.

 This is more subtle, it does not happen on every crash.

 It seems that when recovery ends on record with zero length at...,
 that recovery is correct.

 But when it ends on invalid magic number  in log file.. then the
 recovery is screwed up.



 Can you write a self-contained test case for that? I've been trying to
 reproduce that by running the regression tests and pgbench with a
 streaming
 replication standby, which should be pretty much the same as crash
 recovery.
 No luck this far.


 Probably I could reproduce the same problem as Jeff got. Here is the test
 case:

 $ initdb -D data
 $ pg_ctl -D data start
 $ psql -c create table t (i int); insert into t
 values(generate_series(1,1)); delete from t
 $ pg_ctl -D data stop -m i
 $ pg_ctl -D data start

 The crash recovery emitted the following server logs:

 LOG:  database system was interrupted; last known up at 2012-02-14
 02:07:01 JST
 LOG:  database system was not properly shut down; automatic recovery in
 progress
 LOG:  redo starts at 0/179CC90
 LOG:  invalid magic number  in log file 0, segment 1, offset 8060928
 LOG:  redo done at 0/17AD858
 LOG:  database system is ready to accept connections
 LOG:  autovacuum launcher started

 After recovery, I could not see the table t which I created before:

 $ psql -c select count(*) from t
 ERROR:  relation t does not exist


 Are you still seeing this failure with the latest patch I posted
 (http://archives.postgresql.org/message-id/4f38f5e5.8050...@enterprisedb.com)?

Yes. Just to be safe, I again applied the latest patch to HEAD,
compiled that and tried
the same test. Then unfortunately I got the same failure again.

I ran the configure with '--enable-debug' '--enable-cassert'
'CPPFLAGS=-DWAL_DEBUG',
and make with -j 2 option.

When I ran the test with wal_debug = on, I got the following assertion failure.

LOG:  INSERT @ 0/17B3F90: prev 0/17B3F10; xid 998; len 31: Heap -
insert: rel 1663/12277/16384; tid 0/197
STATEMENT:  create table t (i int); insert into t
values(generate_series(1,1)); delete from t
LOG:  INSERT @ 0/17B3FD0: prev 0/17B3F50; xid 998; len 31: Heap -
insert: rel 1663/12277/16384; tid 0/198
STATEMENT:  create table t (i int); insert into t
values(generate_series(1,1)); delete from t
TRAP: FailedAssertion(!(((bool) (((void*)((target-tid)) != ((void
*)0))  (((target-tid))-ip_posid != 0, File: heapam.c,
Line: 5578)
LOG:  xlog bg flush request 0/17B4000; write 0/17A6000; flush 0/179D5C0
LOG:  xlog bg flush request 0/17B4000; write 0/17B; flush 0/17B
LOG:  server process (PID 16806) was terminated by signal 6: Abort trap

This might be related to the original problem which Jeff and I saw.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)

2012-02-15 Thread Heikki Linnakangas

On 15.02.2012 18:52, Fujii Masao wrote:

On Thu, Feb 16, 2012 at 1:01 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com  wrote:

Are you still seeing this failure with the latest patch I posted
(http://archives.postgresql.org/message-id/4f38f5e5.8050...@enterprisedb.com)?


Yes. Just to be safe, I again applied the latest patch to HEAD,
compiled that and tried
the same test. Then unfortunately I got the same failure again.


Ok.


I ran the configure with '--enable-debug' '--enable-cassert'
'CPPFLAGS=-DWAL_DEBUG',
and make with -j 2 option.

When I ran the test with wal_debug = on, I got the following assertion failure.

LOG:  INSERT @ 0/17B3F90: prev 0/17B3F10; xid 998; len 31: Heap -
insert: rel 1663/12277/16384; tid 0/197
STATEMENT:  create table t (i int); insert into t
values(generate_series(1,1)); delete from t
LOG:  INSERT @ 0/17B3FD0: prev 0/17B3F50; xid 998; len 31: Heap -
insert: rel 1663/12277/16384; tid 0/198
STATEMENT:  create table t (i int); insert into t
values(generate_series(1,1)); delete from t
TRAP: FailedAssertion(!(((bool) (((void*)((target-tid)) != ((void
*)0))  (((target-tid))-ip_posid != 0, File: heapam.c,
Line: 5578)
LOG:  xlog bg flush request 0/17B4000; write 0/17A6000; flush 0/179D5C0
LOG:  xlog bg flush request 0/17B4000; write 0/17B; flush 0/17B
LOG:  server process (PID 16806) was terminated by signal 6: Abort trap

This might be related to the original problem which Jeff and I saw.


That's strange. I made a fresh checkout, too, and applied the patch, but 
still can't reproduce. I used the attached script to test it.


It's surprising that the crash happens when the records are inserted, 
not at recovery. I don't see anything obviously wrong there, so could 
you please take a look around in gdb and see if you can get a clue 
what's going on? What's the stack trace?


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com


fujii-crash.sh
Description: Bourne shell script

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)

2012-02-13 Thread Fujii Masao
On Mon, Feb 13, 2012 at 8:37 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 On 13.02.2012 01:04, Jeff Janes wrote:

 Attached is my quick and dirty attempt to set XLP_FIRST_IS_CONTRECORD.
  I have no idea if I did it correctly, in particular if calling
 GetXLogBuffer(CurrPos) twice is OK or if GetXLogBuffer has side
 effects that make that a bad thing to do.  I'm not proposing it as the
 real fix, I just wanted to get around this problem in order to do more
 testing.


 Thanks. That's basically the right approach. Attached patch contains a
 cleaned up version of that.


 It does get rid of the there is no contrecord flag errors, but
 recover still does not work.

 Now the count of tuples in the table is always correct (I never
 provoke a crash during the initial table load), but sometimes updates
 to those tuples that were reported to have been committed are lost.

 This is more subtle, it does not happen on every crash.

 It seems that when recovery ends on record with zero length at...,
 that recovery is correct.

 But when it ends on invalid magic number  in log file.. then the
 recovery is screwed up.


 Can you write a self-contained test case for that? I've been trying to
 reproduce that by running the regression tests and pgbench with a streaming
 replication standby, which should be pretty much the same as crash recovery.
 No luck this far.

Probably I could reproduce the same problem as Jeff got. Here is the test case:

$ initdb -D data
$ pg_ctl -D data start
$ psql -c create table t (i int); insert into t
values(generate_series(1,1)); delete from t
$ pg_ctl -D data stop -m i
$ pg_ctl -D data start

The crash recovery emitted the following server logs:

LOG:  database system was interrupted; last known up at 2012-02-14 02:07:01 JST
LOG:  database system was not properly shut down; automatic recovery in progress
LOG:  redo starts at 0/179CC90
LOG:  invalid magic number  in log file 0, segment 1, offset 8060928
LOG:  redo done at 0/17AD858
LOG:  database system is ready to accept connections
LOG:  autovacuum launcher started

After recovery, I could not see the table t which I created before:

$ psql -c select count(*) from t
ERROR:  relation t does not exist

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)

2012-02-12 Thread Jeff Janes
On Thu, Feb 9, 2012 at 3:02 AM, Fujii Masao masao.fu...@gmail.com wrote:
 On Thu, Feb 9, 2012 at 7:25 PM, Fujii Masao masao.fu...@gmail.com wrote:
 On Thu, Feb 9, 2012 at 3:32 AM, Jeff Janes jeff.ja...@gmail.com wrote:

 After applying this patch and then forcing crashes, upon recovery the
 database is not correct.

 If I make a table with 10,000 rows and then after that intensively
 update it using a unique key:

 update foo set count=count+1 where foobar=?

 Then after the crash there are less than 10,000 visible rows:

 select count(*) from foo

 This not a subtle thing, it happens every time.  I get counts of
 between 1973 and 8827.  Without this patch I always get exactly
 10,000.

 I don't really know where to start on tracking this down.

 Similar problem happened on my test. When I executed CREATE TABLE and
 shut down the server with immediate mode, after recovery I could not see the
 created table. Here are the server log of recovery with wal_debug = on:

 LOG:  database system was interrupted; last known up at 2012-02-09 19:18:50 
 JST
 LOG:  database system was not properly shut down; automatic recovery in 
 progress
 LOG:  redo starts at 0/179CC90
 LOG:  REDO @ 0/179CC90; LSN 0/179CCB8: prev 0/179CC30; xid 0; len 4:
 XLOG - nextOid: 24576
 LOG:  REDO @ 0/179CCB8; LSN 0/179CCE8: prev 0/179CC90; xid 0; len 16:
 Storage - file create: base/12277/16384
 LOG:  REDO @ 0/179CCE8; LSN 0/179DDE0: prev 0/179CCB8; xid 998; len
 21; bkpb1: Heap - insert: rel 1663/12277/12014; tid 7/22
 LOG:  there is no contrecord flag in log file 0, segment 1, offset 7987200
 LOG:  redo done at 0/179CCE8

 According to the log there is no contrecord flag, ISTM the path treats the
 contrecord of backup block incorrectly, and which causes the problem.

 Yep, as far as I read the patch, it seems to have forgotten to set
 XLP_FIRST_IS_CONTRECORD flag.


Attached is my quick and dirty attempt to set XLP_FIRST_IS_CONTRECORD.
 I have no idea if I did it correctly, in particular if calling
GetXLogBuffer(CurrPos) twice is OK or if GetXLogBuffer has side
effects that make that a bad thing to do.  I'm not proposing it as the
real fix, I just wanted to get around this problem in order to do more
testing.

It does get rid of the there is no contrecord flag errors, but
recover still does not work.

Now the count of tuples in the table is always correct (I never
provoke a crash during the initial table load), but sometimes updates
to those tuples that were reported to have been committed are lost.

This is more subtle, it does not happen on every crash.

It seems that when recovery ends on record with zero length at...,
that recovery is correct.

But when it ends on invalid magic number  in log file.. then the
recovery is screwed up.

Cheers,

Jeff


xloginsert_fix.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)

2012-02-09 Thread Fujii Masao
On Thu, Feb 9, 2012 at 3:32 AM, Jeff Janes jeff.ja...@gmail.com wrote:
 On Wed, Feb 1, 2012 at 11:46 PM, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com wrote:
 On 31.01.2012 17:35, Fujii Masao wrote:

 On Fri, Jan 20, 2012 at 11:11 PM, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com  wrote:

 On 20.01.2012 15:32, Robert Haas wrote:


 On Sat, Jan 14, 2012 at 9:32 AM, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com    wrote:


 Here's another version of the patch to make XLogInsert less of a
 bottleneck
 on multi-CPU systems. The basic idea is the same as before, but several
 bugs
 have been fixed, and lots of misc. clean up has been done.



 This seems to need a rebase.



 Here you go.


 The patch seems to need a rebase again.


 Here you go again. It conflicted with the group commit patch, and the patch
 to WAL-log and track changes to full_page_writes setting.


 After applying this patch and then forcing crashes, upon recovery the
 database is not correct.

 If I make a table with 10,000 rows and then after that intensively
 update it using a unique key:

 update foo set count=count+1 where foobar=?

 Then after the crash there are less than 10,000 visible rows:

 select count(*) from foo

 This not a subtle thing, it happens every time.  I get counts of
 between 1973 and 8827.  Without this patch I always get exactly
 10,000.

 I don't really know where to start on tracking this down.

Similar problem happened on my test. When I executed CREATE TABLE and
shut down the server with immediate mode, after recovery I could not see the
created table. Here are the server log of recovery with wal_debug = on:

LOG:  database system was interrupted; last known up at 2012-02-09 19:18:50 JST
LOG:  database system was not properly shut down; automatic recovery in progress
LOG:  redo starts at 0/179CC90
LOG:  REDO @ 0/179CC90; LSN 0/179CCB8: prev 0/179CC30; xid 0; len 4:
XLOG - nextOid: 24576
LOG:  REDO @ 0/179CCB8; LSN 0/179CCE8: prev 0/179CC90; xid 0; len 16:
Storage - file create: base/12277/16384
LOG:  REDO @ 0/179CCE8; LSN 0/179DDE0: prev 0/179CCB8; xid 998; len
21; bkpb1: Heap - insert: rel 1663/12277/12014; tid 7/22
LOG:  there is no contrecord flag in log file 0, segment 1, offset 7987200
LOG:  redo done at 0/179CCE8

According to the log there is no contrecord flag, ISTM the path treats the
contrecord of backup block incorrectly, and which causes the problem.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)

2012-02-09 Thread Fujii Masao
On Thu, Feb 9, 2012 at 7:25 PM, Fujii Masao masao.fu...@gmail.com wrote:
 On Thu, Feb 9, 2012 at 3:32 AM, Jeff Janes jeff.ja...@gmail.com wrote:
 On Wed, Feb 1, 2012 at 11:46 PM, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com wrote:
 On 31.01.2012 17:35, Fujii Masao wrote:

 On Fri, Jan 20, 2012 at 11:11 PM, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com  wrote:

 On 20.01.2012 15:32, Robert Haas wrote:


 On Sat, Jan 14, 2012 at 9:32 AM, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com    wrote:


 Here's another version of the patch to make XLogInsert less of a
 bottleneck
 on multi-CPU systems. The basic idea is the same as before, but several
 bugs
 have been fixed, and lots of misc. clean up has been done.



 This seems to need a rebase.



 Here you go.


 The patch seems to need a rebase again.


 Here you go again. It conflicted with the group commit patch, and the patch
 to WAL-log and track changes to full_page_writes setting.


 After applying this patch and then forcing crashes, upon recovery the
 database is not correct.

 If I make a table with 10,000 rows and then after that intensively
 update it using a unique key:

 update foo set count=count+1 where foobar=?

 Then after the crash there are less than 10,000 visible rows:

 select count(*) from foo

 This not a subtle thing, it happens every time.  I get counts of
 between 1973 and 8827.  Without this patch I always get exactly
 10,000.

 I don't really know where to start on tracking this down.

 Similar problem happened on my test. When I executed CREATE TABLE and
 shut down the server with immediate mode, after recovery I could not see the
 created table. Here are the server log of recovery with wal_debug = on:

 LOG:  database system was interrupted; last known up at 2012-02-09 19:18:50 
 JST
 LOG:  database system was not properly shut down; automatic recovery in 
 progress
 LOG:  redo starts at 0/179CC90
 LOG:  REDO @ 0/179CC90; LSN 0/179CCB8: prev 0/179CC30; xid 0; len 4:
 XLOG - nextOid: 24576
 LOG:  REDO @ 0/179CCB8; LSN 0/179CCE8: prev 0/179CC90; xid 0; len 16:
 Storage - file create: base/12277/16384
 LOG:  REDO @ 0/179CCE8; LSN 0/179DDE0: prev 0/179CCB8; xid 998; len
 21; bkpb1: Heap - insert: rel 1663/12277/12014; tid 7/22
 LOG:  there is no contrecord flag in log file 0, segment 1, offset 7987200
 LOG:  redo done at 0/179CCE8

 According to the log there is no contrecord flag, ISTM the path treats the
 contrecord of backup block incorrectly, and which causes the problem.

Yep, as far as I read the patch, it seems to have forgotten to set
XLP_FIRST_IS_CONTRECORD flag.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)

2012-02-08 Thread Jeff Janes
On Wed, Feb 1, 2012 at 11:46 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 On 31.01.2012 17:35, Fujii Masao wrote:

 On Fri, Jan 20, 2012 at 11:11 PM, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com  wrote:

 On 20.01.2012 15:32, Robert Haas wrote:


 On Sat, Jan 14, 2012 at 9:32 AM, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com    wrote:


 Here's another version of the patch to make XLogInsert less of a
 bottleneck
 on multi-CPU systems. The basic idea is the same as before, but several
 bugs
 have been fixed, and lots of misc. clean up has been done.



 This seems to need a rebase.



 Here you go.


 The patch seems to need a rebase again.


 Here you go again. It conflicted with the group commit patch, and the patch
 to WAL-log and track changes to full_page_writes setting.


After applying this patch and then forcing crashes, upon recovery the
database is not correct.

If I make a table with 10,000 rows and then after that intensively
update it using a unique key:

update foo set count=count+1 where foobar=?

Then after the crash there are less than 10,000 visible rows:

select count(*) from foo

This not a subtle thing, it happens every time.  I get counts of
between 1973 and 8827.  Without this patch I always get exactly
10,000.

I don't really know where to start on tracking this down.

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)

2012-01-31 Thread Fujii Masao
On Fri, Jan 20, 2012 at 11:11 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 On 20.01.2012 15:32, Robert Haas wrote:

 On Sat, Jan 14, 2012 at 9:32 AM, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com  wrote:

 Here's another version of the patch to make XLogInsert less of a
 bottleneck
 on multi-CPU systems. The basic idea is the same as before, but several
 bugs
 have been fixed, and lots of misc. clean up has been done.


 This seems to need a rebase.


 Here you go.

The patch seems to need a rebase again.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)

2012-01-20 Thread Robert Haas
On Sat, Jan 14, 2012 at 9:32 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 Here's another version of the patch to make XLogInsert less of a bottleneck
 on multi-CPU systems. The basic idea is the same as before, but several bugs
 have been fixed, and lots of misc. clean up has been done.

This seems to need a rebase.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)

2012-01-20 Thread Simon Riggs
On Fri, Jan 20, 2012 at 2:11 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 On 20.01.2012 15:32, Robert Haas wrote:

 On Sat, Jan 14, 2012 at 9:32 AM, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com  wrote:

 Here's another version of the patch to make XLogInsert less of a
 bottleneck
 on multi-CPU systems. The basic idea is the same as before, but several
 bugs
 have been fixed, and lots of misc. clean up has been done.


 This seems to need a rebase.


 Here you go.

I put myself down as reviewer for this. I'm planning to review this
early next week, once I've finished Fujii-san's patches.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Moving more work outside WALInsertLock

2012-01-09 Thread Simon Riggs
On Sat, Jan 7, 2012 at 9:31 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:

 Anyway, here's a new version of the patch. It no longer busy-waits for
 in-progress insertions to finish, and handles xlog-switches. This is now
 feature-complete. It's a pretty complicated patch, so I would appreciate
 more eyeballs on it. And benchmarking again.

Took me awhile to understand why the data structure for the insertion
slots is so complex. Why not have slots per buffer? That would be
easier to understand and slots are very small. Not sure if its a good
idea, but we should explain the design options around that choice.

Can we avoid having spinlocks on the slots altogether? If we have a
page number (int) and an LSN, inserters would set LSN and then set
page number. Anybody waiting for slots would stop if page number is
zero since that means its not complete yet. So readers look at page
number first and aren't allowed to look at LSN without valid page
number.

Page number would be useful in working out where to stop when doing
background flushes, which we need for Group Commit, which is arriving
soon for this release.

Can we also try aligning the actual insertions onto cache lines rather
than just MAXALIGNing them? The WAL header fills half a cache line as
it is, so many other records will fit nicely. I'd like to see what
that does to space consumption, but it might be a useful option at
least.

I'd like to see test results with FPWs turned off and CACHEALIGNed
inserts. Again, we're planning on avoiding FPWs in future, so it would
be sensible to check the tuning in that configuration also.

GetInsertRecPtr() should return the XLogRecPtr of the latest
allocation. IMHO that is what we need for checkpoints and the
walsender doesn't really matter.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Moving more work outside WALInsertLock

2012-01-09 Thread Heikki Linnakangas

On 09.01.2012 15:44, Simon Riggs wrote:

On Sat, Jan 7, 2012 at 9:31 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com  wrote:


Anyway, here's a new version of the patch. It no longer busy-waits for
in-progress insertions to finish, and handles xlog-switches. This is now
feature-complete. It's a pretty complicated patch, so I would appreciate
more eyeballs on it. And benchmarking again.


Took me awhile to understand why the data structure for the insertion
slots is so complex. Why not have slots per buffer? That would be
easier to understand and slots are very small.


Hmm, how would that work?


Can we avoid having spinlocks on the slots altogether? If we have a
page number (int) and an LSN, inserters would set LSN and then set
page number. Anybody waiting for slots would stop if page number is
zero since that means its not complete yet. So readers look at page
number first and aren't allowed to look at LSN without valid page
number.


The LSN on a slot is set in ReserveXLogInsertLocation(), while holding 
the insertpos_lck spinlock. The inserter doesn't acquire the per-slot 
spinlock at that point, it relies on the fact that no-one will look at 
the slot until the shared nextslot variable has been incremented. The 
spinlock is only acquired when updating the pointer, which only happens 
when crossing a WAL page, which isn't that performance-criticial, and 
when the insertion is finished. It would be nice to get rid of the 
spinlock acquisition when the insertion is finished, but I don't see any 
easy way around that. The spinlock is needed to make sure that when the 
inserter clears its slot, it can atomically check the waiter field.


The theory is that contention on those per-slot spinlocks is very rare. 
Profiling this patch with perf, it looks like the bottleneck is the 
insertpos_lck spinlock.



Page number would be useful in working out where to stop when doing
background flushes, which we need for Group Commit, which is arriving
soon for this release.


Ok, I'll have to just take your word on that :-). I don't see why Group 
Commit needs to care about page boundaries, but the slot data structure 
I used already allows you to check fairly how far you write out the WAL 
without having to wait for any in-progress insertions to complete.



Can we also try aligning the actual insertions onto cache lines rather
than just MAXALIGNing them? The WAL header fills half a cache line as
it is, so many other records will fit nicely. I'd like to see what
that does to space consumption, but it might be a useful option at
least.


Hmm, that's an interesting thought. That would mean having gaps in the 
in-memory WAL cache, so that when it's written out, you'd need to stitch 
together the pieces to form the WAL that's actually written to disk. Or 
just leave the gaps in the on-disk format, if we're willing to change 
the WAL format for this, but I don't think we want make our WAL any 
larger than it already is.


I've written this patch avoiding WAL format changes, but if we're 
willing to do that, there's a few things we could do that would help. 
For one, the logic in ReserveXLogInsertLocation() that figures out where 
in the WAL stream the record begins and where it ends could be made a 
lot simpler. At the moment, we refuse to split a WAL record header 
across WAL pages, and because of that, the number of bytes occupied by a 
WAL record depends on where in the WAL it's written. If we changed that, 
reserving space from the WAL for a record that's N bytes long could be 
done essentially as CurrPos += N. There's some complications, like 
having to keep track of the prev-link too, but I believe it would be 
possible to get rid of the spinlock and implement 
ReserveXLogInsertLocation() as a single atomic fetch-and-add instruction.



GetInsertRecPtr() should return the XLogRecPtr of the latest
allocation. IMHO that is what we need for checkpoints and the
walsender doesn't really matter.


Ok. Thanks looking at the patch!

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Moving more work outside WALInsertLock

2012-01-09 Thread Simon Riggs
On Mon, Jan 9, 2012 at 2:29 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:

 Can we also try aligning the actual insertions onto cache lines rather
 than just MAXALIGNing them? The WAL header fills half a cache line as
 it is, so many other records will fit nicely. I'd like to see what
 that does to space consumption, but it might be a useful option at
 least.


 Hmm, that's an interesting thought. That would mean having gaps in the
 in-memory WAL cache, so that when it's written out, you'd need to stitch
 together the pieces to form the WAL that's actually written to disk. Or just
 leave the gaps in the on-disk format, if we're willing to change the WAL
 format for this, but I don't think we want make our WAL any larger than it
 already is.

I don't think that would require any format changes at all. We never
check that the total length of the WAL matches the size of the
contents do we? So if the record is just a little too big for its
contents, it will still all work fine. Recovery just calls rmgr
functions, so it doesn't know how big things should be. _redo routines
don't do local validation anywhere that I'm aware of. Sure, they cast
the record to a specific type, but that doesn't prevent the record
from being longer that _redo wants it to be. Try it.

You could probably do it for individual record types with an
additional rdata item.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Moving more work outside WALInsertLock

2012-01-08 Thread Simon Riggs
On Sun, Dec 25, 2011 at 7:48 PM, Robert Haas robertmh...@gmail.com wrote:

 m01 tps = 631.875547 (including connections establishing)
 x01 tps = 611.443724 (including connections establishing)
 m08 tps = 4573.701237 (including connections establishing)
 x08 tps = 4576.242333 (including connections establishing)
 m16 tps = 7697.783265 (including connections establishing)
 x16 tps = 7837.028713 (including connections establishing)
 m24 tps = 11613.690878 (including connections establishing)
 x24 tps = 12924.027954 (including connections establishing)
 m32 tps = 10684.931858 (including connections establishing)
 x32 tps = 14168.419730 (including connections establishing)
 m80 tps = 10259.628774 (including connections establishing)
 x80 tps = 13864.651340 (including connections establishing)

I think a 5% loss on 1 session is worth a 40% gain on a full loaded system.

Well done Heikki.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Moving more work outside WALInsertLock

2011-12-25 Thread Robert Haas
On Fri, Dec 23, 2011 at 2:54 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 Sorry. Last minute changes, didn't retest properly.. Here's another attempt.

I tried this one out on Nate Boley's system.  Looks pretty good.

m = master, x = with xloginsert-scale-2 patch.  shared_buffers = 8GB,
maintenance_work_mem = 1GB, synchronous_commit = off,
checkpoint_segments = 300, checkpoint_timeout = 15min,
checkpoint_completion_target = 0.9, wal_writer_delay = 20ms.  pgbench,
scale factor 100, median of five five-minute runs.

Permanent tables:

m01 tps = 631.875547 (including connections establishing)
x01 tps = 611.443724 (including connections establishing)
m08 tps = 4573.701237 (including connections establishing)
x08 tps = 4576.242333 (including connections establishing)
m16 tps = 7697.783265 (including connections establishing)
x16 tps = 7837.028713 (including connections establishing)
m24 tps = 11613.690878 (including connections establishing)
x24 tps = 12924.027954 (including connections establishing)
m32 tps = 10684.931858 (including connections establishing)
x32 tps = 14168.419730 (including connections establishing)
m80 tps = 10259.628774 (including connections establishing)
x80 tps = 13864.651340 (including connections establishing)

And, on unlogged tables:

m01 tps = 681.805851 (including connections establishing)
x01 tps = 665.120212 (including connections establishing)
m08 tps = 4753.823067 (including connections establishing)
x08 tps = 4638.690397 (including connections establishing)
m16 tps = 8150.519673 (including connections establishing)
x16 tps = 8082.504658 (including connections establishing)
m24 tps = 14069.077657 (including connections establishing)
x24 tps = 13934.955205 (including connections establishing)
m32 tps = 18736.317650 (including connections establishing)
x32 tps = 1.585420 (including connections establishing)
m80 tps = 17709.683344 (including connections establishing)
x80 tps = 18330.488958 (including connections establishing)

Unfortunately, it does look like there is some raw loss of performance
when WALInsertLock is NOT badly contended; hence the drop-off at a
single client on permanent tables, and up through 24 clients on
unlogged tables.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Moving more work outside WALInsertLock

2011-12-24 Thread Fujii Masao
On Sat, Dec 24, 2011 at 4:54 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 Sorry. Last minute changes, didn't retest properly.. Here's another attempt.

When I tested the patch, initdb failed:

$ initdb -D data

initializing dependencies ... PANIC:  could not locate a valid checkpoint record

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Moving more work outside WALInsertLock

2011-12-24 Thread Greg Stark
On Fri, Dec 16, 2011 at 3:27 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 On its own that sounds dangerous, but its not. When we need to confirm
 the prev link we already know what we expect it to be, so CRC-ing it
 is overkill. That isn't true of any other part of the WAL record, so
 the prev link is the only thing we can relax, but thats OK because we
 can CRC check everything else outside of the locked section.

 That isn't my idea, but I'm happy to put it on the table since I'm not shy.

 I'm glad it's not your idea, because it's a bad one.

I'll take the blame or credit here.

 A large part of
 the point of CRC'ing WAL records is to guard against torn-page problems
 in the WAL files, and doing things like that would give up a significant
 part of that protection, because there would no longer be any assurance
 that the body of a WAL record had anything to do with its prev_link.

Hm, I hadn't considered the possibility of a prev_link being the only
thing left over from a torn page. As Heikki pointed out having the CRC
and the rest of the record on opposite sides of the prev_link does
seem like convincing protection but it's a lot more fiddly and hard to
explain the dependencies this way.

Another thought that was discussed in the same dinner was separating
the CRC into a separate record that would cover all the WAL since the
last CRC. These would only need to be emitted when there's a WAL sync,
not on every record. I think someone showed some benchmarks claiming
that a significant overhead with the CRC was the startup and finishing
time for doing lots of small chunks. If it processes larger blocks it
might be able to make more efficient use of the memory bandwidth. I'm
not entirely convinced of that myself but it bears some
experimentation.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Moving more work outside WALInsertLock

2011-12-23 Thread Heikki Linnakangas

On 16.12.2011 15:42, Heikki Linnakangas wrote:

On 16.12.2011 15:03, Simon Riggs wrote:

On Fri, Dec 16, 2011 at 12:50 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:

On 16.12.2011 14:37, Simon Riggs wrote:


I already proposed a design for that using page-level share locks any
reason not to go with that?


Sorry, I must've missed that. Got a link?


From nearly 4 years ago.

http://grokbase.com/t/postgresql.org/pgsql-hackers/2008/02/reworking-wal-locking/145qrhllcqeqlfzntvn7kjefijey



Ah, thanks. That is similar to what I'm experimenting, but a second
lwlock is still fairly heavy-weight. I think with many backends, you
will be beaten badly by contention on the spinlocks alone.

I'll polish up and post what I've been experimenting with, so we can
discuss that.


So, here's a WIP patch of what I've been working on. The WAL insertions 
is split into two stages:


1. Reserve the space from the WAL stream. This is done while holding a 
spinlock. The page holding the reserved space doesn't necessary need to 
be in cache yet, the reservation can run ahead of the WAL buffer cache. 
(quick testing suggests that a lwlock is too heavy-weight for this)


2. Ensure the page is in the WAL buffer cache. If not, initialize it, 
evicting old pages if needed. Then finish the CRC calculation of the 
header and memcpy the record in place. (if the record spans multiple 
pages, it operates on one page at a time, to avoid problems with running 
out of WAL buffers)


As long as wal_buffers is high enough, and the I/O can keep up, stage 2 
can happen in parallel in many backends. The WAL writer process 
pre-initializes new pages ahead of the insertions, so regular backends 
rarely need to do that.


When a page is written out, with XLogWrite(), you need to wait for any 
in-progress insertions to the pages you're about to write out to finish. 
For that, every backend has slot with an XLogRecPtr in shared memory. 
Iẗ́'s set to the position where that backend is currently inserting to. 
If there's no insertion in-progress, it's invalid, but when it's valid 
it acts like a barrier, so that no-one is allowed to XLogWrite() beyond 
that position. That's very lightweight to the backends, but I'm using 
busy-waiting to wait on an insertion to finish ATM. That should be 
replaced with something smarter, that's the biggest missing part of the 
patch.


One simple way to test the performance impact of this is:

psql -c DROP TABLE IF EXISTS foo; CREATE TABLE foo (id int4); 
CHECKPOINT postgres
echo BEGIN; INSERT INTO foo SELECT i FROM generate_series(1, 1) i; 
ROLLBACK  parallel-insert-test.sql

pgbench -n -T 10 -c4 -f parallel-insert-test.sql postgres

On my dual-core laptop, this patch increases the tps on that from about 
60 to 110.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Moving more work outside WALInsertLock

2011-12-23 Thread Robert Haas
On Fri, Dec 23, 2011 at 3:15 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 On 23.12.2011 10:13, Heikki Linnakangas wrote:
 So, here's a WIP patch of what I've been working on.

 And here's the patch I forgot to attach..

Fails regression tests for me.  I found this in postmaster.log:

PANIC:  could not find WAL buffer for 0/2ECA438STATEMENT:  ANALYZE onek2;
LOG:  stuck waiting upto 0/300
LOG:  server process (PID 34529) was terminated by signal 6: Aborted
DETAIL:  Failed process was running: ANALYZE onek2;

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Moving more work outside WALInsertLock

2011-12-16 Thread Heikki Linnakangas

On 16.12.2011 05:27, Tom Lane wrote:

* We write a WAL record that starts 8 bytes before a sector boundary,
so that the prev_link is in one sector and the rest of the record in
the next one(s).


prev-link is not the first field in the header. The CRC is.


* Time passes, and we recycle that WAL file.

* We write another WAL record that starts 8 bytes before the same sector
boundary, so that the prev_link is in one sector and the rest of the
record in the next one(s).

* System crashes, after having written out the earlier sector but not
the later one(s).

On restart, the replay code will see a prev_link that matches what it
expects.  If the CRC for the remainder of the record is not dependent
on the prev_link, then the remainder of the old record will look good
too, and we'll attempt to replay it, n*16MB too late.


The CRC would be in the previous sector with the prev-link, so the CRC 
of the old record would have to match the CRC of the new record. I guess 
that's not totally impossible, though - there could be some WAL-logged 
operations where the payload of the WAL record is often exactly the 
same. Like a heap clean record, when the same page is repeatedly pruned.



Including the prev_link in the CRC adds a significant amount of
protection against such problems.  We should not remove this protection
in the name of shaving a few cycles.


Yeah. I did some quick testing with a patch to leave prev-link out of 
the calculation, and move the record CRC calculation outside the lock, 
too. I don't remember the numbers, but while it did make some 
difference, it didn't seem worthwhile.


Anyway, I'm looking at ways to make the memcpy() of the payload happen 
without the lock, in parallel, and once you do that the record header 
CRC calculation can be done in parallel, too. That makes it irrelevant 
from a performance point of view whether the prev-link is included in 
the CRC or not.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Moving more work outside WALInsertLock

2011-12-16 Thread Simon Riggs
On Fri, Dec 16, 2011 at 12:07 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:

 Anyway, I'm looking at ways to make the memcpy() of the payload happen
 without the lock, in parallel, and once you do that the record header CRC
 calculation can be done in parallel, too. That makes it irrelevant from a
 performance point of view whether the prev-link is included in the CRC or
 not.

Better plan. So we keep the prev link in the CRC.

I already proposed a design for that using page-level share locks any
reason not to go with that?

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Moving more work outside WALInsertLock

2011-12-16 Thread Heikki Linnakangas

On 16.12.2011 14:37, Simon Riggs wrote:

On Fri, Dec 16, 2011 at 12:07 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com  wrote:


Anyway, I'm looking at ways to make the memcpy() of the payload happen
without the lock, in parallel, and once you do that the record header CRC
calculation can be done in parallel, too. That makes it irrelevant from a
performance point of view whether the prev-link is included in the CRC or
not.


Better plan. So we keep the prev link in the CRC.

I already proposed a design for that using page-level share locks any
reason not to go with that?


Sorry, I must've missed that. Got a link?

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Moving more work outside WALInsertLock

2011-12-16 Thread Simon Riggs
On Fri, Dec 16, 2011 at 12:50 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 On 16.12.2011 14:37, Simon Riggs wrote:

 On Fri, Dec 16, 2011 at 12:07 PM, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com  wrote:

 Anyway, I'm looking at ways to make the memcpy() of the payload happen
 without the lock, in parallel, and once you do that the record header CRC
 calculation can be done in parallel, too. That makes it irrelevant from a
 performance point of view whether the prev-link is included in the CRC or
 not.


 Better plan. So we keep the prev link in the CRC.

 I already proposed a design for that using page-level share locks any
 reason not to go with that?


 Sorry, I must've missed that. Got a link?

From nearly 4 years ago.

http://grokbase.com/t/postgresql.org/pgsql-hackers/2008/02/reworking-wal-locking/145qrhllcqeqlfzntvn7kjefijey

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Moving more work outside WALInsertLock

2011-12-16 Thread Heikki Linnakangas

On 16.12.2011 15:03, Simon Riggs wrote:

On Fri, Dec 16, 2011 at 12:50 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com  wrote:

On 16.12.2011 14:37, Simon Riggs wrote:


I already proposed a design for that using page-level share locks any
reason not to go with that?


Sorry, I must've missed that. Got a link?


 From nearly 4 years ago.

http://grokbase.com/t/postgresql.org/pgsql-hackers/2008/02/reworking-wal-locking/145qrhllcqeqlfzntvn7kjefijey


Ah, thanks. That is similar to what I'm experimenting, but a second 
lwlock is still fairly heavy-weight. I think with many backends, you 
will be beaten badly by contention on the spinlocks alone.


I'll polish up and post what I've been experimenting with, so we can 
discuss that.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Moving more work outside WALInsertLock

2011-12-15 Thread Heikki Linnakangas
I've been looking at various ways to make WALInsertLock less of a 
bottleneck on multi-CPU servers. The key is going to be to separate the 
two things that are done while holding the WALInsertLock: a) allocating 
the required space in the WAL, and b) calculating the CRC of the record 
header and copying the data to the WAL page. a) needs to be serialized, 
but b) could be done in parallel.


I've been experimenting with different approaches to do that, but one 
thing is common among all of them: you need to know the total amount of 
WAL space needed for the record, including backup blocks, before you 
take the lock. So, here's a patch to move things around in XLogInsert() 
a bit, to accomplish that.


This patch doesn't seem to have any performance or scalability impact. I 
must admit I expected it to give a tiny gain in scalability by 
shortening the time WALInsertLock is held by a few instructions, but I 
can't measure any. But IMO it makes the code more readable, so this is 
worthwhile for that reason alone.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***
*** 690,695  XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
--- 690,698 
  	uint32		freespace;
  	int			curridx;
  	XLogRecData *rdt;
+ 	XLogRecData *rdt_cur;
+ 	XLogRecData	*rdt_bkp_first;
+ 	XLogRecData *rdt_bkp_last;
  	Buffer		dtbuf[XLR_MAX_BKP_BLOCKS];
  	bool		dtbuf_bkp[XLR_MAX_BKP_BLOCKS];
  	BkpBlock	dtbuf_xlg[XLR_MAX_BKP_BLOCKS];
***
*** 704,709  XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
--- 707,713 
  	bool		updrqst;
  	bool		doPageWrites;
  	bool		isLogSwitch = (rmid == RM_XLOG_ID  info == XLOG_SWITCH);
+ 	uint8		info_final;
  
  	/* cross-check on whether we should be here or not */
  	if (!XLogInsertAllowed())
***
*** 727,738  XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
  	}
  
  	/*
! 	 * Here we scan the rdata chain, determine which buffers must be backed
! 	 * up, and compute the CRC values for the data.  Note that the record
! 	 * header isn't added into the CRC initially since we don't know the final
! 	 * length or info bits quite yet.  Thus, the CRC will represent the CRC of
! 	 * the whole record in the order rdata, then backup blocks, then record
! 	 * header.
  	 *
  	 * We may have to loop back to here if a race condition is detected below.
  	 * We could prevent the race by doing all this work while holding the
--- 731,738 
  	}
  
  	/*
! 	 * Here we scan the rdata chain to determine which buffers must be backed
! 	 * up.
  	 *
  	 * We may have to loop back to here if a race condition is detected below.
  	 * We could prevent the race by doing all this work while holding the
***
*** 746,751  XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
--- 746,752 
  	 * over the chain later.
  	 */
  begin:;
+ 	info_final = info;
  	for (i = 0; i  XLR_MAX_BKP_BLOCKS; i++)
  	{
  		dtbuf[i] = InvalidBuffer;
***
*** 760,766  begin:;
  	 */
  	doPageWrites = fullPageWrites || Insert-forcePageWrites;
  
- 	INIT_CRC32(rdata_crc);
  	len = 0;
  	for (rdt = rdata;;)
  	{
--- 761,766 
***
*** 768,774  begin:;
  		{
  			/* Simple data, just include it */
  			len += rdt-len;
- 			COMP_CRC32(rdata_crc, rdt-data, rdt-len);
  		}
  		else
  		{
--- 768,773 
***
*** 779,790  begin:;
  {
  	/* Buffer already referenced by earlier chain item */
  	if (dtbuf_bkp[i])
  		rdt-data = NULL;
  	else if (rdt-data)
- 	{
  		len += rdt-len;
- 		COMP_CRC32(rdata_crc, rdt-data, rdt-len);
- 	}
  	break;
  }
  if (dtbuf[i] == InvalidBuffer)
--- 778,789 
  {
  	/* Buffer already referenced by earlier chain item */
  	if (dtbuf_bkp[i])
+ 	{
  		rdt-data = NULL;
+ 		rdt-len = 0;
+ 	}
  	else if (rdt-data)
  		len += rdt-len;
  	break;
  }
  if (dtbuf[i] == InvalidBuffer)
***
*** 796,807  begin:;
  	{
  		dtbuf_bkp[i] = true;
  		rdt-data = NULL;
  	}
  	else if (rdt-data)
- 	{
  		len += rdt-len;
- 		COMP_CRC32(rdata_crc, rdt-data, rdt-len);
- 	}
  	break;
  }
  			}
--- 795,804 
  	{
  		dtbuf_bkp[i] = true;
  		rdt-data = NULL;
+ 		rdt-len = 0;
  	}
  	else if (rdt-data)
  		len += rdt-len;
  	break;
  }
  			}
***
*** 816,854  begin:;
  	}
  
  	/*
! 	 * Now add the backup block headers and data into the CRC
  	 */
  	for (i = 0; i  XLR_MAX_BKP_BLOCKS; i++)
  	{
! 		if (dtbuf_bkp[i])
  		{
! 			BkpBlock   *bkpb = (dtbuf_xlg[i]);
! 			char	   *page;
! 
! 			COMP_CRC32(rdata_crc,
! 	   (char *) bkpb,
! 	   sizeof(BkpBlock));
! 			page = (char *) BufferGetBlock(dtbuf[i]);
! 			if (bkpb-hole_length == 0)
! 			{
! COMP_CRC32(rdata_crc,
! 		   page,
! 		   

Re: [HACKERS] Moving more work outside WALInsertLock

2011-12-15 Thread Andres Freund
On Thursday, December 15, 2011 02:51:33 PM Heikki Linnakangas wrote:
 I've been looking at various ways to make WALInsertLock less of a
 bottleneck on multi-CPU servers. The key is going to be to separate the
 two things that are done while holding the WALInsertLock: a) allocating
 the required space in the WAL, and b) calculating the CRC of the record
 header and copying the data to the WAL page. a) needs to be serialized,
 but b) could be done in parallel.
 
 I've been experimenting with different approaches to do that, but one
 thing is common among all of them: you need to know the total amount of
 WAL space needed for the record, including backup blocks, before you
 take the lock. So, here's a patch to move things around in XLogInsert()
 a bit, to accomplish that.
 
 This patch doesn't seem to have any performance or scalability impact. I
 must admit I expected it to give a tiny gain in scalability by
 shortening the time WALInsertLock is held by a few instructions, but I
 can't measure any. But IMO it makes the code more readable, so this is
 worthwhile for that reason alone.
Thats great! I did (or at least tried) something similar when I was playing 
around with another crc32 implementation (which I plan to finish sometime). My 
changes where totally whacky but I got rather big improvements when changing 
the crc computation from incremental to one big swoop.
I started to hack up an api which buffered xlog data in statically sized buffer 
in each backend and only submitted that every now and then. Never got that to 
actually work correctly in more than the simplest cases though ;). In many 
cases were taking the wal insert lock way to often during a single insert... 
(you obviously know that...)

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Moving more work outside WALInsertLock

2011-12-15 Thread Tom Lane
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 I've been experimenting with different approaches to do that, but one 
 thing is common among all of them: you need to know the total amount of 
 WAL space needed for the record, including backup blocks, before you 
 take the lock. So, here's a patch to move things around in XLogInsert() 
 a bit, to accomplish that.

This patch may or may not be useful, but this description of it is utter
nonsense, because we already do compute that before taking the lock.
Please try again to explain what you're doing?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Moving more work outside WALInsertLock

2011-12-15 Thread Jeff Janes
On Thu, Dec 15, 2011 at 7:34 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 I've been experimenting with different approaches to do that, but one
 thing is common among all of them: you need to know the total amount of
 WAL space needed for the record, including backup blocks, before you
 take the lock. So, here's a patch to move things around in XLogInsert()
 a bit, to accomplish that.

 This patch may or may not be useful, but this description of it is utter
 nonsense, because we already do compute that before taking the lock.
 Please try again to explain what you're doing?

Currently the CRC of all the data minus the header is computed outside the lock,
but then the header's computation is added and the CRC is finalized
inside the lock.


Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Moving more work outside WALInsertLock

2011-12-15 Thread Tom Lane
Jeff Janes jeff.ja...@gmail.com writes:
 On Thu, Dec 15, 2011 at 7:34 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 This patch may or may not be useful, but this description of it is utter
 nonsense, because we already do compute that before taking the lock.
 Please try again to explain what you're doing?

 Currently the CRC of all the data minus the header is computed outside the 
 lock,
 but then the header's computation is added and the CRC is finalized
 inside the lock.

Quite.  AFAICS that is not optional, unless you are proposing to remove
the prev_link from the scope of the CRC, which is not exactly a
penalty-free change.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Moving more work outside WALInsertLock

2011-12-15 Thread Heikki Linnakangas

On 15.12.2011 18:48, Tom Lane wrote:

Jeff Janesjeff.ja...@gmail.com  writes:

On Thu, Dec 15, 2011 at 7:34 AM, Tom Lanet...@sss.pgh.pa.us  wrote:

This patch may or may not be useful, but this description of it is utter
nonsense, because we already do compute that before taking the lock.
Please try again to explain what you're doing?



Currently the CRC of all the data minus the header is computed outside the lock,
but then the header's computation is added and the CRC is finalized
inside the lock.


Quite.  AFAICS that is not optional,


Right, my patch did not change that.


unless you are proposing to remove
the prev_link from the scope of the CRC, which is not exactly a
penalty-free change.


We could CRC the rest of the record header before getting the lock, 
though, and only include the prev-link while holding the lock. I 
micro-benchmarked that a little bit, but didn't see much benefit from 
doing just that. Once you do more drastic changes so that the lock 
doesn't need to be held while copying the data and calculating the CRC 
of the record header, so that those things can be done in parallel, it 
matters even less.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Moving more work outside WALInsertLock

2011-12-15 Thread Heikki Linnakangas

On 15.12.2011 17:34, Tom Lane wrote:

Heikki Linnakangasheikki.linnakan...@enterprisedb.com  writes:

I've been experimenting with different approaches to do that, but one
thing is common among all of them: you need to know the total amount of
WAL space needed for the record, including backup blocks, before you
take the lock. So, here's a patch to move things around in XLogInsert()
a bit, to accomplish that.


This patch may or may not be useful, but this description of it is utter
nonsense, because we already do compute that before taking the lock.


Nope. Without this patch, the total length including the backup blocks, 
write_len, is added up in the loop that creates the rdata entries for 
backup blocks. That is done while holding the lock (see code beginning 
with comment Make additional rdata chain entries for the backup blocks).


Admittedly you could sum up the total length quite easily in the earlier 
loop, before we grab the lock, where we calculate the CRC of the backup 
blocks (Now add the backup block headers and data into the CRC). That 
would be a smaller patch.



Please try again to explain what you're doing?


Ok: I'm moving the creation of rdata entries for backup blocks outside 
the critical section, so that it's done before grabbing the lock. I'm 
also moving the CRC calculation so that it's done after all the rdata 
entries have been created, including the ones for backup blocks. It's 
more readable to do it that way, as a separate step, instead of 
sprinkling the COMP_CRC macros in many places.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Moving more work outside WALInsertLock

2011-12-15 Thread Simon Riggs
On Thu, Dec 15, 2011 at 7:06 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:

 Please try again to explain what you're doing?


 Ok: I'm moving the creation of rdata entries for backup blocks outside the
 critical section, so that it's done before grabbing the lock. I'm also
 moving the CRC calculation so that it's done after all the rdata entries
 have been created, including the ones for backup blocks. It's more readable
 to do it that way, as a separate step, instead of sprinkling the COMP_CRC
 macros in many places.

There's a comment that says we can't undo the linking of the rdata
chains, but it looks like a reversible process to me.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Moving more work outside WALInsertLock

2011-12-15 Thread Simon Riggs
On Thu, Dec 15, 2011 at 6:50 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:

 unless you are proposing to remove
 the prev_link from the scope of the CRC, which is not exactly a
 penalty-free change.


 We could CRC the rest of the record header before getting the lock, though,
 and only include the prev-link while holding the lock. I micro-benchmarked
 that a little bit, but didn't see much benefit from doing just that. Once
 you do more drastic changes so that the lock doesn't need to be held while
 copying the data and calculating the CRC of the record header, so that those
 things can be done in parallel, it matters even less.

You missed your cue to discuss leaving the prev link out of the CRC altogether.

On its own that sounds dangerous, but its not. When we need to confirm
the prev link we already know what we expect it to be, so CRC-ing it
is overkill. That isn't true of any other part of the WAL record, so
the prev link is the only thing we can relax, but thats OK because we
can CRC check everything else outside of the locked section.

That isn't my idea, but I'm happy to put it on the table since I'm not shy.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Moving more work outside WALInsertLock

2011-12-15 Thread Tom Lane
Simon Riggs si...@2ndquadrant.com writes:
 You missed your cue to discuss leaving the prev link out of the CRC 
 altogether.

 On its own that sounds dangerous, but its not. When we need to confirm
 the prev link we already know what we expect it to be, so CRC-ing it
 is overkill. That isn't true of any other part of the WAL record, so
 the prev link is the only thing we can relax, but thats OK because we
 can CRC check everything else outside of the locked section.

 That isn't my idea, but I'm happy to put it on the table since I'm not shy.

I'm glad it's not your idea, because it's a bad one.  A large part of
the point of CRC'ing WAL records is to guard against torn-page problems
in the WAL files, and doing things like that would give up a significant
part of that protection, because there would no longer be any assurance
that the body of a WAL record had anything to do with its prev_link.

Consider a scenario like this:

* We write a WAL record that starts 8 bytes before a sector boundary,
so that the prev_link is in one sector and the rest of the record in
the next one(s).

* Time passes, and we recycle that WAL file.

* We write another WAL record that starts 8 bytes before the same sector
boundary, so that the prev_link is in one sector and the rest of the
record in the next one(s).

* System crashes, after having written out the earlier sector but not
the later one(s).

On restart, the replay code will see a prev_link that matches what it
expects.  If the CRC for the remainder of the record is not dependent
on the prev_link, then the remainder of the old record will look good
too, and we'll attempt to replay it, n*16MB too late.

Including the prev_link in the CRC adds a significant amount of
protection against such problems.  We should not remove this protection
in the name of shaving a few cycles.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers