Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)
On Wed, Mar 21, 2012 at 10:59 PM, Tom Lane wrote: > Heikki Linnakangas writes: >> ... although none of the issues alone is a show-stopper, but considering >> all these things together, I'm starting to feel that this needs to be >> pushed to 9.3. Thoughts? > > Agreed. In particular, I think you are right that it'd be prudent to > simplify the WAL-location arithmetic and then rebase this code onto > that. And since no code at all has been written for the arithmetic > change, I think we have to consider that it's not 9.2 material. Agreed. BTW, the patch changes some functions so that they use volatile pointer, as follows: @@ -8448,7 +9232,7 @@ XLogReportParameters(void) void UpdateFullPageWrites(void) { - XLogCtlInsert *Insert = &XLogCtl->Insert; + volatile XLogCtlInsert *Insert = &XLogCtl->Insert; These changes should be applied? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)
Heikki Linnakangas writes: > ... although none of the issues alone is a show-stopper, but considering > all these things together, I'm starting to feel that this needs to be > pushed to 9.3. Thoughts? Agreed. In particular, I think you are right that it'd be prudent to simplify the WAL-location arithmetic and then rebase this code onto that. And since no code at all has been written for the arithmetic change, I think we have to consider that it's not 9.2 material. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)
On Wed, Mar 21, 2012 at 7:52 AM, Heikki Linnakangas wrote: > So, although none of the issues alone is a show-stopper, but considering all > these things together, I'm starting to feel that this needs to be pushed to > 9.3. Thoughts? I think I agree. I like the refactoring ideas that you're proposing, but I don't really think we should be starting on that in mid-March. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)
On 21.03.2012 13:14, Fujii Masao wrote: PANIC: space reserved for WAL record does not match what was written, CurrPos: C/0, EndPos: B/FF00 So I think that the patch would have a bug which handles WAL boundary wrongly. Thanks for the testing! These WAL boundary issues are really tricky, you found bugs in that area before, and I found and fixed one before posting the last version, and apparently there's still at least one left. Overall, what do you (and others) think about the state of this patch? I'm starting to feel that this needs to be pushed to 9.3. That bug might not be very hard to fix, but the fact that these bugs are still cropping up at this late stage makes me uneasy. That boundary calculation in particular is surprisingly tricky, and I think it could be made less tricky with some refactoring of the WAL-logging code, replacing XLogRecPtr with uint64s, like Peter (IIRC) suggested a while ago. And that seems like 9.3 material. Also, there's still these two known issues: 1. It slows down the WAL insertion in a single backend by about 10% 2. With lots of backends inserting tiny records concurrently, you get spinlock contention, which consumes a lot of CPU. Without the patch, you get lwlock contention and bandwidth isn't any better, but you sleep rather than spin. I'm afraid those issues aren't easily fixable. I haven't been able to identify the source of slowdown in the single-backend case, it seems to be simply the distributed cost of the extra bookkeeping. That might be acceptable, 10% slowdown of raw WAL insertion speed is not good, but WAL insertion only accounts for a fraction of the total CPU usage for any real workload, so I believe the slowdown of a real application would be more like 1-3%, at worst. But I would feel more comfortable if we had more time to test that. The spinlock contention issue might be acceptable too. I think it would be hard to run into it in a real application, and even then, the benchmarks show that although you spend a lot of CPU time spinning, you get at least the same overall bandwidth with the patch, which is what really matters. And I think it could be alleviated by reducing the time the spinlock is held, and I think that could be done by making the space reservation calculations simpler. If we got rid of the limitation that the WAL record header is never split across WAL pages, and always stored the continuation record header on all WAL pages, the space reservation calculation could be reduced to essentially "currentpos += size". But that again seems 9.3 material. So, although none of the issues alone is a show-stopper, but considering all these things together, I'm starting to feel that this needs to be pushed to 9.3. Thoughts? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)
On Thu, Mar 15, 2012 at 5:52 AM, Heikki Linnakangas wrote: > When all those changes are put together, the patched version now beats or > matches the current code in the RAM drive tests, except that the > single-client case is still about 10% slower. I added the new test results > at http://community.enterprisedb.com/xloginsert-scale-tests/, and a new > version of the patch is attached. When I ran pgbench with v18 patch, I encountered the following PANIC error: PANIC: space reserved for WAL record does not match what was written To investigate the cause, I applied the following changes and ran pgbench again, diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index bfc7421..2cef0bd 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -1294,7 +1294,7 @@ CopyXLogRecordToWAL(int write_len, bool isLogSwitch, XLogRecord *rechdr, } if (!XLByteEQ(CurrPos, EndPos)) - elog(PANIC, "space reserved for WAL record does not match what was written"); + elog(PANIC, "space reserved for WAL record does not match what was written, CurrPos: %X/%X, EndPos: %X/%X", CurrPos.xlogid, CurrPos.xrecoff, EndPos.xlogid, EndPos.xrecoff); } else { then I got the following: PANIC: space reserved for WAL record does not match what was written, CurrPos: C/0, EndPos: B/FF00 So I think that the patch would have a bug which handles WAL boundary wrongly. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)
On Tue, Mar 13, 2012 at 1:59 AM, Jeff Janes wrote: > On Fri, Mar 9, 2012 at 2:45 AM, Heikki Linnakangas > wrote: >> >> >> Thanks! >> >> BTW, I haven't forgotten about the recovery bugs Jeff found earlier. I'm >> planning to do a longer run with his test script - I only run it for about >> 1000 iterations - to see if I can reproduce the PANIC with both the earlier >> patch version he tested, and this new one. > > Hi Heikki, > > I've run the v12 patch for 17,489 rounds of crash and recovery, and > detected no assertion failures or other problems. In v12 patch, I could no longer reproduce the assertion failure, too. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)
On 09.03.2012 12:04, Heikki Linnakangas wrote: I've been doing some performance testing with this, using a simple C function that just inserts a dummy WAL record of given size. I'm not totally satisfied. Although the patch helps with scalability at 3-4 concurrent backends doing WAL insertions, it seems to slow down the single-client case with small WAL records by about 5-10%. This is what Robert also saw with an earlier version of the patch (http://archives.postgresql.org/pgsql-hackers/2011-12/msg01223.php). I tested this with the data directory on a RAM drive, unfortunately I don't have a server with a hard drive that can sustain the high insertion rate. I'll post more detailed results, once I've refined the tests a bit. So, here's more detailed test results, using Greg Smith's excellent pgbench-tools test suite: http://community.enterprisedb.com/xloginsert-scale-tests/ The workload in all of these tests was a simple C function that writes a lot of very small WAL records, with 16 bytes of payload each. I ran the tests with the data directory on a regular hard drive, on an SDD, and on a ram drive (/dev/shm). With HDD, I also tried fsync=off and synchronous_commit_off. For each of those, I ran the tests with 1-16 concurrent backends. Summary: The patch hurts single-backend performance by about 10%, except for the synchronous_commit=off test. Between 2-6 clients, it either helps, doesn't make any difference, or hurts. With > 6 clients, it hurts. So, that's quite disappointing. The patch has two problems: the 10% slowdown in single-client case, and the slowdown with > 6 clients. I don't know where exactly the single-client slowdown comes from, although I'm not surprised that the bookkeeping with slots etc. has some overhead. Hopefully that overhead can be made smaller, if not eliminated completely.. The slowdown with > 6 clients seems to be spinlock contention. I ran "perf record" for a short duration during one of the ramdrive tests, and saw the spinlock acquisition in ReserveXLogInsertLocation() consuming about 80% of all CPU time. I then hacked the patch a little bit, removing the check in XLogInsert for fullPageWrites and forcePageWrites, as well as the check for "did a checkpoint just happen" (see http://community.enterprisedb.com/xloginsert-scale-tests/disable-fpwcheck.patch). My hunch was that accessing those fields causes cache line stealing, making the cache line containing the spinlock even more busy. That hunch seems to be correct; when I reran the tests with that patch, the performance with high # of clients became much better. See the results with "xloginsert-scale-13.patch". With that change, the single-client case is still about 10% slower than current code, but the performance with > 8 clients is almost as good as with current code. Between 2-6 clients, the patch is a win. The hack that restored the > 6 clients performance to current level is not safe, of course, so I'll have to figure out a safe way to get that effect. Also, even when the performance is as good as current code, it's not good to spend all the CPU time spinning on the spinlock. I didn't measure the CPU usage with current code, but I would expect it to be sleeping, not spinning, when not doing useful work. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)
On Fri, Mar 9, 2012 at 2:45 AM, Heikki Linnakangas wrote: > > > Thanks! > > BTW, I haven't forgotten about the recovery bugs Jeff found earlier. I'm > planning to do a longer run with his test script - I only run it for about > 1000 iterations - to see if I can reproduce the PANIC with both the earlier > patch version he tested, and this new one. Hi Heikki, I've run the v12 patch for 17,489 rounds of crash and recovery, and detected no assertion failures or other problems. Cheers, Jeff -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)
On Fri, Mar 9, 2012 at 7:04 PM, Heikki Linnakangas wrote: > Here's an updated patch. It now only loops once per segment that a record > crosses. Plus a lot of other small cleanup. Thanks! But you forgot to attach the patch. > I've been doing some performance testing with this, using a simple C > function that just inserts a dummy WAL record of given size. I'm not totally > satisfied. Although the patch helps with scalability at 3-4 concurrent > backends doing WAL insertions, it seems to slow down the single-client case > with small WAL records by about 5-10%. This is what Robert also saw with an > earlier version of the patch > (http://archives.postgresql.org/pgsql-hackers/2011-12/msg01223.php). I > tested this with the data directory on a RAM drive, unfortunately I don't > have a server with a hard drive that can sustain the high insertion rate. > I'll post more detailed results, once I've refined the tests a bit. I'm also doing performance test. If I get interesting result, I'll post it. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)
On 07.03.2012 17:28, Tom Lane wrote: Simon Riggs writes: On Wed, Mar 7, 2012 at 3:04 PM, Tom Lane wrote: Alvaro Herrera writes: So they are undoubtely rare. Not sure if as rare as Higgs bosons. Even if they're rare, having a major performance hiccup when one happens is not a side-effect I want to see from a patch whose only reason to exist is better performance. I agree the effect you point out can exist, I just don't want to slow down the main case as a result. I don't see any reason to think that what I suggested would slow things down, especially not if the code were set up to fall through quickly in the typical case where no page boundary is crossed. Integer division is not slow on any machine made in the last 15 years or so. Agreed. I wasn't worried about the looping with extra-large records, but might as well not do it. Here's an updated patch. It now only loops once per segment that a record crosses. Plus a lot of other small cleanup. I've been doing some performance testing with this, using a simple C function that just inserts a dummy WAL record of given size. I'm not totally satisfied. Although the patch helps with scalability at 3-4 concurrent backends doing WAL insertions, it seems to slow down the single-client case with small WAL records by about 5-10%. This is what Robert also saw with an earlier version of the patch (http://archives.postgresql.org/pgsql-hackers/2011-12/msg01223.php). I tested this with the data directory on a RAM drive, unfortunately I don't have a server with a hard drive that can sustain the high insertion rate. I'll post more detailed results, once I've refined the tests a bit. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)
On Mon, Mar 5, 2012 at 8:50 AM, Heikki Linnakangas wrote: > > That particular issue would be very hard to hit in practice, so I don't know > if this could explain the recovery failures that Jeff saw. I got the test > script running (thanks for that Jeff!), but unfortunately have not seen any > failures yet (aside from the issue with crossing xlogid boundary), with > either this or the older version of the patch. > > Attached is a new version of the patch. I've run patch v10 for 14109 cycles of crash and recovery, and there were 8 assertion failures at "xlog.c", Line: 2106 during the end-of-recovery checkpoint. How many cycles have you run? Assuming the crashes follow a simple binomial distribution with the frequency I see, you would have to run for ~1230 cycles for a 50% chance of experiencing at least one, or for ~8120 cycles for a 99% chance of experiencing at least one. I think Fujii's method if provoking this problem is more efficient than mine, although I haven't tried it myself. Dual Core AMD Opteron(tm) Processor 275 2.6.32.36-0.5-default #1 SMP 2011-04-14 10:12:31 +0200 x86_64 x86_64 x86_64 GNU/Linux Cheers, Jeff -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)
Simon Riggs writes: > On Wed, Mar 7, 2012 at 3:04 PM, Tom Lane wrote: >> Alvaro Herrera writes: >>> So they are undoubtely rare. Not sure if as rare as Higgs bosons. >> Even if they're rare, having a major performance hiccup when one happens >> is not a side-effect I want to see from a patch whose only reason to >> exist is better performance. > I agree the effect you point out can exist, I just don't want to slow > down the main case as a result. I don't see any reason to think that what I suggested would slow things down, especially not if the code were set up to fall through quickly in the typical case where no page boundary is crossed. Integer division is not slow on any machine made in the last 15 years or so. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)
On Wed, Mar 7, 2012 at 3:04 PM, Tom Lane wrote: > Alvaro Herrera writes: >> Just to keep things in perspective -- For a commit record to reach one >> megabyte, it would have to be a transaction that drops over 43k tables. >> Or have 64k smgr inval messages (for example, a TRUNCATE might send half >> a dozen of these messages). Or have 262k subtransactions. Or >> combinations thereof. > >> Now admittedly, a page is only 8 kB, so for a commit record to be "many >> pages long" (that is, >=3) it would require about 1500 smgr inval >> messages, or, say, about 250 TRUNCATEs (of permanent tables with at >> least one toastable field and at least one index). > > What about the locks (if running hot-standby)? It's a list of active AccessExclusiveLocks. If that list is long you can be sure not much else is happening on the server. >> So they are undoubtely rare. Not sure if as rare as Higgs bosons. > > Even if they're rare, having a major performance hiccup when one happens > is not a side-effect I want to see from a patch whose only reason to > exist is better performance. I agree the effect you point out can exist, I just don't want to slow down the main case as a result. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)
Alvaro Herrera writes: > Just to keep things in perspective -- For a commit record to reach one > megabyte, it would have to be a transaction that drops over 43k tables. > Or have 64k smgr inval messages (for example, a TRUNCATE might send half > a dozen of these messages). Or have 262k subtransactions. Or > combinations thereof. > Now admittedly, a page is only 8 kB, so for a commit record to be "many > pages long" (that is, >=3) it would require about 1500 smgr inval > messages, or, say, about 250 TRUNCATEs (of permanent tables with at > least one toastable field and at least one index). What about the locks (if running hot-standby)? > So they are undoubtely rare. Not sure if as rare as Higgs bosons. Even if they're rare, having a major performance hiccup when one happens is not a side-effect I want to see from a patch whose only reason to exist is better performance. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)
Excerpts from Simon Riggs's message of mié mar 07 05:35:44 -0300 2012: > On Tue, Mar 6, 2012 at 8:32 PM, Tom Lane wrote: > > Heikki Linnakangas writes: > >> On 06.03.2012 17:12, Tom Lane wrote: > >>> How long is the current locked code exactly --- does it contain a loop? > > > >> Perhaps best if you take a look for yourself, the function is called > >> ReserveXLogInsertLocation() in patch. It calls a helper function called > >> AdvanceXLogRecPtrToNextPage(ptr), which is small and could be inlined. > >> It does contain one loop, which iterates once for every WAL page the > >> record crosses. > > > > Hm. The loop makes me a tad uncomfortable, because it is possible for > > WAL records to be very long (many pages). I see the point that > > replacing the spinlock with an LWLock would likely negate any > > performance win from this patch, but having other processes arrive and > > spin while somebody is busy calculating the size of a multi-megabyte > > commit record would be bad too. > > I would have thought the existence of a multi-megabyte commit record > would already imply a huge performance effect elsewhere and we > wouldn't care too much about a few spinlock cycles. But I think > they're as rare as Higgs bosons. Just to keep things in perspective -- For a commit record to reach one megabyte, it would have to be a transaction that drops over 43k tables. Or have 64k smgr inval messages (for example, a TRUNCATE might send half a dozen of these messages). Or have 262k subtransactions. Or combinations thereof. Now admittedly, a page is only 8 kB, so for a commit record to be "many pages long" (that is, >=3) it would require about 1500 smgr inval messages, or, say, about 250 TRUNCATEs (of permanent tables with at least one toastable field and at least one index). So they are undoubtely rare. Not sure if as rare as Higgs bosons. -- Álvaro Herrera The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)
On Tue, Mar 6, 2012 at 8:32 PM, Tom Lane wrote: > Heikki Linnakangas writes: >> On 06.03.2012 17:12, Tom Lane wrote: >>> How long is the current locked code exactly --- does it contain a loop? > >> Perhaps best if you take a look for yourself, the function is called >> ReserveXLogInsertLocation() in patch. It calls a helper function called >> AdvanceXLogRecPtrToNextPage(ptr), which is small and could be inlined. >> It does contain one loop, which iterates once for every WAL page the >> record crosses. > > Hm. The loop makes me a tad uncomfortable, because it is possible for > WAL records to be very long (many pages). I see the point that > replacing the spinlock with an LWLock would likely negate any > performance win from this patch, but having other processes arrive and > spin while somebody is busy calculating the size of a multi-megabyte > commit record would be bad too. I would have thought the existence of a multi-megabyte commit record would already imply a huge performance effect elsewhere and we wouldn't care too much about a few spinlock cycles. But I think they're as rare as Higgs bosons. If/when such records do exist it isn't likely to be on a high transaction rate server. Even allocating ~1 million xids takes long enough that we wouldn't be expecting a very high commit rate even with 200 concurrent sessions doing this. If such records are rare, then the minor blip they cause is just a drop in the ocean. So I think Tom's concern is valid, but the frequency of problems resulting from it will be so low as to not even be measurable. And before we fix a perceived performance issue, we really should prove its existence first, then confirm that this area is the bottleneck that is slowing such workloads. So +1 to Heikki keeping the spinlock, as is, and not redesign anything. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)
On Wed, Mar 7, 2012 at 5:32 AM, Tom Lane wrote: > What I suggest is that it should not be necessary to crawl forward one > page at a time to figure out how many pages will be needed to store N > bytes worth of WAL data. You're basically implementing a division > problem as repeated subtraction. Getting the extra WAL-segment-start > overhead right would be slightly tricky; but even if you didn't want to > try to make it pure straight-line code, at the very least it seems like > you could set it up so that the loop iterates only once per segment not > page. +1 Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)
Heikki Linnakangas writes: > On 06.03.2012 17:12, Tom Lane wrote: >> How long is the current locked code exactly --- does it contain a loop? > Perhaps best if you take a look for yourself, the function is called > ReserveXLogInsertLocation() in patch. It calls a helper function called > AdvanceXLogRecPtrToNextPage(ptr), which is small and could be inlined. > It does contain one loop, which iterates once for every WAL page the > record crosses. Hm. The loop makes me a tad uncomfortable, because it is possible for WAL records to be very long (many pages). I see the point that replacing the spinlock with an LWLock would likely negate any performance win from this patch, but having other processes arrive and spin while somebody is busy calculating the size of a multi-megabyte commit record would be bad too. What I suggest is that it should not be necessary to crawl forward one page at a time to figure out how many pages will be needed to store N bytes worth of WAL data. You're basically implementing a division problem as repeated subtraction. Getting the extra WAL-segment-start overhead right would be slightly tricky; but even if you didn't want to try to make it pure straight-line code, at the very least it seems like you could set it up so that the loop iterates only once per segment not page. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)
On 06.03.2012 17:12, Tom Lane wrote: Heikki Linnakangas writes: On 06.03.2012 14:52, Fujii Masao wrote: This also strikes me that the usage of the spinlock insertpos_lck might not be OK in ReserveXLogInsertLocation() because a few dozen instructions can be performed while holding the spinlock I admit that block is longer than any of our existing spinlock blocks. However, it's important for performance. I tried using a lwlock earlier, and that negated the gains. So if that's a serious objection, then let's resolve that now before I spend any more time on other aspects of the patch. Any ideas how to make that block shorter? How long is the current locked code exactly --- does it contain a loop? Perhaps best if you take a look for yourself, the function is called ReserveXLogInsertLocation() in patch. It calls a helper function called AdvanceXLogRecPtrToNextPage(ptr), which is small and could be inlined. It does contain one loop, which iterates once for every WAL page the record crosses. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)
Heikki Linnakangas writes: > On 06.03.2012 14:52, Fujii Masao wrote: >> This also strikes me that the usage of the spinlock insertpos_lck might >> not be OK in ReserveXLogInsertLocation() because a few dozen instructions >> can be performed while holding the spinlock > I admit that block is longer than any of our existing spinlock blocks. > However, it's important for performance. I tried using a lwlock earlier, > and that negated the gains. So if that's a serious objection, then let's > resolve that now before I spend any more time on other aspects of the > patch. Any ideas how to make that block shorter? How long is the current locked code exactly --- does it contain a loop? I'm not sure where the threshold of pain is for length of time holding a spinlock. I wouldn't go out of the way to avoid using a spinlock for say a hundred instructions, at least not unless it was a very high-contention lock. But sleeping while holding a spinlock is right out. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)
On Tue, Mar 6, 2012 at 10:07 AM, Heikki Linnakangas wrote: > I admit that block is longer than any of our existing spinlock blocks. > However, it's important for performance. I tried using a lwlock earlier, and > that negated the gains. So if that's a serious objection, then let's resolve > that now before I spend any more time on other aspects of the patch. Any > ideas how to make that block shorter? We shouldn't put the cart in front of the horse. The point of keeping spinlock acquisitions short is to improve performance by preventing excess spinning. If the performance is better with a spinlock than with an lwlock, then clearly the spinning isn't excessive, or at least not in the case you tested. That having been said, shorter critical sections are always good, of course... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)
On 06.03.2012 14:52, Fujii Masao wrote: On Tue, Mar 6, 2012 at 2:17 AM, Tom Lane wrote: Heikki Linnakangas writes: On 21.02.2012 13:19, Fujii Masao wrote: In some places, the spinlock "insertpos_lck" is taken while another spinlock "info_lck" is being held. Is this OK? What if unfortunately inner spinlock takes long to be taken? Hmm, that's only done at a checkpoint (and a restartpoint), so I doubt that's a big issue in practice. We had the same pattern before the patch, just with WALInsertLock instead of insertpos_lck. Holding a spinlock longer is much worse than holding a lwlock longer, but nevertheless I don't think that's a problem. No, that's NOT okay. A spinlock is only supposed to be held across a short straight-line sequence of instructions. Ok, that's easy enough to fix. This also strikes me that the usage of the spinlock insertpos_lck might not be OK in ReserveXLogInsertLocation() because a few dozen instructions can be performed while holding the spinlock I admit that block is longer than any of our existing spinlock blocks. However, it's important for performance. I tried using a lwlock earlier, and that negated the gains. So if that's a serious objection, then let's resolve that now before I spend any more time on other aspects of the patch. Any ideas how to make that block shorter? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)
On Tue, Mar 6, 2012 at 1:50 AM, Heikki Linnakangas wrote: >> + * An xlog-switch record consumes all the remaining space on the >> + * WAL segment. We have already reserved it for us, but we still >> need >> + * to make sure it's been allocated and zeroed in the WAL buffers >> so >> + * that when the caller (or someone else) does XLogWrite(), it can >> + * really write out all the zeros. >> >> Why do we need to write out all the remaining space with zeros? In >> current master, we don't do that. A recovery code ignores the data >> following XLOG_SWITCH record, so I don't think that's required. > > > It's to keep the logic simpler. Before the patch, an xlog-switch just > initialized the next page in the WAL buffers to insert to, to be the first > page in the next segment. With this patch, we rely on a simple linear > mapping from an XLogRecPtr to the WAL buffer that should contain that page > (see XLogRecPtrToBufIdx()). Such a mapping is not possible if you sometimes > skip over pages in the WAL buffers, so we allocate the buffers for those > empty pages, too. Note that this means that an xlog-switch can blow through > all your WAL buffers. > > We could probably optimize that so that you don't need to actually write() > and fsync() all the zeros, perhaps by setting a flag on the WAL buffer to > indicate that it only contains padding for an xlog-switch. However, I don't > see any easy way to avoid blowing the cache. > > I'm thinking that xlog-switching happens so seldom, and typically on a > fairly idle system, so we don't need to optimize it much. I guess we should > measure the impact, though.. Online backup which forces an xlog-switch twice might be performed under a certain amount of load. So, to avoid the performance spike during online backup, I think that the optimization which skips write() and fsync() of the padding bytes is helpful. > On 22.02.2012 03:34, Fujii Masao wrote: >> When I ran the long-running performance test, I encountered the following >> panic error. >> >> PANIC: could not find WAL buffer for 0/FF00 >> >> 0/FF00 is the xlog file boundary, so the patch seems to handle >> the xlog file boundary incorrectly. In the patch, current insertion lsn >> is advanced by directly incrementing XLogRecPtr.xrecoff as follows. >> But to handle the xlog file boundary correctly, we should use >> XLByteAdvance() for that, instead? > > Thanks, fixed this, too. > > I made the locking a bit more strict in WaitXLogInsertionsToFinish(), so > that it grabs the insertpos_lck to read nextslot. I previously thought that > was not necessary, assuming that reading/writing an int32 is atomic, but I'm > afraid there might be memory-ordering issues where the CurrPos of the most > recent slot has not become visible to other backends yet, while the > advancing of nextslot has. > > That particular issue would be very hard to hit in practice, so I don't know > if this could explain the recovery failures that Jeff saw. I got the test > script running (thanks for that Jeff!), but unfortunately have not seen any > failures yet (aside from the issue with crossing xlogid boundary), with > either this or the older version of the patch. > > Attached is a new version of the patch. Thanks for the new patch! In this new patch, I again was able to reproduce the assertion failure which I described on the upthread. http://archives.postgresql.org/message-id/CAHGQGwGRuNJ%3D_ctXwteNkFkdvMDNFYxFdn0D1cd-CqL0OgNCLg%40mail.gmail.com $ uname -a Linux hermes 3.0.0-16-generic #28-Ubuntu SMP Fri Jan 27 17:50:54 UTC 2012 i686 i686 i386 GNU/Linux Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)
On Tue, Mar 6, 2012 at 2:17 AM, Tom Lane wrote: > Heikki Linnakangas writes: >> On 21.02.2012 13:19, Fujii Masao wrote: >>> In some places, the spinlock "insertpos_lck" is taken while another >>> spinlock "info_lck" is being held. Is this OK? What if unfortunately >>> inner spinlock takes long to be taken? > >> Hmm, that's only done at a checkpoint (and a restartpoint), so I doubt >> that's a big issue in practice. We had the same pattern before the >> patch, just with WALInsertLock instead of insertpos_lck. Holding a >> spinlock longer is much worse than holding a lwlock longer, but >> nevertheless I don't think that's a problem. > > No, that's NOT okay. A spinlock is only supposed to be held across a > short straight-line sequence of instructions. This also strikes me that the usage of the spinlock insertpos_lck might not be OK in ReserveXLogInsertLocation() because a few dozen instructions can be performed while holding the spinlock Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)
Heikki Linnakangas writes: > On 21.02.2012 13:19, Fujii Masao wrote: >> In some places, the spinlock "insertpos_lck" is taken while another >> spinlock "info_lck" is being held. Is this OK? What if unfortunately >> inner spinlock takes long to be taken? > Hmm, that's only done at a checkpoint (and a restartpoint), so I doubt > that's a big issue in practice. We had the same pattern before the > patch, just with WALInsertLock instead of insertpos_lck. Holding a > spinlock longer is much worse than holding a lwlock longer, but > nevertheless I don't think that's a problem. No, that's NOT okay. A spinlock is only supposed to be held across a short straight-line sequence of instructions. Something that could involve a spin loop, or worse a sleep() kernel call, is right out. Please change this. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)
On 21.02.2012 13:19, Fujii Masao wrote: On Sat, Feb 18, 2012 at 12:36 AM, Heikki Linnakangas wrote: Attached is a new version, fixing that, and off-by-one bug you pointed out in the slot wraparound handling. I also moved code around a bit, I think this new division of labor between the XLogInsert subroutines is more readable. This patch includes not only xlog scaling improvement but also other ones. I think it's better to extract them as separate patches and commit them first. If we do so, the main patch would become more readable. Good point. For example, I think that the followings can be extracted as a separate patch. (1) Make walwriter try to initialize as many of the no-longer-needed WAL buffers for future use as we can. This is pretty hard to extract from the larger patch. The current code in master assumes that there's only one page that is currently inserted to, and relies on WALInsertLock being held in AdvanceXLInsertBuffer(). The logic with the scaling patch is quite different. (2) Refactor the "update full_page_writes code". (3) Get rid of XLogCtl->Write.LogwrtResult and XLogCtl->Insert.LogwrtResult. Attached are patches for these two items. Barring objections, I'll commit these. (4) Call TRACE_POSTGRESQL_XLOG_SWITCH() even if the xlog switch has no work to do. Actually, I think I'll just move it in the patch to keep the existing behavior. I'm not sure if (3) makes sense. In current master, those two shared variables are used to reduce the contention of XLogCtl->info_lck and WALWriteLock. You think they have no effect on reducing the lock contention? XLogCtl->Write.LogwrtResult certainly seems redundant with XLogCtl->LogwrtResult. There might be some value in XLogCtl->Insert.LogwrtResult, it's used in AdvanceXLInsertBuffer() to before acquiring info_lck. But I doubt that makes any difference in practice either. At best it's saving one spinlock acquisition per WAL buffer, which isn't all much compared to all the other work involved. (once the scaling patch is committed, this point is moot anyway) -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 266c0de..eb7932e 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -731,8 +731,7 @@ XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata) unsigned i; bool updrqst; bool doPageWrites; - bool isLogSwitch = false; - bool fpwChange = false; + bool isLogSwitch = (rmid == RM_XLOG_ID && info == XLOG_SWITCH); uint8 info_orig = info; /* cross-check on whether we should be here or not */ @@ -746,30 +745,11 @@ XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata) TRACE_POSTGRESQL_XLOG_INSERT(rmid, info); /* - * Handle special cases/records. + * In bootstrap mode, we don't actually log anything but XLOG resources; + * return a phony record pointer. */ - if (rmid == RM_XLOG_ID) + if (IsBootstrapProcessingMode() && rmid != RM_XLOG_ID) { - switch (info) - { - case XLOG_SWITCH: -isLogSwitch = true; -break; - - case XLOG_FPW_CHANGE: -fpwChange = true; -break; - - default: -break; - } - } - else if (IsBootstrapProcessingMode()) - { - /* - * In bootstrap mode, we don't actually log anything but XLOG resources; - * return a phony record pointer. - */ RecPtr.xlogid = 0; RecPtr.xrecoff = SizeOfXLogLongPHD; /* start of 1st chkpt record */ return RecPtr; @@ -1232,15 +1212,6 @@ begin:; WriteRqst = XLogCtl->xlblocks[curridx]; } - /* - * If the record is an XLOG_FPW_CHANGE, we update full_page_writes - * in shared memory before releasing WALInsertLock. This ensures that - * an XLOG_FPW_CHANGE record precedes any WAL record affected - * by this change of full_page_writes. - */ - if (fpwChange) - Insert->fullPageWrites = fullPageWrites; - LWLockRelease(WALInsertLock); if (updrqst) @@ -8517,6 +8488,22 @@ UpdateFullPageWrites(void) if (fullPageWrites == Insert->fullPageWrites) return; + START_CRIT_SECTION(); + + /* + * It's always safe to take full page images, even when not strictly + * required, but not the other round. So if we're setting full_page_writes + * to true, first set it true and then write the WAL record. If we're + * setting it to false, first write the WAL record and then set the + * global flag. + */ + if (fullPageWrites) + { + LWLockAcquire(WALInsertLock, LW_EXCLUSIVE); + Insert->fullPageWrites = true; + LWLockRelease(WALInsertLock); + } + /* * Write an XLOG_FPW_CHANGE record. This allows us to keep * track of full_page_writes during archive recovery, if required. @@ -8532,12 +8519,15 @@ UpdateFullPageWrites(void) XLogInsert(RM_XLOG_ID, XLOG_FPW_CHANGE, &rdata); } - else + + + if (!fullPageWrites) { LWLockAcquire(WALInsertLock, LW_EXCLUSIVE); - Insert->fullPageWrites = fullPageWrites; + Insert->fullPageWrites = false;
Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)
On 20.02.2012 08:00, Amit Kapila wrote: I was trying to understand this patch and had few doubts: 1. In PerformXLogInsert(), why there is need to check freespace when already during ReserveXLogInsertLocation(), the space is reserved. Is it possible that the record size is more than actually calculted in ReserveXLogInsertLocation(), if so in that case what I understand is it is moving to next page to write, however isn't it possible that some other backend had already reserved that space. The calculations between PerformXLogInsert (called CopyXLogRecordToWAL() in the latest patch version) and ReserveXLogInsertLocation() must always match, otherwise we have reserved incorrect amount of WAL and you get corrupt WAL. They both need to do the same calculations of how the WAL record is split across pages, which depends on how much free space there is on the first page. There is an assertion in CopyXLogRecordToWAL() to check that once it's finished writing the WAL record, the last byte landed on the position that ReserveXLogInsertLocation() calculated it would. Another way to do that would be to remember the calculations done in ReserveXLogInsertLocation(), in an extra array or something. But we want to keep ReserveXLogInsertLocation() as simple as possible, as that runs while holding the spinlock. Any extra CPU cycles there will hurt scalability. 2. In function WaitForXLogInsertionSlotToBecomeFree(), chances are there such that when nextslot equals lastslot, all new backends try to reserve a slot will start waiting on same last slot which can lead to serialization for those backends and can impact latency. True. That warrants some performance testing to see if that effect is significant. (it's surely better than the current situation, anyway, where all WAL insertions block on the single lock) 3. GetXlogBuffer - This will get called twice, once for normal buffer, second time for when there is not enough space in current page, and both times it can lead to I/O whereas in earlier algorithm, the chances of I/O is only once. I don't see any difference to the previous situation. In both cases, if you need a new page to copy the WAL record to, you need to first flush out some old pages from the WAL buffers if they're all dirty. The patch doesn't change the number of WAL buffers consumed. Note that GetXLogBuffer() is very cheap when it doesn't need to do I/O, extra calls to it don't matter if the page is already initialized. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)
On Tue, Feb 21, 2012 at 5:34 PM, Fujii Masao wrote: > On Tue, Feb 21, 2012 at 8:19 PM, Fujii Masao wrote: >> On Sat, Feb 18, 2012 at 12:36 AM, Heikki Linnakangas >> wrote: >>> Attached is a new version, fixing that, and off-by-one bug you pointed out >>> in the slot wraparound handling. I also moved code around a bit, I think >>> this new division of labor between the XLogInsert subroutines is more >>> readable. > > When I ran the long-running performance test, I encountered the following > panic error. > > PANIC: could not find WAL buffer for 0/FF00 I too see this panic when the system survives long enough to get to that log switch. But I'm also still seeing (with version 9) the assert failure at "xlog.c", Line: 2154 during the end-of-recovery checkpoint. Here is a set up for repeating my tests. I used this test simply because I had it sitting around after having written it for other purposes. Indeed I'm not all that sure I should publish it. Hopefully other people will write other tests which exercise other corner cases, rather than exercising the same ones I am. The patch creates a guc which causes the md writer routine to panic and bring down the database, triggering recovery, after a given number for writes. In this context probably any other method of forcing a crash and recovery would be just as good as this specific method of crashing. The choice of 400 for the cutoff for crashing is based on: 1) If the number is too low, you re-crash within recovery so you never get a chance to inspect the database. In my hands, recovery doesn't need to do more than 400 writes. (I don't know how to make the database use different guc setting during recovery than it did before the crash). 2) If the number is too high, it takes too long for a crash to happen and I'm not all that patient. Some of the changes to postgresql.conf.sample are purely my preferences and have nothing in particular to do with this set up. But archive_timeout = 30 is necessary in order to get checkpoints, and thus mdwrites, to happen often enough to trigger crashes often enough to satisfy my impatience. The Perl script exercises the integrity of the database by launching multiple processes (4 by default) to run updates and memorize what updates they have run. After a crash, the Perl processes all communicate their data up to the parent, which consolidates that information and then queries the post-recovery database to make sure it agrees. Transactions that are in-flight at the time of a crash are indeterminate. Maybe the crash happened before the commit, and maybe it happened after the commit but before we received notification of the commit. So whichever way those turn out, it is not proof of corruption. With the xloginsert-scale-9.patch, the above features are not needed because the problem is not that the database is incorrect after recovery, but that the database doesn't recover in the first place. So just running pgbench would be good enough to detect that. But in earlier versions this feature did detect incorrect recovery. This logs an awful lot of stuff, most of which merely indicates normal operation. The problem is that corruption is rare, so if you wait until you see corruption before turning on logging, then you have to wait l long time to get another instance of corruption so you can dissect the log information. So, I just log everything all of the time. A warning from 'line 63' which is not marked as in-flight indicates database corruption. A warning from 'line 66' indicates even worse corruption. A failure of the entire outer script to execute for the expected number of iterations (i.e. failure of the warning issued on 'line 18' to show up 100 times) indicates the database failed to restart. Also attached is a bash script that exercises the whole thing. Note that it has various directories hard coded that really ought not be, and that it has no compunctions about calling rm -r /tmp/data. I run it is as "./do.sh >& log" and then inspect the log file for unusual lines. To run this, you first have to apply your own xlog patch, and apply my crash-inducing patch, and build and install the resulting pgsql. And edit the shell script to point to it, etc.. The whole thing is a bit of a idiosyncratic mess. Cheers, Jeff crash_REL9_2CF4.patch Description: Binary data count.pl Description: Binary data do.sh Description: Bourne shell script -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)
On Tue, Feb 21, 2012 at 8:19 PM, Fujii Masao wrote: > On Sat, Feb 18, 2012 at 12:36 AM, Heikki Linnakangas > wrote: >> Attached is a new version, fixing that, and off-by-one bug you pointed out >> in the slot wraparound handling. I also moved code around a bit, I think >> this new division of labor between the XLogInsert subroutines is more >> readable. When I ran the long-running performance test, I encountered the following panic error. PANIC: could not find WAL buffer for 0/FF00 0/FF00 is the xlog file boundary, so the patch seems to handle the xlog file boundary incorrectly. In the patch, current insertion lsn is advanced by directly incrementing XLogRecPtr.xrecoff as follows. But to handle the xlog file boundary correctly, we should use XLByteAdvance() for that, instead? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)
On Sat, Feb 18, 2012 at 12:36 AM, Heikki Linnakangas wrote: > Attached is a new version, fixing that, and off-by-one bug you pointed out > in the slot wraparound handling. I also moved code around a bit, I think > this new division of labor between the XLogInsert subroutines is more > readable. This patch includes not only xlog scaling improvement but also other ones. I think it's better to extract them as separate patches and commit them first. If we do so, the main patch would become more readable. For example, I think that the followings can be extracted as a separate patch. (1) Make walwriter try to initialize as many of the no-longer-needed WAL buffers for future use as we can. (2) Refactor the "update full_page_writes code". (3) Get rid of XLogCtl->Write.LogwrtResult and XLogCtl->Insert.LogwrtResult. (4) Call TRACE_POSTGRESQL_XLOG_SWITCH() even if the xlog switch has no work to do. Others? I'm not sure if (3) makes sense. In current master, those two shared variables are used to reduce the contention of XLogCtl->info_lck and WALWriteLock. You think they have no effect on reducing the lock contention? In some places, the spinlock "insertpos_lck" is taken while another spinlock "info_lck" is being held. Is this OK? What if unfortunately inner spinlock takes long to be taken? +* An xlog-switch record consumes all the remaining space on the +* WAL segment. We have already reserved it for us, but we still need +* to make sure it's been allocated and zeroed in the WAL buffers so +* that when the caller (or someone else) does XLogWrite(), it can +* really write out all the zeros. Why do we need to write out all the remaining space with zeros? In current master, we don't do that. A recovery code ignores the data following XLOG_SWITCH record, so I don't think that's required. + /* XXX: before this patch, TRACE_POSTGRESQL_XLOG_SWITCH was not called +* if the xlog switch had no work to do, ie. if we were already at the +* beginning of a new XLOG segment. You can check if RecPtr points to +* beginning of a segment if you want to keep the distinction. +*/ + TRACE_POSTGRESQL_XLOG_SWITCH(); If so, RecPtr (or the flag indicating whether the xlog switch has no work to do) should be given to TRACE_POSTGRESQL_XLOG_SWITCH() as an argument? The followings are trivial comments: +* insertion, but ẃhile we're at it, advance lastslot as much as we Typo: 'ẃ' should be 'w' In XLogRecPtrToBufIdx() and XLogRecEndPtrToBufIdx(), why don't you use XLogFileSize instead of XLogSegsPerFile * XLogSegSize? There are some source code comments which still refer to WALInsertLock. It's cleaner if pg_start_backup_callback() uses Insert instead of XLogCtl->Insert, like do_pg_start_backup(), do_pg_stop_backup() and do_pg_abort_backup() do. + freespace = XLOG_BLCKSZ - EndRecPtr.xrecoff % XLOG_BLCKSZ; Though this is not incorrect, it's better to use EndOfLog instead of EndRecPtr, like near code does. while (extraWaits-- > 0) PGSemaphoreUnlock(&MyProc->sem); This should be executed also in WaitForXLogInsertionSlotToBecomeFree()? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)
On Sun, Feb 19, 2012 at 3:01 AM, Jeff Janes wrote: > I've tested your v9 patch. I no longer see any inconsistencies or > lost transactions in the recovered database. But occasionally I get > databases that fail to recover at all. > It has always been with the exact same failed assertion, at xlog.c line 2154. > > I've only seen this 4 times out of 2202 cycles of crash and recovery, > so it must be some rather obscure situation. > > LOG: database system was not properly shut down; automatic recovery in > progress > LOG: redo starts at 0/180001B0 > LOG: unexpected pageaddr 0/15084000 in log file 0, segment 25, offset 540672 > LOG: redo done at 0/19083FD0 > LOG: last completed transaction was at log time 2012-02-17 11:13:50.369488-08 > LOG: checkpoint starting: end-of-recovery immediate > TRAP: FailedAssertion("!(((uint64) (NewPageEndPtr).xlogid * > (uint64) (((uint32) 0x) / ((uint32) (16 * 1024 * 1024))) * > ((uint32) (16 * 1024 * 1024))) + (NewPageEndPtr).xrecoff - 1)) / 8192) > % (XLogCtl->XLogCacheBlck + 1)) == nextidx)", File: "xlog.c", Line: > 2154) > LOG: startup process (PID 5390) was terminated by signal 6: Aborted > LOG: aborting startup due to startup process failure I could reproduce this when I made the server crash just after executing "select pg_switch_xlog()". $ initdb -D data $ pg_ctl -D data start $ psql -c "select pg_switch_xlog()" $ pg_ctl -D data stop -m i $ pg_ctl -D data start ... LOG: redo done at 0/16E3B0C TRAP: FailedAssertion("!(((uint64) (NewPageEndPtr).xlogid * (uint64) (((uint32) 0x) / ((uint32) (16 * 1024 * 1024))) * ((uint32) (16 * 1024 * 1024))) + (NewPageEndPtr).xrecoff - 1)) / 8192) % (XLogCtl->XLogCacheBlck + 1)) == nextidx)", File: "xlog.c", Line: 2154) LOG: startup process (PID 16361) was terminated by signal 6: Aborted LOG: aborting startup due to startup process failure Though I've not read new patch yet, I doubt that xlog switch code would still have a bug. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)
I was trying to understand this patch and had few doubts: 1. In PerformXLogInsert(), why there is need to check freespace when already during ReserveXLogInsertLocation(), the space is reserved. Is it possible that the record size is more than actually calculted in ReserveXLogInsertLocation(), if so in that case what I understand is it is moving to next page to write, however isn't it possible that some other backend had already reserved that space. 2. In function WaitForXLogInsertionSlotToBecomeFree(), chances are there such that when nextslot equals lastslot, all new backends try to reserve a slot will start waiting on same last slot which can lead to serialization for those backends and can impact latency. 3. GetXlogBuffer - This will get called twice, once for normal buffer, second time for when there is not enough space in current page, and both times it can lead to I/O whereas in earlier algorithm, the chances of I/O is only once. -Original Message- From: pgsql-hackers-ow...@postgresql.org [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Heikki Linnakangas Sent: Friday, February 17, 2012 9:07 PM To: Fujii Masao Cc: Jeff Janes; Robert Haas; PostgreSQL-development Subject: Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock) On 17.02.2012 07:27, Fujii Masao wrote: > Got another problem: when I ran pg_stop_backup to take an online > backup, it got stuck until I had generated new WAL record. This > happens because, in the patch, when pg_stop_backup forces a switch to > new WAL file, old WAL file is not marked as archivable until next new > WAL record has been inserted, but pg_stop_backup keeps waiting for that WAL file to be archived. > OTOH, without the patch, WAL file is marked as archivable as soon as > WAL file switch occurs. > > So, in short, the patch seems to handle the WAL file switch logic incorrectly. Yep. For a WAL-switch record, XLogInsert returns the location of the end of the record, not the end of the empty padding space. So when the caller flushed up to that point, it didn't flush the empty space and therefore didn't notify the archiver. Attached is a new version, fixing that, and off-by-one bug you pointed out in the slot wraparound handling. I also moved code around a bit, I think this new division of labor between the XLogInsert subroutines is more readable. Thanks for the testing! -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)
On Fri, Feb 17, 2012 at 7:36 AM, Heikki Linnakangas wrote: > On 17.02.2012 07:27, Fujii Masao wrote: >> >> Got another problem: when I ran pg_stop_backup to take an online backup, >> it got stuck until I had generated new WAL record. This happens because, >> in the patch, when pg_stop_backup forces a switch to new WAL file, old >> WAL file is not marked as archivable until next new WAL record has been >> inserted, but pg_stop_backup keeps waiting for that WAL file to be >> archived. >> OTOH, without the patch, WAL file is marked as archivable as soon as WAL >> file switch occurs. >> >> So, in short, the patch seems to handle the WAL file switch logic >> incorrectly. > > > Yep. For a WAL-switch record, XLogInsert returns the location of the end of > the record, not the end of the empty padding space. So when the caller > flushed up to that point, it didn't flush the empty space and therefore > didn't notify the archiver. > > Attached is a new version, fixing that, and off-by-one bug you pointed out > in the slot wraparound handling. I also moved code around a bit, I think > this new division of labor between the XLogInsert subroutines is more > readable. > > Thanks for the testing! Hi Heikki, Sorry for the week long radio silence, I haven't been able to find much time during the week. I'll try to extract my test case from it's quite messy testing harness and get a self-contained version, but it will probably take a week or two to do it. I can probably refactor it to rely just on Perl and the modules DBI, DBD::Pg, IO::Pipe and Storable. Some of those are not core Perl modules, but they are all common ones. Would that be a good option? I've tested your v9 patch. I no longer see any inconsistencies or lost transactions in the recovered database. But occasionally I get databases that fail to recover at all. It has always been with the exact same failed assertion, at xlog.c line 2154. I've only seen this 4 times out of 2202 cycles of crash and recovery, so it must be some rather obscure situation. LOG: database system was not properly shut down; automatic recovery in progress LOG: redo starts at 0/180001B0 LOG: unexpected pageaddr 0/15084000 in log file 0, segment 25, offset 540672 LOG: redo done at 0/19083FD0 LOG: last completed transaction was at log time 2012-02-17 11:13:50.369488-08 LOG: checkpoint starting: end-of-recovery immediate TRAP: FailedAssertion("!(((uint64) (NewPageEndPtr).xlogid * (uint64) (((uint32) 0x) / ((uint32) (16 * 1024 * 1024))) * ((uint32) (16 * 1024 * 1024))) + (NewPageEndPtr).xrecoff - 1)) / 8192) % (XLogCtl->XLogCacheBlck + 1)) == nextidx)", File: "xlog.c", Line: 2154) LOG: startup process (PID 5390) was terminated by signal 6: Aborted LOG: aborting startup due to startup process failure Cheers, Jeff -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)
On 16.02.2012 13:31, Fujii Masao wrote: On Thu, Feb 16, 2012 at 6:15 PM, Fujii Masao wrote: BTW, when I ran the test on my Ubuntu, I could not reproduce the problem. I could reproduce the problem only in MacOS. + nextslot = Insert->nextslot; + if (NextSlotNo(nextslot) == lastslot) + { + /* +* Oops, we've "caught our tail" and the oldest slot is still in use. +* Have to wait for it to become vacant. +*/ + SpinLockRelease(&Insert->insertpos_lck); + WaitForXLogInsertionSlotToBecomeFree(); + goto retry; + } + myslot =&XLogCtl->XLogInsertSlots[nextslot]; + nextslot = NextSlotNo(nextslot); nextslot can reach NumXLogInsertSlots, which would be a bug, I guess. When I did the quick-fix and ran the test, I could not reproduce the problem any more. I'm not sure if this is really the cause of the problem, though. Ah, I see. That explains why you only see it on some platforms - depending on ALIGNOF_XLOG_BUFFER, there is often enough padding after the last valid slot to accommodate the extra bogus slot. Thanks for the debugging! -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)
On Mon, Feb 13, 2012 at 8:37 PM, Heikki Linnakangas wrote: > On 13.02.2012 01:04, Jeff Janes wrote: >> >> Attached is my quick and dirty attempt to set XLP_FIRST_IS_CONTRECORD. >> I have no idea if I did it correctly, in particular if calling >> GetXLogBuffer(CurrPos) twice is OK or if GetXLogBuffer has side >> effects that make that a bad thing to do. I'm not proposing it as the >> real fix, I just wanted to get around this problem in order to do more >> testing. > > > Thanks. That's basically the right approach. Attached patch contains a > cleaned up version of that. Got another problem: when I ran pg_stop_backup to take an online backup, it got stuck until I had generated new WAL record. This happens because, in the patch, when pg_stop_backup forces a switch to new WAL file, old WAL file is not marked as archivable until next new WAL record has been inserted, but pg_stop_backup keeps waiting for that WAL file to be archived. OTOH, without the patch, WAL file is marked as archivable as soon as WAL file switch occurs. So, in short, the patch seems to handle the WAL file switch logic incorrectly. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)
On Thu, Feb 16, 2012 at 6:15 PM, Fujii Masao wrote: > On Thu, Feb 16, 2012 at 5:02 AM, Heikki Linnakangas > wrote: >> On 15.02.2012 18:52, Fujii Masao wrote: >>> >>> On Thu, Feb 16, 2012 at 1:01 AM, Heikki Linnakangas >>> wrote: Are you still seeing this failure with the latest patch I posted (http://archives.postgresql.org/message-id/4f38f5e5.8050...@enterprisedb.com)? >>> >>> >>> Yes. Just to be safe, I again applied the latest patch to HEAD, >>> compiled that and tried >>> the same test. Then unfortunately I got the same failure again. >> >> >> Ok. >> >>> I ran the configure with '--enable-debug' '--enable-cassert' >>> 'CPPFLAGS=-DWAL_DEBUG', >>> and make with -j 2 option. >>> >>> When I ran the test with wal_debug = on, I got the following assertion >>> failure. >>> >>> LOG: INSERT @ 0/17B3F90: prev 0/17B3F10; xid 998; len 31: Heap - >>> insert: rel 1663/12277/16384; tid 0/197 >>> STATEMENT: create table t (i int); insert into t >>> values(generate_series(1,1)); delete from t >>> LOG: INSERT @ 0/17B3FD0: prev 0/17B3F50; xid 998; len 31: Heap - >>> insert: rel 1663/12277/16384; tid 0/198 >>> STATEMENT: create table t (i int); insert into t >>> values(generate_series(1,1)); delete from t >>> TRAP: FailedAssertion("!(((bool) (((void*)(&(target->tid)) != ((void >>> *)0))&& ((&(target->tid))->ip_posid != 0", File: "heapam.c", >>> >>> Line: 5578) >>> LOG: xlog bg flush request 0/17B4000; write 0/17A6000; flush 0/179D5C0 >>> LOG: xlog bg flush request 0/17B4000; write 0/17B; flush 0/17B >>> LOG: server process (PID 16806) was terminated by signal 6: Abort trap >>> >>> This might be related to the original problem which Jeff and I saw. >> >> >> That's strange. I made a fresh checkout, too, and applied the patch, but >> still can't reproduce. I used the attached script to test it. >> >> It's surprising that the crash happens when the records are inserted, not at >> recovery. I don't see anything obviously wrong there, so could you please >> take a look around in gdb and see if you can get a clue what's going on? >> What's the stack trace? > > According to the above log messages, one strange thing is that the location > of the WAL record (i.e., 0/17B3F90) is not the same as the previous location > of the following WAL record (i.e., 0/17B3F50). Is this intentional? > > BTW, when I ran the test on my Ubuntu, I could not reproduce the problem. > I could reproduce the problem only in MacOS. + nextslot = Insert->nextslot; + if (NextSlotNo(nextslot) == lastslot) + { + /* +* Oops, we've "caught our tail" and the oldest slot is still in use. +* Have to wait for it to become vacant. +*/ + SpinLockRelease(&Insert->insertpos_lck); + WaitForXLogInsertionSlotToBecomeFree(); + goto retry; + } + myslot = &XLogCtl->XLogInsertSlots[nextslot]; + nextslot = NextSlotNo(nextslot); nextslot can reach NumXLogInsertSlots, which would be a bug, I guess. When I did the quick-fix and ran the test, I could not reproduce the problem any more. I'm not sure if this is really the cause of the problem, though. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)
On Thu, Feb 16, 2012 at 5:02 AM, Heikki Linnakangas wrote: > On 15.02.2012 18:52, Fujii Masao wrote: >> >> On Thu, Feb 16, 2012 at 1:01 AM, Heikki Linnakangas >> wrote: >>> >>> Are you still seeing this failure with the latest patch I posted >>> >>> (http://archives.postgresql.org/message-id/4f38f5e5.8050...@enterprisedb.com)? >> >> >> Yes. Just to be safe, I again applied the latest patch to HEAD, >> compiled that and tried >> the same test. Then unfortunately I got the same failure again. > > > Ok. > >> I ran the configure with '--enable-debug' '--enable-cassert' >> 'CPPFLAGS=-DWAL_DEBUG', >> and make with -j 2 option. >> >> When I ran the test with wal_debug = on, I got the following assertion >> failure. >> >> LOG: INSERT @ 0/17B3F90: prev 0/17B3F10; xid 998; len 31: Heap - >> insert: rel 1663/12277/16384; tid 0/197 >> STATEMENT: create table t (i int); insert into t >> values(generate_series(1,1)); delete from t >> LOG: INSERT @ 0/17B3FD0: prev 0/17B3F50; xid 998; len 31: Heap - >> insert: rel 1663/12277/16384; tid 0/198 >> STATEMENT: create table t (i int); insert into t >> values(generate_series(1,1)); delete from t >> TRAP: FailedAssertion("!(((bool) (((void*)(&(target->tid)) != ((void >> *)0))&& ((&(target->tid))->ip_posid != 0", File: "heapam.c", >> >> Line: 5578) >> LOG: xlog bg flush request 0/17B4000; write 0/17A6000; flush 0/179D5C0 >> LOG: xlog bg flush request 0/17B4000; write 0/17B; flush 0/17B >> LOG: server process (PID 16806) was terminated by signal 6: Abort trap >> >> This might be related to the original problem which Jeff and I saw. > > > That's strange. I made a fresh checkout, too, and applied the patch, but > still can't reproduce. I used the attached script to test it. > > It's surprising that the crash happens when the records are inserted, not at > recovery. I don't see anything obviously wrong there, so could you please > take a look around in gdb and see if you can get a clue what's going on? > What's the stack trace? According to the above log messages, one strange thing is that the location of the WAL record (i.e., 0/17B3F90) is not the same as the previous location of the following WAL record (i.e., 0/17B3F50). Is this intentional? BTW, when I ran the test on my Ubuntu, I could not reproduce the problem. I could reproduce the problem only in MacOS. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)
On 15.02.2012 18:52, Fujii Masao wrote: On Thu, Feb 16, 2012 at 1:01 AM, Heikki Linnakangas wrote: Are you still seeing this failure with the latest patch I posted (http://archives.postgresql.org/message-id/4f38f5e5.8050...@enterprisedb.com)? Yes. Just to be safe, I again applied the latest patch to HEAD, compiled that and tried the same test. Then unfortunately I got the same failure again. Ok. I ran the configure with '--enable-debug' '--enable-cassert' 'CPPFLAGS=-DWAL_DEBUG', and make with -j 2 option. When I ran the test with wal_debug = on, I got the following assertion failure. LOG: INSERT @ 0/17B3F90: prev 0/17B3F10; xid 998; len 31: Heap - insert: rel 1663/12277/16384; tid 0/197 STATEMENT: create table t (i int); insert into t values(generate_series(1,1)); delete from t LOG: INSERT @ 0/17B3FD0: prev 0/17B3F50; xid 998; len 31: Heap - insert: rel 1663/12277/16384; tid 0/198 STATEMENT: create table t (i int); insert into t values(generate_series(1,1)); delete from t TRAP: FailedAssertion("!(((bool) (((void*)(&(target->tid)) != ((void *)0))&& ((&(target->tid))->ip_posid != 0", File: "heapam.c", Line: 5578) LOG: xlog bg flush request 0/17B4000; write 0/17A6000; flush 0/179D5C0 LOG: xlog bg flush request 0/17B4000; write 0/17B; flush 0/17B LOG: server process (PID 16806) was terminated by signal 6: Abort trap This might be related to the original problem which Jeff and I saw. That's strange. I made a fresh checkout, too, and applied the patch, but still can't reproduce. I used the attached script to test it. It's surprising that the crash happens when the records are inserted, not at recovery. I don't see anything obviously wrong there, so could you please take a look around in gdb and see if you can get a clue what's going on? What's the stack trace? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com fujii-crash.sh Description: Bourne shell script -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)
On Thu, Feb 16, 2012 at 1:01 AM, Heikki Linnakangas wrote: > On 13.02.2012 19:13, Fujii Masao wrote: >> >> On Mon, Feb 13, 2012 at 8:37 PM, Heikki Linnakangas >> wrote: >>> >>> On 13.02.2012 01:04, Jeff Janes wrote: Attached is my quick and dirty attempt to set XLP_FIRST_IS_CONTRECORD. I have no idea if I did it correctly, in particular if calling GetXLogBuffer(CurrPos) twice is OK or if GetXLogBuffer has side effects that make that a bad thing to do. I'm not proposing it as the real fix, I just wanted to get around this problem in order to do more testing. >>> >>> >>> >>> Thanks. That's basically the right approach. Attached patch contains a >>> cleaned up version of that. >>> >>> It does get rid of the "there is no contrecord flag" errors, but recover still does not work. Now the count of tuples in the table is always correct (I never provoke a crash during the initial table load), but sometimes updates to those tuples that were reported to have been committed are lost. This is more subtle, it does not happen on every crash. It seems that when recovery ends on "record with zero length at...", that recovery is correct. But when it ends on "invalid magic number in log file.." then the recovery is screwed up. >>> >>> >>> >>> Can you write a self-contained test case for that? I've been trying to >>> reproduce that by running the regression tests and pgbench with a >>> streaming >>> replication standby, which should be pretty much the same as crash >>> recovery. >>> No luck this far. >> >> >> Probably I could reproduce the same problem as Jeff got. Here is the test >> case: >> >> $ initdb -D data >> $ pg_ctl -D data start >> $ psql -c "create table t (i int); insert into t >> values(generate_series(1,1)); delete from t" >> $ pg_ctl -D data stop -m i >> $ pg_ctl -D data start >> >> The crash recovery emitted the following server logs: >> >> LOG: database system was interrupted; last known up at 2012-02-14 >> 02:07:01 JST >> LOG: database system was not properly shut down; automatic recovery in >> progress >> LOG: redo starts at 0/179CC90 >> LOG: invalid magic number in log file 0, segment 1, offset 8060928 >> LOG: redo done at 0/17AD858 >> LOG: database system is ready to accept connections >> LOG: autovacuum launcher started >> >> After recovery, I could not see the table "t" which I created before: >> >> $ psql -c "select count(*) from t" >> ERROR: relation "t" does not exist > > > Are you still seeing this failure with the latest patch I posted > (http://archives.postgresql.org/message-id/4f38f5e5.8050...@enterprisedb.com)? Yes. Just to be safe, I again applied the latest patch to HEAD, compiled that and tried the same test. Then unfortunately I got the same failure again. I ran the configure with '--enable-debug' '--enable-cassert' 'CPPFLAGS=-DWAL_DEBUG', and make with -j 2 option. When I ran the test with wal_debug = on, I got the following assertion failure. LOG: INSERT @ 0/17B3F90: prev 0/17B3F10; xid 998; len 31: Heap - insert: rel 1663/12277/16384; tid 0/197 STATEMENT: create table t (i int); insert into t values(generate_series(1,1)); delete from t LOG: INSERT @ 0/17B3FD0: prev 0/17B3F50; xid 998; len 31: Heap - insert: rel 1663/12277/16384; tid 0/198 STATEMENT: create table t (i int); insert into t values(generate_series(1,1)); delete from t TRAP: FailedAssertion("!(((bool) (((void*)(&(target->tid)) != ((void *)0)) && ((&(target->tid))->ip_posid != 0", File: "heapam.c", Line: 5578) LOG: xlog bg flush request 0/17B4000; write 0/17A6000; flush 0/179D5C0 LOG: xlog bg flush request 0/17B4000; write 0/17B; flush 0/17B LOG: server process (PID 16806) was terminated by signal 6: Abort trap This might be related to the original problem which Jeff and I saw. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)
On 13.02.2012 19:13, Fujii Masao wrote: On Mon, Feb 13, 2012 at 8:37 PM, Heikki Linnakangas wrote: On 13.02.2012 01:04, Jeff Janes wrote: Attached is my quick and dirty attempt to set XLP_FIRST_IS_CONTRECORD. I have no idea if I did it correctly, in particular if calling GetXLogBuffer(CurrPos) twice is OK or if GetXLogBuffer has side effects that make that a bad thing to do. I'm not proposing it as the real fix, I just wanted to get around this problem in order to do more testing. Thanks. That's basically the right approach. Attached patch contains a cleaned up version of that. It does get rid of the "there is no contrecord flag" errors, but recover still does not work. Now the count of tuples in the table is always correct (I never provoke a crash during the initial table load), but sometimes updates to those tuples that were reported to have been committed are lost. This is more subtle, it does not happen on every crash. It seems that when recovery ends on "record with zero length at...", that recovery is correct. But when it ends on "invalid magic number in log file.." then the recovery is screwed up. Can you write a self-contained test case for that? I've been trying to reproduce that by running the regression tests and pgbench with a streaming replication standby, which should be pretty much the same as crash recovery. No luck this far. Probably I could reproduce the same problem as Jeff got. Here is the test case: $ initdb -D data $ pg_ctl -D data start $ psql -c "create table t (i int); insert into t values(generate_series(1,1)); delete from t" $ pg_ctl -D data stop -m i $ pg_ctl -D data start The crash recovery emitted the following server logs: LOG: database system was interrupted; last known up at 2012-02-14 02:07:01 JST LOG: database system was not properly shut down; automatic recovery in progress LOG: redo starts at 0/179CC90 LOG: invalid magic number in log file 0, segment 1, offset 8060928 LOG: redo done at 0/17AD858 LOG: database system is ready to accept connections LOG: autovacuum launcher started After recovery, I could not see the table "t" which I created before: $ psql -c "select count(*) from t" ERROR: relation "t" does not exist Are you still seeing this failure with the latest patch I posted (http://archives.postgresql.org/message-id/4f38f5e5.8050...@enterprisedb.com)? That includes Jeff's fix for the original crash you and Jeff saw. With that version, I can't get a crash anymore. I also can't reproduce the inconsistency that Jeff still saw with his fix (http://archives.postgresql.org/message-id/CAMkU=1zGWp2QnTjiyFe0VMu4gc+MoEexXYaVC2u=+orfiyj...@mail.gmail.com). Jeff, can you clarify if you're still seeing an issue with the latest version of the patch? If so, can you give a self-contained test case for that? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)
On Mon, Feb 13, 2012 at 8:37 PM, Heikki Linnakangas wrote: > On 13.02.2012 01:04, Jeff Janes wrote: >> >> Attached is my quick and dirty attempt to set XLP_FIRST_IS_CONTRECORD. >> I have no idea if I did it correctly, in particular if calling >> GetXLogBuffer(CurrPos) twice is OK or if GetXLogBuffer has side >> effects that make that a bad thing to do. I'm not proposing it as the >> real fix, I just wanted to get around this problem in order to do more >> testing. > > > Thanks. That's basically the right approach. Attached patch contains a > cleaned up version of that. > > >> It does get rid of the "there is no contrecord flag" errors, but >> recover still does not work. >> >> Now the count of tuples in the table is always correct (I never >> provoke a crash during the initial table load), but sometimes updates >> to those tuples that were reported to have been committed are lost. >> >> This is more subtle, it does not happen on every crash. >> >> It seems that when recovery ends on "record with zero length at...", >> that recovery is correct. >> >> But when it ends on "invalid magic number in log file.." then the >> recovery is screwed up. > > > Can you write a self-contained test case for that? I've been trying to > reproduce that by running the regression tests and pgbench with a streaming > replication standby, which should be pretty much the same as crash recovery. > No luck this far. Probably I could reproduce the same problem as Jeff got. Here is the test case: $ initdb -D data $ pg_ctl -D data start $ psql -c "create table t (i int); insert into t values(generate_series(1,1)); delete from t" $ pg_ctl -D data stop -m i $ pg_ctl -D data start The crash recovery emitted the following server logs: LOG: database system was interrupted; last known up at 2012-02-14 02:07:01 JST LOG: database system was not properly shut down; automatic recovery in progress LOG: redo starts at 0/179CC90 LOG: invalid magic number in log file 0, segment 1, offset 8060928 LOG: redo done at 0/17AD858 LOG: database system is ready to accept connections LOG: autovacuum launcher started After recovery, I could not see the table "t" which I created before: $ psql -c "select count(*) from t" ERROR: relation "t" does not exist Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)
On Thu, Feb 9, 2012 at 3:02 AM, Fujii Masao wrote: > On Thu, Feb 9, 2012 at 7:25 PM, Fujii Masao wrote: >> On Thu, Feb 9, 2012 at 3:32 AM, Jeff Janes wrote: >>> >>> After applying this patch and then forcing crashes, upon recovery the >>> database is not correct. >>> >>> If I make a table with 10,000 rows and then after that intensively >>> update it using a unique key: >>> >>> update foo set count=count+1 where foobar=? >>> >>> Then after the crash there are less than 10,000 visible rows: >>> >>> select count(*) from foo >>> >>> This not a subtle thing, it happens every time. I get counts of >>> between 1973 and 8827. Without this patch I always get exactly >>> 10,000. >>> >>> I don't really know where to start on tracking this down. >> >> Similar problem happened on my test. When I executed CREATE TABLE and >> shut down the server with immediate mode, after recovery I could not see the >> created table. Here are the server log of recovery with wal_debug = on: >> >> LOG: database system was interrupted; last known up at 2012-02-09 19:18:50 >> JST >> LOG: database system was not properly shut down; automatic recovery in >> progress >> LOG: redo starts at 0/179CC90 >> LOG: REDO @ 0/179CC90; LSN 0/179CCB8: prev 0/179CC30; xid 0; len 4: >> XLOG - nextOid: 24576 >> LOG: REDO @ 0/179CCB8; LSN 0/179CCE8: prev 0/179CC90; xid 0; len 16: >> Storage - file create: base/12277/16384 >> LOG: REDO @ 0/179CCE8; LSN 0/179DDE0: prev 0/179CCB8; xid 998; len >> 21; bkpb1: Heap - insert: rel 1663/12277/12014; tid 7/22 >> LOG: there is no contrecord flag in log file 0, segment 1, offset 7987200 >> LOG: redo done at 0/179CCE8 >> >> According to the log "there is no contrecord flag", ISTM the path treats the >> contrecord of backup block incorrectly, and which causes the problem. > > Yep, as far as I read the patch, it seems to have forgotten to set > XLP_FIRST_IS_CONTRECORD flag. Attached is my quick and dirty attempt to set XLP_FIRST_IS_CONTRECORD. I have no idea if I did it correctly, in particular if calling GetXLogBuffer(CurrPos) twice is OK or if GetXLogBuffer has side effects that make that a bad thing to do. I'm not proposing it as the real fix, I just wanted to get around this problem in order to do more testing. It does get rid of the "there is no contrecord flag" errors, but recover still does not work. Now the count of tuples in the table is always correct (I never provoke a crash during the initial table load), but sometimes updates to those tuples that were reported to have been committed are lost. This is more subtle, it does not happen on every crash. It seems that when recovery ends on "record with zero length at...", that recovery is correct. But when it ends on "invalid magic number in log file.." then the recovery is screwed up. Cheers, Jeff xloginsert_fix.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)
On Thu, Feb 9, 2012 at 7:25 PM, Fujii Masao wrote: > On Thu, Feb 9, 2012 at 3:32 AM, Jeff Janes wrote: >> On Wed, Feb 1, 2012 at 11:46 PM, Heikki Linnakangas >> wrote: >>> On 31.01.2012 17:35, Fujii Masao wrote: On Fri, Jan 20, 2012 at 11:11 PM, Heikki Linnakangas wrote: > > On 20.01.2012 15:32, Robert Haas wrote: >> >> >> On Sat, Jan 14, 2012 at 9:32 AM, Heikki Linnakangas >> wrote: >>> >>> >>> Here's another version of the patch to make XLogInsert less of a >>> bottleneck >>> on multi-CPU systems. The basic idea is the same as before, but several >>> bugs >>> have been fixed, and lots of misc. clean up has been done. >> >> >> >> This seems to need a rebase. > > > > Here you go. The patch seems to need a rebase again. >>> >>> >>> Here you go again. It conflicted with the group commit patch, and the patch >>> to WAL-log and track changes to full_page_writes setting. >> >> >> After applying this patch and then forcing crashes, upon recovery the >> database is not correct. >> >> If I make a table with 10,000 rows and then after that intensively >> update it using a unique key: >> >> update foo set count=count+1 where foobar=? >> >> Then after the crash there are less than 10,000 visible rows: >> >> select count(*) from foo >> >> This not a subtle thing, it happens every time. I get counts of >> between 1973 and 8827. Without this patch I always get exactly >> 10,000. >> >> I don't really know where to start on tracking this down. > > Similar problem happened on my test. When I executed CREATE TABLE and > shut down the server with immediate mode, after recovery I could not see the > created table. Here are the server log of recovery with wal_debug = on: > > LOG: database system was interrupted; last known up at 2012-02-09 19:18:50 > JST > LOG: database system was not properly shut down; automatic recovery in > progress > LOG: redo starts at 0/179CC90 > LOG: REDO @ 0/179CC90; LSN 0/179CCB8: prev 0/179CC30; xid 0; len 4: > XLOG - nextOid: 24576 > LOG: REDO @ 0/179CCB8; LSN 0/179CCE8: prev 0/179CC90; xid 0; len 16: > Storage - file create: base/12277/16384 > LOG: REDO @ 0/179CCE8; LSN 0/179DDE0: prev 0/179CCB8; xid 998; len > 21; bkpb1: Heap - insert: rel 1663/12277/12014; tid 7/22 > LOG: there is no contrecord flag in log file 0, segment 1, offset 7987200 > LOG: redo done at 0/179CCE8 > > According to the log "there is no contrecord flag", ISTM the path treats the > contrecord of backup block incorrectly, and which causes the problem. Yep, as far as I read the patch, it seems to have forgotten to set XLP_FIRST_IS_CONTRECORD flag. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)
On Thu, Feb 9, 2012 at 3:32 AM, Jeff Janes wrote: > On Wed, Feb 1, 2012 at 11:46 PM, Heikki Linnakangas > wrote: >> On 31.01.2012 17:35, Fujii Masao wrote: >>> >>> On Fri, Jan 20, 2012 at 11:11 PM, Heikki Linnakangas >>> wrote: On 20.01.2012 15:32, Robert Haas wrote: > > > On Sat, Jan 14, 2012 at 9:32 AM, Heikki Linnakangas > wrote: >> >> >> Here's another version of the patch to make XLogInsert less of a >> bottleneck >> on multi-CPU systems. The basic idea is the same as before, but several >> bugs >> have been fixed, and lots of misc. clean up has been done. > > > > This seems to need a rebase. Here you go. >>> >>> >>> The patch seems to need a rebase again. >> >> >> Here you go again. It conflicted with the group commit patch, and the patch >> to WAL-log and track changes to full_page_writes setting. > > > After applying this patch and then forcing crashes, upon recovery the > database is not correct. > > If I make a table with 10,000 rows and then after that intensively > update it using a unique key: > > update foo set count=count+1 where foobar=? > > Then after the crash there are less than 10,000 visible rows: > > select count(*) from foo > > This not a subtle thing, it happens every time. I get counts of > between 1973 and 8827. Without this patch I always get exactly > 10,000. > > I don't really know where to start on tracking this down. Similar problem happened on my test. When I executed CREATE TABLE and shut down the server with immediate mode, after recovery I could not see the created table. Here are the server log of recovery with wal_debug = on: LOG: database system was interrupted; last known up at 2012-02-09 19:18:50 JST LOG: database system was not properly shut down; automatic recovery in progress LOG: redo starts at 0/179CC90 LOG: REDO @ 0/179CC90; LSN 0/179CCB8: prev 0/179CC30; xid 0; len 4: XLOG - nextOid: 24576 LOG: REDO @ 0/179CCB8; LSN 0/179CCE8: prev 0/179CC90; xid 0; len 16: Storage - file create: base/12277/16384 LOG: REDO @ 0/179CCE8; LSN 0/179DDE0: prev 0/179CCB8; xid 998; len 21; bkpb1: Heap - insert: rel 1663/12277/12014; tid 7/22 LOG: there is no contrecord flag in log file 0, segment 1, offset 7987200 LOG: redo done at 0/179CCE8 According to the log "there is no contrecord flag", ISTM the path treats the contrecord of backup block incorrectly, and which causes the problem. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)
On Wed, Feb 1, 2012 at 11:46 PM, Heikki Linnakangas wrote: > On 31.01.2012 17:35, Fujii Masao wrote: >> >> On Fri, Jan 20, 2012 at 11:11 PM, Heikki Linnakangas >> wrote: >>> >>> On 20.01.2012 15:32, Robert Haas wrote: On Sat, Jan 14, 2012 at 9:32 AM, Heikki Linnakangas wrote: > > > Here's another version of the patch to make XLogInsert less of a > bottleneck > on multi-CPU systems. The basic idea is the same as before, but several > bugs > have been fixed, and lots of misc. clean up has been done. This seems to need a rebase. >>> >>> >>> >>> Here you go. >> >> >> The patch seems to need a rebase again. > > > Here you go again. It conflicted with the group commit patch, and the patch > to WAL-log and track changes to full_page_writes setting. After applying this patch and then forcing crashes, upon recovery the database is not correct. If I make a table with 10,000 rows and then after that intensively update it using a unique key: update foo set count=count+1 where foobar=? Then after the crash there are less than 10,000 visible rows: select count(*) from foo This not a subtle thing, it happens every time. I get counts of between 1973 and 8827. Without this patch I always get exactly 10,000. I don't really know where to start on tracking this down. Cheers, Jeff -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)
On Fri, Jan 20, 2012 at 11:11 PM, Heikki Linnakangas wrote: > On 20.01.2012 15:32, Robert Haas wrote: >> >> On Sat, Jan 14, 2012 at 9:32 AM, Heikki Linnakangas >> wrote: >>> >>> Here's another version of the patch to make XLogInsert less of a >>> bottleneck >>> on multi-CPU systems. The basic idea is the same as before, but several >>> bugs >>> have been fixed, and lots of misc. clean up has been done. >> >> >> This seems to need a rebase. > > > Here you go. The patch seems to need a rebase again. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)
On Fri, Jan 20, 2012 at 2:11 PM, Heikki Linnakangas wrote: > On 20.01.2012 15:32, Robert Haas wrote: >> >> On Sat, Jan 14, 2012 at 9:32 AM, Heikki Linnakangas >> wrote: >>> >>> Here's another version of the patch to make XLogInsert less of a >>> bottleneck >>> on multi-CPU systems. The basic idea is the same as before, but several >>> bugs >>> have been fixed, and lots of misc. clean up has been done. >> >> >> This seems to need a rebase. > > > Here you go. I put myself down as reviewer for this. I'm planning to review this early next week, once I've finished Fujii-san's patches. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)
On Sat, Jan 14, 2012 at 9:32 AM, Heikki Linnakangas wrote: > Here's another version of the patch to make XLogInsert less of a bottleneck > on multi-CPU systems. The basic idea is the same as before, but several bugs > have been fixed, and lots of misc. clean up has been done. This seems to need a rebase. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers