Re: [HACKERS] WAL format and API changes (9.5)
On 11/21/2014 05:58 AM, Amit Kapila wrote: On Thu, Nov 20, 2014 at 10:36 PM, Heikki Linnakangas < hlinnakan...@vmware.com> wrote: As you may have noticed, I committed this (after some more cleanup). Of course, feel free to still review it, and please point out any issues you may find. Few minor observations: 1. Readme void XLogResetInsertion(void) Clear any currently registered data and buffers from the WAL record construction workspace. This is only needed if you have already called XLogBeginInsert(), but decide to not insert the record after all. I think above sentence could be slightly rephrased as the this function is also getting called at end of XLogInsert(). Well, yeah, but that's internal to the xloginsert machinery, not relevant for someone learning about how to use it. 2. shiftList() { .. XLogEnsureRecordSpace(data.ndeleted + 1, 0); .. } Shouldn't above function call to XLogEnsureRecordSpace() be done under if (RelationNeedsWAL(rel))? True, fixed. (It doesn't hurt to call it, but it's clearly unnecessary) 3. XLogInsert(RmgrId rmid, uint8 info) { XLogRecPtr EndPos; /* XLogBeginInsert() must have been called. */ if (!begininsert_called) elog(ERROR, "XLogBeginInsert was not called"); As we are in critical section at this moment, so is it okay to have elog(ERROR,). I think this can only happen due to some coding mistake, but still not sure if elog(ERROR) is okay. In a critical section, the ERROR will be promoted automatically to a PANIC. There isn't much else we can do if that happens; it is a coding mistake as you say. BTW, not all XLogInsert() calls are inside a critical section. All the ones that modify data pages are, but there are others. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format and API changes (9.5)
On 11/21/2014 09:19 AM, Michael Paquier wrote: On Fri, Nov 21, 2014 at 2:06 AM, Heikki Linnakangas wrote: As you may have noticed, I committed this (after some more cleanup). Of course, feel free to still review it, and please point out any issues you may find. This comment on top of XLogRecordAssemble is not adapted as page_writes_omitted does not exist. Perhaps this is a remnant of some older version of the patch? * If there are any registered buffers, and a full-page image was not taken * of all them, *page_writes_omitted is set to true. This signals that the * assembled record is only good for insertion on the assumption that the * RedoRecPtr and doPageWrites values were up-to-date. */ Yep, fixed thanks. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format and API changes (9.5)
On Fri, Nov 21, 2014 at 2:06 AM, Heikki Linnakangas wrote: > As you may have noticed, I committed this (after some more cleanup). Of > course, feel free to still review it, and please point out any issues you > may find. > This comment on top of XLogRecordAssemble is not adapted as page_writes_omitted does not exist. Perhaps this is a remnant of some older version of the patch? * If there are any registered buffers, and a full-page image was not taken * of all them, *page_writes_omitted is set to true. This signals that the * assembled record is only good for insertion on the assumption that the * RedoRecPtr and doPageWrites values were up-to-date. */ -- Michael
Re: [HACKERS] WAL format and API changes (9.5)
On Thu, Nov 20, 2014 at 10:36 PM, Heikki Linnakangas < hlinnakan...@vmware.com> wrote: > > As you may have noticed, I committed this (after some more cleanup). Of course, feel free to still review it, and please point out any issues you may find. > Few minor observations: 1. Readme void XLogResetInsertion(void) Clear any currently registered data and buffers from the WAL record construction workspace. This is only needed if you have already called XLogBeginInsert(), but decide to not insert the record after all. I think above sentence could be slightly rephrased as the this function is also getting called at end of XLogInsert(). 2. shiftList() { .. XLogEnsureRecordSpace(data.ndeleted + 1, 0); .. } Shouldn't above function call to XLogEnsureRecordSpace() be done under if (RelationNeedsWAL(rel))? 3. XLogInsert(RmgrId rmid, uint8 info) { XLogRecPtr EndPos; /* XLogBeginInsert() must have been called. */ if (!begininsert_called) elog(ERROR, "XLogBeginInsert was not called"); As we are in critical section at this moment, so is it okay to have elog(ERROR,). I think this can only happen due to some coding mistake, but still not sure if elog(ERROR) is okay. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] WAL format and API changes (9.5)
As you may have noticed, I committed this (after some more cleanup). Of course, feel free to still review it, and please point out any issues you may find. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format and API changes (9.5)
On 11/18/2014 10:28 AM, Michael Paquier wrote: Btw, did you do a run with the buffer capture facility and checked for page differences? Yes. That's how I bumped into the bug fixed in c73669c0. That tool has been tremendously helpful. Except that, after going through the code once again, ISTM that the patch is in a nice state. It may be better to wait for some input from Andres, he may catch some issues I haven't spotted. Yeah, I'm sure there are tons of small things. I'll have to take a small break and read through the patch myself after a few days. But I'd like to get this committed pretty soon. I believe everyone agrees with the big picture, even if there's some little tweaking left to do. It's a pain to maintain as a patch, and I believe the FPW compression patch is waiting for this to land. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format and API changes (9.5)
On Tue, Nov 18, 2014 at 4:31 AM, Heikki Linnakangas wrote: > WAL insertion performance > = > To measure the performance of generating WAL, I ran the > wal-update-testsuite.sh that Amit also ran earlier. The cluster was > initialized with: > > shared_buffers=512MB > checkpoint_segments=30 > fsync=off > autovacuum=off > full_page_writes=off > > [results] > Summary: No meaningful difference in runtime. If I am seeing that correctly, WAL generated is reduced for all the tests, except for the case of "hundred tiny fields" where more WAL is generated. Now the duration time seems to be generally reduced, some noise (?) making it sometimes higher. > WAL replay performance > == > > To test WAL replay performance, I ran pgbench with WAL archiving enabled, > and timed the replay of the generated WAL. I used the attached script, > replay-perf-test.sh for that. full_page_writes were disabled, because > replaying full page images is quite different from replaying other records. > (Performance of full-page images is interesting too, but it's not expected > that these WAL format changes make much difference to that). > > In summary, there is no significant difference in replay performance. The > amount of WAL generated is much smaller with the patch. > This concludes my performance testing, until someone wants to see some other > scenario being tested. I'm happy with the results. I think you can, that's a great study, and this proves to be a gain on many fields. If this goes in, it is going to be one of the largest patches committed ever. $ git diff --stat | tail -n1 91 files changed, 3895 insertions(+), 4305 deletions(-) There are still some XXX blocks here and there in the code.. But nothing really huge, like here: -* checks could go wrong too. +* checks could go wrong too. XXX does this comment still make sense? */ - Assert(xldata->blkno != xldata->blknoNew); + Assert(blkno != blknoNew); Btw, did you do a run with the buffer capture facility and checked for page differences? Except that, after going through the code once again, ISTM that the patch is in a nice state. It may be better to wait for some input from Andres, he may catch some issues I haven't spotted. Regards, -- Michael
Re: [HACKERS] WAL format and API changes (9.5)
On 11/17/2014 03:22 PM, Heikki Linnakangas wrote: Here is an updated version of the patch. It's rebased over current master, and fixes a bunch of issues you and Michael pointed out, and a ton of other cleanup. That patch had two extra, unrelated patches applied to it. One to use the Intel CRC instruction for the CRC calculation, and another to speed up PageRepairFragmentation, by replacing the pq_qsort() call with a custom sort algorithm. Sorry about that. (the latter is something that showed up very high in a perf profile of replaying the WAL generated by pgbench - I'll start a new thread about that) Here is the same patch without those extra things mixed in. - Heikki wal-format-and-api-changes-11.patch.gz Description: application/gzip -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format and API changes (9.5)
On 11/14/2014 10:31 AM, Michael Paquier wrote: 5) Here why not using the 2nd block instead of the 3rd (@_bt_getroot)? + XLogBeginInsert(); + XLogRegisterBuffer(0, rootbuf, REGBUF_WILL_INIT); + XLogRegisterBuffer(2, metabuf, REGBUF_WILL_INIT); See the comment of the xl_btree_newroot struct. It explains the record format of the BTREE_NEWROOT record type: * Backup Blk 0: new root page (2 tuples as payload, if splitting old root) * Backup Blk 1: left child (if splitting an old root) * Backup Blk 2: metapage When _bt_getroot creates a new root, there is no old root, but the same record type is used in _bt_newroot, which uses block ID 1 to refer to the old root page. (Thanks for the comments, again! I'll post a new version soon) - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format and API changes (9.5)
On Fri, Nov 14, 2014 at 5:31 PM, Michael Paquier wrote: > 3) pg_xlogdump does not seem to work: > $ pg_xlogdump 0001000D > pg_xlogdump: FATAL: could not find a valid record after 0/D00 This one is a bad manipulation from my side. Please forget this comment. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format and API changes (9.5)
On Thu, Nov 13, 2014 at 10:33 PM, Heikki Linnakangas wrote: > In quick testing, this new WAL format is somewhat more compact than the 9.4 > format. That also seems to have more than bought back the performance > regression I saw earlier. Here are results from my laptop, using the > wal-update-testsuite.sh script: > master: > [results] > (27 rows) > wal-format-and-api-changes-9.patch: > [results] > (27 rows) So based on your series of tests, you are saving 6% up to 10%. That's... Really cool! > Aside from the WAL record format changes, this patch adds the "decoded WAL > record" infrastructure that we talked about with Andres. XLogReader now has > a new function, DecodeXLogRecord, which parses the block headers etc. from > the WAL record, and copies the data chunks to aligned buffers. The redo > routines are passed a pointer to the XLogReaderState, instead of the plain > XLogRecord, and the redo routines can use macros and functions defined > xlogreader.h to access the already-decoded WAL record. The new WAL record > format is difficult to parse in a piece-meal fashion, so it really needs > this separate decoding pass to be efficient. > > Thoughts on this new WAL record format? I've attached the xlogrecord.h file > here separately for easy reading, if you want to take a quick look at just > that without applying the whole patch. The new format is neat, and that's a lot of code.. Grouping together the blocks of data is a good thing for the FPW compression patch as well. Note that this patch conflicts with the recent commits 81c4508 and 34402ae. installcheck-world is passing, and standbys are able to replay records, at least without crashes. Here are some more comments: 1) with assertions enabled this does not compile because of a small typo in xlogreader.h here: +#define XLogRecHasAnyBlockRefs(decoder) ((decoder)->max_block id >= 0) 2) In xlogreader.c, XLogRecGetBlockData should return char * but a boolean is returned here: +XLogRecGetBlockData(XLogReaderState *record, uint8 block_id, Size *len) +{ + DecodedBkpBlock *bkpb; + + if (!record->blocks[block_id].in_use) + return false; As no blocks are in use in this case this should be NULL. 3) pg_xlogdump does not seem to work: $ pg_xlogdump 0001000D pg_xlogdump: FATAL: could not find a valid record after 0/D00 4) A couple of NOT_USED blocks could be removed, no? +#ifdef NOT_USED BlockNumber leftChildBlkno = InvalidBlockNumber; +#endif 5) Here why not using the 2nd block instead of the 3rd (@_bt_getroot)? + XLogBeginInsert(); + XLogRegisterBuffer(0, rootbuf, REGBUF_WILL_INIT); + XLogRegisterBuffer(2, metabuf, REGBUF_WILL_INIT); 6) There are FIXME blocks: + // FIXME + //if ((bkpb->fork_flags & BKPBLOCK_WILL_INIT) != 0 && mode != RBM_ZERO) + // elog(PANIC, "block with WILL_INIT flag in WAL record must be zeroed by redo routine");] And that: + /* FIXME: translation? Although this shouldn't happen.. */ + ereport(ERROR, + (errmsg("error decoding WAL record"), +errdetail("%s", errormsg))); Regards, -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format and API changes (9.5)
On Thu, Nov 13, 2014 at 7:03 PM, Heikki Linnakangas wrote: > > On 11/11/2014 04:42 PM, Amit Kapila wrote: >> >> I have done some performance testing of this patch using attached >> script and data is as below: >> >> ... >> >> It seems to me that there is a regression of (4 ~ 8%) for small records, >> refer two short fields tests. > > > Thanks for the testing! > > > Thoughts on this new WAL record format? I've attached the xlogrecord.h file here separately for easy reading, if you want to take a quick look at just that without applying the whole patch. > Apart from changes in XLogRecord to remove xl_len and padding, the new format for block headers seems to be quite succinct and then by making data length as variable for actual record the overall WAL record size becomes smaller. However the increase in unaligned size (as mentined by you up-thread as 2 bytes) seems to be still remain same. Is the difference in size is due to additional block reference id this patch requires or something else? In new record format the Record header always start at aligned boundary, however the actual record data doesn't necessarily be at aligned boundary, can this make difference as currently we copy both of these separately? Overall, I think this format is a net improvement. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] WAL format and API changes (9.5)
On 2014-11-13 15:33:44 +0200, Heikki Linnakangas wrote: > Here's a new version, with big changes again to the record format. Have a > look at xlogrecord.h for the details, but in a nutshell: > > 1. The overall format is now: XLogRecord, per-block headers, header for main > data portion, per-block data, main data. > > 2. I removed xl_len field from XLogRecord and rearranged the fields, to > shrink the XLogRecord struct from 32 to 24 bytes. (instead, there's a new 2- > or 5-byte header for the "main data", after the block headers). > > 3. No alignment padding. (the data chunks are copied to aligned buffers at > replay, so redo functions can still assume aligned access) Whoa. That's a large amount of changes... Not bad at all. > In quick testing, this new WAL format is somewhat more compact than the 9.4 > format. Well, that's mixing different kinds of changes together to some degree. Most of this could have been done independently as well. I'm not generally objecting to doing that, but the space/performance comparison isn't entirely fair. One could (I'm not!) argue that we should just do a refactoring like you suggest above, without the additional block information. > That also seems to have more than bought back the performance > regression I saw earlier. I think we really need to do the performance test with a different CRC implementation. So far all the tests in this thread seem to be correlating pretty linearly to the size of the record. > Here are results from my laptop, using the wal-update-testsuite.sh script: Would be nice if that'd also compute the differences in wal_generated and duration between two runs... :) > Aside from the WAL record format changes, this patch adds the "decoded WAL > record" infrastructure that we talked about with Andres. Ah, cool. > XLogReader now has > a new function, DecodeXLogRecord, which parses the block headers etc. from > the WAL record, and copies the data chunks to aligned buffers. The redo > routines are passed a pointer to the XLogReaderState, instead of the plain > XLogRecord, and the redo routines can use macros and functions defined > xlogreader.h to access the already-decoded WAL record. The new WAL record > format is difficult to parse in a piece-meal fashion, so it really needs > this separate decoding pass to be efficient. Hm. I'm not sure if there's really a point in exposing DecodeXLogRecord() outside of xlogreader.c. In which case would a xlogreader user not want to decode the record? I.e. couldn't we just always do that and not bother exposing the function? Sure, you could skip decoding if the record is of a "uninteresting" rmgr, but since all validation but CRC now happens only during decoding I'm not sure that's a good idea. You e.g. have a block + if (!DecodeXLogRecord(xlogreader_state, &errormsg)) + { + fprintf(stderr, "error in WAL record at %X/%X: %s\n", + (uint32) (xlogreader_state->ReadRecPtr >> 32), + (uint32) xlogreader_state->ReadRecPtr, + errormsg); + /* +* Parsing the record failed, but it passed CRC check, so in +* theory we can continue reading. +*/ + continue; + } + I think that's quite the bad idea. The CRC is only a part of the error detection, not its entirety. > Thoughts on this new WAL record format? I've attached the xlogrecord.h file > here separately for easy reading, if you want to take a quick look at just > that without applying the whole patch. After a quick skim I like it in general. There's some details I'd rather change, but I think it's quite the step forward. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format and API changes (9.5)
On Tue, Nov 11, 2014 at 2:15 PM, Heikki Linnakangas wrote:z > > On 11/11/2014 09:39 AM, Michael Paquier wrote: >> >> On Tue, Nov 11, 2014 at 4:29 AM, Heikki Linnakangas < hlinnakan...@vmware.com >>> >>> wrote: >> >> >>> Attached is a new version. It's rebased on current git master, including >>> BRIN. I've also fixed the laundry list of small things you reported, as >>> well as a bunch of bugs I uncovered during my own testing. >> >> >> This patch needs a small rebase, it has been broken by a590f266 that fixed >> WAL replay for brin indexes: >> patching file src/backend/access/brin/brin_xlog.c >> Hunk #2 FAILED at 42. >> Hunk #3 FAILED at 91. >> This will facilitate testing as well. > > > Here's a rebased patch. No other changes. > I have done some performance testing of this patch using attached script and data is as below: Performance Data IBM POWER-8 24 cores, 192 hardware threads RAM = 492GB checkpoint_segments=300 checkpoint_timeout=30min full_page_writes = off HEAD (commit - ae667f7) - testname | wal_generated | duration -+---+-- two short fields, no change | 398510104 | 11.993057012558 two short fields, no change | 396984168 | 10.3508098125458 two short fields, no change | 396983624 | 10.4196150302887 two short fields, one changed | 437106016 | 10.650365114212 two short fields, one changed | 437102456 | 10.756795167923 two short fields, one changed | 437103000 | 10.6824090480804 two short fields, both changed | 438131520 | 11.1027519702911 two short fields, both changed | 437112248 | 11.0681071281433 two short fields, both changed | 437107168 | 11.0370399951935 one short and one long field, no change | 76044176 | 2.90529608726501 one short and one long field, no change | 76047432 | 2.86844515800476 one short and one long field, no change | 76042664 | 2.84641098976135 ten tiny fields, all changed| 478210904 | 13.1221458911896 ten tiny fields, all changed| 47728 | 12.8017349243164 ten tiny fields, all changed| 477217832 | 12.9907081127167 hundred tiny fields, all changed| 180353400 | 6.03354096412659 hundred tiny fields, all changed| 180349536 | 5.99893379211426 hundred tiny fields, all changed| 180350064 | 6.00634717941284 hundred tiny fields, half changed | 180348480 | 6.03162407875061 hundred tiny fields, half changed | 180348440 | 6.08004903793335 hundred tiny fields, half changed | 180349056 | 6.03126788139343 hundred tiny fields, half nulled| 101369936 | 5.43424606323242 hundred tiny fields, half nulled| 101660160 | 5.06207084655762 hundred tiny fields, half nulled| 100119184 | 5.51814889907837 9 short and 1 long, short changed | 108142168 | 3.26260495185852 9 short and 1 long, short changed | 108138144 | 3.20250797271729 9 short and 1 long, short changed | 109761240 | 3.21728992462158 (27 rows) Patch- testname | wal_generated | duration -+---+-- two short fields, no change | 398692848 | 11.2538840770721 two short fields, no change | 396988648 | 11.4972231388092 two short fields, no change | 396988416 | 11.0026741027832 two short fields, one changed | 437104840 | 11.1911199092865 two short fields, one changed | 437103832 | 11.1138079166412 two short fields, one changed | 437101872 | 11.3141660690308 two short fields, both changed | 437133648 | 11.5836050510406 two short fields, both changed | 437106640 | 11.5675358772278 two short fields, both changed | 437104944 | 11.6147150993347 one short and one long field, no change | 77226464 | 3.01720094680786 one short and one long field, no change | 77645248 | 2.9660210609436 one short and one long field, no change | 76045256 | 3.03326010704041 ten tiny fields, all changed| 477396616 | 13.3963651657104 ten tiny fields, all changed| 477219840 | 13.3838939666748 ten tiny fields, all changed| 477223960 | 13.5736708641052 hundred tiny fields, all changed| 180742136 | 5.87386584281921 hundred tiny fields, all changed| 180398320 | 6.12072706222534 hundred tiny fields, all changed| 180354080 | 6.16246104240417 hundred tiny fields, half changed | 180351456 | 6.17317008972168 hundred tiny fields, half changed | 180348096 | 6.13790798187256 hundred tiny fields, half changed | 183369688 | 6.19
Re: [HACKERS] WAL format and API changes (9.5)
On 11/11/2014 09:39 AM, Michael Paquier wrote: On Tue, Nov 11, 2014 at 4:29 AM, Heikki Linnakangas wrote: Attached is a new version. It's rebased on current git master, including BRIN. I've also fixed the laundry list of small things you reported, as well as a bunch of bugs I uncovered during my own testing. This patch needs a small rebase, it has been broken by a590f266 that fixed WAL replay for brin indexes: patching file src/backend/access/brin/brin_xlog.c Hunk #2 FAILED at 42. Hunk #3 FAILED at 91. This will facilitate testing as well. Here's a rebased patch. No other changes. - Heikki wal-format-and-api-changes-8.patch.gz Description: application/gzip -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format and API changes (9.5)
On Tue, Nov 11, 2014 at 4:29 AM, Heikki Linnakangas wrote: > Attached is a new version. It's rebased on current git master, including > BRIN. I've also fixed the laundry list of small things you reported, as > well as a bunch of bugs I uncovered during my own testing. This patch needs a small rebase, it has been broken by a590f266 that fixed WAL replay for brin indexes: patching file src/backend/access/brin/brin_xlog.c Hunk #2 FAILED at 42. Hunk #3 FAILED at 91. This will facilitate testing as well. Regards -- Michael
Re: [HACKERS] WAL format and API changes (9.5)
Attached is a new version. It's rebased on current git master, including BRIN. I've also fixed the laundry list of small things you reported, as well as a bunch of bugs I uncovered during my own testing. Alvaro: you still have the BRIN WAL-logging code fresh in your memory, so could you take a look at that part of this patch to check that I didn't break it? And also, how do you like the new way of writing that, over what you had to in HEAD ? More below. On 11/09/2014 11:47 PM, Andres Freund wrote: On 2014-11-06 17:32:33 +0200, Heikki Linnakangas wrote: Replying to some of your comments below. The rest were trivial issues that I'll just fix. On 10/30/2014 09:19 PM, Andres Freund wrote: * Is it really a good idea to separate XLogRegisterBufData() from XLogRegisterBuffer() the way you've done it ? If we ever actually get a record with a large numbers of blocks touched this issentially is O(touched_buffers*data_entries). Are you worried that the linear search in XLogRegisterBufData(), to find the right registered_buffer struct, might be expensive? If that ever becomes a problem, a simple fix would be to start the linear search from the end, and make sure that when you touch a large number of blocks, you do all the XLogRegisterBufData() calls right after the corresponding XLogRegisterBuffer() call. Yes, that was what I was (mildly) worried about. Since you specified a high limit of buffers I'm sure someone will come up with a use case for it ;) I've also though about having XLogRegisterBuffer() return the pointer to the struct, and passing it as argument to XLogRegisterBufData. So the pattern in WAL generating code would be like: registered_buffer *buf0; buf0 = XLogRegisterBuffer(0, REGBUF_STANDARD); XLogRegisterBufData(buf0, data, length); registered_buffer would be opaque to the callers. That would have potential to turn XLogRegisterBufData into a macro or inline function, to eliminate the function call overhead. I played with that a little bit, but the difference in performance was so small that it didn't seem worth it. But passing the registered_buffer pointer, like above, might not be a bad thing anyway. Yes, that was roughly what I was thinking of as well. It's not all that pretty, but it generally does seem like a good idea to me anyay. I ended up doing something different: The block_id is now used as the index into the registered_buffers array. So looking up the right struct is now just ®istered_buffers[block_id]. That obviously gets rid of the linear search. It doesn't seem to make any difference in the quick performance testing I've been doing. * There's lots of functions like XLogRecHasBlockRef() that dig through the wal record. A common pattern is something like: if (XLogRecHasBlockRef(record, 1)) XLogRecGetBlockTag(record, 1, NULL, NULL, &oldblk); else oldblk = newblk; I think doing that repeatedly is quite a bad idea. We should parse the record once and then use it in a sensible format. Not do it in pieces, over and over again. It's not like we ignore backup blocks - so doing this lazily doesn't make sense to me. Especially as ValidXLogRecord() *already* has parsed the whole damn thing once. Hmm. Adding some kind of a parsed XLogRecord representation would need a fair amount of new infrastructure. True. Vast majority of WAL records contain one, maybe two, block references, so it's not that expensive to find the right one, even if you do it several times. I'm not convinced. It's not an infrequent thing these days to hear people being bottlenecked by replay. And grovelling repeatedly through larger records isn't *that* cheap. I haven't done anything about this. We might want to do that, but I'd like to get this committed now, with all the changes to the AM's, and then spend some time trying to further optimize the WAL format. I might add the "deparsed WAL record" infrastructure as part of that optimization work. I ran a quick performance test on WAL replay performance yesterday. I ran pgbench for 100 transactions with WAL archiving enabled, and measured the time it took to replay the generated WAL. I did that with and without the patch, and I didn't see any big difference in replay times. I also ran "perf" on the startup process, and the profiles looked identical. I'll do more comprehensive testing later, with different index types, but I'm convinced that there is no performance issue in replay that we'd need to worry about. Interesting. What checkpoint_segments/timeout and what scale did you use? Since that heavily influences the average size of the record that's quite relevant... Scale factor 5, checkpoint_segments=10, checkpoint_timeout=30s. pg_xlogdump --stats says: Type N (%) Record size (%) FPI size (%)Combined size (%) - --- ---
Re: [HACKERS] WAL format and API changes (9.5)
On 2014-11-06 17:32:33 +0200, Heikki Linnakangas wrote: > Replying to some of your comments below. The rest were trivial issues that > I'll just fix. > > On 10/30/2014 09:19 PM, Andres Freund wrote: > >* Is it really a good idea to separate XLogRegisterBufData() from > > XLogRegisterBuffer() the way you've done it ? If we ever actually get > > a record with a large numbers of blocks touched this issentially is > > O(touched_buffers*data_entries). > > Are you worried that the linear search in XLogRegisterBufData(), to find the > right registered_buffer struct, might be expensive? If that ever becomes a > problem, a simple fix would be to start the linear search from the end, and > make sure that when you touch a large number of blocks, you do all the > XLogRegisterBufData() calls right after the corresponding > XLogRegisterBuffer() call. Yes, that was what I was (mildly) worried about. Since you specified a high limit of buffers I'm sure someone will come up with a use case for it ;) > I've also though about having XLogRegisterBuffer() return the pointer to the > struct, and passing it as argument to XLogRegisterBufData. So the pattern in > WAL generating code would be like: > > registered_buffer *buf0; > > buf0 = XLogRegisterBuffer(0, REGBUF_STANDARD); > XLogRegisterBufData(buf0, data, length); > > registered_buffer would be opaque to the callers. That would have potential > to turn XLogRegisterBufData into a macro or inline function, to eliminate > the function call overhead. I played with that a little bit, but the > difference in performance was so small that it didn't seem worth it. But > passing the registered_buffer pointer, like above, might not be a bad thing > anyway. Yes, that was roughly what I was thinking of as well. It's not all that pretty, but it generally does seem like a good idea to me anyay. > >* There's lots of functions like XLogRecHasBlockRef() that dig through > > the wal record. A common pattern is something like: > >if (XLogRecHasBlockRef(record, 1)) > > XLogRecGetBlockTag(record, 1, NULL, NULL, &oldblk); > >else > > oldblk = newblk; > > > > I think doing that repeatedly is quite a bad idea. We should parse the > > record once and then use it in a sensible format. Not do it in pieces, > > over and over again. It's not like we ignore backup blocks - so doing > > this lazily doesn't make sense to me. > > Especially as ValidXLogRecord() *already* has parsed the whole damn > > thing once. > > Hmm. Adding some kind of a parsed XLogRecord representation would need a > fair amount of new infrastructure. True. > Vast majority of WAL records contain one, > maybe two, block references, so it's not that expensive to find the right > one, even if you do it several times. I'm not convinced. It's not an infrequent thing these days to hear people being bottlenecked by replay. And grovelling repeatedly through larger records isn't *that* cheap. > I ran a quick performance test on WAL replay performance yesterday. I ran > pgbench for 100 transactions with WAL archiving enabled, and measured > the time it took to replay the generated WAL. I did that with and without > the patch, and I didn't see any big difference in replay times. I also ran > "perf" on the startup process, and the profiles looked identical. I'll do > more comprehensive testing later, with different index types, but I'm > convinced that there is no performance issue in replay that we'd need to > worry about. Interesting. What checkpoint_segments/timeout and what scale did you use? Since that heavily influences the average size of the record that's quite relevant... > If it mattered, a simple optimization to the above pattern would be to have > XLogRecGetBlockTag return true/false, indicating if the block reference > existed at all. So you'd do: > > if (!XLogRecGetBlockTag(record, 1, NULL, NULL, &oldblk)) > oldblk != newblk; That sounds like a good idea anyway? > On the other hand, decomposing the WAL record into parts, and passing the > decomposed representation to the redo routines would allow us to pack the > WAL record format more tightly, as accessing the different parts of the > on-disk format wouldn't then need to be particularly fast. For example, I've > been thinking that it would be nice to get rid of the alignment padding in > XLogRecord, and between the per-buffer data portions. We could copy the data > to aligned addresses as part of the decomposition or parsing of the WAL > record, so that the redo routines could still assume aligned access. Right. I think it'd generally give us a bit more flexibility. > >* I wonder if it wouldn't be worthwile, for the benefit of the FPI > > compression patch, to keep the bkp block data after *all* the > > "headers". That'd make it easier to just compress the data. > > Maybe. If we do that, I'd also be inclined to move all the bkp block headers > to the beginning of the WAL record, just after the XLogInsert struct. > Someho
Re: [HACKERS] WAL format and API changes (9.5)
Replying to some of your comments below. The rest were trivial issues that I'll just fix. On 10/30/2014 09:19 PM, Andres Freund wrote: * Is it really a good idea to separate XLogRegisterBufData() from XLogRegisterBuffer() the way you've done it ? If we ever actually get a record with a large numbers of blocks touched this issentially is O(touched_buffers*data_entries). Are you worried that the linear search in XLogRegisterBufData(), to find the right registered_buffer struct, might be expensive? If that ever becomes a problem, a simple fix would be to start the linear search from the end, and make sure that when you touch a large number of blocks, you do all the XLogRegisterBufData() calls right after the corresponding XLogRegisterBuffer() call. I've also though about having XLogRegisterBuffer() return the pointer to the struct, and passing it as argument to XLogRegisterBufData. So the pattern in WAL generating code would be like: registered_buffer *buf0; buf0 = XLogRegisterBuffer(0, REGBUF_STANDARD); XLogRegisterBufData(buf0, data, length); registered_buffer would be opaque to the callers. That would have potential to turn XLogRegisterBufData into a macro or inline function, to eliminate the function call overhead. I played with that a little bit, but the difference in performance was so small that it didn't seem worth it. But passing the registered_buffer pointer, like above, might not be a bad thing anyway. * There's lots of functions like XLogRecHasBlockRef() that dig through the wal record. A common pattern is something like: if (XLogRecHasBlockRef(record, 1)) XLogRecGetBlockTag(record, 1, NULL, NULL, &oldblk); else oldblk = newblk; I think doing that repeatedly is quite a bad idea. We should parse the record once and then use it in a sensible format. Not do it in pieces, over and over again. It's not like we ignore backup blocks - so doing this lazily doesn't make sense to me. Especially as ValidXLogRecord() *already* has parsed the whole damn thing once. Hmm. Adding some kind of a parsed XLogRecord representation would need a fair amount of new infrastructure. Vast majority of WAL records contain one, maybe two, block references, so it's not that expensive to find the right one, even if you do it several times. I ran a quick performance test on WAL replay performance yesterday. I ran pgbench for 100 transactions with WAL archiving enabled, and measured the time it took to replay the generated WAL. I did that with and without the patch, and I didn't see any big difference in replay times. I also ran "perf" on the startup process, and the profiles looked identical. I'll do more comprehensive testing later, with different index types, but I'm convinced that there is no performance issue in replay that we'd need to worry about. If it mattered, a simple optimization to the above pattern would be to have XLogRecGetBlockTag return true/false, indicating if the block reference existed at all. So you'd do: if (!XLogRecGetBlockTag(record, 1, NULL, NULL, &oldblk)) oldblk != newblk; On the other hand, decomposing the WAL record into parts, and passing the decomposed representation to the redo routines would allow us to pack the WAL record format more tightly, as accessing the different parts of the on-disk format wouldn't then need to be particularly fast. For example, I've been thinking that it would be nice to get rid of the alignment padding in XLogRecord, and between the per-buffer data portions. We could copy the data to aligned addresses as part of the decomposition or parsing of the WAL record, so that the redo routines could still assume aligned access. * I wonder if it wouldn't be worthwile, for the benefit of the FPI compression patch, to keep the bkp block data after *all* the "headers". That'd make it easier to just compress the data. Maybe. If we do that, I'd also be inclined to move all the bkp block headers to the beginning of the WAL record, just after the XLogInsert struct. Somehow it feels weird to have a bunch of header structs sandwiched between the rmgr-data and per-buffer data. Also, 4-byte alignment is enough for the XLogRecordBlockData struct, so that would be an easy way to make use of the space currently wasted for alignment padding in XLogRecord. * I think heap_xlog_update is buggy for wal_level=logical because it computes the length of the tuple using tuplen = recdataend - recdata; But the old primary key/old tuple value might be stored there as well. Afaics that code has to continue to use xl_heap_header_len. No, the old primary key or tuple is stored in the main data portion. That tuplen computation is done on backup block 0's data. * It looks to me like insert wal logging could just use REGBUF_KEEP_DATA to get rid of: + /* +* The new tuple is normally stored as buffer 0's data. But if +* XLOG_HEAP_CONTAINS_NEW_TUPL
Re: [HACKERS] WAL format and API changes (9.5)
On 11/05/2014 11:30 AM, Andres Freund wrote: On 2014-11-04 18:33:34 +0200, Heikki Linnakangas wrote: On 10/30/2014 06:02 PM, Andres Freund wrote: On 2014-10-29 10:24:20 +0200, Heikki Linnakangas wrote: On 10/06/2014 02:29 PM, Andres Freund wrote: I've not yet really looked, but on a quick readthrough XLogInsertRecData() staying in xlog.c doesn't make me happy... Ok.. Can you elaborate? To me the split between xloginsert.c doing some of the record assembly, and xlog.c doing the lower level part of the assembly is just wrong. I moved the checks for bootstrap mode and xl_len == 0, from the current XLogInsert to the new XLogInsert() function. I don't imagine that to be enough address your feelings about the split between xlog.c and xloginsert.c, but makes it a tiny bit clearer IMHO. I don't know what else to do about it, as it feels very sensible to me as it is now. So unless you object more loudly, I'm going to just go ahead with this and commit, probably tomorrow. I'm still not particularly happy with the split. But I think this patch is enough of a win anyhow. So go ahead. Committed. Now, to the main patch... - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format and API changes (9.5)
On 11/06/2014 07:57 AM, Amit Kapila wrote: On Wed, Nov 5, 2014 at 2:56 PM, Heikki Linnakangas wrote: On 11/05/2014 09:06 AM, Amit Kapila wrote: 2. XLogRecPtr XLogInsertRecord(XLogRecData *rdata, XLogRecPtr fpw_lsn) So the scenario is that: * XLogRecordAssemble decides that a page doesn't need to be backed up * both RedoRecPtr and doPageWrites change while building the record. doPageWrites goes from true to false. Without the patch, we would retry, because first check RedoRecPtr has changed. With the patch, we notice that even though RedoRecPtr has changed, doPageWrites is now off, so no FPWs are required regardless of RedoRecPtr, and not retry. Yeah, you're right, the behavior changes in that case. However, the new behavior is correct; the retry is unnecessary in that scenario. How does it interact with backup, basically in stop backup we first change forcePageWrite to false and then get the stop wal location by inserting XLOG_BACKUP_END, so it seems to me that it is quite possible that the record which backend is inserting using XLogInsert() will be considered in backup. Now shouldn't this record contain FPW if forcePageWrite was true when XLogInsert() started to avoid any torn page taken in backup? It doesn't matter what doPageWrites and RedoRecPtr were when XLogInsert started. It's the state when the insertion actually happens that matters. It's a bit weird that pg_stop_backup() first turns off forcePagesWrites, and only then writes the WAL record. But it's OK because when pg_stop_backup() is called, the backup process has already finished copying all the blocks. Any pages that are written to disk between turning forcePageWrites off and writing the WAL record cannot be torn. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format and API changes (9.5)
On Wed, Nov 5, 2014 at 2:56 PM, Heikki Linnakangas wrote: > On 11/05/2014 09:06 AM, Amit Kapila wrote: >> >> 2. >> XLogRecPtr >> XLogInsertRecord(XLogRecData *rdata, XLogRecPtr fpw_lsn) > > So the scenario is that: > > * XLogRecordAssemble decides that a page doesn't need to be backed up > * both RedoRecPtr and doPageWrites change while building the record. doPageWrites goes from true to false. > > Without the patch, we would retry, because first check RedoRecPtr has changed. With the patch, we notice that even though RedoRecPtr has changed, doPageWrites is now off, so no FPWs are required regardless of RedoRecPtr, and not retry. Yeah, you're right, the behavior changes in that case. However, the new behavior is correct; the retry is unnecessary in that scenario. > How does it interact with backup, basically in stop backup we first change forcePageWrite to false and then get the stop wal location by inserting XLOG_BACKUP_END, so it seems to me that it is quite possible that the record which backend is inserting using XLogInsert() will be considered in backup. Now shouldn't this record contain FPW if forcePageWrite was true when XLogInsert() started to avoid any torn page taken in backup? >> 3. There are couple of places where *XLogInsert* is used in wal.sgml >> and it seems to me some of those needs change w.r.t this patch. > > > Hmm. It's a bit strange to mention XLogInsert and XLogFlush functions by name in that text. It's otherwise admin-level stuff, on how to set WAL-related settings. Up to 9.2 that manual page actually called them "LogInsert" and "LogFlush". But I guess you're right; if we are to mention the function by name, it should say XLogInsertRecord. > Agreed. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] WAL format and API changes (9.5)
On 2014-11-05 23:08:31 +0200, Heikki Linnakangas wrote: > On 10/30/2014 09:19 PM, Andres Freund wrote: > >Some things I noticed while reading the patch: > > A lot of good comments, but let me pick up just two that are related: > > >* There's a couple record types (e.g. XLOG_SMGR_TRUNCATE) that only > > refer to the relation, but not to the block number. These still log > > their rnode manually. Shouldn't we somehow deal with those in a > > similar way explicit block references are dealt with? > > > >* Hm. At least WriteMZeroPageXlogRec (and probably the same for all the > > other slru stuff) doesn't include a reference to the page. Isn't that > > bad? Shouldn't we make XLogRegisterBlock() usable for that case? > > Otherwise I fail to see how pg_rewind like tools can sanely deal with > > this? > > Yeah, there are still operations that modify relation pages, but don't store > the information about the modified pages in the standard format. That > includes XLOG_SMGR_TRUNCATE that you spotted, and XLOG_SMGR_CREATE, and also > XLOG_DBASE_CREATE/DROP. And then there are updates to non-relation files, > like all the slru stuff, relcache init files, etc. And updates to the FSM > and VM bypass the full-page write mechanism too. That's a awful number of special cases. I see little reason not to invent something that can reference at least the relation in those cases. Then we can at least only resync the relations if there were changes. E.g. for vm's that not necessarily all that likely in an insert mostly case. I'm not that worried about things like create/drop database, but fsm, vm, and the various slru's are really somewhat essential. > To play it safe, pg_rewind copies all non-relation files as is. That > includes all SLRUs, FSM and VM files, and everything else whose filename > doesn't match the (main fork of) a relation file. Of course, that's a fair > amount of copying to do, so we might want to optimize that in the future, > but I want to nail the relation files first. They are usually an order of > magnitude larger than the other files, after all. That's fair enough. > Unfortunately pg_rewind still needs to recognize and parse the special WAL > records like XLOG_SMGR_CREATE/TRUNCATE, that modify relation files outside > the normal block registration system. I've been thinking that we should add > another flag to the WAL record format to mark such records. So everytime pg_rewind comes across a record with that flag set which it doesn't have special case code it'd balk? Sounds sensible. Personally I think we should at least have a generic format to refer to entire relations without a specific block number. And one to refer to SLRUs. I don't think we necessarily need to implement them now, but we should make sure there's bit space left to denote them. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format and API changes (9.5)
On 10/30/2014 09:19 PM, Andres Freund wrote: Some things I noticed while reading the patch: A lot of good comments, but let me pick up just two that are related: * There's a couple record types (e.g. XLOG_SMGR_TRUNCATE) that only refer to the relation, but not to the block number. These still log their rnode manually. Shouldn't we somehow deal with those in a similar way explicit block references are dealt with? * Hm. At least WriteMZeroPageXlogRec (and probably the same for all the other slru stuff) doesn't include a reference to the page. Isn't that bad? Shouldn't we make XLogRegisterBlock() usable for that case? Otherwise I fail to see how pg_rewind like tools can sanely deal with this? Yeah, there are still operations that modify relation pages, but don't store the information about the modified pages in the standard format. That includes XLOG_SMGR_TRUNCATE that you spotted, and XLOG_SMGR_CREATE, and also XLOG_DBASE_CREATE/DROP. And then there are updates to non-relation files, like all the slru stuff, relcache init files, etc. And updates to the FSM and VM bypass the full-page write mechanism too. To play it safe, pg_rewind copies all non-relation files as is. That includes all SLRUs, FSM and VM files, and everything else whose filename doesn't match the (main fork of) a relation file. Of course, that's a fair amount of copying to do, so we might want to optimize that in the future, but I want to nail the relation files first. They are usually an order of magnitude larger than the other files, after all. Unfortunately pg_rewind still needs to recognize and parse the special WAL records like XLOG_SMGR_CREATE/TRUNCATE, that modify relation files outside the normal block registration system. I've been thinking that we should add another flag to the WAL record format to mark such records. pg_rewind will still need to understand the record format of such records, but the flag will help to catch bugs of omission. If pg_rewind or another such tool sees a record that's flagged as "special", but doesn't recognize the record type, it can throw an error. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format and API changes (9.5)
On 2014-11-04 18:33:34 +0200, Heikki Linnakangas wrote: > On 10/30/2014 06:02 PM, Andres Freund wrote: > >On 2014-10-29 10:24:20 +0200, Heikki Linnakangas wrote: > >>On 10/06/2014 02:29 PM, Andres Freund wrote: > >>>I've not yet really looked, > >>>but on a quick readthrough XLogInsertRecData() staying in xlog.c doesn't > >>>make me happy... > >> > >>Ok.. Can you elaborate? > > > >To me the split between xloginsert.c doing some of the record assembly, > >and xlog.c doing the lower level part of the assembly is just wrong. > > I moved the checks for bootstrap mode and xl_len == 0, from the current > XLogInsert to the new XLogInsert() function. I don't imagine that to be > enough address your feelings about the split between xlog.c and > xloginsert.c, but makes it a tiny bit clearer IMHO. I don't know what else > to do about it, as it feels very sensible to me as it is now. So unless you > object more loudly, I'm going to just go ahead with this and commit, > probably tomorrow. I'm still not particularly happy with the split. But I think this patch is enough of a win anyhow. So go ahead. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format and API changes (9.5)
On 11/05/2014 09:06 AM, Amit Kapila wrote: 1. +XLogRecPtr +XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata) { .. + /* info's high bits are reserved for use by me */ + if (info & XLR_INFO_MASK) + elog(PANIC, "invalid xlog info mask %02X", info); .. } Earlier before this check, we use to check XLogInsertAllowed() which is moved to XLogInsertRecord(), isn't it better to keep the check in beginning of the function XLogInsert()? Doesn't matter much, but I think it's cleanest the way I had it in the latest patch. XLogInsert() is responsible for building a valid WAL record to insert, and XLogInsertRecord() handles all the locking, buffer management etc. to actually insert it. The high-bit check quoted above is more related to building a valid record, so it belongs in XLogInsert(), while the XLogInsertAllowed() check is more related to the actual insertion of the record, so it belongs in XLogInsertRecord(). 2. XLogRecPtr XLogInsertRecord(XLogRecData *rdata, XLogRecPtr fpw_lsn) { .. if (fpw_lsn != InvalidXLogRecPtr && fpw_lsn <= RedoRecPtr && doPageWrites) { /* * Oops, some buffer now needs to be backed up that the caller * didn't back up. Start over. */ WALInsertLockRelease(); END_CRIT_SECTION(); return InvalidXLogRecPtr; } .. } IIUC, there can be 4 states for doPageWrites w.r.t when record is getting assembled (XLogRecordAssemble) and just before actual insert ( XLogInsertRecord) I think the behaviour for the state when doPageWrites is true during XLogRecordAssemble and false during XLogInsertRecord (just before actual insert) is different as compare to HEAD. In the patch, it will not retry if doPageWrites is false when we are about to insert even though fpw_lsn <= RedoRecPtr whereas in HEAD it will detect this during assembly of record and retry, isn't this a problem? So the scenario is that: * XLogRecordAssemble decides that a page doesn't need to be backed up * both RedoRecPtr and doPageWrites change while building the record. doPageWrites goes from true to false. Without the patch, we would retry, because first check RedoRecPtr has changed. With the patch, we notice that even though RedoRecPtr has changed, doPageWrites is now off, so no FPWs are required regardless of RedoRecPtr, and not retry. Yeah, you're right, the behavior changes in that case. However, the new behavior is correct; the retry is unnecessary in that scenario. 3. There are couple of places where *XLogInsert* is used in wal.sgml and it seems to me some of those needs change w.r.t this patch. Hmm. It's a bit strange to mention XLogInsert and XLogFlush functions by name in that text. It's otherwise admin-level stuff, on how to set WAL-related settings. Up to 9.2 that manual page actually called them "LogInsert" and "LogFlush". But I guess you're right; if we are to mention the function by name, it should say XLogInsertRecord. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format and API changes (9.5)
On Tue, Nov 4, 2014 at 10:03 PM, Heikki Linnakangas wrote: > > On 10/30/2014 06:02 PM, Andres Freund wrote: >> >> On 2014-10-29 10:24:20 +0200, Heikki Linnakangas wrote: >>> >>> On 10/06/2014 02:29 PM, Andres Freund wrote: On 2014-10-06 14:19:39 +0300, Heikki Linnakangas wrote: > > Barring objections, I'll commit this, and then continue benchmarking the > second patch with the WAL format and API changes. I'd like to have a look at it beforehand. >>> >>> >>> Ping? Here's an rebased patch. I'd like to proceed with this. >> >> >> Doing so. > > > Here's a new version of this refactoring patch. It fixes all the concrete issues you pointed out. > I've not yet really looked, but on a quick readthrough XLogInsertRecData() staying in xlog.c doesn't make me happy... >>> >>> >>> Ok.. Can you elaborate? >> >> >> To me the split between xloginsert.c doing some of the record assembly, >> and xlog.c doing the lower level part of the assembly is just wrong. > > > I moved the checks for bootstrap mode and xl_len == 0, from the current XLogInsert to the new XLogInsert() function. I don't imagine that to be enough address your feelings about the split between xlog.c and xloginsert.c, but makes it a tiny bit clearer IMHO. I don't know what else to do about it, as it feels very sensible to me as it is now. So unless you object more loudly, I'm going to just go ahead with this and commit, probably tomorrow. > Few observations while reading the latest patch: 1. +XLogRecPtr +XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata) { .. + /* info's high bits are reserved for use by me */ + if (info & XLR_INFO_MASK) + elog(PANIC, "invalid xlog info mask %02X", info); .. } Earlier before this check, we use to check XLogInsertAllowed() which is moved to XLogInsertRecord(), isn't it better to keep the check in beginning of the function XLogInsert()? 2. XLogRecPtr XLogInsertRecord(XLogRecData *rdata, XLogRecPtr fpw_lsn) { .. if (fpw_lsn != InvalidXLogRecPtr && fpw_lsn <= RedoRecPtr && doPageWrites) { /* * Oops, some buffer now needs to be backed up that the caller * didn't back up. Start over. */ WALInsertLockRelease(); END_CRIT_SECTION(); return InvalidXLogRecPtr; } .. } IIUC, there can be 4 states for doPageWrites w.r.t when record is getting assembled (XLogRecordAssemble) and just before actual insert ( XLogInsertRecord) I think the behaviour for the state when doPageWrites is true during XLogRecordAssemble and false during XLogInsertRecord (just before actual insert) is different as compare to HEAD. In the patch, it will not retry if doPageWrites is false when we are about to insert even though fpw_lsn <= RedoRecPtr whereas in HEAD it will detect this during assembly of record and retry, isn't this a problem? 3. There are couple of places where *XLogInsert* is used in wal.sgml and it seems to me some of those needs change w.r.t this patch. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] WAL format and API changes (9.5)
On Fri, Oct 31, 2014 at 2:52 AM, Heikki Linnakangas wrote: > On 10/30/2014 06:02 PM, Andres Freund wrote: > >> On 2014-10-29 10:24:20 +0200, Heikki Linnakangas wrote: >> >>> On 10/06/2014 02:29 PM, Andres Freund wrote: >>> I've not yet really looked, but on a quick readthrough XLogInsertRecData() staying in xlog.c doesn't make me happy... >>> >>> Ok.. Can you elaborate? >>> >> >> To me the split between xloginsert.c doing some of the record assembly, >> and xlog.c doing the lower level part of the assembly is just wrong. >> > > Really? To me, that feels like a natural split. xloginsert.c is > responsible for constructing the final WAL record, with the backup blocks > and all. It doesn't access any shared memory (related to WAL; it does look > at Buffers, to determine what to back up). The function in xlog.c inserts > the assembled record to the WAL, as is. It handles the locking and WAL > buffer management related to that. > FWIW, I tend to the same opinion here. What would you suggest? I don't think putting everything in one XLogInsert > function, like we have today, is better. Note that the second patch makes > xloginsert.c a lot more complicated. > I recall some time ago seeing complaints that xlog.c is too complicated and should be refactored. Any effort in this area is a good thing IMO, and this split made sense when I went through the code. > I'm not a big fan of the naming for the new split. We have >> * XLogInsert() which is the external interface >> * XLogRecordAssemble() which builds the rdata chain that will be >>inserted >> * XLogInsertRecData() which actually inserts the data into the xl buffers. >> >> To me that's pretty inconsistent. >> > > Got a better suggestion? I wanted to keep the name XLogInsert() unchanged, > to avoid code churn, and because it's still a good name for that. > XLogRecordAssemble is pretty descriptive for what it does, although > "XLogRecord" implies that it construct an XLogRecord struct. It does fill > in that too, but the main purpose is to build the XLogRecData chain. > Perhaps XLogAssembleRecord() would be better. > > I'm not very happy with XLogInsertRecData myself. XLogInsertRecord? > +1 for XLogInsertRecord. -- Michael
Re: [HACKERS] WAL format and API changes (9.5)
Hi, On 2014-09-15 15:41:22 +0300, Heikki Linnakangas wrote: > The second patch contains the interesting changes. Heikki's pushed the newest version of this to the git tree. Some things I noticed while reading the patch: * potential mismerge: +++ b/src/bin/pg_basebackup/pg_receivexlog.c @@ -408,7 +408,7 @@ main(int argc, char **argv) } } - while ((c = getopt_long(argc, argv, "D:d:h:p:U:s:S:nF:wWv", + while ((c = getopt_long(argc, argv, "D:d:h:p:U:s:nF:wWv", long_options, &option_index)) != -1) { switch (c) * Can't you just have REGBUF_WILL_INIT include REGBUF_NO_IMAGE? Then you don't need to test both. * Is it really a good idea to separate XLogRegisterBufData() from XLogRegisterBuffer() the way you've done it ? If we ever actually get a record with a large numbers of blocks touched this issentially is O(touched_buffers*data_entries). * At least in Assert mode XLogRegisterBuffer() should ensure we're not registering the same buffer twice. It's a pretty easy to make mistake to do it twice for some kind of actions (think UPDATE), and the consequences will be far from obvious afaics. * There's lots of functions like XLogRecHasBlockRef() that dig through the wal record. A common pattern is something like: if (XLogRecHasBlockRef(record, 1)) XLogRecGetBlockTag(record, 1, NULL, NULL, &oldblk); else oldblk = newblk; I think doing that repeatedly is quite a bad idea. We should parse the record once and then use it in a sensible format. Not do it in pieces, over and over again. It's not like we ignore backup blocks - so doing this lazily doesn't make sense to me. Especially as ValidXLogRecord() *already* has parsed the whole damn thing once. * Maybe XLogRegisterData() and XLogRegisterBufData() should accept void* instead of char*? Right now there's lots of pointless casts in callers needed that don't add any safety. The one additional line that's then needed in these functions seems well worth it. * There's a couple record types (e.g. XLOG_SMGR_TRUNCATE) that only refer to the relation, but not to the block number. These still log their rnode manually. Shouldn't we somehow deal with those in a similar way explicit block references are dealt with? * You've added an assert in RestoreBackupBlockContents() that ensures the page isn't new. That wasn't there before, instead there was a special case for it. I can't immediately think how that previously could happen, but I do wonder why it was there. * The following comment in +RestoreBackupBlock probably is two lines to early + /* Found it, apply the update */ + if (!(bkpb->fork_flags & BKPBLOCK_HAS_IMAGE)) + return InvalidBuffer; This new InvalidBuffer case probably could use a comment in either XLogReadBufferForRedoExtended or here. * Most of the commentary above RestoreBackupBlock() probably doesn't belong there anymore. * Maybe XLogRecordBlockData.fork_flags should be renamed? Maybe just into flags_fork? Right now it makes at least me think that it's fork specific flags, which really isn't the actual meaning. * I wonder if it wouldn't be worthwile, for the benefit of the FPI compression patch, to keep the bkp block data after *all* the "headers". That'd make it easier to just compress the data. * +InitXLogRecordConstruct() seems like a remainder of the xlogconstruct.c naming. I'd just go for InitXLogInsert() * Hm. At least WriteMZeroPageXlogRec (and probably the same for all the other slru stuff) doesn't include a reference to the page. Isn't that bad? Shouldn't we make XLogRegisterBlock() usable for that case? Otherwise I fail to see how pg_rewind like tools can sanely deal with this? * Why did you duplicate the identical cases in btree_desc()? I guess due to rebasing ontop the xlogdump stats series? * I haven't actually looked at the xlogdump output so I might be completely wrong, but won't the output of e.g. updates be harder to read now because it's not clear which of the references old/new buffers are? Not that it matters that much. * s/recdataend/recdata_end/? The former is somewhat hard to parse for me. * I think heap_xlog_update is buggy for wal_level=logical because it computes the length of the tuple using tuplen = recdataend - recdata; But the old primary key/old tuple value might be stored there as well. Afaics that code has to continue to use xl_heap_header_len. * It looks to me like insert wal logging could just use REGBUF_KEEP_DATA to get rid of: + /* +* The new tuple is normally stored as buffer 0's data. But if +* XLOG_HEAP_CONTAINS_NEW_TUPLE flag is set, it's part of the main +* data, after the xl_heap_insert struct. +*/ + if (xlrec->flags & XLOG_HEAP_CONTAINS_NEW_TUPLE) + { + data = XLogRecGetData(record) + SizeOfHeapInsert; + datalen = record->xl_len - SizeOfHeapInsert; + } + else +
Re: [HACKERS] WAL format and API changes (9.5)
On 10/30/2014 06:02 PM, Andres Freund wrote: On 2014-10-29 10:24:20 +0200, Heikki Linnakangas wrote: On 10/06/2014 02:29 PM, Andres Freund wrote: I've not yet really looked, but on a quick readthrough XLogInsertRecData() staying in xlog.c doesn't make me happy... Ok.. Can you elaborate? To me the split between xloginsert.c doing some of the record assembly, and xlog.c doing the lower level part of the assembly is just wrong. Really? To me, that feels like a natural split. xloginsert.c is responsible for constructing the final WAL record, with the backup blocks and all. It doesn't access any shared memory (related to WAL; it does look at Buffers, to determine what to back up). The function in xlog.c inserts the assembled record to the WAL, as is. It handles the locking and WAL buffer management related to that. What would you suggest? I don't think putting everything in one XLogInsert function, like we have today, is better. Note that the second patch makes xloginsert.c a lot more complicated. And it leads to things like the already complicated logic to retry after detecting missing fpws is now split across two files seems to confirm that. What happens right now is that XLogInsert() (with some helper routines) assembles the record. Then hands that off to XLogInsertRecData(). Which sometimes returns InvalidXLogRecPtr, and returns back to XLogInsert(), which reverses *some* of its work and then retries. Brr. Modifying the chain that was already constructed is not pretty, but it's not this patch's fault. I don't think splitting the functions makes that any worse. (BTW, the follow-up patch to change the WAL format fixes that; the per-buffer data is kept separately from the main data chain). Similary the 'page_writes_omitted' logic doesn't make me particularly happy. Previously we retried when there actually was a page affected by the different RedoRecPtr. Now we do it as soon as our copy of RedoRecPtr is out of date? Won't that quite often spuriously trigger retries? Am I missing something? Arguably this doesn't happen often enough to matter, but it's still something that we should explicitly discuss. The situation where it leads to a spurious retry is when the constructed WAL record skipped the FPW of a buffer, and RedoRecPtr was updated, but the buffer stilll doesn't need to be FPW according to the updated RedoRecPtr. That only happens if the same buffer was updated (by another backend) after the new checkpoint began. I believe that's extremely rare. We could make that more fine-grained, though. Instead of passing a boolean 'fpw_sensitive' flag to XLogInsertRecData, pass the lowest LSN among the pages whose FPW was skipped. If that's less than the new RedoRecPtr, then retry is needed. That would eliminate the spurious retries. The implementation of the split seems to change the meaning of TRACE_POSTGRESQL_XLOG_INSERT - it's now counting retries. I don't particularly care about those tracepoints, but I see no simplification due to changing its meaning. I wasn't sure what to do about that. I don't use tracepoints, and I don't know what others use them for. I'm not a big fan of the naming for the new split. We have * XLogInsert() which is the external interface * XLogRecordAssemble() which builds the rdata chain that will be inserted * XLogInsertRecData() which actually inserts the data into the xl buffers. To me that's pretty inconsistent. Got a better suggestion? I wanted to keep the name XLogInsert() unchanged, to avoid code churn, and because it's still a good name for that. XLogRecordAssemble is pretty descriptive for what it does, although "XLogRecord" implies that it construct an XLogRecord struct. It does fill in that too, but the main purpose is to build the XLogRecData chain. Perhaps XLogAssembleRecord() would be better. I'm not very happy with XLogInsertRecData myself. XLogInsertRecord? I'm also somewhat confused about the new split of the headers. I'm not adverse to splitting them, but it's getting a bit hard to understand: * xlog.h - generic mishmash * xlogdefs.h - Three typedefs and some #defines * xlog_fn.h - SQL functions * xlog_internal.h - Pretty random collection of stuff * xlogutils.h - "Utilities for replaying WAL records" * xlogreader.h - "generic XLog reading facility" * xloginsert.h - "Functions for generating WAL records" * xlogrecord.h - "Definitions for the WAL record format." Isn't that a bit too much? Pretty much the only files that have a recognizable separation to me are xlog_fn.h and xlogreader.h. This is a matter of taste, I guess. I find xlog.h, xlogdefs.h and xlog_internal.h fairly random, while the rest are have a clear function. xlogutils.h is needed by redo functions, and xloginsert.h is needed for generating WAL. Note that the second patch adds more stuff to xloginsert.h and xlogrecord.h. You've removed the - * - * NB: this routine feels free to scribble on the XLogRecData structs, - * though not on
Re: [HACKERS] WAL format and API changes (9.5)
On 2014-10-29 10:24:20 +0200, Heikki Linnakangas wrote: > On 10/06/2014 02:29 PM, Andres Freund wrote: > >On 2014-10-06 14:19:39 +0300, Heikki Linnakangas wrote: > >>Barring objections, I'll commit this, and then continue benchmarking the > >>second patch with the WAL format and API changes. > > > >I'd like to have a look at it beforehand. > > Ping? Here's an rebased patch. I'd like to proceed with this. Doing so. > >I've not yet really looked, > >but on a quick readthrough XLogInsertRecData() staying in xlog.c doesn't > >make me happy... > > Ok.. Can you elaborate? To me the split between xloginsert.c doing some of the record assembly, and xlog.c doing the lower level part of the assembly is just wrong. And it leads to things like the already complicated logic to retry after detecting missing fpws is now split across two files seems to confirm that. What happens right now is that XLogInsert() (with some helper routines) assembles the record. Then hands that off to XLogInsertRecData(). Which sometimes returns InvalidXLogRecPtr, and returns back to XLogInsert(), which reverses *some* of its work and then retries. Brr. Similary the 'page_writes_omitted' logic doesn't make me particularly happy. Previously we retried when there actually was a page affected by the different RedoRecPtr. Now we do it as soon as our copy of RedoRecPtr is out of date? Won't that quite often spuriously trigger retries? Am I missing something? Arguably this doesn't happen often enough to matter, but it's still something that we should explicitly discuss. The implementation of the split seems to change the meaning of TRACE_POSTGRESQL_XLOG_INSERT - it's now counting retries. I don't particularly care about those tracepoints, but I see no simplification due to changing its meaning. Aside: In C11 function local statics become a bit more expensive - they essentially require atomics and/or spinlocks on the first few calls. Next thing: The patch doesn't actually compile. Misses an #include storage/proc.h for MyPgXact. I'm not a big fan of the naming for the new split. We have * XLogInsert() which is the external interface * XLogRecordAssemble() which builds the rdata chain that will be inserted * XLogInsertRecData() which actually inserts the data into the xl buffers. To me that's pretty inconsistent. I'm also somewhat confused about the new split of the headers. I'm not adverse to splitting them, but it's getting a bit hard to understand: * xlog.h - generic mishmash * xlogdefs.h - Three typedefs and some #defines * xlog_fn.h - SQL functions * xlog_internal.h - Pretty random collection of stuff * xlogutils.h - "Utilities for replaying WAL records" * xlogreader.h - "generic XLog reading facility" * xloginsert.h - "Functions for generating WAL records" * xlogrecord.h - "Definitions for the WAL record format." Isn't that a bit too much? Pretty much the only files that have a recognizable separation to me are xlog_fn.h and xlogreader.h. You've removed the - * - * NB: this routine feels free to scribble on the XLogRecData structs, - * though not on the data they reference. This is OK since the XLogRecData - * structs are always just temporaries in the calling code. comment, but we still do, no? While not this patches blame, I see currently we finish the critical section in XLogInsert/XLogInsertRecData pretty early: END_CRIT_SECTION(); /* * Update shared LogwrtRqst.Write, if we crossed page boundary. */ if (StartPos / XLOG_BLCKSZ != EndPos / XLOG_BLCKSZ) { SpinLockAcquire(&XLogCtl->info_lck); /* advance global request to include new block(s) */ if (XLogCtl->LogwrtRqst.Write < EndPos) XLogCtl->LogwrtRqst.Write = EndPos; /* update local result copy while I have the chance */ LogwrtResult = XLogCtl->LogwrtResult; SpinLockRelease(&XLogCtl->info_lck); } I don't think this is particularly critical, but it still looks wrong to me. Hm. I think that's what I see for now. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format and API changes (9.5)
On Mon, Oct 6, 2014 at 8:19 PM, Heikki Linnakangas wrote: > On 10/06/2014 04:42 AM, Michael Paquier wrote: > >> On Fri, Oct 3, 2014 at 9:51 PM, Heikki Linnakangas < >> hlinnakan...@vmware.com> >> wrote: >> >>> So I now have a refactoring patch ready that I'd like to commit (the >>> >> attached two patches together), but to be honest, I have no idea why the >> second patch is so essential to performance. >> Thanks. I did some more tests with master, master+patch1, >> master+patch1+CRC >> refactoring, but I am not able to see any performance difference with >> pgbench (--no-vacuum, -t) and the test suite you provided, just some noise >> that barely changed performance. >> > > Thanks for the confirmation. I'm really going crazy with benchmarking > this. Sometimes I see a big difference, the next day it's gone. > The benchmark paradigms. > * Fixed XLogSaveBufferForHint. It didn't initialize BkpBlock struct, > rendering it completely broken. > Note for other reviewers: that's represented by this addition in XLogSaveBufferForHint: + /* Make a BkpBlock struct representing the buffer */ + XLogFillBkpBlock(buffer, buffer_std, &bkpb) Regards, -- Michael
Re: [HACKERS] WAL format and API changes (9.5)
On 2014-10-06 14:19:39 +0300, Heikki Linnakangas wrote: > On 10/06/2014 04:42 AM, Michael Paquier wrote: > >On Fri, Oct 3, 2014 at 9:51 PM, Heikki Linnakangas > >wrote: > >>So I now have a refactoring patch ready that I'd like to commit (the > >attached two patches together), but to be honest, I have no idea why the > >second patch is so essential to performance. > >Thanks. I did some more tests with master, master+patch1, master+patch1+CRC > >refactoring, but I am not able to see any performance difference with > >pgbench (--no-vacuum, -t) and the test suite you provided, just some noise > >that barely changed performance. > > Thanks for the confirmation. I'm really going crazy with benchmarking this. > Sometimes I see a big difference, the next day it's gone. A usual suspect for this is turbo mode and power control. It often already helps to disable the latter to get much more reproducible benchmark results. > Barring objections, I'll commit this, and then continue benchmarking the > second patch with the WAL format and API changes. I'd like to have a look at it beforehand. I've not yet really looked, but on a quick readthrough XLogInsertRecData() staying in xlog.c doesn't make me happy... Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format and API changes (9.5)
On Fri, Oct 3, 2014 at 9:51 PM, Heikki Linnakangas wrote: > So I now have a refactoring patch ready that I'd like to commit (the attached two patches together), but to be honest, I have no idea why the second patch is so essential to performance. Thanks. I did some more tests with master, master+patch1, master+patch1+CRC refactoring, but I am not able to see any performance difference with pgbench (--no-vacuum, -t) and the test suite you provided, just some noise that barely changed performance. Note that fd.c uses SYNC_METHOD_FSYNC_WRITETHROUGH, so it is necessary to include xlog.h in it. -- Michael
Re: [HACKERS] WAL format and API changes (9.5)
On 10/03/2014 04:10 PM, Andres Freund wrote: On 2014-10-03 15:51:37 +0300, Heikki Linnakangas wrote: After a lot of experimentation, I figured out that the slowdown is already apparent with the *first* patch, the one that just refactors existing XLogInsert, without any changes to the WAL format or to the callers. I fiddled with that for a long time, trying to tease apart the change that makes the difference, and was finally able to narrow it down. Attached are two patches. These are both just refactoring, with no changes to the WAL format or APIs. The first is the same as the refactoring patch I posted earlier, with only minor changes to #includes, comments and such (per Alvaro's and Michael's suggestions - thanks!). With the first patch, the test case I've been using to performance test this becomes somewhere between 5% - 10% slower. Applying the second patch on top of that restores the performance back to what you get without these patches. Strange. The second patch moves the CRC calculation from a separate loop over the XLogRecDatas to the earlier loop, that also iterates through the XLogRecDatas. The strange thing about this is that when I tried to make the same change to current git master, without applying the first patch, it didn't make any difference. The CRC calculation used to integrated to the earlier loop in 9.1 and before, but in 9.2 it was moved to a separate loop for simplicity, because it didn't make any difference to performance. So I now have a refactoring patch ready that I'd like to commit (the attached two patches together), but to be honest, I have no idea why the second patch is so essential to performance. Maybe a stupid question, but did you verify it's not caused by splitting xloginsert over two translation units and/or not inlining the separate function? Hmph, now that I try this again, I don't see any difference with or without the "second" patch - they both perform just as well as unpatched master. So I'm totally baffled again, but I guess the patch is good to go. I need a more stable test environment.. Yeah, I did try merging xlog.c and xloginsert.c into the same .c file at some point, and marking the XLogInsertRecData function as static, allowing the compiler to inline it, but I didn't see any difference. But you have to take that with a grain of salt - I'm apparently not capable of constructing a properly repeatable test case for this. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format and API changes (9.5)
On 2014-10-03 15:51:37 +0300, Heikki Linnakangas wrote: > After a lot of experimentation, I figured out that the slowdown is already > apparent with the *first* patch, the one that just refactors existing > XLogInsert, without any changes to the WAL format or to the callers. I > fiddled with that for a long time, trying to tease apart the change that > makes the difference, and was finally able to narrow it down. > > Attached are two patches. These are both just refactoring, with no changes > to the WAL format or APIs. The first is the same as the refactoring patch I > posted earlier, with only minor changes to #includes, comments and such (per > Alvaro's and Michael's suggestions - thanks!). With the first patch, the > test case I've been using to performance test this becomes somewhere between > 5% - 10% slower. Applying the second patch on top of that restores the > performance back to what you get without these patches. > > Strange. The second patch moves the CRC calculation from a separate loop > over the XLogRecDatas to the earlier loop, that also iterates through the > XLogRecDatas. The strange thing about this is that when I tried to make the > same change to current git master, without applying the first patch, it > didn't make any difference. The CRC calculation used to integrated to the > earlier loop in 9.1 and before, but in 9.2 it was moved to a separate loop > for simplicity, because it didn't make any difference to performance. > > So I now have a refactoring patch ready that I'd like to commit (the > attached two patches together), but to be honest, I have no idea why the > second patch is so essential to performance. Maybe a stupid question, but did you verify it's not caused by splitting xloginsert over two translation units and/or not inlining the separate function? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format and API changes (9.5)
On Mon, Sep 15, 2014 at 9:16 PM, Michael Paquier wrote: > 2) XLogCheckBufferNeedsBackup is not used. It can be removed. Please ignore this comment, XLogCheckBufferNeedsBackup is used in heapam.c. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format and API changes (9.5)
On 2014-09-15 15:41:22 +0300, Heikki Linnakangas wrote: > Here we go. I've split this again into two patches. The first patch is just > refactoring the current code. It moves XLogInsert into a new file, > xloginsert.c, and the definition of XLogRecord to new xlogrecord.h header > file. As a result, there is a a lot of churn in the #includes in C files > that generate WAL records, or contain redo routines. The number of files > that pull in xlog.h - directly or indirectly through other headers - is > greatly reduced. > > The second patch contains the interesting changes. > > I wrote a little benchmark kit to performance test this. I'm trying to find > out two things: > > 1) How much CPU overhead do the new XLogBeginInsert and XLogRegister* > functions add, compared to the current approach with XLogRecDatas. > > 2) How much extra WAL is generated with the patch. This affects the CPU time > spent in the tests, but it's also interesting to measure directly, because > WAL size affects many things like WAL archiving, streaming replication etc. > > Attached is the test kit I'm using. To run the battery of tests, use "psql > -f run.sql". To answer the question of WAL volume, it runs a bunch of tests > that exercise heap insert, update and delete, as well as b-tree and GIN > insertions. To answer the second test, it runs a heap insertion test, with a > tiny record size that's chosen so that it generates exactly the same amount > of WAL after alignment with and without the patch. The test is repeated many > times, and the median of runtimes is printed out. > > Here are the results, comparing unpatched and patched versions. First, the > WAL sizes: > > A heap insertion records are 2 bytes larger with the patch. Due to > alignment, that makes for a 0 or 8 byte difference in the record sizes. > Other WAL records have a similar store; a few extra bytes but no big > regressions. There are a few outliers above where it appears that the > patched version takes less space. Not sure why that would be, probably just > a glitch in the test, autovacuum kicked in or something. I've to admit, that's already not a painless amount of overhead. > Now, for the CPU overhead: > > description | dur_us (orig) | dur_us (patched) | % > +---+--+ > heap insert 30 | 0.7752835 | 0.831883 | 107.30 > (1 row) > > So, the patched version runs 7.3 % slower. That's disappointing :-(. > > This are the result I got on my laptop today. Previously, the typical result > I've gotten has been about 5%, so that's a bit high. Nevertheless, even a 5% > slowdown is probably not acceptable. Yes, I definitely think it's not. > While I've trying to nail down where that difference comes from, I've seen a > lot of strange phenomenon. At one point, the patched version was 10% slower, > but I was able to bring the difference down to 5% if I added a certain > function in xloginsert.c, but never called it. It was very repeatable at the > time, I tried adding and removing it many times and always got the same > result, but I don't see it with the current HEAD and patch version anymore. > So I think 5% is pretty close to the margin of error that arises from > different compiler optimizations, data/instruction cache effects etc. > > Looking at the 'perf' profile, The new function calls only amount to about > 2% of overhead, so I'm not sure where the slowdown is coming from. Here are > explanations I've considered, but I haven't been able to prove any of them: I'd suggest doing: a) perf stat -vvv of both workloads. That will often tell you stuff already b) Look at other events. Particularly stalled-cycles-frontend, stalled-cycles-backend, cache-misses Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format and API changes (9.5)
On Mon, Sep 15, 2014 at 8:00 PM, Michael Paquier wrote: > Alvaro got faster than me... I was just looking at the first patch and > +1 on those comments. It is worth noting that the first patch, as it > does only a large refactoring, does not impact performance or size of > WAL records. Some comments after a second and closer read of the first patch: 1) Wouldn't it be better to call GetFullPageWriteInfo directly in XLogRecordAssemble, removing RedoRecPtr and doPageWrites from its arguments? 2) XLogCheckBufferNeedsBackup is not used. It can be removed. 3) If I am following correctly, there are two code paths where XLogInsertRecData can return InvalidXLogRecPtr, and then there is this process in XLogInsert + { + /* +* Undo the changes we made to the rdata chain. +* +* XXX: This doesn't undo *all* the changes; the XLogRecData +* entries for buffers that we had already decided to back up have +* had their data-pointers cleared. That's OK, as long as we +* decide to back them up on the next iteration as well. Hence, +* don't allow "doPageWrites" value to go from true to false after +* we've modified the rdata chain. +*/ + boolnewDoPageWrites; + + GetFullPageWriteInfo(&RedoRecPtr, &newDoPageWrites); + if (!doPageWrites && newDoPageWrites) + doPageWrites = true; + rdt_lastnormal->next = NULL; + } Wouldn't it be better to keep that in XLogInsertRecData? This way GetFullPageWriteInfo could be removed (I actually see little point in keeping it as part of this patch, but does this change make its sense in patch 2?). 4) This patch is in DOS encoding (?!) That's all I have for now... Regards, -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format and API changes (9.5)
On Mon, Sep 15, 2014 at 7:03 PM, Alvaro Herrera wrote: > Heikki Linnakangas wrote: > >> Here we go. I've split this again into two patches. The first patch >> is just refactoring the current code. It moves XLogInsert into a new >> file, xloginsert.c, and the definition of XLogRecord to new >> xlogrecord.h header file. As a result, there is a a lot of churn in >> the #includes in C files that generate WAL records, or contain redo >> routines. The number of files that pull in xlog.h - directly or >> indirectly through other headers - is greatly reduced. > > I think you should push the first patch for now and continue to > investigate the issues in the second patch. Some minor suggestions I > have for the first patch are that since xlog.h is no longer used by > other headers (but only by .c files), you should just #include > xloginsert.h instead of doing the forward declaration of struct > XLogRecData; and in xlog_internal.h, instead of > > - * XLogRecord is defined in xlog.h, but we avoid #including that to keep > + * XLogRecord is defined in xlogrecord.h, but we avoid #including that to > keep > > I would just suggest to #include xlogrecord.h, since it should just > work to include that file from frontend programs. It didn't work for > xlog.h because that one #includes fmgr.h IIRC and that one causes Datum > to appear, which makes compilation blow up. Shouldn't be the case here. Alvaro got faster than me... I was just looking at the first patch and +1 on those comments. It is worth noting that the first patch, as it does only a large refactoring, does not impact performance or size of WAL records. Btw, a declaration of access/xlog.h in fd.c is forgotten. Also, if somebody is interested in running the test suite, there are comments on how to do it at the top of compare.sql. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format and API changes (9.5)
Heikki Linnakangas wrote: > Here we go. I've split this again into two patches. The first patch > is just refactoring the current code. It moves XLogInsert into a new > file, xloginsert.c, and the definition of XLogRecord to new > xlogrecord.h header file. As a result, there is a a lot of churn in > the #includes in C files that generate WAL records, or contain redo > routines. The number of files that pull in xlog.h - directly or > indirectly through other headers - is greatly reduced. I think you should push the first patch for now and continue to investigate the issues in the second patch. Some minor suggestions I have for the first patch are that since xlog.h is no longer used by other headers (but only by .c files), you should just #include xloginsert.h instead of doing the forward declaration of struct XLogRecData; and in xlog_internal.h, instead of - * XLogRecord is defined in xlog.h, but we avoid #including that to keep + * XLogRecord is defined in xlogrecord.h, but we avoid #including that to keep I would just suggest to #include xlogrecord.h, since it should just work to include that file from frontend programs. It didn't work for xlog.h because that one #includes fmgr.h IIRC and that one causes Datum to appear, which makes compilation blow up. Shouldn't be the case here. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format and API changes (9.5)
On Tue, Sep 2, 2014 at 9:23 PM, Heikki Linnakangas wrote: > I committed the redo-routine refactoring patch. I kept the XLog prefix in > the XLogReadBufferForRedo name; it's redundant, but all the other similar > functions in xlogutils.c use the XLog prefix so it would seem inconsistent > to not have it here. Thanks! Even that will be helpful for a potential patch doing consistency comparisons of FPW with current pages having WAL of a record applied. > I'll post a new version of the main patch shortly... Looking forward to seeing it. Regards, -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format and API changes (9.5)
On Tue, Sep 2, 2014 at 9:23 PM, Heikki Linnakangas wrote: > I committed the redo-routine refactoring patch. I kept the XLog prefix in > the XLogReadBufferForRedo name; it's redundant, but all the other similar > functions in xlogutils.c use the XLog prefix so it would seem inconsistent > to not have it here. Thanks! Even that will be helpful for a potential patch doing consistency comparisons of FPW with current pages having WAL of a record applied. > I'll post a new version of the main patch shortly... Looking forward to seeing it. Regards, -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format and API changes (9.5)
On 08/19/2014 05:38 PM, Andres Freund wrote: On 2014-08-19 10:33:29 -0400, Alvaro Herrera wrote: Heikki Linnakangas wrote: Barring objections or better ideas, I'm leaning towards XLogReadBufferForRedo. WFM for me too. Although we could imo strip the 'XLog' in the beginning if we want to make it shorter. The ForRedo is saying that pretty much. I committed the redo-routine refactoring patch. I kept the XLog prefix in the XLogReadBufferForRedo name; it's redundant, but all the other similar functions in xlogutils.c use the XLog prefix so it would seem inconsistent to not have it here. I'll post a new version of the main patch shortly... - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format and API changes (9.5)
On 2014-08-19 10:33:29 -0400, Alvaro Herrera wrote: > Heikki Linnakangas wrote: > > > Barring objections or better ideas, I'm leaning towards > > XLogReadBufferForRedo. > > WFM for me too. Although we could imo strip the 'XLog' in the beginning if we want to make it shorter. The ForRedo is saying that pretty much. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format and API changes (9.5)
Heikki Linnakangas wrote: > Barring objections or better ideas, I'm leaning towards > XLogReadBufferForRedo. WFM -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format and API changes (9.5)
On 08/19/2014 11:07 AM, Michael Paquier wrote: Patch 1 looks good. The main difference with the v1 submitted a couple of days back is that the global variable approach is replaced with additional arguments for the LSN position and record pointer in XLogReplayBuffer. I have as well run a couple of tests with the page comparison tool, done some tests based on installcheck-world with a slave replaying WAL behind, and found no issues with it. Thanks! Perhaps we could consider pushing it to facilitate the next work? Even if the second patch is dropped it is still a win IMO to have backup block replay managed within a single function (being it named XLogReplayBuffer in latest patch), and having it return a status flag. Yeah, that was my plan. Regarding the name, the following have been suggested so far: XLogReplayBuffer XLogRestoreBuffer XLogRecoverBuffer XLogReadBufferForReplay ReadBufferForXLogReplay One more idea: XLogRedoBuffer (would be like three first options above, but would match that we call the functions that call this "redo" functions) I think XLogReadBufferForReplay is the most descriptive. Andres and Alvaro both suggested it - independently I believe - so that seems to come out naturally. But maybe make it XLogReadBufferForRedo, since we call the redo functions redo functions and not replay functions. Yet another option is to just call it XLogReadBuffer, and rename the existing XLogReadBuffer to something else. With the 2nd patch, there are only a few callers of XLogReadBuffer left. But is it too deceitful if XLogReadBuffer doesn't merely read the page, but also sometimes replaces it with a full-page image? Maybe it's OK.. Barring objections or better ideas, I'm leaning towards XLogReadBufferForRedo. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format and API changes (9.5)
On Mon, Aug 18, 2014 at 10:55 PM, Heikki Linnakangas < hlinnakan...@vmware.com> wrote: > All rightey. > > Here's the next version of this work. It now comes as two patches. The first > one refactors adds the XLogReplayBuffer() function and refactors all the > redo functions to use it. It doesn't change the WAL record format in any > way. The second patch applies over the first one, and changes the WAL > format, and all the WAL-creating functions to use the new API for > constructing WAL records. The second patch removes the relfilenode and block > number arguments from XLogReplayBuffer, because they're no longer needed > when that information is included in the record format. > > Todo: > > * Performance testing. Do the new WAL construction functions add overhead to > WAL insertion? > * Compare WAL record sizes before and after. I've tried to keep it as > compact as possible, but I know that some records have gotten bigger. Need > to do a more thorough analysis. > * Rename XLogReplayBuffer. I don't particularly like any of the suggestions > so far, XLogReplayBuffer included. Discussion continues.. > * Anything else? Patch 1 looks good. The main difference with the v1 submitted a couple of days back is that the global variable approach is replaced with additional arguments for the LSN position and record pointer in XLogReplayBuffer. I have as well run a couple of tests with the page comparison tool, done some tests based on installcheck-world with a slave replaying WAL behind, and found no issues with it. Perhaps we could consider pushing it to facilitate the next work? Even if the second patch is dropped it is still a win IMO to have backup block replay managed within a single function (being it named XLogReplayBuffer in latest patch), and having it return a status flag. Regarding patch 2: - The main differences with the latest version are the modifications for XLogReplayBuffer having new arguments (LSN position and record pointer). XLogRecGetBlockRefIds has been changed to return a palloc'd array of block IDs. xloginsert.h, containing all the functions for xlog record construction is introduced as well. - Tiny thing, be aware of tab padding. Here is heapam.c: page = BufferGetPage(buffer); PageSetAllVisible(page); MarkBufferDirty(buffer); - XLogRecGetBlockRefIds is not described in src/backend/access/transam/README. Btw, pg_xlogdump drops a core dump when using it: --Output: Assertion failed: (n == *num_refs), function XLogRecGetBlockRefIds, file xlogreader.c, line 1157. rmgr: Heaplen (rec/tot): 14/ 12912, tx: 3, lsn: 0/01000148, prev 0/01000120, Abort trap: 6 (core dumped) -- Backtrace: frame #4: 0x000103870363 pg_xlogdump`XLogRecGetBlockRefIds(record=0x7ff38a003200, num_refs=0x7fff5c394464) + 435 at xlogreader.c:1157 frame #5: 0x00010386d610 pg_xlogdump`XLogDumpDisplayRecord(config=0x7fff5c3945c8, ReadRecPtr=16777544, record=0x7ff38a003200) + 272 at pg_xlogdump.c:357 frame #6: 0x00010386cad8 pg_xlogdump`main(argc=2, argv=0x7fff5c394658) + 3160 at pg_xlogdump.c:749 In order to reproduce that, simply run regression tests, followed by pg_xlogdump on one of the WAL files generated. - This patch includes some diffs from pg_receivexlog.c taken from 52bffe3. - I have run installcheck-world and compared the size of the WAL generated on HEAD and the patch (any hints to improve such analysis are of course welcome) name | start |stop| diff +---++--- HEAD (8605bc7) | 0/16C6808 | 0/11A2C670 | 271998568 Patch 1+2 | 0/16D45D8 | 0/1267A4B0 | 284843736 (2 rows) So that's a diff of more or less 13MB for this test set. Looking forward for some performance numbers as well as more precise comparison of WAL record length. -- Michael
Re: [HACKERS] WAL format and API changes (9.5)
On 2014-08-14 12:41:35 +0300, Heikki Linnakangas wrote: > On 08/14/2014 10:29 AM, Michael Paquier wrote: > >On Thu, Aug 14, 2014 at 3:04 AM, Alvaro Herrera > > wrote: > >>Heikki Linnakangas wrote: > >>What's with XLogReplayLSN and XLogReplayRecord? IMO the xlog code has > >>more than enough global variables already, it'd be good to avoid two > >>more if possible. Is there no other good way to get this info down to > >>whatever it is that needs them? > >Yep, they do not look necessary. Looking at the patch, you could get > >rid of XLogReplayLSN and XLogReplayRecord by adding two extra argument > >to XLogReplayBuffer: one for the LSN of the current record, and a > >second for the record pointer. The code just saves those two variables > >in the redo loop of StartupXLOG to only reuse them in > >XLogReplayBufferExtended, and I saw no code paths in the redo routines > >where XLogReplayBuffer is called at places without the LSN position > >and the record pointer. > > > >However I think that Heikki introduced those two variables to make the > >move to the next patch easier. > > The next patch doesn't necessary require them either, you could always pass > the LSN and record as an argument. I wanted to avoid that, because every > redo function would just pass the current record being replayed, so it seems > nicer to pass that information "out-of-band". I guess if we do that, though, > we should remove those arguments from rm_redo interface altogether, and > always rely on the global variables to get the "current" record or its LSN. > I'm not wedded on this, I could be persuaded to go either way... I personally find the out of band transport really ugly. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format and API changes (9.5)
On 08/14/2014 11:22 AM, Michael Paquier wrote: 1) Why changing that from ERROR to PANIC? /* Caller specified a bogus block_index */ - elog(ERROR, "failed to restore block_index %d", block_index); + elog(PANIC, "failed to restore block_index %d", block_index); No reason, just a leftover from debugging. Please ignore. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format and API changes (9.5)
On 08/14/2014 10:29 AM, Michael Paquier wrote: On Thu, Aug 14, 2014 at 3:04 AM, Alvaro Herrera wrote: Heikki Linnakangas wrote: What's with XLogReplayLSN and XLogReplayRecord? IMO the xlog code has more than enough global variables already, it'd be good to avoid two more if possible. Is there no other good way to get this info down to whatever it is that needs them? Yep, they do not look necessary. Looking at the patch, you could get rid of XLogReplayLSN and XLogReplayRecord by adding two extra argument to XLogReplayBuffer: one for the LSN of the current record, and a second for the record pointer. The code just saves those two variables in the redo loop of StartupXLOG to only reuse them in XLogReplayBufferExtended, and I saw no code paths in the redo routines where XLogReplayBuffer is called at places without the LSN position and the record pointer. However I think that Heikki introduced those two variables to make the move to the next patch easier. The next patch doesn't necessary require them either, you could always pass the LSN and record as an argument. I wanted to avoid that, because every redo function would just pass the current record being replayed, so it seems nicer to pass that information "out-of-band". I guess if we do that, though, we should remove those arguments from rm_redo interface altogether, and always rely on the global variables to get the "current" record or its LSN. I'm not wedded on this, I could be persuaded to go either way... - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format and API changes (9.5)
On Thu, Aug 14, 2014 at 2:05 AM, Heikki Linnakangas wrote: > Here's a full version of this refactoring patch, all the rmgr's have now > been updated to use XLogReplayBuffer(). I think this is a worthwhile change > on its own, even if we drop the ball on the rest of the WAL format patch, > because it makes the redo-routines more readable. I propose to commit this > as soon as someone has reviewed it, and we agree on a for what's currently > called XLogReplayBuffer(). I have tested this with my page-image comparison > tool. I have as well passed this patch through the page comparison tool and found no problems, with both regression and isolation tests. I also had a look at the redo routines that are changed and actually found nothing. Now, what this patch does is not much complicated, it adds a couple of status flags returned by XLogReplayBuffer. Then, roughly, when BLK_NEEDS_REDO is returned the previous restore actions are done. This has the merit to put the LSN check on current page to determine if a page needs to be replayed inside XLogReplayBuffer. A couple of minor comments though: 1) Why changing that from ERROR to PANIC? /* Caller specified a bogus block_index */ - elog(ERROR, "failed to restore block_index %d", block_index); + elog(PANIC, "failed to restore block_index %d", block_index); 2) There are some whitespaces, like here: + { + destPage = NULL;/* don't do any page updates */ } Regards, -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format and API changes (9.5)
On Thu, Aug 14, 2014 at 3:04 AM, Alvaro Herrera wrote: > Heikki Linnakangas wrote: > What's with XLogReplayLSN and XLogReplayRecord? IMO the xlog code has > more than enough global variables already, it'd be good to avoid two > more if possible. Is there no other good way to get this info down to > whatever it is that needs them? Yep, they do not look necessary. Looking at the patch, you could get rid of XLogReplayLSN and XLogReplayRecord by adding two extra argument to XLogReplayBuffer: one for the LSN of the current record, and a second for the record pointer. The code just saves those two variables in the redo loop of StartupXLOG to only reuse them in XLogReplayBufferExtended, and I saw no code paths in the redo routines where XLogReplayBuffer is called at places without the LSN position and the record pointer. However I think that Heikki introduced those two variables to make the move to the next patch easier. Regards, -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format and API changes (9.5)
Heikki Linnakangas wrote: > Here's a full version of this refactoring patch, all the rmgr's have > now been updated to use XLogReplayBuffer(). I think this is a > worthwhile change on its own, even if we drop the ball on the rest > of the WAL format patch, because it makes the redo-routines more > readable. I propose to commit this as soon as someone has reviewed > it, and we agree on a for what's currently called > XLogReplayBuffer(). I have tested this with my page-image comparison > tool. What's with XLogReplayLSN and XLogReplayRecord? IMO the xlog code has more than enough global variables already, it'd be good to avoid two more if possible. Is there no other good way to get this info down to whatever it is that needs them? -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format and API changes (9.5)
Heikki Linnakangas wrote: > Here's a full version of this refactoring patch, all the rmgr's have > now been updated to use XLogReplayBuffer(). I think this is a > worthwhile change on its own, even if we drop the ball on the rest > of the WAL format patch, because it makes the redo-routines more > readable. I propose to commit this as soon as someone has reviewed > it, and we agree on a for what's currently called > XLogReplayBuffer(). I have tested this with my page-image comparison > tool. XLogReadBufferForReplay ? ReadBufferForXLogReplay ? -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format and API changes (9.5)
On 08/13/2014 04:15 PM, Heikki Linnakangas wrote: Hmm, thinking about this some more, there is one sensible way to split this patch: We can add the XLogReplayBuffer() function and rewrite all the redo routines to use it, without changing any WAL record formats or anything in the way the WAL records are constructed. In the patch, XLogReplayBuffer() takes one input arument, the block reference ID, and it fetches the RelFileNode and BlockNumber of the block based on that. Without the WAL format changes, the information isn't there in the record, but we can require the callers to pass the RelFileNode and BlockNumber. The final patch will remove those arguments from every caller, but that's a very mechanical change. As in the attached patch. I only modified the heapam redo routines to use the new XLogReplayBuffer() idiom; the idea is to do that for every redo routine. After applying such a patch, the main WAL format changing patch becomes much smaller, and makes it easier to see from the redo routines where significant changes to the WAL record formats have been made. This also allows us to split the bikeshedding; we can discuss the name of XLogReplayBuffer() first :-). Here's a full version of this refactoring patch, all the rmgr's have now been updated to use XLogReplayBuffer(). I think this is a worthwhile change on its own, even if we drop the ball on the rest of the WAL format patch, because it makes the redo-routines more readable. I propose to commit this as soon as someone has reviewed it, and we agree on a for what's currently called XLogReplayBuffer(). I have tested this with my page-image comparison tool. - Heikki commit 4c6c7571e91f34919aff2c2981fe474537dc557b Author: Heikki Linnakangas Date: Wed Aug 13 15:39:08 2014 +0300 Refactor redo routines to use XLogReplayBuffer diff --git a/src/backend/access/gin/ginxlog.c b/src/backend/access/gin/ginxlog.c index ffabf4e..acbb840 100644 --- a/src/backend/access/gin/ginxlog.c +++ b/src/backend/access/gin/ginxlog.c @@ -20,25 +20,23 @@ static MemoryContext opCtx; /* working memory for operations */ static void -ginRedoClearIncompleteSplit(XLogRecPtr lsn, RelFileNode node, BlockNumber blkno) +ginRedoClearIncompleteSplit(XLogRecPtr lsn, int block_index, + RelFileNode node, BlockNumber blkno) { Buffer buffer; Page page; - buffer = XLogReadBuffer(node, blkno, false); - if (!BufferIsValid(buffer)) - return; /* page was deleted, nothing to do */ - page = (Page) BufferGetPage(buffer); - - if (lsn > PageGetLSN(page)) + if (XLogReplayBuffer(block_index, node, blkno, &buffer) == BLK_NEEDS_REDO) { + page = (Page) BufferGetPage(buffer); + GinPageGetOpaque(page)->flags &= ~GIN_INCOMPLETE_SPLIT; PageSetLSN(page, lsn); MarkBufferDirty(buffer); } - - UnlockReleaseBuffer(buffer); + if (BufferIsValid(buffer)) + UnlockReleaseBuffer(buffer); } static void @@ -351,26 +349,14 @@ ginRedoInsert(XLogRecPtr lsn, XLogRecord *record) rightChildBlkno = BlockIdGetBlockNumber((BlockId) payload); payload += sizeof(BlockIdData); - if (record->xl_info & XLR_BKP_BLOCK(0)) - (void) RestoreBackupBlock(lsn, record, 0, false, false); - else - ginRedoClearIncompleteSplit(lsn, data->node, leftChildBlkno); + ginRedoClearIncompleteSplit(lsn, 0, data->node, leftChildBlkno); } - /* If we have a full-page image, restore it and we're done */ - if (record->xl_info & XLR_BKP_BLOCK(isLeaf ? 0 : 1)) + if (XLogReplayBuffer(isLeaf ? 0 : 1, + data->node, data->blkno, &buffer) == BLK_NEEDS_REDO) { - (void) RestoreBackupBlock(lsn, record, isLeaf ? 0 : 1, false, false); - return; - } - - buffer = XLogReadBuffer(data->node, data->blkno, false); - if (!BufferIsValid(buffer)) - return; /* page was deleted, nothing to do */ - page = (Page) BufferGetPage(buffer); + page = BufferGetPage(buffer); - if (lsn > PageGetLSN(page)) - { /* How to insert the payload is tree-type specific */ if (data->flags & GIN_INSERT_ISDATA) { @@ -386,8 +372,8 @@ ginRedoInsert(XLogRecPtr lsn, XLogRecord *record) PageSetLSN(page, lsn); MarkBufferDirty(buffer); } - - UnlockReleaseBuffer(buffer); + if (BufferIsValid(buffer)) + UnlockReleaseBuffer(buffer); } static void @@ -476,12 +462,7 @@ ginRedoSplit(XLogRecPtr lsn, XLogRecord *record) * split */ if (!isLeaf) - { - if (record->xl_info & XLR_BKP_BLOCK(0)) - (void) RestoreBackupBlock(lsn, record, 0, false, false); - else - ginRedoClearIncompleteSplit(lsn, data->node, data->leftChildBlkno); - } + ginRedoClearIncompleteSplit(lsn, 0, data->node, data->leftChildBlkno); flags = 0; if (isLeaf) @@ -607,29 +588,19 @@ ginRedoVacuumDataLeafPage(XLogRecPtr lsn, XLogRecord *record) Buffer buffer; Page page; - /* If we have a full-page image, restore it and we're done */ - if (record->xl_info & XLR_BKP_BLOCK(0)) + if (XLogReplayBuffer(0, xlrec->node, xlrec->blkno, &buffer) == BLK_NEEDS_REDO) { - (void) RestoreBackupBlock(lsn, record, 0, false, false); -
Re: [HACKERS] WAL format and API changes (9.5)
On 08/13/2014 02:07 PM, Andres Freund wrote: On 2014-08-13 02:36:59 +0300, Heikki Linnakangas wrote: On 08/13/2014 01:04 AM, Andres Freund wrote: * The patch mixes the API changes around WAL records with changes of how individual actions are logged. That makes it rather hard to review - and it's a 500kb patch already. I realize it's hard to avoid because the new API changes which information about blocks is available, but it does make it really hard to see whether the old/new code is doing something equivalent. It's hard to find more critical code than this :/ Yeah, I hear you. I considered doing this piecemeal, just adding the new functions first so that you could still use the old XLogRecData API, until all the functions have been converted. But in the end, I figured it's not worth it, as sooner or later we'd want to convert all the functions anyway. I think it might be worthwile anyway. I'd be very surprised if there aren't several significant bugs in the conversion. Your full page checking tool surely helps to reduce the number, but it's not foolproof. I can understand not wanting to do it though, it's a significant amount of work. Would you ask somebody else to do it in two steps? Hmm, thinking about this some more, there is one sensible way to split this patch: We can add the XLogReplayBuffer() function and rewrite all the redo routines to use it, without changing any WAL record formats or anything in the way the WAL records are constructed. In the patch, XLogReplayBuffer() takes one input arument, the block reference ID, and it fetches the RelFileNode and BlockNumber of the block based on that. Without the WAL format changes, the information isn't there in the record, but we can require the callers to pass the RelFileNode and BlockNumber. The final patch will remove those arguments from every caller, but that's a very mechanical change. As in the attached patch. I only modified the heapam redo routines to use the new XLogReplayBuffer() idiom; the idea is to do that for every redo routine. After applying such a patch, the main WAL format changing patch becomes much smaller, and makes it easier to see from the redo routines where significant changes to the WAL record formats have been made. This also allows us to split the bikeshedding; we can discuss the name of XLogReplayBuffer() first :-). - Heikki commit 1a770baa3a3f293e8c592f0419d279b7b8bf7b66 Author: Heikki Linnakangas Date: Wed Aug 13 15:39:08 2014 +0300 Refactor heapam.c redo routines to use XLogReplayBuffer diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index d731f98..bf863af 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -7137,15 +7137,13 @@ heap_xlog_clean(XLogRecPtr lsn, XLogRecord *record) { xl_heap_clean *xlrec = (xl_heap_clean *) XLogRecGetData(record); Buffer buffer; - Page page; - OffsetNumber *end; - OffsetNumber *redirected; - OffsetNumber *nowdead; - OffsetNumber *nowunused; - int nredirected; - int ndead; - int nunused; - Size freespace; + Size freespace = 0; + RelFileNode rnode; + BlockNumber blkno; + XLogReplayResult rc; + + rnode = xlrec->node; + blkno = xlrec->block; /* * We're about to remove tuples. In Hot Standby mode, ensure that there's @@ -7156,65 +7154,62 @@ heap_xlog_clean(XLogRecPtr lsn, XLogRecord *record) * latestRemovedXid is invalid, skip conflict processing. */ if (InHotStandby && TransactionIdIsValid(xlrec->latestRemovedXid)) - ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, - xlrec->node); + ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rnode); /* * If we have a full-page image, restore it (using a cleanup lock) and * we're done. */ - if (record->xl_info & XLR_BKP_BLOCK(0)) - { - (void) RestoreBackupBlock(lsn, record, 0, true, false); - return; - } + rc = XLogReplayBufferExtended(0, rnode, MAIN_FORKNUM, blkno, + RBM_NORMAL, true, &buffer); + if (rc == BLK_NEEDS_REDO) + { + Page page = (Page) BufferGetPage(buffer); + OffsetNumber *end; + OffsetNumber *redirected; + OffsetNumber *nowdead; + OffsetNumber *nowunused; + int nredirected; + int ndead; + int nunused; + + nredirected = xlrec->nredirected; + ndead = xlrec->ndead; + end = (OffsetNumber *) ((char *) xlrec + record->xl_len); + redirected = (OffsetNumber *) ((char *) xlrec + SizeOfHeapClean); + nowdead = redirected + (nredirected * 2); + nowunused = nowdead + ndead; + nunused = (end - nowunused); + Assert(nunused >= 0); + + /* Update all item pointers per the record, and repair fragmentation */ + heap_page_prune_execute(buffer, +redirected, nredirected, +nowdead, ndead, +nowunused, nunused); + + freespace = PageGetHeapFreeSpace(page); /* needed to update FSM below */ - buffer = XLogReadBufferExtended(xlrec->node, MAIN_FORKNUM, xlrec->block, RBM_NORMAL); - if (!BufferIsValid(buffer
Re: [HACKERS] WAL format and API changes (9.5)
On 2014-08-13 02:36:59 +0300, Heikki Linnakangas wrote: > On 08/13/2014 01:04 AM, Andres Freund wrote: > >* I'm distinctly not a fan of passing around both the LSN and the struct > > XLogRecord to functions. We should bit the bullet and use something > > encapsulating both. > > You mean, the redo functions? Seems fine to me, and always it's been like > that... Meh. By that argument we could just not do this patch :) > >* XLogReplayBuffer imo isn't a very good name - the buffer isn't > > replayed. If anything it's sometimes replaced by the backup block, but > > the real replay still happens outside of that function. > > I agree, but haven't come up with a better name. XLogRestoreBuffer(), XLogRecoverBuffer(), XLogReadBufferForReplay(), ...? > >* Why do we need XLogBeginInsert(). Right now this leads to uglyness > > like duplicated if (RelationNeedsWAL()) wal checks and > > XLogCancelInsert() of inserts that haven't been started. > > I like clearly marking the beginning of the insert, with XLogBeginInsert(). > We certainly could design it so that it's not needed, and you could just > start registering stuff without calling XLogBeginInsert() first. But I > believe it's more robust this way. For example, you will get an error at the > next XLogBeginInsert(), if a previous operation had already registered some > data without calling XLogInsert(). I'm coming around to that view - especially as accidentally nesting xlog insertions is a real concern with the new API. > >And if we have Begin, why do we also need Cancel? > > We could just automatically "cancel" any previous insertion when > XLogBeginInsert() is called, but again it seems like bad form to leave > references to buffers and data just lying there, if an operation bails out > after registering some stuff and doesn't finish the insertion. > > >* "XXX: do we need to do this for every page?" - yes, and it's every > > tuple, not every page... And It needs to be done *after* the tuple has > > been put on the page, not before. Otherwise the location will be wrong. > > That comment is in heap_multi_insert(). All the inserted tuples have the > same command id, seems redundant to log multiple NEW_CID records for them. Yea, I know it's in heap_multi_insert(). They tuples have the same cid, but they don't have the same tid. Or well, wouldn't if log_heap_new_cid() were called after the PutHeapTuple(). Note that this won't trigger for any normal user defined relations. > >* The patch mixes the API changes around WAL records with changes of how > > individual actions are logged. That makes it rather hard to review - > > and it's a 500kb patch already. > > > > I realize it's hard to avoid because the new API changes which > > information about blocks is available, but it does make it really hard > > to see whether the old/new code is doing something > > equivalent. It's hard to find more critical code than this :/ > > Yeah, I hear you. I considered doing this piecemeal, just adding the new > functions first so that you could still use the old XLogRecData API, until > all the functions have been converted. But in the end, I figured it's not > worth it, as sooner or later we'd want to convert all the functions anyway. I think it might be worthwile anyway. I'd be very surprised if there aren't several significant bugs in the conversion. Your full page checking tool surely helps to reduce the number, but it's not foolproof. I can understand not wanting to do it though, it's a significant amount of work. Would you ask somebody else to do it in two steps? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format and API changes (9.5)
On Wed, Aug 13, 2014 at 4:49 PM, Michael Paquier wrote: > Yes indeed. As XLogBeginInsert() is already part of xlogconstruct.c, > it is not weird. This approach has the merit to make XLogRecData > completely bundled with the insertion and construction of the WAL > records. Then for the name xloginsert.c is fine for me too. At the same time, renaming XLogInsert to XLogCompleteInsert or XLogFinishInsert would be nice to make the difference with XLogBeginInsert. This could include XLogCancel renamed tos XLogCancelInsert. Appending the prefix XLogInsert* to those functions would make things more consistent as well. But feel free to discard those ideas if you do not like that. Regards, -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format and API changes (9.5)
On Tue, Aug 12, 2014 at 6:44 PM, Heikki Linnakangas wrote: > On 08/05/2014 03:50 PM, Michael Paquier wrote: >> >> - When testing pg_xlogdump I found an assertion failure for record >> XLOG_GIN_INSERT: >> Assertion failed: (((bool) (((const void*)(&insertData->newitem.key) != >> ((void*)0)) && ((&insertData->newitem.key)->ip_posid != 0, function >> gin_desc, file gindesc.c, line 127. > > > I could not reproduce this. Do you have a test case? Indeed. I am not seeing anything now for this record. Perhaps I was using an incorrect version of pg_xlogdump. >> - Don't we need a call to XLogBeginInsert() in XLogSaveBufferForHint(): >> + int flags = REGBUF_FORCE_IMAGE; >> + if (buffer_std) >> + flags |= REGBUF_STANDARD; >> + XLogRegisterBuffer(0, buffer, flags); >> [...] >> - recptr = XLogInsert(RM_XLOG_ID, XLOG_FPI, rdata); >> + recptr = XLogInsert(RM_XLOG_ID, XLOG_FPI); > > > Indeed, fixed. With that, "initdb -k" works, I had not tested the patch with > page checksums before. Thanks for the fix. Yes now this works. >> - XLogRecGetBlockRefIds needing an already-allocated array for *out is not >> user-friendly. Cannot we just do all the work inside this function? > > > I agree it's not a nice API. Returning a palloc'd array would be nicer, but > this needs to work outside the backend too (e.g. pg_xlogdump). It could > return a malloc'd array in frontend code, but it's not as nice. On the > whole, do you think that be better than the current approach? A palloc'd array returned is nicer for the user. And I would rather rewrite the API like that, having it return the array instead: int *XLogRecGetBlockRefIds(XLogRecord *record, int *num_array) Alvaro has also outlined that in the FRONTEND context pfree and palloc would work as pg_free and pg_malloc (didn't recall it before he mentioned it btw). >> - All the functions in xlogconstruct.c could be defined in a separate >> header xlogconstruct.h. What they do is rather independent from xlog.h. >> The >> same applies to all the structures and functions of xlogconstruct.c in >> xlog_internal.h: XLogRecordAssemble, XLogRecordAssemble, >> InitXLogRecordConstruct. (worth noticing as well that the only reason >> XLogRecData is defined externally of xlogconstruct.c is that it is being >> used in XLogInsert()). > > > Hmm. I left the defines for xlogconstruct.c functions that are used to build > a WAL record in xlog.h, so that it's not necessary to include both xlog.h > and xlogconstruct.h in all files that write a WAL record (XLogInsert() is > defined in xlog.h). > > But perhaps it would be better to move the prototype for XLogInsert to > xlogconstruct.h too? That might be a better division; everything needed to > insert a WAL record would be in xlogconstruct.h, and xlog.h would left for > more internal functions related to WAL. And perhaps rename the files to > xloginsert.c and xloginsert.h. Yes indeed. As XLogBeginInsert() is already part of xlogconstruct.c, it is not weird. This approach has the merit to make XLogRecData completely bundled with the insertion and construction of the WAL records. Then for the name xloginsert.c is fine for me too. Among the things changed since v5, v6 is introducing a safety limit to the maximum number of backup blocks within a single WAL record: #define XLR_MAX_BKP_BLOCKS 256 XLogRecordBlockData is updated to use uint8 instead of char to identify the block id, and XLogEnsureRecordSpace fails if more than XLR_MAX_BKP_BLOCKS are wanted by the caller. I am fine with this change, just mentioning it. Regards, -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format and API changes (9.5)
On 08/13/2014 01:04 AM, Andres Freund wrote: On 2014-08-12 12:44:22 +0300, Heikki Linnakangas wrote: Here's a new patch with those little things fixed. I've so far ignored this thread... I'm now taking a look. Unfortunately I have to admit I'm not unequivocally happy. * XLogRegisterData/XLogRegisterRecData imo don't really form a consistent API namewise. What does 'Rec' mean here? The latter function is actually called XLogRegisterBufData. The README was wrong, will fix. * I'm distinctly not a fan of passing around both the LSN and the struct XLogRecord to functions. We should bit the bullet and use something encapsulating both. You mean, the redo functions? Seems fine to me, and always it's been like that... * XLogReplayBuffer imo isn't a very good name - the buffer isn't replayed. If anything it's sometimes replaced by the backup block, but the real replay still happens outside of that function. I agree, but haven't come up with a better name. * Why do we need XLogBeginInsert(). Right now this leads to uglyness like duplicated if (RelationNeedsWAL()) wal checks and XLogCancelInsert() of inserts that haven't been started. I like clearly marking the beginning of the insert, with XLogBeginInsert(). We certainly could design it so that it's not needed, and you could just start registering stuff without calling XLogBeginInsert() first. But I believe it's more robust this way. For example, you will get an error at the next XLogBeginInsert(), if a previous operation had already registered some data without calling XLogInsert(). And if we have Begin, why do we also need Cancel? We could just automatically "cancel" any previous insertion when XLogBeginInsert() is called, but again it seems like bad form to leave references to buffers and data just lying there, if an operation bails out after registering some stuff and doesn't finish the insertion. * "XXX: do we need to do this for every page?" - yes, and it's every tuple, not every page... And It needs to be done *after* the tuple has been put on the page, not before. Otherwise the location will be wrong. That comment is in heap_multi_insert(). All the inserted tuples have the same command id, seems redundant to log multiple NEW_CID records for them. * The patch mixes the API changes around WAL records with changes of how individual actions are logged. That makes it rather hard to review - and it's a 500kb patch already. I realize it's hard to avoid because the new API changes which information about blocks is available, but it does make it really hard to see whether the old/new code is doing something equivalent. It's hard to find more critical code than this :/ Yeah, I hear you. I considered doing this piecemeal, just adding the new functions first so that you could still use the old XLogRecData API, until all the functions have been converted. But in the end, I figured it's not worth it, as sooner or later we'd want to convert all the functions anyway. Have you done any performance evaluation? No, not yet. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format and API changes (9.5)
On 2014-08-12 12:44:22 +0300, Heikki Linnakangas wrote: > Here's a new patch with those little things fixed. I've so far ignored this thread... I'm now taking a look. Unfortunately I have to admit I'm not unequivocally happy. * XLogRegisterData/XLogRegisterRecData imo don't really form a consistent API namewise. What does 'Rec' mean here? * I'm distinctly not a fan of passing around both the LSN and the struct XLogRecord to functions. We should bit the bullet and use something encapsulating both. * XLogReplayBuffer imo isn't a very good name - the buffer isn't replayed. If anything it's sometimes replaced by the backup block, but the real replay still happens outside of that function. * Why do we need XLogBeginInsert(). Right now this leads to uglyness like duplicated if (RelationNeedsWAL()) wal checks and XLogCancelInsert() of inserts that haven't been started. And if we have Begin, why do we also need Cancel? * "XXX: do we need to do this for every page?" - yes, and it's every tuple, not every page... And It needs to be done *after* the tuple has been put on the page, not before. Otherwise the location will be wrong. * The patch mixes the API changes around WAL records with changes of how individual actions are logged. That makes it rather hard to review - and it's a 500kb patch already. I realize it's hard to avoid because the new API changes which information about blocks is available, but it does make it really hard to see whether the old/new code is doing something equivalent. It's hard to find more critical code than this :/ Have you done any performance evaluation? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format and API changes (9.5)
Heikki Linnakangas wrote: > On 08/05/2014 03:50 PM, Michael Paquier wrote: > >- XLogRecGetBlockRefIds needing an already-allocated array for *out is not > >user-friendly. Cannot we just do all the work inside this function? > > I agree it's not a nice API. Returning a palloc'd array would be > nicer, but this needs to work outside the backend too (e.g. > pg_xlogdump). It could return a malloc'd array in frontend code, but > it's not as nice. On the whole, do you think that be better than the > current approach? I don't see the issue. Right now, in frontend code you can call palloc() and pfree(), and they behave like malloc() and free() (see fe_memutils.c). Your module doesn't need to do anything special. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format and API changes (9.5)
Heikki Linnakangas wrote: > On 08/05/2014 03:50 PM, Michael Paquier wrote: > >- All the functions in xlogconstruct.c could be defined in a separate > >header xlogconstruct.h. What they do is rather independent from xlog.h. The > >same applies to all the structures and functions of xlogconstruct.c in > >xlog_internal.h: XLogRecordAssemble, XLogRecordAssemble, > >InitXLogRecordConstruct. (worth noticing as well that the only reason > >XLogRecData is defined externally of xlogconstruct.c is that it is being > >used in XLogInsert()). > > Hmm. I left the defines for xlogconstruct.c functions that are used > to build a WAL record in xlog.h, so that it's not necessary to > include both xlog.h and xlogconstruct.h in all files that write a > WAL record (XLogInsert() is defined in xlog.h). > > But perhaps it would be better to move the prototype for XLogInsert > to xlogconstruct.h too? That might be a better division; everything > needed to insert a WAL record would be in xlogconstruct.h, and > xlog.h would left for more internal functions related to WAL. And > perhaps rename the files to xloginsert.c and xloginsert.h. Yes, IMO ideally modules that only construct WAL records to insert, but are not directly involved with other XLog stuff, should only be using a lean header file, not the whole xlog.h. I imagine xlogconstruct.h would be such a file. (The patch you just posted doesn't have such a file -- AFAICS that stuff is all in xlog.h still). No opinion on xlogconstruct vs. xloginsert as file names. Both seem like good enough names to me. Unless others have stronger opinions, I would left that decision to you. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format and API changes (9.5)
On 08/05/2014 03:50 PM, Michael Paquier wrote: - When testing pg_xlogdump I found an assertion failure for record XLOG_GIN_INSERT: Assertion failed: (((bool) (((const void*)(&insertData->newitem.key) != ((void*)0)) && ((&insertData->newitem.key)->ip_posid != 0, function gin_desc, file gindesc.c, line 127. I could not reproduce this. Do you have a test case? - Would it be a win to add an assertion check for (CritSectionCount == 0) in XLogEnsureRecordSpace? Maybe; added. - There is no mention of REGBUF_NO_IMAGE in transam/README. Fixed - This patch adds a lot of blocks like that in the redo code: + if (BufferIsValid(buffer)) + UnlockReleaseBuffer(buffer); What do you think about adding a new macro like UnlockBufferIfValid(buffer)? I don't think such a macro would be an improvement in readability, TBH. - Don't we need a call to XLogBeginInsert() in XLogSaveBufferForHint(): + int flags = REGBUF_FORCE_IMAGE; + if (buffer_std) + flags |= REGBUF_STANDARD; + XLogRegisterBuffer(0, buffer, flags); [...] - recptr = XLogInsert(RM_XLOG_ID, XLOG_FPI, rdata); + recptr = XLogInsert(RM_XLOG_ID, XLOG_FPI); Indeed, fixed. With that, "initdb -k" works, I had not tested the patch with page checksums before. - There is still a TODO item in ValidXLogRecordHeader to check if block data header is valid or not. Just mentioning. Removed. Previously, we ValidXLogRecordHeader checked that the xl_tot_len of a record doesn't exceed the size of header + xl_len + (space needed for max number of bkp blocks). But the new record format has no limit on the amount of data registered with a buffer, so such a test is not possible anymore. I added the TODO item there to remind me that I need to check if we need to do something else there instead, but I think it's fine as it is. We still sanity-check the block data in ValidXLogRecord; the ValidXLogRecordHeader() check happens before we have read the whole record header in memory. - XLogRecGetBlockRefIds needing an already-allocated array for *out is not user-friendly. Cannot we just do all the work inside this function? I agree it's not a nice API. Returning a palloc'd array would be nicer, but this needs to work outside the backend too (e.g. pg_xlogdump). It could return a malloc'd array in frontend code, but it's not as nice. On the whole, do you think that be better than the current approach? - All the functions in xlogconstruct.c could be defined in a separate header xlogconstruct.h. What they do is rather independent from xlog.h. The same applies to all the structures and functions of xlogconstruct.c in xlog_internal.h: XLogRecordAssemble, XLogRecordAssemble, InitXLogRecordConstruct. (worth noticing as well that the only reason XLogRecData is defined externally of xlogconstruct.c is that it is being used in XLogInsert()). Hmm. I left the defines for xlogconstruct.c functions that are used to build a WAL record in xlog.h, so that it's not necessary to include both xlog.h and xlogconstruct.h in all files that write a WAL record (XLogInsert() is defined in xlog.h). But perhaps it would be better to move the prototype for XLogInsert to xlogconstruct.h too? That might be a better division; everything needed to insert a WAL record would be in xlogconstruct.h, and xlog.h would left for more internal functions related to WAL. And perhaps rename the files to xloginsert.c and xloginsert.h. Here's a new patch with those little things fixed. - Heikki wal-format-and-api-changes-6.patch.gz Description: application/gzip -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format and API changes (9.5)
On Sat, Aug 2, 2014 at 1:40 AM, Heikki Linnakangas wrote: > I ran this through my WAL page comparison tool to verify that all the WAL > record types are replayed correctly (although I did some small cleanup after > that, so it's not impossible that I broke it again; will re-test before > committing). I have run as well some tests on it with a master a a standby to check WAL construction and replay: - regression tests as well as tests from contrib, isolation - WAL page comparison tool - Some tests with pgbench - When testing pg_xlogdump I found an assertion failure for record XLOG_GIN_INSERT: Assertion failed: (((bool) (((const void*)(&insertData->newitem.key) != ((void*)0)) && ((&insertData->newitem.key)->ip_posid != 0, function gin_desc, file gindesc.c, line 127. Then, I had a look at the code, and here are some comments: - This comment has been introduced with 96ef3b8 that has added page checksums. Perhaps the authors can tell what was the original purpose of this comment (Jeff, Simon). For now, it may be better to remove this FIXME and to let this comment as is. +* FIXME: I didn't understand below comment. Is it still relevant? +* * This also means there is no corresponding API call for this, so an * smgr implementation has no need to implement anything. Which means * nothing is needed in md.c etc - Would it be a win to add an assertion check for (CritSectionCount == 0) in XLogEnsureRecordSpace? - There is no mention of REGBUF_NO_IMAGE in transam/README. - This patch adds a lot of blocks like that in the redo code: + if (BufferIsValid(buffer)) + UnlockReleaseBuffer(buffer); What do you think about adding a new macro like UnlockBufferIfValid(buffer)? - Don't we need a call to XLogBeginInsert() in XLogSaveBufferForHint(): + int flags = REGBUF_FORCE_IMAGE; + if (buffer_std) + flags |= REGBUF_STANDARD; + XLogRegisterBuffer(0, buffer, flags); [...] - recptr = XLogInsert(RM_XLOG_ID, XLOG_FPI, rdata); + recptr = XLogInsert(RM_XLOG_ID, XLOG_FPI); - There is a typo in the header of xlogrecord.h, the "to" is not needed: + * Definitions for to the WAL record format. - There is still a TODO item in ValidXLogRecordHeader to check if block data header is valid or not. Just mentioning. - Just came across that: s/reference refes to/reference refers to - XLogRecGetBlockRefIds needing an already-allocated array for *out is not user-friendly. Cannot we just do all the work inside this function? - All the functions in xlogconstruct.c could be defined in a separate header xlogconstruct.h. What they do is rather independent from xlog.h. The same applies to all the structures and functions of xlogconstruct.c in xlog_internal.h: XLogRecordAssemble, XLogRecordAssemble, InitXLogRecordConstruct. (worth noticing as well that the only reason XLogRecData is defined externally of xlogconstruct.c is that it is being used in XLogInsert()). The patch is more solid and better structured than the previous versions. It is in good shape. Regards, -- Michael
Re: [HACKERS] WAL format and API changes (9.5)
Here's a new version of this patch, please review. I've cleaned up a lot of stuff, fixed all the bugs reported so far, and a bunch of others I found myself while testing. I'm not going to explain again what the patch does; the README and comments should now be complete enough to explain how it works. If not, please speak up. In the previous version of this patch, I made the XLogRegisterData function to copy the WAL data to a temporary buffer, instead of constructing the XLogRecData chain. I decided to revert back to the old way after all; I still think that copying would probably be OK from a performance point of view, but that'd need more testing. We can still switch to doing it that way later; the XLogRecData struct is no longer exposed to the functions that generate the WAL records, so it would be a very isolated change. I moved the new functions for constructing the WAL record into a new file, xlogconstruct.c. XLogInsert() now just calls a function called XLogRecordAssemble(), which returns the full XLogRecData chain that includes all the data and backup blocks, ready to be written to the WAL buffer. All the code to construct the XLogRecData chain is now in xlogrecord.c; this makes XLogInsert() simpler and more readable. One change resulting from that worth mentioning is that XLogInsert() now always re-constructs the XLogRecordData chain, if after acquiring the WALInsertLock it sees that RedoRecPtr changed (i.e. a checkpoint just started). Before, it would recheck the LSNs on the individual buffers to see if it's necessary. This is simpler, and I don't think it makes any difference to performance in practice. I ran this through my WAL page comparison tool to verify that all the WAL record types are replayed correctly (although I did some small cleanup after that, so it's not impossible that I broke it again; will re-test before committing). - Heikki wal-format-and-api-changes-5.patch.gz Description: application/gzip -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format and API changes (9.5)
On Wed, Jul 2, 2014 at 4:23 PM, Michael Paquier wrote: > > > > On Wed, Jul 2, 2014 at 4:09 PM, Michael Paquier > wrote: >> >> 3) I noticed a bug in gin redo code path when trying to use the WAL replay >> facility. This patch has been on status "Waiting on author" for a little bit of time, marking it now as "Returned with feedback" as there are still issues with gin record construction and redo code paths. Feel free to disagree if you don't think this is correct. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format and API changes (9.5)
On Wed, Jul 2, 2014 at 4:09 PM, Michael Paquier wrote: > > 3) I noticed a bug in gin redo code path when trying to use the WAL replay > facility. > Looking at the backtrace, it seems that a page for gin is corrupted. The buffer capture patch does some sanity checks on the page format when performing masking and it is failing in one of them on a standby when kicking ginRedoUpdateMetapage: if (pd_lower > pd_upper || pd_special < pd_upper || pd_lower < SizeOfPageHeaderData || pd_special > BLCKSZ) { elog(ERROR, "invalid page at %X/%08X\n", ((PageHeader) page)->pd_lsn.xlogid, ((PageHeader) page)->pd_lsn.xrecoff); } frame #4: 0x00010437dec5 postgres`CheckForBufferLeaks + 165 at bufmgr.c:1778 frame #5: 0x00010437df1e postgres`AtProcExit_Buffers(code=1, arg=0) + 30 at bufmgr.c:1750 frame #6: 0x00010438fded postgres`shmem_exit(code=1) + 301 at ipc.c:263 frame #7: 0x00010438fc1c postgres`proc_exit_prepare(code=1) + 124 at ipc.c:187 frame #8: 0x00010438fb63 postgres`proc_exit(code=1) + 19 at ipc.c:102 frame #9: 0x000104555b3c postgres`errfinish(dummy=0) + 1180 at elog.c:555 frame #10: 0x0001045590de postgres`elog_finish(elevel=20, fmt=0x000104633d4f) + 830 at elog.c:1362 frame #11: 0x00010437c1af postgres`mask_unused_space(page=0x7fff5bc20a70) + 159 at bufcapt.c:78 frame #12: 0x00010437b53d postgres`mask_heap_page(page=0x7fff5bc20a70) + 29 at bufcapt.c:95 frame #13: 0x00010437b1cd postgres`buffer_capture_write(newcontent=0x000104ab3980, blkno=0) + 205 at bufcapt.c:329 frame #14: 0x00010437bc7d postgres`buffer_capture_forget(buffer=82) + 349 at bufcapt.c:433 frame #15: 0x0001043801c9 postgres`LockBuffer(buffer=82, mode=0) + 233 at bufmgr.c:2773 frame #16: 0x0001043800c8 postgres`UnlockReleaseBuffer(buffer=82) + 24 at bufmgr.c:2554 frame #17: 0x000103fefb03 postgres`ginRedoUpdateMetapage(lsn=335350144, record=0x7fb1740382f0) + 1843 at ginxlog.c:580 frame #18: 0x000103fede96 postgres`gin_redo(lsn=335350144, record=0x7fb1740382f0) + 534 at ginxlog.c:724 frame #19: 0x0001040ad692 postgres`StartupXLOG + 8482 at xlog.c:6810 frame #20: 0x000104330e0e postgres`StartupProcessMain + 430 at startup.c:224 frame #21: 0x0001040c64d9 postgres`AuxiliaryProcessMain(argc=2, argv=0x7fff5bc231b0) + 1897 at bootstrap.c:416 frame #22: 0x00010432b1a8 postgres`StartChildProcess(type=StartupProcess) + 328 at postmaster.c:5090 frame #23: 0x0001043292b9 postgres`PostmasterMain(argc=3, argv=0x7fb173c044e0) + 5401 at postmaster.c:1212 frame #24: 0x00010426f995 postgres`main(argc=3, argv=0x7fb173c044e0) + 773 at main.c:219 Note that I have never seen that with vanilla, only with this patch. Regards, -- Michael
Re: [HACKERS] WAL format and API changes (9.5)
On Tue, Jul 1, 2014 at 5:51 AM, Heikki Linnakangas wrote: > On 06/27/2014 09:12 AM, Michael Paquier wrote: > >> 2) Description of XLogBeginInsert, XLogHasBlockRef, XLogGetBlockRef and >> XLogGetBlockRefIds are missing in src/backend/access/transam/README. >> > > Added descriptions of XLogBeginInsert and XLogHasBlockRef. Ḯ'm not sure > what to do with the latter two. They're not really intended for use by redo > routines, but for things like pg_xlogdump that want to extract information > about any record type. That's definitely not part of the redo section or WAL construction section. It could be a new section in the README describing how to get the list of blocks touched by a WAL record. I'd rather have those APIs documented somewhere than nowhere, and the README would be well-suited for that (as well as in-code comments) as those APIs are aimed at developers. > 3) There are a couple of code blocks marked as FIXME and BROKEN. There are >> as well some NOT_USED blocks. >> > > Yeah. The FIXMEs and BROKEN blocks are all in rm_desc routines. They > mostly stand for code that used to print block numbers or relfilenodes, > which doesn't make much sense to do in an ad hoc way in rmgr-specific desc > routines anymore. Will need to go through them all an decide what's the > important information to print for each record type. > > The few NOT_USED blocks are around code in redo routines that extract some > information from the WAL record, that isn't needed anymore. We could remove > the fields from the WAL records altogether, but might be useful to keep for > debugging purposes. They could be removed and be kept as a part of the git history... That's not really a vital point btw. > > 6) XLogRegisterData creates a XLogRecData entry and appends it in one of >> the static XLogRecData entries if buffer_id >= 0 or to >> rdata_head/rdata_tail if buffer_id = -1 (for permanent data related to the >> WAL record). XLogRegisterXLogRecData does something similar, but with an >> entry of XLogRecData already built. What do you think about removing >> XLogRegisterXLogRecData and keeping only XLogRegisterData. This makes the >> interface simpler, and XLogRecData could be kept private in xlog.c. This >> would need more work especially on gin side where I am seeing for example >> constructLeafRecompressWALData directly building a XLogRecData entry. >> > > Another big change in the attached patch version is that > XLogRegisterData() now *copies* the data to a working area, instead of > merely adding a XLogRecData entry to the chain. This simplifies things in > xlog.c, now that the XLogRecData concept is not visible to the rest of the > system. This is a subtle semantic change for the callers: you can no longer > register a struct first, and fill the fields afterwards. > > That adds a lot of memcpys to the critical path, which could be bad for > performance. In practice, I think it's OK, or even a win, because a typical > WAL record payload is very small. But I haven't measured that. If it > becomes an issue, we could try to optimize, and link larger data blocks to > the rdata chain, but memcpy small small chunks of data. There are a few WAL > record types that don't fit in a small statically allocated buffer, so I > provided a new function, XLogEnsureSpace(), that can be called before > XLogBegin() to allocate a large-enough working area. > I just ran the following test on my laptop with your latest patch: - Patched version: =# CREATE TABLE foo AS SELECT generate_series(1,1000) as a; SELECT 1000 Time: 32913.419 ms =# CREATE TABLE foo2 AS SELECT generate_series(1,1000) as a; SELECT 1000 Time: 32261.045 ms - master (d97e98e): =# CREATE TABLE foo AS SELECT generate_series(1,1000) as a; SELECT 1000 Time: 29651.394 ms =# CREATE TABLE foo2 AS SELECT generate_series(1,1000) as a; SELECT 1000 Time: 29734.887 ms 10% difference... That was just a quick test though. These changes required changing a couple of places where WAL records are > created: > > * ginPlaceToPage. This function has a strange split of responsibility > between ginPlaceToPage and dataPlaceToPage/entryPlaceToPage. > ginPlaceToPage calls dataPlaceToPage or entryPlaceToPage, depending on what > kind of a tree it is, and dataPlaceToPage or entryPlaceToPage contributes > some data to the WAL record. Then ginPlaceToPage adds some more, and > finally calls XLogInsert(). I simplified the SPLIT case so that we now just > create full page images of both page halves. We were already logging all > the data on both page halves, but we were doing it in a "smarter" way, > knowing what kind of pages they were. For example, for an entry tree page, > we included an IndexTuple struct for every tuple on the page, but didn't > include the item pointers. A straight full page image takes more space, but > not much, so I think in any real application it's going to be lost in > noise. (again, haven't measured it though) Thanks, the code looks cleaner t
Re: [HACKERS] WAL format and API changes (9.5)
On Tue, Jul 1, 2014 at 7:18 AM, Fujii Masao wrote: > On Tue, Jul 1, 2014 at 5:51 AM, Heikki Linnakangas > wrote: > > Thanks for the review! > > > > > > On 06/27/2014 09:12 AM, Michael Paquier wrote: > >> > >> Looking at this patch, it does not compile when assertions are enabled > >> because of this assertion in heapam.c: > >> Assert(recdata == data + datalen); > >> It seems like a thinko as removing it makes the code compile. > > > > > > Fixed. > > > > > >> Some comments about the code: > >> 1) In src/backend/access/transam/README: > >> 's/no further action is no required/no further action is required/g' > > > > > > Fixed. > > > > > >> 2) Description of XLogBeginInsert, XLogHasBlockRef, XLogGetBlockRef and > >> XLogGetBlockRefIds are missing in src/backend/access/transam/README. > > > > > > Added descriptions of XLogBeginInsert and XLogHasBlockRef. Ḯ'm not sure > > what to do with the latter two. They're not really intended for use by > redo > > routines, but for things like pg_xlogdump that want to extract > information > > about any record type. > > > > > >> 3) There are a couple of code blocks marked as FIXME and BROKEN. There > are > >> as well some NOT_USED blocks. > > > > > > Yeah. The FIXMEs and BROKEN blocks are all in rm_desc routines. They > mostly > > stand for code that used to print block numbers or relfilenodes, which > > doesn't make much sense to do in an ad hoc way in rmgr-specific desc > > routines anymore. Will need to go through them all an decide what's the > > important information to print for each record type. > > > > The few NOT_USED blocks are around code in redo routines that extract > some > > information from the WAL record, that isn't needed anymore. We could > remove > > the fields from the WAL records altogether, but might be useful to keep > for > > debugging purposes. > > > > > >> 4) Current patch crashes when running make check in > >> contrib/test_decoding. > >> Looking closer, wal_level = logical is broken: > >> =# show wal_level; > >> wal_level > >> --- > >> logical > >> (1 row) > >> =# create database foo; > >> The connection to the server was lost. Attempting reset: Failed. > > > > > > Fixed. > > > > > >> Looking even closer, log_heap_new_cid is broken: > > > > > > Fixed. > > > > > >> 6) XLogRegisterData creates a XLogRecData entry and appends it in one of > >> the static XLogRecData entries if buffer_id >= 0 or to > >> rdata_head/rdata_tail if buffer_id = -1 (for permanent data related to > the > >> WAL record). XLogRegisterXLogRecData does something similar, but with an > >> entry of XLogRecData already built. What do you think about removing > >> XLogRegisterXLogRecData and keeping only XLogRegisterData. This makes > the > >> interface simpler, and XLogRecData could be kept private in xlog.c. This > >> would need more work especially on gin side where I am seeing for > example > >> constructLeafRecompressWALData directly building a XLogRecData entry. > > > > > > Hmm. I considered that, but punted because I didn't want to rewrite all > the > > places that currently use XLogRecDatas in less straightforward ways. I > kept > > XLogRegisterXLogRecData as an escape hatch for those. > > > > But I bit the bullet now, and got rid of all the > XLogRegisterXLogRecData() > > calls. XLogRecData struct is now completely private to xlog.c. > > > > Another big change in the attached patch version is that > XLogRegisterData() > > now *copies* the data to a working area, instead of merely adding a > > XLogRecData entry to the chain. This simplifies things in xlog.c, now > that > > the XLogRecData concept is not visible to the rest of the system. This > is a > > subtle semantic change for the callers: you can no longer register a > struct > > first, and fill the fields afterwards. > > > > That adds a lot of memcpys to the critical path, which could be bad for > > performance. In practice, I think it's OK, or even a win, because a > typical > > WAL record payload is very small. But I haven't measured that. If it > becomes > > an issue, we could try to optimize, and link larger data blocks to the > rdata > > chain, but memcpy small small chunks of data. There are a few WAL record > > types that don't fit in a small statically allocated buffer, so I > provided a > > new function, XLogEnsureSpace(), that can be called before XLogBegin() to > > allocate a large-enough working area. > > > > These changes required changing a couple of places where WAL records are > > created: > > > > * ginPlaceToPage. This function has a strange split of responsibility > > between ginPlaceToPage and dataPlaceToPage/entryPlaceToPage. > ginPlaceToPage > > calls dataPlaceToPage or entryPlaceToPage, depending on what kind of a > tree > > it is, and dataPlaceToPage or entryPlaceToPage contributes some data to > the > > WAL record. Then ginPlaceToPage adds some more, and finally calls > > XLogInsert(). I simplified the SPLIT case so that we now just create full > > page images of both page hal
Re: [HACKERS] WAL format and API changes (9.5)
On Mon, Jun 30, 2014 at 4:51 PM, Heikki Linnakangas wrote: > Another big change in the attached patch version is that XLogRegisterData() > now *copies* the data to a working area, instead of merely adding a > XLogRecData entry to the chain. This simplifies things in xlog.c, now that > the XLogRecData concept is not visible to the rest of the system. This is a > subtle semantic change for the callers: you can no longer register a struct > first, and fill the fields afterwards. > > That adds a lot of memcpys to the critical path, which could be bad for > performance. In practice, I think it's OK, or even a win, because a typical > WAL record payload is very small. But I haven't measured that. If it becomes > an issue, we could try to optimize, and link larger data blocks to the rdata > chain, but memcpy small small chunks of data. There are a few WAL record > types that don't fit in a small statically allocated buffer, so I provided a > new function, XLogEnsureSpace(), that can be called before XLogBegin() to > allocate a large-enough working area. On the other hand, the patch I just committed (9f03ca915196dfc871804a1f8aad26207f601fd6) demonstrates that even copying relatively small amounts of data can be expensive if you do it frequently, and certainly WAL insertions are frequent. Of course, some of the overhead there is from palloc() and friends rather than from the copying itself, but the copying itself isn't free, either. And some WAL records can be really big. Of course, it could still work out to a win if the simplifications in xlog.c save enough. I don't know whether that's the case or not. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format and API changes (9.5)
On Tue, Jul 1, 2014 at 5:51 AM, Heikki Linnakangas wrote: > Thanks for the review! > > > On 06/27/2014 09:12 AM, Michael Paquier wrote: >> >> Looking at this patch, it does not compile when assertions are enabled >> because of this assertion in heapam.c: >> Assert(recdata == data + datalen); >> It seems like a thinko as removing it makes the code compile. > > > Fixed. > > >> Some comments about the code: >> 1) In src/backend/access/transam/README: >> 's/no further action is no required/no further action is required/g' > > > Fixed. > > >> 2) Description of XLogBeginInsert, XLogHasBlockRef, XLogGetBlockRef and >> XLogGetBlockRefIds are missing in src/backend/access/transam/README. > > > Added descriptions of XLogBeginInsert and XLogHasBlockRef. Ḯ'm not sure > what to do with the latter two. They're not really intended for use by redo > routines, but for things like pg_xlogdump that want to extract information > about any record type. > > >> 3) There are a couple of code blocks marked as FIXME and BROKEN. There are >> as well some NOT_USED blocks. > > > Yeah. The FIXMEs and BROKEN blocks are all in rm_desc routines. They mostly > stand for code that used to print block numbers or relfilenodes, which > doesn't make much sense to do in an ad hoc way in rmgr-specific desc > routines anymore. Will need to go through them all an decide what's the > important information to print for each record type. > > The few NOT_USED blocks are around code in redo routines that extract some > information from the WAL record, that isn't needed anymore. We could remove > the fields from the WAL records altogether, but might be useful to keep for > debugging purposes. > > >> 4) Current patch crashes when running make check in >> contrib/test_decoding. >> Looking closer, wal_level = logical is broken: >> =# show wal_level; >> wal_level >> --- >> logical >> (1 row) >> =# create database foo; >> The connection to the server was lost. Attempting reset: Failed. > > > Fixed. > > >> Looking even closer, log_heap_new_cid is broken: > > > Fixed. > > >> 6) XLogRegisterData creates a XLogRecData entry and appends it in one of >> the static XLogRecData entries if buffer_id >= 0 or to >> rdata_head/rdata_tail if buffer_id = -1 (for permanent data related to the >> WAL record). XLogRegisterXLogRecData does something similar, but with an >> entry of XLogRecData already built. What do you think about removing >> XLogRegisterXLogRecData and keeping only XLogRegisterData. This makes the >> interface simpler, and XLogRecData could be kept private in xlog.c. This >> would need more work especially on gin side where I am seeing for example >> constructLeafRecompressWALData directly building a XLogRecData entry. > > > Hmm. I considered that, but punted because I didn't want to rewrite all the > places that currently use XLogRecDatas in less straightforward ways. I kept > XLogRegisterXLogRecData as an escape hatch for those. > > But I bit the bullet now, and got rid of all the XLogRegisterXLogRecData() > calls. XLogRecData struct is now completely private to xlog.c. > > Another big change in the attached patch version is that XLogRegisterData() > now *copies* the data to a working area, instead of merely adding a > XLogRecData entry to the chain. This simplifies things in xlog.c, now that > the XLogRecData concept is not visible to the rest of the system. This is a > subtle semantic change for the callers: you can no longer register a struct > first, and fill the fields afterwards. > > That adds a lot of memcpys to the critical path, which could be bad for > performance. In practice, I think it's OK, or even a win, because a typical > WAL record payload is very small. But I haven't measured that. If it becomes > an issue, we could try to optimize, and link larger data blocks to the rdata > chain, but memcpy small small chunks of data. There are a few WAL record > types that don't fit in a small statically allocated buffer, so I provided a > new function, XLogEnsureSpace(), that can be called before XLogBegin() to > allocate a large-enough working area. > > These changes required changing a couple of places where WAL records are > created: > > * ginPlaceToPage. This function has a strange split of responsibility > between ginPlaceToPage and dataPlaceToPage/entryPlaceToPage. ginPlaceToPage > calls dataPlaceToPage or entryPlaceToPage, depending on what kind of a tree > it is, and dataPlaceToPage or entryPlaceToPage contributes some data to the > WAL record. Then ginPlaceToPage adds some more, and finally calls > XLogInsert(). I simplified the SPLIT case so that we now just create full > page images of both page halves. We were already logging all the data on > both page halves, but we were doing it in a "smarter" way, knowing what kind > of pages they were. For example, for an entry tree page, we included an > IndexTuple struct for every tuple on the page, but didn't include the item > pointers. A straight full page image takes more space,
Re: [HACKERS] WAL format and API changes (9.5)
On Sun, Jun 15, 2014 at 6:11 AM, Heikki Linnakangas wrote: > On 06/14/2014 09:43 PM, Alvaro Herrera wrote: > >> Heikki Linnakangas wrote: >> >> Here's a new version, rebased against master. No other changes. >>> >> >> This is missing xlogrecord.h ... >> > > Oops, here you are. > Looking at this patch, it does not compile when assertions are enabled because of this assertion in heapam.c: Assert(recdata == data + datalen); It seems like a thinko as removing it makes the code compile. Here is a review of this patch: - Compilation without warnings, regression tests pass - That's not a small patch, but the changes mainly done are xlog record insertion in accordance to the new API format. $ git diff master --stat | tail -n1 52 files changed, 3601 insertions(+), 4371 deletions(-) Some comments about the code: 1) In src/backend/access/transam/README: 's/no further action is no required/no further action is required/g' 2) Description of XLogBeginInsert, XLogHasBlockRef, XLogGetBlockRef and XLogGetBlockRefIds are missing in src/backend/access/transam/README. 3) There are a couple of code blocks marked as FIXME and BROKEN. There are as well some NOT_USED blocks. 4) Current patch crashes when running make check in contrib/test_decoding. Looking closer, wal_level = logical is broken: =# show wal_level; wal_level --- logical (1 row) =# create database foo; The connection to the server was lost. Attempting reset: Failed. Looking even closer, log_heap_new_cid is broken: (lldb) bt * thread #1: tid = 0x, 0x7fff8a3c2866 libsystem_kernel.dylib`__pthread_kill + 10, stop reason = signal SIGSTOP * frame #0: 0x7fff8a3c2866 libsystem_kernel.dylib`__pthread_kill + 10 frame #1: 0x7fff8d37235c libsystem_pthread.dylib`pthread_kill + 92 frame #2: 0x7fff9369db1a libsystem_c.dylib`abort + 125 frame #3: 0x000104428819 postgres`ExceptionalCondition(conditionName=0x0001044a711c, errorType=0x00010448b19f, fileName=0x0001044a2cfd, lineNumber=6879) + 137 at assert.c:54 frame #4: 0x000103f08dbe postgres`log_heap_new_cid(relation=0x7fae4482afb8, tup=0x7fae438062e8) + 126 at heapam.c:6879 frame #5: 0x000103f080cf postgres`heap_insert(relation=0x7fae4482afb8, tup=0x7fae438062e8, cid=0, options=0, bistate=0x) + 383 at heapam.c:2095 frame #6: 0x000103f09b6a postgres`simple_heap_insert(relation=0x7fae4482afb8, tup=0x7fae438062e8) + 74 at heapam.c:2533 frame #7: 0x00010406aacf postgres`createdb(stmt=0x7fae44843690) + 6335 at dbcommands.c:511 frame #8: 0x00010429def8 postgres`standard_ProcessUtility(parsetree=0x7fae44843690, queryString=0x7fae44842c38, context=PROCESS_UTILITY_TOPLEVEL, params=0x, dest=0x7fae44843a18, completionTag=0x7fff5bd4ee30) + 1720 at utility.c:56 5) Yes,there are some WAL records that have only data related to buffers, XLOG_GIN_VACUUM_DATA_LEAF_PAGE being one, with nothing registered with buffer_id = -d, but using a dummy entry seems counter-productive: + rdata = rdata_head; + if (rdata == NULL) + { + /* +* the rest of this function assumes that there is at least some +* rdata entries, so fake an empty rdata entry +*/ + dummy_rdata.data = NULL; + dummy_rdata.len = 0; + dummy_rdata.next = NULL; + rdata = &dummy_rdata; + } Could this be removed and replaced by a couple of checks on rdata being NULL in some cases? XLogInsert should be updated in consequence. 6) XLogRegisterData creates a XLogRecData entry and appends it in one of the static XLogRecData entries if buffer_id >= 0 or to rdata_head/rdata_tail if buffer_id = -1 (for permanent data related to the WAL record). XLogRegisterXLogRecData does something similar, but with an entry of XLogRecData already built. What do you think about removing XLogRegisterXLogRecData and keeping only XLogRegisterData. This makes the interface simpler, and XLogRecData could be kept private in xlog.c. This would need more work especially on gin side where I am seeing for example constructLeafRecompressWALData directly building a XLogRecData entry. By doing so, we could as remove the while loop at the bottom of XLogRegisterXLogRecData as we would insert in the tail only the latest record created, aka replacing that: while ((*tail)->next) *tail = (*tail)->next; By simply that: *tail = (*tail)->next; 7) New structures at the top of xlog.c should have more comments about their role, particularly rdata_head and rdata_tail that store main WAL data (id of -1) 8) What do you think about adding in the README a description about how to retrieve the block list modified by a WAL record, for a flow similar to that: record = XLogReadRecord(...); nblocks = XLogGetBlockRefIds(record, blockids); for (i = 0; i < nblocks; i++) { bkpb = XLogGetBlockRef(record, id, NULL); // stuff }
Re: [HACKERS] WAL format and API changes (9.5)
Heikki Linnakangas wrote: > Here's a new version, rebased against master. No other changes. This is missing xlogrecord.h ... -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format and API changes (9.5)
On Thu, Apr 3, 2014 at 7:44 PM, Heikki Linnakangas wrote: > I'd like to do some changes to the WAL format in 9.5. I want to annotate > each WAL record with the blocks that they modify. Every WAL record already > includes that information, but it's done in an ad hoc way, differently in > every rmgr. The RelFileNode and block number are currently part of the WAL > payload, and it's the REDO routine's responsibility to extract it. I want to > include that information in a common format for every WAL record type. > > That makes life a lot easier for tools that are interested in knowing which > blocks a WAL record modifies. One such tool is pg_rewind; it currently has > to understand every WAL record the backend writes. There's also a tool out > there called pg_readahead, which does prefetching of blocks accessed by WAL > records, to speed up PITR. I don't think that tool has been actively > maintained, but at least part of the reason for that is probably that it's a > pain to maintain when it has to understand the details of every WAL record > type. > > It'd also be nice for contrib/pg_xlogdump and backend code itself. The > boilerplate code in all WAL redo routines, and writing WAL records, could be > simplified. I think it will also be useful, if we want to implement table/tablespace PITR. > > That's for the normal cases. We'll need a couple of variants for also > registering buffers that don't need full-page images, and perhaps also a > function for registering a page that *always* needs a full-page image, > regardless of the LSN. A few existing WAL record types just WAL-log the > whole page, so those ad-hoc full-page images could be replaced with this. > > With these changes, a typical WAL insertion would look like this: > > /* register the buffer with the WAL record, with ID 0 */ > XLogRegisterBuffer(0, buf, true); > > rdata[0].data = (char *) &xlrec; > rdata[0].len = sizeof(BlahRecord); > rdata[0].buffer_id = -1; /* -1 means the data is always included */ > rdata[0].next = &(rdata[1]); > > rdata[1].data = (char *) mydata; > rdata[1].len = mydatalen; > rdata[1].buffer_id = 0; /* 0 here refers to the buffer registered > above */ > rdata[1].next = NULL > > ... > recptr = XLogInsert(RM_BLAH_ID, xlinfo, rdata); > > PageSetLSN(buf, recptr); If we do register buffer's (that require or don't require FPI) before calling XLogInsert(), then will there be any impact to handle case where we come to know that we need to backup the buffer after taking WALInsertLock.. or will that part of code remains same as it is today. > Redo > > > There are four different states a block referenced by a typical WAL record > can be in: > > 1. The old page does not exist at all (because the relation was truncated > later) > 2. The old page exists, but has an LSN higher than current WAL record, so it > doesn't need replaying. > 3. The LSN is < current WAL record, so it needs to be replayed. > 4. The WAL record contains a full-page image, which needs to be restored. I think there might be a need to have separate handling for some special cases like Init Page which is used in few ops (Insert/Update/multi-insert). Is there any criteria to decide if it needs to be a separate state or a special handling for operations? With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Allocations in critical section (was Re: [HACKERS] WAL format and API changes (9.5))
On 04/04/2014 04:56 PM, Tom Lane wrote: Heikki Linnakangas writes: Ok, I fixed the issues that the assertion fixed. I also committed a patch to add the assertion itself; let's see if the buildfarm finds more cases that violate the rule. It ignores the checkpointer, because it's known to violate the rule, ... uh, isn't that a bug to be fixed? Yes. One step a time ;-). and allocations in ErrorContext, which is used during error recovery, e.g if you indeed PANIC while in a critical section for some other reason. Yeah, I realized we'd have to do something about elog's own allocations. Not sure if a blanket exemption for ErrorContext is the best way. I'd been thinking of having a way to turn off the complaint once processing of an elog(PANIC) has started. Hmm. PANIC processing should avoid allocations too, except in ErrorContext, because otherwise you might get an out-of-memory during PANIC processing. ErrorContext also covers elog(DEBUG2, ...). I presume we'll want to ignore that too. Although I also tried without the exemption for ErrorContext at first, and didn't get any failures from the regression tests, so I guess we never do that in a critical section. I was a bit surprised by that. BTW, I'm pretty sure you added some redundant assertions in mcxt.c. eg, palloc does not need its own copy. palloc is copy-pasted from MemoryContextAlloc - it does need its own copy. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Allocations in critical section (was Re: [HACKERS] WAL format and API changes (9.5))
Heikki Linnakangas writes: > Ok, I fixed the issues that the assertion fixed. I also committed a > patch to add the assertion itself; let's see if the buildfarm finds more > cases that violate the rule. > It ignores the checkpointer, because it's known to violate the rule, ... uh, isn't that a bug to be fixed? > and > allocations in ErrorContext, which is used during error recovery, e.g if > you indeed PANIC while in a critical section for some other reason. Yeah, I realized we'd have to do something about elog's own allocations. Not sure if a blanket exemption for ErrorContext is the best way. I'd been thinking of having a way to turn off the complaint once processing of an elog(PANIC) has started. > I didn't backpatch this. Agreed. BTW, I'm pretty sure you added some redundant assertions in mcxt.c. eg, palloc does not need its own copy. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Allocations in critical section (was Re: [HACKERS] WAL format and API changes (9.5))
Ok, I fixed the issues that the assertion fixed. I also committed a patch to add the assertion itself; let's see if the buildfarm finds more cases that violate the rule. It ignores the checkpointer, because it's known to violate the rule, and allocations in ErrorContext, which is used during error recovery, e.g if you indeed PANIC while in a critical section for some other reason. I didn't backpatch this. Although you shouldn't be running with assertions enabled in production, it nevertheless seems too risky. There might be some obscure cases where there is no real risk, e.g because the current memory context always has enough free space because of a previous pfree, and it doesn't seem worth tracking down and fixing such issues in backbranches. You have to be pretty unlucky to run out of memory in a critical section to begin with. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format and API changes (9.5)
On 2014-04-04 12:50:25 +0300, Heikki Linnakangas wrote: > On 04/04/2014 11:41 AM, Andres Freund wrote: > >On 2014-04-04 10:48:32 +0300, Heikki Linnakangas wrote: > >>@@ -484,10 +483,11 @@ PageRepairFragmentation(Page page) > >>((PageHeader) page)->pd_upper = pd_special; > >>} > >>else > >>- { /* nstorage != > >>0 */ > >>+ { > >>/* Need to compact the page the hard way */ > >>- itemidbase = (itemIdSort) palloc(sizeof(itemIdSortData) * > >>nstorage); > >>- itemidptr = itemidbase; > >>+ itemIdSortData itemidbase[MaxHeapTuplesPerPage]; > >>+ itemIdSort itemidptr = itemidbase; > >>+ > > > >That's a fair bit of stack, and it can be called somewhat deep on the > >stack via heap_page_prune_opt(). I wonder if we ought to add a > >check_stack_depth() somewhere. > > Hmm, on my 64-bit laptop, that array is 24*291=6984 bytes with 8k block > size. That's fairly large, but not unheard of; there are a few places where > we allocate a BLCKSZ-sized buffer from stack. Yea, I am not complaing about using so much stack. Seems sensible here. > But overall I wouldn't worry about it. check_stack_depth() leaves a fair > amount of headroom: STACK_DEPTH_SLOP is 512kB. As long as we don't recurse, > that's plenty. Well, there's no checks at nearby afair. That's why I was wondering... But I don't have a big problem with not checking, I just wanted to bring it up. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format and API changes (9.5)
On 04/04/2014 11:41 AM, Andres Freund wrote: On 2014-04-04 10:48:32 +0300, Heikki Linnakangas wrote: @@ -484,10 +483,11 @@ PageRepairFragmentation(Page page) ((PageHeader) page)->pd_upper = pd_special; } else - { /* nstorage != 0 */ + { /* Need to compact the page the hard way */ - itemidbase = (itemIdSort) palloc(sizeof(itemIdSortData) * nstorage); - itemidptr = itemidbase; + itemIdSortData itemidbase[MaxHeapTuplesPerPage]; + itemIdSort itemidptr = itemidbase; + That's a fair bit of stack, and it can be called somewhat deep on the stack via heap_page_prune_opt(). I wonder if we ought to add a check_stack_depth() somewhere. Hmm, on my 64-bit laptop, that array is 24*291=6984 bytes with 8k block size. That's fairly large, but not unheard of; there are a few places where we allocate a BLCKSZ-sized buffer from stack. We could easily reduce the stack consumption here by changing itemIdSort to use 16-bit ints; all the lengths and offsets that PageRepairFragmentation deals with fit in 16-bits. But overall I wouldn't worry about it. check_stack_depth() leaves a fair amount of headroom: STACK_DEPTH_SLOP is 512kB. As long as we don't recurse, that's plenty. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format and API changes (9.5)
Hi, On 2014-04-04 10:48:32 +0300, Heikki Linnakangas wrote: > But if we give the checkpointer process a free pass, running the regression > tests with an assertion in AllocSetAlloc catches five genuine bugs: > > 1. _bt_newroot > 2. XLogFileInit > 3. spgPageIndexMultiDelete > 4. PageRepairFragmentation > 5. PageIndexMultiDelete Some of those, like PageRepairFragmentation, are somewhat bad... > @@ -484,10 +483,11 @@ PageRepairFragmentation(Page page) > ((PageHeader) page)->pd_upper = pd_special; > } > else > - { /* nstorage != > 0 */ > + { > /* Need to compact the page the hard way */ > - itemidbase = (itemIdSort) palloc(sizeof(itemIdSortData) * > nstorage); > - itemidptr = itemidbase; > + itemIdSortData itemidbase[MaxHeapTuplesPerPage]; > + itemIdSort itemidptr = itemidbase; > + That's a fair bit of stack, and it can be called somewhat deep on the stack via heap_page_prune_opt(). I wonder if we ought to add a check_stack_depth() somewhere. Thanks, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format and API changes (9.5)
On 04/04/2014 02:40 AM, Andres Freund wrote: On 2014-04-03 19:33:12 -0400, Tom Lane wrote: Andres Freund writes: On 2014-04-03 19:08:27 -0400, Tom Lane wrote: A somewhat more relevant concern is where are we going to keep the state data involved in all this. Since this code is, by definition, going to be called in critical sections, any solution involving internal pallocs is right out. We actually already allocate memory in XLogInsert() :(, although only in the first XLogInsert() a backend does... Ouch. I wonder if we should put an Assert(not-in-critical-section) into MemoryContextAlloc. Good idea! XLogInsert() is using malloc() directly, so that wouldn't detect this case... It's not a bad idea tho. I wonder how far the regression tests get... Not even through initdb. GetVirtualXIDsDelayingChkpt() is to blame. Hmm, the checkpointer process seems to ignore the rule, and just hopes for the best. The allocation in GetVirtualXIDsDelayingChkpt() was probably just an oversight, and that call could be moved outside the critical section, but we also have this in AbsortFsyncRequests(): /* * We have to PANIC if we fail to absorb all the pending requests (eg, * because our hashtable runs out of memory). This is because the system * cannot run safely if we are unable to fsync what we have been told to * fsync. Fortunately, the hashtable is so small that the problem is * quite unlikely to arise in practice. */ START_CRIT_SECTION(); Perhaps we could fix that by just calling fsync() directly if the allocation fails, like we do if ForwardFsyncRequest() fails. But if we give the checkpointer process a free pass, running the regression tests with an assertion in AllocSetAlloc catches five genuine bugs: 1. _bt_newroot 2. XLogFileInit 3. spgPageIndexMultiDelete 4. PageRepairFragmentation 5. PageIndexMultiDelete I'll commit the attached patch to fix those shortly. - Heikki diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c index d2ca8d9..922412e 100644 --- a/src/backend/access/nbtree/nbtinsert.c +++ b/src/backend/access/nbtree/nbtinsert.c @@ -1995,8 +1995,10 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf) BTPageOpaque lopaque; ItemId itemid; IndexTuple item; - Size itemsz; - IndexTuple new_item; + IndexTuple left_item; + Size left_item_sz; + IndexTuple right_item; + Size right_item_sz; Buffer metabuf; Page metapg; BTMetaPageData *metad; @@ -2016,6 +2018,26 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf) metapg = BufferGetPage(metabuf); metad = BTPageGetMeta(metapg); + /* + * Create downlink item for left page (old root). Since this will be the + * first item in a non-leaf page, it implicitly has minus-infinity key + * value, so we need not store any actual key in it. + */ + left_item_sz = sizeof(IndexTupleData); + left_item = (IndexTuple) palloc(left_item_sz); + left_item->t_info = left_item_sz; + ItemPointerSet(&(left_item->t_tid), lbkno, P_HIKEY); + + /* + * Create downlink item for right page. The key for it is obtained from + * the "high key" position in the left page. + */ + itemid = PageGetItemId(lpage, P_HIKEY); + right_item_sz = ItemIdGetLength(itemid); + item = (IndexTuple) PageGetItem(lpage, itemid); + right_item = CopyIndexTuple(item); + ItemPointerSet(&(right_item->t_tid), rbkno, P_HIKEY); + /* NO EREPORT(ERROR) from here till newroot op is logged */ START_CRIT_SECTION(); @@ -2034,16 +2056,6 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf) metad->btm_fastlevel = rootopaque->btpo.level; /* - * Create downlink item for left page (old root). Since this will be the - * first item in a non-leaf page, it implicitly has minus-infinity key - * value, so we need not store any actual key in it. - */ - itemsz = sizeof(IndexTupleData); - new_item = (IndexTuple) palloc(itemsz); - new_item->t_info = itemsz; - ItemPointerSet(&(new_item->t_tid), lbkno, P_HIKEY); - - /* * Insert the left page pointer into the new root page. The root page is * the rightmost page on its level so there is no "high key" in it; the * two items will go into positions P_HIKEY and P_FIRSTKEY. @@ -2051,32 +2063,20 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf) * Note: we *must* insert the two items in item-number order, for the * benefit of _bt_restore_page(). */ - if (PageAddItem(rootpage, (Item) new_item, itemsz, P_HIKEY, + if (PageAddItem(rootpage, (Item) left_item, left_item_sz, P_HIKEY, false, false) == InvalidOffsetNumber) elog(PANIC, "failed to add leftkey to new root page" " while splitting block %u of index \"%s\"", BufferGetBlockNumber(lbuf), RelationGetRelationName(rel)); - pfree(new_item); - - /* - * Create downlink item for right page. The key for it is obtained from - * the "high key" position in the left page. - */ - itemid = PageGetItemId(lpage, P_HIKEY); - itemsz = ItemIdGe
Re: [HACKERS] WAL format and API changes (9.5)
On 2014-04-03 19:33:12 -0400, Tom Lane wrote: > Andres Freund writes: > > On 2014-04-03 19:08:27 -0400, Tom Lane wrote: > >> A somewhat more relevant concern is where are we going to keep the state > >> data involved in all this. Since this code is, by definition, going to be > >> called in critical sections, any solution involving internal pallocs is > >> right out. > > > We actually already allocate memory in XLogInsert() :(, although only in > > the first XLogInsert() a backend does... > > Ouch. I wonder if we should put an Assert(not-in-critical-section) > into MemoryContextAlloc. XLogInsert() is using malloc() directly, so that wouldn't detect this case... It's not a bad idea tho. I wonder how far the regression tests get... Not even through initdb. GetVirtualXIDsDelayingChkpt() is to blame. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format and API changes (9.5)
Andres Freund writes: > On 2014-04-03 19:08:27 -0400, Tom Lane wrote: >> A somewhat more relevant concern is where are we going to keep the state >> data involved in all this. Since this code is, by definition, going to be >> called in critical sections, any solution involving internal pallocs is >> right out. > We actually already allocate memory in XLogInsert() :(, although only in > the first XLogInsert() a backend does... Ouch. I wonder if we should put an Assert(not-in-critical-section) into MemoryContextAlloc. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format and API changes (9.5)
On 2014-04-03 19:08:27 -0400, Tom Lane wrote: > Robert Haas writes: > > On Thu, Apr 3, 2014 at 10:14 AM, Heikki Linnakangas > > wrote: > >> (Instead of using a new XLogRegisterBuffer() function to register the > >> buffers, perhaps they should be passed to XLogInsert as a separate list or > >> array. I'm not wedded on the details...) > > > That would have the advantage of avoiding the function-call overhead, > > which seems appealing. > > You're kidding, right? One function call per buffer touched by an update > is going to be lost in the noise. > > A somewhat more relevant concern is where are we going to keep the state > data involved in all this. Since this code is, by definition, going to be > called in critical sections, any solution involving internal pallocs is > right out. We actually already allocate memory in XLogInsert() :(, although only in the first XLogInsert() a backend does... I didn't remember it, but I complained about it before: http://archives.postgresql.org/message-id/4FEB223A.3060707%40enterprisedb.com I didn't realize the implications for it back then and thus didn't press hard for a change, but I think it should be fixed. > The existing arrangement keeps all its data in local variables > of the function doing the update, which is probably a nice property to > preserve if we can manage it. That might force us to do it as above. We could simply allocate enough scratch space for the state at process startup. I don't think there'll ever be a need for XLogInsert() to be reentrant, so that should be fine. > I'd still like something that *looks* more like a list of function calls, > though, even if they're macros under the hood. The existing approach to > setting up the XLogRecData chains is awfully prone to bugs of > omission. Agreed. There's some pretty ugly code around this. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format and API changes (9.5)
Robert Haas writes: > On Thu, Apr 3, 2014 at 10:14 AM, Heikki Linnakangas > wrote: >> (Instead of using a new XLogRegisterBuffer() function to register the >> buffers, perhaps they should be passed to XLogInsert as a separate list or >> array. I'm not wedded on the details...) > That would have the advantage of avoiding the function-call overhead, > which seems appealing. You're kidding, right? One function call per buffer touched by an update is going to be lost in the noise. A somewhat more relevant concern is where are we going to keep the state data involved in all this. Since this code is, by definition, going to be called in critical sections, any solution involving internal pallocs is right out. The existing arrangement keeps all its data in local variables of the function doing the update, which is probably a nice property to preserve if we can manage it. That might force us to do it as above. I'd still like something that *looks* more like a list of function calls, though, even if they're macros under the hood. The existing approach to setting up the XLogRecData chains is awfully prone to bugs of omission. There are a few places that went so far as to set up custom macros to help with that. I'm not a fan of doing the same thing in randomly different ways in different places; but if we did it that way uniformly, it might be better/safer. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format and API changes (9.5)
On Thu, Apr 3, 2014 at 10:14 AM, Heikki Linnakangas wrote: > (Instead of using a new XLogRegisterBuffer() function to register the > buffers, perhaps they should be passed to XLogInsert as a separate list or > array. I'm not wedded on the details...) That would have the advantage of avoiding the function-call overhead, which seems appealing. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format and API changes (9.5)
On 04/03/2014 07:11 PM, Tom Lane wrote: More generally, I'm pretty sure that your proposal is already going to involve some small growth of WAL records compared to today, Quite possible. but I think that's probably all right; the benefits seem significant. Yep. OTOH, once we store the relfilenode+block in a common format, we can then try to optimize that format more heavily. Just as an example, omit the tablespace oid in the RelFileNode, when it's the default tablespace (with a flag bit indicating we did that). Or use a variable-length endoding for the block number, on the assumption that smaller numbers are more common. Probably not be worth the extra complexity, but we can easily experiment with that kind of stuff once we have the infrastructure in place. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format and API changes (9.5)
Heikki Linnakangas writes: > On 04/03/2014 06:37 PM, Tom Lane wrote: >> +1, but one important step here is finding the data to be replayed. >> That is, a large part of the complexity of replay routines has to do >> with figuring out which parts of the WAL record were elided due to >> full-page-images, and locating the remaining parts. What can we do >> to make that simpler? > We can certainly add more structure to the WAL records, but any extra > information you add will make the records larger. It might be worth it, > and would be lost in the noise for more complex records like page > splits, but we should keep frequently-used records like heap insertions > as lean as possible. Sure, but in simple WAL records there would just be a single data chunk that would be present in the normal case and not present in the FPI case. Seems like we ought to be able to handle that degenerate case with no significant wastage (probably just a flag bit someplace). More generally, I'm pretty sure that your proposal is already going to involve some small growth of WAL records compared to today, but I think that's probably all right; the benefits seem significant. > Hmm. You could register a separate XLogRecData chain for each buffer. > Plus one more chain for the data not associated with a buffer. That would probably work. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format and API changes (9.5)
On 04/03/2014 06:37 PM, Tom Lane wrote: Also, IIRC there are places that WAL-log full pages that aren't in a shared buffer at all (btree build does this I think). How will that fit into this model? Hmm. We could provide a function for registering a block with given content, without a Buffer. Something like: XLogRegisterPage(int id, RelFileNode, BlockNumber, Page) Let's simplify that, and have one new function, XLogOpenBuffer, which returns a return code that indicates which of the four cases we're dealing with. A typical redo function looks like this: if (XLogOpenBuffer(0, &buffer) == BLK_REPLAY) { /* Modify the page */ ... PageSetLSN(page, lsn); MarkBufferDirty(buffer); } if (BufferIsValid(buffer)) UnlockReleaseBuffer(buffer); The '0' in the XLogOpenBuffer call is the ID of the block reference specified in the XLogRegisterBuffer call, when the WAL record was created. +1, but one important step here is finding the data to be replayed. That is, a large part of the complexity of replay routines has to do with figuring out which parts of the WAL record were elided due to full-page-images, and locating the remaining parts. What can we do to make that simpler? We can certainly add more structure to the WAL records, but any extra information you add will make the records larger. It might be worth it, and would be lost in the noise for more complex records like page splits, but we should keep frequently-used records like heap insertions as lean as possible. Ideally, if XLogOpenBuffer (bad name BTW) returns BLK_REPLAY, it would also calculate and hand back the address/size of the logged data that had been pointed to by the associated XLogRecData chain item. The trouble here is that there might've been multiple XLogRecData items pointing to the same buffer. Perhaps the magic ID number you give to XLogOpenBuffer should be thought of as identifying an XLogRecData chain item, not so much a buffer? It's fairly easy to see what to do when there's just one chain item per buffer, but I'm not sure what to do if there's more than one. Hmm. You could register a separate XLogRecData chain for each buffer. Along the lines of: rdata[0].data = data for buffer rdata[0].len = ... rdata[0].next = &rdata[1]; rdata[1].data = more data for same buffer rdata[1].len = ... rdata[2].next = NULL; XLogRegisterBuffer(0, buffer, &data[0]); At replay: if (XLogOpenBuffer(0, &buffer, &xldata, &len) == BLK_REPLAY) { /* xldata points to the data registered for this buffer */ } Plus one more chain for the data not associated with a buffer. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format and API changes (9.5)
Heikki Linnakangas writes: > The big change in creating WAL records is that the buffers involved in > the WAL-logged operation are explicitly registered, by calling a new > XLogRegisterBuffer function. Seems reasonable, especially if we can make the buffer numbering business less error-prone. > void XLogRegisterBuffer(int blockref_id, Buffer buffer, bool buffer_std) > blockref_id: An arbitrary ID given to this block reference. It is used > in the redo routine to open/restore the same block. > buffer: the buffer involved > buffer_std: is the page in "standard" page layout? > That's for the normal cases. We'll need a couple of variants for also > registering buffers that don't need full-page images, and perhaps also a > function for registering a page that *always* needs a full-page image, > regardless of the LSN. A few existing WAL record types just WAL-log the > whole page, so those ad-hoc full-page images could be replaced with this. Why not just one function with an additional flags argument? Also, IIRC there are places that WAL-log full pages that aren't in a shared buffer at all (btree build does this I think). How will that fit into this model? > (While we're at it, perhaps we should let XLogInsert set the LSN of all > the registered buffers, to reduce the amount of boilerplate code). Yeah, maybe so. I'm not sure why that was separate to begin with. > Let's simplify that, and have one new function, XLogOpenBuffer, which > returns a return code that indicates which of the four cases we're > dealing with. A typical redo function looks like this: > if (XLogOpenBuffer(0, &buffer) == BLK_REPLAY) > { > /* Modify the page */ > ... > PageSetLSN(page, lsn); > MarkBufferDirty(buffer); > } > if (BufferIsValid(buffer)) > UnlockReleaseBuffer(buffer); > The '0' in the XLogOpenBuffer call is the ID of the block reference > specified in the XLogRegisterBuffer call, when the WAL record was created. +1, but one important step here is finding the data to be replayed. That is, a large part of the complexity of replay routines has to do with figuring out which parts of the WAL record were elided due to full-page-images, and locating the remaining parts. What can we do to make that simpler? Ideally, if XLogOpenBuffer (bad name BTW) returns BLK_REPLAY, it would also calculate and hand back the address/size of the logged data that had been pointed to by the associated XLogRecData chain item. The trouble here is that there might've been multiple XLogRecData items pointing to the same buffer. Perhaps the magic ID number you give to XLogOpenBuffer should be thought of as identifying an XLogRecData chain item, not so much a buffer? It's fairly easy to see what to do when there's just one chain item per buffer, but I'm not sure what to do if there's more than one. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] WAL format and API changes (9.5)
I'd like to do some changes to the WAL format in 9.5. I want to annotate each WAL record with the blocks that they modify. Every WAL record already includes that information, but it's done in an ad hoc way, differently in every rmgr. The RelFileNode and block number are currently part of the WAL payload, and it's the REDO routine's responsibility to extract it. I want to include that information in a common format for every WAL record type. That makes life a lot easier for tools that are interested in knowing which blocks a WAL record modifies. One such tool is pg_rewind; it currently has to understand every WAL record the backend writes. There's also a tool out there called pg_readahead, which does prefetching of blocks accessed by WAL records, to speed up PITR. I don't think that tool has been actively maintained, but at least part of the reason for that is probably that it's a pain to maintain when it has to understand the details of every WAL record type. It'd also be nice for contrib/pg_xlogdump and backend code itself. The boilerplate code in all WAL redo routines, and writing WAL records, could be simplified. So, here's my proposal: Insertion - The big change in creating WAL records is that the buffers involved in the WAL-logged operation are explicitly registered, by calling a new XLogRegisterBuffer function. Currently, buffers that need full-page images are registered by including them in the XLogRecData chain, but with the new system, you call the XLogRegisterBuffer() function instead. And you call that function for every buffer involved, even if no full-page image needs to be taken, e.g because the page is going to be recreated from scratch at replay. It is no longer necessary to include the RelFileNode and BlockNumber of the modified pages in the WAL payload. That information is automatically included in the WAL record, when XLogRegisterBuffer is called. Currently, the backup blocks are implicitly numbered, in the order the buffers appear in XLogRecData entries. With the new API, the blocks are numbered explicitly. This is more convenient when a WAL record sometimes modifies a buffer and sometimes not. For example, a B-tree split needs to modify four pages: the original page, the new page, the right sibling (unless it's the rightmost page) and if it's an internal page, the page at the lower level whose split the insertion completes. So there are two pages that are sometimes missing from the record. With the new API, you can nevertheless always register e.g. original page as buffer 0, new page as 1, right sibling as 2, even if some of them are actually missing. SP-GiST contains even more complicated examples of that. The new XLogRegisterBuffer would look like this: void XLogRegisterBuffer(int blockref_id, Buffer buffer, bool buffer_std) blockref_id: An arbitrary ID given to this block reference. It is used in the redo routine to open/restore the same block. buffer: the buffer involved buffer_std: is the page in "standard" page layout? That's for the normal cases. We'll need a couple of variants for also registering buffers that don't need full-page images, and perhaps also a function for registering a page that *always* needs a full-page image, regardless of the LSN. A few existing WAL record types just WAL-log the whole page, so those ad-hoc full-page images could be replaced with this. With these changes, a typical WAL insertion would look like this: /* register the buffer with the WAL record, with ID 0 */ XLogRegisterBuffer(0, buf, true); rdata[0].data = (char *) &xlrec; rdata[0].len = sizeof(BlahRecord); rdata[0].buffer_id = -1; /* -1 means the data is always included */ rdata[0].next = &(rdata[1]); rdata[1].data = (char *) mydata; rdata[1].len = mydatalen; rdata[1].buffer_id = 0; /* 0 here refers to the buffer registered above */ rdata[1].next = NULL ... recptr = XLogInsert(RM_BLAH_ID, xlinfo, rdata); PageSetLSN(buf, recptr); (While we're at it, perhaps we should let XLogInsert set the LSN of all the registered buffers, to reduce the amount of boilerplate code). (Instead of using a new XLogRegisterBuffer() function to register the buffers, perhaps they should be passed to XLogInsert as a separate list or array. I'm not wedded on the details...) Redo There are four different states a block referenced by a typical WAL record can be in: 1. The old page does not exist at all (because the relation was truncated later) 2. The old page exists, but has an LSN higher than current WAL record, so it doesn't need replaying. 3. The LSN is < current WAL record, so it needs to be replayed. 4. The WAL record contains a full-page image, which needs to be restored. With the current API, that leads to a long boilerplate: /* If we have a full-page image, restore it and we're done */ if (HasBackupBl