subject:"\[HACKERS\] WAL format and API changes \(9.5\)"

Re: [HACKERS] WAL format and API changes (9.5)

2014-11-21 Thread Heikki Linnakangas


On 11/21/2014 05:58 AM, Amit Kapila wrote:

On Thu, Nov 20, 2014 at 10:36 PM, Heikki Linnakangas <
hlinnakan...@vmware.com> wrote:


As you may have noticed, I committed this (after some more cleanup). Of

course, feel free to still review it, and please point out any issues you
may find.




Few minor observations:
1. Readme

void XLogResetInsertion(void)

Clear any currently registered data and buffers from the WAL record
construction workspace.  This is only needed if you have already
called XLogBeginInsert(), but decide to not insert the record after all.

I think above sentence could be slightly rephrased as the this function is
also getting called at end of XLogInsert().


Well, yeah, but that's internal to the xloginsert machinery, not 
relevant for someone learning about how to use it.



2.
shiftList()
{
..
XLogEnsureRecordSpace(data.ndeleted + 1, 0);
..
}

Shouldn't above function call to XLogEnsureRecordSpace() be done
under if (RelationNeedsWAL(rel))?


True, fixed. (It doesn't hurt to call it, but it's clearly unnecessary)


3.
XLogInsert(RmgrId rmid, uint8 info)
{
XLogRecPtr EndPos;

/* XLogBeginInsert() must have been called.
*/
if (!begininsert_called)
elog(ERROR, "XLogBeginInsert was not called");

As we are in critical section at this moment, so is it okay to have
elog(ERROR,).  I think this can only happen due to some coding
mistake, but still not sure if elog(ERROR) is okay.


In a critical section, the ERROR will be promoted automatically to a 
PANIC. There isn't much else we can do if that happens; it is a coding 
mistake as you say.


BTW, not all XLogInsert() calls are inside a critical section. All the 
ones that modify data pages are, but there are others.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL format and API changes (9.5)

2014-11-21 Thread Heikki Linnakangas


On 11/21/2014 09:19 AM, Michael Paquier wrote:

On Fri, Nov 21, 2014 at 2:06 AM, Heikki Linnakangas 
wrote:



As you may have noticed, I committed this (after some more cleanup). Of
course, feel free to still review it, and please point out any issues you
may find.



This comment on top of XLogRecordAssemble is not adapted as
page_writes_omitted does not exist. Perhaps this is a remnant of some older
version of the patch?
  * If there are any registered buffers, and a full-page image was not taken
  * of all them, *page_writes_omitted is set to true. This signals that the
  * assembled record is only good for insertion on the assumption that the
  * RedoRecPtr and doPageWrites values were up-to-date.
  */


Yep, fixed thanks.
- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL format and API changes (9.5)

2014-11-20 Thread Michael Paquier

On Fri, Nov 21, 2014 at 2:06 AM, Heikki Linnakangas  wrote:

> As you may have noticed, I committed this (after some more cleanup). Of
> course, feel free to still review it, and please point out any issues you
> may find.
>

This comment on top of XLogRecordAssemble is not adapted as
page_writes_omitted does not exist. Perhaps this is a remnant of some older
version of the patch?
 * If there are any registered buffers, and a full-page image was not taken
 * of all them, *page_writes_omitted is set to true. This signals that the
 * assembled record is only good for insertion on the assumption that the
 * RedoRecPtr and doPageWrites values were up-to-date.
 */
-- 
Michael

Re: [HACKERS] WAL format and API changes (9.5)

2014-11-20 Thread Amit Kapila

On Thu, Nov 20, 2014 at 10:36 PM, Heikki Linnakangas <
hlinnakan...@vmware.com> wrote:
>
> As you may have noticed, I committed this (after some more cleanup). Of
course, feel free to still review it, and please point out any issues you
may find.
>

Few minor observations:
1. Readme

void XLogResetInsertion(void)

Clear any currently registered data and buffers from the WAL record
construction workspace.  This is only needed if you have already
called XLogBeginInsert(), but decide to not insert the record after all.

I think above sentence could be slightly rephrased as the this function is
also getting called at end of XLogInsert().

2.
shiftList()
{
..
XLogEnsureRecordSpace(data.ndeleted + 1, 0);
..
}

Shouldn't above function call to XLogEnsureRecordSpace() be done
under if (RelationNeedsWAL(rel))?

3.
XLogInsert(RmgrId rmid, uint8 info)
{
XLogRecPtr EndPos;

/* XLogBeginInsert() must have been called.
*/
if (!begininsert_called)
elog(ERROR, "XLogBeginInsert was not called");

As we are in critical section at this moment, so is it okay to have
elog(ERROR,).  I think this can only happen due to some coding
mistake, but still not sure if elog(ERROR) is okay.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] WAL format and API changes (9.5)

2014-11-20 Thread Heikki Linnakangas

As you may have noticed, I committed this (after some more cleanup). Of 
course, feel free to still review it, and please point out any issues 
you may find.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL format and API changes (9.5)

2014-11-18 Thread Heikki Linnakangas


On 11/18/2014 10:28 AM, Michael Paquier wrote:

Btw, did you do a run with the buffer capture facility and checked for page
differences?


Yes. That's how I bumped into the bug fixed in c73669c0. That tool has 
been tremendously helpful.



Except that, after going through the code once again, ISTM that the patch
is in a nice state. It may be better to wait for some input from Andres, he
may catch some issues I haven't spotted.


Yeah, I'm sure there are tons of small things. I'll have to take a small 
break and read through the patch myself after a few days. But I'd like 
to get this committed pretty soon. I believe everyone agrees with the 
big picture, even if there's some little tweaking left to do. It's a 
pain to maintain as a patch, and I believe the FPW compression patch is 
waiting for this to land.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL format and API changes (9.5)

2014-11-18 Thread Michael Paquier

On Tue, Nov 18, 2014 at 4:31 AM, Heikki Linnakangas 
wrote:
> WAL insertion performance
> =
> To measure the performance of generating WAL, I ran the
> wal-update-testsuite.sh that Amit also ran earlier. The cluster was
> initialized with:
>
> shared_buffers=512MB
> checkpoint_segments=30
> fsync=off
> autovacuum=off
> full_page_writes=off
>
> [results]
> Summary: No meaningful difference in runtime.
If I am seeing that correctly, WAL generated is reduced for all the tests,
except for the case of "hundred tiny fields" where more WAL is generated.
Now the duration time seems to be generally reduced, some noise (?) making
it sometimes higher.

> WAL replay performance
> ==
>
> To test WAL replay performance, I ran pgbench with WAL archiving enabled,
> and timed the replay of the generated WAL. I used the attached script,
> replay-perf-test.sh for that. full_page_writes were disabled, because
> replaying full page images is quite different from replaying other
records.
> (Performance of full-page images is interesting too, but it's not expected
> that these WAL format changes make much difference to that).
>
> In summary, there is no significant difference in replay performance. The
> amount of WAL generated is much smaller with the patch.

> This concludes my performance testing, until someone wants to see some
other
> scenario being tested. I'm happy with the results.
I think you can, that's a great study, and this proves to be a gain on many
fields.

If this goes in, it is going to be one of the largest patches committed
ever.
$ git diff --stat | tail -n1
91 files changed, 3895 insertions(+), 4305 deletions(-)

There are still some XXX blocks here and there in the code.. But nothing
really huge, like here:
-* checks could go wrong too.
+* checks could go wrong too. XXX does this comment still
make sense?
 */
-   Assert(xldata->blkno != xldata->blknoNew);
+   Assert(blkno != blknoNew);

Btw, did you do a run with the buffer capture facility and checked for page
differences?

Except that, after going through the code once again, ISTM that the patch
is in a nice state. It may be better to wait for some input from Andres, he
may catch some issues I haven't spotted.
Regards,
-- 
Michael

Re: [HACKERS] WAL format and API changes (9.5)

2014-11-17 Thread Heikki Linnakangas


On 11/17/2014 03:22 PM, Heikki Linnakangas wrote:

Here is an updated version of the patch. It's rebased over current
master, and fixes a bunch of issues you and Michael pointed out, and a
ton of other cleanup.


That patch had two extra, unrelated patches applied to it. One to use 
the Intel CRC instruction for the CRC calculation, and another to speed 
up PageRepairFragmentation, by replacing the pq_qsort() call with a 
custom sort algorithm. Sorry about that. (the latter is something that 
showed up very high in a perf profile of replaying the WAL generated by 
pgbench - I'll start a new thread about that)


Here is the same patch without those extra things mixed in.

- Heikki



wal-format-and-api-changes-11.patch.gz
Description: application/gzip

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL format and API changes (9.5)

2014-11-17 Thread Heikki Linnakangas


On 11/14/2014 10:31 AM, Michael Paquier wrote:

5) Here why not using the 2nd block instead of the 3rd (@_bt_getroot)?
+   XLogBeginInsert();
+   XLogRegisterBuffer(0, rootbuf, REGBUF_WILL_INIT);
+   XLogRegisterBuffer(2, metabuf, REGBUF_WILL_INIT);


See the comment of the xl_btree_newroot struct. It explains the record 
format of the BTREE_NEWROOT record type:



 * Backup Blk 0: new root page (2 tuples as payload, if splitting old root)
 * Backup Blk 1: left child (if splitting an old root)
 * Backup Blk 2: metapage


When _bt_getroot creates a new root, there is no old root, but the same 
record type is used in _bt_newroot, which uses block ID 1 to refer to 
the old root page.


(Thanks for the comments, again! I'll post a new version soon)

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL format and API changes (9.5)

2014-11-14 Thread Michael Paquier

On Fri, Nov 14, 2014 at 5:31 PM, Michael Paquier
 wrote:
> 3) pg_xlogdump does not seem to work:
> $ pg_xlogdump 0001000D
> pg_xlogdump: FATAL:  could not find a valid record after 0/D00
This one is a bad manipulation from my side. Please forget this comment.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL format and API changes (9.5)

2014-11-14 Thread Michael Paquier

On Thu, Nov 13, 2014 at 10:33 PM, Heikki Linnakangas
 wrote:
> In quick testing, this new WAL format is somewhat more compact than the 9.4
> format. That also seems to have more than bought back the performance
> regression I saw earlier. Here are results from my laptop, using the
> wal-update-testsuite.sh script:
> master:
> [results]
> (27 rows)
> wal-format-and-api-changes-9.patch:
> [results]
> (27 rows)
So based on your series of tests, you are saving 6% up to 10%.
That's... Really cool!

> Aside from the WAL record format changes, this patch adds the "decoded WAL
> record" infrastructure that we talked about with Andres. XLogReader now has
> a new function, DecodeXLogRecord, which parses the block headers etc. from
> the WAL record, and copies the data chunks to aligned buffers. The redo
> routines are passed a pointer to the XLogReaderState, instead of the plain
> XLogRecord, and the redo routines can use macros and functions defined
> xlogreader.h to access the already-decoded WAL record. The new WAL record
> format is difficult to parse in a piece-meal fashion, so it really needs
> this separate decoding pass to be efficient.
>
> Thoughts on this new WAL record format? I've attached the xlogrecord.h file
> here separately for easy reading, if you want to take a quick look at just
> that without applying the whole patch.
The new format is neat, and that's a lot of code.. Grouping together
the blocks of data is a good thing for the FPW compression patch as
well.

Note that this patch conflicts with the recent commits 81c4508 and
34402ae. installcheck-world is passing, and standbys are able to
replay records, at least without crashes.

Here are some more comments:
1) with assertions enabled this does not compile because of a small
typo in xlogreader.h here:
+#define XLogRecHasAnyBlockRefs(decoder) ((decoder)->max_block id >= 0)
2) In xlogreader.c, XLogRecGetBlockData should return char * but a
boolean is returned here:
+XLogRecGetBlockData(XLogReaderState *record, uint8 block_id, Size *len)
+{
+   DecodedBkpBlock *bkpb;
+
+   if (!record->blocks[block_id].in_use)
+   return false;
As no blocks are in use in this case this should be NULL.
3) pg_xlogdump does not seem to work:
$ pg_xlogdump 0001000D
pg_xlogdump: FATAL:  could not find a valid record after 0/D00
4) A couple of NOT_USED blocks could be removed, no?
+#ifdef NOT_USED
BlockNumber leftChildBlkno = InvalidBlockNumber;
+#endif
5) Here why not using the 2nd block instead of the 3rd (@_bt_getroot)?
+   XLogBeginInsert();
+   XLogRegisterBuffer(0, rootbuf, REGBUF_WILL_INIT);
+   XLogRegisterBuffer(2, metabuf, REGBUF_WILL_INIT);
6) There are FIXME blocks:
+   // FIXME
+   //if ((bkpb->fork_flags & BKPBLOCK_WILL_INIT) != 0 &&
mode != RBM_ZERO)
+   //  elog(PANIC, "block with WILL_INIT flag in WAL
record must be zeroed by redo routine");]
And that:
+   /* FIXME: translation? Although this shouldn't happen.. */
+   ereport(ERROR,
+   (errmsg("error decoding WAL record"),
+errdetail("%s", errormsg)));
Regards,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL format and API changes (9.5)

2014-11-13 Thread Amit Kapila

On Thu, Nov 13, 2014 at 7:03 PM, Heikki Linnakangas 
wrote:
>
> On 11/11/2014 04:42 PM, Amit Kapila wrote:
>>
>> I have done some performance testing of this patch using attached
>> script and data is as below:
>>
>> ...
>>
>> It seems to me that there is a regression of (4 ~ 8%) for small records,
>> refer two short fields tests.
>
>
> Thanks for the testing!
>
>
> Thoughts on this new WAL record format? I've attached the xlogrecord.h
file here separately for easy reading, if you want to take a quick look at
just that without applying the whole patch.
>

Apart from changes in XLogRecord to remove xl_len and padding, the new
format for block headers seems to be quite succinct and then by making
data length as variable for actual record the overall WAL record size
becomes smaller.  However the increase in unaligned size (as mentined by
you up-thread as 2 bytes) seems to be still remain same.
Is the difference in size is due to additional block reference id this patch
requires or something else?

In new record format the Record header always start at aligned
boundary, however the actual record data doesn't necessarily be
at aligned boundary, can this make difference as currently we copy
both of these separately?

Overall, I think this format is a net improvement.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] WAL format and API changes (9.5)

2014-11-13 Thread Andres Freund

On 2014-11-13 15:33:44 +0200, Heikki Linnakangas wrote:
> Here's a new version, with big changes again to the record format. Have a
> look at xlogrecord.h for the details, but in a nutshell:
> 
> 1. The overall format is now: XLogRecord, per-block headers, header for main
> data portion, per-block data, main data.
> 
> 2. I removed xl_len field from XLogRecord and rearranged the fields, to
> shrink the XLogRecord struct from 32 to 24 bytes. (instead, there's a new 2-
> or 5-byte header for the "main data", after the block headers).
> 
> 3. No alignment padding. (the data chunks are copied to aligned buffers at
> replay, so redo functions can still assume aligned access)

Whoa. That's a large amount of changes... Not bad at all.

> In quick testing, this new WAL format is somewhat more compact than the 9.4
> format.

Well, that's mixing different kinds of changes together to some
degree. Most of this could have been done independently as well. I'm not
generally objecting to doing that, but the space/performance comparison
isn't entirely fair. One could (I'm not!) argue that we should just do a
refactoring like you suggest above, without the additional block
information.

> That also seems to have more than bought back the performance
> regression I saw earlier.

I think we really need to do the performance test with a different CRC
implementation. So far all the tests in this thread seem to be
correlating pretty linearly to the size of the record.

> Here are results from my laptop, using the wal-update-testsuite.sh script:

Would be nice if that'd also compute the differences in wal_generated
and duration between two runs... :)

> Aside from the WAL record format changes, this patch adds the "decoded WAL
> record" infrastructure that we talked about with Andres.

Ah, cool.

> XLogReader now has
> a new function, DecodeXLogRecord, which parses the block headers etc. from
> the WAL record, and copies the data chunks to aligned buffers. The redo
> routines are passed a pointer to the XLogReaderState, instead of the plain
> XLogRecord, and the redo routines can use macros and functions defined
> xlogreader.h to access the already-decoded WAL record. The new WAL record
> format is difficult to parse in a piece-meal fashion, so it really needs
> this separate decoding pass to be efficient.

Hm. I'm not sure if there's really a point in exposing
DecodeXLogRecord() outside of xlogreader.c. In which case would a
xlogreader user not want to decode the record? I.e. couldn't we just
always do that and not bother exposing the function?

Sure, you could skip decoding if the record is of a "uninteresting"
rmgr, but since all validation but CRC now happens only during decoding
I'm not sure that's a good idea. You e.g. have a block

+   if (!DecodeXLogRecord(xlogreader_state, &errormsg))
+   {
+   fprintf(stderr, "error in WAL record at %X/%X: %s\n",
+   (uint32) 
(xlogreader_state->ReadRecPtr >> 32),
+   (uint32) 
xlogreader_state->ReadRecPtr,
+   errormsg);
+   /*
+* Parsing the record failed, but it passed CRC check, 
so in
+* theory we can continue reading.
+*/
+   continue;
+   }
+

I think that's quite the bad idea. The CRC is only a part of the error
detection, not its entirety.

> Thoughts on this new WAL record format? I've attached the xlogrecord.h file
> here separately for easy reading, if you want to take a quick look at just
> that without applying the whole patch.

After a quick skim I like it in general. There's some details I'd rather
change, but I think it's quite the step forward.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL format and API changes (9.5)

2014-11-11 Thread Amit Kapila

On Tue, Nov 11, 2014 at 2:15 PM, Heikki Linnakangas 
wrote:z
>
> On 11/11/2014 09:39 AM, Michael Paquier wrote:
>>
>> On Tue, Nov 11, 2014 at 4:29 AM, Heikki Linnakangas <
hlinnakan...@vmware.com
>>>
>>> wrote:
>>
>>
>>> Attached is a new version. It's rebased on current git master, including
>>> BRIN. I've also fixed the laundry list of small things you reported, as
>>> well as a bunch of bugs I uncovered during my own testing.
>>
>>
>> This patch needs a small rebase, it has been broken by a590f266 that
fixed
>> WAL replay for brin indexes:
>> patching file src/backend/access/brin/brin_xlog.c
>> Hunk #2 FAILED at 42.
>> Hunk #3 FAILED at 91.
>> This will facilitate testing as well.
>
>
> Here's a rebased patch. No other changes.
>

I have done some performance testing of this patch using attached
script and data is as below:


Performance Data

IBM POWER-8 24 cores, 192 hardware threads
RAM = 492GB
checkpoint_segments=300
checkpoint_timeout=30min
full_page_writes = off

HEAD (commit - ae667f7) -

testname | wal_generated | duration
-+---+--
 two short fields, no change | 398510104 |  11.993057012558
 two short fields, no change | 396984168 | 10.3508098125458
 two short fields, no change | 396983624 | 10.4196150302887
 two short fields, one changed   | 437106016 |  10.650365114212
 two short fields, one changed   | 437102456 |  10.756795167923
 two short fields, one changed   | 437103000 | 10.6824090480804
 two short fields, both changed  | 438131520 | 11.1027519702911
 two short fields, both changed  | 437112248 | 11.0681071281433
 two short fields, both changed  | 437107168 | 11.0370399951935
 one short and one long field, no change |  76044176 | 2.90529608726501
 one short and one long field, no change |  76047432 | 2.86844515800476
 one short and one long field, no change |  76042664 | 2.84641098976135
 ten tiny fields, all changed| 478210904 | 13.1221458911896
 ten tiny fields, all changed| 47728 | 12.8017349243164
 ten tiny fields, all changed| 477217832 | 12.9907081127167
 hundred tiny fields, all changed| 180353400 | 6.03354096412659
 hundred tiny fields, all changed| 180349536 | 5.99893379211426
 hundred tiny fields, all changed| 180350064 | 6.00634717941284
 hundred tiny fields, half changed   | 180348480 | 6.03162407875061
 hundred tiny fields, half changed   | 180348440 | 6.08004903793335
 hundred tiny fields, half changed   | 180349056 | 6.03126788139343
 hundred tiny fields, half nulled| 101369936 | 5.43424606323242
 hundred tiny fields, half nulled| 101660160 | 5.06207084655762
 hundred tiny fields, half nulled| 100119184 | 5.51814889907837
 9 short and 1 long, short changed   | 108142168 | 3.26260495185852
 9 short and 1 long, short changed   | 108138144 | 3.20250797271729
 9 short and 1 long, short changed   | 109761240 | 3.21728992462158
(27 rows)


Patch-

testname | wal_generated | duration
-+---+--
 two short fields, no change | 398692848 | 11.2538840770721
 two short fields, no change | 396988648 | 11.4972231388092
 two short fields, no change | 396988416 | 11.0026741027832
 two short fields, one changed   | 437104840 | 11.1911199092865
 two short fields, one changed   | 437103832 | 11.1138079166412
 two short fields, one changed   | 437101872 | 11.3141660690308
 two short fields, both changed  | 437133648 | 11.5836050510406
 two short fields, both changed  | 437106640 | 11.5675358772278
 two short fields, both changed  | 437104944 | 11.6147150993347
 one short and one long field, no change |  77226464 | 3.01720094680786
 one short and one long field, no change |  77645248 |  2.9660210609436
 one short and one long field, no change |  76045256 | 3.03326010704041
 ten tiny fields, all changed| 477396616 | 13.3963651657104
 ten tiny fields, all changed| 477219840 | 13.3838939666748
 ten tiny fields, all changed| 477223960 | 13.5736708641052
 hundred tiny fields, all changed| 180742136 | 5.87386584281921
 hundred tiny fields, all changed| 180398320 | 6.12072706222534
 hundred tiny fields, all changed| 180354080 | 6.16246104240417
 hundred tiny fields, half changed   | 180351456 | 6.17317008972168
 hundred tiny fields, half changed   | 180348096 | 6.13790798187256
 hundred tiny fields, half changed   | 183369688 | 6.19

Re: [HACKERS] WAL format and API changes (9.5)

2014-11-11 Thread Heikki Linnakangas


On 11/11/2014 09:39 AM, Michael Paquier wrote:

On Tue, Nov 11, 2014 at 4:29 AM, Heikki Linnakangas 
wrote:



Attached is a new version. It's rebased on current git master, including
BRIN. I've also fixed the laundry list of small things you reported, as
well as a bunch of bugs I uncovered during my own testing.


This patch needs a small rebase, it has been broken by a590f266 that fixed
WAL replay for brin indexes:
patching file src/backend/access/brin/brin_xlog.c
Hunk #2 FAILED at 42.
Hunk #3 FAILED at 91.
This will facilitate testing as well.


Here's a rebased patch. No other changes.

- Heikki



wal-format-and-api-changes-8.patch.gz
Description: application/gzip

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL format and API changes (9.5)

2014-11-10 Thread Michael Paquier

On Tue, Nov 11, 2014 at 4:29 AM, Heikki Linnakangas  wrote:

> Attached is a new version. It's rebased on current git master, including
> BRIN. I've also fixed the laundry list of small things you reported, as
> well as a bunch of bugs I uncovered during my own testing.

This patch needs a small rebase, it has been broken by a590f266 that fixed
WAL replay for brin indexes:
patching file src/backend/access/brin/brin_xlog.c
Hunk #2 FAILED at 42.
Hunk #3 FAILED at 91.
This will facilitate testing as well.
Regards
-- 
Michael

Re: [HACKERS] WAL format and API changes (9.5)

2014-11-10 Thread Heikki Linnakangas

Attached is a new version. It's rebased on current git master, including 
BRIN. I've also fixed the laundry list of small things you reported, as 
well as a bunch of bugs I uncovered during my own testing.


Alvaro: you still have the BRIN WAL-logging code fresh in your memory, 
so could you take a look at that part of this patch to check that I 
didn't break it? And also, how do you like the new way of writing that, 
over what you had to in HEAD ?


More below.

On 11/09/2014 11:47 PM, Andres Freund wrote:

On 2014-11-06 17:32:33 +0200, Heikki Linnakangas wrote:

Replying to some of your comments below. The rest were trivial issues that
I'll just fix.

On 10/30/2014 09:19 PM, Andres Freund wrote:

* Is it really a good idea to separate XLogRegisterBufData() from
   XLogRegisterBuffer() the way you've done it ? If we ever actually get
   a record with a large numbers of blocks touched this issentially is
   O(touched_buffers*data_entries).


Are you worried that the linear search in XLogRegisterBufData(), to find the
right registered_buffer struct, might be expensive? If that ever becomes a
problem, a simple fix would be to start the linear search from the end, and
make sure that when you touch a large number of blocks, you do all the
XLogRegisterBufData() calls right after the corresponding
XLogRegisterBuffer() call.


Yes, that was what I was (mildly) worried about. Since you specified a
high limit of buffers I'm sure someone will come up with a use case for
it ;)


I've also though about having XLogRegisterBuffer() return the pointer to the
struct, and passing it as argument to XLogRegisterBufData. So the pattern in
WAL generating code would be like:

registered_buffer *buf0;

buf0 = XLogRegisterBuffer(0, REGBUF_STANDARD);
XLogRegisterBufData(buf0, data, length);

registered_buffer would be opaque to the callers. That would have potential
to turn XLogRegisterBufData into a macro or inline function, to eliminate
the function call overhead. I played with that a little bit, but the
difference in performance was so small that it didn't seem worth it. But
passing the registered_buffer pointer, like above, might not be a bad thing
anyway.


Yes, that was roughly what I was thinking of as well. It's not all that
pretty, but it generally does seem like a good idea to me anyay.


I ended up doing something different: The block_id is now used as the 
index into the registered_buffers array. So looking up the right struct 
is now just ®istered_buffers[block_id]. That obviously gets rid of 
the linear search. It doesn't seem to make any difference in the quick 
performance testing I've been doing.



* There's lots of functions like XLogRecHasBlockRef() that dig through
   the wal record. A common pattern is something like:
if (XLogRecHasBlockRef(record, 1))
 XLogRecGetBlockTag(record, 1, NULL, NULL, &oldblk);
else
 oldblk = newblk;

   I think doing that repeatedly is quite a bad idea. We should parse the
   record once and then use it in a sensible format. Not do it in pieces,
   over and over again. It's not like we ignore backup blocks - so doing
   this lazily doesn't make sense to me.
   Especially as ValidXLogRecord() *already* has parsed the whole damn
   thing once.


Hmm. Adding some kind of a parsed XLogRecord representation would need a
fair amount of new infrastructure.


True.


Vast majority of WAL records contain one,
maybe two, block references, so it's not that expensive to find the right
one, even if you do it several times.


I'm not convinced. It's not an infrequent thing these days to hear
people being bottlenecked by replay. And grovelling repeatedly through
larger records isn't *that* cheap.


I haven't done anything about this. We might want to do that, but I'd 
like to get this committed now, with all the changes to the AM's, and 
then spend some time trying to further optimize the WAL format. I might 
add the "deparsed WAL record" infrastructure as part of that 
optimization work.



I ran a quick performance test on WAL replay performance yesterday. I ran
pgbench for 100 transactions with WAL archiving enabled, and measured
the time it took to replay the generated WAL. I did that with and without
the patch, and I didn't see any big difference in replay times. I also ran
"perf" on the startup process, and the profiles looked identical. I'll do
more comprehensive testing later, with different index types, but I'm
convinced that there is no performance issue in replay that we'd need to
worry about.


Interesting. What checkpoint_segments/timeout and what scale did you
use? Since that heavily influences the average size of the record that's
quite relevant...


Scale factor 5, checkpoint_segments=10, checkpoint_timeout=30s. 
pg_xlogdump --stats says:



Type   N  (%)  Record size  
(%) FPI size  (%)Combined size  (%)
   -  ---  ---

Re: [HACKERS] WAL format and API changes (9.5)

2014-11-09 Thread Andres Freund

On 2014-11-06 17:32:33 +0200, Heikki Linnakangas wrote:
> Replying to some of your comments below. The rest were trivial issues that
> I'll just fix.
> 
> On 10/30/2014 09:19 PM, Andres Freund wrote:
> >* Is it really a good idea to separate XLogRegisterBufData() from
> >   XLogRegisterBuffer() the way you've done it ? If we ever actually get
> >   a record with a large numbers of blocks touched this issentially is
> >   O(touched_buffers*data_entries).
> 
> Are you worried that the linear search in XLogRegisterBufData(), to find the
> right registered_buffer struct, might be expensive? If that ever becomes a
> problem, a simple fix would be to start the linear search from the end, and
> make sure that when you touch a large number of blocks, you do all the
> XLogRegisterBufData() calls right after the corresponding
> XLogRegisterBuffer() call.

Yes, that was what I was (mildly) worried about. Since you specified a
high limit of buffers I'm sure someone will come up with a use case for
it ;)

> I've also though about having XLogRegisterBuffer() return the pointer to the
> struct, and passing it as argument to XLogRegisterBufData. So the pattern in
> WAL generating code would be like:
> 
> registered_buffer *buf0;
> 
> buf0 = XLogRegisterBuffer(0, REGBUF_STANDARD);
> XLogRegisterBufData(buf0, data, length);
> 
> registered_buffer would be opaque to the callers. That would have potential
> to turn XLogRegisterBufData into a macro or inline function, to eliminate
> the function call overhead. I played with that a little bit, but the
> difference in performance was so small that it didn't seem worth it. But
> passing the registered_buffer pointer, like above, might not be a bad thing
> anyway.

Yes, that was roughly what I was thinking of as well. It's not all that
pretty, but it generally does seem like a good idea to me anyay.

> >* There's lots of functions like XLogRecHasBlockRef() that dig through
> >   the wal record. A common pattern is something like:
> >if (XLogRecHasBlockRef(record, 1))
> > XLogRecGetBlockTag(record, 1, NULL, NULL, &oldblk);
> >else
> > oldblk = newblk;
> >
> >   I think doing that repeatedly is quite a bad idea. We should parse the
> >   record once and then use it in a sensible format. Not do it in pieces,
> >   over and over again. It's not like we ignore backup blocks - so doing
> >   this lazily doesn't make sense to me.
> >   Especially as ValidXLogRecord() *already* has parsed the whole damn
> >   thing once.
> 
> Hmm. Adding some kind of a parsed XLogRecord representation would need a
> fair amount of new infrastructure.

True.

> Vast majority of WAL records contain one,
> maybe two, block references, so it's not that expensive to find the right
> one, even if you do it several times.

I'm not convinced. It's not an infrequent thing these days to hear
people being bottlenecked by replay. And grovelling repeatedly through
larger records isn't *that* cheap.

> I ran a quick performance test on WAL replay performance yesterday. I ran
> pgbench for 100 transactions with WAL archiving enabled, and measured
> the time it took to replay the generated WAL. I did that with and without
> the patch, and I didn't see any big difference in replay times. I also ran
> "perf" on the startup process, and the profiles looked identical. I'll do
> more comprehensive testing later, with different index types, but I'm
> convinced that there is no performance issue in replay that we'd need to
> worry about.

Interesting. What checkpoint_segments/timeout and what scale did you
use? Since that heavily influences the average size of the record that's
quite relevant...

> If it mattered, a simple optimization to the above pattern would be to have
> XLogRecGetBlockTag return true/false, indicating if the block reference
> existed at all. So you'd do:
> 
> if (!XLogRecGetBlockTag(record, 1, NULL, NULL, &oldblk))
> oldblk != newblk;

That sounds like a good idea anyway?

> On the other hand, decomposing the WAL record into parts, and passing the
> decomposed representation to the redo routines would allow us to pack the
> WAL record format more tightly, as accessing the different parts of the
> on-disk format wouldn't then need to be particularly fast. For example, I've
> been thinking that it would be nice to get rid of the alignment padding in
> XLogRecord, and between the per-buffer data portions. We could copy the data
> to aligned addresses as part of the decomposition or parsing of the WAL
> record, so that the redo routines could still assume aligned access.

Right. I think it'd generally give us a bit more flexibility.

> >* I wonder if it wouldn't be worthwile, for the benefit of the FPI
> >   compression patch, to keep the bkp block data after *all* the
> >   "headers". That'd make it easier to just compress the data.
> 
> Maybe. If we do that, I'd also be inclined to move all the bkp block headers
> to the beginning of the WAL record, just after the XLogInsert struct.

> Someho

Re: [HACKERS] WAL format and API changes (9.5)

2014-11-06 Thread Heikki Linnakangas

Replying to some of your comments below. The rest were trivial issues 
that I'll just fix.


On 10/30/2014 09:19 PM, Andres Freund wrote:

* Is it really a good idea to separate XLogRegisterBufData() from
   XLogRegisterBuffer() the way you've done it ? If we ever actually get
   a record with a large numbers of blocks touched this issentially is
   O(touched_buffers*data_entries).


Are you worried that the linear search in XLogRegisterBufData(), to find 
the right registered_buffer struct, might be expensive? If that ever 
becomes a problem, a simple fix would be to start the linear search from 
the end, and make sure that when you touch a large number of blocks, you 
do all the XLogRegisterBufData() calls right after the corresponding 
XLogRegisterBuffer() call.


I've also though about having XLogRegisterBuffer() return the pointer to 
the struct, and passing it as argument to XLogRegisterBufData. So the 
pattern in WAL generating code would be like:


registered_buffer *buf0;

buf0 = XLogRegisterBuffer(0, REGBUF_STANDARD);
XLogRegisterBufData(buf0, data, length);

registered_buffer would be opaque to the callers. That would have 
potential to turn XLogRegisterBufData into a macro or inline function, 
to eliminate the function call overhead. I played with that a little 
bit, but the difference in performance was so small that it didn't seem 
worth it. But passing the registered_buffer pointer, like above, might 
not be a bad thing anyway.



* There's lots of functions like XLogRecHasBlockRef() that dig through
   the wal record. A common pattern is something like:
if (XLogRecHasBlockRef(record, 1))
 XLogRecGetBlockTag(record, 1, NULL, NULL, &oldblk);
else
 oldblk = newblk;

   I think doing that repeatedly is quite a bad idea. We should parse the
   record once and then use it in a sensible format. Not do it in pieces,
   over and over again. It's not like we ignore backup blocks - so doing
   this lazily doesn't make sense to me.
   Especially as ValidXLogRecord() *already* has parsed the whole damn
   thing once.


Hmm. Adding some kind of a parsed XLogRecord representation would need a 
fair amount of new infrastructure. Vast majority of WAL records contain 
one, maybe two, block references, so it's not that expensive to find the 
right one, even if you do it several times.


I ran a quick performance test on WAL replay performance yesterday. I 
ran pgbench for 100 transactions with WAL archiving enabled, and 
measured the time it took to replay the generated WAL. I did that with 
and without the patch, and I didn't see any big difference in replay 
times. I also ran "perf" on the startup process, and the profiles looked 
identical. I'll do more comprehensive testing later, with different 
index types, but I'm convinced that there is no performance issue in 
replay that we'd need to worry about.


If it mattered, a simple optimization to the above pattern would be to 
have XLogRecGetBlockTag return true/false, indicating if the block 
reference existed at all. So you'd do:


if (!XLogRecGetBlockTag(record, 1, NULL, NULL, &oldblk))
oldblk != newblk;


On the other hand, decomposing the WAL record into parts, and passing 
the decomposed representation to the redo routines would allow us to 
pack the WAL record format more tightly, as accessing the different 
parts of the on-disk format wouldn't then need to be particularly fast. 
For example, I've been thinking that it would be nice to get rid of the 
alignment padding in XLogRecord, and between the per-buffer data 
portions. We could copy the data to aligned addresses as part of the 
decomposition or parsing of the WAL record, so that the redo routines 
could still assume aligned access.



* I wonder if it wouldn't be worthwile, for the benefit of the FPI
   compression patch, to keep the bkp block data after *all* the
   "headers". That'd make it easier to just compress the data.


Maybe. If we do that, I'd also be inclined to move all the bkp block 
headers to the beginning of the WAL record, just after the XLogInsert 
struct. Somehow it feels weird to have a bunch of header structs 
sandwiched between the rmgr-data and per-buffer data. Also, 4-byte 
alignment is enough for the XLogRecordBlockData struct, so that would be 
an easy way to make use of the space currently wasted for alignment 
padding in XLogRecord.



* I think heap_xlog_update is buggy for wal_level=logical because it
   computes the length of the tuple using
   tuplen = recdataend - recdata;
   But the old primary key/old tuple value might be stored there as
   well. Afaics that code has to continue to use xl_heap_header_len.


No, the old primary key or tuple is stored in the main data portion. 
That tuplen computation is done on backup block 0's data.



* It looks to me like insert wal logging could just use REGBUF_KEEP_DATA
   to get rid of:
+   /*
+* The new tuple is normally stored as buffer 0's data. But if
+* XLOG_HEAP_CONTAINS_NEW_TUPL

Re: [HACKERS] WAL format and API changes (9.5)

2014-11-06 Thread Heikki Linnakangas


On 11/05/2014 11:30 AM, Andres Freund wrote:

On 2014-11-04 18:33:34 +0200, Heikki Linnakangas wrote:

On 10/30/2014 06:02 PM, Andres Freund wrote:

On 2014-10-29 10:24:20 +0200, Heikki Linnakangas wrote:

On 10/06/2014 02:29 PM, Andres Freund wrote:

I've not yet really looked,
but on a quick readthrough XLogInsertRecData() staying in xlog.c doesn't
make me happy...


Ok.. Can you elaborate?


To me the split between xloginsert.c doing some of the record assembly,
and xlog.c doing the lower level part of the assembly is just wrong.


I moved the checks for bootstrap mode and xl_len == 0, from the current
XLogInsert to the new XLogInsert() function. I don't imagine that to be
enough address your feelings about the split between xlog.c and
xloginsert.c, but makes it a tiny bit clearer IMHO. I don't know what else
to do about it, as it feels very sensible to me as it is now. So unless you
object more loudly, I'm going to just go ahead with this and commit,
probably tomorrow.


I'm still not particularly happy with the split. But I think this patch
is enough of a win anyhow. So go ahead.


Committed. Now, to the main patch...

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL format and API changes (9.5)

2014-11-06 Thread Heikki Linnakangas


On 11/06/2014 07:57 AM, Amit Kapila wrote:

On Wed, Nov 5, 2014 at 2:56 PM, Heikki Linnakangas 
wrote:

On 11/05/2014 09:06 AM, Amit Kapila wrote:


2.
XLogRecPtr
XLogInsertRecord(XLogRecData *rdata, XLogRecPtr fpw_lsn)


So the scenario is that:

* XLogRecordAssemble decides that a page doesn't need to be backed up
* both RedoRecPtr and doPageWrites change while building the record.
doPageWrites goes from true to false.

Without the patch, we would retry, because first check RedoRecPtr has
changed. With the patch, we notice that even though RedoRecPtr has changed,
doPageWrites is now off, so no FPWs are required regardless of RedoRecPtr,
and not retry. Yeah, you're right, the behavior changes in that case.
However, the new behavior is correct; the retry is unnecessary in that
scenario.


How does it interact with backup, basically in stop backup we first
change forcePageWrite to false and then get the stop wal location
by inserting XLOG_BACKUP_END, so it seems to me that it is quite
possible that the record which backend is inserting using XLogInsert()
will be considered in backup.  Now shouldn't this record contain FPW
if forcePageWrite was true when XLogInsert() started to avoid any torn
page taken in backup?


It doesn't matter what doPageWrites and RedoRecPtr were when XLogInsert 
started. It's the state when the insertion actually happens that matters.


It's a bit weird that pg_stop_backup() first turns off forcePagesWrites, 
and only then writes the WAL record. But it's OK because when 
pg_stop_backup() is called, the backup process has already finished 
copying all the blocks. Any pages that are written to disk between 
turning forcePageWrites off and writing the WAL record cannot be torn.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL format and API changes (9.5)

2014-11-05 Thread Amit Kapila

On Wed, Nov 5, 2014 at 2:56 PM, Heikki Linnakangas 
wrote:
> On 11/05/2014 09:06 AM, Amit Kapila wrote:
>>
>> 2.
>> XLogRecPtr
>> XLogInsertRecord(XLogRecData *rdata, XLogRecPtr fpw_lsn)
>
> So the scenario is that:
>
> * XLogRecordAssemble decides that a page doesn't need to be backed up
> * both RedoRecPtr and doPageWrites change while building the record.
doPageWrites goes from true to false.
>
> Without the patch, we would retry, because first check RedoRecPtr has
changed. With the patch, we notice that even though RedoRecPtr has changed,
doPageWrites is now off, so no FPWs are required regardless of RedoRecPtr,
and not retry. Yeah, you're right, the behavior changes in that case.
However, the new behavior is correct; the retry is unnecessary in that
scenario.
>

How does it interact with backup, basically in stop backup we first
change forcePageWrite to false and then get the stop wal location
by inserting XLOG_BACKUP_END, so it seems to me that it is quite
possible that the record which backend is inserting using XLogInsert()
will be considered in backup.  Now shouldn't this record contain FPW
if forcePageWrite was true when XLogInsert() started to avoid any torn
page taken in backup?


>> 3. There are couple of places where *XLogInsert* is used in wal.sgml
>> and it seems to me some of those needs change w.r.t this patch.
>
>
> Hmm. It's a bit strange to mention XLogInsert and XLogFlush functions by
name in that text. It's otherwise admin-level stuff, on how to set
WAL-related settings. Up to 9.2 that manual page actually called them
"LogInsert" and "LogFlush". But I guess you're right; if we are to mention
the function by name, it should say XLogInsertRecord.
>

Agreed.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] WAL format and API changes (9.5)

2014-11-05 Thread Andres Freund

On 2014-11-05 23:08:31 +0200, Heikki Linnakangas wrote:
> On 10/30/2014 09:19 PM, Andres Freund wrote:
> >Some things I noticed while reading the patch:
> 
> A lot of good comments, but let me pick up just two that are related:
> 
> >* There's a couple record types (e.g. XLOG_SMGR_TRUNCATE) that only
> >   refer to the relation, but not to the block number. These still log
> >   their rnode manually. Shouldn't we somehow deal with those in a
> >   similar way explicit block references are dealt with?
> >
> >* Hm. At least WriteMZeroPageXlogRec (and probably the same for all the
> >   other slru stuff) doesn't include a reference to the page. Isn't that
> >   bad? Shouldn't we make XLogRegisterBlock() usable for that case?
> >   Otherwise I fail to see how pg_rewind like tools can sanely deal with 
> > this?
> 
> Yeah, there are still operations that modify relation pages, but don't store
> the information about the modified pages in the standard format. That
> includes XLOG_SMGR_TRUNCATE that you spotted, and XLOG_SMGR_CREATE, and also
> XLOG_DBASE_CREATE/DROP. And then there are updates to non-relation files,
> like all the slru stuff, relcache init files, etc. And updates to the FSM
> and VM bypass the full-page write mechanism too.

That's a awful number of special cases. I see little reason not to
invent something that can reference at least the relation in those
cases. Then we can at least only resync the relations if there were
changes. E.g. for vm's that not necessarily all that likely in an insert
mostly case.
I'm not that worried about things like create/drop database, but fsm,
vm, and the various slru's are really somewhat essential.

> To play it safe, pg_rewind copies all non-relation files as is. That
> includes all SLRUs, FSM and VM files, and everything else whose filename
> doesn't match the (main fork of) a relation file. Of course, that's a fair
> amount of copying to do, so we might want to optimize that in the future,
> but I want to nail the relation files first. They are usually an order of
> magnitude larger than the other files, after all.

That's fair enough.

> Unfortunately pg_rewind still needs to recognize and parse the special WAL
> records like XLOG_SMGR_CREATE/TRUNCATE, that modify relation files outside
> the normal block registration system. I've been thinking that we should add
> another flag to the WAL record format to mark such records.

So everytime pg_rewind comes across a record with that flag set which it
doesn't have special case code it'd balk? Sounds sensible.

Personally I think we should at least have a generic format to refer to
entire relations without a specific block number. And one to refer to
SLRUs. I don't think we necessarily need to implement them now, but we
should make sure there's bit space left to denote them.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL format and API changes (9.5)

2014-11-05 Thread Heikki Linnakangas


On 10/30/2014 09:19 PM, Andres Freund wrote:

Some things I noticed while reading the patch:


A lot of good comments, but let me pick up just two that are related:


* There's a couple record types (e.g. XLOG_SMGR_TRUNCATE) that only
   refer to the relation, but not to the block number. These still log
   their rnode manually. Shouldn't we somehow deal with those in a
   similar way explicit block references are dealt with?

* Hm. At least WriteMZeroPageXlogRec (and probably the same for all the
   other slru stuff) doesn't include a reference to the page. Isn't that
   bad? Shouldn't we make XLogRegisterBlock() usable for that case?
   Otherwise I fail to see how pg_rewind like tools can sanely deal with this?


Yeah, there are still operations that modify relation pages, but don't 
store the information about the modified pages in the standard format. 
That includes XLOG_SMGR_TRUNCATE that you spotted, and XLOG_SMGR_CREATE, 
and also XLOG_DBASE_CREATE/DROP. And then there are updates to 
non-relation files, like all the slru stuff, relcache init files, etc. 
And updates to the FSM and VM bypass the full-page write mechanism too.


To play it safe, pg_rewind copies all non-relation files as is. That 
includes all SLRUs, FSM and VM files, and everything else whose filename 
doesn't match the (main fork of) a relation file. Of course, that's a 
fair amount of copying to do, so we might want to optimize that in the 
future, but I want to nail the relation files first. They are usually an 
order of magnitude larger than the other files, after all.


Unfortunately pg_rewind still needs to recognize and parse the special 
WAL records like XLOG_SMGR_CREATE/TRUNCATE, that modify relation files 
outside the normal block registration system. I've been thinking that we 
should add another flag to the WAL record format to mark such records. 
pg_rewind will still need to understand the record format of such 
records, but the flag will help to catch bugs of omission. If pg_rewind 
or another such tool sees a record that's flagged as "special", but 
doesn't recognize the record type, it can throw an error.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL format and API changes (9.5)

2014-11-05 Thread Andres Freund

On 2014-11-04 18:33:34 +0200, Heikki Linnakangas wrote:
> On 10/30/2014 06:02 PM, Andres Freund wrote:
> >On 2014-10-29 10:24:20 +0200, Heikki Linnakangas wrote:
> >>On 10/06/2014 02:29 PM, Andres Freund wrote:
> >>>I've not yet really looked,
> >>>but on a quick readthrough XLogInsertRecData() staying in xlog.c doesn't
> >>>make me happy...
> >>
> >>Ok.. Can you elaborate?
> >
> >To me the split between xloginsert.c doing some of the record assembly,
> >and xlog.c doing the lower level part of the assembly is just wrong.
> 
> I moved the checks for bootstrap mode and xl_len == 0, from the current
> XLogInsert to the new XLogInsert() function. I don't imagine that to be
> enough address your feelings about the split between xlog.c and
> xloginsert.c, but makes it a tiny bit clearer IMHO. I don't know what else
> to do about it, as it feels very sensible to me as it is now. So unless you
> object more loudly, I'm going to just go ahead with this and commit,
> probably tomorrow.

I'm still not particularly happy with the split. But I think this patch
is enough of a win anyhow. So go ahead.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL format and API changes (9.5)

2014-11-05 Thread Heikki Linnakangas


On 11/05/2014 09:06 AM, Amit Kapila wrote:

1.
+XLogRecPtr
+XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
{
..
+ /* info's high bits are reserved for use by me */
+ if (info & XLR_INFO_MASK)
+ elog(PANIC, "invalid xlog info mask %02X", info);
..
}

Earlier before this check, we use to check XLogInsertAllowed()
which is moved to XLogInsertRecord(), isn't it better to keep
the check in beginning of the function XLogInsert()?


Doesn't matter much, but I think it's cleanest the way I had it in the 
latest patch. XLogInsert() is responsible for building a valid WAL 
record to insert, and XLogInsertRecord() handles all the locking, buffer 
management etc. to actually insert it. The high-bit check quoted above 
is more related to building a valid record, so it belongs in 
XLogInsert(), while the XLogInsertAllowed() check is more related to the 
actual insertion of the record, so it belongs in XLogInsertRecord().



2.
XLogRecPtr
XLogInsertRecord(XLogRecData *rdata, XLogRecPtr fpw_lsn)
{
..
if (fpw_lsn != InvalidXLogRecPtr && fpw_lsn <= RedoRecPtr && doPageWrites)
{
/*
  * Oops, some buffer now needs to be backed up that the caller
  * didn't back up.  Start over.
  */
WALInsertLockRelease();
END_CRIT_SECTION();
return InvalidXLogRecPtr;
}
..
}

IIUC, there can be 4 states for doPageWrites w.r.t when record is getting
assembled (XLogRecordAssemble) and just before actual insert (
XLogInsertRecord)

I think the behaviour for the state when doPageWrites is true during
XLogRecordAssemble and false during XLogInsertRecord (just before
actual insert) is different as compare to HEAD.

In the patch, it will not retry if doPageWrites is false when we are
about to insert even though fpw_lsn <= RedoRecPtr whereas in HEAD it
will detect this during assembly of record and retry, isn't this a
problem?


So the scenario is that:

* XLogRecordAssemble decides that a page doesn't need to be backed up
* both RedoRecPtr and doPageWrites change while building the record. 
doPageWrites goes from true to false.


Without the patch, we would retry, because first check RedoRecPtr has 
changed. With the patch, we notice that even though RedoRecPtr has 
changed, doPageWrites is now off, so no FPWs are required regardless of 
RedoRecPtr, and not retry. Yeah, you're right, the behavior changes in 
that case. However, the new behavior is correct; the retry is 
unnecessary in that scenario.



3. There are couple of places where *XLogInsert* is used in wal.sgml
and it seems to me some of those needs change w.r.t this patch.


Hmm. It's a bit strange to mention XLogInsert and XLogFlush functions by 
name in that text. It's otherwise admin-level stuff, on how to set 
WAL-related settings. Up to 9.2 that manual page actually called them 
"LogInsert" and "LogFlush". But I guess you're right; if we are to 
mention the function by name, it should say XLogInsertRecord.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL format and API changes (9.5)

2014-11-04 Thread Amit Kapila

On Tue, Nov 4, 2014 at 10:03 PM, Heikki Linnakangas 
wrote:
>
> On 10/30/2014 06:02 PM, Andres Freund wrote:
>>
>> On 2014-10-29 10:24:20 +0200, Heikki Linnakangas wrote:
>>>
>>> On 10/06/2014 02:29 PM, Andres Freund wrote:

 On 2014-10-06 14:19:39 +0300, Heikki Linnakangas wrote:
>
> Barring objections, I'll commit this, and then continue benchmarking
the
> second patch with the WAL format and API changes.


 I'd like to have a look at it beforehand.
>>>
>>>
>>> Ping? Here's an rebased patch. I'd like to proceed with this.
>>
>>
>> Doing so.
>
>
> Here's a new version of this refactoring patch. It fixes all the concrete
issues you pointed out.
>
 I've not yet really looked,
 but on a quick readthrough XLogInsertRecData() staying in xlog.c
doesn't
 make me happy...
>>>
>>>
>>> Ok.. Can you elaborate?
>>
>>
>> To me the split between xloginsert.c doing some of the record assembly,
>> and xlog.c doing the lower level part of the assembly is just wrong.
>
>
> I moved the checks for bootstrap mode and xl_len == 0, from the current
XLogInsert to the new XLogInsert() function. I don't imagine that to be
enough address your feelings about the split between xlog.c and
xloginsert.c, but makes it a tiny bit clearer IMHO. I don't know what else
to do about it, as it feels very sensible to me as it is now. So unless you
object more loudly, I'm going to just go ahead with this and commit,
probably tomorrow.
>

Few observations while reading the latest patch:

1.
+XLogRecPtr
+XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
{
..
+ /* info's high bits are reserved for use by me */
+ if (info & XLR_INFO_MASK)
+ elog(PANIC, "invalid xlog info mask %02X", info);
..
}

Earlier before this check, we use to check XLogInsertAllowed()
which is moved to XLogInsertRecord(), isn't it better to keep
the check in beginning of the function XLogInsert()?

2.
XLogRecPtr
XLogInsertRecord(XLogRecData *rdata, XLogRecPtr fpw_lsn)
{
..
if (fpw_lsn != InvalidXLogRecPtr && fpw_lsn <= RedoRecPtr && doPageWrites)
{
/*
 * Oops, some buffer now needs to be backed up that the caller
 * didn't back up.  Start over.
 */
WALInsertLockRelease();
END_CRIT_SECTION();
return InvalidXLogRecPtr;
}
..
}

IIUC, there can be 4 states for doPageWrites w.r.t when record is getting
assembled (XLogRecordAssemble) and just before actual insert (
XLogInsertRecord)

I think the behaviour for the state when doPageWrites is true during
XLogRecordAssemble and false during XLogInsertRecord (just before
actual insert) is different as compare to HEAD.

In the patch, it will not retry if doPageWrites is false when we are
about to insert even though fpw_lsn <= RedoRecPtr whereas in HEAD it
will detect this during assembly of record and retry, isn't this a
problem?


3. There are couple of places where *XLogInsert* is used in wal.sgml
and it seems to me some of those needs change w.r.t this patch.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] WAL format and API changes (9.5)

2014-10-30 Thread Michael Paquier

On Fri, Oct 31, 2014 at 2:52 AM, Heikki Linnakangas  wrote:

> On 10/30/2014 06:02 PM, Andres Freund wrote:
>
>> On 2014-10-29 10:24:20 +0200, Heikki Linnakangas wrote:
>>
>>> On 10/06/2014 02:29 PM, Andres Freund wrote:
>>>
 I've not yet really looked,
 but on a quick readthrough XLogInsertRecData() staying in xlog.c doesn't
 make me happy...

>>>
>>> Ok.. Can you elaborate?
>>>
>>
>> To me the split between xloginsert.c doing some of the record assembly,
>> and xlog.c doing the lower level part of the assembly is just wrong.
>>
>
> Really? To me, that feels like a natural split. xloginsert.c is
> responsible for constructing the final WAL record, with the backup blocks
> and all. It doesn't access any shared memory (related to WAL; it does look
> at Buffers, to determine what to back up). The function in xlog.c inserts
> the assembled record to the WAL, as is. It handles the locking and WAL
> buffer management related to that.
>
FWIW, I tend to the same opinion here.

What would you suggest? I don't think putting everything in one XLogInsert
> function, like we have today, is better. Note that the second patch makes
> xloginsert.c a lot more complicated.
>
I recall some time ago seeing complaints that xlog.c is too complicated and
should be refactored. Any effort in this area is a good thing IMO, and this
split made sense when I went through the code.


>  I'm not a big fan of the naming for the new split. We have
>> * XLogInsert() which is the external interface
>> * XLogRecordAssemble() which builds the rdata chain that will be
>>inserted
>> * XLogInsertRecData() which actually inserts the data into the xl buffers.
>>
>> To me that's pretty inconsistent.
>>
>
> Got a better suggestion? I wanted to keep the name XLogInsert() unchanged,
> to avoid code churn, and because it's still a good name for that.
> XLogRecordAssemble is pretty descriptive for what it does, although
> "XLogRecord" implies that it construct an XLogRecord struct. It does fill
> in that too, but the main purpose is to build the XLogRecData chain.
> Perhaps XLogAssembleRecord() would be better.
>
> I'm not very happy with XLogInsertRecData myself. XLogInsertRecord?
>
+1 for XLogInsertRecord.
-- 
Michael

Re: [HACKERS] WAL format and API changes (9.5)

2014-10-30 Thread Andres Freund

Hi,

On 2014-09-15 15:41:22 +0300, Heikki Linnakangas wrote:
> The second patch contains the interesting changes.

Heikki's pushed the newest version of this to the git tree.

Some things I noticed while reading the patch:
* potential mismerge:
+++ b/src/bin/pg_basebackup/pg_receivexlog.c
@@ -408,7 +408,7 @@ main(int argc, char **argv)
}
}
 
-   while ((c = getopt_long(argc, argv, "D:d:h:p:U:s:S:nF:wWv",
+   while ((c = getopt_long(argc, argv, "D:d:h:p:U:s:nF:wWv",
long_options, &option_index)) != -1)
{
switch (c)

* Can't you just have REGBUF_WILL_INIT include REGBUF_NO_IMAGE? Then you
  don't need to test both.

* Is it really a good idea to separate XLogRegisterBufData() from
  XLogRegisterBuffer() the way you've done it ? If we ever actually get
  a record with a large numbers of blocks touched this issentially is
  O(touched_buffers*data_entries).

* At least in Assert mode XLogRegisterBuffer() should ensure we're not
  registering the same buffer twice. It's a pretty easy to make mistake
  to do it twice for some kind of actions (think UPDATE), and the
  consequences will be far from obvious afaics.

* There's lots of functions like XLogRecHasBlockRef() that dig through
  the wal record. A common pattern is something like:
if (XLogRecHasBlockRef(record, 1))
XLogRecGetBlockTag(record, 1, NULL, NULL, &oldblk);
else
oldblk = newblk;

  I think doing that repeatedly is quite a bad idea. We should parse the
  record once and then use it in a sensible format. Not do it in pieces,
  over and over again. It's not like we ignore backup blocks - so doing
  this lazily doesn't make sense to me.
  Especially as ValidXLogRecord() *already* has parsed the whole damn
  thing once.

* Maybe XLogRegisterData() and XLogRegisterBufData() should accept void*
  instead of char*? Right now there's lots of pointless casts in callers
  needed that don't add any safety. The one additional line that's then
  needed in these functions seems well worth it.

* There's a couple record types (e.g. XLOG_SMGR_TRUNCATE) that only
  refer to the relation, but not to the block number. These still log
  their rnode manually. Shouldn't we somehow deal with those in a
  similar way explicit block references are dealt with?

* You've added an assert in RestoreBackupBlockContents() that ensures
  the page isn't new. That wasn't there before, instead there was a
  special case for it. I can't immediately think how that previously
  could happen, but I do wonder why it was there.

* The following comment in +RestoreBackupBlock probably is two lines to
  early
+   /* Found it, apply the update */
+   if (!(bkpb->fork_flags & BKPBLOCK_HAS_IMAGE))
+   return InvalidBuffer;
  This new InvalidBuffer case probably could use a comment in either
  XLogReadBufferForRedoExtended or here.

* Most of the commentary above RestoreBackupBlock() probably doesn't
  belong there anymore.

* Maybe XLogRecordBlockData.fork_flags should be renamed? Maybe just
  into flags_fork? Right now it makes at least me think that it's fork
  specific flags, which really isn't the actual meaning.

* I wonder if it wouldn't be worthwile, for the benefit of the FPI
  compression patch, to keep the bkp block data after *all* the
  "headers". That'd make it easier to just compress the data.

* +InitXLogRecordConstruct() seems like a remainder of the
  xlogconstruct.c naming. I'd just go for InitXLogInsert()

* Hm. At least WriteMZeroPageXlogRec (and probably the same for all the
  other slru stuff) doesn't include a reference to the page. Isn't that
  bad? Shouldn't we make XLogRegisterBlock() usable for that case?
  Otherwise I fail to see how pg_rewind like tools can sanely deal with this?

* Why did you duplicate the identical cases in btree_desc()? I guess due
  to rebasing ontop the xlogdump stats series?

* I haven't actually looked at the xlogdump output so I might be
  completely wrong, but won't the output of e.g. updates be harder to
  read now because it's not clear which of the references old/new
  buffers are? Not that it matters that much.

* s/recdataend/recdata_end/? The former is somewhat hard to parse for
  me.

* I think heap_xlog_update is buggy for wal_level=logical because it
  computes the length of the tuple using
  tuplen = recdataend - recdata;
  But the old primary key/old tuple value might be stored there as
  well. Afaics that code has to continue to use xl_heap_header_len.

* It looks to me like insert wal logging could just use REGBUF_KEEP_DATA
  to get rid of:
+   /*
+* The new tuple is normally stored as buffer 0's data. But if
+* XLOG_HEAP_CONTAINS_NEW_TUPLE flag is set, it's part of the main
+* data, after the xl_heap_insert struct.
+*/
+   if (xlrec->flags & XLOG_HEAP_CONTAINS_NEW_TUPLE)
+   {
+   data = XLogRecGetData(record) + SizeOfHeapInsert;
+   datalen = record->xl_len - SizeOfHeapInsert;
+   }
+   else
+

Re: [HACKERS] WAL format and API changes (9.5)

2014-10-30 Thread Heikki Linnakangas


On 10/30/2014 06:02 PM, Andres Freund wrote:

On 2014-10-29 10:24:20 +0200, Heikki Linnakangas wrote:

On 10/06/2014 02:29 PM, Andres Freund wrote:

I've not yet really looked,
but on a quick readthrough XLogInsertRecData() staying in xlog.c doesn't
make me happy...


Ok.. Can you elaborate?


To me the split between xloginsert.c doing some of the record assembly,
and xlog.c doing the lower level part of the assembly is just wrong.


Really? To me, that feels like a natural split. xloginsert.c is 
responsible for constructing the final WAL record, with the backup 
blocks and all. It doesn't access any shared memory (related to WAL; it 
does look at Buffers, to determine what to back up). The function in 
xlog.c inserts the assembled record to the WAL, as is. It handles the 
locking and WAL buffer management related to that.


What would you suggest? I don't think putting everything in one 
XLogInsert function, like we have today, is better. Note that the second 
patch makes xloginsert.c a lot more complicated.



And it leads to things like the already complicated logic to retry after
detecting missing fpws is now split across two files seems to confirm
that. What happens right now is that XLogInsert() (with some helper
routines) assembles the record. Then hands that off to
XLogInsertRecData(). Which sometimes returns InvalidXLogRecPtr, and
returns back to XLogInsert(), which reverses *some* of its work and then
retries. Brr.


Modifying the chain that was already constructed is not pretty, but it's 
not this patch's fault. I don't think splitting the functions makes that 
any worse. (BTW, the follow-up patch to change the WAL format fixes 
that; the per-buffer data is kept separately from the main data chain).



Similary the 'page_writes_omitted' logic doesn't make me particularly
happy. Previously we retried when there actually was a page affected by
the different RedoRecPtr. Now we do it as soon as our copy of RedoRecPtr
is out of date? Won't that quite often spuriously trigger retries? Am I
missing something?
Arguably this doesn't happen often enough to matter, but it's still
something that we should explicitly discuss.


The situation where it leads to a spurious retry is when the constructed 
WAL record skipped the FPW of a buffer, and RedoRecPtr was updated, but 
the buffer stilll doesn't need to be FPW according to the updated 
RedoRecPtr. That only happens if the same buffer was updated (by another 
backend) after the new checkpoint began. I believe that's extremely rare.


We could make that more fine-grained, though. Instead of passing a 
boolean 'fpw_sensitive' flag to XLogInsertRecData, pass the lowest LSN 
among the pages whose FPW was skipped. If that's less than the new 
RedoRecPtr, then retry is needed. That would eliminate the spurious retries.



The implementation of the split seems to change the meaning of
TRACE_POSTGRESQL_XLOG_INSERT - it's now counting retries. I don't
particularly care about those tracepoints, but I see no simplification
due to changing its meaning.


I wasn't sure what to do about that. I don't use tracepoints, and I 
don't know what others use them for.



I'm not a big fan of the naming for the new split. We have
* XLogInsert() which is the external interface
* XLogRecordAssemble() which builds the rdata chain that will be
   inserted
* XLogInsertRecData() which actually inserts the data into the xl buffers.

To me that's pretty inconsistent.


Got a better suggestion? I wanted to keep the name XLogInsert() 
unchanged, to avoid code churn, and because it's still a good name for 
that. XLogRecordAssemble is pretty descriptive for what it does, 
although "XLogRecord" implies that it construct an XLogRecord struct. It 
does fill in that too, but the main purpose is to build the XLogRecData 
chain. Perhaps XLogAssembleRecord() would be better.


I'm not very happy with XLogInsertRecData myself. XLogInsertRecord?


I'm also somewhat confused about the new split of the headers. I'm not
adverse to splitting them, but it's getting a bit hard to understand:
* xlog.h - generic mishmash
* xlogdefs.h - Three typedefs and some #defines
* xlog_fn.h - SQL functions
* xlog_internal.h - Pretty random collection of stuff
* xlogutils.h - "Utilities for replaying WAL records"
* xlogreader.h - "generic XLog reading facility"
* xloginsert.h - "Functions for generating WAL records"
* xlogrecord.h - "Definitions for the WAL record format."

Isn't that a bit too much? Pretty much the only files that have a
recognizable separation to me are xlog_fn.h and xlogreader.h.


This is a matter of taste, I guess. I find xlog.h, xlogdefs.h and 
xlog_internal.h fairly random, while the rest are have a clear function. 
xlogutils.h is needed by redo functions, and xloginsert.h is needed for 
generating WAL.


Note that the second patch adds more stuff to xloginsert.h and xlogrecord.h.


You've removed the
- *
- * NB: this routine feels free to scribble on the XLogRecData structs,
- * though not on

Re: [HACKERS] WAL format and API changes (9.5)

2014-10-30 Thread Andres Freund

On 2014-10-29 10:24:20 +0200, Heikki Linnakangas wrote:
> On 10/06/2014 02:29 PM, Andres Freund wrote:
> >On 2014-10-06 14:19:39 +0300, Heikki Linnakangas wrote:
> >>Barring objections, I'll commit this, and then continue benchmarking the
> >>second patch with the WAL format and API changes.
> >
> >I'd like to have a look at it beforehand.
> 
> Ping? Here's an rebased patch. I'd like to proceed with this.

Doing so.

> >I've not yet really looked,
> >but on a quick readthrough XLogInsertRecData() staying in xlog.c doesn't
> >make me happy...
> 
> Ok.. Can you elaborate?

To me the split between xloginsert.c doing some of the record assembly,
and xlog.c doing the lower level part of the assembly is just wrong.

And it leads to things like the already complicated logic to retry after
detecting missing fpws is now split across two files seems to confirm
that. What happens right now is that XLogInsert() (with some helper
routines) assembles the record. Then hands that off to
XLogInsertRecData(). Which sometimes returns InvalidXLogRecPtr, and
returns back to XLogInsert(), which reverses *some* of its work and then
retries. Brr.

Similary the 'page_writes_omitted' logic doesn't make me particularly
happy. Previously we retried when there actually was a page affected by
the different RedoRecPtr. Now we do it as soon as our copy of RedoRecPtr
is out of date? Won't that quite often spuriously trigger retries? Am I
missing something?
Arguably this doesn't happen often enough to matter, but it's still
something that we should explicitly discuss.

The implementation of the split seems to change the meaning of
TRACE_POSTGRESQL_XLOG_INSERT - it's now counting retries. I don't
particularly care about those tracepoints, but I see no simplification
due to changing its meaning.

Aside: In C11 function local statics become a bit more expensive - they
essentially require atomics and/or spinlocks on the first few calls.

Next thing: The patch doesn't actually compile. Misses an #include
storage/proc.h for MyPgXact.

I'm not a big fan of the naming for the new split. We have
* XLogInsert() which is the external interface
* XLogRecordAssemble() which builds the rdata chain that will be
  inserted
* XLogInsertRecData() which actually inserts the data into the xl buffers.

To me that's pretty inconsistent.

I'm also somewhat confused about the new split of the headers. I'm not
adverse to splitting them, but it's getting a bit hard to understand:
* xlog.h - generic mishmash
* xlogdefs.h - Three typedefs and some #defines
* xlog_fn.h - SQL functions
* xlog_internal.h - Pretty random collection of stuff
* xlogutils.h - "Utilities for replaying WAL records"
* xlogreader.h - "generic XLog reading facility"
* xloginsert.h - "Functions for generating WAL records"
* xlogrecord.h - "Definitions for the WAL record format."

Isn't that a bit too much? Pretty much the only files that have a
recognizable separation to me are xlog_fn.h and xlogreader.h.

You've removed the
- *
- * NB: this routine feels free to scribble on the XLogRecData structs,
- * though not on the data they reference.  This is OK since the
XLogRecData
- * structs are always just temporaries in the calling code.
comment, but we still do, no?

While not this patches blame, I see currently we finish the critical
section in XLogInsert/XLogInsertRecData pretty early:
END_CRIT_SECTION();

/*
 * Update shared LogwrtRqst.Write, if we crossed page boundary.
 */
if (StartPos / XLOG_BLCKSZ != EndPos / XLOG_BLCKSZ)
{
SpinLockAcquire(&XLogCtl->info_lck);
/* advance global request to include new block(s) */
if (XLogCtl->LogwrtRqst.Write < EndPos)
XLogCtl->LogwrtRqst.Write = EndPos;
/* update local result copy while I have the chance */
LogwrtResult = XLogCtl->LogwrtResult;
SpinLockRelease(&XLogCtl->info_lck);
}
I don't think this is particularly critical, but it still looks wrong to me.

Hm. I think that's what I see for now.

Greetings,

Andres Freund

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL format and API changes (9.5)

2014-10-06 Thread Michael Paquier

On Mon, Oct 6, 2014 at 8:19 PM, Heikki Linnakangas 
wrote:

> On 10/06/2014 04:42 AM, Michael Paquier wrote:
>
>> On Fri, Oct 3, 2014 at 9:51 PM, Heikki Linnakangas <
>> hlinnakan...@vmware.com>
>> wrote:
>>
>>> So I now have a refactoring patch ready that I'd like to commit (the
>>>
>> attached two patches together), but to be honest, I have no idea why the
>> second patch is so essential to performance.
>> Thanks. I did some more tests with master, master+patch1,
>> master+patch1+CRC
>> refactoring, but I am not able to see any performance difference with
>> pgbench (--no-vacuum, -t) and the test suite you provided, just some noise
>> that barely changed performance.
>>
>
> Thanks for the confirmation. I'm really going crazy with benchmarking
> this. Sometimes I see a big difference, the next day it's gone.
>
The benchmark paradigms.


> * Fixed XLogSaveBufferForHint. It didn't initialize BkpBlock struct,
> rendering it completely broken.
>
Note for other reviewers: that's represented by this addition in
XLogSaveBufferForHint:
+   /* Make a BkpBlock struct representing the buffer */
+   XLogFillBkpBlock(buffer, buffer_std, &bkpb)

Regards,
-- 
Michael

Re: [HACKERS] WAL format and API changes (9.5)

2014-10-06 Thread Andres Freund

On 2014-10-06 14:19:39 +0300, Heikki Linnakangas wrote:
> On 10/06/2014 04:42 AM, Michael Paquier wrote:
> >On Fri, Oct 3, 2014 at 9:51 PM, Heikki Linnakangas 
> >wrote:
> >>So I now have a refactoring patch ready that I'd like to commit (the
> >attached two patches together), but to be honest, I have no idea why the
> >second patch is so essential to performance.
> >Thanks. I did some more tests with master, master+patch1, master+patch1+CRC
> >refactoring, but I am not able to see any performance difference with
> >pgbench (--no-vacuum, -t) and the test suite you provided, just some noise
> >that barely changed performance.
> 
> Thanks for the confirmation. I'm really going crazy with benchmarking this.
> Sometimes I see a big difference, the next day it's gone.

A usual suspect for this is turbo mode and power control. It often
already helps to disable the latter to get much more reproducible
benchmark results.

> Barring objections, I'll commit this, and then continue benchmarking the
> second patch with the WAL format and API changes.

I'd like to have a look at it beforehand. I've not yet really looked,
but on a quick readthrough XLogInsertRecData() staying in xlog.c doesn't
make me happy...

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL format and API changes (9.5)

2014-10-05 Thread Michael Paquier

On Fri, Oct 3, 2014 at 9:51 PM, Heikki Linnakangas 
wrote:
> So I now have a refactoring patch ready that I'd like to commit (the
attached two patches together), but to be honest, I have no idea why the
second patch is so essential to performance.
Thanks. I did some more tests with master, master+patch1, master+patch1+CRC
refactoring, but I am not able to see any performance difference with
pgbench (--no-vacuum, -t) and the test suite you provided, just some noise
that barely changed performance. Note that fd.c uses
SYNC_METHOD_FSYNC_WRITETHROUGH, so it is necessary to include xlog.h in it.
-- 
Michael

Re: [HACKERS] WAL format and API changes (9.5)

2014-10-03 Thread Heikki Linnakangas


On 10/03/2014 04:10 PM, Andres Freund wrote:

On 2014-10-03 15:51:37 +0300, Heikki Linnakangas wrote:

After a lot of experimentation, I figured out that the slowdown is already
apparent with the *first* patch, the one that just refactors existing
XLogInsert, without any changes to the WAL format or to the callers. I
fiddled with that for a long time, trying to tease apart the change that
makes the difference, and was finally able to narrow it down.

Attached are two patches. These are both just refactoring, with no changes
to the WAL format or APIs. The first is the same as the refactoring patch I
posted earlier, with only minor changes to #includes, comments and such (per
Alvaro's and Michael's suggestions - thanks!). With the first patch, the
test case I've been using to performance test this becomes somewhere between
5% - 10% slower. Applying the second patch on top of that restores the
performance back to what you get without these patches.

Strange. The second patch moves the CRC calculation from a separate loop
over the XLogRecDatas to the earlier loop, that also iterates through the
XLogRecDatas. The strange thing about this is that when I tried to make the
same change to current git master, without applying the first patch, it
didn't make any difference. The CRC calculation used to integrated to the
earlier loop in 9.1 and before, but in 9.2 it was moved to a separate loop
for simplicity, because it didn't make any difference to performance.

So I now have a refactoring patch ready that I'd like to commit (the
attached two patches together), but to be honest, I have no idea why the
second patch is so essential to performance.


Maybe a stupid question, but did you verify it's not caused by splitting
xloginsert over two translation units and/or not inlining the separate
function?


Hmph, now that I try this again, I don't see any difference with or 
without the "second" patch - they both perform just as well as unpatched 
master. So I'm totally baffled again, but I guess the patch is good to 
go. I need a more stable test environment..


Yeah, I did try merging xlog.c and xloginsert.c into the same .c file at 
some point, and marking the XLogInsertRecData function as static, 
allowing the compiler to inline it, but I didn't see any difference. But 
you have to take that with a grain of salt - I'm apparently not capable 
of constructing a properly repeatable test case for this.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL format and API changes (9.5)

2014-10-03 Thread Andres Freund

On 2014-10-03 15:51:37 +0300, Heikki Linnakangas wrote:
> After a lot of experimentation, I figured out that the slowdown is already
> apparent with the *first* patch, the one that just refactors existing
> XLogInsert, without any changes to the WAL format or to the callers. I
> fiddled with that for a long time, trying to tease apart the change that
> makes the difference, and was finally able to narrow it down.
> 
> Attached are two patches. These are both just refactoring, with no changes
> to the WAL format or APIs. The first is the same as the refactoring patch I
> posted earlier, with only minor changes to #includes, comments and such (per
> Alvaro's and Michael's suggestions - thanks!). With the first patch, the
> test case I've been using to performance test this becomes somewhere between
> 5% - 10% slower. Applying the second patch on top of that restores the
> performance back to what you get without these patches.
> 
> Strange. The second patch moves the CRC calculation from a separate loop
> over the XLogRecDatas to the earlier loop, that also iterates through the
> XLogRecDatas. The strange thing about this is that when I tried to make the
> same change to current git master, without applying the first patch, it
> didn't make any difference. The CRC calculation used to integrated to the
> earlier loop in 9.1 and before, but in 9.2 it was moved to a separate loop
> for simplicity, because it didn't make any difference to performance.
> 
> So I now have a refactoring patch ready that I'd like to commit (the
> attached two patches together), but to be honest, I have no idea why the
> second patch is so essential to performance.

Maybe a stupid question, but did you verify it's not caused by splitting
xloginsert over two translation units and/or not inlining the separate
function?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL format and API changes (9.5)

2014-09-16 Thread Michael Paquier

On Mon, Sep 15, 2014 at 9:16 PM, Michael Paquier
 wrote:
> 2) XLogCheckBufferNeedsBackup is not used. It can be removed.

Please ignore this comment, XLogCheckBufferNeedsBackup is used in heapam.c.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL format and API changes (9.5)

2014-09-16 Thread Andres Freund

On 2014-09-15 15:41:22 +0300, Heikki Linnakangas wrote:
> Here we go. I've split this again into two patches. The first patch is just
> refactoring the current code. It moves XLogInsert into a new file,
> xloginsert.c, and the definition of XLogRecord to new xlogrecord.h header
> file. As a result, there is a a lot of churn in the #includes in C files
> that generate WAL records, or contain redo routines.  The number of files
> that pull in xlog.h - directly or indirectly through other headers - is
> greatly reduced.
> 
> The second patch contains the interesting changes.
> 
> I wrote a little benchmark kit to performance test this. I'm trying to find
> out two things:
> 
> 1) How much CPU overhead do the new XLogBeginInsert and XLogRegister*
> functions add, compared to the current approach with XLogRecDatas.
> 
> 2) How much extra WAL is generated with the patch. This affects the CPU time
> spent in the tests, but it's also interesting to measure directly, because
> WAL size affects many things like WAL archiving, streaming replication etc.
> 
> Attached is the test kit I'm using. To run the battery of tests, use "psql
> -f run.sql". To answer the question of WAL volume, it runs a bunch of tests
> that exercise heap insert, update and delete, as well as b-tree and GIN
> insertions. To answer the second test, it runs a heap insertion test, with a
> tiny record size that's chosen so that it generates exactly the same amount
> of WAL after alignment with and without the patch. The test is repeated many
> times, and the median of runtimes is printed out.
> 
> Here are the results, comparing unpatched and patched versions. First, the
> WAL sizes:
> 
> A heap insertion records are 2 bytes larger with the patch. Due to
> alignment, that makes for a 0 or 8 byte difference in the record sizes.
> Other WAL records have a similar store; a few extra bytes but no big
> regressions. There are a few outliers above where it appears that the
> patched version takes less space. Not sure why that would be, probably just
> a glitch in the test, autovacuum kicked in or something.

I've to admit, that's already not a painless amount of overhead.

> Now, for the CPU overhead:
> 
>   description   | dur_us (orig) | dur_us (patched) |   %
> +---+--+
>  heap insert 30 | 0.7752835 | 0.831883 | 107.30
> (1 row)
> 
> So, the patched version runs 7.3 % slower. That's disappointing :-(.
> 
> This are the result I got on my laptop today. Previously, the typical result
> I've gotten has been about 5%, so that's a bit high. Nevertheless, even a 5%
> slowdown is probably not acceptable.

Yes, I definitely think it's not.

> While I've trying to nail down where that difference comes from, I've seen a
> lot of strange phenomenon. At one point, the patched version was 10% slower,
> but I was able to bring the difference down to 5% if I added a certain
> function in xloginsert.c, but never called it. It was very repeatable at the
> time, I tried adding and removing it many times and always got the same
> result, but I don't see it with the current HEAD and patch version anymore.
> So I think 5% is pretty close to the margin of error that arises from
> different compiler optimizations, data/instruction cache effects etc.
> 
> Looking at the 'perf' profile, The new function calls only amount to about
> 2% of overhead, so I'm not sure where the slowdown is coming from. Here are
> explanations I've considered, but I haven't been able to prove any of them:

I'd suggest doing:
a) perf stat -vvv of both workloads. That will often tell you stuff already
b) Look at other events. Particularly stalled-cycles-frontend,
   stalled-cycles-backend, cache-misses


Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL format and API changes (9.5)

2014-09-15 Thread Michael Paquier

On Mon, Sep 15, 2014 at 8:00 PM, Michael Paquier
 wrote:
> Alvaro got faster than me... I was just looking at the first patch and
> +1 on those comments. It is worth noting that the first patch, as it
> does only a large refactoring, does not impact performance or size of
> WAL records.

Some comments after a second and closer read of the first patch:
1) Wouldn't it be better to call GetFullPageWriteInfo directly in
XLogRecordAssemble, removing RedoRecPtr and doPageWrites from its
arguments?
2) XLogCheckBufferNeedsBackup is not used. It can be removed.
3) If I am following correctly, there are two code paths where
XLogInsertRecData can return InvalidXLogRecPtr, and then there is this
process in XLogInsert
+   {
+   /*
+* Undo the changes we made to the rdata chain.
+*
+* XXX: This doesn't undo *all* the changes;
the XLogRecData
+* entries for buffers that we had already
decided to back up have
+* had their data-pointers cleared. That's OK,
as long as we
+* decide to back them up on the next
iteration as well. Hence,
+* don't allow "doPageWrites" value to go from
true to false after
+* we've modified the rdata chain.
+*/
+   boolnewDoPageWrites;
+
+   GetFullPageWriteInfo(&RedoRecPtr, &newDoPageWrites);
+   if (!doPageWrites && newDoPageWrites)
+   doPageWrites = true;
+   rdt_lastnormal->next = NULL;
+   }
Wouldn't it be better to keep that in XLogInsertRecData? This way
GetFullPageWriteInfo could be removed (I actually see little point in
keeping it as part of this patch, but does this change make its sense
in patch 2?).
4) This patch is in DOS encoding (?!)

That's all I have for now...
Regards,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL format and API changes (9.5)

2014-09-15 Thread Michael Paquier

On Mon, Sep 15, 2014 at 7:03 PM, Alvaro Herrera
 wrote:
> Heikki Linnakangas wrote:
>
>> Here we go. I've split this again into two patches. The first patch
>> is just refactoring the current code. It moves XLogInsert into a new
>> file, xloginsert.c, and the definition of XLogRecord to new
>> xlogrecord.h header file. As a result, there is a a lot of churn in
>> the #includes in C files that generate WAL records, or contain redo
>> routines.  The number of files that pull in xlog.h - directly or
>> indirectly through other headers - is greatly reduced.
>
> I think you should push the first patch for now and continue to
> investigate the issues in the second patch. Some minor suggestions I
> have for the first patch are that since xlog.h is no longer used by
> other headers (but only by .c files), you should just #include
> xloginsert.h instead of doing the forward declaration of struct
> XLogRecData; and in xlog_internal.h, instead of
>
> - * XLogRecord is defined in xlog.h, but we avoid #including that to keep
> + * XLogRecord is defined in xlogrecord.h, but we avoid #including that to 
> keep
>
> I would just suggest to #include xlogrecord.h, since it should just
> work to include that file from frontend programs.  It didn't work for
> xlog.h because that one #includes fmgr.h IIRC and that one causes Datum
> to appear, which makes compilation blow up.  Shouldn't be the case here.

Alvaro got faster than me... I was just looking at the first patch and
+1 on those comments. It is worth noting that the first patch, as it
does only a large refactoring, does not impact performance or size of
WAL records. Btw, a declaration of access/xlog.h in fd.c is forgotten.
Also, if somebody is interested in running the test suite, there are
comments on how to do it at the top of compare.sql.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL format and API changes (9.5)

2014-09-15 Thread Alvaro Herrera

Heikki Linnakangas wrote:

> Here we go. I've split this again into two patches. The first patch
> is just refactoring the current code. It moves XLogInsert into a new
> file, xloginsert.c, and the definition of XLogRecord to new
> xlogrecord.h header file. As a result, there is a a lot of churn in
> the #includes in C files that generate WAL records, or contain redo
> routines.  The number of files that pull in xlog.h - directly or
> indirectly through other headers - is greatly reduced.

I think you should push the first patch for now and continue to
investigate the issues in the second patch.  Some minor suggestions I
have for the first patch are that since xlog.h is no longer used by
other headers (but only by .c files), you should just #include
xloginsert.h instead of doing the forward declaration of struct
XLogRecData; and in xlog_internal.h, instead of 

- * XLogRecord is defined in xlog.h, but we avoid #including that to keep
+ * XLogRecord is defined in xlogrecord.h, but we avoid #including that to keep

I would just suggest to #include xlogrecord.h, since it should just
work to include that file from frontend programs.  It didn't work for
xlog.h because that one #includes fmgr.h IIRC and that one causes Datum
to appear, which makes compilation blow up.  Shouldn't be the case here.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL format and API changes (9.5)

2014-09-03 Thread Michael Paquier

On Tue, Sep 2, 2014 at 9:23 PM, Heikki Linnakangas
 wrote:
> I committed the redo-routine refactoring patch. I kept the XLog prefix in
> the XLogReadBufferForRedo name; it's redundant, but all the other similar
> functions in xlogutils.c use the XLog prefix so it would seem inconsistent
> to not have it here.
Thanks! Even that will be helpful for a potential patch doing
consistency comparisons of FPW with current pages having WAL of a
record applied.

> I'll post a new version of the main patch shortly...
Looking forward to seeing it.

Regards,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL format and API changes (9.5)

2014-09-03 Thread Michael Paquier

On Tue, Sep 2, 2014 at 9:23 PM, Heikki Linnakangas
 wrote:
> I committed the redo-routine refactoring patch. I kept the XLog prefix in
> the XLogReadBufferForRedo name; it's redundant, but all the other similar
> functions in xlogutils.c use the XLog prefix so it would seem inconsistent
> to not have it here.
Thanks! Even that will be helpful for a potential patch doing
consistency comparisons of FPW with current pages having WAL of a
record applied.

> I'll post a new version of the main patch shortly...
Looking forward to seeing it.

Regards,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL format and API changes (9.5)

2014-09-02 Thread Heikki Linnakangas


On 08/19/2014 05:38 PM, Andres Freund wrote:

On 2014-08-19 10:33:29 -0400, Alvaro Herrera wrote:

Heikki Linnakangas wrote:


Barring objections or better ideas, I'm leaning towards
XLogReadBufferForRedo.


WFM


for me too. Although we could imo strip the 'XLog' in the beginning if
we want to make it shorter. The ForRedo is saying that pretty much.


I committed the redo-routine refactoring patch. I kept the XLog prefix 
in the XLogReadBufferForRedo name; it's redundant, but all the other 
similar functions in xlogutils.c use the XLog prefix so it would seem 
inconsistent to not have it here.


I'll post a new version of the main patch shortly...

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL format and API changes (9.5)

2014-08-19 Thread Andres Freund

On 2014-08-19 10:33:29 -0400, Alvaro Herrera wrote:
> Heikki Linnakangas wrote:
> 
> > Barring objections or better ideas, I'm leaning towards
> > XLogReadBufferForRedo.
> 
> WFM

for me too. Although we could imo strip the 'XLog' in the beginning if
we want to make it shorter. The ForRedo is saying that pretty much.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL format and API changes (9.5)

2014-08-19 Thread Alvaro Herrera

Heikki Linnakangas wrote:

> Barring objections or better ideas, I'm leaning towards
> XLogReadBufferForRedo.

WFM

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL format and API changes (9.5)

2014-08-19 Thread Heikki Linnakangas


On 08/19/2014 11:07 AM, Michael Paquier wrote:

Patch 1 looks good. The main difference with the v1 submitted a couple of
days back is that the global variable approach is replaced with additional
arguments for the LSN position and record pointer in XLogReplayBuffer. I
have as well run a couple of tests with the page comparison tool, done some
tests based on installcheck-world with a slave replaying WAL behind, and
found no issues with it.


Thanks!


Perhaps we could consider pushing it to facilitate the next work? Even if
the second patch is dropped it is still a win IMO to have backup block
replay managed within a single function (being it named XLogReplayBuffer in
latest patch), and having it return a status flag.


Yeah, that was my plan.

Regarding the name, the following have been suggested so far:

XLogReplayBuffer
XLogRestoreBuffer
XLogRecoverBuffer
XLogReadBufferForReplay
ReadBufferForXLogReplay

One more idea:

XLogRedoBuffer (would be like three first options above, but would match 
that we call the functions that call this "redo" functions)


I think XLogReadBufferForReplay is the most descriptive. Andres and 
Alvaro both suggested it - independently I believe - so that seems to 
come out naturally. But maybe make it XLogReadBufferForRedo, since we 
call the redo functions redo functions and not replay functions.


Yet another option is to just call it XLogReadBuffer, and rename the 
existing XLogReadBuffer to something else. With the 2nd patch, there are 
only a few callers of XLogReadBuffer left. But is it too deceitful if 
XLogReadBuffer doesn't merely read the page, but also sometimes replaces 
it with a full-page image? Maybe it's OK..



Barring objections or better ideas, I'm leaning towards 
XLogReadBufferForRedo.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL format and API changes (9.5)

2014-08-19 Thread Michael Paquier

On Mon, Aug 18, 2014 at 10:55 PM, Heikki Linnakangas <
hlinnakan...@vmware.com> wrote:
> All rightey.
>
> Here's the next version of this work. It now comes as two patches. The
first
> one refactors adds the XLogReplayBuffer() function and refactors all the
> redo functions to use it. It doesn't change the WAL record format in any
> way. The second patch applies over the first one, and changes the WAL
> format, and all the WAL-creating functions to use the new API for
> constructing WAL records. The second patch removes the relfilenode and
block
> number arguments from XLogReplayBuffer, because they're no longer needed
> when that information is included in the record format.
>
> Todo:
>
> * Performance testing. Do the new WAL construction functions add overhead
to
> WAL insertion?
> * Compare WAL record sizes before and after. I've tried to keep it as
> compact as possible, but I know that some records have gotten bigger. Need
> to do a more thorough analysis.
> * Rename XLogReplayBuffer. I don't particularly like any of the
suggestions
> so far, XLogReplayBuffer included. Discussion continues..
> * Anything else?

Patch 1 looks good. The main difference with the v1 submitted a couple of
days back is that the global variable approach is replaced with additional
arguments for the LSN position and record pointer in XLogReplayBuffer. I
have as well run a couple of tests with the page comparison tool, done some
tests based on installcheck-world with a slave replaying WAL behind, and
found no issues with it.
Perhaps we could consider pushing it to facilitate the next work? Even if
the second patch is dropped it is still a win IMO to have backup block
replay managed within a single function (being it named XLogReplayBuffer in
latest patch), and having it return a status flag.

Regarding patch 2:
- The main differences with the latest version are the modifications for
XLogReplayBuffer having new arguments (LSN position and record pointer).
XLogRecGetBlockRefIds has been changed to return a palloc'd array of block
IDs. xloginsert.h, containing all the functions for xlog record
construction is introduced as well.
- Tiny thing, be aware of tab padding. Here is heapam.c:
page = BufferGetPage(buffer);
PageSetAllVisible(page);
MarkBufferDirty(buffer);
- XLogRecGetBlockRefIds is not described in
src/backend/access/transam/README. Btw, pg_xlogdump drops a core dump when
using it:
--Output:
Assertion failed: (n == *num_refs), function XLogRecGetBlockRefIds, file
xlogreader.c, line 1157.
rmgr: Heaplen (rec/tot): 14/ 12912, tx:  3, lsn:
0/01000148, prev 0/01000120, Abort trap: 6 (core dumped)
-- Backtrace:
frame #4: 0x000103870363
pg_xlogdump`XLogRecGetBlockRefIds(record=0x7ff38a003200,
num_refs=0x7fff5c394464) + 435 at xlogreader.c:1157
frame #5: 0x00010386d610
pg_xlogdump`XLogDumpDisplayRecord(config=0x7fff5c3945c8,
ReadRecPtr=16777544, record=0x7ff38a003200) + 272 at pg_xlogdump.c:357
frame #6: 0x00010386cad8 pg_xlogdump`main(argc=2,
argv=0x7fff5c394658) + 3160 at pg_xlogdump.c:749
In order to reproduce that, simply run regression tests, followed by
pg_xlogdump on one of the WAL files generated.
- This patch includes some diffs from pg_receivexlog.c taken from 52bffe3.
- I have run installcheck-world and compared the size of the WAL generated
on HEAD and the patch (any hints to improve such analysis are of course
welcome)
  name  |   start   |stop|   diff
+---++---
 HEAD (8605bc7) | 0/16C6808 | 0/11A2C670 | 271998568
 Patch 1+2  | 0/16D45D8 | 0/1267A4B0 | 284843736
(2 rows)
So that's a diff of more or less 13MB for this test set.

Looking forward for some performance numbers as well as more precise
comparison of WAL record length.
-- 
Michael

Re: [HACKERS] WAL format and API changes (9.5)

2014-08-14 Thread Andres Freund

On 2014-08-14 12:41:35 +0300, Heikki Linnakangas wrote:
> On 08/14/2014 10:29 AM, Michael Paquier wrote:
> >On Thu, Aug 14, 2014 at 3:04 AM, Alvaro Herrera
> > wrote:
> >>Heikki Linnakangas wrote:
> >>What's with XLogReplayLSN and XLogReplayRecord?  IMO the xlog code has
> >>more than enough global variables already, it'd be good to avoid two
> >>more if possible.  Is there no other good way to get this info down to
> >>whatever it is that needs them?
> >Yep, they do not look necessary. Looking at the patch, you could get
> >rid of XLogReplayLSN and XLogReplayRecord by adding two extra argument
> >to XLogReplayBuffer: one for the LSN of the current record, and a
> >second for the record pointer. The code just saves those two variables
> >in the redo loop of StartupXLOG to only reuse them in
> >XLogReplayBufferExtended, and I saw no code paths in the redo routines
> >where XLogReplayBuffer is called at places without the LSN position
> >and the record pointer.
> >
> >However I think that Heikki introduced those two variables to make the
> >move to the next patch easier.
> 
> The next patch doesn't necessary require them either, you could always pass
> the LSN and record as an argument. I wanted to avoid that, because every
> redo function would just pass the current record being replayed, so it seems
> nicer to pass that information "out-of-band". I guess if we do that, though,
> we should remove those arguments from rm_redo interface altogether, and
> always rely on the global variables to get the "current" record or its LSN.
> I'm not wedded on this, I could be persuaded to go either way...

I personally find the out of band transport really ugly.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL format and API changes (9.5)

2014-08-14 Thread Heikki Linnakangas


On 08/14/2014 11:22 AM, Michael Paquier wrote:

1) Why changing that from ERROR to PANIC?
 /* Caller specified a bogus block_index */
-   elog(ERROR, "failed to restore block_index %d", block_index);
+   elog(PANIC, "failed to restore block_index %d", block_index);


No reason, just a leftover from debugging. Please ignore.

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL format and API changes (9.5)

2014-08-14 Thread Heikki Linnakangas


On 08/14/2014 10:29 AM, Michael Paquier wrote:

On Thu, Aug 14, 2014 at 3:04 AM, Alvaro Herrera
 wrote:

Heikki Linnakangas wrote:
What's with XLogReplayLSN and XLogReplayRecord?  IMO the xlog code has
more than enough global variables already, it'd be good to avoid two
more if possible.  Is there no other good way to get this info down to
whatever it is that needs them?

Yep, they do not look necessary. Looking at the patch, you could get
rid of XLogReplayLSN and XLogReplayRecord by adding two extra argument
to XLogReplayBuffer: one for the LSN of the current record, and a
second for the record pointer. The code just saves those two variables
in the redo loop of StartupXLOG to only reuse them in
XLogReplayBufferExtended, and I saw no code paths in the redo routines
where XLogReplayBuffer is called at places without the LSN position
and the record pointer.

However I think that Heikki introduced those two variables to make the
move to the next patch easier.


The next patch doesn't necessary require them either, you could always 
pass the LSN and record as an argument. I wanted to avoid that, because 
every redo function would just pass the current record being replayed, 
so it seems nicer to pass that information "out-of-band". I guess if we 
do that, though, we should remove those arguments from rm_redo interface 
altogether, and always rely on the global variables to get the "current" 
record or its LSN. I'm not wedded on this, I could be persuaded to go 
either way...


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL format and API changes (9.5)

2014-08-14 Thread Michael Paquier

On Thu, Aug 14, 2014 at 2:05 AM, Heikki Linnakangas
 wrote:
> Here's a full version of this refactoring patch, all the rmgr's have now
> been updated to use XLogReplayBuffer(). I think this is a worthwhile change
> on its own, even if we drop the ball on the rest of the WAL format patch,
> because it makes the redo-routines more readable. I propose to commit this
> as soon as someone has reviewed it, and we agree on a for what's currently
> called XLogReplayBuffer(). I have tested this with my page-image comparison
> tool.

I have as well passed this patch through the page comparison tool and
found no problems, with both regression and isolation tests. I also
had a look at the redo routines that are changed and actually found
nothing. Now, what this patch does is not much complicated, it adds a
couple of status flags returned by XLogReplayBuffer. Then, roughly,
when BLK_NEEDS_REDO is returned the previous restore actions are done.
This has the merit to put the LSN check on current page to determine
if a page needs to be replayed inside XLogReplayBuffer.

A couple of minor comments though:
1) Why changing that from ERROR to PANIC?
/* Caller specified a bogus block_index */
-   elog(ERROR, "failed to restore block_index %d", block_index);
+   elog(PANIC, "failed to restore block_index %d", block_index);
2) There are some whitespaces, like here:
+   {
+   destPage = NULL;/* don't do any page updates */
}

Regards,
-- 
Michael

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL format and API changes (9.5)

2014-08-14 Thread Michael Paquier

On Thu, Aug 14, 2014 at 3:04 AM, Alvaro Herrera
 wrote:
> Heikki Linnakangas wrote:
> What's with XLogReplayLSN and XLogReplayRecord?  IMO the xlog code has
> more than enough global variables already, it'd be good to avoid two
> more if possible.  Is there no other good way to get this info down to
> whatever it is that needs them?
Yep, they do not look necessary. Looking at the patch, you could get
rid of XLogReplayLSN and XLogReplayRecord by adding two extra argument
to XLogReplayBuffer: one for the LSN of the current record, and a
second for the record pointer. The code just saves those two variables
in the redo loop of StartupXLOG to only reuse them in
XLogReplayBufferExtended, and I saw no code paths in the redo routines
where XLogReplayBuffer is called at places without the LSN position
and the record pointer.

However I think that Heikki introduced those two variables to make the
move to the next patch easier.
Regards,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL format and API changes (9.5)

2014-08-13 Thread Alvaro Herrera

Heikki Linnakangas wrote:

> Here's a full version of this refactoring patch, all the rmgr's have
> now been updated to use XLogReplayBuffer(). I think this is a
> worthwhile change on its own, even if we drop the ball on the rest
> of the WAL format patch, because it makes the redo-routines more
> readable. I propose to commit this as soon as someone has reviewed
> it, and we agree on a for what's currently called
> XLogReplayBuffer(). I have tested this with my page-image comparison
> tool.

What's with XLogReplayLSN and XLogReplayRecord?  IMO the xlog code has
more than enough global variables already, it'd be good to avoid two
more if possible.  Is there no other good way to get this info down to
whatever it is that needs them?

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL format and API changes (9.5)

2014-08-13 Thread Alvaro Herrera

Heikki Linnakangas wrote:

> Here's a full version of this refactoring patch, all the rmgr's have
> now been updated to use XLogReplayBuffer(). I think this is a
> worthwhile change on its own, even if we drop the ball on the rest
> of the WAL format patch, because it makes the redo-routines more
> readable. I propose to commit this as soon as someone has reviewed
> it, and we agree on a for what's currently called
> XLogReplayBuffer(). I have tested this with my page-image comparison
> tool.

XLogReadBufferForReplay ?
ReadBufferForXLogReplay ?

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL format and API changes (9.5)

2014-08-13 Thread Heikki Linnakangas


On 08/13/2014 04:15 PM, Heikki Linnakangas wrote:

Hmm, thinking about this some more, there is one sensible way to split
this patch: We can add the XLogReplayBuffer() function and rewrite all
the redo routines to use it, without changing any WAL record formats or
anything in the way the WAL records are constructed. In the patch,
XLogReplayBuffer() takes one input arument, the block reference ID, and
it fetches the RelFileNode and BlockNumber of the block based on that.
Without the WAL format changes, the information isn't there in the
record, but we can require the callers to pass the RelFileNode and
BlockNumber. The final patch will remove those arguments from every
caller, but that's a very mechanical change.

As in the attached patch. I only modified the heapam redo routines to
use the new XLogReplayBuffer() idiom; the idea is to do that for every
redo routine.

After applying such a patch, the main WAL format changing patch becomes
much smaller, and makes it easier to see from the redo routines where
significant changes to the WAL record formats have been made. This also
allows us to split the bikeshedding; we can discuss the name of
XLogReplayBuffer() first :-).


Here's a full version of this refactoring patch, all the rmgr's have now 
been updated to use XLogReplayBuffer(). I think this is a worthwhile 
change on its own, even if we drop the ball on the rest of the WAL 
format patch, because it makes the redo-routines more readable. I 
propose to commit this as soon as someone has reviewed it, and we agree 
on a for what's currently called XLogReplayBuffer(). I have tested this 
with my page-image comparison tool.


- Heikki

commit 4c6c7571e91f34919aff2c2981fe474537dc557b
Author: Heikki Linnakangas 
Date:   Wed Aug 13 15:39:08 2014 +0300

Refactor redo routines to use XLogReplayBuffer

diff --git a/src/backend/access/gin/ginxlog.c b/src/backend/access/gin/ginxlog.c
index ffabf4e..acbb840 100644
--- a/src/backend/access/gin/ginxlog.c
+++ b/src/backend/access/gin/ginxlog.c
@@ -20,25 +20,23 @@
 static MemoryContext opCtx;		/* working memory for operations */
 
 static void
-ginRedoClearIncompleteSplit(XLogRecPtr lsn, RelFileNode node, BlockNumber blkno)
+ginRedoClearIncompleteSplit(XLogRecPtr lsn, int block_index,
+			RelFileNode node, BlockNumber blkno)
 {
 	Buffer		buffer;
 	Page		page;
 
-	buffer = XLogReadBuffer(node, blkno, false);
-	if (!BufferIsValid(buffer))
-		return;	/* page was deleted, nothing to do */
-	page = (Page) BufferGetPage(buffer);
-
-	if (lsn > PageGetLSN(page))
+	if (XLogReplayBuffer(block_index, node, blkno, &buffer) == BLK_NEEDS_REDO)
 	{
+		page = (Page) BufferGetPage(buffer);
+
 		GinPageGetOpaque(page)->flags &= ~GIN_INCOMPLETE_SPLIT;
 
 		PageSetLSN(page, lsn);
 		MarkBufferDirty(buffer);
 	}
-
-	UnlockReleaseBuffer(buffer);
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
 }
 
 static void
@@ -351,26 +349,14 @@ ginRedoInsert(XLogRecPtr lsn, XLogRecord *record)
 		rightChildBlkno = BlockIdGetBlockNumber((BlockId) payload);
 		payload += sizeof(BlockIdData);
 
-		if (record->xl_info & XLR_BKP_BLOCK(0))
-			(void) RestoreBackupBlock(lsn, record, 0, false, false);
-		else
-			ginRedoClearIncompleteSplit(lsn, data->node, leftChildBlkno);
+		ginRedoClearIncompleteSplit(lsn, 0, data->node, leftChildBlkno);
 	}
 
-	/* If we have a full-page image, restore it and we're done */
-	if (record->xl_info & XLR_BKP_BLOCK(isLeaf ? 0 : 1))
+	if (XLogReplayBuffer(isLeaf ? 0 : 1,
+		 data->node, data->blkno, &buffer) == BLK_NEEDS_REDO)
 	{
-		(void) RestoreBackupBlock(lsn, record, isLeaf ? 0 : 1, false, false);
-		return;
-	}
-
-	buffer = XLogReadBuffer(data->node, data->blkno, false);
-	if (!BufferIsValid(buffer))
-		return;	/* page was deleted, nothing to do */
-	page = (Page) BufferGetPage(buffer);
+		page = BufferGetPage(buffer);
 
-	if (lsn > PageGetLSN(page))
-	{
 		/* How to insert the payload is tree-type specific */
 		if (data->flags & GIN_INSERT_ISDATA)
 		{
@@ -386,8 +372,8 @@ ginRedoInsert(XLogRecPtr lsn, XLogRecord *record)
 		PageSetLSN(page, lsn);
 		MarkBufferDirty(buffer);
 	}
-
-	UnlockReleaseBuffer(buffer);
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
 }
 
 static void
@@ -476,12 +462,7 @@ ginRedoSplit(XLogRecPtr lsn, XLogRecord *record)
 	 * split
 	 */
 	if (!isLeaf)
-	{
-		if (record->xl_info & XLR_BKP_BLOCK(0))
-			(void) RestoreBackupBlock(lsn, record, 0, false, false);
-		else
-			ginRedoClearIncompleteSplit(lsn, data->node, data->leftChildBlkno);
-	}
+		ginRedoClearIncompleteSplit(lsn, 0, data->node, data->leftChildBlkno);
 
 	flags = 0;
 	if (isLeaf)
@@ -607,29 +588,19 @@ ginRedoVacuumDataLeafPage(XLogRecPtr lsn, XLogRecord *record)
 	Buffer		buffer;
 	Page		page;
 
-	/* If we have a full-page image, restore it and we're done */
-	if (record->xl_info & XLR_BKP_BLOCK(0))
+	if (XLogReplayBuffer(0, xlrec->node, xlrec->blkno, &buffer) == BLK_NEEDS_REDO)
 	{
-		(void) RestoreBackupBlock(lsn, record, 0, false, false);
-

Re: [HACKERS] WAL format and API changes (9.5)

2014-08-13 Thread Heikki Linnakangas


On 08/13/2014 02:07 PM, Andres Freund wrote:

On 2014-08-13 02:36:59 +0300, Heikki Linnakangas wrote:

On 08/13/2014 01:04 AM, Andres Freund wrote:

* The patch mixes the API changes around WAL records with changes of how
   individual actions are logged. That makes it rather hard to review -
   and it's a 500kb patch already.

   I realize it's hard to avoid because the new API changes which
   information about blocks is available, but it does make it really hard
   to see whether the old/new code is doing something
   equivalent. It's hard to find more critical code than this :/


Yeah, I hear you. I considered doing this piecemeal, just adding the new
functions first so that you could still use the old XLogRecData API, until
all the functions have been converted. But in the end, I figured it's not
worth it, as sooner or later we'd want to convert all the functions anyway.


I think it might be worthwile anyway. I'd be very surprised if there
aren't several significant bugs in the conversion. Your full page
checking tool surely helps to reduce the number, but it's not
foolproof. I can understand not wanting to do it though, it's a
significant amount of work.

Would you ask somebody else to do it in two steps?


Hmm, thinking about this some more, there is one sensible way to split 
this patch: We can add the XLogReplayBuffer() function and rewrite all 
the redo routines to use it, without changing any WAL record formats or 
anything in the way the WAL records are constructed. In the patch, 
XLogReplayBuffer() takes one input arument, the block reference ID, and 
it fetches the RelFileNode and BlockNumber of the block based on that. 
Without the WAL format changes, the information isn't there in the 
record, but we can require the callers to pass the RelFileNode and 
BlockNumber. The final patch will remove those arguments from every 
caller, but that's a very mechanical change.


As in the attached patch. I only modified the heapam redo routines to 
use the new XLogReplayBuffer() idiom; the idea is to do that for every 
redo routine.


After applying such a patch, the main WAL format changing patch becomes 
much smaller, and makes it easier to see from the redo routines where 
significant changes to the WAL record formats have been made. This also 
allows us to split the bikeshedding; we can discuss the name of 
XLogReplayBuffer() first :-).


- Heikki

commit 1a770baa3a3f293e8c592f0419d279b7b8bf7b66
Author: Heikki Linnakangas 
Date:   Wed Aug 13 15:39:08 2014 +0300

Refactor heapam.c redo routines to use XLogReplayBuffer

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index d731f98..bf863af 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -7137,15 +7137,13 @@ heap_xlog_clean(XLogRecPtr lsn, XLogRecord *record)
 {
 	xl_heap_clean *xlrec = (xl_heap_clean *) XLogRecGetData(record);
 	Buffer		buffer;
-	Page		page;
-	OffsetNumber *end;
-	OffsetNumber *redirected;
-	OffsetNumber *nowdead;
-	OffsetNumber *nowunused;
-	int			nredirected;
-	int			ndead;
-	int			nunused;
-	Size		freespace;
+	Size		freespace = 0;
+	RelFileNode	rnode;
+	BlockNumber	blkno;
+	XLogReplayResult rc;
+
+	rnode = xlrec->node;
+	blkno = xlrec->block;
 
 	/*
 	 * We're about to remove tuples. In Hot Standby mode, ensure that there's
@@ -7156,65 +7154,62 @@ heap_xlog_clean(XLogRecPtr lsn, XLogRecord *record)
 	 * latestRemovedXid is invalid, skip conflict processing.
 	 */
 	if (InHotStandby && TransactionIdIsValid(xlrec->latestRemovedXid))
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
-			xlrec->node);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rnode);
 
 	/*
 	 * If we have a full-page image, restore it (using a cleanup lock) and
 	 * we're done.
 	 */
-	if (record->xl_info & XLR_BKP_BLOCK(0))
-	{
-		(void) RestoreBackupBlock(lsn, record, 0, true, false);
-		return;
-	}
+	rc = XLogReplayBufferExtended(0, rnode, MAIN_FORKNUM, blkno,
+  RBM_NORMAL, true, &buffer);
+	if (rc == BLK_NEEDS_REDO)
+	{
+		Page		page = (Page) BufferGetPage(buffer);
+		OffsetNumber *end;
+		OffsetNumber *redirected;
+		OffsetNumber *nowdead;
+		OffsetNumber *nowunused;
+		int			nredirected;
+		int			ndead;
+		int			nunused;
+
+		nredirected = xlrec->nredirected;
+		ndead = xlrec->ndead;
+		end = (OffsetNumber *) ((char *) xlrec + record->xl_len);
+		redirected = (OffsetNumber *) ((char *) xlrec + SizeOfHeapClean);
+		nowdead = redirected + (nredirected * 2);
+		nowunused = nowdead + ndead;
+		nunused = (end - nowunused);
+		Assert(nunused >= 0);
+
+		/* Update all item pointers per the record, and repair fragmentation */
+		heap_page_prune_execute(buffer,
+redirected, nredirected,
+nowdead, ndead,
+nowunused, nunused);
+
+		freespace = PageGetHeapFreeSpace(page);		/* needed to update FSM below */
 
-	buffer = XLogReadBufferExtended(xlrec->node, MAIN_FORKNUM, xlrec->block, RBM_NORMAL);
-	if (!BufferIsValid(buffer

Re: [HACKERS] WAL format and API changes (9.5)

2014-08-13 Thread Andres Freund

On 2014-08-13 02:36:59 +0300, Heikki Linnakangas wrote:
> On 08/13/2014 01:04 AM, Andres Freund wrote:
> >* I'm distinctly not a fan of passing around both the LSN and the struct
> >   XLogRecord to functions. We should bit the bullet and use something
> >   encapsulating both.
> 
> You mean, the redo functions? Seems fine to me, and always it's been like
> that...

Meh. By that argument we could just not do this patch :)

> >* XLogReplayBuffer imo isn't a very good name - the buffer isn't
> >   replayed. If anything it's sometimes replaced by the backup block, but
> >   the real replay still happens outside of that function.
> 
> I agree, but haven't come up with a better name.

XLogRestoreBuffer(), XLogRecoverBuffer(), XLogReadBufferForReplay(), ...?

> >* Why do we need XLogBeginInsert(). Right now this leads to uglyness
> >   like duplicated if (RelationNeedsWAL()) wal checks and
> >   XLogCancelInsert() of inserts that haven't been started.
> 
> I like clearly marking the beginning of the insert, with XLogBeginInsert().
> We certainly could design it so that it's not needed, and you could just
> start registering stuff without calling XLogBeginInsert() first. But I
> believe it's more robust this way. For example, you will get an error at the
> next XLogBeginInsert(), if a previous operation had already registered some
> data without calling XLogInsert().

I'm coming around to that view - especially as accidentally nesting xlog
insertions is a real concern with the new API.

> >And if we have Begin, why do we also need Cancel?
> 
> We could just automatically "cancel" any previous insertion when
> XLogBeginInsert() is called, but again it seems like bad form to leave
> references to buffers and data just lying there, if an operation bails out
> after registering some stuff and doesn't finish the insertion.
> 
> >* "XXX: do we need to do this for every page?" - yes, and it's every
> >   tuple, not every page... And It needs to be done *after* the tuple has
> >   been put on the page, not before. Otherwise the location will be wrong.
> 
> That comment is in heap_multi_insert(). All the inserted tuples have the
> same command id, seems redundant to log multiple NEW_CID records for them.

Yea, I know it's in heap_multi_insert(). They tuples have the same cid,
but they don't have the same tid. Or well, wouldn't if
log_heap_new_cid() were called after the PutHeapTuple(). Note that this
won't trigger for any normal user defined relations.

> >* The patch mixes the API changes around WAL records with changes of how
> >   individual actions are logged. That makes it rather hard to review -
> >   and it's a 500kb patch already.
> >
> >   I realize it's hard to avoid because the new API changes which
> >   information about blocks is available, but it does make it really hard
> >   to see whether the old/new code is doing something
> >   equivalent. It's hard to find more critical code than this :/
> 
> Yeah, I hear you. I considered doing this piecemeal, just adding the new
> functions first so that you could still use the old XLogRecData API, until
> all the functions have been converted. But in the end, I figured it's not
> worth it, as sooner or later we'd want to convert all the functions anyway.

I think it might be worthwile anyway. I'd be very surprised if there
aren't several significant bugs in the conversion. Your full page
checking tool surely helps to reduce the number, but it's not
foolproof. I can understand not wanting to do it though, it's a
significant amount of work.

Would you ask somebody else to do it in two steps?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL format and API changes (9.5)

2014-08-13 Thread Michael Paquier

On Wed, Aug 13, 2014 at 4:49 PM, Michael Paquier
 wrote:
> Yes indeed. As XLogBeginInsert() is already part of xlogconstruct.c,
> it is not weird. This approach has the merit to make XLogRecData
> completely bundled with the insertion and construction of the WAL
> records. Then for the name xloginsert.c is fine for me too.
At the same time, renaming XLogInsert to XLogCompleteInsert or
XLogFinishInsert would be nice to make the difference with
XLogBeginInsert. This could include XLogCancel renamed tos
XLogCancelInsert.

Appending the prefix XLogInsert* to those functions would make things
more consistent as well.

But feel free to discard those ideas if you do not like that.
Regards,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL format and API changes (9.5)

2014-08-13 Thread Michael Paquier

On Tue, Aug 12, 2014 at 6:44 PM, Heikki Linnakangas
 wrote:
> On 08/05/2014 03:50 PM, Michael Paquier wrote:
>>
>> - When testing pg_xlogdump I found an assertion failure for record
>> XLOG_GIN_INSERT:
>> Assertion failed: (((bool) (((const void*)(&insertData->newitem.key) !=
>> ((void*)0)) && ((&insertData->newitem.key)->ip_posid != 0, function
>> gin_desc, file gindesc.c, line 127.
>
>
> I could not reproduce this. Do you have a test case?
Indeed. I am not seeing anything now for this record. Perhaps I was
using an incorrect version of pg_xlogdump.

>> - Don't we need a call to XLogBeginInsert() in XLogSaveBufferForHint():
>> +   int flags = REGBUF_FORCE_IMAGE;
>> +   if (buffer_std)
>> +   flags |= REGBUF_STANDARD;
>> +   XLogRegisterBuffer(0, buffer, flags);
>> [...]
>> -   recptr = XLogInsert(RM_XLOG_ID, XLOG_FPI, rdata);
>> +   recptr = XLogInsert(RM_XLOG_ID, XLOG_FPI);
>
>
> Indeed, fixed. With that, "initdb -k" works, I had not tested the patch with
> page checksums before.
Thanks for the fix. Yes now this works.

>> - XLogRecGetBlockRefIds needing an already-allocated array for *out is not
>> user-friendly. Cannot we just do all the work inside this function?
>
>
> I agree it's not a nice API. Returning a palloc'd array would be nicer, but
> this needs to work outside the backend too (e.g. pg_xlogdump). It could
> return a malloc'd array in frontend code, but it's not as nice. On the
> whole, do you think that be better than the current approach?

A palloc'd array returned is nicer for the user. And I would rather
rewrite the API like that, having it return the array instead:
int *XLogRecGetBlockRefIds(XLogRecord *record, int *num_array)
Alvaro has also outlined that in the FRONTEND context pfree and palloc
would work as pg_free and pg_malloc (didn't recall it before he
mentioned it btw).

>> - All the functions in xlogconstruct.c could be defined in a separate
>> header xlogconstruct.h. What they do is rather independent from xlog.h.
>> The
>> same applies to all the structures and functions of xlogconstruct.c in
>> xlog_internal.h: XLogRecordAssemble, XLogRecordAssemble,
>> InitXLogRecordConstruct. (worth noticing as well that the only reason
>> XLogRecData is defined externally of xlogconstruct.c is that it is being
>> used in XLogInsert()).
>
>
> Hmm. I left the defines for xlogconstruct.c functions that are used to build
> a WAL record in xlog.h, so that it's not necessary to include both xlog.h
> and xlogconstruct.h in all files that write a WAL record (XLogInsert() is
> defined in xlog.h).
>
> But perhaps it would be better to move the prototype for XLogInsert to
> xlogconstruct.h too? That might be a better division; everything needed to
> insert a WAL record would be in xlogconstruct.h, and xlog.h would left for
> more internal functions related to WAL. And perhaps rename the files to
> xloginsert.c and xloginsert.h.

Yes indeed. As XLogBeginInsert() is already part of xlogconstruct.c,
it is not weird. This approach has the merit to make XLogRecData
completely bundled with the insertion and construction of the WAL
records. Then for the name xloginsert.c is fine for me too.

Among the things changed since v5, v6 is introducing a safety limit to
the maximum number of backup blocks within a single WAL record:
#define XLR_MAX_BKP_BLOCKS 256
XLogRecordBlockData is updated to use uint8 instead of char to
identify the block id, and XLogEnsureRecordSpace fails if more than
XLR_MAX_BKP_BLOCKS are wanted by the caller. I am fine with this
change, just mentioning it.
Regards,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL format and API changes (9.5)

2014-08-12 Thread Heikki Linnakangas


On 08/13/2014 01:04 AM, Andres Freund wrote:

On 2014-08-12 12:44:22 +0300, Heikki Linnakangas wrote:

Here's a new patch with those little things fixed.


I've so far ignored this thread... I'm now taking a look. Unfortunately
I have to admit I'm not unequivocally happy.

* XLogRegisterData/XLogRegisterRecData imo don't really form a
   consistent API namewise. What does 'Rec' mean here?


The latter function is actually called XLogRegisterBufData. The README 
was wrong, will fix.



* I'm distinctly not a fan of passing around both the LSN and the struct
   XLogRecord to functions. We should bit the bullet and use something
   encapsulating both.


You mean, the redo functions? Seems fine to me, and always it's been 
like that...



* XLogReplayBuffer imo isn't a very good name - the buffer isn't
   replayed. If anything it's sometimes replaced by the backup block, but
   the real replay still happens outside of that function.


I agree, but haven't come up with a better name.


* Why do we need XLogBeginInsert(). Right now this leads to uglyness
   like duplicated if (RelationNeedsWAL()) wal checks and
   XLogCancelInsert() of inserts that haven't been started.


I like clearly marking the beginning of the insert, with 
XLogBeginInsert(). We certainly could design it so that it's not needed, 
and you could just start registering stuff without calling 
XLogBeginInsert() first. But I believe it's more robust this way. For 
example, you will get an error at the next XLogBeginInsert(), if a 
previous operation had already registered some data without calling 
XLogInsert().



And if we have Begin, why do we also need Cancel?


We could just automatically "cancel" any previous insertion when 
XLogBeginInsert() is called, but again it seems like bad form to leave 
references to buffers and data just lying there, if an operation bails 
out after registering some stuff and doesn't finish the insertion.



* "XXX: do we need to do this for every page?" - yes, and it's every
   tuple, not every page... And It needs to be done *after* the tuple has
   been put on the page, not before. Otherwise the location will be wrong.


That comment is in heap_multi_insert(). All the inserted tuples have the 
same command id, seems redundant to log multiple NEW_CID records for them.



* The patch mixes the API changes around WAL records with changes of how
   individual actions are logged. That makes it rather hard to review -
   and it's a 500kb patch already.

   I realize it's hard to avoid because the new API changes which
   information about blocks is available, but it does make it really hard
   to see whether the old/new code is doing something
   equivalent. It's hard to find more critical code than this :/


Yeah, I hear you. I considered doing this piecemeal, just adding the new 
functions first so that you could still use the old XLogRecData API, 
until all the functions have been converted. But in the end, I figured 
it's not worth it, as sooner or later we'd want to convert all the 
functions anyway.



Have you done any performance evaluation?


No, not yet.

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL format and API changes (9.5)

2014-08-12 Thread Andres Freund

On 2014-08-12 12:44:22 +0300, Heikki Linnakangas wrote:
> Here's a new patch with those little things fixed.

I've so far ignored this thread... I'm now taking a look. Unfortunately
I have to admit I'm not unequivocally happy.

* XLogRegisterData/XLogRegisterRecData imo don't really form a
  consistent API namewise. What does 'Rec' mean here?

* I'm distinctly not a fan of passing around both the LSN and the struct
  XLogRecord to functions. We should bit the bullet and use something
  encapsulating both.

* XLogReplayBuffer imo isn't a very good name - the buffer isn't
  replayed. If anything it's sometimes replaced by the backup block, but
  the real replay still happens outside of that function.

* Why do we need XLogBeginInsert(). Right now this leads to uglyness
  like duplicated if (RelationNeedsWAL()) wal checks and
  XLogCancelInsert() of inserts that haven't been started. And if we
  have Begin, why do we also need Cancel?

* "XXX: do we need to do this for every page?" - yes, and it's every
  tuple, not every page... And It needs to be done *after* the tuple has
  been put on the page, not before. Otherwise the location will be wrong.

* The patch mixes the API changes around WAL records with changes of how
  individual actions are logged. That makes it rather hard to review -
  and it's a 500kb patch already.

  I realize it's hard to avoid because the new API changes which
  information about blocks is available, but it does make it really hard
  to see whether the old/new code is doing something
  equivalent. It's hard to find more critical code than this :/

Have you done any performance evaluation?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL format and API changes (9.5)

2014-08-12 Thread Alvaro Herrera

Heikki Linnakangas wrote:
> On 08/05/2014 03:50 PM, Michael Paquier wrote:

> >- XLogRecGetBlockRefIds needing an already-allocated array for *out is not
> >user-friendly. Cannot we just do all the work inside this function?
> 
> I agree it's not a nice API. Returning a palloc'd array would be
> nicer, but this needs to work outside the backend too (e.g.
> pg_xlogdump). It could return a malloc'd array in frontend code, but
> it's not as nice. On the whole, do you think that be better than the
> current approach?

I don't see the issue.  Right now, in frontend code you can call
palloc() and pfree(), and they behave like malloc() and free() (see
fe_memutils.c).  Your module doesn't need to do anything special.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL format and API changes (9.5)

2014-08-12 Thread Alvaro Herrera

Heikki Linnakangas wrote:
> On 08/05/2014 03:50 PM, Michael Paquier wrote:

> >- All the functions in xlogconstruct.c could be defined in a separate
> >header xlogconstruct.h. What they do is rather independent from xlog.h. The
> >same applies to all the structures and functions of xlogconstruct.c in
> >xlog_internal.h: XLogRecordAssemble, XLogRecordAssemble,
> >InitXLogRecordConstruct. (worth noticing as well that the only reason
> >XLogRecData is defined externally of xlogconstruct.c is that it is being
> >used in XLogInsert()).
> 
> Hmm. I left the defines for xlogconstruct.c functions that are used
> to build a WAL record in xlog.h, so that it's not necessary to
> include both xlog.h and xlogconstruct.h in all files that write a
> WAL record (XLogInsert() is defined in xlog.h).
> 
> But perhaps it would be better to move the prototype for XLogInsert
> to xlogconstruct.h too? That might be a better division; everything
> needed to insert a WAL record would be in xlogconstruct.h, and
> xlog.h would left for more internal functions related to WAL. And
> perhaps rename the files to xloginsert.c and xloginsert.h.

Yes, IMO ideally modules that only construct WAL records to insert, but
are not directly involved with other XLog stuff, should only be using a
lean header file, not the whole xlog.h.  I imagine xlogconstruct.h would
be such a file.  (The patch you just posted doesn't have such a file --
AFAICS that stuff is all in xlog.h still).

No opinion on xlogconstruct vs. xloginsert as file names.  Both seem
like good enough names to me.  Unless others have stronger opinions, I
would left that decision to you.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL format and API changes (9.5)

2014-08-12 Thread Heikki Linnakangas


On 08/05/2014 03:50 PM, Michael Paquier wrote:

- When testing pg_xlogdump I found an assertion failure for record
XLOG_GIN_INSERT:
Assertion failed: (((bool) (((const void*)(&insertData->newitem.key) !=
((void*)0)) && ((&insertData->newitem.key)->ip_posid != 0, function
gin_desc, file gindesc.c, line 127.


I could not reproduce this. Do you have a test case?


- Would it be a win to add an assertion check for (CritSectionCount == 0)
in XLogEnsureRecordSpace?


Maybe; added.


- There is no mention of REGBUF_NO_IMAGE in transam/README.


Fixed


- This patch adds a lot of blocks like that in the redo code:
+ if (BufferIsValid(buffer))
+   UnlockReleaseBuffer(buffer);
What do you think about adding a new macro like UnlockBufferIfValid(buffer)?


I don't think such a macro would be an improvement in readability, TBH.


- Don't we need a call to XLogBeginInsert() in XLogSaveBufferForHint():
+   int flags = REGBUF_FORCE_IMAGE;
+   if (buffer_std)
+   flags |= REGBUF_STANDARD;
+   XLogRegisterBuffer(0, buffer, flags);
[...]
-   recptr = XLogInsert(RM_XLOG_ID, XLOG_FPI, rdata);
+   recptr = XLogInsert(RM_XLOG_ID, XLOG_FPI);


Indeed, fixed. With that, "initdb -k" works, I had not tested the patch 
with page checksums before.



- There is still a TODO item in ValidXLogRecordHeader to check if block
data header is valid or not. Just mentioning.


Removed.

Previously, we ValidXLogRecordHeader checked that the xl_tot_len of a 
record doesn't exceed the size of header + xl_len + (space needed for 
max number of bkp blocks). But the new record format has no limit on the 
amount of data registered with a buffer, so such a test is not possible 
anymore. I added the TODO item there to remind me that I need to check 
if we need to do something else there instead, but I think it's fine as 
it is. We still sanity-check the block data in ValidXLogRecord; the 
ValidXLogRecordHeader() check happens before we have read the whole 
record header in memory.



- XLogRecGetBlockRefIds needing an already-allocated array for *out is not
user-friendly. Cannot we just do all the work inside this function?


I agree it's not a nice API. Returning a palloc'd array would be nicer, 
but this needs to work outside the backend too (e.g. pg_xlogdump). It 
could return a malloc'd array in frontend code, but it's not as nice. On 
the whole, do you think that be better than the current approach?



- All the functions in xlogconstruct.c could be defined in a separate
header xlogconstruct.h. What they do is rather independent from xlog.h. The
same applies to all the structures and functions of xlogconstruct.c in
xlog_internal.h: XLogRecordAssemble, XLogRecordAssemble,
InitXLogRecordConstruct. (worth noticing as well that the only reason
XLogRecData is defined externally of xlogconstruct.c is that it is being
used in XLogInsert()).


Hmm. I left the defines for xlogconstruct.c functions that are used to 
build a WAL record in xlog.h, so that it's not necessary to include both 
xlog.h and xlogconstruct.h in all files that write a WAL record 
(XLogInsert() is defined in xlog.h).


But perhaps it would be better to move the prototype for XLogInsert to 
xlogconstruct.h too? That might be a better division; everything needed 
to insert a WAL record would be in xlogconstruct.h, and xlog.h would 
left for more internal functions related to WAL. And perhaps rename the 
files to xloginsert.c and xloginsert.h.


Here's a new patch with those little things fixed.

- Heikki


wal-format-and-api-changes-6.patch.gz
Description: application/gzip

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL format and API changes (9.5)

2014-08-05 Thread Michael Paquier

On Sat, Aug 2, 2014 at 1:40 AM, Heikki Linnakangas 
wrote:
> I ran this through my WAL page comparison tool to verify that all the WAL
> record types are replayed correctly (although I did some small cleanup
after
> that, so it's not impossible that I broke it again; will re-test before
> committing).
I have run as well some tests on it with a master a a standby to check WAL
construction and replay:
- regression tests as well as tests from contrib, isolation
- WAL page comparison tool
- Some tests with pgbench
- When testing pg_xlogdump I found an assertion failure for record
XLOG_GIN_INSERT:
Assertion failed: (((bool) (((const void*)(&insertData->newitem.key) !=
((void*)0)) && ((&insertData->newitem.key)->ip_posid != 0, function
gin_desc, file gindesc.c, line 127.

Then, I had a look at the code, and here are some comments:
- This comment has been introduced with 96ef3b8 that has added page
checksums. Perhaps the authors can tell what was the original purpose of
this comment (Jeff, Simon). For now, it may be better to remove this FIXME
and to let this comment as is.
+* FIXME: I didn't understand below comment. Is it still
relevant?
+*
 * This also means there is no corresponding API call for
this, so an
 * smgr implementation has no need to implement anything.
Which means
 * nothing is needed in md.c etc
- Would it be a win to add an assertion check for (CritSectionCount == 0)
in XLogEnsureRecordSpace?
- There is no mention of REGBUF_NO_IMAGE in transam/README.
- This patch adds a lot of blocks like that in the redo code:
+ if (BufferIsValid(buffer))
+   UnlockReleaseBuffer(buffer);
What do you think about adding a new macro like UnlockBufferIfValid(buffer)?
- Don't we need a call to XLogBeginInsert() in XLogSaveBufferForHint():
+   int flags = REGBUF_FORCE_IMAGE;
+   if (buffer_std)
+   flags |= REGBUF_STANDARD;
+   XLogRegisterBuffer(0, buffer, flags);
[...]
-   recptr = XLogInsert(RM_XLOG_ID, XLOG_FPI, rdata);
+   recptr = XLogInsert(RM_XLOG_ID, XLOG_FPI);
- There is a typo in the header of xlogrecord.h, the "to" is not needed:
+ * Definitions for to the WAL record
format.

- There is still a TODO item in ValidXLogRecordHeader to check if block
data header is valid or not. Just mentioning.
- Just came across that: s/reference refes to/reference refers to
- XLogRecGetBlockRefIds needing an already-allocated array for *out is not
user-friendly. Cannot we just do all the work inside this function?
- All the functions in xlogconstruct.c could be defined in a separate
header xlogconstruct.h. What they do is rather independent from xlog.h. The
same applies to all the structures and functions of xlogconstruct.c in
xlog_internal.h: XLogRecordAssemble, XLogRecordAssemble,
InitXLogRecordConstruct. (worth noticing as well that the only reason
XLogRecData is defined externally of xlogconstruct.c is that it is being
used in XLogInsert()).

The patch is more solid and better structured than the previous versions.
It is in good shape.
Regards,
-- 
Michael

Re: [HACKERS] WAL format and API changes (9.5)

2014-08-01 Thread Heikki Linnakangas


Here's a new version of this patch, please review.

I've cleaned up a lot of stuff, fixed all the bugs reported so far, and 
a bunch of others I found myself while testing.


I'm not going to explain again what the patch does; the README and 
comments should now be complete enough to explain how it works. If not, 
please speak up.


In the previous version of this patch, I made the XLogRegisterData 
function to copy the WAL data to a temporary buffer, instead of 
constructing the XLogRecData chain. I decided to revert back to the old 
way after all; I still think that copying would probably be OK from a 
performance point of view, but that'd need more testing. We can still 
switch to doing it that way later; the XLogRecData struct is no longer 
exposed to the functions that generate the WAL records, so it would be a 
very isolated change.


I moved the new functions for constructing the WAL record into a new 
file, xlogconstruct.c. XLogInsert() now just calls a function called 
XLogRecordAssemble(), which returns the full XLogRecData chain that 
includes all the data and backup blocks, ready to be written to the WAL 
buffer. All the code to construct the XLogRecData chain is now in 
xlogrecord.c; this makes XLogInsert() simpler and more readable.


One change resulting from that worth mentioning is that XLogInsert() now 
always re-constructs the XLogRecordData chain, if after acquiring the 
WALInsertLock it sees that RedoRecPtr changed (i.e. a checkpoint just 
started). Before, it would recheck the LSNs on the individual buffers to 
see if it's necessary. This is simpler, and I don't think it makes any 
difference to performance in practice.


I ran this through my WAL page comparison tool to verify that all the 
WAL record types are replayed correctly (although I did some small 
cleanup after that, so it's not impossible that I broke it again; will 
re-test before committing).


- Heikki



wal-format-and-api-changes-5.patch.gz
Description: application/gzip

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL format and API changes (9.5)

2014-07-18 Thread Michael Paquier

On Wed, Jul 2, 2014 at 4:23 PM, Michael Paquier
 wrote:
>
>
>
> On Wed, Jul 2, 2014 at 4:09 PM, Michael Paquier 
> wrote:
>>
>> 3) I noticed a bug in gin redo code path when trying to use the WAL replay
>> facility.
This patch has been on status "Waiting on author" for a little bit of
time, marking it now as "Returned with feedback" as there are still
issues with gin record construction and redo code paths.

Feel free to disagree if you don't think this is correct.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL format and API changes (9.5)

2014-07-02 Thread Michael Paquier

On Wed, Jul 2, 2014 at 4:09 PM, Michael Paquier 
wrote:
>
> 3) I noticed a bug in gin redo code path when trying to use the WAL replay
> facility.
>

Looking at the backtrace, it seems that a page for gin is corrupted. The
buffer capture patch does some sanity checks on the page format when
performing masking and it is failing in one of them on a standby when
kicking ginRedoUpdateMetapage:
if (pd_lower > pd_upper || pd_special < pd_upper ||
pd_lower < SizeOfPageHeaderData || pd_special > BLCKSZ)
{
elog(ERROR, "invalid page at %X/%08X\n",
((PageHeader) page)->pd_lsn.xlogid,
((PageHeader) page)->pd_lsn.xrecoff);
}

frame #4: 0x00010437dec5 postgres`CheckForBufferLeaks + 165 at
bufmgr.c:1778
frame #5: 0x00010437df1e postgres`AtProcExit_Buffers(code=1, arg=0)
+ 30 at bufmgr.c:1750
frame #6: 0x00010438fded postgres`shmem_exit(code=1) + 301 at
ipc.c:263
frame #7: 0x00010438fc1c postgres`proc_exit_prepare(code=1) + 124
at ipc.c:187
frame #8: 0x00010438fb63 postgres`proc_exit(code=1) + 19 at
ipc.c:102
frame #9: 0x000104555b3c postgres`errfinish(dummy=0) + 1180 at
elog.c:555
frame #10: 0x0001045590de postgres`elog_finish(elevel=20,
fmt=0x000104633d4f) + 830 at elog.c:1362
frame #11: 0x00010437c1af
postgres`mask_unused_space(page=0x7fff5bc20a70) + 159 at bufcapt.c:78
frame #12: 0x00010437b53d
postgres`mask_heap_page(page=0x7fff5bc20a70) + 29 at bufcapt.c:95
frame #13: 0x00010437b1cd
postgres`buffer_capture_write(newcontent=0x000104ab3980, blkno=0) + 205
at bufcapt.c:329
frame #14: 0x00010437bc7d postgres`buffer_capture_forget(buffer=82)
+ 349 at bufcapt.c:433
frame #15: 0x0001043801c9 postgres`LockBuffer(buffer=82, mode=0) +
233 at bufmgr.c:2773
frame #16: 0x0001043800c8 postgres`UnlockReleaseBuffer(buffer=82) +
24 at bufmgr.c:2554
frame #17: 0x000103fefb03
postgres`ginRedoUpdateMetapage(lsn=335350144, record=0x7fb1740382f0) +
1843 at ginxlog.c:580
frame #18: 0x000103fede96 postgres`gin_redo(lsn=335350144,
record=0x7fb1740382f0) + 534 at ginxlog.c:724
frame #19: 0x0001040ad692 postgres`StartupXLOG + 8482 at xlog.c:6810
frame #20: 0x000104330e0e postgres`StartupProcessMain + 430 at
startup.c:224
frame #21: 0x0001040c64d9 postgres`AuxiliaryProcessMain(argc=2,
argv=0x7fff5bc231b0) + 1897 at bootstrap.c:416
frame #22: 0x00010432b1a8
postgres`StartChildProcess(type=StartupProcess) + 328 at postmaster.c:5090
frame #23: 0x0001043292b9 postgres`PostmasterMain(argc=3,
argv=0x7fb173c044e0) + 5401 at postmaster.c:1212
frame #24: 0x00010426f995 postgres`main(argc=3,
argv=0x7fb173c044e0) + 773 at main.c:219
Note that I have never seen that with vanilla, only with this patch.
Regards,
-- 
Michael

Re: [HACKERS] WAL format and API changes (9.5)

2014-07-02 Thread Michael Paquier

On Tue, Jul 1, 2014 at 5:51 AM, Heikki Linnakangas 
wrote:

> On 06/27/2014 09:12 AM, Michael Paquier wrote:
>
>> 2) Description of XLogBeginInsert, XLogHasBlockRef, XLogGetBlockRef and
>> XLogGetBlockRefIds are missing in src/backend/access/transam/README.
>>
>
> Added descriptions of XLogBeginInsert and XLogHasBlockRef. Ḯ'm not sure
> what to do with the latter two. They're not really intended for use by redo
> routines, but for things like pg_xlogdump that want to extract information
> about any record type.


That's definitely not part of the redo section or WAL construction section.
It could be a new section in the README describing how to get the list of
blocks touched by a WAL record. I'd rather have those APIs documented
somewhere than nowhere, and the README would be well-suited for that (as
well as in-code comments) as those APIs are aimed at developers.


>  3) There are a couple of code blocks marked as FIXME and BROKEN. There are
>> as well some NOT_USED blocks.
>>
>
> Yeah. The FIXMEs and BROKEN blocks are all in rm_desc routines. They
> mostly stand for code that used to print block numbers or relfilenodes,
> which doesn't make much sense to do in an ad hoc way in rmgr-specific desc
> routines anymore. Will need to go through them all an decide what's the
> important information to print for each record type.
>
> The few NOT_USED blocks are around code in redo routines that extract some
> information from the WAL record, that isn't needed anymore. We could remove
> the fields from the WAL records altogether, but might be useful to keep for
> debugging purposes.

They could be removed and be kept as a part of the git history... That's
not really a vital point btw.


>
>  6) XLogRegisterData creates a XLogRecData entry and appends it in one of
>> the static XLogRecData entries if buffer_id >= 0 or to
>> rdata_head/rdata_tail if buffer_id = -1 (for permanent data related to the
>> WAL record). XLogRegisterXLogRecData does something similar, but with an
>> entry of XLogRecData already built. What do you think about removing
>> XLogRegisterXLogRecData and keeping only XLogRegisterData. This makes the
>> interface simpler, and XLogRecData could be kept private in xlog.c. This
>> would need more work especially on gin side where I am seeing for example
>> constructLeafRecompressWALData directly building a XLogRecData entry.
>>
>
> Another big change in the attached patch version is that
> XLogRegisterData() now *copies* the data to a working area, instead of
> merely adding a XLogRecData entry to the chain. This simplifies things in
> xlog.c, now that the XLogRecData concept is not visible to the rest of the
> system. This is a subtle semantic change for the callers: you can no longer
> register a struct first, and fill the fields afterwards.
>
> That adds a lot of memcpys to the critical path, which could be bad for
> performance. In practice, I think it's OK, or even a win, because a typical
> WAL record payload is very small. But I haven't measured that. If it
> becomes an issue, we could try to optimize, and link larger data blocks to
> the rdata chain, but memcpy small small chunks of data. There are a few WAL
> record types that don't fit in a small statically allocated buffer, so I
> provided a new function, XLogEnsureSpace(), that can be called before
> XLogBegin() to allocate a large-enough working area.
>

I just ran the following test on my laptop with your latest patch:
- Patched version:
=# CREATE TABLE foo AS SELECT generate_series(1,1000) as a;
SELECT 1000
Time: 32913.419 ms
=# CREATE TABLE foo2 AS SELECT generate_series(1,1000) as a;
SELECT 1000
Time: 32261.045 ms
- master (d97e98e):
=#  CREATE TABLE foo AS SELECT generate_series(1,1000) as a;
SELECT 1000
Time: 29651.394 ms
=#  CREATE TABLE foo2 AS SELECT generate_series(1,1000) as a;
SELECT 1000
Time: 29734.887 ms
10% difference... That was just a quick test though.

These changes required changing a couple of places where WAL records are
> created:
>
> * ginPlaceToPage. This function has a strange split of responsibility
> between ginPlaceToPage and dataPlaceToPage/entryPlaceToPage.
> ginPlaceToPage calls dataPlaceToPage or entryPlaceToPage, depending on what
> kind of a tree it is, and dataPlaceToPage or entryPlaceToPage contributes
> some data to the WAL record. Then ginPlaceToPage adds some more, and
> finally calls XLogInsert(). I simplified the SPLIT case so that we now just
> create full page images of both page halves. We were already logging all
> the data on both page halves, but we were doing it in a "smarter" way,
> knowing what kind of pages they were. For example, for an entry tree page,
> we included an IndexTuple struct for every tuple on the page, but didn't
> include the item pointers. A straight full page image takes more space, but
> not much, so I think in any real application it's going to be lost in
> noise. (again, haven't measured it though)

Thanks, the code looks cleaner t

Re: [HACKERS] WAL format and API changes (9.5)

2014-07-01 Thread Michael Paquier

On Tue, Jul 1, 2014 at 7:18 AM, Fujii Masao  wrote:

> On Tue, Jul 1, 2014 at 5:51 AM, Heikki Linnakangas
>  wrote:
> > Thanks for the review!
> >
> >
> > On 06/27/2014 09:12 AM, Michael Paquier wrote:
> >>
> >> Looking at this patch, it does not compile when assertions are enabled
> >> because of this assertion in heapam.c:
> >> Assert(recdata == data + datalen);
> >> It seems like a thinko as removing it makes the code compile.
> >
> >
> > Fixed.
> >
> >
> >> Some comments about the code:
> >> 1) In src/backend/access/transam/README:
> >> 's/no further action is no required/no further action is required/g'
> >
> >
> > Fixed.
> >
> >
> >> 2) Description of XLogBeginInsert, XLogHasBlockRef, XLogGetBlockRef and
> >> XLogGetBlockRefIds are missing in src/backend/access/transam/README.
> >
> >
> > Added descriptions of XLogBeginInsert and XLogHasBlockRef. Ḯ'm not sure
> > what to do with the latter two. They're not really intended for use by
> redo
> > routines, but for things like pg_xlogdump that want to extract
> information
> > about any record type.
> >
> >
> >> 3) There are a couple of code blocks marked as FIXME and BROKEN. There
> are
> >> as well some NOT_USED blocks.
> >
> >
> > Yeah. The FIXMEs and BROKEN blocks are all in rm_desc routines. They
> mostly
> > stand for code that used to print block numbers or relfilenodes, which
> > doesn't make much sense to do in an ad hoc way in rmgr-specific desc
> > routines anymore. Will need to go through them all an decide what's the
> > important information to print for each record type.
> >
> > The few NOT_USED blocks are around code in redo routines that extract
> some
> > information from the WAL record, that isn't needed anymore. We could
> remove
> > the fields from the WAL records altogether, but might be useful to keep
> for
> > debugging purposes.
> >
> >
> >>   4) Current patch crashes when running make check in
> >> contrib/test_decoding.
> >> Looking closer, wal_level = logical is broken:
> >> =# show wal_level;
> >>   wal_level
> >> ---
> >>   logical
> >> (1 row)
> >> =# create database foo;
> >> The connection to the server was lost. Attempting reset: Failed.
> >
> >
> > Fixed.
> >
> >
> >> Looking even closer, log_heap_new_cid is broken:
> >
> >
> > Fixed.
> >
> >
> >> 6) XLogRegisterData creates a XLogRecData entry and appends it in one of
> >> the static XLogRecData entries if buffer_id >= 0 or to
> >> rdata_head/rdata_tail if buffer_id = -1 (for permanent data related to
> the
> >> WAL record). XLogRegisterXLogRecData does something similar, but with an
> >> entry of XLogRecData already built. What do you think about removing
> >> XLogRegisterXLogRecData and keeping only XLogRegisterData. This makes
> the
> >> interface simpler, and XLogRecData could be kept private in xlog.c. This
> >> would need more work especially on gin side where I am seeing for
> example
> >> constructLeafRecompressWALData directly building a XLogRecData entry.
> >
> >
> > Hmm. I considered that, but punted because I didn't want to rewrite all
> the
> > places that currently use XLogRecDatas in less straightforward ways. I
> kept
> > XLogRegisterXLogRecData as an escape hatch for those.
> >
> > But I bit the bullet now, and got rid of all the
> XLogRegisterXLogRecData()
> > calls. XLogRecData struct is now completely private to xlog.c.
> >
> > Another big change in the attached patch version is that
> XLogRegisterData()
> > now *copies* the data to a working area, instead of merely adding a
> > XLogRecData entry to the chain. This simplifies things in xlog.c, now
> that
> > the XLogRecData concept is not visible to the rest of the system. This
> is a
> > subtle semantic change for the callers: you can no longer register a
> struct
> > first, and fill the fields afterwards.
> >
> > That adds a lot of memcpys to the critical path, which could be bad for
> > performance. In practice, I think it's OK, or even a win, because a
> typical
> > WAL record payload is very small. But I haven't measured that. If it
> becomes
> > an issue, we could try to optimize, and link larger data blocks to the
> rdata
> > chain, but memcpy small small chunks of data. There are a few WAL record
> > types that don't fit in a small statically allocated buffer, so I
> provided a
> > new function, XLogEnsureSpace(), that can be called before XLogBegin() to
> > allocate a large-enough working area.
> >
> > These changes required changing a couple of places where WAL records are
> > created:
> >
> > * ginPlaceToPage. This function has a strange split of responsibility
> > between ginPlaceToPage and dataPlaceToPage/entryPlaceToPage.
> ginPlaceToPage
> > calls dataPlaceToPage or entryPlaceToPage, depending on what kind of a
> tree
> > it is, and dataPlaceToPage or entryPlaceToPage contributes some data to
> the
> > WAL record. Then ginPlaceToPage adds some more, and finally calls
> > XLogInsert(). I simplified the SPLIT case so that we now just create full
> > page images of both page hal

Re: [HACKERS] WAL format and API changes (9.5)

2014-07-01 Thread Robert Haas

On Mon, Jun 30, 2014 at 4:51 PM, Heikki Linnakangas
 wrote:
> Another big change in the attached patch version is that XLogRegisterData()
> now *copies* the data to a working area, instead of merely adding a
> XLogRecData entry to the chain. This simplifies things in xlog.c, now that
> the XLogRecData concept is not visible to the rest of the system. This is a
> subtle semantic change for the callers: you can no longer register a struct
> first, and fill the fields afterwards.
>
> That adds a lot of memcpys to the critical path, which could be bad for
> performance. In practice, I think it's OK, or even a win, because a typical
> WAL record payload is very small. But I haven't measured that. If it becomes
> an issue, we could try to optimize, and link larger data blocks to the rdata
> chain, but memcpy small small chunks of data. There are a few WAL record
> types that don't fit in a small statically allocated buffer, so I provided a
> new function, XLogEnsureSpace(), that can be called before XLogBegin() to
> allocate a large-enough working area.

On the other hand, the patch I just committed
(9f03ca915196dfc871804a1f8aad26207f601fd6) demonstrates that even
copying relatively small amounts of data can be expensive if you do it
frequently, and certainly WAL insertions are frequent.  Of course,
some of the overhead there is from palloc() and friends rather than
from the copying itself, but the copying itself isn't free, either.
And some WAL records can be really big.

Of course, it could still work out to a win if the simplifications in
xlog.c save enough.  I don't know whether that's the case or not.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL format and API changes (9.5)

2014-06-30 Thread Fujii Masao

On Tue, Jul 1, 2014 at 5:51 AM, Heikki Linnakangas
 wrote:
> Thanks for the review!
>
>
> On 06/27/2014 09:12 AM, Michael Paquier wrote:
>>
>> Looking at this patch, it does not compile when assertions are enabled
>> because of this assertion in heapam.c:
>> Assert(recdata == data + datalen);
>> It seems like a thinko as removing it makes the code compile.
>
>
> Fixed.
>
>
>> Some comments about the code:
>> 1) In src/backend/access/transam/README:
>> 's/no further action is no required/no further action is required/g'
>
>
> Fixed.
>
>
>> 2) Description of XLogBeginInsert, XLogHasBlockRef, XLogGetBlockRef and
>> XLogGetBlockRefIds are missing in src/backend/access/transam/README.
>
>
> Added descriptions of XLogBeginInsert and XLogHasBlockRef. Ḯ'm not sure
> what to do with the latter two. They're not really intended for use by redo
> routines, but for things like pg_xlogdump that want to extract information
> about any record type.
>
>
>> 3) There are a couple of code blocks marked as FIXME and BROKEN. There are
>> as well some NOT_USED blocks.
>
>
> Yeah. The FIXMEs and BROKEN blocks are all in rm_desc routines. They mostly
> stand for code that used to print block numbers or relfilenodes, which
> doesn't make much sense to do in an ad hoc way in rmgr-specific desc
> routines anymore. Will need to go through them all an decide what's the
> important information to print for each record type.
>
> The few NOT_USED blocks are around code in redo routines that extract some
> information from the WAL record, that isn't needed anymore. We could remove
> the fields from the WAL records altogether, but might be useful to keep for
> debugging purposes.
>
>
>>   4) Current patch crashes when running make check in
>> contrib/test_decoding.
>> Looking closer, wal_level = logical is broken:
>> =# show wal_level;
>>   wal_level
>> ---
>>   logical
>> (1 row)
>> =# create database foo;
>> The connection to the server was lost. Attempting reset: Failed.
>
>
> Fixed.
>
>
>> Looking even closer, log_heap_new_cid is broken:
>
>
> Fixed.
>
>
>> 6) XLogRegisterData creates a XLogRecData entry and appends it in one of
>> the static XLogRecData entries if buffer_id >= 0 or to
>> rdata_head/rdata_tail if buffer_id = -1 (for permanent data related to the
>> WAL record). XLogRegisterXLogRecData does something similar, but with an
>> entry of XLogRecData already built. What do you think about removing
>> XLogRegisterXLogRecData and keeping only XLogRegisterData. This makes the
>> interface simpler, and XLogRecData could be kept private in xlog.c. This
>> would need more work especially on gin side where I am seeing for example
>> constructLeafRecompressWALData directly building a XLogRecData entry.
>
>
> Hmm. I considered that, but punted because I didn't want to rewrite all the
> places that currently use XLogRecDatas in less straightforward ways. I kept
> XLogRegisterXLogRecData as an escape hatch for those.
>
> But I bit the bullet now, and got rid of all the XLogRegisterXLogRecData()
> calls. XLogRecData struct is now completely private to xlog.c.
>
> Another big change in the attached patch version is that XLogRegisterData()
> now *copies* the data to a working area, instead of merely adding a
> XLogRecData entry to the chain. This simplifies things in xlog.c, now that
> the XLogRecData concept is not visible to the rest of the system. This is a
> subtle semantic change for the callers: you can no longer register a struct
> first, and fill the fields afterwards.
>
> That adds a lot of memcpys to the critical path, which could be bad for
> performance. In practice, I think it's OK, or even a win, because a typical
> WAL record payload is very small. But I haven't measured that. If it becomes
> an issue, we could try to optimize, and link larger data blocks to the rdata
> chain, but memcpy small small chunks of data. There are a few WAL record
> types that don't fit in a small statically allocated buffer, so I provided a
> new function, XLogEnsureSpace(), that can be called before XLogBegin() to
> allocate a large-enough working area.
>
> These changes required changing a couple of places where WAL records are
> created:
>
> * ginPlaceToPage. This function has a strange split of responsibility
> between ginPlaceToPage and dataPlaceToPage/entryPlaceToPage. ginPlaceToPage
> calls dataPlaceToPage or entryPlaceToPage, depending on what kind of a tree
> it is, and dataPlaceToPage or entryPlaceToPage contributes some data to the
> WAL record. Then ginPlaceToPage adds some more, and finally calls
> XLogInsert(). I simplified the SPLIT case so that we now just create full
> page images of both page halves. We were already logging all the data on
> both page halves, but we were doing it in a "smarter" way, knowing what kind
> of pages they were. For example, for an entry tree page, we included an
> IndexTuple struct for every tuple on the page, but didn't include the item
> pointers. A straight full page image takes more space,

Re: [HACKERS] WAL format and API changes (9.5)

2014-06-26 Thread Michael Paquier

On Sun, Jun 15, 2014 at 6:11 AM, Heikki Linnakangas  wrote:

> On 06/14/2014 09:43 PM, Alvaro Herrera wrote:
>
>> Heikki Linnakangas wrote:
>>
>>  Here's a new version, rebased against master. No other changes.
>>>
>>
>> This is missing xlogrecord.h ...
>>
>
> Oops, here you are.
>
Looking at this patch, it does not compile when assertions are enabled
because of this assertion in heapam.c:
Assert(recdata == data + datalen);
It seems like a thinko as removing it makes the code compile.

Here is a review of this patch:
- Compilation without warnings, regression tests pass
- That's not a small patch, but the changes mainly done are xlog record
insertion in accordance to the new API format.
$ git diff master --stat | tail -n1
52 files changed, 3601 insertions(+), 4371 deletions(-)

Some comments about the code:
1) In src/backend/access/transam/README:
's/no further action is no required/no further action is required/g'
2) Description of XLogBeginInsert, XLogHasBlockRef, XLogGetBlockRef and
XLogGetBlockRefIds are missing in src/backend/access/transam/README.
3) There are a couple of code blocks marked as FIXME and BROKEN. There are
as well some NOT_USED blocks.
 4) Current patch crashes when running make check in contrib/test_decoding.
Looking closer, wal_level = logical is broken:
=# show wal_level;
 wal_level
---
 logical
(1 row)
=# create database foo;
The connection to the server was lost. Attempting reset: Failed.

Looking even closer, log_heap_new_cid is broken:
(lldb) bt
* thread #1: tid = 0x, 0x7fff8a3c2866
libsystem_kernel.dylib`__pthread_kill + 10, stop reason = signal SIGSTOP
  * frame #0: 0x7fff8a3c2866 libsystem_kernel.dylib`__pthread_kill + 10
frame #1: 0x7fff8d37235c libsystem_pthread.dylib`pthread_kill + 92
frame #2: 0x7fff9369db1a libsystem_c.dylib`abort + 125
frame #3: 0x000104428819
postgres`ExceptionalCondition(conditionName=0x0001044a711c,
errorType=0x00010448b19f, fileName=0x0001044a2cfd, lineNumber=6879)
+ 137 at assert.c:54
frame #4: 0x000103f08dbe
postgres`log_heap_new_cid(relation=0x7fae4482afb8,
tup=0x7fae438062e8) + 126 at heapam.c:6879
frame #5: 0x000103f080cf
postgres`heap_insert(relation=0x7fae4482afb8, tup=0x7fae438062e8,
cid=0, options=0, bistate=0x) + 383 at heapam.c:2095
frame #6: 0x000103f09b6a
postgres`simple_heap_insert(relation=0x7fae4482afb8,
tup=0x7fae438062e8) + 74 at heapam.c:2533
frame #7: 0x00010406aacf postgres`createdb(stmt=0x7fae44843690)
+ 6335 at dbcommands.c:511
frame #8: 0x00010429def8
postgres`standard_ProcessUtility(parsetree=0x7fae44843690,
queryString=0x7fae44842c38, context=PROCESS_UTILITY_TOPLEVEL,
params=0x, dest=0x7fae44843a18,
completionTag=0x7fff5bd4ee30) + 1720 at utility.c:56

5) Yes,there are some WAL records that have only data related to buffers,
XLOG_GIN_VACUUM_DATA_LEAF_PAGE being one, with nothing registered with
buffer_id = -d, but using a dummy entry seems counter-productive:
+   rdata = rdata_head;
+   if (rdata == NULL)
+   {
+   /*
+* the rest of this function assumes that there is at least
some
+* rdata entries, so fake an empty rdata entry
+*/
+   dummy_rdata.data = NULL;
+   dummy_rdata.len = 0;
+   dummy_rdata.next = NULL;
+   rdata = &dummy_rdata;
+   }
Could this be removed and replaced by a couple of checks on rdata being
NULL in some cases? XLogInsert should be updated in consequence.
6) XLogRegisterData creates a XLogRecData entry and appends it in one of
the static XLogRecData entries if buffer_id >= 0 or to
rdata_head/rdata_tail if buffer_id = -1 (for permanent data related to the
WAL record). XLogRegisterXLogRecData does something similar, but with an
entry of XLogRecData already built. What do you think about removing
XLogRegisterXLogRecData and keeping only XLogRegisterData. This makes the
interface simpler, and XLogRecData could be kept private in xlog.c. This
would need more work especially on gin side where I am seeing for example
constructLeafRecompressWALData directly building a XLogRecData entry.
By doing so, we could as remove the while loop at the bottom of
XLogRegisterXLogRecData as we would insert in the tail only the latest
record created, aka replacing that:
while ((*tail)->next)
*tail = (*tail)->next;
By simply that:
*tail = (*tail)->next;
7) New structures at the top of xlog.c should have more comments about
their role, particularly rdata_head and rdata_tail that store main WAL data
(id of -1)
8) What do you think about adding in the README a description about how to
retrieve the block list modified by a WAL record, for a flow similar to
that:
record = XLogReadRecord(...);
nblocks = XLogGetBlockRefIds(record, blockids);
for (i = 0; i < nblocks; i++)
{
bkpb = XLogGetBlockRef(record, id, NULL);
// stuff
}

Re: [HACKERS] WAL format and API changes (9.5)

2014-06-14 Thread Alvaro Herrera

Heikki Linnakangas wrote:

> Here's a new version, rebased against master. No other changes.

This is missing xlogrecord.h ...

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL format and API changes (9.5)

2014-04-08 Thread Amit Kapila

On Thu, Apr 3, 2014 at 7:44 PM, Heikki Linnakangas
 wrote:
> I'd like to do some changes to the WAL format in 9.5. I want to annotate
> each WAL record with the blocks that they modify. Every WAL record already
> includes that information, but it's done in an ad hoc way, differently in
> every rmgr. The RelFileNode and block number are currently part of the WAL
> payload, and it's the REDO routine's responsibility to extract it. I want to
> include that information in a common format for every WAL record type.
>
> That makes life a lot easier for tools that are interested in knowing which
> blocks a WAL record modifies. One such tool is pg_rewind; it currently has
> to understand every WAL record the backend writes. There's also a tool out
> there called pg_readahead, which does prefetching of blocks accessed by WAL
> records, to speed up PITR. I don't think that tool has been actively
> maintained, but at least part of the reason for that is probably that it's a
> pain to maintain when it has to understand the details of every WAL record
> type.
>
> It'd also be nice for contrib/pg_xlogdump and backend code itself. The
> boilerplate code in all WAL redo routines, and writing WAL records, could be
> simplified.

I think it will also be useful, if we want to implement table/tablespace
PITR.

>
> That's for the normal cases. We'll need a couple of variants for also
> registering buffers that don't need full-page images, and perhaps also a
> function for registering a page that *always* needs a full-page image,
> regardless of the LSN. A few existing WAL record types just WAL-log the
> whole page, so those ad-hoc full-page images could be replaced with this.
>
> With these changes, a typical WAL insertion would look like this:
>
> /* register the buffer with the WAL record, with ID 0 */
> XLogRegisterBuffer(0, buf, true);
>
> rdata[0].data = (char *) &xlrec;
> rdata[0].len = sizeof(BlahRecord);
> rdata[0].buffer_id = -1; /* -1 means the data is always included */
> rdata[0].next = &(rdata[1]);
>
> rdata[1].data = (char *) mydata;
> rdata[1].len = mydatalen;
> rdata[1].buffer_id = 0; /* 0 here refers to the buffer registered
> above */
> rdata[1].next = NULL
>
> ...
> recptr = XLogInsert(RM_BLAH_ID, xlinfo, rdata);
>
> PageSetLSN(buf, recptr);

If we do register buffer's (that require or don't require FPI) before
calling XLogInsert(), then will there be any impact to handle case
where we come to know that we need to backup the buffer after
taking WALInsertLock.. or will that part of code remains same as it is
today.

> Redo
> 
>
> There are four different states a block referenced by a typical WAL record
> can be in:
>
> 1. The old page does not exist at all (because the relation was truncated
> later)
> 2. The old page exists, but has an LSN higher than current WAL record, so it
> doesn't need replaying.
> 3. The LSN is < current WAL record, so it needs to be replayed.
> 4. The WAL record contains a full-page image, which needs to be restored.

I think there might be a need to have separate handling for some special
cases like Init Page which is used in few ops (Insert/Update/multi-insert).
Is there any criteria to decide if it needs to be a separate state or a special
handling for operations?


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Allocations in critical section (was Re: [HACKERS] WAL format and API changes (9.5))

2014-04-04 Thread Heikki Linnakangas


On 04/04/2014 04:56 PM, Tom Lane wrote:

Heikki Linnakangas  writes:

Ok, I fixed the issues that the assertion fixed. I also committed a
patch to add the assertion itself; let's see if the buildfarm finds more
cases that violate the rule.



It ignores the checkpointer, because it's known to violate the rule,


... uh, isn't that a bug to be fixed?


Yes. One step a time ;-).


and
allocations in ErrorContext, which is used during error recovery, e.g if
you indeed PANIC while in a critical section for some other reason.


Yeah, I realized we'd have to do something about elog's own allocations.
Not sure if a blanket exemption for ErrorContext is the best way.  I'd
been thinking of having a way to turn off the complaint once processing
of an elog(PANIC) has started.


Hmm. PANIC processing should avoid allocations too, except in 
ErrorContext, because otherwise you might get an out-of-memory during 
PANIC processing.


ErrorContext also covers elog(DEBUG2, ...). I presume we'll want to 
ignore that too. Although I also tried without the exemption for 
ErrorContext at first, and didn't get any failures from the regression 
tests, so I guess we never do that in a critical section. I was a bit 
surprised by that.




BTW, I'm pretty sure you added some redundant assertions in mcxt.c.
eg, palloc does not need its own copy.


palloc is copy-pasted from MemoryContextAlloc - it does need its own copy.

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Allocations in critical section (was Re: [HACKERS] WAL format and API changes (9.5))

2014-04-04 Thread Tom Lane

Heikki Linnakangas  writes:
> Ok, I fixed the issues that the assertion fixed. I also committed a 
> patch to add the assertion itself; let's see if the buildfarm finds more 
> cases that violate the rule.

> It ignores the checkpointer, because it's known to violate the rule,

... uh, isn't that a bug to be fixed?

> and 
> allocations in ErrorContext, which is used during error recovery, e.g if 
> you indeed PANIC while in a critical section for some other reason.

Yeah, I realized we'd have to do something about elog's own allocations.
Not sure if a blanket exemption for ErrorContext is the best way.  I'd
been thinking of having a way to turn off the complaint once processing
of an elog(PANIC) has started.

> I didn't backpatch this.

Agreed.

BTW, I'm pretty sure you added some redundant assertions in mcxt.c.
eg, palloc does not need its own copy.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Allocations in critical section (was Re: [HACKERS] WAL format and API changes (9.5))

2014-04-04 Thread Heikki Linnakangas

Ok, I fixed the issues that the assertion fixed. I also committed a 
patch to add the assertion itself; let's see if the buildfarm finds more 
cases that violate the rule.


It ignores the checkpointer, because it's known to violate the rule, and 
allocations in ErrorContext, which is used during error recovery, e.g if 
you indeed PANIC while in a critical section for some other reason.


I didn't backpatch this. Although you shouldn't be running with 
assertions enabled in production, it nevertheless seems too risky. There 
might be some obscure cases where there is no real risk, e.g because the 
current memory context always has enough free space because of a 
previous pfree, and it doesn't seem worth tracking down and fixing such 
issues in backbranches. You have to be pretty unlucky to run out of 
memory in a critical section to begin with.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL format and API changes (9.5)

2014-04-04 Thread Andres Freund

On 2014-04-04 12:50:25 +0300, Heikki Linnakangas wrote:
> On 04/04/2014 11:41 AM, Andres Freund wrote:
> >On 2014-04-04 10:48:32 +0300, Heikki Linnakangas wrote:
> >>@@ -484,10 +483,11 @@ PageRepairFragmentation(Page page)
> >>((PageHeader) page)->pd_upper = pd_special;
> >>}
> >>else
> >>-   {   /* nstorage != 
> >>0 */
> >>+   {
> >>/* Need to compact the page the hard way */
> >>-   itemidbase = (itemIdSort) palloc(sizeof(itemIdSortData) * 
> >>nstorage);
> >>-   itemidptr = itemidbase;
> >>+   itemIdSortData itemidbase[MaxHeapTuplesPerPage];
> >>+   itemIdSort  itemidptr = itemidbase;
> >>+
> >
> >That's a fair bit of stack, and it can be called somewhat deep on the
> >stack via heap_page_prune_opt(). I wonder if we ought to add a
> >check_stack_depth() somewhere.
> 
> Hmm, on my 64-bit laptop, that array is 24*291=6984 bytes with 8k block
> size. That's fairly large, but not unheard of; there are a few places where
> we allocate a BLCKSZ-sized buffer from stack.

Yea, I am not complaing about using so much stack. Seems sensible here.

> But overall I wouldn't worry about it. check_stack_depth() leaves a fair
> amount of headroom: STACK_DEPTH_SLOP is 512kB. As long as we don't recurse,
> that's plenty.

Well, there's no checks at nearby afair. That's why I was
wondering... But I don't have a big problem with not checking, I just
wanted to bring it up.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL format and API changes (9.5)

2014-04-04 Thread Heikki Linnakangas


On 04/04/2014 11:41 AM, Andres Freund wrote:

On 2014-04-04 10:48:32 +0300, Heikki Linnakangas wrote:

@@ -484,10 +483,11 @@ PageRepairFragmentation(Page page)
((PageHeader) page)->pd_upper = pd_special;
}
else
-   {   /* nstorage != 
0 */
+   {
/* Need to compact the page the hard way */
-   itemidbase = (itemIdSort) palloc(sizeof(itemIdSortData) * 
nstorage);
-   itemidptr = itemidbase;
+   itemIdSortData itemidbase[MaxHeapTuplesPerPage];
+   itemIdSort  itemidptr = itemidbase;
+


That's a fair bit of stack, and it can be called somewhat deep on the
stack via heap_page_prune_opt(). I wonder if we ought to add a
check_stack_depth() somewhere.


Hmm, on my 64-bit laptop, that array is 24*291=6984 bytes with 8k block 
size. That's fairly large, but not unheard of; there are a few places 
where we allocate a BLCKSZ-sized buffer from stack.


We could easily reduce the stack consumption here by changing itemIdSort 
to use 16-bit ints; all the lengths and offsets that 
PageRepairFragmentation deals with fit in 16-bits.


But overall I wouldn't worry about it. check_stack_depth() leaves a fair 
amount of headroom: STACK_DEPTH_SLOP is 512kB. As long as we don't 
recurse, that's plenty.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL format and API changes (9.5)

2014-04-04 Thread Andres Freund

Hi,

On 2014-04-04 10:48:32 +0300, Heikki Linnakangas wrote:
> But if we give the checkpointer process a free pass, running the regression
> tests with an assertion in AllocSetAlloc catches five genuine bugs:
> 
> 1. _bt_newroot
> 2. XLogFileInit
> 3. spgPageIndexMultiDelete
> 4. PageRepairFragmentation
> 5. PageIndexMultiDelete

Some of those, like PageRepairFragmentation, are somewhat bad...

> @@ -484,10 +483,11 @@ PageRepairFragmentation(Page page)
>   ((PageHeader) page)->pd_upper = pd_special;
>   }
>   else
> - {   /* nstorage != 
> 0 */
> + {
>   /* Need to compact the page the hard way */
> - itemidbase = (itemIdSort) palloc(sizeof(itemIdSortData) * 
> nstorage);
> - itemidptr = itemidbase;
> + itemIdSortData itemidbase[MaxHeapTuplesPerPage];
> + itemIdSort  itemidptr = itemidbase;
> +

That's a fair bit of stack, and it can be called somewhat deep on the
stack via heap_page_prune_opt(). I wonder if we ought to add a
check_stack_depth() somewhere.

Thanks,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL format and API changes (9.5)

2014-04-04 Thread Heikki Linnakangas


On 04/04/2014 02:40 AM, Andres Freund wrote:

On 2014-04-03 19:33:12 -0400, Tom Lane wrote:

Andres Freund  writes:

On 2014-04-03 19:08:27 -0400, Tom Lane wrote:

A somewhat more relevant concern is where are we going to keep the state
data involved in all this.  Since this code is, by definition, going to be
called in critical sections, any solution involving internal pallocs is
right out.



We actually already allocate memory in XLogInsert() :(, although only in
the first XLogInsert() a backend does...


Ouch.  I wonder if we should put an Assert(not-in-critical-section)
into MemoryContextAlloc.


Good idea!


XLogInsert() is using malloc() directly, so that wouldn't detect this
case...

It's not a bad idea tho. I wonder how far the regression tests
get...

Not even through initdb. GetVirtualXIDsDelayingChkpt() is to blame.


Hmm, the checkpointer process seems to ignore the rule, and just hopes 
for the best. The allocation in GetVirtualXIDsDelayingChkpt() was 
probably just an oversight, and that call could be moved outside the 
critical section, but we also have this in AbsortFsyncRequests():



/*
 * We have to PANIC if we fail to absorb all the pending requests (eg,
 * because our hashtable runs out of memory).  This is because the 
system
 * cannot run safely if we are unable to fsync what we have been told to
 * fsync.  Fortunately, the hashtable is so small that the problem is
 * quite unlikely to arise in practice.
 */
START_CRIT_SECTION();


Perhaps we could fix that by just calling fsync() directly if the 
allocation fails, like we do if ForwardFsyncRequest() fails.



But if we give the checkpointer process a free pass, running the 
regression tests with an assertion in AllocSetAlloc catches five genuine 
bugs:


1. _bt_newroot
2. XLogFileInit
3. spgPageIndexMultiDelete
4. PageRepairFragmentation
5. PageIndexMultiDelete

I'll commit the attached patch to fix those shortly.

- Heikki
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index d2ca8d9..922412e 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -1995,8 +1995,10 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	BTPageOpaque lopaque;
 	ItemId		itemid;
 	IndexTuple	item;
-	Size		itemsz;
-	IndexTuple	new_item;
+	IndexTuple	left_item;
+	Size		left_item_sz;
+	IndexTuple	right_item;
+	Size		right_item_sz;
 	Buffer		metabuf;
 	Page		metapg;
 	BTMetaPageData *metad;
@@ -2016,6 +2018,26 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	metapg = BufferGetPage(metabuf);
 	metad = BTPageGetMeta(metapg);
 
+	/*
+	 * Create downlink item for left page (old root).  Since this will be the
+	 * first item in a non-leaf page, it implicitly has minus-infinity key
+	 * value, so we need not store any actual key in it.
+	 */
+	left_item_sz = sizeof(IndexTupleData);
+	left_item = (IndexTuple) palloc(left_item_sz);
+	left_item->t_info = left_item_sz;
+	ItemPointerSet(&(left_item->t_tid), lbkno, P_HIKEY);
+
+	/*
+	 * Create downlink item for right page.  The key for it is obtained from
+	 * the "high key" position in the left page.
+	 */
+	itemid = PageGetItemId(lpage, P_HIKEY);
+	right_item_sz = ItemIdGetLength(itemid);
+	item = (IndexTuple) PageGetItem(lpage, itemid);
+	right_item = CopyIndexTuple(item);
+	ItemPointerSet(&(right_item->t_tid), rbkno, P_HIKEY);
+
 	/* NO EREPORT(ERROR) from here till newroot op is logged */
 	START_CRIT_SECTION();
 
@@ -2034,16 +2056,6 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	metad->btm_fastlevel = rootopaque->btpo.level;
 
 	/*
-	 * Create downlink item for left page (old root).  Since this will be the
-	 * first item in a non-leaf page, it implicitly has minus-infinity key
-	 * value, so we need not store any actual key in it.
-	 */
-	itemsz = sizeof(IndexTupleData);
-	new_item = (IndexTuple) palloc(itemsz);
-	new_item->t_info = itemsz;
-	ItemPointerSet(&(new_item->t_tid), lbkno, P_HIKEY);
-
-	/*
 	 * Insert the left page pointer into the new root page.  The root page is
 	 * the rightmost page on its level so there is no "high key" in it; the
 	 * two items will go into positions P_HIKEY and P_FIRSTKEY.
@@ -2051,32 +2063,20 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	 * Note: we *must* insert the two items in item-number order, for the
 	 * benefit of _bt_restore_page().
 	 */
-	if (PageAddItem(rootpage, (Item) new_item, itemsz, P_HIKEY,
+	if (PageAddItem(rootpage, (Item) left_item, left_item_sz, P_HIKEY,
 	false, false) == InvalidOffsetNumber)
 		elog(PANIC, "failed to add leftkey to new root page"
 			 " while splitting block %u of index \"%s\"",
 			 BufferGetBlockNumber(lbuf), RelationGetRelationName(rel));
-	pfree(new_item);
-
-	/*
-	 * Create downlink item for right page.  The key for it is obtained from
-	 * the "high key" position in the left page.
-	 */
-	itemid = PageGetItemId(lpage, P_HIKEY);
-	itemsz = ItemIdGe

Re: [HACKERS] WAL format and API changes (9.5)

2014-04-03 Thread Andres Freund

On 2014-04-03 19:33:12 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2014-04-03 19:08:27 -0400, Tom Lane wrote:
> >> A somewhat more relevant concern is where are we going to keep the state
> >> data involved in all this.  Since this code is, by definition, going to be
> >> called in critical sections, any solution involving internal pallocs is
> >> right out.
> 
> > We actually already allocate memory in XLogInsert() :(, although only in
> > the first XLogInsert() a backend does...
> 
> Ouch.  I wonder if we should put an Assert(not-in-critical-section)
> into MemoryContextAlloc.

XLogInsert() is using malloc() directly, so that wouldn't detect this
case...

It's not a bad idea tho. I wonder how far the regression tests
get...

Not even through initdb. GetVirtualXIDsDelayingChkpt() is to blame.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL format and API changes (9.5)

2014-04-03 Thread Tom Lane

Andres Freund  writes:
> On 2014-04-03 19:08:27 -0400, Tom Lane wrote:
>> A somewhat more relevant concern is where are we going to keep the state
>> data involved in all this.  Since this code is, by definition, going to be
>> called in critical sections, any solution involving internal pallocs is
>> right out.

> We actually already allocate memory in XLogInsert() :(, although only in
> the first XLogInsert() a backend does...

Ouch.  I wonder if we should put an Assert(not-in-critical-section)
into MemoryContextAlloc.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL format and API changes (9.5)

2014-04-03 Thread Andres Freund

On 2014-04-03 19:08:27 -0400, Tom Lane wrote:
> Robert Haas  writes:
> > On Thu, Apr 3, 2014 at 10:14 AM, Heikki Linnakangas
> >  wrote:
> >> (Instead of using a new XLogRegisterBuffer() function to register the
> >> buffers, perhaps they should be passed to XLogInsert as a separate list or
> >> array. I'm not wedded on the details...)
> 
> > That would have the advantage of avoiding the function-call overhead,
> > which seems appealing.
> 
> You're kidding, right?  One function call per buffer touched by an update
> is going to be lost in the noise.
> 
> A somewhat more relevant concern is where are we going to keep the state
> data involved in all this.  Since this code is, by definition, going to be
> called in critical sections, any solution involving internal pallocs is
> right out.

We actually already allocate memory in XLogInsert() :(, although only in
the first XLogInsert() a backend does... I didn't remember it, but I
complained about it before:
http://archives.postgresql.org/message-id/4FEB223A.3060707%40enterprisedb.com

I didn't realize the implications for it back then and thus didn't press
hard for a change, but I think it should be fixed.

>  The existing arrangement keeps all its data in local variables
> of the function doing the update, which is probably a nice property to
> preserve if we can manage it.  That might force us to do it as above.

We could simply allocate enough scratch space for the state at process
startup. I don't think there'll ever be a need for XLogInsert() to be
reentrant, so that should be fine.

> I'd still like something that *looks* more like a list of function calls,
> though, even if they're macros under the hood.  The existing approach to
> setting up the XLogRecData chains is awfully prone to bugs of
> omission.

Agreed. There's some pretty ugly code around this.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL format and API changes (9.5)

2014-04-03 Thread Tom Lane

Robert Haas  writes:
> On Thu, Apr 3, 2014 at 10:14 AM, Heikki Linnakangas
>  wrote:
>> (Instead of using a new XLogRegisterBuffer() function to register the
>> buffers, perhaps they should be passed to XLogInsert as a separate list or
>> array. I'm not wedded on the details...)

> That would have the advantage of avoiding the function-call overhead,
> which seems appealing.

You're kidding, right?  One function call per buffer touched by an update
is going to be lost in the noise.

A somewhat more relevant concern is where are we going to keep the state
data involved in all this.  Since this code is, by definition, going to be
called in critical sections, any solution involving internal pallocs is
right out.  The existing arrangement keeps all its data in local variables
of the function doing the update, which is probably a nice property to
preserve if we can manage it.  That might force us to do it as above.

I'd still like something that *looks* more like a list of function calls,
though, even if they're macros under the hood.  The existing approach to
setting up the XLogRecData chains is awfully prone to bugs of omission.
There are a few places that went so far as to set up custom macros to help
with that.  I'm not a fan of doing the same thing in randomly different
ways in different places; but if we did it that way uniformly, it might be
better/safer.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL format and API changes (9.5)

2014-04-03 Thread Robert Haas

On Thu, Apr 3, 2014 at 10:14 AM, Heikki Linnakangas
 wrote:
> (Instead of using a new XLogRegisterBuffer() function to register the
> buffers, perhaps they should be passed to XLogInsert as a separate list or
> array. I'm not wedded on the details...)

That would have the advantage of avoiding the function-call overhead,
which seems appealing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL format and API changes (9.5)

2014-04-03 Thread Heikki Linnakangas


On 04/03/2014 07:11 PM, Tom Lane wrote:


More generally, I'm pretty sure that your proposal is already going to
involve some small growth of WAL records compared to today,


Quite possible.


but I think
that's probably all right; the benefits seem significant.


Yep.

OTOH, once we store the relfilenode+block in a common format, we can 
then try to optimize that format more heavily. Just as an example, omit 
the tablespace oid in the RelFileNode, when it's the default tablespace 
(with a flag bit indicating we did that). Or use a variable-length 
endoding for the block number, on the assumption that smaller numbers 
are more common. Probably not be worth the extra complexity, but we can 
easily experiment with that kind of stuff once we have the 
infrastructure in place.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL format and API changes (9.5)

2014-04-03 Thread Tom Lane

Heikki Linnakangas  writes:
> On 04/03/2014 06:37 PM, Tom Lane wrote:
>> +1, but one important step here is finding the data to be replayed.
>> That is, a large part of the complexity of replay routines has to do
>> with figuring out which parts of the WAL record were elided due to
>> full-page-images, and locating the remaining parts.  What can we do
>> to make that simpler?

> We can certainly add more structure to the WAL records, but any extra 
> information you add will make the records larger. It might be worth it, 
> and would be lost in the noise for more complex records like page 
> splits, but we should keep frequently-used records like heap insertions 
> as lean as possible.

Sure, but in simple WAL records there would just be a single data chunk
that would be present in the normal case and not present in the FPI case.
Seems like we ought to be able to handle that degenerate case with no
significant wastage (probably just a flag bit someplace).

More generally, I'm pretty sure that your proposal is already going to
involve some small growth of WAL records compared to today, but I think
that's probably all right; the benefits seem significant.

> Hmm. You could register a separate XLogRecData chain for each buffer. 
> Plus one more chain for the data not associated with a buffer.

That would probably work.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL format and API changes (9.5)

2014-04-03 Thread Heikki Linnakangas


On 04/03/2014 06:37 PM, Tom Lane wrote:

Also, IIRC there are places that WAL-log full pages that aren't in a
shared buffer at all (btree build does this I think).  How will that fit
into this model?


Hmm. We could provide a function for registering a block with given 
content, without a Buffer. Something like:


XLogRegisterPage(int id, RelFileNode, BlockNumber, Page)


Let's simplify that, and have one new function, XLogOpenBuffer, which
returns a return code that indicates which of the four cases we're
dealing with. A typical redo function looks like this:



if (XLogOpenBuffer(0, &buffer) == BLK_REPLAY)
{
/* Modify the page */
...



PageSetLSN(page, lsn);
MarkBufferDirty(buffer);
}
if (BufferIsValid(buffer))
UnlockReleaseBuffer(buffer);



The '0' in the XLogOpenBuffer call is the ID of the block reference
specified in the XLogRegisterBuffer call, when the WAL record was created.


+1, but one important step here is finding the data to be replayed.
That is, a large part of the complexity of replay routines has to do
with figuring out which parts of the WAL record were elided due to
full-page-images, and locating the remaining parts.  What can we do
to make that simpler?


We can certainly add more structure to the WAL records, but any extra 
information you add will make the records larger. It might be worth it, 
and would be lost in the noise for more complex records like page 
splits, but we should keep frequently-used records like heap insertions 
as lean as possible.



Ideally, if XLogOpenBuffer (bad name BTW) returns BLK_REPLAY, it would
also calculate and hand back the address/size of the logged data that
had been pointed to by the associated XLogRecData chain item.  The
trouble here is that there might've been multiple XLogRecData items
pointing to the same buffer.  Perhaps the magic ID number you give to
XLogOpenBuffer should be thought of as identifying an XLogRecData chain
item, not so much a buffer?  It's fairly easy to see what to do when
there's just one chain item per buffer, but I'm not sure what to do
if there's more than one.


Hmm. You could register a separate XLogRecData chain for each buffer. 
Along the lines of:


rdata[0].data = data for buffer
rdata[0].len = ...
rdata[0].next = &rdata[1];
rdata[1].data = more data for same buffer
rdata[1].len = ...
rdata[2].next = NULL;

XLogRegisterBuffer(0, buffer, &data[0]);

At replay:

if (XLogOpenBuffer(0, &buffer, &xldata, &len) == BLK_REPLAY)
{
/* xldata points to the data registered for this buffer */
}

Plus one more chain for the data not associated with a buffer.

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL format and API changes (9.5)

2014-04-03 Thread Tom Lane

Heikki Linnakangas  writes:
> The big change in creating WAL records is that the buffers involved in 
> the WAL-logged operation are explicitly registered, by calling a new 
> XLogRegisterBuffer function.

Seems reasonable, especially if we can make the buffer numbering business
less error-prone.

> void XLogRegisterBuffer(int blockref_id, Buffer buffer, bool buffer_std)

> blockref_id: An arbitrary ID given to this block reference. It is used 
> in the redo routine to open/restore the same block.
> buffer: the buffer involved
> buffer_std: is the page in "standard" page layout?

> That's for the normal cases. We'll need a couple of variants for also 
> registering buffers that don't need full-page images, and perhaps also a 
> function for registering a page that *always* needs a full-page image, 
> regardless of the LSN. A few existing WAL record types just WAL-log the 
> whole page, so those ad-hoc full-page images could be replaced with this.

Why not just one function with an additional flags argument?

Also, IIRC there are places that WAL-log full pages that aren't in a
shared buffer at all (btree build does this I think).  How will that fit
into this model?

> (While we're at it, perhaps we should let XLogInsert set the LSN of all 
> the registered buffers, to reduce the amount of boilerplate code).

Yeah, maybe so.  I'm not sure why that was separate to begin with.

> Let's simplify that, and have one new function, XLogOpenBuffer, which 
> returns a return code that indicates which of the four cases we're 
> dealing with. A typical redo function looks like this:

>   if (XLogOpenBuffer(0, &buffer) == BLK_REPLAY)
>   {
>   /* Modify the page */
>   ...

>   PageSetLSN(page, lsn);
>   MarkBufferDirty(buffer);
>   }
>   if (BufferIsValid(buffer))
>   UnlockReleaseBuffer(buffer);

> The '0' in the XLogOpenBuffer call is the ID of the block reference 
> specified in the XLogRegisterBuffer call, when the WAL record was created.

+1, but one important step here is finding the data to be replayed.
That is, a large part of the complexity of replay routines has to do
with figuring out which parts of the WAL record were elided due to
full-page-images, and locating the remaining parts.  What can we do
to make that simpler?

Ideally, if XLogOpenBuffer (bad name BTW) returns BLK_REPLAY, it would
also calculate and hand back the address/size of the logged data that
had been pointed to by the associated XLogRecData chain item.  The
trouble here is that there might've been multiple XLogRecData items
pointing to the same buffer.  Perhaps the magic ID number you give to
XLogOpenBuffer should be thought of as identifying an XLogRecData chain
item, not so much a buffer?  It's fairly easy to see what to do when
there's just one chain item per buffer, but I'm not sure what to do
if there's more than one.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] WAL format and API changes (9.5)

2014-04-03 Thread Heikki Linnakangas

I'd like to do some changes to the WAL format in 9.5. I want to annotate 
each WAL record with the blocks that they modify. Every WAL record 
already includes that information, but it's done in an ad hoc way, 
differently in every rmgr. The RelFileNode and block number are 
currently part of the WAL payload, and it's the REDO routine's 
responsibility to extract it. I want to include that information in a 
common format for every WAL record type.


That makes life a lot easier for tools that are interested in knowing 
which blocks a WAL record modifies. One such tool is pg_rewind; it 
currently has to understand every WAL record the backend writes. There's 
also a tool out there called pg_readahead, which does prefetching of 
blocks accessed by WAL records, to speed up PITR. I don't think that 
tool has been actively maintained, but at least part of the reason for 
that is probably that it's a pain to maintain when it has to understand 
the details of every WAL record type.


It'd also be nice for contrib/pg_xlogdump and backend code itself. The 
boilerplate code in all WAL redo routines, and writing WAL records, 
could be simplified.


So, here's my proposal:

Insertion
-

The big change in creating WAL records is that the buffers involved in 
the WAL-logged operation are explicitly registered, by calling a new 
XLogRegisterBuffer function. Currently, buffers that need full-page 
images are registered by including them in the XLogRecData chain, but 
with the new system, you call the XLogRegisterBuffer() function instead. 
And you call that function for every buffer involved, even if no 
full-page image needs to be taken, e.g because the page is going to be 
recreated from scratch at replay.


It is no longer necessary to include the RelFileNode and BlockNumber of 
the modified pages in the WAL payload. That information is automatically 
included in the WAL record, when XLogRegisterBuffer is called.


Currently, the backup blocks are implicitly numbered, in the order the 
buffers appear in XLogRecData entries. With the new API, the blocks are 
numbered explicitly. This is more convenient when a WAL record sometimes 
modifies a buffer and sometimes not. For example, a B-tree split needs 
to modify four pages: the original page, the new page, the right sibling 
(unless it's the rightmost page) and if it's an internal page, the page 
at the lower level whose split the insertion completes. So there are two 
pages that are sometimes missing from the record. With the new API, you 
can nevertheless always register e.g. original page as buffer 0, new 
page as 1, right sibling as 2, even if some of them are actually 
missing. SP-GiST contains even more complicated examples of that.


The new XLogRegisterBuffer would look like this:

void XLogRegisterBuffer(int blockref_id, Buffer buffer, bool buffer_std)

blockref_id: An arbitrary ID given to this block reference. It is used 
in the redo routine to open/restore the same block.

buffer: the buffer involved
buffer_std: is the page in "standard" page layout?

That's for the normal cases. We'll need a couple of variants for also 
registering buffers that don't need full-page images, and perhaps also a 
function for registering a page that *always* needs a full-page image, 
regardless of the LSN. A few existing WAL record types just WAL-log the 
whole page, so those ad-hoc full-page images could be replaced with this.


With these changes, a typical WAL insertion would look like this:

/* register the buffer with the WAL record, with ID 0 */
XLogRegisterBuffer(0, buf, true);

rdata[0].data = (char *) &xlrec;
rdata[0].len = sizeof(BlahRecord);
rdata[0].buffer_id = -1; /* -1 means the data is always included */
rdata[0].next = &(rdata[1]);

rdata[1].data = (char *) mydata;
rdata[1].len = mydatalen;
rdata[1].buffer_id = 0; /* 0 here refers to the buffer registered above 
*/
rdata[1].next = NULL

...
recptr = XLogInsert(RM_BLAH_ID, xlinfo, rdata);

PageSetLSN(buf, recptr);


(While we're at it, perhaps we should let XLogInsert set the LSN of all 
the registered buffers, to reduce the amount of boilerplate code).


(Instead of using a new XLogRegisterBuffer() function to register the 
buffers, perhaps they should be passed to XLogInsert as a separate list 
or array. I'm not wedded on the details...)


Redo


There are four different states a block referenced by a typical WAL 
record can be in:


1. The old page does not exist at all (because the relation was 
truncated later)
2. The old page exists, but has an LSN higher than current WAL record, 
so it doesn't need replaying.

3. The LSN is < current WAL record, so it needs to be replayed.
4. The WAL record contains a full-page image, which needs to be restored.

With the current API, that leads to a long boilerplate:

/* If we have a full-page image, restore it and we're done */
if (HasBackupBl

93 matches

Mail list logo