Re: [HACKERS] 16-bit page checksums for 9.2
On Sat, Dec 24, 2011 at 4:06 PM, Simon Riggs wrote: > Checksums merely detect a problem, whereas FPWs correct a problem if > it happens, but only in crash situations. > > So this does nothing to remove the need for FPWs, though checksum > detection could be used for double write buffers also. This is missing the point. If you have a torn page on a page that is only dirty due to hint bits then the checksum will show a spurious checksum failure. It will "detect" a problem that isn't there. The problem is that there is no WAL indicating the hint bit change. And if the torn page includes the new checksum but not the new hint bit or vice versa it will be a checksum mismatch. The strategy discussed in the past was moving all the hint bits to a common area and skipping them in the checksum. No amount of double writing or buffering or locking will avoid this problem. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Moving more work outside WALInsertLock
On Fri, Dec 16, 2011 at 3:27 AM, Tom Lane wrote: >> On its own that sounds dangerous, but its not. When we need to confirm >> the prev link we already know what we expect it to be, so CRC-ing it >> is overkill. That isn't true of any other part of the WAL record, so >> the prev link is the only thing we can relax, but thats OK because we >> can CRC check everything else outside of the locked section. > >> That isn't my idea, but I'm happy to put it on the table since I'm not shy. > > I'm glad it's not your idea, because it's a bad one. I'll take the blame or credit here. > A large part of > the point of CRC'ing WAL records is to guard against torn-page problems > in the WAL files, and doing things like that would give up a significant > part of that protection, because there would no longer be any assurance > that the body of a WAL record had anything to do with its prev_link. Hm, I hadn't considered the possibility of a prev_link being the only thing left over from a torn page. As Heikki pointed out having the CRC and the rest of the record on opposite sides of the prev_link does seem like convincing protection but it's a lot more fiddly and hard to explain the dependencies this way. Another thought that was discussed in the same dinner was separating the CRC into a separate record that would cover all the WAL since the last CRC. These would only need to be emitted when there's a WAL sync, not on every record. I think someone showed some benchmarks claiming that a significant overhead with the CRC was the startup and finishing time for doing lots of small chunks. If it processes larger blocks it might be able to make more efficient use of the memory bandwidth. I'm not entirely convinced of that myself but it bears some experimentation. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] reprise: pretty print viewdefs
On Thu, Dec 22, 2011 at 5:52 PM, Andrew Dunstan wrote: > I've looked at that, and it was discussed a bit previously. It's more > complex because it requires that we keep track of (or calculate) where we > are on the line, You might try a compromise, just spit out all the columns on one line *unless* either the previous or next column is longer than something like 30 columns. So if you have a long list of short columns it just gets wrapped by your terminal but if you have complex expressions like CASE expressions or casts or so on they go on a line by themselves. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 16-bit page checksums for 9.2
On Saturday, December 24, 2011 05:01:02 PM Simon Riggs wrote: > On Sat, Dec 24, 2011 at 3:54 PM, Andres Freund wrote: > > On Saturday, December 24, 2011 03:46:16 PM Tom Lane wrote: > >> Simon Riggs writes: > >> > After the various recent discussions on list, I present what I believe > >> > to be a working patch implementing 16-but checksums on all buffer > >> > pages. > >> > >> I think locking around hint-bit-setting is likely to be unworkable from > >> a performance standpoint. I also wonder whether it might not result in > >> deadlocks. > > > > Why don't you use the same tricks as the former patch and copy the > > buffer, compute the checksum on that, and then write out that copy (you > > can even do both at the same time). I have a hard time believing that > > the additional copy is more expensive than the locking. > > We would copy every time we write, yet lock only every time we set hint > bits. Isn't setting hint bits also a rather frequent operation? At least in a well- cached workload where most writeout happens due to checkpoints. > If that option is favoured, I'll write another version after Christmas. Seems less complicated (wrt deadlocking et al) to me. But I havent read your patch, so I will shut up now ;) Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 16-bit page checksums for 9.2
On Sat, Dec 24, 2011 at 3:51 PM, Aidan Van Dyk wrote: > Not an expert here, but after reading through the patch quickly, I > don't see anything that changes the torn-page problem though, right? > > Hint bits aren't wal-logged, and FPW isn't forced on the hint-bit-only > dirty, right? Checksums merely detect a problem, whereas FPWs correct a problem if it happens, but only in crash situations. So this does nothing to remove the need for FPWs, though checksum detection could be used for double write buffers also. Checksums work even when there is no crash, so if your disk goes bad and corrupts data then you'll know about it as soon as it happens. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 16-bit page checksums for 9.2
On Sat, Dec 24, 2011 at 3:54 PM, Andres Freund wrote: > On Saturday, December 24, 2011 03:46:16 PM Tom Lane wrote: >> Simon Riggs writes: >> > After the various recent discussions on list, I present what I believe >> > to be a working patch implementing 16-but checksums on all buffer >> > pages. >> >> I think locking around hint-bit-setting is likely to be unworkable from >> a performance standpoint. I also wonder whether it might not result in >> deadlocks. > Why don't you use the same tricks as the former patch and copy the buffer, > compute the checksum on that, and then write out that copy (you can even do > both at the same time). I have a hard time believing that the additional copy > is more expensive than the locking. We would copy every time we write, yet lock only every time we set hint bits. If that option is favoured, I'll write another version after Christmas. ISTM we can't write and copy at the same time because the cheksum is not a trailer field. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 16-bit page checksums for 9.2
On Sat, Dec 24, 2011 at 2:46 PM, Tom Lane wrote: > Simon Riggs writes: >> After the various recent discussions on list, I present what I believe >> to be a working patch implementing 16-but checksums on all buffer >> pages. > > I think locking around hint-bit-setting is likely to be unworkable from > a performance standpoint. Anyone choosing page_checksums = on has already made a performance reducing decision in favour of reliability. So they understand and accept the impact. There is no locking when the parameter is off. A safe alternative is to use LockBuffer, which has a much greater performance impact. I did think about optimistically checking after the write, but if we crash at that point we will then see a block that has an invalid checksum. It's faster but you may get a checksum failure if you crash - but then one important aspect of this is to spot problems in case of a crash, so that seems unacceptable. > I also wonder whether it might not result in > deadlocks. If you can see how, please say. I can't see any ways for that myself. > Also, as far as I can see this patch usurps the page version field, > which I find unacceptably short-sighted. Do you really think this is > the last page layout change we'll ever make? No, I don't. I hope and expect the next page layout change to reintroduce such a field. But since we're agreed now that upgrading is important, changing page format isn't likely to be happening until we get an online upgrade process. So future changes are much less likely. If they do happen, we have some flag bits spare that can be used to indicate later versions. It's not the prettiest thing in the world, but it's a small ugliness in return for an important feature. If there was a way without that, I would have chosen it. pg_filedump will need to be changed more than normal, but the version isn't used anywhere else in the server code. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 16-bit page checksums for 9.2
On Saturday, December 24, 2011 03:46:16 PM Tom Lane wrote: > Simon Riggs writes: > > After the various recent discussions on list, I present what I believe > > to be a working patch implementing 16-but checksums on all buffer > > pages. > > I think locking around hint-bit-setting is likely to be unworkable from > a performance standpoint. I also wonder whether it might not result in > deadlocks. Why don't you use the same tricks as the former patch and copy the buffer, compute the checksum on that, and then write out that copy (you can even do both at the same time). I have a hard time believing that the additional copy is more expensive than the locking. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 16-bit page checksums for 9.2
Simon Riggs writes: > After the various recent discussions on list, I present what I believe > to be a working patch implementing 16-but checksums on all buffer > pages. I think locking around hint-bit-setting is likely to be unworkable from a performance standpoint. I also wonder whether it might not result in deadlocks. Also, as far as I can see this patch usurps the page version field, which I find unacceptably short-sighted. Do you really think this is the last page layout change we'll ever make? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch: bytea_agg
On Fri, Dec 23, 2011 at 12:30 PM, Robert Haas wrote: > Well, because it doesn't operate on strings. > > I argued when we added string_agg that it ought to be called > concat_agg, or something like that, but I got shouted down. So now > here we are. +1. Using the input type names to name the function is a mistake and should be stopped...enough. It's verbose and unnecessary...everything else in sql is heavily overloaded (we don't have int_max() and float_max()). merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Page Checksums + Double Writes
On Thu, Dec 22, 2011 at 9:58 PM, Simon Riggs wrote: > On Thu, Dec 22, 2011 at 9:50 AM, Kevin Grittner > wrote: > >> Simon, does it sound like I understand your proposal? > > Yes, thanks for restating. I've implemented that proposal, posting patch on a separate thread. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] 16-bit page checksums for 9.2
After the various recent discussions on list, I present what I believe to be a working patch implementing 16-but checksums on all buffer pages. page_checksums = on | off (default) There are no required block changes; checksums are optional and some blocks may have a checksum, others not. This means that the patch will allow pg_upgrade. That capability also limits us to 16-bit checksums. Fletcher's 16 is used in this patch and seems rather quick, though that is easily replaceable/tuneable if desired, perhaps even as a parameter enum. This patch is a step on the way to 32-bit checksums in a future redesign of the page layout, though that is not a required future change, nor does this prevent that. Checksum is set whenever the buffer is flushed to disk, and checked when the page is read in from disk. It is not set at other times, and for much of the time may not be accurate. This follows earlier discussions from 2010-12-22, and is discussed in detail in patch comments. Note it works with buffer manager pages, which includes shared and local data buffers, but not SLRU pages (yet? an easy addition but needs other discussion around contention). Note that all this does is detect bit errors on the page, it doesn't identify where the error is, how bad and definitely not what caused it or when it happened. The main body of the patch involves changes to bufpage.c/.h so this differs completely from the VMware patch, for technical reasons. Also included are facilities to LockBufferForHints() with usage in various AMs, to avoid the case where hints are set during calculation of the checksum. In my view this is a fully working, committable patch but I'm not in a hurry to do so given the holiday season. Hopefully its a gift not a turkey, and therefore a challenge for some to prove that wrong. Enjoy either way, Merry Christmas, -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services *** a/src/backend/access/hash/hash.c --- b/src/backend/access/hash/hash.c *** *** 282,288 hashgettuple(PG_FUNCTION_ARGS) --- 282,290 /* * Yes, so mark it by setting the LP_DEAD state in the item flags. */ + LockBufferForHints(buf); ItemIdMarkDead(PageGetItemId(page, offnum)); + UnlockBufferForHints(buf); /* * Since this can be redone later if needed, it's treated the same *** a/src/backend/access/heap/pruneheap.c --- b/src/backend/access/heap/pruneheap.c *** *** 259,266 heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin, --- 259,268 if (((PageHeader) page)->pd_prune_xid != prstate.new_prune_xid || PageIsFull(page)) { + LockBufferForHints(buffer); ((PageHeader) page)->pd_prune_xid = prstate.new_prune_xid; PageClearFull(page); + UnlockBufferForHints(buffer); SetBufferCommitInfoNeedsSave(buffer); } } *** a/src/backend/access/nbtree/nbtinsert.c --- b/src/backend/access/nbtree/nbtinsert.c *** *** 403,415 _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel, --- 403,427 * everyone, so we may as well mark the index entry * killed. */ + if (nbuf != InvalidBuffer) + LockBufferForHints(nbuf); + else + LockBufferForHints(buf); + ItemIdMarkDead(curitemid); opaque->btpo_flags |= BTP_HAS_GARBAGE; + /* be sure to mark the proper buffer dirty... */ if (nbuf != InvalidBuffer) + { + UnlockBufferForHints(nbuf); SetBufferCommitInfoNeedsSave(nbuf); + } else + { + UnlockBufferForHints(buf); SetBufferCommitInfoNeedsSave(buf); + } } } } *** a/src/backend/access/nbtree/nbtree.c --- b/src/backend/access/nbtree/nbtree.c *** *** 1040,1046 restart: --- 1040,1048 if (vstate->cycleid != 0 && opaque->btpo_cycleid == vstate->cycleid) { + LockBufferForHints(buf); opaque->btpo_cycleid = 0; + UnlockBufferForHints(buf); SetBufferCommitInfoNeedsSave(buf); } } *** a/src/backend/commands/sequence.c --- b/src/backend/commands/sequence.c *** *** 1092,1101 read_info(SeqTable elm, Relation rel, Buffer *buf) --- 1092,1103 */ if (HeapTupleHeaderGetXmax(tuple.t_data) != InvalidTransactionId) { + LockBufferForHints(*buf); HeapTupleHeaderSetXmax(tuple.t_data, InvalidTransactionId); tuple.t_data->t_infomask &= ~HEAP_XMAX_COMMITTED; tuple.t_data->t_infomask |= HEAP_XMAX_INVALID; SetBufferCommitInfoNeedsSave(*buf); + UnlockBufferForHints(*buf); } seq = (Form_pg_sequence) GETSTRUCT(&tuple); *** a/src/backend/storage/buffer/bufmgr.c --- b/src/backend/storage/buffer/bufmgr.c *** *** 440,446 ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum, smgrread(smgr, forkNum, blockNum, (char *) bufBlock); /* check for garbage da
Re: [HACKERS] CLOG contention
On Thu, Dec 22, 2011 at 4:20 PM, Robert Haas wrote: > You mentioned "latency" so this morning I ran pgbench with -l and > graphed the output. There are latency spikes every few seconds. I'm > attaching the overall graph as well as the graph of the last 100 > seconds, where the spikes are easier to see clearly. Now, here's the > problem: it seems reasonable to hypothesize that the spikes are due to > CLOG page replacement since the frequency is at least plausibly right, > but this is obviously not enough to prove that conclusively. Ideas? Thanks. That illustrates the effect I explained earlier very clearly, so now we all know I wasn't speculating. > Also, if it is that, what do we do about it? I don't think any of the > ideas proposed so far are going to help much. If you don't like guessing, don't guess, don't think. Just measure. Does increasing the number of buffers solve the problems you see? That must be the first port of call - is that enough, or not? If not, we can discuss the various ideas, write patches and measure them. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Moving more work outside WALInsertLock
On Sat, Dec 24, 2011 at 4:54 AM, Heikki Linnakangas wrote: > Sorry. Last minute changes, didn't retest properly.. Here's another attempt. When I tested the patch, initdb failed: $ initdb -D data initializing dependencies ... PANIC: could not locate a valid checkpoint record Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers