Re: [HACKERS] Moving more work outside WALInsertLock

2011-12-24 Thread Fujii Masao
On Sat, Dec 24, 2011 at 4:54 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 Sorry. Last minute changes, didn't retest properly.. Here's another attempt.

When I tested the patch, initdb failed:

$ initdb -D data

initializing dependencies ... PANIC:  could not locate a valid checkpoint record

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CLOG contention

2011-12-24 Thread Simon Riggs
On Thu, Dec 22, 2011 at 4:20 PM, Robert Haas robertmh...@gmail.com wrote:

 You mentioned latency so this morning I ran pgbench with -l and
 graphed the output.  There are latency spikes every few seconds.  I'm
 attaching the overall graph as well as the graph of the last 100
 seconds, where the spikes are easier to see clearly.  Now, here's the
 problem: it seems reasonable to hypothesize that the spikes are due to
 CLOG page replacement since the frequency is at least plausibly right,
 but this is obviously not enough to prove that conclusively.  Ideas?

Thanks. That illustrates the effect I explained earlier very clearly,
so now we all know I wasn't speculating.

 Also, if it is that, what do we do about it?  I don't think any of the
 ideas proposed so far are going to help much.

If you don't like guessing, don't guess, don't think. Just measure.

Does increasing the number of buffers solve the problems you see? That
must be the first port of call - is that enough, or not? If not, we
can discuss the various ideas, write patches and measure them.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] 16-bit page checksums for 9.2

2011-12-24 Thread Simon Riggs
After the various recent discussions on list, I present what I believe
to be a working patch implementing 16-but checksums on all buffer
pages.

page_checksums = on | off (default)

There are no required block changes; checksums are optional and some
blocks may have a checksum, others not. This means that the patch will
allow pg_upgrade.

That capability also limits us to 16-bit checksums. Fletcher's 16 is
used in this patch and seems rather quick, though that is easily
replaceable/tuneable if desired, perhaps even as a parameter enum.
This patch is a step on the way to 32-bit checksums in a future
redesign of the page layout, though that is not a required future
change, nor does this prevent that.

Checksum is set whenever the buffer is flushed to disk, and checked
when the page is read in from disk. It is not set at other times, and
for much of the time may not be accurate. This follows earlier
discussions from 2010-12-22, and is discussed in detail in patch
comments.

Note it works with buffer manager pages, which includes shared and
local data buffers, but not SLRU pages (yet? an easy addition but
needs other discussion around contention).

Note that all this does is detect bit errors on the page, it doesn't
identify where the error is, how bad and definitely not what caused it
or when it happened.

The main body of the patch involves changes to bufpage.c/.h so this
differs completely from the VMware patch, for technical reasons. Also
included are facilities to LockBufferForHints() with usage in various
AMs, to avoid the case where hints are set during calculation of the
checksum.

In my view this is a fully working, committable patch but I'm not in a
hurry to do so given the holiday season.

Hopefully its a gift not a turkey, and therefore a challenge for some
to prove that wrong. Enjoy either way,

Merry Christmas,

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services
*** a/src/backend/access/hash/hash.c
--- b/src/backend/access/hash/hash.c
***
*** 282,288  hashgettuple(PG_FUNCTION_ARGS)
--- 282,290 
  			/*
  			 * Yes, so mark it by setting the LP_DEAD state in the item flags.
  			 */
+ 			LockBufferForHints(buf);
  			ItemIdMarkDead(PageGetItemId(page, offnum));
+ 			UnlockBufferForHints(buf);
  
  			/*
  			 * Since this can be redone later if needed, it's treated the same
*** a/src/backend/access/heap/pruneheap.c
--- b/src/backend/access/heap/pruneheap.c
***
*** 259,266  heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
--- 259,268 
  		if (((PageHeader) page)-pd_prune_xid != prstate.new_prune_xid ||
  			PageIsFull(page))
  		{
+ 			LockBufferForHints(buffer);
  			((PageHeader) page)-pd_prune_xid = prstate.new_prune_xid;
  			PageClearFull(page);
+ 			UnlockBufferForHints(buffer);
  			SetBufferCommitInfoNeedsSave(buffer);
  		}
  	}
*** a/src/backend/access/nbtree/nbtinsert.c
--- b/src/backend/access/nbtree/nbtinsert.c
***
*** 403,415  _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
--- 403,427 
  	 * everyone, so we may as well mark the index entry
  	 * killed.
  	 */
+ 	if (nbuf != InvalidBuffer)
+ 		LockBufferForHints(nbuf);
+ 	else
+ 		LockBufferForHints(buf);
+ 
  	ItemIdMarkDead(curitemid);
  	opaque-btpo_flags |= BTP_HAS_GARBAGE;
+ 
  	/* be sure to mark the proper buffer dirty... */
  	if (nbuf != InvalidBuffer)
+ 	{
+ 		UnlockBufferForHints(nbuf);
  		SetBufferCommitInfoNeedsSave(nbuf);
+ 	}
  	else
+ 	{
+ 		UnlockBufferForHints(buf);
  		SetBufferCommitInfoNeedsSave(buf);
+ 	}
  }
  			}
  		}
*** a/src/backend/access/nbtree/nbtree.c
--- b/src/backend/access/nbtree/nbtree.c
***
*** 1040,1046  restart:
--- 1040,1048 
  			if (vstate-cycleid != 0 
  opaque-btpo_cycleid == vstate-cycleid)
  			{
+ LockBufferForHints(buf);
  opaque-btpo_cycleid = 0;
+ UnlockBufferForHints(buf);
  SetBufferCommitInfoNeedsSave(buf);
  			}
  		}
*** a/src/backend/commands/sequence.c
--- b/src/backend/commands/sequence.c
***
*** 1092,1101  read_info(SeqTable elm, Relation rel, Buffer *buf)
--- 1092,1103 
  	 */
  	if (HeapTupleHeaderGetXmax(tuple.t_data) != InvalidTransactionId)
  	{
+ 		LockBufferForHints(*buf);
  		HeapTupleHeaderSetXmax(tuple.t_data, InvalidTransactionId);
  		tuple.t_data-t_infomask = ~HEAP_XMAX_COMMITTED;
  		tuple.t_data-t_infomask |= HEAP_XMAX_INVALID;
  		SetBufferCommitInfoNeedsSave(*buf);
+ 		UnlockBufferForHints(*buf);
  	}
  
  	seq = (Form_pg_sequence) GETSTRUCT(tuple);
*** a/src/backend/storage/buffer/bufmgr.c
--- b/src/backend/storage/buffer/bufmgr.c
***
*** 440,446  ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
  			smgrread(smgr, forkNum, blockNum, (char *) bufBlock);
  
  			/* check for garbage data */
! 			if 

Re: [HACKERS] Page Checksums + Double Writes

2011-12-24 Thread Simon Riggs
On Thu, Dec 22, 2011 at 9:58 PM, Simon Riggs si...@2ndquadrant.com wrote:
 On Thu, Dec 22, 2011 at 9:50 AM, Kevin Grittner
 kevin.gritt...@wicourts.gov wrote:

 Simon, does it sound like I understand your proposal?

 Yes, thanks for restating.

I've implemented that proposal, posting patch on a separate thread.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] patch: bytea_agg

2011-12-24 Thread Merlin Moncure
On Fri, Dec 23, 2011 at 12:30 PM, Robert Haas robertmh...@gmail.com wrote:
 Well, because it doesn't operate on strings.

 I argued when we added string_agg that it ought to be called
 concat_agg, or something like that, but I got shouted down.  So now
 here we are.

+1.  Using the input type names to name the function is a mistake and
should be stopped...enough.  It's verbose and unnecessary...everything
else in sql is heavily overloaded (we don't have int_max() and
float_max()).

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2011-12-24 Thread Tom Lane
Simon Riggs si...@2ndquadrant.com writes:
 After the various recent discussions on list, I present what I believe
 to be a working patch implementing 16-but checksums on all buffer
 pages.

I think locking around hint-bit-setting is likely to be unworkable from
a performance standpoint.  I also wonder whether it might not result in
deadlocks.

Also, as far as I can see this patch usurps the page version field,
which I find unacceptably short-sighted.  Do you really think this is
the last page layout change we'll ever make?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2011-12-24 Thread Andres Freund
On Saturday, December 24, 2011 03:46:16 PM Tom Lane wrote:
 Simon Riggs si...@2ndquadrant.com writes:
  After the various recent discussions on list, I present what I believe
  to be a working patch implementing 16-but checksums on all buffer
  pages.
 
 I think locking around hint-bit-setting is likely to be unworkable from
 a performance standpoint.  I also wonder whether it might not result in
 deadlocks.
Why don't you use the same tricks as the former patch and copy the buffer, 
compute the checksum on that, and then write out that copy (you can even do 
both at the same time). I have a hard time believing that the additional copy 
is more expensive than the locking.


Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2011-12-24 Thread Simon Riggs
On Sat, Dec 24, 2011 at 2:46 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Simon Riggs si...@2ndquadrant.com writes:
 After the various recent discussions on list, I present what I believe
 to be a working patch implementing 16-but checksums on all buffer
 pages.

 I think locking around hint-bit-setting is likely to be unworkable from
 a performance standpoint.

Anyone choosing page_checksums = on has already made a performance
reducing decision in favour of reliability. So they understand and
accept the impact. There is no locking when the parameter is off.

A safe alternative is to use LockBuffer, which has a much greater
performance impact.

I did think about optimistically checking after the write, but if we
crash at that point we will then see a block that has an invalid
checksum. It's faster but you may get a checksum failure if you crash
- but then one important aspect of this is to spot problems in case of
a crash, so that seems unacceptable.

 I also wonder whether it might not result in
 deadlocks.

If you can see how, please say. I can't see any ways for that myself.

 Also, as far as I can see this patch usurps the page version field,
 which I find unacceptably short-sighted.  Do you really think this is
 the last page layout change we'll ever make?

No, I don't. I hope and expect the next page layout change to
reintroduce such a field.

But since we're agreed now that upgrading is important, changing page
format isn't likely to be happening until we get an online upgrade
process. So future changes are much less likely. If they do happen, we
have some flag bits spare that can be used to indicate later versions.
It's not the prettiest thing in the world, but it's a small ugliness
in return for an important feature. If there was a way without that, I
would have chosen it.

pg_filedump will need to be changed more than normal, but the version
isn't used anywhere else in the server code.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2011-12-24 Thread Simon Riggs
On Sat, Dec 24, 2011 at 3:54 PM, Andres Freund and...@anarazel.de wrote:
 On Saturday, December 24, 2011 03:46:16 PM Tom Lane wrote:
 Simon Riggs si...@2ndquadrant.com writes:
  After the various recent discussions on list, I present what I believe
  to be a working patch implementing 16-but checksums on all buffer
  pages.

 I think locking around hint-bit-setting is likely to be unworkable from
 a performance standpoint.  I also wonder whether it might not result in
 deadlocks.

 Why don't you use the same tricks as the former patch and copy the buffer,
 compute the checksum on that, and then write out that copy (you can even do
 both at the same time). I have a hard time believing that the additional copy
 is more expensive than the locking.

We would copy every time we write, yet lock only every time we set hint bits.

If that option is favoured, I'll write another version after Christmas.

ISTM we can't write and copy at the same time because the cheksum is
not a trailer field.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2011-12-24 Thread Simon Riggs
On Sat, Dec 24, 2011 at 3:51 PM, Aidan Van Dyk ai...@highrise.ca wrote:

 Not an expert here, but after reading through the patch quickly, I
 don't see anything that changes the torn-page problem though, right?

 Hint bits aren't wal-logged, and FPW isn't forced on the hint-bit-only
 dirty, right?

Checksums merely detect a problem, whereas FPWs correct a problem if
it happens, but only in crash situations.

So this does nothing to remove the need for FPWs, though checksum
detection could be used for double write buffers also.

Checksums work even when there is no crash, so if your disk goes bad
and corrupts data then you'll know about it as soon as it happens.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2011-12-24 Thread Andres Freund
On Saturday, December 24, 2011 05:01:02 PM Simon Riggs wrote:
 On Sat, Dec 24, 2011 at 3:54 PM, Andres Freund and...@anarazel.de wrote:
  On Saturday, December 24, 2011 03:46:16 PM Tom Lane wrote:
  Simon Riggs si...@2ndquadrant.com writes:
   After the various recent discussions on list, I present what I believe
   to be a working patch implementing 16-but checksums on all buffer
   pages.
  
  I think locking around hint-bit-setting is likely to be unworkable from
  a performance standpoint.  I also wonder whether it might not result in
  deadlocks.
  
  Why don't you use the same tricks as the former patch and copy the
  buffer, compute the checksum on that, and then write out that copy (you
  can even do both at the same time). I have a hard time believing that
  the additional copy is more expensive than the locking.
 
 We would copy every time we write, yet lock only every time we set hint
 bits.
Isn't setting hint bits also a rather frequent operation? At least in a well-
cached workload where most writeout happens due to checkpoints.

 If that option is favoured, I'll write another version after Christmas.
Seems less complicated (wrt deadlocking et al) to me. But I havent read your 
patch, so I will shut up now ;)

Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] reprise: pretty print viewdefs

2011-12-24 Thread Greg Stark
On Thu, Dec 22, 2011 at 5:52 PM, Andrew Dunstan and...@dunslane.net wrote:
 I've looked at that, and it was discussed a bit previously. It's more
 complex because it requires that we keep track of (or calculate) where we
 are on the line,

You might try a compromise, just spit out all the columns on one line
*unless* either the previous or next column is longer than something
like 30 columns. So if you have a long list of short columns it just
gets wrapped by your terminal but if you have complex expressions like
CASE expressions or casts or so on they go on a line by themselves.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Moving more work outside WALInsertLock

2011-12-24 Thread Greg Stark
On Fri, Dec 16, 2011 at 3:27 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 On its own that sounds dangerous, but its not. When we need to confirm
 the prev link we already know what we expect it to be, so CRC-ing it
 is overkill. That isn't true of any other part of the WAL record, so
 the prev link is the only thing we can relax, but thats OK because we
 can CRC check everything else outside of the locked section.

 That isn't my idea, but I'm happy to put it on the table since I'm not shy.

 I'm glad it's not your idea, because it's a bad one.

I'll take the blame or credit here.

 A large part of
 the point of CRC'ing WAL records is to guard against torn-page problems
 in the WAL files, and doing things like that would give up a significant
 part of that protection, because there would no longer be any assurance
 that the body of a WAL record had anything to do with its prev_link.

Hm, I hadn't considered the possibility of a prev_link being the only
thing left over from a torn page. As Heikki pointed out having the CRC
and the rest of the record on opposite sides of the prev_link does
seem like convincing protection but it's a lot more fiddly and hard to
explain the dependencies this way.

Another thought that was discussed in the same dinner was separating
the CRC into a separate record that would cover all the WAL since the
last CRC. These would only need to be emitted when there's a WAL sync,
not on every record. I think someone showed some benchmarks claiming
that a significant overhead with the CRC was the startup and finishing
time for doing lots of small chunks. If it processes larger blocks it
might be able to make more efficient use of the memory bandwidth. I'm
not entirely convinced of that myself but it bears some
experimentation.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2011-12-24 Thread Greg Stark
On Sat, Dec 24, 2011 at 4:06 PM, Simon Riggs si...@2ndquadrant.com wrote:
 Checksums merely detect a problem, whereas FPWs correct a problem if
 it happens, but only in crash situations.

 So this does nothing to remove the need for FPWs, though checksum
 detection could be used for double write buffers also.

This is missing the point. If you have a torn page on a page that is
only dirty due to hint bits then the checksum will show a spurious
checksum failure. It will detect a problem that isn't there.

The problem is that there is no WAL indicating the hint bit change.
And if the torn page includes the new checksum but not the new hint
bit or vice versa it will be a checksum mismatch.

The strategy discussed in the past was moving all the hint bits to a
common area and skipping them in the checksum. No amount of double
writing or buffering or locking will avoid this problem.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers