Re: [HACKERS] 16-bit page checksums for 9.2

2012-08-10 Thread Jeff Davis
On Sun, 2012-02-19 at 21:49 +, Simon Riggs wrote:
 On Thu, Feb 16, 2012 at 11:16 AM, Simon Riggs si...@2ndquadrant.com wrote:
 
  v8 attached
 
 v10 attached.
 
 This patch covers all the valid concerns discussed and has been
 extensively tested.

Is there something I can do to help get this ready for the next
commitfest? I am willing to rebase it, but I was worried that might
cause confusion. I am also willing to review it after the rebase, of
course.

There are still a couple open issues, including:

* Store the checksum in the page version field or the TLI field?

* What mechanism to guarantee to the user that all pages are properly
protected by checksums (rather than just new pages)? In other words,
there are more than the two states represented by the GUC.

* What specific data is included in the checksum? I suggested that we
include the block number, and maybe the OID.

Regards,
Jeff Davis


-
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-06-12 Thread Jeff Davis
On Sun, 2012-02-19 at 21:49 +, Simon Riggs wrote:
 On Thu, Feb 16, 2012 at 11:16 AM, Simon Riggs si...@2ndquadrant.com wrote:
 
  v8 attached
 
 v10 attached.
 
 This patch covers all the valid concerns discussed and has been
 extensively tested.
 
 I expect to commit this in about 24 hours, so last call for comments
 please or say if you need more time to review.

I have made an attempt to merge this with master. This patch is not
meant to be the next version, and I have not made a real effort to be
sure this was a safe merge. I am just hoping this saves Simon some of
the tedious portions of the merge, and offers a reference point to see
where his merge differs from mine.

In the meantime, I'm looking at the original v10 patch against master as
it existed when Simon posted the patch.

I'd like to see checksums take part in the June commitfest.

Regards,
Jeff Davis


checksums-v10-merge.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-03-02 Thread Peter Geoghegan
On 23 February 2012 01:39, Robert Haas robertmh...@gmail.com wrote:
 Thanks for testing this.  The graph obscures a bit how much percentage
 change we're talking about here - could you post the raw tps numbers?

Sorry for not following up on this until now. The report is available from:

http://pagechecksum.staticloud.com/

Note that detailed latency figures for each test are not available in
this report, so only the main page is available, but that gets you the
raw tps numbers.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-03-02 Thread Simon Riggs
On Fri, Mar 2, 2012 at 2:11 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 One thing I'm not too sure about is how to extend the page format to
 handle optional features.  For example, suppose we want to add 2 bytes
 to the page header for a checksum (or 4 bytes, or any other number).
 Ideally, I'd like to not use up those 2 bytes on pages that aren't
 checksummed.  What do we do about that?

 I'm having a hard time getting excited about adding that requirement on
 top of all the others that we have for this feature.  In particular, if
 we insist on that, there is exactly zero chance of turning checksums on
 on-the-fly, because you won't be able to do it if the page is full.

 The scheme that seems to me to make the most sense for checksums,
 if we grant Simon's postulate that a 2-byte checksum is enough, is
 (1) repurpose the top N bits of pd_flags as a page version number,
for some N;
 (2) repurpose pd_pagesize_version as the checksum, with the
understanding that you can't have a checksum in version-0 pages.

I'd say N = 1, for now. N  1 in future, as needed. We can document the
intention to reserve high end bits for that purpose. We can defer the
decision about what bits are used for until they are needed.


 (Simon's current patch seems to be an overengineered version of that.

Happy to make changes. Having 3 bits instead of 1 is just for robustness,
but would accept the opinion that this is not worth having.

One suggestion from Heikki is that we don't change the bits at all;
we just have a database level setting that says whether checksums are
enabled. We use VACUUM FREEZE to enable checksumming, rewrite the blocks
and then enable at db level. Once fully enabled we check the checksums
on read. If we do that, we don't need to record that the page format
has changed and we could dispense with page-level flags entirely, but
of course that then means you can't inspect a block and know whether its
actually got a checksum on it or maybe its just a version field.

If we want to know the difference between page formats we need to use
flag bits to indicate that. Which in some ways could be called fragile.

Personally, don't mind what we do there.


 Possibly we should also ask ourselves how much we really need pd_tli
 and whether that couldn't be commandeered as the checksum field.)

pd_tli has slightly more utility than pd_pagesize_version, but we could
easily live without either.

The question is whether a 32-bit value, split into 2 pieces, is a
faster and better way than the current proposal. Splitting the value in two
doesn't sound like it would lead to good performance or sensible code,
but someone may have a reason why 32-bit checksums are important.

IMHO we have a viable mechanism for checksums, its all down to what user
interface we would prefer and related details.

I'm happy to implement whatever we decide.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-03-02 Thread Jeff Janes
On Sun, Feb 19, 2012 at 1:49 PM, Simon Riggs si...@2ndquadrant.com wrote:
 On Thu, Feb 16, 2012 at 11:16 AM, Simon Riggs si...@2ndquadrant.com wrote:

 v8 attached

 v10 attached.

 This patch covers all the valid concerns discussed and has been
 extensively tested.

If I turn wal_level to anything other than minimal, this fails to pass
make installcheck with multiple instances of

ERROR:  invalid page header in block 0 of relation

I've looked around in the patch docs and this thread, and this doesn't
seem to be either expected or already known.

This occurs even when page_checksums is off.

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-03-01 Thread Robert Haas
On Wed, Feb 29, 2012 at 5:52 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Wed, Feb 29, 2012 at 4:34 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Easier for who?  I don't care for the idea of code that has to cope with
 two page formats, or before long N page formats, because if we don't
 have some mechanism like this then we will never be able to decide that
 an old data format is safely dead.

 Huh?  You can drop support for a new page format any time you like.
 You just decree that version X+1 no longer supports it, and you can't
 pg_upgrade to it until all traces of the old page format are gone.

 And how would a DBA know that?

We'd add a column to pg_class that tracks which page version is in use
for each relation.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-03-01 Thread Josh Berkus

 And how would a DBA know that?
 
 We'd add a column to pg_class that tracks which page version is in use
 for each relation.

So a relation can't have some pages in Version 9.2, and other pages in
version 9.3?  How will this work for 2TB tables?

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-03-01 Thread Robert Haas
On Thu, Mar 1, 2012 at 12:42 PM, Josh Berkus j...@agliodbs.com wrote:
 And how would a DBA know that?

 We'd add a column to pg_class that tracks which page version is in use
 for each relation.

 So a relation can't have some pages in Version 9.2, and other pages in
 version 9.3?  How will this work for 2TB tables?

Not very well, but better than Tom's proposal to require upgrading the
entire cluster in a single off-line operation.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-03-01 Thread Josh Berkus

 So a relation can't have some pages in Version 9.2, and other pages in
 version 9.3?  How will this work for 2TB tables?
 
 Not very well, but better than Tom's proposal to require upgrading the
 entire cluster in a single off-line operation.

Yes, but the result will be that anyone with a 2TB table will *never*
convert it to the new format.  Which means we can never deprecate that
format, because lots of people will still be using it.

I continue to assert that all of this sounds like 9.3 work to me.  I'm
really not keen on pushing through a hack which will make pushing in a
long-term solution harder.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-03-01 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 So a relation can't have some pages in Version 9.2, and other pages in
 version 9.3?  How will this work for 2TB tables?

 Not very well, but better than Tom's proposal to require upgrading the
 entire cluster in a single off-line operation.

WTF?  That was most certainly not what *I* was proposing; it's obviously
unworkable.  We need a process that can incrementally up-version a live
database and keep track of the minimum version present, at some
granularity smaller than whole database.

All of this was discussed and hashed out about two years ago, IIRC.
We just haven't made any progress towards actually implementing those
concepts.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-03-01 Thread Robert Haas
On Thu, Mar 1, 2012 at 4:08 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 So a relation can't have some pages in Version 9.2, and other pages in
 version 9.3?  How will this work for 2TB tables?

 Not very well, but better than Tom's proposal to require upgrading the
 entire cluster in a single off-line operation.

 WTF?  That was most certainly not what *I* was proposing; it's obviously
 unworkable.  We need a process that can incrementally up-version a live
 database and keep track of the minimum version present, at some
 granularity smaller than whole database.

 All of this was discussed and hashed out about two years ago, IIRC.
 We just haven't made any progress towards actually implementing those
 concepts.

I am now officially confused.  I thought you and Heikki were arguing
about 24 hours ago that we needed to shut down the old database, run a
pre-upgrade utility to convert all the pages, run pg_upgrade, and then
bring the new database on-line; and that, further, you were arguing
that we should not support multiple page versions.  Now you seem to be
arguing the exact opposite - namely, that we should support multiple
page versions, and that the conversion should be done on-line.

So, to reiterate, I'm lost.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-03-01 Thread Alvaro Herrera

Excerpts from Robert Haas's message of jue mar 01 21:23:06 -0300 2012:
 On Thu, Mar 1, 2012 at 4:08 PM, Tom Lane t...@sss.pgh.pa.us wrote:
  Robert Haas robertmh...@gmail.com writes:
  So a relation can't have some pages in Version 9.2, and other pages in
  version 9.3?  How will this work for 2TB tables?
 
  Not very well, but better than Tom's proposal to require upgrading the
  entire cluster in a single off-line operation.
 
  WTF?  That was most certainly not what *I* was proposing; it's obviously
  unworkable.  We need a process that can incrementally up-version a live
  database and keep track of the minimum version present, at some
  granularity smaller than whole database.
 
  All of this was discussed and hashed out about two years ago, IIRC.
  We just haven't made any progress towards actually implementing those
  concepts.
 
 I am now officially confused.  I thought you and Heikki were arguing
 about 24 hours ago that we needed to shut down the old database, run a
 pre-upgrade utility to convert all the pages, run pg_upgrade, and then
 bring the new database on-line;

Actually, nobody mentioned shutting the database down.  This utility
does not necessarily have to be something that runs independently from a
running postmaster; in fact what I envisioned was that we would have it
be run in a backend -- I've always imagined it's just a function to
which you give a table OID, and it converts pages of that table into the
new format.  Maybe a user-friendly utility would connect to each
database and call this function for each table.

 and that, further, you were arguing that we should not support
 multiple page versions.

I don't think we need to support multiple page versions, if multiple
means  2.  Obviously we need the server to support reading *two* page
versions; though we can probably get away with having support for
*writing* only the newest page version.

(The pg_class column would mean the oldest page version to be found in
this table, so the table can actually contain a mixture of old pages
and new pages.)

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-03-01 Thread Tom Lane
Alvaro Herrera alvhe...@commandprompt.com writes:
 Excerpts from Robert Haas's message of jue mar 01 21:23:06 -0300 2012:
 and that, further, you were arguing that we should not support
 multiple page versions.

 I don't think we need to support multiple page versions, if multiple
 means  2.

That's exactly the point here.  We clearly cannot support on-line
upgrade unless, somewhere along the line, we are willing to cope with
two page formats more or less concurrently.  What I don't want is for
that requirement to balloon to supporting N formats forever.  If we
do not have a mechanism that allows certifying that you have no
remaining pages of format N-1 before you upgrade to a server that
supports (only) versions N and N+1, then we're going to be in the
business of indefinite backwards compatibility instead.

I'm not entirely sure, but I think we may all be in violent agreement
about where this needs to end up.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-03-01 Thread Robert Haas
On Thu, Mar 1, 2012 at 8:32 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Alvaro Herrera alvhe...@commandprompt.com writes:
 Excerpts from Robert Haas's message of jue mar 01 21:23:06 -0300 2012:
 and that, further, you were arguing that we should not support
 multiple page versions.

 I don't think we need to support multiple page versions, if multiple
 means  2.

 That's exactly the point here.  We clearly cannot support on-line
 upgrade unless, somewhere along the line, we are willing to cope with
 two page formats more or less concurrently.  What I don't want is for
 that requirement to balloon to supporting N formats forever.  If we
 do not have a mechanism that allows certifying that you have no
 remaining pages of format N-1 before you upgrade to a server that
 supports (only) versions N and N+1, then we're going to be in the
 business of indefinite backwards compatibility instead.

 I'm not entirely sure, but I think we may all be in violent agreement
 about where this needs to end up.

I was using multiple to mean N1, so yeah, it's sounding like we're
more or less on the same page here.

One thing I'm not too sure about is how to extend the page format to
handle optional features.  For example, suppose we want to add 2 bytes
to the page header for a checksum (or 4 bytes, or any other number).
Ideally, I'd like to not use up those 2 bytes on pages that aren't
checksummed.  What do we do about that?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-03-01 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 One thing I'm not too sure about is how to extend the page format to
 handle optional features.  For example, suppose we want to add 2 bytes
 to the page header for a checksum (or 4 bytes, or any other number).
 Ideally, I'd like to not use up those 2 bytes on pages that aren't
 checksummed.  What do we do about that?

I'm having a hard time getting excited about adding that requirement on
top of all the others that we have for this feature.  In particular, if
we insist on that, there is exactly zero chance of turning checksums on
on-the-fly, because you won't be able to do it if the page is full.

The scheme that seems to me to make the most sense for checksums,
if we grant Simon's postulate that a 2-byte checksum is enough, is
(1) repurpose the top N bits of pd_flags as a page version number,
for some N;
(2) repurpose pd_pagesize_version as the checksum, with the
understanding that you can't have a checksum in version-0 pages.

(Simon's current patch seems to be an overengineered version of that.
Possibly we should also ask ourselves how much we really need pd_tli
and whether that couldn't be commandeered as the checksum field.)

I see the page versioning stuff as mainly being of interest for changes
that are more invasive than this, for instance tuple-header or data
changes.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-03-01 Thread Robert Haas
On Thu, Mar 1, 2012 at 9:11 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 One thing I'm not too sure about is how to extend the page format to
 handle optional features.  For example, suppose we want to add 2 bytes
 to the page header for a checksum (or 4 bytes, or any other number).
 Ideally, I'd like to not use up those 2 bytes on pages that aren't
 checksummed.  What do we do about that?

 I'm having a hard time getting excited about adding that requirement on
 top of all the others that we have for this feature.  In particular, if
 we insist on that, there is exactly zero chance of turning checksums on
 on-the-fly, because you won't be able to do it if the page is full.

 The scheme that seems to me to make the most sense for checksums,
 if we grant Simon's postulate that a 2-byte checksum is enough, is
 (1) repurpose the top N bits of pd_flags as a page version number,
    for some N;
 (2) repurpose pd_pagesize_version as the checksum, with the
    understanding that you can't have a checksum in version-0 pages.

 (Simon's current patch seems to be an overengineered version of that.
 Possibly we should also ask ourselves how much we really need pd_tli
 and whether that couldn't be commandeered as the checksum field.)

 I see the page versioning stuff as mainly being of interest for changes
 that are more invasive than this, for instance tuple-header or data
 changes.

Well, OK, so...  it wouldn't bother me a bit to steal pd_tli for this,
although Simon previously objected to steeling even half of it, when I
suggested that upthread.

But I don't see the point of your first proposal: keeping the page
version right where it is is a good idea, and we should do it.  We
could steel some *high* order bits from that field without breaking
anything, but moving it around seems like it will complicate life to
no good purpose.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-29 Thread Heikki Linnakangas

On 22.02.2012 14:30, Simon Riggs wrote:

On Wed, Feb 22, 2012 at 7:06 AM, Noah Mischn...@leadboat.com  wrote:

On Sun, Feb 19, 2012 at 05:04:06PM -0500, Robert Haas wrote:

Another disadvantage of the current scheme is that there's no
particularly easy way to know that your whole cluster has checksums.
No matter how we implement checksums, you'll have to rewrite every
table in the cluster in order to get them fully turned on.  But with
the current design, there's no easy way to know how much of the
cluster is actually checksummed.  If you shut checksums off, they'll
linger until those pages are rewritten, and there's no easy way to
find the relations from which they need to be removed, either.


I'm not seeing value in rewriting pages to remove checksums, as opposed to
just ignoring those checksums going forward.  Did you have a particular
scenario in mind?


Agreed. No reason to change a checksum unless we rewrite the block, no
matter whether page_checksums is on or off.


This can happen:

1. checksums are initially enabled. A page is written, with a correct 
checksum.

2. checksums turned off.
3. A hint bit is set on the page.
4. While the page is being written out, someone pulls the power cord, 
and you get a torn write. The hint bit change made it to disk, but the 
clearing of the checksum in the page header did not.

5. Sometime after restart, checksums are turned back on.

The page now has an incorrect checksum on it. The next time it's read, 
you get a checksum error.


I'm pretty uncomfortable with this idea of having a flag on the page 
itself to indicate whether it has a checksum or not. No matter how many 
bits we use for that flag. You can never be quite sure that all your 
data is covered by the checksum, and there's a lot of room for subtle 
bugs like the above, where a page is reported as corrupt when it isn't, 
or vice versa.


This thing needs to be reliable and robust. The purpose of a checksum is 
to have an extra sanity check, to detect faulty hardware. If it's 
complicated, whenever you get a checksum mismatch, you'll be wondering 
if you have broken hardware or if you just bumped on a PostgreSQL bug. I 
think you need a flag in pg_control or somewhere to indicate whether 
checksums are currently enabled or disabled, and a mechanism to scan and 
rewrite all the pages with checksums, before they are verified.


I've said this before, but I still don't like the hacks with the version 
number in the page header. Even if it works, I would much prefer the 
straightforward option of extending the page header for the new field. 
Yes, it means you have to deal with pg_upgrade, but it's a hurdle we'll 
have to jump at some point anyway.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-29 Thread Simon Riggs
On Wed, Feb 29, 2012 at 2:40 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 On 22.02.2012 14:30, Simon Riggs wrote:

 On Wed, Feb 22, 2012 at 7:06 AM, Noah Mischn...@leadboat.com  wrote:

 On Sun, Feb 19, 2012 at 05:04:06PM -0500, Robert Haas wrote:

 Another disadvantage of the current scheme is that there's no
 particularly easy way to know that your whole cluster has checksums.
 No matter how we implement checksums, you'll have to rewrite every
 table in the cluster in order to get them fully turned on.  But with
 the current design, there's no easy way to know how much of the
 cluster is actually checksummed.  If you shut checksums off, they'll
 linger until those pages are rewritten, and there's no easy way to
 find the relations from which they need to be removed, either.


 I'm not seeing value in rewriting pages to remove checksums, as opposed
 to
 just ignoring those checksums going forward.  Did you have a particular
 scenario in mind?


 Agreed. No reason to change a checksum unless we rewrite the block, no
 matter whether page_checksums is on or off.


 This can happen:

 1. checksums are initially enabled. A page is written, with a correct
 checksum.
 2. checksums turned off.
 3. A hint bit is set on the page.
 4. While the page is being written out, someone pulls the power cord, and
 you get a torn write. The hint bit change made it to disk, but the clearing
 of the checksum in the page header did not.
 5. Sometime after restart, checksums are turned back on.

 The page now has an incorrect checksum on it. The next time it's read, you
 get a checksum error.

Yes, you will. And you'll get a checksum error because the block no
longer passes. So an error should be reported.

We can and should document that turning this on/off/on can cause
problems. Hopefully crashing isn't that common a situation.

The production default would be off. The default in the patch is
on only for testing.

 I'm pretty uncomfortable with this idea of having a flag on the page itself
 to indicate whether it has a checksum or not. No matter how many bits we use
 for that flag. You can never be quite sure that all your data is covered by
 the checksum, and there's a lot of room for subtle bugs like the above,
 where a page is reported as corrupt when it isn't, or vice versa.

That is necessary to allow upgrade. It's not their for any other reason.

 This thing needs to be reliable and robust. The purpose of a checksum is to
 have an extra sanity check, to detect faulty hardware. If it's complicated,
 whenever you get a checksum mismatch, you'll be wondering if you have broken
 hardware or if you just bumped on a PostgreSQL bug. I think you need a flag
 in pg_control or somewhere to indicate whether checksums are currently
 enabled or disabled, and a mechanism to scan and rewrite all the pages with
 checksums, before they are verified.

That would require massive downtime, so again, it has been ruled out
for practicality.

 I've said this before, but I still don't like the hacks with the version
 number in the page header. Even if it works, I would much prefer the
 straightforward option of extending the page header for the new field. Yes,
 it means you have to deal with pg_upgrade, but it's a hurdle we'll have to
 jump at some point anyway.

What you suggest might happen in the next release, or maybe longer.
There may be things that block it completely, so it might never
happen. My personal opinion is that it is not possible to make further
block format changes until we have a fully online upgrade process,
otherwise we block people from upgrading - not everybody can take
their site down to run pg_upgrade. I plan to work on that, but it may
not happen for 9.3; perhaps you will object to that also when it
comes.

So we simply cannot rely on this jam tomorrow vision.

This patch is very specifically something that makes the best of the
situation, now, for those that want and need it. If you don't want it,
you don't have to use it. But that shouldn't stop us giving it to the
people that do want it.

I'm hearing general interest and support for this feature from people
that run their business on PostgreSQL.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-29 Thread Heikki Linnakangas

On 29.02.2012 17:01, Simon Riggs wrote:

On Wed, Feb 29, 2012 at 2:40 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com  wrote:

On 22.02.2012 14:30, Simon Riggs wrote:


Agreed. No reason to change a checksum unless we rewrite the block, no
matter whether page_checksums is on or off.


This can happen:

1. checksums are initially enabled. A page is written, with a correct
checksum.
2. checksums turned off.
3. A hint bit is set on the page.
4. While the page is being written out, someone pulls the power cord, and
you get a torn write. The hint bit change made it to disk, but the clearing
of the checksum in the page header did not.
5. Sometime after restart, checksums are turned back on.

The page now has an incorrect checksum on it. The next time it's read, you
get a checksum error.


Yes, you will. And you'll get a checksum error because the block no
longer passes. So an error should be reported.

We can and should document that turning this on/off/on can cause
problems. Hopefully crashing isn't that common a situation.


Hopefully not, but then again, you get interested in fiddling with this 
setting, when you do experience crashes. This feature needs to be 
trustworthy in the face of crashes.



I'm pretty uncomfortable with this idea of having a flag on the page itself
to indicate whether it has a checksum or not. No matter how many bits we use
for that flag. You can never be quite sure that all your data is covered by
the checksum, and there's a lot of room for subtle bugs like the above,
where a page is reported as corrupt when it isn't, or vice versa.


That is necessary to allow upgrade. It's not their for any other reason.


Understood. I'm uncomfortable with it regardless of why it's there.


This thing needs to be reliable and robust. The purpose of a checksum is to
have an extra sanity check, to detect faulty hardware. If it's complicated,
whenever you get a checksum mismatch, you'll be wondering if you have broken
hardware or if you just bumped on a PostgreSQL bug. I think you need a flag
in pg_control or somewhere to indicate whether checksums are currently
enabled or disabled, and a mechanism to scan and rewrite all the pages with
checksums, before they are verified.


That would require massive downtime, so again, it has been ruled out
for practicality.


Surely it can be done online. You'll just need a third state between off 
and on, where checksums are written but not verified, while the cluster 
is scanned.



I've said this before, but I still don't like the hacks with the version
number in the page header. Even if it works, I would much prefer the
straightforward option of extending the page header for the new field. Yes,
it means you have to deal with pg_upgrade, but it's a hurdle we'll have to
jump at some point anyway.


What you suggest might happen in the next release, or maybe longer.
There may be things that block it completely, so it might never
happen. My personal opinion is that it is not possible to make further
block format changes until we have a fully online upgrade process,
otherwise we block people from upgrading - not everybody can take
their site down to run pg_upgrade. I plan to work on that, but it may
not happen for 9.3;


Yep, we should bite the bullet and work on that.


perhaps you will object to that also when it comes.


Heh, quite possible :-). But only if there's a reason. I do want to see 
a solution to this, I have a few page-format changing ideas myself that 
I'd like to implement at some point.



This patch is very specifically something that makes the best of the
situation, now, for those that want and need it. If you don't want it,
you don't have to use it.


Whether I use it or not, I'll have to live with it in the source tree.


But that shouldn't stop us giving it to the people that do want it.

I'm hearing general interest and support for this feature from people
that run their business on PostgreSQL.


If you ask someone would you like to have checksums in PostgreSQL?, 
he'll say sure. If you ask him would you like to keep the PostgreSQL 
source tree as simple as possible, to make it easy for new developers to 
join the effort?, he'll say yes. It's all about how you frame the 
question. Even if you want to have checksums on data pages, it doesn't 
necessarily mean you want them so badly you can't wait another release 
or two for a cleaner solution, or that you'd be satisfied with the 
implementation proposed here.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-29 Thread Simon Riggs
On Wed, Feb 29, 2012 at 3:30 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:

 Surely it can be done online. You'll just need a third state between off and
 on, where checksums are written but not verified, while the cluster is
 scanned.

Are you saying you would accept the patch if we had this?

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-29 Thread Heikki Linnakangas

On 29.02.2012 17:42, Simon Riggs wrote:

On Wed, Feb 29, 2012 at 3:30 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com  wrote:


Surely it can be done online. You'll just need a third state between off and
on, where checksums are written but not verified, while the cluster is
scanned.


Are you saying you would accept the patch if we had this?


I think I would still be uncomfortable with the hacks in the page 
header. Less so than in the current form - you wouldn't need a flag to 
indicate whether the page has a valid checksum or not, which would clean 
it up quite a bit - but still.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-29 Thread Simon Riggs
On Wed, Feb 29, 2012 at 3:46 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:

 Are you saying you would accept the patch if we had this?

 I think I would still be uncomfortable with the hacks in the page header.

There are no hacks. There are some carefully designed changes with
input from multiple people, including yourself, and it copes as
gracefully as it can with backwards compatibility requirements.

 Less so than in the current form - you wouldn't need a flag to indicate
 whether the page has a valid checksum or not, which would clean it up quite
 a bit - but still.

I agree this would remove the reliance on a bit in the page header and
so make it even more robust.

I'll add the 2 phase enabling feature, making it happen at database level.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-29 Thread Robert Haas
On Wed, Feb 29, 2012 at 11:24 AM, Simon Riggs si...@2ndquadrant.com wrote:
 On Wed, Feb 29, 2012 at 3:46 PM, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com wrote:

 Are you saying you would accept the patch if we had this?

 I think I would still be uncomfortable with the hacks in the page header.

 There are no hacks. There are some carefully designed changes with
 input from multiple people, including yourself, and it copes as
 gracefully as it can with backwards compatibility requirements.

You have comments from three different people, all experienced
hackers, disagreeing with this position; Heikki and I have both
proposed alternate approaches.  I'm not sure that we're at a point
where we can say that we know what the best solution is, but I think
it is clear that there's enough concern about this that you ought not
to be denying that there is a problem.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-29 Thread Robert Haas
On Wed, Feb 29, 2012 at 12:26 PM, Simon Riggs si...@2ndquadrant.com wrote:
 On Wed, Feb 29, 2012 at 4:41 PM, Robert Haas robertmh...@gmail.com wrote:
 On Wed, Feb 29, 2012 at 11:24 AM, Simon Riggs si...@2ndquadrant.com wrote:
 On Wed, Feb 29, 2012 at 3:46 PM, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com wrote:

 Are you saying you would accept the patch if we had this?

 I think I would still be uncomfortable with the hacks in the page header.

 There are no hacks. There are some carefully designed changes with
 input from multiple people, including yourself, and it copes as
 gracefully as it can with backwards compatibility requirements.

 You have comments from three different people, all experienced
 hackers, disagreeing with this position;

 Who is the third person you speak of? Perhaps they will speak again if
 they wish to be heard.

Tom Lane.  It was the very first email posted in response to the very
first version of this patch you ever posted.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-29 Thread Simon Riggs
On Wed, Feb 29, 2012 at 5:44 PM, Robert Haas robertmh...@gmail.com wrote:

 You have comments from three different people, all experienced
 hackers, disagreeing with this position;

 Who is the third person you speak of? Perhaps they will speak again if
 they wish to be heard.

 Tom Lane.  It was the very first email posted in response to the very
 first version of this patch you ever posted.

Tom objected to not being able to tell which version a data block was.
At length, we have discussed this on list and there is no issue. It is
clear what page format a block has.

Please ping him if you believe he has other rational objections to
committing something and he isn't listening.

I'm beginning to lose faith that objections are being raised at a
rational level. It's not a panel game with points for clever answers,
its an engineering debate about how to add features real users want.
And they do want, so let me solve the problems by agreeing something
early enough to allow it to be implemented, rather than just
discussing it until we run out of time.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-29 Thread Simon Riggs
On Wed, Feb 29, 2012 at 4:41 PM, Robert Haas robertmh...@gmail.com wrote:
 On Wed, Feb 29, 2012 at 11:24 AM, Simon Riggs si...@2ndquadrant.com wrote:
 On Wed, Feb 29, 2012 at 3:46 PM, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com wrote:

 Are you saying you would accept the patch if we had this?

 I think I would still be uncomfortable with the hacks in the page header.

 There are no hacks. There are some carefully designed changes with
 input from multiple people, including yourself, and it copes as
 gracefully as it can with backwards compatibility requirements.

 You have comments from three different people, all experienced
 hackers, disagreeing with this position;

Who is the third person you speak of? Perhaps they will speak again if
they wish to be heard.

 Heikki and I have both
 proposed alternate approaches.

No, you've proposed rejecting the patch in favour of some future
dream. We all agree on the future dream but it clearly happens in the
future and that could easily be more than 1 release ahead.

If you have comments that would allow a patch in this release, I am
happy to hear them. I hear only two people seeking to reject a patch
that adds value for users. Quite frankly, the comments about flag bits
are ridiculous and bogus.


 I'm not sure that we're at a point
 where we can say that we know what the best solution is, but I think
 it is clear that there's enough concern about this that you ought not
 to be denying that there is a problem.

There are some weird cases that can cause problems. My wish to is to
resolve those so everyone is happy. If those weird cases are simply
used as an excuse to reject, then I don't accept that and nor will our
users. Of course, if there was a significant issue, it would be
immediate rejection but there isn't.

I'm trying to commit a useful patch in this release. If you'd like to
work with me to find a solution acceptable to all, I'd be happy to
include suggestions, just as I have already included comments from
Heikki, Bruce, Noah and others. I do accept that Heikki now says that
the code I added at his request is just a hack.

Assuming I'm going to commit something in this release, what should it
look like?

At Heikki's request/idea I will be working on a 2-phase database level
parameter that will give us checksumming on the whole database after a
scan to enable it. That sounds like it will resolve the corner cases
relatively cleanly.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-29 Thread Heikki Linnakangas

On 29.02.2012 19:54, Simon Riggs wrote:

I'm beginning to lose faith that objections are being raised at a
rational level. It's not a panel game with points for clever answers,
its an engineering debate about how to add features real users want.
And they do want, so let me solve the problems by agreeing something
early enough to allow it to be implemented, rather than just
discussing it until we run out of time.


I thought my view on how this should be done was already clear, but just 
in case it isn't, let me restate: Enlarge the page header to make room 
for the checksum. To handle upgrades, put code in the backend to change 
the page format from old version to new one on-the-fly, as pages are 
read in. Because we're making the header larger, we need to ensure that 
there's room on every page. To do that, write a utility that you run on 
the cluster before running pg_upgrade, which moves tuples to ensure 
that. To ensure that the space doesn't get used again before upgrading, 
change the old version so that it reserves those N bytes in all new 
insertions and updates (I believe that approach has been discussed 
before and everyone is comfortable with backpatching such a change). All 
of this in 9.3.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-29 Thread Alvaro Herrera

Excerpts from Heikki Linnakangas's message of mié feb 29 16:09:02 -0300 2012:
 On 29.02.2012 19:54, Simon Riggs wrote:
  I'm beginning to lose faith that objections are being raised at a
  rational level. It's not a panel game with points for clever answers,
  its an engineering debate about how to add features real users want.
  And they do want, so let me solve the problems by agreeing something
  early enough to allow it to be implemented, rather than just
  discussing it until we run out of time.
 
 I thought my view on how this should be done was already clear, but just 
 in case it isn't, let me restate: Enlarge the page header to make room 
 for the checksum. To handle upgrades, put code in the backend to change 
 the page format from old version to new one on-the-fly, as pages are 
 read in. Because we're making the header larger, we need to ensure that 
 there's room on every page. To do that, write a utility that you run on 
 the cluster before running pg_upgrade, which moves tuples to ensure 
 that. To ensure that the space doesn't get used again before upgrading, 
 change the old version so that it reserves those N bytes in all new 
 insertions and updates (I believe that approach has been discussed 
 before and everyone is comfortable with backpatching such a change). All 
 of this in 9.3.

Note that if we want such an utility to walk and transform pages, we
probably need a marker in the catalogs somewhere so that pg_upgrade can
make sure that it was done in all candidate tables -- which is something
that we should get in 9.2 so that it can be used in 9.3.  Such a marker
would also allow us get rid of HEAP_MOVED_IN and HEAP_MOVED_OUT.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-29 Thread Robert Haas
On Wed, Feb 29, 2012 at 2:09 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 On 29.02.2012 19:54, Simon Riggs wrote:
 I'm beginning to lose faith that objections are being raised at a
 rational level. It's not a panel game with points for clever answers,
 its an engineering debate about how to add features real users want.
 And they do want, so let me solve the problems by agreeing something
 early enough to allow it to be implemented, rather than just
 discussing it until we run out of time.

 I thought my view on how this should be done was already clear, but just in
 case it isn't, let me restate: Enlarge the page header to make room for the
 checksum. To handle upgrades, put code in the backend to change the page
 format from old version to new one on-the-fly, as pages are read in. Because
 we're making the header larger, we need to ensure that there's room on every
 page. To do that, write a utility that you run on the cluster before running
 pg_upgrade, which moves tuples to ensure that. To ensure that the space
 doesn't get used again before upgrading, change the old version so that it
 reserves those N bytes in all new insertions and updates (I believe that
 approach has been discussed before and everyone is comfortable with
 backpatching such a change). All of this in 9.3.

This could all be done on-line if we had a per-table flag indicating
which page version is in use.  You can use some variant of CLUSTER or
VACUUM or ALTER TABLE to rewrite the table using the page version of
your choice.  This is also more granular, in that it allows checksums
(or other features) to be turned on and off on a per-table basis.
When you read the table, the same flag tells you which page version to
expect, and you can error out if you haven't got that (or if the CRCs
don't match).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-29 Thread Heikki Linnakangas

On 29.02.2012 21:18, Alvaro Herrera wrote:

Excerpts from Heikki Linnakangas's message of mié feb 29 16:09:02 -0300 2012:

I thought my view on how this should be done was already clear, but just
in case it isn't, let me restate: Enlarge the page header to make room
for the checksum. To handle upgrades, put code in the backend to change
the page format from old version to new one on-the-fly, as pages are
read in. Because we're making the header larger, we need to ensure that
there's room on every page. To do that, write a utility that you run on
the cluster before running pg_upgrade, which moves tuples to ensure
that. To ensure that the space doesn't get used again before upgrading,
change the old version so that it reserves those N bytes in all new
insertions and updates (I believe that approach has been discussed
before and everyone is comfortable with backpatching such a change). All
of this in 9.3.


Note that if we want such an utility to walk and transform pages, we
probably need a marker in the catalogs somewhere so that pg_upgrade can
make sure that it was done in all candidate tables -- which is something
that we should get in 9.2 so that it can be used in 9.3.


In the simplest form, the utility could just create a magic file in the 
data directory to indicate that it has run. All we need is a boolean 
flag, unless you want to be fancy and make the utility restartable. 
Implemented that way, there's no need to have anything in the catalogs.



Such a marker would also allow us get rid of HEAP_MOVED_IN and
HEAP_MOVED_OUT.


True.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-29 Thread Robert Haas
On Wed, Feb 29, 2012 at 2:18 PM, Alvaro Herrera
alvhe...@commandprompt.com wrote:
 Note that if we want such an utility to walk and transform pages, we
 probably need a marker in the catalogs somewhere so that pg_upgrade can
 make sure that it was done in all candidate tables -- which is something
 that we should get in 9.2 so that it can be used in 9.3.  Such a marker
 would also allow us get rid of HEAP_MOVED_IN and HEAP_MOVED_OUT.

Getting rid of HEAP_MOVED_IN and HEAP_MOVED_OUT would be really nice,
but I don't see why we need to squeeze anything into 9.2 for any of
this.  pg_upgrade can certainly handle the addition of a new pg_class
column, and can arrange for in-place upgrades from pre-9.3 versions to
9.3 to set the flag to the appropriate value.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-29 Thread Heikki Linnakangas

On 29.02.2012 21:30, Robert Haas wrote:

On Wed, Feb 29, 2012 at 2:18 PM, Alvaro Herrera
alvhe...@commandprompt.com  wrote:

Note that if we want such an utility to walk and transform pages, we
probably need a marker in the catalogs somewhere so that pg_upgrade can
make sure that it was done in all candidate tables -- which is something
that we should get in 9.2 so that it can be used in 9.3.  Such a marker
would also allow us get rid of HEAP_MOVED_IN and HEAP_MOVED_OUT.


Getting rid of HEAP_MOVED_IN and HEAP_MOVED_OUT would be really nice,
but I don't see why we need to squeeze anything into 9.2 for any of
this.  pg_upgrade can certainly handle the addition of a new pg_class
column, and can arrange for in-place upgrades from pre-9.3 versions to
9.3 to set the flag to the appropriate value.


The utility would run in the old cluster before upgrading, so the the 
flag would have to be present in the old version. pg_upgrade would check 
that the flag is set, refusing to upgrade if it isn't, with an error 
like please run pre-upgrade utility first.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-29 Thread Alvaro Herrera

Excerpts from Heikki Linnakangas's message of mié feb 29 16:29:26 -0300 2012:
 On 29.02.2012 21:18, Alvaro Herrera wrote:
  Excerpts from Heikki Linnakangas's message of mié feb 29 16:09:02 -0300 
  2012:
  I thought my view on how this should be done was already clear, but just
  in case it isn't, let me restate: Enlarge the page header to make room
  for the checksum. To handle upgrades, put code in the backend to change
  the page format from old version to new one on-the-fly, as pages are
  read in. Because we're making the header larger, we need to ensure that
  there's room on every page. To do that, write a utility that you run on
  the cluster before running pg_upgrade, which moves tuples to ensure
  that. To ensure that the space doesn't get used again before upgrading,
  change the old version so that it reserves those N bytes in all new
  insertions and updates (I believe that approach has been discussed
  before and everyone is comfortable with backpatching such a change). All
  of this in 9.3.
 
  Note that if we want such an utility to walk and transform pages, we
  probably need a marker in the catalogs somewhere so that pg_upgrade can
  make sure that it was done in all candidate tables -- which is something
  that we should get in 9.2 so that it can be used in 9.3.
 
 In the simplest form, the utility could just create a magic file in the 
 data directory to indicate that it has run. All we need is a boolean 
 flag, unless you want to be fancy and make the utility restartable. 
 Implemented that way, there's no need to have anything in the catalogs.

Well, I find it likely that people with huge and/or high velocity
databases would want to get fancy about it, so that they can schedule it
as they see fit.

What I wouldn't like is an utility that is optional, so that we end up
with situations after upgrade that do have the new page format, others
that don't.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-29 Thread Robert Haas
On Wed, Feb 29, 2012 at 2:33 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 On 29.02.2012 21:30, Robert Haas wrote:

 On Wed, Feb 29, 2012 at 2:18 PM, Alvaro Herrera
 alvhe...@commandprompt.com  wrote:

 Note that if we want such an utility to walk and transform pages, we
 probably need a marker in the catalogs somewhere so that pg_upgrade can
 make sure that it was done in all candidate tables -- which is something
 that we should get in 9.2 so that it can be used in 9.3.  Such a marker
 would also allow us get rid of HEAP_MOVED_IN and HEAP_MOVED_OUT.


 Getting rid of HEAP_MOVED_IN and HEAP_MOVED_OUT would be really nice,
 but I don't see why we need to squeeze anything into 9.2 for any of
 this.  pg_upgrade can certainly handle the addition of a new pg_class
 column, and can arrange for in-place upgrades from pre-9.3 versions to
 9.3 to set the flag to the appropriate value.

 The utility would run in the old cluster before upgrading, so the the flag
 would have to be present in the old version. pg_upgrade would check that the
 flag is set, refusing to upgrade if it isn't, with an error like please run
 pre-upgrade utility first.

I find that a pretty unappealing design; it seems to me it'd be much
easier to make the new cluster cope with everything.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-29 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Wed, Feb 29, 2012 at 2:33 PM, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com wrote:
 The utility would run in the old cluster before upgrading, so the the flag
 would have to be present in the old version. pg_upgrade would check that the
 flag is set, refusing to upgrade if it isn't, with an error like please run
 pre-upgrade utility first.

 I find that a pretty unappealing design; it seems to me it'd be much
 easier to make the new cluster cope with everything.

Easier for who?  I don't care for the idea of code that has to cope with
two page formats, or before long N page formats, because if we don't
have some mechanism like this then we will never be able to decide that
an old data format is safely dead.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-29 Thread Alvaro Herrera

Excerpts from Tom Lane's message of mié feb 29 18:34:27 -0300 2012:
 
 Robert Haas robertmh...@gmail.com writes:
  On Wed, Feb 29, 2012 at 2:33 PM, Heikki Linnakangas
  heikki.linnakan...@enterprisedb.com wrote:
  The utility would run in the old cluster before upgrading, so the the flag
  would have to be present in the old version. pg_upgrade would check that 
  the
  flag is set, refusing to upgrade if it isn't, with an error like please 
  run
  pre-upgrade utility first.
 
  I find that a pretty unappealing design; it seems to me it'd be much
  easier to make the new cluster cope with everything.
 
 Easier for who?  I don't care for the idea of code that has to cope with
 two page formats, or before long N page formats, because if we don't
 have some mechanism like this then we will never be able to decide that
 an old data format is safely dead.

.. in fact this is precisely what killed Zdenek Kotala's idea of
upgrading.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-29 Thread Robert Haas
On Wed, Feb 29, 2012 at 4:34 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Wed, Feb 29, 2012 at 2:33 PM, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com wrote:
 The utility would run in the old cluster before upgrading, so the the flag
 would have to be present in the old version. pg_upgrade would check that the
 flag is set, refusing to upgrade if it isn't, with an error like please run
 pre-upgrade utility first.

 I find that a pretty unappealing design; it seems to me it'd be much
 easier to make the new cluster cope with everything.

 Easier for who?  I don't care for the idea of code that has to cope with
 two page formats, or before long N page formats, because if we don't
 have some mechanism like this then we will never be able to decide that
 an old data format is safely dead.

Huh?  You can drop support for a new page format any time you like.
You just decree that version X+1 no longer supports it, and you can't
pg_upgrade to it until all traces of the old page format are gone.

If you're going to require an offline rewrite of the whole old cluster
before doing the upgrade, how much better is it than just breaking the
page format and telling pg_upgrade users they're out of luck?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-29 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Wed, Feb 29, 2012 at 4:34 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Easier for who?  I don't care for the idea of code that has to cope with
 two page formats, or before long N page formats, because if we don't
 have some mechanism like this then we will never be able to decide that
 an old data format is safely dead.

 Huh?  You can drop support for a new page format any time you like.
 You just decree that version X+1 no longer supports it, and you can't
 pg_upgrade to it until all traces of the old page format are gone.

And how would a DBA know that?

 If you're going to require an offline rewrite of the whole old cluster
 before doing the upgrade, how much better is it than just breaking the
 page format and telling pg_upgrade users they're out of luck?

The difference is that people aren't playing russian roulette with their
data when they upgrade.  The point of the mechanisms being discussed
here is to know, for certain, that a large database no longer contains
pages of the old format.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-29 Thread Jim Nasby

On 2/29/12 3:53 PM, Alvaro Herrera wrote:


Excerpts from Tom Lane's message of mié feb 29 18:34:27 -0300 2012:


Robert Haasrobertmh...@gmail.com  writes:

On Wed, Feb 29, 2012 at 2:33 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com  wrote:

The utility would run in the old cluster before upgrading, so the the flag
would have to be present in the old version. pg_upgrade would check that the
flag is set, refusing to upgrade if it isn't, with an error like please run
pre-upgrade utility first.



I find that a pretty unappealing design; it seems to me it'd be much
easier to make the new cluster cope with everything.


Easier for who?  I don't care for the idea of code that has to cope with
two page formats, or before long N page formats, because if we don't
have some mechanism like this then we will never be able to decide that
an old data format is safely dead.


.. in fact this is precisely what killed Zdenek Kotala's idea of
upgrading.


This is also why Simon has avoided the whole upgrade thing with his 16 bit 
checksum idea (otherwise presumably we'd be looking at bigger checksums anyway).

I get that fussing around with the version field is ugly. If there was another 
way to do this without breaking pg_upgrade then it would be silly to mess with 
the version field. Unfortunately, there is no other way.

Page checksuming is something a lot of people (myself included) want to see. 
Being able to get it in 9.2 would be a big win over crossing our fingers that 
at some point in the future (who knows when) we'll maybe figure out the page 
format upgrade issue. While we should definitely be looking into that issue 
it's definitely not going to get fixed in 9.2. ISTM that checksums are actually 
ready to go if people can just swallow the bitter pill of a screwed-up page 
version field until we can actually get an upgrade utility in place (and until 
we get that utility in place I don't see that the version field would do us any 
good anyway...)
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-29 Thread Tom Lane
Jim Nasby j...@nasby.net writes:
 On 2/29/12 3:53 PM, Alvaro Herrera wrote:
 .. in fact this is precisely what killed Zdenek Kotala's idea of
 upgrading.

 This is also why Simon has avoided the whole upgrade thing with his 16 bit 
 checksum idea (otherwise presumably we'd be looking at bigger checksums 
 anyway).

 I get that fussing around with the version field is ugly. If there was 
 another way to do this without breaking pg_upgrade then it would be silly to 
 mess with the version field. Unfortunately, there is no other way.

Fundamentally, what is going on here is that several of us think that we
need a page format upgrade infrastructure first, and then we can think
about adding checksums.  Simon would like to cram checksums in without
building such infrastructure, regardless of the consequences in code
ugliness or future maintainability.  Personally, I think that is a bad
tradeoff.  Eventually we are going to have to build that infrastructure,
and the more we've kluged the page header format in the meanwhile, the
more unpleasant it's going to be.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-22 Thread Simon Riggs
On Wed, Feb 22, 2012 at 7:06 AM, Noah Misch n...@leadboat.com wrote:
 On Sun, Feb 19, 2012 at 05:04:06PM -0500, Robert Haas wrote:
 On Sun, Feb 19, 2012 at 11:35 AM, Simon Riggs si...@2ndquadrant.com wrote:
  So, when the page has a checksum, PD_CHECKSUM2 is not set, and when it
  doesn't have a checksum, PD_CHECKSUM2 is not set? ?What good does that
  do?
 
  As explained in detailed comments, the purpose of this is to implement
  Heikki's suggestion that we have a bit set to zero so we can detect
  failures that cause a run of 1s.

 I think it's nonsensical to pretend that there's anything special
 about that particular bit.  If we want to validate the page header
 before trusting the lack of a checksum bit, we can do that far more
 thoroughly than just checking that one bit.  There are a whole bunch
 of bits that ought to always be zero, and there are other things we
 can validate as well (e.g. LSN not in future).  If we're concerned
 about the checksum-enabled bit getting flipped (and I agree that we
 should be), we can check some combination of that stuff in the hope of
 catching it, and that'll be a far better guard than just checking one
 arbitrarily selected bit.

 PageHeaderIsValid() (being renamed to PageIsVerified()) already checks
 (page-pd_flags  ~PD_VALID_FLAG_BITS) == 0.  Explicitly naming another bit
 and keeping it unset is redundant with that existing check.  It would cease to
 be redundant if we ever allocate all the flag bits, but then we also wouldn't
 have a bit to spare as PD_CHECKSUM2.  I agree with you on this point.

No problem to remove that. It was there at Heikki's request.

 That having been said, I don't feel very good about the idea of
 relying on the contents of the page to tell us whether or not the page
 has a checksum.  There's no guarantee that an error that flips the
 has-checksum bit will flip any other bit on the page, or that it won't
 flip everything else we're relying on as a backstop in exactly the
 manner that foils whatever algorithm we put in place.  Random
 corruption is, perhaps, unlikely to do that, but somehow I feel like
 it'll happen more often than random chance suggests.  Things never
 fail the way you want them to.

 Another disadvantage of the current scheme is that there's no
 particularly easy way to know that your whole cluster has checksums.
 No matter how we implement checksums, you'll have to rewrite every
 table in the cluster in order to get them fully turned on.  But with
 the current design, there's no easy way to know how much of the
 cluster is actually checksummed.  If you shut checksums off, they'll
 linger until those pages are rewritten, and there's no easy way to
 find the relations from which they need to be removed, either.

 I'm not seeing value in rewriting pages to remove checksums, as opposed to
 just ignoring those checksums going forward.  Did you have a particular
 scenario in mind?

Agreed. No reason to change a checksum unless we rewrite the block, no
matter whether page_checksums is on or off.

 I'm tempted to suggest a relation-level switch: when you want
 checksums, you use ALTER TABLE to turn them on, and when you don't
 want them any more you use ALTER TABLE to shut them off again, in each
 case rewriting the table.  That way, there's never any ambiguity about
 what's in the data pages in a given relation: either they're either
 all checksummed, or none of them are.  This moves the decision about
 whether checksums are enabled or disabled quite a but further away
 from the data itself, and also allows you to determine (by catalog
 inspection) which parts of the cluster do in fact have checksums.  It
 might be kind of a pain to implement, though: you'd have to pass the
 information about how any given relation was configured down to the
 place where we validate page sanity.  I'm not sure whether that's
 practical.

 This patch implies future opportunities to flesh out its UI, and I don't see
 it locking us out of implementing the above.

Agreed

 We'll want a weak-lock command
 that adds checksums to pages lacking them.

VACUUM FREEZE will do this.

 We'll want to expose whether a
 given relation has full checksum coverage.

Not sure I understand why we'd want that. If you really want that, its
a simple function to do it.

 With that, we could produce an
 error when an apparently-non-checksummed page appears in a relation previously
 known to have full checksum coverage.

Additional checking at relation level is just going to slow things
down. Don't see the value of having relation level checksum option.

It would be an easy thing to skip checksums on UNLOGGED relations, and
possibly desirable, but I don't see the utility of a per relation
checksum setting in other cases. There's just no need for fine tuning
like that.

 Even supposing an ALTER TABLE t SET {WITH | WITHOUT} CHECKSUMS as the only
 tool for enabling or disabling them, it's helpful to have each page declare
 whether it has a checksum.  That way, the 

Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-22 Thread Robert Haas
On Tue, Feb 21, 2012 at 5:07 AM, Noah Misch n...@leadboat.com wrote:
 We do, in numerous places, drop a shared buffer content lock and reacquire it
 in exclusive mode.  However, the existing users of that pattern tend to trade
 the lock, complete subsequent work, and unlock the buffer all within the same
 function.  So they must, because several of them recheck facts that can change
 during the period of holding no lock.  SetBufferCommitInfoNeedsSave() callers
 do not adhere to that pattern; they can be quite distant from the original
 lock acquisition and eventual lock release.  Take the prototypical case of
 SetHintBits().  Our shared lock originated in a place like heapgettup(), which
 expects to hold it continually until finished.  We would need to somehow push
 up through HeapTupleSatisfiesVisibility() the fact that we traded the buffer
 lock, then have heapgettup() reorient itself as needed.  Every caller of code
 that can reach SetBufferCommitInfoNeedsSave() would need to do the same.
 Perhaps there's a different way to apply the lock-trade technique that avoids
 this mess, but I'm not seeing it.  Consequently, I find the idea of requiring
 a spinlock acquisition to read or write pd_lsn/pd_tli under BUFFER_LOCK_SHARE
 to be a sensible one.  Within that umbrella, some details need attention:

Fair analysis.

 - Not all BUFFER_LOCK_SHARE callers of PageGetLSN() live in bufmgr.c.  I note
  gistScanPage() and XLogCheckBuffer()[1].  (Perhaps we'll only require the
  spinlock for heap buffers, letting gistScanPage() off the hook.)  We need a
  public API, perhaps LockBufferForLSN().

That seems like a good idea.  I am a little worried about Simon's plan
to do the additional locking only for AMs that have no unlogged-type
updates.  It's a reasonable performance optimization, and skipping the
locking when checksums are shut off also seems reasonable, but it
seems a bit fragile: suppose that, in the future, someone fixes GiST
to do something intelligent with kill_prior_tuple.  Now suddenly it
needs LSN locking that it didn't need before, but this fact might not
be very obvious.  It might be a good idea to design LockBufferForLSN
to take an AM OID as an argument, and use the AM OID to decide whether
the locking is required.  That way, we can call it from everywhere
that reads an LSN, and it can simply skip the locking in places where
it isn't presently needed.

Or maybe there's a better design, but I agree with you that some kind
of public API is essential.

 - The use of some spinlock need not imply using the buffer header spinlock.
  We could add a dedicated pd_lsn_tli_lock to BufferDesc.  That has the usual
  trade-off of splitting a lock: less contention at the cost of more
  acquisitions.  I have no intuition on which approach would perform better.

I think I like the idea of a separate lock, but I agree it could stand
to be tested both ways.  Another thought is that we might add
SpinLockConditionalAcquire(), that just tries to TAS the lock and
returns false if the lock is already held.  Then, we could grab the
spinlock while writing the page, in lieu of copying it, and anyone who
wants to set a hint bit can conditionally acquire the spinlock long
enough to set the bits.  If the spinlock isn't immediately free then
instead of spinning they just don't set the hint bits after all.  That
way you can be sure that no hint bits are set during the write, but
any attempt to set a hint bit only costs you a single TAS - no loop,
as with a regular spinlock acquisition - and of course the hint
itself.

 A possible compromise is to leave the page clean after setting a hint bit,
 much like the patch already has us do under hot standby.  Then there's no new
 WAL and no new rules around pd_lsn.  Wasn't that one of the things Merlin
 benchmarked when he was looking at hint bits?  Does anyone recall the result?

It slows things down substantially in the case of repeated scans of
the same unchaning table. In fact, I tried a much more nuanced
approach: set BM_UNTIDY on the page when
SetBufferCommitInfoNeedsSave() is called.  BM_UNTIDY pages are written
by the background writer and during checkpoints, and 5% of the time by
backends.  All of that has the salutary effect of making the first
sequential scan less ungodly slow, but it also has the less-desirable
effect of making the next 19 of them still kinda-sorta slow.  It was
unclear to me (and perhaps to others) whether it really worked out to
a win.

However, we might need/want to revisit some of that logic in
connection with this patch, because it seems to me that a large
sequential scan of an unhinted table could be ferociously slow, with
the backend writing out its own pages and WAL-logging them as it goes.
 It would be good to test that to figure out whether some mitigation
measures are needed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make 

Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-22 Thread Peter Geoghegan
I decided that it would be worth benchmarking this patch.
Specifically, I tested:

master, as a basis of comparison

checksum16_with_wallogged_hint_bits.v10.patch, page_checksums = 'on'

checksum16_with_wallogged_hint_bits.v10.patch, page_checksums = 'off'

This test was performed using pgbench-tools. At different client
counts and scaling factors 1,10,100, performance of an update.sql
workload was tested.

This benchmark used Greg Smith's toy server. As he put it recently:

The main change to the 8 hyperthreaded core test server (Intel
i7-870) for this year is bumping it from 8GB to 16GB of RAM, which
effectively doubles the scale I can reach before things slow
dramatically. However, while Greg used scientific Linux for his
recent batch of performance numbers, the box was booted into Debian
for this, which used Kernel 2.6.32, FWIW. Didn't bother with a
separate disk for WAL.

I put shared_buffers at 1GB, and checkpoint_segments at 50. I took the
additional precaution of initdb'ing for each set, lest there be some
kind of contamination between sets, which necessitated doing some
additional work since I couldn't very well expect the results
database to persist. Different sets of figures from different runs
where dumped and restored, before finally being combined so that
pgbench-tools could produce it's regular report.

I have attached a config for pgbench-tools, so that others may
recreate my work here. I also attach the most relevant image,
comparing each test set across scaling levels. I'll make a pdf of the
full report available if that would be useful.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services


config
Description: Binary data
attachment: scaling-sets-page-checksums.png
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-22 Thread David Fetter
On Wed, Feb 22, 2012 at 11:17:53PM +, Peter Geoghegan wrote:
 I decided that it would be worth benchmarking this patch.
 Specifically, I tested:
 
 master, as a basis of comparison
 
 checksum16_with_wallogged_hint_bits.v10.patch, page_checksums = 'on'
 
 checksum16_with_wallogged_hint_bits.v10.patch, page_checksums = 'off'
 
 This test was performed using pgbench-tools. At different client
 counts and scaling factors 1,10,100, performance of an update.sql
 workload was tested.

Looks interesting.  Could you get some error bars around the numbers
plotted, and possibly some scaling factors between 10 and 100?

For the former, I'm looking for whether those changes are within
ordinary variation, and in the latter, some better idea of what the
curve looks like.

Cheers,
David.
-- 
David Fetter da...@fetter.org http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-22 Thread Robert Haas
On Wed, Feb 22, 2012 at 6:17 PM, Peter Geoghegan pe...@2ndquadrant.com wrote:
 I decided that it would be worth benchmarking this patch.
 Specifically, I tested:

 master, as a basis of comparison

 checksum16_with_wallogged_hint_bits.v10.patch, page_checksums = 'on'

 checksum16_with_wallogged_hint_bits.v10.patch, page_checksums = 'off'

 This test was performed using pgbench-tools. At different client
 counts and scaling factors 1,10,100, performance of an update.sql
 workload was tested.

 This benchmark used Greg Smith's toy server. As he put it recently:

 The main change to the 8 hyperthreaded core test server (Intel
 i7-870) for this year is bumping it from 8GB to 16GB of RAM, which
 effectively doubles the scale I can reach before things slow
 dramatically. However, while Greg used scientific Linux for his
 recent batch of performance numbers, the box was booted into Debian
 for this, which used Kernel 2.6.32, FWIW. Didn't bother with a
 separate disk for WAL.

 I put shared_buffers at 1GB, and checkpoint_segments at 50. I took the
 additional precaution of initdb'ing for each set, lest there be some
 kind of contamination between sets, which necessitated doing some
 additional work since I couldn't very well expect the results
 database to persist. Different sets of figures from different runs
 where dumped and restored, before finally being combined so that
 pgbench-tools could produce it's regular report.

 I have attached a config for pgbench-tools, so that others may
 recreate my work here. I also attach the most relevant image,
 comparing each test set across scaling levels. I'll make a pdf of the
 full report available if that would be useful.

Thanks for testing this.  The graph obscures a bit how much percentage
change we're talking about here - could you post the raw tps numbers?

I think we also need to test the case of seq-scanning a large, unhinted table.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-21 Thread Noah Misch
On Mon, Feb 20, 2012 at 08:57:08AM -0500, Robert Haas wrote:
 On Mon, Feb 20, 2012 at 4:18 AM, Simon Riggs si...@2ndquadrant.com wrote:
  What straightforward implementation is that?? This *is* the straightforward 
  one.
 
  God knows what else we'd break if we drop the lock, reacquire as an
  exclusive, then drop it again and reacquire in shared mode. Code tends
  to assume that when you take a lock you hold it until you release;
  doing otherwise would allow all manner of race conditions to emerge.
 
  And do you really think that would be faster?
 
 I don't know, but neither do you, because you apparently haven't tried
 it.   Games where we drop the shared lock and get an exclusive lock
 are used in numerous places in the source code; see, for example,
 heap_update(), so I don't believe that there is any reason to reject
 that a priori.  Indeed, I can think of at least one pretty good reason
 to do it exactly that way: it's the way we've handled all previous
 problems of this type, and in general it's better to make new code
 look like existing code than to invent something new, especially when
 you haven't made any effort to quantify the problem or the extent to
 which the proposed solution mitigates it.

We do, in numerous places, drop a shared buffer content lock and reacquire it
in exclusive mode.  However, the existing users of that pattern tend to trade
the lock, complete subsequent work, and unlock the buffer all within the same
function.  So they must, because several of them recheck facts that can change
during the period of holding no lock.  SetBufferCommitInfoNeedsSave() callers
do not adhere to that pattern; they can be quite distant from the original
lock acquisition and eventual lock release.  Take the prototypical case of
SetHintBits().  Our shared lock originated in a place like heapgettup(), which
expects to hold it continually until finished.  We would need to somehow push
up through HeapTupleSatisfiesVisibility() the fact that we traded the buffer
lock, then have heapgettup() reorient itself as needed.  Every caller of code
that can reach SetBufferCommitInfoNeedsSave() would need to do the same.
Perhaps there's a different way to apply the lock-trade technique that avoids
this mess, but I'm not seeing it.  Consequently, I find the idea of requiring
a spinlock acquisition to read or write pd_lsn/pd_tli under BUFFER_LOCK_SHARE
to be a sensible one.  Within that umbrella, some details need attention:

- Not all BUFFER_LOCK_SHARE callers of PageGetLSN() live in bufmgr.c.  I note
  gistScanPage() and XLogCheckBuffer()[1].  (Perhaps we'll only require the
  spinlock for heap buffers, letting gistScanPage() off the hook.)  We need a
  public API, perhaps LockBufferForLSN().

- The use of some spinlock need not imply using the buffer header spinlock.
  We could add a dedicated pd_lsn_tli_lock to BufferDesc.  That has the usual
  trade-off of splitting a lock: less contention at the cost of more
  acquisitions.  I have no intuition on which approach would perform better.

I agree that src/backend/storage/buffer/README is the place to document the
new locking rules.

I do share your general unease about adding new twists to the buffer access
rules.  Some of our hairiest code is also the code manipulating buffer locks
most extensively, and I would not wish to see that code get even more
difficult to understand.  However, I'm not seeing a credible alternative that
retains the same high-level behaviors.

A possible compromise is to leave the page clean after setting a hint bit,
much like the patch already has us do under hot standby.  Then there's no new
WAL and no new rules around pd_lsn.  Wasn't that one of the things Merlin
benchmarked when he was looking at hint bits?  Does anyone recall the result?

Thanks,
nm

[1] Patch v10 adds a comment to XLogCheckBuffer() saying otherwise.  The
comment is true today, but the same patch makes it false by having
XLogSaveBufferForHint() call XLogInsert() under a share lock.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-21 Thread Simon Riggs
On Tue, Feb 21, 2012 at 10:07 AM, Noah Misch n...@leadboat.com wrote:

 Consequently, I find the idea of requiring
 a spinlock acquisition to read or write pd_lsn/pd_tli under BUFFER_LOCK_SHARE
 to be a sensible one.

OK

 Within that umbrella, some details need attention:

 - Not all BUFFER_LOCK_SHARE callers of PageGetLSN() live in bufmgr.c.  I note
  gistScanPage() and XLogCheckBuffer()[1].  (Perhaps we'll only require the
  spinlock for heap buffers, letting gistScanPage() off the hook.)  We need a
  public API, perhaps LockBufferForLSN().

Yep, I checked all the call points previously. gistScanPage() and also
gistbulkdelete() would be need changing, but since GIST doesn't use
hints, there is no need for locking. I'll document that better.

 - The use of some spinlock need not imply using the buffer header spinlock.
  We could add a dedicated pd_lsn_tli_lock to BufferDesc.  That has the usual
  trade-off of splitting a lock: less contention at the cost of more
  acquisitions.  I have no intuition on which approach would perform better.

Will think about that and try a few ideas.

 I agree that src/backend/storage/buffer/README is the place to document the
 new locking rules.

OK, I'll move the comments.

 I do share your general unease about adding new twists to the buffer access
 rules.  Some of our hairiest code is also the code manipulating buffer locks
 most extensively, and I would not wish to see that code get even more
 difficult to understand.  However, I'm not seeing a credible alternative that
 retains the same high-level behaviors.

Yes, those changes not made lightly and with much checking.

 A possible compromise is to leave the page clean after setting a hint bit,
 much like the patch already has us do under hot standby.  Then there's no new
 WAL and no new rules around pd_lsn.  Wasn't that one of the things Merlin
 benchmarked when he was looking at hint bits?  Does anyone recall the result?

I was thinking exactly that myself. I'll add a GUC to test.

 [1] Patch v10 adds a comment to XLogCheckBuffer() saying otherwise.  The
 comment is true today, but the same patch makes it false by having
 XLogSaveBufferForHint() call XLogInsert() under a share lock.

OK, thats an issue, well spotted. Will fix.

Thanks for thorough review.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-21 Thread Noah Misch
On Sun, Feb 19, 2012 at 05:04:06PM -0500, Robert Haas wrote:
 On Sun, Feb 19, 2012 at 11:35 AM, Simon Riggs si...@2ndquadrant.com wrote:
  So, when the page has a checksum, PD_CHECKSUM2 is not set, and when it
  doesn't have a checksum, PD_CHECKSUM2 is not set? ?What good does that
  do?
 
  As explained in detailed comments, the purpose of this is to implement
  Heikki's suggestion that we have a bit set to zero so we can detect
  failures that cause a run of 1s.
 
 I think it's nonsensical to pretend that there's anything special
 about that particular bit.  If we want to validate the page header
 before trusting the lack of a checksum bit, we can do that far more
 thoroughly than just checking that one bit.  There are a whole bunch
 of bits that ought to always be zero, and there are other things we
 can validate as well (e.g. LSN not in future).  If we're concerned
 about the checksum-enabled bit getting flipped (and I agree that we
 should be), we can check some combination of that stuff in the hope of
 catching it, and that'll be a far better guard than just checking one
 arbitrarily selected bit.

PageHeaderIsValid() (being renamed to PageIsVerified()) already checks
(page-pd_flags  ~PD_VALID_FLAG_BITS) == 0.  Explicitly naming another bit
and keeping it unset is redundant with that existing check.  It would cease to
be redundant if we ever allocate all the flag bits, but then we also wouldn't
have a bit to spare as PD_CHECKSUM2.  I agree with you on this point.

 That having been said, I don't feel very good about the idea of
 relying on the contents of the page to tell us whether or not the page
 has a checksum.  There's no guarantee that an error that flips the
 has-checksum bit will flip any other bit on the page, or that it won't
 flip everything else we're relying on as a backstop in exactly the
 manner that foils whatever algorithm we put in place.  Random
 corruption is, perhaps, unlikely to do that, but somehow I feel like
 it'll happen more often than random chance suggests.  Things never
 fail the way you want them to.
 
 Another disadvantage of the current scheme is that there's no
 particularly easy way to know that your whole cluster has checksums.
 No matter how we implement checksums, you'll have to rewrite every
 table in the cluster in order to get them fully turned on.  But with
 the current design, there's no easy way to know how much of the
 cluster is actually checksummed.  If you shut checksums off, they'll
 linger until those pages are rewritten, and there's no easy way to
 find the relations from which they need to be removed, either.

I'm not seeing value in rewriting pages to remove checksums, as opposed to
just ignoring those checksums going forward.  Did you have a particular
scenario in mind?

 I'm tempted to suggest a relation-level switch: when you want
 checksums, you use ALTER TABLE to turn them on, and when you don't
 want them any more you use ALTER TABLE to shut them off again, in each
 case rewriting the table.  That way, there's never any ambiguity about
 what's in the data pages in a given relation: either they're either
 all checksummed, or none of them are.  This moves the decision about
 whether checksums are enabled or disabled quite a but further away
 from the data itself, and also allows you to determine (by catalog
 inspection) which parts of the cluster do in fact have checksums.  It
 might be kind of a pain to implement, though: you'd have to pass the
 information about how any given relation was configured down to the
 place where we validate page sanity.  I'm not sure whether that's
 practical.

This patch implies future opportunities to flesh out its UI, and I don't see
it locking us out of implementing the above.  We'll want a weak-lock command
that adds checksums to pages lacking them.  We'll want to expose whether a
given relation has full checksum coverage.  With that, we could produce an
error when an apparently-non-checksummed page appears in a relation previously
known to have full checksum coverage.

Even supposing an ALTER TABLE t SET {WITH | WITHOUT} CHECKSUMS as the only
tool for enabling or disabling them, it's helpful to have each page declare
whether it has a checksum.  That way, the rewrite can take as little as an
AccessShareLock, and any failure need not lose work already done.  If you rely
on anticipating the presence of a checksum based on a pg_class column, that
ALTER needs to perform an atomic rewrite.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-20 Thread Simon Riggs
On Mon, Feb 20, 2012 at 12:42 AM, Robert Haas robertmh...@gmail.com wrote:

 I think we could do worse than this patch, but I don't really believe
 it's ready for commit.  We don't have a single performance number
 showing how much of a performance regression this causes, either on
 the master or on the standby, on any workload, much less those where a
 problem is reasonably forseeable from the design you've chosen.  Many
 controversial aspects of the patch, such as the way you're using the
 buffer header spinlocks as a surrogate for x-locking the buffer, or
 the right way of obtaining the bit-space the patch requires, haven't
 really been discussed, and to the extent that they have been
 discussed, they have not been agreed.

These points are being discussed here. If you have rational
objections, say what they are.

 On the former point, you haven't updated
 src/backend/storage/buffer/README at all;

I've updated the appropriate README file which relates to page LSNs to explain.

 but updating is not by
 itself sufficient.  Before we change the buffer-locking rules, we
 ought to have some discussion of whether it's OK to do that, and why
 it's necessary.  Why it's necessary would presumably include
 demonstrating that the performance of the straightforward
 implementation stinks, and that with this change is better.

What straightforward implementation is that?? This *is* the straightforward one.

God knows what else we'd break if we drop the lock, reacquire as an
exclusive, then drop it again and reacquire in shared mode. Code tends
to assume that when you take a lock you hold it until you release;
doing otherwise would allow all manner of race conditions to emerge.

And do you really think that would be faster?

 You
 haven't made any effort to do that whatsoever, or if you have, you
 haven't posted the results here.  I'm pretty un-excited by that
 change, first because I think it's a modularity violation and possibly
 unsafe, and second because I believe we've already got a problem with
 buffer header spinlock contention which this might exacerbate.

 Another disadvantage of the current scheme is that there's no
 particularly easy way to know that your whole cluster has checksums.
 No matter how we implement checksums, you'll have to rewrite every
 table in the cluster in order to get them fully turned on.  But with
 the current design, there's no easy way to know how much of the
 cluster is actually checksummed.

 You can read every block and check.

 Using what tool?

Pageinspect?

 If you shut checksums off, they'll
 linger until those pages are rewritten, and there's no easy way to
 find the relations from which they need to be removed, either.

 We can't have it both ways. Either we have an easy upgrade, or we
 don't. I'm told that an easy upgrade is essential.

 Easy upgrade and removal of checksums are unrelated issues AFAICS.

 I'm tempted to suggest a relation-level switch: when you want
 checksums, you use ALTER TABLE to turn them on, and when you don't
 want them any more you use ALTER TABLE to shut them off again, in each
 case rewriting the table.  That way, there's never any ambiguity about
 what's in the data pages in a given relation: either they're either
 all checksummed, or none of them are.  This moves the decision about
 whether checksums are enabled or disabled quite a but further away
 from the data itself, and also allows you to determine (by catalog
 inspection) which parts of the cluster do in fact have checksums.  It
 might be kind of a pain to implement, though: you'd have to pass the
 information about how any given relation was configured down to the
 place where we validate page sanity.  I'm not sure whether that's
 practical.

 It's not practical as the only mechanism, given the requirement for upgrade.

 Why not?

 The version stuff is catered for by the current patch.

 Your patch reuses the version number field for an unrelated purpose
 and includes some vague hints of how we might do versioning using some
 of the page-level flag bits.  Since bit-space is fungible, I think it
 makes more sense to keep the version number where it is and carve the
 checksum out of whatever bit space you would have moved the version
 number into.  Whether that's pd_tli or pd_flags is somewhat secondary;
 the point is that you can certainly arrange to leave the 8 bits that
 currently contain the version number as they are.

If you've got some rational objections, please raise them.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-20 Thread Robert Haas
On Mon, Feb 20, 2012 at 4:18 AM, Simon Riggs si...@2ndquadrant.com wrote:
 What straightforward implementation is that?? This *is* the straightforward 
 one.

 God knows what else we'd break if we drop the lock, reacquire as an
 exclusive, then drop it again and reacquire in shared mode. Code tends
 to assume that when you take a lock you hold it until you release;
 doing otherwise would allow all manner of race conditions to emerge.

 And do you really think that would be faster?

I don't know, but neither do you, because you apparently haven't tried
it.   Games where we drop the shared lock and get an exclusive lock
are used in numerous places in the source code; see, for example,
heap_update(), so I don't believe that there is any reason to reject
that a priori.  Indeed, I can think of at least one pretty good reason
to do it exactly that way: it's the way we've handled all previous
problems of this type, and in general it's better to make new code
look like existing code than to invent something new, especially when
you haven't made any effort to quantify the problem or the extent to
which the proposed solution mitigates it.

 If you've got some rational objections, please raise them.

Look, I think that the objections I have been raising are rational.
YMMV, and obviously does.  The bottom line here is that I can't stop
you from committing this patch, and we both know that.  And, really,
at the end of the day, I have no desire to keep you from committing
this patch, even if I had the power to do so, because I agree that the
feature has merit.  But I can object to it and, as the patch stands
now, I do object to it, for the following specific reasons:

1. You haven't posted any meaningful performance test results, of
either normal cases or worst cases.
2. You're fiddling with the buffer locking in a way that I'm very
doubtful about without having tried any of the alternatives.
3. You're rearranging the page header in a way that I find ugly and baroque.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-20 Thread Josh Berkus
On 2/20/12 5:57 AM, Robert Haas wrote:
 3. You're rearranging the page header in a way that I find ugly and baroque.

Guys, are we talking about an on-disk format change?  If so, this may
need to be kicked out of 9.2 ...

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-20 Thread Robert Haas
On Mon, Feb 20, 2012 at 12:53 PM, Josh Berkus j...@agliodbs.com wrote:
 On 2/20/12 5:57 AM, Robert Haas wrote:
 3. You're rearranging the page header in a way that I find ugly and baroque.

 Guys, are we talking about an on-disk format change?  If so, this may
 need to be kicked out of 9.2 ...

Yes, we are.  Simon's gone to some pains to make it backward
compatible, but IMHO it's a lot messier and less future-proof than it
could be with some more work, and if we commit this patch than we WILL
be stuck with this for a very long time.

The fact is that, thus far, so real effort has been made by anyone to
evict anything from the current CommitFest.  I did that for the last
two cycles, but as the minutes for last year's development meeting
succinctly observed Haas ... doesn't like being the schedule jerk.
My resolve to be less of a schedule jerk is being sorely tested,
though: aside from this patch, which has its share of issues, there's
also Alvaro's foreign key stuff, which probably needs a lot more
review than it has so far gotten, and several large patches from
Dimitri, which do cool stuff but need a lot more work, and Peter
Geoghegan's pg_stat_statements patch, which is awesome functionality
but sounds like it's still a little green, and parallel pg_dump, which
is waiting on a rebase after some refactoring I did to simplify
things, and pgsql_fdw, and Heikki's xlog insert scaling patch, which
fortunately seems to be in the capable hands of Fujii Masao and Jeff
Janes, but certainly isn't ready to go today.  Any of these
individually could probably eat up a solid week of my time, and then
there are the other 40 as-yet-undealt-with patches.

I said five weeks ago that it probably wouldn't hurt anything to let
the CommitFest run six weeks, and Tom opined that it would probably
require two months.  But the sad reality is that, after five weeks,
we're not even half done with this CommitFest.  That's partly because
most people did not review as much code as they submitted, partly
because a bunch of people played fast and loose with the submission
deadline, and partly because we didn't return any patches that still
needed major rehashing after the first round of review, leading to a
situation where supposedly code-complete patches are showing up for
the first time four or five weeks after the submission deadline.  Now
none of that is the fault of this patch specifically: honestly, if I
had to pick just two more patches to get committed before beta, this
would probably be one of them.  But that doesn't make me happy with
where it's at today, not to mention the fact that there are people who
submitted their code a lot earlier who still haven't gotten the
attention this patch has, which is still less than the attention a
patch of this magnitude probably needs.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-20 Thread Josh Berkus

 Guys, are we talking about an on-disk format change?  If so, this may
 need to be kicked out of 9.2 ...
 
 Yes, we are.  Simon's gone to some pains to make it backward
 compatible, but IMHO it's a lot messier and less future-proof than it
 could be with some more work, and if we commit this patch than we WILL
 be stuck with this for a very long time.

Yeah.  I'd personally prefer to boot this to 9.3, then.  It's not like
there's not enough features for 9.2, and I really don't want this
feature to cause 5 others to be delayed.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-20 Thread Bruce Momjian
On Sun, Feb 19, 2012 at 05:04:06PM -0500, Robert Haas wrote:
 Another disadvantage of the current scheme is that there's no
 particularly easy way to know that your whole cluster has checksums.
 No matter how we implement checksums, you'll have to rewrite every
 table in the cluster in order to get them fully turned on.  But with
 the current design, there's no easy way to know how much of the
 cluster is actually checksummed.  If you shut checksums off, they'll
 linger until those pages are rewritten, and there's no easy way to
 find the relations from which they need to be removed, either.

Yes, pg_upgrade makes this hard.  If you are using pg_dump to restore,
and set the checksum GUC before you do the restore, and never turn it
off, then you will have a fully checksum'ed database.

If you use pg_upgrade, and enable the checksum GUC, your database will
become progressively checksummed, and as Simon pointed out, the only
clean way is VACUUM FULL.  It is quite hard to estimate the checksum
coverage of a database with mixed checksumming --- one cool idea would
be for ANALYZE to report how many of the pages it saw were checksummed. 
Yeah, crazy, but it might be enough.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-20 Thread Simon Riggs
On Mon, Feb 20, 2012 at 6:22 PM, Robert Haas robertmh...@gmail.com wrote:
 On Mon, Feb 20, 2012 at 12:53 PM, Josh Berkus j...@agliodbs.com wrote:
 On 2/20/12 5:57 AM, Robert Haas wrote:
 3. You're rearranging the page header in a way that I find ugly and baroque.

 Guys, are we talking about an on-disk format change?  If so, this may
 need to be kicked out of 9.2 ...

 Yes, we are.  Simon's gone to some pains to make it backward
 compatible, but IMHO it's a lot messier and less future-proof than it
 could be with some more work, and if we commit this patch than we WILL
 be stuck with this for a very long time.

No, we aren't. The format is not changed, though there are 3 new bits added.

The format is designed to be extensible and we have room for 10 more
bits, which allows 1024 new page formats, which at current usage rate
should last us for approximately 5000 years of further development.
Previously, we were limited to 251 more page formats, so this new
mechanism is more future proof than the last. Alternatively you might
say that we have dropped from 1271 page formats to only 1024, which is
hardly limiting.

This has already been discussed and includes points made by Bruce and
Heikki in the final design. The more work required is not described
clearly, nor has anyone but RH requested it. In terms of messier, RH's
only alternate suggestion is to split the checksum into multiple
pieces, which I don't think yer average reader would call less ugly,
less messy or more future proof.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-20 Thread Simon Riggs
On Mon, Feb 20, 2012 at 11:09 PM, Bruce Momjian br...@momjian.us wrote:
 On Sun, Feb 19, 2012 at 05:04:06PM -0500, Robert Haas wrote:
 Another disadvantage of the current scheme is that there's no
 particularly easy way to know that your whole cluster has checksums.
 No matter how we implement checksums, you'll have to rewrite every
 table in the cluster in order to get them fully turned on.  But with
 the current design, there's no easy way to know how much of the
 cluster is actually checksummed.  If you shut checksums off, they'll
 linger until those pages are rewritten, and there's no easy way to
 find the relations from which they need to be removed, either.

 Yes, pg_upgrade makes this hard.  If you are using pg_dump to restore,
 and set the checksum GUC before you do the restore, and never turn it
 off, then you will have a fully checksum'ed database.

 If you use pg_upgrade, and enable the checksum GUC, your database will
 become progressively checksummed, and as Simon pointed out, the only
 clean way is VACUUM FULL.  It is quite hard to estimate the checksum
 coverage of a database with mixed checksumming --- one cool idea would
 be for ANALYZE to report how many of the pages it saw were checksummed.
 Yeah, crazy, but it might be enough.

Well, I didn't say VACUUM FULL was the only clean way of knowing
whether every block is checksummed, its a very intrusive way.

If you want a fast upgrade with pg_upgrade, rewriting every block is
not really a grand plan, but if you want it...

If we did that, I think I would prefer to do it with these commands

  VACUUM ENABLE CHECKSUM;   //whole database only
  VACUUM DISABLE CHECKSUM;

rather than use a GUC. We can then add an option to pg_upgrade.

That way, we scan whole database, adding checksums and then record it
in pg_database

When we remove it, we do same thing in reverse then record it.

So there's no worries about turning on/off GUCs and we know for
certain where our towel is.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-20 Thread Bruce Momjian
On Mon, Feb 20, 2012 at 11:23:42PM +, Simon Riggs wrote:
  If you use pg_upgrade, and enable the checksum GUC, your database will
  become progressively checksummed, and as Simon pointed out, the only
  clean way is VACUUM FULL.  It is quite hard to estimate the checksum
  coverage of a database with mixed checksumming --- one cool idea would
  be for ANALYZE to report how many of the pages it saw were checksummed.
  Yeah, crazy, but it might be enough.
 
 Well, I didn't say VACUUM FULL was the only clean way of knowing
 whether every block is checksummed, its a very intrusive way.
 
 If you want a fast upgrade with pg_upgrade, rewriting every block is
 not really a grand plan, but if you want it...
 
 If we did that, I think I would prefer to do it with these commands
 
   VACUUM ENABLE CHECKSUM;   //whole database only
   VACUUM DISABLE CHECKSUM;
 
 rather than use a GUC. We can then add an option to pg_upgrade.
 
 That way, we scan whole database, adding checksums and then record it
 in pg_database
 
 When we remove it, we do same thing in reverse then record it.
 
 So there's no worries about turning on/off GUCs and we know for
 certain where our towel is.

Yes, that would work, but I suspect most users are going to want to
enable checksums when they are seeing some odd behavior and want to test
the hardware.  Requiring them to rewrite the database to do that is
making the feature less useful, and once they have the problem fixed,
they might want to turn it off again.

Now, one argument is that they could enable checksums, see no checksum
errors, but still be getting checksum failures from pages that were
written before checksum was enabled.  I think we just need to document
that checksum only works for writes performed _after_ the feature is
enabled.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-19 Thread Simon Riggs
On Thu, Feb 16, 2012 at 1:53 PM, Robert Haas robertmh...@gmail.com wrote:

 +                       /*
 +                        * If we're in recovery we cannot dirty a page 
 because of a hint.
 +                        * We can set the hint, just not dirty the page as a 
 result so
 +                        * the hint is lost when we evict the page or 
 shutdown.
 +                        *
 +                        * See long discussion in bufpage.c
 +                        */
 +                       if (RecoveryInProgress())
 +                               return;

 Doesn't this seem awfully bad for performance on Hot Standby servers?
 I agree that it fixes the problem with un-WAL-logged pages there, but
 I seem to recall some recent complaining about performance features
 that work on the master but not the standby.  Durable hint bits are
 one such feature.

It's impossible for it to work, in this case, since we cannot write
new WAL to prevent torn pages.

Note that hint bit setting on a dirty block is allowed, so many hints
will still be set in Hot Standby.

 +                        * Basically, we simply prevent the checkpoint WAL 
 record from
 +                        * being written until we have marked the buffer 
 dirty. We don't
 +                        * start the checkpoint flush until we have marked 
 dirty, so our
 +                        * checkpoint must flush the change to disk 
 successfully or the
 +                        * checkpoint never gets written, so crash recovery 
 will fix.
 +                        *
 +                        * It's possible we may enter here without an xid, so 
 it is
 +                        * essential that CreateCheckpoint waits for virtual 
 transactions
 +                        * rather than full transactionids.
 +                        */
 +                       MyPgXact-delayChkpt = delayChkpt = true;

 I am slightly worried that this expansion in the use of this mechanism
 (formerly called inCommit, for those following along at home) could
 lead to checkpoint starvation.  Suppose we've got one or two large
 table scans wandering along, setting hint bits, and now suddenly it's
 time to checkpoint.  How long will it take the checkpoint process to
 find a time when nobody's got delayChkpt set?

We don't need to wait until nobody has it set, we just need to wait
for the people that had it set when we first checked to be out of that
state momentarily.

 + #define PageSetChecksum(page) \
 + do \
 + { \
 +       PageHeader      p = (PageHeader) page; \
 +       p-pd_flags |= PD_PAGE_VERSION_PLUS1; \
 +       p-pd_flags |= PD_CHECKSUM1; \
 +       p-pd_flags = ~PD_CHECKSUM2; \
 +       p-pd_verify.pd_checksum16 = PageCalcChecksum16(page); \
 + } while (0);
 +
 + /* ensure any older checksum info is overwritten with watermark */
 + #define PageResetVersion(page) \
 + do \
 + { \
 +       PageHeader      p = (PageHeader) page; \
 +       if (!PageHasNoChecksum(p)) \
 +       { \
 +               p-pd_flags = ~PD_PAGE_VERSION_PLUS1; \
 +               p-pd_flags = ~PD_CHECKSUM1; \
 +               p-pd_flags = ~PD_CHECKSUM2; \
 +               PageSetPageSizeAndVersion(p, BLCKSZ, PG_PAGE_LAYOUT_VERSION); 
 \
 +       } \
 + } while (0);

 So, when the page has a checksum, PD_CHECKSUM2 is not set, and when it
 doesn't have a checksum, PD_CHECKSUM2 is not set?  What good does that
 do?

As explained in detailed comments, the purpose of this is to implement
Heikki's suggestion that we have a bit set to zero so we can detect
failures that cause a run of 1s.

   * PageGetPageSize
   *            Returns the page size of a page.
   *
 !  * Since PageSizeIsValid() when pagesize == BLCKSZ, just written BLCKSZ.
 !  * This can be called on any page, initialised or not, in or out of buffers.
 !  * You might think this can vary at runtime but you'd be wrong, since pages
 !  * frequently need to occupy buffers and pages are copied from one to 
 another
 !  * so there are many hidden assumptions that this simple definition is true.
   */
 ! #define PageGetPageSize(page) (BLCKSZ)

 I think I agree that the current definition of PageGetPageSize seems
 unlikely to come anywhere close to allowing us to cope with multiple
 page sizes, but I think this method of disabling it is a hack.  The
 callers that want to know how big the page really is should just use
 BLCKSZ instead of this macro, and those that want to know how big the
 page THINKS it is (notably contrib/pageinspect) need a way to get that
 information.

Fair comment.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-19 Thread Simon Riggs
On Sun, Feb 19, 2012 at 4:35 PM, Simon Riggs si...@2ndquadrant.com wrote:

 We don't need to wait until nobody has it set, we just need to wait
 for the people that had it set when we first checked to be out of that
 state momentarily.

I've just finished doing some performance analysis on various aspects
of hint bit setting for this patch.

I've seen as high as 14% of transactions writing full pages during a
pgbench run. That sounds quite bad, but on pgbench at least all of
those are associated with UPDATEs, which would dirty the page anyway,
so there aren't any more full page writes overall.

Checkpoints would be delayed only until a virtual transaction ends or
a virtual transaction comes out of DelayCkpt state. If a virtual
transaction was long running it wouldn't spend much time in the
delaying state, especially if we take into account I/O requirements.

So although I'm not exactly happy with the overheads, they don't seem
to be as big a problem as they sound. Plus this is all optional and
avoidable.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-19 Thread Robert Haas
On Sun, Feb 19, 2012 at 11:35 AM, Simon Riggs si...@2ndquadrant.com wrote:
 Doesn't this seem awfully bad for performance on Hot Standby servers?
 I agree that it fixes the problem with un-WAL-logged pages there, but
 I seem to recall some recent complaining about performance features
 that work on the master but not the standby.  Durable hint bits are
 one such feature.

 It's impossible for it to work, in this case, since we cannot write
 new WAL to prevent torn pages.

 Note that hint bit setting on a dirty block is allowed, so many hints
 will still be set in Hot Standby.

To me, it seems that you are applying a double standard.  You have
twice attempted to insist that I do extra work to make major features
that I worked on - unlogged tables and index-only scans - work in Hot
Standby mode, despite the existence of significant technological
obstacles.  But when it comes to your own feature, you simply state
that it cannot be done, and therefore we need not do it.   Of course,
this feature, like those, CAN be made to work.  It just involves
solving difficult problems that have little to do with the primary
purpose of the patch.  To be honest, I don't use Hot Standby enough to
care very much about this particular issue, and I'm not saying we
should reject it on these grounds.  But I do think it's a mistake to
dismiss it entirely, since every limitation of Hot Standby narrows the
set of cases in which it can be deployed.  And at any rate, I want the
same standard applied to my work as to yours.

 I am slightly worried that this expansion in the use of this mechanism
 (formerly called inCommit, for those following along at home) could
 lead to checkpoint starvation.  Suppose we've got one or two large
 table scans wandering along, setting hint bits, and now suddenly it's
 time to checkpoint.  How long will it take the checkpoint process to
 find a time when nobody's got delayChkpt set?

 We don't need to wait until nobody has it set, we just need to wait
 for the people that had it set when we first checked to be out of that
 state momentarily.

Ah... good point.

 So, when the page has a checksum, PD_CHECKSUM2 is not set, and when it
 doesn't have a checksum, PD_CHECKSUM2 is not set?  What good does that
 do?

 As explained in detailed comments, the purpose of this is to implement
 Heikki's suggestion that we have a bit set to zero so we can detect
 failures that cause a run of 1s.

I think it's nonsensical to pretend that there's anything special
about that particular bit.  If we want to validate the page header
before trusting the lack of a checksum bit, we can do that far more
thoroughly than just checking that one bit.  There are a whole bunch
of bits that ought to always be zero, and there are other things we
can validate as well (e.g. LSN not in future).  If we're concerned
about the checksum-enabled bit getting flipped (and I agree that we
should be), we can check some combination of that stuff in the hope of
catching it, and that'll be a far better guard than just checking one
arbitrarily selected bit.

That having been said, I don't feel very good about the idea of
relying on the contents of the page to tell us whether or not the page
has a checksum.  There's no guarantee that an error that flips the
has-checksum bit will flip any other bit on the page, or that it won't
flip everything else we're relying on as a backstop in exactly the
manner that foils whatever algorithm we put in place.  Random
corruption is, perhaps, unlikely to do that, but somehow I feel like
it'll happen more often than random chance suggests.  Things never
fail the way you want them to.

Another disadvantage of the current scheme is that there's no
particularly easy way to know that your whole cluster has checksums.
No matter how we implement checksums, you'll have to rewrite every
table in the cluster in order to get them fully turned on.  But with
the current design, there's no easy way to know how much of the
cluster is actually checksummed.  If you shut checksums off, they'll
linger until those pages are rewritten, and there's no easy way to
find the relations from which they need to be removed, either.

I'm tempted to suggest a relation-level switch: when you want
checksums, you use ALTER TABLE to turn them on, and when you don't
want them any more you use ALTER TABLE to shut them off again, in each
case rewriting the table.  That way, there's never any ambiguity about
what's in the data pages in a given relation: either they're either
all checksummed, or none of them are.  This moves the decision about
whether checksums are enabled or disabled quite a but further away
from the data itself, and also allows you to determine (by catalog
inspection) which parts of the cluster do in fact have checksums.  It
might be kind of a pain to implement, though: you'd have to pass the
information about how any given relation was configured down to the
place where we validate page sanity.  I'm not sure whether that's
practical.

I 

Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-19 Thread Simon Riggs
On Sun, Feb 19, 2012 at 10:04 PM, Robert Haas robertmh...@gmail.com wrote:

 To me, it seems that you are applying a double standard.  You have
 twice attempted to insist that I do extra work to make major features
 that I worked on - unlogged tables and index-only scans - work in Hot
 Standby mode, despite the existence of significant technological
 obstacles.  But when it comes to your own feature, you simply state
 that it cannot be done, and therefore we need not do it.   Of course,
 this feature, like those, CAN be made to work.

Vitriol aside, If you would be so kind as to explain how it is
possible, as you claim, I'll look into making it work.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-19 Thread Simon Riggs
On Sun, Feb 19, 2012 at 10:04 PM, Robert Haas robertmh...@gmail.com wrote:

 As explained in detailed comments, the purpose of this is to implement
 Heikki's suggestion that we have a bit set to zero so we can detect
 failures that cause a run of 1s.

 I think it's nonsensical to pretend that there's anything special
 about that particular bit.  If we want to validate the page header
 before trusting the lack of a checksum bit, we can do that far more
 thoroughly than just checking that one bit.  There are a whole bunch
 of bits that ought to always be zero, and there are other things we
 can validate as well (e.g. LSN not in future).  If we're concerned
 about the checksum-enabled bit getting flipped (and I agree that we
 should be), we can check some combination of that stuff in the hope of
 catching it, and that'll be a far better guard than just checking one
 arbitrarily selected bit.

I thought it was a reasonable and practical idea from Heikki. The bit
is not selected arbitrarily, it is by design adjacent to one of the
other bits. So overall, 3 bits need to be set to a precise value and a
run of 1s or 0s will throw and error.

 That having been said, I don't feel very good about the idea of
 relying on the contents of the page to tell us whether or not the page
 has a checksum.  There's no guarantee that an error that flips the
 has-checksum bit will flip any other bit on the page, or that it won't
 flip everything else we're relying on as a backstop in exactly the
 manner that foils whatever algorithm we put in place.  Random
 corruption is, perhaps, unlikely to do that, but somehow I feel like
 it'll happen more often than random chance suggests.  Things never
 fail the way you want them to.

You're right. This patch is not the best possible world, given a clean
slate. But we don't have a clean slate.

The fact is this patch will detect corruptions pretty well and that's
what Postgres users want.

While developing this, many obstacles could have been blockers. I
think we're fairly lucky that I managed to find a way through the
minefield of obstacles.

 Another disadvantage of the current scheme is that there's no
 particularly easy way to know that your whole cluster has checksums.
 No matter how we implement checksums, you'll have to rewrite every
 table in the cluster in order to get them fully turned on.  But with
 the current design, there's no easy way to know how much of the
 cluster is actually checksummed.

You can read every block and check.

 If you shut checksums off, they'll
 linger until those pages are rewritten, and there's no easy way to
 find the relations from which they need to be removed, either.

We can't have it both ways. Either we have an easy upgrade, or we
don't. I'm told that an easy upgrade is essential.

 I'm tempted to suggest a relation-level switch: when you want
 checksums, you use ALTER TABLE to turn them on, and when you don't
 want them any more you use ALTER TABLE to shut them off again, in each
 case rewriting the table.  That way, there's never any ambiguity about
 what's in the data pages in a given relation: either they're either
 all checksummed, or none of them are.  This moves the decision about
 whether checksums are enabled or disabled quite a but further away
 from the data itself, and also allows you to determine (by catalog
 inspection) which parts of the cluster do in fact have checksums.  It
 might be kind of a pain to implement, though: you'd have to pass the
 information about how any given relation was configured down to the
 place where we validate page sanity.  I'm not sure whether that's
 practical.

It's not practical as the only mechanism, given the requirement for upgrade.

As I mention in the docs, if you want that, use VACUUM FULL. So there
is a mechanism if you want it.

 I also think that the question of where exactly the checksum ought to
 be put might bear some more thought.  Tom expressed some concern about
 stealing the page version field, and it seems to me we could work
 around that by instead stealing something less valuable.  The top
 eight bits of the pd_pagesize_version field probably meet that
 criteria, since in practice basically everybody uses 8K blocks, and
 the number of errors that will be caught by checksums is probably much
 larger than the number of errors that will be caught by the page-size
 cross-check.  But maybe the other 8 bits should come from somewhere
 else, maybe pd_tli or pd_flags.  For almost all practical purposes,
 storing only the low-order 8 bits of the TLI ought to provide just as
 much of a safety check as storing the low-order 16 bits, so I think
 that might be the way to go.  We could set it up so that we always
 check only the low 8 bits against the TLI, regardless of whether
 checksums are enabled; then we basically free up that bit space at no
 cost in code complexity.

The version stuff is catered for by the current patch.

TLI isn't something I want to mess with. It does actually mean

Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-19 Thread Robert Haas
On Sun, Feb 19, 2012 at 6:33 PM, Simon Riggs si...@2ndquadrant.com wrote:
 On Sun, Feb 19, 2012 at 10:04 PM, Robert Haas robertmh...@gmail.com wrote:
 To me, it seems that you are applying a double standard.  You have
 twice attempted to insist that I do extra work to make major features
 that I worked on - unlogged tables and index-only scans - work in Hot
 Standby mode, despite the existence of significant technological
 obstacles.  But when it comes to your own feature, you simply state
 that it cannot be done, and therefore we need not do it.   Of course,
 this feature, like those, CAN be made to work.

 Vitriol aside, If you would be so kind as to explain how it is
 possible, as you claim, I'll look into making it work.

It would require a double-write buffer of some kind.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-19 Thread Robert Haas
On Sun, Feb 19, 2012 at 6:57 PM, Simon Riggs si...@2ndquadrant.com wrote:
 I thought it was a reasonable and practical idea from Heikki. The bit
 is not selected arbitrarily, it is by design adjacent to one of the
 other bits. So overall, 3 bits need to be set to a precise value and a
 run of 1s or 0s will throw and error.

Sure, but who is to say that a typical error will look anything like
that?  Anyway, you could check even more bits just as easily; we know
exactly which ones have a plausible reason for being non-zero.

 That having been said, I don't feel very good about the idea of
 relying on the contents of the page to tell us whether or not the page
 has a checksum.  There's no guarantee that an error that flips the
 has-checksum bit will flip any other bit on the page, or that it won't
 flip everything else we're relying on as a backstop in exactly the
 manner that foils whatever algorithm we put in place.  Random
 corruption is, perhaps, unlikely to do that, but somehow I feel like
 it'll happen more often than random chance suggests.  Things never
 fail the way you want them to.

 You're right. This patch is not the best possible world, given a clean
 slate. But we don't have a clean slate.

 The fact is this patch will detect corruptions pretty well and that's
 what Postgres users want.

 While developing this, many obstacles could have been blockers. I
 think we're fairly lucky that I managed to find a way through the
 minefield of obstacles.

I think we could do worse than this patch, but I don't really believe
it's ready for commit.  We don't have a single performance number
showing how much of a performance regression this causes, either on
the master or on the standby, on any workload, much less those where a
problem is reasonably forseeable from the design you've chosen.  Many
controversial aspects of the patch, such as the way you're using the
buffer header spinlocks as a surrogate for x-locking the buffer, or
the right way of obtaining the bit-space the patch requires, haven't
really been discussed, and to the extent that they have been
discussed, they have not been agreed.

On the former point, you haven't updated
src/backend/storage/buffer/README at all; but updating is not by
itself sufficient.  Before we change the buffer-locking rules, we
ought to have some discussion of whether it's OK to do that, and why
it's necessary.  Why it's necessary would presumably include
demonstrating that the performance of the straightforward
implementation stinks, and that with this change is better.  You
haven't made any effort to do that whatsoever, or if you have, you
haven't posted the results here.  I'm pretty un-excited by that
change, first because I think it's a modularity violation and possibly
unsafe, and second because I believe we've already got a problem with
buffer header spinlock contention which this might exacerbate.

 Another disadvantage of the current scheme is that there's no
 particularly easy way to know that your whole cluster has checksums.
 No matter how we implement checksums, you'll have to rewrite every
 table in the cluster in order to get them fully turned on.  But with
 the current design, there's no easy way to know how much of the
 cluster is actually checksummed.

 You can read every block and check.

Using what tool?

 If you shut checksums off, they'll
 linger until those pages are rewritten, and there's no easy way to
 find the relations from which they need to be removed, either.

 We can't have it both ways. Either we have an easy upgrade, or we
 don't. I'm told that an easy upgrade is essential.

Easy upgrade and removal of checksums are unrelated issues AFAICS.

 I'm tempted to suggest a relation-level switch: when you want
 checksums, you use ALTER TABLE to turn them on, and when you don't
 want them any more you use ALTER TABLE to shut them off again, in each
 case rewriting the table.  That way, there's never any ambiguity about
 what's in the data pages in a given relation: either they're either
 all checksummed, or none of them are.  This moves the decision about
 whether checksums are enabled or disabled quite a but further away
 from the data itself, and also allows you to determine (by catalog
 inspection) which parts of the cluster do in fact have checksums.  It
 might be kind of a pain to implement, though: you'd have to pass the
 information about how any given relation was configured down to the
 place where we validate page sanity.  I'm not sure whether that's
 practical.

 It's not practical as the only mechanism, given the requirement for upgrade.

Why not?

 The version stuff is catered for by the current patch.

Your patch reuses the version number field for an unrelated purpose
and includes some vague hints of how we might do versioning using some
of the page-level flag bits.  Since bit-space is fungible, I think it
makes more sense to keep the version number where it is and carve the
checksum out of whatever bit space you would have moved 

Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-17 Thread Simon Riggs
On Thu, Feb 16, 2012 at 1:53 PM, Robert Haas robertmh...@gmail.com wrote:
 On Thu, Feb 16, 2012 at 6:16 AM, Simon Riggs si...@2ndquadrant.com wrote:
 v8 attached

 It's hard to believe that this version has been tested terribly
 thoroughly, because it doesn't compile.

I'm just back home from a few days off grid.

It's possible it doesn't compile against current HEAD, though it
certainly does compile and work against my last git pull.

I will look into your comments in detail tomorrow morning. Thank you
for looking at the patch.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-16 Thread Albert Cervera i Areny
A Dijous, 16 de febrer de 2012 12:16:31, Simon Riggs va escriure:
 v8 attached

Maybe the docs should include what will happen if the checksum is not correct?

-- 
Albert Cervera i Areny
http://www.NaN-tic.com
Tel: +34 93 553 18 03

http://twitter.com/albertnan 
http://www.nan-tic.com/blog


Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-16 Thread Robert Haas
On Thu, Feb 16, 2012 at 6:16 AM, Simon Riggs si...@2ndquadrant.com wrote:
 v8 attached

It's hard to believe that this version has been tested terribly
thoroughly, because it doesn't compile.

+   LockBufHdr(buf);
+
+   /*
+* Run PageGetLSN while holding header lock.
+*/
+   recptr = BufferGetLSN(buf);
+
+   /* To check if block content changes while flushing. - vadim 01/17/97 */
+   buf-flags = ~BM_JUST_DIRTIED;
+   UnlockBufHdr(buf);
+

This doesn't seem very helpful.  It's obvious even without the comment
that we're running PageGetLSN while holding the header lock.  What's
not obvious, and what the comment should be explaining, is why we're
doing that.

+   /*
+* If we're in recovery we cannot dirty a page because 
of a hint.
+* We can set the hint, just not dirty the page as a 
result so
+* the hint is lost when we evict the page or shutdown.
+*
+* See long discussion in bufpage.c
+*/
+   if (RecoveryInProgress())
+   return;

Doesn't this seem awfully bad for performance on Hot Standby servers?
I agree that it fixes the problem with un-WAL-logged pages there, but
I seem to recall some recent complaining about performance features
that work on the master but not the standby.  Durable hint bits are
one such feature.

+* Basically, we simply prevent the checkpoint WAL 
record from
+* being written until we have marked the buffer dirty. 
We don't
+* start the checkpoint flush until we have marked 
dirty, so our
+* checkpoint must flush the change to disk 
successfully or the
+* checkpoint never gets written, so crash recovery 
will fix.
+*
+* It's possible we may enter here without an xid, so 
it is
+* essential that CreateCheckpoint waits for virtual 
transactions
+* rather than full transactionids.
+*/
+   MyPgXact-delayChkpt = delayChkpt = true;

I am slightly worried that this expansion in the use of this mechanism
(formerly called inCommit, for those following along at home) could
lead to checkpoint starvation.  Suppose we've got one or two large
table scans wandering along, setting hint bits, and now suddenly it's
time to checkpoint.  How long will it take the checkpoint process to
find a time when nobody's got delayChkpt set?

+ #define PageSetChecksum(page) \
+ do \
+ { \
+   PageHeader  p = (PageHeader) page; \
+   p-pd_flags |= PD_PAGE_VERSION_PLUS1; \
+   p-pd_flags |= PD_CHECKSUM1; \
+   p-pd_flags = ~PD_CHECKSUM2; \
+   p-pd_verify.pd_checksum16 = PageCalcChecksum16(page); \
+ } while (0);
+
+ /* ensure any older checksum info is overwritten with watermark */
+ #define PageResetVersion(page) \
+ do \
+ { \
+   PageHeader  p = (PageHeader) page; \
+   if (!PageHasNoChecksum(p)) \
+   { \
+   p-pd_flags = ~PD_PAGE_VERSION_PLUS1; \
+   p-pd_flags = ~PD_CHECKSUM1; \
+   p-pd_flags = ~PD_CHECKSUM2; \
+   PageSetPageSizeAndVersion(p, BLCKSZ, PG_PAGE_LAYOUT_VERSION); \
+   } \
+ } while (0);

So, when the page has a checksum, PD_CHECKSUM2 is not set, and when it
doesn't have a checksum, PD_CHECKSUM2 is not set?  What good does that
do?

   * PageGetPageSize
   *Returns the page size of a page.
   *
!  * Since PageSizeIsValid() when pagesize == BLCKSZ, just written BLCKSZ.
!  * This can be called on any page, initialised or not, in or out of buffers.
!  * You might think this can vary at runtime but you'd be wrong, since pages
!  * frequently need to occupy buffers and pages are copied from one to another
!  * so there are many hidden assumptions that this simple definition is true.
   */
! #define PageGetPageSize(page) (BLCKSZ)

I think I agree that the current definition of PageGetPageSize seems
unlikely to come anywhere close to allowing us to cope with multiple
page sizes, but I think this method of disabling it is a hack.  The
callers that want to know how big the page really is should just use
BLCKSZ instead of this macro, and those that want to know how big the
page THINKS it is (notably contrib/pageinspect) need a way to get that
information.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-09 Thread Simon Riggs
On Wed, Feb 8, 2012 at 2:05 PM, Noah Misch n...@leadboat.com wrote:
 On Wed, Feb 08, 2012 at 11:40:34AM +, Simon Riggs wrote:
 On Wed, Feb 8, 2012 at 3:24 AM, Noah Misch n...@leadboat.com wrote:
  On Tue, Feb 07, 2012 at 08:58:59PM +, Simon Riggs wrote:
  On Thu, Jan 26, 2012 at 8:20 PM, Noah Misch n...@leadboat.com wrote:
   This patch uses FPIs to guard against torn hint writes, even when the
   checksums are disabled. ?One could not simply condition them on the
   page_checksums setting, though. ?Suppose page_checksums=off and we're 
   hinting
   a page previously written with page_checksums=on. ?If that write tears,
   leaving the checksum intact, that block will now fail verification. ?A 
   couple
   of ideas come to mind. ?(a) Read the on-disk page and emit an FPI only 
   if the
   old page had a checksum. ?(b) Add a checksumEnableLSN field to 
   pg_control.
   Whenever the cluster starts with checksums disabled, set the field to
   InvalidXLogRecPtr. ?Whenever the cluster starts with checksums enabled 
   and the
   field is InvalidXLogRecPtr, set the field to the next LSN. ?When a 
   checksum
   failure occurs in a page with LSN older than the stored one, emit 
   either a
   softer warning or no message at all.
 
  We can only change page_checksums at restart (now) so the above seems 
  moot.
 
  If we crash with FPWs enabled we repair any torn pages.
 
  There's no live bug, but that comes at a high cost: the patch has us emit
  full-page images for hint bit writes regardless of the page_checksums 
  setting.

 Sorry, I don't understand what you mean. I don't see any failure cases
 that require that.

 page_checksums can only change at a shutdown checkpoint,

 The statement If that write tears,
   leaving the checksum intact, that block will now fail verification.
 cannot happen, ISTM.

 If we write out a block we update the checksum if page_checksums is
 set, or we clear it. If we experience a torn page at crash, the FPI
 corrects that, so the checksum never does fail verification. We only
 need to write a FPI when we write with checksums.

 If that's wrong, please explain a failure case in detail.

 In the following description, I will treat a heap page as having two facts.
 The first is either CHKSUM or !CHKSUM, and the second is either HINT or !HINT.
 A torn page write lacking the protection of a full-page image can update one
 or both facts despite the transaction having logically updated both.  (This is
 just the normal torn-page scenario.)  Given that, consider the following
 sequence of events:

 1. startup with page_checksums = on
 2. update page P with facts CHKSUM, !HINT
 3. clean write of P
 4. clean shutdown

 5. startup with page_checksums = off
 6. update P with facts !CHKSUM, HINT.  no WAL since page_checksums=off
 7. begin write of P
 8. crash.  torn write of P only updates HINT.  disk state now CHKSUM, HINT

 9. startup with page_checksums = off
 10. crash recovery.  does not touch P, because step 6 generated no WAL
 11. clean shutdown

 12. startup with page_checksums = on
 13. read P.  checksum failure

 Again, this cannot happen with checksum16_with_wallogged_hint_bits.v7.patch,
 because step 6 _does_ write WAL.  That ought to change for the sake of
 page_checksums=off performance, at which point we must protect the above
 scenario by other means.

Thanks for explaining.

Logic fixed.

   PageSetLSN() is not atomic, so the shared buffer content lock we'll be 
   holding
   is insufficient.
 
  Am serialising this by only writing PageLSN while holding buf hdr lock.
 
  That means also taking the buffer header spinlock in every PageGetLSN() 
  caller
  holding only a shared buffer content lock. ?Do you think that will pay off,
  versus the settled pattern of trading here your shared buffer content lock 
  for
  an exclusive one?

 Yes, I think it will pay off. This is the only code location where we
 do that, and we are already taking the buffer header lock, so there is
 effectively no overhead.

 The sites I had in the mind are the calls to PageGetLSN() in FlushBuffer()
 (via BufferGetLSN()) and XLogCheckBuffer().  We don't take the buffer header
 spinlock at either of those locations.  If they remain safe under the new
 rules, we'll need comments explaining why.  I think they may indeed be safe,
 but it's far from obvious.

OK, some slight rework required to do that, no problems.

I checked all other call sites and have updated README to explain.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-08 Thread Simon Riggs
On Wed, Feb 8, 2012 at 3:24 AM, Noah Misch n...@leadboat.com wrote:
 On Tue, Feb 07, 2012 at 08:58:59PM +, Simon Riggs wrote:
 On Thu, Jan 26, 2012 at 8:20 PM, Noah Misch n...@leadboat.com wrote:
  On Wed, Jan 11, 2012 at 10:12:31PM +, Simon Riggs wrote:

  This patch uses FPIs to guard against torn hint writes, even when the
  checksums are disabled. ?One could not simply condition them on the
  page_checksums setting, though. ?Suppose page_checksums=off and we're 
  hinting
  a page previously written with page_checksums=on. ?If that write tears,
  leaving the checksum intact, that block will now fail verification. ?A 
  couple
  of ideas come to mind. ?(a) Read the on-disk page and emit an FPI only if 
  the
  old page had a checksum. ?(b) Add a checksumEnableLSN field to pg_control.
  Whenever the cluster starts with checksums disabled, set the field to
  InvalidXLogRecPtr. ?Whenever the cluster starts with checksums enabled and 
  the
  field is InvalidXLogRecPtr, set the field to the next LSN. ?When a checksum
  failure occurs in a page with LSN older than the stored one, emit either a
  softer warning or no message at all.

 We can only change page_checksums at restart (now) so the above seems moot.

 If we crash with FPWs enabled we repair any torn pages.

 There's no live bug, but that comes at a high cost: the patch has us emit
 full-page images for hint bit writes regardless of the page_checksums setting.

Sorry, I don't understand what you mean. I don't see any failure cases
that require that.

page_checksums can only change at a shutdown checkpoint,

The statement If that write tears,
  leaving the checksum intact, that block will now fail verification.
cannot happen, ISTM.

If we write out a block we update the checksum if page_checksums is
set, or we clear it. If we experience a torn page at crash, the FPI
corrects that, so the checksum never does fail verification. We only
need to write a FPI when we write with checksums.

If that's wrong, please explain a failure case in detail.


  PageSetLSN() is not atomic, so the shared buffer content lock we'll be 
  holding
  is insufficient.

 Am serialising this by only writing PageLSN while holding buf hdr lock.

 That means also taking the buffer header spinlock in every PageGetLSN() caller
 holding only a shared buffer content lock.  Do you think that will pay off,
 versus the settled pattern of trading here your shared buffer content lock for
 an exclusive one?

Yes, I think it will pay off. This is the only code location where we
do that, and we are already taking the buffer header lock, so there is
effectively no overhead.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-08 Thread Noah Misch
On Wed, Feb 08, 2012 at 11:40:34AM +, Simon Riggs wrote:
 On Wed, Feb 8, 2012 at 3:24 AM, Noah Misch n...@leadboat.com wrote:
  On Tue, Feb 07, 2012 at 08:58:59PM +, Simon Riggs wrote:
  On Thu, Jan 26, 2012 at 8:20 PM, Noah Misch n...@leadboat.com wrote:
   This patch uses FPIs to guard against torn hint writes, even when the
   checksums are disabled. ?One could not simply condition them on the
   page_checksums setting, though. ?Suppose page_checksums=off and we're 
   hinting
   a page previously written with page_checksums=on. ?If that write tears,
   leaving the checksum intact, that block will now fail verification. ?A 
   couple
   of ideas come to mind. ?(a) Read the on-disk page and emit an FPI only 
   if the
   old page had a checksum. ?(b) Add a checksumEnableLSN field to 
   pg_control.
   Whenever the cluster starts with checksums disabled, set the field to
   InvalidXLogRecPtr. ?Whenever the cluster starts with checksums enabled 
   and the
   field is InvalidXLogRecPtr, set the field to the next LSN. ?When a 
   checksum
   failure occurs in a page with LSN older than the stored one, emit either 
   a
   softer warning or no message at all.
 
  We can only change page_checksums at restart (now) so the above seems moot.
 
  If we crash with FPWs enabled we repair any torn pages.
 
  There's no live bug, but that comes at a high cost: the patch has us emit
  full-page images for hint bit writes regardless of the page_checksums 
  setting.
 
 Sorry, I don't understand what you mean. I don't see any failure cases
 that require that.
 
 page_checksums can only change at a shutdown checkpoint,
 
 The statement If that write tears,
   leaving the checksum intact, that block will now fail verification.
 cannot happen, ISTM.
 
 If we write out a block we update the checksum if page_checksums is
 set, or we clear it. If we experience a torn page at crash, the FPI
 corrects that, so the checksum never does fail verification. We only
 need to write a FPI when we write with checksums.
 
 If that's wrong, please explain a failure case in detail.

In the following description, I will treat a heap page as having two facts.
The first is either CHKSUM or !CHKSUM, and the second is either HINT or !HINT.
A torn page write lacking the protection of a full-page image can update one
or both facts despite the transaction having logically updated both.  (This is
just the normal torn-page scenario.)  Given that, consider the following
sequence of events:

1. startup with page_checksums = on
2. update page P with facts CHKSUM, !HINT
3. clean write of P
4. clean shutdown

5. startup with page_checksums = off
6. update P with facts !CHKSUM, HINT.  no WAL since page_checksums=off
7. begin write of P
8. crash.  torn write of P only updates HINT.  disk state now CHKSUM, HINT

9. startup with page_checksums = off
10. crash recovery.  does not touch P, because step 6 generated no WAL
11. clean shutdown

12. startup with page_checksums = on
13. read P.  checksum failure

Again, this cannot happen with checksum16_with_wallogged_hint_bits.v7.patch,
because step 6 _does_ write WAL.  That ought to change for the sake of
page_checksums=off performance, at which point we must protect the above
scenario by other means.

   PageSetLSN() is not atomic, so the shared buffer content lock we'll be 
   holding
   is insufficient.
 
  Am serialising this by only writing PageLSN while holding buf hdr lock.
 
  That means also taking the buffer header spinlock in every PageGetLSN() 
  caller
  holding only a shared buffer content lock. ?Do you think that will pay off,
  versus the settled pattern of trading here your shared buffer content lock 
  for
  an exclusive one?
 
 Yes, I think it will pay off. This is the only code location where we
 do that, and we are already taking the buffer header lock, so there is
 effectively no overhead.

The sites I had in the mind are the calls to PageGetLSN() in FlushBuffer()
(via BufferGetLSN()) and XLogCheckBuffer().  We don't take the buffer header
spinlock at either of those locations.  If they remain safe under the new
rules, we'll need comments explaining why.  I think they may indeed be safe,
but it's far from obvious.

Thanks,
nm

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-07 Thread Simon Riggs
On Thu, Jan 26, 2012 at 8:20 PM, Noah Misch n...@leadboat.com wrote:
 On Wed, Jan 11, 2012 at 10:12:31PM +, Simon Riggs wrote:
 v7

Thanks very much for the review. Just realised I hadn't actually replied...

 This patch uses FPIs to guard against torn hint writes, even when the
 checksums are disabled.  One could not simply condition them on the
 page_checksums setting, though.  Suppose page_checksums=off and we're hinting
 a page previously written with page_checksums=on.  If that write tears,
 leaving the checksum intact, that block will now fail verification.  A couple
 of ideas come to mind.  (a) Read the on-disk page and emit an FPI only if the
 old page had a checksum.  (b) Add a checksumEnableLSN field to pg_control.
 Whenever the cluster starts with checksums disabled, set the field to
 InvalidXLogRecPtr.  Whenever the cluster starts with checksums enabled and the
 field is InvalidXLogRecPtr, set the field to the next LSN.  When a checksum
 failure occurs in a page with LSN older than the stored one, emit either a
 softer warning or no message at all.

We can only change page_checksums at restart (now) so the above seems moot.

If we crash with FPWs enabled we repair any torn pages.

 Not all WAL-bypassing operations use SetBufferCommitInfoNeedsSave() to dirty
 the buffer.  Here are other sites I noticed that do MarkBufferDirty() without
 bumping the page LSN:
        heap_xlog_visible()
        visibilitymap_clear()
        visibilitymap_truncate()
        lazy_scan_heap()
        XLogRecordPageWithFreeSpace()
        FreeSpaceMapTruncateRel()
        fsm_set_and_search()
        fsm_vacuum_page()
        fsm_search_avail()
 Though I have not investigated each of these in detail, I suspect most or all
 represent continued torn-page hazards.  Since the FSM is just a hint, we do
 not WAL-log it at all; it would be nicest to always skip checksums for it,
 too.  The visibility map shortcuts are more nuanced.

Still checking all, but not as bad as the list looks.

I'm removing chksum code from smgr and putting back into bufmgr and
other locations.

Patch footprint now looks like this.

 doc/src/sgml/config.sgml  |   40 +++
 src/backend/access/hash/hashpage.c|1
 src/backend/access/heap/rewriteheap.c |4
 src/backend/access/heap/visibilitymap.c   |1
 src/backend/access/nbtree/nbtree.c|1
 src/backend/access/nbtree/nbtsort.c   |3
 src/backend/access/spgist/spginsert.c |2
 src/backend/access/transam/twophase.c |   16 -
 src/backend/access/transam/xact.c |8
 src/backend/access/transam/xlog.c |  114 
 src/backend/commands/tablecmds.c  |2
 src/backend/storage/buffer/bufmgr.c   |   75 +
 src/backend/storage/buffer/localbuf.c |2
 src/backend/storage/ipc/procarray.c   |   63 +++-
 src/backend/storage/lmgr/proc.c   |4
 src/backend/storage/page/bufpage.c|  331 +-
 src/backend/utils/misc/guc.c  |   14 +
 src/backend/utils/misc/postgresql.conf.sample |   15 -
 src/include/access/xlog.h |1
 src/include/catalog/pg_control.h  |3
 src/include/catalog/storage.h |1
 src/include/storage/bufpage.h |  107 ++--
 src/include/storage/proc.h|3
 src/include/storage/procarray.h   |4
 24 files changed, 743 insertions(+), 72 deletions(-)

Patch is *not* attached here, yet. Still working on it.

 When page_checksums=off and we read a page last written by page_checksums=on,
 we still verify its checksum.  If a mostly-insert/read site uses checksums for
 some time and eventually wishes to shed the overhead, disabling the feature
 will not stop the costs for reads of old data.  They would need a dump/reload
 to get the performance of a never-checksummed database.  Let's either
 unconditionally skip checksum validation under page_checksums=off or add a new
 state like page_checksums=ignore for that even-weaker condition.

Agreed, changed.

 --- a/doc/src/sgml/config.sgml
 +++ b/doc/src/sgml/config.sgml

 +       para
 +        Turning this parameter off speeds normal operation, but
 +        might allow data corruption to go unnoticed. The checksum uses
 +        16-bit checksums, using the fast Fletcher 16 algorithm. With this
 +        parameter enabled there is still a non-zero probability that an 
 error
 +        could go undetected, as well as a non-zero probability of false
 +        positives.
 +       /para

 What sources of false positives do you intend to retain?

I see none. Will reword.

 --- a/src/backend/catalog/storage.c
 +++ b/src/backend/catalog/storage.c
 @@ -20,6 +20,7 @@
  #include postgres.h

  #include access/visibilitymap.h
 +#include access/transam.h
  #include access/xact.h
  #include access/xlogutils.h
  #include catalog/catalog.h
 @@ 

Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-07 Thread Noah Misch
On Tue, Feb 07, 2012 at 08:58:59PM +, Simon Riggs wrote:
 On Thu, Jan 26, 2012 at 8:20 PM, Noah Misch n...@leadboat.com wrote:
  On Wed, Jan 11, 2012 at 10:12:31PM +, Simon Riggs wrote:

  This patch uses FPIs to guard against torn hint writes, even when the
  checksums are disabled. ?One could not simply condition them on the
  page_checksums setting, though. ?Suppose page_checksums=off and we're 
  hinting
  a page previously written with page_checksums=on. ?If that write tears,
  leaving the checksum intact, that block will now fail verification. ?A 
  couple
  of ideas come to mind. ?(a) Read the on-disk page and emit an FPI only if 
  the
  old page had a checksum. ?(b) Add a checksumEnableLSN field to pg_control.
  Whenever the cluster starts with checksums disabled, set the field to
  InvalidXLogRecPtr. ?Whenever the cluster starts with checksums enabled and 
  the
  field is InvalidXLogRecPtr, set the field to the next LSN. ?When a checksum
  failure occurs in a page with LSN older than the stored one, emit either a
  softer warning or no message at all.
 
 We can only change page_checksums at restart (now) so the above seems moot.
 
 If we crash with FPWs enabled we repair any torn pages.

There's no live bug, but that comes at a high cost: the patch has us emit
full-page images for hint bit writes regardless of the page_checksums setting.

  PageSetLSN() is not atomic, so the shared buffer content lock we'll be 
  holding
  is insufficient.
 
 Am serialising this by only writing PageLSN while holding buf hdr lock.

That means also taking the buffer header spinlock in every PageGetLSN() caller
holding only a shared buffer content lock.  Do you think that will pay off,
versus the settled pattern of trading here your shared buffer content lock for
an exclusive one?

  I can see value in an option to exclude local buffers, since corruption 
  there
  may be less exciting. ?It doesn't seem important for an initial patch, 
  though.
 
 I'm continuing to exclude local buffers. Let me know if that should change.

Seems reasonable.

Thanks,
nm

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-06 Thread Simon Riggs
On Mon, Feb 6, 2012 at 4:48 AM, Bruce Momjian br...@momjian.us wrote:
 On Sun, Feb 05, 2012 at 08:40:09PM +, Simon Riggs wrote:
 On Sun, Feb 5, 2012 at 3:59 AM, Bruce Momjian br...@momjian.us wrote:
  On Sat, Dec 24, 2011 at 03:56:58PM +, Simon Riggs wrote:
   Also, as far as I can see this patch usurps the page version field,
   which I find unacceptably short-sighted.  Do you really think this is
   the last page layout change we'll ever make?
 
  No, I don't. I hope and expect the next page layout change to
  reintroduce such a field.
 
  But since we're agreed now that upgrading is important, changing page
  format isn't likely to be happening until we get an online upgrade
  process. So future changes are much less likely. If they do happen, we
  have some flag bits spare that can be used to indicate later versions.
  It's not the prettiest thing in the world, but it's a small ugliness
  in return for an important feature. If there was a way without that, I
  would have chosen it.
 
  Have you considered the CRC might match a valuid page version number?
  Is that safe?

 In the proposed scheme there are two flag bits set on the page to
 indicate whether the field is used as a checksum rather than a version
 number. So its possible the checksum could look like a valid page
 version number, but we'd still be able to tell the difference.

 Thanks.  Clearly we don't need 16 bits to represent our page version
 number because we rarely change it.  The other good thing is I don't
 remember anyone wanting additional per-page storage in the past few
 years except for a checksum.

 I wonder if we should just dedicate 3 page header bits, call that the
 page version number, and set this new version number to 1, and assume
 all previous versions were zero, and have them look in the old page
 version location if the new version number is zero.  I am basically
 thinking of how we can plan ahead to move the version number to a new
 location and have a defined way of finding the page version number using
 old and new schemes.

 I don't want to get to a point where we need a bit per page number
 version.

Agreed, though I think we need at least 2 bits set if we are using the
checksum, so we should start at version 2 to ensure that.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-06 Thread Simon Riggs
On Mon, Feb 6, 2012 at 7:51 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 On 06.02.2012 05:48, Bruce Momjian wrote:

 On Sun, Feb 05, 2012 at 08:40:09PM +, Simon Riggs wrote:

 In the proposed scheme there are two flag bits set on the page to
 indicate whether the field is used as a checksum rather than a version
 number. So its possible the checksum could look like a valid page
 version number, but we'd still be able to tell the difference.


 Thanks.  Clearly we don't need 16 bits to represent our page version
 number because we rarely change it. The other good thing is I don't
 remember anyone wanting additional per-page storage in the past few
 years except for a checksum.


 There's this idea that I'd like to see implemented:
 http://archives.postgresql.org/pgsql-hackers/2010-05/msg01176.php

As you'll note, adding that field will change the page format and is
therefore ruled out for 9.2.

ISTM there is a different way to handle that anyway. All we need to do
is to record the LSN of the last wraparound in shared memory/control
file. Any block with page LSN older than that has all-frozen rows.
That doesn't use any space nor does it require another field to be
set.

That leaves the same problem that if someone writes to the page you
need to freeze the rows first.

 In any case, I think it's a very bad idea to remove the page version field.
 How are we supposed to ever be able to change the page format again if we
 don't even have a version number on the page? I strongly oppose removing it.

Nobody is removing the version field, nor is anybody suggesting not
being able to tell which page version we are looking at.

 I'm also not very comfortable with the idea of having flags on the page
 indicating whether it has a checksum or not. It's not hard to imagine a
 software of firmware bug or hardware failure that would cause pd_flags field
 to be zeroed out altogether. It would be more robust if the expected
 bit-pattern was not 0-0, but 1-0 or 0-1, but then you need to deal with that
 at upgrade somehow. And it still feels a bit whacky anyway.

Good idea. Lets use

0-0-0 to represent upgraded from previous version, needs a bit set
0-0-1 to represent new version number of page, no checksum
1-1-1 to represent new version number of page, with checksum

So we have 1 bit dedicated to the page version, 2 bits to the checksum indicator

 I wonder if we should just dedicate 3 page header bits, call that the
 page version number, and set this new version number to 1, and assume
 all previous versions were zero, and have them look in the old page
 version location if the new version number is zero.  I am basically
 thinking of how we can plan ahead to move the version number to a new
 location and have a defined way of finding the page version number using
 old and new schemes.


 Three bits seems short-sighted, but yeah, something like 6-8 bits should be
 enough. On the whole, though. I think we should bite the bullet and invent a
 way to extend the page header at upgrade.

There are currently many spare bits. I don't see any need to allocate
them to this specific use ahead of time - especially since that is the
exact decision we took last time when we reserved 16 bits for the
version.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-06 Thread Heikki Linnakangas

On 06.02.2012 10:05, Simon Riggs wrote:

On Mon, Feb 6, 2012 at 7:51 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com  wrote:

On 06.02.2012 05:48, Bruce Momjian wrote:


Thanks.  Clearly we don't need 16 bits to represent our page version
number because we rarely change it. The other good thing is I don't
remember anyone wanting additional per-page storage in the past few
years except for a checksum.


There's this idea that I'd like to see implemented:
http://archives.postgresql.org/pgsql-hackers/2010-05/msg01176.php


As you'll note, adding that field will change the page format and is
therefore ruled out for 9.2.

ISTM there is a different way to handle that anyway. All we need to do
is to record the LSN of the last wraparound in shared memory/control
file. Any block with page LSN older than that has all-frozen rows.
That doesn't use any space nor does it require another field to be
set.


Good idea. However, the followup idea to that discussion was to not only 
avoid the I/O needed to mark tuples as frozen, but to avoid xid 
wraparound altogether, by allowing clog to grow indefinitely. You do 
want to freeze at some point of course, to truncate the clog, but it 
would be nice to not have a hard limit. The way to do that is to store 
an xid epoch in the page header, so that Xids are effectively 64-bits 
wide, even though the xid fields on the tuple header are only 32-bits 
wide. That does require a new field in the page header.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-06 Thread Simon Riggs
On Mon, Feb 6, 2012 at 10:02 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 On 06.02.2012 10:05, Simon Riggs wrote:

 On Mon, Feb 6, 2012 at 7:51 AM, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com  wrote:

 On 06.02.2012 05:48, Bruce Momjian wrote:


 Thanks.  Clearly we don't need 16 bits to represent our page version
 number because we rarely change it. The other good thing is I don't
 remember anyone wanting additional per-page storage in the past few
 years except for a checksum.


 There's this idea that I'd like to see implemented:
 http://archives.postgresql.org/pgsql-hackers/2010-05/msg01176.php


 As you'll note, adding that field will change the page format and is
 therefore ruled out for 9.2.

 ISTM there is a different way to handle that anyway. All we need to do
 is to record the LSN of the last wraparound in shared memory/control
 file. Any block with page LSN older than that has all-frozen rows.
 That doesn't use any space nor does it require another field to be
 set.


 Good idea. However, the followup idea to that discussion was to not only
 avoid the I/O needed to mark tuples as frozen, but to avoid xid wraparound
 altogether, by allowing clog to grow indefinitely. You do want to freeze at
 some point of course, to truncate the clog, but it would be nice to not have
 a hard limit. The way to do that is to store an xid epoch in the page
 header, so that Xids are effectively 64-bits wide, even though the xid
 fields on the tuple header are only 32-bits wide. That does require a new
 field in the page header.

We wouldn't need to do that would we?

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-06 Thread Heikki Linnakangas

On 06.02.2012 11:25, Simon Riggs wrote:

On Mon, Feb 6, 2012 at 10:02 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com  wrote:

Good idea. However, the followup idea to that discussion was to not only
avoid the I/O needed to mark tuples as frozen, but to avoid xid wraparound
altogether, by allowing clog to grow indefinitely. You do want to freeze at
some point of course, to truncate the clog, but it would be nice to not have
a hard limit. The way to do that is to store an xid epoch in the page
header, so that Xids are effectively 64-bits wide, even though the xid
fields on the tuple header are only 32-bits wide. That does require a new
field in the page header.


We wouldn't need to do that would we?


Huh? Do you mean that we wouldn't need to implement that feature?

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-06 Thread Bruce Momjian
On Mon, Feb 06, 2012 at 09:05:19AM +, Simon Riggs wrote:
  In any case, I think it's a very bad idea to remove the page version field.
  How are we supposed to ever be able to change the page format again if we
  don't even have a version number on the page? I strongly oppose removing it.
 
 Nobody is removing the version field, nor is anybody suggesting not
 being able to tell which page version we are looking at.

Agreed.  I thought the idea was that we have a 16-bit page _version_
number and 8+ free page _flag_ bits, which are all currently zero.  The
idea was to move the version number from 16-bit field into the unused
flag bits, and use the 16-bit field for the checksum.  I would like to
have some logic that would allow tools inspecting the page to tell if
they should look for the version number in the bits at the beginning of
the page or at the end.

Specifically, this becomes the checksum:

uint16  pd_pagesize_version;

and this holds the page version, if we have updated the page to the new
format:

uint16  pd_flags;   /* flag bits, see below */

Of the 16 bits of pd_flags, these are the only ones used:

#define PD_HAS_FREE_LINES   0x0001  /* are there any unused line 
pointers? */
#define PD_PAGE_FULL0x0002  /* not enough free space for new

 * tuple? */
#define PD_ALL_VISIBLE  0x0004  /* all tuples on page are 
visible to

 * everyone */

#define PD_VALID_FLAG_BITS  0x0007  /* OR of all valid pd_flags 
bits */


  I'm also not very comfortable with the idea of having flags on the page
  indicating whether it has a checksum or not. It's not hard to imagine a
  software of firmware bug or hardware failure that would cause pd_flags field
  to be zeroed out altogether. It would be more robust if the expected
  bit-pattern was not 0-0, but 1-0 or 0-1, but then you need to deal with that
  at upgrade somehow. And it still feels a bit whacky anyway.

 Good idea. Lets use
 
 0-0-0 to represent upgraded from previous version, needs a bit set
 0-0-1 to represent new version number of page, no checksum
 1-1-1 to represent new version number of page, with checksum
 
 So we have 1 bit dedicated to the page version, 2 bits to the checksum 
 indicator

Interesting point that we would not be guarding against a bit flip from
1 to 0 for the checksum bit; I agree using two bits is the way to go.  I
don't see how upgrade figures into this.

However, I am unclear how Simon's idea above actually works.  We need
two bits for redundancy, both 1, to mark a page as having a checksum.  I
don't think mixing the idea of a new page version and checksum enabled
really makes sense, especially since we have to plan for future page
version changes.

I think we dedicate 2 bits to say we have computed a checksum, and 3
bits to mark up to 8 possible page versions, so the logic is, in
pd_flags, we use bits 0x8 and 0x16 to indicate that a checksum is stored
on the page, and we use 0x32 and later for the page version number.  We
can assume all the remaining bits are for the page version number until
we need to define new bits, and we can start storing them at the end
first, and work forward.  If all the version bits are zero, it means the
page version number is still stored in pd_pagesize_version.

  I wonder if we should just dedicate 3 page header bits, call that the
  page version number, and set this new version number to 1, and assume
  all previous versions were zero, and have them look in the old page
  version location if the new version number is zero.  I am basically
  thinking of how we can plan ahead to move the version number to a new
  location and have a defined way of finding the page version number using
  old and new schemes.
 
 
  Three bits seems short-sighted, but yeah, something like 6-8 bits should be
  enough. On the whole, though. I think we should bite the bullet and invent a
  way to extend the page header at upgrade.
 
 There are currently many spare bits. I don't see any need to allocate
 them to this specific use ahead of time - especially since that is the
 exact decision we took last time when we reserved 16 bits for the
 version.

Right, but I am thinking we should set things up so we can grow the page
version number into the unused bit, rather than box it between bits we
are already using.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-06 Thread Bruce Momjian
On Mon, Feb 06, 2012 at 08:51:34AM +0100, Heikki Linnakangas wrote:
 I wonder if we should just dedicate 3 page header bits, call that the
 page version number, and set this new version number to 1, and assume
 all previous versions were zero, and have them look in the old page
 version location if the new version number is zero.  I am basically
 thinking of how we can plan ahead to move the version number to a new
 location and have a defined way of finding the page version number using
 old and new schemes.
 
 Three bits seems short-sighted, but yeah, something like 6-8 bits
 should be enough. On the whole, though. I think we should bite the
 bullet and invent a way to extend the page header at upgrade.

I just emailed a possible layout that allows for future page version
number expansion.  

I don't think there is any magic bullet that will allow for page header
extension by pg_upgrade.  If it is going to be done, it would have to
happen in the backend while the system is running.  Anything that
requires pg_upgrade to do lots of reads or writes basically makes
pg_upgrade useless.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-06 Thread Bruce Momjian
On Mon, Feb 06, 2012 at 12:59:59PM -0500, Bruce Momjian wrote:
   I'm also not very comfortable with the idea of having flags on the page
   indicating whether it has a checksum or not. It's not hard to imagine a
   software of firmware bug or hardware failure that would cause pd_flags 
   field
   to be zeroed out altogether. It would be more robust if the expected
   bit-pattern was not 0-0, but 1-0 or 0-1, but then you need to deal with 
   that
   at upgrade somehow. And it still feels a bit whacky anyway.
 
  Good idea. Lets use
  
  0-0-0 to represent upgraded from previous version, needs a bit set
  0-0-1 to represent new version number of page, no checksum
  1-1-1 to represent new version number of page, with checksum
  
  So we have 1 bit dedicated to the page version, 2 bits to the checksum 
  indicator
 
 Interesting point that we would not be guarding against a bit flip from
 1 to 0 for the checksum bit; I agree using two bits is the way to go.  I
 don't see how upgrade figures into this.
 
 However, I am unclear how Simon's idea above actually works.  We need
 two bits for redundancy, both 1, to mark a page as having a checksum.  I
 don't think mixing the idea of a new page version and checksum enabled
 really makes sense, especially since we have to plan for future page
 version changes.
 
 I think we dedicate 2 bits to say we have computed a checksum, and 3
 bits to mark up to 8 possible page versions, so the logic is, in
 pd_flags, we use bits 0x8 and 0x16 to indicate that a checksum is stored
 on the page, and we use 0x32 and later for the page version number.  We
 can assume all the remaining bits are for the page version number until
 we need to define new bits, and we can start storing them at the end
 first, and work forward.  If all the version bits are zero, it means the
 page version number is still stored in pd_pagesize_version.

A simpler solution would be to place two bits for checksum after the
existing page bit usage, and place the page version on the last four
bits of the 16-bit field --- that still leaves us with 7 unused bits.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-05 Thread Simon Riggs
On Sun, Feb 5, 2012 at 3:59 AM, Bruce Momjian br...@momjian.us wrote:
 On Sat, Dec 24, 2011 at 03:56:58PM +, Simon Riggs wrote:
  Also, as far as I can see this patch usurps the page version field,
  which I find unacceptably short-sighted.  Do you really think this is
  the last page layout change we'll ever make?

 No, I don't. I hope and expect the next page layout change to
 reintroduce such a field.

 But since we're agreed now that upgrading is important, changing page
 format isn't likely to be happening until we get an online upgrade
 process. So future changes are much less likely. If they do happen, we
 have some flag bits spare that can be used to indicate later versions.
 It's not the prettiest thing in the world, but it's a small ugliness
 in return for an important feature. If there was a way without that, I
 would have chosen it.

 Have you considered the CRC might match a valuid page version number?
 Is that safe?

In the proposed scheme there are two flag bits set on the page to
indicate whether the field is used as a checksum rather than a version
number. So its possible the checksum could look like a valid page
version number, but we'd still be able to tell the difference.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-05 Thread Bruce Momjian
On Sun, Feb 05, 2012 at 08:40:09PM +, Simon Riggs wrote:
 On Sun, Feb 5, 2012 at 3:59 AM, Bruce Momjian br...@momjian.us wrote:
  On Sat, Dec 24, 2011 at 03:56:58PM +, Simon Riggs wrote:
   Also, as far as I can see this patch usurps the page version field,
   which I find unacceptably short-sighted.  Do you really think this is
   the last page layout change we'll ever make?
 
  No, I don't. I hope and expect the next page layout change to
  reintroduce such a field.
 
  But since we're agreed now that upgrading is important, changing page
  format isn't likely to be happening until we get an online upgrade
  process. So future changes are much less likely. If they do happen, we
  have some flag bits spare that can be used to indicate later versions.
  It's not the prettiest thing in the world, but it's a small ugliness
  in return for an important feature. If there was a way without that, I
  would have chosen it.
 
  Have you considered the CRC might match a valuid page version number?
  Is that safe?
 
 In the proposed scheme there are two flag bits set on the page to
 indicate whether the field is used as a checksum rather than a version
 number. So its possible the checksum could look like a valid page
 version number, but we'd still be able to tell the difference.

Thanks.  Clearly we don't need 16 bits to represent our page version
number because we rarely change it.  The other good thing is I don't
remember anyone wanting additional per-page storage in the past few
years except for a checksum.  

I wonder if we should just dedicate 3 page header bits, call that the
page version number, and set this new version number to 1, and assume
all previous versions were zero, and have them look in the old page
version location if the new version number is zero.  I am basically
thinking of how we can plan ahead to move the version number to a new
location and have a defined way of finding the page version number using
old and new schemes.

I don't want to get to a point where we need a bit per page number
version.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-05 Thread Heikki Linnakangas

On 06.02.2012 05:48, Bruce Momjian wrote:

On Sun, Feb 05, 2012 at 08:40:09PM +, Simon Riggs wrote:

In the proposed scheme there are two flag bits set on the page to
indicate whether the field is used as a checksum rather than a version
number. So its possible the checksum could look like a valid page
version number, but we'd still be able to tell the difference.


Thanks.  Clearly we don't need 16 bits to represent our page version
number because we rarely change it. The other good thing is I don't
remember anyone wanting additional per-page storage in the past few
years except for a checksum.


There's this idea that I'd like to see implemented: 
http://archives.postgresql.org/pgsql-hackers/2010-05/msg01176.php


In any case, I think it's a very bad idea to remove the page version 
field. How are we supposed to ever be able to change the page format 
again if we don't even have a version number on the page? I strongly 
oppose removing it.


I'm also not very comfortable with the idea of having flags on the page 
indicating whether it has a checksum or not. It's not hard to imagine a 
software of firmware bug or hardware failure that would cause pd_flags 
field to be zeroed out altogether. It would be more robust if the 
expected bit-pattern was not 0-0, but 1-0 or 0-1, but then you need to 
deal with that at upgrade somehow. And it still feels a bit whacky anyway.



I wonder if we should just dedicate 3 page header bits, call that the
page version number, and set this new version number to 1, and assume
all previous versions were zero, and have them look in the old page
version location if the new version number is zero.  I am basically
thinking of how we can plan ahead to move the version number to a new
location and have a defined way of finding the page version number using
old and new schemes.


Three bits seems short-sighted, but yeah, something like 6-8 bits should 
be enough. On the whole, though. I think we should bite the bullet and 
invent a way to extend the page header at upgrade.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-04 Thread Bruce Momjian
On Sun, Dec 25, 2011 at 04:25:19PM +0100, Martijn van Oosterhout wrote:
 On Sat, Dec 24, 2011 at 04:01:02PM +, Simon Riggs wrote:
  On Sat, Dec 24, 2011 at 3:54 PM, Andres Freund and...@anarazel.de wrote:
   Why don't you use the same tricks as the former patch and copy the buffer,
   compute the checksum on that, and then write out that copy (you can even 
   do
   both at the same time). I have a hard time believing that the additional 
   copy
   is more expensive than the locking.
  
  ISTM we can't write and copy at the same time because the cheksum is
  not a trailer field.
 
 Ofcourse you can. If the checksum is in the trailer field you get the
 nice property that the whole block has a constant checksum. However, if
 you store the checksum elsewhere you just need to change the checking
 algorithm to copy the checksum out, zero those bytes and run the
 checksum and compare with the extracted checksum.
 
 Not pretty, but I don't think it makes a difference in performence.

Sorry to be late replying to this, but an interesting idea would be to
zero the _hint_ bits before doing the CRC checksum.  That would avoid
the problem of WAL-logging the hint bits.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-04 Thread Bruce Momjian
On Sat, Dec 24, 2011 at 03:56:58PM +, Simon Riggs wrote:
  Also, as far as I can see this patch usurps the page version field,
  which I find unacceptably short-sighted.  Do you really think this is
  the last page layout change we'll ever make?
 
 No, I don't. I hope and expect the next page layout change to
 reintroduce such a field.
 
 But since we're agreed now that upgrading is important, changing page
 format isn't likely to be happening until we get an online upgrade
 process. So future changes are much less likely. If they do happen, we
 have some flag bits spare that can be used to indicate later versions.
 It's not the prettiest thing in the world, but it's a small ugliness
 in return for an important feature. If there was a way without that, I
 would have chosen it.

Have you considered the CRC might match a valuid page version number? 
Is that safe?

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-01-30 Thread Robert Haas
On Fri, Jan 27, 2012 at 4:07 PM, Dan Scales sca...@vmware.com wrote:
 The advantage of putting the checksum calculation in smgrwrite() (or 
 mdwrite()) is that it catches a bunch of page writes that don't go through 
 the buffer pool (see calls to smgrwrite() in nbtree.c, nbtsort.c, spginsert.c)

Maybe we should have some sort of wrapper function, then.  I really
dislike the idea that the smgr layer knows anything about the page
format, and if md has to know that's even worse.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-01-28 Thread Simon Riggs
On Fri, Jan 27, 2012 at 9:07 PM, Dan Scales sca...@vmware.com wrote:

 The advantage of putting the checksum calculation in smgrwrite() (or 
 mdwrite()) is that it catches a bunch of page writes that don't go through 
 the buffer pool (see calls to smgrwrite() in nbtree.c, nbtsort.c, spginsert.c)

I'll have another look at that. Seems like we can make it various
ways, we just need to decide the code placement.

 Also, I missed this before:  don't you want to add the checksum calculation 
 (PageSetVerificationInfo) to mdextend() (or preferably smgrextend()) as well? 
  Otherwise, you won't be checksumming a bunch of the new pages.

You don't need to checksum the extend because no data is written at
that point. It create a new block that will become dirty at some point
and then be written out, which will trigger the checksum.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-01-28 Thread Heikki Linnakangas

On 28.01.2012 15:49, Simon Riggs wrote:

On Fri, Jan 27, 2012 at 9:07 PM, Dan Scalessca...@vmware.com  wrote:


Also, I missed this before:  don't you want to add the checksum calculation 
(PageSetVerificationInfo) to mdextend() (or preferably smgrextend()) as well?  
Otherwise, you won't be checksumming a bunch of the new pages.


You don't need to checksum the extend because no data is written at
that point.


That's not correct. smgrextend writes a block of data just like 
smgrwrite does. When a relation is extended by the buffer manager, it 
calls smgrextend with an all-zeros block, but not all callers do that. 
See rewriteheap.c and nbtsort.c for counter-examples.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-01-27 Thread Robert Haas
On Thu, Jan 26, 2012 at 7:01 PM, Dan Scales sca...@vmware.com wrote:
 I'm not sure why you moved the checksum calculation (PageSetVerificationInfo) 
 to mdwrite() rather than smgrwrite().  If there were every another storage 
 backend, it would have to duplicate the checksum check, right?  Is there a 
 disadvantage to putting it in smgrwrite()?

The smgr and md layers don't currently know anything about the page
format, and I imagine we want to keep it that way.  It seems like the
right place for this is in some higher layer, like bufmgr.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-01-27 Thread Dan Scales
The advantage of putting the checksum calculation in smgrwrite() (or mdwrite()) 
is that it catches a bunch of page writes that don't go through the buffer pool 
(see calls to smgrwrite() in nbtree.c, nbtsort.c, spginsert.c)

Also, I missed this before:  don't you want to add the checksum calculation 
(PageSetVerificationInfo) to mdextend() (or preferably smgrextend()) as well?  
Otherwise, you won't be checksumming a bunch of the new pages.

Dan

- Original Message -
From: Robert Haas robertmh...@gmail.com
To: Dan Scales sca...@vmware.com
Cc: Noah Misch n...@leadboat.com, Heikki Linnakangas 
heikki.linnakan...@enterprisedb.com, Andres Freund and...@anarazel.de, 
Kevin Grittner kevin.gritt...@wicourts.gov, da...@fetter.org, 
ai...@highrise.ca, st...@mit.edu, pgsql-hackers@postgresql.org, Simon Riggs 
si...@2ndquadrant.com
Sent: Friday, January 27, 2012 5:19:32 AM
Subject: Re: [HACKERS] 16-bit page checksums for 9.2

On Thu, Jan 26, 2012 at 7:01 PM, Dan Scales sca...@vmware.com wrote:
 I'm not sure why you moved the checksum calculation (PageSetVerificationInfo) 
 to mdwrite() rather than smgrwrite().  If there were every another storage 
 backend, it would have to duplicate the checksum check, right?  Is there a 
 disadvantage to putting it in smgrwrite()?

The smgr and md layers don't currently know anything about the page
format, and I imagine we want to keep it that way.  It seems like the
right place for this is in some higher layer, like bufmgr.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-01-26 Thread Noah Misch
On Wed, Jan 11, 2012 at 10:12:31PM +, Simon Riggs wrote:
 v7

This patch uses FPIs to guard against torn hint writes, even when the
checksums are disabled.  One could not simply condition them on the
page_checksums setting, though.  Suppose page_checksums=off and we're hinting
a page previously written with page_checksums=on.  If that write tears,
leaving the checksum intact, that block will now fail verification.  A couple
of ideas come to mind.  (a) Read the on-disk page and emit an FPI only if the
old page had a checksum.  (b) Add a checksumEnableLSN field to pg_control.
Whenever the cluster starts with checksums disabled, set the field to
InvalidXLogRecPtr.  Whenever the cluster starts with checksums enabled and the
field is InvalidXLogRecPtr, set the field to the next LSN.  When a checksum
failure occurs in a page with LSN older than the stored one, emit either a
softer warning or no message at all.

Not all WAL-bypassing operations use SetBufferCommitInfoNeedsSave() to dirty
the buffer.  Here are other sites I noticed that do MarkBufferDirty() without
bumping the page LSN:
heap_xlog_visible()
visibilitymap_clear()
visibilitymap_truncate()
lazy_scan_heap()
XLogRecordPageWithFreeSpace()
FreeSpaceMapTruncateRel()
fsm_set_and_search()
fsm_vacuum_page()
fsm_search_avail()
Though I have not investigated each of these in detail, I suspect most or all
represent continued torn-page hazards.  Since the FSM is just a hint, we do
not WAL-log it at all; it would be nicest to always skip checksums for it,
too.  The visibility map shortcuts are more nuanced.

When page_checksums=off and we read a page last written by page_checksums=on,
we still verify its checksum.  If a mostly-insert/read site uses checksums for
some time and eventually wishes to shed the overhead, disabling the feature
will not stop the costs for reads of old data.  They would need a dump/reload
to get the performance of a never-checksummed database.  Let's either
unconditionally skip checksum validation under page_checksums=off or add a new
state like page_checksums=ignore for that even-weaker condition.

 --- a/doc/src/sgml/config.sgml
 +++ b/doc/src/sgml/config.sgml

 +   para
 +Turning this parameter off speeds normal operation, but
 +might allow data corruption to go unnoticed. The checksum uses
 +16-bit checksums, using the fast Fletcher 16 algorithm. With this
 +parameter enabled there is still a non-zero probability that an error
 +could go undetected, as well as a non-zero probability of false
 +positives.
 +   /para

What sources of false positives do you intend to retain?

 --- a/src/backend/catalog/storage.c
 +++ b/src/backend/catalog/storage.c
 @@ -20,6 +20,7 @@
  #include postgres.h
  
  #include access/visibilitymap.h
 +#include access/transam.h
  #include access/xact.h
  #include access/xlogutils.h
  #include catalog/catalog.h
 @@ -70,6 +71,7 @@ static PendingRelDelete *pendingDeletes = NULL; /* head of 
 linked list */
  /* XLOG gives us high 4 bits */
  #define XLOG_SMGR_CREATE 0x10
  #define XLOG_SMGR_TRUNCATE   0x20
 +#define XLOG_SMGR_HINT   0x40
  
  typedef struct xl_smgr_create
  {
 @@ -477,19 +479,74 @@ AtSubAbort_smgr(void)
   smgrDoPendingDeletes(false);
  }
  
 +/*
 + * Write a backup block if needed when we are setting a hint.
 + *
 + * Deciding the if needed bit is delicate and requires us to either
 + * grab WALInsertLock or check the info_lck spinlock. If we check the
 + * spinlock and it says Yes then we will need to get WALInsertLock as well,
 + * so the design choice here is to just go straight for the WALInsertLock
 + * and trust that calls to this function are minimised elsewhere.
 + *
 + * Callable while holding share lock on the buffer content.
 + *
 + * Possible that multiple concurrent backends could attempt to write
 + * WAL records. In that case, more than one backup block may be recorded
 + * though that isn't important to the outcome and the backup blocks are
 + * likely to be identical anyway.
 + */
 +#define  SMGR_HINT_WATERMARK 13579
 +void
 +smgr_buffer_hint(Buffer buffer)
 +{
 + /*
 +  * Make an XLOG entry reporting the hint
 +  */
 + XLogRecPtr  lsn;
 + XLogRecData rdata[2];
 + int watermark = SMGR_HINT_WATERMARK;
 +
 + /*
 +  * Not allowed to have zero-length records, so use a small watermark
 +  */
 + rdata[0].data = (char *) (watermark);
 + rdata[0].len = sizeof(int);
 + rdata[0].buffer = InvalidBuffer;
 + rdata[0].buffer_std = false;
 + rdata[0].next = (rdata[1]);
 +
 + rdata[1].data = NULL;
 + rdata[1].len = 0;
 + rdata[1].buffer = buffer;
 + rdata[1].buffer_std = true;
 + rdata[1].next = NULL;
 +
 + lsn = XLogInsert(RM_SMGR_ID, XLOG_SMGR_HINT, rdata);
 +
 + /*
 +  * Set the page LSN if we wrote a backup block.
 +   

Re: [HACKERS] 16-bit page checksums for 9.2

2012-01-26 Thread Dan Scales
Some other comments on the checksum patch:

I'm not sure why you moved the checksum calculation (PageSetVerificationInfo) 
to mdwrite() rather than smgrwrite().  If there were every another storage 
backend, it would have to duplicate the checksum check, right?  Is there a 
disadvantage to putting it in smgrwrite()?

You may have already noticed this, but a bunch of the comments are incorrect, 
now that you have moved the checksum calculation to mdwrite().

  config.sgml - says writes via temp_buffers (which I think means local 
buffers) are not checksummed -- that's no longer true, right?
  bufmgr.c, line 1914 - bufToWrite no longer exists.  You could revert changes 
from 1912-1920
  localbuf.c, line 203 - as mentioned below, this comment is no longer 
relevant, you are checksumming local buffers now


Dan

- Original Message -
From: Noah Misch n...@leadboat.com
To: Simon Riggs si...@2ndquadrant.com
Cc: Heikki Linnakangas heikki.linnakan...@enterprisedb.com, Robert Haas 
robertmh...@gmail.com, Andres Freund and...@anarazel.de, Kevin Grittner 
kevin.gritt...@wicourts.gov, da...@fetter.org, ai...@highrise.ca, 
st...@mit.edu, pgsql-hackers@postgresql.org
Sent: Thursday, January 26, 2012 12:20:39 PM
Subject: Re: [HACKERS] 16-bit page checksums for 9.2

On Wed, Jan 11, 2012 at 10:12:31PM +, Simon Riggs wrote:
 v7

This patch uses FPIs to guard against torn hint writes, even when the
checksums are disabled.  One could not simply condition them on the
page_checksums setting, though.  Suppose page_checksums=off and we're hinting
a page previously written with page_checksums=on.  If that write tears,
leaving the checksum intact, that block will now fail verification.  A couple
of ideas come to mind.  (a) Read the on-disk page and emit an FPI only if the
old page had a checksum.  (b) Add a checksumEnableLSN field to pg_control.
Whenever the cluster starts with checksums disabled, set the field to
InvalidXLogRecPtr.  Whenever the cluster starts with checksums enabled and the
field is InvalidXLogRecPtr, set the field to the next LSN.  When a checksum
failure occurs in a page with LSN older than the stored one, emit either a
softer warning or no message at all.

Not all WAL-bypassing operations use SetBufferCommitInfoNeedsSave() to dirty
the buffer.  Here are other sites I noticed that do MarkBufferDirty() without
bumping the page LSN:
heap_xlog_visible()
visibilitymap_clear()
visibilitymap_truncate()
lazy_scan_heap()
XLogRecordPageWithFreeSpace()
FreeSpaceMapTruncateRel()
fsm_set_and_search()
fsm_vacuum_page()
fsm_search_avail()
Though I have not investigated each of these in detail, I suspect most or all
represent continued torn-page hazards.  Since the FSM is just a hint, we do
not WAL-log it at all; it would be nicest to always skip checksums for it,
too.  The visibility map shortcuts are more nuanced.

When page_checksums=off and we read a page last written by page_checksums=on,
we still verify its checksum.  If a mostly-insert/read site uses checksums for
some time and eventually wishes to shed the overhead, disabling the feature
will not stop the costs for reads of old data.  They would need a dump/reload
to get the performance of a never-checksummed database.  Let's either
unconditionally skip checksum validation under page_checksums=off or add a new
state like page_checksums=ignore for that even-weaker condition.

 --- a/doc/src/sgml/config.sgml
 +++ b/doc/src/sgml/config.sgml

 +   para
 +Turning this parameter off speeds normal operation, but
 +might allow data corruption to go unnoticed. The checksum uses
 +16-bit checksums, using the fast Fletcher 16 algorithm. With this
 +parameter enabled there is still a non-zero probability that an error
 +could go undetected, as well as a non-zero probability of false
 +positives.
 +   /para

What sources of false positives do you intend to retain?

 --- a/src/backend/catalog/storage.c
 +++ b/src/backend/catalog/storage.c
 @@ -20,6 +20,7 @@
  #include postgres.h
  
  #include access/visibilitymap.h
 +#include access/transam.h
  #include access/xact.h
  #include access/xlogutils.h
  #include catalog/catalog.h
 @@ -70,6 +71,7 @@ static PendingRelDelete *pendingDeletes = NULL; /* head of 
 linked list */
  /* XLOG gives us high 4 bits */
  #define XLOG_SMGR_CREATE 0x10
  #define XLOG_SMGR_TRUNCATE   0x20
 +#define XLOG_SMGR_HINT   0x40
  
  typedef struct xl_smgr_create
  {
 @@ -477,19 +479,74 @@ AtSubAbort_smgr(void)
   smgrDoPendingDeletes(false);
  }
  
 +/*
 + * Write a backup block if needed when we are setting a hint.
 + *
 + * Deciding the if needed bit is delicate and requires us to either
 + * grab WALInsertLock or check the info_lck spinlock. If we check the
 + * spinlock and it says Yes then we will need to get WALInsertLock as well,
 + * so the design choice here is to just go straight

Re: [HACKERS] 16-bit page checksums for 9.2

2012-01-11 Thread Simon Riggs
On Sun, Jan 8, 2012 at 2:03 PM, Simon Riggs si...@2ndquadrant.com wrote:
 On Sat, Jan 7, 2012 at 11:09 AM, Simon Riggs si...@2ndquadrant.com wrote:
 On Sat, Jan 7, 2012 at 10:55 AM, Simon Riggs si...@2ndquadrant.com wrote:

 So there isn't any problem with there being incorrect checksums on
 blocks and you can turn the parameter on and off as often and as
 easily as you want. I think it can be USERSET but I wouldn't want to
 encourage users to see turning it off as a performance tuning feature.
 If the admin turns it on for the server, its on, so its SIGHUP.

 Any holes in that I haven't noticed?

 And of course, as soon as I wrote that I thought of the problem. We
 mustn't make a write that hasn't been covered by a FPW, so we must
 know ahead of time whether to WAL log hints or not. We can't simply
 turn it on/off any longer, now that we have to WAL log hint bits also.
 So thanks for making me think of that.

 We *could* make it turn on/off at each checkpoint, but its easier just
 to say that it can be turned on/off at server start.

 Attached patch v6 now handles hint bits and checksums correctly,
 following Heikki's comments.

 In recovery, setting a hint doesn't dirty a block if it wasn't already
 dirty. So we can write some hints, and we can set others but not write
 them.

 Lots of comments in the code.


v7

* Fixes merge conflict
* Minor patch cleanups
* Adds checksum of complete page including hole
* Calcs checksum in mdwrite() so we pickup all non-shared buffer writes also

Robert mentioned to me there were outstanding concerns on this patch.
I know of none, and have double checked the thread to confirm all
concerns are fully addressed. Adding to CF.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 0cc3296..3cb8d2a 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1701,6 +1701,47 @@ SET ENABLE_SEQSCAN TO OFF;
   /listitem
  /varlistentry
 
+ varlistentry id=guc-page-checksums xreflabel=page_checksums
+  indexterm
+   primaryvarnamepage_checksums/ configuration parameter/primary
+  /indexterm
+  termvarnamepage_checksums/varname (typeboolean/type)/term
+  listitem
+   para
+When this parameter is on, the productnamePostgreSQL/ server
+calculates checksums when it writes main database pages to disk,
+flagging the page as checksum protected.  When this parameter is off,
+no checksum is written, only a standard watermark in the page header.
+The database may thus contain a mix of pages with checksums and pages
+without checksums.
+   /para
+
+   para
+When pages are read into shared buffers any page flagged with a
+checksum has the checksum re-calculated and compared against the
+stored value to provide greatly improved validation of page contents.
+   /para
+
+   para
+Writes via temp_buffers are not checksummed.
+   /para
+
+   para
+Turning this parameter off speeds normal operation, but
+might allow data corruption to go unnoticed. The checksum uses
+16-bit checksums, using the fast Fletcher 16 algorithm. With this
+parameter enabled there is still a non-zero probability that an error
+could go undetected, as well as a non-zero probability of false
+positives.
+   /para
+
+   para
+This parameter can only be set at server start.
+The default is literaloff/.
+   /para
+  /listitem
+ /varlistentry
+
  varlistentry id=guc-wal-buffers xreflabel=wal_buffers
   termvarnamewal_buffers/varname (typeinteger/type)/term
   indexterm
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 19ef66b..7c7b20e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -709,6 +709,7 @@ XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
 	bool		updrqst;
 	bool		doPageWrites;
 	bool		isLogSwitch = (rmid == RM_XLOG_ID  info == XLOG_SWITCH);
+	bool		IsHint = (rmid == RM_SMGR_ID  info == XLOG_SMGR_HINT);
 	uint8		info_orig = info;
 
 	/* cross-check on whether we should be here or not */
@@ -955,6 +956,18 @@ begin:;
 	}
 
 	/*
+	 * If this is a hint record and we don't need a backup block then
+	 * we have no more work to do and can exit quickly without inserting
+	 * a WAL record at all. In that case return InvalidXLogRecPtr.
+	 */
+	if (IsHint  !(info  XLR_BKP_BLOCK_MASK))
+	{
+		LWLockRelease(WALInsertLock);
+		END_CRIT_SECTION();
+		return InvalidXLogRecPtr;
+	}
+
+	/*
 	 * If there isn't enough space on the current XLOG page for a record
 	 * header, advance to the next page (leaving the unused space as zeroes).
 	 */
@@ -3650,6 +3663,13 @@ RestoreBkpBlocks(XLogRecPtr lsn, XLogRecord *record, bool cleanup)
    BLCKSZ - (bkpb.hole_offset + 

Re: [HACKERS] 16-bit page checksums for 9.2

2012-01-09 Thread Jim Nasby
On Jan 6, 2012, at 4:36 AM, Andres Freund wrote:
 On Friday, January 06, 2012 11:30:53 AM Simon Riggs wrote:
 On Fri, Jan 6, 2012 at 1:10 AM, Stephen Frost sfr...@snowman.net wrote:
 * Simon Riggs (si...@2ndquadrant.com) wrote:
 I discover that non-all-zeroes holes are fairly common, just not very
 frequent.
 
 Curious, might be interesting to find out why.
 
 That may or may not be a problem, but not something to be dealt with
 here and now.
 
 But I agree that it's not the job of this patch/effort.  It sounds like
 we have clear indication, however, that those areas, as they are not
 necessairly all zeros, should be included in the checksum.
 
 Disagree. Full page writes ignore the hole, so its appropriate to do
 so here also.
 Well, ignoriging them in fpw has clear space benefits. Ignoring them while 
 checksumming doesn't have that much of a benefit.

I agree with Andres... we should checksum zero bytes, because if they're 
screwed up then something is wrong with your system, even if you got lucky with 
what data got trashed.

As I mentioned before, 2 separate checksums would be nice, but if we can't have 
that I think we need to fail on any checksum error.
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-01-09 Thread Greg Smith

On 1/7/12 5:26 AM, Heikki Linnakangas wrote:

Perhaps there
needs to be a third setting, calculate-but-dont-verify, where CRCs are
updated but existing CRCs are not expected to be correct. And a utility
to scan through your database and fix any incorrect CRCs, so that after
that's run in calculate-but-dont-verify mode, you can safely turn
checksums on.


The users of any feature like this are eventually going to demand a 
scrubbing utility that validates the checksums in a background task. 
Doing something like pg_dump just to validate your copy of the data is 
still clean is not a great approach.  If you accept that sort of utility 
is inevitable, you might as well presume it could handle this sort of 
task eventually too.



--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-01-09 Thread Greg Smith

On 12/30/11 9:44 AM, Aidan Van Dyk wrote:


So moving to this new double-write-area bandwagon, we move from a WAL
FPW synced at the commit, collect as many other writes, then final
sync type system to a system where *EVERY* write requires syncs of 2
separate 8K writes at buffer write-out time.


It's not quite that bad.  The double-write area is going to be a small 
chunk of re-used sequential I/O, like the current WAL.  And if this 
approach shifts some of the full-page writes out of the WAL and toward 
the new area instead, that's not a real doubling either.  Could probably 
put both on the same disk, and in situations where you don't have a 
battery-backed write cache it's possible to get a write to both per 
rotation.


This idea has been tested pretty extensively as part of MySQL's Innodb 
engine.  Results there suggest the overhead is in the 5% to 30% range; 
some examples mentioning both extremes of that:


http://www.mysqlperformanceblog.com/2006/08/04/innodb-double-write/
http://www.bigdbahead.com/?p=392

Makes me wish I knew off the top of my head how expensive WAL logging 
hint bits would be, for comparison sake.



--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-01-08 Thread Simon Riggs
On Sat, Jan 7, 2012 at 11:09 AM, Simon Riggs si...@2ndquadrant.com wrote:
 On Sat, Jan 7, 2012 at 10:55 AM, Simon Riggs si...@2ndquadrant.com wrote:

 So there isn't any problem with there being incorrect checksums on
 blocks and you can turn the parameter on and off as often and as
 easily as you want. I think it can be USERSET but I wouldn't want to
 encourage users to see turning it off as a performance tuning feature.
 If the admin turns it on for the server, its on, so its SIGHUP.

 Any holes in that I haven't noticed?

 And of course, as soon as I wrote that I thought of the problem. We
 mustn't make a write that hasn't been covered by a FPW, so we must
 know ahead of time whether to WAL log hints or not. We can't simply
 turn it on/off any longer, now that we have to WAL log hint bits also.
 So thanks for making me think of that.

 We *could* make it turn on/off at each checkpoint, but its easier just
 to say that it can be turned on/off at server start.

Attached patch v6 now handles hint bits and checksums correctly,
following Heikki's comments.

In recovery, setting a hint doesn't dirty a block if it wasn't already
dirty. So we can write some hints, and we can set others but not write
them.

Lots of comments in the code.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 0cc3296..3cb8d2a 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1701,6 +1701,47 @@ SET ENABLE_SEQSCAN TO OFF;
   /listitem
  /varlistentry
 
+ varlistentry id=guc-page-checksums xreflabel=page_checksums
+  indexterm
+   primaryvarnamepage_checksums/ configuration parameter/primary
+  /indexterm
+  termvarnamepage_checksums/varname (typeboolean/type)/term
+  listitem
+   para
+When this parameter is on, the productnamePostgreSQL/ server
+calculates checksums when it writes main database pages to disk,
+flagging the page as checksum protected.  When this parameter is off,
+no checksum is written, only a standard watermark in the page header.
+The database may thus contain a mix of pages with checksums and pages
+without checksums.
+   /para
+
+   para
+When pages are read into shared buffers any page flagged with a
+checksum has the checksum re-calculated and compared against the
+stored value to provide greatly improved validation of page contents.
+   /para
+
+   para
+Writes via temp_buffers are not checksummed.
+   /para
+
+   para
+Turning this parameter off speeds normal operation, but
+might allow data corruption to go unnoticed. The checksum uses
+16-bit checksums, using the fast Fletcher 16 algorithm. With this
+parameter enabled there is still a non-zero probability that an error
+could go undetected, as well as a non-zero probability of false
+positives.
+   /para
+
+   para
+This parameter can only be set at server start.
+The default is literaloff/.
+   /para
+  /listitem
+ /varlistentry
+
  varlistentry id=guc-wal-buffers xreflabel=wal_buffers
   termvarnamewal_buffers/varname (typeinteger/type)/term
   indexterm
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 8e65962..c9538d3 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -708,6 +708,7 @@ XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
 	bool		updrqst;
 	bool		doPageWrites;
 	bool		isLogSwitch = (rmid == RM_XLOG_ID  info == XLOG_SWITCH);
+	bool		IsHint = (rmid == RM_SMGR_ID  info == XLOG_SMGR_HINT);
 
 	/* cross-check on whether we should be here or not */
 	if (!XLogInsertAllowed())
@@ -975,6 +976,18 @@ begin:;
 	}
 
 	/*
+	 * If this is a hint record and we don't need a backup block then
+	 * we have no more work to do and can exit quickly without inserting
+	 * a WAL record at all. In that case return InvalidXLogRecPtr.
+	 */
+	if (IsHint  !(info  XLR_BKP_BLOCK_MASK))
+	{
+		LWLockRelease(WALInsertLock);
+		END_CRIT_SECTION();
+		return InvalidXLogRecPtr;
+	}
+
+	/*
 	 * If there isn't enough space on the current XLOG page for a record
 	 * header, advance to the next page (leaving the unused space as zeroes).
 	 */
@@ -3670,6 +3683,13 @@ RestoreBkpBlocks(XLogRecPtr lsn, XLogRecord *record, bool cleanup)
    BLCKSZ - (bkpb.hole_offset + bkpb.hole_length));
 		}
 
+		/*
+		 * Any checksum set on this page will be invalid. We don't need
+		 * to reset it here since it will be reset before being written
+		 * but it seems worth doing this for general sanity and hygiene.
+		 */
+		PageSetPageSizeAndVersion(page, BLCKSZ, PG_PAGE_LAYOUT_VERSION);
+
 		PageSetLSN(page, lsn);
 		PageSetTLI(page, ThisTimeLineID);
 		MarkBufferDirty(buffer);
diff --git 

Re: [HACKERS] 16-bit page checksums for 9.2

2012-01-07 Thread Simon Riggs
On Fri, Jan 6, 2012 at 9:44 PM, Robert Haas robertmh...@gmail.com wrote:

 Writing pages during recovery doesn't need WAL. If we crash, we replay
 using the already generated WAL.

 Which is all fine, except when you start making changes that are not
 WAL-logged.  Then, you have the same torn page problem that exists
 when you it in normal running.

Yes, of course. My point is that this is not a blocker to using
checksums, only that some actions cannot occur on the standby but the
replay of changes is not in danger.

page_checksums is an optional parameter, so you can turn it on or off
on the standby as you wish. People frequently have a standby dedicated
to HA and other standbys for queries. So this is all normal and
natural.

page_checksums will default to 'off' in the final patch anyway, in my
understanding.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-01-07 Thread Heikki Linnakangas

On 07.01.2012 12:14, Simon Riggs wrote:

page_checksums is an optional parameter, so you can turn it on or off
on the standby as you wish. People frequently have a standby dedicated
to HA and other standbys for queries. So this is all normal and
natural.


There needs to be a well-documented way of turning it on/off. In 
particular, from off to on. If you ever flip it off even for a minute, 
pages with invalid checksums start to appear in the data directory, and 
if you then turn it on again, you start getting errors. Perhaps there 
needs to be a third setting, calculate-but-dont-verify, where CRCs are 
updated but existing CRCs are not expected to be correct. And a utility 
to scan through your database and fix any incorrect CRCs, so that after 
that's run in calculate-but-dont-verify mode, you can safely turn 
checksums on.


Even better would be a way to make that process robust so that you can't 
do a pilot error and turn page_checksums 'on' when it's not safe to do 
so. Otherwise when you get a checksum error, your first thought is going 
to be hmm, I wonder if anyone ever turned page_checksums 'off' by 
accident anytime in the past, or if this is a genuine error.



page_checksums will default to 'off' in the final patch anyway, in my
understanding.


That makes the need for such a facility even more important. Otherwise 
there's no safe way to ever switch it from off to on.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-01-07 Thread Simon Riggs
On Sat, Jan 7, 2012 at 10:26 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:

 There needs to be a well-documented way of turning it on/off. In particular,
 from off to on.

There is in the patch.

The checksum field is optional, as is the parameter.

If page_checksums is on, we write a checksum and it is correct. We
also validate any checksums we see.

If page_checksums is off we clear the checksum on write, so an
incorrect checksum is never written.

So there isn't any problem with there being incorrect checksums on
blocks and you can turn the parameter on and off as often and as
easily as you want. I think it can be USERSET but I wouldn't want to
encourage users to see turning it off as a performance tuning feature.
If the admin turns it on for the server, its on, so its SIGHUP.

Any holes in that I haven't noticed?

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


  1   2   >