subject:"Re\: \[HACKERS\] Protecting against unexpected zero\-pages\: proposal"

On Mon, Nov 8, 2010 at 5:59 PM, Aidan Van Dyk ai...@highrise.ca wrote:
 The problem that putting checksums in a different place solves is the
 page layout (binary upgrade) problem.  You're still doing to need to
 buffer the page as you calculate the checksum and write it out.
 buffering that page is absolutely necessary no mater where you put the
 checksum, unless you've got an exclusive lock that blocks even hint
 updates on the page.

But buffering the page only means you've got some consistent view of
the page. It doesn't mean the checksum will actually match the data in
the page that gets written out. So when you read it back in the
checksum may be invalid.

I wonder if we could get by by having some global counter on the page
which you increment when you set a hint bit. That way when we you read
the page back in you could compare the counter on the page and the
counter for the checksum and if the checksum counter is behind ignore
the checksum? It would be nice to do better but I'm not sure we can.



 But if we can start using forks to put other data, that means that
 keeping the page layouts is easier, and thus binary upgrades are much
 more feasible.


The difficulty with the page layout didn't come from the checksum
itself. We can add 4 or 8 bytes to the page header easily enough. The
difficulty came from trying to move the hint bits for all the tuples
to a dedicated area. That means three resizable areas so either one of
them would have to be relocatable or some other solution (like not
checksumming the line pointers and putting the hint bits in the line
pointers). If we're willing to have invalid checksums whenever the
hint bits get set then this wouldn't be necessary.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Protecting against unexpected zero-pages: proposal

2010-11-09 Thread Aidan Van Dyk

On Tue, Nov 9, 2010 at 8:45 AM, Greg Stark gsst...@mit.edu wrote:

 But buffering the page only means you've got some consistent view of
 the page. It doesn't mean the checksum will actually match the data in
 the page that gets written out. So when you read it back in the
 checksum may be invalid.

I was assuming that if the code went through the trouble to buffer the
shared page to get a stable, non-changing copy to use for
checksumming/writing it, it would write() the buffered copy it just
made, not the original in shared memory...  I'm not sure how that
write could be in-consistent.

a.

-- 
Aidan Van Dyk                                             Create like a god,
ai...@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Protecting against unexpected zero-pages: proposal

On Tue, Nov 9, 2010 at 2:28 PM, Aidan Van Dyk ai...@highrise.ca wrote:
 On Tue, Nov 9, 2010 at 8:45 AM, Greg Stark gsst...@mit.edu wrote:

 But buffering the page only means you've got some consistent view of
 the page. It doesn't mean the checksum will actually match the data in
 the page that gets written out. So when you read it back in the
 checksum may be invalid.

 I was assuming that if the code went through the trouble to buffer the
 shared page to get a stable, non-changing copy to use for
 checksumming/writing it, it would write() the buffered copy it just
 made, not the original in shared memory...  I'm not sure how that
 write could be in-consistent.

Oh, I'm mistaken. The problem was that buffering the writes was
insufficient to deal with torn pages. Even if you buffer the writes if
the machine crashes while only having written half the buffer out then
the checksum won't match. If the only changes on the page were hint
bit updates then there will be no full page write in the WAL log to
repair the block.

It's possible that *that* situation is rare enough to let the checksum
raise a warning but not an error.

But personally I'm pretty loath to buffer every page write. The state
of the art are zero-copy processes and we should be looking to reduce
copies rather than increase them. Though I suppose if we did a
zero-copy CRC that might actually get us this buffered write for free.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Protecting against unexpected zero-pages: proposal

On Tue, Nov 9, 2010 at 3:25 PM, Greg Stark gsst...@mit.edu wrote:
 Oh, I'm mistaken. The problem was that buffering the writes was
 insufficient to deal with torn pages. Even if you buffer the writes if
 the machine crashes while only having written half the buffer out then
 the checksum won't match. If the only changes on the page were hint
 bit updates then there will be no full page write in the WAL log to
 repair the block.

Huh, this implies that if we did go through all the work of
segregating the hint bits and could arrange that they all appear on
the same 512-byte sector and if we buffered them so that we were
writing the same bits we checksummed then we actually *could* include
them in the CRC after all since even a torn page will almost certainly
not tear an individual sector.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Protecting against unexpected zero-pages: proposal

2010-11-09 Thread Jim Nasby

On Nov 9, 2010, at 9:27 AM, Greg Stark wrote:
 On Tue, Nov 9, 2010 at 3:25 PM, Greg Stark gsst...@mit.edu wrote:
 Oh, I'm mistaken. The problem was that buffering the writes was
 insufficient to deal with torn pages. Even if you buffer the writes if
 the machine crashes while only having written half the buffer out then
 the checksum won't match. If the only changes on the page were hint
 bit updates then there will be no full page write in the WAL log to
 repair the block.
 
 Huh, this implies that if we did go through all the work of
 segregating the hint bits and could arrange that they all appear on
 the same 512-byte sector and if we buffered them so that we were
 writing the same bits we checksummed then we actually *could* include
 them in the CRC after all since even a torn page will almost certainly
 not tear an individual sector.

If there's a torn page then we've crashed, which means we go through crash 
recovery, which puts a valid page (with valid CRC) back in place from the WAL. 
What am I missing?

BTW, I agree that at minimum we need to leave the option of only raising a 
warning when we hit a checksum failure. Some people might want Postgres to 
treat it as an error by default, but most folks will at least want the option 
to look at their (corrupt) data.
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Protecting against unexpected zero-pages: proposal

2010-11-09 Thread Gurjeet Singh

On Tue, Nov 9, 2010 at 12:32 AM, Tom Lane t...@sss.pgh.pa.us wrote:

 There are also crosschecks that you can apply: if it's a heap page, are
 there any index pages with pointers to it?  If it's an index page, are
 there downlink or sibling links to it from elsewhere in the index?
 A page that Postgres left as zeroes would not have any references to it.

 IMO there are a lot of methods that can separate filesystem misfeasance
 from Postgres errors, probably with greater reliability than this hack.
 I would also suggest that you don't really need to prove conclusively
 that any particular instance is one or the other --- a pattern across
 multiple instances will tell you what you want to know.


Doing this postmortem on a regular deployment and fixing the problem would
not be too difficult. But this platform, which Postgres is a part of,  would
be mostly left unattended once deployed (pardon me for not sharing the
details, as I am not sure if I can).

An external HA component is supposed to detect any problems (by querying
Postgres or by external means) and take an evasive action. It is this
automation of problem detection that we are seeking.

As Greg pointed out, even with this hack in place, we might still get zero
pages from the FS (say, when ext3 does metadata journaling but not block
journaling). In that case we'd rely on recovery's WAL replay of relation
extension to reintroduce the magic number in pages.


 What's more, if I did believe that this was a safe and
 reliable technique, I'd be unhappy about the opportunity cost of
 reserving it for zero-page testing rather than other purposes.


This is one of those times where you are a bit too terse for me. What does
zero-page imply that this hack wouldn't?

Regards,
-- 
gurjeet.singh
@ EnterpriseDB - The Enterprise Postgres Company
http://www.EnterpriseDB.com

singh.gurj...@{ gmail | yahoo }.com
Twitter/Skype: singh_gurjeet

Mail sent from my BlackLaptop device

Re: [HACKERS] Protecting against unexpected zero-pages: proposal

On Tue, Nov 9, 2010 at 4:26 PM, Jim Nasby j...@nasby.net wrote:
 On Tue, Nov 9, 2010 at 3:25 PM, Greg Stark gsst...@mit.edu wrote:
 Oh, I'm mistaken. The problem was that buffering the writes was
 insufficient to deal with torn pages. Even if you buffer the writes if
 the machine crashes while only having written half the buffer out then
 the checksum won't match. If the only changes on the page were hint
 bit updates then there will be no full page write in the WAL log to
 repair the block.

 If there's a torn page then we've crashed, which means we go through crash 
 recovery, which puts a valid page (with valid CRC) back in place from the 
 WAL. What am I missing?

If the only changes on the page were hint bit updates then there will
be no full page write in the WAL to repair the block



-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Protecting against unexpected zero-pages: proposal

2010-11-09 Thread Aidan Van Dyk

On Tue, Nov 9, 2010 at 11:26 AM, Jim Nasby j...@nasby.net wrote:

Huh, this implies that if we did go through all the work of
segregating the hint bits and could arrange that they all appear on
the same 512-byte sector and if we buffered them so that we were
writing the same bits we checksummed then we actually *could* include
them in the CRC after all since even a torn page will almost certainly
not tear an individual sector.

If there's a torn page then we've crashed, which means we go through crash
recovery, which puts a valid page (with valid CRC) back in place from the
WAL. What am I missing?

The problem case is where hint-bits have been set. Hint bits have
always been we don't really care, but we write them.

A torn-page on hint-bit-only writes is ok, because with a torn page
(assuming you dont' get zero-ed pages), you get the old or new chunks
of the complete 8K buffer, but they are identical except for only
hint-bits, which eiterh the old or new state is sufficient.

But with a check-sum, now, getting a torn page w/ only hint-bit
updates now becomes noticed. Before, it might have happened, but we
wouldn't have noticed or cared.

So, for getting checksums, we have to offer up a few things:
1) zero-copy writes, we need to buffer the write to get a consistent
checksum (or lock the buffer tight)
2) saving hint-bits on an otherwise unchanged page. We either need to
just not write that page, and loose the work the hint-bits did, or do
a full-page WAL of it, so the torn-page checksum is fixed

Both of these are theoretical performance tradeoffs. How badly do we
want to verify on read that it is *exactly* what we thought we wrote?

--
Aidan Van Dyk Create like a god,
ai...@highrise.ca command like a king,
http://www.highrise.ca/ work like a slave.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Protecting against unexpected zero-pages: proposal

2010-11-09 Thread Tom Lane

Gurjeet Singh singh.gurj...@gmail.com writes:
 On Tue, Nov 9, 2010 at 12:32 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 IMO there are a lot of methods that can separate filesystem misfeasance
 from Postgres errors, probably with greater reliability than this hack.

 Doing this postmortem on a regular deployment and fixing the problem would
 not be too difficult. But this platform, which Postgres is a part of,  would
 be mostly left unattended once deployed (pardon me for not sharing the
 details, as I am not sure if I can).

 An external HA component is supposed to detect any problems (by querying
 Postgres or by external means) and take an evasive action. It is this
 automation of problem detection that we are seeking.

To be blunt, this argument is utter nonsense.  The changes you propose
would still require manual analysis of any detected issues in order to
do anything useful about them.  Once you know that there is, or isn't,
a filesystem-level error involved, what are you going to do next?
You're going to go try to debug the component you know is at fault,
that's what.  And that problem is still AI-complete.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Protecting against unexpected zero-pages: proposal

On Tue, Nov 9, 2010 at 5:06 PM, Aidan Van Dyk ai...@highrise.ca wrote:
 So, for getting checksums, we have to offer up a few things:
 1) zero-copy writes, we need to buffer the write to get a consistent
 checksum (or lock the buffer tight)
 2) saving hint-bits on an otherwise unchanged page.  We either need to
 just not write that page, and loose the work the hint-bits did, or do
 a full-page WAL of it, so the torn-page checksum is fixed

Actually the consensus the last go-around on this topic was to
segregate the hint bits into a single area of the page and skip them
in the checksum. That way we don't have to do any of the above. It's
just that that's a lot of work.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Protecting against unexpected zero-pages: proposal

On Tue, Nov 9, 2010 at 12:31 PM, Greg Stark gsst...@mit.edu wrote:
 On Tue, Nov 9, 2010 at 5:06 PM, Aidan Van Dyk ai...@highrise.ca wrote:
 So, for getting checksums, we have to offer up a few things:
 1) zero-copy writes, we need to buffer the write to get a consistent
 checksum (or lock the buffer tight)
 2) saving hint-bits on an otherwise unchanged page.  We either need to
 just not write that page, and loose the work the hint-bits did, or do
 a full-page WAL of it, so the torn-page checksum is fixed

 Actually the consensus the last go-around on this topic was to
 segregate the hint bits into a single area of the page and skip them
 in the checksum. That way we don't have to do any of the above. It's
 just that that's a lot of work.

And it still allows silent data corruption, because bogusly clearing a
hint bit is, at the moment, harmless, but bogusly setting one is not.
I really have to wonder how other products handle this.  PostgreSQL
isn't the only database product that uses MVCC - not by a long shot -
and the problem of detecting whether an XID is visible to the current
snapshot can't be ours alone.  So what do other people do about this?
They either don't cache the information about whether the XID is
committed in-page (in which case, are they just slower or do they have
some other means of avoiding the performance hit?) or they cache it in
the page (in which case, they either WAL log it or they don't checksum
it).  I mean, there aren't any other options, are there?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Protecting against unexpected zero-pages: proposal

2010-11-09 Thread Kenneth Marshall

On Tue, Nov 09, 2010 at 02:05:57PM -0500, Robert Haas wrote:
 On Tue, Nov 9, 2010 at 12:31 PM, Greg Stark gsst...@mit.edu wrote:
  On Tue, Nov 9, 2010 at 5:06 PM, Aidan Van Dyk ai...@highrise.ca wrote:
  So, for getting checksums, we have to offer up a few things:
  1) zero-copy writes, we need to buffer the write to get a consistent
  checksum (or lock the buffer tight)
  2) saving hint-bits on an otherwise unchanged page. ?We either need to
  just not write that page, and loose the work the hint-bits did, or do
  a full-page WAL of it, so the torn-page checksum is fixed
 
  Actually the consensus the last go-around on this topic was to
  segregate the hint bits into a single area of the page and skip them
  in the checksum. That way we don't have to do any of the above. It's
  just that that's a lot of work.
 
 And it still allows silent data corruption, because bogusly clearing a
 hint bit is, at the moment, harmless, but bogusly setting one is not.
 I really have to wonder how other products handle this.  PostgreSQL
 isn't the only database product that uses MVCC - not by a long shot -
 and the problem of detecting whether an XID is visible to the current
 snapshot can't be ours alone.  So what do other people do about this?
 They either don't cache the information about whether the XID is
 committed in-page (in which case, are they just slower or do they have
 some other means of avoiding the performance hit?) or they cache it in
 the page (in which case, they either WAL log it or they don't checksum
 it).  I mean, there aren't any other options, are there?
 
 -- 
 Robert Haas
 EnterpriseDB: http://www.enterprisedb.com
 The Enterprise PostgreSQL Company
 

That would imply that we need to have a CRC for just the hint bit
section or some type of ECC calculation that can detect bad hint
bits independent of the CRC for the rest of the page.

Regards,
Ken

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Protecting against unexpected zero-pages: proposal

2010-11-09 Thread Alvaro Herrera

Excerpts from Robert Haas's message of mar nov 09 16:05:57 -0300 2010:

 And it still allows silent data corruption, because bogusly clearing a
 hint bit is, at the moment, harmless, but bogusly setting one is not.
 I really have to wonder how other products handle this.  PostgreSQL
 isn't the only database product that uses MVCC - not by a long shot -
 and the problem of detecting whether an XID is visible to the current
 snapshot can't be ours alone.  So what do other people do about this?
 They either don't cache the information about whether the XID is
 committed in-page (in which case, are they just slower or do they have
 some other means of avoiding the performance hit?) or they cache it in
 the page (in which case, they either WAL log it or they don't checksum
 it).  I mean, there aren't any other options, are there?

Maybe allocate enough shared memory for pg_clog buffers back to the
freeze horizon, and just don't use hint bits?  Maybe some intermediate
solution, i.e. allocate a large bunch of pg_clog buffers, and do
WAL-logged setting of hint bits only for tuples that go further back.

I remember someone had a patch to set all the bits in a page that passed
a threshold of some kind.  Ah, no, that was for freezing tuples.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Protecting against unexpected zero-pages: proposal


 PostgreSQL
 isn't the only database product that uses MVCC - not by a long shot -
 and the problem of detecting whether an XID is visible to the current
 snapshot can't be ours alone.  So what do other people do about this?
 They either don't cache the information about whether the XID is
 committed in-page (in which case, are they just slower or do they have
 some other means of avoiding the performance hit?) or they cache it in
 the page (in which case, they either WAL log it or they don't checksum
 it).

Well, most of the other MVCC-in-table DBMSes simply don't deal with
large, on-disk databases.  In fact, I can't think of one which does,
currently; while MVCC has been popular for the New Databases, they're
all focused on in-memory databases.  Oracle and InnoDB use rollback
segments.

Might be worth asking the BDB folks.

Personally, I think we're headed inevitably towards having a set of
metadata bitmaps for each table, like we do currently for the FSM.

-- 
  -- Josh Berkus
 PostgreSQL Experts Inc.
 http://www.pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Protecting against unexpected zero-pages: proposal

On Tue, Nov 9, 2010 at 7:37 PM, Josh Berkus j...@agliodbs.com wrote:
 Well, most of the other MVCC-in-table DBMSes simply don't deal with
 large, on-disk databases.  In fact, I can't think of one which does,
 currently; while MVCC has been popular for the New Databases, they're
 all focused on in-memory databases.  Oracle and InnoDB use rollback
 segments.

Well rollback segments are still MVCC. However Oracle's MVCC is
block-based. So they only have to do the visibility check once per
block, not once per row. Once they find the right block version they
can process all the rows on it.

Also Oracle's snapshots are just the log position. Instead of having
to check whether every transaction committed or not, they just find
the block version which was last modified before the log position for
when their transaction started.

 Might be worth asking the BDB folks.

 Personally, I think we're headed inevitably towards having a set of
 metadata bitmaps for each table, like we do currently for the FSM.

Well we already have a metadata bitmap for transaction visibility.
It's called the clog. There's no point in having one structured
differently around the table.

The whole point of the hint bits is that it's in the same place as the data.


-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Protecting against unexpected zero-pages: proposal


 The whole point of the hint bits is that it's in the same place as the data.

Yes, but the hint bits are currently causing us trouble on several
features or potential features:

* page-level CRC checks
* eliminating vacuum freeze for cold data
* index-only access
* replication
* this patch
* etc.

At a certain point, it's worth the trouble to handle them differently
because of the other features that enables or makes much easier.

-- 
  -- Josh Berkus
 PostgreSQL Experts Inc.
 http://www.pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Protecting against unexpected zero-pages: proposal

On Tue, Nov 9, 2010 at 8:12 PM, Josh Berkus j...@agliodbs.com wrote:
 The whole point of the hint bits is that it's in the same place as the data.

 Yes, but the hint bits are currently causing us trouble on several
 features or potential features:

Then we might have to get rid of hint bits. But they're hint bits for
a metadata file that already exists, creating another metadata file
doesn't solve anything.

Though incidentally all of the other items you mentioned are generic
problems caused by with MVCC, not hint bits.


-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Protecting against unexpected zero-pages: proposal

2010-11-09 Thread Aidan Van Dyk

On Tue, Nov 9, 2010 at 3:25 PM, Greg Stark gsst...@mit.edu wrote:

 Then we might have to get rid of hint bits. But they're hint bits for
 a metadata file that already exists, creating another metadata file
 doesn't solve anything.

Is there any way to instrument the writes of dirty buffers from the
share memory, and see how many of the pages normally being written are
not backed by WAL (hint-only updates)?  Just dumping those buffers
without writes would allow at least *checksums* to go throug without
loosing all the benifits of the hint bits.

I've got a hunch (with no proof) that the penalty of not writing them
will be born largely by small database installs.  Large OLTP databases
probably won't have pages without a WAL'ed change and hint-bits set,
and large data warehouse ones will probably vacuum freeze big tables
on load to avoid the huge write penalty the 1st time they scan the
tables...

/waving hands

-- 
Aidan Van Dyk                                             Create like a god,
ai...@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Protecting against unexpected zero-pages: proposal


 Though incidentally all of the other items you mentioned are generic
 problems caused by with MVCC, not hint bits.

Yes, but the hint bits prevent us from implementing workarounds.


-- 
  -- Josh Berkus
 PostgreSQL Experts Inc.
 http://www.pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Protecting against unexpected zero-pages: proposal

On Tue, Nov 9, 2010 at 2:05 PM, Robert Haas robertmh...@gmail.com wrote:
 On Tue, Nov 9, 2010 at 12:31 PM, Greg Stark gsst...@mit.edu wrote:
 On Tue, Nov 9, 2010 at 5:06 PM, Aidan Van Dyk ai...@highrise.ca wrote:
 So, for getting checksums, we have to offer up a few things:
 1) zero-copy writes, we need to buffer the write to get a consistent
 checksum (or lock the buffer tight)
 2) saving hint-bits on an otherwise unchanged page.  We either need to
 just not write that page, and loose the work the hint-bits did, or do
 a full-page WAL of it, so the torn-page checksum is fixed

 Actually the consensus the last go-around on this topic was to
 segregate the hint bits into a single area of the page and skip them
 in the checksum. That way we don't have to do any of the above. It's
 just that that's a lot of work.

 And it still allows silent data corruption, because bogusly clearing a
 hint bit is, at the moment, harmless, but bogusly setting one is not.
 I really have to wonder how other products handle this.  PostgreSQL
 isn't the only database product that uses MVCC - not by a long shot -
 and the problem of detecting whether an XID is visible to the current
 snapshot can't be ours alone.  So what do other people do about this?
 They either don't cache the information about whether the XID is
 committed in-page (in which case, are they just slower or do they have
 some other means of avoiding the performance hit?) or they cache it in
 the page (in which case, they either WAL log it or they don't checksum
 it).  I mean, there aren't any other options, are there?

An examination of the MySQL source code reveals their answer.  In
row_vers_build_for_semi_consistent_read(), which I can't swear is the
right place but seems to be, there is this comment:

/* We assume that a rolled-back transaction stays in
TRX_ACTIVE state until all the changes have been
rolled back and the transaction is removed from
the global list of transactions. */

Which makes sense.  If you never leave rows from aborted transactions
in the heap forever, then the list of aborted transactions that you
need to remember for MVCC purposes will remain relatively small and
you can just include those XIDs in your MVCC snapshot.  Our problem is
that we have no particular bound on the number of aborted transactions
whose XIDs may still be floating around, so we can't do it that way.

dons asbestos underpants

To impose a similar bound in PostgreSQL, you'd need to maintain the
set of aborted XIDs and the relations that need to be vacuumed for
each one.  As you vacuum, you prune any tuples with aborted xmins
(which is WAL-logged already anyway) and additionally WAL-log clearing
the xmax for each tuple with an aborted xmax.  Thus, when you
finishing vacuuming the relation, the aborted XID is no longer present
anywhere in it.  When you vacuum the last relation for a particular
XID, that XID no longer exists in the relation files anywhere and you
can remove it from the list of aborted XIDs.  I think that WAL logging
the list of XIDs and list of unvacuumed relations for each at each
checkpoint would be sufficient for crash safety.  If you did this, you
could then assume that any XID which precedes your snapshot's xmin is
committed.

1. When a big abort happens, you may have to carry that XID around in
every snapshot - and avoid advancing RecentGlobalXmin - for quite a
long time.
2. You have to WAL log marking the XMAX of an aborted transaction invalid.
3. You have to WAL log the not-yet-cleaned-up XIDs and the relations
each one needs vacuumed at each checkpoint.
4. There would presumably be some finite limit on the size of the
shared memory structure for aborted transactions.  I don't think
there'd be any reason to make it particularly small, but if you sat
there and aborted transactions at top speed you might eventually run
out of room, at which point any transactions you started wouldn't be
able to abort until vacuum made enough progress to free up an entry.
5. It would be pretty much impossible to run with autovacuum turned
off, and in fact you would likely need to make it a good deal more
aggressive in the specific case of aborted transactions, to mitigate
problems #1, #3, and #4.

I'm not sure how bad those things would be, or if there are more that
I'm missing (besides the obvious it would be a lot of work).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Protecting against unexpected zero-pages: proposal

On 11/9/10 1:50 PM, Robert Haas wrote:
 5. It would be pretty much impossible to run with autovacuum turned
 off, and in fact you would likely need to make it a good deal more
 aggressive in the specific case of aborted transactions, to mitigate
 problems #1, #3, and #4.

6. This would require us to be more aggressive about VACUUMing old-cold
relations/page, e.g. VACUUM FREEZE.  This it would make one of our worst
issues for data warehousing even worse.

What about having this map (and other hintbits) be per-relation?  Hmmm.
 That wouldn't work for DDL, I suppose ...

-- 
  -- Josh Berkus
 PostgreSQL Experts Inc.
 http://www.pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Protecting against unexpected zero-pages: proposal

2010-11-09 Thread Kevin Grittner

Josh Berkus j...@agliodbs.com wrote:
 
 6. This would require us to be more aggressive about VACUUMing
 old-cold relations/page, e.g. VACUUM FREEZE.  This it would make
 one of our worst issues for data warehousing even worse.
 
I continue to feel that it is insane that when a table is populated
within the same database transaction which created it (e.g., a bulk
load of a table or partition), that we don't write the tuples with
hint bits set for commit and xmin frozen.  By the time any but the
creating transaction can see the tuples, *if* any other transaction
is ever able to see the tuples, these will be the correct values;
we really should be able to deal with it within the creating
transaction somehow.
 
If we ever handle that, would #6 be a moot point, or do you think
it's still a significant issue?
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Protecting against unexpected zero-pages: proposal

On Tue, Nov 9, 2010 at 5:03 PM, Josh Berkus j...@agliodbs.com wrote:
 On 11/9/10 1:50 PM, Robert Haas wrote:
 5. It would be pretty much impossible to run with autovacuum turned
 off, and in fact you would likely need to make it a good deal more
 aggressive in the specific case of aborted transactions, to mitigate
 problems #1, #3, and #4.

 6. This would require us to be more aggressive about VACUUMing old-cold
 relations/page, e.g. VACUUM FREEZE.  This it would make one of our worst
 issues for data warehousing even worse.

Uh, no it doesn't.  It only requires you to be more aggressive about
vacuuming the transactions that are in the aborted-XIDs array.  It
doesn't affect transaction wraparound vacuuming at all, either
positively or negatively.  You still have to freeze xmins before they
flip from being in the past to being in the future, but that's it.

 What about having this map (and other hintbits) be per-relation?  Hmmm.
  That wouldn't work for DDL, I suppose ...

This map?  I suppose you could track aborted XIDs per relation
instead of globally, but I don't see why that would affect DDL any
differently than anything else.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Protecting against unexpected zero-pages: proposal

On Tue, Nov 9, 2010 at 5:15 PM, Kevin Grittner
kevin.gritt...@wicourts.gov wrote:
 Josh Berkus j...@agliodbs.com wrote:

 6. This would require us to be more aggressive about VACUUMing
 old-cold relations/page, e.g. VACUUM FREEZE.  This it would make
 one of our worst issues for data warehousing even worse.

 I continue to feel that it is insane that when a table is populated
 within the same database transaction which created it (e.g., a bulk
 load of a table or partition), that we don't write the tuples with
 hint bits set for commit and xmin frozen.  By the time any but the
 creating transaction can see the tuples, *if* any other transaction
 is ever able to see the tuples, these will be the correct values;
 we really should be able to deal with it within the creating
 transaction somehow.

I agree.

 If we ever handle that, would #6 be a moot point, or do you think
 it's still a significant issue?

I think it's a moot point anyway, per previous email.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Protecting against unexpected zero-pages: proposal

On Tue, Nov 9, 2010 at 3:05 PM, Greg Stark gsst...@mit.edu wrote:
 On Tue, Nov 9, 2010 at 7:37 PM, Josh Berkus j...@agliodbs.com wrote:
 Well, most of the other MVCC-in-table DBMSes simply don't deal with
 large, on-disk databases.  In fact, I can't think of one which does,
 currently; while MVCC has been popular for the New Databases, they're
 all focused on in-memory databases.  Oracle and InnoDB use rollback
 segments.

 Well rollback segments are still MVCC. However Oracle's MVCC is
 block-based. So they only have to do the visibility check once per
 block, not once per row. Once they find the right block version they
 can process all the rows on it.

 Also Oracle's snapshots are just the log position. Instead of having
 to check whether every transaction committed or not, they just find
 the block version which was last modified before the log position for
 when their transaction started.

That is cool.  One problem is that it might sometimes result in
additional I/O.  A transaction begins and writes a tuple.  We must
write a preimage of the page (or at least, sufficient information to
reconstruct a preimage of the page) to the undo segment.  If the
transaction commits relatively quickly, and all transactions which
took their snapshots before the commit end either by committing or by
aborting, we can discard that information from the undo segment
without ever writing it to disk.  However, if that doesn't happen, the
undo log page may get evicted, and we're now doing three writes (WAL,
page, undo) rather than just two (WAL, page).  That's no worse than an
update where the old and new tuples land on different pages, but it IS
worse than an update where the old and new tuples are on the same
page, or at least I think it is.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Protecting against unexpected zero-pages: proposal

Robert,

 Uh, no it doesn't.  It only requires you to be more aggressive about
 vacuuming the transactions that are in the aborted-XIDs array.  It
 doesn't affect transaction wraparound vacuuming at all, either
 positively or negatively.  You still have to freeze xmins before they
 flip from being in the past to being in the future, but that's it.

Sorry, I was trying to say that it's similar to the freeze issue, not
that it affects freeze.  Sorry for the lack of clarity.

What I was getting at is that this could cause us to vacuum
relations/pages which would otherwise never be vaccuumed (or at least,
not until freeze).  Imagine a very large DW table which is normally
insert-only and seldom queried, but once a month or so the insert aborts
and rolls back.

I'm not saying that your proposal isn't worth testing.  I'm just saying
that it may prove to be a net loss to overall system efficiency.

 If we ever handle that, would #6 be a moot point, or do you think
  it's still a significant issue?

Kevin, the case which your solution doesn't fix is the common one of
log tables which keep adding records continuously, with  5% inserts
or updates.  That may seem like a corner case but such a table,
partitioned or unpartitioned, exists in around 1/3 of the commercial
applications I've worked on, so it's a common pattern.

-- 
  -- Josh Berkus
 PostgreSQL Experts Inc.
 http://www.pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Protecting against unexpected zero-pages: proposal

2010-11-09 Thread Tom Lane

Josh Berkus j...@agliodbs.com writes:
 Though incidentally all of the other items you mentioned are generic
 problems caused by with MVCC, not hint bits.

 Yes, but the hint bits prevent us from implementing workarounds.

If we got rid of hint bits, we'd need workarounds for the ensuing
massive performance loss.  There is no reason whatsoever to imagine
that we'd come out ahead in the end.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Protecting against unexpected zero-pages: proposal

2010-11-09 Thread Tom Lane

Robert Haas robertmh...@gmail.com writes:
 dons asbestos underpants
 4. There would presumably be some finite limit on the size of the
 shared memory structure for aborted transactions.  I don't think
 there'd be any reason to make it particularly small, but if you sat
 there and aborted transactions at top speed you might eventually run
 out of room, at which point any transactions you started wouldn't be
 able to abort until vacuum made enough progress to free up an entry.

Um, that bit is a *complete* nonstarter.  The possibility of a failed
transaction always has to be allowed.  What if vacuum itself gets an
error for example?  Or, what if the system crashes?

I thought for a bit about inverting the idea, such that there were a
limit on the number of unvacuumed *successful* transactions rather than
the number of failed ones.  But that seems just as unforgiving: what if
you really need to commit a transaction to effect some system state
change?  An example might be dropping some enormous table that you no
longer need, but vacuum is going to insist on plowing through before
it'll let you have any more transactions.

I'm of the opinion that any design that presumes it can always fit all
the required transaction-status data in memory is probably not even
worth discussing.  There always has to be a way for status data to spill
to disk.  What's interesting is how you can achieve enough locality of
access so that most of what you need to look at is usually in memory.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Protecting against unexpected zero-pages: proposal

On Tue, Nov 9, 2010 at 5:45 PM, Josh Berkus j...@agliodbs.com wrote:
 Robert,

 Uh, no it doesn't.  It only requires you to be more aggressive about
 vacuuming the transactions that are in the aborted-XIDs array.  It
 doesn't affect transaction wraparound vacuuming at all, either
 positively or negatively.  You still have to freeze xmins before they
 flip from being in the past to being in the future, but that's it.

 Sorry, I was trying to say that it's similar to the freeze issue, not
 that it affects freeze.  Sorry for the lack of clarity.

 What I was getting at is that this could cause us to vacuum
 relations/pages which would otherwise never be vaccuumed (or at least,
 not until freeze).  Imagine a very large DW table which is normally
 insert-only and seldom queried, but once a month or so the insert aborts
 and rolls back.

Oh, I see.  In that case, under the proposed scheme, you'd get an
immediate vacuum of everything inserted into the table since the last
failed insert.  Everything prior to the last failed insert would be
OK, since the visibility map bits would already be set for those
pages.  Yeah, that would be annoying.

There's a related problem with index-only scans.  If a large DW table
which is normally insert-only, but which IS queried regularly, it
won't be able to use index-only scans effectively because without
regularly vacuuming, the visibility map bits won't be set.  We've
previously discussed the possibility of having the background writer
set hint bits before writing the pages, and maybe it could even set
the all-visible bit and update the visibility map, too.  But that
won't help if the transaction inserts a large enough quantity of data
that it starts spilling buffers to disk before it commits.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Protecting against unexpected zero-pages: proposal

2010-11-09 Thread Gurjeet Singh

On Wed, Nov 10, 2010 at 1:15 AM, Tom Lane t...@sss.pgh.pa.us wrote:

 Once you know that there is, or isn't,
 a filesystem-level error involved, what are you going to do next?
 You're going to go try to debug the component you know is at fault,
 that's what.  And that problem is still AI-complete.


If we know for sure that Postgres was not at fault then we have standby node
to failover to, where Postgres warm standby is being maintained by streaming
replication.

Regards
-- 
gurjeet.singh
@ EnterpriseDB - The Enterprise Postgres Company
http://www.EnterpriseDB.com

singh.gurj...@{ gmail | yahoo }.com
Twitter/Skype: singh_gurjeet

Mail sent from my BlackLaptop device

Re: [HACKERS] Protecting against unexpected zero-pages: proposal

On Tue, Nov 9, 2010 at 6:42 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 dons asbestos underpants
 4. There would presumably be some finite limit on the size of the
 shared memory structure for aborted transactions.  I don't think
 there'd be any reason to make it particularly small, but if you sat
 there and aborted transactions at top speed you might eventually run
 out of room, at which point any transactions you started wouldn't be
 able to abort until vacuum made enough progress to free up an entry.

 Um, that bit is a *complete* nonstarter.  The possibility of a failed
 transaction always has to be allowed.  What if vacuum itself gets an
 error for example?  Or, what if the system crashes?

I wasn't proposing that it was impossible to abort, only that aborts
might have to block.  I admit I don't know what to do about VACUUM
itself failing.  A transient failure mightn't be so bad, but if you
find yourself permanently unable to eradicate the XIDs left behind by
an aborted transaction, you'll eventually have to shut down the
database, lest the XID space wrap around.

Actually, come to think of it, there's no reason you COULDN'T spill
the list of aborted-but-not-yet-cleaned-up XIDs to disk.  It's just
that XidInMVCCSnapshot() would get reeally expensive after a
while.

 I thought for a bit about inverting the idea, such that there were a
 limit on the number of unvacuumed *successful* transactions rather than
 the number of failed ones.  But that seems just as unforgiving: what if
 you really need to commit a transaction to effect some system state
 change?  An example might be dropping some enormous table that you no
 longer need, but vacuum is going to insist on plowing through before
 it'll let you have any more transactions.

The number of relevant aborted XIDs tends naturally to decline to zero
as vacuum does its thing, while the number of relevant committed XIDs
tends to grow very, very large (it starts to decline only when we
start freezing things), so remembering the not-yet-cleaned-up aborted
XIDs seems likely to be cheaper.  In fact, in many cases, the set of
not-yet-cleaned-up aborted XIDs will be completely empty.

 I'm of the opinion that any design that presumes it can always fit all
 the required transaction-status data in memory is probably not even
 worth discussing.

Well, InnoDB does it.

 There always has to be a way for status data to spill
 to disk.  What's interesting is how you can achieve enough locality of
 access so that most of what you need to look at is usually in memory.

We're not going to get any more locality of reference than we're
already getting from hint bits, are we?  The advantage of trying to do
timely cleanup of aborted transactions is that you can assume that any
XID before RecentGlobalXmin is committed, without checking CLOG and
without having to update hint bits and write out the ensuing dirty
pages.  If we could make CLOG access cheap enough that we didn't need
hint bits, that would also solve that problem, but nobody (including
me) seems to think that's feasible.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Protecting against unexpected zero-pages: proposal