Re: [HACKERS] strange nbtree corruption report

2011-11-29 Thread Robert Haas
On Mon, Nov 28, 2011 at 4:44 PM, Alvaro Herrera
alvhe...@commandprompt.com wrote:
 I wonder if it's related, because it seems pretty
 much the same mechanism: sometimes, a btree index insert would be
 randomly forgotten (its page write lost in vacuum, so to speak), ...

Groan.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] strange nbtree corruption report

2011-11-29 Thread Robert Haas
On Mon, Nov 28, 2011 at 8:40 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Alvaro Herrera alvhe...@commandprompt.com writes:
 I wonder if it would be worthwhile to build some sort of tool to scan
 the heap and ensure that there are index entries for all heap items,
 just to test the hypothesis.  Not that this would enlighten on the
 source of the actual problem.

 Seems like the hypothesis could be proven or disproven just by counting
 the heap and index entries while the DB is otherwise idle.  It used to
 be that VACUUM VERBOSE was sufficient for that sort of cross-check ...
 but I'm not totally sure what push-ups are required nowadays to prevent
 it from deciding that it's smarter than you are so it needn't scan the
 whole table.  Is VACUUM FREEZE VERBOSE still trustworthy for this?

Yes.  It will scan the whole table if relfrozenxid is too far back,
and FREEZE defines too far back to 0 transaction IDs, which is
always satisfied.

It might be useful to add an option to VACUUM to scan the whole table
without otherwise altering the behavior, in case you distrust the
visibility map or whatever but don't care about freezing anything.
The obvious name for the option would be full, but since that's
already taken we'd need to invent something different.  PARANOID?
ALL?  SCAN_ALL?  INEFFICIENTLY?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] strange nbtree corruption report

2011-11-28 Thread Alvaro Herrera

Excerpts from Tom Lane's message of mar nov 22 01:14:33 -0300 2011:
 Alvaro Herrera alvhe...@alvh.no-ip.org writes:
  We got a very strange nbtree corruption report some time ago.  This was
  a btree index on a vey high churn table -- entries are updated and
  deleted very quickly, so the index grows very large and also shrinks
  quickly (AFAICT this is a work queue of sorts).
 
  The most strange thing of all is that there was this error:
 
  ERROR:  left link changed unexpectedly in block 3378 of index index_name 
  CONTEXT:  automatic vacuum of table table_name
 
  This was reported not once, but several dozens of times, by each new
  autovacuum worker that tried to vacuum the table.
 
  As far as I can see, there is just no way for this to happen ... much
  less happen repeatedly.
 
 It's not hard to believe that that would happen repeatedly given a
 corrupted set of sibling links, eg deletable page A links left to page
 B, which links right to C, which links right to A.  The question is how
 the index got into such a state.  A dropped update during a page split
 would explain it (ie, B used to be A's left sibling, then at some point
 B got split into B and C, but A's left-link never got updated on disk).
 I wonder how reliable their disk+filesystem is ...

While summarizing this, a (relatively) frequent problem with unique
indexes came to my mind: there would be a UNIQUE index but when the
admin tries to rebuild it for whatever reason, duplicate values are
found.  We've seen dozens of reports of this kind of problem (in the
pgsql lists I mean -- I don't think we've seen this problem in this
customer's servers).  I wonder if it's related, because it seems pretty
much the same mechanism: sometimes, a btree index insert would be
randomly forgotten (its page write lost in vacuum, so to speak), and
thus when the second heap item comes along, there's no entry in the
index and the insert completes, and there you have your duplicate value.

I wonder if it would be worthwhile to build some sort of tool to scan
the heap and ensure that there are index entries for all heap items,
just to test the hypothesis.  Not that this would enlighten on the
source of the actual problem.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] strange nbtree corruption report

2011-11-28 Thread Jim Nasby
On Nov 28, 2011, at 3:44 PM, Alvaro Herrera wrote:
 I wonder if it would be worthwhile to build some sort of tool to scan
 the heap and ensure that there are index entries for all heap items,
 just to test the hypothesis.  Not that this would enlighten on the
 source of the actual problem.

There was a project to create a sanity-checker for Postgres databases... did 
that ever go anywhere? It seems like this would be a good addition for that 
tool, if it exists.

In either case, I am all for better capabilities to detect data problems (I'm 
looking at you, block checksum project! ;)
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] strange nbtree corruption report

2011-11-28 Thread Bruce Momjian
Jim Nasby wrote:
 On Nov 28, 2011, at 3:44 PM, Alvaro Herrera wrote:
  I wonder if it would be worthwhile to build some sort of tool to scan
  the heap and ensure that there are index entries for all heap items,
  just to test the hypothesis.  Not that this would enlighten on the
  source of the actual problem.
 
 There was a project to create a sanity-checker for Postgres databases...
 did that ever go anywhere? It seems like this would be a good addition
 for that tool, if it exists.

Not that I know of.

--
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] strange nbtree corruption report

2011-11-28 Thread Tom Lane
Alvaro Herrera alvhe...@commandprompt.com writes:
 I wonder if it would be worthwhile to build some sort of tool to scan
 the heap and ensure that there are index entries for all heap items,
 just to test the hypothesis.  Not that this would enlighten on the
 source of the actual problem.

Seems like the hypothesis could be proven or disproven just by counting
the heap and index entries while the DB is otherwise idle.  It used to
be that VACUUM VERBOSE was sufficient for that sort of cross-check ...
but I'm not totally sure what push-ups are required nowadays to prevent
it from deciding that it's smarter than you are so it needn't scan the
whole table.  Is VACUUM FREEZE VERBOSE still trustworthy for this?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] strange nbtree corruption report

2011-11-22 Thread Alvaro Herrera

Excerpts from Tom Lane's message of mar nov 22 01:14:33 -0300 2011:
 Alvaro Herrera alvhe...@alvh.no-ip.org writes:

  ERROR:  left link changed unexpectedly in block 3378 of index index_name 
  CONTEXT:  automatic vacuum of table table_name
 
  This was reported not once, but several dozens of times, by each new
  autovacuum worker that tried to vacuum the table.
 
  As far as I can see, there is just no way for this to happen ... much
  less happen repeatedly.
 
 It's not hard to believe that that would happen repeatedly given a
 corrupted set of sibling links, eg deletable page A links left to page
 B, which links right to C, which links right to A.  The question is how
 the index got into such a state.  A dropped update during a page split
 would explain it (ie, B used to be A's left sibling, then at some point
 B got split into B and C, but A's left-link never got updated on disk).
 I wonder how reliable their disk+filesystem is ...

Well, there are no other signs of random data corruption, such as toast
pointers getting corrupted which is the number one symptom showing up
when underlying storage is flaky.  However, it may be possible that
there was a transient storage problem which only affected this one page;
if this persisted in the way you describe, it might well explain these
symptoms.

Another thing I noticed is that there was corruption in heap pages (not
the same server, though; it was a different Londiste slave).  This was
even more strange; the pages would be completely fine, except the first
six words corresponding to the page header; they would be all zeros.
When filled with valid-looking data (mostly I copied the bytes from
neighbor pages), the rest of the page would decode fine.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] strange nbtree corruption report

2011-11-21 Thread Noah Misch
On Mon, Nov 21, 2011 at 08:00:21PM -0300, Alvaro Herrera wrote:
 We got a very strange nbtree corruption report some time ago.  This was
 a btree index on a vey high churn table -- entries are updated and
 deleted very quickly, so the index grows very large and also shrinks
 quickly (AFAICT this is a work queue of sorts).
 
 The most strange thing of all is that there was this error:
 
 ERROR:  left link changed unexpectedly in block 3378 of index index_name 
 CONTEXT:  automatic vacuum of table table_name

Just a suspicion ... when looking at the B-tree page reclamation algorithm, I
had a thought that the logic in _bt_page_recyclable() was obsolete as of the
introduction (in 8.3) of xid-free read-only transactions.  A transaction
without a persistent xid does not hold back RecentXmin, so how could waiting
for a RecentXmin window to pass prove that no scan still holds a link to the
page?  Similarly, running VACUUMs do not hold back RecentXmin.  I had made a
note to try to reproduce a concrete problem, but I haven't tried yet.

 This was reported not once, but several dozens of times, by each new
 autovacuum worker that tried to vacuum the table.
 
 As far as I can see, there is just no way for this to happen ... much
 less happen repeatedly.  I thought it might be related to concurrent
 insertions somehow managing to split the page under deletion very
 quickly (given the load these systems are under, this is plausible).
 But I can't find how.

Yes, nothing comes to mind explaining that duration of persistence.

nm

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] strange nbtree corruption report

2011-11-21 Thread Tom Lane
Noah Misch n...@leadboat.com writes:
 Just a suspicion ... when looking at the B-tree page reclamation algorithm, I
 had a thought that the logic in _bt_page_recyclable() was obsolete as of the
 introduction (in 8.3) of xid-free read-only transactions.  A transaction
 without a persistent xid does not hold back RecentXmin, so how could waiting
 for a RecentXmin window to pass prove that no scan still holds a link to the
 page?  Similarly, running VACUUMs do not hold back RecentXmin.

Uh, sure they do.  It's their advertised snapshot xmin that counts, not
their own xid (if any).

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] strange nbtree corruption report

2011-11-21 Thread Tom Lane
Alvaro Herrera alvhe...@alvh.no-ip.org writes:
 We got a very strange nbtree corruption report some time ago.  This was
 a btree index on a vey high churn table -- entries are updated and
 deleted very quickly, so the index grows very large and also shrinks
 quickly (AFAICT this is a work queue of sorts).

 The most strange thing of all is that there was this error:

 ERROR:  left link changed unexpectedly in block 3378 of index index_name 
 CONTEXT:  automatic vacuum of table table_name

 This was reported not once, but several dozens of times, by each new
 autovacuum worker that tried to vacuum the table.

 As far as I can see, there is just no way for this to happen ... much
 less happen repeatedly.

It's not hard to believe that that would happen repeatedly given a
corrupted set of sibling links, eg deletable page A links left to page
B, which links right to C, which links right to A.  The question is how
the index got into such a state.  A dropped update during a page split
would explain it (ie, B used to be A's left sibling, then at some point
B got split into B and C, but A's left-link never got updated on disk).
I wonder how reliable their disk+filesystem is ...

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] strange nbtree corruption report

2011-11-21 Thread Tom Lane
I wrote:
 Noah Misch n...@leadboat.com writes:
 Just a suspicion ... when looking at the B-tree page reclamation algorithm, I
 had a thought that the logic in _bt_page_recyclable() was obsolete as of the
 introduction (in 8.3) of xid-free read-only transactions.  A transaction
 without a persistent xid does not hold back RecentXmin, so how could waiting
 for a RecentXmin window to pass prove that no scan still holds a link to the
 page?  Similarly, running VACUUMs do not hold back RecentXmin.

 Uh, sure they do.  It's their advertised snapshot xmin that counts, not
 their own xid (if any).

No, wait a second, I think you're right.  The existing mechanism should
protect against transactions that might be updating the btree, so the
worst possible consequences can't happen; but it seems possible that a
read-only transaction in flight to the page could get confused and give
wrong answers.  That would only explain transient failures not persistent
ones, though.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers