Re: [HACKERS] strange nbtree corruption report
On Mon, Nov 28, 2011 at 4:44 PM, Alvaro Herrera alvhe...@commandprompt.com wrote: I wonder if it's related, because it seems pretty much the same mechanism: sometimes, a btree index insert would be randomly forgotten (its page write lost in vacuum, so to speak), ... Groan. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] strange nbtree corruption report
On Mon, Nov 28, 2011 at 8:40 PM, Tom Lane t...@sss.pgh.pa.us wrote: Alvaro Herrera alvhe...@commandprompt.com writes: I wonder if it would be worthwhile to build some sort of tool to scan the heap and ensure that there are index entries for all heap items, just to test the hypothesis. Not that this would enlighten on the source of the actual problem. Seems like the hypothesis could be proven or disproven just by counting the heap and index entries while the DB is otherwise idle. It used to be that VACUUM VERBOSE was sufficient for that sort of cross-check ... but I'm not totally sure what push-ups are required nowadays to prevent it from deciding that it's smarter than you are so it needn't scan the whole table. Is VACUUM FREEZE VERBOSE still trustworthy for this? Yes. It will scan the whole table if relfrozenxid is too far back, and FREEZE defines too far back to 0 transaction IDs, which is always satisfied. It might be useful to add an option to VACUUM to scan the whole table without otherwise altering the behavior, in case you distrust the visibility map or whatever but don't care about freezing anything. The obvious name for the option would be full, but since that's already taken we'd need to invent something different. PARANOID? ALL? SCAN_ALL? INEFFICIENTLY? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] strange nbtree corruption report
Excerpts from Tom Lane's message of mar nov 22 01:14:33 -0300 2011: Alvaro Herrera alvhe...@alvh.no-ip.org writes: We got a very strange nbtree corruption report some time ago. This was a btree index on a vey high churn table -- entries are updated and deleted very quickly, so the index grows very large and also shrinks quickly (AFAICT this is a work queue of sorts). The most strange thing of all is that there was this error: ERROR: left link changed unexpectedly in block 3378 of index index_name CONTEXT: automatic vacuum of table table_name This was reported not once, but several dozens of times, by each new autovacuum worker that tried to vacuum the table. As far as I can see, there is just no way for this to happen ... much less happen repeatedly. It's not hard to believe that that would happen repeatedly given a corrupted set of sibling links, eg deletable page A links left to page B, which links right to C, which links right to A. The question is how the index got into such a state. A dropped update during a page split would explain it (ie, B used to be A's left sibling, then at some point B got split into B and C, but A's left-link never got updated on disk). I wonder how reliable their disk+filesystem is ... While summarizing this, a (relatively) frequent problem with unique indexes came to my mind: there would be a UNIQUE index but when the admin tries to rebuild it for whatever reason, duplicate values are found. We've seen dozens of reports of this kind of problem (in the pgsql lists I mean -- I don't think we've seen this problem in this customer's servers). I wonder if it's related, because it seems pretty much the same mechanism: sometimes, a btree index insert would be randomly forgotten (its page write lost in vacuum, so to speak), and thus when the second heap item comes along, there's no entry in the index and the insert completes, and there you have your duplicate value. I wonder if it would be worthwhile to build some sort of tool to scan the heap and ensure that there are index entries for all heap items, just to test the hypothesis. Not that this would enlighten on the source of the actual problem. -- Álvaro Herrera alvhe...@commandprompt.com The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] strange nbtree corruption report
On Nov 28, 2011, at 3:44 PM, Alvaro Herrera wrote: I wonder if it would be worthwhile to build some sort of tool to scan the heap and ensure that there are index entries for all heap items, just to test the hypothesis. Not that this would enlighten on the source of the actual problem. There was a project to create a sanity-checker for Postgres databases... did that ever go anywhere? It seems like this would be a good addition for that tool, if it exists. In either case, I am all for better capabilities to detect data problems (I'm looking at you, block checksum project! ;) -- Jim C. Nasby, Database Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] strange nbtree corruption report
Jim Nasby wrote: On Nov 28, 2011, at 3:44 PM, Alvaro Herrera wrote: I wonder if it would be worthwhile to build some sort of tool to scan the heap and ensure that there are index entries for all heap items, just to test the hypothesis. Not that this would enlighten on the source of the actual problem. There was a project to create a sanity-checker for Postgres databases... did that ever go anywhere? It seems like this would be a good addition for that tool, if it exists. Not that I know of. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] strange nbtree corruption report
Alvaro Herrera alvhe...@commandprompt.com writes: I wonder if it would be worthwhile to build some sort of tool to scan the heap and ensure that there are index entries for all heap items, just to test the hypothesis. Not that this would enlighten on the source of the actual problem. Seems like the hypothesis could be proven or disproven just by counting the heap and index entries while the DB is otherwise idle. It used to be that VACUUM VERBOSE was sufficient for that sort of cross-check ... but I'm not totally sure what push-ups are required nowadays to prevent it from deciding that it's smarter than you are so it needn't scan the whole table. Is VACUUM FREEZE VERBOSE still trustworthy for this? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] strange nbtree corruption report
Excerpts from Tom Lane's message of mar nov 22 01:14:33 -0300 2011: Alvaro Herrera alvhe...@alvh.no-ip.org writes: ERROR: left link changed unexpectedly in block 3378 of index index_name CONTEXT: automatic vacuum of table table_name This was reported not once, but several dozens of times, by each new autovacuum worker that tried to vacuum the table. As far as I can see, there is just no way for this to happen ... much less happen repeatedly. It's not hard to believe that that would happen repeatedly given a corrupted set of sibling links, eg deletable page A links left to page B, which links right to C, which links right to A. The question is how the index got into such a state. A dropped update during a page split would explain it (ie, B used to be A's left sibling, then at some point B got split into B and C, but A's left-link never got updated on disk). I wonder how reliable their disk+filesystem is ... Well, there are no other signs of random data corruption, such as toast pointers getting corrupted which is the number one symptom showing up when underlying storage is flaky. However, it may be possible that there was a transient storage problem which only affected this one page; if this persisted in the way you describe, it might well explain these symptoms. Another thing I noticed is that there was corruption in heap pages (not the same server, though; it was a different Londiste slave). This was even more strange; the pages would be completely fine, except the first six words corresponding to the page header; they would be all zeros. When filled with valid-looking data (mostly I copied the bytes from neighbor pages), the rest of the page would decode fine. -- Álvaro Herrera alvhe...@commandprompt.com The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] strange nbtree corruption report
On Mon, Nov 21, 2011 at 08:00:21PM -0300, Alvaro Herrera wrote: We got a very strange nbtree corruption report some time ago. This was a btree index on a vey high churn table -- entries are updated and deleted very quickly, so the index grows very large and also shrinks quickly (AFAICT this is a work queue of sorts). The most strange thing of all is that there was this error: ERROR: left link changed unexpectedly in block 3378 of index index_name CONTEXT: automatic vacuum of table table_name Just a suspicion ... when looking at the B-tree page reclamation algorithm, I had a thought that the logic in _bt_page_recyclable() was obsolete as of the introduction (in 8.3) of xid-free read-only transactions. A transaction without a persistent xid does not hold back RecentXmin, so how could waiting for a RecentXmin window to pass prove that no scan still holds a link to the page? Similarly, running VACUUMs do not hold back RecentXmin. I had made a note to try to reproduce a concrete problem, but I haven't tried yet. This was reported not once, but several dozens of times, by each new autovacuum worker that tried to vacuum the table. As far as I can see, there is just no way for this to happen ... much less happen repeatedly. I thought it might be related to concurrent insertions somehow managing to split the page under deletion very quickly (given the load these systems are under, this is plausible). But I can't find how. Yes, nothing comes to mind explaining that duration of persistence. nm -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] strange nbtree corruption report
Noah Misch n...@leadboat.com writes: Just a suspicion ... when looking at the B-tree page reclamation algorithm, I had a thought that the logic in _bt_page_recyclable() was obsolete as of the introduction (in 8.3) of xid-free read-only transactions. A transaction without a persistent xid does not hold back RecentXmin, so how could waiting for a RecentXmin window to pass prove that no scan still holds a link to the page? Similarly, running VACUUMs do not hold back RecentXmin. Uh, sure they do. It's their advertised snapshot xmin that counts, not their own xid (if any). regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] strange nbtree corruption report
Alvaro Herrera alvhe...@alvh.no-ip.org writes: We got a very strange nbtree corruption report some time ago. This was a btree index on a vey high churn table -- entries are updated and deleted very quickly, so the index grows very large and also shrinks quickly (AFAICT this is a work queue of sorts). The most strange thing of all is that there was this error: ERROR: left link changed unexpectedly in block 3378 of index index_name CONTEXT: automatic vacuum of table table_name This was reported not once, but several dozens of times, by each new autovacuum worker that tried to vacuum the table. As far as I can see, there is just no way for this to happen ... much less happen repeatedly. It's not hard to believe that that would happen repeatedly given a corrupted set of sibling links, eg deletable page A links left to page B, which links right to C, which links right to A. The question is how the index got into such a state. A dropped update during a page split would explain it (ie, B used to be A's left sibling, then at some point B got split into B and C, but A's left-link never got updated on disk). I wonder how reliable their disk+filesystem is ... regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] strange nbtree corruption report
I wrote: Noah Misch n...@leadboat.com writes: Just a suspicion ... when looking at the B-tree page reclamation algorithm, I had a thought that the logic in _bt_page_recyclable() was obsolete as of the introduction (in 8.3) of xid-free read-only transactions. A transaction without a persistent xid does not hold back RecentXmin, so how could waiting for a RecentXmin window to pass prove that no scan still holds a link to the page? Similarly, running VACUUMs do not hold back RecentXmin. Uh, sure they do. It's their advertised snapshot xmin that counts, not their own xid (if any). No, wait a second, I think you're right. The existing mechanism should protect against transactions that might be updating the btree, so the worst possible consequences can't happen; but it seems possible that a read-only transaction in flight to the page could get confused and give wrong answers. That would only explain transient failures not persistent ones, though. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers