Re: [HACKERS] Fwd: index corruption in PG 8.3.13

2011-03-16 Thread Nikhil Sontakke
Hi,

 To summarize, as I see it - the zeroed out block 523 should have been
 the second left-most leaf and should have pointed out to 522. Thus
 re-establishing the index chain

 524 - 523 - 522 - 277 - ...

 Was there a machine restart in the picture as well?


It seems there might have been a machine restart involved too. So I
guess even WAL writing could have been impacted.

But even if VF was ongoing at the time of restart, the WAL replay on
restart should not do anything since this will be a non-committed
transaction?

Also I was looking at ReadRecord and saw that it logs a message for
failed CRC blocks but the WAL replay just stops at that point since it
returns a NULL. Is there a way to find out if more blocks follow in
the wake of this failed block (should be a matter of calling
ReadRecord with NULL as a first argument I think)? If so maybe we can
warn further that error was encountered in the middle of WAL replay.
However the last block too could be CRC check-fail candidate...

BTW, is there a possibility to encounter trailing blocks with CRC
failures regularly? For transactions that were ongoing at the time of
shutdown and did not get a chance to commit or WAL log properly?

Regards,
Nikhils

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fwd: index corruption in PG 8.3.13

2011-03-16 Thread Robert Haas
On Wed, Mar 16, 2011 at 7:51 AM, Nikhil Sontakke
nikhil.sonta...@enterprisedb.com wrote:
 Hi,

 To summarize, as I see it - the zeroed out block 523 should have been
 the second left-most leaf and should have pointed out to 522. Thus
 re-establishing the index chain

 524 - 523 - 522 - 277 - ...

 Was there a machine restart in the picture as well?


 It seems there might have been a machine restart involved too. So I
 guess even WAL writing could have been impacted.

 But even if VF was ongoing at the time of restart, the WAL replay on
 restart should not do anything since this will be a non-committed
 transaction?

That's not how it works.  Replaying an uncommitted transaction
shouldn't result in any user-visible changes, but it still does stuff.

 Also I was looking at ReadRecord and saw that it logs a message for
 failed CRC blocks but the WAL replay just stops at that point since it
 returns a NULL. Is there a way to find out if more blocks follow in
 the wake of this failed block (should be a matter of calling
 ReadRecord with NULL as a first argument I think)? If so maybe we can
 warn further that error was encountered in the middle of WAL replay.
 However the last block too could be CRC check-fail candidate...

In general, when we WAL-log, we're writing over a previous WAL segment
that's been recycled.  A failed CRC is indistinguishable from
end-of-WAL, because we expect there to be arbitrary garbage bytes in
the file after the end of WAL position.

 BTW, is there a possibility to encounter trailing blocks with CRC
 failures regularly? For transactions that were ongoing at the time of
 shutdown and did not get a chance to commit or WAL log properly?

Well you might have a torn page if there was a *system* crash in the
middle of recovery, but in theory even that shouldn't break anything,
because the system shouldn't rely on the fsync being complete until it
actually is.  Of course, as you mentioned earlier, it's not impossible
there's a bug in the recovery code.  But if an OS crash is involved,
another possibility is that something went wrong with the fsync -
maybe there's a lying writeback cache between PG and the platter, for
example.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fwd: index corruption in PG 8.3.13

2011-03-16 Thread Nikhil Sontakke
Hi,

 Of course, as you mentioned earlier, it's not impossible
 there's a bug in the recovery code.

Yeah, I was looking at the repair_frag function in 8.3.13 (yup it's
ugly!) and found out that the normal ExecInsertIndexTuples call is
used to insert the index entries. That is standard index code used
everywhere. So btree WAL bugs in this code path should be pretty rare
I would think..

 But if an OS crash is involved,
 another possibility is that something went wrong with the fsync -
 maybe there's a lying writeback cache between PG and the platter, for
 example.


Yup, plan to re-confirm this too.

Thanks Robert!

Regards,
Nikhils

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fwd: index corruption in PG 8.3.13

2011-03-16 Thread Alvaro Herrera
Excerpts from Nikhil Sontakke's message of mié mar 16 08:51:00 -0300 2011:
 Hi,
 
  To summarize, as I see it - the zeroed out block 523 should have been
  the second left-most leaf and should have pointed out to 522. Thus
  re-establishing the index chain
 
  524 - 523 - 522 - 277 - ...
 
  Was there a machine restart in the picture as well?
 
 
 It seems there might have been a machine restart involved too. So I
 guess even WAL writing could have been impacted.

Maybe the filesystem decided to fill a couple of FS pages (one PG page)
with zeroes on recovery due to believing that it wasn't fsynced at the
time of the crash.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fwd: index corruption in PG 8.3.13

2011-03-14 Thread Nikhil Sontakke
Hi Daniel,


 I have also, coincidentally, encountered corruption of a system
 catalog index -- 8.3.11 -- I have saved the file for forensics.  Is it
 possible that I also receive a copy of this program?


Will it be possible for you to share the file/logs off-list with me? I
can also try to do some analysis to compare if the situation is
similar to the one we are faced with right now and report it back here
if so.

TIA,
Nikhils

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fwd: index corruption in PG 8.3.13

2011-03-12 Thread Daniel Farina
On Wed, Mar 9, 2011 at 6:02 AM, Alvaro Herrera
alvhe...@commandprompt.com wrote:
 I'll send you a perl program we wrote for a customer to check for
 strange issues in btrees.  Please give it a spin; it may give you more
 clues.  If you find additional checks to add, please let me know!

I have also, coincidentally, encountered corruption of a system
catalog index -- 8.3.11 -- I have saved the file for forensics.  Is it
possible that I also receive a copy of this program?

-- 
fdr

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fwd: index corruption in PG 8.3.13

2011-03-12 Thread Greg Stark
On Sat, Mar 12, 2011 at 3:06 AM, Nikhil Sontakke
nikhil.sonta...@enterprisedb.com wrote:
 Live 522's      (LSN: logid 29, recoff 0xd1fade3c) previous points to
 the zeroed out 523 block. Note that this seems to be latest LSN in the
 data file.


So do you have logs from the server when it was restarted? It should
say how far it recovered before it started up

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fwd: index corruption in PG 8.3.13

2011-03-11 Thread Nikhil Sontakke
Hi,

 1. Somebody inserts a bunch of new tuples into the relation, causing
 growth in the index.

 In case it's not obvious VACUUM FULL would do precisely that.

 Oh, I didn't even think about that.  Yeah, that could be it, too.

Thanks a lot Greg and Robert. This theory seems very plausible. VF
must have carried out a rearrangement of heap tuples for compaction
and that might have caused new index entries which might explain the
extension of that many blocks.

 maybe VACUUM FULL - crash before checkpoint - problem with recovery?

Did I mention that an immediate shutdown was thrown into the mix just
after the VF? That is almost close to a crash and that means that the
shared buffers were not written back to the index data file. So that
should also account for all these pages still being zeroed out. So
change the above to:

 VACUUM FULL - immediate shutdown - problem with recovery?

But WAL replay should still have handled this. I would presume even an
immediate shutdown ensures that WAL is flushed to disk properly?

So that means that either there is a corner case bug in VF which adds
incorrect WAL logging in some specific btree layout scenarios or there
was indeed some bit flipping in the WAL, which caused the recovery to
prematurely end during WAL replay. What are the scenarios that you
would think can cause WAL bit flipping?

I was trying to repro this on the setup by repeatedly creating a table
with large inserts, doing lotta deletes, running VF and then issuing
immediate shutdown. However if I try to inspect the index data file at
this point in the test case, it is inconsequential as the file is
largely out of sync since its dirty shared buffers have not been
flushed. That leaves me with the option to restart and check the index
data file again for problems. If we see problems after the restart it
should generally mean WAL logging errors (but we still cannot discount
the bit flipping case I guess).

I guess a perusal of the WAL activity done by VF btree index activity
is the need of the hour..

Regards,
Nikhils

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fwd: index corruption in PG 8.3.13

2011-03-11 Thread Robert Haas
On Fri, Mar 11, 2011 at 6:17 AM, Nikhil Sontakke
nikhil.sonta...@enterprisedb.com wrote:
 VACUUM FULL - immediate shutdown - problem with recovery?

An immediate shutdown == an intentional crash.  OK, so you have the
VACUUM FULL and the immediate shutdown just afterward.  So we just
need to figure out what happened during recovery.

 But WAL replay should still have handled this. I would presume even an
 immediate shutdown ensures that WAL is flushed to disk properly?

I'm not sure, but I doubt it.  If the VACUUM FULL committed, then the
WAL records should be on disk, but if the immediate shutdown happened
while it was still running, then the WAL records might still be in
wal_buffers, in which case I don't think they'll get written out and
thus zero pages in the index are to be expected.  Now that doesn't
explain any other corruption in the file, but I believe all-zeroes
pages in a relation are an expected consequence of an unclean
shutdown.  But assuming the VF actually committed before the immediate
shutdown, there must be something else going on, since by that point
XLOG should have been flushed.

 So that means that either there is a corner case bug in VF which adds
 incorrect WAL logging in some specific btree layout scenarios or there
 was indeed some bit flipping in the WAL, which caused the recovery to
 prematurely end during WAL replay. What are the scenarios that you
 would think can cause WAL bit flipping?

Some kind of fluke hard drive malfunction, maybe?  I know that the
incidence of a hard drive being told to write A and actually writing B
is very low, but it's probably not exactly zero.  Do you have the logs
from the recovery following the immediate shutdown?  Anything
interesting there?

Or, as you say, there could be a corner-case VF bug.

 I was trying to repro this on the setup by repeatedly creating a table
 with large inserts, doing lotta deletes, running VF and then issuing
 immediate shutdown. However if I try to inspect the index data file at
 this point in the test case, it is inconsequential as the file is
 largely out of sync since its dirty shared buffers have not been
 flushed. That leaves me with the option to restart and check the index
 data file again for problems. If we see problems after the restart it
 should generally mean WAL logging errors (but we still cannot discount
 the bit flipping case I guess).

contrib/pageinspect might help.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fwd: index corruption in PG 8.3.13

2011-03-11 Thread Nikhil Sontakke
 VACUUM FULL - immediate shutdown - problem with recovery?

 An immediate shutdown == an intentional crash.  OK, so you have the
 VACUUM FULL and the immediate shutdown just afterward.  So we just
 need to figure out what happened during recovery.


Right.

 But WAL replay should still have handled this. I would presume even an
 immediate shutdown ensures that WAL is flushed to disk properly?

 I'm not sure, but I doubt it.  If the VACUUM FULL committed, then the
 WAL records should be on disk, but if the immediate shutdown happened
 while it was still running, then the WAL records might still be in
 wal_buffers, in which case I don't think they'll get written out and
 thus zero pages in the index are to be expected.  Now that doesn't
 explain any other corruption in the file, but I believe all-zeroes
 pages in a relation are an expected consequence of an unclean
 shutdown.  But assuming the VF actually committed before the immediate
 shutdown, there must be something else going on, since by that point
 XLOG should have been flushed.


Oh yeah, so if VF committed, the xlog should have been ok too, but
can't say the same about the shared buffers.

 So that means that either there is a corner case bug in VF which adds
 incorrect WAL logging in some specific btree layout scenarios or there
 was indeed some bit flipping in the WAL, which caused the recovery to
 prematurely end during WAL replay. What are the scenarios that you
 would think can cause WAL bit flipping?

 Some kind of fluke hard drive malfunction, maybe?  I know that the
 incidence of a hard drive being told to write A and actually writing B
 is very low, but it's probably not exactly zero.  Do you have the logs
 from the recovery following the immediate shutdown?  Anything
 interesting there?


Unfortunately we do not have the recovery logs. Would have loved to
see some signs about some issues in the WAL replay to confirm the
theory about bit flipping..

 Or, as you say, there could be a corner-case VF bug.


Yeah, much harder to find by just eyeballing the code I guess :)

Regards,
Nikhils

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fwd: index corruption in PG 8.3.13

2011-03-11 Thread Greg Stark
On Fri, Mar 11, 2011 at 2:28 PM, Nikhil Sontakke
nikhil.sonta...@enterprisedb.com wrote:
 I'm not sure, but I doubt it.  If the VACUUM FULL committed, then the
 WAL records should be on disk, but if the immediate shutdown happened
 while it was still running, then the WAL records might still be in
 wal_buffers, in which case I don't think they'll get written out and
 thus zero pages in the index are to be expected.
...

 Oh yeah, so if VF committed, the xlog should have been ok too, but
 can't say the same about the shared buffers.

But there was a later block that *was* written out. What was the LSN
on that block? everything in the WAL log should have been fsynced up
to that point when that buffer was flushed.

Was there a machine restart in the picture as well?



-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fwd: index corruption in PG 8.3.13

2011-03-11 Thread Nikhil Sontakke
 Oh yeah, so if VF committed, the xlog should have been ok too, but
 can't say the same about the shared buffers.

 But there was a later block that *was* written out. What was the LSN
 on that block? everything in the WAL log should have been fsynced up
 to that point when that buffer was flushed.


Which block are you referring to?
After the holes from 279 to 518. We have

Deleted 519's (LSN: logid 29, recoff 0xd1fab5bc) previous points to Deleted 520.

Deleted 520's (LSN: logid 29, recoff 0xd1fac918) previous points to Deleted 521.

Deleted 521's (LSN: logid 29, recoff 0xd1fadc60) previous points to 522

Note that all the above deleted blocks have next xid set to FrozenXid,
meaning VF did this.

Live 522's  (LSN: logid 29, recoff 0xd1fade3c) previous points to
the zeroed out 523 block. Note that this seems to be latest LSN in the
data file.

The next of all these 4 blocks (519 to 522) point to Live Block 277
with (LSN: logid 29, recoff 0xd1fadc60). The ROOT block also has this
same LSN. So Blocks 521, 277 and block 3 have this same LSN. The rest
of the live blocks appear to have lower LSNs with lower logids.

The last block in the index data file, which also seems to be the
current left most block, number 524 has an LSN of (logid 29, recoff
0xd1fa97b8) with next sibling set to 523. One interesting observation
is that there is another deleted block 278 (again done by VF) with
this same LSN and next pointing to 524. And so this must have been the
original leftmost block before 524 was promoted to that status.

To summarize, as I see it - the zeroed out block 523 should have been
the second left-most leaf and should have pointed out to 522. Thus
re-establishing the index chain

524 - 523 - 522 - 277 - ...

Phew, I hope the above made some sense and was decipherable. Looking
at my png alongside might help a bit more too. Also maybe some bug is
indeed hidden in the guts of VF.

 Was there a machine restart in the picture as well?

I don't think so.

Regards,
Nikhils

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fwd: index corruption in PG 8.3.13

2011-03-10 Thread Nikhil Sontakke
Hi,

 Other peculiarity in the index file is that we found a lot of zeroed
 out pages. Blocks from #279 to #518 are all completely zeroed out
 without any signs of even a page header. Any ideas on how we can get
 so many zeroed out blocks? Apart from the extend code path, I fail to
 see any other. And this is an unusually large number of zero pages

 Well, if you see the root block child pointers, they go to pages
 2, 4-277,522-524.  So pages 278 to 521 seem unused for some reason,
 which is rather strange.  But apparently only page 523 is zeroed and
 shouldn't be.


Yeah, the very definition of strange all this.

 It seems hard to believe that there would be 245 unsuccessful attempts
 at extending the file.

 Page 522 is suspect too ... why does it have a single item?  Note its
 LSN is very close to the one on page 521.


If you look at my png, you will notice that there is a deleted block
chain at play around 522.

Deleted 519's previous points to Deleted 520.

Deleted 520's previous points to Deleted 521.

Deleted 521's previous points to 522

Want to know where the next of all these 4 blocks point to? Block 277.

Another interesting thing is all these Deleted blocks have next XID
set to FroxzenXID. And who sets it to that - only VACUUM FULL. That's
why my suspicions around VF..

Regards,
Nikhils

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fwd: index corruption in PG 8.3.13

2011-03-10 Thread Robert Haas
On Wed, Mar 9, 2011 at 7:14 PM, Greg Stark gsst...@mit.edu wrote:
 On Wed, Mar 9, 2011 at 11:28 PM, Nikhil Sontakke
 nikhil.sonta...@enterprisedb.com wrote:
 Other peculiarity in the index file is that we found a lot of zeroed
 out pages. Blocks from #279 to #518 are all completely zeroed out
 without any signs of even a page header. Any ideas on how we can get
 so many zeroed out blocks? Apart from the extend code path, I fail to
 see any other. And this is an unusually large number of zero pages


 What does stat say for the index data file? Are the Size and Blocks
 values the same (modulo block size)? Or are these blocks actually not
 allocated?

 Postgres always forces blocks to be allocated but if they were lost
 due to filesystem corruption maybe they're not allocated any more.

Hmm, that gives me an idea.  What if the index relation were extended
repeatedly, but the blocks are still all zero on disk because they've
never been written out of shared_buffers?  Then, before the next
checkpoint, the system crashes.  All those blocks are now zero on
disk.  In theory, WAL replay should fix most of it, but there could be
portions of the WAL that haven't been flushed; or there could be some
other problem in WAL replay.  For example, it strikes me that this
situation would be consistent with:

1. Somebody inserts a bunch of new tuples into the relation, causing
growth in the index.
2. Before the blocks are written to disk, system crash.
3. pg_resetxlog

But it wouldn't have to be exactly that.  Like what if during replay,
you hit a WAL record that had a bit flipped so it failed the checksum.
 I think the system would just treat that as end-of-WAL and start up.
Now you could have some corruption, and a big chunk of zero blocks in
the index.  Now you go along in normal running, maybe not realizing
that those zero blocks are there, and start extending the relation as
you need to update the index...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fwd: index corruption in PG 8.3.13

2011-03-10 Thread Greg Stark
On Thu, Mar 10, 2011 at 1:45 PM, Robert Haas robertmh...@gmail.com wrote:
 1. Somebody inserts a bunch of new tuples into the relation, causing
 growth in the index.

In case it's not obvious VACUUM FULL would do precisely that.

 2. Before the blocks are written to disk, system crash.
 3. pg_resetxlog

Do you have any record of the VACUUM FULL committing? Do you know what
LSN its commit record is at?

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fwd: index corruption in PG 8.3.13

2011-03-10 Thread Robert Haas
On Thu, Mar 10, 2011 at 12:52 PM, Greg Stark gsst...@mit.edu wrote:
 On Thu, Mar 10, 2011 at 1:45 PM, Robert Haas robertmh...@gmail.com wrote:
 1. Somebody inserts a bunch of new tuples into the relation, causing
 growth in the index.

 In case it's not obvious VACUUM FULL would do precisely that.

Oh, I didn't even think about that.  Yeah, that could be it, too.  So
maybe VACUUM FULL - crash before checkpoint - problem with recovery?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Fwd: index corruption in PG 8.3.13

2011-03-09 Thread Nikhil Sontakke
Re-sending without the attachments. Can someone please allow my
attachments through from the previous email?

TIA
Nikhils

-- Forwarded message --
From: Nikhil Sontakke nikhil.sonta...@enterprisedb.com
Date: Wed, Mar 9, 2011 at 8:42 PM
Subject: index corruption in PG 8.3.13
To: pgsql-hackers@postgresql.org


Hi,

I am currently looking at a physical file belonging to a corrupt
index. This is with PG 8.3.13 on an ext3 filesystem on suse linux with
a RAID1 setup. The sequence of operation seems nothing suspicious.

Create table, create a bunch of indexes on that table and do
insert/update activity. One difference here is that VACUUM FULL might
be run in between along with the usual autovacuum mechanism when the
inserts/updates are going on.

So one of the vacuum calls errored out with a corrupt (completely
zeroed out in this case) block with the usual hint to reindex.
This is a single machine setup with no slaves. Some analysis using
pg_filedump comes up with the following:

There are a total of 525 blocks (numbered 0 to 524). The root page is
block #3. The corrupt or zeroed out page is block #523.

The basic issue is that item2 in root block points to block 523 which
is completely zeroed out and this has broken the index chain
completely!

Other peculiarity in the index file is that we found a lot of zeroed
out pages. Blocks from #279 to #518 are all completely zeroed out
without any signs of even a page header. Any ideas on how we can get
so many zeroed out blocks? Apart from the extend code path, I fail to
see any other. And this is an unusually large number of zero pages...

Blocks 519 to 521 are DELETED. They do not have the LEAF flag set,
meaning they could be internal pages, but that is strange since ROOT
page is at level 1. Also importantly their next XID is set FrozenXid,
meaning VACUUM FULL was at play. Maybe due to deletes, we reduced the
hierarchy level or something?

I am attaching the data file, the binary and the 8.3 compliant
pg_filedump sources with this mail. Just copy the pg_filedump
directory into contrib and use Makefile.contrib to compile. Cannot
attach the output of running pg_filedump since my email will cross the
attachment limits. Am also attaching my dotty attempts via a png file
to visually show the scenario, hopefully that should make things
clearer.

Feedback, comments, insights appreciated. Appears to be a tough nut to
crack :( :), and although blaming the hardware for the zero pages
might come to mind, still the deletion activity done by Vacuum/Vacuum
Full seems a bit suspicious.

Regards,
Nikhils

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fwd: index corruption in PG 8.3.13

2011-03-09 Thread Alvaro Herrera
Excerpts from Nikhil Sontakke's message of mié mar 09 10:51:50 -0300 2011:
 Re-sending without the attachments. Can someone please allow my
 attachments through from the previous email?

They are not in the moderation queue, so presumably they got eaten by
the antispam grue.


 Blocks 519 to 521 are DELETED. They do not have the LEAF flag set,
 meaning they could be internal pages, but that is strange since ROOT
 page is at level 1. Also importantly their next XID is set FrozenXid,
 meaning VACUUM FULL was at play. Maybe due to deletes, we reduced the
 hierarchy level or something?

Hierarchy level is never reduced.

I'll send you a perl program we wrote for a customer to check for
strange issues in btrees.  Please give it a spin; it may give you more
clues.  If you find additional checks to add, please let me know!

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fwd: index corruption in PG 8.3.13

2011-03-09 Thread Alvaro Herrera
Excerpts from Nikhil Sontakke's message of mié mar 09 11:16:22 -0300 2011:
  Re-sending without the attachments. Can someone please allow my
  attachments through from the previous email?
 
  They are not in the moderation queue, so presumably they got eaten by
  the antispam grue.
 
 Ouch. Attempting to attach the dotty image again..

I don't understand this graph.  What are the arrows?  Downlinks or
sibling pointers?

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fwd: index corruption in PG 8.3.13

2011-03-09 Thread Nikhil Sontakke
 Ouch. Attempting to attach the dotty image again..

 I don't understand this graph.  What are the arrows?  Downlinks or
 sibling pointers?


Sorry, they are sibling previous and next pointers. The ROOT is at
level 1, rest all live blocks are at level 0. #524 is the leftmost
page.

Regards,
Nikhils

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fwd: index corruption in PG 8.3.13

2011-03-09 Thread Nikhil Sontakke
Hi,

 Blocks 519 to 521 are DELETED. They do not have the LEAF flag set,
 meaning they could be internal pages, but that is strange since ROOT
 page is at level 1. Also importantly their next XID is set FrozenXid,
 meaning VACUUM FULL was at play. Maybe due to deletes, we reduced the
 hierarchy level or something?

 Hierarchy level is never reduced.

 I'll send you a perl program we wrote for a customer to check for
 strange issues in btrees.  Please give it a spin; it may give you more
 clues.  If you find additional checks to add, please let me know!



I tried to run Alvaro's perl script, but since the index chain is
broken due to the zeroed out page pretty early on, it could not
traverse it very well.

While I rummage around the code more, does anyone have any theories on
the below?

Other peculiarity in the index file is that we found a lot of zeroed
out pages. Blocks from #279 to #518 are all completely zeroed out
without any signs of even a page header. Any ideas on how we can get
so many zeroed out blocks? Apart from the extend code path, I fail to
see any other. And this is an unusually large number of zero pages

Comments appreciated.

Regards,
Nikhils

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fwd: index corruption in PG 8.3.13

2011-03-09 Thread Greg Stark
On Wed, Mar 9, 2011 at 11:28 PM, Nikhil Sontakke
nikhil.sonta...@enterprisedb.com wrote:
 Other peculiarity in the index file is that we found a lot of zeroed
 out pages. Blocks from #279 to #518 are all completely zeroed out
 without any signs of even a page header. Any ideas on how we can get
 so many zeroed out blocks? Apart from the extend code path, I fail to
 see any other. And this is an unusually large number of zero pages


What does stat say for the index data file? Are the Size and Blocks
values the same (modulo block size)? Or are these blocks actually not
allocated?

Postgres always forces blocks to be allocated but if they were lost
due to filesystem corruption maybe they're not allocated any more.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fwd: index corruption in PG 8.3.13

2011-03-09 Thread Nikhil Sontakke

 What does stat say for the index data file? Are the Size and Blocks
 values the same (modulo block size)? Or are these blocks actually not
 allocated?

stat 58401
  File: `58401'
  Size: 4300800 Blocks: 8400   IO Block: 4096   regular file
Device: 801h/2049d  Inode: 13901264Links: 1

modulo 8192 it gives 525 which is the number of blocks that we see.

 Postgres always forces blocks to be allocated but if they were lost
 due to filesystem corruption maybe they're not allocated any more.


Actually these blocks are like a big hole in between. Blocks 279 to
518 are zeroed out and then 519 to 524 is index data. Number 523 is
the one which has been zeroed out too causing all the mayhem.

Regards,
Nikhils

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fwd: index corruption in PG 8.3.13

2011-03-09 Thread Alvaro Herrera
Excerpts from Nikhil Sontakke's message of mié mar 09 20:28:19 -0300 2011:

 While I rummage around the code more, does anyone have any theories on
 the below?
 
 Other peculiarity in the index file is that we found a lot of zeroed
 out pages. Blocks from #279 to #518 are all completely zeroed out
 without any signs of even a page header. Any ideas on how we can get
 so many zeroed out blocks? Apart from the extend code path, I fail to
 see any other. And this is an unusually large number of zero pages

Well, if you see the root block child pointers, they go to pages
2, 4-277,522-524.  So pages 278 to 521 seem unused for some reason,
which is rather strange.  But apparently only page 523 is zeroed and
shouldn't be.

It seems hard to believe that there would be 245 unsuccessful attempts
at extending the file.

Page 522 is suspect too ... why does it have a single item?  Note its
LSN is very close to the one on page 521.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers