Re: [HACKERS] Disaster!

2004-01-31 Thread Tom Lane
Randolf Richardson [EMAIL PROTECTED] writes:
 [EMAIL PROTECTED] (Greg Stark) stated in
 comp.databases.postgresql.hackers: 
 The traditional Unix filesystems certainly don't return errors at close.

   Why shouldn't the close() function return an error?  If an invalid 
 file handle was passed to it, it most certainly should indicate this 

Of course.  We're discussing the situation where no errors were reported
in prior syscalls --- in particular, open() succeeded.

regards, tom lane

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Disaster!

2004-01-31 Thread Greg Stark

Manfred Spraul [EMAIL PROTECTED] writes:

 The checkpoint code uses sync() right now. Actually sync();sleep(2);sync().
 Win32 has no sync() call, therefore it will use fsyncs. Perhaps platforms with
 deferred errors on close must use fsync, too. Hopefully parallel fsyncs -
 sequential fsyncs could be slow due to more seeking.

That code is known to be totally bogus in theory. However in practice it seems
to be the best of the possible bad choices.

Even on filesystems where errors won't be deferred after the write() the data
is still not guaranteed to be on disk. Even after the sync() call. There's no
guarantee of any particular sleep time being enough.

This was brought up a few months ago. The only safe implementation would be to
fsync every file descriptor that had received writes. The problem is keeping
track of which file descriptors those are. Also people were uncertain whether
a backend opening a file and calling fsync would guarantee that writes written
to the same file by other processes through other file descriptors would be
flushed. I'm fairly convinced they would be on all sane vfs implementations
but others were less convinced.

-- 
greg


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Disaster!

2004-01-30 Thread Greg Stark

Manfred Spraul [EMAIL PROTECTED] writes:

 That means
 open();
 write();
 sync();
 
 could succeed, but the data is not stored on disk, correct?

That would be true on any filesystem. Unless you throw an fsync() call in.

With sync replaced by fsync then any filesystem ought to guarantee the data
has reached disk by the time fsync returns. I think this is even true of NFS
or AFS, though I wouldn't depend on it for my own data.

-- 
greg


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] Disaster!

2004-01-30 Thread Randolf Richardson
[EMAIL PROTECTED] (Greg Stark) stated in
comp.databases.postgresql.hackers: 

 Tom Lane [EMAIL PROTECTED] writes:
 
 Christopher Kings-Lynne [EMAIL PROTECTED] writes:
  FreeBSD 4.7/4.9 and the UFS filesystem
 
 Hm, okay, I'm pretty sure that that combination wouldn't report ENOSPC
 at close().  We need to fix the code to check close's return value,
 probably, but it seems we still lack a clear explanation of what
 happened to your database.
 
 The traditional Unix filesystems certainly don't return errors at close.
 Even NFS doesn't traditionally do so. I think NFSv3 can if the server
 disappears after the client obtains a lease on a piece of the file, but
 I'm not sure if ENOSPC is a possible failure mode.
[sNip]

Why shouldn't the close() function return an error?  If an invalid 
file handle was passed to it, it most certainly should indicate this since 
it's always possible for a separate thread to close it first (or other 
reasons as well).

-- 
Randolf Richardson - [EMAIL PROTECTED]
Vancouver, British Columbia, Canada

We are anti-spammers.  You will confirm
subscriptions.  Resistance is futile.

Please do not eMail me directly when responding
to my postings in the newsgroups.

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] Disaster!

2004-01-29 Thread Christoph Haller
 
 Tom Lane wrote:
  I said:
   If there wasn't disk space enough to hold the clog page, the checkpoint
   attempt should have failed.  So it may be that allowing a short read in
   slru.c would be patching the symptom of a bug that is really elsewhere.
  
  After more staring at the code, I have a theory.  SlruPhysicalWritePage
  and SlruPhysicalReadPage are coded on the assumption that close() can
  never return any interesting failure.  However, it now occurs to me that
  there are some filesystem implementations wherein ENOSPC could be
  returned at close() rather than the preceding write().  (For instance,
  the HPUX man page for close() states that this never happens on local
  filesystems but can happen on NFS.)  So it'd be possible for
  SlruPhysicalWritePage to think it had successfully written a page when
  it hadn't.  This would allow a checkpoint to complete :-(
  
  Chris, what's your platform exactly, and what kind of filesystem are
  you storing pg_clog on?
 
 We already have a TODO on fclose():
 
   * Add checks for fclose() failure
 
Tom was referring to close(), not fclose(). 
I once had an awful time searching for a memory leak caused 
by a typo using close instead of fclose. 
So adding checks for both is probably a good idea. 

Regards, Christoph 


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] Disaster!

2004-01-29 Thread Tom Lane
Christoph Haller [EMAIL PROTECTED] writes:
 Tom was referring to close(), not fclose(). 
 I once had an awful time searching for a memory leak caused 
 by a typo using close instead of fclose. 
 So adding checks for both is probably a good idea. 

Already done.

regards, tom lane

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Disaster!

2004-01-29 Thread Jeroen Ruigrok/asmodai
-On [20040125 03:52], Tom Lane ([EMAIL PROTECTED]) wrote:
Hm, okay, I'm pretty sure that that combination wouldn't report ENOSPC
at close().

From Tru64's write(2):

[ENOSPC]
  [XSH4.2]  No free space is left on the file system containing the
  file.
  [Tru64 UNIX]  An attempt was made to write past the early
  warning EOT while this indicator was enabled.
  [Tru64 UNIX]  An attempt was made to write at or beyond the end of
  a partition.

From close(2):

[Tru64 UNIX]   A close() function on an NFS file system waits for all
outstanding I/O to complete. If any operation completes with an error,
the error will be returned by close(). The possible errors depend on the
NFS server implementation, but the most likely errors are:

[snip...]

[ENOSPC] Attempted to write on a full file system.

We need to fix the code to check close's return value, probably, but it
seems we still lack a clear explanation of what happened to your
database.

You always need to check the return codes of calls like that, what if
you received EBADF or EINTR for whatever reason?

-- 
Jeroen Ruigrok van der Werven asmodai(at)wxs.nl / asmodai / kita no mono
PGP fingerprint: 2D92 980E 45FE 2C28 9DB7  9D88 97E6 839B 2EAC 625B
http://www.tendra.org/   | http://diary.in-nomine.org/
From the pine tree, learn of the pine tree.  And from the bamboo, of the
bamboo...

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] Disaster!

2004-01-27 Thread Gaetano Mendola
Tom Lane wrote:

Okay ... Chris was kind enough to let me examine the WAL logs and
postmaster stderr log for his recent problem, and I believe that
I have now achieved a full understanding of what happened.  The true
bug is indeed somewhere else than slru.c, and we would not have found
it if slru.c had had less-paranoid error checking.


[SNIP]

Clap. Clap.

Regards
Gaetano Mendola


---(end of broadcast)---
TIP 6: Have you searched our list archives?
  http://archives.postgresql.org


Re: [HACKERS] Disaster!

2004-01-26 Thread Christopher Kings-Lynne
Awesome Tom :)

I'm glad I happened to have all the data required on hand to fully analyze
the problem.   Let's hope this make this failure condition go away for all
future postgresql users :)

Chris

On Mon, 26 Jan 2004, Tom Lane wrote:

 Okay ... Chris was kind enough to let me examine the WAL logs and
 postmaster stderr log for his recent problem, and I believe that
 I have now achieved a full understanding of what happened.  The true
 bug is indeed somewhere else than slru.c, and we would not have found
 it if slru.c had had less-paranoid error checking.

snip


---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Disaster!

2004-01-26 Thread Bruce Momjian
Tom Lane wrote:
 I said:
  If there wasn't disk space enough to hold the clog page, the checkpoint
  attempt should have failed.  So it may be that allowing a short read in
  slru.c would be patching the symptom of a bug that is really elsewhere.
 
 After more staring at the code, I have a theory.  SlruPhysicalWritePage
 and SlruPhysicalReadPage are coded on the assumption that close() can
 never return any interesting failure.  However, it now occurs to me that
 there are some filesystem implementations wherein ENOSPC could be
 returned at close() rather than the preceding write().  (For instance,
 the HPUX man page for close() states that this never happens on local
 filesystems but can happen on NFS.)  So it'd be possible for
 SlruPhysicalWritePage to think it had successfully written a page when
 it hadn't.  This would allow a checkpoint to complete :-(
 
 Chris, what's your platform exactly, and what kind of filesystem are
 you storing pg_clog on?

We already have a TODO on fclose():

* Add checks for fclose() failure

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] Disaster!

2004-01-26 Thread Bruce Momjian

Excellent analysis.  Thanks.  Are there any other cases like this?

---

Tom Lane wrote:
 Okay ... Chris was kind enough to let me examine the WAL logs and
 postmaster stderr log for his recent problem, and I believe that
 I have now achieved a full understanding of what happened.  The true
 bug is indeed somewhere else than slru.c, and we would not have found
 it if slru.c had had less-paranoid error checking.
 
 The WAL log shows that checkpoints were happening every five minutes
 up to 2004-01-23 10:13:10, but no checkpoint completion record appears
 after that.  However, the system remained up, with plenty of activity,
 until 10:45:24, when it was finally taken down by a panic.  The last
 transaction commit records in the WAL log are
 
 commit: 14286807 at 2004-01-23 10:45:23 
 commit: 14286811 at 2004-01-23 10:45:24 
 commit: 14286814 at 2004-01-23 10:45:24 
 commit: 14286824 at 2004-01-23 10:45:24 
 commit: 14286825 at 2004-01-23 10:45:24 
 commit: 14286836 at 2004-01-23 10:45:24 
 commit: 14286838 at 2004-01-23 10:45:24 
 commit: 14286850 at 2004-01-23 10:45:24 
 commit: 14286851 at 2004-01-23 10:45:24 
 
 Over in the postmaster log, the first sign of trouble is
 
 Jan 23 10:18:07 canaveral postgres[20039]: [879-1] LOG:  could not close temporary 
 statistics file /usr/local/pgsql/data/global/pgstat.tmp.20035: No space left on 
 device
 
 and there is a steady stream of transactions failing with out-of-space
 errors over the next half hour, but none of the failures are worse than
 a transaction abort.  Finally we see
 
 Jan 23 10:45:24 canaveral postgres[57237]: [17-1] ERROR:  could not access status of 
 transaction 0
 Jan 23 10:45:24 canaveral postgres[57237]: [17-2] DETAIL:  could not write to file 
 /usr/local/pgsql/data/pg_clog/000D at offset 147456: No space left on device
 Jan 23 10:45:24 canaveral postgres[57237]: [18-1] WARNING:  AbortTransaction and not 
 in in-progress state
 Jan 23 10:45:24 canaveral postgres[57237]: [19-1] PANIC:  could not access status of 
 transaction 0
 Jan 23 10:45:24 canaveral postgres[57237]: [19-2] DETAIL:  could not write to file 
 /usr/local/pgsql/data/pg_clog/000D at offset 147456: No space left on device
 Jan 23 10:45:24 canaveral postgres[20035]: [5-1] LOG:  server process (PID 57237) 
 was terminated by signal 6
 Jan 23 10:45:24 canaveral postgres[20035]: [6-1] LOG:  terminating any other active 
 server processes
 
 after which the postmaster's recovery attempts fail, as Chris already
 detailed.  (Note: the reference to transaction 0 is not significant;
 that just happens because SimpleLruWritePage doesn't have a specific
 transaction number to blame its write failures on.)
 
 Those are the observed facts, what's the interpretation?  I think it
 shows that Postgres is pretty darn robust, actually.  We were able to
 stay up and do useful work for quite a long time with zero free space;
 what's more, we lost no transactions that were successfully committed.
 The data was successfully stored in preallocated WAL space.  (If things
 had gone on this way for awhile longer, we would have panicked for lack
 of WAL space, but Chris was actually not anywhere near there; he'd only
 filled about two WAL segments in the half hour of operations.)  Note
 also that checkpoints were attempted several times during that interval,
 and they all failed gracefully --- no panic, no incorrect WAL update.
 
 But why did this panic finally happen?  The key observation is that
 the first nonexistent page of pg_clog was the page beginning with
 transaction 14286848.  Neither this xact nor the following one have any
 commit or abort record in WAL, but we do see entries for 14286850 and
 14286851.  It is also notable that there is no WAL entry for extension
 of pg_clog to include this page --- normally a WAL entry is made each
 time a page of zeroes is added to pg_clog.  My interpretation of the
 sequence of events is:
 
 Transaction 14286848 started, and since it was the first for its pg_clog
 page, it tried to do ZeroCLOGPage() for that page (see ExtendCLOG).  This
 required making room in the in-memory clog buffers, which required
 dumping one of the previously-buffered clog pages, which failed for lack
 of disk space, leading to this log entry:
 
 Jan 23 10:45:24 canaveral postgres[57237]: [17-1] ERROR:  could not access status of 
 transaction 0
 Jan 23 10:45:24 canaveral postgres[57237]: [17-2] DETAIL:  could not write to file 
 /usr/local/pgsql/data/pg_clog/000D at offset 147456: No space left on device
 Jan 23 10:45:24 canaveral postgres[57237]: [18-1] WARNING:  AbortTransaction and not 
 in in-progress state
 
 (Note: page offset 147456 is the page two before the one containing xid
 14286848.  This page had been allocated in clog buffers but never yet
 successfully written to disk.  Ditto for the page in between.)  The next
 thing that happened was that transaction xids 14286849 and 14286850 were
 

Re: [HACKERS] Disaster!

2004-01-26 Thread Alvaro Herrera
On Mon, Jan 26, 2004 at 02:52:58PM +0900, Michael Glaesemann wrote:

 I don't know if the 'canaveral' prompt had anything to do with it 
 (maybe it was just the subject line), but I kept thinking of shuttle 
 disasters, o-rings, and plane crashes reading through this. I won't 
 claim to understand everything in huge detail, but from this newbie's 
 point of view, well explained! I enjoyed reading it.

Just for the record, the Canaveral you are thinking about is derived
from the spanish word Cañaveral, which is a place where cañas grow
(canes or stems, according to my dictionary -- some sort of vegetal
living form anyway).  I suppose Cape Kennedy was filled with those
plants and that's what the name comes from.

I dunno if Chris' machine's name derives from that or not; Merriam
Webster does not list any other meaning for that word.

-- 
Alvaro Herrera (alvherre[a]dcc.uchile.cl)
The Gord often wonders why people threaten never to come back after they've
been told never to return (www.actsofgord.com)

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] Disaster!

2004-01-26 Thread Christopher Kings-Lynne
Just for the record, the Canaveral you are thinking about is derived
from the spanish word Cañaveral, which is a place where cañas grow
(canes or stems, according to my dictionary -- some sort of vegetal
living form anyway).  I suppose Cape Kennedy was filled with those
plants and that's what the name comes from.
I dunno if Chris' machine's name derives from that or not; Merriam
Webster does not list any other meaning for that word.
All our server machines are named after launch sites/space centres.  It 
might have been a bit of a mistake, since we're starting to run out of 
names now, and the Japanese names are just too much of a mouthful :)

Chris

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] Disaster!

2004-01-25 Thread Manfred Spraul
Greg Stark wrote:

I do know that AFS returns quota failures on close. This was unusual enough
that when AFS was deployed at school unix tools failed left and right over
precisely this issue. Though it mostly just meant they returned the wrong exit
status.
That means
   open();
   write();
   sync();
could succeed, but the data is not stored on disk, correct?

--
   Manfred


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly


Re: [HACKERS] Disaster!

2004-01-25 Thread Tom Lane
Okay ... Chris was kind enough to let me examine the WAL logs and
postmaster stderr log for his recent problem, and I believe that
I have now achieved a full understanding of what happened.  The true
bug is indeed somewhere else than slru.c, and we would not have found
it if slru.c had had less-paranoid error checking.

The WAL log shows that checkpoints were happening every five minutes
up to 2004-01-23 10:13:10, but no checkpoint completion record appears
after that.  However, the system remained up, with plenty of activity,
until 10:45:24, when it was finally taken down by a panic.  The last
transaction commit records in the WAL log are

commit: 14286807 at 2004-01-23 10:45:23 
commit: 14286811 at 2004-01-23 10:45:24 
commit: 14286814 at 2004-01-23 10:45:24 
commit: 14286824 at 2004-01-23 10:45:24 
commit: 14286825 at 2004-01-23 10:45:24 
commit: 14286836 at 2004-01-23 10:45:24 
commit: 14286838 at 2004-01-23 10:45:24 
commit: 14286850 at 2004-01-23 10:45:24 
commit: 14286851 at 2004-01-23 10:45:24 

Over in the postmaster log, the first sign of trouble is

Jan 23 10:18:07 canaveral postgres[20039]: [879-1] LOG:  could not close temporary 
statistics file /usr/local/pgsql/data/global/pgstat.tmp.20035: No space left on 
device

and there is a steady stream of transactions failing with out-of-space
errors over the next half hour, but none of the failures are worse than
a transaction abort.  Finally we see

Jan 23 10:45:24 canaveral postgres[57237]: [17-1] ERROR:  could not access status of 
transaction 0
Jan 23 10:45:24 canaveral postgres[57237]: [17-2] DETAIL:  could not write to file 
/usr/local/pgsql/data/pg_clog/000D at offset 147456: No space left on device
Jan 23 10:45:24 canaveral postgres[57237]: [18-1] WARNING:  AbortTransaction and not 
in in-progress state
Jan 23 10:45:24 canaveral postgres[57237]: [19-1] PANIC:  could not access status of 
transaction 0
Jan 23 10:45:24 canaveral postgres[57237]: [19-2] DETAIL:  could not write to file 
/usr/local/pgsql/data/pg_clog/000D at offset 147456: No space left on device
Jan 23 10:45:24 canaveral postgres[20035]: [5-1] LOG:  server process (PID 57237) was 
terminated by signal 6
Jan 23 10:45:24 canaveral postgres[20035]: [6-1] LOG:  terminating any other active 
server processes

after which the postmaster's recovery attempts fail, as Chris already
detailed.  (Note: the reference to transaction 0 is not significant;
that just happens because SimpleLruWritePage doesn't have a specific
transaction number to blame its write failures on.)

Those are the observed facts, what's the interpretation?  I think it
shows that Postgres is pretty darn robust, actually.  We were able to
stay up and do useful work for quite a long time with zero free space;
what's more, we lost no transactions that were successfully committed.
The data was successfully stored in preallocated WAL space.  (If things
had gone on this way for awhile longer, we would have panicked for lack
of WAL space, but Chris was actually not anywhere near there; he'd only
filled about two WAL segments in the half hour of operations.)  Note
also that checkpoints were attempted several times during that interval,
and they all failed gracefully --- no panic, no incorrect WAL update.

But why did this panic finally happen?  The key observation is that
the first nonexistent page of pg_clog was the page beginning with
transaction 14286848.  Neither this xact nor the following one have any
commit or abort record in WAL, but we do see entries for 14286850 and
14286851.  It is also notable that there is no WAL entry for extension
of pg_clog to include this page --- normally a WAL entry is made each
time a page of zeroes is added to pg_clog.  My interpretation of the
sequence of events is:

Transaction 14286848 started, and since it was the first for its pg_clog
page, it tried to do ZeroCLOGPage() for that page (see ExtendCLOG).  This
required making room in the in-memory clog buffers, which required
dumping one of the previously-buffered clog pages, which failed for lack
of disk space, leading to this log entry:

Jan 23 10:45:24 canaveral postgres[57237]: [17-1] ERROR:  could not access status of 
transaction 0
Jan 23 10:45:24 canaveral postgres[57237]: [17-2] DETAIL:  could not write to file 
/usr/local/pgsql/data/pg_clog/000D at offset 147456: No space left on device
Jan 23 10:45:24 canaveral postgres[57237]: [18-1] WARNING:  AbortTransaction and not 
in in-progress state

(Note: page offset 147456 is the page two before the one containing xid
14286848.  This page had been allocated in clog buffers but never yet
successfully written to disk.  Ditto for the page in between.)  The next
thing that happened was that transaction xids 14286849 and 14286850 were
assigned (ie, those xacts started), and then 14286850 tried to commit.
This again led to a failed attempt to write out a clog page, but this
time the error was promoted to a panic because it happened inside the
transaction commit critical section:

Jan 23 

Re: [HACKERS] Disaster!

2004-01-25 Thread Michael Glaesemann
Tom,

I don't know if the 'canaveral' prompt had anything to do with it 
(maybe it was just the subject line), but I kept thinking of shuttle 
disasters, o-rings, and plane crashes reading through this. I won't 
claim to understand everything in huge detail, but from this newbie's 
point of view, well explained! I enjoyed reading it.

Ever thought of working for the NTSB? :)

Michael Glaesemann
grzm myrealbox com
---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly


Re: [HACKERS] Disaster!

2004-01-24 Thread Tom Lane
Gavin Sherry [EMAIL PROTECTED] writes:
 It seems that by adding the following to SlruPhysicalReadPage() we can
 recover in a reasonable way here. Instead of:
 [ add non-error check to lseek() ]

But it's not the lseek() that's gonna fail.  What we'll actually see,
and did see in Chris' report, is a read() that returns zero bytes, or
possibly an incomplete page.  So actually this change is needed in the
next step, not the lseek.

BUT: after looking at the code more, I'm confused again about exactly
how Chris' failure happened.  The backtrace he sent this morning shows
that the panic occurs while replaying a transaction-commit WAL entry ---
it's trying to set the commit status bit for that transaction number,
and finding that the clog page containing that bit doesn't exist.  But
there should have been a previous WAL entry recording the ZeroCLOGPage()
action for that clog page.  The only way that wouldn't have got replayed
too is if there was a checkpoint in between ... but a checkpoint should
not have been able to complete without flushing the clog buffer to disk.
If there wasn't disk space enough to hold the clog page, the checkpoint
attempt should have failed.  So it may be that allowing a short read in
slru.c would be patching the symptom of a bug that is really elsewhere.
We need to understand the sequence of events in more detail.

Chris, you said you'd saved a copy of the data directory at the time of
the failure, right?  Could you send me the pg_control file and the
active segments of pg_xlog?  (It should be sufficient to send the ones
with file mod times within five minutes of the crash.)

regards, tom lane

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] Disaster!

2004-01-24 Thread Tom Lane
Christopher Kings-Lynne [EMAIL PROTECTED] writes:
 FreeBSD 4.7/4.9 and the UFS filesystem

Hm, okay, I'm pretty sure that that combination wouldn't report ENOSPC
at close().  We need to fix the code to check close's return value,
probably, but it seems we still lack a clear explanation of what
happened to your database.

That request to look at your WAL files is still open ...

regards, tom lane

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] Disaster!

2004-01-24 Thread Greg Stark
Tom Lane [EMAIL PROTECTED] writes:

 Christopher Kings-Lynne [EMAIL PROTECTED] writes:
  FreeBSD 4.7/4.9 and the UFS filesystem
 
 Hm, okay, I'm pretty sure that that combination wouldn't report ENOSPC
 at close().  We need to fix the code to check close's return value,
 probably, but it seems we still lack a clear explanation of what
 happened to your database.

The traditional Unix filesystems certainly don't return errors at close. Even
NFS doesn't traditionally do so. I think NFSv3 can if the server disappears
after the client obtains a lease on a piece of the file, but I'm not sure if
ENOSPC is a possible failure mode.

I do know that AFS returns quota failures on close. This was unusual enough
that when AFS was deployed at school unix tools failed left and right over
precisely this issue. Though it mostly just meant they returned the wrong exit
status.

-- 
greg


---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Disaster!

2004-01-24 Thread Christopher Kings-Lynne
 That request to look at your WAL files is still open ...

I've sent you it privately - let me know how it goes.

Chris



---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


[HACKERS] Disaster!

2004-01-23 Thread Christopher Kings-Lynne
We ran out of disk space on our main server, and now I've freed up 
space, we cannot start postgres!

Jan 23 12:18:50 canaveral postgres[563]: [2-1] LOG:  checkpoint record 
is at 2/96500B94
Jan 23 12:18:50 canaveral postgres[563]: [3-1] LOG:  redo record is at 
2/964BD23C; undo record is at 0/0; shutdown FALSE
Jan 23 12:18:50 canaveral postgres[563]: [4-1] LOG:  next transaction 
ID: 14216463; next OID: 4732327
Jan 23 12:18:50 canaveral postgres[563]: [5-1] LOG:  database system was 
not properly shut down; automatic recovery in progress
Jan 23 12:18:50 canaveral postgres[563]: [6-1] LOG:  redo starts at 
2/964BD23C
Jan 23 12:18:51 canaveral postgres[563]: [7-1] PANIC:  could not access 
status of transaction 14286850
Jan 23 12:18:51 canaveral postgres[563]: [7-2] DETAIL:  could not read 
from file /usr/local/pgsql/data/pg_clog/000D at offset 163840: 
Undefined error: 0
Jan 23 12:18:51 canaveral postgres[567]: [1-1] FATAL:  the database 
system is starting up
Jan 23 12:18:52 canaveral postgres[558]: [1-1] LOG:  startup process 
(PID 563) was terminated by signal 6
Jan 23 12:18:52 canaveral postgres[558]: [2-1] LOG:  aborting startup 
due to startup process failure

What can I do?

Chris

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match


Re: [HACKERS] Disaster!

2004-01-23 Thread Christopher Kings-Lynne
I'd suggest extending that file with 8K of zeroes (might need more than
that, but probably not).
How do I do that?  Sorry - I'm not sure of the quickest way, and I'm 
reading man pages as we speak!

Thanks Tom,

Chris

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] Disaster!

2004-01-23 Thread Tom Lane
Christopher Kings-Lynne [EMAIL PROTECTED] writes:
 I'd suggest extending that file with 8K of zeroes (might need more than
 that, but probably not).

 How do I do that?  Sorry - I'm not sure of the quickest way, and I'm 
 reading man pages as we speak!

Something like dd if=/dev/zero bs=8k count=1 clogfile, but check the
dd man page (and make sure you have a /dev/zero).

regards, tom lane

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Disaster!

2004-01-23 Thread Christopher Kings-Lynne

I'd suggest extending that file with 8K of zeroes (might need more than
that, but probably not).
OK, I've done

dd if=/dev/zero of=zeros count=16

Then cat zero  000D

Now I can start it up!  Thanks!

What should I do now?

Chris

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
   (send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] Disaster!

2004-01-23 Thread Martín Marqués
Mensaje citado por Christopher Kings-Lynne [EMAIL PROTECTED]:

  I'd suggest extending that file with 8K of zeroes (might need more than
  that, but probably not).
 
 How do I do that?  Sorry - I'm not sure of the quickest way, and I'm 
 reading man pages as we speak!

# dd if=/dev/zeros of=somefile
# cat file1 somefile  newfile
# mv newfile file1

file1 is /usr/local/pgsql/data/pg_clog/000D

-- 
select 'mmarques' || '@' || 'unl.edu.ar' AS email;
---
Martín Marqués  |   Programador, DBA
Centro de Telemática| Administrador
   Universidad Nacional
del Litoral
---

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [HACKERS] Disaster!

2004-01-23 Thread Tom Lane
Christopher Kings-Lynne [EMAIL PROTECTED] writes:
 Now I can start it up!  Thanks!

 What should I do now?

Go home and get some sleep ;-).  If the WAL replay succeeded, you're up
and running, nothing else to do.

regards, tom lane

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] Disaster!

2004-01-23 Thread Martín Marqués
Mensaje citado por Tom Lane [EMAIL PROTECTED]:

 Christopher Kings-Lynne [EMAIL PROTECTED] writes:
  Now I can start it up!  Thanks!
 
  What should I do now?
 
 Go home and get some sleep ;-).  If the WAL replay succeeded, you're up
 and running, nothing else to do.

Tom, could you give a small insight on what occurred here, why those 8k of zeros
fixed it, and what is a WAL replay?

I am very curious about it.

-- 
select 'mmarques' || '@' || 'unl.edu.ar' AS email;
---
Martín Marqués  |   Programador, DBA
Centro de Telemática| Administrador
   Universidad Nacional
del Litoral
---

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] Disaster!

2004-01-23 Thread Dann Corbit
 -Original Message-
 From: Tom Lane [mailto:[EMAIL PROTECTED] 
 Sent: Friday, January 23, 2004 1:01 PM
 To: Christopher Kings-Lynne
 Cc: PostgreSQL-development
 Subject: Re: [HACKERS] Disaster! 
 
 
 Christopher Kings-Lynne [EMAIL PROTECTED] writes:
  Now I can start it up!  Thanks!
 
  What should I do now?
 
 Go home and get some sleep ;-).  If the WAL replay succeeded, 
 you're up and running, nothing else to do.

This seems a very serious problem, if a database can be broken [into a
non-startable condition] by running out of space.

Is it certain that no data was lost?

If it is totally safe to extend the WAL file with zeros and restart, why
not build it into PostgreSQL to do so automatically?

Can I get a 15 sentence speech on what happened, what the repair did,
and why we know that the result can be trusted?
I think it would reassure more than just myself.

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] Disaster!

2004-01-23 Thread Rod Taylor
On Fri, 2004-01-23 at 16:00, Tom Lane wrote:
 Christopher Kings-Lynne [EMAIL PROTECTED] writes:
  Now I can start it up!  Thanks!
 
  What should I do now?
 
 Go home and get some sleep ;-).  If the WAL replay succeeded, you're up
 and running, nothing else to do.

Granted, running out of diskspace is a bad idea, but can (has?)
something be put into place to prevent manual intervention from being
required in restarting the database?


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] Disaster!

2004-01-23 Thread Tom Lane
Christopher Kings-Lynne [EMAIL PROTECTED] writes:
 Are you interested in real backtraces, any of the old data directory, 
 etc. to debug the problem?

If you could recompile with debug support and get a backtrace from the
panic, it would be helpful.  I suspect what we need to do is make the
clog code more willing to interpret a zero-length read as 8K of zeroes
instead of an error, at least during recovery.  But I kinda thought
there was such an escape hatch already, so I want to see exactly how
it got to the point of the failure.

Also, which PG version are you running exactly?

 Obviously it ran out of disk space, but surely postgres should be able 
 to start up somehow?

See the response I'm about to write to Martín.

regards, tom lane

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Disaster!

2004-01-23 Thread Tom Lane
=?iso-8859-1?b?TWFydO1uIA==?= =?iso-8859-1?b?TWFycXXpcw==?= [EMAIL PROTECTED] writes:
 Tom, could you give a small insight on what occurred here, why those
 8k of zeros fixed it, and what is a WAL replay?

I think what happened is that there was insufficient space to write out
a new page of the clog (transaction commit) file.  This would result in
a database panic, which is fine --- you're not gonna get much done
anyway if you are down to zero free disk space.  However, after Chris
freed up space, the system needed to replay the WAL from the last
checkpoint to ensure consistency.  The WAL entries evidently included
references to transactions whose commit bits were in the unwritten page.
Now there would also be WAL entries recording those commits, so once the
replay was complete everything would be cool.  But the clog access code
evidently got confused by being asked to read a page that didn't exist
in the file.  I'm not sure yet how that sequence of events occurred,
which is why I asked Chris for a stack trace.

Adding a page of zeroes fixed it by eliminating the read error
condition.  It was okay to do so because zeroes is the correct initial
state for a clog page (all transactions in it still in progress).
After WAL replay, any completed transactions would be updated in the page.

regards, tom lane

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Disaster!

2004-01-23 Thread Tom Lane
Rod Taylor [EMAIL PROTECTED] writes:
 Granted, running out of diskspace is a bad idea, but can (has?)
 something be put into place to prevent manual intervention from being
 required in restarting the database?

See subsequent discussion.  I do want to modify the code to avoid this
problem in future, but we do not need to make Chris work on it at 5 AM
his time.  It can wait till tomorrow.

regards, tom lane

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [HACKERS] Disaster!

2004-01-23 Thread Alvaro Herrera
On Fri, Jan 23, 2004 at 04:21:04PM -0500, Tom Lane wrote:

 But the clog access code evidently got confused by being asked to read
 a page that didn't exist in the file.  I'm not sure yet how that
 sequence of events occurred, which is why I asked Chris for a stack
 trace.

There was a very similar episode some time ago, and Elein summed it up
nicely in General Bits:

http://www.varlena.com/GeneralBits/45.php

-- 
Alvaro Herrera (alvherre[a]dcc.uchile.cl)
El sentido de las cosas no viene de las cosas, sino de
las inteligencias que las aplican a sus problemas diarios
en busca del progreso. (Ernesto Hernández-Novich)

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] Disaster!

2004-01-23 Thread Bruce Momjian
Tom Lane wrote:
 Christopher Kings-Lynne [EMAIL PROTECTED] writes:
  Are you interested in real backtraces, any of the old data directory, 
  etc. to debug the problem?
 
 If you could recompile with debug support and get a backtrace from the
 panic, it would be helpful.  I suspect what we need to do is make the
 clog code more willing to interpret a zero-length read as 8K of zeroes
 instead of an error, at least during recovery.  But I kinda thought
 there was such an escape hatch already, so I want to see exactly how
 it got to the point of the failure.
 
 Also, which PG version are you running exactly?

I asked him, 7.4.1.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] Disaster!

2004-01-23 Thread Tom Lane
Alvaro Herrera [EMAIL PROTECTED] writes:
 Tom's answer will be undoubtly better ...

Nope, I think you got all the relevant points.

The only thing I'd add after having had more time to think about it is
that this seems very much like the problem we noticed recently with
recovery-from-WAL being broken by the new code in bufmgr.c that tries to
validate the header fields of any page it reads in.  We had to add an
escape hatch to disable that check while InRecovery, and I expect what
we will end up with here is a few lines added to slru.c to make it treat
read-past-EOF as a non-error condition when InRecovery.  Now the clog
code has always had all that paranoid error checking, but because it
deals in such tiny volumes of data (only 2 bits per transaction), it's
unlikely to suffer an out-of-disk-space condition.  That's why we hadn't
seen this failure mode before.

regards, tom lane

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] Disaster!

2004-01-23 Thread Gavin Sherry
On Fri, 23 Jan 2004, Tom Lane wrote:

 Alvaro Herrera [EMAIL PROTECTED] writes:
  Tom's answer will be undoubtly better ...

 Nope, I think you got all the relevant points.

 The only thing I'd add after having had more time to think about it is
 that this seems very much like the problem we noticed recently with
 recovery-from-WAL being broken by the new code in bufmgr.c that tries to
 validate the header fields of any page it reads in.  We had to add an
 escape hatch to disable that check while InRecovery, and I expect what
 we will end up with here is a few lines added to slru.c to make it treat
 read-past-EOF as a non-error condition when InRecovery.  Now the clog
 code has always had all that paranoid error checking, but because it
 deals in such tiny volumes of data (only 2 bits per transaction), it's
 unlikely to suffer an out-of-disk-space condition.  That's why we hadn't
 seen this failure mode before.

It seems that by adding the following to SlruPhysicalReadPage() we can
recover in a reasonable way here. Instead of:

if (lseek(fd, (off_t) offset, SEEK_SET)  0)
{
slru_errcause = SLRU_SEEK_FAILED;
slru_errno = errno;
return false;
}

We have:


if (lseek(fd, (off_t) offset, SEEK_SET)  0)
{
if(!InRecovery)
{
slru_errcause = SLRU_SEEK_FAILED;
slru_errno = errno;
return false;
}
ereport(LOG,
(errmsg(Short read from file \%s\, reading as zeroes,
path)));
MemSet(shared-page_buffer[slotno], 0, BLCKSZ);
return true;
}

Which is exactly how we recover from a missing pg_clog file.


   regards, tom lane

Gavin

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html