Re: [HACKERS] production server down

2004-12-27 Thread Joe Conway
Tom Lane wrote: Are you using one of the scripts that does an auto initdb if it doesn't see a valid PGDATA? 11 seconds might be about right for that. One problem with this theory is how come you didn't get screwed during *that* boot cycle. It seems to require assuming that the NFS mount came

Re: [HACKERS] production server down

2004-12-18 Thread Joe Conway
Michael Fuhr wrote: On Wed, Dec 15, 2004 at 11:41:02AM -0800, Joe Conway wrote: Just wanted to close the loop for the sake of the list archives. With Tom's xlog dump tool I was able (with a bunch of his help off-list) to identify the needed parameters for pg_resetxlog. Running pg_resetxlog got

Re: [HACKERS] production server down

2004-12-18 Thread Bruce Momjian
Joe Conway wrote: We then spent most of the next 24 hours reviewing the recovered database. The bulk data loading process was well instrumented, so we knew exactly which data should have been committed prior to the server hang, and which files were inprocess (we had been doing 10 loads in

Re: [HACKERS] production server down

2004-12-18 Thread Alvaro Herrera
On Sat, Dec 18, 2004 at 02:28:51PM -0800, Joe Conway wrote: Hi, Apparently, either because of the server hang, or because of the flakey eth0 interface on reboot, pg_control had become corrupt. However, it was not corrupt in the sense that it contained impossibly invalid data. In fact, as

Re: [HACKERS] production server down

2004-12-18 Thread Joe Conway
Alvaro Herrera wrote: I can't help remembering the fact that the init script executes an initdb automatically if it finds an empty data directory (the ones I know of at least -- does the one you are running?). Maybe what happened was that it found the empty mount point, executed an initdb, and

Re: [HACKERS] production server down

2004-12-18 Thread Tom Lane
Alvaro Herrera [EMAIL PROTECTED] writes: These values (from the corrupt pg_control file) are strange: pg_control last modified: Tue Dec 14 15:39:26 2004 Time of latest checkpoint:Tue Nov 2 17:05:32 2004 The last modified date doesn't prove a lot because it would have

Re: [HACKERS] production server down

2004-12-18 Thread Tom Lane
Joe Conway [EMAIL PROTECTED] writes: The Tue Nov 2 17:05:32 2004 seems to be related to the *previous* restart; from /var/log/messages: Nov 2 17:04:20 csdfds1 syslogd 1.4.1: restart. ... Nov 2 17:05:22 csdfds1 su: pam_unix2: session started for user postgres, service su ... Nov 2

pg_resetxlog for 8.0 (was Re: [HACKERS] production server down)

2004-12-18 Thread Tom Lane
Joe Conway [EMAIL PROTECTED] writes: The manpage for pg_resetxlog gives some general idea how it is used, and a way to estimate the next transaction id and wal segment. I had forgotten that that text was in there. It needs to be updated for 8.0 because WAL segment file names are now

Re: [HACKERS] production server down

2004-12-18 Thread Joe Conway
Tom Lane wrote: I think Alvaro's idea that this copy of pg_control got created when the NFS mount was offline is a real good theory. However, it would seem that that was quite some time ago (Nov 2 if not earlier), which would suggest that the mount instability problem has been around longer than

Re: [HACKERS] production server down

2004-12-18 Thread Andrew Dunstan
Joe Conway wrote: So one thing I'd strongly suggest is stopping Postgres and dismounting the NFS server to see what's under there. If there is a valid-looking PGDATA directory under there, you definitely want to get rid of it to reduce the risk of this happening again. Perhaps we should

Re: [HACKERS] production server down

2004-12-18 Thread Joe Conway
Andrew Dunstan wrote: In the absence of that, in your case, certainly the root-owned placeholder is a good idea - it seems nicer than disabling on-boot startup altogether if you can avoid that. I'm pretty well convinced at this point that a start on boot init script is inappropriate when

Re: [HACKERS] production server down

2004-12-15 Thread Joe Conway
Tom Lane wrote: My advice is to backup the $PGDATA tree (which you said was in progress), then pg_resetxlog, then cross-check the hell out of the data you see. Only if you can detect some data problems can we guess at something else to do ... Before running pg_resetxlog, a couple of questions: 1.

Re: [HACKERS] production server down

2004-12-15 Thread Tom Lane
Joe Conway [EMAIL PROTECTED] writes: I don't trust it at all. So does that imply that I should override next transaction id and WAL starting address per the manpage? Yes, override everything there's a switch for. Also check that the other values shown by pg_controldata look reasonable (the

Re: [HACKERS] production server down

2004-12-15 Thread Greg Stark
Tom Lane [EMAIL PROTECTED] writes: The server experienced a hang (as yet unexplained) yesterday and was restarted at 2004-12-13 16:38:49 according to syslog. I'm told by the network admin that there was a problem with the network card on restart, so the nfs mount most probably

Re: [HACKERS] production server down

2004-12-15 Thread Joe Conway
Tom Lane wrote: Joe Conway [EMAIL PROTECTED] writes: I don't trust it at all. So does that imply that I should override next transaction id and WAL starting address per the manpage? Yes, override everything there's a switch for. Also check that the other values shown by pg_controldata look

Re: [HACKERS] production server down

2004-12-15 Thread Joe Conway
Greg Stark wrote: Tom Lane [EMAIL PROTECTED] writes: I've always felt that running a database across NFS was a Bad Idea ;-) Well not that I disagree with that sentiment, but NFS was specifically designed to handle this particular scenario. *UNLESS* you use the soft option. As popular as it is,

Re: [HACKERS] production server down

2004-12-15 Thread Alvaro Herrera
On Tue, Dec 14, 2004 at 09:22:42PM -0800, Joe Conway wrote: # pg_controldata /replica/pgdata Current log file ID: 0 Next log file segment:1 Latest checkpoint location: 0/9B0B8C Prior checkpoint location:0/9AA1B4 Latest checkpoint's

Re: [HACKERS] production server down

2004-12-15 Thread Tom Lane
Joe Conway [EMAIL PROTECTED] writes: Before running pg_resetxlog, a couple of questions: 1. Since it appears that pg_control is suspect, should I force it to be rebuilt, and if so, how? pg_resetxlog will rebuild it in any case. However it will re-use the existing contents as much as it

Re: [HACKERS] production server down

2004-12-15 Thread Joe Conway
Tom Lane wrote: pg_resetxlog will rebuild it in any case. However it will re-use the existing contents as much as it can (if you don't use any of the command line options to override values). Given Alvaro's observation that the existing file looks suspiciously close to a freshly-initdb'd one, I

Re: [HACKERS] production server down

2004-12-15 Thread Joe Conway
Tom Lane wrote: Yes, override everything there's a switch for. Also check that the other values shown by pg_controldata look reasonable (the locale settings are probably the only ones you might get burned on). What if anything have you got in $PGDATA/pg_xlog? -rw--- 1 postgres postgres

Re: [HACKERS] production server down

2004-12-15 Thread Michael Fuhr
On Wed, Dec 15, 2004 at 11:41:02AM -0800, Joe Conway wrote: Just wanted to close the loop for the sake of the list archives. With Tom's xlog dump tool I was able (with a bunch of his help off-list) to identify the needed parameters for pg_resetxlog. Running pg_resetxlog got us back a

[HACKERS] production server down

2004-12-14 Thread Joe Conway
I've got a down production server (will not restart) with the following tail to its log file: 2004-12-13 15:05:52 LOG: recycled transaction log file 0165004C 2004-12-13 15:26:01 LOG: recycled transaction log file 0165004D 2004-12-13 16:39:55 LOG: database system was shut down

Re: [HACKERS] production server down

2004-12-14 Thread Bruce Momjian
Joe Conway wrote: This is a SuSE 9, 8-way Xeon IBM x445, with nfs mounted Network Appliance for database storage, postgresql-7.4.5-36.4. The server experienced a hang (as yet unexplained) yesterday and was restarted at 2004-12-13 16:38:49 according to syslog. I'm told by the network

Re: [HACKERS] production server down

2004-12-14 Thread Joe Conway
Tom Lane wrote: Joe Conway [EMAIL PROTECTED] writes: I've got a down production server (will not restart) with the following tail to its log file: Please show the output of pg_controldata, or a hex dump of pg_control if pg_controldata fails. OK, here it is: # pg_controldata /replica/pgdata

Re: [HACKERS] production server down

2004-12-14 Thread Joe Conway
Tom Lane wrote: ... pg_control last modified: Tue Dec 14 15:39:26 2004 ... Time of latest checkpoint:Tue Nov 2 17:05:32 2004 [ blink... ] That seems like an unreasonable gap between checkpoints, especially for a production server. Can you see an explanation? Hmmm, this

Re: [HACKERS] production server down

2004-12-14 Thread Tom Lane
Joe Conway [EMAIL PROTECTED] writes: Any theories on how we screwed up? I hesitate to suggest this, but maybe a cron job blindly copying data from point A to point B? I'm not sure that that could entirely explain the facts. My recollection of the xlog.c logic is that the pg_control file is

Re: [HACKERS] production server down

2004-12-14 Thread Joe Conway
Tom Lane wrote: Joe Conway [EMAIL PROTECTED] writes: Any theories on how we screwed up? I hesitate to suggest this, but maybe a cron job blindly copying data from point A to point B? Not likely, but I'll check. Offhand my bets would revolve around (a) multiple postmasters trying to run the same

Re: [HACKERS] production server down

2004-12-14 Thread Joe Conway
Tom Lane wrote: My advice is to backup the $PGDATA tree (which you said was in progress), then pg_resetxlog, then cross-check the hell out of the data you see. Only if you can detect some data problems can we guess at something else to do ... OK. I plan to gather the usual suspects and try to get

Re: [HACKERS] production server down

2004-12-14 Thread Tom Lane
Joe Conway [EMAIL PROTECTED] writes: Assuming the only real problem here is the control data (long shot, I know), and the actual database files and transaction logs are OK, is there any reasonable way to reconstruct the correct contol data? Or is that the point at which you use

Re: [HACKERS] production server down

2004-12-14 Thread Tom Lane
Joe Conway [EMAIL PROTECTED] writes: I've got a down production server (will not restart) with the following tail to its log file: Please show the output of pg_controldata, or a hex dump of pg_control if pg_controldata fails. The server experienced a hang (as yet unexplained) yesterday and

Re: [HACKERS] production server down

2004-12-14 Thread Joe Conway
Tom Lane wrote: Joe Conway [EMAIL PROTECTED] writes: I've got a down production server (will not restart) with the following tail to its log file: Please show the output of pg_controldata, or a hex dump of pg_control if pg_controldata fails. OK, will do shortly. The server experienced a hang (as

Re: [HACKERS] production server down

2004-12-14 Thread Tom Lane
Joe Conway [EMAIL PROTECTED] writes: Tom Lane wrote: Please show the output of pg_controldata, or a hex dump of pg_control if pg_controldata fails. OK, here it is: ... pg_control last modified: Tue Dec 14 15:39:26 2004 ... Time of latest checkpoint:Tue Nov 2