Tom Lane wrote:
Are you using one of the scripts that
does an auto initdb if it doesn't see a valid PGDATA? 11 seconds might
be about right for that.
One problem with this theory is how come you didn't get screwed during
*that* boot cycle. It seems to require assuming that the NFS mount came
Michael Fuhr wrote:
On Wed, Dec 15, 2004 at 11:41:02AM -0800, Joe Conway wrote:
Just wanted to close the loop for the sake of the list archives. With
Tom's xlog dump tool I was able (with a bunch of his help off-list) to
identify the needed parameters for pg_resetxlog. Running pg_resetxlog
got
Joe Conway wrote:
We then spent most of the next 24 hours reviewing the recovered
database. The bulk data loading process was well instrumented, so we
knew exactly which data should have been committed prior to the server
hang, and which files were inprocess (we had been doing 10 loads in
On Sat, Dec 18, 2004 at 02:28:51PM -0800, Joe Conway wrote:
Hi,
Apparently, either because of the server hang, or because of the flakey
eth0 interface on reboot, pg_control had become corrupt. However, it
was not corrupt in the sense that it contained impossibly invalid data.
In fact, as
Alvaro Herrera wrote:
I can't help remembering the fact that the init script executes an
initdb automatically if it finds an empty data directory (the ones I
know of at least -- does the one you are running?). Maybe what happened
was that it found the empty mount point, executed an initdb, and
Alvaro Herrera [EMAIL PROTECTED] writes:
These values (from the corrupt pg_control file) are strange:
pg_control last modified: Tue Dec 14 15:39:26 2004
Time of latest checkpoint:Tue Nov 2 17:05:32 2004
The last modified date doesn't prove a lot because it would have
Joe Conway [EMAIL PROTECTED] writes:
The Tue Nov 2 17:05:32 2004 seems to be related to the *previous*
restart; from /var/log/messages:
Nov 2 17:04:20 csdfds1 syslogd 1.4.1: restart.
...
Nov 2 17:05:22 csdfds1 su: pam_unix2: session started for user
postgres, service su
...
Nov 2
Joe Conway [EMAIL PROTECTED] writes:
The manpage for pg_resetxlog gives some general idea how it is used, and
a way to estimate the next transaction id and wal segment.
I had forgotten that that text was in there. It needs to be updated for
8.0 because WAL segment file names are now
Tom Lane wrote:
I think Alvaro's idea that this copy of pg_control got created when the
NFS mount was offline is a real good theory. However, it would seem
that that was quite some time ago (Nov 2 if not earlier), which would
suggest that the mount instability problem has been around longer than
Joe Conway wrote:
So one thing I'd strongly suggest is stopping Postgres and dismounting
the NFS server to see what's under there. If there is a valid-looking
PGDATA directory under there, you definitely want to get rid of it to
reduce the risk of this happening again.
Perhaps we should
Andrew Dunstan wrote:
In the absence of that, in your case, certainly the root-owned
placeholder is a good idea - it seems nicer than disabling on-boot
startup altogether if you can avoid that.
I'm pretty well convinced at this point that a start on boot init script
is inappropriate when
Tom Lane wrote:
My advice is to backup the $PGDATA tree (which you said was in
progress), then pg_resetxlog, then cross-check the hell out of the data
you see. Only if you can detect some data problems can we guess at
something else to do ...
Before running pg_resetxlog, a couple of questions:
1.
Joe Conway [EMAIL PROTECTED] writes:
I don't trust it at all. So does that imply that I should override next
transaction id and WAL starting address per the manpage?
Yes, override everything there's a switch for. Also check that the
other values shown by pg_controldata look reasonable (the
Tom Lane [EMAIL PROTECTED] writes:
The server experienced a hang (as yet unexplained) yesterday and was
restarted at 2004-12-13 16:38:49 according to syslog. I'm told by the
network admin that there was a problem with the network card on restart,
so the nfs mount most probably
Tom Lane wrote:
Joe Conway [EMAIL PROTECTED] writes:
I don't trust it at all. So does that imply that I should override
next transaction id and WAL starting address per the manpage?
Yes, override everything there's a switch for. Also check that the
other values shown by pg_controldata look
Greg Stark wrote:
Tom Lane [EMAIL PROTECTED] writes:
I've always felt that running a database across NFS was a Bad Idea ;-)
Well not that I disagree with that sentiment, but NFS was specifically
designed to handle this particular scenario. *UNLESS* you use the soft
option. As popular as it is,
On Tue, Dec 14, 2004 at 09:22:42PM -0800, Joe Conway wrote:
# pg_controldata /replica/pgdata
Current log file ID: 0
Next log file segment:1
Latest checkpoint location: 0/9B0B8C
Prior checkpoint location:0/9AA1B4
Latest checkpoint's
Joe Conway [EMAIL PROTECTED] writes:
Before running pg_resetxlog, a couple of questions:
1. Since it appears that pg_control is suspect, should I force it to be
rebuilt, and if so, how?
pg_resetxlog will rebuild it in any case. However it will re-use the
existing contents as much as it
Tom Lane wrote:
pg_resetxlog will rebuild it in any case. However it will re-use the
existing contents as much as it can (if you don't use any of the command
line options to override values). Given Alvaro's observation that the
existing file looks suspiciously close to a freshly-initdb'd one, I
Tom Lane wrote:
Yes, override everything there's a switch for. Also check that the
other values shown by pg_controldata look reasonable (the locale
settings are probably the only ones you might get burned on).
What if anything have you got in $PGDATA/pg_xlog?
-rw--- 1 postgres postgres
On Wed, Dec 15, 2004 at 11:41:02AM -0800, Joe Conway wrote:
Just wanted to close the loop for the sake of the list archives. With
Tom's xlog dump tool I was able (with a bunch of his help off-list) to
identify the needed parameters for pg_resetxlog. Running pg_resetxlog
got us back a
I've got a down production server (will not restart) with the following
tail to its log file:
2004-12-13 15:05:52 LOG: recycled transaction log file 0165004C
2004-12-13 15:26:01 LOG: recycled transaction log file 0165004D
2004-12-13 16:39:55 LOG: database system was shut down
Joe Conway wrote:
This is a SuSE 9, 8-way Xeon IBM x445, with nfs mounted Network
Appliance for database storage, postgresql-7.4.5-36.4.
The server experienced a hang (as yet unexplained) yesterday and was
restarted at 2004-12-13 16:38:49 according to syslog. I'm told by the
network
Tom Lane wrote:
Joe Conway [EMAIL PROTECTED] writes:
I've got a down production server (will not restart) with the following
tail to its log file:
Please show the output of pg_controldata, or a hex dump of pg_control
if pg_controldata fails.
OK, here it is:
# pg_controldata /replica/pgdata
Tom Lane wrote:
...
pg_control last modified: Tue Dec 14 15:39:26 2004
...
Time of latest checkpoint:Tue Nov 2 17:05:32 2004
[ blink... ] That seems like an unreasonable gap between checkpoints,
especially for a production server. Can you see an explanation?
Hmmm, this
Joe Conway [EMAIL PROTECTED] writes:
Any theories on how we screwed up?
I hesitate to suggest this, but maybe a cron job blindly copying data
from point A to point B?
I'm not sure that that could entirely explain the facts. My
recollection of the xlog.c logic is that the pg_control file is
Tom Lane wrote:
Joe Conway [EMAIL PROTECTED] writes:
Any theories on how we screwed up?
I hesitate to suggest this, but maybe a cron job blindly copying data
from point A to point B?
Not likely, but I'll check.
Offhand my bets would revolve around (a) multiple postmasters trying
to run the same
Tom Lane wrote:
My advice is to backup the $PGDATA tree (which you said was in
progress), then pg_resetxlog, then cross-check the hell out of the data
you see. Only if you can detect some data problems can we guess at
something else to do ...
OK. I plan to gather the usual suspects and try to get
Joe Conway [EMAIL PROTECTED] writes:
Assuming the only real problem here is the control data (long shot, I
know), and the actual database files and transaction logs are OK, is
there any reasonable way to reconstruct the correct contol data? Or is
that the point at which you use
Joe Conway [EMAIL PROTECTED] writes:
I've got a down production server (will not restart) with the following
tail to its log file:
Please show the output of pg_controldata, or a hex dump of pg_control
if pg_controldata fails.
The server experienced a hang (as yet unexplained) yesterday and
Tom Lane wrote:
Joe Conway [EMAIL PROTECTED] writes:
I've got a down production server (will not restart) with the following
tail to its log file:
Please show the output of pg_controldata, or a hex dump of pg_control
if pg_controldata fails.
OK, will do shortly.
The server experienced a hang (as
Joe Conway [EMAIL PROTECTED] writes:
Tom Lane wrote:
Please show the output of pg_controldata, or a hex dump of pg_control
if pg_controldata fails.
OK, here it is:
...
pg_control last modified: Tue Dec 14 15:39:26 2004
...
Time of latest checkpoint:Tue Nov 2
32 matches
Mail list logo