Re: [HACKERS] 8.1.3 and unused files

2006-05-05 Thread Tom Lane
Rod Taylor [EMAIL PROTECTED] writes:
 Am I correct in the thought that the various files listed below are not
 used by the database and can be safely removed? There were no other
 active db connections when I issued this command.

 I think truncate (Slony) left them behind.

I don't particularly like the casual assumption that truncate is broken.
If I were you I'd be looking harder for a plausible explanation about
where these files came from, especially seeing how large they are.
Have you tried dumping the file contents to see if the data looks
recognizable at all?

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] 8.1.3 and unused files

2006-05-05 Thread Rod Taylor
On Fri, 2006-05-05 at 14:09 -0400, Tom Lane wrote:
 Rod Taylor [EMAIL PROTECTED] writes:
  Am I correct in the thought that the various files listed below are not
  used by the database and can be safely removed? There were no other
  active db connections when I issued this command.
 
  I think truncate (Slony) left them behind.
 
 I don't particularly like the casual assumption that truncate is broken.

 If I were you I'd be looking harder for a plausible explanation about
 where these files came from, especially seeing how large they are.
 Have you tried dumping the file contents to see if the data looks
 recognizable at all?

Hardware is perfectly functional and has been for about 18 months in
production with 8.0.x.

It is a completely new 8.1 database and Slony is the only entity that
has been working in it. There are not very many possibilities.


I'm fairly confident I know exactly which table they are/were a part of.
1434984 is the table data, 1434986 is the primary key of the table (only
index), and 1434985 is probably the toast structure.

The structure have different relfilenode values and valid data at this
time.

At some point it must have failed in copying the data across, aborted,
and restarted.


So it would have been something like this:

BEGIN; 
TRUNCATE; 
decouple indexes -- ask Jan; 
COPY; 
recouple indexes; 
REINDEX crash, abort, something else to cause a Slony to restart;

reconnect
BEGIN; 
TRUNCATE; 
decouple indexes -- ask Jan; 
COPY; recouple indexes; 
REINDEX;
COMMIT;

-- 


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] 8.1.3 and unused files

2006-05-05 Thread Tom Lane
Rod Taylor [EMAIL PROTECTED] writes:
 At some point it must have failed in copying the data across, aborted,
 and restarted.

Unless you had an actual backend crash, that's not an adequate
explanation.  Transaction abort does clean up created files.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] 8.1.3 and unused files

2006-05-05 Thread Rod Taylor
On Fri, 2006-05-05 at 14:31 -0400, Tom Lane wrote:
 Rod Taylor [EMAIL PROTECTED] writes:
  At some point it must have failed in copying the data across, aborted,
  and restarted.
 
 Unless you had an actual backend crash, that's not an adequate
 explanation.  Transaction abort does clean up created files.

The only reason I noticed is because pg_database_size didn't match
sum(pg_total_relation_size()) and was investigating what I thought was a
bug in one of those functions.


I'm afraid we don't have all of the monitoring, logging, and change
control bits hooked up to the non-production DBs, so that is pretty much
all I have other than conjecture.

The only thing I can come up with is that perhaps someone forcefully
gave it a kick. SIGTERM is a necessary action once in a while to unwedge
a stuck db connection (killing the client script doesn't always get it
immediately).

Slony holds open a transaction on the master while reindexing the slave,
so perhaps someone thought the slave needed help. Making a copy of the
master takes several weeks. They may have killed slony, found the
statements still working away, SIGTERM'd them both, then restarted
slony. It wouldn't be an unusual pattern of events, particularly since
they've not been taught about pg_cancel_backend() yet (no 8.1 training).

How about this?

BEGIN;
TRUNCATE;
COPY;
REINDEX SIGTERM during REINDEX;


pg_class references old files. New files in their aborted state are left
behind?


-- 


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] 8.1.3 and unused files

2006-05-05 Thread Tom Lane
Rod Taylor [EMAIL PROTECTED] writes:
 On Fri, 2006-05-05 at 14:31 -0400, Tom Lane wrote:
 Unless you had an actual backend crash, that's not an adequate
 explanation.  Transaction abort does clean up created files.

 The only thing I can come up with is that perhaps someone forcefully
 gave it a kick. SIGTERM is a necessary action once in a while to unwedge
 a stuck db connection (killing the client script doesn't always get it
 immediately).

SIGTERM wouldn't cause that either.  I hope your people are not in the
habit of using kill -9?

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] 8.1.3 and unused files

2006-05-05 Thread Rod Taylor
On Fri, 2006-05-05 at 15:10 -0400, Tom Lane wrote:
 Rod Taylor [EMAIL PROTECTED] writes:
  On Fri, 2006-05-05 at 14:31 -0400, Tom Lane wrote:
  Unless you had an actual backend crash, that's not an adequate
  explanation.  Transaction abort does clean up created files.
 
  The only thing I can come up with is that perhaps someone forcefully
  gave it a kick. SIGTERM is a necessary action once in a while to unwedge
  a stuck db connection (killing the client script doesn't always get it
  immediately).
 
 SIGTERM wouldn't cause that either.  I hope your people are not in the
 habit of using kill -9?

Command line records, etc. are not available, but I did track down a a
snippet of logs from the backups (daily log rotation). Sorry, I didn't
realize there were backups initially -- it's unusual. Appears it did
crash or get killed in some way exited with exit code 1.

It's a temp DB to try a different database encoding (prep for 8.1
upgrade) with production data.

Is there something you would like me to try doing in an attempt to
reproduce? Preferably with a smaller structure.

The truncate would have happened as part of the prepareTableForCopy()
call.

slony%ssdb 10171 4621947 2006-05-02 05:09:40 EDTLOG:  0: duration: 
30526368.316 ms  statement: select _ssrep.prepareTableForCopy(1010); copy 
SNIP from stdin;
slony%ssdb 10171 4621947 2006-05-02 05:09:40 EDTLOCATION:  exec_simple_query, 
postgres.c:1103
slony%ssdb 10171 4621947 2006-05-02 05:09:40 EDTLOCATION:  exec_simple_query, 
postgres.c:1103
slony%ssdb 10181 0 2006-05-02 15:32:06 EDTLOG:  08P01: unexpected EOF on client 
connection
slony%ssdb 10181 0 2006-05-02 15:32:06 EDTLOCATION:  SocketBackend, 
postgres.c:295
slony%ssdb 10154 0 2006-05-02 15:32:06 EDTLOG:  08P01: unexpected EOF on client 
connection
slony%ssdb 10154 0 2006-05-02 15:32:06 EDTLOCATION:  SocketBackend, 
postgres.c:295
slony%ssdb 10173 0 2006-05-02 16:30:53 EDTLOG:  08P01: unexpected EOF on client 
connection
slony%ssdb 10173 0 2006-05-02 16:30:53 EDTLOCATION:  SocketBackend, 
postgres.c:295
slony%ssdb 10755 0 2006-05-02 16:30:53 EDTLOG:  08P01: unexpected EOF on client 
connection
slony%ssdb 10755 0 2006-05-02 16:30:53 EDTLOCATION:  SocketBackend, 
postgres.c:295
slony%ssdb 300 0 2006-05-02 16:55:18 EDTLOG:  08P01: unexpected EOF on client 
connection
slony%ssdb 300 0 2006-05-02 16:55:18 EDTLOCATION:  SocketBackend, postgres.c:295
slony%ssdb 301 0 2006-05-02 16:55:18 EDTLOG:  08P01: unexpected EOF on client 
connection
slony%ssdb 301 0 2006-05-02 16:55:18 EDTLOCATION:  SocketBackend, postgres.c:295
% 1960  2006-05-02 17:03:19 EDTLOG:  0: server process (PID 10171) exited 
with exit code 1
% 1960  2006-05-02 17:03:19 EDTLOCATION:  LogChildExit, postmaster.c:2416
% 1960  2006-05-02 17:03:19 EDTLOG:  0: terminating any other active server 
processes
% 1960  2006-05-02 17:03:19 EDTLOCATION:  HandleChildCrash, postmaster.c:2306
% 1960  2006-05-02 17:03:19 EDTLOCATION:  HandleChildCrash, postmaster.c:2306
% 1960  2006-05-02 17:03:19 EDTLOG:  0: all server processes terminated; 
reinitializing
% 1960  2006-05-02 17:03:19 EDTLOCATION:  reaper, postmaster.c:2206
 snip connection attempts 
% 5826  2006-05-02 17:03:22 EDTLOG:  0: database system was interrupted at 
2006-05-02 16:06:20 EDT
% 5826  2006-05-02 17:03:22 EDTLOCATION:  StartupXLOG, xlog.c:4374
% 5826  2006-05-02 17:03:22 EDTLOG:  0: checkpoint record is at 59/E0B56920
% 5826  2006-05-02 17:03:22 EDTLOCATION:  StartupXLOG, xlog.c:4442
% 5826  2006-05-02 17:03:22 EDTLOG:  0: redo record is at 59/E0B56920; undo 
record is at 0/0; shutdown FALSE
% 5826  2006-05-02 17:03:22 EDTLOCATION:  StartupXLOG, xlog.c:4469
% 5826  2006-05-02 17:03:22 EDTLOG:  0: next transaction ID: 4863932; next 
OID: 1441853
% 5826  2006-05-02 17:03:22 EDTLOCATION:  StartupXLOG, xlog.c:4472
% 5826  2006-05-02 17:03:22 EDTLOG:  0: next MultiXactId: 1; next 
MultiXactOffset: 0
% 5826  2006-05-02 17:03:22 EDTLOCATION:  StartupXLOG, xlog.c:4475
% 5826  2006-05-02 17:03:22 EDTLOG:  0: database system was not properly 
shut down; automatic recovery in progress
% 5826  2006-05-02 17:03:22 EDTLOCATION:  StartupXLOG, xlog.c:4532
% 5826  2006-05-02 17:03:22 EDTLOG:  0: redo starts at 59/E0B56970
% 5826  2006-05-02 17:03:22 EDTLOCATION:  StartupXLOG, xlog.c:4569
% 5826  2006-05-02 17:03:22 EDTLOG:  0: record with zero length at 
59/E0E429B8
% 5826  2006-05-02 17:03:22 EDTLOCATION:  ReadRecord, xlog.c:2764
% 5826  2006-05-02 17:03:22 EDTLOG:  0: redo done at 59/E0E42988
% 5826  2006-05-02 17:03:22 EDTLOCATION:  StartupXLOG, xlog.c:4627
% 5826  2006-05-02 17:03:23 EDTLOG:  0: database system is ready
% 5826  2006-05-02 17:03:23 EDTLOCATION:  StartupXLOG, xlog.c:4821
% 5826  2006-05-02 17:03:23 EDTLOG:  0: transaction ID wrap limit is 
1073749769, limited by database ssdb
% 5826  2006-05-02 17:03:23 EDTLOCATION:  SetTransactionIdLimit, varsup.c:234
-- 


---(end of 

Re: [HACKERS] 8.1.3 and unused files

2006-05-05 Thread Tom Lane
Rod Taylor [EMAIL PROTECTED] writes:
 % 1960  2006-05-02 17:03:19 EDTLOG:  0: server process (PID 10171) exited 
 with exit code 1

Hm.  I wonder if there are any uses of exit(1) in the Slony triggers.

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] 8.1.3 and unused files

2006-05-05 Thread Darcy Buskermolen
On Friday 05 May 2006 13:11, Tom Lane wrote:
 Rod Taylor [EMAIL PROTECTED] writes:
  % 1960  2006-05-02 17:03:19 EDTLOG:  0: server process (PID 10171)
  exited with exit code 1

 Hm.  I wonder if there are any uses of exit(1) in the Slony triggers.

No, there are no calls to exit() in the database loaded funcs.


   regards, tom lane

 ---(end of broadcast)---
 TIP 5: don't forget to increase your free space map settings

-- 
Darcy Buskermolen
Wavefire Technologies Corp.

http://www.wavefire.com
ph: 250.717.0200
fx: 250.763.1759

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] 8.1.3 and unused files

2006-05-05 Thread Rod Taylor
On Fri, 2006-05-05 at 16:11 -0400, Tom Lane wrote:
 Rod Taylor [EMAIL PROTECTED] writes:
  % 1960  2006-05-02 17:03:19 EDTLOG:  0: server process (PID 10171) 
  exited with exit code 1
 
 Hm.  I wonder if there are any uses of exit(1) in the Slony triggers.

It doesn't appear so. It does have this though:

Datum
_Slony_I_killBackend(PG_FUNCTION_ARGS)
{
int32   pid;
int32   signo;
text   *signame;

if (!superuser())
elog(ERROR, Slony-I: insufficient privilege for killBackend);

pid = PG_GETARG_INT32(0);
signame = PG_GETARG_TEXT_P(1);

if (VARSIZE(signame) == VARHDRSZ + 4 
memcmp(VARDATA(signame), NULL, 0) == 0)
{
signo = 0;
}
else if (VARSIZE(signame) == VARHDRSZ + 4 
memcmp(VARDATA(signame), TERM, 0) == 0)
{
signo = SIGTERM;
}
else
{
elog(ERROR, Slony-I: unsupported signal);
}

if (kill(pid, signo)  0)
PG_RETURN_INT32(-1);

PG_RETURN_INT32(0);
}

-- 


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] 8.1.3 and unused files

2006-05-05 Thread Tom Lane
Rod Taylor [EMAIL PROTECTED] writes:
 On Fri, 2006-05-05 at 16:11 -0400, Tom Lane wrote:
 Hm.  I wonder if there are any uses of exit(1) in the Slony triggers.

 It doesn't appear so. It does have this though:

Well, a SIGTERM would have resulted in a bleat in the postmaster log.
The striking thing about your log is that the backend went down without
saying a word; which would be understandable if it had crashed (eg SEGV
or kill -9) but then the postmaster would have seen some other exit
status.  I'm fairly certain there are no paths in the standard backend
code that will exit(1) without any attempt to print a message.  That's
why I'm wondering about add-ons.

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] 8.1.3 and unused files

2006-05-05 Thread Rod Taylor
On Fri, 2006-05-05 at 18:53 -0400, Tom Lane wrote:
 Rod Taylor [EMAIL PROTECTED] writes:
  On Fri, 2006-05-05 at 16:11 -0400, Tom Lane wrote:
  Hm.  I wonder if there are any uses of exit(1) in the Slony triggers.
 
  It doesn't appear so. It does have this though:
 
 Well, a SIGTERM would have resulted in a bleat in the postmaster log.
 The striking thing about your log is that the backend went down without
 saying a word; which would be understandable if it had crashed (eg SEGV
 or kill -9) but then the postmaster would have seen some other exit
 status.  I'm fairly certain there are no paths in the standard backend
 code that will exit(1) without any attempt to print a message.  That's
 why I'm wondering about add-ons.

Add-ons are slim. Slony. We don't have any C based functions and only a
few plpgsql functions in that DB.

I did trim out a ton of autovacuum log entries (it likes to log once a
minute) but I don't see anything interesting in that area nor autovac
the pid that exited.

My knowledge of signal handling is pretty basic. Any chance that
multiple SIGTERMs could have caused it to avoid the logging?

-- 


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings