Re: [HACKERS] strange buildfarm failures

2007-05-02 Thread Alvaro Herrera
Tom Lane wrote:
 I wrote:
  Hmm ... I was about to say that the postmaster never sets
  PG_exception_stack, but maybe an error out of a PG_TRY/PG_RE_THROW
  could do it?  Does the postmaster ever execute PG_TRY?
 
 Doh, I bet that's it, and it's not the postmaster that's at issue
 but PG_TRY blocks executed during subprocess startup.  Inheritance
 of a PG_exception_stack setting from the postmaster could only happen if
 the postmaster were to fork() within a PG_TRY block, which I think we
 can safely say it doesn't.  But suppose we get an elog(ERROR) inside
 a PG_TRY block when there is no outermost longjmp catcher.   elog.c
 will think it should longjmp, and that will eventually lead to
 executing
 
 #define PG_RE_THROW()  \
   siglongjmp(*PG_exception_stack, 1)
 
 with PG_exception_stack = NULL; which seems entirely likely to cause
 a stack smash of gruesome dimensions.  What's more, nothing would have
 been printed to the postmaster log beforehand, agreeing with observation.

I agree that that would be a bug and we should fix it, but I don't think
it explains the problem we're seeing because there is no PG_TRY block
in the autovac startup code that I can see :-(

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] strange buildfarm failures

2007-05-02 Thread Tom Lane
Alvaro Herrera [EMAIL PROTECTED] writes:
 I agree that that would be a bug and we should fix it, but I don't think
 it explains the problem we're seeing because there is no PG_TRY block
 in the autovac startup code that I can see :-(

I'm wondering if there is some code path that invokes a PG_TRY deep in
the bowels of the system.  Anyway, I'll go fix this, and we should know
soon enough if it changes the buildfarm behavior.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] strange buildfarm failures

2007-05-02 Thread Alvaro Herrera
Tom Lane wrote:
 Alvaro Herrera [EMAIL PROTECTED] writes:
  I agree that that would be a bug and we should fix it, but I don't think
  it explains the problem we're seeing because there is no PG_TRY block
  in the autovac startup code that I can see :-(
 
 I'm wondering if there is some code path that invokes a PG_TRY deep in
 the bowels of the system.

Well, I checked all the bowels involved in autovacuum startup.

 Anyway, I'll go fix this, and we should know soon enough if it changes
 the buildfarm behavior.

Agreed.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] strange buildfarm failures

2007-05-02 Thread Alvaro Herrera
Alvaro Herrera wrote:
 Tom Lane wrote:
  Alvaro Herrera [EMAIL PROTECTED] writes:
   I agree that that would be a bug and we should fix it, but I don't think
   it explains the problem we're seeing because there is no PG_TRY block
   in the autovac startup code that I can see :-(
  
  I'm wondering if there is some code path that invokes a PG_TRY deep in
  the bowels of the system.
 
 Well, I checked all the bowels involved in autovacuum startup.

Huh, hang on ... there is one caller, which is to set client_encoding
(call_string_assign_hook uses a PG_TRY block), but it is called *after*
the sigsetjmp block -- in InitPostgres.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] strange buildfarm failures

2007-05-02 Thread Tom Lane
Alvaro Herrera [EMAIL PROTECTED] writes:
 Tom Lane wrote:
 I'm wondering if there is some code path that invokes a PG_TRY deep in
 the bowels of the system.

 Huh, hang on ... there is one caller, which is to set client_encoding
 (call_string_assign_hook uses a PG_TRY block), but it is called *after*
 the sigsetjmp block -- in InitPostgres.

While testing the PG_RE_THROW problem I noted that what I get here is
a SIGSEGV crash, rather than SIGABRT as seen on Stefan's machines, so
that's another hint that this may be unrelated.  Still, it's clearly
at risk of causing a problem as more PG_TRY's get added to the code,
so I'm going to fix it anyway.

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] strange buildfarm failures

2007-05-02 Thread Alvaro Herrera
Alvaro Herrera wrote:
 Stefan Kaltenbrunner wrote:
 
  well - i now have a core file but it does not seem to be much worth
  except to prove that autovacuum seems to be the culprit:
  
  Core was generated by `postgres: autovacuum worker process
   '.
  Program terminated with signal 6, Aborted.
  
  [...]
  
  #0  0x0ed9 in ?? ()
  warning: GDB can't find the start of the function at 0xed9.

I just noticed an ugly bug in the worker code which I'm fixing.  I think
this one would also throw SIGSEGV, not SIGABRT.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] strange buildfarm failures

2007-05-02 Thread Alvaro Herrera
Alvaro Herrera wrote:
 Alvaro Herrera wrote:
  Stefan Kaltenbrunner wrote:
  
   well - i now have a core file but it does not seem to be much worth
   except to prove that autovacuum seems to be the culprit:
   
   Core was generated by `postgres: autovacuum worker process
'.
   Program terminated with signal 6, Aborted.
   
   [...]
   
   #0  0x0ed9 in ?? ()
   warning: GDB can't find the start of the function at 0xed9.
 
 I just noticed an ugly bug in the worker code which I'm fixing.  I think
 this one would also throw SIGSEGV, not SIGABRT.

Nailed it -- this is the actual bug that causes the abort.  But I am
surprised that it doesn't print the error message in Stefan machine's;
here it outputs


TRAP: FailedAssertion(!unsigned long)(elem))  ShmemBase)), File: 
/pgsql/source/00head/src/backend/storage/ipc/shmqueue.c, Line: 107)
16496 2007-05-02 11:30:31 CLT DEBUG:  server process (PID 16540) was terminated 
by signal 6: Aborted
16496 2007-05-02 11:30:31 CLT LOG:  server process (PID 16540) was terminated 
by signal 6: Aborted
16496 2007-05-02 11:30:31 CLT LOG:  terminating any other active server 
processes
16496 2007-05-02 11:30:31 CLT DEBUG:  sending SIGQUIT to process 16541
16496 2007-05-02 11:30:31 CLT DEBUG:  sending SIGQUIT to process 16498
16496 2007-05-02 11:30:31 CLT DEBUG:  sending SIGQUIT to process 16500
16496 2007-05-02 11:30:31 CLT DEBUG:  sending SIGQUIT to process 16499
16541 2007-05-02 11:30:33 CLT WARNING:  terminating connection because of crash 
of another server process


Maybe stderr is going somewhere else?  That would be strange, I think.

I'll commit the fix shortly; attached.

-- 
Alvaro Herrera http://www.flickr.com/photos/alvherre/
La primera ley de las demostraciones en vivo es: no trate de usar el sistema.
Escriba un guión que no toque nada para no causar daños. (Jakob Nielsen)
Index: src/backend/postmaster/autovacuum.c
===
RCS file: /home/alvherre/Code/cvs/pgsql/src/backend/postmaster/autovacuum.c,v
retrieving revision 1.42
diff -c -p -r1.42 autovacuum.c
*** src/backend/postmaster/autovacuum.c	18 Apr 2007 16:44:18 -	1.42
--- src/backend/postmaster/autovacuum.c	2 May 2007 15:25:27 -
*** AutoVacWorkerMain(int argc, char *argv[]
*** 1407,1431 
  	 * Get the info about the database we're going to work on.
  	 */
  	LWLockAcquire(AutovacuumLock, LW_EXCLUSIVE);
! 	MyWorkerInfo = (WorkerInfo) MAKE_PTR(AutoVacuumShmem-av_startingWorker);
! 	dbid = MyWorkerInfo-wi_dboid;
! 	MyWorkerInfo-wi_workerpid = MyProcPid;
! 
! 	/* insert into the running list */
! 	SHMQueueInsertBefore(AutoVacuumShmem-av_runningWorkers, 
! 		 MyWorkerInfo-wi_links);
  	/*
! 	 * remove from the starting pointer, so that the launcher can start a new
! 	 * worker if required
  	 */
! 	AutoVacuumShmem-av_startingWorker = INVALID_OFFSET;
! 	LWLockRelease(AutovacuumLock);
  
! 	on_shmem_exit(FreeWorkerInfo, 0);
  
! 	/* wake up the launcher */
! 	if (AutoVacuumShmem-av_launcherpid != 0)
! 		kill(AutoVacuumShmem-av_launcherpid, SIGUSR1);
  
  	if (OidIsValid(dbid))
  	{
--- 1407,1442 
  	 * Get the info about the database we're going to work on.
  	 */
  	LWLockAcquire(AutovacuumLock, LW_EXCLUSIVE);
! 
  	/*
! 	 * beware of startingWorker being INVALID; this could happen if the
! 	 * launcher thinks we've taking too long to start.
  	 */
! 	if (AutoVacuumShmem-av_startingWorker != INVALID_OFFSET)
! 	{
! 		MyWorkerInfo = (WorkerInfo) MAKE_PTR(AutoVacuumShmem-av_startingWorker);
! 		dbid = MyWorkerInfo-wi_dboid;
! 		MyWorkerInfo-wi_workerpid = MyProcPid;
! 
! 		/* insert into the running list */
! 		SHMQueueInsertBefore(AutoVacuumShmem-av_runningWorkers, 
! 			 MyWorkerInfo-wi_links);
! 		/*
! 		 * remove from the starting pointer, so that the launcher can start a new
! 		 * worker if required
! 		 */
! 		AutoVacuumShmem-av_startingWorker = INVALID_OFFSET;
! 		LWLockRelease(AutovacuumLock);
  
! 		on_shmem_exit(FreeWorkerInfo, 0);
  
! 		/* wake up the launcher */
! 		if (AutoVacuumShmem-av_launcherpid != 0)
! 			kill(AutoVacuumShmem-av_launcherpid, SIGUSR1);
! 	}
! 	else
! 		/* no worker entry for me, go away */
! 		LWLockRelease(AutovacuumLock);
  
  	if (OidIsValid(dbid))
  	{
*** AutoVacWorkerMain(int argc, char *argv[]
*** 1466,1473 
  	}
  
  	/*
! 	 * FIXME -- we need to notify the launcher when we are gone.  But this
! 	 * should be done after our PGPROC is released, in ProcKill.
  	 */
  
  	/* All done, go away */
--- 1477,1484 
  	}
  
  	/*
! 	 * The launcher will be notified of my death in ProcKill, *if* we managed
! 	 * to get a worker slot at all
  	 */
  
  	/* All done, go away */

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] strange buildfarm failures

2007-05-02 Thread Tom Lane
Alvaro Herrera [EMAIL PROTECTED] writes:
 Nailed it -- this is the actual bug that causes the abort.  But I am
 surprised that it doesn't print the error message in Stefan machine's;

Hm, maybe we need an fflush(stderr) in ExceptionalCondition?

regards, tom lane

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] strange buildfarm failures

2007-05-02 Thread Gregory Stark
Tom Lane [EMAIL PROTECTED] writes:

 Alvaro Herrera [EMAIL PROTECTED] writes:
 Nailed it -- this is the actual bug that causes the abort.  But I am
 surprised that it doesn't print the error message in Stefan machine's;

 Hm, maybe we need an fflush(stderr) in ExceptionalCondition?

stderr is supposed to be line-buffered by default. Couldn't hurt I suppose.



-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com


---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] strange buildfarm failures

2007-05-01 Thread Alvaro Herrera
Tom Lane wrote:
 Alvaro Herrera [EMAIL PROTECTED] writes:
  Oh, another thing that I think may be happening is that the stack is
  restored in longjmp, so it is trying to report an error elsewhere but
  it crashes because something got overwritten or something; i.e. a
  bug in the error recovery code.
 
 Hm, something trying to elog before the setjmp's been executed?
 Although I thought it was coded so that elog.c would just proc_exit
 if there was noplace to longjmp to.  A mistake here might explain
 the lack of any message in the postmaster log: if elog.c thinks it
 should longjmp then it doesn't print the message first.

Well, there seems to be plenty of code which is happy to elog(ERROR)
before the longjmp target block has been set; for example
InitFileAccess(), which is called on BaseInit(), which comes before
sigsetjmp() both on postgres.c and autovacuum.c.  (This particular case
is elog(FATAL) not ERROR however).  mdinit() also does some memory
allocation which could certainly fail.

I'm wondering if it wouldn't be more robust to define a longjmp target
block before calling BaseInit(), and have it exit cleanly in case of
failure (which is what you say elog.c should be doing if there is no
target block).

In errstart(), it is checked if PG_exception_stack is NULL.  Now, this
symbol is defined in elog.c and initialized to NULL, but I wonder if a
child process inherits the value that postmaster set, or it comes back
to NULL.  The backend would not inherit any of the values the postmaster
set if the latter were the case, so I'm assuming that PG_exception_stack
stays as the postmaster left it.  I wonder what happens if the child
process finds that this is an invalid point to jump to?

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] strange buildfarm failures

2007-05-01 Thread Tom Lane
Alvaro Herrera [EMAIL PROTECTED] writes:
 I'm wondering if it wouldn't be more robust to define a longjmp target
 block before calling BaseInit(), and have it exit cleanly in case of
 failure (which is what you say elog.c should be doing if there is no
 target block).

No, because elog is already supposed to deal with that case; and does,
every time you connect to a bad database name for example.  If it's
failing, the question to answer is why.  

 In errstart(), it is checked if PG_exception_stack is NULL.  Now, this
 symbol is defined in elog.c and initialized to NULL, but I wonder if a
 child process inherits the value that postmaster set, or it comes back
 to NULL.

Hmm ... I was about to say that the postmaster never sets
PG_exception_stack, but maybe an error out of a PG_TRY/PG_RE_THROW
could do it?  Does the postmaster ever execute PG_TRY?  (And if so,
should it?  The postmaster really ought not be dealing in anything
very hairy --- it should be passing such work off to children.)

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] strange buildfarm failures

2007-05-01 Thread Tom Lane
I wrote:
 Hmm ... I was about to say that the postmaster never sets
 PG_exception_stack, but maybe an error out of a PG_TRY/PG_RE_THROW
 could do it?  Does the postmaster ever execute PG_TRY?

Doh, I bet that's it, and it's not the postmaster that's at issue
but PG_TRY blocks executed during subprocess startup.  Inheritance
of a PG_exception_stack setting from the postmaster could only happen if
the postmaster were to fork() within a PG_TRY block, which I think we
can safely say it doesn't.  But suppose we get an elog(ERROR) inside
a PG_TRY block when there is no outermost longjmp catcher.   elog.c
will think it should longjmp, and that will eventually lead to
executing

#define PG_RE_THROW()  \
siglongjmp(*PG_exception_stack, 1)

with PG_exception_stack = NULL; which seems entirely likely to cause
a stack smash of gruesome dimensions.  What's more, nothing would have
been printed to the postmaster log beforehand, agreeing with observation.

Personally I think the correct fix is to make PG_RE_THROW deal sanely
with the case of PG_exception_stack = NULL, that is, turn it into an
elog(FATAL) with the original error text.  If you try to fix it by
making a setjmp occur earlier, there's still the problem of what
about PG_TRY earlier than that?

This might be more code than we want in a macro, though, especially
since this is presumably not a performance-critical path.  I'm tempted
to change the macro to just call a pg_re_throw() subroutine.  Thoughts?

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] strange buildfarm failures

2007-04-29 Thread Alvaro Herrera
Stefan Kaltenbrunner wrote:

 well - i now have a core file but it does not seem to be much worth
 except to prove that autovacuum seems to be the culprit:
 
 Core was generated by `postgres: autovacuum worker process
  '.
 Program terminated with signal 6, Aborted.
 
 [...]
 
 #0  0x0ed9 in ?? ()
 warning: GDB can't find the start of the function at 0xed9.

Interesting.  Notice how it doesn't have the database name in the ps
display.  This means it must have crashed between the initial
init_ps_display and the set_ps_display call just before starting to
vacuum.  So the bug is probably in the startup code; probably the code
dealing with the PGPROC which is the newest and weirder stuff.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] strange buildfarm failures

2007-04-29 Thread Alvaro Herrera
Alvaro Herrera wrote:
 Stefan Kaltenbrunner wrote:
 
  well - i now have a core file but it does not seem to be much worth
  except to prove that autovacuum seems to be the culprit:
  
  Core was generated by `postgres: autovacuum worker process
   '.
  Program terminated with signal 6, Aborted.
  
  [...]
  
  #0  0x0ed9 in ?? ()
  warning: GDB can't find the start of the function at 0xed9.
 
 Interesting.  Notice how it doesn't have the database name in the ps
 display.  This means it must have crashed between the initial
 init_ps_display and the set_ps_display call just before starting to
 vacuum.  So the bug is probably in the startup code; probably the code
 dealing with the PGPROC which is the newest and weirder stuff.

Oh, another thing that I think may be happening is that the stack is
restored in longjmp, so it is trying to report an error elsewhere but
it crashes because something got overwritten or something; i.e. a
bug in the error recovery code.  I don't know how feasible this is or
even if it makes sense (would longjmp() restore the ps display?), but we
had similar, very hard to debug errors in Mammoth Replicator, which is
why I'm mentioning it in case it rings a bell.

-- 
Alvaro Herrera  Developer, http://www.PostgreSQL.org/
The only difference is that Saddam would kill you on private, where the
Americans will kill you in public (Mohammad Saleh, 39, a building contractor)

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] strange buildfarm failures

2007-04-29 Thread Tom Lane
Alvaro Herrera [EMAIL PROTECTED] writes:
 Oh, another thing that I think may be happening is that the stack is
 restored in longjmp, so it is trying to report an error elsewhere but
 it crashes because something got overwritten or something; i.e. a
 bug in the error recovery code.

Hm, something trying to elog before the setjmp's been executed?
Although I thought it was coded so that elog.c would just proc_exit
if there was noplace to longjmp to.  A mistake here might explain
the lack of any message in the postmaster log: if elog.c thinks it
should longjmp then it doesn't print the message first.

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] strange buildfarm failures

2007-04-28 Thread Stefan Kaltenbrunner
Alvaro Herrera wrote:
 Tom Lane wrote:
 Stefan Kaltenbrunner [EMAIL PROTECTED] writes:
 Stefan Kaltenbrunner wrote:
 two of my buildfarm members had different but pretty weird looking
 failures lately:
 http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=quaggadt=2007-04-25%2002:03:03
 and

 http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=emudt=2007-04-24%2014:35:02

 any ideas on what might causing those ?
 
 Just for the record, quagga and emu failures don't seem related to the
 report below.  They don't crash; the regression.diffs contains data that
 suggests that there may be data corruption of some sort.
 
 INSERT INTO INET_TBL (c, i) VALUES ('192.168.1.2/30', '192.168.1.226');
 ERROR:  invalid cidr value: %{
 
 This doesn't seem to make much sense.

no idea - but quagga and emu seem to have similiar failure (in the sense
that they don't make any sense) and i have no reson to believe that the
hardware is a fault.

 
 
 lionfish just failed too:
 http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=lionfishdt=2007-04-25%2005:30:09
 And had a similar failure a few days ago.  The curious thing is that
 what we get in the postmaster log is

 LOG:  server process (PID 23405) was terminated by signal 6: Aborted
 LOG:  terminating any other active server processes

 You would think SIGABRT would come from an assertion failure, but
 there's no preceding assertion message in the log.  The other
 characteristic of these crashes is that *all* of the failing regression
 instances report terminating connection because of crash of another
 server process, which suggests strongly that the crash was in an
 autovacuum process (if it were bgwriter or stats collector the
 postmaster would've said so).  So I think the recent autovac patches
 are at fault.  I spent a bit of time trolling for a spot where the code
 might abort() without having printed anything, but didn't find one.
 
 Hmm.  I kept an eye on the buildfarm for a few days, but saw nothing
 that could be connected to autovacuum so I neglected it.
 
 This is the other failure:
 
 http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=lionfishdt=2007-04-20%2005:30:14
 
 It shows the same pattern.  I am baffled -- I don't understand how it
 can die without reporting the error.
 
 Apparently it crashes rather frequently, so it shouldn't be too
 difficult to reproduce on manual runs.  If we could get it to run with a
 higher debug level, it might prove helpful to further pinpoint the
 problem.
 
 The core file would be much better obviously (first and foremost to
 confirm that it's autovacuum that's crashing ... )


well - i now have a core file but it does not seem to be much worth
except to prove that autovacuum seems to be the culprit:

Core was generated by `postgres: autovacuum worker process
 '.
Program terminated with signal 6, Aborted.

[...]

#0  0x0ed9 in ?? ()
warning: GDB can't find the start of the function at 0xed9.

GDB is unable to find the start of the function at 0xed9
and thus can't determine the size of that function's stack frame.
This means that GDB may be unable to access that stack frame, or
the frames below it.
This problem is most likely caused by an invalid program counter or
stack pointer.
However, if you think GDB should simply search farther back
from 0xed9 for code which looks like the beginning of a
function, you can increase the range of the search using the `set
heuristic-fence-post' command.


Stefan

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] strange buildfarm failures

2007-04-25 Thread Stefan Kaltenbrunner
Stefan Kaltenbrunner wrote:
 two of my buildfarm members had different but pretty weird looking
 failures lately:
 
 http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=quaggadt=2007-04-25%2002:03:03
 
 and
 
 http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=emudt=2007-04-24%2014:35:02
 
 
 any ideas on what might causing those ?

lionfish just failed too:

http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=lionfishdt=2007-04-25%2005:30:09


Stefan

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] strange buildfarm failures

2007-04-25 Thread Tom Lane
Stefan Kaltenbrunner [EMAIL PROTECTED] writes:
 Stefan Kaltenbrunner wrote:
 two of my buildfarm members had different but pretty weird looking
 failures lately:
 http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=quaggadt=2007-04-25%2002:03:03
 and
 
 http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=emudt=2007-04-24%2014:35:02
 
 any ideas on what might causing those ?

 lionfish just failed too:

 http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=lionfishdt=2007-04-25%2005:30:09

And had a similar failure a few days ago.  The curious thing is that
what we get in the postmaster log is

LOG:  server process (PID 23405) was terminated by signal 6: Aborted
LOG:  terminating any other active server processes

You would think SIGABRT would come from an assertion failure, but
there's no preceding assertion message in the log.  The other
characteristic of these crashes is that *all* of the failing regression
instances report terminating connection because of crash of another
server process, which suggests strongly that the crash was in an
autovacuum process (if it were bgwriter or stats collector the
postmaster would've said so).  So I think the recent autovac patches
are at fault.  I spent a bit of time trolling for a spot where the code
might abort() without having printed anything, but didn't find one.

If any of the buildfarm owners can get a stack trace from the core dump
of one of these events, it'd be mighty helpful.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] strange buildfarm failures

2007-04-25 Thread Alvaro Herrera
Tom Lane wrote:
 Stefan Kaltenbrunner [EMAIL PROTECTED] writes:
  Stefan Kaltenbrunner wrote:
  two of my buildfarm members had different but pretty weird looking
  failures lately:
  http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=quaggadt=2007-04-25%2002:03:03
  and
  
  http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=emudt=2007-04-24%2014:35:02
  
  any ideas on what might causing those ?

Just for the record, quagga and emu failures don't seem related to the
report below.  They don't crash; the regression.diffs contains data that
suggests that there may be data corruption of some sort.

INSERT INTO INET_TBL (c, i) VALUES ('192.168.1.2/30', '192.168.1.226');
ERROR:  invalid cidr value: %{

This doesn't seem to make much sense.


  lionfish just failed too:
 
  http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=lionfishdt=2007-04-25%2005:30:09
 
 And had a similar failure a few days ago.  The curious thing is that
 what we get in the postmaster log is
 
 LOG:  server process (PID 23405) was terminated by signal 6: Aborted
 LOG:  terminating any other active server processes
 
 You would think SIGABRT would come from an assertion failure, but
 there's no preceding assertion message in the log.  The other
 characteristic of these crashes is that *all* of the failing regression
 instances report terminating connection because of crash of another
 server process, which suggests strongly that the crash was in an
 autovacuum process (if it were bgwriter or stats collector the
 postmaster would've said so).  So I think the recent autovac patches
 are at fault.  I spent a bit of time trolling for a spot where the code
 might abort() without having printed anything, but didn't find one.

Hmm.  I kept an eye on the buildfarm for a few days, but saw nothing
that could be connected to autovacuum so I neglected it.

This is the other failure:

http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=lionfishdt=2007-04-20%2005:30:14

It shows the same pattern.  I am baffled -- I don't understand how it
can die without reporting the error.

Apparently it crashes rather frequently, so it shouldn't be too
difficult to reproduce on manual runs.  If we could get it to run with a
higher debug level, it might prove helpful to further pinpoint the
problem.

The core file would be much better obviously (first and foremost to
confirm that it's autovacuum that's crashing ... )

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] strange buildfarm failures

2007-04-25 Thread Stefan Kaltenbrunner
Alvaro Herrera wrote:
 Tom Lane wrote:
 Stefan Kaltenbrunner [EMAIL PROTECTED] writes:
 Stefan Kaltenbrunner wrote:
 two of my buildfarm members had different but pretty weird looking
 failures lately:
 http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=quaggadt=2007-04-25%2002:03:03
 and

 http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=emudt=2007-04-24%2014:35:02

 any ideas on what might causing those ?
 
 Just for the record, quagga and emu failures don't seem related to the
 report below.  They don't crash; the regression.diffs contains data that
 suggests that there may be data corruption of some sort.
 
 INSERT INTO INET_TBL (c, i) VALUES ('192.168.1.2/30', '192.168.1.226');
 ERROR:  invalid cidr value: %{
 
 This doesn't seem to make much sense.

yeah on further reflection it looks like the failures from emu and
quagga seem unrelated to the issue lionfish is experiencing

 
 
 lionfish just failed too:
 http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=lionfishdt=2007-04-25%2005:30:09
 And had a similar failure a few days ago.  The curious thing is that
 what we get in the postmaster log is

 LOG:  server process (PID 23405) was terminated by signal 6: Aborted
 LOG:  terminating any other active server processes

 You would think SIGABRT would come from an assertion failure, but
 there's no preceding assertion message in the log.  The other
 characteristic of these crashes is that *all* of the failing regression
 instances report terminating connection because of crash of another
 server process, which suggests strongly that the crash was in an
 autovacuum process (if it were bgwriter or stats collector the
 postmaster would've said so).  So I think the recent autovac patches
 are at fault.  I spent a bit of time trolling for a spot where the code
 might abort() without having printed anything, but didn't find one.
 
 Hmm.  I kept an eye on the buildfarm for a few days, but saw nothing
 that could be connected to autovacuum so I neglected it.
 
 This is the other failure:
 
 http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=lionfishdt=2007-04-20%2005:30:14
 
 It shows the same pattern.  I am baffled -- I don't understand how it
 can die without reporting the error.

I should have mentioned that initially - but I think the failure from
2007-04-20 is not related at all.
The failure from 2007-04-20 was very likely caused due to the kernel
running totally out of memory (lionfish is a very resource starved box
at only 48MB of RAM and 128MB of swap at that time - do we have a recent
patch that is increasing memory usage quite a lot?).
I immediatly added another 128MB of swap after that and I don't think
the failure from yesterday is the same (at least there are no kernel
logs that indicate a similiar issue)
 
 Apparently it crashes rather frequently, so it shouldn't be too
 difficult to reproduce on manual runs.  If we could get it to run with a
 higher debug level, it might prove helpful to further pinpoint the
 problem.

a manual run of the buildfarm script takes ~4,5 hours on lionfish ;-)

 
 The core file would be much better obviously (first and foremost to
 confirm that it's autovacuum that's crashing ... )

I will see what I can come up with ...


Stefan

---(end of broadcast)---
TIP 6: explain analyze is your friend


[HACKERS] strange buildfarm failures

2007-04-24 Thread Stefan Kaltenbrunner
two of my buildfarm members had different but pretty weird looking
failures lately:

http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=quaggadt=2007-04-25%2002:03:03

and

http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=emudt=2007-04-24%2014:35:02


any ideas on what might causing those ?


Stefan

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match