Re: [HACKERS] weird buildfarm failures on arm/mipsel and --with-tcl

2007-01-29 Thread Stefan Kaltenbrunner
Tom Lane wrote:
> Stefan Kaltenbrunner <[EMAIL PROTECTED]> writes:
>> this patch definitly changes behaviour but not actually for the better :-(
> 
> Oh well, it was worth a try.  At this point I think we have to suppose
> this is a Tcl bug and not our fault.  Can you reproduce the problem in
> bare "tclsh"?  Try
> 
>   $ tclsh
>   % interp create
>   interp0
>   % interp create -safe
>   interp1
>   % exit
>   $
> 
> If tclsh doesn't quit when told then it's easy to file (but you might
> want to try the latest tcl version first --- they're up to 8.4.14)

hmm the above sequence just works fine on that box - will experiment
with a more recent version as time permits (there is no other version
available in the debian repo right now which makes this a bit more
difficult).

Stefan

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] weird buildfarm failures on arm/mipsel and --with-tcl

2007-01-29 Thread Stefan Kaltenbrunner
Alvaro Herrera wrote:
> Stefan Kaltenbrunner wrote:
> 
>> backtrace for 7235:
>>
>> (gdb) bt
> 
> Please do this in GDB:
> 
> thread apply all bt
> 
> (or maybe it is
> threads apply all bt)
> 
> This'll give you backtraces for all threads in the process.

sorry forgot to mention that - the backtrace for the other one is an
endless loop of:

Thread 2 (Thread 32769 (LWP 7267)):
#0  0x4018062c in poll () from /lib/libc.so.6
#1  0x41f023b4 in __pthread_manager () from /lib/libpthread.so.0
#2  0x41f023b4 in __pthread_manager () from /lib/libpthread.so.0
#3  0x41f023b4 in __pthread_manager () from /lib/libpthread.so.0
#4  0x41f023b4 in __pthread_manager () from /lib/libpthread.so.0
#5  0x41f023b4 in __pthread_manager () from /lib/libpthread.so.0
#6  0x41f023b4 in __pthread_manager () from /lib/libpthread.so.0
#7  0x41f023b4 in __pthread_manager () from /lib/libpthread.so.0
#8  0x41f023b4 in __pthread_manager () from /lib/libpthread.so.0
#9  0x41f023b4 in __pthread_manager () from /lib/libpthread.so.0
#10 0x41f023b4 in __pthread_manager () from /lib/libpthread.so.0
#11 0x41f023b4 in __pthread_manager () from /lib/libpthread.so.0
#12 0x41f023b4 in __pthread_manager () from /lib/libpthread.so.0


Stefan

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] weird buildfarm failures on arm/mipsel and --with-tcl

2007-01-29 Thread Tom Lane
Stefan Kaltenbrunner <[EMAIL PROTECTED]> writes:
> this patch definitly changes behaviour but not actually for the better :-(

Oh well, it was worth a try.  At this point I think we have to suppose
this is a Tcl bug and not our fault.  Can you reproduce the problem in
bare "tclsh"?  Try

$ tclsh
% interp create
interp0
% interp create -safe
interp1
% exit
$

If tclsh doesn't quit when told then it's easy to file (but you might
want to try the latest tcl version first --- they're up to 8.4.14)

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] weird buildfarm failures on arm/mipsel and --with-tcl

2007-01-29 Thread Alvaro Herrera
Stefan Kaltenbrunner wrote:

> backtrace for 7235:
> 
> (gdb) bt

Please do this in GDB:

thread apply all bt

(or maybe it is
threads apply all bt)

This'll give you backtraces for all threads in the process.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] weird buildfarm failures on arm/mipsel and --with-tcl

2007-01-29 Thread Stefan Kaltenbrunner
Tom Lane wrote:
> I wrote:
>> One possibility for fixing it is that maybe we should be making an
>> effort to execute Tcl_Finalize() before exiting the backend.  If so,
>> having pltcl set up an on_proc_exit callback to do it would be the
>> appropriate thing.  This is all speculation though.
> 
> Just for grins I tried this, and I can see by strace'ing that it changes
> the process-shutdown-time behavior quite a lot: the secondary thread now
> exits first, apparently after being told to via a message from the
> primary.  So I think this might indeed be something good to do.  Would
> you try the attached patch and see if it changes the behavior on your
> systems?  (This patch is very ugly and will draw compiler warnings, but
> don't worry about that yet.)

this patch definitly changes behaviour but not actually for the better :-(

after running the tcl regression tests and exiting psql I'm left with
two(!) backends(and I'm unable to stop the postmaster short of using -m
immedidate):


UIDPID  PPID   LWP  C NLWP STIME TTY  TIME CMD
1000  7191 1  7191  01 19:02 pts/000:00:03
/home/mastermind/pginst/bin/postgres -D /home/mastermind/data
1000  7202  7191  7202  01 19:02 ?00:00:00 postgres:
writer process
1000  7203  7191  7203  01 19:02 ?00:00:00 postgres:
stats collector process
1000  7235  7191  7235  01 19:06 ?00:00:01 postgres:
mastermind pl_regression [local] idle
1000  7267  7235  7267  01 19:08 ?00:00:00 postgres:
mastermind pl_regression [local] idle

tracefile for 7235:

http://www.kaltenbrunner.cc/files/strace2.out

backtrace for 7235:

(gdb) bt
#0  0x41f04d80 in __pthread_sigsuspend () from /lib/libpthread.so.0
#1  0x41f03a7c in __pthread_wait_for_restart_signal () from
/lib/libpthread.so.0
#2  0x41f05d18 in pthread_key_delete () from /lib/libpthread.so.0
#3  0x41edbe1c in TclpFinalizeThreadDataKey () from /usr/lib/libtcl8.4.so.0
#4  0x41ec96dc in TclFinalizeSynchronization () from /usr/lib/libtcl8.4.so.0
#5  0x41e92040 in Tcl_Finalize () from /usr/lib/libtcl8.4.so.0
#6  0x001ab204 in proc_exit (code=0) at ipc.c:109
#7  0x001be3a8 in PostgresMain (argc=1074783216, argv=, username=0x0) at postgres.c:3638
#8  0x0018fb98 in ServerLoop () at postmaster.c:2953
#9  0x00190898 in PostmasterMain (argc=3, argv=0x36e5c0) at postmaster.c:963
#10 0x001468b0 in main (argc=3, argv=) at main.c:188


Stefan

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] weird buildfarm failures on arm/mipsel and --with-tcl

2007-01-28 Thread Tom Lane
I wrote:
> One possibility for fixing it is that maybe we should be making an
> effort to execute Tcl_Finalize() before exiting the backend.  If so,
> having pltcl set up an on_proc_exit callback to do it would be the
> appropriate thing.  This is all speculation though.

Just for grins I tried this, and I can see by strace'ing that it changes
the process-shutdown-time behavior quite a lot: the secondary thread now
exits first, apparently after being told to via a message from the
primary.  So I think this might indeed be something good to do.  Would
you try the attached patch and see if it changes the behavior on your
systems?  (This patch is very ugly and will draw compiler warnings, but
don't worry about that yet.)

regards, tom lane

Index: pltcl.c
===
RCS file: /cvsroot/pgsql/src/pl/tcl/pltcl.c,v
retrieving revision 1.108
diff -c -r1.108 pltcl.c
*** pltcl.c 4 Oct 2006 00:30:14 -   1.108
--- pltcl.c 28 Jan 2007 21:45:40 -
***
*** 26,31 
--- 26,32 
  #include "fmgr.h"
  #include "nodes/makefuncs.h"
  #include "parser/parse_type.h"
+ #include "storage/ipc.h"
  #include "tcop/tcopprot.h"
  #include "utils/builtins.h"
  #include "utils/lsyscache.h"
***
*** 245,250 
--- 246,253 
 /
if (!pltcl_be_init_done)
{
+   on_proc_exit(Tcl_Finalize, 0);
+ 
if (SPI_connect() != SPI_OK_CONNECT)
elog(ERROR, "SPI_connect failed");
pltcl_init_load_unknown(pltcl_norm_interp);

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] weird buildfarm failures on arm/mipsel and --with-tcl

2007-01-28 Thread Tom Lane
Stefan Kaltenbrunner <[EMAIL PROTECTED]> writes:
> FWIW - I removed --with-tcl from quagga's configuration about two weeks
> ago and it has not failed(for that reason) again. So the issue most
> definitly looks like plptcl related ...

I poked around a bit in the Tcl sources, and as far as I can tell it
will never call atexit() on a Linux platform.  So that theory isn't
right.  However, what I found out via strace'ing is that on my own
Linux machine, Tcl launches a second thread as soon as it's loaded.
This thread just does a select() and sits there.  I am not certain
but I think it must be the "notifier" facility seen in the Tcl sources.

What seems likely to me is that the "child process" we see in your ps
output is actually this notifier thread (can you confirm that it's a
thread via additional ps arguments?), and that for some reason the
shutdown of the notifier isn't working right, or maybe the notifier
is dying for some reason and then the main thread can't exit because
it expects the notifier to respond.

Can you "strace -f" the Tcl test and see what happens?  The
way I did it was

createdb pl_regression
createlang pltcl pl_regression
psql pl_regression

-- in another window, identify PID of connected backend, then
strace -f -p PID 2>strace.out

psql> \i .../pltcl_setup.sql
psql> \i .../pltcl_queries.sql
psql> \q

One possibility for fixing it is that maybe we should be making an
effort to execute Tcl_Finalize() before exiting the backend.  If so,
having pltcl set up an on_proc_exit callback to do it would be the
appropriate thing.  This is all speculation though.

regards, tom lane

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] weird buildfarm failures on arm/mipsel and --with-tcl

2007-01-24 Thread Stefan Kaltenbrunner
Tom Lane wrote:
> Stefan Kaltenbrunner <[EMAIL PROTECTED]> writes:
>> FWIW - I removed --with-tcl from quagga's configuration about two weeks
>> ago and it has not failed(for that reason) again. So the issue most
>> definitly looks like plptcl related ...
> 
> It sorta looks like Tcl might be installing an atexit() callback that is
> doing the Wrong Thing somehow.  What Tcl version do you have installed
> exactly, and with what configure options?  (The contents of tclConfig.sh
> should be reasonably complete info.)

version is 8.4.12-1.1(current debian etch package) and tclConfig.sh is at:

http://www.kaltenbrunner.cc/files/tclConfig.sh.txt


Stefan

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] weird buildfarm failures on arm/mipsel and --with-tcl

2007-01-24 Thread Tom Lane
Stefan Kaltenbrunner <[EMAIL PROTECTED]> writes:
> FWIW - I removed --with-tcl from quagga's configuration about two weeks
> ago and it has not failed(for that reason) again. So the issue most
> definitly looks like plptcl related ...

It sorta looks like Tcl might be installing an atexit() callback that is
doing the Wrong Thing somehow.  What Tcl version do you have installed
exactly, and with what configure options?  (The contents of tclConfig.sh
should be reasonably complete info.)

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] weird buildfarm failures on arm/mipsel and --with-tcl

2007-01-24 Thread Stefan Kaltenbrunner
Stefan Kaltenbrunner wrote:
> Tom Lane wrote:
>> Stefan Kaltenbrunner <[EMAIL PROTECTED]> writes:
>>> one of my new buildfarm boxes (an Debian/Etch based ARM box) is
>>> sometimes failing to stop the database during the regression tests:
>>> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=quagga&dt=2007-01-08%2003:03:03
>>> this only seems to happen sometimes and only if --with-tcl is enabled on
>>> quagga.
>>> lionfish (my mipsel box) is able to trigger that on every build if I
>>> enable --with-tcl but it is nearly impossible to debug it there because
>>> of the low amount of memory and diskspace it has.
>> Hm, could pl/tcl somehow be preventing the backend from exiting once
>> it's run any pl/tcl stuff?  I have no idea why though, and even less
>> why it wouldn't be repeatable. 
>>
>>> After the stopdb failure we still have those processes running:
>>> pgbuild   3488  0.0  2.4  43640  6300 ?Ss   06:15   0:01
>>> postgres: pgbuild pl_regression [local] idle
>> Can you get a stack trace from this process?
> 
> (gdb) bt
> #0  0x406b9d80 in __pthread_sigsuspend () from /lib/libpthread.so.0
> #1  0x406b8a7c in __pthread_wait_for_restart_signal () from
> /lib/libpthread.so.0
> #2  0x406b91f8 in pthread_onexit_process () from /lib/libpthread.so.0
> #3  0x40438658 in exit () from /lib/libc.so.6
> #4  0x40438658 in exit () from /lib/libc.so.6
> Previous frame identical to this frame (corrupt stack?)
> 
> 
> 
>>> pgbuild   3489  0.0  0.0  0 0 ?Z06:15   0:00
>>> [postgres] 
>> This is a bit odd ... if that process is a direct child of the
>> postmaster it should have been reaped promptly.  Could it be a child
>> of the other backend?  If so, why was it started?  Please try the
>> ps again with whatever switch it needs to list parent process ID.
> 
> looks you are right - the defunct 3489 seems to be a child of 3488:
> 
>  PPID   PID  PGID   SID TTY  TPGID STAT   UID   TIME COMMAND
> 1  3389 18341 18341 ?   -1 S 1001   0:03
> /home/pgbuild/pgbuildfarm/HEAD/inst/bin/postgres -D data
>  3389  3391  3391  3391 ?   -1 Ss1001   0:00 postgres:
> writer process
>  3389  3392  3392  3392 ?   -1 Ss1001   0:00 postgres: stats
> collector process
>  3389  3488  3488  3488 ?   -1 Ss1001   0:01 postgres:
> pgbuild pl_regression [local] idle
>  3488  3489  3488  3488 ?   -1 Z 1001   0:00 [postgres]
> 

FWIW - I removed --with-tcl from quagga's configuration about two weeks
ago and it has not failed(for that reason) again. So the issue most
definitly looks like plptcl related ...


Stefan

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] weird buildfarm failures on arm/mipsel and --with-tcl

2007-01-08 Thread Stefan Kaltenbrunner
Tom Lane wrote:
> Stefan Kaltenbrunner <[EMAIL PROTECTED]> writes:
>> one of my new buildfarm boxes (an Debian/Etch based ARM box) is
>> sometimes failing to stop the database during the regression tests:
> 
>> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=quagga&dt=2007-01-08%2003:03:03
> 
>> this only seems to happen sometimes and only if --with-tcl is enabled on
>> quagga.
> 
>> lionfish (my mipsel box) is able to trigger that on every build if I
>> enable --with-tcl but it is nearly impossible to debug it there because
>> of the low amount of memory and diskspace it has.
> 
> Hm, could pl/tcl somehow be preventing the backend from exiting once
> it's run any pl/tcl stuff?  I have no idea why though, and even less
> why it wouldn't be repeatable. 
> 
>> After the stopdb failure we still have those processes running:
>> pgbuild   3488  0.0  2.4  43640  6300 ?Ss   06:15   0:01
>> postgres: pgbuild pl_regression [local] idle
> 
> Can you get a stack trace from this process?

(gdb) bt
#0  0x406b9d80 in __pthread_sigsuspend () from /lib/libpthread.so.0
#1  0x406b8a7c in __pthread_wait_for_restart_signal () from
/lib/libpthread.so.0
#2  0x406b91f8 in pthread_onexit_process () from /lib/libpthread.so.0
#3  0x40438658 in exit () from /lib/libc.so.6
#4  0x40438658 in exit () from /lib/libc.so.6
Previous frame identical to this frame (corrupt stack?)



> 
>> pgbuild   3489  0.0  0.0  0 0 ?Z06:15   0:00
>> [postgres] 
> 
> This is a bit odd ... if that process is a direct child of the
> postmaster it should have been reaped promptly.  Could it be a child
> of the other backend?  If so, why was it started?  Please try the
> ps again with whatever switch it needs to list parent process ID.

looks you are right - the defunct 3489 seems to be a child of 3488:

 PPID   PID  PGID   SID TTY  TPGID STAT   UID   TIME COMMAND
1  3389 18341 18341 ?   -1 S 1001   0:03
/home/pgbuild/pgbuildfarm/HEAD/inst/bin/postgres -D data
 3389  3391  3391  3391 ?   -1 Ss1001   0:00 postgres:
writer process
 3389  3392  3392  3392 ?   -1 Ss1001   0:00 postgres: stats
collector process
 3389  3488  3488  3488 ?   -1 Ss1001   0:01 postgres:
pgbuild pl_regression [local] idle
 3488  3489  3488  3488 ?   -1 Z 1001   0:00 [postgres]



Stefan


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] weird buildfarm failures on arm/mipsel and --with-tcl

2007-01-08 Thread Tom Lane
Stefan Kaltenbrunner <[EMAIL PROTECTED]> writes:
> one of my new buildfarm boxes (an Debian/Etch based ARM box) is
> sometimes failing to stop the database during the regression tests:

> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=quagga&dt=2007-01-08%2003:03:03

> this only seems to happen sometimes and only if --with-tcl is enabled on
> quagga.

> lionfish (my mipsel box) is able to trigger that on every build if I
> enable --with-tcl but it is nearly impossible to debug it there because
> of the low amount of memory and diskspace it has.

Hm, could pl/tcl somehow be preventing the backend from exiting once
it's run any pl/tcl stuff?  I have no idea why though, and even less
why it wouldn't be repeatable. 

> After the stopdb failure we still have those processes running:
> pgbuild   3488  0.0  2.4  43640  6300 ?Ss   06:15   0:01
> postgres: pgbuild pl_regression [local] idle

Can you get a stack trace from this process?

> pgbuild   3489  0.0  0.0  0 0 ?Z06:15   0:00
> [postgres] 

This is a bit odd ... if that process is a direct child of the
postmaster it should have been reaped promptly.  Could it be a child
of the other backend?  If so, why was it started?  Please try the
ps again with whatever switch it needs to list parent process ID.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend