Re: [HACKERS] weird buildfarm failures on arm/mipsel and --with-tcl
Tom Lane wrote: > Stefan Kaltenbrunner <[EMAIL PROTECTED]> writes: >> this patch definitly changes behaviour but not actually for the better :-( > > Oh well, it was worth a try. At this point I think we have to suppose > this is a Tcl bug and not our fault. Can you reproduce the problem in > bare "tclsh"? Try > > $ tclsh > % interp create > interp0 > % interp create -safe > interp1 > % exit > $ > > If tclsh doesn't quit when told then it's easy to file (but you might > want to try the latest tcl version first --- they're up to 8.4.14) hmm the above sequence just works fine on that box - will experiment with a more recent version as time permits (there is no other version available in the debian repo right now which makes this a bit more difficult). Stefan ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] weird buildfarm failures on arm/mipsel and --with-tcl
Alvaro Herrera wrote: > Stefan Kaltenbrunner wrote: > >> backtrace for 7235: >> >> (gdb) bt > > Please do this in GDB: > > thread apply all bt > > (or maybe it is > threads apply all bt) > > This'll give you backtraces for all threads in the process. sorry forgot to mention that - the backtrace for the other one is an endless loop of: Thread 2 (Thread 32769 (LWP 7267)): #0 0x4018062c in poll () from /lib/libc.so.6 #1 0x41f023b4 in __pthread_manager () from /lib/libpthread.so.0 #2 0x41f023b4 in __pthread_manager () from /lib/libpthread.so.0 #3 0x41f023b4 in __pthread_manager () from /lib/libpthread.so.0 #4 0x41f023b4 in __pthread_manager () from /lib/libpthread.so.0 #5 0x41f023b4 in __pthread_manager () from /lib/libpthread.so.0 #6 0x41f023b4 in __pthread_manager () from /lib/libpthread.so.0 #7 0x41f023b4 in __pthread_manager () from /lib/libpthread.so.0 #8 0x41f023b4 in __pthread_manager () from /lib/libpthread.so.0 #9 0x41f023b4 in __pthread_manager () from /lib/libpthread.so.0 #10 0x41f023b4 in __pthread_manager () from /lib/libpthread.so.0 #11 0x41f023b4 in __pthread_manager () from /lib/libpthread.so.0 #12 0x41f023b4 in __pthread_manager () from /lib/libpthread.so.0 Stefan ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] weird buildfarm failures on arm/mipsel and --with-tcl
Stefan Kaltenbrunner <[EMAIL PROTECTED]> writes: > this patch definitly changes behaviour but not actually for the better :-( Oh well, it was worth a try. At this point I think we have to suppose this is a Tcl bug and not our fault. Can you reproduce the problem in bare "tclsh"? Try $ tclsh % interp create interp0 % interp create -safe interp1 % exit $ If tclsh doesn't quit when told then it's easy to file (but you might want to try the latest tcl version first --- they're up to 8.4.14) regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] weird buildfarm failures on arm/mipsel and --with-tcl
Stefan Kaltenbrunner wrote: > backtrace for 7235: > > (gdb) bt Please do this in GDB: thread apply all bt (or maybe it is threads apply all bt) This'll give you backtraces for all threads in the process. -- Alvaro Herrerahttp://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] weird buildfarm failures on arm/mipsel and --with-tcl
Tom Lane wrote: > I wrote: >> One possibility for fixing it is that maybe we should be making an >> effort to execute Tcl_Finalize() before exiting the backend. If so, >> having pltcl set up an on_proc_exit callback to do it would be the >> appropriate thing. This is all speculation though. > > Just for grins I tried this, and I can see by strace'ing that it changes > the process-shutdown-time behavior quite a lot: the secondary thread now > exits first, apparently after being told to via a message from the > primary. So I think this might indeed be something good to do. Would > you try the attached patch and see if it changes the behavior on your > systems? (This patch is very ugly and will draw compiler warnings, but > don't worry about that yet.) this patch definitly changes behaviour but not actually for the better :-( after running the tcl regression tests and exiting psql I'm left with two(!) backends(and I'm unable to stop the postmaster short of using -m immedidate): UIDPID PPID LWP C NLWP STIME TTY TIME CMD 1000 7191 1 7191 01 19:02 pts/000:00:03 /home/mastermind/pginst/bin/postgres -D /home/mastermind/data 1000 7202 7191 7202 01 19:02 ?00:00:00 postgres: writer process 1000 7203 7191 7203 01 19:02 ?00:00:00 postgres: stats collector process 1000 7235 7191 7235 01 19:06 ?00:00:01 postgres: mastermind pl_regression [local] idle 1000 7267 7235 7267 01 19:08 ?00:00:00 postgres: mastermind pl_regression [local] idle tracefile for 7235: http://www.kaltenbrunner.cc/files/strace2.out backtrace for 7235: (gdb) bt #0 0x41f04d80 in __pthread_sigsuspend () from /lib/libpthread.so.0 #1 0x41f03a7c in __pthread_wait_for_restart_signal () from /lib/libpthread.so.0 #2 0x41f05d18 in pthread_key_delete () from /lib/libpthread.so.0 #3 0x41edbe1c in TclpFinalizeThreadDataKey () from /usr/lib/libtcl8.4.so.0 #4 0x41ec96dc in TclFinalizeSynchronization () from /usr/lib/libtcl8.4.so.0 #5 0x41e92040 in Tcl_Finalize () from /usr/lib/libtcl8.4.so.0 #6 0x001ab204 in proc_exit (code=0) at ipc.c:109 #7 0x001be3a8 in PostgresMain (argc=1074783216, argv=, username=0x0) at postgres.c:3638 #8 0x0018fb98 in ServerLoop () at postmaster.c:2953 #9 0x00190898 in PostmasterMain (argc=3, argv=0x36e5c0) at postmaster.c:963 #10 0x001468b0 in main (argc=3, argv=) at main.c:188 Stefan ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] weird buildfarm failures on arm/mipsel and --with-tcl
I wrote: > One possibility for fixing it is that maybe we should be making an > effort to execute Tcl_Finalize() before exiting the backend. If so, > having pltcl set up an on_proc_exit callback to do it would be the > appropriate thing. This is all speculation though. Just for grins I tried this, and I can see by strace'ing that it changes the process-shutdown-time behavior quite a lot: the secondary thread now exits first, apparently after being told to via a message from the primary. So I think this might indeed be something good to do. Would you try the attached patch and see if it changes the behavior on your systems? (This patch is very ugly and will draw compiler warnings, but don't worry about that yet.) regards, tom lane Index: pltcl.c === RCS file: /cvsroot/pgsql/src/pl/tcl/pltcl.c,v retrieving revision 1.108 diff -c -r1.108 pltcl.c *** pltcl.c 4 Oct 2006 00:30:14 - 1.108 --- pltcl.c 28 Jan 2007 21:45:40 - *** *** 26,31 --- 26,32 #include "fmgr.h" #include "nodes/makefuncs.h" #include "parser/parse_type.h" + #include "storage/ipc.h" #include "tcop/tcopprot.h" #include "utils/builtins.h" #include "utils/lsyscache.h" *** *** 245,250 --- 246,253 / if (!pltcl_be_init_done) { + on_proc_exit(Tcl_Finalize, 0); + if (SPI_connect() != SPI_OK_CONNECT) elog(ERROR, "SPI_connect failed"); pltcl_init_load_unknown(pltcl_norm_interp); ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] weird buildfarm failures on arm/mipsel and --with-tcl
Stefan Kaltenbrunner <[EMAIL PROTECTED]> writes: > FWIW - I removed --with-tcl from quagga's configuration about two weeks > ago and it has not failed(for that reason) again. So the issue most > definitly looks like plptcl related ... I poked around a bit in the Tcl sources, and as far as I can tell it will never call atexit() on a Linux platform. So that theory isn't right. However, what I found out via strace'ing is that on my own Linux machine, Tcl launches a second thread as soon as it's loaded. This thread just does a select() and sits there. I am not certain but I think it must be the "notifier" facility seen in the Tcl sources. What seems likely to me is that the "child process" we see in your ps output is actually this notifier thread (can you confirm that it's a thread via additional ps arguments?), and that for some reason the shutdown of the notifier isn't working right, or maybe the notifier is dying for some reason and then the main thread can't exit because it expects the notifier to respond. Can you "strace -f" the Tcl test and see what happens? The way I did it was createdb pl_regression createlang pltcl pl_regression psql pl_regression -- in another window, identify PID of connected backend, then strace -f -p PID 2>strace.out psql> \i .../pltcl_setup.sql psql> \i .../pltcl_queries.sql psql> \q One possibility for fixing it is that maybe we should be making an effort to execute Tcl_Finalize() before exiting the backend. If so, having pltcl set up an on_proc_exit callback to do it would be the appropriate thing. This is all speculation though. regards, tom lane ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] weird buildfarm failures on arm/mipsel and --with-tcl
Tom Lane wrote: > Stefan Kaltenbrunner <[EMAIL PROTECTED]> writes: >> FWIW - I removed --with-tcl from quagga's configuration about two weeks >> ago and it has not failed(for that reason) again. So the issue most >> definitly looks like plptcl related ... > > It sorta looks like Tcl might be installing an atexit() callback that is > doing the Wrong Thing somehow. What Tcl version do you have installed > exactly, and with what configure options? (The contents of tclConfig.sh > should be reasonably complete info.) version is 8.4.12-1.1(current debian etch package) and tclConfig.sh is at: http://www.kaltenbrunner.cc/files/tclConfig.sh.txt Stefan ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] weird buildfarm failures on arm/mipsel and --with-tcl
Stefan Kaltenbrunner <[EMAIL PROTECTED]> writes: > FWIW - I removed --with-tcl from quagga's configuration about two weeks > ago and it has not failed(for that reason) again. So the issue most > definitly looks like plptcl related ... It sorta looks like Tcl might be installing an atexit() callback that is doing the Wrong Thing somehow. What Tcl version do you have installed exactly, and with what configure options? (The contents of tclConfig.sh should be reasonably complete info.) regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] weird buildfarm failures on arm/mipsel and --with-tcl
Stefan Kaltenbrunner wrote: > Tom Lane wrote: >> Stefan Kaltenbrunner <[EMAIL PROTECTED]> writes: >>> one of my new buildfarm boxes (an Debian/Etch based ARM box) is >>> sometimes failing to stop the database during the regression tests: >>> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=quagga&dt=2007-01-08%2003:03:03 >>> this only seems to happen sometimes and only if --with-tcl is enabled on >>> quagga. >>> lionfish (my mipsel box) is able to trigger that on every build if I >>> enable --with-tcl but it is nearly impossible to debug it there because >>> of the low amount of memory and diskspace it has. >> Hm, could pl/tcl somehow be preventing the backend from exiting once >> it's run any pl/tcl stuff? I have no idea why though, and even less >> why it wouldn't be repeatable. >> >>> After the stopdb failure we still have those processes running: >>> pgbuild 3488 0.0 2.4 43640 6300 ?Ss 06:15 0:01 >>> postgres: pgbuild pl_regression [local] idle >> Can you get a stack trace from this process? > > (gdb) bt > #0 0x406b9d80 in __pthread_sigsuspend () from /lib/libpthread.so.0 > #1 0x406b8a7c in __pthread_wait_for_restart_signal () from > /lib/libpthread.so.0 > #2 0x406b91f8 in pthread_onexit_process () from /lib/libpthread.so.0 > #3 0x40438658 in exit () from /lib/libc.so.6 > #4 0x40438658 in exit () from /lib/libc.so.6 > Previous frame identical to this frame (corrupt stack?) > > > >>> pgbuild 3489 0.0 0.0 0 0 ?Z06:15 0:00 >>> [postgres] >> This is a bit odd ... if that process is a direct child of the >> postmaster it should have been reaped promptly. Could it be a child >> of the other backend? If so, why was it started? Please try the >> ps again with whatever switch it needs to list parent process ID. > > looks you are right - the defunct 3489 seems to be a child of 3488: > > PPID PID PGID SID TTY TPGID STAT UID TIME COMMAND > 1 3389 18341 18341 ? -1 S 1001 0:03 > /home/pgbuild/pgbuildfarm/HEAD/inst/bin/postgres -D data > 3389 3391 3391 3391 ? -1 Ss1001 0:00 postgres: > writer process > 3389 3392 3392 3392 ? -1 Ss1001 0:00 postgres: stats > collector process > 3389 3488 3488 3488 ? -1 Ss1001 0:01 postgres: > pgbuild pl_regression [local] idle > 3488 3489 3488 3488 ? -1 Z 1001 0:00 [postgres] > FWIW - I removed --with-tcl from quagga's configuration about two weeks ago and it has not failed(for that reason) again. So the issue most definitly looks like plptcl related ... Stefan ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] weird buildfarm failures on arm/mipsel and --with-tcl
Tom Lane wrote: > Stefan Kaltenbrunner <[EMAIL PROTECTED]> writes: >> one of my new buildfarm boxes (an Debian/Etch based ARM box) is >> sometimes failing to stop the database during the regression tests: > >> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=quagga&dt=2007-01-08%2003:03:03 > >> this only seems to happen sometimes and only if --with-tcl is enabled on >> quagga. > >> lionfish (my mipsel box) is able to trigger that on every build if I >> enable --with-tcl but it is nearly impossible to debug it there because >> of the low amount of memory and diskspace it has. > > Hm, could pl/tcl somehow be preventing the backend from exiting once > it's run any pl/tcl stuff? I have no idea why though, and even less > why it wouldn't be repeatable. > >> After the stopdb failure we still have those processes running: >> pgbuild 3488 0.0 2.4 43640 6300 ?Ss 06:15 0:01 >> postgres: pgbuild pl_regression [local] idle > > Can you get a stack trace from this process? (gdb) bt #0 0x406b9d80 in __pthread_sigsuspend () from /lib/libpthread.so.0 #1 0x406b8a7c in __pthread_wait_for_restart_signal () from /lib/libpthread.so.0 #2 0x406b91f8 in pthread_onexit_process () from /lib/libpthread.so.0 #3 0x40438658 in exit () from /lib/libc.so.6 #4 0x40438658 in exit () from /lib/libc.so.6 Previous frame identical to this frame (corrupt stack?) > >> pgbuild 3489 0.0 0.0 0 0 ?Z06:15 0:00 >> [postgres] > > This is a bit odd ... if that process is a direct child of the > postmaster it should have been reaped promptly. Could it be a child > of the other backend? If so, why was it started? Please try the > ps again with whatever switch it needs to list parent process ID. looks you are right - the defunct 3489 seems to be a child of 3488: PPID PID PGID SID TTY TPGID STAT UID TIME COMMAND 1 3389 18341 18341 ? -1 S 1001 0:03 /home/pgbuild/pgbuildfarm/HEAD/inst/bin/postgres -D data 3389 3391 3391 3391 ? -1 Ss1001 0:00 postgres: writer process 3389 3392 3392 3392 ? -1 Ss1001 0:00 postgres: stats collector process 3389 3488 3488 3488 ? -1 Ss1001 0:01 postgres: pgbuild pl_regression [local] idle 3488 3489 3488 3488 ? -1 Z 1001 0:00 [postgres] Stefan ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] weird buildfarm failures on arm/mipsel and --with-tcl
Stefan Kaltenbrunner <[EMAIL PROTECTED]> writes: > one of my new buildfarm boxes (an Debian/Etch based ARM box) is > sometimes failing to stop the database during the regression tests: > http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=quagga&dt=2007-01-08%2003:03:03 > this only seems to happen sometimes and only if --with-tcl is enabled on > quagga. > lionfish (my mipsel box) is able to trigger that on every build if I > enable --with-tcl but it is nearly impossible to debug it there because > of the low amount of memory and diskspace it has. Hm, could pl/tcl somehow be preventing the backend from exiting once it's run any pl/tcl stuff? I have no idea why though, and even less why it wouldn't be repeatable. > After the stopdb failure we still have those processes running: > pgbuild 3488 0.0 2.4 43640 6300 ?Ss 06:15 0:01 > postgres: pgbuild pl_regression [local] idle Can you get a stack trace from this process? > pgbuild 3489 0.0 0.0 0 0 ?Z06:15 0:00 > [postgres] This is a bit odd ... if that process is a direct child of the postmaster it should have been reaped promptly. Could it be a child of the other backend? If so, why was it started? Please try the ps again with whatever switch it needs to list parent process ID. regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend