Re: [HACKERS] strange buildfarm failures
Tom Lane wrote: I wrote: Hmm ... I was about to say that the postmaster never sets PG_exception_stack, but maybe an error out of a PG_TRY/PG_RE_THROW could do it? Does the postmaster ever execute PG_TRY? Doh, I bet that's it, and it's not the postmaster that's at issue but PG_TRY blocks executed during subprocess startup. Inheritance of a PG_exception_stack setting from the postmaster could only happen if the postmaster were to fork() within a PG_TRY block, which I think we can safely say it doesn't. But suppose we get an elog(ERROR) inside a PG_TRY block when there is no outermost longjmp catcher. elog.c will think it should longjmp, and that will eventually lead to executing #define PG_RE_THROW() \ siglongjmp(*PG_exception_stack, 1) with PG_exception_stack = NULL; which seems entirely likely to cause a stack smash of gruesome dimensions. What's more, nothing would have been printed to the postmaster log beforehand, agreeing with observation. I agree that that would be a bug and we should fix it, but I don't think it explains the problem we're seeing because there is no PG_TRY block in the autovac startup code that I can see :-( -- Alvaro Herrerahttp://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc. ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] strange buildfarm failures
Alvaro Herrera [EMAIL PROTECTED] writes: I agree that that would be a bug and we should fix it, but I don't think it explains the problem we're seeing because there is no PG_TRY block in the autovac startup code that I can see :-( I'm wondering if there is some code path that invokes a PG_TRY deep in the bowels of the system. Anyway, I'll go fix this, and we should know soon enough if it changes the buildfarm behavior. regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] strange buildfarm failures
Tom Lane wrote: Alvaro Herrera [EMAIL PROTECTED] writes: I agree that that would be a bug and we should fix it, but I don't think it explains the problem we're seeing because there is no PG_TRY block in the autovac startup code that I can see :-( I'm wondering if there is some code path that invokes a PG_TRY deep in the bowels of the system. Well, I checked all the bowels involved in autovacuum startup. Anyway, I'll go fix this, and we should know soon enough if it changes the buildfarm behavior. Agreed. -- Alvaro Herrerahttp://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc. ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] strange buildfarm failures
Alvaro Herrera wrote: Tom Lane wrote: Alvaro Herrera [EMAIL PROTECTED] writes: I agree that that would be a bug and we should fix it, but I don't think it explains the problem we're seeing because there is no PG_TRY block in the autovac startup code that I can see :-( I'm wondering if there is some code path that invokes a PG_TRY deep in the bowels of the system. Well, I checked all the bowels involved in autovacuum startup. Huh, hang on ... there is one caller, which is to set client_encoding (call_string_assign_hook uses a PG_TRY block), but it is called *after* the sigsetjmp block -- in InitPostgres. -- Alvaro Herrerahttp://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] strange buildfarm failures
Alvaro Herrera [EMAIL PROTECTED] writes: Tom Lane wrote: I'm wondering if there is some code path that invokes a PG_TRY deep in the bowels of the system. Huh, hang on ... there is one caller, which is to set client_encoding (call_string_assign_hook uses a PG_TRY block), but it is called *after* the sigsetjmp block -- in InitPostgres. While testing the PG_RE_THROW problem I noted that what I get here is a SIGSEGV crash, rather than SIGABRT as seen on Stefan's machines, so that's another hint that this may be unrelated. Still, it's clearly at risk of causing a problem as more PG_TRY's get added to the code, so I'm going to fix it anyway. regards, tom lane ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] strange buildfarm failures
Alvaro Herrera wrote: Stefan Kaltenbrunner wrote: well - i now have a core file but it does not seem to be much worth except to prove that autovacuum seems to be the culprit: Core was generated by `postgres: autovacuum worker process '. Program terminated with signal 6, Aborted. [...] #0 0x0ed9 in ?? () warning: GDB can't find the start of the function at 0xed9. I just noticed an ugly bug in the worker code which I'm fixing. I think this one would also throw SIGSEGV, not SIGABRT. -- Alvaro Herrerahttp://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] strange buildfarm failures
Alvaro Herrera wrote: Alvaro Herrera wrote: Stefan Kaltenbrunner wrote: well - i now have a core file but it does not seem to be much worth except to prove that autovacuum seems to be the culprit: Core was generated by `postgres: autovacuum worker process '. Program terminated with signal 6, Aborted. [...] #0 0x0ed9 in ?? () warning: GDB can't find the start of the function at 0xed9. I just noticed an ugly bug in the worker code which I'm fixing. I think this one would also throw SIGSEGV, not SIGABRT. Nailed it -- this is the actual bug that causes the abort. But I am surprised that it doesn't print the error message in Stefan machine's; here it outputs TRAP: FailedAssertion(!unsigned long)(elem)) ShmemBase)), File: /pgsql/source/00head/src/backend/storage/ipc/shmqueue.c, Line: 107) 16496 2007-05-02 11:30:31 CLT DEBUG: server process (PID 16540) was terminated by signal 6: Aborted 16496 2007-05-02 11:30:31 CLT LOG: server process (PID 16540) was terminated by signal 6: Aborted 16496 2007-05-02 11:30:31 CLT LOG: terminating any other active server processes 16496 2007-05-02 11:30:31 CLT DEBUG: sending SIGQUIT to process 16541 16496 2007-05-02 11:30:31 CLT DEBUG: sending SIGQUIT to process 16498 16496 2007-05-02 11:30:31 CLT DEBUG: sending SIGQUIT to process 16500 16496 2007-05-02 11:30:31 CLT DEBUG: sending SIGQUIT to process 16499 16541 2007-05-02 11:30:33 CLT WARNING: terminating connection because of crash of another server process Maybe stderr is going somewhere else? That would be strange, I think. I'll commit the fix shortly; attached. -- Alvaro Herrera http://www.flickr.com/photos/alvherre/ La primera ley de las demostraciones en vivo es: no trate de usar el sistema. Escriba un guión que no toque nada para no causar daños. (Jakob Nielsen) Index: src/backend/postmaster/autovacuum.c === RCS file: /home/alvherre/Code/cvs/pgsql/src/backend/postmaster/autovacuum.c,v retrieving revision 1.42 diff -c -p -r1.42 autovacuum.c *** src/backend/postmaster/autovacuum.c 18 Apr 2007 16:44:18 - 1.42 --- src/backend/postmaster/autovacuum.c 2 May 2007 15:25:27 - *** AutoVacWorkerMain(int argc, char *argv[] *** 1407,1431 * Get the info about the database we're going to work on. */ LWLockAcquire(AutovacuumLock, LW_EXCLUSIVE); ! MyWorkerInfo = (WorkerInfo) MAKE_PTR(AutoVacuumShmem-av_startingWorker); ! dbid = MyWorkerInfo-wi_dboid; ! MyWorkerInfo-wi_workerpid = MyProcPid; ! ! /* insert into the running list */ ! SHMQueueInsertBefore(AutoVacuumShmem-av_runningWorkers, ! MyWorkerInfo-wi_links); /* ! * remove from the starting pointer, so that the launcher can start a new ! * worker if required */ ! AutoVacuumShmem-av_startingWorker = INVALID_OFFSET; ! LWLockRelease(AutovacuumLock); ! on_shmem_exit(FreeWorkerInfo, 0); ! /* wake up the launcher */ ! if (AutoVacuumShmem-av_launcherpid != 0) ! kill(AutoVacuumShmem-av_launcherpid, SIGUSR1); if (OidIsValid(dbid)) { --- 1407,1442 * Get the info about the database we're going to work on. */ LWLockAcquire(AutovacuumLock, LW_EXCLUSIVE); ! /* ! * beware of startingWorker being INVALID; this could happen if the ! * launcher thinks we've taking too long to start. */ ! if (AutoVacuumShmem-av_startingWorker != INVALID_OFFSET) ! { ! MyWorkerInfo = (WorkerInfo) MAKE_PTR(AutoVacuumShmem-av_startingWorker); ! dbid = MyWorkerInfo-wi_dboid; ! MyWorkerInfo-wi_workerpid = MyProcPid; ! ! /* insert into the running list */ ! SHMQueueInsertBefore(AutoVacuumShmem-av_runningWorkers, ! MyWorkerInfo-wi_links); ! /* ! * remove from the starting pointer, so that the launcher can start a new ! * worker if required ! */ ! AutoVacuumShmem-av_startingWorker = INVALID_OFFSET; ! LWLockRelease(AutovacuumLock); ! on_shmem_exit(FreeWorkerInfo, 0); ! /* wake up the launcher */ ! if (AutoVacuumShmem-av_launcherpid != 0) ! kill(AutoVacuumShmem-av_launcherpid, SIGUSR1); ! } ! else ! /* no worker entry for me, go away */ ! LWLockRelease(AutovacuumLock); if (OidIsValid(dbid)) { *** AutoVacWorkerMain(int argc, char *argv[] *** 1466,1473 } /* ! * FIXME -- we need to notify the launcher when we are gone. But this ! * should be done after our PGPROC is released, in ProcKill. */ /* All done, go away */ --- 1477,1484 } /* ! * The launcher will be notified of my death in ProcKill, *if* we managed ! * to get a worker slot at all */ /* All done, go away */ ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] strange buildfarm failures
Alvaro Herrera [EMAIL PROTECTED] writes: Nailed it -- this is the actual bug that causes the abort. But I am surprised that it doesn't print the error message in Stefan machine's; Hm, maybe we need an fflush(stderr) in ExceptionalCondition? regards, tom lane ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] strange buildfarm failures
Tom Lane [EMAIL PROTECTED] writes: Alvaro Herrera [EMAIL PROTECTED] writes: Nailed it -- this is the actual bug that causes the abort. But I am surprised that it doesn't print the error message in Stefan machine's; Hm, maybe we need an fflush(stderr) in ExceptionalCondition? stderr is supposed to be line-buffered by default. Couldn't hurt I suppose. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] strange buildfarm failures
Tom Lane wrote: Alvaro Herrera [EMAIL PROTECTED] writes: Oh, another thing that I think may be happening is that the stack is restored in longjmp, so it is trying to report an error elsewhere but it crashes because something got overwritten or something; i.e. a bug in the error recovery code. Hm, something trying to elog before the setjmp's been executed? Although I thought it was coded so that elog.c would just proc_exit if there was noplace to longjmp to. A mistake here might explain the lack of any message in the postmaster log: if elog.c thinks it should longjmp then it doesn't print the message first. Well, there seems to be plenty of code which is happy to elog(ERROR) before the longjmp target block has been set; for example InitFileAccess(), which is called on BaseInit(), which comes before sigsetjmp() both on postgres.c and autovacuum.c. (This particular case is elog(FATAL) not ERROR however). mdinit() also does some memory allocation which could certainly fail. I'm wondering if it wouldn't be more robust to define a longjmp target block before calling BaseInit(), and have it exit cleanly in case of failure (which is what you say elog.c should be doing if there is no target block). In errstart(), it is checked if PG_exception_stack is NULL. Now, this symbol is defined in elog.c and initialized to NULL, but I wonder if a child process inherits the value that postmaster set, or it comes back to NULL. The backend would not inherit any of the values the postmaster set if the latter were the case, so I'm assuming that PG_exception_stack stays as the postmaster left it. I wonder what happens if the child process finds that this is an invalid point to jump to? -- Alvaro Herrerahttp://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc. ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] strange buildfarm failures
Alvaro Herrera [EMAIL PROTECTED] writes: I'm wondering if it wouldn't be more robust to define a longjmp target block before calling BaseInit(), and have it exit cleanly in case of failure (which is what you say elog.c should be doing if there is no target block). No, because elog is already supposed to deal with that case; and does, every time you connect to a bad database name for example. If it's failing, the question to answer is why. In errstart(), it is checked if PG_exception_stack is NULL. Now, this symbol is defined in elog.c and initialized to NULL, but I wonder if a child process inherits the value that postmaster set, or it comes back to NULL. Hmm ... I was about to say that the postmaster never sets PG_exception_stack, but maybe an error out of a PG_TRY/PG_RE_THROW could do it? Does the postmaster ever execute PG_TRY? (And if so, should it? The postmaster really ought not be dealing in anything very hairy --- it should be passing such work off to children.) regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] strange buildfarm failures
I wrote: Hmm ... I was about to say that the postmaster never sets PG_exception_stack, but maybe an error out of a PG_TRY/PG_RE_THROW could do it? Does the postmaster ever execute PG_TRY? Doh, I bet that's it, and it's not the postmaster that's at issue but PG_TRY blocks executed during subprocess startup. Inheritance of a PG_exception_stack setting from the postmaster could only happen if the postmaster were to fork() within a PG_TRY block, which I think we can safely say it doesn't. But suppose we get an elog(ERROR) inside a PG_TRY block when there is no outermost longjmp catcher. elog.c will think it should longjmp, and that will eventually lead to executing #define PG_RE_THROW() \ siglongjmp(*PG_exception_stack, 1) with PG_exception_stack = NULL; which seems entirely likely to cause a stack smash of gruesome dimensions. What's more, nothing would have been printed to the postmaster log beforehand, agreeing with observation. Personally I think the correct fix is to make PG_RE_THROW deal sanely with the case of PG_exception_stack = NULL, that is, turn it into an elog(FATAL) with the original error text. If you try to fix it by making a setjmp occur earlier, there's still the problem of what about PG_TRY earlier than that? This might be more code than we want in a macro, though, especially since this is presumably not a performance-critical path. I'm tempted to change the macro to just call a pg_re_throw() subroutine. Thoughts? regards, tom lane ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] strange buildfarm failures
Stefan Kaltenbrunner wrote: well - i now have a core file but it does not seem to be much worth except to prove that autovacuum seems to be the culprit: Core was generated by `postgres: autovacuum worker process '. Program terminated with signal 6, Aborted. [...] #0 0x0ed9 in ?? () warning: GDB can't find the start of the function at 0xed9. Interesting. Notice how it doesn't have the database name in the ps display. This means it must have crashed between the initial init_ps_display and the set_ps_display call just before starting to vacuum. So the bug is probably in the startup code; probably the code dealing with the PGPROC which is the newest and weirder stuff. -- Alvaro Herrerahttp://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc. ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] strange buildfarm failures
Alvaro Herrera wrote: Stefan Kaltenbrunner wrote: well - i now have a core file but it does not seem to be much worth except to prove that autovacuum seems to be the culprit: Core was generated by `postgres: autovacuum worker process '. Program terminated with signal 6, Aborted. [...] #0 0x0ed9 in ?? () warning: GDB can't find the start of the function at 0xed9. Interesting. Notice how it doesn't have the database name in the ps display. This means it must have crashed between the initial init_ps_display and the set_ps_display call just before starting to vacuum. So the bug is probably in the startup code; probably the code dealing with the PGPROC which is the newest and weirder stuff. Oh, another thing that I think may be happening is that the stack is restored in longjmp, so it is trying to report an error elsewhere but it crashes because something got overwritten or something; i.e. a bug in the error recovery code. I don't know how feasible this is or even if it makes sense (would longjmp() restore the ps display?), but we had similar, very hard to debug errors in Mammoth Replicator, which is why I'm mentioning it in case it rings a bell. -- Alvaro Herrera Developer, http://www.PostgreSQL.org/ The only difference is that Saddam would kill you on private, where the Americans will kill you in public (Mohammad Saleh, 39, a building contractor) ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] strange buildfarm failures
Alvaro Herrera [EMAIL PROTECTED] writes: Oh, another thing that I think may be happening is that the stack is restored in longjmp, so it is trying to report an error elsewhere but it crashes because something got overwritten or something; i.e. a bug in the error recovery code. Hm, something trying to elog before the setjmp's been executed? Although I thought it was coded so that elog.c would just proc_exit if there was noplace to longjmp to. A mistake here might explain the lack of any message in the postmaster log: if elog.c thinks it should longjmp then it doesn't print the message first. regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] strange buildfarm failures
Alvaro Herrera wrote: Tom Lane wrote: Stefan Kaltenbrunner [EMAIL PROTECTED] writes: Stefan Kaltenbrunner wrote: two of my buildfarm members had different but pretty weird looking failures lately: http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=quaggadt=2007-04-25%2002:03:03 and http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=emudt=2007-04-24%2014:35:02 any ideas on what might causing those ? Just for the record, quagga and emu failures don't seem related to the report below. They don't crash; the regression.diffs contains data that suggests that there may be data corruption of some sort. INSERT INTO INET_TBL (c, i) VALUES ('192.168.1.2/30', '192.168.1.226'); ERROR: invalid cidr value: %{ This doesn't seem to make much sense. no idea - but quagga and emu seem to have similiar failure (in the sense that they don't make any sense) and i have no reson to believe that the hardware is a fault. lionfish just failed too: http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=lionfishdt=2007-04-25%2005:30:09 And had a similar failure a few days ago. The curious thing is that what we get in the postmaster log is LOG: server process (PID 23405) was terminated by signal 6: Aborted LOG: terminating any other active server processes You would think SIGABRT would come from an assertion failure, but there's no preceding assertion message in the log. The other characteristic of these crashes is that *all* of the failing regression instances report terminating connection because of crash of another server process, which suggests strongly that the crash was in an autovacuum process (if it were bgwriter or stats collector the postmaster would've said so). So I think the recent autovac patches are at fault. I spent a bit of time trolling for a spot where the code might abort() without having printed anything, but didn't find one. Hmm. I kept an eye on the buildfarm for a few days, but saw nothing that could be connected to autovacuum so I neglected it. This is the other failure: http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=lionfishdt=2007-04-20%2005:30:14 It shows the same pattern. I am baffled -- I don't understand how it can die without reporting the error. Apparently it crashes rather frequently, so it shouldn't be too difficult to reproduce on manual runs. If we could get it to run with a higher debug level, it might prove helpful to further pinpoint the problem. The core file would be much better obviously (first and foremost to confirm that it's autovacuum that's crashing ... ) well - i now have a core file but it does not seem to be much worth except to prove that autovacuum seems to be the culprit: Core was generated by `postgres: autovacuum worker process '. Program terminated with signal 6, Aborted. [...] #0 0x0ed9 in ?? () warning: GDB can't find the start of the function at 0xed9. GDB is unable to find the start of the function at 0xed9 and thus can't determine the size of that function's stack frame. This means that GDB may be unable to access that stack frame, or the frames below it. This problem is most likely caused by an invalid program counter or stack pointer. However, if you think GDB should simply search farther back from 0xed9 for code which looks like the beginning of a function, you can increase the range of the search using the `set heuristic-fence-post' command. Stefan ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] strange buildfarm failures
Stefan Kaltenbrunner wrote: two of my buildfarm members had different but pretty weird looking failures lately: http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=quaggadt=2007-04-25%2002:03:03 and http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=emudt=2007-04-24%2014:35:02 any ideas on what might causing those ? lionfish just failed too: http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=lionfishdt=2007-04-25%2005:30:09 Stefan ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] strange buildfarm failures
Stefan Kaltenbrunner [EMAIL PROTECTED] writes: Stefan Kaltenbrunner wrote: two of my buildfarm members had different but pretty weird looking failures lately: http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=quaggadt=2007-04-25%2002:03:03 and http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=emudt=2007-04-24%2014:35:02 any ideas on what might causing those ? lionfish just failed too: http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=lionfishdt=2007-04-25%2005:30:09 And had a similar failure a few days ago. The curious thing is that what we get in the postmaster log is LOG: server process (PID 23405) was terminated by signal 6: Aborted LOG: terminating any other active server processes You would think SIGABRT would come from an assertion failure, but there's no preceding assertion message in the log. The other characteristic of these crashes is that *all* of the failing regression instances report terminating connection because of crash of another server process, which suggests strongly that the crash was in an autovacuum process (if it were bgwriter or stats collector the postmaster would've said so). So I think the recent autovac patches are at fault. I spent a bit of time trolling for a spot where the code might abort() without having printed anything, but didn't find one. If any of the buildfarm owners can get a stack trace from the core dump of one of these events, it'd be mighty helpful. regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] strange buildfarm failures
Tom Lane wrote: Stefan Kaltenbrunner [EMAIL PROTECTED] writes: Stefan Kaltenbrunner wrote: two of my buildfarm members had different but pretty weird looking failures lately: http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=quaggadt=2007-04-25%2002:03:03 and http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=emudt=2007-04-24%2014:35:02 any ideas on what might causing those ? Just for the record, quagga and emu failures don't seem related to the report below. They don't crash; the regression.diffs contains data that suggests that there may be data corruption of some sort. INSERT INTO INET_TBL (c, i) VALUES ('192.168.1.2/30', '192.168.1.226'); ERROR: invalid cidr value: %{ This doesn't seem to make much sense. lionfish just failed too: http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=lionfishdt=2007-04-25%2005:30:09 And had a similar failure a few days ago. The curious thing is that what we get in the postmaster log is LOG: server process (PID 23405) was terminated by signal 6: Aborted LOG: terminating any other active server processes You would think SIGABRT would come from an assertion failure, but there's no preceding assertion message in the log. The other characteristic of these crashes is that *all* of the failing regression instances report terminating connection because of crash of another server process, which suggests strongly that the crash was in an autovacuum process (if it were bgwriter or stats collector the postmaster would've said so). So I think the recent autovac patches are at fault. I spent a bit of time trolling for a spot where the code might abort() without having printed anything, but didn't find one. Hmm. I kept an eye on the buildfarm for a few days, but saw nothing that could be connected to autovacuum so I neglected it. This is the other failure: http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=lionfishdt=2007-04-20%2005:30:14 It shows the same pattern. I am baffled -- I don't understand how it can die without reporting the error. Apparently it crashes rather frequently, so it shouldn't be too difficult to reproduce on manual runs. If we could get it to run with a higher debug level, it might prove helpful to further pinpoint the problem. The core file would be much better obviously (first and foremost to confirm that it's autovacuum that's crashing ... ) -- Alvaro Herrerahttp://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] strange buildfarm failures
Alvaro Herrera wrote: Tom Lane wrote: Stefan Kaltenbrunner [EMAIL PROTECTED] writes: Stefan Kaltenbrunner wrote: two of my buildfarm members had different but pretty weird looking failures lately: http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=quaggadt=2007-04-25%2002:03:03 and http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=emudt=2007-04-24%2014:35:02 any ideas on what might causing those ? Just for the record, quagga and emu failures don't seem related to the report below. They don't crash; the regression.diffs contains data that suggests that there may be data corruption of some sort. INSERT INTO INET_TBL (c, i) VALUES ('192.168.1.2/30', '192.168.1.226'); ERROR: invalid cidr value: %{ This doesn't seem to make much sense. yeah on further reflection it looks like the failures from emu and quagga seem unrelated to the issue lionfish is experiencing lionfish just failed too: http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=lionfishdt=2007-04-25%2005:30:09 And had a similar failure a few days ago. The curious thing is that what we get in the postmaster log is LOG: server process (PID 23405) was terminated by signal 6: Aborted LOG: terminating any other active server processes You would think SIGABRT would come from an assertion failure, but there's no preceding assertion message in the log. The other characteristic of these crashes is that *all* of the failing regression instances report terminating connection because of crash of another server process, which suggests strongly that the crash was in an autovacuum process (if it were bgwriter or stats collector the postmaster would've said so). So I think the recent autovac patches are at fault. I spent a bit of time trolling for a spot where the code might abort() without having printed anything, but didn't find one. Hmm. I kept an eye on the buildfarm for a few days, but saw nothing that could be connected to autovacuum so I neglected it. This is the other failure: http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=lionfishdt=2007-04-20%2005:30:14 It shows the same pattern. I am baffled -- I don't understand how it can die without reporting the error. I should have mentioned that initially - but I think the failure from 2007-04-20 is not related at all. The failure from 2007-04-20 was very likely caused due to the kernel running totally out of memory (lionfish is a very resource starved box at only 48MB of RAM and 128MB of swap at that time - do we have a recent patch that is increasing memory usage quite a lot?). I immediatly added another 128MB of swap after that and I don't think the failure from yesterday is the same (at least there are no kernel logs that indicate a similiar issue) Apparently it crashes rather frequently, so it shouldn't be too difficult to reproduce on manual runs. If we could get it to run with a higher debug level, it might prove helpful to further pinpoint the problem. a manual run of the buildfarm script takes ~4,5 hours on lionfish ;-) The core file would be much better obviously (first and foremost to confirm that it's autovacuum that's crashing ... ) I will see what I can come up with ... Stefan ---(end of broadcast)--- TIP 6: explain analyze is your friend
[HACKERS] strange buildfarm failures
two of my buildfarm members had different but pretty weird looking failures lately: http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=quaggadt=2007-04-25%2002:03:03 and http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=emudt=2007-04-24%2014:35:02 any ideas on what might causing those ? Stefan ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match