Re: [HACKERS] Recent SIGSEGV failures in buildfarm HEAD
Tom Lane wrote: Seneca Cunningham [EMAIL PROTECTED] writes: I don't have a core, but here's the CrashReporter output for both of jackal's failed runs: Wow, some actual data, rather than just noodling about how to get it ... thanks! ... 11 postgres 0x0022b2e3 RelationIdGetRelation + 110 (relcache.c:1496) 12 postgres 0x00020868 relation_open + 84 (heapam.c:697) 13 postgres 0x0002aab9 index_open + 32 (indexam.c:140) 14 postgres 0x0002a9d4 systable_beginscan + 289 (genam.c:184) 15 postgres 0x002279e4 RelationInitIndexAccessInfo + 1645 (relcache.c:1200) 16 postgres 0x0022926a RelationBuildDesc + 3527 (relcache.c:866) 17 postgres 0x0022b2e3 RelationIdGetRelation + 110 (relcache.c:1496) 18 postgres 0x00020868 relation_open + 84 (heapam.c:697) 19 postgres 0x0002aab9 index_open + 32 (indexam.c:140) 20 postgres 0x0002a9d4 systable_beginscan + 289 (genam.c:184) 21 postgres 0x002279e4 RelationInitIndexAccessInfo + 1645 (relcache.c:1200) 22 postgres 0x0022926a RelationBuildDesc + 3527 (relcache.c:866) 23 postgres 0x0022b2e3 RelationIdGetRelation + 110 (relcache.c:1496) ... What you seem to have here is infinite recursion during relcache initialization. That's surely not hard to believe, considering I just whacked that code around, and indeed changed some of the tests that are intended to prevent such recursion. But what I don't understand is why it'd be platform-specific, much less not perfectly repeatable on the platforms where it does manifest. Anyone have a clue? fwiw - I can trigger that issue now pretty reliably on a fast Opteron box (running Debian Sarge/AMD64) with make regress in a loop - I seem to be able to trigger it in about 20-25% of the runs. the resulting core however looks totally stack corrupted and not really usable :-( Stefan ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Recent SIGSEGV failures in buildfarm HEAD
Stefan Kaltenbrunner [EMAIL PROTECTED] writes: fwiw - I can trigger that issue now pretty reliably on a fast Opteron box (running Debian Sarge/AMD64) with make regress in a loop - I seem to be able to trigger it in about 20-25% of the runs. the resulting core however looks totally stack corrupted and not really usable :-( Hmm, probably the stack overrun leaves the call stack too corrupt for gdb to make sense of. Try inserting check_stack_depth(); into one of the functions that're part of the infinite recursion, and then make check_stack_depth() do an abort() instead of just elog(ERROR). That might give you a core that gdb can work with. I'm still having absolutely 0 success reproducing it on a dual Xeon ... so it's not just the architecture that's the issue. Some kind of timing problem? That's hard to believe too. regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Recent SIGSEGV failures in buildfarm HEAD
On Sun, Dec 31, 2006 at 05:43:45PM +0100, Stefan Kaltenbrunner wrote: Tom Lane wrote: What you seem to have here is infinite recursion during relcache initialization. That's surely not hard to believe, considering I just whacked that code around, and indeed changed some of the tests that are intended to prevent such recursion. But what I don't understand is why it'd be platform-specific, much less not perfectly repeatable on the platforms where it does manifest. Anyone have a clue? fwiw - I can trigger that issue now pretty reliably on a fast Opteron box (running Debian Sarge/AMD64) with make regress in a loop - I seem to be able to trigger it in about 20-25% of the runs. the resulting core however looks totally stack corrupted and not really usable :-( By reducing the stack size on jackal from the default of 8MB to 3MB, I can get this to trigger in roughly 30% of the runs while preserving the passed tests in the other parallel groups. -- Seneca [EMAIL PROTECTED] ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Recent SIGSEGV failures in buildfarm HEAD
Seneca Cunningham [EMAIL PROTECTED] writes: I don't have a core, but here's the CrashReporter output for both of jackal's failed runs: Wow, some actual data, rather than just noodling about how to get it ... thanks! ... 11 postgres 0x0022b2e3 RelationIdGetRelation + 110 (relcache.c:1496) 12 postgres 0x00020868 relation_open + 84 (heapam.c:697) 13 postgres 0x0002aab9 index_open + 32 (indexam.c:140) 14 postgres 0x0002a9d4 systable_beginscan + 289 (genam.c:184) 15 postgres 0x002279e4 RelationInitIndexAccessInfo + 1645 (relcache.c:1200) 16 postgres 0x0022926a RelationBuildDesc + 3527 (relcache.c:866) 17 postgres 0x0022b2e3 RelationIdGetRelation + 110 (relcache.c:1496) 18 postgres 0x00020868 relation_open + 84 (heapam.c:697) 19 postgres 0x0002aab9 index_open + 32 (indexam.c:140) 20 postgres 0x0002a9d4 systable_beginscan + 289 (genam.c:184) 21 postgres 0x002279e4 RelationInitIndexAccessInfo + 1645 (relcache.c:1200) 22 postgres 0x0022926a RelationBuildDesc + 3527 (relcache.c:866) 23 postgres 0x0022b2e3 RelationIdGetRelation + 110 (relcache.c:1496) ... What you seem to have here is infinite recursion during relcache initialization. That's surely not hard to believe, considering I just whacked that code around, and indeed changed some of the tests that are intended to prevent such recursion. But what I don't understand is why it'd be platform-specific, much less not perfectly repeatable on the platforms where it does manifest. Anyone have a clue? regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Recent SIGSEGV failures in buildfarm HEAD
Tom Lane wrote: Alvaro Herrera [EMAIL PROTECTED] writes: Andrew Dunstan wrote: here's a quick untested patch for buildfarm that Stefan might like to try. Note that not all core files are named core. On some Linux distros, it's configured to be core.PID by default. And on some platforms, cores don't drop in the current working directory ... but until we have a problem that *only* manifests on such a platform, I wouldn't worry about that. We do need to look for 'core*' not just 'core', though. That part is easy enough. And if people mangle their core location I am certainly not going to go looking for it. Don't forget the ulimit point either ... on most Linuxen there won't be any core at all without twiddling ulimit. Yeah. Perl actually doesn't have a core call for this. I have built some code (see attached revised patch) to try to do it using a widespread but non-standard module called BSD::Resource, but if the module is missing it won't fail. I'm actually wondering if unlimiting core might not be a useful switch to provide on pg_ctl, as long as the platform has setrlimit(). cheers andrew --- run_build.pl.orig 2006-12-28 17:32:14.0 -0500 +++ run_build.pl.new2006-12-29 10:59:39.0 -0500 @@ -299,6 +299,20 @@ unlink $forcefile; } +# try to allow core files to be produced. +# another way would be for the calling environment +# to call ulimit. We do this in an eval so failure is +# not fatal. +eval +{ + require BSD::Resource; + BSD::Resource-import(); + # explicit sub calls here using keeps compiler happy + my $coreok = setrlimit(RLIMIT_CORE,RLIM_INFINITY,RLIM_INFINITY); + die setrlimit unless $coreok; +}; +warn failed to unlimit core size: $@ if $@; + # the time we take the snapshot my $now=time; my $installdir = $buildroot/$branch/inst; @@ -795,6 +809,34 @@ $dbstarted=undef; } + +sub get_stack_trace +{ + my $bindir = shift; + my $pgdata = shift; + + # no core = no result + my @cores = glob($pgdata/core*); + return () unless @cores; + + # no gdb = no result + system gdb --version /dev/null 21; + my $status = $? 8; + return () if $status; + + my @trace; + + foreach my $core (@cores) + { + my @onetrace = `gdb -ex bt --batch $bindir/postgres $core 21`; + push(@trace, + \n\n== stack trace: $core ==\n, +@onetrace); + } + + return @trace; +} + sub make_install_check { my @checkout = `cd $pgsql/src/test/regress $make installcheck 21`; @@ -814,6 +856,11 @@ } close($handle); } + if ($status) + { + my @trace = get_stack_trace($installdir/bin,$installdir/data); + push(@checkout,@trace); + } writelog('install-check',[EMAIL PROTECTED]); print make installcheck log ===\n,@checkout if ($verbose 1); @@ -839,6 +886,11 @@ } close($handle); } + if ($status) + { + my @trace = get_stack_trace($installdir/bin,$installdir/data); + push(@checkout,@trace); + } writelog('contrib-install-check',[EMAIL PROTECTED]); print make contrib installcheck log ===\n,@checkout if ($verbose 1); @@ -864,6 +916,11 @@ } close($handle); } + if ($status) + { + my @trace = get_stack_trace($installdir/bin,$installdir/data); + push(@checkout,@trace); + } writelog('pl-install-check',[EMAIL PROTECTED]); print make pl installcheck log ===\n,@checkout if ($verbose 1); @@ -892,6 +949,13 @@ } close($handle); } + if ($status) + { + my @trace = + get_stack_trace($pgsql/src/test/regress/install$installdir/bin, + $pgsql/src/test/regress/tmp_check/data); + push(@makeout,@trace); + } writelog('check',[EMAIL PROTECTED]); print make check logs ===\n,@makeout if ($verbose 1); ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Recent SIGSEGV failures in buildfarm HEAD
Andrew Dunstan [EMAIL PROTECTED] writes: I'm actually wondering if unlimiting core might not be a useful switch to provide on pg_ctl, as long as the platform has setrlimit(). Not a bad thought; that's actually one of the reasons that I still usually use a handmade script rather than pg_ctl for launching postmasters ... regards, tom lane ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Recent SIGSEGV failures in buildfarm HEAD
Tom Lane wrote: Andrew Dunstan [EMAIL PROTECTED] writes: I'm actually wondering if unlimiting core might not be a useful switch to provide on pg_ctl, as long as the platform has setrlimit(). Not a bad thought; that's actually one of the reasons that I still usually use a handmade script rather than pg_ctl for launching postmasters ... this sounds like a good idea for me too - it seems like a cleaner and more useful thing on a general base then just doing it in the buildfarm code ... Stefan ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Recent SIGSEGV failures in buildfarm HEAD
Tom Lane wrote: Several of the buildfarm machines are exhibiting repeatable signal 11 crashes in what seem perfectly ordinary queries. This started about four days ago so I suppose it's got something to do with my operator-families patch :-( ... but I dunno what, and none of my own machines show the failure. Can someone provide a stack trace? no stack trace yet however impala at least seems to be running out of memory (!) with 380MB of RAM and some 800MB of swap(and no other tasks) during the regression run. Maybe something is causing a dramatic increase in memory usage that is causing the random failures (in impalas case the OOM-killer actually decides to terminate the postmaster) ? Stefan ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Recent SIGSEGV failures in buildfarm HEAD
Stefan Kaltenbrunner [EMAIL PROTECTED] writes: Tom Lane wrote: Several of the buildfarm machines are exhibiting repeatable signal 11 crashes in what seem perfectly ordinary queries. no stack trace yet however impala at least seems to be running out of memory (!) with 380MB of RAM and some 800MB of swap(and no other tasks) during the regression run. Maybe something is causing a dramatic increase in memory usage that is causing the random failures (in impalas case the OOM-killer actually decides to terminate the postmaster) ? No, most all the failures I've looked at are sig11 not sig9. It is interesting that the failures are not as consistent as I first thought --- the machines that are showing failures actually fail maybe one time in two. regards, tom lane ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Recent SIGSEGV failures in buildfarm HEAD
Tom Lane wrote: Stefan Kaltenbrunner [EMAIL PROTECTED] writes: Tom Lane wrote: Several of the buildfarm machines are exhibiting repeatable signal 11 crashes in what seem perfectly ordinary queries. no stack trace yet however impala at least seems to be running out of memory (!) with 380MB of RAM and some 800MB of swap(and no other tasks) during the regression run. Maybe something is causing a dramatic increase in memory usage that is causing the random failures (in impalas case the OOM-killer actually decides to terminate the postmaster) ? No, most all the failures I've looked at are sig11 not sig9. hmm - still weird and I would not actually consider impala a resource starved box (especially when compared to other buildfarm-members) so there seems to be something strange going on. I have changed the overcommit settings on that box for now - let's see what the result of that will be. It is interesting that the failures are not as consistent as I first thought --- the machines that are showing failures actually fail maybe one time in two. or some even less - dove seems to be one of the affected boxes too - I increased the build frequency since yesterday but it has not yet failed again ... Stefan ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Recent SIGSEGV failures in buildfarm HEAD
Stefan Kaltenbrunner [EMAIL PROTECTED] writes: Tom Lane wrote: Stefan Kaltenbrunner [EMAIL PROTECTED] writes: ... Maybe something is causing a dramatic increase in memory usage that is causing the random failures (in impalas case the OOM-killer actually decides to terminate the postmaster) ? No, most all the failures I've looked at are sig11 not sig9. hmm - still weird and I would not actually consider impala a resource starved box (especially when compared to other buildfarm-members) so there seems to be something strange going on. Actually ... one way that a memory overconsumption bug could manifest as sig11 would be if it's a runaway-recursion issue: usually you get sig11 when the machine's stack size limit is exceeded. This doesn't put us any closer to localizing the problem, but at least it's a guess about the cause? I wonder whether there's any way to get the buildfarm script to report a stack trace automatically if it finds a core file left behind in the $PGDATA directory after running the tests. Would something like this be adequately portable? if [ -f $PGDATA/core* ] then echo bt | gdb $installdir/bin/postgres $PGDATA/core* fi Obviously it'd fail if no gdb available, but that seems pretty harmless. The other thing that we'd likely need is an explicit ulimit -c unlimited for machines where core dumps are off by default. regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Recent SIGSEGV failures in buildfarm HEAD
Tom Lane wrote: Stefan Kaltenbrunner [EMAIL PROTECTED] writes: Tom Lane wrote: Stefan Kaltenbrunner [EMAIL PROTECTED] writes: ... Maybe something is causing a dramatic increase in memory usage that is causing the random failures (in impalas case the OOM-killer actually decides to terminate the postmaster) ? No, most all the failures I've looked at are sig11 not sig9. hmm - still weird and I would not actually consider impala a resource starved box (especially when compared to other buildfarm-members) so there seems to be something strange going on. Actually ... one way that a memory overconsumption bug could manifest as sig11 would be if it's a runaway-recursion issue: usually you get sig11 when the machine's stack size limit is exceeded. This doesn't put us any closer to localizing the problem, but at least it's a guess about the cause? that sounds like a possibility though I'm not too optimistic this is indeed the cause of the problem we see. I wonder whether there's any way to get the buildfarm script to report a stack trace automatically if it finds a core file left behind in the $PGDATA directory after running the tests. Would something like this be adequately portable? if [ -f $PGDATA/core* ] then echo bt | gdb $installdir/bin/postgres $PGDATA/core* fi hmmm - not sure I like that that much Obviously it'd fail if no gdb available, but that seems pretty harmless. The other thing that we'd likely need is an explicit ulimit -c unlimited for machines where core dumps are off by default. there are other issues with that - gdb might be available but not actually producing reliable results on certain platforms (some commercial unixes,windows). The thing we might might want to do is the buildfarm script overriding keep_error_builds=0 conditionally in some cases (like detecting a core). That way we will at least have a useful buildtree for later examination(which would be removed even if we get a one-time stacktrace and keep_error_builds is disabled) Stefan ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Recent SIGSEGV failures in buildfarm HEAD
Tom Lane wrote: Actually ... one way that a memory overconsumption bug could manifest as sig11 would be if it's a runaway-recursion issue: usually you get sig11 when the machine's stack size limit is exceeded. This doesn't put us any closer to localizing the problem, but at least it's a guess about the cause? I wonder whether there's any way to get the buildfarm script to report a stack trace automatically if it finds a core file left behind in the $PGDATA directory after running the tests. Would something like this be adequately portable? if [ -f $PGDATA/core* ] then echo bt | gdb $installdir/bin/postgres $PGDATA/core* fi gdb has a batch mode which can be useful: if [ -f $PGDATA/core* ] then gdb -ex bt --batch $installdir/bin/postgres $PGDATA/core* fi -- Alvaro Herrerahttp://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc. ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Recent SIGSEGV failures in buildfarm HEAD
Alvaro Herrera wrote: Tom Lane wrote: I wonder whether there's any way to get the buildfarm script to report a stack trace automatically if it finds a core file left behind in the $PGDATA directory after running the tests. Would something like this be adequately portable? if [ -f $PGDATA/core* ] then echo bt | gdb $installdir/bin/postgres $PGDATA/core* fi gdb has a batch mode which can be useful: if [ -f $PGDATA/core* ] then gdb -ex bt --batch $installdir/bin/postgres $PGDATA/core* fi here's a quick untested patch for buildfarm that Stefan might like to try. cheers andrew --- run_build.pl.orig 2006-12-28 17:32:14.0 -0500 +++ run_build.pl.new2006-12-28 17:58:51.0 -0500 @@ -795,6 +795,29 @@ $dbstarted=undef; } + +sub get_stack_trace +{ + my $bindir = shift; + my $pgdata = shift; + + # no core = no result + return () unless -f $pgdata/core; + + # no gdb = no result + system gdb --version /dev/null 21; + my $status = $? 8; + return () if $status; + + my @trace = `gdb -ex bt --batch $bindir/postgres $pgdata/core 21`; + + unshift(@trace, + \n\n== stack trace ==\n); + + return @trace; + +} + sub make_install_check { my @checkout = `cd $pgsql/src/test/regress $make installcheck 21`; @@ -814,6 +837,11 @@ } close($handle); } + if ($status) + { + my @trace = get_stack_trace($installdir/bin,$installdir/data); + push(@checkout,@trace); + } writelog('install-check',[EMAIL PROTECTED]); print make installcheck log ===\n,@checkout if ($verbose 1); @@ -839,6 +867,11 @@ } close($handle); } + if ($status) + { + my @trace = get_stack_trace($installdir/bin,$installdir/data); + push(@checkout,@trace); + } writelog('contrib-install-check',[EMAIL PROTECTED]); print make contrib installcheck log ===\n,@checkout if ($verbose 1); @@ -864,6 +897,11 @@ } close($handle); } + if ($status) + { + my @trace = get_stack_trace($installdir/bin,$installdir/data); + push(@checkout,@trace); + } writelog('pl-install-check',[EMAIL PROTECTED]); print make pl installcheck log ===\n,@checkout if ($verbose 1); @@ -892,6 +930,13 @@ } close($handle); } + if ($status) + { + my @trace = + get_stack_trace($pgsql/src/test/regress/install$installdir/bin, + $pgsql/src/test/regress/tmp_check/data); + push(@makeout,@trace); + } writelog('check',[EMAIL PROTECTED]); print make check logs ===\n,@makeout if ($verbose 1); ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Recent SIGSEGV failures in buildfarm HEAD
Andrew Dunstan wrote: here's a quick untested patch for buildfarm that Stefan might like to try. Note that not all core files are named core. On some Linux distros, it's configured to be core.PID by default. And you can even change it to weirder names, but I haven't seen those anywhere by default, so I guess supporting just the common ones is appropiate. -- Alvaro Herrerahttp://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Recent SIGSEGV failures in buildfarm HEAD
Alvaro Herrera [EMAIL PROTECTED] writes: Andrew Dunstan wrote: here's a quick untested patch for buildfarm that Stefan might like to try. Note that not all core files are named core. On some Linux distros, it's configured to be core.PID by default. And on some platforms, cores don't drop in the current working directory ... but until we have a problem that *only* manifests on such a platform, I wouldn't worry about that. We do need to look for 'core*' not just 'core', though. Don't forget the ulimit point either ... on most Linuxen there won't be any core at all without twiddling ulimit. regards, tom lane ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
[HACKERS] Recent SIGSEGV failures in buildfarm HEAD
Several of the buildfarm machines are exhibiting repeatable signal 11 crashes in what seem perfectly ordinary queries. This started about four days ago so I suppose it's got something to do with my operator-families patch :-( ... but I dunno what, and none of my own machines show the failure. Can someone provide a stack trace? regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq