Re: [HACKERS] Unhelpful debug tools on OS X :-(
Tom Lane wrote: Heikki Linnakangas <[EMAIL PROTECTED]> writes: But I did notice that we're not fsyncing the newly written relation like we should. Good point, but doesn't the analogy to copy_relation_data say that we should sync if not rd_istemp? Right. I don't know why I didn't copy that if-statement. It seems to be bedtime for me... -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Unhelpful debug tools on OS X :-(
Heikki Linnakangas <[EMAIL PROTECTED]> writes: > But I did notice that we're not fsyncing the newly written relation like > we should. Good point, but doesn't the analogy to copy_relation_data say that we should sync if not rd_istemp? (This is my fault BTW; your original patch kept the data in shared buffers, so it wasn't subject to the problem.) > BTW: In tablecmds.c the new relation is fsynced with smgrimmedsync, not > heap_sync. That's okay since that routine is just copying the one table. TOAST is handled via recursion of ATExecSetTableSpace. regards, tom lane ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Unhelpful debug tools on OS X :-(
I wrote: > Thread 0 Crashed: > 0 postmaster 0x001af4ef smgrextend + 12 (smgr.c:485) > 1 postmaster 0x00029044 end_heap_rewrite + 208 (rewriteheap.c:278) > 2 postmaster 0x000bdc22 cluster_rel + 850 (cluster.c:806) > 3 postmaster 0x000be119 cluster + 160 (cluster.c:220) AFAICS, a crash inside smgrextend must indicate it was passed a junk SMgrRelation, which implies that the problem has to trace back to the new heap's rd_smgr getting closed since the rewriteheap operation started. I think the only possible path for that is an sinval queue overflow and consequent SI reset during the test. That would explain why it's such an intermittent failure, but it is curious that Darwin seems much more prone to this than other platforms. I wonder if Apple has tweaked the scheduler in a way that allows backends to not be given any cycles for long periods ... Anyway, patch committed; we'll see if things don't get more stable. Memo to self: run some tests with CLOBBER_CACHE_ALWAYS after the remaining major patches have landed. regards, tom lane ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Unhelpful debug tools on OS X :-(
Tom Lane wrote: Heikki Linnakangas <[EMAIL PROTECTED]> writes: Tom Lane wrote: Any suggestions how to extract some info out of this? Does OS X have the catchsegv tool? No, but I suddenly remembered about CrashReporter, and sure enough it's catching these crashes: Exception: EXC_BAD_ACCESS (0x0001) Codes: KERN_PROTECTION_FAILURE (0x0002) at 0x0010 Thread 0 Crashed: 0 postmaster 0x001af4ef smgrextend + 12 (smgr.c:485) 1 postmaster 0x00029044 end_heap_rewrite + 208 (rewriteheap.c:278) 2 postmaster 0x000bdc22 cluster_rel + 850 (cluster.c:806) 3 postmaster 0x000be119 cluster + 160 (cluster.c:220) 4 postmaster 0x001b74a8 PortalRunUtility + 233 (palloc.h:84) 5 postmaster 0x001b7784 PortalRunMulti + 237 (pquery.c:1271) 6 postmaster 0x001b80ae PortalRun + 918 (pquery.c:813) 7 postmaster 0x001b2afd exec_simple_query + 656 (postgres.c:965) 8 postmaster 0x001b4b0c PostgresMain + 5628 (postgres.c:3507) 9 postmaster 0x00183973 ServerLoop + 2828 (postmaster.c:2614) 10 postmaster 0x00184b1e PostmasterMain + 2794 (postmaster.c:972) 11 postmaster 0x00130f8e main + 1236 (main.c:188) 12 postmaster 0x1e86 _start + 216 13 postmaster 0x1dad start + 41 So it looks like this has got something to do with the MVCC-safe cluster changes, which is not too surprising considering it started happening around about then. Off to have a look ... I've been looking at the code for a few minutes as well, but haven't found an explanation for that yet. But I did notice that we're not fsyncing the newly written relation like we should. There's a comment raw_heap_insert: /* * Now write the page. We say isTemp = true even if it's not a * temp table, because there's no need for smgr to schedule an * fsync for this write; we'll do it ourselves before committing. */ smgrextend(state->rs_new_rel->rd_smgr, state->rs_blockno, (char *) page, true); That's copy-pasted from tablecmds.c. But unlike in tablecmds.c, end_heap_rewrite only fsyncs the new file if we're not WAL-logging. Proposed fix: Index: src/backend/access/heap/rewriteheap.c === RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/heap/rewriteheap.c,v retrieving revision 1.1 diff -c -r1.1 rewriteheap.c *** src/backend/access/heap/rewriteheap.c 8 Apr 2007 01:26:27 - 1.1 --- src/backend/access/heap/rewriteheap.c 17 Apr 2007 20:50:05 - *** *** 272,282 } /* !* If not WAL-logging, must fsync before commit. We use heap_sync !* to ensure that the toast table gets fsync'd too. */ ! if (!state->rs_use_wal) ! heap_sync(state->rs_new_rel); /* Deleting the context frees everything */ MemoryContextDelete(state->rs_cxt); --- 272,284 } /* !* Must fsync before commit, even if we've WAL-logged the changes, !* because we've written pages outside the buffer manager. See comments!* in copy_relation_data in commands/tablecmds.c for more information. !* !* We use heap_sync to ensure that the toast table gets fsync'd too. */ ! heap_sync(state->rs_new_rel); /* Deleting the context frees everything */ MemoryContextDelete(state->rs_cxt); BTW: In tablecmds.c the new relation is fsynced with smgrimmedsync, not heap_sync. How about the toast table, it goes through shared buffers as usual, right? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Unhelpful debug tools on OS X :-(
Heikki Linnakangas <[EMAIL PROTECTED]> writes: > Tom Lane wrote: >> Any suggestions how to extract some info out of this? > Does OS X have the catchsegv tool? No, but I suddenly remembered about CrashReporter, and sure enough it's catching these crashes: Exception: EXC_BAD_ACCESS (0x0001) Codes: KERN_PROTECTION_FAILURE (0x0002) at 0x0010 Thread 0 Crashed: 0 postmaster 0x001af4ef smgrextend + 12 (smgr.c:485) 1 postmaster 0x00029044 end_heap_rewrite + 208 (rewriteheap.c:278) 2 postmaster 0x000bdc22 cluster_rel + 850 (cluster.c:806) 3 postmaster 0x000be119 cluster + 160 (cluster.c:220) 4 postmaster 0x001b74a8 PortalRunUtility + 233 (palloc.h:84) 5 postmaster 0x001b7784 PortalRunMulti + 237 (pquery.c:1271) 6 postmaster 0x001b80ae PortalRun + 918 (pquery.c:813) 7 postmaster 0x001b2afd exec_simple_query + 656 (postgres.c:965) 8 postmaster 0x001b4b0c PostgresMain + 5628 (postgres.c:3507) 9 postmaster 0x00183973 ServerLoop + 2828 (postmaster.c:2614) 10 postmaster 0x00184b1e PostmasterMain + 2794 (postmaster.c:972) 11 postmaster 0x00130f8e main + 1236 (main.c:188) 12 postmaster 0x1e86 _start + 216 13 postmaster 0x1dad start + 41 So it looks like this has got something to do with the MVCC-safe cluster changes, which is not too surprising considering it started happening around about then. Off to have a look ... regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Unhelpful debug tools on OS X :-(
Tom Lane wrote: Any suggestions how to extract some info out of this? Does OS X have the catchsegv tool? If you can run postmaster with that, you should get a backtrace when it crashes. Unless it has the same problem as gdb, of course.. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Unhelpful debug tools on OS X :-(
Tom Lane wrote: > Dunno if anyone's noticed, but buildfarm member jackal has been crashing > intermittently in the "cluster" regression test. I found out that I can > reproduce that here on a Mac Mini (apparently the same or nearly same > hardware/software as jackal) --- just repeat the parallel tests often > enough, and it'll fail once in awhile. I've got a core file but gdb > is being no help at all: > > mini:~/pgsql/src/test/regress tgl$ gdb /Users/tgl/testversion/bin/postmaster > /cores/core.23609 > GNU gdb 6.3.50-20050815 (Apple version gdb-563) (Wed Jul 19 05:10:58 GMT 2006) > Copyright 2004 Free Software Foundation, Inc. > GDB is free software, covered by the GNU General Public License, and you are > welcome to change it and/or distribute copies of it under certain conditions. > Type "show copying" to see the conditions. > There is absolutely no warranty for GDB. Type "show warranty" for details. > This GDB was configured as "i386-apple-darwin"...Reading symbols for shared > libraries .. done > > Core was generated by `/Users/tgl/testversion/bin/postmaster'. > Core file contained no thread-specific data > > (gdb) bt > #0 0x in ?? () > (gdb) > > Any suggestions how to extract some info out of this? no idea on that one - but clownfish managed to fail the cluster test once too lately: http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=clownfish&dt=2007-04-09%2023:03:03 Stefan ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
[HACKERS] Unhelpful debug tools on OS X :-(
Dunno if anyone's noticed, but buildfarm member jackal has been crashing intermittently in the "cluster" regression test. I found out that I can reproduce that here on a Mac Mini (apparently the same or nearly same hardware/software as jackal) --- just repeat the parallel tests often enough, and it'll fail once in awhile. I've got a core file but gdb is being no help at all: mini:~/pgsql/src/test/regress tgl$ gdb /Users/tgl/testversion/bin/postmaster /cores/core.23609 GNU gdb 6.3.50-20050815 (Apple version gdb-563) (Wed Jul 19 05:10:58 GMT 2006) Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "i386-apple-darwin"...Reading symbols for shared libraries .. done Core was generated by `/Users/tgl/testversion/bin/postmaster'. Core file contained no thread-specific data (gdb) bt #0 0x in ?? () (gdb) Any suggestions how to extract some info out of this? regards, tom lane ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings