subject:"\[HACKERS\] Unhelpful debug tools on OS X \:\-\("

Re: [HACKERS] Unhelpful debug tools on OS X :-(

2007-04-17 Thread Heikki Linnakangas


Tom Lane wrote:

Heikki Linnakangas <[EMAIL PROTECTED]> writes:
But I did notice that we're not fsyncing the newly written relation like 
we should.


Good point, but doesn't the analogy to copy_relation_data say that we
should sync if not rd_istemp? 


Right. I don't know why I didn't copy that if-statement. It seems to be 
bedtime for me...


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq

Re: [HACKERS] Unhelpful debug tools on OS X :-(

2007-04-17 Thread Tom Lane

Heikki Linnakangas <[EMAIL PROTECTED]> writes:
> But I did notice that we're not fsyncing the newly written relation like 
> we should.

Good point, but doesn't the analogy to copy_relation_data say that we
should sync if not rd_istemp?  (This is my fault BTW; your original
patch kept the data in shared buffers, so it wasn't subject to the
problem.)

> BTW: In tablecmds.c the new relation is fsynced with smgrimmedsync, not 
> heap_sync.

That's okay since that routine is just copying the one table.  TOAST is
handled via recursion of ATExecSetTableSpace.

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Unhelpful debug tools on OS X :-(

2007-04-17 Thread Tom Lane

I wrote:
> Thread 0 Crashed:
> 0   postmaster  0x001af4ef smgrextend + 12 (smgr.c:485)
> 1   postmaster  0x00029044 end_heap_rewrite + 208 (rewriteheap.c:278)
> 2   postmaster  0x000bdc22 cluster_rel + 850 (cluster.c:806)
> 3   postmaster  0x000be119 cluster + 160 (cluster.c:220)

AFAICS, a crash inside smgrextend must indicate it was passed a junk
SMgrRelation, which implies that the problem has to trace back to
the new heap's rd_smgr getting closed since the rewriteheap operation
started.  I think the only possible path for that is an sinval queue
overflow and consequent SI reset during the test.  That would explain
why it's such an intermittent failure, but it is curious that Darwin
seems much more prone to this than other platforms.  I wonder if Apple
has tweaked the scheduler in a way that allows backends to not be given
any cycles for long periods ...

Anyway, patch committed; we'll see if things don't get more stable.

Memo to self: run some tests with CLOBBER_CACHE_ALWAYS after the
remaining major patches have landed.

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Unhelpful debug tools on OS X :-(

2007-04-17 Thread Heikki Linnakangas


Tom Lane wrote:

Heikki Linnakangas <[EMAIL PROTECTED]> writes:

Tom Lane wrote:

Any suggestions how to extract some info out of this?



Does OS X have the catchsegv tool?


No, but I suddenly remembered about CrashReporter, and sure enough it's
catching these crashes:

Exception:  EXC_BAD_ACCESS (0x0001)
Codes:  KERN_PROTECTION_FAILURE (0x0002) at 0x0010

Thread 0 Crashed:
0   postmaster  0x001af4ef smgrextend + 12 (smgr.c:485)
1   postmaster  0x00029044 end_heap_rewrite + 208 (rewriteheap.c:278)
2   postmaster  0x000bdc22 cluster_rel + 850 (cluster.c:806)
3   postmaster  0x000be119 cluster + 160 (cluster.c:220)
4   postmaster  0x001b74a8 PortalRunUtility + 233 (palloc.h:84)
5   postmaster  0x001b7784 PortalRunMulti + 237 (pquery.c:1271)
6   postmaster  0x001b80ae PortalRun + 918 (pquery.c:813)
7   postmaster  0x001b2afd exec_simple_query + 656 (postgres.c:965)
8   postmaster  0x001b4b0c PostgresMain + 5628 (postgres.c:3507)
9   postmaster  0x00183973 ServerLoop + 2828 (postmaster.c:2614)
10  postmaster  0x00184b1e PostmasterMain + 2794 (postmaster.c:972)
11  postmaster  0x00130f8e main + 1236 (main.c:188)
12  postmaster  0x1e86 _start + 216
13  postmaster  0x1dad start + 41

So it looks like this has got something to do with the MVCC-safe cluster
changes, which is not too surprising considering it started happening
around about then.  Off to have a look ...


I've been looking at the code for a few minutes as well, but haven't 
found an explanation for that yet.


But I did notice that we're not fsyncing the newly written relation like 
we should. There's a comment raw_heap_insert:

/*
 * Now write the page. We say isTemp = true even if it's not a
 * temp table, because there's no need for smgr to schedule an
 * fsync for this write; we'll do it ourselves before committing.
 */
smgrextend(state->rs_new_rel->rd_smgr, state->rs_blockno,
   (char *) page, true);

That's copy-pasted from tablecmds.c. But unlike in tablecmds.c, 
end_heap_rewrite only fsyncs the new file if we're not WAL-logging. 
Proposed fix:


Index: src/backend/access/heap/rewriteheap.c
===
RCS file: 
/home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/heap/rewriteheap.c,v

retrieving revision 1.1
diff -c -r1.1 rewriteheap.c
*** src/backend/access/heap/rewriteheap.c   8 Apr 2007 01:26:27 
-  1.1

--- src/backend/access/heap/rewriteheap.c   17 Apr 2007 20:50:05 -
***
*** 272,282 
}

/*
!* If not WAL-logging, must fsync before commit.  We use heap_sync
!* to ensure that the toast table gets fsync'd too.
 */
!   if (!state->rs_use_wal)
!   heap_sync(state->rs_new_rel);

/* Deleting the context frees everything */
MemoryContextDelete(state->rs_cxt);
--- 272,284 
}

/*
!* Must fsync before commit, even if we've WAL-logged the changes,
!* because we've written pages outside the buffer manager.  See 
comments!* in copy_relation_data in commands/tablecmds.c for 
more information.

!*
!* We use heap_sync to ensure that the toast table gets fsync'd too.
 */
!   heap_sync(state->rs_new_rel);

/* Deleting the context frees everything */
MemoryContextDelete(state->rs_cxt);


BTW: In tablecmds.c the new relation is fsynced with smgrimmedsync, not 
heap_sync. How about the toast table, it goes through shared buffers as 
usual, right?


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate

Re: [HACKERS] Unhelpful debug tools on OS X :-(

2007-04-17 Thread Tom Lane

Heikki Linnakangas <[EMAIL PROTECTED]> writes:
> Tom Lane wrote:
>> Any suggestions how to extract some info out of this?

> Does OS X have the catchsegv tool?

No, but I suddenly remembered about CrashReporter, and sure enough it's
catching these crashes:

Exception:  EXC_BAD_ACCESS (0x0001)
Codes:  KERN_PROTECTION_FAILURE (0x0002) at 0x0010

Thread 0 Crashed:
0   postmaster  0x001af4ef smgrextend + 12 (smgr.c:485)
1   postmaster  0x00029044 end_heap_rewrite + 208 (rewriteheap.c:278)
2   postmaster  0x000bdc22 cluster_rel + 850 (cluster.c:806)
3   postmaster  0x000be119 cluster + 160 (cluster.c:220)
4   postmaster  0x001b74a8 PortalRunUtility + 233 (palloc.h:84)
5   postmaster  0x001b7784 PortalRunMulti + 237 (pquery.c:1271)
6   postmaster  0x001b80ae PortalRun + 918 (pquery.c:813)
7   postmaster  0x001b2afd exec_simple_query + 656 (postgres.c:965)
8   postmaster  0x001b4b0c PostgresMain + 5628 (postgres.c:3507)
9   postmaster  0x00183973 ServerLoop + 2828 (postmaster.c:2614)
10  postmaster  0x00184b1e PostmasterMain + 2794 (postmaster.c:972)
11  postmaster  0x00130f8e main + 1236 (main.c:188)
12  postmaster  0x1e86 _start + 216
13  postmaster  0x1dad start + 41

So it looks like this has got something to do with the MVCC-safe cluster
changes, which is not too surprising considering it started happening
around about then.  Off to have a look ...

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] Unhelpful debug tools on OS X :-(

2007-04-17 Thread Heikki Linnakangas


Tom Lane wrote:

Any suggestions how to extract some info out of this?


Does OS X have the catchsegv tool? If you can run postmaster with that, 
you should get a backtrace when it crashes. Unless it has the same 
problem as gdb, of course..


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] Unhelpful debug tools on OS X :-(

2007-04-17 Thread Stefan Kaltenbrunner

Tom Lane wrote:
> Dunno if anyone's noticed, but buildfarm member jackal has been crashing
> intermittently in the "cluster" regression test.  I found out that I can
> reproduce that here on a Mac Mini (apparently the same or nearly same
> hardware/software as jackal) --- just repeat the parallel tests often
> enough, and it'll fail once in awhile.  I've got a core file but gdb
> is being no help at all:
> 
> mini:~/pgsql/src/test/regress tgl$ gdb /Users/tgl/testversion/bin/postmaster 
> /cores/core.23609
> GNU gdb 6.3.50-20050815 (Apple version gdb-563) (Wed Jul 19 05:10:58 GMT 2006)
> Copyright 2004 Free Software Foundation, Inc.
> GDB is free software, covered by the GNU General Public License, and you are
> welcome to change it and/or distribute copies of it under certain conditions.
> Type "show copying" to see the conditions.
> There is absolutely no warranty for GDB.  Type "show warranty" for details.
> This GDB was configured as "i386-apple-darwin"...Reading symbols for shared 
> libraries .. done
> 
> Core was generated by `/Users/tgl/testversion/bin/postmaster'.
> Core file contained no thread-specific data
> 
> (gdb) bt
> #0  0x in ?? ()
> (gdb)
> 
> Any suggestions how to extract some info out of this?

no idea on that one - but clownfish managed to fail the cluster test
once too lately:

http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=clownfish&dt=2007-04-09%2023:03:03


Stefan

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

[HACKERS] Unhelpful debug tools on OS X :-(

2007-04-17 Thread Tom Lane

Dunno if anyone's noticed, but buildfarm member jackal has been crashing
intermittently in the "cluster" regression test.  I found out that I can
reproduce that here on a Mac Mini (apparently the same or nearly same
hardware/software as jackal) --- just repeat the parallel tests often
enough, and it'll fail once in awhile.  I've got a core file but gdb
is being no help at all:

mini:~/pgsql/src/test/regress tgl$ gdb /Users/tgl/testversion/bin/postmaster 
/cores/core.23609
GNU gdb 6.3.50-20050815 (Apple version gdb-563) (Wed Jul 19 05:10:58 GMT 2006)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "i386-apple-darwin"...Reading symbols for shared 
libraries .. done

Core was generated by `/Users/tgl/testversion/bin/postmaster'.
Core file contained no thread-specific data

(gdb) bt
#0  0x in ?? ()
(gdb)

Any suggestions how to extract some info out of this?

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] Unhelpful debug tools on OS X :-(

Re: [HACKERS] Unhelpful debug tools on OS X :-(

Re: [HACKERS] Unhelpful debug tools on OS X :-(

Re: [HACKERS] Unhelpful debug tools on OS X :-(

Re: [HACKERS] Unhelpful debug tools on OS X :-(

Re: [HACKERS] Unhelpful debug tools on OS X :-(

Re: [HACKERS] Unhelpful debug tools on OS X :-(

[HACKERS] Unhelpful debug tools on OS X :-(

8 matches

Site Navigation

Mail list logo

Footer information