Re: [HACKERS] catalog corruption bug

2006-01-09 Thread Jeremy Drake
On Sun, 8 Jan 2006, Tom Lane wrote:

 Yeah, that's not very surprising.  Running the forced-cache-resets
 function will definitely expose that catcache bug pretty quickly.
 You'd need to apply the patches I put in yesterday to have a system
 that has any chance of withstanding that treatment for any length of time.

  I think I am going to just run without the function running this time and
  see if it does the duplicate type error and if it will generate two cores.

I ran without that function you made, and it got the error, but not a
crash.  I stuck an Assert(false) right before the ereport for that
particular error, and I did end up with a core there, but I don't see
anything out of the ordinary (what little I know of the ordinary ;)

#0  0x2b8a0cf9 in kill () from /usr/lib64/libc.so.6
#1  0x2b8a0a3d in raise () from /usr/lib64/libc.so.6
#2  0x2b8a1c82 in abort () from /usr/lib64/libc.so.6
#3  0x005f9878 in ExceptionalCondition (
conditionName=0x2c53 Address 0x2c53 out of bounds,
errorType=0x6 Address 0x6 out of bounds, fileName=0x0,
lineNumber=-1)
at assert.c:51
#4  0x00460967 in _bt_doinsert (rel=0x2ab05568,
btitem=0xbec2c0,
index_is_unique=1 '\001', heapRel=0x8bf0f0) at nbtinsert.c:247
#5  0x00463773 in btinsert (fcinfo=0x2c53) at nbtree.c:228
#6  0x005fe869 in FunctionCall6 (flinfo=0x8, arg1=6, arg2=0,
arg3=18446744073709551615, arg4=0, arg5=0, arg6=0) at fmgr.c:1267
#7  0x0045bf4f in index_insert (indexRelation=0x2ab05568,
values=0x7fdfde20, isnull=0x7fdfde00 , heap_t_ctid=0xbebeac,
heapRelation=0x8bf0f0, check_uniqueness=1 '\001') at indexam.c:215
#8  0x0048f8fa in CatalogIndexInsert (indstate=0x2c53,
heapTuple=0xbebb88) at indexing.c:124
#9  0x0048f994 in CatalogUpdateIndexes (heapRel=0x2c53,
heapTuple=0xbebea8) at indexing.c:149
#10 0x0049bc67 in TypeCreate (typeName=0x7fdfe3e0
push_tmp,
typeNamespace=11057063, relationOid=12171371, relationKind=114 'r',
internalSize=-16728, typeType=99 'c', typDelim=44 ',',
inputProcedure=2290, outputProcedure=2291, receiveProcedure=2402,
sendProcedure=2403, analyzeProcedure=0, elementType=0, baseType=0,
defaultTypeValue=0x0, defaultTypeBin=0x0, passedByValue=-16 '',
alignment=100 'd', storage=120 'x', typeMod=-1, typNDims=0,
typeNotNull=0 '\0') at pg_type.c:316
#11 0x0048c361 in heap_create_with_catalog (
relname=0x7fdfe3e0 push_tmp, relnamespace=11057063,
reltablespace=0, relid=12171371, ownerid=16384, tupdesc=0xbeb8e8,
relkind=114 'r', shared_relation=0 '\0', oidislocal=0 '\0',
oidinhcount=0,
oncommit=ONCOMMIT_DROP, allow_system_table_mods=0 '\0') at heap.c:634
#12 0x004de220 in DefineRelation (stmt=0x93fc30, relkind=114 'r')
at tablecmds.c:423
#13 0x0058bfd0 in ProcessUtility (parsetree=0x93fc30, params=0x0,
dest=0x814b40, completionTag=0x0) at utility.c:497
#14 0x00515cb5 in _SPI_execute_plan (plan=0x93f9a8,
Values=0x9c5798,
Nulls=0x9c57b8 ~, '\177' repeats 199 times..., snapshot=0x0,
crosscheck_snapshot=0x0, read_only=0 '\0', tcount=0) at spi.c:1449
#15 0x005165fc in SPI_execute_plan (plan=0x93f9a8,
Values=0x9c5798,
Nulls=0x9c57b8 ~, '\177' repeats 199 times..., read_only=0 '\0',
tcount=0) at spi.c:336
#16 0x2c95d8a4 in exec_stmts (estate=0x7fdfe950, stmts=0x6)
at pl_exec.c:2280
#17 0x2c95ebc2 in exec_stmt_block (estate=0x7fdfe950,
block=0x8f2c70) at pl_exec.c:936
#18 0x2c95f5ab in plpgsql_exec_function (func=0x913bc8,
fcinfo=0x7fdfea90) at pl_exec.c:286
#19 0x2c9573f5 in plpgsql_call_handler (fcinfo=0x7fdfea90)
at pl_handler.c:123
#20 0x00501a74 in ExecMakeFunctionResult (fcache=0x90a7f0,
econtext=0x90a6c0, isNull=0x90ae38
\177~\177\177\177\177\177\177!\006,
isDone=0x90aef0) at execQual.c:1095
#21 0x00505543 in ExecProject (projInfo=0x90ae58,
isDone=0x7fdfeef4) at execQual.c:3669
#22 0x0050ff5a in ExecResult (node=0x90a5a8) at nodeResult.c:157
#23 0x0050034d in ExecProcNode (node=0x90a5a8) at
execProcnode.c:306
#24 0x004ff5ea in ExecutorRun (queryDesc=0x90a5a8,
direction=ForwardScanDirection, count=0) at execMain.c:1110
#25 0x0058a5de in PortalRunSelect (portal=0x8e6c68, forward=1
'\001',
count=0, dest=0x8dad30) at pquery.c:794
#26 0x0058abdf in PortalRun (portal=0x8e6c68,
count=9223372036854775807, dest=0x8dad30, altdest=0x8dad30,
completionTag=0x7fdff320 ) at pquery.c:646
#27 0x00588fcb in PostgresMain (argc=9333864, argv=0x8dac18,
username=0x8853f0 jeremyd) at postgres.c:1754
#28 0x0055e20a in ServerLoop () at postmaster.c:2853
#29 0x0055f9f9 in PostmasterMain (argc=3, argv=0x8832e0)
at postmaster.c:943
#30 0x0051fb83 in main (argc=3, argv=0x8832e0) at main.c:256


 Please also look at putting together a 

Re: [HACKERS] catalog corruption bug

2006-01-09 Thread Tom Lane
Jeremy Drake [EMAIL PROTECTED] writes:
 I ran without that function you made, and it got the error, but not a
 crash.  I stuck an Assert(false) right before the ereport for that
 particular error, and I did end up with a core there, but I don't see
 anything out of the ordinary (what little I know of the ordinary ;)

Yeah, that's just the CREATE TEMP TABLE doing what it's supposed to do.
The problem is presumably that a prior DROP operation failed to remove
the pg_type row associated with a previous temp table of the same name
... but why that would happen is still really unclear.

Does your application drop these temp tables explicitly, or leave them
to be dropped automatically during commit?  It might be interesting to
see whether changing that makes any difference.  I'm also curious
whether the transaction that makes the temp table is ever rolled back
instead of committed.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] catalog corruption bug

2006-01-09 Thread Jeremy Drake
On Mon, 9 Jan 2006, Tom Lane wrote:

 Does your application drop these temp tables explicitly, or leave them
 to be dropped automatically during commit?  It might be interesting to
 see whether changing that makes any difference.

I drop them explicitly at the end of the function.

 I'm also curious
 whether the transaction that makes the temp table is ever rolled back
 instead of committed.

Not intentionally/explicitly.  The only time it should roll back is if it
gets an error (which tends to be this error).  I do sometimes hit ^C on
the perl scripts to tweak something, which would roll it back if in this
particular code, but I don't think i did that on the last run at least.


   regards, tom lane

 ---(end of broadcast)---
 TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match


-- 
Every little picofarad has a nanohenry all its own.
-- Don Vonada

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] catalog corruption bug

2006-01-08 Thread Tom Lane
Jeremy Drake [EMAIL PROTECTED] writes:
 On Sat, 7 Jan 2006, Tom Lane wrote:
 A bit of a leap in the dark, but: maybe the triggering event for this
 situation is not a VACUUM pg_amop but a global cache reset due to
 sinval message buffer overrun.

 I tried that function you sent, while running my other code.  It died, but
 not the same way.  None of my processes had the unique constraint error,
 but two had failed during commit.  Both of them died in that same place as
 the last one, on pg_amop.

Yeah, that's not very surprising.  Running the forced-cache-resets
function will definitely expose that catcache bug pretty quickly.
You'd need to apply the patches I put in yesterday to have a system
that has any chance of withstanding that treatment for any length of time.

 I think I am going to just run without the function running this time and
 see if it does the duplicate type error and if it will generate two cores.

Please also look at putting together a test case so others can poke at
this.

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] catalog corruption bug

2006-01-07 Thread Jeremy Drake
On Fri, 6 Jan 2006, Tom Lane wrote:

 OK, this must be a different issue then.  I think we have seen reports
 like this one before, but not been able to reproduce it.

 Could you rebuild with Asserts enabled and see if any asserts trigger?

I got an assert to fail.  I'm not entirely sure if this is helpful, but I
managed to get a core dump with --enable-debug and --enable-cassert (with
optimizations still on).  Let me know if there is anything else that would
be useful to get out of this core file...

(gdb) bt
#0  0x2b8a0cf9 in kill () from /usr/lib64/libc.so.6
#1  0x2b8a0a3d in raise () from /usr/lib64/libc.so.6
#2  0x2b8a1c82 in abort () from /usr/lib64/libc.so.6
#3  0x005f9838 in ExceptionalCondition (
conditionName=0x1abc Address 0x1abc out of bounds,
errorType=0x6 Address 0x6 out of bounds, fileName=0x0,
lineNumber=-1)
at assert.c:51
#4  0x005eec5d in AtEOXact_CatCache (isCommit=-68 '')
at catcache.c:575
#5  0x0047464f in CommitTransaction () at xact.c:1573
#6  0x00474d07 in CommitTransactionCommand () at xact.c:2184
#7  0x005581da in AutoVacMain (argc=6844, argv=0x6) at
autovacuum.c:688
#8  0x00558907 in autovac_start () at autovacuum.c:170
#9  0x0055e66b in ServerLoop () at postmaster.c:1269
#10 0x0055f9b9 in PostmasterMain (argc=3, argv=0x8832e0)
at postmaster.c:943
#11 0x0051fb43 in main (argc=3, argv=0x8832e0) at main.c:256
(gdb) frame 4
#4  0x005eec5d in AtEOXact_CatCache (isCommit=-68 '')
at catcache.c:575
575 Assert(!ct-dead);
(gdb) l
570 {
571 CatCTup*ct = (CatCTup *) DLE_VAL(elt);
572
573 Assert(ct-ct_magic == CT_MAGIC);
574 Assert(ct-refcount == 0);
575 Assert(!ct-dead);
576 }
577 }
578 #endif
579 }
(gdb) set print pretty
(gdb) p *ct
$1 = {
  ct_magic = 1462113538,
  my_cache = 0x2aac3060,
  lrulist_elem = {
dle_next = 0x0,
dle_prev = 0x939ab0,
dle_val = 0x2ab19e18,
dle_list = 0x93b1a8
  },
  cache_elem = {
dle_next = 0x0,
dle_prev = 0x934b58,
dle_val = 0x2ab19e18,
dle_list = 0x2aac36c8
  },
  c_list = 0x0,
  refcount = 0,
  dead = 1 '\001',
  negative = 0 '\0',
  hash_value = 15438,
  tuple = {
t_len = 48,
t_self = {
  ip_blkid = {
bi_hi = 0,
bi_lo = 0
  },
  ip_posid = 70
},
t_tableOid = 2602,
t_datamcxt = 0x914998,
t_data = 0x2ab19f30
  }
}


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] catalog corruption bug

2006-01-07 Thread Tom Lane
Jeremy Drake [EMAIL PROTECTED] writes:
 I got an assert to fail.  I'm not entirely sure if this is helpful, but I
 managed to get a core dump with --enable-debug and --enable-cassert (with
 optimizations still on).  Let me know if there is anything else that would
 be useful to get out of this core file...

Fascinating --- that's not anywhere near where I thought your problem
was.  Which cache is this tuple in?  (Print *ct-my_cache)

The tableOid implies it's one of the caches on pg_amop, which makes
the whole thing stranger yet.  pg_amop doesn't change during normal
operation so there's no reason for one of its tuples to become dead.
You aren't creating/deleting operator classes in this database are
you?

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] catalog corruption bug

2006-01-07 Thread Jeremy Drake
On Sat, 7 Jan 2006, Tom Lane wrote:

 Fascinating --- that's not anywhere near where I thought your problem
 was.  Which cache is this tuple in?  (Print *ct-my_cache)

$2 = {
  id = 3,
  cc_next = 0x2aac1048,
  cc_relname = 0x2ab19df8 pg_amop,
  cc_reloid = 2602,
  cc_indexoid = 2654,
  cc_relisshared = 0 '\0',
  cc_tupdesc = 0x2ab199e0,
  cc_reloidattr = 0,
  cc_ntup = 3,
  cc_nbuckets = 256,
  cc_nkeys = 2,
  cc_key = {5, 1, 0, 0},
  cc_hashfunc = {0x44e1a0 hashoid, 0x44e1a0 hashoid, 0, 0},
  cc_skey = {{
  sk_flags = 0,
  sk_attno = 5,
  sk_strategy = 3,
  sk_subtype = 0,
  sk_func = {
fn_addr = 0x5bb8c0 oideq,
fn_oid = 184,
fn_nargs = 2,
fn_strict = 1 '\001',
fn_retset = 0 '\0',
fn_extra = 0x0,
fn_mcxt = 0x914998,
fn_expr = 0x0
  },
  sk_argument = 0
}, {
  sk_flags = 0,
  sk_attno = 1,
  sk_strategy = 3,
  sk_subtype = 0,
  sk_func = {
fn_addr = 0x5bb8c0 oideq,
fn_oid = 184,
fn_nargs = 2,
fn_strict = 1 '\001',
fn_retset = 0 '\0',
fn_extra = 0x0,
fn_mcxt = 0x914998,
fn_expr = 0x0
  },
  sk_argument = 0
}, {
  sk_flags = 0,
  sk_attno = 0,
  sk_strategy = 0,
  sk_subtype = 0,
  sk_func = {
fn_addr = 0,
fn_oid = 0,
fn_nargs = 0,
fn_strict = 0 '\0',
fn_retset = 0 '\0',
fn_extra = 0x0,
fn_mcxt = 0x0,
fn_expr = 0x0
  },
  sk_argument = 0
}, {
  sk_flags = 0,
  sk_attno = 0,
  sk_strategy = 0,
  sk_subtype = 0,
  sk_func = {
fn_addr = 0,
fn_oid = 0,
fn_nargs = 0,
fn_strict = 0 '\0',
fn_retset = 0 '\0',
fn_extra = 0x0,
fn_mcxt = 0x0,
fn_expr = 0x0
  },
  sk_argument = 0
}},
  cc_isname = \000\000\000,
  cc_lists = {
dll_head = 0x935018,
dll_tail = 0x934c50
  },
  cc_bucket = {{
  dll_head = 0x0,
  dll_tail = 0x0
}}
}

Am I correct in interpreting this as the hash opclass for Oid?  That's
really bizarre.  Definately didn't change that.

 The tableOid implies it's one of the caches on pg_amop, which makes
 the whole thing stranger yet.  pg_amop doesn't change during normal
 operation so there's no reason for one of its tuples to become dead.
 You aren't creating/deleting operator classes in this database are
 you?

Nope.  As a matter of fact, I never created any custom operator classes in
this database.


   regards, tom lane

 ---(end of broadcast)---
 TIP 6: explain analyze is your friend


-- 
Given my druthers, I'd druther not.

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] catalog corruption bug

2006-01-07 Thread Tom Lane
Jeremy Drake [EMAIL PROTECTED] writes:
 Am I correct in interpreting this as the hash opclass for Oid?

No, it's the AMOPOPID catalog cache (containing rows from pg_amop
indexed by amopopr/amopclaid).

After digging around for a bit I noticed that catalog caches get
flushed if someone vacuums the associated catalog.  I have a
theory now: somebody vacuumed pg_amop while this process had an
open CatCList of entries from this catcache.  ResetCatalogCache
would mark both the CatCList and its member entries as dead,
because they can't be released while the CatCList has a positive
refcount.  When ReleaseCatCacheList is called, it would in turn
call CatCacheRemoveCList ... which would remove the list, but
it doesn't remove the member entries.  That's a bug, because the
entries should go away if they're dead and no longer have any
reference counts keeping them in the zombie state.

However, AFAICS the only consequence of this bug is to trigger
that Assert failure if you've got Asserts enabled.  Dead catcache
entries aren't actually harmful except for wasting some space.
So I don't think this is related to your pg_type duplicate key
problem.

One weak spot in this theory is the assumption that somebody was
vacuuming pg_amop.  It seems unlikely that autovacuum would do so
since the table never changes (unless you had reached the point
where an anti-XID-wraparound vacuum was needed, which is unlikely
in itself).  Do you have any background processes that do full-database
VACUUMs?

I'll go fix CatCacheRemoveCList, but I think this is not the bug
we're looking for.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] catalog corruption bug

2006-01-07 Thread Jeremy Drake
On Sat, 7 Jan 2006, Tom Lane wrote:

 Jeremy Drake [EMAIL PROTECTED] writes:
  Am I correct in interpreting this as the hash opclass for Oid?

 However, AFAICS the only consequence of this bug is to trigger
 that Assert failure if you've got Asserts enabled.  Dead catcache
 entries aren't actually harmful except for wasting some space.
 So I don't think this is related to your pg_type duplicate key
 problem.

 One weak spot in this theory is the assumption that somebody was
 vacuuming pg_amop.  It seems unlikely that autovacuum would do so
 since the table never changes (unless you had reached the point
 where an anti-XID-wraparound vacuum was needed, which is unlikely
 in itself).  Do you have any background processes that do full-database
 VACUUMs?

No.  Just the autovacuum, which is actually the process which had the
assert failure.

This appears to give the current xid
(gdb) p *s
$10 = {
  transactionId = 13568516,
  subTransactionId = 1,
  name = 0x0,
  savepointLevel = 0,
  state = TRANS_COMMIT,
  blockState = TBLOCK_STARTED,
  nestingLevel = 1,
  curTransactionContext = 0x9529c0,
  curTransactionOwner = 0x92eb40,
  childXids = 0x0,
  currentUser = 0,
  prevXactReadOnly = 0 '\0',
  parent = 0x0
}


 I'll go fix CatCacheRemoveCList, but I think this is not the bug
 we're looking for.

Incidentally, one of my processes did get that error at the same time.
All of the other processes had an error
DBD::Pg::st execute failed: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.

But this one had the DBD::Pg::st execute failed: ERROR:  duplicate key
violates unique constraint pg_type_typname_nsp_index

It looks like my kernel did not have the option to append the pid to core
files ,so perhaps they both croaked at the same time but only this one got
to write a core file?

I will enable this and try again, see if I can't get it to make 2 cores.

BTW, nothing of any interest made it into the backend log regarding what
assert(s) failed.


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] catalog corruption bug

2006-01-07 Thread Tom Lane
Jeremy Drake [EMAIL PROTECTED] writes:
 On Sat, 7 Jan 2006, Tom Lane wrote:
 I'll go fix CatCacheRemoveCList, but I think this is not the bug
 we're looking for.

 Incidentally, one of my processes did get that error at the same time.
 All of the other processes had an error
 DBD::Pg::st execute failed: server closed the connection unexpectedly
 This probably means the server terminated abnormally
 before or while processing the request.
 But this one had the DBD::Pg::st execute failed: ERROR:  duplicate key
 violates unique constraint pg_type_typname_nsp_index

Oh, that's interesting ... maybe there is some relation after all?
Hard to see what ...

 It looks like my kernel did not have the option to append the pid to core
 files ,so perhaps they both croaked at the same time but only this one got
 to write a core file?

Yeah, they'd all be dumping into the same directory.  It's reasonable to
suppose that the corefile you have is from the one that aborted last.
That would suggest that this is effect not cause ... hmmm ...

A bit of a leap in the dark, but: maybe the triggering event for this
situation is not a VACUUM pg_amop but a global cache reset due to
sinval message buffer overrun.  It's fairly clear how that would lead
to the CatCacheRemoveCList bug.  The duplicate-key failure could be an
unrelated bug triggered by the same condition.  I have no idea yet what
the mechanism could be, but cache reset is a sufficiently seldom-exercised
code path that it's entirely plausible that there are bugs lurking in it.

If this is correct then we could vastly increase the probability of
seeing the bug by setting up something to force cache resets at a high
rate.  If you're interested I could put together a code patch for that.

 BTW, nothing of any interest made it into the backend log regarding what
 assert(s) failed.

What you'd be looking for is a line starting TRAP:.

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] catalog corruption bug

2006-01-07 Thread Jeremy Drake
On Sat, 7 Jan 2006, Tom Lane wrote:

 Jeremy Drake [EMAIL PROTECTED] writes:
  On Sat, 7 Jan 2006, Tom Lane wrote:
  I'll go fix CatCacheRemoveCList, but I think this is not the bug
  we're looking for.

 A bit of a leap in the dark, but: maybe the triggering event for this
 situation is not a VACUUM pg_amop but a global cache reset due to
 sinval message buffer overrun.  It's fairly clear how that would lead
 to the CatCacheRemoveCList bug.  The duplicate-key failure could be an
 unrelated bug triggered by the same condition.  I have no idea yet what
 the mechanism could be, but cache reset is a sufficiently seldom-exercised
 code path that it's entirely plausible that there are bugs lurking in it.

 If this is correct then we could vastly increase the probability of
 seeing the bug by setting up something to force cache resets at a high
 rate.  If you're interested I could put together a code patch for that.

I tried that function you sent, while running my other code.  It died, but
not the same way.  None of my processes had the unique constraint error,
but two had failed during commit.  Both of them died in that same place as
the last one, on pg_amop.

I think I am going to just run without the function running this time and
see if it does the duplicate type error and if it will generate two cores.




-- 
To kick or not to kick...
-- Somewhere on IRC, inspired by Shakespeare

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] catalog corruption bug

2006-01-06 Thread Jeremy Drake
On Thu, 5 Jan 2006, Tom Lane wrote:

 The ReadBuffer bug I just fixed could result in disappearance of catalog
 rows, so this observation is consistent with the theory that that's
 what's biting you.  It's not proof though...

Well, I applied that patch that you sent me the link to (the bufmgr.c
one), and rebuilt (PORTDIR_OVERLAY is cool...)

I ran my nine processes which hammer things overnight, and in the
morning one of them was dead.

DBD::Pg::st execute failed: ERROR:  duplicate key violates unique
constraint pg_type_typname_nsp_index
CONTEXT:  SQL statement CREATE TEMPORARY TABLE push_tmp (val text) ON
COMMIT DROP
PL/pgSQL function push_func line 6 at SQL statement
DBD::Pg::st execute failed: ERROR:  duplicate key violates unique
constraint pg_type_typname_nsp_index
CONTEXT:  SQL statement CREATE TEMPORARY TABLE push_tmp (val text) ON
COMMIT DROP
PL/pgSQL function push_func line 6 at SQL statement


I also write out the time as my processes progress, so I know roughly when
it happened too.  It happened at 1136534029 (that's result of perl time()
function), which when run through localtime() yields Thu Jan  5 23:53:49
2006.  It looks like I started the processes at about 18:30, so they
lasted a while at least.

Let me know if there is anything else I can try to help debug this
(asserts on?).

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] catalog corruption bug

2006-01-06 Thread Tom Lane
Jeremy Drake [EMAIL PROTECTED] writes:
 Well, I applied that patch that you sent me the link to (the bufmgr.c
 one), and rebuilt (PORTDIR_OVERLAY is cool...)

 I ran my nine processes which hammer things overnight, and in the
 morning one of them was dead.

 DBD::Pg::st execute failed: ERROR:  duplicate key violates unique
 constraint pg_type_typname_nsp_index

Hm, did you REINDEX things beforehand?  This could be leftover corruption...

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] catalog corruption bug

2006-01-06 Thread Jeremy Drake
On Fri, 6 Jan 2006, Tom Lane wrote:

 Jeremy Drake [EMAIL PROTECTED] writes:
  Well, I applied that patch that you sent me the link to (the bufmgr.c
  one), and rebuilt (PORTDIR_OVERLAY is cool...)

  I ran my nine processes which hammer things overnight, and in the
  morning one of them was dead.

  DBD::Pg::st execute failed: ERROR:  duplicate key violates unique
  constraint pg_type_typname_nsp_index

 Hm, did you REINDEX things beforehand?  This could be leftover corruption...

Yes.  I ran that VACUUM FULL ANALYZE VERBOSE which I emailed part of the
excerpt from, and then I started a standalone backend (postgres -D data
-P) and ran REINDEX SYSTEM dbname on the database in question.

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] catalog corruption bug

2006-01-06 Thread Tom Lane
Jeremy Drake [EMAIL PROTECTED] writes:
 DBD::Pg::st execute failed: ERROR:  duplicate key violates unique
 constraint pg_type_typname_nsp_index
 
 Hm, did you REINDEX things beforehand?  This could be leftover corruption...

 Yes.  I ran that VACUUM FULL ANALYZE VERBOSE which I emailed part of the
 excerpt from, and then I started a standalone backend (postgres -D data
 -P) and ran REINDEX SYSTEM dbname on the database in question.

OK, this must be a different issue then.  I think we have seen reports
like this one before, but not been able to reproduce it.

Could you rebuild with Asserts enabled and see if any asserts trigger?
Also, at this point it'd be worth trying to boil it down to a test case
you can give to other people.

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] catalog corruption bug

2006-01-05 Thread Tom Lane
Jeremy Drake [EMAIL PROTECTED] writes:
 We have encountered a very nasty but apparently rare bug which appears to
 result in catalog corruption.

I've been fooling around with this report today.  In several hours of
trying, I've been able to get one Assert failure from running Jeremy's
example on CVS tip.  (I would've given up long ago, except the Assert
happened very soon after I started trying...)  The assert was from this
line in hio.c:

Assert(PageIsNew((PageHeader) pageHeader));

which we've seen before in connection with the
vacuum-vs-relation-extension race condition found last May.  It seems
we still have an issue of that sort :-(.

While fruitlessly waiting for the test to fail again, I've been combing
through the code looking for possible failure paths, and I've found
something that might explain it.  I think this is definitely a bug even
if it isn't what's happening in Jeremy's test:

mdread() is defined to not fail, but silently return a page of zeroes,
if asked to read a page that's at or beyond the end of the table file.
(As noted therein, this seems like pretty sucky design, but there are
various reasons that make it hard to change the behavior.)

Therefore, if for some reason a process tries to read the page just at
EOF, it will leave behind a buffer pool entry that is marked BM_VALID
but contains zeroes.  There are a number of scenarios that could cause
this, but all seem rather low-probability.  One way is if a process'
rd_targblock field for a relation is pointing at the last page of the
file and then VACUUM comes along and truncates off that page because
it's empty.  The next insertion attempt by the process will try to
fetch that page, obtain all-zeroes, decide the page has no free space
(PageGetFreeSpace is carefully coded to be sure that happens), and
go looking for free space elsewhere.

Now suppose someone tries to obtain a new page in the relation by
calling ReadBuffer(rel, P_NEW).  The location of the new page is
determined by asking lseek() how long the file is.  ReadBuffer then
obtains a buffer for that file offset --- and it is going to get a
hit on the all-zeroes buffer previously left behind.  Since the buffer
is marked BM_VALID, the test if it was already in the buffer pool,
we're done succeeds and the buffer is returned as-is.  This is fine
as far as the caller knows: it's expecting to get back an all-zero
page, so it goes merrily along.  The problem is that if that code
path is taken, we *have not extended the file on disk*.

That means, until we get around to writing the dirty buffer to disk
(eg via checkpoint), the kernel thinks the file doesn't contain that
block yet.  So if someone else comes along and again does
ReadBuffer(rel, P_NEW), the lseek computation will return the same
offset as before, and we'll wind up handing back the very same buffer
as before.  Now we get the above-mentioned Assert, if we are lucky
enough to be running an assert-enabled build.  Otherwise the code
in hio.c will just wipe and reinitialize the page, leading to loss
of whatever rows were previously placed in it.

Based on this analysis, the logic in ReadBuffer is wrong: if it finds
an existing buffer in the P_NEW case, it still has to zero the page
and do smgrextend() to be sure that the kernel thinks the page has
been added to the file.

I'm also thinking that the test for empty page in hio.c ought to be
an actual test and elog, not just an Assert.  Now that we've seen
two different bugs allowing the can't happen case to happen, I'm no
longer satisfied with not having any check there in a non-assert build.
The consequences of not detecting an overwrite are too severe.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] catalog corruption bug

2006-01-05 Thread Jeremy Drake
Here is some additional information that I have managed to gather today
regarding this.  It is not really what causes it, so much as what does
not.

I removed all plperl from the loading processes.  I did a VACUUM FULL
ANALYZE, and then I reindexed everything in the database (Including
starting the backend in standalone mode and running REINDEX SYSTEM
dbname).  They still failed.

So it is apparently not that plperl issue which was being discussed
earlier.

Also, what I said about the corruption not having persisted into other
backends was not quite correct.  It was leaving behind types in pg_type
which were in some of the pg_temp* schemas which corresponded to some of
the temp tables.  But I took those out and still had issues (slightly
different).

Here is some interesting stuff too.  I just stopped my processes to start
up a batch again to copy the error message I got now, but before doing so
I was doing a VACUUM FULL ANALYZE VERBOSE so I could hopefully start from
a relatively clean state.  I got a few warnings I don't remember seeing
before.

INFO:  vacuuming pg_catalog.pg_shdepend
INFO:  pg_shdepend: found 108 removable, 440 nonremovable row versions
in 15 p
ages
DETAIL:  0 dead row versions cannot be removed yet.
Nonremovable row versions range from 53 to 53 bytes long.
There were 1492 unused item pointers.
Total free space (including removable row versions) is 89780 bytes.
7 pages are or will become empty, including 0 at the end of the table.
12 pages containing 89744 free bytes are potential move destinations.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  index pg_shdepend_depender_index now contains 448 row versions in
33 pages
DETAIL:  108 index row versions were removed.
23 index pages have been deleted, 23 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.10 sec.
WARNING:  index pg_shdepend_depender_index contains 448 row versions,
but table contains 440 row versions
HINT:  Rebuild the index with REINDEX.
INFO:  index pg_shdepend_reference_index now contains 448 row versions
in 12 pages
DETAIL:  108 index row versions were removed.
3 index pages have been deleted, 3 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
WARNING:  index pg_shdepend_reference_index contains 448 row versions,
but table contains 440 row versions
HINT:  Rebuild the index with REINDEX.
INFO:  pg_shdepend: moved 4 row versions, truncated 15 to 4 pages
DETAIL:  CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  index pg_shdepend_depender_index now contains 448 row versions in
33 pages
DETAIL:  4 index row versions were removed.
23 index pages have been deleted, 23 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
WARNING:  index pg_shdepend_depender_index contains 448 row versions,
but table contains 440 row versions
HINT:  Rebuild the index with REINDEX.
INFO:  index pg_shdepend_reference_index now contains 448 row versions
in 12 pages
DETAIL:  4 index row versions were removed.
4 index pages have been deleted, 4 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
WARNING:  index pg_shdepend_reference_index contains 448 row versions,
but table contains 440 row versions
HINT:  Rebuild the index with REINDEX.
INFO:  analyzing pg_catalog.pg_shdepend
INFO:  pg_shdepend: scanned 4 of 4 pages, containing 440 live rows and 0
dead rows; 440 rows in sample, 440 estimated total rows



Similar for pg_type, there being 248 index row versions vs 244 row
versions in the table.

1631 vs 1619 in pg_attribute
95 vs 94 in pg_index


Looks like it may be time to start a standalone backend and REINDEX
again...

-- 
Don't smoke the next cigarette.  Repeat.

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] catalog corruption bug

2006-01-05 Thread Tom Lane
Jeremy Drake [EMAIL PROTECTED] writes:
 Here is some additional information that I have managed to gather today
 regarding this.  It is not really what causes it, so much as what does
 not.
 ...
 Similar for pg_type, there being 248 index row versions vs 244 row
 versions in the table.

The ReadBuffer bug I just fixed could result in disappearance of catalog
rows, so this observation is consistent with the theory that that's
what's biting you.  It's not proof though...

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] catalog corruption bug

2006-01-04 Thread Jeremy Drake
On Wed, 21 Dec 2005, Tom Lane wrote:

 Jeremy Drake [EMAIL PROTECTED] writes:
  We have encountered a very nasty but apparently rare bug which appears to
  result in catalog corruption.

 How much of this can you reproduce on 8.1.1?  We've fixed a few issues
 already.

We did not see this problem for a while.  I upgraded the second gentoo box
to show this problem to 8.1.1 basically as soon as the ebuild for it was
out.  It just started acting up today (but we have not stressed it for a
while).  It appears to be acting similarly (although corruption which
persisted into other backends has not appeared).

Here is the error message I currently get on 8.1.1 (names have been
changed):

DBD::Pg::st execute failed: ERROR:  type push_temp already exists
CONTEXT:  SQL statement CREATE TEMPORARY TABLE push_temp (val text) ON
COMMIT DROP
PL/pgSQL function push_func line 6 at SQL statement
DBD::Pg::st execute failed: ERROR:  type push_temp already exists
CONTEXT:  SQL statement CREATE TEMPORARY TABLE push_temp (val text) ON
COMMIT DROP
PL/pgSQL function push_func line 6 at SQL statement

postgres=# select version();
 version
--
 PostgreSQL 8.1.1 on x86_64-pc-linux-gnu, compiled by GCC
x86_64-pc-linux-gnu-gcc (GCC) 3.4.4 (Gentoo 3.4.4-r1, ssp-3.4.4-1.0,
pie-8.7.8)
(1 row)


  This was built from the gentoo ebuild version 8.1.0

 I'd be even more interested if you can reproduce it on a non-gentoo
 machine.  Gentoo is not noted for stability.

This one was also on Gentoo, this time ebuild version 8.1.1.  They are
applying a couple patches it looks like, one of which looks like it just
changes some makefile stuff around, and the other appears to add support
for the SH platform in s_lock.h.

Unfortunately, I don't have any non-gentoo boxes around which are on a par
with these two hardware-wise.  Also, I think my test cases I tried to come
up with were most likely wrong.  This code which is currently croaking is
basically amounting to 9 processes calling functions which do SELECT,
INSERT, SELECT FOR UPDATE, DELETE, and UPDATE, as well as CREATE TEMP
TABLE ... ON COMMIT DROP.  ON COMMIT DROP is the only kind of temp table
that this code uses.

I could probably try to re-arrange the code in such a way that I can send
it, if that would be helpful (although I wouldn't want to waste the effort
if it wouldn't be helpful).

Also, what do you figure are the chances of that plperl locale problem
causing this?  I would guess pretty slim seeing as I am only using ASCII
for my schemas, and all of my databases are SQL_ASCII.  I am calling
plperl functions in both of the projects which are breaking...  Also, if I
run the command 'locale' all of the things it prints out are either empty
or POSIX

-- 
Take your dying with some seriousness, however.  Laughing on the way to
your execution is not generally understood by less advanced life forms,
and they'll call you crazy.
-- Messiah's Handbook: Reminders for the Advanced Soul

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] catalog corruption bug

2005-12-21 Thread Tom Lane
Jeremy Drake [EMAIL PROTECTED] writes:
 We have encountered a very nasty but apparently rare bug which appears to
 result in catalog corruption.

How much of this can you reproduce on 8.1.1?  We've fixed a few issues
already.

 This was built from the gentoo ebuild version 8.1.0

I'd be even more interested if you can reproduce it on a non-gentoo
machine.  Gentoo is not noted for stability.

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster