Re: [HACKERS] PG signal handler and non-reentrant malloc/free calls

2011-03-01 Thread Nikhil Sontakke

 Will try to get the call stack if needed.

 Yes, please.


Here is the stack trace:

#0  0xe410 in __kernel_vsyscall ()
#1  0xb7ee676e in __lll_mutex_lock_wait () from /lib/libc.so.6
#2  0xb7e82e41 in _L_lock_4214 () from /lib/libc.so.6
#3  0xb7e80048 in free () from /lib/libc.so.6
#4  0x082f70b1 in AllocSetDelete (context=0x84c7d68) at aset.c:503
#5  0x082f75b2 in MemoryContextDelete (context=0x84c7d68) at mcxt.c:196
#6  0x082f75e9 in MemoryContextDeleteChildren (context=0x84c7ce0) at mcxt.c:215
#7  0x082f7582 in MemoryContextDelete (context=0x84c7ce0) at mcxt.c:169
#8  0x082f75e9 in MemoryContextDeleteChildren (context=0x84c7bd0) at mcxt.c:215
#9  0x082f7582 in MemoryContextDelete (context=0x84c7bd0) at mcxt.c:169
#10 0x080b54fd in CleanupSubTransaction () at xact.c:1444
#11 0x080b5590 in AbortOutOfAnyTransaction () at xact.c:3955
#12 0x082e9b8b in ShutdownPostgres (code=1, arg=0) at postinit.c:655
#13 0x08220f95 in shmem_exit (code=1) at ipc.c:191
#14 0x08221051 in proc_exit (code=1) at ipc.c:119
#15 0x082dd0cd in errfinish (dummy=0) at elog.c:475
#16 0x08231905 in ProcessInterrupts () at postgres.c:2869
#17 0x082dd071 in errfinish (dummy=0) at elog.c:500
#18 0x08231acd in die (postgres_signal_arg=15) at postgres.c:2732
#19 signal handler called
#20 0xb7e8091c in _int_malloc () from /lib/libc.so.6
#21 0xb7e822c6 in malloc () from /lib/libc.so.6
#22 0x082f6bd3 in AllocSetAlloc (context=0x84c6e18, size=20263) at aset.c:533
#23 0x0829ff36 in textout (fcinfo=0xbfa9cba0) at varlena.c:491
#24 0x082e03c2 in FunctionCall1 (flinfo=0xbfa9d0dc, arg1=139801232) at
fmgr.c:1272
#25 0x082e1495 in OutputFunctionCall (flinfo=0xbfa9d0dc,
val=139801232) at fmgr.c:1905
#26 0x082e2478 in OidOutputFunctionCall (functionId=47, val=139801232)
at fmgr.c:2008
#27 0x6f3aab9e in convert_value_to_string (value=139801232,
valtype=value optimized out) at pl_exec.c:5304
#28 0x6f3aac49 in exec_cast_value (value=139801232, valtype=25,
reqtype=1043, reqinput=0x84ff1b8, reqtypioparam=1043,
reqtypmod=14, isnull=0 '\0') at pl_exec.c:5346
#29 0x6f3ac0b5 in exec_assign_value (estate=0xbfa9ddc4,
target=0x84fbd50, value=139801232, valtype=25,
isNull=0xbfa9d26f ) at pl_exec.c:4130
#30 0x6f3ad7de in exec_assign_expr (estate=0xbfa9ddc4,
target=0x84fbd50, expr=0x8500cb0) at pl_exec.c:4102
#31 0x6f3afc99 in exec_stmts (estate=0xbfa9ddc4, stmts=value
optimized out) at pl_exec.c:1483
#32 0x6f3b09de in exec_stmt_fori (estate=0xbfa9ddc4, stmt=0x8500b48)
at pl_exec.c:1891
#33 0x6f3afc24 in exec_stmts (estate=0xbfa9ddc4, stmts=value
optimized out) at pl_exec.c:1381
#34 0x6f3b0dd4 in exec_stmt_loop (estate=0xbfa9ddc4, stmt=0x8500978)
at pl_exec.c:1681
---Type return to continue, or q return to quit---
#35 0x6f3afc0c in exec_stmts (estate=0xbfa9ddc4, stmts=value
optimized out) at pl_exec.c:1373
#36 0x6f3b10f2 in exec_stmt_block (estate=0xbfa9ddc4, block=0x8500f20)
at pl_exec.c:1241
#37 0x6f3af301 in exec_stmts (estate=0xbfa9ddc4, stmts=value
optimized out) at pl_exec.c:1349
#38 0x6f3b1635 in exec_stmt_block (estate=0xbfa9ddc4, block=0x85006c0)
at pl_exec.c:1070
#39 0x6f3af301 in exec_stmts (estate=0xbfa9ddc4, stmts=value
optimized out) at pl_exec.c:1349
#40 0x6f3b10f2 in exec_stmt_block (estate=0xbfa9ddc4, block=0x84fff58)
at pl_exec.c:1241
#41 0x6f3b22c5 in plpgsql_exec_function (func=0x84c6f88,
fcinfo=0xbfa9dfd0) at pl_exec.c:334
#42 0x6f3a5d8c in plpgsql_call_handler (fcinfo=0xbfa9dfd0) at pl_handler.c:112
#43 0x08185119 in ExecMakeTableFunctionResult (funcexpr=0x84f52c0,
econtext=0x84f50d0, expectedDesc=0x84f5198,
returnDesc=0xbfa9e548) at execQual.c:1651
#44 0x081924d0 in FunctionNext (node=0x84f5048) at nodeFunctionscan.c:68
#45 0x08187c34 in ExecScan (node=0x84f5048, accessMtd=0x8192460
FunctionNext) at execScan.c:68
#46 0x08192459 in ExecFunctionScan (node=0x84f5048) at nodeFunctionscan.c:119
#47 0x08180f57 in ExecProcNode (node=0x84f5048) at execProcnode.c:367
#48 0x08180080 in ExecutorRun (queryDesc=0x84dc640,
direction=ForwardScanDirection, count=0) at execMain.c:1335
#49 0x082369c0 in PortalRunSelect (portal=0x84d7578, forward=value
optimized out, count=0, dest=0x84ced18)
at pquery.c:943
#50 0x082379ba in PortalRun (portal=0x84d7578, count=2147483647,
isTopLevel=1 '\001', dest=0x84ced18,
altdest=0x84ced18, completionTag=0xbfa9e7da ) at pquery.c:797
#51 0x082323d3 in exec_simple_query (query_string=0x84cdd28  call
testsp2('testtab_2', 1000);) at postgres.c:1074
#52 0x082345ef in PostgresMain (argc=4, argv=0x8434598,
username=0x8434550 sys) at postgres.c:4081
#53 0x081eef68 in PostmasterMain (argc=1, argv=0x8431cc8) at postmaster.c:4191
#54 0x081a3560 in main (argc=1, argv=0x8431cc8) at main.c:188
(gdb)

Regards,
Nikhils

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PG signal handler and non-reentrant malloc/free calls

2011-03-01 Thread Heikki Linnakangas

On 01.03.2011 12:50, Nikhil Sontakke wrote:



Will try to get the call stack if needed.


Yes, please.


Here is the stack trace:


Hmm, it looks like ImmediateInterruptOK is set, while we're busy running 
a query. How come? Can you debug that? Where does it get set?


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PG signal handler and non-reentrant malloc/free calls

2011-03-01 Thread Nikhil Sontakke
On Tue, Mar 1, 2011 at 10:17 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 On 01.03.2011 12:50, Nikhil Sontakke wrote:

 Will try to get the call stack if needed.

 Yes, please.

 Here is the stack trace:

 Hmm, it looks like ImmediateInterruptOK is set, while we're busy running a
 query. How come? Can you debug that? Where does it get set?


Ah, this is not exactly an easily reproducible problem :( You gotta be
lucky enough to be able to interrupt inside a malloc call.

But adding hold/resume interrrupts in mcxt.c (not aset.c, since we
want to be agnostic to the underlying layer) should be good enough to
handle this non-re-entrant issue, no?

Regards,
Nikhils

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PG signal handler and non-reentrant malloc/free calls

2011-03-01 Thread Heikki Linnakangas

On 01.03.2011 16:40, Nikhil Sontakke wrote:

On Tue, Mar 1, 2011 at 10:17 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com  wrote:

On 01.03.2011 12:50, Nikhil Sontakke wrote:



Will try to get the call stack if needed.


Yes, please.


Here is the stack trace:


Hmm, it looks like ImmediateInterruptOK is set, while we're busy running a
query. How come? Can you debug that? Where does it get set?



Ah, this is not exactly an easily reproducible problem :( You gotta be
lucky enough to be able to interrupt inside a malloc call.


You could put a sleep() just before the malloc(). Even if you can't 
reproduce a crash, we know that we shouldn't be calling malloc() in any 
codepath where ImmediateInterruptOK == true.


Heck, you can just put an Assert(!ImmediateInterruptOK) there, although 
it will fire in the authentication phase because of the issue with 
ClientAuthentication. You can set debug_assertions=off in 
postgresql.conf and enable it again with SET after logging in to get 
around that.



But adding hold/resume interrrupts in mcxt.c (not aset.c, since we
want to be agnostic to the underlying layer) should be good enough to
handle this non-re-entrant issue, no?


We shouldn't be running with ImmediateInterruptOK == true to begin with. 
There are many other things beside malloc/free that are not safe to be 
interrupted like that.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PG signal handler and non-reentrant malloc/free calls

2011-03-01 Thread Tom Lane
Nikhil Sontakke nikhil.sonta...@enterprisedb.com writes:
 On Tue, Mar 1, 2011 at 10:17 PM, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com wrote:
 Hmm, it looks like ImmediateInterruptOK is set, while we're busy running a
 query. How come? Can you debug that? Where does it get set?

 Ah, this is not exactly an easily reproducible problem :( You gotta be
 lucky enough to be able to interrupt inside a malloc call.

No, the question is why is the ImmediateInterruptOK flag set.  Whether
the interrupt actually happens isn't that relevant.  You could try
setting a watchpoint on the flag variable.

 But adding hold/resume interrrupts in mcxt.c (not aset.c, since we
 want to be agnostic to the underlying layer) should be good enough to
 handle this non-re-entrant issue, no?

We are not doing that, because that would be only a band-aid patch for
approximately 0.1% of the problems that can arise from running random
code with ImmediateInterruptOK set.  We need to find out what's leaving
that set and fix it.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PG signal handler and non-reentrant malloc/free calls

2011-03-01 Thread Andres Freund
Hi,


On Tuesday, March 01, 2011 11:50:42 AM Nikhil Sontakke wrote:
  Will try to get the call stack if needed.
  
  Yes, please.
 Here is the stack trace:
Thats not a stock postgres is it?

Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PG signal handler and non-reentrant malloc/free calls

2011-03-01 Thread Nikhil Sontakke

 No, the question is why is the ImmediateInterruptOK flag set.  Whether
 the interrupt actually happens isn't that relevant.  You could try
 setting a watchpoint on the flag variable.

 But adding hold/resume interrrupts in mcxt.c (not aset.c, since we
 want to be agnostic to the underlying layer) should be good enough to
 handle this non-re-entrant issue, no?

 We are not doing that, because that would be only a band-aid patch for
 approximately 0.1% of the problems that can arise from running random
 code with ImmediateInterruptOK set.  We need to find out what's leaving
 that set and fix it.


Got it. Thanks Tom and Heikki. Will investigate this further.

@Andres Apologies all. I should have mentioned upfront that this is
occurring on 8.3.13, with some custom modifications done, but probably
not in this area..

Regards,
Nikhils

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PG signal handler and non-reentrant malloc/free calls

2011-03-01 Thread Greg Stark
On Tue, Mar 1, 2011 at 3:11 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 Heck, you can just put an Assert(!ImmediateInterruptOK) there, although it
 will fire in the authentication phase because of the issue with
 ClientAuthentication. You can set debug_assertions=off in postgresql.conf
 and enable it again with SET after logging in to get around that.

That doesn't sound like a bad idea. We could
  Assert(!ImmediateInterruptOK || ImmediateInterruptEnabledInQuestionablePlace)
at the beginning of a bunch of basic low-level routines like AllocSetAlloc.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] PG signal handler and non-reentrant malloc/free calls

2011-02-28 Thread Nikhil Sontakke
Hi,

I believe we have a case where not holding off interrupts while doing a
malloc() can cause a deadlock due to system or libc level locking. In this
case, a pg_ctl stop in fast mode was resorted to and that caused a backend
to handle the interrupt when it was inside the malloc call. Now as part of
the abort processing, in the subtransaction cleanup code path, this same
backend tried to clear memory contexts, leading to an eventual free() call.
The free() call tried to take the same lock which was already held by
malloc() earlier resulting into a deadlock! Will try to get the call stack
if needed.

The malloc/free functions are known to be not re-entrant. Doesn't it make
sense to hold off interrupts while doing such calls inside the AllocSet* set
of functions? Thankfully the locations are not very many.
AllocSetContextCreate, AllocSetAlloc and AllocSetFree seem to be the only
candidates.

Comments, thoughts?

Regards,
Nikhils


Re: [HACKERS] PG signal handler and non-reentrant malloc/free calls

2011-02-28 Thread Heikki Linnakangas

On 28.02.2011 14:04, Nikhil Sontakke wrote:

I believe we have a case where not holding off interrupts while doing a
malloc() can cause a deadlock due to system or libc level locking. In this
case, a pg_ctl stop in fast mode was resorted to and that caused a backend
to handle the interrupt when it was inside the malloc call. Now as part of
the abort processing, in the subtransaction cleanup code path, this same
backend tried to clear memory contexts, leading to an eventual free() call.
The free() call tried to take the same lock which was already held by
malloc() earlier resulting into a deadlock!


Our signal handlers shouldn't try to do anything that complicated. 
die(), which handles SIGTERM caused by fast shutdown in backends, 
doesn't do abort processing itself. It just sets a global variable.


Unless ImmediateInterruptOK is set, but it's only set around a few 
blocking system calls where it is safe to do so. (Checks...) Actually, 
md5_crypt_verify() looks suspicious, it does ImmediateInterruptOK = 
true, and then calls palloc() and pfree().



Will try to get the call stack if needed.


Yes, please.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PG signal handler and non-reentrant malloc/free calls

2011-02-28 Thread Tom Lane
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 Unless ImmediateInterruptOK is set, but it's only set around a few 
 blocking system calls where it is safe to do so. (Checks...) Actually, 
 md5_crypt_verify() looks suspicious, it does ImmediateInterruptOK = 
 true, and then calls palloc() and pfree().

Hm, yeah, and ClientAuthentication() seems way too optimistic about what
it does with that set too.  I'm not sure what we can do about it though.
The general shape of the problem here is that we're about to go off into
uncooperative third-party libraries like krb5, so if we don't enable
interrupts we're going to have problems honoring authentication timeout.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers