Re: [PERFORM] Hanging queries on dual CPU windows
Is it possible to get a stack trace from the stuck process? I dunno if you've got anything gdb-equivalent under Windows, but that's the first thing I'd be interested in ... Here ya go: http://www.devisser-siderius.com/stack1.jpg http://www.devisser-siderius.com/stack2.jpg http://www.devisser-siderius.com/stack3.jpg There are three threads in the process. I guess thread 1 (stack1.jpg) is the most interesting. I also noted that cranking up concurrency in my app reproduces the problem in about 4 minutes ;-) Actually, stack2 looks very interesting. Does it stay stuck in pg_queue_signal? That's really not supposed to happen. Also, can you confirm that stack1 actually *stops* in pgwin32_waitforsinglesocket? Or does it go out and come back? ;-) (A good signal of this is to check the cswitch delta. If it stays at zero, then it's stuck. If it shows any values, that means it's actuall going out and coming back) And finally, is this 8.0 or 8.1? There have been some significant changes in the handling of the signals between the two... //Magnus ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] Hanging queries on dual CPU windows
On Friday 10 March 2006 04:20, Magnus Hagander wrote: Is it possible to get a stack trace from the stuck process? I dunno if you've got anything gdb-equivalent under Windows, but that's the first thing I'd be interested in ... Here ya go: http://www.devisser-siderius.com/stack1.jpg http://www.devisser-siderius.com/stack2.jpg http://www.devisser-siderius.com/stack3.jpg There are three threads in the process. I guess thread 1 (stack1.jpg) is the most interesting. I also noted that cranking up concurrency in my app reproduces the problem in about 4 minutes ;-) Just reproduced again. Actually, stack2 looks very interesting. Does it stay stuck in pg_queue_signal? That's really not supposed to happen. Yes it does. Also, can you confirm that stack1 actually *stops* in pgwin32_waitforsinglesocket? Or does it go out and come back? ;-) (A good signal of this is to check the cswitch delta. If it stays at zero, then it's stuck. If it shows any values, that means it's actuall going out and coming back) I only see CSwitch change once I click OK on the thread window. Once I do that, it goes up to 3 and back to blank again. The 'context switches' counter does not increase like it does for other processes (like e.g. process explorer itself). Another thing which may or may not be of interest: Nothing is listed in the 'TCP/IP' tab for the stuck process. I would have expected to see at least the socket of the client connection there?? And finally, is this 8.0 or 8.1? There have been some significant changes in the handling of the signals between the two... This is 8.1.3 on Windows 2003 Server. Also reproduced on 8.1.0 and 8.1.1 (also on 2K3). //Magnus jan -- -- Jan de Visser [EMAIL PROTECTED] Baruk Khazad! Khazad ai-menu! -- ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] Hanging queries on dual CPU windows
Hi, -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Tom Lane Sent: Thursday, March 09, 2006 9:11 PM To: Jan de Visser Cc: pgsql-performance@postgresql.org Subject: Re: [PERFORM] Hanging queries on dual CPU windows Jan de Visser [EMAIL PROTECTED] writes: Furtermore, it does not happen on Linux machines, both single CPU and dual CPU, nor on single CPU windows machines. We can only reproduce on a dual CPU windows machine, and if we take one CPU out, it does not happen. ... Which showed me that several transactions where waiting for a particular row which was locked by another transaction. This transaction had no pending locks (so no deadlock), but just does not complete and hence never relinquishes the lock. Is the stuck transaction still consuming CPU time, or just stopped? Is it possible to get a stack trace from the stuck process? I dunno if you've got anything gdb-equivalent under Windows, but that's the first thing I'd be interested in ... Debugging Tools for Windows from Microsoft http://www.microsoft.com/whdc/devtools/debugging/installx86.mspx Additinonally you need a symbol-file or you use SRV*c:\debug\symbols*http://msdl.microsoft.com/download/symbols; to load the symbol-file dynamically from the net. Best regards regards, tom lane ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings Hakan Kocaman Software-Development digame.de GmbH Richard-Byrd-Str. 4-8 50829 Köln Tel.: +49 (0) 221 59 68 88 31 Fax: +49 (0) 221 59 68 88 98 Email: [EMAIL PROTECTED] ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] Hanging queries on dual CPU windows
On Friday 10 March 2006 09:03, Jan de Visser wrote: On Friday 10 March 2006 04:20, Magnus Hagander wrote: Is it possible to get a stack trace from the stuck process? I dunno if you've got anything gdb-equivalent under Windows, but that's the first thing I'd be interested in ... Here ya go: http://www.devisser-siderius.com/stack1.jpg http://www.devisser-siderius.com/stack2.jpg http://www.devisser-siderius.com/stack3.jpg There are three threads in the process. I guess thread 1 (stack1.jpg) is the most interesting. I also noted that cranking up concurrency in my app reproduces the problem in about 4 minutes ;-) Just reproduced again. Actually, stack2 looks very interesting. Does it stay stuck in pg_queue_signal? That's really not supposed to happen. Yes it does. An update on that: There is actually *two* processes in this state, both hanging in pg_queue_signal. I've looked at the source of that, and the obvious candidate for hanging is EnterCriticalSection. I also found this: http://blogs.msdn.com/larryosterman/archive/2005/03/02/383685.aspx where they say: In addition, for Windows 2003, SP1, the EnterCriticalSection API has a subtle change that's intended tor resolve many of the lock convoy issues. Before Win2003 SP1, if 10 threads were blocked on EnterCriticalSection and all 10 threads had the same priority, then EnterCriticalSection would service those threads in a FIFO (first -in, first-out) basis. Starting in Windows 2003 SP1, the EnterCriticalSection will wake up a random thread from the waiting threads. If all the threads are doing the same thing (like a thread pool) this won't make much of a difference, but if the different threads are doing different work (like the critical section protecting a widely accessed object), this will go a long way towards removing lock convoy semantics. Could it be they broke it when they did that Also, can you confirm that stack1 actually *stops* in pgwin32_waitforsinglesocket? Or does it go out and come back? ;-) (A good signal of this is to check the cswitch delta. If it stays at zero, then it's stuck. If it shows any values, that means it's actuall going out and coming back) I only see CSwitch change once I click OK on the thread window. Once I do that, it goes up to 3 and back to blank again. The 'context switches' counter does not increase like it does for other processes (like e.g. process explorer itself). Another thing which may or may not be of interest: Nothing is listed in the 'TCP/IP' tab for the stuck process. I would have expected to see at least the socket of the client connection there?? And finally, is this 8.0 or 8.1? There have been some significant changes in the handling of the signals between the two... This is 8.1.3 on Windows 2003 Server. Also reproduced on 8.1.0 and 8.1.1 (also on 2K3). //Magnus jan -- -- Jan de Visser [EMAIL PROTECTED] Baruk Khazad! Khazad ai-menu! -- ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Hanging queries on dual CPU windows
On Friday 10 March 2006 09:32, Jan de Visser wrote: Actually, stack2 looks very interesting. Does it stay stuck in pg_queue_signal? That's really not supposed to happen. Yes it does. An update on that: There is actually *two* processes in this state, both hanging in pg_queue_signal. I've looked at the source of that, and the obvious candidate for hanging is EnterCriticalSection. I also found this: http://blogs.msdn.com/larryosterman/archive/2005/03/02/383685.aspx where they say: In addition, for Windows 2003, SP1, the EnterCriticalSection API has a subtle change that's intended tor resolve many of the lock convoy issues. Before Win2003 SP1, if 10 threads were blocked on EnterCriticalSection and all 10 threads had the same priority, then EnterCriticalSection would service those threads in a FIFO (first -in, first-out) basis. Starting in Windows 2003 SP1, the EnterCriticalSection will wake up a random thread from the waiting threads. If all the threads are doing the same thing (like a thread pool) this won't make much of a difference, but if the different threads are doing different work (like the critical section protecting a widely accessed object), this will go a long way towards removing lock convoy semantics. Could it be they broke it when they did that See also this: http://bugs.mysql.com/bug.php?id=12071 It appears the mysql people ran into this and concluded it is a Windows bug they needed to work around. jan -- -- Jan de Visser [EMAIL PROTECTED] Baruk Khazad! Khazad ai-menu! -- ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] Hanging queries on dual CPU windows
I dunno if you've got anything gdb-equivalent under Windows, but that's the first thing I'd be interested in ... Here ya go: http://www.devisser-siderius.com/stack1.jpg http://www.devisser-siderius.com/stack2.jpg http://www.devisser-siderius.com/stack3.jpg There are three threads in the process. I guess thread 1 (stack1.jpg) is the most interesting. I also noted that cranking up concurrency in my app reproduces the problem in about 4 minutes ;-) Just reproduced again. Actually, stack2 looks very interesting. Does it stay stuck in pg_queue_signal? That's really not supposed to happen. Yes it does. An update on that: There is actually *two* processes in this state, both hanging in pg_queue_signal. I've looked at the source of that, and the obvious candidate for hanging is EnterCriticalSection. I also found this: http://blogs.msdn.com/larryosterman/archive/2005/03/02/383685.aspx where they say: In addition, for Windows 2003, SP1, the EnterCriticalSection API has a subtle change that's intended tor resolve many of the lock convoy issues. Before Win2003 SP1, if 10 threads were blocked on EnterCriticalSection and all 10 threads had the same priority, then EnterCriticalSection would service those threads in a FIFO (first -in, first-out) basis. Starting in Windows 2003 SP1, the EnterCriticalSection will wake up a random thread from the waiting threads. If all the threads are doing the same thing (like a thread pool) this won't make much of a difference, but if the different threads are doing different work (like the critical section protecting a widely accessed object), this will go a long way towards removing lock convoy semantics. Could it be they broke it when they did that In theory, yes, but it still seems a bit far fetched :-( If you have the env to rebuild, can you try changing the order of the lines: ResetEvent(pgwin32_signal_event); LeaveCriticalSection(pg_signal_crit_sec); in backend/port/win32/signal.c And if not, can you also try disabling the stats collector and see if that makes a difference. (Could be a workaround..) //Magnus ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] Hanging queries on dual CPU windows
On Friday 10 March 2006 10:11, Magnus Hagander wrote: Could it be they broke it when they did that In theory, yes, but it still seems a bit far fetched :-( Well, I rolled back SP1 and am running my test again. Looking much better, hasn't locked up in 45mins now, whereas before it would lock up within 5mins. So I think they broke something. jan -- -- Jan de Visser [EMAIL PROTECTED] Baruk Khazad! Khazad ai-menu! -- ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Hanging queries on dual CPU windows
Could it be they broke it when they did that In theory, yes, but it still seems a bit far fetched :-( Well, I rolled back SP1 and am running my test again. Looking much better, hasn't locked up in 45mins now, whereas before it would lock up within 5mins. So I think they broke something. Wow. I guess I was lucky that I didn't say it was impossible :-) But what really is happening. What other thread is actually holding the critical section at this point, causing us to block? The only places it gets held is while looping the signal queue, but it is released while calling the signal function itself... But they obviously *have* been messing with critical sections, so maybe they accidentally changed something else as well... What bothers me is that nobody else has reported this. It could be that this was exposed by the changes to the signal handling done for 8.1, and the ppl with this level of concurrency are either still on 8.0 or just not on SP1 for their windows boxes yet... Do you have any other software installed on the machine? That might possibly interfere in some way? But let's have it run for a bit longer to confirm this does help. If so, we could perhaps recode that part using a Mutex instead of a critical section - since it's not a performance critical path, the difference shouldn't be large. If I code up a patch for that, can you re-apply SP1 and test it? Or is this a production system you can't really touch? //Magnus ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] Hanging queries on dual CPU windows
On Friday 10 March 2006 13:25, Magnus Hagander wrote: Could it be they broke it when they did that In theory, yes, but it still seems a bit far fetched :-( Well, I rolled back SP1 and am running my test again. Looking much better, hasn't locked up in 45mins now, whereas before it would lock up within 5mins. So I think they broke something. Wow. I guess I was lucky that I didn't say it was impossible :-) But what really is happening. What other thread is actually holding the critical section at this point, causing us to block? The only places it gets held is while looping the signal queue, but it is released while calling the signal function itself... But they obviously *have* been messing with critical sections, so maybe they accidentally changed something else as well... What bothers me is that nobody else has reported this. It could be that this was exposed by the changes to the signal handling done for 8.1, and the ppl with this level of concurrency are either still on 8.0 or just not on SP1 for their windows boxes yet... Do you have any other software installed on the machine? That might possibly interfere in some way? Just a JDK, JBoss, cygwin (running sshd), and a VNC Server. I don't think that interferes. But let's have it run for a bit longer to confirm this does help. I turned it off after 2.5hr. The longest I had to wait before, with less load, was 1.45hr. If so, we could perhaps recode that part using a Mutex instead of a critical section - since it's not a performance critical path, the difference shouldn't be large. If I code up a patch for that, can you re-apply SP1 and test it? Or is this a production system you can't really touch? I can do whatever the hell I want with it, so if you could cook up a patch that would be great. As a BTW: I reinstalled SP1 and turned stats collection off. That also seems to work, but is not really a solution since we want to use autovacuuming. //Magnus jan -- -- Jan de Visser [EMAIL PROTECTED] Baruk Khazad! Khazad ai-menu! -- ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Hanging queries on dual CPU windows
On Friday 10 March 2006 14:27, Jan de Visser wrote: As a BTW: I reinstalled SP1 and turned stats collection off. That also seems to work, but is not really a solution since we want to use autovacuuming. I lied. I hangs now. Just takes a lot longer... jan -- -- Jan de Visser [EMAIL PROTECTED] Baruk Khazad! Khazad ai-menu! -- ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Hanging queries on dual CPU windows
I have more information on this issue. First of, the problem now happens after about 1-2 hours, as opposed to the 6-8 I mentioned earlier. Yey for shorter test cycles. Furtermore, it does not happen on Linux machines, both single CPU and dual CPU, nor on single CPU windows machines. We can only reproduce on a dual CPU windows machine, and if we take one CPU out, it does not happen. I executed the following after it hung: db=# select l.pid, c.relname, l.mode, l.granted, l.page, l.tuple from pg_locks l, pg_class c where c.oid = l.relation order by l.pid; Which showed me that several transactions where waiting for a particular row which was locked by another transaction. This transaction had no pending locks (so no deadlock), but just does not complete and hence never relinquishes the lock. What gives? has anybody ever heard of problems like this on dual CPU windows machines? jan On Monday 06 March 2006 09:38, Jan de Visser wrote: Hello, While doing performance tests on Windows Server 2003 we observed to following two problems. Environment: J2EE application running in JBoss application server, against pgsql 8.1 database. Load is caused by a smallish number of (very) complex transactions, typically about 5-10 concurrently. The first one, which bothers me the most, is that after about 6-8 hours the application stops processing. No errors are reported, neither by the JDBC driver nor by the server, but when I kill the application server, I see that all my connections hang in a SQL statements (which never seem to return): 2006-03-03 08:17:12 4504 6632560 LOG: duration: 45087000.000 ms statement: EXECUTE unnamed [PREPARE: SELECT objID FROM objects WHERE objID = $1 FOR UPDATE] I think I can reliably reproduce this by loading the app, and waiting a couple of hours. -- -- Jan de Visser [EMAIL PROTECTED] Baruk Khazad! Khazad ai-menu! -- ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Hanging queries on dual CPU windows
Jan de Visser [EMAIL PROTECTED] writes: Furtermore, it does not happen on Linux machines, both single CPU and dual CPU, nor on single CPU windows machines. We can only reproduce on a dual CPU windows machine, and if we take one CPU out, it does not happen. ... Which showed me that several transactions where waiting for a particular row which was locked by another transaction. This transaction had no pending locks (so no deadlock), but just does not complete and hence never relinquishes the lock. Is the stuck transaction still consuming CPU time, or just stopped? Is it possible to get a stack trace from the stuck process? I dunno if you've got anything gdb-equivalent under Windows, but that's the first thing I'd be interested in ... regards, tom lane ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PERFORM] Hanging queries on dual CPU windows
On Thursday 09 March 2006 15:10, Tom Lane wrote: Jan de Visser [EMAIL PROTECTED] writes: Furtermore, it does not happen on Linux machines, both single CPU and dual CPU, nor on single CPU windows machines. We can only reproduce on a dual CPU windows machine, and if we take one CPU out, it does not happen. ... Which showed me that several transactions where waiting for a particular row which was locked by another transaction. This transaction had no pending locks (so no deadlock), but just does not complete and hence never relinquishes the lock. Is the stuck transaction still consuming CPU time, or just stopped? CPU drops off. In fact, that's my main clue something's wrong ;-) Is it possible to get a stack trace from the stuck process? I dunno if you've got anything gdb-equivalent under Windows, but that's the first thing I'd be interested in ... I wouldn't know. I'm hardly a windows expert. Prefer not to touch the stuff, myself. Can do some research though... regards, tom lane jan -- -- Jan de Visser [EMAIL PROTECTED] Baruk Khazad! Khazad ai-menu! -- ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] Hanging queries on dual CPU windows
Is it possible to get a stack trace from the stuck process? I dunno if you've got anything gdb-equivalent under Windows, but that's the first thing I'd be interested in ... Try Process Explorer from www.sysinternals.com. //Magnus ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PERFORM] Hanging queries on dual CPU windows
On Thursday 09 March 2006 15:10, Tom Lane wrote: Is it possible to get a stack trace from the stuck process? I dunno if you've got anything gdb-equivalent under Windows, but that's the first thing I'd be interested in ... Here ya go: http://www.devisser-siderius.com/stack1.jpg http://www.devisser-siderius.com/stack2.jpg http://www.devisser-siderius.com/stack3.jpg There are three threads in the process. I guess thread 1 (stack1.jpg) is the most interesting. I also noted that cranking up concurrency in my app reproduces the problem in about 4 minutes ;-) With thanks to Magnus Hagander for the Process Explorer hint. jan -- -- Jan de Visser [EMAIL PROTECTED] Baruk Khazad! Khazad ai-menu! -- ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org