Re: [PERFORM] Hanging queries on dual CPU windows

2006-03-10 Thread Magnus Hagander
  Is it possible to get a stack trace from the stuck process? 
  I dunno 
  if you've got anything gdb-equivalent under Windows, but that's the 
  first thing I'd be interested in ...
 
 Here ya go:
 
 http://www.devisser-siderius.com/stack1.jpg
 http://www.devisser-siderius.com/stack2.jpg
 http://www.devisser-siderius.com/stack3.jpg
 
 There are three threads in the process. I guess thread 1 
 (stack1.jpg) is the most interesting.
 
 I also noted that cranking up concurrency in my app 
 reproduces the problem in about 4 minutes ;-)

Actually, stack2 looks very interesting. Does it stay stuck in 
pg_queue_signal? That's really not supposed to happen.

Also, can you confirm that stack1 actually *stops* in 
pgwin32_waitforsinglesocket? Or does it go out and come back? ;-)

(A good signal of this is to check the cswitch delta. If it stays at zero, then 
it's stuck. If it shows any values, that means it's actuall going out and 
coming back)

And finally, is this 8.0 or 8.1? There have been some significant changes in 
the handling of the signals between the two...

//Magnus

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] Hanging queries on dual CPU windows

2006-03-10 Thread Jan de Visser
On Friday 10 March 2006 04:20, Magnus Hagander wrote:
   Is it possible to get a stack trace from the stuck process?
 
   I dunno
 
   if you've got anything gdb-equivalent under Windows, but that's the
   first thing I'd be interested in ...
 
  Here ya go:
 
  http://www.devisser-siderius.com/stack1.jpg
  http://www.devisser-siderius.com/stack2.jpg
  http://www.devisser-siderius.com/stack3.jpg
 
  There are three threads in the process. I guess thread 1
  (stack1.jpg) is the most interesting.
 
  I also noted that cranking up concurrency in my app
  reproduces the problem in about 4 minutes ;-)


Just reproduced again. 

 Actually, stack2 looks very interesting. Does it stay stuck in
 pg_queue_signal? That's really not supposed to happen.

Yes it does. 


 Also, can you confirm that stack1 actually *stops* in
 pgwin32_waitforsinglesocket? Or does it go out and come back? ;-)

 (A good signal of this is to check the cswitch delta. If it stays at zero,
 then it's stuck. If it shows any values, that means it's actuall going out
 and coming back)

I only see CSwitch change once I click OK on the thread window. Once I do 
that, it goes up to 3 and back to blank again. The 'context switches' counter 
does not increase like it does for other processes (like e.g. process 
explorer itself).

Another thing which may or may not be of interest: Nothing is listed in the 
'TCP/IP' tab for the stuck process. I would have expected to see at least the 
socket of the client connection there??


 And finally, is this 8.0 or 8.1? There have been some significant changes
 in the handling of the signals between the two...

This is 8.1.3 on Windows 2003 Server. Also reproduced on 8.1.0 and 8.1.1 (also 
on 2K3). 


 //Magnus

jan

-- 
--
Jan de Visser                     [EMAIL PROTECTED]

                Baruk Khazad! Khazad ai-menu!
--

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] Hanging queries on dual CPU windows

2006-03-10 Thread Hakan Kocaman
Hi,

 -Original Message-
 From: [EMAIL PROTECTED] 
 [mailto:[EMAIL PROTECTED] On Behalf Of Tom Lane
 Sent: Thursday, March 09, 2006 9:11 PM
 To: Jan de Visser
 Cc: pgsql-performance@postgresql.org
 Subject: Re: [PERFORM] Hanging queries on dual CPU windows 
 
 
 Jan de Visser [EMAIL PROTECTED] writes:
  Furtermore, it does not happen on Linux machines, both 
 single CPU and dual 
  CPU, nor on single CPU windows machines. We can only 
 reproduce on a dual CPU 
  windows machine, and if we take one CPU out, it does not happen.
  ...
  Which showed me that several transactions where waiting for 
 a particular row 
  which was locked by another transaction. This transaction 
 had no pending 
  locks (so no deadlock), but just does not complete and hence never 
  relinquishes the lock.
 
 Is the stuck transaction still consuming CPU time, or just stopped?
 
 Is it possible to get a stack trace from the stuck process?  I dunno
 if you've got anything gdb-equivalent under Windows, but that's the
 first thing I'd be interested in ...

Debugging Tools for Windows from Microsoft
http://www.microsoft.com/whdc/devtools/debugging/installx86.mspx

Additinonally you need a symbol-file or you use
SRV*c:\debug\symbols*http://msdl.microsoft.com/download/symbols;
to load the symbol-file dynamically from the net.

Best regards

 
   regards, tom lane
 
 ---(end of 
 broadcast)---
 TIP 5: don't forget to increase your free space map settings




Hakan Kocaman
Software-Development

digame.de GmbH
Richard-Byrd-Str. 4-8
50829 Köln

Tel.: +49 (0) 221 59 68 88 31
Fax: +49 (0) 221 59 68 88 98
Email: [EMAIL PROTECTED]

 

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Hanging queries on dual CPU windows

2006-03-10 Thread Jan de Visser
On Friday 10 March 2006 09:03, Jan de Visser wrote:
 On Friday 10 March 2006 04:20, Magnus Hagander wrote:
Is it possible to get a stack trace from the stuck process?
  
    I dunno
  
if you've got anything gdb-equivalent under Windows, but that's the
first thing I'd be interested in ...
  
   Here ya go:
  
   http://www.devisser-siderius.com/stack1.jpg
   http://www.devisser-siderius.com/stack2.jpg
   http://www.devisser-siderius.com/stack3.jpg
  
   There are three threads in the process. I guess thread 1
   (stack1.jpg) is the most interesting.
  
   I also noted that cranking up concurrency in my app
   reproduces the problem in about 4 minutes ;-)

 Just reproduced again.

  Actually, stack2 looks very interesting. Does it stay stuck in
  pg_queue_signal? That's really not supposed to happen.

 Yes it does.

An update on that: There is actually *two* processes in this state, both 
hanging in pg_queue_signal. I've looked at the source of that, and the 
obvious candidate for hanging is EnterCriticalSection. I also found this:

http://blogs.msdn.com/larryosterman/archive/2005/03/02/383685.aspx

where they say:


In addition, for Windows 2003, SP1, the EnterCriticalSection API has a subtle 
change that's intended tor resolve many of the lock convoy issues.  Before 
Win2003 SP1, if 10 threads were blocked on EnterCriticalSection and all 10 
threads had the same priority, then EnterCriticalSection would service those 
threads in a FIFO (first -in, first-out) basis.  Starting in Windows 2003 
SP1, the EnterCriticalSection will wake up a random thread from the waiting 
threads.  If all the threads are doing the same thing (like a thread pool) 
this won't make much of a difference, but if the different threads are doing 
different work (like the critical section protecting a widely accessed 
object), this will go a long way towards removing lock convoy semantics.


Could it be they broke it when they did that



  Also, can you confirm that stack1 actually *stops* in
  pgwin32_waitforsinglesocket? Or does it go out and come back? ;-)
 
  (A good signal of this is to check the cswitch delta. If it stays at
  zero, then it's stuck. If it shows any values, that means it's actuall
  going out and coming back)

 I only see CSwitch change once I click OK on the thread window. Once I do
 that, it goes up to 3 and back to blank again. The 'context switches'
 counter does not increase like it does for other processes (like e.g.
 process explorer itself).

 Another thing which may or may not be of interest: Nothing is listed in the
 'TCP/IP' tab for the stuck process. I would have expected to see at least
 the socket of the client connection there??

  And finally, is this 8.0 or 8.1? There have been some significant changes
  in the handling of the signals between the two...

 This is 8.1.3 on Windows 2003 Server. Also reproduced on 8.1.0 and 8.1.1
 (also on 2K3).

  //Magnus

 jan

-- 
--
Jan de Visser                     [EMAIL PROTECTED]

                Baruk Khazad! Khazad ai-menu!
--

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PERFORM] Hanging queries on dual CPU windows

2006-03-10 Thread Jan de Visser
On Friday 10 March 2006 09:32, Jan de Visser wrote:
   Actually, stack2 looks very interesting. Does it stay stuck in
   pg_queue_signal? That's really not supposed to happen.
 
  Yes it does.

 An update on that: There is actually *two* processes in this state, both
 hanging in pg_queue_signal. I've looked at the source of that, and the
 obvious candidate for hanging is EnterCriticalSection. I also found this:

 http://blogs.msdn.com/larryosterman/archive/2005/03/02/383685.aspx

 where they say:

 
 In addition, for Windows 2003, SP1, the EnterCriticalSection API has a
 subtle change that's intended tor resolve many of the lock convoy issues.
  Before Win2003 SP1, if 10 threads were blocked on EnterCriticalSection and
 all 10 threads had the same priority, then EnterCriticalSection would
 service those threads in a FIFO (first -in, first-out) basis.  Starting in
 Windows 2003 SP1, the EnterCriticalSection will wake up a random thread
 from the waiting threads.  If all the threads are doing the same thing
 (like a thread pool) this won't make much of a difference, but if the
 different threads are doing different work (like the critical section
 protecting a widely accessed object), this will go a long way towards
 removing lock convoy semantics. 

 Could it be they broke it when they did that

See also this:

http://bugs.mysql.com/bug.php?id=12071

It appears the mysql people ran into this and concluded it is a Windows bug 
they needed to work around.

jan

-- 
--
Jan de Visser                     [EMAIL PROTECTED]

                Baruk Khazad! Khazad ai-menu!
--

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Hanging queries on dual CPU windows

2006-03-10 Thread Magnus Hagander
 I dunno
   
 if you've got anything gdb-equivalent under Windows, 
 but that's 
 the first thing I'd be interested in ...
   
Here ya go:
   
http://www.devisser-siderius.com/stack1.jpg
http://www.devisser-siderius.com/stack2.jpg
http://www.devisser-siderius.com/stack3.jpg
   
There are three threads in the process. I guess thread 1
(stack1.jpg) is the most interesting.
   
I also noted that cranking up concurrency in my app 
 reproduces the 
problem in about 4 minutes ;-)
 
  Just reproduced again.
 
   Actually, stack2 looks very interesting. Does it stay stuck in 
   pg_queue_signal? That's really not supposed to happen.
 
  Yes it does.
 
 An update on that: There is actually *two* processes in this 
 state, both hanging in pg_queue_signal. I've looked at the 
 source of that, and the obvious candidate for hanging is 
 EnterCriticalSection. I also found this:
 
 http://blogs.msdn.com/larryosterman/archive/2005/03/02/383685.aspx
 
 where they say:
 
 
 In addition, for Windows 2003, SP1, the EnterCriticalSection 
 API has a subtle change that's intended tor resolve many of 
 the lock convoy issues.  Before
 Win2003 SP1, if 10 threads were blocked on 
 EnterCriticalSection and all 10 threads had the same 
 priority, then EnterCriticalSection would service those 
 threads in a FIFO (first -in, first-out) basis.  Starting in 
 Windows 2003 SP1, the EnterCriticalSection will wake up a 
 random thread from the waiting threads.  If all the threads 
 are doing the same thing (like a thread pool) this won't make 
 much of a difference, but if the different threads are doing 
 different work (like the critical section protecting a widely 
 accessed object), this will go a long way towards removing 
 lock convoy semantics.
 
 
 Could it be they broke it when they did that

In theory, yes, but it still seems a bit far fetched :-(

If you have the env to rebuild, can you try changing the order of the lines:
ResetEvent(pgwin32_signal_event);
LeaveCriticalSection(pg_signal_crit_sec);

in backend/port/win32/signal.c


And if not, can you also try disabling the stats collector and see if that 
makes a difference. (Could be a workaround..)


//Magnus

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] Hanging queries on dual CPU windows

2006-03-10 Thread Jan de Visser
On Friday 10 March 2006 10:11, Magnus Hagander wrote:
  Could it be they broke it when they did that

 In theory, yes, but it still seems a bit far fetched :-(

Well, I rolled back SP1 and am running my test again. Looking much better, 
hasn't locked up in 45mins now, whereas before it would lock up within 5mins.

So I think they broke something.

jan

-- 
--
Jan de Visser                     [EMAIL PROTECTED]

                Baruk Khazad! Khazad ai-menu!
--

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PERFORM] Hanging queries on dual CPU windows

2006-03-10 Thread Magnus Hagander
   Could it be they broke it when they did that
 
  In theory, yes, but it still seems a bit far fetched :-(
 
 Well, I rolled back SP1 and am running my test again. Looking 
 much better, hasn't locked up in 45mins now, whereas before 
 it would lock up within 5mins.
 
 So I think they broke something.

Wow. I guess I was lucky that I didn't say it was impossible :-)


But what really is happening. What other thread is actually holding the
critical section at this point, causing us to block? The only places it
gets held is while looping the signal queue, but it is released while
calling the signal function itself...

But they obviously *have* been messing with critical sections, so maybe
they accidentally changed something else as well...

What bothers me is that nobody else has reported this. It could be that
this was exposed by the changes to the signal handling done for 8.1, and
the ppl with this level of concurrency are either still on 8.0 or just
not on SP1 for their windows boxes yet... Do you have any other software
installed on the machine? That might possibly interfere in some way?

But let's have it run for a bit longer to confirm this does help. If so,
we could perhaps recode that part using a Mutex instead of a critical
section - since it's not a performance critical path, the difference
shouldn't be large. If I code up a patch for that, can you re-apply SP1
and test it? Or is this a production system you can't really touch?

//Magnus

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] Hanging queries on dual CPU windows

2006-03-10 Thread Jan de Visser
On Friday 10 March 2006 13:25, Magnus Hagander wrote:
Could it be they broke it when they did that
  
   In theory, yes, but it still seems a bit far fetched :-(
 
  Well, I rolled back SP1 and am running my test again. Looking
  much better, hasn't locked up in 45mins now, whereas before
  it would lock up within 5mins.
 
  So I think they broke something.

 Wow. I guess I was lucky that I didn't say it was impossible :-)


 But what really is happening. What other thread is actually holding the
 critical section at this point, causing us to block? The only places it
 gets held is while looping the signal queue, but it is released while
 calling the signal function itself...

 But they obviously *have* been messing with critical sections, so maybe
 they accidentally changed something else as well...

 What bothers me is that nobody else has reported this. It could be that
 this was exposed by the changes to the signal handling done for 8.1, and
 the ppl with this level of concurrency are either still on 8.0 or just
 not on SP1 for their windows boxes yet... Do you have any other software
 installed on the machine? That might possibly interfere in some way?

Just a JDK, JBoss, cygwin (running sshd), and a VNC Server. I don't think that 
interferes.


 But let's have it run for a bit longer to confirm this does help. 

I turned it off after 2.5hr. The longest I had to wait before, with less load, 
was 1.45hr.

 If so, 
 we could perhaps recode that part using a Mutex instead of a critical
 section - since it's not a performance critical path, the difference
 shouldn't be large. If I code up a patch for that, can you re-apply SP1
 and test it? Or is this a production system you can't really touch?

I can do whatever the hell I want with it, so if you could cook up a patch 
that would be great.

As a BTW: I reinstalled SP1 and turned stats collection off. That also seems 
to work, but is not really a solution since we want to use autovacuuming.


 //Magnus

jan

-- 
--
Jan de Visser                     [EMAIL PROTECTED]

                Baruk Khazad! Khazad ai-menu!
--

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PERFORM] Hanging queries on dual CPU windows

2006-03-10 Thread Jan de Visser
On Friday 10 March 2006 14:27, Jan de Visser wrote:
 As a BTW: I reinstalled SP1 and turned stats collection off. That also
 seems to work, but is not really a solution since we want to use
 autovacuuming.

I lied. I hangs now. Just takes a lot longer...

jan

-- 
--
Jan de Visser                     [EMAIL PROTECTED]

                Baruk Khazad! Khazad ai-menu!
--

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PERFORM] Hanging queries on dual CPU windows

2006-03-09 Thread Jan de Visser
I have more information on this issue.

First of, the problem now happens after about 1-2 hours, as opposed to the 6-8 
I mentioned earlier. Yey for shorter test cycles.

Furtermore, it does not happen on Linux machines, both single CPU and dual 
CPU, nor on single CPU windows machines. We can only reproduce on a dual CPU 
windows machine, and if we take one CPU out, it does not happen.

I executed the following after it hung:

db=# select l.pid, c.relname, l.mode, l.granted, l.page, l.tuple 
from pg_locks l, pg_class c where c.oid = l.relation order by l.pid;

Which showed me that several transactions where waiting for a particular row 
which was locked by another transaction. This transaction had no pending 
locks (so no deadlock), but just does not complete and hence never 
relinquishes the lock.

What gives? has anybody ever heard of problems like this on dual CPU windows 
machines?

jan



On Monday 06 March 2006 09:38, Jan de Visser wrote:
 Hello,

 While doing performance tests on Windows Server 2003 we observed to
 following two problems.

 Environment: J2EE application running in JBoss application server, against
 pgsql 8.1 database. Load is caused by a smallish number of (very) complex
 transactions, typically about 5-10 concurrently.

 The first one, which bothers me the most, is that after about 6-8 hours the
 application stops processing. No errors are reported, neither by the JDBC
 driver nor by the server, but when I kill the application server, I see
 that all my connections hang in a SQL statements (which never seem to
 return):

 2006-03-03 08:17:12 4504 6632560 LOG:  duration: 45087000.000 ms
  statement: EXECUTE unnamed  [PREPARE:  SELECT objID FROM objects WHERE
 objID = $1 FOR UPDATE]

 I think I can reliably reproduce this by loading the app, and waiting a
 couple of hours.

-- 
--
Jan de Visser                     [EMAIL PROTECTED]

                Baruk Khazad! Khazad ai-menu!
--

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PERFORM] Hanging queries on dual CPU windows

2006-03-09 Thread Tom Lane
Jan de Visser [EMAIL PROTECTED] writes:
 Furtermore, it does not happen on Linux machines, both single CPU and dual 
 CPU, nor on single CPU windows machines. We can only reproduce on a dual CPU 
 windows machine, and if we take one CPU out, it does not happen.
 ...
 Which showed me that several transactions where waiting for a particular row 
 which was locked by another transaction. This transaction had no pending 
 locks (so no deadlock), but just does not complete and hence never 
 relinquishes the lock.

Is the stuck transaction still consuming CPU time, or just stopped?

Is it possible to get a stack trace from the stuck process?  I dunno
if you've got anything gdb-equivalent under Windows, but that's the
first thing I'd be interested in ...

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] Hanging queries on dual CPU windows

2006-03-09 Thread Jan de Visser
On Thursday 09 March 2006 15:10, Tom Lane wrote:
 Jan de Visser [EMAIL PROTECTED] writes:
  Furtermore, it does not happen on Linux machines, both single CPU and
  dual CPU, nor on single CPU windows machines. We can only reproduce on a
  dual CPU windows machine, and if we take one CPU out, it does not happen.
  ...
  Which showed me that several transactions where waiting for a particular
  row which was locked by another transaction. This transaction had no
  pending locks (so no deadlock), but just does not complete and hence
  never relinquishes the lock.

 Is the stuck transaction still consuming CPU time, or just stopped?

CPU drops off. In fact, that's my main clue something's wrong ;-)


 Is it possible to get a stack trace from the stuck process?  I dunno
 if you've got anything gdb-equivalent under Windows, but that's the
 first thing I'd be interested in ...

I wouldn't know. I'm hardly a windows expert. Prefer not to touch the stuff, 
myself. Can do some research though...


   regards, tom lane

jan

-- 
--
Jan de Visser                     [EMAIL PROTECTED]

                Baruk Khazad! Khazad ai-menu!
--

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Hanging queries on dual CPU windows

2006-03-09 Thread Magnus Hagander
 Is it possible to get a stack trace from the stuck process?  
 I dunno if you've got anything gdb-equivalent under Windows, 
 but that's the first thing I'd be interested in ...

Try Process Explorer from www.sysinternals.com.

//Magnus

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] Hanging queries on dual CPU windows

2006-03-09 Thread Jan de Visser
On Thursday 09 March 2006 15:10, Tom Lane wrote:
 Is it possible to get a stack trace from the stuck process?  I dunno
 if you've got anything gdb-equivalent under Windows, but that's the
 first thing I'd be interested in ...

Here ya go:

http://www.devisser-siderius.com/stack1.jpg
http://www.devisser-siderius.com/stack2.jpg
http://www.devisser-siderius.com/stack3.jpg

There are three threads in the process. I guess thread 1 (stack1.jpg) is the 
most interesting.

I also noted that cranking up concurrency in my app reproduces the problem in 
about 4 minutes ;-)

With thanks to Magnus Hagander for the Process Explorer hint.

jan

-- 
--
Jan de Visser                     [EMAIL PROTECTED]

                Baruk Khazad! Khazad ai-menu!
--

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org