Re: [HACKERS] [9.4 bug] The database server hangs with write-heavy workload on Windows

2014-10-28 Thread Heikki Linnakangas

On 10/14/2014 03:59 PM, MauMau wrote:

BTW, in LWLockWaitForVar(), the first line of the following code fragment is
not necessary, because lwWaitLink is set to head immediately.  I think it
would be good to eliminate as much unnecessary code as possible from the
spinlock section.

   proc-lwWaitLink = NULL;

   /* waiters are added to the front of the queue */
   proc-lwWaitLink = lock-head;


Thanks, fixed!

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [9.4 bug] The database server hangs with write-heavy workload on Windows

2014-10-15 Thread MauMau

From: MauMau maumau...@gmail.com
Thank you very much.  I didn't anticipate such a difficult complicated 
cause.  The user agreed to try the patch tonight.  I'll report back the 
result as soon as I got it from him.


The test ran successfully without hang for 24 hours.  It was run with your 
patch + the following:


BTW, in LWLockWaitForVar(), the first line of the following code fragment 
is not necessary, because lwWaitLink is set to head immediately.  I think 
it would be good to eliminate as much unnecessary code as possible from 
the spinlock section.


 proc-lwWaitLink = NULL;

 /* waiters are added to the front of the queue */
 proc-lwWaitLink = lock-head;



Regards
MauMau



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [9.4 bug] The database server hangs with write-heavy workload on Windows

2014-10-14 Thread Heikki Linnakangas

On 10/13/2014 06:57 PM, Heikki Linnakangas wrote:

Hmm, we could set releaseOK in LWLockWaitForVar(), though, when it
(re-)queues the backend. That would be less invasive, for sure
(attached). I like this better.


Committed this.

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [9.4 bug] The database server hangs with write-heavy workload on Windows

2014-10-14 Thread MauMau

From: Heikki Linnakangas hlinnakan...@vmware.com

Committed this.


Thank you very much.  I didn't anticipate such a difficult complicated 
cause.  The user agreed to try the patch tonight.  I'll report back the 
result as soon as I got it from him.


BTW, in LWLockWaitForVar(), the first line of the following code fragment is 
not necessary, because lwWaitLink is set to head immediately.  I think it 
would be good to eliminate as much unnecessary code as possible from the 
spinlock section.


 proc-lwWaitLink = NULL;

 /* waiters are added to the front of the queue */
 proc-lwWaitLink = lock-head;

Regards
MauMau



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [9.4 bug] The database server hangs with write-heavy workload on Windows

2014-10-13 Thread Heikki Linnakangas
On 10/10/2014 05:08 PM, MauMau wrote:
 From: Craig Ringer cr...@2ndquadrant.com
 It sounds like they've produced a test case, so they should be able to
 with a bit of luck.

 Or even better, send you the test case.
 
 I asked the user about this.  It sounds like the relevant test case consists
 of many scripts.  He explained to me that the simplified test steps are:
 
 1. initdb
 2. pg_ctl start
 3. Create 16 tables.  Each of those tables consist of around 10 columns.
 4. Insert 1000 rows into each of those 16 tables.
 5. Launch 16 psql sessions concurrently.  Each session updates all 1000 rows
 of one table, e.g., session 1 updates table 1, session 2 updates table 2,
 and so on.
 6. Repeat step 5 50 times.
 
 This sounds a bit complicated, but I understood that the core part is 16
 concurrent updates, which should lead to contention on xlog insert slots
 and/or spinlocks.

I was able to reproduce this. I reduced wal_buffers to 64kB, and
NUM_XLOGINSERT_LOCKS to 4 to increase the probability of the deadlock,
and ran a test case as above on my laptop for several hours, and it
finally hung. Will investigate...

- Heikki



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [9.4 bug] The database server hangs with write-heavy workload on Windows

2014-10-13 Thread Andres Freund
On 2014-10-13 17:56:10 +0300, Heikki Linnakangas wrote:
 So the gist of the problem is that LWLockRelease doesn't wake up
 LW_WAIT_UNTIL_FREE waiters, when releaseOK == false. It should, because a
 LW_WAIT_UNTIL FREE waiter is now free to run if the variable has changed in
 value, and it won't steal the lock from the other backend that's waiting to
 get the lock in exclusive mode, anyway.

I'm not a big fan of that change. Right now we don't iterate the waiters
if releaseOK isn't set. Which is good for the normal lwlock code because
it avoids pointer indirections (of stuff likely residing on another
cpu). Wouldn't it be more sensible to reset releaseOK in *UpdateVar()? I
might just miss something here.

 
 I noticed another potential bug: LWLockAcquireCommon doesn't use a volatile
 pointer when it sets the value of the protected variable:
 
  /* If there's a variable associated with this lock, initialize it */
  if (valptr)
  *valptr = val;
 
  /* We are done updating shared state of the lock itself. */
  SpinLockRelease(lock-mutex);
 
 If the compiler or CPU decides to reorder those two, so that the variable is
 set after releasing the spinlock, things will break.

Good catch. As Robert says that should be fine with master, but 9.4
obviously needs it.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [9.4 bug] The database server hangs with write-heavy workload on Windows

2014-10-13 Thread Heikki Linnakangas

On 10/13/2014 06:26 PM, Andres Freund wrote:

On 2014-10-13 17:56:10 +0300, Heikki Linnakangas wrote:

So the gist of the problem is that LWLockRelease doesn't wake up
LW_WAIT_UNTIL_FREE waiters, when releaseOK == false. It should, because a
LW_WAIT_UNTIL FREE waiter is now free to run if the variable has changed in
value, and it won't steal the lock from the other backend that's waiting to
get the lock in exclusive mode, anyway.


I'm not a big fan of that change. Right now we don't iterate the waiters
if releaseOK isn't set. Which is good for the normal lwlock code because
it avoids pointer indirections (of stuff likely residing on another
cpu).


I can't get excited about that. It's pretty rare for releaseOK to be 
false, and when it's true, you iterate the waiters anyway.



Wouldn't it be more sensible to reset releaseOK in *UpdateVar()? I
might just miss something here.


That's not enough. There's no LWLockUpdateVar involved in the example I 
gave. And LWLockUpdateVar() already wakes up all LW_WAIT_UNTIL_FREE 
waiters, regardless of releaseOK.


Hmm, we could set releaseOK in LWLockWaitForVar(), though, when it 
(re-)queues the backend. That would be less invasive, for sure 
(attached). I like this better.


BTW, attached is a little test program I wrote to reproduce this more 
easily. It exercises the LWLock* calls directly. To run, make and 
install, and do CREATE EXTENSION lwlocktest. Then launch three 
backends concurrently that run select lwlocktest(1), select 
lwlocktest(2) and select lwlocktest(3). It will deadlock within seconds.


- Heikki

diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 5453549..cee3f08 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -482,6 +482,7 @@ static inline bool
 LWLockAcquireCommon(LWLock *l, LWLockMode mode, uint64 *valptr, uint64 val)
 {
 	volatile LWLock *lock = l;
+	volatile uint64 *valp = valptr;
 	PGPROC	   *proc = MyProc;
 	bool		retry = false;
 	bool		result = true;
@@ -637,8 +638,8 @@ LWLockAcquireCommon(LWLock *l, LWLockMode mode, uint64 *valptr, uint64 val)
 	}
 
 	/* If there's a variable associated with this lock, initialize it */
-	if (valptr)
-		*valptr = val;
+	if (valp)
+		*valp = val;
 
 	/* We are done updating shared state of the lock itself. */
 	SpinLockRelease(lock-mutex);
@@ -976,6 +977,12 @@ LWLockWaitForVar(LWLock *l, uint64 *valptr, uint64 oldval, uint64 *newval)
 			lock-tail = proc;
 		lock-head = proc;
 
+		/*
+		 * Set releaseOK, to make sure we get woken up as soon as the lock is
+		 * released.
+		 */
+		lock-releaseOK = true;
+
 		/* Can release the mutex now */
 		SpinLockRelease(lock-mutex);
 


lwlocktest.tar.gz
Description: application/gzip

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [9.4 bug] The database server hangs with write-heavy workload on Windows

2014-10-10 Thread MauMau

From: Craig Ringer cr...@2ndquadrant.com

It sounds like they've produced a test case, so they should be able to
with a bit of luck.

Or even better, send you the test case.


I asked the user about this.  It sounds like the relevant test case consists 
of many scripts.  He explained to me that the simplified test steps are:


1. initdb
2. pg_ctl start
3. Create 16 tables.  Each of those tables consist of around 10 columns.
4. Insert 1000 rows into each of those 16 tables.
5. Launch 16 psql sessions concurrently.  Each session updates all 1000 rows 
of one table, e.g., session 1 updates table 1, session 2 updates table 2, 
and so on.

6. Repeat step 5 50 times.

This sounds a bit complicated, but I understood that the core part is 16 
concurrent updates, which should lead to contention on xlog insert slots 
and/or spinlocks.




Your next step here really needs to be to make this reproducible against
a debug build. Then see if reverting the xlog scalability work actually
changes the behaviour, given that you hypothesised that it could be
involved.


Thank you, but that may be labor-intensive and time-consuming.  In addition, 
the user uses a machine with multiple CPU cores, while I only have a desktop 
PC with two CPU cores.  So I doubt I can reproduce the problem on my PC.


I asked the user to change S_UNLOCK to something like the following and run 
the test during this weekend (the next Monday is a national holiday in 
Japan).


#define S_UNLOCK(lock)  InterlockedExchange(lock, 0)

FYI, the user reported today that the problem didn't occur when he ran the 
same test for 24 hours on 9.3.5.  Do you see something relevant in 9.4?


Regards
MauMau



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [9.4 bug] The database server hangs with write-heavy workload on Windows

2014-10-10 Thread Andres Freund
On 2014-10-10 23:08:34 +0900, MauMau wrote:
 From: Craig Ringer cr...@2ndquadrant.com
 It sounds like they've produced a test case, so they should be able to
 with a bit of luck.
 
 Or even better, send you the test case.
 
 I asked the user about this.  It sounds like the relevant test case consists
 of many scripts.  He explained to me that the simplified test steps are:
 
 1. initdb
 2. pg_ctl start
 3. Create 16 tables.  Each of those tables consist of around 10 columns.
 4. Insert 1000 rows into each of those 16 tables.
 5. Launch 16 psql sessions concurrently.  Each session updates all 1000 rows
 of one table, e.g., session 1 updates table 1, session 2 updates table 2,
 and so on.
 6. Repeat step 5 50 times.
 
 This sounds a bit complicated, but I understood that the core part is 16
 concurrent updates, which should lead to contention on xlog insert slots
 and/or spinlocks.

Hm. I've run similar loads on linux for long enough that I'm relatively
sure I'd have seen this.

Could you get them to print out the content's of the lwlock all these
processes are waiting for?

 Your next step here really needs to be to make this reproducible against
 a debug build. Then see if reverting the xlog scalability work actually
 changes the behaviour, given that you hypothesised that it could be
 involved.

I don't think you can trivially revert the xlog scalability stuff.

 Thank you, but that may be labor-intensive and time-consuming.  In addition,
 the user uses a machine with multiple CPU cores, while I only have a desktop
 PC with two CPU cores.  So I doubt I can reproduce the problem on my PC.

Well, it'll also be labor intensive for the community to debug.

 I asked the user to change S_UNLOCK to something like the following and run
 the test during this weekend (the next Monday is a national holiday in
 Japan).
 
 #define S_UNLOCK(lock)  InterlockedExchange(lock, 0)

That shouldn't be required. For one, on 9.4 (not 9.5!) spinlock releases
only need to prevent reordering on the CPU level. As x86 is a TSO
architecture (total store order) that doesn't require doing anything
special. And even if it'd require more, on msvc volatile reads/stores
act as acquire/release fences unless you monkey with the compiler settings.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [9.4 bug] The database server hangs with write-heavy workload on Windows

2014-10-09 Thread Craig Ringer
On 10/09/2014 09:47 PM, MauMau wrote:
 
 I heard that the user had run 16 concurrent psql sessions which executes
 INSERT and UPDATE statements, which is a write-intensive stress test. 
 He encountered the hang phenomenon twice, one of which occured several
 hours after the start of the test, and the other occured about an hour
 after the test launch.

It'd be interesting and useful to run this test on a debug build of
PostgreSQL, i.e. one compiled against the debug version of the C library
and with full debuginfo not just minimal .pdb.

How were the stacks captured - what tool?

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [9.4 bug] The database server hangs with write-heavy workload on Windows

2014-10-09 Thread Andres Freund
On 2014-10-09 22:47:48 +0900, MauMau wrote:
 Hello,
 
 One user reported a hang problem with 9.4 beta2 on Windows.  The PostgreSQL
 is 64-bit version.  I couldn't find the cause, but want to solve the
 problem.  Could you help with this?
 
 I heard that the user had run 16 concurrent psql sessions which executes
 INSERT and UPDATE statements, which is a write-intensive stress test.  He
 encountered the hang phenomenon twice, one of which occured several hours
 after the start of the test, and the other occured about an hour after the
 test launch.
 
 The user gave me the stack traces, which I attached at the end of this mail.
 The problem appears to be related to the xlog insert scaling.  But I can't
 figure out where the root cause lies --- WAL slot handling, spinlock on
 Windows, or PGSemaphoreLock/UnLock on Windows?
 
 The place I suspect is S_UNLOCK().  It doesn't use any memory barrier.  Is
 this correct on Intel64 processors?

What precisely do you mean with Intel64? 64bit x86 or Itanium?

Also, what's the precise workload? Can you reproduce the problem?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [9.4 bug] The database server hangs with write-heavy workload on Windows

2014-10-09 Thread MauMau

From: Craig Ringer cr...@2ndquadrant.com

It'd be interesting and useful to run this test on a debug build of
PostgreSQL, i.e. one compiled against the debug version of the C library
and with full debuginfo not just minimal .pdb.


Although I'm not sure the user can do this now, I'll ask him anyway.


How were the stacks captured - what tool?


According to his mail, Windbg or userdump.exe.  I'll ask him about this.

Regards
MauMau



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [9.4 bug] The database server hangs with write-heavy workload on Windows

2014-10-09 Thread MauMau

From: Andres Freund and...@2ndquadrant.com

What precisely do you mean with Intel64? 64bit x86 or Itanium?


64-bit x86, i.e. x86-64.



Also, what's the precise workload? Can you reproduce the problem?


IIUC, each client inserts 1000 records into one table, then repeats updating 
all those records.  I'll ask him again.


Regards
MauMau



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [9.4 bug] The database server hangs with write-heavy workload on Windows

2014-10-09 Thread Craig Ringer
On 10/10/2014 04:16 AM, MauMau wrote:
 From: Craig Ringer cr...@2ndquadrant.com
 It'd be interesting and useful to run this test on a debug build of
 PostgreSQL, i.e. one compiled against the debug version of the C library
 and with full debuginfo not just minimal .pdb.
 
 Although I'm not sure the user can do this now, I'll ask him anyway.

It sounds like they've produced a test case, so they should be able to
with a bit of luck.

Or even better, send you the test case.

 How were the stacks captured - what tool?
 
 According to his mail, Windbg or userdump.exe.  I'll ask him about this.

Thanks. The stack trace looks fairly sane, i.e. there's nothing
obviously out of whack at a glance, but I tend to get more informative
traces from Visual Studio debugging sessions.

Your next step here really needs to be to make this reproducible against
a debug build. Then see if reverting the xlog scalability work actually
changes the behaviour, given that you hypothesised that it could be
involved.

As I said off-list, if you can narrow the test case down to something
that can be reproduced more quickly, you could also git-bisect to seek
the commit at fault. Even if the test case takes an hour, that's still
viable:

$ git bisect start
$ git bisect bad
$ git bisect good REL9_3_RC1
Bisecting: a merge base must be tested
[e472b921406407794bab911c64655b8b82375196] Avoid deadlocks during
insertion into SP-GiST indexes.
$ git bisect good
Bisecting: 1026 revisions left to test after this (roughly 10 steps)
...


Thanks to the magic of binary search.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers