subject:"\[HACKERS\] Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease"

Re: [HACKERS] Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

2014-03-21 Thread Andres Freund

Hi,

I see you've committed this, cool. Sorry for not getting back to the
topic earlier..

On 2014-03-13 22:44:03 +0200, Heikki Linnakangas wrote:
 On 03/12/2014 09:29 PM, Andres Freund wrote:
 On 2014-03-07 17:54:32 +0200, Heikki Linnakangas wrote:
 So there are some unexplained differences there, but based on these results,
 I'm still OK with committing the patch.
 
 So, I am looking at this right now.
 
 I think there are some minor things I'd like to see addressed:
 
 1) I think there needs to be a good sized comment explaining why
 WaitXLogInsertionsToFinish() isn't racy due to the unlocked read at
 the beginning of LWLockWait().
 
 There's a comment inside LWLockWait(). I think that's the right place for
 it; it's LWLockWait() that's cheating by not acquiring the spinlock before
 reading lock-exclusive.

I don't find that argument convincing. After all it's only correct
because the API user does things in a particular way. So there should be
comment at the callsite to make sure that's not changed.

 3) I am the wrong one to complain, I know, but the comments above struct
 WALInsertLock are pretty hard to read from th sentence structure.
 
 Hmm, ok. I reworded that, I hope it's more clear now.

Yes, it is.

The committed version doesn't compile with LWLOCK_STATS...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

2014-03-21 Thread Andres Freund

On 2014-03-21 22:52:33 +0100, Andres Freund wrote:
 The committed version doesn't compile with LWLOCK_STATS...

Just noticed that it seems to also break the dtrace stuff:
http://pgbuildfarm.org/cgi-bin/show_log.pl?nm=rover_fireflydt=2014-03-21%2018%3A04%3A00

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

2014-03-12 Thread Andres Freund

On 2014-03-07 17:54:32 +0200, Heikki Linnakangas wrote:
 So there are some unexplained differences there, but based on these results,
 I'm still OK with committing the patch.

So, I am looking at this right now.

I think there are some minor things I'd like to see addressed:

1) I think there needs to be a good sized comment explaining why
   WaitXLogInsertionsToFinish() isn't racy due to the unlocked read at
   the beginning of LWLockWait(). I think it's safe because we're
   reading Insert-CurrBytePos inside a spinlock, and it will only ever
   increment. As SpinLockAcquire() has to be a read barrier we can
   assume that every skewed read in LWLockWait() will be for lock
   protecting a newer insertingAt?
2) I am not particularly happy about the LWLockWait() LWLockWakeup()
   function names. They sound too much like a part of the normal lwlock
   implementation to me. But admittedly I don't have a great idea for
   a better naming scheme. Maybe LWLockWaitForVar(),
   LWLockWakeupVarWaiter()?
3) I am the wrong one to complain, I know, but the comments above struct
   WALInsertLock are pretty hard to read from th sentence structure.
4) WALInsertLockAcquire() needs to comment on acquiring/waking all but
   the last slot. Generally the trick of exclusive xlog insertion lock
   acquiration only really using the last lock could use a bit more
   docs.
5) WALInsertLockRelease() comments on the reset of insertingAt being
   optional, but I am not convinced that that's true anymore. If an
   exclusive acquiration isn't seen as 0 or
   INT64CONST(0x) by another backend we're in trouble,
   right? Absolutely not sure without thinking on it for longer than I
   can concentrate right now.
6) Pretty minor, but from a style POV it seems nicer to separate
   exclusive/nonexclusive out of WALInsertLockAcquire(). The cases don't
   share any code now.

A patch contianing some trivial changes is attached...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 484b9c5..8a55c6b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1628,8 +1628,6 @@ WALInsertLockRelease(void)
 static void
 WALInsertLockWakeup(XLogRecPtr insertingAt)
 {
-	int			i;
-
 	if (holdingAllLocks)
 	{
 		/*
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index f88bf76..2695128 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -873,6 +873,9 @@ LWLockWait(LWLock *l, uint64 *valptr, uint64 oldval, uint64 *newval)
 	int			extraWaits = 0;
 	bool		result = false;
 
+	/* can't be used with shared locks for now */
+	Assert(lock-shared == 0);
+
 	/*
 	 * Quick test first to see if it the slot is free right now.
 	 *
@@ -905,6 +908,8 @@ LWLockWait(LWLock *l, uint64 *valptr, uint64 oldval, uint64 *newval)
 		SpinLockAcquire(lock-mutex);
 #endif
 
+		Assert(lock-shared == 0);
+
 		/* Is the lock now free, and if not, does the value match? */
 		if (lock-exclusive == 0)
 		{
@@ -1022,6 +1027,7 @@ LWLockWakeup(LWLock *l, uint64 *valptr, uint64 val)
 	SpinLockAcquire(lock-mutex);
 
 	/* we should hold the lock */
+	LWLockHeldByMe(l);
 	Assert(lock-exclusive == 1);
 
 	/* Update the lock's value */

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

2014-02-18 Thread MauMau


From: Peter Geoghegan p...@heroku.com

You mentioned a hang during a B-Tree insert operation - do you happen
to have a backtrace that relates to that?


Sorry, I may have misunderstood.  The three stack traces I attached are not 
related to btree.  I recall that I saw one stack trace containing 
bt_insert(), but I'm not sure.


When the hang occurred, INSERT/UPDATE/COMMIT statements stopped for at least 
one hour, while SELECT statements ran smoothly.


Regards
MauMau



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

2014-02-18 Thread Heikki Linnakangas


On 02/17/2014 10:36 PM, Andres Freund wrote:

On 2014-02-17 22:30:54 +0200, Heikki Linnakangas wrote:

This is what I came up with. I like it, I didn't have to contort lwlocks as
much as I feared. I added one field to LWLock structure, which is used to
store the position of how far a WAL inserter has progressed. The LWLock code
calls it just value, without caring what's stored in it, and it's used by
new functions LWLockWait and LWLockWakeup to implement the behavior the WAL
insertion slots have, to wake up other processes waiting for the slot
without releasing it.

This passes regression tests, but I'll have to re-run the performance tests
with this. One worry is that if the padded size of the LWLock struct is
smaller than cache line, neighboring WAL insertion locks will compete for
the cache line. Another worry is that since I added a field to LWLock
struct, it might now take 64 bytes on platforms where it used to be 32 bytes
before. That wastes some memory.


Why don't you allocate them in a separate tranche, from xlog.c? Then you
can store them inside whatever bigger object you want, guaranteeing
exactly the alignment you need. possibly you even can have the extra
value in the enclosing object?


Good idea. New patch attached, doing that.

I'll try to find time on some multi-CPU hardware to performance test 
this against current master, to make sure there's no regression.


- Heikki
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 508970a..3eef968 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -86,7 +86,7 @@ int			sync_method = DEFAULT_SYNC_METHOD;
 int			wal_level = WAL_LEVEL_MINIMAL;
 int			CommitDelay = 0;	/* precommit delay in microseconds */
 int			CommitSiblings = 5; /* # concurrent xacts needed to sleep */
-int			num_xloginsert_slots = 8;
+int			num_xloginsert_locks = 8;
 
 #ifdef WAL_DEBUG
 bool		XLOG_DEBUG = false;
@@ -289,7 +289,7 @@ XLogRecPtr	XactLastRecEnd = InvalidXLogRecPtr;
  * (which is almost but not quite the same as a pointer to the most recent
  * CHECKPOINT record).	We update this from the shared-memory copy,
  * XLogCtl-Insert.RedoRecPtr, whenever we can safely do so (ie, when we
- * hold an insertion slot).  See XLogInsert for details.  We are also allowed
+ * hold an insertion lock).  See XLogInsert for details.  We are also allowed
  * to update from XLogCtl-RedoRecPtr if we hold the info_lck;
  * see GetRedoRecPtr.  A freshly spawned backend obtains the value during
  * InitXLOGAccess.
@@ -361,63 +361,45 @@ typedef struct XLogwrtResult
 	XLogRecPtr	Flush;			/* last byte + 1 flushed */
 } XLogwrtResult;
 
-
 /*
- * A slot for inserting to the WAL. This is similar to an LWLock, the main
- * difference is that there is an extra xlogInsertingAt field that is protected
- * by the same mutex. Unlike an LWLock, a slot can only be acquired in
- * exclusive mode.
- *
- * The xlogInsertingAt field is used to advertise to other processes how far
- * the slot owner has progressed in inserting the record. When a backend
- * acquires a slot, it initializes xlogInsertingAt to 1, because it doesn't
- * yet know where it's going to insert the record. That's conservative
- * but correct; the new insertion is certainly going to go to a byte position
- * greater than 1. If another backend needs to flush the WAL, it will have to
- * wait for the new insertion. xlogInsertingAt is updated after finishing the
- * insert or when crossing a page boundary, which will wake up anyone waiting
- * for it, whether the wait was necessary in the first place or not.
- *
- * A process can wait on a slot in two modes: LW_EXCLUSIVE or
- * LW_WAIT_UNTIL_FREE. LW_EXCLUSIVE works like in an lwlock; when the slot is
- * released, the first LW_EXCLUSIVE waiter in the queue is woken up. Processes
- * waiting in LW_WAIT_UNTIL_FREE mode are woken up whenever the slot is
- * released, or xlogInsertingAt is updated. In other words, a process in
- * LW_WAIT_UNTIL_FREE mode is woken up whenever the inserter makes any progress
- * copying the record in place. LW_WAIT_UNTIL_FREE waiters are always added to
- * the front of the queue, while LW_EXCLUSIVE waiters are appended to the end.
- *
- * To join the wait queue, a process must set MyProc-lwWaitMode to the mode
- * it wants to wait in, MyProc-lwWaiting to true, and link MyProc to the head
- * or tail of the wait queue. The same mechanism is used to wait on an LWLock,
- * see lwlock.c for details.
+ * Inserting to WAL is protected by a bunch of WALInsertLocks. Each WAL
+ * insertion lock consists of a lightweight lock, plus an indicator of how
+ * far the insertion has progressed (insertingAt).
+ *
+ * The insertingAt value is used when writing the WAL to disk, to avoid
+ * waiting unnecessarily for an insertion that's still in-progress, but has
+ * already finished inserting all WAL beyond the point you're going to write
+ * the WAL up to. This isn't just to optimize, it's necessary to

Re: [HACKERS] Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

2014-02-17 Thread Robert Haas

On Sat, Feb 15, 2014 at 11:17 AM, Andres Freund and...@2ndquadrant.com wrote:
 On 2014-02-15 16:18:00 +0100, Andres Freund wrote:
 On 2014-02-15 10:06:41 -0500, Tom Lane wrote:
  Andres Freund and...@2ndquadrant.com writes:
   My current conclusion is that backporting barriers.h is by far the most
   reasonable way to go. The compiler problems have been ironed out by
   now...
 
  -1.  IMO that code is still quite unproven, and what's more, the
  problem we're discussing here is completely hypothetical.  If it
  were real, we'd have field evidence of it.  We've not had that
  much trouble seeing instances of even very narrow race-condition
  windows in the past.

 Well, the problem is that few of us have access to interesting !x86
 machines to run tests, and that's where we'd see problems (since x86
 gives enough guarantees to avoid this unless the compiler reorders
 stuff). I am personally fine with just using volatiles to avoid
 reordering in the older branches, but Florian argued against it.

 Here's patches doing that. The 9.3 version also applies to 9.2; the 9.1
 version applies back to 8.4.

I have no confidence that this isn't going to be real bad for performance.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

2014-02-17 Thread Andres Freund

On 2014-02-17 13:49:01 -0500, Robert Haas wrote:
 On Sat, Feb 15, 2014 at 11:17 AM, Andres Freund and...@2ndquadrant.com 
 wrote:
  On 2014-02-15 16:18:00 +0100, Andres Freund wrote:
  On 2014-02-15 10:06:41 -0500, Tom Lane wrote:
   Andres Freund and...@2ndquadrant.com writes:
My current conclusion is that backporting barriers.h is by far the most
reasonable way to go. The compiler problems have been ironed out by
now...
  
   -1.  IMO that code is still quite unproven, and what's more, the
   problem we're discussing here is completely hypothetical.  If it
   were real, we'd have field evidence of it.  We've not had that
   much trouble seeing instances of even very narrow race-condition
   windows in the past.
 
  Well, the problem is that few of us have access to interesting !x86
  machines to run tests, and that's where we'd see problems (since x86
  gives enough guarantees to avoid this unless the compiler reorders
  stuff). I am personally fine with just using volatiles to avoid
  reordering in the older branches, but Florian argued against it.
 
  Here's patches doing that. The 9.3 version also applies to 9.2; the 9.1
  version applies back to 8.4.
 
 I have no confidence that this isn't going to be real bad for performance.

It's just a write barrier which evaluates to a pure compiler barrier on
x86 anyway?
And it's in a loop that's only entered when the kernel is entered anyway
to wake up the other backend.

What should that affect significantly?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

2014-02-17 Thread Robert Haas

On Mon, Feb 17, 2014 at 1:55 PM, Andres Freund and...@2ndquadrant.com wrote:
 On 2014-02-17 13:49:01 -0500, Robert Haas wrote:
 On Sat, Feb 15, 2014 at 11:17 AM, Andres Freund and...@2ndquadrant.com 
 wrote:
  On 2014-02-15 16:18:00 +0100, Andres Freund wrote:
  On 2014-02-15 10:06:41 -0500, Tom Lane wrote:
   Andres Freund and...@2ndquadrant.com writes:
My current conclusion is that backporting barriers.h is by far the 
most
reasonable way to go. The compiler problems have been ironed out by
now...
  
   -1.  IMO that code is still quite unproven, and what's more, the
   problem we're discussing here is completely hypothetical.  If it
   were real, we'd have field evidence of it.  We've not had that
   much trouble seeing instances of even very narrow race-condition
   windows in the past.
 
  Well, the problem is that few of us have access to interesting !x86
  machines to run tests, and that's where we'd see problems (since x86
  gives enough guarantees to avoid this unless the compiler reorders
  stuff). I am personally fine with just using volatiles to avoid
  reordering in the older branches, but Florian argued against it.
 
  Here's patches doing that. The 9.3 version also applies to 9.2; the 9.1
  version applies back to 8.4.

 I have no confidence that this isn't going to be real bad for performance.

 It's just a write barrier which evaluates to a pure compiler barrier on
 x86 anyway?
 And it's in a loop that's only entered when the kernel is entered anyway
 to wake up the other backend.

 What should that affect significantly?

On x86, presumably nothing.  On other architectures, I don't know what
the impact is, but I don't accept a hand-wavy assertion that there
shouldn't be any as evidence that there won't be.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

2014-02-17 Thread Andres Freund

On 2014-02-17 14:06:43 -0500, Robert Haas wrote:
 On Mon, Feb 17, 2014 at 1:55 PM, Andres Freund and...@2ndquadrant.com wrote:
  On 2014-02-17 13:49:01 -0500, Robert Haas wrote:
  It's just a write barrier which evaluates to a pure compiler barrier on
  x86 anyway?
  And it's in a loop that's only entered when the kernel is entered anyway
  to wake up the other backend.
 
  What should that affect significantly?
 
 On x86, presumably nothing.  On other architectures, I don't know what
 the impact is, but I don't accept a hand-wavy assertion that there
 shouldn't be any as evidence that there won't be.

Directly afterwards there's a syscall that needs to do internal locking
(because it's essentially doing IPC). Which combined certainly is much
more expensive then a write barrier.
And any !x86 architecture that has more heavyweight write barriers
really *needs* a barrier there since you only need more heavywheight
write barriers if the architecture doesn't guarantee total store
order. This isn't a performance optimization, it's correctness.

What's the way to resolve this then? I don't have access to any big !x86
machines.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

2014-02-17 Thread Heikki Linnakangas


On 02/10/2014 08:33 PM, Heikki Linnakangas wrote:

On 02/10/2014 08:03 PM, Tom Lane wrote:

Heikki Linnakangas hlinnakan...@vmware.com writes:

On 02/10/2014 06:41 PM, Andres Freund wrote:

Well, it's not actually using any lwlock.c code, it's a special case
locking logic, just reusing the datastructures. That said, I am not
particularly happy about the amount of code it's duplicating from
lwlock.c. Pretty much all of WALInsertSlotReleaseOne and most of
WALInsertSlotAcquireOne() is a copied.



I'm not too happy with the amount of copy-paste myself, but there was
enough difference to regular lwlocks that I didn't want to bother all
lwlocks with the xlog-specific stuff either. The WAL insert slots do
share the LWLock-related PGPROC fields though, and semaphore. I'm all
ears if you have ideas on that..


I agree that if the behavior is considerably different, we don't really
want to try to make LWLockAcquire/Release cater to both this and their
standard behavior.  But why not add some additional functions in lwlock.c
that do what xlog wants?  If we're going to have mostly-copy-pasted logic,
it'd at least be better if it was in the same file, and not somewhere
that's not even in the same major subtree.


Ok, I'll try to refactor it that way, so that we can see if it looks better.


This is what I came up with. I like it, I didn't have to contort lwlocks 
as much as I feared. I added one field to LWLock structure, which is 
used to store the position of how far a WAL inserter has progressed. The 
LWLock code calls it just value, without caring what's stored in it, 
and it's used by new functions LWLockWait and LWLockWakeup to implement 
the behavior the WAL insertion slots have, to wake up other processes 
waiting for the slot without releasing it.


This passes regression tests, but I'll have to re-run the performance 
tests with this. One worry is that if the padded size of the LWLock 
struct is smaller than cache line, neighboring WAL insertion locks will 
compete for the cache line. Another worry is that since I added a field 
to LWLock struct, it might now take 64 bytes on platforms where it used 
to be 32 bytes before. That wastes some memory.


- Heikki
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 508970a..b148f70 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -86,7 +86,7 @@ int			sync_method = DEFAULT_SYNC_METHOD;
 int			wal_level = WAL_LEVEL_MINIMAL;
 int			CommitDelay = 0;	/* precommit delay in microseconds */
 int			CommitSiblings = 5; /* # concurrent xacts needed to sleep */
-int			num_xloginsert_slots = 8;
+int			num_xloginsert_locks = 8;
 
 #ifdef WAL_DEBUG
 bool		XLOG_DEBUG = false;
@@ -289,7 +289,7 @@ XLogRecPtr	XactLastRecEnd = InvalidXLogRecPtr;
  * (which is almost but not quite the same as a pointer to the most recent
  * CHECKPOINT record).	We update this from the shared-memory copy,
  * XLogCtl-Insert.RedoRecPtr, whenever we can safely do so (ie, when we
- * hold an insertion slot).  See XLogInsert for details.  We are also allowed
+ * hold an insertion lock).  See XLogInsert for details.  We are also allowed
  * to update from XLogCtl-RedoRecPtr if we hold the info_lck;
  * see GetRedoRecPtr.  A freshly spawned backend obtains the value during
  * InitXLOGAccess.
@@ -363,63 +363,6 @@ typedef struct XLogwrtResult
 
 
 /*
- * A slot for inserting to the WAL. This is similar to an LWLock, the main
- * difference is that there is an extra xlogInsertingAt field that is protected
- * by the same mutex. Unlike an LWLock, a slot can only be acquired in
- * exclusive mode.
- *
- * The xlogInsertingAt field is used to advertise to other processes how far
- * the slot owner has progressed in inserting the record. When a backend
- * acquires a slot, it initializes xlogInsertingAt to 1, because it doesn't
- * yet know where it's going to insert the record. That's conservative
- * but correct; the new insertion is certainly going to go to a byte position
- * greater than 1. If another backend needs to flush the WAL, it will have to
- * wait for the new insertion. xlogInsertingAt is updated after finishing the
- * insert or when crossing a page boundary, which will wake up anyone waiting
- * for it, whether the wait was necessary in the first place or not.
- *
- * A process can wait on a slot in two modes: LW_EXCLUSIVE or
- * LW_WAIT_UNTIL_FREE. LW_EXCLUSIVE works like in an lwlock; when the slot is
- * released, the first LW_EXCLUSIVE waiter in the queue is woken up. Processes
- * waiting in LW_WAIT_UNTIL_FREE mode are woken up whenever the slot is
- * released, or xlogInsertingAt is updated. In other words, a process in
- * LW_WAIT_UNTIL_FREE mode is woken up whenever the inserter makes any progress
- * copying the record in place. LW_WAIT_UNTIL_FREE waiters are always added to
- * the front of the queue, while LW_EXCLUSIVE waiters are appended to the end.
- *
- * To join the wait queue, a process must set

Re: [HACKERS] Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

2014-02-17 Thread Andres Freund

On 2014-02-17 22:30:54 +0200, Heikki Linnakangas wrote:
 This is what I came up with. I like it, I didn't have to contort lwlocks as
 much as I feared. I added one field to LWLock structure, which is used to
 store the position of how far a WAL inserter has progressed. The LWLock code
 calls it just value, without caring what's stored in it, and it's used by
 new functions LWLockWait and LWLockWakeup to implement the behavior the WAL
 insertion slots have, to wake up other processes waiting for the slot
 without releasing it.
 
 This passes regression tests, but I'll have to re-run the performance tests
 with this. One worry is that if the padded size of the LWLock struct is
 smaller than cache line, neighboring WAL insertion locks will compete for
 the cache line. Another worry is that since I added a field to LWLock
 struct, it might now take 64 bytes on platforms where it used to be 32 bytes
 before. That wastes some memory.

Why don't you allocate them in a separate tranche, from xlog.c? Then you
can store them inside whatever bigger object you want, guaranteeing
exactly the alignment you need. possibly you even can have the extra
value in the enclosing object?

I'd very much like to keep the core lwlocks size from increasing much, I
plan to work on inlineing them in the BufferDescriptors and keeping it
smaller does increase cache hit ratio..

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

2014-02-17 Thread Peter Geoghegan

On Wed, Feb 12, 2014 at 3:55 AM, MauMau maumau...@gmail.com wrote:
 FYI, the following stack traces are the ones obtained during two instances
 of hang.

You mentioned a hang during a B-Tree insert operation - do you happen
to have a backtrace that relates to that?


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

2014-02-15 Thread Andres Freund

On 2014-02-15 04:20:17 +0100, Florian Pflug wrote:
 Another idea would be to do as you suggest and only mark the PGPROC pointers
 volatile, but to additionally add a check for queue corruption somewhere. We 
 should
 be able to detect that - if we ever hit this issue, LWLockRelease should find 
 a
 PGPROC while traversing the queue whose lwWaitLink is NULL but which isn't 
 equal to
 lock-tail. If that ever happens, we'd simply PANIC.

My current conclusion is that backporting barriers.h is by far the most
reasonable way to go. The compiler problems have been ironed out by
now...
Arguments against?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

2014-02-15 Thread Tom Lane

Andres Freund and...@2ndquadrant.com writes:
 My current conclusion is that backporting barriers.h is by far the most
 reasonable way to go. The compiler problems have been ironed out by
 now...

-1.  IMO that code is still quite unproven, and what's more, the
problem we're discussing here is completely hypothetical.  If it
were real, we'd have field evidence of it.  We've not had that
much trouble seeing instances of even very narrow race-condition
windows in the past.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

2014-02-15 Thread Andres Freund

On 2014-02-15 10:06:41 -0500, Tom Lane wrote:
 Andres Freund and...@2ndquadrant.com writes:
  My current conclusion is that backporting barriers.h is by far the most
  reasonable way to go. The compiler problems have been ironed out by
  now...
 
 -1.  IMO that code is still quite unproven, and what's more, the
 problem we're discussing here is completely hypothetical.  If it
 were real, we'd have field evidence of it.  We've not had that
 much trouble seeing instances of even very narrow race-condition
 windows in the past.

Well, the problem is that few of us have access to interesting !x86
machines to run tests, and that's where we'd see problems (since x86
gives enough guarantees to avoid this unless the compiler reorders
stuff). I am personally fine with just using volatiles to avoid
reordering in the older branches, but Florian argued against it.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

2014-02-15 Thread Andres Freund

On 2014-02-15 16:18:00 +0100, Andres Freund wrote:
 On 2014-02-15 10:06:41 -0500, Tom Lane wrote:
  Andres Freund and...@2ndquadrant.com writes:
   My current conclusion is that backporting barriers.h is by far the most
   reasonable way to go. The compiler problems have been ironed out by
   now...
  
  -1.  IMO that code is still quite unproven, and what's more, the
  problem we're discussing here is completely hypothetical.  If it
  were real, we'd have field evidence of it.  We've not had that
  much trouble seeing instances of even very narrow race-condition
  windows in the past.
 
 Well, the problem is that few of us have access to interesting !x86
 machines to run tests, and that's where we'd see problems (since x86
 gives enough guarantees to avoid this unless the compiler reorders
 stuff). I am personally fine with just using volatiles to avoid
 reordering in the older branches, but Florian argued against it.

Here's patches doing that. The 9.3 version also applies to 9.2; the 9.1
version applies back to 8.4.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 0fe7ce4..a8d5b7f 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -647,12 +647,27 @@ LWLockRelease(LWLockId lockid)
 	 */
 	while (head != NULL)
 	{
+		/*
+		 * Use volatile to prevent the compiler from reordering the store to
+		 * lwWaitLink with the store to lwWaiting which could cause problems
+		 * when the to-be-woken-up backend wakes up spuriously and writes to
+		 * lwWaitLink when acquiring a new lock. That could corrupt the list
+		 * this backend is traversing leading to backends stuck waiting for a
+		 * lock.
+		 *
+		 * That's not neccessarily sufficient for out-of-order architectures,
+		 * but there've been no field report of problems. The proper solution
+		 * would be to use a write barrier, but those are not available in the
+		 * back branches.
+		 */
+		volatile PGPROC *vp = proc;
+
 		LOG_LWDEBUG(LWLockRelease, lockid, release waiter);
-		proc = head;
-		head = proc-lwWaitLink;
-		proc-lwWaitLink = NULL;
-		proc-lwWaiting = false;
-		PGSemaphoreUnlock(proc-sem);
+		vp = head;
+		head = vp-lwWaitLink;
+		vp-lwWaitLink = NULL;
+		vp-lwWaiting = false;
+		PGSemaphoreUnlock(vp-sem);
 	}
 
 	/*
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 4f88d3f..cff631d 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -27,6 +27,7 @@
 #include commands/async.h
 #include miscadmin.h
 #include pg_trace.h
+#include storage/barrier.h
 #include storage/ipc.h
 #include storage/predicate.h
 #include storage/proc.h
@@ -831,10 +832,21 @@ LWLockRelease(LWLockId lockid)
 	 */
 	while (head != NULL)
 	{
+		/*
+		 * Use a write barrier to prevent the compiler from reordering the
+		 * store to lwWaitLink with the store to lwWaiting which could cause
+		 * problems when the to-be-woken-up backend wakes up spuriously and
+		 * writes to lwWaitLink when acquiring a new lock. That could corrupt
+		 * the list this backend is traversing leading to backends stuck
+		 * waiting for a lock. A write barrier is sufficient as the read side
+		 * only accesses the data while holding a spinlock which acts as a
+		 * full barrier.
+		 */
 		LOG_LWDEBUG(LWLockRelease, lockid, release waiter);
 		proc = head;
 		head = proc-lwWaitLink;
 		proc-lwWaitLink = NULL;
+		pg_write_barrier();
 		proc-lwWaiting = false;
 		PGSemaphoreUnlock(proc-sem);
 	}
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 85a0ce9..22f8540 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1872,9 +1872,11 @@ WakeupWaiters(XLogRecPtr EndPos)
 	 */
 	while (head != NULL)
 	{
+		/* check comment in LWLockRelease() about barrier usage */
 		proc = head;
 		head = proc-lwWaitLink;
 		proc-lwWaitLink = NULL;
+		pg_write_barrier();
 		proc-lwWaiting = false;
 		PGSemaphoreUnlock(proc-sem);
 	}
@@ -1966,9 +1968,11 @@ WALInsertSlotReleaseOne(int slotno)
 	 */
 	while (head != NULL)
 	{
+		/* check comment in LWLockRelease() about barrier usage */
 		proc = head;
 		head = proc-lwWaitLink;
 		proc-lwWaitLink = NULL;
+		pg_write_barrier();
 		proc-lwWaiting = false;
 		PGSemaphoreUnlock(proc-sem);
 	}
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 82ef440..98c4845 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -28,6 +28,7 @@
 #include miscadmin.h
 #include pg_trace.h
 #include replication/slot.h
+#include storage/barrier.h
 #include storage/ipc.h
 #include storage/predicate.h
 #include storage/proc.h
@@ -944,10 +945,21 @@ LWLockRelease(LWLock *l)
 	 */
 	while (head != NULL)
 	{
+		/*
+		 * Use a write barrier to prevent the compiler

Re: [HACKERS] Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

2014-02-14 Thread Andres Freund

On 2014-02-13 15:34:09 +0100, Florian Pflug wrote:
 On Feb10, 2014, at 17:38 , Andres Freund and...@2ndquadrant.com wrote:
  On 2014-02-10 11:11:28 -0500, Tom Lane wrote:
  Andres Freund and...@2ndquadrant.com writes:
  So what we need to do is to acquire a write barrier between the
  assignments to lwWaitLink and lwWaiting, i.e.
 proc-lwWaitLink = NULL;
 pg_write_barrier();
 proc-lwWaiting = false;
  
  You didn't really explain why you think that ordering is necessary?
  Each proc being awoken will surely see both fields updated, and other
  procs won't be examining these fields at all, since we already delinked
  all these procs from the LWLock's queue.
  
  The problem is that one the released backends could wake up concurrently
  because of a unrelated, or previous PGSemaphoreUnlock(). It could see
  lwWaiting = false, and thus wakeup and acquire the lock, even if the
  store for lwWaitLink hasn't arrived (or performed, there's no guaranteed
  ordering here) yet.
  Now, it may well be that there's no practical consequence of that, but I
  am not prepared to bet on it.
 
 AFAICS there is a potential problem if three backends are involved, since
 by the time the waiting backend's lwWaitLink is set to NULL after the
 original lock holder released the lock, the waiting backend might already
 have acquired the lock (due to a spurious wakeup) *and* a third backend
 might have already enqueued behind it.
 
 The exact sequence for backends A,B,C that corrupts the wait queue is:
 
 A: Release lock, set B's lwWaiting to false
 B: Wakes up spuriously, takes the lock
 C: Enqueues behind B
 A: Sets B' lwWaitLink back to NULL, thereby truncating the queue and
causing C and anyone behind it to block indefinitely.

I don't think that can actually happen because the head of the wait list
isn't the lock holder's lwWaitLink, but LWLock-head. I thought the same
for a while...

So, right now I don't see problems without either the compiler reordering
stores or heavily out of order machines with speculative execution.

 I wonder whether LWLockRelease really needs to update lwWaitLink at all.
 We take the backends we awake off the queue by updating the queue's head and
 tail, so the contents of lwWaitLink should only matter once the backend is
 re-inserted into some wait queue. But when doing that, we reset lwWaitLink
 back to NULL anway.

I don't like that, this stuff is hard to debug already, having stale
pointers around doesn't help. I think we should just add the barrier in
the releases with barrier.h and locally use a volatile var in the
branches before that.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

2014-02-14 Thread Florian Pflug

On Feb14, 2014, at 11:45 , Andres Freund and...@2ndquadrant.com wrote:
 On 2014-02-13 15:34:09 +0100, Florian Pflug wrote:
 On Feb10, 2014, at 17:38 , Andres Freund and...@2ndquadrant.com wrote:
 On 2014-02-10 11:11:28 -0500, Tom Lane wrote:
 Andres Freund and...@2ndquadrant.com writes:
 So what we need to do is to acquire a write barrier between the
 assignments to lwWaitLink and lwWaiting, i.e.
   proc-lwWaitLink = NULL;
   pg_write_barrier();
   proc-lwWaiting = false;
 
 You didn't really explain why you think that ordering is necessary?
 Each proc being awoken will surely see both fields updated, and other
 procs won't be examining these fields at all, since we already delinked
 all these procs from the LWLock's queue.
 
 The problem is that one the released backends could wake up concurrently
 because of a unrelated, or previous PGSemaphoreUnlock(). It could see
 lwWaiting = false, and thus wakeup and acquire the lock, even if the
 store for lwWaitLink hasn't arrived (or performed, there's no guaranteed
 ordering here) yet.
 Now, it may well be that there's no practical consequence of that, but I
 am not prepared to bet on it.
 
 AFAICS there is a potential problem if three backends are involved, since
 by the time the waiting backend's lwWaitLink is set to NULL after the
 original lock holder released the lock, the waiting backend might already
 have acquired the lock (due to a spurious wakeup) *and* a third backend
 might have already enqueued behind it.
 
 The exact sequence for backends A,B,C that corrupts the wait queue is:
 
 A: Release lock, set B's lwWaiting to false
 B: Wakes up spuriously, takes the lock
 C: Enqueues behind B
 A: Sets B' lwWaitLink back to NULL, thereby truncating the queue and
   causing C and anyone behind it to block indefinitely.
 
 I don't think that can actually happen because the head of the wait list
 isn't the lock holder's lwWaitLink, but LWLock-head. I thought the same
 for a while...

Hm, true, but does that protect us under all circumstances? If another
backend grabs the lock before B gets a chance to do so, B will become the
wait queue's head, and anyone who enqueues behind B will do so by updating
B's lwWaitLink. That is then undone by the delayed reset of B's lwWaitLink
by the original lock holder.

The specific sequence I have in mind is:

A: Take lock
B: Enqueue
A: Release lock, set B's lwWaiting to false
D: Acquire lock
B: Enqueue after spurious wakeup
   (lock-head := B)
C: Enqueue behind B
   (B-lwWaitLink := C, lock-tail := C)
A: Set B's wWaitLink back to NULL, thereby corrupting the queue
   (B-lwWaitLink := NULL)
D: Release lock, dequeue and wakeup B
   (lock-head := B-lwWaitLink, i.e. lock-head := NULL)
B: Take and release lock, queue appears empty!
   C blocks indefinitely.

 I wonder whether LWLockRelease really needs to update lwWaitLink at all.
 We take the backends we awake off the queue by updating the queue's head and
 tail, so the contents of lwWaitLink should only matter once the backend is
 re-inserted into some wait queue. But when doing that, we reset lwWaitLink
 back to NULL anway.
 
 I don't like that, this stuff is hard to debug already, having stale
 pointers around doesn't help. I think we should just add the barrier in
 the releases with barrier.h and locally use a volatile var in the
 branches before that.

I like the idea of updating lwWaiting and lwWaitLink while still holding the
spinlock better. The costs are probably negligible, and it'd make the code much
easier to reason about.

best regards,
Florian Pflug




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

2014-02-14 Thread Andres Freund

On 2014-02-14 13:28:47 +0100, Florian Pflug wrote:
  I don't think that can actually happen because the head of the wait list
  isn't the lock holder's lwWaitLink, but LWLock-head. I thought the same
  for a while...
 
 Hm, true, but does that protect us under all circumstances? If another
 backend grabs the lock before B gets a chance to do so, B will become the
 wait queue's head, and anyone who enqueues behind B will do so by updating
 B's lwWaitLink. That is then undone by the delayed reset of B's lwWaitLink
 by the original lock holder.
 
 The specific sequence I have in mind is:
 
 A: Take lock
 B: Enqueue
 A: Release lock, set B's lwWaiting to false
 D: Acquire lock
 B: Enqueue after spurious wakeup
(lock-head := B)
 C: Enqueue behind B
(B-lwWaitLink := C, lock-tail := C)
 A: Set B's wWaitLink back to NULL, thereby corrupting the queue
(B-lwWaitLink := NULL)
 D: Release lock, dequeue and wakeup B
(lock-head := B-lwWaitLink, i.e. lock-head := NULL)
 B: Take and release lock, queue appears empty!
C blocks indefinitely.

That's assuming either reordering by the compiler or an out-of-order
store architecture, right?

  I wonder whether LWLockRelease really needs to update lwWaitLink at all.
  We take the backends we awake off the queue by updating the queue's head 
  and
  tail, so the contents of lwWaitLink should only matter once the backend is
  re-inserted into some wait queue. But when doing that, we reset lwWaitLink
  back to NULL anway.
  
  I don't like that, this stuff is hard to debug already, having stale
  pointers around doesn't help. I think we should just add the barrier in
  the releases with barrier.h and locally use a volatile var in the
  branches before that.
 
 I like the idea of updating lwWaiting and lwWaitLink while still holding the
 spinlock better. The costs are probably negligible, and it'd make the code 
 much
 easier to reason about.

I agree we should do that, but imo not in the backbranches. Anything
more than than the minimal fix in that code should be avoided in the
stable branches, this stuff is friggin performance sensitive, and the
spinlock already is a *major* contention point in many workloads.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

2014-02-14 Thread Florian Pflug

On Feb14, 2014, at 13:36 , Andres Freund and...@2ndquadrant.com wrote:
 On 2014-02-14 13:28:47 +0100, Florian Pflug wrote:
 I don't think that can actually happen because the head of the wait list
 isn't the lock holder's lwWaitLink, but LWLock-head. I thought the same
 for a while...
 
 Hm, true, but does that protect us under all circumstances? If another
 backend grabs the lock before B gets a chance to do so, B will become the
 wait queue's head, and anyone who enqueues behind B will do so by updating
 B's lwWaitLink. That is then undone by the delayed reset of B's lwWaitLink
 by the original lock holder.
 
 The specific sequence I have in mind is:
 
 A: Take lock
 B: Enqueue
 A: Release lock, set B's lwWaiting to false
 D: Acquire lock
 B: Enqueue after spurious wakeup
   (lock-head := B)
 C: Enqueue behind B
   (B-lwWaitLink := C, lock-tail := C)
 A: Set B's wWaitLink back to NULL, thereby corrupting the queue
   (B-lwWaitLink := NULL)
 D: Release lock, dequeue and wakeup B
   (lock-head := B-lwWaitLink, i.e. lock-head := NULL)
 B: Take and release lock, queue appears empty!
   C blocks indefinitely.
 
 That's assuming either reordering by the compiler or an out-of-order
 store architecture, right?

Yes, it requires that a backend exits out of the PGSemaphoreLock loop in
LWLockAcquire before lwWaitLink has been reset to NULL by the previous lock
holder's LWLockRelease. Only if that happens there is a risk of the PGPROC
being on a wait queue by the time lwWaitLink is finally reset to NULL.

 I wonder whether LWLockRelease really needs to update lwWaitLink at all.
 We take the backends we awake off the queue by updating the queue's head 
 and
 tail, so the contents of lwWaitLink should only matter once the backend is
 re-inserted into some wait queue. But when doing that, we reset lwWaitLink
 back to NULL anway.
 
 I don't like that, this stuff is hard to debug already, having stale
 pointers around doesn't help. I think we should just add the barrier in
 the releases with barrier.h and locally use a volatile var in the
 branches before that.
 
 I like the idea of updating lwWaiting and lwWaitLink while still holding the
 spinlock better. The costs are probably negligible, and it'd make the code 
 much
 easier to reason about.
 
 I agree we should do that, but imo not in the backbranches. Anything
 more than than the minimal fix in that code should be avoided in the
 stable branches, this stuff is friggin performance sensitive, and the
 spinlock already is a *major* contention point in many workloads.

No argument there. But unless I missed something, there actually is a bug
there, and using volatile won't fix it. A barrier would, but what about the
back branches that don't have barrier.h? AFAICS the only other choices we have 
on
these branches are to either ignore the bug - it's probably *extremely* unlikely
- or to use a spinlock acquire/release cycle to create a barrier. The former
leaves me with a bit of an uneasy feeling, and the latter quite certainly has
a larger performance impact than moving the PGPROC updates within the section
protected by the spinlock and using an array to remember the backends to wakeup.

best regards,
Florian Pflug



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

2014-02-14 Thread Andres Freund

On 2014-02-14 13:52:45 +0100, Florian Pflug wrote:
  I agree we should do that, but imo not in the backbranches. Anything
  more than than the minimal fix in that code should be avoided in the
  stable branches, this stuff is friggin performance sensitive, and the
  spinlock already is a *major* contention point in many workloads.
 
 No argument there. But unless I missed something, there actually is a bug
 there, and using volatile won't fix it. A barrier would, but what about the
 back branches that don't have barrier.h?

Yea, but I don't see a better alternative. Realistically the likelihood
of a problem without the compiler reordering stuff is miniscule on any
relevant platform that's supported by the !barrier.h branches. Newer
ARMs are the only realistic suspect, and the support for in older
releases wasn't so good...

 The former
 leaves me with a bit of an uneasy feeling, and the latter quite certainly has
 a larger performance impact than moving the PGPROC updates within the section
 protected by the spinlock and using an array to remember the backends to 
 wakeup.

I am not so sure, it adds a host of new cacheline references in a piece
of code that's already heavily affected by pipeline stalls, that can
influence performance. I am not saying it's super likely, just more than
I want to do for a theoretical problem in the back branches.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

2014-02-14 Thread Florian Pflug

On Feb14, 2014, at 14:07 , Andres Freund and...@2ndquadrant.com wrote:
 On 2014-02-14 13:52:45 +0100, Florian Pflug wrote:
 I agree we should do that, but imo not in the backbranches. Anything
 more than than the minimal fix in that code should be avoided in the
 stable branches, this stuff is friggin performance sensitive, and the
 spinlock already is a *major* contention point in many workloads.
 
 No argument there. But unless I missed something, there actually is a bug
 there, and using volatile won't fix it. A barrier would, but what about the
 back branches that don't have barrier.h?
 
 Yea, but I don't see a better alternative. Realistically the likelihood
 of a problem without the compiler reordering stuff is miniscule on any
 relevant platform that's supported by the !barrier.h branches. Newer
 ARMs are the only realistic suspect, and the support for in older
 releases wasn't so good...

Isn't POWER/PowerPC potentially affected by this? 

Also, there is still the alternative of not resetting lwWaitLink in 
LWLockRelease.
I can understand why you oppose that - it's certainly nicer to not have stray
pointer lying around. But since (as least as far as we know)

a) resetting lwWaitLink is not strictly necessary
b) resetting lwWaitLink introduces a race condition (however unlikely)

we'll trade correctness for cleanliness if we continue to reset lwWaitLink
without protecting against the race. That's a bit of a weird trade-off to make.

Another idea for a fix would be to conflate lwWaiting and lwWaitLink into one
field. We could replace lwWaiting by lwWaitLink != NULL everywhere it's
tested, and set lwWaitLink to some special non-NULL value (say 0x1) when we
enqueue a PGPROC, instead of setting it to NULL and setting lwWaiting to true.

We'd then depend on pointer-sized stores being atomic, which I think we depend
on in other places already.

best regards,
Florian Pflug



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

2014-02-14 Thread Tom Lane

Florian Pflug f...@phlo.org writes:
 Another idea for a fix would be to conflate lwWaiting and lwWaitLink into one
 field. We could replace lwWaiting by lwWaitLink != NULL everywhere it's
 tested, and set lwWaitLink to some special non-NULL value (say 0x1) when we
 enqueue a PGPROC, instead of setting it to NULL and setting lwWaiting to true.

 We'd then depend on pointer-sized stores being atomic, which I think we depend
 on in other places already.

I don't believe that's true; neither that we depend on it now, nor that
it would be safe to do so.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

2014-02-14 Thread Andres Freund

On 2014-02-14 10:26:07 -0500, Tom Lane wrote:
 Florian Pflug f...@phlo.org writes:
  Another idea for a fix would be to conflate lwWaiting and lwWaitLink into 
  one
  field. We could replace lwWaiting by lwWaitLink != NULL everywhere it's
  tested, and set lwWaitLink to some special non-NULL value (say 0x1) when we
  enqueue a PGPROC, instead of setting it to NULL and setting lwWaiting to 
  true.
 
  We'd then depend on pointer-sized stores being atomic, which I think we 
  depend
  on in other places already.
 
 I don't believe that's true; neither that we depend on it now, nor that
 it would be safe to do so.

Yea, we currently rely on 4 byte stores being atomic (most notably for
xids), but we don't rely on anything bigger. I am not sure if there are
architectures with 64bit pointers that aren't accessed atomically when
aligned? Even if, that's certainly nothing that should be introduced
when backpatching.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

2014-02-14 Thread Florian Pflug

On Feb14, 2014, at 16:32 , Andres Freund and...@2ndquadrant.com wrote:
 On 2014-02-14 10:26:07 -0500, Tom Lane wrote:
 Florian Pflug f...@phlo.org writes:
 Another idea for a fix would be to conflate lwWaiting and lwWaitLink into 
 one
 field. We could replace lwWaiting by lwWaitLink != NULL everywhere it's
 tested, and set lwWaitLink to some special non-NULL value (say 0x1) when we
 enqueue a PGPROC, instead of setting it to NULL and setting lwWaiting to 
 true.
 
 We'd then depend on pointer-sized stores being atomic, which I think we 
 depend
 on in other places already.
 
 I don't believe that's true; neither that we depend on it now, nor that
 it would be safe to do so.
 
 Yea, we currently rely on 4 byte stores being atomic (most notably for
 xids), but we don't rely on anything bigger. I am not sure if there are
 architectures with 64bit pointers that aren't accessed atomically when
 aligned? Even if, that's certainly nothing that should be introduced
 when backpatching.

Hm, we could use 4-byte stores instead of 8-byte stores if we turned lwWaitLink
into an index into the proc array instead of a pointer to the PGPROC struct.

We could then use 0x instead place of NULL to indicate not waiting,
and PROCARRAY_MAXPROCS to indicate waiting, but no successor in the queue.

best regards,
Florian Pflug



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

2014-02-14 Thread Andres Freund

On 2014-02-14 15:03:16 +0100, Florian Pflug wrote:
 On Feb14, 2014, at 14:07 , Andres Freund and...@2ndquadrant.com wrote:
  On 2014-02-14 13:52:45 +0100, Florian Pflug wrote:
  I agree we should do that, but imo not in the backbranches. Anything
  more than than the minimal fix in that code should be avoided in the
  stable branches, this stuff is friggin performance sensitive, and the
  spinlock already is a *major* contention point in many workloads.
  
  No argument there. But unless I missed something, there actually is a bug
  there, and using volatile won't fix it. A barrier would, but what about the
  back branches that don't have barrier.h?
  
  Yea, but I don't see a better alternative. Realistically the likelihood
  of a problem without the compiler reordering stuff is miniscule on any
  relevant platform that's supported by the !barrier.h branches. Newer
  ARMs are the only realistic suspect, and the support for in older
  releases wasn't so good...
 
 Isn't POWER/PowerPC potentially affected by this? 

I wouldn't consider it a major architecture... And I am not sure how
much out of order a CPU has to be to be affected by this, the read side
uses spinlocks, which in most of the spinlock implementations implies a
full memory barrier which in many cache coherency designs will cause
other CPUs to flush writes. And I think the control dependency in the
PGSemaphoreUnlock() loop will actually cause a flush on ppc...

 Also, there is still the alternative of not resetting lwWaitLink in 
 LWLockRelease.
 I can understand why you oppose that - it's certainly nicer to not have stray
 pointer lying around. But since (as least as far as we know)
 
 a) resetting lwWaitLink is not strictly necessary

I don't want to rely on that in the backbranches, that's an entirely new
assumption. I think anything more than minimal changes are hard to
justify for a theoretical issue that hasn't been observed in the field.

I think the biggest danger here is that writes are reordered by the
compiler, that we definitely need to protect against. Making a variable
volatile or introducing a memory barrier is pretty simple to analyze.

 b) resetting lwWaitLink introduces a race condition (however unlikely)
 
 we'll trade correctness for cleanliness if we continue to reset lwWaitLink
 without protecting against the race. That's a bit of a weird trade-off to 
 make.

It's not just cleanliness, it's being able to actually debug crashes.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

2014-02-14 Thread knizhnik


On 02/14/2014 07:51 PM, Andres Freund wrote:

On 2014-02-14 15:03:16 +0100, Florian Pflug wrote:

On Feb14, 2014, at 14:07 , Andres Freund and...@2ndquadrant.com wrote:

On 2014-02-14 13:52:45 +0100, Florian Pflug wrote:

I agree we should do that, but imo not in the backbranches. Anything
more than than the minimal fix in that code should be avoided in the
stable branches, this stuff is friggin performance sensitive, and the
spinlock already is a *major* contention point in many workloads.


No argument there. But unless I missed something, there actually is a bug
there, and using volatile won't fix it. A barrier would, but what about the
back branches that don't have barrier.h?


Yea, but I don't see a better alternative. Realistically the likelihood
of a problem without the compiler reordering stuff is miniscule on any
relevant platform that's supported by the !barrier.h branches. Newer
ARMs are the only realistic suspect, and the support for in older
releases wasn't so good...


Isn't POWER/PowerPC potentially affected by this?


I wouldn't consider it a major architecture... And I am not sure how
much out of order a CPU has to be to be affected by this, the read side
uses spinlocks, which in most of the spinlock implementations implies a
full memory barrier which in many cache coherency designs will cause
other CPUs to flush writes. And I think the control dependency in the
PGSemaphoreUnlock() loop will actually cause a flush on ppc...


PGSemaphoreUnlock() should really introduce memory barrier, but the problem can 
happen before PGSemaphoreUnlock() is called.
So presence of PGSemaphoreUnlock() just reduces probability of race condition 
on non-TSO architectures (PPC, ARM, IA64,...) but doesn't completely eliminate 
it.




Also, there is still the alternative of not resetting lwWaitLink in 
LWLockRelease.
I can understand why you oppose that - it's certainly nicer to not have stray
pointer lying around. But since (as least as far as we know)

a) resetting lwWaitLink is not strictly necessary


I don't want to rely on that in the backbranches, that's an entirely new
assumption. I think anything more than minimal changes are hard to
justify for a theoretical issue that hasn't been observed in the field.

I think the biggest danger here is that writes are reordered by the
compiler, that we definitely need to protect against. Making a variable
volatile or introducing a memory barrier is pretty simple to analyze.


b) resetting lwWaitLink introduces a race condition (however unlikely)

we'll trade correctness for cleanliness if we continue to reset lwWaitLink
without protecting against the race. That's a bit of a weird trade-off to make.


It's not just cleanliness, it's being able to actually debug crashes.



Frankly speaking I do not understand why elimination of resetting of lwWaitLink 
was considered to be bad idea.
Resetting this pointer to NULL will not help much to analyze crash dumps, 
because right now it is possible that lwWaitLink==NULL but process in waiting 
for a lock and linked in the list
(if it is the last element of the list). So the fact that  lwWaitLink==NULL 
actually gives not so much useful information.





Greetings,

Andres Freund





--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

2014-02-14 Thread Andres Freund

On 2014-02-14 20:23:32 +0400, knizhnik wrote:
 we'll trade correctness for cleanliness if we continue to reset lwWaitLink
 without protecting against the race. That's a bit of a weird trade-off to 
 make.
 
 It's not just cleanliness, it's being able to actually debug crashes.
 
 
 Frankly speaking I do not understand why elimination of resetting of 
 lwWaitLink was considered to be bad idea.
 Resetting this pointer to NULL will not help much to analyze crash dumps, 
 because right now it is possible that lwWaitLink==NULL but process in waiting 
 for a lock and linked in the list
 (if it is the last element of the list). So the fact that  lwWaitLink==NULL 
 actually gives not so much useful information.

At the moment if you connect to a live pg or a crash dump, investigating
the wait links allows you to somewhat sensibly determine which backends
are waiting for a lock and whether they are part of a queue. If they
aren't reset anymore that will tell you nothing, so you'll need to
connect to all pg instances to debug issues.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

2014-02-14 Thread knizhnik


On 02/14/2014 08:28 PM, Andres Freund wrote:

On 2014-02-14 20:23:32 +0400, knizhnik wrote:

we'll trade correctness for cleanliness if we continue to reset lwWaitLink
without protecting against the race. That's a bit of a weird trade-off to make.


It's not just cleanliness, it's being able to actually debug crashes.



Frankly speaking I do not understand why elimination of resetting of lwWaitLink 
was considered to be bad idea.
Resetting this pointer to NULL will not help much to analyze crash dumps, 
because right now it is possible that lwWaitLink==NULL but process in waiting 
for a lock and linked in the list
(if it is the last element of the list). So the fact that  lwWaitLink==NULL 
actually gives not so much useful information.


At the moment if you connect to a live pg or a crash dump, investigating
the wait links allows you to somewhat sensibly determine which backends
are waiting for a lock and whether they are part of a queue. If they
aren't reset anymore that will tell you nothing, so you'll need to
connect to all pg instances to debug issues.


Why it is not enough to inspect lwWaiting flag?
In any case, resetting lwWaitLink can be safely done in awakened process:


if (!proc-lwWaiting) {
proc-lwWaitLink = NULL;
break;
}





Greetings,

Andres Freund





--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

2014-02-14 Thread Florian Pflug

On Feb14, 2014, at 16:51 , Andres Freund and...@2ndquadrant.com wrote:
 On 2014-02-14 15:03:16 +0100, Florian Pflug wrote:
 On Feb14, 2014, at 14:07 , Andres Freund and...@2ndquadrant.com wrote:
 On 2014-02-14 13:52:45 +0100, Florian Pflug wrote:
 I agree we should do that, but imo not in the backbranches. Anything
 more than than the minimal fix in that code should be avoided in the
 stable branches, this stuff is friggin performance sensitive, and the
 spinlock already is a *major* contention point in many workloads.
 
 No argument there. But unless I missed something, there actually is a bug
 there, and using volatile won't fix it. A barrier would, but what about the
 back branches that don't have barrier.h?
 
 Yea, but I don't see a better alternative. Realistically the likelihood
 of a problem without the compiler reordering stuff is miniscule on any
 relevant platform that's supported by the !barrier.h branches. Newer
 ARMs are the only realistic suspect, and the support for in older
 releases wasn't so good...
 
 Isn't POWER/PowerPC potentially affected by this? 
 
 I wouldn't consider it a major architecture... And I am not sure how
 much out of order a CPU has to be to be affected by this, the read side
 uses spinlocks, which in most of the spinlock implementations implies a
 full memory barrier which in many cache coherency designs will cause
 other CPUs to flush writes. And I think the control dependency in the
 PGSemaphoreUnlock() loop will actually cause a flush on ppc...

I guess it's sufficient for the store to lwWaitLink to be delayed.
As long as the CPU on which that store is executing hasn't managed to gain
exclusive access to the relevant cache line, memory barriers on the read
side won't help because the store won't be visible to other CPUs.

 Also, there is still the alternative of not resetting lwWaitLink in 
 LWLockRelease.
 I can understand why you oppose that - it's certainly nicer to not have stray
 pointer lying around. But since (as least as far as we know)
 
 a) resetting lwWaitLink is not strictly necessary
 
 I don't want to rely on that in the backbranches, that's an entirely new
 assumption. I think anything more than minimal changes are hard to
 justify for a theoretical issue that hasn't been observed in the field.
 
 I think the biggest danger here is that writes are reordered by the
 compiler, that we definitely need to protect against. Making a variable
 volatile or introducing a memory barrier is pretty simple to analyze.

Well, the assumption isn't all that new. We already have the situation that
a PGPROC may be not on any wait queue, yet its lwWaitLink may be non-NULL.
Currently, the process who took it off the queue is responsible for rectifying
that eventually, with the changed proposed below it'll be the backend who
owns the PGPROC. From the point of view of other backends, nothing really
changes.

 
 b) resetting lwWaitLink introduces a race condition (however unlikely)
 
 we'll trade correctness for cleanliness if we continue to reset lwWaitLink
 without protecting against the race. That's a bit of a weird trade-off to 
 make.
 
 It's not just cleanliness, it's being able to actually debug crashes.

We could still arrange for the stray lwWaitLink from being visible only
momentarily. If a backend is woken up and detects that lwWaiting is false,
it knows that it cannot be on any wait queue - it was just removed, and
hasn't added itself again yet. At that point, it's safe to reset lwWaitLink
back to NULL. The stray value would thus only be visible while a backend 
executes
the PGSemaphoreLock() loop, and whether or not this is the case can be deduced
from a stack trace. So I'm not convinced that this makes debugging any harder.

(knizhnik has just suggested the same)

It's interesting, BTW, that updating lwWaitLink after lwWaiting is OK here -
the crucial property is not that it's updated before lwWaiting, but rather that
it is reset before enqueuing the backend again. Which is something that backend
itself can guarantee far more easily than whoever happens to de-queue it due to
spurious wakeups.

best regards,
Florian Pflug



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

2014-02-14 Thread Andres Freund

On 2014-02-14 18:49:33 +0100, Florian Pflug wrote:
  I wouldn't consider it a major architecture... And I am not sure how
  much out of order a CPU has to be to be affected by this, the read side
  uses spinlocks, which in most of the spinlock implementations implies a
  full memory barrier which in many cache coherency designs will cause
  other CPUs to flush writes. And I think the control dependency in the
  PGSemaphoreUnlock() loop will actually cause a flush on ppc...
 
 I guess it's sufficient for the store to lwWaitLink to be delayed.
 As long as the CPU on which that store is executing hasn't managed to gain
 exclusive access to the relevant cache line, memory barriers on the read
 side won't help because the store won't be visible to other CPUs.

The whole lwlock actually should be on the same cacheline on anything
with cachelines = 32. As the woken up side is doing an atomic op on it
*before* modifying lwWaitLink I think we are actually guaranteed that
any incomplete store on the writer will have completed unless the compiler
reordered. So we are safe against out of order stores, but not out of
order execution. That might have been what prevented the issue from
being noticable on some platforms.

 Well, the assumption isn't all that new. We already have the situation that
 a PGPROC may be not on any wait queue, yet its lwWaitLink may be non-NULL.
 Currently, the process who took it off the queue is responsible for rectifying
 that eventually, with the changed proposed below it'll be the backend who
 owns the PGPROC. From the point of view of other backends, nothing really
 changes.

That window is absolutely tiny tho.

  b) resetting lwWaitLink introduces a race condition (however unlikely)
  
  we'll trade correctness for cleanliness if we continue to reset lwWaitLink
  without protecting against the race. That's a bit of a weird trade-off to 
  make.
  
  It's not just cleanliness, it's being able to actually debug crashes.
 
 We could still arrange for the stray lwWaitLink from being visible only
 momentarily. If a backend is woken up and detects that lwWaiting is false,
 it knows that it cannot be on any wait queue - it was just removed, and
 hasn't added itself again yet. At that point, it's safe to reset lwWaitLink
 back to NULL. The stray value would thus only be visible while a backend 
 executes
 the PGSemaphoreLock() loop, and whether or not this is the case can be deduced
 from a stack trace. So I'm not convinced that this makes debugging any harder.

That's not actually safe on an out of order architecture afaics. Without
barriers the store to lwWaitLink in the woken up backend can preempt the
read for the next element in the PGSemaphoreUnlock() loop.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

2014-02-14 Thread Florian Pflug

On Feb14, 2014, at 19:21 , Andres Freund and...@2ndquadrant.com wrote:
 On 2014-02-14 18:49:33 +0100, Florian Pflug wrote:
 Well, the assumption isn't all that new. We already have the situation that
 a PGPROC may be not on any wait queue, yet its lwWaitLink may be non-NULL.
 Currently, the process who took it off the queue is responsible for 
 rectifying
 that eventually, with the changed proposed below it'll be the backend who
 owns the PGPROC. From the point of view of other backends, nothing really
 changes.
 
 That window is absolutely tiny tho.

True, but it's there, so if anything counts on that never being the case, it's
still already broken.

 b) resetting lwWaitLink introduces a race condition (however unlikely)
 
 we'll trade correctness for cleanliness if we continue to reset lwWaitLink
 without protecting against the race. That's a bit of a weird trade-off to 
 make.
 
 It's not just cleanliness, it's being able to actually debug crashes.
 
 We could still arrange for the stray lwWaitLink from being visible only
 momentarily. If a backend is woken up and detects that lwWaiting is false,
 it knows that it cannot be on any wait queue - it was just removed, and
 hasn't added itself again yet. At that point, it's safe to reset lwWaitLink
 back to NULL. The stray value would thus only be visible while a backend 
 executes
 the PGSemaphoreLock() loop, and whether or not this is the case can be 
 deduced
 from a stack trace. So I'm not convinced that this makes debugging any 
 harder.
 
 That's not actually safe on an out of order architecture afaics. Without
 barriers the store to lwWaitLink in the woken up backend can preempt the
 read for the next element in the PGSemaphoreUnlock() loop.

Hm, true. I guess we could prevent that by introducing an artificial dependency
between the read and the write - something like

  proc = head;
  head = proc-lwWaitLink
  /*
   * We don't ever expect to actually PANIC here, but the test forces the
   * load to execute before the store to lwWaiting. This prevents a race
   * between reading lwWaitLink here and resetting it back to zero in the
   * awoken backend (Note that backends can wake up spuriously, so just
   * reading it before doing PGSemaphoreUnlock is insufficient)
   */
  if (head != MyProc)
proc-lwWaiting = false;
  else
elog(PANIC, ...)
  PGSemaphoreUnlock(proc-sem);

(This assumes that proc is a volatile pointer)

Another idea would be to do as you suggest and only mark the PGPROC pointers
volatile, but to additionally add a check for queue corruption somewhere. We 
should
be able to detect that - if we ever hit this issue, LWLockRelease should find a
PGPROC while traversing the queue whose lwWaitLink is NULL but which isn't 
equal to
lock-tail. If that ever happens, we'd simply PANIC.

best regards,
Florian Pflug





-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

2014-02-13 Thread MauMau


eFrom: Andres Freund and...@2ndquadrant.com

could you try if you get more readable dumps by using disassemble/m?
That might at least print line numbers if you have debug info installed.


Please find the attached file.  I hope this will reveal something.

Regards
MauMau

Dump of assembler code for function LWLockRelease:
  0x00647d40 +0:push   %r12
  0x00647d42 +2:mov%edi,%r12d
  0x00647d45 +5:shl$0x5,%r12
  0x00647d49 +9:add0x4f3a98(%rip),%r12# 0xb3b7e8 
LWLockArray
  0x00647d50 +16:   push   %rbp
  0x00647d51 +17:   mov%edi,%ebp
  0x00647d53 +19:   push   %rbx
  0x00647d54 +20:   mov0x4f3a86(%rip),%ebx# 0xb3b7e0 
num_held_lwlocks
  0x00647d5a +26:   nopw   0x0(%rax,%rax,1)
  0x00647d60 +32:   sub$0x1,%ebx
  0x00647d63 +35:   js 0x647ea4 LWLockRelease+356
  0x00647d69 +41:   movslq %ebx,%rax
  0x00647d6c +44:   cmp%ebp,0xb3b800(,%rax,4)
  0x00647d73 +51:   jne0x647d60 LWLockRelease+32
  0x00647d75 +53:   mov0x4f3a65(%rip),%esi# 0xb3b7e0 
num_held_lwlocks
  0x00647d7b +59:   sub$0x1,%esi
  0x00647d7e +62:   cmp%ebx,%esi
  0x00647d80 +64:   mov%esi,0x4f3a5a(%rip)# 0xb3b7e0 
num_held_lwlocks
  0x00647d86 +70:   jg 0x647d92 LWLockRelease+82
  0x00647d88 +72:   jmp0x647dad LWLockRelease+109
  0x00647d8a +74:   nopw   0x0(%rax,%rax,1)
  0x00647d90 +80:   mov%ecx,%ebx
  0x00647d92 +82:   lea0x1(%rbx),%ecx
  0x00647d95 +85:   movslq %ebx,%rax
  0x00647d98 +88:   movslq %ecx,%rdx
  0x00647d9b +91:   cmp%ecx,%esi
  0x00647d9d +93:   mov0xb3b800(,%rdx,4),%edx
  0x00647da4 +100:  mov%edx,0xb3b800(,%rax,4)
  0x00647dab +107:  jg 0x647d90 LWLockRelease+80
  0x00647dad +109:  mov$0x1,%eax
  0x00647db2 +114:  lock xchg %al,(%r12)
  0x00647db7 +119:  test   %al,%al
  0x00647db9 +121:  jne0x647ee4 LWLockRelease+420
  0x00647dbf +127:  movzbl 0x2(%r12),%eax
  0x00647dc5 +133:  test   %al,%al
  0x00647dc7 +135:  jle0x647f04 LWLockRelease+452
  0x00647dcd +141:  movzbl 0x2(%r12),%eax
  0x00647dd3 +147:  sub$0x1,%eax
  0x00647dd6 +150:  mov%al,0x2(%r12)
  0x00647ddb +155:  mov0x8(%r12),%rcx
  0x00647de0 +160:  test   %rcx,%rcx
  0x00647de3 +163:  je 0x647def LWLockRelease+175
  0x00647de5 +165:  movzbl 0x2(%r12),%eax
  0x00647deb +171:  test   %al,%al
  0x00647ded +173:  je 0x647e08 LWLockRelease+200
  0x00647def +175:  movb   $0x0,(%r12)
  0x00647df4 +180:  pop%rbx
  0x00647df5 +181:  pop%rbp
  0x00647df6 +182:  mov0x53d25c(%rip),%eax# 0xb85058 
InterruptHoldoffCount
  0x00647dfc +188:  pop%r12
  0x00647dfe +190:  sub$0x1,%eax
  0x00647e01 +193:  mov%eax,0x53d251(%rip)# 0xb85058 
InterruptHoldoffCount
  0x00647e07 +199:	retq   
  0x00647e08 +200:	mov0x4(%r12),%eax

  0x00647e0d +205:  test   %eax,%eax
  0x00647e0f +207:  jne0x647def LWLockRelease+175
  0x00647e11 +209:  movzbl 0x1(%r12),%eax
  0x00647e17 +215:  test   %al,%al
  0x00647e19 +217:  je 0x647def LWLockRelease+175
  0x00647e1b +219:  cmpb   $0x2,0x42(%rcx)
  0x00647e1f +223:  jne0x647f66 LWLockRelease+550
  0x00647e25 +229:  mov0x48(%rcx),%rax
  0x00647e29 +233:  test   %rax,%rax
  0x00647e2c +236:  je 0x647f66 LWLockRelease+550
  0x00647e32 +242:  mov%rax,%rdx
  0x00647e35 +245:  jmp0x647e44 LWLockRelease+260
  0x00647e37 +247:  mov0x48(%rdx),%rdx
  0x00647e3b +251:  test   %rdx,%rdx
  0x00647e3e +254:  je 0x647f16 LWLockRelease+470
  0x00647e44 +260:  movzbl 0x42(%rdx),%esi
  0x00647e48 +264:  mov%rdx,%rax
  0x00647e4b +267:  cmp$0x2,%sil
  0x00647e4f +271:  je 0x647e37 LWLockRelease+247
  0x00647e51 +273:  mov0x48(%rdx),%rdx
  0x00647e55 +277:  test   %sil,%sil
  0x00647e58 +280:  je 0x647f16 LWLockRelease+470
  0x00647e5e +286:  test   %rdx,%rdx
  0x00647e61 +289:  je 0x647f16 LWLockRelease+470
  0x00647e67 +295:  cmpb   $0x0,0x42(%rdx)
  0x00647e6b +299:  mov$0x1,%edi
  0x00647e70 +304:  jne0x647e8a LWLockRelease+330
  0x00647e72

Re: [HACKERS] Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

2014-02-13 Thread knizhnik


On 02/12/2014 06:10 PM, Ants Aasma wrote:

On Wed, Feb 12, 2014 at 4:04 PM, knizhnik knizh...@garret.ru wrote:

Even if reordering was not done by compiler, it still can happen while
execution.
There is no warranty that two subsequent assignments will be observed by all
CPU cores in the same order.


The x86 memory model (total store order) provides that guarantee in
this specific case.

Regards,
Ants Aasma



Sorry, I thought that we are talking about general case, not just x86 
architecture.
May be I do not understand something in LWlock code, but it seems to me that 
assigning NULL to proc-lwWaitLink is not needed at all:

while (head != NULL)
{
LOG_LWDEBUG(LWLockRelease, lockid, release waiter);
proc = head;
head = proc-lwWaitLink;
proc-lwWaitLink = NULL;
proc-lwWaiting = false;
PGSemaphoreUnlock(proc-sem);
}



This part of L1 list is not traversed by any other processor. So nobody will 
inspect this field. When awakened process needs to wait for another lock,
it  will just assign NULL to this field itself:

proc-lwWaiting = 1;
proc-lwWaitMode = mode;
proc-lwWaitLink = NULL;
if (lock-head == NULL)
lock-head = proc;
else
lock-tail-lwWaitLink = proc;
lock-tail = proc;

Without TSO (total store order), such assignment of lwWaitLink in LWLockRlease 
outside critical section may just corrupt L1 list, in which awakened process is 
already linked.
But I am not sure that elimination of this assignment will be enough to ensure 
correctness of this code without TSO.



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

2014-02-13 Thread Florian Pflug

On Feb10, 2014, at 17:38 , Andres Freund and...@2ndquadrant.com wrote:
 On 2014-02-10 11:11:28 -0500, Tom Lane wrote:
 Andres Freund and...@2ndquadrant.com writes:
 So what we need to do is to acquire a write barrier between the
 assignments to lwWaitLink and lwWaiting, i.e.
proc-lwWaitLink = NULL;
pg_write_barrier();
proc-lwWaiting = false;
 
 You didn't really explain why you think that ordering is necessary?
 Each proc being awoken will surely see both fields updated, and other
 procs won't be examining these fields at all, since we already delinked
 all these procs from the LWLock's queue.
 
 The problem is that one the released backends could wake up concurrently
 because of a unrelated, or previous PGSemaphoreUnlock(). It could see
 lwWaiting = false, and thus wakeup and acquire the lock, even if the
 store for lwWaitLink hasn't arrived (or performed, there's no guaranteed
 ordering here) yet.
 Now, it may well be that there's no practical consequence of that, but I
 am not prepared to bet on it.

AFAICS there is a potential problem if three backends are involved, since
by the time the waiting backend's lwWaitLink is set to NULL after the
original lock holder released the lock, the waiting backend might already
have acquired the lock (due to a spurious wakeup) *and* a third backend
might have already enqueued behind it.

The exact sequence for backends A,B,C that corrupts the wait queue is:

A: Release lock, set B's lwWaiting to false
B: Wakes up spuriously, takes the lock
C: Enqueues behind B
A: Sets B' lwWaitLink back to NULL, thereby truncating the queue and
   causing C and anyone behind it to block indefinitely.

I wonder whether LWLockRelease really needs to update lwWaitLink at all.
We take the backends we awake off the queue by updating the queue's head and
tail, so the contents of lwWaitLink should only matter once the backend is
re-inserted into some wait queue. But when doing that, we reset lwWaitLink
back to NULL anway.

best regards,
Florian Pflug






-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

2014-02-12 Thread MauMau


From: Andres Freund and...@2ndquadrant.com

It's x86, right? Then it's unlikely to be actual unordered memory
accesses, but if the compiler reordered:
   LOG_LWDEBUG(LWLockRelease, T_NAME(l), T_ID(l), release waiter);
   proc = head;
   head = proc-lwWaitLink;
   proc-lwWaitLink = NULL;
   proc-lwWaiting = false;
   PGSemaphoreUnlock(proc-sem);
to
   LOG_LWDEBUG(LWLockRelease, T_NAME(l), T_ID(l), release waiter);
   proc = head;
   proc-lwWaiting = false;
   head = proc-lwWaitLink;
   proc-lwWaitLink = NULL;
   PGSemaphoreUnlock(proc-sem);
which it is permitted to do, yes, that could cause symptoms like you
describe.


Yes, the hang occurred with 64-bit PostgreSQL 9.2.4 running on RHEL6 for 
x86_64.  The PostgreSQL was built with GCC.



Any chance you have the binaries the customer ran back then around?
Disassembling that piece of code might give you a hint whether that's a
possible cause.


I'm sorry I can't provide the module, but I attached the disassembled code 
code for lwlockRelease and LWLockAcquire in the executable.  I'm not sure 
this proves something.


FYI, the following stack traces are the ones obtained during two instances 
of hang.


#0  0x0036102eaf77 in semop () from /lib64/libc.so.6
#1  0x00614707 in PGSemaphoreLock ()
#2  0x00659d5b in LWLockAcquire ()
#3  0x0047983d in RelationGetBufferForTuple ()
#4  0x00477f86 in heap_insert ()
#5  0x005a4a12 in ExecModifyTable ()
#6  0x0058d928 in ExecProcNode ()
#7  0x0058c762 in standard_ExecutorRun ()
#8  0x7f0cb37f99cb in pgss_ExecutorRun () from 
/opt/symfoserver64/lib/pg_stat_statements.so
#9  0x7f0cb357f545 in explain_ExecutorRun () from 
/opt/symfoserver64/lib/auto_explain.so

#10 0x0066a59e in ProcessQuery ()
#11 0x0066a7ef in PortalRunMulti ()
#12 0x0066afd2 in PortalRun ()
#13 0x00666fcb in exec_simple_query ()
#14 0x00668058 in PostgresMain ()
#15 0x00622ef1 in PostmasterMain ()
#16 0x005c0723 in main ()

#0  0x0036102eaf77 in semop () from /lib64/libc.so.6
#1  0x00614707 in PGSemaphoreLock ()
#2  0x00659d5b in LWLockAcquire ()
#3  0x0047983d in RelationGetBufferForTuple ()
#4  0x00477f86 in heap_insert ()
#5  0x005a4a12 in ExecModifyTable ()
#6  0x0058d928 in ExecProcNode ()
#7  0x0058c762 in standard_ExecutorRun ()
#8  0x7f0cb37f99cb in pgss_ExecutorRun () from 
/opt/symfoserver64/lib/pg_stat_statements.so
#9  0x7f0cb357f545 in explain_ExecutorRun () from 
/opt/symfoserver64/lib/auto_explain.so

#10 0x0066a59e in ProcessQuery ()
#11 0x0066a7ef in PortalRunMulti ()
#12 0x0066afd2 in PortalRun ()
#13 0x00666fcb in exec_simple_query ()
#14 0x00668058 in PostgresMain ()
#15 0x00622ef1 in PostmasterMain ()
#16 0x005c0723 in main ()


#0  0x0036102eaf77 in semop () from /lib64/libc.so.6
#1  0x00614707 in PGSemaphoreLock ()
#2  0x00659d5b in LWLockAcquire ()
#3  0x0064bb8c in ProcArrayEndTransaction ()
#4  0x00491216 in CommitTransaction ()
#5  0x004925a5 in CommitTransactionCommand ()
#6  0x00664cf7 in finish_xact_command ()
#7  0x00667145 in exec_simple_query ()
#8  0x00668058 in PostgresMain ()
#9  0x00622ef1 in PostmasterMain ()
#10 0x005c0723 in main ()


Regards
MauMau


Dump of assembler code for function LWLockRelease:
0x00647d40 LWLockRelease+0: push   %r12
0x00647d42 LWLockRelease+2: mov%edi,%r12d
0x00647d45 LWLockRelease+5: shl$0x5,%r12
0x00647d49 LWLockRelease+9: add5192344(%rip),%r12# 0xb3b7e8 
LWLockArray
0x00647d50 LWLockRelease+16:push   %rbp
0x00647d51 LWLockRelease+17:mov%edi,%ebp
0x00647d53 LWLockRelease+19:push   %rbx
0x00647d54 LWLockRelease+20:mov5192326(%rip),%ebx# 0xb3b7e0 
num_held_lwlocks
0x00647d5a LWLockRelease+26:nopw   0x0(%rax,%rax,1)
0x00647d60 LWLockRelease+32:sub$0x1,%ebx
0x00647d63 LWLockRelease+35:js 0x647ea4 LWLockRelease+356
0x00647d69 LWLockRelease+41:movslq %ebx,%rax
0x00647d6c LWLockRelease+44:cmp%ebp,0xb3b800(,%rax,4)
0x00647d73 LWLockRelease+51:jne0x647d60 LWLockRelease+32
0x00647d75 LWLockRelease+53:mov5192293(%rip),%esi# 0xb3b7e0 
num_held_lwlocks
0x00647d7b LWLockRelease+59:sub$0x1,%esi
0x00647d7e LWLockRelease+62:cmp%ebx,%esi
0x00647d80 LWLockRelease+64:mov%esi,5192282(%rip)# 0xb3b7e0 
num_held_lwlocks
0x00647d86 LWLockRelease+70:jg 0x647d92 LWLockRelease+82
0x00647d88 LWLockRelease+72:jmp0x647dad LWLockRelease+109
0x00647d8a LWLockRelease+74:nopw   0x0(%rax,%rax,1)
0x00647d90 LWLockRelease+80:mov%ecx,%ebx

Re: [HACKERS] Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

2014-02-12 Thread Andres Freund

On 2014-02-12 20:55:32 +0900, MauMau wrote:
 Dump of assembler code for function LWLockRelease:

could you try if you get more readable dumps by using disassemble/m?
That might at least print line numbers if you have debug info installed.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

2014-02-12 Thread MauMau


From: Andres Freund and...@2ndquadrant.com

could you try if you get more readable dumps by using disassemble/m?
That might at least print line numbers if you have debug info installed.


OK, I'll try that tomorrow.  However, the debug info is not available, 
because they use PostgreSQL built by themselves, not the community RPM nor 
EnterpriseDB's installer.


Regards
MauMau



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

2014-02-12 Thread Florian Pflug

On Feb12, 2014, at 12:55 , MauMau maumau...@gmail.com wrote:
 From: Andres Freund and...@2ndquadrant.com
 It's x86, right? Then it's unlikely to be actual unordered memory
 accesses, but if the compiler reordered:
   LOG_LWDEBUG(LWLockRelease, T_NAME(l), T_ID(l), release waiter);
   proc = head;
   head = proc-lwWaitLink;
   proc-lwWaitLink = NULL;
   proc-lwWaiting = false;
   PGSemaphoreUnlock(proc-sem);
 to
   LOG_LWDEBUG(LWLockRelease, T_NAME(l), T_ID(l), release waiter);
   proc = head;
   proc-lwWaiting = false;
   head = proc-lwWaitLink;
   proc-lwWaitLink = NULL;
   PGSemaphoreUnlock(proc-sem);
 which it is permitted to do, yes, that could cause symptoms like you
 describe.
 
 Yes, the hang occurred with 64-bit PostgreSQL 9.2.4 running on RHEL6 for 
 x86_64.
 The PostgreSQL was built with GCC.

The relevant part of the disassembled binary you attached seems to be

Dump of assembler code for function LWLockRelease:
...
0x00647f47 LWLockRelease+519: lea0x10(%rcx),%rdi
0x00647f4b LWLockRelease+523: movq   $0x0,0x48(%rcx)
0x00647f53 LWLockRelease+531: movb   $0x0,0x41(%rcx)
0x00647f57 LWLockRelease+535: callq  0x606210 PGSemaphoreUnlock

I haven't checked the offsets, but since lwWaitLink is an 8-byte quantity
and lwWaiting a single-byte quantity, it's pretty much certain that the
first store updates lwWaitLink and the second lwWaiting. Thus, no reordering
seems to have taken place here...

best regards,
Florian Pflug



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

2014-02-12 Thread knizhnik


On 02/12/2014 05:42 PM, Florian Pflug wrote:

On Feb12, 2014, at 12:55 , MauMau maumau...@gmail.com wrote:

From: Andres Freund and...@2ndquadrant.com

It's x86, right? Then it's unlikely to be actual unordered memory
accesses, but if the compiler reordered:
   LOG_LWDEBUG(LWLockRelease, T_NAME(l), T_ID(l), release waiter);
   proc = head;
   head = proc-lwWaitLink;
   proc-lwWaitLink = NULL;
   proc-lwWaiting = false;
   PGSemaphoreUnlock(proc-sem);
to
   LOG_LWDEBUG(LWLockRelease, T_NAME(l), T_ID(l), release waiter);
   proc = head;
   proc-lwWaiting = false;
   head = proc-lwWaitLink;
   proc-lwWaitLink = NULL;
   PGSemaphoreUnlock(proc-sem);
which it is permitted to do, yes, that could cause symptoms like you
describe.


Yes, the hang occurred with 64-bit PostgreSQL 9.2.4 running on RHEL6 for x86_64.
The PostgreSQL was built with GCC.


The relevant part of the disassembled binary you attached seems to be

Dump of assembler code for function LWLockRelease:
...
0x00647f47 LWLockRelease+519:   lea0x10(%rcx),%rdi
0x00647f4b LWLockRelease+523:   movq   $0x0,0x48(%rcx)
0x00647f53 LWLockRelease+531:   movb   $0x0,0x41(%rcx)
0x00647f57 LWLockRelease+535:   callq  0x606210 PGSemaphoreUnlock

I haven't checked the offsets, but since lwWaitLink is an 8-byte quantity
and lwWaiting a single-byte quantity, it's pretty much certain that the
first store updates lwWaitLink and the second lwWaiting. Thus, no reordering
seems to have taken place here...

best regards,
Florian Pflug





Even if reordering was not done by compiler, it still can happen while 
execution.
There is no warranty that two subsequent assignments will be observed by all 
CPU cores in the same order.
So if one thread is performing

p-x = 1;
p-y = 2;
p-x = 3;
p-y = 4;

then other thread can see the following combinations of (x,y):

(1,2)
(1,4)
(3,2)
(3,4)

It is necessary to explicitly insert write barrier to prevent such 
non-deterministic behaviour.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

2014-02-12 Thread Ants Aasma

On Wed, Feb 12, 2014 at 4:04 PM, knizhnik knizh...@garret.ru wrote:
 Even if reordering was not done by compiler, it still can happen while
 execution.
 There is no warranty that two subsequent assignments will be observed by all
 CPU cores in the same order.

The x86 memory model (total store order) provides that guarantee in
this specific case.

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

2014-02-11 Thread MauMau


From: Andres Freund and...@2ndquadrant.com

which means they manipulate the lwWaitLink queue without
protection. That's done intentionally. The code tries to protect against
corruption of the list to do a woken up backend acquiring a lock (this
or an independent one) by only continuing when the lwWaiting flag is set
to false. Unfortunately there's absolutely no guarantee that a) the
assignment to lwWaitLink and lwWaiting are done in that order b) that
the stores are done in-order from the POV of other backends.
So what we need to do is to acquire a write barrier between the
assignments to lwWaitLink and lwWaiting, i.e.
   proc-lwWaitLink = NULL;
   pg_write_barrier();
   proc-lwWaiting = false;
the reader side already uses an implicit barrier by using spinlocks.


I've got a report from one customer that they encountered a hang during 
performance benchmarking.  They were using PostgreSQL 9.2.4.  I remember 
that the stack trace showed many backends blocked forever at LWLockAcuuire() 
during btree insert operation.  I'm not sure this has something to do with 
what you are raising, but the release notes for 9.2.5/6 doesn't suggest any 
fixes for this.  So I felt there is something wrong with lwlocks.


Do you think that your question could cause my customer's problem --  
backends block at lwlock forever?


Regards
MauMau



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

2014-02-11 Thread Andres Freund

On 2014-02-11 21:46:04 +0900, MauMau wrote:
 From: Andres Freund and...@2ndquadrant.com
 which means they manipulate the lwWaitLink queue without
 protection. That's done intentionally. The code tries to protect against
 corruption of the list to do a woken up backend acquiring a lock (this
 or an independent one) by only continuing when the lwWaiting flag is set
 to false. Unfortunately there's absolutely no guarantee that a) the
 assignment to lwWaitLink and lwWaiting are done in that order b) that
 the stores are done in-order from the POV of other backends.
 So what we need to do is to acquire a write barrier between the
 assignments to lwWaitLink and lwWaiting, i.e.
proc-lwWaitLink = NULL;
pg_write_barrier();
proc-lwWaiting = false;
 the reader side already uses an implicit barrier by using spinlocks.
 
 I've got a report from one customer that they encountered a hang during
 performance benchmarking.  They were using PostgreSQL 9.2.4.  I remember
 that the stack trace showed many backends blocked forever at LWLockAcuuire()
 during btree insert operation.  I'm not sure this has something to do with
 what you are raising, but the release notes for 9.2.5/6 doesn't suggest any
 fixes for this.  So I felt there is something wrong with lwlocks.
 
 Do you think that your question could cause my customer's problem --
 backends block at lwlock forever?

It's x86, right? Then it's unlikely to be actual unordered memory
accesses, but if the compiler reordered:
LOG_LWDEBUG(LWLockRelease, T_NAME(l), T_ID(l), release waiter);
proc = head;
head = proc-lwWaitLink;
proc-lwWaitLink = NULL;
proc-lwWaiting = false;
PGSemaphoreUnlock(proc-sem);
to
LOG_LWDEBUG(LWLockRelease, T_NAME(l), T_ID(l), release waiter);
proc = head;
proc-lwWaiting = false;
head = proc-lwWaitLink;
proc-lwWaitLink = NULL;
PGSemaphoreUnlock(proc-sem);
which it is permitted to do, yes, that could cause symptoms like you
describe.

Any chance you have the binaries the customer ran back then around?
Disassembling that piece of code might give you a hint whether that's a
possible cause.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

2014-02-10 Thread Andres Freund

Hi,

During the lwlock scalability work I noticed a longstanding issue with
the lwlock code. LWLockRelease() and the other mentioned locations do
the following to wake up any waiters, without holding the lock's
spinlock:
/*
 * Awaken any waiters I removed from the queue.
 */
while (head != NULL)
{
LOG_LWDEBUG(LWLockRelease, T_NAME(l), T_ID(l), release waiter);
proc = head;
head = proc-lwWaitLink;
proc-lwWaitLink = NULL;
proc-lwWaiting = false;
PGSemaphoreUnlock(proc-sem);
}

which means they manipulate the lwWaitLink queue without
protection. That's done intentionally. The code tries to protect against
corruption of the list to do a woken up backend acquiring a lock (this
or an independent one) by only continuing when the lwWaiting flag is set
to false. Unfortunately there's absolutely no guarantee that a) the
assignment to lwWaitLink and lwWaiting are done in that order b) that
the stores are done in-order from the POV of other backends.
So what we need to do is to acquire a write barrier between the
assignments to lwWaitLink and lwWaiting, i.e.
proc-lwWaitLink = NULL;
pg_write_barrier();
proc-lwWaiting = false;
the reader side already uses an implicit barrier by using spinlocks.

I've fixed this as part 1 of the lwlock scalability work in [1], but
Heikki rightfully suggested that a) this should be backpatched b) done
in a separate commit.

There is the question what to do about the branches without barriers? I
guess a SpinLockAcquire()/Release() would do? Or do we decide it's not
important enough to matter, since it's not an issue on x86?

[1] 
http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=commitdiff;h=2de11eb11bb3e3777f6d384de0af9c2f77960637

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

2014-02-10 Thread Tom Lane

Andres Freund and...@2ndquadrant.com writes:
 So what we need to do is to acquire a write barrier between the
 assignments to lwWaitLink and lwWaiting, i.e.
 proc-lwWaitLink = NULL;
 pg_write_barrier();
 proc-lwWaiting = false;

You didn't really explain why you think that ordering is necessary?
Each proc being awoken will surely see both fields updated, and other
procs won't be examining these fields at all, since we already delinked
all these procs from the LWLock's queue.

 There is the question what to do about the branches without barriers? I
 guess a SpinLockAcquire()/Release() would do? Or do we decide it's not
 important enough to matter, since it's not an issue on x86?

Given the lack of trouble reports that could be traced to this,
I don't feel a need to worry about it in branches that don't
have any barrier support.  But in any case, I'm not convinced
there's a bug here at all.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

2014-02-10 Thread Andres Freund

On 2014-02-10 11:11:28 -0500, Tom Lane wrote:
 Andres Freund and...@2ndquadrant.com writes:
  So what we need to do is to acquire a write barrier between the
  assignments to lwWaitLink and lwWaiting, i.e.
  proc-lwWaitLink = NULL;
  pg_write_barrier();
  proc-lwWaiting = false;
 
 You didn't really explain why you think that ordering is necessary?
 Each proc being awoken will surely see both fields updated, and other
 procs won't be examining these fields at all, since we already delinked
 all these procs from the LWLock's queue.

The problem is that one the released backends could wake up concurrently
because of a unrelated, or previous PGSemaphoreUnlock(). It could see
lwWaiting = false, and thus wakeup and acquire the lock, even if the
store for lwWaitLink hasn't arrived (or performed, there's no guaranteed
ordering here) yet.
Now, it may well be that there's no practical consequence of that, but I
am not prepared to bet on it.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

2014-02-10 Thread Andres Freund

On 2014-02-10 11:20:30 -0500, Tom Lane wrote:
 I wrote:
  You didn't really explain why you think that ordering is necessary?
 
 Actually, after grepping to check my memory of what those fields are
 being used for, I have a bigger question: WTF is xlog.c doing being
 so friendly with the innards of LWLocks?  Surely this needs to get
 refactored so that most of WakeupWaiters() and friends is in lwlock.c.
 Or has all notion of modularity gone out the window while I wasn't
 looking?

Well, it's not actually using any lwlock.c code, it's a special case
locking logic, just reusing the datastructures. That said, I am not
particularly happy about the amount of code it's duplicating from
lwlock.c. Pretty much all of WALInsertSlotReleaseOne and most of
WALInsertSlotAcquireOne() is a copied.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

2014-02-10 Thread Tom Lane

I wrote:
 You didn't really explain why you think that ordering is necessary?

Actually, after grepping to check my memory of what those fields are
being used for, I have a bigger question: WTF is xlog.c doing being
so friendly with the innards of LWLocks?  Surely this needs to get
refactored so that most of WakeupWaiters() and friends is in lwlock.c.
Or has all notion of modularity gone out the window while I wasn't
looking?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

2014-02-10 Thread Heikki Linnakangas


On 02/10/2014 06:41 PM, Andres Freund wrote:

On 2014-02-10 11:20:30 -0500, Tom Lane wrote:

I wrote:

You didn't really explain why you think that ordering is necessary?


Actually, after grepping to check my memory of what those fields are
being used for, I have a bigger question: WTF is xlog.c doing being
so friendly with the innards of LWLocks?  Surely this needs to get
refactored so that most of WakeupWaiters() and friends is in lwlock.c.
Or has all notion of modularity gone out the window while I wasn't
looking?


Well, it's not actually using any lwlock.c code, it's a special case
locking logic, just reusing the datastructures. That said, I am not
particularly happy about the amount of code it's duplicating from
lwlock.c. Pretty much all of WALInsertSlotReleaseOne and most of
WALInsertSlotAcquireOne() is a copied.


I'm not too happy with the amount of copy-paste myself, but there was 
enough difference to regular lwlocks that I didn't want to bother all 
lwlocks with the xlog-specific stuff either. The WAL insert slots do 
share the LWLock-related PGPROC fields though, and semaphore. I'm all 
ears if you have ideas on that..


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

2014-02-10 Thread Tom Lane

Heikki Linnakangas hlinnakan...@vmware.com writes:
 On 02/10/2014 06:41 PM, Andres Freund wrote:
 Well, it's not actually using any lwlock.c code, it's a special case
 locking logic, just reusing the datastructures. That said, I am not
 particularly happy about the amount of code it's duplicating from
 lwlock.c. Pretty much all of WALInsertSlotReleaseOne and most of
 WALInsertSlotAcquireOne() is a copied.

 I'm not too happy with the amount of copy-paste myself, but there was 
 enough difference to regular lwlocks that I didn't want to bother all 
 lwlocks with the xlog-specific stuff either. The WAL insert slots do 
 share the LWLock-related PGPROC fields though, and semaphore. I'm all 
 ears if you have ideas on that..

I agree that if the behavior is considerably different, we don't really
want to try to make LWLockAcquire/Release cater to both this and their
standard behavior.  But why not add some additional functions in lwlock.c
that do what xlog wants?  If we're going to have mostly-copy-pasted logic,
it'd at least be better if it was in the same file, and not somewhere
that's not even in the same major subtree.

Also, the reason that LWLock isn't an abstract struct is because we wanted
to be able to embed it in other structs; *not* as license for other
modules to fool with its contents.  If we were working in C++ we'd
certainly have made all its fields private.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

2014-02-10 Thread Heikki Linnakangas


On 02/10/2014 03:46 PM, Andres Freund wrote:

Hi,

During the lwlock scalability work I noticed a longstanding issue with
the lwlock code. LWLockRelease() and the other mentioned locations do
the following to wake up any waiters, without holding the lock's
spinlock:
 /*
  * Awaken any waiters I removed from the queue.
  */
 while (head != NULL)
 {
 LOG_LWDEBUG(LWLockRelease, T_NAME(l), T_ID(l), release waiter);
 proc = head;
 head = proc-lwWaitLink;
 proc-lwWaitLink = NULL;
 proc-lwWaiting = false;
 PGSemaphoreUnlock(proc-sem);
 }

which means they manipulate the lwWaitLink queue without
protection. That's done intentionally. The code tries to protect against
corruption of the list to do a woken up backend acquiring a lock (this
or an independent one) by only continuing when the lwWaiting flag is set
to false. Unfortunately there's absolutely no guarantee that a) the
assignment to lwWaitLink and lwWaiting are done in that order b) that
the stores are done in-order from the POV of other backends.
So what we need to do is to acquire a write barrier between the
assignments to lwWaitLink and lwWaiting, i.e.
 proc-lwWaitLink = NULL;
 pg_write_barrier();
 proc-lwWaiting = false;
the reader side already uses an implicit barrier by using spinlocks.

I've fixed this as part 1 of the lwlock scalability work in [1], but
Heikki rightfully suggested that a) this should be backpatched b) done
in a separate commit.

There is the question what to do about the branches without barriers? I
guess a SpinLockAcquire()/Release() would do?


The other alternative we discussed on IM is to unlink the waiters from 
the linked list, while still holding the spinlock. We can't do the 
PGSemaphoreUnlock() call to actually wake up the waiters while holding 
the spinlock, but all the shared memory manipulations we could. It would 
move all the modifications of the shared structures under the spinlock, 
which feels comforting.


It would require using some-sort of a backend-private data structure to 
hold the list of processes to wake up. We don't want to palloc() in 
LWLockRelease(), but we could malloc() a large-enough array once at 
process initialization, and use that on all LWLockRelease() calls.

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

2014-02-10 Thread Heikki Linnakangas


On 02/10/2014 08:03 PM, Tom Lane wrote:

Heikki Linnakangas hlinnakan...@vmware.com writes:

On 02/10/2014 06:41 PM, Andres Freund wrote:

Well, it's not actually using any lwlock.c code, it's a special case
locking logic, just reusing the datastructures. That said, I am not
particularly happy about the amount of code it's duplicating from
lwlock.c. Pretty much all of WALInsertSlotReleaseOne and most of
WALInsertSlotAcquireOne() is a copied.



I'm not too happy with the amount of copy-paste myself, but there was
enough difference to regular lwlocks that I didn't want to bother all
lwlocks with the xlog-specific stuff either. The WAL insert slots do
share the LWLock-related PGPROC fields though, and semaphore. I'm all
ears if you have ideas on that..


I agree that if the behavior is considerably different, we don't really
want to try to make LWLockAcquire/Release cater to both this and their
standard behavior.  But why not add some additional functions in lwlock.c
that do what xlog wants?  If we're going to have mostly-copy-pasted logic,
it'd at least be better if it was in the same file, and not somewhere
that's not even in the same major subtree.


Ok, I'll try to refactor it that way, so that we can see if it looks better.


Also, the reason that LWLock isn't an abstract struct is because we wanted
to be able to embed it in other structs; *not* as license for other
modules to fool with its contents.  If we were working in C++ we'd
certainly have made all its fields private.


Um, xlog.c is doing no such thing. The insertion slots use a struct of 
their own, which is also copy-pasted from LWLock (with one additional 
field).


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

2014-02-10 Thread Andres Freund

On 2014-02-10 19:48:47 +0200, Heikki Linnakangas wrote:
 On 02/10/2014 06:41 PM, Andres Freund wrote:
 On 2014-02-10 11:20:30 -0500, Tom Lane wrote:
 I wrote:
 You didn't really explain why you think that ordering is necessary?
 
 Actually, after grepping to check my memory of what those fields are
 being used for, I have a bigger question: WTF is xlog.c doing being
 so friendly with the innards of LWLocks?  Surely this needs to get
 refactored so that most of WakeupWaiters() and friends is in lwlock.c.
 Or has all notion of modularity gone out the window while I wasn't
 looking?
 
 Well, it's not actually using any lwlock.c code, it's a special case
 locking logic, just reusing the datastructures. That said, I am not
 particularly happy about the amount of code it's duplicating from
 lwlock.c. Pretty much all of WALInsertSlotReleaseOne and most of
 WALInsertSlotAcquireOne() is a copied.
 
 I'm not too happy with the amount of copy-paste myself, but there was enough
 difference to regular lwlocks that I didn't want to bother all lwlocks with
 the xlog-specific stuff either. The WAL insert slots do share the
 LWLock-related PGPROC fields though, and semaphore. I'm all ears if you have
 ideas on that..

The lwlock scalability stuff has separated out the enqueue/wakeup code,
that probably should work here as well? And that's a fair portion of the
code. I think it should be doable to make that generic enough that the
actual difference of the struct doesn't matter. It'd also reduce
duplication of LWLockAcquire, ConditionalAcquire, OrWait.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

53 matches

Mail list logo