Re: [HACKERS] HS locking broken in HEAD

2013-01-18 Thread Robert Haas
On Thu, Jan 17, 2013 at 9:00 PM, Andres Freund and...@2ndquadrant.com wrote:
  I think it might also be a dangerous assumption for shared objects?

 Locks on shared objects can't be taken via the fast path.  In order to
 take a fast-path lock, a backend must be bound to a database and the
 locktag must be for a relation in that database.

 I assumed it wasn't allowed, but didn't find where that happens at first
 - I found it now though (c.f. SetLocktagRelationOid).

  A patch minimally addressing the is attached, but it only addresses part
  of the problem.

 Isn't the right fix to compare proc-databaseId to
 locktag-locktag_field1 rather than to MyDatabaseId?  The subsequent
 loop assumes that if the relid matches, the lock tag is an exact
 match, which will be correct with that rule but not the one you
 propose.

 I don't know much about the locking code, you're probably right. I also
 didn't look very far after finding the guilty commit.

 (reading code)

 Yes, I think you're right, that seems to be the actually correct fix.

 I am a bit worried that there are more such assumptions in the code, but
 I didn't find any, but I really don't know that code.

I found two instances of this.  See attached patch.

Can you test whether this fixes it for you?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


fix-mydatabaseid-compare.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] HS locking broken in HEAD

2013-01-18 Thread Andres Freund
On 2013-01-18 10:15:20 -0500, Robert Haas wrote:
 On Thu, Jan 17, 2013 at 9:00 PM, Andres Freund and...@2ndquadrant.com wrote:
   I think it might also be a dangerous assumption for shared objects?
 
  Locks on shared objects can't be taken via the fast path.  In order to
  take a fast-path lock, a backend must be bound to a database and the
  locktag must be for a relation in that database.
 
  I assumed it wasn't allowed, but didn't find where that happens at first
  - I found it now though (c.f. SetLocktagRelationOid).
 
   A patch minimally addressing the is attached, but it only addresses part
   of the problem.
 
  Isn't the right fix to compare proc-databaseId to
  locktag-locktag_field1 rather than to MyDatabaseId?  The subsequent
  loop assumes that if the relid matches, the lock tag is an exact
  match, which will be correct with that rule but not the one you
  propose.
 
  I don't know much about the locking code, you're probably right. I also
  didn't look very far after finding the guilty commit.
 
  (reading code)
 
  Yes, I think you're right, that seems to be the actually correct fix.
 
  I am a bit worried that there are more such assumptions in the code, but
  I didn't find any, but I really don't know that code.
 
 I found two instances of this.  See attached patch.

Yea, those are the two sites I had fixed (incorrectly) as well. I
looked a bit more yesterday night and I didn't find any additional
locations, so I hope we got this.

Yep, seems to work.

I am still stupefied nobody noticed that locking in HS (where just about
all locks are going to be fast path locks) was completely broken for
nearly two years.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] HS locking broken in HEAD

2013-01-18 Thread Tom Lane
Andres Freund and...@2ndquadrant.com writes:
 I am still stupefied nobody noticed that locking in HS (where just about
 all locks are going to be fast path locks) was completely broken for
 nearly two years.

IIUC it would only be broken for cases where activity was going on
concurrently in two different databases, which maybe isn't all that
common out in the field.  And for sure it isn't something we test much.

I wonder if it'd be practical to, say, run all the contrib regression
tests concurrently in different databases of one installation.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] HS locking broken in HEAD

2013-01-18 Thread Andres Freund
On 2013-01-18 11:16:15 -0500, Tom Lane wrote:
 Andres Freund and...@2ndquadrant.com writes:
  I am still stupefied nobody noticed that locking in HS (where just about
  all locks are going to be fast path locks) was completely broken for
  nearly two years.
 
 IIUC it would only be broken for cases where activity was going on
 concurrently in two different databases, which maybe isn't all that
 common out in the field.  And for sure it isn't something we test much.

I think effectively it only was broken in Hot Standby. At least I don't
immediately can think of a scenario where a strong lock is being acquired on a
non-shared object in a different database.

 I wonder if it'd be practical to, say, run all the contrib regression
 tests concurrently in different databases of one installation.

I think it would be a good idea, but I don't immediately have an idea
how to implement it. It seems to me we would need to put the logic for
it into pg_regress? Otherwise the lifetime management of the shared
postmaster seems to get complicated.

What I would really like is to have some basic replication test scenario
in the regression tests. That seems like a dramatically undertested, but
pretty damn essential, part of the code.

The first handwavy guess I have of doing it is something like connecting
a second postmaster to the primary one at the start of the main
regression tests (requires changing the wal level again, yuck) and
fuzzyly comparing a pg_dump of the database remnants in both clusters at
the end of the regression tests.

Better ideas?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] HS locking broken in HEAD

2013-01-18 Thread Andres Freund
On 2013-01-18 17:26:00 +0100, Andres Freund wrote:
 On 2013-01-18 11:16:15 -0500, Tom Lane wrote:
  Andres Freund and...@2ndquadrant.com writes:
   I am still stupefied nobody noticed that locking in HS (where just about
   all locks are going to be fast path locks) was completely broken for
   nearly two years.
  
  IIUC it would only be broken for cases where activity was going on
  concurrently in two different databases, which maybe isn't all that
  common out in the field.  And for sure it isn't something we test much.
 
 I think effectively it only was broken in Hot Standby. At least I don't
 immediately can think of a scenario where a strong lock is being acquired on a
 non-shared object in a different database.

Hrmpf, should have mentioned that the problem in HS is that the startup
process is doing exactly that, which is why it is broke there. It
acquires the exclusive locks shipped via WAL so the following
non-concurrency safe actions can be applied. And obviously its not
connected to any database while doing it...

I would have guessed its not that infrequent to run an ALTER TABLE or
somesuch while the standby is still running some longrunning query...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] HS locking broken in HEAD

2013-01-18 Thread Tom Lane
Andres Freund and...@2ndquadrant.com writes:
 On 2013-01-18 11:16:15 -0500, Tom Lane wrote:
 I wonder if it'd be practical to, say, run all the contrib regression
 tests concurrently in different databases of one installation.

 I think it would be a good idea, but I don't immediately have an idea
 how to implement it. It seems to me we would need to put the logic for
 it into pg_regress? Otherwise the lifetime management of the shared
 postmaster seems to get complicated.

Seems like it'd be sufficient to make it work for the make
installcheck case, which dodges the postmaster problem.

 What I would really like is to have some basic replication test scenario
 in the regression tests.

Agreed, that's not being tested enough either.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] HS locking broken in HEAD

2013-01-18 Thread Amit kapila

On Friday, January 18, 2013 9:56 PM Andres Freund wrote:
On 2013-01-18 11:16:15 -0500, Tom Lane wrote:
 Andres Freund and...@2ndquadrant.com writes:
  I am still stupefied nobody noticed that locking in HS (where just about
  all locks are going to be fast path locks) was completely broken for
  nearly two years.

 IIUC it would only be broken for cases where activity was going on
 concurrently in two different databases, which maybe isn't all that
 common out in the field.  And for sure it isn't something we test much.

 I think effectively it only was broken in Hot Standby. At least I don't
 immediately can think of a scenario where a strong lock is being acquired on 
 a
 non-shared object in a different database.

  I wonder if it'd be practical to, say, run all the contrib regression
 tests concurrently in different databases of one installation.

 I think it would be a good idea, but I don't immediately have an idea
 how to implement it. It seems to me we would need to put the logic for
 it into pg_regress? Otherwise the lifetime management of the shared
 postmaster seems to get complicated.

 What I would really like is to have some basic replication test scenario
 in the regression tests. That seems like a dramatically undertested, but
 pretty damn essential, part of the code.

 The first handwavy guess I have of doing it is something like connecting
 a second postmaster to the primary one at the start of the main
 regression tests (requires changing the wal level again, yuck) and
 fuzzyly comparing a pg_dump of the database remnants in both clusters at
 the end of the regression tests.

  I think this is good idea to start testing for replication. 
  In addition to this, I think to test secondary's behavior, there should be 
some kind of controller module
  which should control SQL commands run on  primary and secondary and match the 
expected behavior.
  This can be particulary useful to test locking conflicts and do the more 
specific tests.

With Regards,
Amit Kapila.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] HS locking broken in HEAD

2013-01-17 Thread Andres Freund
Hi,


While checking whether I could reproduce the replication delay reported
by Michael Paquier I found this very nice tidbit:

In a pretty trivial replication setup of only streaming replication I
can currently easily reproduce this:

standby# BEGIN;SELECT * FROM foo;
BEGIN
 id | data 
+--


standby=# SELECT relation::regclass, locktype, mode FROM pg_locks;
 relation |  locktype  |  mode   
--++-
 pg_locks | relation   | AccessShareLock
 foo_pkey | relation   | AccessShareLock
 foo  | relation   | AccessShareLock
  | virtualxid | ExclusiveLock
  | virtualxid | ExclusiveLock
(5 rows)

primary# DROP TABLE foo;
DROP TABLE


standby# SELECT relation::regclass, pid, locktype, mode FROM pg_locks;
 relation |  pid  |  locktype  |mode 
--+---++-
 pg_locks | 28068 | relation   | AccessShareLock
 foo_pkey | 28068 | relation   | AccessShareLock
 foo  | 28068 | relation   | AccessShareLock
  | 28068 | virtualxid | ExclusiveLock
  | 28048 | virtualxid | ExclusiveLock
 foo  | 28048 | relation   | AccessExclusiveLock
(6 rows)

standby# SELECT * FROM foo;
ERROR:  relation foo does not exist
LINE 1: select * from foo;
  ^
Note the conflicting locks held on relation foo by 28048 and 28068.

I don't immediately know which patch to blame here? Looks a bit like
broken fastpath locking, but I don't immediately see anything in
c00dc337b87 that should cause this?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] HS locking broken in HEAD

2013-01-17 Thread Andres Freund
On 2013-01-17 22:46:21 +0100, Andres Freund wrote:
 Hi,
 
 
 While checking whether I could reproduce the replication delay reported
 by Michael Paquier I found this very nice tidbit:
 
 In a pretty trivial replication setup of only streaming replication I
 can currently easily reproduce this:
 
 standby# BEGIN;SELECT * FROM foo;
 BEGIN
  id | data 
 +--
 
 
 standby=# SELECT relation::regclass, locktype, mode FROM pg_locks;
  relation |  locktype  |  mode   
 --++-
  pg_locks | relation   | AccessShareLock
  foo_pkey | relation   | AccessShareLock
  foo  | relation   | AccessShareLock
   | virtualxid | ExclusiveLock
   | virtualxid | ExclusiveLock
 (5 rows)
 
 primary# DROP TABLE foo;
 DROP TABLE
 
 
 standby# SELECT relation::regclass, pid, locktype, mode FROM pg_locks;
  relation |  pid  |  locktype  |mode 
 --+---++-
  pg_locks | 28068 | relation   | AccessShareLock
  foo_pkey | 28068 | relation   | AccessShareLock
  foo  | 28068 | relation   | AccessShareLock
   | 28068 | virtualxid | ExclusiveLock
   | 28048 | virtualxid | ExclusiveLock
  foo  | 28048 | relation   | AccessExclusiveLock
 (6 rows)
 
 standby# SELECT * FROM foo;
 ERROR:  relation foo does not exist
 LINE 1: select * from foo;
   ^
 Note the conflicting locks held on relation foo by 28048 and 28068.
 
 I don't immediately know which patch to blame here? Looks a bit like
 broken fastpath locking, but I don't immediately see anything in
 c00dc337b87 that should cause this?

Rather scarily this got broken in
96cc18eef6489cccefe351baa193f32f12018551. Yes, nearly half a year ago,
including in 9.2.1+. How the heck could this go unnoticed so long?

Not sure yet what the cause of this is.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] HS locking broken in HEAD

2013-01-17 Thread Andres Freund
On 2013-01-17 23:56:16 +0100, Andres Freund wrote:
 On 2013-01-17 22:46:21 +0100, Andres Freund wrote:
  ^
  Note the conflicting locks held on relation foo by 28048 and 28068.
  
  I don't immediately know which patch to blame here? Looks a bit like
  broken fastpath locking, but I don't immediately see anything in
  c00dc337b87 that should cause this?
 
 Rather scarily this got broken in
 96cc18eef6489cccefe351baa193f32f12018551. Yes, nearly half a year ago,
 including in 9.2.1+. How the heck could this go unnoticed so long?

That only made the bug more visible - the real bug is somewhere
else. Which makes it even scarrier, the bug was in in the fast path
locking patch from the start...

It assumes conflicting fast path locks can only ever be in the same
database as the the backend transfering the locks to itself. But thats
obviously not true for the startup process which is in no specific
database.
I think it might also be a dangerous assumption for shared objects?

A patch minimally addressing the is attached, but it only addresses part
of the problem.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index f2cf5c6..b5240da 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -2458,8 +2458,13 @@ FastPathTransferRelationLocks(LockMethod lockMethodTable, const LOCKTAG *locktag
 		 * less clear that our backend is certain to have performed a memory
 		 * fencing operation since the other backend set proc-databaseId.	So
 		 * for now, we test it after acquiring the LWLock just to be safe.
+		 *
+		 * But if we are the startup process we don't belong to a database but
+		 * still need to lock objects in other databases, so we can do this
+		 * optimization only in case we belong to a database.
+		 * XXX: explain why this is safe for shared tables.
 		 */
-		if (proc-databaseId != MyDatabaseId)
+		if (MyDatabaseId != InvalidOid  proc-databaseId != MyDatabaseId)
 		{
 			LWLockRelease(proc-backendLock);
 			continue;

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] HS locking broken in HEAD

2013-01-17 Thread Robert Haas
On Thu, Jan 17, 2013 at 6:22 PM, Andres Freund and...@2ndquadrant.com wrote:
 That only made the bug more visible - the real bug is somewhere
 else. Which makes it even scarrier, the bug was in in the fast path
 locking patch from the start...

 It assumes conflicting fast path locks can only ever be in the same
 database as the the backend transfering the locks to itself. But thats
 obviously not true for the startup process which is in no specific
 database.

Ugh.

 I think it might also be a dangerous assumption for shared objects?

Locks on shared objects can't be taken via the fast path.  In order to
take a fast-path lock, a backend must be bound to a database and the
locktag must be for a relation in that database.

 A patch minimally addressing the is attached, but it only addresses part
 of the problem.

Isn't the right fix to compare proc-databaseId to
locktag-locktag_field1 rather than to MyDatabaseId?  The subsequent
loop assumes that if the relid matches, the lock tag is an exact
match, which will be correct with that rule but not the one you
propose.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] HS locking broken in HEAD

2013-01-17 Thread Andres Freund
On 2013-01-17 20:36:43 -0500, Robert Haas wrote:
 On Thu, Jan 17, 2013 at 6:22 PM, Andres Freund and...@2ndquadrant.com wrote:
  That only made the bug more visible - the real bug is somewhere
  else. Which makes it even scarrier, the bug was in in the fast path
  locking patch from the start...
 
  It assumes conflicting fast path locks can only ever be in the same
  database as the the backend transfering the locks to itself. But thats
  obviously not true for the startup process which is in no specific
  database.

 Ugh.

  I think it might also be a dangerous assumption for shared objects?

 Locks on shared objects can't be taken via the fast path.  In order to
 take a fast-path lock, a backend must be bound to a database and the
 locktag must be for a relation in that database.

I assumed it wasn't allowed, but didn't find where that happens at first
- I found it now though (c.f. SetLocktagRelationOid).

  A patch minimally addressing the is attached, but it only addresses part
  of the problem.

 Isn't the right fix to compare proc-databaseId to
 locktag-locktag_field1 rather than to MyDatabaseId?  The subsequent
 loop assumes that if the relid matches, the lock tag is an exact
 match, which will be correct with that rule but not the one you
 propose.

I don't know much about the locking code, you're probably right. I also
didn't look very far after finding the guilty commit.

(reading code)

Yes, I think you're right, that seems to be the actually correct fix.

I am a bit worried that there are more such assumptions in the code, but
I didn't find any, but I really don't know that code.

Greetings,

Andres Freund

--
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers