Re: [HACKERS] Re: How to reproduce serialization failure for a read only transaction.

2014-01-07 Thread Dan Ports
On Mon, Jan 06, 2014 at 05:14:12PM -0800, AK wrote:
 Also I cannot reproduce a scenario when applications must not depend on
 results read during a transaction that later aborted;. In this example the
 SELECT itself has failed.
 Can you show an example where a SELECT completes, but the COMMIT blows up?

Actually, no, not for a read-only transaction. It happens that the
final serialization failure check executed on COMMIT only affects
read/write transactions, not read-only ones. That's a pretty specific
implementation detail, though, so I wouldn't necessarily rely on it...

Here's an example of why applications must not depend on results read
during a transaction that later aborted:

   W2: BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE
   W2: UPDATE t SET count=1 WHERE id=1;
   W1: BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE
   W1: SELECT * FROM t WHERE id=1;
   W2: COMMIT;
   R : BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE READ ONLY
   R : SELECT * FROM t;
   R : COMMIT;
 ! W1: UPDATE t SET count=1 WHERE id=2;
   W1: COMMIT;

If you try this, it'll cause a serialization failure on the line marked
with a '!'. W1 saw (1,0) in the table, so W1 appears to have executed
before W2. But R saw both (1,1) and (2,0) in the table, and that has to
be a consistent snapshot of the database state, meaning W2 appears to
have executed before W1. That's an inconsistency, so something has to
be rolled back. This particular anomaly requires all three of the
transactions, and so it can't be detected until W1 does its UPDATE.
Postgres detects the conflict at that point and rolls back W1.

So what does this have to do with relying on the results of read-only
transactions that abort? Well, what if you had instead had R ROLLBACK
instead of COMMIT -- maybe because you expected ROLLBACK and COMMIT to
be equivalent for transactions that don't modify the database, or maybe
because something else caused the transaction to abort? When W1 does
its update, it will be checked for serialization failures, but aborted
transactions are (intentionally) not included in those checks. W1 is
therefore allowed to commit; the apparent serial order of execution is
W1 followed by W2, and the results of the aborted transaction R aren't
consistent with that.

Dan

-- 
Dan R. K. PortsUW CSEhttp://drkp.net/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSI freezing bug

2013-10-07 Thread Dan Ports
On Mon, Oct 07, 2013 at 12:26:37PM +0300, Heikki Linnakangas wrote:
 When updating a tuple, CheckTargetForConflictsIn() only marks a
 conflict if the transaction holding the predicate lock overlapped
 with the updating transaction.

Ah, this is the bit I was forgetting. (I really ought to have
remembered that, but it's been a while...)

I think it's possible, then, to construct a scenario where a slot is
reused before predicate locks on the old tuple are eligible for
cleanup -- but those locks will never cause a conflict.

So I agree: it's correct to just remove the xmin from the key
unconditionally.

And this is also true:

 And if there's a hole in that thinking I can't see right now,
 the worst that will happen is some unnecessary conflicts, ie. it's
 still correct. It surely can't be worse than upgrading the lock to a
 page-level lock, which would also create unnecessary conflicts.

Dan

-- 
Dan R. K. PortsUW CSEhttp://drkp.net/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSI freezing bug

2013-10-03 Thread Dan Ports
On Thu, Oct 03, 2013 at 06:19:49AM -0700, Kevin Grittner wrote:
 Heikki Linnakangas hlinnakan...@vmware.com wrote:
  IMHO it would be better to remove xmin from the lock key, and vacuum
  away the old predicate locks when the corresponding tuple is vacuumed.
  The xmin field is only required to handle the case that a tuple is
  vacuumed, and a new unrelated tuple is inserted to the same slot.
  Removing the lock when the tuple is removed fixes that.

This seems definitely safe: we need the predicate locks to determine if
someone is modifying a tuple we read, and certainly if it's eligible
for vacuum nobody's going to be modifying that tuple anymore.

  In fact, I cannot even come up with a situation where you would have a
  problem if we just removed xmin from the key, even if we didn't vacuum
  away old locks. I don't think the old lock can conflict with anything
  that would see the new tuple that gets inserted in the same slot. I have
  a feeling that you could probably prove that if you stare long enough at
  the diagram of a dangerous structure and the properties required for a
  conflict.

This would also be safe, in the sense that it's OK to flag a
conflict even if one doesn't exist. I'm not convinced that it isn't
possible to have false positives this way. I think it's possible for a
tuple to be vacuumed away and the ctid reused before the predicate
locks on it are eligible for cleanup. (In fact, isn't this what was
happening in the thread Kevin linked?)

 You are the one who suggested adding xmin to the key:
 
 http://www.postgresql.org/message-id/4d5a36fc.6010...@enterprisedb.com
 
 I will review that thread in light of your recent comments, but the
 fact is that xmin was not originally in the lock key, testing
 uncovered bugs, and adding xmin fixed those bugs.  I know I tried
 some other approach first, which turned out to be complex and quite
 messy -- it may have been similar to what you are proposing now.

At the time, we thought it was necessary for a predicate lock to lock
*all future versions* of a tuple, and so we had a bunch of code to
maintain a version chain. That was fraught with bugs, and turned out
not to be necessary (IIRC, we worked that out at the pub at PGcon).
That made it critical to distinguish different tuples that had the same
ctid because they could wind up in the wrong chain or cause a cycle.
With that code ripped out, that's no longer an issue.

But all this is an exceptionally subtle part of what was an
exceptionally complex patch, so a lot of careful thought is needed
here...

 It seems to me that a change such as you are now suggesting is
 likely to be too invasive to back-patch.  Do you agree that it
 would make sense to apply the patch I have proposed, back to 9.1,
 and then consider any alternative as 9.4 material?

I agree with this.

Dan

-- 
Dan R. K. PortsUW CSEhttp://drkp.net/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] lwlock contention with SSI

2013-04-10 Thread Dan Ports
On Tue, Apr 09, 2013 at 07:49:51PM -0400, Robert Haas wrote:
 These locks are all SSI-related and they're all really hot.  Lock 28
 is SerializableXactHashLock and lock 29 is
 SerializableFinishedListLock; both are acquired an order of magnitude
 more often than any non-SSI lock, and cause two orders of magnitude
 more blocking than any other lock whatsoever.  Lock 30 is
 SerializablePredicateLockListLock, which has no exclusive lock
 acquisitions at all on this test, but the shared acquisitions result
 in significant spinlock contention.

This matches what I saw when I looked into this a while ago. I even
started sketching out some plans of how we might deal with it, but
unfortunately I never had much time to work on it, and that seems
unlikely to change any time soon. :-\

As it is, pretty much any operation involving SSI requires acquiring
SerializableXactHashLock (usually exclusive), except for checking
whether a read or write indicates a conflict. That includes starting
and ending a transaction.

Two things make this hard to fix:
 - SSI is about checking for rw-conflicts, which are inherently about
   *pairs* of transactions. This makes it hard to do fine-grained
   locking, because a lot of operations involve looking at or modifying
   the conflict list of more than one transaction.
 - SerializableXactHashLock protects many things. Besides the 
   SERIALIZABLEXACT structures themselves, there's also the free lists
   for SERIALIZABLEXACTs and RWConflicts, the SerializableXidHash
   table, the latest SxactCommitSeqno and SxactGlobalXmin, etc.

I'm trying to swap back in my notes about how to address this. It is
bound to be a substantial project, however.

Dan

-- 
Dan R. K. PortsUW CSEhttp://drkp.net/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Question about SSI, subxacts, and aborted read-only xacts

2012-09-11 Thread Dan Ports
On Mon, Sep 10, 2012 at 10:44:57PM -0700, Jeff Davis wrote:
 For the archives, and for those not following the paper in detail, there
 is one situation in which SSI will abort a read-only transaction.
 
 When there are three transactions forming a dangerous pattern where T1
 (read-only) has a conflict out to T2, and T2 has a conflict out to T3;
 and T3 is committed and T2 is prepared (for two-phase commit). In that
 situation, SSI can't roll back the committed or prepared transactions,
 so it must roll back the read-only transaction (T1).

This is true, but it isn't the only situation where a read-only
transaction can be rolled back -- this can happen even without
two-phase commit involved. 

You can have a situation where two read/write transactions T2 and T3
conflict such that T2 appears to have executed first in the serial
order, but T3 commits before T2. If there's a read-only transaction T1
that takes its snapshot between when T3 and T2 commit, it can't be
allowed to read the data that the other two transactions modified: it'd
see the changes made by T3 but not T2, violating the serial order.

Given a choice, we'd prevent this by aborting one of the read/write
transactions. But if they've both already committed by the time the
read-only transaction T1 does its reads, we'd have to abort it instead.

(Note that this is still an improvement over two-phase locking, which
wouldn't allow any of the transactions to execute concurrently!)


What I was getting at in my previous mail was that there aren't any
situations where COMMIT will return a serialization failure for
a read-only transaction.

Dan

-- 
Dan R. K. PortsUW CSEhttp://drkp.net/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Question about SSI, subxacts, and aborted read-only xacts

2012-09-10 Thread Dan Ports
On Sat, Sep 08, 2012 at 11:34:56AM -0700, Jeff Davis wrote:
 If so, I think we need a documentation update. The serializable
 isolation level docs don't quite make it clear that serializability only
 applies to transactions that commit. It might not be obvious to a user
 that there's a difference between commit and abort for a RO transaction.
 I think that, in S2PL, serializability applies even to aborted
 transactions (though I haven't spent much time thinking about it), so
 users accustomed to other truly-serializable implementations might be

Yes, I agree that this is probably worth mentioning in the
documentation.

It might be worth noting that serializable mode will not cause
read-only transactions to fail to commit (as might be possible in some
optimistic concurrency control systems). However, it might require
other transactions to be aborted to ensure serializability. If the
user aborts the read-only transaction, that won't necessarily happen.

Figure 2 of the aforementioned paper is actually a nice example of
this. The read-only transaction T1 is allowed to commit, but as a
result T2 has to be aborted. If T1 had ABORTed instead of COMMIT, T2
would be allowed to proceed.

Dan

-- 
Dan R. K. PortsUW CSEhttp://drkp.net/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] a slightly stale comment

2012-03-07 Thread Dan Ports
On Wed, Mar 07, 2012 at 07:46:32AM +, Simon Riggs wrote:
 There is much wisdom there and much wisdom in leaving ancient warnings
 as we find them.

The comment is a wise and insightful statement -- about a totally
different system than we have today.

 Are these the words you object to?
 
 we don't need to
   *  check commit time against the start time of this transaction
   *  because 2ph locking protects us from doing the wrong thing.

Yes, that clearly isn't true, and the subsequent bit about catalog
accesses isn't right either -- they may not be serializable, but that
isn't the reason why.

I don't particularly object to the warning that the tests in this
routine are correct (although indeed the fact that they've changed
over the years does seem to belie it).

So I'm also in favor of just removing the comment entirely.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] a slightly stale comment

2012-03-06 Thread Dan Ports
While mucking around in src/backend/utils/time/tqual.c today, I noticed
the following comment attached to HeapTupleSatisfiesNow:

 *  mao says 17 march 1993:  the tests in this routine are correct;
 *  if you think they're not, you're wrong, and you should think
 *  about it again.  i know, it happened to me.  we don't need to
 *  check commit time against the start time of this transaction
 *  because 2ph locking protects us from doing the wrong thing.
 *  if you mess around here, you'll break serializability.  the only
 *  problem with this code is that it does the wrong thing for system
 *  catalog updates, because the catalogs aren't subject to 2ph, so
 *  the serializability guarantees we provide don't extend to xacts
 *  that do catalog accesses.  this is unfortunate, but not critical.

Much as I hate to disturb a comment just before its 19th birthday, the
bit about two-phase locking and serializability hasn't been correct
since around 1999 (when MVCC was added). :-)

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSI rw-conflicts and 2PC

2012-02-14 Thread Dan Ports
On Tue, Feb 14, 2012 at 10:04:15AM +0200, Heikki Linnakangas wrote:
 Perhaps it would be simpler to add the extra information to the commit 
 records of the transactions that commit after the first transaction is 
 prepared. In the commit record, you would include a list of prepared 
 transactions that this transaction conflicted with. During recovery, you 
 would collect those lists in memory, and use them at the end of recovery 
 to flag the conflicts in prepared transactions that are still in 
 prepared state.

Yeah, doing it that way might be a better strategy if we wanted to go
that route. I hadn't really considered it because I'm not that familiar
with the xlog code (plus, the commit record already contains a variable
length field, making it that much more difficult to add another).

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSI rw-conflicts and 2PC

2012-02-14 Thread Dan Ports
On Tue, Feb 14, 2012 at 09:27:58AM -0600, Kevin Grittner wrote:
 Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote:
  On 14.02.2012 04:57, Dan Ports wrote:
  The easiest answer would be to just treat every prepared
  transaction found during recovery as though it had a conflict in
  and out. This is roughly a one-line change, and it's certainly
  safe.
  
 Dan, could you post such a patch, please?

Sure. See attached.

 Should we add anything to the docs to warn people that if they crash
 with serializable prepared transactions pending, they will see this
 behavior until the prepared transactions are either committed or
 rolled back, either by the transaction manager or through manual
 intervention?

Hmm, it occurs to me if we have to abort a transaction due to
serialization failure involving a prepared transaction, we might want
to include the prepared transaction's gid in the errdetail. 

This semes like it'd be especially useful for prepared transactions. We
can generally pick the transaction to abort to ensure the safe retry
property -- if that transaction is immediately retried, it won't
fail with the same conflict -- but we can't always guarantee that when
prepared transactions are involved. And prepared transactions already
have a convenient, user-visible ID we can report.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index b75b73a..b102e19 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -4733,14 +4733,11 @@ AtPrepare_PredicateLocks(void)
 	xactRecord-flags = MySerializableXact-flags;
 
 	/*
-	 * Tweak the flags. Since we're not going to output the inConflicts and
-	 * outConflicts lists, if they're non-empty we'll represent that by
-	 * setting the appropriate summary conflict flags.
+	 * Note that we don't include the list of conflicts in our out in
+	 * the statefile, because new conflicts can be added even after the
+	 * transaction prepares. We'll just make a conservative assumption
+	 * during recovery instead.
 	 */
-	if (!SHMQueueEmpty(MySerializableXact-inConflicts))
-		xactRecord-flags |= SXACT_FLAG_SUMMARY_CONFLICT_IN;
-	if (!SHMQueueEmpty(MySerializableXact-outConflicts))
-		xactRecord-flags |= SXACT_FLAG_SUMMARY_CONFLICT_OUT;
 
 	RegisterTwoPhaseRecord(TWOPHASE_RM_PREDICATELOCK_ID, 0,
 		   record, sizeof(record));
@@ -4875,15 +4872,6 @@ predicatelock_twophase_recover(TransactionId xid, uint16 info,
 
 		sxact-SeqNo.lastCommitBeforeSnapshot = RecoverySerCommitSeqNo;
 
-
-		/*
-		 * We don't need the details of a prepared transaction's conflicts,
-		 * just whether it had conflicts in or out (which we get from the
-		 * flags)
-		 */
-		SHMQueueInit((sxact-outConflicts));
-		SHMQueueInit((sxact-inConflicts));
-
 		/*
 		 * Don't need to track this; no transactions running at the time the
 		 * recovered xact started are still active, except possibly other
@@ -4905,6 +4893,17 @@ predicatelock_twophase_recover(TransactionId xid, uint16 info,
    (MaxBackends + max_prepared_xacts));
 		}
 
+		/*
+		 * We don't know whether the transaction had any conflicts or
+		 * not, so we'll conservatively assume that it had both a
+		 * conflict in and a conflict out, and represent that with the
+		 * summary conflict flags.
+		 */
+		SHMQueueInit((sxact-outConflicts));
+		SHMQueueInit((sxact-inConflicts));
+		sxact-flags |= SXACT_FLAG_SUMMARY_CONFLICT_IN;
+		sxact-flags |= SXACT_FLAG_SUMMARY_CONFLICT_OUT;
+		
 		/* Register the transaction's xid */
 		sxidtag.xid = xid;
 		sxid = (SERIALIZABLEXID *) hash_search(SerializableXidHash,

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] SSI rw-conflicts and 2PC

2012-02-13 Thread Dan Ports
Looking over the SSI 2PC code recently, I noticed that I overlooked a
case that could lead to non-serializable behavior after a crash.

When we PREPARE a serializable transaction, we store part of the
SERIALIZABLEXACT in the statefile (in addition to the list of SIREAD
locks). One of the pieces of information we record is whether the
transaction had any conflicts in or out. The problem is that that can
change if a new conflict occurs after the transaction has prepared.

Here's an example of the problem (based on the receipt-report test):

-- Setup
CREATE TABLE ctl (k text NOT NULL PRIMARY KEY, deposit_date date NOT NULL);
INSERT INTO ctl VALUES ('receipt', DATE '2008-12-22');
CREATE TABLE receipt (receipt_no int NOT NULL PRIMARY KEY, deposit_date date 
NOT NULL, amount numeric(13,2));

-- T2
BEGIN ISOLATION LEVEL SERIALIZABLE;
INSERT INTO receipt VALUES (3, (SELECT deposit_date FROM ctl WHERE k = 
'receipt'), 4.00);
PREPARE TRANSACTION 't2';

-- T3
BEGIN ISOLATION LEVEL SERIALIZABLE;
UPDATE ctl SET deposit_date = DATE '2008-12-23' WHERE k = 'receipt';
COMMIT;

-- T1
BEGIN ISOLATION LEVEL SERIALIZABLE;
SELECT * FROM ctl WHERE k = 'receipt';
SELECT * FROM receipt WHERE deposit_date = DATE '2008-12-22';
COMMIT;

Running this sequence of transactions normally, T1 will be rolled back
because of the pattern of conflicts T1 - T2 - T3, as we'd expect. This
should still be true even if we restart the database before executing
the last transaction -- but it's not. The problem is that, when T2
prepared, it had no conflicts, so we recorded that in the statefile.
The T2 - T3 conflict happened later, so we didn't know about it during
recovery.

I discussed this a bit with Kevin and we agreed that this is important
to fix, since it's a false negative that violates serializability. The
question is how to fix it. There are a couple of options...

The easiest answer would be to just treat every prepared transaction
found during recovery as though it had a conflict in and out. This is
roughly a one-line change, and it's certainly safe. But the downside is
that this is pretty restrictive: after recovery, we'd have to abort any
serializable transaction that tries to read anything that a prepared
transaction wrote, or modify anything that it read, until that
transaction is either committed or rolled back.

To do better than that, we want to know accurately whether the prepared
transaction had a conflict with a transaction that prepared or
committed before the crash. We could do this if we had a way to append
a record to the 2PC statefile of an already-prepared transaction --
then we'd just add a new record indicating the conflict. Of course, we
don't have a way to do that. It'd be tricky to add support for this,
since it has to be crash-safe, so the question is whether the improved
precision justifies the complexity it would require.

A third option is to observe that the only conflicts *in* that matter
from a recovered prepared transaction are from other prepared
transactions. So we could have prepared transactions include in their
statefile the xids of any prepared transactions they conflicted with
at prepare time, and match them up during recovery to reconstruct the
graph. This is a middle ground between the other two options. It
doesn't require modifying the statefile after prepare. However, conflicts
*out* to non-prepared transactions do matter, and this doesn't record
those, so we'd have to do the conservative thing -- which means that
after recovery, no one can read anything a prepared transaction wrote.

I thought I'd throw these options out there to see which ones people
think are reasonable (or any better ideas). Of the three, I think the
first (simplest) solution is the only one we could plausibly backpatch
to 9.1. But if the extra aborts after recovery seem too expensive, we
may want to consider one of the other options for future releases.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] RFC: Making TRUNCATE more MVCC-safe

2012-02-11 Thread Dan Ports
On Fri, Feb 10, 2012 at 01:59:18PM -0500, Robert Haas wrote:
 I guess I'm not particularly excited by the idea of trying to make
 TRUNCATE MVCC-safe.  I notice that the example involves the REPEATABLE
 READ isolation level, which is already known to be busted in a variety
 of ways; that's why we now have SERIALIZABLE, and why most people use
 READ COMMITTED.  Are there examples of this behavior at other
 isolation levels?

Marti's example works for SERIALIZABLE isolation too. In general, when
DDL operations weren't previously MVCC-safe under REPEATABLE READ, we
didn't change that in SERIALIZABLE.

There's some SSI code for TRUNCATE TABLE that tries to do something
reasonable, and it catches some (more subtle) anomalies involving
concurrent truncates -- but it can only do so much when TRUNCATE itself
isn't MVCC-safe. I expect that the combination of that code and this
patch would ensure full serializability for TRUNCATE operations.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] patch: fix SSI finished list corruption

2012-01-06 Thread Dan Ports
There's a corner case in the SSI cleanup code that isn't handled
correctly. It can arise when running workloads that are comprised
mostly (but not 100%) of READ ONLY transactions, and can corrupt the
finished SERIALIZABLEXACT list, potentially causing a segfault. The
attached patch fixes it.

Specifically, when the only remaining active transactions are READ
ONLY, we do a partial cleanup of committed transactions because
certain types of conflicts aren't possible anymore. For committed r/w
transactions, we release the SIREAD locks but keep the
SERIALIZABLEXACT. However, for committed r/o transactions, we can go
further and release the SERIALIZABLEXACT too. The problem was with the
latter case: we were returning the SERIALIZABLEXACT to the free list
without removing it from the finished list.

The only real change in the patch is the SHMQueueDelete line, but I
also reworked some of the surrounding code to make it obvious that r/o
and r/w transactions are handled differently -- the existing code felt
a bit too clever.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/
From 39f8462332f998d7363058adabac412c7654befe Mon Sep 17 00:00:00 2001
From: Dan R. K. Ports d...@csail.mit.edu
Date: Thu, 29 Dec 2011 23:11:35 -0500
Subject: [PATCH] Read-only SERIALIZABLEXACTs are completely released when
 doing partial cleanup, so remove them from the finished
 list. This prevents the finished list from being corrupted.
 Also make it more clear that read-only transactions are
 treated differently here.

---
 src/backend/storage/lmgr/predicate.c |   26 +++---
 1 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index aefa698..c0b3cb4 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -3531,10 +3531,29 @@ ClearOldPredicateLocks(void)
 		else if (finishedSxact-commitSeqNo  PredXact-HavePartialClearedThrough
 		finishedSxact-commitSeqNo = PredXact-CanPartialClearThrough)
 		{
+			/*
+			 * Any active transactions that took their snapshot before
+			 * this transaction committed are read-only, so we can
+			 * clear part of its state.
+			 */
 			LWLockRelease(SerializableXactHashLock);
-			ReleaseOneSerializableXact(finishedSxact,
-	   !SxactIsReadOnly(finishedSxact),
-	   false);
+
+			if (SxactIsReadOnly(finishedSxact))
+			{
+/* A read-only transaction can be removed entirely */
+SHMQueueDelete((finishedSxact-finishedLink));
+ReleaseOneSerializableXact(finishedSxact, false, false);
+			}
+			else
+			{
+/*
+ * A read-write transaction can only be partially
+ * cleared. We need to keep the SERIALIZABLEXACT but
+ * can release the SIREAD locks and conflicts in.
+ */
+ReleaseOneSerializableXact(finishedSxact, true, false);
+			}
+			
 			PredXact-HavePartialClearedThrough = finishedSxact-commitSeqNo;
 			LWLockAcquire(SerializableXactHashLock, LW_SHARED);
 		}
@@ -3640,6 +3659,7 @@ ReleaseOneSerializableXact(SERIALIZABLEXACT *sxact, bool partial,
 
 	Assert(sxact != NULL);
 	Assert(SxactIsRolledBack(sxact) || SxactIsCommitted(sxact));
+	Assert(partial || !SxactIsOnFinishedList(sxact));
 	Assert(LWLockHeldByMe(SerializableFinishedListLock));
 
 	/*
-- 
1.7.8.2


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] autovacuum and default_transaction_isolation

2011-11-29 Thread Dan Ports
I was doing some tests recently with default_transaction_isolation set
to 'serializable' in postgresql.conf when I noticed pg_locks filling up
with SIReadLocks rather more quickly than I expected.

After some investigation, I found that an autovacuum worker was
starting a transaction at the default isolation level. While using a
serializable transaction doesn't affect its behavior (because it's not
using a MVCC snapshot), having a serializable transaction open prevents
other concurrent serializable transactions and their predicate locks
from being cleaned up. Since VACUUM on a large table can take a long
time, this could affect many concurrent transactions.

My one-liner fix for this was to set
  DefaultXactIsoLevel = XACT_READ_COMMITTED;
in AutoVacWorkerMain.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] autovacuum and default_transaction_isolation

2011-11-29 Thread Dan Ports
On Tue, Nov 29, 2011 at 07:04:23PM -0500, Tom Lane wrote:
 Hmm.  Shouldn't we make the autovac launcher use READ COMMITTED, too?

Yeah, probably. That one doesn't seem so important because its
transactions aren't long-running (IIRC, it only starts a transaction to
scan pg_database). But it wouldn't hurt to keep it from pointlessly
registering a serializable transaction.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] FlexLocks

2011-11-15 Thread Dan Ports
On Tue, Nov 15, 2011 at 10:55:49AM -0600, Kevin Grittner wrote:
 And I would be
 surprised if some creative thinking didn't yield a far better FL
 scheme for SSI than we can manage with existing LW locks.

One place I could see it being useful is for
SerializableFinishedListLock, which protects the queue of committed
sxacts that can't yet be cleaned up. When committing a transaction, it
gets added to the list, and then scans the queue to find and clean up
any sxacts that are no longer needed. If there's contention, we don't
need multiple backends doing that scan; it's enough for one to complete
it on everybody's behalf.

I haven't thought it through, but it may also help with the other
contention bottleneck on that lock: that every transaction needs to add
itself to the cleanup list when it commits.


Mostly unrelatedly, the other thing that's looking like it would be really
useful would be some support for atomic integer operations. This would
be useful for some SSI things like writableSxactCount, and some things
elsewhere like the strong lock count in the regular lock manager.
I've been toying with the idea of creating an AtomicInteger type with a
few operations like increment-and-get, compare-and-set, swap, etc. This
would be implemented using the appropriate hardware operations on
platforms that support them (x86_64, perhaps others) and fall back on a
spinlock implementation on other platforms. I'll probably give it a try
and see what it looks like, but if anyone has any thoughts, let me know.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSI implementation question

2011-10-20 Thread Dan Ports
On Thu, Oct 20, 2011 at 07:33:59AM -0500, Kevin Grittner wrote:
 Dan Ports  wrote:
  The part that's harder is building the list of potential conflicts
  that's used to identify safe snapshots for r/o transactions. That
  (currently) has to happen atomically taking the snapshot.
  
 That's not obvious to me; could you elaborate on the reasons?  If the
 commit sequence number is incremented under cover of an existing
 ProcArrayLock, and the current value is assigned to a snapshot under
 cover of same, acquiring SerializableXactHashLock after we get the
 snapshot seems to work for SxactGlobalXmin and WritableSxactCount,
 AFAICS.

Well, whenever a r/w transaction commits or aborts, we need to check
whether that caused any concurrent r/o transactions to have a
known-safe or unsafe snapshot. The way that happens now is that, when a
r/o transaction starts, it scans the list of r/w transactions and adds
a pointer to itself in their sxact-possibleUnsafeConflicts. When one
of them commits, it scans the list of possible conflicts and does the
appropriate thing.

That's not ideal because:

  - it requires modifing another transaction's sxact when registering a
serializable transaction, so that means taking
SerializableXactHashLock exclusive.

  - the set of concurrent transactions used to identify the possible
conflicts needs to match the one used to build the snapshot.
Otherwise, a transaction might commit between when the snapshot is
take and when we find possible conflicts. (Holding
SerializableXactHashLock prevents this.)

 Yeah, I don't see how we can avoid taking a LW lock to get a
 serializable transaction which needs a SERIALIZABLEXACT set up, but
 it might be possible and worthwhile to split the uses of
 SerializableXactHashLock so that it's not such a hotspot.

Oh, right, one other problem is that the sxact free list is also
protected by SerializableXactHashLock, so allocating from it requires
locking. That one could be fixed by protecting it with its own lock, or
(better yet) eventually moving to a lock-free implementation.

In general, the contention problem is that SerializableXactHashLock
basically protects all SSI state except the predicate locks themselves
(notably, the dependency graph). This isn't partitioned at all, so
looking at or modifying a single sxact requires locking the whole
graph. I'd like to replace this with something finer-grained.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSI implementation question

2011-10-19 Thread Dan Ports
On Wed, Oct 19, 2011 at 12:56:58PM -0400, Tom Lane wrote:
 Is it really necessary for GetSerializableTransactionSnapshotInt to
 acquire an empty SERIALIZABLEXACT before it acquires a snapshot?
 If so, why? 

*That* isn't necessary, no. It is necessary, however, to acquire the
snapshot while SerializableXactHashLock is held. There are a couple
reasons for this: the sxact's lastCommitBeforeSnapshot needs to match
the snapshot, SxactGlobalXmin needs to be set to the correct value,
etc. That's why the call to GetSnapshotData happens from where it does

 The proposed synchronized-snapshots feature will mean
 that the allegedly-new snapshot actually was taken some time before,
 so it seems to me that either this is not necessary or we cannot use
 a synchronized snapshot in a serializable xact.

There are definitely potential problems here. If the original snapshot
doesn't belong to an active serializable transaction, we may have
discarded the state we need to do SSI, e.g. we might have already
cleaned up SIREAD locks from concurrent committed transactions.

I assume the answer here is going to have to be to either refuse to
start a serializable transaction if that's the case, or make saving a
snapshot inhibit some of the SSI cleanup.

 In the same vein, why is it necessary to be holding
 SerializableXactHashLock (exclusively, yet) while acquiring the
 snapshot?  That seems rather bad from a concurrency standpoint, and
 again it's going to be pretty meaningless if we're just installing a
 pre-existing snapshot.

Yes, it's bad. I'm working on a design to address
SerializableXactHashLock contention, but there needs to be some locking
here for the reasons I mentioned above. I think the best we can do here
is to acquire a lock in shared mode when registering a serializable
transaction and in exclusive mode when committing. (Which is what you'd
expect, I guess; it's the same story as ProcArrayLock, and for most of
the same reasons.) Obviously, we'll also want to minimize the amount of
work we're doing while holding that lock.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSI implementation question

2011-10-19 Thread Dan Ports
On Wed, Oct 19, 2011 at 04:36:41PM -0400, Tom Lane wrote:
 No, the intention of the synchronized-snapshots feature is just to be
 able to start multiple transactions using exactly the same snapshot.
 They're independent after that.  The aspect of it that is worrying me
 is that if xact A starts, gets a snapshot and publishes it, and then
 xact B starts and wants to adopt that snapshot, then
 
 (2) as things stand, xact A need not be running in serializable mode ---
 if B is serializable, does *that* break any assumptions?

[taking these in opposite order]

Yes, I think that's going to be a problem. The obvious case where it's
clearly not going to work is if A is older than the oldest active
serializable xact (i.e. SxactGlobalXmin would have to move backwards).
It's probably possible to make it work when that's not the case, but I
think it's better to require A to be serializable -- if nothing else,
it's a far simpler rule to document!

There is another case that could be problematic, if A was READ ONLY,
and B isn't. It sounds to me like that would also be a reasonable thing
to forbid.

 (1) other transactions may have started or ended meanwhile; does that
 break any of SSI's assumptions?

Mostly, no, if A is still running. There's one case that needs to be
handled a bit carefully, but shouldn't be a problem: if A was
SERIALIZABLE READ ONLY, and its snapshot was found to be safe, then
it's actually running (safely) at REPEATABLE READ. If we start a new
read-only transaction at the same snapshot, we need to make it run at
REPEATABLE READ if it requests SERIALIZABLE.

 We already have to have an interlock to ensure that GlobalXmin doesn't
 go backwards, by means of requiring A to still be running at the instant
 B adopts the snapshot and sets its MyProc-xmin accordingly.  However,
 there is not currently anything that guarantees that A is still running
 by the time B does GetSerializableTransactionSnapshotInt, shortly later.
 So if your answer to question (1) involves an assumption that A is still
 running, we're going to have to figure out how to arrange that without
 deadlocking on ProcArrayLock vs SerializableXactHashLock.

Yep, I think we're going to have to do that. I haven't had a chance to
look at the synchronized snapshots patch yet, so I can't (yet) offer
any suggestions about how to implement it.

 Which might
 be another good reason for changing predicate.c so that we don't hold
 the latter while taking a snapshot ...

It'd be great if we could do that, but I don't really see it being
possible...

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSI implementation question

2011-10-19 Thread Dan Ports
On Wed, Oct 19, 2011 at 05:04:52PM -0400, Tom Lane wrote:
 I wonder whether it would be prudent to set the synchronized-snapshots
 patch aside until you've finished that work (assuming you're actively
 working on it).  It's evidently going to require some nontrivial changes
 in predicate.c, and I don't think the feature should take precedence
 over SSI performance improvement.

I wouldn't hold the patch up on my account. Improving the SSI locking
situation looks to be a fairly substantial project. I've been drawing
up a plan to fix it, but I'm also travelling for most of the next two
weeks and probably won't be able to do any serious hacking on it until
I'm back to the office.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSI implementation question

2011-10-19 Thread Dan Ports
I think it would be fairly sensible to push some of this into
ProcArray, actually. The commit sequence numbers just involve assigning/
incrementing a global counter when taking a snapshot and finishing a
transaction -- that's not too much work, doesn't require any additional
locking beyond ProcArrayLock, and isn't too tied to SSI. (I could
imagine it being useful for other purposes, though I'm not going to
make that argument too seriously without actually having one in mind.)

SxactGlobalXmin and WritableSxactCount are obviously more SSI-specific,
but I think we can come up with something reasonable to do with them.

The part that's harder is building the list of potential conflicts
that's used to identify safe snapshots for r/o transactions. That
(currently) has to happen atomically taking the snapshot. We'll
probably have to do this in some significantly different way, but I
haven't quite worked out what it is yet.

On the bright side, if we can address these three issues, we shouldn't
need to take SerializableXactHashLock at all when starting a
transaction. (Realistically, we might have to take it or some other
lock shared to handle one of them -- but I really want starting a
serializable xact to not take any exclusive locks.)

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Update on documentation builds on OSX w/ macports

2011-10-19 Thread Dan Ports
On Thu, Oct 20, 2011 at 02:02:09AM +0200, Florian Pflug wrote:
 I've patched the ports for openjade, iso8879 and docbook-dsssl,
 and added a new port for docbook-sgml-4.2. These patches are sitting
 in the macports trac now, waiting to be applied.

I'll try to take a look at them in the next couple days (with my
MacPorts hat on), unless someone beats me to it.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Overhead cost of Serializable Snapshot Isolation

2011-10-10 Thread Dan Ports
On Mon, Oct 10, 2011 at 02:25:59PM -0400, Greg Sabino Mullane wrote:
 I agree it is better versus SELECT FOR, but what about repeatable read versus
 the new serializable? How much overhead is there in the 'monitoring of
 read/write dependencies'? This is my only concern at the moment. Are we 
 talking insignificant overhead? Minor? Is it measurable? Hard to say without 
 knowing the number of txns, number of locks, etc.?

I'd expect that in most cases the main cost is not going to be overhead
from the lock manager but rather the cost of having transactions
aborted due to conflicts. (But the rollback rate is extremely
workload-dependent.)

We've seen CPU overhead from the lock manager to be a few percent on a
CPU-bound workload (in-memory pgbench). Also, if you're using a system
with many cores and a similar workload, SerializableXactHashLock might
become a scalability bottleneck.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Overhead cost of Serializable Snapshot Isolation

2011-10-10 Thread Dan Ports
On Mon, Oct 10, 2011 at 02:59:04PM -0500, Kevin Grittner wrote:
 I do have some concern about whether the performance improvements
 from reduced LW locking contention elsewhere in the code may (in
 whack-a-mole fashion) cause the percentages to go higher in SSI. 
 The biggest performance issues in some of the SSI benchmarks were on
 LW lock contention, so those may become more noticeable as other
 contention is reduced.  I've been trying to follow along on the
 threads regarding Robert's work in that area, with hopes of applying
 some of the same techniques to SSI, but it's not clear whether I'll
 have time to work on that for the 9.2 release.  (It's actually
 looking improbably at this point.)

I spent some time thinking about this a while back, but didn't have
time to get very far. The problem isn't contention in the predicate
lock manager (which is partitioned) but the single lock protecting the
active SerializableXact state.

It would probably help things a great deal if we could make that lock
more fine-grained. However, it's tricky to do this without deadlocking
because the serialization failure checks need to examine a node's
neighbors in the dependency graph.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Overhead cost of Serializable Snapshot Isolation

2011-10-10 Thread Dan Ports
On Mon, Oct 10, 2011 at 04:10:18PM -0500, Kevin Grittner wrote:
 Did you ever see much contention on
 SerializablePredicateLockListLock, or was it just
 SerializableXactHashLock?  I think the former might be able to use
 the non-blocking techniques, but I fear the main issue is with the
 latter, which seems like a harder problem.

No, not that I recall -- if SerializablePredicateLockListLock was on
the list of contended locks, it was pretty far down.

SerializableXactHashLock was the main bottleneck, and
SerializableXactFinishedListLock was a lesser but still significant
one.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] sinval synchronization considered harmful

2011-07-21 Thread Dan Ports
On Thu, Jul 21, 2011 at 02:31:15PM -0400, Robert Haas wrote:
 1. Machines with strong memory ordering.  On this category of machines
 (which include x86), the CPU basically does not perform loads or
 stores out of order.  On some of these machines, it is apparently
 possible for there to be some ordering of stores relative to loads,
 but if the program stores two values or loads two values, those
 operations will performed in the same order they appear in the
 program. 

This is all correct, but...

 The main thing you need to make your code work reliably on
 these machines is a primitive that keeps the compiler from reordering
 your code during optimization. 

If you're suggesting that hardware memory barriers aren't going to be
needed to implement lock-free code on x86, that isn't true. Because a
read can be reordered with respect to a write to a different memory
location, you can still have problems. So you do still need memory
barriers, just fewer of them.

Dekker's algorithm is the classic example: two threads each set a flag
and then check whether the other thread's flag is set. In any
sequential execution, at least one should see the other's flag set, but
on the x86 that doesn't always happen. One thread's read might be
reordered before its write.

 2. Machines with weak memory ordering.  On this category of machines
 (which includes PowerPC, Dec Alpha, and maybe some others), the CPU
 reorders memory accesses arbitrarily unless you explicitly issue
 instructions that enforce synchronization.  You still need to keep the
 compiler from moving things around, too.  Alpha is particularly
 pernicious, because something like a-b can fetch the pointed-to value
 before loading the pointer itself.  This is otherwise known as we
 have basically no cache coherency circuits on this chip at all.  On
 these machines, you need to issue an explicit memory barrier
 instruction at each sequence point, or just acquire and release a
 spinlock.

The Alpha is pretty much unique (thankfully!) in allowing dependent
reads to be reordered. That makes it even weaker than the typical
weak-ordering machine. Since reading a pointer and then dereferencing
it is a pretty reasonable thing to do regularly in RCU code, you
probably don't want to emit barriers in between on architectures where
it's not actually necessary. That argues for another operation that's
defined to be a barrier (mb) on the Alpha but a no-op elsewhere.
Certainly the Linux kernel found it useful to do so
(read_barrier_depends)

Alternatively, one might question how important it is to support the
Alpha these days...

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSI atomic commit

2011-07-07 Thread Dan Ports
On Thu, Jul 07, 2011 at 04:48:55PM -0400, Tom Lane wrote:
 Seems to me there's a more fundamental reason not to do that, which is
 that once you've done PREPARE it is no longer legitimate to decide to
 roll back the transaction to get out of a dangerous structure --- ie,
 you have to target one of the other xacts involved instead.  Shouldn't
 the assignment of a prepareSeqNo correspond to the point where the xact
 is no longer a target for SSI rollback?

That part is already accomplished by setting SXACT_FLAG_PREPARED (and
choosing a new victim if we think we want to abort a transaction with
that flag set).

prepareSeqNo is being used as a lower bound on the transaction's commit
sequence number. It's currently set at the same time as the PREPARED
flag, but it doesn't have to be.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSI atomic commit

2011-07-07 Thread Dan Ports
We should also apply the attached patch, which corrects a minor issue
with the conditions for flagging transactions that could potentially
make a snapshot unsafe.

There's a small window wherein a transaction is committed but not yet
on the finished list, and we shouldn't flag it as a potential conflict
if so. We can also skip adding a doomed transaction to the list of
possible conflicts because we know it won't commit.

This is not really a related issue, but Kevin and I found it while
looking into this issue, and it was included in the patch we sent out.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/
*** a/src/backend/storage/lmgr/predicate.c
--- b/src/backend/storage/lmgr/predicate.c
***
*** 1669,1676  RegisterSerializableTransactionInt(Snapshot snapshot)
  			 othersxact != NULL;
  			 othersxact = NextPredXact(othersxact))
  		{
! 			if (!SxactIsOnFinishedList(othersxact) 
! !SxactIsReadOnly(othersxact))
  			{
  SetPossibleUnsafeConflict(sxact, othersxact);
  			}
--- 1676,1684 
  			 othersxact != NULL;
  			 othersxact = NextPredXact(othersxact))
  		{
! 			if (!SxactIsCommitted(othersxact)
!  !SxactIsDoomed(othersxact)
!  !SxactIsReadOnly(othersxact))
  			{
  SetPossibleUnsafeConflict(sxact, othersxact);
  			}

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSI atomic commit

2011-07-05 Thread Dan Ports
On Tue, Jul 05, 2011 at 01:15:13PM -0500, Kevin Grittner wrote:
 Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote:
  
  Hmm, I think it would be simpler to decide that instead of 
  SerializableXactHashLock, you must hold ProcArrayLock to access 
  LastSxactCommitSeqNo, and move the assignment of commitSeqNo to 
  ProcArrayTransaction(). It's probably easiest to move 
  LastSxactCommitSeqno to ShmemVariableCache too. There's a few
  places that would then need to acquire ProcArrayLock to read 
  LastSxactCommitSeqno, but I feel it might still be much simpler
  that way.
  
 We considered that.  I think the biggest problem was that when there
 is no XID it wouldn't be covered by the lock on assignment.

One other issue is that after the sequence number is assigned, it still
needs to be stored in MySerializableXact-commitSeqNo. Modifying that
does require taking SerializableXactHashLock.

With the proposed patch, assigning the next commitSeqNo and storing it
in MySerializableXact happen atomically. That makes it possible to say
that a transaction that has a commitSeqNo must have committed before
one that doesn't. If the two steps are separated, that isn't true: two
transactions might get their commitSeqNos in one order and make them
visible in the other. We should be able to deal with that, but it will
make some of the commit ordering checks more complicated.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSI 2PC coverage

2011-07-05 Thread Dan Ports
On Tue, Jul 05, 2011 at 09:14:30PM +0300, Heikki Linnakangas wrote:
 I think that needs some explanation, why only those SxactIsCommitted() 
 tests need to be replaced with SxactIsPrepared()?

Here is the specific problem this patch addresses:

If there's a dangerous structure T0 --- T1 --- T2, and T2 commits
first, we need to abort something. If T2 commits before both conflicts
appear, then it should be caught by
OnConflict_CheckForSerializationFailure. If both conflicts appear
before T2 commits, it should be caught by
PreCommit_CheckForSerializationFailure. But that is actually run before
T2 *prepares*.

So the problem occurs if T2 is prepared but not committed when the
second conflict is detected. OnConflict_CFSF deems that OK, because T2
hasn't committed yet. PreCommit_CFSF doesn't see a problem either,
because the conflict didn't exist when it ran (before the transaction
prepared)

This patch fixes that by having OnConflict_CFSF declare a serialization
failure in this circumstance if T2 is already prepared, not just if
it's committed.

Although it fixes the situation described above, I wasn't initially
confident that there weren't other problematic cases. I wrote the
attached test case to convince myself of that. Because it tests all
possible sequences of conflict/prepare/commit that should lead to
serialization failure, the fact that they do all fail (with this patch)
indicates that these are the only checks that need to be changed.

 What about the first 
 SxactIsCommitted() test in OnConflict_CheckForSerializationFailure(), 
 for instance?

That one only comes up if the SERIALIZABLEXACT for one of the
transactions involved has been freed, and the RWConflict that points to
it has been replaced by a flag. That only happens if writer has
previously called ReleasePredicateLocks.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Repeated PredicateLockRelation calls during seqscan

2011-06-22 Thread Dan Ports
On Wed, Jun 22, 2011 at 12:07:04PM +0300, Heikki Linnakangas wrote:
 Hmm, I wonder if we should move this logic to heapam.c. The optimization 
 to acquire a relation lock straight away should apply to all heap scans, 
 not only those coming from ExecSeqScan. The distinction is academic at 
 the moment, because that's the only caller that uses a regular MVCC 
 snapshot, but it seems like a modularity violation for nodeSeqscan.c to 
 reach into the HeapScanDesc to set the flag and grab the whole-relation 
 lock, while heapam.c contains the PredicateLockTuple and 
 CheckForSerializableConflictOut() calls.

On modularity grounds, I think that's a good idea. The other
PredicateLock* calls live in the access methods: heapam, nbtree, and
indexam for the generic index support. heap_beginscan_internal seems
like a reasonable place, as long as we're OK with taking the lock even
if the scan is initialized but never called.

Note that this hadn't been a reasonable option until last week when we
added the check for non-MVCC snapshots, since there are lots of things
that use heap scans but SeqScan is the only (currently-existing) one we
want to lock.

I am rather uneasy about making changes here unless we can be
absolutely certain they're right...

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Coding style point: const in function parameter declarations

2011-06-21 Thread Dan Ports
On Tue, Jun 21, 2011 at 06:51:20PM -0400, Tom Lane wrote:
 I find this to be poor style, and would like to see if there's any
 support for getting rid of the const keywords. 

I'm in favor of removing them too.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Repeated PredicateLockRelation calls during seqscan

2011-06-21 Thread Dan Ports
I was looking at ExecSeqScan today and noticed that it invokes
PredicateLockRelation each time it's called, i.e. for each tuple
returned. Any reason we shouldn't skip that call if
rs_relpredicatelocked is already set, as in the attached patch?

That would save us a bit of overhead, since checking that flag is
cheaper than doing a hash lookup in the local predicate lock table
before bailing out.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index f356874..32a8f56 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -113,9 +113,13 @@ SeqRecheck(SeqScanState *node, TupleTableSlot *slot)
 TupleTableSlot *
 ExecSeqScan(SeqScanState *node)
 {
-	PredicateLockRelation(node-ss_currentRelation,
-		  node-ss_currentScanDesc-rs_snapshot);
-	node-ss_currentScanDesc-rs_relpredicatelocked = true;
+	if (!node-ss_currentScanDesc-rs_relpredicatelocked)
+	{
+		PredicateLockRelation(node-ss_currentRelation,
+			  node-ss_currentScanDesc-rs_snapshot);
+		node-ss_currentScanDesc-rs_relpredicatelocked = true;		
+	}
+	
 	return ExecScan((ScanState *) node,
 	(ExecScanAccessMtd) SeqNext,
 	(ExecScanRecheckMtd) SeqRecheck);

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pika buildfarm member failure on isolationCheck tests

2011-06-21 Thread Dan Ports
On Tue, Jun 21, 2011 at 03:01:48PM +0300, Heikki Linnakangas wrote:
 Thanks, committed.

Thanks.

 In the long term, I'm not sure this is the best way to handle this. It 
 feels a bit silly to set the flag, release the SerializableXactHashLock, 
 and reacquire it later to remove the transaction from the hash table. 
 Surely that could be done in some more straightforward way. But I don't 
 feel like fiddling with it this late in the release cycle.

Yes, I suspect it can be done better. The reason it's tricky is a lock
ordering issue; part of releasing a SerializableXact has to be done
while holding SerializableXactHashLock and part has to be done without
it (because it involves taking partition locks). Reworking the order of
these things might work, but would require some careful thought since
most of the code is shared with the non-abort cleanup paths. And yes,
it's definitely the time for that.

I've been meaning to take another look at this part of the code anyway,
with an eye towards performance. SerializableXactHashLock can be a
bottleneck on certain in-memory workloads.

  One of the prepared_xacts regression tests actually hits this bug.
  I removed the anomaly from the duplicate-gids test so that it fails in
  the intended way, and added a new test to check serialization failures
  with a prepared transaction.
 
 Hmm, I have ran make installcheck with 
 default_transaction_isolation='serializable' earlier. I wonder why I 
 didn't see that.

The affected test was being run at SERIALIZABLE already, so that
wouldn't have made a difference. One reason I didn't notice an issue
when I looked at that test before before, is that it was intended to
fail anyway, just with a different error. I didn't think too hard about
which error would take precedence.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pika buildfarm member failure on isolationCheck tests

2011-06-21 Thread Dan Ports
On Wed, Jun 22, 2011 at 01:31:11AM -0400, Dan Ports wrote:
 Yes, I suspect it can be done better. The reason it's tricky is a lock
 ordering issue; part of releasing a SerializableXact has to be done
 while holding SerializableXactHashLock and part has to be done without
 it (because it involves taking partition locks). Reworking the order of
 these things might work, but would require some careful thought since
 most of the code is shared with the non-abort cleanup paths. And yes,
 it's definitely the time for that.

...by which I mean it's definitely *not* the time for that, of course.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pika buildfarm member failure on isolationCheck tests

2011-06-20 Thread Dan Ports
On Sun, Jun 19, 2011 at 09:10:02PM -0400, Robert Haas wrote:
 Is this an open item for 9.1beta3?

Yes. I've put it on the list.

The SxactGlobalXmin and its refcount were getting out of sync with the
actual transaction state. This is timing-dependent but I can reproduce it
fairly reliably under concurrent workloads.

It looks the problem comes from the change a couple days ago that
removed the SXACT_FLAG_ROLLED_BACK flag and changed the
SxactIsRolledBack checks to SxactIsDoomed. That's the correct thing to
do everywhere else, but gets us in trouble here. We shouldn't be
ignoring doomed transactions in SetNewSxactGlobalXmin, because they're
not eligible for cleanup until the associated backend aborts the
transaction and calls ReleasePredicateLocks.

However, it isn't as simple as just removing the SxactIsDoomed check
from SetNewSxactGlobalXmin. ReleasePredicateLocks calls
SetNewSxactGlobalXmin *before* it releases the aborted transaction's
SerializableXact (it pretty much has to, because we must drop and
reacquire SerializableXactHashLock in between). But we don't want the
aborted transaction included in the SxactGlobalXmin computation.

It seems like we do need that SXACT_FLAG_ROLLED_BACK after all, even
though it's only set for this brief interval. We need to be able to
distinguish a transaction that's just been marked for death (doomed)
from one that's already called ReleasePredicateLocks.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pika buildfarm member failure on isolationCheck tests

2011-06-20 Thread Dan Ports
While testing the fix for this one, I found another bug. Patches for
both are attached.

The first patch addresses this bug by re-adding SXACT_FLAG_ROLLED_BACK,
in a more limited form than its previous incarnation.

We need to be able to distinguish transactions that have already
called ReleasePredicateLocks and are thus eligible for cleanup from
those that have been merely marked for abort by other
backends. Transactions that are ROLLED_BACK are excluded from
SxactGlobalXmin calculations, but those that are merely DOOMED need to
be included.

Also update a couple of assertions to ensure we only try to clean up
ROLLED_BACK transactions.


The second patch fixes a bug in PreCommit_CheckForSerializationFailure.
This function checks whether there's a dangerous structure of the form
 far --- near --- me
where neither the far or near transactions have committed. If so, 
it aborts the near transaction by marking it as DOOMED. However, that
transaction might already be PREPARED. We need to check whether that's
the case and, if so, abort the transaction that's trying to commit
instead.

One of the prepared_xacts regression tests actually hits this bug.
I removed the anomaly from the duplicate-gids test so that it fails in
the intended way, and added a new test to check serialization failures
with a prepared transaction.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index 6c55211..3678878 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -246,7 +246,6 @@
 
 #define SxactIsCommitted(sxact) (((sxact)-flags  SXACT_FLAG_COMMITTED) != 0)
 #define SxactIsPrepared(sxact) (((sxact)-flags  SXACT_FLAG_PREPARED) != 0)
-#define SxactIsRolledBack(sxact) (((sxact)-flags  SXACT_FLAG_ROLLED_BACK) != 0)
 #define SxactIsDoomed(sxact) (((sxact)-flags  SXACT_FLAG_DOOMED) != 0)
 #define SxactIsReadOnly(sxact) (((sxact)-flags  SXACT_FLAG_READ_ONLY) != 0)
 #define SxactHasSummaryConflictIn(sxact) (((sxact)-flags  SXACT_FLAG_SUMMARY_CONFLICT_IN) != 0)
@@ -3047,7 +3046,7 @@ SetNewSxactGlobalXmin(void)
 
 	for (sxact = FirstPredXact(); sxact != NULL; sxact = NextPredXact(sxact))
 	{
-		if (!SxactIsRolledBack(sxact)
+		if (!SxactIsDoomed(sxact)
 			 !SxactIsCommitted(sxact)
 			 sxact != OldCommittedSxact)
 		{
@@ -3114,7 +3113,6 @@ ReleasePredicateLocks(const bool isCommit)
 	Assert(!isCommit || SxactIsPrepared(MySerializableXact));
 	Assert(!isCommit || !SxactIsDoomed(MySerializableXact));
 	Assert(!SxactIsCommitted(MySerializableXact));
-	Assert(!SxactIsRolledBack(MySerializableXact));
 
 	/* may not be serializable during COMMIT/ROLLBACK PREPARED */
 	if (MySerializableXact-pid != 0)
@@ -3153,22 +3151,7 @@ ReleasePredicateLocks(const bool isCommit)
 			MySerializableXact-flags |= SXACT_FLAG_READ_ONLY;
 	}
 	else
-	{
-		/*
-		 * The DOOMED flag indicates that we intend to roll back this
-		 * transaction and so it should not cause serialization
-		 * failures for other transactions that conflict with
-		 * it. Note that this flag might already be set, if another
-		 * backend marked this transaction for abort.
-		 *
-		 * The ROLLED_BACK flag further indicates that
-		 * ReleasePredicateLocks has been called, and so the
-		 * SerializableXact is eligible for cleanup. This means it
-		 * should not be considered when calculating SxactGlobalXmin.
-		 */
 		MySerializableXact-flags |= SXACT_FLAG_DOOMED;
-		MySerializableXact-flags |= SXACT_FLAG_ROLLED_BACK;
-	}
 
 	if (!topLevelIsDeclaredReadOnly)
 	{
@@ -3544,7 +3527,7 @@ ReleaseOneSerializableXact(SERIALIZABLEXACT *sxact, bool partial,
 nextConflict;
 
 	Assert(sxact != NULL);
-	Assert(SxactIsRolledBack(sxact) || SxactIsCommitted(sxact));
+	Assert(SxactIsDoomed(sxact) || SxactIsCommitted(sxact));
 	Assert(LWLockHeldByMe(SerializableFinishedListLock));
 
 	/*
diff --git a/src/include/storage/predicate_internals.h b/src/include/storage/predicate_internals.h
index 34c661d..495983f 100644
--- a/src/include/storage/predicate_internals.h
+++ b/src/include/storage/predicate_internals.h
@@ -90,22 +90,21 @@ typedef struct SERIALIZABLEXACT
 	int			pid;			/* pid of associated process */
 } SERIALIZABLEXACT;
 
-#define SXACT_FLAG_COMMITTED			0x0001	/* already committed */
-#define SXACT_FLAG_PREPARED0x0002	/* about to commit */
-#define SXACT_FLAG_ROLLED_BACK			0x0004	/* already rolled back */
-#define SXACT_FLAG_DOOMED0x0008	/* will roll back */
+#define SXACT_FLAG_COMMITTED0x0001	/* already committed */
+#define SXACT_FLAG_PREPARED	0x0002	/* about to commit */
+#define SXACT_FLAG_DOOMED	0x0004	/* will roll back */
 /*
  * The following flag actually means that the flagged transaction has a
  * conflict out *to a transaction which committed ahead of it*.  It's hard
  * to get that into a name of a reasonable length.
  */
-#define SXACT_FLAG_CONFLICT_OUT			0x0010
-#define 

Re: [HACKERS] patch: update README-SSI

2011-06-16 Thread Dan Ports
On Thu, Jun 16, 2011 at 04:39:09PM +0300, Heikki Linnakangas wrote:
 There's no mention on what T1 is. I believe it's supposed to be Tin, in 
 the terminology used in the graph.

Yes, I changed the naming after I originally wrote it, and missed a
couple spots. T1 should be Tin.

 I don't see how there can be a ww-dependency between T0 and Tin. There 
 can't be a rw-conflict because Tin is read-only, so surely there can't 
 be a ww-conflict either?

Yes, it can only be a wr-conflict. Good catch.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSI work for 9.1

2011-06-16 Thread Dan Ports
On Thu, Jun 16, 2011 at 11:49:48PM -0400, Robert Haas wrote:
 Does this mean that the open item more SSI loose ends can now be
 marked resolved?

I was just looking at it and contemplating moving it to the non-blockers
list. Of the five items:
 - (1) and (4) are resolved
 - (2) isn't an issue -- maybe we want to add a comment, someplace but
   I'm not convinced even that is necessary
 - (3) is a regression test, and is already on the list separately
 - (5) is a doc issue only

There are no open issues with the code that I'm aware of.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSI work for 9.1

2011-06-16 Thread Dan Ports
On Fri, Jun 17, 2011 at 12:32:46AM -0400, Robert Haas wrote:
 Perhaps it would be best to remove the general item and replace it
 with a list of more specific things that need doing - which might just
 mean #5.

Done.

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] patch: update README-SSI

2011-06-15 Thread Dan Ports
The attached patch updates README-SSI. In addition to some minor edits,
changes include:

 - add a section at the beginning that more clearly describes the SSI
   rule and defines dangerous structure with a diagram. It describes
   the optimizations we use about the relative commit times, and the
   case where one transaction is read-only. It includes a proof for the
   latter (novel) optimization, per Heikki's request.

 - note that heap page locks do not lock gaps like index pages

 - be clear about what's been implemented (parts of the README used the
   future tense, probably because they were written long ago), and
   remove a couple items from the RD Issues list that have since
   been addressed.
   
Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/
diff --git a/src/backend/storage/lmgr/README-SSI b/src/backend/storage/lmgr/README-SSI
index c685bee..09a8136 100644
--- a/src/backend/storage/lmgr/README-SSI
+++ b/src/backend/storage/lmgr/README-SSI
@@ -51,13 +51,13 @@ if a transaction can be shown to always do the right thing when it is
 run alone (before or after any other transaction), it will always do
 the right thing in any mix of concurrent serializable transactions.
 Where conflicts with other transactions would result in an
-inconsistent state within the database, or an inconsistent view of
+inconsistent state within the database or an inconsistent view of
 the data, a serializable transaction will block or roll back to
 prevent the anomaly. The SQL standard provides a specific SQLSTATE
 for errors generated when a transaction rolls back for this reason,
 so that transactions can be retried automatically.
 
-Before version 9.1 PostgreSQL did not support a full serializable
+Before version 9.1, PostgreSQL did not support a full serializable
 isolation level. A request for serializable transaction isolation
 actually provided snapshot isolation. This has well known anomalies
 which can allow data corruption or inconsistent views of the data
@@ -77,7 +77,7 @@ Serializable Isolation Implementation Strategies
 
 Techniques for implementing full serializable isolation have been
 published and in use in many database products for decades. The
-primary technique which has been used is Strict 2 Phase Locking
+primary technique which has been used is Strict Two-Phase Locking
 (S2PL), which operates by blocking writes against data which has been
 read by concurrent transactions and blocking any access (read or
 write) against data which has been written by concurrent
@@ -112,54 +112,90 @@ visualize the difference between the serializable implementations
 described above, is to consider that among transactions executing at
 the serializable transaction isolation level, the results are
 required to be consistent with some serial (one-at-a-time) execution
-of the transactions[1]. How is that order determined in each?
+of the transactions [1]. How is that order determined in each?
 
-S2PL locks rows used by the transaction in a way which blocks
-conflicting access, so that at the moment of a successful commit it
-is certain that no conflicting access has occurred. Some transactions
-may have blocked, essentially being partially serialized with the
-committing transaction, to allow this. Some transactions may have
-been rolled back, due to cycles in the blocking. But with S2PL,
-transactions can always be viewed as having occurred serially, in the
-order of successful commit.
+In S2PL, each transaction locks any data it accesses. It holds the
+locks until committing, preventing other transactions from making
+conflicting accesses to the same data in the interim. Some
+transactions may have to be rolled back to prevent deadlock. But
+successful transactions can always be viewed as having occurred
+sequentially, in the order they committed.
 
 With snapshot isolation, reads never block writes, nor vice versa, so
-there is much less actual serialization. The order in which
-transactions appear to have executed is determined by something more
-subtle than in S2PL: read/write dependencies. If a transaction
-attempts to read data which is not visible to it because the
-transaction which wrote it (or will later write it) is concurrent
-(one of them was running when the other acquired its snapshot), then
-the reading transaction appears to have executed first, regardless of
-the actual sequence of transaction starts or commits (since it sees a
-database state prior to that in which the other transaction leaves
-it). If one transaction has both rw-dependencies in (meaning that a
-concurrent transaction attempts to read data it writes) and out
-(meaning it attempts to read data a concurrent transaction writes),
-and a couple other conditions are met, there can appear to be a cycle
-in execution order of the transactions. This is when the anomalies
-occur.
-
-SSI works by watching for the conditions mentioned above, and rolling
-back a transaction when needed to prevent any anomaly. The 

Re: [HACKERS] SSI patch renumbered existing 2PC resource managers??

2011-06-13 Thread Dan Ports
On Mon, Jun 13, 2011 at 10:22:19PM +0300, Heikki Linnakangas wrote:
 As far as I can tell it was for purely cosmetic reasons, to have lock 
 and predicate lock lines together.

Yes, that is the only reason.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSI patch renumbered existing 2PC resource managers??

2011-06-13 Thread Dan Ports
On Mon, Jun 13, 2011 at 03:33:24PM -0400, Tom Lane wrote:
 We can either change that now, or undo the
 unnecessary change in existing RM IDs.  I vote for the latter.

Sounds good to me. I'd offer a patch, but it'd probably take you longer
to apply than to make the change yourself.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Small SSI issues

2011-06-11 Thread Dan Ports
On Sat, Jun 11, 2011 at 01:38:31PM -0500, Kevin Grittner wrote:
 I'm not concerned about references covered by
 SerializableXactHashLock.  I am more concerned about some of the
 tests for whether the (MySerializableXact == InvalidSerializableXact)
 checks and any other tests not covered by that lock are OK without it
 (and OK with it).  Since my knowledge of weak memory ordering
 behavior is, well, weak I didn't want to try to make that call.

Oh, those checks are definitely not an issue -- MySerializableXact
itself (the pointer, not the thing it's pointing to) is in
backend-local memory, so it won't change under us.

The volatile qualifier (as written) doesn't help with that anyway, it
attaches to the data being pointed to, not the pointer itself.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Small SSI issues

2011-06-10 Thread Dan Ports
On Fri, Jun 10, 2011 at 09:43:58PM +0300, Heikki Linnakangas wrote:
  Do checks such as that argue for keeping the volatile flag, or do
  you think we can drop it if we make those changes?  (That would also
  allow dropping a number of casts which exist just to avoid
  warnings.)
 
 I believe we can drop it, I'll double-check.

Yes, dropping it seems like the thing to do. It's been on my list for a
while. We are not really getting anything out of declaring it volatile
since we cast the volatile qualifier away most of the time.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSI work for 9.1

2011-06-09 Thread Dan Ports
On Thu, Jun 09, 2011 at 07:06:18AM -0500, Kevin Grittner wrote:
 Sounds reasonable, but why did you pass the snapshot to the
 PredicateLockPage() call but not the PredicateLockRelation() call? 
 Oversight?

Yep, just an oversight; long day yesterday. I'll fix the patch shortly
(unless you can get to it first, in which case I wouldn't complain)

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSI work for 9.1

2011-06-09 Thread Dan Ports
On Thu, Jun 09, 2011 at 01:30:27PM -0400, Dan Ports wrote:
 On Thu, Jun 09, 2011 at 07:06:18AM -0500, Kevin Grittner wrote:
  Sounds reasonable, but why did you pass the snapshot to the
  PredicateLockPage() call but not the PredicateLockRelation() call? 
  Oversight?
 
 Yep, just an oversight; long day yesterday. I'll fix the patch shortly

Attached. Only change is the missing snapshot argument to that one
call.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 6ac25af..bf75ace 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -274,7 +274,7 @@ heapgetpage(HeapScanDesc scan, BlockNumber page)
 			else
 valid = HeapTupleSatisfiesVisibility(loctup, snapshot, buffer);
 
-			CheckForSerializableConflictOut(valid, scan-rs_rd, loctup, buffer);
+			CheckForSerializableConflictOut(valid, scan-rs_rd, loctup, buffer, snapshot);
 
 			if (valid)
 scan-rs_vistuples[ntup++] = lineoff;
@@ -469,7 +469,7 @@ heapgettup(HeapScanDesc scan,
 	 snapshot,
 	 scan-rs_cbuf);
 
-CheckForSerializableConflictOut(valid, scan-rs_rd, tuple, scan-rs_cbuf);
+CheckForSerializableConflictOut(valid, scan-rs_rd, tuple, scan-rs_cbuf, snapshot);
 
 if (valid  key != NULL)
 	HeapKeyTest(tuple, RelationGetDescr(scan-rs_rd),
@@ -478,7 +478,7 @@ heapgettup(HeapScanDesc scan,
 if (valid)
 {
 	if (!scan-rs_relpredicatelocked)
-		PredicateLockTuple(scan-rs_rd, tuple);
+		PredicateLockTuple(scan-rs_rd, tuple, snapshot);
 	LockBuffer(scan-rs_cbuf, BUFFER_LOCK_UNLOCK);
 	return;
 }
@@ -747,7 +747,7 @@ heapgettup_pagemode(HeapScanDesc scan,
 if (valid)
 {
 	if (!scan-rs_relpredicatelocked)
-		PredicateLockTuple(scan-rs_rd, tuple);
+		PredicateLockTuple(scan-rs_rd, tuple, scan-rs_snapshot);
 	scan-rs_cindex = lineindex;
 	return;
 }
@@ -755,7 +755,7 @@ heapgettup_pagemode(HeapScanDesc scan,
 			else
 			{
 if (!scan-rs_relpredicatelocked)
-	PredicateLockTuple(scan-rs_rd, tuple);
+	PredicateLockTuple(scan-rs_rd, tuple, scan-rs_snapshot);
 scan-rs_cindex = lineindex;
 return;
 			}
@@ -1470,9 +1470,9 @@ heap_fetch(Relation relation,
 	valid = HeapTupleSatisfiesVisibility(tuple, snapshot, buffer);
 
 	if (valid)
-		PredicateLockTuple(relation, tuple);
+		PredicateLockTuple(relation, tuple, snapshot);
 
-	CheckForSerializableConflictOut(valid, relation, tuple, buffer);
+	CheckForSerializableConflictOut(valid, relation, tuple, buffer, snapshot);
 
 	LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 
@@ -1588,11 +1588,11 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
 
 		/* If it's visible per the snapshot, we must return it */
 		valid = HeapTupleSatisfiesVisibility(heapTuple, snapshot, buffer);
-		CheckForSerializableConflictOut(valid, relation, heapTuple, buffer);
+		CheckForSerializableConflictOut(valid, relation, heapTuple, buffer, snapshot);
 		if (valid)
 		{
 			ItemPointerSetOffsetNumber(tid, offnum);
-			PredicateLockTuple(relation, heapTuple);
+			PredicateLockTuple(relation, heapTuple, snapshot);
 			if (all_dead)
 *all_dead = false;
 			return true;
@@ -1750,7 +1750,7 @@ heap_get_latest_tid(Relation relation,
 		 * result candidate.
 		 */
 		valid = HeapTupleSatisfiesVisibility(tp, snapshot, buffer);
-		CheckForSerializableConflictOut(valid, relation, tp, buffer);
+		CheckForSerializableConflictOut(valid, relation, tp, buffer, snapshot);
 		if (valid)
 			*tid = ctid;
 
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 27c37d6..294ab45 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -126,7 +126,7 @@ do { \
 } while(0)
 
 static IndexScanDesc index_beginscan_internal(Relation indexRelation,
-		 int nkeys, int norderbys);
+		 int nkeys, int norderbys, const Snapshot snapshot);
 
 
 /* 
@@ -234,7 +234,7 @@ index_beginscan(Relation heapRelation,
 {
 	IndexScanDesc scan;
 
-	scan = index_beginscan_internal(indexRelation, nkeys, norderbys);
+	scan = index_beginscan_internal(indexRelation, nkeys, norderbys, snapshot);
 
 	/*
 	 * Save additional parameters into the scandesc.  Everything else was set
@@ -259,7 +259,7 @@ index_beginscan_bitmap(Relation indexRelation,
 {
 	IndexScanDesc scan;
 
-	scan = index_beginscan_internal(indexRelation, nkeys, 0);
+	scan = index_beginscan_internal(indexRelation, nkeys, 0, snapshot);
 
 	/*
 	 * Save additional parameters into the scandesc.  Everything else was set
@@ -275,7 +275,7 @@ index_beginscan_bitmap(Relation indexRelation,
  */
 static IndexScanDesc
 index_beginscan_internal(Relation indexRelation,
-		 int nkeys, int norderbys)
+		 int nkeys, int norderbys, const Snapshot snapshot)
 {
 	IndexScanDesc scan;
 	FmgrInfo   *procedure;
@@ -284,7

Re: [HACKERS] reindex creates predicate lock on index root

2011-06-08 Thread Dan Ports
On Tue, Jun 07, 2011 at 10:14:30PM -0400, Tom Lane wrote:
 Do you mean page zero, as in the metapage (for most index types), or do
 you mean the root page?  If the former, how is that not an outright bug,
 since it corresponds to no data?  If the latter, how is that not a
 serious performance problem, since it corresponds to locking the entire
 index?  Any way you slice it, it sounds like a pretty bad bug.

It's a moot point since that isn't actually happening, but FYI, we only
acquire and check for locks on b-tree leaf pages so locking the root
wouldn't have any effect. (This won't be true for other index types;
GIST comes to mind.)

 It's not apparent to me why an index build (regular or reindex) should
 create any predicate locks at all, ever.  Surely it's a basically
 nontransactional operation that SSI should keep its fingers out of.

It shouldn't. What's happening is that when IndexBuildHeapScan reads
the heap tuples, heap_getnext takes a lock if it's running inside a
serializable transaction. It doesn't (yet) know that it doesn't need
the locks for an index build.

We could set a flag in the HeapScanDesc to indicate this. Actually,
setting rs_relpredicatelocked has exactly that effect, so we ought to
be able to use that (but might want to change the name).

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSI heap_insert and page-level predicate locks

2011-06-08 Thread Dan Ports
On Wed, Jun 08, 2011 at 11:23:48AM +0300, Heikki Linnakangas wrote:
 AFAICS, the check for page lock is actually unnecessary. A page-level 
 lock on a heap only occurs when tuple-level locks are promoted. It is 
 just a coarser-grain representation of holding locks on all tuples on 
 the page, *that exist already*. It is not a gap lock like the index 
 locks are, it doesn't need to conflict with inserting new tuples on the 
 page. In fact, if heap_insert chose to insert the tuple on some other 
 heap page, there would have been no conflict.

Yes, it's certainly unnecessary now, given that we never explicitly
take heap page locks, just tuple- or relation-level. 

The only thing I'd be worried about is that at some future point we
might add heap page locks -- say, for sequential scans that don't read
the entire relation -- and expect inserts to be tested against them.
I'm not sure whether we'd actually do this, but we wanted to keep the
option open during development.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSI work for 9.1

2011-06-08 Thread Dan Ports
On Wed, Jun 08, 2011 at 05:48:26PM -0500, Kevin Grittner wrote:
 (1)  Pass snapshot in to some predicate.c functions.  The particular
 functions have yet to be determined, but certainly any which acquire
 predicate locks, and probably all which are guarded by the
 SkipSerialization() macro.  Skip processing for non-MVCC snapshots. 
 The goal here is to reduce false positive serialization failures and
 avoid confusion about locks showing in the pg_locks view which are
 hard to explain.

I assume you've already started on this one; let me know if you have a
patch I should take a look at or hit any snags.

 (2)  Check on heap truncation from vacuum.  On the face of it this
 seems unlikely to be a problem since we make every effort to clean
 up predicate locks as soon as there is no transaction which can
 update what a transaction has read, but it merits a re-check.  Once
 confirmed, add a note to lazy_truncate_heap about why it's not an
 issue.

I assume you are worried here that there may be SIREAD locks remaining
on truncated pages/tuples, and these would cause false positives if
those pages are reused?

I don't believe this can happen, because a vacuum will only delete a
formerly-visible dead tuple if its xmax is earlier than OldestXmin. We
remove all SIREAD locks on anything older than GlobalSxactXmin, which
won't be less than OldestXmin. 

 (4)  Add a comment to the docs about how querying tuples by TID
 doesn't lock not-found gaps the way an indexed access would.

I can take care of this one and some other README-SSI changes.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSI work for 9.1

2011-06-08 Thread Dan Ports
On Wed, Jun 08, 2011 at 09:17:04PM -0500, Kevin Grittner wrote:
 A patch is attached which just covers the predicate lock acquisition,
 where a snapshot is available without too much pain.  There are two
 functions which acquire predicate locks where a snapshot was not
 readily available: _bt_search() and _bt_get_endpoint().  Not only was
 it not clear how to get a snapshot in, it was not entirely clear from
 reading the code that we need to acquire predicate locks here.  Now,
 I suspect that we probably do, because I spent many long hours
 stepping through gdb to pick the spots where they are, but that was
 about a year ago and my memory of the details has faded.

For _bt_search(), the lock calls should move to _bt_first() where the
ScanDesc is available. This also keeps us from trying to take locks
during _bt_pagedel(), which is only called during vacuum and recovery.

The call in _bt_get_endpoint() seems unnecessary, because after it
returns, _bt_endpoint() takes the same lock. The only other callers of
_bt_get_endpoint() are _bt_pagedel() and _bt_insert_parent(), neither
of which should take predicate locks.

I've updated the patch, attached.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 6ac25af..bf75ace 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -274,7 +274,7 @@ heapgetpage(HeapScanDesc scan, BlockNumber page)
 			else
 valid = HeapTupleSatisfiesVisibility(loctup, snapshot, buffer);
 
-			CheckForSerializableConflictOut(valid, scan-rs_rd, loctup, buffer);
+			CheckForSerializableConflictOut(valid, scan-rs_rd, loctup, buffer, snapshot);
 
 			if (valid)
 scan-rs_vistuples[ntup++] = lineoff;
@@ -469,7 +469,7 @@ heapgettup(HeapScanDesc scan,
 	 snapshot,
 	 scan-rs_cbuf);
 
-CheckForSerializableConflictOut(valid, scan-rs_rd, tuple, scan-rs_cbuf);
+CheckForSerializableConflictOut(valid, scan-rs_rd, tuple, scan-rs_cbuf, snapshot);
 
 if (valid  key != NULL)
 	HeapKeyTest(tuple, RelationGetDescr(scan-rs_rd),
@@ -478,7 +478,7 @@ heapgettup(HeapScanDesc scan,
 if (valid)
 {
 	if (!scan-rs_relpredicatelocked)
-		PredicateLockTuple(scan-rs_rd, tuple);
+		PredicateLockTuple(scan-rs_rd, tuple, snapshot);
 	LockBuffer(scan-rs_cbuf, BUFFER_LOCK_UNLOCK);
 	return;
 }
@@ -747,7 +747,7 @@ heapgettup_pagemode(HeapScanDesc scan,
 if (valid)
 {
 	if (!scan-rs_relpredicatelocked)
-		PredicateLockTuple(scan-rs_rd, tuple);
+		PredicateLockTuple(scan-rs_rd, tuple, scan-rs_snapshot);
 	scan-rs_cindex = lineindex;
 	return;
 }
@@ -755,7 +755,7 @@ heapgettup_pagemode(HeapScanDesc scan,
 			else
 			{
 if (!scan-rs_relpredicatelocked)
-	PredicateLockTuple(scan-rs_rd, tuple);
+	PredicateLockTuple(scan-rs_rd, tuple, scan-rs_snapshot);
 scan-rs_cindex = lineindex;
 return;
 			}
@@ -1470,9 +1470,9 @@ heap_fetch(Relation relation,
 	valid = HeapTupleSatisfiesVisibility(tuple, snapshot, buffer);
 
 	if (valid)
-		PredicateLockTuple(relation, tuple);
+		PredicateLockTuple(relation, tuple, snapshot);
 
-	CheckForSerializableConflictOut(valid, relation, tuple, buffer);
+	CheckForSerializableConflictOut(valid, relation, tuple, buffer, snapshot);
 
 	LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 
@@ -1588,11 +1588,11 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
 
 		/* If it's visible per the snapshot, we must return it */
 		valid = HeapTupleSatisfiesVisibility(heapTuple, snapshot, buffer);
-		CheckForSerializableConflictOut(valid, relation, heapTuple, buffer);
+		CheckForSerializableConflictOut(valid, relation, heapTuple, buffer, snapshot);
 		if (valid)
 		{
 			ItemPointerSetOffsetNumber(tid, offnum);
-			PredicateLockTuple(relation, heapTuple);
+			PredicateLockTuple(relation, heapTuple, snapshot);
 			if (all_dead)
 *all_dead = false;
 			return true;
@@ -1750,7 +1750,7 @@ heap_get_latest_tid(Relation relation,
 		 * result candidate.
 		 */
 		valid = HeapTupleSatisfiesVisibility(tp, snapshot, buffer);
-		CheckForSerializableConflictOut(valid, relation, tp, buffer);
+		CheckForSerializableConflictOut(valid, relation, tp, buffer, snapshot);
 		if (valid)
 			*tid = ctid;
 
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 27c37d6..294ab45 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -126,7 +126,7 @@ do { \
 } while(0)
 
 static IndexScanDesc index_beginscan_internal(Relation indexRelation,
-		 int nkeys, int norderbys);
+		 int nkeys, int norderbys, const Snapshot snapshot);
 
 
 /* 
@@ -234,7 +234,7 @@ index_beginscan(Relation heapRelation,
 {
 	IndexScanDesc scan;
 
-	scan = index_beginscan_internal(indexRelation, nkeys, norderbys);
+	scan = 

Re: [HACKERS] reindex creates predicate lock on index root

2011-06-07 Thread Dan Ports
On Tue, Jun 07, 2011 at 07:45:43PM -0500, Kevin Grittner wrote:
 During testing of the SSI DDL changes I noticed that a REINDEX INDEX
 created a predicate lock on page 0 of the index.

Really? That surprises me, and I couldn't reproduce it just now.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SIREAD lock versus ACCESS EXCLUSIVE lock

2011-06-05 Thread Dan Ports
On Sun, Jun 05, 2011 at 12:45:41PM -0500, Kevin Grittner wrote:
 Is this possible?  If a transaction gets its snapshot while OID of N
 is assigned to relation X, can that transaction wind up seeing an OID
 of N as a reference to relation Y?  If not, there aren't any false
 positives possible.

This ought to be possible, assuming that the transaction doesn't try to
read (and take an AccessShareLock on) X.

On the other hand, given all the timing constraints you've noted, I
think it's reasonable to ignore the risk of false positives here. What
you're saying is that we might inadvertently roll back a transaction,
if a table gets dropped and a new table created and assigned the same
OID, and there's an uncommitted transaction that started before the
DROP, *and* certain other conditions hold about the reads and timing of
overlapping transactions.

This combination of conditions seems quite unlikely and I have a hard
time getting too worked up about it. Occasional false positives are
already a fact of life when using SSI.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSI predicate locking on heap -- tuple or row?

2011-06-02 Thread Dan Ports
On Wed, Jun 01, 2011 at 05:09:09PM -0500, Kevin Grittner wrote:
 I won't be shocked if Dan can come up with a shorter proof, but I'm
 confident this one is solid.

Well, so happens I wrote a proof on the airplane today, before I saw
your mail. It's actually quite straightforward... (well, at least more
so than I was expecting)

 From many academic papers, there is well-established proof that
 serialization anomalies can only occur under snapshot isolation when
 there is a cycle in the graph of apparent order of execution of the
 transactions, and that in such a cycle the following pattern of
 rw-dependencies always occurs:
  
 Tin - - - Tpivot - - - Tout
  
 A rw-dependency (also called a rw-conflict) exists when a read by
 one transaction doesn't see the write of another transaction because
 the two transactions overlap, regardless of whether the read or the
 write actually happens first.  Since the reader doesn't see the work
 of the writer, the reader appears to have executed first, regardless
 of the actual order of snapshot acquisition or commits.  The arrows
 show the apparent order of execution of the transactions -- Tin
 first, Tout last.  Published papers have further proven that the
 transaction which appears to have executed last of these three must
 actually commit before either of the others for an anomaly to occur.

We can actually say something slightly stronger than that last
sentence: Tout has to commit before *any* other transaction in the
cycle. That doesn't help us implement SSI, because we never try to look
at an entire cycle, but it's still true and useful for proofs like this.

Now, supposing Tin is read-only...

Since there's a cycle, there must also be a transaction that precedes
Tin in the serial order. Call it T0. (T0 might be the same transaction
as Tout, but that doesn't matter.) There's an edge in the graph from
T0 to Tin. It can't be a rw-conflict, because Tin was read-only, so it
must be a ww- or wr-dependency. Either means T0 committed before Tin
started.

Because Tout committed before any other transaction in the cycle, Tout
has to commit before T0 commits -- and thus before Tin starts.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSI predicate locking on heap -- tuple or row?

2011-06-02 Thread Dan Ports
On Thu, Jun 02, 2011 at 01:01:05PM -0500, Kevin Grittner wrote:
 If we're going to put this into the README-SSI as the proof of the
 validity of this optimization, I'd like to have a footnote pointing
 to a paper describing the first commit in the cycle aspect of a
 dangerous structure.  Got any favorites, or should I fall back on a
 google search?

Hmm. I don't see any that state that in so many words, but it's an
obvious consequence of the proof of Theorem 2.1 in Making Snapshot
Isolation Serializable -- note that T3 is chosen to be the transaction
in the cycle with the earliest commit time.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSI predicate locking on heap -- tuple or row?

2011-05-24 Thread Dan Ports
On Tue, May 24, 2011 at 04:18:37AM -0500, Kevin Grittner wrote:
 These proofs show that
 there is no legitimate cycle which could cause an anomaly which the
 move from row-based to tuple-based logic will miss.  They don't prove
 that the change will generate all the same serialization failures;
 and in fact, some false positives are eliminated by the change. 

Yes, that's correct. That's related to the part in the proof where I
claimed T3 couldn't have a conflict out *to some transaction T0 that
precedes T1*.

I originally tried to show that T3 couldn't have any conflicts out that
T2 didn't have, which would mean we got the same set of serialization
failures, but that's not true. In fact, it's not too hard to come up
with an example where there would be a serialization failure with the
row version links, but not without. However, because the rw-conflict
can't be pointing to a transaction that precedes T1 in the serial
order, it won't create a cycle. In other words, there are serialization
failures that won't happen anymore, but they were false positives.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSI predicate locking on heap -- tuple or row?

2011-05-23 Thread Dan Ports
On Sat, May 21, 2011 at 03:09:12PM -0500, Kevin Grittner wrote:
 I went back to the example which persuaded me and took another look.  On
 review I see that this didn't prove the point because there was a
 dangerous structure with T1 as a pivot which should have caused SSI to
 break the cycle.  Since it didn't, either I got careless in my testing
 methodology (which I will recheck when I get back to Wisconsin) or there
 was a bug -- but either way, this example can't prove that the predicate
 locks need to follow the row to new versions.

I'm still working through the more general question, but I wanted to
see what was going on with this example. It appears there's a bug,
because this doesn't cause a rollback without the version linking.

Specifically, the problem is a missing check in
OnConflict_CheckForSerializationFailure. We check whether the conflict
has caused the writer to become a pivot, but only if neither the reader
or writer is committed. Why is that last condition there? In this case,
the reader (T4) has committed but the writer (T1) hasn't.

It would be worth making sure there aren't any other cases we're
missing here, although I don't see any right now.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSI predicate locking on heap -- tuple or row?

2011-05-23 Thread Dan Ports
I have thought about this quite a bit and am fairly certain we do not need
to track this linkage between row versions. One strong hint to this
is that all the work I've seen on multiversion serializability
theory defines a rw-conflict to be one transaction reading an object
and the other writing *the next version* of the same object.

But maybe that doesn't answer the question conclusively -- after all,
the differences between an object, a tuple, and a row are subtle. So
see the argument below:
[I think it's similar to the argument you were making.]

On Sat, May 21, 2011 at 03:09:12PM -0500, Kevin Grittner wrote:
 The issue can be rephrased to this: If transaction T1 reads a row (thus
 acquiring a predicate lock on it) and a second transaction T2 updates
 that row, must a third transaction T3 which updates the new version of
 the row have a rw-conflict in from T1?

OK, that's a good way to phrase the question. Does it matter whether
this edge T1 - T3 is there?

If T1 has a conflict in, it certainly doesn't. Adding the edge T1 - T3
would create a dangerous structure, but we already had one from the
edge T1 - T2, so we would have aborted something anyway.

Now let's consider the case where T1 doesn't have a conflict in. If
that's the case, for this edge T1 - T3 to make a difference, T3 must
have a rw-conflict out that induces a cycle in the dependency graph,
i.e. a conflict out to some transaction preceding T1 in the serial
order. (A conflict out to T1 would work too, but that would mean T1 has
a conflict in and we would have rolled back.)

So now we're trying to figure out if there can be an rw-conflict edge
T3 - T0, where T0 is some transaction that precedes T1. For T0 to
precede T1, there has to be has to be some edge, or sequence of edges,
from T0 to T1. At least the last edge has to be a wr-dependency or
ww-dependency rather than a rw-conflict, because T1 doesn't have a
rw-conflict in. And that gives us enough information about the order of
transactions to see that T3 can't have a rw-dependency to T0:
 - T0 committed before T1 started (the wr/ww-dependency implies this)
 - T1 started before T2 committed (the T1-T2 rw-conflict implies this)
 - T2 committed before T3 started (otherwise, T3 would be aborted
   because of an update conflict)

That means T0 committed before T3 started, and therefore there can't be
a rw-conflict from T3 to T0.

In both cases, we didn't need the T1 - T3 edge.


Does that make sense to you? Unless anyone can poke a hole in that
reasoning, I think we can get rid of the code to handle later versions,
which would make me happy for many reasons. It will not be missed.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSI predicate locking on heap -- tuple or row?

2011-05-21 Thread Dan Ports
On Sat, May 21, 2011 at 04:45:15PM -0400, Pavan Deolasee wrote:
 As a first step, it would be great if you can upload the slides on the
 conference website. To expect that the attendees would have understood the
 nitty-gritties of SSI just listening to the presentation is so unhuman :-)

I just posted them at
http://drkp.net/drkp/papers/ssi-pgcon11-slides.pdf

...and they'll be linked from the Serializable wiki page as soon as I
remember how to edit it. :-)

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] patch: fix race in SSI's CheckTargetForConflictsIn

2011-05-08 Thread Dan Ports
On Fri, May 06, 2011 at 10:49:22PM -0400, Dan Ports wrote:
 Will update the patch.

Updated patch (in response to Robert's comments) attached.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index 419e0d9..9d51f08 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -3846,6 +3846,8 @@ CheckTargetForConflictsIn(PREDICATELOCKTARGETTAG *targettag)
 	LWLockId	partitionLock;
 	PREDICATELOCKTARGET *target;
 	PREDICATELOCK *predlock;
+	PREDICATELOCK *mypredlock = NULL;
+	PREDICATELOCKTAG mypredlocktag;
 
 	Assert(MySerializableXact != InvalidSerializableXact);
 
@@ -3891,139 +3893,21 @@ CheckTargetForConflictsIn(PREDICATELOCKTARGETTAG *targettag)
 		if (sxact == MySerializableXact)
 		{
 			/*
-			 * If we're getting a write lock on the tuple and we're not in a
-			 * subtransaction, we don't need a predicate (SIREAD) lock.  We
-			 * can't use this optimization within a subtransaction because the
-			 * subtransaction could be rolled back, and we would be left
-			 * without any lock at the top level.
+			 * If we're getting a write lock on a tuple, we don't need
+			 * a predicate (SIREAD) lock on the same tuple. We can
+			 * safely remove our SIREAD lock, but we'll defer doing so
+			 * until after the loop because that requires upgrading to
+			 * an exclusive partition lock.
 			 *
-			 * At this point our transaction already has an ExclusiveRowLock
-			 * on the relation, so we are OK to drop the predicate lock on the
-			 * tuple, if found, without fearing that another write against the
-			 * tuple will occur before the MVCC information makes it to the
-			 * buffer.
+			 * We can't use this optimization within a subtransaction
+			 * because the subtransaction could roll back, and we
+			 * would be left without any lock at the top level.
 			 */
 			if (!IsSubTransaction()
  GET_PREDICATELOCKTARGETTAG_OFFSET(*targettag))
 			{
-uint32		predlockhashcode;
-PREDICATELOCKTARGET *rmtarget = NULL;
-PREDICATELOCK *rmpredlock;
-LOCALPREDICATELOCK *locallock,
-		   *rmlocallock;
-
-/*
- * This is a tuple on which we have a tuple predicate lock. We
- * only have shared LW locks now; release those, and get
- * exclusive locks only while we modify things.
- */
-LWLockRelease(SerializableXactHashLock);
-LWLockRelease(partitionLock);
-LWLockAcquire(SerializablePredicateLockListLock, LW_SHARED);
-LWLockAcquire(partitionLock, LW_EXCLUSIVE);
-LWLockAcquire(SerializableXactHashLock, LW_EXCLUSIVE);
-
-/*
- * Remove the predicate lock from shared memory, if it wasn't
- * removed while the locks were released.  One way that could
- * happen is from autovacuum cleaning up an index.
- */
-predlockhashcode = PredicateLockHashCodeFromTargetHashCode
-	((predlock-tag), targettaghash);
-rmpredlock = (PREDICATELOCK *)
-	hash_search_with_hash_value(PredicateLockHash,
-(predlock-tag),
-predlockhashcode,
-HASH_FIND, NULL);
-if (rmpredlock)
-{
-	Assert(rmpredlock == predlock);
-
-	SHMQueueDelete(predlocktargetlink);
-	SHMQueueDelete((predlock-xactLink));
-
-	rmpredlock = (PREDICATELOCK *)
-		hash_search_with_hash_value(PredicateLockHash,
-	(predlock-tag),
-	predlockhashcode,
-	HASH_REMOVE, NULL);
-	Assert(rmpredlock == predlock);
-
-	RemoveTargetIfNoLongerUsed(target, targettaghash);
-
-	LWLockRelease(SerializableXactHashLock);
-	LWLockRelease(partitionLock);
-	LWLockRelease(SerializablePredicateLockListLock);
-
-	locallock = (LOCALPREDICATELOCK *)
-		hash_search_with_hash_value(LocalPredicateLockHash,
-	targettag, targettaghash,
-	HASH_FIND, NULL);
-
-	/*
-	 * Remove entry in local lock table if it exists and has
-	 * no children. It's OK if it doesn't exist; that means
-	 * the lock was transferred to a new target by a different
-	 * backend.
-	 */
-	if (locallock != NULL)
-	{
-		locallock-held = false;
-
-		if (locallock-childLocks == 0)
-		{
-			rmlocallock = (LOCALPREDICATELOCK *)
-hash_search_with_hash_value(LocalPredicateLockHash,
-	targettag, targettaghash,
-		  HASH_REMOVE, NULL);
-			Assert(rmlocallock == locallock);
-		}
-	}
-
-	DecrementParentLocks(targettag);
-
-	/*
-	 * If we've cleaned up the last of the predicate locks for
-	 * the target, bail out before re-acquiring the locks.
-	 */
-	if (rmtarget)
-		return;
-
-	/*
-	 * The list has been altered.  Start over at the front.
-	 */
-	LWLockAcquire(partitionLock, LW_SHARED);
-	nextpredlock = (PREDICATELOCK *)
-		SHMQueueNext((target-predicateLocks),
-	 (target-predicateLocks),
-	 offsetof(PREDICATELOCK

Re: [HACKERS] patch: fix race in SSI's CheckTargetForConflictsIn

2011-05-06 Thread Dan Ports
On Fri, May 06, 2011 at 09:35:39PM -0400, Robert Haas wrote:
 Why does this HASH_FIND the applicable hash table entries and then
 HASH_REMOVE it as a separate step, instead of just HASH_REMOVE-ing it
 in one go?

For PredicateLockHash, we need to find the lock entry first so that we
can call SHMQueueDelete on its targetLink and xactLink fields. 

For LocalPredicateLockHash, we check the resulting entry to decide
whether to remove it. Having looked at the code some more, however, we do
always remove it because we only apply this optimization to heap tuple
locks, which are not parents of other locks. So we can simplify this
down to a HASH_REMOVE.


 Doesn't this fail to release the locks if rmpredlock == NULL?

Yikes. Indeed it does.

 I believe it's project style to test (rmpredlock != NULL) rather than
 just (rmpredlock).

That works for me (I prefer the != NULL myself). I believe I've seen
both elsewhere, though...

Will update the patch.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: Why is RegisterPredicateLockingXid called while holding XidGenLock?

2011-05-05 Thread Dan Ports
On Thu, May 05, 2011 at 11:12:40PM -0400, Tom Lane wrote:
 Even if it's actually necessary to set up that data structure while
 holding XidGenLock, I would *really* like the call to not be exactly
 where it is.

Good question.

I don't believe it needs to be while XidGenLock is being held at all;
certainly not in this particular place. All that really matters is that
it happens before any backend starts seeing said xid in tuple headers.

I believe the call can be moved over to AssignTransactionId. I'd
probably put it at the end of that function, but it can go anywhere
between there and where it is now. Do you have any preference?

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] patch: fix race in SSI's CheckTargetForConflictsIn

2011-05-04 Thread Dan Ports
While running some benchmarks to test SSI performance, I found a race
condition that's capable of causing a segfault. A patch is attached.

The bug is in CheckTargetForConflictsIn, which scans the list of SIREAD
locks on a lock target when it's modified. There's an optimization in
there where the writing transaction will remove a SIREAD lock that it
holds itself, because it's being replaced with a (stronger) write lock.
To do that, it needs to drop its shared lwlocks and reacquire them in
exclusive mode. The existing code deals with concurrent modifications
in that interval by redoing checks. However, it misses the case where
some other transaction removes all remaining locks on the target, and
proceeds to remove the lock target itself.

The attached patch fixes this by deferring the SIREAD lock removal
until the end of the function. At that point, there isn't any need to
worry about concurrent updates to the target's lock list. The resulting
code is also simpler.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index 3309d07..c274bca 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -3905,6 +3905,8 @@ CheckTargetForConflictsIn(PREDICATELOCKTARGETTAG *targettag)
 	LWLockId	partitionLock;
 	PREDICATELOCKTARGET *target;
 	PREDICATELOCK *predlock;
+	PREDICATELOCK *mypredlock = NULL;
+	PREDICATELOCKTAG mypredlocktag;
 
 	Assert(MySerializableXact != InvalidSerializableXact);
 
@@ -3950,139 +3952,21 @@ CheckTargetForConflictsIn(PREDICATELOCKTARGETTAG *targettag)
 		if (sxact == MySerializableXact)
 		{
 			/*
-			 * If we're getting a write lock on the tuple and we're not in a
-			 * subtransaction, we don't need a predicate (SIREAD) lock.  We
-			 * can't use this optimization within a subtransaction because the
-			 * subtransaction could be rolled back, and we would be left
-			 * without any lock at the top level.
+			 * If we're getting a write lock on a tuple, we don't need
+			 * a predicate (SIREAD) lock on the same tuple. We can
+			 * safely remove our SIREAD lock, but we'll defer doing so
+			 * until after the loop because that requires upgrading to
+			 * an exclusive partition lock.
 			 *
-			 * At this point our transaction already has an ExclusiveRowLock
-			 * on the relation, so we are OK to drop the predicate lock on the
-			 * tuple, if found, without fearing that another write against the
-			 * tuple will occur before the MVCC information makes it to the
-			 * buffer.
+			 * We can't use this optimization within a subtransaction
+			 * because the subtransaction could roll back, and we
+			 * would be left without any lock at the top level.
 			 */
 			if (!IsSubTransaction()
  GET_PREDICATELOCKTARGETTAG_OFFSET(*targettag))
 			{
-uint32		predlockhashcode;
-PREDICATELOCKTARGET *rmtarget = NULL;
-PREDICATELOCK *rmpredlock;
-LOCALPREDICATELOCK *locallock,
-		   *rmlocallock;
-
-/*
- * This is a tuple on which we have a tuple predicate lock. We
- * only have shared LW locks now; release those, and get
- * exclusive locks only while we modify things.
- */
-LWLockRelease(SerializableXactHashLock);
-LWLockRelease(partitionLock);
-LWLockAcquire(SerializablePredicateLockListLock, LW_SHARED);
-LWLockAcquire(partitionLock, LW_EXCLUSIVE);
-LWLockAcquire(SerializableXactHashLock, LW_EXCLUSIVE);
-
-/*
- * Remove the predicate lock from shared memory, if it wasn't
- * removed while the locks were released.  One way that could
- * happen is from autovacuum cleaning up an index.
- */
-predlockhashcode = PredicateLockHashCodeFromTargetHashCode
-	((predlock-tag), targettaghash);
-rmpredlock = (PREDICATELOCK *)
-	hash_search_with_hash_value(PredicateLockHash,
-(predlock-tag),
-predlockhashcode,
-HASH_FIND, NULL);
-if (rmpredlock)
-{
-	Assert(rmpredlock == predlock);
-
-	SHMQueueDelete(predlocktargetlink);
-	SHMQueueDelete((predlock-xactLink));
-
-	rmpredlock = (PREDICATELOCK *)
-		hash_search_with_hash_value(PredicateLockHash,
-	(predlock-tag),
-	predlockhashcode,
-	HASH_REMOVE, NULL);
-	Assert(rmpredlock == predlock);
-
-	RemoveTargetIfNoLongerUsed(target, targettaghash);
-
-	LWLockRelease(SerializableXactHashLock);
-	LWLockRelease(partitionLock);
-	LWLockRelease(SerializablePredicateLockListLock);
-
-	locallock = (LOCALPREDICATELOCK *)
-		hash_search_with_hash_value(LocalPredicateLockHash,
-	targettag, targettaghash,
-	HASH_FIND, NULL);
-
-	/*
-	 * Remove entry in local lock table if it exists and has
-	 * no children. It's OK if it doesn't exist; that means
-	 * the lock was transferred to a new target by a different
-	 * backend.
-	 */
-	if (locallock != NULL)
-		

Re: [HACKERS] Predicate locking

2011-05-03 Thread Dan Ports
On Tue, May 03, 2011 at 01:36:36PM +0900, Vlad Arkhipov wrote:
 Then I commited the both and the second one raised an exception:
 ERROR: could not serialize access due to read/write dependencies among 
 transactions
 SQL state: 40001
 
 However the second transaction does not access the records that the 
 first one does. If I had predicate locks I could avoid this situation by 
 locking the records with the specified id.

Yes, you're right -- the current implementation of SSI only locks
indexes at the granularity of index pages. So although those
transactions don't actually access the same records, they're detected
as a conflict because they're on the same index page. Of course, on a
larger table this might be less likely to happen.

Getting this down to index-key and index-gap lock granularity is on
the todo list. Our focus in the initial SSI development has been to get
something that's functionally correct and stable before optimizing it.
I'm hoping to get some time to work on index-key locking for 9.2, as I
expect it will make a significant performance difference.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSI non-serializable UPDATE performance

2011-04-29 Thread Dan Ports
On Thu, Apr 28, 2011 at 06:45:54PM +0200, Robert Haas wrote:
 Yeah, I think Dan's notes about memory ordering would be good to include.

I left it out initially because I didn't want to make things more
confusing. As far as memory ordering is concerned, this is the same
story as anything else that uses lwlocks: the spinlock memory barrier
prevents memory accesses from being reordered before the lock is
acquired. The only unusual thing here is that the lock in question
isn't the one that protects the variable we're reading.

But I'm OK with adding a comment if you think it helps. Patch attached.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index fe0e458..694e87d 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -2298,6 +2298,9 @@ PredicateLockTupleRowVersionLink(const Relation relation,
 	 * locks. Even if a serializable transaction starts concurrently,
 	 * we know it can't take any SIREAD locks on the modified tuple
 	 * because the caller is holding the associated buffer page lock.
+	 * Memory reordering isn't an issue; the memory barrier in the
+	 * LWLock acquisition guarantees that this read occurs while the
+	 * buffer page lock is held.
 	 */
 	if (!TransactionIdIsValid(PredXact-SxactGlobalXmin))
 		return;

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSI non-serializable UPDATE performance

2011-04-28 Thread Dan Ports
On Thu, Apr 28, 2011 at 08:43:30AM +0100, Simon Riggs wrote:
  We added a quick return which didn't need to check any locks at the
  front of this routine which is taken if there are no active
  serializable transactions on the cluster at the moment of update.
 
 Surprised to hear nobody mentioning memory reordering issues about
 that, but I'm not running Itaniums anywhere.

I did spend a while thinking about it. There aren't any memory
reordering issues with that optimization (even on the Alpha, where just
about anything goes).

The memory barrier when acquiring the buffer page lwlock acts as the
synchronization point we need. When we see that no serializable
transactions are running, that could have been reordered, but that read
still had to come after the lock was taken. That's all we need: even if
another backend starts a serializable transaction after that, we know
it can't take any SIREAD locks on the same target while we're holding
the buffer page lock.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSI non-serializable UPDATE performance

2011-04-27 Thread Dan Ports
On Wed, Apr 27, 2011 at 06:26:52PM +0100, Simon Riggs wrote:
 Reading the code, IIUC, we check for RW conflicts after each write but
 only if the writer is running a serializable transaction.
 
 Am I correct in thinking that there is zero impact of SSI if nobody is
 running a serializable transaction?

That is correct, now.

(Well, other than having to check whether a serializable transaction is
running, the cost of which is truly negligible.)

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SIREAD lock versus ACCESS EXCLUSIVE lock

2011-04-27 Thread Dan Ports
 

On Wed, Apr 27, 2011 at 02:59:19PM -0500, Kevin Grittner wrote:
 For correct serializable behavior in the face of concurrent DDL
 execution, I think that a request for a heavyweight ACCESS EXCLUSIVE
 lock might need to block until all SIREAD locks on the relation have
 been released.  Picture, for example, what might happen if one

I think you're correct about this, but also agree  that it would be
reasonable to document the limitation for now and punt on a fix until
9.2.


On Wed, Apr 27, 2011 at 04:09:38PM -0500, Kevin Grittner wrote:
 Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote:
  Hmm, could we upgrade all predicate locks to relation-level
  predicate locks instead?
  
 Tied to what backend?  This only comes into play with transactions
 which have committed and are not yet able to release their predicate
 locks because of overlapping READ WRITE serializable transactions
 which are still active.  Until that point the ACCESS SHARE lock (or
 stronger) is still there protecting against this.

I think Heikki was suggesting to upgrade to relation-level SIREAD
locks. This seems like it would work, but might cause a lot of aborts
immediately afterwards. I do wonder if it might be necessary to upgrade
index locks to relation-level locks on the heap relation, in cases of
dropping the index.

 One way we could push this entirely into the heavyweight locking
 system would be for a committing transaction with predicate locks to
 somehow cause all overlapping read write serializable transactions
 which are still active to acquire an ACCESS SHARE lock on each
 relation for which the committing transaction has any predicate
 lock(s).

I'm not entirely clear on how this would work, but it sounds like it
could also have a significant performance cost.

 As an alternative, perhaps we could find a way to leave the relation
 locks for a serializable transaction until it's safe to clean up the
 predicate locks?  They could be downgraded to ACCESS SHARE if they
 are stronger.  They would need to survive beyond not only the commit
 of the transaction, but also the termination of the connection. 
 They would need to be immune to being chosen as deadlock victims.

This sounds like it would require major changes to the heavyweight lock
manager. There's already a notion of keeping locks past backend
termination for 2PC prepared transactions, but it would be hard to
apply here.

The most practical solutions I see here are to, when acquiring an
AccessExclusive lock, either wait for any associated SIREAD locks to go
away, or to promote them to relation level.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSI non-serializable UPDATE performance

2011-04-24 Thread Dan Ports
On Sat, Apr 23, 2011 at 08:54:31AM -0500, Kevin Grittner wrote:
 Even though this didn't show any difference in Dan's performance
 tests, it seems like reasonable insurance against creating a new
 bottleneck in very high concurrency situations.
  
 Dan, do you have a patch for this, or should I create one?

Sure, patch is attached.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index f02d5d5..48ff9cc 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -2275,6 +2275,18 @@ PredicateLockTupleRowVersionLink(const Relation relation,
 	TransactionId oldxmin,
 newxmin;
 
+	/*
+	 * Bail out quickly if there are no serializable transactions
+	 * running.
+	 *
+	 * It's safe to do this check without taking any additional
+	 * locks. Even if a serializable transaction starts concurrently,
+	 * we know it can't take any SIREAD locks on the modified tuple
+	 * because the caller is holding the associated buffer page lock.
+	 */
+	if (!TransactionIdIsValid(PredXact-SxactGlobalXmin))
+		return;
+
 	oldblk = ItemPointerGetBlockNumber((oldTuple-t_self));
 	oldoff = ItemPointerGetOffsetNumber((oldTuple-t_self));
 	oldxmin = HeapTupleHeaderGetXmin(oldTuple-t_data);
@@ -2633,6 +2645,15 @@ PredicateLockPageSplit(const Relation relation, const BlockNumber oldblkno,
 	PREDICATELOCKTARGETTAG newtargettag;
 	bool		success;
 
+	/*
+	 * Bail out quickly if there are no serializable transactions
+	 * running. As with PredicateLockTupleRowVersionLink, it's safe to
+	 * check this without taking locks because the caller is holding
+	 * the buffer page lock.
+	 */
+	if (!TransactionIdIsValid(PredXact-SxactGlobalXmin))
+		return;
+
 	if (SkipSplitTracking(relation))
 		return;
 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] SSI non-serializalbe UPDATE performance (was: getting to beta)

2011-04-22 Thread Dan Ports
For background, the issue here is that there are three SSI calls that
get invoked even on non-serializable transactions:
  - PredicateLockPageSplit/Combine, during B-tree page splits/combines
  - PredicateLockTupleRowVersionLink, from heap_update

These have to update any matching SIREAD locks to match the new lock
target. If there aren't any serializable transactions, there won't be
any, but it still has to check and this requires taking a LWLock. Every
other SSI function checks XactIsoLevel and bails out immediately if
non-serializable.

Like Kevin said, I tested this by removing these three calls and
comparing under what I see as worst-case conditions. I used pgbench, an
update-mostly workload, in read committed mode. The database (scale
factor 100) fit in shared_buffers and was backed by tmpfs so disk
accesses didn't enter the picture anywhere. I ran it on a 16-core
machine to stress lock contention.

Even under these conditions I couldn't reliably see a slowdown. My
latest batch of results (16 backends, median of three 10 minute runs)
shows a difference well below 1%. In a couple of cases I saw the code
with the SSI checks running faster than with them removed, so this
difference seems in the noise.

Given that result, and considering it's a pretty extreme condition, it
probably isn't worth worrying about this too much, but...

There's a quick fix: we might as well bail out of these functions early
if there are no serializable transactions running. Kevin points out we
can do this by checking if PredXact-SxactGlobalXmin is invalid. I
would add that we can do this safely without taking any locks, even on
weak-memory-ordering machines. Even if a new serializable transaction
starts concurrently, we have the appropriate buffer page locked, so
it's not able to take any relevant SIREAD locks.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Formatting Curmudgeons WAS: MMAP Buffers

2011-04-17 Thread Dan Ports
On Sat, Apr 16, 2011 at 11:26:34PM -0500, Kevin Grittner wrote:
 I have to say, I've been rather mystified by the difficulty
 attributed to running pgindent.  During work on the SSI patch, I ran
 it about once every two weeks on files involved in the patch

Well, as a counterpoint: during work on the SSI patch, I did *not* run
pgindent. I attempted to, at one point, but was discouraged when I
realized that it required BSD indent and my Linux machine only had GNU
indent. That meant I would need to find, build, and install a new
version of indent, and keep it separate from my existing GNU indent.
Hardly impossible, but it's a lot more of a hassle than simply running a
script, and it left me wondering if I was going to run into other
issues even if I did get the right indent installed.

Andrew's instructions upthread would certainly have been helpful to
have in the pgindent README.

(To be fair, I would probably have made much more of an effort to run
pgindent if I didn't already know Kevin was running it periodically on
the SSI code.)


 And I can't help but wonder why, in an off-list discussion with
 Michael Cahill about the SSI technology he commented that he was
 originally intending to implement the technique in PostgreSQL, but
 later chose Oracle Berkeley DB and then latter InnoDB instead. 
 *Maybe* he was looking toward being hired by Oracle, and *maybe* it
 was because the other databases already had predicate locking and
 true serializable transaction isolation levels -- but was part of it
 the reputation of the community?  I keep wondering.

I would discount the first explanation (being hired at Oracle)
entirely. I think the second explanation is the correct one: it's
simply much more difficult to implement SSI atop a database that does
not already have predicate locking (as we know!)

But I am aware of other cases in which people in the academic community
have done work that could well be of interest to the Postgres community
but didn't submit their work here. In part, that was because they did
not have the time/motivation to get the work into a polished,
acceptable state, and in part because of the reputation of the
community. 

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] getting to beta

2011-04-06 Thread Dan Ports
On Wed, Apr 06, 2011 at 12:25:26PM -0500, Kevin Grittner wrote:
 By the way, the problem with SSI potentially running out of shared
 memory is rather parallel to how heavyweight locks can run out of
 shared memory.  The SLRU prevents the number of transactions from
 being limited in that way, and multiple locks per table escalate
 granularity, but with a strange enough workload (for example,
 accessing hundreds of tables per transaction) one might need to
 boost max_pred_locks_per_transaction above the default to avoid
 shared memory exhaustion.

In fact, it's exactly the same: if a backend wants to acquire many
heavyweight locks, it doesn't stop at max_locks_per_xact, it just
keeps allocating them until shmem is exhausted.

So it's possible, if less likely, to have the same problem with regular
locks causing the system to run out of shared memory. Which sounds to
me like a good reason to address both problems in one place.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] time table for beta1

2011-04-04 Thread Dan Ports
On Mon, Apr 04, 2011 at 10:33:22AM -0500, Kevin Grittner wrote:
 There are patches for all known issues except one.  Dan Ports was
 able to replicate the latest issue uncovered by YAMAMOTO Takashi
 using a particular DBT-2 configuration, found the issue, and posted
 a patch:

Well, it would be good to have confirmation from Takashi that it
actually fixed the problem he was seeing. I expect it did. If so, then
yes, we do have a handle on all open SSI issues.

 In investigating the locks which were not being cleaned up properly,
 Dan noticed that the pid wasn't showing on SIReadLock rows in
 pg_locks.  He submitted a patch here which would always show the pid
 responsible for the lock:
  
 Jeff Davis questioned whether pid should continue to show after the
 end of the transaction or the closing of the connection (and
 therefore the process which the pid identifies).  Not showing it for
 completed transactions would be trivial.  Showing it after the
 transaction completes, until the connection closes should be doable,
 but not trivial.  Of course, we could just leave it alone, but
 leaving the pid out for these rows looks a little funny and reduces
 useful information a bit.

I see Robert committed that one already. If there's a consensus that
omitting the pid for committed transactions is the right thing to do,
I'm happy to put together a patch. I think that is a better approach
than trying to keep it after commit until the connection closes, but
all of this is sufficiently minor that it's probably not worth worrying
much about.

 The one issue without a reasonable patch is that there are now three
 HTABs in shared memory which can grow until shared memory is
 exhausted, rather than the one in heavyweight locks which we had
 prior to 9.1.  I think we're agreed that this is a bad thing, but my
 attempts to address this so far haven't satisfied Heikki.  Heikki
 suggested an approach, but didn't respond as to whether I should try
 to code it up.  I wasn't sure whether he might be going at it
 himself.  I'll happily take a run at it if people want that.

I believe implementing that should just be a matter of making
get_hash_entry bail out before element_alloc if the right flag is set,
because partitioned hash tables already don't split buckets.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] time table for beta1

2011-04-04 Thread Dan Ports
On Mon, Apr 04, 2011 at 07:04:59PM -0400, Robert Haas wrote:
 On Mon, Apr 4, 2011 at 6:41 PM, Stephen Frost sfr...@snowman.net wrote:
  What'd be horribly useful would be the pid and the *time* that the lock
  was taken.. ?Knowing just the pid blows, since the pid could technically
  end up reused (tho not terribly likely) in the time frame you're trying
  to figure out what happened during..
 
 Well, I don't think we're likely to redesign pg_locks at this point,
 so it's a question of making the best use of the fields we have to
 work with.

Agreed. Note that the vxid of the transaction that took the lock is
included in there, so that's at least something you could correlate
with a logfile.

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSI bug?

2011-04-03 Thread Dan Ports
I think I see what is going on now. We are sometimes failing to set the
commitSeqNo correctly on the lock. In particular, if a lock assigned to
OldCommittedSxact is marked with InvalidSerCommitNo, it will never be
cleared.

The attached patch corrects this:
 TransferPredicateLocksToNewTarget should initialize a new lock
 entry's commitSeqNo to that of the old one being transferred, or take
 the minimum commitSeqNo if it is merging two lock entries.

 Also, CreatePredicateLock should initialize commitSeqNo for to
 InvalidSerCommitSeqNo instead of to 0. (I don't think using 0 would
 actually affect anything, but we should be consistent.)

 I also added a couple of assertions I used to track this down: a
 lock's commitSeqNo should never be zero, and it should be
 InvalidSerCommitSeqNo if and only if the lock is not held by
 OldCommittedSxact.

Takashi, does this patch fix your problem with leaked SIReadLocks?

Dan


-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index f6e49fe..d166da3 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -2067,7 +2067,7 @@ CreatePredicateLock(const PREDICATELOCKTARGETTAG *targettag,
 		SHMQueueInsertBefore((target-predicateLocks), (lock-targetLink));
 		SHMQueueInsertBefore((sxact-predicateLocks),
 			 (lock-xactLink));
-		lock-commitSeqNo = 0;
+		lock-commitSeqNo = InvalidSerCommitSeqNo;
 	}
 
 	LWLockRelease(partitionLock);
@@ -2500,6 +2500,7 @@ TransferPredicateLocksToNewTarget(const PREDICATELOCKTARGETTAG oldtargettag,
 			SHM_QUEUE  *predlocktargetlink;
 			PREDICATELOCK *nextpredlock;
 			PREDICATELOCK *newpredlock;
+			SerCommitSeqNo oldCommitSeqNo = oldpredlock-commitSeqNo;
 
 			predlocktargetlink = (oldpredlock-targetLink);
 			nextpredlock = (PREDICATELOCK *)
@@ -2544,8 +2545,17 @@ TransferPredicateLocksToNewTarget(const PREDICATELOCKTARGETTAG oldtargettag,
 	 (newpredlock-targetLink));
 SHMQueueInsertBefore((newpredlocktag.myXact-predicateLocks),
 	 (newpredlock-xactLink));
-newpredlock-commitSeqNo = InvalidSerCommitSeqNo;
+newpredlock-commitSeqNo = oldCommitSeqNo;
 			}
+			else
+			{
+if (newpredlock-commitSeqNo  oldCommitSeqNo)
+	newpredlock-commitSeqNo = oldCommitSeqNo;
+			}
+			
+			Assert(newpredlock-commitSeqNo != 0);
+			Assert((newpredlock-commitSeqNo == InvalidSerCommitSeqNo)
+   || (newpredlock-tag.myXact == OldCommittedSxact));
 
 			oldpredlock = nextpredlock;
 		}
@@ -3130,6 +3140,8 @@ ClearOldPredicateLocks(void)
 		 offsetof(PREDICATELOCK, xactLink));
 
 		LWLockAcquire(SerializableXactHashLock, LW_SHARED);
+		Assert(predlock-commitSeqNo != 0);
+		Assert(predlock-commitSeqNo != InvalidSerCommitSeqNo);
 		canDoPartialCleanup = (predlock-commitSeqNo = PredXact-CanPartialClearThrough);
 		LWLockRelease(SerializableXactHashLock);
 
@@ -3254,6 +3266,8 @@ ReleaseOneSerializableXact(SERIALIZABLEXACT *sxact, bool partial,
 		 errhint(You might need to increase max_pred_locks_per_transaction.)));
 			if (found)
 			{
+Assert(predlock-commitSeqNo != 0);
+Assert(predlock-commitSeqNo != InvalidSerCommitSeqNo);
 if (predlock-commitSeqNo  sxact-commitSeqNo)
 	predlock-commitSeqNo = sxact-commitSeqNo;
 			}

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] trivial patch: show SIREAD pids in pg_locks

2011-04-01 Thread Dan Ports
While looking into a SSI bug, I noticed that we don't actually display
the pid of the holding transaction, even though we have that
information available.

The attached patch fixes that.

One note is that it will show the pid of the backend that executed the
transaction, even if that transaction has already committed. I have no
particular opinion about whether it's more useful to do that or return
null, so went with the smallest change. (The pid is null for PREPARED
or summarized transactions).

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/
diff --git a/src/backend/utils/adt/lockfuncs.c b/src/backend/utils/adt/lockfuncs.c
index c6c948c..6d7d4f4 100644
--- a/src/backend/utils/adt/lockfuncs.c
+++ b/src/backend/utils/adt/lockfuncs.c
@@ -368,7 +368,10 @@ pg_lock_status(PG_FUNCTION_ARGS)
 		/* lock holder */
 		values[10] = VXIDGetDatum(xact-vxid.backendId,
   xact-vxid.localTransactionId);
-		nulls[11] = true;		/* pid */
+		if (xact-pid != 0)
+			values[11] = Int32GetDatum(xact-pid);
+		else
+			nulls[11] = true;
 
 		/*
 		 * Lock mode. Currently all predicate locks are SIReadLocks, which are

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] trivial patch: show SIREAD pids in pg_locks

2011-04-01 Thread Dan Ports
On Fri, Apr 01, 2011 at 12:20:25PM -0500, Kevin Grittner wrote:
 I thought we already had that, but clearly I was mistaken.

Yeah, so did I. Turns out we had the vxid but not the pid. IIRC, we
weren't tracking a SERIALIZABLEXACT's pid yet, at the time we wrote the
code for pg_locks.

 I guess the question is whether it's OK to include this during the
 alpha testing phase.  Even though it's a little bit of a stretch to
 call it a bug, the argument could be made that omitting information
 which all the other rows in the view have is an inconsistency which
 borders on being a bug.  The small size and verifiable safety of the
 patch work in its favor.

There's no urgent need to have this, although it's obviously more
correct than the current behavior. It might be useful for debugging. The
patch is also all of four lines. :-)

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSI bug?

2011-03-31 Thread Dan Ports
On Thu, Mar 31, 2011 at 11:06:30AM -0500, Kevin Grittner wrote:
 The only thing I've been on the fence about is whether it
 makes more sense to allocate it all up front or to continue to allow
 incremental allocation but set a hard limit on the number of entries
 allocated for each shared memory HTAB.  Is there a performance-
 related reason to choose one path or the other?

Seems like it would be marginally better to allocate it up front -- then
you don't have the cost of having to split buckets later as it grows.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSI bug?

2011-03-26 Thread Dan Ports
On Fri, Mar 25, 2011 at 04:06:30PM -0400, Tom Lane wrote:
 Up to now, I believe the lockmgr's lock table is the only shared hash
 table that is expected to grow past the declared size; that can happen
 anytime a session exceeds max_locks_per_transaction, which we consider
 to be only a soft limit.  I don't know whether SSI has introduced any
 other hash tables that behave similarly.  (If it has, we might want to
 rethink the amount of slop space we leave in shmem...)

I looked into this recently and the two lockmgr tables were indeed the
only ones that could vary in size. IIRC, the only other shared hash
tables were the shared buffer index and the shmem index.

SSI adds two more analogous tables (per-lock-target and per-xact-lock),
both of which are sized according to max_pred_locks_per_transaction,
which is also a soft limit. It also adds a per-transaction shared hash
table, but that has a clear maximum size.

I find the soft limit on htab size a strange model, and I suspect it
might be a source of problems now that we've got more than one table
that can actually exceed it its limit. (Particularly so given that once
shmem gets allocated to one htab, there's no getting it back.)

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSI bug?

2011-03-18 Thread Dan Ports

It would probably also be worth monitoring the size of pg_locks to see
how many predicate locks are being held.

On Fri, Mar 18, 2011 at 12:50:16PM -0500, Kevin Grittner wrote:
 Even with the above information it may be far from clear where
 allocations are going past their maximum, since one HTAB could grab
 more than its share and starve another which is staying below its
 maximum.  I'll take a look at the possibility of adding a warning
 or some such when an HTAB expands past its maximum size.

Yes -- considering how few shared memory HTABs have sizes that are
really dynamic, I'd be inclined to take a close look at SSI and
max_predicate_locks_per_transaction regardless of where the failed
allocation took place. But I am surprised to see that error message
without SSI's hint about increasing max_predicate_locks_per_xact.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSI bug?

2011-03-01 Thread Dan Ports
On Tue, Mar 01, 2011 at 07:07:42PM +0200, Heikki Linnakangas wrote:
 Was there test cases for any of the issues fixed by this patch that we 
 should add to the suite?

Some of these issues are tricky to test, e.g. some of the code about
transferring predicate locks to a new target doesn't get exercised
unless an index page gets split while there are concurrent
transactions holding locks on that page.

I have not been able to find a good way to test these other than
system-level testing using a concurrent workload (usually the DBT-2
benchmark). I'd certainly be open to suggestions!

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSI bug?

2011-02-28 Thread Dan Ports
An updated patch to address this issue is attached. It fixes a couple
issues related to use of the backend-local lock table hint:

  - CheckSingleTargetForConflictsIn now correctly handles the case
where a lock that's being held is not reflected in the local lock
table. This fixes the assertion failure reported in this thread.

  - PredicateLockPageCombine now retains locks for the page that is
being removed, rather than removing them. This prevents a
potentially dangerous false-positive inconsistency where the local
lock table believes that a lock is held, but it is actually not.

  - add some more comments documenting the times when the local lock
table can be inconsistent with reality, as reflected in the shared
memory table.

This patch also incorporates Kevin's changes to copy locks when
creating a new version of a tuple rather than trying to maintain a
linkage between different versions. So this is a patch that should
apply against HEAD and addresses all outstanding SSI bugs known to
Kevin or myself.

Besides the usual regression and isolation tests, I have tested this
by running DBT-2 on a 16-core machine to verify that there are no
assertion failures that only show up under concurrent access.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index ba01874..7a0e1a9c 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -824,7 +824,6 @@ restart:
 	if (_bt_page_recyclable(page))
 	{
 		/* Okay to recycle this page */
-		Assert(!PageIsPredicateLocked(rel, blkno));
 		RecordFreeIndexPage(rel, blkno);
 		vstate-totFreePages++;
 		stats-pages_deleted++;
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index d660ce5..580af2a 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -124,10 +124,6 @@
  *	SerializableXactHashLock
  *		- Protects both PredXact and SerializableXidHash.
  *
- *	PredicateLockNextRowLinkLock
- *		- Protects the priorVersionOfRow and nextVersionOfRow fields of
- *			PREDICATELOCKTARGET when linkage is being created or destroyed.
- *
  *
  * Portions Copyright (c) 1996-2011, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -444,8 +440,6 @@ static void ReleaseOneSerializableXact(SERIALIZABLEXACT *sxact, bool partial,
 		   bool summarize);
 static bool XidIsConcurrent(TransactionId xid);
 static void CheckTargetForConflictsIn(PREDICATELOCKTARGETTAG *targettag);
-static bool CheckSingleTargetForConflictsIn(PREDICATELOCKTARGETTAG *targettag,
-		  PREDICATELOCKTARGETTAG *nexttargettag);
 static void FlagRWConflict(SERIALIZABLEXACT *reader, SERIALIZABLEXACT *writer);
 static void OnConflict_CheckForSerializationFailure(const SERIALIZABLEXACT *reader,
 		SERIALIZABLEXACT *writer);
@@ -1044,7 +1038,6 @@ InitPredicateLocks(void)
 		PredXact-LastSxactCommitSeqNo = FirstNormalSerCommitSeqNo - 1;
 		PredXact-CanPartialClearThrough = 0;
 		PredXact-HavePartialClearedThrough = 0;
-		PredXact-NeedTargetLinkCleanup = false;
 		requestSize = mul_size((Size) max_table_size,
 			   PredXactListElementDataSize);
 		PredXact-element = ShmemAlloc(requestSize);
@@ -1651,9 +1644,11 @@ PageIsPredicateLocked(const Relation relation, const BlockNumber blkno)
  * Important note: this function may return false even if the lock is
  * being held, because it uses the local lock table which is not
  * updated if another transaction modifies our lock list (e.g. to
- * split an index page). However, it will never return true if the
- * lock is not held. We only use this function in circumstances where
- * such false negatives are acceptable.
+ * split an index page). However, it will almost never return true if
+ * the lock is not held; it can only do so in rare circumstances when
+ * a coarser-granularity lock that covers this one is being held.  We
+ * are careful to only use this function in circumstances where such
+ * errors are acceptable.
  */
 static bool
 PredicateLockExists(const PREDICATELOCKTARGETTAG *targettag)
@@ -1717,6 +1712,9 @@ GetParentPredicateLockTag(const PREDICATELOCKTARGETTAG *tag,
 /*
  * Check whether the lock we are considering is already covered by a
  * coarser lock for our transaction.
+ *
+ * Like PredicateLockExists, this function might return a false
+ * negative, but it will never return a false positive.
  */
 static bool
 CoarserLockCovers(const PREDICATELOCKTARGETTAG *newtargettag)
@@ -1747,7 +1745,6 @@ static void
 RemoveTargetIfNoLongerUsed(PREDICATELOCKTARGET *target, uint32 targettaghash)
 {
 	PREDICATELOCKTARGET *rmtarget;
-	PREDICATELOCKTARGET *next;
 
 	Assert(LWLockHeldByMe(SerializablePredicateLockListLock));
 
@@ -1755,33 +1752,6 @@ RemoveTargetIfNoLongerUsed(PREDICATELOCKTARGET *target, uint32 targettaghash)
 	if 

Re: [HACKERS] SSI bug?

2011-02-22 Thread Dan Ports
On Tue, Feb 22, 2011 at 10:51:05AM -0600, Kevin Grittner wrote:
 Dan Ports d...@csail.mit.edu wrote:
  
  It looks like CheckTargetForConflictsIn is making the assumption
  that the backend-local lock table is accurate, which was probably
  even true at the time it was written.
  
 I remember we decided that it could only be false in certain ways
 which allowed us to use it as a lossy first-cut test in a couple
 places.  I doubt that we can count on any of that any longer, and
 should drop those heuristics.

Hmm. The theory was before that the local lock table would only have
false negatives, i.e. if it says we hold a lock then we really do. That
makes it a useful heuristic because we can bail out quickly if we're
trying to re-acquire a lock we already hold. It seems worthwhile to try
to preserve that.

This property holds as long as the only time one backend edits
another's lock list is to *add* a new lock, not remove an existing one.
Fortunately, this is true most of the time (at a glance, it seems to be
the case for the new tuple update code).  

There are two exceptions; I think they're both OK but we need to be
careful here.
 - if we're forced to promote an index page lock's granularity to avoid
   running out of shared memory, we remove the fine-grained one. This
   is fine as long as we relax our expectations to if the local
   lock table says we hold a lock, then we hold that lock or one that
   covers it, which is acceptable for current users of the table.
 - if we combine two index pages, we remove the lock entry for the page
   being deleted. I think that's OK because the page is being removed
   so we will not make any efforts to lock it.


 Yeah, as far as a I can recall the only divergence was in *page*
 level entries for *indexes* until this latest patch.  We now have
 *tuple* level entries for *heap* relations, too.

*nod*. I'm slightly concerned about the impact of that on granularity
promotion -- the new locks created by heap updates won't get counted
toward the lock promotion thresholds. That's not a fatal problem of
anything, but it could make granularity promotion less effective at
conserving lock entries.

  The solution is only slightly more complicated than just removing
  the assertion.
  
 That's certainly true for that one spot, but we need an overall
 review of where we might be trying to use LocalPredicateLockHash for
 first cut tests as an optimization.

Yes, I'd planned to go through the references to LocalPredicateLockHash
to make sure none of them were making any unwarranted assumptions about
the results. Fortunately, there are not too many of them...

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSI bug?

2011-02-22 Thread Dan Ports
On Tue, Feb 22, 2011 at 05:54:49PM -0600, Kevin Grittner wrote:
 I'm not sure it's safe to assume that the index page won't get
 reused before the local lock information is cleared.  In the absence
 of a clear proof that it is safe, or some enforcement mechanism to
 ensure that it is, I don't think we should make this assumption. 
 Off-hand I can't think of a clever way to make this safe which would
 cost less than taking out the LW lock and checking the definitive
 shared memory HTAB, but that might be for lack of creative thinking
 at the moment..

Hmm. Yeah, I wasn't sure about that one, and having now thought about
it some more I think it isn't safe -- consider adding a lock on an
index page concurrently with another backend merging that page into
another one.

The obvious solution to me is to just keep the lock on both the old and
new page. The downside is that because this requires allocating a new
lock and is in a context where we're not allowed to fail, we'll need to
fall back on acquiring the relation lock just as we do for page splits.
I was going to bemoan the extra complexity this would add -- but
actually, couldn't we just replace PredicateLockPageCombine with a call
to PredicateLockPageSplit since they'd now do the same thing?

 The only alternative I see would be to use some form of asynchronous
 notification of the new locks so that the local table can be
 maintained.  That seems overkill without some clear evidence that it
 is needed. 

I agree. It is certainly weird and undesirable that the backend-local
lock table is not always accurate, but I don't see a good way to keep
it up to date without the cure being worse than the disease.

 I *really* wouldn't want to go back to needing LW locks
 to maintain this info; as you know (and stated only for the benefit
 of the list), it was a pretty serious contention point in early
 profiling and adding the local table was a big part of getting an
 early benchmark down from a 14+% performance hit for SSI to a 1.8%
 performance hit.

Yes, it's definitely important for a backend to be able to check
whether it's already holding a lock (even if that's just a hint)
without having to take locks.

Let me add one more piece of info for the benefit of the list: a
backend's local lock table contains not just locks held by the backend,
but also an entry and refcount for every parent of a lock it holds.
This is used to determine when to promote to one of the coarser-grained
parent locks. It's both unnecessary and undesirable for that info to be
in shared memory.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSI bug?

2011-02-21 Thread Dan Ports
On Mon, Feb 21, 2011 at 11:42:36PM +, YAMAMOTO Takashi wrote:
 i tested ede45e90dd1992bfd3e1e61ce87bad494b81f54d + ssi-multi-update-1.patch
 with my application and got the following assertion failure.

 #4  0x0827977e in CheckTargetForConflictsIn (targettag=0xbfbfce78)
 at predicate.c:3657
 3657Assert(locallock != NULL);

It looks like CheckTargetForConflictsIn is making the assumption that
the backend-local lock table is accurate, which was probably even true
at the time it was written. Unfortunately, it hasn't been for a while,
and the new changes for tuple versions make it more likely that this
will actually come up.

The solution is only slightly more complicated than just removing the
assertion. Unless Kevin beats me to it, I'll put together a patch later
tonight or tomorrow. (I'm at the airport right now.)

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSI bug?

2011-02-17 Thread Dan Ports
On Wed, Feb 16, 2011 at 10:13:35PM +, YAMAMOTO Takashi wrote:
 i got the following SEGV when runnning vacuum on a table.
 (the line numbers in predicate.c is different as i have local modifications.)
 oldlocktag.myTarget was NULL.
 it seems that TransferPredicateLocksToNewTarget sometimes use stack garbage
 for newpredlocktag.myTarget.  vacuum on the table succeeded with the attached
 patch.  the latter part of the patch was necessary to avoid targetList
 corruption, which later seems to make DeleteChildTargetLocks loop inifinitely.

Oops. Those are both definitely bugs (and my fault). Your patch looks
correct. Thanks for catching that!

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSI patch version 14

2011-02-08 Thread Dan Ports
On Tue, Feb 08, 2011 at 11:25:34AM +0200, Heikki Linnakangas wrote:
 On 08.02.2011 10:43, Kevin Grittner wrote:
  I see that at least three BuildFarm critters don't have UINT64_MAX
  defined.
 
 I guess we'll have to just #define it ourselves. Or could we just pick 
 another magic value, do we actually rely on InvalidSerCommitSeqno being 
 higher than all other values anywhere?

As far as I know we don't specifically rely on that anywhere, and
indeed I did have it #defined to 1 at one point (with the other
constants adjusted to match) and I don't recall any problems. But given
that we most often use InvalidSerCommitSeqNo to mean not committed
yet, it made more sense to set it to UINT64_MAX so that if a
comparison did sneak in it would do the right thing.

I did dust off a copy of the ANSI standard at the time, and it was
pretty explicit that UINT64_MAX is supposed to be defined in stdint.h.
But that may just be a C99 requirement (I didn't have an older copy of
the standard), and it's obviously no guarantee that it actually is
defined.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSI patch version 14

2011-02-08 Thread Dan Ports
On Tue, Feb 08, 2011 at 10:14:44AM -0600, Kevin Grittner wrote:
 I do have some concern that if this defaults to too low a number,
 those who try SSI without bumping it and restarting the postmaster
 will not like the performance under load very much.  SSI performance
 would not be affected by a low setting under light load when there
 isn't a long-running READ WRITE transaction.

If we're worried about this, we could add a log message the first time
SummarizeOldestCommittedXact is called, to suggest increasing the GUC
for number of SerializableXacts. This also has the potential benefit of
alerting the user that there's a long-running transaction, in case that's
unexpected (say, if it were caused by a wedged client)

I don't have any particular opinion on what the default value of the
GUC should be.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSI patch version 14

2011-02-08 Thread Dan Ports
One other nit re. the predicate lock table size GUCs: the out-of-memory
case in RegisterPredicateLockingXid (predicate.c:1592 in my tree) gives
the hint to increase max_predicate_locks_per_transaction. I don't think
that's correct, since that GUC isn't used to size SerializableXidHash.

In fact, that error shouldn't arise at all because if there was room in
PredXact to register the transaction, then there should be room to
register it's xid in SerializableXidHash. Except that it's possible for
something else to allocate all of our shared memory and thus prevent
SerializbleXidHash from reaching its intended max capacity.

In general, it might be worth considering making a HTAB's max_size a
hard limit, but that's a larger issue. Here, it's probably worth just
removing the hint.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSI patch version 14

2011-02-08 Thread Dan Ports
On Tue, Feb 08, 2011 at 04:04:39PM -0600, Kevin Grittner wrote:
 (2)  The predicate lock and lock target initialization code was
 initially copied and modified from the code for heavyweight locks. 
 The heavyweight lock code adds 10% to the calculated maximum size. 
 So I wound up doing that for PredicateLockTargetHash and
 PredicateLockHash, but didn't do it for SerializableXidHassh. 
 Should I eliminate this from the first two, add it to the third, or
 leave it alone?

Actually, I think for SerializableXidHash we should probably just
initially allocate it at its maximum size. Then it'll match the
PredXact list which is allocated in full upfront, and there's no risk
of being able to allocate a transaction but not register its xid. In
fact, I believe there would be no way for starting a new serializable
transaction to fail.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSI patch version 14

2011-02-08 Thread Dan Ports
On Tue, Feb 08, 2011 at 09:09:48PM -0500, Robert Haas wrote:
 No way to fail is a tall order.

Well, no way to fail due to running out of shared memory in
RegisterPredicateLock/RegisterPredicateLockingXid, but that doesn't
have quite the same ring to it...

 If we don't allocate all the memory up front, does that allow memory
 to be dynamically shared between different hash tables in shared
 memory?  I'm thinking not, but...

Not in a useful way. If we only allocate some of the memory up front,
then the rest goes into the global shmem pool (actually, that has
nothing to do with the hash table per se, just the ShmemSize
calculations), and it's up for grabs for any hash table that wants to
expand, even beyond its declared maximum capacity. But once it's
claimed by a hash table it can't get returned.

This doesn't sound like a feature to me.

In particular, I'd worry that something that allocates a lot of locks
(either of the heavyweight or predicate variety) would fill up the
associated hash table, and then we're out of shared memory for the
other hash tables -- and have no way to get it back short of restarting
the whole system.

 Frankly, I think this is an example of how our current shared memory
 model is a piece of garbage.  Our insistence on using sysv shm, and
 only sysv shm, is making it impossible for us to do things that other
 products can do easily.  My first reaction to this whole discussion
 was who gives a crap about 2MB of shared memory? and then I said
 oh, right, we do, because it might cause someone who was going to get
 24MB of shared buffers to get 16MB instead, and then performance will
 suck even worse than it does already.  But of course the person
 should really be running with 256MB or more, in all likelihood, and
 would be happy to have us do that right out of the box if it didn't
 require them to do tap-dance around their kernel settings and reboot.

I'm completely with you on this. 

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSI patch version 14

2011-01-31 Thread Dan Ports
On Sun, Jan 30, 2011 at 04:01:56PM -0600, Kevin Grittner wrote:
 I'm wondering how this differs from what is discussed in Section 2.7
 (Serialization Graph Testing) of Cahill's doctoral thesis.  That
 discusses a technique for trying to avoid false positives by testing
 the full graph for cycles, with the apparent conclusion that the
 costs of doing so are prohibitive.  The failure of attempts to
 implement that technique seem to have been part of the impetus to
 develop the SSI technique, where a particular structure involving two
 to three transactions has been proven to exist in all graphs which
 form such a cycle.

I'm not sure. My very limited understanding is that people have tried
to do concurrency control via serialization graph testing but it's (at
least thought to be) too expensive to actually use. This seems to be
saying the opposite of that, so there must be some difference...

 I've been able to identify four causes for false positives in the
 current SSI implementation:
  
 (1)  Collateral reads.  In particular, data skew caused by inserts
 past the end of an index between an ANALYZE and subsequent data
 access was a recurring source of performance complaints.  This was
 fixed by having the planner probe the ends of an index to correct the
 costs in such situations.  This has been effective at correcting the
 target problem, but we haven't yet excluded such index probes from
 predicate locking.

I wasn't aware of this one (which probably means you mentioned it at
some point and I dropped that packet). Seems like it would not be too
difficult to exclude these -- for 9.2.

 (3)  Dependencies other than read-write.
[...]
 one has to be very careful about assuming anything
 else; trying to explicitly track these conflicts and consider that T2
 may have appeared to run before T1 can fall apart completely in the
 face of some common and reasonable programming practices.

Yes. If you want to do precise cycle testing you'll have to track these
dependencies also, and I believe that would require quite a different
design from what we're doing.

 (4)  Length of read-write dependency (a/k/a rw-conflict) chains.
[...]
 They also, as it
 happens, provide enough data to fully trace the read-write
 dependencies and avoid some false positives where the dangerous
 structure SSI is looking for exists, but there is neither a complete
 rw-conflict cycle, nor any transaction in the graph which committed
 early enough to make a write-read conflict possible to any
 transaction in the graph.  Whether such rigorous tracing prevents
 enough false positives to justify the programming effort, code
 complexity, and run-time cost is anybody's guess.

I think I understand what you're getting at here, and it does sound
quite complicated for a benefit that is not clear.

 I only raise these to clarify the issue for the Jeff (who is
 reviewing the patch), since he asked.  I strongly feel that none of
 them are issues which need to be addressed for 9.1, nor do I think
 they can be properly addressed within the time frame of this CF.  

Absolutely, no question about it!

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSI patch version 14

2011-01-28 Thread Dan Ports
On Thu, Jan 27, 2011 at 09:18:23AM -0800, Jeff Davis wrote:
 On Tue, 2011-01-25 at 05:57 -0500, Dan Ports wrote:
  This summary is right on. I would add one additional detail or
  clarification to the last point, which is that rather than checking for
  a cycle, we're checking for a transaction with both in and out
  conflicts, which every cycle must contain.
 
 To clarify, this means that it will get some false positives, right?

Yes, this is correct.

 Is there a reason we can't check for a real cycle, which would let T1
 succeed?

I talked today with someone who experimented with doing exactly that in
MySQL as part of a research project (Perfectly Serializable Snapshot
Isolation, paper forthcoming in ICDE)

It confirmed my intuition that this is possible but not as
straightforward as it sounds. Some complications I thought of in
adapting that to what we're doing:

1) it requires tracking all edges in the serialization graph; besides
   the rw-conflicts we track there are also the more mundane
   rw-dependencies (if T2 sees an update made by T1, then T2 has to
   come after T1) and ww-dependencies (if T2's write modifies a tuple
   created by T1, then T2 has to come after T1).
   
   We are taking advantage of the fact that any cycle must have two
   adjacent rw-conflict edges, but the rest of the cycle can be
   composed of other types. It would certainly be possible to track the
   others, but it would add a bit more complexity and use more memory.

2) it requires doing a depth-first search to check for a cycle, which
   is more expensive than just detecting two edges. That seems OK if
   you only want to check it on commit, but maybe not if you want to
   detect serialization failures at the time of the conflicting
   read/write (as we do).

3) it doesn't seem to interact very well with our memory mitigation
   efforts, wherein we discard lots of data about committed
   transactions that we know we won't need. If we were checking for
   cycles, we'd need to keep more of it. I suspect summarization (when
   we combine two transactions' predicate locks to save memory) would
   also cause problems.

None of these seem particularly insurmountable, but they suggest it
isn't a clear win to try to find an entire cycle.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSI patch version 14

2011-01-26 Thread Dan Ports
On Wed, Jan 26, 2011 at 10:01:28AM -0600, Kevin Grittner wrote:
 In looking at it just now, I noticed that after trying it in a
 couple different places what was left in the repository was not the
 optimal version for code coverage.  I've put this back to the
 version which did a better job, for reasons described in the commit
 comment:

Isn't this placement the same as the version we had before that didn't
work?

Specifically, aren't we going to have problems running with
TEST_OLDSERXID enabled because CreatePredTran succeeds and links a new
SerializableXact into the active list, but we don't initialize it
before we drop SerializableXactHashLock to call
SummarizeOldestCommittedSxact?

I seem to recall SummarizeOldestCommittedSxact failing before because
of the uninitialized entry, but more generally since we drop the lock
something else might scan the list.

(This isn't a problem in the non-test case because we'd only do that if
CreatePredTran fails.)

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSI patch version 14

2011-01-26 Thread Dan Ports
On Wed, Jan 26, 2011 at 01:42:23PM -0600, Kevin Grittner wrote:
 Dan, do you still have access to that machine you were using for the
 DBT-2 runs?  Could we get a coverage run with and without
 TEST_OLDSERXID defined?

Sure, I'll give it a shot (once I figure out how to enable coverage...)

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSI patch version 14

2011-01-26 Thread Dan Ports
On Wed, Jan 26, 2011 at 02:36:25PM -0600, Kevin Grittner wrote:
 Same benefit in terms of exercising more lines of code, but
 *without* exposing the uninitialized structure to other threads.

Won't this cause a deadlock because locks are being acquired out of
order?

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSI patch version 14

2011-01-25 Thread Dan Ports
Thanks for working your way through this patch. I'm certainly well
aware that that's not a trivial task!

I'm suffering through a bout of insomnia, so I'll respond to some of
your high-level comments in hopes that serializability will help put me
to sleep (as it often does). I'll leave the more detailed code comments
for later when I'm actually looking at the code, or better yet Kevin
will take care of them and I won't have to. ;-)

On Tue, Jan 25, 2011 at 01:07:39AM -0800, Jeff Davis wrote:
 At a high level, there is a nice conceptual simplicity. Let me try to
 summarize it as I understand it:
   * RW dependencies are detected using predicate locking.
   * RW dependencies are tracked from the reading transaction (as an
 out) conflict; and from the writing transaction (as an in
 conflict).
   * Before committing a transaction, then by looking only at the RW 
 dependencies (and predicate locks) for current and past 
 transactions, you can determine if committing this transaction will
 result in a cycle (and therefore a serialization anomaly); and if
 so, abort it.

This summary is right on. I would add one additional detail or
clarification to the last point, which is that rather than checking for
a cycle, we're checking for a transaction with both in and out
conflicts, which every cycle must contain.

 That's where the simplicity ends, however ;)

Indeed!

 Tracking of RW conflicts of current and past transactions is more
 complex. Obviously, it doesn't keep _all_ past transactions, but only
 ones that overlap with a currently-running transaction. It does all of
 this management using SHMQueue. There isn't much of an attempt to
 gracefully handle OOM here as far as I can tell, it just throws an error
 if there's not enough room to track a new transaction (which is
 reasonable, considering that it should be quite rare and can be
 mitigated by increasing max_connections). 

If the OOM condition you're referring to is the same one from the
following comment, then it can't happen: (Apologies if I've
misunderstood what you're referring to.)

 * In RegisterSerializableTransactionInt, if it doesn't get an sxact, it
 goes into summarization. But summarization assumes that it has at least
 one finished xact. Is that always true? If you have enough memory to
 hold a transaction for each connection, plus max_prepared_xacts, plus
 one, I think that's true. But maybe that could be made more clear?

Yes -- the SerializableXact pool is allocated up front and it
definitely has to be bigger than the number of possible active
transactions. In fact, it's much larger: 10 * (MaxBackends +
max_prepared_xacts) to allow some room for the committed transactions
we still have to track.

 * In RegisterSerializableTransactionInt(), for a RO xact, it considers
 any concurrent RW xact a possible conflict. It seems like it would be
 possible to know whether a RW transaction may have overlapped with any
 committed RW transaction (in finishedLink queue), and only add those as
 potential conflicts. Would that work? If so, that would make more
 snapshots safe.

Interesting idea. That's worth some careful thought. I think it's
related to the condition that the RW xact needs to commit with a
conflict out to a transaction earlier than the RO xact. My first guess
is that this wouldn't make more transactions safe, but could detect
safe snapshots faster.

 * When a transaction finishes, then PID should probably be set to zero.
 You only use it for waking up a deferrable RO xact waiting for a
 snapshot, right?

Correct. It probably wouldn't hurt to clear that field when releasing
the transaction, but we don't use it after.

 * I'm a little unclear on summarization and writing to the SLRU. I don't
 see where it's determining that the predicate locks can be safely
 released. Couldn't the oldest transaction still have relevant predicate
 locks?

When a SerializableXact gets summarized to the SLRU, its predicate
locks aren't released; they're transferred to the dummy
OldCommittedSxact.

 I'll keep working on this patch. I hope I can be of some help getting
 this committed, because I'm looking forward to this feature. And I
 certainly think that you and Dan have applied the amount of planning,
 documentation, and careful implementation necessary for a feature like
 this.

Hopefully my comments here will help clarify the patch. It's not lost
on me that there's no shortage of complexity in the patch, so if you
found anything particularly confusing we should probably add some
documentation to README-SSI.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSI and Hot Standby

2011-01-20 Thread Dan Ports
 What I'm still not clear on is why that HS is different. Whatever rules
 apply on the master must also apply on the standby, immutably. Why is it
 we need to pass explicit snapshot information from master to standby? We
 don't do that, except at startup for normal HS. Why do we need that?
 
 I hear, but do not yet understand, that the SSI transaction sequence on
 the master may differ from the WAL transaction sequence. Is it important
 that the ordering on the master would differ from the standby?

The logical serializable ordering of transactions in SSI doesn't
necessarily match the commit time ordering (i.e. the WAL sequence). For
example, with two concurrent transactions, T1 might commit after T2,
even though it didn't see the changes made by T2 and thus has to be
considered earlier.

It doesn't matter whether T1 committed before T2 or the other way
around, as long as no other transaction can tell the difference. If
someone saw the changes made by T1 but not those made by T2, they'd see
T2 as happening before T1, violating serializability. Our SSI code
ensures that doesn't happen by tracking read dependencies. If it
detects that such a read is happening, it rolls back one of the
transactions involved.

Now, if we extend this to hot standby, if T2 commits before T1 on the
master, it obviously will on the slave too. A transaction run on the
slave at the right time might be able to see that T2 has happened but
not T1, which is unserializable. If that transaction had ben run on the
master, then it would have been detected and something would have been
rolled back, but the master has no way to know what data is being read
on the slave.

What Kevin is suggesting is that we already have a mechanism for
identifying snapshots where serialization failures like these will
never happen. If we pass that information to the slave and allow it to
run transactions only on those snapshots, serializability is safe.

Hopefully that made some more sense...

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSI and Hot Standby

2011-01-20 Thread Dan Ports
On Fri, Jan 21, 2011 at 08:44:59AM +0200, Heikki Linnakangas wrote:
 We have enough information in the standby to reconstruct all writes done 
 in the master. I gather that's not enough, in order to roll back 
 read-only transaction T3 on the standby which would see an anomaly, we'd 
 also need to know what reads T1 and T2 did in the master. Is that correct?

That's some of the information we need, but it's not enough...

The problem is that the conflict might not be discovered until after T3
(the reader) commits. In that case, it's too late to abort T3, so you'd
need to roll back T2 instead. But that means a read-only transaction on
the slave has to be able to cause a concurrent read-write transaction
on the master to abort, which brings with it no end of problems.

To make that a little more concrete, let me borrow Kevin's favorite
batch processing example...

 [master] T2: BEGIN
 [master] T2: SELECT FROM control
 [master] T1: BEGIN
 [master] T1: UPDATE control
 [master] T1: COMMIT
  [slave] T3: BEGIN
  [slave] T3: SELECT FROM control, receipt
  [slave] T3: COMMIT
 [master] T2: INSERT INTO receipt
 [master] T2: COMMIT

If this all happened at the master, T2 would get rolled back when it
tries to do its INSERT. (I just tried it.) But if T3 happened on the
slave, the master doesn't know that it read both tables, nor does the
slave know at the time it's executing T3 that it's going to conflict
with T2. 

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSI and Hot Standby

2011-01-19 Thread Dan Ports
Kevin's suggestion seems eminently reasonable to me and probably the
best approach one can do for SSI and hot standby. Pulling it off in
time for 9.1 would be a stretch; 9.2 seems quite doable.

It's worth noting that one way or another, the semantics of
SERIALIZABLE transactions on hot standby replicas could be surprising
to some. There's no getting around this; serializability in distributed
systems is just a hard problem in general. Either we go with Kevin's
suggestion of treating SERIALIZABLE transactions as DEFERRABLE (whether
now or for 9.2), causing them to have to use an older snapshot or block
until an acceptable snapshot becomes available; or we require them to
be downgraded to REPEATABLE READ either implicitly or explicitly.

Now, neither of these is as alarming as they might sound, given that
replication lag is a fact of life for hot standby systems and
REPEATABLE READ is exactly the same as the current (9.0) SERIALIZABLE
behavior. But it's definitely something that should be addressed in
documentation.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


  1   2   >