Re: [HACKERS] RFC: Making TRUNCATE more MVCC-safe

2012-04-21 Thread Jeff Davis
On Sun, 2012-03-04 at 16:39 +, Simon Riggs wrote:
  v3 attached.

Added to next commitfest.

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] RFC: Making TRUNCATE more MVCC-safe

2012-03-09 Thread Simon Riggs
On Fri, Mar 9, 2012 at 3:46 AM, Robert Haas robertmh...@gmail.com wrote:
 On Wed, Mar 7, 2012 at 5:41 PM, Simon Riggs si...@2ndquadrant.com wrote:
 Case #2 is certainly a problem for FrozenXID as well, because anything
 that's marked with FrozenXID is going to look visible to everybody,
 including our older snapshots.  And I gather you're saying it's also a
 problem for HEAP_XMIN_COMMITTED.

 The problem there is that later subtransactions often have xids that
 are greater than xmax, so the xid shows as running when we do
 XidInMVCCSnapshot(), which must then be altered for this one weird
 case. I tried that and not happy with result.

 Altering XidInMVCCSnapshot() seems like a good thing to avoid, but I
 confess I don't quite follow what you're describing here otherwise.

 I had assumed that the way we were
 fixing this problem was to disable these optimizations for
 transactions that had more than one snapshot floating around.  I'm not
 sure whether the patch does that or not, but I think it probably needs
 to

 It does. I thought you already read the patch?

 I glanced over it, but did not look through it in detail.  I'll do a
 more careful look at your next version.

I'm not confident about the restrictions this patch imposes and we
aren't close enough to a final version for me to honestly request this
be considered for this release. I think its time to close this door
for now.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] RFC: Making TRUNCATE more MVCC-safe

2012-03-09 Thread Robert Haas
On Fri, Mar 9, 2012 at 2:59 PM, Simon Riggs si...@2ndquadrant.com wrote:
 On Fri, Mar 9, 2012 at 3:46 AM, Robert Haas robertmh...@gmail.com wrote:
 On Wed, Mar 7, 2012 at 5:41 PM, Simon Riggs si...@2ndquadrant.com wrote:
 Case #2 is certainly a problem for FrozenXID as well, because anything
 that's marked with FrozenXID is going to look visible to everybody,
 including our older snapshots.  And I gather you're saying it's also a
 problem for HEAP_XMIN_COMMITTED.

 The problem there is that later subtransactions often have xids that
 are greater than xmax, so the xid shows as running when we do
 XidInMVCCSnapshot(), which must then be altered for this one weird
 case. I tried that and not happy with result.

 Altering XidInMVCCSnapshot() seems like a good thing to avoid, but I
 confess I don't quite follow what you're describing here otherwise.

 I had assumed that the way we were
 fixing this problem was to disable these optimizations for
 transactions that had more than one snapshot floating around.  I'm not
 sure whether the patch does that or not, but I think it probably needs
 to

 It does. I thought you already read the patch?

 I glanced over it, but did not look through it in detail.  I'll do a
 more careful look at your next version.

 I'm not confident about the restrictions this patch imposes and we
 aren't close enough to a final version for me to honestly request this
 be considered for this release. I think its time to close this door
 for now.

OK, makes sense.  I was trying to hold my nose, because I really would
like to see this stuff work better than it does, but I had my doubts,
too.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] RFC: Making TRUNCATE more MVCC-safe

2012-03-08 Thread Robert Haas
On Wed, Mar 7, 2012 at 5:41 PM, Simon Riggs si...@2ndquadrant.com wrote:
 Case #2 is certainly a problem for FrozenXID as well, because anything
 that's marked with FrozenXID is going to look visible to everybody,
 including our older snapshots.  And I gather you're saying it's also a
 problem for HEAP_XMIN_COMMITTED.

 The problem there is that later subtransactions often have xids that
 are greater than xmax, so the xid shows as running when we do
 XidInMVCCSnapshot(), which must then be altered for this one weird
 case. I tried that and not happy with result.

Altering XidInMVCCSnapshot() seems like a good thing to avoid, but I
confess I don't quite follow what you're describing here otherwise.

 I had assumed that the way we were
 fixing this problem was to disable these optimizations for
 transactions that had more than one snapshot floating around.  I'm not
 sure whether the patch does that or not, but I think it probably needs
 to

 It does. I thought you already read the patch?

I glanced over it, but did not look through it in detail.  I'll do a
more careful look at your next version.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] RFC: Making TRUNCATE more MVCC-safe

2012-03-07 Thread Noah Misch
On Tue, Mar 06, 2012 at 08:36:05AM -0500, Robert Haas wrote:
 On Tue, Mar 6, 2012 at 5:43 AM, Noah Misch n...@leadboat.com wrote:
  Well, consider something like CLUSTER. ?It's perfectly OK for CLUSTER
  to operate on a table that has been truncated since CLUSTER's snapshot
  was taken, and no serialization anomaly is created that would not have
  already existed as a result of the non-MVCC-safe TRUNCATE. ?On the
  other hand, if CLUSTER operates on a table that was created since
  CLUSTER's snapshot was taken, then you have a bona fide serialization
  anomaly.
 
  Core CLUSTER does not use any MVCC snapshot. ?We do push one for the benefit
  of functions called during the reindex phase, but it does not appear that 
  you
  speak of that snapshot. ?Could you elaborate this example?
 
 Imagine this:
 
 - Transaction #1 acquires a snapshot.
 - Transaction #2 creates tables A, inserts a row into table B, and then 
 commits.
 - Transaction #1 tries to CLUSTER A and then select from B.
 
 The only serial execution schedules are T1  T2, in which case the
 transaction fails, or T2  T1, in which case the row is seen.  But
 what actually happens is that the row isn't seen and yet the
 transaction doesn't fail.

For the purpose of contemplating this anomaly, one could just as well replace
CLUSTER with GRANT, COMMENT ON TABLE, or any other command that operates on a
table, correct?

I agree this test case is good to keep in mind while designing, but we could
well conclude not to bother improving it.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] RFC: Making TRUNCATE more MVCC-safe

2012-03-07 Thread Robert Haas
On Wed, Mar 7, 2012 at 7:49 AM, Noah Misch n...@leadboat.com wrote:
 On Tue, Mar 06, 2012 at 08:36:05AM -0500, Robert Haas wrote:
 On Tue, Mar 6, 2012 at 5:43 AM, Noah Misch n...@leadboat.com wrote:
  Well, consider something like CLUSTER. ?It's perfectly OK for CLUSTER
  to operate on a table that has been truncated since CLUSTER's snapshot
  was taken, and no serialization anomaly is created that would not have
  already existed as a result of the non-MVCC-safe TRUNCATE. ?On the
  other hand, if CLUSTER operates on a table that was created since
  CLUSTER's snapshot was taken, then you have a bona fide serialization
  anomaly.
 
  Core CLUSTER does not use any MVCC snapshot. ?We do push one for the 
  benefit
  of functions called during the reindex phase, but it does not appear that 
  you
  speak of that snapshot. ?Could you elaborate this example?

 Imagine this:

 - Transaction #1 acquires a snapshot.
 - Transaction #2 creates tables A, inserts a row into table B, and then 
 commits.
 - Transaction #1 tries to CLUSTER A and then select from B.

 The only serial execution schedules are T1  T2, in which case the
 transaction fails, or T2  T1, in which case the row is seen.  But
 what actually happens is that the row isn't seen and yet the
 transaction doesn't fail.

 For the purpose of contemplating this anomaly, one could just as well replace
 CLUSTER with GRANT, COMMENT ON TABLE, or any other command that operates on a
 table, correct?

 I agree this test case is good to keep in mind while designing, but we could
 well conclude not to bother improving it.

All true.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] RFC: Making TRUNCATE more MVCC-safe

2012-03-07 Thread Simon Riggs
On Wed, Mar 7, 2012 at 2:59 PM, Robert Haas robertmh...@gmail.com wrote:

 All true.

So gentlemen, do we think this is worth pursuing further for this release?

I'm sure usual arguments apply all round, so I'll skip that part.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] RFC: Making TRUNCATE more MVCC-safe

2012-03-07 Thread Robert Haas
On Wed, Mar 7, 2012 at 10:26 AM, Simon Riggs si...@2ndquadrant.com wrote:
 On Wed, Mar 7, 2012 at 2:59 PM, Robert Haas robertmh...@gmail.com wrote:
 All true.

 So gentlemen, do we think this is worth pursuing further for this release?

 I'm sure usual arguments apply all round, so I'll skip that part.

This patch is awfully late to the party, but if we can nail it down
reasonably quickly I guess I'd be in favor of slipping something in.
I am not thrilled with the design as it stands, but bulk loading is a
known and serious pain point for us, so it would be awfully nice to
improve it.  I'm not sure whether we should only go as far as setting
HEAP_XMIN_COMMITTED or whether we should actually try to mark the
tuples with FrozenXID.  The former has the advantage of (I think) not
requiring any other changes to preserve MVCC semantics while the
latter is, obviously, a bigger performance improvement.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] RFC: Making TRUNCATE more MVCC-safe

2012-03-07 Thread Simon Riggs
On Wed, Mar 7, 2012 at 5:39 PM, Robert Haas robertmh...@gmail.com wrote:
 On Wed, Mar 7, 2012 at 10:26 AM, Simon Riggs si...@2ndquadrant.com wrote:
 On Wed, Mar 7, 2012 at 2:59 PM, Robert Haas robertmh...@gmail.com wrote:
 All true.

 So gentlemen, do we think this is worth pursuing further for this release?

 I'm sure usual arguments apply all round, so I'll skip that part.

 This patch is awfully late to the party, but if we can nail it down
 reasonably quickly I guess I'd be in favor of slipping something in.

Cool

 I am not thrilled with the design as it stands, but bulk loading is a
 known and serious pain point for us, so it would be awfully nice to
 improve it.  I'm not sure whether we should only go as far as setting
 HEAP_XMIN_COMMITTED or whether we should actually try to mark the
 tuples with FrozenXID.  The former has the advantage of (I think) not
 requiring any other changes to preserve MVCC semantics while the
 latter is, obviously, a bigger performance improvement.

It's the other way around. Setting to FrozenTransactionId makes the
test in XidInMVCCSnapshot() pass when accessed by later commands in
the same transaction. If we just set the hint we need to play around
to get it accepted. So the frozen route is both best for performance
and least impact on fastpath visibility code. That part of the code is
solid.

The only problem is the visibility from older snapshots, we just
need/desire a better place to put the test. So I'll finish that off.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] RFC: Making TRUNCATE more MVCC-safe

2012-03-07 Thread Robert Haas
On Wed, Mar 7, 2012 at 2:06 PM, Simon Riggs si...@2ndquadrant.com wrote:
 I am not thrilled with the design as it stands, but bulk loading is a
 known and serious pain point for us, so it would be awfully nice to
 improve it.  I'm not sure whether we should only go as far as setting
 HEAP_XMIN_COMMITTED or whether we should actually try to mark the
 tuples with FrozenXID.  The former has the advantage of (I think) not
 requiring any other changes to preserve MVCC semantics while the
 latter is, obviously, a bigger performance improvement.

 It's the other way around. Setting to FrozenTransactionId makes the
 test in XidInMVCCSnapshot() pass when accessed by later commands in
 the same transaction. If we just set the hint we need to play around
 to get it accepted. So the frozen route is both best for performance
 and least impact on fastpath visibility code. That part of the code is
 solid.

Your comment is reminding me that there are actually two problems
here, or at least I think there are.

1. Some other transaction might look at the tuples.
2. An older snapshot (e.g. cursor) might look at the tuples.

Case #1 can happen when we create a table, insert some data, and
commit, and then some other transaction that took a snapshot before we
committed reads the table.  It's OK if the tuples are marked
HEAP_XMIN_COMMITTED, because if we abort no other transaction will
ever see the new pg_class row as alive, and therefore no other
transaction can examine the table contents.  But using FrozenXID as
the tuple xmin would allow those tuples to be seen by a transaction
that took its snapshot before we committed; this is the problem that
relvalidxid is designed to fix, and what I was thinking of when I said
that we need more infrastructure to handle the FrozenXID case.

Case #2 is certainly a problem for FrozenXID as well, because anything
that's marked with FrozenXID is going to look visible to everybody,
including our older snapshots.  And I gather you're saying it's also a
problem for HEAP_XMIN_COMMITTED.  I had assumed that the way we were
fixing this problem was to disable these optimizations for
transactions that had more than one snapshot floating around.  I'm not
sure whether the patch does that or not, but I think it probably needs
to, unless you have some other idea for how to fix this.  It doesn't
seem like an important restriction in practice because it's unlikely
that anyone would keep a cursor open across a bulk data load - and if
they do, this isn't the only problem they're going to have.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] RFC: Making TRUNCATE more MVCC-safe

2012-03-07 Thread Simon Riggs
On Wed, Mar 7, 2012 at 8:21 PM, Robert Haas robertmh...@gmail.com wrote:
 On Wed, Mar 7, 2012 at 2:06 PM, Simon Riggs si...@2ndquadrant.com wrote:
 I am not thrilled with the design as it stands, but bulk loading is a
 known and serious pain point for us, so it would be awfully nice to
 improve it.  I'm not sure whether we should only go as far as setting
 HEAP_XMIN_COMMITTED or whether we should actually try to mark the
 tuples with FrozenXID.  The former has the advantage of (I think) not
 requiring any other changes to preserve MVCC semantics while the
 latter is, obviously, a bigger performance improvement.

 It's the other way around. Setting to FrozenTransactionId makes the
 test in XidInMVCCSnapshot() pass when accessed by later commands in
 the same transaction. If we just set the hint we need to play around
 to get it accepted. So the frozen route is both best for performance
 and least impact on fastpath visibility code. That part of the code is
 solid.

 Your comment is reminding me that there are actually two problems
 here, or at least I think there are.

 1. Some other transaction might look at the tuples.
 2. An older snapshot (e.g. cursor) might look at the tuples.

 Case #1 can happen when we create a table, insert some data, and
 commit, and then some other transaction that took a snapshot before we
 committed reads the table.  It's OK if the tuples are marked
 HEAP_XMIN_COMMITTED, because if we abort no other transaction will
 ever see the new pg_class row as alive, and therefore no other
 transaction can examine the table contents.  But using FrozenXID as
 the tuple xmin would allow those tuples to be seen by a transaction
 that took its snapshot before we committed; this is the problem that
 relvalidxid is designed to fix, and what I was thinking of when I said
 that we need more infrastructure to handle the FrozenXID case.

Yes. If your purpose was to summarise, then yes that's where we're at.

 Case #2 is certainly a problem for FrozenXID as well, because anything
 that's marked with FrozenXID is going to look visible to everybody,
 including our older snapshots.  And I gather you're saying it's also a
 problem for HEAP_XMIN_COMMITTED.

The problem there is that later subtransactions often have xids that
are greater than xmax, so the xid shows as running when we do
XidInMVCCSnapshot(), which must then be altered for this one weird
case. I tried that and not happy with result.

 I had assumed that the way we were
 fixing this problem was to disable these optimizations for
 transactions that had more than one snapshot floating around.  I'm not
 sure whether the patch does that or not, but I think it probably needs
 to

It does. I thought you already read the patch?

, unless you have some other idea for how to fix this.  It doesn't
 seem like an important restriction in practice because it's unlikely
 that anyone would keep a cursor open across a bulk data load - and if
 they do, this isn't the only problem they're going to have.

Exactly.

So we're all good, apart from deciding the exact place to prevent
other transaction's older snapshots from seeing the newly frozen rows.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] RFC: Making TRUNCATE more MVCC-safe

2012-03-06 Thread Noah Misch
On Sun, Mar 04, 2012 at 01:02:57PM +, Simon Riggs wrote:
 More detailed thoughts show that the test in heap_beginscan_internal()
 is not right enough, i.e. wrong.
 
 We need a specific XidInMVCCSnapshot test on the relvalidxid, so it
 needs to be a specific xid, not an xmin because otherwise we can get
 concurrent transactions failing, not just older transactions.

Good point; I agree.  indcheckxmin's level of pessimism isn't appropriate for
this new check.

 If we're going freeze tuples on load this needs to be watertight, so
 some minor rework needed.
 
 Of course, if we only have a valid xid on the class we might get new
 columns added when we do repeated SELECT * statements using the same
 snapshot while concurrent DDL occurs. That is impractical, so if we
 define this as applying to rows it can work; if we want it to apply to
 everything its getting more difficult.

Excess columns seem less grave to me than excess or missing rows.  I'm having
difficulty thinking up an explanation for that opinion.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] RFC: Making TRUNCATE more MVCC-safe

2012-03-06 Thread Noah Misch
On Mon, Mar 05, 2012 at 03:46:16PM -0500, Robert Haas wrote:
 On Mon, Mar 5, 2012 at 2:22 PM, Noah Misch n...@leadboat.com wrote:
  I can see this strategy applying to many relation-pertinent system catalogs.
  Do you foresee applications to non-relation catalogs?
 
 Well, in theory, we have similar issues if, say, a query uses a
 function that didn't exist at the time the snapshot as taken; the
 actual results the user sees may not be consistent with any serial
 execution schedule.  And the same could be true for any other SQL
 object.  It's unclear that those cases are as compelling as this one,
 but then again it's unclear that no one will ever want to fix them,
 either.  For example, suppose we have a view v over a table t that
 calls a function f.  Somebody alters f to give different results and,
 in the same transaction, modifies the contents of t (but no DDL).
 This doesn't strike me as a terribly unlikely scenario; the change to
 t could well be envisioned as a compensating transaction.  But now if
 somebody uses the new definition of f against the old contents of t,
 the user may fail to get what they were hoping for out of bundling
 those changes together in one transaction.

Good example.

 Now, maybe we're never going to fix those kinds of anomalies anyway,
 but if we go with this architecture, then I think the chances of it
 ever being palatable to try are pretty low.

Why?

  But it's not quite the
  same as the xmin of the row itself, because some updates might be
  judged not to matter. ?There could also be intermediate cases where
  updates are invalidating for some purposes but not others. ?I think
  we'd better get our hands around more of the problem space before we
  start trying to engineer solutions.
 
  I'm not seeing that problem. ?Any operation that would update some xmin
  horizon should set it to the greater of its current value and the value the
  operation needs for its own correctness. ?If you have something in mind that
  needs more, could you elaborate?

Simon's point about xmin vs. xid probably leads to an example.  One value is
fine for TRUNCATE, because only the most recent TRUNCATE matters.  Not all DDL
is so simple.

 Well, consider something like CLUSTER.  It's perfectly OK for CLUSTER
 to operate on a table that has been truncated since CLUSTER's snapshot
 was taken, and no serialization anomaly is created that would not have
 already existed as a result of the non-MVCC-safe TRUNCATE.  On the
 other hand, if CLUSTER operates on a table that was created since
 CLUSTER's snapshot was taken, then you have a bona fide serialization
 anomaly.

Core CLUSTER does not use any MVCC snapshot.  We do push one for the benefit
of functions called during the reindex phase, but it does not appear that you
speak of that snapshot.  Could you elaborate this example?

 Maybe not a very important one, but does that prove that
 there's no significant problem of this type in general, or just
 nobody's thought through all the cases yet?  After all, the issues
 with CREATE TABLE/TRUNCATE vs. a concurrent SELECT have been around
 for a very long time, and we're only just getting around to looking at
 them, so I don't have much confidence that there aren't other cases
 floating around out there.

Granted.

Thanks,
nm

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] RFC: Making TRUNCATE more MVCC-safe

2012-03-06 Thread Robert Haas
On Tue, Mar 6, 2012 at 5:43 AM, Noah Misch n...@leadboat.com wrote:
 Now, maybe we're never going to fix those kinds of anomalies anyway,
 but if we go with this architecture, then I think the chances of it
 ever being palatable to try are pretty low.

 Why?

Because it'll require at least one XID column in every system catalog
that represents an SQL catalog to plug all the cases, and I doubt very
much that we want to go there.

 Simon's point about xmin vs. xid probably leads to an example.  One value is
 fine for TRUNCATE, because only the most recent TRUNCATE matters.  Not all DDL
 is so simple.

Yep.

 Well, consider something like CLUSTER.  It's perfectly OK for CLUSTER
 to operate on a table that has been truncated since CLUSTER's snapshot
 was taken, and no serialization anomaly is created that would not have
 already existed as a result of the non-MVCC-safe TRUNCATE.  On the
 other hand, if CLUSTER operates on a table that was created since
 CLUSTER's snapshot was taken, then you have a bona fide serialization
 anomaly.

 Core CLUSTER does not use any MVCC snapshot.  We do push one for the benefit
 of functions called during the reindex phase, but it does not appear that you
 speak of that snapshot.  Could you elaborate this example?

Imagine this:

- Transaction #1 acquires a snapshot.
- Transaction #2 creates tables A, inserts a row into table B, and then commits.
- Transaction #1 tries to CLUSTER A and then select from B.

The only serial execution schedules are T1  T2, in which case the
transaction fails, or T2  T1, in which case the row is seen.  But
what actually happens is that the row isn't seen and yet the
transaction doesn't fail.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] RFC: Making TRUNCATE more MVCC-safe

2012-03-05 Thread Robert Haas
On Sun, Mar 4, 2012 at 11:39 AM, Simon Riggs si...@2ndquadrant.com wrote:
 Marti, please review this latest version which has new isolation tests added.

 This does both TRUNCATE and CREATE TABLE.

I don't see any need for a GUC to control this behavior.  The current
behavior is wrong, so if we're going to choose this path to fix it,
then we should just fix it, period.  The narrow set of circumstances
in which it might be beneficial to disable this behavior doesn't seem
to me to be sufficient to justify a behavior-changing GUC.

It does not seem right that the logic for detecting the serialization
error is in heap_beginscan_internal().  Surely this is just as much of
a problem for an index-scan or index-only-scan.  We don't want to
patch all those places individually, either: I think the check should
happen right around the time we initially lock the relation and build
its relcache entry.

The actual text of the error message could use some work.  Maybe
something like could not serialize access due to concurrent DDL,
although I think we try to avoid using acronyms like DDL in
translatable strings.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] RFC: Making TRUNCATE more MVCC-safe

2012-03-05 Thread Simon Riggs
On Mon, Mar 5, 2012 at 4:32 PM, Robert Haas robertmh...@gmail.com wrote:
 On Sun, Mar 4, 2012 at 11:39 AM, Simon Riggs si...@2ndquadrant.com wrote:
 Marti, please review this latest version which has new isolation tests added.

 This does both TRUNCATE and CREATE TABLE.

 I don't see any need for a GUC to control this behavior.  The current
 behavior is wrong, so if we're going to choose this path to fix it,
 then we should just fix it, period.  The narrow set of circumstances
 in which it might be beneficial to disable this behavior doesn't seem
 to me to be sufficient to justify a behavior-changing GUC.

I agree behaviour is wrong, the only question is whether our users
rely in some way on that behaviour. Given the long discussion on that
point earlier I thought it best to add a GUC. Easy to remove, now or
later.

 It does not seem right that the logic for detecting the serialization
 error is in heap_beginscan_internal().  Surely this is just as much of
 a problem for an index-scan or index-only-scan.

err, very good point. Doh.

 We don't want to
 patch all those places individually, either: I think the check should
 happen right around the time we initially lock the relation and build
 its relcache entry.

OK, that makes sense and works if we need to rebuild relcache.

 The actual text of the error message could use some work.  Maybe
 something like could not serialize access due to concurrent DDL,
 although I think we try to avoid using acronyms like DDL in
 translatable strings.

Yeh that was designed-to-be-replaced text. We do use DDL already
elsewhere without really explaining it; its also one of those acronyms
that doesn't actually explain what it really means very well. So I
like the phrase you suggest.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] RFC: Making TRUNCATE more MVCC-safe

2012-03-05 Thread Simon Riggs
On Mon, Mar 5, 2012 at 4:46 PM, Simon Riggs si...@2ndquadrant.com wrote:

 It does not seem right that the logic for detecting the serialization
 error is in heap_beginscan_internal().  Surely this is just as much of
 a problem for an index-scan or index-only-scan.

 err, very good point. Doh.

 We don't want to
 patch all those places individually, either: I think the check should
 happen right around the time we initially lock the relation and build
 its relcache entry.

 OK, that makes sense and works if we need to rebuild relcache.

Except the reason to do it at the start of the scan is that is the
first time a specific snapshot has been associated with a relation and
also the last point we can apply the check before the errant behaviour
occurs.

If we reject locks against tables that might be used against an
illegal snapshot then we could easily prevent valid snapshot use cases
when a transaction has multiple snapshots, one illegal, one not.

We can certainly have a looser test when we first get the lock and
then another test later, but I don't think we can avoid making all
scans apply this test. And while I'm there, we have to add tests for
things like index build scans.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] RFC: Making TRUNCATE more MVCC-safe

2012-03-05 Thread Robert Haas
On Mon, Mar 5, 2012 at 11:46 AM, Simon Riggs si...@2ndquadrant.com wrote:
 I agree behaviour is wrong, the only question is whether our users
 rely in some way on that behaviour. Given the long discussion on that
 point earlier I thought it best to add a GUC. Easy to remove, now or
 later.

AFAICT, all the discussion upthread was about whether we ought to be
trying to implement this in some entirely-different way that doesn't
rely on storing XIDs in the catalog.  I have a feeling that there are
a whole lot more cases like this and that we're in for a lot of
unpleasant nastiness if we go very far down this route; pg_constraint
is another one, as SnapshotNow can see constraints that may not be
valid under the query's MVCC snapshot.  On the other hand, if someone
comes up with a better way, I suppose we can always rip this out.  In
any case, I don't remember anyone saying that this needed to be
configurable.

Speaking of that, though, I have one further thought on this: we need
to be absolutely certain that autovacuum is going to prevent this XID
value from wrapping around.  I suppose this is safe since, even if
autovacuum is turned off, we'll forcibly kick it off every so often to
advance relfrozenxid, and that will reset relvalidxid while it's
there.  But then again on second thought, what if relvalidxid lags
relfrozenxid?  Then the emergency autovacuum might not kick in until
relvalidxid has already wrapped around.  I think that could happen
after a TRUNCATE, perhaps, since I think that would leave relfrozenxid
alone while advancing relvalidxid.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] RFC: Making TRUNCATE more MVCC-safe

2012-03-05 Thread Robert Haas
On Mon, Mar 5, 2012 at 12:42 PM, Simon Riggs si...@2ndquadrant.com wrote:
 On Mon, Mar 5, 2012 at 4:46 PM, Simon Riggs si...@2ndquadrant.com wrote:
 It does not seem right that the logic for detecting the serialization
 error is in heap_beginscan_internal().  Surely this is just as much of
 a problem for an index-scan or index-only-scan.

 err, very good point. Doh.

 We don't want to
 patch all those places individually, either: I think the check should
 happen right around the time we initially lock the relation and build
 its relcache entry.

 OK, that makes sense and works if we need to rebuild relcache.

 Except the reason to do it at the start of the scan is that is the
 first time a specific snapshot has been associated with a relation and
 also the last point we can apply the check before the errant behaviour
 occurs.

 If we reject locks against tables that might be used against an
 illegal snapshot then we could easily prevent valid snapshot use cases
 when a transaction has multiple snapshots, one illegal, one not.

 We can certainly have a looser test when we first get the lock and
 then another test later, but I don't think we can avoid making all
 scans apply this test. And while I'm there, we have to add tests for
 things like index build scans.

Well, there's no point that I can see in having two checks.  I just
dislike the idea that we have to remember to add this check for every
method of accessing the relation - doesn't seem terribly future-proof.
 It gets even worse if you start adding checks to DDL code paths - if
we're going to do that, we really need to cover them all, and that
doesn't seem very practical if they're going to spread out all over
the place.

I don't understand your comment that a snapshot doesn't get associated
with a relation until scan time.  I believe we associated a snapshot
with each query before we even know what relations are involved; that
query then gets passed down to all the individual scans.  The query
also opens and locks those relations.  We ought to be able to arrange
for the query snapshot to be cross-checked at that point, rather than
waiting until scan-start time.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] RFC: Making TRUNCATE more MVCC-safe

2012-03-05 Thread Noah Misch
On Mon, Feb 13, 2012 at 09:29:56AM -0500, Robert Haas wrote:
 On Fri, Feb 10, 2012 at 11:46 PM, Noah Misch n...@leadboat.com wrote:
  I've yet to see an MVCC anomaly that one can reproduce at REPEATABLE READ 
  and
  not at READ COMMITTED. ?They tend to be narrow race conditions at READ
  COMMITTED, yet easy to demonstrate at REPEATABLE READ. ?Related:
  http://archives.postgresql.org/pgsql-performance/2011-02/msg00451.php
 
 Yeah.  Well, that's actually an interesting example, because it
 illustrates how general this problem is.  We could potentially get
 ourselves into a situation where just about every system catalog table
 needs an xmin field to store the point at which the object came into
 existence - or for that matter, was updated.

I can see this strategy applying to many relation-pertinent system catalogs.
Do you foresee applications to non-relation catalogs?

In any event, I think a pg_class.relvalidxmin is the right starting point.
One might imagine a family of relvalidxmin, convalidxmin, indcheckxmin
(already exists), inhvalidxmin, and attvalidxmin.  relvalidxmin is like the
AccessExclusiveLock of that family; it necessarily blocks everything that
might impugn the others.  The value in extending this to more catalogs is the
ability to narrow the impact of failing the check.  A failed indcheckxmin
comparison merely excludes plans involving the index.  A failed inhvalidxmin
check might just skip recursion to the table in question.  Those are further
refinements, much like using weaker heavyweight lock types.

 But it's not quite the
 same as the xmin of the row itself, because some updates might be
 judged not to matter.  There could also be intermediate cases where
 updates are invalidating for some purposes but not others.  I think
 we'd better get our hands around more of the problem space before we
 start trying to engineer solutions.

I'm not seeing that problem.  Any operation that would update some xmin
horizon should set it to the greater of its current value and the value the
operation needs for its own correctness.  If you have something in mind that
needs more, could you elaborate?

  Incidentally, people use READ COMMITTED because they don't question the
  default, not because they know hazards of REPEATABLE READ. ?I don't know the
  bustedness you speak of; could we improve the documentation to inform folks?
 
 The example that I remember was related to SELECT FOR UPDATE/SELECT
 FOR SHARE.  The idea of those statements is that you want to prevent
 the row from being updated or deleted until some other concurrent
 action is complete; for example, in the case of a foreign key, we'd
 like to prevent the referenced row from being deleted or updated in
 the relevant columns until the inserting transaction is committed.
 But it doesn't work, because when the updating or deleting process
 gets done with the lock wait, they are still using the same snapshot
 as before, and merrily do exactly the the thing that the lock-wait was
 supposed to prevent.  If an actual UPDATE is used, it's safe (I
 think): anyone who was going to UPDATE or DELETE the row will fail
 with some kind of serialization error.  But a SELECT FOR UPDATE that
 commits is treated more like an UPDATE that rolls back: it's as if the
 lock never existed.  Someone (Florian?) proposed a patch to change
 this, but it seemed problematic for reasons I no longer exactly
 remember.

Thanks.  I vaguely remember that thread.

nm

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] RFC: Making TRUNCATE more MVCC-safe

2012-03-05 Thread Simon Riggs
On Mon, Mar 5, 2012 at 5:49 PM, Robert Haas robertmh...@gmail.com wrote:

 Well, there's no point that I can see in having two checks.  I just
 dislike the idea that we have to remember to add this check for every
 method of accessing the relation - doesn't seem terribly future-proof.
  It gets even worse if you start adding checks to DDL code paths - if
 we're going to do that, we really need to cover them all, and that
 doesn't seem very practical if they're going to spread out all over
 the place.

Understood. Will look.

 I don't understand your comment that a snapshot doesn't get associated
 with a relation until scan time.  I believe we associated a snapshot
 with each query before we even know what relations are involved; that
 query then gets passed down to all the individual scans.  The query
 also opens and locks those relations.  We ought to be able to arrange
 for the query snapshot to be cross-checked at that point, rather than
 waiting until scan-start time.

What's to stop other code using an older snapshot explicitly? That
fear may be bogus.

Any suggestions? ISTM we don't know whether we're already locked until
we get to LockAcquire() and there's no easy way to pass down snapshot
information through that, let alone handle RI snapshots. Ideas please.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] RFC: Making TRUNCATE more MVCC-safe

2012-03-05 Thread Robert Haas
On Mon, Mar 5, 2012 at 2:22 PM, Noah Misch n...@leadboat.com wrote:
 On Mon, Feb 13, 2012 at 09:29:56AM -0500, Robert Haas wrote:
 On Fri, Feb 10, 2012 at 11:46 PM, Noah Misch n...@leadboat.com wrote:
  I've yet to see an MVCC anomaly that one can reproduce at REPEATABLE READ 
  and
  not at READ COMMITTED. ?They tend to be narrow race conditions at READ
  COMMITTED, yet easy to demonstrate at REPEATABLE READ. ?Related:
  http://archives.postgresql.org/pgsql-performance/2011-02/msg00451.php

 Yeah.  Well, that's actually an interesting example, because it
 illustrates how general this problem is.  We could potentially get
 ourselves into a situation where just about every system catalog table
 needs an xmin field to store the point at which the object came into
 existence - or for that matter, was updated.

 I can see this strategy applying to many relation-pertinent system catalogs.
 Do you foresee applications to non-relation catalogs?

Well, in theory, we have similar issues if, say, a query uses a
function that didn't exist at the time the snapshot as taken; the
actual results the user sees may not be consistent with any serial
execution schedule.  And the same could be true for any other SQL
object.  It's unclear that those cases are as compelling as this one,
but then again it's unclear that no one will ever want to fix them,
either.  For example, suppose we have a view v over a table t that
calls a function f.  Somebody alters f to give different results and,
in the same transaction, modifies the contents of t (but no DDL).
This doesn't strike me as a terribly unlikely scenario; the change to
t could well be envisioned as a compensating transaction.  But now if
somebody uses the new definition of f against the old contents of t,
the user may fail to get what they were hoping for out of bundling
those changes together in one transaction.

Now, maybe we're never going to fix those kinds of anomalies anyway,
but if we go with this architecture, then I think the chances of it
ever being palatable to try are pretty low.

 In any event, I think a pg_class.relvalidxmin is the right starting point.
 One might imagine a family of relvalidxmin, convalidxmin, indcheckxmin
 (already exists), inhvalidxmin, and attvalidxmin.  relvalidxmin is like the
 AccessExclusiveLock of that family; it necessarily blocks everything that
 might impugn the others.  The value in extending this to more catalogs is the
 ability to narrow the impact of failing the check.  A failed indcheckxmin
 comparison merely excludes plans involving the index.  A failed inhvalidxmin
 check might just skip recursion to the table in question.  Those are further
 refinements, much like using weaker heavyweight lock types.

Yes, good parallel.

 But it's not quite the
 same as the xmin of the row itself, because some updates might be
 judged not to matter.  There could also be intermediate cases where
 updates are invalidating for some purposes but not others.  I think
 we'd better get our hands around more of the problem space before we
 start trying to engineer solutions.

 I'm not seeing that problem.  Any operation that would update some xmin
 horizon should set it to the greater of its current value and the value the
 operation needs for its own correctness.  If you have something in mind that
 needs more, could you elaborate?

Well, consider something like CLUSTER.  It's perfectly OK for CLUSTER
to operate on a table that has been truncated since CLUSTER's snapshot
was taken, and no serialization anomaly is created that would not have
already existed as a result of the non-MVCC-safe TRUNCATE.  On the
other hand, if CLUSTER operates on a table that was created since
CLUSTER's snapshot was taken, then you have a bona fide serialization
anomaly.  Maybe not a very important one, but does that prove that
there's no significant problem of this type in general, or just
nobody's thought through all the cases yet?  After all, the issues
with CREATE TABLE/TRUNCATE vs. a concurrent SELECT have been around
for a very long time, and we're only just getting around to looking at
them, so I don't have much confidence that there aren't other cases
floating around out there.

I guess another way to put this is that you could need locks of a
great number of different strengths to really handle all the cases.
It's going to be unappealing to, say, set the relation xmin when
setting the constraint xmin would do, or to fail for a concurrent
TRUNCATE as well as a concurrent CREATE TABLE when only the latter
logically requires a failure.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] RFC: Making TRUNCATE more MVCC-safe

2012-03-05 Thread Simon Riggs
On Mon, Mar 5, 2012 at 8:46 PM, Robert Haas robertmh...@gmail.com wrote:

 In any event, I think a pg_class.relvalidxmin is the right starting point.
 One might imagine a family of relvalidxmin, convalidxmin, indcheckxmin
 (already exists), inhvalidxmin, and attvalidxmin.  relvalidxmin is like the
 AccessExclusiveLock of that family; it necessarily blocks everything that
 might impugn the others.  The value in extending this to more catalogs is the
 ability to narrow the impact of failing the check.  A failed indcheckxmin
 comparison merely excludes plans involving the index.  A failed inhvalidxmin
 check might just skip recursion to the table in question.  Those are further
 refinements, much like using weaker heavyweight lock types.

 Yes, good parallel.

Did you guys get my comment about not being able to use an xmin value,
we have to use an xid value and to a an XidInMVCCSnapshot() test? Just
checking whether you agree/disagree.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] RFC: Making TRUNCATE more MVCC-safe

2012-03-04 Thread Simon Riggs
On Sun, Mar 4, 2012 at 2:28 AM, Marti Raudsepp ma...@juffo.org wrote:
 On Sat, Mar 3, 2012 at 14:53, Simon Riggs si...@2ndquadrant.com wrote:
 Thanks Noah for drawing attention to this thread. I hadn't been
 watching. As you say, this work would allow me to freeze rows at load
 time and avoid the overhead of hint bit setting, which avoids
 performance issues from hint bit setting in checksum patch.

 I've reviewed this patch and it seems OK to me. Good work Marti.

 Thanks! This approach wasn't universally liked, but if it gets us
 tangible benefits (COPY with frozenxid) then I guess it's a reason to
 reconsider.

Comments I see support this idea. If we did this purely for truncate
correctness I probably wouldn't care either, which is why I didn't
read this thread in the first place.

 v2 patch attached, updated to latest HEAD. Patch adds
 * relvalidxid is reset by VACUUM/ANALYZE to avoid wraparound failure

 Personally I'd rather keep this out of ANALYZE -- since its purpose is
 to collect stats; VACUUM is responsible for correctness (xid
 wraparound etc). But I don't feel strongly about this.

If there were a reason to do it, then I would agree. Later you point
out a reason, so I will make this change.

 A more important consideration is how this interacts with hot standby.
 Currently you compare OldestXmin to relvalidxmin to decide when to
 reset it. But the standby's OldestXmin may be older than the master's.
 (If VACUUM removes rows then this causes a recovery conflict, but
 AFAICT that won't happen if only relvalidxmin changes)

As of 9.1, the standby's oldestxmin is incorporated into the master's
via hot_standby_feedback, so it wouldn't typically be a problem these
days. It's possible that the standby has this set off by choice, in
which case anomalies could exist in the case that a VACUUM doesn't
clean any rows, as you say.

So we'll use vacrelstats-latestRemovedXid instead of OldestXmin when
we call vac_update_relstats()
which means ANALYZE should always pass InvalidTransactionId.

 It might be more robust to wait until relfrozenxid exceeds
 relvalidxmin -- by then, recovery conflict mechanisms will have taken
 care of killing all older snapshots, or am I mistaken?

It would be better to set it as early as possible to reduce the cost
of the test in heap_begin_scan()

 And a few typos the code...

 + gettext_noop(When enabled viewing a truncated or newly created table 
 + will throw a serialization error to prevent MVCC 
 + discrepancies that were possible priot to 9.2.)

 prior not priot

Yep

 + * Reset relvalidxmin if its old enough

 Should be it's in this context.

Cool, thanks for the review.

v3 attached.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index c910863..4387f49 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -73,7 +73,7 @@
 
 /* GUC variable */
 bool		synchronize_seqscans = true;
-
+bool		StrictMVCC = true;
 
 static HeapScanDesc heap_beginscan_internal(Relation relation,
 		Snapshot snapshot,
@@ -1175,6 +1175,24 @@ heap_beginscan_internal(Relation relation, Snapshot snapshot,
 	HeapScanDesc scan;
 
 	/*
+	 * If the snapshot is older than relvalidxmin then that either a table
+	 * has only recently been created or that a TRUNCATE has removed data
+	 * that we should still be able to see. Either way, we aren't allowed
+	 * to view the rows in StrictMVCC mode.
+	 */
+	if (StrictMVCC 
+		TransactionIdIsValid(relation-rd_rel-relvalidxmin) 
+		TransactionIdIsValid(snapshot-xmax) 
+		NormalTransactionIdPrecedes(snapshot-xmax, relation-rd_rel-relvalidxmin))
+	{
+		ereport(ERROR,
+(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
+ errmsg(canceling statement due to conflict with TRUNCATE TABLE on %s,
+		 NameStr(relation-rd_rel-relname)),
+ errdetail(User query attempting to see row versions that have been removed.)));
+	}
+
+	/*
 	 * increment relation ref count while scanning relation
 	 *
 	 * This is just to make really sure the relcache entry won't go away while
diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
index aef410a..3f9bd5d 100644
--- a/src/backend/catalog/heap.c
+++ b/src/backend/catalog/heap.c
@@ -787,6 +787,7 @@ InsertPgClassTuple(Relation pg_class_desc,
 	values[Anum_pg_class_relhastriggers - 1] = BoolGetDatum(rd_rel-relhastriggers);
 	values[Anum_pg_class_relhassubclass - 1] = BoolGetDatum(rd_rel-relhassubclass);
 	values[Anum_pg_class_relfrozenxid - 1] = TransactionIdGetDatum(rd_rel-relfrozenxid);
+	values[Anum_pg_class_relvalidxmin - 1] = TransactionIdGetDatum(rd_rel-relvalidxmin);
 	if (relacl != (Datum) 0)
 		values[Anum_pg_class_relacl - 1] = relacl;
 	else
@@ -884,6 +885,7 @@ AddNewRelationTuple(Relation pg_class_desc,
 		new_rel_reltup-relfrozenxid = InvalidTransactionId;
 	}
 
+	new_rel_reltup-relvalidxmin = InvalidTransactionId;
 	

Re: [HACKERS] RFC: Making TRUNCATE more MVCC-safe

2012-03-04 Thread Simon Riggs
On Sun, Mar 4, 2012 at 9:59 AM, Simon Riggs si...@2ndquadrant.com wrote:
 On Sun, Mar 4, 2012 at 2:28 AM, Marti Raudsepp ma...@juffo.org wrote:
 On Sat, Mar 3, 2012 at 14:53, Simon Riggs si...@2ndquadrant.com wrote:
 Thanks Noah for drawing attention to this thread. I hadn't been
 watching. As you say, this work would allow me to freeze rows at load
 time and avoid the overhead of hint bit setting, which avoids
 performance issues from hint bit setting in checksum patch.

 I've reviewed this patch and it seems OK to me. Good work Marti.

...

 v3 attached.

More detailed thoughts show that the test in heap_beginscan_internal()
is not right enough, i.e. wrong.

We need a specific XidInMVCCSnapshot test on the relvalidxid, so it
needs to be a specific xid, not an xmin because otherwise we can get
concurrent transactions failing, not just older transactions.

If we're going freeze tuples on load this needs to be watertight, so
some minor rework needed.

Of course, if we only have a valid xid on the class we might get new
columns added when we do repeated SELECT * statements using the same
snapshot while concurrent DDL occurs. That is impractical, so if we
define this as applying to rows it can work; if we want it to apply to
everything its getting more difficult.

Longer term we might fix this by making all catalog access use MVCC,
but that suffers the same problems with ALTER TABLEs that rewrite rows
to add columns. I don't see a neat future solution that is worth
waiting for.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] RFC: Making TRUNCATE more MVCC-safe

2012-03-04 Thread Simon Riggs
On Sun, Mar 4, 2012 at 1:02 PM, Simon Riggs si...@2ndquadrant.com wrote:
 On Sun, Mar 4, 2012 at 9:59 AM, Simon Riggs si...@2ndquadrant.com wrote:
 On Sun, Mar 4, 2012 at 2:28 AM, Marti Raudsepp ma...@juffo.org wrote:
 On Sat, Mar 3, 2012 at 14:53, Simon Riggs si...@2ndquadrant.com wrote:
 Thanks Noah for drawing attention to this thread. I hadn't been
 watching. As you say, this work would allow me to freeze rows at load
 time and avoid the overhead of hint bit setting, which avoids
 performance issues from hint bit setting in checksum patch.

 I've reviewed this patch and it seems OK to me. Good work Marti.

 ...

 v3 attached.

 More detailed thoughts show that the test in heap_beginscan_internal()
 is not right enough, i.e. wrong.

 We need a specific XidInMVCCSnapshot test on the relvalidxid, so it
 needs to be a specific xid, not an xmin because otherwise we can get
 concurrent transactions failing, not just older transactions.

Marti, please review this latest version which has new isolation tests added.

This does both TRUNCATE and CREATE TABLE.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index c910863..3d90125 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -73,7 +73,7 @@
 
 /* GUC variable */
 bool		synchronize_seqscans = true;
-
+bool		StrictMVCC = true;
 
 static HeapScanDesc heap_beginscan_internal(Relation relation,
 		Snapshot snapshot,
@@ -1175,6 +1175,24 @@ heap_beginscan_internal(Relation relation, Snapshot snapshot,
 	HeapScanDesc scan;
 
 	/*
+	 * If the snapshot can't see relvalidxid then that means either the table
+	 * is newly created or has recently been truncated. Either way, we aren't
+	 * allowed to view the rows in StrictMVCC mode.
+	 */
+	if (IsMVCCSnapshot(snapshot) 
+		StrictMVCC 
+		XidInMVCCSnapshot(relation-rd_rel-relvalidxid, snapshot))
+	{
+		/* Unless we created or truncated the table recently ourselves */
+		if (relation-rd_createSubid == InvalidSubTransactionId 
+			relation-rd_newRelfilenodeSubid == InvalidSubTransactionId)
+			ereport(ERROR,
+	(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
+	 errmsg(cannot access recently added or recently removed data in %s,
+			 NameStr(relation-rd_rel-relname;
+	}
+
+	/*
 	 * increment relation ref count while scanning relation
 	 *
 	 * This is just to make really sure the relcache entry won't go away while
diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
index aef410a..eb93a7c 100644
--- a/src/backend/catalog/heap.c
+++ b/src/backend/catalog/heap.c
@@ -787,6 +787,7 @@ InsertPgClassTuple(Relation pg_class_desc,
 	values[Anum_pg_class_relhastriggers - 1] = BoolGetDatum(rd_rel-relhastriggers);
 	values[Anum_pg_class_relhassubclass - 1] = BoolGetDatum(rd_rel-relhassubclass);
 	values[Anum_pg_class_relfrozenxid - 1] = TransactionIdGetDatum(rd_rel-relfrozenxid);
+	values[Anum_pg_class_relvalidxid - 1] = TransactionIdGetDatum(rd_rel-relvalidxid);
 	if (relacl != (Datum) 0)
 		values[Anum_pg_class_relacl - 1] = relacl;
 	else
@@ -872,6 +873,7 @@ AddNewRelationTuple(Relation pg_class_desc,
 		 * that will do.
 		 */
 		new_rel_reltup-relfrozenxid = RecentXmin;
+		new_rel_reltup-relvalidxid = GetCurrentTransactionId();
 	}
 	else
 	{
@@ -882,6 +884,7 @@ AddNewRelationTuple(Relation pg_class_desc,
 		 * commands/sequence.c.)
 		 */
 		new_rel_reltup-relfrozenxid = InvalidTransactionId;
+		new_rel_reltup-relvalidxid = InvalidTransactionId;
 	}
 
 	new_rel_reltup-relowner = relowner;
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index bfbe642..0d96a6a 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -2854,7 +2854,7 @@ reindex_index(Oid indexId, bool skip_constraint_checks)
 		}
 
 		/* We'll build a new physical relation for the index */
-		RelationSetNewRelfilenode(iRel, InvalidTransactionId);
+		RelationSetNewRelfilenode(iRel, InvalidTransactionId, InvalidTransactionId);
 
 		/* Initialize the index and rebuild */
 		/* Note: we do not need to re-establish pkey setting */
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index b40e57b..44b891c 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -538,6 +538,7 @@ do_analyze_rel(Relation onerel, VacuumStmt *vacstmt, bool inh)
 			totalrows,
 			visibilitymap_count(onerel),
 			hasindex,
+			InvalidTransactionId,
 			InvalidTransactionId);
 
 	/*
@@ -558,6 +559,7 @@ do_analyze_rel(Relation onerel, VacuumStmt *vacstmt, bool inh)
 totalindexrows,
 0,
 false,
+InvalidTransactionId,
 InvalidTransactionId);
 		}
 	}
diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index 1f95abc..ab4aec2 100644
--- a/src/backend/commands/sequence.c
+++ 

Re: [HACKERS] RFC: Making TRUNCATE more MVCC-safe

2012-03-03 Thread Simon Riggs
On Sat, Feb 11, 2012 at 4:46 AM, Noah Misch n...@leadboat.com wrote:

 But I have to admit I'm intrigued by the idea of extending this to
 other cases, if there's a reasonable way to do that.  For example, if
 we could fix things up so that we don't see a table at all if it was
 created after we took our snapshot, that would remove one of the
 obstacles to marking pages bulk-loaded into that table with FrozenXID
 and PD_ALL_VISIBLE from the get-go.  I think a lot of people would be
 mighty happy about that.

 Exactly.

 But the necessary semantics seem somewhat different.  For TRUNCATE,
 you want to throw a serialization error; but is that also what you
 want for CREATE TABLE?  Or do you just want it to appear empty?

 I think an error helps just as much there.  If you create a table and populate
 it with data in the same transaction, letting some snapshot see an empty table
 is an atomicity failure.

 Your comment illustrates a helpful point: this is just another kind of
 ordinary serialization failure, one that can happen at any isolation level.
 No serial transaction sequence can yield one of these errors.

Thanks Noah for drawing attention to this thread. I hadn't been
watching. As you say, this work would allow me to freeze rows at load
time and avoid the overhead of hint bit setting, which avoids
performance issues from hint bit setting in checksum patch.

I've reviewed this patch and it seems OK to me. Good work Marti.

v2 patch attached, updated to latest HEAD. Patch adds
* a GUC called strict_mvcc to isolate the new behaviour if required
* relvalidxid is reset by VACUUM/ANALYZE to avoid wraparound failure

At present this lacks docs for strict_mvcc and doesn't attempt to
handle CREATE TABLE case yet, but should be straightforward to do so.

Hint bit setting is in separate patch on other thread.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index c910863..4387f49 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -73,7 +73,7 @@
 
 /* GUC variable */
 bool		synchronize_seqscans = true;
-
+bool		StrictMVCC = true;
 
 static HeapScanDesc heap_beginscan_internal(Relation relation,
 		Snapshot snapshot,
@@ -1175,6 +1175,24 @@ heap_beginscan_internal(Relation relation, Snapshot snapshot,
 	HeapScanDesc scan;
 
 	/*
+	 * If the snapshot is older than relvalidxmin then that either a table
+	 * has only recently been created or that a TRUNCATE has removed data
+	 * that we should still be able to see. Either way, we aren't allowed
+	 * to view the rows in StrictMVCC mode.
+	 */
+	if (StrictMVCC 
+		TransactionIdIsValid(relation-rd_rel-relvalidxmin) 
+		TransactionIdIsValid(snapshot-xmax) 
+		NormalTransactionIdPrecedes(snapshot-xmax, relation-rd_rel-relvalidxmin))
+	{
+		ereport(ERROR,
+(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
+ errmsg(canceling statement due to conflict with TRUNCATE TABLE on %s,
+		 NameStr(relation-rd_rel-relname)),
+ errdetail(User query attempting to see row versions that have been removed.)));
+	}
+
+	/*
 	 * increment relation ref count while scanning relation
 	 *
 	 * This is just to make really sure the relcache entry won't go away while
diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
index aef410a..3f9bd5d 100644
--- a/src/backend/catalog/heap.c
+++ b/src/backend/catalog/heap.c
@@ -787,6 +787,7 @@ InsertPgClassTuple(Relation pg_class_desc,
 	values[Anum_pg_class_relhastriggers - 1] = BoolGetDatum(rd_rel-relhastriggers);
 	values[Anum_pg_class_relhassubclass - 1] = BoolGetDatum(rd_rel-relhassubclass);
 	values[Anum_pg_class_relfrozenxid - 1] = TransactionIdGetDatum(rd_rel-relfrozenxid);
+	values[Anum_pg_class_relvalidxmin - 1] = TransactionIdGetDatum(rd_rel-relvalidxmin);
 	if (relacl != (Datum) 0)
 		values[Anum_pg_class_relacl - 1] = relacl;
 	else
@@ -884,6 +885,7 @@ AddNewRelationTuple(Relation pg_class_desc,
 		new_rel_reltup-relfrozenxid = InvalidTransactionId;
 	}
 
+	new_rel_reltup-relvalidxmin = InvalidTransactionId;
 	new_rel_reltup-relowner = relowner;
 	new_rel_reltup-reltype = new_type_oid;
 	new_rel_reltup-reloftype = reloftype;
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index bfbe642..0d96a6a 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -2854,7 +2854,7 @@ reindex_index(Oid indexId, bool skip_constraint_checks)
 		}
 
 		/* We'll build a new physical relation for the index */
-		RelationSetNewRelfilenode(iRel, InvalidTransactionId);
+		RelationSetNewRelfilenode(iRel, InvalidTransactionId, InvalidTransactionId);
 
 		/* Initialize the index and rebuild */
 		/* Note: we do not need to re-establish pkey setting */
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index b40e57b..0578241 100644
--- a/src/backend/commands/analyze.c
+++ 

Re: [HACKERS] RFC: Making TRUNCATE more MVCC-safe

2012-03-03 Thread Marti Raudsepp
On Sat, Mar 3, 2012 at 14:53, Simon Riggs si...@2ndquadrant.com wrote:
 Thanks Noah for drawing attention to this thread. I hadn't been
 watching. As you say, this work would allow me to freeze rows at load
 time and avoid the overhead of hint bit setting, which avoids
 performance issues from hint bit setting in checksum patch.

 I've reviewed this patch and it seems OK to me. Good work Marti.

Thanks! This approach wasn't universally liked, but if it gets us
tangible benefits (COPY with frozenxid) then I guess it's a reason to
reconsider.

 v2 patch attached, updated to latest HEAD. Patch adds
 * relvalidxid is reset by VACUUM/ANALYZE to avoid wraparound failure

Personally I'd rather keep this out of ANALYZE -- since its purpose is
to collect stats; VACUUM is responsible for correctness (xid
wraparound etc). But I don't feel strongly about this.

A more important consideration is how this interacts with hot standby.
Currently you compare OldestXmin to relvalidxmin to decide when to
reset it. But the standby's OldestXmin may be older than the master's.
(If VACUUM removes rows then this causes a recovery conflict, but
AFAICT that won't happen if only relvalidxmin changes)

It might be more robust to wait until relfrozenxid exceeds
relvalidxmin -- by then, recovery conflict mechanisms will have taken
care of killing all older snapshots, or am I mistaken?

And a few typos the code...

+ gettext_noop(When enabled viewing a truncated or newly created table 
+ will throw a serialization error to prevent MVCC 
+ discrepancies that were possible priot to 9.2.)

prior not priot

+ * Reset relvalidxmin if its old enough

Should be it's in this context.

Regards,
Marti

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] RFC: Making TRUNCATE more MVCC-safe

2012-02-13 Thread Robert Haas
On Fri, Feb 10, 2012 at 11:46 PM, Noah Misch n...@leadboat.com wrote:
 On Fri, Feb 10, 2012 at 01:59:18PM -0500, Robert Haas wrote:
 On Fri, Feb 10, 2012 at 6:42 AM, Noah Misch n...@leadboat.com wrote:
  I like the design you have chosen. ?It would find applications beyond
  TRUNCATE, so your use of non-specific naming is sound. ?For example, older
  snapshots will see an empty table t after CREATE TABLE t AS SELECT 1
  commits; that's a comparable MVCC anomaly. ?Some of our non-MVCC-safe 
  commands
  should perhaps just become MVCC-safe, but there will always be use cases 
  for
  operations that shortcut MVCC. ?When one truly does want that, your 
  proposal
  for keeping behavior consistent makes plenty of sense.

 I guess I'm not particularly excited by the idea of trying to make
 TRUNCATE MVCC-safe.  I notice that the example involves the REPEATABLE
 READ isolation level, which is already known to be busted in a variety
 of ways; that's why we now have SERIALIZABLE, and why most people use
 READ COMMITTED.  Are there examples of this behavior at other
 isolation levels?

 I've yet to see an MVCC anomaly that one can reproduce at REPEATABLE READ and
 not at READ COMMITTED.  They tend to be narrow race conditions at READ
 COMMITTED, yet easy to demonstrate at REPEATABLE READ.  Related:
 http://archives.postgresql.org/pgsql-performance/2011-02/msg00451.php

Yeah.  Well, that's actually an interesting example, because it
illustrates how general this problem is.  We could potentially get
ourselves into a situation where just about every system catalog table
needs an xmin field to store the point at which the object came into
existence - or for that matter, was updated.  But it's not quite the
same as the xmin of the row itself, because some updates might be
judged not to matter.  There could also be intermediate cases where
updates are invalidating for some purposes but not others.  I think
we'd better get our hands around more of the problem space before we
start trying to engineer solutions.

 Incidentally, people use READ COMMITTED because they don't question the
 default, not because they know hazards of REPEATABLE READ.  I don't know the
 bustedness you speak of; could we improve the documentation to inform folks?

The example that I remember was related to SELECT FOR UPDATE/SELECT
FOR SHARE.  The idea of those statements is that you want to prevent
the row from being updated or deleted until some other concurrent
action is complete; for example, in the case of a foreign key, we'd
like to prevent the referenced row from being deleted or updated in
the relevant columns until the inserting transaction is committed.
But it doesn't work, because when the updating or deleting process
gets done with the lock wait, they are still using the same snapshot
as before, and merrily do exactly the the thing that the lock-wait was
supposed to prevent.  If an actual UPDATE is used, it's safe (I
think): anyone who was going to UPDATE or DELETE the row will fail
with some kind of serialization error.  But a SELECT FOR UPDATE that
commits is treated more like an UPDATE that rolls back: it's as if the
lock never existed.  Someone (Florian?) proposed a patch to change
this, but it seemed problematic for reasons I no longer exactly
remember.

When using an actual foreign key, we work around this by taking a new
snapshot to cross-check that things haven't changed under us, but
user-level code can't do that.  At READ COMMITTED, depending on the
situation, either the fact that we take new snapshots pretty
frequently or the EPQ machinery sometimes make things work sensibly
anyway, and at SERIALIZABLE, SSI prevents these kinds of anomalies.
But REPEATABLE READ has no protection.  I wish I could find the thread
where we discussed this before.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] RFC: Making TRUNCATE more MVCC-safe

2012-02-13 Thread Kevin Grittner
Robert Haas robertmh...@gmail.com wrote:
 
 The example that I remember was related to SELECT FOR
 UPDATE/SELECT FOR SHARE.  The idea of those statements is that you
 want to prevent the row from being updated or deleted until some
 other concurrent action is complete; for example, in the case of a
 foreign key, we'd like to prevent the referenced row from being
 deleted or updated in the relevant columns until the inserting
 transaction is committed.  But it doesn't work, because when the
 updating or deleting process gets done with the lock wait, they
 are still using the same snapshot as before, and merrily do
 exactly the the thing that the lock-wait was supposed to prevent.
 
This issue is one which appears to be a problem for people trying to
migrate from Oracle, where a write conflict would be generated.
 
 If an actual UPDATE is used, it's safe (I think): anyone who was
 going to UPDATE or DELETE the row will fail with some kind of
 serialization error.
 
Right; a write conflict.
 
 But a SELECT FOR UPDATE that commits is treated more like an
 UPDATE that rolls back: it's as if the lock never existed. 
 Someone (Florian?) proposed a patch to change this, but it seemed
 problematic for reasons I no longer exactly remember.
 
It had to do with only having one xmax and how that worked with
subtransactions.
 
Of course, besides the technical obstacles, such a semantic change
could break existing code for PostgreSQL users.  :-(
 
 When using an actual foreign key, we work around this by taking a
 new snapshot to cross-check that things haven't changed under us,
 but user-level code can't do that.  At READ COMMITTED, depending
 on the situation, either the fact that we take new snapshots
 pretty frequently or the EPQ machinery sometimes make things work
 sensibly anyway, and at SERIALIZABLE, SSI prevents these kinds of
 anomalies.  But REPEATABLE READ has no protection.
 
Well, personally I have a hard time calling READ COMMITTED behavior
sensible.  Consider this:
 
-- connection 1
test=# create table t (id int not null primary key);
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index
t_pkey for table t
CREATE TABLE
test=# insert into t select generate_series(1, 10);
INSERT 0 10
 
-- connection 2
test=# begin;
BEGIN
test=# update t set id = id - 1;
UPDATE 10
 
-- connection 1
test=# select * from t where id = (select min(id) from t) for
update;
[blocks]
 
-- connection 2
test=# commit;
COMMIT
 
-- connection 1
[unblocks]
 id 

(0 rows)
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] RFC: Making TRUNCATE more MVCC-safe

2012-02-13 Thread Robert Haas
On Mon, Feb 13, 2012 at 10:48 AM, Kevin Grittner
kevin.gritt...@wicourts.gov wrote:
 Well, personally I have a hard time calling READ COMMITTED behavior
 sensible.  Consider this:

 [ gigantic pile of fail ]

Yeah, that's bad all right.  I think it's hard to argue that the
current behavior is sensible; the trick is to figure out something
that's better but doesn't impose too much additional overhead.  I
wonder if it's possible to use SSI (or some stripped-down mechanism
along similar lines) to track these kinds of write conflicts, rather
than cluttering the system catalogs with lots more TransactionId
fields.  Like, when we TRUNCATE a table, we could essentially make a
note in memory someplace recording the write conflict.

Unfortunately, the full-blown SSI machinery seems too expensive to use
for this, especially on all-read workloads where there are no actual
conflicts but lots of conflict checks.  But that could probably be
optimized.  The attraction of using something like an in-memory
conflict manager is that it would probably be quite general.  We could
fix problems of this type with no on-disk format changes whenever we
discover them (as we will certainly continue to do) just by adding the
appropriate flag-a-conflict calls to the right places in the code.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] RFC: Making TRUNCATE more MVCC-safe

2012-02-13 Thread Kevin Grittner
Robert Haas robertmh...@gmail.com wrote: 
 Kevin Grittner kevin.gritt...@wicourts.gov wrote:
 Well, personally I have a hard time calling READ COMMITTED
 behavior sensible.  Consider this:

 [ gigantic pile of fail ]
 
 Yeah, that's bad all right.  I think it's hard to argue that the
 current behavior is sensible; the trick is to figure out something
 that's better but doesn't impose too much additional overhead.  I
 wonder if it's possible to use SSI (or some stripped-down
 mechanism along similar lines) to track these kinds of write
 conflicts, rather than cluttering the system catalogs with lots
 more TransactionId fields.  Like, when we TRUNCATE a table, we
 could essentially make a note in memory someplace recording the
 write conflict.
 
Potential additional uses of the predicate locking developed for SSI
keep surfacing.  At some point we should probably pick a couple of
them and try to fashion an API for the non-blocking predicate
locking logic that serves them and SSI.  Since this predicate
locking system is explicitly interested only in read-write
conflicts, it does seem like it could work for SELECT FOR UPDATE
versus writes.
 
As mentioned once or twice before, it was pretty clear that while
predicate locking is required for SSI and is probably 80% of the C
code in the patch, it is really a separate thing -- we just didn't
want to try to create a generalized API based on the one initial
usage.  I think that an API based on registering and listening would
be the ticket.
 
 Unfortunately, the full-blown SSI machinery seems too expensive to
 use for this, especially on all-read workloads where there are no
 actual conflicts but lots of conflict checks.
 
In an all-read workload, if you actually declare the transactions to
be read-only SSI should not introduce much overhead.  If there's
much overhead showing up there at the moment, it would probably be
pretty easy to eliminate.  When there are any read-write
transactions active, it's a different story.
 
 But that could probably be optimized.
 
Undoubtedly.  It's disappointing that neither Dan nor I could find
the round tuits to make the kinds of changes in the SSI locking that
you made in some other areas for 9.2.  I'm not really sure how the
performance impact breaks down between predicate locking and SSI
proper, although I would tend to start from the assumption that,
like the lines of code, it's 80% in the predicate locking.
 
 The attraction of using something like an in-memory conflict
 manager is that it would probably be quite general.  We could fix
 problems of this type with no on-disk format changes whenever we
 discover them (as we will certainly continue to do) just by adding
 the appropriate flag-a-conflict calls to the right places in the
 code.
 
Assuming that the problems could be expressed in terms of read-write
conflicts, that's probably largely true.  I'm not sure that holds
for some of the funny READ COMMITTED behaviors, though.  I think the
only real cure there would be to make subtransactions cheap enough
that we could wrap execution of each SELECT and DML statement in a
subtransaction and roll back for another try with a new snapshot on
conflict.
 
If you want to track something other than read-write conflicts
and/or you want blocking when a conflict is found, you might be
better off looking to bend the heavyweight locks to your purposes.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] RFC: Making TRUNCATE more MVCC-safe

2012-02-11 Thread Dan Ports
On Fri, Feb 10, 2012 at 01:59:18PM -0500, Robert Haas wrote:
 I guess I'm not particularly excited by the idea of trying to make
 TRUNCATE MVCC-safe.  I notice that the example involves the REPEATABLE
 READ isolation level, which is already known to be busted in a variety
 of ways; that's why we now have SERIALIZABLE, and why most people use
 READ COMMITTED.  Are there examples of this behavior at other
 isolation levels?

Marti's example works for SERIALIZABLE isolation too. In general, when
DDL operations weren't previously MVCC-safe under REPEATABLE READ, we
didn't change that in SERIALIZABLE.

There's some SSI code for TRUNCATE TABLE that tries to do something
reasonable, and it catches some (more subtle) anomalies involving
concurrent truncates -- but it can only do so much when TRUNCATE itself
isn't MVCC-safe. I expect that the combination of that code and this
patch would ensure full serializability for TRUNCATE operations.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] RFC: Making TRUNCATE more MVCC-safe

2012-02-10 Thread Noah Misch
On Thu, Feb 09, 2012 at 11:11:16PM +0200, Marti Raudsepp wrote:
 I've always been a little wary of using the TRUNCATE command due to
 the warning in the documentation about it not being MVCC-safe:
 queries may silently give wrong results and it's hard to tell when
 they are affected.
 
 That got me thinking, why can't we handle this like a standby server
 does -- if some query has data removed from underneath it, it aborts
 with a serialization failure.
 
 Does this solution sound like a good idea?
 
 The attached patch is a lame attempt at implementing this. I added a
 new pg_class.relvalidxmin attribute which tracks the Xid of the last
 TRUNCATE (maybe it should be called reltruncatexid?). Whenever
 starting a relation scan with a snapshot older than relvalidxmin, an
 error is thrown. This seems to work out well since TRUNCATE updates
 pg_class anyway, and anyone who sees the new relfilenode automatically
 knows when it was truncated.

I like the design you have chosen.  It would find applications beyond
TRUNCATE, so your use of non-specific naming is sound.  For example, older
snapshots will see an empty table t after CREATE TABLE t AS SELECT 1
commits; that's a comparable MVCC anomaly.  Some of our non-MVCC-safe commands
should perhaps just become MVCC-safe, but there will always be use cases for
operations that shortcut MVCC.  When one truly does want that, your proposal
for keeping behavior consistent makes plenty of sense.

 Should I also add another counter to pg_stat_database_conflicts?
 Currently this table is only used on standby servers.

 ERROR:  canceling statement due to conflict with TRUNCATE TABLE on foo
 DETAIL:  Rows visible to this transaction have been removed.

My initial reaction is not to portray this like a recovery conflict, since
several aspects distinguish it from all recovery conflict types.

(I have not read your actual patch.)

Thanks,
nm

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] RFC: Making TRUNCATE more MVCC-safe

2012-02-10 Thread Robert Haas
On Fri, Feb 10, 2012 at 6:42 AM, Noah Misch n...@leadboat.com wrote:
 I like the design you have chosen.  It would find applications beyond
 TRUNCATE, so your use of non-specific naming is sound.  For example, older
 snapshots will see an empty table t after CREATE TABLE t AS SELECT 1
 commits; that's a comparable MVCC anomaly.  Some of our non-MVCC-safe commands
 should perhaps just become MVCC-safe, but there will always be use cases for
 operations that shortcut MVCC.  When one truly does want that, your proposal
 for keeping behavior consistent makes plenty of sense.

I guess I'm not particularly excited by the idea of trying to make
TRUNCATE MVCC-safe.  I notice that the example involves the REPEATABLE
READ isolation level, which is already known to be busted in a variety
of ways; that's why we now have SERIALIZABLE, and why most people use
READ COMMITTED.  Are there examples of this behavior at other
isolation levels?

But I have to admit I'm intrigued by the idea of extending this to
other cases, if there's a reasonable way to do that.  For example, if
we could fix things up so that we don't see a table at all if it was
created after we took our snapshot, that would remove one of the
obstacles to marking pages bulk-loaded into that table with FrozenXID
and PD_ALL_VISIBLE from the get-go.  I think a lot of people would be
mighty happy about that.

But the necessary semantics seem somewhat different.  For TRUNCATE,
you want to throw a serialization error; but is that also what you
want for CREATE TABLE?  Or do you just want it to appear empty?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] RFC: Making TRUNCATE more MVCC-safe

2012-02-10 Thread Noah Misch
On Fri, Feb 10, 2012 at 01:59:18PM -0500, Robert Haas wrote:
 On Fri, Feb 10, 2012 at 6:42 AM, Noah Misch n...@leadboat.com wrote:
  I like the design you have chosen. ?It would find applications beyond
  TRUNCATE, so your use of non-specific naming is sound. ?For example, older
  snapshots will see an empty table t after CREATE TABLE t AS SELECT 1
  commits; that's a comparable MVCC anomaly. ?Some of our non-MVCC-safe 
  commands
  should perhaps just become MVCC-safe, but there will always be use cases for
  operations that shortcut MVCC. ?When one truly does want that, your proposal
  for keeping behavior consistent makes plenty of sense.
 
 I guess I'm not particularly excited by the idea of trying to make
 TRUNCATE MVCC-safe.  I notice that the example involves the REPEATABLE
 READ isolation level, which is already known to be busted in a variety
 of ways; that's why we now have SERIALIZABLE, and why most people use
 READ COMMITTED.  Are there examples of this behavior at other
 isolation levels?

I've yet to see an MVCC anomaly that one can reproduce at REPEATABLE READ and
not at READ COMMITTED.  They tend to be narrow race conditions at READ
COMMITTED, yet easy to demonstrate at REPEATABLE READ.  Related:
http://archives.postgresql.org/pgsql-performance/2011-02/msg00451.php

Incidentally, people use READ COMMITTED because they don't question the
default, not because they know hazards of REPEATABLE READ.  I don't know the
bustedness you speak of; could we improve the documentation to inform folks?

 But I have to admit I'm intrigued by the idea of extending this to
 other cases, if there's a reasonable way to do that.  For example, if
 we could fix things up so that we don't see a table at all if it was
 created after we took our snapshot, that would remove one of the
 obstacles to marking pages bulk-loaded into that table with FrozenXID
 and PD_ALL_VISIBLE from the get-go.  I think a lot of people would be
 mighty happy about that.

Exactly.

 But the necessary semantics seem somewhat different.  For TRUNCATE,
 you want to throw a serialization error; but is that also what you
 want for CREATE TABLE?  Or do you just want it to appear empty?

I think an error helps just as much there.  If you create a table and populate
it with data in the same transaction, letting some snapshot see an empty table
is an atomicity failure.

Your comment illustrates a helpful point: this is just another kind of
ordinary serialization failure, one that can happen at any isolation level.
No serial transaction sequence can yield one of these errors.

Thanks,
nm

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] RFC: Making TRUNCATE more MVCC-safe

2012-02-09 Thread Marti Raudsepp
Hi!

I've always been a little wary of using the TRUNCATE command due to
the warning in the documentation about it not being MVCC-safe:
queries may silently give wrong results and it's hard to tell when
they are affected.

That got me thinking, why can't we handle this like a standby server
does -- if some query has data removed from underneath it, it aborts
with a serialization failure.

Does this solution sound like a good idea?

The attached patch is a lame attempt at implementing this. I added a
new pg_class.relvalidxmin attribute which tracks the Xid of the last
TRUNCATE (maybe it should be called reltruncatexid?). Whenever
starting a relation scan with a snapshot older than relvalidxmin, an
error is thrown. This seems to work out well since TRUNCATE updates
pg_class anyway, and anyone who sees the new relfilenode automatically
knows when it was truncated.

Am I on the right track? Are there any better ways to attach this
information to a relation?
Should I also add another counter to pg_stat_database_conflicts?
Currently this table is only used on standby servers.

Since I wrote it just this afternoon, there are a few things still
wrong with the patch (it doesn't handle xid wraparound for one), so
don't be too picky about the code yet. :)

Example:
  CREATE TABLE foo (i int);
Session A:
  BEGIN ISOLATION LEVEL REPEATABLE READ;
  SELECT txid_current(); -- Force snapshot open
Session B:
  TRUNCATE TABLE foo;
Session A:
  SELECT * FROM foo;
ERROR:  canceling statement due to conflict with TRUNCATE TABLE on foo
DETAIL:  Rows visible to this transaction have been removed.


Patch also available in my github 'truncate' branch:
https://github.com/intgr/postgres/commits/truncate

Regards,
Marti
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
new file mode 100644
index 99a431a..e909b17
*** a/src/backend/access/heap/heapam.c
--- b/src/backend/access/heap/heapam.c
*** heap_beginscan_internal(Relation relatio
*** 1175,1180 
--- 1175,1196 
  	HeapScanDesc scan;
  
  	/*
+ 	 * If the snapshot is older than relvalidxmin then that means TRUNCATE has
+ 	 * removed data that we should still see. Abort rather than giving
+ 	 * potentially bogus results.
+ 	 */
+ 	if (relation-rd_rel-relvalidxmin != InvalidTransactionId 
+ 		snapshot-xmax != InvalidTransactionId 
+ 		NormalTransactionIdPrecedes(snapshot-xmax, relation-rd_rel-relvalidxmin))
+ 	{
+ 		ereport(ERROR,
+ (errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
+  errmsg(canceling statement due to conflict with TRUNCATE TABLE on %s,
+ 		 NameStr(relation-rd_rel-relname)),
+  errdetail(Rows visible to this transaction have been removed.)));
+ 	}
+ 
+ 	/*
  	 * increment relation ref count while scanning relation
  	 *
  	 * This is just to make really sure the relcache entry won't go away while
diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
new file mode 100644
index aef410a..3f9bd5d
*** a/src/backend/catalog/heap.c
--- b/src/backend/catalog/heap.c
*** InsertPgClassTuple(Relation pg_class_des
*** 787,792 
--- 787,793 
  	values[Anum_pg_class_relhastriggers - 1] = BoolGetDatum(rd_rel-relhastriggers);
  	values[Anum_pg_class_relhassubclass - 1] = BoolGetDatum(rd_rel-relhassubclass);
  	values[Anum_pg_class_relfrozenxid - 1] = TransactionIdGetDatum(rd_rel-relfrozenxid);
+ 	values[Anum_pg_class_relvalidxmin - 1] = TransactionIdGetDatum(rd_rel-relvalidxmin);
  	if (relacl != (Datum) 0)
  		values[Anum_pg_class_relacl - 1] = relacl;
  	else
*** AddNewRelationTuple(Relation pg_class_de
*** 884,889 
--- 885,891 
  		new_rel_reltup-relfrozenxid = InvalidTransactionId;
  	}
  
+ 	new_rel_reltup-relvalidxmin = InvalidTransactionId;
  	new_rel_reltup-relowner = relowner;
  	new_rel_reltup-reltype = new_type_oid;
  	new_rel_reltup-reloftype = reloftype;
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
new file mode 100644
index bfbe642..0d96a6a
*** a/src/backend/catalog/index.c
--- b/src/backend/catalog/index.c
*** reindex_index(Oid indexId, bool skip_con
*** 2854,2860 
  		}
  
  		/* We'll build a new physical relation for the index */
! 		RelationSetNewRelfilenode(iRel, InvalidTransactionId);
  
  		/* Initialize the index and rebuild */
  		/* Note: we do not need to re-establish pkey setting */
--- 2854,2860 
  		}
  
  		/* We'll build a new physical relation for the index */
! 		RelationSetNewRelfilenode(iRel, InvalidTransactionId, InvalidTransactionId);
  
  		/* Initialize the index and rebuild */
  		/* Note: we do not need to re-establish pkey setting */
diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
new file mode 100644
index 1f95abc..ab4aec2
*** a/src/backend/commands/sequence.c
--- b/src/backend/commands/sequence.c
*** ResetSequence(Oid seq_relid)
*** 287,293 
  	 * Create a new storage file for the sequence.	We want to keep the
  	 * sequence's