Re: [HACKERS] Two-phase commit issues

2005-06-11 Thread Heikki Linnakangas

On Tue, 7 Jun 2005, Alvaro Herrera wrote:


On Sat, May 21, 2005 at 06:57:24PM +0300, Heikki Linnakangas wrote:

Heikki,


I took a closer look at the JTA spec and saw that the Xid, which is
translated to a gid in the jdbc driver, consists of a format identifier
(32-bit int), a branch qualifier (max 64 bytes) and a global transaction
identifier (max 64 bytes).

That means that gid needs to hold 132 raw bytes minimum.

Also, it would be nice if the driver could send the gid as a bytea,
without converting it to a string. Similar to using parameter markers
and parse / bind messages with regular queries. That would require a
change in the FE/BE protocol, right?

The branch qualifier and global transaction id structure comes from
the OSI CCR specification. Anyone here that knows more about OSI CCR?


I think I'm going to try to do this by hacking the lexer some (this has
the added benefit of me learning a little about lexers).  Do you have an
URL to those specs you mention?  How authoritative they are, I mean,
they are not the SQL spec, right?


The JTA spec
http://java.sun.com/products/jta/

Relevant X/Open XA documents:
http://www.opengroup.org/bookstore/catalog/tp.htm

See especially page 19 of the Distributed Transaction Processing: The XA 
Specification, it contains xa.h header file that specifies the format of 
the transaction identifier.


It matches with the format in the JTA spec, but the JTA spec also mentions 
the OCI CCR format which I haven't been able to find:

http://java.sun.com/products/jta/jta-1_0_1B-doc/javax/transaction/xa/Xid.html

In addition to those two, I bumped into RFC2371. It basically allows 
any format.


I don't have access to the SQL spec, so I can't comment on that. I'd 
regard the XA spec as the most authoritative standard in the field.


- Heikki

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] Two-phase commit issues

2005-06-11 Thread Jochem van Dieten
On 6/11/05, Heikki Linnakangas wrote:
 
 It matches with the format in the JTA spec, but the JTA spec also mentions
 the OCI CCR format

The OSI CCR format, which appears to refer to ISO/IEC 9805-1. 

ISO/IEC 9805-1:1998
15-12-1998
Information technology - Open Systems Interconnection - Protocol for
the Commitment, Concurrency and Recovery service element: Protocol
specification

This standard is to be applied by reference from other specifications.
Specifies a use of the ACSE, Presentation adn Session services to
carry the CCR semantics. Specifies the static and dynamic conformance
requirements for systems implementing these procedures. Specifies the
protocol elements that support the following functional untis: -
static commitment; - dynamic commitment; - read only; - one-phase
commitment; - cancel; and overlapped recovery.


Unfortunately that standard is not included in my universities
subscription to ISO standards so I can't tell you what it says about
the format.

Jochem

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] Two-phase commit issues

2005-06-11 Thread Heikki Linnakangas

On Sat, 11 Jun 2005, Jochem van Dieten wrote:


The OSI CCR format, which appears to refer to ISO/IEC 9805-1.

ISO/IEC 9805-1:1998
15-12-1998
Information technology - Open Systems Interconnection - Protocol for
the Commitment, Concurrency and Recovery service element: Protocol
specification

This standard is to be applied by reference from other specifications.
Specifies a use of the ACSE, Presentation adn Session services to
carry the CCR semantics. Specifies the static and dynamic conformance
requirements for systems implementing these procedures. Specifies the
protocol elements that support the following functional untis: -
static commitment; - dynamic commitment; - read only; - one-phase
commitment; - cancel; and overlapped recovery.


Unfortunately that standard is not included in my universities
subscription to ISO standards so I can't tell you what it says about
the format.


Great, thanks anyway! Anyone here with access to the content?

- Heikki

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] Two-phase commit issues

2005-05-23 Thread José Orlando Pereira
On Saturday 21 May 2005 03:37, Josh Berkus wrote:
 2PC is a key to supporting 3rd-party replication tools, like C-JDBC.

I don't think C-JDBC requires 2PC for replication. Mixed up acronyms maybe? :)

-- 
Jose Orlando Pereira

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] Two-phase commit issues

2005-05-21 Thread José Orlando Pereira
On Friday 20 May 2005 18:14, Tom Lane wrote:
 Bruce Momjian pgman@candle.pha.pa.us writes:
  As I remember, you said two-phase wasn't 100% reliable and we just
  needed a way to report failures.

 [ Shrug... ]  I remain of the opinion that 2PC is a solution in search
 of a problem, because it does not solve the single point of failure
 issue (just moves same from the database to the 2PC controller).

You're right. 2PC to coordinate replicas of the same data is not that 
interesting. It is however most interesting when coordination updates to 
different objects such as (i) a central database server and a local staging 
area or (ii) a database server and transactional queues in a workflow-style 
app. 

 But some people want it anyway, and they aren't going to be satisfied
 that we are an enterprise grade database until we can check off this
 particular bullet point.  As long as the implementation doesn't impose
 any significant costs when not being used (which AFAICS Heikki's method
 doesn't), I think we gotta hold our noses and do it.

It is a definitly in the check list if you're shopping for a database to go 
with your buzzword compliant J2EE app server. :-)

-- 
Jose Orlando Pereira

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] Two-phase commit issues

2005-05-21 Thread Heikki Linnakangas

On Thu, 19 May 2005, Tom Lane wrote:


Heikki Linnakangas [EMAIL PROTECTED] writes:


* I'm inclined to think that the gid identifiers for prepared
transactions ought to be SQL identifiers (names), not string literals.
Was there a particular reason for making them strings?



Sure. No Reason. While you're at it, do you think it's possible to make it
unlimited size? I couldn't think of a simple way.


Actually, one reason for wanting them to be identifiers is so that
there's a principled reason for saying what the max length is ;-)


I took a closer look at the JTA spec and saw that the Xid, which is 
translated to a gid in the jdbc driver, consists of a format identifier 
(32-bit int), a branch qualifier (max 64 bytes) and a global transaction 
identifier (max 64 bytes).


That means that gid needs to hold 132 raw bytes minimum.

Also, it would be nice if the driver could send the gid as a bytea, 
without converting it to a string. Similar to using parameter markers 
and parse / bind messages with regular queries. That would require a 
change in the FE/BE protocol, right?


The branch qualifier and global transaction id structure comes from 
the OSI CCR specification. Anyone here that knows more about OSI CCR?


- Heikki

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] Two-phase commit issues

2005-05-20 Thread Bruce Momjian
Tom Lane wrote:
 I've started to look seriously at Heikki's patch for two-phase commit.
 There are a few issues that probably deserve discussion:
 
 * The major missing issue that I've come across so far is that
 subtransaction and multixact state isn't preserved across a crash.

I am a little confused by this.  How does two-phase commit add extra
requirements on crash recovery?  I understand a crashed server might be
involved in a two-phase commit, but doesn't the transaction just roll
back?

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] Two-phase commit issues

2005-05-20 Thread Tom Lane
Bruce Momjian pgman@candle.pha.pa.us writes:
 I am a little confused by this.  How does two-phase commit add extra
 requirements on crash recovery?

Uh, that's more or less the entire *POINT*.  Once an open transaction is
prepared, it's supposed to survive a server crash.

regards, tom lane

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] Two-phase commit issues

2005-05-20 Thread Bruce Momjian
Tom Lane wrote:
 Bruce Momjian pgman@candle.pha.pa.us writes:
  I am a little confused by this.  How does two-phase commit add extra
  requirements on crash recovery?
 
 Uh, that's more or less the entire *POINT*.  Once an open transaction is
 prepared, it's supposed to survive a server crash.

Wow.  This is much more than I thought we were going to do.  I thought
if something failed after the prepare we were just going to inform the
administrator and give up.  Becuase you are writing status file to the
disk, it seems you are trying to recover from a crash and roll forward.

What cases would we actually fail to recover from a crash after a
PREPARE?

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Two-phase commit issues

2005-05-20 Thread Tom Lane
Bruce Momjian pgman@candle.pha.pa.us writes:
 Tom Lane wrote:
 Uh, that's more or less the entire *POINT*.  Once an open transaction is
 prepared, it's supposed to survive a server crash.

 Wow.  This is much more than I thought we were going to do.

If we tried to claim that anything less was two-phase commit, we'd be
laughed off the face of the planet ...

regards, tom lane

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] Two-phase commit issues

2005-05-20 Thread Bruce Momjian
Tom Lane wrote:
 Bruce Momjian pgman@candle.pha.pa.us writes:
  Tom Lane wrote:
  Uh, that's more or less the entire *POINT*.  Once an open transaction is
  prepared, it's supposed to survive a server crash.
 
  Wow.  This is much more than I thought we were going to do.
 
 If we tried to claim that anything less was two-phase commit, we'd be
 laughed off the face of the planet ...

Well, based on past discussions, our TODO has:

* Add two-phase commit

  This will involve adding a way to respond to commit failure by either
  taking the server into offline/readonly mode or notifying the
  administrator

As I remember, you said two-phase wasn't 100% reliable and we just
needed a way to report failures.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] Two-phase commit issues

2005-05-20 Thread Tom Lane
Bruce Momjian pgman@candle.pha.pa.us writes:
 As I remember, you said two-phase wasn't 100% reliable and we just
 needed a way to report failures.

[ Shrug... ]  I remain of the opinion that 2PC is a solution in search
of a problem, because it does not solve the single point of failure
issue (just moves same from the database to the 2PC controller).
But some people want it anyway, and they aren't going to be satisfied
that we are an enterprise grade database until we can check off this
particular bullet point.  As long as the implementation doesn't impose
any significant costs when not being used (which AFAICS Heikki's method
doesn't), I think we gotta hold our noses and do it.

regards, tom lane

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Two-phase commit issues

2005-05-20 Thread jordan




Exactly. A 2PC expects every participant that makes it to the prepare to commit phase to survive a server restart, controller or otherwise. Anything less is not 2PC. 

Jordan Henderson

On Fri, 2005-05-20 at 12:07 -0400, Tom Lane wrote:


Bruce Momjian pgman@candle.pha.pa.us writes:
 I am a little confused by this.  How does two-phase commit add extra
 requirements on crash recovery?

Uh, that's more or less the entire *POINT*.  Once an open transaction is
prepared, it's supposed to survive a server crash.

			regards, tom lane

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match






Re: [HACKERS] Two-phase commit issues

2005-05-20 Thread David Garamond
Tom Lane wrote:
 [ Shrug... ]  I remain of the opinion that 2PC is a solution in search
 of a problem, because it does not solve the single point of failure
 issue (just moves same from the database to the 2PC controller).
 But some people want it anyway, and they aren't going to be satisfied
 that we are an enterprise grade database until we can check off this
 particular bullet point.  As long as the implementation doesn't impose
 any significant costs when not being used (which AFAICS Heikki's method
 doesn't), I think we gotta hold our noses and do it.

I thought the primary reason for having 2PC is to be able to participate
in a heterogenous transaction, e.g. with a non-Postgres database/other
types of resource managers? 2PC is mostly about how to make these
cross-RM transactions [appear] atomic. Redundancy is not covered by 2PC
protocol.

--
dave

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] Two-phase commit issues

2005-05-20 Thread Josh Berkus
Tom,

  [ Shrug... ]  I remain of the opinion that 2PC is a solution in search
  of a problem, because it does not solve the single point of failure
  issue (just moves same from the database to the 2PC controller).
  But some people want it anyway, and they aren't going to be satisfied
  that we are an enterprise grade database until we can check off this
  particular bullet point.  As long as the implementation doesn't impose
  any significant costs when not being used (which AFAICS Heikki's method
  doesn't), I think we gotta hold our noses and do it.

2PC is a key to supporting 3rd-party replication tools, like C-JDBC.   And is 
useful for some other use cases, like slow-WAN-based financial transactions.  
We know you don't like it, Tom.  ;-)

-- 
Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] Two-phase commit issues

2005-05-19 Thread Heikki Linnakangas
On Wed, 18 May 2005, Tom Lane wrote:
* The major missing issue that I've come across so far is that
subtransaction and multixact state isn't preserved across a crash.
Assuming that we want to store only top-level XIDs in the shared-memory
list of prepared XIDs (which I think is important), it is essential that
crash restart rebuild the pg_subxact status for prepared transactions.
The subxacts of a prepared xact have to be seen as still running, and
they won't be unless the subxact links are there.  Since subxact.c is
designed to wipe all its state on restart, we need to recreate those
entries.  Fortunately this doesn't seem hard: the state file for a
prepared xact will include all of its subxact XIDs, and we can just
do SubTransSetParent() on them while rereading the state file.  (AFAICS
it's sufficient to make each subxact link directly to the top XID, even
if there was a more complex hierarchy originally.)  Similarly, we've got
to reconstruct MultiXactIds that any prepared xacts are members of, else
row-level locks taken out by prepared xacts won't be enforced correctly.
I think this can be handled if we add to the state files a list of all
MultiXactIds that each prepared xact belongs to, and then during restart
forcibly recreate those MultiXactIds.  (They would only be rebuilt with
prepared XIDs, not any ordinary XIDs that might originally have been
members.)  This seems to require some new code in multixact.c, but not
anything fundamentally difficult --- Alvaro, do you see any likely
problems in this stuff?
The subtransaction part is in fact there already, and it's done just like 
you described. RecoverPreparedTransactions function reads the subxids from 
the state file and calls SubTransSetParent for them.

As Alvaro pointed out elsewhere, the multixacts are harder because a 
backend doesn't know which multixactids it belongs to. AFAICS, the most 
straightforward solution is to xlog every CreateMultixact call, so that 
the multixact slru files can be completely reconstructed on recovery.

* The patch is designed to dump state files into WAL as well as onto
disk.  Why?  Wouldn't it be better just to write and fsync the state
file before reporting successful prepare?  That would get rid of the
need for checkpoint-time fsyncs.
Performance and correctness. There mustn't be a valid state file on the 
disk before the WAL entries of that transactions are on disk. Otherwise, 
the recovery might recover a transaction that in fact aborted right after 
it wrote the state file.

If we fsync the WAL prepare record first, and state file second, a crash 
in between would make it impossible to recover the transaction though the 
WAL says it's prepared.

WAL logging the state file completely saves us one fsync. The state files 
are usually small, say  1 kb, so the tradeoff to write it twice and save 
one fsync is probably well worth it.

Third, we have to cater for PITR. I haven't given it much thought, but if 
we want to do log shipping and PITR, I believe we must have everything in 
the WAL.

* I'm inclined to think that the gid identifiers for prepared
transactions ought to be SQL identifiers (names), not string literals.
Was there a particular reason for making them strings?
Sure. No Reason. While you're at it, do you think it's possible to make it 
unlimited size? I couldn't think of a simple way.

* What are we going to do with GUC variables?  My feeling is that
the only sane answer is that PREPARE is the same as COMMIT as far as
local GUC variables go, and COMMIT/ROLLBACK PREPARED have no effect
on GUC state.  Otherwise it's really unclear what to do.  Consider
SET myvar = foo;
BEGIN;
SET myvar = bar;
PREPARE gid;
SHOW myvar; -- what do you see ... foo or bar?
SET myvar = baz;-- is this even legal?
ROLLBACK PREPARED gid;
SHOW myvar; -- now what do you see ... foo or baz?
Since local GUC changes aren't going to be saved/restored across a
crash anyway, I can't see a point in doing anything really complex.
* There are some fairly ugly cases associated with creation and deletion
of temporary tables as well.  I think we might want to just decree that
you can't PREPARE a transaction that included creating or dropping a
temp table.  Does anyone have much of a problem with that?
I think the safest way to handle the GUC case as well is to just refuse to 
prepare a transaction that has changed local GUC variables.

Another possibility is to rethink the contract of PREPARE TRANSACTION and 
COMMIT/ROLLBACK PREPARED. If PREPARE TRANSACTION would put the backend to 
a state where you can't do anything else than COMMIT/ROLLBACK the prepared 
transaction, we could do more sensible things with GUC and temp tables. 
That would have complications of it's own though. What would happen if 
another backend then tries to COMMIT/ROLLBACK the transaction the original 
backend is still tied to?

- Heikki
---(end of 

Re: [HACKERS] Two-phase commit issues

2005-05-19 Thread Tom Lane
Heikki Linnakangas [EMAIL PROTECTED] writes:
 As Alvaro pointed out elsewhere, the multixacts are harder because a 
 backend doesn't know which multixactids it belongs to. AFAICS, the most 
 straightforward solution is to xlog every CreateMultixact call, so that 
 the multixact slru files can be completely reconstructed on recovery.

I realized this morning that in fact it *can't* know that, since even
after a particular transaction commits it's still possible for others to
add it to new multixacts.  In the case of a prepared xact it would
continue to get added to new multixacts indefinitely :-(.  So the idea
of recording info about this in the state files is clearly a loser.
I think we will indeed have to start xlogging multixact operations.

 Third, we have to cater for PITR. I haven't given it much thought, but if 
 we want to do log shipping and PITR, I believe we must have everything in 
 the WAL.

Hmm.  All your other arguments for WAL-logging a prepare are bogus, but
this one seems real.  (It's also a reason why multixact stuff needs to
be xlogged, I guess.)

 * I'm inclined to think that the gid identifiers for prepared
 transactions ought to be SQL identifiers (names), not string literals.
 Was there a particular reason for making them strings?

 Sure. No Reason. While you're at it, do you think it's possible to make it 
 unlimited size? I couldn't think of a simple way.

Actually, one reason for wanting them to be identifiers is so that
there's a principled reason for saying what the max length is ;-)

 I think the safest way to handle the GUC case as well is to just refuse to 
 prepare a transaction that has changed local GUC variables.

That seems unnecessarily restrictive.

 Another possibility is to rethink the contract of PREPARE TRANSACTION and 
 COMMIT/ROLLBACK PREPARED. If PREPARE TRANSACTION would put the backend to 
 a state where you can't do anything else than COMMIT/ROLLBACK the prepared 
 transaction, we could do more sensible things with GUC and temp tables. 
 That would have complications of it's own though. What would happen if 
 another backend then tries to COMMIT/ROLLBACK the transaction the original 
 backend is still tied to?

Yeah, I do not think this is a useful answer.  Allowing the commit to
happen somewhere else and restricting what a prepared xact can do with
temp tables seems much more useful in practice.

regards, tom lane

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


[HACKERS] Two-phase commit issues

2005-05-18 Thread Tom Lane
I've started to look seriously at Heikki's patch for two-phase commit.
There are a few issues that probably deserve discussion:

* The major missing issue that I've come across so far is that
subtransaction and multixact state isn't preserved across a crash.
Assuming that we want to store only top-level XIDs in the shared-memory
list of prepared XIDs (which I think is important), it is essential that
crash restart rebuild the pg_subxact status for prepared transactions.
The subxacts of a prepared xact have to be seen as still running, and
they won't be unless the subxact links are there.  Since subxact.c is
designed to wipe all its state on restart, we need to recreate those
entries.  Fortunately this doesn't seem hard: the state file for a
prepared xact will include all of its subxact XIDs, and we can just
do SubTransSetParent() on them while rereading the state file.  (AFAICS
it's sufficient to make each subxact link directly to the top XID, even
if there was a more complex hierarchy originally.)  Similarly, we've got
to reconstruct MultiXactIds that any prepared xacts are members of, else
row-level locks taken out by prepared xacts won't be enforced correctly.
I think this can be handled if we add to the state files a list of all
MultiXactIds that each prepared xact belongs to, and then during restart
forcibly recreate those MultiXactIds.  (They would only be rebuilt with
prepared XIDs, not any ordinary XIDs that might originally have been
members.)  This seems to require some new code in multixact.c, but not
anything fundamentally difficult --- Alvaro, do you see any likely
problems in this stuff?

* The patch is designed to dump state files into WAL as well as onto
disk.  Why?  Wouldn't it be better just to write and fsync the state
file before reporting successful prepare?  That would get rid of the
need for checkpoint-time fsyncs.

* I'm inclined to think that the gid identifiers for prepared
transactions ought to be SQL identifiers (names), not string literals.
Was there a particular reason for making them strings?

* What are we going to do with GUC variables?  My feeling is that
the only sane answer is that PREPARE is the same as COMMIT as far as
local GUC variables go, and COMMIT/ROLLBACK PREPARED have no effect
on GUC state.  Otherwise it's really unclear what to do.  Consider
SET myvar = foo;
BEGIN;
SET myvar = bar;
PREPARE gid;
SHOW myvar; -- what do you see ... foo or bar?
SET myvar = baz;-- is this even legal?
ROLLBACK PREPARED gid;
SHOW myvar; -- now what do you see ... foo or baz?
Since local GUC changes aren't going to be saved/restored across a
crash anyway, I can't see a point in doing anything really complex.

* There are some fairly ugly cases associated with creation and deletion
of temporary tables as well.  I think we might want to just decree that
you can't PREPARE a transaction that included creating or dropping a
temp table.  Does anyone have much of a problem with that?

regards, tom lane

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Two-phase commit issues

2005-05-18 Thread Joe Chang
Hi,

One thing I would suggest is to start a global transaction in begin, not in
prepare. That is way to be compliance with XA.

Thanks
Joe


On 5/18/05 2:15 PM, Tom Lane [EMAIL PROTECTED] wrote:

 I've started to look seriously at Heikki's patch for two-phase commit.
 There are a few issues that probably deserve discussion:
 
 * The major missing issue that I've come across so far is that
 subtransaction and multixact state isn't preserved across a crash.
 Assuming that we want to store only top-level XIDs in the shared-memory
 list of prepared XIDs (which I think is important), it is essential that
 crash restart rebuild the pg_subxact status for prepared transactions.
 The subxacts of a prepared xact have to be seen as still running, and
 they won't be unless the subxact links are there.  Since subxact.c is
 designed to wipe all its state on restart, we need to recreate those
 entries.  Fortunately this doesn't seem hard: the state file for a
 prepared xact will include all of its subxact XIDs, and we can just
 do SubTransSetParent() on them while rereading the state file.  (AFAICS
 it's sufficient to make each subxact link directly to the top XID, even
 if there was a more complex hierarchy originally.)  Similarly, we've got
 to reconstruct MultiXactIds that any prepared xacts are members of, else
 row-level locks taken out by prepared xacts won't be enforced correctly.
 I think this can be handled if we add to the state files a list of all
 MultiXactIds that each prepared xact belongs to, and then during restart
 forcibly recreate those MultiXactIds.  (They would only be rebuilt with
 prepared XIDs, not any ordinary XIDs that might originally have been
 members.)  This seems to require some new code in multixact.c, but not
 anything fundamentally difficult --- Alvaro, do you see any likely
 problems in this stuff?
 
 * The patch is designed to dump state files into WAL as well as onto
 disk.  Why?  Wouldn't it be better just to write and fsync the state
 file before reporting successful prepare?  That would get rid of the
 need for checkpoint-time fsyncs.
 
 * I'm inclined to think that the gid identifiers for prepared
 transactions ought to be SQL identifiers (names), not string literals.
 Was there a particular reason for making them strings?
 
 * What are we going to do with GUC variables?  My feeling is that
 the only sane answer is that PREPARE is the same as COMMIT as far as
 local GUC variables go, and COMMIT/ROLLBACK PREPARED have no effect
 on GUC state.  Otherwise it's really unclear what to do.  Consider
 SET myvar = foo;
 BEGIN;
 SET myvar = bar;
 PREPARE gid;
 SHOW myvar;  -- what do you see ... foo or bar?
 SET myvar = baz; -- is this even legal?
 ROLLBACK PREPARED gid;
 SHOW myvar;  -- now what do you see ... foo or baz?
 Since local GUC changes aren't going to be saved/restored across a
 crash anyway, I can't see a point in doing anything really complex.
 
 * There are some fairly ugly cases associated with creation and deletion
 of temporary tables as well.  I think we might want to just decree that
 you can't PREPARE a transaction that included creating or dropping a
 temp table.  Does anyone have much of a problem with that?
 
 regards, tom lane
 
 ---(end of broadcast)---
 TIP 3: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly
 



---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Two-phase commit issues

2005-05-18 Thread Alvaro Herrera
On Wed, May 18, 2005 at 05:15:09PM -0400, Tom Lane wrote:
 I've started to look seriously at Heikki's patch for two-phase commit.

Hum.  I started a few days ago doing some reviewing, with the intention
of correcting some things here and there in order to present it all to
you later, with a pre-filter to get some bugs out.

 There are a few issues that probably deserve discussion:
 
 * The major missing issue that I've come across so far is that
 subtransaction and multixact state isn't preserved across a crash.
[...]
 (AFAICS it's sufficient to make each subxact link directly to the top
 XID, even if there was a more complex hierarchy originally.)

Right, we don't care about the hierarchy; we know all those subXids were
committed.

 Similarly, we've got to reconstruct MultiXactIds that any prepared
 xacts are members of, else row-level locks taken out by prepared xacts
 won't be enforced correctly.  I think this can be handled if we add to
 the state files a list of all MultiXactIds that each prepared xact
 belongs to, and then during restart forcibly recreate those
 MultiXactIds.  (They would only be rebuilt with prepared XIDs, not any
 ordinary XIDs that might originally have been members.)  This seems to
 require some new code in multixact.c, but not anything fundamentally
 difficult --- Alvaro, do you see any likely problems in this stuff?

I'm not sure if it affects in any way that a Xid=1, which participates
in a MultiXactId is seen as not prepared when Xid=2 prepares, which also
participates in the same MultiXactId; if Xid=1 is prepared later, the
MultiXactId needs to be restored with both Xids as participants.


 * The patch is designed to dump state files into WAL as well as onto
 disk.  Why?  Wouldn't it be better just to write and fsync the state
 file before reporting successful prepare?  That would get rid of the
 need for checkpoint-time fsyncs.

I made the same observation.

 * I'm inclined to think that the gid identifiers for prepared
 transactions ought to be SQL identifiers (names), not string literals.
 Was there a particular reason for making them strings?

Ditto.

 * There are some fairly ugly cases associated with creation and deletion
 of temporary tables as well.  I think we might want to just decree that
 you can't PREPARE a transaction that included creating or dropping a
 temp table.  Does anyone have much of a problem with that?

Does this affect any of the other things that use the direct-fsync-no-WAL
path in the smgr?

-- 
Alvaro Herrera (alvherre[a]surnet.cl)
Having your biases confirmed independently is how scientific progress is
made, and hence made our great society what it is today (Mary Gardiner)

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] Two-phase commit issues

2005-05-18 Thread Tom Lane
Alvaro Herrera [EMAIL PROTECTED] writes:
 On Wed, May 18, 2005 at 05:15:09PM -0400, Tom Lane wrote:
 Similarly, we've got to reconstruct MultiXactIds that any prepared
 xacts are members of, else row-level locks taken out by prepared xacts
 won't be enforced correctly.  I think this can be handled if we add to
 the state files a list of all MultiXactIds that each prepared xact
 belongs to, and then during restart forcibly recreate those
 MultiXactIds.  (They would only be rebuilt with prepared XIDs, not any
 ordinary XIDs that might originally have been members.)

 I'm not sure if it affects in any way that a Xid=1, which participates
 in a MultiXactId is seen as not prepared when Xid=2 prepares, which also
 participates in the same MultiXactId; if Xid=1 is prepared later, the
 MultiXactId needs to be restored with both Xids as participants.

What I had in mind was that each prepared xact's state file would just
list the MultiXactIds it belongs to.  While re-reading the state files
after a crash, we'd construct the opposite lists (ie, which xacts belong
to each MultiXactId) and then write appropriate entries into
pg_multixact at completion of the re-read.  I don't think it matters
what order the xacts got prepared in.

 * There are some fairly ugly cases associated with creation and deletion
 of temporary tables as well.  I think we might want to just decree that
 you can't PREPARE a transaction that included creating or dropping a
 temp table.  Does anyone have much of a problem with that?

 Does this affect any of the other things that use the direct-fsync-no-WAL
 path in the smgr?

I don't think so.  It's not fsync that is at issue, really --- what I'm
concerned about is operations that occur at commit time.  For instance,
consider
CREATE TEMP TABLE foo (...) ON COMMIT DELETE ROWS;
BEGIN;
DROP TABLE foo;
PREPARE gid;
foo has an entry in the backend's on-commit-actions list, which the DROP
marks for deletion at commit.  It is unclear what to do with that entry
at prepare.  We can't really leave it active since the table can't be
touched afterwards (the DROP took an exclusive lock, which we no longer
own).  But simply forgetting it is not good either; what if the prepared
xact is later rolled back?  Worse, what if some other backend does that
rollback?  If it was us doing the ROLLBACK PREPARED, we could at least
in theory resurrect the ON COMMIT item, but there's no communication
path to tell us to do so when someone else does the rollback.

More generally, anything like this implies that a transaction that is no
longer ours is holding locks on our temp tables.  This is Really Bad.
(Consider what happens if our backend tries to exit --- it'll want to
delete those temp tables.)  I am more than half tempted to put some kind
of test into LockPersistAll to reject attempts to persist any lock of
any kind on a temp table.

I suppose the ideal solution would be something like what I was just
suggesting for GUC: as far as temp tables go, a PREPARE is a COMMIT,
and none of the resources associated with the temp table get assigned
to the prepared xact.  I am not sure how do-able that is, though.

regards, tom lane

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] Two-phase commit issues

2005-05-18 Thread Alvaro Herrera
On Wed, May 18, 2005 at 07:29:38PM -0400, Tom Lane wrote:
 Alvaro Herrera [EMAIL PROTECTED] writes:
  On Wed, May 18, 2005 at 05:15:09PM -0400, Tom Lane wrote:
  Similarly, we've got to reconstruct MultiXactIds that any prepared
  xacts are members of, else row-level locks taken out by prepared xacts
  won't be enforced correctly.  I think this can be handled if we add to
  the state files a list of all MultiXactIds that each prepared xact
  belongs to, and then during restart forcibly recreate those
  MultiXactIds.  (They would only be rebuilt with prepared XIDs, not any
  ordinary XIDs that might originally have been members.)
 
  I'm not sure if it affects in any way that a Xid=1, which participates
  in a MultiXactId is seen as not prepared when Xid=2 prepares, which also
  participates in the same MultiXactId; if Xid=1 is prepared later, the
  MultiXactId needs to be restored with both Xids as participants.
 
 What I had in mind was that each prepared xact's state file would just
 list the MultiXactIds it belongs to.

Hm, this assumes the transaction knows what MultiXactIds it belongs to.
This is not true, is it?  I'm not sure how to find that out.

  * There are some fairly ugly cases associated with creation and deletion
  of temporary tables as well.  I think we might want to just decree that
  you can't PREPARE a transaction that included creating or dropping a
  temp table.  Does anyone have much of a problem with that?
 
  Does this affect any of the other things that use the direct-fsync-no-WAL
  path in the smgr?
 
 I don't think so.  It's not fsync that is at issue, really --- what I'm
 concerned about is operations that occur at commit time.

I think I confused the issue with using local buffers instead of shared
buffers, for example where btree creation skips WAL.  But certainly it
doesn't have anything to do with it.


 For instance, consider
   CREATE TEMP TABLE foo (...) ON COMMIT DELETE ROWS;
   BEGIN;
   DROP TABLE foo;
   PREPARE gid;
 foo has an entry in the backend's on-commit-actions list, which the DROP
 marks for deletion at commit.  It is unclear what to do with that entry
 at prepare.  We can't really leave it active since the table can't be
 touched afterwards (the DROP took an exclusive lock, which we no longer
 own).  But simply forgetting it is not good either; what if the prepared
 xact is later rolled back?  Worse, what if some other backend does that
 rollback?  If it was us doing the ROLLBACK PREPARED, we could at least
 in theory resurrect the ON COMMIT item, but there's no communication
 path to tell us to do so when someone else does the rollback.

Hmm.  I think not being able to use temp tables is an important
restriction.  I can see the implementation issue though.

 More generally, anything like this implies that a transaction that is no
 longer ours is holding locks on our temp tables.  This is Really Bad.
 (Consider what happens if our backend tries to exit --- it'll want to
 delete those temp tables.)  I am more than half tempted to put some kind
 of test into LockPersistAll to reject attempts to persist any lock of
 any kind on a temp table.

Maybe the restriction could be lighter -- what if the prepared
transaction inserts tuples on a temp table, for example.  It's
inconsistent, I think, that a temp table could have tuples on it that
suddenly appear when some other backend commits my prepared transaction.


 I suppose the ideal solution would be something like what I was just
 suggesting for GUC: as far as temp tables go, a PREPARE is a COMMIT,
 and none of the resources associated with the temp table get assigned
 to the prepared xact.  I am not sure how do-able that is, though.

That'd require labelling tuples in temp tables with a different Xid,
than non-temp tables, no?  It'd get strange very quickly.

-- 
Alvaro Herrera (alvherre[a]surnet.cl)
La persona que no quería pecar / estaba obligada a sentarse
 en duras y empinadas sillas/ desprovistas, por cierto
 de blandos atenuantes  (Patricio Vogel)

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] Two-phase commit issues

2005-05-18 Thread Tom Lane
Alvaro Herrera [EMAIL PROTECTED] writes:
 On Wed, May 18, 2005 at 07:29:38PM -0400, Tom Lane wrote:
 What I had in mind was that each prepared xact's state file would just
 list the MultiXactIds it belongs to.

 Hm, this assumes the transaction knows what MultiXactIds it belongs to.
 This is not true, is it?  I'm not sure how to find that out.

[ thinks about that for a bit... ]  I had been thinking we could just
track it locally in each backend, but that won't do for the case where
someone adds you to a MultiXactId without your knowledge.  Seems like
we'd have to actually scan the contents of pg_multixact?  Yech.

 Maybe the restriction could be lighter -- what if the prepared
 transaction inserts tuples on a temp table, for example.  It's
 inconsistent, I think, that a temp table could have tuples on it that
 suddenly appear when some other backend commits my prepared transaction.

Yeah, there are all sorts of interesting problems there :-(.  I think
we'd be best off to punt for the moment.  I think we could enforce that a
transaction being PREPAREd hasn't touched any temp tables at all, by
checking that it holds no locks on such tables.

regards, tom lane

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])