subject:"Re\: \[HACKERS\] eXtensible Transaction Manager API"

Re: [HACKERS] eXtensible Transaction Manager API (v2)

2016-03-19 Thread David Steele

On 3/16/16 7:59 AM, Stas Kelvich wrote:
> On 12 Mar 2016, at 13:19, Michael Paquier  wrote:
>>
>> On Fri, Mar 11, 2016 at 9:35 PM, Tom Lane  wrote:
>>> IMO this is not committable as-is, and I don't think that it's something
>>> that will become committable during this 'fest.  I think we'd be well
>>> advised to boot it to the 2016-09 CF and focus our efforts on other stuff
>>> that has a better chance of getting finished this month.
>>
>> Yeah, I would believe that a good first step would be to discuss
>> deeply about that directly at PGCon for folks that will be there and
>> interested in the subject. It seems like a good timing to brainstorm
>> things F2F at the developer unconference for example, a couple of
>> months before the 1st CF of 9.7. We may perhaps (or not) get to
>> cleaner picture of what kind of things are wanted in this area.
> 
> To give overview of xtm coupled with postgres_fdw from users perspective i’ve 
> packed patched postgres with docker
> and provided test case when it is easy to spot violation of READ COMMITTED 
> isolation level without XTM.
> 
> This test fills database with users across two shards connected by 
> postgres_fdw and inherits the same table. Then 
> starts to concurrently transfers money between users in different shards:
> 
> begin;
> update t set v = v - 1 where u=%d; -- this is user from t_fdw1, first shard
> update t set v = v + 1 where u=%d; -- this is user from t_fdw2, second shard
> commit;
> 
> Also test simultaneously runs reader thread that counts all money in system:
> 
> select sum(v) from t;
> 
> So in transactional system we expect that sum should be always constant (zero 
> in our case, as we initialize user with zero balance).
> But we can see that without tsdtm total amount of money fluctuates around 
> zero.
> 
> https://github.com/kelvich/postgres_xtm_docker

This is an interesting example but I don't believe it does much to
address the concerns that were raised in this thread.

As far as I can see the consensus is that this patch should not be
considered for the current CF so I have marked it "returned with feedback".

If possible please follow Michael's advice and create a session at the
PGCon unconference in May.  I'm certain there will be a lot of interest.

-- 
-David
da...@pgmasters.net


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] eXtensible Transaction Manager API (v2)

2016-03-19 Thread Stas Kelvich

On 12 Mar 2016, at 13:19, Michael Paquier  wrote:
> 
> On Fri, Mar 11, 2016 at 9:35 PM, Tom Lane  wrote:
>> IMO this is not committable as-is, and I don't think that it's something
>> that will become committable during this 'fest.  I think we'd be well
>> advised to boot it to the 2016-09 CF and focus our efforts on other stuff
>> that has a better chance of getting finished this month.
> 
> Yeah, I would believe that a good first step would be to discuss
> deeply about that directly at PGCon for folks that will be there and
> interested in the subject. It seems like a good timing to brainstorm
> things F2F at the developer unconference for example, a couple of
> months before the 1st CF of 9.7. We may perhaps (or not) get to
> cleaner picture of what kind of things are wanted in this area.

To give overview of xtm coupled with postgres_fdw from users perspective i’ve 
packed patched postgres with docker
and provided test case when it is easy to spot violation of READ COMMITTED 
isolation level without XTM.

This test fills database with users across two shards connected by postgres_fdw 
and inherits the same table. Then 
starts to concurrently transfers money between users in different shards:

begin;
update t set v = v - 1 where u=%d; -- this is user from t_fdw1, first shard
update t set v = v + 1 where u=%d; -- this is user from t_fdw2, second shard
commit;

Also test simultaneously runs reader thread that counts all money in system:

select sum(v) from t;

So in transactional system we expect that sum should be always constant (zero 
in our case, as we initialize user with zero balance).
But we can see that without tsdtm total amount of money fluctuates around zero.

https://github.com/kelvich/postgres_xtm_docker

---
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] eXtensible Transaction Manager API (v2)

2016-03-14 Thread Kevin Grittner

On Sat, Mar 12, 2016 at 11:21 AM, Robert Haas  wrote:

> I'd also be interested in hearing Kevin Grittner's thoughts about
> serializability in a distributed environment, since he's obviously
> thought about the topic of serializability quite a bit.

I haven't done a thorough search of the academic literature on
this, and I wouldn't be comfortable taking a really solid position
without that; but in thinking about it it seems like there are at
least three distinct problems which may have distinct solutions.

*Physical replication* may be best handled by leveraging the "safe
snapshot" idea already implemented in READ ONLY DEFERRABLE
transactions, and passing through information in the WAL stream to
allow the receiver to identify points where a snapshot can be taken
which cannot see an anomaly.  There should probably be an option to
use the last known safe snapshot or wait for a point in the stream
where one next appears.  This might take as little as a bit or two
per WAL commit record.  It's not clear what the processing overhead
would be -- it wouldn't surprise me if it was "in the noise", nor
would it surprise me if it wasn't.  We would need some careful
benchmarking, and, if performance was an issue, A GUC to control
whether the information was passed along (and, thus, whether
SERIALIZABLE transactions were allowed on the replica).

*Logical replication* (considered for the moment in a
unidirectional context) might best be handled by some reordering of
application of the commits on the replica into "apparent order of
execution" -- which is pretty well defined on the primary based on
commit order adjusted by read-write dependencies.  Basically, the
"simple" implementation would be that WAL is applied normally
unless you receive a commit record which is flagged in some way to
indicate that it is for a serializable transaction which wrote data
and at the time of commit was concurrent with at least one other
serializable transaction which had not completed and was not READ
ONLY.  Such a commit would await information in the WAL stream to
tell it when all such concurrent transactions completed, and would
indicate when such a transaction had a read-write dependency *in*
to the transaction with the suspended commit; commits for any such
transactions must be moved ahead of the suspended commit.  This
would allow logical replication, with all the filtering and such,
to avoid ever showing a state on the replica which contained
serialization anomalies.

*Logical replication with cycles* (where there is some path for
cluster A to replicate to cluster B, and some other path for
cluster B to replicate the same or related data to cluster A) has a
few options.  You could opt for "eventual consistency" --
essentially giving up on the I in ACID and managing the anomalies.
In practice this seems to lead to some form of S2PL at the
application coding level, which is very bad for performance and
concurrency, so I tend to think it should not be the only option.
Built-in S2PL would probably perform better than having it pasted
on at the application layer through some locking API, but for most
workloads is still inferior to SSI in both concurrency and
performance.  Unless a search of the literature turns up some new
alternative, I'm inclined to think that if you want to distribute a
"logical" database over multiple clusters and still manage race
conditions through use of SERIALIZABLE transactions, a distributed
SSI implementation may be the best bet.  That requires the
transaction manager (or something like it) to track non-blocking
predicate "locks" (what the implementation calls a SIReadLock)
across the whole environment, as well as tracking rw-conflicts (our
short name for read-write dependencies) across the whole
environment.  Since SSI also looks at the MVCC state, handling
checks of that without falling victim to race conditions would also
need to be handled somehow.

If I remember correctly, the patches to add the SSI implementation
of SERIALIZABLE transactions were about ten times the size of the
patches to remove S2PL and initially replace it with MVCC.  I don't
have even a gut feel as to how much bigger the distributed form is
likely to be.  On the one hand the *fundamental logic* is all there
and should not need to change; on the other hand the *mechanism*
for acquiring the data to be *used* in that logic would be
different and potentially complex.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] eXtensible Transaction Manager API (v2)

2016-03-12 Thread Robert Haas

On Sat, Mar 12, 2016 at 11:06 AM, Tom Lane  wrote:
> Michael Paquier  writes:
>> On Fri, Mar 11, 2016 at 9:35 PM, Tom Lane  wrote:
>>> IMO this is not committable as-is, and I don't think that it's something
>>> that will become committable during this 'fest.  I think we'd be well
>>> advised to boot it to the 2016-09 CF and focus our efforts on other stuff
>>> that has a better chance of getting finished this month.
>
>> Yeah, I would believe that a good first step would be to discuss
>> deeply about that directly at PGCon for folks that will be there and
>> interested in the subject. It seems like a good timing to brainstorm
>> things F2F at the developer unconference for example, a couple of
>> months before the 1st CF of 9.7. We may perhaps (or not) get to
>> cleaner picture of what kind of things are wanted in this area.
>
> Yeah, the whole area seems like a great topic for some unconference
> sessions.

I agree.  I think this is a problem we really need to solve, and I
think talking about it will help us figure out the best solution.  I'd
also be interested in hearing Kevin Grittner's thoughts about
serializability in a distributed environment, since he's obviously
thought about the topic of serializability quite a bit.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] eXtensible Transaction Manager API (v2)

2016-03-12 Thread Tom Lane

Michael Paquier  writes:
> On Fri, Mar 11, 2016 at 9:35 PM, Tom Lane  wrote:
>> IMO this is not committable as-is, and I don't think that it's something
>> that will become committable during this 'fest.  I think we'd be well
>> advised to boot it to the 2016-09 CF and focus our efforts on other stuff
>> that has a better chance of getting finished this month.

> Yeah, I would believe that a good first step would be to discuss
> deeply about that directly at PGCon for folks that will be there and
> interested in the subject. It seems like a good timing to brainstorm
> things F2F at the developer unconference for example, a couple of
> months before the 1st CF of 9.7. We may perhaps (or not) get to
> cleaner picture of what kind of things are wanted in this area.

Yeah, the whole area seems like a great topic for some unconference
sessions.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] eXtensible Transaction Manager API (v2)

2016-03-12 Thread Michael Paquier

On Fri, Mar 11, 2016 at 9:35 PM, Tom Lane  wrote:
> IMO this is not committable as-is, and I don't think that it's something
> that will become committable during this 'fest.  I think we'd be well
> advised to boot it to the 2016-09 CF and focus our efforts on other stuff
> that has a better chance of getting finished this month.

Yeah, I would believe that a good first step would be to discuss
deeply about that directly at PGCon for folks that will be there and
interested in the subject. It seems like a good timing to brainstorm
things F2F at the developer unconference for example, a couple of
months before the 1st CF of 9.7. We may perhaps (or not) get to
cleaner picture of what kind of things are wanted in this area.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] eXtensible Transaction Manager API (v2)

2016-03-11 Thread Konstantin Knizhnik

On 03/11/2016 11:35 PM, Tom Lane wrote:

Robert Haas writes:

On Fri, Mar 11, 2016 at 1:11 PM, David Steele wrote:

Is anyone willing to volunteer a review or make an argument for the
importance of this patch?

There's been a lot of discussion on another thread about this patch.
The subject is "The plan for FDW-based sharding", but the thread kind
of got partially hijacked by this issue. The net-net of that is that
I don't think we have a clear enough idea about where we're going with
global transaction management to make it a good idea to adopt an API
like this. For example, if we later decide we want to put the
functionality in core, will we keep the hooks around for the sake of
alternative non-core implementations? I just don't believe this
technology is nearly mature enough to commit to at this point.
Konstantin does not agree with my assessment, perhaps unsurprisingly.

I re-read the original thread,

http://www.postgresql.org/message-id/flat/f2766b97-555d-424f-b29f-e0ca0f6d1...@postgrespro.ru

I think there is no question that this is an entirely immature patch.
Not coping with subtransactions is alone sufficient to make it not
credible for production.

Lack of subtractions support is not a limitation of XTM API.
It is limitation of current pg_dtm implementation. And another DTM
implementation - pg_tsdtm supports subtransactions.

Even if the extension API were complete and clearly stable, I have doubts
that there's any great value in integrating it into 9.6, rather than some
later release series. The above thread makes it clear that pg_dtm is very
much WIP and has easily a year's worth of work before anybody would think
of wanting to deploy it. So end users don't need this patch in 9.6, and
developers working on pg_dtm shouldn't really have much of a problem
applying the patch locally --- how likely is it that they'd be using a
perfectly stock build of the database apart from this patch?

I agree with you that pg_dtm is very far from production.
But I wan to notice two things:

1. pg_dtm and pg_tsdtm are not complete cluster solutions, them are just one
(relatively small) part of them.
pg_tsdtm seems to be even more "mature", may be because it is simpler and do
not have many limitations which pg_dtm has (like subtrasaction support).

2. Them can be quite easily integrated with other (existed) cluster solutions.
We have integrated bother of them with postgres_fwd and pg_shard.
Postgres_fdw is also not a ready solution, but just a mechanism which can be
used also for sharding.
But pg_shard & CitusDB are quite popular solutions for distributed execution
of queries which provide good performance for analytic and single node OLTP queries.
Integration with DTMs adds ACID semantic for distributed transactions and
makes it possible to support more complex OLTP and OLAP queries involving
multiple nodes.

Such integration is already done, performance was evaluated, so it is not
quite correct to say that we need a year or more to make pg_dtm/pg_tsdtm ready
to deploy.

But my real takeaway from that thread is that there's no great reason
to believe that this API definition *is* stable. The single existing
use-case is very far from completion, and it's hardly unlikely that
what it needs will change.

Sorry, may be I am completely wrong, but I do not thing that it is possible to
develop stable API if nobody is using it.
It is like "fill pool with a water only after you learn how to swim".

I also took a very quick look at the patch itself:

1. No documentation. For something that purports to be an API
specification, really the documentation should have been written *first*.

Sorry, it was my fault. I have already written documentation and it will be
included in next version of the patch.
But please notice, that starting work on DTM we do not have good understanding
with PostgreSQL TM features have to be changed.
Only during work on pg_dtm, pg_tsdtm, multimaster current view of XTM has been
formed.

And yet another moment: we have not introduce new abstractions in XTM.
We just override existed PostgreSQL functions.
Certainly when some internal functions become part of API, it should be much
better documented.

2. As noted in the cited thread, it's not clear that
Get/SetTransactionStatus are a useful cutpoint; they don't provide any
real atomicity guarantees.

I wonder how such guarantees can be provided at API level?
Atomicity means that all other transaction either see this transaction as
committed, either uncommitted.
So transaction commit should be coordinated with visibility check.
In case of pg_dtm atomicity is simply enforced by the fact that decision
whether to commit transaction is taken by central coordinator.
When it decides that transaction is committed, it marks it as committed in all
subsequently obtained snapshots.

In case of pg_tsdtm there is no central arbiter, so we have to introduce
"in-doubt" state of

Re: [HACKERS] eXtensible Transaction Manager API (v2)

2016-03-11 Thread Tom Lane

Robert Haas  writes:
> On Fri, Mar 11, 2016 at 1:11 PM, David Steele  wrote:
>> Is anyone willing to volunteer a review or make an argument for the
>> importance of this patch?

> There's been a lot of discussion on another thread about this patch.
> The subject is "The plan for FDW-based sharding", but the thread kind
> of got partially hijacked by this issue.  The net-net of that is that
> I don't think we have a clear enough idea about where we're going with
> global transaction management to make it a good idea to adopt an API
> like this.  For example, if we later decide we want to put the
> functionality in core, will we keep the hooks around for the sake of
> alternative non-core implementations?  I just don't believe this
> technology is nearly mature enough to commit to at this point.
> Konstantin does not agree with my assessment, perhaps unsurprisingly.

I re-read the original thread,

http://www.postgresql.org/message-id/flat/f2766b97-555d-424f-b29f-e0ca0f6d1...@postgrespro.ru

I think there is no question that this is an entirely immature patch.
Not coping with subtransactions is alone sufficient to make it not
credible for production.

Even if the extension API were complete and clearly stable, I have doubts
that there's any great value in integrating it into 9.6, rather than some
later release series.  The above thread makes it clear that pg_dtm is very
much WIP and has easily a year's worth of work before anybody would think
of wanting to deploy it.  So end users don't need this patch in 9.6, and
developers working on pg_dtm shouldn't really have much of a problem
applying the patch locally --- how likely is it that they'd be using a
perfectly stock build of the database apart from this patch?

But my real takeaway from that thread is that there's no great reason
to believe that this API definition *is* stable.  The single existing
use-case is very far from completion, and it's hardly unlikely that
what it needs will change.

I also took a very quick look at the patch itself:

1. No documentation.  For something that purports to be an API
specification, really the documentation should have been written *first*.

2. As noted in the cited thread, it's not clear that
Get/SetTransactionStatus are a useful cutpoint; they don't provide any
real atomicity guarantees.

3. Uh, how can you hook GetNewTransactionId but not ReadNewTransactionId?

4. There seems to be an intention to encapsulate snapshots, but surely
wrapping hooks around GetSnapshotData and XidInMVCCSnapshot is not nearly
enough for that.  Look at all the knowledge snapmgr.c has about snapshot
representation, for example.  And is a function like GetOldestXmin even
meaningful with a different notion of what snapshots are?  (For that
matter, is TransactionId == uint32 still tenable for any other notion
of snapshots?)

5. BTW, why would you hook at XidInMVCCSnapshot rather than making use of
the existing capability to have a custom SnapshotSatisfiesFunc snapshot
checker function?

IMO this is not committable as-is, and I don't think that it's something
that will become committable during this 'fest.  I think we'd be well
advised to boot it to the 2016-09 CF and focus our efforts on other stuff
that has a better chance of getting finished this month.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] eXtensible Transaction Manager API (v2)

2016-03-11 Thread David Steele

On 3/11/16 2:00 PM, Oleg Bartunov wrote:
> On Fri, Mar 11, 2016 at 7:11 PM, David Steele  I'm concerned about the lack of response or reviewers for this patch.
> It may be because everyone believes they had their say on the original
> thread, or because it seems like a big change to go into the last CF, or
> for other reasons altogether.
> 
> 
> We'll prepare easy setup to play with our solutions, so any developers
> could see how it works.  Hope this weekend we'll post something about this.

OK, then for now I'm marking this "waiting for author."  You can switch
it back to "needs review" once you have posted additional material.

-- 
-David
da...@pgmasters.net



signature.asc
Description: OpenPGP digital signature

Re: [HACKERS] eXtensible Transaction Manager API (v2)

2016-03-11 Thread Oleg Bartunov

On Fri, Mar 11, 2016 at 7:11 PM, David Steele  wrote:

> On 2/10/16 12:50 PM, Konstantin Knizhnik wrote:
>
> > PostgresProffesional cluster teams wants to propose new version of
> > eXtensible Transaction Manager API.
> > Previous discussion concerning this patch can be found here:
> >
> >
> http://www.postgresql.org/message-id/f2766b97-555d-424f-b29f-e0ca0f6d1...@postgrespro.ru
>
> I see a lot of discussion on this thread but little in the way of
> consensus.
>
> > The API patch itself is small enough, but we think that it will be
> > strange to provide just API without examples of its usage.
>
> It's not all that small, though it does apply cleanly even after a few
> months.  At least that indicates there is not a lot of churn in this area.
>
> I'm concerned about the lack of response or reviewers for this patch.
> It may be because everyone believes they had their say on the original
> thread, or because it seems like a big change to go into the last CF, or
> for other reasons altogether.
>

We'll prepare easy setup to play with our solutions, so any developers
could see how it works.  Hope this weekend we'll post something about this.



>
> I think you should try to make it clear why this patch would be a win
> for 9.6.
>

Looks like discussion shifted to different thread, we'll answer here.



>
> Is anyone willing to volunteer a review or make an argument for the
> importance of this patch?
>
> --
> -David
> da...@pgmasters.net
>
>

Re: [HACKERS] eXtensible Transaction Manager API (v2)

2016-03-11 Thread David Steele

On 3/11/16 1:30 PM, Robert Haas wrote:

> There's been a lot of discussion on another thread about this patch.
> The subject is "The plan for FDW-based sharding", but the thread kind
> of got partially hijacked by this issue.  The net-net of that is that
> I don't think we have a clear enough idea about where we're going with
> global transaction management to make it a good idea to adopt an API
> like this.  For example, if we later decide we want to put the
> functionality in core, will we keep the hooks around for the sake of
> alternative non-core implementations?  I just don't believe this
> technology is nearly mature enough to commit to at this point.

Ah yes, I forgot about the related discussion on that thread.  Pasting
here for reference:

http://www.postgresql.org/message-id/20160223164335.ga11...@momjian.us

> Konstantin does not agree with my assessment, perhaps unsurprisingly.

I'm certainly no stranger to feeling strongly about a patch!

-- 
-David
da...@pgmasters.net



signature.asc
Description: OpenPGP digital signature

Re: [HACKERS] eXtensible Transaction Manager API (v2)

2016-03-11 Thread Robert Haas

On Fri, Mar 11, 2016 at 1:11 PM, David Steele  wrote:
> On 2/10/16 12:50 PM, Konstantin Knizhnik wrote:
>> PostgresProffesional cluster teams wants to propose new version of
>> eXtensible Transaction Manager API.
>> Previous discussion concerning this patch can be found here:
>>
>> http://www.postgresql.org/message-id/f2766b97-555d-424f-b29f-e0ca0f6d1...@postgrespro.ru
>
> I see a lot of discussion on this thread but little in the way of consensus.
>
>> The API patch itself is small enough, but we think that it will be
>> strange to provide just API without examples of its usage.
>
> It's not all that small, though it does apply cleanly even after a few
> months.  At least that indicates there is not a lot of churn in this area.
>
> I'm concerned about the lack of response or reviewers for this patch.
> It may be because everyone believes they had their say on the original
> thread, or because it seems like a big change to go into the last CF, or
> for other reasons altogether.
>
> I think you should try to make it clear why this patch would be a win
> for 9.6.
>
> Is anyone willing to volunteer a review or make an argument for the
> importance of this patch?

There's been a lot of discussion on another thread about this patch.
The subject is "The plan for FDW-based sharding", but the thread kind
of got partially hijacked by this issue.  The net-net of that is that
I don't think we have a clear enough idea about where we're going with
global transaction management to make it a good idea to adopt an API
like this.  For example, if we later decide we want to put the
functionality in core, will we keep the hooks around for the sake of
alternative non-core implementations?  I just don't believe this
technology is nearly mature enough to commit to at this point.

Konstantin does not agree with my assessment, perhaps unsurprisingly.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] eXtensible Transaction Manager API (v2)

2016-03-11 Thread David Steele

On 2/10/16 12:50 PM, Konstantin Knizhnik wrote:

> PostgresProffesional cluster teams wants to propose new version of
> eXtensible Transaction Manager API.
> Previous discussion concerning this patch can be found here:
> 
> http://www.postgresql.org/message-id/f2766b97-555d-424f-b29f-e0ca0f6d1...@postgrespro.ru

I see a lot of discussion on this thread but little in the way of consensus.

> The API patch itself is small enough, but we think that it will be
> strange to provide just API without examples of its usage.

It's not all that small, though it does apply cleanly even after a few
months.  At least that indicates there is not a lot of churn in this area.

I'm concerned about the lack of response or reviewers for this patch.
It may be because everyone believes they had their say on the original
thread, or because it seems like a big change to go into the last CF, or
for other reasons altogether.

I think you should try to make it clear why this patch would be a win
for 9.6.

Is anyone willing to volunteer a review or make an argument for the
importance of this patch?

-- 
-David
da...@pgmasters.net

signature.asc
Description: OpenPGP digital signature

Re: [HACKERS] eXtensible Transaction Manager API

2015-12-03 Thread Robert Haas

On Tue, Dec 1, 2015 at 9:19 AM, Bruce Momjian  wrote:
> On Tue, Nov 17, 2015 at 12:48:38PM -0500, Robert Haas wrote:
>> > At this time, the number of round trips needed particularly for READ
>> > COMMITTED transactions that need a new snapshot for each query was
>> > really a performance killer. We used DBT-1 (TPC-W) which is less
>> > OLTP-like than DBT-2 (TPC-C), still with DBT-1 the scalability limit
>> > was quickly reached with 10-20 nodes..
>>
>> Yeah.  I think this merits a good bit of thought.  Superficially, at
>> least, it seems that every time you need a snapshot - which in the
>> case of READ COMMITTED is for every SQL statement - you need a network
>> roundtrip to the snapshot server.  If multiple backends request a
>> snapshot in very quick succession, you might be able to do a sort of
>> "group commit" thing where you send a single request to the server and
>> they all use the resulting snapshot, but it seems hard to get very far
>> with such optimization.  For example, if backend 1 sends a snapshot
>> request and backend 2 then realizes that it also needs a snapshot, it
>> can't just wait for the reply from backend 1 and use that one.  The
>> user might have committed a transaction someplace else and then kicked
>> off a transaction on backend 2 afterward, expecting it to see the work
>> committed earlier.  But the snapshot returned to backend 1 might have
>> been taken before that.  So, all in all, this seems rather crippling.
>>
>> Things are better if the system has a single coordinator node that is
>> also the arbiter of commits and snapshots.  Then, it can always take a
>> snapshot locally with no network roundtrip, and when it reaches out to
>> a shard, it can pass along the snapshot information with the SQL query
>> (or query plan) it has to send anyway.  But then the single
>> coordinator seems like it becomes a bottleneck.  As soon as you have
>> multiple coordinators, one of them has got to be the arbiter of global
>> ordering, and now all of the other coordinators have to talk to it
>> constantly.
>
> I think the performance benefits of having a single coordinator are
> going to require us to implement different snapshot/transaction code
> paths for single coordinator and multi-coordinator cases.  :-(  That is,
> people who can get by with only a single coordinator are going to want
> to do that to avoid the overhead of multiple coordinators, while those
> using multiple coordinators are going to have to live with that
> overhead.
>
> I think an open question is what workloads can use a single coordinator?
> Read-only?  Long queries?  Are those cases also ones where the
> snapshot/transaction overhead is negligible, meaning we don't need the
> single coordinator optimizations?

Well, the cost of the coordinator is basically a per-snapshot cost,
which means in effect a per-transaction cost.  So if your typical
query runtimes are measured in seconds or minutes, it's probably going
to be really hard to swamp the coordinator.  If they're measured
milliseconds or microseconds, it's likely to be a huge problem.  I
think this is why a lot of NoSQL systems have abandoned transactions
as a way of doing business.  The overhead of isolating transactions is
significant even on a single-server system when there are many CPUs
and lots of short-running queries (cf.
0e141c0fbb211bdd23783afa731e3eef95c9ad7a and previous efforts in the
same area) but in a multi-server environment it just gets crushingly
expensive.

I think there are ways to reduce the cost of this.  Some distributed
systems have solved it by retreating from snapshot isolation and going
back to using locks.  This can improve scalability if you've got lots
of transactions but they're very short and rarely touch the same data.
Locking individual data elements (or partitions) doesn't require a
central system that knows about all commits - each individual node can
just release locks for each transaction as it completes.  More
generally, if you could avoid having to make a decision about whether
transaction A precedes or follows transaction B unless transaction A
and B actually touch the same data - an idea we already use for SSI -
then you could get much better scalability.  But I doubt that can be
made to work without a deeper rethink of our transaction system.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] eXtensible Transaction Manager API

2015-12-03 Thread Craig Ringer

On 4 December 2015 at 06:41, Robert Haas  wrote:

>
> I think there are ways to reduce the cost of this.  Some distributed
> systems have solved it by retreating from snapshot isolation and going
> back to using locks.  This can improve scalability if you've got lots
> of transactions but they're very short and rarely touch the same data.
> Locking individual data elements (or partitions) doesn't require a
> central system that knows about all commits - each individual node can
> just release locks for each transaction as it completes.  More
> generally, if you could avoid having to make a decision about whether
> transaction A precedes or follows transaction B unless transaction A
> and B actually touch the same data - an idea we already use for SSI -
> then you could get much better scalability.  But I doubt that can be
> made to work without a deeper rethink of our transaction system.
>

Something I've seen thrown about is the idea of lazy snapshots. Using
SSI-like facilities you keep track of transaction dependencies and what's
changed since the last snapshot. A transaction can use an old snapshot if
it doesn't touch anything changed since the snapshot was taken. If it does
you acquire a new snapshot (or switch to a newer existing one) and can swap
it in safely, since you know they're the same for all data the xact has
touched so far.

I suspect that just replaces one expensive problem with one or more
different expensive problems and is something that'd help a few workloads
but hurt others. It's come up when people have asked why we take a new
snapshot for every transaction but I haven't seen it explored much.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] eXtensible Transaction Manager API

2015-12-01 Thread Bruce Momjian

On Tue, Nov 17, 2015 at 12:48:38PM -0500, Robert Haas wrote:
> > At this time, the number of round trips needed particularly for READ
> > COMMITTED transactions that need a new snapshot for each query was
> > really a performance killer. We used DBT-1 (TPC-W) which is less
> > OLTP-like than DBT-2 (TPC-C), still with DBT-1 the scalability limit
> > was quickly reached with 10-20 nodes..
> 
> Yeah.  I think this merits a good bit of thought.  Superficially, at
> least, it seems that every time you need a snapshot - which in the
> case of READ COMMITTED is for every SQL statement - you need a network
> roundtrip to the snapshot server.  If multiple backends request a
> snapshot in very quick succession, you might be able to do a sort of
> "group commit" thing where you send a single request to the server and
> they all use the resulting snapshot, but it seems hard to get very far
> with such optimization.  For example, if backend 1 sends a snapshot
> request and backend 2 then realizes that it also needs a snapshot, it
> can't just wait for the reply from backend 1 and use that one.  The
> user might have committed a transaction someplace else and then kicked
> off a transaction on backend 2 afterward, expecting it to see the work
> committed earlier.  But the snapshot returned to backend 1 might have
> been taken before that.  So, all in all, this seems rather crippling.
> 
> Things are better if the system has a single coordinator node that is
> also the arbiter of commits and snapshots.  Then, it can always take a
> snapshot locally with no network roundtrip, and when it reaches out to
> a shard, it can pass along the snapshot information with the SQL query
> (or query plan) it has to send anyway.  But then the single
> coordinator seems like it becomes a bottleneck.  As soon as you have
> multiple coordinators, one of them has got to be the arbiter of global
> ordering, and now all of the other coordinators have to talk to it
> constantly.

I think the performance benefits of having a single coordinator are
going to require us to implement different snapshot/transaction code
paths for single coordinator and multi-coordinator cases.  :-(  That is,
people who can get by with only a single coordinator are going to want
to do that to avoid the overhead of multiple coordinators, while those
using multiple coordinators are going to have to live with that
overhead.

I think an open question is what workloads can use a single coordinator?
Read-only?  Long queries?  Are those cases also ones where the
snapshot/transaction overhead is negligible, meaning we don't need the
single coordinator optimizations?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] eXtensible Transaction Manager API

2015-11-17 Thread Robert Haas

On Fri, Nov 13, 2015 at 8:35 AM, Michael Paquier
 wrote:
> As well as there could be FS, OS, network problems... To come back to
> the point, my point is simply that I found surprising the sentence of
> Konstantin upthread saying that if commit fails on some of the nodes
> we should rollback the prepared transaction on all nodes. In the
> example given, in the phase after calling dtm_end_prepare, say we
> perform COMMIT PREPARED correctly on node 1, but then failed it on
> node 2 because a meteor has hit a server, it seems that we cannot
> rollback, instead we had better rolling in a backup and be sure that
> the transaction gets committed. How would you rollback the transaction
> already committed on node 1? But perhaps I missed something...

Right.  In that case, we have to try to eventually get it committed everywhere.

One thing that's a bit confusing about this XTM interface is what
"COMMIT" actually means.  The idea is that on the standby server we
will call some DTM-provided function and pass it a token.  Then we
will begin and commit a transaction.  But presumably the commit does
not actually commit, because if it's a single transaction on all nodes
then the commit can't be completed until all work is done all nodes.
So my guess is that the COMMIT here is intended to behave more like a
PREPARE, but this is not made explicit.

>> One point I'd like to mention is that it's absolutely critical to
>> design this in a way that minimizes network roundtrips without
>> compromising correctness.  XC's GTM proxy suggests that they failed to
>> do that.  I think we really need to look at what's going to be on the
>> other sides of the proposed APIs and think about whether it's going to
>> be possible to have a strong local caching layer that keeps network
>> roundtrips to a minimum.  We should consider whether the need for such
>> a caching layer has any impact on what the APIs should look like.
>
> At this time, the number of round trips needed particularly for READ
> COMMITTED transactions that need a new snapshot for each query was
> really a performance killer. We used DBT-1 (TPC-W) which is less
> OLTP-like than DBT-2 (TPC-C), still with DBT-1 the scalability limit
> was quickly reached with 10-20 nodes..

Yeah.  I think this merits a good bit of thought.  Superficially, at
least, it seems that every time you need a snapshot - which in the
case of READ COMMITTED is for every SQL statement - you need a network
roundtrip to the snapshot server.  If multiple backends request a
snapshot in very quick succession, you might be able to do a sort of
"group commit" thing where you send a single request to the server and
they all use the resulting snapshot, but it seems hard to get very far
with such optimization.  For example, if backend 1 sends a snapshot
request and backend 2 then realizes that it also needs a snapshot, it
can't just wait for the reply from backend 1 and use that one.  The
user might have committed a transaction someplace else and then kicked
off a transaction on backend 2 afterward, expecting it to see the work
committed earlier.  But the snapshot returned to backend 1 might have
been taken before that.  So, all in all, this seems rather crippling.

Things are better if the system has a single coordinator node that is
also the arbiter of commits and snapshots.  Then, it can always take a
snapshot locally with no network roundtrip, and when it reaches out to
a shard, it can pass along the snapshot information with the SQL query
(or query plan) it has to send anyway.  But then the single
coordinator seems like it becomes a bottleneck.  As soon as you have
multiple coordinators, one of them has got to be the arbiter of global
ordering, and now all of the other coordinators have to talk to it
constantly.

Maybe I'm thinking of this too narrowly by talking about snapshots;
perhaps there are other ways of ensuring whatever level of transaction
isolation we want to have here.  But I'm not sure it matters that much
- I don't see any way for the sees-the-effects-of relation on the set
of all transactions to be a total ordering without some kind of
central arbiter of the commit ordering.  Except for
perfectly-synchronized timestamps, but I don't think that's really
physically possible anyway.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] eXtensible Transaction Manager API

2015-11-15 Thread Konstantin Knizhnik


I am sorry, looks like I had incorrectly interpret Michael's comment:

Not always. If COMMIT PREPARED fails at some of the nodes but succeeds
on others, the transaction is already partially acknowledged as
committed in the cluster. Hence it makes more sense for the
coordinator to commit transactions on the remaining nodes. Before
issuing any COMMIT PREPARED queries, I guess that's fine to rollback
the transactions on all nodes though.
I am completely agree that at second stage pf 2PC, once we make a decision to commit transaction, there is no way back: we have to complete commit at all nodes. If node reports that it has successfully prepared transaction and  ready to commit, then it 
mean that node should be able to recover it in case of failure. In my example of code in GO illustrating work with 2PC I for simplicity completely  exclude login of handling errors.

Actually it should be something like this (now in C++ using pqxx):

nontransaction srcTxn(srcCon);
nontransaction dstTxn(dstCon);
string transComplete;
try {
   exec(srcTxn, "prepare transaction '%s'", gtid);
   exec(dstTxn, "prepare transaction '%s'", gtid);
   // At this moment all nodes has prepared transaction, so in principle we 
can make a decision to commit here...
   // But it is easier to wait until CSN is assigned to transaction and it 
is delivered to all nodes
   exec(srcTxn, "select dtm_begin_prepare('%s')", gtid);
   exec(dstTxn, "select dtm_begin_prepare('%s')", gtid);
   csn = execQuery(srcTxn, "select dtm_prepare('%s', 0)", gtid);
   csn = execQuery(dstTxn, "select dtm_prepare('%s', %lu)", gtid, csn);
   exec(srcTxn, "select dtm_end_prepare('%s', %lu)", gtid, csn);
   exec(dstTxn, "select dtm_end_prepare('%s' %lu)", gtid, csn);
   // If we reach this point, we make decision to commit...
   transComplete = string("commit prepared '") + gtid + "'";
   } catch (pqxx_exception const& x) {
   // ... if not, then rollback prepared transaction
   transComplete = string("rollback prepared '") + gtid + "'";
   }
   pipeline srcPipe(srcTxn);
   pipeline dstPipe(dstTxn);
   srcPipe.insert(completeTrans);
   dstPipe.insert(compteteTrans);
   srcPipe.complete();
   dstPipe.complete();


On 11/15/2015 08:59 PM, Kevin Grittner wrote:

On Saturday, November 14, 2015 8:41 AM, Craig Ringer  
wrote:

On 13 November 2015 at 21:35, Michael Paquier  wrote:

On Tue, Nov 10, 2015 at 3:46 AM, Robert Haas  wrote:

If the database is corrupted, there's no way to guarantee that
anything works as planned.  This is like saying that criticizing
somebody's disaster recovery plan on the basis that it will be
inadequate if the entire planet earth is destroyed.

+1

Once all participating servers return "success" from the "prepare"
phase, the only sane way to proceed is to commit as much as
possible.  (I guess you still have some room to argue about what to
do after an error on the attempt to commit the first prepared
transaction, but even there you need to move forward if it might
have been due to a communications failure preventing a success
indication for a commit which was actually successful.)  The idea
of two phase commit is that failure to commit after a successful
prepare should be extremely rare.


As well as there could be FS, OS, network problems... To come back to
the point, my point is simply that I found surprising the sentence of
Konstantin upthread saying that if commit fails on some of the nodes
we should rollback the prepared transaction on all nodes. In the
example given, in the phase after calling dtm_end_prepare, say we
perform COMMIT PREPARED correctly on node 1, but then failed it on
node 2 because a meteor has hit a server, it seems that we cannot
rollback, instead we had better rolling in a backup and be sure that
the transaction gets committed. How would you rollback the transaction
already committed on node 1? But perhaps I missed something...

Right.


The usual way this works in an XA-like model is:

In phase 1 (prepare transaction, in Pg's spelling), failure on
any node triggers a rollback on all nodes.

In phase 2 (commit prepared), failure on any node causes retries
until it succeeds, or until the admin intervenes - say, to remove
that node from operation. The global xact as a whole isn't
considered successful until it's committed on all nodes.

2PC and distributed commit is well studied, including the
problems. We don't have to think this up for ourselves. We don't
have to invent anything here. There's a lot of distributed
systems theory to work with - especially when dealing with well
studied relational DBs trying to maintain ACID semantics.

Right; it is silly not to build on decades of work on theory and
practice in this area.

What is making me nervous as I watch this thread is a bit of loose
talk about the I in ACID.  There have been some points where it
seemed to be implied that we

Re: [HACKERS] eXtensible Transaction Manager API

2015-11-15 Thread Kevin Grittner

On Saturday, November 14, 2015 8:41 AM, Craig Ringer  
wrote:
> On 13 November 2015 at 21:35, Michael Paquier  
> wrote:
>> On Tue, Nov 10, 2015 at 3:46 AM, Robert Haas  wrote:

>>> If the database is corrupted, there's no way to guarantee that
>>> anything works as planned.  This is like saying that criticizing
>>> somebody's disaster recovery plan on the basis that it will be
>>> inadequate if the entire planet earth is destroyed.

+1

Once all participating servers return "success" from the "prepare"
phase, the only sane way to proceed is to commit as much as
possible.  (I guess you still have some room to argue about what to
do after an error on the attempt to commit the first prepared
transaction, but even there you need to move forward if it might
have been due to a communications failure preventing a success
indication for a commit which was actually successful.)  The idea
of two phase commit is that failure to commit after a successful
prepare should be extremely rare.

>> As well as there could be FS, OS, network problems... To come back to
>> the point, my point is simply that I found surprising the sentence of
>> Konstantin upthread saying that if commit fails on some of the nodes
>> we should rollback the prepared transaction on all nodes. In the
>> example given, in the phase after calling dtm_end_prepare, say we
>> perform COMMIT PREPARED correctly on node 1, but then failed it on
>> node 2 because a meteor has hit a server, it seems that we cannot
>> rollback, instead we had better rolling in a backup and be sure that
>> the transaction gets committed. How would you rollback the transaction
>> already committed on node 1? But perhaps I missed something...

Right.

> The usual way this works in an XA-like model is:
>
> In phase 1 (prepare transaction, in Pg's spelling), failure on
> any node triggers a rollback on all nodes.
>
> In phase 2 (commit prepared), failure on any node causes retries
> until it succeeds, or until the admin intervenes - say, to remove
> that node from operation. The global xact as a whole isn't
> considered successful until it's committed on all nodes.
>
> 2PC and distributed commit is well studied, including the
> problems. We don't have to think this up for ourselves. We don't
> have to invent anything here. There's a lot of distributed
> systems theory to work with - especially when dealing with well
> studied relational DBs trying to maintain ACID semantics.

Right; it is silly not to build on decades of work on theory and
practice in this area.

What is making me nervous as I watch this thread is a bit of loose
talk about the I in ACID.  There have been some points where it
seemed to be implied that we had sufficient transaction isolation
if we could get all the participating nodes using the same
snapshot.  I think I've seen some hint that there is an intention
to use distributed strict two-phase locking with a global lock
manager to achieve serializable behavior.  The former is vulnerable
to some well-known serialization anomalies and the latter was
dropped from PostgreSQL when MVCC was implemented because of its
horrible concurrency and performance characteristics, which is not
going to be less of an issue in a distributed system using that
technique.  It may be possible to implement the Serializable
Snapshot Isolation (SSI) technique across multiple cooperating
nodes, but it's not obvious exactly how that would be implemented,
and I have seen no discussion of that.

If we're going to talk about maintaining ACID semantics in this
environment, I think we need to explicitly state what level of
isolation we intend to provide, and how we intend to do that.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] eXtensible Transaction Manager API

2015-11-14 Thread Craig Ringer

On 13 November 2015 at 21:35, Michael Paquier 
wrote:

> On Tue, Nov 10, 2015 at 3:46 AM, Robert Haas 
> wrote:
> > On Sun, Nov 8, 2015 at 6:35 PM, Michael Paquier
> >  wrote:
> >> Sure. Now imagine that the pg_twophase entry is corrupted for this
> >> transaction on one node. This would trigger a PANIC on it, and
> >> transaction would not be committed everywhere.
> >
> > If the database is corrupted, there's no way to guarantee that
> > anything works as planned.  This is like saying that criticizing
> > somebody's disaster recovery plan on the basis that it will be
> > inadequate if the entire planet earth is destroyed.
>
> As well as there could be FS, OS, network problems... To come back to
> the point, my point is simply that I found surprising the sentence of
> Konstantin upthread saying that if commit fails on some of the nodes
> we should rollback the prepared transaction on all nodes. In the
> example given, in the phase after calling dtm_end_prepare, say we
> perform COMMIT PREPARED correctly on node 1, but then failed it on
> node 2 because a meteor has hit a server, it seems that we cannot
> rollback, instead we had better rolling in a backup and be sure that
> the transaction gets committed. How would you rollback the transaction
> already committed on node 1? But perhaps I missed something...
>

The usual way this works in an XA-like model is:

In phase 1 (prepare transaction, in Pg's spelling), failure on any node
triggers a rollback on all nodes.

In phase 2 (commit prepared), failure on any node causes retries until it
succeeds, or until the admin intervenes - say, to remove that node from
operation. The global xact as a whole isn't considered successful until
it's committed on all nodes.

2PC and distributed commit is well studied, including the problems. We
don't have to think this up for ourselves. We don't have to invent anything
here. There's a lot of distributed systems theory to work with - especially
when dealing with well studied relational DBs trying to maintain ACID
semantics.

Not to say that there aren't problems with the established ways. The XA API
is horrific. Java's JTA follows it too closely, and whoever thought
that HeuristicMixedException was a good idea augh.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] eXtensible Transaction Manager API

2015-11-13 Thread Michael Paquier

On Thu, Nov 12, 2015 at 11:23 PM, Robert Haas  wrote:
> On Mon, Nov 9, 2015 at 2:47 PM, Simon Riggs  wrote:
>> On 9 November 2015 at 18:46, Robert Haas  wrote:
>>> One point I'd like to mention is that it's absolutely critical to
>>> design this in a way that minimizes network roundtrips without
>>> compromising correctness.  XC's GTM proxy suggests that they failed to
>>> do that.  I think we really need to look at what's going to be on the
>>> other sides of the proposed APIs and think about whether it's going to
>>> be possible to have a strong local caching layer that keeps network
>>> roundtrips to a minimum.  We should consider whether the need for such
>>> a caching layer has any impact on what the APIs should look like.
>> You mean the caching layer that already exists in XL/XC?
>
> I don't think that's what I mean, no.  If you have an external GTM
> Proxy, then you have missed a trick, because whatever caching it does
> could be done better inside the process that sent the request to the
> proxy.

Definitely. Having a single communication channel between the backends
and the centralized server that communicate with grouped messages
would clearly help enhancing the scalability of the system, though
this is not something the DTM should try to solve IMO.

>>> For example, consider a 10-node cluster where each node has 32 cores
>>> and 32 clients, and each client is running lots of short-running SQL
>>> statements.  The demand for snapshots will be intense.  If every
>>> backend separately requests a snapshot for every SQL statement from
>>> the coordinator, that's probably going to be terrible.  We can make it
>>> the problem of the stuff behind the DTM API to figure out a way to
>>> avoid that, but maybe that's going to result in every DTM needing to
>>> solve the same problems.
>>
>> The whole purpose of that XTM API is to allow different solutions for that
>> to be created. Konstantin has made a very good case for such an API to
>> exist, based around 3 markedly different approaches.
>
> I'm not arguing with that.
>
>> Whether we allow the API into core to be accessible via extensions is a
>> different issue, but it looks fine for its purpose.
>
> I'm not attacking the API.  I'm trying to have a discussion about the
> important design issues in this area.

FWIW, I just looked at the wiki page regarding the DTM, and the set of
routines GTM is actually very close to what XC/XL is doing. Where
XC/XL directly forked the code of Postgres to redirect to the GTM when
fetching a snapshot, TXID, whatever, DTM is using a set of generic
methods to achieve this purpose and replace the in-core calls with a
set of custom functions to relocate to the external instance in charge
of distributing the transaction data. Other differences are that GTM
is extended for global timestamps and global sequence values, but
that's not really related to ACID.

This just reminded me the following message where I wondered about the
possibility to add hooks in those hot code paths...
http://www.postgresql.org/message-id/cab7npqtdjf-58wuf-xz01nkj7wf0e+eukggqhd0igvsod4h...@mail.gmail.com
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] eXtensible Transaction Manager API

2015-11-13 Thread Michael Paquier

On Tue, Nov 10, 2015 at 3:46 AM, Robert Haas  wrote:
> On Sun, Nov 8, 2015 at 6:35 PM, Michael Paquier
>  wrote:
>> Sure. Now imagine that the pg_twophase entry is corrupted for this
>> transaction on one node. This would trigger a PANIC on it, and
>> transaction would not be committed everywhere.
>
> If the database is corrupted, there's no way to guarantee that
> anything works as planned.  This is like saying that criticizing
> somebody's disaster recovery plan on the basis that it will be
> inadequate if the entire planet earth is destroyed.

As well as there could be FS, OS, network problems... To come back to
the point, my point is simply that I found surprising the sentence of
Konstantin upthread saying that if commit fails on some of the nodes
we should rollback the prepared transaction on all nodes. In the
example given, in the phase after calling dtm_end_prepare, say we
perform COMMIT PREPARED correctly on node 1, but then failed it on
node 2 because a meteor has hit a server, it seems that we cannot
rollback, instead we had better rolling in a backup and be sure that
the transaction gets committed. How would you rollback the transaction
already committed on node 1? But perhaps I missed something...

>> I am aware of the fact
>> that by definition PREPARE TRANSACTION ensures that a transaction will
>> be committed with COMMIT PREPARED, just trying to see any corner cases
>> with the approach proposed. The DTM approach is actually rather close
>> to what a GTM in Postgres-XC does :)
>
> Yes.  I think that we should try to learn as much as possible from the
> XC experience, but that doesn't mean we should incorporate XC's fuzzy
> thinking about 2PC into PG.  We should not.

Yep.

> One point I'd like to mention is that it's absolutely critical to
> design this in a way that minimizes network roundtrips without
> compromising correctness.  XC's GTM proxy suggests that they failed to
> do that.  I think we really need to look at what's going to be on the
> other sides of the proposed APIs and think about whether it's going to
> be possible to have a strong local caching layer that keeps network
> roundtrips to a minimum.  We should consider whether the need for such
> a caching layer has any impact on what the APIs should look like.

At this time, the number of round trips needed particularly for READ
COMMITTED transactions that need a new snapshot for each query was
really a performance killer. We used DBT-1 (TPC-W) which is less
OLTP-like than DBT-2 (TPC-C), still with DBT-1 the scalability limit
was quickly reached with 10-20 nodes..

> For example, consider a 10-node cluster where each node has 32 cores
> and 32 clients, and each client is running lots of short-running SQL
> statements.  The demand for snapshots will be intense.  If every
> backend separately requests a snapshot for every SQL statement from
> the coordinator, that's probably going to be terrible.  We can make it
> the problem of the stuff behind the DTM API to figure out a way to
> avoid that, but maybe that's going to result in every DTM needing to
> solve the same problems.

This recalls a couple of things, though in 2009 I was not playing with
servers of this scale.

> Another thing that I think we need to consider is fault-tolerance.
> For example, suppose that transaction commit and snapshot coordination
> services are being provided by a central server which keeps track of
> the global commit ordering.  When that server gets hit by a freak bold
> of lightning and melted into a heap of slag, somebody else needs to
> take over.  Again, this would live below the API proposed here, but I
> think it really deserves some thought before we get too far down the
> path.  XC didn't start thinking about how to add fault-tolerance until
> quite late in the project, I think, and the same could be said of
> PostgreSQL itself: some newer systems have easier-to-use fault
> tolerance mechanisms because it was designed in from the beginning.
> Distributed systems by nature need high availability to a far greater
> degree than single systems, because when there are more nodes, node
> failures are more frequent.

Your memories on the matter are right. In the case of XC, the SPOF
that is GTM has been somewhat made more stable with the creation of a
GTM standby, though it did not solve the scalability limit of having a
centralized snapshot facility. It actually increased the load on the
whole system because for short transactions. Alea jacta est.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] eXtensible Transaction Manager API

2015-11-12 Thread Robert Haas

On Mon, Nov 9, 2015 at 2:47 PM, Simon Riggs  wrote:
> On 9 November 2015 at 18:46, Robert Haas  wrote:
>> > I am aware of the fact
>> > that by definition PREPARE TRANSACTION ensures that a transaction will
>> > be committed with COMMIT PREPARED, just trying to see any corner cases
>> > with the approach proposed. The DTM approach is actually rather close
>> > to what a GTM in Postgres-XC does :)
>>
>> Yes.  I think that we should try to learn as much as possible from the
>> XC experience, but that doesn't mean we should incorporate XC's fuzzy
>> thinking about 2PC into PG.  We should not.
>
> Fuzzy thinking. Please explain.

Multiple people who have worked on XC have argued to me that we need
to defend against the case where PREPARE TRANSACTION succeeds and
COMMIT PREPARED fails, say because somebody manually mucked with the
files in the data directory.  I think that if people are manually
mucking with files in the data directory, we have no hope of achieving
sane behavior, and there's not much point in expending code on
defending against any individual way that could happen.

>> One point I'd like to mention is that it's absolutely critical to
>> design this in a way that minimizes network roundtrips without
>> compromising correctness.  XC's GTM proxy suggests that they failed to
>> do that.  I think we really need to look at what's going to be on the
>> other sides of the proposed APIs and think about whether it's going to
>> be possible to have a strong local caching layer that keeps network
>> roundtrips to a minimum.  We should consider whether the need for such
>> a caching layer has any impact on what the APIs should look like.
> You mean the caching layer that already exists in XL/XC?

I don't think that's what I mean, no.  If you have an external GTM
Proxy, then you have missed a trick, because whatever caching it does
could be done better inside the process that sent the request to the
proxy.

>> For example, consider a 10-node cluster where each node has 32 cores
>> and 32 clients, and each client is running lots of short-running SQL
>> statements.  The demand for snapshots will be intense.  If every
>> backend separately requests a snapshot for every SQL statement from
>> the coordinator, that's probably going to be terrible.  We can make it
>> the problem of the stuff behind the DTM API to figure out a way to
>> avoid that, but maybe that's going to result in every DTM needing to
>> solve the same problems.
>
> The whole purpose of that XTM API is to allow different solutions for that
> to be created. Konstantin has made a very good case for such an API to
> exist, based around 3 markedly different approaches.

I'm not arguing with that.

> Whether we allow the API into core to be accessible via extensions is a
> different issue, but it looks fine for its purpose.

I'm not attacking the API.  I'm trying to have a discussion about the
important design issues in this area.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] eXtensible Transaction Manager API

2015-11-09 Thread Robert Haas

On Sun, Nov 8, 2015 at 6:35 PM, Michael Paquier
 wrote:
> Sure. Now imagine that the pg_twophase entry is corrupted for this
> transaction on one node. This would trigger a PANIC on it, and
> transaction would not be committed everywhere.

If the database is corrupted, there's no way to guarantee that
anything works as planned.  This is like saying that criticizing
somebody's disaster recovery plan on the basis that it will be
inadequate if the entire planet earth is destroyed.

> I am aware of the fact
> that by definition PREPARE TRANSACTION ensures that a transaction will
> be committed with COMMIT PREPARED, just trying to see any corner cases
> with the approach proposed. The DTM approach is actually rather close
> to what a GTM in Postgres-XC does :)

Yes.  I think that we should try to learn as much as possible from the
XC experience, but that doesn't mean we should incorporate XC's fuzzy
thinking about 2PC into PG.  We should not.

One point I'd like to mention is that it's absolutely critical to
design this in a way that minimizes network roundtrips without
compromising correctness.  XC's GTM proxy suggests that they failed to
do that.  I think we really need to look at what's going to be on the
other sides of the proposed APIs and think about whether it's going to
be possible to have a strong local caching layer that keeps network
roundtrips to a minimum.  We should consider whether the need for such
a caching layer has any impact on what the APIs should look like.

For example, consider a 10-node cluster where each node has 32 cores
and 32 clients, and each client is running lots of short-running SQL
statements.  The demand for snapshots will be intense.  If every
backend separately requests a snapshot for every SQL statement from
the coordinator, that's probably going to be terrible.  We can make it
the problem of the stuff behind the DTM API to figure out a way to
avoid that, but maybe that's going to result in every DTM needing to
solve the same problems.

Another thing that I think we need to consider is fault-tolerance.
For example, suppose that transaction commit and snapshot coordination
services are being provided by a central server which keeps track of
the global commit ordering.  When that server gets hit by a freak bold
of lightning and melted into a heap of slag, somebody else needs to
take over.  Again, this would live below the API proposed here, but I
think it really deserves some thought before we get too far down the
path.  XC didn't start thinking about how to add fault-tolerance until
quite late in the project, I think, and the same could be said of
PostgreSQL itself: some newer systems have easier-to-use fault
tolerance mechanisms because it was designed in from the beginning.
Distributed systems by nature need high availability to a far greater
degree than single systems, because when there are more nodes, node
failures are more frequent.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] eXtensible Transaction Manager API

2015-11-09 Thread Simon Riggs

On 9 November 2015 at 18:46, Robert Haas  wrote:


> > I am aware of the fact
> > that by definition PREPARE TRANSACTION ensures that a transaction will
> > be committed with COMMIT PREPARED, just trying to see any corner cases
> > with the approach proposed. The DTM approach is actually rather close
> > to what a GTM in Postgres-XC does :)
>
> Yes.  I think that we should try to learn as much as possible from the
> XC experience, but that doesn't mean we should incorporate XC's fuzzy
> thinking about 2PC into PG.  We should not.
>

Fuzzy thinking. Please explain.

One point I'd like to mention is that it's absolutely critical to
> design this in a way that minimizes network roundtrips without
> compromising correctness.  XC's GTM proxy suggests that they failed to
> do that.  I think we really need to look at what's going to be on the
> other sides of the proposed APIs and think about whether it's going to
> be possible to have a strong local caching layer that keeps network
> roundtrips to a minimum.  We should consider whether the need for such
> a caching layer has any impact on what the APIs should look like.
>

You mean the caching layer that already exists in XL/XC?


> For example, consider a 10-node cluster where each node has 32 cores
> and 32 clients, and each client is running lots of short-running SQL
> statements.  The demand for snapshots will be intense.  If every
> backend separately requests a snapshot for every SQL statement from
> the coordinator, that's probably going to be terrible.  We can make it
> the problem of the stuff behind the DTM API to figure out a way to
> avoid that, but maybe that's going to result in every DTM needing to
> solve the same problems.


The whole purpose of that XTM API is to allow different solutions for that
to be created. Konstantin has made a very good case for such an API to
exist, based around 3 markedly different approaches.

Whether we allow the API into core to be accessible via extensions is a
different issue, but it looks fine for its purpose.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] eXtensible Transaction Manager API

2015-11-09 Thread Amit Kapila

On Mon, Nov 9, 2015 at 2:08 PM, Konstantin Knizhnik <
k.knizh...@postgrespro.ru> wrote:
>
>
> On 09.11.2015 07:46, Amit Kapila wrote:
>
> I think so.  Basically DLM should be responsible for maintaining
> all the lock information which inturn means that any backend process
> that needs to acquire/release lock needs to interact with DLM, without that
> I don't think even global deadlock detection can work (to detect deadlocks
> among all the nodes, it needs to know the lock info of all nodes).
>
>
> I hope that it will not needed, otherwise it will add significant
> performance penalty.
> Unless I missed something, locks can still managed locally, but we need
> DLM to detect global deadlocks.
>

How will you check lock conflicts, example Process A on node-1
tries to acquire lock on object-1, but Process B from node-2 already
holds a conflicting lock on the object-1.  Now if try to think in a way
that we will have an entry of lock request from node-2 in node-1, then
I think it will be difficult to manage and release locks.

>
>
>
>> 3. Should DLM be implemented by separate process or should it be part of
>> arbiter (dtmd).
>>
>
> That's important decision. I think it will depend on which kind of design
> we choose for distributed transaction manager (arbiter based solution or
> non-arbiter based solution, something like tsDTM).  I think DLM should be
> separate, else arbiter will become hot-spot with respect to contention.
>
>
> There are pros and contras.
> Pros for integrating DLM in DTM:
> 1. DTM arbiter has information about local to global transaction ID
> mapping which may be needed to DLM
> 2. If my assumptions about DLM are correct, then it will be accessed
> relatively rarely and should not cause significant
> impact on performance.
>
>
Yeah, if the usage of DLM is relatively less, then it can make sense
to club them, but otherwise it doesn't make much sense.


> Contras:
> 1. tsDTM doesn't need centralized arbiter but still needs DLM
> 2. Logically DLM and DTM are independent components
>
>
>
> Can you please explain more about tsDTM approach, how timestamps
> are used, what is exactly CSN (is it Commit Sequence Number) and
> how it is used in prepare phase?  IS CSN used as timestamp?
> Is the coordinator also one of the PostgreSQL instance?
>
>
> In tsDTM approach system time (in microseconds) is used as CSN (commit
> sequence number).
> We also enforce that assigned CSN is unique and monotonic within
> PostgreSQL instance.
> CSN are assigned locally and do not require interaction with some other
> cluster nodes.
> This is why in theory tsDTM approach should provide good scalability.
>

Okay, but won't checking visibility of tuples need transaction to CSN
mapping?


>
> From my point of view there are two different scenarios:
> 1. When most transactions are local and only few of them are global (for
> example most  operation in branch of the back are
> performed with accounts of clients of this branch, but there are few
> transactions involved accounts from different branches).
> 2. When most or all transactions are global.
>
> It seems to me that first approach is more popular in real life and
> actually good performance of distributed system can be achieved only in
> case when most transaction are local (involves only one node). There are
> several approaches allowing to optimize local transactions. For example
> once used in SAP HANA (
> http://pi3.informatik.uni-mannheim.de/~norman/dsi_jour_2014.pdf)
> We have also DTM implementation based on this approach but it is not yet
> working.
>
> If most of transaction are global, them affect random subsets of cluster
> nodes (so it is not possible to logically split cluster into groups of
> tightly coupled nodes) and number of nodes is not very large (<10) then I
> do not think that there can be better alternative (from performance point
> of view) than centralized arbiter.
>
>
I am slightly confused with above statement, it seems for both
the scenario's you seem to be suggesting to have centrailized
arbiter. I think when most transactions are global having centrailized
arbiter might not be good solution, especially if the nodes in cluster
are large.


> But it is only my speculations and it will be really very interesting for
> me to know access patterns of real customers, using distributed systems.
>
>
I think it is better if the solution is optimized for all kind of
scenario's,
because once the solution is adopted by PostgreSQL, it will be very
difficult to change it.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] eXtensible Transaction Manager API

2015-11-09 Thread Simon Riggs

On 8 November 2015 at 23:35, Michael Paquier 
wrote:

> > COMMIT PREPARED is a pretty thin layer; the work is all done in the
> PREPARE.
> > With a DTM, the main commit itself is done once only in the DTM, so all
> the
> > COMMIT PREPARED would do is release local node resources.
>
> Sure. Now imagine that the pg_twophase entry is corrupted for this
> transaction on one node. This would trigger a PANIC on it, and
> transaction would not be committed everywhere. I am aware of the fact
> that by definition PREPARE TRANSACTION ensures that a transaction will
> be committed with COMMIT PREPARED, just trying to see any corner cases
> with the approach proposed. The DTM approach is actually rather close
> to what a GTM in Postgres-XC does :)
>

This is wrong.

There is no "approach proposed", this is just normal 2PC feature that
PostgreSQL has supported since 8.1. All that is proposed here is that we
have an API that allows this to be exposed.

The whole point of PREPARE is that it allows a database to crash after that
step and it can still be recovered. That has nothing to do with Xanything.

In this case, the role of the GTM is to record the commit. Other nodes
consult the GTM, not local state to see whether the transaction has
committed, acting as the transaction manager in an XA sense.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] eXtensible Transaction Manager API

2015-11-09 Thread Konstantin Knizhnik




On 09.11.2015 07:46, Amit Kapila wrote:
On Sat, Nov 7, 2015 at 10:23 PM, Konstantin Knizhnik 
> wrote:



Lock manager is one of the tasks we are currently working on.
There are still a lot of open questions:
1. Should distributed lock manager (DLM) do something else except
detection of distributed deadlock?


I think so.  Basically DLM should be responsible for maintaining
all the lock information which inturn means that any backend process
that needs to acquire/release lock needs to interact with DLM, without 
that

I don't think even global deadlock detection can work (to detect deadlocks
among all the nodes, it needs to know the lock info of all nodes).


I hope that it will not needed, otherwise it will add significant 
performance penalty.
Unless I missed something, locks can still managed locally, but we need 
DLM to detect global deadlocks.
Deadlock detection in PostgreSQL is performed only after timeout 
expiration for lock waiting. Only then lock graph is analyzed for 
presence of loops. At this moment we can send our local lock graph to 
DLM. If it is really distributed deadlock, then sooner or later
all nodes involved in this deadlock will send their local graphs to DLM 
and DLM will be able to build global graph and detect distributed 
deadlock. In this scenario DLM will be accessed very rarely - only when 
lock can not be granted within deadlock_timeout.


One of the possible problems is false global deadlock detection, because 
local lock graphs corresponds to different moments of time, so we can 
found "false" loop. I am still thinking about best solution of this problem.




This is somewhat inline with what currently we do during lock conflicts,
i.e if the backend incur a lock conflict and decides to sleep, before
sleeping it tries to detect an early deadlock, so if we make DLM 
responsible

for managing locks the same could be even achieved in distributed system.

2. Should DLM be part of XTM API or it should be separate API?


We might be able to do it either ways (make it part of XTM API or devise a
separate API). I think here more important point is to first get the 
high level

design for Distributed Transactions (which primarily includes consistent
Commit/Rollback, snapshots, distributed lock manager (DLM) and recovery).

3. Should DLM be implemented by separate process or should it be
part of arbiter (dtmd).


That's important decision. I think it will depend on which kind of design
we choose for distributed transaction manager (arbiter based solution or
non-arbiter based solution, something like tsDTM).  I think DLM should be
separate, else arbiter will become hot-spot with respect to contention.


There are pros and contras.
Pros for integrating DLM in DTM:
1. DTM arbiter has information about local to global transaction ID 
mapping which may be needed to DLM
2. If my assumptions about DLM are correct, then it will be accessed 
relatively rarely and should not cause significant

impact on performance.

Contras:
1. tsDTM doesn't need centralized arbiter but still needs DLM
2. Logically DLM and DTM are independent components




Can you please explain more about tsDTM approach, how timestamps
are used, what is exactly CSN (is it Commit Sequence Number) and
how it is used in prepare phase?  IS CSN used as timestamp?
Is the coordinator also one of the PostgreSQL instance?


In tsDTM approach system time (in microseconds) is used as CSN (commit 
sequence number).
We also enforce that assigned CSN is unique and monotonic within 
PostgreSQL instance.
CSN are assigned locally and do not require interaction with some other 
cluster nodes.

This is why in theory tsDTM approach should provide good scalability.



I think in this patch, it is important to see the completeness of
all the
API's that needs to be exposed for the implementation of distributed
transactions and the same is difficult to visualize without
having complete
picture of all the components that has some interaction with the
distributed
transaction system.  On the other hand we can do it in
incremental fashion
as and when more parts of the design are clear.


That is exactly what we are going to do - we are trying to
integrate DTM with existed systems (pg_shard, postgres_fdw, BDR)
and find out what is missed and should be added. In parallel we
are trying to compare efficiency and scalability of different
solutions.


One thing, I have noticed that in DTM approach, you seems to
be considering a centralized (Arbiter) transaction manager and
centralized Lock manager which seems to be workable approach,
but I think such an approach won't scale in write-heavy or mixed
read-write workload.  Have you thought about distributing responsibility
of global transaction and lock management?


From my point of view there are two different scenarios:
1. When most transactions are local and

Re: [HACKERS] eXtensible Transaction Manager API

2015-11-08 Thread Michael Paquier

On Sun, Nov 8, 2015 at 1:53 AM, Konstantin Knizhnik wrote:
> In tsDTM approach two phase commit is performed by coordinator and currently
> is using standard PostgreSQL two phase commit:
>
> Code in GO performing two phase commit:
>
>   exec(conn1, "prepare transaction '" + gtid + "'")
>   exec(conn2, "prepare transaction '" + gtid + "'")
>   exec(conn1, "select dtm_begin_prepare($1)", gtid)
>   exec(conn2, "select dtm_begin_prepare($1)", gtid)
>   csn = _execQuery(conn1, "select dtm_prepare($1, 0)", gtid)
>   csn = _execQuery(conn2, "select dtm_prepare($1, $2)", gtid, csn)
>   exec(conn1, "select dtm_end_prepare($1, $2)", gtid, csn)
>   exec(conn2, "select dtm_end_prepare($1, $2)", gtid, csn)
>   exec(conn1, "commit prepared '" + gtid + "'")
>   exec(conn2, "commit prepared '" + gtid + "'")
>
> If commit at some of the nodes failed, coordinator should rollback prepared
> transaction at all nodes.

Not always. If COMMIT PREPARED fails at some of the nodes but succeeds
on others, the transaction is already partially acknowledged as
committed in the cluster. Hence it makes more sense for the
coordinator to commit transactions on the remaining nodes. Before
issuing any COMMIT PREPARED queries, I guess that's fine to rollback
the transactions on all nodes though.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] eXtensible Transaction Manager API

2015-11-08 Thread Konstantin Knizhnik


On 11/08/2015 02:46 PM, Michael Paquier wrote:

On Sun, Nov 8, 2015 at 1:53 AM, Konstantin Knizhnik wrote:

In tsDTM approach two phase commit is performed by coordinator and currently
is using standard PostgreSQL two phase commit:

Code in GO performing two phase commit:

   exec(conn1, "prepare transaction '" + gtid + "'")
   exec(conn2, "prepare transaction '" + gtid + "'")
   exec(conn1, "select dtm_begin_prepare($1)", gtid)
   exec(conn2, "select dtm_begin_prepare($1)", gtid)
   csn = _execQuery(conn1, "select dtm_prepare($1, 0)", gtid)
   csn = _execQuery(conn2, "select dtm_prepare($1, $2)", gtid, csn)
   exec(conn1, "select dtm_end_prepare($1, $2)", gtid, csn)
   exec(conn2, "select dtm_end_prepare($1, $2)", gtid, csn)
   exec(conn1, "commit prepared '" + gtid + "'")
   exec(conn2, "commit prepared '" + gtid + "'")

If commit at some of the nodes failed, coordinator should rollback prepared
transaction at all nodes.

Not always. If COMMIT PREPARED fails at some of the nodes but succeeds
on others, the transaction is already partially acknowledged as
committed in the cluster. Hence it makes more sense for the
coordinator to commit transactions on the remaining nodes. Before
issuing any COMMIT PREPARED queries, I guess that's fine to rollback
the transactions on all nodes though.

We will get inconsistency if  transaction is committed on some subset of nodes 
involved in transaction.
Assume bank debit-credit example. If some distributed transaction transfers 
money from the account at one node to the account and another node,
then committing transaction just at one node cause incorrect total balance.
The main goal of DTM is to preserve ACID semantic for distributed transaction, 
so either transaction is committed at all nodes, either it is not committed at 
all.



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] eXtensible Transaction Manager API

2015-11-08 Thread Simon Riggs

On 8 November 2015 at 16:59, Konstantin Knizhnik 
wrote:

> On 11/08/2015 02:46 PM, Michael Paquier wrote:
>
>> On Sun, Nov 8, 2015 at 1:53 AM, Konstantin Knizhnik wrote:
>>
>>> In tsDTM approach two phase commit is performed by coordinator and
>>> currently
>>> is using standard PostgreSQL two phase commit:
>>>
>>> Code in GO performing two phase commit:
>>>
>>>exec(conn1, "prepare transaction '" + gtid + "'")
>>>exec(conn2, "prepare transaction '" + gtid + "'")
>>>exec(conn1, "select dtm_begin_prepare($1)", gtid)
>>>exec(conn2, "select dtm_begin_prepare($1)", gtid)
>>>csn = _execQuery(conn1, "select dtm_prepare($1, 0)", gtid)
>>>csn = _execQuery(conn2, "select dtm_prepare($1, $2)", gtid,
>>> csn)
>>>exec(conn1, "select dtm_end_prepare($1, $2)", gtid, csn)
>>>exec(conn2, "select dtm_end_prepare($1, $2)", gtid, csn)
>>>exec(conn1, "commit prepared '" + gtid + "'")
>>>exec(conn2, "commit prepared '" + gtid + "'")
>>>
>>> If commit at some of the nodes failed, coordinator should rollback
>>> prepared
>>> transaction at all nodes.
>>>
>> Not always. If COMMIT PREPARED fails at some of the nodes but succeeds
>> on others, the transaction is already partially acknowledged as
>> committed in the cluster. Hence it makes more sense for the
>> coordinator to commit transactions on the remaining nodes. Before
>> issuing any COMMIT PREPARED queries, I guess that's fine to rollback
>> the transactions on all nodes though.
>>
> We will get inconsistency if  transaction is committed on some subset of
> nodes involved in transaction.
> Assume bank debit-credit example. If some distributed transaction
> transfers money from the account at one node to the account and another
> node,
> then committing transaction just at one node cause incorrect total balance.
> The main goal of DTM is to preserve ACID semantic for distributed
> transaction, so either transaction is committed at all nodes, either it is
> not committed at all.


Agreed.

COMMIT PREPARED is a pretty thin layer; the work is all done in the
PREPARE. With a DTM, the main commit itself is done once only in the DTM,
so all the COMMIT PREPARED would do is release local node resources.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] eXtensible Transaction Manager API

2015-11-08 Thread Michael Paquier

On Mon, Nov 9, 2015 at 4:33 AM, Simon Riggs  wrote:
> On 8 November 2015 at 16:59, Konstantin Knizhnik 
> wrote:
>>
>> On 11/08/2015 02:46 PM, Michael Paquier wrote:
>>>
>>> On Sun, Nov 8, 2015 at 1:53 AM, Konstantin Knizhnik wrote:

 In tsDTM approach two phase commit is performed by coordinator and
 currently
 is using standard PostgreSQL two phase commit:

 Code in GO performing two phase commit:

exec(conn1, "prepare transaction '" + gtid + "'")
exec(conn2, "prepare transaction '" + gtid + "'")
exec(conn1, "select dtm_begin_prepare($1)", gtid)
exec(conn2, "select dtm_begin_prepare($1)", gtid)
csn = _execQuery(conn1, "select dtm_prepare($1, 0)", gtid)
csn = _execQuery(conn2, "select dtm_prepare($1, $2)", gtid,
 csn)
exec(conn1, "select dtm_end_prepare($1, $2)", gtid, csn)
exec(conn2, "select dtm_end_prepare($1, $2)", gtid, csn)
exec(conn1, "commit prepared '" + gtid + "'")
exec(conn2, "commit prepared '" + gtid + "'")

 If commit at some of the nodes failed, coordinator should rollback
 prepared
 transaction at all nodes.
>>>
>>> Not always. If COMMIT PREPARED fails at some of the nodes but succeeds
>>> on others, the transaction is already partially acknowledged as
>>> committed in the cluster. Hence it makes more sense for the
>>> coordinator to commit transactions on the remaining nodes. Before
>>> issuing any COMMIT PREPARED queries, I guess that's fine to rollback
>>> the transactions on all nodes though.
>>
>> We will get inconsistency if  transaction is committed on some subset of
>> nodes involved in transaction.
>> Assume bank debit-credit example. If some distributed transaction
>> transfers money from the account at one node to the account and another
>> node,
>> then committing transaction just at one node cause incorrect total
>> balance.
>> The main goal of DTM is to preserve ACID semantic for distributed
>> transaction, so either transaction is committed at all nodes, either it is
>> not committed at all.
>
>
> Agreed.
>
> COMMIT PREPARED is a pretty thin layer; the work is all done in the PREPARE.
> With a DTM, the main commit itself is done once only in the DTM, so all the
> COMMIT PREPARED would do is release local node resources.

Sure. Now imagine that the pg_twophase entry is corrupted for this
transaction on one node. This would trigger a PANIC on it, and
transaction would not be committed everywhere. I am aware of the fact
that by definition PREPARE TRANSACTION ensures that a transaction will
be committed with COMMIT PREPARED, just trying to see any corner cases
with the approach proposed. The DTM approach is actually rather close
to what a GTM in Postgres-XC does :)
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] eXtensible Transaction Manager API

2015-11-08 Thread Amit Kapila

On Sat, Nov 7, 2015 at 10:23 PM, Konstantin Knizhnik <
k.knizh...@postgrespro.ru> wrote:

>
> Lock manager is one of the tasks we are currently working on.
> There are still a lot of open questions:
> 1. Should distributed lock manager (DLM) do something else except
> detection of distributed deadlock?
>

I think so.  Basically DLM should be responsible for maintaining
all the lock information which inturn means that any backend process
that needs to acquire/release lock needs to interact with DLM, without that
I don't think even global deadlock detection can work (to detect deadlocks
among all the nodes, it needs to know the lock info of all nodes).
This is somewhat inline with what currently we do during lock conflicts,
i.e if the backend incur a lock conflict and decides to sleep, before
sleeping it tries to detect an early deadlock, so if we make DLM responsible
for managing locks the same could be even achieved in distributed system.

> 2. Should DLM be part of XTM API or it should be separate API?
>

We might be able to do it either ways (make it part of XTM API or devise a
separate API). I think here more important point is to first get the high
level
design for Distributed Transactions (which primarily includes consistent
Commit/Rollback, snapshots, distributed lock manager (DLM) and recovery).

> 3. Should DLM be implemented by separate process or should it be part of
> arbiter (dtmd).
>

That's important decision. I think it will depend on which kind of design
we choose for distributed transaction manager (arbiter based solution or
non-arbiter based solution, something like tsDTM).  I think DLM should be
separate, else arbiter will become hot-spot with respect to contention.

> 4. How to globally identify resource owners (0transactions) in global lock
> graph. In case of DTM we have global (shared) XIDs,
> and in tsDTM - global transactions IDs, assigned by application (which is
> not so clear how to retrieve).
> In other cases we may need to have local->global transaction id mapping,
> so looks like DLM should be part of DTM...
>
>
I think the DLM should in itself have all the necessary information to find
deadlocks or anything else required by locking system.  For what kind of
cases, do you envision to identify global resource owners?
And I think if we require some interaction between DLM and DTM, then we
can implement the same, rather than making DLM part of DTM.

>
>
> Also I have
> noticed that discussion about Rollback is not there, example how will
> Rollback happen with API's provided in your second approach (tsDTM)?
>
>
> In tsDTM approach two phase commit is performed by coordinator and
> currently is using standard PostgreSQL two phase commit:
>
>
> If commit at some of the nodes failed, coordinator should rollback
> prepared transaction at all nodes.
>
>
Can you please explain more about tsDTM approach, how timestamps
are used, what is exactly CSN (is it Commit Sequence Number) and
how it is used in prepare phase?  IS CSN used as timestamp?
Is the coordinator also one of the PostgreSQL instance?

>
> I think in this patch, it is important to see the completeness of all the
> API's that needs to be exposed for the implementation of distributed
> transactions and the same is difficult to visualize without having complete
> picture of all the components that has some interaction with the
> distributed
> transaction system.  On the other hand we can do it in incremental fashion
> as and when more parts of the design are clear.
>
>
> That is exactly what we are going to do - we are trying to integrate DTM
> with existed systems (pg_shard, postgres_fdw, BDR) and find out what is
> missed and should be added. In parallel we are trying to compare efficiency
> and scalability of different solutions.
>

One thing, I have noticed that in DTM approach, you seems to
be considering a centralized (Arbiter) transaction manager and
centralized Lock manager which seems to be workable approach,
but I think such an approach won't scale in write-heavy or mixed
read-write workload.  Have you thought about distributing responsibility
of global transaction and lock management?  I think such a system
might be somewhat difficult to design, but the scalability will be better.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] eXtensible Transaction Manager API

2015-11-08 Thread Simon Riggs

On 7 November 2015 at 16:53, Konstantin Knizhnik 
wrote:


> On 11/07/2015 05:11 PM, Amit Kapila wrote:
>
>
> Today, while studying your proposal and related material, I noticed
> that in both the approaches DTM and tsDTM, you are talking about
> committing a transaction and acquiring the snapshot consistently, but
> not touched upon the how the locks will be managed across nodes and
> how deadlock detection across nodes will work.  This will also be one
> of the crucial points in selecting one of the approaches.
>
>
> Lock manager is one of the tasks we are currently working on.
> There are still a lot of open questions:
> 1. Should distributed lock manager (DLM) do something else except
> detection of distributed deadlock?
> 2. Should DLM be part of XTM API or it should be separate API?
> 3. Should DLM be implemented by separate process or should it be part of
> arbiter (dtmd).
> 4. How to globally identify resource owners (0transactions) in global lock
> graph. In case of DTM we have global (shared) XIDs,
> and in tsDTM - global transactions IDs, assigned by application (which is
> not so clear how to retrieve).
> In other cases we may need to have local->global transaction id mapping,
> so looks like DLM should be part of DTM...
>

Yes, we need a Distributed Lock Manager, but I think its a separate thing
from the DTM.

(I'm loath to use the phase DLM, which was used within Oracle Parallel
server for a buffer lock manager, which is not what is being discussed).

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] eXtensible Transaction Manager API

2015-11-07 Thread Amit Kapila

On Sat, Oct 31, 2015 at 2:52 PM, konstantin knizhnik <
k.knizh...@postgrespro.ru> wrote:

> Hi,
>
> PostgresPro cluster team wants to announce proposal for eXtensible
> Transaction Manager API and reference implementation of distributed
> transaction manager (pg_dtm).
> pg_dtm is just a standard PostgreSQL extension which should be installed
> in normal way.
>
> Source of pg_dtm and PostgreSQL patches are available here:
> https://github.com/postgrespro/pg_dtm
> WiKi page: https://wiki.postgresql.org/wiki/DTM
>
> xtm.patch patches PostgreSQL core by replacing direct calls of 7 TM
> functions with indirect calls and adding 3 addition events to transaction
> commit callbacks:
>
>
>
>
>
> postgres_fdw.patch patches postgres_fdw extension to work with DTM
>
>
>
>
> We have also pgDTM implementation which is using timestamps (system time)
> as CSNs.
> It is also based on the same XTM transaction API.
> We will publish it later once we clarify problems with recovery and
> performance with this approach.
>
> Also we have patch for pg_shard to work with DTM, which also will be
> published soon.
>
>
Today, while studying your proposal and related material, I noticed
that in both the approaches DTM and tsDTM, you are talking about
committing a transaction and acquiring the snapshot consistently, but
not touched upon the how the locks will be managed across nodes and
how deadlock detection across nodes will work.  This will also be one
of the crucial points in selecting one of the approaches.  Also I have
noticed that discussion about Rollback is not there, example how will
Rollback happen with API's provided in your second approach (tsDTM)?
Similarly, having some discussion on parts of recovery that could be
affected
would be great.

I think in this patch, it is important to see the completeness of all the
API's that needs to be exposed for the implementation of distributed
transactions and the same is difficult to visualize without having complete
picture of all the components that has some interaction with the distributed
transaction system.  On the other hand we can do it in incremental fashion
as and when more parts of the design are clear.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] eXtensible Transaction Manager API

2015-11-07 Thread Konstantin Knizhnik

Hi,
Thank you for your feedback.
My comments are inside.

On 11/07/2015 05:11 PM, Amit Kapila wrote:

Today, while studying your proposal and related material, I noticed
that in both the approaches DTM and tsDTM, you are talking about
committing a transaction and acquiring the snapshot consistently, but
not touched upon the how the locks will be managed across nodes and
how deadlock detection across nodes will work. This will also be one
of the crucial points in selecting one of the approaches.

Lock manager is one of the tasks we are currently working on.
There are still a lot of open questions:
1. Should distributed lock manager (DLM) do something else except detection of
distributed deadlock?
2. Should DLM be part of XTM API or it should be separate API?
3. Should DLM be implemented by separate process or should it be part of
arbiter (dtmd).
4. How to globally identify resource owners (0transactions) in global lock
graph. In case of DTM we have global (shared) XIDs,
and in tsDTM - global transactions IDs, assigned by application (which is not
so clear how to retrieve).
In other cases we may need to have local->global transaction id mapping, so
looks like DLM should be part of DTM...

Also I have
noticed that discussion about Rollback is not there, example how will
Rollback happen with API's provided in your second approach (tsDTM)?

In tsDTM approach two phase commit is performed by coordinator and currently is
using standard PostgreSQL two phase commit:

Code in GO performing two phase commit:

exec(conn1, "prepare transaction '" + gtid + "'")
exec(conn2, "prepare transaction '" + gtid + "'")
exec(conn1, "select dtm_begin_prepare($1)", gtid)
exec(conn2, "select dtm_begin_prepare($1)", gtid)
csn = _execQuery(conn1, "select dtm_prepare($1, 0)", gtid)
csn = _execQuery(conn2, "select dtm_prepare($1, $2)", gtid, csn)
exec(conn1, "select dtm_end_prepare($1, $2)", gtid, csn)
exec(conn2, "select dtm_end_prepare($1, $2)", gtid, csn)
exec(conn1, "commit prepared '" + gtid + "'")
exec(conn2, "commit prepared '" + gtid + "'")

If commit at some of the nodes failed, coordinator should rollback prepared
transaction at all nodes.

Similarly, having some discussion on parts of recovery that could be affected
would be great.

We are currently implementing fault tolerance and recovery for DTM approach
(with centralized arbiter).
There are several replicas of arbiter, synchronized using RAFT protocol.
But with tsDTM approach recovery model is still obscure...
We are thinking about it.

I think in this patch, it is important to see the completeness of all the
API's that needs to be exposed for the implementation of distributed
transactions and the same is difficult to visualize without having complete
picture of all the components that has some interaction with the distributed
transaction system. On the other hand we can do it in incremental fashion
as and when more parts of the design are clear.

That is exactly what we are going to do - we are trying to integrate DTM with
existed systems (pg_shard, postgres_fdw, BDR) and find out what is missed and
should be added. In parallel we are trying to compare efficiency and
scalability of different solutions.
For example we still considering scalability problems with tsDTM approach: to provide acceptable performance, it requires very precise clock synchronization (we have to use PTP instead of NTP). So it may be waste of time trying to provide fault tolerance
for tsDTM if we finally found out that this approach can not provide better scalability than simpler DTM approach.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] eXtensible Transaction Manager API

2015-11-03 Thread Simon Riggs

On 2 November 2015 at 13:24, Konstantin Knizhnik 
wrote:

> PostgreSQL assumes that top-level xid commit is atomic, along with all of
> its subtransactions. So the API having access to only Get/Set at the xid
> level would not work. We would need TransactionIdSetTreeStatus() rather
> than just SetStatus. GetStatus is not atomic.
>
>
> XTM API has function SetTransactionStatus but it actually encapsulate
> TransactionIdSetTreeStatus.
>

OK, thanks.


> At present, I can't tell whether you need subtrans and multixact calls in
> the API as well. I would imagine we probably do, though we might imagine an
> implementation that didn't support those concepts.
>
>
> Postgres-XL doesn't support subtransactions. Neither did we at this moment.
> Support of subtransactions will be our next step.
>

OK. I guess it depends where we want to put the patch; discussion here
means in-core, so for it to go in here, we do need subxacts etc.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] eXtensible Transaction Manager API

2015-11-02 Thread Simon Riggs

On 31 October 2015 at 10:22, konstantin knizhnik 
wrote:

> Hi,
>
> PostgresPro cluster team wants to announce proposal for eXtensible
> Transaction Manager API and reference implementation of distributed
> transaction manager (pg_dtm).
> pg_dtm is just a standard PostgreSQL extension which should be installed
> in normal way.
>
> Source of pg_dtm and PostgreSQL patches are available here:
> https://github.com/postgrespro/pg_dtm
> WiKi page: https://wiki.postgresql.org/wiki/DTM

Very interesting work.

> xtm.patch patches PostgreSQL core by replacing direct calls of 7 TM
> functions with indirect calls and

At first I was concerned about recovery, but that looks to be covered.

PostgreSQL assumes that top-level xid commit is atomic, along with all of
its subtransactions. So the API having access to only Get/Set at the xid
level would not work. We would need TransactionIdSetTreeStatus() rather
than just SetStatus. GetStatus is not atomic.

> adding 3 addition events to transaction commit callbacks:
>

Those look uncontentious, so we should add those anyway in a sub-patch.

> We have also pgDTM implementation which is using timestamps (system time)
> as CSNs.
> It is also based on the same XTM transaction API.
> We will publish it later once we clarify problems with recovery and
> performance with this approach.
>

That is most interesting part, so it needs to be published.

At present, I can't tell whether you need subtrans and multixact calls in
the API as well. I would imagine we probably do, though we might imagine an
implementation that didn't support those concepts.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] eXtensible Transaction Manager API

2015-11-02 Thread Konstantin Knizhnik




On 02.11.2015 12:01, Simon Riggs wrote:


At first I was concerned about recovery, but that looks to be covered.

Yes, we have not considered all possible scenarios of working with 
PostgreSQL.
We have tested work with different isolation levels: repeatable read and 
read committed,

but  we have not tested "select for update" and explicit locking.


PostgreSQL assumes that top-level xid commit is atomic, along with all 
of its subtransactions. So the API having access to only Get/Set at 
the xid level would not work. We would 
need TransactionIdSetTreeStatus() rather than just SetStatus. 
GetStatus is not atomic.


XTM API has function SetTransactionStatus but it actually encapsulate 
TransactionIdSetTreeStatus.



adding 3 addition events to transaction commit callbacks:


Those look uncontentious, so we should add those anyway in a sub-patch.

We have also pgDTM implementation which is using timestamps
(system time) as CSNs.
It is also based on the same XTM transaction API.
We will publish it later once we clarify problems with recovery
and performance with this approach.


That is most interesting part, so it needs to be published.


Will be available during this week.


At present, I can't tell whether you need subtrans and multixact calls 
in the API as well. I would imagine we probably do, though we might 
imagine an implementation that didn't support those concepts.


Postgres-XL doesn't support subtransactions. Neither did we at this moment.
Support of subtransactions will be our next step.

Re: [HACKERS] eXtensible Transaction Manager API

2015-11-02 Thread Konstantin Knizhnik


On 02.11.2015 06:17, Craig Ringer wrote:

On 31 October 2015 at 17:22, konstantin knizhnik
 wrote:


Waiting for your feedback

For anyone wondering about performance impact, there are some graphs
on page 23 of the PDF presentation. I didn't see anything else, and
the graphs don't seem to cover comparison of Pg with the XTM
transaction manager hooks and no DTM enabled vs Pg without the hooks,
i.e. the hook overhead its self.

Have you done much work on that? Personally I wouldn't expect to see
any meaningful overhead, but I'd really like to have numbers behind
that.


Overhead of indirect call is negligible - see for example
https://gist.github.com/rianhunter/0be8dc116b120ad5fdd4

But we have certainly performed comparison of PostgreSQL with/without 
XTM patch.

Pgbench results are almost the same - within the measurement error:

With XTM:
 transaction type: TPC-B (sort of)
scaling factor: 70
query mode: simple
number of clients: 144
number of threads: 24
duration: 600 s
number of transactions actually processed: 12275179
latency average: 7.037 ms
latency stddev: 46.787 ms
tps = 20456.945469 (including connections establishing)
tps = 20457.164023 (excluding connections establishing)

Without XTM:
transaction type: TPC-B (sort of)
scaling factor: 70
query mode: simple
number of clients: 144
number of threads: 24
duration: 600 s
number of transactions actually processed: 12086367
latency average: 7.156 ms
latency stddev: 48.431 ms
tps = 20114.491785 (including connections establishing)
tps = 20116.074391 (excluding connections establishing)



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] eXtensible Transaction Manager API

2015-11-01 Thread Craig Ringer

On 31 October 2015 at 17:22, konstantin knizhnik
 wrote:

> Waiting for your feedback

For anyone wondering about performance impact, there are some graphs
on page 23 of the PDF presentation. I didn't see anything else, and
the graphs don't seem to cover comparison of Pg with the XTM
transaction manager hooks and no DTM enabled vs Pg without the hooks,
i.e. the hook overhead its self.

Have you done much work on that? Personally I wouldn't expect to see
any meaningful overhead, but I'd really like to have numbers behind
that.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

41 matches

Mail list logo