Re: [HACKERS] logical decoding of two-phase transactions

2017-10-29 Thread Craig Ringer
On 28 October 2017 at 03:53, Sokolov Yura  wrote:
> On 2017-10-26 22:01, Sokolov Yura wrote:

> Small improvement compared to v7:
> - twophase_gid is written with alignment padding in the XactLogCommitRecord
> and XactLogAbortRecord.

I think Nikhils has done some significant work on this patch.
Hopefully he'll be able to share it.



-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-10-27 Thread Sokolov Yura

On 2017-10-26 22:01, Sokolov Yura wrote:

On 2017-09-27 14:46, Stas Kelvich wrote:
On 7 Sep 2017, at 18:58, Nikhil Sontakke  
wrote:


Hi,

FYI all, wanted to mention that I am working on an updated version of
the latest patch that I plan to submit to a later CF.



Cool!

So what kind of architecture do you have in mind? Same way as is it
was implemented before?
As far as I remember there were two main issues:

* Decodong of aborted prepared transaction.

If such transaction modified catalog then we can’t read reliable info
with our historic snapshot,
since clog already have aborted bit for our tx it will brake
visibility logic. There are some way to
deal with that — by doing catalog seq scan two times and counting
number of tuples (details
upthread) or by hijacking clog values in historic visibility function.
But ISTM it is better not solve this
issue at all =) In most cases intended usage of decoding of 2PC
transaction is to do some form
of distributed commit, so naturally decoding will happens only with
in-progress transactions and
we commit/abort will happen only after it is decoded, sent and
response is received. So we can
just have atomic flag that prevents commit/abort of tx currently being
decoded. And we can filter
interesting prepared transactions based on GID, to prevent holding
this lock for ordinary 2pc.

* Possible deadlocks that Andres was talking about.

I spend some time trying to find that, but didn’t find any. If locking
pg_class in prepared tx is the only
example then (imho) it is better to just forbid to prepare such
transactions. Otherwise if some realistic
examples that can block decoding are actually exist, then we probably
need to reconsider the way
tx being decoded. Anyway this part probably need Andres blessing.


Just rebased patch logical_twophase_v6 to master.

Fixed small issues:
- XactLogAbortRecord wrote DBINFO twice, but it was decoded in
  ParseAbortRecord only once. Second DBINFO were parsed as ORIGIN.
  Fixed by removing second write of DBINFO.
- SnapBuildPrepareTxnFinish tried to remove xid from `running` instead
  of `committed`. And it removed only xid, without subxids.
- test_decoding skipped returning "COMMIT PREPARED" and "ABORT 
PREPARED",


Big issue were with decoding ddl-including two-phase transactions:
- prepared.out were misleading. We could not reproduce decoding body of
  "test_prepared#3" with logical_twophase_v6.diff. It was skipped if
  `pg_logical_slot_get_changes` were called without
  `twophase-decode-with-catalog-changes` set, and only "COMMIT PREPARED
  test_prepared#3" were decoded.
The reason is "PREPARE TRANSACTION" is passed to `pg_filter_prepare`
twice:
- first on "PREPARE" itself,
- second - on "COMMIT PREPARED".
In v6, `pg_filter_prepare` without `with-catalog-changes` first time
answered "true" (ie it should not be decoded), and second time (when
transaction became committed) it answered "false" (ie it should be
decoded). But second time in DecodePrepare
`ctx->snapshot_builder->start_decoding_at`
is already in a future compared to `buf->origptr` (because it is
on "COMMIT PREPARED" lsn). Therefore DecodePrepare just called
ReorderBufferForget.
If `pg_filter_prepare` is called with `with-catalog-changes`, then
it returns "false" both times, thus DeocdePrepare decodes transaction
in first time, and calls `ReorderBufferForget` in second time.

I didn't found a way to fix it gracefully. I just change 
`pg_filter_prepare`
to return same answer both times: "false" if called 
`with-catalog-changes`

(ie need to call DecodePrepare), and "true" otherwise. With this
change, catalog changing two-phase transaction is decoded as simple
one-phase transaction, if `pg_logical_slot_get_changes` is called
without `with-catalog-changes`.


Small improvement compared to v7:
- twophase_gid is written with alignment padding in the 
XactLogCommitRecord

and XactLogAbortRecord.

--
Sokolov Yura
Postgres Professional: https://postgrespro.ru
The Russian Postgres Company

logical_twophase_v8.diff.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-10-26 Thread Sokolov Yura

On 2017-09-27 14:46, Stas Kelvich wrote:
On 7 Sep 2017, at 18:58, Nikhil Sontakke  
wrote:


Hi,

FYI all, wanted to mention that I am working on an updated version of
the latest patch that I plan to submit to a later CF.



Cool!

So what kind of architecture do you have in mind? Same way as is it
was implemented before?
As far as I remember there were two main issues:

* Decodong of aborted prepared transaction.

If such transaction modified catalog then we can’t read reliable info
with our historic snapshot,
since clog already have aborted bit for our tx it will brake
visibility logic. There are some way to
deal with that — by doing catalog seq scan two times and counting
number of tuples (details
upthread) or by hijacking clog values in historic visibility function.
But ISTM it is better not solve this
issue at all =) In most cases intended usage of decoding of 2PC
transaction is to do some form
of distributed commit, so naturally decoding will happens only with
in-progress transactions and
we commit/abort will happen only after it is decoded, sent and
response is received. So we can
just have atomic flag that prevents commit/abort of tx currently being
decoded. And we can filter
interesting prepared transactions based on GID, to prevent holding
this lock for ordinary 2pc.

* Possible deadlocks that Andres was talking about.

I spend some time trying to find that, but didn’t find any. If locking
pg_class in prepared tx is the only
example then (imho) it is better to just forbid to prepare such
transactions. Otherwise if some realistic
examples that can block decoding are actually exist, then we probably
need to reconsider the way
tx being decoded. Anyway this part probably need Andres blessing.


Just rebased patch logical_twophase_v6 to master.

Fixed small issues:
- XactLogAbortRecord wrote DBINFO twice, but it was decoded in
  ParseAbortRecord only once. Second DBINFO were parsed as ORIGIN.
  Fixed by removing second write of DBINFO.
- SnapBuildPrepareTxnFinish tried to remove xid from `running` instead
  of `committed`. And it removed only xid, without subxids.
- test_decoding skipped returning "COMMIT PREPARED" and "ABORT 
PREPARED",


Big issue were with decoding ddl-including two-phase transactions:
- prepared.out were misleading. We could not reproduce decoding body of
  "test_prepared#3" with logical_twophase_v6.diff. It was skipped if
  `pg_logical_slot_get_changes` were called without
  `twophase-decode-with-catalog-changes` set, and only "COMMIT PREPARED
  test_prepared#3" were decoded.
The reason is "PREPARE TRANSACTION" is passed to `pg_filter_prepare`
twice:
- first on "PREPARE" itself,
- second - on "COMMIT PREPARED".
In v6, `pg_filter_prepare` without `with-catalog-changes` first time
answered "true" (ie it should not be decoded), and second time (when
transaction became committed) it answered "false" (ie it should be
decoded). But second time in DecodePrepare 
`ctx->snapshot_builder->start_decoding_at`

is already in a future compared to `buf->origptr` (because it is
on "COMMIT PREPARED" lsn). Therefore DecodePrepare just called
ReorderBufferForget.
If `pg_filter_prepare` is called with `with-catalog-changes`, then
it returns "false" both times, thus DeocdePrepare decodes transaction
in first time, and calls `ReorderBufferForget` in second time.

I didn't found a way to fix it gracefully. I just change 
`pg_filter_prepare`
to return same answer both times: "false" if called 
`with-catalog-changes`

(ie need to call DecodePrepare), and "true" otherwise. With this
change, catalog changing two-phase transaction is decoded as simple
one-phase transaction, if `pg_logical_slot_get_changes` is called
without `with-catalog-changes`.

--
With regards,
Sokolov Yura
Postgres Professional: https://postgrespro.ru
The Russian Postgres Company

logical_twophase_v7.diff.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-09-27 Thread Stas Kelvich

> On 7 Sep 2017, at 18:58, Nikhil Sontakke  wrote:
> 
> Hi,
> 
> FYI all, wanted to mention that I am working on an updated version of
> the latest patch that I plan to submit to a later CF.
> 

Cool!

So what kind of architecture do you have in mind? Same way as is it was 
implemented before?
As far as I remember there were two main issues:

* Decodong of aborted prepared transaction.

If such transaction modified catalog then we can’t read reliable info with our 
historic snapshot,
since clog already have aborted bit for our tx it will brake visibility logic. 
There are some way to
deal with that — by doing catalog seq scan two times and counting number of 
tuples (details
upthread) or by hijacking clog values in historic visibility function. But ISTM 
it is better not solve this
issue at all =) In most cases intended usage of decoding of 2PC transaction is 
to do some form
of distributed commit, so naturally decoding will happens only with in-progress 
transactions and
we commit/abort will happen only after it is decoded, sent and response is 
received. So we can
just have atomic flag that prevents commit/abort of tx currently being decoded. 
And we can filter
interesting prepared transactions based on GID, to prevent holding this lock 
for ordinary 2pc.

* Possible deadlocks that Andres was talking about.

I spend some time trying to find that, but didn’t find any. If locking pg_class 
in prepared tx is the only
example then (imho) it is better to just forbid to prepare such transactions. 
Otherwise if some realistic
examples that can block decoding are actually exist, then we probably need to 
reconsider the way
tx being decoded. Anyway this part probably need Andres blessing.



Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-09-07 Thread Nikhil Sontakke
Hi,

FYI all, wanted to mention that I am working on an updated version of
the latest patch that I plan to submit to a later CF.

Regards,
Nikhils

On 14 May 2017 at 04:02, Dmitry Dolgov <9erthali...@gmail.com> wrote:
> On 13 May 2017 at 22:22, Tom Lane  wrote:
>>
>> Apparently you are not testing against current HEAD.  That's been there
>> since d10c626de (a whole two days now ;-))
>
> Indeed, I was working on a more than two-day old antiquity. Unfortunately,
> it's even more complicated
> to apply this patch against the current HEAD, so I'll wait for a rebased
> version.



-- 
 Nikhil Sontakke   http://www.2ndQuadrant.com/
 PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-05-13 Thread Dmitry Dolgov
On 13 May 2017 at 22:22, Tom Lane  wrote:
>
> Apparently you are not testing against current HEAD.  That's been there
> since d10c626de (a whole two days now ;-))

Indeed, I was working on a more than two-day old antiquity. Unfortunately,
it's even more complicated
to apply this patch against the current HEAD, so I'll wait for a rebased
version.


Re: [HACKERS] logical decoding of two-phase transactions

2017-05-13 Thread Tom Lane
Dmitry Dolgov <9erthali...@gmail.com> writes:
> Just a note about this patch. Of course time flies by and it needs rebase,
> but also there are few failing tests right now:

> ERROR:  function pg_wal_lsn_diff(pg_lsn, unknown) does not exist

Apparently you are not testing against current HEAD.  That's been there
since d10c626de (a whole two days now ;-)).

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-05-13 Thread Dmitry Dolgov
Hi

> On 4 April 2017 at 19:13, Masahiko Sawada  wrote:
>
> Other than that issue current patch still could not pass 'make check'
> test of contrib/test_decoding.

Just a note about this patch. Of course time flies by and it needs rebase,
but also there are few failing tests right now:

* one that was already mentioned by Masahiko
* one from `ddl`, where expected is:

```
SELECT slot_name, plugin, slot_type, active,
NOT catalog_xmin IS NULL AS catalog_xmin_set,
xmin IS NULl  AS data_xmin_not_set,
pg_wal_lsn_diff(restart_lsn, '0/0100') > 0 AS some_wal
FROM pg_replication_slots;
slot_name|plugin | slot_type | active | catalog_xmin_set |
data_xmin_not_set | some_wal
-+---+---++--+---+--
 regression_slot | test_decoding | logical   | f  | t|
t | t
(1 row)
```

but the result is:

```
SELECT slot_name, plugin, slot_type, active,
NOT catalog_xmin IS NULL AS catalog_xmin_set,
xmin IS NULl  AS data_xmin_not_set,
pg_wal_lsn_diff(restart_lsn, '0/0100') > 0 AS some_wal
FROM pg_replication_slots;
ERROR:  function pg_wal_lsn_diff(pg_lsn, unknown) does not exist
LINE 5: pg_wal_lsn_diff(restart_lsn, '0/0100') > 0 AS some_w...
^
HINT:  No function matches the given name and argument types. You might
need to add explicit type casts.
```


Re: [HACKERS] logical decoding of two-phase transactions

2017-04-04 Thread Andres Freund
On 2017-04-04 13:06:13 +0300, Stas Kelvich wrote:
> That is just argument against Andres concern that prepared transaction
> is able to deadlock with decoding process — at least no such cases in
> regression tests.

There's few longer / adverse xacts, that doesn't say much.


> And that concern is main thing blocking this patch. Except explicit catalog
> locks in prepared tx nobody yet found such cases and it is hard to address
> or argue about.

I doubt that's the case. But even if it were so, it's absolutely not
acceptable that a plain user can cause such deadlocks. So I don't think
this argument buys you anything.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-04-04 Thread Masahiko Sawada
On Tue, Apr 4, 2017 at 7:06 PM, Stas Kelvich  wrote:
>
>> On 4 Apr 2017, at 04:23, Masahiko Sawada  wrote:
>>
>>
>> I reviewed this patch but when I tried to build contrib/test_decoding
>> I got the following error.
>>
>
> Thanks!
>
> Yes, seems that 18ce3a4a changed ProcessUtility_hook signature.
> Updated.
>
>> There are still some unnecessary code in v5 patch.
>

Thank you for updating the patch!

> Actually second diff isn’t intended to be part of the patch, I've just shared
> the way I ran regression test suite through the 2pc decoding changing
> all commits to prepare/commits where commits happens only after decoding
> of prepare is finished (more details in my previous message in this thread).

Understood. Sorry for the noise.

>
> That is just argument against Andres concern that prepared transaction
> is able to deadlock with decoding process — at least no such cases in
> regression tests.
>
> And that concern is main thing blocking this patch. Except explicit catalog
> locks in prepared tx nobody yet found such cases and it is hard to address
> or argue about.
>

Hmm, I also has not found such deadlock case yet.

Other than that issue current patch still could not pass 'make check'
test of contrib/test_decoding.

*** 154,167 
  (4 rows)

  :get_with2pc
!   data
! -
!  BEGIN
!  table public.test_prepared1: INSERT: id[integer]:5
!  table public.test_prepared1: INSERT: id[integer]:6 data[text]:'frakbar'
!  PREPARE TRANSACTION 'test_prepared#3';
!  COMMIT PREPARED 'test_prepared#3';
! (5 rows)

  -- make sure stuff still works
  INSERT INTO test_prepared1 VALUES (8);
--- 154,162 
  (4 rows)

  :get_with2pc
!  data
! --
! (0 rows)

  -- make sure stuff still works
  INSERT INTO test_prepared1 VALUES (8);

I guess that the this part is a unexpected result and should be fixed. Right?

-

*** 215,222 
  -- If we try to decode it now we'll deadlock
  SET statement_timeout = '10s';
  :get_with2pc_nofilter
! -- FIXME we expect a timeout here, but it actually works...
! ERROR: statement timed out

  RESET statement_timeout;
  -- we can decode past it by skipping xacts with catalog changes
--- 210,222 
  -- If we try to decode it now we'll deadlock
  SET statement_timeout = '10s';
  :get_with2pc_nofilter
! data
! 
!  BEGIN
!  table public.test_prepared1: INSERT: id[integer]:10 data[text]:'othercol'
!  table public.test_prepared1: INSERT: id[integer]:11 data[text]:'othercol2'
!  PREPARE TRANSACTION 'test_prepared_lock'
! (4 rows)

  RESET statement_timeout;
  -- we can decode past it by skipping xacts with catalog changes

Probably we can ignore this part for now.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-04-04 Thread Stas Kelvich

> On 4 Apr 2017, at 04:23, Masahiko Sawada  wrote:
> 
> 
> I reviewed this patch but when I tried to build contrib/test_decoding
> I got the following error.
> 

Thanks!

Yes, seems that 18ce3a4a changed ProcessUtility_hook signature.
Updated.

> There are still some unnecessary code in v5 patch.

Actually second diff isn’t intended to be part of the patch, I've just shared
the way I ran regression test suite through the 2pc decoding changing
all commits to prepare/commits where commits happens only after decoding
of prepare is finished (more details in my previous message in this thread).

That is just argument against Andres concern that prepared transaction
is able to deadlock with decoding process — at least no such cases in
regression tests.

And that concern is main thing blocking this patch. Except explicit catalog
locks in prepared tx nobody yet found such cases and it is hard to address
or argue about.

Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



logical_twophase_v6.diff
Description: Binary data


logical_twophase_regresstest.diff
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-04-03 Thread Masahiko Sawada
On Thu, Mar 30, 2017 at 12:55 AM, Stas Kelvich  wrote:
>
>> On 28 Mar 2017, at 18:08, Andres Freund  wrote:
>>
>> On 2017-03-28 15:55:15 +0100, Simon Riggs wrote:
>>>
>>>
>>> That assertion is obviously false... the plugin can resolve this in
>>> various ways, if we allow it.
>>
>> Handling it by breaking replication isn't handling it (e.g. timeouts in
>> decoding etc).  Handling it by rolling back *prepared* transactions
>> (which are supposed to be guaranteed to succeed!), isn't either.
>>
>>
>>> You can say that in your opinion you prefer to see this handled in
>>> some higher level way, though it would be good to hear why and how.
>>
>> It's pretty obvious why: A bit of DDL by the user shouldn't lead to the
>> issues mentioned above.
>>
>>
>>> Bottom line here is we shouldn't reject this patch on this point,
>>
>> I think it definitely has to be rejected because of that.  And I didn't
>> bring this up at the last minute, I repeatedly brought it up before.
>> Both to Craig and Stas.
>
> Okay. In order to find more realistic cases that blocks replication
> i’ve created following setup:
>
> * in backend: tests_decoding plugins hooks on xact events and utility
> statement hooks and transform each commit into prepare, then sleeps
> on latch. If transaction contains DDL that whole statement pushed in
> wal as transactional message. If DDL can not be prepared or disallows
> execution in transaction block than it goes as nontransactional logical
> message without prepare/decode injection. If transaction didn’t issued any
> DDL and didn’t write anything to wal, then it skips 2pc too.
>
> * after prepare is decoded, output plugin in walsender unlocks backend
> allowing to proceed with commit prepared. So in case when decoding
> tries to access blocked catalog everything should stop.
>
> * small python script that consumes decoded wal from walsender (thanks
> Craig and Petr)
>
> After small acrobatics with that hooks I’ve managed to run whole
> regression suite in parallel mode through such setup of test_decoding
> without any deadlocks. I’ve added two xact_events to postgres and
> allowedn prepare of transactions that touched temp tables since
> they are heavily used in tests and creates a lot of noise in diffs.
>
> So it boils down to 3 failed regression tests out of 177, namely:
>
> * transactions.sql — here commit of tx stucks with obtaining SafeSnapshot().
> I didn’t look what is happening there specifically, but just checked that
> walsender isn’t blocked. I’m going to look more closely at this.
>
> * prepared_xacts.sql — here select prepared_xacts() sees our prepared
> tx. It is possible to filter them out, but obviously it works as expected.
>
> * guc.sql — here pendingActions arrives on 'DISCARD ALL’ preventing tx
> from being prepared. I didn’t found the way to check presence of
> pendingActions outside of async.c so decided to leave it as is.
>
> It seems that at least in regression tests nothing can block twophase
> logical decoding. Is that strong enough argument to hypothesis that current
> approach doesn’t creates deadlock except locks on catalog which should be
> disallowed anyway?
>
> Patches attached. logical_twophase_v5 is slightly modified version of previous
> patch merged with Craig’s changes. Second file is set of patches over previous
> one, that implements logic i’ve just described. There is runtest.sh script 
> that
> setups postgres, runs python logical consumer in background and starts
> regression test.
>
>

I reviewed this patch but when I tried to build contrib/test_decoding
I got the following error.

$ make
gcc -Wall -Wmissing-prototypes -Wpointer-arith
-Wdeclaration-after-statement -Wendif-labels
-Wmissing-format-attribute -Wformat-security -fno-strict-aliasing
-fwrapv -g -g -fpic -I. -I. -I../../src/include -D_GNU_SOURCE   -c -o
test_decoding.o test_decoding.c -MMD -MP -MF .deps/test_decoding.Po
test_decoding.c: In function '_PG_init':
test_decoding.c:126: warning: assignment from incompatible pointer type
test_decoding.c: In function 'test_decoding_process_utility':
test_decoding.c:271: warning: passing argument 5 of
'PreviousProcessUtilityHook' from incompatible pointer type
test_decoding.c:271: note: expected 'struct QueryEnvironment *' but
argument is of type 'struct DestReceiver *'
test_decoding.c:271: warning: passing argument 6 of
'PreviousProcessUtilityHook' from incompatible pointer type
test_decoding.c:271: note: expected 'struct DestReceiver *' but
argument is of type 'char *'
test_decoding.c:271: error: too few arguments to function
'PreviousProcessUtilityHook'
test_decoding.c:276: warning: passing argument 5 of
'standard_ProcessUtility' from incompatible pointer type
../../src/include/tcop/utility.h:38: note: expected 'struct
QueryEnvironment *' but argument is of type 'struct DestReceiver *'
test_decoding.c:276: warning: passing argument 6 of
'standard_ProcessUtility' from incompatible pointer type

Re: [HACKERS] logical decoding of two-phase transactions

2017-03-29 Thread Stas Kelvich

> On 28 Mar 2017, at 18:08, Andres Freund  wrote:
> 
> On 2017-03-28 15:55:15 +0100, Simon Riggs wrote:
>> 
>> 
>> That assertion is obviously false... the plugin can resolve this in
>> various ways, if we allow it.
> 
> Handling it by breaking replication isn't handling it (e.g. timeouts in
> decoding etc).  Handling it by rolling back *prepared* transactions
> (which are supposed to be guaranteed to succeed!), isn't either.
> 
> 
>> You can say that in your opinion you prefer to see this handled in
>> some higher level way, though it would be good to hear why and how.
> 
> It's pretty obvious why: A bit of DDL by the user shouldn't lead to the
> issues mentioned above.
> 
> 
>> Bottom line here is we shouldn't reject this patch on this point,
> 
> I think it definitely has to be rejected because of that.  And I didn't
> bring this up at the last minute, I repeatedly brought it up before.
> Both to Craig and Stas.

Okay. In order to find more realistic cases that blocks replication
i’ve created following setup:

* in backend: tests_decoding plugins hooks on xact events and utility
statement hooks and transform each commit into prepare, then sleeps
on latch. If transaction contains DDL that whole statement pushed in
wal as transactional message. If DDL can not be prepared or disallows
execution in transaction block than it goes as nontransactional logical
message without prepare/decode injection. If transaction didn’t issued any
DDL and didn’t write anything to wal, then it skips 2pc too.

* after prepare is decoded, output plugin in walsender unlocks backend
allowing to proceed with commit prepared. So in case when decoding
tries to access blocked catalog everything should stop.

* small python script that consumes decoded wal from walsender (thanks
Craig and Petr)

After small acrobatics with that hooks I’ve managed to run whole
regression suite in parallel mode through such setup of test_decoding
without any deadlocks. I’ve added two xact_events to postgres and
allowedn prepare of transactions that touched temp tables since
they are heavily used in tests and creates a lot of noise in diffs.

So it boils down to 3 failed regression tests out of 177, namely:

* transactions.sql — here commit of tx stucks with obtaining SafeSnapshot().
I didn’t look what is happening there specifically, but just checked that
walsender isn’t blocked. I’m going to look more closely at this.

* prepared_xacts.sql — here select prepared_xacts() sees our prepared
tx. It is possible to filter them out, but obviously it works as expected.

* guc.sql — here pendingActions arrives on 'DISCARD ALL’ preventing tx
from being prepared. I didn’t found the way to check presence of
pendingActions outside of async.c so decided to leave it as is.

It seems that at least in regression tests nothing can block twophase
logical decoding. Is that strong enough argument to hypothesis that current
approach doesn’t creates deadlock except locks on catalog which should be
disallowed anyway?

Patches attached. logical_twophase_v5 is slightly modified version of previous
patch merged with Craig’s changes. Second file is set of patches over previous
one, that implements logic i’ve just described. There is runtest.sh script that
setups postgres, runs python logical consumer in background and starts
regression test.


Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



logical_twophase_v5.diff
Description: Binary data


logical_twophase_regresstest.diff
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-03-28 Thread Craig Ringer
On 28 Mar. 2017 23:08, "Andres Freund"  wrote:

>
> > >> I don't think its for us to say what the plugin is allowed to do. We
> > >> decided on a plugin architecture, so we have to trust that the plugin
> > >> author resolves the issues. We can document them so those choices are
> > >> clear.
> > >
> > > I don't think this is "plugin architecture" related. The output pluging
> > > can't do right here, this has to be solved at a higher level.
> >
> > That assertion is obviously false... the plugin can resolve this in
> > various ways, if we allow it.
>
> Handling it by breaking replication isn't handling it (e.g. timeouts in
> decoding etc).


IMO, if it's a rare condition and we can abort decoding then recover
cleanly and succeed on retry, that's OK. Not dissimilar to the deadlock
detector. But right now that's not the case, it's possible (however
artificially) to create prepared xacts for which decoding will stall and
not succeed.


> Handling it by rolling back *prepared* transactions
> (which are supposed to be guaranteed to succeed!), isn't either.
>

I agree, we can't rely on anything for which the only way to continue is to
rollback a prepared xact.


> > You can say that in your opinion you prefer to see this handled in
> > some higher level way, though it would be good to hear why and how.
>
> It's pretty obvious why: A bit of DDL by the user shouldn't lead to the
> issues mentioned above.
>

I agree that it shouldn't, and in fact DDL is the main part of why I want
2PC decoding.

What's surprised me is that I haven't actually been able to create any
situations where, with test_decoding, we have such a failure. Not unless I
manually LOCK TABLE pg_attribute, anyway.

Notably, we already disallow prepared xacts that make changes to the
relfilenodemap, which covers a lot of the problem cases like CLUSTERing
system tables.


> > Bottom line here is we shouldn't reject this patch on this point,
>
> I think it definitely has to be rejected because of that.  And I didn't
> bring this up at the last minute, I repeatedly brought it up before.
> Both to Craig and Stas.
>

Yes, and I lost track of it while focusing on the catalog tuple visibility
issues. I warned Stas of this issue when he first mentioned an interest in
decoding of 2PC actually, but haven't kept a proper eye on it since.

Andres and I even discussed this back in the early BDR days, it's not new
and is part of why I poked Stas to try some DDL tests etc. The tests in the
initial patch didn't have enough coverage to trigger any issues - they
didn't actually test decoding of a 2pc xact while it was still in-progress
at all. But even once I added more tests I've actually been unable to
reproduce this in a realistic real world example.

Frankly I'm confused by that, since I would expect an AEL on some_table to
cause decoding of some_table to get stuck. It does not.

That doesn't mean we should accept failure cases and commit something with
holes in it. But it might inform our choices about how we solve those
issues.


> One way to fix this would be to allow decoding to acquire such locks
> (i.e. locks held by the prepared xact we're decoding) - there
> unfortunately are some practical issues with that (e.g. the locking code
> doesnt' necessarily expect a second non-exclusive locker, when there's
> an exclusive one),  or we could add an exception to the locking code to
> simply not acquire such locks.
>

I've been meaning to see if we can use the parallel infrastructure's
session leader infrastructure for this, by making the 2pc fake-proc a
leader and making our decoding session inherit its locks. I haven't dug
into it to see if it's even remotely practical yet, and won't be able to
until early pg11.

We could proceed with the caveat that decoding plugins that use 2pc support
must defer decoding of 2pc xacts containing ddl until commit prepared, or
must take responsibility for ensuring (via a command filter, etc) that
xacts are safe to decode and 2pc lock xacts during decoding. But we're
likely to change the interface for all that when we iterate for pg11 and
I'd rather not carry more BC than we have to. Also, the patch has unsolved
issues with how it keeps track of whether an xact was output at prepare
time or not and suppresses output at commit time.

I'm inclined to shelve the patch for Pg 10. We've only got a couple of days
left, the tests are still pretty minimal. We have open issues around
locking, less than totally satisfactory abort handling, and potential to
skip replay of transactions for both prepare and commit prepared. It's not
ready to go. However, it's definitely to the point where with a little more
work it'll be practical to patch into variants of Pg until we can
mainstream it in Pg 11, which is nice.

--
Craig Ringer


Re: [HACKERS] logical decoding of two-phase transactions

2017-03-28 Thread Andres Freund
On 2017-03-28 15:55:15 +0100, Simon Riggs wrote:
> On 28 March 2017 at 15:38, Andres Freund  wrote:
> > On 2017-03-28 15:32:49 +0100, Simon Riggs wrote:
> >> On 28 March 2017 at 03:53, Craig Ringer  wrote:
> >> > On 28 March 2017 at 09:25, Andres Freund  wrote:
> >> >
> >> >> If you actually need separate decoding of 2PC, then you want to wait for
> >> >> the PREPARE to be replicated.  If that replication has to wait for the
> >> >> to-be-replicated prepared transaction to commit prepared, and commit
> >> >> prepare will only happen once replication happened...
> >> >
> >> > In other words, the output plugin cannot decode a transaction at
> >> > PREPARE TRANSACTION time if that xact holds an AccessExclusiveLock on
> >> > a catalog relation we must be able to read in order to decode the
> >> > xact.
> >>
> >> Yes, I understand.
> >>
> >> The decoding plugin can choose to enable lock_timeout, or it can
> >> choose to wait for manual resolution, or it could automatically abort
> >> such a transaction to avoid needing to decode it.
> >
> > That doesn't solve the problem. You still left with replication that
> > can't progress.  I think that's completely unacceptable.  We need a
> > proper solution to this, not throw our hands up in the air and hope that
> > it's not going to hurt a whole lot of peopel.
>
> Nobody is throwing their hands in the air, nobody is just hoping. The
> concern raised is real and needs to be handled somewhere; the only
> point of discussion is where and how.


> >> I don't think its for us to say what the plugin is allowed to do. We
> >> decided on a plugin architecture, so we have to trust that the plugin
> >> author resolves the issues. We can document them so those choices are
> >> clear.
> >
> > I don't think this is "plugin architecture" related. The output pluging
> > can't do right here, this has to be solved at a higher level.
>
> That assertion is obviously false... the plugin can resolve this in
> various ways, if we allow it.

Handling it by breaking replication isn't handling it (e.g. timeouts in
decoding etc).  Handling it by rolling back *prepared* transactions
(which are supposed to be guaranteed to succeed!), isn't either.


> You can say that in your opinion you prefer to see this handled in
> some higher level way, though it would be good to hear why and how.

It's pretty obvious why: A bit of DDL by the user shouldn't lead to the
issues mentioned above.


> Bottom line here is we shouldn't reject this patch on this point,

I think it definitely has to be rejected because of that.  And I didn't
bring this up at the last minute, I repeatedly brought it up before.
Both to Craig and Stas.

One way to fix this would be to allow decoding to acquire such locks
(i.e. locks held by the prepared xact we're decoding) - there
unfortunately are some practical issues with that (e.g. the locking code
doesnt' necessarily expect a second non-exclusive locker, when there's
an exclusive one),  or we could add an exception to the locking code to
simply not acquire such locks.


> especially since any resource issue found during decoding could
> similarly prevent progress with decoding.

For example?

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-03-28 Thread Simon Riggs
On 28 March 2017 at 15:38, Andres Freund  wrote:
> On 2017-03-28 15:32:49 +0100, Simon Riggs wrote:
>> On 28 March 2017 at 03:53, Craig Ringer  wrote:
>> > On 28 March 2017 at 09:25, Andres Freund  wrote:
>> >
>> >> If you actually need separate decoding of 2PC, then you want to wait for
>> >> the PREPARE to be replicated.  If that replication has to wait for the
>> >> to-be-replicated prepared transaction to commit prepared, and commit
>> >> prepare will only happen once replication happened...
>> >
>> > In other words, the output plugin cannot decode a transaction at
>> > PREPARE TRANSACTION time if that xact holds an AccessExclusiveLock on
>> > a catalog relation we must be able to read in order to decode the
>> > xact.
>>
>> Yes, I understand.
>>
>> The decoding plugin can choose to enable lock_timeout, or it can
>> choose to wait for manual resolution, or it could automatically abort
>> such a transaction to avoid needing to decode it.
>
> That doesn't solve the problem. You still left with replication that
> can't progress.  I think that's completely unacceptable.  We need a
> proper solution to this, not throw our hands up in the air and hope that
> it's not going to hurt a whole lot of peopel.

Nobody is throwing their hands in the air, nobody is just hoping. The
concern raised is real and needs to be handled somewhere; the only
point of discussion is where and how.

>> I don't think its for us to say what the plugin is allowed to do. We
>> decided on a plugin architecture, so we have to trust that the plugin
>> author resolves the issues. We can document them so those choices are
>> clear.
>
> I don't think this is "plugin architecture" related. The output pluging
> can't do right here, this has to be solved at a higher level.

That assertion is obviously false... the plugin can resolve this in
various ways, if we allow it.

You can say that in your opinion you prefer to see this handled in
some higher level way, though it would be good to hear why and how.

Bottom line here is we shouldn't reject this patch on this point,
especially since any resource issue found during decoding could
similarly prevent progress with decoding.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-03-28 Thread Andres Freund
On 2017-03-28 15:32:49 +0100, Simon Riggs wrote:
> On 28 March 2017 at 03:53, Craig Ringer  wrote:
> > On 28 March 2017 at 09:25, Andres Freund  wrote:
> >
> >> If you actually need separate decoding of 2PC, then you want to wait for
> >> the PREPARE to be replicated.  If that replication has to wait for the
> >> to-be-replicated prepared transaction to commit prepared, and commit
> >> prepare will only happen once replication happened...
> >
> > In other words, the output plugin cannot decode a transaction at
> > PREPARE TRANSACTION time if that xact holds an AccessExclusiveLock on
> > a catalog relation we must be able to read in order to decode the
> > xact.
> 
> Yes, I understand.
> 
> The decoding plugin can choose to enable lock_timeout, or it can
> choose to wait for manual resolution, or it could automatically abort
> such a transaction to avoid needing to decode it.

That doesn't solve the problem. You still left with replication that
can't progress.  I think that's completely unacceptable.  We need a
proper solution to this, not throw our hands up in the air and hope that
it's not going to hurt a whole lot of peopel.


> I don't think its for us to say what the plugin is allowed to do. We
> decided on a plugin architecture, so we have to trust that the plugin
> author resolves the issues. We can document them so those choices are
> clear.

I don't think this is "plugin architecture" related. The output pluging
can't do right here, this has to be solved at a higher level.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-03-28 Thread Simon Riggs
On 28 March 2017 at 03:53, Craig Ringer  wrote:
> On 28 March 2017 at 09:25, Andres Freund  wrote:
>
>> If you actually need separate decoding of 2PC, then you want to wait for
>> the PREPARE to be replicated.  If that replication has to wait for the
>> to-be-replicated prepared transaction to commit prepared, and commit
>> prepare will only happen once replication happened...
>
> In other words, the output plugin cannot decode a transaction at
> PREPARE TRANSACTION time if that xact holds an AccessExclusiveLock on
> a catalog relation we must be able to read in order to decode the
> xact.

Yes, I understand.

The decoding plugin can choose to enable lock_timeout, or it can
choose to wait for manual resolution, or it could automatically abort
such a transaction to avoid needing to decode it.

I don't think its for us to say what the plugin is allowed to do. We
decided on a plugin architecture, so we have to trust that the plugin
author resolves the issues. We can document them so those choices are
clear.

This doesn't differ in any respect from any other resource it might
need yet cannot obtain.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-03-27 Thread Craig Ringer
On 28 March 2017 at 10:53, Craig Ringer  wrote:

> However, even once I add an option to force decoding of 2pc xacts with
> catalog changes to test_decoding, I cannot reproduce the expected
> locking issues so far. See tests in attached updated version, in
> contrib/test_decoding/sql/prepare.sql .

I haven't been able to create issues with CLUSTER, any ALTER TABLEs
I've tried, or anything similar.

An explicit AEL on pg_attribute causes the decoding stall, but you
can't do anything much else either, and I don't see how that'd arise
under normal circumstances.

If it's a sufficiently obscure issue I'm willing to document it as
"don't do that" or "use a command filter to prohibit that". But it's
more likely that I'm just not spotting the cases where the issue
arises.

Attempting to CLUSTER a system catalog like pg_class or pg_attribute
causes PREPARE TRANSACTION to fail with

ERROR:  cannot PREPARE a transaction that modified relation mapping

and I didn't find any catalogs I could CLUSTER that'd also block decoding.



-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-03-27 Thread Craig Ringer
On 28 March 2017 at 09:25, Andres Freund  wrote:

> If you actually need separate decoding of 2PC, then you want to wait for
> the PREPARE to be replicated.  If that replication has to wait for the
> to-be-replicated prepared transaction to commit prepared, and commit
> prepare will only happen once replication happened...

In other words, the output plugin cannot decode a transaction at
PREPARE TRANSACTION time if that xact holds an AccessExclusiveLock on
a catalog relation we must be able to read in order to decode the
xact.

>> Is there any other scenarios where catalog readers are blocked except 
>> explicit lock
>> on catalog table? Alters on catalogs seems to be prohibited.
>
> VACUUM FULL on catalog tables (but that can't happen in xact => 2pc)
> CLUSTER on catalog tables (can happen in xact)
> ALTER on tables modified in the same transaction (even of non catalog
> tables!), because a lot of routines will do a heap_open() to get the
> tupledesc etc.

Right, and the latter one is the main issue, since it's by far the
most likely and hard to just work around.

The tests Stas has in place aren't sufficient to cover this, as they
decode only after everything has committed. I'm expanding the
pg_regress coverage to do decoding between prepare and commit (when we
actually care) first, and will add some tests involving strong locks.
I've found one bug where it doesn't decode a 2pc xact at prepare or
commit time, even without restart or strong lock issues. Pretty sure
it's due to assumptions made about the filter callback.

The current code as used by test_decoding won't work correctly. If
txn->has_catalog_changes and if it's still in-progress, the filter
skips decoding at PREPARE time. But it isn't then decoded at COMMIT
PREPARED time either, if we processed past the PREPARE TRANSACTION.
Bug.

Also, by skipping decoding of 2pc xacts with catalog changes in this
test we also hide the locking issues.

However, even once I add an option to force decoding of 2pc xacts with
catalog changes to test_decoding, I cannot reproduce the expected
locking issues so far. See tests in attached updated version, in
contrib/test_decoding/sql/prepare.sql .

Haven't done any TAP tests yet, since the pg_regress tests are so far
sufficient to turn up issues.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
From e23909a9929b561e011f41891825bfb5b1ecb1b3 Mon Sep 17 00:00:00 2001
From: Craig Ringer 
Date: Tue, 28 Mar 2017 10:51:48 +0800
Subject: [PATCH] Logical decoding of two-phase commit transactions at PREPARE
 time

---
 contrib/test_decoding/expected/prepared.out | 253 
 contrib/test_decoding/sql/prepared.sql  |  82 +++-
 contrib/test_decoding/test_decoding.c   | 205 ++-
 src/backend/access/rmgrdesc/xactdesc.c  |  37 +++-
 src/backend/access/transam/twophase.c   |  89 -
 src/backend/access/transam/xact.c   |  72 ++-
 src/backend/replication/logical/decode.c| 148 --
 src/backend/replication/logical/logical.c   | 141 +
 src/backend/replication/logical/reorderbuffer.c | 129 +++-
 src/backend/replication/logical/snapbuild.c |  48 -
 src/include/access/twophase.h   |   3 +
 src/include/access/xact.h   |  44 -
 src/include/replication/logical.h   |   6 +-
 src/include/replication/output_plugin.h |  36 
 src/include/replication/reorderbuffer.h |  50 +
 src/include/replication/snapbuild.h |   4 +
 16 files changed, 1254 insertions(+), 93 deletions(-)

diff --git a/contrib/test_decoding/expected/prepared.out b/contrib/test_decoding/expected/prepared.out
index 46e915d..56c6e72 100644
--- a/contrib/test_decoding/expected/prepared.out
+++ b/contrib/test_decoding/expected/prepared.out
@@ -6,19 +6,84 @@ SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_d
  init
 (1 row)
 
-CREATE TABLE test_prepared1(id int);
-CREATE TABLE test_prepared2(id int);
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_2pc', 'test_decoding');
+ ?column? 
+--
+ init
+(1 row)
+
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
+-- Reused queries
+\set get_no2pc 'SELECT data FROM pg_logical_slot_get_changes(''regression_slot_2pc'', NULL, NULL, ''include-xids'', ''0'', ''skip-empty-xacts'', ''1'');'
+\set get_with2pc 'SELECT data FROM pg_logical_slot_get_changes(''regression_slot'', NULL, NULL, ''include-xids'', ''0'', ''skip-empty-xacts'', ''1'', ''twophase-decoding'', ''1'');'
+\set get_with2pc_nofilter 'SELECT data FROM pg_logical_slot_get_changes(''regression_slot'', NULL, NULL, ''include-xids'', ''0'', ''skip-empty-xacts'', ''1'', ''twophase-decoding'', ''1'', 

Re: [HACKERS] logical decoding of two-phase transactions

2017-03-27 Thread Andres Freund
On 2017-03-28 03:30:28 +0100, Simon Riggs wrote:
> On 28 March 2017 at 02:25, Andres Freund  wrote:
> > On 2017-03-28 04:12:41 +0300, Stas Kelvich wrote:
> >>
> >> > On 28 Mar 2017, at 00:25, Andres Freund  wrote:
> >> >
> >> > Hi,
> >> >
> >> > On 2017-03-28 00:19:29 +0300, Stas Kelvich wrote:
> >> >> Ok, here it is.
> >> >
> >> > On a very quick skim, this doesn't seem to solve the issues around
> >> > deadlocks of prepared transactions vs. catalog tables.  What if the
> >> > prepared transaction contains something like LOCK pg_class; (there's a
> >> > lot more realistic examples)? Then decoding won't be able to continue,
> >> > until that transaction is committed / aborted?
> >>
> >> But why is that deadlock? Seems as just lock.
> >
> > If you actually need separate decoding of 2PC, then you want to wait for
> > the PREPARE to be replicated.  If that replication has to wait for the
> > to-be-replicated prepared transaction to commit prepared, and commit
> > prepare will only happen once replication happened...
> 
> Surely that's up to the decoding plugin?

It can't do much about it, so not really.  A lot of the functions
dealing with datatypes (temporarily) lock relations.  Both the actual
user tables, and system catalog tables (cache lookups...).


> If the plugin takes locks it had better make sure it can get the locks
> or timeout. But that's true of any resource the plugin needs access to
> and can't obtain when needed.


> This issue could occur now if the transaction tool a session lock on a
> catalog table.

That's not a self deadlock, and we don't don't do session locks outside
of operations like CIC?

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-03-27 Thread Simon Riggs
On 28 March 2017 at 02:25, Andres Freund  wrote:
> On 2017-03-28 04:12:41 +0300, Stas Kelvich wrote:
>>
>> > On 28 Mar 2017, at 00:25, Andres Freund  wrote:
>> >
>> > Hi,
>> >
>> > On 2017-03-28 00:19:29 +0300, Stas Kelvich wrote:
>> >> Ok, here it is.
>> >
>> > On a very quick skim, this doesn't seem to solve the issues around
>> > deadlocks of prepared transactions vs. catalog tables.  What if the
>> > prepared transaction contains something like LOCK pg_class; (there's a
>> > lot more realistic examples)? Then decoding won't be able to continue,
>> > until that transaction is committed / aborted?
>>
>> But why is that deadlock? Seems as just lock.
>
> If you actually need separate decoding of 2PC, then you want to wait for
> the PREPARE to be replicated.  If that replication has to wait for the
> to-be-replicated prepared transaction to commit prepared, and commit
> prepare will only happen once replication happened...

Surely that's up to the decoding plugin?

If the plugin takes locks it had better make sure it can get the locks
or timeout. But that's true of any resource the plugin needs access to
and can't obtain when needed.

This issue could occur now if the transaction tool a session lock on a
catalog table.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-03-27 Thread Andres Freund
On 2017-03-28 04:12:41 +0300, Stas Kelvich wrote:
> 
> > On 28 Mar 2017, at 00:25, Andres Freund  wrote:
> > 
> > Hi,
> > 
> > On 2017-03-28 00:19:29 +0300, Stas Kelvich wrote:
> >> Ok, here it is.
> > 
> > On a very quick skim, this doesn't seem to solve the issues around
> > deadlocks of prepared transactions vs. catalog tables.  What if the
> > prepared transaction contains something like LOCK pg_class; (there's a
> > lot more realistic examples)? Then decoding won't be able to continue,
> > until that transaction is committed / aborted?
> 
> But why is that deadlock? Seems as just lock.

If you actually need separate decoding of 2PC, then you want to wait for
the PREPARE to be replicated.  If that replication has to wait for the
to-be-replicated prepared transaction to commit prepared, and commit
prepare will only happen once replication happened...


> Is there any other scenarios where catalog readers are blocked except 
> explicit lock
> on catalog table? Alters on catalogs seems to be prohibited.

VACUUM FULL on catalog tables (but that can't happen in xact => 2pc)
CLUSTER on catalog tables (can happen in xact)
ALTER on tables modified in the same transaction (even of non catalog
tables!), because a lot of routines will do a heap_open() to get the
tupledesc etc.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-03-27 Thread Stas Kelvich

> On 28 Mar 2017, at 00:25, Andres Freund  wrote:
> 
> Hi,
> 
> On 2017-03-28 00:19:29 +0300, Stas Kelvich wrote:
>> Ok, here it is.
> 
> On a very quick skim, this doesn't seem to solve the issues around
> deadlocks of prepared transactions vs. catalog tables.  What if the
> prepared transaction contains something like LOCK pg_class; (there's a
> lot more realistic examples)? Then decoding won't be able to continue,
> until that transaction is committed / aborted?

But why is that deadlock? Seems as just lock.

In case of prepared lock of pg_class decoding will wait until it committed and
then continue to decode. As well as anything in postgres that accesses pg_class,
including inability to connect to database and bricking database if you 
accidentally
disconnected before committing that tx (as you showed me some while ago :-).

IMO it is issue of being able to prepare such lock, than of decoding.

Is there any other scenarios where catalog readers are blocked except explicit 
lock
on catalog table? Alters on catalogs seems to be prohibited.

Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-03-27 Thread Craig Ringer
On 28 March 2017 at 08:50, Stas Kelvich  wrote:
>
>> On 28 Mar 2017, at 00:19, Stas Kelvich  wrote:
>>
>> * It is actually doesn’t pass one of mine regression tests. I’ve added 
>> expected output
>> as it should be. I’ll try to send follow up message with fix, but right now 
>> sending it
>> as is, as you asked.
>>
>>
>
> Fixed. I forgot to postpone ReorderBufferTxn cleanup in case of prepare.
>
> So it pass provided regression tests right now.
>
> I’ll give it more testing tomorrow and going to write TAP test to check 
> behaviour
> when we loose info whether prepare was sent to subscriber or not.

Great, thanks. I'll try to have some TAP tests ready.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-03-27 Thread Craig Ringer
On 28 March 2017 at 05:25, Andres Freund  wrote:

> On a very quick skim, this doesn't seem to solve the issues around
> deadlocks of prepared transactions vs. catalog tables.  What if the
> prepared transaction contains something like LOCK pg_class; (there's a
> lot more realistic examples)? Then decoding won't be able to continue,
> until that transaction is committed / aborted?

Yeah, that's a problem and one we discussed in the past, though I lost
track of it in amongst the recent work.

I'm currently writing a few TAP tests intended to check this sort of
thing, mixed DDL/DML, overlapping xacts, interleaved prepared xacts,
etc. If they highlight problems they'll be useful for the next
iteration of this patch anyway.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-03-27 Thread Stas Kelvich

> On 28 Mar 2017, at 00:19, Stas Kelvich  wrote:
> 
> * It is actually doesn’t pass one of mine regression tests. I’ve added 
> expected output
> as it should be. I’ll try to send follow up message with fix, but right now 
> sending it
> as is, as you asked.
> 
> 

Fixed. I forgot to postpone ReorderBufferTxn cleanup in case of prepare.

So it pass provided regression tests right now.

I’ll give it more testing tomorrow and going to write TAP test to check 
behaviour
when we loose info whether prepare was sent to subscriber or not.




logical_twophase.diff
Description: Binary data



Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-03-27 Thread Andres Freund
Hi,

On 2017-03-28 00:19:29 +0300, Stas Kelvich wrote:
> Ok, here it is.

On a very quick skim, this doesn't seem to solve the issues around
deadlocks of prepared transactions vs. catalog tables.  What if the
prepared transaction contains something like LOCK pg_class; (there's a
lot more realistic examples)? Then decoding won't be able to continue,
until that transaction is committed / aborted?

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-03-27 Thread Stas Kelvich

> On 27 Mar 2017, at 16:29, Craig Ringer  wrote:
> 
> On 27 March 2017 at 17:53, Stas Kelvich  wrote:
> 
>> I’m heavily underestimated amount of changes there, but almost finished
>> and will send updated patch in several hours.
> 
> Oh, brilliant! Please post whatever you have before you knock off for
> the day anyway, even if it's just a WIP, so I can pick it up tomorrow
> my time and poke at its tests etc.
> 

Ok, here it is.

Major differences comparing to previous version:

* GID is stored to commit/abort records only when wal_level >= logical.

* More consistency about storing and parsing origin info. Now it
is stored in prepare and abort records when repsession is active.

* Some clenup, function renames to get rid of xact_even/gid fields
in ReorderBuffer which i used only to copy them ReorderBufferTXN.

* Changed output plugin interface to one that was suggested upthread.
Now prepare/CP/AP is separate callback, and if none of them is set
then 2pc tx will be decoded as 1pc to provide back-compatibility.

* New callback filter_prepare() that can be used to switch between
1pc/2pc style of decoding 2pc tx.

* test_decoding uses new API and filters out aborted and running prepared tx.
It is actually easy to move unlock of 2PCState there to prepare callback to 
allow
decode of running tx, but since that extension is example ISTM that is better 
not to
hold that lock there during whole prepare decoding. However I leaved
enough information there about this and about case when that locks are not need 
at all
(when we are coordinating this tx).
  Talking about locking of running prepared tx during decode, I think better 
solution
would be to use own custom lock here and register XACT_EVENT_PRE_ABORT
callback in extension to conflict with this lock. Decode should hold it in 
shared way,
while commit in excluseve. That will allow to lock stuff granularly ang block 
only
tx that is being decoded.
  However we don’t have XACT_EVENT_PRE_ABORT, but it is several LOCs to
add it. Should I?

* It is actually doesn’t pass one of mine regression tests. I’ve added expected 
output
as it should be. I’ll try to send follow up message with fix, but right now 
sending it
as is, as you asked.




logical_twophase.diff
Description: Binary data




Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-03-27 Thread Craig Ringer
On 27 March 2017 at 17:53, Stas Kelvich  wrote:

> I’m heavily underestimated amount of changes there, but almost finished
> and will send updated patch in several hours.

Oh, brilliant! Please post whatever you have before you knock off for
the day anyway, even if it's just a WIP, so I can pick it up tomorrow
my time and poke at its tests etc.

I'm in Western Australia +0800 time, significantly ahead of you.

> Done.
[snip]
> Also done.

Great, time is short so that's fantastic.

> I’ve extended test, but it is good to have some more.

I don't mind writing tests and I've done quite a bit with TAP now, so
happy to help there.

>> Some special care is needed for the callback that decides whether to
>> process a given xact as 2PC or not. It's called before PREPARE
>> TRANSACTION to decide whether to decode any given xact at prepare time
>> or wait until it commits. It's called again at COMMIT PREPARED time if
>> we crashed after we processed PREPARE TRANSACTION and advanced our
>> confirmed_flush_lsn such that we won't re-process the PREPARE
>> TRANSACTION again. Our restart_lsn might've advanced past it so we
>> never even decode it, so we can't rely on seeing it at all. It has
>> access to the xid, gid and invalidations, all of which we have at both
>> prepare and commit time, to make its decision from. It must have the
>> same result at prepare and commit time for any given xact. We can
>> probably use a cache in the reorder buffer to avoid the 2nd call on
>> commit prepared if we haven't crashed/reconnected between the two.
>
> Good point. Didn’t think about restart_lsn in case when we are skipping this
> particular prepare (filter_prepared() -> true, in my terms). I think that 
> should
> work properly as it use the same code path as it was before, but I’ll look at 
> it.

I suspect that's going to be fragile in the face of interleaving of
xacts if we crash between prepare and commit prepared. (Apologies if
the below is long or disjointed, it's been a long day but trying to
sort thoughts out).


Consider ("SSU" = "standby status update"):

0/050 xid 1 BEGIN
0/060 xid 1 INSERT ...

0/070 xid 2 BEGIN
0/080 xid 2 INSERT ...

0/090 xid 3 BEGIN
0/095 xid 3 INSERT ...
0/100 xid 3 PREPARE TRANSACTION 'x' => sent to client [y/n]?
SSU: confirmed_flush_lsn = 0/100, restart_lsn 0/050 (if we sent to client)

0/200 xid 2 COMMIT => sent to client
SSU: confirmed_flush_lsn = 0/200, restart_lsn 0/050

0/250 xl_running_xacts logged, xids = [1,3]

[CRASH or disconnect/reconnect]

Restart decoding at 0/050.

skip output of xid 3 PREPARE TRANSACTION @ 0/100: is <= confirmed_flush_lsn
skip output of xid 2 COMMIT @ 0/200: is <= confirmed_flush_lsn

0/300 xid 2 COMMIT PREPARED 'x' => sent to client, confirmed_flush_lsn
is > confirmed_flush_lsn



In the above, our problem is that restart_lsn is held down by some
other xact, so we can't rely on it to tell us if we replayed xid 3 to
the output plugin or not. We can't use confirmed_flush_lsn either,
since it'll advance at xid 2's commit whether or not we replayed xid
3's prepare to the client.

Since xid 3 will still be in xl_running_xacts when prepared, when we
recover SnapBuildProcessChange will return true for its changes and
we'll (re)buffer them, whether or not we landed up sending to the
client at prepare time. Nothing much to be done about that, we'll just
discard them when we process the prepare or the commit prepared,
depending on where we consult our filter callback again.

We MUST ask our filter callback again though, before we test
SnapBuildXactNeedsSkip when processing the PREPARE TRANSACTION again.
Otherwise we'll discard the buffered changes, and if we *didn't* send
them to the client already ... splat.

We can call the filter callback again on xid 3's prepare to find out
"would you have replayed it when we passed it last time". Or we can
call it when we get to the commit instead, to ask "when called last
time at prepare, did you replay or not?" But we have to consult the
callback. By default we'd just skip ReorderBufferCommit processing for
xid 3 entirely, which we'll do via the SnapBuildXactNeedsSkip call in
DecodeCommit when we process the COMMIT PREPARED.

If there was no other running xact when we decoded the PREPARE
TRANSACTION the first time around (i.e. xid 1 and 2 didn't exist in
the above), and if we do send it to the client at prepare time, I
think we can safely advance restart_lsn to the most recent
xl_running_xacts once we get replay confirmation. So we can pretend we
already committed at PREPARE TRANSACTION time for restart purposes if
we output at PREPARE TRANSACTION time, it just doesn't help us with
deciding whether to send the buffer contents at COMMIT PREPARED time
or not.

TL;DR: we can't rely on restart_lsn or confirmed_flush_lsn or
xl_running_xacts, we must ask the filter callback when we (re)decode
the PREPARE TRANSACTION record and/or at COMMIT PREPARED time.

This isn't a big deal. We just have to make sure we 

Re: [HACKERS] logical decoding of two-phase transactions

2017-03-27 Thread Stas Kelvich

> On 27 Mar 2017, at 12:26, Craig Ringer  wrote:
> 
> On 27 March 2017 at 09:31, Craig Ringer  wrote:
> 
>> We're in the last week of the CF. If you have a patch that's nearly
>> ready or getting there, now would be a good time to post it for help
>> and input from others.
>> 
>> I would really like to get this in, but we're running out of time.
>> 
>> Even if you just post your snapshot management work, with the cosmetic
>> changes discussed above, that would be a valuable start.
> 
> I'm going to pick up the last patch and:

I’m heavily underestimated amount of changes there, but almost finished
and will send updated patch in several hours.

> * Ensure we only add the GID to xact records for 2pc commits and aborts

And only during wal_level >= logical. Done.
Also patch adds origin info to prepares and aborts.

> * Add separate callbacks for prepare, abort prepared, and commit
> prepared (of xacts already processed during prepare), so we aren't
> overloading the "commit" callback and don't have to create fake empty
> transactions to pass to the commit callback;

Done.

> * Add another callback to determine whether an xact should be
> processed at PREPARE TRANSACTION or COMMIT PREPARED time.

Also done.

> * Rename the snapshot builder faux-commit stuff in the current patch
> so it's clearer what's going on.

Hm. Okay, i’ll leave that part to you.

> * Write tests covering DDL, abort-during-decode, etc

I’ve extended test, but it is good to have some more.

> Some special care is needed for the callback that decides whether to
> process a given xact as 2PC or not. It's called before PREPARE
> TRANSACTION to decide whether to decode any given xact at prepare time
> or wait until it commits. It's called again at COMMIT PREPARED time if
> we crashed after we processed PREPARE TRANSACTION and advanced our
> confirmed_flush_lsn such that we won't re-process the PREPARE
> TRANSACTION again. Our restart_lsn might've advanced past it so we
> never even decode it, so we can't rely on seeing it at all. It has
> access to the xid, gid and invalidations, all of which we have at both
> prepare and commit time, to make its decision from. It must have the
> same result at prepare and commit time for any given xact. We can
> probably use a cache in the reorder buffer to avoid the 2nd call on
> commit prepared if we haven't crashed/reconnected between the two.

Good point. Didn’t think about restart_lsn in case when we are skipping this
particular prepare (filter_prepared() -> true, in my terms). I think that should
work properly as it use the same code path as it was before, but I’ll look at 
it.

> This proposal does not provide a way to safely decode a 2pc xact that
> made catalog changes which may be aborted while being decoded. The
> plugin must lock such an xact so that it can't be aborted while being
> processed, or defer decoding until commit prepared. It can use the
> invalidations for the commit to decide.

I had played with that two-pass catalog scan and it seems to be
working but after some time I realised that it is not useful for the main
case when commit/abort is generated after receiver side will answer to
prepares. Also that two-pass scan is a massive change in relcache.c and
genam.c (FWIW there were no problems with cache, but some problems
with index scan and handling one-to-many queries to catalog, e.g. table
with it fields)

Finally i decided to throw it and switched to filter_prepare callback and
passed there txn structure to allow access to has_catalog_changes field.


Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-03-27 Thread Craig Ringer
On 27 March 2017 at 09:31, Craig Ringer  wrote:

> We're in the last week of the CF. If you have a patch that's nearly
> ready or getting there, now would be a good time to post it for help
> and input from others.
>
> I would really like to get this in, but we're running out of time.
>
> Even if you just post your snapshot management work, with the cosmetic
> changes discussed above, that would be a valuable start.

I'm going to pick up the last patch and:


* Ensure we only add the GID to xact records for 2pc commits and aborts

* Add separate callbacks for prepare, abort prepared, and commit
prepared (of xacts already processed during prepare), so we aren't
overloading the "commit" callback and don't have to create fake empty
transactions to pass to the commit callback;

* Add another callback to determine whether an xact should be
processed at PREPARE TRANSACTION or COMMIT PREPARED time.

* Rename the snapshot builder faux-commit stuff in the current patch
so it's clearer what's going on.

* Write tests covering DDL, abort-during-decode, etc


Some special care is needed for the callback that decides whether to
process a given xact as 2PC or not. It's called before PREPARE
TRANSACTION to decide whether to decode any given xact at prepare time
or wait until it commits. It's called again at COMMIT PREPARED time if
we crashed after we processed PREPARE TRANSACTION and advanced our
confirmed_flush_lsn such that we won't re-process the PREPARE
TRANSACTION again. Our restart_lsn might've advanced past it so we
never even decode it, so we can't rely on seeing it at all. It has
access to the xid, gid and invalidations, all of which we have at both
prepare and commit time, to make its decision from. It must have the
same result at prepare and commit time for any given xact. We can
probably use a cache in the reorder buffer to avoid the 2nd call on
commit prepared if we haven't crashed/reconnected between the two.

This proposal does not provide a way to safely decode a 2pc xact that
made catalog changes which may be aborted while being decoded. The
plugin must lock such an xact so that it can't be aborted while being
processed, or defer decoding until commit prepared. It can use the
invalidations for the commit to decide.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-03-26 Thread Craig Ringer
On 20 March 2017 at 21:47, Stas Kelvich  wrote:
>
>> On 20 Mar 2017, at 16:39, Craig Ringer  wrote:
>>
>> On 20 March 2017 at 20:57, Stas Kelvich  wrote:
>>>
 On 20 Mar 2017, at 15:17, Craig Ringer  wrote:

> I thought about having special field (or reusing one of the existing 
> fields)
> in snapshot struct to force filtering xmax > snap->xmax or xmin = 
> snap->xmin
> as Petr suggested. Then this logic can reside in ReorderBufferCommit().
> However this is not solving problem with catcache, so I'm looking into it 
> right now.

 OK, so this is only an issue if we have xacts that change the schema
 of tables and also insert/update/delete to their heaps. Right?

 So, given that this is CF3 for Pg10, should we take a step back and
 impose the limitation that we can decode 2PC with schema changes or
 data row changes, but not both?
>>>
>>> Yep, time is tight. I’ll try today/tomorrow to proceed with this two scan 
>>> approach.
>>> If I’ll fail to do that during this time then I’ll just update this patch 
>>> to decode
>>> only non-ddl 2pc transactions as you suggested.
>>
>> I wasn't suggesting not decoding them, but giving the plugin the
>> option of whether to proceed with decoding or not.
>>
>> As Simon said, have a pre-decode-prepared callback that lets the
>> plugin get a lock on the 2pc xact if it wants, or say it doesn't want
>> to decode it until it commits.
>>
>> That'd be useful anyway, so we can filter and only do decoding at
>> prepare transaction time of xacts the downstream wants to know about
>> before they commit.
>
> Ah, got that. Okay.

Any news here?

We're in the last week of the CF. If you have a patch that's nearly
ready or getting there, now would be a good time to post it for help
and input from others.

I would really like to get this in, but we're running out of time.

Even if you just post your snapshot management work, with the cosmetic
changes discussed above, that would be a valuable start.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-03-20 Thread Stas Kelvich

> On 20 Mar 2017, at 16:39, Craig Ringer  wrote:
> 
> On 20 March 2017 at 20:57, Stas Kelvich  wrote:
>> 
>>> On 20 Mar 2017, at 15:17, Craig Ringer  wrote:
>>> 
 I thought about having special field (or reusing one of the existing 
 fields)
 in snapshot struct to force filtering xmax > snap->xmax or xmin = 
 snap->xmin
 as Petr suggested. Then this logic can reside in ReorderBufferCommit().
 However this is not solving problem with catcache, so I'm looking into it 
 right now.
>>> 
>>> OK, so this is only an issue if we have xacts that change the schema
>>> of tables and also insert/update/delete to their heaps. Right?
>>> 
>>> So, given that this is CF3 for Pg10, should we take a step back and
>>> impose the limitation that we can decode 2PC with schema changes or
>>> data row changes, but not both?
>> 
>> Yep, time is tight. I’ll try today/tomorrow to proceed with this two scan 
>> approach.
>> If I’ll fail to do that during this time then I’ll just update this patch to 
>> decode
>> only non-ddl 2pc transactions as you suggested.
> 
> I wasn't suggesting not decoding them, but giving the plugin the
> option of whether to proceed with decoding or not.
> 
> As Simon said, have a pre-decode-prepared callback that lets the
> plugin get a lock on the 2pc xact if it wants, or say it doesn't want
> to decode it until it commits.
> 
> That'd be useful anyway, so we can filter and only do decoding at
> prepare transaction time of xacts the downstream wants to know about
> before they commit.

Ah, got that. Okay.

> -- 
> Craig Ringer   http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Training & Services



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-03-20 Thread Craig Ringer
On 20 March 2017 at 20:57, Stas Kelvich  wrote:
>
>> On 20 Mar 2017, at 15:17, Craig Ringer  wrote:
>>
>>> I thought about having special field (or reusing one of the existing fields)
>>> in snapshot struct to force filtering xmax > snap->xmax or xmin = snap->xmin
>>> as Petr suggested. Then this logic can reside in ReorderBufferCommit().
>>> However this is not solving problem with catcache, so I'm looking into it 
>>> right now.
>>
>> OK, so this is only an issue if we have xacts that change the schema
>> of tables and also insert/update/delete to their heaps. Right?
>>
>> So, given that this is CF3 for Pg10, should we take a step back and
>> impose the limitation that we can decode 2PC with schema changes or
>> data row changes, but not both?
>
> Yep, time is tight. I’ll try today/tomorrow to proceed with this two scan 
> approach.
> If I’ll fail to do that during this time then I’ll just update this patch to 
> decode
> only non-ddl 2pc transactions as you suggested.

I wasn't suggesting not decoding them, but giving the plugin the
option of whether to proceed with decoding or not.

As Simon said, have a pre-decode-prepared callback that lets the
plugin get a lock on the 2pc xact if it wants, or say it doesn't want
to decode it until it commits.

That'd be useful anyway, so we can filter and only do decoding at
prepare transaction time of xacts the downstream wants to know about
before they commit.

>>> Just as before I marking this transaction committed in snapbuilder, but 
>>> after
>>> decoding I delete this transaction from xip (which holds committed 
>>> transactions
>>> in case of historic snapshot).
>>
>> That seems kind of hacky TBH. I didn't much like marking it as
>> committed then un-committing it.
>>
>> I think it's mostly an interface issue though. I'd rather say
>> SnapBuildPushPrepareTransaction and SnapBuildPopPreparedTransaction or
>> something, to make it clear what we're doing.
>
> Yes, that will be less confusing. However there is no any kind of queue, so
> SnapBuildStartPrepare / SnapBuildFinishPrepare should work too.

Yeah, that's better.


-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-03-20 Thread Stas Kelvich

> On 20 Mar 2017, at 15:17, Craig Ringer  wrote:
> 
>> I thought about having special field (or reusing one of the existing fields)
>> in snapshot struct to force filtering xmax > snap->xmax or xmin = snap->xmin
>> as Petr suggested. Then this logic can reside in ReorderBufferCommit().
>> However this is not solving problem with catcache, so I'm looking into it 
>> right now.
> 
> OK, so this is only an issue if we have xacts that change the schema
> of tables and also insert/update/delete to their heaps. Right?
> 
> So, given that this is CF3 for Pg10, should we take a step back and
> impose the limitation that we can decode 2PC with schema changes or
> data row changes, but not both?

Yep, time is tight. I’ll try today/tomorrow to proceed with this two scan 
approach.
If I’ll fail to do that during this time then I’ll just update this patch to 
decode
only non-ddl 2pc transactions as you suggested.

>> Just as before I marking this transaction committed in snapbuilder, but after
>> decoding I delete this transaction from xip (which holds committed 
>> transactions
>> in case of historic snapshot).
> 
> That seems kind of hacky TBH. I didn't much like marking it as
> committed then un-committing it.
> 
> I think it's mostly an interface issue though. I'd rather say
> SnapBuildPushPrepareTransaction and SnapBuildPopPreparedTransaction or
> something, to make it clear what we're doing.

Yes, that will be less confusing. However there is no any kind of queue, so
SnapBuildStartPrepare / SnapBuildFinishPrepare should work too.

> -- 
> Craig Ringer   http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Training & Services


Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-03-20 Thread Simon Riggs
On 17 March 2017 at 23:59, Robert Haas  wrote:
> On Thu, Mar 16, 2017 at 10:34 PM, Craig Ringer  wrote:
>> On 17 March 2017 at 08:10, Stas Kelvich  wrote:
>>> While working on this i’ve spotted quite a nasty corner case with aborted 
>>> prepared
>>> transaction. I have some not that great ideas how to fix it, but maybe i 
>>> blurred my
>>> view and missed something. So want to ask here at first.
>>>
>>> Suppose we created a table, then in 2pc tx we are altering it and after 
>>> that aborting tx.
>>> So pg_class will have something like this:
>>>
>>> xmin | xmax | relname
>>> 100  | 200| mytable
>>> 200  | 0| mytable
>>>
>>> After previous abort, tuple (100,200,mytable) becomes visible and if we 
>>> will alter table
>>> again then xmax of first tuple will be set current xid, resulting in 
>>> following table:
>>>
>>> xmin | xmax | relname
>>> 100  | 300| mytable
>>> 200  | 0| mytable
>>> 300  | 0| mytable
>>>
>>> In that moment we’ve lost information that first tuple was deleted by our 
>>> prepared tx.
>>
>> Right. And while the prepared xact has aborted, we don't control when
>> it aborts and when those overwrites can start happening. We can and
>> should check if a 2pc xact is aborted before we start decoding it so
>> we can skip decoding it if it's already aborted, but it could be
>> aborted *while* we're decoding it, then have data needed for its
>> snapshot clobbered.
>>
>> This hasn't mattered in the past because prepared xacts (and
>> especially aborted 2pc xacts) have never needed snapshots, we've never
>> needed to do something from the perspective of a prepared xact.
>>
>> I think we'll probably need to lock the 2PC xact so it cannot be
>> aborted or committed while we're decoding it, until we finish decoding
>> it. So we lock it, then check if it's already aborted/already
>> committed/in progress. If it's aborted, treat it like any normal
>> aborted xact. If it's committed, treat it like any normal committed
>> xact. If it's in progress, keep the lock and decode it.
>
> But that lock could need to be held for an unbounded period of time -
> as long as decoding takes to complete - which seems pretty
> undesirable.

This didn't seem to be too much of a problem when I read it.

Sure, the issue noted by Stas exists, but it requires
Alter-Abort-Alter for it to be a problem. Meaning that normal non-DDL
transactions do not have problems. Neither would a real-time system
that uses the decoded data to decide whether to commit or abort the
transaction; in that case there would never be an abort until after
decoding.

So I suggest we have a pre-prepare callback to ensure that the plugin
can decide whether to decode or not. We can pass information to the
plugin such as whether we have issued DDL in that xact or not. The
plugin can then decide how it wishes to handle it, so if somebody
doesn't like the idea of a lock then don't use one. The plugin is
already responsible for many things, so this is nothing new.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-03-20 Thread Craig Ringer
> I thought about having special field (or reusing one of the existing fields)
> in snapshot struct to force filtering xmax > snap->xmax or xmin = snap->xmin
> as Petr suggested. Then this logic can reside in ReorderBufferCommit().
> However this is not solving problem with catcache, so I'm looking into it 
> right now.

OK, so this is only an issue if we have xacts that change the schema
of tables and also insert/update/delete to their heaps. Right?

So, given that this is CF3 for Pg10, should we take a step back and
impose the limitation that we can decode 2PC with schema changes or
data row changes, but not both?

Applications can record DDL in transactional logical WAL messages for
decoding during 2pc processing. Or apps can do 2pc for DML. They just
can't do both at the same time, in the same xact.

Imperfect, but a lot less invasive. And we can even permit apps to use
the locking-based approach I outlined earlier instead:

All we have to do IMO is add an output plugin callback to filter
whether we want to decode a given 2pc xact at PREPARE TRANSACTION time
or defer until COMMIT PREPARED. It could:

* mark the xact for deferred decoding at commit time (the default if
the callback doesn't exist); or

* Acquire a lock on the 2pc xact and request immediate decoding only
if it gets the lock so concurrent ROLLBACK PREPARED is blocked; or

* inspect the reorder buffer contents for row changes and decide
whether to decode now or later based on that.

It has a few downsides - for example, temp tables will be considered
"catalog changes" for now. But .. eh. We already accept a bunch of
practical limitations for catalog changes and DDL in logical decoding,
most notably regarding practical handling of full table rewrites.

> Just as before I marking this transaction committed in snapbuilder, but after
> decoding I delete this transaction from xip (which holds committed 
> transactions
> in case of historic snapshot).

That seems kind of hacky TBH. I didn't much like marking it as
committed then un-committing it.

I think it's mostly an interface issue though. I'd rather say
SnapBuildPushPrepareTransaction and SnapBuildPopPreparedTransaction or
something, to make it clear what we're doing.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-03-20 Thread Stas Kelvich

> On 20 Mar 2017, at 11:32, Craig Ringer  wrote:
> 
> On 19 March 2017 at 21:26, Petr Jelinek  wrote:
> 
>> I think only genam would need changes to do two-phase scan for this as
>> the catalog scans should ultimately go there. It's going to slow down
>> things but we could limit the impact by doing the two-phase scan only
>> when historical snapshot is in use and the tx being decoded changed
>> catalogs (we already have global knowledge of the first one, and it
>> would be trivial to add the second one as we have local knowledge of
>> that as well).
> 
> 
> TBH, I have no idea how to approach the genam changes for the proposed
> double-scan method. It sounds like Stas has some idea how to proceed
> though (right?)
> 

I thought about having special field (or reusing one of the existing fields)
in snapshot struct to force filtering xmax > snap->xmax or xmin = snap->xmin
as Petr suggested. Then this logic can reside in ReorderBufferCommit().
However this is not solving problem with catcache, so I'm looking into it right 
now.


> On 17 Mar 2017, at 05:38, Craig Ringer  wrote:
> 
> On 16 March 2017 at 19:52, Stas Kelvich  wrote:
> 
>> 
>> I’m working right now on issue with building snapshots for decoding prepared 
>> tx.
>> I hope I'll send updated patch later today.
> 
> 
> Great.
> 
> What approach are you taking?


Just as before I marking this transaction committed in snapbuilder, but after
decoding I delete this transaction from xip (which holds committed transactions
in case of historic snapshot).


> -- 
> Craig Ringer   http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Training & Services



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-03-20 Thread Petr Jelinek
On 20/03/17 09:32, Craig Ringer wrote:
> On 19 March 2017 at 21:26, Petr Jelinek  wrote:
> 
>> I think only genam would need changes to do two-phase scan for this as
>> the catalog scans should ultimately go there. It's going to slow down
>> things but we could limit the impact by doing the two-phase scan only
>> when historical snapshot is in use and the tx being decoded changed
>> catalogs (we already have global knowledge of the first one, and it
>> would be trivial to add the second one as we have local knowledge of
>> that as well).
> 
> We'll also have to clobber caches after we finish decoding a 2pc xact,
> since we don't know those changes are visible to other xacts and can't
> guarantee they'll ever be (if it aborts).
> 

AFAIK reorder buffer already does that.

> That's going to be "interesting" when trying to decode interleaved
> transaction streams since we can't afford to clobber caches whenever
> we see an xlog record from a different xact. We'll probably have to
> switch to linear decoding with reordering when someone makes catalog
> changes.

We may need something that allows for representing multiple parallel
transactions in single process and a cheap way of switching between them
(ie, similar things we need for autonomous transactions). But that's not
something current patch has to deal with.

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-03-20 Thread Craig Ringer
On 19 March 2017 at 21:26, Petr Jelinek  wrote:

> I think only genam would need changes to do two-phase scan for this as
> the catalog scans should ultimately go there. It's going to slow down
> things but we could limit the impact by doing the two-phase scan only
> when historical snapshot is in use and the tx being decoded changed
> catalogs (we already have global knowledge of the first one, and it
> would be trivial to add the second one as we have local knowledge of
> that as well).

We'll also have to clobber caches after we finish decoding a 2pc xact,
since we don't know those changes are visible to other xacts and can't
guarantee they'll ever be (if it aborts).

That's going to be "interesting" when trying to decode interleaved
transaction streams since we can't afford to clobber caches whenever
we see an xlog record from a different xact. We'll probably have to
switch to linear decoding with reordering when someone makes catalog
changes.

TBH, I have no idea how to approach the genam changes for the proposed
double-scan method. It sounds like Stas has some idea how to proceed
though (right?)



-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-03-20 Thread Craig Ringer
On 17 March 2017 at 23:59, Robert Haas  wrote:

> But that lock could need to be held for an unbounded period of time -
> as long as decoding takes to complete - which seems pretty
> undesirable.

Yeah. We could use a recovery-conflict like mechanism to signal the
decoding session that someone wants to abort the xact, but it gets
messy.

>  Worse still, the same problem will arise if you
> eventually want to start decoding ordinary, non-2PC transactions that
> haven't committed yet, which I think is something we definitely want
> to do eventually; the current handling of bulk loads or bulk updates
> leads to significant latency.

Yeah. If it weren't for that, I'd probably still just pursue locking.
But you're right that we'll have to solve this sooner or later. I'll
admit I hoped for later.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-03-19 Thread Petr Jelinek
On 17/03/17 03:34, Craig Ringer wrote:
> On 17 March 2017 at 08:10, Stas Kelvich  wrote:
> 
>> While working on this i’ve spotted quite a nasty corner case with aborted 
>> prepared
>> transaction. I have some not that great ideas how to fix it, but maybe i 
>> blurred my
>> view and missed something. So want to ask here at first.
>>
>> Suppose we created a table, then in 2pc tx we are altering it and after that 
>> aborting tx.
>> So pg_class will have something like this:
>>
>> xmin | xmax | relname
>> 100  | 200| mytable
>> 200  | 0| mytable
>>
>> After previous abort, tuple (100,200,mytable) becomes visible and if we will 
>> alter table
>> again then xmax of first tuple will be set current xid, resulting in 
>> following table:
>>
>> xmin | xmax | relname
>> 100  | 300| mytable
>> 200  | 0| mytable
>> 300  | 0| mytable
>>
>> In that moment we’ve lost information that first tuple was deleted by our 
>> prepared tx.
> 
> Right. And while the prepared xact has aborted, we don't control when
> it aborts and when those overwrites can start happening. We can and
> should check if a 2pc xact is aborted before we start decoding it so
> we can skip decoding it if it's already aborted, but it could be
> aborted *while* we're decoding it, then have data needed for its
> snapshot clobbered.
> 
> This hasn't mattered in the past because prepared xacts (and
> especially aborted 2pc xacts) have never needed snapshots, we've never
> needed to do something from the perspective of a prepared xact.
> 
> I think we'll probably need to lock the 2PC xact so it cannot be
> aborted or committed while we're decoding it, until we finish decoding
> it. So we lock it, then check if it's already aborted/already
> committed/in progress. If it's aborted, treat it like any normal
> aborted xact. If it's committed, treat it like any normal committed
> xact. If it's in progress, keep the lock and decode it.
> 
> People using logical decoding for 2PC will presumably want to control
> 2PC via logical decoding, so they're not so likely to mind such a
> lock.
> 
>> * Try at first to scan catalog filtering out tuples with xmax bigger than 
>> snapshot->xmax
>> as it was possibly deleted by our tx. Than if nothing found scan in a usual 
>> way.
> 
> I don't think that'll be at all viable with the syscache/relcache
> machinery. Way too intrusive.
> 

I think only genam would need changes to do two-phase scan for this as
the catalog scans should ultimately go there. It's going to slow down
things but we could limit the impact by doing the two-phase scan only
when historical snapshot is in use and the tx being decoded changed
catalogs (we already have global knowledge of the first one, and it
would be trivial to add the second one as we have local knowledge of
that as well).

What I think is better strategy than filtering out by xmax would be
filtering "in" by xmin though. Meaning that first scan would return only
tuples modified by current tx which are visible in snapshot and second
scan would return the other visible tuples. That way whatever the
decoded tx seen should always win.

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-03-17 Thread Robert Haas
On Thu, Mar 16, 2017 at 10:34 PM, Craig Ringer  wrote:
> On 17 March 2017 at 08:10, Stas Kelvich  wrote:
>> While working on this i’ve spotted quite a nasty corner case with aborted 
>> prepared
>> transaction. I have some not that great ideas how to fix it, but maybe i 
>> blurred my
>> view and missed something. So want to ask here at first.
>>
>> Suppose we created a table, then in 2pc tx we are altering it and after that 
>> aborting tx.
>> So pg_class will have something like this:
>>
>> xmin | xmax | relname
>> 100  | 200| mytable
>> 200  | 0| mytable
>>
>> After previous abort, tuple (100,200,mytable) becomes visible and if we will 
>> alter table
>> again then xmax of first tuple will be set current xid, resulting in 
>> following table:
>>
>> xmin | xmax | relname
>> 100  | 300| mytable
>> 200  | 0| mytable
>> 300  | 0| mytable
>>
>> In that moment we’ve lost information that first tuple was deleted by our 
>> prepared tx.
>
> Right. And while the prepared xact has aborted, we don't control when
> it aborts and when those overwrites can start happening. We can and
> should check if a 2pc xact is aborted before we start decoding it so
> we can skip decoding it if it's already aborted, but it could be
> aborted *while* we're decoding it, then have data needed for its
> snapshot clobbered.
>
> This hasn't mattered in the past because prepared xacts (and
> especially aborted 2pc xacts) have never needed snapshots, we've never
> needed to do something from the perspective of a prepared xact.
>
> I think we'll probably need to lock the 2PC xact so it cannot be
> aborted or committed while we're decoding it, until we finish decoding
> it. So we lock it, then check if it's already aborted/already
> committed/in progress. If it's aborted, treat it like any normal
> aborted xact. If it's committed, treat it like any normal committed
> xact. If it's in progress, keep the lock and decode it.

But that lock could need to be held for an unbounded period of time -
as long as decoding takes to complete - which seems pretty
undesirable.  Worse still, the same problem will arise if you
eventually want to start decoding ordinary, non-2PC transactions that
haven't committed yet, which I think is something we definitely want
to do eventually; the current handling of bulk loads or bulk updates
leads to significant latency.  You're not going to be able to tell an
active transaction that it isn't allowed to abort until you get done
with it, and I don't really think you should be allowed to lock out
2PC aborts for long periods of time either.  That's going to stink for
users.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-03-16 Thread Craig Ringer
On 16 March 2017 at 19:52, Stas Kelvich  wrote:

>
> I’m working right now on issue with building snapshots for decoding prepared 
> tx.
> I hope I'll send updated patch later today.


Great.

What approach are you taking?

It looks like the snapshot builder actually does most of the work we
need for this already, maintaining a stack of snapshots we can use. It
might be as simple as invalidating the relcache/syscache when we exit
(and enter?) decoding of a prepared 2pc xact, since it violates the
usual assumption of logical decoding that we decode things strictly in
commit-time order.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-03-16 Thread Craig Ringer
On 17 March 2017 at 08:10, Stas Kelvich  wrote:

> While working on this i’ve spotted quite a nasty corner case with aborted 
> prepared
> transaction. I have some not that great ideas how to fix it, but maybe i 
> blurred my
> view and missed something. So want to ask here at first.
>
> Suppose we created a table, then in 2pc tx we are altering it and after that 
> aborting tx.
> So pg_class will have something like this:
>
> xmin | xmax | relname
> 100  | 200| mytable
> 200  | 0| mytable
>
> After previous abort, tuple (100,200,mytable) becomes visible and if we will 
> alter table
> again then xmax of first tuple will be set current xid, resulting in 
> following table:
>
> xmin | xmax | relname
> 100  | 300| mytable
> 200  | 0| mytable
> 300  | 0| mytable
>
> In that moment we’ve lost information that first tuple was deleted by our 
> prepared tx.

Right. And while the prepared xact has aborted, we don't control when
it aborts and when those overwrites can start happening. We can and
should check if a 2pc xact is aborted before we start decoding it so
we can skip decoding it if it's already aborted, but it could be
aborted *while* we're decoding it, then have data needed for its
snapshot clobbered.

This hasn't mattered in the past because prepared xacts (and
especially aborted 2pc xacts) have never needed snapshots, we've never
needed to do something from the perspective of a prepared xact.

I think we'll probably need to lock the 2PC xact so it cannot be
aborted or committed while we're decoding it, until we finish decoding
it. So we lock it, then check if it's already aborted/already
committed/in progress. If it's aborted, treat it like any normal
aborted xact. If it's committed, treat it like any normal committed
xact. If it's in progress, keep the lock and decode it.

People using logical decoding for 2PC will presumably want to control
2PC via logical decoding, so they're not so likely to mind such a
lock.

> * Try at first to scan catalog filtering out tuples with xmax bigger than 
> snapshot->xmax
> as it was possibly deleted by our tx. Than if nothing found scan in a usual 
> way.

I don't think that'll be at all viable with the syscache/relcache
machinery. Way too intrusive.

> * Do not decode such transaction at all.

Yes, that's what I'd like to do, per above.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-03-16 Thread Stas Kelvich

>> On 2 Mar 2017, at 11:00, Craig Ringer  wrote:
>> 
>> BTW, I've been reviewing the patch in more detail. Other than a bunch
>> of copy-and-paste that I'm cleaning up, the main issue I've found is
>> that in DecodePrepare, you call:
>> 
>>   SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
>>  parsed->nsubxacts, parsed->subxacts);
>> 
>> but I am not convinced it is correct to call it at PREPARE TRANSACTION
>> time, only at COMMIT PREPARED time. We want to see the 2pc prepared
>> xact's state when decoding it, but there might be later commits that
>> cannot yet see that state and shouldn't have it visible in their
>> snapshots. 
> 
> Agree, that is problem. That allows to decode this PREPARE, but after that
> it is better to mark this transaction as running in snapshot or perform 
> prepare
> decoding with some kind of copied-end-edited snapshot. I’ll have a look at 
> this.
> 

While working on this i’ve spotted quite a nasty corner case with aborted 
prepared
transaction. I have some not that great ideas how to fix it, but maybe i 
blurred my
view and missed something. So want to ask here at first.

Suppose we created a table, then in 2pc tx we are altering it and after that 
aborting tx.
So pg_class will have something like this:

xmin | xmax | relname
100  | 200| mytable
200  | 0| mytable

After previous abort, tuple (100,200,mytable) becomes visible and if we will 
alter table
again then xmax of first tuple will be set current xid, resulting in following 
table:

xmin | xmax | relname
100  | 300| mytable
200  | 0| mytable
300  | 0| mytable

In that moment we’ve lost information that first tuple was deleted by our 
prepared tx.
And from POV of historic snapshot that will be constructed to decode prepare 
first
tuple is visible, but actually send tuple should be used. Moreover such 
snapshot could
see both tuples violating oid uniqueness, but heapscan stops after finding 
first one.

I see here two possible workarounds:

* Try at first to scan catalog filtering out tuples with xmax bigger than 
snapshot->xmax
as it was possibly deleted by our tx. Than if nothing found scan in a usual way.

* Do not decode such transaction at all. If by the time of decoding prepare 
record we
already know that it is aborted than such decoding doesn’t have a lot of sense.
IMO intended usage of logical 2pc decoding is to decide about commit/abort based
on answers from logical subscribers/replicas. So there will be barrier between
prepare and commit/abort and such situations shouldn’t happen.

--
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-03-16 Thread Stas Kelvich

> On 16 Mar 2017, at 14:44, Craig Ringer  wrote:
> 
> I'm going to try to pick this patch up and amend its interface per our
> discussion earlier, see if I can get it committable.

I’m working right now on issue with building snapshots for decoding prepared tx.
I hope I'll send updated patch later today.

> -- 
> Craig Ringer   http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Training & Services



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-03-16 Thread Craig Ringer
On 15 March 2017 at 15:42, Petr Jelinek  wrote:

> Thinking about this some more. Why can't we use the same mechanism
> standby uses, ie, use xid to identify the 2PC?

It pushes work onto the downstream, which has to keep an 
mapping in a crash-safe, persistent form.  We'll be doing a flush of
some kind anyway so we can report successful prepare to the upstream
so an additional flush of a SLRU might not be so bad for a postgres
downstream. And I guess any other clients will have some kind of
downstream persistent mapping to use.

So I think I have a mild preference for recording the gid on 2pc
commit and abort records in the master's WAL, where it's very cheap
and simple.

But I agree that just sending the xid is a viable option if that falls through.

I'm going to try to pick this patch up and amend its interface per our
discussion earlier, see if I can get it committable.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-03-15 Thread Petr Jelinek


On 02/03/17 17:34, Petr Jelinek wrote:
> On 02/03/17 13:23, Craig Ringer wrote:
>> On 2 March 2017 at 16:20, Stas Kelvich  wrote:
>>>
 On 2 Mar 2017, at 11:00, Craig Ringer  wrote:

 We already have it, because we just decoded the PREPARE TRANSACTION.
 I'm preparing a patch revision to demonstrate this.
>>>
>>> Yes, we already have it, but if server reboots between commit prepared (all
>>> prepared state is gone) and decoding of this commit prepared then we loose
>>> that mapping, isn’t it?
>>
>> I was about to explain how restart_lsn works again, and how that would
>> mean we'd always re-decode the PREPARE TRANSACTION before any COMMIT
>> PREPARED or ROLLBACK PREPARED on crash. But...
>>
>> Actually, the way you've implemented it, that won't be the case. You
>> treat PREPARE TRANSACTION as a special-case of COMMIT, and the client
>> will presumably send replay confirmation after it has applied the
>> PREPARE TRANSACTION. In fact, it has to if we want 2PC to work with
>> synchronous replication. This will allow restart_lsn to advance to
>> after the PREPARE TRANSACTION record if there's no other older xact
>> and we see a suitable xl_running_xacts record. So we wouldn't decode
>> the PREPARE TRANSACTION again after restart.
>>

Thinking about this some more. Why can't we use the same mechanism
standby uses, ie, use xid to identify the 2PC? If output plugin cares
about doing 2PC in two phases, it can send xid as part of its protocol
(like the PG10 logical replication and pglogical do already) and simply
remember on downstream the remote node + remote xid of the 2PC in
progress. That way there is no need for gids in COMMIT PREPARED and this
patch would be much simpler (as the tracking would be left to actual
replication implementation as opposed to decoding). Or am I missing
something?

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-03-14 Thread David Steele
On 3/2/17 11:34 AM, Petr Jelinek wrote:
> On 02/03/17 13:23, Craig Ringer wrote:
>>
>> I particularly dislike calling a commit callback for an abort. So I'd
>> like to look further into the interface side of things. I'm inclined
>> to suggest adding new callbacks for 2pc prepare, commit and rollback,
>> and if the output plugin doesn't set them fall back to the existing
>> behaviour. Plugins that aren't interested in 2PC (think ETL) should
>> probably not have to deal with it, we might as well just send them
>> only the actually committed xacts, when they commit.
>>
> 
> I think this is a good approach to handle it.

It's been a while since there was any activity on this thread and a very
long time since the last patch.  As far as I can see there are far more
questions than answers in this thread.

If you need more time to produce a patch, please post an explanation for
the delay and a schedule for the new patch.  If no patch or explanation
is is posted by 2017-03-17 AoE I will mark this submission
"Returned with Feedback".

-- 
-David
da...@pgmasters.net


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-03-02 Thread Petr Jelinek
On 02/03/17 13:23, Craig Ringer wrote:
> On 2 March 2017 at 16:20, Stas Kelvich  wrote:
>>
>>> On 2 Mar 2017, at 11:00, Craig Ringer  wrote:
>>>
>>> We already have it, because we just decoded the PREPARE TRANSACTION.
>>> I'm preparing a patch revision to demonstrate this.
>>
>> Yes, we already have it, but if server reboots between commit prepared (all
>> prepared state is gone) and decoding of this commit prepared then we loose
>> that mapping, isn’t it?
> 
> I was about to explain how restart_lsn works again, and how that would
> mean we'd always re-decode the PREPARE TRANSACTION before any COMMIT
> PREPARED or ROLLBACK PREPARED on crash. But...
> 
> Actually, the way you've implemented it, that won't be the case. You
> treat PREPARE TRANSACTION as a special-case of COMMIT, and the client
> will presumably send replay confirmation after it has applied the
> PREPARE TRANSACTION. In fact, it has to if we want 2PC to work with
> synchronous replication. This will allow restart_lsn to advance to
> after the PREPARE TRANSACTION record if there's no other older xact
> and we see a suitable xl_running_xacts record. So we wouldn't decode
> the PREPARE TRANSACTION again after restart.
> 

Unless we just don't let restart_lsn to go forward if there is 2pc that
wasn't decoded yet (twopcs store the prepare lsn) but that's probably
too much of a kludge.

> 
> It's also worth noting that with your current approach, 2PC xacts will
> produce two calls to the output plugin's commit() callback, once for
> the PREPARE TRANSACTION and another for the COMMIT PREPARED or
> ROLLBACK PREPARED, the latter two with a faked-up state. I'm not a
> huge fan of that. It's not entirely backward compatible since it
> violates the previously safe assumption that there's a 1:1
> relationship between begin and commit callbacks with no interleaving,
> for one thing, and I think it's also a bit misleading to send a
> PREPARE TRANSACTION to a callback that could previously only receive a
> true commit.
> 
> I particularly dislike calling a commit callback for an abort. So I'd
> like to look further into the interface side of things. I'm inclined
> to suggest adding new callbacks for 2pc prepare, commit and rollback,
> and if the output plugin doesn't set them fall back to the existing
> behaviour. Plugins that aren't interested in 2PC (think ETL) should
> probably not have to deal with it, we might as well just send them
> only the actually committed xacts, when they commit.
> 

I think this is a good approach to handle it.

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-03-02 Thread Craig Ringer
On 2 March 2017 at 16:20, Stas Kelvich  wrote:
>
>> On 2 Mar 2017, at 11:00, Craig Ringer  wrote:
>>
>> We already have it, because we just decoded the PREPARE TRANSACTION.
>> I'm preparing a patch revision to demonstrate this.
>
> Yes, we already have it, but if server reboots between commit prepared (all
> prepared state is gone) and decoding of this commit prepared then we loose
> that mapping, isn’t it?

I was about to explain how restart_lsn works again, and how that would
mean we'd always re-decode the PREPARE TRANSACTION before any COMMIT
PREPARED or ROLLBACK PREPARED on crash. But...

Actually, the way you've implemented it, that won't be the case. You
treat PREPARE TRANSACTION as a special-case of COMMIT, and the client
will presumably send replay confirmation after it has applied the
PREPARE TRANSACTION. In fact, it has to if we want 2PC to work with
synchronous replication. This will allow restart_lsn to advance to
after the PREPARE TRANSACTION record if there's no other older xact
and we see a suitable xl_running_xacts record. So we wouldn't decode
the PREPARE TRANSACTION again after restart.

Hm.

That's actually a pretty good reason to xlog the gid for 2pc rollback
and commit if we're at wal_level >= logical . Being able to advance
restart_lsn and avoid the re-decoding work is a big win.

Come to think of it, we have to advance the client replication
identifier as part of PREPARE TRANSACTION anyway, otherwise we'd try
to repeat and re-prepare the same xact on crash recovery.

Given that, I withdraw my objection to adding the gid to commit and
rollback xlog records, though it should only be done if they're 2pc
commit/abort, and only if XLogLogicalInfoActive().


>> BTW, I've been reviewing the patch in more detail. Other than a bunch
>> of copy-and-paste that I'm cleaning up, the main issue I've found is
>> that in DecodePrepare, you call:
>>
>>SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
>>   parsed->nsubxacts, parsed->subxacts);
>>
>> but I am not convinced it is correct to call it at PREPARE TRANSACTION
>> time, only at COMMIT PREPARED time. We want to see the 2pc prepared
>> xact's state when decoding it, but there might be later commits that
>> cannot yet see that state and shouldn't have it visible in their
>> snapshots.
>
> Agree, that is problem. That allows to decode this PREPARE, but after that
> it is better to mark this transaction as running in snapshot or perform 
> prepare
> decoding with some kind of copied-end-edited snapshot. I’ll have a look at 
> this.

Thanks.

It's also worth noting that with your current approach, 2PC xacts will
produce two calls to the output plugin's commit() callback, once for
the PREPARE TRANSACTION and another for the COMMIT PREPARED or
ROLLBACK PREPARED, the latter two with a faked-up state. I'm not a
huge fan of that. It's not entirely backward compatible since it
violates the previously safe assumption that there's a 1:1
relationship between begin and commit callbacks with no interleaving,
for one thing, and I think it's also a bit misleading to send a
PREPARE TRANSACTION to a callback that could previously only receive a
true commit.

I particularly dislike calling a commit callback for an abort. So I'd
like to look further into the interface side of things. I'm inclined
to suggest adding new callbacks for 2pc prepare, commit and rollback,
and if the output plugin doesn't set them fall back to the existing
behaviour. Plugins that aren't interested in 2PC (think ETL) should
probably not have to deal with it, we might as well just send them
only the actually committed xacts, when they commit.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-03-02 Thread Stas Kelvich

> On 2 Mar 2017, at 11:00, Craig Ringer  wrote:
> 
> We already have it, because we just decoded the PREPARE TRANSACTION.
> I'm preparing a patch revision to demonstrate this.

Yes, we already have it, but if server reboots between commit prepared (all
prepared state is gone) and decoding of this commit prepared then we loose
that mapping, isn’t it?

> BTW, I've been reviewing the patch in more detail. Other than a bunch
> of copy-and-paste that I'm cleaning up, the main issue I've found is
> that in DecodePrepare, you call:
> 
>SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
>   parsed->nsubxacts, parsed->subxacts);
> 
> but I am not convinced it is correct to call it at PREPARE TRANSACTION
> time, only at COMMIT PREPARED time. We want to see the 2pc prepared
> xact's state when decoding it, but there might be later commits that
> cannot yet see that state and shouldn't have it visible in their
> snapshots. 

Agree, that is problem. That allows to decode this PREPARE, but after that
it is better to mark this transaction as running in snapshot or perform prepare
decoding with some kind of copied-end-edited snapshot. I’ll have a look at this.

--
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-03-02 Thread Craig Ringer
On 2 March 2017 at 16:00, Craig Ringer  wrote:

> What about if we ROLLBACK PREPARED after
> we made the snapshot visible?

Yeah, I'm pretty sure that's going to be a problem actually.

You're telling the snapshot builder that an xact committed at PREPARE
TRANSACTION time.

If we then ROLLBACK PREPARED, we're in a mess. It looks like it'll
cause issues with catalogs, user-catalog tables, etc.

I suspect we need to construct a temporary snapshot to decode PREPARE
TRANSACTION then discard it. If we later COMMIT PREPARED we should
perform the current steps to merge the snapshot state in.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-03-02 Thread Craig Ringer
On 2 March 2017 at 15:27, Stas Kelvich  wrote:
>
>> On 2 Mar 2017, at 01:20, Petr Jelinek  wrote:
>>
>> The info gets removed on COMMIT PREPARED but at that point
>> there is no real difference between replicating it as 2pc or 1pc since
>> the 2pc behavior is for all intents and purposes lost at that point.
>>
>
> If we are doing 2pc and COMMIT PREPARED happens then we should
> replicate that without transaction body to the receiving servers since tx
> is already prepared on them with some GID. So we need a way to construct
> that GID.

We already have it, because we just decoded the PREPARE TRANSACTION.
I'm preparing a patch revision to demonstrate this.

BTW, I've been reviewing the patch in more detail. Other than a bunch
of copy-and-paste that I'm cleaning up, the main issue I've found is
that in DecodePrepare, you call:

SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
   parsed->nsubxacts, parsed->subxacts);

but I am not convinced it is correct to call it at PREPARE TRANSACTION
time, only at COMMIT PREPARED time. We want to see the 2pc prepared
xact's state when decoding it, but there might be later commits that
cannot yet see that state and shouldn't have it visible in their
snapshots. Imagine, say

BEGIN;
ALTER TABLE t ADD COLUMN ...
INSERT INTO 't' ...
PREPARE TRANSACTION 'x';

BEGIN;
INSERT INTO t ...;
COMMIT;

COMMIT PREPARED 'x';

We want to see the new column when decoding the prepared xact, but
_not_ when decoding the subsequent xact between the prepare and
commit. This particular case cannot occur because the lock held by
ALTER TABLE blocks the INSERT in the other xact, but how sure are you
that there are no other snapshot issues that could arise if we promote
a snapshot to visible early? What about if we ROLLBACK PREPARED after
we made the snapshot visible?

The tests don't appear to cover logical decoding 2PC sessions that do
DDL at all. I emphasised that that would be one of the main problem
areas when we originally discussed this. I'll look at adding some,
since I think this is one of the areas that's most likely to find
issues.


> It seems that last ~10 messages I’m failing to explain some points about this
> topic. Or, maybe, I’m failing to understand some points. Can we maybe setup
> skype call to discuss this and post summary here? Craig? Peter?

Let me prep an updated patch. Time zones make it rather hard to do
voice; I'm in +0800 Western Australia, Petr is in +0200...

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-03-01 Thread Stas Kelvich

> On 2 Mar 2017, at 01:20, Petr Jelinek  wrote:
> 
> The info gets removed on COMMIT PREPARED but at that point
> there is no real difference between replicating it as 2pc or 1pc since
> the 2pc behavior is for all intents and purposes lost at that point.
> 

If we are doing 2pc and COMMIT PREPARED happens then we should
replicate that without transaction body to the receiving servers since tx
is already prepared on them with some GID. So we need a way to construct
that GID.

It seems that last ~10 messages I’m failing to explain some points about this
topic. Or, maybe, I’m failing to understand some points. Can we maybe setup
skype call to discuss this and post summary here? Craig? Peter?

--
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-03-01 Thread Craig Ringer
On 2 March 2017 at 06:20, Petr Jelinek  wrote:

> If I understand you correctly you are saying that if PREPARE is being
> decoded, we can load the GID from the 2pc info in memory about the
> specific 2pc. The info gets removed on COMMIT PREPARED but at that point
> there is no real difference between replicating it as 2pc or 1pc since
> the 2pc behavior is for all intents and purposes lost at that point.
> Works for me. I guess the hard part is knowing if COMMIT PREPARED
> happened at the time PREPARE is decoded, but I existence of the needed
> info could be probably be used for that.

Right.


-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-03-01 Thread Petr Jelinek
On 01/03/17 10:24, Craig Ringer wrote:
> On 9 February 2017 at 21:23, Stas Kelvich  wrote:
> 
>>> On 2 Feb 2017, at 00:35, Craig Ringer  wrote:
>>>
>>> Stas was concerned about what happens in logical decoding if we crash 
>>> between PREPSRE TRANSACTION and COMMIT PREPARED. But we'll always go back 
>>> and decode the whole txn again anyway so it doesn't matter.
>>
>> Not exactly. It seems that in previous discussions we were not on the same 
>> page, probably due to unclear arguments by me.
>>
>> From my point of view there is no problems (or at least new problems 
>> comparing to ordinary 2PC) with preparing transactions on slave servers with 
>> something like “#{xid}#{node_id}” instead of GID if issuing node is 
>> coordinator of that transaction. In case of failure, restart, crash we have 
>> the same options about deciding what to do with uncommitted transactions.
> 
> But we don't *need* to do that. We have access to the GID of the 2PC
> xact from PREPARE TRANSACTION until COMMIT PREPARED, after which we
> have no need for it. So we can always use the user-supplied GID.
> 
>> I performed some tests to understand real impact on size of WAL. I've 
>> compared postgres -master with wal_level = logical, after 3M 2PC 
>> transactions with patched postgres where GID’s are stored inside commit 
>> record too.
> 
> Why do you do this? You don't need to. You can look the GID up from
> the 2pc status table in memory unless the master already did COMMIT
> PREPARED, in which case you can just decode it as a normal xact as if
> it were never 2pc in the first place.
> 
> I don't think I've managed to make this point by description, so I'll
> try to modify your patch to demonstrate.
> 

If I understand you correctly you are saying that if PREPARE is being
decoded, we can load the GID from the 2pc info in memory about the
specific 2pc. The info gets removed on COMMIT PREPARED but at that point
there is no real difference between replicating it as 2pc or 1pc since
the 2pc behavior is for all intents and purposes lost at that point.
Works for me. I guess the hard part is knowing if COMMIT PREPARED
happened at the time PREPARE is decoded, but I existence of the needed
info could be probably be used for that.

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-03-01 Thread Craig Ringer
On 9 February 2017 at 21:23, Stas Kelvich  wrote:

>> On 2 Feb 2017, at 00:35, Craig Ringer  wrote:
>>
>> Stas was concerned about what happens in logical decoding if we crash 
>> between PREPSRE TRANSACTION and COMMIT PREPARED. But we'll always go back 
>> and decode the whole txn again anyway so it doesn't matter.
>
> Not exactly. It seems that in previous discussions we were not on the same 
> page, probably due to unclear arguments by me.
>
> From my point of view there is no problems (or at least new problems 
> comparing to ordinary 2PC) with preparing transactions on slave servers with 
> something like “#{xid}#{node_id}” instead of GID if issuing node is 
> coordinator of that transaction. In case of failure, restart, crash we have 
> the same options about deciding what to do with uncommitted transactions.

But we don't *need* to do that. We have access to the GID of the 2PC
xact from PREPARE TRANSACTION until COMMIT PREPARED, after which we
have no need for it. So we can always use the user-supplied GID.

> I performed some tests to understand real impact on size of WAL. I've 
> compared postgres -master with wal_level = logical, after 3M 2PC transactions 
> with patched postgres where GID’s are stored inside commit record too.

Why do you do this? You don't need to. You can look the GID up from
the 2pc status table in memory unless the master already did COMMIT
PREPARED, in which case you can just decode it as a normal xact as if
it were never 2pc in the first place.

I don't think I've managed to make this point by description, so I'll
try to modify your patch to demonstrate.


-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-02-09 Thread Stas Kelvich

> On 31 Jan 2017, at 12:22, Craig Ringer  wrote:
> 
> Personally I don't think lack of access to the GID justifies blocking 2PC 
> logical decoding. It can be added separately. But it'd be nice to have 
> especially if it's cheap.

Agreed.

> On 2 Feb 2017, at 00:35, Craig Ringer  wrote:
> 
> Stas was concerned about what happens in logical decoding if we crash between 
> PREPSRE TRANSACTION and COMMIT PREPARED. But we'll always go back and decode 
> the whole txn again anyway so it doesn't matter.

Not exactly. It seems that in previous discussions we were not on the same 
page, probably due to unclear arguments by me.

>From my point of view there is no problems (or at least new problems comparing 
>to ordinary 2PC) with preparing transactions on slave servers with something 
>like “#{xid}#{node_id}” instead of GID if issuing node is coordinator of that 
>transaction. In case of failure, restart, crash we have the same options about 
>deciding what to do with uncommitted transactions.

My concern is about the situation with external coordinator. That scenario is 
quite important for users of postgres native 2pc, notably J2EE user.  Suppose 
user (or his framework) issuing “prepare transaction ‘mytxname’;" to servers 
with ordinary synchronous physical replication. If master will crash and 
replica will be promoted than user can reconnect to it and commit/abort that 
transaction using his GID. And it is unclear to me how to achieve same 
behaviour with logical replication of 2pc without GID in commit record. If we 
will prepare with “#{xid}#{node_id}” on acceptor nodes, then if donor node will 
crash we’ll lose mapping between user’s gid and our internal gid; contrary we 
can prepare with user's GID on acceptors, but then we will not know that GID on 
donor during commit decode (by the time decoding happens all memory state 
already gone and we can’t exchange our xid to gid).

I performed some tests to understand real impact on size of WAL. I've compared 
postgres -master with wal_level = logical, after 3M 2PC transactions with 
patched postgres where GID’s are stored inside commit record too. Testing with 
194-bytes and 6-bytes GID’s. (GID max size is 200 bytes)

-master, 6-byte GID after 3M transaction: pg_current_xlog_location = 0/9572CB28
-patched, 6-byte GID after 3M transaction: pg_current_xlog_location = 0/96C442E0

so with 6-byte GID’s difference in WAL size is less than 1%

-master, 194-byte GID after 3M transaction: pg_current_xlog_location = 
0/B7501578
-patched, 194-byte GID after 3M transaction: pg_current_xlog_location = 
0/D8B43E28

and with 194-byte GID’s difference in WAL size is about 18%

So using big GID’s (as J2EE does) can cause notable WAL bloat, while small 
GID’s are almost unnoticeable.

May be we can introduce configuration option track_commit_gid by analogy with 
track_commit_timestamp and make that behaviour optional? Any objections to that?

--
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-02-03 Thread Konstantin Knizhnik
On 02/04/2017 03:08 AM, Andres Freund wrote:
> On 2017-02-03 18:47:23 -0500, Robert Haas wrote:
>> On Fri, Feb 3, 2017 at 6:00 PM, Andres Freund  wrote:
>>> I still haven't seen a credible model for being able to apply a stream
>>> of interleaved transactions that can roll back individually; I think we
>>> really need the ability to have multiple transactions alive in one
>>> backend for that.
>> Hmm, yeah, that's a problem.  That smells like autonomous transactions.
> Unfortunately the last few proposals, like spawning backends, to deal
> with autonomous xacts aren't really suitable for replication, unless you
> only have very large ones.  And it really needs to be an implementation
> where ATs can freely be switched inbetween.  On the other hand, a good
> deal of problems (like locking) shouldn't be an issue, since there's
> obviously a possible execution schedule.
>
> I suspect this'd need some low-level implemention close to xact.c that'd
> allow switching between transactions.

Let me add my two coins here:

1. We are using logical decoding in our multimaster and applying transactions 
concurrently by pool of workers. Unlike asynchronous replication, in 
multimaster we need to perform voting for each transaction commit, so if 
transactions are applied by single
workers, then performance will be awful and, moreover, there is big chance to 
get "deadlock" when none of workers can complete voting because different nodes 
are performing voting for different transactions.

I could not say that there are no problems with this approach. There are 
definitely a lot of challenges. First of all we need to use special DTM 
(distributed transaction manager) to provide consistent applying of transaction 
at different nodes. Second
problem is once again related with kind of "deadlock" explained above. Even if 
we apply transactions concurrently, it is still possible to get such deadlock 
if we do not have enough workers. This is why we allow to launch extra workers 
dynamically (but
finally it is limited by maximal number of configures bgworkers).

But in any case, I think that "parallel apply" is "must have" mode for logical 
replication.

2. We have implemented autonomous transactions in PgPro EE. Unlike proposal 
currently present at commit fest, we execute autonomous transaction within the 
same backend. So we are just storing and restoring transaction context. 
Unfortunately it is also not
so cheap operation. Autonomous transaction should not see any changes done by 
parent transaction (because it can be rollbacked after commit of autonomous 
transaction). But there are catalog and relation caches inside backend, so we 
have to clean this
caches before switching to ATX. It is quite expensive operation and so speed of 
execution of PL/pg-SQL function with autonomous transaction is several order of 
magnitude slower than without it. So autonomous transaction can be used for 
audits (its the
primary goal of using ATX in Oracle PL/SQL applications) but this mechanism is 
not efficient for concurrent execution of multiple transaction in one backend.


-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-02-03 Thread Andres Freund
On 2017-02-03 19:09:43 -0500, Robert Haas wrote:
> On Fri, Feb 3, 2017 at 7:08 PM, Andres Freund  wrote:
> > On 2017-02-03 18:47:23 -0500, Robert Haas wrote:
> >> On Fri, Feb 3, 2017 at 6:00 PM, Andres Freund  wrote:
> >> > I still haven't seen a credible model for being able to apply a stream
> >> > of interleaved transactions that can roll back individually; I think we
> >> > really need the ability to have multiple transactions alive in one
> >> > backend for that.
> >>
> >> Hmm, yeah, that's a problem.  That smells like autonomous transactions.
> >
> > Unfortunately the last few proposals, like spawning backends, to deal
> > with autonomous xacts aren't really suitable for replication, unless you
> > only have very large ones.  And it really needs to be an implementation
> > where ATs can freely be switched inbetween.  On the other hand, a good
> > deal of problems (like locking) shouldn't be an issue, since there's
> > obviously a possible execution schedule.
> >
> > I suspect this'd need some low-level implemention close to xact.c that'd
> > allow switching between transactions.
> 
> Yeah.  Well, I still feel like that's also how autonomous transactions
> oughta work, but I realize that's not a unanimous viewpoint.  :-)

Same here ;)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-02-03 Thread Robert Haas
On Fri, Feb 3, 2017 at 7:08 PM, Andres Freund  wrote:
> On 2017-02-03 18:47:23 -0500, Robert Haas wrote:
>> On Fri, Feb 3, 2017 at 6:00 PM, Andres Freund  wrote:
>> > I still haven't seen a credible model for being able to apply a stream
>> > of interleaved transactions that can roll back individually; I think we
>> > really need the ability to have multiple transactions alive in one
>> > backend for that.
>>
>> Hmm, yeah, that's a problem.  That smells like autonomous transactions.
>
> Unfortunately the last few proposals, like spawning backends, to deal
> with autonomous xacts aren't really suitable for replication, unless you
> only have very large ones.  And it really needs to be an implementation
> where ATs can freely be switched inbetween.  On the other hand, a good
> deal of problems (like locking) shouldn't be an issue, since there's
> obviously a possible execution schedule.
>
> I suspect this'd need some low-level implemention close to xact.c that'd
> allow switching between transactions.

Yeah.  Well, I still feel like that's also how autonomous transactions
oughta work, but I realize that's not a unanimous viewpoint.  :-)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-02-03 Thread Andres Freund
On 2017-02-03 18:47:23 -0500, Robert Haas wrote:
> On Fri, Feb 3, 2017 at 6:00 PM, Andres Freund  wrote:
> > I still haven't seen a credible model for being able to apply a stream
> > of interleaved transactions that can roll back individually; I think we
> > really need the ability to have multiple transactions alive in one
> > backend for that.
> 
> Hmm, yeah, that's a problem.  That smells like autonomous transactions.

Unfortunately the last few proposals, like spawning backends, to deal
with autonomous xacts aren't really suitable for replication, unless you
only have very large ones.  And it really needs to be an implementation
where ATs can freely be switched inbetween.  On the other hand, a good
deal of problems (like locking) shouldn't be an issue, since there's
obviously a possible execution schedule.

I suspect this'd need some low-level implemention close to xact.c that'd
allow switching between transactions.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-02-03 Thread Robert Haas
On Fri, Feb 3, 2017 at 6:00 PM, Andres Freund  wrote:
> I still haven't seen a credible model for being able to apply a stream
> of interleaved transactions that can roll back individually; I think we
> really need the ability to have multiple transactions alive in one
> backend for that.

Hmm, yeah, that's a problem.  That smells like autonomous transactions.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-02-03 Thread Andres Freund
On 2017-02-03 17:47:50 -0500, Robert Haas wrote:
> On Thu, Feb 2, 2017 at 7:14 PM, Craig Ringer  wrote:
> > We could make reorder buffers persistent and shared between decoding
> > sessions but it'd totally change the logical decoding model and create
> > some other problems. It's certainly not a topic for this patch. So we
> > can take it as given that we'll always restart decoding from BEGIN
> > again at a crash.

Sharing them seems unlikely (filtering and such would become a lot more
complicated) and separate from persistency.  I'm not sure however how
it'd "totally change the logical decoding model"?

Even if we'd not always restart decoding, we'd still have the option to
add the information necessary to the spill files, so I'm unclear how
persistency plays a role here?


> OK, thanks for the explanation.  I have never liked this design very
> much, and told Andres so: big transactions are bound to cause
> noticeable replication lag.  But you're certainly right that it's not
> a topic for this patch.

Streaming and persistency of spill files are different topics, no?
Either would have initially complicated things beyond the point of
getting things into core - I'm all for adding them at some point.

Persistent spill files (which'd also spilling of small transactions at
regular intervals) also has the issue that it makes the spill format
something that can't be adapted in bugfixes etc, and that we need to
fsync it.

I still haven't seen a credible model for being able to apply a stream
of interleaved transactions that can roll back individually; I think we
really need the ability to have multiple transactions alive in one
backend for that.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-02-03 Thread Robert Haas
On Thu, Feb 2, 2017 at 7:14 PM, Craig Ringer  wrote:
> We could make reorder buffers persistent and shared between decoding
> sessions but it'd totally change the logical decoding model and create
> some other problems. It's certainly not a topic for this patch. So we
> can take it as given that we'll always restart decoding from BEGIN
> again at a crash.

OK, thanks for the explanation.  I have never liked this design very
much, and told Andres so: big transactions are bound to cause
noticeable replication lag.  But you're certainly right that it's not
a topic for this patch.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-02-02 Thread Craig Ringer
On 3 February 2017 at 03:34, Robert Haas  wrote:
> On Wed, Feb 1, 2017 at 4:35 PM, Craig Ringer  wrote:
>> Right. Per my comments uothread I don't see why we need to add anything more
>> to WAL here.
>>
>> Stas was concerned about what happens in logical decoding if we crash
>> between PREPSRE TRANSACTION and COMMIT PREPARED. But we'll always go back
>> and decode the whole txn again anyway so it doesn't matter.
>>
>> We can just track it on ReorderBufferTxn when we see it at PREPARE
>> TRANSACTION time.
>
> Oh, hmm.  I guess if that's how it works then we don't need it in WAL
> after all.  I'm not sure that re-decoding the already-prepared
> transaction is a very good plan, but if that's what we're doing anyway
> this patch probably shouldn't change it.

We don't have much choice at the moment.

Logical decoding must restart from the xl_running_xacts most recently
prior to the xid allocation for the oldest xact the client hasn't
confirmed receipt of decoded data + commit for. That's because reorder
buffers are not persistent; if a decoding session crashes we throw
away accumulated reorder buffers, both those in memory and those
spilled to disk. We have to re-create them by restarting decoding from
the beginning of the oldest xact of interest.

We could make reorder buffers persistent and shared between decoding
sessions but it'd totally change the logical decoding model and create
some other problems. It's certainly not a topic for this patch. So we
can take it as given that we'll always restart decoding from BEGIN
again at a crash.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-02-02 Thread Robert Haas
On Wed, Feb 1, 2017 at 4:35 PM, Craig Ringer  wrote:
> Right. Per my comments uothread I don't see why we need to add anything more
> to WAL here.
>
> Stas was concerned about what happens in logical decoding if we crash
> between PREPSRE TRANSACTION and COMMIT PREPARED. But we'll always go back
> and decode the whole txn again anyway so it doesn't matter.
>
> We can just track it on ReorderBufferTxn when we see it at PREPARE
> TRANSACTION time.

Oh, hmm.  I guess if that's how it works then we don't need it in WAL
after all.  I'm not sure that re-decoding the already-prepared
transaction is a very good plan, but if that's what we're doing anyway
this patch probably shouldn't change it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-02-02 Thread Robert Haas
On Wed, Feb 1, 2017 at 2:32 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> Also, including the GID in the WAL for each COMMIT/ABORT PREPARED
>> doesn't seem inordinately expensive to me.
>
> I'm confused ... isn't it there already?  If not, how do we handle
> reconstructing 2PC state from WAL at all?

By XID.  See xl_xact_twophase, which gets included in xl_xact_commit
or xl_xact_abort.  The GID has got to be there in the XL_XACT_PREPARE
record, but not when actually committing/rolling back.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-02-01 Thread Craig Ringer
On 2 Feb. 2017 08:32, "Tom Lane"  wrote:

Robert Haas  writes:
> Also, including the GID in the WAL for each COMMIT/ABORT PREPARED
> doesn't seem inordinately expensive to me.

I'm confused ... isn't it there already?  If not, how do we handle
reconstructing 2PC state from WAL at all?


Right. Per my comments uothread I don't see why we need to add anything
more to WAL here.

Stas was concerned about what happens in logical decoding if we crash
between PREPSRE TRANSACTION and COMMIT PREPARED. But we'll always go back
and decode the whole txn again anyway so it doesn't matter.

We can just track it on ReorderBufferTxn when we see it at PREPARE
TRANSACTION time.


Re: [HACKERS] logical decoding of two-phase transactions

2017-02-01 Thread Konstantin Knizhnik
On 02/01/2017 10:32 PM, Tom Lane wrote:
> Robert Haas  writes:
>> Also, including the GID in the WAL for each COMMIT/ABORT PREPARED
>> doesn't seem inordinately expensive to me.
> I'm confused ... isn't it there already?  If not, how do we handle
> reconstructing 2PC state from WAL at all?
>
>   regards, tom lane
>
>
Right now logical decoding ignores prepare and take in account only "commit 
prepared":

/*
 * Currently decoding ignores PREPARE TRANSACTION and will just
 * decode the transaction when the COMMIT PREPARED is sent or
 * throw away the transaction's contents when a ROLLBACK PREPARED
 * is received. In the future we could add code to expose prepared
 * transactions in the changestream allowing for a kind of
 * distributed 2PC.
 */

For some scenarios it works well, but if we really need prepared state at 
replica (as in case of multimaster), then it is not enough.

-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-02-01 Thread Tom Lane
Robert Haas  writes:
> Also, including the GID in the WAL for each COMMIT/ABORT PREPARED
> doesn't seem inordinately expensive to me.

I'm confused ... isn't it there already?  If not, how do we handle
reconstructing 2PC state from WAL at all?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-02-01 Thread Robert Haas
On Tue, Jan 31, 2017 at 9:05 PM, Michael Paquier
 wrote:
>> Personally I don't think lack of access to the GID justifies blocking 2PC
>> logical decoding. It can be added separately. But it'd be nice to have
>> especially if it's cheap.
>
> I think it should be added reading this thread.

+1.  If on the logical replication master the user executes PREPARE
TRANSACTION 'mumble', isn't it sensible to want the logical replica to
prepare the same set of changes with the same GID?  To me, that not
only seems like *a* sensible thing to want to do but probably the
*most* sensible thing to want to do.  And then, when the eventual
COMMIT PREPAPARED 'mumble' comes along, you want to have the replica
run the same command.  If you don't do that, then the alternative is
that the replica has to make up new names based on the master's XID.
But that kinda sucks, because now if replication stops due to a
conflict or whatever and you have to disentangle things by hand, all
the names on the replica are basically meaningless.

Also, including the GID in the WAL for each COMMIT/ABORT PREPARED
doesn't seem inordinately expensive to me.  For that to really add up
to a significant cost, wouldn't you need to be doing LOTS of 2PC
transactions, each with very little work, so that the commit/abort
prepared records weren't swamped by everything else?  That seems like
an unlikely scenario, but if it does happen, that's exactly when
you'll be most grateful for the GID tracking.  I think.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-01-31 Thread Michael Paquier
On Tue, Jan 31, 2017 at 6:22 PM, Craig Ringer  wrote:
> That's where you've misunderstood - it isn't committed yet. The point or
> this change is to allow us to do logical decoding at the PREPARE TRANSACTION
> point. The xact is not yet committed or rolled back.

Yes, I got that. I was looking for a why or an actual use-case.

> Stas wants this for a conflict-free logical semi-synchronous replication
> multi master solution.

This sentence is hard to decrypt, less without "multi master" as the
concept applies basically only to only one master node.

> At PREPARE TRANSACTION time we replay the xact to
> other nodes, each of which applies it and PREPARE TRANSACTION, then replies
> to confirm it has successfully prepared the xact. When all nodes confirm the
> xact is prepared it is safe for the origin node to COMMIT PREPARED. The
> other nodes then see hat the first node has committed and they commit too.

OK, this is the argument I was looking for. So in your schema the
origin node, the one generating the changes, is itself in charge of
deciding if the 2PC should work or not. There are two channels between
the origin node and the replicas replaying the logical changes, one is
for the logical decoder with a receiver, the second one is used to
communicate the WAL apply status. I thought about something like
postgres_fdw doing this job with a transaction that does writes across
several nodes, that's why I got confused about this feature.
Everything goes through one channel, so the failure handling is just
simplified.

> Alternately if any node replies "could not replay xact" or "could not
> prepare xact" the origin node knows to ROLLBACK PREPARED. All the other
> nodes see that and rollback too.

The origin node could just issue the ROLLBACK or COMMIT and the
logical replicas would just apply this change.

> To really make it rock solid you also have to send the old and new values of
> a row, or have row versions, or send old row hashes. Something I also want
> to have, but we can mostly get that already with REPLICA IDENTITY FULL.

On a primary key (or a unique index), the default replica identity is
enough I think.

> It is of interest to me because schema changes in MM logical replication are
> more challenging awkward and restrictive without it. Optimistic conflict
> resolution doesn't work well for schema changes and once the conflicting
> schema changes are committed on different nodes there is no going back. So
> you need your async system to have a global locking model for schema changes
> to stop conflicts arising. Or expect the user not to do anything silly /
> misunderstand anything and know all the relevant system limitations and
> requirements... which we all know works just great in practice. You also
> need a way to ensure that schema changes don't render
> committed-but-not-yet-replayed row changes from other peers nonsensical. The
> safest way is a barrier where all row changes committed on any node before
> committing the schema change on the origin node must be fully replayed on
> every other node, making an async MM system temporarily sync single master
> (and requiring all nodes to be up and reachable). Otherwise you need a way
> to figure out how to conflict-resolve incoming rows with missing columns /
> added columns / changed types / renamed tables  etc which is no fun and
> nearly impossible in the general case.

That's one vision of things, FDW-like approaches would be a second,
but those are not able to pass down utility statements natively,
though this stuff can be done with the utility hook.

> I think the purpose of having the GID available to the decoding output
> plugin at PREPARE TRANSACTION time is that it can co-operate with a global
> transaction manager that way. Each node can tell the GTM "I'm ready to
> commit [X]". It is IMO not crucial since you can otherwise use a (node-id,
> xid) tuple, but it'd be nice for coordinating with external systems,
> simplifying inter node chatter, integrating logical deocding into bigger
> systems with external transaction coordinators/arbitrators etc. It seems
> pretty silly _not_ to have it really.

Well, Postgres-XC/XL save the 2PC GID for this purpose in the GTM,
this way the COMMIT/ABORT PREPARED can be issued from any nodes, and
there is a centralized conflict resolution, the latter being done with
a huge cost, causing much bottleneck in scaling performance.

> Personally I don't think lack of access to the GID justifies blocking 2PC
> logical decoding. It can be added separately. But it'd be nice to have
> especially if it's cheap.

I think it should be added reading this thread.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-01-31 Thread Craig Ringer
On 31 Jan. 2017 22:43, "Konstantin Knizhnik" 
wrote:



On 31.01.2017 09:29, Michael Paquier wrote:

> On Fri, Jan 27, 2017 at 8:52 AM, Craig Ringer 
> wrote:
>
>> Now, if it's simpler to just xlog the gid at COMMIT PREPARED time when
>> wal_level >= logical I don't think that's the end of the world. But
>> since we already have almost everything we need in memory, why not
>> just stash the gid on ReorderBufferTXN?
>>
> I have been through this thread... And to be honest, I have a hard
> time understanding for which purpose the information of a 2PC
> transaction is useful in the case of logical decoding. The prepare and
> commit prepared have been received by a node which is at the root of
> the cluster tree, a node of the cluster at an upper level, or a
> client, being in charge of issuing all the prepare queries, and then
> issue the commit prepared to finish the transaction across a cluster.
> In short, even if you do logical decoding from the root node, or the
> one at a higher level, you would care just about the fact that it has
> been committed.
>

 in any state. So there three records in the WAL: PREPARE, PRECOMMIT,
COMMIT_PREPARED and
recovery can involve either all of them, either PRECOMMIT+COMMIT_PREPARED
either just COMMIT_PREPARED.


That's your modified Pg though.

This 2pc logical decoding patch proposal is for core and I think it just
confused things to introduce discussion of unrelated changes made by your
product to the codebase.





-- 
Konstantin Knizhnik

Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-01-31 Thread Konstantin Knizhnik



On 31.01.2017 09:29, Michael Paquier wrote:

On Fri, Jan 27, 2017 at 8:52 AM, Craig Ringer  wrote:

Now, if it's simpler to just xlog the gid at COMMIT PREPARED time when
wal_level >= logical I don't think that's the end of the world. But
since we already have almost everything we need in memory, why not
just stash the gid on ReorderBufferTXN?

I have been through this thread... And to be honest, I have a hard
time understanding for which purpose the information of a 2PC
transaction is useful in the case of logical decoding. The prepare and
commit prepared have been received by a node which is at the root of
the cluster tree, a node of the cluster at an upper level, or a
client, being in charge of issuing all the prepare queries, and then
issue the commit prepared to finish the transaction across a cluster.
In short, even if you do logical decoding from the root node, or the
one at a higher level, you would care just about the fact that it has
been committed.

Sorry, may be I do not completely understand your arguments.
Actually our multimaster is completely based now on logical replication 
and 2PC (more precisely we are using 3PC now:)
State of transaction (prepared, precommitted, committed) should be 
persisted in WAL  to make it possible to perform recovery.
Recovery can involve transactions in any state. So there three records 
in the WAL: PREPARE, PRECOMMIT, COMMIT_PREPARED and
recovery can involve either all of them, either 
PRECOMMIT+COMMIT_PREPARED either just COMMIT_PREPARED.




--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-01-31 Thread Craig Ringer
On 31 Jan. 2017 19:29, "Michael Paquier"  wrote:

On Fri, Jan 27, 2017 at 8:52 AM, Craig Ringer  wrote:
> Now, if it's simpler to just xlog the gid at COMMIT PREPARED time when
> wal_level >= logical I don't think that's the end of the world. But
> since we already have almost everything we need in memory, why not
> just stash the gid on ReorderBufferTXN?

I have been through this thread... And to be honest, I have a hard
time understanding for which purpose the information of a 2PC
transaction is useful in the case of logical decoding.


TL;DR: this lets us decode the xact after prepare but before commit so
decoding/replay outcomes can affect the commit-or-abort decision.


The prepare and
commit prepared have been received by a node which is at the root of
the cluster tree, a node of the cluster at an upper level, or a
client, being in charge of issuing all the prepare queries, and then
issue the commit prepared to finish the transaction across a cluster.
In short, even if you do logical decoding from the root node, or the
one at a higher level, you would care just about the fact that it has
been committed.


That's where you've misunderstood - it isn't committed yet. The point or
this change is to allow us to do logical decoding at the PREPARE
TRANSACTION point. The xact is not yet committed or rolled back.

This allows the results of logical decoding - or more interestingly results
of replay on another node / to another app / whatever to influence the
commit or rollback decision.

Stas wants this for a conflict-free logical semi-synchronous replication
multi master solution. At PREPARE TRANSACTION time we replay the xact to
other nodes, each of which applies it and PREPARE TRANSACTION, then replies
to confirm it has successfully prepared the xact. When all nodes confirm
the xact is prepared it is safe for the origin node to COMMIT PREPARED. The
other nodes then see hat the first node has committed and they commit too.

Alternately if any node replies "could not replay xact" or "could not
prepare xact" the origin node knows to ROLLBACK PREPARED. All the other
nodes see that and rollback too.

This makes it possible to be much more confident that what's replicated is
exactly the same on all nodes, with no after-the-fact MM conflict
resolution that apps must be aware of to function correctly.

To really make it rock solid you also have to send the old and new values
of a row, or have row versions, or send old row hashes. Something I also
want to have, but we can mostly get that already with REPLICA IDENTITY FULL.

It is of interest to me because schema changes in MM logical replication
are more challenging awkward and restrictive without it. Optimistic
conflict resolution doesn't work well for schema changes and once the
conflciting schema changes are committed on different nodes there is no
going back. So you need your async system to have a global locking model
for schema changes to stop conflicts arising. Or expect the user not to do
anything silly / misunderstand anything and know all the relevant system
limitations and requirements... which we all know works just great in
practice. You also need a way to ensure that schema changes don't render
committed-but-not-yet-replayed row changes from other peers nonsensical.
The safest way is a barrier where all row changes committed on any node
before committing the schema change on the origin node must be fully
replayed on every other node, making an async MM system temporarily sync
single master (and requiring all nodes to be up and reachable). Otherwise
you need a way to figure out how to conflict-resolve incoming rows with
missing columns / added columns / changed types / renamed tables  etc which
is no fun and nearly impossible in the general case.

2PC decoding lets us avoid all this mess by sending all nodes the proposed
schema change and waiting until they all confirm successful prepare before
committing it. It can also be used to solve the row compatibility problems
with some more lazy inter-node chat in logical WAL messages.

I think the purpose of having the GID available to the decoding output
plugin at PREPARE TRANSACTION time is that it can co-operate with a global
transaction manager that way. Each node can tell the GTM "I'm ready to
commit [X]". It is IMO not crucial since you can otherwise use a (node-id,
xid) tuple, but it'd be nice for coordinating with external systems,
simplifying inter node chatter, integrating logical deocding into bigger
systems with external transaction coordinators/arbitrators etc. It seems
pretty silly _not_ to have it really.

Personally I don't think lack of access to the GID justifies blocking 2PC
logical decoding. It can be added separately. But it'd be nice to have
especially if it's cheap.


Re: [HACKERS] logical decoding of two-phase transactions

2017-01-30 Thread Michael Paquier
On Fri, Jan 27, 2017 at 8:52 AM, Craig Ringer  wrote:
> Now, if it's simpler to just xlog the gid at COMMIT PREPARED time when
> wal_level >= logical I don't think that's the end of the world. But
> since we already have almost everything we need in memory, why not
> just stash the gid on ReorderBufferTXN?

I have been through this thread... And to be honest, I have a hard
time understanding for which purpose the information of a 2PC
transaction is useful in the case of logical decoding. The prepare and
commit prepared have been received by a node which is at the root of
the cluster tree, a node of the cluster at an upper level, or a
client, being in charge of issuing all the prepare queries, and then
issue the commit prepared to finish the transaction across a cluster.
In short, even if you do logical decoding from the root node, or the
one at a higher level, you would care just about the fact that it has
been committed.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-01-30 Thread Michael Paquier
On Tue, Jan 31, 2017 at 3:29 PM, Michael Paquier
 wrote:
> On Fri, Jan 27, 2017 at 8:52 AM, Craig Ringer  wrote:
>> Now, if it's simpler to just xlog the gid at COMMIT PREPARED time when
>> wal_level >= logical I don't think that's the end of the world. But
>> since we already have almost everything we need in memory, why not
>> just stash the gid on ReorderBufferTXN?
>
> I have been through this thread... And to be honest, I have a hard
> time understanding for which purpose the information of a 2PC
> transaction is useful in the case of logical decoding. The prepare and
> commit prepared have been received by a node which is at the root of
> the cluster tree, a node of the cluster at an upper level, or a
> client, being in charge of issuing all the prepare queries, and then
> issue the commit prepared to finish the transaction across a cluster.
> In short, even if you do logical decoding from the root node, or the
> one at a higher level, you would care just about the fact that it has
> been committed.

By the way, I have moved this patch to next CF, you guys seem to make
the discussion move on.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-01-26 Thread Craig Ringer
On 26 January 2017 at 19:34, Stas Kelvich  wrote:

> Imagine following scenario:
>
> 1. PREPARE happend
> 2. PREPARE decoded and sent where it should be sent
> 3. We got all responses from participating nodes and issuing COMMIT/ABORT
> 4. COMMIT/ABORT decoded and sent
>
> After step 3 there is no more memory state associated with that prepared tx, 
> so if will fail
> between 3 and 4 then we can’t know GID unless we wrote it commit record (or 
> table).


If the decoding session crashes/disconnects and restarts between 3 and
4, we know the xact is now committed or rolled backand we don't care
about its gid anymore, we can decode it as a normal committed xact or
skip over it if aborted. If Pg crashes between 3 and 4 the same
applies, since all decoding sessions must restart.

No decoding session can ever start up between 3 and 4 without passing
through 1 and 2, since we always restart decoding at restart_lsn and
restart_lsn cannot be advanced past the assignment (BEGIN) of a given
xid until we pass its commit record and the downstream confirms it has
flushed the results.

The reorder buffer doesn't even really need to keep track of the gid
between 3 and 4, though it should do to save the output plugin and
downstream the hassle of keeping an xid to gid mapping. All it needs
is to know if we sent a given xact's data to the output plugin at
PREPARE time, so we can suppress sending them again at COMMIT time,
and we can store that info on the ReorderBufferTxn. We can store the
gid there too.

We'll need two new output plugin callbacks

   prepare_cb
   rollback_cb

since an xact can roll back after we decode PREPARE TRANSACTION (or
during it, even) and we have to be able to tell the downstream to
throw the data away.

I don't think the rollback callback should be called
abort_prepared_cb, because we'll later want to add the ability to
decode interleaved xacts' changes as they are made, before commit, and
in that case will also need to know if they abort. We won't care if
they were prepared xacts or not, but we'll know based on the
ReorderBufferTXN anyway.

We don't need a separate commit_prepared_cb, the existing commit_cb is
sufficient. The gid will be accessible on the ReorderBufferTXN.

Now, if it's simpler to just xlog the gid at COMMIT PREPARED time when
wal_level >= logical I don't think that's the end of the world. But
since we already have almost everything we need in memory, why not
just stash the gid on ReorderBufferTXN?

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-01-26 Thread Stas Kelvich

> On 26 Jan 2017, at 12:51, Craig Ringer  wrote:
> 
> * Tracking xid/gid map in memory also doesn’t help much — if server reboots 
> between prepare
> and commit we’ll lose that mapping.
> 
> Er what? That's why I suggested using the prepared xacts shmem state. It's 
> persistent as you know from your work on prepared transaction files. It has 
> all the required info.

Imagine following scenario:

1. PREPARE happend
2. PREPARE decoded and sent where it should be sent
3. We got all responses from participating nodes and issuing COMMIT/ABORT
4. COMMIT/ABORT decoded and sent

After step 3 there is no more memory state associated with that prepared tx, so 
if will fail
between 3 and 4 then we can’t know GID unless we wrote it commit record (or 
table).

--
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-01-26 Thread Craig Ringer
On 26 Jan. 2017 18:43, "Stas Kelvich"  wrote:

>>
>> Yes, that’s also possible but seems to be less flexible restricting us
to some
>> specific GID format.
>>
>> Anyway, I can measure WAL space overhead introduced by the GID’s inside
commit records
>> to know exactly what will be the cost of such approach.
>
> I think the approach of storing just the xid and fetching the GID
> during logical decoding of the PREPARE TRANSACTION is probably the
> best way forward, per my prior mail.

I don’t think that’s possible in this way. If we will not put GID in commit
record, than by the time when logical decoding will happened transaction
will be already committed/aborted and there will
be no easy way to get that GID.


My thinking is that if the 2PC xact is by that point COMMIT PREPARED or
ROLLBACK PREPARED we don't care that it was ever 2pc and should just decode
it as a normal xact. Its gid has ceased to be significant and no longer
holds meaning since the xact is resolved.

The point of logical decoding of 2pc is to allow peers to participate in a
decision on whether to commit or not. Rather than only being able to decode
the xact once committed as is currently the case.

If it's already committed there's no point treating it as anything special.

So when we get to the prepare transaction in xlog we look to see if it's
already committed / rolled back. If so we proceed normally like current
decoding does. Only if it's still prepared do we decode it as 2pc and
supply the gid to a new output plugin callback for prepared xacts.

I thought about several possibilities:

* Tracking xid/gid map in memory also doesn’t help much — if server reboots
between prepare
and commit we’ll lose that mapping.


Er what? That's why I suggested using the prepared xacts shmem state. It's
persistent as you know from your work on prepared transaction files. It has
all the required info.


Re: [HACKERS] logical decoding of two-phase transactions

2017-01-25 Thread Stas Kelvich
>> 
>> Yes, that’s also possible but seems to be less flexible restricting us to 
>> some
>> specific GID format.
>> 
>> Anyway, I can measure WAL space overhead introduced by the GID’s inside 
>> commit records
>> to know exactly what will be the cost of such approach.
> 
> Stas,
> 
> Have you had a chance to look at this further?

Generally i’m okay with Simon’s approach and will send send updated patch. 
Anyway want to
perform some test to estimate how much disk space is actually wasted by extra 
WAL records.

> I think the approach of storing just the xid and fetching the GID
> during logical decoding of the PREPARE TRANSACTION is probably the
> best way forward, per my prior mail.

I don’t think that’s possible in this way. If we will not put GID in commit 
record, than by the time
when logical decoding will happened transaction will be already 
committed/aborted and there will
be no easy way to get that GID. I thought about several possibilities:

* Tracking xid/gid map in memory also doesn’t help much — if server reboots 
between prepare 
and commit we’ll lose that mapping. 
* We can provide some hooks on prepared tx recovery during startup, but that 
approach also fails
if reboot happened between commit and decoding of that commit.
* Logical messages are WAL-logged, but they don’t have any redo function so 
don’t helps much.

So to support user-accessible 2PC over replication based on 2PC decoding we 
should invent
something more nasty like writing them into a table.

> That should eliminate Simon's
> objection re the cost of tracking GIDs and still let us have access to
> them when we want them, which is the best of both worlds really.

Having 2PC decoding in core is a good thing anyway even without GID tracking =)

--
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-01-25 Thread Craig Ringer
On 5 January 2017 at 20:43, Stas Kelvich  wrote:
>
>> On 5 Jan 2017, at 13:49, Simon Riggs  wrote:
>>
>> Surely in this case the master server is acting as the Transaction
>> Manager, and it knows the mapping, so we are good?
>>
>> I guess if you are using >2 nodes then you need to use full 2PC on each node.
>>
>> Please explain precisely how you expect to use this, to check that GID
>> is required.
>>
>
> For example if we are using logical replication just for failover/HA and 
> allowing user
> to be transaction manager itself. Then suppose that user prepared tx on 
> server A and server A
> crashed. After that client may want to reconnect to server B and commit/abort 
> that tx.
> But user only have GID that was used during prepare.
>
>> But even then, if you adopt the naming convention that all in-progress
>> xacts will be called RepOriginId-EPOCH-XID, so they have a fully
>> unique GID on all of the child nodes then we don't need to add the
>> GID.
>
> Yes, that’s also possible but seems to be less flexible restricting us to some
> specific GID format.
>
> Anyway, I can measure WAL space overhead introduced by the GID’s inside 
> commit records
> to know exactly what will be the cost of such approach.

Stas,

Have you had a chance to look at this further?

I think the approach of storing just the xid and fetching the GID
during logical decoding of the PREPARE TRANSACTION is probably the
best way forward, per my prior mail. That should eliminate Simon's
objection re the cost of tracking GIDs and still let us have access to
them when we want them, which is the best of both worlds really.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-01-06 Thread Simon Riggs
On 5 January 2017 at 12:43, Stas Kelvich  wrote:
>
>> On 5 Jan 2017, at 13:49, Simon Riggs  wrote:
>>
>> Surely in this case the master server is acting as the Transaction
>> Manager, and it knows the mapping, so we are good?
>>
>> I guess if you are using >2 nodes then you need to use full 2PC on each node.
>>
>> Please explain precisely how you expect to use this, to check that GID
>> is required.
>>
>
> For example if we are using logical replication just for failover/HA and 
> allowing user
> to be transaction manager itself. Then suppose that user prepared tx on 
> server A and server A
> crashed. After that client may want to reconnect to server B and commit/abort 
> that tx.
> But user only have GID that was used during prepare.

I don't think that's the case your trying to support and I don't think
that's a common case that we want to pay the price to put into core in
a non-optional way.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-01-05 Thread Craig Ringer
On 5 January 2017 at 20:43, Stas Kelvich  wrote:

> Anyway, I can measure WAL space overhead introduced by the GID’s inside 
> commit records
> to know exactly what will be the cost of such approach.

Sounds like a good idea, especially if you remove any attempt to work
with GIDs for !2PC commits at the same time.

I don't think I care about having access to the GID for the use case I
have in mind, since we'd actually be wanting to hijack a normal COMMIT
and internally transform it to PREPARE TRANSACTION, , COMMIT
PREPARED. But for the more general case of logical decoding of 2PC I
can see the utility of having the xact identifier.

If we presume we're only interested in logically decoding 2PC xacts
that are not yet COMMIT PREPAREd, can we not avoid the WAL overhead of
writing the GID by looking it up in our shmem state at decoding-time
for PREPARE TRANSACTION? If we can't find the prepared transaction in
TwoPhaseState we know to expect a following ROLLBACK PREPARED or
COMMIT PREPARED, so we shouldn't decode it at the PREPARE TRANSACTION
stage.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-01-05 Thread Stas Kelvich

> On 5 Jan 2017, at 13:49, Simon Riggs  wrote:
> 
> Surely in this case the master server is acting as the Transaction
> Manager, and it knows the mapping, so we are good?
> 
> I guess if you are using >2 nodes then you need to use full 2PC on each node.
> 
> Please explain precisely how you expect to use this, to check that GID
> is required.
> 

For example if we are using logical replication just for failover/HA and 
allowing user
to be transaction manager itself. Then suppose that user prepared tx on server 
A and server A
crashed. After that client may want to reconnect to server B and commit/abort 
that tx.
But user only have GID that was used during prepare.

> But even then, if you adopt the naming convention that all in-progress
> xacts will be called RepOriginId-EPOCH-XID, so they have a fully
> unique GID on all of the child nodes then we don't need to add the
> GID.

Yes, that’s also possible but seems to be less flexible restricting us to some
specific GID format.

Anyway, I can measure WAL space overhead introduced by the GID’s inside commit 
records
to know exactly what will be the cost of such approach.

-- 
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-01-05 Thread Simon Riggs
On 5 January 2017 at 10:21, Stas Kelvich  wrote:
> Thank you for looking into this.
>
>> On 5 Jan 2017, at 09:43, Simon Riggs  wrote:
>>>
>>> GID is now variable sized. You seem to have added this to every
>>> commit, not just 2PC
>>
>
> Hm, didn’t realise that, i’ll fix.
>
>> I've just realised that you're adding GID because it allows you to
>> uniquely identify the prepared xact. But then the prepared xact will
>> also have a regular TransactionId, which is also unique. GID exists
>> for users to specify things, but it is not needed internally and we
>> don't need to add it here.
>
> I think we anyway can’t avoid pushing down GID to the client side.
>
> If we will push down only local TransactionId to remote server then we will 
> lose mapping
> of GID to TransactionId, and there will be no way for user to identify his 
> transaction on
> second server. Also Open XA and lots of libraries (e.g. J2EE) assumes that 
> there is
> the same GID everywhere and it’s the same GID that was issued by the client.
>
> Requirements for two-phase decoding can be different depending on what one 
> want
> to build around it and I believe in some situations pushing down xid is 
> enough. But IMO
> dealing with reconnects, failures and client libraries will force programmer 
> to use
> the same GID everywhere.

Surely in this case the master server is acting as the Transaction
Manager, and it knows the mapping, so we are good?

I guess if you are using >2 nodes then you need to use full 2PC on each node.

But even then, if you adopt the naming convention that all in-progress
xacts will be called RepOriginId-EPOCH-XID, so they have a fully
unique GID on all of the child nodes then we don't need to add the
GID.

Please explain precisely how you expect to use this, to check that GID
is required.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-01-05 Thread Stas Kelvich
Thank you for looking into this.

> On 5 Jan 2017, at 09:43, Simon Riggs  wrote:
>> 
>> GID is now variable sized. You seem to have added this to every
>> commit, not just 2PC
> 

Hm, didn’t realise that, i’ll fix.

> I've just realised that you're adding GID because it allows you to
> uniquely identify the prepared xact. But then the prepared xact will
> also have a regular TransactionId, which is also unique. GID exists
> for users to specify things, but it is not needed internally and we
> don't need to add it here.

I think we anyway can’t avoid pushing down GID to the client side.

If we will push down only local TransactionId to remote server then we will 
lose mapping
of GID to TransactionId, and there will be no way for user to identify his 
transaction on 
second server. Also Open XA and lots of libraries (e.g. J2EE) assumes that 
there is 
the same GID everywhere and it’s the same GID that was issued by the client.

Requirements for two-phase decoding can be different depending on what one want
to build around it and I believe in some situations pushing down xid is enough. 
But IMO
dealing with reconnects, failures and client libraries will force programmer to 
use
the same GID everywhere.

> What we do need is for the commit prepared
> message to remember what the xid of the prepare was and then re-find
> it using the commit WAL record's twophase_xid field. So we don't need
> to add GID to any WAL records, nor to any in-memory structures.

Other part of the story is how to find GID during decoding of commit prepared 
record.
I did that by adding GID field to the commit WAL record, because by the time of 
decoding
all memory structures that were holding xid<->gid correspondence are already 
cleaned up.

-- 
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-01-04 Thread Simon Riggs
On 4 January 2017 at 21:20, Simon Riggs  wrote:
> On 31 December 2016 at 08:36, Stas Kelvich  wrote:
>> Here is resubmission of patch to implement logical decoding of two-phase 
>> transactions (instead of treating them
>> as usual transaction when commit) [1] I’ve slightly polished things and used 
>> test_decoding output plugin as client.
>
> Sounds good.
>
>> General idea quite simple here:
>>
>> * Write gid along with commit/prepare records in case of 2pc
>
> GID is now variable sized. You seem to have added this to every
> commit, not just 2PC

I've just realised that you're adding GID because it allows you to
uniquely identify the prepared xact. But then the prepared xact will
also have a regular TransactionId, which is also unique. GID exists
for users to specify things, but it is not needed internally and we
don't need to add it here. What we do need is for the commit prepared
message to remember what the xid of the prepare was and then re-find
it using the commit WAL record's twophase_xid field. So we don't need
to add GID to any WAL records, nor to any in-memory structures.

Please re-work the patch to include twophase_xid, which should make
the patch smaller and much faster too.

Please add comments to explain how and why patches work. Design
comments allow us to check the design makes sense and if it does
whether all the lines in the patch are needed to follow the design.
Without that patches are much harder to commit and we all want patches
to be easier to commit.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-01-04 Thread Simon Riggs
On 31 December 2016 at 08:36, Stas Kelvich  wrote:
> Here is resubmission of patch to implement logical decoding of two-phase 
> transactions (instead of treating them
> as usual transaction when commit) [1] I’ve slightly polished things and used 
> test_decoding output plugin as client.

Sounds good.

> General idea quite simple here:
>
> * Write gid along with commit/prepare records in case of 2pc

GID is now variable sized. You seem to have added this to every
commit, not just 2PC

> * Add several routines to decode prepare records in the same way as it 
> already happens in logical decoding.
>
> I’ve also added explicit LOCK statement in test_decoding regression suit to 
> check that it doesn’t break thing.

Please explain that in comments in the patch.

>  If
> somebody can create scenario that will block decoding because of existing 
> dummy backend lock that will be great
> help. Right now all my tests passing (including TAP tests to check recovery 
> of twophase tx in case of failures from
> adjacent mail thread).
>
> If we will agree about current approach than I’m ready to add this stuff to 
> proposed in-core logical replication.
>
> [1] 
> https://www.postgresql.org/message-id/EE7452CA-3C39-4A0E-97EC-17A414972884%40postgrespro.ru

We'll need some measurements about additional WAL space or mem usage
from these approaches. Thanks.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers