Re: [HACKERS] logical decoding of two-phase transactions
On 28 October 2017 at 03:53, Sokolov Yurawrote: > On 2017-10-26 22:01, Sokolov Yura wrote: > Small improvement compared to v7: > - twophase_gid is written with alignment padding in the XactLogCommitRecord > and XactLogAbortRecord. I think Nikhils has done some significant work on this patch. Hopefully he'll be able to share it. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
On 2017-10-26 22:01, Sokolov Yura wrote: On 2017-09-27 14:46, Stas Kelvich wrote: On 7 Sep 2017, at 18:58, Nikhil Sontakkewrote: Hi, FYI all, wanted to mention that I am working on an updated version of the latest patch that I plan to submit to a later CF. Cool! So what kind of architecture do you have in mind? Same way as is it was implemented before? As far as I remember there were two main issues: * Decodong of aborted prepared transaction. If such transaction modified catalog then we can’t read reliable info with our historic snapshot, since clog already have aborted bit for our tx it will brake visibility logic. There are some way to deal with that — by doing catalog seq scan two times and counting number of tuples (details upthread) or by hijacking clog values in historic visibility function. But ISTM it is better not solve this issue at all =) In most cases intended usage of decoding of 2PC transaction is to do some form of distributed commit, so naturally decoding will happens only with in-progress transactions and we commit/abort will happen only after it is decoded, sent and response is received. So we can just have atomic flag that prevents commit/abort of tx currently being decoded. And we can filter interesting prepared transactions based on GID, to prevent holding this lock for ordinary 2pc. * Possible deadlocks that Andres was talking about. I spend some time trying to find that, but didn’t find any. If locking pg_class in prepared tx is the only example then (imho) it is better to just forbid to prepare such transactions. Otherwise if some realistic examples that can block decoding are actually exist, then we probably need to reconsider the way tx being decoded. Anyway this part probably need Andres blessing. Just rebased patch logical_twophase_v6 to master. Fixed small issues: - XactLogAbortRecord wrote DBINFO twice, but it was decoded in ParseAbortRecord only once. Second DBINFO were parsed as ORIGIN. Fixed by removing second write of DBINFO. - SnapBuildPrepareTxnFinish tried to remove xid from `running` instead of `committed`. And it removed only xid, without subxids. - test_decoding skipped returning "COMMIT PREPARED" and "ABORT PREPARED", Big issue were with decoding ddl-including two-phase transactions: - prepared.out were misleading. We could not reproduce decoding body of "test_prepared#3" with logical_twophase_v6.diff. It was skipped if `pg_logical_slot_get_changes` were called without `twophase-decode-with-catalog-changes` set, and only "COMMIT PREPARED test_prepared#3" were decoded. The reason is "PREPARE TRANSACTION" is passed to `pg_filter_prepare` twice: - first on "PREPARE" itself, - second - on "COMMIT PREPARED". In v6, `pg_filter_prepare` without `with-catalog-changes` first time answered "true" (ie it should not be decoded), and second time (when transaction became committed) it answered "false" (ie it should be decoded). But second time in DecodePrepare `ctx->snapshot_builder->start_decoding_at` is already in a future compared to `buf->origptr` (because it is on "COMMIT PREPARED" lsn). Therefore DecodePrepare just called ReorderBufferForget. If `pg_filter_prepare` is called with `with-catalog-changes`, then it returns "false" both times, thus DeocdePrepare decodes transaction in first time, and calls `ReorderBufferForget` in second time. I didn't found a way to fix it gracefully. I just change `pg_filter_prepare` to return same answer both times: "false" if called `with-catalog-changes` (ie need to call DecodePrepare), and "true" otherwise. With this change, catalog changing two-phase transaction is decoded as simple one-phase transaction, if `pg_logical_slot_get_changes` is called without `with-catalog-changes`. Small improvement compared to v7: - twophase_gid is written with alignment padding in the XactLogCommitRecord and XactLogAbortRecord. -- Sokolov Yura Postgres Professional: https://postgrespro.ru The Russian Postgres Company logical_twophase_v8.diff.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
On 2017-09-27 14:46, Stas Kelvich wrote: On 7 Sep 2017, at 18:58, Nikhil Sontakkewrote: Hi, FYI all, wanted to mention that I am working on an updated version of the latest patch that I plan to submit to a later CF. Cool! So what kind of architecture do you have in mind? Same way as is it was implemented before? As far as I remember there were two main issues: * Decodong of aborted prepared transaction. If such transaction modified catalog then we can’t read reliable info with our historic snapshot, since clog already have aborted bit for our tx it will brake visibility logic. There are some way to deal with that — by doing catalog seq scan two times and counting number of tuples (details upthread) or by hijacking clog values in historic visibility function. But ISTM it is better not solve this issue at all =) In most cases intended usage of decoding of 2PC transaction is to do some form of distributed commit, so naturally decoding will happens only with in-progress transactions and we commit/abort will happen only after it is decoded, sent and response is received. So we can just have atomic flag that prevents commit/abort of tx currently being decoded. And we can filter interesting prepared transactions based on GID, to prevent holding this lock for ordinary 2pc. * Possible deadlocks that Andres was talking about. I spend some time trying to find that, but didn’t find any. If locking pg_class in prepared tx is the only example then (imho) it is better to just forbid to prepare such transactions. Otherwise if some realistic examples that can block decoding are actually exist, then we probably need to reconsider the way tx being decoded. Anyway this part probably need Andres blessing. Just rebased patch logical_twophase_v6 to master. Fixed small issues: - XactLogAbortRecord wrote DBINFO twice, but it was decoded in ParseAbortRecord only once. Second DBINFO were parsed as ORIGIN. Fixed by removing second write of DBINFO. - SnapBuildPrepareTxnFinish tried to remove xid from `running` instead of `committed`. And it removed only xid, without subxids. - test_decoding skipped returning "COMMIT PREPARED" and "ABORT PREPARED", Big issue were with decoding ddl-including two-phase transactions: - prepared.out were misleading. We could not reproduce decoding body of "test_prepared#3" with logical_twophase_v6.diff. It was skipped if `pg_logical_slot_get_changes` were called without `twophase-decode-with-catalog-changes` set, and only "COMMIT PREPARED test_prepared#3" were decoded. The reason is "PREPARE TRANSACTION" is passed to `pg_filter_prepare` twice: - first on "PREPARE" itself, - second - on "COMMIT PREPARED". In v6, `pg_filter_prepare` without `with-catalog-changes` first time answered "true" (ie it should not be decoded), and second time (when transaction became committed) it answered "false" (ie it should be decoded). But second time in DecodePrepare `ctx->snapshot_builder->start_decoding_at` is already in a future compared to `buf->origptr` (because it is on "COMMIT PREPARED" lsn). Therefore DecodePrepare just called ReorderBufferForget. If `pg_filter_prepare` is called with `with-catalog-changes`, then it returns "false" both times, thus DeocdePrepare decodes transaction in first time, and calls `ReorderBufferForget` in second time. I didn't found a way to fix it gracefully. I just change `pg_filter_prepare` to return same answer both times: "false" if called `with-catalog-changes` (ie need to call DecodePrepare), and "true" otherwise. With this change, catalog changing two-phase transaction is decoded as simple one-phase transaction, if `pg_logical_slot_get_changes` is called without `with-catalog-changes`. -- With regards, Sokolov Yura Postgres Professional: https://postgrespro.ru The Russian Postgres Company logical_twophase_v7.diff.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
> On 7 Sep 2017, at 18:58, Nikhil Sontakkewrote: > > Hi, > > FYI all, wanted to mention that I am working on an updated version of > the latest patch that I plan to submit to a later CF. > Cool! So what kind of architecture do you have in mind? Same way as is it was implemented before? As far as I remember there were two main issues: * Decodong of aborted prepared transaction. If such transaction modified catalog then we can’t read reliable info with our historic snapshot, since clog already have aborted bit for our tx it will brake visibility logic. There are some way to deal with that — by doing catalog seq scan two times and counting number of tuples (details upthread) or by hijacking clog values in historic visibility function. But ISTM it is better not solve this issue at all =) In most cases intended usage of decoding of 2PC transaction is to do some form of distributed commit, so naturally decoding will happens only with in-progress transactions and we commit/abort will happen only after it is decoded, sent and response is received. So we can just have atomic flag that prevents commit/abort of tx currently being decoded. And we can filter interesting prepared transactions based on GID, to prevent holding this lock for ordinary 2pc. * Possible deadlocks that Andres was talking about. I spend some time trying to find that, but didn’t find any. If locking pg_class in prepared tx is the only example then (imho) it is better to just forbid to prepare such transactions. Otherwise if some realistic examples that can block decoding are actually exist, then we probably need to reconsider the way tx being decoded. Anyway this part probably need Andres blessing. Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
Hi, FYI all, wanted to mention that I am working on an updated version of the latest patch that I plan to submit to a later CF. Regards, Nikhils On 14 May 2017 at 04:02, Dmitry Dolgov <9erthali...@gmail.com> wrote: > On 13 May 2017 at 22:22, Tom Lanewrote: >> >> Apparently you are not testing against current HEAD. That's been there >> since d10c626de (a whole two days now ;-)) > > Indeed, I was working on a more than two-day old antiquity. Unfortunately, > it's even more complicated > to apply this patch against the current HEAD, so I'll wait for a rebased > version. -- Nikhil Sontakke http://www.2ndQuadrant.com/ PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
On 13 May 2017 at 22:22, Tom Lanewrote: > > Apparently you are not testing against current HEAD. That's been there > since d10c626de (a whole two days now ;-)) Indeed, I was working on a more than two-day old antiquity. Unfortunately, it's even more complicated to apply this patch against the current HEAD, so I'll wait for a rebased version.
Re: [HACKERS] logical decoding of two-phase transactions
Dmitry Dolgov <9erthali...@gmail.com> writes: > Just a note about this patch. Of course time flies by and it needs rebase, > but also there are few failing tests right now: > ERROR: function pg_wal_lsn_diff(pg_lsn, unknown) does not exist Apparently you are not testing against current HEAD. That's been there since d10c626de (a whole two days now ;-)). regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
Hi > On 4 April 2017 at 19:13, Masahiko Sawadawrote: > > Other than that issue current patch still could not pass 'make check' > test of contrib/test_decoding. Just a note about this patch. Of course time flies by and it needs rebase, but also there are few failing tests right now: * one that was already mentioned by Masahiko * one from `ddl`, where expected is: ``` SELECT slot_name, plugin, slot_type, active, NOT catalog_xmin IS NULL AS catalog_xmin_set, xmin IS NULl AS data_xmin_not_set, pg_wal_lsn_diff(restart_lsn, '0/0100') > 0 AS some_wal FROM pg_replication_slots; slot_name|plugin | slot_type | active | catalog_xmin_set | data_xmin_not_set | some_wal -+---+---++--+---+-- regression_slot | test_decoding | logical | f | t| t | t (1 row) ``` but the result is: ``` SELECT slot_name, plugin, slot_type, active, NOT catalog_xmin IS NULL AS catalog_xmin_set, xmin IS NULl AS data_xmin_not_set, pg_wal_lsn_diff(restart_lsn, '0/0100') > 0 AS some_wal FROM pg_replication_slots; ERROR: function pg_wal_lsn_diff(pg_lsn, unknown) does not exist LINE 5: pg_wal_lsn_diff(restart_lsn, '0/0100') > 0 AS some_w... ^ HINT: No function matches the given name and argument types. You might need to add explicit type casts. ```
Re: [HACKERS] logical decoding of two-phase transactions
On 2017-04-04 13:06:13 +0300, Stas Kelvich wrote: > That is just argument against Andres concern that prepared transaction > is able to deadlock with decoding process — at least no such cases in > regression tests. There's few longer / adverse xacts, that doesn't say much. > And that concern is main thing blocking this patch. Except explicit catalog > locks in prepared tx nobody yet found such cases and it is hard to address > or argue about. I doubt that's the case. But even if it were so, it's absolutely not acceptable that a plain user can cause such deadlocks. So I don't think this argument buys you anything. - Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
On Tue, Apr 4, 2017 at 7:06 PM, Stas Kelvichwrote: > >> On 4 Apr 2017, at 04:23, Masahiko Sawada wrote: >> >> >> I reviewed this patch but when I tried to build contrib/test_decoding >> I got the following error. >> > > Thanks! > > Yes, seems that 18ce3a4a changed ProcessUtility_hook signature. > Updated. > >> There are still some unnecessary code in v5 patch. > Thank you for updating the patch! > Actually second diff isn’t intended to be part of the patch, I've just shared > the way I ran regression test suite through the 2pc decoding changing > all commits to prepare/commits where commits happens only after decoding > of prepare is finished (more details in my previous message in this thread). Understood. Sorry for the noise. > > That is just argument against Andres concern that prepared transaction > is able to deadlock with decoding process — at least no such cases in > regression tests. > > And that concern is main thing blocking this patch. Except explicit catalog > locks in prepared tx nobody yet found such cases and it is hard to address > or argue about. > Hmm, I also has not found such deadlock case yet. Other than that issue current patch still could not pass 'make check' test of contrib/test_decoding. *** 154,167 (4 rows) :get_with2pc ! data ! - ! BEGIN ! table public.test_prepared1: INSERT: id[integer]:5 ! table public.test_prepared1: INSERT: id[integer]:6 data[text]:'frakbar' ! PREPARE TRANSACTION 'test_prepared#3'; ! COMMIT PREPARED 'test_prepared#3'; ! (5 rows) -- make sure stuff still works INSERT INTO test_prepared1 VALUES (8); --- 154,162 (4 rows) :get_with2pc ! data ! -- ! (0 rows) -- make sure stuff still works INSERT INTO test_prepared1 VALUES (8); I guess that the this part is a unexpected result and should be fixed. Right? - *** 215,222 -- If we try to decode it now we'll deadlock SET statement_timeout = '10s'; :get_with2pc_nofilter ! -- FIXME we expect a timeout here, but it actually works... ! ERROR: statement timed out RESET statement_timeout; -- we can decode past it by skipping xacts with catalog changes --- 210,222 -- If we try to decode it now we'll deadlock SET statement_timeout = '10s'; :get_with2pc_nofilter ! data ! ! BEGIN ! table public.test_prepared1: INSERT: id[integer]:10 data[text]:'othercol' ! table public.test_prepared1: INSERT: id[integer]:11 data[text]:'othercol2' ! PREPARE TRANSACTION 'test_prepared_lock' ! (4 rows) RESET statement_timeout; -- we can decode past it by skipping xacts with catalog changes Probably we can ignore this part for now. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
> On 4 Apr 2017, at 04:23, Masahiko Sawadawrote: > > > I reviewed this patch but when I tried to build contrib/test_decoding > I got the following error. > Thanks! Yes, seems that 18ce3a4a changed ProcessUtility_hook signature. Updated. > There are still some unnecessary code in v5 patch. Actually second diff isn’t intended to be part of the patch, I've just shared the way I ran regression test suite through the 2pc decoding changing all commits to prepare/commits where commits happens only after decoding of prepare is finished (more details in my previous message in this thread). That is just argument against Andres concern that prepared transaction is able to deadlock with decoding process — at least no such cases in regression tests. And that concern is main thing blocking this patch. Except explicit catalog locks in prepared tx nobody yet found such cases and it is hard to address or argue about. Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company logical_twophase_v6.diff Description: Binary data logical_twophase_regresstest.diff Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
On Thu, Mar 30, 2017 at 12:55 AM, Stas Kelvichwrote: > >> On 28 Mar 2017, at 18:08, Andres Freund wrote: >> >> On 2017-03-28 15:55:15 +0100, Simon Riggs wrote: >>> >>> >>> That assertion is obviously false... the plugin can resolve this in >>> various ways, if we allow it. >> >> Handling it by breaking replication isn't handling it (e.g. timeouts in >> decoding etc). Handling it by rolling back *prepared* transactions >> (which are supposed to be guaranteed to succeed!), isn't either. >> >> >>> You can say that in your opinion you prefer to see this handled in >>> some higher level way, though it would be good to hear why and how. >> >> It's pretty obvious why: A bit of DDL by the user shouldn't lead to the >> issues mentioned above. >> >> >>> Bottom line here is we shouldn't reject this patch on this point, >> >> I think it definitely has to be rejected because of that. And I didn't >> bring this up at the last minute, I repeatedly brought it up before. >> Both to Craig and Stas. > > Okay. In order to find more realistic cases that blocks replication > i’ve created following setup: > > * in backend: tests_decoding plugins hooks on xact events and utility > statement hooks and transform each commit into prepare, then sleeps > on latch. If transaction contains DDL that whole statement pushed in > wal as transactional message. If DDL can not be prepared or disallows > execution in transaction block than it goes as nontransactional logical > message without prepare/decode injection. If transaction didn’t issued any > DDL and didn’t write anything to wal, then it skips 2pc too. > > * after prepare is decoded, output plugin in walsender unlocks backend > allowing to proceed with commit prepared. So in case when decoding > tries to access blocked catalog everything should stop. > > * small python script that consumes decoded wal from walsender (thanks > Craig and Petr) > > After small acrobatics with that hooks I’ve managed to run whole > regression suite in parallel mode through such setup of test_decoding > without any deadlocks. I’ve added two xact_events to postgres and > allowedn prepare of transactions that touched temp tables since > they are heavily used in tests and creates a lot of noise in diffs. > > So it boils down to 3 failed regression tests out of 177, namely: > > * transactions.sql — here commit of tx stucks with obtaining SafeSnapshot(). > I didn’t look what is happening there specifically, but just checked that > walsender isn’t blocked. I’m going to look more closely at this. > > * prepared_xacts.sql — here select prepared_xacts() sees our prepared > tx. It is possible to filter them out, but obviously it works as expected. > > * guc.sql — here pendingActions arrives on 'DISCARD ALL’ preventing tx > from being prepared. I didn’t found the way to check presence of > pendingActions outside of async.c so decided to leave it as is. > > It seems that at least in regression tests nothing can block twophase > logical decoding. Is that strong enough argument to hypothesis that current > approach doesn’t creates deadlock except locks on catalog which should be > disallowed anyway? > > Patches attached. logical_twophase_v5 is slightly modified version of previous > patch merged with Craig’s changes. Second file is set of patches over previous > one, that implements logic i’ve just described. There is runtest.sh script > that > setups postgres, runs python logical consumer in background and starts > regression test. > > I reviewed this patch but when I tried to build contrib/test_decoding I got the following error. $ make gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -g -g -fpic -I. -I. -I../../src/include -D_GNU_SOURCE -c -o test_decoding.o test_decoding.c -MMD -MP -MF .deps/test_decoding.Po test_decoding.c: In function '_PG_init': test_decoding.c:126: warning: assignment from incompatible pointer type test_decoding.c: In function 'test_decoding_process_utility': test_decoding.c:271: warning: passing argument 5 of 'PreviousProcessUtilityHook' from incompatible pointer type test_decoding.c:271: note: expected 'struct QueryEnvironment *' but argument is of type 'struct DestReceiver *' test_decoding.c:271: warning: passing argument 6 of 'PreviousProcessUtilityHook' from incompatible pointer type test_decoding.c:271: note: expected 'struct DestReceiver *' but argument is of type 'char *' test_decoding.c:271: error: too few arguments to function 'PreviousProcessUtilityHook' test_decoding.c:276: warning: passing argument 5 of 'standard_ProcessUtility' from incompatible pointer type ../../src/include/tcop/utility.h:38: note: expected 'struct QueryEnvironment *' but argument is of type 'struct DestReceiver *' test_decoding.c:276: warning: passing argument 6 of 'standard_ProcessUtility' from incompatible pointer type
Re: [HACKERS] logical decoding of two-phase transactions
> On 28 Mar 2017, at 18:08, Andres Freundwrote: > > On 2017-03-28 15:55:15 +0100, Simon Riggs wrote: >> >> >> That assertion is obviously false... the plugin can resolve this in >> various ways, if we allow it. > > Handling it by breaking replication isn't handling it (e.g. timeouts in > decoding etc). Handling it by rolling back *prepared* transactions > (which are supposed to be guaranteed to succeed!), isn't either. > > >> You can say that in your opinion you prefer to see this handled in >> some higher level way, though it would be good to hear why and how. > > It's pretty obvious why: A bit of DDL by the user shouldn't lead to the > issues mentioned above. > > >> Bottom line here is we shouldn't reject this patch on this point, > > I think it definitely has to be rejected because of that. And I didn't > bring this up at the last minute, I repeatedly brought it up before. > Both to Craig and Stas. Okay. In order to find more realistic cases that blocks replication i’ve created following setup: * in backend: tests_decoding plugins hooks on xact events and utility statement hooks and transform each commit into prepare, then sleeps on latch. If transaction contains DDL that whole statement pushed in wal as transactional message. If DDL can not be prepared or disallows execution in transaction block than it goes as nontransactional logical message without prepare/decode injection. If transaction didn’t issued any DDL and didn’t write anything to wal, then it skips 2pc too. * after prepare is decoded, output plugin in walsender unlocks backend allowing to proceed with commit prepared. So in case when decoding tries to access blocked catalog everything should stop. * small python script that consumes decoded wal from walsender (thanks Craig and Petr) After small acrobatics with that hooks I’ve managed to run whole regression suite in parallel mode through such setup of test_decoding without any deadlocks. I’ve added two xact_events to postgres and allowedn prepare of transactions that touched temp tables since they are heavily used in tests and creates a lot of noise in diffs. So it boils down to 3 failed regression tests out of 177, namely: * transactions.sql — here commit of tx stucks with obtaining SafeSnapshot(). I didn’t look what is happening there specifically, but just checked that walsender isn’t blocked. I’m going to look more closely at this. * prepared_xacts.sql — here select prepared_xacts() sees our prepared tx. It is possible to filter them out, but obviously it works as expected. * guc.sql — here pendingActions arrives on 'DISCARD ALL’ preventing tx from being prepared. I didn’t found the way to check presence of pendingActions outside of async.c so decided to leave it as is. It seems that at least in regression tests nothing can block twophase logical decoding. Is that strong enough argument to hypothesis that current approach doesn’t creates deadlock except locks on catalog which should be disallowed anyway? Patches attached. logical_twophase_v5 is slightly modified version of previous patch merged with Craig’s changes. Second file is set of patches over previous one, that implements logic i’ve just described. There is runtest.sh script that setups postgres, runs python logical consumer in background and starts regression test. Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company logical_twophase_v5.diff Description: Binary data logical_twophase_regresstest.diff Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
On 28 Mar. 2017 23:08, "Andres Freund"wrote: > > > >> I don't think its for us to say what the plugin is allowed to do. We > > >> decided on a plugin architecture, so we have to trust that the plugin > > >> author resolves the issues. We can document them so those choices are > > >> clear. > > > > > > I don't think this is "plugin architecture" related. The output pluging > > > can't do right here, this has to be solved at a higher level. > > > > That assertion is obviously false... the plugin can resolve this in > > various ways, if we allow it. > > Handling it by breaking replication isn't handling it (e.g. timeouts in > decoding etc). IMO, if it's a rare condition and we can abort decoding then recover cleanly and succeed on retry, that's OK. Not dissimilar to the deadlock detector. But right now that's not the case, it's possible (however artificially) to create prepared xacts for which decoding will stall and not succeed. > Handling it by rolling back *prepared* transactions > (which are supposed to be guaranteed to succeed!), isn't either. > I agree, we can't rely on anything for which the only way to continue is to rollback a prepared xact. > > You can say that in your opinion you prefer to see this handled in > > some higher level way, though it would be good to hear why and how. > > It's pretty obvious why: A bit of DDL by the user shouldn't lead to the > issues mentioned above. > I agree that it shouldn't, and in fact DDL is the main part of why I want 2PC decoding. What's surprised me is that I haven't actually been able to create any situations where, with test_decoding, we have such a failure. Not unless I manually LOCK TABLE pg_attribute, anyway. Notably, we already disallow prepared xacts that make changes to the relfilenodemap, which covers a lot of the problem cases like CLUSTERing system tables. > > Bottom line here is we shouldn't reject this patch on this point, > > I think it definitely has to be rejected because of that. And I didn't > bring this up at the last minute, I repeatedly brought it up before. > Both to Craig and Stas. > Yes, and I lost track of it while focusing on the catalog tuple visibility issues. I warned Stas of this issue when he first mentioned an interest in decoding of 2PC actually, but haven't kept a proper eye on it since. Andres and I even discussed this back in the early BDR days, it's not new and is part of why I poked Stas to try some DDL tests etc. The tests in the initial patch didn't have enough coverage to trigger any issues - they didn't actually test decoding of a 2pc xact while it was still in-progress at all. But even once I added more tests I've actually been unable to reproduce this in a realistic real world example. Frankly I'm confused by that, since I would expect an AEL on some_table to cause decoding of some_table to get stuck. It does not. That doesn't mean we should accept failure cases and commit something with holes in it. But it might inform our choices about how we solve those issues. > One way to fix this would be to allow decoding to acquire such locks > (i.e. locks held by the prepared xact we're decoding) - there > unfortunately are some practical issues with that (e.g. the locking code > doesnt' necessarily expect a second non-exclusive locker, when there's > an exclusive one), or we could add an exception to the locking code to > simply not acquire such locks. > I've been meaning to see if we can use the parallel infrastructure's session leader infrastructure for this, by making the 2pc fake-proc a leader and making our decoding session inherit its locks. I haven't dug into it to see if it's even remotely practical yet, and won't be able to until early pg11. We could proceed with the caveat that decoding plugins that use 2pc support must defer decoding of 2pc xacts containing ddl until commit prepared, or must take responsibility for ensuring (via a command filter, etc) that xacts are safe to decode and 2pc lock xacts during decoding. But we're likely to change the interface for all that when we iterate for pg11 and I'd rather not carry more BC than we have to. Also, the patch has unsolved issues with how it keeps track of whether an xact was output at prepare time or not and suppresses output at commit time. I'm inclined to shelve the patch for Pg 10. We've only got a couple of days left, the tests are still pretty minimal. We have open issues around locking, less than totally satisfactory abort handling, and potential to skip replay of transactions for both prepare and commit prepared. It's not ready to go. However, it's definitely to the point where with a little more work it'll be practical to patch into variants of Pg until we can mainstream it in Pg 11, which is nice. -- Craig Ringer
Re: [HACKERS] logical decoding of two-phase transactions
On 2017-03-28 15:55:15 +0100, Simon Riggs wrote: > On 28 March 2017 at 15:38, Andres Freundwrote: > > On 2017-03-28 15:32:49 +0100, Simon Riggs wrote: > >> On 28 March 2017 at 03:53, Craig Ringer wrote: > >> > On 28 March 2017 at 09:25, Andres Freund wrote: > >> > > >> >> If you actually need separate decoding of 2PC, then you want to wait for > >> >> the PREPARE to be replicated. If that replication has to wait for the > >> >> to-be-replicated prepared transaction to commit prepared, and commit > >> >> prepare will only happen once replication happened... > >> > > >> > In other words, the output plugin cannot decode a transaction at > >> > PREPARE TRANSACTION time if that xact holds an AccessExclusiveLock on > >> > a catalog relation we must be able to read in order to decode the > >> > xact. > >> > >> Yes, I understand. > >> > >> The decoding plugin can choose to enable lock_timeout, or it can > >> choose to wait for manual resolution, or it could automatically abort > >> such a transaction to avoid needing to decode it. > > > > That doesn't solve the problem. You still left with replication that > > can't progress. I think that's completely unacceptable. We need a > > proper solution to this, not throw our hands up in the air and hope that > > it's not going to hurt a whole lot of peopel. > > Nobody is throwing their hands in the air, nobody is just hoping. The > concern raised is real and needs to be handled somewhere; the only > point of discussion is where and how. > >> I don't think its for us to say what the plugin is allowed to do. We > >> decided on a plugin architecture, so we have to trust that the plugin > >> author resolves the issues. We can document them so those choices are > >> clear. > > > > I don't think this is "plugin architecture" related. The output pluging > > can't do right here, this has to be solved at a higher level. > > That assertion is obviously false... the plugin can resolve this in > various ways, if we allow it. Handling it by breaking replication isn't handling it (e.g. timeouts in decoding etc). Handling it by rolling back *prepared* transactions (which are supposed to be guaranteed to succeed!), isn't either. > You can say that in your opinion you prefer to see this handled in > some higher level way, though it would be good to hear why and how. It's pretty obvious why: A bit of DDL by the user shouldn't lead to the issues mentioned above. > Bottom line here is we shouldn't reject this patch on this point, I think it definitely has to be rejected because of that. And I didn't bring this up at the last minute, I repeatedly brought it up before. Both to Craig and Stas. One way to fix this would be to allow decoding to acquire such locks (i.e. locks held by the prepared xact we're decoding) - there unfortunately are some practical issues with that (e.g. the locking code doesnt' necessarily expect a second non-exclusive locker, when there's an exclusive one), or we could add an exception to the locking code to simply not acquire such locks. > especially since any resource issue found during decoding could > similarly prevent progress with decoding. For example? - Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
On 28 March 2017 at 15:38, Andres Freundwrote: > On 2017-03-28 15:32:49 +0100, Simon Riggs wrote: >> On 28 March 2017 at 03:53, Craig Ringer wrote: >> > On 28 March 2017 at 09:25, Andres Freund wrote: >> > >> >> If you actually need separate decoding of 2PC, then you want to wait for >> >> the PREPARE to be replicated. If that replication has to wait for the >> >> to-be-replicated prepared transaction to commit prepared, and commit >> >> prepare will only happen once replication happened... >> > >> > In other words, the output plugin cannot decode a transaction at >> > PREPARE TRANSACTION time if that xact holds an AccessExclusiveLock on >> > a catalog relation we must be able to read in order to decode the >> > xact. >> >> Yes, I understand. >> >> The decoding plugin can choose to enable lock_timeout, or it can >> choose to wait for manual resolution, or it could automatically abort >> such a transaction to avoid needing to decode it. > > That doesn't solve the problem. You still left with replication that > can't progress. I think that's completely unacceptable. We need a > proper solution to this, not throw our hands up in the air and hope that > it's not going to hurt a whole lot of peopel. Nobody is throwing their hands in the air, nobody is just hoping. The concern raised is real and needs to be handled somewhere; the only point of discussion is where and how. >> I don't think its for us to say what the plugin is allowed to do. We >> decided on a plugin architecture, so we have to trust that the plugin >> author resolves the issues. We can document them so those choices are >> clear. > > I don't think this is "plugin architecture" related. The output pluging > can't do right here, this has to be solved at a higher level. That assertion is obviously false... the plugin can resolve this in various ways, if we allow it. You can say that in your opinion you prefer to see this handled in some higher level way, though it would be good to hear why and how. Bottom line here is we shouldn't reject this patch on this point, especially since any resource issue found during decoding could similarly prevent progress with decoding. -- Simon Riggshttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
On 2017-03-28 15:32:49 +0100, Simon Riggs wrote: > On 28 March 2017 at 03:53, Craig Ringerwrote: > > On 28 March 2017 at 09:25, Andres Freund wrote: > > > >> If you actually need separate decoding of 2PC, then you want to wait for > >> the PREPARE to be replicated. If that replication has to wait for the > >> to-be-replicated prepared transaction to commit prepared, and commit > >> prepare will only happen once replication happened... > > > > In other words, the output plugin cannot decode a transaction at > > PREPARE TRANSACTION time if that xact holds an AccessExclusiveLock on > > a catalog relation we must be able to read in order to decode the > > xact. > > Yes, I understand. > > The decoding plugin can choose to enable lock_timeout, or it can > choose to wait for manual resolution, or it could automatically abort > such a transaction to avoid needing to decode it. That doesn't solve the problem. You still left with replication that can't progress. I think that's completely unacceptable. We need a proper solution to this, not throw our hands up in the air and hope that it's not going to hurt a whole lot of peopel. > I don't think its for us to say what the plugin is allowed to do. We > decided on a plugin architecture, so we have to trust that the plugin > author resolves the issues. We can document them so those choices are > clear. I don't think this is "plugin architecture" related. The output pluging can't do right here, this has to be solved at a higher level. - Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
On 28 March 2017 at 03:53, Craig Ringerwrote: > On 28 March 2017 at 09:25, Andres Freund wrote: > >> If you actually need separate decoding of 2PC, then you want to wait for >> the PREPARE to be replicated. If that replication has to wait for the >> to-be-replicated prepared transaction to commit prepared, and commit >> prepare will only happen once replication happened... > > In other words, the output plugin cannot decode a transaction at > PREPARE TRANSACTION time if that xact holds an AccessExclusiveLock on > a catalog relation we must be able to read in order to decode the > xact. Yes, I understand. The decoding plugin can choose to enable lock_timeout, or it can choose to wait for manual resolution, or it could automatically abort such a transaction to avoid needing to decode it. I don't think its for us to say what the plugin is allowed to do. We decided on a plugin architecture, so we have to trust that the plugin author resolves the issues. We can document them so those choices are clear. This doesn't differ in any respect from any other resource it might need yet cannot obtain. -- Simon Riggshttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
On 28 March 2017 at 10:53, Craig Ringerwrote: > However, even once I add an option to force decoding of 2pc xacts with > catalog changes to test_decoding, I cannot reproduce the expected > locking issues so far. See tests in attached updated version, in > contrib/test_decoding/sql/prepare.sql . I haven't been able to create issues with CLUSTER, any ALTER TABLEs I've tried, or anything similar. An explicit AEL on pg_attribute causes the decoding stall, but you can't do anything much else either, and I don't see how that'd arise under normal circumstances. If it's a sufficiently obscure issue I'm willing to document it as "don't do that" or "use a command filter to prohibit that". But it's more likely that I'm just not spotting the cases where the issue arises. Attempting to CLUSTER a system catalog like pg_class or pg_attribute causes PREPARE TRANSACTION to fail with ERROR: cannot PREPARE a transaction that modified relation mapping and I didn't find any catalogs I could CLUSTER that'd also block decoding. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
On 28 March 2017 at 09:25, Andres Freundwrote: > If you actually need separate decoding of 2PC, then you want to wait for > the PREPARE to be replicated. If that replication has to wait for the > to-be-replicated prepared transaction to commit prepared, and commit > prepare will only happen once replication happened... In other words, the output plugin cannot decode a transaction at PREPARE TRANSACTION time if that xact holds an AccessExclusiveLock on a catalog relation we must be able to read in order to decode the xact. >> Is there any other scenarios where catalog readers are blocked except >> explicit lock >> on catalog table? Alters on catalogs seems to be prohibited. > > VACUUM FULL on catalog tables (but that can't happen in xact => 2pc) > CLUSTER on catalog tables (can happen in xact) > ALTER on tables modified in the same transaction (even of non catalog > tables!), because a lot of routines will do a heap_open() to get the > tupledesc etc. Right, and the latter one is the main issue, since it's by far the most likely and hard to just work around. The tests Stas has in place aren't sufficient to cover this, as they decode only after everything has committed. I'm expanding the pg_regress coverage to do decoding between prepare and commit (when we actually care) first, and will add some tests involving strong locks. I've found one bug where it doesn't decode a 2pc xact at prepare or commit time, even without restart or strong lock issues. Pretty sure it's due to assumptions made about the filter callback. The current code as used by test_decoding won't work correctly. If txn->has_catalog_changes and if it's still in-progress, the filter skips decoding at PREPARE time. But it isn't then decoded at COMMIT PREPARED time either, if we processed past the PREPARE TRANSACTION. Bug. Also, by skipping decoding of 2pc xacts with catalog changes in this test we also hide the locking issues. However, even once I add an option to force decoding of 2pc xacts with catalog changes to test_decoding, I cannot reproduce the expected locking issues so far. See tests in attached updated version, in contrib/test_decoding/sql/prepare.sql . Haven't done any TAP tests yet, since the pg_regress tests are so far sufficient to turn up issues. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services From e23909a9929b561e011f41891825bfb5b1ecb1b3 Mon Sep 17 00:00:00 2001 From: Craig Ringer Date: Tue, 28 Mar 2017 10:51:48 +0800 Subject: [PATCH] Logical decoding of two-phase commit transactions at PREPARE time --- contrib/test_decoding/expected/prepared.out | 253 contrib/test_decoding/sql/prepared.sql | 82 +++- contrib/test_decoding/test_decoding.c | 205 ++- src/backend/access/rmgrdesc/xactdesc.c | 37 +++- src/backend/access/transam/twophase.c | 89 - src/backend/access/transam/xact.c | 72 ++- src/backend/replication/logical/decode.c| 148 -- src/backend/replication/logical/logical.c | 141 + src/backend/replication/logical/reorderbuffer.c | 129 +++- src/backend/replication/logical/snapbuild.c | 48 - src/include/access/twophase.h | 3 + src/include/access/xact.h | 44 - src/include/replication/logical.h | 6 +- src/include/replication/output_plugin.h | 36 src/include/replication/reorderbuffer.h | 50 + src/include/replication/snapbuild.h | 4 + 16 files changed, 1254 insertions(+), 93 deletions(-) diff --git a/contrib/test_decoding/expected/prepared.out b/contrib/test_decoding/expected/prepared.out index 46e915d..56c6e72 100644 --- a/contrib/test_decoding/expected/prepared.out +++ b/contrib/test_decoding/expected/prepared.out @@ -6,19 +6,84 @@ SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_d init (1 row) -CREATE TABLE test_prepared1(id int); -CREATE TABLE test_prepared2(id int); +SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_2pc', 'test_decoding'); + ?column? +-- + init +(1 row) + +CREATE TABLE test_prepared1(id integer primary key); +CREATE TABLE test_prepared2(id integer primary key); +-- Reused queries +\set get_no2pc 'SELECT data FROM pg_logical_slot_get_changes(''regression_slot_2pc'', NULL, NULL, ''include-xids'', ''0'', ''skip-empty-xacts'', ''1'');' +\set get_with2pc 'SELECT data FROM pg_logical_slot_get_changes(''regression_slot'', NULL, NULL, ''include-xids'', ''0'', ''skip-empty-xacts'', ''1'', ''twophase-decoding'', ''1'');' +\set get_with2pc_nofilter 'SELECT data FROM pg_logical_slot_get_changes(''regression_slot'', NULL, NULL, ''include-xids'', ''0'', ''skip-empty-xacts'', ''1'', ''twophase-decoding'', ''1'',
Re: [HACKERS] logical decoding of two-phase transactions
On 2017-03-28 03:30:28 +0100, Simon Riggs wrote: > On 28 March 2017 at 02:25, Andres Freundwrote: > > On 2017-03-28 04:12:41 +0300, Stas Kelvich wrote: > >> > >> > On 28 Mar 2017, at 00:25, Andres Freund wrote: > >> > > >> > Hi, > >> > > >> > On 2017-03-28 00:19:29 +0300, Stas Kelvich wrote: > >> >> Ok, here it is. > >> > > >> > On a very quick skim, this doesn't seem to solve the issues around > >> > deadlocks of prepared transactions vs. catalog tables. What if the > >> > prepared transaction contains something like LOCK pg_class; (there's a > >> > lot more realistic examples)? Then decoding won't be able to continue, > >> > until that transaction is committed / aborted? > >> > >> But why is that deadlock? Seems as just lock. > > > > If you actually need separate decoding of 2PC, then you want to wait for > > the PREPARE to be replicated. If that replication has to wait for the > > to-be-replicated prepared transaction to commit prepared, and commit > > prepare will only happen once replication happened... > > Surely that's up to the decoding plugin? It can't do much about it, so not really. A lot of the functions dealing with datatypes (temporarily) lock relations. Both the actual user tables, and system catalog tables (cache lookups...). > If the plugin takes locks it had better make sure it can get the locks > or timeout. But that's true of any resource the plugin needs access to > and can't obtain when needed. > This issue could occur now if the transaction tool a session lock on a > catalog table. That's not a self deadlock, and we don't don't do session locks outside of operations like CIC? Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
On 28 March 2017 at 02:25, Andres Freundwrote: > On 2017-03-28 04:12:41 +0300, Stas Kelvich wrote: >> >> > On 28 Mar 2017, at 00:25, Andres Freund wrote: >> > >> > Hi, >> > >> > On 2017-03-28 00:19:29 +0300, Stas Kelvich wrote: >> >> Ok, here it is. >> > >> > On a very quick skim, this doesn't seem to solve the issues around >> > deadlocks of prepared transactions vs. catalog tables. What if the >> > prepared transaction contains something like LOCK pg_class; (there's a >> > lot more realistic examples)? Then decoding won't be able to continue, >> > until that transaction is committed / aborted? >> >> But why is that deadlock? Seems as just lock. > > If you actually need separate decoding of 2PC, then you want to wait for > the PREPARE to be replicated. If that replication has to wait for the > to-be-replicated prepared transaction to commit prepared, and commit > prepare will only happen once replication happened... Surely that's up to the decoding plugin? If the plugin takes locks it had better make sure it can get the locks or timeout. But that's true of any resource the plugin needs access to and can't obtain when needed. This issue could occur now if the transaction tool a session lock on a catalog table. -- Simon Riggshttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
On 2017-03-28 04:12:41 +0300, Stas Kelvich wrote: > > > On 28 Mar 2017, at 00:25, Andres Freundwrote: > > > > Hi, > > > > On 2017-03-28 00:19:29 +0300, Stas Kelvich wrote: > >> Ok, here it is. > > > > On a very quick skim, this doesn't seem to solve the issues around > > deadlocks of prepared transactions vs. catalog tables. What if the > > prepared transaction contains something like LOCK pg_class; (there's a > > lot more realistic examples)? Then decoding won't be able to continue, > > until that transaction is committed / aborted? > > But why is that deadlock? Seems as just lock. If you actually need separate decoding of 2PC, then you want to wait for the PREPARE to be replicated. If that replication has to wait for the to-be-replicated prepared transaction to commit prepared, and commit prepare will only happen once replication happened... > Is there any other scenarios where catalog readers are blocked except > explicit lock > on catalog table? Alters on catalogs seems to be prohibited. VACUUM FULL on catalog tables (but that can't happen in xact => 2pc) CLUSTER on catalog tables (can happen in xact) ALTER on tables modified in the same transaction (even of non catalog tables!), because a lot of routines will do a heap_open() to get the tupledesc etc. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
> On 28 Mar 2017, at 00:25, Andres Freundwrote: > > Hi, > > On 2017-03-28 00:19:29 +0300, Stas Kelvich wrote: >> Ok, here it is. > > On a very quick skim, this doesn't seem to solve the issues around > deadlocks of prepared transactions vs. catalog tables. What if the > prepared transaction contains something like LOCK pg_class; (there's a > lot more realistic examples)? Then decoding won't be able to continue, > until that transaction is committed / aborted? But why is that deadlock? Seems as just lock. In case of prepared lock of pg_class decoding will wait until it committed and then continue to decode. As well as anything in postgres that accesses pg_class, including inability to connect to database and bricking database if you accidentally disconnected before committing that tx (as you showed me some while ago :-). IMO it is issue of being able to prepare such lock, than of decoding. Is there any other scenarios where catalog readers are blocked except explicit lock on catalog table? Alters on catalogs seems to be prohibited. Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
On 28 March 2017 at 08:50, Stas Kelvichwrote: > >> On 28 Mar 2017, at 00:19, Stas Kelvich wrote: >> >> * It is actually doesn’t pass one of mine regression tests. I’ve added >> expected output >> as it should be. I’ll try to send follow up message with fix, but right now >> sending it >> as is, as you asked. >> >> > > Fixed. I forgot to postpone ReorderBufferTxn cleanup in case of prepare. > > So it pass provided regression tests right now. > > I’ll give it more testing tomorrow and going to write TAP test to check > behaviour > when we loose info whether prepare was sent to subscriber or not. Great, thanks. I'll try to have some TAP tests ready. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
On 28 March 2017 at 05:25, Andres Freundwrote: > On a very quick skim, this doesn't seem to solve the issues around > deadlocks of prepared transactions vs. catalog tables. What if the > prepared transaction contains something like LOCK pg_class; (there's a > lot more realistic examples)? Then decoding won't be able to continue, > until that transaction is committed / aborted? Yeah, that's a problem and one we discussed in the past, though I lost track of it in amongst the recent work. I'm currently writing a few TAP tests intended to check this sort of thing, mixed DDL/DML, overlapping xacts, interleaved prepared xacts, etc. If they highlight problems they'll be useful for the next iteration of this patch anyway. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
> On 28 Mar 2017, at 00:19, Stas Kelvichwrote: > > * It is actually doesn’t pass one of mine regression tests. I’ve added > expected output > as it should be. I’ll try to send follow up message with fix, but right now > sending it > as is, as you asked. > > Fixed. I forgot to postpone ReorderBufferTxn cleanup in case of prepare. So it pass provided regression tests right now. I’ll give it more testing tomorrow and going to write TAP test to check behaviour when we loose info whether prepare was sent to subscriber or not. logical_twophase.diff Description: Binary data Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
Hi, On 2017-03-28 00:19:29 +0300, Stas Kelvich wrote: > Ok, here it is. On a very quick skim, this doesn't seem to solve the issues around deadlocks of prepared transactions vs. catalog tables. What if the prepared transaction contains something like LOCK pg_class; (there's a lot more realistic examples)? Then decoding won't be able to continue, until that transaction is committed / aborted? - Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
> On 27 Mar 2017, at 16:29, Craig Ringerwrote: > > On 27 March 2017 at 17:53, Stas Kelvich wrote: > >> I’m heavily underestimated amount of changes there, but almost finished >> and will send updated patch in several hours. > > Oh, brilliant! Please post whatever you have before you knock off for > the day anyway, even if it's just a WIP, so I can pick it up tomorrow > my time and poke at its tests etc. > Ok, here it is. Major differences comparing to previous version: * GID is stored to commit/abort records only when wal_level >= logical. * More consistency about storing and parsing origin info. Now it is stored in prepare and abort records when repsession is active. * Some clenup, function renames to get rid of xact_even/gid fields in ReorderBuffer which i used only to copy them ReorderBufferTXN. * Changed output plugin interface to one that was suggested upthread. Now prepare/CP/AP is separate callback, and if none of them is set then 2pc tx will be decoded as 1pc to provide back-compatibility. * New callback filter_prepare() that can be used to switch between 1pc/2pc style of decoding 2pc tx. * test_decoding uses new API and filters out aborted and running prepared tx. It is actually easy to move unlock of 2PCState there to prepare callback to allow decode of running tx, but since that extension is example ISTM that is better not to hold that lock there during whole prepare decoding. However I leaved enough information there about this and about case when that locks are not need at all (when we are coordinating this tx). Talking about locking of running prepared tx during decode, I think better solution would be to use own custom lock here and register XACT_EVENT_PRE_ABORT callback in extension to conflict with this lock. Decode should hold it in shared way, while commit in excluseve. That will allow to lock stuff granularly ang block only tx that is being decoded. However we don’t have XACT_EVENT_PRE_ABORT, but it is several LOCs to add it. Should I? * It is actually doesn’t pass one of mine regression tests. I’ve added expected output as it should be. I’ll try to send follow up message with fix, but right now sending it as is, as you asked. logical_twophase.diff Description: Binary data Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
On 27 March 2017 at 17:53, Stas Kelvichwrote: > I’m heavily underestimated amount of changes there, but almost finished > and will send updated patch in several hours. Oh, brilliant! Please post whatever you have before you knock off for the day anyway, even if it's just a WIP, so I can pick it up tomorrow my time and poke at its tests etc. I'm in Western Australia +0800 time, significantly ahead of you. > Done. [snip] > Also done. Great, time is short so that's fantastic. > I’ve extended test, but it is good to have some more. I don't mind writing tests and I've done quite a bit with TAP now, so happy to help there. >> Some special care is needed for the callback that decides whether to >> process a given xact as 2PC or not. It's called before PREPARE >> TRANSACTION to decide whether to decode any given xact at prepare time >> or wait until it commits. It's called again at COMMIT PREPARED time if >> we crashed after we processed PREPARE TRANSACTION and advanced our >> confirmed_flush_lsn such that we won't re-process the PREPARE >> TRANSACTION again. Our restart_lsn might've advanced past it so we >> never even decode it, so we can't rely on seeing it at all. It has >> access to the xid, gid and invalidations, all of which we have at both >> prepare and commit time, to make its decision from. It must have the >> same result at prepare and commit time for any given xact. We can >> probably use a cache in the reorder buffer to avoid the 2nd call on >> commit prepared if we haven't crashed/reconnected between the two. > > Good point. Didn’t think about restart_lsn in case when we are skipping this > particular prepare (filter_prepared() -> true, in my terms). I think that > should > work properly as it use the same code path as it was before, but I’ll look at > it. I suspect that's going to be fragile in the face of interleaving of xacts if we crash between prepare and commit prepared. (Apologies if the below is long or disjointed, it's been a long day but trying to sort thoughts out). Consider ("SSU" = "standby status update"): 0/050 xid 1 BEGIN 0/060 xid 1 INSERT ... 0/070 xid 2 BEGIN 0/080 xid 2 INSERT ... 0/090 xid 3 BEGIN 0/095 xid 3 INSERT ... 0/100 xid 3 PREPARE TRANSACTION 'x' => sent to client [y/n]? SSU: confirmed_flush_lsn = 0/100, restart_lsn 0/050 (if we sent to client) 0/200 xid 2 COMMIT => sent to client SSU: confirmed_flush_lsn = 0/200, restart_lsn 0/050 0/250 xl_running_xacts logged, xids = [1,3] [CRASH or disconnect/reconnect] Restart decoding at 0/050. skip output of xid 3 PREPARE TRANSACTION @ 0/100: is <= confirmed_flush_lsn skip output of xid 2 COMMIT @ 0/200: is <= confirmed_flush_lsn 0/300 xid 2 COMMIT PREPARED 'x' => sent to client, confirmed_flush_lsn is > confirmed_flush_lsn In the above, our problem is that restart_lsn is held down by some other xact, so we can't rely on it to tell us if we replayed xid 3 to the output plugin or not. We can't use confirmed_flush_lsn either, since it'll advance at xid 2's commit whether or not we replayed xid 3's prepare to the client. Since xid 3 will still be in xl_running_xacts when prepared, when we recover SnapBuildProcessChange will return true for its changes and we'll (re)buffer them, whether or not we landed up sending to the client at prepare time. Nothing much to be done about that, we'll just discard them when we process the prepare or the commit prepared, depending on where we consult our filter callback again. We MUST ask our filter callback again though, before we test SnapBuildXactNeedsSkip when processing the PREPARE TRANSACTION again. Otherwise we'll discard the buffered changes, and if we *didn't* send them to the client already ... splat. We can call the filter callback again on xid 3's prepare to find out "would you have replayed it when we passed it last time". Or we can call it when we get to the commit instead, to ask "when called last time at prepare, did you replay or not?" But we have to consult the callback. By default we'd just skip ReorderBufferCommit processing for xid 3 entirely, which we'll do via the SnapBuildXactNeedsSkip call in DecodeCommit when we process the COMMIT PREPARED. If there was no other running xact when we decoded the PREPARE TRANSACTION the first time around (i.e. xid 1 and 2 didn't exist in the above), and if we do send it to the client at prepare time, I think we can safely advance restart_lsn to the most recent xl_running_xacts once we get replay confirmation. So we can pretend we already committed at PREPARE TRANSACTION time for restart purposes if we output at PREPARE TRANSACTION time, it just doesn't help us with deciding whether to send the buffer contents at COMMIT PREPARED time or not. TL;DR: we can't rely on restart_lsn or confirmed_flush_lsn or xl_running_xacts, we must ask the filter callback when we (re)decode the PREPARE TRANSACTION record and/or at COMMIT PREPARED time. This isn't a big deal. We just have to make sure we
Re: [HACKERS] logical decoding of two-phase transactions
> On 27 Mar 2017, at 12:26, Craig Ringerwrote: > > On 27 March 2017 at 09:31, Craig Ringer wrote: > >> We're in the last week of the CF. If you have a patch that's nearly >> ready or getting there, now would be a good time to post it for help >> and input from others. >> >> I would really like to get this in, but we're running out of time. >> >> Even if you just post your snapshot management work, with the cosmetic >> changes discussed above, that would be a valuable start. > > I'm going to pick up the last patch and: I’m heavily underestimated amount of changes there, but almost finished and will send updated patch in several hours. > * Ensure we only add the GID to xact records for 2pc commits and aborts And only during wal_level >= logical. Done. Also patch adds origin info to prepares and aborts. > * Add separate callbacks for prepare, abort prepared, and commit > prepared (of xacts already processed during prepare), so we aren't > overloading the "commit" callback and don't have to create fake empty > transactions to pass to the commit callback; Done. > * Add another callback to determine whether an xact should be > processed at PREPARE TRANSACTION or COMMIT PREPARED time. Also done. > * Rename the snapshot builder faux-commit stuff in the current patch > so it's clearer what's going on. Hm. Okay, i’ll leave that part to you. > * Write tests covering DDL, abort-during-decode, etc I’ve extended test, but it is good to have some more. > Some special care is needed for the callback that decides whether to > process a given xact as 2PC or not. It's called before PREPARE > TRANSACTION to decide whether to decode any given xact at prepare time > or wait until it commits. It's called again at COMMIT PREPARED time if > we crashed after we processed PREPARE TRANSACTION and advanced our > confirmed_flush_lsn such that we won't re-process the PREPARE > TRANSACTION again. Our restart_lsn might've advanced past it so we > never even decode it, so we can't rely on seeing it at all. It has > access to the xid, gid and invalidations, all of which we have at both > prepare and commit time, to make its decision from. It must have the > same result at prepare and commit time for any given xact. We can > probably use a cache in the reorder buffer to avoid the 2nd call on > commit prepared if we haven't crashed/reconnected between the two. Good point. Didn’t think about restart_lsn in case when we are skipping this particular prepare (filter_prepared() -> true, in my terms). I think that should work properly as it use the same code path as it was before, but I’ll look at it. > This proposal does not provide a way to safely decode a 2pc xact that > made catalog changes which may be aborted while being decoded. The > plugin must lock such an xact so that it can't be aborted while being > processed, or defer decoding until commit prepared. It can use the > invalidations for the commit to decide. I had played with that two-pass catalog scan and it seems to be working but after some time I realised that it is not useful for the main case when commit/abort is generated after receiver side will answer to prepares. Also that two-pass scan is a massive change in relcache.c and genam.c (FWIW there were no problems with cache, but some problems with index scan and handling one-to-many queries to catalog, e.g. table with it fields) Finally i decided to throw it and switched to filter_prepare callback and passed there txn structure to allow access to has_catalog_changes field. Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
On 27 March 2017 at 09:31, Craig Ringerwrote: > We're in the last week of the CF. If you have a patch that's nearly > ready or getting there, now would be a good time to post it for help > and input from others. > > I would really like to get this in, but we're running out of time. > > Even if you just post your snapshot management work, with the cosmetic > changes discussed above, that would be a valuable start. I'm going to pick up the last patch and: * Ensure we only add the GID to xact records for 2pc commits and aborts * Add separate callbacks for prepare, abort prepared, and commit prepared (of xacts already processed during prepare), so we aren't overloading the "commit" callback and don't have to create fake empty transactions to pass to the commit callback; * Add another callback to determine whether an xact should be processed at PREPARE TRANSACTION or COMMIT PREPARED time. * Rename the snapshot builder faux-commit stuff in the current patch so it's clearer what's going on. * Write tests covering DDL, abort-during-decode, etc Some special care is needed for the callback that decides whether to process a given xact as 2PC or not. It's called before PREPARE TRANSACTION to decide whether to decode any given xact at prepare time or wait until it commits. It's called again at COMMIT PREPARED time if we crashed after we processed PREPARE TRANSACTION and advanced our confirmed_flush_lsn such that we won't re-process the PREPARE TRANSACTION again. Our restart_lsn might've advanced past it so we never even decode it, so we can't rely on seeing it at all. It has access to the xid, gid and invalidations, all of which we have at both prepare and commit time, to make its decision from. It must have the same result at prepare and commit time for any given xact. We can probably use a cache in the reorder buffer to avoid the 2nd call on commit prepared if we haven't crashed/reconnected between the two. This proposal does not provide a way to safely decode a 2pc xact that made catalog changes which may be aborted while being decoded. The plugin must lock such an xact so that it can't be aborted while being processed, or defer decoding until commit prepared. It can use the invalidations for the commit to decide. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
On 20 March 2017 at 21:47, Stas Kelvichwrote: > >> On 20 Mar 2017, at 16:39, Craig Ringer wrote: >> >> On 20 March 2017 at 20:57, Stas Kelvich wrote: >>> On 20 Mar 2017, at 15:17, Craig Ringer wrote: > I thought about having special field (or reusing one of the existing > fields) > in snapshot struct to force filtering xmax > snap->xmax or xmin = > snap->xmin > as Petr suggested. Then this logic can reside in ReorderBufferCommit(). > However this is not solving problem with catcache, so I'm looking into it > right now. OK, so this is only an issue if we have xacts that change the schema of tables and also insert/update/delete to their heaps. Right? So, given that this is CF3 for Pg10, should we take a step back and impose the limitation that we can decode 2PC with schema changes or data row changes, but not both? >>> >>> Yep, time is tight. I’ll try today/tomorrow to proceed with this two scan >>> approach. >>> If I’ll fail to do that during this time then I’ll just update this patch >>> to decode >>> only non-ddl 2pc transactions as you suggested. >> >> I wasn't suggesting not decoding them, but giving the plugin the >> option of whether to proceed with decoding or not. >> >> As Simon said, have a pre-decode-prepared callback that lets the >> plugin get a lock on the 2pc xact if it wants, or say it doesn't want >> to decode it until it commits. >> >> That'd be useful anyway, so we can filter and only do decoding at >> prepare transaction time of xacts the downstream wants to know about >> before they commit. > > Ah, got that. Okay. Any news here? We're in the last week of the CF. If you have a patch that's nearly ready or getting there, now would be a good time to post it for help and input from others. I would really like to get this in, but we're running out of time. Even if you just post your snapshot management work, with the cosmetic changes discussed above, that would be a valuable start. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
> On 20 Mar 2017, at 16:39, Craig Ringerwrote: > > On 20 March 2017 at 20:57, Stas Kelvich wrote: >> >>> On 20 Mar 2017, at 15:17, Craig Ringer wrote: >>> I thought about having special field (or reusing one of the existing fields) in snapshot struct to force filtering xmax > snap->xmax or xmin = snap->xmin as Petr suggested. Then this logic can reside in ReorderBufferCommit(). However this is not solving problem with catcache, so I'm looking into it right now. >>> >>> OK, so this is only an issue if we have xacts that change the schema >>> of tables and also insert/update/delete to their heaps. Right? >>> >>> So, given that this is CF3 for Pg10, should we take a step back and >>> impose the limitation that we can decode 2PC with schema changes or >>> data row changes, but not both? >> >> Yep, time is tight. I’ll try today/tomorrow to proceed with this two scan >> approach. >> If I’ll fail to do that during this time then I’ll just update this patch to >> decode >> only non-ddl 2pc transactions as you suggested. > > I wasn't suggesting not decoding them, but giving the plugin the > option of whether to proceed with decoding or not. > > As Simon said, have a pre-decode-prepared callback that lets the > plugin get a lock on the 2pc xact if it wants, or say it doesn't want > to decode it until it commits. > > That'd be useful anyway, so we can filter and only do decoding at > prepare transaction time of xacts the downstream wants to know about > before they commit. Ah, got that. Okay. > -- > Craig Ringer http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
On 20 March 2017 at 20:57, Stas Kelvichwrote: > >> On 20 Mar 2017, at 15:17, Craig Ringer wrote: >> >>> I thought about having special field (or reusing one of the existing fields) >>> in snapshot struct to force filtering xmax > snap->xmax or xmin = snap->xmin >>> as Petr suggested. Then this logic can reside in ReorderBufferCommit(). >>> However this is not solving problem with catcache, so I'm looking into it >>> right now. >> >> OK, so this is only an issue if we have xacts that change the schema >> of tables and also insert/update/delete to their heaps. Right? >> >> So, given that this is CF3 for Pg10, should we take a step back and >> impose the limitation that we can decode 2PC with schema changes or >> data row changes, but not both? > > Yep, time is tight. I’ll try today/tomorrow to proceed with this two scan > approach. > If I’ll fail to do that during this time then I’ll just update this patch to > decode > only non-ddl 2pc transactions as you suggested. I wasn't suggesting not decoding them, but giving the plugin the option of whether to proceed with decoding or not. As Simon said, have a pre-decode-prepared callback that lets the plugin get a lock on the 2pc xact if it wants, or say it doesn't want to decode it until it commits. That'd be useful anyway, so we can filter and only do decoding at prepare transaction time of xacts the downstream wants to know about before they commit. >>> Just as before I marking this transaction committed in snapbuilder, but >>> after >>> decoding I delete this transaction from xip (which holds committed >>> transactions >>> in case of historic snapshot). >> >> That seems kind of hacky TBH. I didn't much like marking it as >> committed then un-committing it. >> >> I think it's mostly an interface issue though. I'd rather say >> SnapBuildPushPrepareTransaction and SnapBuildPopPreparedTransaction or >> something, to make it clear what we're doing. > > Yes, that will be less confusing. However there is no any kind of queue, so > SnapBuildStartPrepare / SnapBuildFinishPrepare should work too. Yeah, that's better. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
> On 20 Mar 2017, at 15:17, Craig Ringerwrote: > >> I thought about having special field (or reusing one of the existing fields) >> in snapshot struct to force filtering xmax > snap->xmax or xmin = snap->xmin >> as Petr suggested. Then this logic can reside in ReorderBufferCommit(). >> However this is not solving problem with catcache, so I'm looking into it >> right now. > > OK, so this is only an issue if we have xacts that change the schema > of tables and also insert/update/delete to their heaps. Right? > > So, given that this is CF3 for Pg10, should we take a step back and > impose the limitation that we can decode 2PC with schema changes or > data row changes, but not both? Yep, time is tight. I’ll try today/tomorrow to proceed with this two scan approach. If I’ll fail to do that during this time then I’ll just update this patch to decode only non-ddl 2pc transactions as you suggested. >> Just as before I marking this transaction committed in snapbuilder, but after >> decoding I delete this transaction from xip (which holds committed >> transactions >> in case of historic snapshot). > > That seems kind of hacky TBH. I didn't much like marking it as > committed then un-committing it. > > I think it's mostly an interface issue though. I'd rather say > SnapBuildPushPrepareTransaction and SnapBuildPopPreparedTransaction or > something, to make it clear what we're doing. Yes, that will be less confusing. However there is no any kind of queue, so SnapBuildStartPrepare / SnapBuildFinishPrepare should work too. > -- > Craig Ringer http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
On 17 March 2017 at 23:59, Robert Haaswrote: > On Thu, Mar 16, 2017 at 10:34 PM, Craig Ringer wrote: >> On 17 March 2017 at 08:10, Stas Kelvich wrote: >>> While working on this i’ve spotted quite a nasty corner case with aborted >>> prepared >>> transaction. I have some not that great ideas how to fix it, but maybe i >>> blurred my >>> view and missed something. So want to ask here at first. >>> >>> Suppose we created a table, then in 2pc tx we are altering it and after >>> that aborting tx. >>> So pg_class will have something like this: >>> >>> xmin | xmax | relname >>> 100 | 200| mytable >>> 200 | 0| mytable >>> >>> After previous abort, tuple (100,200,mytable) becomes visible and if we >>> will alter table >>> again then xmax of first tuple will be set current xid, resulting in >>> following table: >>> >>> xmin | xmax | relname >>> 100 | 300| mytable >>> 200 | 0| mytable >>> 300 | 0| mytable >>> >>> In that moment we’ve lost information that first tuple was deleted by our >>> prepared tx. >> >> Right. And while the prepared xact has aborted, we don't control when >> it aborts and when those overwrites can start happening. We can and >> should check if a 2pc xact is aborted before we start decoding it so >> we can skip decoding it if it's already aborted, but it could be >> aborted *while* we're decoding it, then have data needed for its >> snapshot clobbered. >> >> This hasn't mattered in the past because prepared xacts (and >> especially aborted 2pc xacts) have never needed snapshots, we've never >> needed to do something from the perspective of a prepared xact. >> >> I think we'll probably need to lock the 2PC xact so it cannot be >> aborted or committed while we're decoding it, until we finish decoding >> it. So we lock it, then check if it's already aborted/already >> committed/in progress. If it's aborted, treat it like any normal >> aborted xact. If it's committed, treat it like any normal committed >> xact. If it's in progress, keep the lock and decode it. > > But that lock could need to be held for an unbounded period of time - > as long as decoding takes to complete - which seems pretty > undesirable. This didn't seem to be too much of a problem when I read it. Sure, the issue noted by Stas exists, but it requires Alter-Abort-Alter for it to be a problem. Meaning that normal non-DDL transactions do not have problems. Neither would a real-time system that uses the decoded data to decide whether to commit or abort the transaction; in that case there would never be an abort until after decoding. So I suggest we have a pre-prepare callback to ensure that the plugin can decide whether to decode or not. We can pass information to the plugin such as whether we have issued DDL in that xact or not. The plugin can then decide how it wishes to handle it, so if somebody doesn't like the idea of a lock then don't use one. The plugin is already responsible for many things, so this is nothing new. -- Simon Riggshttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
> I thought about having special field (or reusing one of the existing fields) > in snapshot struct to force filtering xmax > snap->xmax or xmin = snap->xmin > as Petr suggested. Then this logic can reside in ReorderBufferCommit(). > However this is not solving problem with catcache, so I'm looking into it > right now. OK, so this is only an issue if we have xacts that change the schema of tables and also insert/update/delete to their heaps. Right? So, given that this is CF3 for Pg10, should we take a step back and impose the limitation that we can decode 2PC with schema changes or data row changes, but not both? Applications can record DDL in transactional logical WAL messages for decoding during 2pc processing. Or apps can do 2pc for DML. They just can't do both at the same time, in the same xact. Imperfect, but a lot less invasive. And we can even permit apps to use the locking-based approach I outlined earlier instead: All we have to do IMO is add an output plugin callback to filter whether we want to decode a given 2pc xact at PREPARE TRANSACTION time or defer until COMMIT PREPARED. It could: * mark the xact for deferred decoding at commit time (the default if the callback doesn't exist); or * Acquire a lock on the 2pc xact and request immediate decoding only if it gets the lock so concurrent ROLLBACK PREPARED is blocked; or * inspect the reorder buffer contents for row changes and decide whether to decode now or later based on that. It has a few downsides - for example, temp tables will be considered "catalog changes" for now. But .. eh. We already accept a bunch of practical limitations for catalog changes and DDL in logical decoding, most notably regarding practical handling of full table rewrites. > Just as before I marking this transaction committed in snapbuilder, but after > decoding I delete this transaction from xip (which holds committed > transactions > in case of historic snapshot). That seems kind of hacky TBH. I didn't much like marking it as committed then un-committing it. I think it's mostly an interface issue though. I'd rather say SnapBuildPushPrepareTransaction and SnapBuildPopPreparedTransaction or something, to make it clear what we're doing. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
> On 20 Mar 2017, at 11:32, Craig Ringerwrote: > > On 19 March 2017 at 21:26, Petr Jelinek wrote: > >> I think only genam would need changes to do two-phase scan for this as >> the catalog scans should ultimately go there. It's going to slow down >> things but we could limit the impact by doing the two-phase scan only >> when historical snapshot is in use and the tx being decoded changed >> catalogs (we already have global knowledge of the first one, and it >> would be trivial to add the second one as we have local knowledge of >> that as well). > > > TBH, I have no idea how to approach the genam changes for the proposed > double-scan method. It sounds like Stas has some idea how to proceed > though (right?) > I thought about having special field (or reusing one of the existing fields) in snapshot struct to force filtering xmax > snap->xmax or xmin = snap->xmin as Petr suggested. Then this logic can reside in ReorderBufferCommit(). However this is not solving problem with catcache, so I'm looking into it right now. > On 17 Mar 2017, at 05:38, Craig Ringer wrote: > > On 16 March 2017 at 19:52, Stas Kelvich wrote: > >> >> I’m working right now on issue with building snapshots for decoding prepared >> tx. >> I hope I'll send updated patch later today. > > > Great. > > What approach are you taking? Just as before I marking this transaction committed in snapbuilder, but after decoding I delete this transaction from xip (which holds committed transactions in case of historic snapshot). > -- > Craig Ringer http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
On 20/03/17 09:32, Craig Ringer wrote: > On 19 March 2017 at 21:26, Petr Jelinekwrote: > >> I think only genam would need changes to do two-phase scan for this as >> the catalog scans should ultimately go there. It's going to slow down >> things but we could limit the impact by doing the two-phase scan only >> when historical snapshot is in use and the tx being decoded changed >> catalogs (we already have global knowledge of the first one, and it >> would be trivial to add the second one as we have local knowledge of >> that as well). > > We'll also have to clobber caches after we finish decoding a 2pc xact, > since we don't know those changes are visible to other xacts and can't > guarantee they'll ever be (if it aborts). > AFAIK reorder buffer already does that. > That's going to be "interesting" when trying to decode interleaved > transaction streams since we can't afford to clobber caches whenever > we see an xlog record from a different xact. We'll probably have to > switch to linear decoding with reordering when someone makes catalog > changes. We may need something that allows for representing multiple parallel transactions in single process and a cheap way of switching between them (ie, similar things we need for autonomous transactions). But that's not something current patch has to deal with. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
On 19 March 2017 at 21:26, Petr Jelinekwrote: > I think only genam would need changes to do two-phase scan for this as > the catalog scans should ultimately go there. It's going to slow down > things but we could limit the impact by doing the two-phase scan only > when historical snapshot is in use and the tx being decoded changed > catalogs (we already have global knowledge of the first one, and it > would be trivial to add the second one as we have local knowledge of > that as well). We'll also have to clobber caches after we finish decoding a 2pc xact, since we don't know those changes are visible to other xacts and can't guarantee they'll ever be (if it aborts). That's going to be "interesting" when trying to decode interleaved transaction streams since we can't afford to clobber caches whenever we see an xlog record from a different xact. We'll probably have to switch to linear decoding with reordering when someone makes catalog changes. TBH, I have no idea how to approach the genam changes for the proposed double-scan method. It sounds like Stas has some idea how to proceed though (right?) -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
On 17 March 2017 at 23:59, Robert Haaswrote: > But that lock could need to be held for an unbounded period of time - > as long as decoding takes to complete - which seems pretty > undesirable. Yeah. We could use a recovery-conflict like mechanism to signal the decoding session that someone wants to abort the xact, but it gets messy. > Worse still, the same problem will arise if you > eventually want to start decoding ordinary, non-2PC transactions that > haven't committed yet, which I think is something we definitely want > to do eventually; the current handling of bulk loads or bulk updates > leads to significant latency. Yeah. If it weren't for that, I'd probably still just pursue locking. But you're right that we'll have to solve this sooner or later. I'll admit I hoped for later. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
On 17/03/17 03:34, Craig Ringer wrote: > On 17 March 2017 at 08:10, Stas Kelvichwrote: > >> While working on this i’ve spotted quite a nasty corner case with aborted >> prepared >> transaction. I have some not that great ideas how to fix it, but maybe i >> blurred my >> view and missed something. So want to ask here at first. >> >> Suppose we created a table, then in 2pc tx we are altering it and after that >> aborting tx. >> So pg_class will have something like this: >> >> xmin | xmax | relname >> 100 | 200| mytable >> 200 | 0| mytable >> >> After previous abort, tuple (100,200,mytable) becomes visible and if we will >> alter table >> again then xmax of first tuple will be set current xid, resulting in >> following table: >> >> xmin | xmax | relname >> 100 | 300| mytable >> 200 | 0| mytable >> 300 | 0| mytable >> >> In that moment we’ve lost information that first tuple was deleted by our >> prepared tx. > > Right. And while the prepared xact has aborted, we don't control when > it aborts and when those overwrites can start happening. We can and > should check if a 2pc xact is aborted before we start decoding it so > we can skip decoding it if it's already aborted, but it could be > aborted *while* we're decoding it, then have data needed for its > snapshot clobbered. > > This hasn't mattered in the past because prepared xacts (and > especially aborted 2pc xacts) have never needed snapshots, we've never > needed to do something from the perspective of a prepared xact. > > I think we'll probably need to lock the 2PC xact so it cannot be > aborted or committed while we're decoding it, until we finish decoding > it. So we lock it, then check if it's already aborted/already > committed/in progress. If it's aborted, treat it like any normal > aborted xact. If it's committed, treat it like any normal committed > xact. If it's in progress, keep the lock and decode it. > > People using logical decoding for 2PC will presumably want to control > 2PC via logical decoding, so they're not so likely to mind such a > lock. > >> * Try at first to scan catalog filtering out tuples with xmax bigger than >> snapshot->xmax >> as it was possibly deleted by our tx. Than if nothing found scan in a usual >> way. > > I don't think that'll be at all viable with the syscache/relcache > machinery. Way too intrusive. > I think only genam would need changes to do two-phase scan for this as the catalog scans should ultimately go there. It's going to slow down things but we could limit the impact by doing the two-phase scan only when historical snapshot is in use and the tx being decoded changed catalogs (we already have global knowledge of the first one, and it would be trivial to add the second one as we have local knowledge of that as well). What I think is better strategy than filtering out by xmax would be filtering "in" by xmin though. Meaning that first scan would return only tuples modified by current tx which are visible in snapshot and second scan would return the other visible tuples. That way whatever the decoded tx seen should always win. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
On Thu, Mar 16, 2017 at 10:34 PM, Craig Ringerwrote: > On 17 March 2017 at 08:10, Stas Kelvich wrote: >> While working on this i’ve spotted quite a nasty corner case with aborted >> prepared >> transaction. I have some not that great ideas how to fix it, but maybe i >> blurred my >> view and missed something. So want to ask here at first. >> >> Suppose we created a table, then in 2pc tx we are altering it and after that >> aborting tx. >> So pg_class will have something like this: >> >> xmin | xmax | relname >> 100 | 200| mytable >> 200 | 0| mytable >> >> After previous abort, tuple (100,200,mytable) becomes visible and if we will >> alter table >> again then xmax of first tuple will be set current xid, resulting in >> following table: >> >> xmin | xmax | relname >> 100 | 300| mytable >> 200 | 0| mytable >> 300 | 0| mytable >> >> In that moment we’ve lost information that first tuple was deleted by our >> prepared tx. > > Right. And while the prepared xact has aborted, we don't control when > it aborts and when those overwrites can start happening. We can and > should check if a 2pc xact is aborted before we start decoding it so > we can skip decoding it if it's already aborted, but it could be > aborted *while* we're decoding it, then have data needed for its > snapshot clobbered. > > This hasn't mattered in the past because prepared xacts (and > especially aborted 2pc xacts) have never needed snapshots, we've never > needed to do something from the perspective of a prepared xact. > > I think we'll probably need to lock the 2PC xact so it cannot be > aborted or committed while we're decoding it, until we finish decoding > it. So we lock it, then check if it's already aborted/already > committed/in progress. If it's aborted, treat it like any normal > aborted xact. If it's committed, treat it like any normal committed > xact. If it's in progress, keep the lock and decode it. But that lock could need to be held for an unbounded period of time - as long as decoding takes to complete - which seems pretty undesirable. Worse still, the same problem will arise if you eventually want to start decoding ordinary, non-2PC transactions that haven't committed yet, which I think is something we definitely want to do eventually; the current handling of bulk loads or bulk updates leads to significant latency. You're not going to be able to tell an active transaction that it isn't allowed to abort until you get done with it, and I don't really think you should be allowed to lock out 2PC aborts for long periods of time either. That's going to stink for users. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
On 16 March 2017 at 19:52, Stas Kelvichwrote: > > I’m working right now on issue with building snapshots for decoding prepared > tx. > I hope I'll send updated patch later today. Great. What approach are you taking? It looks like the snapshot builder actually does most of the work we need for this already, maintaining a stack of snapshots we can use. It might be as simple as invalidating the relcache/syscache when we exit (and enter?) decoding of a prepared 2pc xact, since it violates the usual assumption of logical decoding that we decode things strictly in commit-time order. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
On 17 March 2017 at 08:10, Stas Kelvichwrote: > While working on this i’ve spotted quite a nasty corner case with aborted > prepared > transaction. I have some not that great ideas how to fix it, but maybe i > blurred my > view and missed something. So want to ask here at first. > > Suppose we created a table, then in 2pc tx we are altering it and after that > aborting tx. > So pg_class will have something like this: > > xmin | xmax | relname > 100 | 200| mytable > 200 | 0| mytable > > After previous abort, tuple (100,200,mytable) becomes visible and if we will > alter table > again then xmax of first tuple will be set current xid, resulting in > following table: > > xmin | xmax | relname > 100 | 300| mytable > 200 | 0| mytable > 300 | 0| mytable > > In that moment we’ve lost information that first tuple was deleted by our > prepared tx. Right. And while the prepared xact has aborted, we don't control when it aborts and when those overwrites can start happening. We can and should check if a 2pc xact is aborted before we start decoding it so we can skip decoding it if it's already aborted, but it could be aborted *while* we're decoding it, then have data needed for its snapshot clobbered. This hasn't mattered in the past because prepared xacts (and especially aborted 2pc xacts) have never needed snapshots, we've never needed to do something from the perspective of a prepared xact. I think we'll probably need to lock the 2PC xact so it cannot be aborted or committed while we're decoding it, until we finish decoding it. So we lock it, then check if it's already aborted/already committed/in progress. If it's aborted, treat it like any normal aborted xact. If it's committed, treat it like any normal committed xact. If it's in progress, keep the lock and decode it. People using logical decoding for 2PC will presumably want to control 2PC via logical decoding, so they're not so likely to mind such a lock. > * Try at first to scan catalog filtering out tuples with xmax bigger than > snapshot->xmax > as it was possibly deleted by our tx. Than if nothing found scan in a usual > way. I don't think that'll be at all viable with the syscache/relcache machinery. Way too intrusive. > * Do not decode such transaction at all. Yes, that's what I'd like to do, per above. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
>> On 2 Mar 2017, at 11:00, Craig Ringerwrote: >> >> BTW, I've been reviewing the patch in more detail. Other than a bunch >> of copy-and-paste that I'm cleaning up, the main issue I've found is >> that in DecodePrepare, you call: >> >> SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid, >> parsed->nsubxacts, parsed->subxacts); >> >> but I am not convinced it is correct to call it at PREPARE TRANSACTION >> time, only at COMMIT PREPARED time. We want to see the 2pc prepared >> xact's state when decoding it, but there might be later commits that >> cannot yet see that state and shouldn't have it visible in their >> snapshots. > > Agree, that is problem. That allows to decode this PREPARE, but after that > it is better to mark this transaction as running in snapshot or perform > prepare > decoding with some kind of copied-end-edited snapshot. I’ll have a look at > this. > While working on this i’ve spotted quite a nasty corner case with aborted prepared transaction. I have some not that great ideas how to fix it, but maybe i blurred my view and missed something. So want to ask here at first. Suppose we created a table, then in 2pc tx we are altering it and after that aborting tx. So pg_class will have something like this: xmin | xmax | relname 100 | 200| mytable 200 | 0| mytable After previous abort, tuple (100,200,mytable) becomes visible and if we will alter table again then xmax of first tuple will be set current xid, resulting in following table: xmin | xmax | relname 100 | 300| mytable 200 | 0| mytable 300 | 0| mytable In that moment we’ve lost information that first tuple was deleted by our prepared tx. And from POV of historic snapshot that will be constructed to decode prepare first tuple is visible, but actually send tuple should be used. Moreover such snapshot could see both tuples violating oid uniqueness, but heapscan stops after finding first one. I see here two possible workarounds: * Try at first to scan catalog filtering out tuples with xmax bigger than snapshot->xmax as it was possibly deleted by our tx. Than if nothing found scan in a usual way. * Do not decode such transaction at all. If by the time of decoding prepare record we already know that it is aborted than such decoding doesn’t have a lot of sense. IMO intended usage of logical 2pc decoding is to decide about commit/abort based on answers from logical subscribers/replicas. So there will be barrier between prepare and commit/abort and such situations shouldn’t happen. -- Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
> On 16 Mar 2017, at 14:44, Craig Ringerwrote: > > I'm going to try to pick this patch up and amend its interface per our > discussion earlier, see if I can get it committable. I’m working right now on issue with building snapshots for decoding prepared tx. I hope I'll send updated patch later today. > -- > Craig Ringer http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
On 15 March 2017 at 15:42, Petr Jelinekwrote: > Thinking about this some more. Why can't we use the same mechanism > standby uses, ie, use xid to identify the 2PC? It pushes work onto the downstream, which has to keep an mapping in a crash-safe, persistent form. We'll be doing a flush of some kind anyway so we can report successful prepare to the upstream so an additional flush of a SLRU might not be so bad for a postgres downstream. And I guess any other clients will have some kind of downstream persistent mapping to use. So I think I have a mild preference for recording the gid on 2pc commit and abort records in the master's WAL, where it's very cheap and simple. But I agree that just sending the xid is a viable option if that falls through. I'm going to try to pick this patch up and amend its interface per our discussion earlier, see if I can get it committable. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
On 02/03/17 17:34, Petr Jelinek wrote: > On 02/03/17 13:23, Craig Ringer wrote: >> On 2 March 2017 at 16:20, Stas Kelvichwrote: >>> On 2 Mar 2017, at 11:00, Craig Ringer wrote: We already have it, because we just decoded the PREPARE TRANSACTION. I'm preparing a patch revision to demonstrate this. >>> >>> Yes, we already have it, but if server reboots between commit prepared (all >>> prepared state is gone) and decoding of this commit prepared then we loose >>> that mapping, isn’t it? >> >> I was about to explain how restart_lsn works again, and how that would >> mean we'd always re-decode the PREPARE TRANSACTION before any COMMIT >> PREPARED or ROLLBACK PREPARED on crash. But... >> >> Actually, the way you've implemented it, that won't be the case. You >> treat PREPARE TRANSACTION as a special-case of COMMIT, and the client >> will presumably send replay confirmation after it has applied the >> PREPARE TRANSACTION. In fact, it has to if we want 2PC to work with >> synchronous replication. This will allow restart_lsn to advance to >> after the PREPARE TRANSACTION record if there's no other older xact >> and we see a suitable xl_running_xacts record. So we wouldn't decode >> the PREPARE TRANSACTION again after restart. >> Thinking about this some more. Why can't we use the same mechanism standby uses, ie, use xid to identify the 2PC? If output plugin cares about doing 2PC in two phases, it can send xid as part of its protocol (like the PG10 logical replication and pglogical do already) and simply remember on downstream the remote node + remote xid of the 2PC in progress. That way there is no need for gids in COMMIT PREPARED and this patch would be much simpler (as the tracking would be left to actual replication implementation as opposed to decoding). Or am I missing something? -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
On 3/2/17 11:34 AM, Petr Jelinek wrote: > On 02/03/17 13:23, Craig Ringer wrote: >> >> I particularly dislike calling a commit callback for an abort. So I'd >> like to look further into the interface side of things. I'm inclined >> to suggest adding new callbacks for 2pc prepare, commit and rollback, >> and if the output plugin doesn't set them fall back to the existing >> behaviour. Plugins that aren't interested in 2PC (think ETL) should >> probably not have to deal with it, we might as well just send them >> only the actually committed xacts, when they commit. >> > > I think this is a good approach to handle it. It's been a while since there was any activity on this thread and a very long time since the last patch. As far as I can see there are far more questions than answers in this thread. If you need more time to produce a patch, please post an explanation for the delay and a schedule for the new patch. If no patch or explanation is is posted by 2017-03-17 AoE I will mark this submission "Returned with Feedback". -- -David da...@pgmasters.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
On 02/03/17 13:23, Craig Ringer wrote: > On 2 March 2017 at 16:20, Stas Kelvichwrote: >> >>> On 2 Mar 2017, at 11:00, Craig Ringer wrote: >>> >>> We already have it, because we just decoded the PREPARE TRANSACTION. >>> I'm preparing a patch revision to demonstrate this. >> >> Yes, we already have it, but if server reboots between commit prepared (all >> prepared state is gone) and decoding of this commit prepared then we loose >> that mapping, isn’t it? > > I was about to explain how restart_lsn works again, and how that would > mean we'd always re-decode the PREPARE TRANSACTION before any COMMIT > PREPARED or ROLLBACK PREPARED on crash. But... > > Actually, the way you've implemented it, that won't be the case. You > treat PREPARE TRANSACTION as a special-case of COMMIT, and the client > will presumably send replay confirmation after it has applied the > PREPARE TRANSACTION. In fact, it has to if we want 2PC to work with > synchronous replication. This will allow restart_lsn to advance to > after the PREPARE TRANSACTION record if there's no other older xact > and we see a suitable xl_running_xacts record. So we wouldn't decode > the PREPARE TRANSACTION again after restart. > Unless we just don't let restart_lsn to go forward if there is 2pc that wasn't decoded yet (twopcs store the prepare lsn) but that's probably too much of a kludge. > > It's also worth noting that with your current approach, 2PC xacts will > produce two calls to the output plugin's commit() callback, once for > the PREPARE TRANSACTION and another for the COMMIT PREPARED or > ROLLBACK PREPARED, the latter two with a faked-up state. I'm not a > huge fan of that. It's not entirely backward compatible since it > violates the previously safe assumption that there's a 1:1 > relationship between begin and commit callbacks with no interleaving, > for one thing, and I think it's also a bit misleading to send a > PREPARE TRANSACTION to a callback that could previously only receive a > true commit. > > I particularly dislike calling a commit callback for an abort. So I'd > like to look further into the interface side of things. I'm inclined > to suggest adding new callbacks for 2pc prepare, commit and rollback, > and if the output plugin doesn't set them fall back to the existing > behaviour. Plugins that aren't interested in 2PC (think ETL) should > probably not have to deal with it, we might as well just send them > only the actually committed xacts, when they commit. > I think this is a good approach to handle it. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
On 2 March 2017 at 16:20, Stas Kelvichwrote: > >> On 2 Mar 2017, at 11:00, Craig Ringer wrote: >> >> We already have it, because we just decoded the PREPARE TRANSACTION. >> I'm preparing a patch revision to demonstrate this. > > Yes, we already have it, but if server reboots between commit prepared (all > prepared state is gone) and decoding of this commit prepared then we loose > that mapping, isn’t it? I was about to explain how restart_lsn works again, and how that would mean we'd always re-decode the PREPARE TRANSACTION before any COMMIT PREPARED or ROLLBACK PREPARED on crash. But... Actually, the way you've implemented it, that won't be the case. You treat PREPARE TRANSACTION as a special-case of COMMIT, and the client will presumably send replay confirmation after it has applied the PREPARE TRANSACTION. In fact, it has to if we want 2PC to work with synchronous replication. This will allow restart_lsn to advance to after the PREPARE TRANSACTION record if there's no other older xact and we see a suitable xl_running_xacts record. So we wouldn't decode the PREPARE TRANSACTION again after restart. Hm. That's actually a pretty good reason to xlog the gid for 2pc rollback and commit if we're at wal_level >= logical . Being able to advance restart_lsn and avoid the re-decoding work is a big win. Come to think of it, we have to advance the client replication identifier as part of PREPARE TRANSACTION anyway, otherwise we'd try to repeat and re-prepare the same xact on crash recovery. Given that, I withdraw my objection to adding the gid to commit and rollback xlog records, though it should only be done if they're 2pc commit/abort, and only if XLogLogicalInfoActive(). >> BTW, I've been reviewing the patch in more detail. Other than a bunch >> of copy-and-paste that I'm cleaning up, the main issue I've found is >> that in DecodePrepare, you call: >> >>SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid, >> parsed->nsubxacts, parsed->subxacts); >> >> but I am not convinced it is correct to call it at PREPARE TRANSACTION >> time, only at COMMIT PREPARED time. We want to see the 2pc prepared >> xact's state when decoding it, but there might be later commits that >> cannot yet see that state and shouldn't have it visible in their >> snapshots. > > Agree, that is problem. That allows to decode this PREPARE, but after that > it is better to mark this transaction as running in snapshot or perform > prepare > decoding with some kind of copied-end-edited snapshot. I’ll have a look at > this. Thanks. It's also worth noting that with your current approach, 2PC xacts will produce two calls to the output plugin's commit() callback, once for the PREPARE TRANSACTION and another for the COMMIT PREPARED or ROLLBACK PREPARED, the latter two with a faked-up state. I'm not a huge fan of that. It's not entirely backward compatible since it violates the previously safe assumption that there's a 1:1 relationship between begin and commit callbacks with no interleaving, for one thing, and I think it's also a bit misleading to send a PREPARE TRANSACTION to a callback that could previously only receive a true commit. I particularly dislike calling a commit callback for an abort. So I'd like to look further into the interface side of things. I'm inclined to suggest adding new callbacks for 2pc prepare, commit and rollback, and if the output plugin doesn't set them fall back to the existing behaviour. Plugins that aren't interested in 2PC (think ETL) should probably not have to deal with it, we might as well just send them only the actually committed xacts, when they commit. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
> On 2 Mar 2017, at 11:00, Craig Ringerwrote: > > We already have it, because we just decoded the PREPARE TRANSACTION. > I'm preparing a patch revision to demonstrate this. Yes, we already have it, but if server reboots between commit prepared (all prepared state is gone) and decoding of this commit prepared then we loose that mapping, isn’t it? > BTW, I've been reviewing the patch in more detail. Other than a bunch > of copy-and-paste that I'm cleaning up, the main issue I've found is > that in DecodePrepare, you call: > >SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid, > parsed->nsubxacts, parsed->subxacts); > > but I am not convinced it is correct to call it at PREPARE TRANSACTION > time, only at COMMIT PREPARED time. We want to see the 2pc prepared > xact's state when decoding it, but there might be later commits that > cannot yet see that state and shouldn't have it visible in their > snapshots. Agree, that is problem. That allows to decode this PREPARE, but after that it is better to mark this transaction as running in snapshot or perform prepare decoding with some kind of copied-end-edited snapshot. I’ll have a look at this. -- Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
On 2 March 2017 at 16:00, Craig Ringerwrote: > What about if we ROLLBACK PREPARED after > we made the snapshot visible? Yeah, I'm pretty sure that's going to be a problem actually. You're telling the snapshot builder that an xact committed at PREPARE TRANSACTION time. If we then ROLLBACK PREPARED, we're in a mess. It looks like it'll cause issues with catalogs, user-catalog tables, etc. I suspect we need to construct a temporary snapshot to decode PREPARE TRANSACTION then discard it. If we later COMMIT PREPARED we should perform the current steps to merge the snapshot state in. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
On 2 March 2017 at 15:27, Stas Kelvichwrote: > >> On 2 Mar 2017, at 01:20, Petr Jelinek wrote: >> >> The info gets removed on COMMIT PREPARED but at that point >> there is no real difference between replicating it as 2pc or 1pc since >> the 2pc behavior is for all intents and purposes lost at that point. >> > > If we are doing 2pc and COMMIT PREPARED happens then we should > replicate that without transaction body to the receiving servers since tx > is already prepared on them with some GID. So we need a way to construct > that GID. We already have it, because we just decoded the PREPARE TRANSACTION. I'm preparing a patch revision to demonstrate this. BTW, I've been reviewing the patch in more detail. Other than a bunch of copy-and-paste that I'm cleaning up, the main issue I've found is that in DecodePrepare, you call: SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid, parsed->nsubxacts, parsed->subxacts); but I am not convinced it is correct to call it at PREPARE TRANSACTION time, only at COMMIT PREPARED time. We want to see the 2pc prepared xact's state when decoding it, but there might be later commits that cannot yet see that state and shouldn't have it visible in their snapshots. Imagine, say BEGIN; ALTER TABLE t ADD COLUMN ... INSERT INTO 't' ... PREPARE TRANSACTION 'x'; BEGIN; INSERT INTO t ...; COMMIT; COMMIT PREPARED 'x'; We want to see the new column when decoding the prepared xact, but _not_ when decoding the subsequent xact between the prepare and commit. This particular case cannot occur because the lock held by ALTER TABLE blocks the INSERT in the other xact, but how sure are you that there are no other snapshot issues that could arise if we promote a snapshot to visible early? What about if we ROLLBACK PREPARED after we made the snapshot visible? The tests don't appear to cover logical decoding 2PC sessions that do DDL at all. I emphasised that that would be one of the main problem areas when we originally discussed this. I'll look at adding some, since I think this is one of the areas that's most likely to find issues. > It seems that last ~10 messages I’m failing to explain some points about this > topic. Or, maybe, I’m failing to understand some points. Can we maybe setup > skype call to discuss this and post summary here? Craig? Peter? Let me prep an updated patch. Time zones make it rather hard to do voice; I'm in +0800 Western Australia, Petr is in +0200... -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
> On 2 Mar 2017, at 01:20, Petr Jelinekwrote: > > The info gets removed on COMMIT PREPARED but at that point > there is no real difference between replicating it as 2pc or 1pc since > the 2pc behavior is for all intents and purposes lost at that point. > If we are doing 2pc and COMMIT PREPARED happens then we should replicate that without transaction body to the receiving servers since tx is already prepared on them with some GID. So we need a way to construct that GID. It seems that last ~10 messages I’m failing to explain some points about this topic. Or, maybe, I’m failing to understand some points. Can we maybe setup skype call to discuss this and post summary here? Craig? Peter? -- Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
On 2 March 2017 at 06:20, Petr Jelinekwrote: > If I understand you correctly you are saying that if PREPARE is being > decoded, we can load the GID from the 2pc info in memory about the > specific 2pc. The info gets removed on COMMIT PREPARED but at that point > there is no real difference between replicating it as 2pc or 1pc since > the 2pc behavior is for all intents and purposes lost at that point. > Works for me. I guess the hard part is knowing if COMMIT PREPARED > happened at the time PREPARE is decoded, but I existence of the needed > info could be probably be used for that. Right. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
On 01/03/17 10:24, Craig Ringer wrote: > On 9 February 2017 at 21:23, Stas Kelvichwrote: > >>> On 2 Feb 2017, at 00:35, Craig Ringer wrote: >>> >>> Stas was concerned about what happens in logical decoding if we crash >>> between PREPSRE TRANSACTION and COMMIT PREPARED. But we'll always go back >>> and decode the whole txn again anyway so it doesn't matter. >> >> Not exactly. It seems that in previous discussions we were not on the same >> page, probably due to unclear arguments by me. >> >> From my point of view there is no problems (or at least new problems >> comparing to ordinary 2PC) with preparing transactions on slave servers with >> something like “#{xid}#{node_id}” instead of GID if issuing node is >> coordinator of that transaction. In case of failure, restart, crash we have >> the same options about deciding what to do with uncommitted transactions. > > But we don't *need* to do that. We have access to the GID of the 2PC > xact from PREPARE TRANSACTION until COMMIT PREPARED, after which we > have no need for it. So we can always use the user-supplied GID. > >> I performed some tests to understand real impact on size of WAL. I've >> compared postgres -master with wal_level = logical, after 3M 2PC >> transactions with patched postgres where GID’s are stored inside commit >> record too. > > Why do you do this? You don't need to. You can look the GID up from > the 2pc status table in memory unless the master already did COMMIT > PREPARED, in which case you can just decode it as a normal xact as if > it were never 2pc in the first place. > > I don't think I've managed to make this point by description, so I'll > try to modify your patch to demonstrate. > If I understand you correctly you are saying that if PREPARE is being decoded, we can load the GID from the 2pc info in memory about the specific 2pc. The info gets removed on COMMIT PREPARED but at that point there is no real difference between replicating it as 2pc or 1pc since the 2pc behavior is for all intents and purposes lost at that point. Works for me. I guess the hard part is knowing if COMMIT PREPARED happened at the time PREPARE is decoded, but I existence of the needed info could be probably be used for that. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
On 9 February 2017 at 21:23, Stas Kelvichwrote: >> On 2 Feb 2017, at 00:35, Craig Ringer wrote: >> >> Stas was concerned about what happens in logical decoding if we crash >> between PREPSRE TRANSACTION and COMMIT PREPARED. But we'll always go back >> and decode the whole txn again anyway so it doesn't matter. > > Not exactly. It seems that in previous discussions we were not on the same > page, probably due to unclear arguments by me. > > From my point of view there is no problems (or at least new problems > comparing to ordinary 2PC) with preparing transactions on slave servers with > something like “#{xid}#{node_id}” instead of GID if issuing node is > coordinator of that transaction. In case of failure, restart, crash we have > the same options about deciding what to do with uncommitted transactions. But we don't *need* to do that. We have access to the GID of the 2PC xact from PREPARE TRANSACTION until COMMIT PREPARED, after which we have no need for it. So we can always use the user-supplied GID. > I performed some tests to understand real impact on size of WAL. I've > compared postgres -master with wal_level = logical, after 3M 2PC transactions > with patched postgres where GID’s are stored inside commit record too. Why do you do this? You don't need to. You can look the GID up from the 2pc status table in memory unless the master already did COMMIT PREPARED, in which case you can just decode it as a normal xact as if it were never 2pc in the first place. I don't think I've managed to make this point by description, so I'll try to modify your patch to demonstrate. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
> On 31 Jan 2017, at 12:22, Craig Ringerwrote: > > Personally I don't think lack of access to the GID justifies blocking 2PC > logical decoding. It can be added separately. But it'd be nice to have > especially if it's cheap. Agreed. > On 2 Feb 2017, at 00:35, Craig Ringer wrote: > > Stas was concerned about what happens in logical decoding if we crash between > PREPSRE TRANSACTION and COMMIT PREPARED. But we'll always go back and decode > the whole txn again anyway so it doesn't matter. Not exactly. It seems that in previous discussions we were not on the same page, probably due to unclear arguments by me. >From my point of view there is no problems (or at least new problems comparing >to ordinary 2PC) with preparing transactions on slave servers with something >like “#{xid}#{node_id}” instead of GID if issuing node is coordinator of that >transaction. In case of failure, restart, crash we have the same options about >deciding what to do with uncommitted transactions. My concern is about the situation with external coordinator. That scenario is quite important for users of postgres native 2pc, notably J2EE user. Suppose user (or his framework) issuing “prepare transaction ‘mytxname’;" to servers with ordinary synchronous physical replication. If master will crash and replica will be promoted than user can reconnect to it and commit/abort that transaction using his GID. And it is unclear to me how to achieve same behaviour with logical replication of 2pc without GID in commit record. If we will prepare with “#{xid}#{node_id}” on acceptor nodes, then if donor node will crash we’ll lose mapping between user’s gid and our internal gid; contrary we can prepare with user's GID on acceptors, but then we will not know that GID on donor during commit decode (by the time decoding happens all memory state already gone and we can’t exchange our xid to gid). I performed some tests to understand real impact on size of WAL. I've compared postgres -master with wal_level = logical, after 3M 2PC transactions with patched postgres where GID’s are stored inside commit record too. Testing with 194-bytes and 6-bytes GID’s. (GID max size is 200 bytes) -master, 6-byte GID after 3M transaction: pg_current_xlog_location = 0/9572CB28 -patched, 6-byte GID after 3M transaction: pg_current_xlog_location = 0/96C442E0 so with 6-byte GID’s difference in WAL size is less than 1% -master, 194-byte GID after 3M transaction: pg_current_xlog_location = 0/B7501578 -patched, 194-byte GID after 3M transaction: pg_current_xlog_location = 0/D8B43E28 and with 194-byte GID’s difference in WAL size is about 18% So using big GID’s (as J2EE does) can cause notable WAL bloat, while small GID’s are almost unnoticeable. May be we can introduce configuration option track_commit_gid by analogy with track_commit_timestamp and make that behaviour optional? Any objections to that? -- Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
On 02/04/2017 03:08 AM, Andres Freund wrote: > On 2017-02-03 18:47:23 -0500, Robert Haas wrote: >> On Fri, Feb 3, 2017 at 6:00 PM, Andres Freundwrote: >>> I still haven't seen a credible model for being able to apply a stream >>> of interleaved transactions that can roll back individually; I think we >>> really need the ability to have multiple transactions alive in one >>> backend for that. >> Hmm, yeah, that's a problem. That smells like autonomous transactions. > Unfortunately the last few proposals, like spawning backends, to deal > with autonomous xacts aren't really suitable for replication, unless you > only have very large ones. And it really needs to be an implementation > where ATs can freely be switched inbetween. On the other hand, a good > deal of problems (like locking) shouldn't be an issue, since there's > obviously a possible execution schedule. > > I suspect this'd need some low-level implemention close to xact.c that'd > allow switching between transactions. Let me add my two coins here: 1. We are using logical decoding in our multimaster and applying transactions concurrently by pool of workers. Unlike asynchronous replication, in multimaster we need to perform voting for each transaction commit, so if transactions are applied by single workers, then performance will be awful and, moreover, there is big chance to get "deadlock" when none of workers can complete voting because different nodes are performing voting for different transactions. I could not say that there are no problems with this approach. There are definitely a lot of challenges. First of all we need to use special DTM (distributed transaction manager) to provide consistent applying of transaction at different nodes. Second problem is once again related with kind of "deadlock" explained above. Even if we apply transactions concurrently, it is still possible to get such deadlock if we do not have enough workers. This is why we allow to launch extra workers dynamically (but finally it is limited by maximal number of configures bgworkers). But in any case, I think that "parallel apply" is "must have" mode for logical replication. 2. We have implemented autonomous transactions in PgPro EE. Unlike proposal currently present at commit fest, we execute autonomous transaction within the same backend. So we are just storing and restoring transaction context. Unfortunately it is also not so cheap operation. Autonomous transaction should not see any changes done by parent transaction (because it can be rollbacked after commit of autonomous transaction). But there are catalog and relation caches inside backend, so we have to clean this caches before switching to ATX. It is quite expensive operation and so speed of execution of PL/pg-SQL function with autonomous transaction is several order of magnitude slower than without it. So autonomous transaction can be used for audits (its the primary goal of using ATX in Oracle PL/SQL applications) but this mechanism is not efficient for concurrent execution of multiple transaction in one backend. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
On 2017-02-03 19:09:43 -0500, Robert Haas wrote: > On Fri, Feb 3, 2017 at 7:08 PM, Andres Freundwrote: > > On 2017-02-03 18:47:23 -0500, Robert Haas wrote: > >> On Fri, Feb 3, 2017 at 6:00 PM, Andres Freund wrote: > >> > I still haven't seen a credible model for being able to apply a stream > >> > of interleaved transactions that can roll back individually; I think we > >> > really need the ability to have multiple transactions alive in one > >> > backend for that. > >> > >> Hmm, yeah, that's a problem. That smells like autonomous transactions. > > > > Unfortunately the last few proposals, like spawning backends, to deal > > with autonomous xacts aren't really suitable for replication, unless you > > only have very large ones. And it really needs to be an implementation > > where ATs can freely be switched inbetween. On the other hand, a good > > deal of problems (like locking) shouldn't be an issue, since there's > > obviously a possible execution schedule. > > > > I suspect this'd need some low-level implemention close to xact.c that'd > > allow switching between transactions. > > Yeah. Well, I still feel like that's also how autonomous transactions > oughta work, but I realize that's not a unanimous viewpoint. :-) Same here ;) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
On Fri, Feb 3, 2017 at 7:08 PM, Andres Freundwrote: > On 2017-02-03 18:47:23 -0500, Robert Haas wrote: >> On Fri, Feb 3, 2017 at 6:00 PM, Andres Freund wrote: >> > I still haven't seen a credible model for being able to apply a stream >> > of interleaved transactions that can roll back individually; I think we >> > really need the ability to have multiple transactions alive in one >> > backend for that. >> >> Hmm, yeah, that's a problem. That smells like autonomous transactions. > > Unfortunately the last few proposals, like spawning backends, to deal > with autonomous xacts aren't really suitable for replication, unless you > only have very large ones. And it really needs to be an implementation > where ATs can freely be switched inbetween. On the other hand, a good > deal of problems (like locking) shouldn't be an issue, since there's > obviously a possible execution schedule. > > I suspect this'd need some low-level implemention close to xact.c that'd > allow switching between transactions. Yeah. Well, I still feel like that's also how autonomous transactions oughta work, but I realize that's not a unanimous viewpoint. :-) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
On 2017-02-03 18:47:23 -0500, Robert Haas wrote: > On Fri, Feb 3, 2017 at 6:00 PM, Andres Freundwrote: > > I still haven't seen a credible model for being able to apply a stream > > of interleaved transactions that can roll back individually; I think we > > really need the ability to have multiple transactions alive in one > > backend for that. > > Hmm, yeah, that's a problem. That smells like autonomous transactions. Unfortunately the last few proposals, like spawning backends, to deal with autonomous xacts aren't really suitable for replication, unless you only have very large ones. And it really needs to be an implementation where ATs can freely be switched inbetween. On the other hand, a good deal of problems (like locking) shouldn't be an issue, since there's obviously a possible execution schedule. I suspect this'd need some low-level implemention close to xact.c that'd allow switching between transactions. - Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
On Fri, Feb 3, 2017 at 6:00 PM, Andres Freundwrote: > I still haven't seen a credible model for being able to apply a stream > of interleaved transactions that can roll back individually; I think we > really need the ability to have multiple transactions alive in one > backend for that. Hmm, yeah, that's a problem. That smells like autonomous transactions. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
On 2017-02-03 17:47:50 -0500, Robert Haas wrote: > On Thu, Feb 2, 2017 at 7:14 PM, Craig Ringerwrote: > > We could make reorder buffers persistent and shared between decoding > > sessions but it'd totally change the logical decoding model and create > > some other problems. It's certainly not a topic for this patch. So we > > can take it as given that we'll always restart decoding from BEGIN > > again at a crash. Sharing them seems unlikely (filtering and such would become a lot more complicated) and separate from persistency. I'm not sure however how it'd "totally change the logical decoding model"? Even if we'd not always restart decoding, we'd still have the option to add the information necessary to the spill files, so I'm unclear how persistency plays a role here? > OK, thanks for the explanation. I have never liked this design very > much, and told Andres so: big transactions are bound to cause > noticeable replication lag. But you're certainly right that it's not > a topic for this patch. Streaming and persistency of spill files are different topics, no? Either would have initially complicated things beyond the point of getting things into core - I'm all for adding them at some point. Persistent spill files (which'd also spilling of small transactions at regular intervals) also has the issue that it makes the spill format something that can't be adapted in bugfixes etc, and that we need to fsync it. I still haven't seen a credible model for being able to apply a stream of interleaved transactions that can roll back individually; I think we really need the ability to have multiple transactions alive in one backend for that. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
On Thu, Feb 2, 2017 at 7:14 PM, Craig Ringerwrote: > We could make reorder buffers persistent and shared between decoding > sessions but it'd totally change the logical decoding model and create > some other problems. It's certainly not a topic for this patch. So we > can take it as given that we'll always restart decoding from BEGIN > again at a crash. OK, thanks for the explanation. I have never liked this design very much, and told Andres so: big transactions are bound to cause noticeable replication lag. But you're certainly right that it's not a topic for this patch. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
On 3 February 2017 at 03:34, Robert Haaswrote: > On Wed, Feb 1, 2017 at 4:35 PM, Craig Ringer wrote: >> Right. Per my comments uothread I don't see why we need to add anything more >> to WAL here. >> >> Stas was concerned about what happens in logical decoding if we crash >> between PREPSRE TRANSACTION and COMMIT PREPARED. But we'll always go back >> and decode the whole txn again anyway so it doesn't matter. >> >> We can just track it on ReorderBufferTxn when we see it at PREPARE >> TRANSACTION time. > > Oh, hmm. I guess if that's how it works then we don't need it in WAL > after all. I'm not sure that re-decoding the already-prepared > transaction is a very good plan, but if that's what we're doing anyway > this patch probably shouldn't change it. We don't have much choice at the moment. Logical decoding must restart from the xl_running_xacts most recently prior to the xid allocation for the oldest xact the client hasn't confirmed receipt of decoded data + commit for. That's because reorder buffers are not persistent; if a decoding session crashes we throw away accumulated reorder buffers, both those in memory and those spilled to disk. We have to re-create them by restarting decoding from the beginning of the oldest xact of interest. We could make reorder buffers persistent and shared between decoding sessions but it'd totally change the logical decoding model and create some other problems. It's certainly not a topic for this patch. So we can take it as given that we'll always restart decoding from BEGIN again at a crash. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
On Wed, Feb 1, 2017 at 4:35 PM, Craig Ringerwrote: > Right. Per my comments uothread I don't see why we need to add anything more > to WAL here. > > Stas was concerned about what happens in logical decoding if we crash > between PREPSRE TRANSACTION and COMMIT PREPARED. But we'll always go back > and decode the whole txn again anyway so it doesn't matter. > > We can just track it on ReorderBufferTxn when we see it at PREPARE > TRANSACTION time. Oh, hmm. I guess if that's how it works then we don't need it in WAL after all. I'm not sure that re-decoding the already-prepared transaction is a very good plan, but if that's what we're doing anyway this patch probably shouldn't change it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
On Wed, Feb 1, 2017 at 2:32 PM, Tom Lanewrote: > Robert Haas writes: >> Also, including the GID in the WAL for each COMMIT/ABORT PREPARED >> doesn't seem inordinately expensive to me. > > I'm confused ... isn't it there already? If not, how do we handle > reconstructing 2PC state from WAL at all? By XID. See xl_xact_twophase, which gets included in xl_xact_commit or xl_xact_abort. The GID has got to be there in the XL_XACT_PREPARE record, but not when actually committing/rolling back. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
On 2 Feb. 2017 08:32, "Tom Lane"wrote: Robert Haas writes: > Also, including the GID in the WAL for each COMMIT/ABORT PREPARED > doesn't seem inordinately expensive to me. I'm confused ... isn't it there already? If not, how do we handle reconstructing 2PC state from WAL at all? Right. Per my comments uothread I don't see why we need to add anything more to WAL here. Stas was concerned about what happens in logical decoding if we crash between PREPSRE TRANSACTION and COMMIT PREPARED. But we'll always go back and decode the whole txn again anyway so it doesn't matter. We can just track it on ReorderBufferTxn when we see it at PREPARE TRANSACTION time.
Re: [HACKERS] logical decoding of two-phase transactions
On 02/01/2017 10:32 PM, Tom Lane wrote: > Robert Haaswrites: >> Also, including the GID in the WAL for each COMMIT/ABORT PREPARED >> doesn't seem inordinately expensive to me. > I'm confused ... isn't it there already? If not, how do we handle > reconstructing 2PC state from WAL at all? > > regards, tom lane > > Right now logical decoding ignores prepare and take in account only "commit prepared": /* * Currently decoding ignores PREPARE TRANSACTION and will just * decode the transaction when the COMMIT PREPARED is sent or * throw away the transaction's contents when a ROLLBACK PREPARED * is received. In the future we could add code to expose prepared * transactions in the changestream allowing for a kind of * distributed 2PC. */ For some scenarios it works well, but if we really need prepared state at replica (as in case of multimaster), then it is not enough. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
Robert Haaswrites: > Also, including the GID in the WAL for each COMMIT/ABORT PREPARED > doesn't seem inordinately expensive to me. I'm confused ... isn't it there already? If not, how do we handle reconstructing 2PC state from WAL at all? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
On Tue, Jan 31, 2017 at 9:05 PM, Michael Paquierwrote: >> Personally I don't think lack of access to the GID justifies blocking 2PC >> logical decoding. It can be added separately. But it'd be nice to have >> especially if it's cheap. > > I think it should be added reading this thread. +1. If on the logical replication master the user executes PREPARE TRANSACTION 'mumble', isn't it sensible to want the logical replica to prepare the same set of changes with the same GID? To me, that not only seems like *a* sensible thing to want to do but probably the *most* sensible thing to want to do. And then, when the eventual COMMIT PREPAPARED 'mumble' comes along, you want to have the replica run the same command. If you don't do that, then the alternative is that the replica has to make up new names based on the master's XID. But that kinda sucks, because now if replication stops due to a conflict or whatever and you have to disentangle things by hand, all the names on the replica are basically meaningless. Also, including the GID in the WAL for each COMMIT/ABORT PREPARED doesn't seem inordinately expensive to me. For that to really add up to a significant cost, wouldn't you need to be doing LOTS of 2PC transactions, each with very little work, so that the commit/abort prepared records weren't swamped by everything else? That seems like an unlikely scenario, but if it does happen, that's exactly when you'll be most grateful for the GID tracking. I think. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
On Tue, Jan 31, 2017 at 6:22 PM, Craig Ringerwrote: > That's where you've misunderstood - it isn't committed yet. The point or > this change is to allow us to do logical decoding at the PREPARE TRANSACTION > point. The xact is not yet committed or rolled back. Yes, I got that. I was looking for a why or an actual use-case. > Stas wants this for a conflict-free logical semi-synchronous replication > multi master solution. This sentence is hard to decrypt, less without "multi master" as the concept applies basically only to only one master node. > At PREPARE TRANSACTION time we replay the xact to > other nodes, each of which applies it and PREPARE TRANSACTION, then replies > to confirm it has successfully prepared the xact. When all nodes confirm the > xact is prepared it is safe for the origin node to COMMIT PREPARED. The > other nodes then see hat the first node has committed and they commit too. OK, this is the argument I was looking for. So in your schema the origin node, the one generating the changes, is itself in charge of deciding if the 2PC should work or not. There are two channels between the origin node and the replicas replaying the logical changes, one is for the logical decoder with a receiver, the second one is used to communicate the WAL apply status. I thought about something like postgres_fdw doing this job with a transaction that does writes across several nodes, that's why I got confused about this feature. Everything goes through one channel, so the failure handling is just simplified. > Alternately if any node replies "could not replay xact" or "could not > prepare xact" the origin node knows to ROLLBACK PREPARED. All the other > nodes see that and rollback too. The origin node could just issue the ROLLBACK or COMMIT and the logical replicas would just apply this change. > To really make it rock solid you also have to send the old and new values of > a row, or have row versions, or send old row hashes. Something I also want > to have, but we can mostly get that already with REPLICA IDENTITY FULL. On a primary key (or a unique index), the default replica identity is enough I think. > It is of interest to me because schema changes in MM logical replication are > more challenging awkward and restrictive without it. Optimistic conflict > resolution doesn't work well for schema changes and once the conflicting > schema changes are committed on different nodes there is no going back. So > you need your async system to have a global locking model for schema changes > to stop conflicts arising. Or expect the user not to do anything silly / > misunderstand anything and know all the relevant system limitations and > requirements... which we all know works just great in practice. You also > need a way to ensure that schema changes don't render > committed-but-not-yet-replayed row changes from other peers nonsensical. The > safest way is a barrier where all row changes committed on any node before > committing the schema change on the origin node must be fully replayed on > every other node, making an async MM system temporarily sync single master > (and requiring all nodes to be up and reachable). Otherwise you need a way > to figure out how to conflict-resolve incoming rows with missing columns / > added columns / changed types / renamed tables etc which is no fun and > nearly impossible in the general case. That's one vision of things, FDW-like approaches would be a second, but those are not able to pass down utility statements natively, though this stuff can be done with the utility hook. > I think the purpose of having the GID available to the decoding output > plugin at PREPARE TRANSACTION time is that it can co-operate with a global > transaction manager that way. Each node can tell the GTM "I'm ready to > commit [X]". It is IMO not crucial since you can otherwise use a (node-id, > xid) tuple, but it'd be nice for coordinating with external systems, > simplifying inter node chatter, integrating logical deocding into bigger > systems with external transaction coordinators/arbitrators etc. It seems > pretty silly _not_ to have it really. Well, Postgres-XC/XL save the 2PC GID for this purpose in the GTM, this way the COMMIT/ABORT PREPARED can be issued from any nodes, and there is a centralized conflict resolution, the latter being done with a huge cost, causing much bottleneck in scaling performance. > Personally I don't think lack of access to the GID justifies blocking 2PC > logical decoding. It can be added separately. But it'd be nice to have > especially if it's cheap. I think it should be added reading this thread. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
On 31 Jan. 2017 22:43, "Konstantin Knizhnik"wrote: On 31.01.2017 09:29, Michael Paquier wrote: > On Fri, Jan 27, 2017 at 8:52 AM, Craig Ringer > wrote: > >> Now, if it's simpler to just xlog the gid at COMMIT PREPARED time when >> wal_level >= logical I don't think that's the end of the world. But >> since we already have almost everything we need in memory, why not >> just stash the gid on ReorderBufferTXN? >> > I have been through this thread... And to be honest, I have a hard > time understanding for which purpose the information of a 2PC > transaction is useful in the case of logical decoding. The prepare and > commit prepared have been received by a node which is at the root of > the cluster tree, a node of the cluster at an upper level, or a > client, being in charge of issuing all the prepare queries, and then > issue the commit prepared to finish the transaction across a cluster. > In short, even if you do logical decoding from the root node, or the > one at a higher level, you would care just about the fact that it has > been committed. > in any state. So there three records in the WAL: PREPARE, PRECOMMIT, COMMIT_PREPARED and recovery can involve either all of them, either PRECOMMIT+COMMIT_PREPARED either just COMMIT_PREPARED. That's your modified Pg though. This 2pc logical decoding patch proposal is for core and I think it just confused things to introduce discussion of unrelated changes made by your product to the codebase. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
On 31.01.2017 09:29, Michael Paquier wrote: On Fri, Jan 27, 2017 at 8:52 AM, Craig Ringerwrote: Now, if it's simpler to just xlog the gid at COMMIT PREPARED time when wal_level >= logical I don't think that's the end of the world. But since we already have almost everything we need in memory, why not just stash the gid on ReorderBufferTXN? I have been through this thread... And to be honest, I have a hard time understanding for which purpose the information of a 2PC transaction is useful in the case of logical decoding. The prepare and commit prepared have been received by a node which is at the root of the cluster tree, a node of the cluster at an upper level, or a client, being in charge of issuing all the prepare queries, and then issue the commit prepared to finish the transaction across a cluster. In short, even if you do logical decoding from the root node, or the one at a higher level, you would care just about the fact that it has been committed. Sorry, may be I do not completely understand your arguments. Actually our multimaster is completely based now on logical replication and 2PC (more precisely we are using 3PC now:) State of transaction (prepared, precommitted, committed) should be persisted in WAL to make it possible to perform recovery. Recovery can involve transactions in any state. So there three records in the WAL: PREPARE, PRECOMMIT, COMMIT_PREPARED and recovery can involve either all of them, either PRECOMMIT+COMMIT_PREPARED either just COMMIT_PREPARED. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
On 31 Jan. 2017 19:29, "Michael Paquier"wrote: On Fri, Jan 27, 2017 at 8:52 AM, Craig Ringer wrote: > Now, if it's simpler to just xlog the gid at COMMIT PREPARED time when > wal_level >= logical I don't think that's the end of the world. But > since we already have almost everything we need in memory, why not > just stash the gid on ReorderBufferTXN? I have been through this thread... And to be honest, I have a hard time understanding for which purpose the information of a 2PC transaction is useful in the case of logical decoding. TL;DR: this lets us decode the xact after prepare but before commit so decoding/replay outcomes can affect the commit-or-abort decision. The prepare and commit prepared have been received by a node which is at the root of the cluster tree, a node of the cluster at an upper level, or a client, being in charge of issuing all the prepare queries, and then issue the commit prepared to finish the transaction across a cluster. In short, even if you do logical decoding from the root node, or the one at a higher level, you would care just about the fact that it has been committed. That's where you've misunderstood - it isn't committed yet. The point or this change is to allow us to do logical decoding at the PREPARE TRANSACTION point. The xact is not yet committed or rolled back. This allows the results of logical decoding - or more interestingly results of replay on another node / to another app / whatever to influence the commit or rollback decision. Stas wants this for a conflict-free logical semi-synchronous replication multi master solution. At PREPARE TRANSACTION time we replay the xact to other nodes, each of which applies it and PREPARE TRANSACTION, then replies to confirm it has successfully prepared the xact. When all nodes confirm the xact is prepared it is safe for the origin node to COMMIT PREPARED. The other nodes then see hat the first node has committed and they commit too. Alternately if any node replies "could not replay xact" or "could not prepare xact" the origin node knows to ROLLBACK PREPARED. All the other nodes see that and rollback too. This makes it possible to be much more confident that what's replicated is exactly the same on all nodes, with no after-the-fact MM conflict resolution that apps must be aware of to function correctly. To really make it rock solid you also have to send the old and new values of a row, or have row versions, or send old row hashes. Something I also want to have, but we can mostly get that already with REPLICA IDENTITY FULL. It is of interest to me because schema changes in MM logical replication are more challenging awkward and restrictive without it. Optimistic conflict resolution doesn't work well for schema changes and once the conflciting schema changes are committed on different nodes there is no going back. So you need your async system to have a global locking model for schema changes to stop conflicts arising. Or expect the user not to do anything silly / misunderstand anything and know all the relevant system limitations and requirements... which we all know works just great in practice. You also need a way to ensure that schema changes don't render committed-but-not-yet-replayed row changes from other peers nonsensical. The safest way is a barrier where all row changes committed on any node before committing the schema change on the origin node must be fully replayed on every other node, making an async MM system temporarily sync single master (and requiring all nodes to be up and reachable). Otherwise you need a way to figure out how to conflict-resolve incoming rows with missing columns / added columns / changed types / renamed tables etc which is no fun and nearly impossible in the general case. 2PC decoding lets us avoid all this mess by sending all nodes the proposed schema change and waiting until they all confirm successful prepare before committing it. It can also be used to solve the row compatibility problems with some more lazy inter-node chat in logical WAL messages. I think the purpose of having the GID available to the decoding output plugin at PREPARE TRANSACTION time is that it can co-operate with a global transaction manager that way. Each node can tell the GTM "I'm ready to commit [X]". It is IMO not crucial since you can otherwise use a (node-id, xid) tuple, but it'd be nice for coordinating with external systems, simplifying inter node chatter, integrating logical deocding into bigger systems with external transaction coordinators/arbitrators etc. It seems pretty silly _not_ to have it really. Personally I don't think lack of access to the GID justifies blocking 2PC logical decoding. It can be added separately. But it'd be nice to have especially if it's cheap.
Re: [HACKERS] logical decoding of two-phase transactions
On Fri, Jan 27, 2017 at 8:52 AM, Craig Ringerwrote: > Now, if it's simpler to just xlog the gid at COMMIT PREPARED time when > wal_level >= logical I don't think that's the end of the world. But > since we already have almost everything we need in memory, why not > just stash the gid on ReorderBufferTXN? I have been through this thread... And to be honest, I have a hard time understanding for which purpose the information of a 2PC transaction is useful in the case of logical decoding. The prepare and commit prepared have been received by a node which is at the root of the cluster tree, a node of the cluster at an upper level, or a client, being in charge of issuing all the prepare queries, and then issue the commit prepared to finish the transaction across a cluster. In short, even if you do logical decoding from the root node, or the one at a higher level, you would care just about the fact that it has been committed. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
On Tue, Jan 31, 2017 at 3:29 PM, Michael Paquierwrote: > On Fri, Jan 27, 2017 at 8:52 AM, Craig Ringer wrote: >> Now, if it's simpler to just xlog the gid at COMMIT PREPARED time when >> wal_level >= logical I don't think that's the end of the world. But >> since we already have almost everything we need in memory, why not >> just stash the gid on ReorderBufferTXN? > > I have been through this thread... And to be honest, I have a hard > time understanding for which purpose the information of a 2PC > transaction is useful in the case of logical decoding. The prepare and > commit prepared have been received by a node which is at the root of > the cluster tree, a node of the cluster at an upper level, or a > client, being in charge of issuing all the prepare queries, and then > issue the commit prepared to finish the transaction across a cluster. > In short, even if you do logical decoding from the root node, or the > one at a higher level, you would care just about the fact that it has > been committed. By the way, I have moved this patch to next CF, you guys seem to make the discussion move on. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
On 26 January 2017 at 19:34, Stas Kelvichwrote: > Imagine following scenario: > > 1. PREPARE happend > 2. PREPARE decoded and sent where it should be sent > 3. We got all responses from participating nodes and issuing COMMIT/ABORT > 4. COMMIT/ABORT decoded and sent > > After step 3 there is no more memory state associated with that prepared tx, > so if will fail > between 3 and 4 then we can’t know GID unless we wrote it commit record (or > table). If the decoding session crashes/disconnects and restarts between 3 and 4, we know the xact is now committed or rolled backand we don't care about its gid anymore, we can decode it as a normal committed xact or skip over it if aborted. If Pg crashes between 3 and 4 the same applies, since all decoding sessions must restart. No decoding session can ever start up between 3 and 4 without passing through 1 and 2, since we always restart decoding at restart_lsn and restart_lsn cannot be advanced past the assignment (BEGIN) of a given xid until we pass its commit record and the downstream confirms it has flushed the results. The reorder buffer doesn't even really need to keep track of the gid between 3 and 4, though it should do to save the output plugin and downstream the hassle of keeping an xid to gid mapping. All it needs is to know if we sent a given xact's data to the output plugin at PREPARE time, so we can suppress sending them again at COMMIT time, and we can store that info on the ReorderBufferTxn. We can store the gid there too. We'll need two new output plugin callbacks prepare_cb rollback_cb since an xact can roll back after we decode PREPARE TRANSACTION (or during it, even) and we have to be able to tell the downstream to throw the data away. I don't think the rollback callback should be called abort_prepared_cb, because we'll later want to add the ability to decode interleaved xacts' changes as they are made, before commit, and in that case will also need to know if they abort. We won't care if they were prepared xacts or not, but we'll know based on the ReorderBufferTXN anyway. We don't need a separate commit_prepared_cb, the existing commit_cb is sufficient. The gid will be accessible on the ReorderBufferTXN. Now, if it's simpler to just xlog the gid at COMMIT PREPARED time when wal_level >= logical I don't think that's the end of the world. But since we already have almost everything we need in memory, why not just stash the gid on ReorderBufferTXN? -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
> On 26 Jan 2017, at 12:51, Craig Ringerwrote: > > * Tracking xid/gid map in memory also doesn’t help much — if server reboots > between prepare > and commit we’ll lose that mapping. > > Er what? That's why I suggested using the prepared xacts shmem state. It's > persistent as you know from your work on prepared transaction files. It has > all the required info. Imagine following scenario: 1. PREPARE happend 2. PREPARE decoded and sent where it should be sent 3. We got all responses from participating nodes and issuing COMMIT/ABORT 4. COMMIT/ABORT decoded and sent After step 3 there is no more memory state associated with that prepared tx, so if will fail between 3 and 4 then we can’t know GID unless we wrote it commit record (or table). -- Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
On 26 Jan. 2017 18:43, "Stas Kelvich"wrote: >> >> Yes, that’s also possible but seems to be less flexible restricting us to some >> specific GID format. >> >> Anyway, I can measure WAL space overhead introduced by the GID’s inside commit records >> to know exactly what will be the cost of such approach. > > I think the approach of storing just the xid and fetching the GID > during logical decoding of the PREPARE TRANSACTION is probably the > best way forward, per my prior mail. I don’t think that’s possible in this way. If we will not put GID in commit record, than by the time when logical decoding will happened transaction will be already committed/aborted and there will be no easy way to get that GID. My thinking is that if the 2PC xact is by that point COMMIT PREPARED or ROLLBACK PREPARED we don't care that it was ever 2pc and should just decode it as a normal xact. Its gid has ceased to be significant and no longer holds meaning since the xact is resolved. The point of logical decoding of 2pc is to allow peers to participate in a decision on whether to commit or not. Rather than only being able to decode the xact once committed as is currently the case. If it's already committed there's no point treating it as anything special. So when we get to the prepare transaction in xlog we look to see if it's already committed / rolled back. If so we proceed normally like current decoding does. Only if it's still prepared do we decode it as 2pc and supply the gid to a new output plugin callback for prepared xacts. I thought about several possibilities: * Tracking xid/gid map in memory also doesn’t help much — if server reboots between prepare and commit we’ll lose that mapping. Er what? That's why I suggested using the prepared xacts shmem state. It's persistent as you know from your work on prepared transaction files. It has all the required info.
Re: [HACKERS] logical decoding of two-phase transactions
>> >> Yes, that’s also possible but seems to be less flexible restricting us to >> some >> specific GID format. >> >> Anyway, I can measure WAL space overhead introduced by the GID’s inside >> commit records >> to know exactly what will be the cost of such approach. > > Stas, > > Have you had a chance to look at this further? Generally i’m okay with Simon’s approach and will send send updated patch. Anyway want to perform some test to estimate how much disk space is actually wasted by extra WAL records. > I think the approach of storing just the xid and fetching the GID > during logical decoding of the PREPARE TRANSACTION is probably the > best way forward, per my prior mail. I don’t think that’s possible in this way. If we will not put GID in commit record, than by the time when logical decoding will happened transaction will be already committed/aborted and there will be no easy way to get that GID. I thought about several possibilities: * Tracking xid/gid map in memory also doesn’t help much — if server reboots between prepare and commit we’ll lose that mapping. * We can provide some hooks on prepared tx recovery during startup, but that approach also fails if reboot happened between commit and decoding of that commit. * Logical messages are WAL-logged, but they don’t have any redo function so don’t helps much. So to support user-accessible 2PC over replication based on 2PC decoding we should invent something more nasty like writing them into a table. > That should eliminate Simon's > objection re the cost of tracking GIDs and still let us have access to > them when we want them, which is the best of both worlds really. Having 2PC decoding in core is a good thing anyway even without GID tracking =) -- Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
On 5 January 2017 at 20:43, Stas Kelvichwrote: > >> On 5 Jan 2017, at 13:49, Simon Riggs wrote: >> >> Surely in this case the master server is acting as the Transaction >> Manager, and it knows the mapping, so we are good? >> >> I guess if you are using >2 nodes then you need to use full 2PC on each node. >> >> Please explain precisely how you expect to use this, to check that GID >> is required. >> > > For example if we are using logical replication just for failover/HA and > allowing user > to be transaction manager itself. Then suppose that user prepared tx on > server A and server A > crashed. After that client may want to reconnect to server B and commit/abort > that tx. > But user only have GID that was used during prepare. > >> But even then, if you adopt the naming convention that all in-progress >> xacts will be called RepOriginId-EPOCH-XID, so they have a fully >> unique GID on all of the child nodes then we don't need to add the >> GID. > > Yes, that’s also possible but seems to be less flexible restricting us to some > specific GID format. > > Anyway, I can measure WAL space overhead introduced by the GID’s inside > commit records > to know exactly what will be the cost of such approach. Stas, Have you had a chance to look at this further? I think the approach of storing just the xid and fetching the GID during logical decoding of the PREPARE TRANSACTION is probably the best way forward, per my prior mail. That should eliminate Simon's objection re the cost of tracking GIDs and still let us have access to them when we want them, which is the best of both worlds really. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
On 5 January 2017 at 12:43, Stas Kelvichwrote: > >> On 5 Jan 2017, at 13:49, Simon Riggs wrote: >> >> Surely in this case the master server is acting as the Transaction >> Manager, and it knows the mapping, so we are good? >> >> I guess if you are using >2 nodes then you need to use full 2PC on each node. >> >> Please explain precisely how you expect to use this, to check that GID >> is required. >> > > For example if we are using logical replication just for failover/HA and > allowing user > to be transaction manager itself. Then suppose that user prepared tx on > server A and server A > crashed. After that client may want to reconnect to server B and commit/abort > that tx. > But user only have GID that was used during prepare. I don't think that's the case your trying to support and I don't think that's a common case that we want to pay the price to put into core in a non-optional way. -- Simon Riggshttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
On 5 January 2017 at 20:43, Stas Kelvichwrote: > Anyway, I can measure WAL space overhead introduced by the GID’s inside > commit records > to know exactly what will be the cost of such approach. Sounds like a good idea, especially if you remove any attempt to work with GIDs for !2PC commits at the same time. I don't think I care about having access to the GID for the use case I have in mind, since we'd actually be wanting to hijack a normal COMMIT and internally transform it to PREPARE TRANSACTION, , COMMIT PREPARED. But for the more general case of logical decoding of 2PC I can see the utility of having the xact identifier. If we presume we're only interested in logically decoding 2PC xacts that are not yet COMMIT PREPAREd, can we not avoid the WAL overhead of writing the GID by looking it up in our shmem state at decoding-time for PREPARE TRANSACTION? If we can't find the prepared transaction in TwoPhaseState we know to expect a following ROLLBACK PREPARED or COMMIT PREPARED, so we shouldn't decode it at the PREPARE TRANSACTION stage. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
> On 5 Jan 2017, at 13:49, Simon Riggswrote: > > Surely in this case the master server is acting as the Transaction > Manager, and it knows the mapping, so we are good? > > I guess if you are using >2 nodes then you need to use full 2PC on each node. > > Please explain precisely how you expect to use this, to check that GID > is required. > For example if we are using logical replication just for failover/HA and allowing user to be transaction manager itself. Then suppose that user prepared tx on server A and server A crashed. After that client may want to reconnect to server B and commit/abort that tx. But user only have GID that was used during prepare. > But even then, if you adopt the naming convention that all in-progress > xacts will be called RepOriginId-EPOCH-XID, so they have a fully > unique GID on all of the child nodes then we don't need to add the > GID. Yes, that’s also possible but seems to be less flexible restricting us to some specific GID format. Anyway, I can measure WAL space overhead introduced by the GID’s inside commit records to know exactly what will be the cost of such approach. -- Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
On 5 January 2017 at 10:21, Stas Kelvichwrote: > Thank you for looking into this. > >> On 5 Jan 2017, at 09:43, Simon Riggs wrote: >>> >>> GID is now variable sized. You seem to have added this to every >>> commit, not just 2PC >> > > Hm, didn’t realise that, i’ll fix. > >> I've just realised that you're adding GID because it allows you to >> uniquely identify the prepared xact. But then the prepared xact will >> also have a regular TransactionId, which is also unique. GID exists >> for users to specify things, but it is not needed internally and we >> don't need to add it here. > > I think we anyway can’t avoid pushing down GID to the client side. > > If we will push down only local TransactionId to remote server then we will > lose mapping > of GID to TransactionId, and there will be no way for user to identify his > transaction on > second server. Also Open XA and lots of libraries (e.g. J2EE) assumes that > there is > the same GID everywhere and it’s the same GID that was issued by the client. > > Requirements for two-phase decoding can be different depending on what one > want > to build around it and I believe in some situations pushing down xid is > enough. But IMO > dealing with reconnects, failures and client libraries will force programmer > to use > the same GID everywhere. Surely in this case the master server is acting as the Transaction Manager, and it knows the mapping, so we are good? I guess if you are using >2 nodes then you need to use full 2PC on each node. But even then, if you adopt the naming convention that all in-progress xacts will be called RepOriginId-EPOCH-XID, so they have a fully unique GID on all of the child nodes then we don't need to add the GID. Please explain precisely how you expect to use this, to check that GID is required. -- Simon Riggshttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
Thank you for looking into this. > On 5 Jan 2017, at 09:43, Simon Riggswrote: >> >> GID is now variable sized. You seem to have added this to every >> commit, not just 2PC > Hm, didn’t realise that, i’ll fix. > I've just realised that you're adding GID because it allows you to > uniquely identify the prepared xact. But then the prepared xact will > also have a regular TransactionId, which is also unique. GID exists > for users to specify things, but it is not needed internally and we > don't need to add it here. I think we anyway can’t avoid pushing down GID to the client side. If we will push down only local TransactionId to remote server then we will lose mapping of GID to TransactionId, and there will be no way for user to identify his transaction on second server. Also Open XA and lots of libraries (e.g. J2EE) assumes that there is the same GID everywhere and it’s the same GID that was issued by the client. Requirements for two-phase decoding can be different depending on what one want to build around it and I believe in some situations pushing down xid is enough. But IMO dealing with reconnects, failures and client libraries will force programmer to use the same GID everywhere. > What we do need is for the commit prepared > message to remember what the xid of the prepare was and then re-find > it using the commit WAL record's twophase_xid field. So we don't need > to add GID to any WAL records, nor to any in-memory structures. Other part of the story is how to find GID during decoding of commit prepared record. I did that by adding GID field to the commit WAL record, because by the time of decoding all memory structures that were holding xid<->gid correspondence are already cleaned up. -- Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
On 4 January 2017 at 21:20, Simon Riggswrote: > On 31 December 2016 at 08:36, Stas Kelvich wrote: >> Here is resubmission of patch to implement logical decoding of two-phase >> transactions (instead of treating them >> as usual transaction when commit) [1] I’ve slightly polished things and used >> test_decoding output plugin as client. > > Sounds good. > >> General idea quite simple here: >> >> * Write gid along with commit/prepare records in case of 2pc > > GID is now variable sized. You seem to have added this to every > commit, not just 2PC I've just realised that you're adding GID because it allows you to uniquely identify the prepared xact. But then the prepared xact will also have a regular TransactionId, which is also unique. GID exists for users to specify things, but it is not needed internally and we don't need to add it here. What we do need is for the commit prepared message to remember what the xid of the prepare was and then re-find it using the commit WAL record's twophase_xid field. So we don't need to add GID to any WAL records, nor to any in-memory structures. Please re-work the patch to include twophase_xid, which should make the patch smaller and much faster too. Please add comments to explain how and why patches work. Design comments allow us to check the design makes sense and if it does whether all the lines in the patch are needed to follow the design. Without that patches are much harder to commit and we all want patches to be easier to commit. -- Simon Riggshttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
On 31 December 2016 at 08:36, Stas Kelvichwrote: > Here is resubmission of patch to implement logical decoding of two-phase > transactions (instead of treating them > as usual transaction when commit) [1] I’ve slightly polished things and used > test_decoding output plugin as client. Sounds good. > General idea quite simple here: > > * Write gid along with commit/prepare records in case of 2pc GID is now variable sized. You seem to have added this to every commit, not just 2PC > * Add several routines to decode prepare records in the same way as it > already happens in logical decoding. > > I’ve also added explicit LOCK statement in test_decoding regression suit to > check that it doesn’t break thing. Please explain that in comments in the patch. > If > somebody can create scenario that will block decoding because of existing > dummy backend lock that will be great > help. Right now all my tests passing (including TAP tests to check recovery > of twophase tx in case of failures from > adjacent mail thread). > > If we will agree about current approach than I’m ready to add this stuff to > proposed in-core logical replication. > > [1] > https://www.postgresql.org/message-id/EE7452CA-3C39-4A0E-97EC-17A414972884%40postgrespro.ru We'll need some measurements about additional WAL space or mem usage from these approaches. Thanks. -- Simon Riggshttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers