Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Thu, May 6, 2021 at 9:01 AM Amit Kapila wrote: > > On Fri, Apr 30, 2021 at 7:43 PM Dilip Kumar wrote: > > > > So, I > > > used the other approach which led to the attached. > > > > The patch looks fine to me. > > > > Thanks, pushed! Thanks! -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Fri, Apr 30, 2021 at 7:43 PM Dilip Kumar wrote: > > So, I > > used the other approach which led to the attached. > > The patch looks fine to me. > Thanks, pushed! -- With Regards, Amit Kapila.
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Fri, Apr 30, 2021 at 3:01 PM Amit Kapila wrote: > > On Wed, Apr 28, 2021 at 11:03 AM Dilip Kumar wrote: > > > > On Wed, Apr 28, 2021 at 11:00 AM Amit Kapila > > wrote: > > > > > > > The idea I have is to additionally check that we are decoding > > > streaming or prepared transaction (the same check as we have for > > > setting curtxn) or we can check if CheckXidAlive is a valid > > > transaction id. What do you think? > > > > I think a check based on CheckXidAlive looks good to me. This will > > protect against if a similar error is raised from any other path as > > you mentioned above. > > > > We can't use CheckXidAlive because it is reset by that time. Right. So, I > used the other approach which led to the attached. The patch looks fine to me. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Wed, Apr 28, 2021 at 11:03 AM Dilip Kumar wrote: > > On Wed, Apr 28, 2021 at 11:00 AM Amit Kapila wrote: > > > > The idea I have is to additionally check that we are decoding > > streaming or prepared transaction (the same check as we have for > > setting curtxn) or we can check if CheckXidAlive is a valid > > transaction id. What do you think? > > I think a check based on CheckXidAlive looks good to me. This will > protect against if a similar error is raised from any other path as > you mentioned above. > We can't use CheckXidAlive because it is reset by that time. So, I used the other approach which led to the attached. -- With Regards, Amit Kapila. v1-0001-Tighten-the-concurrent-abort-check-during-decodin.patch Description: Binary data
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Wed, Apr 28, 2021 at 11:00 AM Amit Kapila wrote: > > Tom Lane has raised a complaint on pgsql-commiters [1] about one of > the commits related to this work [2]. The new member wrasse is showing > Warning: > > "/export/home/nm/farm/studio64v12_6/HEAD/pgsql.build/../pgsql/src/backend/replication/logical/reorderbuffer.c", > line 2510: Warning: Likely null pointer dereference (*(curtxn+272)): > ReorderBufferProcessTXN > > The Warning is for line: > curtxn->concurrent_abort = true; > > Now, we can simply fix this warning by adding an if check like: > if (curtxn) > curtxn->concurrent_abort = true; > > However, on further discussion, it seems that is not sufficient here > because the callbacks can throw the surrounding error code > (ERRCODE_TRANSACTION_ROLLBACK) where we set concurrent_abort flag for > a completely different scenario. I think here we need a > stronger check to ensure that we set concurrent abort flag and do > other things in that check only when we are decoding non-committed > xacts. That makes sense. The idea I have is to additionally check that we are decoding > streaming or prepared transaction (the same check as we have for > setting curtxn) or we can check if CheckXidAlive is a valid > transaction id. What do you think? I think a check based on CheckXidAlive looks good to me. This will protect against if a similar error is raised from any other path as you mentioned above. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
Tom Lane has raised a complaint on pgsql-commiters [1] about one of the commits related to this work [2]. The new member wrasse is showing Warning: "/export/home/nm/farm/studio64v12_6/HEAD/pgsql.build/../pgsql/src/backend/replication/logical/reorderbuffer.c", line 2510: Warning: Likely null pointer dereference (*(curtxn+272)): ReorderBufferProcessTXN The Warning is for line: curtxn->concurrent_abort = true; Now, we can simply fix this warning by adding an if check like: if (curtxn) curtxn->concurrent_abort = true; However, on further discussion, it seems that is not sufficient here because the callbacks can throw the surrounding error code (ERRCODE_TRANSACTION_ROLLBACK) where we set concurrent_abort flag for a completely different scenario. I think here we need a stronger check to ensure that we set concurrent abort flag and do other things in that check only when we are decoding non-committed xacts. The idea I have is to additionally check that we are decoding streaming or prepared transaction (the same check as we have for setting curtxn) or we can check if CheckXidAlive is a valid transaction id. What do you think? [1] - https://www.postgresql.org/message-id/2752962.1619568098%40sss.pgh.pa.us [2] - https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=7259736a6e5b7c7588fff9578370736a6648acbb -- With Regards, Amit Kapila.
Re: HASH_BLOBS hazards (was Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions)
Noah Misch writes: > On Mon, Dec 14, 2020 at 01:59:03PM -0500, Tom Lane wrote: >> Here's a rolled-up patch that does some further documentation work >> and gets rid of the unnecessary memset's as well. > +1 on removing the memset() calls. That said, it's not a big deal if more > creep in over time; it doesn't qualify as a project policy violation. Right, that part is just neatnik-ism. Neither the calls with memset nor the ones without are buggy. >> + * *infoP and hash_flags should specify at least the entry sizes and key > s/should/must/ OK; thanks for reviewing! regards, tom lane
Re: HASH_BLOBS hazards (was Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions)
On Mon, Dec 14, 2020 at 01:59:03PM -0500, Tom Lane wrote: > * Should we just have a blanket insistence that all callers supply > HASH_ELEM? The default sizes that dynahash.c uses without that are > undocumented and basically useless. +1 > we should just rip out all those memsets as pointless, since there's > basically no case where you'd use the memset to fill a field that > you meant to pass as zero. The fact that hash_create() doesn't > read fields it's not told to by a flag means we should not need > the memsets to avoid uninitialized-memory reads. On Mon, Dec 14, 2020 at 06:55:20PM -0500, Tom Lane wrote: > Here's a rolled-up patch that does some further documentation work > and gets rid of the unnecessary memset's as well. +1 on removing the memset() calls. That said, it's not a big deal if more creep in over time; it doesn't qualify as a project policy violation. > @@ -329,6 +328,11 @@ InitShmemIndex(void) > * whose maximum size is certain, this should be equal to max_size; that > * ensures that no run-time out-of-shared-memory failures can occur. > * > + * *infoP and hash_flags should specify at least the entry sizes and key s/should/must/
Re: HASH_BLOBS hazards (was Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions)
Here's a rolled-up patch that does some further documentation work and gets rid of the unnecessary memset's as well. regards, tom lane diff --git a/contrib/dblink/dblink.c b/contrib/dblink/dblink.c index 2dc9e44ae6..651227f510 100644 --- a/contrib/dblink/dblink.c +++ b/contrib/dblink/dblink.c @@ -2607,7 +2607,8 @@ createConnHash(void) ctl.keysize = NAMEDATALEN; ctl.entrysize = sizeof(remoteConnHashEnt); - return hash_create("Remote Con hash", NUMCONN, , HASH_ELEM); + return hash_create("Remote Con hash", NUMCONN, , + HASH_ELEM | HASH_STRINGS); } static void diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c index 70cfdb2c9d..2f00344b7f 100644 --- a/contrib/pg_stat_statements/pg_stat_statements.c +++ b/contrib/pg_stat_statements/pg_stat_statements.c @@ -567,7 +567,6 @@ pgss_shmem_startup(void) pgss->stats.dealloc = 0; } - memset(, 0, sizeof(info)); info.keysize = sizeof(pgssHashKey); info.entrysize = sizeof(pgssEntry); pgss_hash = ShmemInitHash("pg_stat_statements hash", diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c index ab3226287d..66581e5414 100644 --- a/contrib/postgres_fdw/connection.c +++ b/contrib/postgres_fdw/connection.c @@ -119,14 +119,11 @@ GetConnection(UserMapping *user, bool will_prep_stmt) { HASHCTL ctl; - MemSet(, 0, sizeof(ctl)); ctl.keysize = sizeof(ConnCacheKey); ctl.entrysize = sizeof(ConnCacheEntry); - /* allocate ConnectionHash in the cache context */ - ctl.hcxt = CacheMemoryContext; ConnectionHash = hash_create("postgres_fdw connections", 8, , - HASH_ELEM | HASH_BLOBS | HASH_CONTEXT); + HASH_ELEM | HASH_BLOBS); /* * Register some callback functions that manage connection cleanup. diff --git a/contrib/postgres_fdw/shippable.c b/contrib/postgres_fdw/shippable.c index 3433c19712..b4766dc5ff 100644 --- a/contrib/postgres_fdw/shippable.c +++ b/contrib/postgres_fdw/shippable.c @@ -93,7 +93,6 @@ InitializeShippableCache(void) HASHCTL ctl; /* Create the hash table. */ - MemSet(, 0, sizeof(ctl)); ctl.keysize = sizeof(ShippableCacheKey); ctl.entrysize = sizeof(ShippableCacheEntry); ShippableCacheHash = diff --git a/contrib/tablefunc/tablefunc.c b/contrib/tablefunc/tablefunc.c index 85986ec24a..e9a9741154 100644 --- a/contrib/tablefunc/tablefunc.c +++ b/contrib/tablefunc/tablefunc.c @@ -714,7 +714,6 @@ load_categories_hash(char *cats_sql, MemoryContext per_query_ctx) MemoryContext SPIcontext; /* initialize the category hash table */ - MemSet(, 0, sizeof(ctl)); ctl.keysize = MAX_CATNAME_LEN; ctl.entrysize = sizeof(crosstab_HashEnt); ctl.hcxt = per_query_ctx; @@ -726,7 +725,7 @@ load_categories_hash(char *cats_sql, MemoryContext per_query_ctx) crosstab_hash = hash_create("crosstab hash", INIT_CATS, , -HASH_ELEM | HASH_CONTEXT); +HASH_ELEM | HASH_STRINGS | HASH_CONTEXT); /* Connect to SPI manager */ if ((ret = SPI_connect()) < 0) diff --git a/src/backend/access/gist/gistbuildbuffers.c b/src/backend/access/gist/gistbuildbuffers.c index 4ad67c88b4..217c199a14 100644 --- a/src/backend/access/gist/gistbuildbuffers.c +++ b/src/backend/access/gist/gistbuildbuffers.c @@ -76,7 +76,6 @@ gistInitBuildBuffers(int pagesPerBuffer, int levelStep, int maxLevel) * nodeBuffersTab hash is association between index blocks and it's * buffers. */ - memset(, 0, sizeof(hashCtl)); hashCtl.keysize = sizeof(BlockNumber); hashCtl.entrysize = sizeof(GISTNodeBuffer); hashCtl.hcxt = CurrentMemoryContext; diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c index a664ecf494..c77a189907 100644 --- a/src/backend/access/hash/hashpage.c +++ b/src/backend/access/hash/hashpage.c @@ -1363,7 +1363,6 @@ _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf, Bucket obucket, bool found; /* Initialize hash tables used to track TIDs */ - memset(_ctl, 0, sizeof(hash_ctl)); hash_ctl.keysize = sizeof(ItemPointerData); hash_ctl.entrysize = sizeof(ItemPointerData); hash_ctl.hcxt = CurrentMemoryContext; diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c index 39e33763df..65942cc428 100644 --- a/src/backend/access/heap/rewriteheap.c +++ b/src/backend/access/heap/rewriteheap.c @@ -266,7 +266,6 @@ begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xm state->rs_cxt = rw_cxt; /* Initialize hash tables used to track update chains */ - memset(_ctl, 0, sizeof(hash_ctl)); hash_ctl.keysize = sizeof(TidHashKey); hash_ctl.entrysize = sizeof(UnresolvedTupData); hash_ctl.hcxt = state->rs_cxt; @@ -824,7 +823,6 @@ logical_begin_heap_rewrite(RewriteState state) state->rs_begin_lsn = GetXLogInsertRecPtr(); state->rs_num_rewrite_mappings = 0; - memset(_ctl, 0, sizeof(hash_ctl)); hash_ctl.keysize = sizeof(TransactionId); hash_ctl.entrysize
Re: HASH_BLOBS hazards (was Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions)
I wrote: > There are a couple of other API oddities that maybe we should think > about while we're here: > * Should we just have a blanket insistence that all callers supply > HASH_ELEM? The default sizes that dynahash.c uses without that are > undocumented and basically useless. We're already asserting that > in the HASH_BLOBS path, which is the majority use-case, and this > patch now asserts it for HASH_STRINGS too. Here's a follow-up patch for that part, which also tries to respond a bit to Heikki's complaint about skimpy documentation. While at it, I const-ified the HASHCTL argument, since there's no need for hash_create to modify that. regards, tom lane diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c index 07cae638df..49f21b77bb 100644 --- a/src/backend/utils/hash/dynahash.c +++ b/src/backend/utils/hash/dynahash.c @@ -317,11 +317,20 @@ string_compare(const char *key1, const char *key2, Size keysize) * *info: additional table parameters, as indicated by flags * flags: bitmask indicating which parameters to take from *info * - * The flags value must include exactly one of HASH_STRINGS, HASH_BLOBS, + * The flags value *must* include HASH_ELEM. (Formerly, this was nominally + * optional, but the default keysize and entrysize values were useless.) + * The flags value must also include exactly one of HASH_STRINGS, HASH_BLOBS, * or HASH_FUNCTION, to define the key hashing semantics (C strings, * binary blobs, or custom, respectively). Callers specifying a custom * hash function will likely also want to use HASH_COMPARE, and perhaps * also HASH_KEYCOPY, to control key comparison and copying. + * Another often-used flag is HASH_CONTEXT, to allocate the hash table + * under info->hcxt rather than under TopMemoryContext; the default + * behavior is only suitable for session-lifespan hash tables. + * Other flags bits are special-purpose and seldom used. + * + * Fields in *info are read only when the associated flags bit is set. + * It is not necessary to initialize other fields of *info. * * Note: for a shared-memory hashtable, nelem needs to be a pretty good * estimate, since we can't expand the table on the fly. But an unshared @@ -330,11 +339,19 @@ string_compare(const char *key1, const char *key2, Size keysize) * large nelem will penalize hash_seq_search speed without buying much. */ HTAB * -hash_create(const char *tabname, long nelem, HASHCTL *info, int flags) +hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags) { HTAB *hashp; HASHHDR*hctl; + /* + * Hash tables now allocate space for key and data, but you have to say + * how much space to allocate. + */ + Assert(flags & HASH_ELEM); + Assert(info->keysize > 0); + Assert(info->entrysize >= info->keysize); + /* * For shared hash tables, we have a local hash header (HTAB struct) that * we allocate in TopMemoryContext; all else is in shared memory. @@ -385,7 +402,6 @@ hash_create(const char *tabname, long nelem, HASHCTL *info, int flags) { Assert(!(flags & HASH_STRINGS)); /* We can optimize hashing for common key sizes */ - Assert(flags & HASH_ELEM); if (info->keysize == sizeof(uint32)) hashp->hash = uint32_hash; else @@ -402,7 +418,6 @@ hash_create(const char *tabname, long nelem, HASHCTL *info, int flags) * it's almost certainly an integer or pointer not a string. */ Assert(flags & HASH_STRINGS); - Assert(flags & HASH_ELEM); Assert(info->keysize > 8); hashp->hash = string_hash; @@ -529,16 +544,9 @@ hash_create(const char *tabname, long nelem, HASHCTL *info, int flags) hctl->dsize = info->dsize; } - /* - * hash table now allocates space for key and data but you have to say how - * much space to allocate - */ - if (flags & HASH_ELEM) - { - Assert(info->entrysize >= info->keysize); - hctl->keysize = info->keysize; - hctl->entrysize = info->entrysize; - } + /* remember the entry sizes, too */ + hctl->keysize = info->keysize; + hctl->entrysize = info->entrysize; /* make local copies of heavily-used constant fields */ hashp->keysize = hctl->keysize; @@ -617,10 +625,6 @@ hdefault(HTAB *hashp) hctl->dsize = DEF_DIRSIZE; hctl->nsegs = 0; - /* rather pointless defaults for key & entry size */ - hctl->keysize = sizeof(char *); - hctl->entrysize = 2 * sizeof(char *); - hctl->num_partitions = 0; /* not partitioned */ /* table has no fixed maximum size */ diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h index 666ad33567..c3daaae92b 100644 --- a/src/include/utils/hsearch.h +++ b/src/include/utils/hsearch.h @@ -124,7 +124,7 @@ typedef struct * one of these. */ extern HTAB *hash_create(const char *tabname, long nelem, - HASHCTL *info, int flags); + const HASHCTL *info, int flags); extern void hash_destroy(HTAB *hashp); extern void hash_stats(const char *where, HTAB *hashp); extern void *hash_search(HTAB *hashp, const void
Re: HASH_BLOBS hazards (was Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions)
Noah Misch writes: > On Sun, Dec 13, 2020 at 11:49:31AM -0500, Tom Lane wrote: >> A quick count of grep hits suggest that the large majority of >> existing hash_create() calls use HASH_BLOBS, and there might be >> only order-of-ten calls that would need to be touched if we >> required an explicit HASH_STRING flag. So option #2 is seeming >> kind of attractive. Maybe that together with an assertion that >> string keys have to exceed 8 or 16 bytes would be enough protection. > Agreed. I expect (2) gives most of the benefit. Requiring 8-byte capacity > should be harmless, and most architectures can zero 8 bytes in one > instruction. Requiring more bytes trades specificity for sensitivity. Attached is a proposed patch that requires HASH_STRINGS to be stated explicitly (in the event, there are 13 callers needing that) and insists on keysize > 8 for string keys. In examining the now-easily-visible uses of string keys, almost all of them are using NAMEDATALEN-sized keys, or in a few places larger values. Only two are smaller: 1. ShmemIndex uses SHMEM_INDEX_KEYSIZE, which is only set to 48. 2. ResetUnloggedRelationsInDbspaceDir is using OIDCHARS + 1, because it stores relfilenode OIDs as strings. That seems pretty damfool to me, so I'm inclined to change it to store binary OIDs instead; those'd be a third the size (or probably a quarter the size after alignment padding) and likely faster to hash or compare. But I didn't do that here, since it's still more than 8. (I did whack it upside the head to the extent of not storing its temporary hash table in CacheMemoryContext.) So it seems to me that insisting on keysize > 8 is fine. There are a couple of other API oddities that maybe we should think about while we're here: * Should we just have a blanket insistence that all callers supply HASH_ELEM? The default sizes that dynahash.c uses without that are undocumented and basically useless. We're already asserting that in the HASH_BLOBS path, which is the majority use-case, and this patch now asserts it for HASH_STRINGS too. * The coding convention that the HASHCTL argument struct should be pre-zeroed seems to have been ignored at a lot of call sites. I added a memset call to a couple of callers that I was touching in this patch, but I'm having second thoughts about that. Maybe we should just rip out all those memsets as pointless, since there's basically no case where you'd use the memset to fill a field that you meant to pass as zero. The fact that hash_create() doesn't read fields it's not told to by a flag means we should not need the memsets to avoid uninitialized-memory reads. regards, tom lane diff --git a/contrib/dblink/dblink.c b/contrib/dblink/dblink.c index 2dc9e44ae6..8b17fb06eb 100644 --- a/contrib/dblink/dblink.c +++ b/contrib/dblink/dblink.c @@ -2604,10 +2604,12 @@ createConnHash(void) { HASHCTL ctl; + memset(, 0, sizeof(ctl)); ctl.keysize = NAMEDATALEN; ctl.entrysize = sizeof(remoteConnHashEnt); - return hash_create("Remote Con hash", NUMCONN, , HASH_ELEM); + return hash_create("Remote Con hash", NUMCONN, , + HASH_ELEM | HASH_STRINGS); } static void diff --git a/contrib/tablefunc/tablefunc.c b/contrib/tablefunc/tablefunc.c index 85986ec24a..ec7819ca77 100644 --- a/contrib/tablefunc/tablefunc.c +++ b/contrib/tablefunc/tablefunc.c @@ -726,7 +726,7 @@ load_categories_hash(char *cats_sql, MemoryContext per_query_ctx) crosstab_hash = hash_create("crosstab hash", INIT_CATS, , -HASH_ELEM | HASH_CONTEXT); +HASH_ELEM | HASH_STRINGS | HASH_CONTEXT); /* Connect to SPI manager */ if ((ret = SPI_connect()) < 0) diff --git a/src/backend/commands/prepare.c b/src/backend/commands/prepare.c index 4b18be5b27..5ba7c2eb3c 100644 --- a/src/backend/commands/prepare.c +++ b/src/backend/commands/prepare.c @@ -414,7 +414,7 @@ InitQueryHashTable(void) prepared_queries = hash_create("Prepared Queries", 32, _ctl, - HASH_ELEM); + HASH_ELEM | HASH_STRINGS); } /* diff --git a/src/backend/nodes/extensible.c b/src/backend/nodes/extensible.c index ab04459c55..2fe89fd361 100644 --- a/src/backend/nodes/extensible.c +++ b/src/backend/nodes/extensible.c @@ -51,7 +51,8 @@ RegisterExtensibleNodeEntry(HTAB **p_htable, const char *htable_label, ctl.keysize = EXTNODENAME_MAX_LEN; ctl.entrysize = sizeof(ExtensibleNodeEntry); - *p_htable = hash_create(htable_label, 100, , HASH_ELEM); + *p_htable = hash_create(htable_label, 100, , +HASH_ELEM | HASH_STRINGS); } if (strlen(extnodename) >= EXTNODENAME_MAX_LEN) diff --git a/src/backend/storage/file/reinit.c b/src/backend/storage/file/reinit.c index 0c2094f766..f21ab67ae4 100644 --- a/src/backend/storage/file/reinit.c +++ b/src/backend/storage/file/reinit.c @@ -175,7 +175,9 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op) memset(, 0, sizeof(ctl)); ctl.keysize =
Re: HASH_BLOBS hazards (was Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions)
On Mon, Dec 14, 2020 at 1:36 AM Noah Misch wrote: > > On Sun, Dec 13, 2020 at 11:49:31AM -0500, Tom Lane wrote: > > But what jumps out at me here is that this sort of error seems way > > too easy to make, and evidently way too hard to detect. What can we > > do to make it more obvious if one has incorrectly used or omitted > > HASH_BLOBS? Both directions of error might easily escape notice on > > little-endian hardware. > > > > I thought of a few ideas, all of which have drawbacks: > > > > 1. Invert the sense of the flag, ie HASH_BLOBS becomes the default. > > This seems to just move the problem somewhere else, besides which > > it'd require touching an awful lot of callers, and would silently > > break third-party callers. > > > > 2. Don't allow a default: invent a new HASH_STRING flag, and > > require that hash_create() calls specify exactly one of HASH_BLOBS, > > HASH_STRING, or HASH_FUNCTION. This doesn't completely fix the > > hazard of mindless-copy-and-paste, but I think it might make it > > a little more obvious. Still requires touching a lot of calls. > > I like (2), for making the bug harder and for greppability. Probably > pluralize it to HASH_STRINGS, for the parallel with HASH_BLOBS. > > > 3. Add some sort of heuristic restriction on keysize. A keysize > > that's only 4 or 8 bytes almost certainly is not a string. > > This doesn't give us much traction for larger keysizes, though. > > > > 4. Disallow empty string keys, ie something like "Assert(s_len > 0)" > > in string_hash(). I think we could get away with that given that > > SQL disallows empty identifiers. However, it would only help to > > catch one direction of error (omitting HASH_BLOBS), and it would > > only help on big-endian hardware, which is getting harder to find. > > Still, we could hope that the buildfarm would detect errors. > > It's nontrivial to confirm that the empty-string key can't happen for a given > hash table. (In contrast, what (3) asserts on is usually a compile-time > constant.) I would stop short of adding (4), though it could be okay. > > > A quick count of grep hits suggest that the large majority of > > existing hash_create() calls use HASH_BLOBS, and there might be > > only order-of-ten calls that would need to be touched if we > > required an explicit HASH_STRING flag. So option #2 is seeming > > kind of attractive. Maybe that together with an assertion that > > string keys have to exceed 8 or 16 bytes would be enough protection. > > Agreed. I expect (2) gives most of the benefit. Requiring 8-byte capacity > should be harmless, and most architectures can zero 8 bytes in one > instruction. Requiring more bytes trades specificity for sensitivity. > +1. I also think in most cases (2) would be sufficient to avoid such bugs. Adding restriction on string size might annoy some out-of-core user which is already using small strings. However, adding an 8-byte restriction on string size would be still okay. -- With Regards, Amit Kapila.
Re: HASH_BLOBS hazards (was Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions)
On 2020-12-13 17:49, Tom Lane wrote: 2. Don't allow a default: invent a new HASH_STRING flag, and require that hash_create() calls specify exactly one of HASH_BLOBS, HASH_STRING, or HASH_FUNCTION. This doesn't completely fix the hazard of mindless-copy-and-paste, but I think it might make it a little more obvious. Still requires touching a lot of calls. I think this sounds best, and also expand the documentation of these flags a bit. -- Peter Eisentraut 2ndQuadrant, an EDB company https://www.2ndquadrant.com/
Re: HASH_BLOBS hazards (was Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions)
On Sun, Dec 13, 2020 at 11:49:31AM -0500, Tom Lane wrote: > But what jumps out at me here is that this sort of error seems way > too easy to make, and evidently way too hard to detect. What can we > do to make it more obvious if one has incorrectly used or omitted > HASH_BLOBS? Both directions of error might easily escape notice on > little-endian hardware. > > I thought of a few ideas, all of which have drawbacks: > > 1. Invert the sense of the flag, ie HASH_BLOBS becomes the default. > This seems to just move the problem somewhere else, besides which > it'd require touching an awful lot of callers, and would silently > break third-party callers. > > 2. Don't allow a default: invent a new HASH_STRING flag, and > require that hash_create() calls specify exactly one of HASH_BLOBS, > HASH_STRING, or HASH_FUNCTION. This doesn't completely fix the > hazard of mindless-copy-and-paste, but I think it might make it > a little more obvious. Still requires touching a lot of calls. I like (2), for making the bug harder and for greppability. Probably pluralize it to HASH_STRINGS, for the parallel with HASH_BLOBS. > 3. Add some sort of heuristic restriction on keysize. A keysize > that's only 4 or 8 bytes almost certainly is not a string. > This doesn't give us much traction for larger keysizes, though. > > 4. Disallow empty string keys, ie something like "Assert(s_len > 0)" > in string_hash(). I think we could get away with that given that > SQL disallows empty identifiers. However, it would only help to > catch one direction of error (omitting HASH_BLOBS), and it would > only help on big-endian hardware, which is getting harder to find. > Still, we could hope that the buildfarm would detect errors. It's nontrivial to confirm that the empty-string key can't happen for a given hash table. (In contrast, what (3) asserts on is usually a compile-time constant.) I would stop short of adding (4), though it could be okay. > A quick count of grep hits suggest that the large majority of > existing hash_create() calls use HASH_BLOBS, and there might be > only order-of-ten calls that would need to be touched if we > required an explicit HASH_STRING flag. So option #2 is seeming > kind of attractive. Maybe that together with an assertion that > string keys have to exceed 8 or 16 bytes would be enough protection. Agreed. I expect (2) gives most of the benefit. Requiring 8-byte capacity should be harmless, and most architectures can zero 8 bytes in one instruction. Requiring more bytes trades specificity for sensitivity. > A different angle we could think about is that the name "HASH_BLOBS" > is kind of un-obvious. Maybe we should deprecate that spelling in > favor of something like "HASH_BINARY". With (2) in place, I wouldn't worry about renaming HASH_BLOBS. It's hard to confuse with HASH_STRINGS or HASH_FUNCTION. If anything, HASH_BLOBS conveys something more specific. HASH_FUNCTION cases see binary data, but that data has structure that promotes it out of "blob" status.
HASH_BLOBS hazards (was Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions)
Amit Kapila writes: > On Wed, Dec 9, 2020 at 2:56 PM Noah Misch wrote: >> The problem is xidhash using strcmp() to compare keys; it needs memcmp(). > Your analysis is correct. Sorry for not having noticed this thread before. Noah's fix is clearly correct, and I have no objection to the added test case. But what jumps out at me here is that this sort of error seems way too easy to make, and evidently way too hard to detect. What can we do to make it more obvious if one has incorrectly used or omitted HASH_BLOBS? Both directions of error might easily escape notice on little-endian hardware. I thought of a few ideas, all of which have drawbacks: 1. Invert the sense of the flag, ie HASH_BLOBS becomes the default. This seems to just move the problem somewhere else, besides which it'd require touching an awful lot of callers, and would silently break third-party callers. 2. Don't allow a default: invent a new HASH_STRING flag, and require that hash_create() calls specify exactly one of HASH_BLOBS, HASH_STRING, or HASH_FUNCTION. This doesn't completely fix the hazard of mindless-copy-and-paste, but I think it might make it a little more obvious. Still requires touching a lot of calls. 3. Add some sort of heuristic restriction on keysize. A keysize that's only 4 or 8 bytes almost certainly is not a string. This doesn't give us much traction for larger keysizes, though. 4. Disallow empty string keys, ie something like "Assert(s_len > 0)" in string_hash(). I think we could get away with that given that SQL disallows empty identifiers. However, it would only help to catch one direction of error (omitting HASH_BLOBS), and it would only help on big-endian hardware, which is getting harder to find. Still, we could hope that the buildfarm would detect errors. There might be some more options. Also, some of these ideas could be applied in combination. A quick count of grep hits suggest that the large majority of existing hash_create() calls use HASH_BLOBS, and there might be only order-of-ten calls that would need to be touched if we required an explicit HASH_STRING flag. So option #2 is seeming kind of attractive. Maybe that together with an assertion that string keys have to exceed 8 or 16 bytes would be enough protection. Also, this census now suggests to me that the opposite problem (copy-and-paste HASH_BLOBS when you meant string keys) might be a real hazard, since so many of the existing prototypes that you might copy have HASH_BLOBS. I'm not sure if there's much to be done for this case though. A small saving grace is that it seems relatively likely that you'd notice a functional problem pretty quickly with this type of mistake, since lookups would tend to fail due to trailing garbage after your lookup string. A different angle we could think about is that the name "HASH_BLOBS" is kind of un-obvious. Maybe we should deprecate that spelling in favor of something like "HASH_BINARY". Thoughts? regards, tom lane
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Wed, Dec 9, 2020 at 2:56 PM Noah Misch wrote: > > Further testing showed it was a file location problem, not a deletion problem. > The worker tried to open > base/pgsql_tmp/pgsql_tmp9896408.1.sharedfileset/16393-510.changes.0, but these > were the files actually existing: > > [nm@power-aix 0:2 2020-12-08T13:56:35 64gcc 0]$ ls -la $(find > src/test/subscription/tmp_check -name '*sharedfileset*') > src/test/subscription/tmp_check/t_015_stream_subscriber_data/pgdata/base/pgsql_tmp/pgsql_tmp9896408.0.sharedfileset: > total 408 > drwx--2 nm usr 256 Dec 08 03:20 . > drwx--4 nm usr 256 Dec 08 03:20 .. > -rw---1 nm usr 207806 Dec 08 03:20 16393-510.changes.0 > > src/test/subscription/tmp_check/t_015_stream_subscriber_data/pgdata/base/pgsql_tmp/pgsql_tmp9896408.1.sharedfileset: > total 0 > drwx--2 nm usr 256 Dec 08 03:20 . > drwx--4 nm usr 256 Dec 08 03:20 .. > -rw---1 nm usr 0 Dec 08 03:20 16393-511.changes.0 > > > > I have executed "make check" in the loop with only this file. I have > > > repeated it 5000 times but no failure, I am wondering shall we try to > > > execute in the same machine in a loop where it failed once? > > > > Yes, that might help. Noah, would it be possible for you to try that > > The problem is xidhash using strcmp() to compare keys; it needs memcmp(). For > this to matter, xidhash must contain more than one element. Existing tests > rarely exercise the multi-element scenario. Under heavy load, on this system, > the test publisher can have two active transactions at once, in which case it > does exercise multi-element xidhash. (The publisher is sensitive to timing, > but the subscriber is not; once WAL contains interleaved records of two XIDs, > the subscriber fails every time.) This would be much harder to reproduce on a > little-endian system, where strcmp(, _plus_one)!=0. On big-endian, > every small XID has zero in the first octet; they all look like empty strings. > Your analysis is correct. > The attached patch has the one-line fix and some test suite changes that make > this reproduce frequently on any big-endian system. I'm currently planning to > drop the test suite changes from the commit, but I could keep them if folks > like them. (They'd need more comments and timeout handling.) > I think it is better to keep this test which can always test multiple streams on the subscriber. Thanks for working on this. -- With Regards, Amit Kapila.
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Wed, Dec 02, 2020 at 01:50:25PM +0530, Amit Kapila wrote: > On Wed, Dec 2, 2020 at 1:20 PM Dilip Kumar wrote: > > > On Mon, Nov 30, 2020 at 6:49 PM Amit Kapila > > > wrote: > > > > On Mon, Nov 30, 2020 at 3:14 AM Noah Misch wrote: > > > > > On Mon, Sep 07, 2020 at 12:00:41PM +0530, Amit Kapila wrote: > > > > > > Thanks, I have pushed the last patch. Let's wait for a day or so to > > > > > > see the buildfarm reports > > > > > > > > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer=2020-09-08%2006%3A24%3A14 > > > > > failed the new 015_stream.pl test with the subscriber looping like > > > > > this: > > > > > > > > > > 2020-09-08 11:22:49.848 UTC [13959252:1] LOG: logical replication > > > > > apply worker for subscription "tap_sub" has started > > > > > 2020-09-08 11:22:54.045 UTC [13959252:2] ERROR: could not open > > > > > temporary file "16393-510.changes.0" from BufFile > > > > > "16393-510.changes": No such file or directory > > > > > 2020-09-08 11:22:54.055 UTC [7602182:1] LOG: logical replication > > > > > apply worker for subscription "tap_sub" has started > > > > > 2020-09-08 11:22:54.101 UTC [31785284:4] LOG: background worker > > > > > "logical replication worker" (PID 13959252) exited with exit code 1 > > > > > 2020-09-08 11:23:01.142 UTC [7602182:2] ERROR: could not open > > > > > temporary file "16393-510.changes.0" from BufFile > > > > > "16393-510.changes": No such file or directory > > > > > ... > > > > The above kind of error can happen due to the following reasons: (a) > > > > the first time we sent the stream and created the file and that got > > > > removed before the second stream reached the subscriber. (b) from the > > > > publisher-side, we never sent the indication that it is the first > > > > stream and the subscriber directly tries to open the file thinking it > > > > is already there. Further testing showed it was a file location problem, not a deletion problem. The worker tried to open base/pgsql_tmp/pgsql_tmp9896408.1.sharedfileset/16393-510.changes.0, but these were the files actually existing: [nm@power-aix 0:2 2020-12-08T13:56:35 64gcc 0]$ ls -la $(find src/test/subscription/tmp_check -name '*sharedfileset*') src/test/subscription/tmp_check/t_015_stream_subscriber_data/pgdata/base/pgsql_tmp/pgsql_tmp9896408.0.sharedfileset: total 408 drwx--2 nm usr 256 Dec 08 03:20 . drwx--4 nm usr 256 Dec 08 03:20 .. -rw---1 nm usr 207806 Dec 08 03:20 16393-510.changes.0 src/test/subscription/tmp_check/t_015_stream_subscriber_data/pgdata/base/pgsql_tmp/pgsql_tmp9896408.1.sharedfileset: total 0 drwx--2 nm usr 256 Dec 08 03:20 . drwx--4 nm usr 256 Dec 08 03:20 .. -rw---1 nm usr 0 Dec 08 03:20 16393-511.changes.0 > > I have executed "make check" in the loop with only this file. I have > > repeated it 5000 times but no failure, I am wondering shall we try to > > execute in the same machine in a loop where it failed once? > > Yes, that might help. Noah, would it be possible for you to try that The problem is xidhash using strcmp() to compare keys; it needs memcmp(). For this to matter, xidhash must contain more than one element. Existing tests rarely exercise the multi-element scenario. Under heavy load, on this system, the test publisher can have two active transactions at once, in which case it does exercise multi-element xidhash. (The publisher is sensitive to timing, but the subscriber is not; once WAL contains interleaved records of two XIDs, the subscriber fails every time.) This would be much harder to reproduce on a little-endian system, where strcmp(, _plus_one)!=0. On big-endian, every small XID has zero in the first octet; they all look like empty strings. The attached patch has the one-line fix and some test suite changes that make this reproduce frequently on any big-endian system. I'm currently planning to drop the test suite changes from the commit, but I could keep them if folks like them. (They'd need more comments and timeout handling.) Author: Noah Misch Commit: Noah Misch Use HASH_BLOBS for xidhash. This caused BufFile errors on buildfarm member sungazer, and SIGSEGV was possible. Conditions for reaching those symptoms were more frequent on big-endian systems. Reviewed by FIXME. Discussion: https://postgr.es/m/20201129214441.ga691...@rfd.leadboat.com diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c index c37aafe..fce1dee 100644 --- a/src/backend/replication/logical/worker.c +++ b/src/backend/replication/logical/worker.c @@ -804,7 +804,7 @@ apply_handle_stream_start(StringInfo s) hash_ctl.entrysize = sizeof(StreamXidHash); hash_ctl.hcxt = ApplyContext; xidhash = hash_create("StreamXidHash", 1024, _ctl, -
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Wed, Dec 2, 2020 at 1:20 PM Dilip Kumar wrote: > > On Tue, Dec 1, 2020 at 11:38 AM Dilip Kumar wrote: > > > > On Mon, Nov 30, 2020 at 6:49 PM Amit Kapila wrote: > > > > > > On Mon, Nov 30, 2020 at 3:14 AM Noah Misch wrote: > > > > > > > > On Mon, Sep 07, 2020 at 12:00:41PM +0530, Amit Kapila wrote: > > > > > Thanks, I have pushed the last patch. Let's wait for a day or so to > > > > > see the buildfarm reports > > > > > > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer=2020-09-08%2006%3A24%3A14 > > > > failed the new 015_stream.pl test with the subscriber looping like this: > > > > > > > > 2020-09-08 11:22:49.848 UTC [13959252:1] LOG: logical replication > > > > apply worker for subscription "tap_sub" has started > > > > 2020-09-08 11:22:54.045 UTC [13959252:2] ERROR: could not open > > > > temporary file "16393-510.changes.0" from BufFile "16393-510.changes": > > > > No such file or directory > > > > 2020-09-08 11:22:54.055 UTC [7602182:1] LOG: logical replication apply > > > > worker for subscription "tap_sub" has started > > > > 2020-09-08 11:22:54.101 UTC [31785284:4] LOG: background worker > > > > "logical replication worker" (PID 13959252) exited with exit code 1 > > > > 2020-09-08 11:23:01.142 UTC [7602182:2] ERROR: could not open > > > > temporary file "16393-510.changes.0" from BufFile "16393-510.changes": > > > > No such file or directory > > > > ... > > > > > > > > What happened there? > > > > > > > > > > What is going on here is that the expected streaming file is missing. > > > Normally, the first time we send a stream of changes (some percentage > > > of transaction changes) we create the streaming file, and then in > > > respective streams we just keep on writing in that file the changes we > > > receive from the publisher, and on commit, we read that file and apply > > > all the changes. > > > > > > The above kind of error can happen due to the following reasons: (a) > > > the first time we sent the stream and created the file and that got > > > removed before the second stream reached the subscriber. (b) from the > > > publisher-side, we never sent the indication that it is the first > > > stream and the subscriber directly tries to open the file thinking it > > > is already there. > > > > > I have executed "make check" in the loop with only this file. I have > repeated it 5000 times but no failure, I am wondering shall we try to > execute in the same machine in a loop where it failed once? > Yes, that might help. Noah, would it be possible for you to try that out, and if it failed then probably get the stack trace of subscriber? If we are able to reproduce it then we can add elogs in functions SharedFileSetInit, BufFileCreateShared, BufFileOpenShared, and SharedFileSetDeleteAll to print the paths to see if we are sometimes unintentionally removing some files. I have checked the code and there doesn't appear to be any such problems but I might be missing something. -- With Regards, Amit Kapila.
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Tue, Dec 1, 2020 at 11:38 AM Dilip Kumar wrote: > > On Mon, Nov 30, 2020 at 6:49 PM Amit Kapila wrote: > > > > On Mon, Nov 30, 2020 at 3:14 AM Noah Misch wrote: > > > > > > On Mon, Sep 07, 2020 at 12:00:41PM +0530, Amit Kapila wrote: > > > > Thanks, I have pushed the last patch. Let's wait for a day or so to > > > > see the buildfarm reports > > > > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer=2020-09-08%2006%3A24%3A14 > > > failed the new 015_stream.pl test with the subscriber looping like this: > > > > > > 2020-09-08 11:22:49.848 UTC [13959252:1] LOG: logical replication apply > > > worker for subscription "tap_sub" has started > > > 2020-09-08 11:22:54.045 UTC [13959252:2] ERROR: could not open temporary > > > file "16393-510.changes.0" from BufFile "16393-510.changes": No such file > > > or directory > > > 2020-09-08 11:22:54.055 UTC [7602182:1] LOG: logical replication apply > > > worker for subscription "tap_sub" has started > > > 2020-09-08 11:22:54.101 UTC [31785284:4] LOG: background worker "logical > > > replication worker" (PID 13959252) exited with exit code 1 > > > 2020-09-08 11:23:01.142 UTC [7602182:2] ERROR: could not open temporary > > > file "16393-510.changes.0" from BufFile "16393-510.changes": No such file > > > or directory > > > ... > > > > > > What happened there? > > > > > > > What is going on here is that the expected streaming file is missing. > > Normally, the first time we send a stream of changes (some percentage > > of transaction changes) we create the streaming file, and then in > > respective streams we just keep on writing in that file the changes we > > receive from the publisher, and on commit, we read that file and apply > > all the changes. > > > > The above kind of error can happen due to the following reasons: (a) > > the first time we sent the stream and created the file and that got > > removed before the second stream reached the subscriber. (b) from the > > publisher-side, we never sent the indication that it is the first > > stream and the subscriber directly tries to open the file thinking it > > is already there. > > > > Now, the publisher and subscriber log doesn't directly indicate any of > > the above problems but I have some observations. > > > > The subscriber log indicates that before the apply worker exits due to > > an error the new apply worker gets started. We delete the > > streaming-related temporary files on proc_exit, so one possibility > > could have been that the new apply worker has created the streaming > > file which the old apply worker has removed but that is not possible > > because we always create these temp-files by having procid in the > > path. > > Yeah, and I have tried to test on this line, basically, after the > streaming has started I have set the binary=on. Now using gdb I have > made the worker wait before it deletes the temp file and meanwhile the > new worker started and it worked properly as expected. > > > The other thing I observed in the code is that we can mark the > > transaction as streamed (via ReorderBufferTruncateTxn) if we try to > > stream a transaction that has no changes the first time we try to > > stream the transaction. This would lead to symptom (b) because the > > second-time when there are more changes we would stream the changes as > > it is not the first time. However, this shouldn't happen because we > > never pick-up a transaction to stream which has no changes. I can try > > to fix the code here such that we don't mark the transaction as > > streamed unless we have streamed at least one change but I don't see > > how it is related to this particular test failure. > > Yeah, this can be improved but as you mentioned that we never select > an empty transaction for streaming so this case should not occur. I > will perform some testing/review around this and report. I have executed "make check" in the loop with only this file. I have repeated it 5000 times but no failure, I am wondering shall we try to execute in the same machine in a loop where it failed once? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Tue, Dec 1, 2020 at 11:38 AM Dilip Kumar wrote: > > On Mon, Nov 30, 2020 at 6:49 PM Amit Kapila wrote: > > > > > > What is going on here is that the expected streaming file is missing. > > Normally, the first time we send a stream of changes (some percentage > > of transaction changes) we create the streaming file, and then in > > respective streams we just keep on writing in that file the changes we > > receive from the publisher, and on commit, we read that file and apply > > all the changes. > > > > The above kind of error can happen due to the following reasons: (a) > > the first time we sent the stream and created the file and that got > > removed before the second stream reached the subscriber. (b) from the > > publisher-side, we never sent the indication that it is the first > > stream and the subscriber directly tries to open the file thinking it > > is already there. > > > > Now, the publisher and subscriber log doesn't directly indicate any of > > the above problems but I have some observations. > > > > The subscriber log indicates that before the apply worker exits due to > > an error the new apply worker gets started. We delete the > > streaming-related temporary files on proc_exit, so one possibility > > could have been that the new apply worker has created the streaming > > file which the old apply worker has removed but that is not possible > > because we always create these temp-files by having procid in the > > path. > > Yeah, and I have tried to test on this line, basically, after the > streaming has started I have set the binary=on. Now using gdb I have > made the worker wait before it deletes the temp file and meanwhile the > new worker started and it worked properly as expected. > > > The other thing I observed in the code is that we can mark the > > transaction as streamed (via ReorderBufferTruncateTxn) if we try to > > stream a transaction that has no changes the first time we try to > > stream the transaction. This would lead to symptom (b) because the > > second-time when there are more changes we would stream the changes as > > it is not the first time. However, this shouldn't happen because we > > never pick-up a transaction to stream which has no changes. I can try > > to fix the code here such that we don't mark the transaction as > > streamed unless we have streamed at least one change but I don't see > > how it is related to this particular test failure. > > Yeah, this can be improved but as you mentioned that we never select > an empty transaction for streaming so this case should not occur. I > will perform some testing/review around this and report. > On further thinking about this point, I think the message seen on subscriber [1] won't occur if missed the first stream. This is because we always check the value of fileset from the stream hash table (xidhash) and it won't be there if we directly send the second stream and that would have lead to a different kind of problem (probably crash). This symptom seems to be due to the reason (a) mentioned above unless we are missing something else. Now, I am not sure how the file can be removed without the corresponding entry in hash table (xidhash) is still present. The only reasons that come to mind are that some other process cleaned pgsql_tmp directory thinking these temporary file are not required or one manually removes it, none of those seems plausible reasons. [1] - ERROR: could not open temporary file "16393-510.changes.0" from BufFile "16393-510.changes": No such file or directory -- With Regards, Amit Kapila.
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Mon, Nov 30, 2020 at 6:49 PM Amit Kapila wrote: > > On Mon, Nov 30, 2020 at 3:14 AM Noah Misch wrote: > > > > On Mon, Sep 07, 2020 at 12:00:41PM +0530, Amit Kapila wrote: > > > Thanks, I have pushed the last patch. Let's wait for a day or so to > > > see the buildfarm reports > > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer=2020-09-08%2006%3A24%3A14 > > failed the new 015_stream.pl test with the subscriber looping like this: > > > > 2020-09-08 11:22:49.848 UTC [13959252:1] LOG: logical replication apply > > worker for subscription "tap_sub" has started > > 2020-09-08 11:22:54.045 UTC [13959252:2] ERROR: could not open temporary > > file "16393-510.changes.0" from BufFile "16393-510.changes": No such file > > or directory > > 2020-09-08 11:22:54.055 UTC [7602182:1] LOG: logical replication apply > > worker for subscription "tap_sub" has started > > 2020-09-08 11:22:54.101 UTC [31785284:4] LOG: background worker "logical > > replication worker" (PID 13959252) exited with exit code 1 > > 2020-09-08 11:23:01.142 UTC [7602182:2] ERROR: could not open temporary > > file "16393-510.changes.0" from BufFile "16393-510.changes": No such file > > or directory > > ... > > > > What happened there? > > > > What is going on here is that the expected streaming file is missing. > Normally, the first time we send a stream of changes (some percentage > of transaction changes) we create the streaming file, and then in > respective streams we just keep on writing in that file the changes we > receive from the publisher, and on commit, we read that file and apply > all the changes. > > The above kind of error can happen due to the following reasons: (a) > the first time we sent the stream and created the file and that got > removed before the second stream reached the subscriber. (b) from the > publisher-side, we never sent the indication that it is the first > stream and the subscriber directly tries to open the file thinking it > is already there. > > Now, the publisher and subscriber log doesn't directly indicate any of > the above problems but I have some observations. > > The subscriber log indicates that before the apply worker exits due to > an error the new apply worker gets started. We delete the > streaming-related temporary files on proc_exit, so one possibility > could have been that the new apply worker has created the streaming > file which the old apply worker has removed but that is not possible > because we always create these temp-files by having procid in the > path. Yeah, and I have tried to test on this line, basically, after the streaming has started I have set the binary=on. Now using gdb I have made the worker wait before it deletes the temp file and meanwhile the new worker started and it worked properly as expected. > The other thing I observed in the code is that we can mark the > transaction as streamed (via ReorderBufferTruncateTxn) if we try to > stream a transaction that has no changes the first time we try to > stream the transaction. This would lead to symptom (b) because the > second-time when there are more changes we would stream the changes as > it is not the first time. However, this shouldn't happen because we > never pick-up a transaction to stream which has no changes. I can try > to fix the code here such that we don't mark the transaction as > streamed unless we have streamed at least one change but I don't see > how it is related to this particular test failure. Yeah, this can be improved but as you mentioned that we never select an empty transaction for streaming so this case should not occur. I will perform some testing/review around this and report. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Mon, Nov 30, 2020 at 3:14 AM Noah Misch wrote: > > On Mon, Sep 07, 2020 at 12:00:41PM +0530, Amit Kapila wrote: > > Thanks, I have pushed the last patch. Let's wait for a day or so to > > see the buildfarm reports > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer=2020-09-08%2006%3A24%3A14 > failed the new 015_stream.pl test with the subscriber looping like this: > > 2020-09-08 11:22:49.848 UTC [13959252:1] LOG: logical replication apply > worker for subscription "tap_sub" has started > 2020-09-08 11:22:54.045 UTC [13959252:2] ERROR: could not open temporary > file "16393-510.changes.0" from BufFile "16393-510.changes": No such file or > directory > 2020-09-08 11:22:54.055 UTC [7602182:1] LOG: logical replication apply > worker for subscription "tap_sub" has started > 2020-09-08 11:22:54.101 UTC [31785284:4] LOG: background worker "logical > replication worker" (PID 13959252) exited with exit code 1 > 2020-09-08 11:23:01.142 UTC [7602182:2] ERROR: could not open temporary file > "16393-510.changes.0" from BufFile "16393-510.changes": No such file or > directory > ... > > What happened there? > What is going on here is that the expected streaming file is missing. Normally, the first time we send a stream of changes (some percentage of transaction changes) we create the streaming file, and then in respective streams we just keep on writing in that file the changes we receive from the publisher, and on commit, we read that file and apply all the changes. The above kind of error can happen due to the following reasons: (a) the first time we sent the stream and created the file and that got removed before the second stream reached the subscriber. (b) from the publisher-side, we never sent the indication that it is the first stream and the subscriber directly tries to open the file thinking it is already there. Now, the publisher and subscriber log doesn't directly indicate any of the above problems but I have some observations. The subscriber log indicates that before the apply worker exits due to an error the new apply worker gets started. We delete the streaming-related temporary files on proc_exit, so one possibility could have been that the new apply worker has created the streaming file which the old apply worker has removed but that is not possible because we always create these temp-files by having procid in the path. The other thing I observed in the code is that we can mark the transaction as streamed (via ReorderBufferTruncateTxn) if we try to stream a transaction that has no changes the first time we try to stream the transaction. This would lead to symptom (b) because the second-time when there are more changes we would stream the changes as it is not the first time. However, this shouldn't happen because we never pick-up a transaction to stream which has no changes. I can try to fix the code here such that we don't mark the transaction as streamed unless we have streamed at least one change but I don't see how it is related to this particular test failure. I am not sure why this failure is not repeated since it occurred a few months back, it's probably a timing issue. I have few timing issues in the last month or so related to this feature but I am not able to come up with a theory if any of those would have fixed this problem. -- With Regards, Amit Kapila.
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Mon, Nov 30, 2020 at 3:14 AM Noah Misch wrote: > > On Mon, Sep 07, 2020 at 12:00:41PM +0530, Amit Kapila wrote: > > Thanks, I have pushed the last patch. Let's wait for a day or so to > > see the buildfarm reports > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer=2020-09-08%2006%3A24%3A14 > failed the new 015_stream.pl test with the subscriber looping like this: > I will look into this. -- With Regards, Amit Kapila.
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Mon, Sep 07, 2020 at 12:00:41PM +0530, Amit Kapila wrote: > Thanks, I have pushed the last patch. Let's wait for a day or so to > see the buildfarm reports https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer=2020-09-08%2006%3A24%3A14 failed the new 015_stream.pl test with the subscriber looping like this: 2020-09-08 11:22:49.848 UTC [13959252:1] LOG: logical replication apply worker for subscription "tap_sub" has started 2020-09-08 11:22:54.045 UTC [13959252:2] ERROR: could not open temporary file "16393-510.changes.0" from BufFile "16393-510.changes": No such file or directory 2020-09-08 11:22:54.055 UTC [7602182:1] LOG: logical replication apply worker for subscription "tap_sub" has started 2020-09-08 11:22:54.101 UTC [31785284:4] LOG: background worker "logical replication worker" (PID 13959252) exited with exit code 1 2020-09-08 11:23:01.142 UTC [7602182:2] ERROR: could not open temporary file "16393-510.changes.0" from BufFile "16393-510.changes": No such file or directory ... What happened there?
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Tue, Sep 15, 2020 at 10:17 AM Amit Kapila wrote: > > On Tue, Sep 15, 2020 at 8:38 AM Tom Lane wrote: > > > As far as I can see they are useless in this case but I think they > > > might be required in case the user provides its own allocator function > > > (using HASH_ALLOC). So, we can probably remove those from here? > > > > You could imagine writing a HASH_ALLOC allocator whose behavior > > varies depending on CurrentMemoryContext, but it seems like a > > pretty foolish/fragile way to do it. In any case I can think of, > > the hash table lives in one specific context and you really > > really do not want parts of it spread across other contexts. > > dynahash.c is not going to look kindly on pieces of what it > > is managing disappearing from under it. > > > > I agree that doesn't make sense. I have fixed all the comments > discussed in the attached patch. > Pushed. -- With Regards, Amit Kapila.
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Tue, Sep 15, 2020 at 8:38 AM Tom Lane wrote: > > Amit Kapila writes: > > On Mon, Sep 14, 2020 at 9:42 PM Tom Lane wrote: > >> BTW, unless someone has changed the behavior of dynahash when I > >> wasn't looking, those MemoryContextSwitchTos shown above are useless. > > > As far as I can see they are useless in this case but I think they > > might be required in case the user provides its own allocator function > > (using HASH_ALLOC). So, we can probably remove those from here? > > You could imagine writing a HASH_ALLOC allocator whose behavior > varies depending on CurrentMemoryContext, but it seems like a > pretty foolish/fragile way to do it. In any case I can think of, > the hash table lives in one specific context and you really > really do not want parts of it spread across other contexts. > dynahash.c is not going to look kindly on pieces of what it > is managing disappearing from under it. > I agree that doesn't make sense. I have fixed all the comments discussed in the attached patch. -- With Regards, Amit Kapila. v2-0001-Fix-initialization-of-RelationSyncEntry-for-strea.patch Description: Binary data
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
Amit Kapila writes: > On Mon, Sep 14, 2020 at 9:42 PM Tom Lane wrote: >> BTW, unless someone has changed the behavior of dynahash when I >> wasn't looking, those MemoryContextSwitchTos shown above are useless. > As far as I can see they are useless in this case but I think they > might be required in case the user provides its own allocator function > (using HASH_ALLOC). So, we can probably remove those from here? You could imagine writing a HASH_ALLOC allocator whose behavior varies depending on CurrentMemoryContext, but it seems like a pretty foolish/fragile way to do it. In any case I can think of, the hash table lives in one specific context and you really really do not want parts of it spread across other contexts. dynahash.c is not going to look kindly on pieces of what it is managing disappearing from under it. (To be clear, objects that the hash entries contain pointers to are a different question. But the hash entries themselves have to have exactly the same lifespan as the hash table.) regards, tom lane
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Mon, Sep 14, 2020 at 9:42 PM Tom Lane wrote: > > Amit Kapila writes: > > The attached patch will fix the issue. What do you think? > > I think it'd be cleaner to separate the initialization of a new entry from > validation altogether, along the lines of > > /* Find cached function info, creating if not found */ > oldctx = MemoryContextSwitchTo(CacheMemoryContext); > entry = (RelationSyncEntry *) hash_search(RelationSyncCache, > (void *) , > HASH_ENTER, ); > MemoryContextSwitchTo(oldctx); > Assert(entry != NULL); > > if (!found) > { > /* immediately make a new entry valid enough to satisfy callbacks */ > entry->schema_sent = false; > entry->streamed_txns = NIL; > entry->replicate_valid = false; > /* are there any other fields we should clear here for safety??? */ > } > If we want to separate validation then we need to initialize other fields like 'pubactions' and 'publish_as_relid' as well. I think it will be better to arrange it the way you are suggesting. So, I will change it along with other fields that required initialization. > /* Fill it in if not valid */ > if (!entry->replicate_valid) > { > List *pubids = GetRelationPublications(relid); > ... > > BTW, unless someone has changed the behavior of dynahash when I > wasn't looking, those MemoryContextSwitchTos shown above are useless. > As far as I can see they are useless in this case but I think they might be required in case the user provides its own allocator function (using HASH_ALLOC). So, we can probably remove those from here? > Also, why does the comment refer to a "function" entry? > It should be "relation" instead. I'll take care of changing this as well. -- With Regards, Amit Kapila.
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
Amit Kapila writes: > The attached patch will fix the issue. What do you think? I think it'd be cleaner to separate the initialization of a new entry from validation altogether, along the lines of /* Find cached function info, creating if not found */ oldctx = MemoryContextSwitchTo(CacheMemoryContext); entry = (RelationSyncEntry *) hash_search(RelationSyncCache, (void *) , HASH_ENTER, ); MemoryContextSwitchTo(oldctx); Assert(entry != NULL); if (!found) { /* immediately make a new entry valid enough to satisfy callbacks */ entry->schema_sent = false; entry->streamed_txns = NIL; entry->replicate_valid = false; /* are there any other fields we should clear here for safety??? */ } /* Fill it in if not valid */ if (!entry->replicate_valid) { List *pubids = GetRelationPublications(relid); ... BTW, unless someone has changed the behavior of dynahash when I wasn't looking, those MemoryContextSwitchTos shown above are useless. Also, why does the comment refer to a "function" entry? regards, tom lane
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Mon, Sep 14, 2020 at 4:50 PM Amit Kapila wrote: > > On Mon, Sep 14, 2020 at 1:23 PM Dilip Kumar wrote: > > > > On Mon, Sep 14, 2020 at 8:48 AM Amit Kapila wrote: > > > > > > > > > Yeah, this is right, and here is some initial analysis. It seems to be > > > failing in below code: > > > rel_sync_cache_relation_cb(){ ...list_free(entry->streamed_txns);..} > > > > > > This list can have elements only in 'streaming' mode (need to enable > > > 'streaming' with Create Subscription command) whereas none of the > > > tests in 010_truncate.pl is using 'streaming', so this list should be > > > empty (NULL). The two different assertion failures shown in BF reports > > > in list_free code are as below: > > > Assert(list->length > 0); > > > Assert(list->length <= list->max_length); > > > > > > It seems to me that this list is not initialized properly when it is > > > not used or maybe that is true in some special circumstances because > > > we initialize it in get_rel_sync_entry(). I am not sure if CCI build > > > is impacting this in some way. > > > > > > Even I have analyzed this but did not find any reason why the > > streamed_txns list should be anything other than NULL. The only thing > > is we are initializing the entry->streamed_txns to NULL and the list > > free is checking "if (list == NIL)" then return. However IMHO, that > > should not be an issue becase NIL is defined as (List*) NULL. > > > > Yeah, that is not the issue but it is better to initialize it with NIL > for the sake of consistency. The basic issue here was we were trying > to open/lock the relation(s) before initializing this list. Now, when > we process the invalidations during open relation, we try to access > this list in rel_sync_cache_relation_cb and that leads to assertion > failure. I have reproduced the exact scenario of 010_truncate.pl via > debugger. Basically, the backend on publisher has sent the > invalidation after truncating the relation 'tab1' and while processing > the truncate message if WALSender receives that message exactly after > creating the RelSyncEntry for 'tab1', the Assertion shown in BF can be > reproduced. Yeah, this is an issue and I am also able to reproduce this manually using gdb. Basically, I have inserted some data in publication table and after that, I stopped in get_rel_sync_entry after creating the reentry and before calling GetRelationPublications. Meanwhile, I have truncated this table and then it hit the same issue you pointed here. > The attached patch will fix the issue. What do you think? The patch looks good to me and fixing the reported issue. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Mon, Sep 14, 2020 at 1:23 PM Dilip Kumar wrote: > > On Mon, Sep 14, 2020 at 8:48 AM Amit Kapila wrote: > > > > > > Yeah, this is right, and here is some initial analysis. It seems to be > > failing in below code: > > rel_sync_cache_relation_cb(){ ...list_free(entry->streamed_txns);..} > > > > This list can have elements only in 'streaming' mode (need to enable > > 'streaming' with Create Subscription command) whereas none of the > > tests in 010_truncate.pl is using 'streaming', so this list should be > > empty (NULL). The two different assertion failures shown in BF reports > > in list_free code are as below: > > Assert(list->length > 0); > > Assert(list->length <= list->max_length); > > > > It seems to me that this list is not initialized properly when it is > > not used or maybe that is true in some special circumstances because > > we initialize it in get_rel_sync_entry(). I am not sure if CCI build > > is impacting this in some way. > > > Even I have analyzed this but did not find any reason why the > streamed_txns list should be anything other than NULL. The only thing > is we are initializing the entry->streamed_txns to NULL and the list > free is checking "if (list == NIL)" then return. However IMHO, that > should not be an issue becase NIL is defined as (List*) NULL. > Yeah, that is not the issue but it is better to initialize it with NIL for the sake of consistency. The basic issue here was we were trying to open/lock the relation(s) before initializing this list. Now, when we process the invalidations during open relation, we try to access this list in rel_sync_cache_relation_cb and that leads to assertion failure. I have reproduced the exact scenario of 010_truncate.pl via debugger. Basically, the backend on publisher has sent the invalidation after truncating the relation 'tab1' and while processing the truncate message if WALSender receives that message exactly after creating the RelSyncEntry for 'tab1', the Assertion shown in BF can be reproduced. The attached patch will fix the issue. What do you think? -- With Regards, Amit Kapila. v1-0001-Fix-initialization-of-RelationSyncEntry-for-strea.patch Description: Binary data
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Mon, Sep 14, 2020 at 8:48 AM Amit Kapila wrote: > > On Mon, Sep 14, 2020 at 3:08 AM Tom Lane wrote: > > > > Amit Kapila writes: > > > Pushed. > > > > Observe the following reports: > > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=idiacanthus=2020-09-13%2016%3A54%3A03 > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=desmoxytes=2020-09-10%2009%3A08%3A03 > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=komodoensis=2020-09-05%2020%3A22%3A02 > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dragonet=2020-09-04%2001%3A52%3A03 > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dragonet=2020-09-03%2020%3A54%3A04 > > > > These are all on HEAD, and all within the last ten days, and I see > > nothing comparable in any branch before that. So it's hard to avoid > > the conclusion that somebody broke something about ten days ago. > > > > None of these animals provided gdb backtraces; but we do have a built-in > > trace from several, and they all look like pgoutput.so is trying to > > list_free() garbage, somewhere inside a relcache invalidation/rebuild > > scenario: > > > > Yeah, this is right, and here is some initial analysis. It seems to be > failing in below code: > rel_sync_cache_relation_cb(){ ...list_free(entry->streamed_txns);..} > > This list can have elements only in 'streaming' mode (need to enable > 'streaming' with Create Subscription command) whereas none of the > tests in 010_truncate.pl is using 'streaming', so this list should be > empty (NULL). The two different assertion failures shown in BF reports > in list_free code are as below: > Assert(list->length > 0); > Assert(list->length <= list->max_length); > > It seems to me that this list is not initialized properly when it is > not used or maybe that is true in some special circumstances because > we initialize it in get_rel_sync_entry(). I am not sure if CCI build > is impacting this in some way. Even I have analyzed this but did not find any reason why the streamed_txns list should be anything other than NULL. The only thing is we are initializing the entry->streamed_txns to NULL and the list free is checking "if (list == NIL)" then return. However IMHO, that should not be an issue becase NIL is defined as (List*) NULL. I am doing further testing and investigation. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Mon, Sep 14, 2020 at 3:08 AM Tom Lane wrote: > > Amit Kapila writes: > > Pushed. > > Observe the following reports: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=idiacanthus=2020-09-13%2016%3A54%3A03 > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=desmoxytes=2020-09-10%2009%3A08%3A03 > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=komodoensis=2020-09-05%2020%3A22%3A02 > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dragonet=2020-09-04%2001%3A52%3A03 > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dragonet=2020-09-03%2020%3A54%3A04 > > These are all on HEAD, and all within the last ten days, and I see > nothing comparable in any branch before that. So it's hard to avoid > the conclusion that somebody broke something about ten days ago. > > None of these animals provided gdb backtraces; but we do have a built-in > trace from several, and they all look like pgoutput.so is trying to > list_free() garbage, somewhere inside a relcache invalidation/rebuild > scenario: > Yeah, this is right, and here is some initial analysis. It seems to be failing in below code: rel_sync_cache_relation_cb(){ ...list_free(entry->streamed_txns);..} This list can have elements only in 'streaming' mode (need to enable 'streaming' with Create Subscription command) whereas none of the tests in 010_truncate.pl is using 'streaming', so this list should be empty (NULL). The two different assertion failures shown in BF reports in list_free code are as below: Assert(list->length > 0); Assert(list->length <= list->max_length); It seems to me that this list is not initialized properly when it is not used or maybe that is true in some special circumstances because we initialize it in get_rel_sync_entry(). I am not sure if CCI build is impacting this in some way. -- With Regards, Amit Kapila.
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
I wrote: > * Starting over, it appears that 001_rep_changes.pl almost immediately > gets into an infinite loop. It does not complete the third test step, > rather infinitely waiting for progress to be made. Ah, looking closer, the problem is that wal_receiver_timeout = 60s is too short when the sender is using CCA. It times out before we can get through the needed data transmission. regards, tom lane
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Mon, Sep 14, 2020 at 3:08 AM Tom Lane wrote: > > Amit Kapila writes: > > Pushed. > > Observe the following reports: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=idiacanthus=2020-09-13%2016%3A54%3A03 > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=desmoxytes=2020-09-10%2009%3A08%3A03 > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=komodoensis=2020-09-05%2020%3A22%3A02 > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dragonet=2020-09-04%2001%3A52%3A03 > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dragonet=2020-09-03%2020%3A54%3A04 > > These are all on HEAD, and all within the last ten days, and I see > nothing comparable in any branch before that. So it's hard to avoid > the conclusion that somebody broke something about ten days ago. > I'll analyze these reports. -- With Regards, Amit Kapila.
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
I wrote: > Probably this requires a relcache inval at the wrong time; > although we have recent passes from CLOBBER_CACHE_ALWAYS animals, > so that can't be the whole triggering condition. I wonder whether > it is relevant that all of the complaining animals are JIT-enabled. Hmmm ... I take that back. hyrax has indeed passed since this went in, but *it doesn't run any TAP tests*. So the buildfarm offers no information about whether the replication tests work under CLOBBER_CACHE_ALWAYS. Realizing that, I built an installation that way and tried to run the subscription tests. Results so far: * Running 010_truncate.pl by itself passed for me. So there's still some unexplained factor needed to trigger the buildfarm failures. (I'm wondering about concurrent autovacuum activity now...) * Starting over, it appears that 001_rep_changes.pl almost immediately gets into an infinite loop. It does not complete the third test step, rather infinitely waiting for progress to be made. The publisher log shows a repeating loop like 2020-09-13 21:16:05.734 EDT [928529] tap_sub LOG: could not send data to client: Broken pipe 2020-09-13 21:16:05.734 EDT [928529] tap_sub CONTEXT: slot "tap_sub", output plugin "pgoutput", in the commit callback, associated LSN 0/1660628 2020-09-13 21:16:05.843 EDT [928581] 001_rep_changes.pl LOG: statement: SELECT pg_current_wal_lsn() <= replay_lsn AND state = 'streaming' FROM pg_catalog.pg_stat_replication WHERE application_name = 'tap_sub'; 2020-09-13 21:16:05.861 EDT [928582] tap_sub LOG: statement: SELECT pg_catalog.set_config('search_path', '', false); 2020-09-13 21:16:05.929 EDT [928582] tap_sub LOG: received replication command: IDENTIFY_SYSTEM 2020-09-13 21:16:05.930 EDT [928582] tap_sub LOG: received replication command: START_REPLICATION SLOT "tap_sub" LOGICAL 0/1652820 (proto_version '2', publication_names '"tap_pub","tap_pub_ins_only"') 2020-09-13 21:16:05.930 EDT [928582] tap_sub LOG: starting logical decoding for slot "tap_sub" 2020-09-13 21:16:05.930 EDT [928582] tap_sub DETAIL: Streaming transactions committing after 0/1652820, reading WAL from 0/1651B20. 2020-09-13 21:16:05.930 EDT [928582] tap_sub LOG: logical decoding found consistent point at 0/1651B20 2020-09-13 21:16:05.930 EDT [928582] tap_sub DETAIL: There are no running transactions. 2020-09-13 21:16:21.560 EDT [928600] 001_rep_changes.pl LOG: statement: SELECT pg_current_wal_lsn() <= replay_lsn AND state = 'streaming' FROM pg_catalog.pg_stat_replication WHERE application_name = 'tap_sub'; 2020-09-13 21:16:37.291 EDT [928610] 001_rep_changes.pl LOG: statement: SELECT pg_current_wal_lsn() <= replay_lsn AND state = 'streaming' FROM pg_catalog.pg_stat_replication WHERE application_name = 'tap_sub'; 2020-09-13 21:16:52.959 EDT [928627] 001_rep_changes.pl LOG: statement: SELECT pg_current_wal_lsn() <= replay_lsn AND state = 'streaming' FROM pg_catalog.pg_stat_replication WHERE application_name = 'tap_sub'; 2020-09-13 21:17:06.866 EDT [928636] tap_sub LOG: statement: SELECT pg_catalog.set_config('search_path', '', false); 2020-09-13 21:17:06.934 EDT [928636] tap_sub LOG: received replication command: IDENTIFY_SYSTEM 2020-09-13 21:17:06.934 EDT [928636] tap_sub LOG: received replication command: START_REPLICATION SLOT "tap_sub" LOGICAL 0/1652820 (proto_version '2', publication_names '"tap_pub","tap_pub_ins_only"') 2020-09-13 21:17:06.934 EDT [928636] tap_sub ERROR: replication slot "tap_sub" is active for PID 928582 2020-09-13 21:17:07.811 EDT [928638] tap_sub LOG: statement: SELECT pg_catalog.set_config('search_path', '', false); 2020-09-13 21:17:07.880 EDT [928638] tap_sub LOG: received replication command: IDENTIFY_SYSTEM 2020-09-13 21:17:07.881 EDT [928638] tap_sub LOG: received replication command: START_REPLICATION SLOT "tap_sub" LOGICAL 0/1652820 (proto_version '2', publication_names '"tap_pub","tap_pub_ins_only"') 2020-09-13 21:17:07.881 EDT [928638] tap_sub ERROR: replication slot "tap_sub" is active for PID 928582 2020-09-13 21:17:08.618 EDT [928641] 001_rep_changes.pl LOG: statement: SELECT pg_current_wal_lsn() <= replay_lsn AND state = 'streaming' FROM pg_catalog.pg_stat_replication WHERE application_name = 'tap_sub'; 2020-09-13 21:17:08.753 EDT [928642] tap_sub LOG: statement: SELECT pg_catalog.set_config('search_path', '', false); 2020-09-13 21:17:08.821 EDT [928642] tap_sub LOG: received replication command: IDENTIFY_SYSTEM 2020-09-13 21:17:08.821 EDT [928642] tap_sub LOG: received replication command: START_REPLICATION SLOT "tap_sub" LOGICAL 0/1652820 (proto_version '2', publication_names '"tap_pub","tap_pub_ins_only"') 2020-09-13 21:17:08.821 EDT [928642] tap_sub ERROR: replication slot "tap_sub" is active for PID 928582 2020-09-13 21:17:09.689 EDT [928645] tap_sub LOG: statement: SELECT pg_catalog.set_config('search_path', '', false); 2020-09-13 21:17:09.756 EDT [928645] tap_sub LOG: received replication command: IDENTIFY_SYSTEM
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
Amit Kapila writes: > Pushed. Observe the following reports: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=idiacanthus=2020-09-13%2016%3A54%3A03 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=desmoxytes=2020-09-10%2009%3A08%3A03 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=komodoensis=2020-09-05%2020%3A22%3A02 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dragonet=2020-09-04%2001%3A52%3A03 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dragonet=2020-09-03%2020%3A54%3A04 These are all on HEAD, and all within the last ten days, and I see nothing comparable in any branch before that. So it's hard to avoid the conclusion that somebody broke something about ten days ago. None of these animals provided gdb backtraces; but we do have a built-in trace from several, and they all look like pgoutput.so is trying to list_free() garbage, somewhere inside a relcache invalidation/rebuild scenario: TRAP: FailedAssertion("list->length > 0", File: "/home/bf/build/buildfarm-idiacanthus/HEAD/pgsql.build/../pgsql/src/backend/nodes/list.c", Line: 68) postgres: publisher: walsender bf [local] idle(ExceptionalCondition+0x57)[0x9081f7] postgres: publisher: walsender bf [local] idle[0x6bcc70] postgres: publisher: walsender bf [local] idle(list_free+0x11)[0x6bdc01] /home/bf/build/buildfarm-idiacanthus/HEAD/pgsql.build/tmp_install/home/bf/build/buildfarm-idiacanthus/HEAD/inst/lib/postgresql/pgoutput.so(+0x35d8)[0x7fa4c5a6f5d8] postgres: publisher: walsender bf [local] idle(LocalExecuteInvalidationMessage+0x15b)[0x8f0cdb] postgres: publisher: walsender bf [local] idle(ReceiveSharedInvalidMessages+0x4b)[0x7bca0b] postgres: publisher: walsender bf [local] idle(LockRelationOid+0x56)[0x7c19e6] postgres: publisher: walsender bf [local] idle(relation_open+0x1c)[0x4a2d0c] postgres: publisher: walsender bf [local] idle(table_open+0x6)[0x524486] postgres: publisher: walsender bf [local] idle[0x9017f2] postgres: publisher: walsender bf [local] idle[0x8fabd4] postgres: publisher: walsender bf [local] idle[0x8fa58a] postgres: publisher: walsender bf [local] idle(RelationCacheInvalidateEntry+0xaf)[0x8fbdbf] postgres: publisher: walsender bf [local] idle(LocalExecuteInvalidationMessage+0xec)[0x8f0c6c] postgres: publisher: walsender bf [local] idle(ReceiveSharedInvalidMessages+0xcb)[0x7bca8b] postgres: publisher: walsender bf [local] idle(LockRelationOid+0x56)[0x7c19e6] postgres: publisher: walsender bf [local] idle(relation_open+0x1c)[0x4a2d0c] postgres: publisher: walsender bf [local] idle(table_open+0x6)[0x524486] postgres: publisher: walsender bf [local] idle[0x8ee8b0] 010_truncate.pl itself hasn't changed meaningfully in a good long time. However, I see that 464824323 added a whole boatload of code to pgoutput.c, and the timing is right for that commit to be the culprit, so that's what I'm betting on. Probably this requires a relcache inval at the wrong time; although we have recent passes from CLOBBER_CACHE_ALWAYS animals, so that can't be the whole triggering condition. I wonder whether it is relevant that all of the complaining animals are JIT-enabled. regards, tom lane
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Wed, Sep 9, 2020 at 2:26 PM Amit Kapila wrote: > > On Wed, Sep 9, 2020 at 2:13 PM Tomas Vondra > wrote: > > > > Hi, > > > > while looking at the streaming code I noticed two minor issues: > > > > 1) logicalrep_read_stream_stop is never defined/called, so the prototype > > in logicalproto.h is unnecessary > > > > 2) minor typo in one of the comments > > > > Patch attached. > > > > LGTM. > Pushed. -- With Regards, Amit Kapila.
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Wed, Sep 9, 2020 at 2:13 PM Tomas Vondra wrote: > Hi, > > while looking at the streaming code I noticed two minor issues: > > 1) logicalrep_read_stream_stop is never defined/called, so the prototype > in logicalproto.h is unnecessary > > Yeah, right. > 2) minor typo in one of the comments > > Patch attached. > Looks good to me. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Wed, Sep 9, 2020 at 2:13 PM Tomas Vondra wrote: > > Hi, > > while looking at the streaming code I noticed two minor issues: > > 1) logicalrep_read_stream_stop is never defined/called, so the prototype > in logicalproto.h is unnecessary > > 2) minor typo in one of the comments > > Patch attached. > LGTM. -- With Regards, Amit Kapila.
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
Hi, while looking at the streaming code I noticed two minor issues: 1) logicalrep_read_stream_stop is never defined/called, so the prototype in logicalproto.h is unnecessary 2) minor typo in one of the comments Patch attached. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c index c29c088813..343f03129f 100644 --- a/src/backend/replication/pgoutput/pgoutput.c +++ b/src/backend/replication/pgoutput/pgoutput.c @@ -77,7 +77,7 @@ static void send_relation_and_attrs(Relation relation, TransactionId xid, * and with streamed transactions the commit order may be different from * the order the transactions are sent in. Also, the (sub) transactions * might get aborted so we need to send the schema for each (sub) transaction - * so that we don't loose the schema information on abort. For handling this, + * so that we don't lose the schema information on abort. For handling this, * we maintain the list of xids (streamed_txns) for those we have already sent * the schema. * diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h index 53905ee608..607a728508 100644 --- a/src/include/replication/logicalproto.h +++ b/src/include/replication/logicalproto.h @@ -133,7 +133,6 @@ extern void logicalrep_write_stream_start(StringInfo out, TransactionId xid, extern TransactionId logicalrep_read_stream_start(StringInfo in, bool *first_segment); extern void logicalrep_write_stream_stop(StringInfo out); -extern TransactionId logicalrep_read_stream_stop(StringInfo in); extern void logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn, XLogRecPtr commit_lsn); extern TransactionId logicalrep_read_stream_commit(StringInfo out,
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Mon, Sep 7, 2020 at 12:57 PM Dilip Kumar wrote: > > On Mon, Sep 7, 2020 at 12:00 PM Amit Kapila wrote: >> >> On Sat, Sep 5, 2020 at 8:55 PM Dilip Kumar wrote: >> > >> > >> > I have reviewed the changes and looks fine to me. >> > >> >> Thanks, I have pushed the last patch. Let's wait for a day or so to >> see the buildfarm reports and then we can probably close this CF >> entry. > > > Thanks. > I have updated the status of CF entry as committed now. -- With Regards, Amit Kapila.
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Mon, Sep 7, 2020 at 12:00 PM Amit Kapila wrote: > On Sat, Sep 5, 2020 at 8:55 PM Dilip Kumar wrote: > > > > > > I have reviewed the changes and looks fine to me. > > > > Thanks, I have pushed the last patch. Let's wait for a day or so to > see the buildfarm reports and then we can probably close this CF > entry. Thanks. > I am aware that we have one patch related to stats still > pending but I think we can tackle it along with the spill stats patch > which is being discussed in a different thread [1]. Do let me know if > I have missed anything? > > [1] - > https://www.postgresql.org/message-id/CAA4eK1JBqQh9cBKjO-nKOOE%3D7f6ONDCZp0TJZfn4VsQqRZ%2BuYA%40mail.gmail.com Sound good to me. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Sat, Sep 5, 2020 at 8:55 PM Dilip Kumar wrote: > > > I have reviewed the changes and looks fine to me. > Thanks, I have pushed the last patch. Let's wait for a day or so to see the buildfarm reports and then we can probably close this CF entry. I am aware that we have one patch related to stats still pending but I think we can tackle it along with the spill stats patch which is being discussed in a different thread [1]. Do let me know if I have missed anything? [1] - https://www.postgresql.org/message-id/CAA4eK1JBqQh9cBKjO-nKOOE%3D7f6ONDCZp0TJZfn4VsQqRZ%2BuYA%40mail.gmail.com -- With Regards, Amit Kapila.
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Sat, 5 Sep 2020 at 4:02 PM, Amit Kapila wrote: > On Tue, Sep 1, 2020 at 8:33 PM Amit Kapila > wrote: > > > > > > On Tue, Sep 1, 2020 at 9:28 AM Amit Kapila > wrote: > > > > > > I have fixed all the comments except the below comments. > > > 1. verify the size of various tests to ensure that it is above > > > logical_decoding_work_mem. > > > 2. I have checked that in one of the previous patches, we have a test > > > v53-0004-Add-TAP-test-for-streaming-vs.-DDL which contains a test case > > > quite similar to what we have in > > > > v55-0002-Add-support-for-streaming-to-built-in-logical-re/013_stream_subxact_ddl_abort. > > > If there is any difference that can cover more scenarios then can we > > > consider merging them into one test? > > > > > > > I have compared these two tests and found that the only thing > > additional in the test case present in > > v53-0004-Add-TAP-test-for-streaming-vs.-DDL was that it was performing > > few savepoints and DMLs after doing the first rollback to savepoint > > and I included that in one of the existing tests in > > 018_stream_subxact_abort.pl. I have added one test for Rollback, > > changed few messages, removed one test case which was not making any > > sense in the patch. See attached and let me know what you think about > > it? I have reviewed the changes and looks fine to me. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Tue, Sep 1, 2020 at 8:33 PM Amit Kapila wrote: > > On Tue, Sep 1, 2020 at 9:28 AM Amit Kapila wrote: > > I have fixed all the comments except the below comments. > 1. verify the size of various tests to ensure that it is above > logical_decoding_work_mem. > 2. I have checked that in one of the previous patches, we have a test > v53-0004-Add-TAP-test-for-streaming-vs.-DDL which contains a test case > quite similar to what we have in > v55-0002-Add-support-for-streaming-to-built-in-logical-re/013_stream_subxact_ddl_abort. > If there is any difference that can cover more scenarios then can we > consider merging them into one test? > I have compared these two tests and found that the only thing additional in the test case present in v53-0004-Add-TAP-test-for-streaming-vs.-DDL was that it was performing few savepoints and DMLs after doing the first rollback to savepoint and I included that in one of the existing tests in 018_stream_subxact_abort.pl. I have added one test for Rollback, changed few messages, removed one test case which was not making any sense in the patch. See attached and let me know what you think about it? -- With Regards, Amit Kapila. v61-0001-Add-additional-tests-to-test-streaming-of-in-pro.patch Description: Binary data
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Fri, Sep 4, 2020 at 3:10 AM Bossart, Nathan wrote: > > I noticed a small compiler warning for this. > > diff --git a/src/backend/replication/logical/worker.c > b/src/backend/replication/logical/worker.c > index 812aca8011..88d3444c39 100644 > --- a/src/backend/replication/logical/worker.c > +++ b/src/backend/replication/logical/worker.c > @@ -199,7 +199,7 @@ typedef struct ApplySubXactData > static ApplySubXactData subxact_data = {0, 0, InvalidTransactionId, NULL}; > > static void subxact_filename(char *path, Oid subid, TransactionId xid); > -static void changes_filename(char *path, Oid subid, TransactionId xid); > +static inline void changes_filename(char *path, Oid subid, TransactionId > xid); > Thanks for the report, I'll take care of this. I think the nearby similar function subxact_filename() should also be inline. -- With Regards, Amit Kapila.
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
I noticed a small compiler warning for this. diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c index 812aca8011..88d3444c39 100644 --- a/src/backend/replication/logical/worker.c +++ b/src/backend/replication/logical/worker.c @@ -199,7 +199,7 @@ typedef struct ApplySubXactData static ApplySubXactData subxact_data = {0, 0, InvalidTransactionId, NULL}; static void subxact_filename(char *path, Oid subid, TransactionId xid); -static void changes_filename(char *path, Oid subid, TransactionId xid); +static inline void changes_filename(char *path, Oid subid, TransactionId xid); /* * Information about subtransactions of a given toplevel transaction. Nathan
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Wed, Sep 2, 2020 at 7:19 PM Amit Kapila wrote: > On Wed, Sep 2, 2020 at 3:41 PM Dilip Kumar wrote: > > > > On Wed, Sep 2, 2020 at 10:55 AM Amit Kapila > wrote: > > > > > > > > > > > > > We can combine the tests in 015_stream_simple.pl and > > > 020_stream_binary.pl as I can't see a good reason to keep them > > > separate. Then, I think we can keep only this part with the main patch > > > and extract other tests into a separate patch. Basically, we can > > > commit the basic tests with the main patch and then keep the advanced > > > tests separately. I am afraid that there are some tests that don't add > > > much value so we can review them separately. > > > > Fixed > > > > I have slightly adjusted this test and ran pgindent on the patch. I am > planning to push this tomorrow unless you have more comments. > Looks good to me. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Wed, Sep 2, 2020 at 3:41 PM Dilip Kumar wrote: > > On Wed, Sep 2, 2020 at 10:55 AM Amit Kapila wrote: > > > > > > > > > We can combine the tests in 015_stream_simple.pl and > > 020_stream_binary.pl as I can't see a good reason to keep them > > separate. Then, I think we can keep only this part with the main patch > > and extract other tests into a separate patch. Basically, we can > > commit the basic tests with the main patch and then keep the advanced > > tests separately. I am afraid that there are some tests that don't add > > much value so we can review them separately. > > Fixed > I have slightly adjusted this test and ran pgindent on the patch. I am planning to push this tomorrow unless you have more comments. -- With Regards, Amit Kapila. v60-0001-Add-support-for-streaming-to-built-in-logical-re.patch Description: Binary data
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Tue, Sep 1, 2020 at 8:33 PM Amit Kapila wrote: > > On Tue, Sep 1, 2020 at 9:28 AM Amit Kapila wrote: > > > > On Mon, Aug 31, 2020 at 7:28 PM Dilip Kumar wrote: > > > > > > On Mon, Aug 31, 2020 at 1:24 PM Amit Kapila > > > wrote: > > > > In functions cleanup_rel_sync_cache and > > get_schema_sent_in_streamed_txn, lets cast the result of lfirst_int to > > uint32 as suggested by Tom [1]. Also, lets keep the way we compare > > xids consistent in both functions, i.e, if (xid == lfirst_int(lc)). > > > > Fixed this in the attached patch. > > > The behavior tested by the test case added for this is not clear > > primarily because of comments. > > > > +++ b/src/test/subscription/t/021_stream_schema.pl > > @@ -0,0 +1,80 @@ > > +# Test behavior with streaming transaction exceeding > > logical_decoding_work_mem > > ... > > +# large (streamed) transaction with DDL, DML and ROLLBACKs > > +$node_publisher->safe_psql('postgres', q{ > > +BEGIN; > > +ALTER TABLE test_tab ADD COLUMN c INT; > > +INSERT INTO test_tab SELECT i, md5(i::text), i FROM > > generate_series(3,3000) s(i); > > +ALTER TABLE test_tab ADD COLUMN d INT; > > +COMMIT; > > +}); > > + > > +# large (streamed) transaction with DDL, DML and ROLLBACKs > > +$node_publisher->safe_psql('postgres', q{ > > +BEGIN; > > +INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM > > generate_series(3001,3005) s(i); > > +COMMIT; > > +}); > > +wait_for_caught_up($node_publisher, $appname); > > > > I understand that how this test will test the functionality related to > > schema_sent stuff but neither the comments atop of file nor atop the > > test case explains it clearly. > > > > Added comments for this test. > > > > > Few more comments: > > > > > > > > > 2. > > > > 009_stream_simple.pl > > > > +# Insert, update and delete enough rows to exceed the 64kB limit. > > > > +$node_publisher->safe_psql('postgres', q{ > > > > +BEGIN; > > > > +INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, > > > > 5000) s(i); > > > > +UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0; > > > > +DELETE FROM test_tab WHERE mod(a,3) = 0; > > > > +COMMIT; > > > > +}); > > > > > > > > How much above this data is 64kB limit? I just wanted to see that it > > > > should not be on borderline and then due to some alignment issues the > > > > streaming doesn't happen on some machines? Also, how such a test > > > > ensures that the streaming has happened because the way we are > > > > checking results, won't it be the same for the non-streaming case as > > > > well? > > > > > > Only for this case, or you mean for all the tests? > > > > > > > I have not done this yet. Most of the test cases are generating above 100kb and a few are around 72kb, Please find the test case wise data size. 015 - 200kb 016 - 150kb 017 - 72kb 018 - 72kb before first rollback to sb and total ~100kb 019 - 76kb before first rollback to sb and total ~100kb 020 - 150kb 021 - 100kb > > It is better to do it for all tests and I have clarified this in my > > next email sent yesterday [2] where I have raised a few more comments > > as well. I hope you have not missed that email. I saw that I think I replied to this before seeing that. > > > > 3. > > > > +# Change the local values of the extra columns on the subscriber, > > > > +# update publisher, and check that subscriber retains the expected > > > > +# values > > > > +$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = > > > > 'epoch'::timestamptz + 987654321 * interval '1s'"); > > > > +$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = > > > > md5(a::text)"); > > > > + > > > > +wait_for_caught_up($node_publisher, $appname); > > > > + > > > > +$result = > > > > + $node_subscriber->safe_psql('postgres', "SELECT count(*), > > > > count(extract(epoch from c) = 987654321), count(d = 999) FROM > > > > test_tab"); > > > > +is($result, qq(3334|3334|3334), 'check extra columns contain locally > > > > changed data'); > > > > > > > > Again, how this test is relevant to streaming mode? > > > > > > I agree, it is not specific to the streaming. > > > > > I think we can leave this as of now. After committing the stats > patches by Sawada-San and Ajin, we might be able to improve this test. Make sense to me. > > +sub wait_for_caught_up > > +{ > > + my ($node, $appname) = @_; > > + > > + $node->poll_query_until('postgres', > > +"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication > > WHERE application_name = '$appname';" > > + ) or die "Timed ou > > > > The patch has added this in all the test files if it is used in so > > many tests then we need to add this in some generic place > > (PostgresNode.pm) but actually, I am not sure if need this at all. Why > > can't the existing wait_for_catchup in PostgresNode.pm serve the same > > purpose. > > > > Changed as per this suggestion. Okay. > > 2. > > In system_views.sql, > > > > -- All columns of pg_subscription except subconninfo are readable. > > REVOKE ALL ON pg_subscription FROM
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Tue, Sep 1, 2020 at 8:33 PM Amit Kapila wrote: > > I have fixed all the comments except .. > 3. +# Change the local values of the extra columns on the subscriber, > +# update publisher, and check that subscriber retains the expected > +# values > +$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = > 'epoch'::timestamptz + 987654321 * interval '1s'"); > +$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = > md5(a::text)"); > + > +wait_for_caught_up($node_publisher, $appname); > + > +$result = > + $node_subscriber->safe_psql('postgres', "SELECT count(*), > count(extract(epoch from c) = 987654321), count(d = 999) FROM > test_tab"); > +is($result, qq(3334|3334|3334), 'check extra columns contain locally > changed data'); > > Again, how this test is relevant to streaming mode? > I think we can keep this test in one of the newly added tests say in 015_stream_simple.pl to ensure that after streaming transaction, the non-streaming one behaves expectedly. So we can change the comment as "Change the local values of the extra columns on the subscriber, update publisher, and check that subscriber retains the expected values. This is to ensure that non-streaming transactions behave properly after a streaming transaction." We can remove this test from the other two places 016_stream_subxact.pl and 020_stream_binary.pl. > 4. Apart from the above, I think we should think of minimizing the > test cases which can be committed with the base patch. We can later > add more tests. > We can combine the tests in 015_stream_simple.pl and 020_stream_binary.pl as I can't see a good reason to keep them separate. Then, I think we can keep only this part with the main patch and extract other tests into a separate patch. Basically, we can commit the basic tests with the main patch and then keep the advanced tests separately. I am afraid that there are some tests that don't add much value so we can review them separately. One minor comment for option 'streaming = on', spacing-wise it should be consistent in all the tests. Similarly, we can combine 017_stream_ddl.pl and 021_stream_schema.pl as both contains similar tests. As per the above suggestion, this will be in a separate patch though. If you agree with the above suggestions then kindly make these adjustments and send the updated patch. -- With Regards, Amit Kapila.
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Mon, Aug 31, 2020 at 10:27 PM Neha Sharma wrote: > > Hi Amit/Dilip, > > I have tested a few scenarios on top of the v56 patches, where the > replication worker still had few subtransactions in uncommitted state and we > restart the publisher server. > No crash or data discrepancies were observed, attached are the test scenarios > verified. > Thanks, I have pushed the fix (https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=4ab77697f67aa5b90b032b9175b46901859da6d7). -- With Regards, Amit Kapila.
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Mon, Aug 31, 2020 at 7:28 PM Dilip Kumar wrote: > > On Mon, Aug 31, 2020 at 1:24 PM Amit Kapila wrote: > > > > On Mon, Aug 31, 2020 at 10:49 AM Amit Kapila > > wrote: > > > > > > On Sun, Aug 30, 2020 at 2:43 PM Dilip Kumar wrote: > > > > > > > > > > Another comment: > > > > > > +cleanup_rel_sync_cache(TransactionId xid, bool is_commit) > > > +{ > > > + HASH_SEQ_STATUS hash_seq; > > > + RelationSyncEntry *entry; > > > + > > > + Assert(RelationSyncCache != NULL); > > > + > > > + hash_seq_init(_seq, RelationSyncCache); > > > + while ((entry = hash_seq_search(_seq)) != NULL) > > > + { > > > + if (is_commit) > > > + entry->schema_sent = true; > > > > > > How is it correct to set 'entry->schema_sent' for all the entries in > > > RelationSyncCache? Consider a case where due to invalidation in an > > > unrelated transaction we have set the flag schema_sent for a > > > particular relation 'r1' as 'false' and that transaction is executed > > > before the current streamed transaction for which we are performing > > > commit and called this function. It will set the flag for unrelated > > > entry in this case 'r1' which doesn't seem correct to me. Or, if this > > > is correct, it would be a good idea to write some comments about it. > > Yeah, this is wrong, I have fixed this issue in the attached patch > and also added a new test for the same. > In functions cleanup_rel_sync_cache and get_schema_sent_in_streamed_txn, lets cast the result of lfirst_int to uint32 as suggested by Tom [1]. Also, lets keep the way we compare xids consistent in both functions, i.e, if (xid == lfirst_int(lc)). The behavior tested by the test case added for this is not clear primarily because of comments. +++ b/src/test/subscription/t/021_stream_schema.pl @@ -0,0 +1,80 @@ +# Test behavior with streaming transaction exceeding logical_decoding_work_mem ... +# large (streamed) transaction with DDL, DML and ROLLBACKs +$node_publisher->safe_psql('postgres', q{ +BEGIN; +ALTER TABLE test_tab ADD COLUMN c INT; +INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(3,3000) s(i); +ALTER TABLE test_tab ADD COLUMN d INT; +COMMIT; +}); + +# large (streamed) transaction with DDL, DML and ROLLBACKs +$node_publisher->safe_psql('postgres', q{ +BEGIN; +INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(3001,3005) s(i); +COMMIT; +}); +wait_for_caught_up($node_publisher, $appname); I understand that how this test will test the functionality related to schema_sent stuff but neither the comments atop of file nor atop the test case explains it clearly. > > Few more comments: > > > 2. > > 009_stream_simple.pl > > +# Insert, update and delete enough rows to exceed the 64kB limit. > > +$node_publisher->safe_psql('postgres', q{ > > +BEGIN; > > +INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) > > s(i); > > +UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0; > > +DELETE FROM test_tab WHERE mod(a,3) = 0; > > +COMMIT; > > +}); > > > > How much above this data is 64kB limit? I just wanted to see that it > > should not be on borderline and then due to some alignment issues the > > streaming doesn't happen on some machines? Also, how such a test > > ensures that the streaming has happened because the way we are > > checking results, won't it be the same for the non-streaming case as > > well? > > Only for this case, or you mean for all the tests? > It is better to do it for all tests and I have clarified this in my next email sent yesterday [2] where I have raised a few more comments as well. I hope you have not missed that email. > > 3. > > +# Change the local values of the extra columns on the subscriber, > > +# update publisher, and check that subscriber retains the expected > > +# values > > +$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = > > 'epoch'::timestamptz + 987654321 * interval '1s'"); > > +$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = > > md5(a::text)"); > > + > > +wait_for_caught_up($node_publisher, $appname); > > + > > +$result = > > + $node_subscriber->safe_psql('postgres', "SELECT count(*), > > count(extract(epoch from c) = 987654321), count(d = 999) FROM > > test_tab"); > > +is($result, qq(3334|3334|3334), 'check extra columns contain locally > > changed data'); > > > > Again, how this test is relevant to streaming mode? > > I agree, it is not specific to the streaming. > > > Apart from the above, I have made a few changes in the attached patch > > which are mainly to simplify the code at one place, added/edited few > > comments, some other cosmetic changes, and renamed the test case files > > as the initials of their name were matching other tests in the similar > > directory. > > Changes look fine to me except this > > + > > + /* the value must be on/off */ > + if (strcmp(strVal(defel->arg), "on") && strcmp(strVal(defel->arg), "off")) > + ereport(ERROR, > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > + errmsg("invalid streaming value")));
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
Hi Amit/Dilip, I have tested a few scenarios on top of the v56 patches, where the replication worker still had few subtransactions in uncommitted state and we restart the publisher server. No crash or data discrepancies were observed, attached are the test scenarios verified. *Data Setup:* *Publication Server postgresql.conf :* echo "wal_level = logical max_wal_senders = 10 max_replication_slots = 15 wal_log_hints = on hot_standby_feedback = on wal_receiver_status_interval = 1 listen_addresses='*' log_min_messages=debug1 wal_sender_timeout = 0 logical_decoding_work_mem=64kB *Subscription Server postgresql.conf :* wal_level = logical max_wal_senders = 10 max_replication_slots = 15 wal_log_hints = on hot_standby_feedback = on wal_receiver_status_interval = 1 listen_addresses='*' log_min_messages=debug1 wal_sender_timeout = 0 logical_decoding_work_mem=64kB port=5433 *Initial setup:* *Publication Server:* create table t(a int PRIMARY KEY ,b text); CREATE OR REPLACE FUNCTION large_val() RETURNS TEXT LANGUAGE SQL AS 'select array_agg(md5(g::text))::text from generate_series(1, 256) g'; create publication test_pub for table t with(PUBLISH='insert,delete,update,truncate'); alter table t replica identity FULL ; insert into t values (generate_series(1,20),large_val()) ON CONFLICT (a) DO UPDATE SET a=EXCLUDED.a*300; *Subscription server:* create table t(a int,b text); create subscription test_sub CONNECTION 'host=localhost port=5432 dbname=postgres user=edb' PUBLICATION test_pub WITH ( slot_name = test_slot_sub1,streaming=on); Thanks. -- Regards, Neha Sharma On Mon, Aug 31, 2020 at 1:25 PM Amit Kapila wrote: > On Mon, Aug 31, 2020 at 10:49 AM Amit Kapila > wrote: > > > > On Sun, Aug 30, 2020 at 2:43 PM Dilip Kumar > wrote: > > > > > > > Another comment: > > > > +cleanup_rel_sync_cache(TransactionId xid, bool is_commit) > > +{ > > + HASH_SEQ_STATUS hash_seq; > > + RelationSyncEntry *entry; > > + > > + Assert(RelationSyncCache != NULL); > > + > > + hash_seq_init(_seq, RelationSyncCache); > > + while ((entry = hash_seq_search(_seq)) != NULL) > > + { > > + if (is_commit) > > + entry->schema_sent = true; > > > > How is it correct to set 'entry->schema_sent' for all the entries in > > RelationSyncCache? Consider a case where due to invalidation in an > > unrelated transaction we have set the flag schema_sent for a > > particular relation 'r1' as 'false' and that transaction is executed > > before the current streamed transaction for which we are performing > > commit and called this function. It will set the flag for unrelated > > entry in this case 'r1' which doesn't seem correct to me. Or, if this > > is correct, it would be a good idea to write some comments about it. > > > > Few more comments: > 1. > +my $appname = 'tap_sub'; > +$node_subscriber->safe_psql('postgres', > +"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr > application_name=$appname' PUBLICATION tap_pub" > +); > > In most of the tests, we are using the above statement to create a > subscription. Don't we need (streaming = 'on') parameter while > creating a subscription? Is there a reason for not doing so in this > patch itself? > > 2. > 009_stream_simple.pl > +# Insert, update and delete enough rows to exceed the 64kB limit. > +$node_publisher->safe_psql('postgres', q{ > +BEGIN; > +INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) > s(i); > +UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0; > +DELETE FROM test_tab WHERE mod(a,3) = 0; > +COMMIT; > +}); > > How much above this data is 64kB limit? I just wanted to see that it > should not be on borderline and then due to some alignment issues the > streaming doesn't happen on some machines? Also, how such a test > ensures that the streaming has happened because the way we are > checking results, won't it be the same for the non-streaming case as > well? > > 3. > +# Change the local values of the extra columns on the subscriber, > +# update publisher, and check that subscriber retains the expected > +# values > +$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = > 'epoch'::timestamptz + 987654321 * interval '1s'"); > +$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = > md5(a::text)"); > + > +wait_for_caught_up($node_publisher, $appname); > + > +$result = > + $node_subscriber->safe_psql('postgres', "SELECT count(*), > count(extract(epoch from c) = 987654321), count(d = 999) FROM > test_tab"); > +is($result, qq(3334|3334|3334), 'check extra columns contain locally > changed data'); > > Again, how this test is relevant to streaming mode? > > 4. I have checked that in one of the previous patches, we have a test > v53-0004-Add-TAP-test-for-streaming-vs.-DDL which contains a test case > quite similar to what we have in > > v55-0002-Add-support-for-streaming-to-built-in-logical-re/013_stream_subxact_ddl_abort. > If there is any difference that can cover more scenarios then can we > consider merging them into one test? > > Apart from the
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Mon, Aug 31, 2020 at 1:24 PM Amit Kapila wrote: > > > 2. > 009_stream_simple.pl > +# Insert, update and delete enough rows to exceed the 64kB limit. > +$node_publisher->safe_psql('postgres', q{ > +BEGIN; > +INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) > s(i); > +UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0; > +DELETE FROM test_tab WHERE mod(a,3) = 0; > +COMMIT; > +}); > > How much above this data is 64kB limit? I just wanted to see that it > should not be on borderline and then due to some alignment issues the > streaming doesn't happen on some machines? > I think we should find similar information for other tests added by the patch as well. Few other comments: === +sub wait_for_caught_up +{ + my ($node, $appname) = @_; + + $node->poll_query_until('postgres', +"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';" + ) or die "Timed ou The patch has added this in all the test files if it is used in so many tests then we need to add this in some generic place (PostgresNode.pm) but actually, I am not sure if need this at all. Why can't the existing wait_for_catchup in PostgresNode.pm serve the same purpose. 2. In system_views.sql, -- All columns of pg_subscription except subconninfo are readable. REVOKE ALL ON pg_subscription FROM public; GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, subslotname, subpublications) ON pg_subscription TO public; Here, we need to update for substream column as well. 3. Update describeSubscriptions() to show the 'substream' value in \dRs. 4. Also, lets add few tests in subscription.sql as we have added 'binary' option in commit 9de77b5453. 5. I think we can merge pg_dump related changes (the last version posted in mail thread is v53-0005-Add-streaming-option-in-pg_dump) in the main patch, one minor comment on pg_dump related changes @@ -4358,6 +4369,8 @@ dumpSubscription(Archive *fout, SubscriptionInfo *subinfo) if (strcmp(subinfo->subbinary, "t") == 0) appendPQExpBuffer(query, ", binary = true"); + if (strcmp(subinfo->substream, "f") != 0) + appendPQExpBuffer(query, ", streaming = on"); if (strcmp(subinfo->subsynccommit, "off") != 0) appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit)); Keep one line space between substream and subsynccommit option code to keep it consistent with nearby code. -- With Regards, Amit Kapila.
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Sun, Aug 30, 2020 at 2:43 PM Dilip Kumar wrote: > > On Sat, Aug 29, 2020 at 5:18 PM Amit Kapila wrote: > > > > > One more comment for which I haven't done anything yet. > > +static void > > +set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId > > xid) > > +{ > > + MemoryContext oldctx; > > + > > + oldctx = MemoryContextSwitchTo(CacheMemoryContext); > > + > > + entry->streamed_txns = lappend_int(entry->streamed_txns, xid); > > > Is it a good idea to append xid with lappend_int? Won't we need > > something equivalent for uint32? If so, I think we have a couple of > > options (a) use lcons method and accordingly append the pointer to > > xid, I think we need to allocate memory for xid if we want to use this > > idea or (b) use an array instead. What do you think? > > BTW, OID is internally mapped to uint32, but using lappend_oid might > not look good. So maybe we can provide an option for lappend_uint32? > Using an array is also not a bad idea. Providing lappend_uint32 > option looks more appealing to me. > I thought about this again and I feel it might be okay to use it for our case as after storing it in T_IntList, we primarily fetch it for comparison with TrasnactionId (uint32), so this shouldn't create any problem. I feel we can just discuss this in a separate thread and check the opinion of others, what do you think? Another comment: +cleanup_rel_sync_cache(TransactionId xid, bool is_commit) +{ + HASH_SEQ_STATUS hash_seq; + RelationSyncEntry *entry; + + Assert(RelationSyncCache != NULL); + + hash_seq_init(_seq, RelationSyncCache); + while ((entry = hash_seq_search(_seq)) != NULL) + { + if (is_commit) + entry->schema_sent = true; How is it correct to set 'entry->schema_sent' for all the entries in RelationSyncCache? Consider a case where due to invalidation in an unrelated transaction we have set the flag schema_sent for a particular relation 'r1' as 'false' and that transaction is executed before the current streamed transaction for which we are performing commit and called this function. It will set the flag for unrelated entry in this case 'r1' which doesn't seem correct to me. Or, if this is correct, it would be a good idea to write some comments about it. -- With Regards, Amit Kapila.
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Fri, Aug 28, 2020 at 2:18 PM Dilip Kumar wrote: > > As discussed, I have added a another test case for covering the out of > order subtransaction rollback scenario. > +# large (streamed) transaction with out of order subtransaction ROLLBACKs +$node_publisher->safe_psql('postgres', q{ How about writing a comment as: "large (streamed) transaction with subscriber receiving out of order subtransaction ROLLBACKs"? I have reviewed and modified the number of things in the attached patch: 1. In apply_handle_origin, improved the check streamed xacts. 2. In apply_handle_stream_commit() while applying changes in the loop, added CHECK_FOR_INTERRUPTS. 3. In DEBUG messages, print the path with double-quotes as we are doing in all other places. 4. + /* + * Exit if streaming option is changed. The launcher will start new + * worker. + */ + if (newsub->stream != MySubscription->stream) + { + ereport(LOG, + (errmsg("logical replication apply worker for subscription \"%s\" will " + "restart because subscription's streaming option were changed", + MySubscription->name))); + + proc_exit(0); + } + We don't need a separate check like this. I have merged this into one of the existing checks. 5. subxact_info_write() { .. + if (subxact_data.nsubxacts == 0) + { + if (ent->subxact_fileset) + { + cleanup_subxact_info(); + BufFileDeleteShared(ent->subxact_fileset, path); + pfree(ent->subxact_fileset); + ent->subxact_fileset = NULL; + } I don't think it is right to use BufFileDeleteShared interface here because it won't perform SharedFileSetUnregister which means if after above code execution is the server exits it will crash in SharedFileSetDeleteOnProcExit which will try to access already deleted fileset entry. Fixed this by calling SharedFileSetDeleteAll() instead. The another related problem is that in function SharedFileSetDeleteOnProcExit, it tries to delete the list element while traversing the list with 'foreach' construct which makes the behavior of list traversal unpredictable. I have fixed this in a separate patch v54-0001-Fix-the-SharedFileSetUnregister-API, if you are fine with this, I would like to commit this as this fixes a problem in the existing commit 808e13b282. 6. Function stream_cleanup_files() contains a missing_ok argument which is not used so removed it. 7. In pgoutput.c, change the ordering of functions to make them consistent with their declaration. 8. typedef struct RelationSyncEntry { Oid relid; /* relation oid */ + TransactionId xid; /* transaction that created the record */ Removed above parameter as this doesn't seem to be required as per the new design in the patch. Apart from above, I have added/changed quite a few comments and a few other cosmetic changes. Kindly review and let me know what do you think about the changes? One more comment for which I haven't done anything yet. +static void +set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid) +{ + MemoryContext oldctx; + + oldctx = MemoryContextSwitchTo(CacheMemoryContext); + + entry->streamed_txns = lappend_int(entry->streamed_txns, xid); Is it a good idea to append xid with lappend_int? Won't we need something equivalent for uint32? If so, I think we have a couple of options (a) use lcons method and accordingly append the pointer to xid, I think we need to allocate memory for xid if we want to use this idea or (b) use an array instead. What do you think? -- With Regards, Amit Kapila. v54-0001-Fix-the-SharedFileSetUnregister-API.patch Description: Binary data v54-0002-Add-support-for-streaming-to-built-in-logical-re.patch Description: Binary data
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
Hi, I have done code coverage analysis on the latest patches(v53) and below is the report for the same. Highlighted are the files where the coverage modifications were observed. OS: Ubuntu 18.04 Patch applied on commit : 77c7267c37f7fa8e5e48abda4798afdbecb2b95a File Name Coverage Without logical decoding patch On v53 (2,3,4,5) patch Without v53-0003 patch %Line %Function %Line %Function %Line %Function src/backend/access/transam/xact.c 86.2 92.9 86.2 92.9 86.2 92.9 src/backend/access/transam/xloginsert.c 90.2 94.1 90.2 94.1 90.2 94.1 src/backend/access/transam/xlogreader.c 73.3 93.3 73.8 93.3 73.8 93.3 src/backend/replication/logical/decode.c 93.4 100 93.4 100 93.4 100 src/backend/access/rmgrdesc/xactdesc.c 54.4 63.6 54.4 63.6 54.4 63.6 src/backend/replication/logical/reorderbuffer.c 93.4 96.7 93.4 96.7 93.4 96.7 src/backend/utils/cache/inval.c 98.1 100 98.1 100 98.1 100 contrib/test_decoding/test_decoding.c 86.8 95.2 86.8 95.2 86.8 95.2 src/backend/replication/logical/logical.c 90.9 93.5 90.9 93.5 91.8 93.5 src/backend/access/heap/heapam.c 86.1 94.5 86.1 94.5 86.1 94.5 src/backend/access/index/genam.c 90.7 91.7 91.2 91.7 91.2 91.7 src/backend/access/table/tableam.c 90.6 100 90.6 100 90.6 100 src/backend/utils/time/snapmgr.c 81.1 98.1 80.2 98.1 81.1 98.1 src/include/access/tableam.h 92.5 100 92.5 100 92.5 100 src/backend/access/heap/heapam_visibility.c 77.8 100 77.8 100 77.8 100 src/backend/replication/walsender.c 90.5 97.8 90.5 97.8 90.9 100 src/backend/catalog/pg_subscription.c 96 100 96 100 96 100 src/backend/commands/subscriptioncmds.c 93.2 90 92.7 90 92.7 90 src/backend/postmaster/pgstat.c 64.2 85.1 63.9 85.1 64.6 86.1 src/backend/replication/libpqwalreceiver/libpqwalreceiver.c 82.4 95 82.5 95 83.6 95 src/backend/replication/logical/proto.c 93.5 91.3 93.7 93.3 93.7 93.3 src/backend/replication/logical/worker.c 91.6 96 91.5 97.4 91.9 97.4 src/backend/replication/pgoutput/pgoutput.c 81.9 100 85.5 100 86.2 100 src/backend/replication/slotfuncs.c 93 93.8 93 93.8 93 93.8 src/include/pgstat.h 100 - 100 - 100 - src/backend/replication/logical/logicalfuncs.c 87.1 90 87.1 90 87.1 90 src/backend/storage/file/buffile.c 68.3 85 69.6 85 69.6 85 src/backend/storage/file/fd.c 81.1 93 81.1 93 81.1 93 src/backend/storage/file/sharedfileset.c 77.7 90.9 93.2 100 93.2 100 src/backend/utils/sort/logtape.c 94.4 100 94.4 100 94.4 100 src/backend/utils/sort/sharedtuplestore.c 90.1 90.9 90.1 90.9 90.1 90.9 Thanks. -- Regards, Neha Sharma On Thu, Aug 27, 2020 at 11:16 AM Amit Kapila wrote: > On Wed, Aug 26, 2020 at 11:22 PM Jeff Janes wrote: > > > > > > On Tue, Aug 25, 2020 at 8:58 AM Amit Kapila > wrote: > > > >> > >> I am planning > >> to push the first patch (v53-0001-Extend-the-BufFile-interface) in > >> this series tomorrow unless you have any comments on the same. > > > > > > > > I'm getting compiler warnings now, > src/backend/storage/file/sharedfileset.c line 288 needs to be: > > > > boolfound PG_USED_FOR_ASSERTS_ONLY = false; > > > > Thanks for the report. Tom Lane has already fixed this [1]. > > [1] - > https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=e942af7b8261cd8070d0eeaf518dbc1a664859fd > > -- > With Regards, > Amit Kapila. > > >
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Thu, Aug 27, 2020 at 11:16 AM Amit Kapila wrote: > > On Wed, Aug 26, 2020 at 11:22 PM Jeff Janes wrote: > > > > > > On Tue, Aug 25, 2020 at 8:58 AM Amit Kapila wrote: > > > >> > >> I am planning > >> to push the first patch (v53-0001-Extend-the-BufFile-interface) in > >> this series tomorrow unless you have any comments on the same. > > > > > > > > I'm getting compiler warnings now, src/backend/storage/file/sharedfileset.c > > line 288 needs to be: > > > > boolfound PG_USED_FOR_ASSERTS_ONLY = false; > > > > Thanks for the report. Tom Lane has already fixed this [1]. > > [1] - > https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=e942af7b8261cd8070d0eeaf518dbc1a664859fd As discussed, I have added a another test case for covering the out of order subtransaction rollback scenario. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com tap_test_for_out_of_order_subxact_abort.patch Description: Binary data
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Wed, Aug 26, 2020 at 11:22 PM Jeff Janes wrote: > > > On Tue, Aug 25, 2020 at 8:58 AM Amit Kapila wrote: > >> >> I am planning >> to push the first patch (v53-0001-Extend-the-BufFile-interface) in >> this series tomorrow unless you have any comments on the same. > > > > I'm getting compiler warnings now, src/backend/storage/file/sharedfileset.c > line 288 needs to be: > > boolfound PG_USED_FOR_ASSERTS_ONLY = false; > Thanks for the report. Tom Lane has already fixed this [1]. [1] - https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=e942af7b8261cd8070d0eeaf518dbc1a664859fd -- With Regards, Amit Kapila.
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Tue, Aug 25, 2020 at 8:58 AM Amit Kapila wrote: > I am planning > to push the first patch (v53-0001-Extend-the-BufFile-interface) in > this series tomorrow unless you have any comments on the same. > I'm getting compiler warnings now, src/backend/storage/file/sharedfileset.c line 288 needs to be: boolfound PG_USED_FOR_ASSERTS_ONLY = false; Cheers, Jeff
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Tue, Aug 25, 2020 at 6:27 PM Amit Kapila wrote: > > On Tue, Aug 25, 2020 at 10:41 AM Dilip Kumar wrote: > > > > On Tue, Aug 25, 2020 at 9:31 AM Amit Kapila wrote: > > > > > > > > > I think the existing design is superior as it allows the flexibility > > > to create transaction files in different temp_tablespaces which is > > > quite important to consider as we know the files will be created only > > > for large transactions. Once we fix the sharedfileset for a worker all > > > the files will be created in the temp_tablespaces chosen for the first > > > time apply worker creates it even if it got changed at some later > > > point of time (user can change its value and then do reload config > > > which I think will impact the worker settings as well). This all can > > > happen because we set the tablespaces at the time of > > > SharedFileSetInit. > > > > Yeah, I agree with this point, that if we use the single shared > > fileset then it will always use the same tablespace for all the > > streaming transactions. And, we might get the benefit of concurrent > > I/O if we use different tablespaces as we are not immediately flushing > > the files to the disk. > > > > Okay, so let's retain the original approach then. I have made a few > cosmetic modifications in the first two patches which include updating > docs, comments, slightly modify the commit message, and change the > code to match the nearby code. One change which you might have a > different opinion is below: > > + case WAIT_EVENT_LOGICAL_CHANGES_READ: > + event_name = "ReorderLogicalChangesRead"; > + break; > + case WAIT_EVENT_LOGICAL_CHANGES_WRITE: > + event_name = "ReorderLogicalChangesWrite"; > + break; > + case WAIT_EVENT_LOGICAL_SUBXACT_READ: > + event_name = "ReorderLogicalSubxactRead"; > + break; > + case WAIT_EVENT_LOGICAL_SUBXACT_WRITE: > + event_name = "ReorderLogicalSubxactWrite"; > + break; > > Why do we want to name these events starting with name as Reorder*? I > think these are used in subscriber-side, so no need to use the word > Reorder, so I have removed it from the attached patch. I am planning > to push the first patch (v53-0001-Extend-the-BufFile-interface) in > this series tomorrow unless you have any comments on the same. Your changes in 0001 and 0002, looks fine to me. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Tue, Aug 25, 2020 at 9:31 AM Amit Kapila wrote: > > On Mon, Aug 24, 2020 at 9:41 PM Dilip Kumar wrote: > > > > On Sat, Aug 22, 2020 at 8:38 AM Amit Kapila wrote: > > > > > > On Fri, Aug 21, 2020 at 5:10 PM Dilip Kumar wrote: > > > > > > > > I have reviewed and tested the patch and the changes look fine to me. > > > > > > > > > > Thanks, I will push the next patch early next week (by Tuesday) unless > > > you or someone else has any more comments on it. The summary of the > > > patch (v52-0001-Extend-the-BufFile-interface, attached with my > > > previous email) I am planning to push is: "It extends the BufFile > > > interface to support temporary files that can be used by the single > > > backend when the corresponding files need to be survived across the > > > transaction and need to be opened and closed multiple times. Such > > > files need to be created as a member of a SharedFileSet. We have > > > implemented the interface for BufFileTruncate to allow files to be > > > truncated up to a particular offset and extended the BufFileSeek API > > > to support SEEK_END case. We have also added an option to provide a > > > mode while opening the shared BufFiles instead of always opening in > > > read-only mode. These enhancements in BufFile interface are required > > > for the upcoming patch to allow the replication apply worker, to > > > properly handle streamed in-progress transactions." > > > > While reviewing 0002, I realized that instead of using individual > > shared fileset for each transaction, we can use just one common shared > > file set. We can create individual buffile under one shared fileset > > and whenever a transaction commits/aborts we can just delete its > > buffile and the shared fileset can stay. > > > > I think the existing design is superior as it allows the flexibility > to create transaction files in different temp_tablespaces which is > quite important to consider as we know the files will be created only > for large transactions. Once we fix the sharedfileset for a worker all > the files will be created in the temp_tablespaces chosen for the first > time apply worker creates it even if it got changed at some later > point of time (user can change its value and then do reload config > which I think will impact the worker settings as well). This all can > happen because we set the tablespaces at the time of > SharedFileSetInit. Yeah, I agree with this point, that if we use the single shared fileset then it will always use the same tablespace for all the streaming transactions. And, we might get the benefit of concurrent I/O if we use different tablespaces as we are not immediately flushing the files to the disk. > The other relatively smaller thing which I don't like is that we > always need to create a buffile for subxact even though we don't need > it. We might be able to find some solution for this but I guess the > previous point is what bothers me more. Yeah, if we go this way we might need to find some solution to this. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Mon, Aug 24, 2020 at 9:41 PM Dilip Kumar wrote: > > On Sat, Aug 22, 2020 at 8:38 AM Amit Kapila wrote: > > > > On Fri, Aug 21, 2020 at 5:10 PM Dilip Kumar wrote: > > > > > > I have reviewed and tested the patch and the changes look fine to me. > > > > > > > Thanks, I will push the next patch early next week (by Tuesday) unless > > you or someone else has any more comments on it. The summary of the > > patch (v52-0001-Extend-the-BufFile-interface, attached with my > > previous email) I am planning to push is: "It extends the BufFile > > interface to support temporary files that can be used by the single > > backend when the corresponding files need to be survived across the > > transaction and need to be opened and closed multiple times. Such > > files need to be created as a member of a SharedFileSet. We have > > implemented the interface for BufFileTruncate to allow files to be > > truncated up to a particular offset and extended the BufFileSeek API > > to support SEEK_END case. We have also added an option to provide a > > mode while opening the shared BufFiles instead of always opening in > > read-only mode. These enhancements in BufFile interface are required > > for the upcoming patch to allow the replication apply worker, to > > properly handle streamed in-progress transactions." > > While reviewing 0002, I realized that instead of using individual > shared fileset for each transaction, we can use just one common shared > file set. We can create individual buffile under one shared fileset > and whenever a transaction commits/aborts we can just delete its > buffile and the shared fileset can stay. > I think the existing design is superior as it allows the flexibility to create transaction files in different temp_tablespaces which is quite important to consider as we know the files will be created only for large transactions. Once we fix the sharedfileset for a worker all the files will be created in the temp_tablespaces chosen for the first time apply worker creates it even if it got changed at some later point of time (user can change its value and then do reload config which I think will impact the worker settings as well). This all can happen because we set the tablespaces at the time of SharedFileSetInit. The other relatively smaller thing which I don't like is that we always need to create a buffile for subxact even though we don't need it. We might be able to find some solution for this but I guess the previous point is what bothers me more. -- With Regards, Amit Kapila.
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Sat, Aug 22, 2020 at 8:38 AM Amit Kapila wrote: > > On Fri, Aug 21, 2020 at 5:10 PM Dilip Kumar wrote: > > > > I have reviewed and tested the patch and the changes look fine to me. > > > > Thanks, I will push the next patch early next week (by Tuesday) unless > you or someone else has any more comments on it. The summary of the > patch (v52-0001-Extend-the-BufFile-interface, attached with my > previous email) I am planning to push is: "It extends the BufFile > interface to support temporary files that can be used by the single > backend when the corresponding files need to be survived across the > transaction and need to be opened and closed multiple times. Such > files need to be created as a member of a SharedFileSet. We have > implemented the interface for BufFileTruncate to allow files to be > truncated up to a particular offset and extended the BufFileSeek API > to support SEEK_END case. We have also added an option to provide a > mode while opening the shared BufFiles instead of always opening in > read-only mode. These enhancements in BufFile interface are required > for the upcoming patch to allow the replication apply worker, to > properly handle streamed in-progress transactions." While reviewing 0002, I realized that instead of using individual shared fileset for each transaction, we can use just one common shared file set. We can create individual buffile under one shared fileset and whenever a transaction commits/aborts we can just delete its buffile and the shared fileset can stay. I have attached a POC patch for this idea and if we agree with this approach then I will prepare a final patch in a couple of days. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com buffile_changes.patch Description: Binary data
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Fri, Aug 21, 2020 at 5:10 PM Dilip Kumar wrote: > > I have reviewed and tested the patch and the changes look fine to me. > Thanks, I will push the next patch early next week (by Tuesday) unless you or someone else has any more comments on it. The summary of the patch (v52-0001-Extend-the-BufFile-interface, attached with my previous email) I am planning to push is: "It extends the BufFile interface to support temporary files that can be used by the single backend when the corresponding files need to be survived across the transaction and need to be opened and closed multiple times. Such files need to be created as a member of a SharedFileSet. We have implemented the interface for BufFileTruncate to allow files to be truncated up to a particular offset and extended the BufFileSeek API to support SEEK_END case. We have also added an option to provide a mode while opening the shared BufFiles instead of always opening in read-only mode. These enhancements in BufFile interface are required for the upcoming patch to allow the replication apply worker, to properly handle streamed in-progress transactions." -- With Regards, Amit Kapila.
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Fri, Aug 21, 2020 at 3:13 PM Amit Kapila wrote: > > On Fri, Aug 21, 2020 at 10:33 AM Dilip Kumar wrote: > > > > On Fri, Aug 21, 2020 at 9:14 AM Amit Kapila wrote: > > > > > > 2. > > > + /* > > > + * If the new location is smaller then the current location in file then > > > + * we need to set the curFile and the curOffset to the new values and > > > also > > > + * reset the pos and nbytes. Otherwise nothing to do. > > > + */ > > > + else if ((newFile < file->curFile) || > > > + newOffset < file->curOffset + file->pos) > > > + { > > > + file->curFile = newFile; > > > + file->curOffset = newOffset; > > > + file->pos = 0; > > > + file->nbytes = 0; > > > + } > > > > > > Shouldn't there be && instead of || because if newFile is greater than > > > curFile then there is no meaning to update it? > > > > I think this condition is wrong it should be, > > > > else if ((newFile < file->curFile) || ((newFile == file->curFile) && > > (newOffset < file->curOffset + file->pos) > > > > Basically, either new file is smaller otherwise if it is the same > > then-new offset should be smaller. > > > > I think we don't need to use file->pos for that as that is required > only for the current buffer, otherwise, such a condition should > suffice the need. However, I was not happy with the way code and > conditions were arranged in BufFileTruncateShared, so I have > re-arranged them and change quite a few comments in that API. Apart > from that I have updated the docs and ran pgindent for the first > patch. Do let me know if you have any more comments on the first > patch? I have reviewed and tested the patch and the changes look fine to me. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Fri, Aug 21, 2020 at 10:20 AM Amit Kapila wrote: > > On Fri, Aug 21, 2020 at 9:14 AM Amit Kapila wrote: > > > > On Thu, Aug 20, 2020 at 5:42 PM Dilip Kumar wrote: > > > > > > On Thu, Aug 20, 2020 at 2:30 PM Amit Kapila > > > wrote: > > > > > > > > > > > > Right, I think this can happen if one has changed those by BufFileSeek > > > > before doing truncate. We should fix that case as well. > > > > > > Right. > > > > > > > > I will work on those along with your other comments and > > > > > submit the updated patch. > > > > > > I have fixed this in the attached patch along with your other > > > comments. I have also attached a contrib module that is just used for > > > testing the truncate API. > > > > > > > Few comments: > > == > > +void > > +BufFileTruncateShared(BufFile *file, int fileno, off_t offset) > > { > > .. > > + if ((i != fileno || offset == 0) && i != 0) > > + { > > + SharedSegmentName(segment_name, file->name, i); > > + FileClose(file->files[i]); > > + if (!SharedFileSetDelete(file->fileset, segment_name, true)) > > + ereport(ERROR, > > + (errcode_for_file_access(), > > + errmsg("could not delete shared fileset \"%s\": %m", > > + segment_name))); > > + numFiles--; > > + newOffset = MAX_PHYSICAL_FILESIZE; > > + > > + if (i == fileno) > > + newFile--; > > + } > > > > Here, shouldn't it be i <= fileno? Because we need to move back the > > curFile up to newFile whenever curFile is greater than newFile > > > > I think now I have understood why you have added this condition but > probably a comment on the lines "This is required to indicate that we > have removed the given fileno" would be better for future readers. Okay. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Fri, Aug 21, 2020 at 9:14 AM Amit Kapila wrote: > > On Thu, Aug 20, 2020 at 5:42 PM Dilip Kumar wrote: > > > > On Thu, Aug 20, 2020 at 2:30 PM Amit Kapila wrote: > > > > > > > > > Right, I think this can happen if one has changed those by BufFileSeek > > > before doing truncate. We should fix that case as well. > > > > Right. > > > > > > I will work on those along with your other comments and > > > > submit the updated patch. > > > > I have fixed this in the attached patch along with your other > > comments. I have also attached a contrib module that is just used for > > testing the truncate API. > > > > Few comments: > == > +void > +BufFileTruncateShared(BufFile *file, int fileno, off_t offset) > { > .. > + if ((i != fileno || offset == 0) && i != 0) > + { > + SharedSegmentName(segment_name, file->name, i); > + FileClose(file->files[i]); > + if (!SharedFileSetDelete(file->fileset, segment_name, true)) > + ereport(ERROR, > + (errcode_for_file_access(), > + errmsg("could not delete shared fileset \"%s\": %m", > + segment_name))); > + numFiles--; > + newOffset = MAX_PHYSICAL_FILESIZE; > + > + if (i == fileno) > + newFile--; > + } > > Here, shouldn't it be i <= fileno? Because we need to move back the > curFile up to newFile whenever curFile is greater than newFile +/* Loop over all the files upto the fileno which we want to truncate. */ +for (i = file->numFiles - 1; i >= fileno; i--) Because the above loop is up to the fileno, so I feel there is no point of that check or any assert. > 2. > + /* > + * If the new location is smaller then the current location in file then > + * we need to set the curFile and the curOffset to the new values and also > + * reset the pos and nbytes. Otherwise nothing to do. > + */ > + else if ((newFile < file->curFile) || > + newOffset < file->curOffset + file->pos) > + { > + file->curFile = newFile; > + file->curOffset = newOffset; > + file->pos = 0; > + file->nbytes = 0; > + } > > Shouldn't there be && instead of || because if newFile is greater than > curFile then there is no meaning to update it? I think this condition is wrong it should be, else if ((newFile < file->curFile) || ((newFile == file->curFile) && (newOffset < file->curOffset + file->pos) Basically, either new file is smaller otherwise if it is the same then-new offset should be smaller. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Fri, Aug 21, 2020 at 9:14 AM Amit Kapila wrote: > > On Thu, Aug 20, 2020 at 5:42 PM Dilip Kumar wrote: > > > > On Thu, Aug 20, 2020 at 2:30 PM Amit Kapila wrote: > > > > > > > > > Right, I think this can happen if one has changed those by BufFileSeek > > > before doing truncate. We should fix that case as well. > > > > Right. > > > > > > I will work on those along with your other comments and > > > > submit the updated patch. > > > > I have fixed this in the attached patch along with your other > > comments. I have also attached a contrib module that is just used for > > testing the truncate API. > > > > Few comments: > == > +void > +BufFileTruncateShared(BufFile *file, int fileno, off_t offset) > { > .. > + if ((i != fileno || offset == 0) && i != 0) > + { > + SharedSegmentName(segment_name, file->name, i); > + FileClose(file->files[i]); > + if (!SharedFileSetDelete(file->fileset, segment_name, true)) > + ereport(ERROR, > + (errcode_for_file_access(), > + errmsg("could not delete shared fileset \"%s\": %m", > + segment_name))); > + numFiles--; > + newOffset = MAX_PHYSICAL_FILESIZE; > + > + if (i == fileno) > + newFile--; > + } > > Here, shouldn't it be i <= fileno? Because we need to move back the > curFile up to newFile whenever curFile is greater than newFile > I think now I have understood why you have added this condition but probably a comment on the lines "This is required to indicate that we have removed the given fileno" would be better for future readers. -- With Regards, Amit Kapila.
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Fri, Aug 21, 2020 at 9:14 AM Amit Kapila wrote: > > On Thu, Aug 20, 2020 at 5:42 PM Dilip Kumar wrote: > > > > On Thu, Aug 20, 2020 at 2:30 PM Amit Kapila wrote: > > > > > > > > > Right, I think this can happen if one has changed those by BufFileSeek > > > before doing truncate. We should fix that case as well. > > > > Right. > > > > > > I will work on those along with your other comments and > > > > submit the updated patch. > > > > I have fixed this in the attached patch along with your other > > comments. I have also attached a contrib module that is just used for > > testing the truncate API. > > > > Few comments: > == > +void > +BufFileTruncateShared(BufFile *file, int fileno, off_t offset) > { > .. > + if ((i != fileno || offset == 0) && i != 0) > + { > + SharedSegmentName(segment_name, file->name, i); > + FileClose(file->files[i]); > + if (!SharedFileSetDelete(file->fileset, segment_name, true)) > + ereport(ERROR, > + (errcode_for_file_access(), > + errmsg("could not delete shared fileset \"%s\": %m", > + segment_name))); > + numFiles--; > + newOffset = MAX_PHYSICAL_FILESIZE; > + > + if (i == fileno) > + newFile--; > + } > > Here, shouldn't it be i <= fileno? Because we need to move back the > curFile up to newFile whenever curFile is greater than newFile > > 2. > + /* > + * If the new location is smaller then the current location in file then > + * we need to set the curFile and the curOffset to the new values and also > + * reset the pos and nbytes. Otherwise nothing to do. > + */ > + else if ((newFile < file->curFile) || > + newOffset < file->curOffset + file->pos) > + { > + file->curFile = newFile; > + file->curOffset = newOffset; > + file->pos = 0; > + file->nbytes = 0; > + } > > Shouldn't there be && instead of || because if newFile is greater than > curFile then there is no meaning to update it? > Wait, actually, it is not clear to me which case second condition (newOffset < file->curOffset + file->pos) is trying to cover, so I can't recommend anything for this. Can you please explain to me why you have added the second condition in the above check? -- With Regards, Amit Kapila.
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Thu, Aug 20, 2020 at 5:42 PM Dilip Kumar wrote: > > On Thu, Aug 20, 2020 at 2:30 PM Amit Kapila wrote: > > > > > > Right, I think this can happen if one has changed those by BufFileSeek > > before doing truncate. We should fix that case as well. > > Right. > > > > I will work on those along with your other comments and > > > submit the updated patch. > > I have fixed this in the attached patch along with your other > comments. I have also attached a contrib module that is just used for > testing the truncate API. > Few comments: == +void +BufFileTruncateShared(BufFile *file, int fileno, off_t offset) { .. + if ((i != fileno || offset == 0) && i != 0) + { + SharedSegmentName(segment_name, file->name, i); + FileClose(file->files[i]); + if (!SharedFileSetDelete(file->fileset, segment_name, true)) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not delete shared fileset \"%s\": %m", + segment_name))); + numFiles--; + newOffset = MAX_PHYSICAL_FILESIZE; + + if (i == fileno) + newFile--; + } Here, shouldn't it be i <= fileno? Because we need to move back the curFile up to newFile whenever curFile is greater than newFile 2. + /* + * If the new location is smaller then the current location in file then + * we need to set the curFile and the curOffset to the new values and also + * reset the pos and nbytes. Otherwise nothing to do. + */ + else if ((newFile < file->curFile) || + newOffset < file->curOffset + file->pos) + { + file->curFile = newFile; + file->curOffset = newOffset; + file->pos = 0; + file->nbytes = 0; + } Shouldn't there be && instead of || because if newFile is greater than curFile then there is no meaning to update it? -- With Regards, Amit Kapila.
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Thu, Aug 20, 2020 at 1:41 PM Dilip Kumar wrote: > > On Wed, Aug 19, 2020 at 1:35 PM Dilip Kumar wrote: > > > > On Wed, Aug 19, 2020 at 12:20 PM Amit Kapila > > wrote: > > > > > > On Wed, Aug 19, 2020 at 10:10 AM Amit Kapila > > > wrote: > > > > > > > > On Mon, Aug 17, 2020 at 6:29 PM Dilip Kumar > > > > wrote: > > > > > > > > > > > > > > > In last patch v49-0001, there is one issue, Basically, I have called > > > > > BufFileFlush in all the cases. But, ideally, we can not call this if > > > > > the underlying files are deleted/truncated because those files/blocks > > > > > might not exist now. So I think if the truncate position is within > > > > > the same buffer we just need to adjust the buffer, otherwise we just > > > > > need to set the currFile and currOffset to the absolute number and set > > > > > the pos and nbytes 0. Attached patch fixes this issue. > > > > > > > > > > > > > Few comments on the latest patch v50-0001-Extend-the-BufFile-interface > > > > 1. > > > > + > > > > + /* > > > > + * If the truncate point is within existing buffer then we can just > > > > + * adjust pos-within-buffer, without flushing buffer. Otherwise, > > > > + * we don't need to do anything because we have already > > > > deleted/truncated > > > > + * the underlying files. > > > > + */ > > > > + if (curFile == file->curFile && > > > > + curOffset >= file->curOffset && > > > > + curOffset <= file->curOffset + file->nbytes) > > > > + { > > > > + file->pos = (int) (curOffset - file->curOffset); > > > > + return; > > > > + } > > > > > > > > I think in this case you have set the position correctly but what > > > > about file->nbytes? In BufFileSeek, it was okay not to update 'nbytes' > > > > because the contents of the buffer are still valid but I don't think > > > > the same is true here. > > > > > > > > > > I think you need to set 'nbytes' to curOffset as per your current > > > patch as that is the new size of the file. > > > --- a/src/backend/storage/file/buffile.c > > > +++ b/src/backend/storage/file/buffile.c > > > @@ -912,6 +912,7 @@ BufFileTruncateShared(BufFile *file, int fileno, > > > off_t offset) > > > curOffset <= file->curOffset + file->nbytes) > > > { > > > file->pos = (int) (curOffset - file->curOffset); > > > + file->nbytes = (int) curOffset; > > > return; > > > } > > > > > > Also, what about file 'numFiles', that can also change due to the > > > removal of certain files, shouldn't that be also set in this case > > > > Right, we need to set the numFile. I will fix this as well. > > I think there are a couple of more problems in the truncate APIs, > basically, if the curFile and curOffset are already smaller than the > truncate location the truncate should not change that. So the > truncate should only change the curFile and curOffset if it is > truncating the part of the file where the curFile or curOffset is > pointing. > Right, I think this can happen if one has changed those by BufFileSeek before doing truncate. We should fix that case as well. > I will work on those along with your other comments and > submit the updated patch. > Thanks. -- With Regards, Amit Kapila.
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Wed, Aug 19, 2020 at 1:35 PM Dilip Kumar wrote: > > On Wed, Aug 19, 2020 at 12:20 PM Amit Kapila wrote: > > > > On Wed, Aug 19, 2020 at 10:10 AM Amit Kapila > > wrote: > > > > > > On Mon, Aug 17, 2020 at 6:29 PM Dilip Kumar wrote: > > > > > > > > > > > > In last patch v49-0001, there is one issue, Basically, I have called > > > > BufFileFlush in all the cases. But, ideally, we can not call this if > > > > the underlying files are deleted/truncated because those files/blocks > > > > might not exist now. So I think if the truncate position is within > > > > the same buffer we just need to adjust the buffer, otherwise we just > > > > need to set the currFile and currOffset to the absolute number and set > > > > the pos and nbytes 0. Attached patch fixes this issue. > > > > > > > > > > Few comments on the latest patch v50-0001-Extend-the-BufFile-interface > > > 1. > > > + > > > + /* > > > + * If the truncate point is within existing buffer then we can just > > > + * adjust pos-within-buffer, without flushing buffer. Otherwise, > > > + * we don't need to do anything because we have already deleted/truncated > > > + * the underlying files. > > > + */ > > > + if (curFile == file->curFile && > > > + curOffset >= file->curOffset && > > > + curOffset <= file->curOffset + file->nbytes) > > > + { > > > + file->pos = (int) (curOffset - file->curOffset); > > > + return; > > > + } > > > > > > I think in this case you have set the position correctly but what > > > about file->nbytes? In BufFileSeek, it was okay not to update 'nbytes' > > > because the contents of the buffer are still valid but I don't think > > > the same is true here. > > > > > > > I think you need to set 'nbytes' to curOffset as per your current > > patch as that is the new size of the file. > > --- a/src/backend/storage/file/buffile.c > > +++ b/src/backend/storage/file/buffile.c > > @@ -912,6 +912,7 @@ BufFileTruncateShared(BufFile *file, int fileno, > > off_t offset) > > curOffset <= file->curOffset + file->nbytes) > > { > > file->pos = (int) (curOffset - file->curOffset); > > + file->nbytes = (int) curOffset; > > return; > > } > > > > Also, what about file 'numFiles', that can also change due to the > > removal of certain files, shouldn't that be also set in this case > > Right, we need to set the numFile. I will fix this as well. I think there are a couple of more problems in the truncate APIs, basically, if the curFile and curOffset are already smaller than the truncate location the truncate should not change that. So the truncate should only change the curFile and curOffset if it is truncating the part of the file where the curFile or curOffset is pointing. I will work on those along with your other comments and submit the updated patch. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Wed, Aug 19, 2020 at 12:20 PM Amit Kapila wrote: > > On Wed, Aug 19, 2020 at 10:10 AM Amit Kapila wrote: > > > > On Mon, Aug 17, 2020 at 6:29 PM Dilip Kumar wrote: > > > > > > > > > In last patch v49-0001, there is one issue, Basically, I have called > > > BufFileFlush in all the cases. But, ideally, we can not call this if > > > the underlying files are deleted/truncated because those files/blocks > > > might not exist now. So I think if the truncate position is within > > > the same buffer we just need to adjust the buffer, otherwise we just > > > need to set the currFile and currOffset to the absolute number and set > > > the pos and nbytes 0. Attached patch fixes this issue. > > > > > > > Few comments on the latest patch v50-0001-Extend-the-BufFile-interface > > 1. > > + > > + /* > > + * If the truncate point is within existing buffer then we can just > > + * adjust pos-within-buffer, without flushing buffer. Otherwise, > > + * we don't need to do anything because we have already deleted/truncated > > + * the underlying files. > > + */ > > + if (curFile == file->curFile && > > + curOffset >= file->curOffset && > > + curOffset <= file->curOffset + file->nbytes) > > + { > > + file->pos = (int) (curOffset - file->curOffset); > > + return; > > + } > > > > I think in this case you have set the position correctly but what > > about file->nbytes? In BufFileSeek, it was okay not to update 'nbytes' > > because the contents of the buffer are still valid but I don't think > > the same is true here. > > > > I think you need to set 'nbytes' to curOffset as per your current > patch as that is the new size of the file. > --- a/src/backend/storage/file/buffile.c > +++ b/src/backend/storage/file/buffile.c > @@ -912,6 +912,7 @@ BufFileTruncateShared(BufFile *file, int fileno, > off_t offset) > curOffset <= file->curOffset + file->nbytes) > { > file->pos = (int) (curOffset - file->curOffset); > + file->nbytes = (int) curOffset; > return; > } > > Also, what about file 'numFiles', that can also change due to the > removal of certain files, shouldn't that be also set in this case Right, we need to set the numFile. I will fix this as well. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Wed, Aug 19, 2020 at 10:11 AM Amit Kapila wrote: > > On Mon, Aug 17, 2020 at 6:29 PM Dilip Kumar wrote: > > > > > > In last patch v49-0001, there is one issue, Basically, I have called > > BufFileFlush in all the cases. But, ideally, we can not call this if > > the underlying files are deleted/truncated because those files/blocks > > might not exist now. So I think if the truncate position is within > > the same buffer we just need to adjust the buffer, otherwise we just > > need to set the currFile and currOffset to the absolute number and set > > the pos and nbytes 0. Attached patch fixes this issue. > > > > Few comments on the latest patch v50-0001-Extend-the-BufFile-interface > 1. > + > + /* > + * If the truncate point is within existing buffer then we can just > + * adjust pos-within-buffer, without flushing buffer. Otherwise, > + * we don't need to do anything because we have already deleted/truncated > + * the underlying files. > + */ > + if (curFile == file->curFile && > + curOffset >= file->curOffset && > + curOffset <= file->curOffset + file->nbytes) > + { > + file->pos = (int) (curOffset - file->curOffset); > + return; > + } > > I think in this case you have set the position correctly but what > about file->nbytes? In BufFileSeek, it was okay not to update 'nbytes' > because the contents of the buffer are still valid but I don't think > the same is true here. Right, I think we need to set nbytes to new file->pos as shown below > + file->pos = (int) (curOffset - file->curOffset); > file->nbytes = file->pos > 2. > + int curFile = file->curFile; > + off_t curOffset = file->curOffset; > > I find the previous naming (newFile, newOffset) was better as it > distinguishes them from BufFile variables. Ok > 3. > +void > +SharedFileSetUnregister(SharedFileSet *input_fileset) > +{ > .. > + /* Delete all files in the set */ > + SharedFileSetDeleteAll(input_fileset); > .. > } > > I am not sure if this is completely correct because we call this > function (SharedFileSetUnregister) from BufFileDeleteShared which > would have already removed all the required files. This raises the > question in my mind whether it is correct to call > SharedFileSetUnregister from BufFileDeleteShared from the API > perspective as one might not want to remove the entire fileset at that > point of time. It will work for your use case (where while removing > buffile you also want to remove the entire fileset) but not sure if it > is generic enough. For your case, I wonder if we can directly call > SharedFileSetDeleteAll and we can have a call like > SharedFileSetUnregister which will be called from it. Yeah this make more sense to me that we can directly call SharedFileSetDeleteAll, instead of calling BufFileDeleteShared and we can call SharedFileSetUnregister from SharedFileSetDeleteAll. I will make these changes and send the patch after some testing. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Wed, Aug 19, 2020 at 10:10 AM Amit Kapila wrote: > > On Mon, Aug 17, 2020 at 6:29 PM Dilip Kumar wrote: > > > > > > In last patch v49-0001, there is one issue, Basically, I have called > > BufFileFlush in all the cases. But, ideally, we can not call this if > > the underlying files are deleted/truncated because those files/blocks > > might not exist now. So I think if the truncate position is within > > the same buffer we just need to adjust the buffer, otherwise we just > > need to set the currFile and currOffset to the absolute number and set > > the pos and nbytes 0. Attached patch fixes this issue. > > > > Few comments on the latest patch v50-0001-Extend-the-BufFile-interface > 1. > + > + /* > + * If the truncate point is within existing buffer then we can just > + * adjust pos-within-buffer, without flushing buffer. Otherwise, > + * we don't need to do anything because we have already deleted/truncated > + * the underlying files. > + */ > + if (curFile == file->curFile && > + curOffset >= file->curOffset && > + curOffset <= file->curOffset + file->nbytes) > + { > + file->pos = (int) (curOffset - file->curOffset); > + return; > + } > > I think in this case you have set the position correctly but what > about file->nbytes? In BufFileSeek, it was okay not to update 'nbytes' > because the contents of the buffer are still valid but I don't think > the same is true here. > I think you need to set 'nbytes' to curOffset as per your current patch as that is the new size of the file. --- a/src/backend/storage/file/buffile.c +++ b/src/backend/storage/file/buffile.c @@ -912,6 +912,7 @@ BufFileTruncateShared(BufFile *file, int fileno, off_t offset) curOffset <= file->curOffset + file->nbytes) { file->pos = (int) (curOffset - file->curOffset); + file->nbytes = (int) curOffset; return; } Also, what about file 'numFiles', that can also change due to the removal of certain files, shouldn't that be also set in this case? -- With Regards, Amit Kapila.
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Mon, Aug 17, 2020 at 6:29 PM Dilip Kumar wrote: > > > In last patch v49-0001, there is one issue, Basically, I have called > BufFileFlush in all the cases. But, ideally, we can not call this if > the underlying files are deleted/truncated because those files/blocks > might not exist now. So I think if the truncate position is within > the same buffer we just need to adjust the buffer, otherwise we just > need to set the currFile and currOffset to the absolute number and set > the pos and nbytes 0. Attached patch fixes this issue. > Few comments on the latest patch v50-0001-Extend-the-BufFile-interface 1. + + /* + * If the truncate point is within existing buffer then we can just + * adjust pos-within-buffer, without flushing buffer. Otherwise, + * we don't need to do anything because we have already deleted/truncated + * the underlying files. + */ + if (curFile == file->curFile && + curOffset >= file->curOffset && + curOffset <= file->curOffset + file->nbytes) + { + file->pos = (int) (curOffset - file->curOffset); + return; + } I think in this case you have set the position correctly but what about file->nbytes? In BufFileSeek, it was okay not to update 'nbytes' because the contents of the buffer are still valid but I don't think the same is true here. 2. + int curFile = file->curFile; + off_t curOffset = file->curOffset; I find the previous naming (newFile, newOffset) was better as it distinguishes them from BufFile variables. 3. +void +SharedFileSetUnregister(SharedFileSet *input_fileset) +{ .. + /* Delete all files in the set */ + SharedFileSetDeleteAll(input_fileset); .. } I am not sure if this is completely correct because we call this function (SharedFileSetUnregister) from BufFileDeleteShared which would have already removed all the required files. This raises the question in my mind whether it is correct to call SharedFileSetUnregister from BufFileDeleteShared from the API perspective as one might not want to remove the entire fileset at that point of time. It will work for your use case (where while removing buffile you also want to remove the entire fileset) but not sure if it is generic enough. For your case, I wonder if we can directly call SharedFileSetDeleteAll and we can have a call like SharedFileSetUnregister which will be called from it. -- With Regards, Amit Kapila.
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Thu, Aug 13, 2020 at 6:47 PM Amit Kapila wrote: > > On Thu, Aug 13, 2020 at 12:08 PM Amit Kapila wrote: > > > > On Fri, Aug 7, 2020 at 2:04 PM Dilip Kumar wrote: > > > > > > On Thu, Aug 6, 2020 at 2:43 PM Amit Kapila > > > wrote: > > > > > > .. > > > > This patch's functionality can be independently verified by SQL APIs > > > > > > Your changes look fine to me. > > > > > > > I have pushed that patch last week and attached are the remaining > > patches. I have made a few changes in the next patch > > 0001-Extend-the-BufFile-interface.patch and have some comments on it > > which are as below: > > > > Few more comments on the latest patches: > v48-0002-Add-support-for-streaming-to-built-in-replicatio > 1. It appears to me that we don't remove the temporary folders created > by the apply worker. So, we have folders like > pgsql_tmp15324.0.sharedfileset in base/pgsql_tmp directory even when > the apply worker exits. I think we can remove these by calling > PathNameDeleteTemporaryDir in SharedFileSetUnregister while removing > the fileset from registered filesetlist. I think we need to call SharedFileSetDeleteAll(input_fileset), from SharedFileSetUnregister, so that all the directories created for this fileset are removed > 2. > +typedef struct SubXactInfo > +{ > + TransactionId xid; /* XID of the subxact */ > + int fileno; /* file number in the buffile */ > + off_t offset; /* offset in the file */ > +} SubXactInfo; > + > +static uint32 nsubxacts = 0; > +static uint32 nsubxacts_max = 0; > +static SubXactInfo *subxacts = NULL; > +static TransactionId subxact_last = InvalidTransactionId; > > Will it be better if we move all the subxact related variables (like > nsubxacts, nsubxacts_max and subxact_last) inside SubXactInfo struct > as all the information anyway is related to sub-transactions? I have moved them all to a structure. > 3. > + /* > + * If there is no subtransaction then nothing to do, but if already have > + * subxact file then delete that. > + */ > > extra space before 'but' in the above sentence is not required. Fixed > v48-0001-Extend-the-BufFile-interface > 4. > - * SharedFileSets can also be used by backends when the temporary files need > - * to be opened/closed multiple times and the underlying files need to > survive > + * SharedFileSets can be used by backends when the temporary files need to be > + * opened/closed multiple times and the underlying files need to survive > * across transactions. > * > > No need of 'also' in the above sentence. Fixed -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Sat, Aug 15, 2020 at 4:14 AM Thomas Munro wrote: > > On Fri, Aug 14, 2020 at 6:14 PM Amit Kapila wrote: > > Yeah, that makes sense. I will take care of that later today or > > tomorrow. We have not noticed that because currently none of the > > extensions is using those functions. BTW, I noticed that after > > failure, the next run is green, why so? Is the next run not on > > windows? > > The three cfbot results are for applying the patch, testing on Windows > and testing on Ubuntu in that order. It's not at all clear and I'll > probably find a better way to display it when I get around to adding > some more operating systems, maybe with some OS icons or something > like that... > Good to know, anyway, I have pushed a patch to mark those variables with PGDLLIMPORT. -- With Regards, Amit Kapila.
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Fri, Aug 14, 2020 at 6:14 PM Amit Kapila wrote: > Yeah, that makes sense. I will take care of that later today or > tomorrow. We have not noticed that because currently none of the > extensions is using those functions. BTW, I noticed that after > failure, the next run is green, why so? Is the next run not on > windows? The three cfbot results are for applying the patch, testing on Windows and testing on Ubuntu in that order. It's not at all clear and I'll probably find a better way to display it when I get around to adding some more operating systems, maybe with some OS icons or something like that...
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Fri, Aug 14, 2020 at 10:11 AM Thomas Munro wrote: > > On Thu, Aug 13, 2020 at 6:38 PM Amit Kapila wrote: > > I have pushed that patch last week and attached are the remaining > > patches. I have made a few changes in the next patch > > 0001-Extend-the-BufFile-interface.patch and have some comments on it > > which are as below: > > Hi Amit, > > I noticed that Konstantin Knizhnik's CF entry 2386 calls > table_scan_XXX() functions from an extension, namely > contrib/auto_explain, and started failing to build on Windows after > commit 7259736a. This seems to be due to the new global variables > CheckXidAlive and bsysscan, which probably need PGDLLIMPORT if they > are accessed from inline functions that are part of the API that we > expect extensions to be allowed to call. > Yeah, that makes sense. I will take care of that later today or tomorrow. We have not noticed that because currently none of the extensions is using those functions. BTW, I noticed that after failure, the next run is green, why so? Is the next run not on windows? -- With Regards, Amit Kapila.
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Thu, Aug 13, 2020 at 6:38 PM Amit Kapila wrote: > I have pushed that patch last week and attached are the remaining > patches. I have made a few changes in the next patch > 0001-Extend-the-BufFile-interface.patch and have some comments on it > which are as below: Hi Amit, I noticed that Konstantin Knizhnik's CF entry 2386 calls table_scan_XXX() functions from an extension, namely contrib/auto_explain, and started failing to build on Windows after commit 7259736a. This seems to be due to the new global variables CheckXidAlive and bsysscan, which probably need PGDLLIMPORT if they are accessed from inline functions that are part of the API that we expect extensions to be allowed to call.
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Thu, Aug 13, 2020 at 12:08 PM Amit Kapila wrote: > > On Fri, Aug 7, 2020 at 2:04 PM Dilip Kumar wrote: > > > > On Thu, Aug 6, 2020 at 2:43 PM Amit Kapila wrote: > > > > .. > > > This patch's functionality can be independently verified by SQL APIs > > > > Your changes look fine to me. > > > > I have pushed that patch last week and attached are the remaining > patches. I have made a few changes in the next patch > 0001-Extend-the-BufFile-interface.patch and have some comments on it > which are as below: > Few more comments on the latest patches: v48-0002-Add-support-for-streaming-to-built-in-replicatio 1. It appears to me that we don't remove the temporary folders created by the apply worker. So, we have folders like pgsql_tmp15324.0.sharedfileset in base/pgsql_tmp directory even when the apply worker exits. I think we can remove these by calling PathNameDeleteTemporaryDir in SharedFileSetUnregister while removing the fileset from registered filesetlist. 2. +typedef struct SubXactInfo +{ + TransactionId xid; /* XID of the subxact */ + int fileno; /* file number in the buffile */ + off_t offset; /* offset in the file */ +} SubXactInfo; + +static uint32 nsubxacts = 0; +static uint32 nsubxacts_max = 0; +static SubXactInfo *subxacts = NULL; +static TransactionId subxact_last = InvalidTransactionId; Will it be better if we move all the subxact related variables (like nsubxacts, nsubxacts_max and subxact_last) inside SubXactInfo struct as all the information anyway is related to sub-transactions? 3. + /* + * If there is no subtransaction then nothing to do, but if already have + * subxact file then delete that. + */ extra space before 'but' in the above sentence is not required. v48-0001-Extend-the-BufFile-interface 4. - * SharedFileSets can also be used by backends when the temporary files need - * to be opened/closed multiple times and the underlying files need to survive + * SharedFileSets can be used by backends when the temporary files need to be + * opened/closed multiple times and the underlying files need to survive * across transactions. * No need of 'also' in the above sentence. -- With Regards, Amit Kapila.
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Thu, Aug 6, 2020 at 2:43 PM Amit Kapila wrote: > > On Wed, Aug 5, 2020 at 7:37 PM Dilip Kumar wrote: > > > > On Wed, Aug 5, 2020 at 6:25 PM Amit Kapila wrote: > > > > > > > > Can we add a test for incomplete changes (probably with toast > > > insertion but we can do it for spec_insert case as well) in > > > ReorderBuffer such that it needs to first serialize the changes and > > > then stream it? I have manually verified such scenarios but it is > > > good to have the test for the same. > > > > I have added a new test for the same in the stream.sql file. > > > > Thanks, I have slightly changed the test so that we can consume DDL > changes separately. I have made a number of other adjustments like > changing few more comments (to make them consistent with nearby > comments), removed unnecessary inclusion of header file, ran pgindent. > The next patch (v47-0001-Implement-streaming-mode-in-ReorderBuffer) in > this series looks good to me. I am planning to push it after one more > read-through unless you or anyone else has any comments on the same. > The patch I am talking about has the following functionality: > > Implement streaming mode in ReorderBuffer. Instead of serializing the > transaction to disk after reaching the logical_decoding_work_mem limit > in memory, we consume the changes we have in memory and invoke stream > API methods added by commit 45fdc9738b. However, sometimes if we have > incomplete toast or speculative insert we spill to the disk because we > can't stream till we have the complete tuple. And, as soon as we get > the complete tuple we stream the transaction including the serialized > changes. Now that we can stream in-progress transactions, the > concurrent aborts may cause failures when the output plugin consults > catalogs (both system and user-defined). We handle such failures by > returning ERRCODE_TRANSACTION_ROLLBACK sqlerrcode from system table > scan APIs to the backend or WALSender decoding a specific uncommitted > transaction. The decoding logic on the receipt of such a sqlerrcode > aborts the decoding of the current transaction and continues with the > decoding of other transactions. We also provide a new option via SQL > APIs to fetch the changes being streamed. > > This patch's functionality can be independently verified by SQL APIs Your changes look fine to me. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Tue, Aug 4, 2020 at 12:42 PM Dilip Kumar wrote: > > On Tue, Aug 4, 2020 at 10:12 AM Amit Kapila wrote: > > > > > 4. I think we can explain the problems (like we can see the wrong > > tuple or see two versions of the same tuple or whatever else wrong can > > happen, if possible with some example) related to concurrent aborts > > somewhere in comments. > > Done > I have slightly modified the comment added for the above point and apart from that added/modified a few comments at other places. I have also slightly edited the commit message. @@ -2196,6 +2778,7 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid, change->lsn = lsn; change->txn = txn; change->action = REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID; + change->txn = txn; This change is not required as the same information is assigned a few lines before. So, I have removed this change as well. Let me know what you think of the above changes? Can we add a test for incomplete changes (probably with toast insertion but we can do it for spec_insert case as well) in ReorderBuffer such that it needs to first serialize the changes and then stream it? I have manually verified such scenarios but it is good to have the test for the same. -- With Regards, Amit Kapila. v46-0001-Implement-streaming-mode-in-ReorderBuffer.patch Description: Binary data
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Wed, Jul 29, 2020 at 10:46 AM Dilip Kumar wrote: > > Thanks, please find the rebased patch set. > Few comments on v44-0001-Implement-streaming-mode-in-ReorderBuffer: 1. +-- streaming with subxact, nothing in main +BEGIN; +savepoint s1; +SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50)); +INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i); +TRUNCATE table stream_test; +rollback to s1; +INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i); +COMMIT; Is the above comment true? Because it seems to me that Insert is getting streamed in the main transaction. 2. + +postgres[33712]=#* SELECT * FROM pg_logical_slot_get_changes('test_slot', NULL, NULL, 'stream-changes', '1'); +lsn| xid | data +---+-+-- + 0/16B21F8 | 503 | opening a streamed block for transaction TXN 503 + 0/16B21F8 | 503 | streaming change for TXN 503 + 0/16B2300 | 503 | streaming change for TXN 503 + 0/16B2408 | 503 | streaming change for TXN 503 + 0/16BEBA0 | 503 | closing a streamed block for transaction TXN 503 + 0/16B21F8 | 503 | opening a streamed block for transaction TXN 503 + 0/16BECA8 | 503 | streaming change for TXN 503 + 0/16BEDB0 | 503 | streaming change for TXN 503 + 0/16BEEB8 | 503 | streaming change for TXN 503 + 0/16BEBA0 | 503 | closing a streamed block for transaction TXN 503 +(10 rows) + + + Is the above example correct? Because we should include XID in the stream message only when include_xids option is specified. 3. /* - * Queue a change into a transaction so it can be replayed upon commit. + * Record the partial change for the streaming of in-progress transactions. We + * can stream only complete changes so if we have a partial change like toast + * table insert or speculative then we mark such a 'txn' so that it can't be + * streamed. /speculative then/speculative insert then 4. I think we can explain the problems (like we can see the wrong tuple or see two versions of the same tuple or whatever else wrong can happen, if possible with some example) related to concurrent aborts somewhere in comments. -- With Regards, Amit Kapila.
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
> Attaching an updated patch for the stats for streaming based on v2 of > Sawada's san replication slot stats framework and v44 of this patch series > . This is one patch that has both the stats framework from Sawada-san (1) > as well as my update for streaming, so it can be applied easily on top of > v44. regards, Ajin Cherian Fujitsu Australia streaming_stats_update.patch Description: Binary data
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Thu, Jul 30, 2020 at 12:28 PM Ajin Cherian wrote: > > I was running some tests on this patch. I was generally trying to see how the > patch affects logical replication when doing bulk inserts. This issue has > been raised in the past, for eg: this [1]. > My test setup is: > 1. Two postgres servers running - A and B > 2. Create a pgbench setup on A. (pgbench -i -s 5 postgres) > 3. replicate the 3 tables (schema only) on B. > 4. Three publishers on A for the 3 tables of pgbench; pgbench_accounts, > pgbench_branches and pgbench_tellers; > 5. Three subscribers on B for the same tables. (streaming on and off based on > the scenarios described below) > > run pgbench with : pgbench -c 4 -T 100 postgres > While pgbench is running, Do a bulk insert on some other table not in the > publication list (say t1); INSERT INTO t1 (select i FROM > generate_series(1,1000) i); > > Four scenarios: > 1. Pgbench with logical replication enabled without bulk insert > Avg TPS (out of 10 runs): 641 TPS > 2.Pgbench without logical replication enabled with bulk insert (no pub/sub) > Avg TPS (out of 10 runs): 665 TPS > 3, Pgbench with logical replication enabled with bulk insert > Avg TPS (out of 10 runs): 278 TPS > 4. Pgbench with logical replication streaming on with bulk insert > Avg TPS (out of 10 runs): 440 TPS > > As you can see, the bulk inserts, although on a totally unaffected table, > does impact the TPS. But what is good is that, enabling streaming improves > the TPS (about 58% improvement) > Thanks for doing these tests, it is a good win and probably the reason is that after patch we won't serialize such big transactions (as shown in Konstantin's email [1]) and they will be simply skipped. Basically, it will try to stream such transactions and will skip them as they are not required to be sent. [1] - https://www.postgresql.org/message-id/5f5143cc-9f73-3909-3ef7-d3895cc6cc90%40postgrespro.ru -- With Regards, Amit Kapila.
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Wed, Jul 29, 2020 at 3:16 PM Dilip Kumar wrote: > > > Thanks, please find the rebased patch set. > > -- > Regards, > Dilip Kumar > EnterpriseDB: http://www.enterprisedb.com I was running some tests on this patch. I was generally trying to see how the patch affects logical replication when doing bulk inserts. This issue has been raised in the past, for eg: this [1]. My test setup is: 1. Two postgres servers running - A and B 2. Create a pgbench setup on A. (pgbench -i -s 5 postgres) 3. replicate the 3 tables (schema only) on B. 4. Three publishers on A for the 3 tables of pgbench; pgbench_accounts, pgbench_branches and pgbench_tellers; 5. Three subscribers on B for the same tables. (streaming on and off based on the scenarios described below) run pgbench with : pgbench -c 4 -T 100 postgres While pgbench is running, Do a bulk insert on some other table not in the publication list (say t1); INSERT INTO t1 (select i FROM generate_series(1,1000) i); Four scenarios: 1. Pgbench with logical replication enabled without bulk insert Avg TPS (out of 10 runs): 641 TPS 2.Pgbench without logical replication enabled with bulk insert (no pub/sub) Avg TPS (out of 10 runs): 665 TPS 3, Pgbench with logical replication enabled with bulk insert Avg TPS (out of 10 runs): 278 TPS 4. Pgbench with logical replication streaming on with bulk insert Avg TPS (out of 10 runs): 440 TPS As you can see, the bulk inserts, although on a totally unaffected table, does impact the TPS. But what is good is that, enabling streaming improves the TPS (about 58% improvement) [1] - https://www.postgresql.org/message-id/flat/CAMsr%2BYE6aE6Re6smrMr-xCabRmCr%3DyzXEf2Yuv5upEDY5nMX8g%40mail.gmail.com#dbe51a181dd735eec8bb36f8a07bacf5 regards, Ajin Cherian Fujitsu Australia
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Sun, Jul 26, 2020 at 11:04 AM Dilip Kumar wrote: > > > Today, I have again looked at the first patch > > (v42-0001-Extend-the-logical-decoding-output-plugin-API-wi) and didn't > > find any more problems with it so planning to commit the same unless > > you or someone else want to add more to it. Just for ease of others, > > "the next patch extends the logical decoding output plugin API with > > stream methods". It adds seven methods to the output plugin API, > > adding support for streaming changes for large in-progress > > transactions. The methods are stream_start, stream_stop, stream_abort, > > stream_commit, stream_change, stream_message, and stream_truncate. > > Most of this is a simple extension of the existing methods, with the > > semantic difference that the transaction (or subtransaction) is > > incomplete and may be aborted later (which is something the regular > > API does not really need to deal with). > > > > This also extends the 'test_decoding' plugin, implementing these new > > stream methods. The stream_start/start_stop are used to demarcate a > > chunk of changes streamed for a particular toplevel transaction. > > > > This commit simply adds these new APIs and the upcoming patch to > > "allow the streaming mode in ReorderBuffer" will use these APIs. > > LGTM > Pushed. Feel free to submit the remaining patches. -- With Regards, Amit Kapila.
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Sat, Jul 25, 2020 at 5:08 PM Amit Kapila wrote: > > On Fri, Jul 24, 2020 at 7:17 PM Dilip Kumar wrote: > > > > Your changes look fine to me. Additionally, I have changed a test > > case of getting the streaming changes in 0002. Instead of just > > showing the count, I am showing that the transaction is actually > > streaming. > > > > If you want to show the changes then there is no need to display 157 > rows probably a few (10-15) should be sufficient. If we can do that > by increasing the size of the row then good, otherwise, I think it is > better to retain the test to display the count. I think in existing test cases also we are displaying multiple lines e.g. toast.out is showing 235 rows. But maybe I will try to reduce it to the less number of rows. > Today, I have again looked at the first patch > (v42-0001-Extend-the-logical-decoding-output-plugin-API-wi) and didn't > find any more problems with it so planning to commit the same unless > you or someone else want to add more to it. Just for ease of others, > "the next patch extends the logical decoding output plugin API with > stream methods". It adds seven methods to the output plugin API, > adding support for streaming changes for large in-progress > transactions. The methods are stream_start, stream_stop, stream_abort, > stream_commit, stream_change, stream_message, and stream_truncate. > Most of this is a simple extension of the existing methods, with the > semantic difference that the transaction (or subtransaction) is > incomplete and may be aborted later (which is something the regular > API does not really need to deal with). > > This also extends the 'test_decoding' plugin, implementing these new > stream methods. The stream_start/start_stop are used to demarcate a > chunk of changes streamed for a particular toplevel transaction. > > This commit simply adds these new APIs and the upcoming patch to > "allow the streaming mode in ReorderBuffer" will use these APIs. LGTM -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Fri, Jul 24, 2020 at 7:17 PM Dilip Kumar wrote: > > Your changes look fine to me. Additionally, I have changed a test > case of getting the streaming changes in 0002. Instead of just > showing the count, I am showing that the transaction is actually > streaming. > If you want to show the changes then there is no need to display 157 rows probably a few (10-15) should be sufficient. If we can do that by increasing the size of the row then good, otherwise, I think it is better to retain the test to display the count. Today, I have again looked at the first patch (v42-0001-Extend-the-logical-decoding-output-plugin-API-wi) and didn't find any more problems with it so planning to commit the same unless you or someone else want to add more to it. Just for ease of others, "the next patch extends the logical decoding output plugin API with stream methods". It adds seven methods to the output plugin API, adding support for streaming changes for large in-progress transactions. The methods are stream_start, stream_stop, stream_abort, stream_commit, stream_change, stream_message, and stream_truncate. Most of this is a simple extension of the existing methods, with the semantic difference that the transaction (or subtransaction) is incomplete and may be aborted later (which is something the regular API does not really need to deal with). This also extends the 'test_decoding' plugin, implementing these new stream methods. The stream_start/start_stop are used to demarcate a chunk of changes streamed for a particular toplevel transaction. This commit simply adds these new APIs and the upcoming patch to "allow the streaming mode in ReorderBuffer" will use these APIs. -- With Regards, Amit Kapila.
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Wed, Jul 22, 2020 at 4:55 PM Dilip Kumar wrote: > > You are right. I have changed it. > Thanks, I have pushed the second patch in this series which is 0001-WAL-Log-invalidations-at-command-end-with-wal_le in your latest patch. I will continue working on remaining patches. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Wed, Jul 22, 2020 at 10:20 AM Dilip Kumar wrote: > > On Wed, Jul 22, 2020 at 9:18 AM Amit Kapila wrote: > > > > On Mon, Jul 20, 2020 at 6:46 PM Dilip Kumar wrote: > > > > > > There was one warning in release mode in the last version in 0004 so > > > attaching a new version. > > > > > > > Today, I was reviewing patch > > v38-0001-WAL-Log-invalidations-at-command-end-with-wal_le and found a > > small problem with it. > > > > + /* > > + * Execute the invalidations for xid-less transactions, > > + * otherwise, accumulate them so that they can be processed at > > + * the commit time. > > + */ > > + if (!ctx->fast_forward) > > + { > > + if (TransactionIdIsValid(xid)) > > + { > > + ReorderBufferAddInvalidations(reorder, xid, buf->origptr, > > + invals->nmsgs, invals->msgs); > > + ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, > > + buf->origptr); > > + } > > > > I think we need to set ReorderBufferXidSetCatalogChanges even when > > ctx->fast-forward is true because we are dependent on that flag for > > snapshot build (see SnapBuildCommitTxn). We are already doing the > > same way in DecodeCommit where even though we skip adding > > invalidations for fast-forward cases but we do set the flag to > > indicate that this txn has catalog changes. Is there any reason to do > > things differently here? > > I think it is wrong, we should set the > ReorderBufferXidSetCatalogChanges, even if it is in fast-forward mode. > Thanks for the change. I have one more minor comment in the patch 0001-WAL-Log-invalidations-at-command-end-with-wal_le. /* + * Invalidations logged with wal_level=logical. + */ +typedef struct xl_xact_invalidations +{ + int nmsgs; /* number of shared inval msgs */ + SharedInvalidationMessage msgs[FLEXIBLE_ARRAY_MEMBER]; +} xl_xact_invalidations; I see that we already have a structure xl_xact_invals in the code which has the same members, so I think it is better to use that instead of defining a new one. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Mon, Jul 20, 2020 at 6:46 PM Dilip Kumar wrote: > > There was one warning in release mode in the last version in 0004 so > attaching a new version. > Today, I was reviewing patch v38-0001-WAL-Log-invalidations-at-command-end-with-wal_le and found a small problem with it. + /* + * Execute the invalidations for xid-less transactions, + * otherwise, accumulate them so that they can be processed at + * the commit time. + */ + if (!ctx->fast_forward) + { + if (TransactionIdIsValid(xid)) + { + ReorderBufferAddInvalidations(reorder, xid, buf->origptr, + invals->nmsgs, invals->msgs); + ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, + buf->origptr); + } I think we need to set ReorderBufferXidSetCatalogChanges even when ctx->fast-forward is true because we are dependent on that flag for snapshot build (see SnapBuildCommitTxn). We are already doing the same way in DecodeCommit where even though we skip adding invalidations for fast-forward cases but we do set the flag to indicate that this txn has catalog changes. Is there any reason to do things differently here? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Mon, Jul 20, 2020 at 11:16 PM Dilip Kumar wrote: > > > There was one warning in release mode in the last version in 0004 so > attaching a new version. > > -- > Regards, > Dilip Kumar > EnterpriseDB: http://www.enterprisedb.com Hello, I have tried to rework the patch which did the stats for the streaming of logical replication but based on the new logical replication stats framework developed by Masahiko-san and rebased by Amit in [1]. This uses v38 of the streaming logical update patch as well as the v1 of the stats framework patch as base. I will rebase this as the stats framework is updated. Let me know if you have any comments. regards, Ajin Cherian Fujitsu Australia [1] - https://www.postgresql.org/message-id/flat/CA%2Bfd4k5_pPAYRTDrO2PbtTOe0eHQpBvuqmCr8ic39uTNmR49Eg%40mail.gmail.com v1_streaming_stats_update.patch Description: Binary data
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Mon, Jul 20, 2020 at 12:01 PM Dilip Kumar wrote: > > On Thu, Jul 16, 2020 at 4:25 PM Amit Kapila wrote: > > > > On Thu, Jul 16, 2020 at 12:23 PM Dilip Kumar wrote: > > > > > > On Wed, Jul 15, 2020 at 6:59 PM Amit Kapila > > > wrote: > > > > > > > > > > > > Let me know what you think of the changes? > > > > > > I have reviewed the changes and looks fine to me. > > > > > > > Thanks, I am planning to start committing a few of the infrastructure > > patches (especially first two) by early next week as we have resolved > > all the open issues and done an extensive review of the entire > > patch-set. In the attached version, there is a slight change in one > > of the commit messages as compared to the previous version. I would > > like to describe in brief the first two patches for the sake of > > convenience. Let me know if you or anyone else sees any problems with > > these. > > > > The first patch in the series allows us to WAL-log subtransaction and > > top-level XID association. The logical decoding infrastructure needs > > to know which top-level > > transaction the subxact belongs to, in order to decode all the > > changes. Until now that might be delayed until commit, due to the > > caching (GPROC_MAX_CACHED_SUBXIDS), preventing features requiring > > incremental decoding. So we also write the assignment info into WAL > > immediately, as part of the next WAL record (to minimize overhead) > > only when *wal_level=logical*. We can not remove the existing > > XLOG_XACT_ASSIGNMENT WAL as that is required for avoiding overflow in > > the hot standby snapshot. > > Pushed, this patch. > > > > The patch set required to rebase after committing the binary format > option support in the create subscription command. I have rebased the > patch set on the latest head and also added a test case to test > streaming in binary format. > While going through commit 9de77b5453, I noticed below change: @@ -424,6 +424,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn, PQfreemem(pubnames_literal); pfree(pubnames_str); + if (options->proto.logical.binary && + PQserverVersion(conn->streamConn) >= 14) + appendStringInfoString(, ", binary 'true'"); + Now, the similar change in this patch series is as below: @@ -408,6 +408,9 @@ libpqrcv_startstreaming(WalReceiverConn *conn, appendStringInfo(, "proto_version '%u'", options->proto.logical.proto_version); + if (options->proto.logical.streaming) + appendStringInfo(, ", streaming 'on'"); + I think we also need a version check similar to commit 9de77b5453 to ensure that we send the new option only when connected to a newer version (>=14) primary server. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Wed, Jul 15, 2020 at 6:59 PM Amit Kapila wrote: > > On Wed, Jul 15, 2020 at 9:29 AM Dilip Kumar wrote: > > > > > > I have reviewed your changes and those look good to me, please find > > the latest version of the patch set. > > > > I have done an additional round of review and below are the changes I > made in the attached patch-set. > 1. Changed comments in 0002. > 2. In 0005, apart from changing a few comments and function name, I > have changed below code: > + if (ReorderBufferCanStream(rb) && > + !SnapBuildXactNeedsSkip(builder, ctx->reader->ReadRecPtr)) > Here, I think it is better to compare it with EndRecPtr. I feel in > boundary case the next record could be the same as start_decoding_at, > so why to avoid streaming in that case? Make sense to me > 3. In 0006, made below changes: > a. Removed function ReorderBufferFreeChange and added a new > parameter in ReorderBufferReturnChange to achieve the same purpose. > b. Changed quite a few comments, function names, added additional > Asserts, and few other cosmetic changes. > 4. In 0007, made below changes: > a. Removed the unnecessary change in .gitignore > b. Changed the newly added option name to "stream-change". > > Apart from above, I have merged patches 0004, 0005, 0006 and 0007 as > those seems one functionality to me. For the sake of review, the > patch-set that contains merged patches is attached separately as > v34-combined. > > Let me know what you think of the changes? I have reviewed the changes and looks fine to me. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com