Re: Extending USING [heap | mytam | yourtam] grammar and behavior
> On Jun 15, 2022, at 8:51 PM, Michael Paquier wrote: > > I am not sure to see why this would be something users would actually > use in prod. That means to pick up something else than what the > server thinks is the best default AM but where somebody does not want > to trust the default, while generating an error if specifying the > default AM in the USING NOT clause. Sorry for the lack of clarity. I do not suggest raising an error. If you say "USING NOT foo", and foo is the default table access method, then you get the same behavior as a "USING heap" would have gotten you, otherwise, you get the same behavior as not providing any USING clause at all. In future, we might want to create a list of fallback tams rather than just hardcoding "heap" as the one and only fallback, but I haven't run into an actual need for that. If you're wondering what "USING NOT heap" falls back to, I think that could error, or it could just use heap anyway. Whatever. That's why I'm still soliciting for comments at this phase rather than posting a patch. > On top of that > default_table_access_method is user-settable. Yeah, but specifying a "USING foo" clause is also open to any user, so I don't see why this matters. "USING NOT foo" is just shorthand for checking the current default_table_access_method, and then either appending a "USING heap" clause or appending no clause. Since the user can do this anyway, what's the security implication in some syntactic sugar? — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [PROPOSAL] Detecting plan changes with plan_id in pg_stat_activity
On 15/6/2022 21:45, Imseih (AWS), Sami wrote: Adding a plan_id to pg_stat_activity allows users to determine if a plan for a particular statement has changed and if the new plan is performing better or worse for a particular statement. There are several ways the plan_id in pg_stat_activity In general, your idea is quite useful. But, as discussed earlier [1] extensions would implement many useful things if they could add into a plan some custom data. Maybe implement your feature with some private list of nodes in plan structure instead of single-purpose plan_id field? [1] https://www.postgresql.org/message-id/flat/e0de3423-4bba-1e69-c55a-f76bf18dbd74%40postgrespro.ru -- regards, Andrey Lepikhov Postgres Professional
Re: Modest proposal to extend TableAM API for controlling cluster commands
> On Jun 15, 2022, at 8:18 PM, Andres Freund wrote: > > Hi, > > On 2022-06-15 20:10:30 -0700, Mark Dilger wrote: >>> On Jun 15, 2022, at 7:30 PM, Andres Freund wrote: But it's up to the TAM what it wants to do with that boolean, if in fact it does anything at all based on that. A TAM could decide to do the exact opposite of what Heap-AM does and instead sort on VACUUM FULL but not sort on CLUSTER, or perhaps perform a randomized shuffle, or >>> behavior here>. >>> >>> That's bogus. Yes, an AM can do stupid stuff in a callback. But so what, >>> that's possible with all extension APIs. >> >> I don't think it's "stupid stuff". A major motivation, perhaps the only >> useful motivation, for implementing a TAM is to get a non-trivial >> performance boost (relative to heap) for some target workload, almost >> certainly at the expense of worse performance for another workload. To >> optimize any particular workload sufficiently to make it worth the bother, >> you've pretty much got to do something meaningfully different than what heap >> does. > > Sure. I just don't see what that has to do with doing something widely > differing in VACUUM FULL. Which AM can't support that? I guess there's some > where implementing the full visibility semantics isn't feasible, but that's > afaics OK. The problem isn't that the TAM can't do it. Any TAM can probably copy its contents verbatim from the OldHeap to the NewHeap. But it's really stupid to have to do that if you're not going to change anything along the way, and I think if you divorce your thinking from how heap-AM works sufficiently, you might start to see how other TAMs might have nothing useful to do at this step. So there's a strong motivation to not be forced into a "move all the data uselessly" step. I don't really want to discuss the proprietary details of any TAMs I'm developing, so I'll use some made up examples. Imagine you have decided to trade off the need to vacuum against the cost of storing 64bit xids. That doesn't mean that compaction isn't maybe still a good thing, but you don't need to vacuum for anti-wraparound purposes anymore. Now imagine you've also decided to trade off insert speed for select speed, and you sort the entire table on every insert, and to keep indexes happy, you keep a "externally visible TID" to "internal actual location" mapping in a separate fork. Let's say you also don't support UPDATE or DELETE at all. Those immediately draw an error, "not implemented by this tam". At this point, you have a fully sorted and completely compacted table at all times. It's simply an invariant of the TAM. So, now what exactly is VACUUM FULL or CLUSTER supposed to do? Seems like the answer is "diddly squat", and yet you seem to propose requiring the TAM to do it. I don't like that. From the point-of-view of a TAM implementor, VACUUM FULL and CLUSTER are synonyms. Or am I missing something? >>> >>> The callback gets passed use_sort. So just implement it use_sort = false and >>> error out if use_sort = true? >> >> I'm not going to say that your idea is unreasonable for a TAM that you might >> choose to implement, but I don't see why that should be required of all TAMs >> anybody might ever implement. > >> The callback that gets use_sort is called from copy_table_data(). By that >> time, it's too late to avoid the >> >>/* >> * Open the relations we need. >> */ >>NewHeap = table_open(OIDNewHeap, AccessExclusiveLock); >>OldHeap = table_open(OIDOldHeap, AccessExclusiveLock); >> >> code that has already happened in cluster.c's copy_table_data() function, >> and unless I raise an error, after returning from my TAM's callback, the >> cluster code will replace the old table with the new one. I'm left no >> choices but to copy my data over, loose my data, or abort the command. None >> of those are OK options for me. > > I think you need to do a bit more explaining of what you're actually trying to > achieve here. You're just saying "I don't want to", which doesn't really help > me to understand the set of useful options. I'm trying to opt out of cluster/vacfull. >> I'm open to different solutions. If a simple callback like >> relation_supports_cluster(Relation rel) is too simplistic, and more >> parameters with more context information is wanted, then fine, let's do >> that. > > FWIW, I want to go *simpler* if anything, not more complicated. I.e. make the > relation_copy_for_cluster optional. > > I still think it's a terrible idea to silently ignore tables in CLUSTER or > VACUUM FULL. I'm not entirely against you on that, but it makes me cringe that we impose design decisions like that on any and all future TAMs. It seems better to me to let the TAM author decide to emit an error, warning, notice, or whatever, as they see fit. >> But I can't really complete my work with the interface as it stands >> now. > > Since you've not described that work
Re: [PROPOSAL] Detecting plan changes with plan_id in pg_stat_activity
Hi, On Wed, Jun 15, 2022 at 06:45:38PM +, Imseih (AWS), Sami wrote: > Adding a plan_id to pg_stat_activity allows users > to determine if a plan for a particular statement > has changed and if the new plan is performing better > or worse for a particular statement. > [...] > Attached is a POC patch that computes the plan_id > and presents the top-level plan_id in pg_stat_activity. AFAICS you're proposing to add an identifier for a specific plan, but no way to know what that plan was? How are users supposed to use the information if they know something changed but don't know what changed exactly? > - In the POC, the compute_query_id GUC determines if a > plan_id is to be computed. Should this be a separate GUC? Probably, as computing it will likely be quite expensive. Some benchmark on various workloads would be needed here. I only had a quick look at the patch, but I see that you have some code to avoid storing the query text multiple times with different planid. How does it work exactly, and does it ensure that the query text is only removed once the last entry that uses it is removed? It seems that you identify a specific query text by queryid, but that seems wrong as collision can (easily?) happen in different databases. The real identifier of a query text should be (dbid, queryid). Note that this problem already exists, as the query texts are now stored per (userid, dbid, queryid, istoplevel). Maybe this part could be split in a different commit as it could already be useful without a planid.
Re: Extending USING [heap | mytam | yourtam] grammar and behavior
> On Jun 15, 2022, at 8:51 PM, Michael Paquier wrote: > > However, > your problem is basically that you develop multiple AMs, but you want > to have regression tests that do checks across more than one table AM > at the same time. It is true that I test multiple table AMs at the same time, but that's a somewhat different concern. > Am I getting that right? Not exactly. > Why is a grammar > extension necessary for what looks like a test structure problem when > there are interdependencies across multiple AMs developped? Ok, I didn't want to get into my exact process, because it involves other changes that I don't expect -hackers to want. But basically what I do is: ./configure --with-default-tam=chicago && make && make check-world That fails for a few tests, and I manually change the create table statements in tests that are not chicago-compatible to "using not chicago". Then ./configure --with-default-tam=detroit && make && make check-world That fails for some other set of tests, but note that the tests with "using not chicago" are still using detroit in this second run. That wouldn't be true if I'd fixed up the tests in the first run "using heap". Then I can also add my own tests which might make some chicago backed tables plus some detroit backed tables and see how they interact. But that's superfluous to the issue of just trying to leverage the existing tests as much as I can without having to reinvent tests to cover "chicago", and then reinvent again to cover "detroit", and so forth. If you develop enough TAMs in parallel, and go with the "using heap" solution, you eventually have zero coverage for any of the TAMs, because you'll eventually be "using heap" in all the tables of all the tests. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
RE: Replica Identity check of partition table on subscriber
On Wed, Jun 15, 2022 8:30 PM Amit Kapila wrote: > > I have pushed the first bug-fix patch today. > Thanks. Attached the remaining patches which are rebased. Regards, Shi yu v9-0002-fix-memory-leak-about-attrmap.patch Description: v9-0002-fix-memory-leak-about-attrmap.patch v9-0001-Check-partition-table-replica-identity-on-subscri.patch Description: v9-0001-Check-partition-table-replica-identity-on-subscri.patch
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Wed, May 25, 2022 at 11:48 AM Masahiko Sawada wrote: > > On Tue, May 10, 2022 at 6:58 PM John Naylor > wrote: > > > > On Tue, May 10, 2022 at 8:52 AM Masahiko Sawada > > wrote: > > > > > > Overall, radix tree implementations have good numbers. Once we got an > > > agreement on moving in this direction, I'll start a new thread for > > > that and move the implementation further; there are many things to do > > > and discuss: deletion, API design, SIMD support, more tests etc. > > > > +1 > > > > Thanks! > > I've attached an updated version patch. It is still WIP but I've > implemented deletion and improved test cases and comments. I've attached an updated version patch that changes the configure script. I'm still studying how to support AVX2 on msvc build. Also, added more regression tests. The integration with lazy vacuum and parallel vacuum is missing for now. In order to support parallel vacuum, we need to have the radix tree support to be created on DSA. Added this item to the next CF. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/ diff --git a/config/c-compiler.m4 b/config/c-compiler.m4 index d3562d6fee..a56d6e89da 100644 --- a/config/c-compiler.m4 +++ b/config/c-compiler.m4 @@ -676,3 +676,27 @@ if test x"$Ac_cachevar" = x"yes"; then fi undefine([Ac_cachevar])dnl ])# PGAC_ARMV8_CRC32C_INTRINSICS + +# PGAC_AVX2_INTRINSICS +# +# Check if the compiler supports the Intel AVX2 instructinos. +# +# If the intrinsics are supported, sets pgac_avx2_intrinsics, and CFLAGS_AVX2. +AC_DEFUN([PGAC_AVX2_INTRINSICS], +[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx2_intrinsics_$1])])dnl +AC_CACHE_CHECK([for _mm256_set_1_epi8 _mm256_cmpeq_epi8 _mm256_movemask_epi8 CFLAGS=$1], [Ac_cachevar], +[pgac_save_CFLAGS=$CFLAGS +CFLAGS="$pgac_save_CFLAGS $1" +AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ], + [__m256i vec = _mm256_set1_epi8(0); + __m256i cmp = _mm256_cmpeq_epi8(vec, vec); + return _mm256_movemask_epi8(cmp) > 0;])], + [Ac_cachevar=yes], + [Ac_cachevar=no]) +CFLAGS="$pgac_save_CFLAGS"]) +if test x"$Ac_cachevar" = x"yes"; then + CFLAGS_AVX2="$1" + pgac_avx2_intrinsics=yes +fi +undefine([Ac_cachevar])dnl +])# PGAC_AVX2_INTRINSICS diff --git a/configure b/configure index 7dec6b7bf9..6ebc15a8c1 100755 --- a/configure +++ b/configure @@ -645,6 +645,7 @@ XGETTEXT MSGMERGE MSGFMT_FLAGS MSGFMT +CFLAGS_AVX2 PG_CRC32C_OBJS CFLAGS_ARMV8_CRC32C CFLAGS_SSE42 @@ -18829,6 +18830,82 @@ $as_echo "slicing-by-8" >&6; } fi +# Check for Intel AVX2 intrinsics. +{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm256i CFLAGS=" >&5 +$as_echo_n "checking for _mm256i CFLAGS=... " >&6; } +if ${pgac_cv_avx2_intrinsics_+:} false; then : + $as_echo_n "(cached) " >&6 +else + pgac_save_CFLAGS=$CFLAGS +CFLAGS="$pgac_save_CFLAGS " +cat confdefs.h - <<_ACEOF >conftest.$ac_ext +/* end confdefs.h. */ +#include +int +main () +{ +__m256i vec = _mm256_set1_epi8(0); + __m256i cmp = _mm256_cmpeq_epi8(vec, vec); + return _mm256_movemask_epi8(cmp) > 0; + ; + return 0; +} +_ACEOF +if ac_fn_c_try_link "$LINENO"; then : + pgac_cv_avx2_intrinsics_=yes +else + pgac_cv_avx2_intrinsics_=no +fi +rm -f core conftest.err conftest.$ac_objext \ +conftest$ac_exeext conftest.$ac_ext +CFLAGS="$pgac_save_CFLAGS" +fi +{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx2_intrinsics_" >&5 +$as_echo "$pgac_cv_avx2_intrinsics_" >&6; } +if test x"$pgac_cv_avx2_intrinsics_" = x"yes"; then + CFLAGS_AVX2="" + pgac_avx2_intrinsics=yes +fi + +if test x"pgac_avx2_intrinsics" != x"yes"; then + { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm256i CFLAGS=-mavx2" >&5 +$as_echo_n "checking for _mm256i CFLAGS=-mavx2... " >&6; } +if ${pgac_cv_avx2_intrinsics__mavx2+:} false; then : + $as_echo_n "(cached) " >&6 +else + pgac_save_CFLAGS=$CFLAGS +CFLAGS="$pgac_save_CFLAGS -mavx2" +cat confdefs.h - <<_ACEOF >conftest.$ac_ext +/* end confdefs.h. */ +#include +int +main () +{ +__m256i vec = _mm256_set1_epi8(0); + __m256i cmp = _mm256_cmpeq_epi8(vec, vec); + return _mm256_movemask_epi8(cmp) > 0; + ; + return 0; +} +_ACEOF +if ac_fn_c_try_link "$LINENO"; then : + pgac_cv_avx2_intrinsics__mavx2=yes +else + pgac_cv_avx2_intrinsics__mavx2=no +fi +rm -f core conftest.err conftest.$ac_objext \ +conftest$ac_exeext conftest.$ac_ext +CFLAGS="$pgac_save_CFLAGS" +fi +{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx2_intrinsics__mavx2" >&5 +$as_echo "$pgac_cv_avx2_intrinsics__mavx2" >&6; } +if test x"$pgac_cv_avx2_intrinsics__mavx2" = x"yes"; then + CFLAGS_AVX2="-mavx2" + pgac_avx2_intrinsics=yes +fi + +fi + # Select semaphore implementation type. if test "$PORTNAME" != "win32"; then diff --git a/configure.ac b/configure.ac index d093fb88dd..6b6d095306 100644 --- a/configure.ac +++ b/configure.ac @@ -2300,6 +2300,12 @@ else fi AC_SUBST(PG_CRC32C_OBJS) +# Check for Intel AVX2 intrinsics. +PGAC_AVX2_INTRINSICS([]) +if test x"pgac_avx2_intrinsics" != x"yes"; then +
Re: Extending USING [heap | mytam | yourtam] grammar and behavior
On Wed, Jun 15, 2022 at 8:51 PM Michael Paquier wrote: > On top of that > default_table_access_method is user-settable. > > FWIW this proposal acknowledges that and basically leverages it to the hilt, turning it into something like search_path. I strongly dislike the idea of any workflow that depends on a GUC in this manner. The fact that it is user-settable is, IMO, a flaw, not a feature, at least as far as production settings are concerned. It is a novel API for PostgreSQL to rely upon setting a GUC then attaching "unless" configurations to individual objects to ignore it. And what would be chosen (ultimately fallback is heap?), or whether it would simply error, is presently, as you say, undefined. In production this general behavior becomes useful only under the condition that among the various named access methods some of them don't even exist on the server in question, but that a fallback option would be acceptable in that case. But that suggests extending "USING" to accept multiple names, not inventing a "NOT USING". That all said, I can understand that testing presents its own special needs. But testing is probably where GUCs shine. So why not implement this capability as a GUC that is set just before the table is created instead of extending the grammar for it? Add it to "developer options" and call it a day. Dump/Restore no longer has to care about it, and its value once the table exists is basically zero anyway. David J.
Re: bogus: logical replication rows/cols combinations
On Mon, Jun 13, 2022 at 8:54 AM Amit Kapila wrote: > > I would like to close the Open item listed corresponding to this > thread [1] as the fix for the reported issue is committed > (fd0b9dcebd). Do let me know if you or others think otherwise? > Seeing no objections, I have closed this item. -- With Regards, Amit Kapila.
Re: Skipping schema changes in publication
On Tue, Jun 14, 2022 at 9:10 AM houzj.f...@fujitsu.com wrote: > > On Wednesday, June 8, 2022 7:04 PM Amit Kapila > wrote: > > > > On Fri, Jun 3, 2022 at 3:37 PM vignesh C wrote: > > > > > > Thanks for the comments, the attached v8 patch has the changes for the > > same. > > > > > > > AFAICS, the summary of this proposal is that we want to support > > exclude of certain objects from publication with two kinds of > > variants. The first variant is to add support to exclude specific > > tables from ALL TABLES PUBLICATION. Without this feature, users need > > to manually add all tables for a database even when she wants to avoid > > only a handful of tables from the database say because they contain > > sensitive information or are not required. We have seen that other > > database like MySQL also provides similar feature [1] (See > > REPLICATE_WILD_IGNORE_TABLE). The proposed syntax for this is as > > follows: > > > > CREATE PUBLICATION pub1 FOR ALL TABLES EXCEPT TABLE t1,t2; > > or > > ALTER PUBLICATION pub1 ADD ALL TABLES EXCEPT TABLE t1,t2; > > > > This will allow us to publish all the tables in the current database > > except t1 and t2. Now, I see that pg_dump has a similar option > > provided by switch --exclude-table but that allows tables matching > > patterns which is not the case here. I am not sure if we need a > > similar variant here. > > > > Then users will be allowed to reset the publication by: > > ALTER PUBLICATION pub1 RESET; > > > > This will reset the publication to the default state which includes > > resetting the publication parameters, setting the ALL TABLES flag to > > false, and dropping the relations and schemas that are associated with > > the publication. I don't know if we want to go further with allowing > > to RESET specific parameters and if so which parameters and what would > > its syntax be? > > > > The second variant is to add support to exclude certain columns of a > > table while publishing a particular table. Currently, users need to > > list all required columns' names even if they don't want to hide most > > of the columns in the table (for example Create Publication pub For > > Table t1 (c1, c2)). Consider user doesn't want to publish the 'salary' > > or other sensitive information of executives/employees but would like > > to publish all other columns. I feel in such cases it will be a lot of > > work for the user especially when the table has many columns. I see > > that Oracle has a similar feature [2]. I think without this it will be > > difficult for users to use this feature in some cases. The patch for > > this is not proposed but I would imagine syntax for it to be something > > like "Create Publication pub For Table t1 Except (c3)" and similar > > variants for Alter Publication. > > I think the feature to exclude certain columns of a table would be useful. > > In some production scenarios, we usually do not want to replicate > sensitive fields(column) in the table. Although we already can achieve > this by specify all replicated columns in the list[1], but that seems a > hard work when the table has hundreds of columns. > > [1] > CREATE TABLE test(a int, b int, c int,..., sensitive text); > CRAETE PUBLICATION pub FOR TABLE test(a,b,c,...); > > In addition, it's not easy to maintain the column list like above. Because > we sometimes need to add new fields or delete fields due to business > needs. Every time we add a column(or delete a column in column list), we > need to update the column list. > > If we support Except: > CRAETE PUBLICATION pub FOR TABLE test EXCEPT (sensitive); > > We don't need to update the column list in most cases. > Right, this is a valid point and I think it makes sense for me to support such a feature for column list and also to exclude a particular table(s) from the ALL TABLES publication. Peter E., Euler, and others, do you have any objections to supporting the above-mentioned two cases? -- With Regards, Amit Kapila.
Re: Extending USING [heap | mytam | yourtam] grammar and behavior
On Wed, Jun 15, 2022 at 06:16:21PM -0700, Mark Dilger wrote: > When developing two or more TAMs, this falls apart. Some tests may > be worth fixing up (perhaps with alternate output files) for > "mytam", but not for "columnar_tam". That might be because the test > is checking fundamentally row-store-ish properties of the table, > which has no applicability to your column-store-ish TAM. In that > case, "USING NOT columnar_tam" fixes the test failure when columnar > is the default, without preventing the test from testing "mytam" > when it happens to be the default. I think that it is very important for the in-core test suite to remain transparent in terms of options used for table AMs (or compression), and this has improved a lot over the last years with options like HIDE_TABLEAM and HIDE_TOAST_COMPRESSION. Things could have actually more ORDER BY clauses to ensure more ordering of the results, as long as the tests don't want to stress a specific planning path. However, your problem is basically that you develop multiple AMs, but you want to have regression tests that do checks across more than one table AM at the same time. Am I getting that right? Why is a grammar extension necessary for what looks like a test structure problem when there are interdependencies across multiple AMs developped? > Once you have enough TAMs developed and deployed, this USING NOT > business becomes useful in production. You might have different > defaults on different servers, or for different customers, etc., and > for a given piece of DDL that you want to release you only want to > say which TAMs not to use, not to nail down which TAM must be used. I am not sure to see why this would be something users would actually use in prod. That means to pick up something else than what the server thinks is the best default AM but where somebody does not want to trust the default, while generating an error if specifying the default AM in the USING NOT clause. On top of that default_table_access_method is user-settable. -- Michael signature.asc Description: PGP signature
Re: Modest proposal to extend TableAM API for controlling cluster commands
On Wed, Jun 15, 2022 at 8:18 PM Andres Freund wrote: > > If a simple callback like > > relation_supports_cluster(Relation rel) is too simplistic > Seems like it should be called: relation_supports_compaction[_by_removal_of_interspersed_dead_tuples] Basically, if the user tells the table to make itself smaller on disk by removing dead tuples, should we support the case where the Table AM says: "Sorry, I cannot do that"? If yes, then naming the table explicitly should elicit an error. Having the table chosen implicitly should provoke a warning. For ALTER TABLE CLUSTER there should be an error: which makes the implicit CLUSTER command a non-factor. However, given that should the table structure change it is imperative that the Table AM be capable of producing a new physical relation with the correct data, which will have been compacted as a side-effect, it seems like, explicit or implicit, expecting any Table AM to do that when faced with Vacuum Full is reasonable. Which leaves deciding how to allow a table with a given TAM to prevent itself from being added to the CLUSTER roster. And decide whether an opt-out feature for implicit VACUUM FULL is something we should offer as well. I'm doubtful that a TAM that is pluggable into the MVCC and WAL architecture of PostgreSQL could avoid this basic contract between the system and its users. David J.
Re: warn if GUC set to an invalid shared library
I've started to think that we should really WARN whenever a (set of) GUC is set in a manner that the server will fail to start - not just for shared libraries. In particular, I think the server should also warn if it's going to fail to start like this: 2022-06-15 22:48:34.279 CDT postmaster[20782] FATAL: WAL streaming (max_wal_senders > 0) requires wal_level "replica" or "logical" -- Justin
Re: [Proposal] pg_rewind integration into core
On Wed, Mar 23, 2022 at 05:13:47PM +0530, RKN Sai Krishna wrote: > Considering the scenarios where primary is ahead of sync standbys, upon > promotion of a standby, pg_rewind is needed on the old primary if it has to > be up as a standby. Similarly in the scenarios where async standbys(in > physical replication context) go ahead of sync standbys, and upon promotion > of a standby, there is need for pg_rewind to be performed on the async > standbys which are ahead of sync standby being promoted. > With these scenarios under consideration, integrating pg_rewind into > postgres core might be a better option IMO. We could optionally choose to > have pg_rewind dry run performed during the standby startup and based on > the need, perform the rewind and have the standby in sync with the primary. pg_rewind is already part of the core code as a binary tool, but what you mean is to integrate a portion of it in the backend code, as of a startup sequence (with the node to rewind using primary_conninfo for the source?). Once thing that we would need to be careful about is that no assumptions a rewind relies on are messed up in any way at the step where the rewind begins. One such thing is that the standby has achieved crash recovery correctly, so you would need to somewhat complicate more the startup sequence, which is already a complicated and sensitive piece of logic, with more internal dependencies between each piece. I am not really convinced that we need to add more technical debt in this area, particularly now that pg_rewind is able to enforce recovery on the target node once so as it has a clean state when the rewind can begin, so the assumptions around crash recovery and rewind have a clear frontier cut. -- Michael signature.asc Description: PGP signature
Re: Modest proposal to extend TableAM API for controlling cluster commands
Hi, On 2022-06-15 20:10:30 -0700, Mark Dilger wrote: > > On Jun 15, 2022, at 7:30 PM, Andres Freund wrote: > >> But it's up to the TAM what it wants to do with that boolean, if in fact it > >> does anything at all based on that. A TAM could decide to do the exact > >> opposite of what Heap-AM does and instead sort on VACUUM FULL but not sort > >> on CLUSTER, or perhaps perform a randomized shuffle, or >> behavior here>. > > > > That's bogus. Yes, an AM can do stupid stuff in a callback. But so what, > > that's possible with all extension APIs. > > I don't think it's "stupid stuff". A major motivation, perhaps the only > useful motivation, for implementing a TAM is to get a non-trivial > performance boost (relative to heap) for some target workload, almost > certainly at the expense of worse performance for another workload. To > optimize any particular workload sufficiently to make it worth the bother, > you've pretty much got to do something meaningfully different than what heap > does. Sure. I just don't see what that has to do with doing something widely differing in VACUUM FULL. Which AM can't support that? I guess there's some where implementing the full visibility semantics isn't feasible, but that's afaics OK. > >> From the point-of-view of a TAM implementor, VACUUM FULL and CLUSTER are > >> synonyms. Or am I missing something? > > > > The callback gets passed use_sort. So just implement it use_sort = false and > > error out if use_sort = true? > > I'm not going to say that your idea is unreasonable for a TAM that you might > choose to implement, but I don't see why that should be required of all TAMs > anybody might ever implement. > The callback that gets use_sort is called from copy_table_data(). By that > time, it's too late to avoid the > > /* > * Open the relations we need. > */ > NewHeap = table_open(OIDNewHeap, AccessExclusiveLock); > OldHeap = table_open(OIDOldHeap, AccessExclusiveLock); > > code that has already happened in cluster.c's copy_table_data() function, > and unless I raise an error, after returning from my TAM's callback, the > cluster code will replace the old table with the new one. I'm left no > choices but to copy my data over, loose my data, or abort the command. None > of those are OK options for me. I think you need to do a bit more explaining of what you're actually trying to achieve here. You're just saying "I don't want to", which doesn't really help me to understand the set of useful options. > I'm open to different solutions. If a simple callback like > relation_supports_cluster(Relation rel) is too simplistic, and more > parameters with more context information is wanted, then fine, let's do > that. FWIW, I want to go *simpler* if anything, not more complicated. I.e. make the relation_copy_for_cluster optional. I still think it's a terrible idea to silently ignore tables in CLUSTER or VACUUM FULL. > But I can't really complete my work with the interface as it stands > now. Since you've not described that work to a meaningful degree... Greetings, Andres Freund
Re: warn if GUC set to an invalid shared library
On Wed, Mar 23, 2022 at 03:02:23PM -0400, Tom Lane wrote: > I agree with Maciek's concerns about these HINTs being emitted > inappropriately, but I also have a stylistic gripe: they're only > halfway hints. Given that we fix things so they only print when they > should, the complaint about the server not starting is not a hint, > it's a fact, which per style guidelines means it should be errdetail. > So I think this ought to look more like > > WARNING: could not access file "xyz" > DETAIL: The server will fail to start with this setting. > HINT: If the server is shut down, it will be necessary to manually edit the > postgresql.auto.conf file to allow it to start again. > > I adjusted the wording a bit too --- YMMV, but I think my text is clearer. It seems to me that there is no objection to the proposed patch, but an update is required. Justin? -- Michael signature.asc Description: PGP signature
Re: Modest proposal to extend TableAM API for controlling cluster commands
> On Jun 15, 2022, at 7:30 PM, Andres Freund wrote: > >> It's effectively a synonym which determines whether the "bool use_sort" >> parameter of the table AM's relation_copy_for_cluster will be set. Heap-AM >> plays along and sorts or not based on that. > > Hardly a synonym if it behaves differently? I don't think this point is really worth arguing. We don't have to call it a synonym, and the rest of the discussion remains the same. >> But it's up to the TAM what it wants to do with that boolean, if in fact it >> does anything at all based on that. A TAM could decide to do the exact >> opposite of what Heap-AM does and instead sort on VACUUM FULL but not sort >> on CLUSTER, or perhaps perform a randomized shuffle, or > behavior here>. > > That's bogus. Yes, an AM can do stupid stuff in a callback. But so what, > that's possible with all extension APIs. I don't think it's "stupid stuff". A major motivation, perhaps the only useful motivation, for implementing a TAM is to get a non-trivial performance boost (relative to heap) for some target workload, almost certainly at the expense of worse performance for another workload. To optimize any particular workload sufficiently to make it worth the bother, you've pretty much got to do something meaningfully different than what heap does. >> From the point-of-view of a TAM implementor, VACUUM FULL and CLUSTER are >> synonyms. Or am I missing something? > > The callback gets passed use_sort. So just implement it use_sort = false and > error out if use_sort = true? I'm not going to say that your idea is unreasonable for a TAM that you might choose to implement, but I don't see why that should be required of all TAMs anybody might ever implement. The callback that gets use_sort is called from copy_table_data(). By that time, it's too late to avoid the /* * Open the relations we need. */ NewHeap = table_open(OIDNewHeap, AccessExclusiveLock); OldHeap = table_open(OIDOldHeap, AccessExclusiveLock); code that has already happened in cluster.c's copy_table_data() function, and unless I raise an error, after returning from my TAM's callback, the cluster code will replace the old table with the new one. I'm left no choices but to copy my data over, loose my data, or abort the command. None of those are OK options for me. I'm open to different solutions. If a simple callback like relation_supports_cluster(Relation rel) is too simplistic, and more parameters with more context information is wanted, then fine, let's do that. But I can't really complete my work with the interface as it stands now. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Using PQexecQuery in pipeline mode produces unexpected Close messages
At Thu, 16 Jun 2022 10:34:22 +0900 (JST), Kyotaro Horiguchi wrote in > PQgetResult() resets the state to IDLE while in pipeline mode. > > fe-exec.c:2171 > > > if (conn->pipelineStatus != PQ_PIPELINE_OFF) > > { > > /* > > * We're about to send the results of the > > current query. Set > > * us idle now, and ... > > */ > > conn->asyncStatus = PGASYNC_IDLE; > > And actually that code let the connection state enter to IDLE before > CloseComplete. In the test case I posted, the following happens. > > PQsendQuery(conn, "SELECT 1;"); > PQsendFlushRequest(conn); > PQgetResult(conn); // state enters IDLE, reads down to > > PQgetResult(conn); // reads > PQpipelineSync(conn); // sync too late > > Pipeline feature seems intending to allow PQgetResult called before > PQpipelineSync. And also seems allowing to call QPpipelineSync() after > PQgetResult(). > > I haven't come up with a valid *fix* of this flow.. The attached is a crude patch to separate the state for PIPELINE-IDLE from PGASYNC_IDLE. I haven't found a better way.. regards. -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/test/modules/libpq_pipeline/libpq_pipeline.c b/src/test/modules/libpq_pipeline/libpq_pipeline.c index 0ff563f59a..261ccc3ed4 100644 --- a/src/test/modules/libpq_pipeline/libpq_pipeline.c +++ b/src/test/modules/libpq_pipeline/libpq_pipeline.c @@ -968,15 +968,27 @@ test_prepared(PGconn *conn) fprintf(stderr, "ok\n"); } +static int n_notice; +static void +notice_processor(void *arg, const char *message) +{ + n_notice++; + fprintf(stderr, "%s", message); +} + static void test_simple_pipeline(PGconn *conn) { PGresult *res = NULL; const char *dummy_params[1] = {"1"}; Oid dummy_param_oids[1] = {INT4OID}; - + PQnoticeProcessor oldproc; + fprintf(stderr, "simple pipeline... "); + n_notice = 0; + oldproc = PQsetNoticeProcessor(conn, notice_processor, NULL); + /* * Enter pipeline mode and dispatch a set of operations, which we'll then * process the results of as they come in. @@ -1052,6 +1064,51 @@ test_simple_pipeline(PGconn *conn) if (PQpipelineStatus(conn) != PQ_PIPELINE_OFF) pg_fatal("Exiting pipeline mode didn't seem to work"); + if (n_notice > 0) + pg_fatal("unexpected notice"); + + /* Try the same thing with PQsendQuery */ + if (PQenterPipelineMode(conn) != 1) + pg_fatal("failed to enter pipeline mode: %s", PQerrorMessage(conn)); + + if (PQsendQuery(conn, "SELECT 1;") != 1) + pg_fatal("failed to send query: %s", PQerrorMessage(conn)); + PQsendFlushRequest(conn); + + res = PQgetResult(conn); + if (res == NULL) + pg_fatal("PQgetResult returned null when there's a pipeline item: %s", + PQerrorMessage(conn)); + if (PQresultStatus(res) != PGRES_TUPLES_OK) + pg_fatal("Unexpected result code %s from first pipeline item", + PQresStatus(PQresultStatus(res))); + PQclear(res); + + res = PQgetResult(conn); + if (res != NULL) + pg_fatal("expected NULL result"); + + if (PQpipelineSync(conn) != 1) + pg_fatal("pipeline sync failed: %s", PQerrorMessage(conn)); + res = PQgetResult(conn); + if (res == NULL) + pg_fatal("PQgetResult returned null when there's a pipeline item: %s", + PQerrorMessage(conn)); + if (PQresultStatus(res) != PGRES_PIPELINE_SYNC) + pg_fatal("Unexpected result code %s instead of PGRES_PIPELINE_SYNC, error: %s", + PQresStatus(PQresultStatus(res)), PQerrorMessage(conn)); + PQclear(res); + res =NULL; + + if (PQexitPipelineMode(conn) != 1) + pg_fatal("attempt to exit pipeline mode failed when it should've succeeded: %s", + PQerrorMessage(conn)); + + if (n_notice > 0) + pg_fatal("unexpected notice"); + + PQsetNoticeProcessor(conn, oldproc, NULL); + fprintf(stderr, "ok\n"); } diff --git a/src/test/modules/libpq_pipeline/traces/simple_pipeline.trace b/src/test/modules/libpq_pipeline/traces/simple_pipeline.trace index 5c94749bc1..1612c371c0 100644 --- a/src/test/modules/libpq_pipeline/traces/simple_pipeline.trace +++ b/src/test/modules/libpq_pipeline/traces/simple_pipeline.trace @@ -9,4 +9,18 @@ B 33 RowDescription 1 "?column?" 0 4 -1 0 B 11 DataRow 1 1 '1' B 13 CommandComplete "SELECT 1" B 5 ReadyForQuery I +F 17 Parse "" "SELECT 1;" 0 +F 12 Bind "" "" 0 0 0 +F 6 Describe P "" +F 9 Execute "" 0 +F 6 Close P "" +F 4 Flush +B 4 ParseComplete +B 4 BindComplete +B 33 RowDescription 1 "?column?" 0 4 -1 0 +B 11 DataRow 1 1 '1' +B 13 CommandComplete "SELECT 1" +F 4 Sync +B 4 CloseComplete +B 5 ReadyForQuery I F 4 Terminate diff --git a/src/interfaces/libpq/fe-exec.c b/src/interfaces/libpq/fe-exec.c index 919cf5741d..a811d7 100644 --- a/src/interfaces/libpq/fe-exec.c +++ b/src/interfaces/libpq/fe-exec.c @@ -1380,7 +1380,7 @@ pqAppendCmdQueueEntry(PGconn *conn, PGcmdQueueEntry *entry) * itself
Re: fix stats_fetch_consistency value in postgresql.conf.sample
On Sat, Jun 11, 2022 at 09:41:37AM -0500, Justin Pryzby wrote: > Note that this gives: > > guc.c:7573:9: warning: ‘dst’ may be used uninitialized in this function > [-Wmaybe-uninitialized] > > with gcc version 9.2.1 20191008 (Ubuntu 9.2.1-9ubuntu2) > > I wonder whether you'd consider renaming pg_normalize_config_value() to > pg_pretty_config_value() or similar. I have looked at the patch, and I am not convinced that we need a function that does a integer -> integer-with-unit conversion for the purpose of this test. One thing is that it can be unstable with the unit in the conversion where values are close to a given threshold (aka for cases like 2048kB/2MB). On top of that, this overlaps with the existing system function in charge of converting values with bytes as size unit, while this stuff handles more unit types and all GUC types. I think that there could be some room in doing the opposite conversion, feeding the value from postgresql.conf.sample to something and compare it directly with boot_val. That's solvable at SQL level, still a system function may be more user-friendly. Extending the tests to check after the values is something worth doing, but I think that I would limit the checks to the parameters that do not have units for now, until we figure out which interface would be more adapted for doing the normalization of the parameter values. -#syslog_facility = 'LOCAL0' +#syslog_facility = 'local0' Those changes should not be necessary in postgresql.conf.sample. The test should be in charge of applying the lower() conversion, in the same way as guc.c does internally, and that's a mode supported by the parameter parsing. Using an upper-case value in the sample file is actually meaningful sometimes (for example, syslog can use upper-case strings to refer to LOCAL0~7). -- Michael signature.asc Description: PGP signature
Re: Modest proposal to extend TableAM API for controlling cluster commands
Hi, On 2022-06-15 19:21:42 -0700, Mark Dilger wrote: > > On Jun 15, 2022, at 7:14 PM, Andres Freund wrote: > > On 2022-06-15 19:07:50 -0700, Mark Dilger wrote: > >>> On Jun 15, 2022, at 6:55 PM, Andres Freund wrote: > >>> > >>> I think nothing would happen in this case - only pre-clustered tables get > >>> clustered in an argumentless CLUSTER. What am I missing? > >> > >> The "VACUUM FULL" synonym of "CLUSTER" doesn't depend on whether the target > >> is pre-clustered > > > > VACUUM FULL isn't a synonym of CLUSTER. While a good bit of the > > implementation > > is shared, VACUUM FULL doesn't order the table contents. I see now reason > > why > > an AM shouldn't support VACUUM FULL? > > It's effectively a synonym which determines whether the "bool use_sort" > parameter of the table AM's relation_copy_for_cluster will be set. Heap-AM > plays along and sorts or not based on that. Hardly a synonym if it behaves differently? > But it's up to the TAM what it wants to do with that boolean, if in fact it > does anything at all based on that. A TAM could decide to do the exact > opposite of what Heap-AM does and instead sort on VACUUM FULL but not sort > on CLUSTER, or perhaps perform a randomized shuffle, or behavior here>. That's bogus. Yes, an AM can do stupid stuff in a callback. But so what, that's possible with all extension APIs. > From the point-of-view of a TAM implementor, VACUUM FULL and CLUSTER are > synonyms. Or am I missing something? The callback gets passed use_sort. So just implement it use_sort = false and error out if use_sort = true? > >> , and both will run against the table if the user has run an ALTER > >> TABLE..CLUSTER ON. > > > > If a user does that for a table that doesn't support clustering, well, I > > don't > > see what's gained by not erroring out. > > Perhaps they want to give the TAM information about which index to use for > sorting, on those occasions when the TAM's logic dictates that sorting is > appropriate, but not in response to a cluster command. I have little sympathy to randomly misusing catalog contents and then complaining that those catalog contents have an effect. Greetings, Andres Freund
Re: Modest proposal to extend TableAM API for controlling cluster commands
> On Jun 15, 2022, at 7:21 PM, Mark Dilger wrote: > >> If a user does that for a table that doesn't support clustering, well, I >> don't >> see what's gained by not erroring out. > > Perhaps they want to give the TAM information about which index to use for > sorting, on those occasions when the TAM's logic dictates that sorting is > appropriate, but not in response to a cluster command. I should admit that this is a bit hack-ish, but TAM authors haven't been left a lot of options here. Index AMs allow for custom storage parameters, but Table AMs don't, so getting information to the TAM about how to behave takes more than a little slight of hand. Simon's proposal from a while back (don't have the link just now) to allow TAMs to define custom storage parameters would go some distance here. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Modest proposal to extend TableAM API for controlling cluster commands
> On Jun 15, 2022, at 7:14 PM, Andres Freund wrote: > > Hi, > > On 2022-06-15 19:07:50 -0700, Mark Dilger wrote: >>> On Jun 15, 2022, at 6:55 PM, Andres Freund wrote: >>> >>> I think nothing would happen in this case - only pre-clustered tables get >>> clustered in an argumentless CLUSTER. What am I missing? >> >> The "VACUUM FULL" synonym of "CLUSTER" doesn't depend on whether the target >> is pre-clustered > > VACUUM FULL isn't a synonym of CLUSTER. While a good bit of the implementation > is shared, VACUUM FULL doesn't order the table contents. I see now reason why > an AM shouldn't support VACUUM FULL? It's effectively a synonym which determines whether the "bool use_sort" parameter of the table AM's relation_copy_for_cluster will be set. Heap-AM plays along and sorts or not based on that. But it's up to the TAM what it wants to do with that boolean, if in fact it does anything at all based on that. A TAM could decide to do the exact opposite of what Heap-AM does and instead sort on VACUUM FULL but not sort on CLUSTER, or perhaps perform a randomized shuffle, or . From the point-of-view of a TAM implementor, VACUUM FULL and CLUSTER are synonyms. Or am I missing something? >> , and both will run against the table if the user has run an ALTER >> TABLE..CLUSTER ON. > > If a user does that for a table that doesn't support clustering, well, I don't > see what's gained by not erroring out. Perhaps they want to give the TAM information about which index to use for sorting, on those occasions when the TAM's logic dictates that sorting is appropriate, but not in response to a cluster command. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Modest proposal to extend TableAM API for controlling cluster commands
Hi, On 2022-06-15 19:07:50 -0700, Mark Dilger wrote: > > On Jun 15, 2022, at 6:55 PM, Andres Freund wrote: > > > > I think nothing would happen in this case - only pre-clustered tables get > > clustered in an argumentless CLUSTER. What am I missing? > > The "VACUUM FULL" synonym of "CLUSTER" doesn't depend on whether the target > is pre-clustered VACUUM FULL isn't a synonym of CLUSTER. While a good bit of the implementation is shared, VACUUM FULL doesn't order the table contents. I see now reason why an AM shouldn't support VACUUM FULL? > , and both will run against the table if the user has run an ALTER > TABLE..CLUSTER ON. If a user does that for a table that doesn't support clustering, well, I don't see what's gained by not erroring out. Greetings, Andres Freund
Re: Modest proposal to extend TableAM API for controlling cluster commands
> On Jun 15, 2022, at 6:55 PM, Andres Freund wrote: > > I think nothing would happen in this case - only pre-clustered tables get > clustered in an argumentless CLUSTER. What am I missing? The "VACUUM FULL" synonym of "CLUSTER" doesn't depend on whether the target is pre-clustered, and both will run against the table if the user has run an ALTER TABLE..CLUSTER ON. Now, we could try to catch that latter command with a utility hook, but since the VACUUM FULL is still problematic, it seems cleaner to just add the callback I am proposing. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Modest proposal to extend TableAM API for controlling cluster commands
Hi, On 2022-06-15 18:24:45 -0700, Mark Dilger wrote: > > On Jun 15, 2022, at 6:01 PM, Andres Freund wrote: > > On 2022-06-15 17:21:56 -0700, Mark Dilger wrote: > >> While developing various Table Access Methods, I have wanted a callback for > >> determining if CLUSTER (and VACUUM FULL) should be run against a table > >> backed by a given TAM. The current API contains a callback for doing the > >> guts of the cluster, but by that time, it's a bit too late to cleanly back > >> out. For single relation cluster commands, raising an error from that > >> callback is probably not too bad. For multi-relation cluster commands, > >> that > >> aborts the clustering of other yet to be processed relations, which doesn't > >> seem acceptable. > > > > Why not? What else do you want to do in that case? Silently ignoring > > non-clusterable tables doesn't seem right either. What's the use-case for > > swallowing the error? > > Imagine you develop a TAM for which the concept of "clustering" doesn't have > any defined meaning. Perhaps you've arranged the data in a way that has no > similarity to heap, and now somebody runs a CLUSTER command (with no > arguments.) I think nothing would happen in this case - only pre-clustered tables get clustered in an argumentless CLUSTER. What am I missing? Greetings, Andres Freund
Re: Using PQexecQuery in pipeline mode produces unexpected Close messages
At Thu, 16 Jun 2022 10:34:22 +0900 (JST), Kyotaro Horiguchi wrote in > At Wed, 15 Jun 2022 14:56:42 -0400, Tom Lane wrote in > > Alvaro Herrera writes: > > > So, git archaeology led me to this thread > > > https://postgr.es/m/202106072107.d4i55hdscxqj@alvherre.pgsql > > > which is why we added that message in the first place. > > > > Um. Good thing you looked. I doubt we want to revert that change now. > > > > > Alternatives: > > > - Have the client not complain if it gets CloseComplete in idle state. > > > (After all, it's a pretty useless message, since we already do nothing > > > with it if we get it in BUSY state.) > > > > ISTM the actual problem here is that we're reverting to IDLE state too > > soon. I didn't try to trace down exactly where that's happening, but > > Yes. I once visited that fact but also I thought that in the > comparison with non-pipelined PQsendQuery, the three messages look > extra. Thus I concluded (at the time) that removing Close is enough > here. > > > I notice that in the non-pipeline case we don't go to IDLE till we've > > seen 'Z' (Sync). Something in the pipeline logic must be jumping the > > gun on that state transition. > - PQgetResult() resets the state to IDLE when not in pipeline mode. D... the "not" should not be there. + PQgetResult() resets the state to IDLE while in pipeline mode. > fe-exec.c:2171 > > > if (conn->pipelineStatus != PQ_PIPELINE_OFF) > > { > > /* > > * We're about to send the results of the > > current query. Set > > * us idle now, and ... > > */ > > conn->asyncStatus = PGASYNC_IDLE; > > And actually that code let the connection state enter to IDLE before > CloseComplete. In the test case I posted, the following happens. > > PQsendQuery(conn, "SELECT 1;"); > PQsendFlushRequest(conn); > PQgetResult(conn); // state enters IDLE, reads down to > > PQgetResult(conn); // reads > PQpipelineSync(conn); // sync too late > > Pipeline feature seems intending to allow PQgetResult called before > PQpipelineSync. And also seems allowing to call QPpipelineSync() after > PQgetResult(). > > I haven't come up with a valid *fix* of this flow.. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Using PQexecQuery in pipeline mode produces unexpected Close messages
At Wed, 15 Jun 2022 14:56:42 -0400, Tom Lane wrote in > Alvaro Herrera writes: > > So, git archaeology led me to this thread > > https://postgr.es/m/202106072107.d4i55hdscxqj@alvherre.pgsql > > which is why we added that message in the first place. > > Um. Good thing you looked. I doubt we want to revert that change now. > > > Alternatives: > > - Have the client not complain if it gets CloseComplete in idle state. > > (After all, it's a pretty useless message, since we already do nothing > > with it if we get it in BUSY state.) > > ISTM the actual problem here is that we're reverting to IDLE state too > soon. I didn't try to trace down exactly where that's happening, but Yes. I once visited that fact but also I thought that in the comparison with non-pipelined PQsendQuery, the three messages look extra. Thus I concluded (at the time) that removing Close is enough here. > I notice that in the non-pipeline case we don't go to IDLE till we've > seen 'Z' (Sync). Something in the pipeline logic must be jumping the > gun on that state transition. PQgetResult() resets the state to IDLE when not in pipeline mode. fe-exec.c:2171 > if (conn->pipelineStatus != PQ_PIPELINE_OFF) > { > /* >* We're about to send the results of the > current query. Set >* us idle now, and ... >*/ > conn->asyncStatus = PGASYNC_IDLE; And actually that code let the connection state enter to IDLE before CloseComplete. In the test case I posted, the following happens. PQsendQuery(conn, "SELECT 1;"); PQsendFlushRequest(conn); PQgetResult(conn); // state enters IDLE, reads down to PQgetResult(conn); // reads PQpipelineSync(conn); // sync too late Pipeline feature seems intending to allow PQgetResult called before PQpipelineSync. And also seems allowing to call QPpipelineSync() after PQgetResult(). I haven't come up with a valid *fix* of this flow.. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Modest proposal to extend TableAM API for controlling cluster commands
> On Jun 15, 2022, at 6:01 PM, Andres Freund wrote: > > Hi, > > On 2022-06-15 17:21:56 -0700, Mark Dilger wrote: >> While developing various Table Access Methods, I have wanted a callback for >> determining if CLUSTER (and VACUUM FULL) should be run against a table >> backed by a given TAM. The current API contains a callback for doing the >> guts of the cluster, but by that time, it's a bit too late to cleanly back >> out. For single relation cluster commands, raising an error from that >> callback is probably not too bad. For multi-relation cluster commands, that >> aborts the clustering of other yet to be processed relations, which doesn't >> seem acceptable. > > Why not? What else do you want to do in that case? Silently ignoring > non-clusterable tables doesn't seem right either. What's the use-case for > swallowing the error? Imagine you develop a TAM for which the concept of "clustering" doesn't have any defined meaning. Perhaps you've arranged the data in a way that has no similarity to heap, and now somebody runs a CLUSTER command (with no arguments.) It's reasonable that they want all their heap tables to get the usual cluster_rel treatment, and that the existence of a table using your exotic TAM shouldn't interfere with that. Then what? You are forced to copy all the data from your OldHeap (badly named) to the NewHeap (also badly named), or to raise an error. That doesn't seem ok. >> I tried fixing this with a ProcessUtility_hook, but that >> fires before the multi-relation cluster command has compiled the list of >> relations to cluster, meaning the ProcessUtility_hook doesn't have access to >> the necessary information. (It can be hacked to compile the list of >> relations itself, but that duplicates both code and effort, with the usual >> risks that the code will get out of sync.) >> >> For my personal development, I have declared a new hook, bool >> (*relation_supports_cluster) (Relation rel). It gets called from >> commands/cluster.c in both the single-relation and multi-relation code >> paths, with warning or debug log messages output for relations that decline >> to be clustered, respectively. > > Do you actually need to dynamically decide whether CLUSTER is supported? > Otherwise we could just make the existing cluster callback optional and error > out if a table is clustered that doesn't have the callback. Same as above, I don't know why erroring would be the right thing to do. As a comparison, consider that we don't attempt to cluster a partitioned table, but rather just silently skip it. Imagine if, when we introduced the concept of partitioned tables, we made unqualified CLUSTER commands always fail when they encountered a partitioned table. >> Before posting a patch, do people think this sounds useful? Would you like >> the hook function signature to differ in some way? Is a simple "yes this >> relation may be clustered" vs. "no this relation may not be clustered" >> interface overly simplistic? > > It seems overly complicated, if anything ;) For the TAMs I've developed thus far, I don't need the (Relation rel) parameter, and could just have easily used (void). But that seems to fence in what other TAM authors could do in future. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Extending USING [heap | mytam | yourtam] grammar and behavior
Hackers, I have extended the grammar to allow "USING NOT method [, ...]" to exclude one or more TAMs in a CREATE TABLE statement. This may sound like a weird thing to do, but it is surprisingly useful when developing new Table Access Methods, particularly when you are developing two or more, not just one. To explain: Developing a new TAM takes an awful lot of testing, and much of it is duplicative of the existing core regression test suite. Leveraging the existing tests saves an awful lot of test development. When developing just one TAM, leveraging the existing tests isn't too hard. Without much work*, you can set default_table_access_method=mytam for the duration of the check-world. You'll get a few test failures this way. Some will be in tests that probe the catalogs to verify that /heap/ is stored there, and instead /mytam/ is found. Others will be tests that are sensitive to the number of rows that fit per page, etc. But a surprising number of tests just pass, at least after you get the TAM itself debugged. When developing two or more TAMs, this falls apart. Some tests may be worth fixing up (perhaps with alternate output files) for "mytam", but not for "columnar_tam". That might be because the test is checking fundamentally row-store-ish properties of the table, which has no applicability to your column-store-ish TAM. In that case, "USING NOT columnar_tam" fixes the test failure when columnar is the default, without preventing the test from testing "mytam" when it happens to be the default. Once you have enough TAMs developed and deployed, this USING NOT business becomes useful in production. You might have different defaults on different servers, or for different customers, etc., and for a given piece of DDL that you want to release you only want to say which TAMs not to use, not to nail down which TAM must be used. Thoughts? I'll hold off posting a patch until the general idea is debated. [*] It takes some extra work to get the TAP tests to play along. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Modest proposal to extend TableAM API for controlling cluster commands
Hi, On 2022-06-15 17:21:56 -0700, Mark Dilger wrote: > While developing various Table Access Methods, I have wanted a callback for > determining if CLUSTER (and VACUUM FULL) should be run against a table > backed by a given TAM. The current API contains a callback for doing the > guts of the cluster, but by that time, it's a bit too late to cleanly back > out. For single relation cluster commands, raising an error from that > callback is probably not too bad. For multi-relation cluster commands, that > aborts the clustering of other yet to be processed relations, which doesn't > seem acceptable. Why not? What else do you want to do in that case? Silently ignoring non-clusterable tables doesn't seem right either. What's the use-case for swallowing the error? > I tried fixing this with a ProcessUtility_hook, but that > fires before the multi-relation cluster command has compiled the list of > relations to cluster, meaning the ProcessUtility_hook doesn't have access to > the necessary information. (It can be hacked to compile the list of > relations itself, but that duplicates both code and effort, with the usual > risks that the code will get out of sync.) > > For my personal development, I have declared a new hook, bool > (*relation_supports_cluster) (Relation rel). It gets called from > commands/cluster.c in both the single-relation and multi-relation code > paths, with warning or debug log messages output for relations that decline > to be clustered, respectively. Do you actually need to dynamically decide whether CLUSTER is supported? Otherwise we could just make the existing cluster callback optional and error out if a table is clustered that doesn't have the callback. > Before posting a patch, do people think this sounds useful? Would you like > the hook function signature to differ in some way? Is a simple "yes this > relation may be clustered" vs. "no this relation may not be clustered" > interface overly simplistic? It seems overly complicated, if anything ;) Greetings, Andres Freund
Re:
On Thu, Jun 9, 2022 at 11:50 PM Tom Lane wrote: > > Peter Eisentraut writes: > > Initially, that chapter did not document any system views. > > Maybe we could make the system views a separate chapter? +1 -- Kind Regards, Peter Smith. Fujitsu Australia
Re: tablesync copy ignores publication actions
On Wed, Jun 15, 2022 at 5:05 PM shiy.f...@fujitsu.com wrote: > ... > Thanks for updating the patch. Two comments: > > 1. > + it means the copied table t3 contains all rows even > when > + they do not patch the row filter of publication > pub3b. > > Typo. I think "they do not patch the row filter" should be "they do not match > the row filter", right? > > 2. > @@ -500,7 +704,6 @@ > > > > - > > > > > It seems we should remove this change. > Thank you for your review comments. Those reported mistakes are fixed in the attached patch v3. -- Kind Regards, Peter Smith. Fujitsu Australia v3-0001-PGDOCS-tablesync-ignores-publish-operation.patch Description: Binary data
Modest proposal to extend TableAM API for controlling cluster commands
Hackers, While developing various Table Access Methods, I have wanted a callback for determining if CLUSTER (and VACUUM FULL) should be run against a table backed by a given TAM. The current API contains a callback for doing the guts of the cluster, but by that time, it's a bit too late to cleanly back out. For single relation cluster commands, raising an error from that callback is probably not too bad. For multi-relation cluster commands, that aborts the clustering of other yet to be processed relations, which doesn't seem acceptable. I tried fixing this with a ProcessUtility_hook, but that fires before the multi-relation cluster command has compiled the list of relations to cluster, meaning the ProcessUtility_hook doesn't have access to the necessary information. (It can be hacked to compile the list of relations itself, but that duplicates both code and effort, with the usual risks that the code will get out of sync.) For my personal development, I have declared a new hook, bool (*relation_supports_cluster) (Relation rel). It gets called from commands/cluster.c in both the single-relation and multi-relation code paths, with warning or debug log messages output for relations that decline to be clustered, respectively. Before posting a patch, do people think this sounds useful? Would you like the hook function signature to differ in some way? Is a simple "yes this relation may be clustered" vs. "no this relation may not be clustered" interface overly simplistic? — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Small TAP improvements
On Wed, Jun 15, 2022 at 07:59:10AM -0400, Andrew Dunstan wrote: > My would we do that? If you want a map don't pass any switches. But as > written you could do: > > my ($incdir, $localedir, $sharedir) = $node->config_data(qw(--includedir > --localedir --sharedir)); > > No map needed to get what you want, in fact a map would get in the > way. Nice, I didn't know you could do that. That's a pattern worth mentioning in the perldoc part as an example, in my opinion. -- Michael signature.asc Description: PGP signature
Re: "buffer too small" or "path too long"?
On Wed, Jun 15, 2022 at 02:02:03PM -0400, Tom Lane wrote: > Yeah, that was what was bugging me about this proposal. Removing > one function's dependency on MAXPGPATH isn't much of a step forward. This comes down to out-of-memory vs path length at the end. Changing only the paths of make_outputdirs() without touching all the paths in check.c and the one in function.c does not sound good to me, as this increases the risk of failing pg_upgrade in the middle, and that's what we should avoid, as said upthread. > I note also that the patch leaks quite a lot of memory (a kilobyte or > so per pathname, IIRC). That's probably negligible in this particular > context, but anyplace that was called more than once per program run > would need to be more tidy. Surely. -- Michael signature.asc Description: PGP signature
Re: better page-level checksums
On Wed, Jun 15, 2022 at 1:27 PM Robert Haas wrote: > I think what will happen, depending on > the encryption mode, is probably that either (a) the page will decrypt > to complete garbage or (b) the page will fail some kind of > verification and you won't be able to decrypt it at all. Either way, > you won't be able to infer anything about what caused the problem. All > you'll know is that something is wrong. That sucks - a lot - and I > don't have a lot of good ideas as to what can be done about it. The > idea that an encrypted page is unintelligible and that small changes > to either the encrypted or unencrypted data should result in large > changes to the other is intrinsic to the nature of encryption. It's > more or less un-debuggable by design. It's pretty clear that there must be a lot of truth to that. But that doesn't mean that there aren't meaningful gradations beyond that. I think that it's worth doing the following exercise (humor me): Why wouldn't it be okay to just encrypt the tuple space and the line pointer array, leaving both the page header and page special area unencrypted? What kind of user would find that trade-off to be unacceptable, and why? What's the nuance of it? For all I know you're right (about encrypting the whole page, metadata and all). I just want to know why that is. I understand that this whole area is one where in general we may have to live with a certain amount of uncertainty about what really matters. > With extended checksums, I don't think the issues are anywhere near as > bad. I'm not deeply opposed to setting a page-level flag but I expect > nominal benefits. I also expect only a small benefit. But that isn't a particularly important factor in my mind. Let's suppose that it turns out to be significantly more useful than we originally expected, for whatever reason. Assuming all that, what else can be said about it now? Isn't it now *relatively* likely that including that status bit metadata will be *extremely* valuable, and not merely somewhat more valuable? I guess it doesn't matter much now (since you have all but conceded that using a bit for this makes sense), but FWIW that's the main reason why I almost took it for granted that we'd need to use a status bit (or bits) for this. -- Peter Geoghegan
Re: better page-level checksums
On Tue, Jun 14, 2022 at 10:30 PM Peter Geoghegan wrote: > Basically I think that this is giving up rather a lot. For example, > isn't it possible that we'd have corruption that could be a bug in > either the checksum code, or in recovery? > > I'd feel a lot better about it if there was some sense of both the > costs and the benefits. I think that, if and when we get TDE, debuggability is likely to be a huge issue. Something will go wrong for someone at some point, and when it does, what they'll have is a supposedly-encrypted page that cannot be decrypted, and it will be totally unclear what has gone wrong. Did the page get corrupted on disk by a random bit flip? Is there a bug in the algorithm? Torn page? As things stand today, when a page gets corrupted, a human being can look at the page and make an educated guess about what has gone wrong and whether PostgreSQL or some other system is to blame, and if it's PostgreSQL, perhaps have some ideas as to where to look for the bug. If the pages are encrypted, that's a lot harder. I think what will happen, depending on the encryption mode, is probably that either (a) the page will decrypt to complete garbage or (b) the page will fail some kind of verification and you won't be able to decrypt it at all. Either way, you won't be able to infer anything about what caused the problem. All you'll know is that something is wrong. That sucks - a lot - and I don't have a lot of good ideas as to what can be done about it. The idea that an encrypted page is unintelligible and that small changes to either the encrypted or unencrypted data should result in large changes to the other is intrinsic to the nature of encryption. It's more or less un-debuggable by design. With extended checksums, I don't think the issues are anywhere near as bad. I'm not deeply opposed to setting a page-level flag but I expect nominal benefits. A human being looking at the page isn't going to have a ton of trouble figuring out whether or not the extended checksum is present unless the page is horribly, horribly garbled, and even if that happens, will debugging that problem really be any worse than debugging a horribly, horribly garbled page today? I don't think so. I likewise expect that pg_filedump could use heuristics to figure out what's going on just by looking at the page, even if no external information is available. You are probably right when you say that there's no need to be so parsimonious with pd_flags space as all that, but I believe that if we did decide to set no bit in pd_flags, whoever maintains pg_filedump these days would not have huge difficulty inventing a suitable heuristic. A page with an extended checksum is basically still an intelligible page, and we shouldn't understate the value of that. -- Robert Haas EDB: http://www.enterprisedb.com
Re: pg_upgrade (12->14) fails on aggregate
Justin Pryzby writes: > But Petr has a point - pg_upgrade should aspire to catch errors in --check, > rather than starting and then leaving a mess behind for the user to clean up Agreed; pg_upgrade has historically tried to find problems similar to this. However, it's not just aggregates that are at risk. People might also have built user-defined plain functions, or operators, atop these functions. How far do we want to go in looking? As for the query, I think it could be simplified quite a bit by relying on regprocedure literals, that is something like WHERE ... a.aggtransfn IN ('array_append(anyarray,anyelement)'::regprocedure, 'array_prepend(anyelement,anyarray)'::regprocedure, ...) Not sure if it's necessary to stick explicit "pg_catalog." schema qualifications into this --- IIRC pg_upgrade runs with restrictive search_path, so that this would be safe as-is. Also, I think you need to check aggfinalfn too. Also, I'd be inclined to reject system-provided objects by checking for OID >= 16384 rather than hard-wiring assumptions about things being in pg_catalog or not. regards, tom lane
Re: Using PQexecQuery in pipeline mode produces unexpected Close messages
Alvaro Herrera writes: > So, git archaeology led me to this thread > https://postgr.es/m/202106072107.d4i55hdscxqj@alvherre.pgsql > which is why we added that message in the first place. Um. Good thing you looked. I doubt we want to revert that change now. > Alternatives: > - Have the client not complain if it gets CloseComplete in idle state. > (After all, it's a pretty useless message, since we already do nothing > with it if we get it in BUSY state.) ISTM the actual problem here is that we're reverting to IDLE state too soon. I didn't try to trace down exactly where that's happening, but I notice that in the non-pipeline case we don't go to IDLE till we've seen 'Z' (Sync). Something in the pipeline logic must be jumping the gun on that state transition. regards, tom lane
Re: Using PQexecQuery in pipeline mode produces unexpected Close messages
On 2022-Jun-10, Kyotaro Horiguchi wrote: > (Moved to -hackers) > > At Wed, 8 Jun 2022 17:08:47 +0200, Alvaro Herrera > wrote in > > What that Close message is doing is closing the unnamed portal, which > > is otherwise closed implicitly when the next one is opened. That's how > > single-query mode works: if you run a single portal, it'll be kept open. > > > > I believe that the right fix is to not send that Close message in > > PQsendQuery. > > Agreed. At least Close message in that context is useless and > PQsendQueryGuts doesn't send it. And removes the Close message surely > fixes the issue. So, git archaeology led me to this thread https://postgr.es/m/202106072107.d4i55hdscxqj@alvherre.pgsql which is why we added that message in the first place. I was about to push the attached patch (a merged version of Kyotaro's and mine), but now I'm wondering if that's the right approach. Alternatives: - Have the client not complain if it gets CloseComplete in idle state. (After all, it's a pretty useless message, since we already do nothing with it if we get it in BUSY state.) - Have the server not send CloseComplete at all in the cases where the client is not expecting it. Not sure how this would be implemented. - Other ideas? -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "That sort of implies that there are Emacs keystrokes which aren't obscure. I've been using it daily for 2 years now and have yet to discover any key sequence which makes any sense."(Paul Thomas) >From d9f0ff57fe8aa7f963a9411741bb1d68082cc31a Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Wed, 15 Jun 2022 19:56:41 +0200 Subject: [PATCH v2] PQsendQuery: Don't send Close in pipeline mode In commit 4efcf47053ea, we modified PQsendQuery to send a Close message when in pipeline mode. But now we discover that that's not a good thing: under certain circumstances, it causes the server to deliver a CloseComplete message at a time when the client is not expecting it. We failed to noticed it because the tests don't have any scenario where the problem is hit. Remove the offending Close, and add a test case that tickles the problematic scenario. Co-authored-by: Kyotaro Horiguchi Reported-by: Daniele Varrazzo Discussion: https://postgr.es/m/ca+mi_8bvd0_cw3sumgwpvwdnzxy32itog_16tdyru_1s2gv...@mail.gmail.com --- src/interfaces/libpq/fe-exec.c| 5 -- .../modules/libpq_pipeline/libpq_pipeline.c | 62 +++ .../traces/pipeline_abort.trace | 2 - .../traces/simple_pipeline.trace | 12 4 files changed, 74 insertions(+), 7 deletions(-) diff --git a/src/interfaces/libpq/fe-exec.c b/src/interfaces/libpq/fe-exec.c index 4180683194..e2df3a3480 100644 --- a/src/interfaces/libpq/fe-exec.c +++ b/src/interfaces/libpq/fe-exec.c @@ -1403,11 +1403,6 @@ PQsendQueryInternal(PGconn *conn, const char *query, bool newQuery) pqPutInt(0, 4, conn) < 0 || pqPutMsgEnd(conn) < 0) goto sendFailed; - if (pqPutMsgStart('C', conn) < 0 || - pqPutc('P', conn) < 0 || - pqPuts("", conn) < 0 || - pqPutMsgEnd(conn) < 0) - goto sendFailed; entry->queryclass = PGQUERY_EXTENDED; entry->query = strdup(query); diff --git a/src/test/modules/libpq_pipeline/libpq_pipeline.c b/src/test/modules/libpq_pipeline/libpq_pipeline.c index c27c4e0ada..e24fbfe1cc 100644 --- a/src/test/modules/libpq_pipeline/libpq_pipeline.c +++ b/src/test/modules/libpq_pipeline/libpq_pipeline.c @@ -968,15 +968,29 @@ test_prepared(PGconn *conn) fprintf(stderr, "ok\n"); } +/* Notice processor: print them, and count how many we got */ +static void +notice_processor(void *arg, const char *message) +{ + int *n_notices = (int *) arg; + + (*n_notices)++; + fprintf(stderr, "NOTICE %d: %s", *n_notices, message); +} + static void test_simple_pipeline(PGconn *conn) { PGresult *res = NULL; const char *dummy_params[1] = {"1"}; Oid dummy_param_oids[1] = {INT4OID}; + PQnoticeProcessor oldproc; + int n_notices = 0; fprintf(stderr, "simple pipeline... "); + oldproc = PQsetNoticeProcessor(conn, notice_processor, _notices); + /* * Enter pipeline mode and dispatch a set of operations, which we'll then * process the results of as they come in. @@ -1052,6 +1066,54 @@ test_simple_pipeline(PGconn *conn) if (PQpipelineStatus(conn) != PQ_PIPELINE_OFF) pg_fatal("Exiting pipeline mode didn't seem to work"); + if (n_notices > 0) + pg_fatal("unexpected notice"); + + /* Try the same thing with PQsendQuery */ + if (PQenterPipelineMode(conn) != 1) + pg_fatal("failed to enter pipeline mode: %s", PQerrorMessage(conn)); + + if (PQsendQuery(conn, "SELECT 1") != 1) + pg_fatal("failed to send query: %s", PQerrorMessage(conn)); + PQsendFlushRequest(conn); + res = PQgetResult(conn); + if (res == NULL) + pg_fatal("PQgetResult returned null when there's a pipeline item: %s", + PQerrorMessage(conn)); + if (PQresultStatus(res) !=
Re: "buffer too small" or "path too long"?
Robert Haas writes: > On Wed, Jun 15, 2022 at 2:51 AM Peter Eisentraut > wrote: >> We have this problem of long file names being silently truncated all >> over the source code. Instead of equipping each one of them with a >> length check, why don't we get rid of the fixed-size buffers and >> allocate dynamically, as in the attached patch. > I don't know how much we gain by fixing one place and not all the > others, but maybe it would set a trend. Yeah, that was what was bugging me about this proposal. Removing one function's dependency on MAXPGPATH isn't much of a step forward. I note also that the patch leaks quite a lot of memory (a kilobyte or so per pathname, IIRC). That's probably negligible in this particular context, but anyplace that was called more than once per program run would need to be more tidy. regards, tom lane
Re: docs: mention "pg_read_all_stats" in "track_activities" description
On Tue, Jun 07, 2022 at 10:08:21PM +0900, Ian Lawrence Barwick wrote: > A little late to the party, but as an alternative suggestion for the last > part: > > "... and users who either own the session being reported on, or who have > privileges of the role to which the session belongs," > > so the whole sentence would read: > > Note that even when enabled, this information is only visible to superusers, > roles with privileges of the pg_read_all_stats role, and users who either > own > the session being reported on or who have privileges of the role to which > the > session belongs, so it should not represent a security risk. This seems clearer to me. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Skipping logical replication transactions on subscriber side
On Tue, Jun 14, 2022 at 3:54 AM Masahiko Sawada wrote: > > AFAICS, we could do that by: > > > > 1. De-supporting platforms that have this problem, or > > 2. Introducing new typalign values, as Noah proposed back on April 2, or > > 3. Somehow forcing values that are sometimes 4-byte aligned and > > sometimes 8-byte aligned to be 8-byte alignment on all platforms > > Introducing new typalign values seems a good idea to me as it's more > future-proof. Will this item be for PG16, right? The main concern > seems that what this test case enforces would be nuisance when > introducing a new system catalog or a new column to the existing > catalog but given we're in post PG15-beta1 it is unlikely to happen in > PG15. I agree that we're not likely to introduce a new typalign value any sooner than v16. There are a couple of things that bother me about that solution. One is that I don't know how many different behaviors exist out there in the wild. If we distinguish the alignment of double from the alignment of int8, is that good enough, or are there other data types whose properties aren't necessarily the same as either of those? The other is that 32-bit systems are already relatively rare and probably will become more rare until they disappear completely. It doesn't seem like a ton of fun to engineer solutions to problems that may go away by themselves with the passage of time. On the other hand, if the alternative is to live with this kind of ugliness for another 5 years, maybe the time it takes to craft a solution is effort well spent. -- Robert Haas EDB: http://www.enterprisedb.com
Re: "buffer too small" or "path too long"?
On Wed, Jun 15, 2022 at 2:51 AM Peter Eisentraut wrote: > We have this problem of long file names being silently truncated all > over the source code. Instead of equipping each one of them with a > length check, why don't we get rid of the fixed-size buffers and > allocate dynamically, as in the attached patch. I've always wondered why we rely on MAXPGPATH instead of dynamic allocation. It seems pretty lame. I don't know how much we gain by fixing one place and not all the others, but maybe it would set a trend. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Is RecoveryConflictInterrupt() entirely safe in a signal handler?
On Wed, Jun 15, 2022 at 1:51 AM Michael Paquier wrote: > Handle is more consistent with the other types of interruptions in the > SIGUSR1 handler, so the name proposed in the patch in not that > confusing to me. And so does Process, in spirit with > ProcessProcSignalBarrier() and ProcessLogMemoryContextInterrupt(). > While on it, is postgres.c the best home for > HandleRecoveryConflictInterrupt()? That's a very generic file, for > starters. Not related to the actual bug, just asking. Yeah, there's existing precedent for this kind of split in, for example, HandleCatchupInterrupt() and ProcessCatchupInterrupt(). I think the idea is that "process" is supposed to sound like the more involved part of the operation, whereas "handle" is supposed to sound like the initial response to the signal. I'm not sure it's the clearest possible naming, but naming things is hard, and this patch is apparently not inventing a new way to do it. -- Robert Haas EDB: http://www.enterprisedb.com
Re: better page-level checksums
On Wed, Jun 15, 2022 at 4:54 AM Peter Eisentraut wrote: > It's hard to get any definite information about what size of checksum is > "good enough", since after all it depends on what kinds of errors you > expect and what kinds of probabilities you want to accept. But the best > I could gather so far is that 16-bit CRC are good until about 16 kB > block size. Not really. There's a lot of misinformation on this topic floating around on this mailing list, and some of that misinformation is my fault. I keep learning more about this topic. However, I'm pretty confident that, on the one hand, there's no hard limit on the size of the data that can be effectively validated via a CRC, and on the other hand, CRC isn't a particularly great algorithm, although it does have certain interesting advantages for certain purposes. For example, according to https://en.wikipedia.org/wiki/Mathematics_of_cyclic_redundancy_checks#Error_detection_strength a CRC is guaranteed to detect all single-bit errors. This property is easy to achieve: for example, a parity bit has this property. According to the same source, a CRC is guaranteed to detect two-bit errors only if the distance between them is less than some limit that gets larger as the CRC gets wider. Imagine that you have a CRC-16 of a message 64k+1 bits in length. Suppose that an error in the first bit changes the result from v to v'. Can we, by flipping a second bit later in the message, change the final result from v' back to v? The calculation only has 64k possible answers, and we have 64k bits we can flip to try to get the desired answer. If every one of those bit flips produces a different answer, then one of those answers must be v -- which means detection of two-bits errors is not guaranteed. If at least two of those bit flips produce the same answer, then consider the messages produced by those two different bit flips. They differ from each other by exactly two bits and yet produced the same CRC, so detection of two-bit errors is still not guaranteed. On the other hand, it's still highly likely. If a message of length 2^16+1 bits contains two bit errors one of which is in the first bit, the chances that the other one is in exactly the right place to cancel out the first error are about 2^-16. That's not zero, but it's just as good as our chances of detecting a replacement of the entire message with some other message chosen completely at random. I think the reason why discussion of CRCs tends to focus on the types of bit errors that it can detect is that the algorithm was designed when people were doing stuff like sending files over a modem. It's easy to understand how individual bits could get garbled without anybody noticing, while large-scale corruption would be less likely, but the risks are not necessarily the same for a PostgreSQL data file. Lower levels of the stack are probably already using checksums to try to detect errors at the level of the physical medium. I'm sure some stuff slips through the cracks, but in practice we also see failure modes where the filesystem substitutes 8kB of data from an unrelated file, or where a torn write in combination with unreliable fsync results in half of the page contents being from an older version of the page. These kinds of large-scale replacements aren't what CRCs are designed to detect, and the chances that we will detect them are roughly 1-2^-bits, whether we use a CRC or something else. Of course, that partly depends on the algorithm quality. If an algorithm is more likely to generate some results than others, then its actual error detection rate will not be as good as the number of output bits would suggest. If the result doesn't depend equally on every input bit, then the actual error detection rate will not be as good as the number of output bits would suggest. And CRC-32 is apparently not great by modern standards: https://github.com/rurban/smhasher Compare the results for CRC-32 with, say, Spooky32. Apparently the latter is faster yet produces better output. So maybe we would've been better off if we'd made Spooky32 the default algorithm for backup manifest checksums rather than CRC-32. > The benefits of doubling the checksum size are exponential rather than > linear, so there is no significant benefit of using a 64-bit checksum > over a 32-bit one, for supported block sizes (current max is 32 kB). I'm still unconvinced that the block size is very relevant here. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Remove trailing newlines from pg_upgrade's messages
Kyotaro Horiguchi writes: > Also leading newlines and just "\n" bug me when I edit message > catalogues with poedit. I might want a line-spacing function like > pg_log_newline(PG_REPORT) if we need line-breaks in the ends of a > message. Yeah, that is sort of the inverse problem. I think those are there to ensure that the text appears on a fresh line even if the current line has transient status on it. We could get rid of those perhaps if we teach pg_log_v to remember whether it ended the last output with a newline or not, and then put out a leading newline only if necessary, rather than hard-wiring one into the message texts. This might take a little bit of fiddling to make it work, because we'd not want the extra newline when completing an incomplete line by adding status. That would mean that report_status would have to do something special, plus we'd have to be sure that all such cases do go through report_status rather than calling pg_log directly. (I'm fairly sure that the code is sloppy about that today :-(.) It seems probably do-able, though. regards, tom lane
Re: support for MERGE
Pushed this. I noticed that the paragraph that described MERGE in the read-committed subsection had been put in the middle of some other paras that describe interactions with other transactions, so I moved it up so that it appears together with all the other paras that describe the behavior of specific commands, which is where I believe it belongs. Thanks! -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ Thou shalt study thy libraries and strive not to reinvent them without cause, that thy code may be short and readable and thy days pleasant and productive. (7th Commandment for C Programmers)
Re: CREATE TABLE ( .. STORAGE ..)
Hi hackers, I noticed that cfbot is not entirely happy with the patch, so I rebased it. > I see that COMPRESSION and STORAGE now are handled slightly > differently in the grammar. Maybe we could standardize that a bit > more; so that we have only one `STORAGE [kind]` definition in the > grammar? > > As I'm new to the grammar files; would you know the difference between > `name` and `ColId`, and why you would change from one to the other in > ALTER COLUMN STORAGE? Good point, Matthias. I addressed this in 0002. Does it look better now? -- Best regards, Aleksander Alekseev v3-0001-CREATE-TABLE-.-STORAGE.patch Description: Binary data v3-0002-Handle-SET-STORAGE-and-SET-COMPRESSION-similarly.patch Description: Binary data
Re: [PATCH] Add sortsupport for range types and btree_gist
Hi Christoph! > On 15 Jun 2022, at 15:45, Christoph Heiss wrote: > > By sorting the data before inserting it into the index results in a much > better index structure, leading to significant performance improvements. Here's my version of the very similar idea [0]. It lacks range types support. On a quick glance your version lacks support of abbreviated sort, so I think benchmarks can be pushed event further :) Let's merge our efforts and create combined patch? Please, create a new entry for the patch on Commitfest. Thank you! Best regards, Andrey Borodin. [0] https://commitfest.postgresql.org/37/2824/
Re: replacing role-level NOINHERIT with a grant-level option
On Wed, Jun 15, 2022 at 5:23 AM Peter Eisentraut wrote: > > Consider a user who in general prefers the NOINHERIT behavior but also > > wants to use predefined roles. Perhaps user 'peter' is to be granted > > both 'paul' and 'pg_execute_server_programs'. If role 'peter' is set > > to INHERIT, Peter will be sad, because his love for NOINHERIT probably > > means that he doesn't want to exercise Paul's privileges > > automatically. However, he needs to inherit the privileges of > > 'pg_execute_server_programs' or they are of no use to him. Peter > > presumably wants to use COPY TO/FROM program to put data into a table > > owned by 'peter', not a table owned by 'pg_execute_server_programs'. > > If so, being able to SET ROLE to 'pg_execute_server_programs' is of no > > use to him at all, but inheriting the privilege is useful. > > That's because our implementation of SET ROLE is bogus. We should have > a SET ROLE that is separate from SET SESSION AUTHORIZATION, where the > current user can keep their current user-ness and additionally enable > (non-inherited) roles. It would help me to have a better description of what you think the behavior ought to be. I've always thought there was something funny about SET ROLE and SET SESSION AUTHORIZATION, because it seems like they are too similar to each other. But it would surprise me if SET ROLE added additional privileges to my session while leaving the old ones intact, too, much as I'd be surprised if SET work_mem = '8MB' followed by SET work_mem = '1GB' somehow left both values partly in effect at the same time. It feels to me like SET is describing an action that changes the session state, rather than adding to it. > I'm mainly concerned that (AAIU), you propose to remove the current > INHERIT/NOINHERIT attribute of roles. I wouldn't like that. If you > want a feature that allows overriding that per-grant, maybe that's okay. Yeah, I want to remove it and replace it with something more fine-grained. I don't yet understand why that's a problem for anything you want to do. -- Robert Haas EDB: http://www.enterprisedb.com
Re: [PATCH] Compression dictionaries for JSONB
Hi Matthias, > The bulk of the patch > should still be usable, but I think that the way it interfaces with > the CREATE TABLE (column ...) APIs would need reworking to build on > top of the api's of the "pluggable toaster" patches (so, creating > toasters instead of types). I think that would allow for an overall > better user experience and better performance due to decreased need > for fully decompressed type casting. Many thanks for the feedback. The "pluggable TOASTer" patch looks very interesting indeed. I'm currently trying to make heads and tails of it and trying to figure out if it can be used as a base for compression dictionaries, especially for implementing the partial decompression. Hopefully I will be able to contribute to it and to the dependent patch [1] in the upcoming CF, at least as a tester/reviewer. Focusing our efforts on [1] for now seems to be a good strategy. My current impression of your idea is somewhat mixed at this point though. Teodor's goal is to allow creating _extensions_ that implement alternative TOAST strategies, which use alternative compression algorithms and/or use the knowledge of the binary representation of the particular type. For sure, this would be a nice thing to have. However, during the discussion of the "compression dictionaries" RFC the consensus was reached that the community wants to see it as a _built_in_ functionality rather than an extension. Otherwise we could simply add ZSON to /contrib/ as it was originally proposed. So if we are going to keep "compression dictionaries" a built-in functionality, putting artificial constraints on its particular implementation, or adding artificial dependencies of two rather complicated patches, is arguably a controversial idea. Especially considering the fact that it was shown that the feature can be implemented without these dependencies, in a very non-invasive way. These are just my initial thoughts I would like to share though. I may change my mind after diving deeper into a "pluggable TOASTer" patch. I cc:'ed Teodor in case he would like to share his insights on the topic. [1]: https://commitfest.postgresql.org/38/3479/ -- Best regards, Aleksander Alekseev
Re: Replica Identity check of partition table on subscriber
On Wed, Jun 15, 2022 at 8:52 AM shiy.f...@fujitsu.com wrote: > > On Tue, Jun 14, 2022 8:57 PM Amit Kapila wrote: > > > > > v4-0002 looks good btw, except the bitpick about test comment similar > > > to my earlier comment regarding v5-0001: > > > > > > +# Change the column order of table on publisher > > > > > > I think it might be better to say something specific to describe the > > > test intent, like: > > > > > > Test that replication into partitioned target table continues to works > > > correctly when the published table is altered > > > > > > > Okay changed this and slightly modify the comments and commit message. > > I am just attaching the HEAD patches for the first two issues. > > > > Thanks for updating the patch. > > Attached the new patch set which ran pgindent, and the patches for pg14 and > pg13. (Only the first two patches of the patch set.) > I have pushed the first bug-fix patch today. -- With Regards, Amit Kapila.
Re: Perform streaming logical transactions by background workers and parallel apply
On Tue, Jun 14, 2022 at 9:07 AM wangw.f...@fujitsu.com wrote: > > > Attach the new patches. > Only changed patches 0001, 0004 and added new separate patch 0005. > Few questions/comments on 0001 === 1. In the commit message, I see: "We also need to allow stream_stop to complete by the apply background worker to avoid deadlocks because T-1's current stream of changes can update rows in conflicting order with T-2's next stream of changes." Thinking about this, won't the T-1 and T-2 deadlock on the publisher node as well if the above statement is true? 2. + +The apply background workers are taken from the pool defined by +max_logical_replication_workers. + + +The default value is 3. This parameter can only be set in the +postgresql.conf file or on the server command +line. + Is there a reason to choose this number as 3? Why not 2 similar to max_sync_workers_per_subscription? 3. + + + Setting streaming mode to apply could export invalid LSN + as finish LSN of failed transaction. Changing the streaming mode and making + the same conflict writes the finish LSN of the failed transaction in the + server log if required. + How will the user identify that this is an invalid LSN value and she shouldn't use it to SKIP the transaction? Can we change the second sentence to: "User should change the streaming mode to 'on' if they would instead wish to see the finish LSN on error. Users can use finish LSN to SKIP applying the transaction." I think we can give reference to docs where the SKIP feature is explained. 4. + * This file contains routines that are intended to support setting up, using, + * and tearing down a ApplyBgworkerState. + * Refer to the comments in file header of logical/worker.c to see more + * informations about apply background worker. Typo. /informations/information. Consider having an empty line between the above two lines. 5. +ApplyBgworkerState * +apply_bgworker_find_or_start(TransactionId xid, bool start) { ... ... + if (!TransactionIdIsValid(xid)) + return NULL; + + /* + * We don't start new background worker if we are not in streaming apply + * mode. + */ + if (MySubscription->stream != SUBSTREAM_APPLY) + return NULL; + + /* + * We don't start new background worker if user has set skiplsn as it's + * possible that user want to skip the streaming transaction. For + * streaming transaction, we need to spill the transaction to disk so that + * we can get the last LSN of the transaction to judge whether to skip + * before starting to apply the change. + */ + if (start && !XLogRecPtrIsInvalid(MySubscription->skiplsn)) + return NULL; + + /* + * For streaming transactions that are being applied in apply background + * worker, we cannot decide whether to apply the change for a relation + * that is not in the READY state (see should_apply_changes_for_rel) as we + * won't know remote_final_lsn by that time. So, we don't start new apply + * background worker in this case. + */ + if (start && !AllTablesyncsReady()) + return NULL; ... ... } Can we move some of these starting checks to a separate function like canstartapplybgworker()? -- With Regards, Amit Kapila.
Re: Small TAP improvements
On 2022-06-14 Tu 19:13, Michael Paquier wrote: > On Tue, Jun 14, 2022 at 12:20:56PM -0400, Tom Lane wrote: >> Andrew Dunstan writes: >>> The second changes the new GUCs TAP test to check against the installed >>> postgresql.conf.sample rather than the one in the original source >>> location. There are probably arguments both ways, but if we ever decided >>> to postprocess the file before installation, this would do the right thing. >> Seems like a good idea, especially since it also makes the test code >> shorter and more robust(-looking). > It seems to me that you did not look at the git history very closely. > The first version of 003_check_guc.pl did exactly what 0002 is > proposing to do, see b0a55f4. That's also why config_data() has been > introduced in the first place. This original logic has been reverted > once shortly after, as of 52377bb, per a complain by Christoph Berg > because this broke some of the assumptions the custom patches of > Debian relied on: > https://www.postgresql.org/message-id/ygyw25oxv5men...@msg.df7cb.de Quite right, I missed that. Still, it now seems to be moot, given what Christoph said at the bottom of the thread. If I'd seen the thread I would probably have been inclined to say that is Debian can patch pg_config they can also patch the test :-) > > And it was also pointed out that we'd better use the version in the > source tree rather than a logic that depends on finding the path from > the output of pg_config with an installation tree assumed to exist > (there should be one for installcheck anyway), as of: > https://www.postgresql.org/message-id/2023925.1644591...@sss.pgh.pa.us > > If the change of 0002 is applied, we will just loop back to the > original issue with Debian. So I am adding Christoph in CC, as he has > also mentioned that the patch applied to PG for Debian that > manipulates the installation paths has been removed, but I may be > wrong in assuming that it is the case. Honestly, I don't care all that much. I noticed these issues when dealing with something for EDB that turned out not to be related to these things. I can see arguments both ways on this one. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: Small TAP improvements
On 2022-06-14 Tu 19:24, Michael Paquier wrote: > On Tue, Jun 14, 2022 at 05:08:28PM -0400, Tom Lane wrote: >> Andrew Dunstan writes: >>> OK, here's a more principled couple of patches. For config_data, if you >>> give multiple options it gives you back the list of values. If you don't >>> specify any, in scalar context it just gives you back all of pg_config's >>> output, but in array context it gives you a map, so you should be able >>> to say things like: >>> my %node_config = $node->config_data; >> Might be overkill, but since you wrote it already, looks OK to me. > + # exactly one option: hand back the output (minus LF) > + return $stdout if (@options == 1); > + my @lines = split(/\n/, $stdout); > + # more than one option: hand back the list of values; > + return @lines if (@options); > + # no options, array context: return a map > + my @map; > + foreach my $line (@lines) > + { > + my ($k,$v) = split (/ = /,$line,2); > + push(@map, $k, $v); > + } > > This patch is able to handle the case of no option and one option > specified by the caller of the routine. However, pg_config is able to > return a set of values when specifying multiple switches, respecting > the order of the switches, so wouldn't it be better to return a map > made of ($option, $line)? For example, on a command like `pg_config > --sysconfdir --`, we would get back: > (('--sysconfdir', sysconfdir_val), ('--localedir', localedir_val)) > > If this is not worth the trouble, I think that you'd better die() hard > if the caller specifies more than two option switches. My would we do that? If you want a map don't pass any switches. But as written you could do: my ($incdir, $localedir, $sharedir) = $node->config_data(qw(--includedir --localedir --sharedir)); No map needed to get what you want, in fact a map would get in the way. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: Add header support to text format and matching feature
Julien Rouhaud wrote: > Maybe that's just me but I understand "not supported" as "this makes > sense, but this is currently a limitation that might be lifted > later". Looking at ProcessCopyOptions(), there are quite a few invalid combinations of options that produce ERRCODE_FEATURE_NOT_SUPPORTED currently: - HEADER in binary mode - FORCE_QUOTE outside of csv - FORCE_QUOTE outside of COPY TO - FORCE_NOT_NULL outside of csv - FORCE_NOT_NULL outside of COPY FROM - ESCAPE outside of csv - delimiter appearing in the NULL specification - csv quote appearing in the NULL specification FORCE_QUOTE and FORCE_NOT_NULL are options that only make sense in one direction, so the errors when using these in the wrong direction are comparable to the "HEADER MATCH outside of COPY FROM" error that we want to add. In that sense, ERRCODE_FEATURE_NOT_SUPPORTED would be consistent. The other errors in the list above are more about the format itself, with options that make sense only for one format. So the way we're supposed to understand ERRCODE_FEATURE_NOT_SUPPORTED in these other cases is that such format does not support such feature, but without implying that it's a limitation. Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
Re: [PATCH] Completed unaccent dictionary with many missing characters
Two fixes (bad comment and fixed Latin-ASCII.xml). Michael Paquier wrote on 17.05.2022 09:11: On Thu, May 05, 2022 at 09:44:15PM +0200, Przemysław Sztoch wrote: Tom, I disagree with you because many similar numerical conversions are already taking place, e.g. 1/2, 1/4... This part sounds like a valid argument to me. unaccent.rules does already the conversion of some mathematical signs, and the additions proposed in the patch don't look that weird to me. I agree with Peter and Przemysław that this is reasonable. -- Michael -- Przemysław Sztoch | Mobile +48 509 99 00 66 commit 5ea0f226e1d0d7f0cc7b32fa4bd1bc6ed120c194 Author: Przemyslaw Sztoch Date: Wed May 4 18:28:59 2022 +0200 Update unnaccent rules generator. diff --git a/contrib/unaccent/generate_unaccent_rules.py b/contrib/unaccent/generate_unaccent_rules.py index c405e231b3..d94a2a3bf6 100644 --- a/contrib/unaccent/generate_unaccent_rules.py +++ b/contrib/unaccent/generate_unaccent_rules.py @@ -40,6 +40,7 @@ sys.stdout = codecs.getwriter('utf8')(sys.stdout.buffer) # language knowledge. PLAIN_LETTER_RANGES = ((ord('a'), ord('z')), # Latin lower case (ord('A'), ord('Z')), # Latin upper case + (ord('0'), ord('9')), # Digits (0x03b1, 0x03c9), # GREEK SMALL LETTER ALPHA, GREEK SMALL LETTER OMEGA (0x0391, 0x03a9)) # GREEK CAPITAL LETTER ALPHA, GREEK CAPITAL LETTER OMEGA @@ -139,17 +140,17 @@ def get_plain_letter(codepoint, table): return codepoint # Should not come here -assert(False) +assert False, 'Codepoint U+%0.2X' % codepoint.id def is_ligature(codepoint, table): """Return true for letters combined with letters.""" -return all(is_letter(table[i], table) for i in codepoint.combining_ids) +return all(i in table and is_letter(table[i], table) for i in codepoint.combining_ids) def get_plain_letters(codepoint, table): """Return a list of plain letters from a ligature.""" -assert(is_ligature(codepoint, table)) +assert is_ligature(codepoint, table), 'Codepoint U+%0.2X' % codepoint.id return [get_plain_letter(table[id], table) for id in codepoint.combining_ids] @@ -248,7 +249,7 @@ def main(args): # walk through all the codepoints looking for interesting mappings for codepoint in all: if codepoint.general_category.startswith('L') and \ - len(codepoint.combining_ids) > 1: + len(codepoint.combining_ids) > 0: if is_letter_with_marks(codepoint, table): charactersSet.add((codepoint.id, chr(get_plain_letter(codepoint, table).id))) @@ -257,6 +258,13 @@ def main(args): "".join(chr(combining_codepoint.id) for combining_codepoint in get_plain_letters(codepoint, table +elif codepoint.general_category.startswith('N') and \ + len(codepoint.combining_ids) > 0 and \ + args.noLigaturesExpansion is False and is_ligature(codepoint, table): +charactersSet.add((codepoint.id, + "".join(chr(combining_codepoint.id) + for combining_codepoint + in get_plain_letters(codepoint, table elif is_mark_to_remove(codepoint): charactersSet.add((codepoint.id, None)) diff --git a/contrib/unaccent/unaccent.rules b/contrib/unaccent/unaccent.rules index 3030166ed6..ff484d09d7 100644 --- a/contrib/unaccent/unaccent.rules +++ b/contrib/unaccent/unaccent.rules @@ -1,9 +1,15 @@ ¡ ! © (C) +ª a « << - ® (R) ± +/- +² 2 +³ 3 +µ μ +¹ 1 +º o » >> ¼ 1/4 ½ 1/2 @@ -402,6 +408,11 @@ ʦ ts ʪ ls ʫ lz +ʰ h +ʲ j +ʳ r +ʷ w +ʸ y ʹ ' ʺ " ʻ ' @@ -417,6 +428,9 @@ ˖ + ˗ - ˜ ~ +ˡ l +ˢ s +ˣ x ̀ ́ ̂ @@ -536,6 +550,17 @@ ό ο ύ υ ώ ω +ϐ β +ϑ θ +ϒ Υ +ϕ φ +ϖ π +ϰ κ +ϱ ρ +ϲ ς +ϴ Θ +ϵ ε +Ϲ Σ Ё Е ё е ᴀ A @@ -556,6 +581,50 @@ ᴠ V ᴡ W ᴢ Z +ᴬ A +ᴮ B +ᴰ D +ᴱ E +ᴳ G +ᴴ H +ᴵ I +ᴶ J +ᴷ K +ᴸ L +ᴹ M +ᴺ N +ᴼ O +ᴾ P +ᴿ R +ᵀ T +ᵁ U +ᵂ W +ᵃ a +ᵇ b +ᵈ d +ᵉ e +ᵍ g +ᵏ k +ᵐ m +ᵒ o +ᵖ p +ᵗ t +ᵘ u +ᵛ v +ᵝ β +ᵞ γ +ᵟ δ +ᵠ φ +ᵡ χ +ᵢ i +ᵣ r +ᵤ u +ᵥ v +ᵦ β +ᵧ γ +ᵨ ρ +ᵩ φ +ᵪ χ ᵫ ue ᵬ b ᵭ d @@ -592,6 +661,10 @@ ᶓ e ᶖ i ᶙ u +ᶜ c +ᶠ f +ᶻ z +ᶿ θ Ḁ A ḁ a Ḃ B @@ -947,12 +1020,19 @@ Ὦ
[PATCH] Add sortsupport for range types and btree_gist
Hi all! The motivation behind this is that incrementally building up a GiST index for certain input data can create a terrible tree structure. Furthermore, exclusion constraints are commonly implemented using GiST indices and for that use case, data is mostly orderable. By sorting the data before inserting it into the index results in a much better index structure, leading to significant performance improvements. Testing was done using following setup, with about 50 million rows: CREATE EXTENSION btree_gist; CREATE TABLE t (id uuid, block_range int4range); CREATE INDEX ON before USING GIST (id, block_range); COPY t FROM '..' DELIMITER ',' CSV HEADER; using SELECT * FROM t WHERE id = '..' AND block_range && '..' as test query, using a unpatched instance and one with the patch applied. Some stats for fetching 10,000 random rows using the query above, 100 iterations to get good averages. The benchmarking was done on a unpatched instance compiled using the exact same options as with the patch applied. [ Results are noted in a unpatched -> patched fashion. ] First set of results are after the initial CREATE TABLE, CREATE INDEX and a COPY to the table, thereby incrementally building the index. Shared Hit Blocks (average): 110.97 -> 78.58 Shared Read Blocks (average): 58.90 -> 47.42 Execution Time (average): 1.10 -> 0.83 ms I/O Read Time (average): 0.19 -> 0.15 ms After a REINDEX on the table, the results improve even more: Shared Hit Blocks (average): 84.24 -> 8.54 Shared Read Blocks (average): 49.89 -> 0.74 Execution Time (average): 0.84 -> 0.065 ms I/O Read Time (average): 0.16 -> 0.004 ms Additionally, the time a REINDEX takes also improves significantly: 672407.584 ms (11:12.408) -> 130670.232 ms (02:10.670) Most of the sortsupport for btree_gist was implemented by re-using already existing infrastructure. For the few remaining types (bit, bool, cash, enum, interval, macaddress8 and time) I manually implemented them directly in btree_gist. It might make sense to move them into the backend for uniformity, but I wanted to get other opinions on that first. `make check-world` reports no regressions. Attached below, besides the patch, are also two scripts for benchmarking. `bench-gist.py` to benchmark the actual patch, example usage of this would be e.g. `./bench-gist.py -o results.csv public.table`. This expects a local instance with no authentication and default `postgres` user. The port can be set using the `--port` option. `plot.py` prints average values (as used above) and creates boxplots for each statistic from the result files produced with `bench-gist.py`. Depends on matplotlib and pandas. Additionally, if needed, the sample dataset used to benchmark this is available to independently verify the results [1]. Thanks, Christoph Heiss --- [1] https://drive.google.com/file/d/1SKRiUYd78_zl7CeD8pLDoggzCCh0wj39From 22e1b60929c39309e06c2477d8d027e60ad49b9d Mon Sep 17 00:00:00 2001 From: Christoph Heiss Date: Thu, 9 Jun 2022 17:07:46 +0200 Subject: [PATCH 1/1] Add sortsupport for range types and btree_gist Incrementally building up a GiST index can result in a sub-optimal index structure for range types. By sorting the data before inserting it into the index will result in a much better index. This can provide sizeable improvements in query execution time, I/O read time and shared block hits/reads. Signed-off-by: Christoph Heiss --- contrib/btree_gist/Makefile | 3 +- contrib/btree_gist/btree_bit.c | 19 contrib/btree_gist/btree_bool.c | 22 contrib/btree_gist/btree_cash.c | 22 contrib/btree_gist/btree_enum.c | 19 contrib/btree_gist/btree_gist--1.7--1.8.sql | 105 contrib/btree_gist/btree_gist.control | 2 +- contrib/btree_gist/btree_interval.c | 19 contrib/btree_gist/btree_macaddr8.c | 19 contrib/btree_gist/btree_time.c | 19 src/backend/utils/adt/rangetypes_gist.c | 70 + src/include/catalog/pg_amproc.dat | 3 + src/include/catalog/pg_proc.dat | 3 + 13 files changed, 323 insertions(+), 2 deletions(-) create mode 100644 contrib/btree_gist/btree_gist--1.7--1.8.sql diff --git a/contrib/btree_gist/Makefile b/contrib/btree_gist/Makefile index 48997c75f6..4096de73ea 100644 --- a/contrib/btree_gist/Makefile +++ b/contrib/btree_gist/Makefile @@ -33,7 +33,8 @@ EXTENSION = btree_gist DATA = btree_gist--1.0--1.1.sql \ btree_gist--1.1--1.2.sql btree_gist--1.2.sql btree_gist--1.2--1.3.sql \ btree_gist--1.3--1.4.sql btree_gist--1.4--1.5.sql \ - btree_gist--1.5--1.6.sql btree_gist--1.6--1.7.sql + btree_gist--1.5--1.6.sql btree_gist--1.6--1.7.sql \ + btree_gist--1.7--1.8.sql PGFILEDESC = "btree_gist - B-tree equivalent GiST operator classes" REGRESS = init int2 int4 int8 float4 float8 cash oid timestamp
Re: replacing role-level NOINHERIT with a grant-level option
On 13.06.22 20:00, Robert Haas wrote: I don't think this creates any more of a conflict than we've already got. In fact, I'd go so far as to say it resolves a problem that we currently have. As far as I can see, we are stuck with a situation where we have to support both the INHERIT behavior and the NOINHERIT behavior. Removing either one would be a pretty major compatibility break. And even if some people were willing to endorse that, it seems clear from previous discussions that there are people who like the NOINHERIT behavior and would object to its removal, and other people (like me!) who like the INHERIT behavior and would object to removing that. If you think it's feasible to get rid of either of these behaviors, I'd be interested in hearing your thoughts on that, but to me it looks like we are stuck with supporting both. From my point of view, the question is how to make the best of that situation. I think we want to keep both. Consider a user who in general prefers the NOINHERIT behavior but also wants to use predefined roles. Perhaps user 'peter' is to be granted both 'paul' and 'pg_execute_server_programs'. If role 'peter' is set to INHERIT, Peter will be sad, because his love for NOINHERIT probably means that he doesn't want to exercise Paul's privileges automatically. However, he needs to inherit the privileges of 'pg_execute_server_programs' or they are of no use to him. Peter presumably wants to use COPY TO/FROM program to put data into a table owned by 'peter', not a table owned by 'pg_execute_server_programs'. If so, being able to SET ROLE to 'pg_execute_server_programs' is of no use to him at all, but inheriting the privilege is useful. That's because our implementation of SET ROLE is bogus. We should have a SET ROLE that is separate from SET SESSION AUTHORIZATION, where the current user can keep their current user-ness and additionally enable (non-inherited) roles. I don't think I'm proposing to break anything or go in a totally opposite direction from anything, and to be honest I'm kind of confused as to why you think that what I'm proposing would have that effect. As far as I can see, the changes that I'm proposing are upward-compatible and would permit easy migration to new releases via either pg_dump or pg_upgrade with no behavior changes at all. Some syntax would be a bit different on the new releases and that would unlock some new options we don't currently have, but there's no behavior that you can get today which you wouldn't be able to get any more under this proposal. I'm mainly concerned that (AAIU), you propose to remove the current INHERIT/NOINHERIT attribute of roles. I wouldn't like that. If you want a feature that allows overriding that per-grant, maybe that's okay.
Re: Add header support to text format and matching feature
On 14.06.22 11:13, Julien Rouhaud wrote: There is no need for the extra comment, and the error code should be ERRCODE_FEATURE_NOT_SUPPORTED. Is there any rule for what error code should be used? Maybe that's just me but I understand "not supported" as "this makes sense, but this is currently a limitation that might be lifted later". I tend to agree with that interpretation. Also, when you consider the way SQL rules and error codes are set up, errors that are detected during parse analysis should be a subclass of "syntax error or access rule violation".
Re: Handle infinite recursion in logical replication setup
PSA a test script that demonstrates all the documented steps for setting up n-way bidirectional replication. These steps are the same as those documented [1] on the new page "Bidirectional logical replication". This script works using the current latest v20* patch set. Each of the sections of 31.11.1 - 31.11.5 (see below) can be run independently (just edit the script and at the bottom uncomment the part you want to test): 31.11.1. Setting bidirectional replication between two nodes 31.11.2. Adding a new node when there is no data in any of the nodes 31.11.3. Adding a new node when data is present in the existing nodes 31.11.4. Adding a new node when data is present in the new node 31.11.5. Generic steps for adding a new node to an existing set of nodes ~~ Some sample output is also attached. -- [1] https://www.postgresql.org/message-id/attachment/134464/v20-0004-Document-bidirectional-logical-replication-steps.patch Kind Regards, Peter Smith. Fujitsu Australia Clean up pg_ctl: directory "data_N1" does not exist pg_ctl: directory "data_N2" does not exist pg_ctl: directory "data_N3" does not exist pg_ctl: directory "data_N4" does not exist rm: cannot remove ‘data_N1’: No such file or directory rm: cannot remove ‘data_N2’: No such file or directory rm: cannot remove ‘data_N3’: No such file or directory rm: cannot remove ‘data_N4’: No such file or directory Set up The files belonging to this database system will be owned by user "postgres". This user must also own the server process. The database cluster will be initialized with locale "en_AU.UTF-8". The default database encoding has accordingly been set to "UTF8". The default text search configuration will be set to "english". Data page checksums are disabled. creating directory data_N1 ... ok creating subdirectories ... ok selecting dynamic shared memory implementation ... posix selecting default max_connections ... 100 selecting default shared_buffers ... 128MB selecting default time zone ... Australia/Sydney creating configuration files ... ok running bootstrap script ... ok performing post-bootstrap initialization ... ok syncing data to disk ... ok initdb: warning: enabling "trust" authentication for local connections initdb: hint: You can change this by editing pg_hba.conf or using the option -A, or --auth-local and --auth-host, the next time you run initdb. Success. You can now start the database server using: pg_ctl -D data_N1 -l logfile start The files belonging to this database system will be owned by user "postgres". This user must also own the server process. The database cluster will be initialized with locale "en_AU.UTF-8". The default database encoding has accordingly been set to "UTF8". The default text search configuration will be set to "english". Data page checksums are disabled. creating directory data_N2 ... ok creating subdirectories ... ok selecting dynamic shared memory implementation ... posix selecting default max_connections ... 100 selecting default shared_buffers ... 128MB selecting default time zone ... Australia/Sydney creating configuration files ... ok running bootstrap script ... ok performing post-bootstrap initialization ... ok syncing data to disk ... ok initdb: warning: enabling "trust" authentication for local connections initdb: hint: You can change this by editing pg_hba.conf or using the option -A, or --auth-local and --auth-host, the next time you run initdb. Success. You can now start the database server using: pg_ctl -D data_N2 -l logfile start The files belonging to this database system will be owned by user "postgres". This user must also own the server process. The database cluster will be initialized with locale "en_AU.UTF-8". The default database encoding has accordingly been set to "UTF8". The default text search configuration will be set to "english". Data page checksums are disabled. creating directory data_N3 ... ok creating subdirectories ... ok selecting dynamic shared memory implementation ... posix selecting default max_connections ... 100 selecting default shared_buffers ... 128MB selecting default time zone ... Australia/Sydney creating configuration files ... ok running bootstrap script ... ok performing post-bootstrap initialization ... ok syncing data to disk ... ok initdb: warning: enabling "trust" authentication for local connections initdb: hint: You can change this by editing pg_hba.conf or using the option -A, or --auth-local and --auth-host, the next time you run initdb. Success. You can now start the database server using: pg_ctl -D data_N3 -l logfile start The files belonging to this database system will be owned by user "postgres". This user must also own the server process. The database cluster will be initialized with locale "en_AU.UTF-8". The default database encoding has accordingly been set to "UTF8". The default text search configuration will be set to "english". Data page checksums are disabled. creating directory data_N4 ... ok
Re: better page-level checksums
On 13.06.22 20:20, Robert Haas wrote: If the user wants 16-bit checksums, the feature we've already got seems good enough -- and, as you say, it doesn't use any extra disk space. This proposal is just about making people happy if they want a bigger checksum. It's hard to get any definite information about what size of checksum is "good enough", since after all it depends on what kinds of errors you expect and what kinds of probabilities you want to accept. But the best I could gather so far is that 16-bit CRC are good until about 16 kB block size. Which leads to the question whether there is really a lot of interest in catering to larger block sizes. The recent thread about performance impact of different block sizes might renew interest in this. But unless we really want to encourage playing with the block sizes (and if my claim above is correct), then a larger checksum size might not be needed. On the topic of which algorithm to use, I'd be inclined to think that it is going to be more useful to offer checksums that are 64 bits or more, since IMHO 32 is not all that much more than 16, and I still think there are going to be alignment issues. Beyond that I don't have anything against your specific suggestions, but I'd like to hear what other people think. Again, gathering some vague information ... The benefits of doubling the checksum size are exponential rather than linear, so there is no significant benefit of using a 64-bit checksum over a 32-bit one, for supported block sizes (current max is 32 kB).
Re: Typo in ro.po file?
On 14.06.22 05:34, Peter Smith wrote: FWIW, I stumbled on this obscure possible typo (?) in src/pl/plperl/po/ro.po: ~~~ #: plperl.c:788 msgid "while parsing Perl initialization" msgstr "în timpul parsing inițializării Perl" #: plperl.c:793 msgid "while running Perl initialization" msgstr "în timpul rulării intializării Perl" ~~~ (Notice the missing 'i' - "inițializării" versus "intializării") Fixed in translations repository. Thanks.
RE: tablesync copy ignores publication actions
On Tue, Jun 14, 2022 3:36 PM Peter Smith wrote: > > PSA v2 of the patch, based on all feedback received. > > ~~~ > > Main differences from v1: > > * Rewording and more explanatory text. > > * The examples were moved to the "Subscription" [1] page and also > extended to show some normal replication and row filter examples, from > [Amit]. > > * Added some text to CREATE PUBLICATION 'publish' param [2], from > [Euler][Amit]. > > * Added some text to CREATE SUBSCRIPTION Notes [3], from [Shi-san]. > > * Added some text to the "Publication page" [4] to say the 'publish' > is only for DML operations. > > * I changed the note in "Row Filter - Initial Data Synchronization" > [5] to be a warning because I felt users could be surprised to see > data exposed by the initial copy, which a DML operation would not > expose. > Thanks for updating the patch. Two comments: 1. + it means the copied table t3 contains all rows even when + they do not patch the row filter of publication pub3b. Typo. I think "they do not patch the row filter" should be "they do not match the row filter", right? 2. @@ -500,7 +704,6 @@ - It seems we should remove this change. Regards, Shi yu
Re: Remove trailing newlines from pg_upgrade's messages
On 14.06.22 20:57, Tom Lane wrote: I'll stick this in the CF queue, but I wonder if there is any case for squeezing it into v15 instead of waiting for v16. Let's stick this into 16 and use it as a starting point of tidying all this up in pg_upgrade.
Re: "buffer too small" or "path too long"?
On 14.06.22 03:55, Michael Paquier wrote: On Tue, Jun 14, 2022 at 09:52:52AM +0900, Kyotaro Horiguchi wrote: At Tue, 14 Jun 2022 09:48:26 +0900 (JST), Kyotaro Horiguchi wrote in Yeah, I feel so and it is what I wondered about recently when I saw some complete error messages. Is that because of the length of the subject? And I found that it is alrady done. Thanks! I have noticed this thread and 4e54d23 as a result this morning. If you want to spread this style more, wouldn't it be better to do that in all the places of pg_upgrade where we store paths to files? I can see six code paths with log_opts.basedir that could do the same, as of the attached. The hardcoded file names have various lengths, and some of them are quite long making the generated paths more exposed to being cut in the middle. We have this problem of long file names being silently truncated all over the source code. Instead of equipping each one of them with a length check, why don't we get rid of the fixed-size buffers and allocate dynamically, as in the attached patch.From 1a21434c584319b28fd483444aa24b7b54b4f949 Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Wed, 15 Jun 2022 07:25:07 +0200 Subject: [PATCH] pg_upgrade: Use psprintf in make_outputdirs --- src/bin/pg_upgrade/pg_upgrade.c | 42 - 1 file changed, 10 insertions(+), 32 deletions(-) diff --git a/src/bin/pg_upgrade/pg_upgrade.c b/src/bin/pg_upgrade/pg_upgrade.c index 265d829490..f7e3461cfe 100644 --- a/src/bin/pg_upgrade/pg_upgrade.c +++ b/src/bin/pg_upgrade/pg_upgrade.c @@ -216,19 +216,13 @@ main(int argc, char **argv) static void make_outputdirs(char *pgdata) { - FILE *fp; - char **filename; time_t run_time = time(NULL); - charfilename_path[MAXPGPATH]; + char *filename_path; chartimebuf[128]; struct timeval time; time_t tt; - int len; - log_opts.rootdir = (char *) pg_malloc0(MAXPGPATH); - len = snprintf(log_opts.rootdir, MAXPGPATH, "%s/%s", pgdata, BASE_OUTPUTDIR); - if (len >= MAXPGPATH) - pg_fatal("directory path for new cluster is too long\n"); + log_opts.rootdir = psprintf("%s/%s", pgdata, BASE_OUTPUTDIR); /* BASE_OUTPUTDIR/$timestamp/ */ gettimeofday(, NULL); @@ -237,25 +231,13 @@ make_outputdirs(char *pgdata) /* append milliseconds */ snprintf(timebuf + strlen(timebuf), sizeof(timebuf) - strlen(timebuf), ".%03d", (int) (time.tv_usec / 1000)); - log_opts.basedir = (char *) pg_malloc0(MAXPGPATH); - len = snprintf(log_opts.basedir, MAXPGPATH, "%s/%s", log_opts.rootdir, - timebuf); - if (len >= MAXPGPATH) - pg_fatal("directory path for new cluster is too long\n"); + log_opts.basedir = psprintf("%s/%s", log_opts.rootdir, timebuf); /* BASE_OUTPUTDIR/$timestamp/dump/ */ - log_opts.dumpdir = (char *) pg_malloc0(MAXPGPATH); - len = snprintf(log_opts.dumpdir, MAXPGPATH, "%s/%s/%s", log_opts.rootdir, - timebuf, DUMP_OUTPUTDIR); - if (len >= MAXPGPATH) - pg_fatal("directory path for new cluster is too long\n"); + log_opts.dumpdir = psprintf("%s/%s/%s", log_opts.rootdir, timebuf, DUMP_OUTPUTDIR); /* BASE_OUTPUTDIR/$timestamp/log/ */ - log_opts.logdir = (char *) pg_malloc0(MAXPGPATH); - len = snprintf(log_opts.logdir, MAXPGPATH, "%s/%s/%s", log_opts.rootdir, - timebuf, LOG_OUTPUTDIR); - if (len >= MAXPGPATH) - pg_fatal("directory path for new cluster is too long\n"); + log_opts.logdir = psprintf("%s/%s/%s", log_opts.rootdir, timebuf, LOG_OUTPUTDIR); /* * Ignore the error case where the root path exists, as it is kept the @@ -270,21 +252,17 @@ make_outputdirs(char *pgdata) if (mkdir(log_opts.logdir, pg_dir_create_mode) < 0) pg_fatal("could not create directory \"%s\": %m\n", log_opts.logdir); - len = snprintf(filename_path, sizeof(filename_path), "%s/%s", - log_opts.logdir, INTERNAL_LOG_FILE); - if (len >= sizeof(filename_path)) - pg_fatal("directory path for new cluster is too long\n"); + filename_path = psprintf("%s/%s", log_opts.logdir, INTERNAL_LOG_FILE); if ((log_opts.internal = fopen_priv(filename_path, "a")) == NULL) pg_fatal("could not open log file \"%s\": %m\n", filename_path); /* label start of upgrade in logfiles */ - for (filename = output_files; *filename != NULL; filename++) + for (char **filename = output_files; *filename != NULL; filename++) { - len = snprintf(filename_path, sizeof(filename_path), "%s/%s", -
RE: Support logical replication of DDLs
On Wednesday, June 15, 2022 8:14 AM Zheng Li wrote: > > > Thanks for providing this idea. > > > > I looked at the string that is used for replication: > > > > """ > > {ALTERTABLESTMT :relation {RANGEVAR :schemaname public :relname foo > > :inh true :relpersistence p :alias <> :location 12} :cmds ({ALTERTABLECMD > > :subtype 0 :name <> :num 0 :newowner <> :def {COLUMNDEF :colname b > > :typeName {TYPENAME :names ("public" "timestamptz") :typeOid 0 :setof > > false :pct_type false :typmods <> :typemod -1 :arrayBounds <> :location > > 29} :compression <> :inhcount 0 :is_local true :is_not_null false > > :is_from_type false :storage <> :raw_default <> :cooked_default <> > > :identity <> :identitySequence <> :generated <> :collClause <> :collOid 0 > > :constraints <> :fdwoptions <> :location 27} :behavior 0 :missing_ok > > false}) :objtype 41 :missing_ok false} > > """ > > > > I think the converted parsetree string includes lots of internal > > objects(e.g typeOid/pct_type/objtype/collOid/location/...). These are > > unnecessary stuff for replication and we cannot make sure all the internal > > stuff are consistent among pub/sub. So I am not sure whether replicating > > this string is better. > > > > Besides, replicating the string from nodetostring() means we would need to > > deal with the structure difference between the publisher and the > > subscriber if any related structure has been changed which seems not good. > > Yeah, this existing format is not designed to be portable between different > major versions. So it can't directly be used for replication without > serious modification. > > > IMO, The advantages of the deparsing approach(as implemented in the POC > > patch set[1]) are: > > > > 1) We can generate a command representation that can be > > parsed/processed/transformed arbitrarily by the subscriber using generic > > rules it(for example: user can easily replace the schema name in it) while > > the results of nodetostring() seems not a standard json string, so I am > > not sure can user reuse it without traversing the parsetree again. > > > > 2) With event_trigger + deparser, we can filter the unpublished objects > > easier. For example: "DROP TABLE table_pub, table_unpub;". We can deparse > > it into two commands "DROP TABLE table_pub" and "DROP TABLE > table_pub" and > > only publish the first one. > > > > 3) With deparser, we are able to query the catalog in the deparser to > > build a complete command(filled with schemaname...) which user don't need > > to do any other work for it. We don't need to force the subscriber to set > > the same search_path as the publisher which give user more flexibility. > > > > 4) For CREATE TABLE AS, we can separate out the CREATE TABLE part with the > > help of deparser and event trigger. This can avoid executing the subquery > > on subcriber. > > > > 5) For ALTER TABLE command. We might want to filter out the DDL which use > > volatile function as discussed in [2]. We can achieve this easier by > > extending the deparser to check the functions used. We can even rebuild a > > command without unsupported functions to replicate by using deparser. > > > > There may be more cases I am missing as we are still analyzing other DDLs. > > How does the deparser deparses CREATE FUNCTION STATEMENT? Will it > schema qualify > objects inside the function definition? The current deparser doesn't schema qualify objects inside the function source as we won't know the schema of inner objects until the function is executed. The deparser will only schema qualify the objects around function declaration Like: CREATE FUNCTION [public].test_func(i [pg_catalog].int4 ) RETURNS [pg_catalog].int4 LANGUAGE plpgsql Best regards, Hou zj
Re: Handle infinite recursion in logical replication setup
Here are some review comments for patch v20-0004. == 1. General I thought that it is better to refer to the subscription/publications/table "on" the node, rather than "in" the node. Most of the review comments below are related to this point. == 2. Commit message a) Creating a two-node bidirectional replication when there is no data in both nodes. b) Adding a new node when there is no data in any of the nodes. c) Adding a new node when data is present in the existing nodes. "in both nodes" -> "on both nodes" "in any of the nodes" -> "on any of the nodes" "in the existing nodes" -> "on the existing nodes" == 3. doc/src/sgml/logical-replication.sgml - Setting bidirectional replication between two nodes 3a. + +The following steps demonstrate how to create a two-node bidirectional +replication when there is no table data present in both nodes +node1 and node2: + -> "on both nodes" 3b. +Create a publication in node1: -> "on" 3c. +Create a publication in node2: -> "on" 3d. + +Lock the table t1 in node1 and +node2 in EXCLUSIVE mode until the +setup is completed. + -> "on node1" 3e. +Create a subscription in node2 to subscribe to -> "on" 3f. +Create a subscription in node1 to subscribe to +node2: -> "on" ~~~ 4. doc/src/sgml/logical-replication.sgml - Adding a new node when there is no data in any of the nodes 4a. + Adding a new node when there is no data in any of the nodes SUGGESTION Adding a new node when there is no table data on any of the nodes 4b. + +The following steps demonstrate adding a new node node3 +to the existing node1 and node2 when +there is no t1 data in any of the nodes. This requires +creating subscriptions in node1 and +node2 to replicate the data from +node3 and creating subscriptions in +node3 to replicate data from node1 +and node2. Note: These steps assume that the +bidirectional logical replication between node1 and +node2 is already completed. + "data in any of the nodes" -> "data on any of the nodes" "creating subscriptions in node1" -> "creating subscriptions on node1" "creating subscriptions in node3" -> "creating subscriptions on node3" 4c. +Create a publication in node3: -> "on" 4d. +Lock table t1 in all the nodes -> "on" 4e. +Create a subscription in node1 to subscribe to +node3: -> "on" 4f. +Create a subscription in node2 to subscribe to +node3: -> "on" 4g. +Create a subscription in node3 to subscribe to +node1: -> "on" 4h. +Create a subscription in node3 to subscribe to +node2: 4i. +node3. Incremental changes made in any node will be +replicated to the other two nodes. "in any node" -> "on any node" ~~~ 5. doc/src/sgml/logical-replication.sgml - Adding a new node when data is present in the existing nodes 5a. + Adding a new node when data is present in the existing nodes SUGGESTION Adding a new node when table data is present on the existing nodes 5b. + during initial data synchronization. Note: These steps assume that the + bidirectional logical replication between node1 and + node2 is already completed, and the pre-existing data + in table t1 is already synchronized in both those + nodes. + "in both those nodes" -> "on both those nodes" 5c. +Create a publication in node3 -> "on" 5d. +Lock table t1 in node2 and -> "on" 5e. +Create a subscription in node1 to subscribe to +node3: -> "on" 5f. +Create a subscription in node2 to subscribe to +node3: -> "on" 5g. +Create a subscription in node3 to subscribe to +node1. Use copy_data = force so that +the existing table data is copied during initial sync: -> "on" 5h. +Create a subscription in node3 to subscribe to +node2. Use copy_data = off -> "on" 5i. +node3. Incremental changes made in any node will be +replicated to the other two nodes. "in any node" -> "on any node" ~~~ 6. doc/src/sgml/logical-replication.sgml - Adding a new node when data is present in the new node + Adding a new node when data is present in the new node SUGGESTION Adding a new node when table data is present on the new node ~~~ 7. doc/src/sgml/logical-replication.sgml - Generic steps for adding a new node to an existing set of nodes 7a. + +Step-2: Lock the required tables of the new node in EXCLUSIVE mode until +the setup is complete. (This lock is necessary to prevent any modifications +from happening in the new node because if data modifications occurred after +Step-3, there is a chance that the modifications will be published to the +first node and then synchronized back to the new node while creating the +subscription in Step-5. This would result in inconsistent data). + "happening in the new node" -> "happening on the new node" 7b. +not be synchronized to the new node. This would result in inconsistent +data. There is no need to lock the
Re: Handle infinite recursion in logical replication setup
Here are some review comments for patch v20-0003. == 1. Commit message In case, table t1 has a unique key, it will lead to a unique key violation and replication won't proceed. SUGGESTION If table t1 has a unique key, this will lead to a unique key violation, and replication won't proceed. ~~~ 2. Commit message This problem can be solved by using... SUGGESTION This problem can be avoided by using... ~~~ 3. Commit message step 3: Create a subscription in node1 to subscribe to node2. Use 'copy_data = on' so that the existing table data is copied during initial sync: node1=# CREATE SUBSCRIPTION sub_node1_node2 CONNECTION '' node1-# PUBLICATION pub_node2 WITH (copy_data = off, origin = local); CREATE SUBSCRIPTION This is wrong information. The table on node2 has no data, so talking about 'copy_data = on' is inappropriate here. == 4. Commit message IMO it might be better to refer to subscription/publication/table "on" nodeXXX, instead of saying "in" nodeXXX. 4a. "the publication tables were also subscribing data in the publisher from other publishers." -> "the publication tables were also subscribing from other publishers. 4b. "After the subscription is created in node2" -> "After the subscription is created on node2" 4c. "step 3: Create a subscription in node1 to subscribe to node2." -> "step 3: Create a subscription on node1 to subscribe to node2." 4d. "step 4: Create a subscription in node2 to subscribe to node1." -> "step 4: Create a subscription on node2 to subscribe to node1." == 5. doc/src/sgml/ref/create_subscription.sgml @@ -383,6 +397,15 @@ CREATE SUBSCRIPTION subscription_name + + If the subscription is created with origin = local and + copy_data = on, it will check if the publisher tables are + being subscribed to any other publisher and, if so, then throw an error to + prevent possible non-local data from being copied. The user can override + this check and continue with the copy operation by specifying + copy_data = force. + Perhaps it is better here to say 'copy_data = true' instead of 'copy_data = on', simply because the value 'true' was mentioned earlier on this page (but this would be the first mention of 'on'). == 6. src/backend/commands/subscriptioncmds.c - check_pub_table_subscribed + errhint("Use CREATE/ALTER SUBSCRIPTION with copy_data = off or force.")); Saying "off or force" is not consistent with the other message wording in this patch, which used "/" for multiple enums. (e.g. "connect = false", "copy_data = true/force"). So perhaps this errhint should be worded similarly: "Use CREATE/ALTER SUBSCRIPTION with copy_data = off/force." -- Kind Regards, Peter Smith. Fujitsu Australia
Re: Handle infinite recursion in logical replication setup
Here are some review comments for patch v20-0002 == 1. General comment Something subtle but significant changed since I last reviewed v18*. Now the describe.c is changed so that the catalog will never display a NULL origin column; it would always be "any". But now I am not sure if it is a good idea to still allow the NULL in this catalog column while at the same time you are pretending it is not there. I felt it might be less confusing, and would simplify the code (e.g. remove quite a few null checks) to have just used a single concept of the default - e.g. Just assign the default as "any" everywhere. The column would be defined as NOT NULL. Most of the following review comments are related to this point. == 2. doc/src/sgml/catalogs.sgml + + Possible origin values are local, + any, or NULL if none is specified. + If local, the subscription will request the + publisher to only send changes that originated locally. If + any (or NULL), the publisher sends + any changes regardless of their origin. + Is NULL still possible? Perhaps it would be better if it was not and the default "any" was always written instead. == 3. src/backend/catalog/pg_subscription.c + if (!isnull) + sub->origin = TextDatumGetCString(datum); + else + sub->origin = NULL; + Maybe better to either disallow NULL in the first place or assign the "any" here instead of NULL. == 4. src/backend/commands/subscriptioncmds.c - parse_subscription_options @@ -137,6 +139,8 @@ parse_subscription_options(ParseState *pstate, List *stmt_options, opts->twophase = false; if (IsSet(supported_opts, SUBOPT_DISABLE_ON_ERR)) opts->disableonerr = false; + if (IsSet(supported_opts, SUBOPT_ORIGIN)) + opts->origin = NULL; If opt->origin was assigned to "any" then other code would be simplified. ~~~ 5. src/backend/commands/subscriptioncmds.c - CreateSubscription @@ -607,6 +626,11 @@ CreateSubscription(ParseState *pstate, CreateSubscriptionStmt *stmt, LOGICALREP_TWOPHASE_STATE_PENDING : LOGICALREP_TWOPHASE_STATE_DISABLED); values[Anum_pg_subscription_subdisableonerr - 1] = BoolGetDatum(opts.disableonerr); + if (opts.origin) + values[Anum_pg_subscription_suborigin - 1] = + CStringGetTextDatum(opts.origin); + else + values[Anum_pg_subscription_suborigin - 1] = CStringGetTextDatum(LOGICALREP_ORIGIN_ANY); If NULL was not possible then this would just be one line: values[Anum_pg_subscription_suborigin - 1] = CStringGetTextDatum(opts.origin); == 6. src/backend/replication/logical/worker.c @@ -276,6 +276,10 @@ static TransactionId stream_xid = InvalidTransactionId; static XLogRecPtr skip_xact_finish_lsn = InvalidXLogRecPtr; #define is_skipping_changes() (unlikely(!XLogRecPtrIsInvalid(skip_xact_finish_lsn))) +/* Macro for comparing string fields that might be NULL */ +#define equalstr(a, b) \ + (((a) != NULL && (b) != NULL) ? (strcmp((a), (b)) == 0) : (a) == (b)) + If the NULL was not allowed in the first place then I think this macro would just become redundant. 7. src/backend/replication/logical/worker.c - ApplyWorkerMain @@ -3741,6 +3746,11 @@ ApplyWorkerMain(Datum main_arg) options.proto.logical.streaming = MySubscription->stream; options.proto.logical.twophase = false; + if (MySubscription->origin) + options.proto.logical.origin = pstrdup(MySubscription->origin); + else + options.proto.logical.origin = NULL; + Can't the if/else be avoided if you always assigned the "any" default in the first place? == 8. src/backend/replication/pgoutput/pgoutput.c - parse_output_parameters @@ -287,11 +289,13 @@ parse_output_parameters(List *options, PGOutputData *data) bool messages_option_given = false; bool streaming_given = false; bool two_phase_option_given = false; + bool origin_option_given = false; data->binary = false; data->streaming = false; data->messages = false; data->two_phase = false; + data->origin = NULL; Consider assigning default "any" here instead of NULL. == 9. src/bin/pg_dump/pg_dump.c - getSubscriptions + /* FIXME: 15 should be changed to 16 later for PG16. */ + if (fout->remoteVersion >= 15) + appendPQExpBufferStr(query, " s.suborigin\n"); + else + appendPQExpBufferStr(query, " NULL AS suborigin\n"); + Maybe say: 'any' AS suborigin? ~~~ 10. src/bin/pg_dump/pg_dump.c - getSubscriptions @@ -4517,6 +4525,11 @@ getSubscriptions(Archive *fout) subinfo[i].subdisableonerr = pg_strdup(PQgetvalue(res, i, i_subdisableonerr)); + if (PQgetisnull(res, i, i_suborigin)) + subinfo[i].suborigin = NULL; + else + subinfo[i].suborigin = pg_strdup(PQgetvalue(res, i, i_suborigin)); + If you disallow the NULL in the first place this condition maybe is no longer needed. ~~~ 11. src/bin/pg_dump/pg_dump.c - dumpSubscription @@ -4589,6 +4602,9 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo) if (strcmp(subinfo->subdisableonerr, "t") == 0) appendPQExpBufferStr(query, ", disable_on_error = true"); + if
Reply: Remove useless param for create_groupingsets_path
Hi, Richard You are right, The patch is incorrect, and I generate a patch once more, It is sent as as attachment named new,patch, please check, thanks! Best regards! Zxuejing From: Richard Guo Date: 2022-06-15 12:12 To: XueJing Zhao Cc: pgsql-hackers@lists.postgresql.org Subject: Re: Remove useless param for create_groupingsets_path On Wed, Jun 15, 2022 at 11:33 AM XueJing Zhao mailto:zxuej...@vmware.com>> wrote: Recently I work on grouping sets and I find the last param numGroups of create_groupingsets_path is not used. In create_groupingsets_path we use rollup->numGroups to do cost_agg. Yes indeed. The param 'numGroups' was used originally when we first introduced in create_groupingsets_path(), and then all its references inside that function were removed and replaced with the numGroups inside RollupData in b5635948. I generate a diff.patch, which is sent as an attachment. BTW, the patch looks weird to me that it seems operates in the inverse direction, i.e. it's adding the param 'numGroups', not removing it. Thanks Richard new.diff Description: new.diff