Re: WAL Insertion Lock Improvements
On Mon, May 08, 2023 at 08:18:04PM +0530, Bharath Rupireddy wrote: > On Mon, May 8, 2023 at 5:57 PM Bharath Rupireddy > wrote: >> On Mon, Apr 10, 2023 at 9:38 AM Michael Paquier wrote: >>> The sensitive change is in LWLockUpdateVar(). I am not completely >>> sure to understand this removal, though. Does that influence the case >>> where there are waiters? >> >> I'll send about this in a follow-up email to not overload this >> response with too much data. > > It helps the case when there are no waiters. IOW, it updates value > without waitlist lock when there are no waiters, so no extra waitlist > lock acquisition/release just to update the value. In turn, it helps > the other backend wanting to flush the WAL looking for the new updated > value of insertingAt in WaitXLogInsertionsToFinish(), now the flushing > backend can get the new value faster. Sure, which is what the memory barrier given by exchange_u64 guarantees. My thoughts on this one is that I am not completely sure to understand that we won't miss any waiters that should have been awaken. > The fastpath exit in LWLockUpdateVar() doesn't seem to influence the > results much, see below results. However, it avoids waitlist lock > acquisition when there are no waiters. > > test-case 1: -T5, WAL ~16 bytes > clientsHEADPATCHED with fastpathPATCHED no fast path > 64501354552846653 > 128949038979189103 > 25682289152915152835 > 51262498138838142084 > 76857083125074126768 > 102451308113593115930 > 2048410848876485110 > 4096199394225743917 Considering that there could be a few percents of noise mixed into that, that's not really surprising as the workload is highly concurrent on inserts so the fast path won't really shine :) Should we split this patch into two parts, as they aim at tackling two different cases then? One for LWLockConflictsWithVar() and LWLockReleaseClearVar() which are the straight-forward pieces (using one pg_atomic_write_u64() in LWLockUpdateVar instead), then a second for LWLockUpdateVar()? Also, the fast path treatment in LWLockUpdateVar() may show some better benefits when there are really few backends and a bunch of very little records? Still, even that sounds a bit limited.. > Out of 3 functions that got rid of waitlist lock > LWLockConflictsWithVar/LWLockWaitForVar, LWLockUpdateVar, > LWLockReleaseClearVar, perf reports tell that the biggest gain (for > the use-cases that I've tried) is for > LWLockConflictsWithVar/LWLockWaitForVar: > > test-case 6: -T5, WAL 4096 bytes > HEAD: > + 29.66% 0.07% postgres [.] LWLockWaitForVar > + 20.94% 0.08% postgres [.] LWLockConflictsWithVar > 0.19% 0.03% postgres [.] LWLockUpdateVar > > PATCHED: > +3.95% 0.08% postgres [.] LWLockWaitForVar > 0.19% 0.03% postgres [.] LWLockConflictsWithVar > +1.73% 0.00% postgres [.] LWLockReleaseClearVar Indeed. -- Michael signature.asc Description: PGP signature
RE: Subscription statistics are not dropped at DROP SUBSCRIPTION in some cases
Dear Sawada-san, Thank you for giving the patch! I confirmed that the problem you raised could be occurred on the HEAD, and the test you added could reproduce that. When the stats entry has been removed but pg_stat_get_subscription_stats() is called, the returned values are set as 0x0. Additionally, I have checked other pgstat_drop_* functions, and I could not find any similar problems. A comment: ``` + /* +* Tell the cumulative stats system that the subscription is getting +* dropped. +*/ + pgstat_drop_subscription(subid); ``` Isn't it better to write down something you said as comment? Or is it quite trivial? > There is a chance the > transaction dropping the subscription fails due to network error etc > but we don't need to worry about it as reporting the subscription drop > is transactional. Best Regards, Hayato Kuroda FUJITSU LIMITED
Re: WAL Insertion Lock Improvements
On Tue, May 09, 2023 at 09:34:56AM +0530, Bharath Rupireddy wrote: > Below is the configuration I've been using. I have been keeping the > checkpoints away so far to get expected numbers. Probably, something > that I should modify for this long run? Change checkpoint_timeout to > 15 min or so? > > max_wal_size=64GB > checkpoint_timeout=1d > shared_buffers=8GB > max_connections=5000 Noted. Something like that should be OK IMO, with all the checkpoints generated based on the volume generated. With records that have a fixed size, this should, I assume, lead to results that could be compared across runs, even if the patched code would lead to more checkpoints generated. -- Michael signature.asc Description: PGP signature
Re: WAL Insertion Lock Improvements
On Tue, May 9, 2023 at 9:27 AM Michael Paquier wrote: > > On Tue, May 09, 2023 at 09:24:14AM +0530, Bharath Rupireddy wrote: > > I'll pick a test case that generates a reasonable amount of WAL 256 > > bytes. What do you think of the following? > > > > test-case 2: -T900, WAL ~256 bytes (for c in 1 2 4 8 16 32 64 128 256 > > 512 768 1024 2048 4096 - takes 3.5hrs) > > test-case 2: -t100, WAL ~256 bytes > > > > If okay, I'll fire the tests. > > Sounds like a sensible duration, yes. What's your setting for > min/max_wal_size? I assume that there are still 16GB throttled with > target_completion at 0.9? Below is the configuration I've been using. I have been keeping the checkpoints away so far to get expected numbers. Probably, something that I should modify for this long run? Change checkpoint_timeout to 15 min or so? max_wal_size=64GB checkpoint_timeout=1d shared_buffers=8GB max_connections=5000 -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: DROP DATABASE is interruptible
Hi, On 2023-05-09 15:41:36 +1200, Thomas Munro wrote: > I tried out the patch you posted over at [1]. Thanks! > $ psql db2 > psql: error: connection to server on socket "/tmp/.s.PGSQL.5432" > failed: FATAL: database "db2" is invalid > DETAIL: Use DROP DATABASE to drop invalid databases > > I suppose it should be a HINT? Yup. > +# FIXME: It'd be good to test the actual interruption path. But it's not > +# immediately obvious how. > > I wonder if there is some way to incorporate something based on > SIGSTOP signals into the test, but I don't know how to do it on > Windows and maybe that's a bit weird anyway. For a non-OS-specific > way to do it, I was wondering about having a test module function that > has a wait loop that accepts ^C but deliberately ignores > ProcSignalBarrier, and leaving that running in a background psql for a > similar effect? Seems a bit too complicated. We really need to work at a framework for this kind of thing. > Not sure why the test is under src/test/recovery. Where else? We don't really have a place to put backend specific tests that aren't about logical replication or recovery right now... It partially is about dealing with crashes etc in the middle of DROP DATABASE, so it doesn't seem unreasonble to me anyway. > While a database exists in this state, we get periodic autovacuum > noise, which I guess we should actually skip? Yes, good catch. Also should either reset datfrozenxid et al when invalidating, or ignore it when computing horizons. > I suppose someone might eventually wonder if autovacuum could complete the > drop, but it seems a bit of a sudden weird leap in duties and might be > confusing (perhaps it'd make more sense if 'invalid because creating' and > 'invalid because dropping' were distinguished). I'm bit hesitant to do so for now. Once it's a bit more settled, maybe? Although I wonder if there's something better suited to the task than autovacuum. Greetings, Andres Freund
Re: WAL Insertion Lock Improvements
On Tue, May 09, 2023 at 09:24:14AM +0530, Bharath Rupireddy wrote: > I'll pick a test case that generates a reasonable amount of WAL 256 > bytes. What do you think of the following? > > test-case 2: -T900, WAL ~256 bytes (for c in 1 2 4 8 16 32 64 128 256 > 512 768 1024 2048 4096 - takes 3.5hrs) > test-case 2: -t100, WAL ~256 bytes > > If okay, I'll fire the tests. Sounds like a sensible duration, yes. What's your setting for min/max_wal_size? I assume that there are still 16GB throttled with target_completion at 0.9? -- Michael signature.asc Description: PGP signature
Re: WAL Insertion Lock Improvements
On Tue, May 9, 2023 at 9:02 AM Michael Paquier wrote: > > On Mon, May 08, 2023 at 04:04:10PM -0700, Nathan Bossart wrote: > > On Mon, May 08, 2023 at 05:57:09PM +0530, Bharath Rupireddy wrote: > >> test-case 1: -T5, WAL ~16 bytes > >> test-case 1: -t1000, WAL ~16 bytes > > > > I wonder if it's worth doing a couple of long-running tests, too. > > Yes, 5s or 1000 transactions per client is too small, though it shows > that things are going in the right direction. I'll pick a test case that generates a reasonable amount of WAL 256 bytes. What do you think of the following? test-case 2: -T900, WAL ~256 bytes (for c in 1 2 4 8 16 32 64 128 256 512 768 1024 2048 4096 - takes 3.5hrs) test-case 2: -t100, WAL ~256 bytes If okay, I'll fire the tests. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: DROP DATABASE is interruptible
On Tue, May 9, 2023 at 3:41 PM Thomas Munro wrote: > I tried out the patch you posted over at [1]. I forgot to add, +1, I think this is a good approach. (I'm still a little embarrassed at how long we spent trying to debug this in the other thread from the supplied clues, when you'd already pointed this failure mechanism out including the exact error message a couple of months ago. One thing I've noticed is that new threads posted in the middle of commitfests are hard to see :-D We were getting pretty close, though.)
Re: DROP DATABASE is interruptible
I tried out the patch you posted over at [1]. For those wanting an easy way to test it, or test the buggy behaviour in master without this patch, you can simply kill -STOP the checkpointer, so that DROP DATABASE hangs in RequestCheckpoint() (or you could SIGSTOP any other backend so it hangs in the barrier thing instead), and then you can just press ^C like this: postgres=# create database db2; CREATE DATABASE postgres=# drop database db2; ^CCancel request sent ERROR: canceling statement due to user request After that you get: $ psql db2 psql: error: connection to server on socket "/tmp/.s.PGSQL.5432" failed: FATAL: database "db2" is invalid DETAIL: Use DROP DATABASE to drop invalid databases I suppose it should be a HINT? +# FIXME: It'd be good to test the actual interruption path. But it's not +# immediately obvious how. I wonder if there is some way to incorporate something based on SIGSTOP signals into the test, but I don't know how to do it on Windows and maybe that's a bit weird anyway. For a non-OS-specific way to do it, I was wondering about having a test module function that has a wait loop that accepts ^C but deliberately ignores ProcSignalBarrier, and leaving that running in a background psql for a similar effect? Not sure why the test is under src/test/recovery. While a database exists in this state, we get periodic autovacuum noise, which I guess we should actually skip? I suppose someone might eventually wonder if autovacuum could complete the drop, but it seems a bit of a sudden weird leap in duties and might be confusing (perhaps it'd make more sense if 'invalid because creating' and 'invalid because dropping' were distinguished). 2023-05-09 15:24:10.860 NZST [523191] FATAL: database "db2" is invalid 2023-05-09 15:24:10.860 NZST [523191] DETAIL: Use DROP DATABASE to drop invalid databases 2023-05-09 15:25:10.883 NZST [523279] FATAL: database "db2" is invalid 2023-05-09 15:25:10.883 NZST [523279] DETAIL: Use DROP DATABASE to drop invalid databases 2023-05-09 15:26:10.899 NZST [523361] FATAL: database "db2" is invalid 2023-05-09 15:26:10.899 NZST [523361] DETAIL: Use DROP DATABASE to drop invalid databases 2023-05-09 15:27:10.919 NZST [523408] FATAL: database "db2" is invalid 2023-05-09 15:27:10.919 NZST [523408] DETAIL: Use DROP DATABASE to drop invalid databases 2023-05-09 15:28:10.938 NZST [523456] FATAL: database "db2" is invalid 2023-05-09 15:28:10.938 NZST [523456] DETAIL: Use DROP DATABASE to drop invalid databases [1] https://www.postgresql.org/message-id/20230509013255.fjrlpitnj3ltur76%40awork3.anarazel.de
Re: WAL Insertion Lock Improvements
On Mon, May 08, 2023 at 04:04:10PM -0700, Nathan Bossart wrote: > On Mon, May 08, 2023 at 05:57:09PM +0530, Bharath Rupireddy wrote: >> test-case 1: -T5, WAL ~16 bytes >> test-case 1: -t1000, WAL ~16 bytes > > I wonder if it's worth doing a couple of long-running tests, too. Yes, 5s or 1000 transactions per client is too small, though it shows that things are going in the right direction. (Will reply to the rest in a bit..) -- Michael signature.asc Description: PGP signature
Re: Tab completion for CREATE SCHEMAAUTHORIZATION
On Mon, May 08, 2023 at 05:36:27PM +0100, Dagfinn Ilmari Mannsåker wrote: > Here's an updated v3 patch with that. While adding that, I noticed that > CREATE UNLOGGED only tab-completes TABLE and MATERIALIZED VIEW, not > SEQUENCE, so I added that (and removed MATERIALIZED VIEW when part of > CREATE SCHEMA). + /* but not MATVIEW in CREATE SCHEMA */ + if (HeadMatches("CREATE", "SCHEMA")) + COMPLETE_WITH("TABLE", "SEQUENCE"); + else + COMPLETE_WITH("TABLE", "SEQUENCE", "MATERIALIZED VIEW"); This may look strange at first glance, but the grammar is what it is.. Perhaps matviews could be part of that at some point. Or not. :) + /* only some object types can be created as part of CREATE SCHEMA */ + if (HeadMatches("CREATE", "SCHEMA")) + COMPLETE_WITH("TABLE", "VIEW", "INDEX", "SEQUENCE", "TRIGGER", + /* for INDEX and TABLE/SEQUENCE, respectively */ + "UNIQUE", "UNLOGGED"); Not including TEMPORARY is OK here as the grammar does not allow a directly to create a temporary schema. The (many) code paths that have TailMatches() to cope with CREATE SCHEMA would continue the completion of added, but at least this approach avoids the recommendation if possible. That looks pretty much OK to me. One tiny comment I have is that this lacks brackets for the inner blocks, so I have added some in the v4 attached. -- Michael From d163ead2e81048af130ab4b72642b8e89d0c7e32 Mon Sep 17 00:00:00 2001 From: Michael Paquier Date: Tue, 9 May 2023 12:25:03 +0900 Subject: [PATCH v4] Add tab completion for CREATE SCHEMA - AUTHORIZATION both in addition to and after a schema name - list of owner roles after AUTHORIZATION - CREATE and GRANT after any otherwise-complete command - Only the allowed object types after CREATE Also adjust complation after CREATE UNLOGGED: - Add SEQUENCE - Don't suggest MATERIALIZED VIEW in CREATE TABLE --- src/bin/psql/tab-complete.c | 45 ++--- 1 file changed, 37 insertions(+), 8 deletions(-) diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c index bd04244969..9b0b7ddb33 100644 --- a/src/bin/psql/tab-complete.c +++ b/src/bin/psql/tab-complete.c @@ -1031,9 +1031,13 @@ static const SchemaQuery Query_for_trigger_of_table = { " FROM pg_catalog.pg_roles "\ " WHERE rolname LIKE '%s'" +/* add these to Query_for_list_of_roles in OWNER contexts */ +#define Keywords_for_list_of_owner_roles \ +"CURRENT_ROLE", "CURRENT_USER", "SESSION_USER" + /* add these to Query_for_list_of_roles in GRANT contexts */ #define Keywords_for_list_of_grant_roles \ -"PUBLIC", "CURRENT_ROLE", "CURRENT_USER", "SESSION_USER" +Keywords_for_list_of_owner_roles, "PUBLIC" #define Query_for_all_table_constraints \ "SELECT conname "\ @@ -1785,8 +1789,15 @@ psql_completion(const char *text, int start, int end) /* CREATE */ /* complete with something you can create */ else if (TailMatches("CREATE")) - matches = rl_completion_matches(text, create_command_generator); - + { + /* only some object types can be created as part of CREATE SCHEMA */ + if (HeadMatches("CREATE", "SCHEMA")) + COMPLETE_WITH("TABLE", "VIEW", "INDEX", "SEQUENCE", "TRIGGER", + /* for INDEX and TABLE/SEQUENCE, respectively */ + "UNIQUE", "UNLOGGED"); + else + matches = rl_completion_matches(text, create_command_generator); + } /* complete with something you can create or replace */ else if (TailMatches("CREATE", "OR", "REPLACE")) COMPLETE_WITH("FUNCTION", "PROCEDURE", "LANGUAGE", "RULE", "VIEW", @@ -3154,6 +3165,20 @@ psql_completion(const char *text, int start, int end) else if (TailMatches("AS", "ON", "SELECT|UPDATE|INSERT|DELETE", "TO")) COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_tables); +/* CREATE SCHEMA [ ] [ AUTHORIZATION ] */ + else if (Matches("CREATE", "SCHEMA")) + COMPLETE_WITH_QUERY_PLUS(Query_for_list_of_schemas, + "AUTHORIZATION"); + else if (Matches("CREATE", "SCHEMA", "AUTHORIZATION") || + Matches("CREATE", "SCHEMA", MatchAny, "AUTHORIZATION")) + COMPLETE_WITH_QUERY_PLUS(Query_for_list_of_roles, + Keywords_for_list_of_owner_roles); + else if (Matches("CREATE", "SCHEMA", "AUTHORIZATION", MatchAny) || + Matches("CREATE", "SCHEMA", MatchAny, "AUTHORIZATION", MatchAny)) + COMPLETE_WITH("CREATE", "GRANT"); + else if (Matches("CREATE", "SCHEMA", MatchAny)) + COMPLETE_WITH("AUTHORIZATION", "CREATE", "GRANT"); + /* CREATE SEQUENCE --- is allowed inside CREATE SCHEMA, so use TailMatches */ else if (TailMatches("CREATE", "SEQUENCE", MatchAny) || TailMatches("CREATE", "TEMP|TEMPORARY", "SEQUENCE", MatchAny)) @@ -3185,9 +3210,15 @@ psql_completion(const char *text, int start, int end) /* Complete "CREATE TEMP/TEMPORARY" with the possible temp objects */ else if (TailMatches("CREATE", "TEMP|TEMPORARY")) COMPLETE_WITH("SEQUENCE", "TABLE", "VIEW"); - /* Complete "CREATE UNLOGGED" with TABLE or MATVIEW */ + /*
Re: Improve list manipulation in several places
On Tue, May 9, 2023 at 1:48 AM Ranier Vilela wrote: > I think you missed list_nth_xid, It makes perfect sense to exist. > It seems that list_nth_xid is more about simplification. So maybe we should put it in 0001? Thanks Richard
Re: Improve list manipulation in several places
On Tue, May 9, 2023 at 1:26 AM Alvaro Herrera wrote: > The problem I see is that each of these new functions has a single > caller, and the only one that looks like it could have a performance > advantage is list_copy_move_nth_to_head() (which is the weirdest of the > lot). I'm inclined not to have any of these single-use functions unless > a performance case can be made for them. Yeah, maybe this is the reason I failed to devise a query that shows any performance gain. I tried with a query which makes the 'all_pathkeys' in sort_inner_and_outer being length of 500 and still cannot see any notable performance improvements gained by list_copy_move_nth_to_head. Maybe the cost of other parts of planning swamps the performance gain here? Now I agree that maybe 0002 is not worthwhile to do. Thanks Richard
Re: [PoC] pg_upgrade: allow to upgrade publisher node
Hi Kuroda-san. Here are some review comments for the v10-0001 patch. == General. 1. pg_dump option is documented to the user. I'm not sure about exposing the new pg_dump --logical-replication-slots-only option to the user. I thought this pg_dump option was intended only to be called *internally* by the pg_upgrade. But, this patch is also documenting the new option for the user (in case they want to call it independently?) Maybe exposing it is OK, but if you do that then I thought perhaps there should also be some additional pg_dump tests just for this option (i.e. tested independently of the pg_upgrade) == Commit message 2. For pg_upgrade, when '--include-logical-replication-slots' is specified, it executes pg_dump with the new "--logical-replication-slots-only" option and restores from the dump. Apart from restoring schema, pg_resetwal must not be called after restoring replication slots. This is because the command discards WAL files and starts from a new segment, even if they are required by replication slots. This leads to an ERROR: "requested WAL segment XXX has already been removed". To avoid this, replication slots are restored at a different time than other objects, after running pg_resetwal. ~~ The "Apart from" sentence maybe could do with some rewording. I noticed there is a code comment (below fragment) that says the same as this, but more clearly. Maybe it is better to use that code-comment wording in the comment message. + * XXX We cannot dump replication slots at the same time as the schema + * dump because we need to separate the timing of restoring + * replication slots and other objects. Replication slots, in + * particular, should not be restored before executing the pg_resetwal + * command because it will remove WALs that are required by the slots. == src/bin/pg_dump/pg_dump.c 3. main + if (dopt.logical_slots_only && !dopt.binary_upgrade) + pg_fatal("options --logical-replication-slots-only requires option --binary-upgrade"); + + if (dopt.logical_slots_only && dopt.dataOnly) + pg_fatal("options --logical-replication-slots-only and -a/--data-only cannot be used together"); + if (dopt.logical_slots_only && dopt.schemaOnly) + pg_fatal("options --logical-replication-slots-only and -s/--schema-only cannot be used together"); + Consider if it might be simpler to combine together all those dopt.logical_slots_only checks. SUGGESTION if (dopt.logical_slots_only) { if (!dopt.binary_upgrade) pg_fatal("options --logical-replication-slots-only requires option --binary-upgrade"); if (dopt.dataOnly) pg_fatal("options --logical-replication-slots-only and -a/--data-only cannot be used together"); if (dopt.schemaOnly) pg_fatal("options --logical-replication-slots-only and -s/--schema-only cannot be used together"); } ~~~ 4. getLogicalReplicationSlots + /* Check whether we should dump or not */ + if (fout->remoteVersion < 16 || !dopt->logical_slots_only) + return; I'm not sure if this check is necessary. Given the way this function is called, is it possible for this check to fail? Maybe that quick exit would be better code as an Assert? ~~~ 5. dumpLogicalReplicationSlot +dumpLogicalReplicationSlot(Archive *fout, +const LogicalReplicationSlotInfo *slotinfo) +{ + DumpOptions *dopt = fout->dopt; + + if (!dopt->logical_slots_only) + return; (Similar to the previous comment). Is it even possible to arrive here when dopt->logical_slots_only is false. Maybe that quick exit would be better coded as an Assert? ~ 6. + PQExpBuffer query = createPQExpBuffer(); + char*slotname = pg_strdup(slotinfo->dobj.name); I wondered if it was really necessary to strdup/free this slotname. e.g. And if it is, then why don't you do this for the slotinfo->plugin field? == src/bin/pg_upgrade/check.c 7. check_and_dump_old_cluster /* Extract a list of databases and tables from the old cluster */ get_db_and_rel_infos(_cluster); + get_logical_slot_infos(_cluster); Is it correct to associate this new call with that existing comment about "databases and tables"? ~~~ 8. check_new_cluster @@ -188,6 +190,7 @@ void check_new_cluster(void) { get_db_and_rel_infos(_cluster); + get_logical_slot_infos(_cluster); check_new_cluster_is_empty(); @@ -210,6 +213,9 @@ check_new_cluster(void) check_for_prepared_transactions(_cluster); check_for_new_tablespace_dir(_cluster); + + if (user_opts.include_logical_slots) + check_for_parameter_settings(_cluster); Can the get_logical_slot_infos() be done later, guarded by that the same condition if (user_opts.include_logical_slots)? ~~~ 9. check_new_cluster_is_empty + * If --include-logical-replication-slots is required, check the + * existing of slots + */ Did you mean to say "check the existence of slots"? ~~~ 10. check_for_parameter_settings + if (strcmp(wal_level, "logical") != 0) + pg_fatal("wal_level must be \"logical\", but set to \"%s\"", + wal_level); /but set to/but is set to/
Re: Improve list manipulation in several places
On Mon, May 8, 2023 at 11:22 PM Peter Eisentraut < peter.eisentr...@enterprisedb.com> wrote: > On 23.04.23 08:42, Richard Guo wrote: > > Thanks for the suggestion. I've split the patch into two as attached. > > 0001 is just a minor simplification by replacing lfirst(list_head(list)) > > with linitial(list). 0002 introduces new functions to reduce the > > movement of list elements in several places so as to gain performance > > improvement and benefit future callers. > > These look sensible to me. If you could show some numbers that support > the claim that there is a performance advantage, it would be even more > convincing. Thanks Peter for looking at those patches. I tried to devise a query to show performance gain but did not succeed :-(. So I begin to wonder if 0002 is worthwhile to do, as it seems that it does not solve any real problem. Thanks Richard
Re: Cleaning up array_in()
Alexander Lakhin writes: > The only thing that confused me, is the error message (it's not new, too): > select '{{1}}'::int[]; > or even: > select '{{'::int[]; > ERROR: number of array dimensions (7) exceeds the maximum allowed (6) Yeah, I didn't touch that, but it's pretty bogus because the first number will always be "7" even if you wrote more than 7 left braces, since the code errors out immediately upon finding that it's seen too many braces. The equivalent message in the PLs just says "number of array dimensions exceeds the maximum allowed (6)". I'm inclined to do likewise in array_in, but didn't touch it here. > Beside that, I would like to note the following error text changes > (all of these are feasible, I think): I'll look into whether we can improve those, unless you had a patch in mind already? regards, tom lane
Re: Cleaning up array_in()
02.05.2023 18:41, Tom Lane wrote: So, here's a rewrite. Although I view this as a bug fix, AFAICT the only effects are to accept input that should be rejected. So again I don't advocate back-patching. But should we sneak it into v16, or wait for v17? I've tested the patch from a user perspective and found no interesting cases that were valid before, but not accepted with the patch (or vice versa): The only thing that confused me, is the error message (it's not new, too): select '{{1}}'::int[]; or even: select '{{'::int[]; ERROR: number of array dimensions (7) exceeds the maximum allowed (6) Maybe it could be reworded like that?: too many opening braces defining dimensions (maximum dimensions allowed: 6) Beside that, I would like to note the following error text changes (all of these are feasible, I think): select '{{1},{{'::int[]; Before: ERROR: malformed array literal: "{{1},{{" LINE 1: select '{{1},{{'::int[]; ^ DETAIL: Unexpected end of input. After: ERROR: malformed array literal: "{{1},{{" LINE 1: select '{{1},{{'::int[]; ^ DETAIL: Multidimensional arrays must have sub-arrays with matching dimensions. --- select '{{1},{{'::int[]; Before: ERROR: number of array dimensions (7) exceeds the maximum allowed (6) After: ERROR: malformed array literal: "{{1},{{" LINE 1: select '{{1},{{'::int[]; ^ DETAIL: Multidimensional arrays must have sub-arrays with matching dimensions. --- select '{{1},{}}}'::int[]; Before: ERROR: malformed array literal: "{{1},{}}}" LINE 1: select '{{1},{}}}'::int[]; ^ DETAIL: Unexpected "}" character. After: ERROR: malformed array literal: "{{1},{}}}" LINE 1: select '{{1},{}}}'::int[]; ^ DETAIL: Multidimensional arrays must have sub-arrays with matching dimensions. --- select '{{}}}'::int[]; Before: ERROR: malformed array literal: "{{}}}" LINE 1: select '{{}}}'::int[]; ^ DETAIL: Unexpected "}" character. After: ERROR: malformed array literal: "{{}}}" LINE 1: select '{{}}}'::int[]; ^ DETAIL: Junk after closing right brace. Best regards, Alexander
Re: Perform streaming logical transactions by background workers and parallel apply
On Mon, May 8, 2023 at 8:09 PM Amit Kapila wrote: > > On Fri, May 5, 2023 at 9:14 AM Zhijie Hou (Fujitsu) > wrote: > > > > On Wednesday, May 3, 2023 3:17 PM Amit Kapila > > wrote: > > > > > > > Attach another patch to fix the problem that pa_shutdown will access invalid > > MyLogicalRepWorker. I personally want to avoid introducing new static > > variable, > > so I only reorder the callback registration in this version. > > > > When testing this, I notice a rare case that the leader is possible to > > receive > > the worker termination message after the leader stops the parallel worker. > > This > > is unnecessary and have a risk that the leader would try to access the > > detached > > memory queue. This is more likely to happen and sometimes cause the failure > > in > > regression tests after the registration reorder patch because the dsm is > > detached earlier after applying the patch. > > > > I think it is only possible for the leader apply can worker to try to > receive the error message from an error queue after your 0002 patch. > Because another place already detached from the queue before stopping > the parallel apply workers. So, I combined both the patches and > changed a few comments and a commit message. Let me know what you > think of the attached. I have one comment on the detaching error queue part: + /* +* Detach from the error_mq_handle for the parallel apply worker before +* stopping it. This prevents the leader apply worker from trying to +* receive the message from the error queue that might already be detached +* by the parallel apply worker. +*/ + shm_mq_detach(winfo->error_mq_handle); + winfo->error_mq_handle = NULL; In pa_detach_all_error_mq(), we try to detach error queues of all workers in the pool. I think we should check if the queue is already detached (i.e. is NULL) there. Otherwise, we will end up a SEGV if an error happens after detaching the error queue and before removing the worker from the pool. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Perform streaming logical transactions by background workers and parallel apply
On Mon, May 8, 2023 at 3:34 PM Amit Kapila wrote: > > On Mon, May 8, 2023 at 11:08 AM Masahiko Sawada wrote: > > > > On Mon, May 8, 2023 at 12:52 PM Zhijie Hou (Fujitsu) > > wrote: > > > > > > On Monday, May 8, 2023 11:08 AM Masahiko Sawada > > > > > > Hi, > > > > > > > > > > > On Tue, May 2, 2023 at 12:22 PM Amit Kapila > > > > wrote: > > > > > > > > > > On Fri, Apr 28, 2023 at 11:48 AM Masahiko Sawada > > > > wrote: > > > > > > > > > > > > While investigating this issue, I've reviewed the code around > > > > > > callbacks and worker termination etc and I found a problem. > > > > > > > > > > > > A parallel apply worker calls the before_shmem_exit callbacks in the > > > > > > following order: > > > > > > > > > > > > 1. ShutdownPostgres() > > > > > > 2. logicalrep_worker_onexit() > > > > > > 3. pa_shutdown() > > > > > > > > > > > > Since the worker is detached during logicalrep_worker_onexit(), > > > > > > MyLogicalReplication->leader_pid is an invalid when we call > > > > > > pa_shutdown(): > > > > > > > > > > > > static void > > > > > > pa_shutdown(int code, Datum arg) > > > > > > { > > > > > > Assert(MyLogicalRepWorker->leader_pid != InvalidPid); > > > > > > SendProcSignal(MyLogicalRepWorker->leader_pid, > > > > > >PROCSIG_PARALLEL_APPLY_MESSAGE, > > > > > >InvalidBackendId); > > > > > > > > > > > > Also, if the parallel apply worker fails shm_toc_lookup() during the > > > > > > initialization, it raises an error (because of noError = false) but > > > > > > ends up a SEGV as MyLogicalRepWorker is still NULL. > > > > > > > > > > > > I think that we should not use MyLogicalRepWorker->leader_pid in > > > > > > pa_shutdown() but instead store the leader's pid to a static > > > > > > variable > > > > > > before registering pa_shutdown() callback. > > > > > > > > > > > > > > > > Why not simply move the registration of pa_shutdown() to someplace > > > > > after logicalrep_worker_attach()? > > > > > > > > If we do that, the worker won't call dsm_detach() if it raises an > > > > ERROR in logicalrep_worker_attach(), is that okay? It seems that it's > > > > no practically problem since we call dsm_backend_shutdown() in > > > > shmem_exit(), but if so why do we need to call it in pa_shutdown()? > > > > > > I think the dsm_detach in pa_shutdown was intended to fire on_dsm_detach > > > callbacks to give callback a chance to report stat before the stat system > > > is > > > shutdown, following what we do in ParallelWorkerShutdown() (e.g. > > > sharedfileset.c callbacks cause fd.c to do ReportTemporaryFileUsage(), so > > > we > > > need to fire that earlier). > > > > > > But for parallel apply, we currently only have one on_dsm_detach > > > callback(shm_mq_detach_callback) which doesn't report extra stats. So the > > > dsm_detach in pa_shutdown is only used to make it a bit future-proof in > > > case > > > we add some other on_dsm_detach callbacks in the future which need to > > > report > > > stats. > > > > Make sense . Given that it's possible that we add other callbacks that > > report stats in the future, I think it's better not to move the > > position to register pa_shutdown() callback. > > > > Hmm, what kind of stats do we expect to be collected before we > register pa_shutdown? I think if required we can register such a > callback after pa_shutdown. I feel without reordering the callbacks, > the fix would be a bit complicated as explained in my previous email, > so I don't think it is worth complicating this code unless really > required. Fair point. I agree that the issue can be resolved by carefully ordering the callback registration. Another thing I'm concerned about is that since both the leader worker and parallel worker detach DSM before logicalrep_worker_onexit(), cleaning up work that touches DSM cannot be done in logicalrep_worker_onexit(). If we need to do something in the future, we would need to have another callback called before detaching DSM. But I'm fine as it's not a problem for now. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Subscription statistics are not dropped at DROP SUBSCRIPTION in some cases
On Mon, May 08, 2023 at 04:23:15PM +0900, Masahiko Sawada wrote: > We call pgstat_drop_subscription() at the end of DropSubscription() > but we could leave from this function earlier e.g. when no slot is > associated with the subscription. In this case, the statistics entry > for the subscription remains. To fix it, I think we need to call it > earlier, just after removing the catalog tuple. There is a chance the > transaction dropping the subscription fails due to network error etc > but we don't need to worry about it as reporting the subscription drop > is transactional. Looks reasonable to me. IIUC calling pgstat_drop_subscription() earlier makes no real difference (besides avoiding this bug) because it is uѕing pgstat_drop_transactional() behind the scenes. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Cleaning up array_in()
On Tue, May 02, 2023 at 11:41:27AM -0400, Tom Lane wrote: > It looks like back in the dim mists of the > Berkeley era, there was an intentional attempt to allow > non-rectangular array input, with the missing elements automatically > filled out as NULLs. Since that was undocumented, we concluded it was > a bug and plastered on some code to check for rectangularity of the > input. Interesting. > Although I view this as a bug fix, AFAICT the only effects are to > accept input that should be rejected. So again I don't advocate > back-patching. But should we sneak it into v16, or wait for v17? I think it'd be okay to sneak it into v16, given it is technically a bug fix. > (This leaves ArrayGetOffset0() unused, but I'm unsure whether to > remove that.) Why's that? Do you think it is likely to be used again in the future? Otherwise, 0001 LGTM. I haven't had a chance to look at 0002 closely yet. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: evtcache: EventTriggerCache vs Event Trigger Cache
On Mon, May 08, 2023 at 10:39:42AM +0200, Daniel Gustafsson wrote: > On 4 May 2023, at 14:18, Daniel Gustafsson wrote: >> On 4 May 2023, at 14:09, Tom Lane wrote: >>> Hmm, I'm kinda -1 on them having the same name visible in the >>> contexts dump --- that seems very confusing. How about naming >>> the hash "EventTriggerCacheHash" or so? >> >> I think the level is the indicator here, but I have no strong opinions, >> EventTriggerCacheHash is fine by me. > > The attached trivial diff does that, parking this in the next CF. +1 for EventTriggerCacheHash -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: WAL Insertion Lock Improvements
On Mon, May 08, 2023 at 05:57:09PM +0530, Bharath Rupireddy wrote: > I ran performance tests on the patch with different use-cases. Clearly > the patch reduces burden on LWLock's waitlist lock (evident from perf > reports [1]). However, to see visible impact in the output, the txns > must be generating small (between 16 bytes to 2 KB) amounts of WAL in > a highly concurrent manner, check the results below (FWIW, I've zipped > and attached perf images for better illustration along with test > setup). > > When the txns are generating a small amount of WAL i.e. between 16 > bytes to 2 KB in a highly concurrent manner, the benefit is clearly > visible in the TPS more than 2.3X improvement. When the txns are > generating more WAL i.e. more than 2 KB, the gain from reduced burden > on waitlist lock is offset by increase in the wait/release for WAL > insertion locks and no visible benefit is seen. > > As the amount of WAL each txn generates increases, it looks like the > benefit gained from reduced burden on waitlist lock is offset by > increase in the wait for WAL insertion locks. Nice. > test-case 1: -T5, WAL ~16 bytes > test-case 1: -t1000, WAL ~16 bytes I wonder if it's worth doing a couple of long-running tests, too. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: PGDOCS - Replica Identity quotes
On Mon, May 8, 2023 at 2:05 PM Michael Paquier wrote: > > On Mon, May 08, 2023 at 01:57:33PM +1000, Peter Smith wrote: > > I agree. Do you want me to make a new v4 patch to also do that, or > > will you handle them in passing? > > I'll just go handle them on the way, no need to send an updated > patch. Thanks for pushing this yesterday, and for handling the other quotes. -- Kind Regards, Peter Smith. Fujitsu Australia.
Re: Add LZ4 compression in pg_dump
On Mon, May 08, 2023 at 08:00:39PM +0200, Tomas Vondra wrote: > The LZ4Stream_write() forgot to move the pointer to the next chunk, so > it was happily decompressing the initial chunk over and over. A bit > embarrassing oversight :-( > > The custom format calls WriteDataToArchiveLZ4(), which was correct. > > The attached patch fixes this for me. Ouch. So this was corrupting the dumps and the compression when trying to write more than two chunks at once, not the decompression steps. That addresses the issue here as well, thanks! -- Michael signature.asc Description: PGP signature
Re: Order changes in PG16 since ICU introduction
On Mon, 2023-05-08 at 17:47 -0400, Tom Lane wrote: > -ERROR: could not convert locale name "C" to language tag: > U_ILLEGAL_ARGUMENT_ERROR > +NOTICE: using standard form "en-US-u-va-posix" for locale "C" ... > I suppose this is environment-dependent. Sadly, the buildfarm > client does not show the prevailing LANG or LC_XXX settings. Looks like it's failing-to-fail on some versions of ICU which automatically perform that conversion. The easiest thing to do is revert it for now, and after we sort out the memcmp() path for the ICU provider, then I can commit it again (after that point it would just be code cleanup and should have no functional impact). Regards, Jeff Davis
Re: Order changes in PG16 since ICU introduction
Jeff Davis writes: > === 0001: do not convert C to en-US-u-va-posix > I plan to commit this soon. Several buildfarm animals have failed since this went in. The only one showing enough info to diagnose is siskin [1]: @@ -1043,16 +1043,15 @@ ERROR: ICU locale "nonsense-nowhere" has unknown language "nonsense" HINT: To disable ICU locale validation, set parameter icu_validation_level to DISABLED. CREATE COLLATION testx (provider = icu, locale = 'C'); -- fails -ERROR: could not convert locale name "C" to language tag: U_ILLEGAL_ARGUMENT_ERROR +NOTICE: using standard form "en-US-u-va-posix" for locale "C" CREATE COLLATION testx (provider = icu, locale = '@colStrength=primary;nonsense=yes'); -- fails ERROR: could not convert locale name "@colStrength=primary;nonsense=yes" to language tag: U_ILLEGAL_ARGUMENT_ERROR SET icu_validation_level = WARNING; CREATE COLLATION testx (provider = icu, locale = '@colStrength=primary;nonsense=yes'); DROP COLLATION testx; WARNING: could not convert locale name "@colStrength=primary;nonsense=yes" to language tag: U_ILLEGAL_ARGUMENT_ERROR +ERROR: collation "testx" already exists CREATE COLLATION testx (provider = icu, locale = 'C'); DROP COLLATION testx; -WARNING: could not convert locale name "C" to language tag: U_ILLEGAL_ARGUMENT_ERROR -WARNING: ICU locale "C" has unknown language "c" -HINT: To disable ICU locale validation, set parameter icu_validation_level to DISABLED. +NOTICE: using standard form "en-US-u-va-posix" for locale "C" CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); DROP COLLATION testx; WARNING: ICU locale "nonsense-nowhere" has unknown language "nonsense" HINT: To disable ICU locale validation, set parameter icu_validation_level to DISABLED. I suppose this is environment-dependent. Sadly, the buildfarm client does not show the prevailing LANG or LC_XXX settings. regards, tom lane [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=siskin=2023-05-08%2020%3A09%3A26
Re: pg_stat_io for the startup process
Hi, On 2023-04-26 18:47:14 +0900, Kyotaro Horiguchi wrote: > I see four issues here. > > 1. The current database stats omits buffer fetches that don't > originate from a relation. > > In this case pgstat_relation can't work since recovery isn't conscious > of relids. We might be able to resolve relfilenode into a relid, but > it may not be that simple. Fortunately we already count fetches and > hits process-wide using pgBufferUsage, so we can use this for database > stats. I don't think we need to do anything about that for 16 - they aren't updated at process end either. I think the fix here is to do the architectural change of maintaining most stats keyed by relfilenode as we've discussed in some other threads. Then we also can have relation level write stats etc. > 2. Even if we wanted to report stats for the startup process, > pgstat_report_stats wouldn't permit it since transaction-end > timestamp doesn't advance. > > I'm not certain if it's the correct approach, but perhaps we could use > GetCurrentTimestamp() instead of GetCurrentTransactionStopTimestamp() > specifically for the startup process. What about using GetCurrentTimestamp() when force == true? That'd make sense for other users as well, I think? > 3. When should we call pgstat_report_stats on the startup process? > > During recovery, I think we can call pgstat_report_stats() (or a > subset of it) right before invoking WaitLatch and at segment > boundaries. I've pondered that as well. But I don't think it's great - it's not exactly intuitive that stats reporting gets far less common if you use a 1GB wal_segment_size. Greetings, Andres Freund
Re: pg_stat_io for the startup process
On Wed, May 03, 2023 at 04:11:33PM +0300, Melih Mutlu wrote: > Andres Freund , 27 Nis 2023 Per, 19:27 tarihinde şunu > yazdı: > > #ifdef WAL_DEBUG > > > if (XLOG_DEBUG || > > > (record->xl_rmid == RM_XACT_ID && > > trace_recovery_messages <= DEBUG2) || > > > @@ -3617,6 +3621,13 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, > > bool randAccess, > > > /* Do background tasks > > that might benefit us later. */ > > > > > KnownAssignedTransactionIdsIdleMaintenance(); > > > > > > + /* > > > + * Need to disable flush > > timeout to avoid unnecessary > > > + * wakeups. Enable it > > again after a WAL record is read > > > + * in PerformWalRecovery. > > > + */ > > > + > > disable_startup_stat_flush_timeout(); > > > + > > > (void) > > WaitLatch(>recoveryWakeupLatch, > > > > > WL_LATCH_SET | WL_TIMEOUT | > > > > > WL_EXIT_ON_PM_DEATH, > > > > I think always disabling the timer here isn't quite right - we want to > > wake up > > *once* in WaitForWALToBecomeAvailable(), otherwise we'll not submit pending > > stats before waiting - potentially for a long time - for WAL. One way > > would be > > just explicitly report before the WaitLatch(). > > > > Another approach, I think better, would be to not use > > enable_timeout_every(), > > and to explicitly rearm the timer in HandleStartupProcInterrupts(). When > > called from WaitForWALToBecomeAvailable(), we'd not rearm, and instead do > > so > > at the end of WaitForWALToBecomeAvailable(). That way we also wouldn't > > repeatedly fire between calls to HandleStartupProcInterrupts(). > > > > Attached patch is probably not doing what you asked. IIUC > HandleStartupProcInterrupts should arm the timer but also shouldn't arm it > if it's called from WaitForWALToBecomeAvailable. But the timer should be > armed again at the very end of WaitForWALToBecomeAvailable. I've been > thinking about how to do this properly. Should HandleStartupProcInterrupts > take a parameter to decide whether the timer needs to be armed? Or need to > add an additional global flag to rearm the timer? Any thoughts? I had the same question about how to determine whether or not to rearm. > From 9be7360e49db424c45c53e85efe8a4f5359b5244 Mon Sep 17 00:00:00 2001 > From: Melih Mutlu > Date: Wed, 26 Apr 2023 18:21:32 +0300 > Subject: [PATCH v2] Add timeout to flush stats during startup's main replay > loop > diff --git a/src/backend/postmaster/startup.c > b/src/backend/postmaster/startup.c > index efc2580536..842394bc8f 100644 > --- a/src/backend/postmaster/startup.c > +++ b/src/backend/postmaster/startup.c > @@ -72,6 +72,9 @@ static TimestampTz startup_progress_phase_start_time; > */ > static volatile sig_atomic_t startup_progress_timer_expired = false; > > +/* Indicates whether flushing stats is needed. */ > +static volatile sig_atomic_t idle_stats_update_pending = false; > + > /* > * Time between progress updates for long-running startup operations. > */ > @@ -206,6 +209,18 @@ HandleStartupProcInterrupts(void) > /* Perform logging of memory contexts of this process */ > if (LogMemoryContextPending) > ProcessLogMemoryContextInterrupt(); > + > + if (idle_stats_update_pending) > + { > + /* It's time to report wal stats. */ If we want dbstats to be updated, we'll probably have to call pgstat_report_stat() here and fix the timestamp issue Horiguchi-san points out upthread. Perhaps you could review those changes and consider adding those as preliminary patches before this in a set. I think you will then need to handle the issue he mentions with partial flushes coming from pgstat_report_stat() and remembering to try and flush stats again in case of a partial flush. Though, perhaps we can just pass force=true. > + pgstat_report_wal(true); > + idle_stats_update_pending = false; > + } Good that you thought to check if the timeout was already active. > + else if (!get_timeout_active(IDLE_STATS_UPDATE_TIMEOUT)) > + { > + /* Set the next timeout. */ > + enable_idle_stats_update_timeout(); > + } > } > > > @@ -385,3 +400,22 @@ has_startup_progress_timeout_expired(long *secs, int > *usecs) > > return true; > } > + > +/* Set a flag indicating that it's time to flush wal stats. */ > +void > +idle_stats_update_timeout_handler(void) > +{ > + idle_stats_update_pending = true; Is WakeupRecovery() needed when the timer goes off and the startup process is waiting on a latch? Does this happen in WaitForWalToBecomeAvailable()? > + WakeupRecovery(); > +} > + > +/* Enable the timeout set for wal stat
Re: base backup vs. concurrent truncation
Hi, On 2023-04-25 10:28:58 -0700, Andres Freund wrote: > On 2023-04-25 11:42:43 -0400, Robert Haas wrote: > > On Mon, Apr 24, 2023 at 8:03 PM Andres Freund wrote: > > > What we've discussed somewhere in the past is to always truncate N+1 when > > > creating the first page in N. I.e. if we extend into 23456.1, we truncate > > > 23456.2 to 0 blocks. As far as I can tell, that'd solve this issue? > > > > Yeah, although leaving 23456.2 forever unless and until that happens > > doesn't sound amazing. > > It isn't - but the alternatives aren't great either. It's not that easy to hit > this scenario, so I think something along these lines is more palatable than > adding a pass through the entire data directory. > I think eventually we'll have to make the WAL logging bulletproof enough > against this to avoid the risk of it. I think that is possible. I think we should extend my proposal above with improved WAL logging. Right now the replay of XLOG_SMGR_TRUNCATE records uses the normal smgrtruncate() path - which essentially relies on smgrnblocks() to determine the relation size, which in turn iterates over the segments until it finds one < SEGSIZE. That's fine during normal running, where we are consistent. But it's bogus when we're not consistent - in case of a concurrent truncation, the base backup might have missed intermediate segments, while copying later segments. We should fix this by including not just the "target" length in the WAL record, but also the "current" length. Then during WAL replay of such records we'd not look at the files currently present, we'd just stupidly truncate all the segments mentioned in the range. I think we ought to do the same for mdunlinkfork(). > I suspect we would need to prevent checkpoints from happening in the wrong > moment, if we were to go down that route. > > I guess that eventually we'll need to polish the infrastructure for > determining restartpointsm so that delayChkptFlags doesn't actually prevent > checkpoints, just moves the restart to a safe LSN. Although I guess that > truncations aren't frequent enough (compared to transaction commits), for that > to be required "now". > Unfortunately the current approach of smgrtruncate records is quite racy with checkpoints. You can end up with a sequence of something like 1) SMGRTRUNCATE record 2) CHECKPOINT; 3) Truncating of segment files if you crash anywhere in 3), we don't replay the SMGRTRUNCATE record. It's not a great solution, but I suspect we'll just have to set delayChkptFlags during truncations to prevent this. I also think that we need to fix the order in which mdunlink() operates. It seems *quite* bogus that we unlink in "forward" segment order, rather than in reverse order (like mdtruncate()). If we crash after unlinking segment.N, we'll not redo unlinking the later segments after a crash. Nor in WAL replay. While the improved WAL logging would address this as well, it still seems pointlessly risky to iterate forward, rather than backward. Even with those changes, I think we might still need something like the "always truncate the next segment" bit I described in my prior email though. Greetings, Andres Freund
Re: base backup vs. concurrent truncation
Hi, On 2023-05-08 08:57:08 -0400, Robert Haas wrote: > On Mon, May 1, 2023 at 12:54 PM Aleksander Alekseev > wrote: > > So I'm still unable to reproduce the described scenario, at least on PG16. > > Well, that proves that either (1) the scenario that I described is > impossible for some unknown reason or (2) something is wrong with your > test scenario. I bet it's (2), but if it's (1), it would be nice to > know what the reason is. One can't feel good about code that appears > on the surface to be broken even if one knows that some unknown > magical thing is preventing disaster. It seems pretty easy to create disconnected segments. You don't even need a basebackup for it. To make it easier, I rebuilt with segsize_blocks=16. This isn't required, it just makes it a lot cheaper to play around. To noones surprise: I'm not a patient person... Started server with autovacuum=off. DROP TABLE IF EXISTS large; CREATE TABLE large AS SELECT generate_series(1, 10); SELECT current_setting('data_directory') || '/' || pg_relation_filepath('large'); ls -l /srv/dev/pgdev-dev/base/5/24585* shows lots of segments. attach gdb, set breakpoint on truncate. DROP TABLE large; breakpoint will fire. Continue once. In concurrent session, trigger checkpoint. Due to the checkpoint we'll not replay any WAL record. And the checkpoint will unlink the first segment. Kill the server. After crash recovery, you end up with all but the first segment still present. As the first segment doesn't exist anymore, nothing prevents that oid from being recycled in the future. Once it is recycled and the first segment grows large enough, the later segments will suddenly re-appear. It's not quite so trivial to reproduce issues with partial truncations / concurrent base backups. The problem is that it's hard to guarantee the iteration order of the base backup process. You'd just need to write a manual base backup script though. Consider a script mimicking the filesystem returning directory entries in "reverse name order". Recipe includes two sessions. One (BB) doing a base backup, the other (DT) running VACUUM making the table shorter. BB: Copy .2 BB: Copy .1 SS: Truncate relation to < SEGSIZE BB: Copy The replay of the smgrtruncate record will determine the relation size to figure out what segments to remove. Because is < SEGSIZE it'll only truncate , not .N. And boom, a disconnected segment. (I'll post a separate email about an evolved proposal about fixing this set of issues) Greetings, Andres Freund
Re: [PATCH] Add native windows on arm64 support
Andres Freund writes: > I don't really have feelings either way - but haven't we gone further and even > backpatched things like spinlock support for new arches in the past? Mmmm ... don't really think those cases were comparable. We weren't adding support for a whole new OS. Now, you might argue that Windows on arm64 will be just like Windows on x86_64, but I think the jury is still out on that. Microsoft was so Intel-only for so many years that I bet they've had to change quite a bit to make it go on ARM. Also, the cases of back-patched spinlock support that I can find in the last few years were pretty low-risk. I'll grant that c32fcac56 was a bit blue-sky because hardly anybody had RISC-V at that point, but by the same token anybody relying on it at the time would be dealing with a beta-grade OS too. On the other hand, 1c72d82c2 was immediately testable in the buildfarm, and f3bd00c01 was importing code already verified by our OpenBSD packagers. As I said upthread, this seems like something to put in at the beginning of a dev cycle, not post-feature-freeze. regards, tom lane
Re: issue with meson builds on msys2
Hi, On 2023-05-05 07:08:39 -0400, Andrew Dunstan wrote: > On 2023-05-04 Th 19:54, Andres Freund wrote: > > Hm. I can't reproduce this in my test win10 VM, unfortunately. What OS / OS > > version is the host? Any chance to get systeminfo.exe output or something > > like > > that? > > > Its a Windows Server 2019 (v 1809) instance running on AWS. Hm. When I hit the python issue I also couldn't repro it on windows 10. Cirrus was also using Windows Server 2019... > > I think we ought to do something here. If newer environments cause failures > > like this, it seems likely that this will spread to more and more > > applications > > over time... > > > > Just to reassure myself I have not been hallucinating, I repeated the test. > > > pgrunner@EC2AMAZ-GCB871B UCRT64 ~/bf/root/HEAD/inst > $ /usr/bin/perl -e 'system(qq{"bin/pg_ctl" -D data-C -w -l logfile start > > startlog 2>&1}) ; print $? ? "BANG: $?\n" : "OK\n";' > OK > > pgrunner@EC2AMAZ-GCB871B UCRT64 ~/bf/root/HEAD/inst > $ /usr/bin/perl -e 'system(qq{"bin/pg_ctl" -D data-C -w -l logfile stop > > stoplog 2>&1}) ; print $? ? "BANG: $?\n" : "OK\n";' > BANG: 33280 Oh, so it only happens when stopping, never when starting? That's interesting... > If you want to play I can arrange access. That'd be very helpful. Greetings, Andres Freund
Re: pg_stat_io for the startup process
I've reviewed both your patch versions and responded to the ideas in both of them and the associated emails below. On Wed, Apr 26, 2023 at 5:47 AM Kyotaro Horiguchi wrote: > > At Tue, 25 Apr 2023 16:04:23 -0700, Andres Freund wrote > in > > key those stats by oid. However, it *does* report the read/write time. But > > only at process exit, of course. The weird part is that the startup process > > does *NOT* increase pg_stat_database.blks_read/blks_hit, because instead of > > basing those on pgBufferUsage.shared_blks_read etc, we compute them based on > > the relation level stats. pgBufferUsage is just used for EXPLAIN. This > > isn't > > recent, afaict. > > I see four issues here. > > 1. The current database stats omits buffer fetches that don't > originate from a relation. > > In this case pgstat_relation can't work since recovery isn't conscious > of relids. We might be able to resolve relfilenode into a relid, but > it may not be that simple. Fortunately we already count fetches and > hits process-wide using pgBufferUsage, so we can use this for database > stats. ... > > TL;DR: Currently the startup process maintains blk_read_time, > > blk_write_time, > > but doesn't maintain blks_read, blks_hit - which doesn't make sense. > > As a result, the attached patch, which is meant for discussion, allows > pg_stat_database to show fetches and reads by the startup process as > the counts for the database with id 0. I would put this in its own patch in a patchset. Of course it relies on having pgstat_report_stat() called at appropriate times by the startup process, but having pg_stat_database show read/hit counts is a separate issue than having pg_stat_io do so. I'm not suggesting we do this, but you could argue that if we fix the startup process stats reporting that pg_stat_database not showing reads and hits for the startup process is a bug that also exists in 15. Not directly related, but I do not get why the existing stats counters for pg_stat_database count "fetches" and "hits" and then use subtraction to get the number of reads. I find it confusing and seems like it could lead to subtle inconsistencies with those counters counting reads closer to where they are actually happening. > 2. Even if we wanted to report stats for the startup process, > pgstat_report_stats wouldn't permit it since transaction-end > timestamp doesn't advance. > > I'm not certain if it's the correct approach, but perhaps we could use > GetCurrentTimestamp() instead of GetCurrentTransactionStopTimestamp() > specifically for the startup process. In theory, since all of the approaches proposed in this thread would exercise rigid control over how often we flush stats in the startup process, I think it is okay to use GetCurrentTimestamp() when pgstat_report_stat() is called by the startup process (i.e. we don't have to worry about overhead of doing it). But looking at it implemented in the patch made me feel unsettled for some reason. > 3. When should we call pgstat_report_stats on the startup process? > > During recovery, I think we can call pgstat_report_stats() (or a > subset of it) right before invoking WaitLatch and at segment > boundaries. I see in the patch you call pgstat_report_stat() in XLogPageRead(). Will this only be called on segment boundaries? > 4. In the existing ReadBuffer_common, there's an inconsistency in > counting hits and reads between pgstat_io and pgBufferUsage. > > The difference comes from the case of RBM_ZERO pages. We should simply > align them. I would definitely make this a separate patch and probably a separate thread. It isn't related to the startup process and is worth a separate discussion. On Thu, Apr 27, 2023 at 10:43 PM Kyotaro Horiguchi wrote: > > At Fri, 28 Apr 2023 11:15:51 +0900 (JST), Kyotaro Horiguchi > wrote in > > At Thu, 27 Apr 2023 17:30:40 -0400, Melanie Plageman > > wrote in > > > After a quick example implementation of this, I found that it seemed to > > > try and flush the stats less often on an idle standby (good) than using > > > enable_timeout_every(). > > > > Just rearming with the full-interval will work that way. Our existing > > strategy for this is seen in PostgresMain(). > > > >stats_timeout = pgstat_report_stat(false); > >if (stats_timeout > 0) > >{ > > if (!get_timeout_active(BLAH_TIMEOUT)) > > enable_timeout_after(BLAH_TIMEOUT, stats_timeout); > >} > >else > >{ > >if (get_timeout_active(BLAH_TIMEOUT)) > >disable_timeout(BLAH_TIMEOUT, false); > >} > >WaitLatch(); > > Im my example, I left out idle-time flushing, but I realized we don't > need the timeout mechanism here since we're already managing it. So > the following should work (assuming the timestamp updates with > GetCurrentTimestamp() in my last patch). > > @@ -3889,13 +3900,23 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool > randAccess, > /* Update pg_stat_recovery_prefetch
Re: Add LZ4 compression in pg_dump
--- Original Message --- On Monday, May 8th, 2023 at 8:20 PM, Tomas Vondra wrote: > > > > > On 5/8/23 18:19, gkokola...@pm.me wrote: > > > --- Original Message --- > > On Monday, May 8th, 2023 at 3:16 AM, Tom Lane t...@sss.pgh.pa.us wrote: > > > > > I wrote: > > > > > > > Michael Paquier mich...@paquier.xyz writes: > > > > > > > > > While testing this patch, I have triggered an error pointing out that > > > > > the decompression path of LZ4 is broken for table data. I can > > > > > reproduce that with a dump of the regression database, as of: > > > > > make installcheck > > > > > pg_dump --format=d --file=dump_lz4 --compress=lz4 regression > > > > > > > Ugh. Reproduced here ... so we need an open item for this. > > > > > > BTW, it seems to work with --format=c. > > > > Thank you for the extra tests. It seems that exists a gap in the test > > coverage. Please find a patch attached that is addressing the issue > > and attempt to provide tests for it. > > > Seems I'm getting messages with a delay - this is mostly the same fix I > ended up with, not realizing you already posted a fix. Thank you very much for looking. > I don't think we need the local "in" variable - the pointer parameter is > local in the function, so we can modify it directly (with a cast). > WriteDataToArchiveLZ4 does it that way too. Sure, patch updated. > The tests are definitely a good idea. Thank you. > I wonder if we should add a > comment to DEFAULT_IO_BUFFER_SIZE mentioning that if we choose to > increase the value in the future, we needs to tweak the tests too to use > more data in order to exercise the buffering etc. Maybe it's obvious? > You are right. Added a comment both in the header and in the test. I hope v2 gets closer to closing the open item for this. Cheers, //Georgios > > regards > > -- > Tomas Vondra > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL CompanyFrom 89e7066d6c3c6a7eeb147c3f2d345c3046a4d155 Mon Sep 17 00:00:00 2001 From: Georgios Kokolatos Date: Mon, 8 May 2023 19:48:03 + Subject: [PATCH v2] Advance input pointer when LZ4 compressing data LZ4File_write() did not advance the input pointer on subsequent invocations of LZ4F_compressUpdate(). As a result the generated compressed output would be a compressed version of the same input chunk. WriteDataToArchiveLZ4() which is also using LZ4F_compressUpdate() did not suffer from this omission. Tests failed to catch this error because all of their input would comfortably fit within the same input chunk. Tests have been added to provide adequate coverage. --- src/bin/pg_dump/compress_io.h| 7 - src/bin/pg_dump/compress_lz4.c | 2 ++ src/bin/pg_dump/t/002_pg_dump.pl | 46 3 files changed, 54 insertions(+), 1 deletion(-) diff --git a/src/bin/pg_dump/compress_io.h b/src/bin/pg_dump/compress_io.h index fd8752db0d..e8efc57f1a 100644 --- a/src/bin/pg_dump/compress_io.h +++ b/src/bin/pg_dump/compress_io.h @@ -17,7 +17,12 @@ #include "pg_backup_archiver.h" -/* Default size used for IO buffers */ +/* + * Default size used for IO buffers + * + * When altering this value it might be useful to verify that the relevant tests + * cases are meaningfully updated to provide coverage. + */ #define DEFAULT_IO_BUFFER_SIZE 4096 extern char *supports_compression(const pg_compress_specification compression_spec); diff --git a/src/bin/pg_dump/compress_lz4.c b/src/bin/pg_dump/compress_lz4.c index f97b7550d1..b869780c0b 100644 --- a/src/bin/pg_dump/compress_lz4.c +++ b/src/bin/pg_dump/compress_lz4.c @@ -588,6 +588,8 @@ LZ4Stream_write(const void *ptr, size_t size, CompressFileHandle *CFH) errno = (errno) ? errno : ENOSPC; return false; } + + ptr = ((const char *) ptr) + chunk; } return true; diff --git a/src/bin/pg_dump/t/002_pg_dump.pl b/src/bin/pg_dump/t/002_pg_dump.pl index 93e24d5145..d66f3b42ea 100644 --- a/src/bin/pg_dump/t/002_pg_dump.pl +++ b/src/bin/pg_dump/t/002_pg_dump.pl @@ -3108,6 +3108,52 @@ my %tests = ( }, }, + 'CREATE TABLE test_compression_method' => { + create_order => 110, + create_sql => 'CREATE TABLE dump_test.test_compression_method ( + col1 text + );', + regexp => qr/^ + \QCREATE TABLE dump_test.test_compression_method (\E\n + \s+\Qcol1 text\E\n + \Q);\E + /xm, + like => { + %full_runs, + %dump_test_schema_runs, + section_pre_data => 1, + }, + unlike => { + exclude_dump_test_schema => 1, + only_dump_measurement=> 1, + }, + }, + + # Insert enough data to surpass DEFAULT_IO_BUFFER_SIZE during + # (de)compression operations + 'COPY test_compression_method' => { + create_order => 111, + create_sql => 'INSERT INTO dump_test.test_compression_method (col1) ' + . 'SELECT string_agg(a::text, \'\') FROM generate_series(1,4096) a;', + regexp => qr/^ + \QCOPY dump_test.test_compression_method (col1) FROM stdin;\E + \n(?:\d{15277}\n){1}\\\.\n + /xm, + like => {
Re: [PATCH] Add native windows on arm64 support
Hi, On 2023-05-06 00:35:40 -0400, Tom Lane wrote: > Michael Paquier writes: > > Seeing how simple this has become, I would be really tempted to get > > that applied, especially if there's a buildfarm machine able to follow > > up.. On the one hand, we are in a stability period for v16, so it may > > not be the best moment ever. On the other hand, waiting one more year > > sounds like a waste for a patch that just adds new code paths. > > Indeed, there's not much in this patch ... but it's impossible to see > how "run on an entirely new platform" isn't a new feature. Moreover, > it's a platform that very few of us will have any ability to support > or debug for. I can't see how it's a good idea to shove this in > post-feature-freeze. Let's plan to push it in when v17 opens. I don't really have feelings either way - but haven't we gone further and even backpatched things like spinlock support for new arches in the past? Greetings, Andres Freund
Re: benchmark results comparing versions 15.2 and 16
Hi, On 2023-05-08 16:00:01 +0300, Alexander Lakhin wrote: > This difference is confirmed by multiple test runs. `git bisect` for this > regression pointed at f193883fc. I can reproduce a significant regression due to f193883fc of a workload just running SELECT CURRENT_TIMESTAMP; A single session running it on my workstation via pgbench -Mprepared gets before: tps = 89359.128359 (without initial connection time) after: tps = 83843.585152 (without initial connection time) Obviously this is an extreme workload, but that nevertheless seems too large to just accept... Michael, the commit message notes that there were no measured performance regression - yet I see one in a trivial test. What were you measuring? I'm a bit surprised by the magnitude of the regression, but it's not surprising that there is a performance effect. You're replacing something that doesn't go through the whole generic function rigamarole, and replace it with something that does... Looking at two perf profiles, the biggest noticable difference is Before: -5.51% 0.13% postgres postgres [.] ExecInitResult - 5.38% ExecInitResult + 2.29% ExecInitResultTupleSlotTL - 2.22% ExecAssignProjectionInfo - 2.19% ExecBuildProjectionInfo 0.47% ExecReadyInterpretedExpr - 0.43% ExecInitExprRec - 0.10% palloc AllocSetAlloc.localalias (inlined) + 0.32% expression_tree_walker_impl.localalias (inlined) + 0.28% get_typlen 0.09% ExecPushExprSlots + 0.06% MemoryContextAllocZeroAligned + 0.04% MemoryContextAllocZeroAligned 0.02% exprType.localalias (inlined) + 0.41% ExecAssignExprContext + 0.35% MemoryContextAllocZeroAligned 0.11% ExecInitQual.localalias (inlined) + 0.11% _start + 0.02% 0x55b89c764d7f After: -6.57% 0.17% postgres postgres [.] ExecInitResult - 6.40% ExecInitResult - 3.00% ExecAssignProjectionInfo - ExecBuildProjectionInfo - 0.91% ExecInitExprRec - 0.65% ExecInitFunc 0.23% fmgr_info_cxt_security + 0.18% palloc0 + 0.07% object_aclcheck 0.04% fmgr_info 0.05% check_stack_depth + 0.05% palloc + 0.58% expression_tree_walker_impl.localalias (inlined) + 0.55% get_typlen 0.37% ExecReadyInterpretedExpr + 0.11% MemoryContextAllocZeroAligned 0.09% ExecPushExprSlots 0.04% exprType.localalias (inlined) + 2.77% ExecInitResultTupleSlotTL + 0.50% ExecAssignExprContext + 0.09% MemoryContextAllocZeroAligned 0.05% ExecInitQual.localalias (inlined) + 0.10% _start I.e. we spend more time building the expression state for expression evaluation, because we now go through the generic ExecInitFunc(), instead of something dedicated. We also now need to do permission checking etc. I don't think that's the entirety of the regression... Greetings, Andres Freund
Re: Add LZ4 compression in pg_dump
On 5/8/23 18:19, gkokola...@pm.me wrote: > > > > > > --- Original Message --- > On Monday, May 8th, 2023 at 3:16 AM, Tom Lane wrote: > > >> >> >> I wrote: >> >>> Michael Paquier mich...@paquier.xyz writes: >>> While testing this patch, I have triggered an error pointing out that the decompression path of LZ4 is broken for table data. I can reproduce that with a dump of the regression database, as of: make installcheck pg_dump --format=d --file=dump_lz4 --compress=lz4 regression >> >>> Ugh. Reproduced here ... so we need an open item for this. >> >> >> BTW, it seems to work with --format=c. >> > > Thank you for the extra tests. It seems that exists a gap in the test > coverage. Please find a patch attached that is addressing the issue > and attempt to provide tests for it. > Seems I'm getting messages with a delay - this is mostly the same fix I ended up with, not realizing you already posted a fix. I don't think we need the local "in" variable - the pointer parameter is local in the function, so we can modify it directly (with a cast). WriteDataToArchiveLZ4 does it that way too. The tests are definitely a good idea. I wonder if we should add a comment to DEFAULT_IO_BUFFER_SIZE mentioning that if we choose to increase the value in the future, we needs to tweak the tests too to use more data in order to exercise the buffering etc. Maybe it's obvious? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Add LZ4 compression in pg_dump
On 5/8/23 03:16, Tom Lane wrote: > I wrote: >> Michael Paquier writes: >>> While testing this patch, I have triggered an error pointing out that >>> the decompression path of LZ4 is broken for table data. I can >>> reproduce that with a dump of the regression database, as of: >>> make installcheck >>> pg_dump --format=d --file=dump_lz4 --compress=lz4 regression > >> Ugh. Reproduced here ... so we need an open item for this. > > BTW, it seems to work with --format=c. > The LZ4Stream_write() forgot to move the pointer to the next chunk, so it was happily decompressing the initial chunk over and over. A bit embarrassing oversight :-( The custom format calls WriteDataToArchiveLZ4(), which was correct. The attached patch fixes this for me. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Companydiff --git a/src/bin/pg_dump/compress_lz4.c b/src/bin/pg_dump/compress_lz4.c index 423e1b7976f..43c4b9187ef 100644 --- a/src/bin/pg_dump/compress_lz4.c +++ b/src/bin/pg_dump/compress_lz4.c @@ -584,6 +584,8 @@ LZ4Stream_write(const void *ptr, size_t size, CompressFileHandle *CFH) errno = (errno) ? errno : ENOSPC; return false; } + + ptr = ((char *) ptr) + chunk; } return true;
Re: Improve list manipulation in several places
Em seg., 8 de mai. de 2023 às 14:26, Alvaro Herrera escreveu: > On 2023-May-08, Peter Eisentraut wrote: > > > On 23.04.23 08:42, Richard Guo wrote: > > > Thanks for the suggestion. I've split the patch into two as attached. > > > 0001 is just a minor simplification by replacing > lfirst(list_head(list)) > > > with linitial(list). 0002 introduces new functions to reduce the > > > movement of list elements in several places so as to gain performance > > > improvement and benefit future callers. > > > > These look sensible to me. If you could show some numbers that support > the > > claim that there is a performance advantage, it would be even more > > convincing. > > 0001 looks fine. > > The problem I see is that each of these new functions has a single > caller, and the only one that looks like it could have a performance > advantage is list_copy_move_nth_to_head() (which is the weirdest of the > lot). I'm inclined not to have any of these single-use functions unless > a performance case can be made for them. > I think you missed list_nth_xid, It makes perfect sense to exist. regards, Ranier Vilela
Re: Improve list manipulation in several places
On 2023-May-08, Peter Eisentraut wrote: > On 23.04.23 08:42, Richard Guo wrote: > > Thanks for the suggestion. I've split the patch into two as attached. > > 0001 is just a minor simplification by replacing lfirst(list_head(list)) > > with linitial(list). 0002 introduces new functions to reduce the > > movement of list elements in several places so as to gain performance > > improvement and benefit future callers. > > These look sensible to me. If you could show some numbers that support the > claim that there is a performance advantage, it would be even more > convincing. 0001 looks fine. The problem I see is that each of these new functions has a single caller, and the only one that looks like it could have a performance advantage is list_copy_move_nth_to_head() (which is the weirdest of the lot). I'm inclined not to have any of these single-use functions unless a performance case can be made for them. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
Re: Tab completion for CREATE SCHEMAAUTHORIZATION
Michael Paquier writes: > On Tue, May 02, 2023 at 01:19:49PM +0100, Dagfinn Ilmari Mannsåker wrote: >> Dagfinn Ilmari Mannsåker writes: >>> This is for completing the word CREATE itself after CREATE SCHEMA >>> [[] AUTHORIZATION] . The things that can come after that >>> are already handled generically earlier in the function: >>> >>> /* CREATE */ >>> /* complete with something you can create */ >>> else if (TailMatches("CREATE")) >>> matches = rl_completion_matches(text, create_command_generator); >>> >>> create_command_generator uses the words_after_create array, which lists >>> all the things that can be created. > > You are right. I have completely forgotten that this code path would > append everything that supports CREATE for a CREATE SCHEMA command :) > >> But, looking closer at the docs, only tables, views, indexes, sequences >> and triggers can be created as part of a CREATE SCHEMA statement. Maybe >> we should add a HeadMatches("CREATE", "SCHEMA") exception in the above? > > Yes, it looks like we are going to need an exception and append only > the keywords that are supported, or we will end up recommending mostly > things that are not accepted by the parser. Here's an updated v3 patch with that. While adding that, I noticed that CREATE UNLOGGED only tab-completes TABLE and MATERIALIZED VIEW, not SEQUENCE, so I added that (and removed MATERIALIZED VIEW when part of CREATE SCHEMA). - ilmari >From 31856bf5397253da76cbce9ccb590855a44da30d Mon Sep 17 00:00:00 2001 From: tanghy Date: Mon, 9 Aug 2021 18:47:12 +0100 Subject: [PATCH v3] Add tab completion for CREATE SCHEMA - AUTHORIZATION both in addition to and after a schema name - list of owner roles after AUTHORIZATION - CREATE and GRANT after any otherwise-complete command - Only the allowed object types after CREATE Also adjust complation after CREATE UNLOGGED: - Add SEQUENCE - Don't suggest MATERIALIZED VIEW in CREATE TABLE --- src/bin/psql/tab-complete.c | 40 ++--- 1 file changed, 33 insertions(+), 7 deletions(-) diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c index bd04244969..66b3f00c1b 100644 --- a/src/bin/psql/tab-complete.c +++ b/src/bin/psql/tab-complete.c @@ -1031,9 +1031,13 @@ static const SchemaQuery Query_for_trigger_of_table = { " FROM pg_catalog.pg_roles "\ " WHERE rolname LIKE '%s'" +/* add these to Query_for_list_of_roles in OWNER contexts */ +#define Keywords_for_list_of_owner_roles \ +"CURRENT_ROLE", "CURRENT_USER", "SESSION_USER" + /* add these to Query_for_list_of_roles in GRANT contexts */ #define Keywords_for_list_of_grant_roles \ -"PUBLIC", "CURRENT_ROLE", "CURRENT_USER", "SESSION_USER" +Keywords_for_list_of_owner_roles, "PUBLIC" #define Query_for_all_table_constraints \ "SELECT conname "\ @@ -1785,7 +1789,13 @@ psql_completion(const char *text, int start, int end) /* CREATE */ /* complete with something you can create */ else if (TailMatches("CREATE")) - matches = rl_completion_matches(text, create_command_generator); + /* only some object types can be created as part of CREATE SCHEMA */ + if (HeadMatches("CREATE", "SCHEMA")) + COMPLETE_WITH("TABLE", "VIEW", "INDEX", "SEQUENCE", "TRIGGER", + /* for INDEX and TABLE/SEQUENCE, respectively */ + "UNIQUE", "UNLOGGED"); + else + matches = rl_completion_matches(text, create_command_generator); /* complete with something you can create or replace */ else if (TailMatches("CREATE", "OR", "REPLACE")) @@ -3154,6 +3164,20 @@ psql_completion(const char *text, int start, int end) else if (TailMatches("AS", "ON", "SELECT|UPDATE|INSERT|DELETE", "TO")) COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_tables); +/* CREATE SCHEMA [ ] [ AUTHORIZATION ] */ + else if (Matches("CREATE", "SCHEMA")) + COMPLETE_WITH_QUERY_PLUS(Query_for_list_of_schemas, + "AUTHORIZATION"); + else if (Matches("CREATE", "SCHEMA", "AUTHORIZATION") || + Matches("CREATE", "SCHEMA", MatchAny, "AUTHORIZATION")) + COMPLETE_WITH_QUERY_PLUS(Query_for_list_of_roles, + Keywords_for_list_of_owner_roles); + else if (Matches("CREATE", "SCHEMA", "AUTHORIZATION", MatchAny) || + Matches("CREATE", "SCHEMA", MatchAny, "AUTHORIZATION", MatchAny)) + COMPLETE_WITH("CREATE", "GRANT"); + else if (Matches("CREATE", "SCHEMA", MatchAny)) + COMPLETE_WITH("AUTHORIZATION", "CREATE", "GRANT"); + /* CREATE SEQUENCE --- is allowed inside CREATE SCHEMA, so use TailMatches */ else if (TailMatches("CREATE", "SEQUENCE", MatchAny) || TailMatches("CREATE", "TEMP|TEMPORARY", "SEQUENCE", MatchAny)) @@ -3185,9 +3209,13 @@ psql_completion(const char *text, int start, int end) /* Complete "CREATE TEMP/TEMPORARY" with the possible temp objects */ else if (TailMatches("CREATE", "TEMP|TEMPORARY")) COMPLETE_WITH("SEQUENCE", "TABLE", "VIEW"); - /* Complete "CREATE UNLOGGED" with TABLE or MATVIEW */ + /* Complete "CREATE UNLOGGED" with TABLE, SEQUENCE or MATVIEW */ else if
Re: Proposal for Prototype Implementation to Enhance C/C++ Interoperability in PostgreSQL
On 08.05.23 04:38, 盏一 wrote: > It seems extremely specific to one particular C++ implementation To perform a force unwind during longjmp, the _Unwind_ForcedUnwind function is used. This function is defined in the [Itanium C++ ABI Standard](https://itanium-cxx-abi.github.io/cxx-abi/abi-eh.html#base-throw), which is followed by all C++ implementations. Additionally, the glibc [nptl/unwind.c](https://elixir.bootlin.com/glibc/latest/source/nptl/unwind.c#L130) file shows that on all platforms, pthread_exit is also implemented using _Unwind_ForcedUnwind. Furthermore, the Itanium C++ ABI specification also defines _Unwind_RaiseException as the entry point for all C++ exceptions thrown. I ran your patch through Cirrus CI, and it passed on Linux but failed on FreeBSD, macOS, and Windows. You should fix that if you want to alleviate the concerns about the portability of this approach.
Re: Add LZ4 compression in pg_dump
--- Original Message --- On Monday, May 8th, 2023 at 3:16 AM, Tom Lane wrote: > > > I wrote: > > > Michael Paquier mich...@paquier.xyz writes: > > > > > While testing this patch, I have triggered an error pointing out that > > > the decompression path of LZ4 is broken for table data. I can > > > reproduce that with a dump of the regression database, as of: > > > make installcheck > > > pg_dump --format=d --file=dump_lz4 --compress=lz4 regression > > > Ugh. Reproduced here ... so we need an open item for this. > > > BTW, it seems to work with --format=c. > Thank you for the extra tests. It seems that exists a gap in the test coverage. Please find a patch attached that is addressing the issue and attempt to provide tests for it. Cheers, //Georgios > regards, tom laneFrom 8c6c86c362820e93f066992ede6e5ca23f128807 Mon Sep 17 00:00:00 2001 From: Georgios Kokolatos Date: Mon, 8 May 2023 15:25:25 + Subject: [PATCH v1] Advance input pointer when LZ4 compressing data LZ4File_write() did not advance the input pointer on subsequent invocations of LZ4F_compressUpdate(). As a result the generated compressed output would be a compressed version of the same input chunk. WriteDataToArchiveLZ4() which is also using LZ4F_compressUpdate() did not suffer from this omission. Tests failed to catch this error because all of their input would comfortably fit within the same input chunk. Tests have been added to provide adequate coverage. --- src/bin/pg_dump/compress_lz4.c | 5 +++- src/bin/pg_dump/t/002_pg_dump.pl | 44 2 files changed, 48 insertions(+), 1 deletion(-) diff --git a/src/bin/pg_dump/compress_lz4.c b/src/bin/pg_dump/compress_lz4.c index f97b7550d1..76211c82f0 100644 --- a/src/bin/pg_dump/compress_lz4.c +++ b/src/bin/pg_dump/compress_lz4.c @@ -564,6 +564,7 @@ LZ4Stream_write(const void *ptr, size_t size, CompressFileHandle *CFH) LZ4State *state = (LZ4State *) CFH->private_data; size_t status; int remaining = size; + const void *in = ptr; /* Lazy init */ if (!LZ4Stream_init(state, size, true)) @@ -576,7 +577,7 @@ LZ4Stream_write(const void *ptr, size_t size, CompressFileHandle *CFH) remaining -= chunk; status = LZ4F_compressUpdate(state->ctx, state->buffer, state->buflen, - ptr, chunk, NULL); + in, chunk, NULL); if (LZ4F_isError(status)) { state->errcode = status; @@ -588,6 +589,8 @@ LZ4Stream_write(const void *ptr, size_t size, CompressFileHandle *CFH) errno = (errno) ? errno : ENOSPC; return false; } + + in = ((const char *) in) + chunk; } return true; diff --git a/src/bin/pg_dump/t/002_pg_dump.pl b/src/bin/pg_dump/t/002_pg_dump.pl index 93e24d5145..c6b1225815 100644 --- a/src/bin/pg_dump/t/002_pg_dump.pl +++ b/src/bin/pg_dump/t/002_pg_dump.pl @@ -3108,6 +3108,50 @@ my %tests = ( }, }, + 'CREATE TABLE test_compression_method' => { + create_order => 110, + create_sql => 'CREATE TABLE dump_test.test_compression_method ( + col1 text + );', + regexp => qr/^ + \QCREATE TABLE dump_test.test_compression_method (\E\n + \s+\Qcol1 text\E\n + \Q);\E + /xm, + like => { + %full_runs, + %dump_test_schema_runs, + section_pre_data => 1, + }, + unlike => { + exclude_dump_test_schema => 1, + only_dump_measurement=> 1, + }, + }, + + 'COPY test_compression_method' => { + create_order => 111, + create_sql => 'INSERT INTO dump_test.test_compression_method (col1) ' + . 'SELECT string_agg(a::text, \'\') FROM generate_series(1,4096) a;', + regexp => qr/^ + \QCOPY dump_test.test_compression_method (col1) FROM stdin;\E + \n(?:\d{15277}\n){1}\\\.\n + /xm, + like => { + %full_runs, + data_only=> 1, + section_data => 1, + only_dump_test_schema=> 1, + test_schema_plus_large_objects=> 1, + }, + unlike => { + binary_upgrade => 1, + exclude_dump_test_schema => 1, + schema_only => 1, + only_dump_measurement=> 1, + }, + }, + 'CREATE TABLE fk_reference_test_table' => { create_order => 21, create_sql => 'CREATE TABLE dump_test.fk_reference_test_table ( -- 2.34.1
Re: Memory leak from ExecutorState context?
Thanks for continuing to work on this! On Thu, May 04, 2023 at 07:30:06PM +0200, Jehan-Guillaume de Rorthais wrote: > On Fri, 21 Apr 2023 16:44:48 -0400 Melanie Plageman > wrote: ... > > I think the biggest change that is needed is to implement this memory > > context usage for parallel hash join. > > Indeed. ... > 4. accessor->read_buffer is currently allocated in accessor->context as well. > >This buffer holds tuple read from the fileset. This is still a buffer, but >not related to any file anymore... > > Because of 3 and 4, and the gross context granularity of SharedTuplestore > related-code, I'm now wondering if "fileCxt" shouldn't be renamed "bufCxt". I think bufCxt is a potentially confusing name. The context contains pointers to the batch files and saying those are related to buffers is confusing. It also sounds like it could include any kind of buffer related to the hashtable or hash join. Perhaps we could call this memory context the "spillCxt"--indicating it is for the memory required for spilling to permanent storage while executing hash joins. I discuss this more in my code review below. > From c5ed2ae2c2749af4f5058b012dc5e8a9e1529127 Mon Sep 17 00:00:00 2001 > From: Jehan-Guillaume de Rorthais > Date: Mon, 27 Mar 2023 15:54:39 +0200 > Subject: [PATCH 2/3] Allocate hash batches related BufFile in a dedicated > context > diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c > index 5fd1c5553b..a4fbf29301 100644 > --- a/src/backend/executor/nodeHash.c > +++ b/src/backend/executor/nodeHash.c > @@ -570,15 +574,21 @@ ExecHashTableCreate(HashState *state, List > *hashOperators, List *hashCollations, > > if (nbatch > 1 && hashtable->parallel_state == NULL) > { > + MemoryContext oldctx; > + > /* >* allocate and initialize the file arrays in hashCxt (not > needed for >* parallel case which uses shared tuplestores instead of raw > files) >*/ > + oldctx = MemoryContextSwitchTo(hashtable->fileCxt); > + > hashtable->innerBatchFile = palloc0_array(BufFile *, nbatch); > hashtable->outerBatchFile = palloc0_array(BufFile *, nbatch); > /* The files will not be opened until needed... */ > /* ... but make sure we have temp tablespaces established for > them */ I haven't studied it closely, but I wonder if we shouldn't switch back into the oldctx before calling PrepareTempTablespaces(). PrepareTempTablespaces() is explicit about what memory context it is allocating in, however, I'm not sure it makes sense to call it in the fileCxt. If you have a reason, you should add a comment and do the same in ExecHashIncreaseNumBatches(). > PrepareTempTablespaces(); > + > + MemoryContextSwitchTo(oldctx); > } > @@ -934,13 +943,16 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable) > hashtable, nbatch, hashtable->spaceUsed); > #endif > > - oldcxt = MemoryContextSwitchTo(hashtable->hashCxt); > - > if (hashtable->innerBatchFile == NULL) > { > + MemoryContext oldcxt = > MemoryContextSwitchTo(hashtable->fileCxt); > + > /* we had no file arrays before */ > hashtable->innerBatchFile = palloc0_array(BufFile *, nbatch); > hashtable->outerBatchFile = palloc0_array(BufFile *, nbatch); > + As mentioned above, you should likely make ExecHashTableCreate() consistent with this. > + MemoryContextSwitchTo(oldcxt); > + > /* time to establish the temp tablespaces, too */ > PrepareTempTablespaces(); > } > @@ -951,8 +963,6 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable) I don't see a reason to call repalloc0_array() in a different context than the initial palloc0_array(). > hashtable->outerBatchFile = > repalloc0_array(hashtable->outerBatchFile, BufFile *, oldnbatch, nbatch); > } > > - MemoryContextSwitchTo(oldcxt); > - > hashtable->nbatch = nbatch; > > /* > diff --git a/src/backend/executor/nodeHashjoin.c > b/src/backend/executor/nodeHashjoin.c > index 920d1831c2..ac72fbfbb6 100644 > --- a/src/backend/executor/nodeHashjoin.c > +++ b/src/backend/executor/nodeHashjoin.c > @@ -485,8 +485,10 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel) >*/ > Assert(parallel_state == NULL); > Assert(batchno > hashtable->curbatch); > + > ExecHashJoinSaveTuple(mintuple, > hashvalue, > - > >outerBatchFile[batchno]); > + > >outerBatchFile[batchno], > +
Re: Add standard collation UNICODE
On 27.04.23 13:44, Daniel Verite wrote: This collation has an empty pg_collation.collversion column, instead of being set to the same value as "und-x-icu" to track its version. The original patch implements this as an INSERT in which it would be easy to fix I guess, but in current HEAD it comes as an entry in include/catalog/pg_collation.dat: { oid => '963', descr => 'sorts using the Unicode Collation Algorithm with default settings', collname => 'unicode', collprovider => 'i', collencoding => '-1', colliculocale => 'und' }, Should it be converted back into an INSERT or better left in this file and collversion being updated afterwards? How about we do it with an UPDATE command. We already do this for pg_database in a similar way. See attached patch. From 91f2aff04f9bf137e4ac6e7df624dde770503bfd Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Mon, 8 May 2023 17:45:46 +0200 Subject: [PATCH] initdb: Set collversion for standard collation UNICODE Discussion: https://www.postgresql.org/message-id/49417853-7bdd-4b23-a4e9-04c7aff33...@manitou-mail.org --- src/bin/initdb/initdb.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c index 2c208ead01..632f6c9c72 100644 --- a/src/bin/initdb/initdb.c +++ b/src/bin/initdb/initdb.c @@ -1695,6 +1695,13 @@ setup_description(FILE *cmdfd) static void setup_collation(FILE *cmdfd) { + /* +* Set the collation version for collations defined in pg_collation.dat, +* except the ones where we know that the collation behavior will never +* change. +*/ + PG_CMD_PUTS("UPDATE pg_collation SET collversion = pg_collation_actual_version(oid) WHERE collname = 'unicode';\n\n"); + /* Import all collations we can find in the operating system */ PG_CMD_PUTS("SELECT pg_import_system_collations('pg_catalog');\n\n"); } -- 2.40.0
Re: Improve list manipulation in several places
On 23.04.23 08:42, Richard Guo wrote: Thanks for the suggestion. I've split the patch into two as attached. 0001 is just a minor simplification by replacing lfirst(list_head(list)) with linitial(list). 0002 introduces new functions to reduce the movement of list elements in several places so as to gain performance improvement and benefit future callers. These look sensible to me. If you could show some numbers that support the claim that there is a performance advantage, it would be even more convincing.
Re: [PATCH] Allow Postgres to pick an unused port to listen
Alvaro Herrera writes: > This made me wonder if storing the unadorned port number is really the > best way. Suppose we did extend things so that we listen on different > ports on different interfaces; how would this scheme work at all? Yeah, the probability that that will happen someday is one of the things bothering me about this proposal. I'd rather change the file format to support that first (it can be dummy for now, with all lines showing the same port), and then document it second. regards, tom lane
Re: WAL Insertion Lock Improvements
On Mon, May 8, 2023 at 5:57 PM Bharath Rupireddy wrote: > > On Mon, Apr 10, 2023 at 9:38 AM Michael Paquier wrote: > > > -LWLockUpdateVar(LWLock *lock, uint64 *valptr, uint64 val) > > +LWLockUpdateVar(LWLock *lock, pg_atomic_uint64 *valptr, uint64 val) > > [...] > > Assert(pg_atomic_read_u32(>state) & LW_VAL_EXCLUSIVE); > > > > - /* Update the lock's value */ > > - *valptr = val; > > > > The sensitive change is in LWLockUpdateVar(). I am not completely > > sure to understand this removal, though. Does that influence the case > > where there are waiters? > > I'll send about this in a follow-up email to not overload this > response with too much data. It helps the case when there are no waiters. IOW, it updates value without waitlist lock when there are no waiters, so no extra waitlist lock acquisition/release just to update the value. In turn, it helps the other backend wanting to flush the WAL looking for the new updated value of insertingAt in WaitXLogInsertionsToFinish(), now the flushing backend can get the new value faster. > > Another thing I was wondering about: how much does the fast-path used > > in LWLockUpdateVar() influence the performance numbers? Am I right to > > guess that it counts for most of the gain seen? > > I'll send about this in a follow-up email to not overload this > response with too much data. The fastpath exit in LWLockUpdateVar() doesn't seem to influence the results much, see below results. However, it avoids waitlist lock acquisition when there are no waiters. test-case 1: -T5, WAL ~16 bytes clientsHEADPATCHED with fastpathPATCHED no fast path 1148214861457 2161716201569 4317432333031 8613663655725 16125661226911685 32242842362123177 64501354552846653 128949038979189103 25682289152915152835 51262498138838142084 76857083125074126768 102451308113593115930 2048410848876485110 4096199394225743917 > > Or could it be that > > the removal of the spin lock in > > LWLockConflictsWithVar()/LWLockWaitForVar() the point that has the > > highest effect? > > I'll send about this in a follow-up email to not overload this > response with too much data. Out of 3 functions that got rid of waitlist lock LWLockConflictsWithVar/LWLockWaitForVar, LWLockUpdateVar, LWLockReleaseClearVar, perf reports tell that the biggest gain (for the use-cases that I've tried) is for LWLockConflictsWithVar/LWLockWaitForVar: test-case 1: -T5, WAL ~16 bytes HEAD: + 61.89% 0.05% postgres [.] LWLockWaitForVar + 43.19% 0.12% postgres [.] LWLockConflictsWithVar +1.62% 0.00% postgres [.] LWLockReleaseClearVar PATCHED: + 38.79% 0.11% postgres [.] LWLockWaitForVar 0.40% 0.02% postgres [.] LWLockConflictsWithVar +2.80% 0.00% postgres [.] LWLockReleaseClearVar test-case 6: -T5, WAL 4096 bytes HEAD: + 29.66% 0.07% postgres [.] LWLockWaitForVar + 20.94% 0.08% postgres [.] LWLockConflictsWithVar 0.19% 0.03% postgres [.] LWLockUpdateVar PATCHED: +3.95% 0.08% postgres [.] LWLockWaitForVar 0.19% 0.03% postgres [.] LWLockConflictsWithVar +1.73% 0.00% postgres [.] LWLockReleaseClearVar -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: [PATCH] Allow Postgres to pick an unused port to listen
On 2023-Apr-19, Yurii Rashkovskii wrote: > If we consider this path, then (if we assume the format of the file is > still to be private), we can make the port line accept multiple ports using > a delimiter like `:` so that the next line still remains the same. This made me wonder if storing the unadorned port number is really the best way. Suppose we did extend things so that we listen on different ports on different interfaces; how would this scheme work at all? I suspect it would be simpler to store both the interface address and the port, perhaps separated by :. You would keep it to one pair per line, so you'd get the IPv6 address/port separately from the IPv4 address, for example. And if you happen to have multiple addresses, you know exactly which ones you're listening on. To match a problem that has been discussed in the lists previously, suppose you have listen_addresses='localhost' and the resolver does funny things with that name (say you mess up /etc/hosts after starting). Things would be much simpler if you knew exactly what the resolver did at postmaster start time. > (I consider this feature so small that it doesn't deserve such a lengthy > discussion. However, I also get Tom's point about how we document this You should see the discussion that led to the addition of psql's 'exit' command sometime: https://www.postgresql.org/message-id/flat/CALVFHFb-C_5_94hueWg6Dd0zu7TfbpT7hzsh9Zf0DEDOSaAnfA%40mail.gmail.com#949321e44856b7fa295834d6a3997ab4 -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "Every machine is a smoke machine if you operate it wrong enough." https://twitter.com/libseybieda/status/1541673325781196801
Re: benchmark results comparing versions 15.2 and 16
Hello Mark, 05.05.2023 20:45, MARK CALLAGHAN wrote: This is mostly a hobby project for me - my other hobby is removing invasive weeds. I am happy to answer questions and run more tests, but turnaround for answers won't be instant. Getting results from Linux perf for these tests is on my TODO list. For now I am just re-running a subset of these to get more certainty that the regressions are real and not noise. It's a very interesting topic to me, too. I had developed some scripts to measure and compare postgres`s performance using miscellaneous public benchmarks (ycsb, tpcds, benchmarksql_tpcc, htapbench, benchbase, gdprbench, s64da-benchmark, ...). Having compared 15.3 (56e869a09) with master (58f5edf84) I haven't seen significant regressions except a few minor ones. First regression observed with a simple pgbench test: pgbench -i benchdb pgbench -c 10 -T 300 benchdb (with default compilation options and fsync = off) On master I get: tps = 10349.826645 (without initial connection time) On 15.3: tps = 11296.064992 (without initial connection time) This difference is confirmed by multiple test runs. `git bisect` for this regression pointed at f193883fc. Best regards, Alexander
Re: base backup vs. concurrent truncation
On Mon, May 1, 2023 at 12:54 PM Aleksander Alekseev wrote: > So I'm still unable to reproduce the described scenario, at least on PG16. Well, that proves that either (1) the scenario that I described is impossible for some unknown reason or (2) something is wrong with your test scenario. I bet it's (2), but if it's (1), it would be nice to know what the reason is. One can't feel good about code that appears on the surface to be broken even if one knows that some unknown magical thing is preventing disaster. I find it pretty hard to accept that there's no problem at all here, especially in view of the fact that Andres independently posted about the same issue on another thread. It's pretty clear from looking at the code that mdnblocks() can't open any segments past the first one that isn't of the maximum possible size. It's also fairly clear that a crash or a base backup can create such situations. And finally it's pretty clear that having an old partial segment be rediscovered due to the relation be re-extended would be quite bad. So how can there not be a problem? I admit I haven't done the legwork to nail down a test case where everything comes together just right to show user-visible breakage, but your success in finding one where it doesn't is no proof of anything. -- Robert Haas EDB: http://www.enterprisedb.com
Re: [PATCH] Allow Postgres to pick an unused port to listen
On Mon, Apr 24, 2023 at 10:16 AM Peter Eisentraut wrote: > Right. I'm perfectly content with just allowing port number 0 and > leaving it at that. That seems fine to me, too. If somebody wants to add a pg_ctl feature to extract this or any other information from the postmaster.pid file, that can be a separate patch. But it's not necessarily the case that users would even prefer that interface. Some might, some might not. Or so it seems to me, anyway. -- Robert Haas EDB: http://www.enterprisedb.com
Re: proposal: psql: show current user in prompt
I'm very much in favor of adding a way to support reporting other GUC variables than the current hardcoded list. This can be quite useful to support some amount of session state in connection poolers. Some general feedback on the patch: 1. I don't think the synchronization mechanism that you added should be part of PQlinkParameterStatus. It seems likely for people to want to turn on reporting for multiple GUCs in one go. Having to synchronize for each would introduce unnecessary round trips. Maybe you also don't care about syncing at all at this point in time. 2. Support for this message should probably require a protocol extension. There is another recent thread that discusses about adding message types and protocol extensions: https://www.postgresql.org/message-id/flat/CA%2BTgmoaxfJ3whOqnxTjT-%2BHDgZYbEho7dVHHsuEU2sgRw17OEQ%40mail.gmail.com#acd99fde0c037cc6cb35d565329b6e00 3. Some tests for this new libpq API should be added in src/test/modules/libpq_pipeline 4. s/massages/messages Finally, I think this patch should be split into two commits: 1. adding custom GUC_REPORT protocol support+libpq API 2. using this libpq API in psql for the user prompt If you have multiple commits (which are rebased on master), you can very easily create multiple patch files using this command: git format-patch master --base=master --reroll-count={version_number_here} On Fri, 28 Apr 2023 at 07:00, Pavel Stehule wrote: > > > > čt 27. 4. 2023 v 7:39 odesílatel Pavel Stehule > napsal: >> >> Hi >> >> rebased version + fix warning possibly uninitialized variable > > > fix not unique function id > > Regards > > Pavel > >> >> Regards >> >> Pavel >>
Re: Perform streaming logical transactions by background workers and parallel apply
On Fri, May 5, 2023 at 9:14 AM Zhijie Hou (Fujitsu) wrote: > > On Wednesday, May 3, 2023 3:17 PM Amit Kapila wrote: > > > > Attach another patch to fix the problem that pa_shutdown will access invalid > MyLogicalRepWorker. I personally want to avoid introducing new static > variable, > so I only reorder the callback registration in this version. > > When testing this, I notice a rare case that the leader is possible to receive > the worker termination message after the leader stops the parallel worker. > This > is unnecessary and have a risk that the leader would try to access the > detached > memory queue. This is more likely to happen and sometimes cause the failure in > regression tests after the registration reorder patch because the dsm is > detached earlier after applying the patch. > I think it is only possible for the leader apply can worker to try to receive the error message from an error queue after your 0002 patch. Because another place already detached from the queue before stopping the parallel apply workers. So, I combined both the patches and changed a few comments and a commit message. Let me know what you think of the attached. -- With Regards, Amit Kapila. v2-0001-Fix-invalid-memory-access-during-the-shutdown-of-.patch Description: Binary data
Re: Support logical replication of DDLs
On Mon, May 8, 2023 at 3:58 PM shveta malik wrote: > > On Tue, May 2, 2023 at 8:30 AM shveta malik wrote: > > > > On Fri, Apr 28, 2023 at 5:11 PM Amit Kapila wrote: > > > > > > Now, I think we can try to eliminate this entire ObjTree machinery and > > > directly from the JSON blob during deparsing. We have previously also > > > discussed this in an email chain at [1]. I think now the functionality > > > of JSONB has also been improved and we should investigate whether it > > > is feasible to directly use JSONB APIs to form the required blob. > > > > +1. > > I will investigate this and will share my findings. > > > > > Please find the PoC patch for create-table after object-tree removal. > Missed to mention that it is a combined effort by Vignesh and myself, so let us know your feedback. thanks Shveta
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Fri, Apr 7, 2023 at 4:55 PM John Naylor wrote: > - Fixed-size PagetableEntry's are pretty large, but the tid compression scheme used in this thread (in addition to being complex) is not a great fit for tidbitmap because it makes it more difficult to track per-block metadata (see also next point). With the "combined pointer-value slots" technique, if a page's max tid offset is 63 or less, the offsets can be stored directly in the pointer for the exact case. The lowest bit can tag to indicate a pointer to a single-value leaf. That would complicate operations like union/intersection and tracking "needs recheck", but it would reduce memory use and node-traversal in common cases. [just getting some thoughts out there before I have something concrete] Thinking some more, this needn't be complicated at all. We'd just need to reserve some bits of a bitmapword for the tag, as well as flags for "ischunk" and "recheck". The other bits can be used for offsets. Getting/storing the offsets basically amounts to adjusting the shift by a constant. That way, this "embeddable PTE" could serve as both "PTE embedded in a node pointer" and also the first member of a full PTE. A full PTE is now just an array of embedded PTEs, except only the first one has the flags we need. That reduces the number of places that have to be different. Storing any set of offsets all less than ~60 would save allocation/traversal in a large number of real cases. Furthermore, that would reduce a full PTE to 40 bytes because there would be no padding. This all assumes the key (block number) is no longer stored in the PTE, whether embedded or not. That would mean this technique: > - With lazy expansion and single-value leaves, the root of a radix tree can point to a single leaf. That might get rid of the need to track TBMStatus, since setting a single-leaf tree should be cheap. ...is not a good trade off because it requires each leaf to have the key, and would thus reduce the utility of embedded leaves. We just need to make sure storing a single value is not costly, and I suspect it's not. (Currently the overhead avoided is allocating and zeroing a few kilobytes for a hash table). If it is not, then we don't need a special case in tidbitmap, which would be a great simplification. If it is, there are other ways to mitigate. -- John Naylor EDB: http://www.enterprisedb.com
Re: evtcache: EventTriggerCache vs Event Trigger Cache
> On 4 May 2023, at 14:18, Daniel Gustafsson wrote: > >> On 4 May 2023, at 14:09, Tom Lane wrote: >> >> Daniel Gustafsson writes: >>> When reading a memory contexts log I realized that we have this: >>> LOG: level: 2; EventTriggerCache: 8192 total in 1 blocks; 7928 free (4 >>> chunks); 264 used >>> LOG: level: 3; Event Trigger Cache: 8192 total in 1 blocks; 2616 free (0 >>> chunks); 5576 used >> >>> The reason is that BuildEventTriggerCache sets up a context >>> "EventTriggerCache" >>> which house a hash named "Event Trigger Cache" which in turn creates a >>> context >>> with the table name. I think it makes sense that these share the same name, >>> but I think it would be less confusing if they also shared the same spelling >>> whitespace-wise. Any reason to not rename the hash EventTriggerCache to >>> make >>> the logging a tiny bit easier to read and grep? >> >> Hmm, I'm kinda -1 on them having the same name visible in the >> contexts dump --- that seems very confusing. How about naming >> the hash "EventTriggerCacheHash" or so? > > I think the level is the indicator here, but I have no strong opinions, > EventTriggerCacheHash is fine by me. The attached trivial diff does that, parking this in the next CF. -- Daniel Gustafsson evtcachehash.diff Description: Binary data
Re: [DOC] Update ALTER SUBSCRIPTION documentation
On Mon, May 8, 2023 at 12:07 PM Amit Kapila wrote: > > On Fri, May 5, 2023 at 6:47 PM Robert Sjöblom > wrote: > > > > We have recently used the PostgreSQL documentation when setting up our > > logical replication. We noticed there was a step missing in the > > documentation on how to drop a logical replication subscription with a > > replication slot attached. > > > > We clarify the documentation to include prerequisites for running the > > DROP SUBSCRIPTION command. Please see attached patch. > > > > Shouldn't we also change the following errhint in the code as well? > ReportSlotConnectionError() > { > ... > ereport(ERROR, > (errcode(ERRCODE_CONNECTION_FAILURE), > errmsg("could not connect to publisher when attempting to drop > replication slot \"%s\": %s", > slotname, err), > /* translator: %s is an SQL ALTER command */ > errhint("Use %s to disassociate the subscription from the slot.", > "ALTER SUBSCRIPTION ... SET (slot_name = NONE)"))); > ... > } Yeah, if the subscription is enabled, it might be helpful for users if the error hint message says something like: Use ALTER SUBSCRIPTION ... SET (slot_name = NONE) to disassociate the subscription from the slot after disabling the subscription. Apart from the documentation change, given that setting slot_name = NONE always requires for the subscription to be disabled beforehand, does it make sense to change ALTER SUBSCRIPTION SET so that we disable the subscription when setting slot_name = NONE? Setting slot_name to a valid slot name doesn't enable the subscription, though. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Add two missing tests in 035_standby_logical_decoding.pl
Hi, On 5/8/23 4:42 AM, Amit Kapila wrote: On Sat, May 6, 2023 at 9:33 PM Drouvot, Bertrand wrote: On 5/6/23 3:28 PM, Amit Kapila wrote: On Sat, May 6, 2023 at 1:52 PM Drouvot, Bertrand wrote: Next steps: = 1. The first thing is we should verify this theory by adding some LOG in KeepLogSeg() to see if the _logSegNo is reduced due to the value returned by XLogGetReplicationSlotMinimumLSN(). Yeah, will do that early next week. It's done with the following changes: https://github.com/bdrouvot/postgres/commit/79e1bd9ab429a22f876b9364eb8a0da2dacaaef7#diff-c1cb3ab2a19606390c1a7ed00ffe5a45531702ca5faf999d401c548f8951c65bL7454-R7514 With that in place, there is one failing test here: https://cirrus-ci.com/task/5173216310722560 Where we can see: 2023-05-08 07:42:56.301 UTC [18038][checkpointer] LOCATION: UpdateMinRecoveryPoint, xlog.c:2500 2023-05-08 07:42:56.302 UTC [18038][checkpointer] LOG: 0: CreateRestartPoint: After XLByteToSeg RedoRecPtr is 0/498, _logSegNo is 4 2023-05-08 07:42:56.302 UTC [18038][checkpointer] LOCATION: CreateRestartPoint, xlog.c:7271 2023-05-08 07:42:56.302 UTC [18038][checkpointer] LOG: 0: KeepLogSeg: segno changed to 4 due to XLByteToSeg 2023-05-08 07:42:56.302 UTC [18038][checkpointer] LOCATION: KeepLogSeg, xlog.c:7473 2023-05-08 07:42:56.302 UTC [18038][checkpointer] LOG: 0: KeepLogSeg: segno changed to 3 due to XLogGetReplicationSlotMinimumLSN() 2023-05-08 07:42:56.302 UTC [18038][checkpointer] LOCATION: KeepLogSeg, xlog.c:7483 2023-05-08 07:42:56.302 UTC [18038][checkpointer] LOG: 0: CreateRestartPoint: After KeepLogSeg RedoRecPtr is 0/498, endptr is 0/4000148, _logSegNo is 3 2023-05-08 07:42:56.302 UTC [18038][checkpointer] LOCATION: CreateRestartPoint, xlog.c:7284 2023-05-08 07:42:56.302 UTC [18038][checkpointer] LOG: 0: BDT1 about to call RemoveOldXlogFiles in CreateRestartPoint 2023-05-08 07:42:56.302 UTC [18038][checkpointer] LOCATION: CreateRestartPoint, xlog.c:7313 2023-05-08 07:42:56.302 UTC [18038][checkpointer] LOG: 0: attempting to remove WAL segments older than log file 0002 So the suspicion about XLogGetReplicationSlotMinimumLSN() was correct (_logSegNo moved from 4 to 3 due to XLogGetReplicationSlotMinimumLSN()). What about postponing the physical slot creation on the standby and the cascading standby node initialization after this test? Yeah, that is also possible. But, I have a few questions regarding that: (a) There doesn't seem to be a physical slot on cascading standby, if I am missing something, can you please point me to the relevant part of the test? That's right. There is a physical slot only on the primary and on the standby. What I meant up-thread is to also postpone the cascading standby node initialization after this test (once the physical slot on the standby is created). Please find attached a proposal doing so. (b) Which test is currently dependent on the physical slot on standby? Not a test but the cascading standby initialization with the "primary_slot_name" parameter. Also, now I think that's better to have the physical slot on the standby + hsf set to on on the cascading standby (coming from the standby backup). Idea is to avoid any risk of logical slot invalidation on the cascading standby in the standby promotion test. That was not the case before the attached proposal though (hsf was off on the cascading standby). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.comFrom 38c67b0fd8f2d83e97a93108484fe23c881dd91c Mon Sep 17 00:00:00 2001 From: Bertrand Drouvot Date: Mon, 8 May 2023 07:02:50 + Subject: [PATCH v1] fix retaining WAL test --- .../t/035_standby_logical_decoding.pl | 38 ++- 1 file changed, 21 insertions(+), 17 deletions(-) 100.0% src/test/recovery/t/ diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl index ad1def2474..4bda706eae 100644 --- a/src/test/recovery/t/035_standby_logical_decoding.pl +++ b/src/test/recovery/t/035_standby_logical_decoding.pl @@ -276,20 +276,6 @@ $node_standby->append_conf('postgresql.conf', max_replication_slots = 5]); $node_standby->start; $node_primary->wait_for_replay_catchup($node_standby); -$node_standby->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$standby_physical_slotname');]); - -### -# Initialize cascading standby node -### -$node_standby->backup($backup_name); -$node_cascading_standby->init_from_backup( - $node_standby, $backup_name, - has_streaming => 1, - has_restoring => 1); -$node_cascading_standby->append_conf('postgresql.conf', - qq[primary_slot_name = '$standby_physical_slotname']); -$node_cascading_standby->start; -$node_standby->wait_for_replay_catchup($node_cascading_standby,
Subscription statistics are not dropped at DROP SUBSCRIPTION in some cases
Hi, We call pgstat_drop_subscription() at the end of DropSubscription() but we could leave from this function earlier e.g. when no slot is associated with the subscription. In this case, the statistics entry for the subscription remains. To fix it, I think we need to call it earlier, just after removing the catalog tuple. There is a chance the transaction dropping the subscription fails due to network error etc but we don't need to worry about it as reporting the subscription drop is transactional. I've attached the patch. Feedback is very welcome. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com fix_drop_subscription_stats.patch Description: Binary data
Re: SQL:2011 application time
On 03.05.23 23:02, Paul Jungwirth wrote: Thank you again for the review. Here is a patch with most of your feedback addressed. Sorry it has taken so long! These patches are rebased up to 1ab763fc22adc88e5d779817e7b42b25a9dd7c9e (May 3). Here are a few small fixup patches to get your patch set compiling cleanly. Also, it looks like the patches 0002, 0003, and 0004 are not split up correctly. 0002 contains tests using the FOR PORTION OF syntax introduced in 0003, and 0003 uses the function build_period_range() from 0004. From 6fa819b675255af763e51a67d4d8c88f1d390b6c Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Mon, 8 May 2023 08:45:32 +0200 Subject: [PATCH 1/3] fixup! Add PERIODs --- doc/src/sgml/ref/alter_table.sgml| 2 +- src/test/modules/test_ddl_deparse/test_ddl_deparse.c | 6 ++ 2 files changed, 7 insertions(+), 1 deletion(-) diff --git a/doc/src/sgml/ref/alter_table.sgml b/doc/src/sgml/ref/alter_table.sgml index 487f09f88a..d6aed3dff8 100644 --- a/doc/src/sgml/ref/alter_table.sgml +++ b/doc/src/sgml/ref/alter_table.sgml @@ -582,7 +582,7 @@ Description - + DROP PERIOD FOR diff --git a/src/test/modules/test_ddl_deparse/test_ddl_deparse.c b/src/test/modules/test_ddl_deparse/test_ddl_deparse.c index b7c6f98577..6f4e44de3f 100644 --- a/src/test/modules/test_ddl_deparse/test_ddl_deparse.c +++ b/src/test/modules/test_ddl_deparse/test_ddl_deparse.c @@ -309,6 +309,12 @@ get_altertable_subcmdinfo(PG_FUNCTION_ARGS) case AT_ReAddStatistics: strtype = "(re) ADD STATS"; break; + case AT_AddPeriod: + strtype = "ADD PERIOD"; + break; + case AT_DropPeriod: + strtype = "DROP PERIOD"; + break; } if (subcmd->recurse) -- 2.40.0 From 809e1fe145896b190aa4c0ec73902071e5ccdccc Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Mon, 8 May 2023 09:04:42 +0200 Subject: [PATCH 2/3] fixup! Add PERIODs --- src/backend/catalog/meson.build | 1 + src/include/catalog/meson.build | 1 + 2 files changed, 2 insertions(+) diff --git a/src/backend/catalog/meson.build b/src/backend/catalog/meson.build index fa6609e577..c499cd2f5d 100644 --- a/src/backend/catalog/meson.build +++ b/src/backend/catalog/meson.build @@ -26,6 +26,7 @@ backend_sources += files( 'pg_namespace.c', 'pg_operator.c', 'pg_parameter_acl.c', + 'pg_period.c', 'pg_proc.c', 'pg_publication.c', 'pg_range.c', diff --git a/src/include/catalog/meson.build b/src/include/catalog/meson.build index 3179be09d3..c92d4928a3 100644 --- a/src/include/catalog/meson.build +++ b/src/include/catalog/meson.build @@ -57,6 +57,7 @@ catalog_headers = [ 'pg_collation.h', 'pg_parameter_acl.h', 'pg_partitioned_table.h', + 'pg_period.h', 'pg_range.h', 'pg_transform.h', 'pg_sequence.h', -- 2.40.0 From 94f46deacdeaa3dbac1d3988678981ac8cf5fa9a Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Mon, 8 May 2023 09:05:04 +0200 Subject: [PATCH 3/3] fixup! Add UPDATE/DELETE FOR PORTION OF --- src/backend/utils/adt/meson.build | 1 + 1 file changed, 1 insertion(+) diff --git a/src/backend/utils/adt/meson.build b/src/backend/utils/adt/meson.build index 8515cd9365..9deb26f953 100644 --- a/src/backend/utils/adt/meson.build +++ b/src/backend/utils/adt/meson.build @@ -65,6 +65,7 @@ backend_sources += files( 'oracle_compat.c', 'orderedsetaggs.c', 'partitionfuncs.c', + 'period.c', 'pg_locale.c', 'pg_lsn.c', 'pg_upgrade_support.c', -- 2.40.0
Re: Perform streaming logical transactions by background workers and parallel apply
On Mon, May 8, 2023 at 11:08 AM Masahiko Sawada wrote: > > On Mon, May 8, 2023 at 12:52 PM Zhijie Hou (Fujitsu) > wrote: > > > > On Monday, May 8, 2023 11:08 AM Masahiko Sawada > > > > Hi, > > > > > > > > On Tue, May 2, 2023 at 12:22 PM Amit Kapila > > > wrote: > > > > > > > > On Fri, Apr 28, 2023 at 11:48 AM Masahiko Sawada > > > wrote: > > > > > > > > > > While investigating this issue, I've reviewed the code around > > > > > callbacks and worker termination etc and I found a problem. > > > > > > > > > > A parallel apply worker calls the before_shmem_exit callbacks in the > > > > > following order: > > > > > > > > > > 1. ShutdownPostgres() > > > > > 2. logicalrep_worker_onexit() > > > > > 3. pa_shutdown() > > > > > > > > > > Since the worker is detached during logicalrep_worker_onexit(), > > > > > MyLogicalReplication->leader_pid is an invalid when we call > > > > > pa_shutdown(): > > > > > > > > > > static void > > > > > pa_shutdown(int code, Datum arg) > > > > > { > > > > > Assert(MyLogicalRepWorker->leader_pid != InvalidPid); > > > > > SendProcSignal(MyLogicalRepWorker->leader_pid, > > > > >PROCSIG_PARALLEL_APPLY_MESSAGE, > > > > >InvalidBackendId); > > > > > > > > > > Also, if the parallel apply worker fails shm_toc_lookup() during the > > > > > initialization, it raises an error (because of noError = false) but > > > > > ends up a SEGV as MyLogicalRepWorker is still NULL. > > > > > > > > > > I think that we should not use MyLogicalRepWorker->leader_pid in > > > > > pa_shutdown() but instead store the leader's pid to a static variable > > > > > before registering pa_shutdown() callback. > > > > > > > > > > > > > Why not simply move the registration of pa_shutdown() to someplace > > > > after logicalrep_worker_attach()? > > > > > > If we do that, the worker won't call dsm_detach() if it raises an > > > ERROR in logicalrep_worker_attach(), is that okay? It seems that it's > > > no practically problem since we call dsm_backend_shutdown() in > > > shmem_exit(), but if so why do we need to call it in pa_shutdown()? > > > > I think the dsm_detach in pa_shutdown was intended to fire on_dsm_detach > > callbacks to give callback a chance to report stat before the stat system is > > shutdown, following what we do in ParallelWorkerShutdown() (e.g. > > sharedfileset.c callbacks cause fd.c to do ReportTemporaryFileUsage(), so we > > need to fire that earlier). > > > > But for parallel apply, we currently only have one on_dsm_detach > > callback(shm_mq_detach_callback) which doesn't report extra stats. So the > > dsm_detach in pa_shutdown is only used to make it a bit future-proof in case > > we add some other on_dsm_detach callbacks in the future which need to report > > stats. > > Make sense . Given that it's possible that we add other callbacks that > report stats in the future, I think it's better not to move the > position to register pa_shutdown() callback. > Hmm, what kind of stats do we expect to be collected before we register pa_shutdown? I think if required we can register such a callback after pa_shutdown. I feel without reordering the callbacks, the fix would be a bit complicated as explained in my previous email, so I don't think it is worth complicating this code unless really required. -- With Regards, Amit Kapila.