Re: New committers: Melanie Plageman, Richard Guo
Big congrats to you two!!! - Jimmy -- This electronic communication and the information and any files transmitted with it, or attached to it, are confidential and are intended solely for the use of the individual or entity to whom it is addressed and may contain information that is confidential, legally privileged, protected by privacy laws, or otherwise restricted from disclosure to anyone else. If you are not the intended recipient or the person responsible for delivering the e-mail to the intended recipient, you are hereby notified that any use, copying, distributing, dissemination, forwarding, printing, or copying of this e-mail is strictly prohibited. If you received this e-mail in error, please return the e-mail to the sender, delete it from your computer, and destroy any printed copy of it.
Re: Should the archiver process always make sure that the timeline history files exist in the archive?
Thanks for the insightful response! I have attached an updated patch that moves the proposed logic to the end of StartupXLOG where it seems more correct to do this. It also helps with backporting (if it's needed) since the archiver process only has access to shared memory starting from Postgres 14. Kyotaro Horiguchi wrote: > A. The OP suggests archiving the timeline history file for the current > timeline every time the archiver starts. However, I don't think we > want to keep archiving the same file over and over. (Granted, we're > not always perfect at avoiding that..) With the updated proposed patch, we'll be checking if the current timeline history file needs to be archived at the end of StartupXLOG if archiving is enabled. If it detects that a .ready or .done file already exists, then it won't do anything (which will be the common case). I agree though that this may be an excessive check since it'll be a no-op the majority of the time. However, it shouldn't execute often and seems like a quick safe preventive measure. Could you give more details on why this would be too cumbersome? Kyotaro Horiguchi wrote: > B. Given that the steps valid, I concur to what is described in the > test script provided: standbys don't really need that history file > for the initial TLI (though I have yet to fully verify this). If the > walreceiver just overlooks a fetch error for this file, the standby > can successfully start. (Just skipping the first history file seems > to work, but it feels a tad aggressive to me.) This was my initial thought as well but I wasn't sure if it was okay to overlook the fetch error. Initial testing and brainstorming seems to show that it's okay. I think the main bad thing is that these new standbys will not have their initial timeline history files which can be useful for administration. I've attached a patch that attempts this approach if we want to switch to this approach as the solution. The patch contains an updated TAP test as well to better showcase the issue and fix. Kyotaro Horiguchi wrote: > C. If those steps aren't valid, we might want to add a note stating > that -X none basebackups do need the timeline history file for the > initial TLI. The difficult thing about only documenting this is that it forces the user to manually store and track the timeline history files. It can be a bit cumbersome for WAL archiving users to recognize this scenario when they're just trying to optimize their basebackups by using -Xnone. But then again -Xnone does seem like it's designed for advanced users so this might be okay. Kyotaro Horiguchi wrote: > And don't forget to enable archive mode before the latest timeline > switch if any. This might not be reasonable since a user could've been using streaming replication and doing failover/failbacks as part of general high availability to manage their Postgres without knowing they were going to enable WAL archiving later on. The user would need to configure archiving and force a failover which may not be straightforward. Regards, Jimmy Yih v2-0001-Archive-current-timeline-history-file-after-recovery.patch Description: v2-0001-Archive-current-timeline-history-file-after-recovery.patch v1-0001-Allow-recovery-to-proceed-when-initial-timeline-hist.patch Description: v1-0001-Allow-recovery-to-proceed-when-initial-timeline-hist.patch
Re: Should the archiver process always make sure that the timeline history files exist in the archive?
Hello pgsql-hackers, After doing some more debugging on the matter, I believe this issue might be a minor regression from commit 5332b8cec541. Prior to that commit, the archiver process when first started on a previously promoted primary would have all the timeline history files marked as ready for immediate archiving. If that had happened, none of my mentioned failure scenarios would be theoretically possible (barring someone manually deleting the timeline history files). With that in mind, I decided to look more into my Question 1 and created a patch proposal. The attached patch will try to archive the current timeline history file if it has not been archived yet when the archiver process starts up. Regards, Jimmy Yih From: Jimmy Yih Sent: Wednesday, August 9, 2023 5:00 PM To: pgsql-hack...@postgresql.org Subject: Should the archiver process always make sure that the timeline history files exist in the archive? Hello pgsql-hackers, While testing out some WAL archiving and PITR scenarios, it was observed that enabling WAL archiving for the first time on a primary that was on a timeline higher than 1 would not initially archive the timeline history file for the timeline it was currently on. While this might be okay for most use cases, there are scenarios where this leads to unexpected failures that seem to expose some flaws in the logic. Scenario 1: Take a backup of a primary on timeline 2 with `pg_basebackup -Xnone`. Create a standby with that backup that will be continuously restoring from the WAL archives, the standby will not contain the timeline 2 history file. The standby will operate normally but if you try to create a cascading standby off it using streaming replication, the cascade standby's WAL receiver will continuously FATAL trying to request the timeline 2 history file that the main standby does not have. Scenario 2: Take a backup of a primary on timeline 2 with `pg_basebackup -Xnone`. Then try to create a new node by doing PITR with recovery_target_timeline set to 'current' or 'latest' which will succeed. However, doing PITR with recovery_target_timeline = '2' will fail since it is unable to find the timeline 2 history file in the WAL archives. This may be a bit contradicting since we allow 'current' and 'latest' to recover but explicitly setting the recovery_target_timeline to the control file's timeline id ends up with failure. Attached is a patch containing two TAP tests that demonstrate the scenarios. My questions are: 1. Why doesn't the archiver process try to archive timeline history files when WAL archiving is first configured and/or continually check (maybe when the archiver process gets started before the main loop)? 2. Why does explicitly setting the recovery_target_timeline to the control file's timeline id not follow the same logic as recovery_target_timeline set to 'current'? 3. Why does a cascaded standby require the timeline history file of its control file's timeline id (startTLI) when the main replica is able to operate fine without the timeline history file? Note that my initial observations came from testing with pgBackRest (copying pg_wal/ during backup is disabled by default) but using `pg_basebackup -Xnone` reproduced the issues similarly and is what I present in the TAP tests. At the moment, the only workaround I can think of is to manually run the archive_command on the missing timeline history file(s). Are these valid issues that should be looked into or are they expected? Scenario 2 seems like it could be easily fixed if we determine that the recovery_target_timeline numeric value is equal to the control file's timeline id (compare rtli and recoveryTargetTLI in validateRecoveryParameters()?) but I wasn't sure if maybe the opposite was true where we should make 'current' and 'latest' require retrieving the timeline history files instead to help prevent Scenario 1. Regards, Jimmy Yih 0001-Archive-current-timeline-history-file-on-archiver-st.patch Description: 0001-Archive-current-timeline-history-file-on-archiver-st.patch
Should the archiver process always make sure that the timeline history files exist in the archive?
Hello pgsql-hackers, While testing out some WAL archiving and PITR scenarios, it was observed that enabling WAL archiving for the first time on a primary that was on a timeline higher than 1 would not initially archive the timeline history file for the timeline it was currently on. While this might be okay for most use cases, there are scenarios where this leads to unexpected failures that seem to expose some flaws in the logic. Scenario 1: Take a backup of a primary on timeline 2 with `pg_basebackup -Xnone`. Create a standby with that backup that will be continuously restoring from the WAL archives, the standby will not contain the timeline 2 history file. The standby will operate normally but if you try to create a cascading standby off it using streaming replication, the cascade standby's WAL receiver will continuously FATAL trying to request the timeline 2 history file that the main standby does not have. Scenario 2: Take a backup of a primary on timeline 2 with `pg_basebackup -Xnone`. Then try to create a new node by doing PITR with recovery_target_timeline set to 'current' or 'latest' which will succeed. However, doing PITR with recovery_target_timeline = '2' will fail since it is unable to find the timeline 2 history file in the WAL archives. This may be a bit contradicting since we allow 'current' and 'latest' to recover but explicitly setting the recovery_target_timeline to the control file's timeline id ends up with failure. Attached is a patch containing two TAP tests that demonstrate the scenarios. My questions are: 1. Why doesn't the archiver process try to archive timeline history files when WAL archiving is first configured and/or continually check (maybe when the archiver process gets started before the main loop)? 2. Why does explicitly setting the recovery_target_timeline to the control file's timeline id not follow the same logic as recovery_target_timeline set to 'current'? 3. Why does a cascaded standby require the timeline history file of its control file's timeline id (startTLI) when the main replica is able to operate fine without the timeline history file? Note that my initial observations came from testing with pgBackRest (copying pg_wal/ during backup is disabled by default) but using `pg_basebackup -Xnone` reproduced the issues similarly and is what I present in the TAP tests. At the moment, the only workaround I can think of is to manually run the archive_command on the missing timeline history file(s). Are these valid issues that should be looked into or are they expected? Scenario 2 seems like it could be easily fixed if we determine that the recovery_target_timeline numeric value is equal to the control file's timeline id (compare rtli and recoveryTargetTLI in validateRecoveryParameters()?) but I wasn't sure if maybe the opposite was true where we should make 'current' and 'latest' require retrieving the timeline history files instead to help prevent Scenario 1. Regards, Jimmy Yih 0001-TAP-tests-to-show-missing-timeline-history-issues.patch Description: 0001-TAP-tests-to-show-missing-timeline-history-issues.patch
Re: Concurrent deadlock scenario with DROP INDEX on partitioned index
Tom Lane wrote: > Hence, I propose the attached. 0001 is pure refactoring: it hopefully > clears up the confusion about which "relkind" is which, and it also > saves a couple of redundant syscache fetches in RemoveRelations. > Then 0002 adds the actual bug fix as well as a test case that does > deadlock with unpatched code. The proposed patches look great and make much more sense. I see you've already squashed and committed in 7b6ec86532c2ca585d671239bba867fe380448ed. Thanks! -- Jimmy Yih (VMware) Gaurab Dey (VMware)
Re: Concurrent deadlock scenario with DROP INDEX on partitioned index
Zhihong Yu wrote: > Hi, > For RangeVarCallbackForDropRelation(): > > + LockRelationOid(indexRelationOid, heap_lockmode); > > Since the above is called for both if and else blocks, it can be lifted > outside. Thanks for the feedback. Attached new v3 patch with feedback addressed. -- Jimmy Yih (VMware) Gaurab Dey (VMware) v3-0001-Fix-concurrent-deadlock-scenario-with-DROP-INDEX-.patch Description: v3-0001-Fix-concurrent-deadlock-scenario-with-DROP-INDEX-.patch
Re: Concurrent deadlock scenario with DROP INDEX on partitioned index
Tom Lane wrote: > 1. RangeVarCallbackForDropRelation can get called more than once > during a lookup (in case of concurrent rename and suchlike). > If so it needs to be prepared to drop the lock(s) it got last time. > You have not implemented any such logic. This doesn't seem hard > to fix, just store the OID list into struct DropRelationCallbackState. Fixed in attached patch. We added the OID list to the DropRelationCallbackState as you suggested. > 2. I'm not sure you have fully thought through the interaction > with the subsequent "is_partition" stanza. If the target is > an intermediate partitioned index, don't we need to acquire lock > on its parent index before starting to lock children? (It may > be that that stanza is already in the wrong place relative to > the table-locking stanza it currently follows, but not sure.) It's not required to acquire lock on a possible parent index because of the restriction where we can only run DROP INDEX on the top-most partitioned index. > 3. Calling find_all_inheritors with lockmode NoLock, and then > locking those relation OIDs later, is both unsafe and code-wasteful. > Just tell find_all_inheritors to get the locks you want. Fixed in attached patch. > 4. This code will invoke find_all_inheritors on InvalidOid if > IndexGetRelation fails. It needs to be within the if-test > just above. Fixed in attached patch. > 5. Reading classform again at this point in the function is > not merely inefficient, but outright wrong, because we > already released the syscache entry. Use the local variable. Fixed in attached patch. Added another local variable is_partitioned_index to store the classform value. The main reason we need the classform is because the existing relkind and expected_relkind local variables would only show RELKIND_INDEX whereas we needed exactly RELKIND_PARTITIONED_INDEX. > 6. You've not paid enough attention to updating existing comments, > particularly the header comment for RangeVarCallbackForDropRelation. Fixed in attached patch. Updated the header comment and aggregated our introduced comment to another relative comment slightly above the proposed locking section. > Actually though, maybe you *don't* want to do this in > RangeVarCallbackForDropRelation. Because of point 2, it might be > better to run find_all_inheritors after we've successfully > identified and locked the direct target relation, ie do it back > in RemoveRelations. I've not thought hard about that, but it's > attractive if only because it'd mean you don't have to fix point 1. We think that RangeVarCallbackForDropRelation is probably the most correct function to place the fix. It would look a bit out-of-place being in RemoveRelations seeing how there's already relative DROP INDEX code in RangeVarCallbackForDropRelation. With point 2 explained and point 1 addressed, the fix seems to look okay in RangeVarCallbackForDropRelation. Thanks for the feedback. Attached new patch with feedback addressed. -- Jimmy Yih (VMware) Gaurab Dey (VMware) v2-0001-Fix-concurrent-deadlock-scenario-with-DROP-INDEX-.patch Description: v2-0001-Fix-concurrent-deadlock-scenario-with-DROP-INDEX-.patch
Concurrent deadlock scenario with DROP INDEX on partitioned index
Hello pgsql-hackers, When dropping a partitioned index, the locking order starts with the partitioned table, then its partitioned index, then the partition indexes dependent on that partitioned index, and finally the dependent partition indexes' parent tables. This order allows a deadlock scenario to happen if for example an ANALYZE happens on one of the partition tables which locks the partition table and then blocks when it attempts to lock its index (the DROP INDEX has the index lock but cannot get the partition table lock). Deadlock Reproduction == Initial setup: CREATE TABLE foopart (a int) PARTITION BY RANGE (a); CREATE TABLE foopart_child PARTITION OF foopart FOR VALUES FROM (1) TO (5); CREATE INDEX foopart_idx ON foopart USING btree(a); Attach debugger to session 1 and put breakpoint on function deleteObjectsInList: 1: DROP INDEX foopart_idx; While session 1 is blocked by debugger breakpoint, run the following: 2: ANALYZE foopart_child; After a couple seconds, detach the debugger from session 1 and see deadlock. In order to prevent this deadlock scenario, we should find and lock all the sub-partition and partition tables beneath the partitioned index's partitioned table before attempting to lock the sub-partition and partition indexes. Attached is a patch (+isolation test) which fixes the issue. We observed this on latest head of Postgres. Regards, Jimmy Yih and Gaurab Dey 0001-Fix-concurrent-deadlock-scenario-with-DROP-INDEX-on-.patch Description: 0001-Fix-concurrent-deadlock-scenario-with-DROP-INDEX-on-.patch
Obtaining a more consistent view definition when a UNION subquery contains undecorated constants
Hello, A colleague and I were playing around with dumping views and found an inconsistency for a view definition involving subqueries and undecorated constants in a UNION. When we took that view definition and restored it, dumping the view gave different syntax again. Although the slightly different view definitions were semantically the same, it was weird to see the syntax difference. Our view SQL where 'bar' constant is not decorated: CREATE VIEW fooview AS SELECT t.a FROM (SELECT 1 AS a, 'foo'::text AS baz UNION SELECT 0 AS a, 'bar')t; // view definition from pg_get_viewdef() SELECT t.a FROM ( SELECT 1 AS a, 'foo'::text AS baz UNION SELECT 0 AS a, 'bar'::text) t; Note that the type decorator is appended to 'bar' in the normal fashion. Then taking the above view definition, and creating a view, CREATE VIEW fooview AS SELECT t.a FROM (SELECT 1 AS a, 'foo'::text AS baz UNION SELECT 0 AS a, 'bar'::text)t; // view definition from pg_get_viewdef() SELECT t.a FROM ( SELECT 1 AS a, 'foo'::text AS baz UNION SELECT 0 AS a, 'bar'::text AS text) t; results in a view definition that has the alias 'AS text' appended to 'bar'::text. Contrast this to creating a view without the subquery: CREATE OR REPLACE VIEW fooview AS SELECT 1 AS a, 'foo'::text AS baz UNION SELECT 0 AS a, 'bar'; // view definition from pg_get_viewdef() SELECT 1 AS a, 'foo'::text AS baz UNION SELECT 0 AS a, 'bar'::text AS baz; We see that this view will use the view's tuple descriptor to name the columns. Looking at the internal code (mostly get_from_clause_item() function), we saw that when a subquery is used, there is no tuple descriptor passed to get_query_def() function. Because of this, get_target_list() uses the resnames obtained from the pg_rewrite entry's ev_action field. However, it seems to be fairly simple to construct a dummy tuple descriptor from the ev_action to pass down the call stack so that the column names will be consistent when deparsing both T_A_Const and T_TypeCast parse tree nodes involving a UNION. Attached is a patch that demonstrates this. Running with the attached patch, it seems to work pretty well: postgres=# CREATE VIEW fooview AS SELECT t.a FROM (SELECT 1 AS a, 'foo'::text AS baz UNION SELECT 0 AS a, 'bar')t; postgres=# select pg_get_viewdef('fooview'); pg_get_viewdef SELECT t.a FROM ( SELECT 1 AS a, 'foo'::text AS baz UNION SELECT 0 AS a, 'bar'::text AS baz) t; (1 row) postgres=# CREATE VIEW fooview AS SELECT t.a FROM (SELECT 1 AS a, 'foo'::text AS baz UNION SELECT 0 AS a, 'bar'::text)t; postgres=# select pg_get_viewdef('fooview'); pg_get_viewdef SELECT t.a FROM ( SELECT 1 AS a, 'foo'::text AS baz UNION SELECT 0 AS a, 'bar'::text AS baz) t; (1 row) Nested subqueries also work with the patch. We're not sure how this could break. Is this an acceptable change that should be pursued? Regards, Jimmy Yih and Jim Doty view_union_subquery_constant.patch Description: Binary data
Re: Prevent concurrent DROP SCHEMA when certain objects are being initially created in the namespace
> > Something which would be good to have for all those queries is a set of > isolation tests. No need for multiple specs, you could just use one > spec with one session defining all the object types you would like to > work on. How did you find this object list? Did you test all the > objects available manually? > Attached the isolation spec file. I originally was only going to fix the simple CREATE TYPE scenario but decided to look up other objects that can reside in namespaces and ended up finding a handful of others. I tested each one manually before and after adding the AccessShareLock acquire on the schema. I think that line of thought leads to an enormous increase in locking > overhead, for which we'd get little if any gain in usability. So my > inclination is to make an engineering judgment that we won't fix this. > As I was creating this patch, I had similar feelings on the locking overhead and was curious how others would feel about it as well. Regards, Jimmy On Tue, Sep 4, 2018 at 10:05 PM, Tom Lane wrote: > Andres Freund writes: > > On September 4, 2018 9:11:25 PM PDT, Tom Lane wrote: > >> I think that line of thought leads to an enormous increase in locking > >> overhead, for which we'd get little if any gain in usability. So my > >> inclination is to make an engineering judgment that we won't fix this. > > > Haven't we already significantly started down this road, to avoid a lot > of the "tuple concurrently updated" type errors? > > Not that I'm aware of. We do not take locks on schemas, nor functions, > nor any other of the object types I mentioned. > > > Would expanding this a git further really be that noticeable? > > Frankly, I think it would be not so much "noticeable" as "disastrous". > > Making the overhead tolerable would require very large compromises > in coverage, perhaps like "we'll only lock during DDL not DML". > At which point I'd question why bother. We've seen no field reports > (that I can recall offhand, anyway) that trace to not locking these > objects. > > regards, tom lane > concurrent-schema-drop.spec Description: Binary data
Prevent concurrent DROP SCHEMA when certain objects are being initially created in the namespace
Hello, When an empty namespace is being initially populated with certain objects, it is possible for a DROP SCHEMA operation to come in and delete the namespace without using CASCADE. These objects would be created but are left unusable. This is capable of leaving behind pg_class, pg_type, and other such entries with invalid namespace values that need to be manually removed by the user. To prevent this from happening, we should take an AccessShareLock on the namespace of which the type, function, etc. is being added to. Attached is a patch that covers the following CREATE commands: CREATE TYPE, CREATE FUNCTION, CREATE AGGREGATE, CREATE OPERATOR FAMILY, CREATE OPERATOR CLASS, and CREATE OPERATOR Example Reproduction Session1: CREATE SCHEMA testschema; Session1: BEGIN; Session1: CREATE TYPE testschema.testtype; Session2: DROP SCHEMA testschema; Session1: COMMIT; // The DROP SCHEMA schema succeeds in session 2 and session 1's COMMIT goes through without error but the pg_type entry will still be there for typname testtype. This example is for just a simple TYPE object but the other objects can be reproduced the same way in place of CREATE TYPE. Related Postgres 9.2 commit (note in commit message that "We need similar protection for all other object types"): https://github.com/postgres/postgres/commit/1575fbcb795fc331f46588b4520c4bca7e854d5c The patch itself is relatively trivial by just adding copy/paste lock acquires near other lock acquires. I wasn't sure if this needed a decently sized refactor such as the referenced commit above. Regards, Jimmy concurrent_namespace_drop.patch Description: Binary data
possible issue with array type created for every heap relation composite type
Hello, In Postgres 8.3, it was decided that an array type would be automatically created for every heap relation's composite type. Reference thread: https://www.postgresql.org/message-id/flat/20070302234016.GF3665%40fetter.org The possible issue I would like to note is related to how the array type is named in makeArrayTypeName() function. The composite type will take the heap relation's relname as its typname and the array type will usually be named with an underscore prepended (after first attempting to use the relname and hitting typname collision with the composite type). If the typname with the underscore prepended is already taken, the logic is to continue prepending underscores until there are no typname collisions (truncating the end of the typname if typname gets past NAMEDATALEN of 64). It is possible that enough heap relations with similar relnames could cause more and more typname collisions until you end up with typnames that primarily consist of underscores or not being able to construct a typname because we have reached a typname consisting of all underscores (which can cause table creation failure). This is more likely to happen when heap relnames are similar and of string length 63+. I can see this possibly being an issue with table partitioning if a user decides on partition names that reach string length 63+ (assuming user uses exactly 63 characters or does not hit relation already exists from relname truncation). This may also become an issue if we ever decide to do automated partition naming instead of having the user naming the partitions manually. Is this an issue we should be worrying about? When and how often are array types of heap relation's composite type used (I can only think of doing array_agg on table rows)? Should we consider coding up "CREATE TYPE foo ARRAY OF bar" as suggested in the past in the above referenced hackers thread? Attached are some SQL examples describing this issue. Thanks, Jimmy typname_collision.sql Description: Binary data typname_collision_with_partitioning.sql Description: Binary data