Re: New committers: Melanie Plageman, Richard Guo

2024-04-29 Thread Jimmy Yih
Big congrats to you two!!!

- Jimmy

-- 
This electronic communication and the information and any files transmitted 
with it, or attached to it, are confidential and are intended solely for 
the use of the individual or entity to whom it is addressed and may contain 
information that is confidential, legally privileged, protected by privacy 
laws, or otherwise restricted from disclosure to anyone else. If you are 
not the intended recipient or the person responsible for delivering the 
e-mail to the intended recipient, you are hereby notified that any use, 
copying, distributing, dissemination, forwarding, printing, or copying of 
this e-mail is strictly prohibited. If you received this e-mail in error, 
please return the e-mail to the sender, delete it from your computer, and 
destroy any printed copy of it.


Re: Should the archiver process always make sure that the timeline history files exist in the archive?

2023-08-28 Thread Jimmy Yih
Thanks for the insightful response! I have attached an updated patch
that moves the proposed logic to the end of StartupXLOG where it seems
more correct to do this. It also helps with backporting (if it's
needed) since the archiver process only has access to shared memory
starting from Postgres 14.

Kyotaro Horiguchi  wrote:
> A. The OP suggests archiving the timeline history file for the current
>  timeline every time the archiver starts. However, I don't think we
>  want to keep archiving the same file over and over. (Granted, we're
>  not always perfect at avoiding that..)

With the updated proposed patch, we'll be checking if the current
timeline history file needs to be archived at the end of StartupXLOG
if archiving is enabled. If it detects that a .ready or .done file
already exists, then it won't do anything (which will be the common
case). I agree though that this may be an excessive check since it'll
be a no-op the majority of the time. However, it shouldn't execute
often and seems like a quick safe preventive measure. Could you give
more details on why this would be too cumbersome?

Kyotaro Horiguchi  wrote:
> B. Given that the steps valid, I concur to what is described in the
>  test script provided: standbys don't really need that history file
>  for the initial TLI (though I have yet to fully verify this).  If the
>  walreceiver just overlooks a fetch error for this file, the standby
>  can successfully start. (Just skipping the first history file seems
>  to work, but it feels a tad aggressive to me.)

This was my initial thought as well but I wasn't sure if it was okay
to overlook the fetch error. Initial testing and brainstorming seems
to show that it's okay. I think the main bad thing is that these new
standbys will not have their initial timeline history files which can
be useful for administration. I've attached a patch that attempts this
approach if we want to switch to this approach as the solution. The
patch contains an updated TAP test as well to better showcase the
issue and fix.

Kyotaro Horiguchi  wrote:
> C. If those steps aren't valid, we might want to add a note stating
>  that -X none basebackups do need the timeline history file for the
>  initial TLI.

The difficult thing about only documenting this is that it forces the
user to manually store and track the timeline history files. It can be
a bit cumbersome for WAL archiving users to recognize this scenario
when they're just trying to optimize their basebackups by using
-Xnone. But then again -Xnone does seem like it's designed for
advanced users so this might be okay.

Kyotaro Horiguchi  wrote:
> And don't forget to enable archive mode before the latest timeline
> switch if any.

This might not be reasonable since a user could've been using
streaming replication and doing failover/failbacks as part of general
high availability to manage their Postgres without knowing they were
going to enable WAL archiving later on. The user would need to
configure archiving and force a failover which may not be
straightforward.

Regards,
Jimmy Yih

v2-0001-Archive-current-timeline-history-file-after-recovery.patch
Description:  v2-0001-Archive-current-timeline-history-file-after-recovery.patch


v1-0001-Allow-recovery-to-proceed-when-initial-timeline-hist.patch
Description:  v1-0001-Allow-recovery-to-proceed-when-initial-timeline-hist.patch


Re: Should the archiver process always make sure that the timeline history files exist in the archive?

2023-08-16 Thread Jimmy Yih
Hello pgsql-hackers,

After doing some more debugging on the matter, I believe this issue might be a
minor regression from commit 5332b8cec541. Prior to that commit, the archiver
process when first started on a previously promoted primary would have all the
timeline history files marked as ready for immediate archiving. If that had
happened, none of my mentioned failure scenarios would be theoretically possible
(barring someone manually deleting the timeline history files). With that in
mind, I decided to look more into my Question 1 and created a patch proposal.
The attached patch will try to archive the current timeline history file if it
has not been archived yet when the archiver process starts up.

Regards,
Jimmy Yih


From: Jimmy Yih 
Sent: Wednesday, August 9, 2023 5:00 PM
To: pgsql-hack...@postgresql.org
Subject: Should the archiver process always make sure that the timeline history 
files exist in the archive?

Hello pgsql-hackers,

While testing out some WAL archiving and PITR scenarios, it was observed that
enabling WAL archiving for the first time on a primary that was on a timeline
higher than 1 would not initially archive the timeline history file for the
timeline it was currently on. While this might be okay for most use cases, there
are scenarios where this leads to unexpected failures that seem to expose some
flaws in the logic.

Scenario 1:
Take a backup of a primary on timeline 2 with `pg_basebackup -Xnone`. Create a
standby with that backup that will be continuously restoring from the WAL
archives, the standby will not contain the timeline 2 history file. The standby
will operate normally but if you try to create a cascading standby off it using
streaming replication, the cascade standby's WAL receiver will continuously
FATAL trying to request the timeline 2 history file that the main standby does
not have.

Scenario 2:
Take a backup of a primary on timeline 2 with `pg_basebackup -Xnone`. Then try
to create a new node by doing PITR with recovery_target_timeline set to
'current' or 'latest' which will succeed. However, doing PITR with
recovery_target_timeline = '2' will fail since it is unable to find the timeline
2 history file in the WAL archives. This may be a bit contradicting since we
allow 'current' and 'latest' to recover but explicitly setting the
recovery_target_timeline to the control file's timeline id ends up with failure.

Attached is a patch containing two TAP tests that demonstrate the scenarios.

My questions are:
1. Why doesn't the archiver process try to archive timeline history files when
   WAL archiving is first configured and/or continually check (maybe when the
   archiver process gets started before the main loop)?
2. Why does explicitly setting the recovery_target_timeline to the control
   file's timeline id not follow the same logic as recovery_target_timeline set
   to 'current'?
3. Why does a cascaded standby require the timeline history file of its control
   file's timeline id (startTLI) when the main replica is able to operate fine
   without the timeline history file?

Note that my initial observations came from testing with pgBackRest (copying
pg_wal/ during backup is disabled by default) but using `pg_basebackup -Xnone`
reproduced the issues similarly and is what I present in the TAP tests. At the
moment, the only workaround I can think of is to manually run the
archive_command on the missing timeline history file(s).

Are these valid issues that should be looked into or are they expected? Scenario
2 seems like it could be easily fixed if we determine that the
recovery_target_timeline numeric value is equal to the control file's timeline
id (compare rtli and recoveryTargetTLI in validateRecoveryParameters()?) but I
wasn't sure if maybe the opposite was true where we should make 'current' and
'latest' require retrieving the timeline history files instead to help prevent
Scenario 1.

Regards,
Jimmy Yih


0001-Archive-current-timeline-history-file-on-archiver-st.patch
Description:  0001-Archive-current-timeline-history-file-on-archiver-st.patch


Should the archiver process always make sure that the timeline history files exist in the archive?

2023-08-09 Thread Jimmy Yih
Hello pgsql-hackers,

While testing out some WAL archiving and PITR scenarios, it was observed that
enabling WAL archiving for the first time on a primary that was on a timeline
higher than 1 would not initially archive the timeline history file for the
timeline it was currently on. While this might be okay for most use cases, there
are scenarios where this leads to unexpected failures that seem to expose some
flaws in the logic.

Scenario 1:
Take a backup of a primary on timeline 2 with `pg_basebackup -Xnone`. Create a
standby with that backup that will be continuously restoring from the WAL
archives, the standby will not contain the timeline 2 history file. The standby
will operate normally but if you try to create a cascading standby off it using
streaming replication, the cascade standby's WAL receiver will continuously
FATAL trying to request the timeline 2 history file that the main standby does
not have.

Scenario 2:
Take a backup of a primary on timeline 2 with `pg_basebackup -Xnone`. Then try
to create a new node by doing PITR with recovery_target_timeline set to
'current' or 'latest' which will succeed. However, doing PITR with
recovery_target_timeline = '2' will fail since it is unable to find the timeline
2 history file in the WAL archives. This may be a bit contradicting since we
allow 'current' and 'latest' to recover but explicitly setting the
recovery_target_timeline to the control file's timeline id ends up with failure.

Attached is a patch containing two TAP tests that demonstrate the scenarios.

My questions are:
1. Why doesn't the archiver process try to archive timeline history files when
   WAL archiving is first configured and/or continually check (maybe when the
   archiver process gets started before the main loop)?
2. Why does explicitly setting the recovery_target_timeline to the control
   file's timeline id not follow the same logic as recovery_target_timeline set
   to 'current'?
3. Why does a cascaded standby require the timeline history file of its control
   file's timeline id (startTLI) when the main replica is able to operate fine
   without the timeline history file?

Note that my initial observations came from testing with pgBackRest (copying
pg_wal/ during backup is disabled by default) but using `pg_basebackup -Xnone`
reproduced the issues similarly and is what I present in the TAP tests. At the
moment, the only workaround I can think of is to manually run the
archive_command on the missing timeline history file(s).

Are these valid issues that should be looked into or are they expected? Scenario
2 seems like it could be easily fixed if we determine that the
recovery_target_timeline numeric value is equal to the control file's timeline
id (compare rtli and recoveryTargetTLI in validateRecoveryParameters()?) but I
wasn't sure if maybe the opposite was true where we should make 'current' and
'latest' require retrieving the timeline history files instead to help prevent
Scenario 1.

Regards,
Jimmy Yih

0001-TAP-tests-to-show-missing-timeline-history-issues.patch
Description:  0001-TAP-tests-to-show-missing-timeline-history-issues.patch


Re: Concurrent deadlock scenario with DROP INDEX on partitioned index

2022-03-21 Thread Jimmy Yih
Tom Lane  wrote:
> Hence, I propose the attached.  0001 is pure refactoring: it hopefully
> clears up the confusion about which "relkind" is which, and it also
> saves a couple of redundant syscache fetches in RemoveRelations.
> Then 0002 adds the actual bug fix as well as a test case that does
> deadlock with unpatched code.

The proposed patches look great and make much more sense. I see you've
already squashed and committed in
7b6ec86532c2ca585d671239bba867fe380448ed. Thanks!

--
Jimmy Yih (VMware)
Gaurab Dey (VMware)



Re: Concurrent deadlock scenario with DROP INDEX on partitioned index

2022-03-16 Thread Jimmy Yih
Zhihong Yu  wrote:
> Hi,
> For RangeVarCallbackForDropRelation():
>
> +   LockRelationOid(indexRelationOid, heap_lockmode);
>
> Since the above is called for both if and else blocks, it can be lifted 
> outside.

Thanks for the feedback.  Attached new v3 patch with feedback addressed.

--
Jimmy Yih (VMware)
Gaurab Dey (VMware)

v3-0001-Fix-concurrent-deadlock-scenario-with-DROP-INDEX-.patch
Description:  v3-0001-Fix-concurrent-deadlock-scenario-with-DROP-INDEX-.patch


Re: Concurrent deadlock scenario with DROP INDEX on partitioned index

2022-03-16 Thread Jimmy Yih
Tom Lane  wrote:
> 1. RangeVarCallbackForDropRelation can get called more than once
> during a lookup (in case of concurrent rename and suchlike).
> If so it needs to be prepared to drop the lock(s) it got last time.
> You have not implemented any such logic.  This doesn't seem hard
> to fix, just store the OID list into struct DropRelationCallbackState.

Fixed in attached patch. We added the OID list to the
DropRelationCallbackState as you suggested.

> 2. I'm not sure you have fully thought through the interaction
> with the subsequent "is_partition" stanza.   If the target is
> an intermediate partitioned index, don't we need to acquire lock
> on its parent index before starting to lock children?  (It may
> be that that stanza is already in the wrong place relative to
> the table-locking stanza it currently follows, but not sure.)

It's not required to acquire lock on a possible parent index because
of the restriction where we can only run DROP INDEX on the top-most
partitioned index.

> 3. Calling find_all_inheritors with lockmode NoLock, and then
> locking those relation OIDs later, is both unsafe and code-wasteful.
> Just tell find_all_inheritors to get the locks you want.

Fixed in attached patch.

> 4. This code will invoke find_all_inheritors on InvalidOid if
> IndexGetRelation fails.  It needs to be within the if-test
> just above.

Fixed in attached patch.

> 5. Reading classform again at this point in the function is
> not merely inefficient, but outright wrong, because we
> already released the syscache entry.  Use the local variable.

Fixed in attached patch. Added another local variable
is_partitioned_index to store the classform value. The main reason we
need the classform is because the existing relkind and
expected_relkind local variables would only show RELKIND_INDEX whereas
we needed exactly RELKIND_PARTITIONED_INDEX.

> 6. You've not paid enough attention to updating existing comments,
> particularly the header comment for RangeVarCallbackForDropRelation.

Fixed in attached patch. Updated the header comment and aggregated our
introduced comment to another relative comment slightly above the
proposed locking section.

> Actually though, maybe you *don't* want to do this in
> RangeVarCallbackForDropRelation.  Because of point 2, it might be
> better to run find_all_inheritors after we've successfully
> identified and locked the direct target relation, ie do it back
> in RemoveRelations.  I've not thought hard about that, but it's
> attractive if only because it'd mean you don't have to fix point 1.

We think that RangeVarCallbackForDropRelation is probably the most
correct function to place the fix. It would look a bit out-of-place
being in RemoveRelations seeing how there's already relative DROP
INDEX code in RangeVarCallbackForDropRelation. With point 2 explained
and point 1 addressed, the fix seems to look okay in
RangeVarCallbackForDropRelation.

Thanks for the feedback.  Attached new patch with feedback addressed.

--
Jimmy Yih (VMware)
Gaurab Dey (VMware)

v2-0001-Fix-concurrent-deadlock-scenario-with-DROP-INDEX-.patch
Description:  v2-0001-Fix-concurrent-deadlock-scenario-with-DROP-INDEX-.patch


Concurrent deadlock scenario with DROP INDEX on partitioned index

2022-01-25 Thread Jimmy Yih
Hello pgsql-hackers,

When dropping a partitioned index, the locking order starts with the
partitioned table, then its partitioned index, then the partition
indexes dependent on that partitioned index, and finally the dependent
partition indexes' parent tables. This order allows a deadlock
scenario to happen if for example an ANALYZE happens on one of the
partition tables which locks the partition table and then blocks when
it attempts to lock its index (the DROP INDEX has the index lock but
cannot get the partition table lock).

Deadlock Reproduction
==
Initial setup:
CREATE TABLE foopart (a int) PARTITION BY RANGE (a);
CREATE TABLE foopart_child PARTITION OF foopart FOR VALUES FROM (1) TO (5);
CREATE INDEX foopart_idx ON foopart USING btree(a);

Attach debugger to session 1 and put breakpoint on function deleteObjectsInList:
1: DROP INDEX foopart_idx;

While session 1 is blocked by debugger breakpoint, run the following:
2: ANALYZE foopart_child;

After a couple seconds, detach the debugger from session 1 and see deadlock.

In order to prevent this deadlock scenario, we should find and lock
all the sub-partition and partition tables beneath the partitioned
index's partitioned table before attempting to lock the sub-partition
and partition indexes.

Attached is a patch (+isolation test) which fixes the issue. We
observed this on latest head of Postgres.

Regards,
Jimmy Yih and Gaurab Dey


0001-Fix-concurrent-deadlock-scenario-with-DROP-INDEX-on-.patch
Description:  0001-Fix-concurrent-deadlock-scenario-with-DROP-INDEX-on-.patch


Obtaining a more consistent view definition when a UNION subquery contains undecorated constants

2018-09-27 Thread Jimmy Yih
Hello,

A colleague and I were playing around with dumping views and found an
inconsistency for a view definition involving subqueries and undecorated
constants in a UNION.  When we took that view definition and restored it,
dumping the view gave different syntax again.  Although the slightly
different view definitions were semantically the same, it was weird to see
the syntax difference.

Our view SQL where 'bar' constant is not decorated:

CREATE VIEW fooview AS SELECT t.a FROM (SELECT 1 AS a, 'foo'::text AS baz
UNION SELECT 0 AS a, 'bar')t;
// view definition from pg_get_viewdef()
 SELECT t.a
   FROM ( SELECT 1 AS a,
'foo'::text AS baz
UNION
 SELECT 0 AS a,
'bar'::text) t;

Note that the type decorator is appended to 'bar' in the normal fashion.

Then taking the above view definition, and creating a view,

CREATE VIEW fooview AS SELECT t.a FROM (SELECT 1 AS a, 'foo'::text AS baz
UNION SELECT 0 AS a, 'bar'::text)t;
// view definition from pg_get_viewdef()
 SELECT t.a
   FROM ( SELECT 1 AS a,
'foo'::text AS baz
UNION
 SELECT 0 AS a,
'bar'::text AS text) t;

results in a view definition that has the alias 'AS text' appended to
'bar'::text.

Contrast this to creating a view without the subquery:

CREATE OR REPLACE VIEW fooview AS SELECT 1 AS a, 'foo'::text AS baz UNION
SELECT 0 AS a, 'bar';
// view definition from pg_get_viewdef()
 SELECT 1 AS a,
'foo'::text AS baz
UNION
 SELECT 0 AS a,
'bar'::text AS baz;

We see that this view will use the view's tuple descriptor to name the
columns.

Looking at the internal code (mostly get_from_clause_item() function), we
saw that when a subquery is used, there is no tuple descriptor passed to
get_query_def() function. Because of this, get_target_list() uses the
resnames obtained from the pg_rewrite entry's ev_action field.  However, it
seems to be fairly simple to construct a dummy tuple descriptor from the
ev_action to pass down the call stack so that the column names will be
consistent when deparsing both T_A_Const and T_TypeCast parse tree nodes
involving a UNION.  Attached is a patch that demonstrates this.

Running with the attached patch, it seems to work pretty well:

postgres=# CREATE VIEW fooview AS SELECT t.a FROM (SELECT 1 AS a,
'foo'::text AS baz UNION SELECT 0 AS a, 'bar')t;
postgres=# select pg_get_viewdef('fooview');
   pg_get_viewdef

  SELECT t.a
FROM ( SELECT 1 AS a,
 'foo'::text AS baz
 UNION
  SELECT 0 AS a,
 'bar'::text AS baz) t;
(1 row)

postgres=# CREATE VIEW fooview AS SELECT t.a FROM (SELECT 1 AS a,
'foo'::text AS baz UNION SELECT 0 AS a, 'bar'::text)t;
postgres=# select pg_get_viewdef('fooview');
   pg_get_viewdef

  SELECT t.a
FROM ( SELECT 1 AS a,
 'foo'::text AS baz
 UNION
  SELECT 0 AS a,
 'bar'::text AS baz) t;
(1 row)

Nested subqueries also work with the patch.  We're not sure how this could
break.

Is this an acceptable change that should be pursued?

Regards,
Jimmy Yih and Jim Doty


view_union_subquery_constant.patch
Description: Binary data


Re: Prevent concurrent DROP SCHEMA when certain objects are being initially created in the namespace

2018-09-05 Thread Jimmy Yih
>
> Something which would be good to have for all those queries is a set of
> isolation tests.  No need for multiple specs, you could just use one
> spec with one session defining all the object types you would like to
> work on.  How did you find this object list?  Did you test all the
> objects available manually?
>
Attached the isolation spec file.  I originally was only going to fix the
simple CREATE TYPE scenario but decided to look up other objects that can
reside in namespaces and ended up finding a handful of others.  I tested
each one manually before and after adding the AccessShareLock acquire on
the schema.

I think that line of thought leads to an enormous increase in locking
> overhead, for which we'd get little if any gain in usability.  So my
> inclination is to make an engineering judgment that we won't fix this.
>
As I was creating this patch, I had similar feelings on the locking
overhead and was curious how others would feel about it as well.

Regards,
Jimmy


On Tue, Sep 4, 2018 at 10:05 PM, Tom Lane  wrote:

> Andres Freund  writes:
> > On September 4, 2018 9:11:25 PM PDT, Tom Lane  wrote:
> >> I think that line of thought leads to an enormous increase in locking
> >> overhead, for which we'd get little if any gain in usability.  So my
> >> inclination is to make an engineering judgment that we won't fix this.
>
> > Haven't we already significantly started down this road, to avoid a lot
> of the "tuple concurrently updated" type errors?
>
> Not that I'm aware of.  We do not take locks on schemas, nor functions,
> nor any other of the object types I mentioned.
>
> > Would expanding this a git further really be that noticeable?
>
> Frankly, I think it would be not so much "noticeable" as "disastrous".
>
> Making the overhead tolerable would require very large compromises
> in coverage, perhaps like "we'll only lock during DDL not DML".
> At which point I'd question why bother.  We've seen no field reports
> (that I can recall offhand, anyway) that trace to not locking these
> objects.
>
> regards, tom lane
>


concurrent-schema-drop.spec
Description: Binary data


Prevent concurrent DROP SCHEMA when certain objects are being initially created in the namespace

2018-09-04 Thread Jimmy Yih
Hello,

When an empty namespace is being initially populated with certain objects,
it is possible for a DROP SCHEMA operation to come in and delete the
namespace without using CASCADE.  These objects would be created but are
left unusable.  This is capable of leaving behind pg_class, pg_type, and
other such entries with invalid namespace values that need to be manually
removed by the user.  To prevent this from happening, we should take an
AccessShareLock on the namespace of which the type, function, etc. is being
added to.

Attached is a patch that covers the following CREATE commands:
CREATE TYPE, CREATE FUNCTION, CREATE AGGREGATE, CREATE OPERATOR FAMILY,
CREATE OPERATOR CLASS, and CREATE OPERATOR

Example Reproduction
Session1: CREATE SCHEMA testschema;
Session1: BEGIN;
Session1: CREATE TYPE testschema.testtype;
Session2: DROP SCHEMA testschema;
Session1: COMMIT;
// The DROP SCHEMA schema succeeds in session 2 and session 1's COMMIT goes
through without error but the pg_type entry will still be there for typname
testtype.  This example is for just a simple TYPE object but the other
objects can be reproduced the same way in place of CREATE TYPE.

Related Postgres 9.2 commit (note in commit message that "We need similar
protection for all other object types"):
https://github.com/postgres/postgres/commit/1575fbcb795fc331f46588b4520c4bca7e854d5c

The patch itself is relatively trivial by just adding copy/paste lock
acquires near other lock acquires.  I wasn't sure if this needed a decently
sized refactor such as the referenced commit above.

Regards,
Jimmy


concurrent_namespace_drop.patch
Description: Binary data


possible issue with array type created for every heap relation composite type

2018-07-10 Thread Jimmy Yih
Hello,

In Postgres 8.3, it was decided that an array type would be automatically
created for every heap relation's composite type.

Reference thread:
https://www.postgresql.org/message-id/flat/20070302234016.GF3665%40fetter.org

The possible issue I would like to note is related to how the array type is
named in makeArrayTypeName() function.  The composite type will take the
heap relation's relname as its typname and the array type will usually be
named with an underscore prepended (after first attempting to use the
relname and hitting typname collision with the composite type).  If the
typname with the underscore prepended is already taken, the logic is to
continue prepending underscores until there are no typname collisions
(truncating the end of the typname if typname gets past NAMEDATALEN of
64).  It is possible that enough heap relations with similar relnames could
cause more and more typname collisions until you end up with typnames that
primarily consist of underscores or not being able to construct a typname
because we have reached a typname consisting of all underscores (which can
cause table creation failure).

This is more likely to happen when heap relnames are similar and of string
length 63+.  I can see this possibly being an issue with table partitioning
if a user decides on partition names that reach string length 63+ (assuming
user uses exactly 63 characters or does not hit relation already exists
from relname truncation).  This may also become an issue if we ever decide
to do automated partition naming instead of having the user naming the
partitions manually.

Is this an issue we should be worrying about?  When and how often are array
types of heap relation's composite type used (I can only think of doing
array_agg on table rows)?  Should we consider coding up "CREATE TYPE foo
ARRAY OF bar" as suggested in the past in the above referenced hackers
thread?

Attached are some SQL examples describing this issue.

Thanks,
Jimmy


typname_collision.sql
Description: Binary data


typname_collision_with_partitioning.sql
Description: Binary data