Re: Handle infinite recursion in logical replication setup
Hi Vignesh. FYI the v39* patch fails to apply [1]. Can you please rebase it? [1] === Applying patches on top of PostgreSQL commit ID 5f858dd3bebd1f3845aef2bff7f4345bfb7b74b3 === === applying patch ./v39-0001-Check-and-throw-an-error-if-publication-tables-w.patch patching file doc/src/sgml/ref/alter_subscription.sgml patching file doc/src/sgml/ref/create_subscription.sgml patching file src/backend/commands/subscriptioncmds.c Hunk #10 FAILED at 886. 1 out of 14 hunks FAILED -- saving rejects to file src/backend/commands/subscriptioncmds.c.rej patching file src/test/regress/expected/subscription.out patching file src/test/regress/sql/subscription.sql patching file src/test/subscription/t/030_origin.pl patching file src/tools/pgindent/typedefs.list -- [1] http://cfbot.cputube.org/patch_38_3610.log Kind Regards, Peter Smith. Fujitsu Australia
Re: Documentation about PL transforms
čt 28. 7. 2022 v 4:06 odesílatel napsal: > On 2022-07-13 00:26, Pavel Stehule wrote: > > > 1. I am not sure if get_call_trftypes is a good name - the prefix > > get_call > > is used when some runtime data is processed. > > I guess I hadn't caught on that the prefix carried that meaning. > To me, it appeared to be a prefix used to avoid being specific to > 'function' or 'procedure'. > I am sure this function is in Postgres significantly longer than Postgres has support for 'procedures'. > > > This function just returns > > reformatted data from the system catalogue. Maybe > > get_func_trftypes_list, > > or just replace function get_func_trftypes (now, the list is an array, > > so > > there should not be performance issues). For consistency, the new > > function > > should be used in plperl and plpython too. Probably this function is > > not > > If it is acceptable to replace get_func_trftypes like that, I can > produce > such a patch. > > > 2. > > > > +It also contains the OID of the intended procedural language and > > whether > > +that procedural language is declared as > > TRUSTED, > > useful > > +if a single inline handler is supporting more than one procedural > > language. > > > > I am not sure if this sentence is in the correct place. Maybe can be > > mentioned separately, > > so generally handlers can be used by more than one procedural language. > > But > > maybe > > I don't understand this sentence. > > My point was that, if the structure did /not/ include the OID of > the PL and its TRUSTED property, then it would not be possible > for a single inline handler to support more than one PL. So that > is why it is a good thing that those are included in the structure, > and why it would be a bad thing if they were not. > > Would it be clearer to say: > > It also contains the OID of the intended procedural language and whether > that procedural language is declared as TRUSTED. > While these values are redundant if the inline handler is serving only > a single procedural language, they are necessary to allow one inline > handler to serve more than one PL. > ok Regards Pavel > > Regards, > -Chap >
RE: Perform streaming logical transactions by background workers and parallel apply
Dear Wang, I found further comments about the test code. 11. src/test/regress/sql/subscription.sql ``` -- fail - streaming must be boolean CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = foo); ``` The comment is no longer correct: should be "streaming must be boolean or 'parallel'" 12. src/test/regress/sql/subscription.sql ``` -- now it works CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true); ``` I think we should test the case of streaming = 'parallel'. 13. 015_stream.pl I could not find test about TRUNCATE. IIUC apply bgworker works well even if it gets LOGICAL_REP_MSG_TRUNCATE message from main worker. Can you add the case? Best Regards, Hayato Kuroda FUJITSU LIMITED
Re: Official Windows Installer and Documentation
On Wed, Jul 27, 2022 at 8:22 PM Tom Lane wrote: > I wrote: > > If EDB isn't adequately filling in the documentation for the behavior > > of their packaging, that's on them. > > Having now looked more closely at the pg_upgrade documentation, > I don't think this is exactly EDB's fault; it's text that should > never have been there to begin with. ISTM we need to simply rip out > lines 431..448 of pgupgrade.sgml, that is all the Windows-specific > text starting with > > For Windows users, you must be logged into an administrative account, > and > > That has got nothing to recommend it: we do not generally provide > platform-specific details in these man pages, and to the extent it > provides details, those details are likely to be wrong. I mean, we do provide platform-specific details/examples, it's just that platform is a source installed Linux platform (though pathless) Does the avoidance of dealing with other platforms also apply to NET STOP or do you find that an acceptable variance? Or are you suggesting that basically all O/S commands should be zapped? If not, then rewriting 442 to 446 to just be the command seems worthwhile. I'd say pg_upgrade warrants an examples section like pg_basebackup has (though obviously pg_upgrade is procedural). I do have another observation: https://github.com/postgres/postgres/blob/4fc6b6eefcf98f79211bb790ee890ebcb05c178d/src/bin/pg_upgrade/check.c#L665 if (PQntuples(res) != 1 || atooid(PQgetvalue(res, 0, 1)) != BOOTSTRAP_SUPERUSERID) pg_fatal("database user \"%s\" is not the install user", os_info.user); Any reason to not inform the DBA the name of the install user here? Sure, it is almost certainly postgres, but it also seems like an easy win in order for them, and anyone they may ask for help, to know exactly the name of install user in the clusters should that end up being the issue. Additionally, from what I can tell, if that check does fail (or any of the checks really) it is not possible to tell whether the check was being performed against the old or new server. The user does not know that checks against the old server are performed first then checks against the new one, and there are no banners saying "checking old/new" David J.
Re: Improve description of XLOG_RUNNING_XACTS
Thanks Masahiko for the updated patch. It looks good to me. I wonder whether the logic should be, similar to ProcArrayApplyRecoveryInfo() if (xlrec->subxid_overflow) ... else if (xlrec->subxcnt > 0) ... But you may ignore it. -- Best Wishes, Ashutosh On Thu, Jul 28, 2022 at 7:41 AM Masahiko Sawada wrote: > On Thu, Jul 21, 2022 at 10:13 PM Ashutosh Bapat > wrote: > > > > Hi > > > > On Thu, Jul 21, 2022 at 6:44 AM Masahiko Sawada > wrote: > > > > > > Hi, > > > > > > I realized that standby_desc_running_xacts() in standbydesc.c doesn't > > > describe subtransaction XIDs. I've attached the patch to improve the > > > description. Here is an example by pg_wlaldump: > > > > > > * HEAD > > > rmgr: Standby len (rec/tot): 58/58, tx: 0, lsn: > > > 0/01D0C608, prev 0/01D0C5D8, desc: RUNNING_XACTS nextXid 1050 > > > latestCompletedXid 1047 oldestRunningXid 1048; 1 xacts: 1048 > > > > > > * w/ patch > > > rmgr: Standby len (rec/tot): 58/58, tx: 0, lsn: > > > 0/01D0C608, prev 0/01D0C5D8, desc: RUNNING_XACTS nextXid 1050 > > > latestCompletedXid 1047 oldestRunningXid 1048; 1 xacts: 1048; 1 > > > subxacts: 1049 > > > > > > > I think this is a good addition to debugging info. +1 > > > > If we are going to add 64 subxid numbers then it would help if we > > could be more verbose and print "subxid overflowed" instead of "subxid > > ovf". > > Yeah, it looks better so agreed. I've attached an updated patch. > > Regards, > > -- > Masahiko Sawada > EDB: https://www.enterprisedb.com/ >
Re: [Patch] ALTER SYSTEM READ ONLY
Hi, On Thu, Jul 28, 2022 at 4:05 AM Jacob Champion wrote: > > On Fri, Apr 8, 2022 at 7:27 AM Amul Sul wrote: > > Attached is rebase version for the latest maste head(#891624f0ec). > > Hi Amul, > > I'm going through past CF triage emails today; I noticed that this > patch dropped out of the commitfest when you withdrew it in January, > but it hasn't been added back with the most recent patchset you > posted. Was that intended, or did you want to re-register it for > review? > Yes, there is a plan to re-register it again but not anytime soon, once we start to rework the design. Regards, Amul
Re: Official Windows Installer and Documentation
I wrote: > If EDB isn't adequately filling in the documentation for the behavior > of their packaging, that's on them. Having now looked more closely at the pg_upgrade documentation, I don't think this is exactly EDB's fault; it's text that should never have been there to begin with. ISTM we need to simply rip out lines 431..448 of pgupgrade.sgml, that is all the Windows-specific text starting with For Windows users, you must be logged into an administrative account, and That has got nothing to recommend it: we do not generally provide platform-specific details in these man pages, and to the extent it provides details, those details are likely to be wrong. We need look no further than the references to "9.6" to establish that. Yeah, it says "e.g.", but novices will probably fail to understand which parts of the example are suitable to copy verbatim and which aren't. Meanwhile non-novices don't need the example to begin with. On top of which, the whole para has been inserted into non-platform-specific text, seemingly with the aid of a dartboard, because it doesn't particularly connect to what's before or after it. regards, tom lane
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns
On Thu, Jul 28, 2022 at 7:18 AM Masahiko Sawada wrote: > > On Wed, Jul 27, 2022 at 8:33 PM Amit Kapila wrote: > > > > > I have changed accordingly in the attached > > and apart from that slightly modified the comments and commit message. > > Do let me know what you think of the attached? > > It would be better to remember the initial running xacts after > SnapBuildRestore() returns true? Because otherwise, we could end up > allocating InitialRunningXacts multiple times while leaking the old > ones if there are no serialized snapshots that we are interested in. > Right, this makes sense. But note that you can no longer have a check (builder->state == SNAPBUILD_START) which I believe is not required. We need to do this after restore, in whichever state snapshot was as any state other than SNAPBUILD_CONSISTENT can have commits without all their changes. Accordingly, I think the comment: "Remember the transactions and subtransactions that were running when xl_running_xacts record that we decoded first was written." needs to be slightly modified to something like: "Remember the transactions and subtransactions that were running when xl_running_xacts record that we decoded was written.". Change this if it is used at any other place in the patch. > --- > + if (builder->state == SNAPBUILD_START) > + { > + int nxacts = > running->subxcnt + running->xcnt; > + Sizesz = sizeof(TransactionId) * nxacts; > + > + NInitialRunningXacts = nxacts; > + InitialRunningXacts = > MemoryContextAlloc(builder->context, sz); > + memcpy(InitialRunningXacts, running->xids, sz); > + qsort(InitialRunningXacts, nxacts, > sizeof(TransactionId), xidComparator); > + } > > We should allocate the memory for InitialRunningXacts only when > (running->subxcnt + running->xcnt) > 0. > There is no harm in doing that but ideally, that case would have been covered by an earlier check "if (running->oldestRunningXid == running->nextXid)" which suggests "No transactions were running, so we can jump to consistent." Kindly make the required changes and submit the back branch patches again. -- With Regards, Amit Kapila.
Re: Skip partition tuple routing with constant partition key
On Thu, 28 Jul 2022 at 00:50, Amit Langote wrote: > So, in a way the caching scheme works for > LIST partitioning only if the same value appears consecutively in the > input set, whereas it does not for *a set of* values belonging to the > same partition appearing consecutively. Maybe that's a reasonable > restriction for now. I'm not really seeing another cheap enough way of doing that. Any LIST partition could allow any number of values. We've only space to record 1 of those values by way of recording which element in the PartitionBound that it was located. > if (part_index < 0) > - part_index = boundinfo->default_index; > + { > + /* > +* Since we don't do caching for the default partition or failed > +* lookups, we'll just wipe the cache fields back to their initial > +* values. The count becomes 0 rather than 1 as 1 means it's the > +* first time we've found a partition we're recording for the cache. > +*/ > + partdesc->last_found_datum_index = -1; > + partdesc->last_found_part_index = -1; > + partdesc->last_found_count = 0; > + > + return boundinfo->default_index; > + } > > I wonder why not to leave the cache untouched in this case? It's > possible that erratic rows only rarely occur in the input sets. I looked into that and I ended up just removing the code to reset the cache. It now works similarly to a LIST partitioned table's NULL partition. > Should the comment update above get_partition_for_tuple() mention > something like the cached path is basically O(1) and the non-cache > path O (log N) as I can see in comments in some other modules, like > pairingheap.c? I adjusted for the other things you mentioned but I didn't add the big O stuff. I thought the comment was clear enough. I'd quite like to push this patch early next week, so if anyone else is following along that might have any objections, could they do so before then? David diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c index e03ea27299..7ae7496737 100644 --- a/src/backend/executor/execPartition.c +++ b/src/backend/executor/execPartition.c @@ -1332,10 +1332,49 @@ FormPartitionKeyDatum(PartitionDispatch pd, elog(ERROR, "wrong number of partition key expressions"); } +/* + * The number of times the same partition must be found in a row before we + * switch from a binary search for the given values to just checking if the + * values belong to the last found partition. This must be above 0. + */ +#define PARTITION_CACHED_FIND_THRESHOLD16 + /* * get_partition_for_tuple * Finds partition of relation which accepts the partition key specified - * in values and isnull + * in values and isnull. + * + * Calling this function can be quite expensive when LIST and RANGE + * partitioned tables have many partitions. This is due to the binary search + * that's done to find the correct partition. Many of the use cases for LIST + * and RANGE partitioned tables make it likely that the same partition is + * found in subsequent ExecFindPartition() calls. This is especially true for + * cases such as RANGE partitioned tables on a TIMESTAMP column where the + * partition key is the current time. When asked to find a partition for a + * RANGE or LIST partitioned table, we record the partition index and datum + * offset we've found for the given 'values' in the PartitionDesc (which is + * stored in relcache), and if we keep finding the same partition + * PARTITION_CACHED_FIND_THRESHOLD times in a row, then we'll enable caching + * logic and instead of performing a binary search to find the correct + * partition, we'll just double-check that 'values' still belong to the last + * found partition, and if so, we'll return that partition index, thus + * skipping the need for the binary search. If we fail to match the last + * partition when double checking, then we fall back on doing a binary search. + * In this case, unless we find 'values' belong to the DEFAULT partition, + * we'll reset the number of times we've hit the same partition so that we + * don't attempt to use the cache again until we've found that partition at + * least PARTITION_CACHED_FIND_THRESHOLD times in a row. + * + * For cases where the partition changes on each lookup, the amount of + * additional work required just amounts to recording the last found partition + * and bound offset then resetting the found counter. This is cheap and does + * not appear to cause any meaningful slowdowns for such cases. + * + * No caching of partitions is done when the last found partition is the + * DEFAULT or NULL partition. For the case of the DEFAULT partition, there + * is no bound offset storing the matching datum, so we cannot confirm the + * indexes match. For the NULL partition, this is just so cheap, there's no + * sense in caching. * * Return value is index of the
Re: Official Windows Installer and Documentation
On Wed, Jul 27, 2022 at 07:31:35PM -0700, David G. Johnston wrote: > > Ultimately we do our users the best service if when they operate an > installation using defaults that they have documentation showing how to > perform something like an upgrade that works with those defaults. I don't really agree that it's best service to let users assume that they can blindly copy/paste some commands without trying to understand them, or how to adapt them to their specificities. That's in my opinion particularly true on windows, since to my knowledge most companies will have the binaries installed in one place (C:\Program Files\PostgreSQL might be frequent), and have the data stored in another place (D: or other). So I don't think the default command will actually work for any non toy installation. > So why not assume the user is whatever the EDB installer uses > and make that the example? Well, IIUC that used to be the case until EDB changed its installer. Maybe the odds for an impacting change to happen again are low, but it's certainly not a great idea to assume that the community will regularly check their installer and update the doc to match what they're doing. So yeah it may be better for them to provide a documentation adapted to their usage.
Re: Official Windows Installer and Documentation
"David G. Johnston" writes: > On Wed, Jul 27, 2022 at 6:42 PM Julien Rouhaud wrote: >> Note that there's no "official" Windows installer, Yeah, that. > Our technical definition aside, the fact is our users consider the sole EDB > installer to be official. > If the ultimate solution is to update: > https://www.enterprisedb.com/downloads/postgres-postgresql-downloads > to have its own installation and upgrade supplement to the official > documentation then I'd be fine with that. That's what needs to happen. On the Linux side for example, we do not address packaging-specific behaviors of Devrim's builds, or Debian's builds, or Red Hat's builds --- all of which act differently, in ways that are sadly a lot more critical to novices than seasoned users. If EDB isn't adequately filling in the documentation for the behavior of their packaging, that's on them. regards, tom lane
RE: Support load balancing in libpq
Dear Jelte, > With plain Postgres this assumption is probably correct. But the main reason > I'm interested in this patch was because I would like to be able to load > balance across the workers in a Citus cluster. These workers are all > primaries. > Similar usage would likely be possible with BDR (bidirectional replication). I agree this feature is useful for BDR-like solutions. > If the user takes such care when building their host list, they could simply > not add the load_balance_hosts=true option to the connection string. > If you know there's only one primary in the list and you're looking for > the primary, then there's no reason to use load_balance_hosts=true. You meant that it was the user responsibility to set correctly, right? It seemed reasonable because libpq was just a library for connecting to server and should not be smart. Best Regards, Hayato Kuroda FUJITSU LIMITED
Re: Official Windows Installer and Documentation
On Wednesday, July 27, 2022, Julien Rouhaud wrote: > On Wed, Jul 27, 2022 at 07:02:51PM -0700, David G. Johnston wrote: > > > > In the end the problem is ours and cannot be simply assigned to a > > third-party. So let's resolve it here (on this list, whatever the > > solution) where representatives from all parties are present. > > We could amend the pg_upgrade (and maybe other if needed, but I don't see > any > other occurences of RUNAS) documentation to be a bit more general, like the > non-windows part of it, maybe something like > > For Windows users, you must be logged into an administrative account, and > then > start a shell as the user running the postgres service and set the proper > path. > Assuming a user named postgres and the binaries installed in C:\Program > Files\PostgreSQL\14: > > RUNAS /USER:postgres "CMD.EXE" > SET PATH=%PATH%;C:\Program Files\PostgreSQL\14\bin; > > It's ultimately up to the users to adapt the commands to match their > environment. > Ultimately we do our users the best service if when they operate an installation using defaults that they have documentation showing how to perform something like an upgrade that works with those defaults. I don’t see much point making that change in isolation until it is obvious nothing better is forthcoming. If the o/s user postgres doesn’t exist then you need to supply -U postgres cause the install user for PostgresSQL is still postgres. So why not assume the user is whatever the EDB installer uses and make that the example? If someone has an install on Windows that uses the postgres account adapting the command for them should be trivial and the majority installer users get a command sequence that works. David J.
Re: Official Windows Installer and Documentation
On Wed, Jul 27, 2022 at 07:02:51PM -0700, David G. Johnston wrote: > > In the end the problem is ours and cannot be simply assigned to a > third-party. So let's resolve it here (on this list, whatever the > solution) where representatives from all parties are present. We could amend the pg_upgrade (and maybe other if needed, but I don't see any other occurences of RUNAS) documentation to be a bit more general, like the non-windows part of it, maybe something like For Windows users, you must be logged into an administrative account, and then start a shell as the user running the postgres service and set the proper path. Assuming a user named postgres and the binaries installed in C:\Program Files\PostgreSQL\14: RUNAS /USER:postgres "CMD.EXE" SET PATH=%PATH%;C:\Program Files\PostgreSQL\14\bin; It's ultimately up to the users to adapt the commands to match their environment.
Re: fairywren hung in pg_basebackup tests
On Thu, Jul 28, 2022 at 12:41 PM Thomas Munro wrote: > I thought about putting it at the top, but don't we really only need > to make an extra system call if MinGW's stat() told us it saw a > directory? And what if you asked to look through symlinks? I thought > about putting it near the S_ISLNK() test, which is the usual pattern, > but then what if MinGW decides to add d_type support one day? Those > thoughts led to the attached formulation. Untested. I'll go and try > to see if I can run this with Melih's proposed MSYS CI support... Success. I'll push this, and then hopefully those BF animals can be unwedged. (I see some unrelated warnings to look into.) https://cirrus-ci.com/task/5253183533481984?logs=tests#L835
Re: Improve description of XLOG_RUNNING_XACTS
On Thu, Jul 21, 2022 at 10:13 PM Ashutosh Bapat wrote: > > Hi > > On Thu, Jul 21, 2022 at 6:44 AM Masahiko Sawada wrote: > > > > Hi, > > > > I realized that standby_desc_running_xacts() in standbydesc.c doesn't > > describe subtransaction XIDs. I've attached the patch to improve the > > description. Here is an example by pg_wlaldump: > > > > * HEAD > > rmgr: Standby len (rec/tot): 58/58, tx: 0, lsn: > > 0/01D0C608, prev 0/01D0C5D8, desc: RUNNING_XACTS nextXid 1050 > > latestCompletedXid 1047 oldestRunningXid 1048; 1 xacts: 1048 > > > > * w/ patch > > rmgr: Standby len (rec/tot): 58/58, tx: 0, lsn: > > 0/01D0C608, prev 0/01D0C5D8, desc: RUNNING_XACTS nextXid 1050 > > latestCompletedXid 1047 oldestRunningXid 1048; 1 xacts: 1048; 1 > > subxacts: 1049 > > > > I think this is a good addition to debugging info. +1 > > If we are going to add 64 subxid numbers then it would help if we > could be more verbose and print "subxid overflowed" instead of "subxid > ovf". Yeah, it looks better so agreed. I've attached an updated patch. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/ v2-0001-Improve-description-of-XLOG_RUNNING_XACTS.patch Description: Binary data
RE: [Commitfest 2022-07] Patch Triage: Waiting on Author
On Wednesday, July 27, 2022 3:27 AM Jacob Champion wrote: > > - Parallel INSERT SELECT take 2 > https://commitfest.postgresql.org/38/3143/ > > There was a lot of discussion early on in this patchset's life, and > then it got caught in a rebase loop without review in August 2021. The > thread has mostly gone dark since then and the patch does not apply. Sorry, I think we don't enough time to work on this recently. So please mark it as RWF and we will get back to this in the future. Best regards, Hou zj
Re: Documentation about PL transforms
On 2022-07-13 00:26, Pavel Stehule wrote: 1. I am not sure if get_call_trftypes is a good name - the prefix get_call is used when some runtime data is processed. I guess I hadn't caught on that the prefix carried that meaning. To me, it appeared to be a prefix used to avoid being specific to 'function' or 'procedure'. This function just returns reformatted data from the system catalogue. Maybe get_func_trftypes_list, or just replace function get_func_trftypes (now, the list is an array, so there should not be performance issues). For consistency, the new function should be used in plperl and plpython too. Probably this function is not If it is acceptable to replace get_func_trftypes like that, I can produce such a patch. 2. +It also contains the OID of the intended procedural language and whether +that procedural language is declared as TRUSTED, useful +if a single inline handler is supporting more than one procedural language. I am not sure if this sentence is in the correct place. Maybe can be mentioned separately, so generally handlers can be used by more than one procedural language. But maybe I don't understand this sentence. My point was that, if the structure did /not/ include the OID of the PL and its TRUSTED property, then it would not be possible for a single inline handler to support more than one PL. So that is why it is a good thing that those are included in the structure, and why it would be a bad thing if they were not. Would it be clearer to say: It also contains the OID of the intended procedural language and whether that procedural language is declared as TRUSTED. While these values are redundant if the inline handler is serving only a single procedural language, they are necessary to allow one inline handler to serve more than one PL. Regards, -Chap
Re: Official Windows Installer and Documentation
On Wed, Jul 27, 2022 at 6:42 PM Julien Rouhaud wrote: > Hi, > > On Wed, Jul 27, 2022 at 11:36:11PM +0200, Thomas Kellerer wrote: > > David G. Johnston schrieb am 27.07.2022 um 21:21: > > > And then there is the issue of file ownership. > > > > > > Assuming we want better documentation for this specific issue for > > > back-patching what would that look like? > > > > > > Going forward should our installer be creating the postgres user for > > > consistency with other platforms or not? > > > > Didn't the installer used to do that in earlier releases and that > > was removed when Postgres was able to "drop privileges" when the > > service is started? > > > > I remember a lot of problems around the specific Postgres service > > account when that still was the case. > > Note that there's no "official" Windows installer, and companies providing > one > are free to implement it the way they want, which can contradict the > official > documentation. The download section of the website clearly says that this > is a > third-party installer. > > For now there's only the EDB installer that remains, but I think that some > time > ago there was 2 or 3 different providers. > > For the EDB installer, I'm not sure why or when it was changed, but it > indeed > used to have a dedicated local account and now relies on "Local System > Account" > or something like that. But IIRC, when it used to create a local account > the > name could be configured, so there was no guarantee of a local "postgres" > account by then either. > > Our technical definition aside, the fact is our users consider the sole EDB installer to be official. If the ultimate solution is to update: https://www.enterprisedb.com/downloads/postgres-postgresql-downloads to have its own installation and upgrade supplement to the official documentation then I'd be fine with that. But as of now the "Installation Guide" points back to the official documentation, which has no actual distribution specific information while simultaneously reinforcing the fact that it is an official installer. I get sending people to the EDB web services team for download issues since we don't host the binaries. That aspect of them being third-party doesn't seem to be an issue. But for documentation, given the current state of things, whether we amend our docs or highly encourage the people who are benefiting financially from being our de facto official Windows installer provider to provide separate documentation to address this apparent short-coming that is harming our image in the Windows community, I don't really care, as long as something changes. In the end the problem is ours and cannot be simply assigned to a third-party. So let's resolve it here (on this list, whatever the solution) where representatives from all parties are present. David J.
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns
On Wed, Jul 27, 2022 at 8:33 PM Amit Kapila wrote: > > On Mon, Jul 25, 2022 at 11:26 AM Masahiko Sawada > wrote: > > > > On Mon, Jul 25, 2022 at 10:45 AM Masahiko Sawada > > wrote: > > > > I've attached the patch for REl15 that I forgot. > > > > I feel the place to remember running xacts information in > SnapBuildProcessRunningXacts is not appropriate. Because in cases > where there are no running xacts or when xl_running_xact is old enough > that we can't use it, we don't need that information. I feel we need > it only when we have to reuse the already serialized snapshot, so, > won't it be better to initialize at that place in > SnapBuildFindSnapshot()? Good point, agreed. > I have changed accordingly in the attached > and apart from that slightly modified the comments and commit message. > Do let me know what you think of the attached? It would be better to remember the initial running xacts after SnapBuildRestore() returns true? Because otherwise, we could end up allocating InitialRunningXacts multiple times while leaking the old ones if there are no serialized snapshots that we are interested in. --- + if (builder->state == SNAPBUILD_START) + { + int nxacts = running->subxcnt + running->xcnt; + Sizesz = sizeof(TransactionId) * nxacts; + + NInitialRunningXacts = nxacts; + InitialRunningXacts = MemoryContextAlloc(builder->context, sz); + memcpy(InitialRunningXacts, running->xids, sz); + qsort(InitialRunningXacts, nxacts, sizeof(TransactionId), xidComparator); + } We should allocate the memory for InitialRunningXacts only when (running->subxcnt + running->xcnt) > 0. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Re: collect_corrupt_items_vacuum.patch
On Wed, Jul 27, 2022 at 5:56 PM Tom Lane wrote: > Maybe we need a different function for pg_visibility to call? > If we want ComputeXidHorizons to serve both these purposes, then it > has to always deliver exactly the right answer, which seems like > a definition that will be hard and expensive to achieve. Yeah, I was thinking along similar lines. I'm also kind of wondering why these calculations use latestCompletedXid. Is that something we do solely to reduce locking? The XIDs of running transactions matter, and their snapshots matter, and the XIDs that could start running in the future matter, but I don't know why it matters what the latest completed XID is. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Official Windows Installer and Documentation
Hi, On Wed, Jul 27, 2022 at 11:36:11PM +0200, Thomas Kellerer wrote: > David G. Johnston schrieb am 27.07.2022 um 21:21: > > And then there is the issue of file ownership. > > > > Assuming we want better documentation for this specific issue for > > back-patching what would that look like? > > > > Going forward should our installer be creating the postgres user for > > consistency with other platforms or not? > > Didn't the installer used to do that in earlier releases and that > was removed when Postgres was able to "drop privileges" when the > service is started? > > I remember a lot of problems around the specific Postgres service > account when that still was the case. Note that there's no "official" Windows installer, and companies providing one are free to implement it the way they want, which can contradict the official documentation. The download section of the website clearly says that this is a third-party installer. For now there's only the EDB installer that remains, but I think that some time ago there was 2 or 3 different providers. For the EDB installer, I'm not sure why or when it was changed, but it indeed used to have a dedicated local account and now relies on "Local System Account" or something like that. But IIRC, when it used to create a local account the name could be configured, so there was no guarantee of a local "postgres" account by then either.
Re: Introduce wait_for_subscription_sync for TAP tests
On Wed, Jul 27, 2022 at 8:54 PM Amit Kapila wrote: > > On Tue, Jul 26, 2022 at 1:12 PM Masahiko Sawada wrote: > > > > On Tue, Jul 26, 2022 at 2:01 PM Amit Kapila wrote: > > > > > > 2. > > > +# wait for the replication to catchup if required. > > > +if (defined($publisher)) > > > +{ > > > + croak 'subscription name must be specified' unless defined($subname); > > > + $publisher->wait_for_catchup($subname, 'replay'); > > > +} > > > + > > > +# then, wait for all table states to be ready. > > > +print "Waiting for all subscriptions in \"$name\" to synchronize > > > data\n"; > > > +my $query = qq[SELECT count(1) = 0 > > > + FROM pg_subscription_rel > > > + WHERE srsubstate NOT IN ('r', 's');]; > > > +$self->poll_query_until($dbname, $query) > > > + or croak "timed out waiting for subscriber to synchronize data"; > > > > > > In the tests, I noticed that a few places did wait_for_catchup after > > > the subscription check, and at other places, we did that check before > > > as you have it here. Ideally, I think wait_for_catchup should be after > > > confirming the initial sync is over as without initial sync, the > > > publisher node won't be completely in sync with the subscriber. > > > > What do you mean by the last sentence? I thought the order doesn't > > matter here. Even if we do wait_for_catchup first then the > > subscription check, we can make sure that the apply worker caught up > > and table synchronization has been done, no? > > > > That's right. I thought we should first ensure the subscriber has > finished operations if possible, like in this case, it can ensure > table sync has finished and then we can ensure whether publisher and > subscriber are in sync. That sounds more logical to me. Make sense. I've incorporated it in the v3 patch. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Re: Introduce wait_for_subscription_sync for TAP tests
On Wed, Jul 27, 2022 at 7:08 PM shiy.f...@fujitsu.com wrote: > > On Tue, Jul 26, 2022 3:42 PM Masahiko Sawada wrote: > > > > I've attached an updated patch as well as a patch to remove duplicated > > waits in 007_ddl.pl. > > > > Thanks for your patch. Here are some comments. Thank you for the comments! > > 1. > I think some comments need to be changed in the patch. > For example: > # Also wait for initial table sync to finish > # Wait for initial sync to finish as well > > Words like "Also" and "as well" can be removed now, we originally used them > because we wait for catchup and "also" wait for initial sync. Agreed. > > 2. > In the following places, we can remove wait_for_catchup() and then call it in > wait_for_subscription_sync(). > > 2.1. > 030_origin.pl: > @@ -128,8 +120,7 @@ $node_B->safe_psql( > > $node_C->wait_for_catchup($appname_B2); > > -$node_B->poll_query_until('postgres', $synced_query) > - or die "Timed out while waiting for subscriber to synchronize data"; > +$node_B->wait_for_subscription_sync; > > 2.2. > 031_column_list.pl: > @@ -385,7 +373,7 @@ $node_subscriber->safe_psql( > ALTER SUBSCRIPTION sub1 SET PUBLICATION pub2, pub3 > )); > > -wait_for_subscription_sync($node_subscriber); > +$node_subscriber->wait_for_subscription_sync; > > $node_publisher->wait_for_catchup('sub1'); > > 2.3. > 100_bugs.pl: > @@ -281,8 +276,7 @@ $node_subscriber->safe_psql('postgres', > $node_publisher->wait_for_catchup('tap_sub'); > > # Also wait for initial table sync to finish > -$node_subscriber->poll_query_until('postgres', $synced_query) > - or die "Timed out while waiting for subscriber to synchronize data"; > +$node_subscriber->wait_for_subscription_sync; > > is( $node_subscriber->safe_psql( > 'postgres', "SELECT * FROM tab_replidentity_index"), Agreed. I've attached updated patches that incorporated the above comments as well as the comment from Amit. BTW regarding 0001 patch to remove the duplicated wait, should we backpatch to v15? I think we can do that as it's an obvious fix and it seems to be an oversight in 8f2e2bbf145. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/ v3-0001-Remove-duplicated-wait-for-subscription-synchroni.patch Description: Binary data v3-0002-Introduce-wait_for_subscription_sync-for-TAP-test.patch Description: Binary data
Re: fairywren hung in pg_basebackup tests
On Thu, Jul 28, 2022 at 3:21 AM Andrew Dunstan wrote: > On 2022-07-27 We 10:58, Tom Lane wrote: > > Andrew Dunstan writes: > >> The alternative I thought of would be to switch msys to using our > >> dirent.c. Probably not too hard, but certainly more work than reverting. Thanks for figuring this out Andrew. Previously I thought of MSYS as a way to use configure+make+gcc/clang but pure Windows C APIs (using MinGW's replacement Windows headers), but today I learned that MSYS/MinGW also supplies a small amount of POSIX stuff, including readdir() etc, so we don't use our own emulation in that case. I suppose we could consider using own dirent.h/c with MinGW (and seeing if there are other similar hazards), to reduce the number of Windows/POSIX API combinations we have to fret about, but not today. Another thought for the future is that lstat() + S_ISLNK() could probably be made to fire for junction points on Windows (all build variants), and then get_dirent_type()'s fallback code for DT_UNKNOWN would have Just Worked (no extra system call required), and we could also probably remove calls to pgwin32_is_junction() everywhere. > > If you ask me, the shortest-path general-purpose fix is to insert > > > > #if MSYS > > if (pgwin32_is_junction(path)) > > return PGFILETYPE_DIR; > > #endif > > > > at the start of get_dirent_type. (I'm not sure how to spell the > > #if test.) We could look at using dirent.c later, but I think > > right now it's important to un-break the buildfarm ASAP. > > +1. I think you spell it: > > #if defined(WIN32) && !defined(_MSC_VER) I thought about putting it at the top, but don't we really only need to make an extra system call if MinGW's stat() told us it saw a directory? And what if you asked to look through symlinks? I thought about putting it near the S_ISLNK() test, which is the usual pattern, but then what if MinGW decides to add d_type support one day? Those thoughts led to the attached formulation. Untested. I'll go and try to see if I can run this with Melih's proposed MSYS CI support... From 33005aaad52a550ba028277dfb634d3d6fbfdd7f Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Thu, 28 Jul 2022 11:45:31 +1200 Subject: [PATCH] Fix get_dirent_type() for simlinks on MinGW/MSYS. On Windows with MSVC, get_dirent_type() was recently made to return DT_LNK for junction points by commit 9d3444dc, which fixed some defective dirent.c code. On Windows with Cygwin, get_dirent_type() already worked for symlinks, as it does on POSIX systems, because Cygwin has its own fake symlinks that behave like POSIX (on closer inspection, Cygwin's dirent has the BSD d_type extension but it's probably always DT_UNKNOWN, so we fall back to lstat(), which understands Cygwin symlinks with S_ISLNK()). On Windows with MinGW/MSYS, we need extra code, because the MinGW runtime has its own readdir() without d_type, and the lstat()-based fallback has no knowledge of our convention for treating junctions as symlinks. Reported-by: Andrew Dunstan Discussion: https://postgr.es/m/b9ddf605-6b36-f90d-7c30-7b3e95c46276%40dunslane.net --- src/common/file_utils.c | 14 ++ 1 file changed, 14 insertions(+) diff --git a/src/common/file_utils.c b/src/common/file_utils.c index 19d308ad1f..210b2c4dfd 100644 --- a/src/common/file_utils.c +++ b/src/common/file_utils.c @@ -465,5 +465,19 @@ get_dirent_type(const char *path, #endif } +#if defined(WIN32) && !defined(_MSC_VER) + /* + * If we're on native Windows (not Cygwin, which has its own POSIX + * symlinks), but not using the MSVC compiler, then we're using a readdir() + * emulation provided by the compiler's runtime that has no concept of + * symlinks. It sees junction points as directories, so we need an extra + * system call to recognize them as symlinks, following our convention. + */ + if (result == PGFILETYPE_DIR && + !look_through_symlinks && + pgwin32_is_junction(path)) + result = PGFILETYPE_LNK; +#endif + return result; } -- 2.36.1
Re: Proposal: add a debug message about using geqo
On Fri, Jul 22, 2022 at 1:20 PM Jacob Champion wrote: > On Wed, Jun 1, 2022 at 11:09 PM KAWAMOTO Masaya > wrote: > > That sounds a nice idea. But I don't think that postgres shows in the > > EXPLAIN output why the plan is selected. Would it be appropriate to > > show that GEQO is used in EXPLAIN output? > > I'm reminded of Greenplum's "Optimizer" line in its EXPLAIN output > [1], so from that perspective I think it's intuitive. > > > As a test, I created a patch that add information about GEQO to > > EXPLAIN output by the GEQO option. The output example is as follows. > > What do you think about the location and content of information about > GEQO? > I am a little surprised to see GeqoDetails being printed for a plan > that didn't use GEQO, but again that's probably because I'm used to > GPDB's Optimizer output. And I don't have a lot of personal experience > using alternative optimizers. > I agree this should be part of explain output. I would not print the current value of geqo_threshold and leave setting display the exclusive purview of the settings option. The presentation of only a single geqo result seems incorrect given that multiple trees can exist. In the first example below the full outer join causes 3 relations to be seen as a single relation at the top level (hence max join nodes = 4) while in the inner join case we see all 6 join nodes. There should be two outputs of GEQO in the first explain, one with join nodes of 3 and the existing one with 4. I also don't see the point of labelling them "max"; "join nodes" seems sufficient. While it can probably be figured out from the rest of the plan, listing the names of the join nodes may be useful (and give join nodes some company). David J. postgres=# explain (verbose, geqo) with gs2 (v2) as materialized ( select * from generate_series(1,1) ) select * from gs2 as gs4 full outer join (select gs2a.v2 from gs2 as gs2a, gs2 as gs2b) as gs5 using (v2), generate_series(1, 1) as gs (v1) cross join gs2 as gs3 where v1 IN (select v2 from gs2); QUERY PLAN Nested Loop (cost=0.07..0.21 rows=1 width=12) Output: COALESCE(gs4.v2, gs2a.v2), gs.v1, gs3.v2 CTE gs2 -> Function Scan on pg_catalog.generate_series (cost=0.00..0.01 rows=1 width=4) Output: generate_series.generate_series Function Call: generate_series(1, 1) -> Nested Loop (cost=0.06..0.16 rows=1 width=12) Output: gs.v1, gs4.v2, gs2a.v2 -> Nested Loop (cost=0.02..0.06 rows=1 width=4) Output: gs.v1 Join Filter: (gs.v1 = gs2.v2) -> Function Scan on pg_catalog.generate_series gs (cost=0.00..0.01 rows=1 width=4) Output: gs.v1 Function Call: generate_series(1, 1) -> HashAggregate (cost=0.02..0.03 rows=1 width=4) Output: gs2.v2 Group Key: gs2.v2 -> CTE Scan on gs2 (cost=0.00..0.02 rows=1 width=4) Output: gs2.v2 -> Hash Full Join (cost=0.03..0.10 rows=1 width=8) Output: gs4.v2, gs2a.v2 Hash Cond: (gs2a.v2 = gs4.v2) -> Nested Loop (cost=0.00..0.05 rows=1 width=4) Output: gs2a.v2 -> CTE Scan on gs2 gs2b (cost=0.00..0.02 rows=1 width=0) Output: gs2b.v2 -> CTE Scan on gs2 gs2a (cost=0.00..0.02 rows=1 width=4) Output: gs2a.v2 -> Hash (cost=0.02..0.02 rows=1 width=4) Output: gs4.v2 -> CTE Scan on gs2 gs4 (cost=0.00..0.02 rows=1 width=4) Output: gs4.v2 -> CTE Scan on gs2 gs3 (cost=0.00..0.02 rows=1 width=4) Output: gs3.v2 GeqoDetails: GEQO: used, geqo_threshold: 2, Max join nodes: 4 (35 rows) postgres=# explain (verbose, geqo) with gs2 (v2) as materialized ( select * from generate_series(1,1) ) select * from gs2 as gs4 join (select gs2a.v2 from gs2 as gs2a, gs2 as gs2b) as gs5 using (v2), generate_series(1, 1) as gs (v1) cross join gs2 as gs3 where v1 IN (select v2 from gs2); QUERY PLAN -- Nested Loop (cost=0.02..0.18 rows=1 width=12) Output: gs4.v2, gs.v1, gs3.v2 CTE gs2 -> Function Scan on pg_catalog.generate_series (cost=0.00..0.01 rows=1 width=4) Output: generate_series.generate_series Function Call: generate_series(1, 1) -> Nested Loop (cost=0.00..0.14 rows=1 width=12) Output: gs.v1, gs4.v2, gs3.v2 -> Nested Loop (cost=0.00..0.11 rows=1 width=8) Output: gs.v1, gs4.v2 -> Nested Loop
Re: Unprivileged user can induce crash by using an SUSET param in PGOPTIONS
I wrote: > =?utf-8?Q?Dagfinn_Ilmari_Manns=C3=A5ker?= writes: >> Thanks! Just one minor nitpick: setting an %ENV entry to `undef` >> doesn't unset the environment variable, it sets it to the empty string. >> To unset a variable it needs to be deleted from %ENV, i.e. `delete >> $ENV{PGUSER};`. > Ah. Still, libpq doesn't distinguish, so the test works anyway. > Not sure if it's worth changing. Meh ... I had to un-break the test for Windows, so did this while at it, using the local-in-block method. Thanks for the suggestion. regards, tom lane
Re: [Patch] ALTER SYSTEM READ ONLY
On Fri, Apr 8, 2022 at 7:27 AM Amul Sul wrote: > Attached is rebase version for the latest maste head(#891624f0ec). Hi Amul, I'm going through past CF triage emails today; I noticed that this patch dropped out of the commitfest when you withdrew it in January, but it hasn't been added back with the most recent patchset you posted. Was that intended, or did you want to re-register it for review? --Jacob
Re: collect_corrupt_items_vacuum.patch
Robert Haas writes: > In reality, the oldest all-visible XID cannot move backward, but > ComputeXidHorizons() lets it move backward, because it's intended for > use by a caller who wants to mark pages all-visible, and it's only > concerned with making sure that the value is old enough to be safe. Right. > And that's a problem for the way that pg_visibility is (mis-)using it. > To say that another way, ComputeXidHorizons() is perfectly fine with > returning a value that is older than the true answer, as long as it > never returns a value that is newer than the new answer. pg_visibility > wants the opposite. Here, a value that is newer than the true value > can't do worse than hide corruption, which is sort of OK, but a value > that's older than the true value can report corruption where none > exists, which is very bad. Maybe we need a different function for pg_visibility to call? If we want ComputeXidHorizons to serve both these purposes, then it has to always deliver exactly the right answer, which seems like a definition that will be hard and expensive to achieve. regards, tom lane
Re: collect_corrupt_items_vacuum.patch
On Mon, Apr 4, 2022 at 4:51 AM Daniel Shelepanov wrote: > I’ve been working on this > [https://www.postgresql.org/message-id/flat/cfcca574-6967-c5ab-7dc3-2c82b6723b99%40mail.ru] > bug. Finally, I’ve come up with the patch you can find attached. Basically > what is does is rises a PROC_IN_VACUUM flag and resets it afterwards. I know > this seems kinda crunchy and I hope you guys will give me some hints on where > to continue. This > [https://www.postgresql.org/message-id/20220218175119.7hwv7ksamfjwijbx%40alap3.anarazel.de] > message contains reproduction script. Thank you very much in advance. I noticed the CommitFest entry for this thread today and decided to take a look. I think the general issue here can be stated in this way: suppose a VACUUM computes an all-visible cutoff X, i.e. it thinks all committed XIDs < X are all-visible. Then, at a later time, pg_visible computes an all-visible cutoff Y, i.e. it thinks all committed XIDs < Y are all-visible. If Y < X, pg_check_visible() might falsely report corruption, because VACUUM might have marked as all-visible some page containing tuples which pg_check_visibile() thinks aren't really all-visible. In reality, the oldest all-visible XID cannot move backward, but ComputeXidHorizons() lets it move backward, because it's intended for use by a caller who wants to mark pages all-visible, and it's only concerned with making sure that the value is old enough to be safe. And that's a problem for the way that pg_visibility is (mis-)using it. To say that another way, ComputeXidHorizons() is perfectly fine with returning a value that is older than the true answer, as long as it never returns a value that is newer than the new answer. pg_visibility wants the opposite. Here, a value that is newer than the true value can't do worse than hide corruption, which is sort of OK, but a value that's older than the true value can report corruption where none exists, which is very bad. I have a feeling, therefore, that this isn't really a complete fix. I think it might address one way for the horizon reported by ComputeXidHorizons() to move backward, but not all the ways. Unfortunately, I am out of time for today to study this... but will try to find more time on another day. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Official Windows Installer and Documentation
David G. Johnston schrieb am 27.07.2022 um 21:21: And then there is the issue of file ownership. Assuming we want better documentation for this specific issue for back-patching what would that look like? Going forward should our installer be creating the postgres user for consistency with other platforms or not? Didn't the installer used to do that in earlier releases and that was removed when Postgres was able to "drop privileges" when the service is started? I remember a lot of problems around the specific Postgres service account when that still was the case. As far as I can tell, most of the problems of the Windows installer stem from the fact that it tries to use icacls to set privileges on the data directory. This seems to fail quite frequently, causing the infamous "Problem running post-install step" error. The fact that the installer still defaults to using "c:\Program Files" for the location of the data directoy might be related to that. (but then I don't know enough of the internals of the installer and Windows) Just my 0.02€ Thomas
Re: [PoC] Reducing planning time when tables have many partitions
On Wed, 27 Jul 2022 at 18:07, Yuya Watari wrote: > I agree that introducing a Bitmapset field may solve this problem. I > will try this approach in addition to previous ones. I've attached a very half-done patch that might help you get started on this. There are still 2 failing regression tests which seem to be due to plan changes. I didn't expend any effort looking into why these plans changed. The attached does not contain any actual optimizations to find the minimal set of EMs to loop through by masking the Bitmapsets that I mentioned in my post last night. I just quickly put it together to see if there's some hole in the idea. I don't think there is. I've not really considered all of the places that we'll want to do the bit twiddling to get the minimal set of EquivalenceMember. I did see there's a couple more functions in postgres_fdw.c that could be optimized. One thing I've only partially thought about is what if you want to also find EquivalenceMembers with a constant value. If there's a Const, then you'll lose the bit for that when you mask the ec's ec_member_indexes with the RelOptInfos. If there are some places where we need to keep those then I think we'll need to add another field to EquivalenceClass to mark the index into PlannerInfo's eq_members for the EquivalenceMember with the Const. That bit would have to be bms_add_member()ed back into the Bitmapset of matching EquivalenceMembers after masking out RelOptInfo's ec_member_indexes. When adding the optimizations to find the minimal set of EM bits to search through, you should likely add some functions similar to the get_eclass_indexes_for_relids() and get_common_eclass_indexes() functions to help you find the minimal set of bits. You can also probably get some other inspiration from [1], in general. David [1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=3373c715535 diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c index 048db542d3..0af3943e03 100644 --- a/contrib/postgres_fdw/postgres_fdw.c +++ b/contrib/postgres_fdw/postgres_fdw.c @@ -7410,11 +7410,11 @@ conversion_error_callback(void *arg) EquivalenceMember * find_em_for_rel(PlannerInfo *root, EquivalenceClass *ec, RelOptInfo *rel) { - ListCell *lc; + int i = -1; - foreach(lc, ec->ec_members) + while ((i = bms_next_member(ec->ec_member_indexes, i)) >= 0) { - EquivalenceMember *em = (EquivalenceMember *) lfirst(lc); + EquivalenceMember *em = (EquivalenceMember *) list_nth(root->eq_members, i); /* * Note we require !bms_is_empty, else we'd accept constant @@ -7453,7 +7453,7 @@ find_em_for_rel_target(PlannerInfo *root, EquivalenceClass *ec, { Expr *expr = (Expr *) lfirst(lc1); Index sgref = get_pathtarget_sortgroupref(target, i); - ListCell *lc2; + int i; /* Ignore non-sort expressions */ if (sgref == 0 || @@ -7469,9 +7469,10 @@ find_em_for_rel_target(PlannerInfo *root, EquivalenceClass *ec, expr = ((RelabelType *) expr)->arg; /* Locate an EquivalenceClass member matching this expr, if any */ - foreach(lc2, ec->ec_members) + i = -1; + while ((i = bms_next_member(ec->ec_member_indexes, i)) >= 0) { - EquivalenceMember *em = (EquivalenceMember *) lfirst(lc2); + EquivalenceMember *em = (EquivalenceMember *) list_nth(root->eq_members, i); Expr *em_expr; /* Don't match constants */ diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c index a96f2ee8c6..2060b73686 100644 --- a/src/backend/nodes/outfuncs.c +++ b/src/backend/nodes/outfuncs.c @@ -430,7 +430,7 @@ _outEquivalenceClass(StringInfo str, const EquivalenceClass *node) WRITE_NODE_FIELD(ec_opfamilies); WRITE_OID_FIELD(ec_collation); - WRITE_NODE_FIELD(ec_members); + WRITE_BITMAPSET_FIELD(ec_member_indexes); WRITE_NODE_FIELD(ec_sources); WRITE_NODE_FIELD(ec_derives); WRITE_BITMAPSET_FIELD(ec_relids); diff --git a/src/backend/nodes/print.c b/src/backend/nodes/print.c index a5c44adc6c..5953b79fe8 100644 --- a/src/backend/nodes/print.c +++ b/src/backend/nodes/print.c @@ -423,16 +423,17 @@ print_expr(const Node *expr, const List *rtable) * pathkeys list of PathKeys */ void -print_pathkeys(const List *pathkeys, const List *rtable) +print_pathkeys(const PlannerInfo *root, const List *pathkeys, + const List *rtable) { - const ListCell *i; + const ListCell *lc; printf("("); - foreach(i, pathkeys) + foreach(lc, pathkeys) { - PathKey*pathkey = (PathKey *) lfirst(i); +
Re: ExecRTCheckPerms() and many prunable partitions
... One more thing: maybe we should rethink where to put extraUpdatedCols. Between the facts that it's not used for actual permissions checks, and that it's calculated by the rewriter not parser, it doesn't seem like it really belongs in RelPermissionInfo. Should we keep it in RangeTblEntry? Should it go somewhere else entirely? I'm just speculating, but now is a good time to think about it. regards, tom lane
Re: ExecRTCheckPerms() and many prunable partitions
Amit Langote writes: > [ v16 patches ] I took a quick look at this ... I think that the notion behind MergeRelPermissionInfos, ie that a single RelPermissionInfo can represent *all* the checks for a given table OID, is fundamentally wrong. For example, when merging a view into an outer query that references a table also used by the view, the checkAsUser fields might be different, and the permissions to check might be different, and the columns those permissions need to hold for might be different. Blindly bms_union'ing the column sets will lead to requiring far more permissions than the query should require. Conversely, this approach could lead to allowing cases we should reject, if you happen to "merge" checkAsUser in a way that ends in checking as a higher-privilege user than should be checked. I'm inclined to think that you should abandon the idea of merging RelPermissionInfos at all. It can only buy us much in the case of self-joins, which ought to be rare. It'd be better to just say "there is one RelPermissionInfo for each RTE requiring any sort of permissions check". Either that or you need to complicate RelPermissionInfo a lot, but I don't see the advantage. It'd likely be better to rename ExecutorCheckPerms_hook, say to ExecCheckPermissions_hook given the rename of ExecCheckRTPerms. As it stands, it *looks* like the API of that hook has not changed, when it has. Better to break calling code visibly than to make people debug their way to an understanding that the List contents are no longer what they expected. A different idea could be to pass both the rangetable and relpermlist, again making the API break obvious (and who's to say a hook might not still want the rangetable?) In parsenodes.h: +List *relpermlist;/* list of RTEPermissionInfo nodes for + * the RTE_RELATION entries in rtable */ I find this comment not very future-proof, if indeed it's strictly correct even today. Maybe better "list of RelPermissionInfo nodes for rangetable entries having perminfoindex > 0". Likewise for the comment in RangeTableEntry: there's no compelling reason to assume that all and only RELATION RTEs will have RelPermissionInfo. Even if that remains true at parse time it's falsified during planning. Also note typo in node name: that comment is the only reference to "RTEPermissionInfo" AFAICS. Although, given the redefinition I suggest above, arguably "RTEPermissionInfo" is the better name? I'm confused as to why RelPermissionInfo.inh exists. It doesn't seem to me that permissions checking should care about child rels. Why did you add checkAsUser to ForeignScan (and not any other scan plan nodes)? At best that's pretty asymmetric, but it seems mighty bogus: under what circumstances would an FDW need to know that but not any of the other RelPermissionInfo fields? This seems to indicate that someplace we should work harder at making the RelPermissionInfo list available to FDWs. (CustomScan providers might have similar issues, btw.) I've not looked at much of the actual code, just the .h file changes. Haven't studied 0002 either. regards, tom lane
Re: out of date comment in commit_ts.c
On Tue, Jul 26, 2022 at 10:33:43AM -0700, Nathan Bossart wrote: > IIUC the ability for callers to request WAL record generation is no longer > possible as of 08aa89b [0]. Should the second sentence be removed? Here's a patch. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c index 4dc8d402bd..9aa4675cb7 100644 --- a/src/backend/access/transam/commit_ts.c +++ b/src/backend/access/transam/commit_ts.c @@ -7,13 +7,10 @@ * for each transaction. * * XLOG interactions: this module generates an XLOG record whenever a new - * CommitTs page is initialized to zeroes. Also, one XLOG record is - * generated for setting of values when the caller requests it; this allows - * us to support values coming from places other than transaction commit. - * Other writes of CommitTS come from recording of transaction commit in - * xact.c, which generates its own XLOG records for these events and will - * re-perform the status update on redo; so we need make no additional XLOG - * entry here. + * CommitTs page is initialized to zeroes. Other writes of CommitTS come + * from recording of transaction commit in xact.c, which generates its own + * XLOG records for these events and will re-perform the status update on + * redo; so we need make no additional XLOG entry here. * * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group * Portions Copyright (c) 1994, Regents of the University of California
Re: pgsql: Remove the restriction that the relmap must be 512 bytes.
Robert Haas writes: > In short, I think Alvaro's idea is unprincipled but solves the problem > of making it clear that a new initdb is required. Your idea is > principled but does not solve any problem. [ shrug... ] Fair enough. regards, tom lane
Re: pgsql: Remove the restriction that the relmap must be 512 bytes.
On Wed, Jul 27, 2022 at 3:17 PM Tom Lane wrote: > Alvaro Herrera writes: > >> Hmm, but that doesn't actually produce nice behavior. It just goes > >> into an infinite loop, like this: > >> So now I'm back to being unsure what to do here. > > > I vote to go for the catversion bump for now. > > What this is showing us is that any form of corruption in the relmapper > file causes very unpleasant behavior. We probably had better do something > about that, independently of this issue. I'm not sure how important that is, but it certainly wouldn't hurt. > In the meantime, I still think bumping the file magic number is a better > answer. It won't make the behavior any worse for un-updated code than > it is already. But it also won't make it any better, so why even bother? The goal of catversion bumps is to replace crashes or unpredictable misbehavior with a nice error message that tells you exactly what the problem is. Here we'd just be replacing an infinite series of crashes with an infinite series of crashes with a slightly different error message. It's probably worth comparing those error messages: FATAL: could not read file "global/pg_filenode.map": read 512 of 524 FATAL: relation mapping file "global/pg_filenode.map" contains invalid data The first message is what you get now. The second message is what you get with the proposed change to the magic number. I would argue that the second message is actually worse than the first one, because the first one actually gives you some hint what the problem is, whereas the second one really just says that an unspecified bad thing happened. In short, I think Alvaro's idea is unprincipled but solves the problem of making it clear that a new initdb is required. Your idea is principled but does not solve any problem. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Allow pageinspect's bt_page_stats function to return a set of rows instead of a single row
Hamid Akhtar writes: > That attached patch is based on the master branch. It makes the following > changes to the pageinspect contrib module: > - Updates bt_page_stats_internal function to accept 3 arguments instead of > 2. > - The function now uses SRF macros to return a set rather than a single > row. The function call now requires specifying column names. FWIW, I think you'd be way better off changing the function name, say to bt_multi_page_stats(). Overloading the name this way is going to lead to great confusion, e.g. somebody who fat-fingers the number of output arguments in a JDBC call could see confusing results due to invoking the wrong one of the two functions. Also, I'm not quite sure what you mean by "The function call now requires specifying column names", but it doesn't sound like an acceptable restriction from a compatibility standpoint. If a different name dodges that issue then it's clearly a better way. regards, tom lane
Official Windows Installer and Documentation
Hey, Just interacted with a frustrated user on Slack trying to upgrade from v13 to v14 on Windows. Our official download page for the Windows installer claims the core documentation as its official reference - can someone responsible for this area please suggest and test some changes to make this reality more acceptable. The particular point that was brought up is our documentation for pg_upgrade says: RUNAS /USER:postgres "CMD.EXE" SET PATH=%PATH%;C:\Program Files\PostgreSQL\14\bin; The problem is apparently (I haven't personally tested) our official installer doesn't bother to create the postgres operating system user account. It is also unclear whether the defaults for pg_hba.conf add some kind of bad interaction here should one fix this particular problem. And then there is the issue of file ownership. Assuming we want better documentation for this specific issue for back-patching what would that look like? Going forward should our installer be creating the postgres user for consistency with other platforms or not? I suggest adding relevant discussion about this particular official binary distribution to: https://www.postgresql.org/docs/current/install-binaries.html David J.
Re: pgsql: Remove the restriction that the relmap must be 512 bytes.
Alvaro Herrera writes: >> Hmm, but that doesn't actually produce nice behavior. It just goes >> into an infinite loop, like this: >> So now I'm back to being unsure what to do here. > I vote to go for the catversion bump for now. What this is showing us is that any form of corruption in the relmapper file causes very unpleasant behavior. We probably had better do something about that, independently of this issue. In the meantime, I still think bumping the file magic number is a better answer. It won't make the behavior any worse for un-updated code than it is already. regards, tom lane
Re: Allow foreign keys to reference a superset of unique columns
Kaiting Chen writes: > I'd like to propose a change to PostgreSQL to allow the creation of a foreign > key constraint referencing a superset of uniquely constrained columns. TBH, I think this is a fundamentally bad idea and should be rejected outright. It fuzzes the semantics of the FK relationship, and I'm not convinced that there are legitimate use-cases. Your example schema could easily be dismissed as bad design that should be done some other way. For one example of where the semantics get fuzzy, it's not very clear how the extra-baggage columns ought to participate in CASCADE updates. Currently, if we have CREATE TABLE foo (a integer PRIMARY KEY, b integer); then an update that changes only foo.b doesn't need to update referencing tables, and I think we even have optimizations that assume that if no unique-key columns are touched then RI checks need not be made. But if you did CREATE TABLE bar (x integer, y integer, FOREIGN KEY (x, y) REFERENCES foo(a, b) ON UPDATE CASCADE); then perhaps you expect bar.y to be updated ... or maybe you don't? Another example is that I think the idea is only well-defined when the subset column(s) are a primary key, or at least all marked NOT NULL. Otherwise they're not as unique as you're claiming. But then the FK constraint really has to be dependent on a PK constraint not just an index definition, since indexes in themselves don't enforce not-nullness. That gets back to Peter's complaint that referring to an index isn't good enough. Anyway, seeing that the patch touches neither ri_triggers.c nor any regression tests, I find it hard to believe that such semantic questions have been thought through. It's also unclear to me how this ought to interact with the information_schema views concerning foreign keys. We generally feel that we don't want to present any non-SQL-compatible data in information_schema, for fear that it will confuse applications that expect to see SQL-spec behavior there. So do we leave such FKs out of the views altogether, or show only the columns involving the associated unique constraint? Neither answer seems pleasant. FWIW, the patch is currently failing to apply per the cfbot. I think you don't need to manually update backend/nodes/ anymore, but the gram.y changes look to have been sideswiped by some other recent commit. regards, tom lane
Re: standby recovery fails (tablespace related) (tentative patch and discussion)
Okay, I think I'm done with this. Here's v27 for the master branch, where I fixed some comments as well as thinkos in the test script. The ones on older branches aren't materially different, they just have tonnes of conflicts resolved. I'll get this pushed tomorrow morning. I have run it through CI and it seems ... not completely broken, at least, but I have no working recipes for Windows on branches 14 and older, so it doesn't really work fully. If anybody does, please share. You can see mine here https://github.com/alvherre/postgres/commits/REL_11_STABLE [etc] https://cirrus-ci.com/build/5320904228995072 https://cirrus-ci.com/github/alvherre/postgres -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "Every machine is a smoke machine if you operate it wrong enough." https://twitter.com/libseybieda/status/1541673325781196801 >From b84f66975d664c45babd878a43c67601b7f7d2b6 Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Wed, 27 Jul 2022 20:22:21 +0200 Subject: [PATCH v27] Fix replay of create database records on standby MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Crash recovery on standby may encounter missing directories when replaying database-creation WAL records. Prior to this patch, the standby would fail to recover in such a case; however, the directories could be legitimately missing. Consider the following sequence of commands: CREATE DATABASE DROP DATABASE DROP TABLESPACE If, after replaying the last WAL record and removing the tablespace directory, the standby crashes and has to replay the create database record again, crash recovery must be able to continue. A fix for this problem was already attempted in 49d9cfc68bf4, but it was reverted because of design issues. This new version is based on Robert Haas' proposal: any missing tablespaces are created during recovery before reaching consistency. Tablespaces are created as real directories, and should be deleted by later replay. CheckRecoveryConsistency ensures they have disappeared. The problems detected by this new code are reported as PANIC, except when allow_in_place_tablespaces is set to ON, in which case they are WARNING. Apart from making tests possible, this gives users an escape hatch in case things don't go as planned. Author: Kyotaro Horiguchi Author: Asim R Praveen Author: Paul Guo Reviewed-by: Anastasia Lubennikova (older versions) Reviewed-by: Fujii Masao (older versions) Reviewed-by: Michaël Paquier Diagnosed-by: Paul Guo Discussion: https://postgr.es/m/CAEET0ZGx9AvioViLf7nbR_8tH9-=27dn5xwj2p9-roh16e4...@mail.gmail.com --- src/backend/access/transam/xlogrecovery.c | 50 ++ src/backend/commands/dbcommands.c | 77 + src/backend/commands/tablespace.c | 40 ++--- src/test/recovery/t/033_replay_tsp_drops.pl | 169 4 files changed, 305 insertions(+), 31 deletions(-) create mode 100644 src/test/recovery/t/033_replay_tsp_drops.pl diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c index e383c2123a..27e02fbfcd 100644 --- a/src/backend/access/transam/xlogrecovery.c +++ b/src/backend/access/transam/xlogrecovery.c @@ -42,6 +42,7 @@ #include "access/xlogutils.h" #include "catalog/pg_control.h" #include "commands/tablespace.h" +#include "common/file_utils.h" #include "miscadmin.h" #include "pgstat.h" #include "postmaster/bgwriter.h" @@ -2008,6 +2009,47 @@ xlogrecovery_redo(XLogReaderState *record, TimeLineID replayTLI) } } +/* + * Verify that, in non-test mode, ./pg_tblspc doesn't contain any real + * directories. + * + * Replay of database creation XLOG records for databases that were later + * dropped can create fake directories in pg_tblspc. By the time consistency + * is reached these directories should have been removed; here we verify + * that this did indeed happen. This is to be called at the point where + * consistent state is reached. + * + * allow_in_place_tablespaces turns the PANIC into a WARNING, which is + * useful for testing purposes, and also allows for an escape hatch in case + * things go south. + */ +static void +CheckTablespaceDirectory(void) +{ + DIR *dir; + struct dirent *de; + + dir = AllocateDir("pg_tblspc"); + while ((de = ReadDir(dir, "pg_tblspc")) != NULL) + { + char path[MAXPGPATH + 10]; + + /* Skip entries of non-oid names */ + if (strspn(de->d_name, "0123456789") != strlen(de->d_name)) + continue; + + snprintf(path, sizeof(path), "pg_tblspc/%s", de->d_name); + + if (get_dirent_type(path, de, false, ERROR) != PGFILETYPE_LNK) + ereport(allow_in_place_tablespaces ? WARNING : PANIC, + (errcode(ERRCODE_DATA_CORRUPTED), + errmsg("unexpected directory entry \"%s\" found in %s", + de->d_name, "pg_tblspc/"), + errdetail("All directory entries in pg_tblspc/ should be symbolic links."), + errhint("Remove those directories, or set
Re: pgsql: Remove the restriction that the relmap must be 512 bytes.
On 2022-Jul-27, Robert Haas wrote: > Hmm, but that doesn't actually produce nice behavior. It just goes > into an infinite loop, like this: > 2022-07-27 14:21:12.869 EDT [32853] FATAL: relation mapping file > "global/pg_filenode.map" contains invalid data This seems almost identical what happens without the version number change, so I wouldn't call it much of an improvement. > While I agree that changing a version identifier that is more closely > related to what got changed is better than changing a generic one, I > think people won't like an infinite loop that spews messages into the > log at top speed as a way of telling them about the problem. > > So now I'm back to being unsure what to do here. I vote to go for the catversion bump for now. Maybe it's possible to patch the relmapper code afterwards, so that if a version mismatch is detected the server is stopped with a reasonable message. An hypothetical improvement would be to keep the code to read the old version and upgrade the file automatically, but given the number of times that this file has changed, it's likely pointless effort. Therefore, my proposal is to add a comment next to the struct definition suggesting to bump catversion and call it a day. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
small windows psqlrc re-wording
Howdy folks, The attached patch tweaks the wording around finding the psqlrc file on windows, with the primary goal of removing the generally incorrect statement that windows has no concept of a home directory. Robert Treat https://xzilla.net windows-psqlrc.patch Description: Binary data
Re: pgsql: Remove the restriction that the relmap must be 512 bytes.
On Wed, Jul 27, 2022 at 2:13 PM Robert Haas wrote: > On Wed, Jul 27, 2022 at 1:42 PM Tom Lane wrote: > > If there's a magic number, then I'd (a) change that and (b) adjust > > whatever comments led you to think you shouldn't. Bumping catversion > > is a good fallback choice when there's not any more-proximate version > > indicator, but here there is. > > Maybe I just got cold feet because it doesn't ever have seem to have > been changed before, because the definition says: > > #define RELMAPPER_FILEMAGIC 0x592717/* version ID value */ > > And the fact that "version" is in there sure makes it seem like a > version number, now that I look again. > > I'll add 1 to the value. Hmm, but that doesn't actually produce nice behavior. It just goes into an infinite loop, like this: 2022-07-27 14:21:12.826 EDT [32849] LOG: database system was interrupted; last known up at 2022-07-27 14:21:12 EDT 2022-07-27 14:21:12.860 EDT [32849] LOG: database system was not properly shut down; automatic recovery in progress 2022-07-27 14:21:12.861 EDT [32849] LOG: invalid record length at 0/14B3BB8: wanted 24, got 0 2022-07-27 14:21:12.861 EDT [32849] LOG: redo is not required 2022-07-27 14:21:12.864 EDT [32850] LOG: checkpoint starting: end-of-recovery immediate wait 2022-07-27 14:21:12.865 EDT [32850] LOG: checkpoint complete: wrote 3 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.001 s, sync=0.001 s, total=0.002 s; sync files=2, longest=0.001 s, average=0.001 s; distance=0 kB, estimate=0 kB; lsn=0/14B3BB8, redo lsn=0/14B3BB8 2022-07-27 14:21:12.868 EDT [31930] LOG: database system is ready to accept connections 2022-07-27 14:21:12.869 EDT [32853] FATAL: relation mapping file "global/pg_filenode.map" contains invalid data 2022-07-27 14:21:12.869 EDT [32854] FATAL: relation mapping file "global/pg_filenode.map" contains invalid data 2022-07-27 14:21:12.870 EDT [31930] LOG: autovacuum launcher process (PID 32853) exited with exit code 1 2022-07-27 14:21:12.870 EDT [31930] LOG: terminating any other active server processes 2022-07-27 14:21:12.870 EDT [31930] LOG: background worker "logical replication launcher" (PID 32854) exited with exit code 1 2022-07-27 14:21:12.871 EDT [31930] LOG: all server processes terminated; reinitializing While I agree that changing a version identifier that is more closely related to what got changed is better than changing a generic one, I think people won't like an infinite loop that spews messages into the log at top speed as a way of telling them about the problem. So now I'm back to being unsure what to do here. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Hash index build performance tweak from sorting
Simon Riggs writes: > [ hash_sort_by_hash.v2.patch ] The cfbot says this no longer applies --- probably sideswiped by Korotkov's sorting-related commits last night. regards, tom lane
Re: pgsql: Remove the restriction that the relmap must be 512 bytes.
On Wed, Jul 27, 2022 at 1:42 PM Tom Lane wrote: > If there's a magic number, then I'd (a) change that and (b) adjust > whatever comments led you to think you shouldn't. Bumping catversion > is a good fallback choice when there's not any more-proximate version > indicator, but here there is. Maybe I just got cold feet because it doesn't ever have seem to have been changed before, because the definition says: #define RELMAPPER_FILEMAGIC 0x592717/* version ID value */ And the fact that "version" is in there sure makes it seem like a version number, now that I look again. I'll add 1 to the value. -- Robert Haas EDB: http://www.enterprisedb.com
Re: making relfilenodes 56 bits
On Wed, Jul 27, 2022 at 12:37 PM Dilip Kumar wrote: > Just realised that this should have been BufferTagsEqual instead of > BufferTagEqual > > I will modify this and send an updated patch tomorrow. I changed it and committed. What was formerly 0002 will need minor rebasing. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Use -fvisibility=hidden for shared libraries
Andres Freund writes: > On 2022-07-18 00:05:16 -0700, Andres Freund wrote: >> Given that that's just about all compilers we support using configure, >> perhaps >> we should just move that out of the compiler specific section? Doesn't look >> like there's much precedent for that so far... > Here's a potential patch along those lines. Now that the dust from the main patch is pretty well settled, +1 for trying that. > I wonder if we also should move the -fno-strict-aliasing, -fwrapv tests > out. But that'd be something for later. Those seem less likely to be portable to non-gcc-alike compilers. On the other hand, maybe it'd be interesting to just remove the conditionality temporarily and try ALL the switches on all compilers, just to see what we can learn from the buildfarm. regards, tom lane
Re: pgsql: Remove the restriction that the relmap must be 512 bytes.
Robert Haas writes: > On Wed, Jul 27, 2022 at 1:19 PM Alvaro Herrera > wrote: >> Another thing that seems to have happened here is that catversion ought >> to have been touched and wasn't. > Hmm, interesting. I didn't think about bumping catversion because I > didn't change anything in the catalogs. I did think about changing the > magic number for the file at one point, but unlike some of our other > constants, there's no indication that this one is intended to be used > as a version number. But in retrospect it would have been good to > change something somewhere. If you want me to bump catversion now, I > can. If you or someone else wants to do it, that's also fine. If there's a magic number, then I'd (a) change that and (b) adjust whatever comments led you to think you shouldn't. Bumping catversion is a good fallback choice when there's not any more-proximate version indicator, but here there is. regards, tom lane
Re: pgsql: Remove the restriction that the relmap must be 512 bytes.
On Wed, Jul 27, 2022 at 1:19 PM Alvaro Herrera wrote: > Another thing that seems to have happened here is that catversion ought > to have been touched and wasn't. Trying to start a cluster that was > initdb'd with the previous code enters an infinite loop that dies each > time with > > 2022-07-27 19:17:27.589 CEST [2516547] LOG: database system is ready to > accept connections > 2022-07-27 19:17:27.589 CEST [2516730] FATAL: could not read file > "global/pg_filenode.map": read 512 of 524 > 2022-07-27 19:17:27.589 CEST [2516731] FATAL: could not read file > "global/pg_filenode.map": read 512 of 524 > 2022-07-27 19:17:27.589 CEST [2516547] LOG: autovacuum launcher process (PID > 2516730) exited with exit code 1 > 2022-07-27 19:17:27.589 CEST [2516547] LOG: terminating any other active > server processes > > Perhaps we should still do a catversion bump now, since one hasn't > happened since the commit. Hmm, interesting. I didn't think about bumping catversion because I didn't change anything in the catalogs. I did think about changing the magic number for the file at one point, but unlike some of our other constants, there's no indication that this one is intended to be used as a version number. But in retrospect it would have been good to change something somewhere. If you want me to bump catversion now, I can. If you or someone else wants to do it, that's also fine. -- Robert Haas EDB: http://www.enterprisedb.com
Re: pgsql: Remove the restriction that the relmap must be 512 bytes.
On 2022-Jul-26, Robert Haas wrote: > Remove the restriction that the relmap must be 512 bytes. > > Instead of relying on the ability to atomically overwrite the > entire relmap file in one shot, write a new one and durably > rename it into place. Removing the struct padding and the > calculation showing why the map is exactly 512 bytes, and change > the maximum number of entries to a nearby round number. Another thing that seems to have happened here is that catversion ought to have been touched and wasn't. Trying to start a cluster that was initdb'd with the previous code enters an infinite loop that dies each time with 2022-07-27 19:17:27.589 CEST [2516547] LOG: database system is ready to accept connections 2022-07-27 19:17:27.589 CEST [2516730] FATAL: could not read file "global/pg_filenode.map": read 512 of 524 2022-07-27 19:17:27.589 CEST [2516731] FATAL: could not read file "global/pg_filenode.map": read 512 of 524 2022-07-27 19:17:27.589 CEST [2516547] LOG: autovacuum launcher process (PID 2516730) exited with exit code 1 2022-07-27 19:17:27.589 CEST [2516547] LOG: terminating any other active server processes Perhaps we should still do a catversion bump now, since one hasn't happened since the commit. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "El número de instalaciones de UNIX se ha elevado a 10, y se espera que este número aumente" (UPM, 1972)
Re: Reducing planning time on tables with many indexes
I wrote: > Unfortunately, as things stand today, the planner needs more than the > right to look at the indexes' schemas, because it makes physical accesses > to btree indexes to find out their tree height (and I think there are some > comparable behaviors in other AMs). I've never particularly cared for > that implementation, and would be glad to rip out that behavior if we can > find another way. Maybe we could persuade VACUUM or ANALYZE to store that > info in the index's pg_index row, or some such, and then the planner > could use it with no lock? A first step here could just be to postpone fetching _bt_getrootheight() until we actually need it during cost estimation. That would avoid the need to do it at all for indexes that indxpath.c discards as irrelevant, which is a decision made on considerably more information than the proposed patch uses. Having done that, you could look into revising plancat.c to fill the IndexOptInfo structs from catcache entries instead of opening the index per se. (You'd have to also make sure that the appropriate index locks are acquired eventually, for indexes the query does use at runtime. I think that's the case, but I'm not sure if anything downstream has been optimized on the assumption the planner did it.) This'd probably get us a large part of the way there. Further optimization of acquisition of tree height etc could be an optional follow-up. regards, tom lane
Re: Reducing planning time on tables with many indexes
David Geier writes: > We tracked down the root cause of this slowdown to lock contention in > 'get_relation_info()'. The index lock of every single index of every single > table used in that query is acquired. We attempted a fix by pre-filtering > out all indexes that anyways cannot be used with a certain query, without > taking the index locks (credits to Luc Vlaming for idea and > implementation). The patch does so by caching the columns present in every > index, inside 'struct Relation', similarly to 'rd_indexlist'. I wonder how much thought you gave to the costs imposed by that extra cache space. We have a lot of users who moan about relcache bloat already. But more to the point, I do not buy the assumption that an index's set of columns is a good filter for which indexes are of interest. A trivial counterexample from the regression database is regression=# explain select count(*) from tenk1; QUERY PLAN Aggregate (cost=219.28..219.29 rows=1 width=8) -> Index Only Scan using tenk1_hundred on tenk1 (cost=0.29..194.28 rows=100 00 width=0) (2 rows) It looks to me like the patch also makes unwarranted assumptions about being able to discard all but the smallest index having a given set of columns. This would, for example, possibly lead to dropping the index that has the most useful sort order, or that has the operator class needed to support a specific WHERE clause. In short, I'm not sure I buy this concept at all. I think it might be more useful to attack the locking overhead more directly. I kind of wonder why we need per-index locks at all during planning --- I think that we already interpret AccessShareLock on the parent table as being sufficient to block schema changes on existing indexes. Unfortunately, as things stand today, the planner needs more than the right to look at the indexes' schemas, because it makes physical accesses to btree indexes to find out their tree height (and I think there are some comparable behaviors in other AMs). I've never particularly cared for that implementation, and would be glad to rip out that behavior if we can find another way. Maybe we could persuade VACUUM or ANALYZE to store that info in the index's pg_index row, or some such, and then the planner could use it with no lock? regards, tom lane
Re: making relfilenodes 56 bits
On Wed, 27 Jul 2022 at 9:49 PM, Dilip Kumar wrote: > On Wed, Jul 27, 2022 at 12:07 AM Robert Haas > wrote: > > > > On Tue, Jul 26, 2022 at 2:07 AM Dilip Kumar > wrote: > > > I have thought about it while doing so but I am not sure whether it is > > > a good idea or not, because before my change these all were macros > > > with 2 naming conventions so I just changed to inline function so why > > > to change the name. > > > > Well, the reason to change the name would be for consistency. It feels > > weird to have some NAMES_LIKETHIS() and other NamesLikeThis(). > > > > Now, an argument against that is that it will make back-patching more > > annoying, if any code using these functions/macros is touched. But > > since the calling sequence is changing anyway (you now have to pass a > > pointer rather than the object itself) that argument doesn't really > > carry any weight. So I would favor ClearBufferTag(), InitBufferTag(), > > etc. > > Okay, so I have renamed these 2 functions and BUFFERTAGS_EQUAL as well > to BufferTagEqual(). Just realised that this should have been BufferTagsEqual instead of BufferTagEqual I will modify this and send an updated patch tomorrow. — Dilip > -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: generic plans and "initial" pruning
On Tue, Jul 26, 2022 at 11:01 PM Amit Langote wrote: > Needed to be rebased again, over 2d04277121f this time. 0001 adds es_part_prune_result but does not use it, so maybe the introduction of that field should be deferred until it's needed for something. I wonder whether it's really necessary to added the PartitionPruneInfo objects to a list in PlannerInfo first and then roll them up into PlannerGlobal later. I know we do that for range table entries, but I've never quite understood why we do it that way instead of creating a flat range table in PlannerGlobal from the start. And so by extension I wonder whether this table couldn't be flat from the start also. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Request for assistance to backport CVE-2022-1552 fixes to 9.6 and 9.4
Hello pgsql-hackers, Is there anyone willing to review the patches that I prepared? I'd have substatntially more confidence in the patches with a review from an upstream developer who is familiar with the code. Regards, -Roberto On Mon, Jul 04, 2022 at 06:06:58PM -0400, Roberto C. Sánchez wrote: > On Wed, Jun 08, 2022 at 05:31:22PM -0400, Roberto C. Sánchez wrote: > > On Wed, Jun 08, 2022 at 04:15:47PM -0400, Tom Lane wrote: > > > Roberto =?iso-8859-1?Q?C=2E_S=E1nchez?= writes: > > > > I am investigating backporting the fixes for CVE-2022-1552 to 9.6 and > > > > 9.4 as part of Debian LTS and Extended LTS. I am aware that these > > > > releases are no longer supported upstream, but I have made an attempt at > > > > adapting commits ef792f7856dea2576dcd9cab92b2b05fe955696b and > > > > f26d5702857a9c027f84850af48b0eea0f3aa15c from the REL_10_STABLE branch. > > > > I would appreciate a review of the attached patches and any comments on > > > > things that may have been missed and/or adapted improperly. > > > > > > FWIW, I would not recommend being in a huge hurry to back-port those > > > changes, pending the outcome of this discussion: > > > > > > https://www.postgresql.org/message-id/flat/f8a4105f076544c180a87ef0c4822352%40stmuk.bayern.de > > > > > Thanks for the pointer. > > > > > We're going to have to tweak that code somehow, and it's not yet > > > entirely clear how. > > > > > I will monitor the discussion to see what comes of it. > > > Based on the discussion in the other thread, I have made an attempt to > backport commit 88b39e61486a8925a3861d50c306a51eaa1af8d6 to 9.6 and 9.4. > The only significant change that I had to make was to add the full > function signatures for the REVOKE/GRANT in the citext test. > > One question that I had about the change as committed is whether a > REVOKE is needed on s.citext_ne, like so: > > REVOKE ALL ON FUNCTION s.citext_ne FROM PUBLIC; > > Or (for pre-10): > > REVOKE ALL ON FUNCTION s.citext_ne(s.citext, s.citext) FROM PUBLIC; > > I ask because the comment immediately preceding the sequence of REVOKEs > includes the comment "Revoke all conceivably-relevant ACLs within the > extension." I am not especially knowledgable about deep internals, but > that function seems like it would belong in the same group with the > others. > > In any event, would someone be willing to review the attached patches > for correctness? I would like to shortly publish updates to 9.6 and 9.4 > in Debian and a review would be most appreciated. > > Regards, > > -Roberto > > -- > Roberto C. Sánchez > From ef792f7856dea2576dcd9cab92b2b05fe955696b Mon Sep 17 00:00:00 2001 > From: Noah Misch > Date: Mon, 9 May 2022 08:35:08 -0700 > Subject: [PATCH] Make relation-enumerating operations be security-restricted > operations. > > When a feature enumerates relations and runs functions associated with > all found relations, the feature's user shall not need to trust every > user having permission to create objects. BRIN-specific functionality > in autovacuum neglected to account for this, as did pg_amcheck and > CLUSTER. An attacker having permission to create non-temp objects in at > least one schema could execute arbitrary SQL functions under the > identity of the bootstrap superuser. CREATE INDEX (not a > relation-enumerating operation) and REINDEX protected themselves too > late. This change extends to the non-enumerating amcheck interface. > Back-patch to v10 (all supported versions). > > Sergey Shinderuk, reviewed (in earlier versions) by Alexander Lakhin. > Reported by Alexander Lakhin. > > Security: CVE-2022-1552 > --- > src/backend/access/brin/brin.c | 30 - > src/backend/catalog/index.c | 41 +-- > src/backend/commands/cluster.c | 35 > src/backend/commands/indexcmds.c | 53 > +-- > src/backend/utils/init/miscinit.c| 24 -- > src/test/regress/expected/privileges.out | 42 > src/test/regress/sql/privileges.sql | 36 + > 7 files changed, 231 insertions(+), 30 deletions(-) > > --- a/src/backend/access/brin/brin.c > +++ b/src/backend/access/brin/brin.c > @@ -28,6 +28,7 @@ > #include "pgstat.h" > #include "storage/bufmgr.h" > #include "storage/freespace.h" > +#include "utils/guc.h" > #include "utils/index_selfuncs.h" > #include "utils/memutils.h" > #include "utils/rel.h" > @@ -786,6 +787,9 @@ > Oid heapoid; > RelationindexRel; > RelationheapRel; > + Oid save_userid; > + int save_sec_context; > + int save_nestlevel; > double numSummarized = 0; > > if (RecoveryInProgress()) > @@ -799,10 +803,28 @@ >* passed indexoid isn't an index then IndexGetRelation() will fail. >* Rather than emitting a
Re: [Commitfest 2022-07] Patch Triage: Waiting on Author
On Tue, Jul 26, 2022 at 7:47 PM James Coleman wrote: > These are both mine, and I'd hoped to work on them this CF, but I've > been sufficiently busy that that hasn't happened. > > I'd like to just move these to the next CF. Well, if we mark them returned with feedback now, and you get time to work on them, you can always change the status back to something else at that point. That has the advantage that, if you don't get time to work on them, they're not cluttering up the next CF in the meantime. We're not doing a great job kicking things out of the CF when they are non-actionable, and thereby we are making life harder for ourselves collectively. -- Robert Haas EDB: http://www.enterprisedb.com
Re: fairywren hung in pg_basebackup tests
On 2022-07-27 We 10:58, Tom Lane wrote: > Andrew Dunstan writes: >> The alternative I thought of would be to switch msys to using our >> dirent.c. Probably not too hard, but certainly more work than reverting. > If you ask me, the shortest-path general-purpose fix is to insert > > #if MSYS > if (pgwin32_is_junction(path)) > return PGFILETYPE_DIR; > #endif > > at the start of get_dirent_type. (I'm not sure how to spell the > #if test.) We could look at using dirent.c later, but I think > right now it's important to un-break the buildfarm ASAP. > > +1. I think you spell it: #if defined(WIN32) && !defined(_MSC_VER) (c.f. libpq-be.h) cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: autoprewarm worker failing to load
Robins Tharakan writes: > 089480c077056 seems to have broken pg_prewarm. Ugh ... sure would be nice if contrib/pg_prewarm had some regression tests. > Checking pg_prewarm.so the function 'autoprewarm_main' visibility > switched from GLOBAL to LOCAL. Per [1], using PGDLLEXPORT > makes it GLOBAL again, which appears to fix the issue: Right, that's the appropriate fix. I suppose we had better look at everything else that's passed to bgw_function_name anywhere, too. regards, tom lane
Re: fairywren hung in pg_basebackup tests
Andrew Dunstan writes: > The alternative I thought of would be to switch msys to using our > dirent.c. Probably not too hard, but certainly more work than reverting. If you ask me, the shortest-path general-purpose fix is to insert #if MSYS if (pgwin32_is_junction(path)) return PGFILETYPE_DIR; #endif at the start of get_dirent_type. (I'm not sure how to spell the #if test.) We could look at using dirent.c later, but I think right now it's important to un-break the buildfarm ASAP. regards, tom lane
autoprewarm worker failing to load
Hi, 089480c077056 seems to have broken pg_prewarm. When pg_prewarm is added to shared_preload_libraries, each new connection results in thousands of errors such as this: 2022-07-27 04:25:14.325 UTC [2903955] LOG: background worker "autoprewarm leader" (PID 2904146) exited with exit code 1 2022-07-27 04:25:14.325 UTC [2904148] ERROR: could not find function "autoprewarm_main" in file "/home/ubuntu/proj/tempdel/lib/postgresql/pg_prewarm.so" Checking pg_prewarm.so the function 'autoprewarm_main' visibility switched from GLOBAL to LOCAL. Per [1], using PGDLLEXPORT makes it GLOBAL again, which appears to fix the issue: Before commit (089480c077056) - ubuntu:~/proj/tempdel$ readelf -sW lib/postgresql/pg_prewarm.so | grep main 103: 3d79 609 FUNC GLOBAL DEFAULT 14 autoprewarm_main 109: 45ad 873 FUNC GLOBAL DEFAULT 14 autoprewarm_database_main 128: 3d79 609 FUNC GLOBAL DEFAULT 14 autoprewarm_main 187: 45ad 873 FUNC GLOBAL DEFAULT 14 autoprewarm_database_main After commit (089480c077056) - 78: 2d79 609 FUNC LOCAL DEFAULT 14 autoprewarm_main 85: 35ad 873 FUNC LOCAL DEFAULT 14 autoprewarm_database_main After applying the attached fix: 103: 3d79 609 FUNC GLOBAL DEFAULT 14 autoprewarm_main 84: 45ad 873 FUNC LOCAL DEFAULT 14 autoprewarm_database_main 129: 3d79 609 FUNC GLOBAL DEFAULT 14 autoprewarm_main Please let me know your thoughts on this approach. [1] https://www.postgresql.org/message-id/A737B7A37273E048B164557ADEF4A58B5393038C%40ntex2010a.host.magwien.gv.at diff --git a/contrib/pg_prewarm/autoprewarm.c b/contrib/pg_prewarm/autoprewarm.c index b2d6026093..ec619be9f2 100644 --- a/contrib/pg_prewarm/autoprewarm.c +++ b/contrib/pg_prewarm/autoprewarm.c @@ -82,7 +82,7 @@ typedef struct AutoPrewarmSharedState int prewarmed_blocks; } AutoPrewarmSharedState; -void autoprewarm_main(Datum main_arg); +PGDLLEXPORT void autoprewarm_main(Datum main_arg); void autoprewarm_database_main(Datum main_arg); PG_FUNCTION_INFO_V1(autoprewarm_start_worker); - Robins Tharakan Amazon Web Services
Re: fairywren hung in pg_basebackup tests
On 2022-07-27 We 10:24, Alvaro Herrera wrote: > On 2022-Jul-27, Andrew Dunstan wrote: > >> The msys dirent.h doesn't have a d_type field at all in a struct dirent. >> I can see a number of ways of dealing with this, but the simplest seems >> to be just to revert 5344723755, at least for msys, along with a comment >> about why it's necessary. > Hmm, what other ways there are? I'm about to push a change that > duplicates the get_dirent_type call pattern and I was happy about not > having that #ifdef there. Not that it's critical, but ... The alternative I thought of would be to switch msys to using our dirent.c. Probably not too hard, but certainly more work than reverting. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: fairywren hung in pg_basebackup tests
On 2022-Jul-27, Andrew Dunstan wrote: > The msys dirent.h doesn't have a d_type field at all in a struct dirent. > I can see a number of ways of dealing with this, but the simplest seems > to be just to revert 5344723755, at least for msys, along with a comment > about why it's necessary. Hmm, what other ways there are? I'm about to push a change that duplicates the get_dirent_type call pattern and I was happy about not having that #ifdef there. Not that it's critical, but ... -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
Re: fairywren hung in pg_basebackup tests
On 2022-07-27 We 09:47, Andrew Dunstan wrote: > On 2022-07-26 Tu 18:31, Thomas Munro wrote: >> On Tue, Jul 26, 2022 at 4:03 AM Andrew Dunstan wrote: >>> On 2022-07-25 Mo 11:24, Thomas Munro wrote: On Tue, Jul 26, 2022 at 3:08 AM Tom Lane wrote: > Hmm ... an alternative theory is that the test is fine, and what > it's telling us is that get_dirent_type() is still wrong on msys. > Would that end in this symptom? Hmm, possibly yes (if it sees a non-symlink, it'll skip it). If someone can run the test on an msys system, perhaps they could put a debugging elog() into the code modified by 9d3444dc to log d_name and the d_type that is returned? I'm struggling to understand why msys would change the answer though. >>> I have no idea either. The link exists and it is a junction. I'll see >>> about logging details. >> >From the clues so far, it seems like pgwin32_is_junction(fullpath) was >> returning true, but the following code in get_dirent_type(), which was >> supposed to be equivalent, is not reached on MSYS (though it >> apparently does on MSVC?): >> >> + if ((fd.dwFileAttributes & FILE_ATTRIBUTE_REPARSE_POINT) != 0 && >> + (fd.dwReserved0 == IO_REPARSE_TAG_MOUNT_POINT)) >> d->ret.d_type = DT_LNK; >> >> pgwin32_is_junction() uses GetFileAttributes() and tests (attr & >> FILE_ATTRIBUTE_REPARSE_POINT) == FILE_ATTRIBUTE_REPARSE_POINT, which >> is like the first condition but lacks the dwReserved0 part. What is >> that part doing, and why would it be doing something different in MSVC >> and MSYS builds? That code came from 87e6ed7c, recently I was just >> trying to fix it by reordering the checks; oh, there was some >> discussion about that field[1]. >> >> One idea is that something about dwReserved0 or >> IO_REPARSE_TAG_MOUNT_POINT is different in the open source replacement >> system headers supplied by the MinGW project used by MSYS builds >> (right?), compared to the "real" Windows SDK's headers used by MSVC >> builds. >> >> Or perhaps there is some other dumb mistake, or perhaps the reparse >> point really is different, or ... I dunno, I'd probably shove a load >> of log messages in there and see what's going on. >> >> [1] >> https://www.postgresql.org/message-id/flat/CABUevEzURN%3DwC95JHvTKFJtEy0eY9rWO42yU%3D59-q8xSwm-Dug%40mail.gmail.com#ac54acd782fc849c0fe6c2c05db101dc > > dirent.c is not used on msys, only on MSVC. msys is apparently using > opendir and friends supplied by the system. > > What it does if there's a junction I'll try to find out, but it appears > that 5344723755 was conceived under a misapprehension about the > behaviour of msys. > > The msys dirent.h doesn't have a d_type field at all in a struct dirent. I can see a number of ways of dealing with this, but the simplest seems to be just to revert 5344723755, at least for msys, along with a comment about why it's necessary. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: fairywren hung in pg_basebackup tests
On 2022-07-26 Tu 18:31, Thomas Munro wrote: > On Tue, Jul 26, 2022 at 4:03 AM Andrew Dunstan wrote: >> On 2022-07-25 Mo 11:24, Thomas Munro wrote: >>> On Tue, Jul 26, 2022 at 3:08 AM Tom Lane wrote: Hmm ... an alternative theory is that the test is fine, and what it's telling us is that get_dirent_type() is still wrong on msys. Would that end in this symptom? >>> Hmm, possibly yes (if it sees a non-symlink, it'll skip it). If >>> someone can run the test on an msys system, perhaps they could put a >>> debugging elog() into the code modified by 9d3444dc to log d_name and >>> the d_type that is returned? I'm struggling to understand why msys >>> would change the answer though. >> I have no idea either. The link exists and it is a junction. I'll see >> about logging details. > >From the clues so far, it seems like pgwin32_is_junction(fullpath) was > returning true, but the following code in get_dirent_type(), which was > supposed to be equivalent, is not reached on MSYS (though it > apparently does on MSVC?): > > + if ((fd.dwFileAttributes & FILE_ATTRIBUTE_REPARSE_POINT) != 0 && > + (fd.dwReserved0 == IO_REPARSE_TAG_MOUNT_POINT)) > d->ret.d_type = DT_LNK; > > pgwin32_is_junction() uses GetFileAttributes() and tests (attr & > FILE_ATTRIBUTE_REPARSE_POINT) == FILE_ATTRIBUTE_REPARSE_POINT, which > is like the first condition but lacks the dwReserved0 part. What is > that part doing, and why would it be doing something different in MSVC > and MSYS builds? That code came from 87e6ed7c, recently I was just > trying to fix it by reordering the checks; oh, there was some > discussion about that field[1]. > > One idea is that something about dwReserved0 or > IO_REPARSE_TAG_MOUNT_POINT is different in the open source replacement > system headers supplied by the MinGW project used by MSYS builds > (right?), compared to the "real" Windows SDK's headers used by MSVC > builds. > > Or perhaps there is some other dumb mistake, or perhaps the reparse > point really is different, or ... I dunno, I'd probably shove a load > of log messages in there and see what's going on. > > [1] > https://www.postgresql.org/message-id/flat/CABUevEzURN%3DwC95JHvTKFJtEy0eY9rWO42yU%3D59-q8xSwm-Dug%40mail.gmail.com#ac54acd782fc849c0fe6c2c05db101dc dirent.c is not used on msys, only on MSVC. msys is apparently using opendir and friends supplied by the system. What it does if there's a junction I'll try to find out, but it appears that 5344723755 was conceived under a misapprehension about the behaviour of msys. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: Trying to add more tests to gistbuild.c
Hi Matheus, Many thanks for hacking on increasing the code coverage! I noticed that this patch was stuck in "Needs Review" state for some time and decided to take a look. > With these new tests the coverage went from 45.3% to 85.5%, but I have some > doubts: > - Does this test make sense? > - Would there be a way to validate that the buffering was done correctly? > - Is this test necessary? I can confirm that the coverage improved as stated. Personally I believe this change makes perfect sense. Although this is arguably not an ideal test for gistInitBuffering(), writing proper tests for `static` procedures is generally not an easy task. Executing the code at least once in order to make sure that it doesn't crash, doesn't throw errors and doesn't violate any Asserts() is better than doing nothing. Here is a slightly modified patch with added commit message. Please note that patches created with `git format-patch` are generally preferable than patches created with `git diff`. I'm going to change the status of this patch to "Ready for Committer" in a short time unless anyone has a second opinion. -- Best regards, Aleksander Alekseev v3-0001-Improve-test-coverage-of-gistbuild.c.patch Description: Binary data
Re: Skip partition tuple routing with constant partition key
On Wed, Jul 27, 2022 at 7:28 AM David Rowley wrote: > On Sat, 23 Jul 2022 at 01:23, Amit Langote wrote: > > + /* > > +* The Datum has changed. Zero the number of times > > we've > > +* found last_found_datum_index in a row. > > +*/ > > + partdesc->last_found_count = 0; > > > > + /* Zero the "winning streak" on the cache hit count */ > > + partdesc->last_found_count = 0; > > > > Might it be better for the two comments to say the same thing? Also, > > I wonder which one do you intend as the resetting of last_found_count: > > setting it to 0 or 1? I can see that the stanza at the end of the > > function sets to 1 to start a new cycle. > > I think I've addressed all of your comments. The above one in > particular caused me to make some larger changes. > > The reason I was zeroing the last_found_count in LIST partitioned > tables when the Datum was not equal to the previous found Datum was > due to the fact that the code at the end of the function was only > checking the partition indexes matched rather than the bound_offset vs > last_found_datum_index. The reason I wanted to zero this was that if > you had a partition FOR VALUES IN(1,2), and you received rows with > values alternating between 1 and 2 then we'd match to the same > partition each time, however the equality test with the current > 'values' and the Datum at last_found_datum_index would have been false > each time. If we didn't zero the last_found_count we'd have kept > using the cache path even though the Datum and last Datum wouldn't > have been equal each time. That would have resulted in always doing > the cache check and failing, then doing the binary search anyway. Thanks for the explanation. So, in a way the caching scheme works for LIST partitioning only if the same value appears consecutively in the input set, whereas it does not for *a set of* values belonging to the same partition appearing consecutively. Maybe that's a reasonable restriction for now. > I've now changed the code so that instead of checking the last found > partition is the same as the last one, I'm now checking if > bound_offset is the same as last_found_datum_index. This will be > false in the "values alternating between 1 and 2" case from above. > This caused me to have to change how the caching works for LIST > partitions with a NULL partition which is receiving NULL values. I've > coded things now to just skip the cache for that case. Finding the > correct LIST partition for a NULL value is cheap and no need to cache > that. I've also moved all the code which updates the cache fields to > the bottom of get_partition_for_tuple(). I'm only expecting to do that > when bound_offset is set by the lookup code in the switch statement. > Any paths, e.g. HASH partitioning lookup and LIST or RANGE with NULL > values shouldn't reach the code which updates the partition fields. > I've added an Assert(bound_offset >= 0) to ensure that stays true. Looks good. > There's probably a bit more to optimise here too, but not much. I > don't think the partdesc->last_found_part_index = -1; is needed when > we're in the code block that does return boundinfo->default_index; > However, that only might very slightly speedup the case when we're > inserting continuously into the DEFAULT partition. That code path is > also used when we fail to find any matching partition. That's not one > we need to worry about making go faster. So this is about: if (part_index < 0) - part_index = boundinfo->default_index; + { + /* +* Since we don't do caching for the default partition or failed +* lookups, we'll just wipe the cache fields back to their initial +* values. The count becomes 0 rather than 1 as 1 means it's the +* first time we've found a partition we're recording for the cache. +*/ + partdesc->last_found_datum_index = -1; + partdesc->last_found_part_index = -1; + partdesc->last_found_count = 0; + + return boundinfo->default_index; + } I wonder why not to leave the cache untouched in this case? It's possible that erratic rows only rarely occur in the input sets. > I also ran the benchmarks again and saw that most of the use of > likely() and unlikely() no longer did what I found them to do earlier. > So the weirdness we saw there most likely was just down to random code > layout changes. In this patch, I just dropped the use of either of > those two macros. Ah, using either seems to be trying to fit the code one or the other pattern in the input set anyway, so seems fine to keep them out for now. Some minor comments: + * The number of times the same partition must be found in a row before we + * switch from a search for the given values to just checking if the values How about: switch from using a binary search for the given values to... Should the comment
Re: [PoC] Reducing planning time on tables with many indexes
Sorry, by accident I sent this one out twice. -- David Geier (ServiceNow) On Wed, Jul 27, 2022 at 2:42 PM David Geier wrote: > Hi hackers, > > We came across a slowdown in planning, where queries use tables with many > indexes. In setups with wide tables it is not uncommon to have easily > 10-100 indexes on a single table. The slowdown is already visible in serial > workloads with just a handful of indexes, but gets drastically amplified > when running queries with more indexes in parallel at high throughput. > > We measured the TPS and planning time of running parallel streams of > simple point look-up queries on a single empty table with 60 columns and 60 > indexes. The query used is 'SELECT * FROM synth_table WHERE col5 = 42'. No > rows are returned because the table is empty. We used a machine with 64 > physical CPU cores. The schema and sysbench script to reproduce these > numbers are attached. We used the TPS as reported by sysbench and obtained > planning time by running 'EXPLAIN ANALYZE' on the same query in a > separately opened connection. We averaged the planning time of 3 successive > 'EXPLAIN ANALYZE' runs. sysbench ran on the same machine with varying > numbers of threads using the following command line: > > sysbench repro.lua --db-driver=pgsql --pgsql-host=localhost > --pgsql-db=postgres --pgsql-port=? --pgsql-user=? --pgsql-password=? > --report-interval=1 --threads=64 run > > The following table shows the results. It is clearly visible that the TPS > flatten out already at 8 parallel streams, while the planning time is > increasing drastically. > > Parallel streams | TPS (before) | Planning time (before) > -|--|--- > 1| 5,486 | 0.13 ms > 2| 8,098 | 0.22 ms > 4| 15,013 | 0.19 ms > 8| 27,107 | 0.29 ms > 16 | 30,938 | 0.43 ms > 32 | 26,330 | 1.68 ms > 64 | 24,314 | 2.48 ms > > We tracked down the root cause of this slowdown to lock contention in > 'get_relation_info()'. The index lock of every single index of every single > table used in that query is acquired. We attempted a fix by pre-filtering > out all indexes that anyways cannot be used with a certain query, without > taking the index locks (credits to Luc Vlaming for idea and > implementation). The patch does so by caching the columns present in every > index, inside 'struct Relation', similarly to 'rd_indexlist'. Then, before > opening (= locking) the indexes in 'get_relation_info()', we check if the > index can actually contribute to the query and if not it is discarded right > away. Caching the index info saves considerable work for every query run > subsequently, because less indexes must be inspected and thereby locked. > This way we also save cycles in any code that later on goes over all > relation indexes. > > The work-in-progress version of the patch is attached. It is still fairly > rough (e.g. uses a global variable, selects the best index in scans without > restrictions by column count instead of physical column size, is missing > some renaming, etc.), but shows the principle. > > The following table shows the TPS, planning time and speed-ups after > applying the patch and rerunning above described benchmark. Now, the > planning time remains roughly constant and TPS roughly doubles each time > the number of parallel streams is doubled. The higher the stream count the > more severe the lock contention is and the more pronounced the gained > speed-up gets. Interestingly, even for a single query stream the speed-up > in planning time is already very significant. This applies also for lower > index counts. For example just with 10 indexes the TPS for a single query > stream goes from 9,159 to 12,558. We can do more measurements if there is > interest in details for a lower number of indexes. > > Parallel streams | TPS (after) | Planning time (after) | Speed-up TPS | > Speed-up planning > > -|-|---|--|-- > 1| 10,344 | 0.046 | 1.9x| > 2.8x > 2| 20,140 | 0.045 ms | 2.5x| > 4.9x > 4| 40,349 | 0.047 ms | 2.7x| > 4.0x > 8| 80,121 | 0.046 ms | 3.0x| > 6.3x > 16 | 152,632 | 0.051 ms | 4.9x| > 8.4x > 32 | 301,359 | 0.052 ms | 11.4x| > 32.3x > 64 | 525,115 | 0.062 ms | 21.6x| > 40.0x > > We are happy to receive your feedback and polish up the patch. > > -- > David Geier > (ServiceNow) >
[PoC] Reducing planning time on tables with many indexes
Hi hackers, We came across a slowdown in planning, where queries use tables with many indexes. In setups with wide tables it is not uncommon to have easily 10-100 indexes on a single table. The slowdown is already visible in serial workloads with just a handful of indexes, but gets drastically amplified when running queries with more indexes in parallel at high throughput. We measured the TPS and planning time of running parallel streams of simple point look-up queries on a single empty table with 60 columns and 60 indexes. The query used is 'SELECT * FROM synth_table WHERE col5 = 42'. No rows are returned because the table is empty. We used a machine with 64 physical CPU cores. The schema and sysbench script to reproduce these numbers are attached. We used the TPS as reported by sysbench and obtained planning time by running 'EXPLAIN ANALYZE' on the same query in a separately opened connection. We averaged the planning time of 3 successive 'EXPLAIN ANALYZE' runs. sysbench ran on the same machine with varying numbers of threads using the following command line: sysbench repro.lua --db-driver=pgsql --pgsql-host=localhost --pgsql-db=postgres --pgsql-port=? --pgsql-user=? --pgsql-password=? --report-interval=1 --threads=64 run The following table shows the results. It is clearly visible that the TPS flatten out already at 8 parallel streams, while the planning time is increasing drastically. Parallel streams | TPS (before) | Planning time (before) -|--|--- 1| 5,486 | 0.13 ms 2| 8,098 | 0.22 ms 4| 15,013 | 0.19 ms 8| 27,107 | 0.29 ms 16 | 30,938 | 0.43 ms 32 | 26,330 | 1.68 ms 64 | 24,314 | 2.48 ms We tracked down the root cause of this slowdown to lock contention in 'get_relation_info()'. The index lock of every single index of every single table used in that query is acquired. We attempted a fix by pre-filtering out all indexes that anyways cannot be used with a certain query, without taking the index locks (credits to Luc Vlaming for idea and implementation). The patch does so by caching the columns present in every index, inside 'struct Relation', similarly to 'rd_indexlist'. Then, before opening (= locking) the indexes in 'get_relation_info()', we check if the index can actually contribute to the query and if not it is discarded right away. Caching the index info saves considerable work for every query run subsequently, because less indexes must be inspected and thereby locked. This way we also save cycles in any code that later on goes over all relation indexes. The work-in-progress version of the patch is attached. It is still fairly rough (e.g. uses a global variable, selects the best index in scans without restrictions by column count instead of physical column size, is missing some renaming, etc.), but shows the principle. The following table shows the TPS, planning time and speed-ups after applying the patch and rerunning above described benchmark. Now, the planning time remains roughly constant and TPS roughly doubles each time the number of parallel streams is doubled. The higher the stream count the more severe the lock contention is and the more pronounced the gained speed-up gets. Interestingly, even for a single query stream the speed-up in planning time is already very significant. This applies also for lower index counts. For example just with 10 indexes the TPS for a single query stream goes from 9,159 to 12,558. We can do more measurements if there is interest in details for a lower number of indexes. Parallel streams | TPS (after) | Planning time (after) | Speed-up TPS | Speed-up planning -|-|---|--|-- 1| 10,344 | 0.046 | 1.9x| 2.8x 2| 20,140 | 0.045 ms | 2.5x| 4.9x 4| 40,349 | 0.047 ms | 2.7x| 4.0x 8| 80,121 | 0.046 ms | 3.0x| 6.3x 16 | 152,632 | 0.051 ms | 4.9x| 8.4x 32 | 301,359 | 0.052 ms | 11.4x| 32.3x 64 | 525,115 | 0.062 ms | 21.6x| 40.0x We are happy to receive your feedback and polish up the patch. -- David Geier (ServiceNow) From 60e300b5cac8f527e61483296df81ab783670ac9 Mon Sep 17 00:00:00 2001 From: David Geier Date: Fri, 15 Jul 2022 11:53:27 +0200 Subject: [PATCH] Index filtering --- src/backend/optimizer/path/Makefile | 3 +- src/backend/optimizer/path/index_filtering.c | 220 ++ src/backend/optimizer/plan/planmain.c | 7 + src/backend/optimizer/util/plancat.c | 124 ++ src/backend/utils/cache/relcache.c| 17 +-
RE: Perform streaming logical transactions by background workers and parallel apply
Dear Wang-san, Hi, I'm also interested in the patch and I started to review this. Followings are comments about 0001. 1. terminology In your patch a new worker "apply background worker" has been introduced, but I thought it might be confused because PostgreSQL has already the worker "background worker". Both of apply worker and apply bworker are categolized as bgworker. Do you have any reasons not to use "apply parallel worker" or "apply streaming worker"? (Note that I'm not native English speaker) 2. logicalrep_worker_stop() ``` - /* No worker, nothing to do. */ - if (!worker) - { - LWLockRelease(LogicalRepWorkerLock); - return; - } + if (worker) + logicalrep_worker_stop_internal(worker); + + LWLockRelease(LogicalRepWorkerLock); +} ``` I thought you could add a comment the meaning of if-statement, like "No main apply worker, nothing to do" 3. logicalrep_workers_find() I thought you could add a description about difference between this and logicalrep_worker_find() at the top of the function. IIUC logicalrep_workers_find() counts subworker, but logicalrep_worker_find() does not focus such type of workers. 4. logicalrep_worker_detach() ``` static void logicalrep_worker_detach(void) { + /* +* If we are the main apply worker, stop all the apply background workers +* we started before. +* ``` I thought "we are" should be "This is", based on other comments. 5. applybgworker.c ``` +/* Apply background workers hash table (initialized on first use) */ +static HTAB *ApplyWorkersHash = NULL; +static List *ApplyWorkersFreeList = NIL; +static List *ApplyWorkersList = NIL; ``` I thought they should be ApplyBgWorkersXXX, because they stores information only related with apply bgworkers. 6. ApplyBgworkerShared ``` + TransactionId stream_xid; + uint32 n; /* id of apply background worker */ +} ApplyBgworkerShared; ``` I thought the field "n" is too general, how about "proc_id" or "worker_id"? 7. apply_bgworker_wait_for() ``` + /* If any workers (or the postmaster) have died, we have failed. */ + if (status == APPLY_BGWORKER_EXIT) + ereport(ERROR, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), +errmsg("background worker %u failed to apply transaction %u", + wstate->shared->n, wstate->shared->stream_xid))) ``` 7.a I thought we should not mention about PM death here, because in this case apply worker will exit at WaitLatch(). 7.b The error message should be "apply background worker %u...". 8. apply_bgworker_check_status() ``` +errmsg("background worker %u exited unexpectedly", + wstate->shared->n))); ``` The error message should be "apply background worker %u...". 9. apply_bgworker_send_data() ``` + if (result != SHM_MQ_SUCCESS) + ereport(ERROR, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), +errmsg("could not send tuples to shared-memory queue"))); ``` I thought the error message should be "could not send data to..." because sent data might not be tuples. For example, in case of STEAM PREPARE, I thit does not contain tuple. 10. wait_event.h ``` WAIT_EVENT_HASH_GROW_BUCKETS_REINSERT, + WAIT_EVENT_LOGICAL_APPLY_WORKER_STATE_CHANGE, WAIT_EVENT_LOGICAL_SYNC_DATA, ``` I thought the event should be WAIT_EVENT_LOGICAL_APPLY_BG_WORKER_STATE_CHANGE, because this is used when apply worker waits until the status of bgworker changes. Best Regards, Hayato Kuroda FUJITSU LIMITED
Reducing planning time on tables with many indexes
Hi hackers, We came across a slowdown in planning, where queries use tables with many indexes. In setups with wide tables it is not uncommon to have easily 10-100 indexes on a single table. The slowdown is already visible in serial workloads with just a handful of indexes, but gets drastically amplified when running queries with more indexes in parallel at high throughput. We measured the TPS and planning time of running parallel streams of simple point look-up queries on a single empty table with 60 columns and 60 indexes. The query used is 'SELECT * FROM synth_table WHERE col5 = 42'. No rows are returned because the table is empty. We used a machine with 64 physical CPU cores. The schema and sysbench script to reproduce these numbers are attached. We used the TPS as reported by sysbench and obtained planning time by running 'EXPLAIN ANALYZE' on the same query in a separately opened connection. We averaged the planning time of 3 successive 'EXPLAIN ANALYZE' runs. sysbench ran on the same machine with varying numbers of threads using the following command line: sysbench repro.lua --db-driver=pgsql --pgsql-host=localhost --pgsql-db=postgres --pgsql-port=? --pgsql-user=? --pgsql-password=? --report-interval=1 --threads=64 run The following table shows the results. It is clearly visible that the TPS flatten out already at 8 parallel streams, while the planning time is increasing drastically. Parallel streams | TPS (before) | Planning time (before) -|--|--- 1| 5,486 | 0.13 ms 2| 8,098 | 0.22 ms 4| 15,013 | 0.19 ms 8| 27,107 | 0.29 ms 16 | 30,938 | 0.43 ms 32 | 26,330 | 1.68 ms 64 | 24,314 | 2.48 ms We tracked down the root cause of this slowdown to lock contention in 'get_relation_info()'. The index lock of every single index of every single table used in that query is acquired. We attempted a fix by pre-filtering out all indexes that anyways cannot be used with a certain query, without taking the index locks (credits to Luc Vlaming for idea and implementation). The patch does so by caching the columns present in every index, inside 'struct Relation', similarly to 'rd_indexlist'. Then, before opening (= locking) the indexes in 'get_relation_info()', we check if the index can actually contribute to the query and if not it is discarded right away. Caching the index info saves considerable work for every query run subsequently, because less indexes must be inspected and thereby locked. This way we also save cycles in any code that later on goes over all relation indexes. The work-in-progress version of the patch is attached. It is still fairly rough (e.g. uses a global variable, selects the best index in scans without restrictions by column count instead of physical column size, is missing some renaming, etc.), but shows the principle. The following table shows the TPS, planning time and speed-ups after applying the patch and rerunning above described benchmark. Now, the planning time remains roughly constant and TPS roughly doubles each time the number of parallel streams is doubled. The higher the stream count the more severe the lock contention is and the more pronounced the gained speed-up gets. Interestingly, even for a single query stream the speed-up in planning time is already very significant. This applies also for lower index counts. For example just with 10 indexes the TPS for a single query stream goes from 9,159 to 12,558. We can do more measurements if there is interest in details for a lower number of indexes. Parallel streams | TPS (after) | Planning time (after) | Speed-up TPS | Speed-up planning -|-|---|--|-- 1| 10,344 | 0.046 | 1.9x| 2.8x 2| 20,140 | 0.045 ms | 2.5x| 4.9x 4| 40,349 | 0.047 ms | 2.7x| 4.0x 8| 80,121 | 0.046 ms | 3.0x| 6.3x 16 | 152,632 | 0.051 ms | 4.9x| 8.4x 32 | 301,359 | 0.052 ms | 11.4x| 32.3x 64 | 525,115 | 0.062 ms | 21.6x| 40.0x We are happy to receive your feedback and polish up the patch. -- David Geier (ServiceNow) function thread_init() drv = sysbench.sql.driver() con = drv:connect() end function thread_done() con:disconnect() end function event() con:query('SELECT * FROM synth_table WHERE col05 = 52') end schema.sql Description: application/sql From 60e300b5cac8f527e61483296df81ab783670ac9 Mon Sep 17 00:00:00 2001 From: David Geier Date: Fri, 15 Jul 2022 11:53:27 +0200 Subject: [PATCH] Index filtering --- src/backend/optimizer/path/Makefile | 3 +-
Re: Refactor to make use of a common function for GetSubscriptionRelations and GetSubscriptionNotReadyRelations.
On Wed, Jul 27, 2022 at 4:22 PM Michael Paquier wrote: > > On Wed, Jul 27, 2022 at 08:47:46AM +0530, vignesh C wrote: > > I feel this would be an overkill, I did not make any changes for this. > > Agreed. Using an extra layer of wrappers for that seems a bit too > much, so I have applied your v5 with a slight tweak to the comment. Thanks for pushing this patch. Regards, Vignesh
Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths
On Wed, 27 Jul 2022 at 11:09, Michael Paquier wrote: > > On Tue, Jul 26, 2022 at 06:58:02PM +0200, Matthias van de Meent wrote: > > - Retained the check in XLogRegisterData, so that we check against > > integer overflows in the registerdata code instead of only an assert > > in XLogRecordAssemble where it might be too late. > > Why? The record has not been inserted yet. I would tend to keep only > the check at the bottom of XLogRecordAssemble(), for simplicity, and > call it a day. Because the sum value main_rdatalen can easily overflow in both the current and the previous APIs, which then corrupts the WAL - one of the two issues that I mentioned when I started the thread. We don't re-summarize the lengths of all XLogRecData segments for the main record data when assembling a record to keep the performance of RecordAssemble (probably to limit the complexity when many data segments are registered), and because I didn't want to add more changes than necessary this check will need to be done in the place where the overflow may occur, which is in XLogRegisterData. > > - Kept the inline static elog-ing function (as per Andres' suggestion > > on 2022-03-14; this decreases binary sizes) > > I am not really convinced that this one is worth doing. I'm not married to that change, but I also don't see why this can't be updated while this code is already being touched. > +#define MaxXLogRecordSize (1020 * 1024 * 1024) > + > +#define XLogRecordLengthIsValid(len) ((len) >= 0 && (len) < > MaxXLogRecordSize) > > These are used only in xloginsert.c, so we could keep them isolated. They might be only used in xloginsert right now, but that's not the point. This is now advertised as part of the record API spec: A record larger than 1020MB is explicitly not supported. If it was kept internal to xloginsert, that would be implicit and other people might start hitting issues similar to those we're hitting right now - records that are too large to read. Although PostgreSQL is usually the only one generating WAL, we do support physical replication from arbitrary PG-compatible WAL streams, which means that any compatible WAL source could be the origin of our changes - and those need to be aware of the assumptions we make about the WAL format. I'm fine with also updating xlogreader.c to check this while reading records to clarify the limits there as well, if so desired. > + * To accommodate some overhead, hhis MaxXLogRecordSize value allows for > s/hhis/this/. Will be included in the next update.. > For now, I have extracted from the patch the two API changes and the > checks for the block information for uint16, and applied this part. > That's one less thing to worry about. Thanks. Kind regards, Matthias van de Meent
Re: Introduce wait_for_subscription_sync for TAP tests
On Tue, Jul 26, 2022 at 1:12 PM Masahiko Sawada wrote: > > On Tue, Jul 26, 2022 at 2:01 PM Amit Kapila wrote: > > > > 2. > > +# wait for the replication to catchup if required. > > +if (defined($publisher)) > > +{ > > + croak 'subscription name must be specified' unless defined($subname); > > + $publisher->wait_for_catchup($subname, 'replay'); > > +} > > + > > +# then, wait for all table states to be ready. > > +print "Waiting for all subscriptions in \"$name\" to synchronize > > data\n"; > > +my $query = qq[SELECT count(1) = 0 > > + FROM pg_subscription_rel > > + WHERE srsubstate NOT IN ('r', 's');]; > > +$self->poll_query_until($dbname, $query) > > + or croak "timed out waiting for subscriber to synchronize data"; > > > > In the tests, I noticed that a few places did wait_for_catchup after > > the subscription check, and at other places, we did that check before > > as you have it here. Ideally, I think wait_for_catchup should be after > > confirming the initial sync is over as without initial sync, the > > publisher node won't be completely in sync with the subscriber. > > What do you mean by the last sentence? I thought the order doesn't > matter here. Even if we do wait_for_catchup first then the > subscription check, we can make sure that the apply worker caught up > and table synchronization has been done, no? > That's right. I thought we should first ensure the subscriber has finished operations if possible, like in this case, it can ensure table sync has finished and then we can ensure whether publisher and subscriber are in sync. That sounds more logical to me. -- With Regards, Amit Kapila.
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns
On Mon, Jul 25, 2022 at 11:26 AM Masahiko Sawada wrote: > > On Mon, Jul 25, 2022 at 10:45 AM Masahiko Sawada > wrote: > > I've attached the patch for REl15 that I forgot. > I feel the place to remember running xacts information in SnapBuildProcessRunningXacts is not appropriate. Because in cases where there are no running xacts or when xl_running_xact is old enough that we can't use it, we don't need that information. I feel we need it only when we have to reuse the already serialized snapshot, so, won't it be better to initialize at that place in SnapBuildFindSnapshot()? I have changed accordingly in the attached and apart from that slightly modified the comments and commit message. Do let me know what you think of the attached? -- With Regards, Amit Kapila. REL15_v9-0001-Fix-catalog-lookup-with-the-wrong-snapshot-during.patch Description: Binary data
Re: Refactor to make use of a common function for GetSubscriptionRelations and GetSubscriptionNotReadyRelations.
On Wed, Jul 27, 2022 at 08:47:46AM +0530, vignesh C wrote: > I feel this would be an overkill, I did not make any changes for this. Agreed. Using an extra layer of wrappers for that seems a bit too much, so I have applied your v5 with a slight tweak to the comment. -- Michael signature.asc Description: PGP signature
Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication
Hi Amit, I updated the patch in order to prevent the problems that might be caused by using different replication slots for syncing a table. As suggested in previous emails, replication slot names are stored in the catalog. So slot names can be reached later and it is ensured that same replication slot is used during tablesync step of a table. With the current version of the patch: -. "srrelslotname" column is introduced into pg_subscibtion_rel catalog. It stores the slot name for tablesync -. Tablesync worker logic is now as follows: 1. Tablesync worker is launched by apply worker for a table. 2. Worker generates a default replication slot name for itself. Slot name includes subid and worker pid for tracking purposes. 3. If table has a slot name value in the catalog: i. If the table state is DATASYNC, drop the replication slot from the catalog and proceed tablesync with a new slot. ii. If the table state is FINISHEDCOPY, use the replicaton slot from the catalog, do not create a new slot. 4. Before worker moves to new table, drop any replication slot that are retrieved from the catalog and used. 5. In case of no table left to sync, drop the replication slot of that sync worker with worker pid if it exists. (It's possible that a sync worker do not create a replication slot for itself but uses slots read from the catalog in each iteration) I think using worker_pid also has similar risks of mixing slots from > different workers because after restart same worker_pid could be > assigned to a totally different worker. Can we think of using a unique > 64-bit number instead? This will be allocated when each workers > started for the very first time and after that we can refer catalog to > find it as suggested in the idea below. > I'm not sure how likely to have colliding pid's for different tablesync workers in the same subscription. Though ,having pid in slot name makes it easier to track which slot belongs to which worker. That's why I kept using pid in slot names. But I think it should be simple to switch to using a unique 64-bit number. So I can remove pid's from slot names, if you think that it would be better. > We should use the same for the origin name as well. > I did not really change anything related to origin names. Origin names are still the same and include relation id. What do you think would be an issue with origin names in this patch? > This sounds tricky. Why not first drop slot/origin and then detach it > from pg_subscription_rel? On restarts, it is possible that we may > error out after dropping the slot or origin but before updating the > catalog entry but in such case we can ignore missing slot/origin and > detach them from pg_subscription_rel. Also, if we use the unique > number as suggested above, I think even if we don't remove it after > the relation state is ready, it should be okay. > Right, I did not add an additional slot cleanup step. The patch now drops the slot when we're done with it and then removes it from the catalog. Thanks, Melih v2-0001-Reuse-Logical-Replication-Background-worker.patch Description: Binary data
RE: Introduce wait_for_subscription_sync for TAP tests
On Tue, Jul 26, 2022 3:42 PM Masahiko Sawada wrote: > > I've attached an updated patch as well as a patch to remove duplicated > waits in 007_ddl.pl. > Thanks for your patch. Here are some comments. 1. I think some comments need to be changed in the patch. For example: # Also wait for initial table sync to finish # Wait for initial sync to finish as well Words like "Also" and "as well" can be removed now, we originally used them because we wait for catchup and "also" wait for initial sync. 2. In the following places, we can remove wait_for_catchup() and then call it in wait_for_subscription_sync(). 2.1. 030_origin.pl: @@ -128,8 +120,7 @@ $node_B->safe_psql( $node_C->wait_for_catchup($appname_B2); -$node_B->poll_query_until('postgres', $synced_query) - or die "Timed out while waiting for subscriber to synchronize data"; +$node_B->wait_for_subscription_sync; 2.2. 031_column_list.pl: @@ -385,7 +373,7 @@ $node_subscriber->safe_psql( ALTER SUBSCRIPTION sub1 SET PUBLICATION pub2, pub3 )); -wait_for_subscription_sync($node_subscriber); +$node_subscriber->wait_for_subscription_sync; $node_publisher->wait_for_catchup('sub1'); 2.3. 100_bugs.pl: @@ -281,8 +276,7 @@ $node_subscriber->safe_psql('postgres', $node_publisher->wait_for_catchup('tap_sub'); # Also wait for initial table sync to finish -$node_subscriber->poll_query_until('postgres', $synced_query) - or die "Timed out while waiting for subscriber to synchronize data"; +$node_subscriber->wait_for_subscription_sync; is( $node_subscriber->safe_psql( 'postgres', "SELECT * FROM tab_replidentity_index"), Regards, Shi yu
Re: making relfilenodes 56 bits
On Tue, Jul 26, 2022 at 1:32 PM Dilip Kumar wrote: > > On Thu, Jul 21, 2022 at 9:53 AM Thomas Munro wrote: > > > > On Wed, Jul 20, 2022 at 11:27 PM Dilip Kumar wrote: > > > [v10 patch set] > > > > Hi Dilip, I'm experimenting with these patches and will hopefully have > > more to say soon, but I just wanted to point out that this builds with > > warnings and failed on 3/4 of the CI OSes on cfbot's last run. Maybe > > there is the good kind of uninitialised data on Linux, and the bad > > kind of uninitialised data on those other pesky systems? > > Here is the patch to fix the issue, basically, while asserting for the > file existence it was not setting the relfilenumber in the > relfilelocator before generating the path so it was just checking for > the existence of the random path so it was asserting randomly. Thanks for the updated patch, Few comments: 1) The format specifier should be changed from %u to INT64_FORMAT autoprewarm.c -> apw_load_buffers ... if (fscanf(file, "%u,%u,%u,%u,%u\n", [i].database, [i].tablespace, [i].filenumber, , [i].blocknum) != 5) ... 2) The format specifier should be changed from %u to INT64_FORMAT autoprewarm.c -> apw_dump_now ... ret = fprintf(file, "%u,%u,%u,%u,%u\n", block_info_array[i].database, block_info_array[i].tablespace, block_info_array[i].filenumber, (uint32) block_info_array[i].forknum, block_info_array[i].blocknum); ... 3) should the comment "entry point for old extension version" be on top of pg_buffercache_pages, as the current version will use pg_buffercache_pages_v1_4 + +Datum +pg_buffercache_pages(PG_FUNCTION_ARGS) +{ + return pg_buffercache_pages_internal(fcinfo, OIDOID); +} + +/* entry point for old extension version */ +Datum +pg_buffercache_pages_v1_4(PG_FUNCTION_ARGS) +{ + return pg_buffercache_pages_internal(fcinfo, INT8OID); +} 4) we could use the new style or ereport by removing the brackets around errcode: + if (fctx->record[i].relfilenumber > OID_MAX) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("relfilenode" INT64_FORMAT " is too large to be represented as an OID", + fctx->record[i].relfilenumber), + errhint("Upgrade the extension using ALTER EXTENSION pg_buffercache UPDATE"))); like: ereport(ERROR, errcode(ERRCODE_INVALID_PARAMETER_VALUE), errmsg("relfilenode" INT64_FORMAT " is too large to be represented as an OID", fctx->record[i].relfilenumber), errhint("Upgrade the extension using ALTER EXTENSION pg_buffercache UPDATE")); 5) Similarly in the below code too: + /* check whether the relfilenumber is within a valid range */ + if ((relfilenumber) < 0 || (relfilenumber) > MAX_RELFILENUMBER) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), +errmsg("relfilenumber " INT64_FORMAT " is out of range", + (relfilenumber; 6) Similarly in the below code too: +#define CHECK_RELFILENUMBER_RANGE(relfilenumber) \ +do { \ + if ((relfilenumber) < 0 || (relfilenumber) > MAX_RELFILENUMBER) \ + ereport(ERROR, \ + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), \ +errmsg("relfilenumber " INT64_FORMAT " is out of range", \ + (relfilenumber; \ +} while (0) + 7) This error code looks similar to CHECK_RELFILENUMBER_RANGE, can this macro be used here too: pg_filenode_relation(PG_FUNCTION_ARGS) { Oid reltablespace = PG_GETARG_OID(0); - RelFileNumber relfilenumber = PG_GETARG_OID(1); + RelFileNumber relfilenumber = PG_GETARG_INT64(1); Oid heaprel; + /* check whether the relfilenumber is within a valid range */ + if ((relfilenumber) < 0 || (relfilenumber) > MAX_RELFILENUMBER) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), +errmsg("relfilenumber " INT64_FORMAT " is out of range", + (relfilenumber; 8) I felt this include is not required: diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c index 849a7ce..a2f0d35 100644 --- a/src/backend/access/transam/varsup.c +++ b/src/backend/access/transam/varsup.c @@ -13,12 +13,16 @@ #include "postgres.h" +#include + #include "access/clog.h" #include "access/commit_ts.h" 9) should we change elog to ereport to use the New-style error reporting API + /* safety check, we should never get this far in a HS standby */ + if (RecoveryInProgress()) + elog(ERROR, "cannot assign RelFileNumber
Re: [PATCH] Simple code cleanup in tuplesort.c.
On Wed, Jul 27, 2022 at 5:10 PM Xing Guo wrote: > The bounded heap sorting status flag is set twice in sort_bounded_heap() > and tuplesort_performsort(). This patch helps remove one of them. > +1. Looks good to me. Thanks Richard
[PATCH] Simple code cleanup in tuplesort.c.
Hi hackers, The bounded heap sorting status flag is set twice in sort_bounded_heap() and tuplesort_performsort(). This patch helps remove one of them. Best Regards, Xing diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c index d90a1aebdf..84dc0d07f9 100644 --- a/src/backend/utils/sort/tuplesort.c +++ b/src/backend/utils/sort/tuplesort.c @@ -2709,7 +2709,6 @@ sort_bounded_heap(Tuplesortstate *state) */ reversedirection(state); - state->status = TSS_SORTEDINMEM; state->boundUsed = true; }
Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths
On Tue, Jul 26, 2022 at 06:58:02PM +0200, Matthias van de Meent wrote: > - Retained the check in XLogRegisterData, so that we check against > integer overflows in the registerdata code instead of only an assert > in XLogRecordAssemble where it might be too late. Why? The record has not been inserted yet. I would tend to keep only the check at the bottom of XLogRecordAssemble(), for simplicity, and call it a day. > - Kept the inline static elog-ing function (as per Andres' suggestion > on 2022-03-14; this decreases binary sizes) I am not really convinced that this one is worth doing. +#define MaxXLogRecordSize (1020 * 1024 * 1024) + +#define XLogRecordLengthIsValid(len) ((len) >= 0 && (len) < MaxXLogRecordSize) These are used only in xloginsert.c, so we could keep them isolated. + * To accommodate some overhead, hhis MaxXLogRecordSize value allows for s/hhis/this/. For now, I have extracted from the patch the two API changes and the checks for the block information for uint16, and applied this part. That's one less thing to worry about. -- Michael signature.asc Description: PGP signature
Re: [PATCH] Compression dictionaries for JSONB
On Wed, 27 Jul 2022 at 09:36, Simon Riggs wrote: > > On Sun, 17 Jul 2022 at 19:15, Nikita Malakhov wrote: > > > For using in special Toaster for JSON datatype compression dictionaries > > seem to be very valuable addition, but now I > > have to agree that this feature in current state is competing with > > Pluggable TOAST. > > But I don't understand this. > > Why does storing a compression dictionary in the catalog prevent that > dictionary from being used within the toaster? The point is not that compression dictionaries in the catalog are bad - I think it makes a lot of sense - but that the typecast -based usage of those dictionaries in user tables (like the UI provided by zson) effectively competes with the toaster: It tries to store the data in a more compressed manner than the toaster currently can because it has additional knowledge about the values being toasted. The main difference between casting and toasting however is that casting is fairly because it has a significantly higher memory overhead: both the fully decompressed and the compressed values are stored in memory at the same time at some point when you cast a value, while only the decompressed value is stored in full in memory when (de)toasting. And, considering that there is an open proposal for extending the toaster mechanism, I think that it is not specifically efficient to work with the relatively expensive typecast -based infrastructure if this dictionary compression can instead be added using the proposed extensible toasting mechanism at relatively low overhead. Kind regards, Matthias van de Meent
RE: Perform streaming logical transactions by background workers and parallel apply
On Tuesday, July 26, 2022 5:34 PM Dilip Kumar wrote: > On Tue, Jul 26, 2022 at 2:30 PM Dilip Kumar wrote: > > > > On Fri, Jul 22, 2022 at 8:27 AM wangw.f...@fujitsu.com > > wrote: > > > > > > On Tues, Jul 19, 2022 at 10:29 AM I wrote: > > > > Attach the news patches. > > > > > > Not able to apply patches cleanly because the change in HEAD > (366283961a). > > > Therefore, I rebased the patch based on the changes in HEAD. > > > > > > Attach the new patches. > > > > +/* Check the foreign keys. */ > > +fkeys = RelationGetFKeyList(entry->localrel); > > +if (fkeys) > > +entry->parallel_apply = PARALLEL_APPLY_UNSAFE; > > > > So if there is a foreign key on any of the tables which are parts of a > > subscription then we do not allow changes for that subscription to be > > applied in parallel? I think this is a big limitation because having > > foreign key on the table is very normal right? I agree that if we > > allow them then there could be failure due to out of order apply > > right? but IMHO we should not put the restriction instead let it fail > > if there is ever such conflict. Because if there is a conflict the > > transaction will be sent again. Do we see that there could be wrong > > or inconsistent results if we allow such things to be executed in > > parallel. If not then IMHO just to avoid some corner case failure we > > are restricting very normal cases. > > some more comments.. > 1. > +/* > + * If we have found a free worker or if we are already > applying this > + * transaction in an apply background worker, then we > pass the data to > + * that worker. > + */ > +if (first_segment) > +apply_bgworker_send_data(stream_apply_worker, s->len, > + s->data); > > Comment says that if we have found a free worker or we are already applying in > the worker then pass the changes to the worker but actually as per the code > here we are only passing in case of first_segment? > > I think what you are trying to say is that if it is first segment then send > the > > 2. > +/* > + * This is the main apply worker. Check if there is any free apply > + * background worker we can use to process this transaction. > + */ > +if (first_segment) > +stream_apply_worker = apply_bgworker_start(stream_xid); > +else > +stream_apply_worker = apply_bgworker_find(stream_xid); > > So currently, whenever we get a new streamed transaction we try to start a new > background worker for that. Why do we need to start/close the background > apply worker every time we get a new streamed transaction. I mean we can > keep the worker in the pool for time being and if there is a new transaction > looking for a worker then we can find from that. Starting a worker is costly > operation and since we are using parallelism for this mean we are expecting > that there would be frequent streamed transaction needing parallel apply > worker so why not to let it wait for a certain amount of time so that if load > is low > it will anyway stop and if the load is high it will be reused for next > streamed > transaction. It seems the function name was a bit mislead. Currently, the started apply bgworker won't exit after applying the transaction. And the apply_bgworker_start will first try to choose a free worker. It will start a new worker only if no free worker is available. > 3. > Why are we restricting parallel apply workers only for the streamed > transactions, because streaming depends upon the size of the logical decoding > work mem so making steaming and parallel apply tightly coupled seems too > restrictive to me. Do we see some obvious problems in applying other > transactions in parallel? We thought there could be some conflict failure and deadlock if we parallel apply normal transaction which need transaction dependency check[1]. But I will do some more research for this and share the result soon. [1] https://www.postgresql.org/message-id/CAA4eK1%2BwyN6zpaHUkCLorEWNx75MG0xhMwcFhvjqm2KURZEAGw%40mail.gmail.com Best regards, Hou zj
Re: making relfilenodes 56 bits
On Wed, Jul 27, 2022 at 1:24 PM Ashutosh Sharma wrote: > > Some more comments: Note: Please don't top post. > == > > Shouldn't we retry for the new relfilenumber if > "ShmemVariableCache->nextRelFileNumber > MAX_RELFILENUMBER". There can be a > cases where some of the tables are dropped by the user and relfilenumber of > those tables can be reused for which we would need to find the relfilenumber > that can be resued. For e.g. consider below example: > > postgres=# create table t1(a int); > CREATE TABLE > > postgres=# create table t2(a int); > CREATE TABLE > > postgres=# create table t3(a int); > ERROR: relfilenumber is out of bound > > postgres=# drop table t1, t2; > DROP TABLE > > postgres=# checkpoint; > CHECKPOINT > > postgres=# vacuum; > VACUUM > > Now if I try to recreate table t3, it should succeed, shouldn't it? But it > doesn't because we simply error out by seeing the nextRelFileNumber saved in > the shared memory. > > postgres=# create table t1(a int); > ERROR: relfilenumber is out of bound > > I think, above should have worked. No, it should not, the whole point of this design is not to reuse the relfilenumber ever within a cluster lifetime. You might want to read this mail[1] that by the time we use 2^56 relfilenumbers the cluster will anyway reach its lifetime by other factors. [1] https://www.postgresql.org/message-id/CA%2BhUKG%2BZrDms7gSjckme8YV2tzxgZ0KVfGcsjaFoKyzQX_f_Mw%40mail.gmail.com > == > > > > Note that while a table's filenode often matches its OID, this is > not necessarily the case; some operations, like > TRUNCATE, REINDEX, > CLUSTER and some forms > of ALTER TABLE, can change the filenode while preserving > the OID. > > I think this note needs some improvement in storage.sgml. It says the table's > relfilenode mostly matches its OID, but it doesn't. This will happen only in > case of system table and maybe never in case of user table. Yes, this should be changed. > postgres=# create table t2(a int); > ERROR: relfilenumber is out of bound > > Since this is a user-visible error, I think it would be good to mention > relfilenode instead of relfilenumber. Elsewhere (including the user manual) > we refer to this as a relfilenode. No this is expected to be an internal error because in general during the cluster lifetime ideally, we should never reach this number. So we are putting this check so that it should not reach this number due to some other computational/programming mistake. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Perform streaming logical transactions by background workers and parallel apply
Here are some review comments for the patch v19-0004: == 1. doc/src/sgml/ref/create_subscription.sgml @@ -244,6 +244,11 @@ CREATE SUBSCRIPTION subscription_nameparallel mode is disregarded when retrying; + instead the transaction will be applied using on + mode. That last sentence starting with lowercase seems odd - that's why I thought saying "The parallel mode..." might be better. IMO "on mode" seems strange too. Hence my previous [1] (#4.3) suggestion for this SUGGESTION The parallel mode is disregarded when retrying; instead the transaction will be applied using streaming = on. == 2. src/backend/replication/logical/worker.c - start_table_sync @@ -3902,20 +3925,28 @@ start_table_sync(XLogRecPtr *origin_startpos, char **myslotname) } PG_CATCH(); { + /* + * Emit the error message, and recover from the error state to an idle + * state + */ + HOLD_INTERRUPTS(); + + EmitErrorReport(); + AbortOutOfAnyTransaction(); + FlushErrorState(); + + RESUME_INTERRUPTS(); + + /* Report the worker failed during table synchronization */ + pgstat_report_subscription_error(MySubscription->oid, false); + + /* Set the retry flag. */ + set_subscription_retry(true); + if (MySubscription->disableonerr) DisableSubscriptionAndExit(); - else - { - /* - * Report the worker failed during table synchronization. Abort - * the current transaction so that the stats message is sent in an - * idle state. - */ - AbortOutOfAnyTransaction(); - pgstat_report_subscription_error(MySubscription->oid, false); - PG_RE_THROW(); - } + proc_exit(0); } But is it correct to set the 'retry' flag even if the MySubscription->disableonerr is true? Won’t that mean even after the user corrects the problem and then re-enabled the subscription it still won't let the streaming=parallel work, because that retry flag is set? Also, Something seems wrong to me here - IIUC the patch changed this code because of the potential risk of an error within the set_subscription_retry function, but now if such an error happens the current code will bypass even getting to DisableSubscriptionAndExit, so the subscription won't have a chance to get disabled as the user might have wanted. ~~~ 3. src/backend/replication/logical/worker.c - start_apply @@ -3940,20 +3971,27 @@ start_apply(XLogRecPtr origin_startpos) } PG_CATCH(); { + /* + * Emit the error message, and recover from the error state to an idle + * state + */ + HOLD_INTERRUPTS(); + + EmitErrorReport(); + AbortOutOfAnyTransaction(); + FlushErrorState(); + + RESUME_INTERRUPTS(); + + /* Report the worker failed while applying changes */ + pgstat_report_subscription_error(MySubscription->oid, + !am_tablesync_worker()); + + /* Set the retry flag. */ + set_subscription_retry(true); + if (MySubscription->disableonerr) DisableSubscriptionAndExit(); - else - { - /* - * Report the worker failed while applying changes. Abort the - * current transaction so that the stats message is sent in an - * idle state. - */ - AbortOutOfAnyTransaction(); - pgstat_report_subscription_error(MySubscription->oid, !am_tablesync_worker()); - - PG_RE_THROW(); - } } (Same as previous review comment #2) But is it correct to set the 'retry' flag even if the MySubscription->disableonerr is true? Won’t that mean even after the user corrects the problem and then re-enabled the subscription it still won't let the streaming=parallel work, because that retry flag is set? Also, Something seems wrong to me here - IIUC the patch changed this code because of the potential risk of an error within the set_subscription_retry function, but now if such an error happens the current code will bypass even getting to DisableSubscriptionAndExit, so the subscription won't have a chance to get disabled as the user might have wanted. ~~~ 4. src/backend/replication/logical/worker.c - DisableSubscriptionAndExit /* - * After error recovery, disable the subscription in a new transaction - * and exit cleanly. + * Disable the subscription in a new transaction. */ static void DisableSubscriptionAndExit(void) { - /* - * Emit the error message, and recover from the error state to an idle - * state - */ - HOLD_INTERRUPTS(); - - EmitErrorReport(); - AbortOutOfAnyTransaction(); - FlushErrorState(); - - RESUME_INTERRUPTS(); - - /* Report the worker failed during either table synchronization or apply */ - pgstat_report_subscription_error(MyLogicalRepWorker->subid, - !am_tablesync_worker()); - /* Disable the subscription */ StartTransactionCommand(); DisableSubscription(MySubscription->oid); @@ -4231,8 +4252,6 @@ DisableSubscriptionAndExit(void) ereport(LOG, errmsg("logical replication subscription \"%s\" has been disabled due to an error", MySubscription->name)); - - proc_exit(0); } 4a. Hmm, I think it is a bad idea to remove the "exiting" code from the function but still leave the function name the same as before saying "AndExit". 4b. Also, now the patch is unconditionally doing
Re: PG 15 (and to a smaller degree 14) regression due to ExprEvalStep size
On Fri, Jul 22, 2022 at 2:49 PM Amit Langote wrote: > On Fri, Jul 22, 2022 at 1:13 PM David Rowley wrote: > > BTW, I was working on code inside llvm_compile_expr() a few days ago > > and I thought I'd gotten the new evaluation steps I was adding correct > > as it worked fine with jit_above_cost=0, but on further testing, it > > crashed with jit_inline_above_cost=0. Might be worth doing both to see > > if everything works as intended. > > Thanks for the pointer. > > So I didn't see things going bust on re-testing with all > jit_*_above_cost parameters set to 0, so maybe the > llvm_compile_expression() additions are alright. Here's an updated version of the patch, with mostly cosmetic changes. In particular, I added comments describing the new llvm_compile_expr() blobs. -- Thanks, Amit Langote EDB: http://www.enterprisedb.com v5-0001-Some-improvements-to-JsonExpr-evaluation.patch Description: Binary data
RE: Perform streaming logical transactions by background workers and parallel apply
On Wednesday, July 27, 2022 1:29 PM Dilip Kumar wrote: > > On Wed, Jul 27, 2022 at 10:06 AM Amit Kapila > wrote: > > > > On Tue, Jul 26, 2022 at 2:30 PM Dilip Kumar wrote: > > > > > > On Fri, Jul 22, 2022 at 8:27 AM wangw.f...@fujitsu.com > > > wrote: > > > > > > > > On Tues, Jul 19, 2022 at 10:29 AM I wrote: > > > > > Attach the news patches. > > > > > > > > Not able to apply patches cleanly because the change in HEAD > (366283961a). > > > > Therefore, I rebased the patch based on the changes in HEAD. > > > > > > > > Attach the new patches. > > > > > > +/* Check the foreign keys. */ > > > +fkeys = RelationGetFKeyList(entry->localrel); > > > +if (fkeys) > > > +entry->parallel_apply = PARALLEL_APPLY_UNSAFE; > > > > > > So if there is a foreign key on any of the tables which are parts of > > > a subscription then we do not allow changes for that subscription to > > > be applied in parallel? > > > > > > > I think the above check will just prevent the parallelism for a xact > > operating on the corresponding relation not the relations of the > > entire subscription. Your statement sounds like you are saying that it > > will prevent parallelism for all the other tables in the subscription > > which has a table with FK. > > Okay, got it. I thought we are disallowing parallelism for the entire > subscription. > > > > I think this is a big limitation because having foreign key on the > > > table is very normal right? I agree that if we allow them then > > > there could be failure due to out of order apply right? > > > > > > > What kind of failure do you have in mind and how it can occur? The one > > way it can fail is if the publisher doesn't have a corresponding > > foreign key on the table because then the publisher could have allowed > > an insert into a table (insert into FK table without having the > > corresponding key in PK table) which may not be allowed on the > > subscriber. However, I don't see any check that could prevent this > > because for this we need to compare the FK list for a table from the > > publisher with the corresponding one on the subscriber. I am not > > really sure if due to the risk of such conflicts we should block > > parallelism of transactions operating on tables with FK because those > > conflicts can occur even without parallelism, it is just a matter of > > timing. But, I could be missing something due to which the above check > > can be useful? > > Actually, my question starts with this check[1][2], from this it > appears that if this relation is having a foreign key then we are > marking it parallel unsafe[2] and later in [1] while the worker is > applying changes for that relation and if it was marked parallel > unsafe then we are throwing error. So my question was why we are > putting this restriction? Although this error is only talking about > unique and non-immutable functions this is also giving an error if the > target table had a foreign key. So my question was do we really need > to restrict this? I mean why we are restricting this case? > Hi, I think the foreign key check is used to prevent the apply worker from waiting indefinitely which is caused by foreign key difference between publisher and subscriber, Like the following example: - Publisher: -- both table are published CREATE TABLE PKTABLE ( ptest1 int); CREATE TABLE FKTABLE ( ftest1 int); -- initial data INSERT INTO PKTABLE VALUES(1); Subcriber: CREATE TABLE PKTABLE ( ptest1 int PRIMARY KEY); CREATE TABLE FKTABLE ( ftest1 int REFERENCES PKTABLE); -- Execute the following transactions on publisher Tx1: INSERT ... -- make enough changes to start streaming mode DELETE FROM PKTABLE; Tx2: INSERT ITNO FKTABLE VALUES(1); COMMIT; COMMIT; - The subcriber's apply worker will wait indefinitely, because the main apply worker is waiting for the streaming transaction to finish which is in another apply bgworker. BTW, I think the foreign key won't take effect in subscriber's apply worker by default. Because we set session_replication_role to 'replica' in apply worker which prevent the FK trigger function to be executed(only the trigger with FIRES_ON_REPLICA flag will be executed in this mode). User can only alter the trigger to enable it on replica mode to make the foreign key work. So, ISTM, we won't hit this ERROR frequently. And based on this, another comment about the patch is that it seems unnecessary to directly check the FK returned by RelationGetFKeyList. Checking the actual FK trigger function seems enough. Best regards, Hou zj
Re: making relfilenodes 56 bits
Some more comments: == Shouldn't we retry for the new relfilenumber if "ShmemVariableCache->nextRelFileNumber > MAX_RELFILENUMBER". There can be a cases where some of the tables are dropped by the user and relfilenumber of those tables can be reused for which we would need to find the relfilenumber that can be resued. For e.g. consider below example: postgres=# create table t1(a int); CREATE TABLE postgres=# create table t2(a int); CREATE TABLE postgres=# create table t3(a int); ERROR: relfilenumber is out of bound postgres=# drop table t1, t2; DROP TABLE postgres=# checkpoint; CHECKPOINT postgres=# vacuum; VACUUM Now if I try to recreate table t3, it should succeed, shouldn't it? But it doesn't because we simply error out by seeing the nextRelFileNumber saved in the shared memory. postgres=# create table t1(a int); ERROR: relfilenumber is out of bound I think, above should have worked. == Note that while a table's filenode often matches its OID, this is not necessarily the case; some operations, like TRUNCATE, REINDEX, CLUSTER and some forms of ALTER TABLE, can change the filenode while preserving the OID. I think this note needs some improvement in storage.sgml. It says the table's relfilenode mostly matches its OID, but it doesn't. This will happen only in case of system table and maybe never in case of user table. == postgres=# create table t2(a int); ERROR: relfilenumber is out of bound Since this is a user-visible error, I think it would be good to mention relfilenode instead of relfilenumber. Elsewhere (including the user manual) we refer to this as a relfilenode. -- With Regards, Ashutosh Sharma. On Tue, Jul 26, 2022 at 10:36 PM Ashutosh Sharma wrote: > Thanks Dilip. Here are few comments that could find upon quickly > reviewing the v11 patch: > > /* > + * Similar to the XLogPutNextOid but instead of writing NEXTOID log > record it > + * writes a NEXT_RELFILENUMBER log record. If '*prevrecptr' is a valid > + * XLogRecPtrthen flush the wal upto this record pointer otherwise flush > upto > > XLogRecPtrthen -> XLogRecPtr then > > == > > + switch (relpersistence) > + { > + case RELPERSISTENCE_TEMP: > + backend = BackendIdForTempRelations(); > + break; > + case RELPERSISTENCE_UNLOGGED: > + case RELPERSISTENCE_PERMANENT: > + backend = InvalidBackendId; > + break; > + default: > + elog(ERROR, "invalid relpersistence: %c", relpersistence); > + return InvalidRelFileNumber;/* placate compiler */ > + } > > > I think the above check should be added at the beginning of the function > for the reason that if we come to the default switch case we won't be > acquiring the lwlock and do other stuff to get a new relfilenumber. > > == > > - newrelfilenumber = GetNewRelFileNumber(newTableSpace, NULL, > +* Generate a new relfilenumber. We cannot reuse the old relfilenumber > +* because of the possibility that that relation will be moved back to > the > > that that relation -> that relation. > > == > > + * option_parse_relfilenumber > + * > + * Parse relfilenumber value for an option. If the parsing is successful, > + * returns; if parsing fails, returns false. > + */ > > If parsing is successful, returns true; > > -- > With Regards, > Ashutosh Sharma. > > On Tue, Jul 26, 2022 at 7:33 PM Dilip Kumar wrote: > >> On Tue, Jul 26, 2022 at 6:06 PM Ashutosh Sharma >> wrote: >> >> Hi, >> Note: please avoid top posting. >> >> > /* >> > * If relfilenumber is unspecified by the caller then create >> storage >> > -* with oid same as relid. >> > +* with relfilenumber same as relid if it is a system table >> otherwise >> > +* allocate a new relfilenumber. For more details read >> comments atop >> > +* FirstNormalRelFileNumber declaration. >> > */ >> > if (!RelFileNumberIsValid(relfilenumber)) >> > - relfilenumber = relid; >> > + { >> > + relfilenumber = relid < FirstNormalObjectId ? >> > + relid : GetNewRelFileNumber(); >> > >> > Above code says that in the case of system table we want relfilenode to >> be the same as object id. This technically means that the relfilenode or >> oid for the system tables would not be exceeding 16383. However in the >> below lines of code added in the patch, it says there is some chance for >> the storage path of the user tables from the old cluster conflicting with >> the storage path of the system tables in the new cluster. Assuming that the >> OIDs for the user tables on the old cluster would start with 16384 (the >> first object ID), I see no reason why there would be a conflict. >> >> >> Basically, the above comment says that the initial system table >> storage will be created with the same relfilenumber as Oid so you are >> right that will not exceed 16383. And below code is explaining the >> reason that
Re: [PATCH] Compression dictionaries for JSONB
On Sun, 17 Jul 2022 at 19:15, Nikita Malakhov wrote: > we suggest that as an improvement compression should be put inside the > Toaster as an option, > thus the Toaster could have maximum benefits from knowledge of data internal > structure (and in future use JSON Schema). Very much agreed. > For using in special Toaster for JSON datatype compression dictionaries seem > to be very valuable addition, but now I > have to agree that this feature in current state is competing with Pluggable > TOAST. But I don't understand this. Why does storing a compression dictionary in the catalog prevent that dictionary from being used within the toaster? -- Simon Riggshttp://www.EnterpriseDB.com/
Re: Slow standby snapshot
At Tue, 26 Jul 2022 23:09:16 +0500, Andrey Borodin wrote in > > > > On 20 Jul 2022, at 02:12, Michail Nikolaev > > wrote: > > > >> I've looked into v5. > > Thanks! > > > > Patch is updated accordingly your remarks. > > The patch seems Ready for Committer from my POV. + * is s updated during taking the snapshot. The KnownAssignedXidsNextOffset + * contains not an offset to the next valid xid but a number which tends to + * the offset to next valid xid. KnownAssignedXidsNextOffset[] values read Is there still a reason why the array stores the distnace to the next valid element instead of the index number of the next valid index? It seems to me that that was in an intention to reduce the size of the offset array but it is int32[] which is far wider than TOTAL_MAX_CACHED_SUBXIDS. It seems to me storing the index itself is simpler and maybe faster by the cycles to perform addition. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Cygwin cleanup
On Tue, Jul 26, 2022 at 04:24:25PM +1200, Thomas Munro wrote: > 3. You can't really run PostgreSQL on Cygwin for real, because its > implementation of signals does not have reliable signal masking, so > unsubtle and probably also subtle breakage occurs. That was reported > upstream by Noah years ago, but they aren't working on a fix. > lorikeet shows random failures, and presumably any CI system will do > the same... Reference: https://www.postgresql.org/message-id/20170321034703.GB2097809%40tornado.leadboat.com On my 2nd try: https://cirrus-ci.com/task/5311911574110208 TRAP: FailedAssertion("mq->mq_sender == NULL", File: "shm_mq.c", Line: 230, PID: 16370) 2022-07-26 06:32:35.525 PDT [15538][postmaster] LOG: background worker "parallel worker" (PID 16370) was terminated by signal 6: Aborted > XXX Doesn't get all the way through yet... Mainly because getopt was causing all tap tests to fail. I tried to fix that in configure, but ended up changing the callers. This is getting close, but I don't think has actually managed to pass all tests yet.. https://cirrus-ci.com/task/5274721116749824 > 4. When building with Cygwin GCC 11.3 you get a bunch of warnings > that don't show up on other platforms, seemingly indicating that it > interprets -Wimplicit-fallthrough=3 differently. Huh? Evidently due to the same getopt issues. > XXX This should use a canned Docker image with all the right packages > installed Has anyone tried using non-canned images ? It sounds like this could reduce the 4min startup time for windows. https://cirrus-ci.org/guide/docker-builder-vm/#dockerfile-as-a-ci-environment > XXX configure is s slooow, can we cache it?! Compiling is also > insanely slow, but ccache gets it down to a couple of minutes if you're > lucky One reason compiling was slow is because you ended up with -O2. You can cache configure as long as you're willing to re-run it whenever options were changed. That also applies to the existing headerscheck. > XXX I don't know how to put variables like BUILD_JOBS into the scripts WDYM ? If it's outside of bash and in windows shell it's like %var%, right ? https://cirrus-ci.org/guide/writing-tasks/#environment-variables I just noticed that cirrus is misbehaving: if there's a variable called CI (which there is), then it expands $CI_FOO like ${CI}_FOO rather than ${CI_FOO}. I've also seen weirdness when variable names or operators appear in the commit message... > XXX Needs some --with-X options Done > XXX We would never want this to run by default in CI, but it'd be nice > to be able to ask for it with ci-os-only! (See commented out line) > only_if: $CIRRUS_CHANGE_MESSAGE =~ '.*\nci-os-only:[^\n]*cygwin.*' Doesn't this already do what's needed? As long as it doesn't also check: CHANGE_MESSAGE !~ 'ci-os-only', the task will runs only on request. > XXX I have no idea if crash dump works, and if this should share > elements with the msys work in commitfest #3575 Based on the crash above, it wasn't working. And after some changes ... it still doesn't work. windows_os is probably skipping too many things. -- Justin >From 174cc603ff951b86794f57fcc8c7d326d948e006 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Mon, 25 Jul 2022 23:05:10 +1200 Subject: [PATCH] WIP CI support for Cygwin. ci-os-only: cygwin, xindows, xinux https://cirrus-ci.com/task/5145086722834432 XXX This should use a canned Docker image with all the right packages installed XXX I have no idea if crash dump works, and if this should share elements with the msys work in commitfest #3575 --- .cirrus.yml | 59 +++ configure | 18 +++--- configure.ac | 6 +- src/interfaces/libpq/t/001_uri.pl | 11 +++- src/interfaces/libpq/t/002_api.pl | 13 +++- .../libpq_pipeline/t/001_libpq_pipeline.pl| 14 - src/test/perl/PostgreSQL/Test/Cluster.pm | 2 +- src/test/perl/PostgreSQL/Test/Utils.pm| 12 +++- src/tools/ci/cores_backtrace.sh | 6 +- 9 files changed, 120 insertions(+), 21 deletions(-) diff --git a/.cirrus.yml b/.cirrus.yml index f23d6cae552..0389b7758d1 100644 --- a/.cirrus.yml +++ b/.cirrus.yml @@ -456,6 +456,65 @@ task: path: "crashlog-*.txt" type: text/plain +task: + name: Windows - Cygwin + + env: +CPUS: 4 +BUILD_JOBS: 4 +TEST_JOBS: 3 +CCACHE_DIR: /tmp/ccache +CCACHE_LOGFILE: ccache.log +CONFIGURE_FLAGS: --enable-cassert --enable-debug --enable-tap-tests --with-ldap --with-ssl=openssl --with-gssapi +CONFIGURE_CACHE: /tmp/ccache/configure.cache +PG_TEST_USE_UNIX_SOCKETS: 1 + + only_if: $CIRRUS_CHANGE_MESSAGE =~ '.*\nci-os-only:[^\n]*cygwin.*' + + windows_container: +image: cirrusci/windowsservercore:2019 +os_version: 2019 +cpu: $CPUS +memory: 4G + + ccache_cache: +folder: C:\tools\cygwin\tmp\ccache + +
Re: [Refactor]Avoid to handle FORCE_NOT_NULL/FORCE_NULL options when COPY TO
On Wed, Jul 27, 2022 at 12:55 PM Kyotaro Horiguchi wrote: > ProcessCopyOptions previously rejects force_quote_all for COPY FROM > and copyfrom.c is not even conscious of the option (that is, even no > assertion on it). The two options are rejected for COPY TO by the same > function so it seems like a thinko of the commit. Getting rid of the > code would be good in the view of code coverage and maintenance. Yeah, ProcessCopyOptions() does have the check for force_notnull and force_null whether it is using COPY FROM and whether it is in CSV mode. So the codes in copyto.c processing force_notnull/force_null are actually dead codes. > On the otherhand I wonder if it is good that we have assertions on the > option values. Agree. Assertions would be better. Thanks Richard
Re: [PoC] Reducing planning time when tables have many partitions
Dear David, Thank you for sharing your new idea. I agree that introducing a Bitmapset field may solve this problem. I will try this approach in addition to previous ones. Thank you again for helping me. -- Best regards, Yuya Watari
Re: standby recovery fails (tablespace related) (tentative patch and discussion)
On 2022-Jul-22, Kyotaro Horiguchi wrote: > At Fri, 22 Jul 2022 10:18:58 +0200, Alvaro Herrera > wrote in > > Which commits would we consider? > > > > 7170f2159fb2Allow "in place" tablespaces. > > f6f0db4d6240Fix pg_tablespace_location() with in-place tablespaces > > The second one is just to make the function work with in-place > tablespaces. Without it the function yeilds the following error. > > > ERROR: could not read symbolic link "pg_tblspc/16407": Invalid argument > > This looks actually odd but I think no need of back-patching because > there's no actual user of the feature is not seen in our test suite. > If we have a test that needs the feature in future, it would be enough > to back-patch it then. Actually, I found that the new test added by the fix in this thread does depend on this being fixed, so I included an even larger set, which I think makes this more complete: 7170f2159fb2 Allow "in place" tablespaces. c6f2f01611d4 Fix pg_basebackup with in-place tablespaces. f6f0db4d6240 Fix pg_tablespace_location() with in-place tablespaces 7a7cd84893e0 doc: Remove mention to in-place tablespaces for pg_tablespace_location() 5344723755bd Remove unnecessary Windows-specific basebackup code. I didn't include any of the test changes for now. I don't intend to do so, unless we see another reason for that; I think the new tests that are going to be added by the recovery bugfix should be sufficient coverage. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "La fuerza no está en los medios físicos sino que reside en una voluntad indomable" (Gandhi)