date:20220727

Re: Improve description of XLOG_RUNNING_XACTS

2022-07-27 Thread Masahiko Sawada

On Thu, Jul 28, 2022 at 3:24 PM Kyotaro Horiguchi
 wrote:
>
> At Thu, 28 Jul 2022 09:56:33 +0530, Ashutosh Bapat 
>  wrote in
> > Thanks Masahiko for the updated patch. It looks good to me.
> >
> > I wonder whether the logic should be, similar
> > to ProcArrayApplyRecoveryInfo()
> >  if (xlrec->subxid_overflow)
> > ...
> > else if (xlrec->subxcnt > 0)
> > ...
> >
> > But you may ignore it.
>
> Either is fine if we asuume the record is sound, but since it is
> debugging output, I think we should always output the information *for
> both* . The following change doesn't change the output for a sound
> record.
>
> 
> if (xlrec->subxcnt > 0)
> {
> appendStringInfo(buf, "; %d subxacts:", xlrec->subxcnt);
> for (i = 0; i < xlrec->subxcnt; i++)
> appendStringInfo(buf, " %u", xlrec->xids[xlrec->xcnt 
> + i]);
> }
> -   else if (xlrec->subxid_overflow)
> +   if (xlrec->subxid_overflow)
> appendStringInfoString(buf, "; subxid overflowed");
> 

Do you mean that both could be true at the same time? If I read
GetRunningTransactionData() correctly, that doesn't happen.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Refactoring postgres_fdw/connection.c

2022-07-27 Thread Fujii Masao





On 2022/07/27 10:36, Kyotaro Horiguchi wrote:

At Tue, 26 Jul 2022 18:33:04 +0900, Fujii Masao  
wrote in

I'm not sure the two are similar with each other.  The new function
pgfdw_exec_pre_commit() looks like a merger of two isolated code paths
intended to share a seven-line codelet.  I feel the code gets a bit
harder to understand after the change.  I mildly oppose to this part.


If so, we can pgfdw_exec_pre_commit() into two, one is the common
function that sends or executes the command (i.e., calls
do_sql_command_begin() or do_sql_command()), and another is
the function only for toplevel. The latter function calls
the common function and then executes DEALLOCATE ALL things.

But this is not the way that other functions like
pgfdw_abort_cleanup()
is implemented. Those functions have both codes for toplevel and
!toplevel (i.e., subxact), and run the processings depending
on the argument "toplevel". So I'm thinking that
pgfdw_exec_pre_commit() implemented in the same way is better.


I didn't see it from that viewpoint but I don't think that
unconditionally justifies other refactoring.  If we merge
pgfdw_finish_pre_(sub)?commit_cleanup()s this way, in turn
pgfdw_subxact_callback() and pgfdw_xact_callback() are going to be
almost identical except event IDs to handle. But I don't think we
would want to merge them.


I don't think they are so identical because (as you say) they have to handle 
different event IDs. So I agree we don't want to merge them.



A concern on 0002 is that it is hiding the subxact-specific steps from
the subxact callback.  It would look reasonable if it were called from
two or more places for each topleve and !toplevel, but actually it has
only one caller for each.  So I think that pgfdw_exec_pre_commit
should not do that and should be renamed to pgfdw_commit_remote() or
something.  On the other hand pgfdw_finish_pre_commit() hides
toplevel-specific steps from the caller so the same argument holds.


So you conclusion is to rename pgfdw_exec_pre_commit() to pgfdw_commit_remote() 
or something?



Another point that makes me concern about the patch is the new
function takes an SQL statement, along with the toplevel flag. I guess
the reason is that the command for subxact (RELEASE SAVEPOINT %d)
requires the current transaction level.  However, the values
isobtainable very cheap within the cleanup functions. So I propose to
get rid of the parameter "sql" from the two functions.


Yes, that's possible. That is, pgfdw_exec_pre_commit() can construct the query 
string by executing the following codes, instead of accepting the query as an 
argument. But one downside of this approach is that the following codes are 
executed for every remote subtransaction entries. Maybe it's cheap to construct 
the query string as follows, but I'd like to avoid any unnecessray overhead if 
possible. So the patch makes the caller, pgfdw_subxact_callback(), construct 
the query string only once and give it to pgfdw_exec_pre_commit().

curlevel = GetCurrentTransactionNestLevel();
snprintf(sql, sizeof(sql), "RELEASE SAVEPOINT s%d", curlevel);

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

2022-07-27 Thread Masahiko Sawada

() an

On Thu, Jul 28, 2022 at 12:21 PM Amit Kapila  wrote:
>
> On Thu, Jul 28, 2022 at 7:18 AM Masahiko Sawada  wrote:
> >
> > On Wed, Jul 27, 2022 at 8:33 PM Amit Kapila  wrote:
> > >
> >
> > >  I have changed accordingly in the attached
> > > and apart from that slightly modified the comments and commit message.
> > > Do let me know what you think of the attached?
> >
> > It would be better to remember the initial running xacts after
> > SnapBuildRestore() returns true? Because otherwise, we could end up
> > allocating InitialRunningXacts multiple times while leaking the old
> > ones if there are no serialized snapshots that we are interested in.
> >
>
> Right, this makes sense. But note that you can no longer have a check
> (builder->state == SNAPBUILD_START) which I believe is not required.
> We need to do this after restore, in whichever state snapshot was as
> any state other than SNAPBUILD_CONSISTENT can have commits without all
> their changes.

Right.

>
> Accordingly, I think the comment: "Remember the transactions and
> subtransactions that were running when xl_running_xacts record that we
> decoded first was written." needs to be slightly modified to something
> like: "Remember the transactions and subtransactions that were running
> when xl_running_xacts record that we decoded was written.". Change
> this if it is used at any other place in the patch.

Agreed.

>
> > ---
> > +   if (builder->state == SNAPBUILD_START)
> > +   {
> > +   int nxacts =
> > running->subxcnt + running->xcnt;
> > +   Sizesz = sizeof(TransactionId) * nxacts;
> > +
> > +   NInitialRunningXacts = nxacts;
> > +   InitialRunningXacts =
> > MemoryContextAlloc(builder->context, sz);
> > +   memcpy(InitialRunningXacts, running->xids, sz);
> > +   qsort(InitialRunningXacts, nxacts,
> > sizeof(TransactionId), xidComparator);
> > +   }
> >
> > We should allocate the memory for InitialRunningXacts only when
> > (running->subxcnt + running->xcnt) > 0.
> >
>
d > There is no harm in doing that but ideally, that case would have been
> covered by an earlier check "if (running->oldestRunningXid ==
> running->nextXid)" which suggests "No transactions were running, so we
> can jump to consistent."

You're right.

While editing back branch patches, I realized that the following
(parsed->xinfo & XACT_XINFO_HAS_INVALS) and (parsed->nmsgs > 0) are
equivalent:

+   /*
+* If the COMMIT record has invalidation messages, it could have catalog
+* changes. It is possible that we didn't mark this transaction as
+* containing catalog changes when the decoding starts from a commit
+* record without decoding the transaction's other changes. So, we ensure
+* to mark such transactions as containing catalog change.
+*
+* This must be done before SnapBuildCommitTxn() so that we can include
+* these transactions in the historic snapshot.
+*/
+   if (parsed->xinfo & XACT_XINFO_HAS_INVALS)
+   SnapBuildXidSetCatalogChanges(ctx->snapshot_builder, xid,
+ parsed->nsubxacts, parsed->subxacts,
+ buf->origptr);
+
/*
 * Process invalidation messages, even if we're not interested in the
 * transaction's contents, since the various caches need to always be
 * consistent.
 */
if (parsed->nmsgs > 0)
{
if (!ctx->fast_forward)
ReorderBufferAddInvalidations(ctx->reorder, xid, buf->origptr,
  parsed->nmsgs, parsed->msgs);
ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
}

If that's right, I think we can merge these if branches. We can call
ReorderBufferXidSetCatalogChanges() for top-txn and in
SnapBuildXidSetCatalogChanges() we mark its subtransactions if top-txn
is in the list. What do you think?

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Improve description of XLOG_RUNNING_XACTS

2022-07-27 Thread Kyotaro Horiguchi

At Thu, 28 Jul 2022 09:56:33 +0530, Ashutosh Bapat 
 wrote in 
> Thanks Masahiko for the updated patch. It looks good to me.
> 
> I wonder whether the logic should be, similar
> to ProcArrayApplyRecoveryInfo()
>  if (xlrec->subxid_overflow)
> ...
> else if (xlrec->subxcnt > 0)
> ...
> 
> But you may ignore it.

Either is fine if we asuume the record is sound, but since it is
debugging output, I think we should always output the information *for
both* . The following change doesn't change the output for a sound
record.


if (xlrec->subxcnt > 0)
{
appendStringInfo(buf, "; %d subxacts:", xlrec->subxcnt);
for (i = 0; i < xlrec->subxcnt; i++)
appendStringInfo(buf, " %u", xlrec->xids[xlrec->xcnt + 
i]);
}
-   else if (xlrec->subxid_overflow)
+   if (xlrec->subxid_overflow)
appendStringInfoString(buf, "; subxid overflowed");


Another point is if the xid/subxid lists get long, I see it annoying
that the "overflowed" messages goes far away to the end of the long
line. Couldn't we rearrange the item order of the line as the follows?

nextXid %u latestCompletedXid %u oldestRunningXid %u;[ subxid overflowed;][ %d 
xacts: %u %u ...;][ subxacts: %u %u ..]

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

How to get accurate backup end time when it is taken from synchronous standby ?

2022-07-27 Thread Harinath Kanchu

Greetings,


When we take backups from a synchronous standby replica, how can we get the 
accurate timestamp of the backup end time ? (As backup history files are not 
generated on standbys)
For example:
this is a part of control file after a backup (created using wal-g by calling 
pg_startbackup and pg_stopbackup),
Fake LSN counter for unlogged rels:   0/3E8
Minimum recovery ending location: 28/28000B68
Min recovery ending loc's timeline:   2
Backup start location:0/0
Backup end location:  0/0
End-of-backup record required:no
here I can see that minimum recovery ending location as LSN value, how can we 
get the timestamp of it ?
The backup label file looks like this.
INFO: 2022/07/26 23:25:23.850621  LABLE FILE START --
INFO: 2022/07/26 23:25:23.850628 START WAL LOCATION: 1D/F94C7320 (file 
0002001D00F9)
INFO: 2022/07/26 23:25:23.850633 CHECKPOINT LOCATION: 1E/EDA8700
INFO: 2022/07/26 23:25:23.850639 BACKUP METHOD: streamed
INFO: 2022/07/26 23:25:23.850645 BACKUP FROM: standby
INFO: 2022/07/26 23:25:23.850653 START TIME: 2022-07-26 23:10:27 GMT
INFO: 2022/07/26 23:25:23.850659 LABEL: 2022-07-26 23:10:27.545378 + UTC 
m=+0.167723956
INFO: 2022/07/26 23:25:23.850665 START TIMELINE: 2
INFO: 2022/07/26 23:25:23.850669 
INFO: 2022/07/26 23:25:23.850676  LABLE FILE END --

How can we do PITR using timestamp if we don’t know the accurate timestamp of 
minimum recovery point ?







Thanks.
Best,
Harinath

Re: Fix annotations nextFullXid

2022-07-27 Thread Fujii Masao





On 2022/07/27 1:29, Zhang Mingli wrote:

Thanks!


Pushed. Thanks!

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Handle infinite recursion in logical replication setup

2022-07-27 Thread Peter Smith

Hi Vignesh.

FYI the v39* patch fails to apply [1]. Can you please rebase it?


[1]
=== Applying patches on top of PostgreSQL commit ID
5f858dd3bebd1f3845aef2bff7f4345bfb7b74b3 ===
=== applying patch
./v39-0001-Check-and-throw-an-error-if-publication-tables-w.patch
patching file doc/src/sgml/ref/alter_subscription.sgml
patching file doc/src/sgml/ref/create_subscription.sgml
patching file src/backend/commands/subscriptioncmds.c
Hunk #10 FAILED at 886.
1 out of 14 hunks FAILED -- saving rejects to file
src/backend/commands/subscriptioncmds.c.rej
patching file src/test/regress/expected/subscription.out
patching file src/test/regress/sql/subscription.sql
patching file src/test/subscription/t/030_origin.pl
patching file src/tools/pgindent/typedefs.list

--
[1] http://cfbot.cputube.org/patch_38_3610.log

Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Documentation about PL transforms

2022-07-27 Thread Pavel Stehule

čt 28. 7. 2022 v 4:06 odesílatel  napsal:

> On 2022-07-13 00:26, Pavel Stehule wrote:
>
> > 1. I am not sure if get_call_trftypes is a good name - the prefix
> > get_call
> > is used when some runtime data is processed.
>
> I guess I hadn't caught on that the prefix carried that meaning.
> To me, it appeared to be a prefix used to avoid being specific to
> 'function' or 'procedure'.
>

I am sure this function is in Postgres significantly longer than Postgres
has support for 'procedures'.


>
> > This function just returns
> > reformatted data from the system catalogue. Maybe
> > get_func_trftypes_list,
> > or just replace function get_func_trftypes (now, the list is an array,
> > so
> > there should not be performance issues). For consistency, the new
> > function
> > should be used in plperl and plpython too. Probably this function is
> > not
>
> If it is acceptable to replace get_func_trftypes like that, I can
> produce
> such a patch.
>
> > 2.
> >
> > +It also contains the OID of the intended procedural language and
> > whether
> > +that procedural language is declared as
> > TRUSTED,
> > useful
> > +if a single inline handler is supporting more than one procedural
> > language.
> >
> > I am not sure if this sentence is in the correct place. Maybe can be
> > mentioned separately,
> > so generally handlers can be used by more than one procedural language.
> > But
> > maybe
> > I don't understand this sentence.
>
> My point was that, if the structure did /not/ include the OID of
> the PL and its TRUSTED property, then it would not be possible
> for a single inline handler to support more than one PL. So that
> is why it is a good thing that those are included in the structure,
> and why it would be a bad thing if they were not.
>
> Would it be clearer to say:
>
> It also contains the OID of the intended procedural language and whether
> that procedural language is declared as TRUSTED.
> While these values are redundant if the inline handler is serving only
> a single procedural language, they are necessary to allow one inline
> handler to serve more than one PL.
>

ok

Regards

Pavel

>
> Regards,
> -Chap
>

RE: Perform streaming logical transactions by background workers and parallel apply

2022-07-27 Thread kuroda.hay...@fujitsu.com

Dear Wang,

I found further comments about the test code.

11. src/test/regress/sql/subscription.sql

```
-- fail - streaming must be boolean
CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' 
PUBLICATION testpub WITH (connect = false, streaming = foo);
```

The comment is no longer correct: should be "streaming must be boolean or 
'parallel'"

12. src/test/regress/sql/subscription.sql

```
-- now it works
CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' 
PUBLICATION testpub WITH (connect = false, streaming = true);
```

I think we should test the case of streaming = 'parallel'.

13. 015_stream.pl

I could not find test about TRUNCATE. IIUC apply bgworker works well
even if it gets LOGICAL_REP_MSG_TRUNCATE message from main worker.
Can you add the case? 

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Re: Official Windows Installer and Documentation

2022-07-27 Thread David G. Johnston

On Wed, Jul 27, 2022 at 8:22 PM Tom Lane  wrote:

> I wrote:
> > If EDB isn't adequately filling in the documentation for the behavior
> > of their packaging, that's on them.
>
> Having now looked more closely at the pg_upgrade documentation,
> I don't think this is exactly EDB's fault; it's text that should
> never have been there to begin with.  ISTM we need to simply rip out
> lines 431..448 of pgupgrade.sgml, that is all the Windows-specific
> text starting with
>
>  For Windows users, you must be logged into an administrative account,
> and
>
> That has got nothing to recommend it: we do not generally provide
> platform-specific details in these man pages, and to the extent it
> provides details, those details are likely to be wrong.

I mean, we do provide platform-specific details/examples, it's just that
platform is a source installed Linux platform (though pathless)

Does the avoidance of dealing with other platforms also apply to NET STOP
or do you find that an acceptable variance?  Or are you suggesting that
basically all O/S commands should be zapped?  If not, then rewriting 442 to
446 to just be the command seems worthwhile.  I'd say pg_upgrade warrants
an examples section like pg_basebackup has (though obviously pg_upgrade is
procedural).

I do have another observation:

https://github.com/postgres/postgres/blob/4fc6b6eefcf98f79211bb790ee890ebcb05c178d/src/bin/pg_upgrade/check.c#L665

 if (PQntuples(res) != 1 ||
atooid(PQgetvalue(res, 0, 1)) != BOOTSTRAP_SUPERUSERID)
pg_fatal("database user \"%s\" is not the install user",
os_info.user);

Any reason to not inform the DBA the name of the install user here?  Sure,
it is almost certainly postgres, but it also seems like an easy win in
order for them, and anyone they may ask for help, to know exactly the name
of install user in the clusters should that end up being the issue.
Additionally, from what I can tell, if that check does fail (or any of the
checks really) it is not possible to tell whether the check was being
performed against the old or new server.  The user does not know that
checks against the old server are performed first then checks against the
new one, and there are no banners saying "checking old/new"

David J.

Re: Improve description of XLOG_RUNNING_XACTS

2022-07-27 Thread Ashutosh Bapat

Thanks Masahiko for the updated patch. It looks good to me.

I wonder whether the logic should be, similar
to ProcArrayApplyRecoveryInfo()
 if (xlrec->subxid_overflow)
...
else if (xlrec->subxcnt > 0)
...

But you may ignore it.


--
Best Wishes,
Ashutosh

On Thu, Jul 28, 2022 at 7:41 AM Masahiko Sawada 
wrote:

> On Thu, Jul 21, 2022 at 10:13 PM Ashutosh Bapat
>  wrote:
> >
> > Hi
> >
> > On Thu, Jul 21, 2022 at 6:44 AM Masahiko Sawada 
> wrote:
> > >
> > > Hi,
> > >
> > > I realized that standby_desc_running_xacts() in standbydesc.c doesn't
> > > describe subtransaction XIDs. I've attached the patch to improve the
> > > description. Here is an example by pg_wlaldump:
> > >
> > > * HEAD
> > > rmgr: Standby len (rec/tot): 58/58, tx:  0, lsn:
> > > 0/01D0C608, prev 0/01D0C5D8, desc: RUNNING_XACTS nextXid 1050
> > > latestCompletedXid 1047 oldestRunningXid 1048; 1 xacts: 1048
> > >
> > > * w/ patch
> > > rmgr: Standby len (rec/tot): 58/58, tx:  0, lsn:
> > > 0/01D0C608, prev 0/01D0C5D8, desc: RUNNING_XACTS nextXid 1050
> > > latestCompletedXid 1047 oldestRunningXid 1048; 1 xacts: 1048; 1
> > > subxacts: 1049
> > >
> >
> > I think this is a good addition to debugging info. +1
> >
> > If we are going to add 64 subxid numbers then it would help if we
> > could be more verbose and print "subxid overflowed" instead of "subxid
> > ovf".
>
> Yeah, it looks better so agreed. I've attached an updated patch.
>
> Regards,
>
> --
> Masahiko Sawada
> EDB:  https://www.enterprisedb.com/
>

Re: [Patch] ALTER SYSTEM READ ONLY

2022-07-27 Thread Amul Sul

Hi,

On Thu, Jul 28, 2022 at 4:05 AM Jacob Champion  wrote:
>
> On Fri, Apr 8, 2022 at 7:27 AM Amul Sul  wrote:
> > Attached is rebase version for the latest maste head(#891624f0ec).
>
> Hi Amul,
>
> I'm going through past CF triage emails today; I noticed that this
> patch dropped out of the commitfest when you withdrew it in January,
> but it hasn't been added back with the most recent patchset you
> posted. Was that intended, or did you want to re-register it for
> review?
>

Yes, there is a plan to re-register it again but not anytime soon,
once we start to rework the design.

Regards,
Amul

Re: Official Windows Installer and Documentation

2022-07-27 Thread Tom Lane

I wrote:
> If EDB isn't adequately filling in the documentation for the behavior
> of their packaging, that's on them.

Having now looked more closely at the pg_upgrade documentation,
I don't think this is exactly EDB's fault; it's text that should
never have been there to begin with.  ISTM we need to simply rip out
lines 431..448 of pgupgrade.sgml, that is all the Windows-specific
text starting with

 For Windows users, you must be logged into an administrative account, and

That has got nothing to recommend it: we do not generally provide
platform-specific details in these man pages, and to the extent it
provides details, those details are likely to be wrong.  We need
look no further than the references to "9.6" to establish that.
Yeah, it says "e.g.", but novices will probably fail to understand
which parts of the example are suitable to copy verbatim and which
aren't.  Meanwhile non-novices don't need the example to begin with.
On top of which, the whole para has been inserted into
non-platform-specific text, seemingly with the aid of a dartboard,
because it doesn't particularly connect to what's before or after it.

regards, tom lane

Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

2022-07-27 Thread Amit Kapila

On Thu, Jul 28, 2022 at 7:18 AM Masahiko Sawada  wrote:
>
> On Wed, Jul 27, 2022 at 8:33 PM Amit Kapila  wrote:
> >
>
> >  I have changed accordingly in the attached
> > and apart from that slightly modified the comments and commit message.
> > Do let me know what you think of the attached?
>
> It would be better to remember the initial running xacts after
> SnapBuildRestore() returns true? Because otherwise, we could end up
> allocating InitialRunningXacts multiple times while leaking the old
> ones if there are no serialized snapshots that we are interested in.
>

Right, this makes sense. But note that you can no longer have a check
(builder->state == SNAPBUILD_START) which I believe is not required.
We need to do this after restore, in whichever state snapshot was as
any state other than SNAPBUILD_CONSISTENT can have commits without all
their changes.

Accordingly, I think the comment: "Remember the transactions and
subtransactions that were running when xl_running_xacts record that we
decoded first was written." needs to be slightly modified to something
like: "Remember the transactions and subtransactions that were running
when xl_running_xacts record that we decoded was written.". Change
this if it is used at any other place in the patch.

> ---
> +   if (builder->state == SNAPBUILD_START)
> +   {
> +   int nxacts =
> running->subxcnt + running->xcnt;
> +   Sizesz = sizeof(TransactionId) * nxacts;
> +
> +   NInitialRunningXacts = nxacts;
> +   InitialRunningXacts =
> MemoryContextAlloc(builder->context, sz);
> +   memcpy(InitialRunningXacts, running->xids, sz);
> +   qsort(InitialRunningXacts, nxacts,
> sizeof(TransactionId), xidComparator);
> +   }
>
> We should allocate the memory for InitialRunningXacts only when
> (running->subxcnt + running->xcnt) > 0.
>

There is no harm in doing that but ideally, that case would have been
covered by an earlier check "if (running->oldestRunningXid ==
running->nextXid)" which suggests "No transactions were running, so we
can jump to consistent."

Kindly make the required changes and submit the back branch patches again.

-- 
With Regards,
Amit Kapila.

Re: Skip partition tuple routing with constant partition key

2022-07-27 Thread David Rowley

On Thu, 28 Jul 2022 at 00:50, Amit Langote  wrote:
> So, in a way the caching scheme works for
> LIST partitioning only if the same value appears consecutively in the
> input set, whereas it does not for *a set of* values belonging to the
> same partition appearing consecutively.  Maybe that's a reasonable
> restriction for now.

I'm not really seeing another cheap enough way of doing that. Any LIST
partition could allow any number of values. We've only space to record
1 of those values by way of recording which element in the
PartitionBound that it was located.

> if (part_index < 0)
> -   part_index = boundinfo->default_index;
> +   {
> +   /*
> +* Since we don't do caching for the default partition or failed
> +* lookups, we'll just wipe the cache fields back to their initial
> +* values.  The count becomes 0 rather than 1 as 1 means it's the
> +* first time we've found a partition we're recording for the cache.
> +*/
> +   partdesc->last_found_datum_index = -1;
> +   partdesc->last_found_part_index = -1;
> +   partdesc->last_found_count = 0;
> +
> +   return boundinfo->default_index;
> +   }
>
> I wonder why not to leave the cache untouched in this case?  It's
> possible that erratic rows only rarely occur in the input sets.

I looked into that and I ended up just removing the code to reset the
cache. It now works similarly to a LIST partitioned table's NULL
partition.

> Should the comment update above get_partition_for_tuple() mention
> something like the cached path is basically O(1) and the non-cache
> path O (log N) as I can see in comments in some other modules, like
> pairingheap.c?

I adjusted for the other things you mentioned but I didn't add the big
O stuff. I thought the comment was clear enough.

I'd quite like to push this patch early next week, so if anyone else
is following along that might have any objections, could they do so
before then?

David
diff --git a/src/backend/executor/execPartition.c 
b/src/backend/executor/execPartition.c
index e03ea27299..7ae7496737 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -1332,10 +1332,49 @@ FormPartitionKeyDatum(PartitionDispatch pd,
elog(ERROR, "wrong number of partition key expressions");
 }
 
+/*
+ * The number of times the same partition must be found in a row before we
+ * switch from a binary search for the given values to just checking if the
+ * values belong to the last found partition.  This must be above 0.
+ */
+#define PARTITION_CACHED_FIND_THRESHOLD16
+
 /*
  * get_partition_for_tuple
  * Finds partition of relation which accepts the partition key 
specified
- * in values and isnull
+ * in values and isnull.
+ *
+ * Calling this function can be quite expensive when LIST and RANGE
+ * partitioned tables have many partitions.  This is due to the binary search
+ * that's done to find the correct partition.  Many of the use cases for LIST
+ * and RANGE partitioned tables make it likely that the same partition is
+ * found in subsequent ExecFindPartition() calls.  This is especially true for
+ * cases such as RANGE partitioned tables on a TIMESTAMP column where the
+ * partition key is the current time.  When asked to find a partition for a
+ * RANGE or LIST partitioned table, we record the partition index and datum
+ * offset we've found for the given 'values' in the PartitionDesc (which is
+ * stored in relcache), and if we keep finding the same partition
+ * PARTITION_CACHED_FIND_THRESHOLD times in a row, then we'll enable caching
+ * logic and instead of performing a binary search to find the correct
+ * partition, we'll just double-check that 'values' still belong to the last
+ * found partition, and if so, we'll return that partition index, thus
+ * skipping the need for the binary search.  If we fail to match the last
+ * partition when double checking, then we fall back on doing a binary search.
+ * In this case, unless we find 'values' belong to the DEFAULT partition,
+ * we'll reset the number of times we've hit the same partition so that we
+ * don't attempt to use the cache again until we've found that partition at
+ * least PARTITION_CACHED_FIND_THRESHOLD times in a row.
+ *
+ * For cases where the partition changes on each lookup, the amount of
+ * additional work required just amounts to recording the last found partition
+ * and bound offset then resetting the found counter.  This is cheap and does
+ * not appear to cause any meaningful slowdowns for such cases.
+ *
+ * No caching of partitions is done when the last found partition is the
+ * DEFAULT or NULL partition.  For the case of the DEFAULT partition, there
+ * is no bound offset storing the matching datum, so we cannot confirm the
+ * indexes match.  For the NULL partition, this is just so cheap, there's no
+ * sense in caching.
  *
  * Return value is index of the partition

Re: Official Windows Installer and Documentation

2022-07-27 Thread Julien Rouhaud

On Wed, Jul 27, 2022 at 07:31:35PM -0700, David G. Johnston wrote:
>
> Ultimately we do our users the best service if when they operate an
> installation using defaults that they have documentation showing how to
> perform something like an upgrade that works with those defaults.

I don't really agree that it's best service to let users assume that they can
blindly copy/paste some commands without trying to understand them, or how to
adapt them to their specificities.  That's in my opinion particularly true on
windows, since to my knowledge most companies will have the binaries installed
in one place (C:\Program Files\PostgreSQL might be frequent), and have the data
stored in another place (D: or other).  So I don't think the default command
will actually work for any non toy installation.

> So why not assume the user is whatever the EDB installer uses
> and make that the example?

Well, IIUC that used to be the case until EDB changed its installer.  Maybe the
odds for an impacting change to happen again are low, but it's certainly not a
great idea to assume that the community will regularly check their installer
and update the doc to match what they're doing.  So yeah it may be better for
them to provide a documentation adapted to their usage.

Re: Official Windows Installer and Documentation

2022-07-27 Thread Tom Lane

"David G. Johnston"  writes:
> On Wed, Jul 27, 2022 at 6:42 PM Julien Rouhaud  wrote:
>> Note that there's no "official" Windows installer,

Yeah, that.

> Our technical definition aside, the fact is our users consider the sole EDB
> installer to be official.
> If the ultimate solution is to update:
> https://www.enterprisedb.com/downloads/postgres-postgresql-downloads
> to have its own installation and upgrade supplement to the official
> documentation then I'd be fine with that.

That's what needs to happen.  On the Linux side for example, we do
not address packaging-specific behaviors of Devrim's builds, or Debian's
builds, or Red Hat's builds --- all of which act differently, in ways
that are sadly a lot more critical to novices than seasoned users.
If EDB isn't adequately filling in the documentation for the behavior
of their packaging, that's on them.

regards, tom lane

RE: Support load balancing in libpq

2022-07-27 Thread kuroda.hay...@fujitsu.com

Dear Jelte,

> With plain Postgres this assumption is probably correct. But the main reason
> I'm interested in this patch was because I would like to be able to load
> balance across the workers in a Citus cluster. These workers are all 
> primaries.
> Similar usage would likely be possible with BDR (bidirectional replication).

I agree this feature is useful for BDR-like solutions.

> If the user takes such care when building their host list, they could simply 
> not add the load_balance_hosts=true option to the connection string.
> If you know there's only one primary in the list and you're looking for
> the primary, then there's no reason to use load_balance_hosts=true.

You meant that it was the user responsibility to set correctly, right?
It seemed reasonable because libpq was just a library for connecting to server
and should not be smart.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Re: Official Windows Installer and Documentation

2022-07-27 Thread David G. Johnston

On Wednesday, July 27, 2022, Julien Rouhaud  wrote:

> On Wed, Jul 27, 2022 at 07:02:51PM -0700, David G. Johnston wrote:
> >
> > In the end the problem is ours and cannot be simply assigned to a
> > third-party.  So let's resolve it here (on this list, whatever the
> > solution) where representatives from all parties are present.
>
> We could amend the pg_upgrade (and maybe other if needed, but I don't see
> any
> other occurences of RUNAS) documentation to be a bit more general, like the
> non-windows part of it, maybe something like
>
> For Windows users, you must be logged into an administrative account, and
> then
> start a shell as the user running the postgres service and set the proper
> path.
> Assuming a user named postgres and the binaries installed in C:\Program
> Files\PostgreSQL\14:
>
> RUNAS /USER:postgres "CMD.EXE"
> SET PATH=%PATH%;C:\Program Files\PostgreSQL\14\bin;
>
> It's ultimately up to the users to adapt the commands to match their
> environment.
>

Ultimately we do our users the best service if when they operate an
installation using defaults that they have documentation showing how to
perform something like an upgrade that works with those defaults.  I don’t
see much point making that change in isolation until it is obvious nothing
better is forthcoming. If the o/s user postgres doesn’t exist then you need
to supply -U postgres cause the install user for PostgresSQL is still
postgres.  So why not assume the user is whatever the EDB installer uses
and make that the example?  If someone has an install on Windows that uses
the postgres account adapting the command for them should be trivial and
the majority installer users get a command sequence that works.

David J.

Re: Official Windows Installer and Documentation

2022-07-27 Thread Julien Rouhaud

On Wed, Jul 27, 2022 at 07:02:51PM -0700, David G. Johnston wrote:
>
> In the end the problem is ours and cannot be simply assigned to a
> third-party.  So let's resolve it here (on this list, whatever the
> solution) where representatives from all parties are present.

We could amend the pg_upgrade (and maybe other if needed, but I don't see any
other occurences of RUNAS) documentation to be a bit more general, like the
non-windows part of it, maybe something like

For Windows users, you must be logged into an administrative account, and then
start a shell as the user running the postgres service and set the proper path.
Assuming a user named postgres and the binaries installed in C:\Program
Files\PostgreSQL\14:

RUNAS /USER:postgres "CMD.EXE"
SET PATH=%PATH%;C:\Program Files\PostgreSQL\14\bin;

It's ultimately up to the users to adapt the commands to match their
environment.

Re: fairywren hung in pg_basebackup tests

2022-07-27 Thread Thomas Munro

On Thu, Jul 28, 2022 at 12:41 PM Thomas Munro  wrote:
> I thought about putting it at the top, but don't we really only need
> to make an extra system call if MinGW's stat() told us it saw a
> directory?  And what if you asked to look through symlinks?  I thought
> about putting it near the S_ISLNK() test, which is the usual pattern,
> but then what if MinGW decides to add d_type support one day?  Those
> thoughts led to the attached formulation.  Untested.  I'll go and try
> to see if I can run this with Melih's proposed MSYS CI support...

Success.  I'll push this, and then hopefully those BF animals can be
unwedged.  (I see some unrelated warnings to look into.)

https://cirrus-ci.com/task/5253183533481984?logs=tests#L835

Re: Improve description of XLOG_RUNNING_XACTS

2022-07-27 Thread Masahiko Sawada

On Thu, Jul 21, 2022 at 10:13 PM Ashutosh Bapat
 wrote:
>
> Hi
>
> On Thu, Jul 21, 2022 at 6:44 AM Masahiko Sawada  wrote:
> >
> > Hi,
> >
> > I realized that standby_desc_running_xacts() in standbydesc.c doesn't
> > describe subtransaction XIDs. I've attached the patch to improve the
> > description. Here is an example by pg_wlaldump:
> >
> > * HEAD
> > rmgr: Standby len (rec/tot): 58/58, tx:  0, lsn:
> > 0/01D0C608, prev 0/01D0C5D8, desc: RUNNING_XACTS nextXid 1050
> > latestCompletedXid 1047 oldestRunningXid 1048; 1 xacts: 1048
> >
> > * w/ patch
> > rmgr: Standby len (rec/tot): 58/58, tx:  0, lsn:
> > 0/01D0C608, prev 0/01D0C5D8, desc: RUNNING_XACTS nextXid 1050
> > latestCompletedXid 1047 oldestRunningXid 1048; 1 xacts: 1048; 1
> > subxacts: 1049
> >
>
> I think this is a good addition to debugging info. +1
>
> If we are going to add 64 subxid numbers then it would help if we
> could be more verbose and print "subxid overflowed" instead of "subxid
> ovf".

Yeah, it looks better so agreed. I've attached an updated patch.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/


v2-0001-Improve-description-of-XLOG_RUNNING_XACTS.patch
Description: Binary data

RE: [Commitfest 2022-07] Patch Triage: Waiting on Author

2022-07-27 Thread houzj.f...@fujitsu.com

On Wednesday, July 27, 2022 3:27 AM Jacob Champion  
wrote:
> 
> - Parallel INSERT SELECT take 2
>   https://commitfest.postgresql.org/38/3143/
> 
>   There was a lot of discussion early on in this patchset's life, and
>   then it got caught in a rebase loop without review in August 2021. The
>   thread has mostly gone dark since then and the patch does not apply.

Sorry, I think we don't enough time to work on this recently. So please mark it 
as RWF and
we will get back to this in the future.

Best regards,
Hou zj

Re: Documentation about PL transforms

2022-07-27 Thread chap


On 2022-07-13 00:26, Pavel Stehule wrote:

1. I am not sure if get_call_trftypes is a good name - the prefix 
get_call

is used when some runtime data is processed.


I guess I hadn't caught on that the prefix carried that meaning.
To me, it appeared to be a prefix used to avoid being specific to
'function' or 'procedure'.


This function just returns
reformatted data from the system catalogue. Maybe 
get_func_trftypes_list,
or just replace function get_func_trftypes (now, the list is an array, 
so
there should not be performance issues). For consistency, the new 
function
should be used in plperl and plpython too. Probably this function is 
not


If it is acceptable to replace get_func_trftypes like that, I can 
produce

such a patch.


2.

+It also contains the OID of the intended procedural language and
whether
+that procedural language is declared as 
TRUSTED,

useful
+if a single inline handler is supporting more than one procedural
language.

I am not sure if this sentence is in the correct place. Maybe can be
mentioned separately,
so generally handlers can be used by more than one procedural language. 
But

maybe
I don't understand this sentence.


My point was that, if the structure did /not/ include the OID of
the PL and its TRUSTED property, then it would not be possible
for a single inline handler to support more than one PL. So that
is why it is a good thing that those are included in the structure,
and why it would be a bad thing if they were not.

Would it be clearer to say:

It also contains the OID of the intended procedural language and whether
that procedural language is declared as TRUSTED.
While these values are redundant if the inline handler is serving only
a single procedural language, they are necessary to allow one inline
handler to serve more than one PL.

Regards,
-Chap

Re: Official Windows Installer and Documentation

2022-07-27 Thread David G. Johnston

On Wed, Jul 27, 2022 at 6:42 PM Julien Rouhaud  wrote:

> Hi,
>
> On Wed, Jul 27, 2022 at 11:36:11PM +0200, Thomas Kellerer wrote:
> > David G. Johnston schrieb am 27.07.2022 um 21:21:
> > > And then there is the issue of file ownership.
> > >
> > > Assuming we want better documentation for this specific issue for
> > > back-patching what would that look like?
> > >
> > > Going forward should our installer be creating the postgres user for
> > > consistency with other platforms or not?
> >
> > Didn't the installer used to do that in earlier releases and that
> > was removed when Postgres was able to "drop privileges" when the
> > service is started?
> >
> > I remember a lot of problems around the specific Postgres service
> > account when that still was the case.
>
> Note that there's no "official" Windows installer, and companies providing
> one
> are free to implement it the way they want, which can contradict the
> official
> documentation.  The download section of the website clearly says that this
> is a
> third-party installer.
>
> For now there's only the EDB installer that remains, but I think that some
> time
> ago there was 2 or 3 different providers.
>
> For the EDB installer, I'm not sure why or when it was changed, but it
> indeed
> used to have a dedicated local account and now relies on "Local System
> Account"
> or something like that.  But IIRC, when it used to create a local account
> the
> name could be configured, so there was no guarantee of a local "postgres"
> account by then either.
>
>
Our technical definition aside, the fact is our users consider the sole EDB
installer to be official.

If the ultimate solution is to update:

https://www.enterprisedb.com/downloads/postgres-postgresql-downloads

to have its own installation and upgrade supplement to the official
documentation then I'd be fine with that.  But as of now the "Installation
Guide" points back to the official documentation, which has no actual
distribution specific information while simultaneously reinforcing the fact
that it is an official installer.

I get sending people to the EDB web services team for download issues since
we don't host the binaries.  That aspect of them being third-party doesn't
seem to be an issue.

But for documentation, given the current state of things, whether we amend
our docs or highly encourage the people who are benefiting financially from
being our de facto official Windows installer provider to provide separate
documentation to address this apparent short-coming that is harming our
image in the Windows community, I don't really care, as long as something
changes.

In the end the problem is ours and cannot be simply assigned to a
third-party.  So let's resolve it here (on this list, whatever the
solution) where representatives from all parties are present.

David J.

Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

2022-07-27 Thread Masahiko Sawada

On Wed, Jul 27, 2022 at 8:33 PM Amit Kapila  wrote:
>
> On Mon, Jul 25, 2022 at 11:26 AM Masahiko Sawada  
> wrote:
> >
> > On Mon, Jul 25, 2022 at 10:45 AM Masahiko Sawada  
> > wrote:
> >
> > I've attached the patch for REl15 that I forgot.
> >
>
> I feel the place to remember running xacts information in
> SnapBuildProcessRunningXacts is not appropriate. Because in cases
> where there are no running xacts or when xl_running_xact is old enough
> that we can't use it, we don't need that information. I feel we need
> it only when we have to reuse the already serialized snapshot, so,
> won't it be better to initialize at that place in
> SnapBuildFindSnapshot()?

Good point, agreed.

>  I have changed accordingly in the attached
> and apart from that slightly modified the comments and commit message.
> Do let me know what you think of the attached?

It would be better to remember the initial running xacts after
SnapBuildRestore() returns true? Because otherwise, we could end up
allocating InitialRunningXacts multiple times while leaking the old
ones if there are no serialized snapshots that we are interested in.

---
+   if (builder->state == SNAPBUILD_START)
+   {
+   int nxacts =
running->subxcnt + running->xcnt;
+   Sizesz = sizeof(TransactionId) * nxacts;
+
+   NInitialRunningXacts = nxacts;
+   InitialRunningXacts =
MemoryContextAlloc(builder->context, sz);
+   memcpy(InitialRunningXacts, running->xids, sz);
+   qsort(InitialRunningXacts, nxacts,
sizeof(TransactionId), xidComparator);
+   }

We should allocate the memory for InitialRunningXacts only when
(running->subxcnt + running->xcnt) > 0.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: collect_corrupt_items_vacuum.patch

2022-07-27 Thread Robert Haas

On Wed, Jul 27, 2022 at 5:56 PM Tom Lane  wrote:
> Maybe we need a different function for pg_visibility to call?
> If we want ComputeXidHorizons to serve both these purposes, then it
> has to always deliver exactly the right answer, which seems like
> a definition that will be hard and expensive to achieve.

Yeah, I was thinking along similar lines.

I'm also kind of wondering why these calculations use
latestCompletedXid. Is that something we do solely to reduce locking?
The XIDs of running transactions matter, and their snapshots matter,
and the XIDs that could start running in the future matter, but I
don't know why it matters what the latest completed XID is.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Official Windows Installer and Documentation

2022-07-27 Thread Julien Rouhaud

Hi,

On Wed, Jul 27, 2022 at 11:36:11PM +0200, Thomas Kellerer wrote:
> David G. Johnston schrieb am 27.07.2022 um 21:21:
> > And then there is the issue of file ownership.
> > 
> > Assuming we want better documentation for this specific issue for
> > back-patching what would that look like?
> > 
> > Going forward should our installer be creating the postgres user for
> > consistency with other platforms or not?
> 
> Didn't the installer used to do that in earlier releases and that
> was removed when Postgres was able to "drop privileges" when the
> service is started?
> 
> I remember a lot of problems around the specific Postgres service
> account when that still was the case.

Note that there's no "official" Windows installer, and companies providing one
are free to implement it the way they want, which can contradict the official
documentation.  The download section of the website clearly says that this is a
third-party installer.

For now there's only the EDB installer that remains, but I think that some time
ago there was 2 or 3 different providers.

For the EDB installer, I'm not sure why or when it was changed, but it indeed
used to have a dedicated local account and now relies on "Local System Account"
or something like that.  But IIRC, when it used to create a local account the
name could be configured, so there was no guarantee of a local "postgres"
account by then either.

Re: Introduce wait_for_subscription_sync for TAP tests

2022-07-27 Thread Masahiko Sawada

On Wed, Jul 27, 2022 at 8:54 PM Amit Kapila  wrote:
>
> On Tue, Jul 26, 2022 at 1:12 PM Masahiko Sawada  wrote:
> >
> > On Tue, Jul 26, 2022 at 2:01 PM Amit Kapila  wrote:
> > >
> > > 2.
> > > +# wait for the replication to catchup if required.
> > > +if (defined($publisher))
> > > +{
> > > + croak 'subscription name must be specified' unless defined($subname);
> > > + $publisher->wait_for_catchup($subname, 'replay');
> > > +}
> > > +
> > > +# then, wait for all table states to be ready.
> > > +print "Waiting for all subscriptions in \"$name\" to synchronize 
> > > data\n";
> > > +my $query = qq[SELECT count(1) = 0
> > > + FROM pg_subscription_rel
> > > + WHERE srsubstate NOT IN ('r', 's');];
> > > +$self->poll_query_until($dbname, $query)
> > > + or croak "timed out waiting for subscriber to synchronize data";
> > >
> > > In the tests, I noticed that a few places did wait_for_catchup after
> > > the subscription check, and at other places, we did that check before
> > > as you have it here. Ideally, I think wait_for_catchup should be after
> > > confirming the initial sync is over as without initial sync, the
> > > publisher node won't be completely in sync with the subscriber.
> >
> > What do you mean by the last sentence? I thought the order doesn't
> > matter here. Even if we do wait_for_catchup first then the
> > subscription check, we can make sure that the apply worker caught up
> > and table synchronization has been done, no?
> >
>
> That's right. I thought we should first ensure the subscriber has
> finished operations if possible, like in this case, it can ensure
> table sync has finished and then we can ensure whether publisher and
> subscriber are in sync. That sounds more logical to me.

Make sense. I've incorporated it in the v3 patch.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Introduce wait_for_subscription_sync for TAP tests

2022-07-27 Thread Masahiko Sawada

On Wed, Jul 27, 2022 at 7:08 PM shiy.f...@fujitsu.com
 wrote:
>
> On Tue, Jul 26, 2022 3:42 PM Masahiko Sawada  wrote:
> >
> > I've attached an updated patch as well as a patch to remove duplicated
> > waits in 007_ddl.pl.
> >
>
> Thanks for your patch. Here are some comments.

Thank you for the comments!

>
> 1.
> I think some comments need to be changed in the patch.
> For example:
> # Also wait for initial table sync to finish
> # Wait for initial sync to finish as well
>
> Words like "Also" and "as well" can be removed now, we originally used them
> because we wait for catchup and "also" wait for initial sync.

Agreed.

>
> 2.
> In the following places, we can remove wait_for_catchup() and then call it in
> wait_for_subscription_sync().
>
> 2.1.
> 030_origin.pl:
> @@ -128,8 +120,7 @@ $node_B->safe_psql(
>
>  $node_C->wait_for_catchup($appname_B2);
>
> -$node_B->poll_query_until('postgres', $synced_query)
> -  or die "Timed out while waiting for subscriber to synchronize data";
> +$node_B->wait_for_subscription_sync;
>
> 2.2.
> 031_column_list.pl:
> @@ -385,7 +373,7 @@ $node_subscriber->safe_psql(
> ALTER SUBSCRIPTION sub1 SET PUBLICATION pub2, pub3
>  ));
>
> -wait_for_subscription_sync($node_subscriber);
> +$node_subscriber->wait_for_subscription_sync;
>
>  $node_publisher->wait_for_catchup('sub1');
>
> 2.3.
> 100_bugs.pl:
> @@ -281,8 +276,7 @@ $node_subscriber->safe_psql('postgres',
>  $node_publisher->wait_for_catchup('tap_sub');
>
>  # Also wait for initial table sync to finish
> -$node_subscriber->poll_query_until('postgres', $synced_query)
> -  or die "Timed out while waiting for subscriber to synchronize data";
> +$node_subscriber->wait_for_subscription_sync;
>
>  is( $node_subscriber->safe_psql(
> 'postgres', "SELECT * FROM tab_replidentity_index"),

Agreed.

I've attached updated patches that incorporated the above comments as
well as the comment from Amit.

BTW regarding 0001 patch to remove the duplicated wait, should we
backpatch to v15? I think we can do that as it's an obvious fix and it
seems to be an oversight in 8f2e2bbf145.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/


v3-0001-Remove-duplicated-wait-for-subscription-synchroni.patch
Description: Binary data


v3-0002-Introduce-wait_for_subscription_sync-for-TAP-test.patch
Description: Binary data

Re: fairywren hung in pg_basebackup tests

2022-07-27 Thread Thomas Munro

On Thu, Jul 28, 2022 at 3:21 AM Andrew Dunstan  wrote:
> On 2022-07-27 We 10:58, Tom Lane wrote:
> > Andrew Dunstan  writes:
> >> The alternative I thought of would be to switch msys to using our
> >> dirent.c. Probably not too hard, but certainly more work than reverting.

Thanks for figuring this out Andrew.  Previously I thought of MSYS as
a way to use configure+make+gcc/clang but pure Windows C APIs (using
MinGW's replacement Windows headers), but today I learned that
MSYS/MinGW also supplies a small amount of POSIX stuff, including
readdir() etc, so we don't use our own emulation in that case.

I suppose we could consider using own dirent.h/c with MinGW (and
seeing if there are other similar hazards), to reduce the number of
Windows/POSIX API combinations we have to fret about, but not today.

Another thought for the future is that lstat() + S_ISLNK() could
probably be made to fire for junction points on Windows (all build
variants), and then get_dirent_type()'s fallback code for DT_UNKNOWN
would have Just Worked (no extra system call required), and we could
also probably remove calls to pgwin32_is_junction() everywhere.

> > If you ask me, the shortest-path general-purpose fix is to insert
> >
> > #if MSYS
> >   if (pgwin32_is_junction(path))
> >   return PGFILETYPE_DIR;
> > #endif
> >
> > at the start of get_dirent_type.  (I'm not sure how to spell the
> > #if test.)  We could look at using dirent.c later, but I think
> > right now it's important to un-break the buildfarm ASAP.
>
> +1. I think you spell it:
>
> #if defined(WIN32) && !defined(_MSC_VER)

I thought about putting it at the top, but don't we really only need
to make an extra system call if MinGW's stat() told us it saw a
directory?  And what if you asked to look through symlinks?  I thought
about putting it near the S_ISLNK() test, which is the usual pattern,
but then what if MinGW decides to add d_type support one day?  Those
thoughts led to the attached formulation.  Untested.  I'll go and try
to see if I can run this with Melih's proposed MSYS CI support...
From 33005aaad52a550ba028277dfb634d3d6fbfdd7f Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Thu, 28 Jul 2022 11:45:31 +1200
Subject: [PATCH] Fix get_dirent_type() for simlinks on MinGW/MSYS.

On Windows with MSVC, get_dirent_type() was recently made to return
DT_LNK for junction points by commit 9d3444dc, which fixed some
defective dirent.c code.

On Windows with Cygwin, get_dirent_type() already worked for symlinks,
as it does on POSIX systems, because Cygwin has its own fake symlinks
that behave like POSIX (on closer inspection, Cygwin's dirent has the
BSD d_type extension but it's probably always DT_UNKNOWN, so we fall
back to lstat(), which understands Cygwin symlinks with S_ISLNK()).

On Windows with MinGW/MSYS, we need extra code, because the MinGW
runtime has its own readdir() without d_type, and the lstat()-based
fallback has no knowledge of our convention for treating junctions as
symlinks.

Reported-by: Andrew Dunstan 
Discussion: https://postgr.es/m/b9ddf605-6b36-f90d-7c30-7b3e95c46276%40dunslane.net
---
 src/common/file_utils.c | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/src/common/file_utils.c b/src/common/file_utils.c
index 19d308ad1f..210b2c4dfd 100644
--- a/src/common/file_utils.c
+++ b/src/common/file_utils.c
@@ -465,5 +465,19 @@ get_dirent_type(const char *path,
 #endif
 	}

+#if defined(WIN32) && !defined(_MSC_VER)
+	/*
+	 * If we're on native Windows (not Cygwin, which has its own POSIX
+	 * symlinks), but not using the MSVC compiler, then we're using a readdir()
+	 * emulation provided by the compiler's runtime that has no concept of
+	 * symlinks.  It sees junction points as directories, so we need an extra
+	 * system call to recognize them as symlinks, following our convention.
+	 */
+	if (result == PGFILETYPE_DIR &&
+		!look_through_symlinks &&
+		pgwin32_is_junction(path))
+		result = PGFILETYPE_LNK;
+#endif
+
 	return result;
 }
-- 
2.36.1

Re: Proposal: add a debug message about using geqo

2022-07-27 Thread David G. Johnston

On Fri, Jul 22, 2022 at 1:20 PM Jacob Champion 
wrote:

> On Wed, Jun 1, 2022 at 11:09 PM KAWAMOTO Masaya 
> wrote:
> > That sounds a nice idea. But I don't think that postgres shows in the
> > EXPLAIN output why the plan is selected. Would it be appropriate to
> > show that GEQO is used in EXPLAIN output?
>
> I'm reminded of Greenplum's "Optimizer" line in its EXPLAIN output
> [1], so from that perspective I think it's intuitive.
>
> > As a test, I created a patch that add information about GEQO to
> > EXPLAIN output by the GEQO option. The output example is as follows.
> > What do you think about the location and content of information about
> GEQO?


> I am a little surprised to see GeqoDetails being printed for a plan
> that didn't use GEQO, but again that's probably because I'm used to
> GPDB's Optimizer output. And I don't have a lot of personal experience
> using alternative optimizers.
>

I agree this should be part of explain output.

I would not print the current value of geqo_threshold and leave setting
display the exclusive purview of the settings option.

The presentation of only a single geqo result seems incorrect given that
multiple trees can exist.  In the first example below the full outer join
causes 3 relations to be seen as a single relation at the top level (hence
max join nodes = 4) while in the inner join case we see all 6 join nodes.
There should be two outputs of GEQO in the first explain, one with join
nodes of 3 and the existing one with 4.

I also don't see the point of labelling them "max"; "join nodes" seems
sufficient.

While it can probably be figured out from the rest of the plan, listing the
names of the join nodes may be useful (and give join nodes some company).

David J.

postgres=# explain (verbose, geqo) with gs2 (v2) as materialized ( select *
from generate_series(1,1) ) select * from gs2 as gs4 full outer join
(select gs2a.v2 from gs2 as gs2a, gs2 as gs2b) as gs5 using (v2),
generate_series(1, 1) as gs (v1) cross join gs2 as gs3 where v1 IN (select
v2 from gs2);
 QUERY PLAN


 Nested Loop  (cost=0.07..0.21 rows=1 width=12)
   Output: COALESCE(gs4.v2, gs2a.v2), gs.v1, gs3.v2
   CTE gs2
 ->  Function Scan on pg_catalog.generate_series  (cost=0.00..0.01
rows=1 width=4)
   Output: generate_series.generate_series
   Function Call: generate_series(1, 1)
   ->  Nested Loop  (cost=0.06..0.16 rows=1 width=12)
 Output: gs.v1, gs4.v2, gs2a.v2
 ->  Nested Loop  (cost=0.02..0.06 rows=1 width=4)
   Output: gs.v1
   Join Filter: (gs.v1 = gs2.v2)
   ->  Function Scan on pg_catalog.generate_series gs
 (cost=0.00..0.01 rows=1 width=4)
 Output: gs.v1
 Function Call: generate_series(1, 1)
   ->  HashAggregate  (cost=0.02..0.03 rows=1 width=4)
 Output: gs2.v2
 Group Key: gs2.v2
 ->  CTE Scan on gs2  (cost=0.00..0.02 rows=1 width=4)
   Output: gs2.v2
 ->  Hash Full Join  (cost=0.03..0.10 rows=1 width=8)
   Output: gs4.v2, gs2a.v2
   Hash Cond: (gs2a.v2 = gs4.v2)
   ->  Nested Loop  (cost=0.00..0.05 rows=1 width=4)
 Output: gs2a.v2
 ->  CTE Scan on gs2 gs2b  (cost=0.00..0.02 rows=1
width=0)
   Output: gs2b.v2
 ->  CTE Scan on gs2 gs2a  (cost=0.00..0.02 rows=1
width=4)
   Output: gs2a.v2
   ->  Hash  (cost=0.02..0.02 rows=1 width=4)
 Output: gs4.v2
 ->  CTE Scan on gs2 gs4  (cost=0.00..0.02 rows=1
width=4)
   Output: gs4.v2
   ->  CTE Scan on gs2 gs3  (cost=0.00..0.02 rows=1 width=4)
 Output: gs3.v2
 GeqoDetails: GEQO: used, geqo_threshold: 2, Max join nodes: 4
(35 rows)

postgres=# explain (verbose, geqo) with gs2 (v2) as materialized ( select *
from generate_series(1,1) ) select * from gs2 as gs4 join (select gs2a.v2
from gs2 as gs2a, gs2 as gs2b) as gs5 using (v2), generate_series(1, 1) as
gs (v1) cross join gs2 as gs3 where v1 IN (select v2 from gs2);
QUERY PLAN

--
 Nested Loop  (cost=0.02..0.18 rows=1 width=12)
   Output: gs4.v2, gs.v1, gs3.v2
   CTE gs2
 ->  Function Scan on pg_catalog.generate_series  (cost=0.00..0.01
rows=1 width=4)
   Output: generate_series.generate_series
   Function Call: generate_series(1, 1)
   ->  Nested Loop  (cost=0.00..0.14 rows=1 width=12)
 Output: gs.v1, gs4.v2, gs3.v2
 ->  Nested Loop  (cost=0.00..0.11 rows=1 width=8)
   Output: gs.v1, gs4.v2
   ->  Nested Loop Semi

Re: Unprivileged user can induce crash by using an SUSET param in PGOPTIONS

2022-07-27 Thread Tom Lane

I wrote:
> =?utf-8?Q?Dagfinn_Ilmari_Manns=C3=A5ker?=  writes:
>> Thanks!  Just one minor nitpick: setting an %ENV entry to `undef`
>> doesn't unset the environment variable, it sets it to the empty string.
>> To unset a variable it needs to be deleted from %ENV, i.e. `delete
>> $ENV{PGUSER};`.

> Ah.  Still, libpq doesn't distinguish, so the test works anyway.
> Not sure if it's worth changing.

Meh ... I had to un-break the test for Windows, so did this while
at it, using the local-in-block method.  Thanks for the suggestion.

regards, tom lane

Re: [Patch] ALTER SYSTEM READ ONLY

2022-07-27 Thread Jacob Champion

On Fri, Apr 8, 2022 at 7:27 AM Amul Sul  wrote:
> Attached is rebase version for the latest maste head(#891624f0ec).

Hi Amul,

I'm going through past CF triage emails today; I noticed that this
patch dropped out of the commitfest when you withdrew it in January,
but it hasn't been added back with the most recent patchset you
posted. Was that intended, or did you want to re-register it for
review?

--Jacob

Re: collect_corrupt_items_vacuum.patch

2022-07-27 Thread Tom Lane

Robert Haas  writes:
> In reality, the oldest all-visible XID cannot move backward, but
> ComputeXidHorizons() lets it move backward, because it's intended for
> use by a caller who wants to mark pages all-visible, and it's only
> concerned with making sure that the value is old enough to be safe.

Right.

> And that's a problem for the way that pg_visibility is (mis-)using it.

> To say that another way, ComputeXidHorizons() is perfectly fine with
> returning a value that is older than the true answer, as long as it
> never returns a value that is newer than the new answer. pg_visibility
> wants the opposite. Here, a value that is newer than the true value
> can't do worse than hide corruption, which is sort of OK, but a value
> that's older than the true value can report corruption where none
> exists, which is very bad.

Maybe we need a different function for pg_visibility to call?
If we want ComputeXidHorizons to serve both these purposes, then it
has to always deliver exactly the right answer, which seems like
a definition that will be hard and expensive to achieve.

regards, tom lane

Re: collect_corrupt_items_vacuum.patch

2022-07-27 Thread Robert Haas

On Mon, Apr 4, 2022 at 4:51 AM Daniel Shelepanov  wrote:
> I’ve been working on this 
> [https://www.postgresql.org/message-id/flat/cfcca574-6967-c5ab-7dc3-2c82b6723b99%40mail.ru]
>  bug. Finally, I’ve come up with the patch you can find attached. Basically 
> what is does is rises a PROC_IN_VACUUM flag and resets it afterwards. I know 
> this seems kinda crunchy and I hope you guys will give me some hints on where 
> to continue. This 
> [https://www.postgresql.org/message-id/20220218175119.7hwv7ksamfjwijbx%40alap3.anarazel.de]
>  message contains reproduction script. Thank you very much in advance.

I noticed the CommitFest entry for this thread today and decided to
take a look. I think the general issue here can be stated in this way:
suppose a VACUUM computes an all-visible cutoff X, i.e. it thinks all
committed XIDs < X are all-visible. Then, at a later time, pg_visible
computes an all-visible cutoff Y, i.e. it thinks all committed XIDs <
Y are all-visible. If Y < X, pg_check_visible() might falsely report
corruption, because VACUUM might have marked as all-visible some page
containing tuples which pg_check_visibile() thinks aren't really
all-visible.

In reality, the oldest all-visible XID cannot move backward, but
ComputeXidHorizons() lets it move backward, because it's intended for
use by a caller who wants to mark pages all-visible, and it's only
concerned with making sure that the value is old enough to be safe.
And that's a problem for the way that pg_visibility is (mis-)using it.

To say that another way, ComputeXidHorizons() is perfectly fine with
returning a value that is older than the true answer, as long as it
never returns a value that is newer than the new answer. pg_visibility
wants the opposite. Here, a value that is newer than the true value
can't do worse than hide corruption, which is sort of OK, but a value
that's older than the true value can report corruption where none
exists, which is very bad.

I have a feeling, therefore, that this isn't really a complete fix. I
think it might address one way for the horizon reported by
ComputeXidHorizons() to move backward, but not all the ways.

Unfortunately, I am out of time for today to study this... but will
try to find more time on another day.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Official Windows Installer and Documentation

2022-07-27 Thread Thomas Kellerer


David G. Johnston schrieb am 27.07.2022 um 21:21:

And then there is the issue of file ownership.

Assuming we want better documentation for this specific issue for
back-patching what would that look like?

Going forward should our installer be creating the postgres user for
consistency with other platforms or not?


Didn't the installer used to do that in earlier releases and that
was removed when Postgres was able to "drop privileges" when the
service is started?

I remember a lot of problems around the specific Postgres service
account when that still was the case.

As far as I can tell, most of the problems of the Windows installer
stem from the fact that it tries to use icacls to set privileges
on the data directory. This seems to fail quite frequently,
causing the infamous "Problem running post-install step" error.

The fact that the installer still defaults to using "c:\Program Files"
for the location of the data directoy might be related to that.
(but then I don't know enough of the internals of the installer
and Windows)

Just my 0.02€

Thomas

Re: [PoC] Reducing planning time when tables have many partitions

2022-07-27 Thread David Rowley

On Wed, 27 Jul 2022 at 18:07, Yuya Watari  wrote:
> I agree that introducing a Bitmapset field may solve this problem. I
> will try this approach in addition to previous ones.

I've attached a very half-done patch that might help you get started
on this. There are still 2 failing regression tests which seem to be
due to plan changes. I didn't expend any effort looking into why these
plans changed.

The attached does not contain any actual optimizations to find the
minimal set of EMs to loop through by masking the Bitmapsets that I
mentioned in my post last night.  I just quickly put it together to
see if there's some hole in the idea. I don't think there is.

I've not really considered all of the places that we'll want to do the
bit twiddling to get the minimal set of EquivalenceMember. I did see
there's a couple more functions in postgres_fdw.c that could be
optimized.

One thing I've only partially thought about is what if you want to
also find EquivalenceMembers with a constant value. If there's a
Const, then you'll lose the bit for that when you mask the ec's
ec_member_indexes with the RelOptInfos.  If there are some places
where we need to keep those then I think we'll need to add another
field to EquivalenceClass to mark the index into PlannerInfo's
eq_members for the EquivalenceMember with the Const. That bit would
have to be bms_add_member()ed back into the Bitmapset of matching
EquivalenceMembers after masking out RelOptInfo's ec_member_indexes.

When adding the optimizations to find the minimal set of EM bits to
search through, you should likely add some functions similar to the
get_eclass_indexes_for_relids() and get_common_eclass_indexes()
functions to help you find the minimal set of bits.  You can also
probably get some other inspiration from [1], in general.

David

[1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=3373c715535
diff --git a/contrib/postgres_fdw/postgres_fdw.c 
b/contrib/postgres_fdw/postgres_fdw.c
index 048db542d3..0af3943e03 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -7410,11 +7410,11 @@ conversion_error_callback(void *arg)
 EquivalenceMember *
 find_em_for_rel(PlannerInfo *root, EquivalenceClass *ec, RelOptInfo *rel)
 {
-   ListCell   *lc;
+   int i = -1;

-   foreach(lc, ec->ec_members)
+   while ((i = bms_next_member(ec->ec_member_indexes, i)) >= 0)
{
-   EquivalenceMember *em = (EquivalenceMember *) lfirst(lc);
+   EquivalenceMember *em = (EquivalenceMember *) 
list_nth(root->eq_members, i);

/*
 * Note we require !bms_is_empty, else we'd accept constant
@@ -7453,7 +7453,7 @@ find_em_for_rel_target(PlannerInfo *root, 
EquivalenceClass *ec,
{
Expr   *expr = (Expr *) lfirst(lc1);
Index   sgref = get_pathtarget_sortgroupref(target, i);
-   ListCell   *lc2;
+   int i;

/* Ignore non-sort expressions */
if (sgref == 0 ||
@@ -7469,9 +7469,10 @@ find_em_for_rel_target(PlannerInfo *root, 
EquivalenceClass *ec,
expr = ((RelabelType *) expr)->arg;

/* Locate an EquivalenceClass member matching this expr, if any 
*/
-   foreach(lc2, ec->ec_members)
+   i = -1;
+   while ((i = bms_next_member(ec->ec_member_indexes, i)) >= 0)
{
-   EquivalenceMember *em = (EquivalenceMember *) 
lfirst(lc2);
+   EquivalenceMember *em = (EquivalenceMember *) 
list_nth(root->eq_members, i);
Expr   *em_expr;

/* Don't match constants */
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index a96f2ee8c6..2060b73686 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -430,7 +430,7 @@ _outEquivalenceClass(StringInfo str, const EquivalenceClass 
*node)

WRITE_NODE_FIELD(ec_opfamilies);
WRITE_OID_FIELD(ec_collation);
-   WRITE_NODE_FIELD(ec_members);
+   WRITE_BITMAPSET_FIELD(ec_member_indexes);
WRITE_NODE_FIELD(ec_sources);
WRITE_NODE_FIELD(ec_derives);
WRITE_BITMAPSET_FIELD(ec_relids);
diff --git a/src/backend/nodes/print.c b/src/backend/nodes/print.c
index a5c44adc6c..5953b79fe8 100644
--- a/src/backend/nodes/print.c
+++ b/src/backend/nodes/print.c
@@ -423,16 +423,17 @@ print_expr(const Node *expr, const List *rtable)
  *   pathkeys list of PathKeys
  */
 void
-print_pathkeys(const List *pathkeys, const List *rtable)
+print_pathkeys(const PlannerInfo *root, const List *pathkeys,
+  const List *rtable)
 {
-   const ListCell *i;
+   const ListCell *lc;

printf("(");
-   foreach(i, pathkeys)
+   foreach(lc, pathkeys)
{
-   PathKey*pathkey = (PathKey *) lfirst(i);
+

Re: ExecRTCheckPerms() and many prunable partitions

2022-07-27 Thread Tom Lane

... One more thing: maybe we should rethink where to put
extraUpdatedCols.  Between the facts that it's not used for
actual permissions checks, and that it's calculated by the
rewriter not parser, it doesn't seem like it really belongs
in RelPermissionInfo.  Should we keep it in RangeTblEntry?
Should it go somewhere else entirely?  I'm just speculating,
but now is a good time to think about it.

regards, tom lane

Re: ExecRTCheckPerms() and many prunable partitions

2022-07-27 Thread Tom Lane

Amit Langote  writes:
> [ v16 patches ]

I took a quick look at this ...

I think that the notion behind MergeRelPermissionInfos, ie that
a single RelPermissionInfo can represent *all* the checks for
a given table OID, is fundamentally wrong.  For example, when
merging a view into an outer query that references a table
also used by the view, the checkAsUser fields might be different,
and the permissions to check might be different, and the columns
those permissions need to hold for might be different.  Blindly
bms_union'ing the column sets will lead to requiring far more
permissions than the query should require.  Conversely, this
approach could lead to allowing cases we should reject, if you
happen to "merge" checkAsUser in a way that ends in checking as a
higher-privilege user than should be checked.

I'm inclined to think that you should abandon the idea of
merging RelPermissionInfos at all.  It can only buy us much
in the case of self-joins, which ought to be rare.  It'd
be better to just say "there is one RelPermissionInfo for
each RTE requiring any sort of permissions check".  Either
that or you need to complicate RelPermissionInfo a lot, but
I don't see the advantage.

It'd likely be better to rename ExecutorCheckPerms_hook,
say to ExecCheckPermissions_hook given the rename of
ExecCheckRTPerms.  As it stands, it *looks* like the API
of that hook has not changed, when it has.  Better to
break calling code visibly than to make people debug their
way to an understanding that the List contents are no longer
what they expected.  A different idea could be to pass both
the rangetable and relpermlist, again making the API break obvious
(and who's to say a hook might not still want the rangetable?)

In parsenodes.h:
+List   *relpermlist;/* list of RTEPermissionInfo nodes for
+ * the RTE_RELATION entries in rtable */

I find this comment not very future-proof, if indeed it's strictly
correct even today.  Maybe better "list of RelPermissionInfo nodes for
rangetable entries having perminfoindex > 0".  Likewise for the comment
in RangeTableEntry: there's no compelling reason to assume that all and
only RELATION RTEs will have RelPermissionInfo.  Even if that remains
true at parse time it's falsified during planning.

Also note typo in node name: that comment is the only reference to
"RTEPermissionInfo" AFAICS.  Although, given the redefinition I
suggest above, arguably "RTEPermissionInfo" is the better name?

I'm confused as to why RelPermissionInfo.inh exists.  It doesn't
seem to me that permissions checking should care about child rels.

Why did you add checkAsUser to ForeignScan (and not any other scan
plan nodes)?  At best that's pretty asymmetric, but it seems mighty
bogus: under what circumstances would an FDW need to know that but
not any of the other RelPermissionInfo fields?  This seems to
indicate that someplace we should work harder at making the
RelPermissionInfo list available to FDWs.  (CustomScan providers
might have similar issues, btw.)

I've not looked at much of the actual code, just the .h file changes.
Haven't studied 0002 either.

regards, tom lane

Re: out of date comment in commit_ts.c

2022-07-27 Thread Nathan Bossart

On Tue, Jul 26, 2022 at 10:33:43AM -0700, Nathan Bossart wrote:
> IIUC the ability for callers to request WAL record generation is no longer
> possible as of 08aa89b [0].  Should the second sentence be removed?

Here's a patch.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c
index 4dc8d402bd..9aa4675cb7 100644
--- a/src/backend/access/transam/commit_ts.c
+++ b/src/backend/access/transam/commit_ts.c
@@ -7,13 +7,10 @@
  * for each transaction.
  *
  * XLOG interactions: this module generates an XLOG record whenever a new
- * CommitTs page is initialized to zeroes.  Also, one XLOG record is
- * generated for setting of values when the caller requests it; this allows
- * us to support values coming from places other than transaction commit.
- * Other writes of CommitTS come from recording of transaction commit in
- * xact.c, which generates its own XLOG records for these events and will
- * re-perform the status update on redo; so we need make no additional XLOG
- * entry here.
+ * CommitTs page is initialized to zeroes.  Other writes of CommitTS come
+ * from recording of transaction commit in xact.c, which generates its own
+ * XLOG records for these events and will re-perform the status update on
+ * redo; so we need make no additional XLOG entry here.
  *
  * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California

Re: pgsql: Remove the restriction that the relmap must be 512 bytes.

2022-07-27 Thread Tom Lane

Robert Haas  writes:
> In short, I think Alvaro's idea is unprincipled but solves the problem
> of making it clear that a new initdb is required. Your idea is
> principled but does not solve any problem.

[ shrug... ] Fair enough.

regards, tom lane

Re: pgsql: Remove the restriction that the relmap must be 512 bytes.

2022-07-27 Thread Robert Haas

On Wed, Jul 27, 2022 at 3:17 PM Tom Lane  wrote:
> Alvaro Herrera  writes:
> >> Hmm, but that doesn't actually produce nice behavior. It just goes
> >> into an infinite loop, like this:
> >> So now I'm back to being unsure what to do here.
>
> > I vote to go for the catversion bump for now.
>
> What this is showing us is that any form of corruption in the relmapper
> file causes very unpleasant behavior.  We probably had better do something
> about that, independently of this issue.

I'm not sure how important that is, but it certainly wouldn't hurt.

> In the meantime, I still think bumping the file magic number is a better
> answer.  It won't make the behavior any worse for un-updated code than
> it is already.

But it also won't make it any better, so why even bother? The goal of
catversion bumps is to replace crashes or unpredictable misbehavior
with a nice error message that tells you exactly what the problem is.
Here we'd just be replacing an infinite series of crashes with an
infinite series of crashes with a slightly different error message.
It's probably worth comparing those error messages:

FATAL:  could not read file "global/pg_filenode.map": read 512 of 524
FATAL:  relation mapping file "global/pg_filenode.map" contains invalid data

The first message is what you get now. The second message is what you
get with the proposed change to the magic number. I would argue that
the second message is actually worse than the first one, because the
first one actually gives you some hint what the problem is, whereas
the second one really just says that an unspecified bad thing
happened.

In short, I think Alvaro's idea is unprincipled but solves the problem
of making it clear that a new initdb is required. Your idea is
principled but does not solve any problem.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Allow pageinspect's bt_page_stats function to return a set of rows instead of a single row

2022-07-27 Thread Tom Lane

Hamid Akhtar  writes:
> That attached patch is based on the master branch. It makes the following
> changes to the pageinspect contrib module:
> - Updates bt_page_stats_internal function to accept 3 arguments instead of
> 2.
> - The function now uses SRF macros to return a set rather than a single
> row. The function call now requires specifying column names.

FWIW, I think you'd be way better off changing the function name, say
to bt_multi_page_stats().  Overloading the name this way is going to
lead to great confusion, e.g. somebody who fat-fingers the number of
output arguments in a JDBC call could see confusing results due to
invoking the wrong one of the two functions.  Also, I'm not quite sure
what you mean by "The function call now requires specifying column
names", but it doesn't sound like an acceptable restriction from a
compatibility standpoint.  If a different name dodges that issue then
it's clearly a better way.

regards, tom lane

Official Windows Installer and Documentation

2022-07-27 Thread David G. Johnston

Hey,

Just interacted with a frustrated user on Slack trying to upgrade from v13
to v14 on Windows.  Our official download page for the Windows installer
claims the core documentation as its official reference - can someone
responsible for this area please suggest and test some changes to make this
reality more acceptable.

The particular point that was brought up is our documentation for
pg_upgrade says:

RUNAS /USER:postgres "CMD.EXE"
SET PATH=%PATH%;C:\Program Files\PostgreSQL\14\bin;

The problem is apparently (I haven't personally tested) our
official installer doesn't bother to create the postgres operating system
user account.

It is also unclear whether the defaults for pg_hba.conf add some kind of
bad interaction here should one fix this particular problem.

And then there is the issue of file ownership.

Assuming we want better documentation for this specific issue for
back-patching what would that look like?

Going forward should our installer be creating the postgres user for
consistency with other platforms or not?

I suggest adding relevant discussion about this particular official binary
distribution to:

https://www.postgresql.org/docs/current/install-binaries.html

David J.

Re: pgsql: Remove the restriction that the relmap must be 512 bytes.

2022-07-27 Thread Tom Lane

Alvaro Herrera  writes:
>> Hmm, but that doesn't actually produce nice behavior. It just goes
>> into an infinite loop, like this:
>> So now I'm back to being unsure what to do here.

> I vote to go for the catversion bump for now.  

What this is showing us is that any form of corruption in the relmapper
file causes very unpleasant behavior.  We probably had better do something
about that, independently of this issue.

In the meantime, I still think bumping the file magic number is a better
answer.  It won't make the behavior any worse for un-updated code than
it is already.

regards, tom lane

Re: Allow foreign keys to reference a superset of unique columns

2022-07-27 Thread Tom Lane

Kaiting Chen  writes:
> I'd like to propose a change to PostgreSQL to allow the creation of a foreign
> key constraint referencing a superset of uniquely constrained columns.

TBH, I think this is a fundamentally bad idea and should be rejected
outright.  It fuzzes the semantics of the FK relationship, and I'm
not convinced that there are legitimate use-cases.  Your example
schema could easily be dismissed as bad design that should be done
some other way.

For one example of where the semantics get fuzzy, it's not
very clear how the extra-baggage columns ought to participate in
CASCADE updates.  Currently, if we have
   CREATE TABLE foo (a integer PRIMARY KEY, b integer);
then an update that changes only foo.b doesn't need to update
referencing tables, and I think we even have optimizations that
assume that if no unique-key columns are touched then RI checks
need not be made.  But if you did
   CREATE TABLE bar (x integer, y integer,
 FOREIGN KEY (x, y) REFERENCES foo(a, b) ON UPDATE CASCADE);
then perhaps you expect bar.y to be updated ... or maybe you don't?

Another example is that I think the idea is only well-defined when
the subset column(s) are a primary key, or at least all marked NOT NULL.
Otherwise they're not as unique as you're claiming.  But then the FK
constraint really has to be dependent on a PK constraint not just an
index definition, since indexes in themselves don't enforce not-nullness.
That gets back to Peter's complaint that referring to an index isn't
good enough.

Anyway, seeing that the patch touches neither ri_triggers.c nor any
regression tests, I find it hard to believe that such semantic
questions have been thought through.

It's also unclear to me how this ought to interact with the
information_schema views concerning foreign keys.  We generally
feel that we don't want to present any non-SQL-compatible data
in information_schema, for fear that it will confuse applications
that expect to see SQL-spec behavior there.  So do we leave such
FKs out of the views altogether, or show only the columns involving
the associated unique constraint?  Neither answer seems pleasant.

FWIW, the patch is currently failing to apply per the cfbot.
I think you don't need to manually update backend/nodes/ anymore,
but the gram.y changes look to have been sideswiped by some
other recent commit.

regards, tom lane

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

2022-07-27 Thread Alvaro Herrera

Okay, I think I'm done with this.  Here's v27 for the master branch,
where I fixed some comments as well as thinkos in the test script.
The ones on older branches aren't materially different, they just have
tonnes of conflicts resolved.  I'll get this pushed tomorrow morning.

I have run it through CI and it seems ... not completely broken, at
least, but I have no working recipes for Windows on branches 14 and
older, so it doesn't really work fully.  If anybody does, please share.
You can see mine here
https://github.com/alvherre/postgres/commits/REL_11_STABLE [etc]
https://cirrus-ci.com/build/5320904228995072
https://cirrus-ci.com/github/alvherre/postgres

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Every machine is a smoke machine if you operate it wrong enough."
https://twitter.com/libseybieda/status/1541673325781196801
>From b84f66975d664c45babd878a43c67601b7f7d2b6 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Wed, 27 Jul 2022 20:22:21 +0200
Subject: [PATCH v27] Fix replay of create database records on standby
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Crash recovery on standby may encounter missing directories
when replaying database-creation WAL records.  Prior to this
patch, the standby would fail to recover in such a case;
however, the directories could be legitimately missing.
Consider the following sequence of commands:

CREATE DATABASE
DROP DATABASE
DROP TABLESPACE

If, after replaying the last WAL record and removing the
tablespace directory, the standby crashes and has to replay the
create database record again, crash recovery must be able to continue.

A fix for this problem was already attempted in 49d9cfc68bf4, but it
was reverted because of design issues.  This new version is based
on Robert Haas' proposal: any missing tablespaces are created
during recovery before reaching consistency.  Tablespaces
are created as real directories, and should be deleted
by later replay.  CheckRecoveryConsistency ensures
they have disappeared.

The problems detected by this new code are reported as PANIC,
except when allow_in_place_tablespaces is set to ON, in which
case they are WARNING.  Apart from making tests possible, this
gives users an escape hatch in case things don't go as planned.

Author: Kyotaro Horiguchi 
Author: Asim R Praveen 
Author: Paul Guo 
Reviewed-by: Anastasia Lubennikova  (older versions)
Reviewed-by: Fujii Masao  (older versions)
Reviewed-by: Michaël Paquier 
Diagnosed-by: Paul Guo 
Discussion: https://postgr.es/m/CAEET0ZGx9AvioViLf7nbR_8tH9-=27dn5xwj2p9-roh16e4...@mail.gmail.com
---
 src/backend/access/transam/xlogrecovery.c   |  50 ++
 src/backend/commands/dbcommands.c   |  77 +
 src/backend/commands/tablespace.c   |  40 ++---
 src/test/recovery/t/033_replay_tsp_drops.pl | 169 
 4 files changed, 305 insertions(+), 31 deletions(-)
 create mode 100644 src/test/recovery/t/033_replay_tsp_drops.pl

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index e383c2123a..27e02fbfcd 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -42,6 +42,7 @@
 #include "access/xlogutils.h"
 #include "catalog/pg_control.h"
 #include "commands/tablespace.h"
+#include "common/file_utils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/bgwriter.h"
@@ -2008,6 +2009,47 @@ xlogrecovery_redo(XLogReaderState *record, TimeLineID replayTLI)
 	}
 }
 
+/*
+ * Verify that, in non-test mode, ./pg_tblspc doesn't contain any real
+ * directories.
+ *
+ * Replay of database creation XLOG records for databases that were later
+ * dropped can create fake directories in pg_tblspc.  By the time consistency
+ * is reached these directories should have been removed; here we verify
+ * that this did indeed happen.  This is to be called at the point where
+ * consistent state is reached.
+ *
+ * allow_in_place_tablespaces turns the PANIC into a WARNING, which is
+ * useful for testing purposes, and also allows for an escape hatch in case
+ * things go south.
+ */
+static void
+CheckTablespaceDirectory(void)
+{
+	DIR		   *dir;
+	struct dirent *de;
+
+	dir = AllocateDir("pg_tblspc");
+	while ((de = ReadDir(dir, "pg_tblspc")) != NULL)
+	{
+		char		path[MAXPGPATH + 10];
+
+		/* Skip entries of non-oid names */
+		if (strspn(de->d_name, "0123456789") != strlen(de->d_name))
+			continue;
+
+		snprintf(path, sizeof(path), "pg_tblspc/%s", de->d_name);
+
+		if (get_dirent_type(path, de, false, ERROR) != PGFILETYPE_LNK)
+			ereport(allow_in_place_tablespaces ? WARNING : PANIC,
+	(errcode(ERRCODE_DATA_CORRUPTED),
+	 errmsg("unexpected directory entry \"%s\" found in %s",
+			de->d_name, "pg_tblspc/"),
+	 errdetail("All directory entries in pg_tblspc/ should be symbolic links."),
+	 errhint("Remove those directories, or set allow_in_place_tablespace

Re: pgsql: Remove the restriction that the relmap must be 512 bytes.

2022-07-27 Thread Alvaro Herrera

On 2022-Jul-27, Robert Haas wrote:

> Hmm, but that doesn't actually produce nice behavior. It just goes
> into an infinite loop, like this:

> 2022-07-27 14:21:12.869 EDT [32853] FATAL:  relation mapping file
> "global/pg_filenode.map" contains invalid data

This seems almost identical what happens without the version number
change, so I wouldn't call it much of an improvement.

> While I agree that changing a version identifier that is more closely
> related to what got changed is better than changing a generic one, I
> think people won't like an infinite loop that spews messages into the
> log at top speed as a way of telling them about the problem.
> 
> So now I'm back to being unsure what to do here.

I vote to go for the catversion bump for now.  

Maybe it's possible to patch the relmapper code afterwards, so that if a
version mismatch is detected the server is stopped with a reasonable
message.  An hypothetical improvement would be to keep the code to read
the old version and upgrade the file automatically, but given the number
of times that this file has changed, it's likely pointless effort.

Therefore, my proposal is to add a comment next to the struct definition
suggesting to bump catversion and call it a day.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/

small windows psqlrc re-wording

2022-07-27 Thread Robert Treat

Howdy folks,

The attached patch tweaks the wording around finding the psqlrc file
on windows, with the primary goal of removing the generally incorrect
statement that windows has no concept of a home directory.

Robert Treat
https://xzilla.net


windows-psqlrc.patch
Description: Binary data

Re: pgsql: Remove the restriction that the relmap must be 512 bytes.

2022-07-27 Thread Robert Haas

On Wed, Jul 27, 2022 at 2:13 PM Robert Haas  wrote:
> On Wed, Jul 27, 2022 at 1:42 PM Tom Lane  wrote:
> > If there's a magic number, then I'd (a) change that and (b) adjust
> > whatever comments led you to think you shouldn't.  Bumping catversion
> > is a good fallback choice when there's not any more-proximate version
> > indicator, but here there is.
>
> Maybe I just got cold feet because it doesn't ever have seem to have
> been changed before, because the definition says:
>
> #define RELMAPPER_FILEMAGIC 0x592717/* version ID value */
>
> And the fact that "version" is in there sure makes it seem like a
> version number, now that I look again.
>
> I'll add 1 to the value.

Hmm, but that doesn't actually produce nice behavior. It just goes
into an infinite loop, like this:

2022-07-27 14:21:12.826 EDT [32849] LOG:  database system was
interrupted; last known up at 2022-07-27 14:21:12 EDT
2022-07-27 14:21:12.860 EDT [32849] LOG:  database system was not
properly shut down; automatic recovery in progress
2022-07-27 14:21:12.861 EDT [32849] LOG:  invalid record length at
0/14B3BB8: wanted 24, got 0
2022-07-27 14:21:12.861 EDT [32849] LOG:  redo is not required
2022-07-27 14:21:12.864 EDT [32850] LOG:  checkpoint starting:
end-of-recovery immediate wait
2022-07-27 14:21:12.865 EDT [32850] LOG:  checkpoint complete: wrote 3
buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled;
write=0.001 s, sync=0.001 s, total=0.002 s; sync files=2,
longest=0.001 s, average=0.001 s; distance=0 kB, estimate=0 kB;
lsn=0/14B3BB8, redo lsn=0/14B3BB8
2022-07-27 14:21:12.868 EDT [31930] LOG:  database system is ready to
accept connections
2022-07-27 14:21:12.869 EDT [32853] FATAL:  relation mapping file
"global/pg_filenode.map" contains invalid data
2022-07-27 14:21:12.869 EDT [32854] FATAL:  relation mapping file
"global/pg_filenode.map" contains invalid data
2022-07-27 14:21:12.870 EDT [31930] LOG:  autovacuum launcher process
(PID 32853) exited with exit code 1
2022-07-27 14:21:12.870 EDT [31930] LOG:  terminating any other active
server processes
2022-07-27 14:21:12.870 EDT [31930] LOG:  background worker "logical
replication launcher" (PID 32854) exited with exit code 1
2022-07-27 14:21:12.871 EDT [31930] LOG:  all server processes
terminated; reinitializing

While I agree that changing a version identifier that is more closely
related to what got changed is better than changing a generic one, I
think people won't like an infinite loop that spews messages into the
log at top speed as a way of telling them about the problem.

So now I'm back to being unsure what to do here.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Hash index build performance tweak from sorting

2022-07-27 Thread Tom Lane

Simon Riggs  writes:
> [ hash_sort_by_hash.v2.patch ]

The cfbot says this no longer applies --- probably sideswiped by
Korotkov's sorting-related commits last night.

regards, tom lane

Re: pgsql: Remove the restriction that the relmap must be 512 bytes.

2022-07-27 Thread Robert Haas

On Wed, Jul 27, 2022 at 1:42 PM Tom Lane  wrote:
> If there's a magic number, then I'd (a) change that and (b) adjust
> whatever comments led you to think you shouldn't.  Bumping catversion
> is a good fallback choice when there's not any more-proximate version
> indicator, but here there is.

Maybe I just got cold feet because it doesn't ever have seem to have
been changed before, because the definition says:

#define RELMAPPER_FILEMAGIC 0x592717/* version ID value */

And the fact that "version" is in there sure makes it seem like a
version number, now that I look again.

I'll add 1 to the value.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

2022-07-27 Thread Robert Haas

On Wed, Jul 27, 2022 at 12:37 PM Dilip Kumar  wrote:
> Just realised that this should have been BufferTagsEqual instead of 
> BufferTagEqual
>
> I will modify this and send an updated patch tomorrow.

I changed it and committed.

What was formerly 0002 will need minor rebasing.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Use -fvisibility=hidden for shared libraries

2022-07-27 Thread Tom Lane

Andres Freund  writes:
> On 2022-07-18 00:05:16 -0700, Andres Freund wrote:
>> Given that that's just about all compilers we support using configure, 
>> perhaps
>> we should just move that out of the compiler specific section? Doesn't look
>> like there's much precedent for that so far...

> Here's a potential patch along those lines.

Now that the dust from the main patch is pretty well settled, +1
for trying that.

> I wonder if we also should move the -fno-strict-aliasing, -fwrapv tests
> out. But that'd be something for later.

Those seem less likely to be portable to non-gcc-alike compilers.
On the other hand, maybe it'd be interesting to just remove the
conditionality temporarily and try ALL the switches on all compilers,
just to see what we can learn from the buildfarm.

regards, tom lane

Re: pgsql: Remove the restriction that the relmap must be 512 bytes.

2022-07-27 Thread Tom Lane

Robert Haas  writes:
> On Wed, Jul 27, 2022 at 1:19 PM Alvaro Herrera  
> wrote:
>> Another thing that seems to have happened here is that catversion ought
>> to have been touched and wasn't.

> Hmm, interesting. I didn't think about bumping catversion because I
> didn't change anything in the catalogs. I did think about changing the
> magic number for the file at one point, but unlike some of our other
> constants, there's no indication that this one is intended to be used
> as a version number. But in retrospect it would have been good to
> change something somewhere. If you want me to bump catversion now, I
> can. If you or someone else wants to do it, that's also fine.

If there's a magic number, then I'd (a) change that and (b) adjust
whatever comments led you to think you shouldn't.  Bumping catversion
is a good fallback choice when there's not any more-proximate version
indicator, but here there is.

regards, tom lane

Re: pgsql: Remove the restriction that the relmap must be 512 bytes.

2022-07-27 Thread Robert Haas

On Wed, Jul 27, 2022 at 1:19 PM Alvaro Herrera  wrote:
> Another thing that seems to have happened here is that catversion ought
> to have been touched and wasn't.  Trying to start a cluster that was
> initdb'd with the previous code enters an infinite loop that dies each
> time with
>
> 2022-07-27 19:17:27.589 CEST [2516547] LOG:  database system is ready to 
> accept connections
> 2022-07-27 19:17:27.589 CEST [2516730] FATAL:  could not read file 
> "global/pg_filenode.map": read 512 of 524
> 2022-07-27 19:17:27.589 CEST [2516731] FATAL:  could not read file 
> "global/pg_filenode.map": read 512 of 524
> 2022-07-27 19:17:27.589 CEST [2516547] LOG:  autovacuum launcher process (PID 
> 2516730) exited with exit code 1
> 2022-07-27 19:17:27.589 CEST [2516547] LOG:  terminating any other active 
> server processes
>
> Perhaps we should still do a catversion bump now, since one hasn't
> happened since the commit.

Hmm, interesting. I didn't think about bumping catversion because I
didn't change anything in the catalogs. I did think about changing the
magic number for the file at one point, but unlike some of our other
constants, there's no indication that this one is intended to be used
as a version number. But in retrospect it would have been good to
change something somewhere. If you want me to bump catversion now, I
can. If you or someone else wants to do it, that's also fine.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: pgsql: Remove the restriction that the relmap must be 512 bytes.

2022-07-27 Thread Alvaro Herrera

On 2022-Jul-26, Robert Haas wrote:

> Remove the restriction that the relmap must be 512 bytes.
> 
> Instead of relying on the ability to atomically overwrite the
> entire relmap file in one shot, write a new one and durably
> rename it into place. Removing the struct padding and the
> calculation showing why the map is exactly 512 bytes, and change
> the maximum number of entries to a nearby round number.

Another thing that seems to have happened here is that catversion ought
to have been touched and wasn't.  Trying to start a cluster that was
initdb'd with the previous code enters an infinite loop that dies each
time with

2022-07-27 19:17:27.589 CEST [2516547] LOG:  database system is ready to accept 
connections
2022-07-27 19:17:27.589 CEST [2516730] FATAL:  could not read file 
"global/pg_filenode.map": read 512 of 524
2022-07-27 19:17:27.589 CEST [2516731] FATAL:  could not read file 
"global/pg_filenode.map": read 512 of 524
2022-07-27 19:17:27.589 CEST [2516547] LOG:  autovacuum launcher process (PID 
2516730) exited with exit code 1
2022-07-27 19:17:27.589 CEST [2516547] LOG:  terminating any other active 
server processes

Perhaps we should still do a catversion bump now, since one hasn't
happened since the commit.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"El número de instalaciones de UNIX se ha elevado a 10,
y se espera que este número aumente" (UPM, 1972)

Re: Reducing planning time on tables with many indexes

2022-07-27 Thread Tom Lane

I wrote:
> Unfortunately, as things stand today, the planner needs more than the
> right to look at the indexes' schemas, because it makes physical accesses
> to btree indexes to find out their tree height (and I think there are some
> comparable behaviors in other AMs).  I've never particularly cared for
> that implementation, and would be glad to rip out that behavior if we can
> find another way.  Maybe we could persuade VACUUM or ANALYZE to store that
> info in the index's pg_index row, or some such, and then the planner
> could use it with no lock?

A first step here could just be to postpone fetching _bt_getrootheight()
until we actually need it during cost estimation.  That would avoid the
need to do it at all for indexes that indxpath.c discards as irrelevant,
which is a decision made on considerably more information than the
proposed patch uses.

Having done that, you could look into revising plancat.c to fill the
IndexOptInfo structs from catcache entries instead of opening the
index per se.  (You'd have to also make sure that the appropriate
index locks are acquired eventually, for indexes the query does use
at runtime.  I think that's the case, but I'm not sure if anything
downstream has been optimized on the assumption the planner did it.)

This'd probably get us a large part of the way there.  Further
optimization of acquisition of tree height etc could be an
optional follow-up.

regards, tom lane

Re: Reducing planning time on tables with many indexes

2022-07-27 Thread Tom Lane

David Geier  writes:
> We tracked down the root cause of this slowdown to lock contention in
> 'get_relation_info()'. The index lock of every single index of every single
> table used in that query is acquired. We attempted a fix by pre-filtering
> out all indexes that anyways cannot be used with a certain query, without
> taking the index locks (credits to Luc Vlaming for idea and
> implementation). The patch does so by caching the columns present in every
> index, inside 'struct Relation', similarly to 'rd_indexlist'.

I wonder how much thought you gave to the costs imposed by that extra
cache space.  We have a lot of users who moan about relcache bloat
already.  But more to the point, I do not buy the assumption that
an index's set of columns is a good filter for which indexes are of
interest.  A trivial counterexample from the regression database is

regression=# explain select count(*) from tenk1;
 QUERY PLAN 

 Aggregate  (cost=219.28..219.29 rows=1 width=8)
   ->  Index Only Scan using tenk1_hundred on tenk1  (cost=0.29..194.28 rows=100
00 width=0)
(2 rows)

It looks to me like the patch also makes unwarranted assumptions about
being able to discard all but the smallest index having a given set
of columns.  This would, for example, possibly lead to dropping the
index that has the most useful sort order, or that has the operator
class needed to support a specific WHERE clause.

In short, I'm not sure I buy this concept at all.  I think it might
be more useful to attack the locking overhead more directly.  I kind
of wonder why we need per-index locks at all during planning ---
I think that we already interpret AccessShareLock on the parent table
as being sufficient to block schema changes on existing indexes.

Unfortunately, as things stand today, the planner needs more than the
right to look at the indexes' schemas, because it makes physical accesses
to btree indexes to find out their tree height (and I think there are some
comparable behaviors in other AMs).  I've never particularly cared for
that implementation, and would be glad to rip out that behavior if we can
find another way.  Maybe we could persuade VACUUM or ANALYZE to store that
info in the index's pg_index row, or some such, and then the planner
could use it with no lock?

regards, tom lane

Re: making relfilenodes 56 bits

2022-07-27 Thread Dilip Kumar

On Wed, 27 Jul 2022 at 9:49 PM, Dilip Kumar  wrote:

> On Wed, Jul 27, 2022 at 12:07 AM Robert Haas 
> wrote:
> >
> > On Tue, Jul 26, 2022 at 2:07 AM Dilip Kumar 
> wrote:
> > > I have thought about it while doing so but I am not sure whether it is
> > > a good idea or not, because before my change these all were macros
> > > with 2 naming conventions so I just changed to inline function so why
> > > to change the name.
> >
> > Well, the reason to change the name would be for consistency. It feels
> > weird to have some NAMES_LIKETHIS() and other NamesLikeThis().
> >
> > Now, an argument against that is that it will make back-patching more
> > annoying, if any code using these functions/macros is touched. But
> > since the calling sequence is changing anyway (you now have to pass a
> > pointer rather than the object itself) that argument doesn't really
> > carry any weight. So I would favor ClearBufferTag(), InitBufferTag(),
> > etc.
>
> Okay, so I have renamed these 2 functions and BUFFERTAGS_EQUAL as well
> to BufferTagEqual().


Just realised that this should have been BufferTagsEqual instead of
BufferTagEqual

I will modify this and send an updated patch tomorrow.

—
Dilip

> --
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: generic plans and "initial" pruning

2022-07-27 Thread Robert Haas

On Tue, Jul 26, 2022 at 11:01 PM Amit Langote  wrote:
> Needed to be rebased again, over 2d04277121f this time.

0001 adds es_part_prune_result but does not use it, so maybe the
introduction of that field should be deferred until it's needed for
something.

I wonder whether it's really necessary to added the PartitionPruneInfo
objects to a list in PlannerInfo first and then roll them up into
PlannerGlobal later. I know we do that for range table entries, but
I've never quite understood why we do it that way instead of creating
a flat range table in PlannerGlobal from the start. And so by
extension I wonder whether this table couldn't be flat from the start
also.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Request for assistance to backport CVE-2022-1552 fixes to 9.6 and 9.4

2022-07-27 Thread Roberto C . Sánchez

Hello pgsql-hackers,

Is there anyone willing to review the patches that I prepared?  I'd have
substatntially more confidence in the patches with a review from an
upstream developer who is familiar with the code.

Regards,

-Roberto

On Mon, Jul 04, 2022 at 06:06:58PM -0400, Roberto C. Sánchez wrote:
> On Wed, Jun 08, 2022 at 05:31:22PM -0400, Roberto C. Sánchez wrote:
> > On Wed, Jun 08, 2022 at 04:15:47PM -0400, Tom Lane wrote:
> > > Roberto =?iso-8859-1?Q?C=2E_S=E1nchez?=  writes:
> > > > I am investigating backporting the fixes for CVE-2022-1552 to 9.6 and
> > > > 9.4 as part of Debian LTS and Extended LTS.  I am aware that these
> > > > releases are no longer supported upstream, but I have made an attempt at
> > > > adapting commits ef792f7856dea2576dcd9cab92b2b05fe955696b and
> > > > f26d5702857a9c027f84850af48b0eea0f3aa15c from the REL_10_STABLE branch.
> > > > I would appreciate a review of the attached patches and any comments on
> > > > things that may have been missed and/or adapted improperly.
> > > 
> > > FWIW, I would not recommend being in a huge hurry to back-port those
> > > changes, pending the outcome of this discussion:
> > > 
> > > https://www.postgresql.org/message-id/flat/f8a4105f076544c180a87ef0c4822352%40stmuk.bayern.de
> > > 
> > Thanks for the pointer.
> > 
> > > We're going to have to tweak that code somehow, and it's not yet
> > > entirely clear how.
> > > 
> > I will monitor the discussion to see what comes of it.
> > 
> Based on the discussion in the other thread, I have made an attempt to
> backport commit 88b39e61486a8925a3861d50c306a51eaa1af8d6 to 9.6 and 9.4.
> The only significant change that I had to make was to add the full
> function signatures for the REVOKE/GRANT in the citext test.
> 
> One question that I had about the change as committed is whether a
> REVOKE is needed on s.citext_ne, like so:
> 
> REVOKE ALL ON FUNCTION s.citext_ne FROM PUBLIC;
> 
> Or (for pre-10):
> 
> REVOKE ALL ON FUNCTION s.citext_ne(s.citext, s.citext) FROM PUBLIC;
> 
> I ask because the comment immediately preceding the sequence of REVOKEs
> includes the comment "Revoke all conceivably-relevant ACLs within the
> extension."  I am not especially knowledgable about deep internals, but
> that function seems like it would belong in the same group with the
> others.
> 
> In any event, would someone be willing to review the attached patches
> for correctness?  I would like to shortly publish updates to 9.6 and 9.4
> in Debian and a review would be most appreciated.
> 
> Regards,
> 
> -Roberto
> 
> -- 
> Roberto C. Sánchez

> From ef792f7856dea2576dcd9cab92b2b05fe955696b Mon Sep 17 00:00:00 2001
> From: Noah Misch 
> Date: Mon, 9 May 2022 08:35:08 -0700
> Subject: [PATCH] Make relation-enumerating operations be security-restricted
>  operations.
> 
> When a feature enumerates relations and runs functions associated with
> all found relations, the feature's user shall not need to trust every
> user having permission to create objects.  BRIN-specific functionality
> in autovacuum neglected to account for this, as did pg_amcheck and
> CLUSTER.  An attacker having permission to create non-temp objects in at
> least one schema could execute arbitrary SQL functions under the
> identity of the bootstrap superuser.  CREATE INDEX (not a
> relation-enumerating operation) and REINDEX protected themselves too
> late.  This change extends to the non-enumerating amcheck interface.
> Back-patch to v10 (all supported versions).
> 
> Sergey Shinderuk, reviewed (in earlier versions) by Alexander Lakhin.
> Reported by Alexander Lakhin.
> 
> Security: CVE-2022-1552
> ---
>  src/backend/access/brin/brin.c   |   30 -
>  src/backend/catalog/index.c  |   41 +--
>  src/backend/commands/cluster.c   |   35 
>  src/backend/commands/indexcmds.c |   53 
> +--
>  src/backend/utils/init/miscinit.c|   24 --
>  src/test/regress/expected/privileges.out |   42 
>  src/test/regress/sql/privileges.sql  |   36 +
>  7 files changed, 231 insertions(+), 30 deletions(-)
> 
> --- a/src/backend/access/brin/brin.c
> +++ b/src/backend/access/brin/brin.c
> @@ -28,6 +28,7 @@
>  #include "pgstat.h"
>  #include "storage/bufmgr.h"
>  #include "storage/freespace.h"
> +#include "utils/guc.h"
>  #include "utils/index_selfuncs.h"
>  #include "utils/memutils.h"
>  #include "utils/rel.h"
> @@ -786,6 +787,9 @@
>   Oid heapoid;
>   RelationindexRel;
>   RelationheapRel;
> + Oid save_userid;
> + int save_sec_context;
> + int save_nestlevel;
>   double  numSummarized = 0;
>  
>   if (RecoveryInProgress())
> @@ -799,10 +803,28 @@
>* passed indexoid isn't an index then IndexGetRelation() will fail.
>* Rather than emitting a

Re: [Commitfest 2022-07] Patch Triage: Waiting on Author

2022-07-27 Thread Robert Haas

On Tue, Jul 26, 2022 at 7:47 PM James Coleman  wrote:
> These are both mine, and I'd hoped to work on them this CF, but I've
> been sufficiently busy that that hasn't happened.
>
> I'd like to just move these to the next CF.

Well, if we mark them returned with feedback now, and you get time to
work on them, you can always change the status back to something else
at that point.

That has the advantage that, if you don't get time to work on them,
they're not cluttering up the next CF in the meantime.

We're not doing a great job kicking things out of the CF when they are
non-actionable, and thereby we are making life harder for ourselves
collectively.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: fairywren hung in pg_basebackup tests

2022-07-27 Thread Andrew Dunstan



On 2022-07-27 We 10:58, Tom Lane wrote:
> Andrew Dunstan  writes:
>> The alternative I thought of would be to switch msys to using our
>> dirent.c. Probably not too hard, but certainly more work than reverting.
> If you ask me, the shortest-path general-purpose fix is to insert
>
> #if MSYS
>   if (pgwin32_is_junction(path))
>   return PGFILETYPE_DIR;
> #endif
>
> at the start of get_dirent_type.  (I'm not sure how to spell the
> #if test.)  We could look at using dirent.c later, but I think
> right now it's important to un-break the buildfarm ASAP.
>
>   



+1. I think you spell it:


#if defined(WIN32) && !defined(_MSC_VER)


(c.f. libpq-be.h)


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: autoprewarm worker failing to load

2022-07-27 Thread Tom Lane

Robins Tharakan  writes:
> 089480c077056 seems to have broken pg_prewarm.

Ugh ... sure would be nice if contrib/pg_prewarm had some regression
tests.

> Checking pg_prewarm.so the function 'autoprewarm_main' visibility
> switched from GLOBAL to LOCAL. Per [1], using PGDLLEXPORT
> makes it GLOBAL again, which appears to fix the issue:

Right, that's the appropriate fix.  I suppose we had better look
at everything else that's passed to bgw_function_name anywhere,
too.

regards, tom lane

Re: fairywren hung in pg_basebackup tests

2022-07-27 Thread Tom Lane

Andrew Dunstan  writes:
> The alternative I thought of would be to switch msys to using our
> dirent.c. Probably not too hard, but certainly more work than reverting.

If you ask me, the shortest-path general-purpose fix is to insert

#if MSYS
if (pgwin32_is_junction(path))
return PGFILETYPE_DIR;
#endif

at the start of get_dirent_type.  (I'm not sure how to spell the
#if test.)  We could look at using dirent.c later, but I think
right now it's important to un-break the buildfarm ASAP.

regards, tom lane

autoprewarm worker failing to load

2022-07-27 Thread Robins Tharakan

Hi,

089480c077056 seems to have broken pg_prewarm. When pg_prewarm
is added to shared_preload_libraries, each new connection results in
thousands of errors such as this:


2022-07-27 04:25:14.325 UTC [2903955] LOG: background worker
"autoprewarm leader" (PID 2904146) exited with exit code 1
2022-07-27 04:25:14.325 UTC [2904148] ERROR: could not find function
"autoprewarm_main" in file
"/home/ubuntu/proj/tempdel/lib/postgresql/pg_prewarm.so"

Checking pg_prewarm.so the function 'autoprewarm_main' visibility
switched from GLOBAL to LOCAL. Per [1], using PGDLLEXPORT
makes it GLOBAL again, which appears to fix the issue:

Before commit (089480c077056) -
ubuntu:~/proj/tempdel$ readelf -sW lib/postgresql/pg_prewarm.so | grep main
103: 3d79 609 FUNC GLOBAL DEFAULT 14 autoprewarm_main
109: 45ad 873 FUNC GLOBAL DEFAULT 14 autoprewarm_database_main
128: 3d79 609 FUNC GLOBAL DEFAULT 14 autoprewarm_main
187: 45ad 873 FUNC GLOBAL DEFAULT 14 autoprewarm_database_main

After commit (089480c077056) -
78: 2d79 609 FUNC LOCAL DEFAULT 14 autoprewarm_main
85: 35ad 873 FUNC LOCAL DEFAULT 14 autoprewarm_database_main

After applying the attached fix:
103: 3d79 609 FUNC GLOBAL DEFAULT 14 autoprewarm_main
84: 45ad 873 FUNC LOCAL DEFAULT 14 autoprewarm_database_main
129: 3d79 609 FUNC GLOBAL DEFAULT 14 autoprewarm_main


Please let me know your thoughts on this approach.

[1] 
https://www.postgresql.org/message-id/A737B7A37273E048B164557ADEF4A58B5393038C%40ntex2010a.host.magwien.gv.at

diff --git a/contrib/pg_prewarm/autoprewarm.c b/contrib/pg_prewarm/autoprewarm.c
index b2d6026093..ec619be9f2 100644
--- a/contrib/pg_prewarm/autoprewarm.c
+++ b/contrib/pg_prewarm/autoprewarm.c
@@ -82,7 +82,7 @@ typedef struct AutoPrewarmSharedState
int prewarmed_blocks;
} AutoPrewarmSharedState;

-void autoprewarm_main(Datum main_arg);
+PGDLLEXPORT void autoprewarm_main(Datum main_arg);
void autoprewarm_database_main(Datum main_arg);

PG_FUNCTION_INFO_V1(autoprewarm_start_worker);

-
Robins Tharakan
Amazon Web Services

Re: fairywren hung in pg_basebackup tests

2022-07-27 Thread Andrew Dunstan



On 2022-07-27 We 10:24, Alvaro Herrera wrote:
> On 2022-Jul-27, Andrew Dunstan wrote:
>
>> The msys dirent.h doesn't have a d_type field at all in a struct dirent.
>> I can see a number of ways of dealing with this, but the simplest seems
>> to be just to revert 5344723755, at least for msys, along with a comment
>> about why it's necessary.
> Hmm, what other ways there are?  I'm about to push a change that
> duplicates the get_dirent_type call pattern and I was happy about not
> having that #ifdef there.  Not that it's critical, but ...



The alternative I thought of would be to switch msys to using our
dirent.c. Probably not too hard, but certainly more work than reverting.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: fairywren hung in pg_basebackup tests

2022-07-27 Thread Alvaro Herrera

On 2022-Jul-27, Andrew Dunstan wrote:

> The msys dirent.h doesn't have a d_type field at all in a struct dirent.
> I can see a number of ways of dealing with this, but the simplest seems
> to be just to revert 5344723755, at least for msys, along with a comment
> about why it's necessary.

Hmm, what other ways there are?  I'm about to push a change that
duplicates the get_dirent_type call pattern and I was happy about not
having that #ifdef there.  Not that it's critical, but ...

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/

Re: fairywren hung in pg_basebackup tests

2022-07-27 Thread Andrew Dunstan



On 2022-07-27 We 09:47, Andrew Dunstan wrote:
> On 2022-07-26 Tu 18:31, Thomas Munro wrote:
>> On Tue, Jul 26, 2022 at 4:03 AM Andrew Dunstan  wrote:
>>> On 2022-07-25 Mo 11:24, Thomas Munro wrote:
 On Tue, Jul 26, 2022 at 3:08 AM Tom Lane  wrote:
> Hmm ... an alternative theory is that the test is fine, and what
> it's telling us is that get_dirent_type() is still wrong on msys.
> Would that end in this symptom?
 Hmm, possibly yes (if it sees a non-symlink, it'll skip it).   If
 someone can run the test on an msys system, perhaps they could put a
 debugging elog() into the code modified by 9d3444dc to log d_name and
 the d_type that is returned?  I'm struggling to understand why msys
 would change the answer though.
>>> I have no idea either. The link exists and it is a junction. I'll see
>>> about logging details.
>> >From the clues so far, it seems like pgwin32_is_junction(fullpath) was
>> returning true, but the following code in get_dirent_type(), which was
>> supposed to be equivalent, is not reached on MSYS (though it
>> apparently does on MSVC?):
>>
>> +   if ((fd.dwFileAttributes & FILE_ATTRIBUTE_REPARSE_POINT) != 0 &&
>> +   (fd.dwReserved0 == IO_REPARSE_TAG_MOUNT_POINT))
>> d->ret.d_type = DT_LNK;
>>
>> pgwin32_is_junction() uses GetFileAttributes() and tests (attr &
>> FILE_ATTRIBUTE_REPARSE_POINT) == FILE_ATTRIBUTE_REPARSE_POINT, which
>> is like the first condition but lacks the dwReserved0 part.  What is
>> that part doing, and why would it be doing something different in MSVC
>> and MSYS builds?  That code came from 87e6ed7c, recently I was just
>> trying to fix it by reordering the checks; oh, there was some
>> discussion about that field[1].
>>
>> One idea is that something about dwReserved0 or
>> IO_REPARSE_TAG_MOUNT_POINT is different in the open source replacement
>> system headers supplied by the MinGW project used by MSYS builds
>> (right?), compared to the "real" Windows SDK's headers used by MSVC
>> builds.
>>
>> Or perhaps there is some other dumb mistake, or perhaps the reparse
>> point really is different, or ... I dunno, I'd probably shove a load
>> of log messages in there and see what's going on.
>>
>> [1] 
>> https://www.postgresql.org/message-id/flat/CABUevEzURN%3DwC95JHvTKFJtEy0eY9rWO42yU%3D59-q8xSwm-Dug%40mail.gmail.com#ac54acd782fc849c0fe6c2c05db101dc
>
> dirent.c is not used on msys, only on MSVC. msys is apparently using
> opendir and friends supplied by the system.
>
> What it does if there's a junction I'll try to find out, but it appears
> that 5344723755 was conceived under a misapprehension about the
> behaviour of msys.
>
>

The msys dirent.h doesn't have a d_type field at all in a struct dirent.
I can see a number of ways of dealing with this, but the simplest seems
to be just to revert 5344723755, at least for msys, along with a comment
about why it's necessary.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: fairywren hung in pg_basebackup tests

2022-07-27 Thread Andrew Dunstan



On 2022-07-26 Tu 18:31, Thomas Munro wrote:
> On Tue, Jul 26, 2022 at 4:03 AM Andrew Dunstan  wrote:
>> On 2022-07-25 Mo 11:24, Thomas Munro wrote:
>>> On Tue, Jul 26, 2022 at 3:08 AM Tom Lane  wrote:
 Hmm ... an alternative theory is that the test is fine, and what
 it's telling us is that get_dirent_type() is still wrong on msys.
 Would that end in this symptom?
>>> Hmm, possibly yes (if it sees a non-symlink, it'll skip it).   If
>>> someone can run the test on an msys system, perhaps they could put a
>>> debugging elog() into the code modified by 9d3444dc to log d_name and
>>> the d_type that is returned?  I'm struggling to understand why msys
>>> would change the answer though.
>> I have no idea either. The link exists and it is a junction. I'll see
>> about logging details.
> >From the clues so far, it seems like pgwin32_is_junction(fullpath) was
> returning true, but the following code in get_dirent_type(), which was
> supposed to be equivalent, is not reached on MSYS (though it
> apparently does on MSVC?):
>
> +   if ((fd.dwFileAttributes & FILE_ATTRIBUTE_REPARSE_POINT) != 0 &&
> +   (fd.dwReserved0 == IO_REPARSE_TAG_MOUNT_POINT))
> d->ret.d_type = DT_LNK;
>
> pgwin32_is_junction() uses GetFileAttributes() and tests (attr &
> FILE_ATTRIBUTE_REPARSE_POINT) == FILE_ATTRIBUTE_REPARSE_POINT, which
> is like the first condition but lacks the dwReserved0 part.  What is
> that part doing, and why would it be doing something different in MSVC
> and MSYS builds?  That code came from 87e6ed7c, recently I was just
> trying to fix it by reordering the checks; oh, there was some
> discussion about that field[1].
>
> One idea is that something about dwReserved0 or
> IO_REPARSE_TAG_MOUNT_POINT is different in the open source replacement
> system headers supplied by the MinGW project used by MSYS builds
> (right?), compared to the "real" Windows SDK's headers used by MSVC
> builds.
>
> Or perhaps there is some other dumb mistake, or perhaps the reparse
> point really is different, or ... I dunno, I'd probably shove a load
> of log messages in there and see what's going on.
>
> [1] 
> https://www.postgresql.org/message-id/flat/CABUevEzURN%3DwC95JHvTKFJtEy0eY9rWO42yU%3D59-q8xSwm-Dug%40mail.gmail.com#ac54acd782fc849c0fe6c2c05db101dc


dirent.c is not used on msys, only on MSVC. msys is apparently using
opendir and friends supplied by the system.

What it does if there's a junction I'll try to find out, but it appears
that 5344723755 was conceived under a misapprehension about the
behaviour of msys.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Trying to add more tests to gistbuild.c

2022-07-27 Thread Aleksander Alekseev

Hi Matheus,

Many thanks for hacking on increasing the code coverage! I noticed
that this patch was stuck in "Needs Review" state for some time and
decided to take a look.

> With these new tests the coverage went from 45.3% to 85.5%, but I have some 
> doubts:
> - Does this test make sense?
> - Would there be a way to validate that the buffering was done correctly?
> - Is this test necessary?

I can confirm that the coverage improved as stated.

Personally I believe this change makes perfect sense. Although this is
arguably not an ideal test for gistInitBuffering(), writing proper
tests for `static` procedures is generally not an easy task. Executing
the code at least once in order to make sure that it doesn't crash,
doesn't throw errors and doesn't violate any Asserts() is better than
doing nothing.

Here is a slightly modified patch with added commit message. Please
note that patches created with `git format-patch` are generally
preferable than patches created with `git diff`.

I'm going to change the status of this patch to "Ready for Committer"
in a short time unless anyone has a second opinion.

-- 
Best regards,
Aleksander Alekseev


v3-0001-Improve-test-coverage-of-gistbuild.c.patch
Description: Binary data

Re: Skip partition tuple routing with constant partition key

2022-07-27 Thread Amit Langote

On Wed, Jul 27, 2022 at 7:28 AM David Rowley  wrote:
> On Sat, 23 Jul 2022 at 01:23, Amit Langote  wrote:
> > +   /*
> > +* The Datum has changed.  Zero the number of times 
> > we've
> > +* found last_found_datum_index in a row.
> > +*/
> > +   partdesc->last_found_count = 0;
> >
> > +   /* Zero the "winning streak" on the cache hit count */
> > +   partdesc->last_found_count = 0;
> >
> > Might it be better for the two comments to say the same thing?  Also,
> > I wonder which one do you intend as the resetting of last_found_count:
> > setting it to 0 or 1?  I can see that the stanza at the end of the
> > function sets to 1 to start a new cycle.
>
> I think I've addressed all of your comments.  The above one in
> particular caused me to make some larger changes.
>
> The reason I was zeroing the last_found_count in LIST partitioned
> tables when the Datum was not equal to the previous found Datum was
> due to the fact that the code at the end of the function was only
> checking the partition indexes matched rather than the bound_offset vs
> last_found_datum_index.  The reason I wanted to zero this was that if
> you had a partition FOR VALUES IN(1,2), and you received rows with
> values alternating between 1 and 2 then we'd match to the same
> partition each time, however the equality test with the current
> 'values' and the Datum at last_found_datum_index would have been false
> each time.  If we didn't zero the last_found_count we'd have kept
> using the cache path even though the Datum and last Datum wouldn't
> have been equal each time. That would have resulted in always doing
> the cache check and failing, then doing the binary search anyway.

Thanks for the explanation.  So, in a way the caching scheme works for
LIST partitioning only if the same value appears consecutively in the
input set, whereas it does not for *a set of* values belonging to the
same partition appearing consecutively.  Maybe that's a reasonable
restriction for now.

> I've now changed the code so that instead of checking the last found
> partition is the same as the last one, I'm now checking if
> bound_offset is the same as last_found_datum_index.  This will be
> false in the "values alternating between 1 and 2" case from above.
> This caused me to have to change how the caching works for LIST
> partitions with a NULL partition which is receiving NULL values.  I've
> coded things now to just skip the cache for that case. Finding the
> correct LIST partition for a NULL value is cheap and no need to cache
> that.  I've also moved all the code which updates the cache fields to
> the bottom of get_partition_for_tuple(). I'm only expecting to do that
> when bound_offset is set by the lookup code in the switch statement.
> Any paths, e.g. HASH partitioning lookup and LIST or RANGE with NULL
> values shouldn't reach the code which updates the partition fields.
> I've added an Assert(bound_offset >= 0) to ensure that stays true.

Looks good.

> There's probably a bit more to optimise here too, but not much. I
> don't think the partdesc->last_found_part_index = -1; is needed when
> we're in the code block that does return boundinfo->default_index;
> However, that only might very slightly speedup the case when we're
> inserting continuously into the DEFAULT partition. That code path is
> also used when we fail to find any matching partition. That's not one
> we need to worry about making go faster.

So this is about:

if (part_index < 0)
-   part_index = boundinfo->default_index;
+   {
+   /*
+* Since we don't do caching for the default partition or failed
+* lookups, we'll just wipe the cache fields back to their initial
+* values.  The count becomes 0 rather than 1 as 1 means it's the
+* first time we've found a partition we're recording for the cache.
+*/
+   partdesc->last_found_datum_index = -1;
+   partdesc->last_found_part_index = -1;
+   partdesc->last_found_count = 0;
+
+   return boundinfo->default_index;
+   }

I wonder why not to leave the cache untouched in this case?  It's
possible that erratic rows only rarely occur in the input sets.

> I also ran the benchmarks again and saw that most of the use of
> likely() and unlikely() no longer did what I found them to do earlier.
> So the weirdness we saw there most likely was just down to random code
> layout changes. In this patch, I just dropped the use of either of
> those two macros.

Ah, using either seems to be trying to fit the code one or the other
pattern in the input set anyway, so seems fine to keep them out for
now.

Some minor comments:

+ * The number of times the same partition must be found in a row before we
+ * switch from a search for the given values to just checking if the values

How about:

switch from using a binary search for the given values to...

Should the comment u

Re: [PoC] Reducing planning time on tables with many indexes

2022-07-27 Thread David Geier

Sorry, by accident I sent this one out twice.

--
David Geier
(ServiceNow)

On Wed, Jul 27, 2022 at 2:42 PM David Geier  wrote:

> Hi hackers,
>
> We came across a slowdown in planning, where queries use tables with many
> indexes. In setups with wide tables it is not uncommon to have easily
> 10-100 indexes on a single table. The slowdown is already visible in serial
> workloads with just a handful of indexes, but gets drastically amplified
> when running queries with more indexes in parallel at high throughput.
>
> We measured the TPS and planning time of running parallel streams of
> simple point look-up queries on a single empty table with 60 columns and 60
> indexes. The query used is 'SELECT * FROM synth_table WHERE col5 = 42'. No
> rows are returned because the table is empty. We used a machine with 64
> physical CPU cores. The schema and sysbench script to reproduce these
> numbers are attached. We used the TPS as reported by sysbench and obtained
> planning time by running 'EXPLAIN ANALYZE' on the same query in a
> separately opened connection. We averaged the planning time of 3 successive
> 'EXPLAIN ANALYZE' runs. sysbench ran on the same machine with varying
> numbers of threads using the following command line:
>
> sysbench repro.lua --db-driver=pgsql --pgsql-host=localhost
> --pgsql-db=postgres --pgsql-port=? --pgsql-user=? --pgsql-password=?
> --report-interval=1 --threads=64 run
>
> The following table shows the results. It is clearly visible that the TPS
> flatten out already at 8 parallel streams, while the planning time is
> increasing drastically.
>
> Parallel streams | TPS (before) | Planning time (before)
> -|--|---
> 1|  5,486   | 0.13 ms
> 2|  8,098   | 0.22 ms
> 4| 15,013   | 0.19 ms
> 8| 27,107   | 0.29 ms
> 16   | 30,938   | 0.43 ms
> 32   | 26,330   | 1.68 ms
> 64   | 24,314   | 2.48 ms
>
> We tracked down the root cause of this slowdown to lock contention in
> 'get_relation_info()'. The index lock of every single index of every single
> table used in that query is acquired. We attempted a fix by pre-filtering
> out all indexes that anyways cannot be used with a certain query, without
> taking the index locks (credits to Luc Vlaming for idea and
> implementation). The patch does so by caching the columns present in every
> index, inside 'struct Relation', similarly to 'rd_indexlist'. Then, before
> opening (= locking) the indexes in 'get_relation_info()', we check if the
> index can actually contribute to the query and if not it is discarded right
> away. Caching the index info saves considerable work for every query run
> subsequently, because less indexes must be inspected and thereby locked.
> This way we also save cycles in any code that later on goes over all
> relation indexes.
>
> The work-in-progress version of the patch is attached. It is still fairly
> rough (e.g. uses a global variable, selects the best index in scans without
> restrictions by column count instead of physical column size, is missing
> some renaming, etc.), but shows the principle.
>
> The following table shows the TPS, planning time and speed-ups after
> applying the patch and rerunning above described benchmark. Now, the
> planning time remains roughly constant and TPS roughly doubles each time
> the number of parallel streams is doubled. The higher the stream count the
> more severe the lock contention is and the more pronounced the gained
> speed-up gets. Interestingly, even for a single query stream the speed-up
> in planning time is already very significant. This applies also for lower
> index counts. For example just with 10 indexes the TPS for a single query
> stream goes from 9,159 to 12,558. We can do more measurements if there is
> interest in details for a lower number of indexes.
>
> Parallel streams | TPS (after) | Planning time (after) | Speed-up TPS |
> Speed-up planning
>
> -|-|---|--|--
> 1|  10,344 | 0.046 |  1.9x|
>  2.8x
> 2|  20,140 | 0.045 ms  |  2.5x|
>  4.9x
> 4|  40,349 | 0.047 ms  |  2.7x|
>  4.0x
> 8|  80,121 | 0.046 ms  |  3.0x|
>  6.3x
> 16   | 152,632 | 0.051 ms  |  4.9x|
>  8.4x
> 32   | 301,359 | 0.052 ms  | 11.4x|
> 32.3x
> 64   | 525,115 | 0.062 ms  | 21.6x|
> 40.0x
>
> We are happy to receive your feedback and polish up the patch.
>
> --
> David Geier
> (ServiceNow)
>

[PoC] Reducing planning time on tables with many indexes

2022-07-27 Thread David Geier

Hi hackers,

We came across a slowdown in planning, where queries use tables with many
indexes. In setups with wide tables it is not uncommon to have easily
10-100 indexes on a single table. The slowdown is already visible in serial
workloads with just a handful of indexes, but gets drastically amplified
when running queries with more indexes in parallel at high throughput.

We measured the TPS and planning time of running parallel streams of simple
point look-up queries on a single empty table with 60 columns and 60
indexes. The query used is 'SELECT * FROM synth_table WHERE col5 = 42'. No
rows are returned because the table is empty. We used a machine with 64
physical CPU cores. The schema and sysbench script to reproduce these
numbers are attached. We used the TPS as reported by sysbench and obtained
planning time by running 'EXPLAIN ANALYZE' on the same query in a
separately opened connection. We averaged the planning time of 3 successive
'EXPLAIN ANALYZE' runs. sysbench ran on the same machine with varying
numbers of threads using the following command line:

sysbench repro.lua --db-driver=pgsql --pgsql-host=localhost
--pgsql-db=postgres --pgsql-port=? --pgsql-user=? --pgsql-password=?
--report-interval=1 --threads=64 run

The following table shows the results. It is clearly visible that the TPS
flatten out already at 8 parallel streams, while the planning time is
increasing drastically.

Parallel streams | TPS (before) | Planning time (before)
-|--|---
1|  5,486   | 0.13 ms
2|  8,098   | 0.22 ms
4| 15,013   | 0.19 ms
8| 27,107   | 0.29 ms
16   | 30,938   | 0.43 ms
32   | 26,330   | 1.68 ms
64   | 24,314   | 2.48 ms

We tracked down the root cause of this slowdown to lock contention in
'get_relation_info()'. The index lock of every single index of every single
table used in that query is acquired. We attempted a fix by pre-filtering
out all indexes that anyways cannot be used with a certain query, without
taking the index locks (credits to Luc Vlaming for idea and
implementation). The patch does so by caching the columns present in every
index, inside 'struct Relation', similarly to 'rd_indexlist'. Then, before
opening (= locking) the indexes in 'get_relation_info()', we check if the
index can actually contribute to the query and if not it is discarded right
away. Caching the index info saves considerable work for every query run
subsequently, because less indexes must be inspected and thereby locked.
This way we also save cycles in any code that later on goes over all
relation indexes.

The work-in-progress version of the patch is attached. It is still fairly
rough (e.g. uses a global variable, selects the best index in scans without
restrictions by column count instead of physical column size, is missing
some renaming, etc.), but shows the principle.

The following table shows the TPS, planning time and speed-ups after
applying the patch and rerunning above described benchmark. Now, the
planning time remains roughly constant and TPS roughly doubles each time
the number of parallel streams is doubled. The higher the stream count the
more severe the lock contention is and the more pronounced the gained
speed-up gets. Interestingly, even for a single query stream the speed-up
in planning time is already very significant. This applies also for lower
index counts. For example just with 10 indexes the TPS for a single query
stream goes from 9,159 to 12,558. We can do more measurements if there is
interest in details for a lower number of indexes.

Parallel streams | TPS (after) | Planning time (after) | Speed-up TPS |
Speed-up planning
-|-|---|--|--
1|  10,344 | 0.046 |  1.9x|
 2.8x
2|  20,140 | 0.045 ms  |  2.5x|
 4.9x
4|  40,349 | 0.047 ms  |  2.7x|
 4.0x
8|  80,121 | 0.046 ms  |  3.0x|
 6.3x
16   | 152,632 | 0.051 ms  |  4.9x|
 8.4x
32   | 301,359 | 0.052 ms  | 11.4x|
32.3x
64   | 525,115 | 0.062 ms  | 21.6x|
40.0x

We are happy to receive your feedback and polish up the patch.

--
David Geier
(ServiceNow)
From 60e300b5cac8f527e61483296df81ab783670ac9 Mon Sep 17 00:00:00 2001
From: David Geier 
Date: Fri, 15 Jul 2022 11:53:27 +0200
Subject: [PATCH] Index filtering

---
 src/backend/optimizer/path/Makefile   |   3 +-
 src/backend/optimizer/path/index_filtering.c  | 220 ++
 src/backend/optimizer/plan/planmain.c |   7 +
 src/backend/optimizer/util/plancat.c  | 124 ++
 src/backend/utils/cache/relcache.c|  17 +-
 src/include/optimizer

RE: Perform streaming logical transactions by background workers and parallel apply

2022-07-27 Thread kuroda.hay...@fujitsu.com

Dear Wang-san,

Hi, I'm also interested in the patch and I started to review this.
Followings are comments about 0001.

1. terminology

In your patch a new worker "apply background worker" has been introduced,
but I thought it might be confused because PostgreSQL has already the worker 
"background worker".
Both of apply worker and apply bworker are categolized as bgworker. 
Do you have any reasons not to use "apply parallel worker" or "apply streaming 
worker"?
(Note that I'm not native English speaker)

2. logicalrep_worker_stop()

```
-   /* No worker, nothing to do. */
-   if (!worker)
-   {
-   LWLockRelease(LogicalRepWorkerLock);
-   return;
-   }
+   if (worker)
+   logicalrep_worker_stop_internal(worker);
+
+   LWLockRelease(LogicalRepWorkerLock);
+}
```

I thought you could add a comment the meaning of if-statement, like "No main 
apply worker, nothing to do"

3. logicalrep_workers_find()

I thought you could add a description about difference between this and 
logicalrep_worker_find() at the top of the function.
IIUC logicalrep_workers_find() counts subworker, but logicalrep_worker_find() 
does not focus such type of workers.

4. logicalrep_worker_detach()

```
static void
 logicalrep_worker_detach(void)
 {
+   /*
+* If we are the main apply worker, stop all the apply background 
workers
+* we started before.
+*
```

I thought "we are" should be "This is", based on other comments.

5. applybgworker.c

```
+/* Apply background workers hash table (initialized on first use) */
+static HTAB *ApplyWorkersHash = NULL;
+static List *ApplyWorkersFreeList = NIL;
+static List *ApplyWorkersList = NIL;
```

I thought they should be ApplyBgWorkersXXX, because they stores information 
only related with apply bgworkers.

6. ApplyBgworkerShared

```
+   TransactionId   stream_xid;
+   uint32  n;  /* id of apply background worker */
+} ApplyBgworkerShared;
```

I thought the field "n" is too general, how about "proc_id" or "worker_id"?

7. apply_bgworker_wait_for()

```
+   /* If any workers (or the postmaster) have died, we have 
failed. */
+   if (status == APPLY_BGWORKER_EXIT)
+   ereport(ERROR,
+   
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+errmsg("background worker %u failed to 
apply transaction %u",
+   wstate->shared->n, 
wstate->shared->stream_xid)))
```

7.a
I thought we should not mention about PM death here, because in this case
apply worker will exit at WaitLatch().  

7.b
The error message should be "apply background worker %u...".

8. apply_bgworker_check_status()

```
+errmsg("background worker %u exited 
unexpectedly",
+   wstate->shared->n)));
```

The error message should be "apply background worker %u...".


9. apply_bgworker_send_data()

```
+   if (result != SHM_MQ_SUCCESS)
+   ereport(ERROR,
+   
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+errmsg("could not send tuples to shared-memory 
queue")));
```

I thought the error message should be "could not send data to..."
because sent data might not be tuples. For example, in case of STEAM PREPARE, I 
thit does not contain tuple.

10. wait_event.h

```
WAIT_EVENT_HASH_GROW_BUCKETS_REINSERT,
+   WAIT_EVENT_LOGICAL_APPLY_WORKER_STATE_CHANGE,
WAIT_EVENT_LOGICAL_SYNC_DATA,
```

I thought the event should be WAIT_EVENT_LOGICAL_APPLY_BG_WORKER_STATE_CHANGE,
because this is used when apply worker waits until the status of bgworker 
changes.  


Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Reducing planning time on tables with many indexes

2022-07-27 Thread David Geier

Hi hackers,

We came across a slowdown in planning, where queries use tables with many
indexes. In setups with wide tables it is not uncommon to have easily
10-100 indexes on a single table. The slowdown is already visible in serial
workloads with just a handful of indexes, but gets drastically amplified
when running queries with more indexes in parallel at high throughput.

We measured the TPS and planning time of running parallel streams of simple
point look-up queries on a single empty table with 60 columns and 60
indexes. The query used is 'SELECT * FROM synth_table WHERE col5 = 42'. No
rows are returned because the table is empty. We used a machine with 64
physical CPU cores. The schema and sysbench script to reproduce these
numbers are attached. We used the TPS as reported by sysbench and obtained
planning time by running 'EXPLAIN ANALYZE' on the same query in a
separately opened connection. We averaged the planning time of 3 successive
'EXPLAIN ANALYZE' runs. sysbench ran on the same machine with varying
numbers of threads using the following command line:

sysbench repro.lua --db-driver=pgsql --pgsql-host=localhost
--pgsql-db=postgres --pgsql-port=? --pgsql-user=? --pgsql-password=?
--report-interval=1 --threads=64 run

The following table shows the results. It is clearly visible that the TPS
flatten out already at 8 parallel streams, while the planning time is
increasing drastically.

Parallel streams | TPS (before) | Planning time (before)
-|--|---
1|  5,486   | 0.13 ms
2|  8,098   | 0.22 ms
4| 15,013   | 0.19 ms
8| 27,107   | 0.29 ms
16   | 30,938   | 0.43 ms
32   | 26,330   | 1.68 ms
64   | 24,314   | 2.48 ms

We tracked down the root cause of this slowdown to lock contention in
'get_relation_info()'. The index lock of every single index of every single
table used in that query is acquired. We attempted a fix by pre-filtering
out all indexes that anyways cannot be used with a certain query, without
taking the index locks (credits to Luc Vlaming for idea and
implementation). The patch does so by caching the columns present in every
index, inside 'struct Relation', similarly to 'rd_indexlist'. Then, before
opening (= locking) the indexes in 'get_relation_info()', we check if the
index can actually contribute to the query and if not it is discarded right
away. Caching the index info saves considerable work for every query run
subsequently, because less indexes must be inspected and thereby locked.
This way we also save cycles in any code that later on goes over all
relation indexes.

The work-in-progress version of the patch is attached. It is still fairly
rough (e.g. uses a global variable, selects the best index in scans without
restrictions by column count instead of physical column size, is missing
some renaming, etc.), but shows the principle.

The following table shows the TPS, planning time and speed-ups after
applying the patch and rerunning above described benchmark. Now, the
planning time remains roughly constant and TPS roughly doubles each time
the number of parallel streams is doubled. The higher the stream count the
more severe the lock contention is and the more pronounced the gained
speed-up gets. Interestingly, even for a single query stream the speed-up
in planning time is already very significant. This applies also for lower
index counts. For example just with 10 indexes the TPS for a single query
stream goes from 9,159 to 12,558. We can do more measurements if there is
interest in details for a lower number of indexes.

Parallel streams | TPS (after) | Planning time (after) | Speed-up TPS |
Speed-up planning
-|-|---|--|--
1|  10,344 | 0.046 |  1.9x|
 2.8x
2|  20,140 | 0.045 ms  |  2.5x|
 4.9x
4|  40,349 | 0.047 ms  |  2.7x|
 4.0x
8|  80,121 | 0.046 ms  |  3.0x|
 6.3x
16   | 152,632 | 0.051 ms  |  4.9x|
 8.4x
32   | 301,359 | 0.052 ms  | 11.4x|
32.3x
64   | 525,115 | 0.062 ms  | 21.6x|
40.0x

We are happy to receive your feedback and polish up the patch.

--
David Geier
(ServiceNow)
function thread_init()
  drv = sysbench.sql.driver()
  con = drv:connect()
end

function thread_done()
  con:disconnect()
end

function event()
  con:query('SELECT * FROM synth_table WHERE col05 = 52')
end

schema.sql
Description: application/sql
From 60e300b5cac8f527e61483296df81ab783670ac9 Mon Sep 17 00:00:00 2001
From: David Geier 
Date: Fri, 15 Jul 2022 11:53:27 +0200
Subject: [PATCH] Index filtering

---
 src/backend/optimizer/path/Makefile   |   3 +-
 src/backend/optimizer

Re: Refactor to make use of a common function for GetSubscriptionRelations and GetSubscriptionNotReadyRelations.

2022-07-27 Thread vignesh C

On Wed, Jul 27, 2022 at 4:22 PM Michael Paquier  wrote:
>
> On Wed, Jul 27, 2022 at 08:47:46AM +0530, vignesh C wrote:
> > I feel this would be an overkill, I did not make any changes for this.
>
> Agreed.  Using an extra layer of wrappers for that seems a bit too
> much, so I have applied your v5 with a slight tweak to the comment.

Thanks for pushing this patch.

Regards,
Vignesh

Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

2022-07-27 Thread Matthias van de Meent

On Wed, 27 Jul 2022 at 11:09, Michael Paquier  wrote:
>
> On Tue, Jul 26, 2022 at 06:58:02PM +0200, Matthias van de Meent wrote:
> > - Retained the check in XLogRegisterData, so that we check against
> > integer overflows in the registerdata code instead of only an assert
> > in XLogRecordAssemble where it might be too late.
>
> Why?  The record has not been inserted yet.  I would tend to keep only
> the check at the bottom of XLogRecordAssemble(), for simplicity, and
> call it a day.

Because the sum value main_rdatalen can easily overflow in both the
current and the previous APIs, which then corrupts the WAL - one of
the two issues that I mentioned when I started the thread.

We don't re-summarize the lengths of all XLogRecData segments for the
main record data when assembling a record to keep the performance of
RecordAssemble (probably to limit the complexity when many data
segments are registered), and because I didn't want to add more
changes than necessary this check will need to be done in the place
where the overflow may occur, which is in XLogRegisterData.

> > - Kept the inline static elog-ing function (as per Andres' suggestion
> > on 2022-03-14; this decreases binary sizes)
>
> I am not really convinced that this one is worth doing.

I'm not married to that change, but I also don't see why this can't be
updated while this code is already being touched.

> +#define MaxXLogRecordSize  (1020 * 1024 * 1024)
> +
> +#define XLogRecordLengthIsValid(len) ((len) >= 0 && (len) < 
> MaxXLogRecordSize)
>
> These are used only in xloginsert.c, so we could keep them isolated.

They might be only used in xloginsert right now, but that's not the
point. This is now advertised as part of the record API spec: A record
larger than 1020MB is explicitly not supported. If it was kept
internal to xloginsert, that would be implicit and other people might
start hitting issues similar to those we're hitting right now -
records that are too large to read. Although PostgreSQL is usually the
only one generating WAL, we do support physical replication from
arbitrary PG-compatible WAL streams, which means that any compatible
WAL source could be the origin of our changes - and those need to be
aware of the assumptions we make about the WAL format.

I'm fine with also updating xlogreader.c to check this while reading
records to clarify the limits there as well, if so desired.

> + * To accommodate some overhead, hhis MaxXLogRecordSize value allows for
> s/hhis/this/.

Will be included in the next update..

> For now, I have extracted from the patch the two API changes and the
> checks for the block information for uint16, and applied this part.
> That's one less thing to worry about.

Thanks.

Kind regards,

Matthias van de Meent

Re: Introduce wait_for_subscription_sync for TAP tests

2022-07-27 Thread Amit Kapila

On Tue, Jul 26, 2022 at 1:12 PM Masahiko Sawada  wrote:
>
> On Tue, Jul 26, 2022 at 2:01 PM Amit Kapila  wrote:
> >
> > 2.
> > +# wait for the replication to catchup if required.
> > +if (defined($publisher))
> > +{
> > + croak 'subscription name must be specified' unless defined($subname);
> > + $publisher->wait_for_catchup($subname, 'replay');
> > +}
> > +
> > +# then, wait for all table states to be ready.
> > +print "Waiting for all subscriptions in \"$name\" to synchronize 
> > data\n";
> > +my $query = qq[SELECT count(1) = 0
> > + FROM pg_subscription_rel
> > + WHERE srsubstate NOT IN ('r', 's');];
> > +$self->poll_query_until($dbname, $query)
> > + or croak "timed out waiting for subscriber to synchronize data";
> >
> > In the tests, I noticed that a few places did wait_for_catchup after
> > the subscription check, and at other places, we did that check before
> > as you have it here. Ideally, I think wait_for_catchup should be after
> > confirming the initial sync is over as without initial sync, the
> > publisher node won't be completely in sync with the subscriber.
>
> What do you mean by the last sentence? I thought the order doesn't
> matter here. Even if we do wait_for_catchup first then the
> subscription check, we can make sure that the apply worker caught up
> and table synchronization has been done, no?
>

That's right. I thought we should first ensure the subscriber has
finished operations if possible, like in this case, it can ensure
table sync has finished and then we can ensure whether publisher and
subscriber are in sync. That sounds more logical to me.

-- 
With Regards,
Amit Kapila.

Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

2022-07-27 Thread Amit Kapila

On Mon, Jul 25, 2022 at 11:26 AM Masahiko Sawada  wrote:
>
> On Mon, Jul 25, 2022 at 10:45 AM Masahiko Sawada  
> wrote:
>
> I've attached the patch for REl15 that I forgot.
>

I feel the place to remember running xacts information in
SnapBuildProcessRunningXacts is not appropriate. Because in cases
where there are no running xacts or when xl_running_xact is old enough
that we can't use it, we don't need that information. I feel we need
it only when we have to reuse the already serialized snapshot, so,
won't it be better to initialize at that place in
SnapBuildFindSnapshot()? I have changed accordingly in the attached
and apart from that slightly modified the comments and commit message.
Do let me know what you think of the attached?

-- 
With Regards,
Amit Kapila.

REL15_v9-0001-Fix-catalog-lookup-with-the-wrong-snapshot-during.patch
Description: Binary data

Re: Refactor to make use of a common function for GetSubscriptionRelations and GetSubscriptionNotReadyRelations.

2022-07-27 Thread Michael Paquier

On Wed, Jul 27, 2022 at 08:47:46AM +0530, vignesh C wrote:
> I feel this would be an overkill, I did not make any changes for this.

Agreed.  Using an extra layer of wrappers for that seems a bit too
much, so I have applied your v5 with a slight tweak to the comment.
--
Michael


signature.asc
Description: PGP signature

Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

2022-07-27 Thread Melih Mutlu

Hi Amit,

I updated the patch in order to prevent the problems that might be caused
by using different replication slots for syncing a table.
As suggested in previous emails, replication slot names are stored in the
catalog. So slot names can be reached later and it is ensured
that same replication slot is used during tablesync step of a table.

With the current version of the patch:
-. "srrelslotname" column is introduced into pg_subscibtion_rel catalog. It
stores the slot name for tablesync

-. Tablesync worker logic is now as follows:
1. Tablesync worker is launched by apply worker for a table.
2. Worker generates a default replication slot name for itself. Slot name
includes subid and worker pid for tracking purposes.
3. If table has a slot name value in the catalog:

i. If the table state is DATASYNC, drop the replication slot from the
catalog and proceed tablesync with a new slot.

ii. If the table state is FINISHEDCOPY, use the replicaton slot from the
catalog, do not create a new slot.

4. Before worker moves to new table, drop any replication slot that are
retrieved from the catalog and used.
5. In case of no table left to sync, drop the replication slot of that sync
worker with worker pid if it exists. (It's possible that a sync worker do
not create a replication slot for itself but uses slots read from the
catalog in each iteration)


I think using worker_pid also has similar risks of mixing slots from
> different workers because after restart same worker_pid could be
> assigned to a totally different worker. Can we think of using a unique
> 64-bit number instead? This will be allocated when each workers
> started for the very first time and after that we can refer catalog to
> find it as suggested in the idea below.
>

I'm not sure how likely to have colliding pid's for different tablesync
workers in the same subscription.
Though ,having pid in slot name makes it easier to track which slot belongs
to which worker. That's why I kept using pid in slot names.
But I think it should be simple to switch to using a unique 64-bit number.
So I can remove pid's from slot names, if you think that it would be
better.


> We should use the same for the origin name as well.
>

I did not really change anything related to origin names. Origin names are
still the same and include relation id. What do you think would be an issue
with origin names in this patch?


> This sounds tricky. Why not first drop slot/origin and then detach it
> from pg_subscription_rel? On restarts, it is possible that we may
> error out after dropping the slot or origin but before updating the
> catalog entry but in such case we can ignore missing slot/origin and
> detach them from pg_subscription_rel. Also, if we use the unique
> number as suggested above, I think even if we don't remove it after
> the relation state is ready, it should be okay.
>

Right, I did not add an additional slot cleanup step. The patch now drops
the slot when we're done with it and then removes it from the catalog.


Thanks,
Melih


v2-0001-Reuse-Logical-Replication-Background-worker.patch
Description: Binary data

RE: Introduce wait_for_subscription_sync for TAP tests

2022-07-27 Thread shiy.f...@fujitsu.com

On Tue, Jul 26, 2022 3:42 PM Masahiko Sawada  wrote:
> 
> I've attached an updated patch as well as a patch to remove duplicated
> waits in 007_ddl.pl.
> 

Thanks for your patch. Here are some comments.

1.
I think some comments need to be changed in the patch.
For example:
# Also wait for initial table sync to finish
# Wait for initial sync to finish as well

Words like "Also" and "as well" can be removed now, we originally used them
because we wait for catchup and "also" wait for initial sync.

2.
In the following places, we can remove wait_for_catchup() and then call it in
wait_for_subscription_sync().

2.1.
030_origin.pl:
@@ -128,8 +120,7 @@ $node_B->safe_psql(
 
 $node_C->wait_for_catchup($appname_B2);
 
-$node_B->poll_query_until('postgres', $synced_query)
-  or die "Timed out while waiting for subscriber to synchronize data";
+$node_B->wait_for_subscription_sync;
 
2.2.
031_column_list.pl:
@@ -385,7 +373,7 @@ $node_subscriber->safe_psql(
ALTER SUBSCRIPTION sub1 SET PUBLICATION pub2, pub3
 ));
 
-wait_for_subscription_sync($node_subscriber);
+$node_subscriber->wait_for_subscription_sync;
 
 $node_publisher->wait_for_catchup('sub1');

2.3.
100_bugs.pl:
@@ -281,8 +276,7 @@ $node_subscriber->safe_psql('postgres',
 $node_publisher->wait_for_catchup('tap_sub');
 
 # Also wait for initial table sync to finish
-$node_subscriber->poll_query_until('postgres', $synced_query)
-  or die "Timed out while waiting for subscriber to synchronize data";
+$node_subscriber->wait_for_subscription_sync;
 
 is( $node_subscriber->safe_psql(
'postgres', "SELECT * FROM tab_replidentity_index"),

Regards,
Shi yu

Re: making relfilenodes 56 bits

2022-07-27 Thread vignesh C

On Tue, Jul 26, 2022 at 1:32 PM Dilip Kumar  wrote:
>
> On Thu, Jul 21, 2022 at 9:53 AM Thomas Munro  wrote:
> >
> > On Wed, Jul 20, 2022 at 11:27 PM Dilip Kumar  wrote:
> > > [v10 patch set]
> >
> > Hi Dilip, I'm experimenting with these patches and will hopefully have
> > more to say soon, but I just wanted to point out that this builds with
> > warnings and failed on 3/4 of the CI OSes on cfbot's last run.  Maybe
> > there is the good kind of uninitialised data on Linux, and the bad
> > kind of uninitialised data on those other pesky systems?
>
> Here is the patch to fix the issue, basically, while asserting for the
> file existence it was not setting the relfilenumber in the
> relfilelocator before generating the path so it was just checking for
> the existence of the random path so it was asserting randomly.

Thanks for the updated patch, Few comments:
1) The format specifier should be changed from %u to INT64_FORMAT
autoprewarm.c -> apw_load_buffers
...
if (fscanf(file, "%u,%u,%u,%u,%u\n", &blkinfo[i].database,
   &blkinfo[i].tablespace, &blkinfo[i].filenumber,
   &forknum, &blkinfo[i].blocknum) != 5)
...

2) The format specifier should be changed from %u to INT64_FORMAT
autoprewarm.c -> apw_dump_now
...
ret = fprintf(file, "%u,%u,%u,%u,%u\n",
  block_info_array[i].database,
  block_info_array[i].tablespace,
  block_info_array[i].filenumber,
  (uint32) block_info_array[i].forknum,
  block_info_array[i].blocknum);
...

3) should the comment "entry point for old extension version" be on
top of pg_buffercache_pages, as the current version will use
pg_buffercache_pages_v1_4
+
+Datum
+pg_buffercache_pages(PG_FUNCTION_ARGS)
+{
+   return pg_buffercache_pages_internal(fcinfo, OIDOID);
+}
+
+/* entry point for old extension version */
+Datum
+pg_buffercache_pages_v1_4(PG_FUNCTION_ARGS)
+{
+   return pg_buffercache_pages_internal(fcinfo, INT8OID);
+}

4) we could use the new style or ereport by removing the brackets
around errcode:
+   if (fctx->record[i].relfilenumber > OID_MAX)
+   ereport(ERROR,
+
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+
errmsg("relfilenode" INT64_FORMAT " is too large to be represented as
an OID",
+
 fctx->record[i].relfilenumber),
+
errhint("Upgrade the extension using ALTER EXTENSION pg_buffercache
UPDATE")));

like:
ereport(ERROR,

errcode(ERRCODE_INVALID_PARAMETER_VALUE),

errmsg("relfilenode" INT64_FORMAT " is too large to be represented as
an OID",

fctx->record[i].relfilenumber),

errhint("Upgrade the extension using ALTER EXTENSION pg_buffercache
UPDATE"));

5) Similarly in the below code too:
+   /* check whether the relfilenumber is within a valid range */
+   if ((relfilenumber) < 0 || (relfilenumber) > MAX_RELFILENUMBER)
+   ereport(ERROR,
+   (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+errmsg("relfilenumber " INT64_FORMAT
" is out of range",
+   (relfilenumber;


6) Similarly in the below code too:
+#define CHECK_RELFILENUMBER_RANGE(relfilenumber)
 \
+do {
 \
+   if ((relfilenumber) < 0 || (relfilenumber) > MAX_RELFILENUMBER) \
+   ereport(ERROR,
 \
+
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),  \
+errmsg("relfilenumber " INT64_FORMAT
" is out of range",   \
+   (relfilenumber; \
+} while (0)
+


7) This error code looks similar to CHECK_RELFILENUMBER_RANGE, can
this macro be used here too:
pg_filenode_relation(PG_FUNCTION_ARGS)
 {
Oid reltablespace = PG_GETARG_OID(0);
-   RelFileNumber relfilenumber = PG_GETARG_OID(1);
+   RelFileNumber relfilenumber = PG_GETARG_INT64(1);
Oid heaprel;

+   /* check whether the relfilenumber is within a valid range */
+   if ((relfilenumber) < 0 || (relfilenumber) > MAX_RELFILENUMBER)
+   ereport(ERROR,
+   (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+errmsg("relfilenumber " INT64_FORMAT
" is out of range",
+   (relfilenumber;


8) I felt this include is not required:
diff --git a/src/backend/access/transam/varsup.c
b/src/backend/access/transam/varsup.c
index 849a7ce..a2f0d35 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -13,12 +13,16 @@

 #include "postgres.h"

+#include 
+
 #include "access/clog.h"
 #include "access/commit_ts.h"

9) should we change elog to ereport to use the New-style error reporting API
+   /* safety check, we should never get this far in a HS standby */
+   if (RecoveryInProgress())
+   elog(E

Re: [PATCH] Simple code cleanup in tuplesort.c.

2022-07-27 Thread Richard Guo

On Wed, Jul 27, 2022 at 5:10 PM Xing Guo  wrote:

> The bounded heap sorting status flag is set twice in sort_bounded_heap()
> and tuplesort_performsort(). This patch helps remove one of them.
>

+1. Looks good to me.

Thanks
Richard

[PATCH] Simple code cleanup in tuplesort.c.

2022-07-27 Thread Xing Guo

Hi hackers,

The bounded heap sorting status flag is set twice in sort_bounded_heap()
and tuplesort_performsort(). This patch helps remove one of them.

Best Regards,
Xing
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index d90a1aebdf..84dc0d07f9 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -2709,7 +2709,6 @@ sort_bounded_heap(Tuplesortstate *state)
 	 */
 	reversedirection(state);
 
-	state->status = TSS_SORTEDINMEM;
 	state->boundUsed = true;
 }

Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

2022-07-27 Thread Michael Paquier

On Tue, Jul 26, 2022 at 06:58:02PM +0200, Matthias van de Meent wrote:
> - Retained the check in XLogRegisterData, so that we check against
> integer overflows in the registerdata code instead of only an assert
> in XLogRecordAssemble where it might be too late.

Why?  The record has not been inserted yet.  I would tend to keep only
the check at the bottom of XLogRecordAssemble(), for simplicity, and
call it a day.

> - Kept the inline static elog-ing function (as per Andres' suggestion
> on 2022-03-14; this decreases binary sizes)

I am not really convinced that this one is worth doing.

+#define MaxXLogRecordSize  (1020 * 1024 * 1024)
+
+#define XLogRecordLengthIsValid(len) ((len) >= 0 && (len) < MaxXLogRecordSize)

These are used only in xloginsert.c, so we could keep them isolated.

+ * To accommodate some overhead, hhis MaxXLogRecordSize value allows for
s/hhis/this/.

For now, I have extracted from the patch the two API changes and the
checks for the block information for uint16, and applied this part.
That's one less thing to worry about.
--
Michael


signature.asc
Description: PGP signature

Re: [PATCH] Compression dictionaries for JSONB

2022-07-27 Thread Matthias van de Meent

On Wed, 27 Jul 2022 at 09:36, Simon Riggs  wrote:
>
> On Sun, 17 Jul 2022 at 19:15, Nikita Malakhov  wrote:
>
> > For using in special Toaster for JSON datatype compression dictionaries 
> > seem to be very valuable addition, but now I
> > have to agree that this feature in current state is competing with 
> > Pluggable TOAST.
>
> But I don't understand this.
>
> Why does storing a compression dictionary in the catalog prevent that
> dictionary from being used within the toaster?

The point is not that compression dictionaries in the catalog are bad
- I think it makes a lot of sense - but that the typecast -based usage
of those dictionaries in user tables (like the UI provided by zson)
effectively competes with the toaster: It tries to store the data in a
more compressed manner than the toaster currently can because it has
additional knowledge about the values being toasted.

The main difference between casting and toasting however is that
casting is fairly because it has a significantly higher memory
overhead: both the fully decompressed and the compressed values are
stored in memory at the same time at some point when you cast a value,
while only the decompressed value is stored in full in memory when
(de)toasting.

And, considering that there is an open proposal for extending the
toaster mechanism, I think that it is not specifically efficient to
work with the relatively expensive typecast -based infrastructure if
this dictionary compression can instead be added using the proposed
extensible toasting mechanism at relatively low overhead.

Kind regards,

Matthias van de Meent

RE: Perform streaming logical transactions by background workers and parallel apply

2022-07-27 Thread houzj.f...@fujitsu.com

On Tuesday, July 26, 2022 5:34 PM Dilip Kumar  wrote:
> On Tue, Jul 26, 2022 at 2:30 PM Dilip Kumar  wrote:
> >
> > On Fri, Jul 22, 2022 at 8:27 AM wangw.f...@fujitsu.com
> >  wrote:
> > >
> > > On Tues, Jul 19, 2022 at 10:29 AM I wrote:
> > > > Attach the news patches.
> > >
> > > Not able to apply patches cleanly because the change in HEAD
> (366283961a).
> > > Therefore, I rebased the patch based on the changes in HEAD.
> > >
> > > Attach the new patches.
> >
> > +/* Check the foreign keys. */
> > +fkeys = RelationGetFKeyList(entry->localrel);
> > +if (fkeys)
> > +entry->parallel_apply = PARALLEL_APPLY_UNSAFE;
> >
> > So if there is a foreign key on any of the tables which are parts of a
> > subscription then we do not allow changes for that subscription to be
> > applied in parallel?  I think this is a big limitation because having
> > foreign key on the table is very normal right?  I agree that if we
> > allow them then there could be failure due to out of order apply
> > right? but IMHO we should not put the restriction instead let it fail
> > if there is ever such conflict.  Because if there is a conflict the
> > transaction will be sent again.  Do we see that there could be wrong
> > or inconsistent results if we allow such things to be executed in
> > parallel.  If not then IMHO just to avoid some corner case failure we
> > are restricting very normal cases.
> 
> some more comments..
> 1.
> +/*
> + * If we have found a free worker or if we are already
> applying this
> + * transaction in an apply background worker, then we
> pass the data to
> + * that worker.
> + */
> +if (first_segment)
> +apply_bgworker_send_data(stream_apply_worker, s->len,
> + s->data);
> 
> Comment says that if we have found a free worker or we are already applying in
> the worker then pass the changes to the worker but actually as per the code
> here we are only passing in case of first_segment?
> 
> I think what you are trying to say is that if it is first segment then send 
> the
> 
> 2.
> +/*
> + * This is the main apply worker. Check if there is any free apply
> + * background worker we can use to process this transaction.
> + */
> +if (first_segment)
> +stream_apply_worker = apply_bgworker_start(stream_xid);
> +else
> +stream_apply_worker = apply_bgworker_find(stream_xid);
> 
> So currently, whenever we get a new streamed transaction we try to start a new
> background worker for that.  Why do we need to start/close the background
> apply worker every time we get a new streamed transaction.  I mean we can
> keep the worker in the pool for time being and if there is a new transaction
> looking for a worker then we can find from that.  Starting a worker is costly
> operation and since we are using parallelism for this mean we are expecting
> that there would be frequent streamed transaction needing parallel apply
> worker so why not to let it wait for a certain amount of time so that if load 
> is low
> it will anyway stop and if the load is high it will be reused for next 
> streamed
> transaction.

It seems the function name was a bit mislead. Currently, the started apply
bgworker won't exit after applying the transaction. And the
apply_bgworker_start will first try to choose a free worker. It will start a
new worker only if no free worker is available.

> 3.
> Why are we restricting parallel apply workers only for the streamed
> transactions, because streaming depends upon the size of the logical decoding
> work mem so making steaming and parallel apply tightly coupled seems too
> restrictive to me.  Do we see some obvious problems in applying other
> transactions in parallel?

We thought there could be some conflict failure and deadlock if we parallel
apply normal transaction which need transaction dependency check[1]. But I will 
do
some more research for this and share the result soon.

[1] 
https://www.postgresql.org/message-id/CAA4eK1%2BwyN6zpaHUkCLorEWNx75MG0xhMwcFhvjqm2KURZEAGw%40mail.gmail.com

Best regards,
Hou zj

Re: making relfilenodes 56 bits

2022-07-27 Thread Dilip Kumar

On Wed, Jul 27, 2022 at 1:24 PM Ashutosh Sharma  wrote:
>
> Some more comments:

Note: Please don't top post.

> ==
>
> Shouldn't we retry for the new relfilenumber if 
> "ShmemVariableCache->nextRelFileNumber > MAX_RELFILENUMBER". There can be a 
> cases where some of the tables are dropped by the user and relfilenumber of 
> those tables can be reused for which we would need to find the relfilenumber 
> that can be resued. For e.g. consider below example:
>
> postgres=# create table t1(a int);
> CREATE TABLE
>
> postgres=# create table t2(a int);
> CREATE TABLE
>
> postgres=# create table t3(a int);
> ERROR:  relfilenumber is out of bound
>
> postgres=# drop table t1, t2;
> DROP TABLE
>
> postgres=# checkpoint;
> CHECKPOINT
>
> postgres=# vacuum;
> VACUUM
>
> Now if I try to recreate table t3, it should succeed, shouldn't it? But it 
> doesn't because we simply error out by seeing the nextRelFileNumber saved in 
> the shared memory.
>
> postgres=# create table t1(a int);
> ERROR:  relfilenumber is out of bound
>
> I think, above should have worked.

No, it should not, the whole point of this design is not to reuse the
relfilenumber ever within a cluster lifetime.  You might want to read
this mail[1] that by the time we use 2^56 relfilenumbers the cluster
will anyway reach its lifetime by other factors.

[1] 
https://www.postgresql.org/message-id/CA%2BhUKG%2BZrDms7gSjckme8YV2tzxgZ0KVfGcsjaFoKyzQX_f_Mw%40mail.gmail.com

> ==
>
> 
> 
> Note that while a table's filenode often matches its OID, this is
> not necessarily the case; some operations, like
> TRUNCATE, REINDEX, 
> CLUSTER and some forms
> of ALTER TABLE, can change the filenode while preserving 
> the OID.
>
> I think this note needs some improvement in storage.sgml. It says the table's 
> relfilenode mostly matches its OID, but it doesn't. This will happen only in 
> case of system table and maybe never in case of user table.

Yes, this should be changed.

> postgres=# create table t2(a int);
> ERROR:  relfilenumber is out of bound
>
> Since this is a user-visible error, I think it would be good to mention 
> relfilenode instead of relfilenumber. Elsewhere (including the user manual) 
> we refer to this as a relfilenode.

No this is expected to be an internal error because in general during
the cluster lifetime ideally, we should never reach this number.  So
we are putting this check so that it should not reach this number due
to some other computational/programming mistake.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Perform streaming logical transactions by background workers and parallel apply

2022-07-27 Thread Peter Smith

Here are some review comments for the patch v19-0004:

==

1. doc/src/sgml/ref/create_subscription.sgml

@@ -244,6 +244,11 @@ CREATE SUBSCRIPTION subscription_nameparallel mode is disregarded when retrying;
+  instead the transaction will be applied using on
+  mode.
  

That last sentence starting with lowercase seems odd - that's why I
thought saying "The parallel mode..." might be better. IMO "on mode"
seems strange too. Hence my previous [1] (#4.3) suggestion for this

SUGGESTION
The parallel mode is disregarded when retrying;
instead the transaction will be applied using streaming =
on.

==

2. src/backend/replication/logical/worker.c - start_table_sync

@@ -3902,20 +3925,28 @@ start_table_sync(XLogRecPtr *origin_startpos,
char **myslotname)
  }
  PG_CATCH();
  {
+ /*
+ * Emit the error message, and recover from the error state to an idle
+ * state
+ */
+ HOLD_INTERRUPTS();
+
+ EmitErrorReport();
+ AbortOutOfAnyTransaction();
+ FlushErrorState();
+
+ RESUME_INTERRUPTS();
+
+ /* Report the worker failed during table synchronization */
+ pgstat_report_subscription_error(MySubscription->oid, false);
+
+ /* Set the retry flag. */
+ set_subscription_retry(true);
+
  if (MySubscription->disableonerr)
  DisableSubscriptionAndExit();
- else
- {
- /*
- * Report the worker failed during table synchronization. Abort
- * the current transaction so that the stats message is sent in an
- * idle state.
- */
- AbortOutOfAnyTransaction();
- pgstat_report_subscription_error(MySubscription->oid, false);

- PG_RE_THROW();
- }
+ proc_exit(0);
  }

But is it correct to set the 'retry' flag even if the
MySubscription->disableonerr is true? Won’t that mean even after the
user corrects the problem and then re-enabled the subscription it
still won't let the streaming=parallel work, because that retry flag
is set?

Also, Something seems wrong to me here - IIUC the patch changed this
code because of the potential risk of an error within the
set_subscription_retry function, but now if such an error happens the
current code will bypass even getting to DisableSubscriptionAndExit,
so the subscription won't have a chance to get disabled as the user
might have wanted.

~~~

3. src/backend/replication/logical/worker.c - start_apply

@@ -3940,20 +3971,27 @@ start_apply(XLogRecPtr origin_startpos)
  }
  PG_CATCH();
  {
+ /*
+ * Emit the error message, and recover from the error state to an idle
+ * state
+ */
+ HOLD_INTERRUPTS();
+
+ EmitErrorReport();
+ AbortOutOfAnyTransaction();
+ FlushErrorState();
+
+ RESUME_INTERRUPTS();
+
+ /* Report the worker failed while applying changes */
+ pgstat_report_subscription_error(MySubscription->oid,
+ !am_tablesync_worker());
+
+ /* Set the retry flag. */
+ set_subscription_retry(true);
+
  if (MySubscription->disableonerr)
  DisableSubscriptionAndExit();
- else
- {
- /*
- * Report the worker failed while applying changes. Abort the
- * current transaction so that the stats message is sent in an
- * idle state.
- */
- AbortOutOfAnyTransaction();
- pgstat_report_subscription_error(MySubscription->oid, !am_tablesync_worker());
-
- PG_RE_THROW();
- }
  }

(Same as previous review comment #2)

But is it correct to set the 'retry' flag even if the
MySubscription->disableonerr is true? Won’t that mean even after the
user corrects the problem and then re-enabled the subscription it
still won't let the streaming=parallel work, because that retry flag
is set?

Also, Something seems wrong to me here - IIUC the patch changed this
code because of the potential risk of an error within the
set_subscription_retry function, but now if such an error happens the
current code will bypass even getting to DisableSubscriptionAndExit,
so the subscription won't have a chance to get disabled as the user
might have wanted.

~~~

4. src/backend/replication/logical/worker.c - DisableSubscriptionAndExit

 /*
- * After error recovery, disable the subscription in a new transaction
- * and exit cleanly.
+ * Disable the subscription in a new transaction.
  */
 static void
 DisableSubscriptionAndExit(void)
 {
- /*
- * Emit the error message, and recover from the error state to an idle
- * state
- */
- HOLD_INTERRUPTS();
-
- EmitErrorReport();
- AbortOutOfAnyTransaction();
- FlushErrorState();
-
- RESUME_INTERRUPTS();
-
- /* Report the worker failed during either table synchronization or apply */
- pgstat_report_subscription_error(MyLogicalRepWorker->subid,
- !am_tablesync_worker());
-
  /* Disable the subscription */
  StartTransactionCommand();
  DisableSubscription(MySubscription->oid);
@@ -4231,8 +4252,6 @@ DisableSubscriptionAndExit(void)
  ereport(LOG,
  errmsg("logical replication subscription \"%s\" has been disabled
due to an error",
 MySubscription->name));
-
- proc_exit(0);
 }

4a.
Hmm,  I think it is a bad idea to remove the "exiting" code from the
function but still leave the function name the same as before saying
"AndExit".

4b.
Also, now the patch is unconditionally doing proc_exit

Re: PG 15 (and to a smaller degree 14) regression due to ExprEvalStep size

2022-07-27 Thread Amit Langote

On Fri, Jul 22, 2022 at 2:49 PM Amit Langote  wrote:
> On Fri, Jul 22, 2022 at 1:13 PM David Rowley  wrote:
> > BTW, I was working on code inside llvm_compile_expr() a few days ago
> > and I thought I'd gotten the new evaluation steps I was adding correct
> > as it worked fine with jit_above_cost=0, but on further testing, it
> > crashed with jit_inline_above_cost=0. Might be worth doing both to see
> > if everything works as intended.
>
> Thanks for the pointer.
>
> So I didn't see things going bust on re-testing with all
> jit_*_above_cost parameters set to 0, so maybe the
> llvm_compile_expression() additions are alright.

Here's an updated version of the patch, with mostly cosmetic changes.
In particular, I added comments describing the new llvm_compile_expr()
blobs.

-- 
Thanks, Amit Langote
EDB: http://www.enterprisedb.com


v5-0001-Some-improvements-to-JsonExpr-evaluation.patch
Description: Binary data

RE: Perform streaming logical transactions by background workers and parallel apply

2022-07-27 Thread houzj.f...@fujitsu.com

On Wednesday, July 27, 2022 1:29 PM Dilip Kumar  wrote:
> 
> On Wed, Jul 27, 2022 at 10:06 AM Amit Kapila 
> wrote:
> >
> > On Tue, Jul 26, 2022 at 2:30 PM Dilip Kumar  wrote:
> > >
> > > On Fri, Jul 22, 2022 at 8:27 AM wangw.f...@fujitsu.com
> > >  wrote:
> > > >
> > > > On Tues, Jul 19, 2022 at 10:29 AM I wrote:
> > > > > Attach the news patches.
> > > >
> > > > Not able to apply patches cleanly because the change in HEAD
> (366283961a).
> > > > Therefore, I rebased the patch based on the changes in HEAD.
> > > >
> > > > Attach the new patches.
> > >
> > > +/* Check the foreign keys. */
> > > +fkeys = RelationGetFKeyList(entry->localrel);
> > > +if (fkeys)
> > > +entry->parallel_apply = PARALLEL_APPLY_UNSAFE;
> > >
> > > So if there is a foreign key on any of the tables which are parts of
> > > a subscription then we do not allow changes for that subscription to
> > > be applied in parallel?
> > >
> >
> > I think the above check will just prevent the parallelism for a xact
> > operating on the corresponding relation not the relations of the
> > entire subscription. Your statement sounds like you are saying that it
> > will prevent parallelism for all the other tables in the subscription
> > which has a table with FK.
> 
> Okay, got it. I thought we are disallowing parallelism for the entire 
> subscription.
> 
> > >  I think this is a big limitation because having foreign key on the
> > > table is very normal right?  I agree that if we allow them then
> > > there could be failure due to out of order apply right?
> > >
> >
> > What kind of failure do you have in mind and how it can occur? The one
> > way it can fail is if the publisher doesn't have a corresponding
> > foreign key on the table because then the publisher could have allowed
> > an insert into a table (insert into FK table without having the
> > corresponding key in PK table) which may not be allowed on the
> > subscriber. However, I don't see any check that could prevent this
> > because for this we need to compare the FK list for a table from the
> > publisher with the corresponding one on the subscriber. I am not
> > really sure if due to the risk of such conflicts we should block
> > parallelism of transactions operating on tables with FK because those
> > conflicts can occur even without parallelism, it is just a matter of
> > timing. But, I could be missing something due to which the above check
> > can be useful?
> 
> Actually, my question starts with this check[1][2], from this it
> appears that if this relation is having a foreign key then we are
> marking it parallel unsafe[2] and later in [1] while the worker is
> applying changes for that relation and if it was marked parallel
> unsafe then we are throwing error.  So my question was why we are
> putting this restriction?  Although this error is only talking about
> unique and non-immutable functions this is also giving an error if the
> target table had a foreign key.  So my question was do we really need
> to restrict this? I mean why we are restricting this case?
> 

Hi,

I think the foreign key check is used to prevent the apply worker from waiting
indefinitely which is caused by foreign key difference between publisher and
subscriber, Like the following example:

-
Publisher:
-- both table are published
CREATE TABLE PKTABLE ( ptest1 int);
CREATE TABLE FKTABLE ( ftest1 int);

-- initial data
INSERT INTO PKTABLE VALUES(1);

Subcriber:
CREATE TABLE PKTABLE ( ptest1 int PRIMARY KEY);
CREATE TABLE FKTABLE ( ftest1 int REFERENCES PKTABLE);

-- Execute the following transactions on publisher

Tx1:
INSERT ... -- make enough changes to start streaming mode
DELETE FROM PKTABLE;
Tx2:
INSERT ITNO FKTABLE VALUES(1);
COMMIT;
COMMIT;
-

The subcriber's apply worker will wait indefinitely, because the main apply 
worker is
waiting for the streaming transaction to finish which is in another apply
bgworker.


BTW, I think the foreign key won't take effect in subscriber's apply worker by
default. Because we set session_replication_role to 'replica' in apply worker
which prevent the FK trigger function to be executed(only the trigger with
FIRES_ON_REPLICA flag will be executed in this mode). User can only alter the
trigger to enable it on replica mode to make the foreign key work. So, ISTM, we
won't hit this ERROR frequently.

And based on this, another comment about the patch is that it seems unnecessary
to directly check the FK returned by RelationGetFKeyList. Checking the actual FK
trigger function seems enough.

Best regards,
Hou zj

Re: making relfilenodes 56 bits

2022-07-27 Thread Ashutosh Sharma

Some more comments:

==

Shouldn't we retry for the new relfilenumber if
"ShmemVariableCache->nextRelFileNumber > MAX_RELFILENUMBER". There can be a
cases where some of the tables are dropped by the user and relfilenumber of
those tables can be reused for which we would need to find the
relfilenumber that can be resued. For e.g. consider below example:

postgres=# create table t1(a int);
CREATE TABLE

postgres=# create table t2(a int);
CREATE TABLE

postgres=# create table t3(a int);
ERROR:  relfilenumber is out of bound

postgres=# drop table t1, t2;
DROP TABLE

postgres=# checkpoint;
CHECKPOINT

postgres=# vacuum;
VACUUM

Now if I try to recreate table t3, it should succeed, shouldn't it? But it
doesn't because we simply error out by seeing the nextRelFileNumber saved
in the shared memory.

postgres=# create table t1(a int);
ERROR:  relfilenumber is out of bound

I think, above should have worked.

==



Note that while a table's filenode often matches its OID, this is
not necessarily the case; some operations, like
TRUNCATE, REINDEX,
CLUSTER and some forms
of ALTER TABLE, can change the filenode while preserving
the OID.

I think this note needs some improvement in storage.sgml. It says the
table's relfilenode mostly matches its OID, but it doesn't. This will
happen only in case of system table and maybe never in case of user table.

==

postgres=# create table t2(a int);
ERROR:  relfilenumber is out of bound

Since this is a user-visible error, I think it would be good to mention
relfilenode instead of relfilenumber. Elsewhere (including the user manual)
we refer to this as a relfilenode.

--
With Regards,
Ashutosh Sharma.

On Tue, Jul 26, 2022 at 10:36 PM Ashutosh Sharma 
wrote:

> Thanks Dilip. Here are few comments that  could find upon quickly
> reviewing the v11 patch:
>
>  /*
> + * Similar to the XLogPutNextOid but instead of writing NEXTOID log
> record it
> + * writes a NEXT_RELFILENUMBER log record.  If '*prevrecptr' is a valid
> + * XLogRecPtrthen flush the wal upto this record pointer otherwise flush
> upto
>
> XLogRecPtrthen -> XLogRecPtr then
>
> ==
>
> +   switch (relpersistence)
> +   {
> +   case RELPERSISTENCE_TEMP:
> +   backend = BackendIdForTempRelations();
> +   break;
> +   case RELPERSISTENCE_UNLOGGED:
> +   case RELPERSISTENCE_PERMANENT:
> +   backend = InvalidBackendId;
> +   break;
> +   default:
> +   elog(ERROR, "invalid relpersistence: %c", relpersistence);
> +   return InvalidRelFileNumber;/* placate compiler */
> +   }
>
>
> I think the above check should be added at the beginning of the function
> for the reason that if we come to the default switch case we won't be
> acquiring the lwlock and do other stuff to get a new relfilenumber.
>
> ==
>
> -   newrelfilenumber = GetNewRelFileNumber(newTableSpace, NULL,
> +* Generate a new relfilenumber.  We cannot reuse the old relfilenumber
> +* because of the possibility that that relation will be moved back to
> the
>
> that that relation -> that relation.
>
> ==
>
> + * option_parse_relfilenumber
> + *
> + * Parse relfilenumber value for an option.  If the parsing is successful,
> + * returns; if parsing fails, returns false.
> + */
>
> If parsing is successful, returns true;
>
> --
> With Regards,
> Ashutosh Sharma.
>
> On Tue, Jul 26, 2022 at 7:33 PM Dilip Kumar  wrote:
>
>> On Tue, Jul 26, 2022 at 6:06 PM Ashutosh Sharma 
>> wrote:
>>
>> Hi,
>> Note: please avoid top posting.
>>
>> > /*
>> >  * If relfilenumber is unspecified by the caller then create
>> storage
>> > -* with oid same as relid.
>> > +* with relfilenumber same as relid if it is a system table
>> otherwise
>> > +* allocate a new relfilenumber.  For more details read
>> comments atop
>> > +* FirstNormalRelFileNumber declaration.
>> >  */
>> > if (!RelFileNumberIsValid(relfilenumber))
>> > -   relfilenumber = relid;
>> > +   {
>> > +   relfilenumber = relid < FirstNormalObjectId ?
>> > +   relid : GetNewRelFileNumber();
>> >
>> > Above code says that in the case of system table we want relfilenode to
>> be the same as object id. This technically means that the relfilenode or
>> oid for the system tables would not be exceeding 16383. However in the
>> below lines of code added in the patch, it says there is some chance for
>> the storage path of the user tables from the old cluster conflicting with
>> the storage path of the system tables in the new cluster. Assuming that the
>> OIDs for the user tables on the old cluster would start with 16384 (the
>> first object ID), I see no reason why there would be a conflict.
>>
>>
>> Basically, the above comment says that the initial system table
>> storage will be created with the same relfilenumber as Oid so you are
>> right that will not exceed 16383.  And below code is explaining the
>> reason that

Re: [PATCH] Compression dictionaries for JSONB

2022-07-27 Thread Simon Riggs

On Sun, 17 Jul 2022 at 19:15, Nikita Malakhov  wrote:

> we suggest that as an improvement compression should be put inside the 
> Toaster as an option,
> thus the Toaster could have maximum benefits from knowledge of data internal 
> structure (and in future use JSON Schema).

Very much agreed.

> For using in special Toaster for JSON datatype compression dictionaries seem 
> to be very valuable addition, but now I
> have to agree that this feature in current state is competing with Pluggable 
> TOAST.

But I don't understand this.

Why does storing a compression dictionary in the catalog prevent that
dictionary from being used within the toaster?

-- 
Simon Riggshttp://www.EnterpriseDB.com/

Re: Slow standby snapshot

2022-07-27 Thread Kyotaro Horiguchi

At Tue, 26 Jul 2022 23:09:16 +0500, Andrey Borodin  wrote 
in 
> 
> 
> > On 20 Jul 2022, at 02:12, Michail Nikolaev  
> > wrote:
> > 
> >> I've looked into v5.
> > Thanks!
> > 
> > Patch is updated accordingly your remarks.
> 
> The patch seems Ready for Committer from my POV.

+ * is s updated during taking the snapshot. The KnownAssignedXidsNextOffset
+ * contains not an offset to the next valid xid but a number which tends to
+ * the offset to next valid xid. KnownAssignedXidsNextOffset[] values read

Is there still a reason why the array stores the distnace to the next
valid element instead of the index number of the next valid index?  It
seems to me that that was in an intention to reduce the size of the
offset array but it is int32[] which is far wider than
TOTAL_MAX_CACHED_SUBXIDS.

It seems to me storing the index itself is simpler and maybe faster by
the cycles to perform addition.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

98 matches

Mail list logo