date:20210907

At Tue, 7 Sep 2021 11:30:27 +, "kuroda.hay...@fujitsu.com" 
 wrote in 
> I attached the rebased version. Tests and descriptions were added.
> In my understanding Ikeda-san's indication is included.

I have some comments by a quick look.

+* one have higher priority.  See also the previous comment.

Is "the previous comment" "the comment above"?

+   for (i = n -1; i >= 0; i--)

You might want a space between - and 1.

+parse_application_name(StringInfo buf, const char *name)

The name seems a bit too generic as it is a function only for
postgres_fdw.


+   /* must be a '%', so skip to the next char */
+   p++;
+   if (*p == '\0')
+   break;  /* format error - 
ignore it */

I'm surprised by finding that undefined %-escapes and stray % behave
differently between archive_command and log_line_prefix. I understand
this behaves like the latter.

+   const char *username = 
MyProcPort->user_name;

I'm not sure but even if user_name doesn't seem to be NULL, don't we
want to do the same thing with %u of log_line_prefix for safety?
Namely, setting [unknown] if user_name is NULL or "". The same can be
said for %d.

+ * process_padding --- helper function for processing the format
+ * string in log_line_prefix

Since this is no longer a static helper function for a specific
function, the name and the comment should be more descriptive.

That being said, in the first place the function seems reducible
almost to a call to atol. By a quick measurement the function is about
38% faster (0.024us/call(the function) vs 0.039us/call(strtol) so I'm
not sure we want to replace process_log_prefix_padding(), but we don't
need to reuse the function in parse_application_name since it is
called only once per connection.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [BUG]Update Toast data failure in logical replication

2021-09-07 Thread Ajin Cherian

On Wed, Aug 11, 2021 at 10:45 PM Dilip Kumar  wrote:

> Yeah we can avoid that by detecting any toasted replica identity key
> in HeapDetermineModifiedColumns, check the attached patch.
>

I had a second look at this, and I just had a small doubt. Since the
convention is that for UPDATES, the old tuple/key is written to
WAL only if the one of the columns in the key has changed as part of
the update, and we are breaking that convention with this patch by
also including
the old key if it is toasted and is stored in disk even if it is not changed.
Why do we not include the detoasted key as part of the new tuple
rather than the old tuple? Then we don't really break this convention.

And one small typo in the patch:

The header above ExtractReplicaIdentity()

Before:
 * key_required should be false if caller knows that no replica identity
 * columns changed value and it doesn't has any external data.
 * It's always true in the DELETE case.

After:
 * key_required should be false if caller knows that no replica identity
 * columns changed value and it doesn't have any external data.
 * It's always true in the DELETE case.

regards,
Ajin Cherian
Fujitsu Australia

Re: Added schema level support for publication.

On Tue, Sep 7, 2021 at 5:10 PM Amit Kapila  wrote:
>
> On Tue, Sep 7, 2021 at 12:45 PM vignesh C  wrote:
> >
> > On Fri, Sep 3, 2021 at 4:49 PM Amit Kapila  wrote:
> > >
> >
> > > 5.
> > > If I modify the search path to remove public schema then I get the
> > > below error message:
> > > postgres=# Create publication mypub for all tables in schema 
> > > current_schema;
> > > ERROR:  no schema has been selected
> > >
> > > I think this message is not very clear. How about changing to
> > > something like "current_schema doesn't contain any valid schema"? This
> > > message is used in more than one place, so let's keep it the same at
> > > all the places if you agree to change it.
> >
> > I would prefer to use the existing messages as we have used this in a
> > few other places similarly. Thoughts?
> >
>
> Yeah, I also see the same message in code but I think here usage is a
> bit different. If you see a similar SQL statement that causes the same
> error message then can you please give an example?

I was referring to the error message in create table
postgres=# set search_path='non_existent_schema';
SET
postgres=# create table t1(c1 int);
ERROR:  no schema has been selected to create in
LINE 1: create table t1(c1 int);

If it is not very useful in the case of creating a publication, then I
can change it. Thoughts?

Regards,
Vignesh

Re: The Free Space Map: Problems and Opportunities

2021-09-07 Thread Peter Geoghegan

On Tue, Sep 7, 2021 at 5:25 AM Hannu Krosing  wrote:
> Are you speaking of just heap pages here or also index pages ?

Mostly heap pages, but FWIW I think it could work for index tuples
too, with retail index tuple deletion. Because that allows you to even
remove would-be LP_DEAD item pointers.

> Or are you expecting these to be kept in good-enoug shape by your
> earlier index manager work ?

It's very workload dependent. Some things were very much improved by
bottom-up index deletion in Postgres 14, for example (non-hot updates
with lots of logically unchanged indexes). Other things weren't helped
at all, or were barely helped. I think it's important to cover or
cases.

> A minimal useful patch emerging from that understanding could be
> something which just adds hysteresis to FSM management. (TBH, I
> actually kind of expected some hysteresis to be there already, as it
> is in my mental model of "how things should be done" for managing
> almost any resource :) )

I think that you need to do the FSM before the aborted-heap-tuple
cleanup. Otherwise you don't really know when or where to apply the
special kind of pruning that the patch invents, which targets aborts
specifically.

> Adding hysteresis to FSM management can hopefully be done independent
> of all the other stuff and also seems to be something that is
> unobtrusive and non-controversial enough to fit in current release and
> possibly be even back-ported .

I don't know about that! Seems kind of an invasive patch to me.

> I did not mean CHECKPOINT as a command, but more the concept of
> writing back / un-dirtying the page. In this sense it *is* special
> because it is the last point in time where you are guaranteed to have
> the page available in buffercache and thus cheap to access for
> modifications plus you will avoid a second full-page writeback because
> of cleanup. Also you do not want to postpone the cleanup to actual
> page eviction, as that is usually in the critical path for some user
> query or command.

I intend to do some kind of batching, but only at the level of small
groups of related transactions. Writing a page that was quickly opened
for new inserts, filled with newly inserted heap tuples, then closed,
and finally cleaned up doesn't seem like it needs to take any direct
responsibility for writeback.

-- 
Peter Geoghegan

Re: automatically generating node support functions

2021-09-07 Thread Noah Misch

On Tue, Sep 07, 2021 at 10:57:02AM +0200, Peter Eisentraut wrote:
> On 02.09.21 20:53, Jacob Champion wrote:
> >>0004-Make-node-output-prefix-match-node-structure-name.patch
> >>
> >>Some nodes' output/read functions use a label that is slightly different
> >>from their node name, e.g., "NOTIFY" instead of "NOTIFYSTMT".  This
> >>cleans that up so that an automated approach doesn't have to deal with
> >>these special cases.
> >
> >Is there any concern about the added serialization length, or is that
> >trivial in practice? The one that particularly caught my eye is
> >RANGETBLENTRY, which was previously RTE. But I'm not very well-versed
> >in all the places these strings can be generated and stored.
> 
> These are just matters of taste.  Let's wait a bit more to see if anyone is
> concerned.

I am not concerned about changing the serialization length this much.  The
format is already quite verbose, and this change is small relative to that
existing verbosity.

Re: pgstat_send_connstats() introduces unnecessary timestamp and UDP overhead

2021-09-07 Thread Laurenz Albe

On Fri, 2021-09-03 at 17:04 -0700, Andres Freund wrote:
> Here's how I think that would look like. While writing up this draft, I found
> two more issues:
> 
> - On windows / 32 bit systems, the session time would overflow if idle for
>   longer than ~4300s. long is only 32 bit. Easy to fix obviously.
> - Right now walsenders, including database connected walsenders, are not
>   reported in connection stats. That doesn't seem quite right to me.
> 
> In the patch I made the message for connecting an explicitly reported message,
> that seems cleaner, because it then happens at a clearly defined point. I
> didn't do the same for disconnecting, but perhaps that would be better? Then
> we could get rid of the whole pgStatSessionEndCause variable.

I have gone over your patch and made the following changes:

- cache the last report time in a static variable pgLastSessionReportTime
- add a comment to explain why we only track normal backends
- pgindent
- an attempt at a commit message

Yours,
Laurenz Albe
From 1fdfac528ec4e75d91b9cf3868eda66a72a41c8f Mon Sep 17 00:00:00 2001
From: Laurenz Albe 
Date: Wed, 8 Sep 2021 06:06:46 +0200
Subject: [PATCH] Fix performance regression from session statistics

Session statistics, as introduced by 960869da08, had several
shortcomings:

- an additional GetCurrentTimestamp() call that also impaired
  the accuracy of the data collected

  This can be avoided by passing the current timestamp we already
  have in pgstat_report_stat().

- an additional statistics UDP packet sent every 500ms

  This is solved by adding the new statistics to PgStat_MsgTabstat.
  This is conceptually ugly, because session statistics are not
  table statistics.  But the struct already contains data unrelated
  to tables, so there is not much damage done.

  Connection and disconnection are reported in separate messages,
  which reduces the total overhead to two messages per session.

The time of the last session statistics report is now cached
in the static variable pgLastSessionReportTime.

Discussion: https://postgr.es/m/20210801205501.nyxzxoelqoo4x2qc%40alap3.anarazel.de
---
 src/backend/postmaster/pgstat.c | 177 +---
 src/backend/tcop/postgres.c |   2 +
 src/backend/utils/activity/backend_status.c |   4 +-
 src/include/pgstat.h|  34 ++--
 src/tools/pgindent/typedefs.list|   3 +-
 5 files changed, 141 insertions(+), 79 deletions(-)

diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 3450a10129..7aaf928315 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -246,6 +246,7 @@ static int	pgStatXactCommit = 0;
 static int	pgStatXactRollback = 0;
 PgStat_Counter pgStatBlockReadTime = 0;
 PgStat_Counter pgStatBlockWriteTime = 0;
+static PgStat_Counter pgLastSessionReportTime = 0;
 PgStat_Counter pgStatActiveTime = 0;
 PgStat_Counter pgStatTransactionIdleTime = 0;
 SessionEndType pgStatSessionEndCause = DISCONNECT_NORMAL;
@@ -330,11 +331,11 @@ static bool pgstat_db_requested(Oid databaseid);
 static PgStat_StatReplSlotEntry *pgstat_get_replslot_entry(NameData name, bool create_it);
 static void pgstat_reset_replslot(PgStat_StatReplSlotEntry *slotstats, TimestampTz ts);
 
-static void pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg);
+static void pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg, TimestampTz now);
 static void pgstat_send_funcstats(void);
 static void pgstat_send_slru(void);
 static HTAB *pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid);
-static void pgstat_send_connstats(bool disconnect, TimestampTz last_report);
+static void pgstat_report_disconnect(Oid dboid);
 
 static PgStat_TableStatus *get_tabstat_entry(Oid rel_id, bool isshared);
 
@@ -366,7 +367,8 @@ static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
 static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len);
 static void pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len);
 static void pgstat_recv_checksum_failure(PgStat_MsgChecksumFailure *msg, int len);
-static void pgstat_recv_connstat(PgStat_MsgConn *msg, int len);
+static void pgstat_recv_connect(PgStat_MsgConnect *msg, int len);
+static void pgstat_recv_disconnect(PgStat_MsgDisconnect *msg, int len);
 static void pgstat_recv_replslot(PgStat_MsgReplSlot *msg, int len);
 static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len);
 
@@ -890,12 +892,11 @@ pgstat_report_stat(bool disconnect)
 		!TimestampDifferenceExceeds(last_report, now, PGSTAT_STAT_INTERVAL))
 		return;
 
-	/* for backends, send connection statistics */
-	if (MyBackendType == B_BACKEND)
-		pgstat_send_connstats(disconnect, last_report);
-
 	last_report = now;
 
+	if (disconnect)
+		pgstat_report_disconnect(MyDatabaseId);
+
 	/*
 	 * Destroy pgStatTabHash before we start invalidating PgStat_TableEntry
 	 * entries it points to.  (Should we fail partway through the loop below,
@@ -947,7 +948,7 @@

Re: Postgres perl module namespace

2021-09-07 Thread Noah Misch

On Tue, Sep 07, 2021 at 07:43:47AM -0400, Andrew Dunstan wrote:
> On 9/6/21 1:08 AM, Michael Paquier wrote:
> > On Sat, Sep 04, 2021 at 09:58:08AM -0400, Andrew Dunstan wrote:
> >> On 9/4/21 2:19 AM, Noah Misch wrote:
> >>> plperl uses PostgreSQL:: as the first component of its Perl module 
> >>> namespace.
> >>> We shouldn't use both PostgreSQL:: and Postgres:: in the same source 
> >>> tree, so
> >>> this change should not use Postgres::. 
> >> Good point. Here's the same thing using PostgreSQL::Test
> > A minor point: this introduces PostgreSQL::Test::PostgresVersion.
> > Could be this stripped down to PostgreSQL::Test::Version instead?
> 
> That name isn't very clear - what is it the version of, PostgreSQL or
> the test?

Fair.

> There's nothing very test-specific about this module - it simply
> encapsulates a Postgres version string. So maybe it should just be
> PostgreSQL::Version.

Could be fine, but that name could be useful as a CPAN module.  These modules
don't belong on CPAN, so I'd keep PostgreSQL::Test::PostgresVersion.  There's
only one reference in the tree, so optimizing that particular name is less
exciting.

(I wondered about using PGXS:: as the namespace for all these modules, since
it's short and "pgxs" is the closest thing to a name for the PostgreSQL build
system.  Overall, I didn't convince myself about it being an improvement.)

Re: Estimating HugePages Requirements?

2021-09-07 Thread Michael Paquier

On Tue, Sep 07, 2021 at 05:08:43PM +, Bossart, Nathan wrote:
> On 9/6/21, 9:00 PM, "Michael Paquier"  wrote:
>> +   sprintf(buf, "%lu MB", size_mb);
>> +   SetConfigOption("shared_memory_size", buf, PGC_INTERNAL, PGC_S_OVERRIDE);
>> One small-ish comment about 0002: there is no need to add the unit
>> into the buffer set as GUC_UNIT_MB would take care of that.  The patch
>> looks fine.
> 
> I fixed this in v7.

Switched the variable name to shared_memory_size_mb for easier
grepping, moved it to a more correct location with the other read-only
GUCS, and applied 0002.  Well, 0001 here.

>> +#ifndef WIN32
>> +#include 
>> +#endif
>> So, this is needed in ipci.c to check for MAP_HUGETLB.  I am not much
>> a fan of moving around platform-specific checks when these have
>> remained local to each shmem implementation.  Could it be cleaner to
>> add GetHugePageSize() to win32_shmem.c and make it always declared in
>> the SysV implementation?
> 
> I don't know if it's really all that much cleaner, but I did it this
> way in v7.  IMO it's a little weird that GetHugePageSize() doesn't
> return the value from GetLargePageMinimum(), but that's what we'd need
> to do to avoid setting huge_pages_required for Windows without any
> platform-specific checks.

Thanks.  Keeping MAP_HUGETLB within the SysV portions is an
improvement IMO.  That's subject to one's taste, perhaps.

After sleeping on it, I'd be fine to live with the logic based on the
new GUC flag called GUC_RUNTIME_COMPUTED to control if a parameter can
be looked at either an earlier or a later stage of the startup
sequences, with the later stage meaning that such parameters cannot be
checked if a server is running.  This case was originally broken
anyway, so it does not make it worse, just better.

+This can be used on a running server for most parameters.  However,
+the server must be shut down for some runtime-computed parameters
+(e.g., ).
Perhaps we should add a couple of extra parameters here, like
shared_memory_size, and perhaps wal_segment_size?  The list does not
have to be complete, just meaningful enough.
--
Michael

signature.asc
Description: PGP signature

RE: Added missing invalidations for all tables publication

2021-09-07 Thread houzj.f...@fujitsu.com

> From Mon, Sep 6, 2021 1:56 PM Amit Kapila  wrote:
> > On Tue, Aug 31, 2021 at 8:54 PM vignesh C  wrote:
> > > Thanks for the comments, the attached v3 patch has the changes for
> > > the same.
> > >
> >
> > I think this bug should be fixed in back branches (till v10). OTOH, as
> > this is not reported by any user and we have found it during code
> > review so it seems either users don't have an exact use case or they
> > haven't noticed this yet. What do you people think about back-patching?
> 
> Personally, I think it's ok to back-patch.

I found that the patch cannot be applied to back-branches(v10-v14) cleanly,
so, I generate the patches for back-branches. Attached, all the patches have
passed regression test.

Best regards,
Hou zj


v5-HEAD-0001-Invalidate-relcache-for-publications-defined-for-.patch
Description:  v5-HEAD-0001-Invalidate-relcache-for-publications-defined-for-.patch


v5-PG10-0001-Invalidate-relcache-for-publications-defined-for-.patch
Description:  v5-PG10-0001-Invalidate-relcache-for-publications-defined-for-.patch


v5-PG11-0001-Invalidate-relcache-for-publications-defined-for-.patch
Description:  v5-PG11-0001-Invalidate-relcache-for-publications-defined-for-.patch


v5-PG12-0001-Invalidate-relcache-for-publications-defined-for-.patch
Description:  v5-PG12-0001-Invalidate-relcache-for-publications-defined-for-.patch


v5-PG13-0001-Invalidate-relcache-for-publications-defined-for-.patch
Description:  v5-PG13-0001-Invalidate-relcache-for-publications-defined-for-.patch


v5-PG14-0001-Invalidate-relcache-for-publications-defined-for-.patch
Description:  v5-PG14-0001-Invalidate-relcache-for-publications-defined-for-.patch

Re: Improve logging when using Huge Pages

At Tue, 7 Sep 2021 08:16:53 -0500, Justin Pryzby  wrote 
in 
> On Tue, Sep 07, 2021 at 07:12:36PM +0900, Fujii Masao wrote:
> > One big concern about the patch is that log message is always reported
> > when shared memory fails to be allocated with huge pages enabled
> > when huge_pages=try. Since huge_pages=try is the default setting,
> > many users would see this new log message whenever they start
> > the server. Those who don't need huge pages but just use the default
> > setting might think that such log messages would be noisy.
> 
> I don't see this as any issue.  We're only talking about a single message on
> each restart, which would be added in a major release.  If it's a problem, the
> message could be a NOTICE or INFO, and it won't be shown by default.
> 
> I think it should say "with/out huge pages" without "enabled/disabled", 
> without
> "again", and without "The server", like:
> 
> +   (errmsg("could not map anonymous 
> shared memory (%zu bytes)"
> +   " with huge pages.", 
> allocsize),
> +errdetail("Anonymous shared memory 
> will be mapped "
> +   "without huge pages.")));

I don't dislike the message, but I'm not sure I like the message is
too verbose, especially about it has DETAILS. It seems to me something
like the following is sufficient, or I'd like see it even more concise.

"fall back anonymous shared memory to non-huge pages: required %zu bytes for 
huge pages"

If we emit an error message for other than mmap failure, it would be
like the following.

"fall back anonymous shared memory to non-huge pages: huge pages not available"

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Remove Value node struct

At Tue, 7 Sep 2021 11:22:24 +0200, Peter Eisentraut 
 wrote in 
> On 30.08.21 04:13, Kyotaro Horiguchi wrote:
> > +   else if (IsA(obj, Integer))
> > +   _outInteger(str, (Integer *) obj);
> > +   else if (IsA(obj, Float))
> > +   _outFloat(str, (Float *) obj);
> > I felt that the type enames are a bit confusing as they might be too
> > generic, or too close with the corresponding binary types.
> > -   Node   *arg;/* a (Value *) or a (TypeName 
> > *) */
> > +   Node   *arg;
> > Mmm. It's a bit pity that we lose the generic name for the value
> > nodes.
> 
> Not sure what you mean here.

The member arg loses the information on what kind of nodes are to be
stored there. Concretely it just removes the comment "a (Value *) or a
(TypeName *)". If the (Value *) were expanded in a straight way, the
comment would be "a (Integer *), (Float *), (String *), (BitString *),
or (TypeName *)". I supposed that the member loses the comment because
it become too long.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Possible missing segments in archiving on standby

At Tue, 7 Sep 2021 17:03:06 +0900, Fujii Masao  
wrote in 
> > + if (XLogArchivingAlways())
> > + XLogArchiveNotify(xlogfilename, true);
> > +   else
> > + XLogArchiveForceDone(xlogfilename);
> > The path is used both for crash and archive recovery. If we pass there
> > while crash recovery on a primary with archive_mode=on, the file could
> > be marked .done before actually archived. On the other hand when
> > archive_mode=always, the file could be re-marked .ready even after it
> > has been already archived.  Why isn't it XLogArchiveCheckDone?
> 
> Yeah, you're right. ISTM what we should do is to just call
> XLogArchiveCheckDone() for the segment including XLOG_SWITCH record,
> i.e., to create .ready file if the segment has no archive notification
> file yet
> and archive_mode is set to 'always'. Even if we don't do this when we
> reach
> XLOG_SWITCH record, subsequent restartpoints eventually will call
> XLogArchiveCheckDone() for such segments.
> 
> One issue of this approach is that the WAL segment including
> XLOG_SWITCH
> record may be archived before its previous segments are. That is,
> the notification file of current segment is created when it's replayed
> because it includes XLOG_SWIATCH, but the notification files of
> its previous segments will be created by subsequent restartpoints
> because they don't have XLOG_SWITCH. Probably we should avoid this?

Anyway there's no guarantee on the archive ordering. As discussed in
the nearby thread [1], newer segment is often archived earlier. I
agree that that happens mainly on busy servers, though. The archiver
is designed to handle such "gaps" and/or out-of-order segment
notifications.  We could impose a strict ordering on archiving but I
think we would take total performance than such strictness.

In short, no.

> If yes, one approach for this issue is to call XLogArchiveCheckDone()
> for
> not only the segment including XLOG_SWITCH but also all the segments
> older than that. Thought?

At least currently, recovery fetches segments by a single process and
every file is marked immediately after being filled-up, so all files
other than the latest one in pg_wal including history files should
have been marked for sure unless file system gets into a trouble. So I
think we don't need to do that even if we want the strictness.

Addition to that that takes too long time when many segments reside in
pg_wal so I think we never want to run such a scan at every segment
end that recovery passes.  If I remember correctly, the reason we
don't fix archive status at start up but at checkpoint is we avoided
extra startup time.

> Anyway, I extracted the changes in walreceiver from the patch,
> because it's self-contained and can be applied separately.
> Patch attached.

I'm not sure I like that XLogWalRcvClose hides the segment-switch
condition.  If we do that check in the function, I'd like to name the
function XLogWalRcvCloseIfSwitched or something indicates the
condition.  I'd like to invert the condition to reduce indentation,
too.

Why don't we call it just after writing data, as my first proposal
did? There's no difference in functionality between doing that and the
patch.  If we do so, recvFile>=0 is always true and that condition can
be removed but that would be optional.  Anyway, by doing that, no
longer need to call the function twice or we can even live without the
new function.

[1] 
https://www.postgresql.org/message-id/20210504042755.ehuaoz5blcnjw5yk%40alap3.anarazel.de

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: The Free Space Map: Problems and Opportunities

2021-09-07 Thread Peter Geoghegan

On Tue, Sep 7, 2021 at 12:31 PM Robert Haas  wrote:
> Doing work in the background has some advantages, though. In
> particular, it has the possibly-large advantage of not slowing down
> foreground work.

What I really like about the idea of doing the work in the foreground
(when it's clearly not going to be too much of a latency hit) is that
it would act as a natural form of backpressure. Sometimes slowing
things down is a good thing -- fundamentally inefficient and
unsustainable use of the system's resources ought to be slowed down by
imposing the cost on the transaction that performs the relevant insert
queries. To me this seems like a problem of negative externalities.
With small transactions we can pretty much know for sure that the
benefit of not leaving pages in good shape at the moment they become
"closed" is fixed and low, while the cost to the system as a whole
over time is potentially far higher. It seems almost unbounded, in a
way, so we're bound to lose in the long run, regardless of workload.
(The work can be piggybacked across transactions to make it more
efficient, but that's mostly an implementation detail.)

Updaters might actually be much less of a problem than inserters. They
already keep costs down by pruning opportunistically -- they at least
try. Most individual logical rows that are inserted have exactly zero
chance of ever being updated, but a ~100% chance of needing to have
hint bits set sooner or later. We know very little about workload
characteristics, but these are some of the few things that we really
do know for sure.

> For me the key insight here is that HOT-pruning a heap page is a lot
> cheaper if you do it before you write the page. Once you've written
> the page, the eventual HOT-prune is going to need to dirty it, which
> will cause it to be written again. If you prune before writing it the
> first time, that cost is avoided.

Right. The difficulty with applying that insight before now has been
making the prune operation take place at the level of whole pages
systematically. I believe that it's quite valuable to set all the hint
bits on a page as we "close it out" for the first time. But setting
hint bits for all of the heap tuples on the page except one or two is
pretty much useless. You might as well not bother at all.

It follows that concurrent inserters clobbering the same page
senselessly are more or less not okay if we're going to pursue this
technique. You might still have a concurrent update of one of the
inserted rows shortly after the original inserting transaction
commits, but before the page is closed -- that might make the idea not
work out for that one page (obviously it depends on the specifics).
That seems like something we can live with, though.

> I'm not sure that it really matters
> whether the space consumed by aborted tuples gets reused by "the very
> next transaction" or, say, 10 transactions after that, or even 1000
> transactions after that. If you wait for a million transactions,
> you've quite possibly lost enough temporality to matter, but at 10 or
> 1000 that's much less likely.

I didn't mean to suggest that it had to happen in perfect lockstep.
But I do think that it should be pretty close to perfect. What I
actually do right now is prune an open page when it *appears* to be
full inside the loop in RelationGetBufferForTuple(). As crude as that
is, it sets hint bits and prunes away aborted transactions in mostly
the right way, at least as far as TPC-C is concerned. (This only works
because it fits into the wider framework of my much larger patch,
which introduces ownership semantics, open and closed pages, empty
page freelists, etc.)

> The exact threshold is fuzzy: every
> moment you wait makes it less likely that you have sufficient
> locality, but you can always find a workload where even a very long
> wait is acceptable, and another one where even a tiny delay is
> catastrophic, and it's hard to say what the "real world" looks like.

I agree that offloading work in the background can be very useful, and
that technique needs to be possible as future work (I actually really
want this myself). But most individual apps probably won't really
notice the latency hit from pruning if it's well managed -- as with
pruning by updates today. We can make a decision about which strategy
to use (lightweight foreground processing with a potential latency
hit, or background batch cleanup?) at a very late point in the cycle,
based on the specifics in each case. Offhand, I don't think that we
need to characterize this fuzzy threshold in too much detail. I
suspect that the main determinative question can be simple enough.
Something like: Does this transaction (or this group of transactions)
involve only trivial changes to maybe 3 - 5 heap pages at the most, or
not?

There is a separate question about what happens if our background
cleanup process cannot keep up over time. I think that we can handle
that by falling back on foreground processing (within

Re: Bug in query rewriter - hasModifyingCTE not getting set

2021-09-07 Thread Greg Nancarrow

On Wed, Sep 8, 2021 at 8:00 AM Tom Lane  wrote:
>
> "tsunakawa.ta...@fujitsu.com"  writes:
> > The attached patch is based on your version.  It includes cosmetic
> > changes to use = instead of |= for boolean variable assignments.
>
> Now, we could potentially make this work if we wrote code to run
> through the copied rtable entries (recursively) and increment the
> appropriate ctelevelsup fields by one.  That would essentially
> have to be a variant of IncrementVarSublevelsUp that *only* acts
> on ctelevelsup and not other level-dependent fields.  That's
> what I meant when I spoke of moving mountains: the amount of code
> that would need to go into this seems out of all proportion to
> the value.  I think we should just throw an error, instead.
> At least till such time as we see actual field complaints.
>

[I don't think Tsunakawa-san will be responding to this any time soon]

I proposed a patch for this issue in a separate thread:
https://www.postgresql.org/message-id/CAJcOf-f68DT=26YAMz_i0+Au3TcLO5oiHY5=fl6sfuits6r...@mail.gmail.com

The patch takes your previously-reverted patch for this issue and adds an
error condition, so it does throw an error for that test case in your
previous post.
It also affects one existing regression test, since that uses an
INSERT...SELECT rule action applied to a command with a data-modifying CTE
(and we shouldn't really be allowing that anyway).


Regards,
Greg Nancarrow
Fujitsu Australia

Re: Allow escape in application_name (was: [postgres_fdw] add local pid to fallback_application_name)

2021-09-07 Thread Fujii Masao





On 2021/09/08 7:46, Tom Lane wrote:

Fujii Masao  writes:

Pushed 0001 patch. Thanks!


The buildfarm shows that this test case isn't terribly reliable.


Yes... Thanks for reporting this!



TBH, I think you should just remove the test case altogether.
It does not look useful enough to justify a permanent expenditure of
testing cycles, never mind the amount of effort that would be needed to
stabilize it.


Ok, I will remove the tests.

Kuroda-san is proposing the additional feature which allows
postgres_fdw.application_name to include escape characters.
If we commit the feature, it's worth adding the similar tests
checking that the escape characters there are replaced with
expected values. So it's better to investigate what made
the tests fail, not to cause the same tests failure later.

The tests use postgres_fdw_disconnect_all() to close the existing
remote connections before establishing new remote connection
with new application_name setting. Then they check that
the expected application_name appears in pg_stat_activity.
The cause of the issue seems to be that it can take a bit time for
the existing remote connection (i.e., its corresponding backend)
to exit after postgres_fdw_disconect_all() is called.
So application_name of the existing remote connection also could
appear in pg_stat_activity when it's checked next time.

If this analysis is right, the tests that the upcoming patch adds
should handle the case where the existing remote connection
can appear in pg_stat_activity after postgres_fdw_disconect_all().

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Data loss when '"json_populate_recorset" with long column name

2021-09-07 Thread Julien Rouhaud

On Tue, Sep 7, 2021 at 10:08 PM Tom Lane  wrote:
>
> Julien Rouhaud  writes:
>
> > Yes, but even if we eventually fix that my impression is that we would
> > still enforce a limit of 128 characters (or bytes) as this is the SQL
> > specification.
>
> Probably not.  I think SQL says that's the minimum expectation;

Ah, I didn't know that.

> and
> even if they say it should be that exactly, there is no reason we'd
> suddenly start slavishly obeying that part of the spec after ignoring
> it for years ;-).

Well, yes but we ignored it for years due to technical limitation.
And the result of that is that we make migration to postgres harder.

If we somehow find a way to remove this limitation, ignoring the spec
again (assuming that the spec gives a hard limit) will make migration
from postgres harder and will also probably bring other problems
(allowing identifier kB long will lead to bigger caches for instances,
which can definitely bite hard).  Is it really worth it?

Re: Allow escape in application_name (was: [postgres_fdw] add local pid to fallback_application_name)

Fujii Masao  writes:
> Pushed 0001 patch. Thanks!

The buildfarm shows that this test case isn't terribly reliable.

TBH, I think you should just remove the test case altogether.
It does not look useful enough to justify a permanent expenditure of
testing cycles, never mind the amount of effort that would be needed to
stabilize it.

regards, tom lane

Re: Column Filtering in Logical Replication

2021-09-07 Thread Euler Taveira

On Mon, Sep 6, 2021, at 2:51 PM, Alvaro Herrera wrote:
> I pushed the clerical part of this -- namely the addition of
> PublicationTable node and PublicationRelInfo struct.  I attach the part
> of your v4 patch that I didn't include.  It contains a couple of small
> corrections, but I didn't do anything invasive (such as pgindent)
> because that would perhaps cause you too much merge pain.
While updating the row filter patch [1] (because it also uses these
structures), I noticed that you use PublicationRelInfo as a type name instead
of PublicationRelationInfo. I choose the latter because there is already a data
structure named PublicationRelInfo (pg_dump.h). It is a client-side data
structure but I doesn't seem a good practice to duplicate data structure names
over the same code base.

[1] 
https://www.postgresql.org/message-id/0c2464d4-65f4-4d91-aeb2-c5584c1350f5%40www.fastmail.com


--
Euler Taveira
EDB   https://www.enterprisedb.com/

parallelizing the archiver

Hi hackers,

I'd like to gauge interest in parallelizing the archiver process.
From a quick scan, I was only able to find one recent thread [0] that
brought up this topic, and ISTM the conventional wisdom is to use a
backup utility like pgBackRest that does things in parallel behind-
the-scenes.  My experience is that the generating-more-WAL-than-we-
can-archive problem is pretty common, and parallelization seems to
help quite a bit, so perhaps it's a good time to consider directly
supporting parallel archiving in PostgreSQL.

Based on previous threads I've seen, I believe many in the community
would like to replace archive_command entirely, but what I'm proposing
here would build on the existing tools.  I'm currently thinking of
something a bit like autovacuum_max_workers, but the archive workers
would be created once and would follow a competing consumers model.
Another approach I'm looking at is to use background worker processes,
although I'm not sure if linking such a critical piece of
functionality to max_worker_processes is a good idea.  However, I do
see that logical replication uses background workers.

Anyway, I'm curious what folks think about this.  I think it'd help
simplify server administration for many users.

Nathan

[0] 
https://www.postgresql.org/message-id/flat/20180828060221.x33gokifqi3csjj4%40depesz.com

Re: Bug in query rewriter - hasModifyingCTE not getting set

"tsunakawa.ta...@fujitsu.com"  writes:
> The attached patch is based on your version.  It includes cosmetic
> changes to use = instead of |= for boolean variable assignments.

I think that's less "cosmetic" than "gratuitous breakage".  The point
here is that we are combining two rtables, so the query had better
end up with flags that describe the union of the rtables' properties.
Our regression tests are unfortunately not very thorough in this area,
so it doesn't surprise me that they fail to fall over.

After thinking about it for awhile, I'm okay with the concept of
attaching the source query's CTEs to the parent rule_action so far
as the semantics are concerned.  But this patch fails to implement
that correctly.  If we're going to do it like that, then the
ctelevelsup fields of any CTE RTEs that refer to those CTEs have
to be incremented when rule_action is different from sub_action,
because the CTEs are getting attached one level higher in the
query nest than the referencing RTEs are.  The proposed test case
fails to expose this, because the rule action isn't INSERT/SELECT,
so the case of interest isn't being exercised at all.  However,
it's harder than you might think to demonstrate a problem ---
I first tried

CREATE RULE bug6051_3_ins AS ON INSERT TO bug6051_3 DO INSTEAD
  INSERT INTO bug6051_2 SELECT a FROM bug6051_3;

and that failed to fall over with the patch.  Turns out that's
because the SELECT part is simple enough to be pulled up, and
the pull-up moves the CTE that's been put into it one level
higher, causing it to accidentally have the correct ctelevelsup
anyway.  If you use an INSERT with a non-pull-up-able SELECT
then you can see the problem: this script

CREATE TEMP TABLE bug6051_2 (i int);

CREATE TEMP TABLE bug6051_3 AS
  select a from generate_series(11,13) as a;

CREATE RULE bug6051_3_ins AS ON INSERT TO bug6051_3 DO INSTEAD
  INSERT INTO bug6051_2 SELECT sum(a) FROM bug6051_3;

explain verbose
WITH t1 AS ( DELETE FROM bug6051_3 RETURNING * )
  INSERT INTO bug6051_3 SELECT * FROM t1;

causes the patch to fail with

ERROR:  could not find CTE "t1"

Now, we could potentially make this work if we wrote code to run
through the copied rtable entries (recursively) and increment the
appropriate ctelevelsup fields by one.  That would essentially
have to be a variant of IncrementVarSublevelsUp that *only* acts
on ctelevelsup and not other level-dependent fields.  That's
what I meant when I spoke of moving mountains: the amount of code
that would need to go into this seems out of all proportion to
the value.  I think we should just throw an error, instead.
At least till such time as we see actual field complaints.

regards, tom lane

Re: Rewriting the test of pg_upgrade as a TAP test - take three - remastered set

2021-09-07 Thread Rachel Heaton

Hello Michael,

This patch needs the update from 201a76183 -- the function `get_new_node`
no longer exists.

Running check tests in the pg_upgrade folder fails for this reason.

Thank you,
Rachel

On Tue, Sep 7, 2021 at 2:06 PM Michael Paquier  wrote:

> On Tue, May 18, 2021 at 10:49:39AM +0900, Michael Paquier wrote:
> > Makes sense.  For now, I'll update this patch set so as it is possible
> > to use custom dumps, as an option in parallel of pg_regress when
> > specifying a different source code path.  I'll also decouple the
> > business with probin updates and stick with the approach used by the
> > buildfarm code.
>
> This has proved to not be that difficult.  With the updated version
> attached, pg_upgrade has two modes to set up the old instance used for
> the upgrade with older binaries:
> - With oldsrc and oldinstall set, pg_regress gets used, same way as
> HEAD.
> - With olddump and oldinstall set, an old dump is loaded instead in
> the old instance before launching the upgrade.
>
> oldsrc and olddump are exclusive options.  Similarly to HEAD, the
> dumps taken from the old and new instances generate diffs that can be
> inspected manually.  The updates of probin are done without any
> dependencies to the source path of the old instance, copying from the
> buildfarm.
>
> While on it, I have fixed a couple of things that exist in test.sh but
> were not reflected in this new script:
> - Handling of postfix operations with ~13 clusters.
> - Handling oldstyle_length for ~9.6 clusters.
> - Handling of EXTRA_REGRESS_OPT.
>
> This stuff still needs to be expanded depending on how PostgresNode is
> made backward-compatible, but I'll wait for that to happen before
> going further down here.  I have also spent some time testing all that
> with MSVC, and the installation paths used for pg_regress make the
> script a tad more confusing, so I have dropped this part for now.
>
> Thanks,
> --
> Michael
>

Re: Atomic rename feature for Windows.

2021-09-07 Thread Victor Spirin


Thank you,

In this variant:

1) renamed file win10.manifest to windows.manifest

2) renamed function pgrename_win10 to pgrename_windows_posix_semantics

3) Function pgrename returns result of pgrename_windows_posix_semantics 
function and not contiue run old version of function.


4) Added call GetLastError() after error MultiByteToWideChar fuction.


+typedef struct _FILE_RENAME_INFO_VVS {
+   union {
+   BOOLEAN ReplaceIfExists;  // FileRenameInfo
+   DWORD Flags;  // FileRenameInfoEx
+   } DUMMYUNIONNAME;
+   HANDLE RootDirectory;
+   DWORD FileNameLength;
+   WCHAR FileName[MAX_PATH];
+} FILE_RENAME_INFO_VVS;

Why can't we use a system header[2] for this?

I have a dynamic memory allocation version in the first patch.

    len = wcslen(to_w);
    rename_info = (FILE_RENAME_INFO*)malloc(sizeof(FILE_RENAME_INFO) + 
(len + 1) * sizeof(wchar_t));


    rename_info->ReplaceIfExists = TRUE;
    rename_info->Flags = FILE_RENAME_FLAG_POSIX_SEMANTICS | 
FILE_RENAME_FLAG_REPLACE_IF_EXISTS;


    rename_info->RootDirectory = NULL;
    rename_info->FileNameLength = len;
    memcpy(rename_info->FileName, to_w, (len + 1) * sizeof(wchar_t));

Is this code better? Maybe there is another correct method?

I checked the pgrename_windows_posix_semantics() function on Windows 7. 
It returns error 87: Parameter is incorrect. Hence, it is necessary to 
check the Windows version and call the old pgrename function for old 
Windows.



Victor Spirin
Postgres Professional:http://www.postgrespro.com
The Russian Postgres Company

07.09.2021 4:40, Thomas Munro пишет:

On Tue, Sep 7, 2021 at 5:44 AM Victor Spirin  wrote:

I have changed the way I add the manifest to projects. I used the
AdditionalManifestFiles option for a VS project.

Hi Victor,

Thanks for working on this!

I wonder if POSIX-style rename is used automatically on recent
Windows, based on the new clue that DeleteFile() has started
defaulting to POSIX semantics[1] (maybe it would require ReplaceFile()
instead of MoveFileEx(), but I have no idea.)  If so, one question is
whether we'd still want to do this explicit POSIX rename dance, or
whether we should just wait a bit longer for it to happen
automatically on all relevant systems (plus tweak to use ReplaceFile()
if necessary).  If not, we might want to do what you're proposing
anyway, especially if ReplaceFile() is required, because its interface
is weird (it only works if the other file exists?).  Hmm, that alone
would be a good reason to go with your plan regardless, and of course
it would be good to see this fixed everywhere ASAP.

We still have to answer that question for pgunlink().  I was
contemplating that over in that other thread, because unlink() ->
EACCES is blocking something I'm working on.  I found a partial
solution to that that works even on old and non-NTFS systems, and I
was thinking that would be enough for now and we could just patiently
wait until automatic POSIX semantics to arrives on all relevant
systems as the real long term solution, so I didn't need to expend
energy doing an intermediate explicit POSIX-mode wrapper like what
you're proposing.  But then it seems strange to make a different
choice about that for rename() and unlink().  So... do you think it
would make sense to extend your patch to cover unlink() too?

It would be great to have a tool in the tree that tests directory
entry semantics, called something like src/bin/pg_test_dirmod, so that
it becomes very clear when POSIX semantics are being used.  It could
test various interesting unlink and rename scenarios through our
wrappers (concurrent file handles, creating a new file with the name
of the old one, unlinking the containing directory, ...).  It could
run on the build farm animals, and we could even ask people to run it
when they report problems, to try to lift the fog of bizarro Windows
file system semantics.

How exactly does the function fail on a file system that doesn't
support the new POSIX semantics?  Assuming there is something like
ENOSUPP to report "file system says no", do we want to keep trying
every time, or remember that it doesn't work?  I guess the answer may
vary per mount point, which makes it hard to track when you only have
an fd...

If it fails because of a sharing violation, it seems strange that we
immediately fall back to the old code to do the traditional (and
horrible) sleep/retry loop.  That means that in rare conditions we can
still get the old behaviour that leads to problems, just because of a
transient condition.  Hmm.  Would it make more sense to say: fall back
to the traditional behaviour only for ENOSUPP (if there is such a
thing), cope with transient sharing violations without giving up on
POSIX semantics, and report all other failures immediately?

I agree that the existing enum CHECKSUM_TYPE_NONE + friends should be
renamed to something less collision-prone and more consistent with the
name of the enum ("pg_checksum_type"), so I'd vote for adding a PG_
prefix, in a

Re: Gather performance analysis

Hi,

On 2021-08-06 14:00:48 +0530, Dilip Kumar wrote:
> --Setup
> SET parallel_tuple_cost TO 0   -- to test parallelism in the extreme case
> CREATE TABLE t (a int, b varchar);
> INSERT INTO t SELECT i, repeat('a', 200) from generate_series(1,2) as 
> i;
> ANALYZE t;
> Test query: EXPLAIN ANALYZE SELECT * FROM t;
> 
> Perf analysis: Gather Node
>- 43.57% shm_mq_receive
>   - 78.94% shm_mq_receive_bytes
>  - 91.27% pg_atomic_read_u64
> - pg_atomic_read_u64_impl
>- apic_timer_interrupt
>  smp_apic_timer_interrupt
> 
> Perf analysis: Worker Node
>   - 99.14% shm_mq_sendv
>  - 74.10% shm_mq_send_bytes
> + 42.35% shm_mq_inc_bytes_written
> - 32.56% pg_atomic_read_u64
>- pg_atomic_read_u64_impl
>   - 86.27% apic_timer_interrupt
> + 17.93% WaitLatch
> 
> From the perf results and also from the code analysis I can think of
> two main problems here

Looking at this profile made me wonder if this was a build without
optimizations. The pg_atomic_read_u64()/pg_atomic_read_u64_impl() calls should
be inlined. And while perf can reconstruct inlined functions when using
--call-graph=dwarf, they show up like "pg_atomic_read_u64 (inlined)" for me.

FWIW, I see times like this

postgres[4144648][1]=# EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t;
┌──┐
│  QUERY PLAN   
   │
├──┤
│ Gather  (cost=1000.00..6716686.33 rows=2 width=208) (actual 
rows=2 loops=1)  │
│   Workers Planned: 2  
   │
│   Workers Launched: 2 
   │
│   ->  Parallel Seq Scan on t  (cost=0.00..6715686.33 rows=8333 width=208) 
(actual rows=6667 loops=3) │
│ Planning Time: 0.043 ms   
   │
│ Execution Time: 24954.012 ms  
   │
└──┘
(6 rows)


Looking at a profile I see the biggest bottleneck in the leader (which is the
bottleneck as soon as the worker count is increased) to be reading the length
word of the message. I do see shm_mq_receive_bytes() in the profile, but the
costly part there is the "read % (uint64) ringsize" - divisions are slow. We
could just compute a mask instead of the size.

We also should probably split the read-mostly data in shm_mq (ring_size,
detached, ring_offset, receiver, sender) into a separate cacheline from the
read/write data. Or perhaps copy more info into the handle, particularly the
ringsize (or mask).

Greetings,

Andres Freund

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

2021-09-07 Thread Melanie Plageman

On Fri, Aug 13, 2021 at 3:08 AM Andres Freund  wrote:
>
> Hi,
>
> On 2021-08-11 16:11:34 -0400, Melanie Plageman wrote:
> > On Tue, Aug 3, 2021 at 2:13 PM Andres Freund  wrote:
> > > > diff --git a/src/backend/catalog/system_views.sql 
> > > > b/src/backend/catalog/system_views.sql
> > > > index 55f6e3711d..96cac0a74e 100644
> > > > --- a/src/backend/catalog/system_views.sql
> > > > +++ b/src/backend/catalog/system_views.sql
> > > > @@ -1067,9 +1067,6 @@ CREATE VIEW pg_stat_bgwriter AS
> > > >  pg_stat_get_bgwriter_buf_written_checkpoints() AS 
> > > > buffers_checkpoint,
> > > >  pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean,
> > > >  pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean,
> > > > -pg_stat_get_buf_written_backend() AS buffers_backend,
> > > > -pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
> > > > -pg_stat_get_buf_alloc() AS buffers_alloc,
> > > >  pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
> > >
> > > Material for a separate patch, not this. But if we're going to break
> > > monitoring queries anyway, I think we should consider also renaming
> > > maxwritten_clean (and perhaps a few others), because nobody understands 
> > > what
> > > that is supposed to mean.
>
> > Do you mean I shouldn't remove anything from the pg_stat_bgwriter view?
>
> No - I just meant that now that we're breaking pg_stat_bgwriter queries,
> we should also rename the columns to be easier to understand. But that
> it should be a separate patch / commit...
>

I separated the removal of some redundant stats from pg_stat_bgwriter
into a different commit but haven't removed or clarified any additional
columns in pg_stat_bgwriter.

>
>
> > > > @@ -411,11 +421,6 @@ StrategySyncStart(uint32 *complete_passes, uint32 
> > > > *num_buf_alloc)
> > > >*/
> > > >   *complete_passes += nextVictimBuffer / NBuffers;
> > > >   }
> > > > -
> > > > - if (num_buf_alloc)
> > > > - {
> > > > - *num_buf_alloc = 
> > > > pg_atomic_exchange_u32(>numBufferAllocs, 0);
> > > > - }
> > > >   SpinLockRelease(>buffer_strategy_lock);
> > > >   return result;
> > > >  }
> > >
> > > Hm. Isn't bgwriter using the *num_buf_alloc value to pace its activity? I
> > > suspect this patch shouldn't get rid of numBufferAllocs at the same time 
> > > as
> > > overhauling the stats stuff. Perhaps we don't need both - but it's not 
> > > obvious
> > > that that's the case / how we can make that work.
> > >
> > >
> >
> > I initially meant to add a function to the patch like
> > pg_stat_get_buffer_actions() but which took a BufferActionType and
> > BackendType as parameters and returned a single value which is the
> > number of buffer action types of that type for that type of backend.
> >
> > let's say I defined it like this:
> > uint64
> >   pg_stat_get_backend_buffer_actions_stats(BackendType backend_type,
> >   BufferActionType ba_type)
> >
> > Then, I intended to use that in StrategySyncStart() to set num_buf_alloc
> > by subtracting the value of StrategyControl->numBufferAllocs from the
> > value returned by pg_stat_get_backend_buffer_actions_stats(B_BG_WRITER,
> > BA_Alloc), val, then adding that value, val, to
> > StrategyControl->numBufferAllocs.
>
> I don't think you could restrict this to B_BG_WRITER? The whole point of
> this logic is that bgwriter uses the stats for *all* backends to get the
> "usage rate" for buffers, which it then uses to control how many buffers
> to clean.
>
>
> > I think that would have the same behavior as current, though I'm not
> > sure if the performance would end up being better or worse. It wouldn't
> > be atomically incrementing StrategyControl->numBufferAllocs, but it
> > would do a few additional atomic operations in StrategySyncStart() than
> > before. Also, we would do all the work done by
> > pg_stat_get_buffer_actions() in StrategySyncStart().
>
> I think it'd be better to separate changing the bgwriter pacing logic
> (and thus numBufferAllocs) from changing the stats reporting.
>
>
> > But that is called comparatively infrequently, right?
>
> Depending on the workload not that rarely. I'm afraid this might be a
> bit too expensive. It's possible we can work around that however.
>

I've restored StrategyControl->numBuffersAlloc.

Attached is v6 of the patchset.

I have made several small updates to the patch, including user docs
updates, comment clarifications, various changes related to how
structures are initialized, code simplications, small details like
alphabetizing of #includes, etc.

Below are details on the remaining TODOs and open questions for this
patch and why I haven't done them yet:

1) performance testing (initial tests done, but need to do some further
investigation before sharing)

2) stats_reset
Because pg_stat_buffer_actions fields were added to the globalStats
structure, they get reset when the

Re: Read-only vs read only vs readonly

2021-09-07 Thread Magnus Hagander

On Fri, Sep 3, 2021 at 8:10 AM Kyotaro Horiguchi
 wrote:
>
> At Thu, 2 Sep 2021 22:07:02 +, "Bossart, Nathan"  
> wrote in
> > On 9/2/21, 11:30 AM, "Magnus Hagander"  wrote:
> > > I had a customer point out to me that we're inconsistent in how we
> > > spell read-only. Turns out we're not as inconsistent as I initially
> > > thought :), but that they did manage to spot the one actual log
> > > message we have that writes it differently than everything else -- but
> > > that broke their grepping...
> > >
> > > Almost everywhere we use read-only. Attached patch changes the one log
> > > message where we didn't, as well as a few places in the docs for it. I
> > > did not bother with things like comments in the code.
> > >
> > > Two questions:
> > >
> > > 1. Is it worth fixing? Or just silly nitpicking?
> >
> > It seems entirely reasonable to me to consistently use "read-only" in
> > the log messages and documentation.
> >
> > > 2. What about translations? This string exists in translations --
> > > should we just "fix" it there, without touching the translated string?
> > > Or try to fix both? Or leave it for the translators who will get a
> > > diff on it?
> >
> > I don't have a strong opinion, but if I had to choose, I would say to
> > leave it to the translators to decide.
>
> +1 for both.  As a translator, it seems that that kind of changes are
> usual.  Many changes about full-stops, spacings, capitalizing and so
> happen especially at near-release season like now.

Thanks for the input. I've applied this and back-patched to 14 since
it's not out yet and there is translation still do be done. I didn't
backpatch it further back than that to avoid the need for translation
updates there.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/

Re: [PATCH] Add tab-complete for backslash commands

"tanghy.f...@fujitsu.com"  writes:
> On Sunday, September 5, 2021 1:42 AM, Tom Lane  wrote:
>> I particularly question why we'd offer both
>> single- and multiple-character versions, as the single-character
>> version seems entirely useless from a completion standpoint.

> I generally agreed with your opinion. But I'm not sure if there's someone
> who'd like to see the list of backslash commands and choose one to use.
> I mean, someone may type '\[tab][tab]' to check all supported backslash 
> commands.

Sure, but he'd still get all the commands, just not all the possible
spellings of each one.  And a person who's not sure what's available
is unlikely to be helped by an entry for "\c", because it's entirely
not clear which command that's an abbreviation for.

In any case, my main point is that the primary usage of tab-completion
is as a typing aid, not documentation.  I do not think we should make
the behavior less useful for typing in order to be exhaustive on the
documentation side.

regards, tom lane

Re: Don't clean up LLVM state when exiting in a bad way

2021-09-07 Thread Justin Pryzby

On Tue, Sep 07, 2021 at 12:27:27PM -0700, Andres Freund wrote:
> Hi,
> 
> On 2021-08-18 15:00:59 +, Jelte Fennema wrote:
> > I ran into some segfaults when using Postgres that was compiled with LLVM
> > 7. According to the backtraces these crashes happened during the call to
> > llvm_shutdown, during cleanup after another out of memory condition. It
> > seems that calls to LLVMOrcDisposeInstance, can crash (at least on LLVM 7)
> > when LLVM is left in bad state. I attached the relevant part of the
> > stacktrace to this email.
> 
> > With the attached patch these segfaults went away. The patch turns
> > llvm_shutdown into a no-op whenever the backend is exiting with an
> > error. Based on my understanding of the code this should be totally fine. No
> > memory should be leaked, since all memory will be cleaned up anyway once the
> > backend exits shortly after. The only reason this cleanup code even seems to
> > exist at all is to get useful LLVM profiling data. To me it seems be
> > acceptable if the profiling data is incorrect/missing when the backend exits
> > with an error.
> 
> I think this is a tad too strong. We should continue to clean up on exit as
> long as the error didn't happen while we're already inside llvm
> code. Otherwise we loose some ability to find leaks. How about checking in the
> error path whether fatal_new_handler_depth is > 0, and skipping cleanup in
> that case? Because that's precisely when it should be unsafe to reenter LLVM.

This avoids a crash when compiled with llvm7+clang7 on RH7 on master:

python3 -c "import pg; db=pg.DB('dbname=postgres host=/tmp port=5678'); 
db.query('SET jit_above_cost=0; SET jit_inline_above_cost=-1; SET jit=on; SET 
client_min_messages=debug'); db.query('begin'); db.query_formatted('SELECT 1 
FROM generate_series(1,99)a WHERE a=%s', [1], inline=False);"

diff --git a/src/backend/jit/llvm/llvmjit.c b/src/backend/jit/llvm/llvmjit.c
index df691cbf1c..a3869ee700 100644
--- a/src/backend/jit/llvm/llvmjit.c
+++ b/src/backend/jit/llvm/llvmjit.c
@@ -885,6 +885,12 @@ llvm_session_initialize(void)
 static void
 llvm_shutdown(int code, Datum arg)
 {
+   extern int fatal_new_handler_depth;
+
+   // if (code!=0)
+   if (fatal_new_handler_depth > 0)
+   return;
+
 #if LLVM_VERSION_MAJOR > 11
{
if (llvm_opt3_orc)
diff --git a/src/backend/jit/llvm/llvmjit_error.cpp 
b/src/backend/jit/llvm/llvmjit_error.cpp
index 26bc828875..802dc1b058 100644
--- a/src/backend/jit/llvm/llvmjit_error.cpp
+++ b/src/backend/jit/llvm/llvmjit_error.cpp
@@ -24,7 +24,7 @@ extern "C"
 #include "jit/llvmjit.h"
 
 
-static int fatal_new_handler_depth = 0;
+int fatal_new_handler_depth = 0;
 static std::new_handler old_new_handler = NULL;
 
 static void fatal_system_new_handler(void);

Re: The Free Space Map: Problems and Opportunities

On Mon, Sep 6, 2021 at 8:29 PM Peter Geoghegan  wrote:
> On Mon, Sep 6, 2021 at 4:33 PM Hannu Krosing  wrote:
> > When I have been thinking of this type of problem it seems that the
> > latest -- and correct :) --  place which should do all kinds of
> > cleanup like removing aborted tuples, freezing committed tuples and
> > setting any needed hint bits would be background writer or CHECKPOINT.
> >
> > This would be more PostgreSQL-like, as it moves any work not
> > immediately needed from the critical path, as an extension of how MVCC
> > for PostgreSQL works in general.
>
> I think it depends. There is no need to do work in the background
> here, with TPC-C. With my patch series each backend can know that it
> just had an aborted transaction that affected a page that it more or
> less still owns. And has very close at hand, for further inserts. It's
> very easy to piggy-back the work once you have that sense of ownership
> of newly allocated heap pages by individual backends/transactions.

Doing work in the background has some advantages, though. In
particular, it has the possibly-large advantage of not slowing down
foreground work.

For me the key insight here is that HOT-pruning a heap page is a lot
cheaper if you do it before you write the page. Once you've written
the page, the eventual HOT-prune is going to need to dirty it, which
will cause it to be written again. If you prune before writing it the
first time, that cost is avoided. I'm not sure that it really matters
whether the space consumed by aborted tuples gets reused by "the very
next transaction" or, say, 10 transactions after that, or even 1000
transactions after that. If you wait for a million transactions,
you've quite possibly lost enough temporality to matter, but at 10 or
1000 that's much less likely. The exact threshold is fuzzy: every
moment you wait makes it less likely that you have sufficient
locality, but you can always find a workload where even a very long
wait is acceptable, and another one where even a tiny delay is
catastrophic, and it's hard to say what the "real world" looks like.

On the other hand, there's nothing fuzzy about the expense incurred by
writing the page before it's HOT-pruned. That is essentially certain
to incur an extra page write, except in the corner case where the
relation gets dropped or truncated before then. So I think you might
find that if you found a way to ensure that HOT-pruning -- and entry
into the FSM -- always happens for every heap page just before it's
written, if it hasn't already happened sooner and might be needed, you
might end up in a pretty good spot. It wouldn't even be ignoring
temporal locality, since at minimum dirty pages are written once per
checkpoint cycle.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Don't clean up LLVM state when exiting in a bad way

Hi,

On 2021-08-18 15:00:59 +, Jelte Fennema wrote:
> I ran into some segfaults when using Postgres that was compiled with LLVM
> 7. According to the backtraces these crashes happened during the call to
> llvm_shutdown, during cleanup after another out of memory condition. It
> seems that calls to LLVMOrcDisposeInstance, can crash (at least on LLVM 7)
> when LLVM is left in bad state. I attached the relevant part of the
> stacktrace to this email.

> With the attached patch these segfaults went away. The patch turns
> llvm_shutdown into a no-op whenever the backend is exiting with an
> error. Based on my understanding of the code this should be totally fine. No
> memory should be leaked, since all memory will be cleaned up anyway once the
> backend exits shortly after. The only reason this cleanup code even seems to
> exist at all is to get useful LLVM profiling data. To me it seems be
> acceptable if the profiling data is incorrect/missing when the backend exits
> with an error.

I think this is a tad too strong. We should continue to clean up on exit as
long as the error didn't happen while we're already inside llvm
code. Otherwise we loose some ability to find leaks. How about checking in the
error path whether fatal_new_handler_depth is > 0, and skipping cleanup in
that case? Because that's precisely when it should be unsafe to reenter LLVM.

Greetings,

Andres Freund

Re: Correct handling of blank/commented lines in PSQL interactive-mode history

[ this is a digression from the main point of the thread, but ... ]

Alvaro Herrera  writes:
> I am particularly bothered by the uselessness
> that M-# results in -- namely, inserting a # at the start of the buffer.

Fixing that might be as simple as the attached.  I've not beat on
it hard though.

regards, tom lane

diff --git a/src/bin/psql/input.c b/src/bin/psql/input.c
index f926bc98dc..1dcd95a7b9 100644
--- a/src/bin/psql/input.c
+++ b/src/bin/psql/input.c
@@ -353,8 +353,13 @@ initializeInput(int flags)
 
 		useReadline = true;
 
-		/* these two things must be done in this order: */
+		/* set appropriate values for Readline's global variables */
 		initialize_readline();
+
+		/* set comment-begin to a useful value for SQL */
+		(void) rl_variable_bind("comment-begin", "-- ");
+
+		/* this reads ~/.inputrc, so do it after rl_variable_bind */
 		rl_initialize();
 
 		useHistory = true;

Re: Data loss when '"json_populate_recorset" with long column name

2021-09-07 Thread Gavin Flower


On 8/09/21 2:08 am, Tom Lane wrote:

Julien Rouhaud  writes:

On Tue, Sep 7, 2021 at 1:31 PM Michael Paquier  wrote:

Yeah.  We should try to work toward removing the limits on NAMEDATALEN
for the attribute names.  Easier said than done :)

Yes, but even if we eventually fix that my impression is that we would
still enforce a limit of 128 characters (or bytes) as this is the SQL
specification.

Probably not.  I think SQL says that's the minimum expectation; and
even if they say it should be that exactly, there is no reason we'd
suddenly start slavishly obeying that part of the spec after ignoring
it for years ;-).

There would still be a limit of course, but it would stem from the max
tuple width in the associated catalog, so on the order of 7kB or so.
(Hmm ... perhaps it'd be wise to set a limit of say a couple of kB,
just so that the implementation limit is crisp rather than being
a little bit different in each catalog and each release.)

regards, tom lane



How about 4kB (unless there are systems for which this is too large)?

That should be easy to remember.


Cheers,
Gavin

Re: CI/windows docker vs "am a service" autodetection on windows

Hi,

On 2021-08-16 15:34:51 +0200, Magnus Hagander wrote:
> It wouldn't surprise me if it does break some *other* weird
> cornercase, but based on the docs page you linked to it doesn't look
> like it would break any of the normal/standard usecases.

Yea, me neither...

I do suspect that it'd have been better to have a --windows-service flag to
postgres. But we can't easily change the past...


> Gaining better testability definitely seems worth it, so I think an
> approach of "push to master and see what explodes" is reasonable :)

Done.

Greetings,

Andres Freund

Re: Correct handling of blank/commented lines in PSQL interactive-mode history

Alvaro Herrera  writes:
> On 2021-Sep-06, Laurenz Albe wrote:
>> I agree with Greg that the current behavior is annoying and would
>> welcome the change.  This has bothered me before.

> It has bothered me too.

I'm not here to claim that the current behavior is perfect.  However,
AFAICT the patch as-presented breaks the principle that text goes into
the history at the moment it's sent to the server.  In particular, we
might make an entry for text that *never* got to the server because you
cleared the buffer instead.  I don't find that to be an improvement.
It breaks one of the primary use-cases for history, ie keeping a record
of what you did.

We could perhaps finesse that point by deciding that comment lines
that are handled this way will never be sent to the server --- but
I'm sure people will complain about that, too.  I've definitely heard
people complain because "--" comments are stripped from what's sent
(so I'd look favorably on a patch to undo that).

I think the questions around empty-line handling are largely
orthogonal to this, and we'll just confuse ourselves if we
discuss that at the same time.  Likewise for M-#.

regards, tom lane

Re: CI/windows docker vs "am a service" autodetection on windows

Hi,

On 2021-08-15 11:25:01 -0300, Ranier Vilela wrote:
> I found this function on the web, from OpenSSL, but I haven't tested it.
> I think that there is one more way to test if a service is running
> (SECURITY_INTERACTIVE_RID).

I don't think that really addresses the issue. If a service starts postgres
somewhere within, it won't have SECURITY_INTERACTIVE_RID, but postgres isn't
quite running as a service nevertheless. I think that's the situation in the
CI case that triggered me to look into this.

Regards,

Andres

Re: prevent immature WAL streaming

On 9/4/21, 10:26 AM, "Alvaro Herrera"  wrote:
> Attached are the same patches as last night, except I added a test for
> XLOG_DEBUG where pertinent.  (The elog(PANIC) is not made conditional on
> that, since it's a cross-check rather than informative.)  Also fixed the
> silly pg_rewind mistake I made.
>
> I'll work on the new xlog record early next week.

Are these patches in a good state for some preliminary testing?  I'd
like to try them out, but I'll hold off if they're not quite ready
yet.

Nathan

Re: .ready and .done files considered harmful

On 9/7/21, 11:31 AM, "Robert Haas"  wrote:
> I guess we still have to pick one or the other, but I don't really
> know how to do that, since both methods seem to be relatively fine,
> and the scenarios where one is better than the other all feel a little
> bit contrived. I guess if no clear consensus emerges in the next week
> or so, I'll just pick one and commit it. Not quite sure yet how I'll
> do the picking, but we seem to all agree that something is better than
> nothing, so hopefully nobody will be too sad if I make an arbitrary
> decision. And if some clear agreement emerges before then, even
> better.

I will be happy to see this fixed either way.

Nathan

Re: EXPLAIN(VERBOSE) to CTE with SEARCH BREADTH FIRST fails

torikoshia  writes:
> While working on [1], we found that EXPLAIN(VERBOSE) to CTE with SEARCH 
> BREADTH FIRST ends up ERROR.

Yeah.  It's failing here:

 * We're deparsing a Plan tree so we don't have a CTE
 * list.  But the only place we'd see a Var directly
 * referencing a CTE RTE is in a CteScan plan node, and we
 * can look into the subplan's tlist instead.

if (!dpns->inner_plan)
elog(ERROR, "failed to find plan for CTE %s",
 rte->eref->aliasname);

The problematic Var is *not* in a CteScan plan node; it's in a
WorkTableScan node.  It's not clear to me whether this is a bug
in the planner's handling of SEARCH BREADTH FIRST, or if the plan
is as-intended and ruleutils.c is failing to cope.

Either way, this deserves an open item...

regards, tom lane

Re: .ready and .done files considered harmful

On Tue, Sep 7, 2021 at 2:13 PM Bossart, Nathan  wrote:
> Right.  The latest patch for that approach [0] does just that.  In
> fact, I think timeline files are the only files for which we need to
> force an immediate directory scan in the multiple-files-per-scan
> approach.  For the keep-trying-the-next-file approach, we have to
> force a directory scan for anything but a regular WAL file that is
> ahead of our archiver state.

Yeah, that makes sense.

> Yeah, I would agree that the approaches basically converge into some
> form of "do fewer directory scans."

I guess we still have to pick one or the other, but I don't really
know how to do that, since both methods seem to be relatively fine,
and the scenarios where one is better than the other all feel a little
bit contrived. I guess if no clear consensus emerges in the next week
or so, I'll just pick one and commit it. Not quite sure yet how I'll
do the picking, but we seem to all agree that something is better than
nothing, so hopefully nobody will be too sad if I make an arbitrary
decision. And if some clear agreement emerges before then, even
better.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: .ready and .done files considered harmful

On 9/7/21, 10:54 AM, "Robert Haas"  wrote:
> I guess what I don't understand about the multiple-files-per-dirctory
> scan implementation is what happens when something happens that would
> require the keep-trying-the-next-file approach to perform a forced
> scan. It seems to me that you still need to force an immediate full
> scan, because if the idea is that you want to, say, prioritize
> archiving of new timeline files over any others, a cached list of
> files that you should archive next doesn't accomplish that, just like
> keeping on trying the next file in sequence doesn't accomplish that.

Right.  The latest patch for that approach [0] does just that.  In
fact, I think timeline files are the only files for which we need to
force an immediate directory scan in the multiple-files-per-scan
approach.  For the keep-trying-the-next-file approach, we have to
force a directory scan for anything but a regular WAL file that is
ahead of our archiver state.

> So I'm wondering if in the end the two approaches converge somewhat,
> so that with either patch you get (1) some kind of optimization to
> scan the directory less often, plus (2) some kind of notification
> mechanism to tell you when you need to avoid applying that
> optimization. If you wanted to, (1) could even include both batching
> and then, when the batch is exhausted, trying files in sequence. I'm
> not saying that's the way to go, but you could. In the end, it seems
> less important that we do any particular thing here and more important
> that we do something - but if prioritizing timeline history files is
> important, then we have to preserve that behavior.

Yeah, I would agree that the approaches basically converge into some
form of "do fewer directory scans."

Nathan

[0] 
https://www.postgresql.org/message-id/attachment/125980/0001-Improve-performance-of-pgarch_readyXlog-with-many-st.patch

Re: VARDATA_COMPRESSED_GET_COMPRESS_METHOD comment?

On Tue, Sep 7, 2021 at 11:56 AM Christoph Berg  wrote:
> In postgres.h, there are these macros for working with compressed
> toast:
>
>   
> /* Decompressed size and compression method of an external compressed Datum */
> #define VARDATA_COMPRESSED_GET_EXTSIZE(PTR) \
> (((varattrib_4b *) (PTR))->va_compressed.va_tcinfo & VARLENA_EXTSIZE_MASK)
> #define VARDATA_COMPRESSED_GET_COMPRESS_METHOD(PTR) \
> (((varattrib_4b *) (PTR))->va_compressed.va_tcinfo >> 
> VARLENA_EXTSIZE_BITS)
>
> /* Same, when working directly with a struct varatt_external */
> #define VARATT_EXTERNAL_GET_EXTSIZE(toast_pointer) \
> ((toast_pointer).va_extinfo & VARLENA_EXTSIZE_MASK)
> #define VARATT_EXTERNAL_GET_COMPRESS_METHOD(toast_pointer) \
> ((toast_pointer).va_extinfo >> VARLENA_EXTSIZE_BITS)
>
> On the first line, is the comment "external" correct? It took me quite
> a while to realize that the first two macros are the methods to be
> used on an *inline* compressed Datum, when the second set is used for
> varlenas in toast tables.

Well ... technically the second set are used on a TOAST pointer, which
is not really the same thing as a varlena. The varlena would start
with a 1-byte header identifying it as a TOAST pointer, and then
there'd be a 1-byte saying what kind of TOAST pointer it is, which
would be VARTAG_ONDISK if this is coming from a tuple on disk, and
then the TOAST pointer would start after that. So toast_pointer =
varlena_pointer + 2, if I'm not confused here.

But I agree with you that referring to the argument to
VARDATA_COMPRESSED_GET_EXTSIZE or
VARDATA_COMPRESSED_GET_COMPRESS_METHOD as an "external compressed
Datum" doesn't seem quite right. It is compressed, but it is not
external, at least in the sense that I understand that term.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: .ready and .done files considered harmful

On Tue, Sep 7, 2021 at 1:28 PM Bossart, Nathan  wrote:
> Thanks for chiming in.  The limit of 64 in the multiple-files-per-
> directory-scan approach was mostly arbitrary.  My earlier testing [0]
> with different limits didn't reveal any significant difference, but
> using a higher limit might yield a small improvement when there are
> several hundred thousand .ready files.  IMO increasing the limit isn't
> really worth it for this approach.  For 500,000 .ready files,
> ordinarily you'd need 500,000 directory scans.  When 64 files are
> archived for each directory scan, you need ~8,000 directory scans.
> With 128 files per directory scan, you need ~4,000.  With 256, you
> need ~2000.  The difference between 8,000 directory scans and 500,000
> is quite significant.  The difference between 2,000 and 8,000 isn't
> nearly as significant in comparison.

That's certainly true.

I guess what I don't understand about the multiple-files-per-dirctory
scan implementation is what happens when something happens that would
require the keep-trying-the-next-file approach to perform a forced
scan. It seems to me that you still need to force an immediate full
scan, because if the idea is that you want to, say, prioritize
archiving of new timeline files over any others, a cached list of
files that you should archive next doesn't accomplish that, just like
keeping on trying the next file in sequence doesn't accomplish that.

So I'm wondering if in the end the two approaches converge somewhat,
so that with either patch you get (1) some kind of optimization to
scan the directory less often, plus (2) some kind of notification
mechanism to tell you when you need to avoid applying that
optimization. If you wanted to, (1) could even include both batching
and then, when the batch is exhausted, trying files in sequence. I'm
not saying that's the way to go, but you could. In the end, it seems
less important that we do any particular thing here and more important
that we do something - but if prioritizing timeline history files is
important, then we have to preserve that behavior.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: .ready and .done files considered harmful

On 9/7/21, 1:42 AM, "Kyotaro Horiguchi"  wrote:
> I was thinking that the multple-files approch would work efficiently
> but the the patch still runs directory scans every 64 files.  As
> Robert mentioned it is still O(N^2).  I'm not sure the reason for the
> limit, but if it were to lower memory consumption or the cost to sort,
> we can resolve that issue by taking trying-the-next approach ignoring
> the case of having many gaps (discussed below). If it were to cause
> voluntary checking of out-of-order files, almost the same can be
> achieved by running directory scans every 64 files in the
> trying-the-next approach (and we would suffer O(N^2) again).  On the
> other hand, if archiving is delayed by several segments, the
> multiple-files method might reduce the cost to scan the status
> directory but it won't matter since the directory contains only
> several files.  (I think that it might be better that we don't go to
> trying-the-next path if we found only several files by running a
> directory scan.)  The multiple-files approach reduces the number of
> directory scans if there were many gaps in the WAL file
> sequence. Alghouth theoretically the last max_backend(+alpha?)
> segemnts could be written out-of-order, but I suppose we only have
> gaps only among the several latest files in reality. I'm not sure,
> though..
>
> In short, the trying-the-next approach seems to me to be the way to
> go, for the reason that it is simpler but it can cover the possible
> failures by almost the same measures with the muliple-files approach.

Thanks for chiming in.  The limit of 64 in the multiple-files-per-
directory-scan approach was mostly arbitrary.  My earlier testing [0]
with different limits didn't reveal any significant difference, but
using a higher limit might yield a small improvement when there are
several hundred thousand .ready files.  IMO increasing the limit isn't
really worth it for this approach.  For 500,000 .ready files,
ordinarily you'd need 500,000 directory scans.  When 64 files are
archived for each directory scan, you need ~8,000 directory scans.
With 128 files per directory scan, you need ~4,000.  With 256, you
need ~2000.  The difference between 8,000 directory scans and 500,000
is quite significant.  The difference between 2,000 and 8,000 isn't
nearly as significant in comparison.

Nathan

[0] 
https://www.postgresql.org/message-id/3ECC212F-88FD-4FB2-BAF1-C2DD1563E310%40amazon.com

Re: Estimating HugePages Requirements?

On 9/6/21, 11:24 PM, "Kyotaro Horiguchi"  wrote:
> At Fri, 3 Sep 2021 17:46:05 +, "Bossart, Nathan"  
> wrote in
>> On 9/2/21, 10:12 PM, "Kyotaro Horiguchi"  wrote:
>> > I noticed that postgres -C shared_memory_size showed 137 (= 144703488)
>> > whereas the error message above showed 148897792 bytes (142MB). So it
>> > seems that something is forgotten while calculating
>> > shared_memory_size.  As the consequence, launching postgres setting
>> > huge_pages_required (69 pages) as vm.nr_hugepages ended up in the
>> > "could not map anonymous shared memory" error.
>>
>> Hm.  I'm not seeing this with the v5 patch set, so maybe I
>> inadvertently fixed it already.  Can you check this again with v5?
>
> Thanks! I confirmed that the numbers match with v5.

Thanks for confirming.

Nathan

Re: [PATCH] Add `verify-system` sslmode to use system CA pool for server cert

Andrew Dunstan  writes:
> On 9/7/21 11:47 AM, Tom Lane wrote:
>> so I'm coming around to the idea
>> that we need to do something.  I don't like the details of Thomas'
>> proposal though; specifically I don't see a need to invent a new sslmode
>> value.  I think it should just be "if ~/.postgresql/root.crt doesn't
>> exist, use the system's default trust store".

> An alternative might be to allow a magic value for sslrootcert, say
> "system" which would make it go and look in the system's store wherever
> that is, without the user having to know exactly where. OTOH it would
> require that the user knows that the system's store is being used, which
> might not be a bad thing.

Yeah, that would mostly fix the usability concern.  I guess what it
comes down to is whether you think that public or private certs are
likely to be the majority use-case in the long run.  The shortage of
previous requests for this feature says that right now, just about
everyone is using self-signed or private-CA certs for Postgres
servers.  So it would likely be a long time, if ever, before public-CA
certs become the majority use-case.

On the other hand, even if I'm using a private CA, there's a lot
to be said for adding its root cert to system-level trust stores
rather than copying it into individual users' home directories.
So I still feel like there's a pretty good case for allowing use
of the system store to happen by default.  (As I said, I'd always
thought that was *already* what would happen.)

regards, tom lane

Re: [PATCH] Add `verify-system` sslmode to use system CA pool for server cert

On 9/7/21 11:47 AM, Tom Lane wrote:
>
> This is not how I supposed it worked, 

That happens to me more than I usually admit -)

> so I'm coming around to the idea
> that we need to do something.  I don't like the details of Thomas'
> proposal though; specifically I don't see a need to invent a new sslmode
> value.  I think it should just be "if ~/.postgresql/root.crt doesn't
> exist, use the system's default trust store".
>
>   

I agree sslmode is the wrong vehicle.

An alternative might be to allow a magic value for sslrootcert, say
"system" which would make it go and look in the system's store wherever
that is, without the user having to know exactly where. OTOH it would
require that the user knows that the system's store is being used, which
might not be a bad thing.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: [PATCH] Add `verify-system` sslmode to use system CA pool for server cert

Andrew Dunstan  writes:
> You don't have to copy anything to achieve what you want. Just set the
> sslrootcert parameter of your connection to point to the system file. e.g.

> psql "sslmode=verify-full 
> sslrootcert=/etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt ..."

While that does work for me, it seems pretty OS-specific and
user-unfriendly.  Why should ordinary users need to know that
much about their platform's OpenSSL installation?

regards, tom lane

Re: [Patch] ALTER SYSTEM READ ONLY

2021-09-07 Thread Amul Sul

On Tue, 7 Sep 2021 at 8:43 PM, Mark Dilger 
wrote:

>
>
> > On Aug 31, 2021, at 5:15 AM, Amul Sul  wrote:
> >
> > Attached is the rebased version for the latest master head.
>
> Hi Amul!
>
> Could you please rebase again?
>

Ok will do that tomorrow, thanks.

Regards,
Amul

VARDATA_COMPRESSED_GET_COMPRESS_METHOD comment?

2021-09-07 Thread Christoph Berg

In postgres.h, there are these macros for working with compressed
toast:

  
/* Decompressed size and compression method of an external compressed Datum */
#define VARDATA_COMPRESSED_GET_EXTSIZE(PTR) \
(((varattrib_4b *) (PTR))->va_compressed.va_tcinfo & VARLENA_EXTSIZE_MASK)
#define VARDATA_COMPRESSED_GET_COMPRESS_METHOD(PTR) \
(((varattrib_4b *) (PTR))->va_compressed.va_tcinfo >> VARLENA_EXTSIZE_BITS)

/* Same, when working directly with a struct varatt_external */
#define VARATT_EXTERNAL_GET_EXTSIZE(toast_pointer) \
((toast_pointer).va_extinfo & VARLENA_EXTSIZE_MASK)
#define VARATT_EXTERNAL_GET_COMPRESS_METHOD(toast_pointer) \
((toast_pointer).va_extinfo >> VARLENA_EXTSIZE_BITS)


On the first line, is the comment "external" correct? It took me quite
a while to realize that the first two macros are the methods to be
used on an *inline* compressed Datum, when the second set is used for
varlenas in toast tables.

Context: Me figuring out 
https://github.com/credativ/toastinfo/blob/master/toastinfo.c#L119-L128

Christoph

Re: [PATCH] Add `verify-system` sslmode to use system CA pool for server cert



On 9/7/21 10:57 AM, tho...@habets.se wrote:
> On Tue, 7 Sep 2021 15:16:51 +0100, Andrew Dunstan  said:
>> can't you specify a CA cert in the system's
>> CA store if necessary? e.g. on my Fedora system I think it's
>> /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt
> I could, but that seems more like a workaround, where I have to change
> things around as LetsEncrypt switches to another root (I believe they
> have in the past, but I'm not sure), or the server decides to switch
> from LetsEncrypt to something else. Then all clients need to update.
>
> Such a decision could actually be made by whoever runs the webserver,
> not the database, and the database just reuses the cert and gets a
> free ride for cert renewals.
>
> So in other words postgresql currently doesn't use the system database
> at all, and the workaround is to find and copy from the system
> database. I agree that is a workaround.
>
> If you think this is enough of a corner case that the workaround is
> acceptable, or the added complexity of another sslmode setting isn't
> worth fixing this edge case, then I assume you have more knowledge
> about postgres is used in the field than I do.
>
> But it's not just about today. I would hope that now with LE that
> every user of SSL starts using "real" certs. Postgres default settings
> imply that most people who even enable SSL will not verify the CA nor
> the name, which is a shame.


It would be if it were true, but it's not. In talks I give on
PostgreSQL+SSL I highly recommend people use verify-full. And the CA
doesn't have to be one that's publicly known. We cater for both public
and private CAs.

You don't have to copy anything to achieve what you want. Just set the
sslrootcert parameter of your connection to point to the system file. e.g.

psql "sslmode=verify-full 
sslrootcert=/etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt ..."


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: [PATCH] Add `verify-system` sslmode to use system CA pool for server cert

tho...@habets.se writes:
> On Tue, 7 Sep 2021 15:16:51 +0100, Andrew Dunstan  said:
>> can't you specify a CA cert in the system's
>> CA store if necessary? e.g. on my Fedora system I think it's
>> /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt

> I could, but that seems more like a workaround, where I have to change
> things around as LetsEncrypt switches to another root (I believe they
> have in the past, but I'm not sure), or the server decides to switch
> from LetsEncrypt to something else. Then all clients need to update.

I experimented with loading a real (not self-signed, not from a private
CA) cert into the server, and I can confirm these results when trying
to use sslmode=verify-ca or sslmode=verify-full:

* libpq fails the connection if ~/.postgresql/root.crt is missing
or empty.

* If I put an irrelevant cert into ~/.postgresql/root.crt, then
libpq reports "SSL error: certificate verify failed".  So the
verification insists that the server's cert chain to whatever
is in root.crt.

This does seem to make it unreasonably painful to use a real SSL cert
for a PG server.  If you've gone to the trouble and expense of getting
a real cert, it should not be on you to persuade the clients that
it's valid.  I agree with Thomas that copying the system trust store
into users' home directories is a completely horrid idea, from both
the ease-of-use and maintenance standpoints.

This is not how I supposed it worked, so I'm coming around to the idea
that we need to do something.  I don't like the details of Thomas'
proposal though; specifically I don't see a need to invent a new sslmode
value.  I think it should just be "if ~/.postgresql/root.crt doesn't
exist, use the system's default trust store".

regards, tom lane

Re: [Patch] ALTER SYSTEM READ ONLY

2021-09-07 Thread Mark Dilger




> On Aug 31, 2021, at 5:15 AM, Amul Sul  wrote:
> 
> Attached is the rebased version for the latest master head. 

Hi Amul!

Could you please rebase again?

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Gather performance analysis

2021-09-07 Thread Tomas Vondra

Hi,

The numbers presented in this thread seem very promising - clearly
there's significant potential for improvements. I'll run similar
benchmarks too, to get a better understanding of this.

Can you share some basic details about the hardware you used?
Particularly the CPU model - I guess this might explain some of the
results, e.g. if CPU caches are ~1MB, that'd explain why setting
tup_queue_size to 1MB improves things, but 4MB is a bit slower.
Similarly, number of cores might explain why 4 workers perform better
than 8 or 16 workers.

Now, this is mostly expected, but the consequence is that maybe things
like queue size should be tunable/dynamic, not hard-coded?

As for the patches, I think the proposed changes are sensible, but I
wonder what queries might get slower. For example with the batching
(updating the counter only once every 4kB, that pretty much transfers
data in larger chunks with higher latency. So what if the query needs
only a small chunk, like a LIMIT query? Similarly, this might mean the
upper parts of the plan have to wait for the data for longer, and thus
can't start some async operation (like send them to a FDW, or something
like that). I do admit those are theoretical queries, I haven't tried
creating such query.

FWIW I've tried applying both patches at the same time, but there's a
conflict in shm_mq_sendv - not a complex one, but I'm not sure what's
the correct solution. Can you share a "combined" patch?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallelize correlated subqueries that execute within each worker

2021-09-07 Thread Zhihong Yu

On Tue, Sep 7, 2021 at 6:17 AM James Coleman  wrote:

> On Wed, Sep 1, 2021 at 7:06 AM Daniel Gustafsson  wrote:
> >
> > > On 7 May 2021, at 18:30, James Coleman  wrote:
> >
> > > ..here we are now, and I finally have this patch cleaned up
> > > enough to share.
> >
> > This patch no longer applies to HEAD, can you please submit a rebased
> version?
>
> See attached.
>
> Thanks,
> James
>
Hi,
For v2-0002-Parallel-query-support-for-basic-correlated-subqu.patch :

+* is when we're going to execute multiple partial parths in parallel

parths -> paths

if (index->amcanparallel &&
-   rel->consider_parallel && outer_relids == NULL &&
-   scantype != ST_BITMAPSCAN)
+   rel->consider_parallel && outer_relids == NULL &&
+   scantype != ST_BITMAPSCAN)

the change above seems unnecessary since the first line of if condition
doesn't change.
Similar comment for the next hunk.

+* It's not a partial path; it'a a full path that is executed
as a subquery.

it'a a -> it's a

+   /* rel->consider_parallel_rechecking_params = false; */
+   /* rel->partial_pathlist = NIL; */

The commented code can be taken out.

Cheers

Re: [BUG?] SET TIME ZONE doesn't work with abbreviations

2021-09-07 Thread Aleksander Alekseev

David, Tom,

> Well, given that the limitation is documented I’d have to say it is 
> intentional:
> [...]

> That's intentional, per the fine manual:
> [...]

My bad, I missed this. Many thanks!

-- 
Best regards,
Aleksander Alekseev

Re: [PATCH] Add `verify-system` sslmode to use system CA pool for server cert

2021-09-07 Thread thomas

On Tue, 7 Sep 2021 15:16:51 +0100, Andrew Dunstan  said:
> can't you specify a CA cert in the system's
> CA store if necessary? e.g. on my Fedora system I think it's
> /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt

I could, but that seems more like a workaround, where I have to change
things around as LetsEncrypt switches to another root (I believe they
have in the past, but I'm not sure), or the server decides to switch
from LetsEncrypt to something else. Then all clients need to update.

Such a decision could actually be made by whoever runs the webserver,
not the database, and the database just reuses the cert and gets a
free ride for cert renewals.

So in other words postgresql currently doesn't use the system database
at all, and the workaround is to find and copy from the system
database. I agree that is a workaround.

If you think this is enough of a corner case that the workaround is
acceptable, or the added complexity of another sslmode setting isn't
worth fixing this edge case, then I assume you have more knowledge
about postgres is used in the field than I do.

But it's not just about today. I would hope that now with LE that
every user of SSL starts using "real" certs. Postgres default settings
imply that most people who even enable SSL will not verify the CA nor
the name, which is a shame.

--
typedef struct me_s {
  char name[]  = { "Thomas Habets" };
  char email[] = { "tho...@habets.se" };
  char kernel[]= { "Linux" };
  char *pgpKey[]   = { "http://www.habets.pp.se/pubkey.txt; };
  char pgp[] = { "9907 8698 8A24 F52F 1C2E  87F6 39A4 9EEA 460A 0169" };
  char coolcmd[]   = { "echo '. ./_&. ./_'>_;. ./_" };
} me_t;

Re: [PATCH] Add `verify-system` sslmode to use system CA pool for server cert



On 9/6/21 6:21 PM, tho...@habets.se wrote:
> On Mon, 6 Sep 2021 20:47:37 +0100, Tom Lane  said:
>> I'm confused by your description of this patch.  AFAIK, OpenSSL verifies
>> against the system-wide CA pool by default.  Why do we need to do
>> anything?
> Experimentally, no it doesn't. Or if it does, then it doesn't verify
> the CN/altnames of the cert.
>
> sslmode=require allows self-signed and name mismatch.
>
> verify-ca errors out if there is no ~/.postgresql/root.crt. verify-full too.
>
> It seems that currently postgresql verifies the name if and only if
> verify-full is used, and then only against ~/.postgresql/root.crt CA file.
>
> But could be that I missed a config option?



That's my understanding. But can't you specify a CA cert in the system's
CA store if necessary? e.g. on my Fedora system I think it's
/etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Data loss when '"json_populate_recorset" with long column name

Julien Rouhaud  writes:
> On Tue, Sep 7, 2021 at 1:31 PM Michael Paquier  wrote:
>> Yeah.  We should try to work toward removing the limits on NAMEDATALEN
>> for the attribute names.  Easier said than done :)

> Yes, but even if we eventually fix that my impression is that we would
> still enforce a limit of 128 characters (or bytes) as this is the SQL
> specification.

Probably not.  I think SQL says that's the minimum expectation; and
even if they say it should be that exactly, there is no reason we'd
suddenly start slavishly obeying that part of the spec after ignoring
it for years ;-).

There would still be a limit of course, but it would stem from the max
tuple width in the associated catalog, so on the order of 7kB or so.
(Hmm ... perhaps it'd be wise to set a limit of say a couple of kB,
just so that the implementation limit is crisp rather than being
a little bit different in each catalog and each release.)

regards, tom lane

Re: [BUG?] SET TIME ZONE doesn't work with abbreviations

Aleksander Alekseev  writes:
> I noticed that `SET TIME ZONE` / `SET timezone TO` don't work with
> abbreviations:

That's intentional, per the fine manual:

A time zone abbreviation, for example PST.  Such a
specification merely defines a particular offset from UTC, in
contrast to full time zone names which can imply a set of daylight
savings transition rules as well.  The recognized abbreviations
are listed in the pg_timezone_abbrevs view (see ).  You cannot set the
configuration parameters  or
 to a time
zone abbreviation, but you can use abbreviations in
date/time input values and with the AT TIME ZONE
operator.

I'm too caffeine-deprived to remember the exact reasoning right now,
but it was likely along the lines of "you don't really want to do
that because it won't track DST changes".

regards, tom lane

Re: [BUG?] SET TIME ZONE doesn't work with abbreviations

2021-09-07 Thread David G. Johnston

On Tuesday, September 7, 2021, Aleksander Alekseev 
wrote:

> Hi hackers,
>
> I noticed that `SET TIME ZONE` / `SET timezone TO` don't work with
> abbreviations:
>
> Is it a bug or this behavior is intentional (something to do with SQL
> standard, perhaps)?
>
>
Well, given that the limitation is documented I’d have to say it is
intentional:

You cannot set the configuration parameters TimeZone

 or log_timezone

to
a time zone abbreviation, but you can use abbreviations in date/time input
values and with the AT TIME ZONE operator.

https://www.postgresql.org/docs/current/datatype-datetime.html#DATATYPE-TIMEZONES

David J.

Re: Use "superuser" instead of "super user" in code comments

2021-09-07 Thread Bharath Rupireddy

On Tue, Sep 7, 2021 at 6:40 PM Daniel Gustafsson  wrote:
>
> > On 7 Sep 2021, at 14:44, Bharath Rupireddy 
> >  wrote:
>
> > It seems like we use "superuser" as a standard term across the entire
> > code base i.e. error messages, docs, code comments. But there are
> > still a few code comments that use the term "super user". Can we
> > replace those with "superuser"? Attaching a tiny patch to do that.
>
> Good catch, superuser is the term we should use for this.  There is one
> additional “super user” in src/test/regress/sql/conversion.sql (and its
> corresponding resultfile) which can be included in this.  Unless there are
> objections I’ll apply this with the testfile fixup.

Thanks for picking this up. Here's v2 including the change in
conversion.sql and .out.

Regards,
Bharath Rupireddy.


v2-0001-Use-superuser-instead-of-super-user-in-code-comme.patch
Description: Binary data

Re: Parallelize correlated subqueries that execute within each worker

2021-09-07 Thread James Coleman

On Wed, Sep 1, 2021 at 7:06 AM Daniel Gustafsson  wrote:
>
> > On 7 May 2021, at 18:30, James Coleman  wrote:
>
> > ..here we are now, and I finally have this patch cleaned up
> > enough to share.
>
> This patch no longer applies to HEAD, can you please submit a rebased version?

See attached.

Thanks,
James


v2-0001-Allow-parallel-LATERAL-subqueries-with-LIMIT-OFFS.patch
Description: Binary data


v2-0002-Parallel-query-support-for-basic-correlated-subqu.patch
Description: Binary data


v2-0003-Other-places-to-consider-for-completeness.patch
Description: Binary data

Re: Improve logging when using Huge Pages

2021-09-07 Thread Justin Pryzby

On Tue, Sep 07, 2021 at 07:12:36PM +0900, Fujii Masao wrote:
> One big concern about the patch is that log message is always reported
> when shared memory fails to be allocated with huge pages enabled
> when huge_pages=try. Since huge_pages=try is the default setting,
> many users would see this new log message whenever they start
> the server. Those who don't need huge pages but just use the default
> setting might think that such log messages would be noisy.

I don't see this as any issue.  We're only talking about a single message on
each restart, which would be added in a major release.  If it's a problem, the
message could be a NOTICE or INFO, and it won't be shown by default.

I think it should say "with/out huge pages" without "enabled/disabled", without
"again", and without "The server", like:

+   (errmsg("could not map anonymous shared 
memory (%zu bytes)"
+   " with huge pages.", allocsize),
+errdetail("Anonymous shared memory 
will be mapped "
+   "without huge pages.")));

-- 
Justin

Increase log level in xlogreader.c ?

2021-09-07 Thread Jesper Pedersen


Hi,

I have a 13.4 based setup (physical streaming replication) where the 
replica does the attach log upon startup, and when the first message is 
sent from the primary.


There is the FATAL from when the WAL receiver shuts down, but I think it 
would be a benefit to have report_invalid_record() log at ERROR level 
instead to highlight to the admin that there is a serious problem.


Feel free to contact me off-list for the setup (20Mb).

Thoughts ?

Best regards,
 Jesper
[1412350] [2021-09-07 08:51:27.172 EDT] [] [0] LOG:  starting PostgreSQL 13.4 
on x86_64-pc-linux-gnu, compiled by gcc (GCC) 11.2.1 20210728 (Red Hat 
11.2.1-1), 64-bit
[1412350] [2021-09-07 08:51:27.173 EDT] [] [0] LOG:  listening on IPv4 address 
"0.0.0.0", port 5433
[1412350] [2021-09-07 08:51:27.173 EDT] [] [0] LOG:  listening on IPv6 address 
"::", port 5433
[1412350] [2021-09-07 08:51:27.176 EDT] [] [0] LOG:  listening on Unix socket 
"/tmp/.s.PGSQL.5433"
[1412352] [2021-09-07 08:51:27.186 EDT] [] [0] LOG:  database system was shut 
down at 2020-10-01 07:48:52 EDT
[1412352] [2021-09-07 08:51:27.186 EDT] [] [0] LOG:  entering standby mode
[1412352] [2021-09-07 08:51:27.191 EDT] [] [0] LOG:  consistent recovery state 
reached at 0/5000118
[1412352] [2021-09-07 08:51:27.191 EDT] [] [0] LOG:  invalid record length at 
0/5000118: wanted 24, got 0
[1412350] [2021-09-07 08:51:27.192 EDT] [] [0] LOG:  database system is ready 
to accept read only connections
[1412356] [2021-09-07 08:51:27.224 EDT] [] [0] LOG:  started streaming WAL from 
primary at 0/500 on timeline 1
[1412352] [2021-09-07 08:52:54.091 EDT] [] [0] LOG:  record with incorrect 
prev-link 8DE46ADF/30 at 0/5000118
[1412356] [2021-09-07 08:52:54.091 EDT] [] [0] FATAL:  terminating walreceiver 
process due to administrator command
[1412352] [2021-09-07 08:52:54.092 EDT] [] [0] LOG:  record with incorrect 
prev-link 8DE46ADF/30 at 0/5000118
[1412352] [2021-09-07 08:52:54.092 EDT] [] [0] LOG:  record with incorrect 
prev-link 8DE46ADF/30 at 0/5000118
[1412352] [2021-09-07 08:52:59.096 EDT] [] [0] LOG:  record with incorrect 
prev-link 8DE46ADF/30 at 0/5000118
[1412352] [2021-09-07 08:53:04.101 EDT] [] [0] LOG:  record with incorrect 
prev-link 8DE46ADF/30 at 0/5000118
[1412352] [2021-09-07 08:53:09.105 EDT] [] [0] LOG:  record with incorrect 
prev-link 8DE46ADF/30 at 0/5000118
[1412352] [2021-09-07 08:53:14.109 EDT] [] [0] LOG:  record with incorrect 
prev-link 8DE46ADF/30 at 0/5000118
[1412352] [2021-09-07 08:53:19.113 EDT] [] [0] LOG:  record with incorrect 
prev-link 8DE46ADF/30 at 0/5000118
[1412350] [2021-09-07 08:53:22.187 EDT] [] [0] LOG:  received fast shutdown 
request
[1412350] [2021-09-07 08:53:22.194 EDT] [] [0] LOG:  aborting any active 
transactions
[1412353] [2021-09-07 08:53:22.195 EDT] [] [0] LOG:  shutting down
[1412350] [2021-09-07 08:53:22.209 EDT] [] [0] LOG:  database system is shut 
down

Re: Use "superuser" instead of "super user" in code comments

2021-09-07 Thread Daniel Gustafsson

> On 7 Sep 2021, at 14:44, Bharath Rupireddy 
>  wrote:

> It seems like we use "superuser" as a standard term across the entire
> code base i.e. error messages, docs, code comments. But there are
> still a few code comments that use the term "super user". Can we
> replace those with "superuser"? Attaching a tiny patch to do that.

Good catch, superuser is the term we should use for this.  There is one
additional “super user” in src/test/regress/sql/conversion.sql (and its
corresponding resultfile) which can be included in this.  Unless there are
objections I’ll apply this with the testfile fixup.

--
Daniel Gustafsson   https://vmware.com/

Use "superuser" instead of "super user" in code comments

2021-09-07 Thread Bharath Rupireddy

Hi,

It seems like we use "superuser" as a standard term across the entire
code base i.e. error messages, docs, code comments. But there are
still a few code comments that use the term "super user". Can we
replace those with "superuser"? Attaching a tiny patch to do that.

Thoughts?

Regards,
Bharath Rupireddy.


v1-0001-Use-superuser-instead-of-super-user-in-code-comme.patch
Description: Binary data

Re: Diagnostic comment in LogicalIncreaseXminForSlot

2021-09-07 Thread Ashutosh Bapat

Thanks Amit.

On Tue, Sep 7, 2021 at 11:14 AM Amit Kapila  wrote:
>
> On Mon, Sep 6, 2021 at 5:29 PM Ashutosh Bapat
>  wrote:
> >
> > Yeah, I agree. Sorry for missing that.
> >
> > The updated patch looks good to me.
> >
>
> Pushed!
>
> --
> With Regards,
> Amit Kapila.



-- 
Best Wishes,
Ashutosh Bapat

Re: The Free Space Map: Problems and Opportunities

2021-09-07 Thread Hannu Krosing

On Tue, Sep 7, 2021 at 2:29 AM Peter Geoghegan  wrote:
>
> On Mon, Sep 6, 2021 at 4:33 PM Hannu Krosing  wrote:
> > When I have been thinking of this type of problem it seems that the
> > latest -- and correct :) --  place which should do all kinds of
> > cleanup like removing aborted tuples, freezing committed tuples and
> > setting any needed hint bits would be background writer or CHECKPOINT.
> >
> > This would be more PostgreSQL-like, as it moves any work not
> > immediately needed from the critical path, as an extension of how MVCC
> > for PostgreSQL works in general.
>
> I think it depends. There is no need to do work in the background
> here, with TPC-C. With my patch series each backend can know that it
> just had an aborted transaction that affected a page that it more or
> less still owns. And has very close at hand, for further inserts. It's
> very easy to piggy-back the work once you have that sense of ownership
> of newly allocated heap pages by individual backends/transactions.

Are you speaking of just heap pages here or also index pages ?

It seems indeed easy for heap, but for index pages can be mixed up by
other parallel work, especially things like Serial Primary Keys .

Or are you expecting these to be kept in good-enoug shape by your
earlier index manager work ?

> > This would be more PostgreSQL-like, as it moves any work not
> > immediately needed from the critical path, as an extension of how MVCC
> > for PostgreSQL works in general.
>
> I think that it also makes sense to have what I've called "eager
> physical rollback" that runs in the background, as you suggest.
>
> I'm thinking of a specialized form of VACUUM that targets a specific
> aborted transaction's known-dirtied pages. That's my long term goal,
> actually. Originally I wanted to do this as a way of getting rid of
> SLRUs and tuple freezing, by representing that all heap pages must
> only have committed tuples implicitly. That seemed like a good enough
> reason to split VACUUM into specialized "eager physical rollback
> following abort" and "garbage collection" variants.
>
> The insight that making abort-related cleanup special will help free
> space management is totally new to me -- it emerged from working
> directly on this benchmark. But it nicely complements some of my
> existing ideas about improving VACUUM.

A minimal useful patch emerging from that understanding could be
something which just adds hysteresis to FSM management. (TBH, I
actually kind of expected some hysteresis to be there already, as it
is in my mental model of "how things should be done" for managing
almost any resource :) )

Adding hysteresis to FSM management can hopefully be done independent
of all the other stuff and also seems to be something that is
unobtrusive and non-controversial enough to fit in current release and
possibly be even back-ported .

> > But doing it as part of checkpoint probably ends up with less WAL
> > writes in the end.
>
> I don't think that checkpoints are special in any way. They're very
> important in determining the total number of FPIs we'll generate, and
> so have huge importance today. But that seems accidental to me.

I did not mean CHECKPOINT as a command, but more the concept of
writing back / un-dirtying the page. In this sense it *is* special
because it is the last point in time where you are guaranteed to have
the page available in buffercache and thus cheap to access for
modifications plus you will avoid a second full-page writeback because
of cleanup. Also you do not want to postpone the cleanup to actual
page eviction, as that is usually in the critical path for some user
query or command.

Of course this becomes most important for workloads where the active
working set is larger than fits in memory and this is not a typical
case for OLTP any more. But at least freezing the page before
write-out could have a very big impact on the need to freeze "old"
pages available only on disk and thus would be a cheap way to improve
the problems around running out of transaction ids.

> > There could be a possibility to do a small amount of cleanup -- enough
> > for TPC-C-like workloads, but not larger ones -- while waiting for the
> > next command to arrive from the client over the network. This of
> > course assumes that we will not improve our feeder mechanism to have
> > back-to-back incoming commands, which can already be done today, but
> > which I have seen seldom used.
>
> That's what I meant, really. Doing the work of cleaning up a heap page
> that a transaction inserts into (say pruning away aborted tuples or
> setting hint bits) should ideally happen right after commit or abort
> -- at least for OLTP like workloads, which are the common case for
> Postgres. This cleanup doesn't have to be done by exactly the same
> transactions (and can't be in most interesting cases). It should be
> quite possible for the work to be done by approximately the same
> transaction, though -- the current transaction cleans

Re: OpenSSL 3.0.0 compatibility

2021-09-07 Thread Daniel Gustafsson

> On 10 Aug 2021, at 15:27, Daniel Gustafsson  wrote:

> These have now been committed, when OpenSSL 3.0.0 ships and there is coverage
> in the buildfarm I’ll revisit this for the backbranches.

As an update to this, I’ve tested the tree frozen for the upcoming 3.0.0
release (scheduled for today AFAIK) and postgres still builds and tests clean
with the patches that were applied.

--
Daniel Gustafsson   https://vmware.com/

Re: Small documentation improvement for ALTER SUBSCRIPTION

2021-09-07 Thread Daniel Gustafsson

> On 7 Sep 2021, at 13:36, Peter Eisentraut  
> wrote:
> 
> On 12.08.21 04:52, Masahiko Sawada wrote:
>> On Wed, Aug 11, 2021 at 5:42 PM Daniel Gustafsson  wrote:
>>> 
 On 11 Aug 2021, at 09:57, Masahiko Sawada  wrote:
>>> 
 Additionally, refresh options as described in
 refresh_option of
 REFRESH PUBLICATION may be specified,
 except in the case of DROP PUBLICATION.
>>> 
>>> Since this paragraph is under the literal option “refresh”, which takes a
>>> value, I still find your original patch to be the clearest.
>> Yeah, I prefer my original patch over this idea. On the other hand, I
>> can see the point of review comment on it that Amit pointed out[1].
> 
> How about this:
> 
> -  Additionally, refresh options as described
> -  under REFRESH PUBLICATION may be specified.
> +  Additionally, the options described under REFRESH
> +  PUBLICATION may be specified, to control the implicit refresh
> +  operation.

LGTM.

--
Daniel Gustafsson   https://vmware.com/

Re: Non-decimal integer literals

2021-09-07 Thread Zhihong Yu

On Tue, Sep 7, 2021 at 4:13 AM Peter Eisentraut <
peter.eisentr...@enterprisedb.com> wrote:

> On 16.08.21 17:32, John Naylor wrote:
> > The one thing that jumped out at me on a cursory reading is
> > the {integer} rule, which seems to be used nowhere except to
> > call process_integer_literal, which must then inspect the token text to
> > figure out what type of integer it is. Maybe consider 4 separate
> > process_*_literal functions?
>
> Agreed, that can be done in a simpler way.  Here is an updated patch.
>
Hi,
Minor comment:

+SELECT int4 '0o112';

Maybe involve digits of up to 7 in the octal test case.

Thanks

Re: when the startup process doesn't (logging startup delays)

2021-09-07 Thread Justin Pryzby

On Tue, Sep 07, 2021 at 03:07:15PM +0530, Nitin Jadhav wrote:
> > Looking over this version, I realized something I (or you) should have
> > noticed sooner: you've added the RegisterTimeout call to
> > InitPostgres(), but that's for things that are used by all backends,
> > and this is only used by the startup process. So it seems to me that
> > the right place is StartupProcessMain. That would have the further
> > advantage of allowing startup_progress_timeout_handler to be made
> > static. begin_startup_progress_phase() and
> > startup_progress_timeout_has_expired() are the actual API functions
> > though so they will need to remain extern.
> >
> > I agree with Robert that a standby server should not continuously show 
> > timing
> > messages for WAL replay.
> 
> The earlier discussion wrt this point is as follows.

I think you're confusing discussions.

Robert was talking about initdb/bootstrap/single, and I separately and
independently asked about hot standbys.  It seems like Robert and I agreed
about the desired behavior and there was no further discussion.

> > Honestly, I don't see why we should have
> > a GUC for what's proposed here.  A value too low would bloat the logs
> > with entries that are not that meaningful.  A value too large would
> > just prevent access to some information wanted.  Wouldn't it be better
> > to just pick up a value like 10s or 20s?

I don't think bloating logs is a issue for values > 10sec.

You agreed that it's important to choose the "right" value, but I think that
will vary between users.  Some installations with large checkpoint_timeout
might anticipate taking 15+min to perform recovery, but others might even have
a strict requirement that recovery must not take more than (say) 10sec; someone
might want to use this to verify that, or to optimize the slow parts of
recovery, with an interval that someone else might not care about.

> > +   {"log_startup_progress_interval", PGC_SIGHUP, LOGGING_WHEN,
> > +   gettext_noop("Sets the time interval between each progress 
> > update "
> > +"of the startup process."),
> > +   gettext_noop("0 logs all messages. -1 turns this feature off."),
> > +   GUC_UNIT_MS,
|+   10, -1, INT_MAX,
> > The unit is incorrect here, as that would default to 10ms, contrary to
> > what the documentation says about 10s.
> 
> Kindly refer the previous few discussions wrt this point.

You copied and pasted unrelated emails, which isn't helpful.

Michael is right.  You updated some of the units based on Robert's suggestion
to use MS, but didn't update all of the corresponding parts of the patch.
guc.c says that the units are in MS, which means that unqualified values are
interpretted as such.  But postgresql.conf.sample still says "seconds", and
guc.c says the default value is "10", and you still do:

+   interval_in_ms = log_startup_progress_interval * 1000;

I checked that this currently does not interpret the value as ms:
|./tmp_install/usr/local/pgsql/bin/postgres -D src/test/regress/tmp_check/data/ 
-c log_startup_progress_interval=1
|2021-09-07 06:28:58.694 CDT startup[18865] LOG:  redo in progress, elapsed 
time: 1.00 s, current LSN: 0/E94ED88
|2021-09-07 06:28:59.694 CDT startup[18865] LOG:  redo in progress, elapsed 
time: 2.00 s, current LSN: 0/10808EE0
|2021-09-07 06:29:00.694 CDT startup[18865] LOG:  redo in progress, elapsed 
time: 3.00 s, current LSN: 0/126B8C80

(Also, the GUC value is in the range 0..INT_MAX, so multiplying and storing to
another int could overflow.)

I think the convention is to for GUC global vars to be initialized with the
same default as in guc.c, so both should be 1, like:

+int log_startup_progress_interval = 10 * 1000 /* 10sec */

-- 
Justin

Re: Postgres perl module namespace



On 9/6/21 1:08 AM, Michael Paquier wrote:
> On Sat, Sep 04, 2021 at 09:58:08AM -0400, Andrew Dunstan wrote:
>> On 9/4/21 2:19 AM, Noah Misch wrote:
>>> plperl uses PostgreSQL:: as the first component of its Perl module 
>>> namespace.
>>> We shouldn't use both PostgreSQL:: and Postgres:: in the same source tree, 
>>> so
>>> this change should not use Postgres::. 
>> Good point. Here's the same thing using PostgreSQL::Test
> A minor point: this introduces PostgreSQL::Test::PostgresVersion.
> Could be this stripped down to PostgreSQL::Test::Version instead?



That name isn't very clear - what is it the version of, PostgreSQL or
the test?

There's nothing very test-specific about this module - it simply
encapsulates a Postgres version string. So maybe it should just be
PostgreSQL::Version.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Added schema level support for publication.

2021-09-07 Thread Amit Kapila

On Tue, Sep 7, 2021 at 12:45 PM vignesh C  wrote:
>
> On Fri, Sep 3, 2021 at 4:49 PM Amit Kapila  wrote:
> >
>
> > 5.
> > If I modify the search path to remove public schema then I get the
> > below error message:
> > postgres=# Create publication mypub for all tables in schema current_schema;
> > ERROR:  no schema has been selected
> >
> > I think this message is not very clear. How about changing to
> > something like "current_schema doesn't contain any valid schema"? This
> > message is used in more than one place, so let's keep it the same at
> > all the places if you agree to change it.
>
> I would prefer to use the existing messages as we have used this in a
> few other places similarly. Thoughts?
>

Yeah, I also see the same message in code but I think here usage is a
bit different. If you see a similar SQL statement that causes the same
error message then can you please give an example?

Few comments on latest patch
(v25-0002-Added-schema-level-support-for-publication):
=
1.
getPublicationSchemaInfo()
..
+ *nspname = get_namespace_name(pnform->pnnspcid);
+ if (!(*nspname))
+ {
+ Oid schemaid = pnform->pnpubid;
+
+ pfree(*pubname);
+ ReleaseSysCache(tup);
+ if (!missing_ok)
+ elog(ERROR, "cache lookup failed for schema %u",
+ schemaid);

Why are you using pnform->pnpubid to get schemaid? I think you need
pnform->pnnspcid here and it was like that in the previous version,
not sure, why you changed it?

2.
@@ -369,15 +531,20 @@ AlterPublicationTables(AlterPublicationStmt
*stmt, Relation rel,
  (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  errmsg("publication \"%s\" is defined as FOR ALL TABLES",
  NameStr(pubform->pubname)),
- errdetail("Tables cannot be added to or dropped from FOR ALL TABLES
publications.")));
+ errdetail("Tables cannot be added to, dropped from, or set on FOR
ALL TABLES publications.")));

Why is this message changed? Have we changed anything related to this
as part of this patch?

3.
+ Oid pnnspcid BKI_LOOKUP(pg_namespace); /* Oid of the schema */
+} FormData_pg_publication_namespace;
+
...
...
+DECLARE_UNIQUE_INDEX(pg_publication_namespace_pnnspcid_pnpubid_index,
8903, PublicationNamespacePnnspcidPnpubidIndexId, on
pg_publication_namespace using btree(pnnspcid oid_ops, pnpubid
oid_ops));

Let's use nspid instead nspcid at all places as before we refer that
way in the code. The way you have done is also okay but it is better
to be consistent with existing code.

4. Need to think of comments in GetSchemaPublicationRelations.
+ /* get all the ordinary tables present in schema schemaid */
..

Let's change the above comment to something like: "get all the
relations present in the given schema"

+
+ /*
+ * Get all relations that inherit from the partition table, directly or
+ * indirectly.
+ */
+ scan = table_beginscan_catalog(classRel, keycount, key);


Let's change the above comment to something like: "It is quite
possible that some of the partitions are in a different schema than
the parent table, so we need to get such partitions separately."

5.
+ if (list_member_oid(schemaidlist, relSchemaId))
+ ereport(ERROR,
+ errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("cannot add relation \"%s.%s\" to publication",
+get_namespace_name(relSchemaId),
+RelationGetRelationName(rel)),
+ errdetail("Table's schema is already included as part of ALL TABLES
IN SCHEMA option."));

I think the better errdetail would be: "Table's schema \"%s\" is
already part of the publication."

+ if (list_member_oid(schemaidlist, relSchemaId))
+ ereport(ERROR,
+ errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("cannot add schema \"%s\" to publication",
+get_namespace_name(get_rel_namespace(tableOid))),
+ errdetail("Table \"%s.%s\" is part of publication, adding same
schema \"%s\" is not supported",
+   get_namespace_name(get_rel_namespace(tableOid)),
+   get_rel_name(tableOid),
+   get_namespace_name(get_rel_namespace(tableOid;

I think this errdetail message can also be improved. One idea could
be: "Table \"%s\" in schema \"%s\" is already part of the publication,
adding the same schema is not supported.". Do you or anyone else have
better ideas?

6. I think instead of two different functions
CheckRelschemaInSchemaList and CheckSchemaInRels, let's have a single
function RelSchemaIsMemberOfSchemaList and have a boolean variable to
distinguish the two cases.


-- 
With Regards,
Amit Kapila.

Re: Small documentation improvement for ALTER SUBSCRIPTION


On 12.08.21 04:52, Masahiko Sawada wrote:

On Wed, Aug 11, 2021 at 5:42 PM Daniel Gustafsson  wrote:



On 11 Aug 2021, at 09:57, Masahiko Sawada  wrote:



Additionally, refresh options as described in
refresh_option of
REFRESH PUBLICATION may be specified,
except in the case of DROP PUBLICATION.


Since this paragraph is under the literal option “refresh”, which takes a
value, I still find your original patch to be the clearest.


Yeah, I prefer my original patch over this idea. On the other hand, I
can see the point of review comment on it that Amit pointed out[1].


How about this:

-  Additionally, refresh options as described
-  under REFRESH PUBLICATION may be specified.
+  Additionally, the options described under REFRESH
+  PUBLICATION may be specified, to control the implicit 
refresh

+  operation.

RE: Allow escape in application_name (was: [postgres_fdw] add local pid to fallback_application_name)

2021-09-07 Thread kuroda.hay...@fujitsu.com

Dear Fujii-san, Ikeda-san,

> Pushed 0001 patch. Thanks!

I confirmed your commit. Thanks!
I attached the rebased version. Tests and descriptions were added.
In my understanding Ikeda-san's indication is included.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



v10_0002_allow_escapes.patch
Description: v10_0002_allow_escapes.patch

Re: psql: \dl+ to list large objects privileges

2021-09-07 Thread Pavel Luzanov


Hi,

On 06.09.2021 14:39, gkokola...@pm.me wrote:


Thanks! The tests look good. A minor nitpick would be to also add a comment on 
the
large object to verify that it is picked up correctly.

Also:

+\lo_unlink 42
+DROP ROLE lo_test;
+

That last empty line can be removed.


The new version adds a comment to a large object and removes the last empty 
line.

--
Pavel Luzanov
Postgres Professional: https://postgrespro.com
The Russian Postgres Company

diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml
index e0ffb020bf..374515bcb2 100644
--- a/doc/src/sgml/ddl.sgml
+++ b/doc/src/sgml/ddl.sgml
@@ -2094,7 +2094,7 @@ REVOKE ALL ON accounts FROM PUBLIC;
   LARGE OBJECT
   rw
   none
-  
+  \dl+
  
  
   SCHEMA
diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml
index fcab5c0d51..de47ef528e 100644
--- a/doc/src/sgml/ref/psql-ref.sgml
+++ b/doc/src/sgml/ref/psql-ref.sgml
@@ -1681,11 +1681,13 @@ testdb=
 
 
   
-\dl
+\dl[+]
 
 
 This is an alias for \lo_list, which shows a
-list of large objects.
+list of large objects. If + is appended 
+to the command name, each large object is listed with its 
+associated permissions, if any.
 
 
   
@@ -2588,12 +2590,15 @@ lo_import 152801
   
 
   
-\lo_list
+\lo_list[+]
 
 
 Shows a list of all PostgreSQL
 large objects currently stored in the database,
 along with any comments provided for them.
+If + is appended to the command name, 
+each large object is listed with its associated permissions,
+if any. 
 
 
   
diff --git a/src/bin/psql/command.c b/src/bin/psql/command.c
index 49d4c0e3ce..f3645cab96 100644
--- a/src/bin/psql/command.c
+++ b/src/bin/psql/command.c
@@ -807,7 +807,16 @@ exec_command_d(PsqlScanState scan_state, bool active_branch, const char *cmd)
 success = describeRoles(pattern, show_verbose, show_system);
 break;
 			case 'l':
-success = do_lo_list();
+switch (cmd[2])
+{
+	case '\0':
+	case '+':
+		success = listLargeObjects(show_verbose);
+		break;
+	default:
+		status = PSQL_CMD_UNKNOWN;
+		break;
+}
 break;
 			case 'L':
 success = listLanguages(pattern, show_verbose, show_system);
@@ -1901,11 +1910,13 @@ exec_command_lo(PsqlScanState scan_state, bool active_branch, const char *cmd)
 	{
 		char	   *opt1,
    *opt2;
+		bool		show_verbose;
 
 		opt1 = psql_scan_slash_option(scan_state,
 	  OT_NORMAL, NULL, true);
 		opt2 = psql_scan_slash_option(scan_state,
 	  OT_NORMAL, NULL, true);
+		show_verbose = strchr(cmd, '+') ? true : false;
 
 		if (strcmp(cmd + 3, "export") == 0)
 		{
@@ -1935,8 +1946,8 @@ exec_command_lo(PsqlScanState scan_state, bool active_branch, const char *cmd)
 			}
 		}
 
-		else if (strcmp(cmd + 3, "list") == 0)
-			success = do_lo_list();
+		else if (strcmp(cmd + 3, "list") == 0 || strcmp(cmd + 3, "list+") == 0)
+			success = listLargeObjects(show_verbose);
 
 		else if (strcmp(cmd + 3, "unlink") == 0)
 		{
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 90ff649be7..a079ce9e36 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -6828,3 +6828,64 @@ listOpFamilyFunctions(const char *access_method_pattern,
 	PQclear(res);
 	return true;
 }
+
+/*
+ * \dl or \lo_list
+ * Lists large objects in database
+ */
+bool
+listLargeObjects(bool verbose)
+{
+	PQExpBufferData buf;
+	PGresult   *res;
+	printQueryOpt myopt = pset.popt;
+
+	initPQExpBuffer();
+
+	if (pset.sversion >= 9)
+	{
+		printfPQExpBuffer(,
+			"SELECT oid as \"%s\",\n"
+			"  pg_catalog.pg_get_userbyid(lomowner) as \"%s\",\n  ",
+			gettext_noop("ID"),
+			gettext_noop("Owner"));
+
+		if (verbose)
+		{
+			printACLColumn(, "lomacl");
+			appendPQExpBufferStr(, ",\n  ");
+		}
+
+		appendPQExpBuffer(,
+			"pg_catalog.obj_description(oid, 'pg_largeobject') as \"%s\"\n"
+			"FROM pg_catalog.pg_largeobject_metadata\n"
+			"ORDER BY oid",
+			gettext_noop("Description"));
+
+	}
+	else
+	{
+		printfPQExpBuffer(,
+			"SELECT loid as \"%s\",\n"
+			"  pg_catalog.obj_description(loid, 'pg_largeobject') as \"%s\"\n"
+			"FROM (SELECT DISTINCT loid FROM pg_catalog.pg_largeobject) x\n"
+			"ORDER BY 1",
+			gettext_noop("ID"),
+			gettext_noop("Description"));
+	}
+
+	res = PSQLexec(buf.data);
+	termPQExpBuffer();
+	if (!res)
+		return false;
+
+	myopt.topt.tuples_only = false;
+	myopt.nullPrint = NULL;
+	myopt.title = _("Large objects");
+	myopt.translate_header = true;
+
+	printQuery(res, , pset.queryFout, false, pset.logfile);
+
+	PQclear(res);
+	return true;
+}
diff --git a/src/bin/psql/describe.h b/src/bin/psql/describe.h
index 71b320f1fc..53738cc0cb 100644
--- a/src/bin/psql/describe.h
+++ b/src/bin/psql/describe.h
@@ -139,5 +139,8 @@ extern bool listOpFamilyOperators(const char

Re: What are exactly bootstrap processes, auxiliary processes, standalone backends, normal backends(user sessions)?

2021-09-07 Thread Bharath Rupireddy

On Tue, Sep 7, 2021 at 5:48 AM Alvaro Herrera  wrote:
>
> On 2021-Aug-14, Justin Pryzby wrote:
>
> > I elaborated on your definition and added here.
> > https://commitfest.postgresql.org/34/3285/
>
> Thanks!  This works for me.  After looking at it, it seemed to me that
> listing the autovacuum launcher is perfectly adapted, so we might as
> well do it; and add verbiage about it to the autovacuum entry.  (I was
> first adding a whole new glossary entry for it, but it seemed overkill.)
>
> I also ended up adding an entry for WAL sender -- seems to round things
> nicely.
>
> ... In doing so I noticed that the definition for startup process and
> WAL receiver is slightly wrong.  WAL receiver only receives, it doesn't
> replay; it is always the startup process the one that replays.  So I
> changed that too.

Thanks for the v2 patch, here are some comments on it:

1) How about
A set of background processes (autovacuum
launcher and autovacuum workers)
that routinely perform
instead of
A set of background processes that routinely perform
?

2) In what way we call autovacuum launcher an auxiliary process but
not autovacuum worker? And autovacuum isn't a background worker right?
Why can't we call it an auxiliary process?
+ (but not the autovacuum workers),

3) Isn't it "WAL sender" instead of "WAL senders"?
+ (but not the WAL
senders),


4) replays WAL during replication? Isn't it "replays WAL during crash
recovery or in standby mode"
+ An auxiliary process that replays WAL during replication and
+ crash recovery.

5) Should we mention that WAL archiver too is optional similar to
Logger (process)? Also, let us rearrange the text a bit to be in sync.
+ An auxiliary process which (if enabled) saves copies of
+ WAL files

+ An auxiliary process which (if enabled)
  writes information about database events into the current

6) Shouldn't we mention "auxiliary process
instead of just plain "auxilary process"?

7) Shouldn't we mention "primary"? instead of
"primary server"?
+ to receive WAL from the primary server for replay by the

8) I agree to not call walsender an auxiliary process because it is
type of a backend
process that understands replication commands only. Instead of saying
"A process that runs..."
why can't we mention that in the description?
+ A process that runs on a server that streams WAL over a
+ network.  The receiving end can be a
+ WAL receiver

Regards,
Bharath Rupireddy.

Re: UNIQUE null treatment option


On 27.08.21 14:44, Marko Tiikkaja wrote:
On Fri, Aug 27, 2021 at 3:38 PM Peter Eisentraut 
> wrote:


In the SQL:202x draft, this
has been formalized by making this implementation-defined and adding
an option on unique constraint definitions UNIQUE [ NULLS [NOT]
DISTINCT ] to choose a behavior explicitly.

The CREATE UNIQUE INDEX syntax extension is not from the standard,
it's my own invention.


For the unique index syntax, should this be selectable per 
column/expression, rather than for the entire index as a whole?


Semantically, this would be possible, but the bookkeeping to make it 
work seems out of proportion with the utility.  And you'd have the 
unique index syntax out of sync with the unique constraint syntax, which 
would be pretty confusing.

Re: Non-decimal integer literals


On 16.08.21 17:32, John Naylor wrote:
The one thing that jumped out at me on a cursory reading is 
the {integer} rule, which seems to be used nowhere except to 
call process_integer_literal, which must then inspect the token text to 
figure out what type of integer it is. Maybe consider 4 separate 
process_*_literal functions?


Agreed, that can be done in a simpler way.  Here is an updated patch.
From f90826f77d8067a1641f60dd75d5ea1d83466ea9 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 7 Sep 2021 13:10:18 +0200
Subject: [PATCH v2] Non-decimal integer literals

Add support for hexadecimal, octal, and binary integer literals:

0x42E
0o112
0b100101

per SQL:202x draft.

This adds support in the lexer as well as in the integer type input
functions.

Discussion: 
https://www.postgresql.org/message-id/flat/b239564c-cad0-b23e-c57e-166d883cb...@enterprisedb.com
---
 doc/src/sgml/syntax.sgml | 26 
 src/backend/catalog/sql_features.txt |  1 +
 src/backend/parser/scan.l| 87 ++---
 src/backend/utils/adt/int8.c | 54 
 src/backend/utils/adt/numutils.c | 97 
 src/fe_utils/psqlscan.l  | 55 +++-
 src/interfaces/ecpg/preproc/pgc.l| 64 +++---
 src/test/regress/expected/int2.out   | 19 ++
 src/test/regress/expected/int4.out   | 37 +++
 src/test/regress/expected/int8.out   | 19 ++
 src/test/regress/sql/int2.sql|  7 ++
 src/test/regress/sql/int4.sql| 11 
 src/test/regress/sql/int8.sql|  7 ++
 13 files changed, 422 insertions(+), 62 deletions(-)

diff --git a/doc/src/sgml/syntax.sgml b/doc/src/sgml/syntax.sgml
index d66560b587..8fb4b1228d 100644
--- a/doc/src/sgml/syntax.sgml
+++ b/doc/src/sgml/syntax.sgml
@@ -694,6 +694,32 @@ Numeric Constants
 
 
 
+
+ Additionally, non-decimal integer constants can be used in these forms:
+
+0xhexdigits
+0ooctdigits
+0bbindigits
+
+ hexdigits is one or more hexadecimal digits
+ (0-9, A-F), octdigits is one or more octal
+ digits (0-7), bindigits is one or more binary
+ digits (0 or 1).  Hexadecimal digits and the radix prefixes can be in
+ upper or lower case.  Note that only integers can have non-decimal forms,
+ not numbers with fractional parts.
+
+
+
+ These are some examples of this:
+0b100101
+0B10011001
+0o112
+0O755
+0x42e
+0X
+
+
+
 
  integer
  bigint
diff --git a/src/backend/catalog/sql_features.txt 
b/src/backend/catalog/sql_features.txt
index 9f424216e2..d6359503f3 100644
--- a/src/backend/catalog/sql_features.txt
+++ b/src/backend/catalog/sql_features.txt
@@ -526,6 +526,7 @@ T652SQL-dynamic statements in SQL routines  
NO
 T653   SQL-schema statements in external routines  YES 
 T654   SQL-dynamic statements in external routines NO  
 T655   Cyclically dependent routines   YES 
+T661   Non-decimal integer literalsYES SQL:202x draft
 T811   Basic SQL/JSON constructor functionsNO  
 T812   SQL/JSON: JSON_OBJECTAGGNO  
 T813   SQL/JSON: JSON_ARRAYAGG with ORDER BY   NO  
diff --git a/src/backend/parser/scan.l b/src/backend/parser/scan.l
index 6e6824faeb..a78fe7a2ed 100644
--- a/src/backend/parser/scan.l
+++ b/src/backend/parser/scan.l
@@ -124,7 +124,7 @@ static void addlit(char *ytext, int yleng, core_yyscan_t 
yyscanner);
 static void addlitchar(unsigned char ychar, core_yyscan_t yyscanner);
 static char *litbufdup(core_yyscan_t yyscanner);
 static unsigned char unescape_single_char(unsigned char c, core_yyscan_t 
yyscanner);
-static int process_integer_literal(const char *token, YYSTYPE *lval);
+static int process_integer_literal(const char *token, YYSTYPE *lval, int 
base);
 static void addunicode(pg_wchar c, yyscan_t yyscanner);
 
 #define yyerror(msg)  scanner_yyerror(msg, yyscanner)
@@ -262,7 +262,7 @@ quotecontinuefail   {whitespace}*"-"?
 xbstart[bB]{quote}
 xbinside   [^']*
 
-/* Hexadecimal number */
+/* Hexadecimal byte string */
 xhstart[xX]{quote}
 xhinside   [^']*
 
@@ -341,7 +341,7 @@ xcstart \/\*{op_chars}*
 xcstop \*+\/
 xcinside   [^*/]+
 
-digit  [0-9]
+
 ident_start[A-Za-z\200-\377_]
 ident_cont [A-Za-z\200-\377_0-9\$]
 
@@ -380,24 +380,39 @@ self  [,()\[\].;\:\+\-\*\/\%\^\<\>\=]
 op_chars   [\~\!\@\#\^\&\|\`\?\+\-\*\/\%\<\>\=]
 operator   {op_chars}+
 
-/* we no longer allow unary minus in numbers.
- * instead we pass it separately to parser. there it gets
- * coerced via doNegate() -- Leon aug 20 1999
+/*
+ * Numbers
  *
- * {decimalfail} is used because we would like "1..10" to lex

Re: Patching documentation of ALTER TABLE re column type changes on binary-coercible fields

2021-09-07 Thread Michael Banck

Hi,

I've stumbled over this topic today, and found your patch.

On Thu, Jan 23, 2020 at 11:01:36PM -0800, Mike Lissner wrote:
> Enclosed please find a patch to tweak the documentation of the ALTER TABLE
> page. I believe this patch is ready to be applied to master and backported
> all the way to 9.2.
> 
> On the ALTER TABLE page, it currently notes that if you change the type of
> a column, even to a binary coercible type:
> 
> > any indexes on the affected columns must still be rebuilt.
> 
> It appears this hasn't been true for about eight years, since 367bc426a.
> 
> Here's the discussion of the topic from earlier today and yesterday:
> 
> https://www.postgresql.org/message-id/flat/CAMp9%3DExXtH0NeF%2BLTsNrew_oXycAJTNVKbRYnqgoEAT01t%3D67A%40mail.gmail.com
> 
> I haven't run tests, but I presume they'll be unaffected by a documentation
> change.
> 
> I've made an effort to follow the example of other people's patches I
> looked at, but I haven't contributed here before. Happy to take another
> stab at this if this doesn't hit the mark — though I hope it does. I love
> and appreciate Postgresql and hope that I can do my little part to make it
> better.
> 
> For the moment, I haven't added this to commitfest. I don't know what it
> is, but I suspect this is small enough somebody will just pick it up.
> 
> Mike

> Index: doc/src/sgml/ref/alter_table.sgml
> IDEA additional info:
> Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
> <+>UTF-8
> ===
> --- doc/src/sgml/ref/alter_table.sgml (revision 
> 6de7bcb76f6593dcd107a6bfed645f2142bf3225)
> +++ doc/src/sgml/ref/alter_table.sgml (revision 
> 9a813e0896e828900739d95f78b5e4be10dac365)
> @@ -1225,10 +1225,9 @@
>  existing column, if the USING clause does not change
>  the column contents and the old type is either binary coercible to the 
> new
>  type or an unconstrained domain over the new type, a table rewrite is not
> -needed; but any indexes on the affected columns must still be rebuilt.
> -Table and/or index rebuilds may take a
> -significant amount of time for a large table; and will temporarily 
> require
> -as much as double the disk space.
> +needed. Table and/or index rebuilds may take a significant amount of time
> +for a large table; and will temporarily require as much as double the 
> disk
> +space.
> 

In general, I find the USING part in that paragraph a bit confusing; I
think the main use case for ALTER COLUMN ... TYPE is without it. So I
would suggest separating the two and make the general case (table and
indexe rewrites are not needed if the type is binary coercible without
having USING in the same sentence.


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.ba...@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Sascha Heuer, Geoff Richardson,
Peter Lilley

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz

Re: PROXY protocol support

2021-09-07 Thread Magnus Hagander

On Wed, Jul 14, 2021 at 8:24 PM Jacob Champion  wrote:
>
> On Mon, 2021-07-12 at 18:28 +0200, Magnus Hagander wrote:
> > Yeah, I have no problem being stricter than necessary, unless that
> > actually causes any interop problems. It's a lot worse to not be
> > strict enough..
>
> Agreed. Haven't heard back from the HAProxy mailing list yet, so
> staying strict seems reasonable in the meantime. That could always be
> rolled back later.

Any further feedback from them now, two months later? :)

(Sorry, I was out on vacation for the end of the last CF, so didn't
get around to this one, but it seemed there'd be plenty of time in
this CF)


> > > I've been thinking more about your earlier comment:
> > >
> > > > An interesting thing is what to do about
> > > > inet_server_addr/inet_server_port. That sort of loops back up to the
> > > > original question of where/how to expose the information about the
> > > > proxy in general (since right now it just logs). Right now you can
> > > > actually use inet_server_port() to see if the connection was proxied
> > > > (as long as it was over tcp).
> > >
> > > IMO these should return the "virtual" dst_addr/port, instead of
> > > exposing the physical connection information to the client. That way,
> > > if you intend for your proxy to be transparent, you're not advertising
> > > your network internals to connected clients. It also means that clients
> > > can reasonably expect to be able to reconnect to the addr:port that we
> > > give them, and prevents confusion if the proxy is using an address
> > > family that the client doesn't even support (e.g. the client is IPv4-
> > > only but the proxy connects via IPv6).
> >
> > That reasoning I think makes a lot of sense, especially with the
> > comment about being able to connect back to it.
> >
> > The question at that point extends to, would we also add extra
> > functions to get the data on the proxy connection itself? Maybe add a
> > inet_proxy_addr()/inet_proxy_port()? Better names?
>
> What's the intended use case? I have trouble viewing those as anything
> but information disclosure vectors, but I'm jaded. :)

"Covering all the bases"?

I'm not entirely sure what the point is of the *existing* functions
for that though, so I'm definitely not wedded to including it.


> If the goal is to give a last-ditch debugging tool to someone whose
> proxy isn't behaving properly -- though I'd hope the proxy in question
> has its own ways to debug its behavior -- maybe they could be locked
> behind one of the pg_monitor roles, so that they're only available to
> someone who could get that information anyway?

Yeah, agreed.


> > PFA a patch that fixes the above errors, and changes
> > inet_server_addr()/inet_server_port(). Does not yet add anything to
> > receive the actual local port in this case.
>
> Looking good in local testing. I'm going to reread the spec with fresh
> eyes and do a full review pass, but this is shaping up nicely IMO.

Thanks!


> Something that I haven't thought about very hard yet is proxy
> authentication, but I think the simple IP authentication will be enough
> for a first version. For the Unix socket case, it looks like anyone
> currently relying on peer auth will need to switch to a
> unix_socket_group/_permissions model. For now, that sounds like a
> reasonable v1 restriction, though I think not being able to set the
> proxy socket's permissions separately from the "normal" one might lead
> to some complications in more advanced setups.

Agreed in principle, but I think those are some quite uncommon
usecases, so definitely something we don't need to cover in a first
feature.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/

Re: Improve logging when using Huge Pages

2021-09-07 Thread Fujii Masao





On 2021/09/07 13:09, Shinoda, Noriyoshi (PN Japan FSIP) wrote:

Hello,

Thank you everyone for comments.
In the thread [1] that Horiguchi told me about, there is already a review going 
on about GUC for HugePages memory.
For this reason, I have removed the new GUC implementation and attached a patch 
that changes only the message at instance startup.


Thanks for updating the patch!

Even with the patch, there are still some cases where huge pages is
disabled silently. We should report something even in these cases?
For example, in the platform where huge pages is not supported,
it's silently disabled when huge_pages=try.

One big concern about the patch is that log message is always reported
when shared memory fails to be allocated with huge pages enabled
when huge_pages=try. Since huge_pages=try is the default setting,
many users would see this new log message whenever they start
the server. Those who don't need huge pages but just use the default
setting might think that such log messages would be noisy.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: when the startup process doesn't (logging startup delays)

2021-09-07 Thread Nitin Jadhav

> Looking over this version, I realized something I (or you) should have
> noticed sooner: you've added the RegisterTimeout call to
> InitPostgres(), but that's for things that are used by all backends,
> and this is only used by the startup process. So it seems to me that
> the right place is StartupProcessMain. That would have the further
> advantage of allowing startup_progress_timeout_handler to be made
> static. begin_startup_progress_phase() and
> startup_progress_timeout_has_expired() are the actual API functions
> though so they will need to remain extern.
>
> I agree with Robert that a standby server should not continuously show timing
> messages for WAL replay.

The earlier discussion wrt this point is as follows.

> > I also agree that this is the better place to do it. Hence modified
> > the patch accordingly. The condition "!AmStartupProcess()" is added to
> > differentiate whether the call is done from a startup process or some
> > other process. Actually StartupXLOG() gets called in 2 places. one in
> > StartupProcessMain() and the other in InitPostgres(). As the logging
> > of the startup progress is required only during the startup process
> > and not in the other cases,
>
> The InitPostgres() case occurs when the server is started in bootstrap
> mode (during initdb) or in single-user mode (postgres --single). I do
> not see any reason why we shouldn't produce progress messages in at
> least the latter case. I suspect that someone who is in the rather
> desperate scenario of having to use single-user mode would really like
> to know how long the server is going to take to start up.
>
> Perhaps during initdb we don't want messages, but I'm not sure that we
> need to do anything about that here. None of the messages that the
> server normally produces show up when you run initdb, so I guess they
> are getting redirected to /dev/null or something.
>
> So I don't think that using AmStartupProcess() for this purpose is right.

So as per the recent discussion, RegisterTimeout call should be
removed from InitPostgres() and the condition "!AmStartupProcess()" is
to be added in begin_startup_progress_phase() and
ereport_startup_progress() to differentiate whether the call is from a
startup process or some other process. Kindly correct me if I am
wrong.

> Some doc comments:

Thanks for the suggestions. I will take care in the next patch.

> Clearly.  This should be limited to crash recovery, and maybe there
> could be some checks to make sure that nothing is logged once a
> consistent point is reached.

The purpose here is to show the progress of the operation if it is
taking longer than the interval set by the user until it completes the
operation. Users should know what operation is happening in the
background and to show the progress, displaying the elapsed time. So
according to me the consistent point is nothing but the end of the
operation. Kindly let me know if you have something in mind and that
could be the better consistent point.

> Honestly, I don't see why we should have
> a GUC for what's proposed here.  A value too low would bloat the logs
> with entries that are not that meaningful.  A value too large would
> just prevent access to some information wanted.  Wouldn't it be better
> to just pick up a value like 10s or 20s?

It is difficult to finalise the value and use that value without
providing an option to change. If we choose one value (say 10s), it
may be too less for some users or too large for some other users. So I
feel it is better to provide an option to users so that they can
choose the value according to their need. Anyway the default value set
for this setting is 10s.

> +   {"log_startup_progress_interval", PGC_SIGHUP, LOGGING_WHEN,
> +   gettext_noop("Sets the time interval between each progress update 
> "
> +"of the startup process."),
> +   gettext_noop("0 logs all messages. -1 turns this feature off."),
> +   GUC_UNIT_MS,
> The unit is incorrect here, as that would default to 10ms, contrary to
> what the documentation says about 10s.

Kindly refer the previous few discussions wrt this point.

On Tue, Sep 7, 2021 at 10:58 AM Michael Paquier  wrote:
>
> On Fri, Sep 03, 2021 at 09:23:27PM -0500, Justin Pryzby wrote:
> > On Fri, Sep 03, 2021 at 01:23:56PM +0530, Nitin Jadhav wrote:
> > > Please find the updated patch attached.
> >
> > Please check 
> > CA+TgmoZtbqxaOLdpNkBcDbz=41tWALA8kpH4M=rwtpyhc7-...@mail.gmail.com
> >
> > I agree with Robert that a standby server should not continuously show 
> > timing
> > messages for WAL replay.
>
> Clearly.  This should be limited to crash recovery, and maybe there
> could be some checks to make sure that nothing is logged once a
> consistent point is reached.  Honestly, I don't see why we should have
> a GUC for what's proposed here.  A value too low would bloat the logs
> with entries that are not that meaningful.  A value too large would
> just prevent access to some

Re: Avoid stuck of pbgench due to skipped transactions

2021-09-07 Thread Fabien COELHO




Hello Fujii-san,

Stop counting skipped transactions under -T as soon as the timer is 
exceeded. Because otherwise it can take a very long time to count all of 
them especially when quite a lot of them happen with unrealistically 
high rate setting in -R, which would prevent pgbench from ending 
immediately. Because of this behavior, note that there is no guarantee 
that all skipped transactions are counted under -T though there is under 
-t. This is OK in practice because it's very unlikely to happen with 
realistic setting.


Ok, I find this text quite clear.


One question is; which version do we want to back-patch to?


If we consider it a "very minor bug fix" which is triggered by somehow 
unrealistic options, so I'd say 14 & dev, or possibly only dev.


--
Fabien.

Re: Remove Value node struct


On 30.08.21 04:13, Kyotaro Horiguchi wrote:

However, the patch adds:


+typedef struct Null
+{
+   NodeTag type;
+   char   *val;
+} Null;


which doesn't seem to be used anywhere. Is that a leftoverf from an
intermediate development stage?


+1 Looks like so, it can be simply removed.


fixed


0002:
   there's an "integer Value node" in gram.y: 7776.


fixed


-   n = makeFloatConst(v->val.str, location);
+   n = (Node *) makeFloatConst(castNode(Float, v)->val, 
location);

makeFloatConst is Node* so the cast doesn't seem needed. The same can
be said for Int and String Consts.  This looks like a confustion with
makeInteger and friends.


fixed


+   else if (IsA(obj, Integer))
+   _outInteger(str, (Integer *) obj);
+   else if (IsA(obj, Float))
+   _outFloat(str, (Float *) obj);

I felt that the type enames are a bit confusing as they might be too
generic, or too close with the corresponding binary types.


-   Node   *arg;/* a (Value *) or a (TypeName 
*) */
+   Node   *arg;

Mmm. It's a bit pity that we lose the generic name for the value nodes.


Not sure what you mean here.
From 01773e4aeedea35fa2f23af08c30cfbda6fe8b27 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Wed, 25 Aug 2021 13:41:09 +0200
Subject: [PATCH v2 1/2] Remove useless casts

Casting the argument of strVal() to (Value *) is useless, since
strVal() already does that.

Most code didn't do that anyway; this was apparently just a style that
snuck into certain files.
---
 src/backend/catalog/objectaddress.c | 20 ++--
 src/backend/commands/alter.c| 16 
 src/backend/commands/comment.c  |  2 +-
 src/backend/commands/dropcmds.c | 16 
 src/backend/commands/statscmds.c|  2 +-
 5 files changed, 28 insertions(+), 28 deletions(-)

diff --git a/src/backend/catalog/objectaddress.c 
b/src/backend/catalog/objectaddress.c
index deaabaeae9..bc2a4ccdde 100644
--- a/src/backend/catalog/objectaddress.c
+++ b/src/backend/catalog/objectaddress.c
@@ -2386,7 +2386,7 @@ check_object_ownership(Oid roleid, ObjectType objtype, 
ObjectAddress address,
case OBJECT_DATABASE:
if (!pg_database_ownercheck(address.objectId, roleid))
aclcheck_error(ACLCHECK_NOT_OWNER, objtype,
-  strVal((Value *) 
object));
+  strVal(object));
break;
case OBJECT_TYPE:
case OBJECT_DOMAIN:
@@ -2433,7 +2433,7 @@ check_object_ownership(Oid roleid, ObjectType objtype, 
ObjectAddress address,
case OBJECT_SCHEMA:
if (!pg_namespace_ownercheck(address.objectId, roleid))
aclcheck_error(ACLCHECK_NOT_OWNER, objtype,
-  strVal((Value *) 
object));
+  strVal(object));
break;
case OBJECT_COLLATION:
if (!pg_collation_ownercheck(address.objectId, roleid))
@@ -2448,27 +2448,27 @@ check_object_ownership(Oid roleid, ObjectType objtype, 
ObjectAddress address,
case OBJECT_EXTENSION:
if (!pg_extension_ownercheck(address.objectId, roleid))
aclcheck_error(ACLCHECK_NOT_OWNER, objtype,
-  strVal((Value *) 
object));
+  strVal(object));
break;
case OBJECT_FDW:
if 
(!pg_foreign_data_wrapper_ownercheck(address.objectId, roleid))
aclcheck_error(ACLCHECK_NOT_OWNER, objtype,
-  strVal((Value *) 
object));
+  strVal(object));
break;
case OBJECT_FOREIGN_SERVER:
if (!pg_foreign_server_ownercheck(address.objectId, 
roleid))
aclcheck_error(ACLCHECK_NOT_OWNER, objtype,
-  strVal((Value *) 
object));
+  strVal(object));
break;
case OBJECT_EVENT_TRIGGER:
if (!pg_event_trigger_ownercheck(address.objectId, 
roleid))
aclcheck_error(ACLCHECK_NOT_OWNER, objtype,
-  strVal((Value *) 
object));
+  strVal(object));
break;

RE: Added missing invalidations for all tables publication

2021-09-07 Thread houzj.f...@fujitsu.com


From Mon, Sep 6, 2021 1:56 PM Amit Kapila  wrote:
> On Tue, Aug 31, 2021 at 8:54 PM vignesh C  wrote:
> > Thanks for the comments, the attached v3 patch has the changes for the
> > same.
> >
> 
> I think this bug should be fixed in back branches (till v10). OTOH, as this 
> is not
> reported by any user and we have found it during code review so it seems
> either users don't have an exact use case or they haven't noticed this yet. 
> What
> do you people think about back-patching?

Personally, I think it's ok to back-patch.

> Attached, please find a slightly updated patch with minor changes. I have
> slightly changed the test to make it more meaningful.

The patch looks good to me.

Best regards,
Hou zj

Re: Add statistics refresh materialized view

2021-09-07 Thread Seino Yuki

On 2021-09-01 23:15, Fujii Masao wrote:
Why do you want to treat only REFRESH MATERIALIZED VIEW command
special?

What about other utility commands like TRUNCATE, CLUSTER, etc?

First of all, knowing the update date and time of the MATVIEW is
essential for actual operation.

Without that information, users will not be able to trust the MATVIEW.

In terms of the reliability of the information in the table,
I think the priority of the REFRESHED MATVIEW is higher than that of
TRUNCATE and CLUSTER.

It's not good design to add new columns per utility command into
pg_stat_all_tables. Otherwise pg_stat_all_tables will have to have lots
of

columns to expose the stats of many utility commands at last. Which is
ugly and very user-unfriendly.

Most entries in pg_stat_all_tables are basically for tables. So the
columns

about REFRESH MATERIALIZED VIEW are useless for those most entries.
This is another reason why I think the design is not good.

I agree with this opinion.
Initially, I thought about storing this information in pg_matviews,
but decided against it because of the overhead of adding it to the
system catalog.

pg_stat_statements reports different records for REFRESH MATERIALIZED
VIEW
commands on different views. So ISTM that we can aggregate the
information

per view, from pg_stat_statements. No?

I made this suggestion based on the premise that the last update date
and time of the Mateview should always be retained.

I think the same concept applies to Oracle Database.
https://docs.oracle.com/cd/F19136_01/refrn/ALL_MVIEWS.html#GUID-8B9432B5-6B66-411A-936E-590D9D7671E9
I thought it would be useless to enable pg_stat_statements and
log_statement to see this information.

However, as you said, for most use cases, pg_stat_statements and
log_statement may be sufficient.

I would like to withdraw this proposal.

Regards,

Re: automatically generating node support functions


On 02.09.21 20:53, Jacob Champion wrote:

0004-Make-node-output-prefix-match-node-structure-name.patch

Some nodes' output/read functions use a label that is slightly different
from their node name, e.g., "NOTIFY" instead of "NOTIFYSTMT".  This
cleans that up so that an automated approach doesn't have to deal with
these special cases.


Is there any concern about the added serialization length, or is that
trivial in practice? The one that particularly caught my eye is
RANGETBLENTRY, which was previously RTE. But I'm not very well-versed
in all the places these strings can be generated and stored.


These are just matters of taste.  Let's wait a bit more to see if anyone 
is concerned.



0005-Add-Cardinality-typedef.patch

Adds a typedef Cardinality for double fields that store an estimated row
or other count.  Works alongside Cost and Selectivity.


Should RangeTblEntry.enrtuples also be a Cardinality?


Yes, I'll add that.

Re: .ready and .done files considered harmful

At Fri, 3 Sep 2021 18:31:46 +0530, Dipesh Pandit  
wrote in 
> Hi,
> 
> Thanks for the feedback.
> 
> > Which approach do you think we should use?  I think we have decent
> > patches for both approaches at this point, so perhaps we should see if
> > we can get some additional feedback from the community on which one we
> > should pursue further.
> 
> In my opinion both the approaches have benefits over current implementation.
> I think in keep-trying-the-next-file approach we have handled all rare and
> specific
> scenarios which requires us to force a directory scan to archive the
> desired files.
> In addition to this with the recent change to force a directory scan at
> checkpoint
> we can avoid an infinite wait for a file which is still being missed out
> despite
> handling the special scenarios. It is also more efficient in extreme
> scenarios
> as discussed in this thread. However, multiple-files-per-readdir approach
> is
> cleaner with resilience of current implementation.
> 
> I agree that we should decide on which approach to pursue further based on
> additional feedback from the community.


I was thinking that the multple-files approch would work efficiently
but the the patch still runs directory scans every 64 files.  As
Robert mentioned it is still O(N^2).  I'm not sure the reason for the
limit, but if it were to lower memory consumption or the cost to sort,
we can resolve that issue by taking trying-the-next approach ignoring
the case of having many gaps (discussed below). If it were to cause
voluntary checking of out-of-order files, almost the same can be
achieved by running directory scans every 64 files in the
trying-the-next approach (and we would suffer O(N^2) again).  On the
other hand, if archiving is delayed by several segments, the
multiple-files method might reduce the cost to scan the status
directory but it won't matter since the directory contains only
several files.  (I think that it might be better that we don't go to
trying-the-next path if we found only several files by running a
directory scan.)  The multiple-files approach reduces the number of
directory scans if there were many gaps in the WAL file
sequence. Alghouth theoretically the last max_backend(+alpha?)
segemnts could be written out-of-order, but I suppose we only have
gaps only among the several latest files in reality. I'm not sure,
though..

In short, the trying-the-next approach seems to me to be the way to
go, for the reason that it is simpler but it can cover the possible
failures by almost the same measures with the muliple-files approach.

> > The problem I see with this is that pgarch_archiveXlog() might end up
> > failing.  If it does, we won't retry archiving the file until we do a
> > directory scan.  I think we could try to avoid forcing a directory
> > scan outside of these failure cases and archiver startup, but I'm not
> > sure it's really worth it.  When pgarch_readyXlog() returns false, it
> > most likely means that there are no .ready files present, so I'm not
> > sure we are gaining a whole lot by avoiding a directory scan in that
> > case.  I guess it might help a bit if there are a ton of .done files,
> > though.
> 
> Yes, I think it will be useful when we have a bunch of .done files and
> the frequency of .ready files is such that the archiver goes to wait
> state before the next WAL file is ready for archival.
> 
> > I agree, but it should probably be something like DEBUG3 instead of
> > LOG.
> 
> I will update it in the next patch.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Support tab completion for upper character inputs in psql


On 23.06.21 14:43, tanghy.f...@fujitsu.com wrote:

I've updated the patch to V8 since Tom, Kyotaro and Laurenz discussed the lower 
case issue of German/Turkish language at [1].

Differences from V7 are:
* Add a function valid_input_text which checks the input text to see if it only 
contains alphabet letters, numbers etc.
* Delete the flag setting of "completion_case_sensitive=false" which introduced 
in V1 patch and no use now.

As you can see, now the patch limited the lower case transform of the input to 
alphabet letters.
By doing that, language like German/Turkish will not affected by this patch.

Any comment or suggestion on this patch is very welcome.


The coding of valid_input_text() seems a bit bulky.  I think you can do  
the same thing using strspn() without a loop.


The name is also not great.  It's not like other strings are not "valid".

There is also no explanation why that specific set of characters is  
allowed and not others.  Does it have something to do with identifier  
syntax?  This needs to be explained.


Seeing that valid_input_text() is always called together with  
pg_string_tolower(), I think those could be combined into one function,  
like pg_string_tolower_if_ascii() is whatever.  That would save a lot of  
repetition.


There are a couple of queries where the result is *not*  
case-insensitive, namely


Query_for_list_of_enum_values
Query_for_list_of_available_extension_versions

(and their variants).  These are cases where the query result is not  
used as an identifier but as a (single-quoted) string.  So that needs to  
be handled somehow, perhaps by adding a COMPLETE_WITH_QUERY_CS() similar  
to COMPLETE_WITH_CS().


(A test case for the enum case should be doable easily.)

Re: Possible missing segments in archiving on standby

2021-09-07 Thread Fujii Masao




On 2021/09/03 14:56, Kyotaro Horiguchi wrote:

+   if (readSource == XLOG_FROM_STREAM &&
+   record->xl_rmid == RM_XLOG_ID &&
+   (record->xl_info & ~XLR_INFO_MASK) == 
XLOG_SWITCH)

readSource is the source at the time startup reads it and it could be
different from the source at the time the record was written. We
cannot know where the record came from there, so does the readSource
condition work as expected?  If we had some trouble streaming just
before, readSource at the time is likely to be XLOG_FROM_PG_WAL.


Yes.



+   if (XLogArchivingAlways())
+   
XLogArchiveNotify(xlogfilename, true);
+   else
+   
XLogArchiveForceDone(xlogfilename);

The path is used both for crash and archive recovery. If we pass there
while crash recovery on a primary with archive_mode=on, the file could
be marked .done before actually archived. On the other hand when
archive_mode=always, the file could be re-marked .ready even after it
has been already archived.  Why isn't it XLogArchiveCheckDone?


Yeah, you're right. ISTM what we should do is to just call
XLogArchiveCheckDone() for the segment including XLOG_SWITCH record,
i.e., to create .ready file if the segment has no archive notification file yet
and archive_mode is set to 'always'. Even if we don't do this when we reach
XLOG_SWITCH record, subsequent restartpoints eventually will call
XLogArchiveCheckDone() for such segments.

One issue of this approach is that the WAL segment including XLOG_SWITCH
record may be archived before its previous segments are. That is,
the notification file of current segment is created when it's replayed
because it includes XLOG_SWIATCH, but the notification files of
its previous segments will be created by subsequent restartpoints
because they don't have XLOG_SWITCH. Probably we should avoid this?

If yes, one approach for this issue is to call XLogArchiveCheckDone() for
not only the segment including XLOG_SWITCH but also all the segments
older than that. Thought?


Anyway, I extracted the changes in walreceiver from the patch,
because it's self-contained and can be applied separately.
Patch attached.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
diff --git a/src/backend/replication/walreceiver.c 
b/src/backend/replication/walreceiver.c
index 9a2bc37fd7..eb9d12adc1 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -125,6 +125,7 @@ static void WalRcvDie(int code, Datum arg);
 static void XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len);
 static void XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr);
 static void XLogWalRcvFlush(bool dying);
+static void XLogWalRcvClose(XLogRecPtr recptr);
 static void XLogWalRcvSendReply(bool force, bool requestReply);
 static void XLogWalRcvSendHSFeedback(bool immed);
 static void ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime);
@@ -883,42 +884,11 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
{
int segbytes;
 
-   if (recvFile < 0 || !XLByteInSeg(recptr, recvSegNo, 
wal_segment_size))
+   /* Close the current segment if it's completed */
+   XLogWalRcvClose(recptr);
+
+   if (recvFile < 0)
{
-   /*
-* fsync() and close current file before we switch to 
next one. We
-* would otherwise have to reopen this file to fsync it 
later
-*/
-   if (recvFile >= 0)
-   {
-   charxlogfname[MAXFNAMELEN];
-
-   XLogWalRcvFlush(false);
-
-   XLogFileName(xlogfname, recvFileTLI, recvSegNo, 
wal_segment_size);
-
-   /*
-* XLOG segment files will be re-read by 
recovery in startup
-* process soon, so we don't advise the OS to 
release cache
-* pages associated with the file like 
XLogFileClose() does.
-*/
-   if (close(recvFile) != 0)
-   ereport(PANIC,
-   
(errcode_for_file_access(),
-errmsg("could not 
close log segment %s: %m",
-   
xlogfname)));
-
-   /*
-* Create .done file

[BUG?] SET TIME ZONE doesn't work with abbreviations

2021-09-07 Thread Aleksander Alekseev

Hi hackers,

I noticed that `SET TIME ZONE` / `SET timezone TO` don't work with
abbreviations:

```
# select * from pg_timezone_names where abbrev = 'MSK';
   name| abbrev | utc_offset | is_dst
---+++
 Europe/Moscow | MSK| 03:00:00   | f
 Europe/Simferopol | MSK| 03:00:00   | f
 W-SU  | MSK| 03:00:00   | f

97394 (master) =# set time zone 'Europe/Moscow';
SET

97394 (master) =# set time zone 'MSK';
ERROR:  invalid value for parameter "TimeZone": "MSK"
```

However, I can use both Europe/Moscow and MSK in timestamptz_in():

```
# select '2021-09-07 12:34:56 Europe/Moscow' :: timestamptz;
  timestamptz

 2021-09-07 12:34:56+03

# select '2021-09-07 12:34:56 MSK' :: timestamptz;
  timestamptz

 2021-09-07 12:34:56+03
```

PostgreSQL was built on MacOS Catalina without the `--with-system-tzdata=` flag.

Is it a bug or this behavior is intentional (something to do with SQL
standard, perhaps)?

-- 
Best regards,
Aleksander Alekseev

RE: [BUG] Unexpected action when publishing partition tables

2021-09-07 Thread tanghy.f...@fujitsu.com

> > > ---
> > > PublicationAddTables
> > > publication_add_relation
> > > /* Invalidate relcache so that publication info is 
> > > rebuilt. */
> > > CacheInvalidateRelcache(targetrel);
> > > ---
> > >
> > > In addition, this problem can happen in both ADD TABLE, DROP TABLE,
> > > and SET TABLE cases, so we need to invalidate the leaf partitions'
> > > recache in all these cases.
> > >
> >
> > Few comments:
> > =
> >   {
> > @@ -664,7 +673,13 @@ PublicationDropTables(Oid pubid, List *rels, bool
> > missing_ok)
> >
> >   ObjectAddressSet(obj, PublicationRelRelationId, prid);
> >   performDeletion(, DROP_CASCADE, 0);
> > +
> > + relids = GetPubPartitionOptionRelations(relids, PUBLICATION_PART_LEAF,
> > + relid);
> >   }
> > +
> > + /* Invalidate relcache so that publication info is rebuilt. */
> > + InvalidatePublicationRels(relids);
> >  }
> >
> > We already register the invalidation for the main table in
> > RemovePublicationRelById which is called via performDeletion. I think it is
> > better to perform invalidation for partitions at that place.
> > Similarly is there a reason for not doing invalidations of partitions in
> > publication_add_relation()?
> 
> Thanks for the comment. I originally intended to reduce the number of invalid
> message when add/drop serval tables while each table has lots of partitions 
> which
> could exceed the MAX_RELCACHE_INVAL_MSGS. But that seems a rare case, so ,
> I changed the code as suggested.
> 
> Attach new version patches which addressed the comment.

Thanks for your patch. I confirmed that the problem I reported was fixed.

Besides, Your v2 patch also fixed an existing a problem about "DROP 
PUBLICATION" on HEAD.
(Publication was dropped but it still reported errors about replica identity 
when trying to
update or delete a partition table.)

Regards
Tang

Re: using an end-of-recovery record in all cases

At Fri, 3 Sep 2021 15:56:35 +0530, Amul Sul  wrote in 
> On Fri, Sep 3, 2021 at 10:23 AM Kyotaro Horiguchi
>  wrote:
> You might need the following change at the end of StartupXLOG():
> 
> - if (promoted)
> - RequestCheckpoint(CHECKPOINT_FORCE);
> + RequestCheckpoint(CHECKPOINT_FORCE);

At Fri, 3 Sep 2021 10:13:53 -0400, Robert Haas  wrote in 
> Did you use the patch I posted, or a different one

Thanks to both of you. That was my stupid.

..But even with the patch (with removal of 018_..pl test file), I
didn't get check-world failed..  I'll retry later.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: when the startup process doesn't (logging startup delays)

2021-09-07 Thread Nitin Jadhav

> I also agree that this is the better place to do it. Hence modified
> the patch accordingly. The condition "!AmStartupProcess()" is added to
> differentiate whether the call is done from a startup process or some
> other process. Actually StartupXLOG() gets called in 2 places. one in
> StartupProcessMain() and the other in InitPostgres(). As the logging
> of the startup progress is required only during the startup process
> and not in the other cases,

The InitPostgres() case occurs when the server is started in bootstrap
mode (during initdb) or in single-user mode (postgres --single). I do
not see any reason why we shouldn't produce progress messages in at
least the latter case. I suspect that someone who is in the rather
desperate scenario of having to use single-user mode would really like
to know how long the server is going to take to start up.

Perhaps during initdb we don't want messages, but I'm not sure that we
need to do anything about that here. None of the messages that the
server normally produces show up when you run initdb, so I guess they
are getting redirected to /dev/null or something.

So I don't think that using AmStartupProcess() for this purpose is right.

On Tue, Sep 7, 2021 at 10:58 AM Michael Paquier  wrote:
>
> On Fri, Sep 03, 2021 at 09:23:27PM -0500, Justin Pryzby wrote:
> > On Fri, Sep 03, 2021 at 01:23:56PM +0530, Nitin Jadhav wrote:
> > > Please find the updated patch attached.
> >
> > Please check 
> > CA+TgmoZtbqxaOLdpNkBcDbz=41tWALA8kpH4M=rwtpyhc7-...@mail.gmail.com
> >
> > I agree with Robert that a standby server should not continuously show 
> > timing
> > messages for WAL replay.
>
> Clearly.  This should be limited to crash recovery, and maybe there
> could be some checks to make sure that nothing is logged once a
> consistent point is reached.  Honestly, I don't see why we should have
> a GUC for what's proposed here.  A value too low would bloat the logs
> with entries that are not that meaningful.  A value too large would
> just prevent access to some information wanted.  Wouldn't it be better
> to just pick up a value like 10s or 20s?
>
> Looking at v13..
>
> +   {"log_startup_progress_interval", PGC_SIGHUP, LOGGING_WHEN,
> +   gettext_noop("Sets the time interval between each progress update 
> "
> +"of the startup process."),
> +   gettext_noop("0 logs all messages. -1 turns this feature off."),
> +   GUC_UNIT_MS,
> The unit is incorrect here, as that would default to 10ms, contrary to
> what the documentation says about 10s.
> --
> Michael

Re: Added schema level support for publication.

On Fri, Sep 3, 2021 at 4:06 PM Amit Kapila  wrote:
>
> On Wed, Sep 1, 2021 at 12:05 PM Amit Kapila  wrote:
> >
> > On Wed, Sep 1, 2021 at 8:52 AM Greg Nancarrow  wrote:
> > >
> >
> > > I'd expect a lot of users to naturally think that "ALTER PUBLICATION
> > > pub1 DROP ALL TABLES IN SCHEMA sc1;" would drop from the publication
> > > all tables that are in schema "sc1", which is not what it is currently
> > > doing.
> > > Since the syntax was changed to specifically refer to FOR ALL TABLES
> > > IN SCHEMA rather than FOR SCHEMA, then now it's clear we're referring
> > > to tables only, when specifying "... FOR ALL TABLES in sc1, TABLE
> > > sc1.test", so IMHO it's reasonable to remove duplicates here, rather
> > > than treating these as somehow separate ways of referencing the same
> > > table.
> > >
> >
> > I see your point and if we decide to go this path then it is better to
> > give an error than silently removing duplicates.
> >
>
> Today, I have thought about this point again and it seems better to
> give an error in this case and let the user take the action rather
> than silently removing such tables to avoid any confusion.

I have handled this to throw an error. This is handled as part of v25
patch attached at [1].
[1] - 
https://www.postgresql.org/message-id/CALDaNm2SytXy2TDnzzYkXWKgNp74ssPBXrkMXEyac1qVYSRkbw%40mail.gmail.com

Regards,
Vignesh

Re: Added schema level support for publication.

On Fri, Sep 3, 2021 at 4:49 PM Amit Kapila  wrote:
>
> On Thu, Sep 2, 2021 at 5:12 PM Amit Kapila  wrote:
> >
> > On Thu, Sep 2, 2021 at 11:58 AM vignesh C  wrote:
> > >
> >
> > Below are few comments on v23. If you have already addressed anything
> > in v24, then ignore it.
> >
>
> More Review comments (on latest version v24):
> ===
> 1.
> +Oidpsnspcid BKI_LOOKUP(pg_class);/* Oid of the schema */
> +} FormData_pg_publication_schema;
>
> Why in the above structure you have used pg_class instead of pg_namespace?

It should be pg_namespace, I have changed it.

> 2. GetSchemaPublicationRelations() uses two different ways to skip
> non-publishable relations in two while loops. Both are correct but I
> think it would be easier to follow if they use the same way and in
> this case I would prefer a check like if (is_publishable_class(relid,
> relForm)). The comments atop function could also be modified to :"Get
> the list of publishable relation oids for a specified schema.". I
> think it is better to write some comments on why you need to scan and
> loop twice.

Modified

> 3. The other point about GetSchemaPublicationRelations() is that I am
> afraid that it will collect duplicate relation oids in the final list
> when the partitioned table and child tables are in the same schema.

Modified it to prepare a list without duplicate relation ids.

> 4. In GetRelationPublicationActions(), the handling related to
> partitions seems to be missing for the schema. It won't be able to
> take care of child tables when they are in a different schema than the
> parent table.

Modified

> 5.
> If I modify the search path to remove public schema then I get the
> below error message:
> postgres=# Create publication mypub for all tables in schema current_schema;
> ERROR:  no schema has been selected
>
> I think this message is not very clear. How about changing to
> something like "current_schema doesn't contain any valid schema"? This
> message is used in more than one place, so let's keep it the same at
> all the places if you agree to change it.

I would prefer to use the existing messages as we have used this in a
few other places similarly. Thoughts?

> 6. How about naming ConvertPubObjSpecListToOidList() as
> ObjectsInPublicationToOids()? I see somewhat similarly named functions
> in the code like objectsInSchemaToOids, objectNamesToOids.

Modified

> 7.
> + /*
> + * Schema lock is held until the publication is created to prevent
> + * concurrent schema deletion. No need to unlock the schemas, the
> + * locks will be released automatically at the end of create
> + * publication command.
> + */
>
> In this comment no need to say create publication command as that is
> implicit, we can just write " at the end of command".

Modified

> 8. Can you please extract changes like tab-completion, dump/restore in
> separate patches? This can help to review the core (backend) part of
> the patch in more detail.

Modified

This is handled as part of v25 patch attached at [1]
[1] - 
https://www.postgresql.org/message-id/CALDaNm2SytXy2TDnzzYkXWKgNp74ssPBXrkMXEyac1qVYSRkbw%40mail.gmail.com

Regards,
Vignesh

RE: suboverflowed subtransactions concurrency performance optimize

2021-09-07 Thread Pengchengliu

Hi Andrey,

> I think if we really want to fix exclusive SubtransSLRULock I think best 
> option would be to split SLRU control lock into array of locks
 I agree with you. If we can resolve the performance issue with this approach, 
It should be a good solution.

> one for each bank (in 
> v17-0002-Divide-SLRU-buffers-into-n-associative-banks.patch)
 I have tested with this patch. And I have modified NUM_SUBTRANS_BUFFERS to 
128.  With 500 concurrence,  it would not be stuck indeed. But the performance 
is very bad. For a sequence scan table, it uses more than one minute.
I think it is unacceptable in a production environment.

postgres=# select count(*) from contend ;
 count 
---
 10127
(1 row)

Time: 86011.593 ms (01:26.012)
postgres=# select count(*) from contend ;
 count 
---
 10254
(1 row)
Time: 79399.949 ms (01:19.400)


With my local subtrans optimize approach, the same env and the same test script 
and 500 concurrence, a sequence scan, it uses only less than 10 seconds.

postgres=# select count(*) from contend ;
 count 
---
 10508
(1 row)

Time: 7104.283 ms (00:07.104)

postgres=# select count(*) from contend ;
count 
---
 13175
(1 row)

Time: 6602.635 ms (00:06.603)
Thanks
Pengcheng

-Original Message-
From: Andrey Borodin  
Sent: 2021年9月3日 14:51
To: Pengchengliu 
Cc: pgsql-hack...@postgresql.org
Subject: Re: suboverflowed subtransactions concurrency performance optimize

Sorry, for some reason Mail.app converted message to html and mailing list 
mangled this html into mess. I'm resending previous message as plain text 
again. Sorry for the noise.

> 31 авг. 2021 г., в 11:43, Pengchengliu  написал(а):
> 
> Hi Andrey,
>  Thanks a lot for your replay and reference information.
> 
>  The default NUM_SUBTRANS_BUFFERS is 32. My implementation is 
> local_cache_subtrans_pages can be adjusted dynamically.
>  If we configure local_cache_subtrans_pages as 64, every backend use only 
> extra 64*8192=512KB memory. 
>  So the local cache is similar to the first level cache. And subtrans SLRU is 
> the second level cache.
>  And I think extra memory is very well worth it. It really resolve massive 
> subtrans stuck issue which I mentioned in previous email.
> 
>  I have view the patch of [0] before. For SLRU buffers adding GUC 
> configuration parameters are very nice.
>  I think for subtrans, its optimize is not enough. For 
> SubTransGetTopmostTransaction, we should get the SubtransSLRULock first, then 
> call SubTransGetParent in loop.
>  Prevent acquire/release  SubtransSLRULock in SubTransGetTopmostTransaction-> 
> SubTransGetParent in loop.
>  After I apply this patch which I  optimize SubTransGetTopmostTransaction,  
> with my test case, I still get stuck result.

SubTransGetParent() acquires only Shared lock on SubtransSLRULock. The problem 
may arise only when someone reads page from disk. But if you have big enough 
cache - this will never happen. And this cache will be much less than 
512KB*max_connections.

I think if we really want to fix exclusive SubtransSLRULock I think best option 
would be to split SLRU control lock into array of locks - one for each bank (in 
v17-0002-Divide-SLRU-buffers-into-n-associative-banks.patch). With this 
approach we will have to rename s/bank/partition/g for consistency with locks 
and buffers partitions. I really liked having my own banks, but consistency 
worth it anyway.

Thanks!

Best regards, Andrey Borodin.

Re: Added schema level support for publication.

On Mon, Sep 6, 2021 at 6:56 AM houzj.f...@fujitsu.com
 wrote:
>
> From Thur, Sep 2, 2021 7:42 PM Amit Kapila  wrote:
> > On Thu, Sep 2, 2021 at 11:58 AM vignesh C  wrote:
> > >
> >
> > Below are few comments on v23. If you have already addressed anything in 
> > v24,
> > then ignore it.
> >
> > 1. The commit message says: A new system table "pg_publication_schema"
> > has been added, to maintain the schemas that the user wants to publish
> > through the publication.". The alternative name for this system table could 
> > be
> > "pg_publication_namespace". The reason why this alternative comes to my
> > mind is that the current system table to store schema information is named
> > pg_namespace. So shouldn't we be consistent here?
> > What do others think about this?
>
> Since the new system table refer to pg_namespace, so, personally, I am +1 for
> "pg_publication_namespace".

I have changed it as part of v25 patch attached at [1]
[1] - 
https://www.postgresql.org/message-id/CALDaNm2SytXy2TDnzzYkXWKgNp74ssPBXrkMXEyac1qVYSRkbw%40mail.gmail.com

Regards,
Vignesh

Re: Added schema level support for publication.