date:20210614

Re: Different compression methods for FPI

2021-06-14 Thread Peter Eisentraut


On 15.06.21 07:37, Michael Paquier wrote:

Actually, I was just thinking that default yes/no/on/off stuff maybe should be
defined to mean "lz4" rather than meaning pglz for "backwards compatibility".

+1

I am not sure that we have any reasons to be that aggressive about
this one either, and this would mean that wal_compression=on implies a
different method depending on the build options.  I would just stick
with the past, careful practice that we have to use a default
backward-compatible value as default, while being able to use a new
option.


If we think this new thing is strictly better than the old thing, then 
why not make it the default.  What would be the gain from sticking to 
the old default?


The point that the default would depend on build options is a valid one. 
 I'm not sure whether it's important enough by itself.

Re: Different compression methods for FPI

2021-06-14 Thread Michael Paquier

On Tue, Jun 15, 2021 at 08:08:54AM +0300, Heikki Linnakangas wrote:
> On 15/06/2021 06:42, Justin Pryzby wrote:
>> Why ?  This is WAL, not table data.  WAL depends on the major version, so
>> I think wal_compression could provide a different set of compression methods 
>> at
>> every major release?

That may finish by being annoying to the user, but perhaps that you
are right that we could have more flexibility here.  That does not
change the fact that we'd better choose something wisely, able to
stick around for a couple of years at least, rather than revisiting
this choice every year.

>> Actually, I was just thinking that default yes/no/on/off stuff maybe should 
>> be
>> defined to mean "lz4" rather than meaning pglz for "backwards compatibility".
> 
> +1

I am not sure that we have any reasons to be that aggressive about
this one either, and this would mean that wal_compression=on implies a
different method depending on the build options.  I would just stick
with the past, careful practice that we have to use a default
backward-compatible value as default, while being able to use a new
option.
--
Michael


signature.asc
Description: PGP signature

Re: Use singular number when appropriate

2021-06-14 Thread Julien Rouhaud

On Tue, Jun 15, 2021 at 04:59:24AM +, David Fetter wrote:
> Hi,
> 
> I thought about using the dual, but wasn't sure how many languages
> support it.

I don't think that you can assume that appending something will work in all
languages.  Note that it also doesn't always work in english (e.g. this/these),
as seen in this inconsistent change:

-_(" %d of %d tests failed, %d of these 
failures ignored. "),

+_(" %d of %d test%s failed, %d of these 
failures ignored. "),

Re: Error on pgbench logs

2021-06-14 Thread Michael Paquier

On Sun, Jun 13, 2021 at 03:07:51AM +0900, Yugo NAGATA wrote:
> It will not work if the transaction is skipped, in which case latency is 0.0.
> It would work if we check also "skipped" as bellow.
> 
> + if (!logged && !skipped && latency == 0.0)
> 
> However, it still might not work if the latency is so small so that  we could
> observe latency == 0.0. I observed this when I used a script that contained
> only a meta command. This is not usual and would be a corner case, though.

Hmm.  I am not sure to completely follow the idea here.  It would be
good to make this code less confusing than it is now.

>   /* log aggregated but not yet reported transactions */
>   doLog(thread, state, , false, 0, 0);
> + logAgg(thread->logfile, );
> 
> I think we don't have to call doLog() before logAgg(). If we call doLog(),
> we will count an extra transaction that is not actually processed because
> accumStats() is called in this.

Yes, calling both is weird.  Is using logAgg() directly in the context
actually right when it comes to sample_rate?  We may not log anything
on HEAD if sample_rate is enabled, but we would finish by logging
something all the time with this patch.  If I am following this code
correctly, we don't care about accumStats() in the code path of a
thread we are done with, right?
--
Michael


signature.asc
Description: PGP signature

Re: Use singular number when appropriate

2021-06-14 Thread Heikki Linnakangas


On 15/06/2021 07:59, David Fetter wrote:

Hi,

I thought about using the dual, but wasn't sure how many languages
support it.

if (fail_count == 0 && fail_ignore_count == 0)
snprintf(buf, sizeof(buf),
 _(" %s %d test%s passed. "),
 success_count == 1 ? "The" : "All",
 success_count,
 success_count == 1 ? "" : "s");


Constructing sentences like that is bad practice for translations. See 
https://www.gnu.org/software/gettext/manual/html_node/Plural-forms.html.


- Heikki

Re: Different compression methods for FPI

2021-06-14 Thread Heikki Linnakangas


On 15/06/2021 06:42, Justin Pryzby wrote:

On Tue, Jun 15, 2021 at 11:39:24AM +0900, Michael Paquier wrote:

On Mon, Jun 14, 2021 at 08:42:08PM -0500, Justin Pryzby wrote:

It's USERSET following your own suggestion (which is a good suggestion):

+   {"wal_compression_method", PGC_SIGHUP, WAL_SETTINGS,
+   gettext_noop("Set the method used to compress full page images in the 
WAL."),
+   NULL
+   },
+   _compression_method,
+   WAL_COMPRESSION_PGLZ, wal_compression_options,
+   NULL, NULL, NULL
Any reason to not make that user-settable?  If you merge that with
wal_compression, that's not an issue.


Hmm, yeah.  This can be read as using PGC_USERSET.  With the second
part of my sentence, I think that I imply to use PGC_SUSET and be
consistent with wal_compression, but I don't recall my mood from one
month ago :)  Sorry for the confusion.


Hold on - we're all confused (and I'm to blame).  The patch is changing the
existing wal_compression GUC, rather than adding wal_compression_method.
It's still SUSET, but in earlier messages, I called it USERSET, twice.


See prior discussion on the security aspect: 
https://www.postgresql.org/message-id/55269915.1000309%40iki.fi. Adding 
different compression algorithms doesn't change anything from a security 
point of view AFAICS.



The goal of the patch is to give options, and the overhead of adding both zlib
and lz4 is low.  zlib gives good compression at some CPUcost and may be
preferable for (some) DWs, and lz4 is almost certainly better (than pglz) for
OLTPs.


Anything will be better than pglz.  I am rather confident in that.
What I am wondering is if we need to eat more bits than necessary for
the WAL record format, because we will end up supporting it until the
end of times.


Why ?  This is WAL, not table data.  WAL depends on the major version, so
I think wal_compression could provide a different set of compression methods at
every major release?

Actually, I was just thinking that default yes/no/on/off stuff maybe should be
defined to mean "lz4" rather than meaning pglz for "backwards compatibility".


+1

- Heikki

Use singular number when appropriate

2021-06-14 Thread David Fetter

Hi,

I thought about using the dual, but wasn't sure how many languages
support it.

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate
>From 1fd595576c5972d8604adf77d1959008daebacb0 Mon Sep 17 00:00:00 2001
From: David Fetter 
Date: Mon, 14 Jun 2021 21:51:20 -0700
Subject: [PATCH v1] Singular number when appropriate

---
 src/test/regress/pg_regress.c | 20 +---
 1 file changed, 13 insertions(+), 7 deletions(-)

diff --git src/test/regress/pg_regress.c src/test/regress/pg_regress.c
index 05296f7ee1..d6f9e829e8 100644
--- src/test/regress/pg_regress.c
+++ src/test/regress/pg_regress.c
@@ -2641,25 +2641,31 @@ regression_main(int argc, char *argv[],
 	 */
 	if (fail_count == 0 && fail_ignore_count == 0)
 		snprintf(buf, sizeof(buf),
- _(" All %d tests passed. "),
- success_count);
+ _(" %s %d test%s passed. "),
+ success_count == 1 ? "The" : "All",
+ success_count,
+ success_count == 1 ? "" : "s");
 	else if (fail_count == 0)	/* fail_count=0, fail_ignore_count>0 */
 		snprintf(buf, sizeof(buf),
- _(" %d of %d tests passed, %d failed test(s) ignored. "),
+ _(" %d of %d test%s passed, %d failed test%s ignored. "),
  success_count,
  success_count + fail_ignore_count,
- fail_ignore_count);
+ success_count == 1 ? "" : "s",
+ fail_ignore_count,
+ fail_ignore_count == 1 ? "" : "s");
 	else if (fail_ignore_count == 0)	/* fail_count>0 && fail_ignore_count=0 */
 		snprintf(buf, sizeof(buf),
- _(" %d of %d tests failed. "),
+ _(" %d of %d test%s failed. "),
  fail_count,
- success_count + fail_count);
+ success_count + fail_count,
+ fail_count == 1 ? "" : "s");
 	else
 		/* fail_count>0 && fail_ignore_count>0 */
 		snprintf(buf, sizeof(buf),
- _(" %d of %d tests failed, %d of these failures ignored. "),
+ _(" %d of %d test%s failed, %d of these failures ignored. "),
  fail_count + fail_ignore_count,
  success_count + fail_count + fail_ignore_count,
+ fail_count + fail_ignore_count == 1 ? "" : "s",
  fail_ignore_count);
 
 	putchar('\n');
-- 
2.31.1

Re: locking [user] catalog tables vs 2pc vs logical rep

2021-06-14 Thread vignesh C

On Mon, Jun 14, 2021 at 5:33 PM osumi.takami...@fujitsu.com
 wrote:
>
> On  Friday, June 11, 2021 2:13 PM  vignesh C  wrote:
> > Thanks for the updated patch:
> > Few comments:
> > 1) We have used Reordering and Clustering for the same command, we could
> > rephrase similarly in both places.
> > +   
> > +Reordering pg_class by
> > CLUSTER
> > +command in a transaction.
> > +   
> >
> > +   
> > +Clustering pg_trigger and decoding
> > PREPARE
> > +TRANSACTION, if any published table have a trigger
> > and any
> > +operations that will be decoded are conducted.
> > +   
> > +  
> >
> > 2) Here user_catalog_table should be user catalog table
> > +   
> > +Executing TRUNCATE on
> > user_catalog_table
> > in a transaction.
> > +   
> Thanks for your review.
>
> Attached the patch-set that addressed those two comments.
> I also fixed the commit message a bit in the 2PC specific patch to HEAD.
> No other changes.
>
> Please check.

Thanks for the updated patches, the patch applies cleanly in all branches.
Please add a commitfest entry for this, so that we don't miss it.

Regards,
Vignesh

Re: RFC: Logging plan of the running query

2021-06-14 Thread Bharath Rupireddy

On Mon, Jun 14, 2021 at 5:48 PM torikoshia  wrote:
> > 1) We could just say "Requests to log query plan of the presently
> > running query of a given backend along with an untruncated query
> > string in the server logs."
> > Instead of
> > +They will be logged at LOG message level
> > and
> > +will appear in the server log based on the log
> > +configuration set (See  > linkend="runtime-config-logging"/>
>
> Actually this explanation is almost the same as that of
> pg_log_backend_memory_contexts().
> Do you think we should change both of them?
> I think it may be too detailed but accurate.

I withdraw my comment. We can keep the explanation similar to
pg_log_backend_memory_contexts as it was agreed upon and committed
text. If the wordings are similar, then it will be easier for users to
understand the documentation.

> > 5) Instead of just showing the true return value of the function
> > pg_log_current_plan in the sql test, which just shows that the signal
> > is sent, but it doesn't mean that the backend has processed that
> > signal and logged the plan. I think we can add the test using TAP
> > framework, one
>
> I once made a tap test for pg_log_backend_memory_contexts(), but it
> seemed an overkill and we agreed that it was appropriate just ensuring
> the function working as below discussion.
>
> https://www.postgresql.org/message-id/bbecd722d4f8e261b350186ac4bf68a7%40oss.nttdata.com

Okay. I withdraw my comment.

> > 6) Do we unnecessarily need to signal the processes such as autovacuum
> > launcher/workers, logical replication launcher/workers, wal senders,
> > wal receivers and so on. only to emit a "PID %d is not executing
> > queries now" message? Moreover, those processes will be waiting in
> > loops for timeouts to occur, then as soon as they wake up do they need
> > to process this extra uninformative signal?
> > We could choose to not signal those processes at all which might or
> > might not be possible.
> > Otherwise, we could just emit messages like " process cannot run a
> > query" in ProcessInterrupts.
>
> Agreed.
>
> However it needs to distinguish backends which can execute queries and
> other processes such as autovacuum launcher, I don't come up with
> easy ways to do so.
> I'm going to think about it.

I'm not sure if there is any information in the shared memory
accessible to all the backends/sessions that can say a PID is
autovacuum launcher/worker, logical replication launcher/worker or any
other background or parallel worker. If we were to invent a new
mechanism just for addressing the above comment, I would rather choose
to not do that as it seems like an overkill. We can leave it up to the
user whether or not to unnecessarily signal those processes which are
bound to print "PID XXX is not executing queries now" message.

> > 11) What happens if pg_log_current_plan is called for a parallel
> > worker?
>
> AFAIU Parallel worker doesn't have ActivePortal, so it would always
> emit the message 'PID %d is not executing queries now'.
> As 6), it would be better to distinguish the parallel worker and normal
> backend.

As I said, above, I think it will be a bit tough to do. If done, it
seems like an overkill.

With Regards,
Bharath Rupireddy.

Re: Different compression methods for FPI

2021-06-14 Thread Justin Pryzby

On Tue, Jun 15, 2021 at 11:39:24AM +0900, Michael Paquier wrote:
> On Mon, Jun 14, 2021 at 08:42:08PM -0500, Justin Pryzby wrote:
 It's USERSET following your own suggestion (which is a good suggestion):
> >> +   {"wal_compression_method", PGC_SIGHUP, WAL_SETTINGS,
> >> +   gettext_noop("Set the method used to compress full page images 
> >> in the WAL."),
> >> +   NULL
> >> +   },
> >> +   _compression_method,
> >> +   WAL_COMPRESSION_PGLZ, wal_compression_options,
> >> +   NULL, NULL, NULL
> >> Any reason to not make that user-settable?  If you merge that with
> >> wal_compression, that's not an issue.
> 
> Hmm, yeah.  This can be read as using PGC_USERSET.  With the second
> part of my sentence, I think that I imply to use PGC_SUSET and be
> consistent with wal_compression, but I don't recall my mood from one
> month ago :)  Sorry for the confusion.

Hold on - we're all confused (and I'm to blame).  The patch is changing the
existing wal_compression GUC, rather than adding wal_compression_method.
It's still SUSET, but in earlier messages, I called it USERSET, twice.

> You cannot do cross-checks for GUCs in their assign hooks or even rely
> in the order of those parameters, but you can do that in some backend
> code paths.  A recent discussion on the matter is for example what led
> to 79dfa8a for the GUCs controlling the min/max SSL protocols
> allowed.

Thank you - this is what I was remembering.

> > The goal of the patch is to give options, and the overhead of adding both 
> > zlib
> > and lz4 is low.  zlib gives good compression at some CPUcost and may be
> > preferable for (some) DWs, and lz4 is almost certainly better (than pglz) 
> > for
> > OLTPs.
> 
> Anything will be better than pglz.  I am rather confident in that.
> 
> What I am wondering is if we need to eat more bits than necessary for
> the WAL record format, because we will end up supporting it until the
> end of times.

Why ?  This is WAL, not table data.  WAL depends on the major version, so
I think wal_compression could provide a different set of compression methods at
every major release?

Actually, I was just thinking that default yes/no/on/off stuff maybe should be
defined to mean "lz4" rather than meaning pglz for "backwards compatibility".

> hear here, there are many cases that we may care about depending on
> how much CPU one is ready to pay in order to get more compression,
> knowing that there are no magic solutions for something that's cheap
> in CPU with a very good compression ratio or we could just go with
> that.  So it seems to me that there is still an argument for adding
> only one new compression method with a good range of levels, able to
> support the range of cases we'd care about:
> - High compression ratio but high CPU cost.
> - Low compression ratio but low CPU cost.

I think zlib is too expensive and lz4 doesn't get enough compression,
so neither supports both cases.  In a sample of 1, zlib-1 is ~35% slower than
lz4 and writes half as much.

I think zstd could support both cases; however, I still see it as this patch's
job to provide options, rather to definitively conclude which compression
algorithm is going to work best for everyone's use data and application.

-- 
Justin

Re: PG 14 release notes, first draft

2021-06-14 Thread Michael Paquier

On Tue, Jun 15, 2021 at 11:49:21AM +0900, Masahiko Sawada wrote:
> On Tue, Jun 15, 2021 at 10:36 AM Bruce Momjian  wrote:
>> OK, but I need more information on how users will see a difference based
>> on this commit:

+1.  That would be good to have in the release notes.

> I think that since with this commit the server on Windows can handle a
> file over 4GB, COPY FROM loading data from an over 4GB file and
> pg_dump dumping a large table work now.

Segment files or WAL files larger than 4GB also gain from that.
Anything for which we may finish to do a stat() on benefits from this
change if running on Windows.  For pg_dump, a workaround in PG <= 13
was to use --no-sync as the stat() failure came from files with a size
larger than 4GB.  That's rather sad as that means sacrifying
durability for more usability :(
--
Michael


signature.asc
Description: PGP signature

Improving the isolationtester: fewer failures, less delay

2021-06-14 Thread Tom Lane

This is a followup to the conversation at [1], in which we speculated
about constraining the isolationtester's behavior by annotating the
specfiles, in order to eliminate common buildfarm failures such as [2]
and reduce the need to use long delays to stabilize the test results.

I've spent a couple days hacking on this idea, and I think it has worked
out really well.  On my machine, the time needed for "make installcheck"
in src/test/isolation drops from ~93 seconds to ~26 seconds, as a result
of removing all the multiple-second delays we used before.  Also,
while I'm not fool enough to claim that this will reduce the rate of
bogus failures to zero, I do think it addresses all the repeating
failures we've seen lately.

In the credit-where-credit-is-due department, this owes some inspiration
to the patch Asim Praveen offered at [3], though this takes the idea a
good bit further.

This is still WIP to some extent, as I've not spent much time looking at
specfiles other than the ones with big delays; there may be additional
improvements possible in some places.  Also, I've not worried about
whether the tests pass in serializable mode, since we have problems there
already [4].  But this seemed like a good point at which to solicit
feedback and see what the cfbot thinks of it.

regards, tom lane

[1] 
https://www.postgresql.org/message-id/flat/2527507.1598237598%40sss.pgh.pa.us
[2] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=anole=2021-06-13%2016%3A31%3A57
[3] 
https://www.postgresql.org/message-id/F8DC434A-9141-451C-857F-148CCA1D42AD%40vmware.com
[4] https://www.postgresql.org/message-id/324309.1623722988%40sss.pgh.pa.us

diff --git a/src/test/isolation/README b/src/test/isolation/README
index 6ae7152325..35b40f8345 100644
--- a/src/test/isolation/README
+++ b/src/test/isolation/README
@@ -57,11 +57,16 @@ Test specification
 ==
 
 Each isolation test is defined by a specification file, stored in the specs
-subdirectory. A test specification consists of four parts, in this order:
+subdirectory.  A test specification defines some SQL "steps", groups them
+into "sessions" (where all the steps of one session will be run in the
+same backend), and specifies the "permutations" or orderings of the steps
+that are to be run.
+
+A test specification consists of four parts, in this order:
 
 setup {  }
 
-  The given SQL block is executed once, in one session only, before running
+  The given SQL block is executed once (per permutation) before running
   the test.  Create any test tables or other required objects here.  This
   part is optional.  Multiple setup blocks are allowed if needed; each is
   run separately, in the given order.  (The reason for allowing multiple
@@ -81,8 +86,8 @@ session ""
   session is executed in its own connection. A session part consists
   of three parts: setup, teardown and one or more "steps". The per-session
   setup and teardown parts have the same syntax as the per-test setup and
-  teardown described above, but they are executed in each session. The
-  setup part typically contains a "BEGIN" command to begin a transaction.
+  teardown described above, but they are executed in each session. The setup
+  part might, for example, contain a "BEGIN" command to begin a transaction.
 
   Each step has the syntax
 
@@ -101,7 +106,8 @@ permutation "" ...
   order).  Note that the list of steps in a manually specified
   "permutation" line doesn't actually have to be a permutation of the
   available steps; it could for instance repeat some steps more than once,
-  or leave others out.
+  or leave others out.  Also, each step name can be annotated with some
+  parenthesized markers, which are described below.
 
 Lines beginning with a # are considered comments.
 
@@ -110,7 +116,8 @@ specified in the spec file, or automatically generated), the isolation
 tester runs the main setup part, then per-session setup parts, then
 the selected session steps, then per-session teardown, then the main
 teardown script.  Each selected step is sent to the connection associated
-with its session.
+with its session.  The main setup and teardown scripts are run in a
+separate "control" session.
 
 
 Support for blocking commands
@@ -129,3 +136,64 @@ tests take a very long time to run, and they serve no useful testing purpose.
 Note that isolationtester recognizes that a command has blocked by looking
 to see if it is shown as waiting in the pg_locks view; therefore, only
 blocks on heavyweight locks will be detected.
+
+
+Dealing with race conditions
+
+
+In some cases, the isolationtester's output for a test script may vary
+due to timing issues.  One way to deal with that is to create variant
+expected-files, which follow the usual PG convention that variants for
+foo.spec are named foo_1.out, foo_2.out, etc.  However, this method is
+discouraged since the extra files are a nuisance for maintenance.
+Instead, it's usually

Re: Isolation tests vs. SERIALIZABLE isolation level

2021-06-14 Thread Thomas Munro

On Tue, Jun 15, 2021 at 2:09 PM Tom Lane  wrote:
> * Do we still care about that policy?

> * If so, who's going to fix the above-listed problems?

> * Should we try to get some routine testing of this case
> in place?

I wondered the same in commit 37929599 (the same problem for
src/test/regress, which now passes but only in master, not the back
branches).  I doubt it will find real bugs very often, and I doubt
many people would enjoy the slowdown if it were always on, but it
might make sense to have something like PG_TEST_EXTRA that can be used
to run the tests at all three levels, and then turn that on in a few
strategic places like CI and a BF animal or two.

Re: PG 14 release notes, first draft

2021-06-14 Thread Masahiko Sawada

On Tue, Jun 15, 2021 at 10:36 AM Bruce Momjian  wrote:
>
> On Tue, Jun 15, 2021 at 10:06:49AM +0900, Masahiko Sawada wrote:
> > On Mon, May 10, 2021 at 3:03 PM Bruce Momjian  wrote:
> > >
> > > I have committed the first draft of the PG 14 release notes.  You can
> > > see the most current  build of them here:
> > >
> > > https://momjian.us/pgsql_docs/release-14.html
> > >
> >
> > It might have already been discussed but I think we should mention
> > commit bed90759f in the release note. The commit seems like a bug fix
> > but it not back-patched to the older versions at least for now. There
> > is a discussion[1] that we will revisit that a couple of months after
> > 14 is released so as there is some feedback from the field with this
> > change.
>
> OK, but I need more information on how users will see a difference based
> on this commit:

I think that since with this commit the server on Windows can handle a
file over 4GB, COPY FROM loading data from an over 4GB file and
pg_dump dumping a large table work now.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Duplicate history file?

2021-06-14 Thread Julien Rouhaud

On Tue, Jun 15, 2021 at 10:20:37AM +0900, Kyotaro Horiguchi wrote:
> 
> Actually there's large room for losing data with cp. Yes, we would
> need additional redundancy of storage and periodical integrity
> inspection of the storage and archives on maybe need copies at the
> other sites on the other side of the Earth. But they are too-much for
> some kind of users.  They have the right and responsibility to decide
> how durable/reliable their archive needs to be.  (Putting aside some
> hardware/geological requirements :p)

Note that most of those considerations are orthogonal to what a proper
archive_command should be responsible for.

Yes users are responsible to decide they want valid and durable backup or
not, but we should assume a sensible default behavior, which is a valid and
durable archive_command.  We don't document a default fsync = off with later
recommendation explaining why you shouldn't do that, and I think it should be
the same for the archive_command.  The problem with the current documentation
is that many users will just blindly copy/paste whatever is in the
documentation without reading any further.

As an example, a few hours ago some french user on the french bulletin board
said that he fixed his "postmaster.pid already exists" error with a
pg_resetxlog -f, referring to some article explaining how to start postgres in
case of "PANIC: could not locate a valid checkpoint record".  Arguably
that article didn't bother to document what are the implication for executing
pg_resetxlog, but given that the user original problem had literally nothing to
do with what was documented, I really doubt that it would have changed
anything.

> If we mandate some
> characteristics on the archive_command, we should take them into core.

I agree.

> I remember I saw some discussions on archive command on this line but
> I think it had ended at the point something like that "we cannot
> design one-fits-all interface comforming the requirements" or
> something (sorry, I don't remember in its detail..)

I also agree, but this problem is solved by making archive_command
customisable.  Providing something that can reliably work in some general and
limited cases would be a huge improvement.

> Well. rman used rsync/ssh in its documentation in the past and now
> looks like providing barman-wal-archive so it seems that you're right
> in that point.  So, do we recommend them in our documentation? (I'm
> not sure they are actually comform the requirement, though..)

We could maybe bless some third party backup solutions, but this will probably
lead to a lot more discussions, so it's better to discuss that in a different
thread.  Note that I don't have a deep knowledge of any of those tools so I
don't have an opinion.

> If we write an example with a pseudo tool name, requiring some
> characteristics on the tool, then not telling about the minimal tools,
> I think that it is equivalent to that we are inhibiting certain users
> from using archive_command even if they really don't want such level
> of durability.

I already saw customers complaining about losing backups because their
archive_command didn't ensure that the copy was durable.  Some users may not
care about losing their backups in such case, but I really think that the
majority of users expect a backup to be valid, durable and everything without
even thinking that it may not be the case.  It should be the default behavior,
not optional.

Re: Teaching users how they can get the most out of HOT in Postgres 14

2021-06-14 Thread Peter Geoghegan

On Mon, Jun 14, 2021 at 5:23 PM Michael Paquier  wrote:
> > *Why* does it have to work at the system level? I don't understand
> > what you mean about the system level.
>
> I mean that you lack a GUC that allows to enforce to *not* use this
> optimization for all relations, for all processes.

You've just explained what a GUC is. This is not helpful.

> > Why is this fundamentally different to those two things?
>
> Because the situation looks completely different to me here.  TRUNCATE
> is thought as a option to be able to avoid an exclusive lock when
> truncating the relation file size at the end of VACUUM.  More
> importantly the default of TRUNCATE is *false*, meaning that we are
> never going to skip the truncation unless one specifies it at the
> relation level.

Maybe it looks different to you because that's not actually true;
VACUUM *does* skip the truncation when it feels like it, regardless of
the value of the reloption. In general there is no strict guarantee of
truncation ever happening -- see lazy_truncate_heap().

Again: Why is this fundamentally different?

> Here, what we have is a decision that is enforced to happen by
> default, all the time, with the user not knowing about it.  If there
> is a bug of an issue with it, users, based on your proposal, would be
> forced to change it for each *relation*.  If they miss some of those
> relations, they may still run into problems without knowing about it.
> The change of default behavior and having no way to control it in
> a simple way look incompatible to me.

You've just explained what a reloption is. Again, this is not helping.

> Perhaps.  What I am really scared about is that you are assuming that
> enforcing this decision will be *always* fine.

I very clearly and repeatedly said that there was uncertainty about
causing issues in rare real world cases. Are you always 100% certain
that your code has no bugs before you commit it?

Should I now add a GUC for every single feature that I commit? You are
just asserting that we must need to add a GUC, without giving any real
reasons -- you're just giving generic reasons that work just as well
in most situations. I'm baffled by this.

> What I am trying to
> say here is that it *may not* be fine for everybody, and that there
> should be an easy way to turn it off if that proves to be a problem.

As I said, I think that the relationship is both necessary and
sufficient. A GUC is a heavyweight solution that seems quite
unnecessary.

> I don't quite see how that's an implementation problem, we have
> already many reloptions that are controlled with GUCs if the
> reloptions have no default.

I never said that there was an implementation problem with a GUC. Just
that it was unnecessary, and not consistent with existing practice.

Does anyone else have an opinion on this? Of course I can easily add a
GUC. But I won't do so in the absence of any real argument in favor of
it.

> I think that a more careful choice implementation would have been to
> turn this optimization off by default, while having an option to allow
> one to turn it on at will.

You have yet to say anything about the implementation.

-- 
Peter Geoghegan

Re: Different compression methods for FPI

2021-06-14 Thread Michael Paquier

On Mon, Jun 14, 2021 at 08:42:08PM -0500, Justin Pryzby wrote:
> On Tue, Jun 15, 2021 at 09:50:41AM +0900, Michael Paquier wrote:
>> +   {"wal_compression_method", PGC_SIGHUP, WAL_SETTINGS,
>> +   gettext_noop("Set the method used to compress full page images 
>> in the WAL."),
>> +   NULL
>> +   },
>> +   _compression_method,
>> +   WAL_COMPRESSION_PGLZ, wal_compression_options,
>> +   NULL, NULL, NULL
>> Any reason to not make that user-settable?  If you merge that with
>> wal_compression, that's not an issue.

Hmm, yeah.  This can be read as using PGC_USERSET.  With the second
part of my sentence, I think that I imply to use PGC_SUSET and be
consistent with wal_compression, but I don't recall my mood from one
month ago :)  Sorry for the confusion.

> I don't see how restricting it to superusers would mitigate the hazard at all:
> If the local admin enables wal compression, then every user's data will be
> compressed, and the degree of compression indicatates a bit about their data,
> no matter whether it's pglz or lz4.

I would vote for having some consistency with wal_compression.
Perhaps we could even revisit c2e5f4d, but I'd rather not do that.

>> The compression level may be better if specified with a different
>> GUC.  That's less parsing to have within the GUC machinery.
> 
> I'm not sure about that - then there's an interdependency between GUCs.
> If zlib range is 1..9, and zstd is -50..10, then you may have to set the
> compression level first, to avoid an error.  I believe there's a previous
> discussion about inter-dependent GUCs, and maybe a commit fixing a problem 
> they
> caused.  But I cannot find it offhand.

You cannot do cross-checks for GUCs in their assign hooks or even rely
in the order of those parameters, but you can do that in some backend
code paths.  A recent discussion on the matter is for example what led
to 79dfa8a for the GUCs controlling the min/max SSL protocols
allowed.

>> seems to me that if we can get the same amount of compression and CPU
>> usage just by tweaking the compression level, there is no need to
>> support more than one extra compression algorithm, easing the life of
>> packagers and users.
> 
> I don't think it eases it for packagers, since I anticipate the initial patch
> would support {none/pglz/lz4/zlib}.  I anticipate that zstd may not be in 
> pg15.

Yes, without zstd we have all the infra to track the dependencies.

> The goal of the patch is to give options, and the overhead of adding both zlib
> and lz4 is low.  zlib gives good compression at some CPUcost and may be
> preferable for (some) DWs, and lz4 is almost certainly better (than pglz) for
> OLTPs.

Anything will be better than pglz.  I am rather confident in that.

What I am wondering is if we need to eat more bits than necessary for
the WAL record format, because we will end up supporting it until the
end of times.  We may have twenty years from now a better solution
than what has been integrated, and we may not care much about 1 extra
byte for a WAL record at this point, or perhaps we will.  From what I
hear here, there are many cases that we may care about depending on
how much CPU one is ready to pay in order to get more compression,
knowing that there are no magic solutions for something that's cheap
in CPU with a very good compression ratio or we could just go with
that.  So it seems to me that there is still an argument for adding
only one new compression method with a good range of levels, able to
support the range of cases we'd care about:
- High compression ratio but high CPU cost.
- Low compression ratio but low CPU cost.

So we could also take a decision here based on the range of
(compression,CPU) an algorithm is able to cover.
--
Michael


signature.asc
Description: PGP signature

Isolation tests vs. SERIALIZABLE isolation level

2021-06-14 Thread Tom Lane

In the past people have tried to ensure that the isolation tests
would pass regardless of the prevailing default_transaction_isolation
setting.  (That was sort of the point, in fact, for the earliest
tests using that infrastructure.)

This seems to have been forgotten about lately, as all of these tests
fail with default_transaction_isolation = serializable:

test detach-partition-concurrently-1 ... FAILED  504 ms
test detach-partition-concurrently-3 ... FAILED 2224 ms
test detach-partition-concurrently-4 ... FAILED 1600 ms
test fk-partitioned-2 ... FAILED  133 ms
test lock-update-delete   ... FAILED  538 ms
test tuplelock-update ... FAILED10223 ms
test tuplelock-upgrade-no-deadlock ... FAILED  664 ms
test tuplelock-partition  ... FAILED   49 ms

(drop-index-concurrently-1 also failed until just now, but
I resurrected its variant expected-file.)

So:

* Do we still care about that policy?

* If so, who's going to fix the above-listed problems?

* Should we try to get some routine testing of this case
in place?

regards, tom lane

Re: Different compression methods for FPI

2021-06-14 Thread Justin Pryzby

On Tue, Jun 15, 2021 at 09:50:41AM +0900, Michael Paquier wrote:
> On Sun, Jun 13, 2021 at 08:24:12PM -0500, Justin Pryzby wrote:
> > I think it's more nuanced than just finding the algorithm with the least CPU
> > use.  The GUC is PGC_USERSET, and it's possible that a data-loading process
> > might want to use zlib for better compress ratio, but an interactive OLTP
> > process might want to use lz4 or no compression for better responsiveness.
> 
> It seems to me that this should be a PGC_SUSET, at least?  We've had
> our share of problems with assumptions behind data leaks depending on
> data compressibility (see ssl_compression and the kind).

It's USERSET following your own suggestion (which is a good suggestion):

On Mon, May 17, 2021 at 04:44:11PM +0900, Michael Paquier wrote:
> +   {"wal_compression_method", PGC_SIGHUP, WAL_SETTINGS,
> +   gettext_noop("Set the method used to compress full page images in 
> the WAL."),
> +   NULL
> +   },
> +   _compression_method,
> +   WAL_COMPRESSION_PGLZ, wal_compression_options,
> +   NULL, NULL, NULL
> Any reason to not make that user-settable?  If you merge that with
> wal_compression, that's not an issue.

I don't see how restricting it to superusers would mitigate the hazard at all:
If the local admin enables wal compression, then every user's data will be
compressed, and the degree of compression indicatates a bit about their data,
no matter whether it's pglz or lz4.

It's probably true without compression, too - the fraction of FPW might reveal
their usage patterns.

> > In this patch series, I added compression information to the errcontext from
> > xlog_block_info(), and allow specifying compression levels like zlib-2.  
> > I'll
> > rearrange that commit earlier if we decide that's desirable to include.
> 
> The compression level may be better if specified with a different
> GUC.  That's less parsing to have within the GUC machinery.

I'm not sure about that - then there's an interdependency between GUCs.
If zlib range is 1..9, and zstd is -50..10, then you may have to set the
compression level first, to avoid an error.  I believe there's a previous
discussion about inter-dependent GUCs, and maybe a commit fixing a problem they
caused.  But I cannot find it offhand.

> seems to me that if we can get the same amount of compression and CPU
> usage just by tweaking the compression level, there is no need to
> support more than one extra compression algorithm, easing the life of
> packagers and users.

I don't think it eases it for packagers, since I anticipate the initial patch
would support {none/pglz/lz4/zlib}.  I anticipate that zstd may not be in pg15.

The goal of the patch is to give options, and the overhead of adding both zlib
and lz4 is low.  zlib gives good compression at some CPUcost and may be
preferable for (some) DWs, and lz4 is almost certainly better (than pglz) for
OLTPs.

-- 
Justin

Re: Delegating superuser tasks to new security roles

2021-06-14 Thread torikoshia


On 2021-06-14 23:53, Mark Dilger wrote:
On Jun 14, 2021, at 5:51 AM, torikoshia  
wrote:


Thanks for working on this topic, I appreciate it!


Thank you for taking a look!


BTW, do these patches enable non-superusers to create user with
bypassrls?


No, I did not break out the ability to create such users.


Since I failed to apply the patches and didn't test them,
I may have overlooked something but I didn't find the
corresponding codes.


Do you believe that functionality should be added?  I have not thought
much about that issue.


I just noticed that because I was looking into operations that can only 
be done by superusers.


It might be somewhat inconvenient in PostgreSQL service providers that 
don't give users superuser privileges, but at least I don't have a 
specific demand for it.



Regards,

--
Atsushi Torikoshi
NTT DATA CORPORATION

Re: Duplicate history file?

2021-06-14 Thread Kyotaro Horiguchi

At Fri, 11 Jun 2021 16:08:33 +0900, Michael Paquier  wrote 
in 
> On Fri, Jun 11, 2021 at 03:32:28PM +0900, Kyotaro Horiguchi wrote:
> > I think cp can be an example as far as we explain the limitations. (On
> > the other hand "test !-f" cannot since it actually prevents server
> > from working correctly.)
> 
> Disagreed.  I think that we should not try to change this area until
> we can document a reliable solution, and a simple "cp" is not that.

Isn't removing cp from the documentation a change in this area? I
basically agree to not to change anything but the current example
"test ! -f  && cp .." and relevant description has been known to
be problematic in a certain situation.

- Do we leave it alone igonring the possible problem?

- Just add a description about "the problem"?

- Just remove "test ! -f" and the relevant description?

- Remove "test ! -f" and rewrite the relevant description?

(- or not remove "test ! -f" and rewrite the relevant description?)

- Write the full (known) requirements and use a pseudo tool-name in
 the example?

 - provide a minimal implement of the command?

 - recommend some external tools (that we can guarantee that they
   comform the requriements)?

 - not recommend any tools?

> Hmm.  A simple command that could be used as reference is for example
> "dd" that flushes the file by itself, or we could just revisit the
> discussions about having a pg_copy command, or we could document a
> small utility in perl that does the job.

I think we should do that if pg_copy comforms the mandatory
requirements but maybe it's in the future. Showing the minimal
implement in perl looks good.

However, my main point here is "what should we do for now?".  Not
about an ideal solution.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: PG 14 release notes, first draft

2021-06-14 Thread Bruce Momjian

On Tue, Jun 15, 2021 at 10:06:49AM +0900, Masahiko Sawada wrote:
> On Mon, May 10, 2021 at 3:03 PM Bruce Momjian  wrote:
> >
> > I have committed the first draft of the PG 14 release notes.  You can
> > see the most current  build of them here:
> >
> > https://momjian.us/pgsql_docs/release-14.html
> >
> 
> It might have already been discussed but I think we should mention
> commit bed90759f in the release note. The commit seems like a bug fix
> but it not back-patched to the older versions at least for now. There
> is a discussion[1] that we will revisit that a couple of months after
> 14 is released so as there is some feedback from the field with this
> change.

OK, but I need more information on how users will see a difference based
on this commit:

commit bed90759fc
Author: Tom Lane 
Date:   Fri Oct 9 16:20:12 2020 -0400

Fix our Windows stat() emulation to handle file sizes > 4GB.

Hack things so that our idea of "struct stat" is equivalent to 
Windows'
struct __stat64, allowing it to have a wide enough st_size field.

Instead of relying on native stat(), use 
GetFileInformationByHandle().
This avoids a number of issues with Microsoft's multiple and rather
slipshod emulations of stat().  We still need to jump through hoops
to deal with ERROR_DELETE_PENDING, though :-(

Pull the relevant support code out of dirmod.c and put it into
its own file, win32stat.c.

Still TODO: do we need to do something different with lstat(),
rather than treating it identically to stat()?

Juan José Santamaría Flecha, reviewed by Emil Iggland;
based on prior work by Michael Paquier, Sergey Zubkovsky, and others

Discussion: 
https://postgr.es/m/1803D792815FC24D871C00D17AE95905CF5099@g01jpexmbkw24
Discussion: 
https://postgr.es/m/15858-9572469fd3b73...@postgresql.org

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic

2021-06-14 Thread Andres Freund

Hi,

> @@ -4032,6 +4039,24 @@ GlobalVisTestShouldUpdate(GlobalVisState *state)
>  static void
>  GlobalVisUpdateApply(ComputeXidHorizonsResult *horizons)
>  {
> + /* assert non-decreasing nature of horizons */
> + Assert(FullTransactionIdFollowsOrEquals(
> +FullXidRelativeTo(horizons->latest_completed,
> +  
> horizons->shared_oldest_nonremovable),
> +GlobalVisSharedRels.maybe_needed));
> + Assert(FullTransactionIdFollowsOrEquals(
> +FullXidRelativeTo(horizons->latest_completed,
> +  
> horizons->catalog_oldest_nonremovable),
> +GlobalVisCatalogRels.maybe_needed));
> + Assert(FullTransactionIdFollowsOrEquals(
> +FullXidRelativeTo(horizons->latest_completed,
> +  
> horizons->data_oldest_nonremovable),
> +GlobalVisDataRels.maybe_needed));
> + Assert(FullTransactionIdFollowsOrEquals(
> +FullXidRelativeTo(horizons->latest_completed,
> +  
> horizons->temp_oldest_nonremovable),
> +GlobalVisTempRels.maybe_needed));
> +
>   GlobalVisSharedRels.maybe_needed =
>   FullXidRelativeTo(horizons->latest_completed,
> 
> horizons->shared_oldest_nonremovable);

Thinking more about it, I don't think these are correct. See the
following comment in procarray.c:

 * Note: despite the above, it's possible for the calculated values to move
 * backwards on repeated calls. The calculated values are conservative, so
 * that anything older is definitely not considered as running by anyone
 * anymore, but the exact values calculated depend on a number of things. For
 * example, if there are no transactions running in the current database, the
 * horizon for normal tables will be latestCompletedXid. If a transaction
 * begins after that, its xmin will include in-progress transactions in other
 * databases that started earlier, so another call will return a lower value.
 * Nonetheless it is safe to vacuum a table in the current database with the
 * first result.  There are also replication-related effects: a walsender
 * process can set its xmin based on transactions that are no longer running
 * on the primary but are still being replayed on the standby, thus possibly
 * making the values go backwards.  In this case there is a possibility that
 * we lose data that the standby would like to have, but unless the standby
 * uses a replication slot to make its xmin persistent there is little we can
 * do about that --- data is only protected if the walsender runs continuously
 * while queries are executed on the standby.  (The Hot Standby code deals
 * with such cases by failing standby queries that needed to access
 * already-removed data, so there's no integrity bug.)  The computed values
 * are also adjusted with vacuum_defer_cleanup_age, so increasing that setting
 * on the fly is another easy way to make horizons move backwards, with no
 * consequences for data integrity.

Greetings,

Andres Freund

Re: Duplicate history file?

2021-06-14 Thread Kyotaro Horiguchi

At Fri, 11 Jun 2021 15:18:03 +0800, Julien Rouhaud  wrote 
in 
> On Fri, Jun 11, 2021 at 03:32:28PM +0900, Kyotaro Horiguchi wrote:
> I disagree, cp is probably the worst command that can be used for this 
> purpose.
> On top on the previous problems already mentioned, you also have the fact that
> the copy isn't atomic.  It means that any concurrent restore_command (or
> anything that would consume the archived files) will happily process a half
> copied WAL file, and in case of any error during the copy you end up with a
> file for which you don't know if it contains valid data or not.  I don't see
> any case where you would actually want to use that, unless maybe if you want 
> to
> benchmark how long it takes before you lose some data.

Actually there's large room for losing data with cp. Yes, we would
need additional redundancy of storage and periodical integrity
inspection of the storage and archives on maybe need copies at the
other sites on the other side of the Earth. But they are too-much for
some kind of users.  They have the right and responsibility to decide
how durable/reliable their archive needs to be.  (Putting aside some
hardware/geological requirements :p) If we mandate some
characteristics on the archive_command, we should take them into core.
I remember I saw some discussions on archive command on this line but
I think it had ended at the point something like that "we cannot
design one-fits-all interface comforming the requirements" or
something (sorry, I don't remember in its detail..)

> I don't know, I'm assuming that barman also provides one, such as wal-e and
> wal-g (assuming that the distant providers do their part of the job 
> correctly).

Well. rman used rsync/ssh in its documentation in the past and now
looks like providing barman-wal-archive so it seems that you're right
in that point.  So, do we recommend them in our documentation? (I'm
not sure they are actually comform the requirement, though..)

> Maybe there are other tools too.  But as long as we don't document what 
> exactly
> are the requirements, it's not really a surprise that most people don't
> implement them.

I strongly agree to describe the requirements.

My point is that if all of them are really mandatory, it is mandatory
for us to officially provide or at least recommend the minimal
implement(s) that covers all of them.  If we recommended some external
tools, that would mean that we ensure that the tools qualify the
requirements.

If we write an example with a pseudo tool name, requiring some
characteristics on the tool, then not telling about the minimal tools,
I think that it is equivalent to that we are inhibiting certain users
from using archive_command even if they really don't want such level
of durability.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: PG 14 release notes, first draft

2021-06-14 Thread Masahiko Sawada

On Mon, May 10, 2021 at 3:03 PM Bruce Momjian  wrote:
>
> I have committed the first draft of the PG 14 release notes.  You can
> see the most current  build of them here:
>
> https://momjian.us/pgsql_docs/release-14.html
>

It might have already been discussed but I think we should mention
commit bed90759f in the release note. The commit seems like a bug fix
but it not back-patched to the older versions at least for now. There
is a discussion[1] that we will revisit that a couple of months after
14 is released so as there is some feedback from the field with this
change.

Regards,

[1] https://www.postgresql.org/message-id/ycszix2a2ilsv...@paquier.xyz

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Different compression methods for FPI

2021-06-14 Thread Andres Freund

Hi,

On 2021-06-15 09:50:41 +0900, Michael Paquier wrote:
> On Sun, Jun 13, 2021 at 08:24:12PM -0500, Justin Pryzby wrote:
> > I think it's more nuanced than just finding the algorithm with the least CPU
> > use.  The GUC is PGC_USERSET, and it's possible that a data-loading process
> > might want to use zlib for better compress ratio, but an interactive OLTP
> > process might want to use lz4 or no compression for better responsiveness.
> 
> It seems to me that this should be a PGC_SUSET, at least?  We've had
> our share of problems with assumptions behind data leaks depending on
> data compressibility (see ssl_compression and the kind).

-1.

Currently wal_compression has too much overhead for some workloads, but
not for others. Disallowing normal users to set it would break cases
where it's set for users that can tolerate the perf impact (which I have
done at least). And the scenarios where it can leak information that
couldn't otherwise be leaked already don't seem all that realistic?

Greetings,

Andres Freund

Re: Different compression methods for FPI

2021-06-14 Thread Michael Paquier

On Sun, Jun 13, 2021 at 08:24:12PM -0500, Justin Pryzby wrote:
> I think it's more nuanced than just finding the algorithm with the least CPU
> use.  The GUC is PGC_USERSET, and it's possible that a data-loading process
> might want to use zlib for better compress ratio, but an interactive OLTP
> process might want to use lz4 or no compression for better responsiveness.

It seems to me that this should be a PGC_SUSET, at least?  We've had
our share of problems with assumptions behind data leaks depending on
data compressibility (see ssl_compression and the kind).

> In this patch series, I added compression information to the errcontext from
> xlog_block_info(), and allow specifying compression levels like zlib-2.  I'll
> rearrange that commit earlier if we decide that's desirable to include.

The compression level may be better if specified with a different
GUC.  That's less parsing to have within the GUC machinery.

So, how does the compression level influence those numbers?  The level
of compression used by LZ4 here is the fastest-CPU/least-compression,
same for zlib and zstd?  Could we get some data with getrusage()?  It
seems to me that if we can get the same amount of compression and CPU
usage just by tweaking the compression level, there is no need to
support more than one extra compression algorithm, easing the life of
packagers and users.
--
Michael


signature.asc
Description: PGP signature

Re: Skipping logical replication transactions on subscriber side

2021-06-14 Thread Masahiko Sawada

On Wed, Jun 2, 2021 at 3:07 PM Amit Kapila  wrote:
>
> On Tue, Jun 1, 2021 at 9:05 PM Peter Eisentraut
>  wrote:
> >
> > On 01.06.21 06:01, Amit Kapila wrote:
> > > But, won't that be costly in cases where we have errors in the
> > > processing of very large transactions? Subscription has to process all
> > > the data before it gets an error. I think we can even imagine this
> > > feature to be extended to use commitLSN as a skip candidate in which
> > > case we can even avoid getting the data of that transaction from the
> > > publisher. So if this information is persistent, the user can even set
> > > the skip identifier after the restart before the publisher can send
> > > all the data.
> >
> > At least in current practice, skipping parts of the logical replication
> > stream on the subscriber is a rare, emergency-level operation when
> > something that shouldn't have happened happened.  So it doesn't really
> > matter how costly it is.  It's not going to be more costly than the
> > error happening in the first place.  All you'd need is one shared memory
> > slot per subscription to store a xid to skip.
> >
>
> Leaving aside the performance point, how can we do by just storing
> skip identifier (XID/commitLSN) in shared_memory? How will the apply
> worker know about that information after restart? Do you expect the
> user to set it again, if so, I think users might not like that? Also,
> how will we prohibit users to give some identifier other than for
> failed transactions, and if users provide that what should be our
> action? Without that, if users provide XID of some in-progress
> transaction, we might need to do more work (rollback) than just
> skipping it.

I think the simplest solution would be to have a fixed-size array on
the shared memory to store information of skipping transactions on the
particular subscription. Given that this feature is meant to be a
repair tool in emergency cases, 32 or 64 entries seem enough. That
information should be visible to users via a system view and each
entry is cleared once the worker has skipped the transaction. Also, we
also would need to clear the entry if the meta information of the
subscription such as conninfo and slot name has been changed. The
worker reads that information at least when starting logical
replication. The worker receives changes from the publication and
checks if the transaction should be skipped when start to apply those
changes. If so the worker skips applying all changes of the
transaction and removes stream files if exist.

Regarding the point of how to check if the specified XID by the user
is valid, I guess it’s not easy to do that since XIDs sent from the
publisher are in random order. Considering the use case of this tool,
the situation seems like the logical replication gets stuck due to a
problem transaction and the worker repeatedly restarts and raises an
error. So I guess it also would be a good idea that the user can
specify to skip the first transaction (or first N transactions) since
the subscription starts logical replication. It’s less flexible but
seems enough to solve such a situation and doesn’t have such a problem
of validating the XID. If the functionality like letting the
subscriber know the oldest XID that is possibly sent is useful also
for other purposes it would also be a good idea to implement it but
not sure about other use cases.

Anyway, it seems to me that we need to consider the user interface
first, especially how and what the user specifies the transaction to
skip. My current feeling is that specifying XID is intuitive and
flexible but the user needs to have 2 steps: checks XID and then
specifies it, and there is a risk that the user mistakenly specifies a
wrong XID. On the other hand, the idea of specifying to skip the first
transaction doesn’t require the user to check and specify XID but is
less flexible, and “the first” transaction might be ambiguous for the
user.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: A new function to wait for the backend exit after termination

2021-06-14 Thread Noah Misch

On Mon, Jun 14, 2021 at 07:34:59PM +0530, Bharath Rupireddy wrote:
> Thanks. +1 to remove the pg_wait_for_backend_termination function. The
> patch basically looks good to me. I'm attaching an updated patch. I
> corrected a minor typo in the commit message, took docs and code
> comment changes suggested by Justin Pryzby.

Pushed as two commits.  I used your log message typo fix.  Here were the diffs
in your v2 and not in an earlier patch:

> -+   Add an optional wait parameter to  ++   Add an optional timeout parameter to  -+int timeout; /* milliseconds */
> ++int timeout;  /* milliseconds */

pgindent chooses a third option, so I ran pgindent instead of using this.

> Please note that release notes still have the headline "Add functions
> to wait for backend termination" of the original commit that added the
> pg_wait_for_backend_termination. With the removal of it, I'm not quite
> sure if we retain the the commit message or tweak it to something like
> "Add optional timeout parameter to pg_terminate_backend".
> 

That part is supposed to mirror "git log --pretty=format:%s", no matter what
happens later.  The next set of release note updates might add my latest
commit (5f1df62) to this SGML comment, on another line.

Re: Teaching users how they can get the most out of HOT in Postgres 14

2021-06-14 Thread Michael Paquier

On Fri, Jun 11, 2021 at 02:46:20PM -0700, Peter Geoghegan wrote:
> On Thu, Jun 3, 2021 at 11:15 PM Michael Paquier  wrote:
>> I have read through the patch, and I am surprised to see that this
>> only makes possible to control the optimization at relation level.
>> The origin of the complaints is that this index cleanup optimization
>> has been introduced as a new rule that gets enforced at *system*
>> level, so I think that we should have an equivalent with a GUC to
>> control the behavior for the whole system.
> 
> *Why* does it have to work at the system level? I don't understand
> what you mean about the system level.

I mean that you lack a GUC that allows to enforce to *not* use this
optimization for all relations, for all processes.

> As Masahiko pointed out, adding a GUC isn't what we've done in other
> similar cases -- that's how DISABLE_PAGE_SKIPPING works, which was a
> defensive option that seems similar enough to what we want to add now.
> To give another example, the TRUNCATE VACUUM option (or the related
> reloption) can be used to disable relation truncation, a behavior that
> sometimes causes *big* issues in production. The truncate behavior is
> determined dynamically in most situations -- which is another
> similarity to the optimization we've added here.

> Why is this fundamentally different to those two things?

Because the situation looks completely different to me here.  TRUNCATE
is thought as a option to be able to avoid an exclusive lock when
truncating the relation file size at the end of VACUUM.  More
importantly the default of TRUNCATE is *false*, meaning that we are
never going to skip the truncation unless one specifies it at the
relation level. 

Here, what we have is a decision that is enforced to happen by
default, all the time, with the user not knowing about it.  If there
is a bug of an issue with it, users, based on your proposal, would be
forced to change it for each *relation*.  If they miss some of those
relations, they may still run into problems without knowing about it.
The change of default behavior and having no way to control it in
a simple way look incompatible to me.

>> With what you are
>> presenting here, one could only disable the optimization for each
>> relation, one-by-one.  If this optimization proves to be a problem,
>> it's just going to be harder to users to go through all the relations
>> and re-tune autovacuum.  Am I missing something?
> 
> Why would you expect autovacuum to run even when the optimization is
> unavailable (e.g. with Postgres 13)? After all, the specifics of when
> the bypass optimization kicks in make it very unlikely that ANALYZE
> will ever be able to notice enough dead tuples to trigger an
> autovacuum (barring antiwraparound and insert-driven autovacuums).
>
> There will probably be very few LP_DEAD items remaining. Occasionally
> there will be somewhat more LP_DEAD items, that happen to be
> concentrated in less than 2% of the table's blocks -- but block-based
> sampling by ANALYZE is likely to miss most of them and underestimate
> the total number. The sampling approach taken by acquire_sample_rows()
> ensures this with larger tables. With small tables the chances of the
> optimization kicking in are very low, unless perhaps fillfactor has
> been tuned very aggressively.
> 
> There has never been a guarantee that autovacuum will be triggered
> (and do index vacuuming) in cases that have very few LP_DEAD items, no
> matter how the system has been tuned. The main reason why making the
> optimization behavior controllable is for the VACUUM command.
> Principally for hackers. I can imagine myself using the VACUUM option
> to disable the optimization when I was interested in testing VACUUM or
> space utilization in some specific, narrow way.
> 
> Of course it's true that there is still some uncertainty about the
> optimization harming production workloads -- that is to be expected
> with an enhancement like this one. But there is still no actual
> example or test case that shows the optimization doing the wrong
> thing, or anything like it. Anything is possible, but I am not
> expecting there to be even one user complaint about the feature.
> Naturally I don't want to add something as heavyweight as a GUC, given
> all that.

Perhaps.  What I am really scared about is that you are assuming that
enforcing this decision will be *always* fine.  What I am trying to
say here is that it *may not* be fine for everybody, and that there
should be an easy way to turn it off if that proves to be a problem.
I don't quite see how that's an implementation problem, we have
already many reloptions that are controlled with GUCs if the
reloptions have no default.

I think that a more careful choice implementation would have been to
turn this optimization off by default, while having an option to allow
one to turn it on at will.
--
Michael


signature.asc
Description: PGP signature

Re: Signed vs Unsigned (take 2) (src/backend/storage/ipc/procarray.c)

2021-06-14 Thread Ranier Vilela

I took it a step further.

Transactions

HEAD  patched
1000220710586781
1014616710388685
100489191059
10065764,333 10436275 3,55021946687555

TPS
HEAD   patched
33469,016009 35399,010472
33950,624679 34733,252336
33639,8429 34578,495043
33686,494529 34903,585950 3,48700968070122

3,55% Is it worth touch procarray.c for real?

With msvc 64 bits, the asm generated:
HEAD
213.731 bytes procarray.asm
patched
212.035 bytes procarray.asm

Patch attached.

regards,
Ranier Vilela


improve_procarray.patch
Description: Binary data

Re: [PATCH] Fix select from wrong table array_op_test

2021-06-14 Thread Jason Kim

On 2021-06-11T19:00:41+0300, Heikki Linnakangas wrote:
> We already have these same queries in the 'arrays' test against the
> 'array_op_test' table, though. It sure looks like a copy-paste error to me
> as well.

That's reason enough, but another reason is that I don't think GIN_CAT_NULL_KEY
is covered without this change.

Re: Support for NSS as a libpq TLS backend

2021-06-14 Thread Jacob Champion

On Fri, 2021-05-28 at 11:04 +0200, Daniel Gustafsson wrote:
> Attached is a rebase to keep bitrot at bay.

I get a failure during one of the CRL directory tests due to a missing
database -- it looks like the Makefile is missing an entry. (I'm
dusting off my build after a few months away, so I don't know if this
latest rebase introduced it or not.)

Attached is a quick patch; does it work on your machine?

--Jacob
From 7a7b8904ef22212190bb988fab1ef696fe1a59de Mon Sep 17 00:00:00 2001
From: Jacob Champion 
Date: Mon, 14 Jun 2021 15:04:26 -0700
Subject: [PATCH] test/ssl: fix NSS server-side CRL test

Make sure the database is created during `make nssfiles`, and expect a
revocation failure message.
---
 src/test/ssl/Makefile  | 2 ++
 src/test/ssl/t/001_ssltests.pl | 2 +-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/src/test/ssl/Makefile b/src/test/ssl/Makefile
index 14ca1f8bf3..557cbe223f 100644
--- a/src/test/ssl/Makefile
+++ b/src/test/ssl/Makefile
@@ -45,6 +45,7 @@ NSSFILES := ssl/nss/client_ca.crt.db \
ssl/nss/client-revoked.crt__client-revoked.key.db \
ssl/nss/server-cn-only.crt__server-password.key.db \
ssl/nss/server-cn-only.crt__server-cn-only.key.db \
+   ssl/nss/server-cn-only.crt__server-cn-only.key.crldir.db \
ssl/nss/root.crl \
ssl/nss/server.crl \
ssl/nss/client.crl \
@@ -167,6 +168,7 @@ ssl/nss/server-cn-only.crt__server-cn-only.key.db: 
ssl/server-cn-only.crt ssl/se
pk12util -i ssl/nss/server-cn-only.pfx -d "sql:$@" -W ''
 
 ssl/nss/server-cn-only.crt__server-cn-only.key.crldir.db: 
ssl/nss/server-cn-only.crt__server-cn-only.key.db
+   cp -R $< $@
for c in $(shell ls ssl/root+client-crldir) ; do \
echo $${c} ; \
openssl crl -in ssl/root+client-crldir/$${c} -outform der -out 
ssl/nss/$${c} ; \
diff --git a/src/test/ssl/t/001_ssltests.pl b/src/test/ssl/t/001_ssltests.pl
index 4105a67b94..aec99e7bf6 100644
--- a/src/test/ssl/t/001_ssltests.pl
+++ b/src/test/ssl/t/001_ssltests.pl
@@ -664,7 +664,7 @@ $node->connect_fails(
"$common_connstr user=ssltestuser sslcert=ssl/client-revoked.crt 
sslkey=ssl/client-revoked_tmp.key 
ssldatabase=ssl/nss/client-revoked.crt__client-revoked.key.db",
"certificate authorization fails with revoked client cert with 
server-side CRL directory",
expected_stderr =>
- qr/SSL error: sslv3 alert certificate revoked|SSL error: Encountered 
end of file/);
+ qr/SSL error: sslv3 alert certificate revoked|SSL peer rejected your 
certificate as revoked/);
 
 # clean up
 
-- 
2.25.1

Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic

2021-06-14 Thread Andres Freund

Hi,

On 2021-06-14 11:53:47 +0200, Matthias van de Meent wrote:
> On Thu, 10 Jun 2021 at 19:43, Peter Geoghegan  wrote:
> >
> > On Thu, Jun 10, 2021 at 10:29 AM Matthias van de Meent
> >  wrote:
> > > I see one exit for HEAPTUPLE_DEAD on a potentially recently committed
> > > xvac (?), and we might also check against recently committed
> > > transactions if xmin == xmax, although apparently that is not
> > > implemented right now.
> >
> > I don't follow. Perhaps you can produce a test case?
> 
> If you were to delete a tuple in the same transaction that you create
> it (without checkpoints / subtransactions), I would assume that this
> would allow us to vacuum the tuple, as the only snapshot that could
> see the tuple must commit or roll back.

Right now we do not do so, but I think we talked about adding such logic
a couple times.

I think a more robust assertion than aborted-ness could be to assert
that repeated retries are not allowed to have the same "oldest xid" than
a previous retry. With oldest xid be the older of xmin/xmax?

Greetings,

Andres Freund

Re: [Proposal] Add accumulated statistics for wait event

2021-06-14 Thread Andres Freund

Hi,

On 2021-06-14 23:20:47 +0200, Jehan-Guillaume de Rorthais wrote:
> > On 2021-06-14 16:10:32 +0200, Jehan-Guillaume de Rorthais wrote:
> > > In the patch in attachment, I tried to fix this by using kind of an 
> > > internal
> > > hook for pgstat_report_wait_start and pgstat_report_wait_end. This allows 
> > > to
> > > "instrument" wait events only when required, on the fly, dynamically.  
> > 
> > That's *far worse*. You're adding an indirect function call. Which requires
> > loading a global variable and then a far call to a different function. 
> > You're
> > changing a path that's ~2 instructions with minimal dependencies (and no
> > branches (i.e. fully out of order executable) to something on the order of 
> > ~15
> > instructions with plenty dependencies and at least two branches (call, ret).
> 
> Oh, I didn't realized it would affect all queries, even when 
> log_statement_stats
> was off. Thank you for your explanation.

Maybe I just am misunderstanding what you were doing? As far as I can
tell your patch changed pgstat_report_wait_start() to be an indirect
function call - right? Then yes, this adds overhead to everything.

You *could* add a pgstat_report_wait_(start|end)_with_time() or such and
only use that in places that won't have a high frequency. But I just
don't quite see the use-case for that.

Greetings,

Andres Freund

Re: Question about StartLogicalReplication() error path

2021-06-14 Thread Jeff Davis

On Mon, 2021-06-14 at 13:13 -0400, Robert Haas wrote:
> I'm happy to hear other opinions, but I think I would be inclined to
> vote in favor of #1 and/or #2 but against #3.

What about upgrading it to, say, LOG? It seems like it would happen
pretty infrequently, and in the event something strange happens, might
rule out some possibilities.

Regards,
Jeff Davis

Re: [Proposal] Add accumulated statistics for wait event

2021-06-14 Thread Jehan-Guillaume de Rorthais

Hi,

On Mon, 14 Jun 2021 11:27:21 -0700
Andres Freund  wrote:

> On 2021-06-14 16:10:32 +0200, Jehan-Guillaume de Rorthais wrote:
> > In the patch in attachment, I tried to fix this by using kind of an internal
> > hook for pgstat_report_wait_start and pgstat_report_wait_end. This allows to
> > "instrument" wait events only when required, on the fly, dynamically.  
> 
> That's *far worse*. You're adding an indirect function call. Which requires
> loading a global variable and then a far call to a different function. You're
> changing a path that's ~2 instructions with minimal dependencies (and no
> branches (i.e. fully out of order executable) to something on the order of ~15
> instructions with plenty dependencies and at least two branches (call, ret).

Oh, I didn't realized it would affect all queries, even when log_statement_stats
was off. Thank you for your explanation.

> I doubt there's a path towards this feature without adding the necessary
> infrastructure to hot-patch the code - which is obviously quite a
> substantial project.

Right. Sadly, this kind of project is far above what I can do. So I suppose
it's a dead end for me.

I'll study if/how the sampling approach can be done dynamically.

Thank you,

Re: a path towards replacing GEQO with something better

2021-06-14 Thread Bruce Momjian

On Mon, Jun 14, 2021 at 06:10:28PM +0200, Tomas Vondra wrote:
> On 6/14/21 1:16 PM, John Naylor wrote:
> > On Sun, Jun 13, 2021 at 9:50 AM Tomas Vondra
> > mailto:tomas.von...@enterprisedb.com>>
> > wrote:
> > 
> >> > 2) We can still keep GEQO around (with some huge limit by default) for a
> >> > few years as an escape hatch, while we refine the replacement. If there
> >> > is some bug that prevents finding a plan, we can emit a WARNING and fall
> >> > back to GEQO. Or if a user encounters a regression in a big query, they
> >> > can lower the limit to restore the plan they had in an earlier version.
> >> >
> >>
> >> Not sure. Keeping significant amounts of code may not be free - both for
> >> maintenance and new features. It'd be a bit sad if someone proposed
> >> improvements to join planning, but had to do 2x the work to support it
> >> in both the DP and GEQO branches, or risk incompatibility.
> > 
> > Looking back again at the commit history, we did modify geqo to support
> > partial paths and partition-wise join, so that's a fair concern.
> 
> Right. I think the question is how complex those changes were. If it was
> mostly mechanical, it's not a big deal and we can keep both, but if it
> requires deeper knowledge of the GEQO inner workings it may be an issue
> (planner changes are already challenging enough).

The random plan nature of GEQO, along with its "cliff", make it
something I would be glad to get rid of if we can get an improved
approach to large planning needs.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: Failure in subscription test 004_sync.pl

2021-06-14 Thread Alvaro Herrera

On 2021-Jun-14, Masahiko Sawada wrote:

> The condition should be the opposite; we should raise the error when
> 'nowait' is true. I think this is the cause of the test failure. Even
> if DROP SUBSCRIPTION tries to drop the slot with the WAIT option, we
> don't wait but raise the error.

Hi, thanks for diagnosing this and producing a patch!  I ended up
turning the condition around, so that all three stanzas still test
"!nowait"; which seems a bit easier to follow.

TBH I'm quite troubled by the fact that this test only failed once on
each animal; they all had a lot of successful runs after that.  I wonder
if this is because coverage is insufficient, or is it just bad luck.

I also wonder if this bug is what caused the random failures in the test
case I tried to add.  I should look at that some more now ...

-- 
Álvaro Herrera   Valdivia, Chile
Al principio era UNIX, y UNIX habló y dijo: "Hello world\n".
No dijo "Hello New Jersey\n", ni "Hello USA\n".

Re: PG 14 release notes, first draft

2021-06-14 Thread Bruce Momjian

On Mon, Jun 14, 2021 at 01:29:42PM -0400, John Naylor wrote:
> Hi Bruce,
> 
> For this item:
> 
> 
> 
>        
>         Allow BRIN indexes to use bloom filters
>         (Tomas Vondra)
>        
> 
>        
>         This allows bloom indexes to be used effectively with data that
>         is not physically localized in the heap.
>        
>       
> 
> The text implies that this affects contrib/bloom. I think it should be "This
> allows BRIN indexes...". 

Ah, I see your point.  Updated text is:

  


   
Allow BRIN indexes to use bloom filters
(Tomas Vondra)
   

   
This allows BRIN indexes to be used 
effectively
with data that is not physically localized in the heap.
   
  

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: [PATCH] expand the units that pg_size_pretty supports on output

2021-06-14 Thread David Christensen

> I don't see the need to extend the unit to YB.
> What use case do you have in mind?

Practical or no, I saw no reason not to support all defined units. I assume 
we’ll get to a need sooner or later. :)

David

Re: PG 14 release notes, first draft

2021-06-14 Thread Bruce Momjian

On Mon, Jun 14, 2021 at 06:57:41PM +0200, Tomas Vondra wrote:
> Sorry it took me a while to look at the release notes. I have one
> suggestion regarding this item:
> 
> 
>   Allow logical replication to stream long in-progress transactions to
>   subscribers (Tomas Vondra, Dilip Kumar, Amit Kapila, Ajin Cherian,
>   Nikhil Sontakke, Stas Kelvich)
> 
> AFAICS the authors are generally ordered by how much they contributed to
> the feature. In that case I'd move myself down the list - certainly
> after Dilip and Amit, perhaps after Ajin. While I posted the original
> patch, but most of the work after that to get it committed was done by
> those two/three people.

OK, I moved you after Ajin.   Sometimes it isn't clear how much of an
original patch was modified by later authors.

FYI, the most recent PG 14 relnote doc build is at:

https://momjian.us/pgsql_docs/release-14.html

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: PG 14 release notes, first draft

2021-06-14 Thread Bruce Momjian

On Mon, Jun 14, 2021 at 11:37:58AM -0500, Justin Pryzby wrote:
> Some more:
> 
> | VACUUM now has a PROCESS_TOAST which can be set to false to disable TOAST 
> processing, and vacuumdb has a --no-process-toast option. 
> has a process_toast *option

Agreed.

> | Previously, if the object already exists, EXPLAIN would fail. 
> already existed

Fixed.

> | Function pg_stat_reset_replication_slot() resets slot statistics. 
> *The function.  But maybe it should be omitted.

OK, I went with "The function".

> | New options are read-only, primary, standby, and prefer-standby. 
> *The new options

Agreed.

> | Allow reindexdb to change the tablespace of the new index (Michael Paquier)
> | This is done by specifying --tablespace.
> I think this should be merged with the corresponding server feature, like 
> this one:
> |Add ability to skip vacuuming of TOAST tables (Nathan Bossart)
> |VACUUM now has a PROCESS_TOAST which can be set to false to disable TOAST 
> processing, and vacuumdb has a --no-process-toast option. 
> 
> Or, the client-side option could be omitted.  This is distinguished from
> vacuumdb --no-index-cleanup and --no-truncate, for which the server support 
> was
> added in v12, and the client support was essentially an omision.

I am inclined to mention reindexdb because we mention the SQL command
option in the text.  Here is the updated text:

 


  
   Allow REINDEX
   to change the tablespace of the new index (Alexey Kondratov,
   Michael Paquier, Justin Pryzby)
  

  
   This is done by specifying a TABLESPACE 
clause.
   A --tablespace option was also added to reindexdb
   to control this.
 

> | Add documentation for the factorial() function (Peter Eisentraut)
> | With the removal of the ! operator in this release, factorial() is the only 
> built-in way to compute a factorial. 
> Could be ommited or collapsed into the other item.  I know Tom thinks that
> it's unnecesary to document changes to documentation.

Uh, I think we need both items.  We are removing a feature and asking
people to use an existing feature that was previously undocumented.  I
think having two items makes it clear that the feature existed in
previous releases.  I just tried merging them into one item and there
were just too many changes for it to be clear.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: Race condition in recovery?

2021-06-14 Thread Andrew Dunstan



On 6/14/21 3:32 PM, Andrew Dunstan wrote:
> On 6/14/21 1:50 PM, Andrew Dunstan wrote:
>> On 6/14/21 1:11 PM, Robert Haas wrote:
>>> On Mon, Jun 14, 2021 at 12:56 PM Andrew Dunstan  wrote:
 $^X is not at all broken.

 The explanation here is pretty simple - the argument to perl2host is
 meant to be a directory. If we're going to accomodate plain files then
 we have some more work to do in TestLib.
>>> This explanation seems to contradict the documentation in TestLib.pm,
>>> which makes no mention of any such restriction.
>> Heres a snippet:
>>
>>
>> sub perl2host
>> {
>> my ($subject) = @_;
>> ...
>> if (chdir $subject)
>>  
>>
>> Last time I looked you can't chdir to anything except a directory.
>
>
> Actually, I take it back, it does work for a file. I'll change it. I
> probably did this when something else wasn't working.




So, will you feel happier with this applied? I haven't tested it yet but
I'm confident it will work.


diff --git a/src/test/recovery/t/025_stuck_on_old_timeline.pl 
b/src/test/recovery/t/025_stuck_on_old_timeline.pl
index e4e58cb8ab..3e19bc4c50 100644
--- a/src/test/recovery/t/025_stuck_on_old_timeline.pl
+++ b/src/test/recovery/t/025_stuck_on_old_timeline.pl
@@ -24,11 +24,11 @@ my $node_primary = get_new_node('primary');
 # the timeline history file reaches the archive but before any of the WAL files
 # get there.
 $node_primary->init(allows_streaming => 1, has_archiving => 1);
-my $perlbin = $^X;
-if ($^O eq 'msys')
-{
-   $perlbin = TestLib::perl2host(dirname($^X)) . '\\' . basename($^X);
-}
+
+# Note: consistent use of forward slashes here avoids any escaping problems
+# that arise from use of backslashes. That means we need to double-quote all
+# the paths in the archive_command
+my $perlbin = TestLib::perl2host(^X);
 $perlbin =~ s!\\!/!g if $TestLib::windows_os;
 my $archivedir_primary = $node_primary->archive_dir;
 $archivedir_primary =~ s!\\!/!g if $TestLib::windows_os;
@@ -36,6 +36,8 @@ $node_primary->append_conf('postgresql.conf', qq(
 archive_command = '"$perlbin" "$FindBin::RealBin/cp_history_files" "%p" 
"$archivedir_primary/%f"'
 wal_keep_size=128MB
 ));
+# make sure that Msys perl doesn't complain about difficulty in setting locale
+# when called this way.
 local $ENV{PERL_BADLANG}=0;
 $node_primary->start;
 


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Race condition in recovery?

2021-06-14 Thread Andrew Dunstan



On 6/14/21 1:50 PM, Andrew Dunstan wrote:
> On 6/14/21 1:11 PM, Robert Haas wrote:
>> On Mon, Jun 14, 2021 at 12:56 PM Andrew Dunstan  wrote:
>>> $^X is not at all broken.
>>>
>>> The explanation here is pretty simple - the argument to perl2host is
>>> meant to be a directory. If we're going to accomodate plain files then
>>> we have some more work to do in TestLib.
>> This explanation seems to contradict the documentation in TestLib.pm,
>> which makes no mention of any such restriction.
>
> Heres a snippet:
>
>
> sub perl2host
> {
> my ($subject) = @_;
> ...
> if (chdir $subject)
>  
>
> Last time I looked you can't chdir to anything except a directory.



Actually, I take it back, it does work for a file. I'll change it. I
probably did this when something else wasn't working.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Race condition in recovery?

2021-06-14 Thread Robert Haas

On Mon, Jun 14, 2021 at 1:50 PM Andrew Dunstan  wrote:
> Heres a snippet:
>
> sub perl2host
> {
> my ($subject) = @_;
> ...
> if (chdir $subject)
>
> Last time I looked you can't chdir to anything except a directory.

OK, but like I said, you can't tell that from the documentation. The
documentation says: "Translate a virtual file name to a host file
name. Currently, this is a no-op except for the case of Perl=msys and
host=mingw32.  The subject need not exist, but its parent or
grandparent directory must exist unless cygpath is available." If you
look just at that, there's nothing that would lead you to believe that
it has to be a directory name.

> I was trying to get the buildfarm green again. There have been plenty of
> times when small patches directly for such fixes have been committed
> directly. And that's the only circumstance when I do.

I wasn't intending to criticize your work on this. I really appreciate
it, in fact, as I also said to you off-list. I do think that there
were some small things in those patches where a little bit of quick
discussion might have been useful: e.g. should the archive_command
change have gone in in the first place? Do we need any comments to
explain the fixes? But it's not like it's a big deal either. I'm
certainly not disagreeing with the goodness of making the buildfarm
green as expediently as possible.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Position of ClientAuthentication hook

2021-06-14 Thread Robert Haas

On Mon, Jun 14, 2021 at 8:51 AM Rafia Sabih  wrote:
> I have a doubt regarding the positioning of clientAuthentication hook
> in function ClientAuthentication. Particularly, in case of hba errors,
> i.e. cases uaReject or uaImplicitReject it errors out, leading to no
> calls to any functions attached to the authentication hook. Can't we
> process the hook function first and then error out...?

Maybe. One potential problem is that if the hook errors out, the
original error would be lost and only the error thrown by the hook
would be logged or visible to the client. Whether or not that's a
problem depends, I suppose, on what you're trying to do with the hook.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: MultiXact\SLRU buffers configuration

2021-06-14 Thread Васильев Дмитрий

пн, 14 июн. 2021 г. в 15:07, Andrey Borodin :

> PFA patch implementing this idea.
>

I'm benchmarked v17 patches.
Testing was done on a 96-core machine, with PGDATA completely placed in
tmpfs.
PostgreSQL was built with CFLAGS -O2.

for-update PgBench script:
\set aid random_zipfian(1, 100, 2)
begin;
select :aid from pgbench_accounts where aid = :aid for update;
update pgbench_accounts set abalance = abalance + 1 where aid = :aid;
update pgbench_accounts set abalance = abalance * 2 where aid = :aid;
update pgbench_accounts set abalance = abalance - 2 where aid = :aid;
end;

Before each test sample data was filled with "pgbench -i -s 100", testing
was performed 3 times for 1 hour each test.
The benchmark results are presented with changing
multi_xact_members_buffers and multicast_offsets_buffers (1:2 respectively):
settings  tps
multixact_members_buffers_64Kb   693.2
multixact_members_buffers_128Kb  691.4
multixact_members_buffers_192Kb  696.3
multixact_members_buffers_256Kb  694.4
multixact_members_buffers_320Kb  692.3
multixact_members_buffers_448Kb  693.7
multixact_members_buffers_512Kb  693.3
vanilla  676.1

Best regards, Dmitry Vasiliev.

Re: PG 14 release notes, first draft

2021-06-14 Thread Bruce Momjian

On Fri, Jun 11, 2021 at 10:45:51PM -0500, Justin Pryzby wrote:
> | Add Set Server Name Indication (SNI) for SSL connection packets (Peter 
> Eisentraut) 
> Remove "Set"

Fixed.

> | Reduce the default value of vacuum_cost_page_miss from 10 milliseconds to 2 
> (Peter Geoghegan) 
> Peter mentioned that this should not say "milliseconds" (but maybe the page 
> I'm
> looking at is old).

It is old.  It is now:

Reduce the default value of  to better reflects current
hardware capabilities (Peter Geoghegan)

> | Cause vacuum operations to be aggressive if the table is near xid or 
> multixact wraparound (Masahiko Sawada, Peter Geoghegan) 
> Say "become aggressive" ?

Updated text:

Cause vacuum operations to be more aggressive if the table is near
xid or multixact wraparound (Masahiko Sawada, Peter Geoghegan)

> |  Allow the arbitrary collations of partition boundary values (Tom Lane) 
> Remove "the"

Agreed, removed.

> | Generate WAL invalidations message during command completion when using 
> logical replication (Dilip Kumar, Tomas Vondra, Amit Kapila) 
> invalidation messages

Fixed.

> |  Add support for infinity and -infinity values to the numeric data type 
> (Tom Lane) 
> "-infinity" has markup but not "infinity" ?

Fixed.
> 
> | Allow vacuum to deallocate space reserved by trailing unused heap line 
> pointers (Matthias van de Meent, Peter Geoghegan) 
> say "reclaim space" ?

OK, new text is:

Allow vacuum to reclaim space used by unused trailing heap
line pointers (Matthias van de Meent, Peter Geoghegan)

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: [Proposal] Add accumulated statistics for wait event

2021-06-14 Thread Andres Freund

Hi,

On 2021-06-14 11:27:21 -0700, Andres Freund wrote:
> On 2021-06-14 16:10:32 +0200, Jehan-Guillaume de Rorthais wrote:
> > In the patch in attachment, I tried to fix this by using kind of an internal
> > hook for pgstat_report_wait_start and pgstat_report_wait_end. This allows to
> > "instrument" wait events only when required, on the fly, dynamically.
> 
> That's *far worse*. You're adding an indirect function call. Which requires
> loading a global variable and then a far call to a different function. You're
> changing a path that's ~2 instructions with minimal dependencies (and no
> branches (i.e. fully out of order executable) to something on the order of ~15
> instructions with plenty dependencies and at least two branches (call, ret).

In the case at hand it might even be worse, because the external function call
will require registers to be spilled for the function call. Right now wait
events "use" two register (one for the wait event, one for my_wait_event_info),
but otherwise don't cause additional spilling. With your change you'd see
register spill/reload around both wait start and end.

Greetings,

Andres Freund

Re: [Proposal] Add accumulated statistics for wait event

2021-06-14 Thread Andres Freund

Hi,

On 2021-06-14 16:10:32 +0200, Jehan-Guillaume de Rorthais wrote:
> In the patch in attachment, I tried to fix this by using kind of an internal
> hook for pgstat_report_wait_start and pgstat_report_wait_end. This allows to
> "instrument" wait events only when required, on the fly, dynamically.

That's *far worse*. You're adding an indirect function call. Which requires
loading a global variable and then a far call to a different function. You're
changing a path that's ~2 instructions with minimal dependencies (and no
branches (i.e. fully out of order executable) to something on the order of ~15
instructions with plenty dependencies and at least two branches (call, ret).

I doubt there's a path towards this feature without adding the necessary
infrastructure to hot-patch the code - which is obviously quite a
substantial project.

Greetings,

Andres Freund

Re: Race condition in recovery?

2021-06-14 Thread Andrew Dunstan



On 6/14/21 1:11 PM, Robert Haas wrote:
> On Mon, Jun 14, 2021 at 12:56 PM Andrew Dunstan  wrote:
>> $^X is not at all broken.
>>
>> The explanation here is pretty simple - the argument to perl2host is
>> meant to be a directory. If we're going to accomodate plain files then
>> we have some more work to do in TestLib.
> This explanation seems to contradict the documentation in TestLib.pm,
> which makes no mention of any such restriction.


Heres a snippet:


sub perl2host
{
my ($subject) = @_;
...
if (chdir $subject)
 

Last time I looked you can't chdir to anything except a directory.


>
>>> +local $ENV{PERL_BADLANG}=0;
>>>
>>> Similarly here. There's not a single other reference to PERL_BADLANG
>>> in the repository, so if we need this one here, there should be a
>>> comment explaining why this is different from all the places where we
>>> don't need it.
>> Here's why this is different: this is the only place that we invoke the
>> msys perl in this way (i.e. from a non-msys aware environment - the
>> binaries we build are not msys-aware). We need to do that if for no
>> other reason than that it might well be the only perl available. Doing
>> so makes it complain loudly about missing locale info. Setting this
>> variable makes it shut up. I can add a comment on that if you like.
> Yes, please, but perhaps you'd like to post patches for discussion
> first instead of committing directly.


I was trying to get the buildfarm green again. There have been plenty of
times when small patches directly for such fixes have been committed
directly. And that's the only circumstance when I do.



cheers


andrew

Re: using extended statistics to improve join estimates

2021-06-14 Thread Tomas Vondra


Hi,

Here's a slightly improved / cleaned up version of the PoC patch, 
removing a bunch of XXX and FIXMEs, adding comments, etc.


The approach is sound in principle, I think, although there's still a 
bunch of things to address:


1) statext_compare_mcvs only really deals with equijoins / inner joins 
at the moment, as it's based on eqjoinsel_inner. It's probably desirable 
to add support for additional join types (inequality and outer joins).


2) Some of the steps are performed multiple times - e.g. matching base 
restrictions to statistics, etc. Those probably can be cached somehow, 
to reduce the overhead.


3) The logic of picking the statistics to apply is somewhat simplistic, 
and maybe could be improved in some way. OTOH the number of candidate 
statistics is likely low, so this is not a big issue.


4) statext_compare_mcvs is based on eqjoinsel_inner and makes a bunch of 
assumptions similar to the original, but some of those assumptions may 
be wrong in multi-column case, particularly when working with a subset 
of columns. For example (ndistinct - size(MCV)) may not be the number of 
distinct combinations outside the MCV, when ignoring some columns. Same 
for nullfract, and so on. I'm not sure we can do much more than pick 
some reasonable approximation.


5) There are no regression tests at the moment. Clearly a gap.


regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index d263ecf082..dca1e7d34e 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -157,6 +157,23 @@ clauselist_selectivity_ext(PlannerInfo *root,
 			, false);
 	}
 
+	/*
+	 * Try applying extended statistics to joins. There's not much we can
+	 * do to detect when this makes sense, but we can check that there are
+	 * join clauses, and that at least some of the rels have stats.
+	 *
+	 * XXX Isn't this mutualy exclusive with the preceding block which
+	 * calculates estimates for a single relation?
+	 */
+	if (use_extended_stats &&
+		statext_try_join_estimates(root, clauses, varRelid, jointype, sjinfo,
+		 estimatedclauses))
+	{
+		s1 *= statext_clauselist_join_selectivity(root, clauses, varRelid,
+  jointype, sjinfo,
+  );
+	}
+
 	/*
 	 * Apply normal selectivity estimates for remaining clauses. We'll be
 	 * careful to skip any clauses which were already estimated above.
diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index b05e818ba9..d4cbbee785 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -30,6 +30,7 @@
 #include "nodes/nodeFuncs.h"
 #include "optimizer/clauses.h"
 #include "optimizer/optimizer.h"
+#include "optimizer/pathnode.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
 #include "statistics/extended_stats_internal.h"
@@ -101,6 +102,16 @@ static StatsBuildData *make_build_data(Relation onerel, StatExtEntry *stat,
 	   int numrows, HeapTuple *rows,
 	   VacAttrStats **stats, int stattarget);
 
+static bool stat_covers_expressions(StatisticExtInfo *stat, List *exprs,
+	Bitmapset **expr_idxs);
+
+static List *statext_mcv_get_conditions(PlannerInfo *root,
+		RelOptInfo *rel,
+		StatisticExtInfo *info);
+
+static bool *statext_mcv_eval_conditions(PlannerInfo *root, RelOptInfo *rel,
+		 StatisticExtInfo *stat, MCVList *mcv,
+		 Selectivity *sel);
 
 /*
  * Compute requested extended stats, using the rows sampled for the plain
@@ -1154,6 +1165,89 @@ has_stats_of_kind(List *stats, char requiredkind)
 	return false;
 }
 
+/*
+ * find_matching_mcv
+ *		Search for a MCV covering all the attributes.
+ *
+ * XXX Should consider both attnums and expressions. Also should consider
+ * additional restrictinfos as conditions (but only as optional).
+ *
+ * XXX The requirement that all the attributes need to be covered might be
+ * too strong. Maybe we could relax it a bit, and search for MCVs (on both
+ * sides of the join) with the largest overlap. But we don't really expect
+ * many candidate MCVs, so this simple approach seems sufficient.
+ */
+StatisticExtInfo *
+find_matching_mcv(PlannerInfo *root, RelOptInfo *rel, Bitmapset *attnums, List *exprs)
+{
+	ListCell   *l;
+	StatisticExtInfo *mcv = NULL;
+	List *stats = rel->statlist;
+
+	foreach(l, stats)
+	{
+		StatisticExtInfo *stat = (StatisticExtInfo *) lfirst(l);
+		List *conditions1 = NIL,
+			 *conditions2 = NIL;
+
+		/* We only care about MCV statistics here. */
+		if (stat->kind != STATS_EXT_MCV)
+			continue;
+
+		/*
+		 * Ignore MCVs not covering all the attributes/expressions.
+		 *
+		 * XXX Maybe we shouldn't be so strict and consider only partial
+		 * matches for join clauses too?
+		 */
+		if (!bms_is_subset(attnums, stat->keys) ||
+			!stat_covers_expressions(stat, exprs,

Re: PG 14 release notes, first draft

2021-06-14 Thread John Naylor

Hi Bruce,

For this item:



   
Allow BRIN indexes to use bloom filters
(Tomas Vondra)
   

   
This allows bloom indexes to be used effectively with data that
is not physically localized in the heap.
   
  

The text implies that this affects contrib/bloom. I think it should be
"This allows BRIN indexes...".

-- 
John Naylor
EDB: http://www.enterprisedb.com

Re: recent failures on lorikeet

2021-06-14 Thread Tom Lane

I wrote:
> What that'd have to imply is that get_ps_display() messed up,
> returning a bad pointer or a bad length.
> A platform-specific problem in get_ps_display() seems plausible
> enough.  The apparent connection to a concurrent VACUUM FULL seems
> pretty hard to explain that way ... but maybe that's a mirage.

If I understand correctly that you're only seeing this in v13 and
HEAD, then it seems like bf68b79e5 (Refactor ps_status.c API)
deserves a hard look.

regards, tom lane

Re: recent failures on lorikeet

2021-06-14 Thread Tom Lane

Andrew Dunstan  writes:
> The line in lmgr.c is where the process title gets changed to "waiting".
> I recently stopped setting process title on this animal on REL_13_STABLE
> and its similar errors have largely gone away.

Oooh, that certainly seems like a smoking gun.

> I can do the same on
> HEAD. But it does make me wonder what the heck has changed to make this
> code fragile.

So what we've got there is

old_status = get_ps_display();
new_status = (char *) palloc(len + 8 + 1);
memcpy(new_status, old_status, len);
strcpy(new_status + len, " waiting");
set_ps_display(new_status);
new_status[len] = '\0'; /* truncate off " waiting" */

Line 1831 is the strcpy, but it seems entirely impossible that that
could fail, unless palloc has shirked its job.  I'm thinking that
the crash is really in the memcpy --- looking at the other lines
in your trace, fingering the line after the call seems common.

What that'd have to imply is that get_ps_display() messed up,
returning a bad pointer or a bad length.

A platform-specific problem in get_ps_display() seems plausible
enough.  The apparent connection to a concurrent VACUUM FULL seems
pretty hard to explain that way ... but maybe that's a mirage.

regards, tom lane

Re: Question about StartLogicalReplication() error path

2021-06-14 Thread Robert Haas

On Mon, Jun 14, 2021 at 12:50 PM Jeff Davis  wrote:
> It seems that there's not much agreement in a behavior change here. I
> suggest one or more of the following:
>
>  1. change the logical rep protocol docs to match the current behavior
> a. also briefly explain in the docs why it's different from
> physical replication (which does always start at the provided LSN as
> far as I can tell)
>
>  2. Change the comment to add something like "Starting at a different
> LSN than requested might not catch certain kinds of client errors.
> Clients should be careful to check confirmed_flush_lsn if starting at
> the requested LSN is required."
>
>  3. upgrade DEBUG1 message to a WARNING
>
> Can I get agreement on any of the above suggestions?

I'm happy to hear other opinions, but I think I would be inclined to
vote in favor of #1 and/or #2 but against #3.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Race condition in recovery?

2021-06-14 Thread Robert Haas

On Mon, Jun 14, 2021 at 12:56 PM Andrew Dunstan  wrote:
> $^X is not at all broken.
>
> The explanation here is pretty simple - the argument to perl2host is
> meant to be a directory. If we're going to accomodate plain files then
> we have some more work to do in TestLib.

This explanation seems to contradict the documentation in TestLib.pm,
which makes no mention of any such restriction.

> > +local $ENV{PERL_BADLANG}=0;
> >
> > Similarly here. There's not a single other reference to PERL_BADLANG
> > in the repository, so if we need this one here, there should be a
> > comment explaining why this is different from all the places where we
> > don't need it.
>
> Here's why this is different: this is the only place that we invoke the
> msys perl in this way (i.e. from a non-msys aware environment - the
> binaries we build are not msys-aware). We need to do that if for no
> other reason than that it might well be the only perl available. Doing
> so makes it complain loudly about missing locale info. Setting this
> variable makes it shut up. I can add a comment on that if you like.

Yes, please, but perhaps you'd like to post patches for discussion
first instead of committing directly.

> Part of the trouble is that I've been living and breathing some of these
> issues so much recently that I forget that what might be fairly obvious
> to me isn't to others. I assure you there is not the faintest touch of
> pixy dust involved.

Every pixie with whom I've spoken today says otherwise!

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: PG 14 release notes, first draft

2021-06-14 Thread Tomas Vondra

On 5/10/21 8:03 AM, Bruce Momjian wrote:
> I have committed the first draft of the PG 14 release notes.  You can
> see the most current  build of them here:
> 
>   https://momjian.us/pgsql_docs/release-14.html
> 
> I need clarification on many items, and the document still needs its
> items properly ordered, and markup added.  I also expect a lot of
> feedback.
> 
> I plan to work on completing this document this coming week in
> preparation for beta next week.
> 

Sorry it took me a while to look at the release notes. I have one
suggestion regarding this item:

  Allow logical replication to stream long in-progress transactions to
  subscribers (Tomas Vondra, Dilip Kumar, Amit Kapila, Ajin Cherian,
  Nikhil Sontakke, Stas Kelvich)

AFAICS the authors are generally ordered by how much they contributed to
the feature. In that case I'd move myself down the list - certainly
after Dilip and Amit, perhaps after Ajin. While I posted the original
patch, but most of the work after that to get it committed was done by
those two/three people.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Race condition in recovery?

2021-06-14 Thread Andrew Dunstan

On 6/14/21 11:52 AM, Robert Haas wrote:
> On Sat, Jun 12, 2021 at 10:20 AM Tom Lane  wrote:
>> Andrew Dunstan  writes:
>>> I have pushed a fix, tested on a replica of fairywren/drongo,
>> This bit seems a bit random:
>>
>>  # WAL segment, this is enough to guarantee that the history file was
>>  # archived.
>>  my $archive_wait_query =
>> -  "SELECT '$walfile_to_be_archived' <= last_archived_wal FROM 
>> pg_stat_archiver;";
>> +  "SELECT coalesce('$walfile_to_be_archived' <= last_archived_wal, false) " 
>> .
>> +  "FROM pg_stat_archiver";
>>  $node_standby->poll_query_until('postgres', $archive_wait_query)
>>or die "Timed out while waiting for WAL segment to be archived";
>>  my $last_archived_wal_file = $walfile_to_be_archived;
>>
>> I wonder whether that is a workaround for the poll_query_until bug
>> I proposed to fix at [1].

This has been reverted.

> I found that a bit random too, but it wasn't the only part of the
> patch I found a bit random. Like, what can this possibly be doing?
>
> +if ($^O eq 'msys')
> +{
> +$perlbin = TestLib::perl2host(dirname($^X)) . '\\' . basename($^X);
> +}
>
> The idea here is apparently that on msys, the directory name that is
> part of $^X needs to be passed through perl2host, but the file name
> that is part of the same $^X needs to NOT be passed through perl2host.
> Is $^X really that broken? If so, I think some comments are in order.

$^X is not at all broken.

The explanation here is pretty simple - the argument to perl2host is
meant to be a directory. If we're going to accomodate plain files then
we have some more work to do in TestLib.

> +local $ENV{PERL_BADLANG}=0;
>
> Similarly here. There's not a single other reference to PERL_BADLANG
> in the repository, so if we need this one here, there should be a
> comment explaining why this is different from all the places where we
> don't need it.

Here's why this is different: this is the only place that we invoke the
msys perl in this way (i.e. from a non-msys aware environment - the
binaries we build are not msys-aware). We need to do that if for no
other reason than that it might well be the only perl available. Doing
so makes it complain loudly about missing locale info. Setting this
variable makes it shut up. I can add a comment on that if you like.

> On those occasions when I commit TAP test cases, I do try to think
> about whether they are going to be portable, but I find these kinds of
> changes indistinguishable from magic.

Part of the trouble is that I've been living and breathing some of these
issues so much recently that I forget that what might be fairly obvious
to me isn't to others. I assure you there is not the faintest touch of
pixy dust involved.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Question about StartLogicalReplication() error path

2021-06-14 Thread Jeff Davis

On Sat, 2021-06-12 at 16:17 +0530, Amit Kapila wrote:
> AFAIU, currently, in such a case, the subscriber (client) won't
> advance the flush location (confirmed_flush_lsn). So, we won't lose
> any data.

I think you are talking about the official Logical Replication
specifically, rather than an arbitrary client that's using the logical
replication protocol based on the protocol docs.

It seems that there's not much agreement in a behavior change here. I
suggest one or more of the following:

 1. change the logical rep protocol docs to match the current behavior
a. also briefly explain in the docs why it's different from
physical replication (which does always start at the provided LSN as
far as I can tell)

 2. Change the comment to add something like "Starting at a different
LSN than requested might not catch certain kinds of client errors.
Clients should be careful to check confirmed_flush_lsn if starting at
the requested LSN is required."

 3. upgrade DEBUG1 message to a WARNING

Can I get agreement on any of the above suggestions?

Regards,
Jeff Davis

Re: array_cat anycompatible change is breaking xversion upgrade tests (v14 release notes)

2021-06-14 Thread Bruce Momjian

On Fri, Jun 11, 2021 at 09:40:07PM -0400, Bruce Momjian wrote:
> On Fri, Jun 11, 2021 at 09:17:46PM -0400, Tom Lane wrote:
> > Bruce Momjian  writes:
> > > OK, I used some of your ideas and tried for something more general; 
> > > patch attached.
> > 
> > I think it's a good idea to mention custom aggregates and operators
> > specifically, as otherwise people will look at this and have little
> > idea what you're on about.  I just want wording like "such as custom
> > aggregates and operators", in case somebody has done some other creative
> > thing that breaks.
> 
> Agreed, updated patch attached.

Patch applied.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: A new function to wait for the backend exit after termination

2021-06-14 Thread Justin Pryzby

On Mon, Jun 14, 2021 at 09:40:27AM -0700, Noah Misch wrote:
> > > Agreed, these strings generally give less detail.  I can revert that to 
> > > the
> > > v13 wording, 'terminate a server process'.

...

> > Maybe you'd also update the release notes.
> 
> What part of the notes did you expect to change that the patch did not change?

Sorry, I didn't notice that your patch already adjusted the v14 notes.

Note that Bharath also corrected your commit message to say "unable *to*", and
revert the verbose pg_proc.dat descr change.

Thanks,
-- 
Justin

Re: recent failures on lorikeet

2021-06-14 Thread Andrew Dunstan



On 6/14/21 12:33 PM, Andrew Dunstan wrote:
>
> The line in lmgr.c is where the process title gets changed to "waiting".
> I recently stopped setting process title on this animal on REL_13_STABLE
> and its similar errors have largely gone away. I can do the same on
> HEAD. But it does make me wonder what the heck has changed to make this
> code fragile.


Of course I meant the line (1831) in lock.c.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: A new function to wait for the backend exit after termination

2021-06-14 Thread Noah Misch

On Sat, Jun 12, 2021 at 01:27:43PM -0500, Justin Pryzby wrote:
> On Sat, Jun 12, 2021 at 08:21:39AM -0700, Noah Misch wrote:
> > On Sat, Jun 12, 2021 at 12:12:12AM -0500, Justin Pryzby wrote:
> > > Even if it's not removed, the descriptions should be cleaned up.
> > > 
> > > | src/include/catalog/pg_proc.dat-  descr => 'terminate a backend process 
> > > and if timeout is specified, wait for its exit or until timeout occurs',
> > > => I think doesn't need to change or mention the optional timeout at all
> > 
> > Agreed, these strings generally give less detail.  I can revert that to the
> > v13 wording, 'terminate a server process'.
> 
> Maybe you'd also update the release notes.

What part of the notes did you expect to change that the patch did not change?

> I suggest some edits from the remaining parts of the original patch.

These look good.  I will push them when I push the other part.

Re: PG 14 release notes, first draft

2021-06-14 Thread Justin Pryzby

On Fri, Jun 11, 2021 at 10:45:51PM -0500, Justin Pryzby wrote:
> | Add Set Server Name Indication (SNI) for SSL connection packets (Peter 
> Eisentraut) 
> Remove "Set"
> 
> | Reduce the default value of vacuum_cost_page_miss from 10 milliseconds to 2 
> (Peter Geoghegan) 
> Peter mentioned that this should not say "milliseconds" (but maybe the page 
> I'm
> looking at is old).
> 
> | Cause vacuum operations to be aggressive if the table is near xid or 
> multixact wraparound (Masahiko Sawada, Peter Geoghegan) 
> Say "become aggressive" ?
> 
> |  Allow the arbitrary collations of partition boundary values (Tom Lane) 
> Remove "the"
> 
> | Generate WAL invalidations message during command completion when using 
> logical replication (Dilip Kumar, Tomas Vondra, Amit Kapila) 
> invalidation messages
> 
> |  Add support for infinity and -infinity values to the numeric data type 
> (Tom Lane) 
> "-infinity" has markup but not "infinity" ?
> 
> | Allow vacuum to deallocate space reserved by trailing unused heap line 
> pointers (Matthias van de Meent, Peter Geoghegan) 
> say "reclaim space" ?

Some more:

| VACUUM now has a PROCESS_TOAST which can be set to false to disable TOAST 
processing, and vacuumdb has a --no-process-toast option. 
has a process_toast *option

| Previously, if the object already exists, EXPLAIN would fail. 
already existed

| Function pg_stat_reset_replication_slot() resets slot statistics. 
*The function.  But maybe it should be omitted.

| New options are read-only, primary, standby, and prefer-standby. 
*The new options

| Allow reindexdb to change the tablespace of the new index (Michael Paquier)
| This is done by specifying --tablespace.
I think this should be merged with the corresponding server feature, like this 
one:
|Add ability to skip vacuuming of TOAST tables (Nathan Bossart)
|VACUUM now has a PROCESS_TOAST which can be set to false to disable TOAST 
processing, and vacuumdb has a --no-process-toast option. 

Or, the client-side option could be omitted.  This is distinguished from
vacuumdb --no-index-cleanup and --no-truncate, for which the server support was
added in v12, and the client support was essentially an omision.

| Add documentation for the factorial() function (Peter Eisentraut)
| With the removal of the ! operator in this release, factorial() is the only 
built-in way to compute a factorial. 
Could be ommited or collapsed into the other item.  I know Tom thinks that
it's unnecesary to document changes to documentation.

-- 
Justin

Re: recent failures on lorikeet

2021-06-14 Thread Andrew Dunstan

On 6/14/21 9:39 AM, Tom Lane wrote:
> Andrew Dunstan  writes:
>> I've been looking at the recent spate of intermittent failures on my
>> Cygwin animal lorikeet. Most of them look something like this, where
>> there's 'VACUUM FULL pg_class' and an almost simultaneous "CREATE TABLE'
>> which fails.
> Do you have any idea what "exit code 127" signifies on that platform?
> (BTW, not all of them look like that; many are reported as plain
> segfaults.)  I hadn't spotted the association with a concurrent "VACUUM
> FULL pg_class" before, that does seem interesting.
>
>> Getting stack traces in this platform can be very difficult. I'm going
>> to try forcing complete serialization of the regression tests
>> (MAX_CONNECTIONS=1) to see if the problem goes away. Any other
>> suggestions might be useful. Note that we're not getting the same issue
>> on REL_13_STABLE, where the same group pf tests run together (inherit
>> typed_table, vacuum)
> If it does go away, that'd be interesting, but I don't see how it gets
> us any closer to a fix.  Seems like a stack trace is a necessity to
> narrow it down.
>
>   

Some have given stack traces and some not, not sure why. The one from
June 13 has this:

 backtrace 
??
??:0
WaitOnLock
src/backend/storage/lmgr/lock.c:1831
LockAcquireExtended
src/backend/storage/lmgr/lock.c:1119
LockRelationOid
src/backend/storage/lmgr/lmgr.c:135
relation_open
src/backend/access/common/relation.c:59
table_open
src/backend/access/table/table.c:43
ScanPgRelation
src/backend/utils/cache/relcache.c:322
RelationBuildDesc
src/backend/utils/cache/relcache.c:1039
RelationIdGetRelation
src/backend/utils/cache/relcache.c:2045
relation_open
src/backend/access/common/relation.c:59
table_open
src/backend/access/table/table.c:43
ExecInitPartitionInfo
src/backend/executor/execPartition.c:510
ExecPrepareTupleRouting
src/backend/executor/nodeModifyTable.c:2311
ExecModifyTable
src/backend/executor/nodeModifyTable.c:2559
ExecutePlan
src/backend/executor/execMain.c:1557

The line in lmgr.c is where the process title gets changed to "waiting".
I recently stopped setting process title on this animal on REL_13_STABLE
and its similar errors have largely gone away. I can do the same on
HEAD. But it does make me wonder what the heck has changed to make this
code fragile.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Use extended statistics to estimate (Var op Var) clauses

2021-06-14 Thread Tomas Vondra

On 6/14/21 5:36 PM, Mark Dilger wrote:
> 
> 
>> On Jun 13, 2021, at 1:28 PM, Tomas Vondra
>>  wrote:
>> 
>> Here is a slightly updated version of the patch
> 
> Thanks for taking this up again!
> 
> Applying the new test cases from your patch, multiple estimates have
> gotten better.  That looks good.  I wrote a few extra test cases and
> saw no change, which is fine.  I was looking for regressions where
> the estimates are now worse than before.  Do you expect there to be
> any such cases?
> 

Not really. These clauses could not be estimated before, we generally
used default estimates for them. So

  WHERE a = b

would use 0.5%, while

  WHERE a < b

would use 33%, and so on. OTOH it depends on the accuracy of the
extended statistics - particularly the MCV list (what fraction of the
data it covers, etc.).

So it's possible the default estimate is very accurate by chance, and
MCV list represents only a tiny fraction of the data. Then the new
estimate could we worse. Consider for example this:

create table t (a int, b int);
insert into t select 100, 100 from generate_series(1,5000) s(i);
insert into t select i, i+1 from generate_series(1,995000) s(i);

This has exactly 0.5% of rows with (a=b). Without extended stats it's
perfect:

explain analyze select * from t where a = b;

Seq Scan on t  (cost=0.00..16925.00 rows=5000 width=8)
   (actual time=0.064..159.928 rows=5000 loops=1)

while with statistics it gets worse:

create statistics s (mcv) on a, b from t;
analyze t;

Seq Scan on t  (cost=0.00..16925.00 rows=9810 width=8)
   (actual time=0.059..160.467 rows=5000 loops=1)

It's not terrible, although we could construct worse examples. But the
same issue applies to other clauses, not just to these new ones. And it
relies on the regular estimation producing better estimate by chance.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: a path towards replacing GEQO with something better

2021-06-14 Thread Tomas Vondra

On 6/14/21 1:16 PM, John Naylor wrote:
> On Sun, Jun 13, 2021 at 9:50 AM Tomas Vondra
> mailto:tomas.von...@enterprisedb.com>>
> wrote:
> 
>> > 2) We can still keep GEQO around (with some huge limit by default) for a
>> > few years as an escape hatch, while we refine the replacement. If there
>> > is some bug that prevents finding a plan, we can emit a WARNING and fall
>> > back to GEQO. Or if a user encounters a regression in a big query, they
>> > can lower the limit to restore the plan they had in an earlier version.
>> >
>>
>> Not sure. Keeping significant amounts of code may not be free - both for
>> maintenance and new features. It'd be a bit sad if someone proposed
>> improvements to join planning, but had to do 2x the work to support it
>> in both the DP and GEQO branches, or risk incompatibility.
> 
> Looking back again at the commit history, we did modify geqo to support
> partial paths and partition-wise join, so that's a fair concern.

Right. I think the question is how complex those changes were. If it was
mostly mechanical, it's not a big deal and we can keep both, but if it
requires deeper knowledge of the GEQO inner workings it may be an issue
(planner changes are already challenging enough).

> My concern is the risk of plan regressions after an upgrade, even if
> for a small number of cases.
> 

I don't know. My impression/experience with GEQO is that getting a good
join order for queries with many joins is often a matter of luck, and
the risk of getting poor plan just forces me to increase geqo_threshold
or disable it altogether. Or just rephrase the the query to nudge the
planner to use a good join order (those large queries are often star or
chain shaped, so it's not that difficult).

So I'm not sure "GEQO accidentally produces a better plan for this one
query" is a good argument to keep it around. We should probably evaluate
the overall behavior, and then make a decision.

FWIW I think we're facing this very dilemma for every optimizer change,
to some extent. Every time we make the planner smarter by adding a new
plan variant or heuristics, we're also increasing the opportunity for
errors. And every time we look (or should look) at average behavior and
worst case behavior ...

>> OTOH maybe this concern is unfounded in practice - I don't think we've
>> done very many big changes to geqo in the last few years.
> 
> Yeah, I get the feeling that it's already de facto obsolete, and we
> could make it a policy not to consider improvements aside from bug fixes
> where it can't find a valid plan, or forced API changes. Which I guess
> is another way of saying "deprecated".
> 
> (I briefly considered turning it into a contrib module, but that seems
> like the worst of both worlds.)
> 

True. I'm fine with deprecating / not improving geqo. What would worry
me is incompatibility, i.e. if geqo could not support some features. I'm
thinking of storage engines in MySQL not supporting some features,
leading to a mine field for users. Producing poor plans is fine, IMO.

>> This reminds me the proposals to have a GUC that'd determine how much
>> effort should the planner invest into various optimizations. For OLTP it
>> might be quite low, for large OLAP queries it'd be economical to spend
>> more time trying some more expensive optimizations.
>>
>> The challenge of course is how / in what units / to define the budget,
>> so that it's meaningful and understandable for users. Not sure if
>> "number of join rels generated" will be clear enough for users. But it
>> seems good enough for PoC / development, and hopefully people won't have
>> to tweak it very often.
> 
> I'm also in favor of having some type of "planner effort" or "OLTP to
> OLAP spectrum" guc, but I'm not yet sure whether it'd be better to have
> it separate or to couple the joinrel budget to it. If we go that route,
> I imagine there'll be many things that planner_effort changes that we
> don't want to give a separate knob for. And, I hope with graceful
> degradation and a good enough heuristic search, it won't be necessary to
> change in most cases.
> 

Yeah. I'm really worried about having a zillion separate GUC knobs for
tiny parts of the code. That's impossible to tune, and it also exposes
details about the implementation. And some people just can't resist
touching all the available options ;-)

The thing I like about JIT tunables is that it's specified in "cost"
which the users are fairly familiar with. Having another GUC with
entirely different unit is not great.

But as I said - it seems perfectly fine for PoC / development, and we
can revisit that later. Adding some sort of "planner effort" or multiple
optimization passes is a huge project on it's own.

>> I haven't read the [Kossmann & Stocker, 2000] paper yet, but the
>> [Neumann, 2018] paper seems to build on it, and it seems to work with
>> much larger subtrees of the join tree than k=5.
> 
> Right, in particular it builds on "IDP-2" from Kossmann & Stocker. Okay,
> so

Re: Transactions involving multiple postgres foreign servers, take 2

2021-06-14 Thread Robert Haas

On Sun, Jun 13, 2021 at 10:04 PM tsunakawa.ta...@fujitsu.com
 wrote:
> I know sending a commit request may get an error from various underlying 
> functions, but we're talking about the client side, not the Postgres's server 
> side that could unexpectedly ereport(ERROR) somewhere.  So, the new FDW 
> commit routine won't lose control and can return an error code as its return 
> value.  For instance, the FDW commit routine for DBMS-X would typically be:
>
> int
> DBMSXCommit(...)
> {
> int ret;
>
> /* extract info from the argument to pass to xa_commit() */
>
> ret = DBMSX_xa_commit(...);
> /* This is the actual commit function which is exposed to the app 
> server (e.g. Tuxedo) through the xa_commit() interface */
>
> /* map xa_commit() return values to the corresponding return values 
> of the FDW commit routine */
> switch (ret)
> {
> case XA_RMERR:
> ret = ...;
> break;
> ...
> }
>
> return ret;
> }

Well, we're talking about running this commit routine from within
CommitTransaction(), right? So I think it is in fact running in the
server. And if that's so, then you have to worry about how to make it
respond to interrupts. You can't just call some functions
DBMSX_xa_commit() and wait for infinite time for it to return. Look at
pgfdw_get_result() for an example of what real code to do this looks
like.

> So, we need to design how commit behaves from the user's perspective.  That's 
> the functional design.  We should figure out what's the desirable response of 
> commit first, and then see if we can implement it or have to compromise in 
> some way.  I think we can reference the X/Open TX standard and/or JTS (Java 
> Transaction Service) specification (I haven't had a chance to read them yet, 
> though.)  Just in case we can't find the requested commit behavior in the 
> volcano case from those specifications, ... (I'm hesitant to say this because 
> it may be hard,) it's desirable to follow representative products such as 
> Tuxedo and GlassFish (the reference implementation of Java EE specs.)

Honestly, I am not quite sure what any specification has to say about
this. We're talking about what happens when a user does something with
a foreign table and then type COMMIT. That's all about providing a set
of behaviors that are consistent with how PostgreSQL works in other
situations. You can't negotiate away the requirement to handle errors
in a way that works with PostgreSQL's infrastructure, or the
requirement that any length operation handle interrupts properly, by
appealing to a specification.

> Concurrent transactions are serialized at the resolver.  I heard that the 
> current patch handles 2PC like this: the TM (transaction manager in Postgres 
> core) requests prepare to the resolver, the resolver sends prepare to the 
> remote server and wait for reply, the TM gets back control from the resolver, 
> TM requests commit to the resolver, the resolver sends commit to the remote 
> server and wait for reply, and TM gets back control.  The resolver handles 
> one transaction at a time.

That sounds more like a limitation of the present implementation than
a fundamental problem. We shouldn't reject the idea of having a
resolver process handle this just because the initial implementation
might be slow. If there's no fundamental problem with the idea,
parallelism and concurrency can be improved in separate patches at a
later time. It's much more important at this stage to reject ideas
that are not theoretically sound.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Race condition in recovery?

2021-06-14 Thread Robert Haas

On Sat, Jun 12, 2021 at 10:20 AM Tom Lane  wrote:
> Andrew Dunstan  writes:
> > I have pushed a fix, tested on a replica of fairywren/drongo,
>
> This bit seems a bit random:
>
>  # WAL segment, this is enough to guarantee that the history file was
>  # archived.
>  my $archive_wait_query =
> -  "SELECT '$walfile_to_be_archived' <= last_archived_wal FROM 
> pg_stat_archiver;";
> +  "SELECT coalesce('$walfile_to_be_archived' <= last_archived_wal, false) " .
> +  "FROM pg_stat_archiver";
>  $node_standby->poll_query_until('postgres', $archive_wait_query)
>or die "Timed out while waiting for WAL segment to be archived";
>  my $last_archived_wal_file = $walfile_to_be_archived;
>
> I wonder whether that is a workaround for the poll_query_until bug
> I proposed to fix at [1].

I found that a bit random too, but it wasn't the only part of the
patch I found a bit random. Like, what can this possibly be doing?

+if ($^O eq 'msys')
+{
+$perlbin = TestLib::perl2host(dirname($^X)) . '\\' . basename($^X);
+}

The idea here is apparently that on msys, the directory name that is
part of $^X needs to be passed through perl2host, but the file name
that is part of the same $^X needs to NOT be passed through perl2host.
Is $^X really that broken? If so, I think some comments are in order.

+local $ENV{PERL_BADLANG}=0;

Similarly here. There's not a single other reference to PERL_BADLANG
in the repository, so if we need this one here, there should be a
comment explaining why this is different from all the places where we
don't need it.

On those occasions when I commit TAP test cases, I do try to think
about whether they are going to be portable, but I find these kinds of
changes indistinguishable from magic.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [bug?] Missed parallel safety checks, and wrong parallel safety

2021-06-14 Thread Robert Haas

On Mon, Jun 14, 2021 at 2:32 AM Amit Kapila  wrote:
> Why do you think we don't need to check index AM functions? Say we
> have an index expression that uses function and if its parallel safety
> is changed then probably that also impacts whether we can do insert in
> parallel. Because otherwise, we will end up executing some parallel
> unsafe function in parallel mode during index insertion.

I'm not saying that we don't need to check index expressions. I agree
that we need to check those. The index AM functions are things like
btint4cmp(). I don't think that a function like that should ever be
parallel-unsafe.

> Yeah, this could be one idea but I think even if we use pg_proc OID,
> we still need to check all the rel cache entries to find which one
> contains the invalidated OID and that could be expensive. I wonder
> can't we directly find the relation involved and register invalidation
> for the same? We are able to find the relation to which trigger
> function is associated during drop function via findDependentObjects
> by scanning pg_depend. Assuming, we are able to find the relation for
> trigger function by scanning pg_depend, what kinds of problems do we
> envision in registering the invalidation for the same?

I don't think that finding the relation involved and registering an
invalidation for the same will work properly. Suppose there is a
concurrently-running transaction which has created a new table and
attached a trigger function to it. You can't see any of the catalog
entries for that relation yet, so you can't invalidate it, but
invalidation needs to happen. Even if you used some snapshot that can
see those catalog entries before they are committed, I doubt it fixes
the race condition. You can't hold any lock on that relation, because
the creating transaction holds AccessExclusiveLock, but the whole
invalidation mechanism is built around the assumption that the sender
puts messages into the shared queue first and then releases locks,
while the receiver first acquires a conflicting lock and then
processes messages from the queue. Without locks, that synchronization
algorithm can't work reliably. As a consequence of all that, I believe
that, not just in this particular case but in general, the
invalidation message needs to describe the thing that has actually
changed, rather than any derived property. We can make invalidations
that say "some function's parallel-safety flag has changed" or "this
particular function's parallel-safety flag has changed" or "this
particular function has changed in some way" (this one, we have
already), but anything that involves trying to figure out what the
consequences of such a change might be and saying "hey, you, please
update XYZ because I changed something somewhere that could affect
that" is not going to be correct.

> I think we probably need to worry about the additional cost to find
> dependent objects and if there are any race conditions in doing so as
> pointed out by Tom in his email [1]. The concern related to cost could
> be addressed by your idea of registering such an invalidation only
> when the user changes the parallel safety of the function which we
> don't expect to be a frequent operation. Now, here the race condition,
> I could think of could be that by the time we change parallel-safety
> (say making it unsafe) of a function, some of the other sessions might
> have already started processing insert on a relation where that
> function is associated via trigger or check constraint in which case
> there could be a problem. I think to avoid that we need to acquire an
> Exclusive lock on the relation as we are doing in Rename Policy kind
> of operations.

Well, the big issue here is that we don't actually lock functions
while they are in use. So there's absolutely nothing that prevents a
function from being altered in any arbitrary way, or even dropped,
while code that uses it is running. I don't really know what happens
in practice if you do that sort of thing: can you get the same query
to run with one function definition for the first part of execution
and some other definition for the rest of execution? I tend to doubt
it, because I suspect we cache the function definition at some point.
If that's the case, caching the parallel-safety marking at the same
point seems OK too, or at least no worse than what we're doing
already. But on the other hand if it is possible for a query's notion
of the function definition to shift while the query is in flight, then
this is just another example of that and no worse than any other.
Instead of changing the parallel-safety flag, somebody could redefine
the function so that it divides by zero or produces a syntax error and
kaboom, running queries break. Either way, I don't see what the big
deal is. As long as we make the handling of parallel-safety consistent
with other ways the function could be concurrently redefined, it won't
suck any more than the current system already does, or in any
fundamentally

Re: Use extended statistics to estimate (Var op Var) clauses

2021-06-14 Thread Mark Dilger

> On Jun 13, 2021, at 1:28 PM, Tomas Vondra  
> wrote:
> 
> Here is a slightly updated version of the patch

Thanks for taking this up again!

Applying the new test cases from your patch, multiple estimates have gotten 
better.  That looks good.  I wrote a few extra test cases and saw no change, 
which is fine.  I was looking for regressions where the estimates are now worse 
than before.  Do you expect there to be any such cases?

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Delegating superuser tasks to new security roles

2021-06-14 Thread Mark Dilger




> On Jun 14, 2021, at 5:51 AM, torikoshia  wrote:
> 
> Thanks for working on this topic, I appreciate it!

Thank you for taking a look!

> BTW, do these patches enable non-superusers to create user with
> bypassrls?

No, I did not break out the ability to create such users.

> Since I failed to apply the patches and didn't test them,
> I may have overlooked something but I didn't find the
> corresponding codes.

Do you believe that functionality should be added?  I have not thought much 
about that issue.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Replication protocol doc fix

2021-06-14 Thread Robert Haas

On Fri, Jun 11, 2021 at 6:12 PM Jeff Davis  wrote:
> On Fri, 2021-06-11 at 16:57 -0400, Robert Haas wrote:
> > My impression was that CopyBoth can be initiated either way,
>
> The docs say: "In either physical replication or logical replication
> walsender mode, only the simple query protocol can be used." Is there
> some way to initiate CopyBoth outside of walsender?

Currently, no, or at least not to my knowledge. I just meant that
there seems to be nothing in the protocol specification which prevents
CopyBothResponse from being sent in response to a query sent using the
extended protocol.

> > but if
> > you use the extended query protocol, then the result is a hopeless
> > mess, because the protocol is badly designed:
> >
> https://www.postgresql.org/message-id/ca+tgmoa4ea+cpxaigqmebp9xisvd3ze9dbvnbzevx9ucmiw...@mail.gmail.com
>
> It seems like you're saying that CopyIn and CopyBoth are both equally
> broken in extended query mode. Is that right?

Yeah.

> > But I think you're correct in saying that the discard-until-Sync
> > behavior only happens if the extended query protocol is used, so I
> > agree that the current text is wrong.
>
> Should we just document how CopyBoth works with the simple query
> protocol, or should we make it match the CopyIn docs?

I think it would make sense to make it match the CopyIn docs. Possibly
the CopyOut docs should be made more similar as well.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Failure in subscription test 004_sync.pl

2021-06-14 Thread Alvaro Herrera

On 2021-Jun-14, Amit Kapila wrote:

> On Mon, Jun 14, 2021 at 10:41 AM Masahiko Sawada  
> wrote:

> > The condition should be the opposite; we should raise the error when
> > 'nowait' is true. I think this is the cause of the test failure. Even
> > if DROP SUBSCRIPTION tries to drop the slot with the WAIT option, we
> > don't wait but raise the error.
> 
> Yes, this should fix the recent buildfarm failures. Alvaro, would you
> like to take care of this?

Ugh, thanks for CCing me. Yes, I'll get this fixed ASAP.

-- 
Álvaro Herrera   Valdivia, Chile
"Saca el libro que tu religión considere como el indicado para encontrar la
oración que traiga paz a tu alma. Luego rebootea el computador
y ve si funciona" (Carlos Duclós)

Re: [Proposal] Add accumulated statistics for wait event

2021-06-14 Thread Jehan-Guillaume de Rorthais

Hi Andres, Hi all,

First, thank you for your feedback!

Please find in attachment a patch implementing accumulated wait event stats
only from the backend point of view. As I wrote when I reviewed and rebased the
existing patch, I was uncomfortable with the global approach. I still volunteer
to work/review the original approach is required.

See bellow for comments and some more explanations about what I think might be
improvements over the previous patch.

On Fri, 11 Jun 2021 12:18:07 -0700
Andres Freund  wrote:

> On 2021-06-05 00:53:44 +0200, Jehan-Guillaume de Rorthais wrote:
> > From 88c2779679c5c9625ca5348eec0543daab5ccab4 Mon Sep 17 00:00:00 2001
> > From: Jehan-Guillaume de Rorthais 
> > Date: Tue, 1 Jun 2021 13:25:57 +0200
> > Subject: [PATCH 1/2] Add pg_stat_waitaccum view.
> > 
> > pg_stat_waitaccum shows counts and duration of each wait events.
> > Each backend/backgrounds counts and measures the time of wait event
> > in every pgstat_report_wait_start and pgstat_report_wait_end. They
> > store those info into their local variables and send to Statistics
> > Collector. We can get those info via Statistics Collector.
> > 
> > For reducing overhead, I implemented statistic hash instead of
> > dynamic hash. I also implemented track_wait_timing which
> > determines wait event duration is collected or not.  
> 
> I object to adding this overhead. The whole selling point for wait
> events was that they are low overhead. I since spent time reducing the
> overhead further, because even just the branches for checking if
> track_activity is enabled are measurable (225a22b19ed).

Agree. The previous patch I rebased was to review it and reopen this discussion,
I even added a small FIXME in pgstat_report_wait_end and
pgstat_report_wait_start about your work:

  //FIXME: recent patch to speed up this call.

In the patch in attachment, I tried to fix this by using kind of an internal
hook for pgstat_report_wait_start and pgstat_report_wait_end. This allows to
"instrument" wait events only when required, on the fly, dynamically.

Moreover, I removed the hash structure for a simple static array for faster
access.

> > From ddb1adc5cd9acc9bc9de16d0cf057124b09fe1e3 Mon Sep 17 00:00:00 2001
> > From: Jehan-Guillaume de Rorthais 
> > Date: Fri, 4 Jun 2021 18:14:51 +0200
> > Subject: [PATCH 2/2] [POC] Change measuring method of wait event time from
> >  INSTR_TIME to rdtsc.
> > 
> > This patch changes measuring method of wait event time from INSTR_TIME
> > (which uses gettimeofday or clock_gettime) to rdtsc. This might reduce the
> > overhead of measuring overhead.
> > 
> > Any supports like changing clock cycle to actual time or error handling are
> > not currently implemented.  
> 
> rdtsc is a serializing (*) instruction - that's the expensive part. On linux
> clock_gettime() doesn't actually need a syscall. While the vdso call
> implies a bit of overhead over a raw rdtsc, it's a relatively small part
> of the overhead.  See
> https://www.postgresql.org/message-id/20200612232810.f46nbqkdhbutzqdg%40alap3.anarazel.de

I choose to remove all this rdtsc part from my patch as this wasn't clear how
much faster it was compare to simpler vdso functions and how to accurately
extract a human time.

About my take on $subject, for the sake of simplicity of this PoC, I added
instrumentation to log_statement_stats. Despite the query context of the
reported log, they are really accumulated stats.

The patch updated pg_stat_get_waitaccum() as well to be able to report the
accumulated wait events from your interactive or batch session.

So using my previous fe-time demo client, you can test it using:

  PGOPTIONS="--log_statement_stats=on" ./fe-time 100

From logs, I now have (notice the last line):

  LOG:  duration: 3837.194 ms  statement: SELECT * FROM pgbench_accounts
  LOG:  QUERY STATISTICS
  DETAIL:  ! system usage stats:
   !   0.087444 s user, 0.002106 s system, 3.837202 s elapsed
   !   [0.087444 s user, 0.003974 s system total]
   !   25860 kB max resident size
   !   0/0 [0/0] filesystem blocks in/out
   !   0/303 [0/697] page faults/reclaims, 0 [0] swaps
   !   0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent
   !   4/18 [5/18] voluntary/involuntary context switches
   !   Client/ClientWrite 4 calls, 3747102 us elapsed

Using pgbench scale factor 10, the copy query for pgbench_accounts looks like:

  LOG:  duration: 2388.081 ms  statement: copy pgbench_accounts from stdin
  LOG:  QUERY STATISTICS
  DETAIL:  ! system usage stats:
   !   1.373756 s user, 0.252860 s system, 2.388100 s elapsed
   !   [1.397015 s user, 0.264951 s system total]
   !   37788 kB max resident size
   !   0/641584 [0/642056] filesystem blocks in/out
   !   194/4147 [195/4728] page faults/reclaims, 0 [0] swaps
   !   0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent

Re: A new function to wait for the backend exit after termination

2021-06-14 Thread Bharath Rupireddy

On Sat, Jun 12, 2021 at 10:07 AM Noah Misch  wrote:
>
> On Fri, Jun 11, 2021 at 08:54:08PM -0500, Justin Pryzby wrote:
> > On Sat, Jun 05, 2021 at 12:08:01PM -0700, Noah Misch wrote:
> > > > > My preference is to remove pg_wait_for_backend_termination().  The 
> > > > > use case
> > > > > that prompted this thread used pg_terminate_backend(pid, 18); it 
> > > > > doesn't
> > > > > need pg_wait_for_backend_termination().
> >
> > Is this an Opened Issue ?
>
> An Open Item?  Not really, since there's no objective defect.  Nonetheless,
> the attached is what I'd like to use.

Thanks. +1 to remove the pg_wait_for_backend_termination function. The
patch basically looks good to me. I'm attaching an updated patch. I
corrected a minor typo in the commit message, took docs and code
comment changes suggested by Justin Pryzby.

Please note that release notes still have the headline "Add functions
to wait for backend termination" of the original commit that added the
pg_wait_for_backend_termination. With the removal of it, I'm not quite
sure if we retain the the commit message or tweak it to something like
"Add optional timeout parameter to pg_terminate_backend".

With Regards,
Bharath Rupireddy.

v2-0001-Remove-pg_wait_for_backend_termination-function.patch
Description: Binary data

pg_dumpall --exclude-database case folding

2021-06-14 Thread Andrew Dunstan



[discussion transferred from psql-performance]

Summary: pg_dumpall and pg_dump fold non-quoted commandline patterns to
lower case

Tom lane writes:

Andrew Dunstan  writes:
> On 6/10/21 2:23 PM, Andrew Dunstan wrote:
>> Ouch. That looks like a plain old bug. Let's fix it. IIRC I just used
>> the same logic that we use for pg_dump's --exclude-* options, so we need
>> to check if they have similar issues.

> Peter Eisentraut has pointed out to me that this is documented, albeit a
> bit obscurely for pg_dumpall. But it is visible on the pg_dump page.

Hmm.

> Nevertheless, it's a bit of a POLA violation as we've seen above, and
> I'd like to get it fixed, if there's agreement, both for this pg_dumpall
> option and for pg_dump's pattern matching options.

+1, but the -performance list isn't really where to hold that discussion.
Please start a thread on -hackers.

regards, tom lane

Re: recent failures on lorikeet

2021-06-14 Thread Tom Lane

Andrew Dunstan  writes:
> I've been looking at the recent spate of intermittent failures on my
> Cygwin animal lorikeet. Most of them look something like this, where
> there's 'VACUUM FULL pg_class' and an almost simultaneous "CREATE TABLE'
> which fails.

Do you have any idea what "exit code 127" signifies on that platform?
(BTW, not all of them look like that; many are reported as plain
segfaults.)  I hadn't spotted the association with a concurrent "VACUUM
FULL pg_class" before, that does seem interesting.

> Getting stack traces in this platform can be very difficult. I'm going
> to try forcing complete serialization of the regression tests
> (MAX_CONNECTIONS=1) to see if the problem goes away. Any other
> suggestions might be useful. Note that we're not getting the same issue
> on REL_13_STABLE, where the same group pf tests run together (inherit
> typed_table, vacuum)

If it does go away, that'd be interesting, but I don't see how it gets
us any closer to a fix.  Seems like a stack trace is a necessity to
narrow it down.

regards, tom lane

Re: unnesting multirange data types

2021-06-14 Thread Alexander Korotkov

On Mon, Jun 14, 2021 at 3:50 PM Jonathan S. Katz  wrote:
> On 6/13/21 5:18 PM, Alexander Korotkov wrote:
>
> >> "Expands an array into a set of rows. The array's elements are read out
> >> in storage order."
> >>
> >> If we tweaked the multirange "unnest" function, we could change it to:
> >>
> >> +   
> >> +Expands a multirange into a set of rows.
> >> +The ranges are read out in storage order (ascending).
> >> +   
> >>
> >> to match what the array "unnest" function docs, or
> >>
> >> +   
> >> +Expands a multirange into a set of rows that each
> >> +contain an individual range.
> >> +The ranges are read out in storage order (ascending).
> >> +   
> >>
> >> to be a bit more specific. However, I think this is also bordering on
> >> overengineering the text, given there has been a lack of feedback on the
> >> "unnest" array function description being confusing.
> >
> > I think it's not necessarily to say about rows here.  Our
> > documentation already has already a number of examples, where we
> > describe set of returned values without speaking about rows including:
> > json_array_elements, json_array_elements_text, json_object_keys,
> > pg_listening_channels, pg_tablespace_databases...
>
> I do agree -- my main point was that I don't think we need to change
> anything. I proposed alternatives just to show some other ways of
> looking at it. But as I mentioned, at this point I think it's
> overengineering the text.
>
> If folks are good with the method + code, I think this is ready.

Cool, thank you for the summary.  I'll wait for two days since I've
published the last revision of the patch [1] (comes tomorrow), and
push it if no new issues arise.

Links
1. 
https://www.postgresql.org/message-id/CAPpHfdvG%3DJR5kqmZx7KvTyVgtQePX0QJ09TO1y3sN73WOfJf1Q%40mail.gmail.com

--
Regards,
Alexander Korotkov

Re: contrib/pg_visibility fails regression under CLOBBER_CACHE_ALWAYS

2021-06-14 Thread Masahiko Sawada

On Mon, Jun 7, 2021 at 10:01 PM Tomas Vondra
 wrote:
>
>
>
> On 6/7/21 2:11 PM, Masahiko Sawada wrote:
> > On Mon, Jun 7, 2021 at 4:30 PM Masahiko Sawada  
> > wrote:
> >>
> >> On Mon, Jun 7, 2021 at 11:15 AM Tom Lane  wrote:
> >>>
> >>> husky just reported $SUBJECT:
> >>>
> >>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=husky=2021-06-05%2013%3A42%3A17
> >>>
> >>> and I find I can reproduce that locally:
> >>>
> >>> diff -U3 
> >>> /home/postgres/pgsql/contrib/pg_visibility/expected/pg_visibility.out 
> >>> /home/postgres/pgsql/contrib/pg_visibility/results/pg_visibility.out
> >>> --- /home/postgres/pgsql/contrib/pg_visibility/expected/pg_visibility.out 
> >>>   2021-01-20 11:12:24.854346717 -0500
> >>> +++ /home/postgres/pgsql/contrib/pg_visibility/results/pg_visibility.out  
> >>>   2021-06-06 22:12:07.948890104 -0400
> >>> @@ -215,7 +215,8 @@
> >>>   0 | f   | f
> >>>   1 | f   | f
> >>>   2 | t   | t
> >>> -(3 rows)
> >>> + 3 | t   | t
> >>> +(4 rows)
> >>>
> >>>  select * from pg_check_frozen('copyfreeze');
> >>>   t_ctid
> >>> @@ -235,7 +236,8 @@
> >>>   0 | t   | t
> >>>   1 | f   | f
> >>>   2 | t   | t
> >>> -(3 rows)
> >>> + 3 | t   | t
> >>> +(4 rows)
> >>>
> >>>  select * from pg_check_frozen('copyfreeze');
> >>>   t_ctid
> >>>
> >>>
> >>> The test cases that are failing date back to January (7db0cd2145f),
> >>> so I think this is some side-effect of a recent commit, but I have
> >>> no idea which one.
> >>
> >> It seems like the recent revert (8e03eb92e9a) is relevant.
> >>
> >> After committing 7db0cd2145f we had the same regression test failure
> >> in January[1]. Then we fixed that issue by 39b66a91b. But since we
> >> recently reverted most of 39b66a91b, the same issue happened again.
> >>
> >
> > So the cause of this failure seems the same as before. The failed test is,
> >
> > begin;
> > truncate copyfreeze;
> > copy copyfreeze from stdin freeze;
> > 1   '1'
> > 2   '2'
> > 3   '3'
> > 4   '4'
> > 5   '5'
> > \.
> > copy copyfreeze from stdin;
> > 6   '6'
> > \.
> > copy copyfreeze from stdin freeze;
> > 7   '7'
> > 8   '8'
> > 9   '9'
> > 10  '10'
> > 11  '11'
> > 12  '12'
> > \.
> > commit;
> >
> > If the target block cache is invalidated before the third COPY, we
> > will start to insert the frozen tuple into a new page, resulting in
> > adding two blocks in total during the third COPY. I think we still
> > need the following part of the reverted code so that we don't leave
> > the page partially empty after relcache invalidation:
> >
> > --- a/src/backend/access/heap/hio.c
> > +++ b/src/backend/access/heap/hio.c
> > @@ -407,19 +407,19 @@ RelationGetBufferForTuple(Relation relation, Size len,
> >  * target.
> >  */
> > targetBlock = GetPageWithFreeSpace(relation, targetFreeSpace);
> > -   }
> >
> > -   /*
> > -* If the FSM knows nothing of the rel, try the last page before we give
> > -* up and extend.  This avoids one-tuple-per-page syndrome during
> > -* bootstrapping or in a recently-started system.
> > -*/
> > -   if (targetBlock == InvalidBlockNumber)
> > -   {
> > -   BlockNumber nblocks = RelationGetNumberOfBlocks(relation);
> > +   /*
> > +* If the FSM knows nothing of the rel, try the last page before we
> > +* give up and extend.  This avoids one-tuple-per-page syndrome 
> > during
> > +* bootstrapping or in a recently-started system.
> > +*/
> > +   if (targetBlock == InvalidBlockNumber)
> > +   {
> > +   BlockNumber nblocks = RelationGetNumberOfBlocks(relation);
> >
> > -   if (nblocks > 0)
> > -   targetBlock = nblocks - 1;
> > +   if (nblocks > 0)
> > +   targetBlock = nblocks - 1;
> > +   }
> > }
> >
> > Attached the patch that brings back the above change.
> >
>
> Thanks for the analysis! I think you're right - this bit should have
> been kept. Partial reverts are tricky :-(
>
> I'll get this fixed / pushed later today, after a bit more testing. I'd
> swear I ran tests with CCA, but it's possible I skipped contrib.

I had missed this mail. Thank you for pushing the fix!

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Position of ClientAuthentication hook

2021-06-14 Thread Rafia Sabih

Hello hackers,

I have a doubt regarding the positioning of clientAuthentication hook
in function ClientAuthentication. Particularly, in case of hba errors,
i.e. cases uaReject or uaImplicitReject it errors out, leading to no
calls to any functions attached to the authentication hook. Can't we
process the hook function first and then error out...?

-- 
Regards,
Rafia Sabih

Re: Delegating superuser tasks to new security roles

2021-06-14 Thread torikoshia


On 2021-05-26 05:33, Mark Dilger wrote:
On May 13, 2021, at 12:30 PM, Mark Dilger 
 wrote:




On May 13, 2021, at 12:18 PM, Jacob Champion  
wrote:


On Thu, 2021-05-13 at 11:42 -0700, Mark Dilger wrote:

The distinction that Theme+Security would make is that capabilities
can be categorized by the area of the system:
-- planner
-- replication
-- logging
...
but also by the security implications of what is being done:
-- host
-- schema
-- network

Since the "security" buckets are being used for both proposals -- how
you would deal with overlap between them? When a GUC gives you enough
host access to bleed into the schema and network domains, does it get
all three attributes assigned to it, and thus require membership in 
all

three roles?


Yeah, from a security standpoint, pg_host_admin basically gives 
everything away.  I doubt service providers would give the "host" or 
"network" security to their tenants, but they would probably consider 
giving "schema" security to the tenants.



(Thanks, by the way, for this thread -- I think a "capability system"
for superuser access is a great idea.)


I am happy to work on this, and appreciate feedback


Please find attached five new patches each intended to reduce the
number of administrative tasks that require superuser privileges.

v3-0001 adds a new pg_logical_replication role with permission to
manage publications and subscriptions.

v3-0002 adds a new pg_host_security role with permission to manage
extensions, event triggers and tablespaces.

v3-0003 adds a new pg_network_security role with pemission to manage
foreign servers and data wrappers.

v3-0004 adds a new pg_database_security role with permission to
perform many actions that would otherwise require superuser, so long
as those actions do not compromise the security of the host or
network.  This role, along with pg_logical_replication, is intended to
be safe to delegate to the tenant of a database provided as a service.

v3-0005 associates all GUC variables with security roles and allows
both SET and ALTER SYSTEM SET on those variables by users belonging to
the necessary security role(s).  This patch extends the significance
of the pg_host_security, pg_network_security, and pg_database_security
roles added in the previous patches, as those roles are associated
with GUC variables that implicate the same security concerns.

These patches likely still need some adjustment, as there are a large
number of security relevant permission decisions in here which some
hackers may debate, but I think these are mature enough to solicit
feedback.

I admit right upfront that the regression tests guc_priv_admin and
guc_priv_tenant in v3-0005 could be made to cover a subset of GUC
variables rather than the full set of them, but I'm delaying pruning
them down until I know if the rest of the patches are basically
acceptable.


Thanks for working on this topic, I appreciate it!

BTW, do these patches enable non-superusers to create user with
bypassrls?
Since I failed to apply the patches and didn't test them,
I may have overlooked something but I didn't find the
corresponding codes.

Regards,

--
Atsushi Torikoshi
NTT DATA CORPORATION

Re: unnesting multirange data types

2021-06-14 Thread Jonathan S. Katz

On 6/13/21 5:18 PM, Alexander Korotkov wrote:

>> "Expands an array into a set of rows. The array's elements are read out
>> in storage order."
>>
>> If we tweaked the multirange "unnest" function, we could change it to:
>>
>> +   
>> +Expands a multirange into a set of rows.
>> +The ranges are read out in storage order (ascending).
>> +   
>>
>> to match what the array "unnest" function docs, or
>>
>> +   
>> +Expands a multirange into a set of rows that each
>> +contain an individual range.
>> +The ranges are read out in storage order (ascending).
>> +   
>>
>> to be a bit more specific. However, I think this is also bordering on
>> overengineering the text, given there has been a lack of feedback on the
>> "unnest" array function description being confusing.
> 
> I think it's not necessarily to say about rows here.  Our
> documentation already has already a number of examples, where we
> describe set of returned values without speaking about rows including:
> json_array_elements, json_array_elements_text, json_object_keys,
> pg_listening_channels, pg_tablespace_databases...

I do agree -- my main point was that I don't think we need to change
anything. I proposed alternatives just to show some other ways of
looking at it. But as I mentioned, at this point I think it's
overengineering the text.

If folks are good with the method + code, I think this is ready.

Thanks,

Jonathan



OpenPGP_signature
Description: OpenPGP digital signature

Re: Fdw batch insert error out when set batch_size > 65535

2021-06-14 Thread Bharath Rupireddy

On Mon, Jun 14, 2021, 5:33 PM Bharath Rupireddy <
bharath.rupireddyforpostg...@gmail.com> wrote:

> Okay. Here are the readings on my dev system:
> 1) on master with the existing test case with inserting 70K rows:
> 4263200 ms (71.05 min)
> 2) with Tomas's patch with the test case modified with 1500 table
> columns and 50 rows, (majority of the time ~30min it took in SELECT
> create_batch_tables(1500); statement. I measured this time manually
> looking at the start and end time of the statement - 6649312 ms (110.8
> min)
> 3) with my patch with test case modified with 1600 table columns and
> 41 rows: 4003007 ms  (66.71 min)
> 4) on master without the test case at all: 3770722 ms (62.84 min)
>

I forgot to mention one thing: I ran the above tests with
CLOBBER_CACHE_ALWAYS.

Regards,
Bharath Rupireddy.

>

recent failures on lorikeet

2021-06-14 Thread Andrew Dunstan



I've been looking at the recent spate of intermittent failures on my
Cygwin animal lorikeet. Most of them look something like this, where
there's 'VACUUM FULL pg_class' and an almost simultaneous "CREATE TABLE'
which fails.


2021-06-14 05:04:00.220 EDT [60c71b7f.e8bf:60] pg_regress/vacuum LOG:  
statement: VACUUM FULL pg_class;
2021-06-14 05:04:00.222 EDT [60c71b80.e8c0:7] pg_regress/typed_table LOG:  
statement: CREATE TABLE persons OF person_type;
2021-06-14 05:04:00.232 EDT [60c71b80.e8c1:3] pg_regress/inherit LOG:  
statement: CREATE TABLE a (aa TEXT);
*** starting debugger for pid 59584, tid 9640
2021-06-14 05:04:14.678 EDT [60c71b53.e780:4] LOG:  server process (PID 59584) 
exited with exit code 127
2021-06-14 05:04:14.678 EDT [60c71b53.e780:5] DETAIL:  Failed process was 
running: CREATE TABLE persons OF person_type;
2021-06-14 05:04:14.678 EDT [60c71b53.e780:6] LOG:  terminating any other 
active server processes


Getting stack traces in this platform can be very difficult. I'm going
to try forcing complete serialization of the regression tests
(MAX_CONNECTIONS=1) to see if the problem goes away. Any other
suggestions might be useful. Note that we're not getting the same issue
on REL_13_STABLE, where the same group pf tests run together (inherit
typed_table, vacuum)


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: RFC: Logging plan of the running query

2021-06-14 Thread torikoshia


On 2021-06-11 01:20, Bharath Rupireddy wrote:

Thanks for your review!

On Wed, Jun 9, 2021 at 1:14 PM torikoshia  
wrote:

Updated the patch.


Thanks for the patch. Here are some comments on v3 patch:

1) We could just say "Requests to log query plan of the presently
running query of a given backend along with an untruncated query
string in the server logs."
Instead of
+They will be logged at LOG message level 
and

+will appear in the server log based on the log
+configuration set (See linkend="runtime-config-logging"/>


Actually this explanation is almost the same as that of
pg_log_backend_memory_contexts().
Do you think we should change both of them?
I think it may be too detailed but accurate.


2) It's better to do below, for reference see how pg_backend_pid,
pg_terminate_backend, pg_relpages and so on are used in the tests.
+SELECT pg_log_current_plan(pg_backend_pid());
rather than using the function in the FROM clause.
+SELECT * FROM pg_log_current_plan(pg_backend_pid());
If okay, also change it for pg_log_backend_memory_contexts.


Agreed.


3) Since most of the code looks same in pg_log_backend_memory_contexts
and pg_log_current_plan, I think we can have a common function
something like below:


Agreed. I'll do some refactoring.


bool
SendProcSignalForLogInfo(ProcSignalReason reason)
{
Assert(reason == PROCSIG_LOG_MEMORY_CONTEXT || reason ==
PROCSIG_LOG_CURRENT_PLAN);

if (!superuser())
{
if (reason == PROCSIG_LOG_MEMORY_CONTEXT)
errmsg("must be a superuser to log memory contexts")
else if (reason == PROCSIG_LOG_CURRENT_PLAN)
errmsg("must be a superuser to log plan of the running query")
}

if (SendProcSignal(pid, reason, proc->backendId) < 0)
{
}
}
Then we could just do:
Datum
pg_log_backend_memory_contexts(PG_FUNCTION_ARGS)
{
PG_RETURN_BOOL(SendProcSignalForLogInfo(PROCSIG_LOG_MEMORY_CONTEXT));
}
Datum
pg_log_current_plan(PG_FUNCTION_ARGS)
{
PG_RETURN_BOOL(SendProcSignalForLogInfo(PROCSIG_LOG_CURRENT_PLAN));
}
We can have SendProcSignalForLogInfo function defined in procsignal.c
and procsignal.h

4) I think we can have a sample function usage and how it returns true
value, how the plan looks for a simple query(select 1 or some other
simple/complex generic query or simply select
pg_log_current_plan(pg_backend_pid());) in the documentation, much
like pg_log_backend_memory_contexts.


+1.


5) Instead of just showing the true return value of the function
pg_log_current_plan in the sql test, which just shows that the signal
is sent, but it doesn't mean that the backend has processed that
signal and logged the plan. I think we can add the test using TAP
framework, one


I once made a tap test for pg_log_backend_memory_contexts(), but it
seemed an overkill and we agreed that it was appropriate just ensuring
the function working as below discussion.

  
https://www.postgresql.org/message-id/bbecd722d4f8e261b350186ac4bf68a7%40oss.nttdata.com



6) Do we unnecessarily need to signal the processes such as autovacuum
launcher/workers, logical replication launcher/workers, wal senders,
wal receivers and so on. only to emit a "PID %d is not executing
queries now" message? Moreover, those processes will be waiting in
loops for timeouts to occur, then as soon as they wake up do they need
to process this extra uninformative signal?
We could choose to not signal those processes at all which might or
might not be possible.
Otherwise, we could just emit messages like " process cannot run a
query" in ProcessInterrupts.


Agreed.

However it needs to distinguish backends which can execute queries and
other processes such as autovacuum launcher, I don't come up with
easy ways to do so.
I'm going to think about it.


7)Instead of
(errmsg("logging plan of the running query on PID %d\n%s",
how about below?
(errmsg("plan of the query running on backend with PID %d is:\n%s",


+1.


8) Instead of
errmsg("PID %d is not executing queries now")
how about below?
errmsg("Backend with PID %d is not running a query")


+1.



9) We could just do:
void
ProcessLogCurrentPlanInterrupt(void)
{
ExplainState *es;
LogCurrentPlanPending = false;
if (!ActivePortal || !ActivePortal->queryDesc)
errmsg("PID %d is not executing queries now");
es = NewExplainState();
ExplainQueryText();
ExplainPrintPlan();

10) How about renaming the function pg_log_current_plan to
pg_log_query_plan or pg_log_current_query_plan?


+1.

11) What happens if pg_log_current_plan is called for a parallel 
worker?


AFAIU Parallel worker doesn't have ActivePortal, so it would always
emit the message 'PID %d is not executing queries now'.
As 6), it would be better to distinguish the parallel worker and normal
backend.


Regards,

--
Atsushi Torikoshi
NTT DATA CORPORATION

Re: BUG #15293: Stored Procedure Triggered by Logical Replication is Unable to use Notification Events

2021-06-14 Thread Sergey Fedchenko

Hi all! I still can reproduce it on 14beta1 version. I adapted a patch I found 
in this thread  
https://github.com/seregayoga/postgres/commit/338bc33f2cf77edde7c45bfdfb9f39a92ec57eb8
 . It solved this bug for me (tested with simple Go program using  
https://github.com/lib/pq ). It would be nice to have this bug fixed. I’m not 
so familiar with postgres code base, but would glad to help with testing.

>Monday, June 14, 2021 12:39 PM +03:00 from Daniel Danzberger < 
>dan...@dd-wrt.com >:
> 
>On Fri, 2019-06-28 at 17:22 -0400, Alvaro Herrera wrote:
>> On 2019-Feb-22, Robert Welin wrote:
>>
>> > I have reproduced this bug on PostgreSQL version 10.7 and 11.2
>> > using the
>> > steps described in Michael Powers' original report. The issue also
>> > still
>> > seems to be present even with the patch provided by Sergei
>> > Kornilov.
>> >
>> > Are there plans to address this issue any time soon or is there
>> > some way
>> > I can assist in fixing it? It would be great to have notifications
>> > from
>> > logical replication.
>>
>> This issue seems largely forgotten about :-(
>>
>Are there any news to this issue ?
>I can reproduce this bug up to now with version 12.
>
>--
>Regards
>
>Daniel Danzberger
>embeDD GmbH, Alter Postplatz 2, CH-6370 Stans
>
>
>
>  

Best regards, Sergey Fedchenko.

Re: Fdw batch insert error out when set batch_size > 65535

2021-06-14 Thread Bharath Rupireddy

On Sun, Jun 13, 2021 at 9:28 PM Tomas Vondra
 wrote:
> On 6/13/21 5:25 PM, Bharath Rupireddy wrote:
> > On Sun, Jun 13, 2021 at 6:10 AM Alvaro Herrera  
> > wrote:
> >>
> >> On 2021-Jun-12, Tomas Vondra wrote:
> >>
> >>> There's one caveat, though - for regular builds the slowdown is pretty
> >>> much eliminated. But with valgrind it's still considerably slower. For
> >>> postgres_fdw the "make check" used to take ~5 minutes for me, now it
> >>> takes >1h. And yes, this is entirely due to the new test case which is
> >>> generating / inserting 70k rows. So maybe the test case is not worth it
> >>> after all, and we should get rid of it.
> >>
> >> Hmm, what if the table is made 1600 columns wide -- would inserting 41
> >> rows be sufficient to trigger the problem case?  If it does, maybe it
> >> would reduce the runtime for valgrind/cache-clobber animals enough that
> >> it's no longer a concern.
> >
> > Yeah, that's a good idea. PSA patch that creates the table of 1600
> > columns and inserts 41 rows into the foreign table. If the batch_size
> > adjustment fix isn't there, we will hit the error. On my dev system,
> > postgres_fdw contrib regression tests execution time: with and without
> > the attached patch 4.5 sec and 5.7 sec respectively.
> >
>
> But we're discussing cases with valgrind and/or CLOBBER_CACHE_ALWAYS.

Okay. Here are the readings on my dev system:
1) on master with the existing test case with inserting 70K rows:
4263200 ms (71.05 min)
2) with Tomas's patch with the test case modified with 1500 table
columns and 50 rows, (majority of the time ~30min it took in SELECT
create_batch_tables(1500); statement. I measured this time manually
looking at the start and end time of the statement - 6649312 ms (110.8
min)
3) with my patch with test case modified with 1600 table columns and
41 rows: 4003007 ms  (66.71 min)
4) on master without the test case at all: 3770722 ms (62.84 min)

With Regards,
Bharath Rupireddy.

RE: locking [user] catalog tables vs 2pc vs logical rep

2021-06-14 Thread osumi.takami...@fujitsu.com

On  Friday, June 11, 2021 2:13 PM  vignesh C  wrote:
> Thanks for the updated patch:
> Few comments:
> 1) We have used Reordering and Clustering for the same command, we could
> rephrase similarly in both places.
> +   
> +Reordering pg_class by
> CLUSTER
> +command in a transaction.
> +   
> 
> +   
> +Clustering pg_trigger and decoding
> PREPARE
> +TRANSACTION, if any published table have a trigger
> and any
> +operations that will be decoded are conducted.
> +   
> +  
> 
> 2) Here user_catalog_table should be user catalog table
> +   
> +Executing TRUNCATE on
> user_catalog_table
> in a transaction.
> +   
Thanks for your review.

Attached the patch-set that addressed those two comments.
I also fixed the commit message a bit in the 2PC specific patch to HEAD.
No other changes.

Please check.

Best Regards,
Takamichi Osumi



HEAD_deadlock_documentation_of_logical_decoding_v06.patch
Description: HEAD_deadlock_documentation_of_logical_decoding_v06.patch


HEAD_with_2PC_deadlock_documentation_of_logical_decoding_v06.patch
Description:  HEAD_with_2PC_deadlock_documentation_of_logical_decoding_v06.patch


PG10_deadlock_documentation_of_logical_decoding_v06.patch
Description: PG10_deadlock_documentation_of_logical_decoding_v06.patch


PG11_deadlock_documentation_of_logical_decoding_v06.patch
Description: PG11_deadlock_documentation_of_logical_decoding_v06.patch


PG12_deadlock_documentation_of_logical_decoding_v06.patch
Description: PG12_deadlock_documentation_of_logical_decoding_v06.patch


PG13_deadlock_documentation_of_logical_decoding_v06.patch
Description: PG13_deadlock_documentation_of_logical_decoding_v06.patch


PG96_deadlock_documentation_of_logical_decoding_v06.patch
Description: PG96_deadlock_documentation_of_logical_decoding_v06.patch

Fwd: BUG #17056: Segmentation fault on altering the type of the foreign table column with a default

2021-06-14 Thread Andrew Dunstan


reposting to -hackers to get more eyeballs.


Summary: only RELKIND_RELATION type relations should have attributes
with atthasmissing/attmissingval



 Forwarded Message 
Subject:Re: BUG #17056: Segmentation fault on altering the type of the
foreign table column with a default
Date:   Sat, 12 Jun 2021 21:59:24 -0400
From:   Andrew Dunstan 
To: Andres Freund 
CC: Tom Lane , exclus...@gmail.com,
pgsql-b...@lists.postgresql.org




On 6/12/21 5:50 PM, Andres Freund wrote:
> Hi,
>
> On 2021-06-12 17:40:38 -0400, Andrew Dunstan wrote:
>> Ok, I think the attached is the least we need to do. Given this I
>> haven't been able to induce a crash even when the catalog is hacked with
>> bogus missing values on a foreign table. But I'm not 100% convinced I
>> have fixed all the places that need to be fixed.
> Hm. There's a few places that look at atthasmissing and just assume that
> there's corresponding information about the missing field. And as far as
> I can see the proposed changes in RelationBuildTupleDesc() don't unset
> atthasmissing, they just prevent the constraint part of the tuple desc
> from being filled. Wouldn't this cause problems if we reach code like
>

Yes, you're right. This version should take care of things better.


Thanks for looking.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
index afa830d924..bcf80d865f 100644
--- a/src/backend/catalog/heap.c
+++ b/src/backend/catalog/heap.c
@@ -2161,6 +2161,13 @@ SetAttrMissing(Oid relid, char *attname, char *value)
 	/* lock the table the attribute belongs to */
 	tablerel = table_open(relid, AccessExclusiveLock);
 
+	/* Don't do anything unless it's a RELKIND type relation */
+	if (tablerel->rd_rel->relkind != RELKIND_RELATION)
+	{
+		table_close(tablerel, AccessExclusiveLock);
+		return;
+	}
+
 	/* Lock the attribute row and get the data */
 	attrrel = table_open(AttributeRelationId, RowExclusiveLock);
 	atttup = SearchSysCacheAttName(relid, attname);
@@ -2287,7 +2294,8 @@ StoreAttrDefault(Relation rel, AttrNumber attnum,
 		valuesAtt[Anum_pg_attribute_atthasdef - 1] = true;
 		replacesAtt[Anum_pg_attribute_atthasdef - 1] = true;
 
-		if (add_column_mode && !attgenerated)
+		if (rel->rd_rel->relkind == RELKIND_RELATION  && add_column_mode &&
+			!attgenerated)
 		{
 			expr2 = expression_planner(expr2);
 			estate = CreateExecutorState();
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 028e8ac46b..2b18eb89bf 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -12215,9 +12215,11 @@ ATExecAlterColumnType(AlteredTableInfo *tab, Relation rel,
 	/*
 	 * Here we go --- change the recorded column type and collation.  (Note
 	 * heapTup is a copy of the syscache entry, so okay to scribble on.) First
-	 * fix up the missing value if any.
+	 * fix up the missing value if any. There shouldn't be any missing values
+	 * for anything except RELKIND_RELATION relations, but if there are, ignore
+	 * them.
 	 */
-	if (attTup->atthasmissing)
+	if (rel->rd_rel->relkind == RELKIND_RELATION  && attTup->atthasmissing)
 	{
 		Datum		missingval;
 		bool		missingNull;
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index fd05615e76..588f9f0067 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -559,6 +559,15 @@ RelationBuildTupleDesc(Relation relation)
 			elog(ERROR, "invalid attribute number %d for relation \"%s\"",
  attp->attnum, RelationGetRelationName(relation));
 
+		/*
+		 * Fix atthasmissing flag - it's only for RELKIND relations. Others
+		 * should not have missing values set, but there may be some left from
+		 * before when we placed that check, so this code defensively ignores
+		 * such values.
+		 */
+		if (relation->rd_rel->relkind != RELKIND_RELATION)
+			attp->atthasmissing = false;
+
 		memcpy(TupleDescAttr(relation->rd_att, attnum - 1),
 			   attp,
 			   ATTRIBUTE_FIXED_PART_SIZE);

Re: a path towards replacing GEQO with something better

2021-06-14 Thread John Naylor

On Sun, Jun 13, 2021 at 9:50 AM Tomas Vondra 
wrote:

> > 2) We can still keep GEQO around (with some huge limit by default) for a
> > few years as an escape hatch, while we refine the replacement. If there
> > is some bug that prevents finding a plan, we can emit a WARNING and fall
> > back to GEQO. Or if a user encounters a regression in a big query, they
> > can lower the limit to restore the plan they had in an earlier version.
> >
>
> Not sure. Keeping significant amounts of code may not be free - both for
> maintenance and new features. It'd be a bit sad if someone proposed
> improvements to join planning, but had to do 2x the work to support it
> in both the DP and GEQO branches, or risk incompatibility.

Looking back again at the commit history, we did modify geqo to support
partial paths and partition-wise join, so that's a fair concern. My concern
is the risk of plan regressions after an upgrade, even if for a small
number of cases.

> OTOH maybe this concern is unfounded in practice - I don't think we've
> done very many big changes to geqo in the last few years.

Yeah, I get the feeling that it's already de facto obsolete, and we could
make it a policy not to consider improvements aside from bug fixes where it
can't find a valid plan, or forced API changes. Which I guess is another
way of saying "deprecated".

(I briefly considered turning it into a contrib module, but that seems like
the worst of both worlds.)

> This reminds me the proposals to have a GUC that'd determine how much
> effort should the planner invest into various optimizations. For OLTP it
> might be quite low, for large OLAP queries it'd be economical to spend
> more time trying some more expensive optimizations.
>
> The challenge of course is how / in what units / to define the budget,
> so that it's meaningful and understandable for users. Not sure if
> "number of join rels generated" will be clear enough for users. But it
> seems good enough for PoC / development, and hopefully people won't have
> to tweak it very often.

I'm also in favor of having some type of "planner effort" or "OLTP to OLAP
spectrum" guc, but I'm not yet sure whether it'd be better to have it
separate or to couple the joinrel budget to it. If we go that route, I
imagine there'll be many things that planner_effort changes that we don't
want to give a separate knob for. And, I hope with graceful degradation and
a good enough heuristic search, it won't be necessary to change in most
cases.

> I haven't read the [Kossmann & Stocker, 2000] paper yet, but the
> [Neumann, 2018] paper seems to build on it, and it seems to work with
> much larger subtrees of the join tree than k=5.

Right, in particular it builds on "IDP-2" from Kossmann & Stocker. Okay, so
Neumann's favorite algorithm stack "Adaptive" is complex, and I believe you
are referring to cases where they can iteratively improve up to 100 rels at
a time because of linearization. That's a separate algorithm (IKKBZ) that
complicates the cost model and also cannot have outer joins. If it has
outer joins, they use regular DP on subsets of size up to 10. It's not
substantively different from IDP-2, and that's the one I'd like to try to
gracefully fall back to. Or something similar.

> What I find fairly interesting is the section in [Neumann, 2018] about
> cardinality estimates, and quality of query plans when the estimates are
> off. The last paragraph in the 6.5 section essentially says that despite
> poor estimates, the proposed algorithm performs better than the simple
> (and cheap) heuristics. I'm not sure what to think about that, because
> my "intuitive" understanding is that the more elaborate the planning is,
> the more errors it can make when the estimates are off.

Yeah, I'm not sure this part is very useful and seems almost like an
afterthought. In table 3, all those poor examples are "pure" greedy
algorithms and don't have iterative refinement added, so it kind of makes
sense that poor estimates would hurt them more. But they don't compare
those with *just* a refinement step added. I also don't know how realistic
their "estimate fuzzing" is.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic

2021-06-14 Thread Matthias van de Meent

On Thu, 10 Jun 2021 at 19:43, Peter Geoghegan  wrote:
>
> On Thu, Jun 10, 2021 at 10:29 AM Matthias van de Meent
>  wrote:
> > I see one exit for HEAPTUPLE_DEAD on a potentially recently committed
> > xvac (?), and we might also check against recently committed
> > transactions if xmin == xmax, although apparently that is not
> > implemented right now.
>
> I don't follow. Perhaps you can produce a test case?

If you were to delete a tuple in the same transaction that you create
it (without checkpoints / subtransactions), I would assume that this
would allow us to vacuum the tuple, as the only snapshot that could
see the tuple must commit or roll back. In any case, inside the
transaction the tuple is not visible anymore, and outside the
transaction the tuple will never be seen. That being the case, any
such tuple that has xmin == xmax should be vacuumable at any time,
except that you might want to wait for the transaction to have
committed/rolled back to prevent any race conditions with (delayed)
index insertions.

example:

BEGIN;
INSERT INTO tab VALUES (1);
DELETE FROM tab;
-- At this point, the tuple inserted cannot be seen in any
-- current or future snapshot, and could thus be vacuumed.
COMMIT;

Because I am not quite yet well versed with the xid assignment and
heapam deletion subsystems, it could very well be that either this
case is impossible to reach, or that the heapam tuple delete logic
already applies this at tuple delete time.

With regards,

Matthias van de Meent

Re: Different compression methods for FPI

2021-06-14 Thread Andrey Borodin

Thanks for benchmarks, Justin!


> 14 июня 2021 г., в 06:24, Justin Pryzby  написал(а):
> 
> The GUC is PGC_USERSET
Oh, wow, that's neat. I did not realize that we can tune this for each 
individual client connection. Cool!

> pglz writes ~half as much, but takes twice as long as uncompressed:
> |Time: 3362.912 ms (00:03.363)
> |wal_bytes| 11644224
> 
> zlib writes ~4x less than ncompressed, and still much faster than pglz
> |Time: 2167.474 ms (00:02.167)
> |wal_bytes| 5611653
> 
> lz4 is as fast as uncompressed, and writes a bit more than pglz:
> |Time: 1612.874 ms (00:01.613)
> |wal_bytes| 12397123
> 
> zstd(6) is slower than lz4, but compresses better than anything but zlib.
> |Time: 1808.881 ms (00:01.809)
> |wal_bytes| 6395993

I was wrong about zlib: it has its point on Pareto frontier. At least for this 
test.

Thanks!

Best regards, Andrey Borodin.

Re: GiST operator class for bool

2021-06-14 Thread Andrey Borodin




> 8 июня 2021 г., в 19:53, Emre Hasegeli  написал(а):
> 
>> But patch that you propose does not support sorting build added in PG14.
> 
> It looks like the change to btree_gist is not committed yet. I'll
> rebase my patch once it's committed.
Changes to GiST are committed. There will be no need to rebase anyway :)

> 
> It was a long thread.  I couldn't read all of it.  Though, the last
> patches felt to me like a part of what's already been committed.
> Shouldn't they also be committed to version 14?

Well, yeah, it could would be cool to have gist build and gist_btree support it 
together, but there was many parts and we could did not finish it before 
feature freeze.

Best regards, Andrey Borodin.

Re: pgbench bug candidate: negative "initial connection time"

2021-06-14 Thread Fabien COELHO





Hmmm. Possibly. Another option could be not to report anything after some
errors. I'm not sure, because it would depend on the use case. I guess the
command returned an error status as well.


I did not know any use cases and decisions , but I vote to report nothing when 
error occurs.


I would prefer to abort the thread whose connection got an error and report
results for other threads, as handled when doConnect fails in CSTATE_START_TX
state.


It is unclear to me whether it makes much sense to report performance when 
things go wrong. At least when a one connection per client bench is run 
ISTM that it should not proceed, because the bench could not even start 
as prescribe. When connection break while the bench has already started, 
maybe it makes more sense to proceed, although I guess that maybe 
reattempting connections would make also sense in such case.



In this case, we have to set the state to CSTATE_ABORT before going to 'done'
as fixed in the attached patch, in order to ensure that exit status is 2 and the
result reports "pgbench: fatal: Run was aborted; the above results are 
incomplete."


Hmmm. I agree that at least reporting that there was an issue is a good 
idea.



Otherwise, if we want pgbench to exit immediately when a connection error 
occurs,
we have tocall exit(1) to ensure the exit code is 1, of course. Anyway, it is 
wrong
that thecurrent pgbench exit successfully with exit code 0 when doConnnect 
fails.


Indeed, I can only agree on this one.

--
Fabien.

Re: SQLSTATE for replication connection failures

2021-06-14 Thread Amit Kapila

On Sat, Jun 12, 2021 at 9:12 PM Tom Lane  wrote:
>
> So far as I can find, just about everyplace that deals with replication
> connections has slipshod error reporting.  An example from worker.c is
>
> LogRepWorkerWalRcvConn = walrcv_connect(MySubscription->conninfo, 
> true,
> MySubscription->name, );
> if (LogRepWorkerWalRcvConn == NULL)
> ereport(ERROR,
> (errmsg("could not connect to the publisher: %s", err)));
>
> Because of the lack of any errcode() call, this failure will be reported
> as XX000 ERRCODE_INTERNAL_ERROR, which is surely not appropriate.
> worker.c is in good company though, because EVERY caller of walrcv_connect
> is equally slipshod.
>
> Shall we just use ERRCODE_CONNECTION_FAILURE for these failures, or
> would it be better to invent another SQLSTATE code?  Arguably,
> ERRCODE_CONNECTION_FAILURE is meant for failures of client connections;
> but on the other hand, a replication connection is a sort of client.
>

Your reasoning sounds good to me. So, +1 for using ERRCODE_CONNECTION_FAILURE.

-- 
With Regards,
Amit Kapila.

Re: Fix around conn_duration in pgbench

2021-06-14 Thread Fabien COELHO



Hello Yugo-san,


TState has a field called "conn_duration" and this is, the comment says,
"cumulated connection and deconnection delays". This value is summed over
threads and reported as "average connection time" under -C/--connect.
If this options is not specified, the value is never used.


Yep.


However, I found that conn_duration is calculated even when -C/--connect
is not specified, which is waste. SO we can remove this code as fixed in
the attached patch.


I'm fine with the implied code simplification, but it deserves a comment.


In addition, deconnection delays are not cumulated even under -C/--connect
in spite of mentioned in the comment. I also fixed this in the attached patch.


I'm fine with that, even if it only concerns is_connect. I've updated the 
command to work whether now is initially set or not. I'm unsure whether 
closing a pg connection actually waits for anything, so probably the 
impact is rather small anyway. It cannot be usefully measured when 
!is_connect, because thread do that when then feel like it, whereas on 
connection we can use barriers to have something which makes sense.


Also, there is the issue of connection failures: the attached version adds 
an error message and exit for initial connections consistently.
This is not done with is_connect, though, and I'm unsure what we should 
really do.


--
Fabien.diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index d7479925cb..9d2abbfb68 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -3171,6 +3171,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 
 	if ((st->con = doConnect()) == NULL)
 	{
+		/* as the bench is already running, we do not abort */
 		pg_log_error("client %d aborted while establishing connection", st->id);
 		st->state = CSTATE_ABORTED;
 		break;
@@ -3536,8 +3537,12 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 
 if (is_connect)
 {
+	pg_time_usec_t start = now;
+
+	pg_time_now_lazy();
 	finishCon(st);
-	now = 0;
+	now = pg_time_now();
+	thread->conn_duration += now - start;
 }
 
 if ((st->cnt >= nxacts && duration <= 0) || timer_exceeded)
@@ -3561,6 +3566,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
  */
 			case CSTATE_ABORTED:
 			case CSTATE_FINISHED:
+/* per-thread last disconnection time is not measured */
 finishCon(st);
 return;
 		}
@@ -4405,7 +4411,10 @@ runInitSteps(const char *initialize_steps)
 	initPQExpBuffer();
 
 	if ((con = doConnect()) == NULL)
+	{
+		pg_log_fatal("connection for initialization failed");
 		exit(1);
+	}
 
 	setup_cancel_handler(NULL);
 	SetCancelConn(con);
@@ -6332,7 +6341,10 @@ main(int argc, char **argv)
 	/* opening connection... */
 	con = doConnect();
 	if (con == NULL)
+	{
+		pg_log_fatal("setup connection failed");
 		exit(1);
+	}
 
 	pg_log_debug("pghost: %s pgport: %s nclients: %d %s: %d dbName: %s",
  PQhost(con), PQport(con), nclients,
@@ -6548,7 +6560,7 @@ threadRun(void *arg)
 		if (thread->logfile == NULL)
 		{
 			pg_log_fatal("could not open logfile \"%s\": %m", logpath);
-			goto done;
+			exit(1);
 		}
 	}
 
@@ -6561,6 +6573,7 @@ threadRun(void *arg)
 
 	thread_start = pg_time_now();
 	thread->started_time = thread_start;
+	thread->conn_duration = 0;
 	last_report = thread_start;
 	next_report = last_report + (int64) 100 * progress;
 
@@ -6572,26 +6585,13 @@ threadRun(void *arg)
 		{
 			if ((state[i].con = doConnect()) == NULL)
 			{
-/*
- * On connection failure, we meet the barrier here in place of
- * GO before proceeding to the "done" path which will cleanup,
- * so as to avoid locking the process.
- *
- * It is unclear whether it is worth doing anything rather
- * than coldly exiting with an error message.
- */
-THREAD_BARRIER_WAIT();
-goto done;
+/* coldly abort on connection failure */
+pg_log_fatal("cannot create connection for thread %d client %d",
+			 thread->tid, i);
+exit(1);
 			}
 		}
-
-		/* compute connection delay */
-		thread->conn_duration = pg_time_now() - thread->started_time;
-	}
-	else
-	{
-		/* no connection delay to record */
-		thread->conn_duration = 0;
+		/* connection delay is measured globally between the barriers */
 	}
 
 	/* GO */

Re: Failure in subscription test 004_sync.pl

2021-06-14 Thread Amit Kapila

On Mon, Jun 14, 2021 at 10:41 AM Masahiko Sawada  wrote:
>
> >
> > I think it is showing a race condition issue in the code. In
> > DropSubscription, we first stop the worker that is receiving the WAL,
> > and then in a separate connection with the publisher, it tries to drop
> > the slot which leads to this error. The reason is that walsender is
> > still active as we just wait for wal receiver (or apply worker) to
> > stop. Normally, as soon as the apply worker is stopped the walsender
> > detects it and exits but in this case, it took some time to exit, and
> > in the meantime, we tried to drop the slot which is still in use by
> > walsender.
>
> There might be possible.
>
> That's weird since DROP SUBSCRIPTION executes DROP_REPLICATION_SLOT
> command with WAIT option. I found a bug that is possibly an oversight
> of commit 1632ea4368.
>
..
>
> The condition should be the opposite; we should raise the error when
> 'nowait' is true. I think this is the cause of the test failure. Even
> if DROP SUBSCRIPTION tries to drop the slot with the WAIT option, we
> don't wait but raise the error.
>
> Attached a small patch fixes it.
>

Yes, this should fix the recent buildfarm failures. Alvaro, would you
like to take care of this?

-- 
With Regards,
Amit Kapila.

1 2 >

1 - 100 of 111 matches

Mail list logo