date:20201210

On Fri, Dec 11, 2020 at 12:13 PM Hou, Zhijie  wrote:
> > IMO, let's not change the 1) behaviour to 3) with the patch. If agreed,
>
> > I can do the following way in ExplainOneUtility and will add a comment on
>
> > why we are doing this.
>
> > if (es->analyze)
>
> >  (void) CheckRelExistenceInCTAS(ctas, true);
>
> > Thoughts?
>
> Agreed.

Thanks!

So, I will post an updated patch soon.

> Just in case, I took a look at Oracle 12’s behavior about [explain CTAS].
>
> Oracle 12 will output the plan without throwing any msg in this case.

I'm not quite sure how other databases behave. If I go by the main
intention of EXPLAIN without ANALYZE, that should do the planning,
show it in the output and no execution of the query should happen. For
EXPLAIN CTAS/CMV, only thing that gets planned is the SELECT part and
no execution happens so no existence check for the CTAS/CMV relation
that will get created if the CTAS/CMV is executed. Having said that,
the existence of the relations that are in the SELECT part are anyways
checked during planning for EXPLAIN without ANALYZE.

IMHO, let's not alter the existing behaviour, if needed, that can be
discussed separately.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: On login trigger: take three

2020-12-10 Thread Pavel Stehule

čt 10. 12. 2020 v 19:09 odesílatel Pavel Stehule 
napsal:

>
>
> čt 10. 12. 2020 v 16:48 odesílatel Konstantin Knizhnik <
> k.knizh...@postgrespro.ru> napsal:
>
>>
>>
>> On 10.12.2020 18:12, Pavel Stehule wrote:
>>
>>
>> My idea was a little bit different. Inside postinit initialize some
>> global variables with info if there are event triggers or not. And later
>> you can use this variable to start transactions and  other things.
>>
>> There will be two access to pg_event_trigger, but it can eliminate
>> useless and probably more expensive start_transaction and end_transaction.
>>
>>
>>
>> Do you mean some variable in shared memory or GUCs?
>> It was my first idea - to use some flag in shared memory to make it
>> possible fast check that there are not event triggers.
>> But this flag should be sent per database. May be I missed something, but
>> there is no any per-database shared memory  data structure in Postgres.
>> Certainly it is possible to organize some hash db->event_info, but it
>> makes this patch several times more complex.
>>
>
> My idea was a little bit different - just inside process initialization,
> checking existence of event triggers, and later when it is necessary to
> start a transaction for trigger execution. This should to reduce useless
> empty transaction,
>

but this information can be accessed via shared memory if it is necessary.
Probably it doesn't need a special hash table.  Maybe a special flag per
process.


> I am sure this is a problem on computers with slower CPU s, although I
> have I7, but because this feature is usually unused, then the performance
> impact should be minimal every time.
>
>
>
>>
>> From my point of view it is better to have separate GUC disabling just
>> client connection events and switch it on by default.
>> So only those who need this events with switch it on, other users will
>> not pay any extra cost for it.
>>
>
> It can work, but this design is not user friendly.  The significant
> bottleneck should be forking new processes, and check of content some
> usually very small tables should be invisible. So if there is a possible
> way to implement some feature that can be enabled by default, then we
> should go this way. It can be pretty unfriendly if somebody writes a logon
> trigger and it will not work by default without any warning.
>
> When I tested last patch I found a problem (I have disabled assertions due
> performance testing)
>
> create role xx login;
> alter system set disable_event_triggers to on; -- better should be
> positive form - enable_event_triggers to on, off
> select pg_reload_conf();
>
> psql -U xx
>
> Thread 2.1 "postmaster" received signal SIGSEGV, Segmentation fault.
> [Switching to Thread 0x7f2bff438ec0 (LWP 827497)]
> ResourceOwnerEnlargeCatCacheRefs (owner=0x0) at resowner.c:1025
> 1025 ResourceArrayEnlarge(&(owner->catrefarr));
> (gdb) bt
> #0  ResourceOwnerEnlargeCatCacheRefs (owner=0x0) at resowner.c:1025
> #1  0x008a70f8 in SearchCatCacheInternal (cache=,
> nkeys=nkeys@entry=1, v1=v1@entry=13836, v2=v2@entry=0,
> v3=v3@entry=0, v4=v4@entry=0) at catcache.c:1273
> #2  0x008a7575 in SearchCatCache1 (cache=,
> v1=v1@entry=13836) at catcache.c:1167
> #3  0x008b7f80 in SearchSysCache1 (cacheId=cacheId@entry=21,
> key1=key1@entry=13836) at syscache.c:1122
> #4  0x005939cd in pg_database_ownercheck (db_oid=13836,
> roleid=16387) at aclchk.c:5114
> #5  0x00605b42 in EventTriggersDisabled () at event_trigger.c:650
> #6  EventTriggerOnConnect () at event_trigger.c:839
> #7  0x007b46d7 in PostgresMain (argc=argc@entry=1, 
> argv=argv@entry=0x7ffdd6256080,
> dbname=,
> username=0xf82508 "tom") at postgres.c:4021
> #8  0x00741afd in BackendRun (port=,
> port=) at postmaster.c:4490
> #9  BackendStartup (port=) at postmaster.c:4212
> #10 ServerLoop () at postmaster.c:1729
> #11 0x00742881 in PostmasterMain (argc=argc@entry=3,
> argv=argv@entry=0xf44d00) at postmaster.c:1401
> #12 0x004f1ff7 in main (argc=3, argv=0xf44d00) at main.c:209
>
> I checked profiles, and although there is significant slowdown when there
> is one login trigger, it was not related to plpgsql.
>
> There is big overhead of
>
>8,98%  postmaster  postgres  [.] guc_name_compare
>
> Maybe EventTriggerInvoke is not well optimized for very frequent
> execution. Overhead of plpgsql is less than 0.1%, although there is
> significant slowdown (when a very simple logon trigger is executed).
>
> Regards
>
> Pavel
>
>
>
>

Re: pg_basebackup test coverage

2020-12-10 Thread Noah Misch

On Thu, Dec 10, 2020 at 12:32:52PM -0500, Robert Haas wrote:
> It would probably be good to fix as much of this as we can, but there
> are a couple of cases I think would be particularly good to cover. One
> is 'pg_basebackup -Ft -Xnone -D -', which tries to write the output as
> a single tar file on standard output, injecting the backup_manifest
> file into the tar file instead of writing it out separately as we
> normally would. This case requires special handling in a few places
> and it would be good to check that it actually works. The other is the
> -z or -Z option, which produces a compressed tar file.
> 
> Now, there's nothing to prevent us from running commands like this
> from the pg_basebackup tests, but it doesn't seem like we could really
> check anything meaningful. Perhaps we'd notice if the command exited
> non-zero or didn't produce any output, but it would be nice to verify
> that the resulting backups are actually correct. The way to do that
> would seem to be to extract the tar file (and decompress it first, in
> the -z/-Z case) and then run pg_verifybackup on the result and check
> that it passes. However, as far as I can tell there's no guarantee
> that the user has 'tar' or 'gunzip' installed on their system, so I
> don't see a clean way to do this. A short (but perhaps incomplete)
> search didn't turn up similar precedents in the existing tests.
> 
> Any ideas?

I would probe for the commands with "tar -cf- anyfile | tar -xf-" and "echo
foo | gzip | gunzip".  If those fail, skip any test that relies on an
unavailable command.

RE: Fail Fast In CTAS/CMV If Relation Already Exists To Avoid Unnecessary Rewrite, Planning Costs

2020-12-10 Thread Hou, Zhijie

> IMO, let's not change the 1) behaviour to 3) with the patch. If agreed,

> I can do the following way in ExplainOneUtility and will add a comment on

> why we are doing this.

>

> if (es->analyze)

>  (void) CheckRelExistenceInCTAS(ctas, true);

>

> Thoughts?



Agreed.



Just in case, I took a look at Oracle 12’s behavior about [explain CTAS].

Oracle 12 will output the plan without throwing any msg in this case.



Best regards,

houzj

Re: pg_shmem_allocations & documentation

On Fri, Dec 11, 2020 at 11:00:58AM +0900, Kyotaro Horiguchi wrote:
> Although we could just rip some words off, I'd like to propose instead
> to add an explanation why it is not exposed for anonymous allocations,
> like the column allocated_size.

Indeed, there is a hiccup between what the code does and what the docs
tell: the offset is not NULL for unused memory.

> -   The offset at which the allocation starts. NULL for anonymous
> -   allocations and unused memory.
> +   The offset at which the allocation starts. For anonymous allocations,
> +   no information about individual allocations is available, so the 
> column
> +   will be NULL in that case.

I'd say: let's be simple and just remove "and unused memory" because
anonymous allocations are...  Anonymous so you cannot know details
related to them.  That's something easy to reason about, and the docs
were written originally to remain simple.
--
Michael


signature.asc
Description: PGP signature

Re: pg_waldump error message fix

On Fri, Dec 11, 2020 at 01:30:16PM +0900, Kyotaro Horiguchi wrote:
> currRecPtr is a private member of the struct (see the definition of
> the struct XLogReaderState).  We should instead use EndRecPtr outside
> xlog reader.

Please note that this is documented in xlogreader.h.  Hmm.  I can see
your point here, still I think that what both of you are suggesting
is not completely correct.  For example, switching to EndRecPtr would
cause DecodeXLogRecord() to report an error with the end of the
current record but it makes more sense to point to ReadRecPtr in this
case.  On the other hand, it would make sense to report the beginning 
of the position we are working on when validating the header if an
error happens at this point.  So it seems to me that EndRecPtr or
ReadRecPtr are not completely correct, and that we may need an extra
LSN position to tell on which LSN we are working on instead that gets
updated before the validation part, because we update ReadRecPtr and
EndRecPtr only after this initial validation is done.
--
Michael


signature.asc
Description: PGP signature

Re: pg_waldump error message fix

2020-12-10 Thread Kyotaro Horiguchi

At Thu, 10 Dec 2020 18:47:58 +, "Bossart, Nathan"  
wrote in 
> Hi,
> 
> I noticed that when pg_waldump finds an invalid record, the
> corresponding error message seems to point to the last valid record
> read.

Good catch!

> rmgr: ... lsn: 0/090E5AF8, prev 0/090E59D0, ...
> pg_waldump: fatal: error in WAL record at 0/90E5AF8: invalid record 
> length at 0/90E5B30: wanted 24, got 0
> 
> Should pg_waldump report currRecPtr instead of ReadRecPtr in the error
> message?  With that, I see the following.
> 
> rmgr: ... lsn: 0/090E5AF8, prev 0/090E59D0, ...
> pg_waldump: fatal: error in WAL record at 0/90E5B30: invalid record 
> length at 0/90E5B30: wanted 24, got 0
> 
> Here is the patch:
> 
> diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
> index 31e99c2a6d..27da60e6db 100644
> --- a/src/bin/pg_waldump/pg_waldump.c
> +++ b/src/bin/pg_waldump/pg_waldump.c
> @@ -1110,8 +1110,8 @@ main(int argc, char **argv)
> 
> if (errormsg)
> fatal_error("error in WAL record at %X/%X: %s",
> -   (uint32) 
> (xlogreader_state->ReadRecPtr >> 32),
> -   (uint32) xlogreader_state->ReadRecPtr,
> +   (uint32) 
> (xlogreader_state->currRecPtr >> 32),
> +   (uint32) xlogreader_state->currRecPtr,
> errormsg);
> 
> XLogReaderFree(xlogreader_state);

currRecPtr is a private member of the struct (see the definition of
the struct XLogReaderState).  We should instead use EndRecPtr outside
xlog reader.

regardes.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: please update ps display for recovery checkpoint

On Thu, Dec 10, 2020 at 10:02:10PM -0600, Justin Pryzby wrote:
> Isn't the sense of "reset" inverted ?

It is ;p
--
Michael


signature.asc
Description: PGP signature

Re: please update ps display for recovery checkpoint

2020-12-10 Thread Justin Pryzby

Isn't the sense of "reset" inverted ?

I think maybe you mean to do set_ps_display(""); in the "if reset".

On Fri, Dec 11, 2020 at 12:54:22PM +0900, Michael Paquier wrote:
> +update_checkpoint_display(int flags, bool restartpoint, bool reset)
> +{
> + if (reset)
> + {
> + char activitymsg[128];
> +
> + snprintf(activitymsg, sizeof(activitymsg), "performing %s%s%s",
> +  (flags & CHECKPOINT_END_OF_RECOVERY) ? 
> "end-of-recovery " : "",
> +  (flags & CHECKPOINT_IS_SHUTDOWN) ? "shutdown " 
> : "",
> +  restartpoint ? "restartpoint" : "checkpoint");
> + set_ps_display(activitymsg);
> + }
> + else
> + set_ps_display("");
> +}
> +
> +
>  /*
>   * Perform a checkpoint --- either during shutdown, or on-the-fly
>   *
> @@ -8905,6 +8937,9 @@ CreateCheckPoint(int flags)
>   if (log_checkpoints)
>   LogCheckpointStart(flags, false);
>  
> + /* Update the process title */
> + update_checkpoint_display(flags, false, false);
> +
>   TRACE_POSTGRESQL_CHECKPOINT_START(flags);
>  
>   /*
> @@ -9120,6 +9155,9 @@ CreateCheckPoint(int flags)
>   /* Real work is done, but log and update stats before releasing lock. */
>   LogCheckpointEnd(false);
>  
> + /* Reset the process title */
> + update_checkpoint_display(flags, false, true);
> +
>   TRACE_POSTGRESQL_CHECKPOINT_DONE(CheckpointStats.ckpt_bufs_written,
>
> NBuffers,
>
> CheckpointStats.ckpt_segs_added,
> @@ -9374,6 +9412,9 @@ CreateRestartPoint(int flags)
>   if (log_checkpoints)
>   LogCheckpointStart(flags, true);
>  
> + /* Update the process title */
> + update_checkpoint_display(flags, true, false);
> +
>   CheckPointGuts(lastCheckPoint.redo, flags);
>  
>   /*
> @@ -9492,6 +9533,9 @@ CreateRestartPoint(int flags)
>   /* Real work is done, but log and update before releasing lock. */
>   LogCheckpointEnd(true);
>  
> + /* Reset the process title */
> + update_checkpoint_display(flags, true, true);
> +
>   xtime = GetLatestXTime();
>   ereport((log_checkpoints ? LOG : DEBUG2),
>   (errmsg("recovery restart point at %X/%X",

Re: please update ps display for recovery checkpoint

On Wed, Dec 09, 2020 at 06:37:22PM +, Bossart, Nathan wrote:
> On 12/8/20, 10:16 PM, "Michael Paquier"  wrote:
>> Yeah, I'd rather leave out the part of the patch where we have the
>> checkpointer say "idle".  So I still think that what v3 did upthread,
>> by just resetting the ps display in all cases, feels more natural.  So
>> we should remove the parts of v5 from checkpointer.c.
> 
> That seems fine to me.  I think it is most important that the end-of-
> recovery and shutdown checkpoints are shown.  I'm not sure there's
> much value in updating the process title for automatic checkpoints and
> checkpoints triggered via the CHECKPOINT command, so IMO we could skip
> those, too.  I made these changes in the new version of the patch.

It would be possible to use pg_stat_activity in most cases here, so I
agree to settle down to the minimum we can agree on for now, and maybe
discuss separately if this should be extended in some or another in
the future if there is a use-case for that.  So I'd say that what you
have here is logically fine. 

> > +   snprintf(activitymsg, sizeof(activitymsg), "creating %s%scheckpoint",
> > [...]
> > +   snprintf(activitymsg, sizeof(activitymsg), "creating %srestartpoint",
> > FWIW, I still fing "running" to sound better than "creating" here.  An
> > extra term I can think of that could be adapted is "performing".
> 
> I think I prefer "performing" over "running" because that's what the
> docs use [0].  I've changed it to "performing" in the new version of
> the patch.

That's also used in the code comments, FWIW.

@@ -9282,6 +9296,7 @@ CreateRestartPoint(int flags)
XLogRecPtr  endptr;
XLogSegNo   _logSegNo;
TimestampTz xtime;
+   boolshutdown = (flags & CHECKPOINT_IS_SHUTDOWN);
You are right that CHECKPOINT_END_OF_RECOVERY should not be called for
a restart point, so that's correct.

However, I think that it would be better to have all those four code
paths controlled by a single routine, where we pass down the
checkpoint flags and fill in the ps string directly depending on what
the caller wants to do.  This way, we will avoid any inconsistent
updates if this stuff gets extended in the future, and there will be
all the information at hand to extend the logic.  So I have simplified
your patch as per the attached.  Thoughts?
--
Michael
From 98440b74b10c522bd485e072c43631ce78d115a7 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Fri, 11 Dec 2020 12:53:36 +0900
Subject: [PATCH v7] Add checkpoint/restartpoint status to ps display.

---
 src/backend/access/transam/xlog.c | 44 +++
 1 file changed, 44 insertions(+)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 7e81ce4f17..8d0a5e05fc 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8687,6 +8687,38 @@ UpdateCheckPointDistanceEstimate(uint64 nbytes)
 			(0.90 * CheckPointDistanceEstimate + 0.10 * (double) nbytes);
 }
 
+/*
+ * Update the ps display for a process running a checkpoint.  Note that
+ * this routine should not do any allocations so as it can be called
+ * from a critical section.
+ */
+static void
+update_checkpoint_display(int flags, bool restartpoint, bool reset)
+{
+	/*
+	 * The status is reported only for end-of-recovery and shutdown checkpoints
+	 * or shutdown restartpoints.  Updating the ps display is useful in those
+	 * situations as it may not be possible to rely on pg_stat_activity to see
+	 * the status of the checkpointer or the startup process.
+	 */
+	if ((flags & (CHECKPOINT_END_OF_RECOVERY | CHECKPOINT_IS_SHUTDOWN)) == 0)
+		return;
+
+	if (reset)
+	{
+		char activitymsg[128];
+
+		snprintf(activitymsg, sizeof(activitymsg), "performing %s%s%s",
+ (flags & CHECKPOINT_END_OF_RECOVERY) ? "end-of-recovery " : "",
+ (flags & CHECKPOINT_IS_SHUTDOWN) ? "shutdown " : "",
+ restartpoint ? "restartpoint" : "checkpoint");
+		set_ps_display(activitymsg);
+	}
+	else
+		set_ps_display("");
+}
+
+
 /*
  * Perform a checkpoint --- either during shutdown, or on-the-fly
  *
@@ -8905,6 +8937,9 @@ CreateCheckPoint(int flags)
 	if (log_checkpoints)
 		LogCheckpointStart(flags, false);
 
+	/* Update the process title */
+	update_checkpoint_display(flags, false, false);
+
 	TRACE_POSTGRESQL_CHECKPOINT_START(flags);
 
 	/*
@@ -9120,6 +9155,9 @@ CreateCheckPoint(int flags)
 	/* Real work is done, but log and update stats before releasing lock. */
 	LogCheckpointEnd(false);
 
+	/* Reset the process title */
+	update_checkpoint_display(flags, false, true);
+
 	TRACE_POSTGRESQL_CHECKPOINT_DONE(CheckpointStats.ckpt_bufs_written,
 	 NBuffers,
 	 CheckpointStats.ckpt_segs_added,
@@ -9374,6 +9412,9 @@ CreateRestartPoint(int flags)
 	if (log_checkpoints)
 		LogCheckpointStart(flags, true);
 
+	/* Update the process title */
+	update_checkpoint_display(flags, true, false);
+
 	CheckPointGuts(lastCheckPoint.redo, flags);
 
 	/*
@@ -9492,6 +9533,9 @@ CreateRestartPoint(int f

Re: Autovacuum worker doesn't immediately exit on postmaster death

2020-12-10 Thread Thomas Munro

On Fri, Dec 11, 2020 at 8:34 AM Robert Haas  wrote:
> On Thu, Oct 29, 2020 at 5:36 PM Alvaro Herrera  
> wrote:
> > Maybe instead of thinking specifically in terms of vacuum, we could
> > count buffer accesses (read from kernel) and check the latch once every
> > 1000th such, or something like that.  Then a very long query doesn't
> > have to wait until it's run to completion.  The cost is one integer
> > addition per syscall, which should be bearable.
>
> Interesting idea. One related case is where everything is fine on the
> server side but the client has disconnected and we don't notice that
> the socket has changed state until something makes us try to send a
> message to the client, which might be a really long time if the
> server's doing like a lengthy computation before generating any rows.
> It would be really nice if we could find a cheap way to check for both
> postmaster death and client disconnect every now and then, like if a
> single system call could somehow answer both questions.

For the record, an alternative approach was proposed[1] that
periodically checks for disconnected sockets using a timer, that will
then cause the next CFI() to abort.

Doing the check (a syscall) based on elapsed time rather than every
nth CFI() or buffer access or whatever seems better in some ways,
considering the difficulty of knowing what the frequency will be.  One
of the objections was that it added unacceptable setitimer() calls.
We discussed an idea to solve that problem generally, and then later I
prototyped that idea in another thread[2] about idle session timeouts
(not sure about that yet, comments welcome).

I've also wondered about checking postmaster_possibly_dead in CFI() on
platforms where we have it (and working to increase that set of
platforms), instead of just reacting to PM death when sleeping.   But
it seems like the real problem in this specific case is the use of
pg_usleep() where WaitLatch() should be used, no?

The recovery loop is at the opposite end of the spectrum: while vacuum
doesn't check for postmaster death often enough, the recovery loop
checks potentially hundreds of thousands or millions of times per
seconds, which sucks on systems that don't have parent-death signals
and slows down recovery quite measurably.  In the course of the
discussion about fixing that[3] we spotted other places that are using
a pg_usleep() where they ought to be using WaitLatch() (which comes
with exit-on-PM-death behaviour built-in).  By the way, the patch in
that thread does almost what Robert described, namely check for PM
death every nth time (which in this case means every nth WAL record),
except it's not in the main CFI(), it's in a special variant used just
for recovery.

[1] 
https://www.postgresql.org/message-id/flat/77def86b27e41f0efcba411460e929ae%40postgrespro.ru
[2] 
https://www.postgresql.org/message-id/flat/763a0689-f189-459e-946f-f0ec44589...@hotmail.com
[3] 
https://www.postgresql.org/message-id/flat/CA+hUKGK1607VmtrDUHQXrsooU=ap4g4r2yaobywooa3m8xe...@mail.gmail.com

Re: pg_shmem_allocations & documentation

2020-12-10 Thread Kyotaro Horiguchi

At Thu, 10 Dec 2020 11:07:47 +0100, Benoit Lobréau  
wrote in 
> Hi,
> 
> While reading the documentation of pg_shmem_allocations, I noticed that the
> off column is described as such :
> 
> "The offset at which the allocation starts. NULL for anonymous allocations
> and unused memory."
> 
> Whereas, the view returns a value for unused memory:
> 
> [local]:5433 postgres@postgres=# SELECT * FROM pg_shmem_allocations WHERE
> name IS NULL;
>  name |off|  size   | allocated_size
> --+---+-+
>  ¤| 178095232 | 1923968 |1923968
> (1 row)
> 
> From what I understand, the doc is wrong.
> Am I right ?

Good catch! I think you're right.  It seems to me the conclusion in
the discussion is to expose the offset for free memory.

Although we could just rip some words off, I'd like to propose instead
to add an explanation why it is not exposed for anonymous allocations,
like the column allocated_size.

> Benoit
> 
> [1] https://www.postgresql.org/docs/13/view-pg-shmem-allocations.html
> [2]
> https://www.postgresql.org/message-id/flat/20140504114417.GM12715%40awork2.anarazel.de
> (original thread)

regards.
¤
-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 62711ee83f..8921907f5e 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -12492,8 +12492,9 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
off int8
   
   
-   The offset at which the allocation starts. NULL for anonymous
-   allocations and unused memory.
+   The offset at which the allocation starts. For anonymous allocations,
+   no information about individual allocations is available, so the column
+   will be NULL in that case.

Re: Fail Fast In CTAS/CMV If Relation Already Exists To Avoid Unnecessary Rewrite, Planning Costs

Thanks for taking a look at this.

On Fri, Dec 11, 2020 at 6:33 AM Hou, Zhijie  wrote:
>
> > Currently, for CTAS or CREATE MATERIALIZED VIEW(CMV) without if-not-exists
> > clause, the existence of the relation gets checked during the execution
> > of the select part and an error is thrown there.
> > All the unnecessary rewrite and planning for the select part would have
> > happened just to fail later. However, if if-not-exists clause is present,
> > then a notice is issued and returned immediately without any further rewrite
> > or planning for . This seems somewhat inconsistent to me.
> >
> > I propose to check the relation existence early in ExecCreateTableAs() as
> > well as in ExplainOneUtility() and throw an error in case it exists already
> > to avoid unnecessary rewrite, planning and execution of the select part.
> >
> > Attaching a patch. Note that I have not added any test cases as the existing
> > test cases in create_table.sql and matview.sql would cover the code.
> >
> > Thoughts?
>
> Personally, I think it make sense, as other CMD(such as create 
> extension/index ...) throw that error
> before any further operation too.
>
> I am just a little worried about the behavior change of [explain CTAS].
> May be someone will complain the change from normal explaininfo to error 
> output.

I think we are clear with the patch for plain i.e. non-EXPLAIN and
EXPLAIN ANALYZE CTAS/CMV cases.

The behaviour for EXPLAIN is as follows:

1)EXPLAIN without ANALYZE, without patch: select part is planned(note
that the relations in the select part are checked for their existence
while planning, fails any of them don't exist) , relation(CTAS/CMV
being created) existence is not checked as we will not create the
relation and execute the plan.

2)EXPLAIN with ANALYZE, without patch: select part is planned, as we
execute the plan, relation(CTAS/CMV) existence is checked during the
execution and fails there if it exists.

3) EXPLAIN without ANALYZE, with patch: relation(CTAS/CMV) existence
is checked before the planning and fails if it exists, without going
further to the planning for select part.

4)EXPLAIN with ANALYZE, with patch: relation(CTAS/CMV) existence is
checked before the rewrite, planning and fails if it exists, without
going further.

IMO, let's not change the 1) behaviour to 3) with the patch. If
agreed, I can do the following way in ExplainOneUtility and will add a
comment on why we are doing this.

if (es->analyze)
 (void) CheckRelExistenceInCTAS(ctas, true);

Thoughts?

> And I took a look into the patch.
>
> +   StringInfoData emsg;
> +
> +   initStringInfo(&emsg);
> +
> +   if (level == NOTICE)
> +   appendStringInfo(&emsg,
>
> Using variable emsg and level seems a little complicated to me.
> How about just:
>
> if (!is_explain && ctas->if_not_exists)
> ereport(NOTICE,xxx
> else
> ereport(ERROR,xxx

I will modify it in the next version of the patch which I plan to send
once agreed on the above point.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: Proposed patch for key managment

2020-12-10 Thread Bruce Momjian

On Thu, Dec 10, 2020 at 07:26:48PM +0800, Neil Chen wrote:
> 
> 
> Hi, everyone
> 
> I have read the patch and did some simple tests. I'm not entirely sure
> about some code segments; e.g.:
> 
> In the BootStrapKmgr() we generate a data encryption key by:
> key = generate_crypto_key(file_encryption_keylen);
> 
> However, I found that the file_encryption_keylen is always 0 in bootstrap
> mode because there exitst another variable 
> bootstrap_file_encryption_keylen
> in xlog.c and bootstrap.c.

Oh, good point;  that is very helpful.  I was relying on SetConfigOption
to set file_encryption_keylen, but that happens _after_ we create the
keys, so they were zero length.  I have fixed this by passing
bootstrap_file_encryption_keylen to the boot routines.  The diff URL has
the fix:

https://github.com/postgres/postgres/compare/master...bmomjian:key.diff

> We get the REL/WAL key by KmgrGetKey() call and it works like:
> return (const CryptoKey *) &(KmgrShmem->intlKeys[id]);
> 
> But in bootstrap mode, the KmgrShmem are not assigned. So, if we want to
> use it to encrypt something in bootstrap mode, I suggest we make the
> following changes:
> if ( in bootstrap mode)
> return intlKeys[id]; // a static variable which contains key
> else
> reutrn (const CryptoKey *) &(KmgrShmem->intlKeys[id]);

Yes, you are also correct here.  I had not gotten to using KmgrGetKey
yet, but it clearly needs your suggestion, so have done that.

Thanks for your help.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-10 Thread tsunakawa.ta...@fujitsu.com

From: tsunakawa.ta...@fujitsu.com 
> What's valuable as a code comment to describe the remaining issue is that the

You can attach XXX or FIXME in front of the issue description for easier 
search.  (XXX appears to be used much more often in Postgres.)


Regards
Takayuki Tsunakawa

Re: [Patch] Optimize dropping of relation buffers using dlist

2020-12-10 Thread Amit Kapila

On Fri, Dec 11, 2020 at 5:54 AM k.jami...@fujitsu.com
 wrote:
>
> On Thursday, December 10, 2020 8:12 PM, Amit Kapila wrote:
> > On Thu, Dec 10, 2020 at 1:40 PM k.jami...@fujitsu.com
> >  wrote:
> > >
> > > Yes, I have tested that optimization works for index relations.
> > >
> > > I have attached the V34, following the conditions that we use "cached"
> > > flag for both DropRelFileNodesBuffers() and DropRelFileNodesBuffers()
> > > for consistency.
> > > I added comment in 0004 the limitation of optimization when there are
> > > TOAST relations that use NON-PLAIN strategy. i.e. The optimization
> > > works if the data types used are integers, OID, bytea, etc. But for
> > > TOAST-able data types like text, the optimization will be skipped and 
> > > force a
> > full scan during recovery.
> > >
> >
> > AFAIU, it won't take optimization path only when we have TOAST relation but
> > there is no insertion corresponding to it. If so, then we don't need to 
> > mention
> > it specifically because there are other similar cases where the optimization
> > won't work like when during recovery we have to just perform TRUNCATE.
> >
>
> Right, I forgot to add that there should be an update like insert to the TOAST
> relation for truncate optimization to work. However, that is only limited to
> TOAST relations with PLAIN strategy. I have tested with text data type, with
> Inserts before truncate, and it did not enter the optimization path.
>

I think you are seeing because text datatype allows creating toast
storage and your data is small enough to be toasted.

> OTOH,
> It worked for data type like integer.
>

It is not related to any datatype, it can happen whenever we don't
have any operation on any of the forks after recovery.

> So should I still not include that information?
>

I think we can extend your existing comment like: "Otherwise if the
size of a relation fork is not cached, we proceed to a full scan of
the whole buffer pool. This can happen if there is no update to a
particular fork during recovery."

-- 
With Regards,
Amit Kapila.

RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-10 Thread tsunakawa.ta...@fujitsu.com

From: Jamison, Kirk/ジャミソン カーク 
> On Thursday, December 10, 2020 8:12 PM, Amit Kapila wrote:
> > AFAIU, it won't take optimization path only when we have TOAST relation but
> > there is no insertion corresponding to it. If so, then we don't need to 
> > mention
> > it specifically because there are other similar cases where the optimization
> > won't work like when during recovery we have to just perform TRUNCATE.
> >
> 
> Right, I forgot to add that there should be an update like insert to the TOAST
> relation for truncate optimization to work. However, that is only limited to
> TOAST relations with PLAIN strategy. I have tested with text data type, with
> Inserts before truncate, and it did not enter the optimization path. OTOH,
> It worked for data type like integer. So should I still not include that 
> information?

What's valuable as a code comment to describe the remaining issue is that the 
reader can find clues to if this is related to the problem he/she has hit, 
and/or how to solve the issue.  I don't think the current comment is so bad in 
that regard, but it seems better to add:

* The condition of the issue: the table's ancillary storage (index, TOAST 
table, FSM, VM, etc.) was not updated during recovery.
(As an aside, "during recovery" here does not mean "after the last checkpoint" 
but "from the start of recovery", because the standby experiences many 
checkpoints (the correct term is restartpoints in case of standby).)

* The cause as a hint to solve the issue: The startup process does not find 
page modification WAL records.  As a result, it won't call 
XLogReadBufferExtended() and smgrnblocks() called therein, so the relation/fork 
size is not cached.

Regards
Takayuki Tsunakawa

RE: Fail Fast In CTAS/CMV If Relation Already Exists To Avoid Unnecessary Rewrite, Planning Costs

2020-12-10 Thread Hou, Zhijie

> Currently, for CTAS or CREATE MATERIALIZED VIEW(CMV) without if-not-exists
> clause, the existence of the relation gets checked during the execution
> of the select part and an error is thrown there.
> All the unnecessary rewrite and planning for the select part would have
> happened just to fail later. However, if if-not-exists clause is present,
> then a notice is issued and returned immediately without any further rewrite
> or planning for . This seems somewhat inconsistent to me.
> 
> I propose to check the relation existence early in ExecCreateTableAs() as
> well as in ExplainOneUtility() and throw an error in case it exists already
> to avoid unnecessary rewrite, planning and execution of the select part.
> 
> Attaching a patch. Note that I have not added any test cases as the existing
> test cases in create_table.sql and matview.sql would cover the code.
> 
> Thoughts?

Personally, I think it make sense, as other CMD(such as create extension/index 
...) throw that error
before any further operation too.

I am just a little worried about the behavior change of [explain CTAS].
May be someone will complain the change from normal explaininfo to error output.

And I took a look into the patch.

+   StringInfoData emsg;
+
+   initStringInfo(&emsg);
+
+   if (level == NOTICE)
+   appendStringInfo(&emsg,

Using variable emsg and level seems a little complicated to me.
How about just:

if (!is_explain && ctas->if_not_exists)
ereport(NOTICE,xxx
else
ereport(ERROR,xxx

Best regards,
houzj

RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-10 Thread k.jami...@fujitsu.com

On Thursday, December 10, 2020 8:12 PM, Amit Kapila wrote:
> On Thu, Dec 10, 2020 at 1:40 PM k.jami...@fujitsu.com
>  wrote:
> >
> > Yes, I have tested that optimization works for index relations.
> >
> > I have attached the V34, following the conditions that we use "cached"
> > flag for both DropRelFileNodesBuffers() and DropRelFileNodesBuffers()
> > for consistency.
> > I added comment in 0004 the limitation of optimization when there are
> > TOAST relations that use NON-PLAIN strategy. i.e. The optimization
> > works if the data types used are integers, OID, bytea, etc. But for
> > TOAST-able data types like text, the optimization will be skipped and force 
> > a
> full scan during recovery.
> >
> 
> AFAIU, it won't take optimization path only when we have TOAST relation but
> there is no insertion corresponding to it. If so, then we don't need to 
> mention
> it specifically because there are other similar cases where the optimization
> won't work like when during recovery we have to just perform TRUNCATE.
> 

Right, I forgot to add that there should be an update like insert to the TOAST
relation for truncate optimization to work. However, that is only limited to
TOAST relations with PLAIN strategy. I have tested with text data type, with
Inserts before truncate, and it did not enter the optimization path. OTOH,
It worked for data type like integer. So should I still not include that 
information?

Also, I will remove the unnecessary "cached" from the line that Tsunakawa-san
mentioned. I will wait for a few more comments before reuploading, hopefully,
the final version & including the test for truncate,

Regards,
Kirk Jamison

Re: Hybrid Hash/Nested Loop joins and caching results from subplans

2020-12-10 Thread David Rowley

Thanks a lot for testing this patch. It's good to see it run through a
benchmark that exercises quite a few join problems.

On Fri, 11 Dec 2020 at 05:44, Konstantin Knizhnik
 wrote:
> For most queries performance is the same, some queries are executed
> faster but
> one query is 150 times slower:
>
>
> explain analyze SELECT MIN(chn.name) AS character,
...
>   Execution Time: 523002.608 ms

> I attach file with times of query execution.

I noticed the time reported in results.csv is exactly the same as the
one in the EXPLAIN ANALYZE above.  One thing to note there that it
would be a bit fairer if the benchmark was testing the execution time
of the query instead of the time to EXPLAIN ANALYZE.

One of the reasons that the patch may look less favourable here is
that the timing overhead on EXPLAIN ANALYZE increases with additional
nodes.

If I just put this to the test by using the tables and query from [1].

#  explain (analyze, costs off) select count(*) from hundredk hk inner
# join lookup l on hk.thousand = l.a;
   QUERY PLAN
-
 Aggregate (actual time=1891.262..1891.263 rows=1 loops=1)
   ->  Nested Loop (actual time=0.312..1318.087 rows=999 loops=1)
 ->  Seq Scan on hundredk hk (actual time=0.299..15.753
rows=10 loops=1)
 ->  Result Cache (actual time=0.000..0.004 rows=100 loops=10)
   Cache Key: hk.thousand
   Hits: 99000  Misses: 1000  Evictions: 0  Overflows: 0
Memory Usage: 3579kB
   ->  Index Only Scan using lookup_a_idx on lookup l
(actual time=0.003..0.012 rows=100 loops=1000)
 Index Cond: (a = hk.thousand)
 Heap Fetches: 0
 Planning Time: 3.471 ms
 Execution Time: 1891.612 ms
(11 rows)

You can see here the query took 1.891 seconds to execute.

Same query without EXPLAIN ANALYZE.

postgres=# \timing
Timing is on.
postgres=# select count(*) from hundredk hk inner
postgres-# join lookup l on hk.thousand = l.a;
  count
-
 999
(1 row)

Time: 539.449 ms

Or is it more accurate to say it took just 0.539 seconds?

Going through the same query after disabling; enable_resultcache,
enable_mergejoin, enable_nestloop, I can generate the following table
which compares the EXPLAIN ANALYZE time to the \timing on time.

postgres=# select type,ea_time,timing_time, round(ea_time::numeric /
timing_time::numeric,3) as ea_overhead from results order by
timing_time;
  type  | ea_time  | timing_time | ea_overhead
+--+-+-
 Nest loop + RC | 1891.612 | 539.449 |   3.507
 Merge join | 2411.632 |1008.991 |   2.390
 Nest loop  |  2484.82 | 1049.63 |   2.367
 Hash join  | 4969.284 |3272.424 |   1.519

Result Cache will be hit a bit harder by this problem due to it having
additional nodes in the plan. The Hash Join query seems to suffer much
less from this problem.

However, saying that. It's certainly not the entire problem here:

Hits: 5  Misses: 156294 Evictions: 0  Overflows: 0  Memory Usage: 9769kB

The planner must have thought there'd be more hits than that or it
wouldn't have thought Result Caching would be a good plan.  Estimating
the cache hit ratio using n_distinct becomes much less reliable when
there are joins and filters. A.K.A the real world.

David

[1] 
https://www.postgresql.org/message-id/caaphdvrpcqyqdwergywx8j+2dlungxu+fosbq1uscxrunyx...@mail.gmail.com

Feature Proposal: Add ssltermination parameter for SNI-based LoadBalancing

2020-12-10 Thread Lukas Meisegeier


Hi,

I try to host multiple postgresql-servers on the same ip and the same
port through SNI-based load-balancing.
Currently this is not possible because of two issues:
1. The psql client won't set the tls-sni-extension correctly
(https://www.postgresql.org/message-id/20181211145240.GL20222%40redhat.com)
2. The psql connection protocol implements a SSLRequest in plain text
before actually opening a connection.

The first issue is easily solvable by calling
`SSL_set_tlsext_host_name(conn->ssl,
conn->connhost[conn->whichhost].host)` before opening the connection.

The second issue is also solvable through a new parameter
"ssltermination" which if set to "proxy" will skip the initial
SSLRequest and connects directly through ssl.
The default value would be "server" which changes nothing on the
existing behaviour.

I compiled the psql-client with these changes and was able to connect to
2 different databases through the same ip and port just by changing the
hostname.

This fix is important to allow multiple postgres instances on one ip
without having to add a port number.

I implemented this change on a fork of the postgres mirror on github:
https://github.com/klg71/mayope_postgres

The  affected files are:
- src/interfaces/libpq/fe-connect.c (added ssltermination parameter)
- src/interfaces/libpq/libpq-int.h (added ssltermination parameter)
- src/interfaces/libpq/fe-secure-openssl.c (added tls-sni-extension)

I appended the relevant diff.

Best Regards
Lukas
diff --git a/src/interfaces/libpq/fe-connect.c 
b/src/interfaces/libpq/fe-connect.c
index 7d04d3664e..43fcfc2274 100644
--- a/src/interfaces/libpq/fe-connect.c
+++ b/src/interfaces/libpq/fe-connect.c
@@ -131,6 +131,7 @@ static int  ldapServiceLookup(const char *purl, 
PQconninfoOption *options,
 #define DefaultTargetSessionAttrs  "any"
 #ifdef USE_SSL
 #define DefaultSSLMode "prefer"
+#define DefaultSSLTermination "server"
 #else
 #define DefaultSSLMode "disable"
 #endif
@@ -293,6 +294,11 @@ static const internalPQconninfoOption PQconninfoOptions[] 
= {
"SSL-Mode", "", 12, /* sizeof("verify-full") == 12 
*/
offsetof(struct pg_conn, sslmode)},
 
+   {"ssltermination", "PGSSLTERMINATION", DefaultSSLMode, NULL,
+   "SSL-Termination-Mode", "", 6,  /* sizeof("server") == 
6 */
+   offsetof(struct pg_conn, ssltermination)},
+
+
{"sslcompression", "PGSSLCOMPRESSION", "0", NULL,
"SSL-Compression", "", 1,
offsetof(struct pg_conn, sslcompression)},
@@ -1278,6 +1284,16 @@ connectOptions2(PGconn *conn)
return false;
}
 
+   if (strcmp(conn->ssltermination, "server") != 0
+   && strcmp(conn->ssltermination, "proxy") != 0)
+   {
+   conn->status = CONNECTION_BAD;
+   printfPQExpBuffer(&conn->errorMessage,
+ 
libpq_gettext("invalid %s value: \"%s\"\n"),
+ "ssltermination", 
conn->ssltermination);
+   return false;
+   }
+
 #ifndef USE_SSL
switch (conn->sslmode[0])
{
@@ -2915,6 +2931,13 @@ keep_going:  
/* We will come back to here until there is
if (conn->allow_ssl_try && !conn->wait_ssl_try 
&&
!conn->ssl_in_use)
{
+   /*
+   * SSL termination is handled by 
proxy/load-balancer, no need to send SSLRequest
+   */
+   if(conn->ssltermination[0]=='p'){
+   conn->status = 
CONNECTION_SSL_STARTUP;
+   return PGRES_POLLING_WRITING;
+   }
ProtocolVersion pv;
 
/*
@@ -2995,77 +3018,89 @@ keep_going: 
/* We will come back to here until there is
if (!conn->ssl_in_use)
{
/*
-* We use pqReadData here since it has 
the logic to
-* distinguish no-data-yet from 
connection closure. Since
-* conn->ssl isn't set, a plain recv() 
will occur.
+* Skip SSLRequest package and 
initialize ssl directly with proxy
 */
-   charSSLok;
-   int rdresult;
-
-

extended statistics - functional dependencies vs. MCV lists

2020-12-10 Thread Tomas Vondra

Hi,

over in the pgsql-general channel, Michael Lewis reported [1] a bit
strange behavior switching between good/bad estimates with extended
statistics.

The crux of the issue is that with statistics containing both MCV and
functional dependencies, we prefer applying the MCV. And functional
dependencies are used only for the remaining clauses on columns not
covered by the MCV list.

This works perfectly fine when the clauses match a MCV item (or even
multiple of them). But if there's no matching MCV item, this may be
problematic - statext_mcv_clauselist_selectivity tries to be smart, but
when the MCV represents only a small fraction of the data set the
results may not be far from just a product of selectivities (as if the
clauses were independent).

So I'm wondering about two things:

1) Does it actually make sense to define extended statistics with both
MCV and functional dependencies? ISTM the MCV part will always filter
all the clauses, before we even try to apply the dependencies.

2) Could we consider the functional dependencies when estimating the
part not covered by the MCV list. Of course, this could only help with
equality clauses (as supported by functional dependencies).


regards


[1]
https://www.postgresql.org/message-id/CAMcsB%3Dy%3D3G_%2Bs_zFYPu2-O6OMWOvOkb3t1MU%3D907yk5RC_RaYw%40mail.gmail.com

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Autovacuum worker doesn't immediately exit on postmaster death

2020-12-10 Thread Robert Haas

On Thu, Oct 29, 2020 at 5:36 PM Alvaro Herrera  wrote:
> Maybe instead of thinking specifically in terms of vacuum, we could
> count buffer accesses (read from kernel) and check the latch once every
> 1000th such, or something like that.  Then a very long query doesn't
> have to wait until it's run to completion.  The cost is one integer
> addition per syscall, which should be bearable.

Interesting idea. One related case is where everything is fine on the
server side but the client has disconnected and we don't notice that
the socket has changed state until something makes us try to send a
message to the client, which might be a really long time if the
server's doing like a lengthy computation before generating any rows.
It would be really nice if we could find a cheap way to check for both
postmaster death and client disconnect every now and then, like if a
single system call could somehow answer both questions.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: cutting down the TODO list thread

2020-12-10 Thread John Naylor

Hi,

Continuing with TODO list maintenance, first a couple things to clean up:

- Allow ALTER INDEX ... RENAME concurrently

This was in the wrong section, but it's irrelevant: The lock level was
lowered in commit 1b5d797cd4f, so I went ahead and removed this already.

- Add CORRESPONDING BY to UNION/INTERSECT/EXCEPT

The link titled "how not to write this patch" points to a web archive of
the author's description of how he implemented the rejected patch. That
doesn't seem useful, since it was...rejected. I propose to replace that
with the -hackers thread, where there is discussion of the design problem:

https://www.postgresql.org/message-id/flat/CAJZSWkWN3YwQ01C3%2Bcq0%2BeyZ1DmK%3D69_6vryrsVGMC%3D%2BfWrSZA%40mail.gmail.com

Now, for the proposed items to move to "Not Worth Doing". As before, let me
know of any objections. I plan to move these early next week:

*Views and Rules

- Allow VIEW/RULE recompilation when the underlying tables change

The entry itself says "This is both difficult and controversial." and the
linked threads confirm that.

- Make it possible to use RETURNING together with conditional DO INSTEAD
rules, such as for partitioning setups

This was from before we got native partitioning, so the stated rationale is
outdated.

*SQL Commands (this is a huge section, for now just doing the miscellany at
the top before the various subsections)

- Add a GUC variable to warn about non-standard SQL usage in queries

I don't see the reason for this, and sounds difficult anyway.

- Add NOVICE output level for helpful messages

This would only be useful if turned on, so is going to be least used where
it might help the most. It also sounds like a lot of slow menial work to
implement.

- Allow DISTINCT to work in multiple-argument aggregate calls

Tom suggested this in 2006 for the sake of orthogonality. Given the amount
of time passed, it seems not very important.

- Allow DELETE and UPDATE to be used with LIMIT and ORDER BY

Some use cases mentioned, but nearly all have some kind of workaround
already.

--
John Naylor
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Autovacuum worker doesn't immediately exit on postmaster death

2020-12-10 Thread Stephen Frost

Greetings,

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Alvaro Herrera  writes:
> > On 2020-Oct-29, Stephen Frost wrote:
> >> I do think it'd be good to find a way to check every once in a while
> >> even when we aren't going to delay though.  Not sure what the best
> >> answer there is.
> 
> > Maybe instead of thinking specifically in terms of vacuum, we could
> > count buffer accesses (read from kernel) and check the latch once every
> > 1000th such, or something like that.  Then a very long query doesn't
> > have to wait until it's run to completion.  The cost is one integer
> > addition per syscall, which should be bearable.
> 
> I'm kind of unwilling to add any syscalls at all to normal execution
> code paths for this purpose.  People shouldn't be sig-kill'ing the
> postmaster, or if they do, cleaning up the mess is their responsibility.
> I'd also suggest that adding nearly-untestable code paths for this
> purpose is a fine way to add bugs we'll never catch.
> 
> The if-we're-going-to-delay-anyway path in vacuum_delay_point seems
> OK to add a touch more overhead to, though.

Alright, for this part at least, seems like it'd be something like the
attached.

Only lightly tested, but does seem to address the specific example which
was brought up on this thread.

Thoughts..?

Thanks,

Stephen
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 98270a1049..c90a4edb98 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -2069,9 +2069,11 @@ vacuum_delay_point(void)
 		if (msec > VacuumCostDelay * 4)
 			msec = VacuumCostDelay * 4;
 
-		pgstat_report_wait_start(WAIT_EVENT_VACUUM_DELAY);
-		pg_usleep((long) (msec * 1000));
-		pgstat_report_wait_end();
+		(void) WaitLatch(MyLatch,
+		 WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+		 msec,
+		 WAIT_EVENT_VACUUM_DELAY);
+		ResetLatch(MyLatch);
 
 		VacuumCostBalance = 0;
 


signature.asc
Description: PGP signature

Re: Proposed patch for key managment

2020-12-10 Thread Bruce Momjian

On Wed, Dec  9, 2020 at 05:18:37PM -0500, Stephen Frost wrote:
> Greetings,
> 
> * Daniel Gustafsson (dan...@yesql.se) wrote:
> > > On 2 Dec 2020, at 22:38, Bruce Momjian  wrote:
> > > Attached is a patch for key management, which will eventually be part of
> > > cluster file encryption (CFE), called TDE (Transparent Data Encryption)
> > > by Oracle.
> > 
> > The attackvector protected against seems to be operating systems users
> > (*without* any legitimate database access) gaining access to the data files.
> 
> That isn't the only attack vector that this addresses (though it is one
> that this is envisioned to help with- to wit, someone rsync'ing the DB
> directory).  TDE also helps with traditional data at rest requirements,
> in environments where you don't trust the storage layer to handle that
> (for whatever reason), and it's at least imagined that backups with
> pg_basebackup would also be encrypted, helping protect against backup
> theft.
> 
> There's, further, certainly no shortage of folks asking for this.

Can we flesh this out more in the docs?  Any idea on wording compared to
what I have?

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

pg_waldump error message fix

2020-12-10 Thread Bossart, Nathan

Hi,

I noticed that when pg_waldump finds an invalid record, the
corresponding error message seems to point to the last valid record
read.

rmgr: ... lsn: 0/090E5AF8, prev 0/090E59D0, ...
pg_waldump: fatal: error in WAL record at 0/90E5AF8: invalid record length 
at 0/90E5B30: wanted 24, got 0

Should pg_waldump report currRecPtr instead of ReadRecPtr in the error
message?  With that, I see the following.

rmgr: ... lsn: 0/090E5AF8, prev 0/090E59D0, ...
pg_waldump: fatal: error in WAL record at 0/90E5B30: invalid record length 
at 0/90E5B30: wanted 24, got 0

Here is the patch:

diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index 31e99c2a6d..27da60e6db 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -1110,8 +1110,8 @@ main(int argc, char **argv)

if (errormsg)
fatal_error("error in WAL record at %X/%X: %s",
-   (uint32) (xlogreader_state->ReadRecPtr 
>> 32),
-   (uint32) xlogreader_state->ReadRecPtr,
+   (uint32) (xlogreader_state->currRecPtr 
>> 32),
+   (uint32) xlogreader_state->currRecPtr,
errormsg);

XLogReaderFree(xlogreader_state);

Nathan

Re: Proposed patch for key managment

2020-12-10 Thread Bruce Momjian

On Wed, Dec  9, 2020 at 10:34:59PM +0100, Daniel Gustafsson wrote:
> > On 2 Dec 2020, at 22:38, Bruce Momjian  wrote:
> > 
> > Attached is a patch for key management, which will eventually be part of
> > cluster file encryption (CFE), called TDE (Transparent Data Encryption)
> > by Oracle.
> 
> The attackvector protected against seems to be operating systems users
> (*without* any legitimate database access) gaining access to the data files.
> Such an attacker would also gain access to postgresql.conf and therefore may
> have the cluster passphrase command in case it's stored there.  That would
> imply the attacker is likely (or perhaps better phrased "not entirely
> unlikely") to be able to run that command and decrypt the data *iff* there is
> no interactive/2fa aspect to the passphrase command.  Is that an accurate
> interpretation?  Do Oracle TDE et.al use a similar setup where an database
> restart require manual intervention?

I have updated the docs to say "read" access more clearly:

   The purpose of cluster file encryption is to prevent users with read
   access to the directories used to store database files from being able to
   access the data stored in those files.  For example, when using cluster
   file encryption, users who have read access to the cluster directories
   for backup purposes will not be able to read the data stored in the
   these files.

I already said "read access" in the later part of the paragraph, but
obviously it needs to be mentioned early too.  If we need more text
here, please let me know.

As far as someone hijacking the passphrase command, that is a serious
risk.  No matter how you do the authentication, even TFA, you are going
to have someone able to grab that passphrase as it comes out of the
script and passes to the server.  It might help to store the script and
postgres binaries in a more secure location than the data, but I am not
sure how realistic that is.  I can if you are using a NAS for data store
and have local binaries and passphrase script, that would be more secure
than putting the script on the NAS.  Is that something we should explain?
Should we explicity say that there is no protection against write
access?  What can someone with write access to PGDATA do if
postgresql.conf is not stored there?  I don't know.

> Admittedly I haven't read the other threads on the topic so if it's discussed
> somewhere else then I'd appreciate a whacking with a cluestick.
> 
> A few other comments from skimming the patch:
> 
> +  is data encryption keys, specifically keys zero and one.  Key zero is
> +  the key uses to encrypt database heap and index files which are stored in
> ".. is the key used to .."?

Fixed, thanks.

> +   matches the initdb-supplied passphrase.  If this check fails, and the
> +   the server will refuse to start.
> Seems like something is missing, perhaps just s/and the//?

Fixed.

> +  url="https://tools.ietf.org/html/rfc2315";>RFC2315.
> Tiny nitpick: turns out that RFC 2315 (with a space) is the correct spelling.

Fixed.

> + * are read-only.  All wrapping and unwrapping key routines depends on
> + * the openssl library for now.
> OpenSSL is a name so it should be cased as OpenSSL in text like this.

Fixed, and fixed "depends".

>  #include 
> +#include 
> Why do we need this header in be-secure-openssl.c?  There are no other changes
> to that file?

Not sure, removed, since it compiles fine without it.
> 
> +/* Define to 1 if you have the `OPENSSL_init_crypto' function. */
> +#undef HAVE_OPENSSL_INIT_CRYPTO
> This seems to be unused?

Agreed, removed.

> KmgrSaveCryptoKeys doesn't seem to be explicitly clamping down permissions on
> the generated file, is that something we want for belt-and-suspender level
> paranoia around keys? The same goes for read_one_keyfile etc.

Well, KmgrSaveCryptoKeys is calling BasicOpenFile, which is calling
BasicOpenFilePerm(fileName, fileFlags, pg_file_create_mode).  The
question is whether we should honor the pg_file_create_mode specified on
the data directory, or make it tighter for these keys.  The file is
already owner-rw-only by default:

$ ls -l
drwx-- 2 postgres postgres 4096 Dec 10 13:13 live/
$ cd live/
$ ls -l
total 8
-rw--- 1 postgres postgres 356 Dec 10 13:13 0
-rw--- 1 postgres postgres 356 Dec 10 13:13 1

If the key files were mixed in with a bunch of other files of lesser
value, like with Apache, I could see not honoring it, but for this case,
since all the files are of equal value, I think just honoring it makes
sense.

> Also, KmgrSaveCryptoKeys assumes that keys are stored in plain files which is
> true for OpenSSL but won't necessarily be true for other crypto libraries.
> Would it make sense to have an API in be-kmgr.c with the OpenSSL specifics in
> be-kmgr-openssl.c like how the TLS backend support is written?  We have spent 
> a
> lot effort in making the backend not assume a particular TLS library, it seems
> a sham

Re: Allow CURRENT_ROLE in GRANTED BY

2020-12-10 Thread Peter Eisentraut


On 2020-06-24 20:21, Peter Eisentraut wrote:

On 2020-06-24 10:12, Vik Fearing wrote:

On 6/24/20 8:35 AM, Peter Eisentraut wrote:

I was checking some loose ends in SQL conformance, when I noticed: We
support GRANT role ... GRANTED BY CURRENT_USER, but we don't support
CURRENT_ROLE in that place, even though in PostgreSQL they are
equivalent.  Here is a trivial patch to add that.



The only thing that isn't dead-obvious about this patch is the commit
message says "[PATCH 1/2]".  What is in the other part?


Hehe.  The second patch is some in-progress work to add the GRANTED BY
clause to the regular GRANT command.  More on that perhaps at a later date.


Here is the highly anticipated and quite underwhelming second part of 
this patch set.


--
Peter Eisentraut
2ndQuadrant, an EDB company
https://www.2ndquadrant.com/
From 72b1e99337b6fc9e72bcca003eb2d18351079261 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Thu, 10 Dec 2020 19:38:21 +0100
Subject: [PATCH] Allow GRANTED BY clause in normal GRANT and REVOKE statements

The SQL standard allows a GRANTED BY clause on GRANT and
REVOKE (privilege) statements that can specify CURRENT_USER or
CURRENT_ROLE.  In PostgreSQL, both of these are the default behavior.
Since we already have all the parsing support for this for the
GRANT (role) statement, we might as well add basic support for this
for the privilege variant as well.  This allows us to check of SQL
feature T332.  In the future, perhaps more interesting things could be
done with this, too.
---
 doc/src/sgml/ref/grant.sgml  | 25 ++---
 doc/src/sgml/ref/revoke.sgml | 13 +
 src/backend/catalog/aclchk.c | 16 
 src/backend/catalog/sql_features.txt |  2 +-
 src/backend/nodes/copyfuncs.c|  1 +
 src/backend/nodes/equalfuncs.c   |  1 +
 src/backend/parser/gram.y| 13 -
 src/include/nodes/parsenodes.h   |  1 +
 8 files changed, 63 insertions(+), 9 deletions(-)

diff --git a/doc/src/sgml/ref/grant.sgml b/doc/src/sgml/ref/grant.sgml
index c3db393bde..a897712de2 100644
--- a/doc/src/sgml/ref/grant.sgml
+++ b/doc/src/sgml/ref/grant.sgml
@@ -26,58 +26,71 @@
 ON { [ TABLE ] table_name [, 
...]
  | ALL TABLES IN SCHEMA schema_name [, ...] }
 TO role_specification [, ...] 
[ WITH GRANT OPTION ]
+[ GRANTED BY role_specification ]
 
 GRANT { { SELECT | INSERT | UPDATE | REFERENCES } ( column_name [, ...] )
 [, ...] | ALL [ PRIVILEGES ] ( column_name [, ...] ) }
 ON [ TABLE ] table_name [, 
...]
 TO role_specification [, ...] 
[ WITH GRANT OPTION ]
+[ GRANTED BY role_specification ]
 
 GRANT { { USAGE | SELECT | UPDATE }
 [, ...] | ALL [ PRIVILEGES ] }
 ON { SEQUENCE sequence_name 
[, ...]
  | ALL SEQUENCES IN SCHEMA schema_name [, ...] }
 TO role_specification [, ...] 
[ WITH GRANT OPTION ]
+[ GRANTED BY role_specification ]
 
 GRANT { { CREATE | CONNECT | TEMPORARY | TEMP } [, ...] | ALL [ PRIVILEGES ] }
 ON DATABASE database_name [, ...]
 TO role_specification [, ...] 
[ WITH GRANT OPTION ]
+[ GRANTED BY role_specification ]
 
 GRANT { USAGE | ALL [ PRIVILEGES ] }
 ON DOMAIN domain_name [, ...]
 TO role_specification [, ...] 
[ WITH GRANT OPTION ]
+[ GRANTED BY role_specification ]
 
 GRANT { USAGE | ALL [ PRIVILEGES ] }
 ON FOREIGN DATA WRAPPER fdw_name [, ...]
 TO role_specification [, ...] 
[ WITH GRANT OPTION ]
+[ GRANTED BY role_specification ]
 
 GRANT { USAGE | ALL [ PRIVILEGES ] }
 ON FOREIGN SERVER server_name [, ...]
 TO role_specification [, ...] 
[ WITH GRANT OPTION ]
+[ GRANTED BY role_specification ]
 
 GRANT { EXECUTE | ALL [ PRIVILEGES ] }
 ON { { FUNCTION | PROCEDURE | ROUTINE } 
routine_name [ ( [ [ argmode ] [ arg_name ] arg_type [, ...] ] ) ] [, ...]
  | ALL { FUNCTIONS | PROCEDURES | ROUTINES } IN SCHEMA schema_name [, ...] }
 TO role_specification [, ...] 
[ WITH GRANT OPTION ]
+[ GRANTED BY role_specification ]
 
 GRANT { USAGE | ALL [ PRIVILEGES ] }
 ON LANGUAGE lang_name [, ...]
 TO role_specification [, ...] 
[ WITH GRANT OPTION ]
+[ GRANTED BY role_specification ]
 
 GRANT { { SELECT | UPDATE } [, ...] | ALL [ PRIVILEGES ] }
 ON LARGE OBJECT loid [, ...]
 TO role_specification [, ...] 
[ WITH GRANT OPTION ]
+[ GRANTED BY role_specification ]
 
 GRANT { { CREATE | USAGE } [, ...] | ALL [ PRIVILEGES ] }
 ON SCHEMA schema_name [, ...]
 TO role_specification [, ...] 
[ WITH GRANT OPTION ]
+[ GRANTED BY role_specification ]
 
 GRANT { CREATE | ALL [ PRIVILEGES ] }
 ON TABLESPACE tablespace_name [, ...]
 TO role_specification [, ...] 
[ WITH GRANT OPTION ]
+[ GRANTED BY role_specification ]
 
 GRANT { USAGE | ALL [ PRIVILEGES ] }
 ON TYPE type_name [, ...]
 TO role_specification [, ...] 
[ WITH GRANT OPTION ]
+[ GRANTED BY role_specification ]
 
 GRANT role_name [, ...] TO 
role_specification [, ...]

Re: On login trigger: take three

2020-12-10 Thread Pavel Stehule

čt 10. 12. 2020 v 16:48 odesílatel Konstantin Knizhnik <
k.knizh...@postgrespro.ru> napsal:

>
>
> On 10.12.2020 18:12, Pavel Stehule wrote:
>
>
> My idea was a little bit different. Inside postinit initialize some global
> variables with info if there are event triggers or not. And later you can
> use this variable to start transactions and  other things.
>
> There will be two access to pg_event_trigger, but it can eliminate useless
> and probably more expensive start_transaction and end_transaction.
>
>
>
> Do you mean some variable in shared memory or GUCs?
> It was my first idea - to use some flag in shared memory to make it
> possible fast check that there are not event triggers.
> But this flag should be sent per database. May be I missed something, but
> there is no any per-database shared memory  data structure in Postgres.
> Certainly it is possible to organize some hash db->event_info, but it
> makes this patch several times more complex.
>

My idea was a little bit different - just inside process initialization,
checking existence of event triggers, and later when it is necessary to
start a transaction for trigger execution. This should to reduce useless
empty transaction,

I am sure this is a problem on computers with slower CPU s, although I have
I7, but because this feature is usually unused, then the performance impact
should be minimal every time.



>
> From my point of view it is better to have separate GUC disabling just
> client connection events and switch it on by default.
> So only those who need this events with switch it on, other users will not
> pay any extra cost for it.
>

It can work, but this design is not user friendly.  The significant
bottleneck should be forking new processes, and check of content some
usually very small tables should be invisible. So if there is a possible
way to implement some feature that can be enabled by default, then we
should go this way. It can be pretty unfriendly if somebody writes a logon
trigger and it will not work by default without any warning.

When I tested last patch I found a problem (I have disabled assertions due
performance testing)

create role xx login;
alter system set disable_event_triggers to on; -- better should be positive
form - enable_event_triggers to on, off
select pg_reload_conf();

psql -U xx

Thread 2.1 "postmaster" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7f2bff438ec0 (LWP 827497)]
ResourceOwnerEnlargeCatCacheRefs (owner=0x0) at resowner.c:1025
1025 ResourceArrayEnlarge(&(owner->catrefarr));
(gdb) bt
#0  ResourceOwnerEnlargeCatCacheRefs (owner=0x0) at resowner.c:1025
#1  0x008a70f8 in SearchCatCacheInternal (cache=,
nkeys=nkeys@entry=1, v1=v1@entry=13836, v2=v2@entry=0,
v3=v3@entry=0, v4=v4@entry=0) at catcache.c:1273
#2  0x008a7575 in SearchCatCache1 (cache=,
v1=v1@entry=13836) at catcache.c:1167
#3  0x008b7f80 in SearchSysCache1 (cacheId=cacheId@entry=21,
key1=key1@entry=13836) at syscache.c:1122
#4  0x005939cd in pg_database_ownercheck (db_oid=13836,
roleid=16387) at aclchk.c:5114
#5  0x00605b42 in EventTriggersDisabled () at event_trigger.c:650
#6  EventTriggerOnConnect () at event_trigger.c:839
#7  0x007b46d7 in PostgresMain (argc=argc@entry=1,
argv=argv@entry=0x7ffdd6256080,
dbname=,
username=0xf82508 "tom") at postgres.c:4021
#8  0x00741afd in BackendRun (port=, port=) at postmaster.c:4490
#9  BackendStartup (port=) at postmaster.c:4212
#10 ServerLoop () at postmaster.c:1729
#11 0x00742881 in PostmasterMain (argc=argc@entry=3,
argv=argv@entry=0xf44d00)
at postmaster.c:1401
#12 0x004f1ff7 in main (argc=3, argv=0xf44d00) at main.c:209

I checked profiles, and although there is significant slowdown when there
is one login trigger, it was not related to plpgsql.

There is big overhead of

   8,98%  postmaster  postgres  [.] guc_name_compare

Maybe EventTriggerInvoke is not well optimized for very frequent execution.
Overhead of plpgsql is less than 0.1%, although there is significant
slowdown (when a very simple logon trigger is executed).

Regards

Pavel

pg_basebackup test coverage

2020-12-10 Thread Robert Haas

$SUBJECT is not great. The options to pg_basebackup that are not
tested by any part of the regression test suite include the
single-letter options rlzZdUwWvP as well as --no-estimate-size.

It would probably be good to fix as much of this as we can, but there
are a couple of cases I think would be particularly good to cover. One
is 'pg_basebackup -Ft -Xnone -D -', which tries to write the output as
a single tar file on standard output, injecting the backup_manifest
file into the tar file instead of writing it out separately as we
normally would. This case requires special handling in a few places
and it would be good to check that it actually works. The other is the
-z or -Z option, which produces a compressed tar file.

Now, there's nothing to prevent us from running commands like this
from the pg_basebackup tests, but it doesn't seem like we could really
check anything meaningful. Perhaps we'd notice if the command exited
non-zero or didn't produce any output, but it would be nice to verify
that the resulting backups are actually correct. The way to do that
would seem to be to extract the tar file (and decompress it first, in
the -z/-Z case) and then run pg_verifybackup on the result and check
that it passes. However, as far as I can tell there's no guarantee
that the user has 'tar' or 'gunzip' installed on their system, so I
don't see a clean way to do this. A short (but perhaps incomplete)
search didn't turn up similar precedents in the existing tests.

Any ideas?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Change default of checkpoint_completion_target

2020-12-10 Thread Stephen Frost

Greetings,

* Alvaro Herrera (alvhe...@alvh.no-ip.org) wrote:
> On 2020-Dec-10, Stephen Frost wrote:
> > * Laurenz Albe (laurenz.a...@cybertec.at) wrote:
> > > On Tue, 2020-12-08 at 17:29 +, Bossart, Nathan wrote:
> > > > +1 to setting checkpoint_completion_target to 0.9 by default.
> > > 
> > > +1 for changing the default or getting rid of it, as Tom suggested.
> > 
> > Attached is a patch to change it from a GUC to a compile-time #define
> > which is set to 0.9, with accompanying documentation updates.
> 
> I think we should leave a doc stub or at least an , to let
> people know the GUC has been removed rather than just making it
> completely invisible.  (Maybe piggyback on the stuff in [1]?)
> 
> [1] 
> https://postgr.es/m/CAGRY4nyA=jmBNa4LVwgGO1GyO-RnFmfkesddpT_uO+3=mot...@mail.gmail.com

Yes, I agree, and am involved in that thread as well- currently waiting
feedback from others about the proposed approach.

Getting a few more people looking at that thread and commenting on it
would really help us be able to move forward.

Thanks,

Stephen


signature.asc
Description: PGP signature

Re: Change default of checkpoint_completion_target

2020-12-10 Thread Alvaro Herrera

Howdy,

On 2020-Dec-10, Stephen Frost wrote:

> * Laurenz Albe (laurenz.a...@cybertec.at) wrote:
> > On Tue, 2020-12-08 at 17:29 +, Bossart, Nathan wrote:
> > > +1 to setting checkpoint_completion_target to 0.9 by default.
> > 
> > +1 for changing the default or getting rid of it, as Tom suggested.
> 
> Attached is a patch to change it from a GUC to a compile-time #define
> which is set to 0.9, with accompanying documentation updates.

I think we should leave a doc stub or at least an , to let
people know the GUC has been removed rather than just making it
completely invisible.  (Maybe piggyback on the stuff in [1]?)

[1] 
https://postgr.es/m/CAGRY4nyA=jmBNa4LVwgGO1GyO-RnFmfkesddpT_uO+3=mot...@mail.gmail.com

Re: Change default of checkpoint_completion_target

2020-12-10 Thread Stephen Frost

Greetings,

* Laurenz Albe (laurenz.a...@cybertec.at) wrote:
> On Tue, 2020-12-08 at 17:29 +, Bossart, Nathan wrote:
> > +1 to setting checkpoint_completion_target to 0.9 by default.
> 
> +1 for changing the default or getting rid of it, as Tom suggested.

Attached is a patch to change it from a GUC to a compile-time #define
which is set to 0.9, with accompanying documentation updates.

> While we are at it, could we change the default of "log_lock_waits" to "on"?

While I agree that it'd be good to change quite a few of the log_X items
to be 'on' by default, I'm not planning to work on this.

Thanks,

Stephen
diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index 42a8ed328d..ed2b27164b 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -857,10 +857,9 @@ SELECT pg_start_backup('label', false, false);
 
  By default, pg_start_backup can take a long time to finish.
  This is because it performs a checkpoint, and the I/O
- required for the checkpoint will be spread out over a significant
- period of time, by default half your inter-checkpoint interval
- (see the configuration parameter
- ).  This is
+ required for the checkpoint will be spread out over the inter-checkpoint
+ interval (see the configuration parameter
+ ).  This is
  usually what you want, because it minimizes the impact on query
  processing.  If you want to start the backup as soon as
  possible, change the second parameter to true, which will
@@ -1000,10 +999,9 @@ SELECT pg_start_backup('label');
 
  By default, pg_start_backup can take a long time to finish.
  This is because it performs a checkpoint, and the I/O
- required for the checkpoint will be spread out over a significant
- period of time, by default half your inter-checkpoint interval
- (see the configuration parameter
- ).  This is
+ required for the checkpoint will be spread out over the inter-checkpoint
+ interval (see the configuration parameter
+ ).  This is
  usually what you want, because it minimizes the impact on query
  processing.  If you want to start the backup as soon as
  possible, use:
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 8cd3d6901c..39f9701959 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3251,22 +3251,6 @@ include_dir 'conf.d'
   
  
 
- 
-  checkpoint_completion_target (floating point)
-  
-   checkpoint_completion_target configuration parameter
-  
-  
-  
-   
-Specifies the target of checkpoint completion, as a fraction of
-total time between checkpoints. The default is 0.5.
-This parameter can only be set in the postgresql.conf
-file or on the server command line.
-   
-  
- 
-
  
   checkpoint_flush_after (integer)
   
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index d1c3893b14..735d0c0661 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -521,28 +521,19 @@
 
   
To avoid flooding the I/O system with a burst of page writes,
-   writing dirty buffers during a checkpoint is spread over a period of time.
-   That period is controlled by
-   , which is
-   given as a fraction of the checkpoint interval.
-   The I/O rate is adjusted so that the checkpoint finishes when the
-   given fraction of
-   checkpoint_timeout seconds have elapsed, or before
+   writing dirty buffers during a checkpoint is spread out across the time between
+   when checkpoints are scheduled to begin, as configured by 
+   .
+   The I/O rate is adjusted so that the checkpoint finishes at approximately the
+   time when the next checkpoint is scheduled to begin, or before 
max_wal_size is exceeded, whichever is sooner.
-   With the default value of 0.5,
-   PostgreSQL can be expected to complete each checkpoint
-   in about half the time before the next checkpoint starts.  On a system
-   that's very close to maximum I/O throughput during normal operation,
-   you might want to increase checkpoint_completion_target
-   to reduce the I/O load from checkpoints.  The disadvantage of this is that
-   prolonging checkpoints affects recovery time, because more WAL segments
-   will need to be kept around for possible use in recovery.  Although
-   checkpoint_completion_target can be set as high as 1.0,
-   it is best to keep it less than that (perhaps 0.9 at most) since
-   checkpoints include some other activities besides writing dirty buffers.
-   A setting of 1.0 is quite likely to result in checkpoints not being
-   completed on time, which would result in performance loss due to
-   unexpected variation in the number of WAL segments needed.
+   This spreads out the I/O as much as possible to have the I/O load be consistent
+   during the checkpoint and generally throughout the operation of the system.  The
+   disadvantage of this is that prolonging checkp

Re: [HACKERS] [PATCH] Generic type subscripting

2020-12-10 Thread David Fetter

On Wed, Dec 09, 2020 at 12:49:48PM -0500, Tom Lane wrote:
> I've pushed the core patch now.  The jsonb parts now have to be
> rebased onto this design, which I'm assuming Dmitry will tackle
> (I do not intend to).  It's not quite clear to me whether we have
> a meeting of the minds on what the jsonb functionality should be,
> anyway.  Alexander seemed to be thinking about offering an option
> to let the subscript be a jsonpath, but how would we distinguish
> that from a plain-text field name?
> 
> BTW, while reviewing the thread to write the commit message,
> I was reminded of my concerns around the "is it a container"
> business.  As things stand, if type A has a typelem link to
> type B, then the system supposes that A contains B physically;
> this has implications for what's allowed in DDL, for example
> (cf find_composite_type_dependencies() and other places).
> We now have a feature whereby subscripting can yield a type
> that is not contained in the source type in that sense.
> I'd be happier if the "container" terminology were reserved for
> that sort of physical containment, which means that I think a lot
> of the commentary around SubscriptingRef is misleading.  But I do
> not have a better word to suggest offhand.  Thoughts?

Would this be something more along the lines of a "dependent type," or
is that adding too much baggage?

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Re: On login trigger: take three

2020-12-10 Thread Konstantin Knizhnik




On 10.12.2020 18:12, Pavel Stehule wrote:


My idea was a little bit different. Inside postinit initialize some 
global variables with info if there are event triggers or not. And 
later you can use this variable to start transactions and  other things.


There will be two access to pg_event_trigger, but it can eliminate 
useless and probably more expensive start_transaction and end_transaction.





Do you mean some variable in shared memory or GUCs?
It was my first idea - to use some flag in shared memory to make it 
possible fast check that there are not event triggers.
But this flag should be sent per database. May be I missed something, 
but there is no any per-database shared memory  data structure in Postgres.
Certainly it is possible to organize some hash db->event_info, but it 
makes this patch several times more complex.


From my point of view it is better to have separate GUC disabling just 
client connection events and switch it on by default.
So only those who need this events with switch it on, other users will 
not pay any extra cost for it.

Re: Speeding up GIST index creation for tsvectors

2020-12-10 Thread Pavel Borisov

Hi, Amit!
It's really cool to hear about another GiST improvement proposal. I'd like
to connect recently committed GiST ordered build discussion here [1] and
further improvement proposed [2]

I've tested feature [1] and got 2.5-3 times speed improvement which is much
better I believe. There is an ongoing activity [2] to build support for
different data types for GiST. Maybe you will consider it interesting to
join.

BTW you may have heard about Gin and Rum [3] indexes which suit text search
much, much better (and faster) than GiST. The idea to process data in
bigger chunks is good. Still optimize index structure, minimizing disc
pages access, etc. seems better in many cases.

Thank you for your proposal!

[1] https://commitfest.postgresql.org/29/2276/
[2] https://commitfest.postgresql.org/31/2824/
[3] https://github.com/postgrespro/rum

-- 
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com

Re: On login trigger: take three

2020-12-10 Thread Pavel Stehule

čt 10. 12. 2020 v 14:03 odesílatel Konstantin Knizhnik <
k.knizh...@postgrespro.ru> napsal:

>
>
> On 10.12.2020 10:45, Pavel Stehule wrote:
>
>
>
> st 9. 12. 2020 v 14:28 odesílatel Konstantin Knizhnik <
> k.knizh...@postgrespro.ru> napsal:
>
>>
>>
>> On 09.12.2020 15:34, Pavel Stehule wrote:
>>
>> Hi
>>
>> st 9. 12. 2020 v 13:17 odesílatel Greg Nancarrow 
>> napsal:
>>
>>> On Tue, Dec 8, 2020 at 3:26 PM Pavel Stehule 
>>> wrote:
>>> >
>>> >
>>> > There are two maybe generic questions?
>>> >
>>> > 1. Maybe we can introduce more generic GUC for all event triggers like
>>> disable_event_triggers? This GUC can be checked only by the database owner
>>> or super user. It can be an alternative ALTER TABLE DISABLE TRIGGER ALL. It
>>> can be protection against necessity to restart to single mode to repair the
>>> event trigger. I think so more generic solution is better than special
>>> disable_client_connection_trigger GUC.
>>> >
>>> > 2. I have no objection against client_connection. It is probably
>>> better for the mentioned purpose - possibility to block connection to
>>> database. Can be interesting, and I am not sure how much work it is to
>>> introduce the second event - session_start. This event should be started
>>> after connecting - so the exception there doesn't block connect, and should
>>> be started also after the new statement "DISCARD SESSION", that will be
>>> started automatically after DISCARD ALL.  This feature should not be
>>> implemented in first step, but it can be a plan for support pooled
>>> connections
>>> >
>>>
>>> I've created a separate patch to address question (1), rather than
>>> include it in the main patch, which I've adjusted accordingly. I'll
>>> leave question (2) until another time, as you suggest.
>>> See the attached patches.
>>>
>>
>> I see two possible questions?
>>
>> 1. when you introduce this event, then the new hook is useless ?
>>
>> 2. what is a performance impact for users that want not to use this
>> feature. What is a overhead of EventTriggerOnConnect and is possible to
>> skip this step if database has not any event trigger
>>
>>
>> As far as I understand this are questions to me rather than to Greg.
>> 1. Do you mean client_connection_hook? It is used to implement this new
>> event type. It can be also used for other purposes.
>>
>
> ok. I don't like it, but there are redundant hooks (against event
> triggers) already.
>
>
> 2. Connection overhead is quite large. Supporting on connect hook requires
>> traversal of event trigger relation. But this overhead is negligible
>> comparing with overhead of establishing connection. In any case I did the
>> following test (with local connection):
>>
>> pgbench -C -S -T 100 -P 1 -M prepared postgres
>>
>> without this patch:
>> tps = 424.287889 (including connections establishing)
>> tps = 952.911068 (excluding connections establishing)
>>
>> with this patch (but without any connection trigger defined):
>> tps = 434.642947 (including connections establishing)
>> tps = 995.525383 (excluding connections establishing)
>>
>> As you can see - there is almost now different (patched version is even
>> faster, but it seems to be just "white noise".
>>
>
> This is not the worst case probably. In this patch the
> StartTransactionCommand is executed on every connection, although it is not
> necessary - and for most use cases it will not be used.
>
> I did more tests - see attachments and I see a 5% slowdown - I don't think
> so it is acceptable for this case. This feature is nice, and for some users
> important, but only really few users can use it.
>
> ┌┬─┬┬─┐
> │  test  │ WITH LT │ LT ENABLED │ LT DISABLED │
> ╞╪═╪╪═╡
> │ ro_constant_10 │ 539 │877 │ 905 │
> │ ro_index_10│ 562 │808 │ 855 │
> │ ro_constant_50 │ 527 │843 │ 863 │
> │ ro_index_50│ 544 │731 │ 742 │
> └┴─┴┴─┘
> (4 rows)
>
> I tested a performance of trigger (results of first column in table):
>
> CREATE OR REPLACE FUNCTION public.foo()
>  RETURNS event_trigger
>  LANGUAGE plpgsql
> AS $function$
> begin
>   if not pg_has_role(session_user, 'postgres', 'member') then
> raise exception 'you are not super user';
>   end if;
> end;
> $function$;
>
> There is an available snapshot in InitPostgres, and then there is possible
> to check if for the current database some connect event trigger exists.This
> can reduce an overhead of this patch, when there are no logon triggers.
>
> I think so implemented and used names are ok, but for this feature the
> performance impact should be really very minimal.
>
> There is other small issue - missing tab-complete support for CREATE
> TRIGGER statement in psql
>
> Regards
>
> Pavel
>
>
> Unfortunately I was not able to reproduce your results.
> I just tried the case "select 1" because for this tr

Re: MultiXact\SLRU buffers configuration

2020-12-10 Thread Gilles Darold


Le 08/12/2020 à 18:52, Andrey Borodin a écrit :

Hi Gilles!

Many thanks for your message!


8 дек. 2020 г., в 21:05, Gilles Darold  написал(а):

I know that this report is not really helpful

Quite contrary - this benchmarks prove that controllable reproduction exists. 
I've rebased patches for PG11. Can you please benchmark them (without extending 
SLRU)?

Best regards, Andrey Borodin.


Hi,


Running tests yesterday with the patches has reported log of failures 
with error on INSERT and UPDATE statements:



   ERROR:  lock MultiXactOffsetControlLock is not held


After a patch review this morning I think I have found what's going 
wrong. In patch 
v6-0001-Use-shared-lock-in-GetMultiXactIdMembers-for-offs.patch I think 
there is a missing reinitialisation of the lockmode variable to LW_NONE 
inside the retry loop after the call to LWLockRelease() in 
src/backend/access/transam/multixact.c:1392:GetMultiXactIdMembers(). 
I've attached a new version of the patch for master that include the fix 
I'm using now with PG11 and with which everything works very well now.



I'm running more tests to see the impact on the performances to play 
with multixact_offsets_slru_buffers, multixact_members_slru_buffers and 
multixact_local_cache_entries. I will reports the results later today.


--
Gilles Darold
http://www.darold.net/

diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index 034349aa7b..4c372065de 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -640,10 +640,11 @@ TransactionIdGetStatus(TransactionId xid, XLogRecPtr *lsn)
 	int			lsnindex;
 	char	   *byteptr;
 	XidStatus	status;
+	LWLockMode	lockmode = LW_NONE;
 
 	/* lock is acquired by SimpleLruReadPage_ReadOnly */
 
-	slotno = SimpleLruReadPage_ReadOnly(XactCtl, pageno, xid);
+	slotno = SimpleLruReadPage_ReadOnly(XactCtl, pageno, xid, &lockmode);
 	byteptr = XactCtl->shared->page_buffer[slotno] + byteno;
 
 	status = (*byteptr >> bshift) & CLOG_XACT_BITMASK;
@@ -651,6 +652,7 @@ TransactionIdGetStatus(TransactionId xid, XLogRecPtr *lsn)
 	lsnindex = GetLSNIndex(slotno, xid);
 	*lsn = XactCtl->shared->group_lsn[lsnindex];
 
+	Assert(lockmode != LW_NONE);
 	LWLockRelease(XactSLRULock);
 
 	return status;
diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c
index 2fe551f17e..2699de033d 100644
--- a/src/backend/access/transam/commit_ts.c
+++ b/src/backend/access/transam/commit_ts.c
@@ -288,6 +288,7 @@ TransactionIdGetCommitTsData(TransactionId xid, TimestampTz *ts,
 	CommitTimestampEntry entry;
 	TransactionId oldestCommitTsXid;
 	TransactionId newestCommitTsXid;
+	LWLockMode	lockmode = LW_NONE;
 
 	if (!TransactionIdIsValid(xid))
 		ereport(ERROR,
@@ -342,7 +343,7 @@ TransactionIdGetCommitTsData(TransactionId xid, TimestampTz *ts,
 	}
 
 	/* lock is acquired by SimpleLruReadPage_ReadOnly */
-	slotno = SimpleLruReadPage_ReadOnly(CommitTsCtl, pageno, xid);
+	slotno = SimpleLruReadPage_ReadOnly(CommitTsCtl, pageno, xid, &lockmode);
 	memcpy(&entry,
 		   CommitTsCtl->shared->page_buffer[slotno] +
 		   SizeOfCommitTimestampEntry * entryno,
@@ -352,6 +353,7 @@ TransactionIdGetCommitTsData(TransactionId xid, TimestampTz *ts,
 	if (nodeid)
 		*nodeid = entry.nodeid;
 
+	Assert(lockmode != LW_NONE);
 	LWLockRelease(CommitTsSLRULock);
 	return *ts != 0;
 }
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index eb8de7cf32..56bdd04364 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1237,6 +1237,7 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	MultiXactId tmpMXact;
 	MultiXactOffset nextOffset;
 	MultiXactMember *ptr;
+	LWLockMode	lockmode = LW_NONE;
 
 	debug_elog3(DEBUG2, "GetMembers: asked for %u", multi);
 
@@ -1340,12 +1341,13 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	 * time on every multixact creation.
 	 */
 retry:
-	LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE);
 
 	pageno = MultiXactIdToOffsetPage(multi);
 	entryno = MultiXactIdToOffsetEntry(multi);
 
-	slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, multi);
+	/* lock is acquired by SimpleLruReadPage_ReadOnly */
+	slotno = SimpleLruReadPage_ReadOnly(MultiXactOffsetCtl, pageno,
+		multi, &lockmode);
 	offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
 	offptr += entryno;
 	offset = *offptr;
@@ -1377,7 +1379,8 @@ retry:
 		entryno = MultiXactIdToOffsetEntry(tmpMXact);
 
 		if (pageno != prev_pageno)
-			slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, tmpMXact);
+			slotno = SimpleLruReadPage_ReadOnly(MultiXactOffsetCtl, pageno,
+tmpMXact, &lockmode);
 
 		offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
 		offptr += entryno;
@@ -1387,6 +1390,7 @@ retry:
 		{
 			/* Corner case 2: next multixact is still being filled in */
 			LWLockRelease

Re: Insert Documentation - Returning Clause and Order

2020-12-10 Thread David G. Johnston

On Thursday, December 10, 2020, Ashutosh Bapat 
wrote:

> On Wed, Dec 9, 2020 at 9:10 PM David G. Johnston
>  wrote:
> >
> > Hey,
> >
> > Would it be accurate to add the following sentence to the INSERT
> documentation under "Outputs"?
> >
> > "...inserted or updated by the command."  For a multiple-values
> insertion, the order of output rows will match the order that rows are
> presented in the values or query clause.
>
> Postgres's current implementation may be doing so, but I don't think
> that can be guaranteed in possible implementations. I don't think
> restricting choice of implementation to guarantee that is a good idea
> either.
>
>
Yeah, the ongoing work on parallel inserts would seem to be an issue.  We
should probably document that though.  And maybe as part of parallel
inserts patch provide a user-specifiable way to ask for such a guarantee if
needed.  ‘Insert returning ordered”

David J.

Re: Parallel Inserts in CREATE TABLE AS

On Thu, Dec 10, 2020 at 5:19 PM Dilip Kumar  wrote:
> > > > +   allow = ps && IsA(ps, GatherState) && !ps->ps_ProjInfo 
> > > > &&
> > > > +   plannedstmt->parallelModeNeeded &&
> > > > +   plannedstmt->planTree &&
> > > > +   IsA(plannedstmt->planTree, Gather) &&
> > > > +   plannedstmt->planTree->lefttree &&
> > > > +   
> > > > plannedstmt->planTree->lefttree->parallel_aware &&
> > > > +   
> > > > plannedstmt->planTree->lefttree->parallel_safe;
> > > >
> > > > I noticed it check both IsA(ps, GatherState) and 
> > > > IsA(plannedstmt->planTree, Gather).
> > > > Does it mean it is possible that IsA(ps, GatherState) is true but 
> > > > IsA(plannedstmt->planTree, Gather) is false ?
> > > >
> > > > I did some test but did not find a case like that.
> > > >
> > > This seems like an extra check.  Apart from that if we combine 0001
> > > and 0002 there should be an additional protection so that it should
> > > not happen that in cost_gather we have ignored the parallel tuple cost
> > > and now we are rejecting the parallel insert. Probably we should add
> > > an assert.
> >
> > Yeah it's an extra check. I don't think we need that extra check 
> > IsA(plannedstmt->planTree, Gather). GatherState check is enough. I verified 
> > it as follows: the gatherstate will be allocated and initialized with the 
> > plan tree in ExecInitGather which are the ones we are checking here. So, 
> > there is no chance that the plan state is GatherState and the plan tree 
> > will not be Gather.  I will remove IsA(plannedstmt->planTree, Gather) check 
> > in the next version of the patch set.
> >
> > Breakpoint 4, ExecInitGather (node=0x5647f98ae994 , 
> > estate=0x1ca8, eflags=730035099) at nodeGather.c:61
> > (gdb) p gatherstate
> > $10 = (GatherState *) 0x5647fac83850
> > (gdb) p gatherstate->ps.plan
> > $11 = (Plan *) 0x5647fac918a0
> >
> > Breakpoint 1, IsParallelInsertInCTASAllowed (into=0x5647fac97580, 
> > queryDesc=0x5647fac835e0) at createas.c:663
> > 663 {
> > (gdb) p ps
> > $13 = (PlanState *) 0x5647fac83850
> > (gdb) p ps->plan
> > $14 = (Plan *) 0x5647fac918a0
> >
> Hope you did not miss the second part of my comment
> "
> > Apart from that if we combine 0001
> > and 0002 there should be additional protection so that it should
> > not happen that in cost_gather we have ignored the parallel tuple cost
> > and now we are rejecting the parallel insert. Probably we should add
> > an assert.
> "

IIUC, we need to set a flag in cost_gather(in 0002 patch) whenever we
ignore the parallel tuple cost and while checking to allow or disallow
parallel inserts in IsParallelInsertInCTASAllowed(), we need to add an
assert something like Assert(cost_ignored_in_cost_gather && allow)
before return allow;

This assertion fails 1) either if we have not ignored the cost but
allowing parallel inserts 2) or we ignored the cost but not allowing
parallel inserts.

1) seems to be fine, we can go ahead and perform parallel inserts. 2)
is the concern that the planner would have wrongly chosen the parallel
plan, but in this case also isn't it better to go ahead with the
parallel plan instead of failing the query?

+/*
+ * We allow parallel inserts by the workers only if the Gather node has
+ * no projections to perform and if the upper node is Gather. In case,
+ * the Gather node has projections, which is possible if there are any
+ * subplans in the query, the workers can not do those projections. And
+ * when the upper node is GatherMerge, then the leader has to perform
+ * the final phase i.e. merge the results by workers.
+ */
+allow = ps && IsA(ps, GatherState) && !ps->ps_ProjInfo &&
+plannedstmt->parallelModeNeeded &&
+plannedstmt->planTree &&
+plannedstmt->planTree->lefttree &&
+plannedstmt->planTree->lefttree->parallel_aware &&
+plannedstmt->planTree->lefttree->parallel_safe;
+
+return allow;
+}

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

OpenSSL connection setup debug callback issue

2020-12-10 Thread Daniel Gustafsson

I went looking at the SSL connection state change information callback we
install when setting up connections with OpenSSL, and I wasn't getting the
state changes I expected.  Turns out we install it at the tail end of setting
up the connection so we miss most of the calls.  Moving it to the beginning of
be_tls_open_server allows us to catch the handshake etc.  I also extended it by
printing the human readable state change message available from OpenSSL to make
the logs more detailed (SSL_state_string_long has existed since 0.9.8).

A randomly selected sequence from a src/test/ssl testrun with the callback
moved but not extended with state information:

LOG:  connection received: host=localhost port=51177
DEBUG:  SSL: handshake start
DEBUG:  SSL: accept loop
DEBUG:  SSL: accept exit (-1)
DEBUG:  SSL: accept loop
DEBUG:  SSL: accept loop
DEBUG:  SSL: accept loop
DEBUG:  SSL: accept loop
DEBUG:  SSL: accept loop
DEBUG:  SSL: accept loop
DEBUG:  SSL: accept exit (-1)
DEBUG:  SSL: accept exit (-1)
DEBUG:  SSL: accept loop
DEBUG:  SSL: accept loop
DEBUG:  SSL: accept loop
DEBUG:  SSL: accept loop
DEBUG:  SSL: accept loop
DEBUG:  SSL: accept loop
DEBUG:  SSL: accept loop
DEBUG:  SSL: handshake done
DEBUG:  SSL: accept exit (1)
DEBUG:  SSL connection from "(anonymous)"

The same sequence with the patch applied:

LOG:  connection received: host=localhost port=51177
DEBUG:  SSL: handshake start: "before/accept initialization"
DEBUG:  SSL: accept loop: "before/accept initialization"
DEBUG:  SSL: accept exit (-1): "SSLv2/v3 read client hello A"
DEBUG:  SSL: accept loop: "SSLv3 read client hello A"
DEBUG:  SSL: accept loop: "SSLv3 write server hello A"
DEBUG:  SSL: accept loop: "SSLv3 write certificate A"
DEBUG:  SSL: accept loop: "SSLv3 write key exchange A"
DEBUG:  SSL: accept loop: "SSLv3 write certificate request A"
DEBUG:  SSL: accept loop: "SSLv3 flush data"
DEBUG:  SSL: accept exit (-1): "SSLv3 read client certificate A"
DEBUG:  SSL: accept exit (-1): "SSLv3 read client certificate A"
DEBUG:  SSL: accept loop: "SSLv3 read client certificate A"
DEBUG:  SSL: accept loop: "SSLv3 read client key exchange A"
DEBUG:  SSL: accept loop: "SSLv3 read certificate verify A"
DEBUG:  SSL: accept loop: "SSLv3 read finished A"
DEBUG:  SSL: accept loop: "SSLv3 write change cipher spec A"
DEBUG:  SSL: accept loop: "SSLv3 write finished A"
DEBUG:  SSL: accept loop: "SSLv3 flush data"
DEBUG:  SSL: handshake done: "SSL negotiation finished successfully"
DEBUG:  SSL: accept exit (1): "SSL negotiation finished successfully"
DEBUG:  SSL connection from "(anonymous)"

The attached contains these two changes as well as comment fixups which Heikki
noticed.

cheers ./daniel



0001-Move-information-callback-earlier-to-capture-connect.patch
Description: Binary data

Re: Proposed patch for key managment

2020-12-10 Thread Neil Chen

Hi, everyone
>
> I have read the patch and did some simple tests. I'm not entirely sure
> about some code segments; e.g.:
>
> In the BootStrapKmgr() we generate a data encryption key by:
> key = generate_crypto_key(file_encryption_keylen);
>
> However, I found that the file_encryption_keylen is always 0 in bootstrap
> mode because there exitst another variable bootstrap_file_encryption_keylen
> in xlog.c and bootstrap.c.
>
> We get the REL/WAL key by KmgrGetKey() call and it works like:
> return (const CryptoKey *) &(KmgrShmem->intlKeys[id]);
>
> But in bootstrap mode, the KmgrShmem are not assigned. So, if we want to
> use it to encrypt something in bootstrap mode, I suggest we make the
> following changes:
> if ( in bootstrap mode)
> return intlKeys[id]; // a static variable which contains key
> else
> reutrn (const CryptoKey *) &(KmgrShmem->intlKeys[id]);
>
>

-- 
There is no royal road to learning.
Highgo Software Co.

Re: Insert Documentation - Returning Clause and Order

2020-12-10 Thread Ashutosh Bapat

On Wed, Dec 9, 2020 at 9:10 PM David G. Johnston
 wrote:
>
> Hey,
>
> Would it be accurate to add the following sentence to the INSERT 
> documentation under "Outputs"?
>
> "...inserted or updated by the command."  For a multiple-values insertion, 
> the order of output rows will match the order that rows are presented in the 
> values or query clause.

Postgres's current implementation may be doing so, but I don't think
that can be guaranteed in possible implementations. I don't think
restricting choice of implementation to guarantee that is a good idea
either.

-- 
Best Wishes,
Ashutosh Bapat

Re: On login trigger: take three

2020-12-10 Thread Konstantin Knizhnik

On 10.12.2020 10:45, Pavel Stehule wrote:

st 9. 12. 2020 v 14:28 odesílatel Konstantin Knizhnik 
mailto:k.knizh...@postgrespro.ru>> napsal:

On 09.12.2020 15:34, Pavel Stehule wrote:

Hi

st 9. 12. 2020 v 13:17 odesílatel Greg Nancarrow
mailto:gregn4...@gmail.com>> napsal:

On Tue, Dec 8, 2020 at 3:26 PM Pavel Stehule
mailto:pavel.steh...@gmail.com>> wrote:
>
>
> There are two maybe generic questions?
>
> 1. Maybe we can introduce more generic GUC for all event
triggers like disable_event_triggers? This GUC can be checked
only by the database owner or super user. It can be an
alternative ALTER TABLE DISABLE TRIGGER ALL. It can be
protection against necessity to restart to single mode to
repair the event trigger. I think so more generic solution is
better than special disable_client_connection_trigger GUC.
>
> 2. I have no objection against client_connection. It is
probably better for the mentioned purpose - possibility to
block connection to database. Can be interesting, and I am
not sure how much work it is to introduce the second event -
session_start. This event should be started after connecting
- so the exception there doesn't block connect, and should be
started also after the new statement "DISCARD SESSION", that
will be started automatically after DISCARD ALL. This feature
should not be implemented in first step, but it can be a plan
for support pooled connections
>

I've created a separate patch to address question (1), rather
than
include it in the main patch, which I've adjusted
accordingly. I'll
leave question (2) until another time, as you suggest.
See the attached patches.

I see two possible questions?

1. when you introduce this event, then the new hook is useless ?

2. what is a performance impact for users that want not to use
this feature. What is a overhead of EventTriggerOnConnect and is
possible to skip this step if database has not any event trigger

As far as I understand this are questions to me rather than to Greg.
1. Do you mean client_connection_hook? It is used to implement
this new event type. It can be also used for other purposes.

ok. I don't like it, but there are redundant hooks (against event 
triggers) already.

2. Connection overhead is quite large. Supporting on connect hook
requires traversal of event trigger relation. But this overhead is
negligible comparing with overhead of establishing connection. In
any case I did the following test (with local connection):

pgbench -C -S -T 100 -P 1 -M prepared postgres

without this patch:
tps = 424.287889 (including connections establishing)
tps = 952.911068 (excluding connections establishing)

with this patch (but without any connection trigger defined):
tps = 434.642947 (including connections establishing)
tps = 995.525383 (excluding connections establishing)

As you can see - there is almost now different (patched version is
even faster, but it seems to be just "white noise".

This is not the worst case probably. In this patch the 
StartTransactionCommand is executed on every connection, although it 
is not necessary - and for most use cases it will not be used.

I did more tests - see attachments and I see a 5% slowdown - I don't 
think so it is acceptable for this case. This feature is nice, and for 
some users important, but only really few users can use it.

┌┬─┬┬─┐
│      test      │ WITH LT │ LT ENABLED │ LT DISABLED │
╞╪═╪╪═╡
│ ro_constant_10 │     539 │        877 │         905 │
│ ro_index_10    │     562 │        808 │         855 │
│ ro_constant_50 │     527 │        843 │         863 │
│ ro_index_50    │     544 │        731 │         742 │
└┴─┴┴─┘
(4 rows)

I tested a performance of trigger (results of first column in table):

CREATE OR REPLACE FUNCTION public.foo()
 RETURNS event_trigger
 LANGUAGE plpgsql
AS $function$
begin
  if not pg_has_role(session_user, 'postgres', 'member') then
    raise exception 'you are not super user';
  end if;
end;
$function$;

There is an available snapshot in InitPostgres, and then there is 
possible to check if for the current database some connect event 
trigger exists.This can reduce an overhead of this patch, when there 
are no logon triggers.

I think so implemented and used names are ok, but for this feature the 
performance impact should be really very minimal.

There is other small issue - missing tab-complete support for CREATE 
TRIGGER statement in psql

Regards

Pavel

Unfortunately I was not able to reproduce your results.
I just tried the case "select 1" beca

Re: initscan for MVCC snapshot

2020-12-10 Thread Andy Fan

On Thu, Dec 10, 2020 at 7:31 PM Andy Fan  wrote:

>
>
> On Mon, Dec 7, 2020 at 8:26 PM Andy Fan  wrote:
>
>> Hi:
>>  I see initscan calls RelationGetwNumberOfBlocks every time and rescan
>> calls
>>  initscan as well. In my system, RelationGetNumberOfBlocks is expensive
>> (the reason
>>  doesn't deserve a talk.. ),  so in a nest loop + Bitmap heap scan case,
>> the
>> impact will be huge. The comments of initscan are below.
>>
>> /*
>> * Determine the number of blocks we have to scan.
>> *
>> * It is sufficient to do this once at scan start, since any tuples added
>> * while the scan is in progress will be invisible to my snapshot anyway.
>> * (That is not true when using a non-MVCC snapshot.  However, we couldn't
>> * guarantee to return tuples added after scan start anyway, since they
>> * might go into pages we already scanned.  To guarantee consistent
>> * results for a non-MVCC snapshot, the caller must hold some higher-level
>> * lock that ensures the interesting tuple(s) won't change.)
>> */
>>
>> I still do not fully understand the comments. Looks we only need to call
>> multi times for non-MVCC snapshot, IIUC, does the following change
>> reasonable?
>>
>> ===
>>
>> diff --git a/src/backend/access/heap/heapam.c
>> b/src/backend/access/heap/heapam.c
>> index 1b2f70499e..8238eabd8b 100644
>> --- a/src/backend/access/heap/heapam.c
>> +++ b/src/backend/access/heap/heapam.c
>> @@ -211,6 +211,7 @@ initscan(HeapScanDesc scan, ScanKey key, bool
>> keep_startblock)
>> ParallelBlockTableScanDesc bpscan = NULL;
>> boolallow_strat;
>> boolallow_sync;
>> +   boolis_mvcc = scan->rs_base.rs_snapshot &&
>> IsMVCCSnapshot(scan->rs_base.rs_snapshot);
>>
>> /*
>>  * Determine the number of blocks we have to scan.
>> @@ -229,7 +230,8 @@ initscan(HeapScanDesc scan, ScanKey key, bool
>> keep_startblock)
>> scan->rs_nblocks = bpscan->phs_nblocks;
>> }
>> else
>> -   scan->rs_nblocks =
>> RelationGetNumberOfBlocks(scan->rs_base.rs_rd);
>> +   if (scan->rs_nblocks == -1 || !is_mvcc)
>> +   scan->rs_nblocks =
>> RelationGetNumberOfBlocks(scan->rs_base.rs_rd);
>>
>> /*
>>  * If the table is large relative to NBuffers, use a bulk-read
>> access
>> @@ -1210,6 +1212,7 @@ heap_beginscan(Relation relation, Snapshot snapshot,
>> else
>> scan->rs_base.rs_key = NULL;
>>
>> +   scan->rs_nblocks = -1;
>> initscan(scan, key, false);
>>
>> --
>> Best Regards
>> Andy Fan
>>
>
> I have tested this with an ext4 file system, and I can get a 7%+
> performance improvement
> for the given test case. Here are the steps:
>
> create table t(a int, b char(8000));
> insert into t select i, i from generate_series(1, 100)i;
> create index on t(a);
> delete from t where a <= 1;
> vacuum t;
> alter system set enable_indexscan to off;
> select pg_reload_conf();
>
> cat 1.sql
> select * from generate_series(1, 1)i, t where i = t.a;
>
> bin/pgbench -f 1.sql postgres -T 300 -c 10
>
> Without this patch:  latency average = 61.806 ms
> with this patch:   latency average = 57.484 ms
>
> I think the result is good and I think we can probably make this change.
> However, I'm not
> sure about it.
>
>
The plan which was used is below, in the rescan of Bitmap Heap Scan,
mdnblocks will
be called 1 times in current implementation,  Within my patch,  it will
be only called
once.

postgres=# explain (costs off) select * from generate_series(1, 1)i, t
where i = t.a;
QUERY PLAN
--
 Nested Loop
   ->  Function Scan on generate_series i
   ->  Bitmap Heap Scan on t
 Recheck Cond: (a = i.i)
 ->  Bitmap Index Scan on t_a_idx
   Index Cond: (a = i.i)
(6 rows)



-- 
Best Regards
Andy Fan

Re: Parallel Inserts in CREATE TABLE AS

On Thu, Dec 10, 2020 at 5:00 PM Bharath Rupireddy
 wrote:
>
> On Thu, Dec 10, 2020 at 4:49 PM Dilip Kumar  wrote:
> > > +   allow = ps && IsA(ps, GatherState) && !ps->ps_ProjInfo &&
> > > +   plannedstmt->parallelModeNeeded &&
> > > +   plannedstmt->planTree &&
> > > +   IsA(plannedstmt->planTree, Gather) &&
> > > +   plannedstmt->planTree->lefttree &&
> > > +   
> > > plannedstmt->planTree->lefttree->parallel_aware &&
> > > +   
> > > plannedstmt->planTree->lefttree->parallel_safe;
> > >
> > > I noticed it check both IsA(ps, GatherState) and 
> > > IsA(plannedstmt->planTree, Gather).
> > > Does it mean it is possible that IsA(ps, GatherState) is true but 
> > > IsA(plannedstmt->planTree, Gather) is false ?
> > >
> > > I did some test but did not find a case like that.
> > >
> >
> > This seems like an extra check.  Apart from that if we combine 0001
> > and 0002 there should be an additional protection so that it should
> > not happen that in cost_gather we have ignored the parallel tuple cost
> > and now we are rejecting the parallel insert. Probably we should add
> > an assert.
>
> Yeah it's an extra check. I don't think we need that extra check 
> IsA(plannedstmt->planTree, Gather). GatherState check is enough. I verified 
> it as follows: the gatherstate will be allocated and initialized with the 
> plan tree in ExecInitGather which are the ones we are checking here. So, 
> there is no chance that the plan state is GatherState and the plan tree will 
> not be Gather.  I will remove IsA(plannedstmt->planTree, Gather) check in the 
> next version of the patch set.
>
> Breakpoint 4, ExecInitGather (node=0x5647f98ae994 , 
> estate=0x1ca8, eflags=730035099) at nodeGather.c:61
> (gdb) p gatherstate
> $10 = (GatherState *) 0x5647fac83850
> (gdb) p gatherstate->ps.plan
> $11 = (Plan *) 0x5647fac918a0
>
> Breakpoint 1, IsParallelInsertInCTASAllowed (into=0x5647fac97580, 
> queryDesc=0x5647fac835e0) at createas.c:663
> 663 {
> (gdb) p ps
> $13 = (PlanState *) 0x5647fac83850
> (gdb) p ps->plan
> $14 = (Plan *) 0x5647fac918a0
>

Hope you did not miss the second part of my comment
"
> Apart from that if we combine 0001
> and 0002 there should be additional protection so that it should
> not happen that in cost_gather we have ignored the parallel tuple cost
> and now we are rejecting the parallel insert. Probably we should add
> an assert.
"

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Fail Fast In CTAS/CMV If Relation Already Exists To Avoid Unnecessary Rewrite, Planning Costs

Hi,

Currently, for CTAS or CREATE MATERIALIZED VIEW(CMV) without
if-not-exists clause, the existence of the relation gets checked
during the execution of the select part and an error is thrown there.
All the unnecessary rewrite and planning for the select part would
have happened just to fail later. However, if if-not-exists clause is
present, then a notice is issued and returned immediately without any
further rewrite or planning for . This seems somewhat inconsistent to
me.

I propose to check the relation existence early in ExecCreateTableAs()
as well as in ExplainOneUtility() and throw an error in case it exists
already to avoid unnecessary rewrite, planning and execution of the
select part.

Attaching a patch. Note that I have not added any test cases as the
existing test cases in create_table.sql and matview.sql would cover
the code.

Thoughts?

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com


v1-0001-Fail-Fast-In-CTAS-CMV-If-Relation-Already-Exists.patch
Description: Binary data

Re: initscan for MVCC snapshot

2020-12-10 Thread Andy Fan

On Mon, Dec 7, 2020 at 8:26 PM Andy Fan  wrote:

> Hi:
>  I see initscan calls RelationGetwNumberOfBlocks every time and rescan
> calls
>  initscan as well. In my system, RelationGetNumberOfBlocks is expensive
> (the reason
>  doesn't deserve a talk.. ),  so in a nest loop + Bitmap heap scan case,
> the
> impact will be huge. The comments of initscan are below.
>
> /*
> * Determine the number of blocks we have to scan.
> *
> * It is sufficient to do this once at scan start, since any tuples added
> * while the scan is in progress will be invisible to my snapshot anyway.
> * (That is not true when using a non-MVCC snapshot.  However, we couldn't
> * guarantee to return tuples added after scan start anyway, since they
> * might go into pages we already scanned.  To guarantee consistent
> * results for a non-MVCC snapshot, the caller must hold some higher-level
> * lock that ensures the interesting tuple(s) won't change.)
> */
>
> I still do not fully understand the comments. Looks we only need to call
> multi times for non-MVCC snapshot, IIUC, does the following change
> reasonable?
>
> ===
>
> diff --git a/src/backend/access/heap/heapam.c
> b/src/backend/access/heap/heapam.c
> index 1b2f70499e..8238eabd8b 100644
> --- a/src/backend/access/heap/heapam.c
> +++ b/src/backend/access/heap/heapam.c
> @@ -211,6 +211,7 @@ initscan(HeapScanDesc scan, ScanKey key, bool
> keep_startblock)
> ParallelBlockTableScanDesc bpscan = NULL;
> boolallow_strat;
> boolallow_sync;
> +   boolis_mvcc = scan->rs_base.rs_snapshot &&
> IsMVCCSnapshot(scan->rs_base.rs_snapshot);
>
> /*
>  * Determine the number of blocks we have to scan.
> @@ -229,7 +230,8 @@ initscan(HeapScanDesc scan, ScanKey key, bool
> keep_startblock)
> scan->rs_nblocks = bpscan->phs_nblocks;
> }
> else
> -   scan->rs_nblocks =
> RelationGetNumberOfBlocks(scan->rs_base.rs_rd);
> +   if (scan->rs_nblocks == -1 || !is_mvcc)
> +   scan->rs_nblocks =
> RelationGetNumberOfBlocks(scan->rs_base.rs_rd);
>
> /*
>  * If the table is large relative to NBuffers, use a bulk-read
> access
> @@ -1210,6 +1212,7 @@ heap_beginscan(Relation relation, Snapshot snapshot,
> else
> scan->rs_base.rs_key = NULL;
>
> +   scan->rs_nblocks = -1;
> initscan(scan, key, false);
>
> --
> Best Regards
> Andy Fan
>

I have tested this with an ext4 file system, and I can get a 7%+
performance improvement
for the given test case. Here are the steps:

create table t(a int, b char(8000));
insert into t select i, i from generate_series(1, 100)i;
create index on t(a);
delete from t where a <= 1;
vacuum t;
alter system set enable_indexscan to off;
select pg_reload_conf();

cat 1.sql
select * from generate_series(1, 1)i, t where i = t.a;

bin/pgbench -f 1.sql postgres -T 300 -c 10

Without this patch:  latency average = 61.806 ms
with this patch:   latency average = 57.484 ms

I think the result is good and I think we can probably make this change.
However, I'm not
sure about it.



-- 
Best Regards
Andy Fan

Re: Parallel Inserts in CREATE TABLE AS

On Thu, Dec 10, 2020 at 4:49 PM Dilip Kumar  wrote:
> > +   allow = ps && IsA(ps, GatherState) && !ps->ps_ProjInfo
&&
> > +   plannedstmt->parallelModeNeeded &&
> > +   plannedstmt->planTree &&
> > +   IsA(plannedstmt->planTree, Gather) &&
> > +   plannedstmt->planTree->lefttree &&
> > +
plannedstmt->planTree->lefttree->parallel_aware &&
> > +
plannedstmt->planTree->lefttree->parallel_safe;
> >
> > I noticed it check both IsA(ps, GatherState) and
IsA(plannedstmt->planTree, Gather).
> > Does it mean it is possible that IsA(ps, GatherState) is true but
IsA(plannedstmt->planTree, Gather) is false ?
> >
> > I did some test but did not find a case like that.
> >
>
> This seems like an extra check.  Apart from that if we combine 0001
> and 0002 there should be an additional protection so that it should
> not happen that in cost_gather we have ignored the parallel tuple cost
> and now we are rejecting the parallel insert. Probably we should add
> an assert.

Yeah it's an extra check. I don't think we need that extra check
IsA(plannedstmt->planTree, Gather). GatherState check is enough. I verified
it as follows: the gatherstate will be allocated and initialized with the
plan tree in ExecInitGather which are the ones we are checking here. So,
there is no chance that the plan state is GatherState and the plan tree
will not be Gather.  I will remove IsA(plannedstmt->planTree, Gather) check
in the next version of the patch set.

Breakpoint 4, ExecInitGather (node=0x5647f98ae994 ,
estate=0x1ca8, eflags=730035099) at nodeGather.c:61
(gdb) p gatherstate
$10 = *(GatherState *) 0x5647fac83850*
(gdb) p gatherstate->ps.plan
$11 = *(Plan *) 0x5647fac918a0*

Breakpoint 1, IsParallelInsertInCTASAllowed (into=0x5647fac97580,
queryDesc=0x5647fac835e0) at createas.c:663
663 {
(gdb) p ps
$13 = *(PlanState *) 0x5647fac83850*
(gdb) p ps->plan
$14 =* (Plan *) 0x5647fac918a0*

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel Inserts in CREATE TABLE AS

On Thu, Dec 10, 2020 at 3:59 PM Hou, Zhijie  wrote:
>
> Hi
>
> +   allow = ps && IsA(ps, GatherState) && !ps->ps_ProjInfo &&
> +   plannedstmt->parallelModeNeeded &&
> +   plannedstmt->planTree &&
> +   IsA(plannedstmt->planTree, Gather) &&
> +   plannedstmt->planTree->lefttree &&
> +   
> plannedstmt->planTree->lefttree->parallel_aware &&
> +   
> plannedstmt->planTree->lefttree->parallel_safe;
>
> I noticed it check both IsA(ps, GatherState) and IsA(plannedstmt->planTree, 
> Gather).
> Does it mean it is possible that IsA(ps, GatherState) is true but 
> IsA(plannedstmt->planTree, Gather) is false ?
>
> I did some test but did not find a case like that.
>

This seems like an extra check.  Apart from that if we combine 0001
and 0002 there should be an additional protection so that it should
not happen that in cost_gather we have ignored the parallel tuple cost
and now we are rejecting the parallel insert. Probably we should add
an assert.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [Patch] Optimize dropping of relation buffers using dlist

2020-12-10 Thread Amit Kapila

On Thu, Dec 10, 2020 at 1:40 PM k.jami...@fujitsu.com
 wrote:
>
> Yes, I have tested that optimization works for index relations.
>
> I have attached the V34, following the conditions that we use "cached" flag
> for both DropRelFileNodesBuffers() and DropRelFileNodesBuffers() for
> consistency.
> I added comment in 0004 the limitation of optimization when there are TOAST
> relations that use NON-PLAIN strategy. i.e. The optimization works if the data
> types used are integers, OID, bytea, etc. But for TOAST-able data types like 
> text,
> the optimization will be skipped and force a full scan during recovery.
>

AFAIU, it won't take optimization path only when we have TOAST
relation but there is no insertion corresponding to it. If so, then we
don't need to mention it specifically because there are other similar
cases where the optimization won't work like when during recovery we
have to just perform TRUNCATE.

-- 
With Regards,
Amit Kapila.

RE: Parallel Inserts in CREATE TABLE AS

2020-12-10 Thread Hou, Zhijie

Hi

+   allow = ps && IsA(ps, GatherState) && !ps->ps_ProjInfo &&
+   plannedstmt->parallelModeNeeded &&
+   plannedstmt->planTree &&
+   IsA(plannedstmt->planTree, Gather) &&
+   plannedstmt->planTree->lefttree &&
+   plannedstmt->planTree->lefttree->parallel_aware 
&&
+   plannedstmt->planTree->lefttree->parallel_safe;

I noticed it check both IsA(ps, GatherState) and IsA(plannedstmt->planTree, 
Gather).
Does it mean it is possible that IsA(ps, GatherState) is true but 
IsA(plannedstmt->planTree, Gather) is false ?

I did some test but did not find a case like that.


Best regards,
houzj

Re: Yet another fast GiST build

2020-12-10 Thread Andrey Borodin



> 9 дек. 2020 г., в 14:47, Andrey Borodin  написал(а):
> 
> 
> 
>> 7 дек. 2020 г., в 23:56, Peter Geoghegan  написал(а):
>> 
>> On Mon, Dec 7, 2020 at 2:05 AM Andrey Borodin  wrote:
>>> Here's version with tests and docs. I still have no idea how to print some 
>>> useful information about tuples keys.
>> 
>> I suggest calling BuildIndexValueDescription() from your own custom
>> debug instrumentation code.
> Thanks for the hint, Peter!
> This function does exactly what I want to do. But I have no Relation inside 
> gist_page_items(bytea) function... probably, I'll add 
> gist_page_items(relname, blockno) overload to fetch keys.

PFA patch with implementation.

Best regards, Andrey Borodin.



v3-0001-Add-functions-to-pageinspect-to-inspect-GiST-inde.patch
Description: Binary data

Re: Single transaction in the tablesync worker?

On Thu, Dec 10, 2020 at 3:19 PM Peter Smith  wrote:
>
> On Tue, Dec 8, 2020 at 9:14 PM Amit Kapila  wrote:
> >
> > On Tue, Dec 8, 2020 at 11:53 AM Peter Smith  wrote:
> > >
> > > Yes, I observed this same behavior.
> > >
> > > IIUC the only way for the tablesync worker to go from CATCHUP mode to
> > > SYNCDONE is via the call to process_sync_tables.
> > >
> > > But a side-effect of this is, when messages arrive during this CATCHUP
> > > phase one tx will be getting handled by the tablesync worker before
> > > the process_sync_tables() is ever encountered.
> > >
> > > I have created and attached a simple patch which allows the tablesync
> > > to detect if there is anything to do *before* it enters the apply main
> > > loop. Calling process_sync_tables() before the apply main loop  offers
> > > a quick way out so the message handling will not be split
> > > unnecessarily between the workers.
> > >
> >
> > Yeah, this demonstrates the idea can work but as mentioned in my
> > previous email [1] this needs much more work to make the COPY and then
> > later fetching the changes from the publisher consistently. So, let me
> > summarize the discussion so far. We wanted to enhance the tablesync
> > phase of Logical Replication to enable decoding of prepared
> > transactions [2]. The problem was when we stream prepared transactions
> > in the tablesync worker, it will simply commit the same due to the
> > requirement of maintaining a single transaction for the entire
> > duration of copy and streaming of transactions afterward. We can't
> > simply disable the decoding of prepared xacts for tablesync workers
> > because it can skip some of the prepared xacts forever on subscriber
> > as explained in one of the emails above [3]. Now, while investigating
> > the solutions to enhance tablesync to support decoding at prepare
> > time, I found that due to the current design of tablesync we can see
> > partial data of transactions on subscribers which is also explained in
> > the email above with an example [4]. This problem of visibility is
> > there since the Logical Replication is introduced in PostgreSQL and
> > the only answer I got till now is that there doesn't seem to be any
> > other alternative which I think is not true and I have provided one
> > alternative as well.
> >
> > Next, we have discussed three different solutions all of which will
> > solve the first problem (allow the tablesync worker to decode
> > transactions at prepare time) and one of which solves both the first
> > and second problem (partial transaction data visibility).
> >
> > Solution-1: Allow the table-sync worker to use multiple transactions.
> > The reason for doing it in a single transaction is that if after
> > initial COPY we commit and then crash while streaming changes of other
> > transactions, the state of the table won't be known after the restart
> > as we are using temporary slot so we don't from where to restart
> > syncing the table.
> >
> > IIUC, we need to primarily do two things to achieve multiple
> > transactions, one is to have an additional state in the catalog (say
> > catch up) which will say that the initial copy is done. Then we need
> > to have a permanent slot using which we can track the progress of the
> > slot so that after restart (due to crash, connection break, etc.) we
> > can start from the appropriate position. Now, this will allow us to do
> > less work after recovering from a crash because we will know the
> > restart point. As Craig mentioned, it also allows the sync slot to
> > advance, freeing any held upstream resources before the whole sync is
> > done, which is good if the upstream is busy and generating lots of
> > WAL. Finally, committing as we go means we won't exceed the cid
> > increment limit in a single txn.
> >
> > Solution-2: The next solution we discussed is to make "tablesync"
> > worker just gather messages after COPY in a similar way to how the
> > current streaming of in-progress transaction feature gathers messages
> > into a "changes" file so that they can be replayed later by the apply
> > worker. Now, here as we don't need to replay the individual
> > transactions in tablesync worker in a single transaction, it will
> > allow us to send decode prepared to the subscriber. This has some
> > disadvantages such as each transaction processed by tablesync worker
> > needs to be durably written to file and it can also lead to some apply
> > lag later when we process the same by apply worker.
> >
> > Solution-3: Allow the table-sync workers to just perform initial COPY
> > and then once the COPY is done for all relations the apply worker will
> > stream all the future changes. Now, surely if large copies are
> > required for multiple relations then we would delay a bit to replay
> > transactions partially by the apply worker but don't know how much
> > that matters as compared to transaction visibility issue and anyway we
> > would have achieved the maximum parallelism by allowing copy vi

pg_shmem_allocations & documentation

2020-12-10 Thread Benoit Lobréau

Hi,

While reading the documentation of pg_shmem_allocations, I noticed that the
off column is described as such :

"The offset at which the allocation starts. NULL for anonymous allocations
and unused memory."

Whereas, the view returns a value for unused memory:

[local]:5433 postgres@postgres=# SELECT * FROM pg_shmem_allocations WHERE
name IS NULL;
 name |off|  size   | allocated_size
--+---+-+
 ¤| 178095232 | 1923968 |1923968
(1 row)

>From what I understand, the doc is wrong.
Am I right ?

Benoit

[1] https://www.postgresql.org/docs/13/view-pg-shmem-allocations.html
[2]
https://www.postgresql.org/message-id/flat/20140504114417.GM12715%40awork2.anarazel.de
(original thread)

Re: Parallel INSERT (INTO ... SELECT ...)

On Thu, Dec 10, 2020 at 1:50 PM Greg Nancarrow  wrote:
>
> On Thu, Dec 10, 2020 at 5:25 PM Dilip Kumar  wrote:
> >
> >
> >  /*
> > + * PrepareParallelMode
> > + *
> > + * Prepare for entering parallel mode, based on command-type.
> > + */
> > +void
> > +PrepareParallelMode(CmdType commandType)
> > +{
> > + Assert(!IsInParallelMode() || force_parallel_mode != FORCE_PARALLEL_OFF);
> > +
> > + if (IsModifySupportedInParallelMode(commandType))
> > + {
> > + /*
> > + * Prepare for entering parallel mode by assigning a
> > + * FullTransactionId, to be included in the transaction state that is
> > + * serialized in the parallel DSM.
> > + */
> > + (void) GetCurrentTransactionId();
> > + }
> > +}
> >
> > Why do we need to serialize the transaction ID for 0001?  I mean in
> > 0001 we are just allowing the SELECT to be executed in parallel so why
> > we would need the transaction Id for that.  I agree that we would need
> > this once we try to perform the Insert also from the worker in the
> > remaining patches.
> >
>
> There's a very good reason. It's related to parallel-mode checks for
> Insert and how the XID is lazily acquired if required.
> When allowing SELECT to be executed in parallel, we're in
> parallel-mode and the leader interleaves Inserts with retrieval of the
> tuple data from the workers.
> You will notice that heap_insert() calls GetTransactionId() as the
> very first thing it does. If the FullTransactionId is not valid,
> AssignTransactionId() is then called, which then executes this code:
>
> /*
>  * Workers synchronize transaction state at the beginning of each parallel
>  * operation, so we can't account for new XIDs at this point.
>  */
> if (IsInParallelMode() || IsParallelWorker())
> elog(ERROR, "cannot assign XIDs during a parallel operation");
>
> So that code (currently) has no way of knowing that a XID is being
> (lazily) assigned at the beginning, or somewhere in the middle of, a
> parallel operation.
> This is the reason why PrepareParallelMode() is calling
> GetTransactionId() up-front, to ensure a FullTransactionId is assigned
> up-front, prior to parallel-mode (so then there won't be an attempted
> XID assignment).
>
> If you remove the GetTransactionId() call from PrepareParallelMode()
> and run "make installcheck-world" with "force_parallel_mode=regress"
> in effect, many tests will fail with:
> ERROR:  cannot assign XIDs during a parallel operation

Yeah got it, I missed that point that the goal is the avoid assigning
the Transaction Id when we are in parallel mode.  But IIUC at least
for the first patch we don't want to serialize the XID in the
transaction state right because workers don't need the xid as they are
only doing select.  So maybe we can readjust the comment slightly in
the below code

> > + * Prepare for entering parallel mode by assigning a
> > + * FullTransactionId, to be included in the transaction state that is
> > + * serialized in the parallel DSM.
> > + */
> > + (void) GetCurrentTransactionId();

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Single transaction in the tablesync worker?

2020-12-10 Thread Peter Smith

On Tue, Dec 8, 2020 at 9:14 PM Amit Kapila  wrote:
>
> On Tue, Dec 8, 2020 at 11:53 AM Peter Smith  wrote:
> >
> > Yes, I observed this same behavior.
> >
> > IIUC the only way for the tablesync worker to go from CATCHUP mode to
> > SYNCDONE is via the call to process_sync_tables.
> >
> > But a side-effect of this is, when messages arrive during this CATCHUP
> > phase one tx will be getting handled by the tablesync worker before
> > the process_sync_tables() is ever encountered.
> >
> > I have created and attached a simple patch which allows the tablesync
> > to detect if there is anything to do *before* it enters the apply main
> > loop. Calling process_sync_tables() before the apply main loop  offers
> > a quick way out so the message handling will not be split
> > unnecessarily between the workers.
> >
>
> Yeah, this demonstrates the idea can work but as mentioned in my
> previous email [1] this needs much more work to make the COPY and then
> later fetching the changes from the publisher consistently. So, let me
> summarize the discussion so far. We wanted to enhance the tablesync
> phase of Logical Replication to enable decoding of prepared
> transactions [2]. The problem was when we stream prepared transactions
> in the tablesync worker, it will simply commit the same due to the
> requirement of maintaining a single transaction for the entire
> duration of copy and streaming of transactions afterward. We can't
> simply disable the decoding of prepared xacts for tablesync workers
> because it can skip some of the prepared xacts forever on subscriber
> as explained in one of the emails above [3]. Now, while investigating
> the solutions to enhance tablesync to support decoding at prepare
> time, I found that due to the current design of tablesync we can see
> partial data of transactions on subscribers which is also explained in
> the email above with an example [4]. This problem of visibility is
> there since the Logical Replication is introduced in PostgreSQL and
> the only answer I got till now is that there doesn't seem to be any
> other alternative which I think is not true and I have provided one
> alternative as well.
>
> Next, we have discussed three different solutions all of which will
> solve the first problem (allow the tablesync worker to decode
> transactions at prepare time) and one of which solves both the first
> and second problem (partial transaction data visibility).
>
> Solution-1: Allow the table-sync worker to use multiple transactions.
> The reason for doing it in a single transaction is that if after
> initial COPY we commit and then crash while streaming changes of other
> transactions, the state of the table won't be known after the restart
> as we are using temporary slot so we don't from where to restart
> syncing the table.
>
> IIUC, we need to primarily do two things to achieve multiple
> transactions, one is to have an additional state in the catalog (say
> catch up) which will say that the initial copy is done. Then we need
> to have a permanent slot using which we can track the progress of the
> slot so that after restart (due to crash, connection break, etc.) we
> can start from the appropriate position. Now, this will allow us to do
> less work after recovering from a crash because we will know the
> restart point. As Craig mentioned, it also allows the sync slot to
> advance, freeing any held upstream resources before the whole sync is
> done, which is good if the upstream is busy and generating lots of
> WAL. Finally, committing as we go means we won't exceed the cid
> increment limit in a single txn.
>
> Solution-2: The next solution we discussed is to make "tablesync"
> worker just gather messages after COPY in a similar way to how the
> current streaming of in-progress transaction feature gathers messages
> into a "changes" file so that they can be replayed later by the apply
> worker. Now, here as we don't need to replay the individual
> transactions in tablesync worker in a single transaction, it will
> allow us to send decode prepared to the subscriber. This has some
> disadvantages such as each transaction processed by tablesync worker
> needs to be durably written to file and it can also lead to some apply
> lag later when we process the same by apply worker.
>
> Solution-3: Allow the table-sync workers to just perform initial COPY
> and then once the COPY is done for all relations the apply worker will
> stream all the future changes. Now, surely if large copies are
> required for multiple relations then we would delay a bit to replay
> transactions partially by the apply worker but don't know how much
> that matters as compared to transaction visibility issue and anyway we
> would have achieved the maximum parallelism by allowing copy via
> multiple workers. This would reduce the load (both CPU and I/O) on the
> publisher-side by allowing to decode the data only once instead of for
> each table sync worker once and separately for the apply worker. I
> thi

Re: Improving spin-lock implementation on ARM.

2020-12-10 Thread Krunal Bauskar

On Tue, 8 Dec 2020 at 14:33, Krunal Bauskar  wrote:

>
>
> On Thu, 3 Dec 2020 at 21:32, Tom Lane  wrote:
>
>> Krunal Bauskar  writes:
>> > Any updates or further inputs on this.
>>
>> As far as LSE goes: my take is that tampering with the
>> compiler/platform's default optimization options requires *very*
>> strong evidence, which we have not got and likely won't get.  Users
>> who are building for specific hardware can choose to supply custom
>> CFLAGS, of course.  But we shouldn't presume to do that for them,
>> because we don't know what they are building for, or with what.
>>
>> I'm very willing to consider the CAS spinlock patch, but it still
>> feels like there's not enough evidence to show that it's a universal
>> win.  The way to move forward on that is to collect more measurements
>> on additional ARM-based platforms.  And I continue to think that
>> pgbench is only a very crude tool for testing spinlock performance;
>> we should look at other tests.
>>
>
> Thanks Tom.
>
> Given pg-bench limited option I decided to try things with sysbench to
> expose
> the real contention using zipfian type (zipfian pattern causes part of the
> database
> to get updated there-by exposing main contention point).
>
>
> 
> *Baseline for 256 threads update-index use-case:*
> -   44.24%174935  postgres postgres [.] s_lock
> transactions:
> transactions:5587105 (92988.40 per sec.)
>
> *Patched for 256 threads update-index use-case:*
>  0.02%80  postgres  postgres  [.] s_lock
> transactions:
> transactions:10288781 (171305.24 per sec.)
>
> *perf diff*
>
> * 0.02%+44.22%  postgres [.] s_lock*
> 
>
> As we see from the above result s_lock is exposing major contention that
> could be relaxed using the
> said cas patch. Performance improvement in range of 80% is observed.
>
> Taking this guideline we decided to run it for all scalability for update
> and non-update use-case.
> Check the attached graph. Consistent improvement is observed.
>
> I presume this should help re-establish that for major contention cases
> existing tas approach will always give up.
>
>
> ---
>
> Unfortunately, I don't have access to different ARM arch except for
> Kunpeng or Graviton2 where
> we have already proved the value of the patch.
> [ref: Apple M1 as per your evaluation patch doesn't show regression for
> select. Maybe if possible can you try update scenarios too].
>
> Do you know anyone from the community who has access to other ARM arches
> we can request them to evaluate?
> But since it is has proven on 2 independent ARM arch I am pretty confident
> it will scale with other ARM arches too.
>
>

Any direction on how we can proceed on this?

* We have tested it with both cloud vendors that provide ARM instances.
* We have tested it with Apple M1 (partially at-least)
* Ampere use to provide instance on packet.com but now it is an evaluation
program only.

No other active arm instance offering a cloud provider.

Given our evaluation so far has proven to be +ve can we think of
considering it on basis of the available
data which is quite encouraging with 80% improvement seen for heavy
contention use-cases.



>
>> From a system structural standpoint, I seriously dislike that lwlock.c
>> patch: putting machine-specific variant implementations into that file
>> seems like a disaster for maintainability.  So it would need to show a
>> very significant gain across a range of hardware before I'd want to
>> consider adopting it ... and it has not shown that.
>>
>> regards, tom lane
>>
>
>
> --
> Regards,
> Krunal Bauskar
>


-- 
Regards,
Krunal Bauskar

RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-10 Thread tsunakawa.ta...@fujitsu.com

From: Jamison, Kirk/ジャミソン カーク 
> I added comment in 0004 the limitation of optimization when there are TOAST
> relations that use NON-PLAIN strategy. i.e. The optimization works if the data
> types used are integers, OID, bytea, etc. But for TOAST-able data types like 
> text,
> the optimization will be skipped and force a full scan during recovery.

bytea is a TOAST-able type.


+   /*
+* Enter the optimization if the total number of blocks to be
+* invalidated for all relations is below the full scan threshold.
+*/
+   if (cached && nBlocksToInvalidate < BUF_DROP_FULL_SCAN_THRESHOLD)

Checking cached here doesn't seem to be necessary, because if cached is false, 
the control goes to the full scan path as below:

+   if (!cached)
+   goto buffer_full_scan;
+


Regards
Takayuki Tsunakawa

Re: Parallel INSERT (INTO ... SELECT ...)

2020-12-10 Thread Greg Nancarrow

On Thu, Dec 10, 2020 at 5:25 PM Dilip Kumar  wrote:
>
>
>  /*
> + * PrepareParallelMode
> + *
> + * Prepare for entering parallel mode, based on command-type.
> + */
> +void
> +PrepareParallelMode(CmdType commandType)
> +{
> + Assert(!IsInParallelMode() || force_parallel_mode != FORCE_PARALLEL_OFF);
> +
> + if (IsModifySupportedInParallelMode(commandType))
> + {
> + /*
> + * Prepare for entering parallel mode by assigning a
> + * FullTransactionId, to be included in the transaction state that is
> + * serialized in the parallel DSM.
> + */
> + (void) GetCurrentTransactionId();
> + }
> +}
>
> Why do we need to serialize the transaction ID for 0001?  I mean in
> 0001 we are just allowing the SELECT to be executed in parallel so why
> we would need the transaction Id for that.  I agree that we would need
> this once we try to perform the Insert also from the worker in the
> remaining patches.
>

There's a very good reason. It's related to parallel-mode checks for
Insert and how the XID is lazily acquired if required.
When allowing SELECT to be executed in parallel, we're in
parallel-mode and the leader interleaves Inserts with retrieval of the
tuple data from the workers.
You will notice that heap_insert() calls GetTransactionId() as the
very first thing it does. If the FullTransactionId is not valid,
AssignTransactionId() is then called, which then executes this code:

/*
 * Workers synchronize transaction state at the beginning of each parallel
 * operation, so we can't account for new XIDs at this point.
 */
if (IsInParallelMode() || IsParallelWorker())
elog(ERROR, "cannot assign XIDs during a parallel operation");

So that code (currently) has no way of knowing that a XID is being
(lazily) assigned at the beginning, or somewhere in the middle of, a
parallel operation.
This is the reason why PrepareParallelMode() is calling
GetTransactionId() up-front, to ensure a FullTransactionId is assigned
up-front, prior to parallel-mode (so then there won't be an attempted
XID assignment).

If you remove the GetTransactionId() call from PrepareParallelMode()
and run "make installcheck-world" with "force_parallel_mode=regress"
in effect, many tests will fail with:
ERROR:  cannot assign XIDs during a parallel operation

Regards,
Greg Nancarrow
Fujitsu Australia

RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-10 Thread k.jami...@fujitsu.com

On Thursday, December 10, 2020 12:27 PM, Amit Kapila wrote: 
> On Thu, Dec 10, 2020 at 7:11 AM Kyotaro Horiguchi
>  wrote:
> >
> > At Wed, 9 Dec 2020 16:27:30 +0530, Amit Kapila
> >  wrote in
> > > On Wed, Dec 9, 2020 at 6:32 AM Kyotaro Horiguchi
> > > > Mmm. At least btree doesn't need to call smgrnblocks except at
> > > > expansion, so we cannot get to the optimized path in major cases
> > > > of truncation involving btree (and/or maybe other indexes).
> > > >
> > >
> > > AFAICS, btree insert should call smgrnblocks via
> > >
> btree_xlog_insert->XLogReadBufferForRedo->XLogReadBufferForRedoExte
> nded->XLogReadBufferExtended->smgrnblocks.
> > > Similarly delete should also call smgrnblocks. Can you be bit more
> > > specific related to the btree case you have in mind?
> >
> > Oh, sorry. I wrongly looked to non-recovery path. smgrnblocks is
> > called during buffer loading while recovery. So, smgrnblock is called
> > for indexes if any update happens on the heap relation.
> >
> 
> Okay, so this means that we can get the benefit of optimization in many cases
> in the Truncate code path as well even if we use 'cached'
> flag? If so, then I would prefer to keep the code consistent for both vacuum
> and truncate recovery code path.

Yes, I have tested that optimization works for index relations.

I have attached the V34, following the conditions that we use "cached" flag
for both DropRelFileNodesBuffers() and DropRelFileNodesBuffers() for
consistency.
I added comment in 0004 the limitation of optimization when there are TOAST
relations that use NON-PLAIN strategy. i.e. The optimization works if the data
types used are integers, OID, bytea, etc. But for TOAST-able data types like 
text,
the optimization will be skipped and force a full scan during recovery.

Regards,
Kirk Jamison



v34-0001-Prevent-invalidating-blocks-in-smgrextend-during.patch
Description:  v34-0001-Prevent-invalidating-blocks-in-smgrextend-during.patch


v34-0002-Add-bool-param-in-smgrnblocks-for-cached-blocks.patch
Description:  v34-0002-Add-bool-param-in-smgrnblocks-for-cached-blocks.patch


v34-0003-Optimize-DropRelFileNodeBuffers-during-recovery.patch
Description:  v34-0003-Optimize-DropRelFileNodeBuffers-during-recovery.patch


v34-0004-Optimize-DropRelFileNodesAllBuffers-in-recovery.patch
Description:  v34-0004-Optimize-DropRelFileNodesAllBuffers-in-recovery.patch

Some more hackery around cryptohashes (some fixes + SHA1)