Re: make the stats collector shutdown without writing the statsfiles if the immediate shutdown is requested.

2021-03-22 Thread Masahiro Ikeda


On 2021/03/23 11:40, Fujii Masao wrote:
> 
> 
> On 2021/03/23 9:05, Masahiro Ikeda wrote:
>> Yes. I attached the v5 patch based on v3 patch.
>> I renamed SignalHandlerForUnsafeExit() and fixed the following comment.
> 
> Thanks for updating the patch!
> 
> When the startup process exits because of recovery_target_action=shutdown,
> reaper() calls TerminateChildren(SIGTERM). This function sends SIGTERM to
> the stats collector. Currently the stats collector ignores SIGTERM, but with
> the patch it exits normally. This change of behavior might be problematic.
> 
> That is, TerminateChildren(SIGTERM) sends SIGTERM to various processes.
> But currently the stats collector and checkpointer don't exit even when
> SIGTERM arrives because they ignore SIGTERM. After several processes
> other than the stats collector and checkpointer exit by SIGTERM,
> PostmasterStateMachine() and reaper() make checkpointer exit and then
> the stats collector exit. The shutdown terminates the processes in this order.
> 
> On the other hand, with the patch, the stats collector exits by SIGTERM
> before checkpointer exits. This is not normal order of processes to exit in
> shutdown.
> 
> To address this issue, one idea is to use SIGUSR2 for normal exit of the stats
> collector, instead of SIGTERM. If we do this, TerminateChildren(SIGTERM)
> cannot terminate the stats collector. Thought?
> 
> If we adopt this idea, the detail comment about why SIGUSR2 is used for that
> needs to be added.

Thanks for your comments. I agreed your concern and suggestion.
Additionally, we need to consider system shutdown cycle too as
CheckpointerMain()'s comment said.

```
 * Note: we deliberately ignore SIGTERM, because during a standard Unix
 * system shutdown cycle, init will SIGTERM all processes at once.  We
 * want to wait for the backends to exit, whereupon the postmaster will
 * tell us it's okay to shut down (via SIGUSR2)
```

I changed the signal from SIGTERM to SIGUSR2 and add the comments why SIGUSR2
is used.
(v6-0001-pgstat_avoid_writing_on_sigquit.patch)

Regards,

-- 
Masahiro Ikeda
NTT DATA CORPORATION
diff --git a/src/backend/postmaster/interrupt.c b/src/backend/postmaster/interrupt.c
index dd9136a942..e09a7b3cad 100644
--- a/src/backend/postmaster/interrupt.c
+++ b/src/backend/postmaster/interrupt.c
@@ -64,9 +64,27 @@ SignalHandlerForConfigReload(SIGNAL_ARGS)
 }
 
 /*
- * Simple signal handler for exiting quickly as if due to a crash.
+ * Simple signal handler for processes which have not yet touched or do not
+ * touch shared memory to exit quickly.
  *
- * Normally, this would be used for handling SIGQUIT.
+ * If processes already touched shared memory, call
+ * SignalHandlerForCrashExit() because shared memory may be corrupted.
+ */
+void
+SignalHandlerForNonCrashExit(SIGNAL_ARGS)
+{
+	/*
+	 * Since we don't touch shared memory, we can just pull the plug and exit
+	 * without running any atexit handlers.
+	 */
+	_exit(1);
+}
+
+/*
+ * Simple signal handler for processes which have touched shared memory to
+ * exit quickly.
+ *
+ * Normally, this would be used for handling SIGQUIT as if due to a crash.
  */
 void
 SignalHandlerForCrashExit(SIGNAL_ARGS)
@@ -93,9 +111,8 @@ SignalHandlerForCrashExit(SIGNAL_ARGS)
  * shut down and exit.
  *
  * Typically, this handler would be used for SIGTERM, but some processes use
- * other signals. In particular, the checkpointer exits on SIGUSR2, the
- * stats collector on SIGQUIT, and the WAL writer exits on either SIGINT
- * or SIGTERM.
+ * other signals. In particular, the checkpointer and the stats collector exit
+ * on SIGUSR2, and the WAL writer exits on either SIGINT or SIGTERM.
  *
  * ShutdownRequestPending should be checked at a convenient place within the
  * main loop, or else the main loop should call HandleMainLoopInterrupts.
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index b7af7c2707..53b84eda56 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1,7 +1,21 @@
 /* --
  * pgstat.c
  *
- *	All the statistics collector stuff hacked up in one big, ugly file.
+ * All the statistics collector stuff hacked up in one big, ugly file.
+ *
+ * The statistics collector is started by the postmaster as soon as the
+ * startup subprocess finishes.  It remains alive until the postmaster
+ * commands it to terminate.  Normal termination is by SIGUSR2 after the
+ * checkpointer must exit(0), which instructs the statistics collector
+ * to save the final statistics to reuse at next startup and then exit(0).
+ * Emergency termination is by SIGQUIT; like any backend, the statistics
+ * collector will exit quickly without saving the final statistics.  It's
+ * ok because the startup process will remove all statistics at next
+ * startup after emergency termination.
+ *
+ * Because the statistics collector doesn't touch shared memory, even if
+ * the statistics collector exits unexpectedly, the 

Re: PostmasterIsAlive() in recovery (non-USE_POST_MASTER_DEATH_SIGNAL builds)

2021-03-22 Thread Fujii Masao




On 2021/03/23 10:52, Thomas Munro wrote:

On Tue, Mar 23, 2021 at 2:44 PM Fujii Masao  wrote:

I found 0001 patch was committed in de829ddf23, and which added new
wait event WalrcvExit. This name seems not consistent with other wait
events. I'm thinking it's better to rename it to WalReceiverExit. Thought?
Patch attached.


Agreed, your names are better.  Thanks.


Thanks! I will commit the patch.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: Nicer error when connecting to standby with hot_standby=off

2021-03-22 Thread Fujii Masao




On 2021/03/23 3:59, James Coleman wrote:

Are you saying we should only change the message for a single case:
the case where we'd otherwise allow connections but EnableHotStandby
is false?


No. Let me clarify my opinion.

At PM_STARTUP, "the database system is starting up" should be logged
whatever the setting of hot_standby is. This is the same as the original
behavior. During crash recovery, this message is output. Also at archive
recovery or standby server, until the startup process sends
PMSIGNAL_RECOVERY_STARTED, this message is logged.

At PM_RECOVERY, originally "the database system is starting up" was logged
whatever the setting of hot_standby is. My opinion is the same as our
consensus, i.e., "the database system is not accepting connections" and
"Hot standby mode is disabled." are logged if hot_standby is disabled.
"the database system is not accepting connections" and "Consistent
 recovery state has not been yet reached." are logged if hot_standby is
 enabled.

After the consistent recovery state is reached, if hot_standby is disabled,
the postmaster state is still PM_RECOVERY. So "Hot standby mode is disabled."
is still logged in this case. This is also different behavior from the original.
If hot_standby is enabled, read-only connections can be accepted because
the consistent state is reached. So no message needs to be logged.

Therefore for now what we've not reached the consensus is what message
should be logged at PM_STARTUP. I'm thinking it's better to log
"the database system is starting up" in that case because of the reasons
that I explained upthread.

Thought?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: a verbose option for autovacuum

2021-03-22 Thread Masahiko Sawada
On Tue, Mar 23, 2021 at 1:31 PM Michael Paquier  wrote:
>
> On Mon, Mar 22, 2021 at 12:17:37PM +0900, Masahiko Sawada wrote:
> > I've updated the patch. I saved the index names at the beginning of
> > heap_vacuum_rel() for autovacuum logging, and add indstats and
> > nindexes to LVRelStats. Some functions still have nindexes as a
> > function argument but it seems to make sense since it corresponds the
> > list of index relations (*Irel). Please review the patch.
>
> Going back to that, the structure of the static APIs in this file make
> the whole logic a bit hard to follow, but the whole set of changes you
> have done here makes sense.  It took me a moment to recall and
> understand why it is safe to free *stats at the end of
> vacuum_one_index() and if the index stats array actually pointed to
> the DSM segment correctly within the shared stats.
>
> I think that there is more consolidation possible within LVRelStats,
> but let's leave that for another day, if there is any need for such a
> move.

While studying your patch (v5-index_stat_log.patch) I found we can
polish the parallel vacuum code in some places. I'll try it another
day.

>
> To keep it short.  Sold.

Thank you!

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: Change default of checkpoint_completion_target

2021-03-22 Thread Michael Paquier
On Mon, Mar 22, 2021 at 01:11:00PM -0400, Stephen Frost wrote:
> Unless there's anything further on this, I'll plan to commit it tomorrow
> or Wednesday.

Cool, looks fine to me.

This version of the patch has forgotten to update one spot:
src/backend/postmaster/checkpointer.c:double CheckPointCompletionTarget = 0.5;
--
Michael


signature.asc
Description: PGP signature


Re: proposal - psql - use pager for \watch command

2021-03-22 Thread Pavel Stehule
út 23. 3. 2021 v 0:35 odesílatel Thomas Munro 
napsal:

> On Sun, Mar 21, 2021 at 11:43 PM Pavel Stehule 
> wrote:
> > so 20. 3. 2021 v 23:45 odesílatel Thomas Munro 
> napsal:
> >> Thoughts?  I put my changes into a separate patch for clarity, but
> >> they need some more tidying up.
> >
> > yes, your solution is much better.
>
> Hmm, there was a problem with it though: it blocked ^C while running
> the query, which is bad.  I fixed that.   I did some polishing of the
> code and some editing on the documentation and comments.  I disabled
> the feature completely on Windows, because it seems unlikely that
> we'll be able to know if it even works, in this cycle.
>
> -   output = PageOutput(158, pager ? &(pset.popt.topt) : NULL);
> +   output = PageOutput(160, pager ? &(pset.popt.topt) : NULL);
>
> What is that change for?
>

This is correct - this is the number of printed rows - it is used for
decisions about using a pager for help. There are two new rows, and the
number is correctly +2

Pavel


Re: Proposal: Save user's original authenticated identity for logging

2021-03-22 Thread Michael Paquier
On Mon, Mar 22, 2021 at 06:22:52PM +0100, Magnus Hagander wrote:
> The 0002/0001/whateveritisaftertherebase is tracked over at
> https://www.postgresql.org/message-id/flat/92e70110-9273-d93c-5913-0bccb6562...@dunslane.net
> isn't it? I've assumed the expectation is to have that one committed
> from that thread, and then rebase using that.

Independent and useful pieces could just be extracted and applied
separately where it makes sense.  I am not sure if that's the case
here, so I'll do a patch_to_review++.
--
Michael


signature.asc
Description: PGP signature


Re: replication cleanup code incorrect way to use of HTAB HASH_REMOVE ?

2021-03-22 Thread Amit Kapila
On Mon, Mar 22, 2021 at 7:57 AM Amit Kapila  wrote:
>
> On Mon, Mar 22, 2021 at 3:20 AM Peter Smith  wrote:
> >
> > On Sun, Mar 21, 2021 at 8:54 PM Amit Kapila  wrote:
> > >
> > > On Sat, Mar 20, 2021 at 12:54 PM Peter Smith  
> > > wrote:
> > > >
> > > > PSA my patch to correct this by firstly doing a HASH_FIND, then only
> > > > HASH_REMOVE after we've finished using the ent.
> > > >
> > >
> > > Why can't we keep using HASH_REMOVE as it is but get the output (entry
> > > found or not) in the last parameter of hash_search API and then
> > > perform Assert based on that? See similar usage in reorderbuffer.c and
> > > rewriteheap.c.
> > >
> >
> > Changing the Assert doesn't do anything to fix the problem as
> > described, i.e. dereferencing of ent after the HASH_REMOVE.
> >
> > The real problem isn't the Assert. It's all those other usages of ent
> > disobeying the API rule: "(NB: in the case of the REMOVE action, the
> > result is a dangling pointer that shouldn't be dereferenced!)"
> >
>
> Right, that is a problem. I see that your patch will fix it. Thanks.
>

Pushed your patch.

-- 
With Regards,
Amit Kapila.




Re: Proposal: Save user's original authenticated identity for logging

2021-03-22 Thread Michael Paquier
On Mon, Mar 22, 2021 at 07:17:26PM +, Jacob Champion wrote:
> v8's test_access lost the in-order log search from v7; I've added it
> back in v9. The increased resistance to entropy seems worth the few
> extra lines. Thoughts?

I am not really sure that we need to bother about the ordering of the
entries here, as long as we check for all of them within the same
fragment of the log file, so I would just go down to the simplest
solution that I posted upthread that is enough to make sure that the
verbosity is protected.  That's what we do elsewhere, like with
command_checks_all() and such.
--
Michael


signature.asc
Description: PGP signature


Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

2021-03-22 Thread Neha Sharma
On Tue, Mar 23, 2021 at 10:08 AM Amul Sul  wrote:

> On Mon, Mar 22, 2021 at 3:03 PM Amit Langote 
> wrote:
> >
> > On Mon, Mar 22, 2021 at 5:26 PM Amul Sul  wrote:
> > > In heapam_relation_copy_for_cluster(), begin_heap_rewrite() sets
> > > rwstate->rs_new_rel->rd_smgr correctly but next line
> tuplesort_begin_cluster()
> > > get called which cause the system cache invalidation and due to CCA
> setting,
> > > wipe out rwstate->rs_new_rel->rd_smgr which wasn't restored for the
> subsequent
> > > operations and causes segmentation fault.
> > >
> > > By calling RelationOpenSmgr() before calling smgrimmedsync() in
> > > end_heap_rewrite() would fix the failure. Did the same in the attached
> patch.
> >
> > That makes sense.  I see a few commits in the git history adding
> > RelationOpenSmgr() before a smgr* operation, whenever such a problem
> > would have been discovered: 4942ee656ac, afa8f1971ae, bf347c60bdd7,
> > for example.
> >
>
> Thanks for the confirmation.
>
> > I do wonder if there are still other smgr* operations in the source
> > code that are preceded by operations that would invalidate the
> > SMgrRelation that those smgr* operations would be called with.  For
> > example, the smgrnblocks() in gistBuildCallback() may get done too
> > late than a corresponding RelationOpenSmgr() on the index relation.
> >
>
> I did the check for gistBuildCallback() by adding Assert(index->rd_smgr)
> before
> smgrnblocks() with CCA setting and didn't see any problem there.
>
> I think the easiest way to find that is to run a regression suite with CCA
> build, perhaps, there is no guarantee that regression will hit all smgr*
> operations, but that might hit most of them.


Sure, will give a regression run with CCA enabled.

>
> Regards,
> Amul
>

Regards,
Neha Sharma


Re: Handling of opckeytype / CREATE OPERATOR CLASS (bug?)

2021-03-22 Thread Tom Lane
Tomas Vondra  writes:
> On 3/23/21 3:13 AM, Tom Lane wrote:
>> Hm.  Both catalogs.sgml and pg_opclass.h say specifically that
>> opckeytype should be zero if it's to be the same as the input
>> column type.  I don't think just dropping the enforcement of that
>> is the right answer.

> Yeah, that's possible. I was mostly just demonstrating the difference in
> behavior. Maybe the right fix is to fix the catalog contents and then
> tweak the AM code, or something.

Digging in our git history, the rule about zero opckeytype dates to
2001 (f933766ba), which precedes our invention of polymorphic types
in 2003 (somewhere around 730840c9b).  So I'm pretty sure that that
was a poor man's substitute for polymorphic opclasses, which we
failed to clean up entirely after we got real polymorphic opclasses.

Now, I'd be in favor of cleaning that up and just using standard
polymorphism rules throughout.  But (without having studied the code)
it looks like the immediate issue is that something in the BRIN code is
unfamiliar with the rule for zero opckeytype.  It might be a noticeably
smaller patch to fix that than to get rid of the convention about zero.

regards, tom lane




RE: [POC] Fast COPY FROM command for the table with foreign partitions

2021-03-22 Thread tsunakawa.ta...@fujitsu.com
From: Justin Pryzby 
> On Mon, Mar 22, 2021 at 08:18:56PM -0700, Zhihong Yu wrote:
> > with data_dest_cb callback. It is used for send text representation of a
> > tuple to a custom destination.
> >
> > send text -> sending text
> 
> I would say "It is used to send the text representation ..."

I took Justin-san's suggestion.  (It feels like I'm in a junior English 
class...)


> It's actually a pointer:
> src/include/commands/copy.h:typedef struct CopyToStateData *CopyToState;
> 
> There's many data structures like this, where a structure is typedefed with a
> "Data" suffix and the pointer is typedefed without the "Data"

Yes.  Thank you for good explanation, Justin-san.



Regards
TakayukiTsunakawa




v22-0001-Fast-COPY-FROM-into-the-foreign-or-sharded-table.patch
Description:  v22-0001-Fast-COPY-FROM-into-the-foreign-or-sharded-table.patch


Re: Type of wait events WalReceiverWaitStart and WalSenderWaitForWAL

2021-03-22 Thread Kyotaro Horiguchi
(I finally get to catch up here..)

At Mon, 22 Mar 2021 13:59:47 +0900, Fujii Masao  
wrote in 
> 
> 
> On 2021/03/22 12:01, Kyotaro Horiguchi wrote:
> > Mmm. I agree that it waits for WAL in most cases, but still WAL-wait
> > is activity for me because it is not waiting for being cued by
> > someone, but waiting for new WAL to come to perform its main purpose.
> > If it's an IPC, all waits on other than pure sleep should fall into
> > IPC?  (I was confused by the comment of WalSndWait, which doesn't
> > state that it is waiting for latch..)
> > Other point I'd like to raise is that the client_wait case should be
> > distinctive from the WAL-wait since it is significant sign of what is
> > happening.
> > So I propose two chagnes here.
> > a. Rewrite the comment of WalSndWait so that it states that "also
> >waiting for latch-set".
> 
> +1

Cool.

> > b. Split the event to two different events.
> > -   WalSndWait(wakeEvents, sleeptime, WAIT_EVENT_WAL_SENDER_MAIN);
> > +   WalSndWait(wakeEvents, sleeptime,
> > +  pq_is_send_pending() ? WAIT_EVENT_WAL_SENDER_WRITE_DATA:
> > + WAIT_EVENT_WAL_SENDER_MAIN);
> > And _WRITE_DATA as client_wait and _SENDER_MAIN as activity.
> > What  do you think about this?
> 
> I'm ok with this. What about the attached patch
> (WalSenderWriteData.patch)?

Yeah, that is better. I'm fine with it as a whole.

+ * Overwrite wait_event with WAIT_EVENT_WAL_SENDER_WRITE_DATA
+ * if we have pending data in the output buffer and are waiting to write
+ * data to a client.

Since the function doesn't check for that directly, I'd like to write
as the following.

Overwrite wait_event with WAIT_EVENT_WAL_SENDER_WRITE_DATA if the
caller told to wait for WL_SOCKET_WRITEABLE, which means that we have
pending data in the output buffer and are waiting to write data to a
client.


> > Yes. The WAIT_EVENT_WAL_SENDER_WAIT_WAL is equivalent to
> > WAIT_EVENT_WAL_SENDER_MAIN as function.  So I think it should be in
> > the same category as WAIT_EVENT_WAL_SENDER_MAIN. And like the 1 above,
> > wait_client case should be distinctive from the _MAIN event.
> 
> +1. What about the attached patch (WalSenderWaitForWAL.patch)?

Looks good to me. Thanks.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

2021-03-22 Thread Amul Sul
On Mon, Mar 22, 2021 at 3:03 PM Amit Langote  wrote:
>
> On Mon, Mar 22, 2021 at 5:26 PM Amul Sul  wrote:
> > In heapam_relation_copy_for_cluster(), begin_heap_rewrite() sets
> > rwstate->rs_new_rel->rd_smgr correctly but next line 
> > tuplesort_begin_cluster()
> > get called which cause the system cache invalidation and due to CCA setting,
> > wipe out rwstate->rs_new_rel->rd_smgr which wasn't restored for the 
> > subsequent
> > operations and causes segmentation fault.
> >
> > By calling RelationOpenSmgr() before calling smgrimmedsync() in
> > end_heap_rewrite() would fix the failure. Did the same in the attached 
> > patch.
>
> That makes sense.  I see a few commits in the git history adding
> RelationOpenSmgr() before a smgr* operation, whenever such a problem
> would have been discovered: 4942ee656ac, afa8f1971ae, bf347c60bdd7,
> for example.
>

Thanks for the confirmation.

> I do wonder if there are still other smgr* operations in the source
> code that are preceded by operations that would invalidate the
> SMgrRelation that those smgr* operations would be called with.  For
> example, the smgrnblocks() in gistBuildCallback() may get done too
> late than a corresponding RelationOpenSmgr() on the index relation.
>

I did the check for gistBuildCallback() by adding Assert(index->rd_smgr)  before
smgrnblocks() with CCA setting and didn't see any problem there.

I think the easiest way to find that is to run a regression suite with CCA
build, perhaps, there is no guarantee that regression will hit all smgr*
operations, but that might hit most of them.

Regards,
Amul




Re: a verbose option for autovacuum

2021-03-22 Thread Michael Paquier
On Mon, Mar 22, 2021 at 12:17:37PM +0900, Masahiko Sawada wrote:
> I've updated the patch. I saved the index names at the beginning of
> heap_vacuum_rel() for autovacuum logging, and add indstats and
> nindexes to LVRelStats. Some functions still have nindexes as a
> function argument but it seems to make sense since it corresponds the
> list of index relations (*Irel). Please review the patch.

Going back to that, the structure of the static APIs in this file make
the whole logic a bit hard to follow, but the whole set of changes you
have done here makes sense.  It took me a moment to recall and
understand why it is safe to free *stats at the end of
vacuum_one_index() and if the index stats array actually pointed to
the DSM segment correctly within the shared stats.

I think that there is more consolidation possible within LVRelStats,
but let's leave that for another day, if there is any need for such a
move.

To keep it short.  Sold.
--
Michael


signature.asc
Description: PGP signature


Re: proposal - psql - use pager for \watch command

2021-03-22 Thread Pavel Stehule
po 22. 3. 2021 v 22:07 odesílatel Thomas Munro 
napsal:

> On Tue, Mar 23, 2021 at 1:53 AM Pavel Stehule 
> wrote:
> > po 22. 3. 2021 v 13:13 odesílatel Thomas Munro 
> napsal:
> >> The problem is that Apple's /dev/tty device is defective, and doesn't
> >> work in poll().  It always returns immediately with revents=POLLNVAL,
> >> but pspg assumes that data is ready and tries to read the keyboard and
> >> then blocks until I press a key.  This seems to fix it:
> >>
> >> +#ifndef __APPLE__
> >> +   /* macOS can't use poll() on /dev/tty */
> >> state.tty = fopen("/dev/tty", "r+");
> >> +#endif
> >> if (!state.tty)
> >> state.tty = fopen(ttyname(fileno(stdout)), "r");
> >
> >
> > it is hell.
>
> Heh.  I've recently spent many, many hours trying to make AIO work on
> macOS, and nothing surprises me anymore.  BTW I found something from
> years ago on the 'net that fits with my observation about /dev/tty:
>
> https://www.mail-archive.com/bug-gnulib@gnu.org/msg00296.html
>
> Curious, which other OS did you put that fallback case in for?  I'm a
> little confused about why it works, so I'm not sure if it's the best
> possible change, but I'm not planning to dig any further now, too many
> patches, not enough time :-)
>

Unfortunately, I have no exact evidence. My original implementation was
very primitive

if (freopen("/dev/tty", "rw", stdin) == NULL)
{
fprintf(stderr, "cannot to reopen stdin: %s\n", strerror(errno));
exit(1);
}

Some people reported problems, but I don't know if these issues was related
to tty or to freopen

In some discussion I found a workaround with reusing stdout and stderr -
and then this works well, but I have no feedback about these fallback
cases. And because this strategy is used by "less" pager too, I expect so
this is a common and widely used workaround.

I remember so there was a problems with cygwin and some unix platforms (but
maybe very old) there was problem in deeper nesting - some like

screen -> psql -> pspg.

Directly pspg was working, but it didn't work from psql. Probably somewhere
the implementation of pty was not fully correct.



>
> > Please, can you verify this fix?
>
> It works perfectly for me on a macOS 11.2 system with that change,
> repainting the screen exactly when it should.  I'm happy about that
> because (1) it means I can confirm that the proposed change to psql is
> working correctly on the 3 Unixes I have access to, and (2) I am sure
> that a lot of Mac users will appreciate being able to use super-duper
> \watch mode when this ships (a high percentage of PostgreSQL users I
> know use a Mac as their client machine).
>

Thank you for verification. I fixed it in master branch


> >> A minor problem is that on macOS, _GNU_SOURCE doesn't seem to imply
> >> NCURSES_WIDECHAR, so I suspect Unicode will be broken unless you
> >> manually add -DNCURSES_WIDECHAR=1, though I didn't check.
> >
> > It is possible -
> >
> > can you run "pspg --version"
>
> Looks like I misunderstood: it is showing "with wide char support",
> it's just that the "num" is 0 rather than 1.  I'm not planning to
> investigate that any further now, but I checked that it can show the
> output of SELECT 'špeĉiäl chârãçtérs' correctly.
>

It is the job of ncursesw -  pspg sends data to ncurses in original format
- it does only some game with attributes.


Re: UPDATE ... SET (single_column) = row_constructor is a bit broken from V10 906bfcad7ba7c

2021-03-22 Thread Mahendra Singh Thalor
On Tue, 23 Mar 2021 at 02:43, Justin Pryzby  wrote:
>
> On Mon, Mar 22, 2021 at 02:10:49PM +0530, Mahendra Singh Thalor wrote:
> > Hi Hackers,
> >
> > Commit 906bfcad7ba7c has improved handling for "UPDATE ... SET
> > (column_list) = row_constructor", but it has broken in some cases where it
> > was working prior to this commit.
> > After this commit query “DO UPDATE SET (t1_col)” is giving an error which
> > was working fine earlier.
>
> See prior discussions:
>
> https://www.postgresql.org/message-id/flat/20170719174507.GA19616%40telsasoft.com
> https://www.postgresql.org/message-id/flat/camjna7cdlzpcs0xnrpkvqmj6vb6g3eh8cygp9zbjxdpfftz...@mail.gmail.com
> https://www.postgresql.org/message-id/flat/87sh5rs74y@news-spur.riddles.org.uk
> https://git.postgresql.org/gitweb/?p=postgresql.git=commit=86182b18957b8f9e8045d55b137aeef7c9af9916
>

Thanks Justin.

Sorry, my mistake is that without checking prior discussion, i opened
a new thread. From next time, I will take care of this.

-- 
Thanks and Regards
Mahendra Singh Thalor
EnterpriseDB: http://www.enterprisedb.com




Re: Handling of opckeytype / CREATE OPERATOR CLASS (bug?)

2021-03-22 Thread Tomas Vondra



On 3/23/21 3:13 AM, Tom Lane wrote:
> Tomas Vondra  writes:
>> while working on the new BRIN opclasses in [1], I stumbled on something
>> I think is actually a bug in how CREATE OPERATOR CLASS handles the
>> storage type.
> 
> Hm.  Both catalogs.sgml and pg_opclass.h say specifically that
> opckeytype should be zero if it's to be the same as the input
> column type.  I don't think just dropping the enforcement of that
> is the right answer.
> 

Yeah, that's possible. I was mostly just demonstrating the difference in
behavior. Maybe the right fix is to fix the catalog contents and then
tweak the AM code, or something.

> I don't recall for sure what-all might depend on that.  I suspect
> that it's mostly for the benefit of polymorphic opclasses, so
> maybe the right thing is to say that the opckeytype can be
> polymorphic if opcintype is, and then we resolve it as per
> the usual polymorphism rules.
> 

I did an experiment - fixed all the opclasses violating the rule by
removing the opckeytype, and ran make checke. The only cases causing
issues were cidr and int4range. Not that it proves anything.

> In any case, it's fairly suspicious that the only opclasses
> violating the existing rule are johnny-come-lately BRIN opclasses.
> 

Right, that seems suspicious.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: New IndexAM API controlling index vacuum strategies

2021-03-22 Thread Peter Geoghegan
On Mon, Mar 22, 2021 at 8:33 PM Peter Geoghegan  wrote:
> More concretely, maybe the new GUC is forced to be 1.05 of
> vacuum_freeze_table_age. Of course that scheme is a bit arbitrary --
> but so is the existing 0.95 scheme.

I meant to write 1.05 of autovacuum_vacuum_max_age. So just as
vacuum_freeze_table_age cannot really be greater than 0.95 *
autovacuum_vacuum_max_age, your new GUC cannot really be less than
1.05 * autovacuum_vacuum_max_age.

-- 
Peter Geoghegan




Re: New IndexAM API controlling index vacuum strategies

2021-03-22 Thread Peter Geoghegan
On Mon, Mar 22, 2021 at 8:28 PM Peter Geoghegan  wrote:
> You seem to be concerned about a similar contradiction. In fact it's
> *very* similar contradiction, because this new GUC is naturally a
> "sibling GUC" of both vacuum_freeze_table_age and
> autovacuum_vacuum_max_age (the "units" are the same, though the
> behavior that each GUC triggers is different -- but
> vacuum_freeze_table_age and autovacuum_vacuum_max_age are both already
> *similar and different* in the same way). So perhaps the solution
> should be similar -- silently interpret the setting of the new GUC to
> resolve the contradiction.

More concretely, maybe the new GUC is forced to be 1.05 of
vacuum_freeze_table_age. Of course that scheme is a bit arbitrary --
but so is the existing 0.95 scheme.

There may be some value in picking a scheme that "advertises" that all
three GUCs are symmetrical, or at least related -- all three divide up
the table's XID space.

-- 
Peter Geoghegan




Re: Disable WAL logging to speed up data loading

2021-03-22 Thread Laurenz Albe
On Mon, 2021-03-22 at 13:22 -0400, Stephen Frost wrote:
> * Laurenz Albe (laurenz.a...@cybertec.at) wrote:
> > On Mon, 2021-03-22 at 11:05 -0400, Stephen Frost wrote:
> > > > Perhaps allowing to set unlogged tables to logged ones without writing 
> > > > WAL
> > > > is a more elegant way to do that, but I cannot see how that would be any
> > > > more crash safe than this patch.  Besides, the feature doesn't exist 
> > > > yet.
> > > 
> > > I'm not suggesting it's somehow more crash safe- but it's at least very
> > > clear what happens in such a case, to wit: the entire table is cleared
> > > on crash recovery.
> > 
> > I didn't look at the patch, but are you saying that after changing the
> > table to LOGGED, it somehow remembers that it is not crash safe and is
> > emptied if there is a crash before the next checkpoint?
> 
> I'm not sure where the confusion is, but certainly once the table has
> been turned into LOGGED and that's been committed then it should be
> crash safe, even under 'minimal' with the optimization to avoid actually
> copying the entire table into the WAL.  I don't see any particular
> reason why that isn't possible to do.

The confusion is probably caused by my ignorance.
If the actual data files of the table are forced out to disk on COMMIT,
that should indeed work.  And if you fsync a file that has not been changed
since it was last persisted, that should be cheap.

So if I got that right, I agree with you that that would be a much better
way to achieve the goal than wal_level = 'none'.

The only drawback is that we don't have that feature yet...

Yours,
Laurenz Albe





Re: New IndexAM API controlling index vacuum strategies

2021-03-22 Thread Peter Geoghegan
On Mon, Mar 22, 2021 at 6:41 PM Masahiko Sawada  wrote:
> But we're not sure when the next anti-wraparound vacuum will take
> place. Since the table is already vacuumed by a non-aggressive vacuum
> with disabling index cleanup, an autovacuum will process the table
> when the table gets modified enough or the table's relfrozenxid gets
> older than autovacuum_vacuum_max_age. If the new threshold, probably a
> new GUC, is much lower than autovacuum_vacuum_max_age and
> vacuum_freeze_table_age, the table is continuously vacuumed without
> advancing relfrozenxid, leading to unnecessarily index bloat. Given
> the new threshold is for emergency purposes (i.g., advancing
> relfrozenxid faster), I think it might be better to use
> vacuum_freeze_table_age as the lower bound of the new threshold. What
> do you think?

As you know, when the user sets vacuum_freeze_table_age to a value
that is greater than the value of autovacuum_vacuum_max_age, the two
GUCs have values that are contradictory. This contradiction is
resolved inside vacuum_set_xid_limits(), which knows that it should
"interpret" the value of vacuum_freeze_table_age as
(autovacuum_vacuum_max_age * 0.95) to paper-over the user's error.
This 0.95 behavior is documented in the user docs, though it happens
silently.

You seem to be concerned about a similar contradiction. In fact it's
*very* similar contradiction, because this new GUC is naturally a
"sibling GUC" of both vacuum_freeze_table_age and
autovacuum_vacuum_max_age (the "units" are the same, though the
behavior that each GUC triggers is different -- but
vacuum_freeze_table_age and autovacuum_vacuum_max_age are both already
*similar and different* in the same way). So perhaps the solution
should be similar -- silently interpret the setting of the new GUC to
resolve the contradiction.

(Maybe I should say "these two new GUCs"? MultiXact variant might be needed...)

This approach has the following advantages:

* It follows precedent.

* It establishes that the new GUC is a logical extension of the
existing vacuum_freeze_table_age and autovacuum_vacuum_max_age GUCs.

* The default value for the new GUC will be so much higher (say 1.8
billion XIDs) than even the default of autovacuum_vacuum_max_age that
it won't disrupt anybody's existing postgresql.conf setup.

* For the same reason (the big space between autovacuum_vacuum_max_age
and the new GUC with default settings), you can almost set the new GUC
without needing to know about autovacuum_vacuum_max_age.

* The overall behavior isn't actually restrictive/paternalistic. That
is, if you know what you're doing (say you're testing the feature) you
can reduce all 3 sibling GUCs to 0 and get the testing behavior that
you desire.

What do you think?

-- 
Peter Geoghegan




Re: [POC] Fast COPY FROM command for the table with foreign partitions

2021-03-22 Thread Justin Pryzby
On Mon, Mar 22, 2021 at 08:18:56PM -0700, Zhihong Yu wrote:
> with data_dest_cb callback. It is used for send text representation of a
> tuple to a custom destination.
> 
> send text -> sending text

I would say "It is used to send the text representation ..."

> struct PgFdwModifyState *aux_fmstate;   /* foreign-insert state, if
>  * created */
> +   CopyToState cstate; /* foreign COPY state, if used */
> 
> Since foreign COPY is optional, should cstate be a pointer ? That would be
> in line with aux_fmstate.

It's actually a pointer:
src/include/commands/copy.h:typedef struct CopyToStateData *CopyToState;

There's many data structures like this, where a structure is typedefed with a
"Data" suffix and the pointer is typedefed without the "Data"

-- 
Justin




Re: shared memory stats: high level design decisions: consistency, dropping

2021-03-22 Thread Greg Stark
On Sun, 21 Mar 2021 at 18:16, Stephen Frost  wrote:
>
> Greetings,
>
> * Tom Lane (t...@sss.pgh.pa.us) wrote:
> > I also believe that the snapshotting behavior has advantages in terms
> > of being able to perform multiple successive queries and get consistent
> > results from them.  Only the most trivial sorts of analysis don't need
> > that.
> >
> > In short, what you are proposing sounds absolutely disastrous for
> > usability of the stats views, and I for one will not sign off on it
> > being acceptable.
> >
> > I do think we could relax the consistency guarantees a little bit,
> > perhaps along the lines of only caching view rows that have already
> > been read, rather than grabbing everything up front.  But we can't
> > just toss the snapshot concept out the window.  It'd be like deciding
> > that nobody needs MVCC, or even any sort of repeatable read.
>
> This isn't the same use-case as traditional tables or relational
> concepts in general- there aren't any foreign keys for the fields that
> would actually be changing across these accesses to the shared memory
> stats- we're talking about gross stats numbers like the number of
> inserts into a table, not an employee_id column.  In short, I don't
> agree that this is a fair comparison.

I use these stats quite a bit and do lots of slicing and dicing with
them. I don't think it's as bad as Tom says but I also don't think we
can be quite as loosy-goosy as I think Andres or Stephen might be
proposing either (though I note that haven't said they don't want any
consistency at all).

The cases where the consistency really matter for me is when I'm doing
math involving more than one statistic.

Typically that's ratios. E.g. with pg_stat_*_tables I routinely divide
seq_tup_read by seq_scan or idx_tup_* by idx_scans. I also often look
at the ratio between n_tup_upd and n_tup_hot_upd.

And no, it doesn't help that these are often large numbers after a
long time because I'm actually working with the first derivative of
these numbers using snapshots or a time series database. So if you
have the seq_tup_read incremented but not seq_scan incremented you
could get a wildly incorrect calculation of "tup read per seq scan"
which actually matters.

I don't think I've ever done math across stats for different objects.
I mean, I've plotted them together and looked at which was higher but
I don't think that's affected by some plots having peaks slightly out
of sync with the other. I suppose you could look at the ratio of
access patterns between two tables and know that they're only ever
accessed by a single code path at the same time and therefore the
ratios would be meaningful. But I don't think users would be surprised
to find they're not consistent that way either.

So I think we need to ensure that at least all the values for a single
row representing a single object are consistent. Or at least that
there's *some* correct way to retrieve a consistent row and that the
standard views use that. I don't think we need to guarantee that every
possible plan will always be consistent even if you access the row
multiple times in a self-join or use the lookup function on individual
columns separately.

-- 
greg




Re: [POC] Fast COPY FROM command for the table with foreign partitions

2021-03-22 Thread Zhihong Yu
Hi,
In the description:

with data_dest_cb callback. It is used for send text representation of a
tuple to a custom destination.

send text -> sending text

struct PgFdwModifyState *aux_fmstate;   /* foreign-insert state, if
 * created */
+   CopyToState cstate; /* foreign COPY state, if used */

Since foreign COPY is optional, should cstate be a pointer ? That would be
in line with aux_fmstate.

Cheers

On Mon, Mar 22, 2021 at 7:02 PM tsunakawa.ta...@fujitsu.com <
tsunakawa.ta...@fujitsu.com> wrote:

> From: Andrey Lepikhov 
> > Macros _() at the postgresExecForeignCopy routine:
> > if (PQputCopyEnd(conn, OK ? NULL : _("canceled by server")) <= 0)
> >
> > uses gettext. Under linux it is compiled ok, because (as i understood)
> > uses standard implementation of gettext:
> > objdump -t contrib/postgres_fdw/postgres_fdw.so | grep 'gettext'
> > gettext@@GLIBC_2.2.5
> >
> > but in MacOS (and maybe somewhere else) we need to explicitly link
> > libintl library in the Makefile:
> > SHLIB_LINK += $(filter -lintl, $(LIBS)
> >
> > Also, we may not use gettext at all in this part of the code.
>
> I'm afraid so, because no extension in contrib/ has po/ directory.  I just
> removed _() and rebased the patch on HEAD.
>
>
> Regards
> TakayukiTsunakawa
>
>
>


RE: Disable WAL logging to speed up data loading

2021-03-22 Thread tsunakawa.ta...@fujitsu.com
From: Stephen Frost 
> First- what are you expecting would actually happen during crash recovery in
> this specific case with your proposed new WAL level?
...
> I'm not suggesting it's somehow more crash safe- but it's at least very clear
> what happens in such a case, to wit: the entire table is cleared on crash
> recovery.

As Laurenz-san kindly replied, the database server refuses to start with a 
clear message.  So, it's similarly very clear what happens.  The user will 
never unknowingly resume operation with possibly corrupt data.


> We're talking about two different ways to accomplish essentially the same
> thing- one which introduces a new WAL level, vs. one which adds an
> optimization for a WAL level we already have.  That the second is more elegant
> is more-or-less entirely the point I'm making here, so it seems pretty 
> relevant.

So, I understood the point boils down to elegance.  Could I ask what makes you 
feel ALTER TABLE UNLOGGED/LOGGED is (more) elegant?  I'm purely asking as a 
user.

(I don't want to digress, but if we consider the number of options for 
wal_level as an issue, I feel it's not elegant to have separate "replica" and 
"logical".)


> Under the proposed 'none', you basically have to throw out the entire cluster 
> on
> a crash, all because you don't want to use 'UNLOGGED' when you created the
> tables you want to load data into, or 'TRUNCATE' them in the transaction where
> you start the data load, either of which gives us enough indication and which
> we have infrastructure around dealing with in the event of a crash during the
> load without everything else having to be tossed and everything restored from 
> a
> backup.  That's both a better user experience from the perspective of having
> fewer WAL levels to understand and from just a general administration
> perspective so you don't have to go all the way back to a backup to bring the
> system back up.

The elegance of wal_level = none is that the user doesn't have to remember to 
add ALTER TABLE to the data loading job when they add load target 
tables/partitions.  If they build and use their own (shell) scripts to load 
data, that won't be burdon or forgotten.  But what would they have to do when 
they use ETL tools like Talend, Pentaho, and Informatica Power Center?  Do 
those tools allow users to add custom processing like ALTER TABLE to the data 
loading job steps for each table?  (AFAIK, not.)

wal_level = none is convenient and attractive for users who can backup and 
restore the entire database instantly with a storage or filesystem snapshot 
feature.


Regards
TakayukiTsunakawa






Re: make the stats collector shutdown without writing the statsfiles if the immediate shutdown is requested.

2021-03-22 Thread Fujii Masao




On 2021/03/23 9:05, Masahiro Ikeda wrote:

Yes. I attached the v5 patch based on v3 patch.
I renamed SignalHandlerForUnsafeExit() and fixed the following comment.


Thanks for updating the patch!

When the startup process exits because of recovery_target_action=shutdown,
reaper() calls TerminateChildren(SIGTERM). This function sends SIGTERM to
the stats collector. Currently the stats collector ignores SIGTERM, but with
the patch it exits normally. This change of behavior might be problematic.

That is, TerminateChildren(SIGTERM) sends SIGTERM to various processes.
But currently the stats collector and checkpointer don't exit even when
SIGTERM arrives because they ignore SIGTERM. After several processes
other than the stats collector and checkpointer exit by SIGTERM,
PostmasterStateMachine() and reaper() make checkpointer exit and then
the stats collector exit. The shutdown terminates the processes in this order.

On the other hand, with the patch, the stats collector exits by SIGTERM
before checkpointer exits. This is not normal order of processes to exit in
shutdown.

To address this issue, one idea is to use SIGUSR2 for normal exit of the stats
collector, instead of SIGTERM. If we do this, TerminateChildren(SIGTERM)
cannot terminate the stats collector. Thought?

If we adopt this idea, the detail comment about why SIGUSR2 is used for that
needs to be added.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: shared-memory based stats collector

2021-03-22 Thread Kyotaro Horiguchi
At Fri, 19 Mar 2021 14:27:38 -0700, Andres Freund  wrote in 
> Hi,
> 
> On 2021-03-10 20:26:56 -0800, Andres Freund wrote:
> > > +  * We expose this shared entry now.  You might think that the 
> > > entry
> > > +  * can be removed by a concurrent backend, but since we are 
> > > creating
> > > +  * an stats entry, the object actually exists and used in the 
> > > upper
> > > +  * layer. Such an object cannot be dropped until the first 
> > > vacuum
> > > +  * after the current transaction ends.
> > > +  */
> > > + dshash_release_lock(pgStatSharedHash, shhashent);
> >
> > I don't think you can safely release the lock before you incremented the
> > refcount?  What if, once the lock is released, somebody looks up that
> > entry, increments the refcount, and decrements it again? It'll see a
> > refcount of 0 at the end and decide to free the memory. Then the code
> > below will access already freed / reused memory, no?
> 
> Yep, it's not even particularly hard to hit:
>
> S0: CREATE TABLE a_table();
> S0: INSERT INTO a_table();
> S0: disconnect
> S1: set a breakpoint to just after the dshash_release_lock(), with an
>  if objid == a_table_oid
> S1: SELECT pg_stat_get_live_tuples('a_table'::regclass);
>   (will break at above breakpoint, without having incremented the
>   refcount yet)
> S2: DROP TABLE a_table;
> S2: VACUUM pg_class;
> 
> At that point S2's call to pgstat_vacuum_stat() will find the shared
> stats entry for a_table, delete the entry from the shared hash table,
> see that the stats data has a zero refcount, and free it. Once S1 wakes
> up it'll use already freed (and potentially since reused) memory.

Sorry for the delay.  You're right. I actually see permanent-block when
continue running S1 after the vacuum.  That happens at LWLockRelease
on freed block.

Moving the refcount bumping before the dshash_release_lock call fixes
that.  One issue doing that *was* get_stat_entry() had the path for
the case pgStatCacheContext is not available, which is already
dead. After the early lock releasing is removed, the comment is no
longer needed, too.

While working on this, I noticed that the previous diff
v56-57-func-diff.txt was slightly stale (missing a later bug fix).  So
attached contains a fix on the amendment path.

Please find the attached.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 3f546afe6a..3e2b90e92b 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -884,6 +884,8 @@ get_stat_entry(PgStatTypes type, Oid dbid, Oid objid, bool 
nowait, bool create,
 
create, nowait, create, );
if (shhashent)
{
+   boollofound;
+
if (create && !shfound)
{
/* Create new stats entry. */
@@ -900,38 +902,27 @@ get_stat_entry(PgStatTypes type, Oid dbid, Oid objid, 
bool nowait, bool create,
else
shheader = dsa_get_address(area, shhashent->body);
 
-   /*
-* We expose this shared entry now.  You might think that the 
entry
-* can be removed by a concurrent backend, but since we are 
creating
-* an stats entry, the object actually exists and used in the 
upper
-* layer. Such an object cannot be dropped until the first 
vacuum
-* after the current transaction ends.
-*/
+   pg_atomic_add_fetch_u32(>refcount, 1);
dshash_release_lock(pgStatSharedHash, shhashent);
 
-   /* register to local hash if possible */
-   if (pgStatEntHash || pgStatCacheContext)
+   /* Create local hash if not yet */
+   if (pgStatEntHash == NULL)
{
-   boollofound;
+   Assert(pgStatCacheContext);
 
-   if (pgStatEntHash == NULL)
-   {
-   pgStatEntHash =
-   
pgstat_localhash_create(pgStatCacheContext,
-   
PGSTAT_TABLE_HASH_SIZE, NULL);
-   pgStatEntHashAge =
-   
pg_atomic_read_u64(>gc_count);
-   }
-
-   lohashent =
-   pgstat_localhash_insert(pgStatEntHash, key, 
);
-
-   Assert(!lofound);
-   lohashent->body = shheader;
-   lohashent->dsapointer = shhashent->body;
-
-   pg_atomic_add_fetch_u32(>refcount, 1);
+   pgStatEntHash =
+   

Re: Key management with tests

2021-03-22 Thread Bruce Momjian
On Mon, Mar 22, 2021 at 08:38:37PM -0400, Bruce Momjian wrote:
> > This particular patch (introducing the RelationIsPermanent() macro)
> > seems like it'd be a nice thing to commit independently of the rest,
> > reducing the size of this patch set..? 
> 
> Committed as suggested.

Also, I have written a short presentation on where I think we are with
cluster file encryption:

https://momjian.us/main/writings/pgsql/cfe.pdf

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.





Re: Handling of opckeytype / CREATE OPERATOR CLASS (bug?)

2021-03-22 Thread Tom Lane
Tomas Vondra  writes:
> while working on the new BRIN opclasses in [1], I stumbled on something
> I think is actually a bug in how CREATE OPERATOR CLASS handles the
> storage type.

Hm.  Both catalogs.sgml and pg_opclass.h say specifically that
opckeytype should be zero if it's to be the same as the input
column type.  I don't think just dropping the enforcement of that
is the right answer.

I don't recall for sure what-all might depend on that.  I suspect
that it's mostly for the benefit of polymorphic opclasses, so
maybe the right thing is to say that the opckeytype can be
polymorphic if opcintype is, and then we resolve it as per
the usual polymorphism rules.

In any case, it's fairly suspicious that the only opclasses
violating the existing rule are johnny-come-lately BRIN opclasses.

regards, tom lane




RE: [PATCH] Tracking statements entry timestamp in pg_stat_statements

2021-03-22 Thread kuroda.hay...@fujitsu.com
Dear Andrei,

I think the idea is good because the pg_stat_statements_info view cannot 
distinguish
whether the specific statement is deallocated or not.
But multiple calling of GetCurrentTimestamp() may cause some performance issues.
How about adding a configuration parameter for controlling this feature?
Or we don't not have to worry about that?


> + if (api_version >= PGSS_V1_9)
> + {
> + values[i++] = TimestampTzGetDatum(first_seen);
> + }

I think {} is not needed here.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



RE: [POC] Fast COPY FROM command for the table with foreign partitions

2021-03-22 Thread tsunakawa.ta...@fujitsu.com
From: Andrey Lepikhov 
> Macros _() at the postgresExecForeignCopy routine:
> if (PQputCopyEnd(conn, OK ? NULL : _("canceled by server")) <= 0)
> 
> uses gettext. Under linux it is compiled ok, because (as i understood)
> uses standard implementation of gettext:
> objdump -t contrib/postgres_fdw/postgres_fdw.so | grep 'gettext'
> gettext@@GLIBC_2.2.5
> 
> but in MacOS (and maybe somewhere else) we need to explicitly link
> libintl library in the Makefile:
> SHLIB_LINK += $(filter -lintl, $(LIBS)
> 
> Also, we may not use gettext at all in this part of the code.

I'm afraid so, because no extension in contrib/ has po/ directory.  I just 
removed _() and rebased the patch on HEAD.


Regards
TakayukiTsunakawa




v21-0001-Fast-COPY-FROM-into-the-foreign-or-sharded-table.patch
Description:  v21-0001-Fast-COPY-FROM-into-the-foreign-or-sharded-table.patch


Re: PostmasterIsAlive() in recovery (non-USE_POST_MASTER_DEATH_SIGNAL builds)

2021-03-22 Thread Thomas Munro
On Tue, Mar 23, 2021 at 2:44 PM Fujii Masao  wrote:
> I found 0001 patch was committed in de829ddf23, and which added new
> wait event WalrcvExit. This name seems not consistent with other wait
> events. I'm thinking it's better to rename it to WalReceiverExit. Thought?
> Patch attached.

Agreed, your names are better.  Thanks.




Re: PostmasterIsAlive() in recovery (non-USE_POST_MASTER_DEATH_SIGNAL builds)

2021-03-22 Thread Fujii Masao



On 2021/03/02 10:10, Thomas Munro wrote:

On Tue, Mar 2, 2021 at 12:00 AM Thomas Munro  wrote:

On Mon, Nov 16, 2020 at 8:56 PM Michael Paquier  wrote:

No objections with the two changes from pg_usleep() to WaitLatch() so
they could be applied separately first.


I thought about committing that first part, and got as far as
splitting the patch into two (see attached), but then I re-read
Fujii-san's message about the speed of promotion and realised that we
really should have something like a condition variable for walRcvState
changes.  I'll look into that.


Here's an experimental attempt at that, though I'm not sure if it's
the right approach.  Of course it's not necessary to use condition
variables here: we could use recoveryWakeupLatch, because we're not in
any doubt about who needs to be woken up.  But then you could get
constant wakeups while recovery is paused, unless you also suppressed
that somehow.  You could use the startup process's procLatch,
advertised in shmem, but that's almost a condition variable.  With a
condition variable, you get to name it something like
walRcvStateChanged, and then the programming rule is very clear: if
you change walRcvState, you need to broadcast that fact (and you don't
have to worry about who might be interested).  One question I haven't
got to the bottom of: is it a problem for the startup process that CVs
use CHECK_FOR_INTERRUPTS()?


I found 0001 patch was committed in de829ddf23, and which added new
wait event WalrcvExit. This name seems not consistent with other wait
events. I'm thinking it's better to rename it to WalReceiverExit. Thought?
Patch attached.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 19540206f9..43c07da20e 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1763,8 +1763,8 @@ postgres   27093  0.0  0.0  30096  2752 ?Ss   
11:34   0:00 postgres: ser
replication.
  
  
-  WalrcvExit
-  Waiting for the walreceiver to exit.
+  WalReceiverExit
+  Waiting for the WAL receiver to exit.
  
  
   WalReceiverWaitStart
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index b7af7c2707..60f45ccc4e 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4121,8 +4121,8 @@ pgstat_get_wait_ipc(WaitEventIPC w)
case WAIT_EVENT_SYNC_REP:
event_name = "SyncRep";
break;
-   case WAIT_EVENT_WALRCV_EXIT:
-   event_name = "WalrcvExit";
+   case WAIT_EVENT_WAL_RECEIVER_EXIT:
+   event_name = "WalReceiverExit";
break;
case WAIT_EVENT_WAL_RECEIVER_WAIT_START:
event_name = "WalReceiverWaitStart";
diff --git a/src/backend/replication/walreceiverfuncs.c 
b/src/backend/replication/walreceiverfuncs.c
index fff6c54c45..6f0acbfdef 100644
--- a/src/backend/replication/walreceiverfuncs.c
+++ b/src/backend/replication/walreceiverfuncs.c
@@ -224,7 +224,7 @@ ShutdownWalRcv(void)
ConditionVariablePrepareToSleep(>walRcvStoppedCV);
while (WalRcvRunning())
ConditionVariableSleep(>walRcvStoppedCV,
-  
WAIT_EVENT_WALRCV_EXIT);
+  
WAIT_EVENT_WAL_RECEIVER_EXIT);
ConditionVariableCancelSleep();
 }
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 2c82313550..87672e6f30 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1008,7 +1008,7 @@ typedef enum
WAIT_EVENT_REPLICATION_SLOT_DROP,
WAIT_EVENT_SAFE_SNAPSHOT,
WAIT_EVENT_SYNC_REP,
-   WAIT_EVENT_WALRCV_EXIT,
+   WAIT_EVENT_WAL_RECEIVER_EXIT,
WAIT_EVENT_WAL_RECEIVER_WAIT_START,
WAIT_EVENT_XACT_GROUP_UPDATE
 } WaitEventIPC;


Re: New IndexAM API controlling index vacuum strategies

2021-03-22 Thread Masahiko Sawada
On Fri, Mar 19, 2021 at 3:36 AM Peter Geoghegan  wrote:
>
> On Thu, Mar 18, 2021 at 3:32 AM Masahiko Sawada  wrote:
> > If we have the constant threshold of 1 billion transactions, a vacuum
> > operation might not be an anti-wraparound vacuum and even not be an
> > aggressive vacuum, depending on autovacuum_freeze_max_age value. Given
> > the purpose of skipping index vacuuming in this case, I think it
> > doesn't make sense to have non-aggressive vacuum skip index vacuuming
> > since it might not be able to advance relfrozenxid. If we have a
> > constant threshold, 2 billion transactions, maximum value of
> > autovacuum_freeze_max_age, seems to work.
>
> I like the idea of not making the behavior a special thing that only
> happens with a certain variety of VACUUM operation (non-aggressive or
> anti-wraparound VACUUMs). Just having a very high threshold should be
> enough.
>
> Even if we're not going to be able to advance relfrozenxid, we'll
> still finish much earlier and let a new anti-wraparound vacuum take
> place that will do that -- and will be able to reuse much of the work
> of the original VACUUM. Of course this anti-wraparound vacuum will
> also skip index vacuuming from the start (whereas the first VACUUM may
> well have done some index vacuuming before deciding to end index
> vacuuming to hurry with finishing).

But we're not sure when the next anti-wraparound vacuum will take
place. Since the table is already vacuumed by a non-aggressive vacuum
with disabling index cleanup, an autovacuum will process the table
when the table gets modified enough or the table's relfrozenxid gets
older than autovacuum_vacuum_max_age. If the new threshold, probably a
new GUC, is much lower than autovacuum_vacuum_max_age and
vacuum_freeze_table_age, the table is continuously vacuumed without
advancing relfrozenxid, leading to unnecessarily index bloat. Given
the new threshold is for emergency purposes (i.g., advancing
relfrozenxid faster), I think it might be better to use
vacuum_freeze_table_age as the lower bound of the new threshold. What
do you think?

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Handling of opckeytype / CREATE OPERATOR CLASS (bug?)

2021-03-22 Thread Tomas Vondra
Hi,

while working on the new BRIN opclasses in [1], I stumbled on something
I think is actually a bug in how CREATE OPERATOR CLASS handles the
storage type. If you look at built-in opclasses, there's a bunch of
cases where (opcintype == opckeytype), for example the BRIN opclasses
for inet data type:

test=# select oid, opcname, opcfamily, opcintype, opckeytype from
pg_opclass where opcintype = 869 order by oid;

  oid  |opcname| opcfamily | opcintype | opckeytype
---+---+---+---+
 10105 | inet_minmax_ops   |  4075 |   869 |869
 10106 | inet_inclusion_ops|  4102 |   869 |869

The fact that opckeytype is set is important, because this allows the
opclass to handle data with cidr data type - there's no opclass for
cidr, but we can do this:

create table t (a cidr);
insert into t values ('127.0.0.1');
insert into t values ('127.0.0.2');
insert into t values ('127.0.0.3');
create index on t using brin (a inet_minmax_ops);

This seems to work fine. Now, if you manually update the opckeytype to
0, you'll get this:

test=# update pg_opclass set opckeytype = 0 where oid = 10105;
UPDATE 1
test=# create index on t using brin (a inet_minmax_ops);
ERROR:  missing operator 1(650,650) in opfamily 4075


Unfortunately, it turns out it's impossible to re-create this opclass
using CREATE OPERATOR CLASS, because the opclasscmds.c does this:

/* Just drop the spec if same as column datatype */
if (storageoid == typeoid && false)
storageoid = InvalidOid;

So the storageoid is reset to 0. This only works for the built-in
opclasses because those are injected directly into the catalogs.

Either the CREATE OPERATOR CLASS is not considering something before
resetting the storageoid, or maybe it should be reset for all opclasses
(including the built-in ones) and the code that's using it should
restore it in those cases (AFAICS opckeytype=0 means it's the same as
opcintkey).

Attached is a PoC patch doing the first thing - this does the trick for
me, but I'm not 100% sure it's the right fix.


[1]
https://www.postgresql.org/message-id/54b47e66-bd8a-d44a-2090-fd4b2f49abe6%40enterprisedb.com

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
>From 4ee5e2bbd0c44c82d1604836db5352f3e0e0fbf2 Mon Sep 17 00:00:00 2001
From: Tomas Vondra 
Date: Mon, 22 Mar 2021 20:36:59 +0100
Subject: [PATCH 1/5] fixup opclass storage type

---
 src/backend/commands/opclasscmds.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/src/backend/commands/opclasscmds.c b/src/backend/commands/opclasscmds.c
index fad39e2b75..78b2d69782 100644
--- a/src/backend/commands/opclasscmds.c
+++ b/src/backend/commands/opclasscmds.c
@@ -576,10 +576,7 @@ DefineOpClass(CreateOpClassStmt *stmt)
 	 */
 	if (OidIsValid(storageoid))
 	{
-		/* Just drop the spec if same as column datatype */
-		if (storageoid == typeoid)
-			storageoid = InvalidOid;
-		else if (!amstorage)
+		if ((storageoid != typeoid) && (!amstorage))
 			ereport(ERROR,
 	(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
 	 errmsg("storage type cannot be different from data type for access method \"%s\"",
-- 
2.30.2



Re: Type of wait events WalReceiverWaitStart and WalSenderWaitForWAL

2021-03-22 Thread Fujii Masao




On 2021/03/22 13:59, Fujii Masao wrote:


Ok, so barring any objection, I will commit the patch that I posted upthread.


Pushed!

I'm waiting for other two patches to be reviewed :)

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: Wrong statistics for size of XLOG_SWITCH during pg_waldump.

2021-03-22 Thread Fujii Masao




On 2021/03/22 17:49, Kyotaro Horiguchi wrote:

At Mon, 22 Mar 2021 14:07:43 +0900, Fujii Masao  
wrote in



On 2021/03/22 14:03, shinya11.k...@nttdata.com wrote:

Barring any objection, I will commit this.

I think it's good except for a typo "thoes four bits"


Thanks for the review! Attached is the updated version of the patch.


Thanks for picking it up. LGTM and applies cleanly.


Pushed. Thanks!

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2021-03-22 Thread Bruce Momjian
On Mon, Mar 22, 2021 at 05:17:15PM -0700, Zhihong Yu wrote:
> Hi,
> For queryjumble.c :
> 
> + * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
> 
> The year should be updated.
> Same with queryjumble.h

Thanks, fixed.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.





Re: create table like: ACCESS METHOD

2021-03-22 Thread Justin Pryzby
On Tue, Jan 19, 2021 at 03:03:31PM -0600, Justin Pryzby wrote:
> On Wed, Dec 30, 2020 at 12:33:56PM +, Simon Riggs wrote:
> > There are no tests for the new functionality, please could you add some?
> 
> Did you look at the most recent patch?
> 
> +CREATE ACCESS METHOD heapdup TYPE TABLE HANDLER heap_tableam_handler;
> +CREATE TABLE likeam() USING heapdup;
> +CREATE TABLE likeamlike(LIKE likeam INCLUDING ALL);  
>   
>
> 
> Also, I just realized that Dilip's toast compression patch adds "INCLUDING
> COMPRESSION", which is stored in pg_am.  That's an implementation detail of
> that patch, but it's not intuitive that "including access method" wouldn't
> include the compression stored there.  So I think this should use "INCLUDING
> TABLE ACCESS METHOD" not just ACCESS METHOD.  

Since the TOAST patch ended up not using access methods after all, I renamed
this back to "like ACCESS METHOD" (without table).

For now, I left TableLikeOption un-alphabetized.

-- 
Justin
>From 2da8104a39f65f576d0f221c288502d27211a209 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Sun, 15 Nov 2020 16:54:53 -0600
Subject: [PATCH v4] create table (like .. including ACCESS METHOD)

---
 doc/src/sgml/ref/create_table.sgml  | 12 +++-
 src/backend/parser/gram.y   |  1 +
 src/backend/parser/parse_utilcmd.c  | 10 ++
 src/include/nodes/parsenodes.h  |  1 +
 src/test/regress/expected/create_table_like.out | 12 
 src/test/regress/sql/create_table_like.sql  |  8 
 6 files changed, 43 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index c6c248f1e9..dd92fd0e9a 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -87,7 +87,7 @@ class="parameter">referential_action ] [ ON UPDATE and like_option is:
 
-{ INCLUDING | EXCLUDING } { COMMENTS | CONSTRAINTS | DEFAULTS | GENERATED | IDENTITY | INDEXES | STATISTICS | STORAGE | ALL }
+{ INCLUDING | EXCLUDING } { ACCESS METHOD | COMMENTS | CONSTRAINTS | DEFAULTS | GENERATED | IDENTITY | INDEXES | STATISTICS | STORAGE | ALL }
 
 and partition_bound_spec is:
 
@@ -613,6 +613,16 @@ WITH ( MODULUS numeric_literal, REM
   available options are:
 
   
+   
+INCLUDING ACCESS METHOD
+
+ 
+  The table's access method will be copied.  By default, the
+  default_table_access_method is used.
+ 
+
+   
+

 INCLUDING COMMENTS
 
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index bc43641ffe..383e0671af 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -3741,6 +3741,7 @@ TableLikeOption:
 | STATISTICS		{ $$ = CREATE_TABLE_LIKE_STATISTICS; }
 | STORAGE			{ $$ = CREATE_TABLE_LIKE_STORAGE; }
 | COMPRESSION		{ $$ = CREATE_TABLE_LIKE_COMPRESSION; }
+| ACCESS METHOD 	{ $$ = CREATE_TABLE_LIKE_ACCESS_METHOD; }
 | ALL{ $$ = CREATE_TABLE_LIKE_ALL; }
 		;
 
diff --git a/src/backend/parser/parse_utilcmd.c b/src/backend/parser/parse_utilcmd.c
index aa6c19adad..07e18fa62f 100644
--- a/src/backend/parser/parse_utilcmd.c
+++ b/src/backend/parser/parse_utilcmd.c
@@ -97,6 +97,7 @@ typedef struct
 	bool		ispartitioned;	/* true if table is partitioned */
 	PartitionBoundSpec *partbound;	/* transformed FOR VALUES */
 	bool		ofType;			/* true if statement contains OF typename */
+	char		*accessMethod;	/* table access method */
 } CreateStmtContext;
 
 /* State shared by transformCreateSchemaStmt and its subroutines */
@@ -253,6 +254,7 @@ transformCreateStmt(CreateStmt *stmt, const char *queryString)
 	cxt.ispartitioned = stmt->partspec != NULL;
 	cxt.partbound = stmt->partbound;
 	cxt.ofType = (stmt->ofTypename != NULL);
+	cxt.accessMethod = NULL;
 
 	Assert(!stmt->ofTypename || !stmt->inhRelations);	/* grammar enforces */
 
@@ -347,6 +349,9 @@ transformCreateStmt(CreateStmt *stmt, const char *queryString)
 	stmt->tableElts = cxt.columns;
 	stmt->constraints = cxt.ckconstraints;
 
+	if (cxt.accessMethod != NULL)
+		stmt->accessMethod = cxt.accessMethod;
+
 	result = lappend(cxt.blist, stmt);
 	result = list_concat(result, cxt.alist);
 	result = list_concat(result, save_alist);
@@ -1137,6 +1142,11 @@ transformTableLikeClause(CreateStmtContext *cxt, TableLikeClause *table_like_cla
 		cxt->likeclauses = lappend(cxt->likeclauses, table_like_clause);
 	}
 
+	/* ACCESS METHOD doesn't apply and isn't copied for partitioned tables */
+	if ((table_like_clause->options & CREATE_TABLE_LIKE_ACCESS_METHOD) != 0 &&
+		!cxt->ispartitioned)
+		cxt->accessMethod = get_am_name(relation->rd_rel->relam);
+
 	/*
 	 * We may copy extended statistics if requested, since the representation
 	 * of CreateStatsStmt doesn't 

Re: Key management with tests

2021-03-22 Thread Bruce Momjian
On Thu, Mar 18, 2021 at 11:31:34AM -0400, Stephen Frost wrote:
> >  src/backend/access/gist/gistutil.c   |  2 +-
> >  src/backend/access/heap/heapam_handler.c |  2 +-
> >  src/backend/catalog/pg_publication.c |  2 +-
> >  src/backend/commands/tablecmds.c | 10 +-
> >  src/backend/optimizer/util/plancat.c |  3 +--
> >  src/backend/utils/cache/relcache.c   |  2 +-
> >  src/include/utils/rel.h  | 10 --
> >  src/include/utils/snapmgr.h  |  3 +--
> >  8 files changed, 19 insertions(+), 15 deletions(-)
> 
> This particular patch (introducing the RelationIsPermanent() macro)
> seems like it'd be a nice thing to commit independently of the rest,
> reducing the size of this patch set..? 

Committed as suggested.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.





Re: shared-memory based stats collector

2021-03-22 Thread Andres Freund
Hi,

On 2021-03-22 16:17:37 -0700, Andres Freund wrote:
> and to reduce unnecessary diff noise

This patch has just tons of stuff like:

-/*
- * Calculate function call usage and update stat counters.
- * Called by the executor after invoking a function.
+/* --
+ * pgstat_end_function_usage() -
  *
- * In the case of a set-returning function that runs in value-per-call mode,
- * we will see multiple pgstat_init_function_usage/pgstat_end_function_usage
- * calls for what the user considers a single call of the function.  The
- * finalize flag should be TRUE on the last call.
+ *  Calculate function call usage and update stat counters.
+ *  Called by the executor after invoking a function.
+ *
+ *  In the case of a set-returning function that runs in value-per-call mode,
+ *  we will see multiple pgstat_init_function_usage/pgstat_end_function_usage
+ *  calls for what the user considers a single call of the function.  The
+ *  finalize flag should be TRUE on the last call.
+ * --

and

 typedef struct PgStat_StatTabEntry
 {
-Oid tableid;
+/* Persistent data follow */
+TimestampTz vacuum_timestamp;   /* user initiated vacuum */
+TimestampTz autovac_vacuum_timestamp;   /* autovacuum initiated */
+TimestampTz analyze_timestamp;  /* user initiated */
+TimestampTz autovac_analyze_timestamp;  /* autovacuum initiated */
 
 PgStat_Counter numscans;
 
@@ -773,103 +352,31 @@ typedef struct PgStat_StatTabEntry
 PgStat_Counter blocks_fetched;
 PgStat_Counter blocks_hit;
 
-TimestampTz vacuum_timestamp;   /* user initiated vacuum */
 PgStat_Counter vacuum_count;
-TimestampTz autovac_vacuum_timestamp;   /* autovacuum initiated */
 PgStat_Counter autovac_vacuum_count;
-TimestampTz analyze_timestamp;  /* user initiated */
 PgStat_Counter analyze_count;
-TimestampTz autovac_analyze_timestamp;  /* autovacuum initiated */
 PgStat_Counter autovac_analyze_count;
 } PgStat_StatTabEntry;

and changes like s/PgStat_WalStats/PgStat_Wal/.


I can't really recognize what the pattern of what was changed and what
not is?

Mixing this in into a 300kb patch really adds a bunch of work.

Greetings,

Andres Freund




Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2021-03-22 Thread Zhihong Yu
Hi,
For queryjumble.c :

+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group

The year should be updated.
Same with queryjumble.h

Cheers

On Mon, Mar 22, 2021 at 2:56 PM Bruce Momjian  wrote:

> On Sat, Mar 20, 2021 at 02:12:34PM +0800, Julien Rouhaud wrote:
> > On Fri, Mar 19, 2021 at 12:53:18PM -0400, Bruce Momjian wrote:
> > >
> > > Well, given we don't really want to support multiple query id types
> > > being generated or displayed, the "error out" above should fix it.
> > >
> > > Let's do this --- tell extensions to error out if the query id is
> > > already set, either by compute_query_id or another extension.  If an
> > > extension wants to generate its own query id and store is internal to
> > > the extension, that is fine, but the server-displayed query id should
> be
> > > generated once and never overwritten by an extension.
> >
> > Agreed, this will ensure that you won't dynamically change the queryid
> source.
> >
> > We should also document that changing it requires a restart and calling
> > pg_stat_statements_reset() afterwards.
> >
> > v19 adds some changes, plus extra documentation for pg_stat_statements
> about
> > the requirement for a queryid to be calculated, and a note that all
> documented
> > details only apply for in-core source.  I'm not sure if this is still
> the best
> > place to document those details anymore though.
>
> OK, after reading the entire thread, I don't think there are any
> remaining open issues with this patch and I think this is ready for
> committing.  I have adjusted the doc section of the patches, attached.
> I have marked myself as committer in the commitfest app and hope to
> apply it in the next few days based on feedback.
>
> --
>   Bruce Momjian  https://momjian.us
>   EDB  https://enterprisedb.com
>
>   If only the physical world exists, free will is an illusion.
>
>


Re: make the stats collector shutdown without writing the statsfiles if the immediate shutdown is requested.

2021-03-22 Thread Masahiro Ikeda

On 2021/03/22 23:59, Fujii Masao wrote:
> 
> 
> On 2021/03/20 13:40, Masahiro Ikeda wrote:
>> Sorry, there is no evidence we should call it.
>> I thought that to call FreeWaitEventSet() is a manner after using
>> CreateWaitEventSet() and the performance impact to call it seems to be too
>> small.
>>
>> (Please let me know if my understanding is wrong.)
>> The reason why I said this is a manner because I thought it's no problem
>> even without calling FreeWaitEventSet() before the process exits regardless
>> of fast or immediate shutdown. Since the "WaitEventSet *wes" is a process
>> local resource,
>> this will be released by OS even if FreeWaitEventSet() is not called.
>>
>> I'm now changing my mind to skip it is better because the code can be
>> simpler and,
>> it's unimportant for the above reason especially when the immediate shutdown 
>> is
>> requested which is a bad manner itself.
> 
> Thanks for explaining this! So you're thinking that v3 patch should be chosen?
> Probably some review comments I posted upthread need to be applied to
> v3 patch, e.g., rename of SignalHandlerForUnsafeExit() function.

Yes. I attached the v5 patch based on v3 patch.
I renamed SignalHandlerForUnsafeExit() and fixed the following comment.

> "SIGTERM or SIGQUIT of our parent postmaster" should be
> "SIGTERM, SIGQUIT, or detect ungraceful death of our parent
> postmaster"?


>> BTW, the SysLoggerMain() create the WaitEventSet, but it didn't call
>> FreeWaitEventSet().
>> Is it better to call FreeWaitEventSet() before exiting too?
> 
> Yes if which could cause actual issue. Otherwise I don't have strong
> reason to do that for now..

OK. AFAIK, this doesn't lead critical problems like memory leak.

Regards,
-- 
Masahiro Ikeda
NTT DATA CORPORATION
diff --git a/src/backend/postmaster/interrupt.c b/src/backend/postmaster/interrupt.c
index dd9136a942..8621212eda 100644
--- a/src/backend/postmaster/interrupt.c
+++ b/src/backend/postmaster/interrupt.c
@@ -64,9 +64,27 @@ SignalHandlerForConfigReload(SIGNAL_ARGS)
 }
 
 /*
- * Simple signal handler for exiting quickly as if due to a crash.
+ * Simple signal handler for processes which have not yet touched or do not
+ * touch shared memory to exit quickly.
  *
- * Normally, this would be used for handling SIGQUIT.
+ * If processes already touched shared memory, call
+ * SignalHandlerForCrashExit() because shared memory may be corrupted.
+ */
+void
+SignalHandlerForNonCrashExit(SIGNAL_ARGS)
+{
+	/*
+	 * Since we don't touch shared memory, we can just pull the plug and exit
+	 * without running any atexit handlers.
+	 */
+	_exit(1);
+}
+
+/*
+ * Simple signal handler for processes which have touched shared memory to
+ * exit quickly.
+ *
+ * Normally, this would be used for handling SIGQUIT as if due to a crash.
  */
 void
 SignalHandlerForCrashExit(SIGNAL_ARGS)
@@ -93,9 +111,8 @@ SignalHandlerForCrashExit(SIGNAL_ARGS)
  * shut down and exit.
  *
  * Typically, this handler would be used for SIGTERM, but some processes use
- * other signals. In particular, the checkpointer exits on SIGUSR2, the
- * stats collector on SIGQUIT, and the WAL writer exits on either SIGINT
- * or SIGTERM.
+ * other signals. In particular, the checkpointer exits on SIGUSR2 and the
+ * WAL writer exits on either SIGINT or SIGTERM.
  *
  * ShutdownRequestPending should be checked at a convenient place within the
  * main loop, or else the main loop should call HandleMainLoopInterrupts.
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 208a33692f..44c471937b 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -724,6 +724,7 @@ pgstat_reset_remove_files(const char *directory)
 
 		snprintf(fname, sizeof(fname), "%s/%s", directory,
  entry->d_name);
+		elog(DEBUG2, "removing stats file \"%s\"", fname);
 		unlink(fname);
 	}
 	FreeDir(dir);
@@ -4821,13 +4822,19 @@ PgstatCollectorMain(int argc, char *argv[])
 
 	/*
 	 * Ignore all signals usually bound to some action in the postmaster,
-	 * except SIGHUP and SIGQUIT.  Note we don't need a SIGUSR1 handler to
-	 * support latch operations, because we only use a local latch.
+	 * except SIGHUP, SIGTERM and SIGQUIT.  Note we don't need a SIGUSR1
+	 * handler to support latch operations, because we only use a local latch.
 	 */
 	pqsignal(SIGHUP, SignalHandlerForConfigReload);
 	pqsignal(SIGINT, SIG_IGN);
-	pqsignal(SIGTERM, SIG_IGN);
-	pqsignal(SIGQUIT, SignalHandlerForShutdownRequest);
+	pqsignal(SIGTERM, SignalHandlerForShutdownRequest);
+
+	/*
+	 * If SIGQUIT is received, exit quickly without doing any additional work,
+	 * for example writing stats files. We arrange to do _exit(1) because the
+	 * stats collector doesn't touch shared memory.
+	 */
+	pqsignal(SIGQUIT, SignalHandlerForNonCrashExit);
 	pqsignal(SIGALRM, SIG_IGN);
 	pqsignal(SIGPIPE, SIG_IGN);
 	pqsignal(SIGUSR1, SIG_IGN);
@@ -4852,8 +4859,8 @@ PgstatCollectorMain(int argc, char *argv[])
 	AddWaitEventToSet(wes, 

Re: [HACKERS] Custom compression methods

2021-03-22 Thread Tom Lane
Justin Pryzby  writes:
> On Mon, Mar 22, 2021 at 03:47:58PM -0400, Robert Haas wrote:
>> Are you sure you're looking at the patch I sent,
>> toast-compression-guc-rmh.patch? I can't help wondering if you applied
>> it to a dirty source tree or got the wrong file or something, because
>> otherwise I don't understand why you're seeing things that I'm not
>> seeing.

> I'm guessing Tom read this hunk as being changes to
> check_default_toast_compression() rather than removing the function ?

Yeah, after looking closer, the diff looks like
check_default_toast_compression is being modified in-place,
whereas actually it's getting replaced by CompressionNameToMethod
which does something entirely different.  I'd also not looked
closely enough at where NO_LZ4_SUPPORT() was being moved to.
My apologies --- I can only plead -ENOCAFFEINE.

regards, tom lane




Re: [HACKERS] logical decoding of two-phase transactions

2021-03-22 Thread Peter Smith
On Mon, Mar 22, 2021 at 11:51 PM Amit Kapila  wrote:
>
> I have incorporated all your changes and additionally made few more
> changes (a) got rid of LogicalRepBeginPrepareData and instead used
> LogicalRepPreparedTxnData, (b) made a number of changes in comments
> and docs, (c) ran pgindent, (d) modified tests to use standard
> wait_for_catch function and removed few tests to reduce the time and
> to keep regression tests reliable.

I checked all v65* / v66* differences and found only two trivial comment typos.

PSA patches to fix those.


Kind Regards,
Peter Smith.
Fujitsu Australia


v66-0001-Feedback-apply_handle_rollback_prepared-typo-in-.patch
Description: Binary data


v66-0002-Feedback-AlterSubscription-typo-in-comment.patch
Description: Binary data


Re: proposal - psql - use pager for \watch command

2021-03-22 Thread Thomas Munro
On Sun, Mar 21, 2021 at 11:43 PM Pavel Stehule  wrote:
> so 20. 3. 2021 v 23:45 odesílatel Thomas Munro  
> napsal:
>> Thoughts?  I put my changes into a separate patch for clarity, but
>> they need some more tidying up.
>
> yes, your solution is much better.

Hmm, there was a problem with it though: it blocked ^C while running
the query, which is bad.  I fixed that.   I did some polishing of the
code and some editing on the documentation and comments.  I disabled
the feature completely on Windows, because it seems unlikely that
we'll be able to know if it even works, in this cycle.

-   output = PageOutput(158, pager ? &(pset.popt.topt) : NULL);
+   output = PageOutput(160, pager ? &(pset.popt.topt) : NULL);

What is that change for?
From cf1d23b92e9a7a97d60abdf4c5a8fbe9d9092788 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Sat, 20 Mar 2021 21:54:27 +1300
Subject: [PATCH v4] Add PSQL_WATCH_PAGER for psql's \watch command.

Allow a pager to be used by the \watch command, which runs a query
periodically.  It works but isn't very useful with traditional pagers
like "less", so use a different evironment variable to control it.
Pagers that understand the output format of psql such as the popular
open source tool "pspg" (written by the main author of this patch) can
use this to show the results in a user-friendly format.  Example:
PSQL_WATCH_PAGER="pspg --stream".

To make \watch react quickly when the user quits the pager or presses
^C, and also to increase the accuracy of its timing and decrease the
rate of useless context switches, change the main loop of the \watch
command to use sigwait() rather than a sleeping/polling loop, on Unix.

Supported on Unix only for now (like pspg).

Author: Pavel Stehule 
Author: Thomas Munro 
Discussion: https://postgr.es/m/CAFj8pRBfzUUPz-3gN5oAzto9SDuRSq-TQPfXU_P6h0L7hO%2BEhg%40mail.gmail.com
---
 doc/src/sgml/ref/psql-ref.sgml |  28 +++
 src/bin/psql/command.c | 133 ++---
 src/bin/psql/common.c  |  11 ++-
 src/bin/psql/common.h  |   2 +-
 src/bin/psql/help.c|   6 +-
 src/bin/psql/startup.c |  14 
 6 files changed, 179 insertions(+), 15 deletions(-)

diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml
index 01ec9b8b0a..dfa7bf17a9 100644
--- a/doc/src/sgml/ref/psql-ref.sgml
+++ b/doc/src/sgml/ref/psql-ref.sgml
@@ -2993,6 +2993,16 @@ lo_import 152801
   (such as more) is used.
   
 
+  
+  When using the \watch command to execute a query
+  repeatedly, the environment variable PSQL_WATCH_PAGER
+  is used to find the pager program instead, on Unix systems.  This is
+  configured separately because it may confuse traditional pagers, but
+  can be used to send output to tools that understand
+  psql's output format (such as
+  pspg --stream).
+  
+
   
   When the pager option is off, the pager
   program is not used. When the pager option is
@@ -4663,6 +4673,24 @@ PSQL_EDITOR_LINENUMBER_ARG='--line '
 

 
+   
+PSQL_WATCH_PAGER
+
+
+ 
+  When a query is executed repeatedly with the \watch
+  command, a pager is not used by default.  This behavior can be changed
+  by setting PSQL_WATCH_PAGER to a pager command, on Unix
+  systems.  The pspg pager (not part of
+  PostgreSQL but available in many open source
+  software distributions) can display the output of
+  \watch if started with the option
+  --stream.
+ 
+
+
+   
+

 PSQLRC
 
diff --git a/src/bin/psql/command.c b/src/bin/psql/command.c
index 838f74..bfdc5a8121 100644
--- a/src/bin/psql/command.c
+++ b/src/bin/psql/command.c
@@ -13,6 +13,7 @@
 #include 
 #ifndef WIN32
 #include 			/* for stat() */
+#include 			/* for setitimer() */
 #include /* open() flags */
 #include /* for geteuid(), getpid(), stat() */
 #else
@@ -4787,8 +4788,17 @@ do_watch(PQExpBuffer query_buf, double sleep)
 	const char *strftime_fmt;
 	const char *user_title;
 	char	   *title;
+	const char *pagerprog = NULL;
+	FILE	   *pagerpipe = NULL;
 	int			title_len;
 	int			res = 0;
+#ifndef WIN32
+	sigset_t	sigalrm_sigchld_sigint;
+	sigset_t	sigalrm_sigchld;
+	sigset_t	sigint;
+	struct itimerval interval;
+	bool		done = false;
+#endif
 
 	if (!query_buf || query_buf->len <= 0)
 	{
@@ -4796,6 +4806,58 @@ do_watch(PQExpBuffer query_buf, double sleep)
 		return false;
 	}
 
+#ifndef WIN32
+	sigemptyset(_sigchld_sigint);
+	sigaddset(_sigchld_sigint, SIGCHLD);
+	sigaddset(_sigchld_sigint, SIGALRM);
+	sigaddset(_sigchld_sigint, SIGINT);
+
+	sigemptyset(_sigchld);
+	sigaddset(_sigchld, SIGCHLD);
+	sigaddset(_sigchld, SIGALRM);
+
+	sigemptyset();
+	sigaddset(, SIGINT);
+
+	/*
+	 * Block SIGALRM and SIGCHLD before we start the timer and the pager (if
+	 * configured), to avoid races.  sigwait() will receive them.
+	 */
+	sigprocmask(SIG_BLOCK, _sigchld, NULL);
+

Re: pg_upgrade failing for 200+ million Large Objects

2021-03-22 Thread Jan Wieck

On 3/22/21 5:36 PM, Zhihong Yu wrote:

Hi,

w.r.t. pg_upgrade_improvements.v2.diff.

+       blobBatchCount = 0;
+       blobInXact = false;

The count and bool flag are always reset in tandem. It seems 
variable blobInXact is not needed.


You are right. I will fix that.


Thanks, Jan

--
Jan Wieck
Principle Database Engineer
Amazon Web Services




Re: Allow an alias to be attached directly to a JOIN ... USING

2021-03-22 Thread Tom Lane
Peter Eisentraut  writes:
> I think Tom's input on the guts of this patch would be most valuable, 
> since it intersects a lot with the parse namespace refactoring he did.

I really didn't like the way you'd done that :-(.  My primary complaint
is that any one ParseNamespaceItem can describe only one table alias,
but here we have the potential for two aliases associated with the same
join:

select * from (t1 join t2 using(a) as tu) tx;

Admittedly that's not hugely useful since tx hides the tu alias, but
it should behave in a sane fashion.  (BTW, after reading the SQL spec
again along the way to reviewing this, I am wondering if hiding the
lower aliases is really what we want; though it may be decades too late
to change that.)

However, ParseNamespaceItem as it stands needs some help for this.
It has a wired-in assumption that p_rte->eref describes the table
and column aliases exposed by the nsitem.  0001 below fixes this by
creating a separate p_names field in an nsitem.  (There are some
comments in 0001 referencing JOIN USING aliases, but no actual code
for the feature.)  That saves one indirection in common code paths,
so it's possibly a win on its own.  Then 0002 is your patch rebased
onto that infrastructure, and with some cleanup of my own.

One thing I ran into is that a whole-row Var for the JOIN USING
alias did the wrong thing.  It should have only the common columns,
but we were getting all the join columns in examples such as the
row_to_json() test case I added.  This is difficult to fix given
the existing whole-row Var infrastructure, unless we want to make a
separate RTE for the JOIN USING alias, which I think is overkill.
What I did about this was to make transformWholeRowRef produce a
ROW() construct --- which is something that a whole-row Var for a
join would be turned into by the planner anyway.  I think this is
semantically OK since the USING construct has already nailed down
the number and types of the join's common columns; there's no
prospect of those changing underneath a stored view query.  It's
slightly ugly because the ROW() construct will be visible in a
decompiled view instead of "tu.*" like you wrote originally,
but I'm willing to live with that.

Speaking of decompiled views, I feel like ruleutils.c could do with
a little more work to teach it that these aliases are available.
Right now, it resorts to ugly workarounds:

regression=# create table t1 (a int, b int, c int);
CREATE TABLE
regression=# create table t2 (a int, x int, y int);
CREATE TABLE
regression=# create view vvv as select tj.a, t1.b from t1 full join t2 using(a) 
as tj, t1 as tx;
CREATE VIEW
regression=# \d+ vvv
 View "public.vvv"
 Column |  Type   | Collation | Nullable | Default | Storage | Description 
+-+---+--+-+-+-
 a  | integer |   |  | | plain   | 
 b  | integer |   |  | | plain   | 
View definition:
 SELECT a,
t1.b
   FROM t1
 FULL JOIN t2 USING (a) AS tj,
t1 tx(a_1, b, c);

That's not wrong, but it could likely be done better if ruleutils
realized it could use the tj alias to reference the column, instead
of having to force unqualified "a" to be a globally unique name.

I ran out of steam to look into that, though, and it's probably
something that could be improved later.

One other cosmetic thing is that this:

regression=# select tu.* from (t1 join t2 using(a) as tu) tx;
ERROR:  missing FROM-clause entry for table "tu"
LINE 1: select tu.* from (t1 join t2 using(a) as tu) tx;
   ^

is a relatively dumb error message, compared to

regression=# select t1.* from (t1 join t2 using(a) as tu) tx;
ERROR:  invalid reference to FROM-clause entry for table "t1"
LINE 1: select t1.* from (t1 join t2 using(a) as tu) tx;
   ^
HINT:  There is an entry for table "t1", but it cannot be referenced from this 
part of the query.

I didn't look into why that isn't working, but maybe errorMissingRTE
needs to trawl all of the ParseNamespaceItems not just the RTEs.

Anyway, since these remaining gripes are cosmetic, I'll mark this RFC.

regards, tom lane

diff --git a/src/backend/parser/parse_clause.c b/src/backend/parser/parse_clause.c
index bdf8ec46e2..5dfea46021 100644
--- a/src/backend/parser/parse_clause.c
+++ b/src/backend/parser/parse_clause.c
@@ -1217,9 +1217,9 @@ transformFromClauseItem(ParseState *pstate, Node *n,
 		 * input column numbers more easily.
 		 */
 		l_nscolumns = l_nsitem->p_nscolumns;
-		l_colnames = l_nsitem->p_rte->eref->colnames;
+		l_colnames = l_nsitem->p_names->colnames;
 		r_nscolumns = r_nsitem->p_nscolumns;
-		r_colnames = r_nsitem->p_rte->eref->colnames;
+		r_colnames = r_nsitem->p_names->colnames;
 
 		/*
 		 * Natural join does not explicitly specify columns; must generate
@@ -1469,7 +1469,7 @@ transformFromClauseItem(ParseState *pstate, Node *n,
 		 * Now that we know the join 

Re: shared-memory based stats collector

2021-03-22 Thread Andres Freund
Hi,

On 2021-03-22 12:02:39 +0900, Kyotaro Horiguchi wrote:
> At Mon, 22 Mar 2021 09:55:59 +0900 (JST), Kyotaro Horiguchi 
>  wrote in
> > At Thu, 18 Mar 2021 01:47:20 -0700, Andres Freund  
> > wrote in
> > > Hi,
> > >
> > > On 2021-03-18 16:56:02 +0900, Kyotaro Horiguchi wrote:
> > > > At Tue, 16 Mar 2021 10:27:55 +0900 (JST), Kyotaro Horiguchi 
> > > >  wrote in
> > > > Rebased and fixed two bugs.  Not addressed received comments in this
> > > > version.
> > >
> > > Since I am heavily editing the code, could you submit "functional"
> > > changes (as opposed to fixing rebase issues) as incremental patches?
> >
> > Oh.. please wait for.. a moment, maybe.
>
> This is that.

Thanks! That change shouldn't be necessary on my branch - I did
something to fix this kind of problem too. I decided that there's no
point in doing hash table lookups for the database: It's not going to
change in the life of a backend. So there's now two static "pending"
entries: One for the current DB, one for the shared DB.  There's only
one place that needed to change,
pgstat_report_checksum_failures_in_db(), which now reports the changes
directly instead of going via pending.

I suspect we should actually do that with a number of other DB specific
functions. Things like recovery conflicts, deadlocks, checksum failures
imo should really not be delayed till later. And you should never have
enough of them to make contention a concern.


You can see a somewhat sensible list of changes from your v52 at
https://github.com/anarazel/postgres/compare/master...shmstat-before-split-2021-03-22
(I did fix some of damage from rebase in a non-incremental way, of course)

My branch: https://github.com/anarazel/postgres/tree/shmstat

It would be cool if you could check if there any relevant things between
v52 and v56 that I should include.


I think a lot of the concerns I had with the patch are addressed at the
end of my series of changes.  Please let me know what you think.


My next step is going to be to squash all my changes into the base
patch, and try to extract all the things that I think can be
independently committed, and to reduce unnecessary diff noise.  Once
that's done I plan to post that series to the list.


TODO:

- explain the design at the top of pgstat.c

- figure out a way to deal with the different demands on stats
  consistency / efficiency

- see how hard it'd be to not need collect_oids()

- split pgstat.c

- consider removing PgStatTypes and replacing it with the oid of the
  table the type of stats reside in. So PGSTAT_TYPE_DB would be
  DatabaseRelationId, PGSTAT_TYPE_TABLE would be RelationRelationId, ...

  I think that'd make the system more cleanly extensible going forward?

- I'm not yet happy with the naming schemes in use in pgstat.c. I feel
  like I improved it a bunch, but it's not yet there.

- the replication slot stuff isn't quite right in my branch

- I still don't quite like the reset_offset stuff - I wonder if we can
  find something better there. And if not, whether we can deduplicate
  the code between functions like pgstat_fetch_stat_checkpointer() and
  pgstat_report_checkpointer().

  At the very least it'll need a lot better comments.

- bunch of FIXMEs / XXXs

Greetings,

Andres Freund




Re: [HACKERS] Custom compression methods

2021-03-22 Thread Justin Pryzby
On Mon, Mar 22, 2021 at 03:47:58PM -0400, Robert Haas wrote:
> On Mon, Mar 22, 2021 at 2:10 PM Tom Lane  wrote:
> > Robert Haas  writes:
> > > I think this is significantly cleaner than what we have now, and I
> > > also prefer it to your proposal.
> >
> > +1 in general.  However, I suspect that you did not try to compile
> > this without --with-lz4, because if you had you'd have noticed the
> > other uses of NO_LZ4_SUPPORT() that you broke.  I think you need
> > to leave that macro where it is.
> 
> You're correct that I hadn't tried this without --with-lz4, but I did
> grep for other uses of NO_LZ4_SUPPORT() and found none. I also just
> tried it without --with-lz4 just now, and it worked fine.
> 
> > Also, it's not nice for GUC check
> > functions to throw ereport(ERROR); we prefer the caller to be able
> > to decide if it's a hard error or not.  That usage should be using
> > GUC_check_errdetail() or a cousin, so it can't share the macro anyway.
> 
> I agree that these are valid points about GUC check functions in
> general, but the patch I sent adds 0 GUC check functions and removes
> 1, and it didn't do the stuff you describe here anyway.
> 
> Are you sure you're looking at the patch I sent,
> toast-compression-guc-rmh.patch? I can't help wondering if you applied
> it to a dirty source tree or got the wrong file or something, because
> otherwise I don't understand why you're seeing things that I'm not
> seeing.

I'm guessing Tom read this hunk as being changes to
check_default_toast_compression() rather than removing the function ?


- * Validate a new value for the default_toast_compression GUC.
+ * CompressionNameToMethod - Get compression method from compression name
+ *
+ * Search in the available built-in methods.  If the compression not found
+ * in the built-in methods then return InvalidCompressionMethod.
  */
-bool
-check_default_toast_compression(char **newval, void **extra, GucSource source)
+char
+CompressionNameToMethod(const char *compression)
 {
-   if (**newval == '\0')
+   if (strcmp(compression, "pglz") == 0)
+   return TOAST_PGLZ_COMPRESSION;
+   else if (strcmp(compression, "lz4") == 0)
{
-   GUC_check_errdetail("%s cannot be empty.",
-   
"default_toast_compression");
-   return false;
+#ifndef USE_LZ4
+   NO_LZ4_SUPPORT();
+#endif
+   return TOAST_LZ4_COMPRESSION;

-- 
Justin




Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2021-03-22 Thread Bruce Momjian
On Sat, Mar 20, 2021 at 02:12:34PM +0800, Julien Rouhaud wrote:
> On Fri, Mar 19, 2021 at 12:53:18PM -0400, Bruce Momjian wrote:
> > 
> > Well, given we don't really want to support multiple query id types
> > being generated or displayed, the "error out" above should fix it. 
> > 
> > Let's do this --- tell extensions to error out if the query id is
> > already set, either by compute_query_id or another extension.  If an
> > extension wants to generate its own query id and store is internal to
> > the extension, that is fine, but the server-displayed query id should be
> > generated once and never overwritten by an extension.
> 
> Agreed, this will ensure that you won't dynamically change the queryid source.
> 
> We should also document that changing it requires a restart and calling
> pg_stat_statements_reset() afterwards.
> 
> v19 adds some changes, plus extra documentation for pg_stat_statements about
> the requirement for a queryid to be calculated, and a note that all documented
> details only apply for in-core source.  I'm not sure if this is still the best
> place to document those details anymore though.

OK, after reading the entire thread, I don't think there are any
remaining open issues with this patch and I think this is ready for
committing.  I have adjusted the doc section of the patches, attached. 
I have marked myself as committer in the commitfest app and hope to
apply it in the next few days based on feedback.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

>From 5ab783123fc11e948963de7a7c3e6428051e3315 Mon Sep 17 00:00:00 2001
From: Bruce Momjian 
Date: Mon, 22 Mar 2021 17:43:22 -0400
Subject: [PATCH] qid-01-jumble_over_master squash commit

---
 .../pg_stat_statements/pg_stat_statements.c   | 805 +
 .../pg_stat_statements.conf   |   1 +
 doc/src/sgml/config.sgml  |  25 +
 doc/src/sgml/pgstatstatements.sgml|  20 +-
 src/backend/parser/analyze.c  |  14 +-
 src/backend/tcop/postgres.c   |   6 +-
 src/backend/utils/misc/Makefile   |   1 +
 src/backend/utils/misc/guc.c  |  10 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/backend/utils/misc/queryjumble.c (new)| 834 ++
 src/include/parser/analyze.h  |   4 +-
 src/include/utils/guc.h   |   1 +
 src/include/utils/queryjumble.h (new) |  58 ++
 13 files changed, 995 insertions(+), 785 deletions(-)

diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 62cccbfa44..bd8c96728c 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -8,24 +8,9 @@
  * a shared hashtable.  (We track only as many distinct queries as will fit
  * in the designated amount of shared memory.)
  *
- * As of Postgres 9.2, this module normalizes query entries.  Normalization
- * is a process whereby similar queries, typically differing only in their
- * constants (though the exact rules are somewhat more subtle than that) are
- * recognized as equivalent, and are tracked as a single entry.  This is
- * particularly useful for non-prepared queries.
- *
- * Normalization is implemented by fingerprinting queries, selectively
- * serializing those fields of each query tree's nodes that are judged to be
- * essential to the query.  This is referred to as a query jumble.  This is
- * distinct from a regular serialization in that various extraneous
- * information is ignored as irrelevant or not essential to the query, such
- * as the collations of Vars and, most notably, the values of constants.
- *
- * This jumble is acquired at the end of parse analysis of each query, and
- * a 64-bit hash of it is stored into the query's Query.queryId field.
- * The server then copies this value around, making it available in plan
- * tree(s) generated from the query.  The executor can then use this value
- * to blame query costs on the proper queryId.
+ * Starting in Postgres 9.2, this module normalized query entries.  As of
+ * Postgres 14, the normalization is done by the core if compute_query_id is
+ * enabled, or optionally by third-party modules.
  *
  * To facilitate presenting entries to users, we create "representative" query
  * strings in which constants are replaced with parameter symbols ($n), to
@@ -114,8 +99,6 @@ static const uint32 PGSS_PG_MAJOR_VERSION = PG_VERSION_NUM / 100;
 #define USAGE_DEALLOC_PERCENT	5	/* free this % of entries at once */
 #define IS_STICKY(c)	((c.calls[PGSS_PLAN] + c.calls[PGSS_EXEC]) == 0)
 
-#define JUMBLE_SIZE1024	/* query serialization buffer size */
-
 /*
  * Extension version number, for supporting older extension versions' objects
  */
@@ -235,40 +218,6 @@ typedef struct pgssSharedState
 	

Re: pg_upgrade failing for 200+ million Large Objects

2021-03-22 Thread Zhihong Yu
>
> Hi,
>
w.r.t. pg_upgrade_improvements.v2.diff.

+   blobBatchCount = 0;
+   blobInXact = false;

The count and bool flag are always reset in tandem. It seems
variable blobInXact is not needed.

Cheers


Re: UPDATE ... SET (single_column) = row_constructor is a bit broken from V10 906bfcad7ba7c

2021-03-22 Thread Justin Pryzby
On Mon, Mar 22, 2021 at 02:10:49PM +0530, Mahendra Singh Thalor wrote:
> Hi Hackers,
> 
> Commit 906bfcad7ba7c has improved handling for "UPDATE ... SET
> (column_list) = row_constructor", but it has broken in some cases where it
> was working prior to this commit.
> After this commit query “DO UPDATE SET (t1_col)” is giving an error which
> was working fine earlier.

See prior discussions:

https://www.postgresql.org/message-id/flat/20170719174507.GA19616%40telsasoft.com
https://www.postgresql.org/message-id/flat/camjna7cdlzpcs0xnrpkvqmj6vb6g3eh8cygp9zbjxdpfftz...@mail.gmail.com
https://www.postgresql.org/message-id/flat/87sh5rs74y@news-spur.riddles.org.uk
https://git.postgresql.org/gitweb/?p=postgresql.git=commit=86182b18957b8f9e8045d55b137aeef7c9af9916

-- 
Justin




Re: proposal - psql - use pager for \watch command

2021-03-22 Thread Thomas Munro
On Tue, Mar 23, 2021 at 1:53 AM Pavel Stehule  wrote:
> po 22. 3. 2021 v 13:13 odesílatel Thomas Munro  
> napsal:
>> The problem is that Apple's /dev/tty device is defective, and doesn't
>> work in poll().  It always returns immediately with revents=POLLNVAL,
>> but pspg assumes that data is ready and tries to read the keyboard and
>> then blocks until I press a key.  This seems to fix it:
>>
>> +#ifndef __APPLE__
>> +   /* macOS can't use poll() on /dev/tty */
>> state.tty = fopen("/dev/tty", "r+");
>> +#endif
>> if (!state.tty)
>> state.tty = fopen(ttyname(fileno(stdout)), "r");
>
>
> it is hell.

Heh.  I've recently spent many, many hours trying to make AIO work on
macOS, and nothing surprises me anymore.  BTW I found something from
years ago on the 'net that fits with my observation about /dev/tty:

https://www.mail-archive.com/bug-gnulib@gnu.org/msg00296.html

Curious, which other OS did you put that fallback case in for?  I'm a
little confused about why it works, so I'm not sure if it's the best
possible change, but I'm not planning to dig any further now, too many
patches, not enough time :-)

> Please, can you verify this fix?

It works perfectly for me on a macOS 11.2 system with that change,
repainting the screen exactly when it should.  I'm happy about that
because (1) it means I can confirm that the proposed change to psql is
working correctly on the 3 Unixes I have access to, and (2) I am sure
that a lot of Mac users will appreciate being able to use super-duper
\watch mode when this ships (a high percentage of PostgreSQL users I
know use a Mac as their client machine).

>> A minor problem is that on macOS, _GNU_SOURCE doesn't seem to imply
>> NCURSES_WIDECHAR, so I suspect Unicode will be broken unless you
>> manually add -DNCURSES_WIDECHAR=1, though I didn't check.
>
> It is possible -
>
> can you run "pspg --version"

Looks like I misunderstood: it is showing "with wide char support",
it's just that the "num" is 0 rather than 1.  I'm not planning to
investigate that any further now, but I checked that it can show the
output of SELECT 'špeĉiäl chârãçtérs' correctly.




Re: Since '2001-09-09 01:46:40'::timestamp microseconds are lost when extracting epoch

2021-03-22 Thread Peter Eisentraut


On 19.03.21 21:06, Tom Lane wrote:

I guess the immediate question is how much of a performance gap there
is now between the float and numeric implementations.


Attached are my test script and the full output.

To summarize, for cases that don't do any interesting computation and 
where the overhead is only the data type passing, the difference is like 
this:


-- old
select date_part('microseconds', current_timestamp + generate_series(0, 
1000) * interval '1 second') \g /dev/null

Time: 2760.966 ms (00:02.761)

-- new
select extract(microseconds from current_timestamp + generate_series(0, 
1000) * interval '1 second') \g /dev/null

Time: 3178.477 ms (00:03.178)
-- date
select date_part('day', current_date + generate_series(0, 1000) * interval 
'1 day') \g /dev/null
Time: 2753.082 ms (00:02.753)
select date_part('month', current_date + generate_series(0, 1000) * 
interval '1 day') \g /dev/null
Time: 2777.257 ms (00:02.777)
select date_part('quarter', current_date + generate_series(0, 1000) * 
interval '1 day') \g /dev/null
Time: 2788.313 ms (00:02.788)
select date_part('week', current_date + generate_series(0, 1000) * interval 
'1 day') \g /dev/null
Time: 2842.797 ms (00:02.843)
select date_part('year', current_date + generate_series(0, 1000) * interval 
'1 day') \g /dev/null
Time: 2848.463 ms (00:02.848)
select date_part('decade', current_date + generate_series(0, 1000) * 
interval '1 day') \g /dev/null
Time: 2844.086 ms (00:02.844)
select date_part('century', current_date + generate_series(0, 1000) * 
interval '1 day') \g /dev/null
Time: 2824.671 ms (00:02.825)
select date_part('millennium', current_date + generate_series(0, 1000) * 
interval '1 day') \g /dev/null
Time: 2846.615 ms (00:02.847)
select date_part('julian', current_date + generate_series(0, 1000) * 
interval '1 day') \g /dev/null
Time: 3011.378 ms (00:03.011)
select date_part('isoyear', current_date + generate_series(0, 1000) * 
interval '1 day') \g /dev/null
Time: 3059.134 ms (00:03.059)
select date_part('dow', current_date + generate_series(0, 1000) * interval 
'1 day') \g /dev/null
Time: 3014.891 ms (00:03.015)
select date_part('isodow', current_date + generate_series(0, 1000) * 
interval '1 day') \g /dev/null
Time: 3054.004 ms (00:03.054)
select date_part('doy', current_date + generate_series(0, 1000) * interval 
'1 day') \g /dev/null
Time: 3101.087 ms (00:03.101)
select date_part('epoch', current_date + generate_series(0, 1000) * 
interval '1 day') \g /dev/null
Time: 3156.491 ms (00:03.156)
select extract(day from current_date + generate_series(0, 1000) * interval 
'1 day') \g /dev/null
Time: 3303.320 ms (00:03.303)
select extract(month from current_date + generate_series(0, 1000) * 
interval '1 day') \g /dev/null
Time: 3289.939 ms (00:03.290)
select extract(quarter from current_date + generate_series(0, 1000) * 
interval '1 day') \g /dev/null
Time: 3301.930 ms (00:03.302)
select extract(week from current_date + generate_series(0, 1000) * interval 
'1 day') \g /dev/null
Time: 3347.644 ms (00:03.348)
select extract(year from current_date + generate_series(0, 1000) * interval 
'1 day') \g /dev/null
Time: 3358.538 ms (00:03.359)
select extract(decade from current_date + generate_series(0, 1000) * 
interval '1 day') \g /dev/null
Time: 3351.724 ms (00:03.352)
select extract(century from current_date + generate_series(0, 1000) * 
interval '1 day') \g /dev/null
Time: 3282.368 ms (00:03.282)
select extract(millennium from current_date + generate_series(0, 1000) * 
interval '1 day') \g /dev/null
Time: 3358.242 ms (00:03.358)
select extract(julian from current_date + generate_series(0, 1000) * 
interval '1 day') \g /dev/null
Time: 5758.773 ms (00:05.759) FIXME
select extract(isoyear from current_date + generate_series(0, 1000) * 
interval '1 day') \g /dev/null
Time: 3780.507 ms (00:03.781)
select extract(dow from current_date + generate_series(0, 1000) * interval 
'1 day') \g /dev/null
Time: 3688.659 ms (00:03.689)
select extract(isodow from current_date + generate_series(0, 1000) * 
interval '1 day') \g /dev/null
Time: 3716.656 ms (00:03.717)
select extract(doy from current_date + generate_series(0, 1000) * interval 
'1 day') \g /dev/null
Time: 3636.682 ms (00:03.637)
select extract(epoch from current_date + generate_series(0, 1000) * 
interval '1 day') \g /dev/null
Time: 6423.341 ms (00:06.423) FIXME
-- time
select date_part('microseconds', localtime + generate_series(0, 1000) * 
interval '1 second') \g /dev/null
Time: 2425.741 ms (00:02.426)
select date_part('milliseconds', localtime + generate_series(0, 1000) * 
interval '1 second') \g /dev/null
Time: 2790.781 ms (00:02.791)
select date_part('second', localtime + generate_series(0, 1000) * interval 
'1 second') \g /dev/null
Time: 2900.706 ms (00:02.901)
select date_part('minute', localtime + generate_series(0, 1000) * interval 
'1 second') \g /dev/null

Re: [HACKERS] Custom compression methods

2021-03-22 Thread Robert Haas
On Mon, Mar 22, 2021 at 4:33 PM Robert Haas  wrote:
> On Mon, Mar 22, 2021 at 1:58 PM Justin Pryzby  wrote:
> > guc.c should not longer define this as extern:
> > default_toast_compression_options
>
> Fixed.

Fixed some more.

-- 
Robert Haas
EDB: http://www.enterprisedb.com


v3-0001-Tidy-up-more-loose-ends-related-to-configurable-T.patch
Description: Binary data


Re: [HACKERS] Custom compression methods

2021-03-22 Thread Robert Haas
On Mon, Mar 22, 2021 at 1:58 PM Justin Pryzby  wrote:
> guc.c should not longer define this as extern:
> default_toast_compression_options

Fixed.

> I think you should comment that default_toast_compression is an int as far as
> guc.c is concerned, but storing one of the char value of TOAST_*_COMPRESSION

Done.

> Shouldn't varlena.c pg_column_compression() call GetCompressionMethodName () ?
> I guess it should already have done that.

It has a 0-3 integer, not a char value.

> Maybe pg_dump.c can't use those constants, though (?)

Hmm, toast_compression.h might actually be safe for frontend code now,
or if necessary we could add #ifdef FRONTEND stanzas to make it so. I
don't know if that is really this patch's job, but I guess it could
be.

A couple of other things:

- Upon further reflection, I think the NO_LZ4_SUPPORT() message is
kinda not great. I'm thinking we should change it to say "LZ4 is not
supported by this build" instead of "unsupported LZ4 compression
method" and drop the hint and detail. That seems more like how we've
handled other such cases.

- It is not very nice that the three possible values of attcompression
are TOAST_PGLZ_COMPRESSION, TOAST_LZ4_COMPRESSION, and
InvalidCompressionMethod. One of those three identifiers looks very
little like the other two, and there's no real good reason for that. I
think we should try to standardize on something, but I'm not sure what
it should be. It would also be nice if these names were more visually
distinct from the related but very different enum values
TOAST_PGLZ_COMPRESSION_ID and TOAST_LZ4_COMPRESSION_ID. Really, as the
comments I added explain, we want to minimize the amount of code that
knows about the 0-3 "ID" values, and use the char values whenever we
can.

-- 
Robert Haas
EDB: http://www.enterprisedb.com


v2-0001-Tidy-up-more-loose-ends-related-to-configurable-T.patch
Description: Binary data


Re: postgres_fdw: IMPORT FOREIGN SCHEMA ... LIMIT TO (partition)

2021-03-22 Thread Bernd Helmle
Am Donnerstag, dem 21.01.2021 um 15:56 +0100 schrieb Matthias van de
Meent:
> Hi,
> 
> Recently I was trying to copy some of the data of one database to
> another through postgres_fdw, and found that it wouldn't import that
> partition through IMPORT FOREIGN SCHEMA, even when I explicitly
> specified the name of the table that contained the data in the LIMIT
> TO clause.
> 
> I realised the reason is that currently, postgres_fdw explicitly
> disallows importing foreign partitions. This is a reasonable default
> when importing a whole schema, but if I wanted to explicitly import
> one of a partitioned tables' partitions, that would now require me to
> manually copy the foreign table's definitions through the use of
> CREATE FOREIGN TABLE, which is a hassle and prone to mistakes.
> 

Hi,

I took a look at this patch.

Patch applies on current master.

Documentation and adjusted regression tests included.
Regression tests passes without errors.

The patch changes IMPORT FOREIGN SCHEMA to explicitely allow partition
child tables in the LIMIT TO clause of the IMPORT FOREIGN SCHEMA
command by relaxing the checks introduced with commit [1]. The reason
behind [1] are discussed in [2].

So the original behavior this patch wants to address was done
intentionally, so what needs to be discussed here is whether we want to
relax that a little. One argument for the original behavior since then
was that it is cleaner to just automatically import the parent, which
allows access to the childs through the foreign table anways and
exclude partition childs when querying pg_class.

I haven't seen demand for the implemented feature here myself, but i
could imagine use cases where just a single child or a set of child
tables are candidates. For example, i think it's possible that users
can query only specific childs and want them to have imported on
another foreign server.


[1]
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=f49bcd4ef3e9a75de210357a4d9bbe3e004db956

[2]
https://www.postgresql.org/message-id/20170309141531.gd9...@tamriel.snowman.net


-- 
Thanks,
Bernd






Re: Autovacuum worker doesn't immediately exit on postmaster death

2021-03-22 Thread Robert Haas
On Mon, Mar 22, 2021 at 1:48 PM Stephen Frost  wrote:
> Thanks for that.  Attached is just a rebased version with a commit
> message added.  If there aren't any other concerns, I'll commit this in
> the next few days and back-patch it.  When it comes to 12 and older,
> does anyone want to opine about the wait event to use?  I was thinking
> PG_WAIT_TIMEOUT or WAIT_EVENT_PG_SLEEP ...

I'm not sure if we should back-patch this, but I think if you do you
should just add a wait event, rather than using a generic one.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: [HACKERS] Custom compression methods

2021-03-22 Thread Robert Haas
On Mon, Mar 22, 2021 at 2:10 PM Tom Lane  wrote:
> Robert Haas  writes:
> > I think this is significantly cleaner than what we have now, and I
> > also prefer it to your proposal.
>
> +1 in general.  However, I suspect that you did not try to compile
> this without --with-lz4, because if you had you'd have noticed the
> other uses of NO_LZ4_SUPPORT() that you broke.  I think you need
> to leave that macro where it is.

You're correct that I hadn't tried this without --with-lz4, but I did
grep for other uses of NO_LZ4_SUPPORT() and found none. I also just
tried it without --with-lz4 just now, and it worked fine.

> Also, it's not nice for GUC check
> functions to throw ereport(ERROR); we prefer the caller to be able
> to decide if it's a hard error or not.  That usage should be using
> GUC_check_errdetail() or a cousin, so it can't share the macro anyway.

I agree that these are valid points about GUC check functions in
general, but the patch I sent adds 0 GUC check functions and removes
1, and it didn't do the stuff you describe here anyway.

Are you sure you're looking at the patch I sent,
toast-compression-guc-rmh.patch? I can't help wondering if you applied
it to a dirty source tree or got the wrong file or something, because
otherwise I don't understand why you're seeing things that I'm not
seeing.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: shared-memory based stats collector

2021-03-22 Thread Andres Freund
Hi,

On 2021-03-19 14:27:38 -0700, Andres Freund wrote:
> Yep, it's not even particularly hard to hit:
>
> S0: CREATE TABLE a_table();
> S0: INSERT INTO a_table();
> S0: disconnect
> S1: set a breakpoint to just after the dshash_release_lock(), with an
>  if objid == a_table_oid
> S1: SELECT pg_stat_get_live_tuples('a_table'::regclass);
>   (will break at above breakpoint, without having incremented the
>   refcount yet)
> S2: DROP TABLE a_table;
> S2: VACUUM pg_class;
>
> At that point S2's call to pgstat_vacuum_stat() will find the shared
> stats entry for a_table, delete the entry from the shared hash table,
> see that the stats data has a zero refcount, and free it. Once S1 wakes
> up it'll use already freed (and potentially since reused) memory.

I fixed this by initializing / increment the refcount while holding
dshash partition lock. To avoid the potential refcount leak in case the
lookup cache insertion fails due to OOM I changed things so that the
lookup cache is inserted, not just looked up, earlier. That also avoids
needing two hashtable ops in the cache miss case.  The price of an empty
hashtable entry in the !create case doesn't seem high.


Related issue: delete_current_stats_entry() there's the following
comment:

/*
 * Let the referrers drop the entry if any.  Refcount won't be 
decremented
 * until "dropped" is set true and StatsShmem->gc_count is incremented
 * later. So we can check refcount to set dropped without holding a 
lock.
 * If no one is referring this entry, free it immediately.
 */

I don't think this explanations is correct. gc_count might have been
incremented by another backend, or cleanup_dropped_stats_entries() might
run. So the whole bit about refcounts seems wrong.

I don't see what prevents a double-free here. Consider what happens if
S1: cleanup_dropped_stats_entries() does 
pg_atomic_sub_fetch_u32(>shared->refcount, 1)
S2: delete_current_stats_entry() pg_atomic_read_u32(>refcount), reading 0
S1: dsa_free(area, ent->dsapointer);
S2: dsa_free(area, pdsa);
World: boom

I think the appropriate fix might be to not have ->dropped (or rather have it
just as a crosscheck), and have every non-dropped entry have an extra
refcount. When dropping the entry the refcount is dropped, and we can safely
free the entry.  That way normal paths don't need to check ->dropped at all.

Greetings,

Andres Freund




Re: Proposal: Save user's original authenticated identity for logging

2021-03-22 Thread Jacob Champion
On Mon, 2021-03-22 at 15:16 +0900, Michael Paquier wrote:
> On Fri, Mar 19, 2021 at 06:37:05PM +, Jacob Champion wrote:
> > The same effect can be had by moving the log rotation to the top of the
> > test that needs it, so I've done it that way in v7.
> 
> After thinking more about 0001, I have come up with an even simpler
> solution that has resulted in 11e1577.  That's similar to what
> PostgresNode::issues_sql_like() does.  This also makes 0003 simpler
> with its changes as this requires to change two lines in test_access.
v8's test_access lost the in-order log search from v7; I've added it
back in v9. The increased resistance to entropy seems worth the few
extra lines. Thoughts?

--Jacob
From 1939c94fe9d21b19a456ce99d03dfe8b4682d5af Mon Sep 17 00:00:00 2001
From: Jacob Champion 
Date: Mon, 8 Feb 2021 10:53:20 -0800
Subject: [PATCH v9 1/2] ssl: store client's DN in port->peer_dn

Original patch by Andrew Dunstan:

https://www.postgresql.org/message-id/fd96ae76-a8e3-ef8e-a642-a592f5b76771%40dunslane.net

but I've taken out the clientname=DN functionality; all that will be
needed for the next patch is the DN string.
---
 src/backend/libpq/be-secure-openssl.c | 53 +++
 src/include/libpq/libpq-be.h  |  1 +
 2 files changed, 47 insertions(+), 7 deletions(-)

diff --git a/src/backend/libpq/be-secure-openssl.c b/src/backend/libpq/be-secure-openssl.c
index 5ce3f27855..18321703da 100644
--- a/src/backend/libpq/be-secure-openssl.c
+++ b/src/backend/libpq/be-secure-openssl.c
@@ -551,22 +551,25 @@ aloop:
 	/* Get client certificate, if available. */
 	port->peer = SSL_get_peer_certificate(port->ssl);
 
-	/* and extract the Common Name from it. */
+	/* and extract the Common Name / Distinguished Name from it. */
 	port->peer_cn = NULL;
+	port->peer_dn = NULL;
 	port->peer_cert_valid = false;
 	if (port->peer != NULL)
 	{
 		int			len;
+		X509_NAME  *x509name = X509_get_subject_name(port->peer);
+		char	   *peer_cn;
+		char	   *peer_dn;
+		BIO		   *bio = NULL;
+		BUF_MEM*bio_buf = NULL;
 
-		len = X509_NAME_get_text_by_NID(X509_get_subject_name(port->peer),
-		NID_commonName, NULL, 0);
+		len = X509_NAME_get_text_by_NID(x509name, NID_commonName, NULL, 0);
 		if (len != -1)
 		{
-			char	   *peer_cn;
-
 			peer_cn = MemoryContextAlloc(TopMemoryContext, len + 1);
-			r = X509_NAME_get_text_by_NID(X509_get_subject_name(port->peer),
-		  NID_commonName, peer_cn, len + 1);
+			r = X509_NAME_get_text_by_NID(x509name, NID_commonName, peer_cn,
+		  len + 1);
 			peer_cn[len] = '\0';
 			if (r != len)
 			{
@@ -590,6 +593,36 @@ aloop:
 
 			port->peer_cn = peer_cn;
 		}
+
+		bio = BIO_new(BIO_s_mem());
+		if (!bio)
+		{
+			pfree(port->peer_cn);
+			port->peer_cn = NULL;
+			return -1;
+		}
+		/* use commas instead of slashes */
+		X509_NAME_print_ex(bio, x509name, 0, XN_FLAG_RFC2253);
+		BIO_get_mem_ptr(bio, _buf);
+		peer_dn = MemoryContextAlloc(TopMemoryContext, bio_buf->length + 1);
+		memcpy(peer_dn, bio_buf->data, bio_buf->length);
+		peer_dn[bio_buf->length] = '\0';
+		if (bio_buf->length != strlen(peer_dn))
+		{
+			ereport(COMMERROR,
+	(errcode(ERRCODE_PROTOCOL_VIOLATION),
+	 errmsg("SSL certificate's distinguished name contains embedded null")));
+			BIO_free(bio);
+			pfree(peer_dn);
+			pfree(port->peer_cn);
+			port->peer_cn = NULL;
+			return -1;
+		}
+
+		BIO_free(bio);
+
+		port->peer_dn = peer_dn;
+
 		port->peer_cert_valid = true;
 	}
 
@@ -618,6 +651,12 @@ be_tls_close(Port *port)
 		pfree(port->peer_cn);
 		port->peer_cn = NULL;
 	}
+
+	if (port->peer_dn)
+	{
+		pfree(port->peer_dn);
+		port->peer_dn = NULL;
+	}
 }
 
 ssize_t
diff --git a/src/include/libpq/libpq-be.h b/src/include/libpq/libpq-be.h
index 30fb4e613d..d970277894 100644
--- a/src/include/libpq/libpq-be.h
+++ b/src/include/libpq/libpq-be.h
@@ -190,6 +190,7 @@ typedef struct Port
 	 */
 	bool		ssl_in_use;
 	char	   *peer_cn;
+	char	   *peer_dn;
 	bool		peer_cert_valid;
 
 	/*
-- 
2.25.1

From e315f806dc7895a8c1a51cec4d073c5d3ccea74b Mon Sep 17 00:00:00 2001
From: Jacob Champion 
Date: Wed, 3 Feb 2021 11:42:05 -0800
Subject: [PATCH v9 2/2] Log authenticated identity from all auth backends

The "authenticated identity" is the string used by an auth method to
identify a particular user. In many common cases this is the same as the
Postgres username, but for some third-party auth methods, the identifier
in use may be shortened or otherwise translated (e.g. through pg_ident
mappings) before the server stores it.

To help DBAs see who has actually interacted with the system, store the
original identity when authentication succeeds, and expose it via the
log_connections setting. The log entries look something like this
example (where a local user named "pchampion" is connecting to the
database as the "admin" user):

LOG:  connection received: host=[local]
LOG:  connection authenticated: identity="pchampion" method=peer (/data/pg_hba.conf:88)
LOG:  connection 

Re: Nicer error when connecting to standby with hot_standby=off

2021-03-22 Thread James Coleman
On Mon, Mar 22, 2021 at 2:52 PM Fujii Masao  wrote:
>
>
>
> On 2021/03/23 3:25, James Coleman wrote:
> > On Mon, Mar 22, 2021 at 1:49 PM Fujii Masao  
> > wrote:
> >>
> >>
> >>
> >> On 2021/03/19 23:35, James Coleman wrote:
> >>> Here's an updated patch; I think I've gotten what Alvaro suggested.
> >>
> >> Thanks for updating the patch! But I was thinking that our consensus is
> >> something like the attached patch. Could you check this patch?
> >
> > As far as I can tell (I might be missing something) your v5 patch does
> > the same thing, albeit with different code organization. It did catch
> > though that I'd neglected to add the DETAIL line as separate from the
> > errmsg line.
> >
> > Is the attached (in the style of my v4, since I'm not following why we
> > need to move the standby determination logic into a new
> > CAC_NOCONSISTENT block) what you're thinking? Or is there something
> > else I'm missing?
>
> I just did that to avoid adding more CAC_state. But basically it's
> ok to check hot standby at either canAcceptConnections() or
> ProcessStartupPacket().
>
> case CAC_STARTUP:
> ereport(FATAL,
> (errcode(ERRCODE_CANNOT_CONNECT_NOW),
> -errmsg("the database system is 
> starting up")));
> +errmsg("the database system is not 
> accepting connections"),
> +errdetail("Consistent recovery state 
> has not been yet reached.")));
>
> Do you want to report this message even in crash recovery? Since crash
> recovery is basically not so much related to "consistent recovery state",
> at least for me the original message seems more suitable for crash recovery.
>
> Also if we adopt this message, the server with hot_standby=off reports
> "Consistent recovery state has not been yet reached." in PM_STARTUP,
> but stops reporting this message at PM_RECOVERY even if the consistent
> recovery state has not been reached yet. Instead, it reports "Hot standby
> mode is disabled." at PM_RECOVERY. Isn't this transition of message confusing?

Are you saying we should only change the message for a single case:
the case where we'd otherwise allow connections but EnableHotStandby
is false? I believe that's what the original patch did, but then
Alvaro's proposal included changing additional messages.

James Coleman




Re: Nicer error when connecting to standby with hot_standby=off

2021-03-22 Thread Fujii Masao




On 2021/03/23 3:25, James Coleman wrote:

On Mon, Mar 22, 2021 at 1:49 PM Fujii Masao  wrote:




On 2021/03/19 23:35, James Coleman wrote:

Here's an updated patch; I think I've gotten what Alvaro suggested.


Thanks for updating the patch! But I was thinking that our consensus is
something like the attached patch. Could you check this patch?


As far as I can tell (I might be missing something) your v5 patch does
the same thing, albeit with different code organization. It did catch
though that I'd neglected to add the DETAIL line as separate from the
errmsg line.

Is the attached (in the style of my v4, since I'm not following why we
need to move the standby determination logic into a new
CAC_NOCONSISTENT block) what you're thinking? Or is there something
else I'm missing?


I just did that to avoid adding more CAC_state. But basically it's
ok to check hot standby at either canAcceptConnections() or
ProcessStartupPacket().

case CAC_STARTUP:
ereport(FATAL,
(errcode(ERRCODE_CANNOT_CONNECT_NOW),
-errmsg("the database system is starting 
up")));
+errmsg("the database system is not 
accepting connections"),
+errdetail("Consistent recovery state has 
not been yet reached.")));

Do you want to report this message even in crash recovery? Since crash
recovery is basically not so much related to "consistent recovery state",
at least for me the original message seems more suitable for crash recovery.

Also if we adopt this message, the server with hot_standby=off reports
"Consistent recovery state has not been yet reached." in PM_STARTUP,
but stops reporting this message at PM_RECOVERY even if the consistent
recovery state has not been reached yet. Instead, it reports "Hot standby
mode is disabled." at PM_RECOVERY. Isn't this transition of message confusing?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: Proposal: Save user's original authenticated identity for logging

2021-03-22 Thread Jacob Champion
On Mon, 2021-03-22 at 18:22 +0100, Magnus Hagander wrote:
> On Mon, Mar 22, 2021 at 7:16 AM Michael Paquier  wrote:
> > 
> > I have briefly looked at 0002 (0001 in the attached set), and it seems
> > sane to me.  I still need to look at 0003 (well, now 0002) in details,
> > which is very sensible as one mistake would likely be a CVE-class
> > bug.
> 
> The 0002/0001/whateveritisaftertherebase is tracked over at
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.postgresql.org%2Fmessage-id%2Fflat%2F92e70110-9273-d93c-5913-0bccb6562740%40dunslane.netdata=04%7C01%7Cpchampion%40vmware.com%7Cd085c1e56ff045c7af3308d8ed57279a%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637520305878415422%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=kyW9O1jD0z14z0rC%2BYY9UhIKb7D6bg0nCWoVBJkF8oQ%3Dreserved=0
> isn't it? I've assumed the expectation is to have that one committed
> from that thread, and then rebase using that.

I think the primary thing that needs to be greenlit for both is the
idea of using the RFC 2253/4514 format for Subject DNs.

Other than that, the version here should only contain the changes
necessary for both features (that is, port->peer_dn), so there's no
hard dependency between the two. It's just on me to make sure my
version is up-to-date. Which I believe it is, as of today.

--Jacob


Re: [POC] Fast COPY FROM command for the table with foreign partitions

2021-03-22 Thread Andrey Lepikhov

On 5/3/21 21:54, tsunakawa.ta...@fujitsu.com wrote:

I've managed to rebased it, although it took unexpectedly long.  The patch is 
attached.  It passes make check against core and postgres_fdw.  I'll turn the 
CF status back to ready for committer shortly.


Macros _() at the postgresExecForeignCopy routine:
if (PQputCopyEnd(conn, OK ? NULL : _("canceled by server")) <= 0)

uses gettext. Under linux it is compiled ok, because (as i understood)  
uses standard implementation of gettext:

objdump -t contrib/postgres_fdw/postgres_fdw.so | grep 'gettext'
gettext@@GLIBC_2.2.5

but in MacOS (and maybe somewhere else) we need to explicitly link  
libintl library in the Makefile:

SHLIB_LINK += $(filter -lintl, $(LIBS)

Also, we may not use gettext at all in this part of the code.

--
regards,
Andrey Lepikhov
Postgres Professional




Re: New IndexAM API controlling index vacuum strategies

2021-03-22 Thread Peter Geoghegan
On Mon, Mar 22, 2021 at 7:05 AM Robert Haas  wrote:
> I agree. I was having trouble before understanding exactly what you
> are proposing, but this makes sense to me and I agree it's a good
> idea.

Our ambition is for this to be one big multi-release umbrella project,
with several individual enhancements that each deliver a user-visible
benefit on their own. The fact that we're talking about a few things
at once is confusing, but I think that you need a "grand bargain" kind
of discussion for this. I believe that it actually makes sense to do
it that way, difficult though it may be.

Sometimes the goal is to improve performance, other times the goal is
to improve robustness. Although the distinction gets blurry at the
margins. If VACUUM was infinitely fast (say because of sorcery), then
performance would bee *unbeatable* -- plus we'd never have to worry
about anti-wraparound vacuums not completing in time!

> I'm not 100% sure whether we need a new GUC for this or not. I think
> that if by default this triggers at the 90% of the hard-shutdown
> limit, it would be unlikely, and perhaps unreasonable, for users to
> want to raise the limit. However, I wonder whether some users will
> want to lower the limit. Would it be reasonable for someone to want to
> trigger this at 50% or 70% of XID exhaustion rather than waiting until
> things get really bad?

I wanted to avoid inventing a GUC for the mechanism in the third patch
(not the emergency mechanism we're discussing right now, which was
posted separately by Masahiko). I think that a GUC to control skipping
index vacuuming purely because there are very few index tuples to
delete in indexes will become a burden before long. In particular, we
should eventually be able to vacuum some indexes but not others (on
the same table) based on the local needs of each index.

As I keep pointing out, bottom-up index deletion has created a
situation where there can be dramatically different needs among
indexes on the same table -- it can literally prevent 100% of all page
splits from version churn in those indexes that are never subject to
logically changes from non-HOT updates. Whereas bottom-up index
deletion does nothing for any index that is logically updated, for the
obvious reason -- there is now frequently a sharp qualitative
difference among indexes that vacuumlazy.c currently imagines have
basically the same needs. Vacuuming these remaining indexes is a cost
that users will actually understand and accept, too.

But that has nothing to do with the emergency mechanism we're talking
about right now. I actually like your idea of making the emergency
mechanism a GUC. It's equivalent to index_cleanup, except that it is
continuous and dynamic (not discrete and static). The fact that this
GUC expresses what VACUUM should do in terms of the age of the target
table's current relfrozenxid age (and nothing else) seems like exactly
the right thing. As I said before: What else could possibly matter? So
I don't see any risk of such a GUC becoming a burden. I also think
that it's a useful knob to be able to tune. It's also going to help a
lot with testing the feature. So let's have a GUC for that.

> Also, one thing that I dislike about the current system is that, from
> a user perspective, when something goes wrong, nothing happens for a
> while and then the whole system goes bananas. It seems desirable to me
> to find ways of gradually ratcheting up the pressure, like cranking up
> the effective cost limit if we can somehow figure out that we're not
> keeping up. If, with your mechanism, there's an abrupt point when we
> switch from never doing this for any table to always doing this for
> every table, that might not be as good as something which does this
> "sometimes" and then, if that isn't enough to avoid disaster, does it
> "more," and eventually ramps up to doing it always, if trouble
> continues. I don't know whether that's possible here, or what it would
> look like, or even whether it's appropriate at all in this particular
> case, so I just offer it as food for thought.

That is exactly the kind of thing that some future highly evolved
version of the broader incremental/dynamic VACUUM design might do.
Your thoughts about the effective delay/throttling lessening as
conditions change is in line with the stuff that we're thinking of
doing. Though I don't believe Masahiko and I have talked about the
delay stuff specifically in any of our private discussions about it.

I am a big believer in the idea that we should have a variety of
strategies that are applied incrementally and dynamically, in response
to an immediate local need (say at the index level). VACUUM should be
able to organically figure out the best strategy (or combination of
strategies) itself, over time. Sometimes it will be very important to
recognize that most indexes have been able to use techniques like
bottom-up index deletion, and so really don't need to be vacuumed at
all. Other times the cost delay 

Re: Nicer error when connecting to standby with hot_standby=off

2021-03-22 Thread James Coleman
On Mon, Mar 22, 2021 at 1:49 PM Fujii Masao  wrote:
>
>
>
> On 2021/03/19 23:35, James Coleman wrote:
> > Here's an updated patch; I think I've gotten what Alvaro suggested.
>
> Thanks for updating the patch! But I was thinking that our consensus is
> something like the attached patch. Could you check this patch?

As far as I can tell (I might be missing something) your v5 patch does
the same thing, albeit with different code organization. It did catch
though that I'd neglected to add the DETAIL line as separate from the
errmsg line.

Is the attached (in the style of my v4, since I'm not following why we
need to move the standby determination logic into a new
CAC_NOCONSISTENT block) what you're thinking? Or is there something
else I'm missing?

Thanks,
James


v6-0001-Improve-standby-connection-denied-error-message.patch
Description: Binary data


Re: [HACKERS] Custom compression methods

2021-03-22 Thread Tom Lane
Robert Haas  writes:
> I think this is significantly cleaner than what we have now, and I
> also prefer it to your proposal.

+1 in general.  However, I suspect that you did not try to compile
this without --with-lz4, because if you had you'd have noticed the
other uses of NO_LZ4_SUPPORT() that you broke.  I think you need
to leave that macro where it is.  Also, it's not nice for GUC check
functions to throw ereport(ERROR); we prefer the caller to be able
to decide if it's a hard error or not.  That usage should be using
GUC_check_errdetail() or a cousin, so it can't share the macro anyway.

regards, tom lane




Re: Fix pg_upgrade to preserve datdba

2021-03-22 Thread Jan Wieck

On 3/21/21 3:56 PM, Tom Lane wrote:

Jan Wieck  writes:
So let's focus on the actual problem of running out of XIDs and memory 
while doing the upgrade involving millions of small large objects.


Right.  So as far as --single-transaction vs. --create goes, that's
mostly a definitional problem.  As long as the contents of a DB are
restored in one transaction, it's not gonna matter if we eat one or
two more XIDs while creating the DB itself.  So we could either
relax pg_restore's complaint, or invent a different switch that's
named to acknowledge that it's not really only one transaction.

That still leaves us with the lots-o-locks problem.  However, once
we've crossed the Rubicon of "it's not really only one transaction",
you could imagine that the switch is "--fewer-transactions", and the
idea is for pg_restore to commit after every (say) 10 operations.
That would both bound its lock requirements and greatly cut its XID
consumption.


It leaves us with three things.

1) tremendous amounts of locks
2) tremendous amounts of memory needed
3) taking forever because it is single threaded.

I created a pathological case here on a VM with 24GB of RAM, 80GB of 
SWAP sitting on NVME. The database has 20 million large objects, each of 
which has 2 GRANTS, 1 COMMENT and 1 SECURITY LABEL (dummy). Each LO only 
contains a string "large object ", so the whole database in 9.5 is 
about 15GB in size.


A stock pg_upgrade to version 14devel using --link takes about 15 hours. 
This is partly because the pg_dump and pg_restore both grow to something 
like 50GB+ to hold the TOC. Which sounds out of touch considering that 
the entire system catalog on disk is less than 15GB. But aside from the 
ridiculous amount of swapping, the whole thing also suffers from 
consuming about 80 million transactions and apparently having just as 
many network round trips with a single client.




The work you described sounded like it could fit into that paradigm,
with the additional ability to run some parallel restore tasks
that are each consuming a bounded number of locks.


I have attached a POC patch that implements two new options for pg_upgrade.

  --restore-jobs=NUM --jobs parameter passed to pg_restore
  --restore-blob-batch-size=NUM  number of blobs restored in one xact

It does a bit more than just that. It rearranges the way large objects 
are dumped so that most of the commands are all in one TOC entry and the 
entry is emitted into SECTION_DATA when in binary upgrade mode (which 
guarantees that there isn't any actual BLOB data in the dump). This 
greatly reduces the number of network round trips and when using 8 
parallel restore jobs, almost saturates the 4-core VM. Reducing the 
number of TOC entries also reduces the total virtual memory need of 
pg_restore to 15G, so there is a lot less swapping going on.


It cuts down the pg_upgrade time from 15 hours to 1.5 hours. In that run 
I used --restore-jobs=8 and --restore-blob-batch-size=1 (with a 
max_locks_per_transaction=12000).


As said, this isn't a "one size fits all" solution. The pg_upgrade 
parameters for --jobs and --restore-jobs will really depend on the 
situation. Hundreds of small databases want --jobs, but one database 
with millions of large objects wants --restore-jobs.



Regards, Jan

--
Jan Wieck
Principle Database Engineer
Amazon Web Services
diff --git a/src/bin/pg_dump/parallel.c b/src/bin/pg_dump/parallel.c
index c7351a4..4a611d0 100644
--- a/src/bin/pg_dump/parallel.c
+++ b/src/bin/pg_dump/parallel.c
@@ -864,6 +864,11 @@ RunWorker(ArchiveHandle *AH, ParallelSlot *slot)
 	WaitForCommands(AH, pipefd);
 
 	/*
+	 * Close an eventually open BLOB batch transaction.
+	 */
+	CommitBlobTransaction((Archive *)AH);
+
+	/*
 	 * Disconnect from database and clean up.
 	 */
 	set_cancel_slot_archive(slot, NULL);
diff --git a/src/bin/pg_dump/pg_backup.h b/src/bin/pg_dump/pg_backup.h
index 0296b9b..cd8a590 100644
--- a/src/bin/pg_dump/pg_backup.h
+++ b/src/bin/pg_dump/pg_backup.h
@@ -203,6 +203,8 @@ typedef struct Archive
 	int			numWorkers;		/* number of parallel processes */
 	char	   *sync_snapshot_id;	/* sync snapshot id for parallel operation */
 
+	int			blobBatchSize;	/* # of blobs to restore per transaction */
+
 	/* info needed for string escaping */
 	int			encoding;		/* libpq code for client_encoding */
 	bool		std_strings;	/* standard_conforming_strings */
@@ -269,6 +271,7 @@ extern void WriteData(Archive *AH, const void *data, size_t dLen);
 extern int	StartBlob(Archive *AH, Oid oid);
 extern int	EndBlob(Archive *AH, Oid oid);
 
+extern void	CommitBlobTransaction(Archive *AH);
 extern void CloseArchive(Archive *AH);
 
 extern void SetArchiveOptions(Archive *AH, DumpOptions *dopt, RestoreOptions *ropt);
diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c
index 1f82c64..51a862a 100644
--- a/src/bin/pg_dump/pg_backup_archiver.c
+++ b/src/bin/pg_dump/pg_backup_archiver.c
@@ -68,6 +68,8 @@ typedef struct 

Re: [HACKERS] Custom compression methods

2021-03-22 Thread Justin Pryzby
On Mon, Mar 22, 2021 at 01:38:36PM -0400, Robert Haas wrote:
> On Mon, Mar 22, 2021 at 12:16 PM Robert Haas  wrote:
> > But, what about giving the default_toast_compression_method GUC an
> > assign hook that sets a global variable of type "char" to the
> > appropriate value? Then GetDefaultToastCompression() goes away
> > entirely. That might be worth exploring.
> 
> Actually, we can do even better. We should just make the values
> actually assigned to the GUC be TOAST_PGLZ_COMPRESSION etc. rather
> than TOAST_PGLZ_COMPRESSION_ID etc. Then a whole lot of complexity
> just goes away. I added some comments explaining why using
> TOAST_PGLZ_COMPRESSION is the wrong thing anyway. Then I got hacking
> and rearranged a few other things. So the attached patch does these
> thing:
> 
> - Changes default_toast_compression to an enum, as in your patch, but
> now with values that are the same as what ultimately gets stored in
> attcompression.
> - Adds a comment warning against incautious use of
> TOAST_PGLZ_COMPRESSION_ID, etc.
> - Moves default_toast_compression_options to guc.c.
> - After doing the above two things, we can remove the #include of
> utils/guc.h into access/toast_compression.h, so the patch does that.
> - Moves NO_LZ4_SUPPORT, GetCompressionMethodName, and
> CompressionNameToMethod to guc.c. Making these inline functions
> doesn't save anything meaningful; it's more important not to export a
> bunch of random identifiers.
> - Removes an unnecessary cast to bool from the definition of
> CompressionMethodIsValid.
> 
> I think this is significantly cleaner than what we have now, and I
> also prefer it to your proposal.

+1

guc.c should not longer define this as extern:
default_toast_compression_options

I think you should comment that default_toast_compression is an int as far as
guc.c is concerned, but storing one of the char value of TOAST_*_COMPRESSION

Shouldn't varlena.c pg_column_compression() call GetCompressionMethodName () ?
I guess it should already have done that.

Maybe pg_dump.c can't use those constants, though (?)

-- 
Justin




Re: Nicer error when connecting to standby with hot_standby=off

2021-03-22 Thread Fujii Masao



On 2021/03/19 23:35, James Coleman wrote:

Here's an updated patch; I think I've gotten what Alvaro suggested.


Thanks for updating the patch! But I was thinking that our consensus is
something like the attached patch. Could you check this patch?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
diff --git a/src/backend/postmaster/postmaster.c 
b/src/backend/postmaster/postmaster.c
index ef0be4ca38..a800568d14 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -2294,6 +2294,19 @@ retry1:
(errcode(ERRCODE_CANNOT_CONNECT_NOW),
 errmsg("the database system is 
starting up")));
break;
+   case CAC_NOCONSISTENT:
+   if (EnableHotStandby)
+   ereport(FATAL,
+   
(errcode(ERRCODE_CANNOT_CONNECT_NOW),
+errmsg("the database system is 
not accepting connections"),
+errdetail("Consistent recovery 
state has not been yet reached.")));
+   else
+   ereport(FATAL,
+   
(errcode(ERRCODE_CANNOT_CONNECT_NOW),
+errmsg("the database system is 
not accepting connections"),
+errdetail("Hot standby mode is 
disabled.")));
+   break;
+
case CAC_SHUTDOWN:
ereport(FATAL,
(errcode(ERRCODE_CANNOT_CONNECT_NOW),
@@ -2435,10 +2448,11 @@ canAcceptConnections(int backend_type)
{
if (Shutdown > NoShutdown)
return CAC_SHUTDOWN;/* shutdown is pending */
-   else if (!FatalError &&
-(pmState == PM_STARTUP ||
- pmState == PM_RECOVERY))
+   else if (!FatalError && pmState == PM_STARTUP)
return CAC_STARTUP; /* normal startup */
+   else if (!FatalError && pmState == PM_RECOVERY)
+   return CAC_NOCONSISTENT;/* consistent recovery 
state has not
+   
 * been yet reached */
else
return CAC_RECOVERY;/* else must be crash recovery 
*/
}
diff --git a/src/include/libpq/libpq-be.h b/src/include/libpq/libpq-be.h
index 30fb4e613d..299528255c 100644
--- a/src/include/libpq/libpq-be.h
+++ b/src/include/libpq/libpq-be.h
@@ -70,7 +70,12 @@ typedef struct
 
 typedef enum CAC_state
 {
-   CAC_OK, CAC_STARTUP, CAC_SHUTDOWN, CAC_RECOVERY, CAC_TOOMANY,
+   CAC_OK,
+   CAC_STARTUP,
+   CAC_NOCONSISTENT,
+   CAC_SHUTDOWN,
+   CAC_RECOVERY,
+   CAC_TOOMANY,
CAC_SUPERUSER
 } CAC_state;
 


Re: Autovacuum worker doesn't immediately exit on postmaster death

2021-03-22 Thread Stephen Frost
Greetings,

* Thomas Munro (thomas.mu...@gmail.com) wrote:
> On Fri, Dec 11, 2020 at 7:57 AM Stephen Frost  wrote:
> > * Tom Lane (t...@sss.pgh.pa.us) wrote:
> > > The if-we're-going-to-delay-anyway path in vacuum_delay_point seems
> > > OK to add a touch more overhead to, though.
> >
> > Alright, for this part at least, seems like it'd be something like the
> > attached.
> >
> > Only lightly tested, but does seem to address the specific example which
> > was brought up on this thread.
> >
> > Thoughts..?
> 
> +1

Thanks for that.  Attached is just a rebased version with a commit
message added.  If there aren't any other concerns, I'll commit this in
the next few days and back-patch it.  When it comes to 12 and older,
does anyone want to opine about the wait event to use?  I was thinking
PG_WAIT_TIMEOUT or WAIT_EVENT_PG_SLEEP ...

Or do folks think this shouldn't be backpatched?  That would mean it
wouldn't help anyone for years, which would be pretty unfortuante, hence
my feeling that it's worthwhile to backpatch.

Thanks!

Stephen
From 9daf52b78d106c86e038dcefdb1d8345d22b9756 Mon Sep 17 00:00:00 2001
From: Stephen Frost 
Date: Mon, 22 Mar 2021 13:25:57 -0400
Subject: [PATCH] Use a WaitLatch for vacuum/autovacuum sleeping

Instead of using pg_usleep() in vacuum_delay_point(), use a WaitLatch.
This has the advantage that we will realize if the postmaster has been
killed since the last time we decided to sleep while vacuuming.

Discussion: https://postgr.es/m/CAFh8B=kcdk8k-Y21RfXPu5dX=bgPqJ8TC3p_qxR_ygdBS=j...@mail.gmail.com
Backpatch: 9.6-
---
 src/backend/commands/vacuum.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index c064352e23..662aff04b4 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -2080,9 +2080,11 @@ vacuum_delay_point(void)
 		if (msec > VacuumCostDelay * 4)
 			msec = VacuumCostDelay * 4;
 
-		pgstat_report_wait_start(WAIT_EVENT_VACUUM_DELAY);
-		pg_usleep((long) (msec * 1000));
-		pgstat_report_wait_end();
+		(void) WaitLatch(MyLatch,
+		 WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+		 msec,
+		 WAIT_EVENT_VACUUM_DELAY);
+		ResetLatch(MyLatch);
 
 		VacuumCostBalance = 0;
 
-- 
2.27.0



signature.asc
Description: PGP signature


Re: [HACKERS] Custom compression methods

2021-03-22 Thread Robert Haas
On Mon, Mar 22, 2021 at 12:16 PM Robert Haas  wrote:
> But, what about giving the default_toast_compression_method GUC an
> assign hook that sets a global variable of type "char" to the
> appropriate value? Then GetDefaultToastCompression() goes away
> entirely. That might be worth exploring.

Actually, we can do even better. We should just make the values
actually assigned to the GUC be TOAST_PGLZ_COMPRESSION etc. rather
than TOAST_PGLZ_COMPRESSION_ID etc. Then a whole lot of complexity
just goes away. I added some comments explaining why using
TOAST_PGLZ_COMPRESSION is the wrong thing anyway. Then I got hacking
and rearranged a few other things. So the attached patch does these
thing:

- Changes default_toast_compression to an enum, as in your patch, but
now with values that are the same as what ultimately gets stored in
attcompression.
- Adds a comment warning against incautious use of
TOAST_PGLZ_COMPRESSION_ID, etc.
- Moves default_toast_compression_options to guc.c.
- After doing the above two things, we can remove the #include of
utils/guc.h into access/toast_compression.h, so the patch does that.
- Moves NO_LZ4_SUPPORT, GetCompressionMethodName, and
CompressionNameToMethod to guc.c. Making these inline functions
doesn't save anything meaningful; it's more important not to export a
bunch of random identifiers.
- Removes an unnecessary cast to bool from the definition of
CompressionMethodIsValid.

I think this is significantly cleaner than what we have now, and I
also prefer it to your proposal.

Comments?

-- 
Robert Haas
EDB: http://www.enterprisedb.com


toast-compression-guc-rmh.patch
Description: Binary data


Re: Wired if-statement in gen_partprune_steps_internal

2021-03-22 Thread Ryan Lambert
Should the status of this patch be updated to ready for comitter to get in
line for Pg 14 deadline?

*Ryan Lambert*

On Sun, Mar 7, 2021 at 11:38 PM Amit Langote 
wrote:

> On Fri, Mar 5, 2021 at 7:50 AM Ryan Lambert 
> wrote:
> > On Wed, Mar 3, 2021 at 11:03 PM Amit Langote 
> wrote:
> >>
> >> Sorry, this seems to have totally slipped my mind.
> >>
> >> Attached is the patch I had promised.
> >>
> >> Also, I have updated the title of the CF entry to "Some cosmetic
> >> improvements of partition pruning code", which I think is more
> >> appropriate.
> >
> > Thank you.  The updated patch passes installcheck-world.  I ran a
> handful of test queries with a small number of partitions and observed the
> same plans before and after the patch. I cannot speak to the quality of the
> code, though am happy to test any additional use cases that should be
> verified.
>
> Thanks Ryan.
>
> There's no need to test it extensively, because no functionality is
> changed with this patch.
>
> --
> Amit Langote
> EDB: http://www.enterprisedb.com
>


Re: Proposal: Save user's original authenticated identity for logging

2021-03-22 Thread Magnus Hagander
On Mon, Mar 22, 2021 at 7:16 AM Michael Paquier  wrote:
>
> On Fri, Mar 19, 2021 at 06:37:05PM +, Jacob Champion wrote:
> > The same effect can be had by moving the log rotation to the top of the
> > test that needs it, so I've done it that way in v7.
>
> After thinking more about 0001, I have come up with an even simpler
> solution that has resulted in 11e1577.  That's similar to what
> PostgresNode::issues_sql_like() does.  This also makes 0003 simpler
> with its changes as this requires to change two lines in test_access.

Man that renumbering threw me off :)


> > Turns out it's easy now to have our cake and eat it too; a single if
> > statement can implement the same search-forward functionality that was
> > spread across multiple places before. So I've done that too.
>
> I have briefly looked at 0002 (0001 in the attached set), and it seems
> sane to me.  I still need to look at 0003 (well, now 0002) in details,
> which is very sensible as one mistake would likely be a CVE-class
> bug.

The 0002/0001/whateveritisaftertherebase is tracked over at
https://www.postgresql.org/message-id/flat/92e70110-9273-d93c-5913-0bccb6562...@dunslane.net
isn't it? I've assumed the expectation is to have that one committed
from that thread, and then rebase using that.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/




Re: Disable WAL logging to speed up data loading

2021-03-22 Thread Stephen Frost
Greetings,

* Laurenz Albe (laurenz.a...@cybertec.at) wrote:
> On Mon, 2021-03-22 at 11:05 -0400, Stephen Frost wrote:
> > > Perhaps allowing to set unlogged tables to logged ones without writing WAL
> > > is a more elegant way to do that, but I cannot see how that would be any
> > > more crash safe than this patch.  Besides, the feature doesn't exist yet.
> > 
> > I'm not suggesting it's somehow more crash safe- but it's at least very
> > clear what happens in such a case, to wit: the entire table is cleared
> > on crash recovery.
> 
> I didn't look at the patch, but are you saying that after changing the
> table to LOGGED, it somehow remembers that it is not crash safe and is
> emptied if there is a crash before the next checkpoint?

I'm not sure where the confusion is, but certainly once the table has
been turned into LOGGED and that's been committed then it should be
crash safe, even under 'minimal' with the optimization to avoid actually
copying the entire table into the WAL.  I don't see any particular
reason why that isn't possible to do.

Under the proposed 'none', you basically have to throw out the entire
cluster on a crash, all because you don't want to use 'UNLOGGED' when
you created the tables you want to load data into, or 'TRUNCATE' them in
the transaction where you start the data load, either of which gives us
enough indication and which we have infrastructure around dealing with
in the event of a crash during the load without everything else having
to be tossed and everything restored from a backup.  That's both a
better user experience from the perspective of having fewer WAL levels
to understand and from just a general administration perspective so you
don't have to go all the way back to a backup to bring the system back
up.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: Protect syscache from bloating with negative cache entries

2021-03-22 Thread Bruce Momjian
On Thu, Jan 28, 2021 at 05:16:52PM +0900, Kyotaro Horiguchi wrote:
> At Thu, 28 Jan 2021 16:50:44 +0900 (JST), Kyotaro Horiguchi 
>  wrote in 
> > I was going to write in the doc something like "you can inspect memory
> > consumption by catalog caches using pg_backend_memory_contexts", but
> > all the memory used by catalog cache is in CacheMemoryContext.  Is it
> > sensible for each catalog cache to have their own contexts?
> 
> Something like this.

Is this feature not going to make it into PG 14?  It first appeared in
the January, 2017 commitfest:

https://commitfest.postgresql.org/32/931/

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.





Re: Change default of checkpoint_completion_target

2021-03-22 Thread Stephen Frost
Greetings,

* David Steele (da...@pgmasters.net) wrote:
> I had a look at the patch and the change and new documentation seem sensible
> to me.

Thanks!

> I think this phrase may be a bit too idiomatic:
> 
> +consistent I/O load while also leaving some slop for checkpoint
> 
> Perhaps just:
> 
> +consistent I/O load while also leaving some time for checkpoint

Yeah, good thought, updated.

> It seems to me that the discussion about changing the wording for GUCs not
> changeable after server should be saved for another patch as long as this
> patch follows the current convention.

Agreed.

Unless there's anything further on this, I'll plan to commit it tomorrow
or Wednesday.

Thanks!

Stephen
From 3ebe08dee4b9dfe2dff51fd1bad2eb36834e82ed Mon Sep 17 00:00:00 2001
From: Stephen Frost 
Date: Tue, 19 Jan 2021 13:53:34 -0500
Subject: [PATCH] Change the default of checkpoint_completion_target to 0.9

Common recommendations are that the checkpoint should be spread out as
much as possible, provided we avoid having it take too long.  This
change updates the default to 0.9 (from 0.5) to match that
recommendation.

There was some debate about possibly removing the option entirely but it
seems there may be some corner-cases where having it set much lower to
try to force the checkpoint to be as fast as possible could result in
fewer periods of time of reduced performance due to kernel flushing.
General agreement is that the "spread more" is the preferred approach
though and those who need to tune away from that value are much less
common.

Reviewed-By: Michael Paquier, Peter Eisentraut, Tom Lane, David Steele
Discussion: https://postgr.es/m/20201207175329.GM16415%40tamriel.snowman.net
---
 doc/src/sgml/config.sgml  | 12 ++--
 doc/src/sgml/wal.sgml | 29 ---
 src/backend/utils/misc/guc.c  |  2 +-
 src/backend/utils/misc/postgresql.conf.sample |  2 +-
 src/test/recovery/t/015_promotion_pages.pl|  1 -
 5 files changed, 29 insertions(+), 17 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 5679b40dd5..44763f0180 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3302,9 +3302,15 @@ include_dir 'conf.d'
   

 Specifies the target of checkpoint completion, as a fraction of
-total time between checkpoints. The default is 0.5.
-This parameter can only be set in the postgresql.conf
-file or on the server command line.
+total time between checkpoints. The default is 0.9, which spreads the
+checkpoint across almost all of the available interval, providing fairly
+consistent I/O load while also leaving some time for checkpoint
+completion overhead.  Reducing this parameter is not recommended as that
+causes the I/O from the checkpoint to have to complete faster, resulting
+in a higher I/O rate, while then having a period of less I/O between the
+completion of the checkpoint and the start of the next scheduled
+checkpoint.  This parameter can only be set in the
+postgresql.conf file or on the server command line.

   
  
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index ae4a3c1293..4354051c7b 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -571,22 +571,29 @@
writing dirty buffers during a checkpoint is spread over a period of time.
That period is controlled by
, which is
-   given as a fraction of the checkpoint interval.
+   given as a fraction of the checkpoint interval (configured by using
+   checkpoint_timeout).
The I/O rate is adjusted so that the checkpoint finishes when the
given fraction of
checkpoint_timeout seconds have elapsed, or before
max_wal_size is exceeded, whichever is sooner.
-   With the default value of 0.5,
+   With the default value of 0.9,
PostgreSQL can be expected to complete each checkpoint
-   in about half the time before the next checkpoint starts.  On a system
-   that's very close to maximum I/O throughput during normal operation,
-   you might want to increase checkpoint_completion_target
-   to reduce the I/O load from checkpoints.  The disadvantage of this is that
-   prolonging checkpoints affects recovery time, because more WAL segments
-   will need to be kept around for possible use in recovery.  Although
-   checkpoint_completion_target can be set as high as 1.0,
-   it is best to keep it less than that (perhaps 0.9 at most) since
-   checkpoints include some other activities besides writing dirty buffers.
+   a bit before the next scheduled checkpoint (at around 90% of the last checkpoint's
+   duration).  This spreads out the I/O as much as possible to have the I/O load be
+   consistent during the checkpoint.  The disadvantage of this is that prolonging
+   checkpoints affects recovery time, because more WAL segments will need to be kept
+   

Re: Disable WAL logging to speed up data loading

2021-03-22 Thread Laurenz Albe
On Mon, 2021-03-22 at 11:05 -0400, Stephen Frost wrote:
> > Perhaps allowing to set unlogged tables to logged ones without writing WAL
> > is a more elegant way to do that, but I cannot see how that would be any
> > more crash safe than this patch.  Besides, the feature doesn't exist yet.
> 
> I'm not suggesting it's somehow more crash safe- but it's at least very
> clear what happens in such a case, to wit: the entire table is cleared
> on crash recovery.

I didn't look at the patch, but are you saying that after changing the
table to LOGGED, it somehow remembers that it is not crash safe and is
emptied if there is a crash before the next checkpoint?

Wouldn't that cause corruption if you restore from an earlier backup?
At least with the feature in this thread we'd get an error on recovery.

Yours,
Laurenz Albe





Re: Add docs stub for recovery.conf

2021-03-22 Thread Stephen Frost
Greetings,

* Craig Ringer (craig.rin...@enterprisedb.com) wrote:
> Pretty good to me. Thanks so much for your help and support with this.

Thanks for helping me move it forward!

> Index entries render as e.g.
> 
> pg_xlogdump, The pg_xlogdump command
> (see also pg_waldump)
> 
> wheras with the obsolete subhead they would render as something like:
> 
> obsolete, Obsolete or renamed features, settings and files
> pg_xlogdump, The pg_xlogdump command
> 
> The see also spelling is much easier to find in the index but doesn't make
> it as obvious that it's obsoleted/replaced.
> 
> A look at the doxygen docs suggest we should use  not  for
> these.
> 
> A quick
> 
> sed -i -e 's///g' -e 's/<\/seealso>/<\/see>/g'
> doc/src/sgml/appendix-obsolete*
> 
> causes them to render much better:
> 
> pg_receivexlog, The pg_receivexlog command (see pg_receivewal)
> 
> It might be worth changing the s too, so I've done so in the
> attached. The terms now render as:
> 
> pg_receivexlog, pg_receivexlog renamed to pg_recievewal (see
> pg_receivewal)
> 
> which is good enough in my opinion. The duplication is messy but an
> expected artifact of index generation. I don't see any docbook 
> attribute that lets you suppress insertion of the  of the section
> containing the , and it's not worth fiddling to try to eliminate
> it with structural hacks.

Nice, yes, that does look better.

> The attached changes the titles, changes  to , and also
> updates the comments in the obsolete entries SGML docs to specify that the
> id must be unchanged + give a recommended index term format.

Awesome, attached is just a rebase (not that anything really changed).
Unless someone wants to speak up, I'll commit this soonish (hopefully
tomorrow, but at least sometime later this week).

Thanks!

Stephen
From 000cd577d6dbb9d6d6c571e2302657f1252e6a56 Mon Sep 17 00:00:00 2001
From: Stephen Frost 
Date: Mon, 22 Mar 2021 12:45:41 -0400
Subject: [PATCH] Add a docs section for obsoleted and renamed functions and
 settings

The new appendix groups information on renamed or removed settings,
commands, etc into an out-of-the-way part of the docs.

The original id elements are retained in each subsection to ensure that
the same filenames are produced for HTML docs. This prevents /current/
links on the web from breaking, and allows users of the web docs
to follow links from old version pages to info on the changes in the
new version. Prior to this change, a link to /current/ for renamed
sections like the recovery.conf docs would just 404. Similarly if
someone searched for recovery.conf they would find the pg11 docs,
but there would be no /12/ or /current/ link, so they couldn't easily
find out that it was removed in pg12 or how to adapt.

Index entries are also added so that there's a breadcrumb trail for
users to follow when they know the old name, but not what we changed it
to. So a user who is trying to find out how to set standby_mode in
PostgreSQL 12+, or where pg_resetxlog went, now has more chance of
finding that information.

Craig Ringer and Stephen Frost
Discussion: https://postgr.es/m/CAGRY4nzPNOyYQ_1-pWYToUVqQ0ThqP5jdURnJMZPm539fdizOg%40mail.gmail.com
---
 .../sgml/appendix-obsolete-pgreceivexlog.sgml | 24 
 .../sgml/appendix-obsolete-pgresetxlog.sgml   | 24 
 .../sgml/appendix-obsolete-pgxlogdump.sgml| 24 
 .../appendix-obsolete-recovery-config.sgml| 58 +++
 doc/src/sgml/appendix-obsolete.sgml   | 40 +
 doc/src/sgml/config.sgml  |  4 +-
 doc/src/sgml/filelist.sgml|  7 +++
 doc/src/sgml/high-availability.sgml   | 16 -
 doc/src/sgml/postgres.sgml|  1 +
 doc/src/sgml/ref/pg_basebackup.sgml   |  5 +-
 10 files changed, 198 insertions(+), 5 deletions(-)
 create mode 100644 doc/src/sgml/appendix-obsolete-pgreceivexlog.sgml
 create mode 100644 doc/src/sgml/appendix-obsolete-pgresetxlog.sgml
 create mode 100644 doc/src/sgml/appendix-obsolete-pgxlogdump.sgml
 create mode 100644 doc/src/sgml/appendix-obsolete-recovery-config.sgml
 create mode 100644 doc/src/sgml/appendix-obsolete.sgml

diff --git a/doc/src/sgml/appendix-obsolete-pgreceivexlog.sgml b/doc/src/sgml/appendix-obsolete-pgreceivexlog.sgml
new file mode 100644
index 00..30bae2f7a2
--- /dev/null
+++ b/doc/src/sgml/appendix-obsolete-pgreceivexlog.sgml
@@ -0,0 +1,24 @@
+
+
+
+
+  pg_receivexlog renamed to pg_recievewal
+
+   
+ pg_receivexlog
+ pg_receivewal
+   
+
+   
+PostgreSQL 9.6 and below provided a command named
+pg_receivexlog
+pg_receivexlog
+to fetch write-ahead-log (WAL) files.  This command was renamed to pg_receivewal, see
+ for documentation of pg_receivewal and see
+the release notes for PostgreSQL 10 for details
+on this change.
+   
+
+
diff --git a/doc/src/sgml/appendix-obsolete-pgresetxlog.sgml b/doc/src/sgml/appendix-obsolete-pgresetxlog.sgml
new file 

Re: [HACKERS] Custom compression methods

2021-03-22 Thread Robert Haas
On Mon, Mar 22, 2021 at 11:13 AM Justin Pryzby  wrote:
> The first iteration was pretty rough, and there's still some question in my
> mind about where default_toast_compression_options[] should be defined.  If
> it's in the header file, then I could use lengthof() - but then it probably
> gets multiply defined.

What do you want to use lengthof() for?

> In the latest patch, there's multiple "externs".  Maybe
> guc.c doesn't need the extern, since it includes toast_compression.h.  But 
> then
> it's the only "struct config_enum_entry" which has an "extern" outside of
> guc.c.

Oh, yeah, we certainly shouldn't have an extern in guc.c itself, if
we've already got it in the header file.

As to the more general question of where to put stuff, I don't think
there's any conceptual problem with putting it in a header file rather
than in guc.c. It's not very scalable to just keeping inventing new
GUCs and sticking all their accoutrement into guc.c. That might have
kind of made sense when guc.c was invented, since there were probably
fewer settings there and guc.c itself was new, but at this point it's
a well-established part of the infrastructure and having other
subsystems cater to what it needs rather than the other way around
seems logical. However, it's not great to have "utils/guc.h" included
in "access/toast_compression.h", because then anything that includes
"access/toast_compression.h" or "access/toast_internals.h" sucks in
"utils/guc.h" even though it's not really topically related to what
they intended to include. We can't avoid that just by choosing to put
this enum in guc.c, because GetDefaultToastCompression() also uses it.
But, what about giving the default_toast_compression_method GUC an
assign hook that sets a global variable of type "char" to the
appropriate value? Then GetDefaultToastCompression() goes away
entirely. That might be worth exploring.

> Also, it looks like you added default_toast_compression out of order, so maybe
> you'd fix that at the same time.

You know, I looked at where you had it and said to myself, "surely
this is a silly place to put this, it would make much more sense to
move this up a bit." Now I feel dumb.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: [HACKERS] Custom compression methods (mac+lz4.h)

2021-03-22 Thread Tom Lane
Robert Haas  writes:
> On Mon, Mar 22, 2021 at 11:48 AM Tom Lane  wrote:
>> Possibly the former names should survive and the latter become
>> wrappers around them, not sure.  But we shouldn't be using the "4B"
>> terminology anyplace except this part of postgres.h.

> I would argue that it shouldn't be used any place at all, and that we
> ought to go the other direction and get rid of the existing macros -
> e.g. change #define VARATT_IS_1B_E to #define VARATT_IS_EXTERNAL
> instead of defining the latter as a no-value-added wrapper around the
> former. Maybe at one time somebody thought that the test for
> VARATT_IS_EXTERNAL might someday have more cases than just
> VARATT_IS_1B_E, but that's not looking like a good bet in 2021.

Maybe.  I think the original idea was exactly what the comment says,
to have a layer of macros that'd deal with endianness issues and no more.
That still seems like a reasonable plan to me, though perhaps it wasn't
executed very well.

regards, tom lane




Re: [HACKERS] Custom compression methods (mac+lz4.h)

2021-03-22 Thread Robert Haas
On Mon, Mar 22, 2021 at 11:48 AM Tom Lane  wrote:
> > Anyway, this particular macro name was chosen, it seems, for symmetry
> > with VARDATA_4B_C, but if you want to change it to something else, I'm
> > OK with that, too.
>
> After looking at postgres.h for a bit, I'm thinking that what these
> should have been symmetric with is the considerably-less-terrible
> names used for the corresponding VARATT_EXTERNAL cases.  Thus,
> something like
>
> s/VARRAWSIZE_4B_C/VARDATA_COMPRESSED_GET_RAWSIZE/
> s/VARCOMPRESS_4B_C/VARDATA_COMPRESSED_GET_COMPRESSION/

Works for me.

> Possibly the former names should survive and the latter become
> wrappers around them, not sure.  But we shouldn't be using the "4B"
> terminology anyplace except this part of postgres.h.

I would argue that it shouldn't be used any place at all, and that we
ought to go the other direction and get rid of the existing macros -
e.g. change #define VARATT_IS_1B_E to #define VARATT_IS_EXTERNAL
instead of defining the latter as a no-value-added wrapper around the
former. Maybe at one time somebody thought that the test for
VARATT_IS_EXTERNAL might someday have more cases than just
VARATT_IS_1B_E, but that's not looking like a good bet in 2021.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: [HACKERS] Custom compression methods (mac+lz4.h)

2021-03-22 Thread Tom Lane
Robert Haas  writes:
> On Mon, Mar 22, 2021 at 10:41 AM Tom Lane  wrote:
>> Yeah, I thought about that too, but do we want to assume that
>> VARRAWSIZE_4B_C is the correct way to get the decompressed size
>> for all compression methods?

> I think it's OK to assume this.

OK, cool.

>> (If so, I think it would be better style to have a less opaque macro
>> name for the purpose.)

> Complaining about the name of one particular TOAST-related macro name
> seems a bit like complaining about the greenhouse gasses emitted by
> one particular car.

Maybe, but that's not a reason to make it worse.  Anyway, my understanding
of that is that the really opaque names are *only* meant to be used in
this very stretch of postgres.h, ie they are just intermediate steps on
the way to the macros below them.  As an example, the only use of
VARDATA_1B_E() is in VARDATA_EXTERNAL().

> Anyway, this particular macro name was chosen, it seems, for symmetry
> with VARDATA_4B_C, but if you want to change it to something else, I'm
> OK with that, too.

After looking at postgres.h for a bit, I'm thinking that what these
should have been symmetric with is the considerably-less-terrible
names used for the corresponding VARATT_EXTERNAL cases.  Thus,
something like

s/VARRAWSIZE_4B_C/VARDATA_COMPRESSED_GET_RAWSIZE/
s/VARCOMPRESS_4B_C/VARDATA_COMPRESSED_GET_COMPRESSION/

Possibly the former names should survive and the latter become
wrappers around them, not sure.  But we shouldn't be using the "4B"
terminology anyplace except this part of postgres.h.

regards, tom lane




Re: [HACKERS] Custom compression methods (mac+lz4.h)

2021-03-22 Thread Robert Haas
On Mon, Mar 22, 2021 at 10:41 AM Tom Lane  wrote:
> > Okay, the fix makes sense.  In fact, IMHO, in general also this fix
> > looks like an optimization, I mean when slicelength >=
> > VARRAWSIZE_4B_C(value), then why do we need to allocate extra memory
> > even in the case of pglz.  So shall we put this check directly in
> > toast_decompress_datum_slice instead of handling it at the lz4 level?
>
> Yeah, I thought about that too, but do we want to assume that
> VARRAWSIZE_4B_C is the correct way to get the decompressed size
> for all compression methods?

I think it's OK to assume this. If and when we add a third compression
method, it seems certain to just grab one of the two remaining bit
patterns. Now, things get a bit more complicated if and when we want
to add a fourth method, because at that point you've got to ask
yourself how comfortable you feel about stealing the last bit pattern
for your feature. But, if the solution to that problem were to decide
that whenever that last bit pattern is used, we will add an extra byte
(or word) after va_tcinfo indicating the real compression method, then
using VARRAWSIZE_4B_C here would still be correct. To imagine this
decision being wrong, you have to posit a world in which one of the
two remaining bit patterns for the high 2 bits cause the low 30 bits
to be interpreted as something other than the size, which I guess is
not totally impossible, but my first reaction is to think that such a
design would be (1) hard to make work and (2) unnecessarily painful.

> (If so, I think it would be better style to have a less opaque macro
> name for the purpose.)

Complaining about the name of one particular TOAST-related macro name
seems a bit like complaining about the greenhouse gasses emitted by
one particular car. They're pretty uniformly terrible. Does anyone
really know when to use VARATT_IS_1B_E or VARATT_IS_4B_U or any of
that cruft? Like, who decided that "is this varatt 1B E?" would be a
perfectly reasonable way of asking "is this varlena is TOAST
pointer?". While I'm complaining, it's hard to say enough bad things
about the fact that we have 12 consecutive completely obscure macro
definitions for which the only comments are (a) that they are
endian-dependent - which isn't even true for all of them - and (b)
that they are "considered internal." Apparently, they're SO internal
that they don't even need to be understandable to other developers.

Anyway, this particular macro name was chosen, it seems, for symmetry
with VARDATA_4B_C, but if you want to change it to something else, I'm
OK with that, too.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: [HACKERS] Custom compression methods (mac+lz4.h)

2021-03-22 Thread Dilip Kumar
On Mon, Mar 22, 2021 at 8:11 PM Tom Lane  wrote:
>
> Dilip Kumar  writes:
> > On Mon, Mar 22, 2021 at 5:22 AM Tom Lane  wrote:
> >> Also, after studying the documentation for LZ4_decompress_safe
> >> and LZ4_decompress_safe_partial, I realized that liblz4 is also
> >> counting on the *output* buffer size to not be a lie.  So we
> >> cannot pass it a number larger than the chunk's true decompressed
> >> size.  The attached patch resolves the issue I'm seeing.
>
> > Okay, the fix makes sense.  In fact, IMHO, in general also this fix
> > looks like an optimization, I mean when slicelength >=
> > VARRAWSIZE_4B_C(value), then why do we need to allocate extra memory
> > even in the case of pglz.  So shall we put this check directly in
> > toast_decompress_datum_slice instead of handling it at the lz4 level?
>
> Yeah, I thought about that too, but do we want to assume that
> VARRAWSIZE_4B_C is the correct way to get the decompressed size
> for all compression methods?

Yeah, VARRAWSIZE_4B_C is the macro getting the rawsize of the data
stored in the compressed varlena.

> (If so, I think it would be better style to have a less opaque macro
> name for the purpose.)

Okay, I have added another macro that is less opaque and came up with
this patch.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


v1-0001-fix-slice-decompression.patch
Description: Binary data


Re: [HACKERS] Custom compression methods

2021-03-22 Thread Justin Pryzby
On Mon, Mar 22, 2021 at 11:05:19AM -0400, Robert Haas wrote:
> On Mon, Mar 22, 2021 at 10:44 AM Justin Pryzby  wrote:
> > Thanks.  I just realized that if you also push the GUC change, then the docs
> > should change from  to 
> >
> > doc/src/sgml/config.sgml:  
> > default_toast_compression (string)
> 
> I've now also committed your 0005. As for 0006, aside from the note
> above, which is a good one, is there any particular reason why this
> patch is labelled as WIP? I think this change makes sense and we
> should just do it unless there's some problem with it.

The first iteration was pretty rough, and there's still some question in my
mind about where default_toast_compression_options[] should be defined.  If
it's in the header file, then I could use lengthof() - but then it probably
gets multiply defined.  In the latest patch, there's multiple "externs".  Maybe
guc.c doesn't need the extern, since it includes toast_compression.h.  But then
it's the only "struct config_enum_entry" which has an "extern" outside of
guc.c.

Also, it looks like you added default_toast_compression out of order, so maybe
you'd fix that at the same time.

-- 
Justin




Re: Disable WAL logging to speed up data loading

2021-03-22 Thread Stephen Frost
Greetings,

* Laurenz Albe (laurenz.a...@cybertec.at) wrote:
> On Mon, 2021-03-22 at 09:46 -0400, Stephen Frost wrote:
> > * tsunakawa.ta...@fujitsu.com (tsunakawa.ta...@fujitsu.com) wrote:
> > > From: Stephen Frost 
> > > > The argument here seems to stem from the idea that issueing a 'TRUNCATE'
> > > > inside the transaction before starting the 'COPY' command is 'too hard'.
> > > 
> > > No, we can't ask using TRUNCATE because the user wants to add data to a 
> > > table.
> > 
> > First- what are you expecting would actually happen during crash
> > recovery in this specific case with your proposed new WAL level?
> > 
> > Second, use partitioning, or unlogged tables (with the patch discussed
> > elsewhere to allow flipping them to logged without writing the entire
> > thing to WAL).
> 
> Perhaps allowing to set unlogged tables to logged ones without writing WAL
> is a more elegant way to do that, but I cannot see how that would be any
> more crash safe than this patch.  Besides, the feature doesn't exist yet.

I'm not suggesting it's somehow more crash safe- but it's at least very
clear what happens in such a case, to wit: the entire table is cleared
on crash recovery.

> So I think that argument doesn't carry much weight.

We're talking about two different ways to accomplish essentially the
same thing- one which introduces a new WAL level, vs. one which adds an
optimization for a WAL level we already have.  That the second is more
elegant is more-or-less entirely the point I'm making here, so it seems
pretty relevant.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: [HACKERS] Custom compression methods

2021-03-22 Thread Robert Haas
On Mon, Mar 22, 2021 at 10:44 AM Justin Pryzby  wrote:
> Thanks.  I just realized that if you also push the GUC change, then the docs
> should change from  to 
>
> doc/src/sgml/config.sgml:  
> default_toast_compression (string)

I've now also committed your 0005. As for 0006, aside from the note
above, which is a good one, is there any particular reason why this
patch is labelled as WIP? I think this change makes sense and we
should just do it unless there's some problem with it.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: make the stats collector shutdown without writing the statsfiles if the immediate shutdown is requested.

2021-03-22 Thread Fujii Masao




On 2021/03/20 13:40, Masahiro Ikeda wrote:

Sorry, there is no evidence we should call it.
I thought that to call FreeWaitEventSet() is a manner after using
CreateWaitEventSet() and the performance impact to call it seems to be too 
small.

(Please let me know if my understanding is wrong.)
The reason why I said this is a manner because I thought it's no problem
even without calling FreeWaitEventSet() before the process exits regardless
of fast or immediate shutdown. Since the "WaitEventSet *wes" is a process local 
resource,
this will be released by OS even if FreeWaitEventSet() is not called.

I'm now changing my mind to skip it is better because the code can be simpler 
and,
it's unimportant for the above reason especially when the immediate shutdown is
requested which is a bad manner itself.


Thanks for explaining this! So you're thinking that v3 patch should be chosen?
Probably some review comments I posted upthread need to be applied to
v3 patch, e.g., rename of SignalHandlerForUnsafeExit() function.



BTW, the SysLoggerMain() create the WaitEventSet, but it didn't call 
FreeWaitEventSet().
Is it better to call FreeWaitEventSet() before exiting too?


Yes if which could cause actual issue. Otherwise I don't have strong
reason to do that for now..

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: Disable WAL logging to speed up data loading

2021-03-22 Thread Laurenz Albe
On Mon, 2021-03-22 at 09:46 -0400, Stephen Frost wrote:
> * tsunakawa.ta...@fujitsu.com (tsunakawa.ta...@fujitsu.com) wrote:
> > From: Stephen Frost 
> > > The argument here seems to stem from the idea that issueing a 'TRUNCATE'
> > > inside the transaction before starting the 'COPY' command is 'too hard'.
> > 
> > No, we can't ask using TRUNCATE because the user wants to add data to a 
> > table.
> 
> First- what are you expecting would actually happen during crash
> recovery in this specific case with your proposed new WAL level?
> 
> Second, use partitioning, or unlogged tables (with the patch discussed
> elsewhere to allow flipping them to logged without writing the entire
> thing to WAL).

Perhaps allowing to set unlogged tables to logged ones without writing WAL
is a more elegant way to do that, but I cannot see how that would be any
more crash safe than this patch.  Besides, the feature doesn't exist yet.

So I think that argument doesn't carry much weight.

Yours,
Laurenz Albe





Re: [HACKERS] Custom compression methods

2021-03-22 Thread Justin Pryzby
On Mon, Mar 22, 2021 at 10:41:33AM -0400, Robert Haas wrote:
> On Sun, Mar 21, 2021 at 7:55 PM Justin Pryzby  wrote:
> > Rebased again.
> 
> Thanks, Justin. I committed 0003 and 0004 together as
> 226e2be3876d0bda3dc33d16dfa0bed246b7b74f. I also committed 0001 and
> 0002 together as 24f0e395ac5892cd12e8914646fe921fac5ba23d, but with
> some revisions, because your text was not clear that this is setting
> the default for new tables, not new values; it also implied that this
> only affects out-of-line compression, which is not true. In lieu of
> trying to explain how TOAST works here, I added a link. It looks,
> though, like that documentation also needs to be patched for this
> change. I'll look into that, and your remaining patches, next.

Thanks.  I just realized that if you also push the GUC change, then the docs
should change from  to 

doc/src/sgml/config.sgml:  
default_toast_compression (string)

-- 
Justin




Re: [HACKERS] Custom compression methods (mac+lz4.h)

2021-03-22 Thread Tom Lane
Dilip Kumar  writes:
> On Mon, Mar 22, 2021 at 5:22 AM Tom Lane  wrote:
>> Also, after studying the documentation for LZ4_decompress_safe
>> and LZ4_decompress_safe_partial, I realized that liblz4 is also
>> counting on the *output* buffer size to not be a lie.  So we
>> cannot pass it a number larger than the chunk's true decompressed
>> size.  The attached patch resolves the issue I'm seeing.

> Okay, the fix makes sense.  In fact, IMHO, in general also this fix
> looks like an optimization, I mean when slicelength >=
> VARRAWSIZE_4B_C(value), then why do we need to allocate extra memory
> even in the case of pglz.  So shall we put this check directly in
> toast_decompress_datum_slice instead of handling it at the lz4 level?

Yeah, I thought about that too, but do we want to assume that
VARRAWSIZE_4B_C is the correct way to get the decompressed size
for all compression methods?

(If so, I think it would be better style to have a less opaque macro
name for the purpose.)

regards, tom lane




Re: [HACKERS] Custom compression methods

2021-03-22 Thread Robert Haas
On Sun, Mar 21, 2021 at 7:55 PM Justin Pryzby  wrote:
> Rebased again.

Thanks, Justin. I committed 0003 and 0004 together as
226e2be3876d0bda3dc33d16dfa0bed246b7b74f. I also committed 0001 and
0002 together as 24f0e395ac5892cd12e8914646fe921fac5ba23d, but with
some revisions, because your text was not clear that this is setting
the default for new tables, not new values; it also implied that this
only affects out-of-line compression, which is not true. In lieu of
trying to explain how TOAST works here, I added a link. It looks,
though, like that documentation also needs to be patched for this
change. I'll look into that, and your remaining patches, next.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: PG13 fails to startup even though the current transaction is equal to the target transaction

2021-03-22 Thread Fujii Masao




On 2021/03/22 21:40, Sean Jezewski wrote:

Hi Kyotaro -

Thanks for the response.

I think it boils down to your comment:

 > I'm not sure.  The direct cause of the "issue" is a promotion trigger
 > that came before reaching recovery target.  That won't happen if the
 > "someone" doesn't do that.

I think the question is 'under what conditions is it safe to do the promotion' ?

What is your recommendation in this case? The end of the archive has been 
reached. All transactions have been replayed. And in fact the current 
transaction id is exactly equal to the target recovery transaction id.


I guess that the transaction with this current XID has not been committed
yet at that moment. Right? I thought that because you confirmed the XID
by SELECT pg_catalog.txid_snapshot_xmax(pg_catalog.txid_current_snapshot()).
IIUC this query doesn't return the XID of already-committed transaction.

The standby thinks that the recovery reaches the recovery target when
the XID of *committed* transaction get equal to the recovery_target_xid.
So in your case, the standby didn't reached the recovery target when you
requested the promotion. IMO this is why you got that FATAL error.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: Minimal logical decoding on standbys

2021-03-22 Thread Fabrízio de Royes Mello
On Thu, Mar 18, 2021 at 5:34 AM Drouvot, Bertrand 
wrote:
>
> Thanks!
>
> Just made minor changes to make it compiling again on current master
(mainly had to take care of ResolveRecoveryConflictWithSnapshotFullXid()
that has been introduced in e5d8a99903).
>
> Please find enclosed the new patch version that currently passes "make
check" and the 2 associated TAP tests.
>

Unfortunately it still not applying to the current master:

$ git am ~/Downloads/v10-000*.patch

Applying: Allow logical decoding on standby.
Applying: Add info in WAL records in preparation for logical slot conflict
handling.
error: patch failed: src/backend/access/nbtree/nbtpage.c:32
error: src/backend/access/nbtree/nbtpage.c: patch does not apply
Patch failed at 0002 Add info in WAL records in preparation for logical
slot conflict handling.
hint: Use 'git am --show-current-patch' to see the failed patch
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".


> I'll have a look to the whole thread to check if there is anything else
waiting in the pipe regarding this feature, unless some of you know off the
top of their head?
>

Will do the same!!! But as far I remember last time I checked it everything
discussed is covered in this patch set.

Regards,

--
   Fabrízio de Royes Mello Timbira - http://www.timbira.com.br/
   PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento


Re: New IndexAM API controlling index vacuum strategies

2021-03-22 Thread Robert Haas
On Thu, Mar 18, 2021 at 9:42 PM Peter Geoghegan  wrote:
> The fact that we can *continually* reevaluate if an ongoing VACUUM is
> at risk of taking too long is entirely the point here. We can in
> principle end index vacuuming dynamically, whenever we feel like it
> and for whatever reasons occur to us (hopefully these are good reasons
> -- the point is that we get to pick and choose). We can afford to be
> pretty aggressive about not giving up, while still having the benefit
> of doing that when it *proves* necessary. Because: what are the
> chances of the emergency mechanism ending index vacuuming being the
> wrong thing to do if we only do that when the system clearly and
> measurably has no more than about 10% of the possible XID space to go
> before the system becomes unavailable for writes?

I agree. I was having trouble before understanding exactly what you
are proposing, but this makes sense to me and I agree it's a good
idea.

> > But ... should the thresholds for triggering these kinds of mechanisms
> > really be hard-coded with no possibility of being configured in the
> > field? What if we find out after the release is shipped that the
> > mechanism works better if you make it kick in sooner, or later, or if
> > it depends on other things about the system, which I think it almost
> > certainly does? Thresholds that can't be changed without a recompile
> > are bad news. That's why we have GUCs.
>
> I'm fine with a GUC, though only for the emergency mechanism. The
> default really matters, though -- it shouldn't be necessary to tune
> (since we're trying to address a problem that many people don't know
> they have until it's too late). I still like 1.8 billion XIDs as the
> value -- I propose that that be made the default.

I'm not 100% sure whether we need a new GUC for this or not. I think
that if by default this triggers at the 90% of the hard-shutdown
limit, it would be unlikely, and perhaps unreasonable, for users to
want to raise the limit. However, I wonder whether some users will
want to lower the limit. Would it be reasonable for someone to want to
trigger this at 50% or 70% of XID exhaustion rather than waiting until
things get really bad?

Also, one thing that I dislike about the current system is that, from
a user perspective, when something goes wrong, nothing happens for a
while and then the whole system goes bananas. It seems desirable to me
to find ways of gradually ratcheting up the pressure, like cranking up
the effective cost limit if we can somehow figure out that we're not
keeping up. If, with your mechanism, there's an abrupt point when we
switch from never doing this for any table to always doing this for
every table, that might not be as good as something which does this
"sometimes" and then, if that isn't enough to avoid disaster, does it
"more," and eventually ramps up to doing it always, if trouble
continues. I don't know whether that's possible here, or what it would
look like, or even whether it's appropriate at all in this particular
case, so I just offer it as food for thought.

> > On another note, I cannot say enough bad things about the function
> > name two_pass_strategy(). I sincerely hope that you're not planning to
> > create a function which is a major point of control for VACUUM whose
> > name gives no hint that it has anything to do with vacuum.
>
> You always hate my names for things. But that's fine by me -- I'm
> usually not very attached to them. I'm happy to change it to whatever
> you prefer.

My taste in names may be debatable, but including the subsystem name
in the function name seems like a pretty bare-minimum requirement,
especially when the other words in the function name could refer to
just about anything.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Disable WAL logging to speed up data loading

2021-03-22 Thread Stephen Frost
Greetings,

* tsunakawa.ta...@fujitsu.com (tsunakawa.ta...@fujitsu.com) wrote:
> From: Stephen Frost 
> > The argument here seems to stem from the idea that issueing a 'TRUNCATE'
> > inside the transaction before starting the 'COPY' command is 'too hard'.
> 
> > I could be wrong and perhaps opinions will change in the future, but it 
> > really
> > doesn't seem like the there's much support for adding a new WAL level just 
> > to
> > avoid doing the TRUNCATE.  Appealing to what other database systems
> > support can be helpful in some cases but that's usually when we're looking 
> > at a
> > new capability which multiple other systems have implemented.  This isn't
> > actually a new capability at all- the WAL level that you're looking for 
> > already
> > exists and already gives the optimization you're looking for, as long as 
> > you issue
> > a TRUNCATE at the start of the transaction.
> 
> No, we can't ask using TRUNCATE because the user wants to add data to a table.

First- what are you expecting would actually happen during crash
recovery in this specific case with your proposed new WAL level?

Second, use partitioning, or unlogged tables (with the patch discussed
elsewhere to allow flipping them to logged without writing the entire
thing to WAL).

Thanks,

Stephen


signature.asc
Description: PGP signature


  1   2   >