date:20201116

On Tue, Nov 03, 2020 at 05:28:46PM +0200, Heikki Linnakangas wrote:
> Removing pg_standby has been proposed a couple of times in the past. See 
> https://www.postgresql.org/message-id/20170913064824.rqflkadxwpboa...@alap3.anarazel.de
> for the latest attempt.
> 
> Masao-san, back in 2014 you mentioned "fast failover" as a feature that was
> missing from the built-in standby mode 
> (https://www.postgresql.org/message-id/CAHGQGwEE_8vvpQk0ex6Qa_aXt-OSJ7OdZjX4uM_FtqKfxq5SbQ%40mail.gmail.com).
> I think that's been implemented since, with the recovery_target settings.
> Would you agree?
> 
> I'm pretty sure we can remove pg_standby by now. But if there's something
> crucial missing from the built-in facilities, we need to talk about
> implementing them.

Reading the thread you are mentioning, it seems to me that the
statu-quo is the same, but I find rather scary that this tool is used
in exactly zero tests.

Echoing with Robert, I think that pg_archivecleanup is still useful in
many cases, so that's not something we should remove.
--
Michael


signature.asc
Description: PGP signature

Re: Add statistics to pg_stat_wal view for wal related parameter tuning

2020-11-16 Thread Masahiro Ikeda


On 2020-11-13 12:32, lchch1...@sina.cn wrote:

Now, pg_stat_wal supports reset all informantion in WalStats
using pg_stat_reset_shared('wal') function.
Isn't it enough?

Yes it ok, sorry I miss this infomation.


OK.


3. I do not think it's a correct describe in document for
'wal_buffers_full'.



Where should I rewrite the description? If my understanding is not
correct, please let me know.

Sorry I have not described it clearly, because I can not understand
the meaning of this
column after I read the describe in document.
And now I read the source code of walwrite and found the place where
'wal_buffers_full'
added is for a backend to wait a wal buffer which is occupied by other
wal page, so the
backend flush the old page in the wal buffer(after wait it can).
So i think the origin decribe in document is not so in point, we can
describe it such as
'Total number of times WAL data written to the disk because a backend
yelled a wal buffer
for an advanced wal page.

Sorry if my understand is wrong.


Thanks for your comments.

You're understanding is almost the same as mine.
It describes when not only backends but also other backgrounds 
initialize a new wal page,

wal buffer's space is already used and there is no space.


'Total number of times WAL data written to the disk because a backend
yelled a wal buffer for an advanced wal page'


Thanks for your suggestion.
I wondered that users may confuse about how to use "wal_buffers_full" 
and how to tune parameters.


I thought the reason which wal buffer has no space is
important for users to tune the wal_buffers parameter.

How about the following comments?

'Total number of times WAL data was written to the disk because WAL 
buffers got full

 when to initialize a new WAL page'


Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

Re: [HACKERS] logical decoding of two-phase transactions

2020-11-16 Thread Masahiko Sawada

On Mon, Nov 16, 2020 at 4:25 PM Ajin Cherian  wrote:
>
> Updated with a new test case
> (contrib/test_decoding/t/002_twophase-streaming.pl) that tests
> concurrent aborts during streaming prepare. Had to make a few changes
> to the test_decoding stream_start callbacks to handle
> "check-xid-aborted"
> the same way it was handled in the non stream callbacks. Merged
> Peter's v20-0006 as well.
>

Thank you for updating the patch.

I have a question about the timestamp of PREPARE on a subscriber node,
although this may have already been discussed.

With the current patch, the timestamps of PREPARE are different
between the publisher and the subscriber but the timestamp of their
commits are the same. For example,

-- There is 1 prepared transaction on a publisher node.
=# select * from pg_prepared_xact;

 transaction | gid |   prepared|  owner   | database
-+-+---+--+--
 510 | h1  | 2020-11-16 16:57:13.438633+09 | masahiko | postgres
(1 row)

-- This prepared transaction is replicated to a subscriber.
=# select * from pg_prepared_xact;

 transaction | gid |   prepared|  owner   | database
-+-+---+--+--
 514 | h1  | 2020-11-16 16:57:13.440593+09 | masahiko | postgres
(1 row)

These timestamps are different. Let's commit the prepared transaction
'h1' on the publisher and check the commit timestamps on both nodes.

-- On the publisher node.
=# select pg_xact_commit_timestamp('510'::xid);

   pg_xact_commit_timestamp
---
 2020-11-16 16:57:13.474275+09
(1 row)

-- Commit prepared is also replicated to the subscriber node.
=# select pg_xact_commit_timestamp('514'::xid);

   pg_xact_commit_timestamp
---
 2020-11-16 16:57:13.474275+09
(1 row)

The commit timestamps are the same. At PREPARE we use the local
timestamp when PREPARE is executed as 'prepared' time while at COMMIT
PREPARED we use the origin's commit timestamp as the commit timestamp
if the commit WAL has.

This behaviour made me think a possibility that if the clock of the
publisher is behind then on subscriber node the timestamp of COMMIT
PREPARED (i.g., the return value from pg_xact_commit_timestamp())
could be smaller than the timestamp of PREPARE (i.g., 'prepared_at' in
pg_prepared_xacts). I think it would not be a critical issue but I
think it might be worth discussing the behaviour.

Regards,

-- 
Masahiko Sawada
EnterpriseDB:  https://www.enterprisedb.com/

Re: Split copy.c

2020-11-16 Thread Heikki Linnakangas


On 16/11/2020 04:28, Justin Pryzby wrote:

On Tue, Nov 03, 2020 at 03:15:27PM +1300, David Rowley wrote:

On Tue, 3 Nov 2020 at 07:35, Andres Freund  wrote:


On 2020-11-02 19:43:38 +0200, Heikki Linnakangas wrote:

On 02/11/2020 19:23, Andres Freund wrote:

On 2020-11-02 11:03:29 +0200, Heikki Linnakangas wrote:

There isn't much common code between COPY FROM and COPY TO, so I propose
that we split copy.c into two: copyfrom.c and copyto.c. See attached. I thin
that's much nicer.


Not quite convinced that's the right split - or perhaps there's just
more potential. My feeling is that splitting out all the DML related
code would make the code considerably easier to read.


What do you mean by DML related code?


Basically all the insertion related code (e.g CopyMultiInsert*, lots of
code in CopyFrom()) and perhaps also the type input invocations.


I quite like the fact that those are static and inline-able.  I very
much imagine there'd be a performance hit if we moved them out to
another .c file and made them extern.  Some of those functions can be
quite hot when copying into a partitioned table.


For another patch [0], I moved into copy.h:
+typedef struct CopyMultiInsertBuffer
+typedef struct CopyMultiInsertInfo
+CopyMultiInsertBufferInit(ResultRelInfo *rri)
+CopyMultiInsertInfoSetupBuffer(CopyMultiInsertInfo *miinfo,
+CopyMultiInsertInfoIsFull(CopyMultiInsertInfo *miinfo)
+CopyMultiInsertBufferCleanup(CopyMultiInsertInfo *miinfo,
+CopyMultiInsertInfoNextFreeSlot(CopyMultiInsertInfo *miinfo,
+CopyMultiInsertInfoStore(CopyMultiInsertInfo *miinfo, ResultRelInfo *rri,

That's an experimental part 0002 of my patch in response to Simon's suggestion.
Maybe your response will be that variants of those interfaces should be added
to nodeModifyTable.[ch] instead of moving them.


Nice. I don't think that affects this patch too much.

I would suggest renaming renaming the functions and structs to remove 
the "Copy"-prefix. COPY uses them, but so does INSERT with the patch.



Currently I'm passing (void*)mtstate as cstate - if there were a
generic interface, that would be a void *state or so.
The functions only need cstate/mtstate to set the line number, for the 
error callback, and to access the transition_capture field. You could 
add a field for transition_capture in CopyMultiInsertInfo. For the line 
number, you could add a line number field in CopyMultiInsertInfo, set 
that in CopyMultiInsertBufferFlush() instead of cstate->cur_lineno, and 
teach CopyFromErrorCallback() to get the line number from there.


- Heikki

Tab complete for CREATE OR REPLACE TRIGGER statement

2020-11-16 Thread Shinoda, Noriyoshi (PN Japan FSI)

Hi, Hackers.

 Yesterday, OR REPLACE clause was provided to the CREATE TRIGGER statement, so 
I wrote a patch for tab completion for the psql command.
TRIGGER adds tab completion to the CREATE OR REPLACE statement, and the CREATE 
TRIGGER and CREATE OR REPLACE TRIGGER statements are completed in the same way.
I referred to the tab completion code of the CREATE RULE statement.

Regards,
Noriyoshi Shinoda


psql_create_or_replace_trigger_tab.diff
Description: psql_create_or_replace_trigger_tab.diff

Re: Move OpenSSL random under USE_OPENSSL_RANDOM

2020-11-16 Thread Magnus Hagander

On Mon, Nov 16, 2020 at 10:19 AM Daniel Gustafsson  wrote:

> > On 16 Nov 2020, at 01:20, Michael Paquier  wrote:
> >
> > On Sun, Nov 15, 2020 at 12:16:56PM -0500, Tom Lane wrote:
> >> The obvious problem with this is that if !USE_OPENSSL, we will not have
> >> pulled in openssl's headers.
> >
> > FWIW, I argued upthread against including this part because it is
> > useless: if not building with OpenSSL, we'll never have the base to be
> > able to use RAND_poll().
>
> How do you mean?  The OpenSSL PRNG can be used without setting up a context
> etc, the code in pg_strong_random is all we need to use it without
> USE_OPENSSL
> (whether we'd like to is another story) or am I missing something?
>
> >> However ... all these machines are pointing at line 96, which is not
> >> that one but the one under "#if defined(USE_OPENSSL)".  So I'm not sure
> >> what to make of that, except that a bit more finesse seems required.
> >
> > The build scripts of src/tools/msvc/ choose to not use OpenSSL as
> > strong random source even if building with OpenSSL.  The top of the
> > file only includes openssl/rand.h if using USE_OPENSSL_RANDOM.
>
> The fallout here is precisely the reason why I argued for belts and
> suspenders
> such that PRNG init is performed for (USE_OPENSSL || USE_OPENSSL_RANDOM).
> I
> don't trust the assumption that if one is there other will always be there
> as
> well as long as they are disjoint.  Since we expose this PRNG to users,
> there
> is a vector for spooling the rand state via UUID generation in case the
> PRNG is
> faulty and have predictability, so failing to protect the after-fork case
> can
> be expensive.  Granted, such vulnerabilities are rare but not
> inconcievable.
>
> Now, this patch didn't get the header inclusion right which is why thise
> broke.
>

> > Thinking about that afresh, I think that we got that wrong here on
> > three points:
> > - If attempting to use OpenSSL on Windows, let's just bite the bullet
> > and use OpenSSL as random source, using Windows as source only when
> > not building with OpenSSL.
> > - Instead of using a call to RAND_poll() that we know will never work,
> > let's just issue a compilation failure if attempting to use
> > USE_OPENSSL_RANDOM without USE_OPENSSL.
>
> Taking a step back, what is the usecase of USE_OPENSSL_RANDOM if we force
> it to
> be equal to USE_OPENSSL?  Shouldn't we in that case just have USE_OPENSSL,
> adjust the logic and remove the below comment from configure.ac which
> isn't
> really telling the truth?


>   # Select random number source
>   #
>   # You can override this logic by setting the appropriate USE_*RANDOM
> flag to 1
>   # in the template or configure command line.
>
> I might be thick but I'm struggling to see the use for complications when
> we
> don't support any pluggability.  Having said that, it might be the sane
> way in
> the end to forcibly use the TLS library as a randomness source should
> there be
> one (FIPS compliance comes to mind), but then we might as well spell that
> out.
>
> > - rand.h needs to be included under USE_OPENSSL.
>
>
> It needs to be included for both USE_OPENSSL and USE_OPENSSL_RANDOM unless
> we
> combine them as per the above.
>


I agree with those -- either we remove the ability to choose random source
independently of the SSL library (and then only use the windows crypto
provider or /dev/urandom as platform-specific choices when *no* SSL library
is used), and in that case we should not have separate #ifdef's for them.

Or we fix the includes. Which is obviously easier, but we should take the
time to do what we think is right long-term of course.

Keeping two defines and an extra configure check when they mean the same
thing seems like the worst combination of the two.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/

Re: Delay of standby shutdown





On 2020/11/13 2:58, Soumyadeep Chakraborty wrote:

Thanks! Marking this as ready for committer.


Pushed. Thanks!

Regards,


--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Cache relation sizes?

2020-11-16 Thread Konstantin Knizhnik





On 16.11.2020 10:11, Thomas Munro wrote:

On Tue, Aug 4, 2020 at 2:21 PM Thomas Munro  wrote:

On Tue, Aug 4, 2020 at 3:54 AM Konstantin Knizhnik
 wrote:

This shared relation cache can easily store relation size as well.
In addition it will solve a lot of other problems:
- noticeable overhead of local relcache warming
- large memory consumption in case of larger number of relations
O(max_connections*n_relations)
- sophisticated invalidation protocol and related performance issues
Certainly access to shared cache requires extra synchronization.But DDL
operations are relatively rare.
So in most cases we will have only shared locks. May be overhead of
locking will not be too large?

Yeah, I would be very happy if we get a high performance shared
sys/rel/plan/... caches in the future, and separately, having the
relation size available in shmem is something that has come up in
discussions about other topics too (tree-based buffer mapping,
multi-relation data files, ...).  ...

After recent discussions about the limitations of relying on SEEK_END
in a nearby thread[1], I decided to try to prototype a system for
tracking relation sizes properly in shared memory.  Earlier in this
thread I was talking about invalidation schemes for backend-local
caches, because I only cared about performance.  In contrast, this new
system has SMgrRelation objects that point to SMgrSharedRelation
objects (better names welcome) that live in a pool in shared memory,
so that all backends agree on the size.  The scheme is described in
the commit message and comments.  The short version is that smgr.c
tracks the "authoritative" size of any relation that has recently been
extended or truncated, until it has been fsync'd.  By authoritative, I
mean that there may be dirty buffers in that range in our buffer pool,
even if the filesystem has vaporised the allocation of disk blocks and
shrunk the file.

That is, it's not really a "cache".  It's also not like a shared
catalog, which Konstantin was talking about... it's more like the pool
of inodes in a kernel's memory.  It holds all currently dirty SRs
(SMgrSharedRelations), plus as many clean ones as it can fit, with
some kind of reclamation scheme, much like buffers.  Here, "dirty"
means the size changed.


I will look at your implementation more precisely latter.
But now I just to ask one question: is it reasonable to spent 
substantial amount of efforts trying to solve
the problem of redundant calculations of relation size, which seems to 
be just part of more general problem: shared relation cache?


It is well known problem: private caches of system catalog can cause 
unacceptable memory consumption for large number of backends.
If there are thousands relations in the database (which is not so rare 
case, especially taken in account ORMs and temporary tables)
then size of catalog cache can be several megabytes. If it is multiplies 
by thousand backends, then we get gigabytes of memory used just for
private catalog caches. Thanks to Andres optimizations of taking 
snapshots, now Postgres can normally handle thousands of connections.
So this extremely inefficient use of memory for private catalog caches 
becomes more and more critical.
Also private caches requires sophisticated and error prone invalidation 
mechanism.


If we will have such shared cache, then keeping shared relation size 
seems to be trivial task, isn't it?
So may be we before "diving" into the problem of maintaining shared pool 
of dirtied "inodes" we can better discuss and conerntrate on solving 
more generic problem?

Re: [HACKERS] make async slave to wait for lsn to be replayed

2020-11-16 Thread a . pervushina


Hello,

I've changed the BEGIN WAIT FOR LSN statement to core functions 
pg_waitlsn, pg_waitlsn_infinite and pg_waitlsn_no_wait.
Currently the functions work inside repeatable read transactions, but 
waitlsn creates a snapshot if called first in a transaction block, which 
can possibly lead the transaction to working incorrectly, so the 
function gives a warning.


Usage examples
==
select pg_waitlsn(‘LSN’, timeout);
select pg_waitlsn_infinite(‘LSN’);
select pg_waitlsn_no_wait(‘LSN’);diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 9d3f1c12fc5..27e964c8111 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -42,6 +42,7 @@
 #include "catalog/pg_database.h"
 #include "commands/progress.h"
 #include "commands/tablespace.h"
+#include "commands/wait.h"
 #include "common/controldata_utils.h"
 #include "executor/instrument.h"
 #include "miscadmin.h"
@@ -7387,6 +7388,15 @@ StartupXLOG(void)
 	break;
 }
 
+/*
+ * If we replayed an LSN that someone was waiting for,
+ * set latches in shared memory array to notify the waiter.
+ */
+if (XLogCtl->lastReplayedEndRecPtr >= GetMinWaitedLSN())
+{
+	WaitSetLatch(XLogCtl->lastReplayedEndRecPtr);
+}
+
 /* Else, try to fetch the next WAL record */
 record = ReadRecord(xlogreader, LOG, false);
 			} while (record != NULL);
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index d4815d3ce65..9b310926c12 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -57,6 +57,7 @@ OBJS = \
 	user.o \
 	vacuum.o \
 	variable.o \
-	view.o
+	view.o \
+	wait.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
new file mode 100644
index 000..ac18040c362
--- /dev/null
+++ b/src/backend/commands/wait.c
@@ -0,0 +1,297 @@
+/*-
+ *
+ * wait.c
+ *	  Implements waitlsn, which allows waiting for events such as
+ *	  LSN having been replayed on replica.
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 2020, Regents of PostgresPro
+ *
+ * IDENTIFICATION
+ *	  src/backend/commands/wait.c
+ *
+ *-
+ */
+#include "postgres.h"
+
+#include 
+
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/xlogdefs.h"
+#include "commands/wait.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/backendid.h"
+#include "storage/pmsignal.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "storage/sinvaladt.h"
+#include "storage/spin.h"
+#include "utils/builtins.h"
+#include "utils/pg_lsn.h"
+#include "utils/timestamp.h"
+
+/* Add to shared memory array */
+static void AddWaitedLSN(XLogRecPtr lsn_to_wait);
+
+/* Shared memory structure */
+typedef struct
+{
+	int			backend_maxid;
+	pg_atomic_uint64 min_lsn; /* XLogRecPtr of minimal waited for LSN */
+	slock_t		mutex;
+	/* LSNs that different backends are waiting */
+	XLogRecPtr	lsn[FLEXIBLE_ARRAY_MEMBER];
+} WaitState;
+
+static WaitState *state;
+
+/*
+ * Add the wait event of the current backend to shared memory array
+ */
+static void
+AddWaitedLSN(XLogRecPtr lsn_to_wait)
+{
+	SpinLockAcquire(>mutex);
+	if (state->backend_maxid < MyBackendId)
+		state->backend_maxid = MyBackendId;
+
+	state->lsn[MyBackendId] = lsn_to_wait;
+
+	if (lsn_to_wait < state->min_lsn.value)
+		state->min_lsn.value = lsn_to_wait;
+	SpinLockRelease(>mutex);
+}
+
+/*
+ * Delete wait event of the current backend from the shared memory array.
+ */
+void
+DeleteWaitedLSN(void)
+{
+	int			i;
+	XLogRecPtr	lsn_to_delete;
+
+	SpinLockAcquire(>mutex);
+
+	lsn_to_delete = state->lsn[MyBackendId];
+	state->lsn[MyBackendId] = InvalidXLogRecPtr;
+
+	/* If we are deleting the minimal LSN, then choose the next min_lsn */
+	if (lsn_to_delete != InvalidXLogRecPtr &&
+		lsn_to_delete == state->min_lsn.value)
+	{
+		state->min_lsn.value = PG_UINT64_MAX;
+		for (i = 2; i <= state->backend_maxid; i++)
+			if (state->lsn[i] != InvalidXLogRecPtr &&
+state->lsn[i] < state->min_lsn.value)
+state->min_lsn.value = state->lsn[i];
+	}
+
+	/* If deleting from the end of the array, shorten the array's used part */
+	if (state->backend_maxid == MyBackendId)
+		for (i = (MyBackendId); i >= 2; i--)
+			if (state->lsn[i] != InvalidXLogRecPtr)
+			{
+state->backend_maxid = i;
+break;
+			}
+
+	SpinLockRelease(>mutex);
+}
+
+/*
+ * Report amount of shared memory space needed for WaitState
+ */
+Size
+WaitShmemSize(void)
+{
+	Size		size;
+
+	size = offsetof(WaitState, lsn);
+	size = add_size(size, mul_size(MaxBackends + 1, sizeof(XLogRecPtr)));
+	return size;
+}
+
+/*
+ * Initialize an array of events to wait for in shared memory
+ */
+void
+WaitShmemInit(void)
+{
+	bool		found;
+	uint32		i;
+

Re: Online verification of checksums

2020-11-16 Thread Magnus Hagander

On Mon, Nov 16, 2020 at 1:23 AM Michael Paquier  wrote:

> On Sun, Nov 15, 2020 at 04:37:36PM +0100, Magnus Hagander wrote:
> > On Tue, Nov 10, 2020 at 5:44 AM Michael Paquier 
> wrote:
> >> On Thu, Nov 05, 2020 at 10:57:16AM +0900, Michael Paquier wrote:
> >>> I was referring to the patch I sent on this thread that fixes the
> >>> detection of a corruption for the zero-only case and where pd_lsn
> >>> and/or pg_upper are trashed by a corruption of the page header.  Both
> >>> cases allow a base backup to complete on HEAD, while sending pages
> >>> that could be corrupted, which is wrong.  Once you make the page
> >>> verification rely only on pd_checksum, as the patch does because the
> >>> checksum is the only source of truth in the page header, corrupted
> >>> pages are correctly detected, causing pg_basebackup to complain as it
> >>> should.  However, it has also the risk to cause pg_basebackup to fail
> >>> *and* to report as broken pages that are in the process of being
> >>> written, depending on how slow a disk is able to finish a 8kB write.
> >>> That's a different kind of wrongness, and users have two more reasons
> >>> to be pissed.  Note that if a page is found as torn we have a
> >>> consistent page header, meaning that on HEAD the PageIsNew() and
> >>> PageGetLSN() would pass, but the checksum verification would fail as
> >>> the contents at the end of the page does not match the checksum.
> >>
> >> Magnus, as the original committer of 4eb77d5, do you have an opinion
> >> to share?
> >>
> >
> > I admit that I at some point lost track of the overlapping threads around
> > this, and just figured there was enough different
> checksum-involved-people
> > on those threads to handle it :) Meaning the short answer is "no, I don't
> > really have one at this point".
> >
> > Slightly longer comment is that it does seem reasonable, but I have not
> > read in on all the different issues discussed over the whole thread, so
> > take that as a weak-certainty comment.
>
> Which part are you considering as reasonable?  The removal-feature
> part on a stable branch or perhaps something else?
>

I was referring to the latest patch on the thread. But as I said, I have
not read up on all the different issues raised in the thread, so take it
with a big grain os salt.

And I would also echo the previous comment that this code was adapted from
what the pgbackrest folks do. As such, it would be good to get a comment
from for example David on that -- I don't see any of them having commented
after that was mentioned?

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/

Re: Evaluate expression at planning time for two more cases

2020-11-16 Thread Pavel Borisov

Thank you for working on this!
I got slightly into this patch. I can be wrong, but my opinion is that 
planner/optimizer-related changes are not without dangers generally.  So 
anyway, they should be justified by performance increase, or the previous 
behavior should be considered totally wrong. Patching the thing which is just a 
little sub-optimal seems for me seems not necessary.

So it would be very good to see measurements of a performance gain from this 
patch. And also I think tests with partitioned and inherited relations for 
demonstration of the right work in the cases discussed in the thread should be 
a must-do for this patch.

-- 
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com

The new status of this patch is: Waiting on Author

Re: [HACKERS] make async slave to wait for lsn to be replayed

2020-11-16 Thread Alexander Korotkov

Hi!

On Mon, Nov 16, 2020 at 1:09 PM  wrote:
> I've changed the BEGIN WAIT FOR LSN statement to core functions
> pg_waitlsn, pg_waitlsn_infinite and pg_waitlsn_no_wait.
> Currently the functions work inside repeatable read transactions, but
> waitlsn creates a snapshot if called first in a transaction block, which
> can possibly lead the transaction to working incorrectly, so the
> function gives a warning.
>
> Usage examples
> ==
> select pg_waitlsn(‘LSN’, timeout);
> select pg_waitlsn_infinite(‘LSN’);
> select pg_waitlsn_no_wait(‘LSN’);

The name pg_waitlsn_no_wait() looks confusing.  I've tried to see how
it's documented, but the patch doesn't contain documentation...

--
Regards,
Alexander Korotkov

doc CREATE INDEX

2020-11-16 Thread Erik Rijkers


This one seems to have fallen by the wayside.

Erik
--- doc/src/sgml/ref/create_index.sgml.orig	2020-11-16 13:04:29.923760413 +0100
+++ doc/src/sgml/ref/create_index.sgml	2020-11-16 13:04:54.260093095 +0100
@@ -746,7 +746,7 @@
   
 
   
-   The regularly system collects statistics on all of a table's
+   The system regularly collects statistics on all of a table's
columns.  Newly-created non-expression indexes can immediately
use these statistics to determine an index's usefulness.
For new expression indexes, it is necessary to run

Re: More time spending with "delete pending"

2020-11-16 Thread Juan José Santamaría Flecha

On Sun, Nov 15, 2020 at 4:00 PM Alexander Lakhin 
wrote:

> As I've found out, readdir() replacement for Windows in fact gets all
> the needed information (correct file size, attributes...) in
> WIN32_FIND_DATA, but it just leaves it inside and returns only fields of
> the dirent structure. So pg_ls_dir_files() (that calls
> ReadDir()/readdir()) needs to get this information again, that leads to
> opening a file on Windows.
> I think it can be done more effectively by adding a new function
> ReadDirExtendedInfo(), that will additionally accept a pointer to
> "struct dirent_extra" with fields {valid (indicating that the structure
> is filled), attributes, file size, mtime}. Maybe the advanced function
> could also invoke stat() inside (not on Windows).
>
> As to patch I proposed before, I think it's still needed to fully
> support the following usage pattern:
> stat();
> if (no_error) {
> do_something();
> } else if (file_not_found) {
> do_something_else();
> } else {
> error();
> }


We are currently missing a WIN32 lstat() port. I was thinking about
proposing a patch to implement it using GetFileAttributesEx(). That might
work as fall back to the CreateFile() if the file attribute is not a
FILE_ATTRIBUTE_REPARSE_POINT.

Anyhow, I do not think any retry logic should be in the stat() function,
but in the caller.

Regards,

Juan José Santamaría Flecha

>
>

Re: ECPG: proposal for new DECLARE STATEMENT

2020-11-16 Thread Shawn Wang

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:tested, passed

Recently I have been doing some work on ecpg. So I review this patch. No 
problem was found.

The new status of this patch is: Ready for Committer

Re: Get memory contexts of an arbitrary backend process

2020-11-16 Thread torikoshia


On 2020-10-28 15:32, torikoshia wrote:

On 2020-10-23 13:46, Kyotaro Horiguchi wrote:



I think we might need to step-back to basic design of this feature
since this patch seems to have unhandled corner cases that are
difficult to find.


I've written out the basic design below and attached
corresponding patch.

  # Communication flow between the dumper and the requester
  - (1) When REQUESTING memory context dumping, the dumper adds an entry 
to the shared memory. The entry manages the dump state and it is set to 
'REQUESTING'.

  - (2) The dumper sends the signal to the dumper and wait on the latch.
  - (3) The dumper looks into the corresponding shared memory entry and 
changes its state to 'DUMPING'.
  - (4) When the dumper completes dumping, it changes the state to 
'DONE' and set the latch.
  - (5) The dumper reads the dump file and shows it to the user. 
Finally, the dumper removes the dump file and reset the shared memory 
entry.


  # Query cancellation
  - When the requestor cancels dumping, e.g. signaling using ctrl-C, the 
requestor changes the status of the shared memory entry to 'CANCELING'.
  - The dumper checks the status when it tries to change the state to 
'DONE' at (4), and if the state is 'CANCELING', it removes the dump file 
and reset the shared memory entry.


  # Cleanup dump file and the shared memory entry
  - In the normal case, the dumper removes the dump file and resets the 
shared memory entry as described in (5).
  - When something like query cancellation or process termination 
happens on the dumper after (1) and before (3), in other words, the 
state is 'REQUESTING', the requestor does the cleanup.
  - When something happens on the dumper or the requestor after (3) and 
before (4), in other words, the state is 'DUMPING', the dumper does the 
cleanup. Specifically, if the requestor cancels the query, it just 
changes the state to 'CANCELING' and the dumper notices it and cleans up 
things later. OTOH, when the dumper fails to dump, it cleans up the dump 
file and deletes the entry on the shared memory.
  - When something happens on the requestor after (4), i.e., the state 
is 'DONE', the requestor does the cleanup.
  - In the case of receiving SIGKILL or power failure, all dump files 
are removed in the crash recovery process.



Although there was a suggestion that shared memory hash
table should be changed to more efficient structures,
I haven't done it in this patch.
I think it can be treated separately, I'm going to work
on that later.


On 2020-11-11 00:07, Georgios Kokolatos wrote:

Hi,

I noticed that this patch fails on the cfbot.
For this, I changed the status to: 'Waiting on Author'.

Cheers,
//Georgios

The new status of this patch is: Waiting on Author


Thanks for your notification and updated the patch.
Changed the status to: 'Waiting on Author'.

Regards,

--
Atsushi TorikoshiFrom c6d06b11d16961acd59bfa022af52cb5fc668b3e Mon Sep 17 00:00:00 2001
From: Atsushi Torikoshi 
Date: Mon, 16 Nov 2020 11:49:03 +0900
Subject: [PATCH v4] Enabled pg_get_backend_memory_contexts() to collect
 arbitrary backend process's memory contexts.

Previsouly, pg_get_backend_memory_contexts() could only get the
local memory contexts. This patch enables to get memory contexts
of the arbitrary backend process which PID is specified by the
argument.
---
 src/backend/access/transam/xlog.c|   7 +
 src/backend/catalog/system_views.sql |   4 +-
 src/backend/postmaster/pgstat.c  |   3 +
 src/backend/replication/basebackup.c |   3 +
 src/backend/storage/ipc/ipci.c   |   2 +
 src/backend/storage/ipc/procsignal.c |   4 +
 src/backend/storage/lmgr/lwlocknames.txt |   1 +
 src/backend/tcop/postgres.c  |   5 +
 src/backend/utils/adt/mcxtfuncs.c| 615 ++-
 src/backend/utils/init/globals.c |   1 +
 src/bin/initdb/initdb.c  |   3 +-
 src/bin/pg_basebackup/t/010_pg_basebackup.pl |   4 +-
 src/bin/pg_rewind/filemap.c  |   3 +
 src/include/catalog/pg_proc.dat  |  11 +-
 src/include/miscadmin.h  |   1 +
 src/include/pgstat.h |   3 +-
 src/include/storage/procsignal.h |   1 +
 src/include/utils/mcxtfuncs.h|  52 ++
 src/test/regress/expected/rules.out  |   2 +-
 19 files changed, 697 insertions(+), 28 deletions(-)
 create mode 100644 src/include/utils/mcxtfuncs.h

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a1078a7cfc..f628fa8b53 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -73,6 +73,7 @@
 #include "storage/sync.h"
 #include "utils/builtins.h"
 #include "utils/guc.h"
+#include "utils/mcxtfuncs.h"
 #include "utils/memutils.h"
 #include "utils/ps_status.h"
 #include "utils/relmapper.h"
@@ -6986,6 +6987,12 @@ StartupXLOG(void)
 		 */
 		pgstat_reset_all();
 
+		/*
+		 * Reset

Re: Zedstore - compressed in-core columnar storage

2020-11-16 Thread Merlin Moncure

On Thu, Nov 12, 2020 at 4:40 PM Tomas Vondra
 wrote:
>masterzedstore/pglzzedstore/lz4
>   -
>copy  1855680922131
>dump   751  905 811
>
> And the size of the lineitem table (as shown by \d+) is:
>
>   master: 64GB
>   zedstore/pglz: 51GB
>   zedstore/lz4: 20GB
>
> It's mostly expected lz4 beats pglz in performance and compression
> ratio, but this seems a bit too extreme I guess. Per past benchmarks
> (e.g. [1] and [2]) the difference in compression/decompression time
> should be maybe 1-2x or something like that, not 35x like here.

I can't speak to the ratio, but in basic backup/restore scenarios pglz
is absolutely killing me; Performance is just awful; we are cpubound
in backups throughout the department.  Installations defaulting to
plgz will make this feature show very poorly.

merlin

Re: Terminate the idle sessions

2020-11-16 Thread Li Japin

Hi Kuroda,

On Nov 16, 2020, at 1:22 PM, 
kuroda.hay...@fujitsu.com wrote:


@@ -30,6 +30,7 @@ typedef enum TimeoutId
STANDBY_DEADLOCK_TIMEOUT,
STANDBY_TIMEOUT,
STANDBY_LOCK_TIMEOUT,
+ IDLE_SESSION_TIMEOUT,
IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
/* First user-definable timeout reason */
USER_TIMEOUT,

I'm not familiar with timeout, but I can see that the priority of idle-session 
is set lower than transaction-timeout.
Could you explain the reason? In my image this timeout locates at the lowest 
layer, so it might have the lowest
priority.

My apologies! I just add a enum for idle session and ignore the comments that 
says the enum has priority.
Fixed as follows:

@@ -30,8 +30,8 @@ typedef enum TimeoutId
STANDBY_DEADLOCK_TIMEOUT,
STANDBY_TIMEOUT,
STANDBY_LOCK_TIMEOUT,
-   IDLE_SESSION_TIMEOUT,
IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
+   IDLE_SESSION_TIMEOUT,
/* First user-definable timeout reason */
USER_TIMEOUT,
/* Maximum number of timeout reasons */

Thanks for your review! Attached.

--
Best regards
Japin Li






v5-0001-Allow-terminating-the-idle-sessions.patch
Description: v5-0001-Allow-terminating-the-idle-sessions.patch


v5-0002-Optimize-setitimer-usage.patch
Description: v5-0002-Optimize-setitimer-usage.patch

Re: doc CREATE INDEX

On 2020-Nov-16, Erik Rijkers wrote:

> This one seems to have fallen by the wayside.

So it did.

Pushed to all three branches, thanks.

Re: Add important info about ANALYZE after create Functional Index

2020-11-16 Thread Justin Pryzby

On Thu, Nov 12, 2020 at 06:01:02PM -0500, Bruce Momjian wrote:
> On Thu, Nov 12, 2020 at 03:11:43PM -0600, Justin Pryzby wrote:
> > I guess it should say "The system regularly ..."
> > 
> > Also, the last sentence begins "For new expression indexes" and ends with
> > "about new expression indexes", which I guess could instead say "about the
> > expressions".
> 
> How is this followup patch?

I see Alvaro already patched the first issue at bcbd77133.

The problematic language was recently introduced, and I'd reported at:
https://www.postgresql.org/message-id/20201112211143.GL30691%40telsasoft.com
And Erik at:
https://www.postgresql.org/message-id/e92b3fba98a0c0f7afc0a2a37e765954%40xs4all.nl

-- 
Justin

Re: cutting down the TODO list thread

2020-11-16 Thread John Naylor

On Wed, Nov 11, 2020 at 4:45 PM John Naylor 
wrote:

> Here is the next section on data types, proposed to be moved to the "not
> worth doing" page. As before, if there are any objections, do speak up.
> I'll make the move in a few days.
>

Hearing no objection, these have been moved over.

-- 
John Naylor
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Multi Inserts in CREATE TABLE AS - revived patch

2020-11-16 Thread Paul Guo




> On Nov 13, 2020, at 7:21 PM, Bharath Rupireddy 
>  wrote:
> 
> On Tue, Nov 10, 2020 at 3:47 PM Paul Guo  wrote:
>> 
>> Thanks for doing this. There might be another solution - use raw insert 
>> interfaces (i.e. raw_heap_insert()).
>> Attached is the test (not formal) patch that verifies this idea. 
>> raw_heap_insert() writes the page into the
>> table files directly and also write the FPI xlog when the tuples filled up 
>> the whole page. This seems be
>> more efficient.
>> 
> 
> Thanks. Will the new raw_heap_insert() APIs scale well (i.e. extend
> the table parallelly) with parallelism? The existing
> table_multi_insert() API scales well, see, for instance, the benefit
> with parallel copy[1] and parallel multi inserts in CTAS[2].

Yes definitely some work needs to be done to make raw heap insert interfaces 
fit the parallel work, but
it seems that there is no hard blocking issues for this?

> 
> [1] - 
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.postgresql.org%2Fmessage-id%2FCALj2ACWeQVd-xoQZHGT01_33St4xPoZQibWz46o7jW1PE3XOqQ%2540mail.gmail.comdata=04%7C01%7Cguopa%40vmware.com%7C6fb10e05b7a243e0042608d887c651ac%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C637408633136197927%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000sdata=fyQaor4yhmqVRYcK78JyPW25i7zjRoWXqZVf%2BfFYq1w%3Dreserved=0
> [2] - 
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.postgresql.org%2Fmessage-id%2FCALj2ACWFq6Z4_jd9RPByURB8-Y8wccQWzLf%252B0-Jg%252BKYT7ZO-Ug%2540mail.gmail.comdata=04%7C01%7Cguopa%40vmware.com%7C6fb10e05b7a243e0042608d887c651ac%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C637408633136207912%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000sdata=CkFToJ11nmoyT2SodsJYYMOGP3cHSpeNYn8ZTYurn3U%3Dreserved=0
> 
> With Regards,
> Bharath Rupireddy.
> EnterpriseDB: 
> https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.enterprisedb.com%2Fdata=04%7C01%7Cguopa%40vmware.com%7C6fb10e05b7a243e0042608d887c651ac%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C637408633136207912%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000sdata=btiktR5Ftx1astyEmCUroQCIN1%2FcgcaMOxfA1z6pawE%3Dreserved=0

Re: Skip ExecCheckRTPerms in CTAS with no data

2020-11-16 Thread Peter Eisentraut


On 2020-11-16 04:04, Michael Paquier wrote:

On Fri, Nov 13, 2020 at 12:58:52PM +0530, Bharath Rupireddy wrote:

It's not required to set bistate to null as we have allocated myState
with palloc0 in CreateIntoRelDestReceiver, it will anyways be null.
   if (!into->skipData)
 myState->bistate = GetBulkInsertState();

Attaching v4 patch. Please review it.


I have reviewed this one this morning, and applied it after some
tweaks.  I have reworded some of the comments, fixed some typos, and
largely refactored the test cases to stress all the combinations
possible.  Please note that your patch would have caused failures
in the buildfarm, as any role created needs to be prefixed with
"regress_".


While this patch was nice enough to update the documentation about the 
requirement of the INSERT privilege, this is maybe more confusing now: 
How could a new table not have INSERT privilege?  Yes, you can do that 
with default privileges, but that's not well known and should be 
clarified in the documentation.


The SQL standard says that for CREATE TABLE AS, the INSERT "is 
effectively executed without further Access Rule checking", which means 
the INSERT privilege shouldn't be required at all.  I suggest we 
consider doing that instead.  I don't see a use for the current behavior.

Re: default result formats setting

2020-11-16 Thread Andrew Dunstan



On 11/9/20 5:10 AM, Peter Eisentraut wrote:
> On 2020-11-05 22:03, Peter Eisentraut wrote:
>>> Independently of that, how would you implement "says otherwise" here,
>>> ie do a single-query override of the session's prevailing setting?
>>> Maybe the right thing for that is to define -1 all the way down to the
>>> protocol level as meaning "use the session's per-type default", and
>>> then if you don't want that you can pass 0 or 1.  An advantage of that
>>> is that you couldn't accidentally break an application that wasn't
>>> ready for this feature, because it would not be the default to use it.
>> Yeah, that sounds a lot better.  I'll look into that.
>
> Here is a new patch updated to work that way.  Feels better now.
>

I think this is conceptually OK, although it feels a bit odd.

Might it be better to have the values as typename={binary,text} pairs
instead of oid={0,1} pairs, which are fairly opaque? That might make
things easier for things like UDTs where the oid might not be known or
constant.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Multi Inserts in CREATE TABLE AS - revived patch

2020-11-16 Thread Bharath Rupireddy

On Mon, Nov 16, 2020 at 8:02 PM Paul Guo  wrote:
>
> > On Nov 13, 2020, at 7:21 PM, Bharath Rupireddy 
> >  wrote:
> >
> > On Tue, Nov 10, 2020 at 3:47 PM Paul Guo  wrote:
> >>
> >> Thanks for doing this. There might be another solution - use raw insert 
> >> interfaces (i.e. raw_heap_insert()).
> >> Attached is the test (not formal) patch that verifies this idea. 
> >> raw_heap_insert() writes the page into the
> >> table files directly and also write the FPI xlog when the tuples filled up 
> >> the whole page. This seems be
> >> more efficient.
> >>
> >
> > Thanks. Will the new raw_heap_insert() APIs scale well (i.e. extend
> > the table parallelly) with parallelism? The existing
> > table_multi_insert() API scales well, see, for instance, the benefit
> > with parallel copy[1] and parallel multi inserts in CTAS[2].
>
> Yes definitely some work needs to be done to make raw heap insert interfaces 
> fit the parallel work, but
> it seems that there is no hard blocking issues for this?
>

I may be wrong here. If we were to allow raw heap insert APIs to
handle parallelism, shouldn't we need some sort of shared memory to
allow coordination among workers? If we do so, at the end, aren't
these raw insert APIs equivalent to current table_multi_insert() API
which uses a separate shared ring buffer(bulk insert state) for
insertions?

And can we think of these raw insert APIs similar to the behaviour of
table_multi_insert() API for unlogged tables?

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: Add important info about ANALYZE after create Functional Index

On 2020-Nov-12, Bruce Momjian wrote:

> For new expression indexes, it is necessary to run  linkend="sql-analyze">ANALYZE or wait for
> the autovacuum daemon to analyze
> -   the table to generate statistics about new expression indexes.
> +   the table to generate statistics for these indexes.

Looks good to me.

Re: Move OpenSSL random under USE_OPENSSL_RANDOM

Magnus Hagander  writes:
> I agree with those -- either we remove the ability to choose random source
> independently of the SSL library (and then only use the windows crypto
> provider or /dev/urandom as platform-specific choices when *no* SSL library
> is used), and in that case we should not have separate #ifdef's for them.
> Or we fix the includes. Which is obviously easier, but we should take the
> time to do what we think is right long-term of course.

FWIW, I'd vote for the former.  I think the presumption that OpenSSL's
random-number machinery can be used without any other initialization is
shaky as heck.

regards, tom lane

Re: Heads-up: macOS Big Sur upgrade breaks EDB PostgreSQL installations

2020-11-16 Thread Jonathan S. Katz

On 11/16/20 4:27 AM, Dave Page wrote:
> Hi,
> 
> This is more of a head-ups than anything else, as I suspect this may
> come up in various forums.
> 
> The PostgreSQL installers for macOS (from EDB, possibly others too)
> create the data directory in /Library/PostgreSQL//data. This
> has been the case since the first release, 10+ years ago.
> 
> It looks like the Big Sur upgrade has taken it upon itself to "fix" any
> filesystem permissions it doesn't like. On my system, this resulted in
> the data directory having 0755 permissions, which meant that PostgreSQL
> refused to start. Manually changing the permissions back to 0700 (0750
> should also work) fixes the issue.
> 
> I'm not sure there's much we can do about this - systems that are likely
> to be affected are already out there, and we obviously don't want to
> relax the permissions Postgres requires.

Thanks for raising this. We should provide some guidance on upgrading
this when upgrading to Big Sur.

Do we know where the other macOS installers place their data
directories? We should reach out to the installer maintainers to see if
they are seeing the same behavior so we know what guidance to issue.

Thanks,

Jonathan


OpenPGP_0xF1049C729F1C6527.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature

Re: Add important info about ANALYZE after create Functional Index

On 2020-Nov-16, Justin Pryzby wrote:

> I see Alvaro already patched the first issue at bcbd77133.
> 
> The problematic language was recently introduced, and I'd reported at:
> https://www.postgresql.org/message-id/20201112211143.GL30691%40telsasoft.com
> And Erik at:
> https://www.postgresql.org/message-id/e92b3fba98a0c0f7afc0a2a37e765954%40xs4all.nl

Yeah, sorry I didn't notice you had already reported it.

Re: PoC/WIP: Extended statistics on expressions

On 11/16/20 2:49 PM, Tomas Vondra wrote:
> Hi,
>
> ...
>
> 4) apply the statistics
> 
>This is the hard part, really, and the exact state of the support
>depends on type of statistics.
> 
>For ndistinct coefficients, it generally works. I'm sure there may be
>bugs in estimate_num_groups, etc. but in principle it works.
> 
>For MCV lists, it generally works too - you can define statistics on
>the expressions and the estimates should improve. The main downside
>here is that it requires at least two expressions, otherwise we can't
>build/apply the extended statistics. So for example
> 
>   SELECT * FROM t WHERE mod(a,100) = 10 AND mod(b,11) = 0
> 
>may be estimated "correctly", once you drop any of the conditions it
>gets much worse as we don't have stats for individual expressions.
>That's rather annoying - it does not break the extended MCV, but the
>behavior will certainly cause confusion.
> 
>For functional dependencies, the estimation does not work yet. Also,
>the missing per-column statistics have bigger impact than on MCV,
>because while MCV can work fine without it, the dependencies heavily
>rely on the per-column estimates. We only apply "corrections" based
>on the dependency degree, so we still need (good) per-column
>estimates, which does not quite work with the expressions.
> 
> 
>Of course, the lack of per-expression statistics may be somewhat
>fixed by adding indexes on expressions, but that's kinda expensive.
> 

FWIW after re-reading [1], I think the plan to build pg_statistic rows
for expressions and stash them in pg_statistic_ext_data is the way to
go. I was thinking that maybe we'll need some new statistics type to
request this, e.g.

   CREATE STATISTICS s (expressions) ON ...

but on second thought I think we should just build this whenever there
are expressions in the definition. It'll require some changes (e.g. we
require at least two items in the list, but we'll want to allow building
stats on a single expression too, I guess), but that's doable.

Of course, we don't have any catalogs with composite types yet, so it's
not 100% sure this will work, but it's worth a try.

regards


[1]
https://www.postgresql.org/message-id/flat/6331.1579041473%40sss.pgh.pa.us#5ec6af7583e84cef2ca6a9e8a713511e

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Add important info about ANALYZE after create Functional Index

2020-11-16 Thread Bruce Momjian

On Mon, Nov 16, 2020 at 11:59:03AM -0300, Álvaro Herrera wrote:
> On 2020-Nov-12, Bruce Momjian wrote:
> 
> > For new expression indexes, it is necessary to run  > linkend="sql-analyze">ANALYZE or wait for
> > the autovacuum daemon to analyze
> > -   the table to generate statistics about new expression indexes.
> > +   the table to generate statistics for these indexes.
> 
> Looks good to me.

Applied to all branches.  Thanks for the review.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: Cache relation sizes?

2020-11-16 Thread Konstantin Knizhnik

On 16.11.2020 10:11, Thomas Munro wrote:

On Tue, Aug 4, 2020 at 2:21 PM Thomas Munro wrote:

On Tue, Aug 4, 2020 at 3:54 AM Konstantin Knizhnik
wrote:

This shared relation cache can easily store relation size as well.
In addition it will solve a lot of other problems:
- noticeable overhead of local relcache warming
- large memory consumption in case of larger number of relations
O(max_connections*n_relations)
- sophisticated invalidation protocol and related performance issues
Certainly access to shared cache requires extra synchronization.But DDL
operations are relatively rare.
So in most cases we will have only shared locks. May be overhead of
locking will not be too large?

Yeah, I would be very happy if we get a high performance shared
sys/rel/plan/... caches in the future, and separately, having the
relation size available in shmem is something that has come up in
discussions about other topics too (tree-based buffer mapping,
multi-relation data files, ...). ...

After recent discussions about the limitations of relying on SEEK_END
in a nearby thread[1], I decided to try to prototype a system for
tracking relation sizes properly in shared memory. Earlier in this
thread I was talking about invalidation schemes for backend-local
caches, because I only cared about performance. In contrast, this new
system has SMgrRelation objects that point to SMgrSharedRelation
objects (better names welcome) that live in a pool in shared memory,
so that all backends agree on the size. The scheme is described in
the commit message and comments. The short version is that smgr.c
tracks the "authoritative" size of any relation that has recently been
extended or truncated, until it has been fsync'd. By authoritative, I
mean that there may be dirty buffers in that range in our buffer pool,
even if the filesystem has vaporised the allocation of disk blocks and
shrunk the file.

That is, it's not really a "cache". It's also not like a shared
catalog, which Konstantin was talking about... it's more like the pool
of inodes in a kernel's memory. It holds all currently dirty SRs
(SMgrSharedRelations), plus as many clean ones as it can fit, with
some kind of reclamation scheme, much like buffers. Here, "dirty"
means the size changed.

Attached is an early sketch, not debugged much yet (check undir
contrib/postgres_fdw fails right now for a reason I didn't look into),
and there are clearly many architectural choices one could make
differently, and more things to be done... but it seemed like enough
of a prototype to demonstrate the concept and fuel some discussion
about this and whatever better ideas people might have...

Thoughts?

[1]
https://www.postgresql.org/message-id/flat/OSBPR01MB3207DCA7EC725FDD661B3EDAEF660%40OSBPR01MB3207.jpnprd01.prod.outlook.com

I noticed that there are several fragments like this:

if (!smgrexists(rel->rd_smgr, FSM_FORKNUM))
smgrcreate(rel->rd_smgr, FSM_FORKNUM, false);

fsm_nblocks_now = smgrnblocks(rel->rd_smgr, FSM_FORKNUM);

I wonder if it will be more efficient and simplify code to add
"create_if_not_exists" parameter to smgrnblocks?
It will avoid extra hash lookup and avoid explicit checks for fork
presence in multiple places?

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Zedstore - compressed in-core columnar storage

On 11/16/20 1:59 PM, Merlin Moncure wrote:
> On Thu, Nov 12, 2020 at 4:40 PM Tomas Vondra
>  wrote:
>>masterzedstore/pglzzedstore/lz4
>>   -
>>copy  1855680922131
>>dump   751  905 811
>>
>> And the size of the lineitem table (as shown by \d+) is:
>>
>>   master: 64GB
>>   zedstore/pglz: 51GB
>>   zedstore/lz4: 20GB
>>
>> It's mostly expected lz4 beats pglz in performance and compression
>> ratio, but this seems a bit too extreme I guess. Per past benchmarks
>> (e.g. [1] and [2]) the difference in compression/decompression time
>> should be maybe 1-2x or something like that, not 35x like here.
> 
> I can't speak to the ratio, but in basic backup/restore scenarios pglz
> is absolutely killing me; Performance is just awful; we are cpubound
> in backups throughout the department.  Installations defaulting to
> plgz will make this feature show very poorly.
> 

Maybe. I'm not disputing that pglz is considerably slower than lz4, but
judging by previous benchmarks I'd expect the compression to be slower
maybe by a factor of ~2x. So the 30x difference is suspicious. Similarly
for the compression ratio - lz4 is great, but it seems strange it's 1/2
the size of pglz. Which is why I'm speculating that something else is
going on.

As for the "plgz will make this feature show very poorly" I think that
depends. I think we may end up with pglz doing pretty well (compared to
heap), but lz4 will probably outperform that. OTOH for various use cases
it may be more efficient to use something else with worse compression
ratio, but allowing execution on compressed data, etc.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [PATCH] remove deprecated v8.2 containment operators

2020-11-16 Thread Peter Eisentraut


On 2020-11-13 16:57, Tom Lane wrote:

Peter Eisentraut  writes:

On 2020-11-12 23:28, Tom Lane wrote:

I'm on board with pulling these now --- 8.2 to v14 is plenty of
deprecation notice.  However, the patch seems incomplete in that
the code support for these is still there -- look for
RTOldContainedByStrategyNumber and RTOldContainsStrategyNumber.
Admittedly, there's not much to be removed except some case labels,
but it still seems like we oughta do that to avoid future confusion.



Yeah, the stuff in gistproc.c should be removed now.  But I wonder what
the mentions in brin_inclusion.c are and whether or how they should be
removed.


I think those probably got cargo-culted in there at some point.
They're visibly dead code, because there are no pg_amop entries
for BRIN opclasses with amopstrategy 13 or 14.


I have committed fixes that remove the unused strategy numbers from both 
of these code areas.

Re: [PATCH] Add features to pg_stat_statements




On 2020/11/16 12:22, Seino Yuki wrote:

Thanks for updating the patch!

+    pgss_info->dealloc = 0;
+    SpinLockInit(_info->mutex);
+    Assert(pgss_info->dealloc == 0);

Why is this assertion check necessary? It seems not necessary.

+    {
+    Assert(found == found_info);

Having pgssSharedState and pgssInfoCounters separately might make
the code a bit more complicated like the above? If this is true, what about
including pgssInfoCounters in pgssSharedState?

PGSS_FILE_HEADER needs to be changed since the patch changes
the format of pgss file?

+    /* Read pgss_info */
+    if (feof(file) == 0)
+    if (fread(pgss_info, sizeof(pgssInfoCounters), 1, file) != 1)
+    goto read_error;

Why does feof(file) need to be called here?

+pgss_info_update(void)
+{
+    {

Why is the second "{" necessary? It seems redundant.

+pgss_info_reset(void)
+{
+    {

Same as above.

+pg_stat_statements_info(PG_FUNCTION_ARGS)
+{
+    int64    d_count = 0;
+    {

Same as above.

+    SpinLockAcquire(>mutex);
+    d_count = Int64GetDatum(c->dealloc);
+    SpinLockRelease(>mutex);

Why does Int64GetDatum() need to be called here? It seems not necessary.

+   
+    
+ pg_stat_statements_info() returns bigint
+ 
+  pg_stat_statements_info
+ 
+    

Isn't it better not to expose pg_stat_statements_info() function in the
document because pg_stat_statements_info view is enough and there
seems no use case for the function?

Regards,


Thanks for the comment.
I'll post a fixed patch.
Due to similar fixed, we have also merged the patches discussed in the 
following thread.
https://commitfest.postgresql.org/30/2738/


I agree that these two patches should use the same infrastructure
because they both try to add the global stats for pg_stat_statements.
But IMO they should not be merged to one patch. It's better to
develop them one by one for ease of review. Thought?

So I extracted the "dealloc" part from the merged version of your patch.
Also I refactored the code and applied some cosmetic changes into
the patch. Attached is the updated version of the patch that implements
only "dealloc" part. Could you review this version?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
diff --git a/contrib/pg_stat_statements/Makefile 
b/contrib/pg_stat_statements/Makefile
index 081f997d70..3ec627b956 100644
--- a/contrib/pg_stat_statements/Makefile
+++ b/contrib/pg_stat_statements/Makefile
@@ -6,7 +6,7 @@ OBJS = \
pg_stat_statements.o
 
 EXTENSION = pg_stat_statements
-DATA = pg_stat_statements--1.4.sql \
+DATA = pg_stat_statements--1.4.sql pg_stat_statements--1.8--1.9.sql \
pg_stat_statements--1.7--1.8.sql pg_stat_statements--1.6--1.7.sql \
pg_stat_statements--1.5--1.6.sql pg_stat_statements--1.4--1.5.sql \
pg_stat_statements--1.3--1.4.sql pg_stat_statements--1.2--1.3.sql \
diff --git a/contrib/pg_stat_statements/expected/pg_stat_statements.out 
b/contrib/pg_stat_statements/expected/pg_stat_statements.out
index 2a303a7f07..16158525ca 100644
--- a/contrib/pg_stat_statements/expected/pg_stat_statements.out
+++ b/contrib/pg_stat_statements/expected/pg_stat_statements.out
@@ -861,4 +861,19 @@ SELECT query, plans, calls, rows FROM pg_stat_statements 
ORDER BY query COLLATE
  SELECT query, plans, calls, rows FROM pg_stat_statements ORDER BY query 
COLLATE "C" | 1 | 0 |0
 (6 rows)
 
+--
+-- access to pg_stat_statements_info view
+--
+SELECT pg_stat_statements_reset();
+ pg_stat_statements_reset 
+--
+ 
+(1 row)
+
+SELECT dealloc FROM pg_stat_statements_info;
+ dealloc 
+-
+   0
+(1 row)
+
 DROP EXTENSION pg_stat_statements;
diff --git a/contrib/pg_stat_statements/pg_stat_statements--1.8--1.9.sql 
b/contrib/pg_stat_statements/pg_stat_statements--1.8--1.9.sql
new file mode 100644
index 00..2019a4ffe0
--- /dev/null
+++ b/contrib/pg_stat_statements/pg_stat_statements--1.8--1.9.sql
@@ -0,0 +1,17 @@
+/* contrib/pg_stat_statements/pg_stat_statements--1.8--1.9.sql */
+
+-- complain if script is sourced in psql, rather than via ALTER EXTENSION
+\echo Use "ALTER EXTENSION pg_stat_statements UPDATE TO '1.9'" to load this 
file. \quit
+
+--- Define pg_stat_statements_info
+CREATE FUNCTION pg_stat_statements_info(
+OUT dealloc bigint
+)
+RETURNS bigint
+AS 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL SAFE;
+
+CREATE VIEW pg_stat_statements_info AS
+  SELECT * FROM pg_stat_statements_info();
+
+GRANT SELECT ON pg_stat_statements_info TO PUBLIC;
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c 
b/contrib/pg_stat_statements/pg_stat_statements.c
index dd963c4644..9de2d53496 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -98,7 +98,7 @@ PG_MODULE_MAGIC;
 #define PGSS_TEXT_FILE PG_STAT_TMP_DIR "/pgss_query_texts.stat"
 
 /* Magic number identifying the

Re: planner support functions: handle GROUP BY estimates ?

On 1/15/20 12:44 AM, Tom Lane wrote:
> Tomas Vondra  writes:
>> On Tue, Jan 14, 2020 at 05:37:53PM -0500, Tom Lane wrote:
>>> I wonder just how messy it would be to add a column to pg_statistic_ext
>>> whose type is the composite type "pg_statistic", and drop the required
>>> data into that.  We've not yet used any composite types in the system
>>> catalogs, AFAIR, but since pg_statistic_ext isn't a bootstrap catalog
>>> it seems like we might be able to get away with it.
> 
> [ I meant pg_statistic_ext_data, obviously ]
> 
>> I don't know, but feels a bit awkward to store this type of stats into
>> pg_statistic_ext, which was meant for multi-column stats. Maybe it'd
>> work fine, not sure.
> 
> If we wanted to allow a single statistics object to contain data for
> multiple expressions, we'd actually need that to be array-of-pg_statistic
> not just pg_statistic.  Seems do-able, but on the other hand we could
> just prohibit having more than one output column in the "query" for this
> type of extended statistic.  Either way, this seems far less invasive
> than either a new catalog or a new relation relkind (to say nothing of
> needing both, which is where you seemed to be headed).
> 

I've started looking at statistics on expressions too, mostly because it
seems the extended stats improvements (as discussed in [1]) need that.

The "stash pg_statistic records into pg_statistics_ext_data" approach
seems simple, but it's not clear to me how to make it work, so I'd
appreciate some guidance.


1) Considering we don't have any composite types in any catalog yet, and
naive attempts to just use something like

pg_statistic stxdexprs[1];

did not work. So I suppose this will require changes to genbki.pl, but
honestly, my Perl-fu is non-existent :-(


2) Won't it be an issue that pg_statistic contains pseudo-types? That
is, this does not work, for example:

test=# create table t (a pg_statistic[]);
ERROR:  column "stavalues1" has pseudo-type anyarray

and it seems unlikely just using this in a catalog would make it work.


regards


[1]
https://www.postgresql.org/message-id/ad7891d2-e90c-b446-9fe2-7419143847d7%40enterprisedb.com

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: doc CREATE INDEX

2020-11-16 Thread Bruce Momjian

On Mon, Nov 16, 2020 at 11:04:57AM -0300, Álvaro Herrera wrote:
> On 2020-Nov-16, Erik Rijkers wrote:
> 
> > This one seems to have fallen by the wayside.
> 
> So it did.
> 
> Pushed to all three branches, thanks.

I have applied with to PG 9.5-11, which also had the odd wording.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: Heads-up: macOS Big Sur upgrade breaks EDB PostgreSQL installations

2020-11-16 Thread Dave Page

On Mon, Nov 16, 2020 at 4:45 PM Pavel Borisov 
wrote:

> I suppose there are many ways to have PG on OSX i.e. package managers
> (Homebrew, Macports), App installers etc and so many places anyone can find
> his data directory reside in. Generally I prefer data directory to be
> somewhere inside the user home dir as OSX will take care of possible
> backups and will not generally modify its contents during migration betweeb
> osx versions and/or different machines. It is not only the question of
> permissions.
>
> Any options inside user homedir are equally suitable IMO.
>

It is in the user's homedir - it's just that that isn't under /Users:

hal:~ postgres$ echo $HOME
/Library/PostgreSQL/13

With the EDB installers (unlike postgres.app), PostgreSQL runs as a
service, much as it would on Linux or BSD.

-- 
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake

EDB: http://www.enterprisedb.com

Re: Zedstore - compressed in-core columnar storage

2020-11-16 Thread Merlin Moncure

On Mon, Nov 16, 2020 at 10:07 AM Tomas Vondra
 wrote:
>
>
> On 11/16/20 1:59 PM, Merlin Moncure wrote:
> > On Thu, Nov 12, 2020 at 4:40 PM Tomas Vondra
> >  wrote:
> >>masterzedstore/pglzzedstore/lz4
> >>   -
> >>copy  1855680922131
> >>dump   751  905 811
> >>
> >> And the size of the lineitem table (as shown by \d+) is:
> >>
> >>   master: 64GB
> >>   zedstore/pglz: 51GB
> >>   zedstore/lz4: 20GB
> >>
> >> It's mostly expected lz4 beats pglz in performance and compression
> >> ratio, but this seems a bit too extreme I guess. Per past benchmarks
> >> (e.g. [1] and [2]) the difference in compression/decompression time
> >> should be maybe 1-2x or something like that, not 35x like here.
> >
> > I can't speak to the ratio, but in basic backup/restore scenarios pglz
> > is absolutely killing me; Performance is just awful; we are cpubound
> > in backups throughout the department.  Installations defaulting to
> > plgz will make this feature show very poorly.
> >
>
> Maybe. I'm not disputing that pglz is considerably slower than lz4, but
> judging by previous benchmarks I'd expect the compression to be slower
> maybe by a factor of ~2x. So the 30x difference is suspicious. Similarly
> for the compression ratio - lz4 is great, but it seems strange it's 1/2
> the size of pglz. Which is why I'm speculating that something else is
> going on.
>
> As for the "plgz will make this feature show very poorly" I think that
> depends. I think we may end up with pglz doing pretty well (compared to
> heap), but lz4 will probably outperform that. OTOH for various use cases
> it may be more efficient to use something else with worse compression
> ratio, but allowing execution on compressed data, etc.

hm, you might be right.  Doing some number crunching, I'm getting
about 23mb/sec compression on a 600gb backup image on a pretty typical
aws server.  That's obviously not great, but your numbers are much
worse than that, so maybe something else might be going on.

> I think we may end up with pglz doing pretty well (compared to heap)

I *don't* think so, or at least I'm skeptical as long as insertion
times are part of the overall performance measurement.  Naturally,
with column stores, insertion times are often very peripheral to the
overall performance picture but for cases that aren't I suspect the
results are not going to be pleasant, and advise planning accordingly.

Aside, I am very interested in this work. I may be able to support
testing in an enterprise environment; lmk if interested -- thank you

merlin

Re: Heads-up: macOS Big Sur upgrade breaks EDB PostgreSQL installations

2020-11-16 Thread Dave Page

On Mon, Nov 16, 2020 at 3:55 PM Jonathan S. Katz 
wrote:

> On 11/16/20 4:27 AM, Dave Page wrote:
> > Hi,
> >
> > This is more of a head-ups than anything else, as I suspect this may
> > come up in various forums.
> >
> > The PostgreSQL installers for macOS (from EDB, possibly others too)
> > create the data directory in /Library/PostgreSQL//data. This
> > has been the case since the first release, 10+ years ago.
> >
> > It looks like the Big Sur upgrade has taken it upon itself to "fix" any
> > filesystem permissions it doesn't like. On my system, this resulted in
> > the data directory having 0755 permissions, which meant that PostgreSQL
> > refused to start. Manually changing the permissions back to 0700 (0750
> > should also work) fixes the issue.
> >
> > I'm not sure there's much we can do about this - systems that are likely
> > to be affected are already out there, and we obviously don't want to
> > relax the permissions Postgres requires.
>
> Thanks for raising this. We should provide some guidance on upgrading
> this when upgrading to Big Sur.
>
> Do we know where the other macOS installers place their data
> directories? We should reach out to the installer maintainers to see if
> they are seeing the same behavior so we know what guidance to issue.
>

I believe postgres.app only installs for the current user, and puts it's
data under ~/Library/Application Support/Postgres.

-- 
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake

EDB: http://www.enterprisedb.com

Re: Heads-up: macOS Big Sur upgrade breaks EDB PostgreSQL installations

2020-11-16 Thread Pavel Borisov

I suppose there are many ways to have PG on OSX i.e. package managers
(Homebrew, Macports), App installers etc and so many places anyone can find
his data directory reside in. Generally I prefer data directory to be
somewhere inside the user home dir as OSX will take care of possible
backups and will not generally modify its contents during migration betweeb
osx versions and/or different machines. It is not only the question of
permissions.

Any options inside user homedir are equally suitable IMO.

---
Best regards,
Pavel Borisov

Postgres professional: http://postgrespro.com

Re: enable_incremental_sort changes query behavior

2020-11-16 Thread luis . roberto





I've pushed the 0001 part, i.e. the main fix. Not sure about the other 
parts (comments and moving the code back to postgres_fdw) yet. 


regards 

-- 
Tomas Vondra http://www.2ndQuadrant.com 
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Hi, Tomas! 

I'm still having trouble with this problem. Tom commited something [1] to avoid 
core dumps, but all I get now is: subplan "SubPlan 1" was not initialized 

I upgraded to 13.1 over the weekend but the problem still occurs. 

This is the topic I started: 
https://www.postgresql.org/message-id/flat/CAAaqYe--qpnBOUiqnHSAc71K2HpXetnGedRAX0Z04zx%3DY0ExHA%40mail.gmail.com#8cb6c7c5e787d2d3d2dc7aaf00d1cae8
 



[1] 
https://git.postgresql.org/pg/commitdiff/936043c9eacb9e9c7356a8190a410d2c4e4ea03a

Re: PATCH: Batch/pipelining support for libpq

On Fri, Nov 13, 2020 at 5:38 PM Alvaro Herrera 
wrote:

> Here's a v25.
>
> I made a few more changes to the docs per David's suggestions; I also
> reordered the sections quite a bit.  It's now like this:
>  * Batch Mode
>* Using Batch Mode
>  * Issuing Queries
>  * Processing Results
>  * Error Handling
>  * Interleaving Result Processing and Query Dispatch
>  * Ending Batch Mode
>* Functions Associated with Batch Mode
>* When to Use Batching
>

Thanks!

I like the new flow and changes.  I've attached a patch that covers some
missing commas and typos along with a few parts that made me pause.

The impact of memory came out of nowhere under the non-blocking mode
commentary.  I took a stab at incorporating it more broadly.

The "are error conditions" might be a well-known phrasing to those using
libpq but that sentence reads odd to me.  I did try to make the example
listing flow a bit better and added a needed comma.

Tried to clean up a few phrasings after that.  The error handling part I'm
not especially happy with but I think it's closer and more useful than just
"emitted during error handling" - it gets emitted upon error, handling is a
client concern.

Seems odd to say the new feature was introduced in v14.0, the .0 seems ok
to imply.  I didn't actually fix it in the attached but "the PostgreSQL 14
version of libpq" is going to become outdated quickly, better just to drop
it.

"The batch API was introduced in PostgreSQL 14, but clients can use batches
on servers supporting v3 of the extended query protocol, potentially going
as far back as version 7.4."

David J.

v25-01-libpq-batch-dgj-doc-suggestions.patch
Description: Binary data

Re: [PATCH] remove deprecated v8.2 containment operators

2020-11-16 Thread Justin Pryzby

On Fri, Nov 13, 2020 at 10:03:43AM -0500, Stephen Frost wrote:
> * Magnus Hagander (mag...@hagander.net) wrote:
> > On Thu, Nov 12, 2020 at 11:28 PM Tom Lane  wrote:
> > > > The changes to the contrib modules appear to be incomplete in some ways.
> > > >   In cube, hstore, and seg, there are no changes to the extension
> > > > scripts to remove the operators.  All you're doing is changing the C
> > > > code to no longer recognize the strategy, but that doesn't explain what
> > > > will happen if the operator is still used.  In intarray, by contrast,
> > > > you're editing an existing extension script, but that should be done by
> > > > an upgrade script instead.
> > >
> > > In the contrib modules, I'm afraid what you gotta do is remove the
> > > SQL operator definitions but leave the opclass code support in place.
> > > That's because there's no guarantee that users will update the extension's
> > > SQL version immediately, so a v14 build of the .so might still be used
> > > with the old SQL definitions.  It's not clear how much window we need
> > > give for people to do that update, but I don't think "zero" is an
> > > acceptable answer.
> > 
> > Based on my experience from the field, the answer is "never".
> > 
> > As in, most people have no idea they are even *supposed* to do such an
> > upgrade, so they don't do it. Until we solve that problem, I think
> > we're basically stuck with keeping them "forever". (and even if/when
> > we do, "zero" is probably not going to cut it, no)
> 
> Yeah, this is a serious problem and one that we should figure out a way
> to fix or at least improve on- maybe by having pg_upgrade say something
> about extensions that could/should be upgraded..?

I think what's needed are:

1) a way to *warn* users about deprecation.  CREATE EXTENSION might give an
elog(WARNING), but it's probably not enough.  It only happens once, and if it's
in pg_restore/pg_upgrade, it be wrapped by vendor upgrade scripts.  It needs to
be more like first function call in every session.  Or more likely, put it in
documentation for 10 years.

2) a way to *enforce* it, like making CREATE EXTENSION fail when run against an
excessively old server, including by pg_restore/pg_upgrade (which ought to also
handle it in --check).

Are there any contrib for which (1) is done and we're anywhere near doing (2) ?

-- 
Justin

Re: VACUUM (DISABLE_PAGE_SKIPPING on)

On Mon, Nov 16, 2020 at 1:52 PM Simon Riggs  wrote:

> The docs are misleading for this feature, since they say:
> "This option disables all page-skipping behavior, and is
> intended to be used only when the contents of the visibility map are
> suspect, which should happen only if there is a hardware or software
> issue causing database corruption."
>
[...]

>
> The code is quite clear: DISABLE_PAGE_SKIPPING makes the vacuum into
> an aggressive vacuum. Line 487, heap_vacuum_rel().  Aggressive vacuums
> can still skip a page that is frozen, and rely on the visibility map
> for that information.
>
> So the docs are wrong - we don't disable *all* page-skipping and it is
> not appropriate to warn users away from this feature by saying "is
> intended to be used only when the contents of the visibility map are
> suspect".
>

This patch seems mis-placed, at least in HEAD.  If DISABLE_PAGE_SKIPPING
isn't doing what the documentation says it should, but instead provides
identical behavior to FREEZE, then the bug should be fixed in HEAD.  I'd
argue for batch-patching it as well.

David J.

Re: Crash in virtual file descriptor FDDEBUG code

Greg Nancarrow  writes:
> While investigating a possible file-descriptor issue, I enabled the
> FDDEBUG code in src/backend/storage/file/fd.c, only to find that it
> resulted in a crash as soon as it was invoked.

Yeah, I can imagine that code doesn't get tested often.

> I've attached a small patch that corrects the issue.

Thanks, will push.

BTW, while looking at this I started to cast an unfriendly eye on the
_dump_lru() debug function that's hidden under that symbol.  That seems
like it's spending a lot of cycles to give you information that will
very possibly be incomplete (since its 2K buffer would fill up fast).
And it could very easily turn into an infinite loop if there were
anything actually wrong with the LRU chain.  So I'm not seeing where
the cost-benefit ratio is there.

Maybe we should change it to just print the first and last entries
and the number of entries --- which it could count using a loop that
knows there shouldn't be more entries than the size of the VFD
array, so as to prevent infinite-looping if the chain is corrupt.

regards, tom lane

Re: Cache relation sizes?

2020-11-16 Thread Thomas Munro

On Mon, Nov 16, 2020 at 11:01 PM Konstantin Knizhnik
 wrote:
> I will look at your implementation more precisely latter.

Thanks!  Warning:  I thought about making a thing like this for a
while, but the patch itself is only a one-day prototype, so I am sure
you can find many bugs...  Hmm, I guess the smgrnblocks_fast() code
really needs a nice big comment to explain the theory about why this
value is fresh enough, based on earlier ideas about implicit memory
barriers.  (Something like: if you're extending, you must have
acquired the extension lock; if you're scanning you must have acquired
a snapshot that only needs to see blocks that existed at that time and
concurrent truncations are impossible; I'm wondering about special
snapshots like VACUUM, and whether this is too relaxed for that, or if
it's covered by existing horizon-based interlocking...).

> But now I just to ask one question: is it reasonable to spent
> substantial amount of efforts trying to solve
> the problem of redundant calculations of relation size, which seems to
> be just part of more general problem: shared relation cache?

It's a good question.  I don't know exactly how to make a shared
relation cache (I have only some vague ideas and read some of
Idehira-san's work) but I agree that it would be very nice.  However,
I think that's an independent and higher level thing, because Relation
!= SMgrRelation.

What is an SMgrRelation?  Not much!  It holds file descriptors, which
are different in every backend so you can't share them, and then some
information about sizes and a target block for insertions.  With this
patch, the sizes become shared and more reliable, so there's *almost*
nothing left at all except for the file descriptors and the pointer to
the shared object.  I haven't looked into smgr_targblock, the last
remaining non-shared thing, but there might be an opportunity to do
something clever for parallel inserters (the proposed feature) by
coordinating target blocks in SMgrSharedRelation; I don't know.

> It is well known problem: private caches of system catalog can cause
> unacceptable memory consumption for large number of backends.

Yeah.

> Also private caches requires sophisticated and error prone invalidation
> mechanism.
>
> If we will have such shared cache, then keeping shared relation size
> seems to be trivial task, isn't it?
> So may be we before "diving" into the problem of maintaining shared pool
> of dirtied "inodes" we can better discuss and conerntrate on solving
> more generic problem?

Sure, we can talk about how to make a shared relation cache (and
syscache etc).  I think it'll be much harder than this shared SMGR
concept.  Even if we figure out how to share palloc'd trees of nodes
with correct invalidation and interlocking (ie like Ideriha-san's work
in [1]), including good performance and space management/replacement,
I guess we'll still need per-backend Relation objects to point to the
per-backend SMgrRelation objects to hold the per-backend file
descriptors.  (Until we have threads...).  But... do you think sharing
SMGR relations to the maximum extent possible conflicts with that?
Are you thinking there should be some general component here, perhaps
a generic system for pools of shmem objects with mapping tables and a
replacement policy?

[1] 
https://www.postgresql.org/message-id/flat/4E72940DA2BF16479384A86D54D0988A567B9245%40G01JPEXMBKW04

RE: Disable WAL logging to speed up data loading

2020-11-16 Thread tsunakawa.ta...@fujitsu.com

# It'd be helpful if you could send mails in text format, not HTML.

From: David G. Johnston 
> For this case the fundamental feature that would seem to be required is an 
> ability for a transaction commit to return only after the system has ensured 
> that all of the new pages added to the relation during the scope of the 
> transaction have made it to disk.  Something like:
>
> BEGIN UNLOGGED TRANSACTION FOR table1, table2;
> -- locking probably allows reads, definitely disallows concurrent writes, to 
> the named tables
> -- Disallow updates and deletes, do not use dead tuple space, for the tables 
> named.  Should be able to do normal stuff for other tables?
> -- Always create new pages
> COPY TO table1;
> COPY TO table2;
> COMMIT; -- wait here until data files for table1 and table2 are completely 
> written and the transaction alive flag is committed to the WAL.
>
> I suppose the above could be written "BEGIN UNLOGGED TRANSACTION FOR ALL 
> TABLES" and you'd get the initial database population optimization capability.
>
> If the commit doesn't complete all of the newly created pages are junk.  
> Otherwise, you have a crash-recoverable state for those tables as regards 
> those specific pages.

As Steven-san said, I don't want to go this complicated direction.  Plus, 
putting my feet in the user's shoes, I want to try to avoid introducing a new 
SQL syntax for this kind of performance boost, which requires applications and 
maintenance scripts and testing.


Regards
Takayuki Tsunakawa

Re: Terminate the idle sessions

2020-11-16 Thread Li Japin


On Nov 17, 2020, at 10:53 AM, David G. Johnston 
mailto:david.g.johns...@gmail.com>> wrote:

On Monday, November 16, 2020, Li Japin 
mailto:japi...@hotmail.com>> wrote:


Consider setting this for specific users instead of as a server default.  
Client connections managed by connection poolers, or initiated indirectly like 
those by a remote postgres_fdw  using server, should probably be excluded from 
this timeout.


 
- This parameter should be set to zero if you use postgres_fdw or some
- connection-pooling software, because connections might be closed 
unexpectedly.
+ This parameter should be set to zero if you use some 
connection-pooling software, or
+ PostgreSQL servers used by postgres_fdw, because connections might be 
closed unexpectedly.
 



Prefer mine, “or pg servers used by postgres_fdw”, doesn’t flow.

Could you please explain how the idle-in-transaction interfere the long-running 
stability?

From the docs (next section):

This allows any locks held by that session to be released and the connection 
slot to be reused; it also allows tuples visible only to this transaction to be 
vacuumed. See Section 
24.1 for more 
details about this.

Thanks David! Attached.

--
Best regards
Japin Li


v6-0001-Allow-terminating-the-idle-sessions.patch
Description: v6-0001-Allow-terminating-the-idle-sessions.patch


v6-0002-Optimize-setitimer-usage.patch
Description: v6-0002-Optimize-setitimer-usage.patch

Re: pgbench: option delaying queries till connections establishment?

2020-11-16 Thread Andres Freund

Hi,

On 2020-11-17 00:09:34 +0300, Marina Polyakova wrote:
> Sorry I'm not familiar with the internal architecture of snapshots, locks
> etc. in postgres, but I wanted to ask - what exact kind of patch for fair
> lwlocks do you want to offer to the community? I applied the 6-th version of
> the patch for fair lwlocks from [1] to the old master branch (see commit
> [2]), started many threads in pgbench (-M prepared -c 1000 -j 500 -T 10 -P1
> -S) and I did not receive stable first progress reports, which IIUC are one
> of the advantages of the discussed patch for the pgbench (see [3])...

Thanks for running some benchmarks.

I think it's quite unsurprising that you'd see skewed numbers initially,
even with fairer locks. Just by virtue of pgbench starting threads and
each thread immediately starting to perform work, you are bound to get
odd pretty meaningless initial numbers. Even without contention, and
when using fewer connections than there are CPUs. And especially when
starting a larger number of connections, because the main pgbench thread
will get fewer and fewer scheduler slices because of the threads running
benchmarks already.

Regards,

Andres

Re: [PATCH] Covering SPGiST index

Pavel Borisov  writes:
> [ v10-0001-Covering-SP-GiST-index-support-for-INCLUDE-colum.patch ]

I've started to review this, and I've got to say that I really disagree
with this choice:

+ * If there are INCLUDE columns, they are stored after a key value, each
+ * starting from its own typalign boundary. Unlike IndexTuple, first INCLUDE
+ * value does not need to start from MAXALIGN boundary, so SPGiST uses private
+ * routines to access them.

This seems to require far more new code than it could possibly be worth,
because most of the time anything you could save here is just going
to disappear into end-of-tuple alignment space anyway -- recall that
the overall index tuple length is going to be MAXALIGN'd no matter what.
I think you should yank this out and try to rely on standard tuple
construction/deconstruction code instead.

I also find it unacceptable that you stuck a tuple descriptor pointer into
the rd_amcache structure.  The relcache only supports that being a flat
blob of memory.  I see that you tried to hack around that by having
spgGetCache reconstruct a new tupdesc every time through, but (a) that's
actually worse than having no cache at all, and (b) spgGetCache doesn't
really know much about the longevity of the memory context it's called in.
This could easily lead to dangling tuple pointers, serious memory bloat
from repeated tupdesc construction, or quite possibly both.  Safer would
be to build the tupdesc during initSpGistState(), or maybe just make it
on-demand.  In view of the previous point, I'm also wondering if there's
any way to get the relcache's regular rd_att tupdesc to be useful here,
so we don't have to build one during scans at all.

(I wondered for a bit about whether you could keep a long-lived private
tupdesc in the relcache's rd_indexcxt context.  But it looks like
relcache.c sometimes resets rd_amcache without also clearing the
rd_indexcxt, so that would probably lead to leakage.)

regards, tom lane

Crash in virtual file descriptor FDDEBUG code

2020-11-16 Thread Greg Nancarrow

Hi Hackers,

While investigating a possible file-descriptor issue, I enabled the
FDDEBUG code in src/backend/storage/file/fd.c, only to find that it
resulted in a crash as soon as it was invoked.
It turns out the crash was in some logging code that was being passed
a NULL fileName.
I've attached a small patch that corrects the issue.

Regards,
Greg Nancarrow
Fujitsu Australia


v1-0001-Fix-crash-in-virtual-file-descriptor-FDDEBUG-code.patch
Description: Binary data

RE: Cache relation sizes?

2020-11-16 Thread tsunakawa.ta...@fujitsu.com

From: Thomas Munro 
> On Mon, Nov 16, 2020 at 11:01 PM Konstantin Knizhnik
>  wrote:
> > I will look at your implementation more precisely latter.
> 
> Thanks!  Warning:  I thought about making a thing like this for a
> while, but the patch itself is only a one-day prototype, so I am sure
> you can find many bugs...  Hmm, I guess the smgrnblocks_fast() code
> really needs a nice big comment to explain the theory about why this
> value is fresh enough, based on earlier ideas about implicit memory
> barriers.  (Something like: if you're extending, you must have
> acquired the extension lock; if you're scanning you must have acquired
> a snapshot that only needs to see blocks that existed at that time and
> concurrent truncations are impossible; I'm wondering about special
> snapshots like VACUUM, and whether this is too relaxed for that, or if
> it's covered by existing horizon-based interlocking...).

We'd like to join the discussion and review, too, so that Kirk's optimization 
for VACUUM truncation and TRUNCATRE can extend beyond recovery to operations 
during normal operation.  In that regard, would you mind becoming a committer 
for his patch?  We think it's committable now.  I'm fine with using the 
unmodified smgrnblocks() as your devil's suggestion.


> It's a good question.  I don't know exactly how to make a shared
> relation cache (I have only some vague ideas and read some of
> Idehira-san's work) but I agree that it would be very nice.  However,
> I think that's an independent and higher level thing, because Relation
> != SMgrRelation.

> Sure, we can talk about how to make a shared relation cache (and
> syscache etc).  I think it'll be much harder than this shared SMGR
> concept.  Even if we figure out how to share palloc'd trees of nodes
> with correct invalidation and interlocking (ie like Ideriha-san's work
> in [1]), including good performance and space management/replacement,
> I guess we'll still need per-backend Relation objects to point to the
> per-backend SMgrRelation objects to hold the per-backend file
> descriptors.  (Until we have threads...).  But... do you think sharing
> SMGR relations to the maximum extent possible conflicts with that?
> Are you thinking there should be some general component here, perhaps
> a generic system for pools of shmem objects with mapping tables and a
> replacement policy?

Yes, Ideriha-san and we have completed the feature called global metacache for 
placing relation and catalog caches in shared memory and limiting their sizes, 
and incorporated it in our product to meet our customer's request.  I hope 
we'll submit the patch in the near future when our resource permits.


Regards
Takayuki Tsunakawa

Re: remove spurious CREATE INDEX CONCURRENTLY wait

On Mon, Nov 16, 2020 at 09:23:41PM -0300, Alvaro Herrera wrote:
> I've been wondering if it would be sane/safe to do the WaitForFoo stuff
> outside of any transaction.

This needs to remain within a transaction as CIC holds a session lock
on the parent table, which would not be cleaned up without a
transaction context.
--
Michael


signature.asc
Description: PGP signature

Re: Terminate the idle sessions

2020-11-16 Thread Li Japin


--
Best regards
Japin Li

On Nov 17, 2020, at 7:59 AM, David G. Johnston 
mailto:david.g.johns...@gmail.com>> wrote:

On Mon, Nov 16, 2020 at 5:41 AM Li Japin 
mailto:japi...@hotmail.com>> wrote:
Thanks for your review! Attached.

Reading the doc changes:

I'd rather not name postgres_fdw explicitly, or at least not solely, as a 
reason for setting this to zero.  Additionally, using postgres_fdw within the 
server doesn't cause issues, its using postgres_fdw and the remote server 
having this setting set to zero that causes a problem.


Consider setting this for specific users instead of as a server default.  
Client connections managed by connection poolers, or initiated indirectly like 
those by a remote postgres_fdw using server, should probably be excluded from 
this timeout.

Text within  should be indented one space (you missed both under 
listitem).

Thanks for your suggest! How about change document as follows:

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 6c4e2a1fdc..23e691a7c5 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -8281,17 +8281,17 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH 
csv;
   
   

-   Terminate any session that has been idle for longer than the specified 
amount of time.
+Terminate any session that has been idle for longer than the specified 
amount of time.


-   If this value is specified without units, it is taken as milliseconds.
-   A value of zero (the default) disables the timeout.
+If this value is specified without units, it is taken as milliseconds.
+A value of zero (the default) disables the timeout.



 
- This parameter should be set to zero if you use postgres_fdw or some
- connection-pooling software, because connections might be closed 
unexpectedly.
+ This parameter should be set to zero if you use some 
connection-pooling software, or
+ PostgreSQL servers used by postgres_fdw, because connections might be 
closed unexpectedly.
 


I'd suggest a comment that aside from a bit of resource consumption idle 
sessions do not interfere with the long-running stability of the server, unlike 
idle-in-transaction sessions which are controlled by the other configuration 
setting.

Could you please explain how the idle-in-transaction interfere the long-running 
stability?

--
Best regards
Japin Li

Re: remove spurious CREATE INDEX CONCURRENTLY wait

On 2020-Nov-16, Dmitry Dolgov wrote:

> The same with reindex without locks:
> 
>  nsecs   : count distribution
>512 -> 1023   : 0||
>   1024 -> 2047   : 111345   ||
>   2048 -> 4095   : 6997627  ||
>   4096 -> 8191   : 18575||
>   8192 -> 16383  : 586  ||
>  16384 -> 32767  : 312  ||
>  32768 -> 65535  : 18   ||
> 
> The same with reindex with locks:
> 
>  nsecs   : count distribution
>512 -> 1023   : 0||
>   1024 -> 2047   : 59438||
>   2048 -> 4095   : 6901187  ||
>   4096 -> 8191   : 18584||
>   8192 -> 16383  : 581  ||
>  16384 -> 32767  : 280  ||
>  32768 -> 65535  : 84   ||
> 
> Looks like with reindex without locks is indeed faster (there are mode
> samples in lower time section), but not particularly significant to the
> whole distribution, especially taking into account extremity of the
> test.

I didn't analyze these numbers super carefully, but yeah it doesn't look
significant.

I'm looking at these patches now, with intention to push.

VACUUM (DISABLE_PAGE_SKIPPING on)

2020-11-16 Thread Simon Riggs

The docs are misleading for this feature, since they say:
"This option disables all page-skipping behavior, and is
intended to be used only when the contents of the visibility map are
suspect, which should happen only if there is a hardware or software
issue causing database corruption."

The docs do correctly say "Pages where all tuples are known to be
frozen can always be skipped". Checking the code, lazy_scan_heap()
comments say
"we can still skip pages that are all-frozen, since such pages do not
need freezing".

The code is quite clear: DISABLE_PAGE_SKIPPING makes the vacuum into
an aggressive vacuum. Line 487, heap_vacuum_rel().  Aggressive vacuums
can still skip a page that is frozen, and rely on the visibility map
for that information.

So the docs are wrong - we don't disable *all* page-skipping and it is
not appropriate to warn users away from this feature by saying "is
intended to be used only when the contents of the visibility map are
suspect".

Reworded docs patch attached.

-- 
Simon Riggshttp://www.EnterpriseDB.com/


aggressive_rewording.v1.patch
Description: Binary data

Re: VACUUM (DISABLE_PAGE_SKIPPING on)

2020-11-16 Thread Masahiko Sawada

On Tue, Nov 17, 2020 at 5:52 AM Simon Riggs  wrote:
>
> The docs are misleading for this feature, since they say:
> "This option disables all page-skipping behavior, and is
> intended to be used only when the contents of the visibility map are
> suspect, which should happen only if there is a hardware or software
> issue causing database corruption."
>
> The docs do correctly say "Pages where all tuples are known to be
> frozen can always be skipped". Checking the code, lazy_scan_heap()
> comments say
> "we can still skip pages that are all-frozen, since such pages do not
> need freezing".
>
> The code is quite clear: DISABLE_PAGE_SKIPPING makes the vacuum into
> an aggressive vacuum. Line 487, heap_vacuum_rel().  Aggressive vacuums
> can still skip a page that is frozen, and rely on the visibility map
> for that information.
>
> So the docs are wrong - we don't disable *all* page-skipping and it is
> not appropriate to warn users away from this feature by saying "is
> intended to be used only when the contents of the visibility map are
> suspect".

I don't think the doc is wrong. If DISABLE_PAGE_SKIPPING is specified,
we not only set aggressive = true but also skip checking visibility
map. For instance, see line 905 and line 963, lazy_scan_heap().

Regards,

-- 
Masahiko Sawada
EnterpriseDB:  https://www.enterprisedb.com/

Re: Terminate the idle sessions

On Mon, Nov 16, 2020 at 5:41 AM Li Japin  wrote:

> Thanks for your review! Attached.
>

Reading the doc changes:

I'd rather not name postgres_fdw explicitly, or at least not solely, as a
reason for setting this to zero.  Additionally, using postgres_fdw within
the server doesn't cause issues, its using postgres_fdw and the remote
server having this setting set to zero that causes a problem.

Consider setting this for specific users instead of as a server default.
Client connections managed by connection poolers, or initiated indirectly
like those by a remote postgres_fdw using server, should probably be
excluded from this timeout.

Text within  should be indented one space (you missed both under
listitem).

I'd suggest a comment that aside from a bit of resource consumption idle
sessions do not interfere with the long-running stability of the server,
unlike idle-in-transaction sessions which are controlled by the other
configuration setting.

David J.

Re: remove spurious CREATE INDEX CONCURRENTLY wait

I am really unsure about the REINDEX CONCURRENTLY part of this, for two
reasons:

1. It is not as good when reindexing multiple indexes, because we can
only apply the flag if *all* indexes are "safe".  Any unsafe index means
we step down from it for the whole thing.  This is probably not worth
worrying much about, but still.

2. In some of the waiting transactions, we actually do more things than
what we do in CREATE INDEX CONCURRENTLY transactions --- some catalog
updates, but we also do the whole index validation phase.  Is that OK?
It's not as clear to me that it is safe to set the flag in all those
places.

I moved the comments to the new function and made it inline.  I also
changed the way we determine how the function is safe; there's no reason
to build an IndexInfo if we can simply look at rel->rd_indexprs and
rel->indpred.

I've been wondering if it would be sane/safe to do the WaitForFoo stuff
outside of any transaction.

I'll have a look again tomorrow.
>From 1e5560d4a3f3e3b4cb83803f90985701fb4dee8c Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Tue, 4 Aug 2020 22:04:57 -0400
Subject: [PATCH v5] Avoid spurious CREATE INDEX CONCURRENTLY waits

---
 src/backend/commands/indexcmds.c | 85 +---
 src/include/storage/proc.h   |  6 ++-
 2 files changed, 83 insertions(+), 8 deletions(-)

diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 35696f9f75..44ea84c54d 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -68,6 +68,7 @@
 
 
 /* non-export function prototypes */
+static bool CompareOpclassOptions(Datum *opts1, Datum *opts2, int natts);
 static void CheckPredicate(Expr *predicate);
 static void ComputeIndexAttrs(IndexInfo *indexInfo,
 			  Oid *typeOidP,
@@ -87,13 +88,12 @@ static char *ChooseIndexNameAddition(List *colnames);
 static List *ChooseIndexColumnNames(List *indexElems);
 static void RangeVarCallbackForReindexIndex(const RangeVar *relation,
 			Oid relId, Oid oldRelId, void *arg);
-static bool ReindexRelationConcurrently(Oid relationOid, int options);
-
+static void reindex_error_callback(void *args);
 static void ReindexPartitions(Oid relid, int options, bool isTopLevel);
 static void ReindexMultipleInternal(List *relids, int options);
-static void reindex_error_callback(void *args);
+static bool ReindexRelationConcurrently(Oid relationOid, int options);
 static void update_relispartition(Oid relationId, bool newval);
-static bool CompareOpclassOptions(Datum *opts1, Datum *opts2, int natts);
+static inline void set_safe_index_flag(void);
 
 /*
  * callback argument type for RangeVarCallbackForReindexIndex()
@@ -385,7 +385,10 @@ CompareOpclassOptions(Datum *opts1, Datum *opts2, int natts)
  * lazy VACUUMs, because they won't be fazed by missing index entries
  * either.  (Manual ANALYZEs, however, can't be excluded because they
  * might be within transactions that are going to do arbitrary operations
- * later.)
+ * later.)  Processes running CREATE INDEX CONCURRENTLY or REINDEX CONCURRENTLY
+ * on indexes that are neither expressional nor partial are also safe to
+ * ignore, since we know that those processes won't examine any data
+ * outside the table they're indexing.
  *
  * Also, GetCurrentVirtualXIDs never reports our own vxid, so we need not
  * check for that.
@@ -406,7 +409,8 @@ WaitForOlderSnapshots(TransactionId limitXmin, bool progress)
 	VirtualTransactionId *old_snapshots;
 
 	old_snapshots = GetCurrentVirtualXIDs(limitXmin, true, false,
-		  PROC_IS_AUTOVACUUM | PROC_IN_VACUUM,
+		  PROC_IS_AUTOVACUUM | PROC_IN_VACUUM
+		  | PROC_IN_SAFE_IC,
 		  _old_snapshots);
 	if (progress)
 		pgstat_progress_update_param(PROGRESS_WAITFOR_TOTAL, n_old_snapshots);
@@ -426,7 +430,8 @@ WaitForOlderSnapshots(TransactionId limitXmin, bool progress)
 
 			newer_snapshots = GetCurrentVirtualXIDs(limitXmin,
 	true, false,
-	PROC_IS_AUTOVACUUM | PROC_IN_VACUUM,
+	PROC_IS_AUTOVACUUM | PROC_IN_VACUUM
+	| PROC_IN_SAFE_IC,
 	_newer_snapshots);
 			for (j = i; j < n_old_snapshots; j++)
 			{
@@ -519,6 +524,7 @@ DefineIndex(Oid relationId,
 	bool		amcanorder;
 	amoptions_function amoptions;
 	bool		partitioned;
+	bool		safe_index;
 	Datum		reloptions;
 	int16	   *coloptions;
 	IndexInfo  *indexInfo;
@@ -1045,6 +1051,10 @@ DefineIndex(Oid relationId,
 		}
 	}
 
+	/* Determine whether we can call set_safe_index_flag */
+	safe_index = indexInfo->ii_Expressions == NIL &&
+		indexInfo->ii_Predicate == NIL;
+
 	/*
 	 * Report index creation if appropriate (delay this till after most of the
 	 * error checks)
@@ -1431,6 +1441,10 @@ DefineIndex(Oid relationId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_safe_index_flag();
+
 	/*
 	 * The index is now visible, so we can report the OID.

RE: Cache relation sizes?

2020-11-16 Thread k.jami...@fujitsu.com

On Tuesday, November 17, 2020 9:40 AM, Tsunkawa-san wrote:
> From: Thomas Munro 
> > On Mon, Nov 16, 2020 at 11:01 PM Konstantin Knizhnik
> >  wrote:
> > > I will look at your implementation more precisely latter.
> >
> > Thanks!  Warning:  I thought about making a thing like this for a
> > while, but the patch itself is only a one-day prototype, so I am sure
> > you can find many bugs...  Hmm, I guess the smgrnblocks_fast() code
> > really needs a nice big comment to explain the theory about why this
> > value is fresh enough, based on earlier ideas about implicit memory
> > barriers.  (Something like: if you're extending, you must have
> > acquired the extension lock; if you're scanning you must have acquired
> > a snapshot that only needs to see blocks that existed at that time and
> > concurrent truncations are impossible; I'm wondering about special
> > snapshots like VACUUM, and whether this is too relaxed for that, or if
> > it's covered by existing horizon-based interlocking...).
> 
> We'd like to join the discussion and review, too, so that Kirk's optimization 
> for
> VACUUM truncation and TRUNCATRE can extend beyond recovery to
> operations during normal operation.  In that regard, would you mind
> becoming a committer for his patch?  We think it's committable now.  I'm
> fine with using the unmodified smgrnblocks() as your devil's suggestion.

Just saw this thread has been bumped. And yes, regarding the vacuum and truncate
optimization during recovery, it would still work even without the unmodified 
smgrnblocks.
We'd just need to remove the conditions that checks the flag parameter. And if 
we explore
your latest prototype patch, it can possibly use the new smgrnblocks_ APIs 
introduced here.
That said, we'd like to explore the next step of extending the feature to 
normal operations.

Regards,
Kirk Jamison

Re: Tab complete for CREATE OR REPLACE TRIGGER statement

On Mon, Nov 16, 2020 at 08:12:05AM +, Shinoda, Noriyoshi (PN Japan FSI) 
wrote:
>  Yesterday, OR REPLACE clause was provided to the CREATE TRIGGER
> statement, so I wrote a patch for tab completion for the psql
> command.

Thanks, the logic looks fine.  I'll apply if there are no objections.
Please note that git diff --check and that the indentation does not
seem quite right, but that's easy enough to fix.
--
Michael


signature.asc
Description: PGP signature

Re: Terminate the idle sessions

On Monday, November 16, 2020, Li Japin  wrote:

>
> 
> Consider setting this for specific users instead of as a server default.
> Client connections managed by connection poolers, or initiated indirectly
> like those by a remote postgres_fdw using server, should probably be
> excluded from this timeout.
>
> 
>
>  
> - This parameter should be set to zero if you use postgres_fdw or
> some
> - connection-pooling software, because connections might be closed
> unexpectedly.
> + This parameter should be set to zero if you use some
> connection-pooling software, or
> + PostgreSQL servers used by postgres_fdw, because connections
> might be closed unexpectedly.
>  
> 
>
>
Prefer mine, “or pg servers used by postgres_fdw”, doesn’t flow.


> Could you please explain how the idle-in-transaction interfere the
> long-running stability?
>

>From the docs (next section):

This allows any locks held by that session to be released and the
connection slot to be reused; it also allows tuples visible only to this
transaction to be vacuumed. See Section 24.1
 for more
details about this.

David J.

Re: Add statistics to pg_stat_wal view for wal related parameter tuning





On 2020/11/16 18:24, Masahiro Ikeda wrote:

On 2020-11-13 12:32, lchch1...@sina.cn wrote:

Now, pg_stat_wal supports reset all informantion in WalStats
using pg_stat_reset_shared('wal') function.
Isn't it enough?

Yes it ok, sorry I miss this infomation.


OK.


3. I do not think it's a correct describe in document for
'wal_buffers_full'.



Where should I rewrite the description? If my understanding is not
correct, please let me know.

Sorry I have not described it clearly, because I can not understand
the meaning of this
column after I read the describe in document.
And now I read the source code of walwrite and found the place where
'wal_buffers_full'
added is for a backend to wait a wal buffer which is occupied by other
wal page, so the
backend flush the old page in the wal buffer(after wait it can).
So i think the origin decribe in document is not so in point, we can
describe it such as
'Total number of times WAL data written to the disk because a backend
yelled a wal buffer
for an advanced wal page.

Sorry if my understand is wrong.


Thanks for your comments.

You're understanding is almost the same as mine.
It describes when not only backends but also other backgrounds initialize a new 
wal page,
wal buffer's space is already used and there is no space.


'Total number of times WAL data written to the disk because a backend
yelled a wal buffer for an advanced wal page'


Thanks for your suggestion.
I wondered that users may confuse about how to use "wal_buffers_full" and how 
to tune parameters.

I thought the reason which wal buffer has no space is
important for users to tune the wal_buffers parameter.

How about the following comments?

'Total number of times WAL data was written to the disk because WAL buffers got 
full
  when to initialize a new WAL page'


Or what about the following?

Total number of times WAL data was written to the disk, to claim the buffer 
page to insert new WAL data when the WAL buffers got filled up with unwritten 
WAL data.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: remove spurious CREATE INDEX CONCURRENTLY wait

On 2020-Nov-09, Tom Lane wrote:

> Michael Paquier  writes:
> >> +  LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
> >> +  MyProc->vacuumFlags |= PROC_IN_SAFE_IC;
> >> +  ProcGlobal->vacuumFlags[MyProc->pgxactoff] = 
> >> MyProc->vacuumFlags;
> >> +  LWLockRelease(ProcArrayLock);
> 
> > I can't help noticing that you are repeating the same code pattern
> > eight times.  I think that this should be in its own routine, and that
> > we had better document that this should be called just after starting
> > a transaction, with an assertion enforcing that.
> 
> Do we really need exclusive lock on the ProcArray to make this flag
> change?  That seems pretty bad from a concurrency standpoint.

BTW I now know that the reason for taking ProcArrayLock is not the
vacuumFlags itself, but rather MyProc->pgxactoff, which can move.

On the other hand, if we stopped mirroring the flags in ProcGlobal, it
would mean we would have to access all procs' PGPROC entries in
GetSnapshotData, which is undesirable for performance reason (per commit
5788e258bb26).

Re: Skip ExecCheckRTPerms in CTAS with no data

On Mon, Nov 16, 2020 at 04:01:33PM +0100, Peter Eisentraut wrote:
> While this patch was nice enough to update the documentation about the
> requirement of the INSERT privilege, this is maybe more confusing now: How
> could a new table not have INSERT privilege?  Yes, you can do that with
> default privileges, but that's not well known and should be clarified in the
> documentation.
>
> The SQL standard says that for CREATE TABLE AS, the INSERT "is effectively
> executed without further Access Rule checking", which means the INSERT
> privilege shouldn't be required at all.  I suggest we consider doing that
> instead.  I don't see a use for the current behavior.

Hmm.  Is there anything specific for materialized views?  They've
checked for INSERT privileges at this phase since their introduction
in 3bf3ab8c.
--
Michael


signature.asc
Description: PGP signature

Re: Tab complete for CREATE OR REPLACE TRIGGER statement

Michael Paquier  writes:
> On Mon, Nov 16, 2020 at 08:12:05AM +, Shinoda, Noriyoshi (PN Japan FSI) 
> wrote:
>> Yesterday, OR REPLACE clause was provided to the CREATE TRIGGER
>> statement, so I wrote a patch for tab completion for the psql
>> command.

> Thanks, the logic looks fine.  I'll apply if there are no objections.
> Please note that git diff --check and that the indentation does not
> seem quite right, but that's easy enough to fix.

It's kind of depressing how repetitive the patch is.  There's
probably not much to be done about that in the short run, but
it seems to point up the need to start thinking about how to
refactor tab-complete.c more thoroughly.

regards, tom lane

Re: pgbench: option delaying queries till connections establishment?

2020-11-16 Thread Marina Polyakova

Hello!

On 2020-11-14 20:07, Alexander Korotkov wrote:

Hmm... Let's see the big picture. You've recently committed a
patchset, which greatly improved the performance of GetSnapshotData().
And you're making further improvements in this direction. But you're
getting trouble in measuring the effect, because Postgres is still
stuck on ProcArrayLock. And in this thread you propose a workaround
for that implemented on the pgbench side. My very dumb idea is
following: should we finally give a chance to more fair lwlocks rather
than inventing workarounds?

As I remember, your major argument against more fair lwlocks was the
idea that we should fix lwlocks use-cases rather than lwlock mechanism
themselves. But can we expect that we fix all the lwlocks use-case in
any reasonable prospect? My guess is 'no'.

Links
1.
https://www.postgresql.org/message-id/CAPpHfdvJhO1qutziOp%3Ddy8TO8Xb4L38BxgKG4RPa1up1Lzh_UQ%40mail.gmail.com

Sorry I'm not familiar with the internal architecture of snapshots,
locks etc. in postgres, but I wanted to ask - what exact kind of patch
for fair lwlocks do you want to offer to the community? I applied the
6-th version of the patch for fair lwlocks from [1] to the old master
branch (see commit [2]), started many threads in pgbench (-M prepared -c
1000 -j 500 -T 10 -P1 -S) and I did not receive stable first progress
reports, which IIUC are one of the advantages of the discussed patch for
the pgbench (see [3])...

[1]
https://www.postgresql.org/message-id/CAPpHfduV3v3EG7K74-9htBZz_mpE993zGz-%3D2k5RNA3tqabUAA%40mail.gmail.com
[2]
https://github.com/postgres/postgres/commit/84d514887f9ca673ae688d00f8b544e70f1ab270
[3]
https://www.postgresql.org/message-id/20200227185129.hikscyenomnlrord%40alap3.anarazel.de

--
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Tab complete for CREATE OR REPLACE TRIGGER statement

On Mon, Nov 16, 2020 at 09:14:51PM -0500, Tom Lane wrote:
> It's kind of depressing how repetitive the patch is.  There's
> probably not much to be done about that in the short run, but
> it seems to point up the need to start thinking about how to
> refactor tab-complete.c more thoroughly.

Agreed.  One thing that I'd think could help is a new wild card to
make some of the words conditional in the list of items.  But that may
be tricky once you consider the case of a group of words.

I don't think that this is a requirement for this thread, though.
--
Michael

signature.asc
Description: PGP signature

Re: Tab complete for CREATE OR REPLACE TRIGGER statement

Michael Paquier  writes:
> I don't think that this is a requirement for this thread, though.

Agreed, I'm not trying to block this patch.  Just wishing
there were a better way.

regards, tom lane

Re: pgbench: option delaying queries till connections establishment?

2020-11-16 Thread Andres Freund

Hi,

On 2020-11-14 20:07:38 +0300, Alexander Korotkov wrote:
> Hmm...  Let's see the big picture.  You've recently committed a
> patchset, which greatly improved the performance of GetSnapshotData().
> And you're making further improvements in this direction.  But you're
> getting trouble in measuring the effect, because Postgres is still
> stuck on ProcArrayLock.

No, the problem was that I couldn't measure the before/after behaviour
reliably, because not all connections actually ever get established
*before* the GetSnapshotData() scability patchset. Which made the
numbers pointless, because we'd often end up with e.g. 80 connections
doing work pre-patch, and 800 post-patch; which obviously measures very
different things.

I think the issue really is that, independent of PG lock contention,
it'll take a while to establish all connections, and that starting to
benchmark with only some connections established will create pretty
pointless numbers.

> And in this thread you propose a workaround
> for that implemented on the pgbench side.  My very dumb idea is
> following: should we finally give a chance to more fair lwlocks rather
> than inventing workarounds?

Perhaps - I just don't think it's related to this thread. And how you're
going to address the overhead.

Greetings,

Andres Freund

Re: Add Information during standby recovery conflicts

2020-11-16 Thread Masahiko Sawada

On Mon, Nov 16, 2020 at 4:55 PM Drouvot, Bertrand  wrote:
>
> Hi,
>
> On 11/16/20 6:44 AM, Masahiko Sawada wrote:
> > Thank you for updating the patch.
> >
> > Here are review comments.
> >
> > +   if (report_waiting && (!logged_recovery_conflict ||
> > new_status == NULL))
> > +   ts = GetCurrentTimestamp();
> >
> > The condition will always be true if log_recovery_conflict_wait is
> > false and report_waiting is true, leading to unnecessary calling of
> > GetCurrentTimestamp().
> >
> > ---
> > +   
> > +You can control whether a log message is produced when the startup 
> > process
> > +is waiting longer than deadlock_timeout for recovery
> > +conflicts. This is controled by the  > linkend="guc-log-recovery-conflict-waits"/>
> > +parameter.
> > +   
> >
> > s/controled/controlled/
> >
> > ---
> >  if (report_waiting)
> >  waitStart = GetCurrentTimestamp();
> >
> > Similarly, we have the above code but we don't need to call
> > GetCurrentTimestamp() if update_process_title is false, even if
> > report_waiting is true.
> >
> > I've attached the patch that fixes the above comments. It can be
> > applied on top of your v8 patch.
>
> Thanks for the review and the associated fixes!
>
> I've attached a new version that contains your fixes.
>

Thank you for updating the patch.

I have other comments:

+   
+You can control whether a log message is produced when the startup process
+is waiting longer than deadlock_timeout for recovery
+conflicts. This is controlled by the
+ parameter.
+   

It would be better to use 'WAL replay' instead of 'the startup
process' for consistency with circumjacent descriptions. What do you
think?

---
@@ -1260,6 +1262,8 @@ ProcSleep(LOCALLOCK *locallock, LockMethod
lockMethodTable)
  else
  enable_timeout_after(DEADLOCK_TIMEOUT, DeadlockTimeout);
  }
+ else
+   standbyWaitStart = GetCurrentTimestamp();

I think we can add a check of log_recovery_conflict_waits to avoid
unnecessary calling of GetCurrentTimestamp().

I've attached the updated version patch including the above comments
as well as adding some assertions. Please review it.

Regards,

-- 
Masahiko Sawada
EnterpriseDB:  https://www.enterprisedb.com/


v9-Log-the-standby-recovery-conflict-waits.patch
Description: Binary data

RE: Disable WAL logging to speed up data loading

2020-11-16 Thread tsunakawa.ta...@fujitsu.com

From: Robert Haas 
> I'm also concerned about the way that this proposed feature interacts with
> incremental backup capabilities that already exist in tools like pgBackRest,
> EDB's BART, pg_probackup, and future things we might want to introduce into
> core, along the lines of what I have previously proposed. Now, I think
> pgBackRest uses only timestamps and checksums, so it probably doesn't care,
> but some of the other solutions rely on WAL-scanning to gather a list of
> changed blocks. I guess there's no reason that they can't notice the wal_level
> being changed and do the right thing; they should probably have that kind of
> capability already. Still, it strikes me that it might be useful if we had a 
> stronger
> mechanism.

Having a quick look, those backup tools seem to require setting wal_level to 
replica or higher.  That's no wonder, because recovering the database needs WAL 
for non-relation resources such as pg_control and relation map.  So, I think 
wal_level = none won't introduce new issues (compared to wal_level = minimal, 
which also can lack WAL records for some data updates.)


> By the way, another problem here is that some AMs - e.g. GiST, IIRC - use LSNs
> to figure out whether a block has changed. For temporary and unlogged tables,
> we use "fake" LSNs that are generated using a counter, but that approach only
> works because such relations are never really WAL-logged. Mixing fake LSNs
> and real LSNs will break stuff, and not bumping the LSN when the page
> changes probably will, too.

Unlogged GiST indexes use fake LSNs that are instance-wide.  Unlogged temporary 
GiST indexes use backend-local sequence values.  Other unlogged and temporary 
relations don't set LSNs on pages.  So, I think it's enough to call 
GetFakeLSNForUnloggedRel() when wal_level = none as well.


Regards
Takayuki Tsunakawa

Re: Add statistics to pg_stat_wal view for wal related parameter tuning

On 2020/11/16 16:35, Masahiro Ikeda wrote:

On 2020-11-12 14:58, Fujii Masao wrote:

On 2020/11/06 10:25, Masahiro Ikeda wrote:

On 2020-10-30 11:50, Fujii Masao wrote:

On 2020/10/29 17:03, Masahiro Ikeda wrote:

Hi,

Thanks for your comments and advice. I updated the patch.

On 2020-10-21 18:03, Kyotaro Horiguchi wrote:

At Tue, 20 Oct 2020 16:11:29 +0900, Masahiro Ikeda
 wrote in

On 2020-10-20 12:46, Amit Kapila wrote:
> I see that we also need to add extra code to capture these stats (some
> of which is in performance-critical path especially in
> XLogInsertRecord) which again makes me a bit uncomfortable. It might
> be that it is all fine as it is very important to collect these stats
> at cluster-level in spite that the same information can be gathered at
> statement-level to help customers but I don't see a very strong case
> for that in your proposal.

We should avoid that duplication as possible even if the both number
are important.

Also about performance, I thought there are few impacts because it
increments stats in memory. If I can implement to reuse pgWalUsage's
value which already collects these stats, there is no impact in
XLogInsertRecord.
For example, how about pg_stat_wal() calculates the accumulated
value of wal_records, wal_fpi, and wal_bytes to use pgWalUsage's
value?

I don't think that works, but it would work that pgstat_send_wal()
takes the difference of that values between two successive calls.

WalUsage prevWalUsage;
...
pgstat_send_wal()
{
..
   /* fill in some values using pgWalUsage */
   WalStats.m_wal_bytes   = pgWalUsage.wal_bytes   - prevWalUsage.wal_bytes;
   WalStats.m_wal_records = pgWalUsage.wal_records - prevWalUsage.wal_records;
   WalStats.m_wal_wal_fpi = pgWalUsage.wal_fpi - prevWalUsage.wal_fpi;
...
   pgstat_send(, sizeof(WalStats));

   /* remember the current numbers */
   prevWalUsage = pgWalUsage;

Thanks for Horiguchi-san's advice, I changed to reuse pgWalUsage
which is already defined and eliminates the extra overhead.

+    /* fill in some values using pgWalUsage */
+    WalStats.m_wal_bytes = pgWalUsage.wal_bytes - prevWalUsage.wal_bytes;
+    WalStats.m_wal_records = pgWalUsage.wal_records - prevWalUsage.wal_records;
+    WalStats.m_wal_fpi = pgWalUsage.wal_fpi - prevWalUsage.wal_fpi;

It's better to use WalUsageAccumDiff() here?

Yes, thanks. I fixed it.

prevWalUsage needs to be initialized with pgWalUsage?

+    if (AmWalWriterProcess()){
+    WalStats.m_wal_write_walwriter++;
+    }
+    else
+    {
+    WalStats.m_wal_write_backend++;
+    }

I think that it's better not to separate m_wal_write_xxx into two for
walwriter and other processes. Instead, we can use one m_wal_write_xxx
counter and make pgstat_send_wal() send also the process type to
the stats collector. Then the stats collector can accumulate the counters
per process type if necessary. If we adopt this approach, we can easily
extend pg_stat_wal so that any fields can be reported per process type.

I'll remove the above source code because these counters are not useful.

On 2020-10-30 12:00, Fujii Masao wrote:

On 2020/10/20 11:31, Masahiro Ikeda wrote:

Hi,

I think we need to add some statistics to pg_stat_wal view.

Although there are some parameter related WAL,
there are few statistics for tuning them.

I think it's better to provide the following statistics.
Please let me know your comments.

```
postgres=# SELECT * from pg_stat_wal;
-[ RECORD 1 ]---+--
wal_records | 2000224
wal_fpi | 47
wal_bytes   | 248216337
wal_buffers_full    | 20954
wal_init_file   | 8
wal_write_backend   | 20960
wal_write_walwriter | 46
wal_write_time  | 51
wal_sync_backend    | 7
wal_sync_walwriter  | 8
wal_sync_time   | 0
stats_reset | 2020-10-20 11:04:51.307771+09
```

1. Basic statistics of WAL activity

- wal_records: Total number of WAL records generated
- wal_fpi: Total number of WAL full page images generated
- wal_bytes: Total amount of WAL bytes generated

To understand DB's performance, first, we will check the performance
trends for the entire database instance.
For example, if the number of wal_fpi becomes higher, users may tune
"wal_compression", "checkpoint_timeout" and so on.

Although users can check the above statistics via EXPLAIN, auto_explain,
autovacuum and pg_stat_statements now,
if users want to see the performance trends  for the entire database,
they must recalculate the statistics.

I think it is useful to add the sum of the basic statistics.

2.  WAL segment file creation

- wal_init_file: Total number of WAL segment files created.

To create a new WAL file may have an impact on the performance of
a write-heavy workload generating lots of WAL. If this number is reported high,
to reduce the number of this initialization, we can tune WAL-related parameters
so that more "recycled" WAL files can

Re: Add statistics to pg_stat_wal view for wal related parameter tuning

2020-11-16 Thread lchch1...@sina.cn


>> Thanks for your comments.
>>
>> You're understanding is almost the same as mine.
>> It describes when not only backends but also other backgrounds initialize a 
>> new wal page,
>> wal buffer's space is already used and there is no space.
>>
>>> 'Total number of times WAL data written to the disk because a backend
>>> yelled a wal buffer for an advanced wal page'
>>
>> Thanks for your suggestion.
>> I wondered that users may confuse about how to use "wal_buffers_full" and 
>> how to tune parameters.
>>
>> I thought the reason which wal buffer has no space is
>> important for users to tune the wal_buffers parameter.
>>
>> How about the following comments?
>>
>> 'Total number of times WAL data was written to the disk because WAL buffers 
>> got full
>>   when to initialize a new WAL page'
>Or what about the following?
>Total number of times WAL data was written to the disk, to claim the buffer 
>page to insert new
>WAL data when the WAL buffers got filled up with unwritten WAL data.
As my understand we can not say 'full' because every wal page mapped a special 
wal buffer slot.
When a wal page need to be write, but the buffer slot was occupied by other wal 
page. It need to
wait the wal buffer slot released. So i think we should say it 'occupied' not 
'full'.

Maybe:
Total number of times WAL data was written to the disk, to claim the buffer 
page to insert new
WAL data when the special WAL buffer occupied by other page.



Regards,
Highgo Software (Canada/China/Pakistan) 
URL : www.highgo.ca 
EMAIL: mailto:movead(dot)li(at)highgo(dot)ca

Re: enable_incremental_sort changes query behavior