Re: Move OpenSSL random under USE_OPENSSL_RANDOM

2020-11-16 Thread Daniel Gustafsson
> On 16 Nov 2020, at 01:20, Michael Paquier  wrote:
> 
> On Sun, Nov 15, 2020 at 12:16:56PM -0500, Tom Lane wrote:
>> The obvious problem with this is that if !USE_OPENSSL, we will not have
>> pulled in openssl's headers.
> 
> FWIW, I argued upthread against including this part because it is
> useless: if not building with OpenSSL, we'll never have the base to be
> able to use RAND_poll().

How do you mean?  The OpenSSL PRNG can be used without setting up a context
etc, the code in pg_strong_random is all we need to use it without USE_OPENSSL
(whether we'd like to is another story) or am I missing something?

>> However ... all these machines are pointing at line 96, which is not
>> that one but the one under "#if defined(USE_OPENSSL)".  So I'm not sure
>> what to make of that, except that a bit more finesse seems required.
> 
> The build scripts of src/tools/msvc/ choose to not use OpenSSL as
> strong random source even if building with OpenSSL.  The top of the
> file only includes openssl/rand.h if using USE_OPENSSL_RANDOM.

The fallout here is precisely the reason why I argued for belts and suspenders
such that PRNG init is performed for (USE_OPENSSL || USE_OPENSSL_RANDOM).  I
don't trust the assumption that if one is there other will always be there as
well as long as they are disjoint.  Since we expose this PRNG to users, there
is a vector for spooling the rand state via UUID generation in case the PRNG is
faulty and have predictability, so failing to protect the after-fork case can
be expensive.  Granted, such vulnerabilities are rare but not inconcievable.

Now, this patch didn't get the header inclusion right which is why thise broke.

> Thinking about that afresh, I think that we got that wrong here on
> three points:
> - If attempting to use OpenSSL on Windows, let's just bite the bullet
> and use OpenSSL as random source, using Windows as source only when
> not building with OpenSSL.
> - Instead of using a call to RAND_poll() that we know will never work,
> let's just issue a compilation failure if attempting to use
> USE_OPENSSL_RANDOM without USE_OPENSSL.

Taking a step back, what is the usecase of USE_OPENSSL_RANDOM if we force it to
be equal to USE_OPENSSL?  Shouldn't we in that case just have USE_OPENSSL,
adjust the logic and remove the below comment from configure.ac which isn't
really telling the truth?

  # Select random number source
  #
  # You can override this logic by setting the appropriate USE_*RANDOM flag to 1
  # in the template or configure command line.

I might be thick but I'm struggling to see the use for complications when we
don't support any pluggability.  Having said that, it might be the sane way in
the end to forcibly use the TLS library as a randomness source should there be
one (FIPS compliance comes to mind), but then we might as well spell that out.

> - rand.h needs to be included under USE_OPENSSL.


It needs to be included for both USE_OPENSSL and USE_OPENSSL_RANDOM unless we
combine them as per the above.

cheers ./daniel



Heads-up: macOS Big Sur upgrade breaks EDB PostgreSQL installations

2020-11-16 Thread Dave Page
Hi,

This is more of a head-ups than anything else, as I suspect this may come
up in various forums.

The PostgreSQL installers for macOS (from EDB, possibly others too) create
the data directory in /Library/PostgreSQL//data. This has been
the case since the first release, 10+ years ago.

It looks like the Big Sur upgrade has taken it upon itself to "fix" any
filesystem permissions it doesn't like. On my system, this resulted in the
data directory having 0755 permissions, which meant that PostgreSQL refused
to start. Manually changing the permissions back to 0700 (0750 should also
work) fixes the issue.

I'm not sure there's much we can do about this - systems that are likely to
be affected are already out there, and we obviously don't want to relax the
permissions Postgres requires.

-- 
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake

EDB: http://www.enterprisedb.com


Re: [PATCH] remove pg_archivecleanup and pg_standby

2020-11-16 Thread Michael Paquier
On Tue, Nov 03, 2020 at 05:28:46PM +0200, Heikki Linnakangas wrote:
> Removing pg_standby has been proposed a couple of times in the past. See 
> https://www.postgresql.org/message-id/20170913064824.rqflkadxwpboa...@alap3.anarazel.de
> for the latest attempt.
> 
> Masao-san, back in 2014 you mentioned "fast failover" as a feature that was
> missing from the built-in standby mode 
> (https://www.postgresql.org/message-id/CAHGQGwEE_8vvpQk0ex6Qa_aXt-OSJ7OdZjX4uM_FtqKfxq5SbQ%40mail.gmail.com).
> I think that's been implemented since, with the recovery_target settings.
> Would you agree?
> 
> I'm pretty sure we can remove pg_standby by now. But if there's something
> crucial missing from the built-in facilities, we need to talk about
> implementing them.

Reading the thread you are mentioning, it seems to me that the
statu-quo is the same, but I find rather scary that this tool is used
in exactly zero tests.

Echoing with Robert, I think that pg_archivecleanup is still useful in
many cases, so that's not something we should remove.
--
Michael


signature.asc
Description: PGP signature


Re: Add statistics to pg_stat_wal view for wal related parameter tuning

2020-11-16 Thread Masahiro Ikeda

On 2020-11-13 12:32, lchch1...@sina.cn wrote:

Now, pg_stat_wal supports reset all informantion in WalStats
using pg_stat_reset_shared('wal') function.
Isn't it enough?

Yes it ok, sorry I miss this infomation.


OK.


3. I do not think it's a correct describe in document for
'wal_buffers_full'.



Where should I rewrite the description? If my understanding is not
correct, please let me know.

Sorry I have not described it clearly, because I can not understand
the meaning of this
column after I read the describe in document.
And now I read the source code of walwrite and found the place where
'wal_buffers_full'
added is for a backend to wait a wal buffer which is occupied by other
wal page, so the
backend flush the old page in the wal buffer(after wait it can).
So i think the origin decribe in document is not so in point, we can
describe it such as
'Total number of times WAL data written to the disk because a backend
yelled a wal buffer
for an advanced wal page.

Sorry if my understand is wrong.


Thanks for your comments.

You're understanding is almost the same as mine.
It describes when not only backends but also other backgrounds 
initialize a new wal page,

wal buffer's space is already used and there is no space.


'Total number of times WAL data written to the disk because a backend
yelled a wal buffer for an advanced wal page'


Thanks for your suggestion.
I wondered that users may confuse about how to use "wal_buffers_full" 
and how to tune parameters.


I thought the reason which wal buffer has no space is
important for users to tune the wal_buffers parameter.

How about the following comments?

'Total number of times WAL data was written to the disk because WAL 
buffers got full

 when to initialize a new WAL page'


Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION




Re: [HACKERS] logical decoding of two-phase transactions

2020-11-16 Thread Masahiko Sawada
On Mon, Nov 16, 2020 at 4:25 PM Ajin Cherian  wrote:
>
> Updated with a new test case
> (contrib/test_decoding/t/002_twophase-streaming.pl) that tests
> concurrent aborts during streaming prepare. Had to make a few changes
> to the test_decoding stream_start callbacks to handle
> "check-xid-aborted"
> the same way it was handled in the non stream callbacks. Merged
> Peter's v20-0006 as well.
>

Thank you for updating the patch.

I have a question about the timestamp of PREPARE on a subscriber node,
although this may have already been discussed.

With the current patch, the timestamps of PREPARE are different
between the publisher and the subscriber but the timestamp of their
commits are the same. For example,

-- There is 1 prepared transaction on a publisher node.
=# select * from pg_prepared_xact;

 transaction | gid |   prepared|  owner   | database
-+-+---+--+--
 510 | h1  | 2020-11-16 16:57:13.438633+09 | masahiko | postgres
(1 row)

-- This prepared transaction is replicated to a subscriber.
=# select * from pg_prepared_xact;

 transaction | gid |   prepared|  owner   | database
-+-+---+--+--
 514 | h1  | 2020-11-16 16:57:13.440593+09 | masahiko | postgres
(1 row)

These timestamps are different. Let's commit the prepared transaction
'h1' on the publisher and check the commit timestamps on both nodes.

-- On the publisher node.
=# select pg_xact_commit_timestamp('510'::xid);

   pg_xact_commit_timestamp
---
 2020-11-16 16:57:13.474275+09
(1 row)

-- Commit prepared is also replicated to the subscriber node.
=# select pg_xact_commit_timestamp('514'::xid);

   pg_xact_commit_timestamp
---
 2020-11-16 16:57:13.474275+09
(1 row)

The commit timestamps are the same. At PREPARE we use the local
timestamp when PREPARE is executed as 'prepared' time while at COMMIT
PREPARED we use the origin's commit timestamp as the commit timestamp
if the commit WAL has.

This behaviour made me think a possibility that if the clock of the
publisher is behind then on subscriber node the timestamp of COMMIT
PREPARED (i.g., the return value from pg_xact_commit_timestamp())
could be smaller than the timestamp of PREPARE (i.g., 'prepared_at' in
pg_prepared_xacts). I think it would not be a critical issue but I
think it might be worth discussing the behaviour.

Regards,

-- 
Masahiko Sawada
EnterpriseDB:  https://www.enterprisedb.com/




Re: Split copy.c

2020-11-16 Thread Heikki Linnakangas

On 16/11/2020 04:28, Justin Pryzby wrote:

On Tue, Nov 03, 2020 at 03:15:27PM +1300, David Rowley wrote:

On Tue, 3 Nov 2020 at 07:35, Andres Freund  wrote:


On 2020-11-02 19:43:38 +0200, Heikki Linnakangas wrote:

On 02/11/2020 19:23, Andres Freund wrote:

On 2020-11-02 11:03:29 +0200, Heikki Linnakangas wrote:

There isn't much common code between COPY FROM and COPY TO, so I propose
that we split copy.c into two: copyfrom.c and copyto.c. See attached. I thin
that's much nicer.


Not quite convinced that's the right split - or perhaps there's just
more potential. My feeling is that splitting out all the DML related
code would make the code considerably easier to read.


What do you mean by DML related code?


Basically all the insertion related code (e.g CopyMultiInsert*, lots of
code in CopyFrom()) and perhaps also the type input invocations.


I quite like the fact that those are static and inline-able.  I very
much imagine there'd be a performance hit if we moved them out to
another .c file and made them extern.  Some of those functions can be
quite hot when copying into a partitioned table.


For another patch [0], I moved into copy.h:
+typedef struct CopyMultiInsertBuffer
+typedef struct CopyMultiInsertInfo
+CopyMultiInsertBufferInit(ResultRelInfo *rri)
+CopyMultiInsertInfoSetupBuffer(CopyMultiInsertInfo *miinfo,
+CopyMultiInsertInfoIsFull(CopyMultiInsertInfo *miinfo)
+CopyMultiInsertBufferCleanup(CopyMultiInsertInfo *miinfo,
+CopyMultiInsertInfoNextFreeSlot(CopyMultiInsertInfo *miinfo,
+CopyMultiInsertInfoStore(CopyMultiInsertInfo *miinfo, ResultRelInfo *rri,

That's an experimental part 0002 of my patch in response to Simon's suggestion.
Maybe your response will be that variants of those interfaces should be added
to nodeModifyTable.[ch] instead of moving them.


Nice. I don't think that affects this patch too much.

I would suggest renaming renaming the functions and structs to remove 
the "Copy"-prefix. COPY uses them, but so does INSERT with the patch.



Currently I'm passing (void*)mtstate as cstate - if there were a
generic interface, that would be a void *state or so.
The functions only need cstate/mtstate to set the line number, for the 
error callback, and to access the transition_capture field. You could 
add a field for transition_capture in CopyMultiInsertInfo. For the line 
number, you could add a line number field in CopyMultiInsertInfo, set 
that in CopyMultiInsertBufferFlush() instead of cstate->cur_lineno, and 
teach CopyFromErrorCallback() to get the line number from there.


- Heikki




Tab complete for CREATE OR REPLACE TRIGGER statement

2020-11-16 Thread Shinoda, Noriyoshi (PN Japan FSI)
Hi, Hackers.

 Yesterday, OR REPLACE clause was provided to the CREATE TRIGGER statement, so 
I wrote a patch for tab completion for the psql command.
TRIGGER adds tab completion to the CREATE OR REPLACE statement, and the CREATE 
TRIGGER and CREATE OR REPLACE TRIGGER statements are completed in the same way.
I referred to the tab completion code of the CREATE RULE statement.

Regards,
Noriyoshi Shinoda


psql_create_or_replace_trigger_tab.diff
Description: psql_create_or_replace_trigger_tab.diff


Re: Move OpenSSL random under USE_OPENSSL_RANDOM

2020-11-16 Thread Magnus Hagander
On Mon, Nov 16, 2020 at 10:19 AM Daniel Gustafsson  wrote:

> > On 16 Nov 2020, at 01:20, Michael Paquier  wrote:
> >
> > On Sun, Nov 15, 2020 at 12:16:56PM -0500, Tom Lane wrote:
> >> The obvious problem with this is that if !USE_OPENSSL, we will not have
> >> pulled in openssl's headers.
> >
> > FWIW, I argued upthread against including this part because it is
> > useless: if not building with OpenSSL, we'll never have the base to be
> > able to use RAND_poll().
>
> How do you mean?  The OpenSSL PRNG can be used without setting up a context
> etc, the code in pg_strong_random is all we need to use it without
> USE_OPENSSL
> (whether we'd like to is another story) or am I missing something?
>
> >> However ... all these machines are pointing at line 96, which is not
> >> that one but the one under "#if defined(USE_OPENSSL)".  So I'm not sure
> >> what to make of that, except that a bit more finesse seems required.
> >
> > The build scripts of src/tools/msvc/ choose to not use OpenSSL as
> > strong random source even if building with OpenSSL.  The top of the
> > file only includes openssl/rand.h if using USE_OPENSSL_RANDOM.
>
> The fallout here is precisely the reason why I argued for belts and
> suspenders
> such that PRNG init is performed for (USE_OPENSSL || USE_OPENSSL_RANDOM).
> I
> don't trust the assumption that if one is there other will always be there
> as
> well as long as they are disjoint.  Since we expose this PRNG to users,
> there
> is a vector for spooling the rand state via UUID generation in case the
> PRNG is
> faulty and have predictability, so failing to protect the after-fork case
> can
> be expensive.  Granted, such vulnerabilities are rare but not
> inconcievable.
>
> Now, this patch didn't get the header inclusion right which is why thise
> broke.
>

> > Thinking about that afresh, I think that we got that wrong here on
> > three points:
> > - If attempting to use OpenSSL on Windows, let's just bite the bullet
> > and use OpenSSL as random source, using Windows as source only when
> > not building with OpenSSL.
> > - Instead of using a call to RAND_poll() that we know will never work,
> > let's just issue a compilation failure if attempting to use
> > USE_OPENSSL_RANDOM without USE_OPENSSL.
>
> Taking a step back, what is the usecase of USE_OPENSSL_RANDOM if we force
> it to
> be equal to USE_OPENSSL?  Shouldn't we in that case just have USE_OPENSSL,
> adjust the logic and remove the below comment from configure.ac which
> isn't
> really telling the truth?


>   # Select random number source
>   #
>   # You can override this logic by setting the appropriate USE_*RANDOM
> flag to 1
>   # in the template or configure command line.
>
> I might be thick but I'm struggling to see the use for complications when
> we
> don't support any pluggability.  Having said that, it might be the sane
> way in
> the end to forcibly use the TLS library as a randomness source should
> there be
> one (FIPS compliance comes to mind), but then we might as well spell that
> out.
>
> > - rand.h needs to be included under USE_OPENSSL.
>
>
> It needs to be included for both USE_OPENSSL and USE_OPENSSL_RANDOM unless
> we
> combine them as per the above.
>


I agree with those -- either we remove the ability to choose random source
independently of the SSL library (and then only use the windows crypto
provider or /dev/urandom as platform-specific choices when *no* SSL library
is used), and in that case we should not have separate #ifdef's for them.

Or we fix the includes. Which is obviously easier, but we should take the
time to do what we think is right long-term of course.

Keeping two defines and an extra configure check when they mean the same
thing seems like the worst combination of the two.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: Delay of standby shutdown

2020-11-16 Thread Fujii Masao




On 2020/11/13 2:58, Soumyadeep Chakraborty wrote:

Thanks! Marking this as ready for committer.


Pushed. Thanks!

Regards,


--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: Cache relation sizes?

2020-11-16 Thread Konstantin Knizhnik




On 16.11.2020 10:11, Thomas Munro wrote:

On Tue, Aug 4, 2020 at 2:21 PM Thomas Munro  wrote:

On Tue, Aug 4, 2020 at 3:54 AM Konstantin Knizhnik
 wrote:

This shared relation cache can easily store relation size as well.
In addition it will solve a lot of other problems:
- noticeable overhead of local relcache warming
- large memory consumption in case of larger number of relations
O(max_connections*n_relations)
- sophisticated invalidation protocol and related performance issues
Certainly access to shared cache requires extra synchronization.But DDL
operations are relatively rare.
So in most cases we will have only shared locks. May be overhead of
locking will not be too large?

Yeah, I would be very happy if we get a high performance shared
sys/rel/plan/... caches in the future, and separately, having the
relation size available in shmem is something that has come up in
discussions about other topics too (tree-based buffer mapping,
multi-relation data files, ...).  ...

After recent discussions about the limitations of relying on SEEK_END
in a nearby thread[1], I decided to try to prototype a system for
tracking relation sizes properly in shared memory.  Earlier in this
thread I was talking about invalidation schemes for backend-local
caches, because I only cared about performance.  In contrast, this new
system has SMgrRelation objects that point to SMgrSharedRelation
objects (better names welcome) that live in a pool in shared memory,
so that all backends agree on the size.  The scheme is described in
the commit message and comments.  The short version is that smgr.c
tracks the "authoritative" size of any relation that has recently been
extended or truncated, until it has been fsync'd.  By authoritative, I
mean that there may be dirty buffers in that range in our buffer pool,
even if the filesystem has vaporised the allocation of disk blocks and
shrunk the file.

That is, it's not really a "cache".  It's also not like a shared
catalog, which Konstantin was talking about... it's more like the pool
of inodes in a kernel's memory.  It holds all currently dirty SRs
(SMgrSharedRelations), plus as many clean ones as it can fit, with
some kind of reclamation scheme, much like buffers.  Here, "dirty"
means the size changed.


I will look at your implementation more precisely latter.
But now I just to ask one question: is it reasonable to spent 
substantial amount of efforts trying to solve
the problem of redundant calculations of relation size, which seems to 
be just part of more general problem: shared relation cache?


It is well known problem: private caches of system catalog can cause 
unacceptable memory consumption for large number of backends.
If there are thousands relations in the database (which is not so rare 
case, especially taken in account ORMs and temporary tables)
then size of catalog cache can be several megabytes. If it is multiplies 
by thousand backends, then we get gigabytes of memory used just for
private catalog caches. Thanks to Andres optimizations of taking 
snapshots, now Postgres can normally handle thousands of connections.
So this extremely inefficient use of memory for private catalog caches 
becomes more and more critical.
Also private caches requires sophisticated and error prone invalidation 
mechanism.


If we will have such shared cache, then keeping shared relation size 
seems to be trivial task, isn't it?
So may be we before "diving" into the problem of maintaining shared pool 
of dirtied "inodes" we can better discuss and conerntrate on solving 
more generic problem?







Re: [HACKERS] make async slave to wait for lsn to be replayed

2020-11-16 Thread a . pervushina

Hello,

I've changed the BEGIN WAIT FOR LSN statement to core functions 
pg_waitlsn, pg_waitlsn_infinite and pg_waitlsn_no_wait.
Currently the functions work inside repeatable read transactions, but 
waitlsn creates a snapshot if called first in a transaction block, which 
can possibly lead the transaction to working incorrectly, so the 
function gives a warning.


Usage examples
==
select pg_waitlsn(‘LSN’, timeout);
select pg_waitlsn_infinite(‘LSN’);
select pg_waitlsn_no_wait(‘LSN’);diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 9d3f1c12fc5..27e964c8111 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -42,6 +42,7 @@
 #include "catalog/pg_database.h"
 #include "commands/progress.h"
 #include "commands/tablespace.h"
+#include "commands/wait.h"
 #include "common/controldata_utils.h"
 #include "executor/instrument.h"
 #include "miscadmin.h"
@@ -7387,6 +7388,15 @@ StartupXLOG(void)
 	break;
 }
 
+/*
+ * If we replayed an LSN that someone was waiting for,
+ * set latches in shared memory array to notify the waiter.
+ */
+if (XLogCtl->lastReplayedEndRecPtr >= GetMinWaitedLSN())
+{
+	WaitSetLatch(XLogCtl->lastReplayedEndRecPtr);
+}
+
 /* Else, try to fetch the next WAL record */
 record = ReadRecord(xlogreader, LOG, false);
 			} while (record != NULL);
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index d4815d3ce65..9b310926c12 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -57,6 +57,7 @@ OBJS = \
 	user.o \
 	vacuum.o \
 	variable.o \
-	view.o
+	view.o \
+	wait.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
new file mode 100644
index 000..ac18040c362
--- /dev/null
+++ b/src/backend/commands/wait.c
@@ -0,0 +1,297 @@
+/*-
+ *
+ * wait.c
+ *	  Implements waitlsn, which allows waiting for events such as
+ *	  LSN having been replayed on replica.
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 2020, Regents of PostgresPro
+ *
+ * IDENTIFICATION
+ *	  src/backend/commands/wait.c
+ *
+ *-
+ */
+#include "postgres.h"
+
+#include 
+
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/xlogdefs.h"
+#include "commands/wait.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/backendid.h"
+#include "storage/pmsignal.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "storage/sinvaladt.h"
+#include "storage/spin.h"
+#include "utils/builtins.h"
+#include "utils/pg_lsn.h"
+#include "utils/timestamp.h"
+
+/* Add to shared memory array */
+static void AddWaitedLSN(XLogRecPtr lsn_to_wait);
+
+/* Shared memory structure */
+typedef struct
+{
+	int			backend_maxid;
+	pg_atomic_uint64 min_lsn; /* XLogRecPtr of minimal waited for LSN */
+	slock_t		mutex;
+	/* LSNs that different backends are waiting */
+	XLogRecPtr	lsn[FLEXIBLE_ARRAY_MEMBER];
+} WaitState;
+
+static WaitState *state;
+
+/*
+ * Add the wait event of the current backend to shared memory array
+ */
+static void
+AddWaitedLSN(XLogRecPtr lsn_to_wait)
+{
+	SpinLockAcquire(>mutex);
+	if (state->backend_maxid < MyBackendId)
+		state->backend_maxid = MyBackendId;
+
+	state->lsn[MyBackendId] = lsn_to_wait;
+
+	if (lsn_to_wait < state->min_lsn.value)
+		state->min_lsn.value = lsn_to_wait;
+	SpinLockRelease(>mutex);
+}
+
+/*
+ * Delete wait event of the current backend from the shared memory array.
+ */
+void
+DeleteWaitedLSN(void)
+{
+	int			i;
+	XLogRecPtr	lsn_to_delete;
+
+	SpinLockAcquire(>mutex);
+
+	lsn_to_delete = state->lsn[MyBackendId];
+	state->lsn[MyBackendId] = InvalidXLogRecPtr;
+
+	/* If we are deleting the minimal LSN, then choose the next min_lsn */
+	if (lsn_to_delete != InvalidXLogRecPtr &&
+		lsn_to_delete == state->min_lsn.value)
+	{
+		state->min_lsn.value = PG_UINT64_MAX;
+		for (i = 2; i <= state->backend_maxid; i++)
+			if (state->lsn[i] != InvalidXLogRecPtr &&
+state->lsn[i] < state->min_lsn.value)
+state->min_lsn.value = state->lsn[i];
+	}
+
+	/* If deleting from the end of the array, shorten the array's used part */
+	if (state->backend_maxid == MyBackendId)
+		for (i = (MyBackendId); i >= 2; i--)
+			if (state->lsn[i] != InvalidXLogRecPtr)
+			{
+state->backend_maxid = i;
+break;
+			}
+
+	SpinLockRelease(>mutex);
+}
+
+/*
+ * Report amount of shared memory space needed for WaitState
+ */
+Size
+WaitShmemSize(void)
+{
+	Size		size;
+
+	size = offsetof(WaitState, lsn);
+	size = add_size(size, mul_size(MaxBackends + 1, sizeof(XLogRecPtr)));
+	return size;
+}
+
+/*
+ * Initialize an array of events to wait for in shared memory
+ */
+void
+WaitShmemInit(void)
+{
+	bool		found;
+	uint32		i;
+

Re: Online verification of checksums

2020-11-16 Thread Magnus Hagander
On Mon, Nov 16, 2020 at 1:23 AM Michael Paquier  wrote:

> On Sun, Nov 15, 2020 at 04:37:36PM +0100, Magnus Hagander wrote:
> > On Tue, Nov 10, 2020 at 5:44 AM Michael Paquier 
> wrote:
> >> On Thu, Nov 05, 2020 at 10:57:16AM +0900, Michael Paquier wrote:
> >>> I was referring to the patch I sent on this thread that fixes the
> >>> detection of a corruption for the zero-only case and where pd_lsn
> >>> and/or pg_upper are trashed by a corruption of the page header.  Both
> >>> cases allow a base backup to complete on HEAD, while sending pages
> >>> that could be corrupted, which is wrong.  Once you make the page
> >>> verification rely only on pd_checksum, as the patch does because the
> >>> checksum is the only source of truth in the page header, corrupted
> >>> pages are correctly detected, causing pg_basebackup to complain as it
> >>> should.  However, it has also the risk to cause pg_basebackup to fail
> >>> *and* to report as broken pages that are in the process of being
> >>> written, depending on how slow a disk is able to finish a 8kB write.
> >>> That's a different kind of wrongness, and users have two more reasons
> >>> to be pissed.  Note that if a page is found as torn we have a
> >>> consistent page header, meaning that on HEAD the PageIsNew() and
> >>> PageGetLSN() would pass, but the checksum verification would fail as
> >>> the contents at the end of the page does not match the checksum.
> >>
> >> Magnus, as the original committer of 4eb77d5, do you have an opinion
> >> to share?
> >>
> >
> > I admit that I at some point lost track of the overlapping threads around
> > this, and just figured there was enough different
> checksum-involved-people
> > on those threads to handle it :) Meaning the short answer is "no, I don't
> > really have one at this point".
> >
> > Slightly longer comment is that it does seem reasonable, but I have not
> > read in on all the different issues discussed over the whole thread, so
> > take that as a weak-certainty comment.
>
> Which part are you considering as reasonable?  The removal-feature
> part on a stable branch or perhaps something else?
>

I was referring to the latest patch on the thread. But as I said, I have
not read up on all the different issues raised in the thread, so take it
with a big grain os salt.

And I would also echo the previous comment that this code was adapted from
what the pgbackrest folks do. As such, it would be good to get a comment
from for example David on that -- I don't see any of them having commented
after that was mentioned?

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: Evaluate expression at planning time for two more cases

2020-11-16 Thread Pavel Borisov
Thank you for working on this!
I got slightly into this patch. I can be wrong, but my opinion is that 
planner/optimizer-related changes are not without dangers generally.  So 
anyway, they should be justified by performance increase, or the previous 
behavior should be considered totally wrong. Patching the thing which is just a 
little sub-optimal seems for me seems not necessary.

So it would be very good to see measurements of a performance gain from this 
patch. And also I think tests with partitioned and inherited relations for 
demonstration of the right work in the cases discussed in the thread should be 
a must-do for this patch.

-- 
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com

The new status of this patch is: Waiting on Author


Re: [HACKERS] make async slave to wait for lsn to be replayed

2020-11-16 Thread Alexander Korotkov
Hi!

On Mon, Nov 16, 2020 at 1:09 PM  wrote:
> I've changed the BEGIN WAIT FOR LSN statement to core functions
> pg_waitlsn, pg_waitlsn_infinite and pg_waitlsn_no_wait.
> Currently the functions work inside repeatable read transactions, but
> waitlsn creates a snapshot if called first in a transaction block, which
> can possibly lead the transaction to working incorrectly, so the
> function gives a warning.
>
> Usage examples
> ==
> select pg_waitlsn(‘LSN’, timeout);
> select pg_waitlsn_infinite(‘LSN’);
> select pg_waitlsn_no_wait(‘LSN’);

The name pg_waitlsn_no_wait() looks confusing.  I've tried to see how
it's documented, but the patch doesn't contain documentation...

--
Regards,
Alexander Korotkov




doc CREATE INDEX

2020-11-16 Thread Erik Rijkers

This one seems to have fallen by the wayside.

Erik
--- doc/src/sgml/ref/create_index.sgml.orig	2020-11-16 13:04:29.923760413 +0100
+++ doc/src/sgml/ref/create_index.sgml	2020-11-16 13:04:54.260093095 +0100
@@ -746,7 +746,7 @@
   
 
   
-   The regularly system collects statistics on all of a table's
+   The system regularly collects statistics on all of a table's
columns.  Newly-created non-expression indexes can immediately
use these statistics to determine an index's usefulness.
For new expression indexes, it is necessary to run 

Re: More time spending with "delete pending"

2020-11-16 Thread Juan José Santamaría Flecha
On Sun, Nov 15, 2020 at 4:00 PM Alexander Lakhin 
wrote:

> As I've found out, readdir() replacement for Windows in fact gets all
> the needed information (correct file size, attributes...) in
> WIN32_FIND_DATA, but it just leaves it inside and returns only fields of
> the dirent structure. So pg_ls_dir_files() (that calls
> ReadDir()/readdir()) needs to get this information again, that leads to
> opening a file on Windows.
> I think it can be done more effectively by adding a new function
> ReadDirExtendedInfo(), that will additionally accept a pointer to
> "struct dirent_extra" with fields {valid (indicating that the structure
> is filled), attributes, file size, mtime}. Maybe the advanced function
> could also invoke stat() inside (not on Windows).
>
> As to patch I proposed before, I think it's still needed to fully
> support the following usage pattern:
> stat();
> if (no_error) {
> do_something();
> } else if (file_not_found) {
> do_something_else();
> } else {
> error();
> }


We are currently missing a WIN32 lstat() port. I was thinking about
proposing a patch to implement it using GetFileAttributesEx(). That might
work as fall back to the CreateFile() if the file attribute is not a
FILE_ATTRIBUTE_REPARSE_POINT.

Anyhow, I do not think any retry logic should be in the stat() function,
but in the caller.

Regards,

Juan José Santamaría Flecha

>
>


Re: ECPG: proposal for new DECLARE STATEMENT

2020-11-16 Thread Shawn Wang
The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:tested, passed

Recently I have been doing some work on ecpg. So I review this patch. No 
problem was found.

The new status of this patch is: Ready for Committer


Re: Get memory contexts of an arbitrary backend process

2020-11-16 Thread torikoshia

On 2020-10-28 15:32, torikoshia wrote:

On 2020-10-23 13:46, Kyotaro Horiguchi wrote:



I think we might need to step-back to basic design of this feature
since this patch seems to have unhandled corner cases that are
difficult to find.


I've written out the basic design below and attached
corresponding patch.

  # Communication flow between the dumper and the requester
  - (1) When REQUESTING memory context dumping, the dumper adds an entry 
to the shared memory. The entry manages the dump state and it is set to 
'REQUESTING'.

  - (2) The dumper sends the signal to the dumper and wait on the latch.
  - (3) The dumper looks into the corresponding shared memory entry and 
changes its state to 'DUMPING'.
  - (4) When the dumper completes dumping, it changes the state to 
'DONE' and set the latch.
  - (5) The dumper reads the dump file and shows it to the user. 
Finally, the dumper removes the dump file and reset the shared memory 
entry.


  # Query cancellation
  - When the requestor cancels dumping, e.g. signaling using ctrl-C, the 
requestor changes the status of the shared memory entry to 'CANCELING'.
  - The dumper checks the status when it tries to change the state to 
'DONE' at (4), and if the state is 'CANCELING', it removes the dump file 
and reset the shared memory entry.


  # Cleanup dump file and the shared memory entry
  - In the normal case, the dumper removes the dump file and resets the 
shared memory entry as described in (5).
  - When something like query cancellation or process termination 
happens on the dumper after (1) and before (3), in other words, the 
state is 'REQUESTING', the requestor does the cleanup.
  - When something happens on the dumper or the requestor after (3) and 
before (4), in other words, the state is 'DUMPING', the dumper does the 
cleanup. Specifically, if the requestor cancels the query, it just 
changes the state to 'CANCELING' and the dumper notices it and cleans up 
things later. OTOH, when the dumper fails to dump, it cleans up the dump 
file and deletes the entry on the shared memory.
  - When something happens on the requestor after (4), i.e., the state 
is 'DONE', the requestor does the cleanup.
  - In the case of receiving SIGKILL or power failure, all dump files 
are removed in the crash recovery process.



Although there was a suggestion that shared memory hash
table should be changed to more efficient structures,
I haven't done it in this patch.
I think it can be treated separately, I'm going to work
on that later.


On 2020-11-11 00:07, Georgios Kokolatos wrote:

Hi,

I noticed that this patch fails on the cfbot.
For this, I changed the status to: 'Waiting on Author'.

Cheers,
//Georgios

The new status of this patch is: Waiting on Author


Thanks for your notification and updated the patch.
Changed the status to: 'Waiting on Author'.

Regards,

--
Atsushi TorikoshiFrom c6d06b11d16961acd59bfa022af52cb5fc668b3e Mon Sep 17 00:00:00 2001
From: Atsushi Torikoshi 
Date: Mon, 16 Nov 2020 11:49:03 +0900
Subject: [PATCH v4] Enabled pg_get_backend_memory_contexts() to collect
 arbitrary backend process's memory contexts.

Previsouly, pg_get_backend_memory_contexts() could only get the
local memory contexts. This patch enables to get memory contexts
of the arbitrary backend process which PID is specified by the
argument.
---
 src/backend/access/transam/xlog.c|   7 +
 src/backend/catalog/system_views.sql |   4 +-
 src/backend/postmaster/pgstat.c  |   3 +
 src/backend/replication/basebackup.c |   3 +
 src/backend/storage/ipc/ipci.c   |   2 +
 src/backend/storage/ipc/procsignal.c |   4 +
 src/backend/storage/lmgr/lwlocknames.txt |   1 +
 src/backend/tcop/postgres.c  |   5 +
 src/backend/utils/adt/mcxtfuncs.c| 615 ++-
 src/backend/utils/init/globals.c |   1 +
 src/bin/initdb/initdb.c  |   3 +-
 src/bin/pg_basebackup/t/010_pg_basebackup.pl |   4 +-
 src/bin/pg_rewind/filemap.c  |   3 +
 src/include/catalog/pg_proc.dat  |  11 +-
 src/include/miscadmin.h  |   1 +
 src/include/pgstat.h |   3 +-
 src/include/storage/procsignal.h |   1 +
 src/include/utils/mcxtfuncs.h|  52 ++
 src/test/regress/expected/rules.out  |   2 +-
 19 files changed, 697 insertions(+), 28 deletions(-)
 create mode 100644 src/include/utils/mcxtfuncs.h

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a1078a7cfc..f628fa8b53 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -73,6 +73,7 @@
 #include "storage/sync.h"
 #include "utils/builtins.h"
 #include "utils/guc.h"
+#include "utils/mcxtfuncs.h"
 #include "utils/memutils.h"
 #include "utils/ps_status.h"
 #include "utils/relmapper.h"
@@ -6986,6 +6987,12 @@ StartupXLOG(void)
 		 */
 		pgstat_reset_all();
 
+		/*
+		 * Reset 

Re: Zedstore - compressed in-core columnar storage

2020-11-16 Thread Merlin Moncure
On Thu, Nov 12, 2020 at 4:40 PM Tomas Vondra
 wrote:
>masterzedstore/pglzzedstore/lz4
>   -
>copy  1855680922131
>dump   751  905 811
>
> And the size of the lineitem table (as shown by \d+) is:
>
>   master: 64GB
>   zedstore/pglz: 51GB
>   zedstore/lz4: 20GB
>
> It's mostly expected lz4 beats pglz in performance and compression
> ratio, but this seems a bit too extreme I guess. Per past benchmarks
> (e.g. [1] and [2]) the difference in compression/decompression time
> should be maybe 1-2x or something like that, not 35x like here.

I can't speak to the ratio, but in basic backup/restore scenarios pglz
is absolutely killing me; Performance is just awful; we are cpubound
in backups throughout the department.  Installations defaulting to
plgz will make this feature show very poorly.

merlin




Re: Terminate the idle sessions

2020-11-16 Thread Li Japin
Hi Kuroda,

On Nov 16, 2020, at 1:22 PM, 
kuroda.hay...@fujitsu.com wrote:


@@ -30,6 +30,7 @@ typedef enum TimeoutId
STANDBY_DEADLOCK_TIMEOUT,
STANDBY_TIMEOUT,
STANDBY_LOCK_TIMEOUT,
+ IDLE_SESSION_TIMEOUT,
IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
/* First user-definable timeout reason */
USER_TIMEOUT,

I'm not familiar with timeout, but I can see that the priority of idle-session 
is set lower than transaction-timeout.
Could you explain the reason? In my image this timeout locates at the lowest 
layer, so it might have the lowest
priority.

My apologies! I just add a enum for idle session and ignore the comments that 
says the enum has priority.
Fixed as follows:

@@ -30,8 +30,8 @@ typedef enum TimeoutId
STANDBY_DEADLOCK_TIMEOUT,
STANDBY_TIMEOUT,
STANDBY_LOCK_TIMEOUT,
-   IDLE_SESSION_TIMEOUT,
IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
+   IDLE_SESSION_TIMEOUT,
/* First user-definable timeout reason */
USER_TIMEOUT,
/* Maximum number of timeout reasons */

Thanks for your review! Attached.

--
Best regards
Japin Li






v5-0001-Allow-terminating-the-idle-sessions.patch
Description: v5-0001-Allow-terminating-the-idle-sessions.patch


v5-0002-Optimize-setitimer-usage.patch
Description: v5-0002-Optimize-setitimer-usage.patch


Re: doc CREATE INDEX

2020-11-16 Thread Alvaro Herrera
On 2020-Nov-16, Erik Rijkers wrote:

> This one seems to have fallen by the wayside.

So it did.

Pushed to all three branches, thanks.





Re: Add important info about ANALYZE after create Functional Index

2020-11-16 Thread Justin Pryzby
On Thu, Nov 12, 2020 at 06:01:02PM -0500, Bruce Momjian wrote:
> On Thu, Nov 12, 2020 at 03:11:43PM -0600, Justin Pryzby wrote:
> > I guess it should say "The system regularly ..."
> > 
> > Also, the last sentence begins "For new expression indexes" and ends with
> > "about new expression indexes", which I guess could instead say "about the
> > expressions".
> 
> How is this followup patch?

I see Alvaro already patched the first issue at bcbd77133.

The problematic language was recently introduced, and I'd reported at:
https://www.postgresql.org/message-id/20201112211143.GL30691%40telsasoft.com
And Erik at:
https://www.postgresql.org/message-id/e92b3fba98a0c0f7afc0a2a37e765954%40xs4all.nl

-- 
Justin




Re: cutting down the TODO list thread

2020-11-16 Thread John Naylor
On Wed, Nov 11, 2020 at 4:45 PM John Naylor 
wrote:

> Here is the next section on data types, proposed to be moved to the "not
> worth doing" page. As before, if there are any objections, do speak up.
> I'll make the move in a few days.
>

Hearing no objection, these have been moved over.

-- 
John Naylor
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Multi Inserts in CREATE TABLE AS - revived patch

2020-11-16 Thread Paul Guo



> On Nov 13, 2020, at 7:21 PM, Bharath Rupireddy 
>  wrote:
> 
> On Tue, Nov 10, 2020 at 3:47 PM Paul Guo  wrote:
>> 
>> Thanks for doing this. There might be another solution - use raw insert 
>> interfaces (i.e. raw_heap_insert()).
>> Attached is the test (not formal) patch that verifies this idea. 
>> raw_heap_insert() writes the page into the
>> table files directly and also write the FPI xlog when the tuples filled up 
>> the whole page. This seems be
>> more efficient.
>> 
> 
> Thanks. Will the new raw_heap_insert() APIs scale well (i.e. extend
> the table parallelly) with parallelism? The existing
> table_multi_insert() API scales well, see, for instance, the benefit
> with parallel copy[1] and parallel multi inserts in CTAS[2].

Yes definitely some work needs to be done to make raw heap insert interfaces 
fit the parallel work, but
it seems that there is no hard blocking issues for this?

> 
> [1] - 
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.postgresql.org%2Fmessage-id%2FCALj2ACWeQVd-xoQZHGT01_33St4xPoZQibWz46o7jW1PE3XOqQ%2540mail.gmail.comdata=04%7C01%7Cguopa%40vmware.com%7C6fb10e05b7a243e0042608d887c651ac%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C637408633136197927%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000sdata=fyQaor4yhmqVRYcK78JyPW25i7zjRoWXqZVf%2BfFYq1w%3Dreserved=0
> [2] - 
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.postgresql.org%2Fmessage-id%2FCALj2ACWFq6Z4_jd9RPByURB8-Y8wccQWzLf%252B0-Jg%252BKYT7ZO-Ug%2540mail.gmail.comdata=04%7C01%7Cguopa%40vmware.com%7C6fb10e05b7a243e0042608d887c651ac%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C637408633136207912%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000sdata=CkFToJ11nmoyT2SodsJYYMOGP3cHSpeNYn8ZTYurn3U%3Dreserved=0
> 
> With Regards,
> Bharath Rupireddy.
> EnterpriseDB: 
> https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.enterprisedb.com%2Fdata=04%7C01%7Cguopa%40vmware.com%7C6fb10e05b7a243e0042608d887c651ac%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C637408633136207912%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000sdata=btiktR5Ftx1astyEmCUroQCIN1%2FcgcaMOxfA1z6pawE%3Dreserved=0





Re: Skip ExecCheckRTPerms in CTAS with no data

2020-11-16 Thread Peter Eisentraut

On 2020-11-16 04:04, Michael Paquier wrote:

On Fri, Nov 13, 2020 at 12:58:52PM +0530, Bharath Rupireddy wrote:

It's not required to set bistate to null as we have allocated myState
with palloc0 in CreateIntoRelDestReceiver, it will anyways be null.
   if (!into->skipData)
 myState->bistate = GetBulkInsertState();

Attaching v4 patch. Please review it.


I have reviewed this one this morning, and applied it after some
tweaks.  I have reworded some of the comments, fixed some typos, and
largely refactored the test cases to stress all the combinations
possible.  Please note that your patch would have caused failures
in the buildfarm, as any role created needs to be prefixed with
"regress_".


While this patch was nice enough to update the documentation about the 
requirement of the INSERT privilege, this is maybe more confusing now: 
How could a new table not have INSERT privilege?  Yes, you can do that 
with default privileges, but that's not well known and should be 
clarified in the documentation.


The SQL standard says that for CREATE TABLE AS, the INSERT "is 
effectively executed without further Access Rule checking", which means 
the INSERT privilege shouldn't be required at all.  I suggest we 
consider doing that instead.  I don't see a use for the current behavior.





Re: default result formats setting

2020-11-16 Thread Andrew Dunstan


On 11/9/20 5:10 AM, Peter Eisentraut wrote:
> On 2020-11-05 22:03, Peter Eisentraut wrote:
>>> Independently of that, how would you implement "says otherwise" here,
>>> ie do a single-query override of the session's prevailing setting?
>>> Maybe the right thing for that is to define -1 all the way down to the
>>> protocol level as meaning "use the session's per-type default", and
>>> then if you don't want that you can pass 0 or 1.  An advantage of that
>>> is that you couldn't accidentally break an application that wasn't
>>> ready for this feature, because it would not be the default to use it.
>> Yeah, that sounds a lot better.  I'll look into that.
>
> Here is a new patch updated to work that way.  Feels better now.
>

I think this is conceptually OK, although it feels a bit odd.

Might it be better to have the values as typename={binary,text} pairs
instead of oid={0,1} pairs, which are fairly opaque? That might make
things easier for things like UDTs where the oid might not be known or
constant.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: Multi Inserts in CREATE TABLE AS - revived patch

2020-11-16 Thread Bharath Rupireddy
On Mon, Nov 16, 2020 at 8:02 PM Paul Guo  wrote:
>
> > On Nov 13, 2020, at 7:21 PM, Bharath Rupireddy 
> >  wrote:
> >
> > On Tue, Nov 10, 2020 at 3:47 PM Paul Guo  wrote:
> >>
> >> Thanks for doing this. There might be another solution - use raw insert 
> >> interfaces (i.e. raw_heap_insert()).
> >> Attached is the test (not formal) patch that verifies this idea. 
> >> raw_heap_insert() writes the page into the
> >> table files directly and also write the FPI xlog when the tuples filled up 
> >> the whole page. This seems be
> >> more efficient.
> >>
> >
> > Thanks. Will the new raw_heap_insert() APIs scale well (i.e. extend
> > the table parallelly) with parallelism? The existing
> > table_multi_insert() API scales well, see, for instance, the benefit
> > with parallel copy[1] and parallel multi inserts in CTAS[2].
>
> Yes definitely some work needs to be done to make raw heap insert interfaces 
> fit the parallel work, but
> it seems that there is no hard blocking issues for this?
>

I may be wrong here. If we were to allow raw heap insert APIs to
handle parallelism, shouldn't we need some sort of shared memory to
allow coordination among workers? If we do so, at the end, aren't
these raw insert APIs equivalent to current table_multi_insert() API
which uses a separate shared ring buffer(bulk insert state) for
insertions?

And can we think of these raw insert APIs similar to the behaviour of
table_multi_insert() API for unlogged tables?

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




Re: Add important info about ANALYZE after create Functional Index

2020-11-16 Thread Alvaro Herrera
On 2020-Nov-12, Bruce Momjian wrote:

> For new expression indexes, it is necessary to run  linkend="sql-analyze">ANALYZE or wait for
> the autovacuum daemon to analyze
> -   the table to generate statistics about new expression indexes.
> +   the table to generate statistics for these indexes.

Looks good to me.





Re: Move OpenSSL random under USE_OPENSSL_RANDOM

2020-11-16 Thread Tom Lane
Magnus Hagander  writes:
> I agree with those -- either we remove the ability to choose random source
> independently of the SSL library (and then only use the windows crypto
> provider or /dev/urandom as platform-specific choices when *no* SSL library
> is used), and in that case we should not have separate #ifdef's for them.
> Or we fix the includes. Which is obviously easier, but we should take the
> time to do what we think is right long-term of course.

FWIW, I'd vote for the former.  I think the presumption that OpenSSL's
random-number machinery can be used without any other initialization is
shaky as heck.

regards, tom lane




Re: Heads-up: macOS Big Sur upgrade breaks EDB PostgreSQL installations

2020-11-16 Thread Jonathan S. Katz
On 11/16/20 4:27 AM, Dave Page wrote:
> Hi,
> 
> This is more of a head-ups than anything else, as I suspect this may
> come up in various forums.
> 
> The PostgreSQL installers for macOS (from EDB, possibly others too)
> create the data directory in /Library/PostgreSQL//data. This
> has been the case since the first release, 10+ years ago.
> 
> It looks like the Big Sur upgrade has taken it upon itself to "fix" any
> filesystem permissions it doesn't like. On my system, this resulted in
> the data directory having 0755 permissions, which meant that PostgreSQL
> refused to start. Manually changing the permissions back to 0700 (0750
> should also work) fixes the issue.
> 
> I'm not sure there's much we can do about this - systems that are likely
> to be affected are already out there, and we obviously don't want to
> relax the permissions Postgres requires.

Thanks for raising this. We should provide some guidance on upgrading
this when upgrading to Big Sur.

Do we know where the other macOS installers place their data
directories? We should reach out to the installer maintainers to see if
they are seeing the same behavior so we know what guidance to issue.

Thanks,

Jonathan


OpenPGP_0xF1049C729F1C6527.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: Add important info about ANALYZE after create Functional Index

2020-11-16 Thread Alvaro Herrera
On 2020-Nov-16, Justin Pryzby wrote:

> I see Alvaro already patched the first issue at bcbd77133.
> 
> The problematic language was recently introduced, and I'd reported at:
> https://www.postgresql.org/message-id/20201112211143.GL30691%40telsasoft.com
> And Erik at:
> https://www.postgresql.org/message-id/e92b3fba98a0c0f7afc0a2a37e765954%40xs4all.nl

Yeah, sorry I didn't notice you had already reported it.




Re: PoC/WIP: Extended statistics on expressions

2020-11-16 Thread Tomas Vondra
On 11/16/20 2:49 PM, Tomas Vondra wrote:
> Hi,
>
> ...
>
> 4) apply the statistics
> 
>This is the hard part, really, and the exact state of the support
>depends on type of statistics.
> 
>For ndistinct coefficients, it generally works. I'm sure there may be
>bugs in estimate_num_groups, etc. but in principle it works.
> 
>For MCV lists, it generally works too - you can define statistics on
>the expressions and the estimates should improve. The main downside
>here is that it requires at least two expressions, otherwise we can't
>build/apply the extended statistics. So for example
> 
>   SELECT * FROM t WHERE mod(a,100) = 10 AND mod(b,11) = 0
> 
>may be estimated "correctly", once you drop any of the conditions it
>gets much worse as we don't have stats for individual expressions.
>That's rather annoying - it does not break the extended MCV, but the
>behavior will certainly cause confusion.
> 
>For functional dependencies, the estimation does not work yet. Also,
>the missing per-column statistics have bigger impact than on MCV,
>because while MCV can work fine without it, the dependencies heavily
>rely on the per-column estimates. We only apply "corrections" based
>on the dependency degree, so we still need (good) per-column
>estimates, which does not quite work with the expressions.
> 
> 
>Of course, the lack of per-expression statistics may be somewhat
>fixed by adding indexes on expressions, but that's kinda expensive.
> 

FWIW after re-reading [1], I think the plan to build pg_statistic rows
for expressions and stash them in pg_statistic_ext_data is the way to
go. I was thinking that maybe we'll need some new statistics type to
request this, e.g.

   CREATE STATISTICS s (expressions) ON ...

but on second thought I think we should just build this whenever there
are expressions in the definition. It'll require some changes (e.g. we
require at least two items in the list, but we'll want to allow building
stats on a single expression too, I guess), but that's doable.

Of course, we don't have any catalogs with composite types yet, so it's
not 100% sure this will work, but it's worth a try.

regards


[1]
https://www.postgresql.org/message-id/flat/6331.1579041473%40sss.pgh.pa.us#5ec6af7583e84cef2ca6a9e8a713511e

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Add important info about ANALYZE after create Functional Index

2020-11-16 Thread Bruce Momjian
On Mon, Nov 16, 2020 at 11:59:03AM -0300, Álvaro Herrera wrote:
> On 2020-Nov-12, Bruce Momjian wrote:
> 
> > For new expression indexes, it is necessary to run  > linkend="sql-analyze">ANALYZE or wait for
> > the autovacuum daemon to analyze
> > -   the table to generate statistics about new expression indexes.
> > +   the table to generate statistics for these indexes.
> 
> Looks good to me.

Applied to all branches.  Thanks for the review.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee





Re: Cache relation sizes?

2020-11-16 Thread Konstantin Knizhnik




On 16.11.2020 10:11, Thomas Munro wrote:

On Tue, Aug 4, 2020 at 2:21 PM Thomas Munro  wrote:

On Tue, Aug 4, 2020 at 3:54 AM Konstantin Knizhnik
 wrote:

This shared relation cache can easily store relation size as well.
In addition it will solve a lot of other problems:
- noticeable overhead of local relcache warming
- large memory consumption in case of larger number of relations
O(max_connections*n_relations)
- sophisticated invalidation protocol and related performance issues
Certainly access to shared cache requires extra synchronization.But DDL
operations are relatively rare.
So in most cases we will have only shared locks. May be overhead of
locking will not be too large?

Yeah, I would be very happy if we get a high performance shared
sys/rel/plan/... caches in the future, and separately, having the
relation size available in shmem is something that has come up in
discussions about other topics too (tree-based buffer mapping,
multi-relation data files, ...).  ...

After recent discussions about the limitations of relying on SEEK_END
in a nearby thread[1], I decided to try to prototype a system for
tracking relation sizes properly in shared memory.  Earlier in this
thread I was talking about invalidation schemes for backend-local
caches, because I only cared about performance.  In contrast, this new
system has SMgrRelation objects that point to SMgrSharedRelation
objects (better names welcome) that live in a pool in shared memory,
so that all backends agree on the size.  The scheme is described in
the commit message and comments.  The short version is that smgr.c
tracks the "authoritative" size of any relation that has recently been
extended or truncated, until it has been fsync'd.  By authoritative, I
mean that there may be dirty buffers in that range in our buffer pool,
even if the filesystem has vaporised the allocation of disk blocks and
shrunk the file.

That is, it's not really a "cache".  It's also not like a shared
catalog, which Konstantin was talking about... it's more like the pool
of inodes in a kernel's memory.  It holds all currently dirty SRs
(SMgrSharedRelations), plus as many clean ones as it can fit, with
some kind of reclamation scheme, much like buffers.  Here, "dirty"
means the size changed.

Attached is an early sketch, not debugged much yet (check undir
contrib/postgres_fdw fails right now for a reason I didn't look into),
and there are clearly many architectural choices one could make
differently, and more things to be done... but it seemed like enough
of a prototype to demonstrate the concept and fuel some discussion
about this and whatever better ideas people might have...

Thoughts?

[1] 
https://www.postgresql.org/message-id/flat/OSBPR01MB3207DCA7EC725FDD661B3EDAEF660%40OSBPR01MB3207.jpnprd01.prod.outlook.com


I noticed that there are several fragments like this:


if (!smgrexists(rel->rd_smgr, FSM_FORKNUM))
     smgrcreate(rel->rd_smgr, FSM_FORKNUM, false);

fsm_nblocks_now = smgrnblocks(rel->rd_smgr, FSM_FORKNUM);


I wonder if it will be more efficient and simplify code to add 
"create_if_not_exists" parameter to smgrnblocks?
It will avoid extra hash lookup and avoid explicit checks for fork 
presence in multiple places?



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company





Re: Zedstore - compressed in-core columnar storage

2020-11-16 Thread Tomas Vondra


On 11/16/20 1:59 PM, Merlin Moncure wrote:
> On Thu, Nov 12, 2020 at 4:40 PM Tomas Vondra
>  wrote:
>>masterzedstore/pglzzedstore/lz4
>>   -
>>copy  1855680922131
>>dump   751  905 811
>>
>> And the size of the lineitem table (as shown by \d+) is:
>>
>>   master: 64GB
>>   zedstore/pglz: 51GB
>>   zedstore/lz4: 20GB
>>
>> It's mostly expected lz4 beats pglz in performance and compression
>> ratio, but this seems a bit too extreme I guess. Per past benchmarks
>> (e.g. [1] and [2]) the difference in compression/decompression time
>> should be maybe 1-2x or something like that, not 35x like here.
> 
> I can't speak to the ratio, but in basic backup/restore scenarios pglz
> is absolutely killing me; Performance is just awful; we are cpubound
> in backups throughout the department.  Installations defaulting to
> plgz will make this feature show very poorly.
> 

Maybe. I'm not disputing that pglz is considerably slower than lz4, but
judging by previous benchmarks I'd expect the compression to be slower
maybe by a factor of ~2x. So the 30x difference is suspicious. Similarly
for the compression ratio - lz4 is great, but it seems strange it's 1/2
the size of pglz. Which is why I'm speculating that something else is
going on.

As for the "plgz will make this feature show very poorly" I think that
depends. I think we may end up with pglz doing pretty well (compared to
heap), but lz4 will probably outperform that. OTOH for various use cases
it may be more efficient to use something else with worse compression
ratio, but allowing execution on compressed data, etc.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: [PATCH] remove deprecated v8.2 containment operators

2020-11-16 Thread Peter Eisentraut

On 2020-11-13 16:57, Tom Lane wrote:

Peter Eisentraut  writes:

On 2020-11-12 23:28, Tom Lane wrote:

I'm on board with pulling these now --- 8.2 to v14 is plenty of
deprecation notice.  However, the patch seems incomplete in that
the code support for these is still there -- look for
RTOldContainedByStrategyNumber and RTOldContainsStrategyNumber.
Admittedly, there's not much to be removed except some case labels,
but it still seems like we oughta do that to avoid future confusion.



Yeah, the stuff in gistproc.c should be removed now.  But I wonder what
the mentions in brin_inclusion.c are and whether or how they should be
removed.


I think those probably got cargo-culted in there at some point.
They're visibly dead code, because there are no pg_amop entries
for BRIN opclasses with amopstrategy 13 or 14.


I have committed fixes that remove the unused strategy numbers from both 
of these code areas.





Re: [PATCH] Add features to pg_stat_statements

2020-11-16 Thread Fujii Masao



On 2020/11/16 12:22, Seino Yuki wrote:

Thanks for updating the patch!

+    pgss_info->dealloc = 0;
+    SpinLockInit(_info->mutex);
+    Assert(pgss_info->dealloc == 0);

Why is this assertion check necessary? It seems not necessary.

+    {
+    Assert(found == found_info);

Having pgssSharedState and pgssInfoCounters separately might make
the code a bit more complicated like the above? If this is true, what about
including pgssInfoCounters in pgssSharedState?

PGSS_FILE_HEADER needs to be changed since the patch changes
the format of pgss file?

+    /* Read pgss_info */
+    if (feof(file) == 0)
+    if (fread(pgss_info, sizeof(pgssInfoCounters), 1, file) != 1)
+    goto read_error;

Why does feof(file) need to be called here?

+pgss_info_update(void)
+{
+    {

Why is the second "{" necessary? It seems redundant.

+pgss_info_reset(void)
+{
+    {

Same as above.

+pg_stat_statements_info(PG_FUNCTION_ARGS)
+{
+    int64    d_count = 0;
+    {

Same as above.

+    SpinLockAcquire(>mutex);
+    d_count = Int64GetDatum(c->dealloc);
+    SpinLockRelease(>mutex);

Why does Int64GetDatum() need to be called here? It seems not necessary.

+   
+    
+ pg_stat_statements_info() returns bigint
+ 
+  pg_stat_statements_info
+ 
+    

Isn't it better not to expose pg_stat_statements_info() function in the
document because pg_stat_statements_info view is enough and there
seems no use case for the function?

Regards,


Thanks for the comment.
I'll post a fixed patch.
Due to similar fixed, we have also merged the patches discussed in the 
following thread.
https://commitfest.postgresql.org/30/2738/


I agree that these two patches should use the same infrastructure
because they both try to add the global stats for pg_stat_statements.
But IMO they should not be merged to one patch. It's better to
develop them one by one for ease of review. Thought?

So I extracted the "dealloc" part from the merged version of your patch.
Also I refactored the code and applied some cosmetic changes into
the patch. Attached is the updated version of the patch that implements
only "dealloc" part. Could you review this version?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
diff --git a/contrib/pg_stat_statements/Makefile 
b/contrib/pg_stat_statements/Makefile
index 081f997d70..3ec627b956 100644
--- a/contrib/pg_stat_statements/Makefile
+++ b/contrib/pg_stat_statements/Makefile
@@ -6,7 +6,7 @@ OBJS = \
pg_stat_statements.o
 
 EXTENSION = pg_stat_statements
-DATA = pg_stat_statements--1.4.sql \
+DATA = pg_stat_statements--1.4.sql pg_stat_statements--1.8--1.9.sql \
pg_stat_statements--1.7--1.8.sql pg_stat_statements--1.6--1.7.sql \
pg_stat_statements--1.5--1.6.sql pg_stat_statements--1.4--1.5.sql \
pg_stat_statements--1.3--1.4.sql pg_stat_statements--1.2--1.3.sql \
diff --git a/contrib/pg_stat_statements/expected/pg_stat_statements.out 
b/contrib/pg_stat_statements/expected/pg_stat_statements.out
index 2a303a7f07..16158525ca 100644
--- a/contrib/pg_stat_statements/expected/pg_stat_statements.out
+++ b/contrib/pg_stat_statements/expected/pg_stat_statements.out
@@ -861,4 +861,19 @@ SELECT query, plans, calls, rows FROM pg_stat_statements 
ORDER BY query COLLATE
  SELECT query, plans, calls, rows FROM pg_stat_statements ORDER BY query 
COLLATE "C" | 1 | 0 |0
 (6 rows)
 
+--
+-- access to pg_stat_statements_info view
+--
+SELECT pg_stat_statements_reset();
+ pg_stat_statements_reset 
+--
+ 
+(1 row)
+
+SELECT dealloc FROM pg_stat_statements_info;
+ dealloc 
+-
+   0
+(1 row)
+
 DROP EXTENSION pg_stat_statements;
diff --git a/contrib/pg_stat_statements/pg_stat_statements--1.8--1.9.sql 
b/contrib/pg_stat_statements/pg_stat_statements--1.8--1.9.sql
new file mode 100644
index 00..2019a4ffe0
--- /dev/null
+++ b/contrib/pg_stat_statements/pg_stat_statements--1.8--1.9.sql
@@ -0,0 +1,17 @@
+/* contrib/pg_stat_statements/pg_stat_statements--1.8--1.9.sql */
+
+-- complain if script is sourced in psql, rather than via ALTER EXTENSION
+\echo Use "ALTER EXTENSION pg_stat_statements UPDATE TO '1.9'" to load this 
file. \quit
+
+--- Define pg_stat_statements_info
+CREATE FUNCTION pg_stat_statements_info(
+OUT dealloc bigint
+)
+RETURNS bigint
+AS 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL SAFE;
+
+CREATE VIEW pg_stat_statements_info AS
+  SELECT * FROM pg_stat_statements_info();
+
+GRANT SELECT ON pg_stat_statements_info TO PUBLIC;
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c 
b/contrib/pg_stat_statements/pg_stat_statements.c
index dd963c4644..9de2d53496 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -98,7 +98,7 @@ PG_MODULE_MAGIC;
 #define PGSS_TEXT_FILE PG_STAT_TMP_DIR "/pgss_query_texts.stat"
 
 /* Magic number identifying the 

Re: planner support functions: handle GROUP BY estimates ?

2020-11-16 Thread Tomas Vondra
On 1/15/20 12:44 AM, Tom Lane wrote:
> Tomas Vondra  writes:
>> On Tue, Jan 14, 2020 at 05:37:53PM -0500, Tom Lane wrote:
>>> I wonder just how messy it would be to add a column to pg_statistic_ext
>>> whose type is the composite type "pg_statistic", and drop the required
>>> data into that.  We've not yet used any composite types in the system
>>> catalogs, AFAIR, but since pg_statistic_ext isn't a bootstrap catalog
>>> it seems like we might be able to get away with it.
> 
> [ I meant pg_statistic_ext_data, obviously ]
> 
>> I don't know, but feels a bit awkward to store this type of stats into
>> pg_statistic_ext, which was meant for multi-column stats. Maybe it'd
>> work fine, not sure.
> 
> If we wanted to allow a single statistics object to contain data for
> multiple expressions, we'd actually need that to be array-of-pg_statistic
> not just pg_statistic.  Seems do-able, but on the other hand we could
> just prohibit having more than one output column in the "query" for this
> type of extended statistic.  Either way, this seems far less invasive
> than either a new catalog or a new relation relkind (to say nothing of
> needing both, which is where you seemed to be headed).
> 

I've started looking at statistics on expressions too, mostly because it
seems the extended stats improvements (as discussed in [1]) need that.

The "stash pg_statistic records into pg_statistics_ext_data" approach
seems simple, but it's not clear to me how to make it work, so I'd
appreciate some guidance.


1) Considering we don't have any composite types in any catalog yet, and
naive attempts to just use something like

pg_statistic stxdexprs[1];

did not work. So I suppose this will require changes to genbki.pl, but
honestly, my Perl-fu is non-existent :-(


2) Won't it be an issue that pg_statistic contains pseudo-types? That
is, this does not work, for example:

test=# create table t (a pg_statistic[]);
ERROR:  column "stavalues1" has pseudo-type anyarray

and it seems unlikely just using this in a catalog would make it work.


regards


[1]
https://www.postgresql.org/message-id/ad7891d2-e90c-b446-9fe2-7419143847d7%40enterprisedb.com

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: doc CREATE INDEX

2020-11-16 Thread Bruce Momjian
On Mon, Nov 16, 2020 at 11:04:57AM -0300, Álvaro Herrera wrote:
> On 2020-Nov-16, Erik Rijkers wrote:
> 
> > This one seems to have fallen by the wayside.
> 
> So it did.
> 
> Pushed to all three branches, thanks.

I have applied with to PG 9.5-11, which also had the odd wording.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee





Re: Heads-up: macOS Big Sur upgrade breaks EDB PostgreSQL installations

2020-11-16 Thread Dave Page
On Mon, Nov 16, 2020 at 4:45 PM Pavel Borisov 
wrote:

> I suppose there are many ways to have PG on OSX i.e. package managers
> (Homebrew, Macports), App installers etc and so many places anyone can find
> his data directory reside in. Generally I prefer data directory to be
> somewhere inside the user home dir as OSX will take care of possible
> backups and will not generally modify its contents during migration betweeb
> osx versions and/or different machines. It is not only the question of
> permissions.
>
> Any options inside user homedir are equally suitable IMO.
>

It is in the user's homedir - it's just that that isn't under /Users:

hal:~ postgres$ echo $HOME
/Library/PostgreSQL/13

With the EDB installers (unlike postgres.app), PostgreSQL runs as a
service, much as it would on Linux or BSD.

-- 
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake

EDB: http://www.enterprisedb.com


Re: Zedstore - compressed in-core columnar storage

2020-11-16 Thread Merlin Moncure
On Mon, Nov 16, 2020 at 10:07 AM Tomas Vondra
 wrote:
>
>
> On 11/16/20 1:59 PM, Merlin Moncure wrote:
> > On Thu, Nov 12, 2020 at 4:40 PM Tomas Vondra
> >  wrote:
> >>masterzedstore/pglzzedstore/lz4
> >>   -
> >>copy  1855680922131
> >>dump   751  905 811
> >>
> >> And the size of the lineitem table (as shown by \d+) is:
> >>
> >>   master: 64GB
> >>   zedstore/pglz: 51GB
> >>   zedstore/lz4: 20GB
> >>
> >> It's mostly expected lz4 beats pglz in performance and compression
> >> ratio, but this seems a bit too extreme I guess. Per past benchmarks
> >> (e.g. [1] and [2]) the difference in compression/decompression time
> >> should be maybe 1-2x or something like that, not 35x like here.
> >
> > I can't speak to the ratio, but in basic backup/restore scenarios pglz
> > is absolutely killing me; Performance is just awful; we are cpubound
> > in backups throughout the department.  Installations defaulting to
> > plgz will make this feature show very poorly.
> >
>
> Maybe. I'm not disputing that pglz is considerably slower than lz4, but
> judging by previous benchmarks I'd expect the compression to be slower
> maybe by a factor of ~2x. So the 30x difference is suspicious. Similarly
> for the compression ratio - lz4 is great, but it seems strange it's 1/2
> the size of pglz. Which is why I'm speculating that something else is
> going on.
>
> As for the "plgz will make this feature show very poorly" I think that
> depends. I think we may end up with pglz doing pretty well (compared to
> heap), but lz4 will probably outperform that. OTOH for various use cases
> it may be more efficient to use something else with worse compression
> ratio, but allowing execution on compressed data, etc.

hm, you might be right.  Doing some number crunching, I'm getting
about 23mb/sec compression on a 600gb backup image on a pretty typical
aws server.  That's obviously not great, but your numbers are much
worse than that, so maybe something else might be going on.

> I think we may end up with pglz doing pretty well (compared to heap)

I *don't* think so, or at least I'm skeptical as long as insertion
times are part of the overall performance measurement.  Naturally,
with column stores, insertion times are often very peripheral to the
overall performance picture but for cases that aren't I suspect the
results are not going to be pleasant, and advise planning accordingly.

Aside, I am very interested in this work. I may be able to support
testing in an enterprise environment; lmk if interested -- thank you

merlin




Re: Heads-up: macOS Big Sur upgrade breaks EDB PostgreSQL installations

2020-11-16 Thread Dave Page
On Mon, Nov 16, 2020 at 3:55 PM Jonathan S. Katz 
wrote:

> On 11/16/20 4:27 AM, Dave Page wrote:
> > Hi,
> >
> > This is more of a head-ups than anything else, as I suspect this may
> > come up in various forums.
> >
> > The PostgreSQL installers for macOS (from EDB, possibly others too)
> > create the data directory in /Library/PostgreSQL//data. This
> > has been the case since the first release, 10+ years ago.
> >
> > It looks like the Big Sur upgrade has taken it upon itself to "fix" any
> > filesystem permissions it doesn't like. On my system, this resulted in
> > the data directory having 0755 permissions, which meant that PostgreSQL
> > refused to start. Manually changing the permissions back to 0700 (0750
> > should also work) fixes the issue.
> >
> > I'm not sure there's much we can do about this - systems that are likely
> > to be affected are already out there, and we obviously don't want to
> > relax the permissions Postgres requires.
>
> Thanks for raising this. We should provide some guidance on upgrading
> this when upgrading to Big Sur.
>
> Do we know where the other macOS installers place their data
> directories? We should reach out to the installer maintainers to see if
> they are seeing the same behavior so we know what guidance to issue.
>

I believe postgres.app only installs for the current user, and puts it's
data under ~/Library/Application Support/Postgres.

-- 
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake

EDB: http://www.enterprisedb.com


Re: Heads-up: macOS Big Sur upgrade breaks EDB PostgreSQL installations

2020-11-16 Thread Pavel Borisov
I suppose there are many ways to have PG on OSX i.e. package managers
(Homebrew, Macports), App installers etc and so many places anyone can find
his data directory reside in. Generally I prefer data directory to be
somewhere inside the user home dir as OSX will take care of possible
backups and will not generally modify its contents during migration betweeb
osx versions and/or different machines. It is not only the question of
permissions.

Any options inside user homedir are equally suitable IMO.

---
Best regards,
Pavel Borisov

Postgres professional: http://postgrespro.com


Re: enable_incremental_sort changes query behavior

2020-11-16 Thread luis . roberto




I've pushed the 0001 part, i.e. the main fix. Not sure about the other 
parts (comments and moving the code back to postgres_fdw) yet. 


regards 

-- 
Tomas Vondra http://www.2ndQuadrant.com 
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Hi, Tomas! 

I'm still having trouble with this problem. Tom commited something [1] to avoid 
core dumps, but all I get now is: subplan "SubPlan 1" was not initialized 

I upgraded to 13.1 over the weekend but the problem still occurs. 

This is the topic I started: 
https://www.postgresql.org/message-id/flat/CAAaqYe--qpnBOUiqnHSAc71K2HpXetnGedRAX0Z04zx%3DY0ExHA%40mail.gmail.com#8cb6c7c5e787d2d3d2dc7aaf00d1cae8
 



[1] 
https://git.postgresql.org/pg/commitdiff/936043c9eacb9e9c7356a8190a410d2c4e4ea03a
 



Re: PATCH: Batch/pipelining support for libpq

2020-11-16 Thread David G. Johnston
On Fri, Nov 13, 2020 at 5:38 PM Alvaro Herrera 
wrote:

> Here's a v25.
>
> I made a few more changes to the docs per David's suggestions; I also
> reordered the sections quite a bit.  It's now like this:
>  * Batch Mode
>* Using Batch Mode
>  * Issuing Queries
>  * Processing Results
>  * Error Handling
>  * Interleaving Result Processing and Query Dispatch
>  * Ending Batch Mode
>* Functions Associated with Batch Mode
>* When to Use Batching
>

Thanks!

I like the new flow and changes.  I've attached a patch that covers some
missing commas and typos along with a few parts that made me pause.

The impact of memory came out of nowhere under the non-blocking mode
commentary.  I took a stab at incorporating it more broadly.

The "are error conditions" might be a well-known phrasing to those using
libpq but that sentence reads odd to me.  I did try to make the example
listing flow a bit better and added a needed comma.

Tried to clean up a few phrasings after that.  The error handling part I'm
not especially happy with but I think it's closer and more useful than just
"emitted during error handling" - it gets emitted upon error, handling is a
client concern.

Seems odd to say the new feature was introduced in v14.0, the .0 seems ok
to imply.  I didn't actually fix it in the attached but "the PostgreSQL 14
version of libpq" is going to become outdated quickly, better just to drop
it.

"The batch API was introduced in PostgreSQL 14, but clients can use batches
on servers supporting v3 of the extended query protocol, potentially going
as far back as version 7.4."

David J.


v25-01-libpq-batch-dgj-doc-suggestions.patch
Description: Binary data


Re: [PATCH] remove deprecated v8.2 containment operators

2020-11-16 Thread Justin Pryzby
On Fri, Nov 13, 2020 at 10:03:43AM -0500, Stephen Frost wrote:
> * Magnus Hagander (mag...@hagander.net) wrote:
> > On Thu, Nov 12, 2020 at 11:28 PM Tom Lane  wrote:
> > > > The changes to the contrib modules appear to be incomplete in some ways.
> > > >   In cube, hstore, and seg, there are no changes to the extension
> > > > scripts to remove the operators.  All you're doing is changing the C
> > > > code to no longer recognize the strategy, but that doesn't explain what
> > > > will happen if the operator is still used.  In intarray, by contrast,
> > > > you're editing an existing extension script, but that should be done by
> > > > an upgrade script instead.
> > >
> > > In the contrib modules, I'm afraid what you gotta do is remove the
> > > SQL operator definitions but leave the opclass code support in place.
> > > That's because there's no guarantee that users will update the extension's
> > > SQL version immediately, so a v14 build of the .so might still be used
> > > with the old SQL definitions.  It's not clear how much window we need
> > > give for people to do that update, but I don't think "zero" is an
> > > acceptable answer.
> > 
> > Based on my experience from the field, the answer is "never".
> > 
> > As in, most people have no idea they are even *supposed* to do such an
> > upgrade, so they don't do it. Until we solve that problem, I think
> > we're basically stuck with keeping them "forever". (and even if/when
> > we do, "zero" is probably not going to cut it, no)
> 
> Yeah, this is a serious problem and one that we should figure out a way
> to fix or at least improve on- maybe by having pg_upgrade say something
> about extensions that could/should be upgraded..?

I think what's needed are:

1) a way to *warn* users about deprecation.  CREATE EXTENSION might give an
elog(WARNING), but it's probably not enough.  It only happens once, and if it's
in pg_restore/pg_upgrade, it be wrapped by vendor upgrade scripts.  It needs to
be more like first function call in every session.  Or more likely, put it in
documentation for 10 years.

2) a way to *enforce* it, like making CREATE EXTENSION fail when run against an
excessively old server, including by pg_restore/pg_upgrade (which ought to also
handle it in --check).

Are there any contrib for which (1) is done and we're anywhere near doing (2) ?

-- 
Justin




Re: VACUUM (DISABLE_PAGE_SKIPPING on)

2020-11-16 Thread David G. Johnston
On Mon, Nov 16, 2020 at 1:52 PM Simon Riggs  wrote:

> The docs are misleading for this feature, since they say:
> "This option disables all page-skipping behavior, and is
> intended to be used only when the contents of the visibility map are
> suspect, which should happen only if there is a hardware or software
> issue causing database corruption."
>
[...]

>
> The code is quite clear: DISABLE_PAGE_SKIPPING makes the vacuum into
> an aggressive vacuum. Line 487, heap_vacuum_rel().  Aggressive vacuums
> can still skip a page that is frozen, and rely on the visibility map
> for that information.
>
> So the docs are wrong - we don't disable *all* page-skipping and it is
> not appropriate to warn users away from this feature by saying "is
> intended to be used only when the contents of the visibility map are
> suspect".
>

This patch seems mis-placed, at least in HEAD.  If DISABLE_PAGE_SKIPPING
isn't doing what the documentation says it should, but instead provides
identical behavior to FREEZE, then the bug should be fixed in HEAD.  I'd
argue for batch-patching it as well.

David J.


Re: Crash in virtual file descriptor FDDEBUG code

2020-11-16 Thread Tom Lane
Greg Nancarrow  writes:
> While investigating a possible file-descriptor issue, I enabled the
> FDDEBUG code in src/backend/storage/file/fd.c, only to find that it
> resulted in a crash as soon as it was invoked.

Yeah, I can imagine that code doesn't get tested often.

> I've attached a small patch that corrects the issue.

Thanks, will push.

BTW, while looking at this I started to cast an unfriendly eye on the
_dump_lru() debug function that's hidden under that symbol.  That seems
like it's spending a lot of cycles to give you information that will
very possibly be incomplete (since its 2K buffer would fill up fast).
And it could very easily turn into an infinite loop if there were
anything actually wrong with the LRU chain.  So I'm not seeing where
the cost-benefit ratio is there.

Maybe we should change it to just print the first and last entries
and the number of entries --- which it could count using a loop that
knows there shouldn't be more entries than the size of the VFD
array, so as to prevent infinite-looping if the chain is corrupt.

regards, tom lane




Re: Cache relation sizes?

2020-11-16 Thread Thomas Munro
On Mon, Nov 16, 2020 at 11:01 PM Konstantin Knizhnik
 wrote:
> I will look at your implementation more precisely latter.

Thanks!  Warning:  I thought about making a thing like this for a
while, but the patch itself is only a one-day prototype, so I am sure
you can find many bugs...  Hmm, I guess the smgrnblocks_fast() code
really needs a nice big comment to explain the theory about why this
value is fresh enough, based on earlier ideas about implicit memory
barriers.  (Something like: if you're extending, you must have
acquired the extension lock; if you're scanning you must have acquired
a snapshot that only needs to see blocks that existed at that time and
concurrent truncations are impossible; I'm wondering about special
snapshots like VACUUM, and whether this is too relaxed for that, or if
it's covered by existing horizon-based interlocking...).

> But now I just to ask one question: is it reasonable to spent
> substantial amount of efforts trying to solve
> the problem of redundant calculations of relation size, which seems to
> be just part of more general problem: shared relation cache?

It's a good question.  I don't know exactly how to make a shared
relation cache (I have only some vague ideas and read some of
Idehira-san's work) but I agree that it would be very nice.  However,
I think that's an independent and higher level thing, because Relation
!= SMgrRelation.

What is an SMgrRelation?  Not much!  It holds file descriptors, which
are different in every backend so you can't share them, and then some
information about sizes and a target block for insertions.  With this
patch, the sizes become shared and more reliable, so there's *almost*
nothing left at all except for the file descriptors and the pointer to
the shared object.  I haven't looked into smgr_targblock, the last
remaining non-shared thing, but there might be an opportunity to do
something clever for parallel inserters (the proposed feature) by
coordinating target blocks in SMgrSharedRelation; I don't know.

> It is well known problem: private caches of system catalog can cause
> unacceptable memory consumption for large number of backends.

Yeah.

> Also private caches requires sophisticated and error prone invalidation
> mechanism.
>
> If we will have such shared cache, then keeping shared relation size
> seems to be trivial task, isn't it?
> So may be we before "diving" into the problem of maintaining shared pool
> of dirtied "inodes" we can better discuss and conerntrate on solving
> more generic problem?

Sure, we can talk about how to make a shared relation cache (and
syscache etc).  I think it'll be much harder than this shared SMGR
concept.  Even if we figure out how to share palloc'd trees of nodes
with correct invalidation and interlocking (ie like Ideriha-san's work
in [1]), including good performance and space management/replacement,
I guess we'll still need per-backend Relation objects to point to the
per-backend SMgrRelation objects to hold the per-backend file
descriptors.  (Until we have threads...).  But... do you think sharing
SMGR relations to the maximum extent possible conflicts with that?
Are you thinking there should be some general component here, perhaps
a generic system for pools of shmem objects with mapping tables and a
replacement policy?

[1] 
https://www.postgresql.org/message-id/flat/4E72940DA2BF16479384A86D54D0988A567B9245%40G01JPEXMBKW04




RE: Disable WAL logging to speed up data loading

2020-11-16 Thread tsunakawa.ta...@fujitsu.com
# It'd be helpful if you could send mails in text format, not HTML.

From: David G. Johnston 
> For this case the fundamental feature that would seem to be required is an 
> ability for a transaction commit to return only after the system has ensured 
> that all of the new pages added to the relation during the scope of the 
> transaction have made it to disk.  Something like:
>
> BEGIN UNLOGGED TRANSACTION FOR table1, table2;
> -- locking probably allows reads, definitely disallows concurrent writes, to 
> the named tables
> -- Disallow updates and deletes, do not use dead tuple space, for the tables 
> named.  Should be able to do normal stuff for other tables?
> -- Always create new pages
> COPY TO table1;
> COPY TO table2;
> COMMIT; -- wait here until data files for table1 and table2 are completely 
> written and the transaction alive flag is committed to the WAL.
>
> I suppose the above could be written "BEGIN UNLOGGED TRANSACTION FOR ALL 
> TABLES" and you'd get the initial database population optimization capability.
>
> If the commit doesn't complete all of the newly created pages are junk.  
> Otherwise, you have a crash-recoverable state for those tables as regards 
> those specific pages.

As Steven-san said, I don't want to go this complicated direction.  Plus, 
putting my feet in the user's shoes, I want to try to avoid introducing a new 
SQL syntax for this kind of performance boost, which requires applications and 
maintenance scripts and testing.


Regards
Takayuki Tsunakawa



Re: Terminate the idle sessions

2020-11-16 Thread Li Japin

On Nov 17, 2020, at 10:53 AM, David G. Johnston 
mailto:david.g.johns...@gmail.com>> wrote:

On Monday, November 16, 2020, Li Japin 
mailto:japi...@hotmail.com>> wrote:


Consider setting this for specific users instead of as a server default.  
Client connections managed by connection poolers, or initiated indirectly like 
those by a remote postgres_fdw  using server, should probably be excluded from 
this timeout.


 
- This parameter should be set to zero if you use postgres_fdw or some
- connection-pooling software, because connections might be closed 
unexpectedly.
+ This parameter should be set to zero if you use some 
connection-pooling software, or
+ PostgreSQL servers used by postgres_fdw, because connections might be 
closed unexpectedly.
 



Prefer mine, “or pg servers used by postgres_fdw”, doesn’t flow.

Could you please explain how the idle-in-transaction interfere the long-running 
stability?

From the docs (next section):

This allows any locks held by that session to be released and the connection 
slot to be reused; it also allows tuples visible only to this transaction to be 
vacuumed. See Section 
24.1 for more 
details about this.

Thanks David! Attached.

--
Best regards
Japin Li


v6-0001-Allow-terminating-the-idle-sessions.patch
Description: v6-0001-Allow-terminating-the-idle-sessions.patch


v6-0002-Optimize-setitimer-usage.patch
Description: v6-0002-Optimize-setitimer-usage.patch


Re: pgbench: option delaying queries till connections establishment?

2020-11-16 Thread Andres Freund
Hi,

On 2020-11-17 00:09:34 +0300, Marina Polyakova wrote:
> Sorry I'm not familiar with the internal architecture of snapshots, locks
> etc. in postgres, but I wanted to ask - what exact kind of patch for fair
> lwlocks do you want to offer to the community? I applied the 6-th version of
> the patch for fair lwlocks from [1] to the old master branch (see commit
> [2]), started many threads in pgbench (-M prepared -c 1000 -j 500 -T 10 -P1
> -S) and I did not receive stable first progress reports, which IIUC are one
> of the advantages of the discussed patch for the pgbench (see [3])...

Thanks for running some benchmarks.

I think it's quite unsurprising that you'd see skewed numbers initially,
even with fairer locks. Just by virtue of pgbench starting threads and
each thread immediately starting to perform work, you are bound to get
odd pretty meaningless initial numbers. Even without contention, and
when using fewer connections than there are CPUs. And especially when
starting a larger number of connections, because the main pgbench thread
will get fewer and fewer scheduler slices because of the threads running
benchmarks already.

Regards,

Andres




Re: [PATCH] Covering SPGiST index

2020-11-16 Thread Tom Lane
Pavel Borisov  writes:
> [ v10-0001-Covering-SP-GiST-index-support-for-INCLUDE-colum.patch ]

I've started to review this, and I've got to say that I really disagree
with this choice:

+ * If there are INCLUDE columns, they are stored after a key value, each
+ * starting from its own typalign boundary. Unlike IndexTuple, first INCLUDE
+ * value does not need to start from MAXALIGN boundary, so SPGiST uses private
+ * routines to access them.

This seems to require far more new code than it could possibly be worth,
because most of the time anything you could save here is just going
to disappear into end-of-tuple alignment space anyway -- recall that
the overall index tuple length is going to be MAXALIGN'd no matter what.
I think you should yank this out and try to rely on standard tuple
construction/deconstruction code instead.

I also find it unacceptable that you stuck a tuple descriptor pointer into
the rd_amcache structure.  The relcache only supports that being a flat
blob of memory.  I see that you tried to hack around that by having
spgGetCache reconstruct a new tupdesc every time through, but (a) that's
actually worse than having no cache at all, and (b) spgGetCache doesn't
really know much about the longevity of the memory context it's called in.
This could easily lead to dangling tuple pointers, serious memory bloat
from repeated tupdesc construction, or quite possibly both.  Safer would
be to build the tupdesc during initSpGistState(), or maybe just make it
on-demand.  In view of the previous point, I'm also wondering if there's
any way to get the relcache's regular rd_att tupdesc to be useful here,
so we don't have to build one during scans at all.

(I wondered for a bit about whether you could keep a long-lived private
tupdesc in the relcache's rd_indexcxt context.  But it looks like
relcache.c sometimes resets rd_amcache without also clearing the
rd_indexcxt, so that would probably lead to leakage.)

regards, tom lane




Crash in virtual file descriptor FDDEBUG code

2020-11-16 Thread Greg Nancarrow
Hi Hackers,

While investigating a possible file-descriptor issue, I enabled the
FDDEBUG code in src/backend/storage/file/fd.c, only to find that it
resulted in a crash as soon as it was invoked.
It turns out the crash was in some logging code that was being passed
a NULL fileName.
I've attached a small patch that corrects the issue.

Regards,
Greg Nancarrow
Fujitsu Australia


v1-0001-Fix-crash-in-virtual-file-descriptor-FDDEBUG-code.patch
Description: Binary data


RE: Cache relation sizes?

2020-11-16 Thread tsunakawa.ta...@fujitsu.com
From: Thomas Munro 
> On Mon, Nov 16, 2020 at 11:01 PM Konstantin Knizhnik
>  wrote:
> > I will look at your implementation more precisely latter.
> 
> Thanks!  Warning:  I thought about making a thing like this for a
> while, but the patch itself is only a one-day prototype, so I am sure
> you can find many bugs...  Hmm, I guess the smgrnblocks_fast() code
> really needs a nice big comment to explain the theory about why this
> value is fresh enough, based on earlier ideas about implicit memory
> barriers.  (Something like: if you're extending, you must have
> acquired the extension lock; if you're scanning you must have acquired
> a snapshot that only needs to see blocks that existed at that time and
> concurrent truncations are impossible; I'm wondering about special
> snapshots like VACUUM, and whether this is too relaxed for that, or if
> it's covered by existing horizon-based interlocking...).

We'd like to join the discussion and review, too, so that Kirk's optimization 
for VACUUM truncation and TRUNCATRE can extend beyond recovery to operations 
during normal operation.  In that regard, would you mind becoming a committer 
for his patch?  We think it's committable now.  I'm fine with using the 
unmodified smgrnblocks() as your devil's suggestion.


> It's a good question.  I don't know exactly how to make a shared
> relation cache (I have only some vague ideas and read some of
> Idehira-san's work) but I agree that it would be very nice.  However,
> I think that's an independent and higher level thing, because Relation
> != SMgrRelation.

> Sure, we can talk about how to make a shared relation cache (and
> syscache etc).  I think it'll be much harder than this shared SMGR
> concept.  Even if we figure out how to share palloc'd trees of nodes
> with correct invalidation and interlocking (ie like Ideriha-san's work
> in [1]), including good performance and space management/replacement,
> I guess we'll still need per-backend Relation objects to point to the
> per-backend SMgrRelation objects to hold the per-backend file
> descriptors.  (Until we have threads...).  But... do you think sharing
> SMGR relations to the maximum extent possible conflicts with that?
> Are you thinking there should be some general component here, perhaps
> a generic system for pools of shmem objects with mapping tables and a
> replacement policy?

Yes, Ideriha-san and we have completed the feature called global metacache for 
placing relation and catalog caches in shared memory and limiting their sizes, 
and incorporated it in our product to meet our customer's request.  I hope 
we'll submit the patch in the near future when our resource permits.


Regards
Takayuki Tsunakawa



Re: remove spurious CREATE INDEX CONCURRENTLY wait

2020-11-16 Thread Michael Paquier
On Mon, Nov 16, 2020 at 09:23:41PM -0300, Alvaro Herrera wrote:
> I've been wondering if it would be sane/safe to do the WaitForFoo stuff
> outside of any transaction.

This needs to remain within a transaction as CIC holds a session lock
on the parent table, which would not be cleaned up without a
transaction context.
--
Michael


signature.asc
Description: PGP signature


Re: Terminate the idle sessions

2020-11-16 Thread Li Japin

--
Best regards
Japin Li

On Nov 17, 2020, at 7:59 AM, David G. Johnston 
mailto:david.g.johns...@gmail.com>> wrote:

On Mon, Nov 16, 2020 at 5:41 AM Li Japin 
mailto:japi...@hotmail.com>> wrote:
Thanks for your review! Attached.

Reading the doc changes:

I'd rather not name postgres_fdw explicitly, or at least not solely, as a 
reason for setting this to zero.  Additionally, using postgres_fdw within the 
server doesn't cause issues, its using postgres_fdw and the remote server 
having this setting set to zero that causes a problem.


Consider setting this for specific users instead of as a server default.  
Client connections managed by connection poolers, or initiated indirectly like 
those by a remote postgres_fdw using server, should probably be excluded from 
this timeout.

Text within  should be indented one space (you missed both under 
listitem).

Thanks for your suggest! How about change document as follows:

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 6c4e2a1fdc..23e691a7c5 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -8281,17 +8281,17 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH 
csv;
   
   

-   Terminate any session that has been idle for longer than the specified 
amount of time.
+Terminate any session that has been idle for longer than the specified 
amount of time.


-   If this value is specified without units, it is taken as milliseconds.
-   A value of zero (the default) disables the timeout.
+If this value is specified without units, it is taken as milliseconds.
+A value of zero (the default) disables the timeout.



 
- This parameter should be set to zero if you use postgres_fdw or some
- connection-pooling software, because connections might be closed 
unexpectedly.
+ This parameter should be set to zero if you use some 
connection-pooling software, or
+ PostgreSQL servers used by postgres_fdw, because connections might be 
closed unexpectedly.
 


I'd suggest a comment that aside from a bit of resource consumption idle 
sessions do not interfere with the long-running stability of the server, unlike 
idle-in-transaction sessions which are controlled by the other configuration 
setting.

Could you please explain how the idle-in-transaction interfere the long-running 
stability?

--
Best regards
Japin Li



Re: remove spurious CREATE INDEX CONCURRENTLY wait

2020-11-16 Thread Alvaro Herrera
On 2020-Nov-16, Dmitry Dolgov wrote:

> The same with reindex without locks:
> 
>  nsecs   : count distribution
>512 -> 1023   : 0||
>   1024 -> 2047   : 111345   ||
>   2048 -> 4095   : 6997627  ||
>   4096 -> 8191   : 18575||
>   8192 -> 16383  : 586  ||
>  16384 -> 32767  : 312  ||
>  32768 -> 65535  : 18   ||
> 
> The same with reindex with locks:
> 
>  nsecs   : count distribution
>512 -> 1023   : 0||
>   1024 -> 2047   : 59438||
>   2048 -> 4095   : 6901187  ||
>   4096 -> 8191   : 18584||
>   8192 -> 16383  : 581  ||
>  16384 -> 32767  : 280  ||
>  32768 -> 65535  : 84   ||
> 
> Looks like with reindex without locks is indeed faster (there are mode
> samples in lower time section), but not particularly significant to the
> whole distribution, especially taking into account extremity of the
> test.

I didn't analyze these numbers super carefully, but yeah it doesn't look
significant.

I'm looking at these patches now, with intention to push.





VACUUM (DISABLE_PAGE_SKIPPING on)

2020-11-16 Thread Simon Riggs
The docs are misleading for this feature, since they say:
"This option disables all page-skipping behavior, and is
intended to be used only when the contents of the visibility map are
suspect, which should happen only if there is a hardware or software
issue causing database corruption."

The docs do correctly say "Pages where all tuples are known to be
frozen can always be skipped". Checking the code, lazy_scan_heap()
comments say
"we can still skip pages that are all-frozen, since such pages do not
need freezing".

The code is quite clear: DISABLE_PAGE_SKIPPING makes the vacuum into
an aggressive vacuum. Line 487, heap_vacuum_rel().  Aggressive vacuums
can still skip a page that is frozen, and rely on the visibility map
for that information.

So the docs are wrong - we don't disable *all* page-skipping and it is
not appropriate to warn users away from this feature by saying "is
intended to be used only when the contents of the visibility map are
suspect".

Reworded docs patch attached.

-- 
Simon Riggshttp://www.EnterpriseDB.com/


aggressive_rewording.v1.patch
Description: Binary data


Re: VACUUM (DISABLE_PAGE_SKIPPING on)

2020-11-16 Thread Masahiko Sawada
On Tue, Nov 17, 2020 at 5:52 AM Simon Riggs  wrote:
>
> The docs are misleading for this feature, since they say:
> "This option disables all page-skipping behavior, and is
> intended to be used only when the contents of the visibility map are
> suspect, which should happen only if there is a hardware or software
> issue causing database corruption."
>
> The docs do correctly say "Pages where all tuples are known to be
> frozen can always be skipped". Checking the code, lazy_scan_heap()
> comments say
> "we can still skip pages that are all-frozen, since such pages do not
> need freezing".
>
> The code is quite clear: DISABLE_PAGE_SKIPPING makes the vacuum into
> an aggressive vacuum. Line 487, heap_vacuum_rel().  Aggressive vacuums
> can still skip a page that is frozen, and rely on the visibility map
> for that information.
>
> So the docs are wrong - we don't disable *all* page-skipping and it is
> not appropriate to warn users away from this feature by saying "is
> intended to be used only when the contents of the visibility map are
> suspect".

I don't think the doc is wrong. If DISABLE_PAGE_SKIPPING is specified,
we not only set aggressive = true but also skip checking visibility
map. For instance, see line 905 and line 963, lazy_scan_heap().

Regards,

-- 
Masahiko Sawada
EnterpriseDB:  https://www.enterprisedb.com/




Re: Terminate the idle sessions

2020-11-16 Thread David G. Johnston
On Mon, Nov 16, 2020 at 5:41 AM Li Japin  wrote:

> Thanks for your review! Attached.
>

Reading the doc changes:

I'd rather not name postgres_fdw explicitly, or at least not solely, as a
reason for setting this to zero.  Additionally, using postgres_fdw within
the server doesn't cause issues, its using postgres_fdw and the remote
server having this setting set to zero that causes a problem.


Consider setting this for specific users instead of as a server default.
Client connections managed by connection poolers, or initiated indirectly
like those by a remote postgres_fdw using server, should probably be
excluded from this timeout.

Text within  should be indented one space (you missed both under
listitem).

I'd suggest a comment that aside from a bit of resource consumption idle
sessions do not interfere with the long-running stability of the server,
unlike idle-in-transaction sessions which are controlled by the other
configuration setting.

David J.


Re: remove spurious CREATE INDEX CONCURRENTLY wait

2020-11-16 Thread Alvaro Herrera
I am really unsure about the REINDEX CONCURRENTLY part of this, for two
reasons:

1. It is not as good when reindexing multiple indexes, because we can
only apply the flag if *all* indexes are "safe".  Any unsafe index means
we step down from it for the whole thing.  This is probably not worth
worrying much about, but still.

2. In some of the waiting transactions, we actually do more things than
what we do in CREATE INDEX CONCURRENTLY transactions --- some catalog
updates, but we also do the whole index validation phase.  Is that OK?
It's not as clear to me that it is safe to set the flag in all those
places.

I moved the comments to the new function and made it inline.  I also
changed the way we determine how the function is safe; there's no reason
to build an IndexInfo if we can simply look at rel->rd_indexprs and
rel->indpred.

I've been wondering if it would be sane/safe to do the WaitForFoo stuff
outside of any transaction.

I'll have a look again tomorrow.
>From 1e5560d4a3f3e3b4cb83803f90985701fb4dee8c Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Tue, 4 Aug 2020 22:04:57 -0400
Subject: [PATCH v5] Avoid spurious CREATE INDEX CONCURRENTLY waits

---
 src/backend/commands/indexcmds.c | 85 +---
 src/include/storage/proc.h   |  6 ++-
 2 files changed, 83 insertions(+), 8 deletions(-)

diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 35696f9f75..44ea84c54d 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -68,6 +68,7 @@
 
 
 /* non-export function prototypes */
+static bool CompareOpclassOptions(Datum *opts1, Datum *opts2, int natts);
 static void CheckPredicate(Expr *predicate);
 static void ComputeIndexAttrs(IndexInfo *indexInfo,
 			  Oid *typeOidP,
@@ -87,13 +88,12 @@ static char *ChooseIndexNameAddition(List *colnames);
 static List *ChooseIndexColumnNames(List *indexElems);
 static void RangeVarCallbackForReindexIndex(const RangeVar *relation,
 			Oid relId, Oid oldRelId, void *arg);
-static bool ReindexRelationConcurrently(Oid relationOid, int options);
-
+static void reindex_error_callback(void *args);
 static void ReindexPartitions(Oid relid, int options, bool isTopLevel);
 static void ReindexMultipleInternal(List *relids, int options);
-static void reindex_error_callback(void *args);
+static bool ReindexRelationConcurrently(Oid relationOid, int options);
 static void update_relispartition(Oid relationId, bool newval);
-static bool CompareOpclassOptions(Datum *opts1, Datum *opts2, int natts);
+static inline void set_safe_index_flag(void);
 
 /*
  * callback argument type for RangeVarCallbackForReindexIndex()
@@ -385,7 +385,10 @@ CompareOpclassOptions(Datum *opts1, Datum *opts2, int natts)
  * lazy VACUUMs, because they won't be fazed by missing index entries
  * either.  (Manual ANALYZEs, however, can't be excluded because they
  * might be within transactions that are going to do arbitrary operations
- * later.)
+ * later.)  Processes running CREATE INDEX CONCURRENTLY or REINDEX CONCURRENTLY
+ * on indexes that are neither expressional nor partial are also safe to
+ * ignore, since we know that those processes won't examine any data
+ * outside the table they're indexing.
  *
  * Also, GetCurrentVirtualXIDs never reports our own vxid, so we need not
  * check for that.
@@ -406,7 +409,8 @@ WaitForOlderSnapshots(TransactionId limitXmin, bool progress)
 	VirtualTransactionId *old_snapshots;
 
 	old_snapshots = GetCurrentVirtualXIDs(limitXmin, true, false,
-		  PROC_IS_AUTOVACUUM | PROC_IN_VACUUM,
+		  PROC_IS_AUTOVACUUM | PROC_IN_VACUUM
+		  | PROC_IN_SAFE_IC,
 		  _old_snapshots);
 	if (progress)
 		pgstat_progress_update_param(PROGRESS_WAITFOR_TOTAL, n_old_snapshots);
@@ -426,7 +430,8 @@ WaitForOlderSnapshots(TransactionId limitXmin, bool progress)
 
 			newer_snapshots = GetCurrentVirtualXIDs(limitXmin,
 	true, false,
-	PROC_IS_AUTOVACUUM | PROC_IN_VACUUM,
+	PROC_IS_AUTOVACUUM | PROC_IN_VACUUM
+	| PROC_IN_SAFE_IC,
 	_newer_snapshots);
 			for (j = i; j < n_old_snapshots; j++)
 			{
@@ -519,6 +524,7 @@ DefineIndex(Oid relationId,
 	bool		amcanorder;
 	amoptions_function amoptions;
 	bool		partitioned;
+	bool		safe_index;
 	Datum		reloptions;
 	int16	   *coloptions;
 	IndexInfo  *indexInfo;
@@ -1045,6 +1051,10 @@ DefineIndex(Oid relationId,
 		}
 	}
 
+	/* Determine whether we can call set_safe_index_flag */
+	safe_index = indexInfo->ii_Expressions == NIL &&
+		indexInfo->ii_Predicate == NIL;
+
 	/*
 	 * Report index creation if appropriate (delay this till after most of the
 	 * error checks)
@@ -1431,6 +1441,10 @@ DefineIndex(Oid relationId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_safe_index_flag();
+
 	/*
 	 * The index is now visible, so we can report the OID.
 	 

RE: Cache relation sizes?

2020-11-16 Thread k.jami...@fujitsu.com
On Tuesday, November 17, 2020 9:40 AM, Tsunkawa-san wrote:
> From: Thomas Munro 
> > On Mon, Nov 16, 2020 at 11:01 PM Konstantin Knizhnik
> >  wrote:
> > > I will look at your implementation more precisely latter.
> >
> > Thanks!  Warning:  I thought about making a thing like this for a
> > while, but the patch itself is only a one-day prototype, so I am sure
> > you can find many bugs...  Hmm, I guess the smgrnblocks_fast() code
> > really needs a nice big comment to explain the theory about why this
> > value is fresh enough, based on earlier ideas about implicit memory
> > barriers.  (Something like: if you're extending, you must have
> > acquired the extension lock; if you're scanning you must have acquired
> > a snapshot that only needs to see blocks that existed at that time and
> > concurrent truncations are impossible; I'm wondering about special
> > snapshots like VACUUM, and whether this is too relaxed for that, or if
> > it's covered by existing horizon-based interlocking...).
> 
> We'd like to join the discussion and review, too, so that Kirk's optimization 
> for
> VACUUM truncation and TRUNCATRE can extend beyond recovery to
> operations during normal operation.  In that regard, would you mind
> becoming a committer for his patch?  We think it's committable now.  I'm
> fine with using the unmodified smgrnblocks() as your devil's suggestion.

Just saw this thread has been bumped. And yes, regarding the vacuum and truncate
optimization during recovery, it would still work even without the unmodified 
smgrnblocks.
We'd just need to remove the conditions that checks the flag parameter. And if 
we explore
your latest prototype patch, it can possibly use the new smgrnblocks_ APIs 
introduced here.
That said, we'd like to explore the next step of extending the feature to 
normal operations.

Regards,
Kirk Jamison


Re: Tab complete for CREATE OR REPLACE TRIGGER statement

2020-11-16 Thread Michael Paquier
On Mon, Nov 16, 2020 at 08:12:05AM +, Shinoda, Noriyoshi (PN Japan FSI) 
wrote:
>  Yesterday, OR REPLACE clause was provided to the CREATE TRIGGER
> statement, so I wrote a patch for tab completion for the psql
> command.

Thanks, the logic looks fine.  I'll apply if there are no objections.
Please note that git diff --check and that the indentation does not
seem quite right, but that's easy enough to fix.
--
Michael


signature.asc
Description: PGP signature


Re: Terminate the idle sessions

2020-11-16 Thread David G. Johnston
On Monday, November 16, 2020, Li Japin  wrote:

>
> 
> Consider setting this for specific users instead of as a server default.
> Client connections managed by connection poolers, or initiated indirectly
> like those by a remote postgres_fdw using server, should probably be
> excluded from this timeout.
>
> 
>
>  
> - This parameter should be set to zero if you use postgres_fdw or
> some
> - connection-pooling software, because connections might be closed
> unexpectedly.
> + This parameter should be set to zero if you use some
> connection-pooling software, or
> + PostgreSQL servers used by postgres_fdw, because connections
> might be closed unexpectedly.
>  
> 
>
>
Prefer mine, “or pg servers used by postgres_fdw”, doesn’t flow.


> Could you please explain how the idle-in-transaction interfere the
> long-running stability?
>

>From the docs (next section):

This allows any locks held by that session to be released and the
connection slot to be reused; it also allows tuples visible only to this
transaction to be vacuumed. See Section 24.1
 for more
details about this.

David J.


Re: Add statistics to pg_stat_wal view for wal related parameter tuning

2020-11-16 Thread Fujii Masao




On 2020/11/16 18:24, Masahiro Ikeda wrote:

On 2020-11-13 12:32, lchch1...@sina.cn wrote:

Now, pg_stat_wal supports reset all informantion in WalStats
using pg_stat_reset_shared('wal') function.
Isn't it enough?

Yes it ok, sorry I miss this infomation.


OK.


3. I do not think it's a correct describe in document for
'wal_buffers_full'.



Where should I rewrite the description? If my understanding is not
correct, please let me know.

Sorry I have not described it clearly, because I can not understand
the meaning of this
column after I read the describe in document.
And now I read the source code of walwrite and found the place where
'wal_buffers_full'
added is for a backend to wait a wal buffer which is occupied by other
wal page, so the
backend flush the old page in the wal buffer(after wait it can).
So i think the origin decribe in document is not so in point, we can
describe it such as
'Total number of times WAL data written to the disk because a backend
yelled a wal buffer
for an advanced wal page.

Sorry if my understand is wrong.


Thanks for your comments.

You're understanding is almost the same as mine.
It describes when not only backends but also other backgrounds initialize a new 
wal page,
wal buffer's space is already used and there is no space.


'Total number of times WAL data written to the disk because a backend
yelled a wal buffer for an advanced wal page'


Thanks for your suggestion.
I wondered that users may confuse about how to use "wal_buffers_full" and how 
to tune parameters.

I thought the reason which wal buffer has no space is
important for users to tune the wal_buffers parameter.

How about the following comments?

'Total number of times WAL data was written to the disk because WAL buffers got 
full
  when to initialize a new WAL page'


Or what about the following?

Total number of times WAL data was written to the disk, to claim the buffer 
page to insert new WAL data when the WAL buffers got filled up with unwritten 
WAL data.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: remove spurious CREATE INDEX CONCURRENTLY wait

2020-11-16 Thread Alvaro Herrera
On 2020-Nov-09, Tom Lane wrote:

> Michael Paquier  writes:
> >> +  LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
> >> +  MyProc->vacuumFlags |= PROC_IN_SAFE_IC;
> >> +  ProcGlobal->vacuumFlags[MyProc->pgxactoff] = 
> >> MyProc->vacuumFlags;
> >> +  LWLockRelease(ProcArrayLock);
> 
> > I can't help noticing that you are repeating the same code pattern
> > eight times.  I think that this should be in its own routine, and that
> > we had better document that this should be called just after starting
> > a transaction, with an assertion enforcing that.
> 
> Do we really need exclusive lock on the ProcArray to make this flag
> change?  That seems pretty bad from a concurrency standpoint.

BTW I now know that the reason for taking ProcArrayLock is not the
vacuumFlags itself, but rather MyProc->pgxactoff, which can move.

On the other hand, if we stopped mirroring the flags in ProcGlobal, it
would mean we would have to access all procs' PGPROC entries in
GetSnapshotData, which is undesirable for performance reason (per commit
5788e258bb26).




Re: Skip ExecCheckRTPerms in CTAS with no data

2020-11-16 Thread Michael Paquier
On Mon, Nov 16, 2020 at 04:01:33PM +0100, Peter Eisentraut wrote:
> While this patch was nice enough to update the documentation about the
> requirement of the INSERT privilege, this is maybe more confusing now: How
> could a new table not have INSERT privilege?  Yes, you can do that with
> default privileges, but that's not well known and should be clarified in the
> documentation.
>
> The SQL standard says that for CREATE TABLE AS, the INSERT "is effectively
> executed without further Access Rule checking", which means the INSERT
> privilege shouldn't be required at all.  I suggest we consider doing that
> instead.  I don't see a use for the current behavior.

Hmm.  Is there anything specific for materialized views?  They've
checked for INSERT privileges at this phase since their introduction
in 3bf3ab8c.
--
Michael


signature.asc
Description: PGP signature


Re: Tab complete for CREATE OR REPLACE TRIGGER statement

2020-11-16 Thread Tom Lane
Michael Paquier  writes:
> On Mon, Nov 16, 2020 at 08:12:05AM +, Shinoda, Noriyoshi (PN Japan FSI) 
> wrote:
>> Yesterday, OR REPLACE clause was provided to the CREATE TRIGGER
>> statement, so I wrote a patch for tab completion for the psql
>> command.

> Thanks, the logic looks fine.  I'll apply if there are no objections.
> Please note that git diff --check and that the indentation does not
> seem quite right, but that's easy enough to fix.

It's kind of depressing how repetitive the patch is.  There's
probably not much to be done about that in the short run, but
it seems to point up the need to start thinking about how to
refactor tab-complete.c more thoroughly.

regards, tom lane




Re: pgbench: option delaying queries till connections establishment?

2020-11-16 Thread Marina Polyakova

Hello!

On 2020-11-14 20:07, Alexander Korotkov wrote:

Hmm...  Let's see the big picture.  You've recently committed a
patchset, which greatly improved the performance of GetSnapshotData().
And you're making further improvements in this direction.  But you're
getting trouble in measuring the effect, because Postgres is still
stuck on ProcArrayLock.  And in this thread you propose a workaround
for that implemented on the pgbench side.  My very dumb idea is
following: should we finally give a chance to more fair lwlocks rather
than inventing workarounds?

As I remember, your major argument against more fair lwlocks was the
idea that we should fix lwlocks use-cases rather than lwlock mechanism
themselves.  But can we expect that we fix all the lwlocks use-case in
any reasonable prospect?  My guess is 'no'.

Links
1.
https://www.postgresql.org/message-id/CAPpHfdvJhO1qutziOp%3Ddy8TO8Xb4L38BxgKG4RPa1up1Lzh_UQ%40mail.gmail.com


Sorry I'm not familiar with the internal architecture of snapshots, 
locks etc. in postgres, but I wanted to ask - what exact kind of patch 
for fair lwlocks do you want to offer to the community? I applied the 
6-th version of the patch for fair lwlocks from [1] to the old master 
branch (see commit [2]), started many threads in pgbench (-M prepared -c 
1000 -j 500 -T 10 -P1 -S) and I did not receive stable first progress 
reports, which IIUC are one of the advantages of the discussed patch for 
the pgbench (see [3])...


[1] 
https://www.postgresql.org/message-id/CAPpHfduV3v3EG7K74-9htBZz_mpE993zGz-%3D2k5RNA3tqabUAA%40mail.gmail.com
[2] 
https://github.com/postgres/postgres/commit/84d514887f9ca673ae688d00f8b544e70f1ab270
[3] 
https://www.postgresql.org/message-id/20200227185129.hikscyenomnlrord%40alap3.anarazel.de


--
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Tab complete for CREATE OR REPLACE TRIGGER statement

2020-11-16 Thread Michael Paquier
On Mon, Nov 16, 2020 at 09:14:51PM -0500, Tom Lane wrote:
> It's kind of depressing how repetitive the patch is.  There's
> probably not much to be done about that in the short run, but
> it seems to point up the need to start thinking about how to
> refactor tab-complete.c more thoroughly.

Agreed.  One thing that I'd think could help is a new wild card to
make some of the words conditional in the list of items.  But that may
be tricky once you consider the case of a group of words.

I don't think that this is a requirement for this thread, though.
--
Michael


signature.asc
Description: PGP signature


Re: Tab complete for CREATE OR REPLACE TRIGGER statement

2020-11-16 Thread Tom Lane
Michael Paquier  writes:
> I don't think that this is a requirement for this thread, though.

Agreed, I'm not trying to block this patch.  Just wishing
there were a better way.

regards, tom lane




Re: pgbench: option delaying queries till connections establishment?

2020-11-16 Thread Andres Freund
Hi,

On 2020-11-14 20:07:38 +0300, Alexander Korotkov wrote:
> Hmm...  Let's see the big picture.  You've recently committed a
> patchset, which greatly improved the performance of GetSnapshotData().
> And you're making further improvements in this direction.  But you're
> getting trouble in measuring the effect, because Postgres is still
> stuck on ProcArrayLock.

No, the problem was that I couldn't measure the before/after behaviour
reliably, because not all connections actually ever get established
*before* the GetSnapshotData() scability patchset. Which made the
numbers pointless, because we'd often end up with e.g. 80 connections
doing work pre-patch, and 800 post-patch; which obviously measures very
different things.

I think the issue really is that, independent of PG lock contention,
it'll take a while to establish all connections, and that starting to
benchmark with only some connections established will create pretty
pointless numbers.


> And in this thread you propose a workaround
> for that implemented on the pgbench side.  My very dumb idea is
> following: should we finally give a chance to more fair lwlocks rather
> than inventing workarounds?

Perhaps - I just don't think it's related to this thread. And how you're
going to address the overhead.

Greetings,

Andres Freund




Re: Add Information during standby recovery conflicts

2020-11-16 Thread Masahiko Sawada
On Mon, Nov 16, 2020 at 4:55 PM Drouvot, Bertrand  wrote:
>
> Hi,
>
> On 11/16/20 6:44 AM, Masahiko Sawada wrote:
> > Thank you for updating the patch.
> >
> > Here are review comments.
> >
> > +   if (report_waiting && (!logged_recovery_conflict ||
> > new_status == NULL))
> > +   ts = GetCurrentTimestamp();
> >
> > The condition will always be true if log_recovery_conflict_wait is
> > false and report_waiting is true, leading to unnecessary calling of
> > GetCurrentTimestamp().
> >
> > ---
> > +   
> > +You can control whether a log message is produced when the startup 
> > process
> > +is waiting longer than deadlock_timeout for recovery
> > +conflicts. This is controled by the  > linkend="guc-log-recovery-conflict-waits"/>
> > +parameter.
> > +   
> >
> > s/controled/controlled/
> >
> > ---
> >  if (report_waiting)
> >  waitStart = GetCurrentTimestamp();
> >
> > Similarly, we have the above code but we don't need to call
> > GetCurrentTimestamp() if update_process_title is false, even if
> > report_waiting is true.
> >
> > I've attached the patch that fixes the above comments. It can be
> > applied on top of your v8 patch.
>
> Thanks for the review and the associated fixes!
>
> I've attached a new version that contains your fixes.
>

Thank you for updating the patch.

I have other comments:

+   
+You can control whether a log message is produced when the startup process
+is waiting longer than deadlock_timeout for recovery
+conflicts. This is controlled by the
+ parameter.
+   

It would be better to use 'WAL replay' instead of 'the startup
process' for consistency with circumjacent descriptions. What do you
think?

---
@@ -1260,6 +1262,8 @@ ProcSleep(LOCALLOCK *locallock, LockMethod
lockMethodTable)
  else
  enable_timeout_after(DEADLOCK_TIMEOUT, DeadlockTimeout);
  }
+ else
+   standbyWaitStart = GetCurrentTimestamp();

I think we can add a check of log_recovery_conflict_waits to avoid
unnecessary calling of GetCurrentTimestamp().

I've attached the updated version patch including the above comments
as well as adding some assertions. Please review it.

Regards,

-- 
Masahiko Sawada
EnterpriseDB:  https://www.enterprisedb.com/


v9-Log-the-standby-recovery-conflict-waits.patch
Description: Binary data


RE: Disable WAL logging to speed up data loading

2020-11-16 Thread tsunakawa.ta...@fujitsu.com
From: Robert Haas 
> I'm also concerned about the way that this proposed feature interacts with
> incremental backup capabilities that already exist in tools like pgBackRest,
> EDB's BART, pg_probackup, and future things we might want to introduce into
> core, along the lines of what I have previously proposed. Now, I think
> pgBackRest uses only timestamps and checksums, so it probably doesn't care,
> but some of the other solutions rely on WAL-scanning to gather a list of
> changed blocks. I guess there's no reason that they can't notice the wal_level
> being changed and do the right thing; they should probably have that kind of
> capability already. Still, it strikes me that it might be useful if we had a 
> stronger
> mechanism.

Having a quick look, those backup tools seem to require setting wal_level to 
replica or higher.  That's no wonder, because recovering the database needs WAL 
for non-relation resources such as pg_control and relation map.  So, I think 
wal_level = none won't introduce new issues (compared to wal_level = minimal, 
which also can lack WAL records for some data updates.)


> By the way, another problem here is that some AMs - e.g. GiST, IIRC - use LSNs
> to figure out whether a block has changed. For temporary and unlogged tables,
> we use "fake" LSNs that are generated using a counter, but that approach only
> works because such relations are never really WAL-logged. Mixing fake LSNs
> and real LSNs will break stuff, and not bumping the LSN when the page
> changes probably will, too.

Unlogged GiST indexes use fake LSNs that are instance-wide.  Unlogged temporary 
GiST indexes use backend-local sequence values.  Other unlogged and temporary 
relations don't set LSNs on pages.  So, I think it's enough to call 
GetFakeLSNForUnloggedRel() when wal_level = none as well.


Regards
Takayuki Tsunakawa



Re: Add statistics to pg_stat_wal view for wal related parameter tuning

2020-11-16 Thread Fujii Masao




On 2020/11/16 16:35, Masahiro Ikeda wrote:

On 2020-11-12 14:58, Fujii Masao wrote:

On 2020/11/06 10:25, Masahiro Ikeda wrote:

On 2020-10-30 11:50, Fujii Masao wrote:

On 2020/10/29 17:03, Masahiro Ikeda wrote:

Hi,

Thanks for your comments and advice. I updated the patch.

On 2020-10-21 18:03, Kyotaro Horiguchi wrote:

At Tue, 20 Oct 2020 16:11:29 +0900, Masahiro Ikeda
 wrote in

On 2020-10-20 12:46, Amit Kapila wrote:
> I see that we also need to add extra code to capture these stats (some
> of which is in performance-critical path especially in
> XLogInsertRecord) which again makes me a bit uncomfortable. It might
> be that it is all fine as it is very important to collect these stats
> at cluster-level in spite that the same information can be gathered at
> statement-level to help customers but I don't see a very strong case
> for that in your proposal.


We should avoid that duplication as possible even if the both number
are important.


Also about performance, I thought there are few impacts because it
increments stats in memory. If I can implement to reuse pgWalUsage's
value which already collects these stats, there is no impact in
XLogInsertRecord.
For example, how about pg_stat_wal() calculates the accumulated
value of wal_records, wal_fpi, and wal_bytes to use pgWalUsage's
value?


I don't think that works, but it would work that pgstat_send_wal()
takes the difference of that values between two successive calls.

WalUsage prevWalUsage;
...
pgstat_send_wal()
{
..
   /* fill in some values using pgWalUsage */
   WalStats.m_wal_bytes   = pgWalUsage.wal_bytes   - prevWalUsage.wal_bytes;
   WalStats.m_wal_records = pgWalUsage.wal_records - prevWalUsage.wal_records;
   WalStats.m_wal_wal_fpi = pgWalUsage.wal_fpi - prevWalUsage.wal_fpi;
...
   pgstat_send(, sizeof(WalStats));

   /* remember the current numbers */
   prevWalUsage = pgWalUsage;


Thanks for Horiguchi-san's advice, I changed to reuse pgWalUsage
which is already defined and eliminates the extra overhead.


+    /* fill in some values using pgWalUsage */
+    WalStats.m_wal_bytes = pgWalUsage.wal_bytes - prevWalUsage.wal_bytes;
+    WalStats.m_wal_records = pgWalUsage.wal_records - prevWalUsage.wal_records;
+    WalStats.m_wal_fpi = pgWalUsage.wal_fpi - prevWalUsage.wal_fpi;

It's better to use WalUsageAccumDiff() here?


Yes, thanks. I fixed it.


prevWalUsage needs to be initialized with pgWalUsage?

+    if (AmWalWriterProcess()){
+    WalStats.m_wal_write_walwriter++;
+    }
+    else
+    {
+    WalStats.m_wal_write_backend++;
+    }

I think that it's better not to separate m_wal_write_xxx into two for
walwriter and other processes. Instead, we can use one m_wal_write_xxx
counter and make pgstat_send_wal() send also the process type to
the stats collector. Then the stats collector can accumulate the counters
per process type if necessary. If we adopt this approach, we can easily
extend pg_stat_wal so that any fields can be reported per process type.


I'll remove the above source code because these counters are not useful.


On 2020-10-30 12:00, Fujii Masao wrote:

On 2020/10/20 11:31, Masahiro Ikeda wrote:

Hi,

I think we need to add some statistics to pg_stat_wal view.

Although there are some parameter related WAL,
there are few statistics for tuning them.

I think it's better to provide the following statistics.
Please let me know your comments.

```
postgres=# SELECT * from pg_stat_wal;
-[ RECORD 1 ]---+--
wal_records | 2000224
wal_fpi | 47
wal_bytes   | 248216337
wal_buffers_full    | 20954
wal_init_file   | 8
wal_write_backend   | 20960
wal_write_walwriter | 46
wal_write_time  | 51
wal_sync_backend    | 7
wal_sync_walwriter  | 8
wal_sync_time   | 0
stats_reset | 2020-10-20 11:04:51.307771+09
```

1. Basic statistics of WAL activity

- wal_records: Total number of WAL records generated
- wal_fpi: Total number of WAL full page images generated
- wal_bytes: Total amount of WAL bytes generated

To understand DB's performance, first, we will check the performance
trends for the entire database instance.
For example, if the number of wal_fpi becomes higher, users may tune
"wal_compression", "checkpoint_timeout" and so on.

Although users can check the above statistics via EXPLAIN, auto_explain,
autovacuum and pg_stat_statements now,
if users want to see the performance trends  for the entire database,
they must recalculate the statistics.

I think it is useful to add the sum of the basic statistics.


2.  WAL segment file creation

- wal_init_file: Total number of WAL segment files created.

To create a new WAL file may have an impact on the performance of
a write-heavy workload generating lots of WAL. If this number is reported high,
to reduce the number of this initialization, we can tune WAL-related parameters
so that more "recycled" WAL files can 

Re: Add statistics to pg_stat_wal view for wal related parameter tuning

2020-11-16 Thread lchch1...@sina.cn

>> Thanks for your comments.
>>
>> You're understanding is almost the same as mine.
>> It describes when not only backends but also other backgrounds initialize a 
>> new wal page,
>> wal buffer's space is already used and there is no space.
>>
>>> 'Total number of times WAL data written to the disk because a backend
>>> yelled a wal buffer for an advanced wal page'
>>
>> Thanks for your suggestion.
>> I wondered that users may confuse about how to use "wal_buffers_full" and 
>> how to tune parameters.
>>
>> I thought the reason which wal buffer has no space is
>> important for users to tune the wal_buffers parameter.
>>
>> How about the following comments?
>>
>> 'Total number of times WAL data was written to the disk because WAL buffers 
>> got full
>>   when to initialize a new WAL page'
>Or what about the following?
>Total number of times WAL data was written to the disk, to claim the buffer 
>page to insert new
>WAL data when the WAL buffers got filled up with unwritten WAL data.
As my understand we can not say 'full' because every wal page mapped a special 
wal buffer slot.
When a wal page need to be write, but the buffer slot was occupied by other wal 
page. It need to
wait the wal buffer slot released. So i think we should say it 'occupied' not 
'full'.

Maybe:
Total number of times WAL data was written to the disk, to claim the buffer 
page to insert new
WAL data when the special WAL buffer occupied by other page.



Regards,
Highgo Software (Canada/China/Pakistan) 
URL : www.highgo.ca 
EMAIL: mailto:movead(dot)li(at)highgo(dot)ca



Re: enable_incremental_sort changes query behavior

2020-11-16 Thread Tomas Vondra
Hmm, I missed that other thread. That indeed seems like a bug in the
same area already tweaked by ebb7ae839d033d0f2 for similar cases.

The attached patch fixes this simply by adding is_parallel_safe to
get_useful_pathkeys_for_relation - that does fix the reproducer, but I'm
not entirely sure that's the right place. Maybe it should be done in
find_em_expr_usable_for_sorting_rel (which might make a difference if an
EC clause can contain a mix of parallel safe and unsafe expressions). Or
maybe we should do it in the caller (which would allow using
get_useful_pathkeys_for_relation in contexts not requiring parallel
safety in the future).

Anyway, while this is not an "incremental sort" bug, it seems like the
root cause is the same as for ebb7ae839d033d0f2 - one of the incremental
sort patches started considering sorting below gather nodes, not
realizing these possible consequences.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 84a69b064a..93db261011 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -2826,6 +2826,7 @@ get_useful_pathkeys_for_relation(PlannerInfo *root, RelOptInfo *rel)
 
 		foreach(lc, root->query_pathkeys)
 		{
+			Expr   *expr;
 			PathKey*pathkey = (PathKey *) lfirst(lc);
 			EquivalenceClass *pathkey_ec = pathkey->pk_eclass;
 
@@ -2840,7 +2841,14 @@ get_useful_pathkeys_for_relation(PlannerInfo *root, RelOptInfo *rel)
 			 * enable not just an incremental sort on the entirety of
 			 * query_pathkeys but also incremental sort below a JOIN.
 			 */
-			if (!find_em_expr_usable_for_sorting_rel(pathkey_ec, rel))
+			if (!(expr = find_em_expr_usable_for_sorting_rel(pathkey_ec, rel)))
+break;
+
+			/*
+			 * Also stop when the expression is not parallel-safe. We plan
+			 * to add all of this under a Gather node.
+			 */
+			if (!is_parallel_safe(root, (Node *) expr))
 break;
 
 			npathkeys++;


Re: pgbench: option delaying queries till connections establishment?

2020-11-16 Thread Fabien COELHO




I think the issue really is that, independent of PG lock contention,
it'll take a while to establish all connections, and that starting to
benchmark with only some connections established will create pretty
pointless numbers.


Yes. This is why I think that if we have some way to synchronize it should 
always be used, i.e. not an option.


--
Fabien.




Re: list of extended statistics on psql

2020-11-16 Thread Tatsuro Yamada

Hi Tomas,

Thanks for your comments and also revising patches.

On 2020/11/16 3:22, Tomas Vondra wrote:

It's better to always post the whole patch series, so that cfbot can
test it properly. Sending just 0003 separately kind breaks that.


I now understand how "cfbot" works so that I'll take care of that
when I send patches. Thanks.



Also, 0003 seems to only tweak the .sql file, not the expected output,
and there actually seems to be two places that mistakenly use \dx (so
listing extensions) instead of \dX. I've fixed both issues in the
attached patches.


Oops, sorry about that.

 

However, I think the 0002 tests are better/sufficient - I prefer to keep
it compact, not interleaving with the tests testing various other stuff.
So I don't intend to commit 0003, unless there's something that I don't
see for some reason.


I Agreed. 0002 is easy to modify test cases and check results than 0003.
Therefore, I'll go with 0002.

 

The one remaining thing I'm not sure about is naming of the columns with
size of statistics - N_size, D_size and M_size does not seem very clear.
Any clearer naming will however make the tables wider, though :-/


Yeah, I think so too, but I couldn't get an idea of a suitable name for
the columns when I created the patch.
I don't prefer a long name but I'll replace the name with it to be clearer.
For example, s/N_size/Ndistinct_size/.

Please find attached patcheds:
  - 0001: Replace column names
  - 0002: Recreate regression test based on 0001


Regards,
Tatsuro Yamada

From 85fe05c3020cd595ae8d5c2cc6f695b39f4a6e03 Mon Sep 17 00:00:00 2001
From: Tatsuro Yamada 
Date: Tue, 17 Nov 2020 13:30:57 +0900
Subject: [PATCH 2/2] Recreate regression test

---
 src/test/regress/expected/stats_ext.out | 94 +
 src/test/regress/sql/stats_ext.sql  | 31 +++
 2 files changed, 125 insertions(+)

diff --git a/src/test/regress/expected/stats_ext.out 
b/src/test/regress/expected/stats_ext.out
index 4c3edd213f..27ca54a8f3 100644
--- a/src/test/regress/expected/stats_ext.out
+++ b/src/test/regress/expected/stats_ext.out
@@ -1550,6 +1550,100 @@ INSERT INTO tststats.priv_test_tbl
 CREATE STATISTICS tststats.priv_test_stats (mcv) ON a, b
   FROM tststats.priv_test_tbl;
 ANALYZE tststats.priv_test_tbl;
+-- Check printing info about extended statistics by \dX
+create table stts_t1 (a int, b int);
+create statistics stts_1 (ndistinct) on a, b from stts_t1;
+create statistics stts_2 (ndistinct, dependencies) on a, b from stts_t1;
+create statistics stts_3 (ndistinct, dependencies, mcv) on a, b from stts_t1;
+create table stts_t2 (a int, b int, c int);
+create statistics stts_4 on b, c from stts_t2;
+create table stts_t3 (col1 int, col2 int, col3 int);
+create statistics stts_hoge on col1, col2, col3 from stts_t3;
+create schema stts_s1;
+create schema stts_s2;
+create statistics stts_s1.stts_foo on col1, col2 from stts_t3;
+create statistics stts_s2.stts_yama (dependencies, mcv) on col1, col3 from 
stts_t3;
+insert into stts_t1 select i,i from generate_series(1,100) i;
+analyze stts_t1;
+\dX
+  List of extended statistics
+  Schema  |  Name  |  Definition  | 
Ndistinct | Dependencies |   MCV   
+--++--+---+--+-
+ public   | func_deps_stat | a, b, c FROM functional_dependencies |
   | built| 
+ public   | mcv_lists_arrays_stats | a, b, c FROM mcv_lists_arrays|
   |  | built
+ public   | mcv_lists_bool_stats   | a, b, c FROM mcv_lists_bool  |
   |  | built
+ public   | mcv_lists_stats| a, b, d FROM mcv_lists   |
   |  | built
+ public   | stts_1 | a, b FROM stts_t1| 
built |  | 
+ public   | stts_2 | a, b FROM stts_t1| 
built | built| 
+ public   | stts_3 | a, b FROM stts_t1| 
built | built| built
+ public   | stts_4 | b, c FROM stts_t2| 
defined   | defined  | defined
+ public   | stts_hoge  | col1, col2, col3 FROM stts_t3| 
defined   | defined  | defined
+ stts_s1  | stts_foo   | col1, col2 FROM stts_t3  | 
defined   | defined  | defined
+ stts_s2  | stts_yama  | col1, col3 FROM stts_t3  |
   | defined  | defined
+ tststats | priv_test_stats| a, b FROM tststats.priv_test_tbl |
   |  | built
+(12 rows)
+
+\dX stts_?
+   List of extended statistics
+ Schema |  Name  |Definition | Ndistinct | Dependencies |   MCV   
+++---+---+--+-
+ public | stts_1 | a, b FROM stts_t1 | built |  

Re: Deleting older versions in unique indexes to avoid page splits

2020-11-16 Thread Peter Geoghegan
On Sun, Nov 15, 2020 at 2:29 PM Victor Yegorov  wrote:
> TPS
> ---
>  query  | Master TPS | Patched TPS |  diff
> ++-+---
> UPDATE + SELECT |   2413 |2473 | +2.5%
> 3 SELECT in txn |  19737 |   19545 | -0.9%
> 15min SELECT|   0.74 |1.03 |  +39%
>
> Based on the figures and also on the graphs attached, I can tell v8 has no 
> visible regression
> in terms of TPS, IO pattern changes slightly, but the end result is worth it.
> In my view, this patch can be applied from a performance POV.

Great, thanks for testing!

-- 
Peter Geoghegan




RE: Terminate the idle sessions

2020-11-16 Thread kuroda.hay...@fujitsu.com
Dear Li, David,

> Additionally, using postgres_fdw within the server doesn't cause issues,
> its using postgres_fdw and the remote server having this setting set to zero 
> that causes a problem.

I didn't know the fact that postgres_fdw can use within the server... Thanks.

I read optimize-setitimer patch, and looks basically good. I put what I 
understanding,
so please confirm it whether your implementation is correct.
(Maybe I missed some simultaneities, so please review anyone...)

[besic consept]

sigalrm_due_at means the time that interval timer will ring, and 
sigalrm_delivered means who calls schedule_alarm().
If fin_time of active_timeouts[0] is larger than or equal to sigalrm_due_at,
stop calling setitimer because handle_sig_alarm() will be call sooner.

[when call setitimer]

In the attached patch, setitimer() will be only called the following scenarios:

* when handle_sig_alarm() is called due to the pqsignal
* when a timeout is registered and its fin_time is later than active_timeous[0]
* when disable a timeout
* when handle_sig_alarm() is interrupted and rescheduled(?)

According to comments, handle_sig_alarm() may be interrupted because of the 
ereport.
I think if handle_sig_alarm() is interrupted before subsutituting 
sigalrm_due_at to true,
interval timer will be never set. Is it correct, or is my assumption wrong?

Lastly, I found that setitimer is obsolete and should change to another one. 
According to my man page:

```
POSIX.1-2001, SVr4, 4.4BSD (this call first appeared in 4.2BSD).
POSIX.1-2008 marks getitimer() and setitimer() obsolete,
recommending the use of the POSIX timers API (timer_gettime(2), 
timer_settime(2), etc.) instead.
```

Do you have an opinion for this? I think it should be changed
if all platform can support timer_settime system call, but this fix affects all 
timeouts,
so more considerations might be needed.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



Re: Support for NSS as a libpq TLS backend

2020-11-16 Thread Jacob Champion
On Nov 13, 2020, at 4:14 AM, Daniel Gustafsson  wrote:
>> On 12 Nov 2020, at 23:12, Jacob Champion  wrote:
>> 
>> I'm not completely sure why this is exposed so easily with an OpenSSL
>> server -- I'm guessing the implementation slices up its packets
>> differently on the wire, causing a read event before NSS is able to
>> decrypt a full record -- but it's worth noting that this case also shows
>> up during NSS-to-NSS psql connections, when handling notifications at
>> the end of every query. PQconsumeInput() reports a hard failure with the
>> current implementation, but its return value is ignored by
>> PrintNotifications(). Otherwise this probably would have showed up
>> earlier.
> 
> Should there perhaps be an Assert there to catch those?

Hm. From the perspective of helping developers out, perhaps, but from
the standpoint of "don't crash when an endpoint outside our control does
something strange", I think that's a harder sell. Should the error be
bubbled all the way up instead? Or perhaps, if psql isn't supposed to
treat notification errors as "hard" failures, it should at least warn
the user that something is fishy?

>> (What's the best way to test this case? Are there lower-level tests for
>> the protocol/network layer somewhere that I'm missing?)
> 
> Not AFAIK.  Having been knee-deep now, do you have any ideas on how to
> implement?

I think that testing these sorts of important edge cases needs a
friendly DSL -- something that doesn't want to make devs tear their hair
out while building tests. I've been playing a little bit with Scapy [1]
to understand more of the libpq v3 protocol; I'll see if that can be
adapted for pieces of the TLS handshake in a way that's easy to
maintain. If it can be, maybe that'd be a good starting example.

> I've incorporated this patch as well as the previous patch for the assertion
> failure on private callback data into the attached v19 patchset.  I also did a
> spellcheck and pgindent run on it for ease of review.

Commit 6be725e70 got rid of some psql error messaging that the tests
were keying off of, so there are a few new failures after a rebase onto
latest master.

I've attached a patch that gets the SCRAM tests a little further
(certificate hashing was caught in an infinite loop). I also added error
checks to those loops, along the lines of the existing OpenSSL
implementation: if a suitable digest can't be found, the user will see
an error like

psql: error: could not find digest for OID 'PKCS #1 SHA-256 With RSA 
Encryption'

It's a little verbose but I don't think this case should come up in
normal practice.

--Jacob

[1] https://scapy.net/



nss-fix-hang-when-hashing-certificates.patch
Description: nss-fix-hang-when-hashing-certificates.patch


Re: remove spurious CREATE INDEX CONCURRENTLY wait

2020-11-16 Thread Dmitry Dolgov
> On Fri, Nov 13, 2020 at 09:25:40AM +0900, Michael Paquier wrote:
> On Thu, Nov 12, 2020 at 04:36:32PM +0100, Dmitry Dolgov wrote:
> > Interesting enough, similar discussion happened about vaccumFlags before
> > with the same conclusion that theoretically it's fine to update without
> > holding the lock, but this assumption could change one day and it's
> > better to avoid such risks. Having said that I believe it makes sense to
> > continue with locking. Are there any other opinions? I'll try to
> > benchmark it in the meantime.
>
> Thanks for planning some benchmarking for this specific patch.  I have
> to admit that the possibility of switching vacuumFlags to use atomics
> is very appealing in the long term, with or without considering this
> patch, even if we had better be sure that this patch has no actual
> effect on concurrency first if atomics are not used in worst-case
> scenarios.

I've tried first to test scenarios where GetSnapshotData produces
significant lock contention and "reindex concurrently" implementation
with locks interferes with it. The idea I had is to create a test
function that constantly calls GetSnapshotData (perf indeed shows
significant portion of time spent on contended lock), and clash it with
a stream of "reindex concurrently" of an empty relation (which still
reaches safe_index check). I guess it could be considered as an
artificial extreme case. Measuring GetSnapshotData (or rather the
surrounding wrapper, to distinguish calls from the test function from
everything else) latency without reindex, with reindex and locks, with
reindex without locks should produce different "modes" and comparing
them we can make some conclusions.

Latency histograms without reindex (nanoseconds):

 nsecs   : count distribution
   512 -> 1023   : 0||
  1024 -> 2047   : 10001209 ||
  2048 -> 4095   : 76936||
  4096 -> 8191   : 1468 ||
  8192 -> 16383  : 98   ||
 16384 -> 32767  : 39   ||
 32768 -> 65535  : 6||

The same with reindex without locks:

 nsecs   : count distribution
   512 -> 1023   : 0||
  1024 -> 2047   : 111345   ||
  2048 -> 4095   : 6997627  ||
  4096 -> 8191   : 18575||
  8192 -> 16383  : 586  ||
 16384 -> 32767  : 312  ||
 32768 -> 65535  : 18   ||

The same with reindex with locks:

 nsecs   : count distribution
   512 -> 1023   : 0||
  1024 -> 2047   : 59438||
  2048 -> 4095   : 6901187  ||
  4096 -> 8191   : 18584||
  8192 -> 16383  : 581  ||
 16384 -> 32767  : 280  ||
 32768 -> 65535  : 84   ||

Looks like with reindex without locks is indeed faster (there are mode
samples in lower time section), but not particularly significant to the
whole distribution, especially taking into account extremity of the
test.

I'll take a look at benchmarking of switching vacuumFlags to use
atomics, but as it's probably a bit off topic I'm going to attach
another version of the patch with locks and suggested changes. To which
I have one question:

> Michael Paquier  writes:

> I think that this should be in its own routine, and that we had better
> document that this should be called just after starting a transaction,
> with an assertion enforcing that.

I'm not sure which exactly assertion condition do you mean?
>From 07c4705a22e6cdd5717df46a974ce00a69fc901f Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthali...@gmail.com>
Date: Wed, 11 Nov 2020 15:19:48 +0100
Subject: [PATCH v4 1/2] Rename vaccumFlags to statusFlags

With more flags associated to a PGPROC entry that are not related to
vacuum (currently existing or planned), the name "statusFlags" describes
its purpose better.
---
 src/backend/access/transam/twophase.c |  2 +-
 src/backend/commands/vacuum.c |  6 +--
 src/backend/postmaster/autovacuum.c   |  6 +--
 src/backend/replication/logical/logical.c |  4 +-
 src/backend/replication/slot.c|  4 +-
 

Re: remove spurious CREATE INDEX CONCURRENTLY wait

2020-11-16 Thread Simon Riggs
On Tue, 10 Nov 2020 at 03:02, Tom Lane  wrote:
>
> Alvaro Herrera  writes:
> > Yeah ... it would be much better if we can make it use atomics instead.
>
> I was thinking more like "do we need any locking at all".
>
> Assuming that a proc's vacuumFlags can be set by only the process itself,
> there's no write conflicts to worry about.  On the read side, there's a
> hazard that onlookers will not see the PROC_IN_SAFE_IC flag set; but
> that's not any different from what the outcome would be if they looked
> just before this stanza executes.  And even if they don't see it, at worst
> we lose the optimization being proposed.
>
> There is a question of whether it's important that both copies of the flag
> appear to update atomically ... but that just begs the question "why in
> heaven's name are there two copies?"

Agreed to all of the above, but I think the issue is miniscule because
ProcArrayLock is acquired in LW_EXCLUSIVE mode at transaction commit.
So it doesn't seem like there is much need to optimize this particular
aspect of the patch.

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: Tracking cluster upgrade and configuration history

2020-11-16 Thread Mark Dilger



> On Nov 15, 2020, at 11:23 PM, Ian Lawrence Barwick  wrote:
> 
> 2020年11月16日(月) 15:48 Bharath Rupireddy 
> :
>> 
>> On Thu, Nov 12, 2020 at 4:01 AM Mark Dilger
>>  wrote:
>>> 
>>> While supporting customers, it would frequently be useful to have more 
>>> information about the history of a cluster.  For example, which prior 
>>> versions were ever installed and running on the cluster?  Has the cluster 
>>> ever been run with fsync=off?  When did the server last enter recovery, if 
>>> ever?  Was a backup_label file present at that time?
>>> 
>> 
>> +1 for the idea. The information will be useful at times for debugging 
>> purposes.
> 
> It's certainly something which would be nice to have.

Thanks for the feedback.

>>> Would it make sense to alternately, or additionally, store some of this 
>>> information in a flat text file in pg_data, say a new file named 
>>> "cluster_history" or such?
>>> 
>> 
>> IMHO, this is also a good idea. We need to think of the APIs to
>> open/read/write/close that history file? How often and which processes
>> and what type of data they write? Is it that the postmaster alone will
>> write into that file? If multiple processes are allowed to write, how
>> to deal with concurrent writers? Will users have to open manually and
>> read that file? or Will we have some program similar to
>> pg_controldata? Will we have some maximum limit to the size of this
>> file?
> 
> pg_stat_statements might be worth looking at as one way of handling that kind
> of file.
> 
> However the problem with keeping a separate file which is not WAL-logged would
> mean it doesn't get propagated to standbys, and there's also the question
> of how it could be maintained across upgrades via pg_upgrade.

Hmmm.  I was not expecting the file to be propagated to standbys.  The 
information could legitimately be different for a primary and a standby.  As a 
very simple example, there may be a flag bit for whether the cluster has 
operated as a standby.  That does raise questions about what sort of 
information about a primary that a standby should track, in case they get 
promoted to primary and information about the old primary would be useful for 
troubleshooting.  Ideas welcome

> 
> FWIW I did once create a background worker extension [1] which logs
> configuration changes to a table, though it's not particularly maintained or
> recommended for production use.

I'm happy to change course if the consensus on the list favors using something 
larger, like log files or logging to a table, but for now I'm still thinking 
about this in terms of something smaller than that.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Re: Tracking cluster upgrade and configuration history

2020-11-16 Thread Mark Dilger



> On Nov 15, 2020, at 10:47 PM, Bharath Rupireddy 
>  wrote:
> 
> On Thu, Nov 12, 2020 at 4:01 AM Mark Dilger
>  wrote:
>> 
>> While supporting customers, it would frequently be useful to have more 
>> information about the history of a cluster.  For example, which prior 
>> versions were ever installed and running on the cluster?  Has the cluster 
>> ever been run with fsync=off?  When did the server last enter recovery, if 
>> ever?  Was a backup_label file present at that time?
>> 
> 
> +1 for the idea. The information will be useful at times for debugging 
> purposes.

Thanks for the feedback.

> 
>> 
>> Some of this type of information could strictly be fixed size, such as a 
>> fixed set of timestamps for the time at which a fixed set of things last 
>> occurred, or a fixed set of bits indicating whether a fixed set of things 
>> ever happened.
>> 
>> Some other types would be variable size, but hopefully short in practice, 
>> like a list of all postgres versions that have ever been run on the cluster.
>> 
>> Logging the information via the usual log mechanism seems insufficient, as 
>> log files may get rotated and this information lost.
>> 
> 
> True. Just a thought, can we use existing logging mechanism and APIs
> to write to a new file that never gets rotated by the syslogger(Of
> course, we need to think of the maximum file size that's allowed)? The
> idea is like this: we use elog/ereport and so on with a new debug
> level, when specified, instead of logging into the standard log files,
> we log it to the new file.

That's going in a very different direction from what I had in mind.  I was 
imagining something like a single binary or text file of either fixed size or 
something very short but not fixed.  The "very short but not fixed" part of 
that seems a bit too hand-waving on reflection.  Any variable length list, such 
as the list of postgres versions started on the cluster, could be made fixed 
length by only tracking the most recent N of them, perhaps with a flag bit to 
indicate if the list has overflowed.

Using elog/ereport with a new log level that gets directed into a different log 
file is an interesting idea, but it is not clear how to use elog/ereport in a 
principled way to write files that need never get too large.

>> Would it be acceptable to store some fixed set of flag bits and timestamps 
>> in pg_control?  Space there is at a premium.
>> 
> 
> Since we allocate ControlFileData in shared memory and also we may
> have some data with timestamps, variable texts and so on, having this
> included in pg_control data structure would not seem a good idea to
> me.

Variable length texts seem completely out of scope for this.  I would expect 
the data to be a collection of integer types and flag bits.  Fixed length text 
might also be possible, but I don't have any examples in mind of text that we'd 
want to track.

>> Would it make sense to alternately, or additionally, store some of this 
>> information in a flat text file in pg_data, say a new file named 
>> "cluster_history" or such?
>> 
> 
> IMHO, this is also a good idea. We need to think of the APIs to
> open/read/write/close that history file? How often and which processes
> and what type of data they write? Is it that the postmaster alone will
> write into that file? If multiple processes are allowed to write, how
> to deal with concurrent writers? Will users have to open manually and
> read that file? or Will we have some program similar to
> pg_controldata? Will we have some maximum limit to the size of this
> file?

This depends in part on feedback about which information others on this list 
would like to see included, but I was imagining something similar to how 
pg_control works, or using pg_control itself.  The maximum size for pg_control 
is 512 bytes, and on my system sizeof(ControlFileData) = 296, which leaves 216 
bytes free.  I didn't check how much that might change on systems with 
different alignments.  We could either use some of the ~200 bytes currently 
available in pg_control, or use another file, "pg_history" or such, following 
the design pattern already used for pg_control.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company