date:20201222

On Wed, 23 Dec 2020 at 15:51, Craig Ringer 
wrote:

>
> I've struggled with this quite a bit myself.
>
>
By the way, I sent in a patch to enhance the static tracepoints available
for LWLocks. See
https://www.postgresql.org/message-id/cagry4nxjo+-hcc2i5h93ttsz4gzo-fsddcwvkb-qafq1zdx...@mail.gmail.com
.

It'd benefit significantly from the sort of changes you mentioned in #4.
For most purposes I've been able to just use the raw LWLock* but having a
nice neat (tranche,index) value would be ideal.

The trace patch has helped me identify some excessively long LWLock waits
in tools I work on. I'll share another of the systemtap scripts I used with
it soon.

Re: Improving LWLock wait events

On Mon, 21 Dec 2020 at 05:27, Andres Freund  wrote:

> Hi,
>
> The current wait events are already pretty useful. But I think we could
> make them more informative without adding real runtime overhead.
>
>
All 1-3 sound pretty sensible to me.

I also think there's a 4, but I think the tradeoffs are a bit more
> complicated:
>

> 4) For a few types of lwlock just knowing the tranche isn't
> sufficient. E.g. knowing whether it's one or different buffer mapping locks
> are being waited on is important to judge contention.
>

I've struggled with this quite a bit myself.

In particular, for tools that validate acquire-ordering safety it's
desirable to be able to identify a specific lock in a backend-independent
way.

The hardest part would be to know how to identify individual locks. The
> easiest would probably be to just mask in a parts of the lwlock address
> (e.g. shift it right by INTALIGN, and then mask in the result into the
> eventId). That seems a bit unsatisfying.
>

It also won't work reliably for locks in dsm segments, since the lock can
be mapped to a different address in different backends.

We could probably do a bit better: We could just store the information about
> tranche / offset within tranche at LWLockInitialize() time, instead of
> computing something just before waiting.  While LWLock.tranche is only
> 16bits
> right now, the following two bytes are currently padding...
>
> That'd allow us to have proper numerical identification for nearly all
> tranches, without needing to go back to the complexity of having tranches
> specify base & stride.
>

That sounds appealing. It'd work for any lock in MainLWLockArray - all
built-in individual LWLocks, LWTRANCHE_BUFFER_MAPPING,
LWTRANCHE_LOCK_MANAGER, LWTRANCHE_PREDICATE_LOCK_MANAGER, any lock
allocated by RequestNamedLWLockTranche().

Some of the other tranches allocate locks in contiguous fixed blocks or in
ways that would let them maintain a counter.

We'd need some kind of "unknown" placeholder value for LWLocks where that
doesn't make sense, though, like most locks allocated by callers that make
their own LWLockNewTrancheId() call and locks in some of the  built-in
tranches not allocated in MainLWLockArray.

So I suggest retaining the current LWLockInitialize() and making it a
wrapper for LWLockInitializeWithIndex() or similar. Use a 1-index and keep
0 as unknown, or use 0-index and use (max-1) as unknown.

Re: Perform COPY FROM encoding conversions in larger chunks

2020-12-22 Thread Heikki Linnakangas


On 22/12/2020 22:01, John Naylor wrote:
In 0004, it seems you have some doubts about upgrade compatibility. Is 
that because user-defined conversions would no longer have the right 
signature?


Exactly. If you have an extension that adds a custom conversion function 
and does CREATE CONVERSION, the old installation script will fail on the 
new version. That causes trouble for pg_dump+restore and pg_upgrade.


Perhaps we could accept the old signature in the server when you do 
CREATE CONVERSION, but somehow mark the conversion as broken in the 
catalog so that you would get a runtime error if you tried to use it. 
That would be enough to make pg_dump+restore (and pg_upgrade) not throw 
an error, and you could then upgrade the extension later (ALTER 
EXTENSION UPDATE).


I'm not sure it's worth the trouble, though. Custom conversions are very 
rare. And I don't think any other object can depend on a conversion, so 
you can always drop the conversion before upgrade, and re-create it with 
the new function signature afterwards. A note in the release notes and a 
check in pg_upgrade, with instructions to drop and recreate the 
conversion, are probably enough.


- Heikki

Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

On Tue, Dec 22, 2020 at 03:15:37PM -0600, Justin Pryzby wrote:
> Now, I really think utility.c ought to pass in a pointer to a local
> ReindexOptions variable to avoid all the memory context, which is unnecessary
> and prone to error.

Yeah, it sounds right to me to just bite the bullet and do this 
refactoring, limiting the manipulations of the options as much as
possible across contexts.  So +1 from me to merge 0001 and 0002
together.

I have adjusted a couple of comments and simplified a bit more the
code in utility.c.  I think that this is commitable, but let's wait
more than a couple of days for Alvaro and Peter first.  This is a
period of vacations for a lot of people, and there is no point to
apply something that would need more work at the end.  Using hexas for
the flags with bitmasks is the right conclusion IMO, but we are not
alone.

> ExecReindex() will set options.tablespaceOid, not a pointer.  Like
> this.

OK.  Good to know, I have not looked at this part of the patch yet.
--
Michael
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index c041628049..89394b648e 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -30,13 +30,16 @@ typedef enum
 } IndexStateFlagsAction;
 
 /* options for REINDEX */
-typedef enum ReindexOption
+typedef struct ReindexOptions
 {
-	REINDEXOPT_VERBOSE = 1 << 0,	/* print progress info */
-	REINDEXOPT_REPORT_PROGRESS = 1 << 1,	/* report pgstat progress */
-	REINDEXOPT_MISSING_OK = 1 << 2, /* skip missing relations */
-	REINDEXOPT_CONCURRENTLY = 1 << 3	/* concurrent mode */
-} ReindexOption;
+	bits32		flags;			/* bitmask of REINDEXOPT_* */
+} ReindexOptions;
+
+/* flag bits for ReindexOptions->flags */
+#define REINDEXOPT_VERBOSE		0x01	/* print progress info */
+#define REINDEXOPT_REPORT_PROGRESS 0x02 /* report pgstat progress */
+#define REINDEXOPT_MISSING_OK 	0x04	/* skip missing relations */
+#define REINDEXOPT_CONCURRENTLY	0x08	/* concurrent mode */
 
 /* state info for validate_index bulkdelete callback */
 typedef struct ValidateIndexState
@@ -146,7 +149,7 @@ extern void index_set_state_flags(Oid indexId, IndexStateFlagsAction action);
 extern Oid	IndexGetRelation(Oid indexId, bool missing_ok);
 
 extern void reindex_index(Oid indexId, bool skip_constraint_checks,
-		  char relpersistence, int options);
+		  char relpersistence, ReindexOptions *options);
 
 /* Flag bits for reindex_relation(): */
 #define REINDEX_REL_PROCESS_TOAST			0x01
@@ -155,7 +158,7 @@ extern void reindex_index(Oid indexId, bool skip_constraint_checks,
 #define REINDEX_REL_FORCE_INDEXES_UNLOGGED	0x08
 #define REINDEX_REL_FORCE_INDEXES_PERMANENT 0x10
 
-extern bool reindex_relation(Oid relid, int flags, int options);
+extern bool reindex_relation(Oid relid, int flags, ReindexOptions *options);
 
 extern bool ReindexIsProcessingHeap(Oid heapOid);
 extern bool ReindexIsProcessingIndex(Oid indexOid);
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index 7cfb37c9b2..c66629cf73 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -18,16 +18,17 @@
 #include "storage/lock.h"
 #include "utils/relcache.h"
 
-
 /* options for CLUSTER */
-typedef enum ClusterOption
+#define CLUOPT_RECHECK 0x01		/* recheck relation state */
+#define CLUOPT_VERBOSE 0x02		/* print progress info */
+
+typedef struct ClusterOptions
 {
-	CLUOPT_RECHECK = 1 << 0,	/* recheck relation state */
-	CLUOPT_VERBOSE = 1 << 1		/* print progress info */
-} ClusterOption;
+	bits32		flags;			/* bitmask of CLUSTEROPT_* */
+} ClusterOptions;
 
 extern void cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel);
-extern void cluster_rel(Oid tableOid, Oid indexOid, int options);
+extern void cluster_rel(Oid tableOid, Oid indexOid, ClusterOptions *options);
 extern void check_index_is_clusterable(Relation OldHeap, Oid indexOid,
 	   bool recheck, LOCKMODE lockmode);
 extern void mark_index_clustered(Relation rel, Oid indexOid, bool is_internal);
diff --git a/src/include/commands/defrem.h b/src/include/commands/defrem.h
index 1133ae1143..d4ea57e757 100644
--- a/src/include/commands/defrem.h
+++ b/src/include/commands/defrem.h
@@ -14,6 +14,7 @@
 #ifndef DEFREM_H
 #define DEFREM_H
 
+#include "catalog/index.h"
 #include "catalog/objectaddress.h"
 #include "nodes/params.h"
 #include "parser/parse_node.h"
@@ -34,11 +35,7 @@ extern ObjectAddress DefineIndex(Oid relationId,
  bool check_not_in_use,
  bool skip_build,
  bool quiet);
-extern int	ReindexParseOptions(ParseState *pstate, ReindexStmt *stmt);
-extern void ReindexIndex(RangeVar *indexRelation, int options, bool isTopLevel);
-extern Oid	ReindexTable(RangeVar *relation, int options, bool isTopLevel);
-extern void ReindexMultipleTables(const char *objectName, ReindexObjectType objectKind,
-  int options);
+extern void ExecReindex(ParseState *pstate, ReindexStmt *stmt, bool isTopLevel);
 extern char *makeObjectName(const char *name1, c

RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-22 Thread k.jami...@fujitsu.com

On Tuesday, December 22, 2020 9:11 PM, Amit Kapila wrote:
> On Tue, Dec 22, 2020 at 2:55 PM Amit Kapila 
> wrote:
> > Next, I'll look into DropRelFileNodesAllBuffers()
> > optimization patch.
> >
> 
> Review of v35-0004-Optimize-DropRelFileNodesAllBuffers-in-recovery [1]
> =
> ===
> 1.
> DropRelFileNodesAllBuffers()
> {
> ..
> +buffer_full_scan:
> + pfree(block);
> + nodes = palloc(sizeof(RelFileNode) * n); /* non-local relations */
> +for (i = 0; i < n; i++)  nodes[i] = smgr_reln[i]->smgr_rnode.node;
> +
> ..
> }
> 
> How is it correct to assign nodes array directly from smgr_reln? There is no
> one-to-one correspondence. If you see the code before patch, the passed
> array can have mixed of temp and non-temp relation information.

You are right. I mistakenly removed the array node that should have been
allocated for non-local relations. So I fixed that by doing:

SMgrRelation*rels;

rels = palloc(sizeof(SMgrRelation) * nnodes);   /* non-local relations 
*/

/* If it's a local relation, it's localbuf.c's problem. */
for (i = 0; i < nnodes; i++)
{
if (RelFileNodeBackendIsTemp(smgr_reln[i]->smgr_rnode))
{
if (smgr_reln[i]->smgr_rnode.backend == MyBackendId)

DropRelFileNodeAllLocalBuffers(smgr_reln[i]->smgr_rnode.node);
}
else
rels[n++] = smgr_reln[i];
}
...
if (n == 0)
{
pfree(rels);
return;
}
...
//traditional path:

pfree(block);
nodes = palloc(sizeof(RelFileNode) * n); /* non-local relations */
for (i = 0; i < n; i++)
nodes[i] = rels[i]->smgr_rnode.node;

> 2.
> + for (i = 0; i < n; i++)
>   {
> - pfree(nodes);
> + for (j = 0; j <= MAX_FORKNUM; j++)
> + {
> + /*
> + * Assign InvalidblockNumber to a block if a relation
> + * fork does not exist, so that we can skip it later
> + * when dropping the relation buffers.
> + */
> + if (!smgrexists(smgr_reln[i], j))
> + {
> + block[i][j] = InvalidBlockNumber;
> + continue;
> + }
> +
> + /* Get the number of blocks for a relation's fork */ block[i][j] =
> + smgrnblocks(smgr_reln[i], j, &cached);
> 
> Similar to above, how can we assume smgr_reln array has all non-local
> relations? Have we tried the case with mix of temp and non-temp relations?

Similar to above reply.

> In this code, I am slightly worried about the additional cost of each time
> checking smgrexists. Consider a case where there are many relations and only
> one or few of them have not cached the information, in such a case we will
> pay the cost of smgrexists for many relations without even going to the
> optimized path. Can we avoid that in some way or at least reduce its usage to
> only when it is required? One idea could be that we first check if the nblocks
> information is cached and if so then we don't need to call smgrnblocks,
> otherwise, check if it exists. For this, we need an API like
> smgrnblocks_cahced, something we discussed earlier but preferred the
> current API. Do you have any better ideas?

Right. I understand the point that let's say there are 100 relations, and
the first 99 non-local relations happen to enter the optimization path, but the 
last
one does not, calling smgrexist() would be too costly and waste of time in that 
case.
The only solution I could think of as you mentioned is to reintroduce the new 
API
which we discussed before: smgrnblocks_cached().
It's possible that we call smgrexist() only if smgrnblocks_cached() returns
InvalidBlockNumber.
So if everyone agrees, we can add that API smgrnblocks_cached() which will
Include the "cached" flag parameter, and remove the additional flag 
modifications
from smgrnblocks(). In this case, both DropRelFileNodeBuffers() and
DropRelFileNodesAllBuffers() will use the new API.

Thoughts?


Regards,
Kirk Jamison

TAP PostgresNode function to gdb stacks and optional cores for all backends

Hi all

I recently wrote a utility that adds a $node->gdb_backends() method to
PostgresNode instances - figured I'd share it here in case anyone finds it
useful, or wants to adopt it into the features of the TAP tools.

This function provides a one-line way to dump stacks for all running
backends to per-pid files or to the main test log, as well as the values of
various global variables that are potentially of interest. A default set of
globals will be dumped for each backend and the caller can specify
additional expressions of interest.

If requested, cores will be dumped for each running backend.

A subset of backends may be passed by pid instead, so you can easily target
specific backends you're interested in.

I initially wrote this to help debug a variety of issues with shutdown,
where I hacked the PostgresNode stop() method to trap failed shutdowns and
report stacks for all surviving processes + the postmaster in my wrapper
class for PostgresNode:

sub stop {
my ($self, $mode) = @_;
local($@);
eval {
PostgresNode::stop($self, $mode);
};
if ($@) {
$node->gdb_backends(want_cores => 1);
die $@;
}
}
#
# This is an excerpt from a subclass of PostgresNode
#

# Generate backtraces and optionally core files for all user backends and
# walsenders associated with this node. Requires gdb to be present. Cores
# will be labeled by node name.
sub gdb_backends {
my ($self, %kwargs) = @_;
$kwargs{backtrace_timeout_s} //= '60';
$kwargs{core_timeout_s} //= '60';
$kwargs{want_cores} //= 0;
$kwargs{core_name_pattern} //= 'core.{{pid}}';
$kwargs{gdb_logfile_pattern} //= '';

my $postmaster_pid = $self->{_pid};
my $pgname = $self->name;

# Globals
# TODO make these conditional on an expression to filter them.
# TODO handle statics that vary across files
# TODO add typecasts for when we don't have debuginfo
# TODO useful GUCs
#
my @print_exprs = (
# All backends
'IsPostmasterEnvironment',
'IsUnderPostmaster',
'PostmasterPid',
'LocalRecoveryInProgress',
'*MyProc',
'MyAuxProcType',
'*XLogCtl',
'*ControlFile',

# Generic signal handling
'InterruptPending',
'ProcDiePending',
'ShutdownRequestPending',
'ConfigReloadPending',

# user backend / postgres
'xact_started',
'doing_extended_query_message',
'ignore_till_sync',

# startup process
'ThisTimeLineID',
'LastRec',
'ArchiveRecoveryRequested',
'InArchiveRecovery',
'PrimaryConnInfo',
'PrimarySlotName',
'StandbyMode',

# autovac
'am_autovacuum_launcher',
'am_autovacuum_worker',
'got_SIGHUP',
'got_SIGUSR2',
'got_SIGTERM',
"'autovacuum.c':got_SIGTERM",

# for walsenders
'am_walsender',
'am_cascading_walsender',
'am_db_walsender',
'*MyWalSnd',
'*xlogreader',
'sendTimeLine',
'sentPtr',
'streamingDoneSending',
'streamingDoneReceiving',
"'walsender.c':got_SIGTERM",
'got_STOPPING',
'got_SIGUSR2',
'replication_active',
'*logical_decoding_ctx',
'logical_startptr',

# walreceiver
'recvFileTLI',
'*wrconn',

# checkpointer
'*CheckpointerShmem',
'last_checkpoint_time',
'ckpt_active',

# for bgworkers
'IsBackgroundWorker',

# for pgl backends
'*MyPGLogicalWorker',
'*MyPGLSubscription',

# for bdr backends
'*MyBdrSubscription',

# postmaster
'pmState',
);

# Add your own print expressions by passing print_exprs => ['var1', 
'var2']
push @print_exprs, @{$kwargs{print_exprs}}
if (defined($kwargs{print_exprs}));

my @pids;
if (defined($kwargs{pids})) {
if (ref($kwargs{pids}) eq 'ARRAY') {
# arrayref pid-list
@pids = @{$kwargs{pids}};
} elsif (ref($kwargs{pids}) eq '') {
# Scalar pid-list
@pids = split(qr/[\r\n]/, $kwargs{pids});
} else {
die("keyword argument 'pids' must be undef, an 
arrayref,

Logical decoding without slots: decoding in lockstep with recovery

Hi all

I want to share an idea I've looked at a few times where I've run into
situations where logical slots were inadvertently dropped, or where it
became necessary to decode changes in the past on a slot.

As most of you will know you can't just create a logical slot in the past.
Even if it was permitted, it'd be unsafe due to catalog_xmin retention
requirements and missing WAL.

But if we can arrange a physical replica to replay the WAL of interest and
decode each commit as soon as it's replayed by the startup process, we know
the needed catalog rows must all exist, so it's safe to decode the change.

So it should be feasible to run logical decoding in standby, even without a
replication slot, so long as we:

* pause startup process after each xl_xact_commit
* wake the walsender running logical decoding
* decode and process until ReorderBufferCommit for the just-committed xact
returns
* wake the startup process to decode the up to the next commit

Can anyone see any obvious problem with this?

I don't think the potential issues with WAL commit visibility order vs
shmem commit visibility order should be a concern.

I see this as potentially useful in data recovery, where you might want to
be able to extract a change stream for a subset of tables from PITR
recovery, for example. Also for audit use.

Re: Better client reporting for "immediate stop" shutdowns

On Tue, Dec 22, 2020 at 11:02 PM Tom Lane  wrote:
> Magnus Hagander  writes:
> > On Tue, Dec 22, 2020 at 2:29 AM Bharath Rupireddy
> >  wrote:
> >> If I'm correct, quickdie() doesn't access any shared memory because
> >> one of the reason we can be in quickdie() is when the shared memory
> >> itself is corrupted(the comment down below on why we don't call
> >> roc_exit() or atexit() says), in such case, will GetQuitSignalReason()
> >> have some problem in accessing the shared memory i.e. +return
> >> PMSignalState->sigquit_reason;?
>
> It couldn't really have any problem in physically accessing the field;
> we never detach from the main shared memory block.  The risk that needs
> to be thought about is that shared memory contains garbage --- for
> example, imagine that a failing process scribbled in the wrong part of
> shared memory before crashing.  So the hazard here is that there's a
> small chance that sigquit_reason will contain the wrong value, which
> would cause the patch to print a misleading message, or more likely
> not print anything (since I didn't put a default case in that switch).
> That seems fine to me.  Also, because the sequence of events would be
> (1) failing process scribbles and crashes, (2) postmaster updates
> sigquit_reason, (3) other child processes examine sigquit_reason,
> it's fairly likely that we'd get the right answer even if the field
> got clobbered during (1).

Hmm.

> There might be an argument for emitting the "unexpected SIGQUIT"
> text if we find garbage in sigquit_reason.  Any thoughts about that?

Although I can't think of any case now, IMHO we can still have a
default case(we may or may not hit it) in the switch with a message
something like  "terminating connection due to unexpected SIGQUIT".

> >> AFAIK, errmsg(terminating connection due to administrator command") is
> >> emitted when there's no specific reason. But we know exactly why we
> >> are terminating in this case, how about having an error message like
> >> errmsg("terminating connection due to immediate shutdown request")));
> >> ? There are other places where errmsg("terminating connection due to
> >>  reasons"); is used. We also log messages when an immediate
> >> shutdown request is received errmsg("received immediate shutdown
> >> request").
>
> > +1. I definitely think having this message be different can be useful.
>
> OK, will use "terminating connection due to immediate shutdown
> request".

Thanks.

I don't have any further comments on the patch.

> > See also the thread about tracking shutdown reasons (connection
> > statistics) -- not the same thing, but the same concepts apply.
>
> Hm.  I wondered for a bit if that patch could make use of this one
> to improve its results.  For the specific case of SIGQUIT it seems
> moot because we aren't going to let backends send any shutdown
> statistics during an emergency stop.

Yeah.

> But maybe the idea could be extended to let more-accurate termination reasons 
> be provided in
> some other cases.

Yeah. For instance, the idea can be extended to the following scenario
- currently for smart and fast shutdown postmaster sends single signal
SIGTERM, so the backend can not know what was the exact reason for the
shutdown. If the postmaster updates the sigterm reason, (the way this
patch does, just before signalling children with SIGTERM), then the
backend would know that information and can report better.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: Single transaction in the tablesync worker?

2020-12-22 Thread Peter Smith

Hi Amit.

PSA my v7 WIP patch for the Solution1.

This patch still applies onto the v30 patch set [1] from other 2PC thread:
[1]
https://www.postgresql.org/message-id/CAFPTHDYA8yE6tEmQ2USYS68kNt%2BkM%3DSwKgj%3Djy4AvFD5e9-UTQ%40mail.gmail.com

(I understand you would like this to be delivered as a separate patch
independent of v30. I will convert it ASAP)

Coded / WIP:

* tablesync slot is now permanent instead of temporary. The tablesync
slot name is no longer tied to the Subscription slot name.

* the tablesync slot cleanup (drop) code is added for DropSubscription
and for finish_sync_worker functions

* tablesync worked now allowing multiple tx instead of single tx

* a new state (SUBREL_STATE_COPYDONE) is persisted after a successful
copy_table in LogicalRepSyncTableStart.

* if a relaunched tablesync finds the state is SUBREL_STATE_COPYDONE
then it will bypass the initial copy_table phase.

* tablesync sets up replication origin tracking in
LogicalRepSyncTableStart (similar as done for the apply worker). The
origin is advanced when first created.

* tablesync replication origin tracking is cleaned up during
DropSubscription and/or process_syncing_tables_for_apply

* The v7 DropSubscription cleanup code has been rewritten since v6.
The subscription TAP tests have been executed many (7) times now
without observing any of the race problems that I previously reported
seeing when using the v6 patch.

TODO / Known Issues:

* Help / comments / cleanup

* There is temporary "!!>>" excessive logging scattered around which I
added to help my testing during development

* Address review comments

---

Kind Regards,
Peter Smith.
Fujitsu Australia

v7-0002-WIP-patch-for-the-Solution1.patch
Description: Binary data

v7-0001-2PC-change-tablesync-slot-to-use-same-two_phase-m.patch
Description: Binary data

proposal - support tsv output format for psql

Hi

I am playing with clipboard on Linux, and I found a way, how to redirect
psql output to clipboard via wl-copy or xclip and then to Libre Office. Now
it looks so best format is tsv

select * from pg_database \g (format=tsv) | wl-paste -t
application/x-libreoffice-tsvc

Implementation of tsv format should not be hard.

What do you think about this?

Regards

Pavel

Re: libpq compression

2020-12-22 Thread Konstantin Knizhnik





On 22.12.2020 22:03, Tom Lane wrote:

Tomas Vondra  writes:

I don't see aby benchmark results in this thread, allowing me to make
that conclusion, and I find it hard to believe that 200MB/client is a
sensible trade-off.
It assumes you have that much memory, and it may allow easy DoS attack
(although maybe it's not worse than e.g. generating a lot of I/O or
running expensive function). Maybe allowing limiting the compression
level / decompression buffer size in postgresql.conf would be enough. Or
maybe allow disabling such compression algorithms altogether.

The link Ken pointed at suggests that restricting the window size to
8MB is a common compromise.  It's not clear to me what that does to
the achievable compression ratio.  Even 8MB could be an annoying cost
if it's being paid per-process, on both the server and client sides.

regards, tom lane
Please notice that my original intention was to not give to a user 
(client) possibility to choose compression algorithm and compression 
level at all.
All my previous experiments demonstrate that using compression level 
larger than default only significantly decrease speed, but not compression

ratio.  Especially for compressing of protocol messages.
Moreover, on some dummy data (like generated by pgbench) zstd with 
default compression level (1) shows better compression ratio

than with higher levels.

I have to add possibility to specify compression level and suggested 
compression algorithms because it was requested by reviewers.
But I still think that it was wrong idea and this results just prove 
prove it.

More flexibility is not always good...

Now there is a discussion concerning a way to switch compression 
algorithm on the fly (particular case: toggling compression for 
individual ibpq messages). IMHO it is once again excessive flexibility 
which just increase complexity and gives nothing good in practice).

Re: [Patch] Optimize dropping of relation buffers using dlist

2020-12-22 Thread Zhihong Yu

Hi,
It is possible to come out of the nested loop without goto.

+   boolcached = true;
...
+* to that fork during recovery.
+*/
+   for (i = 0; i < n && cached; i++)
...
+   if (!cached)
+.  break;

Here I changed the initial value for cached to true so that we enter the
outer loop.
In place of the original goto, we break out of inner loop and exit outer
loop.

Cheers

On Tue, Dec 22, 2020 at 8:22 PM tsunakawa.ta...@fujitsu.com <
tsunakawa.ta...@fujitsu.com> wrote:

> From: Amit Kapila 
> > + /* Get the number of blocks for a relation's fork */
> > + block[i][j] = smgrnblocks(smgr_reln[i], j, &cached);
> > +
> > + if (!cached)
> > + goto buffer_full_scan;
> >
> > Why do we need to use goto here? We can simply break from the loop and
> > then check if (cached && nBlocksToInvalidate <
> > BUF_DROP_FULL_SCAN_THRESHOLD). I think we should try to avoid goto if
> > possible without much complexity.
>
> That's because two for loops are nested -- breaking there only exits the
> inner loop.  (I thought the same as you at first... And I understand some
> people have alergy to goto, I think modest use of goto makes the code
> readable.)
>
>
> Regards
> Takayuki Tsunakawa
>
>
>
>
>

Re: Reduce the number of special cases to build contrib modules on windows

On Tue, Dec 22, 2020 at 11:24:40PM +1300, David Rowley wrote:
> On Wed, 11 Nov 2020 at 13:44, Michael Paquier  wrote:
>> It seems to me that your patch is doing the right thing for adminpack
>> and that its Makefile has no need to include a reference to libpq
>> source path, no?
> 
> Yeah.  Likely a separate commit should remove the -I$(libpq_srcdir)
> from adminpack and old_snapshot

I have begun a new thread about this point as that's a separate
topic.  I did not see other places in need of a similar cleanup:
https://www.postgresql.org/message-id/x+lqpflyk7jgz...@paquier.xyz

> I didn't look in detail, but it looks like if we define LOWER_NODE on
> Windows that it might break pg_upgrade.  I guess you could say it's
> partially broken now as the behaviour there will depend on if you
> build using Visual Studio or cygwin.  We'd define LOWER_NODE on cygwin
> but not on VS.  Looks like a pg_upgrade might be problematic there
> today.
> 
> It feels a bit annoying to add some special case to the script to
> maintain the status quo there.  An alternative to that would be to
> modify the .c code at #ifdef LOWER_NODE to also check we're not
> building on VS. Neither option seems nice.

Hmm.  It seems that you are right here.  This influences lquery
parsing so it may be nasty and this exists since ltree is present in
the tree (2002).  I think that I would choose the update in the C code
and remove LOWER_NODE while keeping the scripts clean, and documenting
directly in the code why this compatibility issue exists.
REFINT_VERBOSE is no problem, fortunately.

> I've attached the updated patch and also a diff showing the changes in
> the *.vcxproj files.

Thanks!

> There are quite a few places where the hash table code for includes
> and references gets rid of duplicates that already exist today. For
> example pgbench.vcxproj references libpgport.vcxproj and
> libpgcommon.vcxproj twice.

The diffs look clean.  dblink has lost src/backend/, there are the
additions of REFINT_VERBOSE and LOWER_NODE but the bulk of the diffs
comes from a change in the order of items listed, while removing
duplicates.

I have tested your patch, and this is causing compilation failures for
hstore_plpython, jsonb_plpython and ltree_plpython.  So
AddTransformModule is missing something here when compiling with
Python.
--
Michael

signature.asc
Description: PGP signature

Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2020-12-22 Thread Justin Pryzby

On Tue, Dec 22, 2020 at 03:22:19PM -0800, Zhihong Yu wrote:
> Justin:
> For reindex_index() :
> 
> +   if (options->tablespaceOid == MyDatabaseTableSpace)
> +   options->tablespaceOid = InvalidOid;
> ...
> +   oldTablespaceOid = iRel->rd_rel->reltablespace;
> +   if (set_tablespace &&
> +   (options->tablespaceOid != oldTablespaceOid ||
> +   (options->tablespaceOid == MyDatabaseTableSpace &&
> OidIsValid(oldTablespaceOid
> 
> I wonder why the options->tablespaceOid == MyDatabaseTableSpace clause
> appears again in the second if statement.
> Since the first if statement would assign InvalidOid
> to options->tablespaceOid when the first if condition is satisfied.

Good question.  Alexey mentioned on Sept 23 that he added the first stanza.  to
avoid storing the DB's tablespace OID (rather than InvalidOid).

I think the 2nd half of the "or" is unnecessary since that was added setting to
options->tablespaceOid = InvalidOid.
If requesting to move to the DB's default tablespace, it'll now hit the first
part of the OR:

> +   (options->tablespaceOid != oldTablespaceOid ||

Without the first stanza setting, it would've hit the 2nd condition:

> +   (options->tablespaceOid == MyDatabaseTableSpace && 
> OidIsValid(oldTablespaceOid

which means: "user requested to move a table to the DB's default tblspace, and
it was previously on a nondefault space".

So I think we can drop the 2nd half of the OR.  Thanks for noticing.

-- 
Justin

Re: [Patch] Optimize dropping of relation buffers using dlist

2020-12-22 Thread Kyotaro Horiguchi

At Wed, 23 Dec 2020 04:22:19 +, "tsunakawa.ta...@fujitsu.com" 
 wrote in 
> From: Amit Kapila 
> > + /* Get the number of blocks for a relation's fork */
> > + block[i][j] = smgrnblocks(smgr_reln[i], j, &cached);
> > +
> > + if (!cached)
> > + goto buffer_full_scan;
> > 
> > Why do we need to use goto here? We can simply break from the loop and
> > then check if (cached && nBlocksToInvalidate <
> > BUF_DROP_FULL_SCAN_THRESHOLD). I think we should try to avoid goto if
> > possible without much complexity.
> 
> That's because two for loops are nested -- breaking there only exits the 
> inner loop.  (I thought the same as you at first... And I understand some 
> people have alergy to goto, I think modest use of goto makes the code 
> readable.)

I don't strongly oppose to goto's but in this case the outer loop can
break on the same condition with the inner loop, since cached is true
whenever the inner loop runs to the end. It is needed to initialize
the variable cache with true, instead of false, though.

The same pattern is seen in the tree.

Regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Cleanup some -I$(libpq_srcdir) in makefiles

Hi all,

While looking at a patch from David, I have noticed $subject:
https://www.postgresql.org/message-id/CAApHDvpgB+vxk=w6opkidwzzeo6knifqidnomzr8p4rotyk...@mail.gmail.com

adminpack and old_snapshot have no need for those references as they
don't use libpq.  Any objections to a small-ish cleanup that removes
those references, as per the attached?

Thanks,
--
Michael
diff --git a/contrib/adminpack/Makefile b/contrib/adminpack/Makefile
index 630fea7726..851504f4ae 100644
--- a/contrib/adminpack/Makefile
+++ b/contrib/adminpack/Makefile
@@ -4,7 +4,6 @@ MODULE_big = adminpack
 OBJS = \
 	$(WIN32RES) \
 	adminpack.o
-PG_CPPFLAGS = -I$(libpq_srcdir)
 
 EXTENSION = adminpack
 DATA = adminpack--1.0.sql adminpack--1.0--1.1.sql adminpack--1.1--2.0.sql\
diff --git a/contrib/old_snapshot/Makefile b/contrib/old_snapshot/Makefile
index 77c85df322..adb557532f 100644
--- a/contrib/old_snapshot/Makefile
+++ b/contrib/old_snapshot/Makefile
@@ -4,7 +4,6 @@ MODULE_big = old_snapshot
 OBJS = \
 	$(WIN32RES) \
 	time_mapping.o
-PG_CPPFLAGS = -I$(libpq_srcdir)
 
 EXTENSION = old_snapshot
 DATA = old_snapshot--1.0.sql


signature.asc
Description: PGP signature

Re: Preventing hangups in bgworker start/stop during DB shutdown

On Wed, 23 Dec 2020 at 05:40, Tom Lane  wrote:

> Here's an attempt at closing the race condition discussed in [1]
> (and in some earlier threads, though I'm too lazy to find them).
>
> The core problem is that the bgworker management APIs were designed
> without any thought for exception conditions, notably "we're not
> gonna launch any more workers because we're shutting down the database".
> A process waiting for a worker in WaitForBackgroundWorkerStartup or
> WaitForBackgroundWorkerShutdown will wait forever, so that the database
> fails to shut down without manual intervention.
>
> I'd supposed that we would need some incompatible changes in those APIs
> in order to fix this, but after further study it seems like we could
> hack things by just acting as though a request that won't be serviced
> has already run to completion.  I'm not terribly satisfied with that
> as a long-term solution --- it seems to me that callers should be told
> that there was a failure.  But this looks to be enough to solve the
> lockup condition for existing callers, and it seems like it'd be okay
> to backpatch.
>
> Thoughts?
>

Callers who launch bgworkers already have to cope with conditions such as
the worker failing immediately after launch, or before attaching to the
shmem segment used for worker management by whatever extension is launching
it.

So I think it's reasonable to lie and say we launched it. The caller must
already cope with this case to behave correctly.

Patch specifics:

> This function should only be called from the postmaster

It'd be good to

Assert(IsPostmasterEnvironment && !IsUnderPostmaster)

in these functions.

Otherwise at first read the patch and rationale looks sensible to me.

(When it comes to the bgw APIs in general I have a laundry list of things
I'd like to change or improve around signal handling, error trapping and
recovery, and lots more, but that's for another thread.)

RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-22 Thread tsunakawa.ta...@fujitsu.com

From: Amit Kapila 
> + /* Get the number of blocks for a relation's fork */
> + block[i][j] = smgrnblocks(smgr_reln[i], j, &cached);
> +
> + if (!cached)
> + goto buffer_full_scan;
> 
> Why do we need to use goto here? We can simply break from the loop and
> then check if (cached && nBlocksToInvalidate <
> BUF_DROP_FULL_SCAN_THRESHOLD). I think we should try to avoid goto if
> possible without much complexity.

That's because two for loops are nested -- breaking there only exits the inner 
loop.  (I thought the same as you at first... And I understand some people have 
alergy to goto, I think modest use of goto makes the code readable.)


Regards
Takayuki Tsunakawa

Re: [Patch] Optimize dropping of relation buffers using dlist

On Tue, Dec 22, 2020 at 5:41 PM Amit Kapila  wrote:
>
> On Tue, Dec 22, 2020 at 2:55 PM Amit Kapila  wrote:
> >
> > Apart from tests, do let me know if you are happy with the changes in
> > the patch? Next, I'll look into DropRelFileNodesAllBuffers()
> > optimization patch.
> >
>
> Review of v35-0004-Optimize-DropRelFileNodesAllBuffers-in-recovery [1]
> 
>
> In this code, I am slightly worried about the additional cost of each
> time checking smgrexists. Consider a case where there are many
> relations and only one or few of them have not cached the information,
> in such a case we will pay the cost of smgrexists for many relations
> without even going to the optimized path. Can we avoid that in some
> way or at least reduce its usage to only when it is required? One idea
> could be that we first check if the nblocks information is cached and
> if so then we don't need to call smgrnblocks, otherwise, check if it
> exists. For this, we need an API like smgrnblocks_cahced, something we
> discussed earlier but preferred the current API. Do you have any
> better ideas?
>

One more idea which is not better than what I mentioned above is that
we completely avoid calling smgrexists and rely on smgrnblocks. It
will throw an error in case the particular fork doesn't exist and we
can use try .. catch to handle it. I just mentioned it as it came
across my mind but I don't think it is better than the previous one.

One more thing about patch:
+ /* Get the number of blocks for a relation's fork */
+ block[i][j] = smgrnblocks(smgr_reln[i], j, &cached);
+
+ if (!cached)
+ goto buffer_full_scan;

Why do we need to use goto here? We can simply break from the loop and
then check if (cached && nBlocksToInvalidate <
BUF_DROP_FULL_SCAN_THRESHOLD). I think we should try to avoid goto if
possible without much complexity.

-- 
With Regards,
Amit Kapila.

RE: Consider Parallelism While Planning For REFRESH MATERIALIZED VIEW

> Thanks for taking a look at the patch.
> 
> The intention of the patch is to just enable the parallel mode while planning
> the select part of the materialized view, but the insertions do happen in
> the leader backend itself. That way even if there's temporary tablespace
> gets created, we have no problem.
> 
> This patch can be thought as a precursor to parallelizing inserts in refresh
> matview. I'm thinking to follow the design of parallel inserts in ctas [1]
> i.e. pushing the dest receiver down to the workers once that gets reviewed
> and finalized. At that time, we should take care of not pushing inserts
> down to workers if temporary tablespace gets created.
> 
> In summary, as far as this patch is considered we don't have any problem
> with temporary tablespace getting created with CONCURRENTLY option.
> 
> I'm planning to add a few test cases to cover this patch in matview.sql
> and post a v2 patch soon.

Thanks for your explanation!
You are right that temporary tablespace does not affect current patch.

For the testcase:
I noticed that you have post a mail about add explain support for REFRESH 
MATERIALIZED VIEW.
Do you think we can combine these two features in one thread ?

Personally, The testcase with explain info will be better.

Best regards,
houzj

Re: Parallel bitmap index scan

2020-12-22 Thread Dilip Kumar

On Wed, 23 Dec 2020 at 4:15 AM, Tomas Vondra 
wrote:

> On 11/11/20 8:52 PM, Tomas Vondra wrote:
> > Hi,
> >
> > I took a look at this today, doing a bit of stress-testing, and I can
> > get it to crash because of segfaults in pagetable_create (not sure if
> > the issue is there, it might be just a symptom of an issue elsewhere).
> >
> > Attached is a shell script I use to run the stress test - it's using
> > 'test' database, generates tables of different size and then runs
> > queries with various parameter combinations. It takes a while to trigger
> > the crash, so it might depend on timing or something like that.
> >
> > I've also attached two examples of backtraces. I've also seen infinite
> > loop in pagetable_create, but the crashes are much more common.
> >
>
> Hi Dilip,
>
> Do you plan to work on this for PG14? I haven't noticed any response in
> this thread, dealing with the crashes I reported a while ago. Also, it
> doesn't seem to be added to any of the commitfests.

Hi Tomas,

Thanks for testing this.  Actually we have noticed a lot of performance
drop in many cases due to the tbm_merge.  So off list we are discussing
different approaches and testing the performance.  So basically, in the
current approach all the worker are first preparing their bitmap hash and
then they are merging into the common bitmap hash under a lock.  So based
on the off list discussion with Robert, the next approach I am trying is to
directly insert into the shared bitmap hash while scanning the index
itself.  So now instead of preparing a separate bitmap, all the workers
will directly insert into the shared bitmap hash.  I agree that for getting
each page from the bitmaphash we need to acquire the lock and this also
might generate a lot of lock contention but we want to try the POC and
check the performance.  In fact I have already implemented the POC and
results aren't great.  But I am still experimenting with it to see whether
the lock can be more granular than I have now.  I will share my finding
soon along with the POC patch.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

>

Re: Proposed patch for key managment

On Tue, Dec 22, 2020 at 10:40:17AM -0500, Bruce Momjian wrote:
> On Mon, Dec 21, 2020 at 10:07:48PM -0500, Bruce Momjian wrote:
> > Attached is the script patch.  It is also at:
> > 
> > 
> > https://github.com/postgres/postgres/compare/master...bmomjian:cfe-sh.diff
> > 
> > I think it still needs docs but those will have to be done after the
> > code doc patch is added.
> 
> Here is an updated patch.  Are people happy with the Makefile, its
> location in the source tree, and the install directory name?  I used the
> directory name 'auth_commands' because I thought 'auth' was too easily
> misinterpreted. I put the scripts in /src/backend/utils/auth_commands. 
> It also contains a script that can be used for SSL passphrase prompting,
> but I haven't written the C code for that yet.

Here is a new patch, build on previous patches, which allows for the SSL
passphrase to be prompted from the terminal.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
new file mode 100644
index 639c623..850813e
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
*** include_dir 'conf.d'
*** 1452,1469 
  mechanism is used.

! The command must print the passphrase to the standard output and exit
! with code 0.  In the parameter value, %p is
! replaced by a prompt string.  (Write %% for a
! literal %.)  Note that the prompt string will
! probably contain whitespace, so be sure to quote adequately.  A single
! newlines is stripped from the end of the output if present.
!
!
! The command does not actually have to prompt the user for a
! passphrase.  It can read it from a file, obtain it from a keychain
! facility, or similar.  It is up to the user to make sure the chosen
! mechanism is adequately secure.

  This parameter can only be set in the postgresql.conf
--- 1452,1469 
  mechanism is used.

! The command must print the passphrase to the standard output
! and exit with code 0.  It can prompt from the terminal if
! --authprompt is used.  In the parameter value,
! %R represents the file descriptor number opened
! to the terminal that started the server.  A file descriptor is only
! available if enabled at server start.  If %R
! is used and no file descriptor is available, the server will not
! start.  Value %p is replaced by a pre-defined
! prompt string.  (Write %% for a literal
! %.)  Note that the prompt string will probably
! contain whitespace, so be sure to quote its use adequately.
! Newlines are stripped from the end of the output if present.

  This parameter can only be set in the postgresql.conf
*** include_dir 'conf.d'
*** 1486,1495 
  parameter is off (the default), then
  ssl_passphrase_command will be ignored during a
  reload and the SSL configuration will not be reloaded if a passphrase
! is needed.  That setting is appropriate for a command that requires a
! TTY for prompting, which might not be available when the server is
! running.  Setting this parameter to on might be appropriate if the
! passphrase is obtained from a file, for example.

  This parameter can only be set in the postgresql.conf
--- 1486,1495 
  parameter is off (the default), then
  ssl_passphrase_command will be ignored during a
  reload and the SSL configuration will not be reloaded if a passphrase
! is needed.  This setting is appropriate for a command that requires a
! terminal for prompting, which might not be available when the server is
! running.  Setting this parameter on might be appropriate, for
! example, if the passphrase is obtained from a file.

  This parameter can only be set in the postgresql.conf
diff --git a/doc/src/sgml/ref/pg_ctl-ref.sgml b/doc/src/sgml/ref/pg_ctl-ref.sgml
new file mode 100644
index f04e417..0662ae0
*** a/doc/src/sgml/ref/pg_ctl-ref.sgml
--- b/doc/src/sgml/ref/pg_ctl-ref.sgml
*** PostgreSQL documentation
*** 380,387 
--authprompt

! Allows the --cluster-key-command command
! to prompt for a passphrase or PIN.

--- 380,388 
--authprompt

! Allows ssl_passphrase_command or
! cluster_key_command to prompt for a passphrase
! or PIN.

diff --git a/doc/src/sgml/ref/pgupgrade.sgml b/doc/src/sgml/ref/pgupgrade.sgml
new file mode 100644
index 98be392..b1b

Re: Parallel INSERT (INTO ... SELECT ...)

On Wed, Dec 23, 2020 at 7:52 AM Hou, Zhijie  wrote:
>
> Hi
>
> > > I may be wrong, and if I miss sth in previous mails, please give me some
> > hints.
> > > IMO, serial insertion with underlying parallel SELECT can be
> > > considered for foreign table or temporary table, as the insertions only
> > happened in the leader process.
> > >
> >
> > I don't think we support parallel scan for temporary tables. Can you please
> > try once both of these operations without Insert being involved? If you
> > are able to produce a parallel plan without Insert then we can see why it
> > is not supported with Insert.
>
> Sorry, may be I did not express it clearly, I actually means the case when 
> insert's target(not in select part) table is temporary.
> And you are right that parallel select is not enabled when temporary table is 
> in select part.
>

I think Select can be parallel for this case and we should support this case.

-- 
With Regards,
Amit Kapila.

RE: Parallel INSERT (INTO ... SELECT ...)

Hi

> > I may be wrong, and if I miss sth in previous mails, please give me some
> hints.
> > IMO, serial insertion with underlying parallel SELECT can be
> > considered for foreign table or temporary table, as the insertions only
> happened in the leader process.
> >
> 
> I don't think we support parallel scan for temporary tables. Can you please
> try once both of these operations without Insert being involved? If you
> are able to produce a parallel plan without Insert then we can see why it
> is not supported with Insert.

Sorry, may be I did not express it clearly, I actually means the case when 
insert's target(not in select part) table is temporary.
And you are right that parallel select is not enabled when temporary table is 
in select part.

I test for the case when insert's target table is temporary or not.
--insert into not temporary table---
postgres=# explain (costs off) insert into notemp select * from test where i < 
600;
  QUERY PLAN   
---
 Gather
   Workers Planned: 4
   ->  Insert on notemp
 ->  Parallel Seq Scan on test
   Filter: (i < 600)

--insert into temporary table---
postgres=# explain (costs off) insert into temp select * from test where i < 
600;
   QUERY PLAN 
---
 Insert on temp
   ->  Seq Scan on test
 Filter: (i < 600)

---without insert part---
postgres=# explain (costs off) select * from test where i < 600;
   QUERY PLAN
-
 Gather
   Workers Planned: 4
   ->  Parallel Seq Scan on test
 Filter: (i < 600)

Best regards,
houzj

Re: Confused about stream replication protocol documentation

2020-12-22 Thread Li Japin


On Dec 22, 2020, at 11:13 PM, Fujii Masao 
mailto:masao.fu...@oss.nttdata.com>> wrote:

‘B’ means a backend and ‘F’ means a frontend. Maybe as [1] does, we should
add the note like "Each is marked to indicate that it can be sent by
a frontend (F) and a backend (B)" into the description about each message
format for START_REPLICATION.

[1]
https://www.postgresql.org/docs/devel/protocol-message-formats.html

Thanks for your clarify.  Maybe we should move the "protocol message formats”
before “stream replication protocol” or referenced it in "stream replication 
protocol”.

--
Best regards
Japin Li

Re: Parallel INSERT (INTO ... SELECT ...)

On Wed, Dec 23, 2020 at 7:15 AM Hou, Zhijie  wrote:
>
> Hi
>
> I have an issue about the parallel-safety checks.
>
> If target table is foreign table or temporary table,
> rel_max_parallel_hazard_for_modify will return PROPARALLEL_UNSAFE,
> which not only disable parallel insert but also disable underlying parallel 
> SELECT.
>
> +create temporary table temp_names (like names);
> +explain (costs off) insert into temp_names select * from names;
> +   QUERY PLAN
> +-
> + Insert on temp_names
> +   ->  Seq Scan on names
> +(2 rows)
>
> I may be wrong, and if I miss sth in previous mails, please give me some 
> hints.
> IMO, serial insertion with underlying parallel SELECT can be considered for 
> foreign table or temporary table,
> as the insertions only happened in the leader process.
>

I don't think we support parallel scan for temporary tables. Can you
please try once both of these operations without Insert being
involved? If you are able to produce a parallel plan without Insert
then we can see why it is not supported with Insert.

-- 
With Regards,
Amit Kapila.

RE: Parallel INSERT (INTO ... SELECT ...)

Hi

I have an issue about the parallel-safety checks.

If target table is foreign table or temporary table, 
rel_max_parallel_hazard_for_modify will return PROPARALLEL_UNSAFE,
which not only disable parallel insert but also disable underlying parallel 
SELECT.

+create temporary table temp_names (like names);
+explain (costs off) insert into temp_names select * from names;
+   QUERY PLAN
+-
+ Insert on temp_names
+   ->  Seq Scan on names
+(2 rows)

I may be wrong, and if I miss sth in previous mails, please give me some hints.
IMO, serial insertion with underlying parallel SELECT can be considered for 
foreign table or temporary table,
as the insertions only happened in the leader process.

Are there any special considerations for this case ?

Best regards,
houzj

Re: [Patch] Optimize dropping of relation buffers using dlist

On Wed, Dec 23, 2020 at 6:30 AM k.jami...@fujitsu.com
 wrote:
>
> On Tuesday, December 22, 2020 6:25 PM, Amit Kapila wrote:
>
> > Apart from tests, do let me know if you are happy with the changes in the
> > patch? Next, I'll look into DropRelFileNodesAllBuffers() optimization patch.
>
> Thank you, Amit.
> That looks more neat, combining the previous patches 0002-0003, so I am +1
> with the changes because of the clearer explanations for the threshold and
> optimization path in DropRelFileNodeBuffers. Thanks for cleaning my patch 
> sets.
> Hope we don't forget the 0001 patch's assertion in smgrextend() to ensure 
> that we
> do it safely too and that we are not InRecovery.
>

I think the 0001 is mostly for test purposes but we will see once the
main patches are ready.

-- 
With Regards,
Amit Kapila.

RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-22 Thread k.jami...@fujitsu.com

On Tuesday, December 22, 2020 6:25 PM, Amit Kapila wrote: 
> Attached, please find the updated patch with the following modifications, (a)
> updated comments at various places especially to tell why this is a safe
> optimization, (b) merged the patch for extending the smgrnblocks and
> vacuum optimization patch, (c) made minor cosmetic changes and ran
> pgindent, and (d) updated commit message. BTW, this optimization will help
> not only vacuum but also truncate when it is done in the same transaction in
> which the relation is created.  I would like to see certain tests to ensure 
> that
> the value we choose for BUF_DROP_FULL_SCAN_THRESHOLD is correct. I
> see that some testing has been done earlier [1] for this threshold but I am 
> not
> still able to conclude. The criteria to find the right threshold should be 
> what is
> the maximum size of relation to be truncated above which we don't get
> benefit with this optimization.
> 
> One idea could be to remove "nBlocksToInvalidate <
> BUF_DROP_FULL_SCAN_THRESHOLD" part of check "if (cached &&
> nBlocksToInvalidate < BUF_DROP_FULL_SCAN_THRESHOLD)" so that it
> always use optimized path for the tests. Then use the relation size as
> NBuffers/128, NBuffers/256, NBuffers/512 for different values of shared
> buffers as 128MB, 1GB, 20GB, 100GB.

Alright. I will also repeat the tests with the different threshold settings, 
and thank you for the tip.

> Apart from tests, do let me know if you are happy with the changes in the
> patch? Next, I'll look into DropRelFileNodesAllBuffers() optimization patch.

Thank you, Amit.
That looks more neat, combining the previous patches 0002-0003, so I am +1
with the changes because of the clearer explanations for the threshold and
optimization path in DropRelFileNodeBuffers. Thanks for cleaning my patch sets.
Hope we don't forget the 0001 patch's assertion in smgrextend() to ensure that 
we
do it safely too and that we are not InRecovery.

> [1] -
> https://www.postgresql.org/message-id/OSBPR01MB234176B1829AECFE9
> FDDFCC2EFE90%40OSBPR01MB2341.jpnprd01.prod.outlook.com

Regards,
Kirk Jamison

Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2020-12-22 Thread Zhihong Yu

Justin:
For reindex_index() :

+   if (options->tablespaceOid == MyDatabaseTableSpace)
+   options->tablespaceOid = InvalidOid;
...
+   if (set_tablespace &&
+   (options->tablespaceOid != oldTablespaceOid ||
+   (options->tablespaceOid == MyDatabaseTableSpace &&
OidIsValid(oldTablespaceOid

I wonder why the options->tablespaceOid == MyDatabaseTableSpace clause
appears again in the second if statement.
Since the first if statement would assign InvalidOid
to options->tablespaceOid when the first if condition is satisfied.

Cheers


On Tue, Dec 22, 2020 at 1:15 PM Justin Pryzby  wrote:

> On Tue, Dec 22, 2020 at 06:57:41PM +0900, Michael Paquier wrote:
> > On Tue, Dec 22, 2020 at 02:32:05AM -0600, Justin Pryzby wrote:
> > > Also, this one is going to be subsumed by ExecReindex(), so the palloc
> will go
> > > away (otherwise I would ask to pass it in from the caller):
> >
> > Yeah, maybe.  Still you need to be very careful if you have any
> > allocated variables like a tablespace or a path which requires to be
> > in the private context used by ReindexMultipleInternal() or even
> > ReindexRelationConcurrently(), so I am not sure you can avoid that
> > completely.  For now, we could choose the option to still use a
> > palloc(), and then save the options in the private contexts.  Forgot
> > that in the previous version actually.
>
> I can't see why this still uses memset instead of structure assignment.
>
> Now, I really think utility.c ought to pass in a pointer to a local
> ReindexOptions variable to avoid all the memory context, which is
> unnecessary
> and prone to error.
>
> ExecReindex() will set options.tablesapceOid, not a pointer.  Like this.
>
> I also changed the callback to be a ReindexOptions and not a pointer.
>
> --
> Justin
>

New IndexAM API controlling index vacuum strategies

2020-12-22 Thread Masahiko Sawada

Hi all,

I've started this separate thread from [1] for discussing the general
API design of index vacuum.

Summary:

* Call ambulkdelete and amvacuumcleanup even when INDEX_CLEANUP is
false, and leave it to the index AM whether or not skip them.
* Add a new index AM API amvacuumstrategy(), asking the index AM the
strategy before calling to ambulkdelete.
* Whether or not remove garbage tuples from heap depends on multiple
factors including INDEX_CLEANUP option and the answers of
amvacuumstrategy() for each index AM.

The first point is to fix the inappropriate behavior discussed on the thread[1].

The second and third points are to introduce a general framework for
future extensibility. User-visible behavior is not changed by this
change.

The new index AM API, amvacuumstrategy(), which is called before
bulkdelete() for each index and asks the index bulk-deletion strategy.
On this API, lazy vacuum asks, "Hey index X, I collected garbage heap
tuples during heap scanning, how urgent is vacuuming for you?", and
the index answers either "it's urgent" when it wants to do
bulk-deletion or "it's not urgent, I can skip it". The point of this
proposal is to isolate heap vacuum and index vacuum for each index so
that we can employ different strategies for each index. Lazy vacuum
can decide whether or not to do heap clean based on the answers from
the indexes.

By default, if all indexes answer 'yes' (meaning it will do
bulkdelete()), lazy vacuum can do heap clean. On the other hand, if
even one index answers 'no' (meaning it will not do bulkdelete()),
lazy vacuum doesn't the heap clean. Lazy vacuum would also be able to
require indexes to do bulkdelete() for some reason such as specyfing
INDEX_CLEANUP option by the user. It’s something like saying "Hey
index X, you answered not to do bulkdelete() but since heap clean is
necessary for me please don't skip bulkdelete()".

Currently, if INDEX_CLEANUP option is not set (i.g.
VACOPT_TERNARY_DEFAULT in the code), it's treated as true and will do
heap clean. But with this patch we use the default as a neutral state
('smart' mode). This neutral state could be "on" and "off" depending
on several factors including the answers of amvacuumstrategy(), the
table status, and user's request. In this context, specifying
INDEX_CLEANUP would mean making the neutral state "on" or "off" by
user's request. The table status that could influence the decision
could concretely be, for instance:

* Removing LP_DEAD accumulation due to skipping bulkdelete() for a long time.
* Making pages all-visible for index-only scan.

Also there are potential enhancements using this API:

* If bottom-up index deletion feature[2] is introduced, individual
indexes could be a different situation in terms of dead tuple
accumulation; some indexes on the table can delete its garbage index
tuples without bulkdelete(). A problem will appear that doing
bulkdelete() for such indexes would not be efficient. This problem is
solved by this proposal because we can do bulkdelete() for a subset of
indexes on the table.

* If retail index deletion feature[3] is introduced, we can make the
return value of bulkvacuumstrategy() a ternary value: "do_bulkdelete",
"do_indexscandelete", and "no".

* We probably can introduce a threshold of the number of dead tuples
to control whether or not to do index tuple bulk-deletion (like
bulkdelete() version of vacuum_cleanup_index_scale_factor). In the
case where the amount of dead tuples is slightly larger than
maitenance_work_mem the second time calling to bulkdelete will be
called with a small number of dead tuples, which is inefficient. This
problem is also solved by this proposal by allowing a subset of
indexes to skip bulkdelete() if the number of dead tuple doesn't
exceed the threshold.

I’ve attached the PoC patch for the above idea. By default, since lazy
vacuum choose the vacuum bulkdelete strategy based on answers of
amvacuumstrategy() so it can be either true or false ( although it’s
always true in the currene patch). But for amvacuumcleanup() there is
no the neutral state, lazy vacuum treats the default as true.

Comment and feedback are very welcome.

Regards,

[1]
https://www.postgresql.org/message-id/20200415233848.saqp72pcjv2y6ryi%40alap3.anarazel.de
[2]
https://www.postgresql.org/message-id/CAH2-Wzm%2BmaE3apHB8NOtmM%3Dp-DO65j2V5GzAWCOEEuy3JZgb2g%40mail.gmail.com
[3]
https://www.postgresql.org/message-id/425db134-8bba-005c-b59d-56e50de3b41e%40postgrespro.ru

--
Masahiko Sawada
EnterpriseDB: https://www.enterprisedb.com/

poc_vacuumstrategy.patch
Description: Binary data

Re: Parallel bitmap index scan

On 11/11/20 8:52 PM, Tomas Vondra wrote:
> Hi,
> 
> I took a look at this today, doing a bit of stress-testing, and I can
> get it to crash because of segfaults in pagetable_create (not sure if
> the issue is there, it might be just a symptom of an issue elsewhere).
> 
> Attached is a shell script I use to run the stress test - it's using
> 'test' database, generates tables of different size and then runs
> queries with various parameter combinations. It takes a while to trigger
> the crash, so it might depend on timing or something like that.
> 
> I've also attached two examples of backtraces. I've also seen infinite
> loop in pagetable_create, but the crashes are much more common.
> 

Hi Dilip,

Do you plan to work on this for PG14? I haven't noticed any response in
this thread, dealing with the crashes I reported a while ago. Also, it
doesn't seem to be added to any of the commitfests.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Proposed patch for key managment

On Tue, Dec 22, 2020 at 04:13:06PM -0500, Bruce Momjian wrote:
> On Tue, Dec 22, 2020 at 08:15:27PM +, Alastair Turner wrote:
> > Hi Bruce
> > 
> > In ckey_passphrase.sh.sample
> > 
> > +
> > +echo "$PASS" | sha256sum | cut -d' ' -f1
> > +
> > 
> > Under the threat model discussed, a copy of the keyfile could be
> > attacked offline. So getting from passphrase to DEKs should be as
> > resource intensive as possible to slow down brute-force attempts.
> > Instead of just a SHA hash, this should be at least a PBKDF2 (PKCS#5)
> 
> I am satisfied with the security of SHA256.

Sorry, I should have said I am happy with a SHA512 HMAC in a 256-bit
keyspace.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: libpq compression

I wrote:
> Robert Haas  writes:
>> But there is a privilege boundary between the sender and the receiver.
>> What's alleged here is that the sender can do a thing which causes the
>> receiver to burn through tons of memory. It doesn't help anything to
>> say, well, the sender ought to use a window size of N or less. What if
>> they don't?

> The receiver rejects the data as though it were corrupt.

(Having said that, I don't know whether it's possible for the user of
libzstd to specify such behavior.  But if it isn't, that's a CVE-worthy
problem in libzstd.)

regards, tom lane

Preventing hangups in bgworker start/stop during DB shutdown

Here's an attempt at closing the race condition discussed in [1]
(and in some earlier threads, though I'm too lazy to find them).

The core problem is that the bgworker management APIs were designed
without any thought for exception conditions, notably "we're not
gonna launch any more workers because we're shutting down the database".
A process waiting for a worker in WaitForBackgroundWorkerStartup or
WaitForBackgroundWorkerShutdown will wait forever, so that the database
fails to shut down without manual intervention.

I'd supposed that we would need some incompatible changes in those APIs
in order to fix this, but after further study it seems like we could
hack things by just acting as though a request that won't be serviced
has already run to completion.  I'm not terribly satisfied with that
as a long-term solution --- it seems to me that callers should be told
that there was a failure.  But this looks to be enough to solve the
lockup condition for existing callers, and it seems like it'd be okay
to backpatch.

Thoughts?

regards, tom lane

[1] 
https://www.postgresql.org/message-id/flat/16785-c0207d8c67fb5f25%40postgresql.org

diff --git a/src/backend/postmaster/bgworker.c b/src/backend/postmaster/bgworker.c
index 7ef6259eb5..b3ab8b0fa3 100644
--- a/src/backend/postmaster/bgworker.c
+++ b/src/backend/postmaster/bgworker.c
@@ -231,13 +231,15 @@ FindRegisteredWorkerBySlotNumber(int slotno)
 }
 
 /*
- * Notice changes to shared memory made by other backends.  This code
- * runs in the postmaster, so we must be very careful not to assume that
- * shared memory contents are sane.  Otherwise, a rogue backend could take
- * out the postmaster.
+ * Notice changes to shared memory made by other backends.
+ * Accept new dynamic worker requests only if allow_new_workers is true.
+ *
+ * This code runs in the postmaster, so we must be very careful not to assume
+ * that shared memory contents are sane.  Otherwise, a rogue backend could
+ * take out the postmaster.
  */
 void
-BackgroundWorkerStateChange(void)
+BackgroundWorkerStateChange(bool allow_new_workers)
 {
 	int			slotno;
 
@@ -297,6 +299,15 @@ BackgroundWorkerStateChange(void)
 			continue;
 		}
 
+		/*
+		 * If this is a dynamic worker request, and we aren't allowing new
+		 * dynamic workers, then immediately mark it for termination; the next
+		 * stanza will take care of cleaning it up.
+		 */
+		if (slot->worker.bgw_restart_time == BGW_NEVER_RESTART &&
+			!allow_new_workers)
+			slot->terminate = true;
+
 		/*
 		 * If the worker is marked for termination, we don't need to add it to
 		 * the registered workers list; we can just free the slot. However, if
@@ -503,12 +514,54 @@ BackgroundWorkerStopNotifications(pid_t pid)
 	}
 }
 
+/*
+ * Cancel any not-yet-started worker requests that have BGW_NEVER_RESTART.
+ *
+ * This is called during a normal ("smart" or "fast") database shutdown.
+ * After this point, no new background workers will be started, so any
+ * dynamic worker requests should be killed off, allowing anything that
+ * might be waiting for them to clean up and exit.
+ *
+ * This function should only be called from the postmaster.
+ */
+void
+ForgetUnstartedBackgroundWorkers(void)
+{
+	slist_mutable_iter iter;
+
+	slist_foreach_modify(iter, &BackgroundWorkerList)
+	{
+		RegisteredBgWorker *rw;
+		BackgroundWorkerSlot *slot;
+
+		rw = slist_container(RegisteredBgWorker, rw_lnode, iter.cur);
+		Assert(rw->rw_shmem_slot < max_worker_processes);
+		slot = &BackgroundWorkerData->slot[rw->rw_shmem_slot];
+
+		/* If it's not yet started, and it's a dynamic worker ... */
+		if (slot->pid == InvalidPid &&
+			rw->rw_worker.bgw_restart_time == BGW_NEVER_RESTART)
+		{
+			/* ... then zap it, and notify anything that was waiting */
+			int			notify_pid = rw->rw_worker.bgw_notify_pid;
+
+			ForgetBackgroundWorker(&iter);
+			if (notify_pid != 0)
+kill(notify_pid, SIGUSR1);
+		}
+	}
+}
+
 /*
  * Reset background worker crash state.
  *
  * We assume that, after a crash-and-restart cycle, background workers without
  * the never-restart flag should be restarted immediately, instead of waiting
- * for bgw_restart_time to elapse.
+ * for bgw_restart_time to elapse.  On the other hand, workers with that flag
+ * should be forgotten, because they are dynamic requests from processes that
+ * no longer exist.
+ *
+ * This function should only be called from the postmaster.
  */
 void
 ResetBackgroundWorkerCrashTimes(void)
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 5d09822c81..fa35bf4369 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -3796,6 +3796,13 @@ PostmasterStateMachine(void)
 	 */
 	if (pmState == PM_STOP_BACKENDS)
 	{
+		/*
+		 * Forget any pending requests for dynamic background workers, since
+		 * we're no longer willing to launch any new workers.  (If additional
+		 * requests arrive, BackgroundWorkerStateChange will reject t

Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2020-12-22 Thread Justin Pryzby

On Tue, Dec 22, 2020 at 06:57:41PM +0900, Michael Paquier wrote:
> On Tue, Dec 22, 2020 at 02:32:05AM -0600, Justin Pryzby wrote:
> > Also, this one is going to be subsumed by ExecReindex(), so the palloc will 
> > go
> > away (otherwise I would ask to pass it in from the caller):
> 
> Yeah, maybe.  Still you need to be very careful if you have any
> allocated variables like a tablespace or a path which requires to be
> in the private context used by ReindexMultipleInternal() or even
> ReindexRelationConcurrently(), so I am not sure you can avoid that
> completely.  For now, we could choose the option to still use a
> palloc(), and then save the options in the private contexts.  Forgot
> that in the previous version actually.

I can't see why this still uses memset instead of structure assignment.

Now, I really think utility.c ought to pass in a pointer to a local
ReindexOptions variable to avoid all the memory context, which is unnecessary
and prone to error.

ExecReindex() will set options.tablesapceOid, not a pointer.  Like this.

I also changed the callback to be a ReindexOptions and not a pointer.

-- 
Justin
>From f12723a8e4ad550c009935fa5397d1d4a968c7bf Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Tue, 22 Dec 2020 18:57:41 +0900
Subject: [PATCH 1/3] refactor-utility-opts-michael-2.patch - hacked on by
 Justin

---
 src/backend/catalog/index.c  | 19 ---
 src/backend/commands/cluster.c   | 19 ---
 src/backend/commands/indexcmds.c | 97 +---
 src/backend/commands/tablecmds.c |  3 +-
 src/backend/commands/vacuum.c|  6 +-
 src/backend/tcop/utility.c   | 12 ++--
 src/include/catalog/index.h  | 19 ---
 src/include/commands/cluster.h   | 13 +++--
 src/include/commands/defrem.h| 17 --
 src/include/commands/vacuum.h| 20 +++
 src/tools/pgindent/typedefs.list |  2 +
 11 files changed, 124 insertions(+), 103 deletions(-)

diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 731610c701..d776c661d7 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3594,7 +3594,7 @@ IndexGetRelation(Oid indexId, bool missing_ok)
  */
 void
 reindex_index(Oid indexId, bool skip_constraint_checks, char persistence,
-			  int options)
+			  ReindexOptions *options)
 {
 	Relation	iRel,
 heapRelation;
@@ -3602,7 +3602,7 @@ reindex_index(Oid indexId, bool skip_constraint_checks, char persistence,
 	IndexInfo  *indexInfo;
 	volatile bool skipped_constraint = false;
 	PGRUsage	ru0;
-	bool		progress = (options & REINDEXOPT_REPORT_PROGRESS) != 0;
+	bool		progress = ((options->flags & REINDEXOPT_REPORT_PROGRESS) != 0);
 
 	pg_rusage_init(&ru0);
 
@@ -3611,12 +3611,12 @@ reindex_index(Oid indexId, bool skip_constraint_checks, char persistence,
 	 * we only need to be sure no schema or data changes are going on.
 	 */
 	heapId = IndexGetRelation(indexId,
-			  (options & REINDEXOPT_MISSING_OK) != 0);
+			  (options->flags & REINDEXOPT_MISSING_OK) != 0);
 	/* if relation is missing, leave */
 	if (!OidIsValid(heapId))
 		return;
 
-	if ((options & REINDEXOPT_MISSING_OK) != 0)
+	if ((options->flags & REINDEXOPT_MISSING_OK) != 0)
 		heapRelation = try_table_open(heapId, ShareLock);
 	else
 		heapRelation = table_open(heapId, ShareLock);
@@ -3792,7 +3792,7 @@ reindex_index(Oid indexId, bool skip_constraint_checks, char persistence,
 	}
 
 	/* Log what we did */
-	if (options & REINDEXOPT_VERBOSE)
+	if ((options->flags & REINDEXOPT_VERBOSE) != 0)
 		ereport(INFO,
 (errmsg("index \"%s\" was reindexed",
 		get_rel_name(indexId)),
@@ -3846,7 +3846,7 @@ reindex_index(Oid indexId, bool skip_constraint_checks, char persistence,
  * index rebuild.
  */
 bool
-reindex_relation(Oid relid, int flags, int options)
+reindex_relation(Oid relid, int flags, ReindexOptions *options)
 {
 	Relation	rel;
 	Oid			toast_relid;
@@ -3861,7 +3861,7 @@ reindex_relation(Oid relid, int flags, int options)
 	 * to prevent schema and data changes in it.  The lock level used here
 	 * should match ReindexTable().
 	 */
-	if ((options & REINDEXOPT_MISSING_OK) != 0)
+	if ((options->flags & REINDEXOPT_MISSING_OK) != 0)
 		rel = try_table_open(relid, ShareLock);
 	else
 		rel = table_open(relid, ShareLock);
@@ -3965,8 +3965,9 @@ reindex_relation(Oid relid, int flags, int options)
 		 * Note that this should fail if the toast relation is missing, so
 		 * reset REINDEXOPT_MISSING_OK.
 		 */
-		result |= reindex_relation(toast_relid, flags,
-   options & ~(REINDEXOPT_MISSING_OK));
+		ReindexOptions newoptions = *options;
+		newoptions.flags &= ~(REINDEXOPT_MISSING_OK);
+		result |= reindex_relation(toast_relid, flags, &newoptions);
 	}
 
 	return result;
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index fd5a6eec86..32f5ae848f 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -103,7 +103,7 @@ void
 cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)

Re: Proposed patch for key managment

On Tue, Dec 22, 2020 at 08:15:27PM +, Alastair Turner wrote:
> Hi Bruce
> 
> In ckey_passphrase.sh.sample
> 
> +
> +echo "$PASS" | sha256sum | cut -d' ' -f1
> +
> 
> Under the threat model discussed, a copy of the keyfile could be
> attacked offline. So getting from passphrase to DEKs should be as
> resource intensive as possible to slow down brute-force attempts.
> Instead of just a SHA hash, this should be at least a PBKDF2 (PKCS#5)

I am satisfied with the security of SHA256.

> On Tue, 22 Dec 2020 at 15:40, Bruce Momjian  wrote:
> >
> > Here is an updated patch.  Are people happy with the Makefile, its
> > location in the source tree, and the install directory name?  I used the
> > directory name 'auth_commands' because I thought 'auth' was too easily
> > misinterpreted. I put the scripts in /src/backend/utils/auth_commands.
> >
> 
> What's implemented in these patches is an internal keystore, wrapped
> with a key derived from a passphrase. I'd think that the scripts
> directory should reflect what they interact with, so
> 'keystore_commands' or 'local_keystore_command' sounds more specific
> and therefore better than 'auth_commands'.

The point is that some commands are used for keystore and some for SSL
certificate passphrase entry, hence "auth".

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: [PATCH] Automatic HASH and LIST partition creation

2020-12-22 Thread Fabien COELHO





BTW could you tell me a couple of words about pros and cons of c-code
syntax parsing comparing to parsing using gram.y trees?


I'd rather use an automatic tool (lexer/parser) if possible instead of 
doing it by hand if I can. If you want a really nice syntax with clever 
tricks, then you may need to switch to manual though, but pg/sql is not in 
that class.


I think both are possible but my predisposition was that we'd better use 
the later if possible.


I agree.

--
Fabien.

Re: libpq compression

Robert Haas  writes:
> On Tue, Dec 22, 2020 at 2:33 PM Tom Lane  wrote:
>> I'd assume that there's a direct correlation between the compression level
>> setting and the window size; but I've not studied the libzstd docs in
>> enough detail to know what it is.

> But there is a privilege boundary between the sender and the receiver.
> What's alleged here is that the sender can do a thing which causes the
> receiver to burn through tons of memory. It doesn't help anything to
> say, well, the sender ought to use a window size of N or less. What if
> they don't?

The receiver rejects the data as though it were corrupt.

regards, tom lane

Re: libpq compression

2020-12-22 Thread Robert Haas

On Tue, Dec 22, 2020 at 2:33 PM Tom Lane  wrote:
> I'd assume that there's a direct correlation between the compression level
> setting and the window size; but I've not studied the libzstd docs in
> enough detail to know what it is.

But there is a privilege boundary between the sender and the receiver.
What's alleged here is that the sender can do a thing which causes the
receiver to burn through tons of memory. It doesn't help anything to
say, well, the sender ought to use a window size of N or less. What if
they don't?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Proposed patch for key managment

2020-12-22 Thread Alastair Turner

Hi Bruce

In ckey_passphrase.sh.sample

+
+echo "$PASS" | sha256sum | cut -d' ' -f1
+

Under the threat model discussed, a copy of the keyfile could be
attacked offline. So getting from passphrase to DEKs should be as
resource intensive as possible to slow down brute-force attempts.
Instead of just a SHA hash, this should be at least a PBKDF2 (PKCS#5)
operation or preferably scrypt (which is memory intensive as well as
computationally intensive). While OpenSSL provides these key
derivation options for it's encoding operations, it does not offer a
command line option for key derivation using them. What's your view on
including third party utilities like nettlepbkdf in the sample or
providing this utility as a compiled program?

In the same theme - in ossl_cipher_ctx_create and the functions using
that context should rather be using the key-wrapping variants of the
cipher. As well as lower throughput, with the resulting effects on
brute force attempts, the key wrap variants provide more robust
ciphertext output
"
KW, KWP, and TKW are each robust in the sense that each bit of output
can be expected to depend in a nontrivial fashion on each bit of
input, even when the length of the input data is greater than one
block. This property is achieved at the cost of a considerably lower
throughput rate, compared to other authenticated-encryption modes, but
the tradeoff may be appropriate for some key management applications.
For example, a robust method may be desired when the length of the
keys to be protected is greater than the block size of the underlying
block cipher, or when the value of the protected data is very high.
" - NIST SP 800-38F

and aim to manage the risk of short or predictable IVs ([1] final para
of page 1)

On Tue, 22 Dec 2020 at 15:40, Bruce Momjian  wrote:
>
> Here is an updated patch.  Are people happy with the Makefile, its
> location in the source tree, and the install directory name?  I used the
> directory name 'auth_commands' because I thought 'auth' was too easily
> misinterpreted. I put the scripts in /src/backend/utils/auth_commands.
>

What's implemented in these patches is an internal keystore, wrapped
with a key derived from a passphrase. I'd think that the scripts
directory should reflect what they interact with, so
'keystore_commands' or 'local_keystore_command' sounds more specific
and therefore better than 'auth_commands'.

Regards
Alastair

[1] https://web.cs.ucdavis.edu/~rogaway/papers/keywrap.pdf

Re: Perform COPY FROM encoding conversions in larger chunks

2020-12-22 Thread John Naylor

On Wed, Dec 16, 2020 at 8:18 AM Heikki Linnakangas  wrote:
>
> Currently, COPY FROM parses the input one line at a time. Each line is
> converted to the database encoding separately, or if the file encoding
> matches the database encoding, we just check that the input is valid for
> the encoding. It would be more efficient to do the encoding
> conversion/verification in larger chunks. At least potentially; the
> current conversion/verification implementations work one byte a time so
> it doesn't matter too much, but there are faster algorithms out there
> that use SIMD instructions or lookup tables that benefit from larger
inputs.

Hi Heikki,

This is great news. I've seen examples of such algorithms and that'd be
nice to have. I haven't studied the patch in detail, but it looks fine on
the whole.

In 0004, it seems you have some doubts about upgrade compatibility. Is that
because user-defined conversions would no longer have the right signature?

--
John Naylor
EDB: http://www.enterprisedb.com

Re: libpq compression

Tomas Vondra  writes:
> On 12/22/20 8:03 PM, Tom Lane wrote:
>> The link Ken pointed at suggests that restricting the window size to
>> 8MB is a common compromise.  It's not clear to me what that does to
>> the achievable compression ratio.  Even 8MB could be an annoying cost
>> if it's being paid per-process, on both the server and client sides.

> Possibly, but my understanding is that's merely a recommendation for the 
> decoder library (e.g. libzstd), and it's not clear to me if/how that 
> relates to the compression level or how to influence it.
>  From the results shared by Daniil, the per-client overhead seems way 
> higher than 8MB, so either libzstd does not respect this recommendation 
> or maybe there's something else going on.

I'd assume that there's a direct correlation between the compression level
setting and the window size; but I've not studied the libzstd docs in
enough detail to know what it is.

regards, tom lane

Re: libpq compression





On 12/22/20 8:03 PM, Tom Lane wrote:

Tomas Vondra  writes:

I don't see aby benchmark results in this thread, allowing me to make
that conclusion, and I find it hard to believe that 200MB/client is a
sensible trade-off.



It assumes you have that much memory, and it may allow easy DoS attack
(although maybe it's not worse than e.g. generating a lot of I/O or
running expensive function). Maybe allowing limiting the compression
level / decompression buffer size in postgresql.conf would be enough. Or
maybe allow disabling such compression algorithms altogether.


The link Ken pointed at suggests that restricting the window size to
8MB is a common compromise.  It's not clear to me what that does to
the achievable compression ratio.  Even 8MB could be an annoying cost
if it's being paid per-process, on both the server and client sides.



Possibly, but my understanding is that's merely a recommendation for the 
decoder library (e.g. libzstd), and it's not clear to me if/how that 
relates to the compression level or how to influence it.


From the results shared by Daniil, the per-client overhead seems way 
higher than 8MB, so either libzstd does not respect this recommendation 
or maybe there's something else going on.


regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] [PATCH] Generic type subscripting

Pavel Stehule  writes:
> But maybe we try to design some that are designed already. Is there some
> info about index specification in SQL/JSON?

We do have precedent for this, it's the rules about resolving argument
types for overloaded functions.  But the conclusion that that precedent
leads to is that we should check whether the subscript expression can
be *implicitly* coerced to either integer or text, and fail if neither
coercion or both coercions succeed.  I'd be okay with that from a system
design standpoint, but it's hard to say without trying it whether it
will work out nicely from a usability standpoint.  In a quick trial
it seems it might be okay:

regression=# create function mysub(int) returns text language sql
regression-# as $$select 'int'$$;
CREATE FUNCTION
regression=# create function mysub(text) returns text language sql
as $$select 'text'$$;
CREATE FUNCTION
regression=# select mysub(42);
 mysub 
---
 int
(1 row)

regression=# select mysub('foo');
 mysub 
---
 text
(1 row)

But there are definitely cases that will fail when an assignment
coercion would have succeeded, eg

regression=# select mysub(42::bigint);
ERROR:  function mysub(bigint) does not exist

Maybe that's okay.  (As I said earlier, we can't use assignment
coercion when there's two possible coercion targets, because it'd
be too likely that they both succeed.)

regards, tom lane

Re: [HACKERS] [PATCH] Generic type subscripting

út 22. 12. 2020 v 18:35 odesílatel Dmitry Dolgov <9erthali...@gmail.com>
napsal:

> > On Tue, Dec 22, 2020 at 11:57:13AM -0500, Tom Lane wrote:
> > Dmitry Dolgov <9erthali...@gmail.com> writes:
> > > On Tue, Dec 22, 2020 at 12:19:26PM +0100, Pavel Stehule wrote:
> > >> I expect behave like
> > >>
> > >> update x set test[1] = 10; --> "[10]";
> > >> update x set test['1'] = 10; --> "{"1": 10}"
> >
> > > Yes, I also was thinking about this because such behaviour is more
> > > natural.
> >
> > I continue to feel that this is a fundamentally bad idea that will
> > lead to much more pain than benefit.  People are going to want to
> > know why "test[1.0]" doesn't act like "test[1]".  They are going
> > to complain because "test[$1]" acts so much differently depending
> > on whether they assigned a type to the $1 parameter or not.  And
> > they are going to bitch because dumping and reloading a rule causes
> > it to do something different than it did before --- or at least we'd
> > be at horrid risk of that; only if we hide the injected cast-to-text
> > doesd the dumped rule look the way it needs to.  Even then, the whole
> > thing is critically dependent on the fact that integer-type constants
> > are written and displayed differently from other constants, so it
> > won't scale to any other type that someone might want to treat specially.
> > So you're just leading datatype designers down a garden path that will be
> > a dead end for many of them.
> >
> > IMO this isn't actually any saner than your previous iterations
> > on the idea.
>
> Ok. While I don't have any preferences here, we can disregard the last
> posted patch (extended-with-subscript-type) and consider only
> v38-0001-Subscripting-for-jsonb version.
>

There are two parts - fetching and setting.

Probably there can be an agreement on fetching part: if index text is
JSONPath expression, use jsonb_path_query, else use jsonb_extract_path.
The setting should be the same in the inverse direction.

I like the behavior of jsonb_extract_path - it has intuitive behaviour and
we should use it.

Re: libpq compression

Tomas Vondra  writes:
> I don't see aby benchmark results in this thread, allowing me to make 
> that conclusion, and I find it hard to believe that 200MB/client is a 
> sensible trade-off.

> It assumes you have that much memory, and it may allow easy DoS attack 
> (although maybe it's not worse than e.g. generating a lot of I/O or 
> running expensive function). Maybe allowing limiting the compression 
> level / decompression buffer size in postgresql.conf would be enough. Or 
> maybe allow disabling such compression algorithms altogether.

The link Ken pointed at suggests that restricting the window size to
8MB is a common compromise.  It's not clear to me what that does to
the achievable compression ratio.  Even 8MB could be an annoying cost
if it's being paid per-process, on both the server and client sides.

regards, tom lane

Re: libpq compression





On 12/22/20 7:31 PM, Andrey Borodin wrote:




22 дек. 2020 г., в 23:15, Tomas Vondra  
написал(а):



On 12/22/20 6:56 PM, Robert Haas wrote:

On Tue, Dec 22, 2020 at 6:24 AM Daniil Zakhlystov
 wrote:

When using bidirectional compression, Postgres resource usage correlates with 
the selected compression level. For example, here is the Postgresql application 
memory usage:

No compression - 1.2 GiB

ZSTD
zstd:1 - 1.4 GiB
zstd:7 - 4.0 GiB
zstd:13 - 17.7 GiB
zstd:19 - 56.3 GiB
zstd:20 - 109.8 GiB - did not succeed
zstd:21, zstd:22  > 140 GiB
Postgres process crashes (out of memory)

Good grief. So, suppose we add compression and support zstd. Then, can
unprivileged user capable of connecting to the database can negotiate
for zstd level 1 and then choose to actually send data compressed at
zstd level 22, crashing the server if it doesn't have a crapton of
memory? Honestly, I wouldn't blame somebody for filing a CVE if we
allowed that sort of thing to happen. I'm not sure what the solution
is, but we can't leave a way for a malicious client to consume 140GB
of memory on the server *per connection*. I assumed decompression
memory was going to measured in kB or MB, not GB. Honestly, even at
say L7, if you've got max_connections=100 and a user who wants to make
trouble, you have a really big problem.
Perhaps I'm being too pessimistic here, but man that's a lot of memory.


Maybe I'm just confused, but my assumption was this means there's a memory leak 
somewhere - that we're not resetting/freeing some piece of memory, or so. Why 
would zstd need so much memory? It seems like a pretty serious disadvantage, so 
how could it become so popular?


AFAIK it's 700 clients. Does not seem like super high price for big 
traffic\latency reduction.



I don't see aby benchmark results in this thread, allowing me to make 
that conclusion, and I find it hard to believe that 200MB/client is a 
sensible trade-off.


It assumes you have that much memory, and it may allow easy DoS attack 
(although maybe it's not worse than e.g. generating a lot of I/O or 
running expensive function). Maybe allowing limiting the compression 
level / decompression buffer size in postgresql.conf would be enough. Or 
maybe allow disabling such compression algorithms altogether.



regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: libpq compression

2020-12-22 Thread Andrey Borodin




> 22 дек. 2020 г., в 23:15, Tomas Vondra  
> написал(а):
> 
> 
> 
> On 12/22/20 6:56 PM, Robert Haas wrote:
>> On Tue, Dec 22, 2020 at 6:24 AM Daniil Zakhlystov
>>  wrote:
>>> When using bidirectional compression, Postgres resource usage correlates 
>>> with the selected compression level. For example, here is the Postgresql 
>>> application memory usage:
>>> 
>>> No compression - 1.2 GiB
>>> 
>>> ZSTD
>>> zstd:1 - 1.4 GiB
>>> zstd:7 - 4.0 GiB
>>> zstd:13 - 17.7 GiB
>>> zstd:19 - 56.3 GiB
>>> zstd:20 - 109.8 GiB - did not succeed
>>> zstd:21, zstd:22  > 140 GiB
>>> Postgres process crashes (out of memory)
>> Good grief. So, suppose we add compression and support zstd. Then, can
>> unprivileged user capable of connecting to the database can negotiate
>> for zstd level 1 and then choose to actually send data compressed at
>> zstd level 22, crashing the server if it doesn't have a crapton of
>> memory? Honestly, I wouldn't blame somebody for filing a CVE if we
>> allowed that sort of thing to happen. I'm not sure what the solution
>> is, but we can't leave a way for a malicious client to consume 140GB
>> of memory on the server *per connection*. I assumed decompression
>> memory was going to measured in kB or MB, not GB. Honestly, even at
>> say L7, if you've got max_connections=100 and a user who wants to make
>> trouble, you have a really big problem.
>> Perhaps I'm being too pessimistic here, but man that's a lot of memory.
> 
> Maybe I'm just confused, but my assumption was this means there's a memory 
> leak somewhere - that we're not resetting/freeing some piece of memory, or 
> so. Why would zstd need so much memory? It seems like a pretty serious 
> disadvantage, so how could it become so popular?

AFAIK it's 700 clients. Does not seem like super high price for big 
traffic\latency reduction.

Best regards, Andrey Borodin.

Re: Fix typo about generate_gather_paths


On 12/9/20 3:21 AM, Hou, Zhijie wrote:

Hi

Since ba3e76c,
the optimizer call generate_useful_gather_paths instead of 
generate_gather_paths() outside.

But I noticed that some comment still talking about generate_gather_paths not 
generate_useful_gather_paths.
I think we should fix these comment, and I try to replace these " generate_gather_paths " 
with " generate_useful_gather_paths "



Thanks. I started looking at this a bit more closely, and I think most 
of the changes are fine - the code was changed to call a different 
function, but the comments still reference generate_gather_paths().


The one exception seems to be create_ordered_paths(), because that 
comment also makes statements about what generate_gather_pathes is 
doing. And some of it does not apply to generate_useful_gather_paths.
For example it says it generates order-preserving Gather Merge paths, 
but generate_useful_gather_paths also generates paths with sorts (which 
are clearly not order-preserving).


So I think this comment will need a bit more work to update ...


regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: libpq compression

2020-12-22 Thread Kenneth Marshall

On Tue, Dec 22, 2020 at 07:15:23PM +0100, Tomas Vondra wrote:
> 
> 
> On 12/22/20 6:56 PM, Robert Haas wrote:
> >On Tue, Dec 22, 2020 at 6:24 AM Daniil Zakhlystov
> > wrote:
> >>When using bidirectional compression, Postgres resource usage correlates 
> >>with the selected compression level. For example, here is the Postgresql 
> >>application memory usage:
> >>
> >>No compression - 1.2 GiB
> >>
> >>ZSTD
> >>zstd:1 - 1.4 GiB
> >>zstd:7 - 4.0 GiB
> >>zstd:13 - 17.7 GiB
> >>zstd:19 - 56.3 GiB
> >>zstd:20 - 109.8 GiB - did not succeed
> >>zstd:21, zstd:22  > 140 GiB
> >>Postgres process crashes (out of memory)
> >
> >Good grief. So, suppose we add compression and support zstd. Then, can
> >unprivileged user capable of connecting to the database can negotiate
> >for zstd level 1 and then choose to actually send data compressed at
> >zstd level 22, crashing the server if it doesn't have a crapton of
> >memory? Honestly, I wouldn't blame somebody for filing a CVE if we
> >allowed that sort of thing to happen. I'm not sure what the solution
> >is, but we can't leave a way for a malicious client to consume 140GB
> >of memory on the server *per connection*. I assumed decompression
> >memory was going to measured in kB or MB, not GB. Honestly, even at
> >say L7, if you've got max_connections=100 and a user who wants to make
> >trouble, you have a really big problem.
> >
> >Perhaps I'm being too pessimistic here, but man that's a lot of memory.
> >
> 
> Maybe I'm just confused, but my assumption was this means there's a
> memory leak somewhere - that we're not resetting/freeing some piece
> of memory, or so. Why would zstd need so much memory? It seems like
> a pretty serious disadvantage, so how could it become so popular?
> 
> 
> regards
> 

Hi,

It looks like the space needed for decompression is between 1kb and
3.75tb:

https://github.com/facebook/zstd/blob/dev/doc/zstd_compression_format.md#window_descriptor

Sheesh! Looks like it would definitely need to be bounded to control
resource use.

Regards,
Ken

Re: On login trigger: take three

út 22. 12. 2020 v 12:42 odesílatel Konstantin Knizhnik <
k.knizh...@postgrespro.ru> napsal:

>
>
> On 22.12.2020 12:25, Pavel Stehule wrote:
>
>
> regress tests fails
>
>  sysviews ... FAILED  112 ms
> test event_trigger... FAILED (test process exited with
> exit code 2)  447 ms
> test fast_default ... FAILED  392 ms
> test stats... FAILED  626 ms
> == shutting down postmaster   ==
>
>
> Sorry, fixed.
>

no problem

I had to fix doc

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 6ff783273f..7aded1848f 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1008,7 +1008,7 @@ include_dir 'conf.d'
 trigger when a client connects. This parameter is switched on by
default.
 Errors in trigger code can prevent user to login to the system.
 I this case disabling this parameter in connection string can
solve the problem:
-psql "dbname=postgres options='-c
enable_client_connection_trigger=false'".
+psql "dbname=postgres options='-c
enable_client_connection_trigger=false'".

   
  

I am thinking again about enable_client_connection_trigger, and although it
can look useless (because error is ignored for superuser), it can be useful
for some debugging and administration purposes. Probably we don't want to
start the client_connection trigger from backup tools, maybe from some
monitoring tools. Maybe the possibility to set this GUC can be dedicated to
some special role (like pg_signal_backend).

+ 
+  enable_client_connection_trigger
(boolean)
+  
+   enable_client_connection_trigger
configuration parameter
+  
+  
+  
+   
+Enables firing the client_connection
+trigger when a client connects. This parameter is switched on by
default.
+Errors in trigger code can prevent user to login to the system.
+I this case disabling this parameter in connection string can
solve the problem:
+psql "dbname=postgres options='-c
enable_client_connection_trigger=false'".
+   
+  
+ 

There should be note, so only superuser can change this value

There is should be tab-complete support

diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 3a43c09bf6..08f00d8fc4 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -2970,7 +2970,8 @@ psql_completion(const char *text, int start, int end)
COMPLETE_WITH("ON");
/* Complete CREATE EVENT TRIGGER  ON with event_type */
else if (Matches("CREATE", "EVENT", "TRIGGER", MatchAny, "ON"))
-   COMPLETE_WITH("ddl_command_start", "ddl_command_end", "sql_drop");
+   COMPLETE_WITH("ddl_command_start", "ddl_command_end",
+ "client_connection", "sql_drop");

/*
 * Complete CREATE EVENT TRIGGER  ON .  EXECUTE
FUNCTION

Regards

Pavel




>
> --
>
> Konstantin Knizhnik
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company
>
>

Re: libpq compression





On 12/22/20 6:56 PM, Robert Haas wrote:

On Tue, Dec 22, 2020 at 6:24 AM Daniil Zakhlystov
 wrote:

When using bidirectional compression, Postgres resource usage correlates with 
the selected compression level. For example, here is the Postgresql application 
memory usage:

No compression - 1.2 GiB

ZSTD
zstd:1 - 1.4 GiB
zstd:7 - 4.0 GiB
zstd:13 - 17.7 GiB
zstd:19 - 56.3 GiB
zstd:20 - 109.8 GiB - did not succeed
zstd:21, zstd:22  > 140 GiB
Postgres process crashes (out of memory)


Good grief. So, suppose we add compression and support zstd. Then, can
unprivileged user capable of connecting to the database can negotiate
for zstd level 1 and then choose to actually send data compressed at
zstd level 22, crashing the server if it doesn't have a crapton of
memory? Honestly, I wouldn't blame somebody for filing a CVE if we
allowed that sort of thing to happen. I'm not sure what the solution
is, but we can't leave a way for a malicious client to consume 140GB
of memory on the server *per connection*. I assumed decompression
memory was going to measured in kB or MB, not GB. Honestly, even at
say L7, if you've got max_connections=100 and a user who wants to make
trouble, you have a really big problem.

Perhaps I'm being too pessimistic here, but man that's a lot of memory.



Maybe I'm just confused, but my assumption was this means there's a 
memory leak somewhere - that we're not resetting/freeing some piece of 
memory, or so. Why would zstd need so much memory? It seems like a 
pretty serious disadvantage, so how could it become so popular?



regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: libpq compression

2020-12-22 Thread Robert Haas

On Tue, Dec 22, 2020 at 6:24 AM Daniil Zakhlystov
 wrote:
> When using bidirectional compression, Postgres resource usage correlates with 
> the selected compression level. For example, here is the Postgresql 
> application memory usage:
>
> No compression - 1.2 GiB
>
> ZSTD
> zstd:1 - 1.4 GiB
> zstd:7 - 4.0 GiB
> zstd:13 - 17.7 GiB
> zstd:19 - 56.3 GiB
> zstd:20 - 109.8 GiB - did not succeed
> zstd:21, zstd:22  > 140 GiB
> Postgres process crashes (out of memory)

Good grief. So, suppose we add compression and support zstd. Then, can
unprivileged user capable of connecting to the database can negotiate
for zstd level 1 and then choose to actually send data compressed at
zstd level 22, crashing the server if it doesn't have a crapton of
memory? Honestly, I wouldn't blame somebody for filing a CVE if we
allowed that sort of thing to happen. I'm not sure what the solution
is, but we can't leave a way for a malicious client to consume 140GB
of memory on the server *per connection*. I assumed decompression
memory was going to measured in kB or MB, not GB. Honestly, even at
say L7, if you've got max_connections=100 and a user who wants to make
trouble, you have a really big problem.

Perhaps I'm being too pessimistic here, but man that's a lot of memory.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

HOT chain bug in latestRemovedXid calculation

2020-12-22 Thread Peter Geoghegan

ISTM that heap_compute_xid_horizon_for_tuples() calculates
latestRemovedXid for index deletion callers without sufficient care.
The function only follows line pointer redirects, which is necessary
but not sufficient to visit all relevant heap tuple headers -- it also
needs to traverse HOT chains, but that doesn't happen. AFAICT
heap_compute_xid_horizon_for_tuples() might therefore fail to produce
a sufficiently recent latestRemovedXid value for the index deletion
operation as a whole. This might in turn lead to the REDO routine
(e.g. btree_xlog_delete()) doing conflict processing incorrectly
during hot standby.

Attached is an instrumentation patch. If I run "make check" with the
patch applied, I get test output failures that can be used to get a
general sense of the problem:

$ cat /code/postgresql/patch/build/src/test/regress/regression.diffs |
grep "works okay this time" | wc -l
382

$ cat /code/postgresql/patch/build/src/test/regress/regression.diffs |
grep "hot chain bug"
+WARNING:  hot chain bug, latestRemovedXid: 2307,
latestRemovedXidWithHotChain: 2316
+WARNING:  hot chain bug, latestRemovedXid: 4468,
latestRemovedXidWithHotChain: 4538
+WARNING:  hot chain bug, latestRemovedXid: 4756,
latestRemovedXidWithHotChain: 4809
+WARNING:  hot chain bug, latestRemovedXid: 5000,
latestRemovedXidWithHotChain: 5001
+WARNING:  hot chain bug, latestRemovedXid: 7683,
latestRemovedXidWithHotChain: 7995
+WARNING:  hot chain bug, latestRemovedXid: 13450,
latestRemovedXidWithHotChain: 13453
+WARNING:  hot chain bug, latestRemovedXid: 10040,
latestRemovedXidWithHotChain: 10041

So out of 389 calls, we see 7 failures on this occasion, which is
typical. Heap pruning usually saves us in practice (since it is highly
correlated with setting LP_DEAD bits on index pages in the first
place), and even when it doesn't it's not particularly likely that the
issue will make the crucial difference for the deletion operation as a
whole.

The code that is now heap_compute_xid_horizon_for_tuples() ran in REDO
routines directly prior to Postgres 12.
heap_compute_xid_horizon_for_tuples() is a descendant of code added by
Simon’s commit a760893d in 2010 -- pretty close to HOT’s initial
introduction. So this has been around for a long time.

-- 
Peter Geoghegan


0001-Instrument-heap_compute_xid_horizon_for_tuples.patch
Description: Binary data

Re: libpq compression

2020-12-22 Thread Robert Haas

On Mon, Dec 14, 2020 at 12:53 PM Daniil Zakhlystov
 wrote:
> > On Dec 10, 2020, at 1:39 AM, Robert Haas  wrote:
> > Good points. I guess you need to arrange to "flush" at the compression
> > layer as well as the libpq layer so that you don't end up with data
> > stuck in the compression buffers.
>
> I think that “flushing” the libpq and compression buffers before setting the 
> new compression method will help to solve issues only at the compressing 
> (sender) side
> but won't help much on the decompressing (receiver) side.

Hmm, I assumed that if the compression buffers were flushed on the
sending side, and if all the data produced on the sending side were
transmitted to the receiver, the receiving side would then return
everything up to the point of the flush. However, now that I think
about it, there's no guarantee that any particular compression library
would actually behave that way. I wonder what actually happens in
practice with the libraries we care about?

> This may help to solve the above issue. For example, we may introduce the 
> CompressedData message:
>
> CompressedData (F & B)
>
> Byte1(‘m’) // I am not so sure about the ‘m’ identifier :)
> Identifies the message as compressed data.
>
> Int32
> Length of message contents in bytes, including self.
>
> Byten
> Data that forms part of a compressed data stream.
>
> Basically, it wraps some chunk of compressed data (like the CopyData message).
>
> On the sender side, the compressor will wrap all outgoing message chunks into 
> the CopyData messages.
>
> On the receiver side, some intermediate component between the secure_read and 
> the decompressor will do the following:
> 1. Read the next 5 bytes (type and length) from the buffer
> 2.1 If the message type is other than CompressedData, forward it straight to 
> the PqRecvBuffer /  conn->inBuffer.
> 2.2 If the message type is CompressedData, forward its contents to the 
> current decompressor.
>
> What do you think of this approach?

I'm not sure about the details, but the general idea seems like it
might be worth considering. If we choose a compression method that is
intended for streaming compression and decompression and whose library
handles compression flushes sensibly, then we might not really need to
go this way to make it work. But, on the other hand, this method has a
certain elegance that just compressing everything lacks, and might
allow some useful flexibility. On the third hand, restarting
compression for every new set of messages might really hurt the
compression ratio in some scenarios. I'm not sure what is best.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [HACKERS] [PATCH] Generic type subscripting

2020-12-22 Thread Dmitry Dolgov

> On Tue, Dec 22, 2020 at 11:57:13AM -0500, Tom Lane wrote:
> Dmitry Dolgov <9erthali...@gmail.com> writes:
> > On Tue, Dec 22, 2020 at 12:19:26PM +0100, Pavel Stehule wrote:
> >> I expect behave like
> >>
> >> update x set test[1] = 10; --> "[10]";
> >> update x set test['1'] = 10; --> "{"1": 10}"
>
> > Yes, I also was thinking about this because such behaviour is more
> > natural.
>
> I continue to feel that this is a fundamentally bad idea that will
> lead to much more pain than benefit.  People are going to want to
> know why "test[1.0]" doesn't act like "test[1]".  They are going
> to complain because "test[$1]" acts so much differently depending
> on whether they assigned a type to the $1 parameter or not.  And
> they are going to bitch because dumping and reloading a rule causes
> it to do something different than it did before --- or at least we'd
> be at horrid risk of that; only if we hide the injected cast-to-text
> doesd the dumped rule look the way it needs to.  Even then, the whole
> thing is critically dependent on the fact that integer-type constants
> are written and displayed differently from other constants, so it
> won't scale to any other type that someone might want to treat specially.
> So you're just leading datatype designers down a garden path that will be
> a dead end for many of them.
>
> IMO this isn't actually any saner than your previous iterations
> on the idea.

Ok. While I don't have any preferences here, we can disregard the last
posted patch (extended-with-subscript-type) and consider only
v38-0001-Subscripting-for-jsonb version.

Re: [HACKERS] [PATCH] Generic type subscripting

út 22. 12. 2020 v 17:57 odesílatel Tom Lane  napsal:

> Dmitry Dolgov <9erthali...@gmail.com> writes:
> > On Tue, Dec 22, 2020 at 12:19:26PM +0100, Pavel Stehule wrote:
> >> I expect behave like
> >>
> >> update x set test[1] = 10; --> "[10]";
> >> update x set test['1'] = 10; --> "{"1": 10}"
>
> > Yes, I also was thinking about this because such behaviour is more
> > natural.
>
> I continue to feel that this is a fundamentally bad idea that will
> lead to much more pain than benefit.  People are going to want to
> know why "test[1.0]" doesn't act like "test[1]".  They are going
> to complain because "test[$1]" acts so much differently depending
> on whether they assigned a type to the $1 parameter or not.  And
> they are going to bitch because dumping and reloading a rule causes
> it to do something different than it did before --- or at least we'd
> be at horrid risk of that; only if we hide the injected cast-to-text
> doesd the dumped rule look the way it needs to.  Even then, the whole
> thing is critically dependent on the fact that integer-type constants
> are written and displayed differently from other constants, so it
> won't scale to any other type that someone might want to treat specially.
> So you're just leading datatype designers down a garden path that will be
> a dead end for many of them.
>
> IMO this isn't actually any saner than your previous iterations
> on the idea.
>

I think so there can be some logic. But json has two kinds of very
different objects - objects and arrays, and we should support both.

can be a good solution based on initial source value?

DECLARE v jsonb;
BEGIN
  v := '[]';
  v[1] := 10; v[2] := 20; -- v = "[10,20]"
  v['a'] := 30; --> should to raise an error
  v := '{}';
  v[1] := 10; v[2] := 20; -- v = "{"1": 10, "2":20}"
  v['a'] := 30; -- v = "{"1": 10, "2":20, "a": 30}"

When the source variable is null, then the default behavior can be the same
like json objects. But it doesn't solve well numeric indexes.

because

v := '[]'
v[1.5] := 100; -- it is nonsense

but
v := '{}'
v[1.5] := 100; -- v = "{"1.5":100}" -- and this can have good benefit, but
"1" and "1.0" are different keys.

But maybe we try to design some that are designed already. Is there some
info about index specification in SQL/JSON?

Regards

Pavel







regards, tom lane
>

Re: Better client reporting for "immediate stop" shutdowns

Magnus Hagander  writes:
> On Tue, Dec 22, 2020 at 2:29 AM Bharath Rupireddy
>  wrote:
>> If I'm correct, quickdie() doesn't access any shared memory because
>> one of the reason we can be in quickdie() is when the shared memory
>> itself is corrupted(the comment down below on why we don't call
>> roc_exit() or atexit() says), in such case, will GetQuitSignalReason()
>> have some problem in accessing the shared memory i.e. +return
>> PMSignalState->sigquit_reason;?

It couldn't really have any problem in physically accessing the field;
we never detach from the main shared memory block.  The risk that needs
to be thought about is that shared memory contains garbage --- for
example, imagine that a failing process scribbled in the wrong part of
shared memory before crashing.  So the hazard here is that there's a
small chance that sigquit_reason will contain the wrong value, which
would cause the patch to print a misleading message, or more likely
not print anything (since I didn't put a default case in that switch).
That seems fine to me.  Also, because the sequence of events would be
(1) failing process scribbles and crashes, (2) postmaster updates
sigquit_reason, (3) other child processes examine sigquit_reason,
it's fairly likely that we'd get the right answer even if the field
got clobbered during (1).

There might be an argument for emitting the "unexpected SIGQUIT"
text if we find garbage in sigquit_reason.  Any thoughts about that?

>> AFAIK, errmsg(terminating connection due to administrator command") is
>> emitted when there's no specific reason. But we know exactly why we
>> are terminating in this case, how about having an error message like
>> errmsg("terminating connection due to immediate shutdown request")));
>> ? There are other places where errmsg("terminating connection due to
>>  reasons"); is used. We also log messages when an immediate
>> shutdown request is received errmsg("received immediate shutdown
>> request").

> +1. I definitely think having this message be different can be useful.

OK, will use "terminating connection due to immediate shutdown
request".

> See also the thread about tracking shutdown reasons (connection
> statistics) -- not the same thing, but the same concepts apply.

Hm.  I wondered for a bit if that patch could make use of this one
to improve its results.  For the specific case of SIGQUIT it seems
moot because we aren't going to let backends send any shutdown
statistics during an emergency stop.  But maybe the idea could be
extended to let more-accurate termination reasons be provided in
some other cases.

regards, tom lane

Re: [HACKERS] [PATCH] Generic type subscripting

Dmitry Dolgov <9erthali...@gmail.com> writes:
> On Tue, Dec 22, 2020 at 12:19:26PM +0100, Pavel Stehule wrote:
>> I expect behave like
>> 
>> update x set test[1] = 10; --> "[10]";
>> update x set test['1'] = 10; --> "{"1": 10}"

> Yes, I also was thinking about this because such behaviour is more
> natural.

I continue to feel that this is a fundamentally bad idea that will
lead to much more pain than benefit.  People are going to want to
know why "test[1.0]" doesn't act like "test[1]".  They are going
to complain because "test[$1]" acts so much differently depending
on whether they assigned a type to the $1 parameter or not.  And
they are going to bitch because dumping and reloading a rule causes
it to do something different than it did before --- or at least we'd
be at horrid risk of that; only if we hide the injected cast-to-text
doesd the dumped rule look the way it needs to.  Even then, the whole
thing is critically dependent on the fact that integer-type constants
are written and displayed differently from other constants, so it
won't scale to any other type that someone might want to treat specially.
So you're just leading datatype designers down a garden path that will be
a dead end for many of them.

IMO this isn't actually any saner than your previous iterations
on the idea.

regards, tom lane

Re: [HACKERS] [PATCH] Generic type subscripting

2020-12-22 Thread Dmitry Dolgov

> On Tue, Dec 22, 2020 at 12:19:26PM +0100, Pavel Stehule wrote:
>
> > Here is the new version of jsonb subscripting rebased on the committed
> > infrastructure patch. I hope it will not introduce any confusion with
> > the previously posted patched in this thread (about alter type subscript
> > and hstore) as they are independent.
> >
> > There are few differences from the previous version:
> >
> > * No limit on number of subscripts for jsonb (as there is no intrinsic
> >   limitation of this kind for jsonb).
> >
> > * In case of assignment via subscript now it expects the replace value
> >   to be of jsonb type.
> >
> > * Similar to the implementation for arrays, if the source jsonb is NULL,
> >   it will be replaced by an empty jsonb and the new value will be
> >   assigned to it. This means:
> >
> > =# select * from test_jsonb_subscript where id = 3;
> >  id | test_json
> > +---
> >   3 | NULL
> >
> > =# update test_jsonb_subscript set test_json['a'] = '1' where id =
> > 3;
> > UPDATE 1
> >
> > =# select * from test_jsonb_subscript where id = 3;
> >  id | test_json
> > +---
> >   3 | {"a": 1}
> >
> >   and similar:
> >
> > =# select * from test_jsonb_subscript where id = 3;
> >  id | test_json
> > +---
> >   3 | NULL
> >
> > =# update test_jsonb_subscript set test_json[1] = '1' where id = 3;
> > UPDATE 1
> >
> > =# select * from test_jsonb_subscript where id = 3;
> >  id | test_json
> > +---
> >   3 | {"1": 1}
> >
> >   The latter is probably a bit strange looking, but if there are any
> > concerns
> >   about this part (and in general about an assignment to jsonb which is
> > NULL)
> >   of the implementation it could be easily changed.
> >
>
> What is the possibility to make an array instead of a record?
>
> I expect behave like
>
> update x set test[1] = 10; --> "[10]";
> update x set test['1'] = 10; --> "{"1": 10}"

Yes, I also was thinking about this because such behaviour is more
natural. To implement this we need to provide possibility for type
specific code to remember original subscript expression type (something
like in the attached version), which could be also useful for the future
work on jsonpath. I'm just not sure if there are again some important
bits are missing in this idea, so if someone can take a look I would
appreciate. In case there are any issues, I would just suggest keep it
simple and return NULL.
>From dc7fc5fff7b1597861b950138e1084f4ac04e321 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthali...@gmail.com>
Date: Fri, 18 Dec 2020 17:19:51 +0100
Subject: [PATCH v38] Subscripting for jsonb (extended with subscript type)

Subscripting implementation for jsonb. It does not support slices, does
not have a limit for number of subscripts and for assignment expects a
replace value to be of jsonb type. There is also one functional
difference in assignment via subscripting from jsonb_set, when an
original jsonb container is NULL, subscripting replaces it with an empty
jsonb and proceed with assignment. The type of such empty jsonb would be
array if the first subscript was an integer, and object in all other
cases.

For the sake of code reuse, some parts of jsonb functionality were
rearranged to allow use the same functions for jsonb_set and assign
subscripting operation.

The patch also extends SubscriptingRef and SubscriptingRefState with
information about original types of subscript expressions. This could be
useful if type specific subscript implementation needs to distinguish
between different data types in subscripting.

The original idea belongs to Oleg Bartunov.

Reviewed-by: Tom Lane, Arthur Zakirov, Pavel Stehule
---
 doc/src/sgml/json.sgml  |  48 +
 src/backend/executor/execExpr.c |  21 +-
 src/backend/nodes/copyfuncs.c   |   2 +
 src/backend/nodes/outfuncs.c|   2 +
 src/backend/nodes/readfuncs.c   |   2 +
 src/backend/utils/adt/Makefile  |   1 +
 src/backend/utils/adt/jsonb_util.c  |  76 ++-
 src/backend/utils/adt/jsonbsubs.c   | 299 
 src/backend/utils/adt/jsonfuncs.c   | 180 +
 src/include/catalog/pg_proc.dat |   4 +
 src/include/catalog/pg_type.dat |   3 +-
 src/include/executor/execExpr.h |   2 +
 src/include/nodes/primnodes.h   |   2 +
 src/include/utils/jsonb.h   |   6 +-
 src/test/regress/expected/jsonb.out | 273 -
 src/test/regress/sql/jsonb.sql  |  84 +++-
 16 files changed, 899 insertions(+), 106 deletions(-)
 create mode 100644 src/backend/utils/adt/jsonbsubs.c

diff --git a/doc/src/sgml/json.sgml b/doc/src/sgml/json.sgml
index 5b9a5557a4..cad7b02559 100644
--- a/doc/src/sgml/json.sgml
+++ b/doc/src/sgml/json.sgml
@@ -602,6 +602,54 @@ SELECT jdoc->'guid', jdoc->'name' FROM api WHERE jdoc @> '{"tags": ["qu

Re: [PATCH] Automatic HASH and LIST partition creation

2020-12-22 Thread Pavel Borisov

>
> Why? We could accept anything in the list? i.e.:
>
> (ident =? value[, ident =? value]*)
>
> > I don't against this but as far as I've heard there is some
> > opposition among PG community against new keywords. Maybe I am wrong.
>
> the ident is a keyword that can be interpreted later on, not a "reserved
> keyword" from a parser perspective, which is the only real issue?
>
> The parser does not need to know about it, only the command interpreter
> which will have to interpret it. AUTOMATIC is a nice parser cue to
> introduce such a ident-value list.
>
> > 2. The existing syntax for declarative partitioning is different to your
> > proposal.
>
> Yep. I think that it was not so good a design choice from a
> language/extensibility perspective.
>
Thank you very much, Fabien. It is clear enough.
BTW could you tell me a couple of words about pros and cons of c-code
syntax parsing comparing to parsing using gram.y trees? I think both are
possible but my predisposition was that we'd better use the later if
possible.

Best regards,
Pavel Borisov

>

Re: Proposed patch for key managment

On Mon, Dec 21, 2020 at 10:07:48PM -0500, Bruce Momjian wrote:
> Attached is the script patch.  It is also at:
> 
>   
> https://github.com/postgres/postgres/compare/master...bmomjian:cfe-sh.diff
> 
> I think it still needs docs but those will have to be done after the
> code doc patch is added.

Here is an updated patch.  Are people happy with the Makefile, its
location in the source tree, and the install directory name?  I used the
directory name 'auth_commands' because I thought 'auth' was too easily
misinterpreted. I put the scripts in /src/backend/utils/auth_commands. 
It also contains a script that can be used for SSL passphrase prompting,
but I haven't written the C code for that yet.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

cfe-sh.diff.gz
Description: application/gzip

Re: [PATCH] Automatic HASH and LIST partition creation

2020-12-22 Thread Fabien COELHO



HEllo.


CREATE TABLE foo(a int) PARTITION BY LIST(a) CONFIGURATION (FOR VALUES

IN

(1,2),(3,4) DEFAULT PARTITION foo_def);


I would like to disagree with this syntactic approach because it would
very specific to each partition method. IMHO the syntax should be as
generic as possible. I'd suggest (probably again) a keyword/value list
which would allow to be quite adaptable without inducing any pressure on
the parser.


If I remember your proposal correctly it is something like
CREATE TABLE foo(...) PARTITION BY HASH AUTOMATIC (MODULUS 10);


Yep, that would be the spirit.

It is still possible but there are some caveats: 1. We'll need to add 
keyword MODULUS (and probably AUTOMATIC) to the parser's list.


Why? We could accept anything in the list? i.e.:

   (ident =? value[, ident =? value]*)


I don't against this but as far as I've heard there is some
opposition among PG community against new keywords. Maybe I am wrong.


the ident is a keyword that can be interpreted later on, not a "reserved 
keyword" from a parser perspective, which is the only real issue?


The parser does not need to know about it, only the command interpreter 
which will have to interpret it. AUTOMATIC is a nice parser cue to 
introduce such a ident-value list.



2. The existing syntax for declarative partitioning is different to your
proposal.


Yep. I think that it was not so good a design choice from a 
language/extensibility perspective.



It is still not a big problem and your proposal makes query
shorter for several words. I'd just like to see some consensus on the
syntax. Now I must admit there are too many contradictions in opinions
which make progress slow. Also I think it is important to have a really
convenient syntaх.





2a Maybe we all who participated in the thread can vote for some variant?
2b Maybe the existing syntax for declarative partitioniong should be given
some priority as it is already committed into CREATE TABLE ... PARTITION OF
... FOR VALUES IN.. etc.



I'd be happy if everyone will join some version of the proposed syntaх in
this thread and in the previous discussion [1]. If we have a variant with
more than one supporter, sure we can develop patch based on it.
Thank you very much
and Merry Christmas!

[1]
https://www.postgresql.org/message-id/flat/alpine.DEB.2.21.1907150711080.22273%40lancre




--
Fabien.

Re: Confused about stream replication protocol documentation





On 2020/12/22 18:07, Li Japin wrote:

Hi, all

In Stream Replication Protocol [1], the documentation of `START_REPLICATION` 
message is

XLogData (B)
   …
Primary keepalive message (B)
   …
Standby status update (F)
   …
Hot Standby feedback message (F)
   ...

I’m confused about the means of ‘B’ and ‘F’? If it doesn't make sense, why we 
document here?
However, if it makes sense, should we explain it?
Can someone help me out?


‘B’ means a backend and ‘F’ means a frontend. Maybe as [1] does, we should
add the note like "Each is marked to indicate that it can be sent by
 a frontend (F) and a backend (B)" into the description about each message
 format for START_REPLICATION.

[1]
https://www.postgresql.org/docs/devel/protocol-message-formats.html

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: [PATCH] Automatic HASH and LIST partition creation

2020-12-22 Thread Pavel Borisov

>
> > CREATE TABLE foo(a int) PARTITION BY LIST(a) CONFIGURATION (FOR VALUES
> IN
> > (1,2),(3,4) DEFAULT PARTITION foo_def);
>
> I would like to disagree with this syntactic approach because it would
> very specific to each partition method. IMHO the syntax should be as
> generic as possible. I'd suggest (probably again) a keyword/value list
> which would allow to be quite adaptable without inducing any pressure on
> the parser.
>
If I remember your proposal correctly it is something like
CREATE TABLE foo(...) PARTITION BY HASH AUTOMATIC (MODULUS 10);

It is still possible but there are some caveats:
1. We'll need to add keyword MODULUS (and probably AUTOMATIC) to the
parser's list. I don't against this but as far as I've heard there is some
opposition among PG community against new keywords. Maybe I am wrong.
2. The existing syntax for declarative partitioning is different to your
proposal. It is still not a big problem and your proposal makes query
shorter for several words. I'd just like to see some consensus on the
syntax. Now I must admit there are too many contradictions in opinions
which make progress slow. Also I think it is important to have a really
convenient syntaх.
2a Maybe we all who participated in the thread can vote for some variant?
2b Maybe the existing syntax for declarative partitioniong should be given
some priority as it is already committed into CREATE TABLE ... PARTITION OF
... FOR VALUES IN.. etc.

I'd be happy if everyone will join some version of the proposed syntaх in
this thread and in the previous discussion [1]. If we have a variant with
more than one supporter, sure we can develop patch based on it.
Thank you very much
and Merry Christmas!

[1]
https://www.postgresql.org/message-id/flat/alpine.DEB.2.21.1907150711080.22273%40lancre

-- 
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com

Re: Deadlock between backend and recovery may not be detected

On 2020/12/22 20:42, Fujii Masao wrote:

On 2020/12/22 10:25, Masahiko Sawada wrote:

On Fri, Dec 18, 2020 at 6:36 PM Fujii Masao  wrote:

On 2020/12/17 2:15, Fujii Masao wrote:

On 2020/12/16 23:28, Drouvot, Bertrand wrote:

Hi,

On 12/16/20 2:36 PM, Victor Yegorov wrote:

*CAUTION*: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.

ср, 16 дек. 2020 г. в 13:49, Fujii Masao mailto:masao.fu...@oss.nttdata.com>>:

 After doing this procedure, you can see the startup process and backend
 wait for the table lock each other, i.e., deadlock. But this deadlock 
remains
 even after deadlock_timeout passes.

 This seems a bug to me.

+1

 > * Deadlocks involving the Startup process and an ordinary backend process
 > * will be detected by the deadlock detector within the ordinary backend.

 The cause of this issue seems that ResolveRecoveryConflictWithLock() that
 the startup process calls when recovery conflict on lock happens doesn't
 take care of deadlock case at all. You can see this fact by reading the 
above
 source code comment for ResolveRecoveryConflictWithLock().

 To fix this issue, I think that we should enable STANDBY_DEADLOCK_TIMEOUT
 timer in ResolveRecoveryConflictWithLock() so that the startup process can
 send PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK signal to the backend.
 Then if PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK signal arrives,
 the backend should check whether the deadlock actually happens or not.
 Attached is the POC patch implimenting this.

good catch!

I don't see any obvious reasons why the STANDBY_DEADLOCK_TIMEOUT shouldn't be 
set in ResolveRecoveryConflictWithLock() too (it is already set in 
ResolveRecoveryConflictWithBufferPin()).

So + 1 to consider this as a bug and for the way the patch proposes to fix it.

Thanks Victor and Bertrand for agreeing!
Attached is the updated version of the patch.

Attached is v3 of the patch. Could you review this version?

While the startup process is waiting for recovery conflict on buffer pin,
it repeats sending the request for deadlock check to all the backends
every deadlock_timeout. This may increase the workload in the startup
process and backends, but since this is the original behavior, the patch
doesn't change that. Also in practice this may not be so harmful because
the period that the buffer is kept pinned is basically not so long.

@@ -529,6 +603,26 @@ ResolveRecoveryConflictWithBufferPin(void)
  */
 ProcWaitForSignal(PG_WAIT_BUFFER_PIN);

+   if (got_standby_deadlock_timeout)
+   {
+   /*
+    * Send out a request for hot-standby backends to check themselves for
+    * deadlocks.
+    *
+    * XXX The subsequent ResolveRecoveryConflictWithBufferPin() will wait
+    * to be signaled by UnpinBuffer() again and send a request for
+    * deadlocks check if deadlock_timeout happens. This causes the
+    * request to continue to be sent every deadlock_timeout until the
+    * buffer is unpinned or ltime is reached. This would increase the
+    * workload in the startup process and backends. In practice it may
+    * not be so harmful because the period that the buffer is kept pinned
+    * is basically no so long. But we should fix this?
+    */
+   SendRecoveryConflictWithBufferPin(
+
PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
+   got_standby_deadlock_timeout = false;
+   }
+

Since SendRecoveryConflictWithBufferPin() sends the signal to all
backends every backend who is waiting on a lock at ProcSleep() and not
holding a buffer pin blocking the startup process will end up doing a
deadlock check, which seems expensive. What is worse is that the
deadlock will not be detected because the deadlock involving a buffer
pin isn't detected by CheckDeadLock(). I thought we can replace
PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK with
PROCSIG_RECOVERY_CONFLICT_BUFFERPIN but it’s not good because the
backend who has a buffer pin blocking the startup process and not
waiting on a lock is also canceled after deadlock_timeout. We can have
the backend return from RecoveryConflictInterrupt() when it received
PROCSIG_RECOVERY_CONFLICT_BUFFERPIN and is not waiting on any lock,
but it’s also not good because we cannot cancel the backend after
max_standby_streaming_delay that has a buffer pin blocking the startup
process. So I wonder if we can have a new signal. When the backend
received this signal it returns from RecoveryConflictInterrupt()
without deadlock check either if it’s not waiting on any lock or if it
doesn’t have a buffer pin blocking the startup process. Otherwise it's
cancelled.

Thanks for pointing out that issue! Using new signal is an idea. Another idea
is to make a backend skip check the deadlock if GetStartupBufferPinWaitBufId()
returns -1, i.e., the startup process is not wa

Re: [PATCH] Automatic HASH and LIST partition creation

2020-12-22 Thread Fabien COELHO




CREATE TABLE foo(a int) PARTITION BY LIST(a) CONFIGURATION (FOR VALUES IN 
(1,2),(3,4) DEFAULT PARTITION foo_def);


I would like to disagree with this syntactic approach because it would 
very specific to each partition method. IMHO the syntax should be as 
generic as possible. I'd suggest (probably again) a keyword/value list 
which would allow to be quite adaptable without inducing any pressure on 
the parser.


--
Fabien.

EXPLAIN/EXPLAIN ANALYZE REFRESH MATERIALIZED VIEW

Hi,

Currently, $subject is not allowed. We do plan the mat view query
before every refresh. I propose to show the explain/explain analyze of
the select part of the mat view in case of Refresh Mat View(RMV). It
will be useful for the user to know what exactly is being planned and
executed as part of RMV. Please note that we already have
explain/explain analyze CTAS/Create Mat View(CMV), where we show the
explain/explain analyze of the select part. This proposal will do the
same thing.

The behaviour can be like this:
EXPLAIN REFRESH MATERIALIZED VIEW mv1;   --> will not refresh the mat
view, but shows the select part's plan of mat view.
EXPLAIN ANALYZE REFRESH MATERIALIZED VIEW mv1;   --> will refresh the
mat view and shows the select part's plan of mat view.

Thoughts? If okay, I will post a patch later.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: proposal: schema variables

ne 20. 12. 2020 v 21:43 odesílatel Zhihong Yu  napsal:

> Hi,
> This is continuation of the previous review.
>
> +* We should to use schema variable buffer,
> when
> +* it is available.
>
> 'should use' (no to)
>

fixed


> +   /* When buffer of used schema variables loaded from shared memory
> */
>
> A verb seems missing in the above comment.
>

I changed this comment

<--><-->/*
<--><--> * link shared memory with working copy of schema variable's values
<--><--> * used in this query. This access is used by parallel query
executor's
<--><--> * workers.
<--><--> */


> +   elog(ERROR, "unexpected non-SELECT command in LET ... SELECT");
>
> Since non-SELECT is mentioned, I wonder if the trailing SELECT can be
> omitted.
>

done


> +* some collision can be solved simply here to reduce errors
> based
> +* on simply existence of some variables. Often error can be
> using
>
> simply occurred twice above - I think one should be enough.
> If you want to keep the second, it should be 'simple'.
>

I rewrote this comment

updated patch attached

Regards

Pavel



> Cheers
>
> On Sun, Dec 20, 2020 at 11:25 AM Zhihong Yu  wrote:
>
>> Hi,
>> I took a look at the rebased patch.
>>
>> +  varisnotnull
>> +  boolean
>> +  
>> +  
>> +   True if the schema variable doesn't allow null value. The default
>> value is false.
>>
>> I wonder whether the field can be named in positive tense: e.g.
>> varallowsnull with default of true.
>>
>> +  vareoxaction
>> +   n = no action, d = drop the
>> variable,
>> +   r = reset the variable to its default value.
>>
>> Looks like there is only one action allowed. I wonder if there is a
>> possibility of having two actions at the same time in the future.
>>
>> + The PL/pgSQL language has not packages
>> + and then it has not package variables and package constants. The
>>
>> 'has not' -> 'has no'
>>
>> +  a null value. A variable created as NOT NULL and without an
>> explicitely
>>
>> explicitely -> explicitly
>>
>> +   int nnewmembers;
>> +   Oid*oldmembers;
>> +   Oid*newmembers;
>>
>> I wonder if naming nnewmembers newmembercount would be more readable.
>>
>> For pg_variable_aclcheck:
>>
>> +   return ACLCHECK_OK;
>> +   else
>> +   return ACLCHECK_NO_PRIV;
>>
>> The 'else' can be omitted.
>>
>> + * Ownership check for a schema variables (specified by OID).
>>
>> 'a schema variable' (no s)
>>
>> For NamesFromList():
>>
>> +   if (IsA(n, String))
>> +   {
>> +   result = lappend(result, n);
>> +   }
>> +   else
>> +   break;
>>
>> There would be no more name if current n is not a String ?
>>
>> +* both variants, and returns InvalidOid with not_uniq flag,
>> when
>>
>> 'and return' (no s)
>>
>> +   return InvalidOid;
>> +   }
>> +   else if (OidIsValid(varoid_without_attr))
>>
>> 'else' is not needed (since the if block ends with return).
>>
>> For clean_cache_callback(),
>>
>> +   if (hash_search(schemavarhashtab,
>> +   (void *) &svar->varid,
>> +   HASH_REMOVE,
>> +   NULL) == NULL)
>> +   elog(DEBUG1, "hash table corrupted");
>>
>> Maybe add more information to the debug, such as var name.
>> Should we come out of the while loop in this scenario ?
>>
>> Cheers
>>
>


schema-variables-20201222.patch.gz
Description: application/gzip

Re: Implementing Incremental View Maintenance

2020-12-22 Thread Yugo NAGATA

Hi hackers,

I heard the opinion that this patch is too big and hard to review.
So, I wander that we should downsize the patch  by eliminating some
features and leaving other basic features.

If there are more opinions this makes it easer for reviewers to look
at this patch, I would like do it. If so, we plan to support only
selection, projection, inner-join, and some aggregates in the first
release and leave sub-queries, outer-join, and CTE supports to the
next release.

Regards,
Yugo Nagata

On Tue, 22 Dec 2020 21:51:36 +0900
Yugo NAGATA  wrote:
> Hi,
> 
> Attached is the revised patch (v20) to add support for Incremental
> Materialized View Maintenance (IVM).
> 
> In according with Konstantin's suggestion, I made a few optimizations.
> 
> 1. Creating an index on the matview automatically
> 
> When creating incremental maintainable materialized view (IMMV)s,
> a unique index on IMMV is created automatically if possible.
> 
> If the view definition query has a GROUP BY clause, the index is created
> on the columns of GROUP BY expressions. Otherwise, if the view contains
> all primary key attributes of its base tables in the target list, the index
> is created on these attributes.  Also, if the view has DISTINCT,
> a unique index is created on all columns in the target list.
> In other cases, no index is created.
> 
> In all cases, a NOTICE message is output to inform users that an index is
> created or that an appropriate index is necessary for efficient IVM.
> 
> 2. Use a weaker lock on the matview if possible
> 
> If the view has only one base table in this query, RowExclusiveLock is
> held on the view instead of AccessExclusiveLock, because we don't
> need to wait other concurrent transaction's result in order to
> maintain the view in this case. When the same row in the view is
> affected due to concurrent maintenances, a row level lock will
> protect it.
> 
> On Tue, 24 Nov 2020 12:46:57 +0300
> Konstantin Knizhnik  wrote:
> 
> > The most obvious optimization is not to use exclusive table lock if view 
> > depends just on one table (contains no joins).
> > Looks like there are no any anomalies in this case, are there?
> 
> I confirmed the effect of this optimizations.
> 
> First, when I performed pgbench (SF=100) without any materialized views,
> the results is :
>  
>  pgbench test4 -T 300 -c 8 -j 4
>  latency average = 6.493 ms
>  tps = 1232.146229 (including connections establishing)
> 
> Next, created a view as below, I performed the same pgbench.
>  CREATE INCREMENTAL MATERIALIZED VIEW mv_ivm2 AS
> SELECT bid, count(abalance), sum(abalance), avg(abalance)
> FROM pgbench_accounts GROUP BY bid;
> 
> The result is here:
> 
> [the previous version (v19 with exclusive table lock)]
>  - latency average = 77.677 ms
>  - tps = 102.990159 (including connections establishing)
> 
> [In the latest version (v20 with weaker lock)]
>  - latency average = 17.576 ms
>  - tps = 455.159644 (including connections establishing)
> 
> There is still substantial overhead, but we can see that the effect
> of the optimization.
> 
> Regards,
> Yugo Nagata
> 
> -- 
> Yugo NAGATA 


-- 
Yugo NAGATA

Re: Implementing Incremental View Maintenance

2020-12-22 Thread Yugo NAGATA

Hi,

Attached is the revised patch (v20) to add support for Incremental
Materialized View Maintenance (IVM).

In according with Konstantin's suggestion, I made a few optimizations.

1. Creating an index on the matview automatically

When creating incremental maintainable materialized view (IMMV)s,
a unique index on IMMV is created automatically if possible.

If the view definition query has a GROUP BY clause, the index is created
on the columns of GROUP BY expressions. Otherwise, if the view contains
all primary key attributes of its base tables in the target list, the index
is created on these attributes.  Also, if the view has DISTINCT,
a unique index is created on all columns in the target list.
In other cases, no index is created.

In all cases, a NOTICE message is output to inform users that an index is
created or that an appropriate index is necessary for efficient IVM.

2. Use a weaker lock on the matview if possible

If the view has only one base table in this query, RowExclusiveLock is
held on the view instead of AccessExclusiveLock, because we don't
need to wait other concurrent transaction's result in order to
maintain the view in this case. When the same row in the view is
affected due to concurrent maintenances, a row level lock will
protect it.

On Tue, 24 Nov 2020 12:46:57 +0300
Konstantin Knizhnik  wrote:

> The most obvious optimization is not to use exclusive table lock if view 
> depends just on one table (contains no joins).
> Looks like there are no any anomalies in this case, are there?

I confirmed the effect of this optimizations.

First, when I performed pgbench (SF=100) without any materialized views,
the results is :

 pgbench test4 -T 300 -c 8 -j 4
 latency average = 6.493 ms
 tps = 1232.146229 (including connections establishing)

Next, created a view as below, I performed the same pgbench.
 CREATE INCREMENTAL MATERIALIZED VIEW mv_ivm2 AS
SELECT bid, count(abalance), sum(abalance), avg(abalance)
FROM pgbench_accounts GROUP BY bid;

The result is here:

[the previous version (v19 with exclusive table lock)]
 - latency average = 77.677 ms
 - tps = 102.990159 (including connections establishing)

[In the latest version (v20 with weaker lock)]
 - latency average = 17.576 ms
 - tps = 455.159644 (including connections establishing)

There is still substantial overhead, but we can see that the effect
of the optimization.

Regards,
Yugo Nagata

-- 
Yugo NAGATA 

IVM_patches_v20.tar.gz
Description: application/gzip

Re: Consider Parallelism While Planning For REFRESH MATERIALIZED VIEW

On Tue, Dec 22, 2020 at 4:53 PM Hou, Zhijie  wrote:
> I have an issue about the safety of enable parallel select.
>
> I checked the [parallel insert into select] patch.
> https://commitfest.postgresql.org/31/2844/
> It seems parallel select is not allowed when target table is temporary table.
>
> +   /*
> +* We can't support table modification in parallel-mode if it's a 
> foreign
> +* table/partition (no FDW API for supporting parallel access) or a
> +* temporary table.
> +*/
> +   if (rel->rd_rel->relkind == RELKIND_FOREIGN_TABLE ||
> +   RelationUsesLocalBuffers(rel))
> +   {
> +   table_close(rel, lockmode);
> +   context->max_hazard = PROPARALLEL_UNSAFE;
> +   return context->max_hazard;
> +   }
>
> For Refresh MatView.
> if CONCURRENTLY is specified, It will builds new data in temp tablespace:
> if (concurrent)
> {
> tableSpace = GetDefaultTablespace(RELPERSISTENCE_TEMP, false);
> relpersistence = RELPERSISTENCE_TEMP;
> }
>
> For the above case, should we still consider parallelism ?

Thanks for taking a look at the patch.

The intention of the patch is to just enable the parallel mode while
planning the select part of the materialized view, but the insertions
do happen in the leader backend itself. That way even if there's
temporary tablespace gets created, we have no problem.

This patch can be thought as a precursor to parallelizing inserts in
refresh matview. I'm thinking to follow the design of parallel inserts
in ctas [1] i.e. pushing the dest receiver down to the workers once
that gets reviewed and finalized. At that time, we should take care of
not pushing inserts down to workers if temporary tablespace gets
created.

In summary, as far as this patch is considered we don't have any
problem with temporary tablespace getting created with CONCURRENTLY
option.

I'm planning to add a few test cases to cover this patch in
matview.sql and post a v2 patch soon.

[1] -  
https://www.postgresql.org/message-id/CALj2ACWbQ95gS0z1viQC3qFVnMGAz7dcLjno9GdZv%2Bu9RAU9eQ%40mail.gmail.com

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

回复：Re: Cache relation sizes?

2020-12-22 Thread 陈佳昕(步真)

Hi Thomas:

I studied your patch these days and found there might be a problem.
When execute 'drop database', the smgr shared pool will not be removed because 
of no call 'smgr_drop_sr'. Function 'dropdb' in dbcommands.c remove the buffer 
from bufferpool and unlink the real files by 'rmtree', It doesn't call 
smgrdounlinkall, so the smgr shared cache will not be dropped although the 
table has been removed. This will cause some errors when smgr_alloc_str -> 
smgropen、smgrimmedsync. Table file has been removed, so smgropen and 
smgrimmedsync will get a unexpected result.

Buzhen

 --原始邮件 --
发件人:Thomas Munro 
发送时间:Tue Dec 22 19:57:35 2020
收件人:Amit Kapila 
抄送:Konstantin Knizhnik , PostgreSQL Hackers 

主题:Re: Cache relation sizes?
On Tue, Nov 17, 2020 at 10:48 PM Amit Kapila  wrote:
> Yeah, it is good to verify VACUUM stuff but I have another question
> here. What about queries having functions that access the same
> relation (SELECT c1 FROM t1 WHERE c1 <= func(); assuming here function
> access the relation t1)? Now, here I think because the relation 't1'
> is already opened, it might use the same value of blocks from the
> shared cache even though the snapshot for relation 't1' when accessed
> from func() might be different. Am, I missing something, or is it
> dealt in some way?

I think it should be covered by the theory about implicit memory
barriers snapshots, but to simplify things I have now removed the
lock-free stuff from the main patch (0001), because it was a case of
premature optimisation and it distracted from the main concept.  The
main patch has 128-way partitioned LWLocks for the mapping table, and
then per-relfilenode spinlocks for modifying the size.  There are
still concurrency considerations, which I think are probably handled
with the dirty-update-wins algorithm you see in the patch.  In short:
due to extension and exclusive locks, size changes AKA dirty updates
are serialised, but clean updates can run concurrently, so we just
have to make sure that clean updates never clobber dirty updates -- do
you think that is on the right path?

Re: [Patch] Optimize dropping of relation buffers using dlist

On Tue, Dec 22, 2020 at 2:55 PM Amit Kapila  wrote:
>
> Apart from tests, do let me know if you are happy with the changes in
> the patch? Next, I'll look into DropRelFileNodesAllBuffers()
> optimization patch.
>

Review of v35-0004-Optimize-DropRelFileNodesAllBuffers-in-recovery [1]

1.
DropRelFileNodesAllBuffers()
{
..
+buffer_full_scan:
+ pfree(block);
+ nodes = palloc(sizeof(RelFileNode) * n); /* non-local relations */
+ for (i = 0; i < n; i++)
+ nodes[i] = smgr_reln[i]->smgr_rnode.node;
+
..
}

How is it correct to assign nodes array directly from smgr_reln? There
is no one-to-one correspondence. If you see the code before patch, the
passed array can have mixed of temp and non-temp relation information.

2.
+ for (i = 0; i < n; i++)
  {
- pfree(nodes);
+ for (j = 0; j <= MAX_FORKNUM; j++)
+ {
+ /*
+ * Assign InvalidblockNumber to a block if a relation
+ * fork does not exist, so that we can skip it later
+ * when dropping the relation buffers.
+ */
+ if (!smgrexists(smgr_reln[i], j))
+ {
+ block[i][j] = InvalidBlockNumber;
+ continue;
+ }
+
+ /* Get the number of blocks for a relation's fork */
+ block[i][j] = smgrnblocks(smgr_reln[i], j, &cached);

Similar to above, how can we assume smgr_reln array has all non-local
relations? Have we tried the case with mix of temp and non-temp
relations?

In this code, I am slightly worried about the additional cost of each
time checking smgrexists. Consider a case where there are many
relations and only one or few of them have not cached the information,
in such a case we will pay the cost of smgrexists for many relations
without even going to the optimized path. Can we avoid that in some
way or at least reduce its usage to only when it is required? One idea
could be that we first check if the nblocks information is cached and
if so then we don't need to call smgrnblocks, otherwise, check if it
exists. For this, we need an API like smgrnblocks_cahced, something we
discussed earlier but preferred the current API. Do you have any
better ideas?

[1] - 
https://www.postgresql.org/message-id/OSBPR01MB2341882F416A282C3F7D769DEFC70%40OSBPR01MB2341.jpnprd01.prod.outlook.com

-- 
With Regards,
Amit Kapila.

Re: proposal: schema variables

Hi

ne 20. 12. 2020 v 20:24 odesílatel Zhihong Yu  napsal:

> Hi,
> I took a look at the rebased patch.
>
> +  varisnotnull
> +  boolean
> +  
> +  
> +   True if the schema variable doesn't allow null value. The default
> value is false.
>
> I wonder whether the field can be named in positive tense: e.g.
> varallowsnull with default of true.
>

although I prefer positive designed variables, in this case this negative
form is better due consistency with other system tables

postgres=# select table_name, column_name from information_schema.columns
where column_name like '%null';
┌──┬──┐
│  table_name  │ column_name  │
╞══╪══╡
│ pg_type  │ typnotnull   │
│ pg_attribute │ attnotnull   │
│ pg_variable  │ varisnotnull │
└──┴──┘
(3 rows)



> +  vareoxaction
> +   n = no action, d = drop the
> variable,
> +   r = reset the variable to its default value.
>

> Looks like there is only one action allowed. I wonder if there is a
> possibility of having two actions at the same time in the future.
>


Surely not - these three possibilities are supported and tested

CREATE VARIABLE t1 AS int DEFAULT -1 ON TRANSACTION END RESET
CREATE TEMP VARIABLE g AS int ON COMMIT DROP;


>
> + The PL/pgSQL language has not packages
> + and then it has not package variables and package constants. The
>
> 'has not' -> 'has no'
>

fixed


> +  a null value. A variable created as NOT NULL and without an
> explicitely
>
> explicitely -> explicitly
>

fixed


> +   int nnewmembers;
> +   Oid*oldmembers;
> +   Oid*newmembers;
>
> I wonder if naming nnewmembers newmembercount would be more readable.
>

It is not bad idea, but nnewmembers is used more times on more places, and
then this refactoring should be done in independent patch


> For pg_variable_aclcheck:
>
> +   return ACLCHECK_OK;
> +   else
> +   return ACLCHECK_NO_PRIV;
>
> The 'else' can be omitted.
>

again - this pattern is used more often in related source file, and I used
it for consistency with other functions. It is not visible from the patch,
but when you check the related file, it will be clean.


> + * Ownership check for a schema variables (specified by OID).
>
> 'a schema variable' (no s)
>
> For NamesFromList():
>
> +   if (IsA(n, String))
> +   {
> +   result = lappend(result, n);
> +   }
> +   else
> +   break;
>
> There would be no more name if current n is not a String ?
>

It tries to derive the name of a variable from the target list. Usually
there  can be only strings, but there can be array subscripting too
(A_indices node)
I wrote a comment there.


>
> +* both variants, and returns InvalidOid with not_uniq flag,
> when
>
> 'and return' (no s)
>
> +   return InvalidOid;
> +   }
> +   else if (OidIsValid(varoid_without_attr))
>
> 'else' is not needed (since the if block ends with return).
>

I understand. The `else` is not necessary, but I think so it is better due
symmetry

if ()
{
  return ..
}
else if ()
{
  return ..
}
else
{
  return ..
}

This style is used more times in Postgres, and usually I prefer it, when I
want to emphasize so all ways have some similar pattern. My opinion about
it is not too strong, The functionality is same, and I think so readability
is a little bit better (due symmetry) (but I understand well so this can be
subjective).



> For clean_cache_callback(),
>
> +   if (hash_search(schemavarhashtab,
> +   (void *) &svar->varid,
> +   HASH_REMOVE,
> +   NULL) == NULL)
> +   elog(DEBUG1, "hash table corrupted");
>
> Maybe add more information to the debug, such as var name.
> Should we come out of the while loop in this scenario ?
>

I checked other places, and the same pattern is used in this text. If there
are problems, then the problem is not with some specific schema variable,
but in implementation of a hash table.

Maybe it is better to ignore the result (I did it), like it is done on some
other places. This part is implementation of some simple garbage collector,
and is not important if value was removed this or different way. I changed
comments for this routine.

Regards

Pavel


> Cheers
>

Re: On login trigger: take three

2020-12-22 Thread Konstantin Knizhnik




On 22.12.2020 12:25, Pavel Stehule wrote:


regress tests fails

     sysviews                     ... FAILED      112 ms
test event_trigger                ... FAILED (test process exited with 
exit code 2)      447 ms

test fast_default                 ... FAILED      392 ms
test stats                        ... FAILED      626 ms
== shutting down postmaster ==



Sorry, fixed.



--


Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/doc/src/sgml/bki.sgml b/doc/src/sgml/bki.sgml
index 036a72c..bb62f25 100644
--- a/doc/src/sgml/bki.sgml
+++ b/doc/src/sgml/bki.sgml
@@ -182,7 +182,7 @@
 { oid => '1', oid_symbol => 'TemplateDbOid',
   descr => 'database\'s default template',
   datname => 'template1', encoding => 'ENCODING', datcollate => 'LC_COLLATE',
-  datctype => 'LC_CTYPE', datistemplate => 't', datallowconn => 't',
+  datctype => 'LC_CTYPE', datistemplate => 't', datallowconn => 't', dathaslogontriggers => 'f',
   datconnlimit => '-1', datlastsysoid => '0', datfrozenxid => '0',
   datminmxid => '1', dattablespace => 'pg_default', datacl => '_null_' },
 
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 5698413..010bff5 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -2946,6 +2946,17 @@ SCRAM-SHA-256$:&l
 
  
   
+   dathaslogontriggers bool
+  
+  
+Indicates that there are client connection triggers defined for this database.
+This flag is used to avoid extra lookup of pg_event_trigger table on each backend startup.
+This flag is used internally by Postgres and should not be manually changed by DBA or application.
+  
+ 
+
+ 
+  
datconnlimit int4
   
   
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index f810789..45d30b3 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -996,6 +996,23 @@ include_dir 'conf.d'
   
  
 
+ 
+  enable_client_connection_trigger (boolean)
+  
+   enable_client_connection_trigger configuration parameter
+  
+  
+  
+   
+Enables firing the client_connection
+trigger when a client connects. This parameter is switched on by default.
+Errors in trigger code can prevent user to login to the system.
+I this case disabling this parameter in connection string can solve the problem:
+psql "dbname=postgres options='-c enable_client_connection_trigger=false'". 
+   
+  
+ 
+
  
  
 
diff --git a/doc/src/sgml/ecpg.sgml b/doc/src/sgml/ecpg.sgml
index 14dcbdb..383f782 100644
--- a/doc/src/sgml/ecpg.sgml
+++ b/doc/src/sgml/ecpg.sgml
@@ -4673,6 +4673,7 @@ datdba = 10 (type: 1)
 encoding = 0 (type: 5)
 datistemplate = t (type: 1)
 datallowconn = t (type: 1)
+dathaslogontriggers = f (type: 1)
 datconnlimit = -1 (type: 5)
 datlastsysoid = 11510 (type: 1)
 datfrozenxid = 379 (type: 1)
@@ -4698,6 +4699,7 @@ datdba = 10 (type: 1)
 encoding = 0 (type: 5)
 datistemplate = f (type: 1)
 datallowconn = t (type: 1)
+dathaslogontriggers = f (type: 1)
 datconnlimit = -1 (type: 5)
 datlastsysoid = 11510 (type: 1)
 datfrozenxid = 379 (type: 1)
diff --git a/doc/src/sgml/event-trigger.sgml b/doc/src/sgml/event-trigger.sgml
index 60366a9..ae40a8e 100644
--- a/doc/src/sgml/event-trigger.sgml
+++ b/doc/src/sgml/event-trigger.sgml
@@ -28,6 +28,7 @@
  An event trigger fires whenever the event with which it is associated
  occurs in the database in which it is defined. Currently, the only
  supported events are
+ client_connection,
  ddl_command_start,
  ddl_command_end,
  table_rewrite
@@ -36,6 +37,29 @@

 

+ The client_connection event occurs when a client connection
+ to the server is established.
+ There are two mechanisms for dealing with any bugs in a trigger procedure for
+ this event which might prevent successful login to the system:
+ 
+  
+   
+ The configuration parameter enable_client_connection_trigger
+ makes it possible to disable firing the client_connection
+ trigger when a client connects.
+   
+  
+  
+   
+ Errors in the client_connection trigger procedure are
+ ignored for superuser. An error message is delivered to the client as
+ NOTICE in this case.
+   
+  
+ 
+   
+
+   
  The ddl_command_start event occurs just before the
  execution of a CREATE, ALTER, DROP,
  SECURITY LABEL,
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index f27c3fe..8646db7 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -560,6 +560,7 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	new_record[Anum_pg_database_datctype - 1] =
 		DirectFunctionCall1(namein, CStringGetDatum(dbctype));
 	new_record[Anum_pg_da

Re: Deadlock between backend and recovery may not be detected

On 2020/12/22 10:25, Masahiko Sawada wrote:

On Fri, Dec 18, 2020 at 6:36 PM Fujii Masao  wrote:

On 2020/12/17 2:15, Fujii Masao wrote:

On 2020/12/16 23:28, Drouvot, Bertrand wrote:

Hi,

On 12/16/20 2:36 PM, Victor Yegorov wrote:

*CAUTION*: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.

ср, 16 дек. 2020 г. в 13:49, Fujii Masao mailto:masao.fu...@oss.nttdata.com>>:

 After doing this procedure, you can see the startup process and backend
 wait for the table lock each other, i.e., deadlock. But this deadlock 
remains
 even after deadlock_timeout passes.

 This seems a bug to me.

+1

 > * Deadlocks involving the Startup process and an ordinary backend process
 > * will be detected by the deadlock detector within the ordinary backend.

 The cause of this issue seems that ResolveRecoveryConflictWithLock() that
 the startup process calls when recovery conflict on lock happens doesn't
 take care of deadlock case at all. You can see this fact by reading the 
above
 source code comment for ResolveRecoveryConflictWithLock().

 To fix this issue, I think that we should enable STANDBY_DEADLOCK_TIMEOUT
 timer in ResolveRecoveryConflictWithLock() so that the startup process can
 send PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK signal to the backend.
 Then if PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK signal arrives,
 the backend should check whether the deadlock actually happens or not.
 Attached is the POC patch implimenting this.

good catch!

I don't see any obvious reasons why the STANDBY_DEADLOCK_TIMEOUT shouldn't be 
set in ResolveRecoveryConflictWithLock() too (it is already set in 
ResolveRecoveryConflictWithBufferPin()).

So + 1 to consider this as a bug and for the way the patch proposes to fix it.

Thanks Victor and Bertrand for agreeing!
Attached is the updated version of the patch.

Attached is v3 of the patch. Could you review this version?

While the startup process is waiting for recovery conflict on buffer pin,
it repeats sending the request for deadlock check to all the backends
every deadlock_timeout. This may increase the workload in the startup
process and backends, but since this is the original behavior, the patch
doesn't change that. Also in practice this may not be so harmful because
the period that the buffer is kept pinned is basically not so long.

@@ -529,6 +603,26 @@ ResolveRecoveryConflictWithBufferPin(void)
  */
 ProcWaitForSignal(PG_WAIT_BUFFER_PIN);

+   if (got_standby_deadlock_timeout)
+   {
+   /*
+* Send out a request for hot-standby backends to check themselves for
+* deadlocks.
+*
+* XXX The subsequent ResolveRecoveryConflictWithBufferPin() will wait
+* to be signaled by UnpinBuffer() again and send a request for
+* deadlocks check if deadlock_timeout happens. This causes the
+* request to continue to be sent every deadlock_timeout until the
+* buffer is unpinned or ltime is reached. This would increase the
+* workload in the startup process and backends. In practice it may
+* not be so harmful because the period that the buffer is kept pinned
+* is basically no so long. But we should fix this?
+*/
+   SendRecoveryConflictWithBufferPin(
+
PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
+   got_standby_deadlock_timeout = false;
+   }
+

Since SendRecoveryConflictWithBufferPin() sends the signal to all
backends every backend who is waiting on a lock at ProcSleep() and not
holding a buffer pin blocking the startup process will end up doing a
deadlock check, which seems expensive. What is worse is that the
deadlock will not be detected because the deadlock involving a buffer
pin isn't detected by CheckDeadLock(). I thought we can replace
PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK with
PROCSIG_RECOVERY_CONFLICT_BUFFERPIN but it’s not good because the
backend who has a buffer pin blocking the startup process and not
waiting on a lock is also canceled after deadlock_timeout. We can have
the backend return from RecoveryConflictInterrupt() when it received
PROCSIG_RECOVERY_CONFLICT_BUFFERPIN and is not waiting on any lock,
but it’s also not good because we cannot cancel the backend after
max_standby_streaming_delay that has a buffer pin blocking the startup
process. So I wonder if we can have a new signal. When the backend
received this signal it returns from RecoveryConflictInterrupt()
without deadlock check either if it’s not waiting on any lock or if it
doesn’t have a buffer pin blocking the startup process. Otherwise it's
cancelled.

Thanks for pointing out that issue! Using new signal is an idea. Another idea
is to make a backend skip check the deadlock if GetStartupBufferPinWaitBufId()
returns -1, i.e., the startup process is not waiting for buffer pin. So,
what I'm thinkin

Re: Deadlock between backend and recovery may not be detected

On 2020/12/19 1:43, Drouvot, Bertrand wrote:

Hi,

On 12/18/20 10:35 AM, Fujii Masao wrote:

CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.

On 2020/12/17 2:15, Fujii Masao wrote:

On 2020/12/16 23:28, Drouvot, Bertrand wrote:

Hi,

On 12/16/20 2:36 PM, Victor Yegorov wrote:

*CAUTION*: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.

ср, 16 дек. 2020 г. в 13:49, Fujii Masao mailto:masao.fu...@oss.nttdata.com>>:

    After doing this procedure, you can see the startup process and backend
    wait for the table lock each other, i.e., deadlock. But this deadlock 
remains
    even after deadlock_timeout passes.

    This seems a bug to me.

+1

    > * Deadlocks involving the Startup process and an ordinary backend process
    > * will be detected by the deadlock detector within the ordinary backend.

    The cause of this issue seems that ResolveRecoveryConflictWithLock() that
    the startup process calls when recovery conflict on lock happens doesn't
    take care of deadlock case at all. You can see this fact by reading the 
above
    source code comment for ResolveRecoveryConflictWithLock().

    To fix this issue, I think that we should enable STANDBY_DEADLOCK_TIMEOUT
    timer in ResolveRecoveryConflictWithLock() so that the startup process can
    send PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK signal to the backend.
    Then if PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK signal arrives,
    the backend should check whether the deadlock actually happens or not.
    Attached is the POC patch implimenting this.

good catch!

I don't see any obvious reasons why the STANDBY_DEADLOCK_TIMEOUT shouldn't be 
set in ResolveRecoveryConflictWithLock() too (it is already set in 
ResolveRecoveryConflictWithBufferPin()).

So + 1 to consider this as a bug and for the way the patch proposes to fix it.

Thanks Victor and Bertrand for agreeing!
Attached is the updated version of the patch.

Attached is v3 of the patch. Could you review this version?

While the startup process is waiting for recovery conflict on buffer pin,
it repeats sending the request for deadlock check to all the backends
every deadlock_timeout. This may increase the workload in the startup
process and backends, but since this is the original behavior, the patch
doesn't change that. 

Agree.

IMHO that may need to be rethink (as we are already in a conflict situation, i 
am tempted to say that the less is being done the better it is), but i think 
that's outside the scope of this patch.

Yes, I agree that's better. I think that we should improve that as a separate
patch only for master branch, after fixing the bug and back-patching that
at first.

Also in practice this may not be so harmful because
the period that the buffer is kept pinned is basically not so long.

Agree that chances are less to be in this mode for a "long" duration (as 
compare to the lock conflict).

On the other hand, IMO we should avoid this issue while the startup
process is waiting for recovery conflict on locks since it can take
a long time to release the locks. So the patch tries to fix it.

Agree

Or I'm overthinking this? If this doesn't need to be handled,
the patch can be simplified more. Thought?

I do think that's good to handle it that way for the lock conflict: the less 
work is done the better it is (specially in a conflict situation).

The patch does look good to me.

Thanks for the review!

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Single transaction in the tablesync worker?

2020-12-22 Thread Peter Smith

On Mon, Dec 21, 2020 at 11:36 PM Amit Kapila  wrote:
>
> On Mon, Dec 21, 2020 at 3:17 PM Peter Smith  wrote:
> >
> > On Mon, Dec 21, 2020 at 4:23 PM Amit Kapila  wrote:
> >
> > > Few other comments:
> > > ==
> >
> > Thanks for your feedback.
> >
> > > 1.
> > > * FIXME 3 - Crashed tablesync workers may also have remaining slots
> > > because I don't think
> > > + * such workers are even iterated by this loop, and nobody else is
> > > removing them.
> > > + */
> > > + if (slotname)
> > > + {
> > >
> > > The above FIXME is not clear to me. Actually, the crashed workers
> > > should restart, finish their work, and drop the slots. So not sure
> > > what exactly this FIXME refers to?
> >
> > Yes, normally if the tablesync can complete it should behave like that.
> > But I think there are other scenarios where it may be unable to
> > clean-up after itself. For example:
> >
> > i) Maybe the crashed tablesync worker cannot finish. e.g. A row insert
> > handled by tablesync can give a PK violation which also will crash
> > again and again for each re-launched/replacement tablesync worker.
> > This can be reproduced in the debugger. If the DropSubscription
> > doesn't clean-up the tablesync's slot then nobody will.
> >
> > ii) Also DROP SUBSCRIPTION code has locking (see code commit) "to
> > ensure that the launcher doesn't restart new worker during dropping
> > the subscription".
> >
>
> Yeah, I have also read that comment but do you know how it is
> preventing relaunch? How does the subscription lock help?

Hmmm. I did see there is a matching lock in get_subscription_list of
launcher.c, which may be what that code comment was referring to. But
I also am currently unsure how this lock prevents anybody (e.g.
process_syncing_tables_for_apply) from executing another
logicalrep_worker_launch.

>
> > So executing DROP SUBSCRIPTION will prevent a newly
> > crashed tablesync from re-launching, so it won’t be able to take care
> > of its own slot. If the DropSubscription doesn't clean-up that
> > tablesync's slot then nobody will.
> >
>
>
> > >
> > > 2.
> > > DropSubscription()
> > > {
> > > ..
> > > ReplicationSlotDropAtPubNode(
> > > + NULL,
> > > + conninfo, /* use conninfo to make a new connection. */
> > > + subname,
> > > + syncslotname);
> > > ..
> > > }
> > >
> > > With the above call, it will form a connection with the publisher and
> > > drop the required slots. I think we need to save the connection info
> > > so that we don't need to connect/disconnect for each slot to be
> > > dropped. Later in this function, we again connect and drop the apply
> > > worker slot. I think we should connect just once drop the apply and
> > > table sync slots if any.
> >
> > OK. IIUC this is a suggestion for more efficient connection usage,
> > rather than actual bug right?
> >
>
> Yes, it is for effective connection usage.
>

I have addressed this in the latest patch [v6]

> >
> > >
> > > 3.
> > > ReplicationSlotDropAtPubNode(WalReceiverConn *wrconn_given, char
> > > *conninfo, char *subname, char *slotname)
> > > {
> > > ..
> > > + PG_TRY();
> > > ..
> > > + PG_CATCH();
> > > + {
> > > + /* NOP. Just gobble any ERROR. */
> > > + }
> > > + PG_END_TRY();
> > >
> > > Why are we suppressing the error instead of handling it the error in
> > > the same way as we do while dropping the apply worker slot in
> > > DropSubscription?
> >
> > This function is common - it is also called from the tablesync
> > finish_sync_worker. But in the finish_sync_worker case I wanted to
> > avoid throwing an ERROR which would cause the tablesync to crash and
> > relaunch (and crash/relaunch/repeat...) when all it was trying to do
> > in the first place was just cleanup and exit the process. Perhaps the
> > error suppression should be conditional depending where this function
> > is called from?
> >
>
> Yeah, that could be one way and if you follow my previous suggestion
> this function might change a bit more.

I have addressed this in the latest patch [v6]

---
[v6] 
https://www.postgresql.org/message-id/CAHut%2BPuCLty2HGNT6neyOcUmBNxOLo%3DybQ2Yv-nTR4kFY-8QLw%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia.

Re: libpq compression

2020-12-22 Thread Daniil Zakhlystov

Hi!

I’ve fixed an issue with compression level parsing in this PR 
https://github.com/postgrespro/libpq_compression/pull/4

Also, did a couple of pgbenchmarks to measure database resource usage with 
different compression levels.

Firstly, I measured the bidirectional compression scenario, i.e. database had 
to do both compression and decompression:

Database setup:
pgbench "host=xxx dbname=xxx  port=5432 user=xxx” -i -s 500

Test run:
pgbench "host=xxx dbname=xxx  port=5432 user=xxx 
compression=zstd:(1/3/5/7/9/11/13/15/17/19/20)" --builtin tpcb-like -t 50 
--jobs=64 --client=700

When using bidirectional compression, Postgres resource usage correlates with 
the selected compression level. For example, here is the Postgresql application 
memory usage:

No compression - 1.2 GiB

ZSTD
zstd:1 - 1.4 GiB
zstd:7 - 4.0 GiB
zstd:13 - 17.7 GiB
zstd:19 - 56.3 GiB  

zstd:20 - 109.8 GiB - did not succeed
zstd:21, zstd:22  > 140 GiB
Postgres process crashes (out of memory)

ZLIB
zlib:1 - 1.35 GiB
zlib:5 - 1.35 GiB
zlib:9 - 1.35 GiB

Full report with CPU/Memory/Network consumption graph is available here:
https://docs.google.com/document/d/1qakHcsabZhV70GfSEOjFxmlUDBe21p7DRoPrDPAjKNg

Then, I’ve disabled the compression for the backend and decompression for the 
frontend 
and measured the resource usage for single-directional compression scenario 
(frontend compression only, backend decompression only):

ZSTD
For all ZSTD compression levels, database host resource usage was roughly the 
same, except the Committed Memory (Committed_AS):

no compression - 44.4 GiB
zstd:1 - 45.0 GiB
zstd:3 - 46.1 GiB
zstd:5 - 46.1 GiB
zstd:7 - 46.0 GiB
zstd:9 - 46.0 GiB
zstd:11 - 47.4 GiB
zstd:13 - 47.4 GiB
zstd:15 - 47.4 GiB
zstd:17 - 50.3 GiB
zstd:19 - 50.1 GiB  
zstd:20 - 66.8 GiB  
zstd:21 - 88.7 GiB  
zstd:22 - 123.9 GiB 

ZLIB
For all ZLIB compression level, database host resource usage was roughly the 
same.

Full report with CPU/Memory/Network consumption graph is available here:
https://docs.google.com/document/d/1gI7c3_YvcL5-PzeK65P0pIY-4BI9KBDwlfPpGhYxrNg

To sum up, there is actually almost no difference when decompressing the 
different compression levels, except the Committed_AS size. 

—
Daniil Zakhlystov

RE: Consider Parallelism While Planning For REFRESH MATERIALIZED VIEW

Hi

> Added this to commitfest, in case it is useful -
> https://commitfest.postgresql.org/31/2856/

I have an issue about the safety of enable parallel select.

I checked the [parallel insert into select] patch.
https://commitfest.postgresql.org/31/2844/
It seems parallel select is not allowed when target table is temporary table.

+   /*
+* We can't support table modification in parallel-mode if it's a 
foreign
+* table/partition (no FDW API for supporting parallel access) or a
+* temporary table.
+*/
+   if (rel->rd_rel->relkind == RELKIND_FOREIGN_TABLE ||
+   RelationUsesLocalBuffers(rel))
+   {
+   table_close(rel, lockmode);
+   context->max_hazard = PROPARALLEL_UNSAFE;
+   return context->max_hazard;
+   }

For Refresh MatView.
if CONCURRENTLY is specified, It will builds new data in temp tablespace:
if (concurrent)
{
tableSpace = GetDefaultTablespace(RELPERSISTENCE_TEMP, false);
relpersistence = RELPERSISTENCE_TEMP;
}

For the above case, should we still consider parallelism ?

Best regards,
houzj

Re: [HACKERS] [PATCH] Generic type subscripting

út 22. 12. 2020 v 11:24 odesílatel Dmitry Dolgov <9erthali...@gmail.com>
napsal:

> > On Fri, Dec 18, 2020 at 08:59:25PM +0100, Dmitry Dolgov wrote:
> > > On Thu, Dec 17, 2020 at 03:29:35PM -0500, Tom Lane wrote:
> > > Dmitry Dolgov <9erthali...@gmail.com> writes:
> > > > On Thu, Dec 17, 2020 at 01:49:17PM -0500, Tom Lane wrote:
> > > >> We can certainly reconsider the API for the parsing hook if there's
> > > >> really a good reason for these to be different types, but it seems
> > > >> like that would just be encouraging poor design.
> > >
> > > > To be more specific, this is the current behaviour (an example from
> the
> > > > tests) and it doesn't seem right:
> > >
> > > > =# update test_jsonb_subscript
> > > >set test_json['a'] = 3 where id = 1;
> > > > UPDATE 1
> > > > =# select jsonb_typeof(test_json->'a')
> > > >from test_jsonb_subscript where id = 1;
> > > >  jsonb_typeof
> > > >  --
> > > >   string
> > >
> > >
> > > I'm rather inclined to think that the result of subscripting a
> > > jsonb (and therefore also the required source type for assignment)
> > > should be jsonb, not just text.  In that case, something like
> > > update ... set jsoncol['a'] = 3
> > > would fail, because there's no cast from integer to jsonb.  You'd
> > > have to write one of
> > > update ... set jsoncol['a'] = '3'
> > > update ... set jsoncol['a'] = '"3"'
> > > to clarify how you wanted the input to be interpreted.
> > > But that seems like a good thing, just as it is for jsonb_in.
> >
> > Yep, that makes sense, will go with this idea.
>
> Here is the new version of jsonb subscripting rebased on the committed
> infrastructure patch. I hope it will not introduce any confusion with
> the previously posted patched in this thread (about alter type subscript
> and hstore) as they are independent.
>
> There are few differences from the previous version:
>
> * No limit on number of subscripts for jsonb (as there is no intrinsic
>   limitation of this kind for jsonb).
>
> * In case of assignment via subscript now it expects the replace value
>   to be of jsonb type.
>
> * Similar to the implementation for arrays, if the source jsonb is NULL,
>   it will be replaced by an empty jsonb and the new value will be
>   assigned to it. This means:
>
> =# select * from test_jsonb_subscript where id = 3;
>  id | test_json
> +---
>   3 | NULL
>
> =# update test_jsonb_subscript set test_json['a'] = '1' where id =
> 3;
> UPDATE 1
>
> =# select * from test_jsonb_subscript where id = 3;
>  id | test_json
> +---
>   3 | {"a": 1}
>
>   and similar:
>
> =# select * from test_jsonb_subscript where id = 3;
>  id | test_json
> +---
>   3 | NULL
>
> =# update test_jsonb_subscript set test_json[1] = '1' where id = 3;
> UPDATE 1
>
> =# select * from test_jsonb_subscript where id = 3;
>  id | test_json
> +---
>   3 | {"1": 1}
>
>   The latter is probably a bit strange looking, but if there are any
> concerns
>   about this part (and in general about an assignment to jsonb which is
> NULL)
>   of the implementation it could be easily changed.
>

What is the possibility to make an array instead of a record?

I expect behave like

update x set test[1] = 10; --> "[10]";
update x set test['1'] = 10; --> "{"1": 10}"

Regards

Pavel


> * There is nothing to address question about distinguishing a regular text
>   subscript and jsonpath in the patch yet. I guess the idea would be to
> save
>   the original subscript value type before coercing it into text and allow
> a
>   type specific code to convert it back. But I'll probably do it as a
> separate
>   patch when we finish with this one.
>

Re: Single transaction in the tablesync worker?

2020-12-22 Thread Peter Smith

Hi Amit.

PSA my v6 WIP patch for the Solution1.

This patch still applies onto the v30 patch set [1] from other 2PC thread:
[1]
https://www.postgresql.org/message-id/CAFPTHDYA8yE6tEmQ2USYS68kNt%2BkM%3DSwKgj%3Djy4AvFD5e9-UTQ%40mail.gmail.com

(I understand you would like this to be delivered as a separate patch
independent of v30. I will convert it ASAP)

Coded / WIP:

* tablesync slot is now permanent instead of temporary. The tablesync
slot name is no longer tied to the Subscription slot name.

* the tablesync slot cleanup (drop) code is added for DropSubscription
and for finish_sync_worker functions

* tablesync worked now allowing multiple tx instead of single tx

* a new state (SUBREL_STATE_COPYDONE) is persisted after a successful
copy_table in LogicalRepSyncTableStart.

* if a relaunched tablesync finds the state is SUBREL_STATE_COPYDONE
then it will bypass the initial copy_table phase.

* tablesync sets up replication origin tracking in
LogicalRepSyncTableStart (similar as done for the apply worker). The
origin is advanced when first created.

* tablesync replication origin tracking is cleaned up during
DropSubscription and/or process_syncing_tables_for_apply

TODO / Known Issues:

* Crashed tablesync workers may not be known to DropSubscription
current code. This might be a problem to cleanup slots and/or origin
tracking belonging to those unknown workers.

* There seems to be a race condition during DROP SUBSCRIPTION. It
manifests as the TAP test 007 hanging. Logging shows it seems to be
during replorigin_drop when called from DropSubscription. It is timing
related and quite rare - e.g. Only happens 1x every 10x running
subscription TAP tests.

* Help / comments / cleanup

* There is temporary "!!>>" excessive logging of mine scattered around
which I added to help my testing during development

* Address review comments

---

Kind Regards,
Peter Smith.
Fujitsu Australia

v6-0002-WIP-patch-for-the-Solution1.patch
Description: Binary data

v6-0001-2PC-change-tablesync-slot-to-use-same-two_phase-m.patch
Description: Binary data

Re: pg_preadv() and pg_pwritev()

2020-12-22 Thread Thomas Munro

On Mon, Dec 21, 2020 at 11:40 AM Andres Freund  wrote:
> On 2020-12-20 16:26:42 +1300, Thomas Munro wrote:
> > > 1. port.h cannot assume that  has already been included;
> > > nor do I want to fix that by including  there.  Do we
> > > really need to define a fallback value of IOV_MAX?  If so,
> > > maybe the answer is to put the replacement struct iovec and
> > > IOV_MAX in some new header.
> >
> > Ok, I moved all this stuff into port/pg_uio.h.
>
> Can we come up with a better name than 'uio'? I find that a not exactly
> meaningful name.

Ok, let's try port/pg_iovec.h.

> Or perhaps we could just leave the functions in port/port.h, but extract
> the value of the define at configure time? That way only pread.c /
> pwrite.c would need it.

That won't work for the struct definition, so client code would need
to remember to do:

#ifdef HAVE_SYS_UIO_H
#include 
#endif

... which is a little tedious, or port.h would need to do that and
incur an overhead in every translation unit, which Tom objected to.
That's why I liked the separate header idea.

> > > 3. The patch as given won't prove anything except that the code
> > > compiles.  Is it worth fixing at least one code path to make
> > > use of pg_preadv and pg_pwritev, so we can make sure this code
> > > is tested before there's layers of other new code on top?
> >
> > OK, here's a patch to zero-fill fresh WAL segments with pwritev().
>
> I think that's a good idea. However, I see two small issues: 1) If we
> write larger amounts at once, we need to handle partial writes. That's a
> large enough amount of IO that it's much more likely to hit a memory
> shortage or such in the kernel - we had to do that a while a go for WAL
> writes (which can also be larger), if memory serves.
>
> Perhaps we should have pg_pwritev/readv unconditionally go through
> pwrite.c/pread.c and add support for "continuing" partial writes/reads
> in one central place?

Ok, here's a new version with the following changes:

1.  Define PG_IOV_MAX, a reasonable size for local variables, not
larger than IOV_MAX.
2   Use 32 rather than 1024, based on off-list complaint about 1024
potentially swamping the IO system unfairly.
3.  Add a wrapper pg_pwritev_retry() to retry automatically on short
writes.  (I didn't write pg_preadv_retry(), because I don't currently
need it for anything so I didn't want to have to think about EOF vs
short-reads-for-implementation-reasons.)
4.  I considered whether pg_pwrite() should have built-in retry
instead of a separate wrapper, but I thought of an argument against
hiding the "raw" version: the AIO patch set already understands short
reads/writes and knows how to retry at a higher level, as that's
needed for native AIO too, so I think it makes sense to be able to
keep things the same and not solve the same problem twice.  A counter
argument would be that you could get the retry underway sooner with a
tight loop, but I'm not expecting this to be common.

> > I'm drawing a blank on trivial candidate uses for preadv(), without
> > infrastructure from later patches.
>
> Can't immediately think of something either.

One idea I had for the future is for xlogreader.c to read the WAL into
a large multi-page circular buffer, which could wrap around using a
pair of iovecs, but that requires lots more work.
From 3da0b25b4ce0804355f912bd026ef9c9eee146f3 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Thu, 26 Nov 2020 15:48:31 +1300
Subject: [PATCH v3 1/2] Add pg_preadv() and pg_pwritev().

Provide synchronous scatter/gather I/O routines.  These map to preadv(),
pwritev() with various fallbacks for systems that don't have them.  Also
provide a wrapper pg_pwritev_retry() which automatically retries on
short write.

Reviewed-by: Tom Lane 
Reviewed-by: Andres Freund 
Discussion: https://postgr.es/m/CA%2BhUKGJA%2Bu-220VONeoREBXJ9P3S94Y7J%2BkqCnTYmahvZJwM%3Dg%40mail.gmail.com
---
 configure   |  30 +-
 configure.ac|   9 ++-
 src/include/pg_config.h.in  |  15 +
 src/include/port.h  |   2 +
 src/include/port/pg_iovec.h |  57 +++
 src/port/Makefile   |   2 +
 src/port/pread.c|  43 +-
 src/port/pwrite.c   | 109 +++-
 src/tools/msvc/Solution.pm  |   5 ++
 9 files changed, 238 insertions(+), 34 deletions(-)
 create mode 100644 src/include/port/pg_iovec.h

diff --git a/configure b/configure
index 11a4284e5b..5887c712cc 100755
--- a/configure
+++ b/configure
@@ -13061,7 +13061,7 @@ $as_echo "#define HAVE_STDBOOL_H 1" >>confdefs.h
 fi

-for ac_header in atomic.h copyfile.h execinfo.h getopt.h ifaddrs.h langinfo.h mbarrier.h poll.h sys/epoll.h sys/event.h sys/ipc.h sys/prctl.h sys/procctl.h sys/pstat.h sys/resource.h sys/select.h sys/sem.h sys/shm.h sys/sockio.h sys/tas.h sys/un.h termios.h ucred.h wctype.h
+for ac_header in atomic.h copyfile.h execinfo.h getopt.h ifaddrs.h langinfo.h mbarrier.h poll.h sys/epoll.h sys/event.h sys/ipc.h sys/p

Re: postgres_fdw - cached connection leaks if the associated user mapping/foreign server is dropped

On Fri, Dec 18, 2020 at 6:46 PM Bharath Rupireddy
 wrote:
> On Fri, Dec 18, 2020 at 6:39 PM Bharath Rupireddy
>  wrote:
> >
> > On Fri, Dec 18, 2020 at 5:06 PM Hou, Zhijie  
> > wrote:
> > > I have an issue about the existing testcase.
> > >
> > > """
> > > -- Test that alteration of server options causes reconnection
> > > SELECT c3, c4 FROM ft1 ORDER BY c3, c1 LIMIT 1;  -- should work
> > > ALTER SERVER loopback OPTIONS (SET dbname 'no such database');
> > > SELECT c3, c4 FROM ft1 ORDER BY c3, c1 LIMIT 1;  -- should fail
> > > DO $d$
> > > BEGIN
> > > EXECUTE $$ALTER SERVER loopback
> > > OPTIONS (SET dbname '$$||current_database()||$$')$$;
> > > END;
> > > $d$;
> > > SELECT c3, c4 FROM ft1 ORDER BY c3, c1 LIMIT 1;  -- should work again
> > > """
> > >
> > > IMO, the above case is designed to test the following code[1]:
> > > With the patch, it seems the following code[1] will not work for this 
> > > case, right?
> > > (It seems the connection will be disconnect in pgfdw_xact_callback)
> > >
> > > I do not know does it matter, or should we add a testcase to cover that?
> > >
> > > [1] /*
> > >  * If the connection needs to be remade due to invalidation, 
> > > disconnect as
> > >  * soon as we're out of all transactions.
> > >  */
> > > if (entry->conn != NULL && entry->invalidated && 
> > > entry->xact_depth == 0)
> > > {
> > > elog(DEBUG3, "closing connection %p for option changes to 
> > > take effect",
> > >  entry->conn);
> > > disconnect_pg_server(entry);
> > > }
> >
> > Yes you are right. With the patch if the server is altered in the same
> > session in which the connection exists, the connection gets closed at
> > the end of that alter query txn, not at the beginning of the next
> > foreign tbl query. So, that part of the code in GetConnection()
> > doesn't get hit. Having said that, this code gets hit when the alter
> > query is run in another session and the connection in the current
> > session gets disconnected at the start of the next foreign tbl query.
> >
> > If we want to cover it with a test case, we might have to alter the
> > foreign server in another session. I'm not sure whether we can open
> > another session from the current psql session and run a sql command.

I further checked on how we can add/move the test case( that is
altering server options in a different session and see if the
connection gets disconnected at the start of the next foreign query in
the current session ) to cover the above code. Upon some initial
analysis, it looks like it is possible to add that under
src/test/isolation tests. Another way could be to add it using the TAP
framework under contrib/postgres_fdw. Having said that, currently
these two places don't have any postgres_fdw related tests, we will be
the first ones to add.

I'm not quite sure whether that's okay or is there any better way of
doing it. Thoughts?

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: Reduce the number of special cases to build contrib modules on windows

2020-12-22 Thread David Rowley

On Wed, 11 Nov 2020 at 13:44, Michael Paquier  wrote:
>
> On Wed, Nov 11, 2020 at 11:01:57AM +1300, David Rowley wrote:
> > I'm still working through some small differences in some of the
> > .vcxproj files.  I've been comparing these by copying *.vcxproj out to
> > another directory with patched and unpatched then diffing the
> > directory. See attached txt file with those diffs. Here's a summary of
> > some of them:
>
> Thanks.  It would be good to not have those diffs if not necessary.
>
> > 1. There are a few places that libpq gets linked where it previously did 
> > not.
>
> It seems to me that your patch is doing the right thing for adminpack
> and that its Makefile has no need to include a reference to libpq
> source path, no?

Yeah.  Likely a separate commit should remove the -I$(libpq_srcdir)
from adminpack and old_snapshot

> For dblink and postgres_fdw, the duplication comes from PG_CPPFLAGS.
> It does not matter much in practice, but it would be nice to not have
> unnecessary data in the project files.  One thing that could be done
> is to make Project.pm more aware of the uniqueness of the elements
> included.  But, do we really need -I$(libpq_srcdir) there anyway?
> From what I can see, we have all the paths in -I we'd actually need
> with or without USE_PGXS.

I've changed the patch to do that for the includes. I'm now putting
the list of include directories in a hash table to get rid of the
duplicates.  This does shuffle the order of them around a bit. I've
done the same for references too.

> > 3. LOWER_NODE gets defined in ltree now where it wasn't before.  It's
> > defined on Linux. Unsure why it wasn't before on Windows.
>
> Your patch is grabbing the value of PG_CPPFLAGS from ltree's
> Makefile, which is fine.  We may be able to remove this flag and rely
> on pg_tolower() instead in the long run?  I am not sure about
> FLG_CANLOOKSIGN() though.

I didn't look in detail, but it looks like if we define LOWER_NODE on
Windows that it might break pg_upgrade.  I guess you could say it's
partially broken now as the behaviour there will depend on if you
build using Visual Studio or cygwin.  We'd define LOWER_NODE on cygwin
but not on VS.  Looks like a pg_upgrade might be problematic there
today.

It feels a bit annoying to add some special case to the script to
maintain the status quo there.  An alternative to that would be to
modify the .c code at #ifdef LOWER_NODE to also check we're not
building on VS. Neither option seems nice.

I've attached the updated patch and also a diff showing the changes in
the *.vcxproj files.

There are quite a few places where the hash table code for includes
and references gets rid of duplicates that already exist today. For
example pgbench.vcxproj references libpgport.vcxproj and
libpgcommon.vcxproj twice.

David
diff --git a/src/tools/msvc/MSBuildProject.pm b/src/tools/msvc/MSBuildProject.pm
index ebb169e201..2cf3fee65d 100644
--- a/src/tools/msvc/MSBuildProject.pm
+++ b/src/tools/msvc/MSBuildProject.pm
@@ -109,15 +109,13 @@ sub AddDefine
 sub WriteReferences
 {
my ($self, $f) = @_;
-
-   my @references = @{ $self->{references} };
-
-   if (scalar(@references))
+   # Add referenced projects, if any exist.
+   if (scalar(keys % { $self->{references} }) > 0)
{
print $f <
 EOF
-   foreach my $ref (@references)
+   foreach my $ref (values % { $self->{references} } )
{
print $f <
@@ -310,11 +308,12 @@ sub WriteItemDefinitionGroup
my $targetmachine =
  $self->{platform} eq 'Win32' ? 'MachineX86' : 'MachineX64';
 
-   my $includes = $self->{includes};
-   unless ($includes eq '' or $includes =~ /;$/)
+   my $includes = "";
+   foreach my $inc (keys %{ $self->{includes} } )
{
-   $includes .= ';';
+   $includes .= $inc . ";";
}
+
print $f <
 
diff --git a/src/tools/msvc/Mkvcbuild.pm b/src/tools/msvc/Mkvcbuild.pm
index f92c14030d..8db5def9ce 100644
--- a/src/tools/msvc/Mkvcbuild.pm
+++ b/src/tools/msvc/Mkvcbuild.pm
@@ -32,16 +32,13 @@ my $libpq;
 my @unlink_on_exit;
 
 # Set of variables for modules in contrib/ and src/test/modules/
-my $contrib_defines = { 'refint' => 'REFINT_VERBOSE' };
-my @contrib_uselibpq = ('dblink', 'oid2name', 'postgres_fdw', 'vacuumlo');
-my @contrib_uselibpgport   = ('oid2name', 'pg_standby', 'vacuumlo');
-my @contrib_uselibpgcommon = ('oid2name', 'pg_standby', 'vacuumlo');
+my $contrib_defines = {};
+my @contrib_uselibpq = ();
+my @contrib_uselibpgport   = ('pg_standby');
+my @contrib_uselibpgcommon = ('pg_standby');
 my $contrib_extralibs  = undef;
-my $contrib_extraincludes = { 'dblink' => ['src/backend'] };
-my $contrib_extrasource = {
-   'cube' => [ 'contrib/cube/cubescan.l', 'contrib/cube/cubeparse.y' ],
-   'seg'  => [ 'contrib/seg/segscan.l',   'contrib/seg/segparse.y' ],
-};
+my $contrib_extraincludes = {};
+my $contrib_extraso

Re: [HACKERS] [PATCH] Generic type subscripting

2020-12-22 Thread Dmitry Dolgov

> On Fri, Dec 18, 2020 at 08:59:25PM +0100, Dmitry Dolgov wrote:
> > On Thu, Dec 17, 2020 at 03:29:35PM -0500, Tom Lane wrote:
> > Dmitry Dolgov <9erthali...@gmail.com> writes:
> > > On Thu, Dec 17, 2020 at 01:49:17PM -0500, Tom Lane wrote:
> > >> We can certainly reconsider the API for the parsing hook if there's
> > >> really a good reason for these to be different types, but it seems
> > >> like that would just be encouraging poor design.
> >
> > > To be more specific, this is the current behaviour (an example from the
> > > tests) and it doesn't seem right:
> >
> > > =# update test_jsonb_subscript
> > >set test_json['a'] = 3 where id = 1;
> > > UPDATE 1
> > > =# select jsonb_typeof(test_json->'a')
> > >from test_jsonb_subscript where id = 1;
> > >  jsonb_typeof
> > >  --
> > >   string
> >
> >
> > I'm rather inclined to think that the result of subscripting a
> > jsonb (and therefore also the required source type for assignment)
> > should be jsonb, not just text.  In that case, something like
> > update ... set jsoncol['a'] = 3
> > would fail, because there's no cast from integer to jsonb.  You'd
> > have to write one of
> > update ... set jsoncol['a'] = '3'
> > update ... set jsoncol['a'] = '"3"'
> > to clarify how you wanted the input to be interpreted.
> > But that seems like a good thing, just as it is for jsonb_in.
>
> Yep, that makes sense, will go with this idea.

Here is the new version of jsonb subscripting rebased on the committed
infrastructure patch. I hope it will not introduce any confusion with
the previously posted patched in this thread (about alter type subscript
and hstore) as they are independent.

There are few differences from the previous version:

* No limit on number of subscripts for jsonb (as there is no intrinsic
  limitation of this kind for jsonb).

* In case of assignment via subscript now it expects the replace value
  to be of jsonb type.

* Similar to the implementation for arrays, if the source jsonb is NULL,
  it will be replaced by an empty jsonb and the new value will be
  assigned to it. This means:

=# select * from test_jsonb_subscript where id = 3;
 id | test_json
+---
  3 | NULL

=# update test_jsonb_subscript set test_json['a'] = '1' where id = 3;
UPDATE 1

=# select * from test_jsonb_subscript where id = 3;
 id | test_json
+---
  3 | {"a": 1}

  and similar:

=# select * from test_jsonb_subscript where id = 3;
 id | test_json
+---
  3 | NULL

=# update test_jsonb_subscript set test_json[1] = '1' where id = 3;
UPDATE 1

=# select * from test_jsonb_subscript where id = 3;
 id | test_json
+---
  3 | {"1": 1}

  The latter is probably a bit strange looking, but if there are any concerns
  about this part (and in general about an assignment to jsonb which is NULL)
  of the implementation it could be easily changed.

* There is nothing to address question about distinguishing a regular text
  subscript and jsonpath in the patch yet. I guess the idea would be to save
  the original subscript value type before coercing it into text and allow a
  type specific code to convert it back. But I'll probably do it as a separate
  patch when we finish with this one.
>From d2aab172a4e70af0684a937b99c426652231f456 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthali...@gmail.com>
Date: Fri, 18 Dec 2020 17:19:51 +0100
Subject: [PATCH v38] Subscripting for jsonb

Subscripting implementation for jsonb. It does not support slices, does
not have a limit for number of subscripts and for assignment expects a
replace value to be of jsonb type. There is also one functional
difference in assignment via subscripting from jsonb_set, when an
original jsonb container is NULL, subscripting replaces it with an empty
jsonb of type object and proceed with assignment.

For the sake of code reuse, some parts of jsonb functionality were
rearranged to allow use the same functions for jsonb_set and assign
subscripting operation.

The original idea belongs to Oleg Bartunov.

Reviewed-by: Tom Lane, Arthur Zakirov, Pavel Stehule
---
 doc/src/sgml/json.sgml  |  48 +
 src/backend/utils/adt/Makefile  |   1 +
 src/backend/utils/adt/jsonb_util.c  |  76 +++-
 src/backend/utils/adt/jsonbsubs.c   | 282 
 src/backend/utils/adt/jsonfuncs.c   | 180 +-
 src/include/catalog/pg_proc.dat |   4 +
 src/include/catalog/pg_type.dat |   3 +-
 src/include/utils/jsonb.h   |   6 +-
 src/test/regress/expected/jsonb.out | 273 ++-
 src/test/regress/sql/jsonb.sql  |  84 -
 10 files changed, 852 insertions(+), 105 deletions(-)
 create mode 100644 src/backend/utils/adt/jsonbsubs.c

diff --git a/doc/src/sgml/json.sgml b/doc/src/sgml/json.sgml

Re: [HACKERS] logical decoding of two-phase transactions

On Tue, Dec 22, 2020 at 2:51 PM Ajin Cherian  wrote:
>
> On Sat, Dec 19, 2020 at 2:13 PM Amit Kapila  wrote:
>
> > Okay, I have changed the rollback_prepare API as discussed above and
> > accordingly handle the case where rollback is received without prepare
> > in apply_handle_rollback_prepared.
>
>
> I have reviewed and tested your new patchset, I agree with all the
> changes that you have made and have tested quite a few scenarios and
> they seem to be working as expected.
> No major comments but some minor observations:
>
> Patch 1:
> logical.c: 984
> Comment should be "rollback prepared" rather than "abort prepared".
>

Agreed.

> Patch 2:
> decode.c: 737: The comments in the header of DecodePrepare seem out of
> place, I think here it should describe what the function does rather
> than what it does not.
>

Hmm, I have written it because it is important to explain the theory
of concurrent aborts as that is not quite obvious. Also, the
functionality is quite similar to DecodeCommit and the comments inside
the function explain clearly if there is any difference so not sure
what additional we can write, do you have any suggestions?

> reorderbuffer.c: 2422: It looks like pg_indent has mangled the
> comments, the numbering is no longer aligned.
>

Yeah, I had also noticed that but not sure if there is a better
alternative because we don't want to change it after each pgindent
run. We might want to use (a), (b) .. notation instead but otherwise,
there is no big problem with how it is.

> Patch 5:
> worker.c: 753: Type: change "dont" to "don't"
>

Okay.

> Patch 6: logicaldecoding.sgml
> logicaldecoding example is no longer correct. This was true prior to
> the changes done to replay prepared transactions after a restart. Now
> the whole transaction will get decoded again after the commit
> prepared.
>
> postgres=# COMMIT PREPARED 'test_prepared1';
> COMMIT PREPARED
> postgres=# select * from
> pg_logical_slot_get_changes('regression_slot', NULL, NULL,
> 'two-phase-commit', '1');
> lsn| xid |data
> ---+-+
>  0/168A060 | 529 | COMMIT PREPARED 'test_prepared1', txid 529
> (1 row)
>

Agreed.

> Patch 8:
> worker.c: 2798 :
> worker.c: 3445 : disabling two-phase in tablesync worker.
>  considering new design of multiple commits in tablesync, do we need
> to disable two-phase in tablesync?
>

No, but let Peter's patch get committed then we can change it.

> Other than this I've noticed a few typos that are not in the patch but
> in the surrounding code.
> logical.c: 1383: Comment should mention stream_commit_cb not stream_abort_cb.
> decode.c: 686 - Extra "it's" here:  "because it's it happened"
>

Anything not related to this patch, please post in a separate email.

Can you please update the patch for the points we agreed upon?

-- 
With Regards,
Amit Kapila.

Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

On Tue, Dec 22, 2020 at 02:32:05AM -0600, Justin Pryzby wrote:
> Also, this one is going to be subsumed by ExecReindex(), so the palloc will go
> away (otherwise I would ask to pass it in from the caller):

Yeah, maybe.  Still you need to be very careful if you have any
allocated variables like a tablespace or a path which requires to be
in the private context used by ReindexMultipleInternal() or even
ReindexRelationConcurrently(), so I am not sure you can avoid that
completely.  For now, we could choose the option to still use a
palloc(), and then save the options in the private contexts.  Forgot
that in the previous version actually.
--
Michael
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index c041628049..89394b648e 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -30,13 +30,16 @@ typedef enum
 } IndexStateFlagsAction;
 
 /* options for REINDEX */
-typedef enum ReindexOption
+typedef struct ReindexOptions
 {
-	REINDEXOPT_VERBOSE = 1 << 0,	/* print progress info */
-	REINDEXOPT_REPORT_PROGRESS = 1 << 1,	/* report pgstat progress */
-	REINDEXOPT_MISSING_OK = 1 << 2, /* skip missing relations */
-	REINDEXOPT_CONCURRENTLY = 1 << 3	/* concurrent mode */
-} ReindexOption;
+	bits32		flags;			/* bitmask of REINDEXOPT_* */
+} ReindexOptions;
+
+/* flag bits for ReindexOptions->flags */
+#define REINDEXOPT_VERBOSE		0x01	/* print progress info */
+#define REINDEXOPT_REPORT_PROGRESS 0x02 /* report pgstat progress */
+#define REINDEXOPT_MISSING_OK 	0x04	/* skip missing relations */
+#define REINDEXOPT_CONCURRENTLY	0x08	/* concurrent mode */
 
 /* state info for validate_index bulkdelete callback */
 typedef struct ValidateIndexState
@@ -146,7 +149,7 @@ extern void index_set_state_flags(Oid indexId, IndexStateFlagsAction action);
 extern Oid	IndexGetRelation(Oid indexId, bool missing_ok);
 
 extern void reindex_index(Oid indexId, bool skip_constraint_checks,
-		  char relpersistence, int options);
+		  char relpersistence, ReindexOptions *options);
 
 /* Flag bits for reindex_relation(): */
 #define REINDEX_REL_PROCESS_TOAST			0x01
@@ -155,7 +158,7 @@ extern void reindex_index(Oid indexId, bool skip_constraint_checks,
 #define REINDEX_REL_FORCE_INDEXES_UNLOGGED	0x08
 #define REINDEX_REL_FORCE_INDEXES_PERMANENT 0x10
 
-extern bool reindex_relation(Oid relid, int flags, int options);
+extern bool reindex_relation(Oid relid, int flags, ReindexOptions *options);
 
 extern bool ReindexIsProcessingHeap(Oid heapOid);
 extern bool ReindexIsProcessingIndex(Oid indexOid);
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index 7cfb37c9b2..c66629cf73 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -18,16 +18,17 @@
 #include "storage/lock.h"
 #include "utils/relcache.h"
 
-
 /* options for CLUSTER */
-typedef enum ClusterOption
+#define CLUOPT_RECHECK 0x01		/* recheck relation state */
+#define CLUOPT_VERBOSE 0x02		/* print progress info */
+
+typedef struct ClusterOptions
 {
-	CLUOPT_RECHECK = 1 << 0,	/* recheck relation state */
-	CLUOPT_VERBOSE = 1 << 1		/* print progress info */
-} ClusterOption;
+	bits32		flags;			/* bitmask of CLUSTEROPT_* */
+} ClusterOptions;
 
 extern void cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel);
-extern void cluster_rel(Oid tableOid, Oid indexOid, int options);
+extern void cluster_rel(Oid tableOid, Oid indexOid, ClusterOptions *options);
 extern void check_index_is_clusterable(Relation OldHeap, Oid indexOid,
 	   bool recheck, LOCKMODE lockmode);
 extern void mark_index_clustered(Relation rel, Oid indexOid, bool is_internal);
diff --git a/src/include/commands/defrem.h b/src/include/commands/defrem.h
index 1133ae1143..43d5480c20 100644
--- a/src/include/commands/defrem.h
+++ b/src/include/commands/defrem.h
@@ -14,6 +14,7 @@
 #ifndef DEFREM_H
 #define DEFREM_H
 
+#include "catalog/index.h"
 #include "catalog/objectaddress.h"
 #include "nodes/params.h"
 #include "parser/parse_node.h"
@@ -34,11 +35,16 @@ extern ObjectAddress DefineIndex(Oid relationId,
  bool check_not_in_use,
  bool skip_build,
  bool quiet);
-extern int	ReindexParseOptions(ParseState *pstate, ReindexStmt *stmt);
-extern void ReindexIndex(RangeVar *indexRelation, int options, bool isTopLevel);
-extern Oid	ReindexTable(RangeVar *relation, int options, bool isTopLevel);
-extern void ReindexMultipleTables(const char *objectName, ReindexObjectType objectKind,
-  int options);
+extern ReindexOptions *ReindexParseOptions(ParseState *pstate, ReindexStmt *stmt);
+extern void ReindexIndex(RangeVar *indexRelation,
+		 ReindexOptions *options,
+		 bool isTopLevel);
+extern Oid	ReindexTable(RangeVar *relation,
+		 ReindexOptions *options,
+		 bool isTopLevel);
+extern void ReindexMultipleTables(const char *objectName,
+  ReindexObjectType objectKind,
+  ReindexOptions *options);
 extern char *makeObjectName(const char *name1, const

Re: Fail Fast In CTAS/CMV If Relation Already Exists To Avoid Unnecessary Rewrite, Planning Costs

On Tue, Dec 22, 2020 at 2:07 PM Michael Paquier  wrote:
> I was looking at your patch today, and I actually found the conclusion
> to output an empty plan while issuing a NOTICE to be quite intuitive
> if the caller uses IF NOT EXISTS with EXPLAIN.

Thanks!

> Thanks for adding some test cases!  Some of them were exact
> duplicates, so it is possible to reduce the number of queries without
> impacting the coverage.  I have also chosen a query that forces an
> error within the planner.
> Please see the attached.  IF NOT EXISTS implies that CTAS or CREATE
> MATVIEW will never ERROR if the relation already exists, with or
> without EXPLAIN, EXECUTE or WITH NO DATA, so that gets us a consistent
> behavior across all the patterns.

LGTM.

> Note: I'd like to think that we could choose a better name for
> CheckRelExistenceInCTAS().

I changed it to IsCTASRelCreationAllowed() and attached a v5 patch.
Please let me know if this is okay.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
From fb96bbaa5fb5797517902e367dc134733da6d76c Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Tue, 22 Dec 2020 14:51:25 +0530
Subject: [PATCH v5] Fail Fast In CTAS/CMV If Relation Already Exists

Currently, for CTAS or CREATE MATERIALIZED VIEW(CMV) without
IF-NOT-EXISTS clause, the existence of the relation (either
table or materialized view) gets checked during execution and an
error is thrown there. All the unnecessary rewrite and planning
for the SELECT part of the query have happened just to fail
later. However, if IF-NOT-EXISTS clause is present, then a notice
is issued and returned immediately without rewrite and planning
further. This seems somewhat inconsistent.

This patch propose to check the relation existence early in
ExecCreateTableAs() as well as in ExplainOneUtility() and
throw an error in case it exists already to avoid unnecessary
rewrite, planning and execution of the SELECT part.
---
 src/backend/commands/createas.c   | 59 +--
 src/backend/commands/explain.c|  8 +++
 src/include/commands/createas.h   |  2 +
 src/test/regress/expected/matview.out | 38 +++
 src/test/regress/expected/select_into.out | 31 
 src/test/regress/sql/matview.sql  | 23 +
 src/test/regress/sql/select_into.sql  | 16 ++
 7 files changed, 162 insertions(+), 15 deletions(-)

diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index 6bf6c5a310..49feadfb9a 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -239,21 +239,9 @@ ExecCreateTableAs(ParseState *pstate, CreateTableAsStmt *stmt,
 	PlannedStmt *plan;
 	QueryDesc  *queryDesc;
 
-	if (stmt->if_not_exists)
-	{
-		Oid			nspid;
-
-		nspid = RangeVarGetCreationNamespace(stmt->into->rel);
-
-		if (get_relname_relid(stmt->into->rel->relname, nspid))
-		{
-			ereport(NOTICE,
-	(errcode(ERRCODE_DUPLICATE_TABLE),
-	 errmsg("relation \"%s\" already exists, skipping",
-			stmt->into->rel->relname)));
-			return InvalidObjectAddress;
-		}
-	}
+	/* Check if the to-be-created relation exists or not */
+	if (!IsCTASRelCreationAllowed(stmt))
+		return InvalidObjectAddress;
 
 	/*
 	 * Create the tuple receiver object and insert info it will need
@@ -400,6 +388,47 @@ GetIntoRelEFlags(IntoClause *intoClause)
 	return flags;
 }
 
+/*
+ * IsCTASRelCreationAllowed --- check existence of relation in CREATE TABLE AS
+ *
+ * This function checks if the relation CTAS is trying to create already
+ * exists. If so, with if-not-exists clause it issues a notice and returns
+ * false, without if-not-exists clause it throws an error. Returns true if the
+ * relation can be created. This routine can be used as a fast-exit path when
+ * checking if a CTAS query needs to be executed or not.
+ */
+bool
+IsCTASRelCreationAllowed(CreateTableAsStmt *ctas)
+{
+	Oid nspid;
+	IntoClause *into = ctas->into;
+
+	nspid = RangeVarGetCreationNamespace(into->rel);
+
+	if (get_relname_relid(into->rel->relname, nspid))
+	{
+		if (!ctas->if_not_exists)
+			ereport(ERROR,
+	(errcode(ERRCODE_DUPLICATE_TABLE),
+	 errmsg("relation \"%s\" already exists",
+			into->rel->relname)));
+
+		/*
+		 * The relation exists and IF NOT EXISTS has been specified,
+		 * so let the caller know about that and let the processing
+		 * continue.
+		 */
+		ereport(NOTICE,
+(errcode(ERRCODE_DUPLICATE_TABLE),
+ errmsg("relation \"%s\" already exists, skipping",
+		into->rel->relname)));
+		return false;
+	}
+
+	/* Relation does not exist, so it can be created */
+	return true;
+}
+
 /*
  * CreateIntoRelDestReceiver -- create a suitable DestReceiver object
  *
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 43f9b01e83..372c6a196c 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -435,6 +435,14 @@ ExplainOneUtility(Node *utilityStmt, IntoClause *into, ExplainState *es,
 		C

Re: On login trigger: take three

Hi

po 21. 12. 2020 v 11:06 odesílatel Konstantin Knizhnik <
k.knizh...@postgrespro.ru> napsal:

>
>
> On 20.12.2020 10:04, Pavel Stehule wrote:
>
>
>
> čt 17. 12. 2020 v 19:30 odesílatel Pavel Stehule 
> napsal:
>
>>
>>
>> čt 17. 12. 2020 v 14:04 odesílatel Konstantin Knizhnik <
>> k.knizh...@postgrespro.ru> napsal:
>>
>>>
>>>
>>> On 17.12.2020 9:31, Pavel Stehule wrote:
>>>
>>>
>>>
>>> st 16. 12. 2020 v 20:38 odesílatel Pavel Stehule <
>>> pavel.steh...@gmail.com> napsal:
>>>
 Attached please find new versoin of the patch based on
 on_connect_event_trigger_WITH_SUGGESTED_UPDATES.patch

> So there is still only  "disable_client_connection_trigger" GUC?
> because we need possibility to disable client connect triggers and there 
> is
> no such need for other event types.
>
> As you suggested  have added "dathaslogontriggers" flag which is set
> when client connection trigger is created.
> This flag is never cleaned (to avoid visibility issues mentioned in my
> previous mail).
>

 This is much better - I don't see any slowdown when logon trigger is
 not defined

 I did some benchmarks and looks so starting language handler is
 relatively expensive - it is about 25% and starting handler like event
 trigger has about 35%. But this problem can be solved later and elsewhere.

 I prefer the inverse form of disable_connection_trigger. Almost all GUC
 are in positive form - so enable_connection_triggger is better

 I don't think so current handling dathaslogontriggers is good for
 production usage. The protection against race condition can be solved by
 lock on pg_event_trigger

>>>
>>> I thought about it, and probably the counter of connect triggers will be
>>> better there. The implementation will be simpler and more robust.
>>>
>>>
>>>
>>>
>>> I prefer to implement different approach: unset dathaslogontriggers
>>> flag in event trigger itself when no triggers are returned by
>>> EventTriggerCommonSetup.
>>> I am using double checking to prevent race condition.
>>> The main reason for this approach is that dropping of triggers is not
>>> done in event_trigger.c: it is done by generic code for all resources.
>>> It seems to be there is no right place where decrementing of trigger
>>> counters can be inserted.
>>> Also I prefer to keep all logon-trigger specific code in one file.
>>>
>>> Also, as you suggested, I renamed disable_connection_trigger to
>>> enable_connection_trigger.
>>>
>>>
>>> New version of the patch is attached.
>>>
>>
>> yes, it can work
>>
>
> for complete review the new field in pg_doc should be documented
>
>
> Done.
> Also I fixed some examples in documentation which involves pg_database.
>

regress tests fails

 sysviews ... FAILED  112 ms
test event_trigger... FAILED (test process exited with exit
code 2)  447 ms
test fast_default ... FAILED  392 ms
test stats... FAILED  626 ms
== shutting down postmaster   ==

Regards

Pavel


> --
> Konstantin Knizhnik
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company
>
>

Re: [Patch] Optimize dropping of relation buffers using dlist

On Tue, Dec 22, 2020 at 8:30 AM Kyotaro Horiguchi
 wrote:
>
> At Tue, 22 Dec 2020 02:48:22 +, "tsunakawa.ta...@fujitsu.com" 
>  wrote in
> > From: Amit Kapila 
> > > Why would all client backends wait for AccessExclusive lock on this
> > > relation? Say, a client needs a buffer for some other relation and
> > > that might evict this buffer after we release the lock on the
> > > partition. In StrategyGetBuffer, it is important to either have a pin
> > > on the buffer or the buffer header itself must be locked to avoid
> > > getting picked as victim buffer. Am I missing something?
> >
> > Ouch, right.  (The year-end business must be making me crazy...)
> >
> > So, there are two choices here:
> >
> > 1) The current patch.
> > 2) Acquire the buffer header spinlock before releasing the buffer mapping 
> > lwlock, and eliminate the buffer tag comparison as follows:
> >
> >   BufTableLookup();
> >   LockBufHdr();
> >   LWLockRelease();
> >   InvalidateBuffer();
> >
> > I think both are okay.  If I must choose either, I kind of prefer 1), 
> > because LWLockRelease() could take longer time to wake up other processes 
> > waiting on the lwlock, which is not very good to do while holding a 
> > spinlock.
>
> I like, as said before, the current patch.
>

Attached, please find the updated patch with the following
modifications, (a) updated comments at various places especially to
tell why this is a safe optimization, (b) merged the patch for
extending the smgrnblocks and vacuum optimization patch, (c) made
minor cosmetic changes and ran pgindent, and (d) updated commit
message. BTW, this optimization will help not only vacuum but also
truncate when it is done in the same transaction in which the relation
is created.  I would like to see certain tests to ensure that the
value we choose for BUF_DROP_FULL_SCAN_THRESHOLD is correct. I see
that some testing has been done earlier [1] for this threshold but I
am not still able to conclude. The criteria to find the right
threshold should be what is the maximum size of relation to be
truncated above which we don't get benefit with this optimization.

One idea could be to remove "nBlocksToInvalidate <
BUF_DROP_FULL_SCAN_THRESHOLD" part of check "if (cached &&
nBlocksToInvalidate < BUF_DROP_FULL_SCAN_THRESHOLD)" so that it always
use optimized path for the tests. Then use the relation size as
NBuffers/128, NBuffers/256, NBuffers/512 for different values of
shared buffers as 128MB, 1GB, 20GB, 100GB.

Apart from tests, do let me know if you are happy with the changes in
the patch? Next, I'll look into DropRelFileNodesAllBuffers()
optimization patch.

[1] - 
https://www.postgresql.org/message-id/OSBPR01MB234176B1829AECFE9FDDFCC2EFE90%40OSBPR01MB2341.jpnprd01.prod.outlook.com

-- 
With Regards,
Amit Kapila.

v36-0001-Optimize-DropRelFileNodeBuffers-for-recovery.patch
Description: Binary data

Re: [HACKERS] logical decoding of two-phase transactions

2020-12-22 Thread Ajin Cherian

On Sat, Dec 19, 2020 at 2:13 PM Amit Kapila  wrote:

> Okay, I have changed the rollback_prepare API as discussed above and
> accordingly handle the case where rollback is received without prepare
> in apply_handle_rollback_prepared.

I have reviewed and tested your new patchset, I agree with all the
changes that you have made and have tested quite a few scenarios and
they seem to be working as expected.
No major comments but some minor observations:

Patch 1:
logical.c: 984
Comment should be "rollback prepared" rather than "abort prepared".

Patch 2:
decode.c: 737: The comments in the header of DecodePrepare seem out of
place, I think here it should describe what the function does rather
than what it does not.
reorderbuffer.c: 2422: It looks like pg_indent has mangled the
comments, the numbering is no longer aligned.

Patch 5:
worker.c: 753: Type: change "dont" to "don't"

Patch 6: logicaldecoding.sgml
logicaldecoding example is no longer correct. This was true prior to
the changes done to replay prepared transactions after a restart. Now
the whole transaction will get decoded again after the commit
prepared.

postgres=# COMMIT PREPARED 'test_prepared1';
COMMIT PREPARED
postgres=# select * from
pg_logical_slot_get_changes('regression_slot', NULL, NULL,
'two-phase-commit', '1');
lsn| xid |data
---+-+
 0/168A060 | 529 | COMMIT PREPARED 'test_prepared1', txid 529
(1 row)

Patch 8:
worker.c: 2798 :
worker.c: 3445 : disabling two-phase in tablesync worker.
 considering new design of multiple commits in tablesync, do we need
to disable two-phase in tablesync?

Other than this I've noticed a few typos that are not in the patch but
in the surrounding code.
logical.c: 1383: Comment should mention stream_commit_cb not stream_abort_cb.
decode.c: 686 - Extra "it's" here:  "because it's it happened"

regards,
Ajin Cherian
Fujitsu Australia

Re: Feature Proposal: Add ssltermination parameter for SNI-based LoadBalancing

2020-12-22 Thread Lukas Meisegeier




Hey,

whats the state of this? Can we start working out a plan to remove the
inital SSLRequest from the connection protocol or is there any reason to
keep it?

I would start by removing the need of the SSLRequest in the psql-server
if its started with a special parameter(ssl-only or so).
Simultaneously I would add a parameter to this disable the SSLRequest in
the client as well.

Later we could make this behaviour default for psql-server with
ssl-enabled and clients and some far time ahead we could remove the
implementation of SSLRequest in both server and client.

What are your thaughts about this?

Best Regards

Confused about stream replication protocol documentation

2020-12-22 Thread Li Japin

Hi, all

In Stream Replication Protocol [1], the documentation of `START_REPLICATION` 
message is

XLogData (B)
  …
Primary keepalive message (B)
  …
Standby status update (F)
  …
Hot Standby feedback message (F)
  ...

I’m confused about the means of ‘B’ and ‘F’? If it doesn't make sense, why we 
document here?
However, if it makes sense, should we explain it?
Can someone help me out?

Anyway, thanks in advance!

[1] https://www.postgresql.org/docs/devel/protocol-replication.html

--
Best regards
Japin Li

Re: Better client reporting for "immediate stop" shutdowns

2020-12-22 Thread Magnus Hagander

On Tue, Dec 22, 2020 at 2:29 AM Bharath Rupireddy
 wrote:
>
> On Tue, Dec 22, 2020 at 3:13 AM Tom Lane  wrote:
> > Up to now, if you shut down the database with "pg_ctl stop -m immediate"
> > then clients get a scary message claiming that something has crashed,
> > because backends can't tell whether the SIGQUIT they got was sent for
> > a crash-and-restart situation or because of an immediate-stop command.
> >
> > This isn't great from a fit-and-finish perspective, and it occurs to me
> > that it's really easy to do better: the postmaster can stick a flag
> > into shared memory explaining the reason for SIGQUIT.  While we don't
> > like the postmaster touching shared memory, there doesn't seem to be
> > any need for interlocking or anything like that, so there is no risk
> > involved that's greater than those already taken by pmsignal.c.
>
> +1 to improve the message.
>
> > So, here's a very simple proposed patch.  Some issues for possible
> > bikeshedding:
> >
> > * Up to now, pmsignal.c has only been for child-to-postmaster
> > communication, so maybe there is some better place to put the
> > support code.  I can't think of one though.
>
> +1 to have it here as we already have the required shared memory
> initialization code to add in new flags there -
> PMSignalState->sigquit_reason.
>
> If I'm correct, quickdie() doesn't access any shared memory because
> one of the reason we can be in quickdie() is when the shared memory
> itself is corrupted(the comment down below on why we don't call
> roc_exit() or atexit() says), in such case, will GetQuitSignalReason()
> have some problem in accessing the shared memory i.e. +return
> PMSignalState->sigquit_reason;?
>
> > * I chose to report the same message for immediate shutdown as we
> > already use for SIGTERM (fast shutdown or pg_terminate_backend()).
> > Should it be different, and if so what?
>
> AFAIK, errmsg(terminating connection due to administrator command") is
> emitted when there's no specific reason. But we know exactly why we
> are terminating in this case, how about having an error message like
> errmsg("terminating connection due to immediate shutdown request")));
> ? There are other places where errmsg("terminating connection due to
>  reasons"); is used. We also log messages when an immediate
> shutdown request is received errmsg("received immediate shutdown
> request").

+1. I definitely think having this message be different can be useful.

See also the thread about tracking shutdown reasons (connection
statistics) -- not the same thing, but the same concepts apply.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/

Re: Fail Fast In CTAS/CMV If Relation Already Exists To Avoid Unnecessary Rewrite, Planning Costs